\history

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. 10.1109/ACCESS.2017.DOI

\tfootnote

This paragraph of the first footnote will contain support information, including sponsor and financial support acknowledgment. For example, “This work was supported in part by the U.S. Department of Commerce under Grant BS123456.”

\tfootnote

$\dagger$ These authors contributed equally to this work.
This work was partly supported by the Agency for Defense Development and Korea Institute for Advancement of Technologe (KIAT) grant funded by the Korea Government (MOTIE) (P0020536, HRD Program for Industrial Innovation)

\corresp

Corresponding author: Inwook Shim (e-mail: iwshim@inha.ac.kr).

Self-Supervised 3D Traversability Estimation with Proxy Bank Guidance

JIHWAN BAE^†1 JUNWON SEO^†1 TAEKYUNG KIM1 HAE-GON JEON2, KIHO KWAK1, AND INWOOK SHIM.3 Agency for Defense Development, Daejeon 34186, Republic of Korea (e-mail: mierojihan1008@gmail.com, junwon.vision@gmail.com, tkkim.robot@gmail.com, kkwak.add@gmail.com) AI Graduate School and the School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Gwangju 61005, KOREA (e-mail: haegonj@gist.ac.kr) Department of Smart Mobility Engineering, INHA UNIVERSITY, 100 Inha-ro, Michuhol-gu, Incheon 22212, KOREA (e-mail: iwshim@inha.ac.kr)

Abstract

Traversability estimation for mobile robots in off-road environments requires more than conventional semantic segmentation used in constrained environments like on-road conditions. Recently, approaches to learning a traversability estimation from past driving experiences in a self-supervised manner are arising as they can significantly reduce human labeling costs and labeling errors. However, the self-supervised data only provide supervision for the actually traversed regions, inducing epistemic uncertainty according to the scarcity of negative information. Negative data are rarely harvested as the system can be severely damaged while logging the data. To mitigate the uncertainty, we introduce a deep metric learning-based method to incorporate unlabeled data with a few positive and negative prototypes in order to leverage the uncertainty, which jointly learns using semantic segmentation and traversability regression. To firmly evaluate the proposed framework, we introduce a new evaluation metric that comprehensively evaluates the segmentation and regression. Additionally, we construct a driving dataset ‘Dtrail’ in off-road environments with a mobile robot platform, which is composed of a wide variety of negative data. We examine our method on Dtrail as well as the publicly available SemanticKITTI dataset.

Index Terms:

deep metric learning, mobile robots, autonomous driving

\titlepgskip

=-15pt

I Introduction

Estimating traversability for mobile robot s is an important task for autonomous driving and machine perception. However, the majority of the relevant works focus on constrained road environments like paved roads which are all possibly observed in public datasets [1, 2, 3]. In urban scenes, road detection with semantic segmentation is enough [4, 5], but in unconstrained environments like off-road areas, the semantic segmentation is insufficient as the environment can be highly complex and rough [6] as shown in Fig. 1a. Several works from the robotics field have proposed a method to estimate the traversability cost in the unconstrained environments [7, 8, 9, 10], and to infer probabilistic traversability map with visual information such as image [11] and 3D LiDAR [6].

Refer to caption — Figure 1: Illustration of motivation of our framework. The color map for the traversability shows that the higher (green), the traversability is, the easier to traverse, and the lower (purple), the traversability is, the harder to traverse. Self-supervised traversability estimation should be considered to minimize the epistemic uncertainty by filtering out the non-traversable regions to which no supervision is given in terms of traversability.

Actual physical state changes that a vehicle undergoes can give meaningful information on where it can traverse and how difficult it would be. Such physical changes that the vehicle encounters itself are called self-supervised data. Accordingly, self-supervised traversability estimation can offer more robot-oriented prediction [11, 12, 13]. Fig. 1b shows an example of the self-supervised traversability data. Previously, haptic inspection [13, 11] has been examined as traversability in the self-supervised approaches. These works demonstrate that learning self-supervised data is a promising approach for traversability estimation, but are only delved into the proprioceptive sensor domain or image domain. Additionally, supervision from the self-supervised data is limited to the actually traversed regions as depicted in Fig. 1b, thereby inducing an epistemic uncertainty when inferring the traversability on non-traversed regions. An example of such epistemic uncertainty is illustrated in Fig. 1c. Trees that are impossible to drive over are regressed with high traversability, which means they are easy to traverse.

In this paper, we propose a self-supervised framework on $3$ D point cloud data for traversability estimation in unconstrained environments concentrated on alleviating epistemic uncertainty. We jointly learn semantic segmentation along with traversability regression via deep metric learning to filter out the non-traversable regions (see Fig. 1d.) Also, to harness the unlabeled data from the non-traversed area, we introduce the unsupervised loss similar to the clustering methods [14]. To better evaluate our task, we develop a new evaluation metric that can both evaluate the segmentation and the regression, while highlighting the false-positive ratio for reliable estimation. To test our method on more realistic data, we build an off-road robot driving dataset named ‘Dtrail.’ Experimental results are both shown for Dtrail and SemanticKITTI [15] dataset. Ablations and comparisons with the other metric learning-based methods show that our method yields quantitatively and qualitatively robust results. Our contributions to this work are fivefold:

•

We introduce a self-supervised traversability estimation framework on 3D point clouds that mitigates the uncertainty.
•

We adopt a deep metric learning-based method that jointly learns the semantic segmentation and the traversability estimation.
•

We propose the unsupervised loss to utilize the unlabeled data in the current self-supervised settings.
•

We devise a new metric to evaluate the suggested framework properly.
•

We present a new 3D point cloud dataset for off-road mobile robot driving in unconstrained environments that includes IMU data synchronized with LiDAR.

II Related Works

II-A Traversability Estimation

Traversability estimation is a crucial component in mobile robotics platforms for estimating where it should go. In the case of paved road conditions, the traversability estimation task can be regarded as a subset of road detection [5, 16] and semantic segmentation [17]. However, the human-supervised method is clearly limited in estimating traversability for unconstrained environments like off-road areas. According to the diversity of the road conditions, it is hard to determine the traversability of a mobile robot in advance by man-made predefined rules.

Self-supervised approaches [18, 19, 20, 13] are suggested in the robotics literature to estimate the traversability using proprioceptive sensors such as inertial measurement and force-torque sensors [13]. Since these tasks only measured traversability in the proprioceptive-sensor domain, they do not affect the robot’s future driving direction. To solve this problem, a study to predict terrain properties by combining image information with the robot’s self-supervision has been proposed [11]. They identify the terrain properties from haptic interaction and associate them with the image to facilitate self-supervised learning. This work demonstrates promising outputs for traversability estimation, but it does not take epistemic uncertainty into account that necessarily exists in the self-supervised data. Furthermore, image data-based learning approaches are still vulnerable to illumination changes that can reduce the performance of the algorithms. Therefore, range sensors such as 3D LiDAR can be a strong alternative [21].

To overcome such limitations, we propose self-supervised traversability estimation in unconstrained environments that can alleviate congenital uncertainty from 3D point cloud data.

II-B Deep Metric Learning

One of the biggest challenges in learning with few labeled data is epistemic uncertainty. To handle this problem, researchers proposed deep metric learning (DML) [22], which learns embedding spaces and classifies an unseen sample in the learned space. Several works adopt the sampled mini-batches called episodes during training, which mimics the task with few labeled data to facilitate DML [23, 24, 25, 26]. These methods with episodic training strategies epitomize labeled data of each class as a single vector, referred to as a prototype [27, 28, 29, 30, 17]. The prototypes generated by these works require non-parametric procedures and insufficiently represent unlabeled data.

Other works [31, 32, 33, 34, 31, 35, 36] develop loss functions to learn an embedding space where similar examples are attracted, and dissimilar examples are repelled. Recently, proxy-based loss [37] is proposed. Proxies are representative vectors of the training data in the learned embedding spaces, which are obtained in a parametric way [38, 39]. Using proxies leads to better convergence as they reflect the entire distribution of the training data [37]. A majority of the works [38, 39] provides a single proxy for each class, whereas SoftTriple loss [40] adopts multiple proxies for each class. We adopt the SoftTriple loss, as traversable and non-traversable regions are represented as multiple clusters rather than a single one in the unstructured driving surfaces according to their complexity and roughness.

III Methods

III-A Overview

Our self-supervised framework aims to learn a mapping between point clouds to traversability. We call input data containing the traversability information as ‘query.’ The traversable regions are referred to as the ‘positive’ class, and the non-traversable regions are referred to as the ‘negative’ class in this work. In query data, only positive data are labeled along with their traversability. The rest remains as unlabeled regions. Non-black points in Fig. 2(a) indicate the positive regions and the black points indicate the unlabeled regions.

However, there exists a limitation in that the query data is devoid of any supervision about negative regions. With query data only, results would be unreliable, as negative regions can be regressed as a good traversable region due to the epistemic uncertainty (Fig. 1c.) Consequently, our task aims to learn semantic segmentation along with traversability regression to mask out the negative regions, thereby mitigating the epistemic uncertainty. Accordingly, we utilize a very small number of hand-labeled point cloud scenes and call it ‘support’ data. In support data, traversable and non-traversable regions are manually annotated as positive and negative, respectively. Manually labeling entire scenes can be biased with human intuitions. Therefore, only evident regions are labeled and used for training. Fig. 2(b) shows the example of labeled support data.

The overall schema of our task is illustrated in Fig. 3. When the input point cloud data is given, a segmentation mask is applied to the initial version of the traversability regression map, producing a masked traversability map as a final output. For training, we form an episode composed of queries and randomly sampled support data. We can optimize our network over both query and relatively small support data with the episodic strategy [17]. Also, to properly evaluate the proposed framework, we introduce a new metric that comprehensively measures the segmentation and the regression, while highlighting the nature of the traversability estimation task with the epistemic uncertainty.

III-B Baseline Method

Let query data, consisting of positive and unlabeled data, as ${Q}=\{{{Q}_{P}},{{Q}_{U}}\}$ , and support data, consisting of positive and negative data, as ${S}=\{{{S}_{P}},{{S}_{N}}\}$ . Let $P_{i}\in\mathbb{R}^{3}$ denotes the $3$ D point, $a_{i}\in\mathbb{R}$ denotes the traversability, and $y_{i}\in\{0,1\}$ denotes the class of each point. Accordingly, data from ${Q}_{P}$ , ${Q}_{U}$ , ${S}_{P}$ , and ${S}_{N}$ are in forms of $\{P_{i},a_{i},y_{i}\}$ , $\{P_{i}\}$ , $\{P_{i},y_{i}\}$ , and $\{P_{i},y_{i}\}$ , respectively. Let $f_{\theta}$ denote a feature encoding backbone where $\theta$ indicates a network parameter, $x_{i}\in\mathbb{R}^{d}$ as encoded features extracted from $P_{i}$ , and $h_{\theta}$ as the multi-layer perceptron (MLP) head for the traversability regression. $g_{\theta}$ denotes the MLP head for the segmentation that distinguishes the traversable and non-traversable regions. The encoded feature domain for each data is notated as $\mathbb{Q}_{P}$ , $\mathbb{Q}_{U}$ , $\mathbb{S}_{P}$ , and $\mathbb{S}_{N}$ .

A baseline solution learns the network with labeled data only. ${Q}_{P}$ is used for the traversability regression and ${Q}_{P}$ and ${S}$ are both used for the segmentation. We obtain the traversability map $t_{i}=h(x_{i})$ , $t_{i}\in\mathbb{R}$ , and segmentation map $s_{i}=g(x_{i}),s_{i}\in\{0,1\}$ . The final masked traversability map $T_{i}$ is represented as element-wise multiplication, $T_{i}=t_{i}\odot s_{i}$ . The regression loss $L^{reg}$ is computed with $\mathbb{Q}_{p}$ and based on a mean squared error loss as Eq. (1), where $x_{i}$ is the $i$ -th element of $\mathbb{Q}_{P}$ .

L^{reg}(x_{i})=(h(x_{i})-a_{i})^{2}.

(1)

For the segmentation loss $L^{seg}$ , binary cross-entropy loss is used in the supervised setting as Eq. (2), where $x_{i}$ refers to the $i$ -th element of either $\mathbb{Q}_{P}$ and $\mathbb{S}$ . Both the positive query and the support data can be used for the segmentation loss as follows:

L^{\textit{seg}}(x_{i})=-\Big{(}y_{i}\log(g(x_{i}))+(1-y_{i})\log(1-g(x_{i}))\Big{)}.

(2)

Combining the regression and the segmentation, the traversability estimation loss in the supervised setting is defined as follows:

\begin{split}L^{\textit{Supervised}}(\mathbb{Q}_{P},\mathbb{S})=\frac{1}{|\mathbb{Q}_{P}|}\sum_{x_{i}\in\mathbb{Q}_{P}}\Big{(}L^{reg}(x_{i})+L^{seg}(x_{i})\Big{)}\\ +\frac{1}{|\mathbb{S}|}\sum_{x_{i}\in\mathbb{S}}L^{seg}(x_{i}).\end{split}

(3)

Nonetheless, it does not fully take advantage of data captured under various driving surfaces. Since the learning is limited to the labeled data, it can not capture the whole characteristics of the training data. This drawback hinders the capability of the traversability estimation trained in a supervised manner.

III-C Metric learning method

We adopt metric learning to overcome the limitation of the fully-supervised solution. The objectives of metric learning are to learn embedding space and find the representations that epitomize the training data in the learned embedding space. To jointly optimize the embedding network and the representations, we adopt a proxy-based loss [39]. The embedding network is updated based on the positions of the proxies, and the proxies are adjusted by the updated embedding network iteratively. The proxies can be regarded as representations that abstract the training data. We refer this set of proxies as ‘proxy bank,’ denoted as $\mathbb{B}=\{\mathbb{B}_{P},\mathbb{B}_{N}\}$ , where $\mathbb{B}_{P}$ and $\mathbb{B}_{N}$ indicate the set of proxies for each class. The segmentation map is inferred based on the similarity between feature vectors and the proxies of each class, as $s_{i}=g(\mathbb{B},x_{i})$ .

The representations of traversable and non-traversable regions exhibit large intra-class variations, where numerous sub-classes exist in each class; flat ground or gravel road for positive, and rocks, trees, or bushes for negative. For the segmentation, we use SoftTriple loss [40] that utilizes multiple proxies for each class. The similarity between $x_{i}$ and class $c$ , denoted as $S_{i,c}$ , is defined by a weighted sum of cosine similarity between $x_{i}$ and $\mathbb{B}_{c}=$ { $p_{c}^{1},...,p_{c}^{K}$ }, where $c$ denotes positive or negative, $K$ is the number of proxies per class, and $p_{c}^{k}$ is $k$ -th proxy in the proxy bank. The weight given to each cosine similarity is proportionate to its value. $S_{i,c}$ is defined as follows:

S_{i,c}=\sum_{k}\frac{\exp(\frac{1}{T}x_{i}^{\top}p_{c}^{k})}{\sum_{k}\exp(\frac{1}{T}x_{i}^{\top}p_{c}^{k})}x_{i}^{\top}p_{c}^{k},

(4)

where $T$ is a temperature parameter to control the softness of assignments. Soft assignments reduce sensitivity between multiple centers. Note that the $l_{2}$ norm has been applied to embedding vectors to sustain divergence of magnitude. Then the SoftTriple loss is defined as follows:

L^{\textit{SoftTriple}}(x_{i})=-\log\frac{\exp(\lambda(S_{i,y_{i}}-\delta))}{\exp(\lambda(S_{i,y_{i}}-\delta))+\sum_{j\neq y_{i}}\exp(\lambda S_{i,j})},

(5)

where $\lambda$ is a hyperparameter for smoothing effect and $\delta$ is a margin. The segmentation loss using the proxy bank can be reformulated using the SoftTriple loss as Eq. (6) and the traversability estimation loss using the proxy bank is defined as Eq. (7).

L^{seg}(x_{i},\mathbb{B})=-\log\frac{\exp(\lambda(S_{i,y_{i}}-\delta))}{\exp(\lambda(S_{i,y_{i}}-\delta))+\exp(\lambda S_{i,1-y_{i}})}.

(6)

\begin{split}L^{Proxy}(\mathbb{Q}_{P},\mathbb{S},\mathbb{B})=\frac{1}{|\mathbb{Q}_{P}|}\sum_{x_{i}\in\mathbb{Q}_{P}}\Big{(}L^{reg}(x_{i})+L^{seg}(x_{i},\mathbb{B})\Big{)}\\ +\frac{1}{|\mathbb{S}|}\sum_{x_{i}\in\mathbb{S}}L^{seg}(x_{i},\mathbb{B}).\end{split}

(7)

Unlabeled data, which is abundantly included in self-supervised traversability data, has not been considered in previous works. To enhance the supervision we can extract from the data, we utilize the unlabeled data in the query data in the learning process. The problem is that the segmentation loss cannot be applied to the $\mathbb{Q}_{U}$ because no class label $y_{i}$ exists for them. We assign an auxiliary target for each unlabeled data as clustering [41]. Pseudo class of $i$ -th sample $\hat{y_{i}}$ is assigned based on the class of the nearest proxy in the embedding space as $\hat{y_{i}}=\operatorname*{argmax}_{c\in\{P,N\}}S_{i,c}$ .

The unsupervised loss for the segmentation, denoted as $L^{U}$ , is defined as Eq. (8) using the pseudo-class, where $x_{i}$ is an embedding of $i$ -th sample in $\mathbb{Q}_{U}$ .

L^{U}(x_{i},\mathbb{B})=-\log\frac{\exp(\lambda(S_{i,\hat{y_{i}}}-\delta)}{\exp(\lambda(S_{i,\hat{y_{i}}}-\delta))+\exp(\lambda S_{i,1-\hat{y_{i}}}))}

(8)

Fig. 4 illustrates the benefit of incorporating unlabeled loss. The embedding network can learn to capture more broad distribution of data, and learned proxies would represent training data better. When unlabeled data features are assigned to the proxies (Fig. 4a,) the embedding space and proxies are updated as Fig. 4b, exhibiting more precise decision boundaries.

Combining the aforementioned objectives altogether, we define our final objective as ‘Traverse Loss,’ and is defined as Eq. (9). The overall high-level schema of the learning procedure is depicted in Fig. 5.

L^{\textit{Traverse}}(\mathbb{Q},\mathbb{S},\mathbb{B})=L^{Proxy}(\mathbb{Q}_{P},\mathbb{S},\mathbb{B})+\frac{1}{|\mathbb{Q}_{U}|}\sum_{x_{i}\in\mathbb{Q}_{U}}L^{U}(x_{i},\mathbb{B})

(9)

III-D Re-initialization to avoid trivial solutions

Our metric learning method can suffer from sub-optimal solutions, which are induced by empty proxies. Empty proxies indicate the proxies to which none of the data are assigned. Such empty proxies should be redeployed to be a good representation of training data. Otherwise, the model might lose the discriminative power and the bank might include semantically poor representations.

Our intuitive idea to circumvent an empty proxy is to re-initialize the empty proxy with support data features. By updating the empty proxies with support data, the proxy bank can reflect training data that was not effectively captured beforehand. In order to obtain representative feature vectors without noises, $M$ number of prototype feature vectors, denoted as $\mu^{+}=\{\mu_{m}^{+},m=1,...,M\}$ and $\mu^{-}=\{\mu_{m}^{-},m=1,...,M\}$ , are estimated using an Expectation-Maximization algorithm [42]. The prototype vectors are cluster centers of support features. We randomly choose the prototype vectors with small perturbations and use them as re-initialized proxies. Algorithm 1 summarizes the overall training procedure of our method.

III-E Traversability Precision Error

We devise a new metric for the proposed framework, ‘Traversability Precision Error’ (TPE). The new metric should be able to comprehensively evaluate the segmentation and the regression while taking the critical aspect of the traversability estimation into account. One of the most important aspects of traversability estimation is to avoid the false-positive of the traversable region, the region that is impossible to traverse but inferred as traversable. If such a region is estimated as traversable, a robot will likely go over that region, resulting in undesirable movements. The impact of the false-positive decreases if they are estimated as less traversable. TPE computes the degree of false-positive of the traversable region, extenuating its impact with the traversability $t_{i}$ . The TPE is defined as Eq. (10) where $TN$ , $FP$ , and $FN$ denote the number of true negative, false positive, and false negative points of the traversable region, respectively.

\begin{split}\text{Traversability Precision Error~{}(TPE)}=\\ \frac{TN}{TN+FP(1-t_{i})+FN}\end{split}

(10)

Input: Query data

Q=\{Q_{P},Q_{U}\}

and Support Data

S=\{S_{P},S_{N}\}

, where

|Q|\gg|S|

Output: Network

f

with parameters

\theta

, proxy bank

\mathbb{B}=\{\mathbb{B}_{P},\mathbb{B}_{N}\}

for each query data do

Random Sample support data from

S

Feed query and support data to

f_{\theta}

, and Get embedding features

x_{i}

Calculate similarity between

x_{i}

and

\mathbb{B}

Estimate Pseudo-class

\hat{y_{i}}

for

x_{i}\in Q_{U}

Calculate

L^{Traverse}

Update

\theta

and

\mathbb{B}

end for

Calculate the membership of each proxy

if an empty proxy exists then

Feed

S

f_{\theta}

, and Get embedding features

Estimate

M

cluster centers for each class,

\mu=\{\mu^{+},\mu^{-}\}

, by the EM algorithm

Re-initialize empty proxy to

\mu

with small random perturbation

end if

Algorithm 1 Single epoch of traversability estimation with metric learning

IV Experiments

	Dtrail						SemanticKITTI
	mIoU			TPE			mIoU
$\|\mathbb{S}\|/\|\mathbb{Q}\|$	$4\%$	$2\%$	$1\%$	$4\%$	$2\%$	$1\%$	$5\%$	$1\%$	$0.5\%$	$0.1\%$
ProtoNet [26]	0.8033	0.7515	0.5049	0.7129	0.5624	0.3249	0.8009	0.8040	0.7993	0.7798
MPTI [17]	0.6992	0.6936	0.6390	0.6202	0.5466	0.4995	0.8586	0.8108	0.7531	0.7663
Ours (supervised)	0.9238	0.8857	0.7779	0.8896	0.8447	0.7345	0.8405	0.8376	0.8338	0.8201
Ours (w.o. unlabeled)	0.8864	0.8529	0.8461	0.8434	0.8121	0.8164	0.8124	0.7896	0.8049	0.7994
Ours (w.o. re-init)	0.8970	0.8771	0.7935	0.8649	0.8163	0.7517	0.8058	0.7895	0.8058	0.7895
Ours	0.9338	0.9151	0.9005	0.9067	0.8776	0.8636	0.8652	0.8402	0.8473	0.8973

TABLE I: Comparison results on Dtrail and SemanticKITTI dataset. Our methods with different objectives are annotated as follows. Ours (supervised): Eq. (3) that is trained in supervised manner. Ours (w.o. unlabeled): Eq. (7) that does not leverage unlabeled data. Ours (w.o. re-init): Eq. (9) excluding the re-initialization step. Ours: Eq. (9).

In this section, our method is evaluated with Dtrail dataset for traversability estimation on off-road environments along with SemanticKITTI [15] dataset. Our method is compared to other metric learning methods based on episodic training strategies. Furthermore, we conduct various ablation studies to show the benefits of our method.

IV-A Datasets

IV-A1 Dtrail: Off-road terrain dataset

In order to thoroughly examine the validity of our method, we build the Dtrail dataset, a real mobile robot driving dataset of high-resolution LiDAR point clouds from mountain trail scenes. We collect point clouds using one $32$ layer and two $16$ layers of LiDAR sensors equipped on our caterpillar-type mobile robot platform, shown in Fig. 6(a). Our dataset consists of $119$ point cloud scenes and each point cloud scene has approximately $4$ million points. Corresponding sample camera images of point cloud scenes are shown in Fig. 6(b). For the experiments, we split $98$ scenes for the query set and $4$ scenes for the support set, and $17$ scenes for the evaluation set. For the traversability, the magnitude z-acceleration from the Inertial Measurement Unit (IMU) of the mobile robot is re-scaled from $0$ to $1$ and mapped to points that the robot actually explored. Also, in terms of data augmentation, a small perturbation is added along the z-axis on some positive points.

IV-A2 SemanticKITTI

We evaluate our method on the SemanticKITTI [15] dataset, which is an urban outdoor-scene dataset for point cloud segmentation. Since it does not provide any type of attributes for traversability, we conducted experiments on segmentation only. It contains $11$ sequences, 00 to 10 as the training set, with $23,210$ point clouds and $28$ classes. We split $5$ sequences ( $00$ , $02$ , $05$ , $08$ , $09$ ) with $17,625$ point clouds for training and the rest, with $5,576$ point clouds, for evaluation. We define the ‘road’, ‘parking’, ‘sidewalk’, ‘other-ground’, and ‘terrain’ classes as positive and the rest classes as negative. For query data, only the ‘road’ class is labeled as positive and left other positive classes as unlabeled. We expect the model to learn the other positive regions using unlabeled data without direct supervision.

IV-B Evaluation metric

We evaluate the performance of our method with TPE, the new criteria designed for the traversability estimation task, which evaluates segmentation and regression quality simultaneously. Additionally, we evaluate the segmentation quality with mean Interaction over Union [43] (mIoU). For each class, the IoU is calculated by $IoU=\frac{TP}{TP+FP+FN}$ , where $TP$ , $FP$ , and $FN$ denote the number of true negatives, false positives, and false negative points of each class, respectively.

IV-C Implementation Details

IV-C1 Embedding network

RandLA-Net [4] is fixed as a backbone embedding network for every method for a fair comparison. Specifically, we use $2$ down-sampling layers in the backbone and excluded global $(x,y,z)$ positions in the local spatial encoding layer, which aids the network to embed local geometric patterns explicitly. The embedding vectors are normalized with $l_{2}$ norm and are handled with cosine similarity.

IV-C2 Training

We train the model and proxies with Adam optimizer with the exponential learning rate decay for $50$ epochs. The initial learning rate is set as $1e^{-4}$ . For query and support data, K-nearest neighbors (KNN) of a randomly picked point is sampled in training steps. We ensure that positive and negative points exist evenly in sampled points of the support data.

IV-C3 Hyperparameter setting

For learning stability, proxies are updated exclusively for the initial $5$ epochs. The number of proxies $K$ is set to $128$ for each class and the proxies are initialized with normal distribution. We set small margin $\delta$ as $0.01$ , $\lambda$ as $20$ , and temperature parameter $T$ as $0.05$ for handling multiple proxies.

IV-D Results

IV-D1 Comparison

We compare the performance to ProtoNet [26] which uses a single prototype and MPTI [17] which adopts multiple prototypes for few-shot 3D segmentation. Also, we compare the performance with our supervised manner method, denoted as ‘Ours(supervised).’ Table I summarizes the result of experiments. Our method shows a significant margin in terms of IoU and TPE compared to the ProtoNet and MPTI. It demonstrates that generating prototypes in a non-parametric approach does not represent the whole data effectively. Moreover, it is notable that we show the performance of our metric learning method is better than the supervised setting designed for our task. It verifies that ours can reduce epistemic uncertainty by incorporating unlabeled data by unsupervised loss. For SemanticKITTI, the observation is similar to that of the Dtrail dataset. Even though the SemanticKITTI dataset is based on urban scenes, our method shows better performance than other few-shot learning methods by $6\%$ and the supervised manner by $2\%$ .

$K$	1	2	4	8	16	32	64	128	256	512
mIoU( $\uparrow$ )	0.890	0.894	0.880	0.906	0.911	0.883	0.920	0.934	0.931	0.924
TPE( $\uparrow$ )	0.847	0.868	0.840	0.862	0.881	0.845	0.888	0.906	0.898	0.895

TABLE II: Ablation study on Dtrail according to the number of proxies

K

IV-D2 Ablation studies

We repeat experiments with varying support-to-query ratio ( $|\mathbb{S}|/|\mathbb{Q}|$ ) to evaluate robustness regarding the amount of support data. Table I shows that our metric learning method is much more robust from performance degradation than the others when the support-to-query ratio decreases. When the ratio decreases from $4\%$ to $1\%$ in the Dtrail dataset, the TPE of our metric learning method only decreases about $4\%$ while the TPE of others dropped significantly: $39\%$ for ProtoNet, $13\%$ for MPTI, and $16\%$ for Ours(supervised). It verifies that our method can robustly reduce epistemic uncertainty with small labeled data.

Moreover, we observe that performance increases by $6\%$ on average on TPE when adopting the re-initialization step. It confirms the re-initialization step can help avoid trivial solutions. Also, it is shown that adopting the unsupervised loss can boost the performance up to $6\%$ on average. It verifies that the unlabeled loss can give affluent supervision without explicit labels. Moreover, as shown in Table II, an increasing number of proxies boost the performance until it converges when the number exceeds $32$ , which demonstrates the advantages of multiple proxies.

IV-D3 Qualitative Results

Fig. 7 shows the traversability estimation results of our supervised-based and metric learning-based method on the Dtrail dataset. We can examine that our metric learning-based method performs better than the supervised-based method. Especially, our method yields better results on regions that are not labeled on training data. We compare the example of segmentation results with the SemanticKITTI dataset in Fig. 8. The first column indicates the ground truth and the other columns indicate the segmentation results of the supervised learning-based method and our method. Evidently, our method shows better results on unlabeled regions, which confirms that our metric learning-based method reduces epistemic uncertainty.

Fig. 9 shows the visualization of the proxies assigned to the point cloud scenes. For better visualization, proxies are clustered into three representations. We observe that the learned proxies successfully represent the various semantic features. Leaves, grounds, and tree trunks are mostly colored green, black, and blue, respectively.

V Conclusion

We propose a self-supervised traversability estimation framework on 3D point cloud data in terms of mitigating epistemic uncertainty. Self-supervised traversability estimation suffers from the uncertainty that arises from the limited supervision given from the data. We tackle the epistemic uncertainty by concurrently learning semantic segmentation along with traversability estimation, eventually masking out the non-traversable regions. We start from the fully-supervised setting and finally developed the deep metric learning method with unsupervised loss that harnessed the unlabeled data. To properly evaluate the framework, we also devise a new evaluation metric according to the task’s settings and underline the important criteria of the traversability estimation. We build our own off-road terrain dataset with the mobile robotics platform in unconstrained environments for realistic testing. Various experimental results show that our framework is promising.

References

[1] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, pp. 1231 – 1237, 2013.
[2] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11 618–11 628, 2020.
[3] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. M. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in perception for autonomous driving: Waymo open dataset,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2443–2451, 2020.
[4] Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, A. Trigoni, and A. Markham, “Randla-net: Efficient semantic segmentation of large-scale point clouds,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11 105–11 114, 2020.
[5] Z. Chen, J. Zhang, and D. Tao, “Progressive lidar adaptation for road detection,” IEEE/CAA Journal of Automatica Sinica, vol. 6, pp. 693–702, 2019.
[6] J. Sock, J. H. Kim, J. Min, and K. H. Kwak, “Probabilistic traversability map generation using 3d-lidar and camera,” IEEE International Conference on Robotics and Automation, pp. 5631–5637, 2016.
[7] J. Ahtiainen, T. Stoyanov, and J. Saarinen, “Normal distributions transform traversability maps: Lidar‐only approach for traversability mapping in outdoor environments,” Journal of Field Robotics, vol. 34, 2017.
[8] S. Matsuzaki, J. Miura, and H. Masuzawa, “Semantic-aware plant traversability estimation in plant-rich environments for agricultural mobile robots,” ArXiv, vol. abs/2108.00759, 2021.
[9] T. Guan, Z. He, D. Manocha, and L. Zhang, “Ttm: Terrain traversability mapping for autonomous excavator navigation in unstructured environments,” ArXiv, vol. abs/2109.06250, 2021.
[10] H. Roncancio, M. Becker, A. Broggi, and S. Cattani, “Traversability analysis using terrain mapping and online-trained terrain type classifier,” IEEE Intelligent Vehicles Symposium Proceedings, pp. 1239–1244, 2014.
[11] L. Wellhausen, A. Dosovitskiy, R. Ranftl, K. Walas, C. Cadena, and M. Hutter, “Where should i walk? predicting terrain properties from images via self-supervised learning,” IEEE Robotics and Automation Letters, vol. 4, pp. 1509–1516, 2019.
[12] M. Wermelinger, P. Fankhauser, R. Diethelm, P. Krüsi, R. Y. Siegwart, and M. Hutter, “Navigation planning for legged robots in challenging terrain,” IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1184–1189, 2016.
[13] H. Kolvenbach, C. Bärtschi, L. Wellhausen, R. Grandia, and M. Hutter, “Haptic inspection of planetary soils with legged robots,” IEEE Robotics and Automation Letters, vol. 4, pp. 1626–1632, 2019.
[14] W. Van Gansbeke, S. Vandenhende, S. Georgoulis, M. Proesmans, and L. Van Gool, “Scan: Learning to classify images without labels,” in European Conference on Computer Vision, 2020, pp. 268–285.
[15] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “Semantickitti: A dataset for semantic scene understanding of lidar sequences,” IEEE/CVF International Conference on Computer Vision, pp. 9296–9306, 2019.
[16] A. Valada, J. Vertens, A. Dhall, and W. Burgard, “Adapnet: Adaptive semantic segmentation in adverse environmental conditions,” IEEE International Conference on Robotics and Automation, pp. 4644–4651, 2017.
[17] N. Zhao, T.-S. Chua, and G. H. Lee, “Few-shot 3d point cloud semantic segmentation,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8869–8878, 2021.
[18] P. Dallaire, K. Walas, P. Giguère, and B. Chaib-draa, “Learning terrain types with the pitman-yor process mixtures of gaussians for a legged robot,” IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3457–3463, 2015.
[19] L. Ding, H. Gao, Z. Deng, J. Song, Y. Liu, G. Liu, and K. Iagnemma, “Foot–terrain interaction mechanics for legged robots: Modeling and experimental validation,” The International Journal of Robotics Research, vol. 32, pp. 1585 – 1606, 2013.
[20] W. Bosworth, J. Whitney, S. Kim, and N. Hogan, “Robot locomotion on hard and soft ground: Measuring stability and ground properties in-situ,” IEEE International Conference on Robotics and Automation, pp. 3582–3589, 2016.
[21] G. G. Waibel, T. Löw, M. Nass, D. Howard, T. Bandyopadhyay, and P. V. K. Borges, “How rough is the path? terrain traversability estimation for local and global path planning,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 9, p. 16462–16473, sep 2022. [Online]. Available: https://doi.org/10.1109/TITS.2022.3150328
[22] G. Koch, R. Zemel, R. Salakhutdinov et al., “Siamese neural networks for one-shot image recognition,” in ICML deep learning workshop, vol. 2. Lille, 2015, p. 0.
[23] V. G. Satorras and J. Bruna, “Few-shot learning with graph neural networks,” ArXiv, vol. abs/1711.04043, 2018.
[24] O. Vinyals, C. Blundell, T. P. Lillicrap, K. Kavukcuoglu, and D. Wierstra, “Matching networks for one shot learning,” in Advances in Neural Information Processing Systems, 2016.
[25] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1199–1208, 2018.
[26] J. Snell, K. Swersky, and R. S. Zemel, “Prototypical networks for few-shot learning,” ArXiv, vol. abs/1703.05175, 2017.
[27] C. Chen, O. Li, A. J. Barnett, J. Su, and C. Rudin, “This looks like that: deep learning for interpretable image recognition,” ArXiv, vol. abs/1806.10574, 2019.
[28] J. Deuschel, D. Firmbach, C. I. Geppert, M. Eckstein, A. Hartmann, V. Bruns, P. Kuritcyn, J. Dexl, D. Hartmann, D. Perrin, T. Wittenberg, and M. Benz, “Multi-prototype few-shot learning in histopathology,” IEEE/CVF International Conference on Computer Vision Workshops, pp. 620–628, 2021.
[29] K. R. Allen, E. Shelhamer, H. Shin, and J. B. Tenenbaum, “Infinite mixture prototypes for few-shot learning,” ArXiv, vol. abs/1902.04552, 2019.
[30] B. Yang, C. Liu, B. Li, J. Jiao, and Q. Ye, “Prototype mixture models for few-shot semantic segmentation,” ArXiv, vol. abs/2008.03898, 2020.
[31] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823, 2015.
[32] M. Wang and W. Deng, “Deep face recognition: A survey,” Neurocomputing, vol. 429, pp. 215–244, 2021.
[33] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 539–546 vol. 1, 2005.
[34] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1735–1742, 2006.
[35] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu, “Learning fine-grained image similarity with deep ranking,” IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386–1393, 2014.
[36] H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese, “Deep metric learning via lifted structured feature embedding,” IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012, 2016.
[37] Y. Movshovitz-Attias, A. Toshev, T. Leung, S. Ioffe, and S. Singh, “No fuss distance metric learning using proxies,” IEEE International Conference on Computer Vision, pp. 360–368, 2017.
[38] E. W. Teh, T. Devries, and G. W. Taylor, “Proxynca++: Revisiting and revitalizing proxy neighborhood component analysis,” ArXiv, vol. abs/2004.01113, 2020.
[39] S. Kim, D. Kim, M. Cho, and S. Kwak, “Proxy anchor loss for deep metric learning,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3235–3244, 2020.
[40] Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin, “Softtriple loss: Deep metric learning without triplet sampling,” IEEE/CVF International Conference on Computer Vision, pp. 6449–6457, 2019.
[41] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in European Conference on Computer Vision, 2018.
[42] T. K. Moon, “The expectation-maximization algorithm,” IEEE Signal processing magazine, vol. 13, no. 6, pp. 47–60, 1996.
[43] M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision, vol. 88, pp. 303–338, 2009.

\EOD