This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\history

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. 10.1109/ACCESS.2017.DOI

\tfootnote

This paragraph of the first footnote will contain support information, including sponsor and financial support acknowledgment. For example, “This work was supported in part by the U.S. Department of Commerce under Grant BS123456.”

\tfootnote

\dagger These authors contributed equally to this work.
This work was partly supported by the Agency for Defense Development and Korea Institute for Advancement of Technologe (KIAT) grant funded by the Korea Government (MOTIE) (P0020536, HRD Program for Industrial Innovation)

\corresp

Corresponding author: Inwook Shim (e-mail: iwshim@inha.ac.kr).

Self-Supervised 3D Traversability Estimation with Proxy Bank Guidance

JIHWAN BAE1    JUNWON SEO1    TAEKYUNG KIM1    HAE-GON JEON2, KIHO KWAK1, AND INWOOK SHIM.3    Agency for Defense Development, Daejeon 34186, Republic of Korea (e-mail: mierojihan1008@gmail.com, junwon.vision@gmail.com, tkkim.robot@gmail.com, kkwak.add@gmail.com) AI Graduate School and the School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Gwangju 61005, KOREA (e-mail: haegonj@gist.ac.kr) Department of Smart Mobility Engineering, INHA UNIVERSITY, 100 Inha-ro, Michuhol-gu, Incheon 22212, KOREA (e-mail: iwshim@inha.ac.kr)
Abstract

Traversability estimation for mobile robots in off-road environments requires more than conventional semantic segmentation used in constrained environments like on-road conditions. Recently, approaches to learning a traversability estimation from past driving experiences in a self-supervised manner are arising as they can significantly reduce human labeling costs and labeling errors. However, the self-supervised data only provide supervision for the actually traversed regions, inducing epistemic uncertainty according to the scarcity of negative information. Negative data are rarely harvested as the system can be severely damaged while logging the data. To mitigate the uncertainty, we introduce a deep metric learning-based method to incorporate unlabeled data with a few positive and negative prototypes in order to leverage the uncertainty, which jointly learns using semantic segmentation and traversability regression. To firmly evaluate the proposed framework, we introduce a new evaluation metric that comprehensively evaluates the segmentation and regression. Additionally, we construct a driving dataset ‘Dtrail’ in off-road environments with a mobile robot platform, which is composed of a wide variety of negative data. We examine our method on Dtrail as well as the publicly available SemanticKITTI dataset.

Index Terms:
deep metric learning, mobile robots, autonomous driving
\titlepgskip

=-15pt

I Introduction

Estimating traversability for mobile robot s is an important task for autonomous driving and machine perception. However, the majority of the relevant works focus on constrained road environments like paved roads which are all possibly observed in public datasets [1, 2, 3]. In urban scenes, road detection with semantic segmentation is enough [4, 5], but in unconstrained environments like off-road areas, the semantic segmentation is insufficient as the environment can be highly complex and rough [6] as shown in Fig. 1a. Several works from the robotics field have proposed a method to estimate the traversability cost in the unconstrained environments [7, 8, 9, 10], and to infer probabilistic traversability map with visual information such as image [11] and 3D LiDAR [6].

Refer to caption
Figure 1: Illustration of motivation of our framework. The color map for the traversability shows that the higher (green), the traversability is, the easier to traverse, and the lower (purple), the traversability is, the harder to traverse. Self-supervised traversability estimation should be considered to minimize the epistemic uncertainty by filtering out the non-traversable regions to which no supervision is given in terms of traversability.

Actual physical state changes that a vehicle undergoes can give meaningful information on where it can traverse and how difficult it would be. Such physical changes that the vehicle encounters itself are called self-supervised data. Accordingly, self-supervised traversability estimation can offer more robot-oriented prediction [11, 12, 13]. Fig. 1b shows an example of the self-supervised traversability data. Previously, haptic inspection [13, 11] has been examined as traversability in the self-supervised approaches. These works demonstrate that learning self-supervised data is a promising approach for traversability estimation, but are only delved into the proprioceptive sensor domain or image domain. Additionally, supervision from the self-supervised data is limited to the actually traversed regions as depicted in Fig. 1b, thereby inducing an epistemic uncertainty when inferring the traversability on non-traversed regions. An example of such epistemic uncertainty is illustrated in Fig. 1c. Trees that are impossible to drive over are regressed with high traversability, which means they are easy to traverse.

In this paper, we propose a self-supervised framework on 33D point cloud data for traversability estimation in unconstrained environments concentrated on alleviating epistemic uncertainty. We jointly learn semantic segmentation along with traversability regression via deep metric learning to filter out the non-traversable regions (see Fig. 1d.) Also, to harness the unlabeled data from the non-traversed area, we introduce the unsupervised loss similar to the clustering methods [14]. To better evaluate our task, we develop a new evaluation metric that can both evaluate the segmentation and the regression, while highlighting the false-positive ratio for reliable estimation. To test our method on more realistic data, we build an off-road robot driving dataset named ‘Dtrail.’ Experimental results are both shown for Dtrail and SemanticKITTI [15] dataset. Ablations and comparisons with the other metric learning-based methods show that our method yields quantitatively and qualitatively robust results. Our contributions to this work are fivefold:

  • We introduce a self-supervised traversability estimation framework on 3D point clouds that mitigates the uncertainty.

  • We adopt a deep metric learning-based method that jointly learns the semantic segmentation and the traversability estimation.

  • We propose the unsupervised loss to utilize the unlabeled data in the current self-supervised settings.

  • We devise a new metric to evaluate the suggested framework properly.

  • We present a new 3D point cloud dataset for off-road mobile robot driving in unconstrained environments that includes IMU data synchronized with LiDAR.

II Related Works

II-A Traversability Estimation

Traversability estimation is a crucial component in mobile robotics platforms for estimating where it should go. In the case of paved road conditions, the traversability estimation task can be regarded as a subset of road detection [5, 16] and semantic segmentation [17]. However, the human-supervised method is clearly limited in estimating traversability for unconstrained environments like off-road areas. According to the diversity of the road conditions, it is hard to determine the traversability of a mobile robot in advance by man-made predefined rules.

Self-supervised approaches [18, 19, 20, 13] are suggested in the robotics literature to estimate the traversability using proprioceptive sensors such as inertial measurement and force-torque sensors [13]. Since these tasks only measured traversability in the proprioceptive-sensor domain, they do not affect the robot’s future driving direction. To solve this problem, a study to predict terrain properties by combining image information with the robot’s self-supervision has been proposed [11]. They identify the terrain properties from haptic interaction and associate them with the image to facilitate self-supervised learning. This work demonstrates promising outputs for traversability estimation, but it does not take epistemic uncertainty into account that necessarily exists in the self-supervised data. Furthermore, image data-based learning approaches are still vulnerable to illumination changes that can reduce the performance of the algorithms. Therefore, range sensors such as 3D LiDAR can be a strong alternative [21].

To overcome such limitations, we propose self-supervised traversability estimation in unconstrained environments that can alleviate congenital uncertainty from 3D point cloud data.

Refer to caption
(a) Query
Refer to caption
(b) Support
Figure 2: Examples of our task data settings: query and support data. (a) is an example of query data. Unlabeled points are colored black, and non-black points indicate the robot’s traversable region. Traversability is mapped only on the positive points. (b) is an example of support data. Traversable and non-traversable are manually labeled as red and blue, respectively. Only evident regions are labeled and used for the training in the support data.

II-B Deep Metric Learning

One of the biggest challenges in learning with few labeled data is epistemic uncertainty. To handle this problem, researchers proposed deep metric learning (DML) [22], which learns embedding spaces and classifies an unseen sample in the learned space. Several works adopt the sampled mini-batches called episodes during training, which mimics the task with few labeled data to facilitate DML [23, 24, 25, 26]. These methods with episodic training strategies epitomize labeled data of each class as a single vector, referred to as a prototype [27, 28, 29, 30, 17]. The prototypes generated by these works require non-parametric procedures and insufficiently represent unlabeled data.

Other works [31, 32, 33, 34, 31, 35, 36] develop loss functions to learn an embedding space where similar examples are attracted, and dissimilar examples are repelled. Recently, proxy-based loss [37] is proposed. Proxies are representative vectors of the training data in the learned embedding spaces, which are obtained in a parametric way [38, 39]. Using proxies leads to better convergence as they reflect the entire distribution of the training data [37]. A majority of the works [38, 39] provides a single proxy for each class, whereas SoftTriple loss [40] adopts multiple proxies for each class. We adopt the SoftTriple loss, as traversable and non-traversable regions are represented as multiple clusters rather than a single one in the unstructured driving surfaces according to their complexity and roughness.

Refer to caption
Figure 3: Illustration of our task definition. Given the point cloud data, the initial traversability map and semantic segmentation mask are processed. Finally, the final output is obtained by masking out the non-traversable regions of the initial traversability map.

III Methods

III-A Overview

Our self-supervised framework aims to learn a mapping between point clouds to traversability. We call input data containing the traversability information as ‘query.’ The traversable regions are referred to as the ‘positive’ class, and the non-traversable regions are referred to as the ‘negative’ class in this work. In query data, only positive data are labeled along with their traversability. The rest remains as unlabeled regions. Non-black points in Fig. 2(a) indicate the positive regions and the black points indicate the unlabeled regions.

However, there exists a limitation in that the query data is devoid of any supervision about negative regions. With query data only, results would be unreliable, as negative regions can be regressed as a good traversable region due to the epistemic uncertainty (Fig. 1c.) Consequently, our task aims to learn semantic segmentation along with traversability regression to mask out the negative regions, thereby mitigating the epistemic uncertainty. Accordingly, we utilize a very small number of hand-labeled point cloud scenes and call it ‘support’ data. In support data, traversable and non-traversable regions are manually annotated as positive and negative, respectively. Manually labeling entire scenes can be biased with human intuitions. Therefore, only evident regions are labeled and used for training. Fig. 2(b) shows the example of labeled support data.

The overall schema of our task is illustrated in Fig. 3. When the input point cloud data is given, a segmentation mask is applied to the initial version of the traversability regression map, producing a masked traversability map as a final output. For training, we form an episode composed of queries and randomly sampled support data. We can optimize our network over both query and relatively small support data with the episodic strategy [17]. Also, to properly evaluate the proposed framework, we introduce a new metric that comprehensively measures the segmentation and the regression, while highlighting the nature of the traversability estimation task with the epistemic uncertainty.

III-B Baseline Method

Let query data, consisting of positive and unlabeled data, as Q={QP,QU}{Q}=\{{{Q}_{P}},{{Q}_{U}}\}, and support data, consisting of positive and negative data, as S={SP,SN}{S}=\{{{S}_{P}},{{S}_{N}}\}. Let Pi3P_{i}\in\mathbb{R}^{3} denotes the 33D point, aia_{i}\in\mathbb{R} denotes the traversability, and yi{0,1}y_{i}\in\{0,1\} denotes the class of each point. Accordingly, data from QP{Q}_{P}, QU{Q}_{U}, SP{S}_{P}, and SN{S}_{N} are in forms of {Pi,ai,yi}\{P_{i},a_{i},y_{i}\}, {Pi}\{P_{i}\}, {Pi,yi}\{P_{i},y_{i}\}, and {Pi,yi}\{P_{i},y_{i}\}, respectively. Let fθf_{\theta} denote a feature encoding backbone where θ\theta indicates a network parameter, xidx_{i}\in\mathbb{R}^{d} as encoded features extracted from PiP_{i}, and hθh_{\theta} as the multi-layer perceptron (MLP) head for the traversability regression. gθg_{\theta} denotes the MLP head for the segmentation that distinguishes the traversable and non-traversable regions. The encoded feature domain for each data is notated as P\mathbb{Q}_{P}, U\mathbb{Q}_{U}, 𝕊P\mathbb{S}_{P}, and 𝕊N\mathbb{S}_{N}.

A baseline solution learns the network with labeled data only. QP{Q}_{P} is used for the traversability regression and QP{Q}_{P} and S{S} are both used for the segmentation. We obtain the traversability map ti=h(xi)t_{i}=h(x_{i}), tit_{i}\in\mathbb{R}, and segmentation map si=g(xi),si{0,1}s_{i}=g(x_{i}),s_{i}\in\{0,1\}. The final masked traversability map TiT_{i} is represented as element-wise multiplication, Ti=tisiT_{i}=t_{i}\odot s_{i}. The regression loss LregL^{reg} is computed with p\mathbb{Q}_{p} and based on a mean squared error loss as Eq. (1), where xix_{i} is the ii-th element of P\mathbb{Q}_{P}.

Lreg(xi)=(h(xi)ai)2.L^{reg}(x_{i})=(h(x_{i})-a_{i})^{2}. (1)

For the segmentation loss LsegL^{seg}, binary cross-entropy loss is used in the supervised setting as Eq. (2), where xix_{i} refers to the ii-th element of either P\mathbb{Q}_{P} and 𝕊\mathbb{S}. Both the positive query and the support data can be used for the segmentation loss as follows:

Lseg(xi)=(yilog(g(xi))+(1yi)log(1g(xi))).L^{\textit{seg}}(x_{i})=-\Big{(}y_{i}\log(g(x_{i}))+(1-y_{i})\log(1-g(x_{i}))\Big{)}. (2)

Combining the regression and the segmentation, the traversability estimation loss in the supervised setting is defined as follows:

LSupervised(P,𝕊)=1|P|xiP(Lreg(xi)+Lseg(xi))+1|𝕊|xi𝕊Lseg(xi).\begin{split}L^{\textit{Supervised}}(\mathbb{Q}_{P},\mathbb{S})=\frac{1}{|\mathbb{Q}_{P}|}\sum_{x_{i}\in\mathbb{Q}_{P}}\Big{(}L^{reg}(x_{i})+L^{seg}(x_{i})\Big{)}\\ +\frac{1}{|\mathbb{S}|}\sum_{x_{i}\in\mathbb{S}}L^{seg}(x_{i}).\end{split} (3)

Nonetheless, it does not fully take advantage of data captured under various driving surfaces. Since the learning is limited to the labeled data, it can not capture the whole characteristics of the training data. This drawback hinders the capability of the traversability estimation trained in a supervised manner.

III-C Metric learning method

We adopt metric learning to overcome the limitation of the fully-supervised solution. The objectives of metric learning are to learn embedding space and find the representations that epitomize the training data in the learned embedding space. To jointly optimize the embedding network and the representations, we adopt a proxy-based loss [39]. The embedding network is updated based on the positions of the proxies, and the proxies are adjusted by the updated embedding network iteratively. The proxies can be regarded as representations that abstract the training data. We refer this set of proxies as ‘proxy bank,’ denoted as 𝔹={𝔹P,𝔹N}\mathbb{B}=\{\mathbb{B}_{P},\mathbb{B}_{N}\}, where 𝔹P\mathbb{B}_{P} and 𝔹N\mathbb{B}_{N} indicate the set of proxies for each class. The segmentation map is inferred based on the similarity between feature vectors and the proxies of each class, as si=g(𝔹,xi)s_{i}=g(\mathbb{B},x_{i}).

Refer to caption
Figure 4: Illustration of the effect of adopting the unlabeled data. Red and blue nodes are embedding vectors of positive and negative data. Gray nodes with a question mark indicate the unlabeled data, and the black ones indicate proxies. The background color and lines indicate decision boundaries in the embedding space. The embedded vectors (non-black nodes) assigned to the proxies are connected to each other with solid lines. (a) Without unlabeled data, proxies and decision boundaries are optimized only with labeled data. (b) With unlabeled data, the optimization exploits the broader context of the training data, resulting in a more precise and discriminative decision boundary.

The representations of traversable and non-traversable regions exhibit large intra-class variations, where numerous sub-classes exist in each class; flat ground or gravel road for positive, and rocks, trees, or bushes for negative. For the segmentation, we use SoftTriple loss [40] that utilizes multiple proxies for each class. The similarity between xix_{i} and class cc, denoted as Si,cS_{i,c}, is defined by a weighted sum of cosine similarity between xix_{i} and 𝔹c=\mathbb{B}_{c}= {pc1,,pcKp_{c}^{1},...,p_{c}^{K}}, where cc denotes positive or negative, KK is the number of proxies per class, and pckp_{c}^{k} is kk-th proxy in the proxy bank. The weight given to each cosine similarity is proportionate to its value. Si,cS_{i,c} is defined as follows:

Si,c=kexp(1Txipck)kexp(1Txipck)xipck,S_{i,c}=\sum_{k}\frac{\exp(\frac{1}{T}x_{i}^{\top}p_{c}^{k})}{\sum_{k}\exp(\frac{1}{T}x_{i}^{\top}p_{c}^{k})}x_{i}^{\top}p_{c}^{k}, (4)

where TT is a temperature parameter to control the softness of assignments. Soft assignments reduce sensitivity between multiple centers. Note that the l2l_{2} norm has been applied to embedding vectors to sustain divergence of magnitude. Then the SoftTriple loss is defined as follows:

LSoftTriple(xi)=logexp(λ(Si,yiδ))exp(λ(Si,yiδ))+jyiexp(λSi,j),L^{\textit{SoftTriple}}(x_{i})=-\log\frac{\exp(\lambda(S_{i,y_{i}}-\delta))}{\exp(\lambda(S_{i,y_{i}}-\delta))+\sum_{j\neq y_{i}}\exp(\lambda S_{i,j})}, (5)

where λ\lambda is a hyperparameter for smoothing effect and δ\delta is a margin. The segmentation loss using the proxy bank can be reformulated using the SoftTriple loss as Eq. (6) and the traversability estimation loss using the proxy bank is defined as Eq. (7).

Lseg(xi,𝔹)=logexp(λ(Si,yiδ))exp(λ(Si,yiδ))+exp(λSi,1yi).L^{seg}(x_{i},\mathbb{B})=-\log\frac{\exp(\lambda(S_{i,y_{i}}-\delta))}{\exp(\lambda(S_{i,y_{i}}-\delta))+\exp(\lambda S_{i,1-y_{i}})}. (6)
LProxy(P,𝕊,𝔹)=1|P|xiP(Lreg(xi)+Lseg(xi,𝔹))+1|𝕊|xi𝕊Lseg(xi,𝔹).\begin{split}L^{Proxy}(\mathbb{Q}_{P},\mathbb{S},\mathbb{B})=\frac{1}{|\mathbb{Q}_{P}|}\sum_{x_{i}\in\mathbb{Q}_{P}}\Big{(}L^{reg}(x_{i})+L^{seg}(x_{i},\mathbb{B})\Big{)}\\ +\frac{1}{|\mathbb{S}|}\sum_{x_{i}\in\mathbb{S}}L^{seg}(x_{i},\mathbb{B}).\end{split} (7)

Unlabeled data, which is abundantly included in self-supervised traversability data, has not been considered in previous works. To enhance the supervision we can extract from the data, we utilize the unlabeled data in the query data in the learning process. The problem is that the segmentation loss cannot be applied to the U\mathbb{Q}_{U} because no class label yiy_{i} exists for them. We assign an auxiliary target for each unlabeled data as clustering [41]. Pseudo class of ii-th sample yi^\hat{y_{i}} is assigned based on the class of the nearest proxy in the embedding space as yi^=argmaxc{P,N}Si,c\hat{y_{i}}=\operatorname*{argmax}_{c\in\{P,N\}}S_{i,c}.

The unsupervised loss for the segmentation, denoted as LUL^{U}, is defined as Eq. (8) using the pseudo-class, where xix_{i} is an embedding of ii-th sample in U\mathbb{Q}_{U}.

LU(xi,𝔹)=logexp(λ(Si,yi^δ)exp(λ(Si,yi^δ))+exp(λSi,1yi^))L^{U}(x_{i},\mathbb{B})=-\log\frac{\exp(\lambda(S_{i,\hat{y_{i}}}-\delta)}{\exp(\lambda(S_{i,\hat{y_{i}}}-\delta))+\exp(\lambda S_{i,1-\hat{y_{i}}}))} (8)

Fig. 4 illustrates the benefit of incorporating unlabeled loss. The embedding network can learn to capture more broad distribution of data, and learned proxies would represent training data better. When unlabeled data features are assigned to the proxies (Fig. 4a,) the embedding space and proxies are updated as Fig. 4b, exhibiting more precise decision boundaries.

Combining the aforementioned objectives altogether, we define our final objective as ‘Traverse Loss,’ and is defined as Eq. (9). The overall high-level schema of the learning procedure is depicted in Fig. 5.

LTraverse(,𝕊,𝔹)=LProxy(P,𝕊,𝔹)+1|U|xiULU(xi,𝔹)L^{\textit{Traverse}}(\mathbb{Q},\mathbb{S},\mathbb{B})=L^{Proxy}(\mathbb{Q}_{P},\mathbb{S},\mathbb{B})+\frac{1}{|\mathbb{Q}_{U}|}\sum_{x_{i}\in\mathbb{Q}_{U}}L^{U}(x_{i},\mathbb{B}) (9)
Refer to caption
Figure 5: Illustration for the learning procedure of the proposed framework with deep metric learning

III-D Re-initialization to avoid trivial solutions

Our metric learning method can suffer from sub-optimal solutions, which are induced by empty proxies. Empty proxies indicate the proxies to which none of the data are assigned. Such empty proxies should be redeployed to be a good representation of training data. Otherwise, the model might lose the discriminative power and the bank might include semantically poor representations.

Our intuitive idea to circumvent an empty proxy is to re-initialize the empty proxy with support data features. By updating the empty proxies with support data, the proxy bank can reflect training data that was not effectively captured beforehand. In order to obtain representative feature vectors without noises, MM number of prototype feature vectors, denoted as μ+={μm+,m=1,,M}\mu^{+}=\{\mu_{m}^{+},m=1,...,M\} and μ={μm,m=1,,M}\mu^{-}=\{\mu_{m}^{-},m=1,...,M\}, are estimated using an Expectation-Maximization algorithm [42]. The prototype vectors are cluster centers of support features. We randomly choose the prototype vectors with small perturbations and use them as re-initialized proxies. Algorithm 1 summarizes the overall training procedure of our method.

III-E Traversability Precision Error

We devise a new metric for the proposed framework, ‘Traversability Precision Error’ (TPE). The new metric should be able to comprehensively evaluate the segmentation and the regression while taking the critical aspect of the traversability estimation into account. One of the most important aspects of traversability estimation is to avoid the false-positive of the traversable region, the region that is impossible to traverse but inferred as traversable. If such a region is estimated as traversable, a robot will likely go over that region, resulting in undesirable movements. The impact of the false-positive decreases if they are estimated as less traversable. TPE computes the degree of false-positive of the traversable region, extenuating its impact with the traversability tit_{i}. The TPE is defined as Eq. (10) where TNTN, FPFP, and FNFN denote the number of true negative, false positive, and false negative points of the traversable region, respectively.

Traversability Precision Error (TPE)=TNTN+FP(1ti)+FN\begin{split}\text{Traversability Precision Error~{}(TPE)}=\\ \frac{TN}{TN+FP(1-t_{i})+FN}\end{split} (10)
Input: Query data Q={QP,QU}Q=\{Q_{P},Q_{U}\} and Support Data S={SP,SN}S=\{S_{P},S_{N}\}, where |Q||S||Q|\gg|S|
Output: Network ff with parameters θ\theta, proxy bank 𝔹={𝔹P,𝔹N}\mathbb{B}=\{\mathbb{B}_{P},\mathbb{B}_{N}\}
for each query data do
       Random Sample support data from SS
       Feed query and support data to fθf_{\theta}, and Get embedding features xix_{i}
      
       Calculate similarity between xix_{i} and 𝔹\mathbb{B}
       Estimate Pseudo-class yi^\hat{y_{i}} for xiQUx_{i}\in Q_{U}
       Calculate LTraverseL^{Traverse}
       Update θ\theta and 𝔹\mathbb{B}
      
end for
Calculate the membership of each proxy
if an empty proxy exists then
       Feed SS to fθf_{\theta}, and Get embedding features
       Estimate MM cluster centers for each class, μ={μ+,μ}\mu=\{\mu^{+},\mu^{-}\}, by the EM algorithm
       Re-initialize empty proxy to μ\mu with small random perturbation
      
end if
Algorithm 1 Single epoch of traversability estimation with metric learning

IV Experiments

    Dtrail     SemanticKITTI
    mIoU TPE     mIoU
|𝕊|/|||\mathbb{S}|/|\mathbb{Q}|     4%4\% 2%2\% 1%1\% 4%4\% 2%2\% 1%1\%     5%5\% 1%1\% 0.5%0.5\% 0.1%0.1\%
ProtoNet [26]     0.8033 0.7515 0.5049 0.7129 0.5624 0.3249     0.8009 0.8040 0.7993 0.7798
MPTI [17]     0.6992 0.6936 0.6390 0.6202 0.5466 0.4995     0.8586 0.8108 0.7531 0.7663
Ours (supervised)     0.9238 0.8857 0.7779 0.8896 0.8447 0.7345     0.8405 0.8376 0.8338 0.8201
Ours (w.o. unlabeled)     0.8864 0.8529 0.8461 0.8434 0.8121 0.8164     0.8124 0.7896 0.8049 0.7994
Ours (w.o. re-init)     0.8970 0.8771 0.7935 0.8649 0.8163 0.7517     0.8058 0.7895 0.8058 0.7895
Ours     0.9338 0.9151 0.9005 0.9067 0.8776 0.8636     0.8652 0.8402 0.8473 0.8973
TABLE I: Comparison results on Dtrail and SemanticKITTI dataset. Our methods with different objectives are annotated as follows. Ours (supervised): Eq. (3) that is trained in supervised manner. Ours (w.o. unlabeled): Eq. (7) that does not leverage unlabeled data. Ours (w.o. re-init): Eq. (9) excluding the re-initialization step. Ours: Eq. (9).

In this section, our method is evaluated with Dtrail dataset for traversability estimation on off-road environments along with SemanticKITTI [15] dataset. Our method is compared to other metric learning methods based on episodic training strategies. Furthermore, we conduct various ablation studies to show the benefits of our method.

IV-A Datasets

Refer to caption
(a)
Refer to caption
(b)
Figure 6: Dtrail dataset. (a) Our mobile robotic platform with one 3232-layer and two 1616-layers of LiDARs. (b) Images of the mountain trail scenes where we construct the dataset.

IV-A1 Dtrail: Off-road terrain dataset

In order to thoroughly examine the validity of our method, we build the Dtrail dataset, a real mobile robot driving dataset of high-resolution LiDAR point clouds from mountain trail scenes. We collect point clouds using one 3232 layer and two 1616 layers of LiDAR sensors equipped on our caterpillar-type mobile robot platform, shown in Fig. 6(a). Our dataset consists of 119119 point cloud scenes and each point cloud scene has approximately 44 million points. Corresponding sample camera images of point cloud scenes are shown in Fig. 6(b). For the experiments, we split 9898 scenes for the query set and 44 scenes for the support set, and 1717 scenes for the evaluation set. For the traversability, the magnitude z-acceleration from the Inertial Measurement Unit (IMU) of the mobile robot is re-scaled from 0 to 11 and mapped to points that the robot actually explored. Also, in terms of data augmentation, a small perturbation is added along the z-axis on some positive points.

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
(a) (b) (c) (d) (e)
Scene Images Support Data Supervised Ours
Traversability
map (Ours)
Figure 7: Qualitative Results for Dtrail dataset. (a) Camera image of each scene. (b)-(d) Support data and inference results of segmentation. A red point indicates a traversable region, a blue one indicates a non-traversable region, and a black point is an unlabeled region. (e) The final output of our traversability estimation. The traversability map of non-traversable regions is masked out using the segmentation result.
Refer to caption
(a) Ground Truth
Refer to caption
(b) Supervised
Refer to caption
(c) Ours
Figure 8: Qualitative Results for SemanticKITTI. A red-colored point indicates a traversable region, a blue-colored point indicates a non-traversable region, and a black point is an unlabeled region.

IV-A2 SemanticKITTI

We evaluate our method on the SemanticKITTI [15] dataset, which is an urban outdoor-scene dataset for point cloud segmentation. Since it does not provide any type of attributes for traversability, we conducted experiments on segmentation only. It contains 1111 sequences, 00 to 10 as the training set, with 23,21023,210 point clouds and 2828 classes. We split 55 sequences (0000, 0202, 0505, 0808, 0909) with 17,62517,625 point clouds for training and the rest, with 5,5765,576 point clouds, for evaluation. We define the ‘road’, ‘parking’, ‘sidewalk’, ‘other-ground’, and ‘terrain’ classes as positive and the rest classes as negative. For query data, only the ‘road’ class is labeled as positive and left other positive classes as unlabeled. We expect the model to learn the other positive regions using unlabeled data without direct supervision.

IV-B Evaluation metric

We evaluate the performance of our method with TPE, the new criteria designed for the traversability estimation task, which evaluates segmentation and regression quality simultaneously. Additionally, we evaluate the segmentation quality with mean Interaction over Union [43] (mIoU). For each class, the IoU is calculated by IoU=TPTP+FP+FNIoU=\frac{TP}{TP+FP+FN}, where TPTP, FPFP, and FNFN denote the number of true negatives, false positives, and false negative points of each class, respectively.

IV-C Implementation Details

IV-C1 Embedding network

RandLA-Net [4] is fixed as a backbone embedding network for every method for a fair comparison. Specifically, we use 22 down-sampling layers in the backbone and excluded global (x,y,z)(x,y,z) positions in the local spatial encoding layer, which aids the network to embed local geometric patterns explicitly. The embedding vectors are normalized with l2l_{2} norm and are handled with cosine similarity.

IV-C2 Training

We train the model and proxies with Adam optimizer with the exponential learning rate decay for 5050 epochs. The initial learning rate is set as 1e41e^{-4}. For query and support data, K-nearest neighbors (KNN) of a randomly picked point is sampled in training steps. We ensure that positive and negative points exist evenly in sampled points of the support data.

IV-C3 Hyperparameter setting

For learning stability, proxies are updated exclusively for the initial 55 epochs. The number of proxies KK is set to 128128 for each class and the proxies are initialized with normal distribution. We set small margin δ\delta as 0.010.01, λ\lambda as 2020, and temperature parameter TT as 0.050.05 for handling multiple proxies.

Refer to caption
Figure 9: Proxy visualization of the scene in our Dtrail dataset. The color of each point represents the proxy assigned to the point. It shows that the learned proxies are well-clustered, and mapped with the semantic features on the point cloud scene.

IV-D Results

IV-D1 Comparison

We compare the performance to ProtoNet [26] which uses a single prototype and MPTI [17] which adopts multiple prototypes for few-shot 3D segmentation. Also, we compare the performance with our supervised manner method, denoted as ‘Ours(supervised).’ Table I summarizes the result of experiments. Our method shows a significant margin in terms of IoU and TPE compared to the ProtoNet and MPTI. It demonstrates that generating prototypes in a non-parametric approach does not represent the whole data effectively. Moreover, it is notable that we show the performance of our metric learning method is better than the supervised setting designed for our task. It verifies that ours can reduce epistemic uncertainty by incorporating unlabeled data by unsupervised loss. For SemanticKITTI, the observation is similar to that of the Dtrail dataset. Even though the SemanticKITTI dataset is based on urban scenes, our method shows better performance than other few-shot learning methods by 6%6\% and the supervised manner by 2%2\%.

KK 1 2 4 8 16 32 64 128 256 512
mIoU(\uparrow) 0.890 0.894 0.880 0.906 0.911 0.883 0.920 0.934 0.931 0.924
TPE(\uparrow) 0.847 0.868 0.840 0.862 0.881 0.845 0.888 0.906 0.898 0.895
TABLE II: Ablation study on Dtrail according to the number of proxies KK.

IV-D2 Ablation studies

We repeat experiments with varying support-to-query ratio (|𝕊|/|||\mathbb{S}|/|\mathbb{Q}|) to evaluate robustness regarding the amount of support data. Table I shows that our metric learning method is much more robust from performance degradation than the others when the support-to-query ratio decreases. When the ratio decreases from 4%4\% to 1%1\% in the Dtrail dataset, the TPE of our metric learning method only decreases about 4%4\% while the TPE of others dropped significantly: 39%39\% for ProtoNet, 13%13\% for MPTI, and 16%16\% for Ours(supervised). It verifies that our method can robustly reduce epistemic uncertainty with small labeled data.

Moreover, we observe that performance increases by 6%6\% on average on TPE when adopting the re-initialization step. It confirms the re-initialization step can help avoid trivial solutions. Also, it is shown that adopting the unsupervised loss can boost the performance up to 6%6\% on average. It verifies that the unlabeled loss can give affluent supervision without explicit labels. Moreover, as shown in Table II, an increasing number of proxies boost the performance until it converges when the number exceeds 3232, which demonstrates the advantages of multiple proxies.

IV-D3 Qualitative Results

Fig. 7 shows the traversability estimation results of our supervised-based and metric learning-based method on the Dtrail dataset. We can examine that our metric learning-based method performs better than the supervised-based method. Especially, our method yields better results on regions that are not labeled on training data. We compare the example of segmentation results with the SemanticKITTI dataset in Fig. 8. The first column indicates the ground truth and the other columns indicate the segmentation results of the supervised learning-based method and our method. Evidently, our method shows better results on unlabeled regions, which confirms that our metric learning-based method reduces epistemic uncertainty.

Fig. 9 shows the visualization of the proxies assigned to the point cloud scenes. For better visualization, proxies are clustered into three representations. We observe that the learned proxies successfully represent the various semantic features. Leaves, grounds, and tree trunks are mostly colored green, black, and blue, respectively.

V Conclusion

We propose a self-supervised traversability estimation framework on 3D point cloud data in terms of mitigating epistemic uncertainty. Self-supervised traversability estimation suffers from the uncertainty that arises from the limited supervision given from the data. We tackle the epistemic uncertainty by concurrently learning semantic segmentation along with traversability estimation, eventually masking out the non-traversable regions. We start from the fully-supervised setting and finally developed the deep metric learning method with unsupervised loss that harnessed the unlabeled data. To properly evaluate the framework, we also devise a new evaluation metric according to the task’s settings and underline the important criteria of the traversability estimation. We build our own off-road terrain dataset with the mobile robotics platform in unconstrained environments for realistic testing. Various experimental results show that our framework is promising.

References

  • [1] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, pp. 1231 – 1237, 2013.
  • [2] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11 618–11 628, 2020.
  • [3] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. M. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in perception for autonomous driving: Waymo open dataset,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2443–2451, 2020.
  • [4] Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, A. Trigoni, and A. Markham, “Randla-net: Efficient semantic segmentation of large-scale point clouds,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11 105–11 114, 2020.
  • [5] Z. Chen, J. Zhang, and D. Tao, “Progressive lidar adaptation for road detection,” IEEE/CAA Journal of Automatica Sinica, vol. 6, pp. 693–702, 2019.
  • [6] J. Sock, J. H. Kim, J. Min, and K. H. Kwak, “Probabilistic traversability map generation using 3d-lidar and camera,” IEEE International Conference on Robotics and Automation, pp. 5631–5637, 2016.
  • [7] J. Ahtiainen, T. Stoyanov, and J. Saarinen, “Normal distributions transform traversability maps: Lidar‐only approach for traversability mapping in outdoor environments,” Journal of Field Robotics, vol. 34, 2017.
  • [8] S. Matsuzaki, J. Miura, and H. Masuzawa, “Semantic-aware plant traversability estimation in plant-rich environments for agricultural mobile robots,” ArXiv, vol. abs/2108.00759, 2021.
  • [9] T. Guan, Z. He, D. Manocha, and L. Zhang, “Ttm: Terrain traversability mapping for autonomous excavator navigation in unstructured environments,” ArXiv, vol. abs/2109.06250, 2021.
  • [10] H. Roncancio, M. Becker, A. Broggi, and S. Cattani, “Traversability analysis using terrain mapping and online-trained terrain type classifier,” IEEE Intelligent Vehicles Symposium Proceedings, pp. 1239–1244, 2014.
  • [11] L. Wellhausen, A. Dosovitskiy, R. Ranftl, K. Walas, C. Cadena, and M. Hutter, “Where should i walk? predicting terrain properties from images via self-supervised learning,” IEEE Robotics and Automation Letters, vol. 4, pp. 1509–1516, 2019.
  • [12] M. Wermelinger, P. Fankhauser, R. Diethelm, P. Krüsi, R. Y. Siegwart, and M. Hutter, “Navigation planning for legged robots in challenging terrain,” IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1184–1189, 2016.
  • [13] H. Kolvenbach, C. Bärtschi, L. Wellhausen, R. Grandia, and M. Hutter, “Haptic inspection of planetary soils with legged robots,” IEEE Robotics and Automation Letters, vol. 4, pp. 1626–1632, 2019.
  • [14] W. Van Gansbeke, S. Vandenhende, S. Georgoulis, M. Proesmans, and L. Van Gool, “Scan: Learning to classify images without labels,” in European Conference on Computer Vision, 2020, pp. 268–285.
  • [15] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “Semantickitti: A dataset for semantic scene understanding of lidar sequences,” IEEE/CVF International Conference on Computer Vision, pp. 9296–9306, 2019.
  • [16] A. Valada, J. Vertens, A. Dhall, and W. Burgard, “Adapnet: Adaptive semantic segmentation in adverse environmental conditions,” IEEE International Conference on Robotics and Automation, pp. 4644–4651, 2017.
  • [17] N. Zhao, T.-S. Chua, and G. H. Lee, “Few-shot 3d point cloud semantic segmentation,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8869–8878, 2021.
  • [18] P. Dallaire, K. Walas, P. Giguère, and B. Chaib-draa, “Learning terrain types with the pitman-yor process mixtures of gaussians for a legged robot,” IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3457–3463, 2015.
  • [19] L. Ding, H. Gao, Z. Deng, J. Song, Y. Liu, G. Liu, and K. Iagnemma, “Foot–terrain interaction mechanics for legged robots: Modeling and experimental validation,” The International Journal of Robotics Research, vol. 32, pp. 1585 – 1606, 2013.
  • [20] W. Bosworth, J. Whitney, S. Kim, and N. Hogan, “Robot locomotion on hard and soft ground: Measuring stability and ground properties in-situ,” IEEE International Conference on Robotics and Automation, pp. 3582–3589, 2016.
  • [21] G. G. Waibel, T. Löw, M. Nass, D. Howard, T. Bandyopadhyay, and P. V. K. Borges, “How rough is the path? terrain traversability estimation for local and global path planning,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 9, p. 16462–16473, sep 2022. [Online]. Available: https://doi.org/10.1109/TITS.2022.3150328
  • [22] G. Koch, R. Zemel, R. Salakhutdinov et al., “Siamese neural networks for one-shot image recognition,” in ICML deep learning workshop, vol. 2.   Lille, 2015, p. 0.
  • [23] V. G. Satorras and J. Bruna, “Few-shot learning with graph neural networks,” ArXiv, vol. abs/1711.04043, 2018.
  • [24] O. Vinyals, C. Blundell, T. P. Lillicrap, K. Kavukcuoglu, and D. Wierstra, “Matching networks for one shot learning,” in Advances in Neural Information Processing Systems, 2016.
  • [25] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1199–1208, 2018.
  • [26] J. Snell, K. Swersky, and R. S. Zemel, “Prototypical networks for few-shot learning,” ArXiv, vol. abs/1703.05175, 2017.
  • [27] C. Chen, O. Li, A. J. Barnett, J. Su, and C. Rudin, “This looks like that: deep learning for interpretable image recognition,” ArXiv, vol. abs/1806.10574, 2019.
  • [28] J. Deuschel, D. Firmbach, C. I. Geppert, M. Eckstein, A. Hartmann, V. Bruns, P. Kuritcyn, J. Dexl, D. Hartmann, D. Perrin, T. Wittenberg, and M. Benz, “Multi-prototype few-shot learning in histopathology,” IEEE/CVF International Conference on Computer Vision Workshops, pp. 620–628, 2021.
  • [29] K. R. Allen, E. Shelhamer, H. Shin, and J. B. Tenenbaum, “Infinite mixture prototypes for few-shot learning,” ArXiv, vol. abs/1902.04552, 2019.
  • [30] B. Yang, C. Liu, B. Li, J. Jiao, and Q. Ye, “Prototype mixture models for few-shot semantic segmentation,” ArXiv, vol. abs/2008.03898, 2020.
  • [31] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823, 2015.
  • [32] M. Wang and W. Deng, “Deep face recognition: A survey,” Neurocomputing, vol. 429, pp. 215–244, 2021.
  • [33] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 539–546 vol. 1, 2005.
  • [34] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1735–1742, 2006.
  • [35] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu, “Learning fine-grained image similarity with deep ranking,” IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386–1393, 2014.
  • [36] H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese, “Deep metric learning via lifted structured feature embedding,” IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012, 2016.
  • [37] Y. Movshovitz-Attias, A. Toshev, T. Leung, S. Ioffe, and S. Singh, “No fuss distance metric learning using proxies,” IEEE International Conference on Computer Vision, pp. 360–368, 2017.
  • [38] E. W. Teh, T. Devries, and G. W. Taylor, “Proxynca++: Revisiting and revitalizing proxy neighborhood component analysis,” ArXiv, vol. abs/2004.01113, 2020.
  • [39] S. Kim, D. Kim, M. Cho, and S. Kwak, “Proxy anchor loss for deep metric learning,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3235–3244, 2020.
  • [40] Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin, “Softtriple loss: Deep metric learning without triplet sampling,” IEEE/CVF International Conference on Computer Vision, pp. 6449–6457, 2019.
  • [41] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in European Conference on Computer Vision, 2018.
  • [42] T. K. Moon, “The expectation-maximization algorithm,” IEEE Signal processing magazine, vol. 13, no. 6, pp. 47–60, 1996.
  • [43] M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision, vol. 88, pp. 303–338, 2009.
\EOD