\UseRawInputEncoding

Unsupervised Domain Adaptation for Point Cloud Semantic Segmentation via Graph Matching

Yikai Bian, Le Hui, Jianjun Qian^∗ and Jin Xie^∗ PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China. Email:{yikai.bian, le.hui, csjqian, csjxie}@njust.edu.cn.^∗The author responsible for the correspondence of this paper.

Abstract

Unsupervised domain adaptation for point cloud semantic segmentation has attracted great attention due to its effectiveness in learning with unlabeled data. Most of existing methods use global-level feature alignment to transfer the knowledge from the source domain to the target domain, which may cause the semantic ambiguity of the feature space. In this paper, we propose a graph-based framework to explore the local-level feature alignment between the two domains, which can reserve semantic discrimination during adaptation. Specifically, in order to extract local-level features, we first dynamically construct local feature graphs on both domains and build a memory bank with the graphs from the source domain. In particular, we use optimal transport to generate the graph matching pairs. Then, based on the assignment matrix, we can align the feature distributions between the two domains with the graph-based local feature loss. Furthermore, we consider the correlation between the features of different categories and formulate a category-guided contrastive loss to guide the segmentation model to learn discriminative features on the target domain. Extensive experiments on different synthetic-to-real and real-to-real domain adaptation scenarios demonstrate that our method can achieve state-of-the-art performance. Our code is available at https://github.com/BianYikai/PointUDA.

I INTRODUCTION

Deep learning methods [1, 2] for point cloud semantic segmentation have shown dramatic success in recent years. However, most of these methods focus on fully supervised learning for point cloud segmentation with a large number of manually annotated labels. Although there are several public datasets providing large amounts of annotation data, it is difficult to directly apply the model trained on a labeled source domain to another unlabeled target domain. The reason lies in that the data collected by different 3D sensors have a huge discrepancy in appearance and sparsity, which results in the domain shift problem. Therefore, how to generalize a well-trained model to another unlabeled domain is a challenging but valuable problem in point cloud semantic segmentation.

Unsupervised domain adaptation can alleviate the domain shift problem by transferring the knowledge from the labeled source domain to the unlabeled target domain. Recent advances on unsupervised point cloud domain adaptation tasks mainly focus on reducing the domain gap between the inputs. For example, Yi et al. [3] build a point cloud completion network with sequences of point clouds to bridge the domain gap between LiDAR sensors with different beams. ePointDA [4] and SqueezeSegV2 [5] use auxiliary rendering networks to render dropout noises or intensity on the synthetic dataset, which translate the point clouds from the source domain similar to the target domain. Furthermore, these methods use a series of feature alignment methods to increase the consistency of feature distributions, such as higher-order moment matching [6], and geodesic correlation alignment [7]. However, these methods mainly consider the overall distributions of two domains to form the global-level feature alignment, which ignores the local geometric differences between the domains.

In this paper, we propose a domain adaptation framework for unsupervised point cloud segmentation with the local-level feature alignment. Compared with the global-level feature alignment, our framework can focus on the correlation between the similar local structures of point from the two domains, so that the reliable feature alignment can be performed to guide the discriminative semantic feature learning of the target domain. Specifically, through the farthest point sampling, we select a set of centroid points and construct the dynamic local feature graph for each centroid point to capture its local geometry information. Then, in order to enrich the graph of the source domain, we construct a feature graph memory bank to store the generated source-domain feature graphs during the training phase. After that, inspired by the point cloud matching [8], we adopt the optimal-transport cost to measure the graph similarities between the memory bank and target domain, so that a reliable assignment matrix can be obtained to guide the knowledge transferring from the source domain to the target domain. Particularly, in order to further extract the discriminative target-domain feature, we consider the category-wise correlation between the source domain and the target domain, and exploit the contrastive learning to increase category-level discrimination of target graphs. Such category-guided contrastive loss can effectively help cluster and distinguish the feature-graph distributions of different categories. Extensive experiments demonstrate the effectiveness of our framework, where we not only focus on the synthetic-to-real domain adaptation scenarios (vKITTI to SemanticPOSS), but also pay attention to the indoor (S3DIS to ScanNet) and the outdoor (SemanticKITTI to nuScenes) real-to-real domain adaptation scenarios.

Our contributions can be summarized as follows:

•

We propose a novel graph-based framework for local-level feature alignment for unsupervised domain adaptive point cloud semantic segmentation.
•

We construct feature graphs to capture the local geometry information of point clouds and use a local feature loss based on an assignment matrix for the alignment of feature graphs.
•

We develop a category-guided contrastive loss to guide the segmentation model to learn the discriminative features on the target domain.

II RELATED WORK

Point Cloud Semantic Segmentation. Recent progress on point cloud semantic segmentation is mainly divided into several categories according to different representations of data. Volumetric-based methods require a preprocessing stage to voxelize the original point cloud. SparseConvNet [9] proposes a submanifold sparse convolution network to deal with spatially-sparse voxel data. MinkowskiNet [10] creates Minkowski space on sparse representation data and proposes a powerful 4-dimensional convolutional neural network to deal with 3D videos. Projection-based methods need to project the point cloud into an image before feeding data into the network. SqueezeSegV2 [5] uses a context aggregation module to improve the robustness to dropout noise on projected 2D LiDAR image. SqueezeSegV3 [11] proposes an efficient spatially-adaptive convolution to deal with the discrepancy of data distribution of different LiDAR image locations. Point-based methods directly use unordered point clouds for semantic segmentation. However, due to the heavy computation, most of the methods first split the point cloud into blocks before training and inferring. PointNet [1] uses multi-layer perceptron and a mini-network (T-Net) to extract features from unordered point clouds. In order to strengthen the local information for point-level segmentation, GACNet [2] and PointWeb [12] use different attention modules to dynamically assign weight to local features. In this work, we leverage PointWeb as our segmentation network because of its efficiency in processing unordered point clouds.

Unsupervised Domain Adaptation. Unsupervised domain adaptation (UDA) aims to train the model in the labeled source domain and generalize the knowledge to the target domain through the unsupervised methods. Recent advances on domain adaptation for 3D point cloud mainly study aligning the distributions by input-level and feature-level alignment. Saleh et al. [13] use CycleGAN [14] to translate the synthetic bird’s eye view point cloud image to the real point cloud image for domain adaptive vehicle detection. ePointDA [4] use a dropout noise rendering network to achieve uniformity of data distribution between domains and adopt a higher-order moment matching loss for feature-level alignment. Yi et al. [3] use a completion network to complete the point cloud with sequences data, so that they can recover the 3D surfaces from different LiDAR data and transfer knowledge between different LiDAR sensors. However, the input-level adaptation methods lead to extra challenges and training costs due to the variable geometric structures in different domains. Besides, recent works on 2D UDA [15, 16, 17] are also quite applicable to 3D UDA, where they use the different losses to decrease the domain shift problem, $e.g.$ maximum squares loss, entropy loss, and adversarial loss. Furthermore, self-training is also an effective technique for UDA. [18] proposes a self-supervised task for target domain to learn its useful representations. ST3D [19] proposes a quality-aware triplet memory bank to generate high-quality 3D detection pseudo labels for self-training. xMUDA [20] and DsCML [21] propose cross-modal constraint to retain the advantages of 2D images and 3D point clouds for domain adaptation. In this paper, considering the input-level methods cannot handle complex domain adaptation scenarios, we develop a general uni-modal 3D UDA framework with feature-level alignment.

Refer to caption — Figure 1: Illustration of our framework. Point clouds in the source and target domains are fed into the model to construct dynamic feature graphs. The source graphs are used to build a memory bank and the target graphs are aligned by our graph-based local feature alignment method.

III OUR METHOD

III-A Overview

In unsupervised domain adaptive point cloud semantic segmentation, we are able to access the source domain $\mathcal{X}_{s}=\{\mathbf{x}^{s}_{i}\}^{M^{s}}_{i=1}$ of ${M^{s}}$ point clouds with its segmentation labels $\mathcal{Y}_{s}=\{\mathbf{y}^{s}_{i}\}^{M^{s}}_{i=1}$ and the unlabeled target domain $\mathcal{X}_{t}=\{\mathbf{x}^{t}_{j}\}^{M^{t}}_{j=1}$ of ${M^{t}}$ point clouds, where $\mathbf{x}^{s}_{i}\in\mathbb{R}^{{N^{s}_{i}}\times 3}$ is the set of ${N^{s}_{i}}$ points and $\mathbf{x}^{t}_{j}\in\mathbb{R}^{{N^{t}_{j}}\times 3}$ is the set of ${N^{t}_{j}}$ points. Given data $\mathcal{X}_{s}$ , $\mathcal{Y}_{s}$ and $\mathcal{X}_{t}$ , our goal is to train a model which can precisely categorize the point of target data into one of the common semantic categories in the source data, and to alleviate the performance drop problem caused by the domain gap at the same time. As illustrated in Fig.1, we first construct the local feature distributions of the source and target domains with the proposed dynamic feature graphs. Then, by building a source-domain feature graph memory bank, we employ graph matching to obtain graph pairs between the graphs in the memory bank and the target graphs. Finally, with the obtained matching association, we utilize the designed losses for local-level feature alignment.

III-B Dynamic Feature Graph

Different from the global-level feature alignment methods [6, 7, 15, 16, 17] roughly aligning two domains, we consider the differences of local neighborhood context in the target domain for the fine alignment. The main idea of our method is to use the learned dynamic local feature graphs to capture the multi-level features in different neighborhoods of point clouds. Then, based on the local-graph similarity, the correlation between local neighborhoods from two domains can be viewed as the knowledge transferring from the labeled source domain to the unlabeled target domain.

We leverage PointWeb [12] as our semantic segmentation backbone, which contains a classifier and a feature extractor with an encoder and a decoder. We use the feature extractor to extract the multi-level features and then build dynamic feature graphs on the sampled centroid points by the feature similarity. Specifically, given a point cloud sample $\mathbf{x}\in\mathbb{R}^{{N}\times 3}$ with $N$ points, we first extract its local-context feature using the feature extractor. Then, we select ${N/64}$ centroid points using the farthest point sampling for three iterations, where the centroid points are then as the kernel points of graphs for feature aggregation at each level. In detail, for each kernel point at different levels, we gather its k-nearest neighbors in the feature space ( $k$ is different at each level). Thereby, a feature graph can be constructed by setting the k-NN features as its vertices $\{\mathbf{v}_{i}^{k_{1}},\mathbf{v}_{i}^{k_{2}},\mathbf{v}_{i}^{k_{3}},\mathbf{v}_{i}^{k_{4}}\}$ and the k-NN feature distances as the edges $\{\mathbf{e}_{i}^{k_{1}},\mathbf{e}_{i}^{k_{2}},\mathbf{e}_{i}^{k_{3}},\mathbf{e}_{i}^{k_{4}}\}$ , where $i$ is the index of centroid points and $\{k_{j}\}_{j=1}^{4}$ is the different values in k-NN. We obtain ${N/64}$ dynamically updated local feature graphs to represent the local neighborhood context of the given point cloud, which can be formulated as,

\displaystyle\mathcal{G}={\{\mathbf{g}_{i}\}}^{N_{c}}_{i=1}=\{\{\mathbf{f}_{i}^{k_{j}}\}^{4}_{j=1}\}^{N_{c}}_{i=1}=\{\{\mathbf{v}_{i}^{k_{j}},\mathbf{e}_{i}^{k_{j}}\}^{4}_{j=1}\}^{N_{c}}_{i=1},

(1)

where $N_{c}$ denotes the number of ${N/64}$ centroid points, and $\mathbf{f}_{i}^{k_{j}}$ represents the feature embedding containing the vertice and the edge information. As a result, given a source sample and a target sample, the generated graphs are represented as $\mathcal{G}^{s}={\{\mathbf{g}^{s}_{i}\}}^{N_{c}}_{i=1}$ and $\mathcal{G}^{t}={\{\mathbf{g}^{t}_{i}\}}^{N_{c}}_{i=1}$ .

III-C Graph-based Local Feature Alignment

In this section, based on the constructed dynamic graphs above, we aim to find the intrinsic correlation between the source domain and the target domain. We use the graphs from the source domain to guide the model to extract semantic discriminative features on the target domain. However, at the training stage, the sample category in a batch is limited and their graph patterns tend to present significant structure differences, which may potentially introduce the alignment bias. To address the issue above, we build up a feature graph memory bank $\mathcal{G}^{b}$ and store a graph $\mathbf{g}^{s}_{i}$ into the bank according to the corresponding category of the centroid point. Therefore, benefiting from such memory bank mechanism, we can sufficiently mine the rich source information from it for reliable target-domain feature learning. The memory bank provides the same capacity $B$ for each category of graphs. Once the number of graphs exceeds the capacity $B$ of the corresponding category, we will update the memory bank by replacing the oldest graphs with the new ones.

Given the graphs $\mathcal{G}^{t}={\{\mathbf{g}^{t}_{j}\}}^{N_{c}}_{j=1}$ from target domain, we consider finding the most similar graph from $\mathcal{G}^{b}$ to each graph in $\mathcal{G}^{t}$ for feature alignment. In particular, we use optimal transport for graph matching. Specifically, the total transport cost of optimal transport is used to measure the similarity between two graphs, and the assignment matrix $\mathbf{A}\in\mathbb{R}^{K\times K}$ is used to find the point-level correspondences for $K$ nodes in graphs. In the graph matching formulation, we first compute the distance matrix $\mathbf{D}\in\mathbb{R}^{K\times K}$ , where the element $\mathbf{D}_{(m,n)}$ indicates the distance between the point $m$ in one graph and the point $n$ in the other graph. Here, we use the squared Euclidean distance in the feature space to measure the pairwise distance between points in graphs, where the points are composed of the corresponding features with edge and vertice information. Once we obtain the distance matrix $\mathbf{D}$ , we apply the Sinkhorn algorithm [22] to obtain the final assignment matrix $\mathbf{A}$ and the total transport cost through solving the optimal transport problem. In this way, we can measure the relevance between each target graph with all the graphs in the memory bank. As a result, we can find the most similar graph $\mathbf{g}^{b}_{(a,l)}\in\mathcal{G}^{b}$ for the target graph $\mathbf{g}^{t}_{j}$ according to the sorting result of the transport cost for knowledge transferring, where $a$ indicates the category and $l$ indicates the index in the memory bank.

Based on the generated graph pairs, we formulate a local feature loss based on the assignment matrix for the local-level feature alignment. Given a target graph $\mathbf{g}^{t}_{j}$ , we first select the most similar graph $\mathbf{g}^{b}_{(a,l)}$ from the memory bank. At the same time, we are able to access the corresponding assignment matrix $\mathbf{A}_{j}\in\mathbb{R}^{K\times K}$ , which can decode the point-level corresponding feature assignment between two graphs. In detail, for each point in target graph $\mathbf{g}^{t}_{j}$ , we can obtain the corresponding transport weights for every point in graph $\mathbf{g}^{b}_{(a,l)}$ from the assignment matrix $\mathbf{A}_{j}$ . Then, we perform a weighted sum of $\mathbf{g}^{b}_{(a,l)}\in\mathbb{R}^{K\times D}$ to guide the learning of $\mathbf{g}^{t}_{j}\in\mathbb{R}^{K\times D}$ , where $D$ is the number of feature channels. The key point is that the local neighborhood areas with similar semantic contexts need to have similar feature distributions. In this way, we can effectively align the indiscriminate feature distributions of the unlabeled target domain to the source domain. Therefore, we propose the following assignment matrix based local feature loss for feature graph learning in the target domain:

\displaystyle\mathcal{L}_{loc}=\frac{1}{N_{c}}\sum_{j=1}^{N_{c}}{\left\|\mathbf{g}^{t}_{j}-\mathbf{A}_{j}\mathbf{g}^{b}_{(a,l)}\right\|}_{1}.

(2)

Owing to a variety of feature graphs from different categories in our memory bank, we can further exploit the contrastive learning for more discriminative target-domain feature learning. Here, we select the category $a$ of the matched graph $\mathbf{g}^{b}_{(a,l)}$ as the positive category and the other categories as the negative categories. In order to obtain the representative features of each category, all graphs in the memory bank are used for calculating the feature representations. Although we have achieved the assignment matrix for each positive or negative pair, it is meaningless for point-level adaptation on unmatched pairs. Therefore, we use the mean of all features in the graphs with the same category to represent the feature representation of the corresponding category. It is worth noting that our graph is composed of multi-level features, so we calculate the mean features of different levels separately and then concatenate them as the final feature representation. The positive and negative features for graph $\mathbf{g}^{t}_{j}$ can be formulated as,

\displaystyle\mathbf{f}^{+}_{j}=\frac{1}{B}\sum_{b=1}^{B}\sum_{c=1}^{C}\mathbb{I}_{[c=a]}\Theta\left({\mathbf{f}_{(b,c)}^{k_{1}},\mathbf{f}_{(b,c)}^{k_{2}},\mathbf{f}_{(b,c)}^{k_{3}},\mathbf{f}_{(b,c)}^{k_{4}}}\right),

(3)

\displaystyle\mathbf{f}^{-}_{j}=\frac{1}{BC}\sum_{b=1}^{B}\sum_{c=1}^{C}\mathbb{I}_{[c\neq a]}\Theta\left({\mathbf{f}_{(b,c)}^{k_{1}},\mathbf{f}_{(b,c)}^{k_{2}},\mathbf{f}_{(b,c)}^{k_{3}},\mathbf{f}_{(b,c)}^{k_{4}}}\right),

(4)

where $C$ is the number of categories and $B$ is the capacity size of each category. The indicator function $\mathbb{I}$ returns 1 if the condition is satisfied or returns 0 if unsatisfied. We use $\Theta$ to represent the mean and concatenation operators.

Then, with the generated positive and negative features, we formulate the following contrastive loss for increasing the intra-category compactness and inter-category separability between the target graph $\mathbf{g}^{t}_{j}$ and the graphs in $\mathcal{G}^{b}$ .

\displaystyle\mathcal{L}_{con}=\frac{1}{N_{c}}\sum_{j=1}^{N_{c}}\left[{\left\|\mathbf{f}^{t}_{j}-\mathbf{f}^{+}_{j}\right\|}_{1}-{\left\|\mathbf{f}^{t}_{j}-\mathbf{f}^{-}_{j}\right\|}_{1}+\alpha\right]_{+},

(5)

where $\alpha$ is the margin of the contrastive loss and $\mathbf{f}^{t}_{j}$ is the mean feature of graph $\mathbf{g}^{t}_{j}$ . Therefore, with the proposed local feature loss and the contrastive loss, we consider the feature alignment from two complementary perspectives, which can significantly reduce the domain discrepancy in feature space.

III-D Domain Adaptation Scheme

For the unsupervised domain adaptation, the core challenge is how to learn the discriminative target features without labels. First of all, for the source domain, we use the standard cross-entropy loss for supervised training:

\displaystyle\mathcal{L}_{seg}=-\frac{1}{N}\sum_{n=1}^{N}\sum_{c=1}^{C}\mathbf{y}^{s}_{(n,c)}\log\hat{\mathbf{y}}^{s}_{(n,c)},

(6)

where $\mathbf{y}^{s}\in\mathbb{R}^{N\times C}$ is the semantic labels for $N$ points with $C$ semantic categories and $\hat{\mathbf{y}}^{s}$ is the outputs from the model.

In addition, in our framework, in order to identify the relationship of local features between the source domain and the target domain, we construct dynamic feature graphs and the generated graph pairs based on the graph matching to find correspondences between the two domains. With our developed local feature loss based on the assignment matrix and the category-guided contrastive loss, we can effectively align the local features between the two domains. The overall loss can be formulated as:

\begin{aligned} \mathcal{L}_{all}=\mathcal{L}_{seg}+\lambda_{1}\mathcal{L}_{loc}+\lambda_{2}\mathcal{L}_{con}\end{aligned},

(7)

where $\lambda_{1}$ and $\lambda_{2}$ are hyperparameters balancing the proposed losses with semantic segmentation loss.

Furthermore, our framework can be extended into a two-stage method in a self-training manner, where we follow Jaritz et al. [20] to use a pseudo-label training strategy. We first use our framework to train a model with the loss in Eq. 7, where the source data $\mathcal{X}_{s}$ , $\mathcal{Y}_{s}$ and the target data $\mathcal{X}_{t}$ are available. Then we fix the parameters of the model and generate pseudo labels $\mathcal{\hat{Y}}_{s}$ for target data. After that, the supervised semantic segmentation loss with pseudo labels is used on the target domain.

IV EXPERIMENT

IV-A Datasets

vKITTI to SemanticPOSS. The synthetic dataset vKITTI [23] contains 6 sequences of outdoor scenes in urban settings, where the point cloud are generated from the synthetic 2D depth images. The SemanticPOSS [24] dataset was obtained in dynamic driving scenarios. It is composed of 6 sequences of scenes with a total of 2988 LiDAR scans. Therefore, there is a large gap in data distribution between the vKITTI and the real-world SemanticPOSS. For the domain adaptation scenario from vKITTI to SemanticPOSS, we select 6 semantic categories for domain adaptation: plants, building, road, traffic sign, pole, and car. The point clouds are sampled into blocks of 15m $\times$ 15m, and each block contains 4096 points.

S3DIS to ScanNet. The S3DIS [25] dataset is an indoor point cloud dataset containing 6 areas with 271 rooms and the ScanNet [26] dataset contains 1513 indoor point cloud scenes annotated. For the domain adaptation scenario from S3DIS to ScanNet, we use 8 semantic categories for domain adaptation: floor, wall, window, door, table, chair, sofa, and bookshelf. Due to the sparsity and scene incompleteness of ScanNet, there is a huge domain gap between the datasets. We divide the point clouds into blocks of size 1.5m $\times$ 1.5m, and each block contains 8192 points.

SemanticKITTI to nuScenes. The SemanticKITTI [27] dataset and the nuScenes [28] dataset are real-world datasets. However, the SemanticKITTI dataset is obtained by the 64-beam LiDAR scanner, while the nuScenes dataset is obtained by the 32-beam LiDAR scanner. Thus, there is a large gap of data sparsity in the SemanticKITTI-to-nuScenes domain adaptation scenario. We focus on the 10 categories for domain adaptation: car, bicycle, motorcycle, truck, other vehicle, pedestrian, drivable, sidewalk, terrain, and vegetation. The point clouds are sampled into blocks of 10m $\times$ 10m, and each block contains 4096 points.

IV-B Implementation Details

We use the official PyTorch implementation for PointWeb as our segmentation backbone. Stochastic Gradient Descent (SGD) optimizer is selected for training with the momentum 0.9 and the weight decay 0.0001, respectively. Also, we apply the weight decay to the learning rate, where the drop factor is 0.1 and the step size is 30. The initial learning rate for indoor and outdoor scenarios are 0.05 and 0.005. The capacity size $B$ of the memory bank is set to 16. The parameters $\lambda_{1}$ and $\lambda_{2}$ are set to 1.0 and 0.1. The margin $\alpha$ of the contrastive loss is set to 0.4. The values in k-NN for different levels are set to 1, 4, 16 and 64. To train and test our model, we use a single TITAN RTX GPU and the batch size is set to 4.

TABLE I: The performance comparison of unsupervised domain adaptation methods. All the results are reported by mIoU.

Model	vKITTI to SemanticPOSS	S3DIS to ScanNet	SemanticKITTI to nuScenes
Supervised	65.8	66.4	46.8
Source Only	44.6	43.2	26.7
MinEnt	45.9	44.3	32.3
MaxSquare	46.3	43.6	32.8
ADDA	49.8	42.5	31.2
PL	51.0	46.4	29.8
3DGCA	47.1	43.1	33.7
SQSGV2	-	-	10.1
C&L	-	-	31.6
Ours	54.9	53.8	37.3

IV-C Performance Comparison

We report the performance of point cloud semantic segmentation with mean Intersection-over-Union (mIoU). Tab.I shows the quantitative results of the comparison between our method with other domain adaptation methods. As shown in the table, our method achieves the highest performance, which shows the effective domain transferability of our method. Specifically, the Supervised means the model of PointWeb is trained on the target domain with semantic labels. The Source Only means the model trained with the source domain and directly tested at the target domain. Due to the domain gap, the performance of Source Only has a significant drop compared to the Supervised, which shows the necessity of unsupervised domain adaptation.

In order to verify the effectiveness of our method, we compare our method with a series of general unsupervised domain adaptation methods: MinEnt [16], MaxSquare [15], and ADDA [17]. For a fair comparison, these methods are reproduced with the same setting in our framework, where the hyperparameters are adjusted to obtain the best performance on all domain adaptation scenarios. The PL is the same Pseudo-Label training strategy in the [20] with our framework, which is a two-stage method with extra training cost. Furthermore, we introduce the geodesic correlation alignment used in [5] into our segmentation framework to construct additional comparison method 3DGCA. It can be observed from the Tab. I that although the above methods can alleviate the domain discrepancy, they are not efficient for point cloud semantic segmentation. Especially with the global-level feature alignment methods, the model produces the confused semantic information in the feature space. In the S3DIS to ScanNet scenario, these methods even produce negative effects on domain adaptation.

Furthermore, we compare our method with the 3D unsupervised domain adaptation methods: SqueezeSegV2 (SQSGV2) [5] and Complete & Label (C&L) [3]. Because these methods use different point cloud semantic segmentation backbones, we directly use the results of SQSGV2 and C&L reported in [3] for comparison. Since SQSGV2 requires spherical projection and C&L requires sequences of point cloud for the completion network, it is limited to reproduce in the vKITTI to SemanticPOSS and S3DIS to ScanNet domain adaptation scenarios. As shown in the Tab. I, compared with the above methods, our method achieves state-of-the-art performance on three domain adaptation scenarios.

IV-D Ablation Studies and Analysis

In order to further verify the effects of each module of our method and the effectiveness of the proposed assignment matrix based local feature loss, we conduct ablation studies on the vKITTI to SemanticPOSS scenario.

As shown in Tab. II, we first report the performance improvement brought by each proposed loss, where the quantitative results of each loss can show its effective domain transferability. Particularly, the integration of the two losses can benefit the overall domain adaptation framework, and further improve the domain adaptation performance. It can be observed that our framework can benefit from a simple pseudo-label training strategy with additional 3.0% improvement, and they play a complementary role in unsupervised domain adaptation.

Secondly, in order to verify that the feature distributions of different categories has been separated, we draw t-SNE [29] visualization with the features from the target domain to show qualitative results. As shown in Fig. 3, our proposed method can effectively enhance the discrimination of features from the target domain.

Thirdly, we conduct the ablation study without using the assignment matrix named $L_{loc}$ w/o $A$ . In this case, we use the mean features mentioned in Sec.III-C to represent the local feature graphs, and directly select the nearest neighbor from the memory bank to find the relationship between the graphs from the source domain and the target domain. Instead of using the assignment matrix for the alignment, we use the mean features to directly align the two graph features. As shown in Tab. II, the proposed assignment matrix based local feature loss can achieve a better performance.

At last, we show the visualization of point cloud semantic segmentation results to qualitatively illustrate the effectiveness of our method. It can be clearly observed in Fig. 2, compared with the Source Only, only a few noise predictions are produced in our method, which shows the proposed framework can effectively alleviate the domain gap problem and significantly improve the segmentation performance.

TABLE II: Ablation study of adapting vKITTI to SemanticPOSS

Model	plants	building	road	trafficsign	pole	car	mIoU
Source Only	57.4	58.2	75.3	16.5	17.7	42.5	44.6
$L_{loc}$ w/o $A$	61.5	71.3	76.9	10.7	18.1	41.5	46.7
$L_{loc}$	62.1	73.6	79.9	9.7	25.1	44.2	49.1
$L_{con}$	60.0	72.0	78.1	15.4	27.1	45.8	49.7
$L_{loc}$ + $L_{con}$	63.2	74.8	81.9	12.6	28.8	50.0	51.9
$L_{loc}$ + $L_{con}$ +PL	63.9	76.9	84.1	16.6	36.4	51.5	54.9

V CONCLUSION

In this paper, we proposed an unsupervised domain adaptive point cloud semantic segmentation framework based on feature graph matching. With the proposed assignment matrix based local feature loss and category-guided contrastive loss, we can align the local-level feature distributions of the source domain and the target domain more accurately in a meticulous way and guide the segmentation model to learn discriminative features on the target domain. Extensive experiments on different synthetic-to-real and real-to-real domain adaptation scenarios have demonstrated the superiority of our method.

References

[1] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on point sets for 3D classification and segmentation,” in CVPR, 2017.
[2] L. Wang, Y. Huang, Y. Hou, S. Zhang, and J. Shan, “Graph attention convolution for point cloud semantic segmentation,” in CVPR, 2019.
[3] L. Yi, B. Gong, and T. Funkhouser, “Complete & Label: A domain adaptation approach to semantic segmentation of LiDAR point clouds,” in CVPR, 2021.
[4] S. Zhao, Y. Wang, B. Li, B. Wu, Y. Gao, P. Xu, T. Darrell, and K. Keutzer, “ePointDA: An end-to-end simulation-to-real domain adaptation framework for LiDAR point cloud segmentation,” in AAAI, 2021.
[5] B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer, “SqueezeSegV2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a LiDAR point cloud,” in ICRA, 2019.
[6] C. Chen, Z. Fu, Z. Chen, S. Jin, Z. Cheng, X. Jin, and X.-S. Hua, “HoMM: Higher-order moment matching for unsupervised domain adaptation,” in AAAI, 2020.
[7] P. Morerio, J. Cavazza, and V. Murino, “Minimal-entropy correlation alignment for unsupervised deep domain adaptation,” in ICLR, 2018.
[8] Z. J. Yew and G. H. Lee, “RPM-Net: robust point matching using learned features,” in CVPR, 2020.
[9] B. Graham, M. Engelcke, and L. Van Der Maaten, “3D semantic segmentation with submanifold sparse convolutional networks,” in CVPR, 2018.
[10] C. Choy, J. Gwak, and S. Savarese, “4D spatio-temporal ConvNets: Minkowski convolutional neural networks,” in CVPR, 2019.
[11] C. Xu, B. Wu, Z. Wang, W. Zhan, P. Vajda, K. Keutzer, and Tomizuka, “SqueezeSegV3: Spatially-adaptive convolution for efficient point-cloud segmentation,” in ECCV, 2020.
[12] H. Zhao, L. Jiang, C.-W. Fu, and J. Jia, “PointWeb: Enhancing local neighborhood features for point cloud processing,” in CVPR, 2019.
[13] K. Saleh, A. Abobakr, M. Attia, J. Iskander, D. Nahavandi, M. Hossny, and S. Nahvandi, “Domain adaptation for vehicle detection from bird’s eye view LiDAR point cloud data,” in ICCV Workshops, 2019.
[14] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in ICCV, 2017.
[15] M. Chen, H. Xue, and D. Cai, “Domain adaptation for semantic segmentation with maximum squares loss,” in ICCV, 2019.
[16] T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez, “ADVENT: Adversarial entropy minimization for domain adaptation in semantic segmentation,” in CVPR, 2019.
[17] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in CVPR, 2017.
[18] I. Achituve, H. Maron, and G. Chechik, “Self-supervised learning for domain adaptation on point clouds,” in WACV, 2021.
[19] J. Yang, S. Shi, Z. Wang, H. Li, and X. Qi, “ST3D: Self-training for unsupervised domain adaptation on 3D object detection,” in CVPR, 2021.
[20] M. Jaritz, T.-H. Vu, R. d. Charette, E. Wirbel, and P. Pérez, “xMUDA: Cross-modal unsupervised domain adaptation for 3D semantic segmentation,” in CVPR, 2020.
[21] D. Peng, Y. Lei, W. Li, P. Zhang, and Y. Guo, “Sparse-to-dense feature matching: Intra and inter domain cross-modal learning in domain adaptation for 3D semantic segmentation,” in ICCV, 2021.
[22] M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” NeurIPS, 2013.
[23] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds as proxy for multi-object tracking analysis,” in CVPR, 2016.
[24] Y. Pan, B. Gao, J. Mei, S. Geng, C. Li, and H. Zhao, “SemanticPOSS: A point cloud dataset with large quantity of dynamic instances,” in IV, 2020.
[25] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese, “3D semantic parsing of large-scale indoor spaces,” in CVPR, 2016.
[26] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “ScanNet: Richly-annotated 3D reconstructions of indoor scenes,” in CVPR, 2017.
[27] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “SemanticKITTI: A dataset for semantic scene understanding of LiDAR sequences,” in ICCV, 2019.
[28] W. K. Fong, R. Mohan, J. V. Hurtado, L. Zhou, H. Caesar, O. Beijbom, and A. Valada, “Panoptic nuScenes: A large-scale benchmark for LiDAR panoptic segmentation and tracking,” arXiv preprint arXiv:2109.03805, 2021.
[29] L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE,” JMLR, 2008.