¹¹institutetext: RCVLab, Dept. of Electrical and Computer Engineering, Ingenuity Labs,
Queen’s University, Kingston, Ontario, Canada
¹¹email: {y.wu, michael.greenspan}@queensu.ca

Pseudo-keypoint RKHS Learning for Self-supervised 6DoF Pose Estimation

Yangzheng Wu 0000-0001-8893-0672 Michael Greenspan 0000-0001-6054-8770

Abstract

We address the simulation-to-real domain gap in six degree-of-freedom pose estimation (6DoF PE), and propose a novel self-supervised keypoint voting-based 6DoF PE framework, effectively narrowing this gap using a learnable kernel in RKHS. We formulate this domain gap as a distance in high-dimensional feature space, distinct from previous iterative matching methods. We propose an adapter network, which is pre-trained on purely synthetic data with synthetic ground truth poses, and which evolves the network parameters from this source synthetic domain to the target real domain. Importantly, the real data training only uses pseudo-poses estimated by pseudo-keypoints, and thereby requires no real groundtruth data annotations. Our proposed method is called RKHSPose, and achieves state-of-the-art performance among self-supervised methods on three commonly used 6DoF PE datasets including LINEMOD ( $+4.2\%$ ), Occlusion LINEMOD ( $+2\%$ ), and YCB-Video ( $+3\%$ ). It also compares favorably to fully supervised methods on all six applicable BOP core datasets, achieving within $-11.3\%$ to $+0.2\%$ of the top fully supervised results.

Keywords:

pose estimation self-supervision domain adaptation keypoint estimation

1 Introduction

RGB-D Six Degree-of-Freedom Pose Estimation (6DoF PE) is a problem being actively explored in Computer Vision (CV). Given an RGB image and its associated depth map, the task is to detect objects and estimate their poses comprising 3DoF rotational angles and 3DoF translational offsets in the camera reference frame. This task enables many applications such as augmented reality [hinterstoisser2012model, posecnn, hodan2018bop, kaskman2019homebreweddb], robotic bin picking [doumanoglou2016recovering, hodan2017tless, kleeberger2019large], autonomous driving [ma2019accurate, xiao2019identity] and image-guided surgeries [gadwe2018real, greene2023dvpose].

As with other machine learning (ML) tasks, fully supervised 6DoF PE requires large annotated datasets. This requirement is particularly challenging for 6DoF PE, as the annotations comprise not only the identity of the objects in the scene, but also their 6DoF pose, which makes the data relatively expensive to annotate compared to related tasks such as classification, detection and segmentation. This is due to the fact that humans are not able to qualitatively or intuitively estimate 6DoF pose, which therefore requires additional instrumentation of the scene at data collection, and/or more sophisticated user annotation tools [hinterstoisser2012model, posecnn]. Consequently, synthetic 6DoF PE datasets [hodan2018bop, kleeberger2019large] have been introduced, either as an additional complement to real datasets, or as standalone purely synthetic datasets. Annotated synthetic datasets are of course trivially inexpensive to collect, simply because precise synthetic pose annotation in a simulated environment is fully automatic. A known challenge in using synthetic data, however, is that there typically exists a domain gap between the real and synthetic data, which makes results less accurate when inferring in real data using models trained on purely synthetic datasets. Expectations of the potential benefit of synthetic datasets has led to the exploration of a rich set of Domain Adaptation (DA) methods, which specifically aim to reduce the domain gap [bozorgtabar2020exprada, zhang2017curriculum, pinheiro2019domain] using inexpensive synthetic data for a wide variety of tasks, recently including 6DoF PE [guo2023knowledge, lee2022uda, li2021sd].

Early methods [wu2022vote, kleeberger2020single] ignored the simulation-to-real (sim2real) domain gap and, nevertheless improved performance by training on both synthetic and real annotated data, effectively augmenting the real images with the synthetic. However, these methods still required real labels to be sufficiently accurate and robust for practical applications, partially due to the domain gap. As shown in Fig. LABEL:fig:teaser, the rendered synthetic objects (left image) have a slightly different appearance than the real objects (right image). The details of the CAD models, both geometry and texture, are not precise, as can be seen for the can object (which lacks a mouth and shadows), the benchvise (which is missing a handle), and the holepuncher (which has coarse geometric resolution). Several recent methods [kleeberger2021investigations, sundermeyer2018implicit, tan2023smoc, chen2023texpose, wang2020self6d] have started to address the sim2real gap for 6DoF PE by first training on labeled synthetic data and then fine-tuning on unlabeled real data. These methods, commonly known as self-supervised methods, reduce the domain gap by adding extra supervision using features extracted from real images without requiring real groundtruth (GT) labels. The majority of these methods are viewpoint/template matching-based, and the self-supervision commonly iteratively matches 2D masks or 3D point clouds [kleeberger2021investigations, sundermeyer2018implicit, wang2020self6d].

While the above-mentioned self-supervison works have shown promise, there exist a wealth of DA techniques that can be brought to bear to improve performance further for this task. One such technique is Reproducing Kernel Hilbert Space (RKHS), which is a kernel method that has been shown to be effective for DA [khosravi2023existence, al2021learning, bietti2019kernel]. RKHS was initially used to create decision boundaries for non-separable data [cortes1995support, pearson1901liii], and has been shown to be effective at reducing the domain gap for various tasks and applications [zhang2018aligning, shan2023unsupervised, chen2020homm]. The reproducing kernel guarantees that the domain gap can be statistically measured, allowing network parameters trained on synthetic data to be effectively adapted to the real data, using specifically tailored metrics.

To address this sim2real domain gap in 6DoF PE, we propose RKHSPose, which is a keypoint-based method [wu2022vote, wu2022keypoint] trained on a mixture of a large collection of labeled synthetic data, and a small handful of unlabeled real data. RKHSPose estimates the intermediate radial voting quantity, which has shown to be effective for estimating keypoints [wu2022vote], by first training a modified FCN-Resnet-18 on purely synthetic data, with automatically labeled synthetic GT poses. The radial quantity is a map of the distance from each image pixel to each keypoint. Next, real images are passed through the synthetically trained network, resulting in a set of pseudo-keypoints. The real images and their corresponding pseudo-keypoints are used to render a set of ‘synthetic-over-real’ (synth/real) images, by first estimating the pseudo-pose from the pseudo-keypoints, and then overlaying the synthetic object, rendered with the pseudo-pose, onto the real image. The network training then continues on the synth/real images, invoking an RKHS network module with a trainable linear product kernel, which minimizes the Maximum Mean Discrepancy (MMD) loss. At the front end, a proposed keypoint radial voting network learns to cast votes to estimate keypoints from the backend-generated radial maps. The final pose is then determined using ePnP [lepetit2009ep] based on the estimated and corresponding object keypoints.

The main contributions of this work are:

•

A novel learnable RKHS-based Adapter backend network architecture to minimize the sim2real domain gap in 6DoF PE;
•

A novel CNN-based frontend network for keypoint radial voting;
•

A self-supervised keypoint-based 6DoF PE method, RKHSPose, which is shown to have state-of-the-art (SOTA) performance, based on our experiments and several ablation studies.

2 Related Work

2.1 6DoF PE

ML-based 6DoF PE methods [posecnn, oberweger2018making, pvnet] all train a network to regress quantities, such as keypoints and camera viewpoints, as have been used in classical algorithms [hinterstoisser2012model]. ML-based methods, which initially became popular for the general object detection task, have started to dominate the 6DoF PE literature due to their accuracy and efficiency.

There are two main categories of ML-based fully supervised methods: feature matching-based [posecnn, oberweger2018making, su2022zebrapose, haugaard2022surfemb, hai2023rigidity], and keypoint-based methods [pvnet, pvn3d, he2021ffb6d, yang2023object]. Feature matching-based methods make use of the structures from general object detection networks most directly. The network encodes and matches features and estimates pose by either regressing elements of the pose (e.g. the transformation matrix [tan2023smoc], rotational angles and translational offsets [posecnn] or 3D vertices [su2022zebrapose]) directly, or by regressing some intermediate feature-matching representations, such as viewpoints or segments.

In contrast, keypoint-based methods encode features to estimate keypoints which are predefined within the reference frame of an object’s CAD model. These methods then use (modified) classical algorithms such as PnP [p3p, lepetit2009ep], Horn’s method [horn1988closed], and ICP [besl1992icp] to estimate the final pose from corresponding image and model keypoints. Unlike feature-matching methods, keypoint-based methods are typically more accurate due to redundancies encountered through voting schemes [pvnet, wu2022vote, pvn3d] and by generating confidence hypotheses of keypoints [he2021ffb6d, yang2023object].

Recently, self-supervised 6DoF PE methods have been explored in order to reduce the reliance on labeled real data, which is expensive to acquire. As summarized in Table LABEL:tab:self_supervised_6DoF_PE, these methods commonly use real images without GT labels. Some methods use pure synthetic data and CAD models only, with the exception of LatentFusion [park2020latentfusion] which trained the model using only synthetic data. The majority of these methods [he2022fs6d, shugurov2022osop, sundermeyer2020multi, su2015render, wang2020self6d, deng2020self, wang2021occlusion, chen2023texpose, tan2023smoc, sock2020introducing, lin2022category] are inspired by fully supervised feature-matching methods, except DPODv2 [shugurov2021dpodv2] and DSC [xiao2019pose], in which keypoint correspondences are rendered and matched.

A few methods [shugurov2022osop, deng2020self] fine-tuned the pose trivially by iteratively matching the template/viewpoint, whereas others [xiao2019pose, kleeberger2021investigations, park2020latentfusion, he2022fs6d] augmented the training data by adding noise [xiao2019pose, kleeberger2021investigations], rendering textures [he2022fs6d, shugurov2021dpodv2] and creating a latent space [park2020latentfusion]. Some methods [sundermeyer2018implicit, sundermeyer2020multi, Manhardt_2019_ICCV] also implemented DA techniques such as encoding a codebook [sundermeyer2020multi], Principle Component Analysis (PCA) [sundermeyer2020multi] and symmetric Bingham distributions [Manhardt_2019_ICCV]. Most methods [shugurov2022osop, su2015render, wang2020self6d, wang2021occlusion, sock2020introducing, xiao2019pose] used rendering techniques to render and match a template. There are a few methods that combined 3D reconstruction techniques, such as Neural radiance fields (Nerf) [mildenhall2021nerf] and Structure from Motion (SfM) [ullman1979interpretation]. TexPose [chen2023texpose] matched CAD models to segments generated by Nerf, and SMOC-Net [tan2023smoc] used SfM to create the 3D segment and matched with the CAD model.

2.2 Kernel Methods and Deep Learning

While Deep Learning (DL) is the most common ML technique within the CV literature, kernel methods [cortes1995support, pearson1901liii, sejdinovic2013equivalence] have also been actively explored. Kernel methods are typically in RKHS space [szafraniec2000reproducing] with reproducing properties that facilitate solving non-linear problems by mapping input data into high dimensional spaces where they can be linearly separated [ghojogh2021reproducing, corcoran2020end]. Well-known early kernel methods that have been applied to CV are Support Vector Machines [cortes1995support] (SVMs) and PCA [pearson1901liii]. A recent method [sejdinovic2013equivalence] linked energy distance and MMD in RKHS, and showed the effectiveness of kernels in statistical hypothesis testing.

More recent studies compare kernel methods with DL networks [khosravi2023existence, al2021learning, bietti2019kernel, bietti2019group]. RKHS is found to perform better on classification tasks than one single block of a CNN comprising convolution, pooling and downsampling (linear) layers [khosravi2023existence]. RKHS can also help with CNN generalization by meta-regularization on image registration problems[al2021learning]. Similarly, norms (magnitude of trainable weights) defined in RKHS help with CNN regularization [bietti2019kernel, bietti2019group]. Further, discriminant information in label distributions in unsupervised DA is addressed and RKHS-based Conditional Kernel Bures metric is proposed [luo2021conditional]. Lastly, the connection between Neural Tangent Kernel and MMD is established, and an efficient MMD for two-sample tests is developed [cheng2021neural].

Inspired by the previous work, RKHSPose applies concepts of kernel learning to keypoint-based 6DoF PE, to provide an effective means to self-supervise a synthetically trained network on unlabeled real data.

3 Method

3.1 Network Overview

As shown in Fig. LABEL:fig:diagram, RKHSPose is made up of two networks, one main network $M_{rv}$ for keypoint regression and classification, and one Adapter network $M_{A}$ . The input of $M_{rv}$ with shape $W\!\times\!H\!\times\!4$ , is the concatenation of an RGB image $I$ and its corresponding depth map $D$ . This input can be synthetically generated with an arbitrary background ( $I_{syn}$ and $D_{syn}$ ), a synthetic mask overlayed on a real background ( $I_{syn/real}$ and $D_{syn/real}$ ), or real ( $I_{real}$ and $D_{real}$ ) images. The outputs are $n$ projected 2D keypoints $K$ , along with corresponding classification labels $C$ and confidence scores $S$ . $K$ are organized into instance sets based on $C$ and geometric constraints.

$M_{rv}$ comprises two sub-networks, regression network $M_{r}$ and voting network $M_{v}$ . Inspired by recent voting techniques [pvnet, pvn3d, wu2022vote, wu2022keypoint, zhou2023deep], $M_{r}$ estimates an intermediate voting quantity, which is a radial distance map $V_{r}$ [wu2022vote], by using a modified Fully Connected ResNet 18 (FCN-ResNet-18). The radial voting map $V_{r}$ , with shape $W\!\times\!H$ , stores the Euclidean distance from each object point to each keypoint in the 3D camera world reference frame. The voting network $M_{v}$ (described in Sec. 3.3) then takes $V_{r}$ as input, accumulates votes, and detects the peak to estimate $K$ , $C$ and $S$ .

The Adapter network $M_{A}$ consists of a series of CNNs which encode pairs of feature maps from $M_{rv}$ , and are trained on both synthetic overlayed and pure real data. $M_{A}$ encodes feature map pairs $(f_{syn/real},f_{real})$ into corresponding high-dimensional feature maps $(h_{syn/real},h_{real})$ . The input data, $(I_{syn/real},I_{real})$ and $(D_{syn/real},D_{real})$ , are also treated as $(f_{syn/real},f_{real})$ during the learning of $M_{A}$ . Each of these networks creates a high-dimensional latent space, essentially the Reproducing Kernel Hilbert Space (RKHS) [aronszajn1950theory] and contributes to the learning of both $M_{rv}$ and $M_{A}$ by calculating the MMD [gretton2012kernel]. While RKHS has been applied effectively to other DA tasks, to our knowledge, this is the first time that it has been applied to 6DoF PE, and the adapter network architecture is novel.

The loss function $\mathcal{L}$ is made up of five elements: Radial regression loss $\mathcal{L}_{r}$ for $V_{r}$ ; Keypoint projection loss $\mathcal{L}_{k}$ for $K$ ; Classification loss $\mathcal{L}_{c}$ for $C$ ; Confidence loss $\mathcal{L}_{s}$ for $S$ , and finally; Adapter loss $\mathcal{L}_{M_{A}}$ for the comparison of intermediate feature maps. The regression losses $\mathcal{L}_{r}$ , $\mathcal{L}_{k}$ , and $\mathcal{L}_{s}$ all use the smooth L1 metric, whereas classification loss $\mathcal{L}_{c}$ uses the cross-entropy metric $H(\cdot)$ . RKHSPose losses can then be denoted as:

$\displaystyle\mathcal{L}_{r}$	$\displaystyle=$	$\displaystyle smooth_{\mathcal{L}_{1}}(V_{r},\widehat{V_{r}})$	(1)
$\displaystyle\mathcal{L}_{k}$	$\displaystyle=$	$\displaystyle smooth_{\mathcal{L}_{1}}(K,\widehat{K})$	(2)
$\displaystyle\mathcal{L}_{c}$	$\displaystyle=$	$\displaystyle H(p(C),\widehat{p(C)})$	(3)
$\displaystyle\mathcal{L}_{s}$	$\displaystyle=$	$\displaystyle smooth_{\mathcal{L}_{1}}(S,\widehat{S})$	(4)
$\displaystyle\mathcal{L}_{M_{A}}$	$\displaystyle=$	$\displaystyle MMD(\widehat{f_{syn/real}},\widehat{f_{real}})$	(5)
$\displaystyle\mathcal{L}=\lambda_{r}\mathcal{L}_{r}$	$\displaystyle+$	$\displaystyle\lambda_{k}\mathcal{L}_{k}+\lambda_{c}\mathcal{L}_{c}+\lambda_{s}\mathcal{L}_{s}+\lambda_{D}\mathcal{L}_{M_{A}}$	(6)

where $\lambda_{r}$ , $\lambda_{k}$ , $\lambda_{c}$ , $\lambda_{s}$ , and $\lambda_{D}$ are weights for adjustment during training, and all non-hatted quantities are GT values. At inference, $M_{rv}$ takes the $I_{real}$ and $D_{real}$ as input, and outputs $K$ , $C$ and $S$ . The keypoints $K$ are ranked and grouped by $S$ and $C$ , and are then forwarded into the ePnP [lepetit2009ep] algorithm which estimates 6DoF pose values.

3.2 Convolutional RKHS Adapter

Reproducing Kernel Hilbert Space $\mathcal{H}$ is a commonly used vector space for Domain Adaptation [pan2010domain, venkateswara2017deep]. Hilbert Space is a complete metric space (in which every Cauchy sequence of points has a limit within the metric) represented by the inner product of vectors. For a non-empty set of data $\mathcal{X}$ , a function $K_{X}\!:\!\mathcal{X}\!\times\!\mathcal{X}\!\rightarrow\!\mathcal{R}$ is a reproducing kernel if:

\begin{cases}&k(\cdot,x)\in\mathcal{H}\;\;\;\forall\;\;\;x\in\mathcal{X}\\ &\left\langle f(\cdot),k(\cdot,x)\right\rangle=f(x)\;\;\;\forall\;\;\;x\in\mathcal{X},f\in\mathcal{H}\end{cases}

(7)

where $\left\langle a,b\right\rangle$ denotes the inner product of two vectors $a$ and $b$ , $k(\cdot,x)=K_{\mathcal{X}}$ for each $x\!\in\!\mathcal{X}$ , and $f$ is a function in $\mathcal{H}$ . The second equation in Eq. 7 is an expression of the reproducing property of $\mathcal{H}$ .

In order to utilize $\mathcal{H}$ for DA, we expand the kernel definition to two sets of data $\mathcal{X}$ and $\mathcal{Y}$ . The reproducing kernel can then be defined as:

K(\mathcal{X},\mathcal{Y})=\left\langle K_{\mathcal{X}},K_{\mathcal{Y}}\right\rangle_{H}

(8)

Here, $K_{\mathcal{X}}$ and $K_{\mathcal{Y}}$ are themselves inner product kernels that map $\mathcal{X}$ and $\mathcal{Y}$ respectively into their own Hilbert spaces, and $K(\mathcal{X},\mathcal{Y})$ is the joint Hilbert space kernel of $\mathcal{X}$ and $\mathcal{Y}$ . Note that $K(\mathcal{X},\mathcal{Y})$ is not the kernel commonly defined for CNNs. Rather, it is the similarity function defined in kernel methods, such as is used by SVM techniques to calculate similarity measurements. Some recent methods [misiakiewicz2022learning, mairal2014convolutional, mairal2016end, chen2020convolutional] named it a Convolutional Kernel Network (CKN) in order to distinguish it from CNNs.

Various kernels of CKN, such as the Gaussian Kernel [mairal2014convolutional], the RBF Kernel [mairal2016end] and the Inner Product Kernel [misiakiewicz2022learning], have been shown to be comparable to shallow CNNs for various tasks, especially for Domain Adaption. The most intuitive inner product kernel is mathematically similar to a fully connected layer, where trainable weights are multiplied by the input feature map. To allow the application of RKHS methods into our CNN based Adapter network $M_{A}$ , trainable weights are added to $K(\mathcal{X},\mathcal{Y})$ [liu2020learning]. The trainable Kernel $K_{w}(\mathcal{X},\mathcal{Y})$ can then be denoted as:

K_{w}(\mathcal{X},\mathcal{Y})=\left\langle\left\langle\mathcal{X},W_{\mathcal{X}}\right\rangle,\left\langle\mathcal{Y},W_{\mathcal{Y}}\right\rangle\right\rangle_{H}

(9)

where $W_{\mathcal{X}}$ and $W_{\mathcal{Y}}$ are trainable weights. By adding $W_{\mathcal{X}}$ and $W_{\mathcal{Y}}$ , $K_{w}(\mathcal{X},\mathcal{Y})$ still satisfies the RKHS constraints.

The sim2real domain gap of feature maps $f$ being trained in $M_{rv}$ (with few real images and no real GT labels) are hard to measure using trivial distance metrics. In contrast, RKHS can be a more accurate and robust space for comparison, since it is known to be capable of handling high-dimensional data with a low number of samples [ghorbani2020neural, misiakiewicz2022learning]. To compare $(f_{syn/real},f_{real})$ in RKHS, a series of CNN layers encodes $(f_{syn/real},f_{real})$ into higher-dimensional features $(h_{syn/real},h_{real})$ , followed by the trainable $K_{w}(\mathcal{X},\mathcal{Y})$ . Once mapped into RKHS, $h_{syn/real}\!=\!\{sr_{i}\}_{i=1}^{m}$ and $h_{real}\!=\!\{r_{i}\}_{i=1}^{m}$ can then be measured by Maximum Mean Discrepancy (MMD), a common DA measurement [gretton2012kernel, liu2020learning], which is the square distance between the kernel embedding [gretton2006kernel]:

$\displaystyle MMD(h_{syn/real},h_{real})=\left[\frac{1}{m^{2}}\right.$	$\displaystyle\left(\left(\sum_{i=1}^{m}\sum_{j=1}^{m}{k_{w}(sr_{i},sr_{j})}\right.\right.-\left.\sum_{i=1}^{m}{k_{w}(sr_{i},sr_{i})}\right)$
	$\displaystyle-\left(\sum_{i=1}^{m}\sum_{j=1}^{m}{k_{w}(sr_{i},r_{j})}\right.-\left.\sum_{i=1}^{m}{k_{w}(sr_{i},r_{i})}\right)$	(10)
	$\displaystyle+\left(\sum_{i=1}^{m}\sum_{j=1}^{m}{k_{w}(r_{i},r_{j})}\right.-\left.\left.\left.\sum_{i=1}^{m}{k_{w}(r_{i},r_{i})}\right)\right)\right]^{0.5}$

where $k_{w}()$ is the feature element of $K_{w}()$ .

In summary, the Adapter $M_{A}$ , shown in Fig. LABEL:fig:diagram, measures MMD for each $(h_{syn/real},h_{real})$ in RKHS by increasing the $f_{syn/real}$ and $f_{real}$ dimension using a CNN and thereby constructing a learnable kernel $K_{w}$ . The outputs are the feature maps $h_{syn/real}$ and $h_{real}$ which are supervised by loss $\mathcal{L}_{M_{A}}$ , during training of the real data epochs. Based on the experiments in Sec. 5.2, our trainable inner product kernel is shown to be more accurate for our task than other known kernels [mairal2014convolutional, mairal2016end, misiakiewicz2022learning] that we tested, that are used in such kernel methods.

3.3 Keypoint Radial Voting Network

The network $M_{v}$ votes for keypoints using a CNN architecture, taking the radial voting quantity $V_{r}$ resulting from $M_{r}$ as input. VoteNet [qi2019deep] previously used a CNN approach to vote for object centers, whereas other keypoint-based techniques have implemented GPU-based parallel RANSAC [pvn3d, pvnet] methods for offset and vector quantities. The radial quantity is known to be more accurate but less efficient than the vector or offset quantities, and has been previously implemented with a CPU-based parallel accumulator space method [wu2022vote, wu2022keypoint].

Given its superior accuracy, $M_{v}$ implements radial voting using a CNN to improve efficiency. Given a 2D radial map $\widehat{V_{r}}$ estimated by $M_{r}$ and supervised by GT radial maps $V_{r}$ , the task is to accumulate votes, find the peak, and estimate the keypoint location. The $V_{r}$ foreground pixels (which lie on the target object) store the Euclidean distance from these pixels to each of the keypoints, with background (non-object) pixels set to value $-1$ .

The estimated radial map $\widehat{V_{r}}$ is indeed an inverse heat map of the candidate keypoints’ locations , distributed in a radial pattern centered at the keypoints. To forward $\widehat{V_{r}}$ into a CNN voting module, it is inversely normalized so that it becomes a heat map. Suppose that $v^{max}_{r}$ and $v^{min}_{r}$ are the maximum and minimum global radial distances for all objects in a dataset, which can be calculated by iterating through all GT radial maps, or alternately generated from the object CAD models. An inverse radial map $\widehat{V^{-1}_{r}}$ can then be denoted as:

\widehat{V^{-1}_{r}}=(v^{max}_{r}\!-\!\widehat{V_{r}})/(v^{max}_{r}\!-\!v^{min}_{r})

(11)

Thus, $M_{v}$ takes $\widehat{V^{-1}_{r}}$ as the input and generates the accumulated vote map by a series of convolution, ReLu, and batch normalization layers. The complete network architecture is provided in the Supplementary material. The background pixels are filtered out by a ReLu layer, and only foreground pixels contribute to voting. The accumulated vote map is then max-pooled (peak extraction) and reshaped using a fully connected layer into a $n\!\times\!4$ output. The output represents $n$ keypoints and comprises $n\!\times\!2$ projected 2D keypoints $K$ , $n$ classification labels $C$ , and $n$ confidence scores $S$ . The labels $C$ indicate which object the corresponding keypoint belongs to, and $S$ ranks the confidence level of keypoints before being forwarded into ePnP [lepetit2009ep] for pose estimation. $M_{v}$ is supervised by $\mathcal{L}_{k}$ , $\mathcal{L}_{c}$ , and $\mathcal{L}_{s}$ , and is trained end-to-end with $M_{r}$ .

4 Experiments

4.1 Datasets and Evaluation Metrics

RKHSPose uses BOP Procedural Blender [Denninger2023] (PBR) synthetic images [hodan2018bop, hodavn2020bop, sundermeyer2023bop, denninger2020blenderproc] for the synthetic training phase. The images are generated by dropping synthetic objects (CAD models) onto a plane in a simulated environment using PyBullet [coumans2021], and then rendering them with synthetic textures. All objects in the synthetic images are thus automatically labeled with precise GT poses.

We evaluated RKHSPose for the six BOP [hodavn2020bop, sundermeyer2023bop] core datasets (LM (LMO) [hinterstoisser2012model], YCB-V (YCB) [posecnn], TLESS [hodan2017tless], TUDL [hodan2018bop], ITODO [drost2017introducing], and HB [kaskman2019homebreweddb]), all except IC-BIN [doumanoglou2016recovering], which does not include any real training and validation images and is therefore not applicable. ITODO and HB have no real images in the training set, and so for training we instead used the real images in their validation sets, which were disjoint from their test sets.

Our main results are evaluated with the ADD(S) [hinterstoisser2012model] metric for the LM and LMO dataset, and the ADD(S) AUC [posecnn] metric for the YCB dataset. These are the standard metrics commonly used to compare self-supervised 6DoF PE methods. ADD(S) is based on the mean distance (minimum distance for symmetry) between the object surfaces for GT and estimated poses, whereas ADD(S) AUC plots a curve formed by ADD(S) for various object diameter thresholds. We use the BOP average recall ( $AR$ ) metrics for our ablation studies. The $AR$ metric, based on the original ADD(S) [hinterstoisser2012model], evaluates three aspects, including Visible Surface Discrepancy ( $AR_{VSD}$ ), Maximum Symmetry-Aware Surface Distance ( $AR_{MSSD}$ ), and Maximum Symmetry-Aware Projection Distance ( $AR_{MSPD}$ ).

4.2 Implementation Details

RKHSPose is trained on a server with an Intel Xeon 5218 CPU and 2 RTX6000 GPUs with a batch size of 32. The Adam optimizer is used for the training of $M_{rv}$ , on both synthetic and real data, and $M_{A}$ is optimized by SGD. Both of the optimizers have an initial learning rate of $lr\!=\!1e-3$ and weight decay $1e$ - $4$ for $80$ and $20$ epochs respectively.

The input of the network is normalized before training, as follows. The images $I$ are normalized and standardized using ImageNet [Imagenet] specifications. The depth maps $D$ are each individually normalized by their local minima and maxima, to lie within a range of 0 to 1. The radial distances in radial map $V_{r}$ and 2D projected keypoints are both normalized by the width and height of $I$ .

A single set of four keypoints is chosen by KeyGNet [wu2023learning] for the set of all objects in each dataset. One extra “background” class is added to $C$ in order to filter out the redundant background points in $K$ . $M_{rv}$ is first trained for $120$ epochs on synthetic data, during which $M_{A}$ remains frozen. Following this, training proceeds for an additional $80$ epochs which alternate between real and synthetic data. When training on real data, both $M_{A}$ and $M_{rv}$ weights are learned, whereas $M_{A}$ is frozen for the alternating synthetic data training.

During real data epochs, $M_{rv}$ initially estimates pseudo-keypoints for each real image. These pseudo-keypoints are then forwarded into ePnP for pseudo-pose estimation. Each estimated pseudo-pose is then augmented into a set of poses $P_{aug}$ , by applying arbitrary rotational and translational perturbations. Rotational perturbation $\Delta_{R}$ and translational perturbation $\Delta_{T}$ have respective ranges of $[-\frac{\pi}{18},\frac{\pi}{18}]$ radians and $[-0.1,0.1]$ along three axes within the normalized model frame, which is defined using the largest object in the dataset. The set of hybrid synthetic/real images are rendered by overlaying onto the real image the CAD model of each object using each augmented pose value in $P_{aug}$ . The cardinality of $P_{aug}$ is set to be one less than the batch size, and the adapter $M_{A}$ is trained on a mini-batch of the hybrid images resulting from $P_{aug}$ , plus the image resulting from the original estimated pseudo-pose.

Initially $M_{A}$ is frozen, and for the first $80$ epochs, the loss is set to emphasize the classification and the visibility score regression, i.e. $\lambda_{c}\!=\!\lambda_{s}\!=\!0.6$ and $\lambda_{r}\!=\!\lambda_{k}\!=\!0.4$ . Following this, up to epoch $200$ , the scales of losses are then exchanged to fine-tune the localization of the keypoints, i.e. $\lambda_{c}\!=\!\lambda_{s}\!=\!0.4$ and $\lambda_{r}\!=\!\lambda_{k}\!=\!0.6$ . After epoch $120$ , $M_{A}$ is unfrozen each alternating epoch, and $\lambda_{D}$ is set to 1 during the remaining $M_{A}$ training epochs. This training strategy, shown in Fig. LABEL:fig:diagram, minimizes the synthetic-to-real gap without any real image GT labels, and using very few (320) real images.

4.3 Results

The results are summarized in Tables LABEL:tab:mainresults and LABEL:tab:bop_main. To the best of our knowledge, RKHSPose outperforms all existing self-supverised 6DoF PE methods. In Table LABEL:tab:mainresults, on LM and LMO, our ADD(S) is $+4.2\%$ and $+2\%$ better than the second best method TexPose [chen2023texpose]. We compared our performance to Self6D++ [wang2021occlusion], which is the only other method that evaluated using YCB, and made $3\%$ improvements after ICP on ADD(S) AUC. Last but not least, our performance evaluated on the other four BOP core datasets is comparable to several SOTA methods from the BOP leaderboard, which are fully supervised with real labels (Table LABEL:tab:bop_main). Some test scenes with RKHSPose results are shown in Fig. LABEL:fig:results.

RKHSPose runs at 34 fps on an Intel i7 2.5GHz CPU and an RTX 3090 GPU with 24G VRAM. It takes on average $8.7$ ms for loading data, $4.5$ ms for forward inference through $M_{rv}$ ( $\times 10$ faster compared to the analytical radial voting in RCVPose), and $16.2$ ms for ePnP.

5 Ablation Studies

5.1 Dense Vs. Sparse Adapter

The Adapter $M_{A}$ densely matches the intermediate feature maps, whereas the majority of other methods [tan2023smoc, deng2020self, wang2020self6d] only compare the final output. In order to show the benefits of dense comparison, we conduct an experiment with different variations of $M_{A}$ . A sparse matching $M^{s}_{A}$ network is trained on synthetic and real data comparing only a single feature map, which is the intermediate radial map. $M^{s}_{A}$ has the exact same overall learning capacity (number of parameters) as the dense matching $M_{A}$ described in Sec. 3.3. The results in Table LABEL:tab:discriminator_desntiy show that $M_{A}$ surpassed $M^{s}_{A}$ on all six datasets tested. Specifically, on ITODD, $M_{A}$ is $12.1\%$ more accurate than $M^{s}_{A}$ . This experiment shows the effectiveness of our densely matched $M_{A}$ .

5.2 Adapter Kernels and Metrics

We use linear (dot product) kernel and MMD in RKHS for domain gap measurements. There are various other kernels and similarity measurements that can be implemented in RKHS as described in Sec. 3.2. First, we add trainable weights to the radial basis function (RBF) kernel in a similar manner as $K_{w}$ defined in Eq. 9. The trainable RBF kernel on two sets of data $X$ and $Y$ is denoted as:

K_{rbf}(X,Y)=exp(-w\left\|X-Y\right\|^{2})

(12)

where $w$ are the trainable weights, which replaces the original adjustable parameter in the classical RBF kernel. We also experiment with the classical RKHS kernel functions without trainable weights, including the inner product kernel and the original RBF kernel [vert2004primer], for comparison. Further, we experiment on other commonly used distance measures, including Kullback-Leibler Divergence (KL Div, i.e. relative-entropy) and Wasserstein (Wass) Distance.

In Table LABEL:tab:kernels, the RBF kernel performs similar to the linear product kernel with a slight performance dip. In Table LABEL:tab:metrics, MMD minimizes the domain gap better than the Wass Distance, followed by the KL Div metric, leading to a better overall performance on $AR$ . Based on these results, we use the linear product kernel with trainable weights, and chose MMD as the main loss metric in our final structure.

5.3 Syn/Real Synchronized Training

When training RKHSPose, real epochs are alternated with synthetic epochs. In contrast, some other methods [chen2023texpose, wang2020self6d, wang2021occlusion, tan2023smoc] separate the synthetic/real training. We conducted an experiment to compare these two different training strategies, the results of which are shown in Table LABEL:tab:train_strategy. The alternating training performs slightly better ( $+2.5\%$ on average) than the sequential training, possibly due to the early access to real scenes thereby avoiding local minima.

5.4 Number of Real Images and Real Labels

The objective of RKHSPose is to reduce real data usage and train without any real GT labels. To show the effectiveness of the approach, we conducted an experiment by training on different numbers of real images, the results of which are shown in Fig. LABEL:fig:noofreal. We used up to 640 real images in all cases, except for that of ITODD which contains only 357 real images. The $AR$ of all datasets saturates at 160 images except YCB. The further improvement of YCB beyond 160 images is also only $+0.1\%$ and saturates after 320 real images. We nevertheless conservatively use $320$ real unlabeled images for our main results. This experiment also showed that adding an equal number of real labeled images beyond 320, did little to improve performance.

6 Conclusion

To sum up, we propose a novel self-supervised keypoint radial voting-based 6DoF PE method using RGB-D data called RKHSPose. RKHSPose fine-tunes poses pre-trained on synthetic data by densely matching features with a learnable kernel in RKHS, using real data albeit without any real GT poses. By applying this DA technique in feature space, RKHSPose achieved SOTA performance on six BOP core datasets, surpassing the performance of all other self-supervised methods. Notably, the RKHSPose performance closely approaches that of several fully-supervised methods, which indicates the strength of the approach at reducing the sim2real domain gap for this problem.

Acknowledgements:

Thanks to Bluewrist Inc. and NSERC for their support of this work.