Object-Pose Estimation With Neural Population Codes

Heiko Hoffmann¹ and Richard Hoffmann² *No funding was received for this work.¹Magimine, LLC, Simi Valley, CA 93065, USA. heiko@magimine.com²Magimine, LLC. rhoffman@caltech.edu

Abstract

Robotic assembly tasks require object-pose estimation, particularly for tasks that avoid costly mechanical constraints. Object symmetry complicates the direct mapping of sensory input to object rotation, as the rotation becomes ambiguous and lacks a unique training target. Some proposed solutions involve evaluating multiple pose hypotheses against the input or predicting a probability distribution, but these approaches suffer from significant computational overhead. Here, we show that representing object rotation with a neural population code overcomes these limitations, enabling a direct mapping to rotation and end-to-end learning. As a result, population codes facilitate fast and accurate pose estimation. On the T-LESS dataset, we achieve inference in 3.2 milliseconds on an Apple M1 CPU and a Maximum Symmetry-Aware Surface Distance accuracy of 84.7% using only gray-scale image input, compared to 69.7% accuracy when directly mapping to pose.

I INTRODUCTION

Robotic manipulation requires object pose estimation; particularly, for assembly, we need accurate object orientation for part alignment. Traditionally, part alignment has been solved using mechanical constraints, where the orientation of a part is physically restricted at the pick-up location or where the gripper is engineered in a way to force a certain part orientation. These mechanical constraints, however, increase setup time and costs. Moreover, they cannot prevent failure, e.g., due to part jamming.

Therefore, optical sensing of object pose is preferred. Pose consists of both object orientation and location. Here, orientation presents the primary challenge, as in many manufacturing settings, the mathematical representation of object rotation is not unique. For symmetric parts, multiple rotation matrices or quaternions can represent the same pose. This ambiguity creates a problem for learning algorithms, such as those mapping images to poses, since a neural feedforward network requires a unique target for each input and fails when encountering multiple possible outputs [1].

As a solution, some works proposed to learn pose candidates, which are then down-selected on the sensory data [2, 3]. Alternatively, data can be mapped onto a feature space, e.g., with an autoencoder [4], which is then compared against the feature-space representation of a dense set of possible poses. But these indirect mappings have also disadvantages: they are slower, the algorithms are more complex, and they are less flexible in terms of mapping auxiliary information onto pose.

Here, we propose a representation for object orientation that allows a direct mapping onto pose and end-to-end learning. We take inspiration from information-encoding in the mammalian cortex. There many variables, like object orientation [5], have been found to be encoded by a population of neurons, where each neuron maximally responds to a preferred value, and the activation of the group of neurons resembles a probability distribution of the encoded variable [6]. The population code can represent object symmetry and the corresponding pose ambiguity as multiple activation peaks in the code.

For each input image, we predict the entire population code of the object rotation. We tested our population code on the T-LESS dataset and compared against alternative pose representations using the same network architecture and training data. As a result, we observe significantly improved rotation accuracies with fast inference (3.2 msec on an Apple M1 CPU).

II RELATED WORK

Related to our work are methods that predict a probability distribution of pose or predict multiple possible pose outcomes.

II-A Probability Over Rotations

Pose ambiguity can be dealt with by estimating a probability distribution over the group of rotations in 3D, ${\bf R}\in{\bf SO(3)}$ [7, 8]. The distribution may be represented parametrically, e.g., with a Gaussian mixture model [7] or non-parametrically [8]. The former has the disadvantage of being hard to train [7], e.g., due to local minima [1].

In the non-parametric method by Murphy et al., the distribution is predicted implicitly, i.e., a neural network computes the probability $p({\bf R},I)$ for any input matrix ${\bf R}$ and input image $I$ [8]. The disadvantage of this strategy is that $p$ has to be normalized across a large sample of rotation matrices that all have to be fed through the network, resulting in a computationally expensive and slow training.

Given a normalized probability and a ground-truth rotation matrix ${\bf R}_{o}$ , the loss function is defined as follows [8]

\mathcal{L}=-\log(p({\bf R}_{o}|I))\,.

(1)

For inference, the rotation with maximum probability is chosen from a set of centers of an equivolumetric partitioning of SO(3) [8],

{\bf\hat{R}}=\underset{{\bf R}\in{\bf SO(3)}}{\arg\max}\,p({\bf R}|I)\,.

(2)

II-B Multiple Pose Hypotheses

Manhardt et al. train a network to predict multiple pose hypotheses to learn alternative (ambiguous) poses [9]. The network output is a fixed number $M$ of possible rotations. For training, the loss function combines a minimum loss $\mathcal{L}_{\mbox{\small min}}$ , the minimum mean squared error (MSE) between the ground truth and the $M$ outcomes, with an average loss $\mathcal{L}_{\mbox{\small avg}}$ , which is the average MSE across the $M$ outcomes,

\mathcal{L}=(1-\epsilon\frac{M}{M-1})\mathcal{L}_{\mbox{\small min}}+\epsilon\frac{M}{M-1}\mathcal{L}_{\mbox{\small avg}}\,,

(3)

where the parameter $\epsilon$ is gradually reduced from 0.05 to 0.01.

Inference is more complicated since we do not know a probability for each of the $M$ outcomes. Instead, the outcomes are clustered, and the median for each cluster is computed. The results can be compared against the input image to find the best fit [9]. Though this strategy involves again an extra selection step as mentioned above.

III METHODOLOGY

The key element of our new method is to represent object orientation with a neural population code. This code can be the last layer in any neural network architecture. In the following, we describe the population code and the specific architecture that we used in our real-world experiments.

III-A Orientation Code

To encode the orientation with a neural population code, we first transform the rotation matrix into a rotation axis and a rotation angle. For the axis, the preferred values of each neuron are arranged on a sphere. To obtain a near uniform distribution of axes, we compute a Fibonacci lattice on a sphere [10]. For the rotation angle, the preferred values are arranged in a circle. To combine axis with angle, our population code has for each point on the Fibonacci sphere a circle of neurons for the rotation angle. So, the space to represent the orientation is a direct product of a sphere and a circle.

Given a rotation axis and angle, we activate all neurons $\{i\}$ using Gaussian tuning curves,

a_{i}=\exp\left(-\frac{d\theta_{i}^{2}+d\phi_{i}^{2}}{2\sigma^{2}}\right)\,,

(4)

where $d\theta_{i}$ is the angle between the encoded axis and the preferred axis on the sphere for neuron $i$ , $d\phi_{i}$ the angle between the encoded angle and the preferred angle of neuron $i$ , and $\sigma$ is the tuning width, here $20^{\circ}$ .

To compute the target population code from a ground truth transformation matrix, ${\bf R}_{o}$ , we compute activations for all symmetry transformations of ${\bf R}_{o}$ ,

{\bf R^{\prime}}_{k}={\bf R}_{o}{\bf R}_{\mbox{\small sym}}^{k}\,,

(5)

where $\{{\bf R}_{\mbox{\small sym}}^{k}\}$ is the set of symmetry transformations for a given object (in a manufacturing setting, the symmetry of an object is known a priori). The target population code is the sum of all activations $a_{i}$ over all $k$ symmetry transformations (Fig. 1).

The transformation from rotation matrix to axis/angle, ${\bf r}$ / $\phi$ , is not unique since $-{\bf r}$ and $2\pi-\phi$ is an equivalent axis/angle combination. Therefore, for each symmetry transformation, we compute the tuning-curve activations also for $-{\bf r}$ and $2\pi-\phi$ and add those to the target activations (resulting in four peaks in Fig. 1).

For objects with a symmetry axis, which would result in infinitely many equivalent poses, we compute a variant of our population code: since the population code for the rotation angle would show uniform activations, we simplify the code and omit the rotation angle neurons and just use the neurons encoding the rotation axis on the Fibonacci sphere. Here, the code is computed simply as

a_{i}=\exp\left(-\frac{d\theta_{i}^{2}}{2\sigma^{2}}\right)\,,

(6)

without having to superimpose activations for symmetries. So, our population code can leverage the rotation symmetry.

Refer to caption — Figure 1: The population-code for symmetric objects has multiple peaks. The neural activation is color coded (yellow: high; blue: low). Here, showing only the Fibonacci sphere for the rotation axis.

For inference, we predict the entire population code and choose the neuron with the highest activation,

j=\underset{i}{\arg\max}\,a_{i}\,.

(7)

Neuron $j$ corresponds to a rotation axis and angle, which we convert into a rotation matrix. For objects with symmetry axis, we pick a random angle. This inference is similar to (2) from the probabilistic method [8] with the crucial difference that we only need to evaluate our network once, resulting in a faster pose estimate.

III-B Network Architecture

Our neural network maps gray-scale images of size 128x128 pixels onto a population code, which is a vector of size $nm$ , where $n$ is the number of preferred axes and $m$ the number of preferred angles. Here, we used $n=2,562$ and $m=36$ , matching the number of rotation candidates used in [4].

The images are fed through four convolutional layers (see Fig. 2 regarding the kernel size and number of features). The first three convolutional layers are followed by batch normalization and ReLU activation functions. This architectural choice was inspired by the Augmented Autoencoder [4]. The fourth convolutional layer with kernel size 1x1 is followed by a GeLU non-linear function. This architectural element was inspired by [11], which demonstrated its benefit.

After the convolutional layers, the features are mapped through three hidden linear layers onto the population code, which itself acts like a linear layer. Each hidden layer is followed by a Leaky ReLU activation function.

IV EVALUATION

We evaluated our method on synthetic and real-world data.

IV-A Synthetic-Data Experiment

The synthetic-data experiment demonstrates the benefit of the population code. Here, we predict the one-dimensional orientation of shapes inside 64x64 pixels images (Fig. 3). As training data, we used either a set of randomly generated bar or arrow images (800/200 train/test split).

For the synthetic data, we used a simplified neural network comprising three convolutional layers (feature sizes: 32, 64, and 128) and 4 linear layers. All layers except for the output layer were followed by a ReLU activation function, and for the convolutional layers, we added max pooling (2x2).

For the output, we used either a population code vector of 36 neurons or a scalar orientation angle. The population code had preferred angles uniformly distributed in a circle in increments of $10^{\circ}$ and Gaussian tuning curves with width $20^{\circ}$ . For the one-hot vector, we use the same output layer as for the population code.

We trained the network with MSE loss for single variable and population code. For the one-hot vector, we tested both MSE and Cross Entropy loss. In addition, tor training the population code, we compared two strategies: 1) having a single activation peak at the ground truth angle and 2) having two activation peaks $180^{\circ}$ apart for shapes with $180^{\circ}$ symmetry.

We repeated train and test runs 10 times, including new train/test splits for each run. As a result, learning the population code with symmetry (two peaks) had the lowest error (Tab. I). Omitting symmetry during training resulted in a slightly larger error, but was not catastrophic. In contrast, learning failed when mapping directly onto the orientation angle for the symmetric bar.

TABLE I: Squared error of orientation angle for the arrows and bars datasets. For the Sym variant, the method was trained on the bars dataset with two possible angles as a target.

Method	Arrows	Bars
Single Variable	$237\pm 69$	$3597\pm 423$
One Hot (MSE)	$112\pm 229$	$15.0\pm 1.7$
One Hot (Cross Entropy)	$13.8\pm 1.2$	$12.6\pm 1.2$
Population Code	$\mathbf{8.6\pm 0.5}$	$17.8\pm 2.3$
Population Code (Sym)	N/A	$\mathbf{8.7\pm 0.5}$

IV-B Evaluation Metrics

For the real-world data, we computed six different metrics to evaluate the accuracy of estimating pose; specifically, we evaluated the rotation error, assuming a perfect translation estimate. We chose metrics that are suitable for symmetric objects. The first three metrics are used by the BOP: Benchmark for 6D Object Pose Estimation [12]. The remaining three have been also used in the literature.

Maximum Symmetry-Aware Surface Distance (MSSD)

Given a set of model vertices, $V_{M}$ , the MSSD error is computed as

e_{\mbox{\small MSSD}}=\underset{{\bf R}_{\mbox{\tiny sym}}\in S_{M}}{\min}\underset{{\bf x}\in V_{M}}{\max}||{\bf\hat{R}}{\bf x}-{\bf R^{\prime}}_{o}{\bf x}||_{2}\,,

(8)

where $S_{M}$ is a set of symmetry transforms, s. th., ${\bf R^{\prime}}_{o}={\bf R}_{o}{\bf R}_{\mbox{\small sym}}$ . The MSSD accuracy is the mean recall rate of $e_{\mbox{\small MSSD}}<\theta$ with $\theta$ ranging from 5% to 50% of the object diameter in steps of 5% [12].

Maximum Symmetry-Aware Projection Distance (MSPD)

The MSPD error is similar to the MSSD error, except that it takes into account only the projection onto the image plane,

e_{\mbox{\small MSPD}}=\underset{{\bf R}_{\mbox{\tiny sym}}\in S_{M}}{\min}\underset{{\bf x}\in V_{M}}{\max}||{\bf P}({\bf\hat{R}}{\bf x}-{\bf R^{\prime}}_{o}{\bf x})||_{2}\,,

(9)

where ${\bf P}$ is the projection onto the image plane. The MSPD accuracy is the average recall rate of $e_{\mbox{\small MSPD}}<\theta$ with $\theta$ ranging from $5w/640$ to $50w/640$ in steps of $5w/640$ , where $w$ is the image width in pixels [12].

Visible Surface Discrepancy (VSD)

The VSD error measures the mismatch in appearance between the model in the ground-truth pose and the same model in the estimated pose,

e_{\mbox{\small VSD}}=\underset{x\in\hat{V}\cup V_{o}}{\mbox{avg}}\left\{\begin{array}[]{ll}0&\text{if }x\in\hat{V}\cap V_{o}\land|\hat{d}(x)-d_{o}(x)|<\tau\\ 1&\text{otherwise}\,,\end{array}\right.

(10)

where $\hat{V}$ and $V_{o}$ are the projected shapes of the model onto the image plane for estimate and ground truth, and $\hat{d}(x)$ and $d_{o}(x)$ are the estimated and ground-truth depths of the model at pixel $x$ . The VSD accuracy is the average recall rate of $e_{\mbox{\small VSD}}<\theta$ with parameter $\tau$ ranging from 5% to 50% of the object diameter in steps of 5% and $\theta$ ranging from 0.05 to 0.5 in steps of 0.05 [12].

${\bf e}_{\mbox{\small\bf VSD}}{\bf<0.3}$

In this variant, the accuracy is the ratio of errors below threshold 0.3 with $\tau=20$ . We adopt this metric since it was used by Sundermeyer et al for the augmented autoencoder [4].

Visual Surface Similarty (VSS)

VSS is a variant of VSD where $\tau=\infty$ , i.e., the error in depth is ignored. This metric has been used for the multiple pose hypotheses method [9].

Average Distance of Model Points - Indistinguishable Views (ADI)

ADI is a variant of the Average Distance of Model Points metric, adapted for symmetric objects [13]. The minimization takes into account ambiguous poses.

e_{\mbox{\small ADI}}=\underset{{\bf x}\in V_{M}}{\mbox{avg}}\underset{{\bf y}\in V_{M}}{\min}||{\bf\hat{R}}{\bf y}-{\bf R}_{o}{\bf x}||_{2}\,,

(11)

To compute the ADI accuracy, we computed the ratio of $e_{\mbox{\small ADI}}<\theta$ with $\theta$ equal to 5% of the object diameter. Here, we used a different threshold than in [13], who used 15%, because with $\theta=15$ % of diameter, all our experiments would show a 100% accuracy.

Sensitivity to rotation error

We found large differences in sensitivity to rotation error between metrics (Fig. 4). Particularly, for roughly spherical objects most metrics failed to reflect large errors. Moreover, only MSSD and ADI are invariant to the viewpoint since the other metrics compute a projection onto the image plane. To conclude, we found MSSD to be the most suitable metric to estimate rotation accuracy.

IV-C Comparison Methods

On the real-world data, we compared our method with three other methods. All of them used the same network architecture, as described in Section III, and the same data augmentation. This ensures a direct comparison of the impact of the representations. In contrast, results across publications are difficult to compare because architecture and data augmentation significantly affect the outcomes but are often hard to replicate due to a lack of detailed information. This section describes the alternative methods.

Multiple Pose Hypotheses

We adapted the multiple pose hypotheses method [9] to our network, predicting $M$ quaternions in the last layer (size: $4M$ ). For training, we used the loss in (3), and for inference, we computed mean-shift clustering on the $M$ predicted quaternions [9] and picked the mean of the largest cluster.

Single Variable

For objects with a symmetry axis, we can directly predict the three coordinates of the axis since there is no ambiguity. Here, the last layer of our network had only three elements, and for training, we used the MSE to the ground truth axis.

For objects with discrete symmetries, we computed the minimum loss to the closest target (one for each equivalent pose). Here, the last layer of our network was 6 dimensional for storing two columns of the rotation matrix R. Learning two columns has been shown to be advantageous over other representations like quaternions due to the continuity of the R6 space [14]. This representation was also used by one of the best performing methods for asymmetric objects [15], and, as we will see, for asymmetric objects, mapping onto R6 does indeed work well (Tab. II). The entire rotation matrix can be reconstructed from the two columns using Gram-Schmidt orthonormalization. For training, we used the L1 loss.

One-Hot Vector

Instead of treating the output layer as a population code, we used it as a one-hot vector, where each element/index represents a rotation. As common for one-hot vectors, we used the Cross-Entropy loss for training. For inference, we picked the index with the largest logit/probability.

IV-D Real-World-Data Experiment

We carried out our real-world experiments with the T-LESS dataset [16], which contains feature-less and mostly symmetric objects. Table II shows all 30 objects.

TABLE II: Population-code, one-hot vector, and single variable results (MSSD accuracy) for all 30 objects in the T-LESS dataset

Object
Symmetries	Continuous	Continuous	Continuous	Continuous	2	2	2	2	2	2
Trained on Objects	1, 2	1, 2	3, 4	3, 4	5, 6	5, 6	7, 8	7, 8	9	10
Test Instances	818	465	397	619	194	100	250	150	249	145
Single Variable	80.4%	80.9%	90.3%	84.1%	66.0%	64.9%	21.9%	34.9%	58.7%	64.0%
One Hot Vector	71.0%	74.5%	82.1%	78.5%	69.0%	50.7%	41.5%	30.9%	41.6%	51.5%
Population Code	81.8%	84.1%	91.6%	86.8%	89.6%	91.5%	83.2%	89.1%	92.2%	92.5%
Object
Symmetries	2	2	Continuous	Continuous	Continuous	Continuous	Continuous	None	2	2
Trained on Objects	11, 12	11, 12	13, 14	13, 14	15, 16	15, 16	17	18	19, 20	19, 20
Test instances	185	148	148	150	148	198	150	149	198	245
Single Variable	74.5%	73.5%	86.6%	84.1%	93.5%	89.0%	94.4%	85.2%	47.7%	34.0%
One Hot Vector	60.9%	62.2%	84.0%	84.9%	86.4%	87.4%	88.0%	44.3%	35.0%	19.1%
Population Code	86.8%	84.9%	90.9%	86.1%	95.8%	93.0%	96.1%	75.9%	82.6%	74.0%
Object
Symmetries	None	None	2	Continuous	2	2	4	2	2	Continuous
Trained on Objects	21, 22	21, 22	23	24	25, 26	25, 26	27	28	29	30
Test instances	184	195	250	195	98	100	96	195	99	148
Single Variable	62.7%	64.0%	66.4%	92.4%	75.9%	79.8%	7.8%	15.8%	18.6%	89.7%
One Hot Vector	34.5%	27.8%	34.6%	86.2%	41.3%	34.9%	17.8%	20.7%	31.2%	87.4%
Population Code	60.1%	59.7%	89.0%	93.9%	83.5%	85.3%	58.9%	79.2%	88.3%	94.0%

For training, we used the images from the BlenderProc4BOP dataset [17], extracting all instances of an object with at least 60% visibility. We augmented these data with the T-LESS Primesense camera images [16]. For all images, we used the provided bounding boxes and made square cutouts by using the maximum of width and height plus a 10% padding. The resulting cutouts were scaled to 128x128 pixels. We supplemented these images with 8,000 generated images per object using pyrender and the 3D model files from the T-LESS dataset. This combination resulted in about 20,000 to 22,000 images per object.

We combined some pairs of visually similar objects into one dataset and trained a single model for both objects (Tab. II). This strategy resulted a small improvement (Tab. VI) and also points to the ability of our architecture and population-code representation to generalize across objects.

We trained all methods for 80 epochs, using the Adam optimizer with a learning rate of 0.0002 and a batch size of 64. As the loss function, we used MSE, except as specified differently above (Section IV.C). During training, we augmented the data for the pyrender-generated images by 1) adding random brightness, a uniformly-distributed value between -0.2 and 0.2 to all pixels with a 50% probability and 2) adding a background image with a 70% probability. Moreover, for all training images, we added Gaussian noise (std: 0.02) with a 50% probability, did contrast normalization, and shifted the image in the image plane ( $\pm 5$ pixels in x and y) with 90% probability.

For testing, we used the official test set from the BOP Pose Estimation Challenge [17]. Since our focus was on rotation estimation rather than object detection, we used the provided ground truth bounding boxes. We compared Single Variable, One-Hot Vector, and Population Code across all 30 objects (Tab. II), whereas Multiple Pose Hypotheses was evaluated only on objects 4 and 5, as its performance was no better than random weights (Tab. IV and V). This suggests that the approach cannot learn from scratch and requires a pre-trained network backbone, as used by [9].

For our method, the average inference time was 3.2 msec on a Macbook Air M1 2020, which makes our method suitable for edge devices and real-time applications. In comparison, Multiple Pose Hypotheses required 115 msec for inference due to the costly clustering.

For the population code, the average MSSD accuracy across all objects was 84.7% (Tab. III). This value compares favorably against the best result on the BOP leaderboard for RGB-only input in the category Localization of Seen Objects: 81.4% (https://bop.felk.cvut.cz/leaderboards/pose-estimation-bop19/t-less/ accessed on 9/14/2024). The limitation is that we did not estimate the translation but instead used the ground truth translation.

Our average $e_{\mbox{\tiny VSD}}<0.3$ was 87.3% across all objects (Tab. III). In comparison, using the same ground-truth bounding boxes, Sundermeyer et al reported a 38.34% accuracy from RGB input and 84.04% with ICP refinement on the depth data [4]. However, the same caveat as above holds regarding estimating the translation.

Predicting a single variable performed well when there was no ambiguity in the target (Tab. III) but struggled with ambiguous poses (Tab. III). In contrast, our population code method performed equally well under ambiguity (Tab. III). Moreover, population code targets consistently outperformed one-hot vectors by large margins (Tab. II).

We also observed that our network faithfully learned the entire distribution of the population code (Fig. 5) and not just the peaks. This behavior is expected since we compute the MSE across all neurons of a population code. Interestingly, even without providing symmetry information during training, the performance drop was relatively small (Tab. VI); i.e., the network could still learn to represent the ambiguity caused by symmetry.

TABLE III: Accuracy depending on pose ambiguity caused by symmetry

Without Ambiguity
Method	VSD	MSSD	MSPD	$e_{\mbox{\tiny VSD}}<0.3$
Single Variable	73.41%	83.35%	85.74%	84.74%
One Hot Vector	65.26%	73.27%	75.89%	78.81%
Population Code	75.33%	84.68%	86.07%	88.04%
With Ambiguity
Method	VSD	MSSD	MSPD	$e_{\mbox{\tiny VSD}}<0.3$
Single Variable	44.70%	49.60%	43.51%	43.01%
One Hot Vector	43.17%	40.12%	35.31%	43.04%
Population Code	72.59%	84.74%	79.61%	86.23%
All Objects
Method	VSD	MSSD	MSPD	$e_{\mbox{\tiny VSD}}<0.3$
Single Variable	61.77%	69.67%	68.62%	67.82%
One Hot Vector	56.31%	59.83%	59.44%	64.31%
Population Code	74.22%	84.70%	83.46%	87.31%

TABLE IV: Comparison of methods for object 4 in the T-LESS dataset

Approach	VSD	MSSD	MSPD	$e_{\mbox{\tiny VSD}}<0.3$	VSS	ADI
Random Weights	8.2%	9.0%	14.4%	6.6%	55.5%	7.1%
Multiple Poses	8.4%	6.3%	9.6%	2.6%	57.5%	4.9%
Single Variable	74.4%	84.1%	86.2%	85.0%	89.1%	87.1%
One-Hot Vector	68.5%	78.5%	80.6%	81.6%	86.7%	82.6%
Population Code	80.0%	86.8%	88.5%	89.5%	90.3%	90.5%

TABLE V: Comparison of methods for object 5 in the T-LESS dataset

Approach	VSD	MSSD	MSPD	$e_{\mbox{\tiny VSD}}<0.3$	VSS	ADI
Random Weights	17.4%	1.7%	0.9%	4.6%	68.0%	11.3%
Multiple Poses	20.7%	3.3%	2.3%.	9.8%	69.0%	30.4%
Single Variable	61.5%	66.0%	61.4%	73.2%	87.2%	82.5%
One-Hot Vector	69.3%	69.0%	66.6%	82.5%	89.9%	87.1%
Population Code	83.4%	89.6%	87.8%	94.9%	94.1%	96.9%

TABLE VI: Ablation results for our method using object 5 from T-LESS

Ablation	VSD	MSSD	MSPD
Original Trained on objects 5 & 6	83.4%	89.6%	87.8%
Trained only on object 5	82.8%	88.2%	86.6%
Trained only on object 5, removed 1x1 conv layer	81.4%	87.1%	84.7%
Trained only on object 5, trained without symmetry	77.7%	84.9%	82.3%

V CONCLUSIONS

Representing object orientation with a population code allows end-to-end training of a pose-estimation network from scratch and solves the ambiguity problem caused by object symmetry. This strategy results in fast and accurate inference. The speed allows real-time operation on edge devices. We found that the population-code results in higher accuracies compared to other methods using the same training data and network. For robotic control, the population code could be also directly mapped onto a grasp posture (omitting rotation matrices), which we speculate is the operation used also by our brains.

References

[1] H. Hoffmann, Unsupervised Learning of Visuomotor Associations, ser. MPI Series in Biological Cybernetics. Logos Verlag Berlin, 2005, vol. 11.
[2] B. Wen, W. Yang, J. Kautz, and S. Birchfield, “Foundationpose: Unified 6d pose estimation and tracking of novel objects,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 17 868–17 879.
[3] J. Liu, W. Sun, H. Yang, Z. Zeng, C. Liu, J. Zheng, X. Liu, H. Rahmani, N. Sebe, and A. Mian, “Deep learning-based object pose estimation: A comprehensive survey,” arXiv preprint arXiv:2405.07801, 2024.
[4] M. Sundermeyer, Z.-C. Marton, M. Durner, and R. Triebel, “Augmented autoencoders: Implicit 3d orientation learning for 6d object detection,” International Journal of Computer Vision, vol. 128, no. 3, pp. 714–729, 2020.
[5] D. H. Hubel and T. N. Wiesel, “Receptive fields of single neurones in the cat’s striate cortex.” J Physiol, vol. 148, no. 3, pp. 574–591, Oct 1959.
[6] A. Pouget, P. Dayan, and R. Zemel, “Information processing with population codes,” Nature Reviews Neuroscience, vol. 1, no. 2, pp. 125–132, 2000.
[7] I. Gilitschenski, R. Sahoo, W. Schwarting, A. Amini, S. Karaman, and D. Rus, “Deep orientation uncertainty learning based on a bingham loss,” in International conference on learning representations, 2019.
[8] K. Murphy, C. Esteves, V. Jampani, S. Ramalingam, and A. Makadia, “Implicit-pdf: Non-parametric representation of probability distributions on the rotation manifold,” arXiv preprint arXiv:2106.05965, 2021.
[9] F. Manhardt, D. M. Arroyo, C. Rupprecht, B. Busam, T. Birdal, N. Navab, and F. Tombari, “Explaining the ambiguity of object detection and 6d pose from visual data,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6841–6850.
[10] R. A. Dixon, Mathographics. Dover Publications, 1991.
[11] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 976–11 986.
[12] T. Hodaň, M. Sundermeyer, B. Drost, Y. Labbé, E. Brachmann, F. Michel, C. Rother, and J. Matas, “BOP challenge 2020 on 6D object localization,” in Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer, 2020, pp. 577–594.
[13] T. Hodaň, J. Matas, and Š. Obdržálek, “On evaluation of 6d object pose estimation,” in Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14. Springer, 2016, pp. 606–619.
[14] Y. Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity of rotation representations in neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5745–5753.
[15] G. Wang, F. Manhardt, F. Tombari, and X. Ji, “GDR-Net: Geometry-guided direct regression network for monocular 6D object pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 611–16 621.
[16] T. Hodan, P. Haluza, Š. Obdržálek, J. Matas, M. Lourakis, and X. Zabulis, “T-LESS: An RGB-D dataset for 6d pose estimation of texture-less objects,” in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2017, pp. 880–888.
[17] T. Hodaň, M. Sundermeyer, B. Drost, Y. Labbé, E. Brachmann, F. Michel, C. Rother, and J. Matas, “Bop challenge 2020 on 6d object localization,” in Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer, 2020, pp. 577–594.