Generalizing Single-View 3D Shape Retrieval to Occlusions and Unseen Objects

Qirui Wu¹ Daniel Ritchie² Manolis Savva¹ Angel X. Chang^1,3
¹Simon Fraser University ²Brown University ³Alberta Machine Intelligence Institute (Amii)
https://3dlg-hcvc.github.io/generalizing_shape_retrieval/

Abstract

Single-view 3D shape retrieval is a challenging task that is increasingly important with the growth of available 3D data. Prior work that has studied this task has not focused on evaluating how realistic occlusions impact performance, and how shape retrieval methods generalize to scenarios where either the target 3D shape database contains unseen shapes, or the input image contains unseen objects. In this paper, we systematically evaluate single-view 3D shape retrieval along three different axes: the presence of object occlusions and truncations, generalization to unseen 3D shape data, and generalization to unseen objects in the input images. We standardize two existing datasets of real images and propose a dataset generation pipeline to produce a synthetic dataset of scenes with multiple objects exhibiting realistic occlusions. Our experiments show that training on occlusion-free data as was commonly done in prior work leads to significant performance degradation for inputs with occlusion. We find that that by first pretraining on our synthetic dataset with occlusions and then finetuning on real data, we can significantly outperform models from prior work and demonstrate robustness to both unseen 3D shapes and unseen objects.

1 Introduction

Refer to caption — Figure 1: We focus on the single-view 3D shape retrieval task in complex but realistic scenarios where both the object and 3D shape candidates may be unseen during training, and the object may be observed in cluttered scenes under significant occlusions.

3D shape retrieval given a single-view image is increasingly useful for 3D content creation, robotic perception and other applications. The increasing volume of 3D data for domains such as e-commerce, interior design, and augmented reality makes single-view shape retrieval an important task. However, this is a challenging task. Paired real image–3D shape data require significant annotation so there are no large-scale, diverse datasets. Synthetic data can be helpful in bridging this gap but prior work has deployed it in simplified, unrealistic settings where a single synthetic 3D object is overlaid on a blank or random image background (see Fig. 1). Some work leverages existing, relatively small image-shape datasets such as Pix3D [36] to demonstrate retrieval performance. However, results are not comparable due to the lack of a standardized, shared evaluation protocol, leaving work to create its own training and evaluation splits, and pick its own metrics for evaluation. Lastly, occlusion or truncation of objects in the image are prevalent scenarios that impact task difficulty in real data but are usually ignored [30].

In this paper, we systematically study the generalization of single-view 3D shape retrieval methods along three axes, making associated contributions in each axis.

Generalization to unseen objects. Prior work on shape retrieval has not been evaluated in settings involving unseen objects (i.e. no overlap in 3D shapes between train and val/test splits). Thus, it is unclear how existing retrieval methods generalize to images containing unseen objects. We present experiments that evaluate on this setting, and demonstrate that this form of generalization to unseen objects remains challenging (see Table 1 for a summary of performance degradation in the unseen object setting).

Generalization to occluded objects. Occlusion and truncation occur in many real images but prior work has not studied their impact during training and in evaluation. We construct a dataset with varying occlusion rates and systematically analyze the impact of occlusions on retrieval performance.

Generalization to similar shapes. Most work on image-to-shape retrieval has been on evaluated on small shape databases. There has been no systematic study of performance of models when tested on large databases where there may be many shapes that are similar to the one annotated GT shape. In such cases, it is possible that the common accuracy metric fails to capture whether more similar shapes are ranked higher than dissimilar shapes. While reconstruction metrics such as Chamfer distance and volumetric intersection-over-union (IoU) have been used for evaluation, these metrics are limited in evaluating fine-grained distinctions between object instances in a robust way. We propose a suite of view-dependent and view-independent metrics that are more appropriate for this practical setting.

In summary, we make the following contributions: 1) we propose a standardized evaluation protocol that better characterizes 3D shape retrieval performance and we recommend a suite of view-dependent and view-independent metrics; 2) we develop a multi-object occlusion dataset to measure the impact of occlusions on retrieval; 3) we experimentally analyze the generalization of 3D shape retrieval methods on unseen objects and novel 3D shape databases.

Dataset	objects	$Acc_{1}\uparrow$	$Acc_{5}\uparrow$	$\text{CD}\downarrow$	$\text{CD}_{5}\downarrow$
Easy set	seen	74.9	89.2	0.35	1.66
Easy set	unseen	0.3	7.6	1.52	1.77
Hard set	seen	47.2	73.5	0.99	2.04
Hard set	unseen	0.0	17.4	2.40	2.62

Table 1: Performance of our CMIC [30] implementation trained on the Pix3D “Easy set” as reported by Lin et al. [30], and tested on val sets of Pix3D that contain unobserved objects (“unseen”) or contain occlusions (“hard set”). Note the significant performance degradation across metrics.

2 Related Work

3D shape retrieval is well studied [37, 27]. Task variants involve querying of shapes from single-view images [29, 30], sketches [18], text, as well as partial scans [19, 33] and other 3D objects [2, 18]. Here, we describe relevant work focusing on single, photo-realistic image to 3D shape retrieval.

Image to shape retrieval. 3D shapes are commonly represented as multiview image renderings [29, 31, 14, 23, 30, 10] to shrink the gap between 2D and 3D. To align the different modalities of the same physical entity, real RGB images and shape multiview images are encoded into a joint-embedding space using triplet or contrastive losses. There is also recent interest in studying generalization to novel objects and occlusions. Nguyen et al. [32] use templates for object pose estimation, and study the robustness of 3D shape retrieval to occlusions in the object pose estimation task.

Early work by Li et al. [29] used a joint embedding space of 3D shapes and 2D images to retrieve shapes from images. Lee et al. [26] proposed to learn a joint embedding space for natural images and 3D shapes in an end-to-end manner. To handle different texture and lighting conditions in images, Fu et al. [10] proposed to synthesize textures on 3D shapes to generate hard negatives for additional training. Uy et al. [39] used a deformation-aware embedding space so that retrieved models better match the target after an appropriate deformation. A recent model, CMIC [30] achieved state-of-the-art performance on retrieval accuracy by using instance-level and category-level contrastive losses. We build on CMIC for our analysis of image to 3D shape retrieval generalization.

Joint retrieval and alignment. Another line of work uses image-to-shape retrieval as a component in an image-to-scene pipeline [16, 20, 23, 15], where objects are detected, segmented, aligned, and CAD models are retrieved and posed to form a 3D scene. In these works, the layout estimation and CAD model alignment tasks are often solved jointly. IM2CAD [20] proposed a pipeline that produces a full indoor scene (room and furniture) from a single image by leveraging 2D object recognition methods to detect multiple object candidates. Mask2CAD [23] proposed a more lightweight approach that jointly retrieves and aligns 3D CAD models to detected objects in an image. Patch2CAD [24] used a patch-based approach to improve shape retrieval. ROCA [15] and its follow-up [25] relied on dense 2D-3D correspondences to retrieve geometrically similar CAD models but mainly focused on differentiable pose estimation. Point2Objects [9] addressed the same task as Mask2CAD but directly regresses 9DoF alignments and treats object retrieval as a classification problem, and mainly showed results on synthetic renderings.

Evaluating image to shape retrieval. To study pose estimation, researchers have created datasets with 3D shapes aligned to real-world images [44, 45]. However, these datasets are not suitable for shape retrieval as the shape may not be an accurate match. Li et al. [28] provided a benchmark on monocular image to shape retrieval with 21 categories. Their images mostly offered unoccluded views of the object and the shape match was inexact (category match, shape geometry not guaranteed to match). There are few datasets with accurate image-shape match, leading to relatively small-scale and ad-hoc evaluation schemes in prior work. Li et al. [29] evaluated on a small dataset of 105 shapes paired with 315 google search images (each shape had 3 images), with the 105 shapes excluded from training. The authors noted that this benchmark was expensive to create. Sun et al. [36] contributed Pix3D, a dataset of images aligned to posed 3D shapes and demonstrated its usefulness for benchmarking methods for single-view 3D reconstruction, image-to-shape retrieval, and pose estimation. The Pix3D dataset provided annotations of whether the shape in the image had truncation or occlusion. However, experiments in this work and followups [36, 10, 14, 30] are limited to untruncated and unoccluded objects, and few work study generalization to unseen shapes [14, 30] in a rigorous manner. MeshRCNN [13] proposed standard splits for Pix3D, including a more challenging data split S2 for testing on images of unobserved objects, but this split was rarely used in followup work. Mask2CAD [23] was one of the few works that used the S2 data split. We standardize two real image datasets, and develop a new synthetic dataset and associated metrics to evaluate 3D shape retrieval in a more comprehensive manner than prior work.

3 Analyzing image-to-shape retrieval

To study the performance of image-to-shape retrieval, we argue that it is important to: 1) evaluate on unseen shapes by establishing a train/val/test split that ensures separation of val/test shapes not seen during training; 2) evaluate on occluded and truncated objects; and 3) evaluate the geometric similarity of the ranked results to the ground truth shape. In addition, we believe it is important to evaluate on real-world images while still leveraging synthetically constructed scenes and rendered images for controlled experiments and pre-training (see Fig. 2). To this end, we propose an evaluation protocol with standardized train/val/test splits for three datasets, two of which have 3D shapes aligned to real-world images, and a third that is synthetically constructed. For each dataset, we ensure that the val/test splits contain shapes not seen during training. As part of our evaluation, we also analyze existing 3D similarity evaluation metrics and propose a set of recommended metrics for image-to-shape retrieval.

We select Pix3D [36] as our first evaluation dataset as it provides accurate 3D shape-to-image matches and annotations indicating occlusion and truncation. Our second dataset is ScanIm2CAD: a new dataset using images from ScanNet [7] with projected masks and ShapeNet [3] models from Scan2CAD [1] annotations. We create a standardized image-to-shape retrieval benchmark on this dataset as it is a popular choice for recent work on joint shape-retrieval and alignment [23, 24, 15, 25]. The 3D shape to image match is not as high as Pix3D but this dataset offers real-world images of objects in context (in real-world scenes), with realistic occlusions and truncations. Lastly, we create MOOS, a dataset of synthetic scenes where we programmatically control the selection of objects and layouts. MOOS allows us to easily render images with ground-truth masks that are accurately aligned with ground-truth CAD models, and control the amount of occlusion and truncation.

Using these datasets, we investigate how well a state-of-the-art image-to-shape retrieval model, CMIC [30], can generalize to unseen shapes, and handle images with occluded and truncated objects. We find that performance drops drastically under these challenging settings (see Table 1). We use our synthetically generated MOOS dataset to systematically investigate the impact of occlusion and unseen shapes (see Section 7) and show that pretraining with MOOS improves performance on real-world images (see Table 6).

4 Model

We use CMIC [30] as the basis of our benchmarking. CMIC learns a joint embedding space between images and 3D shapes using contrastive losses at the instance and category levels. Given an image containing a target object, the joint embedding space is used to retrieve a ranked list of 3D shapes based on similarity to the visual features of the object in the image. Since the input images may contain multiple objects we use an object mask to indicate the object of interest. The image and mask are cropped from the input image using the 2D bounding box computed from the mask. As input to our models, we use the resized cropped image and mask and assume that the crop mainly contain only the object of interest, potentially with occlusions.

Encoders. CMIC consists of two separate encoders to obtain representations $f^{I}$ and $f^{S}$ for query image $I$ and 3D shape $S$ . The object segmentation mask $M$ is fed to the query encoder to mask out background information. Each 3D shape $S$ is represented as a set of multiview images $\{S_{k}\}_{k=1}^{m}$ rendered from predefined camera viewpoints. This approach aims to bridge the gap between the 2D and 3D modalities by converting the image-shape retrieval problem to image-image retrieval. By passing $m$ images of one shape into the shape encoder, we obtain a set of image features $\{f^{S_{k}}\}_{k=1}^{m}$ . For each shape in a batch, we compute multiple query-conditioned features $\{f^{S_{k}}_{i}\}_{i=1}^{B}$ for all queries $\{I_{i}\}_{i=1}^{B}$ using dot product attention.

Image-Shape Joint Embedding. Given query features $\{f^{i}\}_{i=1}^{B}$ and query-attended shape features $\{f^{S_{k}}_{i}\}_{i=1}^{B}$ , we learn an image-shape joint embedding using contrastive losses. We denote $D(f^{i},f^{S_{k}}_{j})$ as the similarity function between the embeddings of the $i$ -th image and the $k$ -th query-specific shape rendering of the $j$ -th image. The contrastive losses are designed at two levels: instance and category. The instance-level contrastive loss treats matching image-shape pairs as positive examples, and all other cases as negative. Thus, each image $I_{i}$ is paired with only one positive shape $S_{i}$ and $B-1$ negative shapes $S_{j}$ where $j\in B\setminus\{B_{i}\}$ :

L_{\text{inst}}=-\sum_{i\in B}\log\frac{D\left(f^{i},f_{i}^{S_{i}}\right)}{\sum_{j\in B}D\left(f^{i},f_{j}^{S_{k}}\right)}

The category-level contrastive loss is used to cluster and disperse image-shape pairs belonging to the same and different categories. This loss leverages shape category labels $y_{i}$ to maximize category agreement in a batch:

L_{\text{cat}}=-\sum_{i\in B}\frac{1}{|C(i)|}\sum_{c\in C(i)}\log\frac{D\left(f^{i},f_{i}^{S_{c}}\right)}{\sum_{j\in B}D\left(f^{i},f_{j}^{S_{k}}\right)}

where $C(i)$ refers to all instance with the same category label (e.g. $\{j|j\in B\setminus\{B_{i}\}\text{ and }y_{j}=y_{i}\}$ ). There could be multiple positive and negative samples for an image in the same mini-batch. The total loss is a weighted sum of the instance and category contrastive losses: $L=L_{\text{inst}}+\beta_{1}\cdot L_{\text{cat}}$ where $\beta_{1}$ is the weight on the category loss.

3D Shape Retrieval. The multiview image features of all 3D shape candidates are computed offline. At inference time, an RGB query is embedded using the query encoder. A list of 3D shapes is retrieved by ranking based on the cosine similarity between the query-conditioned shape feature and the query feature in descending order. We assume the category label is known to retrieve appropriate 3D shapes.

Implementation. We use a pretrained ResNet50 [17] to encode RGB images with object masks as an extra channel and a pretrained ResNet18 [17] to encode grayscale multiview renderings of 3D shapes. The extracted image/rendering embeddings are projected to features of dimension 128 with additional MLPs. We render 12 images for each 3D shape using predefined camera poses placed on the same horizontal plane and 30 degrees apart from each other. We apply several data augmentations to image queries including affine transformation, crop, flip and color jitter. We use temperature $\tau=0.1$ and $\beta_{1}=0.2$ . We train the model on a single Nvidia A40 GPU with a batch size of 64 using an Adam [21] optimizer with initial learning rate $5e^{-5}$ and $\text{betas}=(0.5,0.999)$ .

5 Datasets

Most work on image-to-shape retrieval rely on datasets of image-shape pairs [44, 45, 42, 36, 1] are constructed from realistic photos and synthetic 3D shapes. Due to the difficulty of manually aligning 3D shape to 2D objects, existing datasets are limited in data volume, both in terms of 2D images and 3D shapes for various categories. Unlike Pix3D whose accurate image-shape matches are obtained by using the IKEA object name or 3D scans of objects by the authors, other datasets [1] provide 3D shapes that do not fully match the geometry of the observed objects. Moreover, the annotated 3D object poses are subject to errors, which makes robust evaluation of single-view 3D shape retrieval methods harder. We propose a scalable and simple synthetic data generation pipeline for constructing 3D scenes from sampled 3D shapes that mimic realistic object occlusions.

Dataset	#images	#shape	#cat	alignment	scalable	occ info
Pix3D [36]	10K	395	9	accurate	no	partial
Scan2CAD [1]	25K	3,049	35	inaccurate	no	no
MOOS	120K	6,209	4	accurate	yes	complete

Table 2: Comparison of datasets with image and 3D shape pairs. Note that Pix3D only categorizes objects into three groups based on the degree of object occlusions.

5.1 Real Image Datasets

We evaluate on two datasets with 3D shapes aligned to real images, Pix3D [36] and Scan2CAD [1]. Following prior work [30, 10], we conduct experiments on 4 categories, chair, bed, sofa and table. There are $8,650$ images and $324$ unique 3D shapes in Pix3D over the $4$ categories. We create two sets using Pix3D annotations: Easy and Hard, where the latter contains occluded or truncated objects. We intentionally exclude 1,233 images of 181 objects from the training set to test generalization to unseen objects. For the remaining 7,417 images of 143 seen objects, we split images for each object into train and val sets in a 75:25 ratio. The Scan2CAD dataset collects scan-to-shape annotations by aligning 3D shapes from ShapeNet to 3D scans from ScanNet [7]. We use the ScanNet25K frames [7] as images and project 3D shapes onto each image using the camera poses. We use the ScanNetv2 splits resulting in 40K/11K image queries in train/val with 1,779 unique shapes over the four categories.

5.2 MOOS: Multi-Object Occlusion Scenes

To facilitate studying generalization in single-view shape retrieval, we propose a scalable synthetic dataset generation pipeline that we call Multi-Object Occlusion Scenes (MOOS). Our generation pipeline allows for control over the key variables we study: 1) amount of occlusion and truncation; and 2) novel shapes that are set aside for evaluation and not seen in training. Compared to prior datasets, MOOS provides a scalable dataset with accurate image-to-shape alignment and occlusion statistics (see Table 2).

Scenes in MOOS are comprised of randomly selected 3D shapes from 3D-FUTURE [12] that are randomly arranged to form a scene. Compared to ShapeNet, furniture objects in 3D-FUTURE have more consistent geometric quality, higher-resolution textures, and known physical dimensions. Each generated scene consists of 4 objects from 4 categories (chair, bed, table and sofa), with the same shape potentially occuring in multiple scenes. To generate a scene layout with natural occlusions in real environment, we use a heuristic algorithm that iteratively places newly sampled 3D shapes into the existing layout, ensuring the 2D bounding box from a top-down view does not intersect with previously inserted 3D shapes. See Fig. 3 for an overview. In this way, we compose a scene with closely placed objects exhibiting natural occlusion patterns, but no inter-penetrations. See supplement for more details on the generation procedure.

Using our pipeline we generate 10,000 unique scenes to construct the MOOS dataset. With our generated scenes, we construct a dataset of rendered images with occluded shapes for retrieval (see Table 2 and supplement for detailed statistics). We render 12 viewpoints per scene by evenly dividing along the azimuth every 30 degrees and sampling elevation uniformly between 5 and 25 degrees. We render an RGB image, instance segmentation, depth map, normal image and object-level RGB and mask images. Thus, each scene in MOOS has 156 renderings, all at 1K² resolution. The entire dataset consists of 1,560,000 images for the 10K scenes and is generated using PyTorch3D [34] in 35 hours. As truncation is a special case of occlusion, we compute the occlusion rate for each object instance by comparing its intact object mask and instance mask. Based on occlusion rate, we separate all object instances in MOOS into two subsets, the Occ set and the NoOcc set with 351K and 118K queries respectively, depending on whether occlusion exists. Both subsets are split into train, val and test splits in an 8:1:1 ratio. To investigate how shape retrieval methods generalize to unseen objects, we set aside $10\%$ of the objects (randomly selected) and their corresponding images from the train split.

6 Metrics

Typical shape retrieval metrics require both the ground-truth (GT) shape and suggested 3D shapes to compute either accuracy or point cloud based reconstruction scores [30, 23], such as Chamfer Distance (CD), Normal Consistency (NC) and $\mathrm{F}1^{t}$ . [23, 24] also adopt $\text{AP}^{\text{mesh}}$ from Gkioxari et al. [13] to compute average precision weighted by recall at different IoU thresholds based on $\mathrm{F}1^{0.3}$ scores. These metrics are view-independent in a way that ignores potential occlusions in real images. Moreover, reconstruction metrics are sensitive to the point sampling method and the number of sampled points. It is also unclear whether they robustly reflect how close retrieved shapes are to the ground truth. We perform quantitative and qualitative analysis (see supplement for details) to select appropriate metrics for shape retrieval. We summarize the details of the metrics we chose here.

6.1 View-independent Metrics

We use Accuracy (Acc) and Category Accuracy (CatAcc) to measure the average number of retrieved objects with correct shapes and categories, respectively. $Acc_{k}$ indicates whether the top-K retrieved objects contain the GT 3D shape. However, accuracy metrics do not measure whether retrieved shapes are structurally similar to the GT, but not exact matches. Thus we use Chamfer Distance (CD) and LFD L1 [4] to measure shape similarity given the GT and retrieved shapes. We use $\text{CD}_{k}$ and $\text{LFD}_{k}$ to denote the average score over the top-K retrievals. For CD, we sample 4K points using farthest point sampling for each shape. For LFD L1, we represent each 3D shape as its LFD features computed from a set of pre-rendered binary masks. Assuming all 3D shapes are normalized and centered in the same canonical orientation, we render 200 views by placing cameras on the vertices of 10 randomly rotated dodecahedrons. We use average L1 distance over all views to measure shape similarity.

6.2 View-dependent Metrics

In image queries, the object to be retrieved might be occluded by other objects or self-occluded, in which case matching the visible parts to the ground truth from a similar viewpoint is desirable. Hence, we propose a set of view-dependent metrics between objects in images and 3D shapes. These metrics can quantify retrieval performance for cases without corresponding 3D shape annotations. To compute the view-dependent metrics, we render each retrieved 3D shape under the same viewpoint and pose as the object in the image.

Mask IoU. The mask IoU is computed between the unoccluded (complete) rendered binary masks for the GT and retrieved 3D shape. vLFD L1. Normally LFD is computed from a set of multiview binary masks to measure region and contour similarity. Here, we only consider a single-view LFD by concatenating region-based and contour-based descriptors for one mask. The mask-to-mask LFD distance is defined as the $L1$ distance between LFDs of silhouettes of the ground truth and retrieved 3D shapes. LPIPS [46]. The mask IoU and vLFD metrics only account for similarity of the object silhouette, ignoring content in the interior of the object. To capture the perceptual similarity between the object as observed in the image and the retrieved shape, we feed resized patches of the query image and shape renders with occluded area masked out and cropped from the bounding box of the object mask into a pretrained VGG [35] model to compute LPIPS scores. Lower scores indicate better matches.

Models	NoOcc val set							Occ val set
Models	$Acc_{1}$ $\uparrow$	$CatAcc$ $\uparrow$	CD $\downarrow$	LFD $\downarrow$	MIoU $\uparrow$	vLFD $\downarrow$	LPIPS $\downarrow$	$Acc_{1}$ $\uparrow$	$CatAcc$ $\uparrow$	CD $\downarrow$	LFD $\downarrow$	MIoU $\uparrow$	vLFD $\downarrow$	LPIPS $\downarrow$
NoOcc-CMIC	84.1	99.3	0.631	0.430	0.961	0.137	0.159	48.3	85.9	2.090	1.391	0.735	1.023	0.224
Occ-CMIC	86.1	99.5	0.552	0.372	0.966	0.125	0.157	81.6	98.4	0.755	0.499	0.938	0.260	0.149
All-CMIC	87.8	99.7	0.485	0.330	0.972	0.108	0.153	82.6	98.6	0.696	0.469	0.941	0.251	0.147

Table 3: Cross-evaluation on the no occlusion (NoOcc), occlusion (Occ), and combined (All) sets of MOOS with seen objects during training. Training on all cases leads to the best performance in both occluded and occlusion-free image queries.

MOOS Set	Objects	Retrieval Accuracy			View-independent Metrics				View-dependent Metrics
MOOS Set	Objects	$Acc_{1}$ $\uparrow$	$Acc_{5}$ $\uparrow$	CatAcc $\uparrow$	CD $\downarrow$	$\text{CD}_{5}$ $\downarrow$	LFD $\downarrow$	$\text{LFD}_{5}$ $\downarrow$	MIoU $\uparrow$	$\text{MIoU}_{5}$ $\uparrow$	vLFD $\downarrow$	$\text{vLFD}_{5}$ $\downarrow$	LPIPS $\downarrow$	$\text{LPIPS}_{5}$ $\downarrow$
NoOcc	seen	87.8	99.4	99.7	0.485	3.143	0.330	2.158	0.972	0.653	0.108	1.065	0.153	0.310
NoOcc	unseen	58.9	88.7	94.9	1.710	3.258	1.126	2.229	0.831	0.620	0.525	1.195	0.217	0.322
Occ	seen	82.6	96.6	98.6	0.696	3.204	0.469	2.177	0.941	0.661	0.251	1.168	0.147	0.274
Occ	unseen	49.6	79.4	93.4	2.075	3.350	1.374	2.277	0.787	0.629	0.772	1.298	0.211	0.286

Table 4: Comparison on MOOS for query objects seen during training vs for unseen objects. Unseen queries are more challenging.

7 Results

Retrieval for non-occluded and occluded objects. Prior work [10, 14, 30] trains and evaluates on images with non-occluded and non-truncated objects. This is equivalent to our NoOcc set. Table 1 shows that models trained only on such easy retrieval queries generalize poorly to hard cases with occlusions (Occ). In contrast, models trained on both unoccluded and occluded sets (All) outperform models that only have access to one of the sets. Table 3 shows that performance of NoOcc-CMIC drops significantly when testing on occluded objects ( ${\sim}36\%$ drop on Acc, ${\sim}1.46$ increase on CD). Occ-CMIC, performs not only much better than NoOcc-CMIC on the Occ set, but also achieves comparable results on the NoOcc set, which is unsurprising since the occlusion rates of objects in Occ span from 0 to 1 so there exist objects that are barely occluded. The fact that All-CMIC outperforms both NoOcc-CMIC and Occ-CMIC on all metrics provides evidence that single-view shape retrieval models trained to handle occluded and unoccluded objects together generalize better. Fig. 4 shows how the metrics changes as the occlusion of the input images increases. Note both Acc and CD degrade drastically with increasing occlusion rate, while LPIPS is relatively stable. All remaining experiments are conducted with the All-CMIC model.

Val Set	Shape	View-independent Metrics		View-dependent Metrics
Val Set	Shape	CD $\downarrow$	LFD $\downarrow$	MIoU $\uparrow$	vLFD $\downarrow$	LPIPS $\downarrow$
NoOcc	3D-FUTURE	0.485	0.330	0.972	0.108	0.153
NoOcc	Scan2CAD	5.442	2.735	0.433	1.574	0.409
Occ	3D-FUTURE	0.69	0.469	0.941	0.251	0.147
Occ	Scan2CAD	5.430	2.690	0.452	1.695	0.361

Table 5: Evaluation on different shape database of ShapeNet 3D shapes from Scan2CAD on MOOS seen object images using the All-CMIC model. This regime is extremely challenging and as the shape database is different from the original annotated shape database, there is no ground-truth shape for measuring the accuracy.

Generalization to unseen object queries. Generalization to retrieval of 3D shapes for unseen object queries is crucial for deployment in practical applications. Our MOOS dataset contains unobserved object queries for both the NoOcc and Occ sets. Table 4 reports how the All-CMIC model performs on the NoOcc and Occ val set for unseen objects. Although the high $CatAcc$ numbers indicate shapes matching the correct category are retrieved, $Acc_{1}$ is noticeably lower by ${\sim}30\%$ than results for seen object queries. The view-independent metrics CD and LFD also increase by about 1.23 and 0.80 respectively for non-occluded objects. For view-dependent metrics, the small increase in LPIPS ( ${\sim}0.064$ ) shows that retrieved shapes are still perceptually similar to the ground truth shape under the same viewpoint. Fig. 5 shows examples for seen and unseen object queries.

Generalization to similar shapes. We test generalization of our model trained on 3D-FUTURE shapes to similar 3D shapes by retrieving from a different shape database. Instead of using 3D-FUTURE shapes as our retrieval database, we retrieve from 1,779 ShapeNet shapes used in Scan2CAD. Note that accuracy-based metrics are unavailable since there is no ground truth 3D shape. Results (Table 5) show that this regime is quite challenging. One important factor is that ShapeNet shapes are not as cleanly annotated as 3D-FUTURE. In particular, incorrect shape dimensions strongly impact metrics by breaking the correct scale assumption.

Models	Retrieval Accuracy			View-independent Metrics				View-dependent Metrics
	$Acc_{1}$ $\uparrow$	$Acc_{5}$ $\uparrow$	CatAcc $\uparrow$	CD $\downarrow$	$\text{CD}_{5}$ $\downarrow$	LFD $\downarrow$	$\text{LFD}_{5}$ $\downarrow$	MIoU $\uparrow$	$\text{MIoU}_{5}$ $\uparrow$	vLFD $\downarrow$	$\text{vLFD}_{5}$ $\downarrow$	LPIPS $\downarrow$	$\text{LPIPS}_{5}$ $\downarrow$
CMIC-Pix3D (500ep)	47.6	66.2	94.1	0.761	1.429	1.399	2.353	0.765	0.607	0.780	1.341	0.311	0.335
CMIC-MOOS-ft (5ep)	55.1	80.3	95.5	0.608	1.319	1.234	2.279	0.808	0.625	0.650	1.273	0.299	0.330
CMIC-Scan2CAD (500ep)	27.7	42.1	85.2	1.161	1.391	1.482	1.860	0.489	0.414	2.795	3.007	0.298	0.318
CMIC-MOOS-ft (5ep)	27.7	47.2	86.0	1.111	1.278	1.444	1.776	0.490	0.422	2.810	2.970	0.298	0.316

Table 6: Evaluation on Pix3D and Scan2CAD real image datasets. The pretrained-then-finetuned MOOS models (CMIC-MOOS-ft) outperform baselines (CMIC-Pix3D and CMIC-Scan2CAD) trained on real datasets, demonstrating the benefit of our synthetic pretraining.

Models	objects	Easy set		Hard set
Models	objects	$Acc_{1}\uparrow$	$Acc_{5}\uparrow$	$Acc_{1}\uparrow$	$Acc_{5}\uparrow$
CMIC-Pix3D	seen	74.9	89.2	47.2	73.5
CMIC-Pix3D	unseen	0.3	7.6	0.0	17.4
CMIC-MOOS-ft	seen	72.7	93.1	64.8	89.3
CMIC-MOOS-ft	unseen	34.3	62.8	24.2	62.4

Table 7: Performance breakdown of CMIC models on Pix3D “Easy set” and “Hard set”. Note the significant improvements of the MOOS pretrained model on unobserved objects.

Transfer from synthetic to real datasets. We demonstrate transfer of CMIC models pretrained on MOOS to two datasets with real images, Pix3D and Scan2CAD. In Table 6, we compare with CMIC-Pix3D and CMIC-Scan2CAD which are trained directly on the corresponding datasets for 500 epochs using the same hyperparameters. CMIC-Pix3D uses predicted object segmentations from Mask2Former [6] instead of the ground truth shape mask. For CMIC-Scan2CAD, we use the object mask from the dataset which may reflect occlusions as multiple objects are present in scene. CMIC-MOOS-ft is first pretrained on the All set of MOOS and then fine-tuned for 5 epochs on the respective real-world image dataset. On Pix3D, CMIC-MOOS-ft significantly surpasses CMIC-Pix3D with 7.5% higher accuracy, 0.153 lower CD, and 0.0122 lower LPIPS. It has better performance on objects unobserved or containing occlusions (see Table 7). On Scan2CAD, the pretrain-then-finetune strategy achieves competitive performance to CMIC-Scan2CAD. Although retrieval accuracy is the same, better performance on view-independent and view-dependent metrics except vLFD implies that CMIC-MOOS-ft retrieves 3D shapes that are more similar to the desired 3D shape in terms of geometry. Figs. 6 and 7 shows qualitative examples.

Models	$Acc_{1}$ $\uparrow$	CD $\downarrow$	LFD $\downarrow$	MIoU $\uparrow$	vLFD $\downarrow$	LPIPS $\downarrow$
CMIC-Pix3D	47.6	0.761	1.400	0.764	0.785	0.310
0-shot inference	20.9	1.510	2.135	0.624	1.286	0.360
2%-shot-ft	29.1	1.276	1.919	0.669	1.119	0.344
10%-shot-ft	44.1	0.904	1.517	0.751	0.842	0.312
50%-shot-ft	50.1	0.731	1.364	0.784	0.720	0.299
100%-shot-ft	55.1	0.613	1.238	0.806	0.654	0.292

Table 8: Few-shot MOOS pretrained CMIC fine-tuned on Pix3D. Even fine-tuning on fractions of Pix3D gives good performance.

Few-shot fine-tuning. We conduct few-shot fine-tuning on Pix3D (see Table 8) where $n$ %-shot-ft indicates a CMIC model pretrained on MOOS and finetuned on $n$ % Pix3D. We show that training on MOOS data enables fine-tuning performant models on Pix3D data with considerably less data. The relatively poor results of zero-shot inference on Pix3D reflect the domain shift from synthetic to real images. When finetuning on 50% Pix3D data, we still outperform CMIC-Pix3D. Even with only 10% Pix3D data, performance is comparable to the whole training data. We conclude that synthetic data is useful for training generalizable models and the effort spent on data collection and annotation for real image-shape pairs can be significantly reduced.

8 Conclusion

We studied the generalization of single-view 3D shape retrieval to occlusions and unseen objects. We standardized two real image datasets and presented a synthetic dataset generation pipeline that allowed us to systematically evaluate the performance of shape retrieval for inputs with occlusions and with 3D shapes or objects unseen during training. We show that training on synthetic data with occlusions helps significantly improve performance. Though results are promising, the task remains challenging and in particular generalization to unseen categories of objects is an open question for future work. We hope our work enables more rigorous evaluation of single-view 3D shape retrieval in practical settings.

Acknowledgments. This work was funded in part by a CIFAR AI Chair, a Canada Research Chair, NSERC Discovery Grant, NSF award #2016532, and enabled by support from WestGrid and Compute Canada. Daniel Ritchie is an advisor to Geopipe and owns equity in the company. Geopipe is a start-up that is developing 3D technology to build immersive virtual copies of the real world with applications in various fields, including games and architecture. We thank Weijie Lin for help with initial development of the metrics code, and Sonia Raychaudhuri, Yongsen Mao, Sanjay Haresh and Yiming Zhang for feedback on paper drafts.

References

Avetisyan et al. [2019] Armen Avetisyan, Manuel Dahnert, Angela Dai, Manolis Savva, Angel X Chang, and Matthias Nießner. Scan2CAD: Learning CAD model alignment in RGB-D scans. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 2614–2623, 2019.
Bai et al. [2016] Song Bai, Xiang Bai, Zhichao Zhou, Zhaoxiang Zhang, and Longin Jan Latecki. GIFT: A real-time and scalable 3D shape search engine. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 5023–5032, 2016.
Chang et al. [2015] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012, 2015.
Chen et al. [2003] Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen, and Ming Ouhyoung. On visual similarity based 3D model retrieval. Computer Graphics Forum, 22(3):223–232, 2003.
Chen and Zhang [2019] Zhiqin Chen and Hao Zhang. Learning implicit fields for generative shape modeling. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 5939–5948, 2019.
Cheng et al. [2022] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
Dai et al. [2017] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 5828–5839, 2017.
Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. of the International Conference on Learning Representations (ICLR), 2021.
Engelmann et al. [2021] Francis Engelmann, Konstantinos Rematas, Bastian Leibe, and Vittorio Ferrari. From points to multi-object 3D reconstruction. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 4588–4597, 2021.
Fu et al. [2020] Huan Fu, Shunming Li, Rongfei Jia, Mingming Gong, Binqiang Zhao, and Dacheng Tao. Hard example generation by texture synthesis for cross-domain shape similarity learning. Advances in neural information processing systems, 33:14675–14687, 2020.
Fu et al. [2021a] Huan Fu, Bowen Cai, Lin Gao, Lingxiao Zhang, Cao Li, Zengqi Xun, Chengyue Sun, Yiyun Fei, Yu Zheng, Ying Li, et al. 3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics. In Proc. of the International Conference on Computer Vision (ICCV), 2021a.
Fu et al. [2021b] Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3D-Future: 3D furniture shape with texture. International Journal of Computer Vision (IJCV), 129:3313–3337, 2021b.
Gkioxari et al. [2019] Georgia Gkioxari, Jitendra Malik, and Justin Johnson. Mesh R-CNN. In Proc. of the International Conference on Computer Vision (ICCV), pages 9785–9795, 2019.
Grabner et al. [2019] Alexander Grabner, Peter M Roth, and Vincent Lepetit. Location field descriptors: Single image 3D model retrieval in the wild. In Proc. of the International Conference on 3D Vision (3DV), pages 583–593. IEEE, 2019.
Gümeli et al. [2022] Can Gümeli, Angela Dai, and Matthias Nießner. ROCA: robust CAD model retrieval and alignment from a single image. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 4022–4031, 2022.
Gupta et al. [2015] Saurabh Gupta, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. Aligning 3D models to RGB-D images of cluttered scenes. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 4731–4740, 2015.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
He et al. [2018] Xinwei He, Yang Zhou, Zhichao Zhou, Song Bai, and Xiang Bai. Triplet-center loss for multi-view 3D object retrieval. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 1945–1954, 2018.
Hua et al. [2017] Binh-Son Hua, Quang-Trung Truong, Minh-Khoi Tran, Quang-Hieu Pham, Asako Kanezaki, Tang Lee, HungYueh Chiang, Winston Hsu, Bo Li, Yijuan Lu, et al. SHREC’17: RGB-D to CAD retrieval with ObjectNN dataset. In Proc. of the Eurographics Workshop on 3D Object Retrieval (3DOR), pages 25–32, 2017.
Izadinia et al. [2017] Hamid Izadinia, Qi Shan, and Steven M Seitz. IM2CAD. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 5134–5143, 2017.
Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
Kuo et al. [2020] Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, and Angela Dai. Mask2CAD: 3D shape prediction by learning to segment and retrieve. In Proc. of the European Conference on Computer Vision (ECCV), volume 1, page 3. Springer, 2020.
Kuo et al. [2021] Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, and Angela Dai. Patch2CAD: Patchwise embedding learning for in-the-wild shape retrieval from a single image. In Proc. of the International Conference on Computer Vision (ICCV), pages 12589–12599, 2021.
Langer et al. [2022] Florian Langer, Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. SPARC: Sparse render-and-compare for CAD model alignment in a single RGB image. In Proc. British Machine Vision Conference, 2022.
Lee et al. [2018] Tang Lee, Yen-Liang Lin, HungYueh Chiang, Ming-Wei Chiu, Winston Hsu, and Polly Huang. Cross-domain image-based 3D shape retrieval by view sequence learning. In Proc. of the International Conference on 3D Vision (3DV), pages 258–266. IEEE, 2018.
Li et al. [2015a] Bo Li, Yijuan Lu, Chunyuan Li, Afzal Godil, Tobias Schreck, Masaki Aono, Martin Burtscher, Qiang Chen, Nihad Karim Chowdhury, Bin Fang, et al. A comparison of 3D shape retrieval methods based on a large-scale benchmark supporting multimodal queries. Computer Vision and Image Understanding, 131:1–27, 2015a.
Li et al. [2019] Wenhui Li, Anan Liu, Weizhi Nie, Dan Song, Yuqian Li, Zjenja Doubrovski, Jo Geraedts, Zishun Liu, Yunsheng Ma, et al. SHREC 2019-monocular image based 3D model retrieval. In Proc. of the Eurographics Workshop on 3D Object Retrieval (3DOR), pages 1–8, 2019.
Li et al. [2015b] Yangyan Li, Hao Su, Charles Ruizhongtai Qi, Noa Fish, Daniel Cohen-Or, and Leonidas J Guibas. Joint embeddings of shapes and images via CNN image purification. ACM Transactions on Graphics (TOG), Proc. SIGGRAPH Asia, 34(6):1–12, 2015b.
Lin et al. [2021] Ming-Xian Lin, Jie Yang, He Wang, Yu-Kun Lai, Rongfei Jia, Binqiang Zhao, and Lin Gao. Single image 3D shape retrieval via cross-modal instance and category contrastive learning. In Proc. of the International Conference on Computer Vision (ICCV), pages 11405–11415, 2021.
Massa et al. [2016] Francisco Massa, Bryan C Russell, and Mathieu Aubry. Deep exemplar 2D-3D detection by adapting from real to rendered views. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 6024–6033, 2016.
Nguyen et al. [2022] Van Nguyen Nguyen, Yinlin Hu, Yang Xiao, Mathieu Salzmann, and Vincent Lepetit. Templates for 3D object pose estimation revisited: generalization to new objects and robustness to occlusions. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 6771–6780, 2022.
Pham et al. [2018] Quang-Hieu Pham, Minh-Khoi Tran, Wenhui Li, Shu Xiang, Heyu Zhou, Weizhi Nie, Anan Liu, Yuting Su, Minh-Triet Tran, Ngoc-Minh Bui, et al. SHREC’18: RGB-D object-to-CAD retrieval. In Proc. of the Eurographics Workshop on 3D Object Retrieval (3DOR), volume 2, page 2, 2018.
Ravi et al. [2020] Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. Accelerating 3D deep learning with Pytorch3D. arXiv preprint arXiv:2007.08501, 2020.
Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Sun et al. [2018] Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tianfan Xue, Joshua B Tenenbaum, and William T Freeman. Pix3D: Dataset and methods for single-image 3D shape modeling. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 2974–2983, 2018.
Tangelder and Veltkamp [2008] Johan WH Tangelder and Remco C Veltkamp. A survey of content based 3D shape retrieval methods. Multimedia tools and applications, 39:441–471, 2008.
Tatarchenko et al. [2019] Maxim Tatarchenko, Stephan R Richter, René Ranftl, Zhuwen Li, Vladlen Koltun, and Thomas Brox. What do single-view 3D reconstruction networks learn? In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 3405–3414, 2019.
Uy et al. [2021] Mikaela Angelina Uy, Vladimir G Kim, Minhyuk Sung, Noam Aigerman, Siddhartha Chaudhuri, and Leonidas J Guibas. Joint learning of 3D shape retrieval and deformation. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 11713–11722, 2021.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Wang et al. [2018a] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3D mesh models from single RGB images. In Proc. of the European Conference on Computer Vision (ECCV), pages 52–67, 2018a.
Wang et al. [2018b] Yaming Wang, Xiao Tan, Yi Yang, Xiao Liu, Errui Ding, Feng Zhou, and Larry S Davis. 3D pose estimation for fine-grained object categories. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018b.
Webber et al. [2010] William Webber, Alistair Moffat, and Justin Zobel. A similarity measure for indefinite rankings. ACM Transactions on Information Systems (TOIS), 28(4):1–38, 2010.
Xiang et al. [2014] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond PASCAL: A benchmark for 3D object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision (WACV), pages 75–82. IEEE, 2014.
Xiang et al. [2016] Yu Xiang, Wonhui Kim, Wei Chen, Jingwei Ji, Christopher Choy, Hao Su, Roozbeh Mottaghi, Leonidas Guibas, and Silvio Savarese. ObjectNet3D: A large scale database for 3D object recognition. In Proc. of the European Conference on Computer Vision (ECCV), 2016.
Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595, 2018.

In this supplement, we provide more details about the CMIC model (Appendix A), and the generation process (Section B.1) and statistics (Section B.2) for our synthetic MOOS dataset. We also provide a detailed analysis of metrics for shape retrieval (Appendix C), and additional results (Appendix D).

Appendix A CMIC Model Details

We present an overview of the CMIC [30] architecture in Fig. 8. Given a RGB image that may contain multiple objects and a binary mask indicating the object of interest obtained from either ground-truth or prediction, the object RGB query and mask are cropped from the input image and mask using 2D bounding box computed from the input mask. All image inputs, including object query, object mask and 3D shape multiview renderings, are resized to 224x224 for passing to the image encoders. For the image encoders, we use ResNet [17]-based encoders pretrained on ImageNet [22], R50 for the object query and R18 for the rendered shape images. We adapt the encoders for our task by modifying the first convolution layer to take inputs with 4-channel (RGB+mask) for the object query (R50) and 1-channel (greyscale) for the shape multiview renderings (R18). The output dimensions of the last FC layers of the encoders are also modified to obtain embeddings of dimension 128. Given query features $\{f^{i}\}_{i=1}^{B}$ and query-attended shape features $\{f^{S_{k}}_{i}\}_{i=1}^{B}$ , we define the similarity function $D(x,y)$ with a temperature hyperparameter $\tau$ as follows:

D(x,y):=e^{\frac{1}{\tau}\left(\frac{x}{\|x\|}\right)^{T}\left(\frac{y}{\|y\|}\right)}

Appendix B Multi-Object Occlusion Scene

B.1 Layout generation

The layout generation for a multi-object occlusion scene is implemented by iteratively inserting newly sampled 3D shapes into the existing layout to make sure its 2D bounding box does not intersect with any 2D bounding boxes of previously selected 3D shapes. Each scene is composed of 4 3D-FUTURE [12] objects from 4 categories (chair, bed, table and sofa) respectively. The goal is to form a scene where objects are close to each other to observe occlusion, but they don’t overlap. We first randomly sample 4 objects from 4 categories. For each 3D shape $S_{i}$ remaining to be placed, we rotate it by a random angle around the up axis. The initial position $\boldsymbol{p}_{i}^{0}$ is initialized as the origin if no shapes are placed before, otherwise set to the average position of all placed shapes. Then a unit vector $\boldsymbol{v}_{i}$ is randomly sampled as a moving direction. A base distance scalar moving distance $d_{i}^{0}$ is the sum of short sides of all placed objects. The position of the new shape is then $d_{i}\cdot\boldsymbol{v}_{i}+\boldsymbol{p}_{i}^{0}$ . If there are any intersections between 2D bounding boxes of placed objects and the new object, we iterative increase $d_{i}$ by 0.05 until no intersections are observed after $N$ times. The final shape position is calculated as $(d_{i}^{0}+0.05\times N)\cdot\boldsymbol{v}_{i}+\boldsymbol{p}_{i}^{0}$ (see the red dashed arrows in the top-down view layouts in the main paper). The vertical position of the shape is set so that the shape is positioned on the floor plane. We repeat the procedure until all shapes are placed into the scene. Note that we only consider intersections among 2D bounding boxes from the top-down view as a simplification instead of using more expensive physics-based collision checks.

B.2 MOOS Statistics

Dataset	Split	Train Set	Val Set		Test Set
Dataset	Split	Train Set	Seen	Unseen	Seen	Unseen
Pix3D [36]	Easy	2,998 (143)	1,065	1,055 (179)	-	-
Pix3D [36]	Hard	2,451 (143)	903	178 (62)	-	-
Scan2CAD [1]		39K (1548)	11K (560)		-	-
MOOS	NoOcc	85K (5K)	10K	6K (618)	12K	6K (618)
MOOS	Occ	249K (5K)	31K	17K (618)	35K	17K (618)
	All	334K (5K)	41K	23K (618)	47K	23K (618)

Table 9: Statistics of the number of object queries in the splits for the different datasets. Numbers in parentheses represent the number of unique 3D shapes. For the seen split, the correct shape is found in the training set.

	Chair	Bed	Table	Sofa	All
# unique shapes	1,639	1,073	1,038	2,457	6,207
# objects w/ occlusion	78,745	93,877	86,807	90,580	350,009
# objects w/o occlusion	35,752	26,033	30,335	27,621	119,741
# invisible objects	5,503	90	2,858	1,799	10,250

Table 10: Statistics of rendered objects in MOOS generation for each category. We report the number of number of unique shapes as well as number of object queries with and without occlusions. In some generated scenes, some objects are completely occluded or truncated. We do not includes those in our dataset of training or evaluation samples.

In Table 9, we provide statistics for the train and val split we create for MOOS. As part of of our work, we also provide standardized splits for image-to-shape retrieval with Pix3D [36] with splits that includes both seen and unseen shapes in the validation set. For Scan2CAD [7], the validation set consists of only unseen shapes. Compared to the other two datasets, MOOS provides a large set of rendered images with and without occlusion for training. In the NoOcc set, among all object queries we have 85K for training, 16K for validation, and 18K for test. In the Occ set, among all object queries we have 249K for training, 48K for validation, and 52K for test. For the 618 3D shapes that we set aside for exploring generalization to unseen objects, we have 23K object queries for validation and 23K for test.

We list detailed statistics of the objects in MOOS in Table 10 where for each category we count the number of unique shapes, the number of object instances in all scene renderings with/without some occlusions and the number of invisible objects due to complete occlusion or truncation. In Fig. 9, we plot a histogram of the number of unique shapes vs. the frequency of shapes for each category in MOOS. In Fig. 10, we plot a histogram of the number of object occurrences vs. occlusion rates for each category in MOOS.

Appendix C Metrics Analysis

A variety of different metrics for comparing shape similarity has been proposed and used for shape retrieval. Here we describe the analysis that we conduct to determine which metrics are more appropriate for single-view shape retrieval. We use the metrics we find to work well in our main paper. To conduct our analysis, we consider a target shape query, and rank all 3D shapes in the database using a given similarity metric to determine the usability and stability of the metric for retrieval.

We categorize metric candidates into three groups: (1) Point-cloud based reconstruction metrics including CD, NC and $\text{F1}^{t}$ ; (2) Shape2shape metrics including voxel IoU, neural shape descriptor and LFD [4] and (3) View-dependent metrics including mask IoU, vLFD, normal IoU, normal L2 and LPIPS [46]. Since our practical goal is to select metrics appropriate for the single-view shape retrieval task we perform a qualitative analysis over several hundred query examples. A rigorous study of metric design for single-view shape retrieval is an interesting direction for future work.

Based on our analysis we select the reconstruction metric CD, LFD as a view-based shape2shape metric, and several view-dependent metrics (mask IOU, vLFD, and LPIPS). In the following sections, we provide more details on how we selected these metrics.

C.1 Reconstruction Metrics

Reconstruction metrics computed from sampled point clouds are widely used to measure the quality of reconstructed or generated shapes against a target shape [41, 13]. These metrics assume that the 3D shapes being compared are well aligned and scaled. Several prior work [36, 23, 24, 30] has opted to use point-wise reconstruction metrics to evaluate the single-view shape retrieval task.

We consider three popular reconstruction metrics: Chamfer distance (CD), Normal consistency (NC), and point-wise F1 at distance threshold $t$ (F1^t) for measuring the similarity between two point clouds. Given two point clouds $P$ , $Q$ with positions and normals, let $\Lambda_{P,Q}=\left\{\left(p,\arg\min_{q}\|p-q\|\right)|p\in P,q\in Q\right\}$ denote the set of pairs $(p,q)$ where $q$ from $Q$ is the nearest neighbor of p and $n_{p}$ denote the unit normal vector at point $p$ .

The Chamfer distance (CD) and Normal consistency (NC) measures the similarity of two point clouds based on either distance of points (CD) or similarity of unit normals (NC). Formally, CD and NC are defined as:

The $\textbf{F1}^{t}$ score measures the percentage of points that are accurately reconstructed by taking into account both precision (how close reconstructed points are to ground truth points) and completeness (the percentage of ground truth points that are covered) for a distance threshold $t$ that controls the strictness of score. It is robust to the geometric layout of outliers [38] but does not have a scale-invariance property. We set $t=0.1$ for our analysis.

Score Variance $\downarrow$	FAS	FPS
CD	5.11e-03	2.25e-03
NC	6.49e-05	6.95e-05
$\text{F1}^{0.1}$	54.89	91.14

Table 11: Metric score variance over different number of points. We compute the variance of the score for shape-to-shape pairs with different samplings of number of points (1K, 2K, 4K, and 10K), and compare the variance for face area-weighted sampling (FAS) vs farthest point sampling (FPS). We find that

\text{F1}^{0.1}

has relatively high variance (note that we report NC from 0 to 1, while we report

\text{F1}^{0.1}

from 0 to 100), and that the combination of CD with FPS gives the least variance.

Metrics	#points	FAS			FPS
Metrics	#points	mMS $\downarrow$	mRD $\downarrow$	RBO $\uparrow$	mMS $\downarrow$	mRD $\downarrow$	RBO $\uparrow$
CD	1K	13.74	2.90	0.90	12.59	2.50	0.92
	2K	11.62	2.12	0.94	10.97	1.96	0.95
	4K	9.91	1.65	0.96	8.61	1.51	0.97
NC	1K	16.60	7.62	0.78	16.53	8.24	0.78
	2K	15.79	5.06	0.83	15.73	5.40	0.83
	4K	14.82	3.70	0.88	14.64	3.76	0.88
$\text{F1}^{0.1}$	1K	17.54	26.81	0.64	17.69	33.05	0.60
	2K	16.81	9.88	0.75	17.25	12.67	0.71
	4K	15.59	4.95	0.84	16.13	6.25	0.81

Table 12: Stability of pointcloud-based reconstruction metrics for shape ranking with different number of sampled points and comparing face area weighted sampling (FAS) vs farthest point sampling (FPS). For measuring stability, we use the mean number of moved shape ranks (mMS), the mean rank distance (mRD) of moved shapes, and the rank biased overlap (RBO). We find that CD with FPS gives the most stable rankings.

We compare the ranking quality among the three reconstruction metrics mentioned above computed from 2K sampled points using face area–weighted sampling (see examples in Fig. 11). Under the same sampling condition, NC and $\text{F1}^{t}$ tend to rank unrelated shapes higher compared to CD, which either present distinguishable geometric structure or have wrong categories. Note that the ranking score and result produced by F1 is subject to the threshold $t$ and the shape scale. With a larger threshold, $\text{F1}^{t}$ can rank shapes better. Based on our findings, we argue that CD is a simple and representative point cloud-based metric for shape retrieval that outputs reasonable shape ranking without the need to tweak hyperparameters.

We also investigate the effect of the number of points and the sampling method on the shape ranking. For each shape, we sample 1K, 2K, 4K and 10K points using face area-weighted sampling (FAS) and farthest point sampling (FPS). We observe that FPS is better at capturing thin and small structure of the shape when the number of points is small. For CD, FPS results in smaller metric score variance over different number of points (see Table 11). We quantify the stability of each metric under different number of points and sampling methods (see Table 12) using a set of stability metrics that we define. Given a base shape ranking computed from 10K points and a different sampling setting, we calculate the mean moved shapes (mMS) measuring how many shapes moved their ranks on average, the mean rank difference (mRD) over moved shapes, and Rank Biased Overlap [43] (RBO) measuring the similarity between two ranked lists. CD with FPS sampling gives the most stable rankings over different number of points. Considering both computation efficiency and stability, we find that CD with 4K sampled points using farthest point sampling is an appropriate metric for retrieval task. We note that Sun et al. [36] has conducted a user-study and found that CD and EMD to correlate better with human judgement than voxel IoU, with CD having the best correlation. One weakness of CD is that its score is potentially less interpretable than NC and $\text{F1}^{t}$ .

C.2 Shape2shape Metrics

Besides point clouds, 3D shapes can be represented using other formats such as voxels, multi-view images and neural descriptors. Therefore, we assess the potential of metrics based on such alternative 3D representations. VoxIoU computes the intersection-over-union between voxelizations of two shapes. We generate solid voxels at $128^{3}$ resolution using binvox¹¹1www.patrickmin.com/binvox. Similar to point cloud–based metrics, this metric is sensitive to scale and alignment issues. For neural shape descriptors, we use IMNET [5] as the shape encoder due to its simple but effective architecture. We train an IMNET model on 3D-FUTURE shapes from the first 1,500 3D-FRONT [11] scenes: 1083 unique objects including 312 chairs, 177 beds, 351 sofas and 253 tables. We compute cosine similarity between two normalized IMNET embeddings. Lastly, we represent each 3D shape as light-field descriptor, LFD [4], computed from a set of pre-rendered binary masks. Assuming all 3D shapes are normalized and centered in the same canonical orientation, 200-view renderings are obtained by placing cameras on vertices of 10 randomly rotated dodecahedrons. We use averaged L1 distances over all views to measure shape similarity. In general, these three shape2shape metrics produce shape rankings with comparable quality in a way that they may rank less similar shapes higher (see voxIoU in the 1st example, IMNET and LFD in the 2nd example in Fig. 12). As the IMNET model is trained on a limited number of 3D shapes, we believe the neural descriptor-based metric has potential for producing better rankings with more training data and more advanced neural shape descriptor models. Taking into account the computational simplicity and robustness, we choose LFD over voxIoU and IMNET.

C.3 View-dependent Metrics

We compare the view-dependent metrics we use in the main paper: mask IOU (MIoU), and single-view LFD L1 distance (vLFD), LPIPS [46] with two view-dependent normal based metrics inspired by [24], normal $L2$ distance (nL2) and normal IoU (nIoU). To evaluate the ranking quality of the view-dependent metrics, the inputs are either mask images, neutral renderings or normal maps for each 3D shape rendered under 10 randomly sampled camera viewpoints using PyTorch3D (see Fig. 13).

For MIoU and vLFD, only the binary segmentation mask is used to rank shapes by computing mask IoU and single-view LFD $L1$ distance respectively. Since both MIoU and vLFD mainly measure similarity between object silhouettes without considering geometry variants inside the object, we consider measuring view-dependent normal similarity derived from patch based normal similarity to determine patch-wise matches during training of Patch2CAD [24]. In Patch2CAD, for a patch of normals, the authors compute a self-similarity histogram over all pairwise angular distances of the normals in the patch and then measure IoUs between different patches to obtain an orientation-independent metric. In our case, for image mask to shape retrieval evaluation, we only need to consider one pose. Thus, we consider two simplified and computationally efficient metrics that measures L2 difference between normals (nL2) and IoU over histogram of normals (nIoU). We then define nL2 and nIoU where the former is the average $L2$ distance between normals over all pixels on the object, and the latter computes the IoU between 2D normal histograms of two shapes. To build the normal histogram, we represent each normal vector as two angles for azimuth and elevation such that they fall into two bins partitioned on azimuth and elevation separately. The IoU between two 2D normal histograms is obtained by taking the average of bin-wise IoUs over all bins. We set the bin size to 10 for both azimuth and elevation. As normal IoU is invariant to actual positions of normals, two different shapes can share similar 2D normal histogram pattern (see Fig. 14). For perceptual similarity measurement LPIPS, we use neutral renderings (white colored shapes) to reduce the effect of textures and resize images to 100x100 for passing into a pretrained VGG model ²²2github.com/richzhang/PerceptualSimilarity. We show two views of two query shapes in Fig. 15.

We observe that MIoU and vLFD measure more coarse-grained and overall geometry consistency with the query shape since they only get access to mask information. We find that nL2 can produce competitive rankings against MIoU and vLFD when the shape possess relatively simple structure. When the local structure of the target shape becomes complicated, both nL2 and nIoU fail to return meaningful shape rankings. LPIPS [46] is more robust to capturing fine-grained structure similarity (see thin slat structure presented in top-5 rankings of LPIPS in Fig. 15) given different views of shapes.

Appendix D Additional Results

In this section, we present additional results. We conduct an ablation study (Section D.1) that shows that some of the proposed techniques such as color transfer for data augmentation and unique shape mining is not necessary for good image-to-shape retrieval. Finally, we present more qualitative results on Pix3D [36] and ScanNet [7] frames (Section D.2), focusing on examples where the top retrieved shape does not match the ground truth model exactly.

CMIC	$Acc_{1}$ $\uparrow$	CD $\downarrow$	LFD $\downarrow$	MIoU $\uparrow$	vLFD $\downarrow$	LPIPS $\downarrow$
Baseline	48.2	0.650	1.411	0.771	0.764	0.305
Crop object	47.6	0.7614	1.400	0.764	0.785	0.310
Color Transfer [30]	48.3	0.705	1.404	0.771	0.769	0.308
Unique Shape Mining	43.9	0.809	1.522	0.747	0.849	0.321
R18 encoders	47.4	0.797	1.423	0.764	0.782	0.309
ViT query encoder	47.7	0.716	1.421	0.769	0.773	0.307
Multihead attention	47.4	0.854	1.422	0.757	0.806	0.312
Stacked attention	46.8	0.825	1.416	0.755	0.826	0.313

Table 13: Ablations of CMIC [30] on Pix3D comparing training techniques and network submodule design choices. We find that it is not necessary to use color transfer or unique shape mining as was done in the original paper and implementation.

D.1 Ablation Study

We perform a set of ablation studies for CMIC on Pix3D in terms of training techniques and model architecture as shown in the two subgroups in Table 13. The baseline is a CMIC model taking as input the whole RGB image and applying several data augmentations including affine transformation, crop, flip and color jitter.

We find that object cropping degrades the performance on Pix3D where images mainly contain one salient object, but improves retrieval given inputs containing multiple objects like MOOS. In the CMIC paper, Lin et al. [30] proposes color transfer to augment image queries, but our results show that simpler color jitter can achieve nearly the same performance. The original implementation of CMIC³³3https://github.com/IGLICT/IBSR_jittor uses unique shape mining when preparing 3D shapes in a batch. Our experiments show that actually results in worse performance.

We also experiment with various settings for different components of CMIC, including using R18 for both encoders, using ViT [8] as the query encoder, and replacing the dot production attention with multi-head attention or stacked attention blocks. However, none of these variants show promising signals of performance improvement. For the ViT encoder, we use a small ViT model pretrained on ImageNet21K [22] with patch size of 32. Similar to ResNet-based encoders, we add a linear projection layer to process the extra mask channel and an MLP layer to embed feature into the embedding space of dimension 128. For multi-head attention, we use a common multi-head attention block from Transformer [40] with 6 heads and feature dimension of 384. For stacked attention, we stack two multi-head attention blocks together.

D.2 Qualitative Results

We show detailed qualitative results on Pix3D [36] and Scan2CAD [1] (see Figs. 16 and 17), including input RGB images with masks, view-dependent metrics and complete renderings of ground-truth and top@1 retrieved shapes. In this qualitative results, we focus on showing examples where the retrieve shape is not the exact same model as the ground-truth. However, we see that in many cases, the retrieved shape is quite similar to the ground truth.