This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

UniG3D: A Unified 3D Object Generation Dataset

Qinghong Sun1, Yangguang Li111footnotemark: 1, ZeXiang Liu1, Xiaoshui Huang2, Fenggang Liu1
Xihui Liu3, Wanli Ouyang1, Jing Shao1
1SenseTime Research 2Shanghai AI Lab 3The University of Hong Kong
{sunqinghong1,liyangguang}@sensetime.com
Equal ContributionCorresponding Author
Abstract

The field of generative AI has a transformative impact on various areas, including virtual reality, autonomous driving, the metaverse, gaming, and robotics. Among these applications, 3D object generation techniques are of utmost importance. This technique has unlocked fresh avenues in the realm of creating, customizing, and exploring 3D objects. However, the quality and diversity of existing 3D object generation methods are constrained by the inadequacies of existing 3D object datasets, including issues related to text quality, the incompleteness of multi-modal data representation encompassing 2D rendered images and 3D assets, as well as the size of the dataset. In order to resolve these issues, we present UniG3D, a unified 3D object generation dataset constructed by employing a universal data transformation pipeline on Objaverse and ShapeNet datasets. This pipeline converts each raw 3D model into comprehensive multi-modal data representation <text, image, point cloud, mesh> by employing rendering engines and multi-modal models. These modules ensure the richness of textual information and the comprehensiveness of data representation. Remarkably, the universality of our pipeline refers to its ability to be applied to any 3D dataset, as it only requires raw 3D data. The selection of data sources for our dataset is based on their scale and quality. Subsequently, we assess the effectiveness of our dataset by employing Point-E and SDFusion, two widely recognized methods for object generation, tailored to the prevalent 3D representations of point clouds and signed distance functions. Our dataset is available at: https://unig3d.github.io.

1 Introduction

Refer to caption
Figure 1: Overview of the data transformation pipeline of UniG3D. First, we use Blender [30] to render each 3D model to multi-view meshes and images. Next, we obtain colored point clouds by calculating the points of multiple random-view rendered images. Finally, we employ the CLIP-VIT [32] and multi-modal LLM [33] to acquire rich text information. Additionally, we present some quadruples within the datasets.

Generative AI has revolutionized the way humans work and improved their efficiency, as this technology can understand human intentions and automatically generate the required content. In particular, there have been numerous generative works in virtual reality [9; 10; 11; 12], autonomous driving [15; 16], metaverse [7; 8] and gaming [5; 6], and robotics [14; 13]. In the aforementioned applications, a crucial and significant technique is 3D object generation [68; 69; 38; 37; 36; 47; 31; 76], which involves creating realistic or novel 3D representations aiming to simulate and replicate real-world or imagined 3D objects. 3D object generation technology opens up new possibilities for creating, customizing, and exploring 3D objects, making it a valuable tool in various fields where 3D models play a crucial role.

Several recent works [38; 37; 36] have tackled the problem of 3D generation by optimizing 3D representations against a text-to-image model and do not leverage 3D data. Although these methods have demonstrated promising results, state-of-the-art approaches typically require about six GPU hours to produce a single sample. It is challenging to scale up the generation of 3D data by utilizing these methods. There are alternative methods for 3D object generation that make use of 3D data. Some methods incorporate text as a condition during the 3D generation process. Despite promising early results, many of these works are limited to simple prompts or a narrow set of object categories due to the scarcity of 3D training data [72; 35; 68; 73; 74; 31]. Alternatively, some methods use a pre-trained text-to-image model to condition their 3D generation procedure [47; 76]. However, since most datasets lack image data, researchers are left with the task of rendering each dataset individually, which is a time-consuming and resource-intensive process. In addition to text-conditioning, existing methods also employ image-conditioning as an alternative approach. However, they face similar challenges as mentioned above. As a result, the quality of text information, the availability of 2D-rendered data, and the scalability of the dataset are crucial.

To resolve these issues, we construct a unified 3D object generation dataset, UniG3D, by utilizing the ShapeNet [48] and Objaverse [34] as raw data sources. We develop a unified pipeline that can convert raw 3D models into comprehensive multi-modal data, which allows researchers focusing on different target 3D representations or different input conditions to use it conveniently. Specifically, as shown in Fig. 1, we convert each raw 3D model into a <text, image, point cloud, mesh> quadruple by employing a 3D model rendering engine [30], BLIP [33], and CLIP [32] model. The former is used for rendering 2D and 3D representations, while the latter two are used for generating high-quality textual information. Our proposed unified multi-modal data transformation pipeline requires only 3D data, which eliminates the need for any manual annotation effort and demonstrates its scalability. To illustrate its effectiveness, we conduct our dataset by using ShapeNet [48] and Objaverse [34] as the data sources, given their scale and quality. Then, we validate the efficacy of our dataset by utilizing Point-E [47] and SDFusion [31], two typical object generation methods that are widely used in prevalent 3D representations such as point cloud and signed distance function. UniG3D offers three contributions:

  1. 1.

    We construct a large-scale unified 3D object generation dataset with rich textural information and comprehensive multi-modal data.

  2. 2.

    We propose a universal data transformation pipeline that can convert any 3D data into representations suitable for most 3D object generation methods.

  3. 3.

    To validate the efficacy of our dataset, we conduct experiments under various input conditions and target 3D representations. Based on our empirical investigations, we present several valuable insights into the impact of various conditions, the efficacy of data expansion, and the significance of text quality.

2 Related Work

2.1 3D Generative Methods

Several recent works have explored the challenge of generating 3D models with conditioned inputs by optimizing the 3D representations based on a text-image matching objective. [38] introduce DreamFields, a method that leverages CLIP to optimize the parameters of a NeRF model without the need for 3D training data. More recently, [37] extends DreamFields by incorporating a pre-trained text-to-image diffusion model in place of CLIP, producing more coherent and complex objects. [36] builds upon this technique by converting the NeRF representation into a mesh and further refining the mesh representation through a secondary optimization stage. Although these approaches are capable of generating diverse and intricate objects or scenes, the optimization procedures often demand significant GPU computational time to converge, posing challenges for practical applications.

While the above primarily rely on optimizing against a 2D text-image model and do not utilize 3D data, alternative methods for conditional 3D object generation incorporate 3D data, sometimes in conjunction with text labels. [72] leverages paired text-3D data to generate models in a joint representation space. [35] employs a flow-based model to generate 3D latent representations, and find some text-to-3D capabilities when conditioning their model on CLIP embeddings. [73] and [74] employ a VQ-VAE with an autoregressive prior to sample 3D shapes conditioned on text labels. SDFusion [31] utilizes an encoder-decoder structure to compress 3D shapes into a compact latent representation, which is then used to train a diffusion model for text-to-3D generation. While many of these works demonstrate promising early results, they tend to be limited to simple prompts or a narrow set of object categories due to the limited availability of 3D training data. Point-E [47] solves this problem by building a large-scale text 3D dataset. However, the datasets are not open source.

Alternatively, some methods  [76; 47]use a pre-trained text-to-image model to condition their 3D object generation procedure with images. However, since most datasets lack image data, researchers are left with the task of rendering each dataset individually, which is a time-consuming and resource-intensive process. In addition to text-conditioning, existing methods also employ image-conditioning as an alternative approach. However, they face similar challenges as mentioned above.

2.2 3D Object Datasets

Many widely-used 3D datasets prefer to collect synthetic CAD models from online repositories [48; 49; 50; 51; 34]. Shapenet[48] stands out as the prevalent dataset. It covers 55 common object categories with about 51,300 unique 3D models. Every object in this dataset is precisely rendered as a 3D mesh, thereby imparting meticulous geometric information. Moreover, a "name" field supplements each 3D model, carrying rich metadata associated with it. Objaverse [34] is an extensive dataset comprising over 800K 3D models accompanied by descriptive captions, tags, and animations. It surpasses existing 3D repositories in terms of its scale, the number of categories, and the visual variety of instances within each category. However, the majority of objects lack appropriate text information. Each object in the dataset is represented in GLB format, posing a challenge for users in directly leveraging the 3D data. Another line of works [52; 53; 54; 54; 55; 56; 57; 59; 60; 62; 61] advocate real 3D objects in limited scale. MVImgNet [60] is a recent medium-sized dataset of multi-view images, which is highly convenient to gain by shooting videos of real-world objects in human daily life. It contains 6.5 million frames from 219,188 videos crossing objects from 238 classes, with rich annotations of object masks, camera parameters, and point clouds.

As previously discussed, the current state of 3D object generation techniques is hindered by various factors that limit their quality. These factors include deficiencies in text quality, limited availability of 2D-rendered data, and dataset scale. Apart from the dataset scale, these limitations manifest in two ways. Firstly, available text information is often imprecise and lacking in detail, providing only broad categories or incomplete descriptions that have limited correlation with actual models. Additionally, existing datasets typically lack the essential data formats required for 3D object generation tasks, such as multi-view 2D rendered images, 3D point clouds, and multi-view meshes. For example, Shapenet [48] only provides raw meshes, while Objaverse [34] exclusively offers raw 3D models in GLB format.

3 The UniG3D Dataset

Table 1: The statistical information of the four representations in two UniG3D datasets.
Dataset Mesh PCL Image Text
UniG3D-Shapenet 500K 50K 1 million 50K
UniG3D-Objaverse 5 million 500K 10 million 500K

In this section, we describe the data transformation process of UniG3D. The raw 3D data and related text information are gathered from two 3D datasets, specifically chosen among various datasets based on their scale and significance. The first dataset is Shapenet [48], a classic dataset that has around 50K+ 3D objects with 55 annotated categories. The second one is Objaverse [34], the large-scale 3D dataset. It has approximately 800K 3D objects, but the category of most objects is unknown. We describe the transformation pipeline of different representations in more detail in the following. As shown in Fig.1, we convert each 3D model to four representations, which are mesh, image, point cloud, and rich text information.

3.1 The Construction of Quadruples

As illustrated in Fig. 1, we present the unified data transformation pipeline employed in our dataset. This pipeline encompasses the transformation of raw 3D data and raw text information into a unified representation, quadruple. The pipeline begins by leveraging a powerful 3D model rendering engine, which enables the generation of multiple 2D and 3D representations from the raw data. Furthermore, our pipeline incorporates the utilization of state-of-the-art multi-modal models to generate high-quality textual information. By employing these models, we extract meaningful and descriptive text features from the 3D representations. This process ensures that our dataset includes comprehensive and accurate textual information that complements the visual aspects of the objects.

Mesh. For each 3D model, we employ Blender [30], a versatile software tool that supports various 3D formats and incorporates an optimized rendering engine, to generate ten multi-view meshes in the OBJ format. These meshes are rendered using a z-circular camera pose, ensuring comprehensive coverage of the object from different viewpoints. Specifically, the views are evenly spaced at intervals of 36 degrees to capture a wide range of perspectives. Additionally, to facilitate model training [31], we also provide the models in signed distance function (SDF) format, which offers a convenient representation for 3D object generation tasks.

Image. To obtain multi-view 2D images for each 3D model, we implement a customized rendering process. Leveraging the capabilities of Blender [30], we develop a script that first normalizes the 3D models to fit within a bounding cube, ensuring consistent scale across all objects. Additionally, we set up a standardized lighting arrangement to ensure uniform illumination across the rendered images. Subsequently, we employ Blender’s built-in real-time rendering engine to export the 2D images. The rendering process involved capturing ten images using the z-circular camera pose, which provides a well-distributed set of views around the object. These views are captured at equal intervals of 36 degrees, enabling comprehensive coverage of the object from different angles. The deflection angle used for these random poses is the same as that of the above-mentioned multi-view meshes. In addition to the z-circular camera pose, we also capture another set of ten images using random camera poses. These random poses introduce variations in the viewing angles and orientations of the object, allowing for the generation of dense point clouds. By combining the multi-view images captured using the z-circular camera pose and the random poses, we obtain a total of 20 multi-view 2D images for each 3D model. This diverse set of images provides comprehensive visual information for subsequent experiments and processing tasks.

Point Cloud. To convert the 3D models into colored point clouds, we utilize the RGBAD images rendered from them. Initially, we generate dense point clouds by associating points with each pixel in the rendered RGBAD images. However, these point clouds often exhibit uneven distribution and contain a large number of points. To address this issue, we employ voxel point sampling techniques to create uniform point clouds consisting of 4K points. By directly constructing point clouds from the rendered images, we circumvent potential challenges that may arise when sampling points from 3D meshes. These challenges include dealing with points located inside the model or handling 3D models stored in non-standard file formats [47]. Our approach ensures a consistent and reliable representation of the objects in the form of point clouds. To further enhance the quality of our dataset, we implement heuristics to exclude low-quality models. Specifically, we employ a criterion based on the singular value decomposition (SVD) of each point cloud [47]. We compute the SVD for each point cloud and retain only those models where the smallest singular value exceeds a certain threshold. This process effectively filters out flat objects or models with poor geometric structure, ensuring that our dataset comprises high-quality and meaningful 3D representations. By employing these techniques, we create a comprehensive dataset of colored point clouds that accurately represent the underlying 3D models, while also ensuring the inclusion of high-quality and diverse objects for further analysis and research.

Text. The text information of the 3D model mainly has two sources. One is the raw text information associated with each 3D model. However, we observe that a significant portion of this text information does not accurately correspond to the 3D models themselves, including terms like "model", "blender", and "Low poly". Therefore, we employ CLIP-VIT model [32] to clean them by calculating image-text similarity. We only retain those where the similarity value was above a certain threshold. By employing this method, over 80% of low-quality texts can be filtered out, while the false recognition rate does not exceed 30%. The other source is employing multi-modal LLM to generate descriptions with its 2D image as input. For each 3D model, we employ BLIP [33] to generate rich and detailed descriptions based on its thumbnail or 2D rendered image. Then, we also evaluate its accuracy by the similarity score using CLIP-VIT model [32].

Refer to caption
Figure 2: A histogram of fine-grained UniG3D categories with representative members from several bins highlighted.

Based on our data observation, we find that the data with low similarity is primarily attributed to the abstraction of 3D models, which have weak visual features. Hence, we implement a filtering process where we retain only those 3D models whose similarity score surpasses a predetermined threshold. By employing this method, the models we filter out account for approximately 20% of the total. Furthermore, for the 3D models with known categories, we enhance their description information by aligning it with the corresponding category. To be specific, if the category sentence is absent from the original description, we extract descriptive phrases from the existing description and merge them with the known category sentence to create an improved description. Conversely, if the generated description already exhibits high quality and coherence, we retain it as is.

3.2 Statistics of our dataset

The statistics for each representation of UniG3D are presented in Table 1, providing a significant overview. Our dataset pipeline generates multiple representations for each raw 3D model, including ten meshes, one colored point cloud, and 20 images. These visual representations are accompanied by corresponding descriptive text information, enhancing the richness and comprehensiveness of the dataset. To provide a visual overview of the dataset, Fig. 2 showcases the varying degrees of object counts across different categories within UniG3D, highlighting both the long-tail and head categories and their respective object counts. The distribution of object counts in each category ranges from 0 to 200 in our dataset. More than half of the categories have fewer than 20 objects, representing the long-tail categories of the dataset. In contrast, there are approximately 50 head categories that contain around 200 objects each, representing the head categories.

4 Experiments

4.1 3D Object Generation Method

Refer to caption
Figure 3: Mesh generated by SDFusion conditioned on different input modalities of Shapenet.

In our experimental study, we leverage two commonly used 3D object generation methods, namely Point-E and SDFusion, which are specifically tailored for two prevalent 3D representations, which are point cloud and signed distance function. These methods are grounded in the diffusion process, as proposed by Sohl-Dickstein et al. [19]. To enhance the clarity of the training and generation processes, we depict the forward and backward processes in Fig. 4, which provides a comprehensive overview of the steps involved in both the forward and backward pass of the diffusion model. Please refer to the supplementary material for detailed hyperparameters of the training process. In particular, we utilize two model structures in SDFusion: VQ-VAE and the 3D latent diffusion model, whose parameter count exceeds 400 million. To provide the ability for interaction, learning conditional distribution is important. We incorporate multiple conditional input modalities with task-specific encoders and cross-attention modules in the latent diffusion model, such as text input, image input, and text-image multi-modality input. For text input, we embed the text caption using BERT [1], while for image input, we embed the image using CLIP [32]. As for the generation of the point cloud, we employ two small model structures in Point-E, which are 40M-text and 40M-image. Specifically, 40M-text is a small model which only conditions text captions, not rendered images. The text caption is embedded with CLIP, and the CLIP embedding is appended as a single extra token of context. This model depends on the text captions present in our 3D dataset and does not leverage the fine-tuned GLIDE model. 40M-image is a small model with full image conditioning through a grid of CLIP latent. In the future, we will expand the scale of the training dataset and model structure.

Refer to caption
Figure 4: The directed graphical model depicts the diffusion process for 3D representations. The variable xi(t)x_{i}^{(t)} denotes the 3D representation at timestep tt. The processes of noise addition and denoising are represented by qq and pp respectively. The denoising process is conditioned on variable cc, which can be either embedded text, embedded image, or a combination of both.

4.2 Experimental Results

4.2.1 The effects of different conditions

To explore the distinct roles of different input modalities in the 3D object generation process, we initially compare the effects of images and text. The experiment is conducted on Shapenet. As shown in Fig.3, images can provide a direct visual reference, allowing for the generation of 3D objects that closely resemble the appearance of the referenced images. However, images may lack contextual information or high-level semantics that can be conveyed through textual descriptions, as shown in the second column of Fig. 5. Textual descriptions allow for precise and specific control over the generated 3D objects by providing detailed instructions or constraints. But, textual descriptions may sometimes be ambiguous or subjective, leading to different interpretations and potential variations in the generated 3D objects. Additionally, text lacks the ability to convey rich visual details, such as color gradients, textures, or fine-grained shape features.

Refer to caption
Figure 5: Comparison of the generative results by SDFusion when using images as the sole condition versus when incorporating additional text information.

Consequently, using text and images as conditioning inputs have its own advantages and disadvantages. Moreover, we aim to explore the effect when utilizing both of the aforementioned modalities as conditions. As shown in Fig.5, when using images as the sole condition, the model may have limited semantic understanding due to issues such as the angle of the images. For example, in the first row, the paper airplane, as well as the chairs in the second and third rows, can be challenging for the model to accurately determine the specific category or infer the occluded parts solely based on the image information. By incorporating text as an additional modality, the model gains a better grasp of the desired object characteristics, resulting in improved generation quality and a more comprehensive understanding of the semantic information associated with the object. Overall, using both text and images as conditioning inputs in 3D generation methods offers complementary benefits, with text providing semantic control and language understanding, while images contribute visual realism and rich visual details. The choice between these modalities depends on the specific requirements and objectives of the 3D generation task.

Refer to caption
Figure 6: Effects of increased data sources. Data(s) represents the use of only the ShapeNet, while Data(s+o) represents the additional inclusion of the UniG3D-Objaversecoco{}_{\text{coco}} dataset. (a) and (b) represent the experiments conducted on different 3D representations.

4.2.2 The effectiveness of data expansion

Due to limited computing resources, we are currently unable to conduct experiments with the entire set of UniG3D-Objaverse datasets. Therefore, we create a subset of the dataset by selecting data from the coco category, which we refer to as UniG3D-Objaversecoco{}_{\text{coco}}. This subset contains around 50K 3D objects with 66 categories. UniG3D-Shapenet has 50K+ 3D objects with 55 categories.

In order to explore the changes in data generation quality, such as point cloud integrity and diversity, after incrementally adding data from different categories, we conduct experiments on two 3D representations. Firstly, due to the high training cost associated with incorporating images as conditions in the Point-E method, we conduct experiments initially using text as the conditioning modality. As shown in Fig. 6 (a), increasing the sources of data enhances the diversity of generated 3D models. This implies the necessity for large-scale and scalable datasets. Due to the model’s sensitivity to language ambiguity and variation, it is important to note that when integrating different data sources, language ambiguities need to be addressed. For example, in different datasets, the term "mouse" could refer to either a small rodent or a computer input device. Then, we explore the impact of data expansion under different representations and conditions based on SDFusion method. The first column in Fig. 6 (b) represents the test data from UniG3D-Objaversecoco{}_{\text{coco}}. We observe that the model trained solely on Shapenet exhibits generalization ability. However, the addition of supplementary data allows the model to perform better on the new domain.

4.2.3 The impact of multi-view data

Refer to caption
Figure 7: Effects of increased multi-view data. Training(S) represents the model trained with single-view data, while Training(M) represents the model utilizing multi-view data.

In order to address the potential limitations of images captured from a single viewpoint, our dataset offers users multi-view data to facilitate data augmentation. As we describe in Section 3.1, our dataset includes ten multi-view rendered images along with their corresponding meshes. Specifically, the views are evenly spaced at intervals of 36 degrees to capture a wide range of perspectives. To investigate the influence of multi-view data, we compare the performance of generative models trained on single-view and multi-view data using the SDFusion method. Training data is the combination of ShapeNet and UniG3D-Objaversecoco{}_{\text{coco}} dataset. The benefits of utilizing multi-view training data are demonstrated in Fig. 7, where it is evident that the model trained with such data outperforms in cases where the input image exhibits relatively uncommon viewpoints. Consequently, augmenting the dataset with multi-view perspectives proves to be an effective strategy for enhancing the model’s robustness when dealing with less frequent angles.

Refer to caption
Figure 8: (a) corresponds to using only the category as text information during training, while (b) corresponds to utilizing the descriptive text generated by our pipeline.

4.2.4 The importance of text quality

Due to the relatively lower training cost of the generation experiments based on signed distance function representations, we employ Point-E to investigate the impact of text quality on the task of 3D object generation. To ensure data diversity, we utilize both the Shapenet and UniG3D-Objaversecoco{}_{\text{coco}} datasets as our training data. To provide a clear illustration of the quality and diversity of the generated point clouds, we present three different results for each text condition in Fig.8, using distinct seeds. It shows that when using category as the text condition in the inference, there is not much difference in the completeness and diversity of the generated point clouds. The model using category as its text information can only generate a chair model by inputting the word "chair". When the model is given descriptive text as input, it fails to generate results that possess matching visual features. However, the model which has better text quality is capable of accepting more detailed descriptive text. It can generate more controllable models by specifying detailed textual information, such as specifying the airplane model’s type, color, or material. Overall, our UniG3D data transformation pipeline enables the text-conditioned models to receive more detailed textual information and provide more control.

5 Limitation and Social Impact

Due to current limitations in computing resources, we are unable to conduct experiments using the complete UniG3D-Objaverse datasets. However, in future work, we plan to present experimental results based on the entire dataset to validate the consistency of our conclusions on a larger scale. Moreover, considering the advancements in recent methods that demonstrate improved speed and quality, we intend to explore a broader range of 3D generation methods in our future experiments. Additionally, we aim to incorporate additional tasks related to 3D understanding, such as novel view synthesis, neural surface reconstruction, and 3D point cloud classification, to further expand the scope and applicability of our dataset.

This work aims to provide a unified dataset for the 3D generation task, eliminating the need for extensive human annotation efforts. While this approach offers positive impacts such as reducing human labor, it is crucial to acknowledge that the reduction of human labor may have negative consequences, including job loss or displacement, especially for individuals with lower skill levels who may rely on employment opportunities.

6 Conclusion

In this study, we provide a unified 3D object generation dataset called UniG3D. Our dataset is constructed by utilizing a universal data transformation pipeline applied to the Objaverse and ShapeNet datasets. This pipeline effectively converts each raw 3D model into a comprehensive multi-modal data representation, encompassing text, images, point clouds, and meshes. To achieve this, it utilizes rendering engines and multi-modal models that are capable of capturing textual information and ensuring a comprehensive representation of the data. As a result, our dataset guarantees the richness of textual information and the comprehensiveness of data representation. Our pipeline’s universality stems from its ability to be applied to any 3D dataset, solely relying on raw 3D data, thereby enhancing its practicality and flexibility. During the construction of our dataset, we meticulously select data sources considering their scale and quality, guaranteeing the incorporation of diverse and reliable information. Furthermore, through empirical investigations and analysis of various factors, we present several key insights.

References

  • [1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [2] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  • [3] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • [4] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  • [5] Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, and Sanja Fidler. Learning to simulate dynamic environments with gamegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1231–1240, 2020.
  • [6] Ruben Rodriguez Torrado, Philip Bontrager, Julian Togelius, Jialin Liu, and Diego Perez-Liebana. Deep reinforcement learning for general video game ai. In 2018 IEEE Conference on Computational Intelligence and Games (CIG), pages 1–8. IEEE, 2018.
  • [7] Yijing Lin, Hongyang Du, Dusit Niyato, Jiangtian Nie, Jiayi Zhang, Yanyu Cheng, and Zhaohui Yang. Blockchain-aided secure semantic communication for ai-generated content in metaverse. IEEE Open Journal of the Computer Society, 4:72–83, 2023.
  • [8] Yijing Lin, Zhipeng Gao, Hongyang Du, Dusit Niyato, Jiawen Kang, Abbas Jamalipour, and Xuemin Sherman Shen. A unified framework for integrating semantic communication and ai-generated content in metaverse. arXiv preprint arXiv:2305.11911, 2023.
  • [9] Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1534–1543, 2016.
  • [10] Ze Liu, Zheng Zhang, Yue Cao, Han Hu, and Xin Tong. Group-free 3d object detection via transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2949–2958, 2021.
  • [11] Ishan Misra, Rohit Girdhar, and Armand Joulin. An end-to-end transformer model for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2906–2917, 2021.
  • [12] Thang Vu, Kookhoi Kim, Tung M Luu, Thanh Nguyen, and Chang D Yoo. Softgroup for 3d instance segmentation on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2708–2717, 2022.
  • [13] Cesar Cadena, Anthony R Dick, and Ian D Reid. Multi-modal auto-encoders as joint estimators for robotics scene understanding. In Robotics: Science and systems, volume 5, 2016.
  • [14] Christian Wojek, Stefan Walk, Stefan Roth, and Bernt Schiele. Monocular 3d scene understanding with explicit occlusion reasoning. In CVPR 2011, pages 1993–2000. IEEE, 2011.
  • [15] Yingwei Li, Adams Wei Yu, Tianjian Meng, Ben Caine, Jiquan Ngiam, Daiyi Peng, Junyang Shen, Yifeng Lu, Denny Zhou, Quoc V Le, et al. Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17182–17191, 2022.
  • [16] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021.
  • [17] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022.
  • [18] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  • [19] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  • [20] Miguel Angel Bautista, Pengsheng Guo, Samira Abnar, Walter Talbott, Alexander Toshev, Zhuoyuan Chen, Laurent Dinh, Shuangfei Zhai, Hanlin Goh, Daniel Ulbricht, et al. Gaudi: A neural architect for immersive 3d scene generation. Advances in Neural Information Processing Systems, 35:25102–25116, 2022.
  • [21] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt. Stylenerf: A style-based 3d-aware generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985, 2021.
  • [22] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  • [23] Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics (TOG), 39(6):1–14, 2020.
  • [24] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems, 35:31841–31854, 2022.
  • [25] Peng Zhou, Lingxi Xie, Bingbing Ni, and Qi Tian. Cips-3d: A 3d-aware generator of gans based on conditionally-independent pixel synthesis. arXiv preprint arXiv:2110.09788, 2021.
  • [26] Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shechtman, Jeong Joon Park, and Ira Kemelmacher-Shlizerman. Stylesdf: High-resolution 3d-consistent image and geometry generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13503–13513, 2022.
  • [27] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022.
  • [28] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Geiger. Graf: Generative radiance fields for 3d-aware image synthesis. Advances in Neural Information Processing Systems, 33:20154–20166, 2020.
  • [29] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5799–5809, 2021.
  • [30] Brian R Kent. 3D scientific visualization with Blender®. Morgan & Claypool Publishers, 2015.
  • [31] Yen-Chi Cheng, Hsin-Ying Lee, Sergey Tulyakov, Alexander Schwing, and Liangyan Gui. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. arXiv preprint arXiv:2212.04493, 2022.
  • [32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [33] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  • [34] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022.
  • [35] Aditya Sanghi, Hang Chu, Joseph G Lambourne, Ye Wang, Chin-Yi Cheng, Marco Fumero, and Kamal Rahimi Malekshan. Clip-forge: Towards zero-shot text-to-shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18603–18613, 2022.
  • [36] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. arXiv preprint arXiv:2211.10440, 2022.
  • [37] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  • [38] Ajay Jain, Ben Mildenhall, Jonathan T Barron, Pieter Abbeel, and Ben Poole. Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 867–876, 2022.
  • [39] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  • [40] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  • [41] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  • [42] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XV, pages 89–106. Springer, 2022.
  • [43] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • [44] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  • [45] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
  • [46] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  • [47] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  • [48] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  • [49] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
  • [50] Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d furniture shape with texture. International Journal of Computer Vision, 129:3313–3337, 2021.
  • [51] Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21126–21136, 2022.
  • [52] Henrik Aanæs, Rasmus Ramsbøl Jensen, George Vogiatzis, Engin Tola, and Anders Bjorholm Dahl. Large-scale data for multiple-view stereopsis. International Journal of Computer Vision, 120:153–168, 2016.
  • [53] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1790–1799, 2020.
  • [54] Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1588–1597, 2019.
  • [55] Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
  • [56] Liu Liu, Wenqiang Xu, Haoyuan Fu, Sucheng Qian, Qiaojun Yu, Yang Han, and Cewu Lu. Akb-48: a real-world articulated object knowledge base. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14809–14818, 2022.
  • [57] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10901–10911, 2021.
  • [58] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
  • [59] Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Jiawei Ren, Liang Pan, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, et al. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. arXiv preprint arXiv:2301.07525, 2023.
  • [60] Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang Wu, Zizheng Yan, Chenming Zhu, Zhangyang Xiong, Tianyou Liang, et al. Mvimgnet: A large-scale dataset of multi-view images. arXiv preprint arXiv:2303.06042, 2023.
  • [61] Arjun Singh, James Sha, Karthik S Narayan, Tudor Achim, and Pieter Abbeel. Bigbird: A large-scale 3d database of object instances. In 2014 IEEE international conference on robotics and automation (ICRA), pages 509–516. IEEE, 2014.
  • [62] Berk Calli, Aaron Walsman, Arjun Singh, Siddhartha Srinivasa, Pieter Abbeel, and Aaron M Dollar. Benchmarking in manipulation research: Using the yale-cmu-berkeley object and model set. IEEE Robotics & Automation Magazine, 22(3):36–52, 2015.
  • [63] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning representations and generative models for 3d point clouds. In International conference on machine learning, pages 40–49. PMLR, 2018.
  • [64] Kaichun Mo, Paul Guerrero, Li Yi, Hao Su, Peter Wonka, Niloy Mitra, and Leonidas J Guibas. Structurenet: Hierarchical graph networks for 3d shape generation. arXiv preprint arXiv:1908.00575, 2019.
  • [65] Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and Bharath Hariharan. Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4541–4550, 2019.
  • [66] Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2837–2845, 2021.
  • [67] Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun Hao, Serge Belongie, Noah Snavely, and Bharath Hariharan. Learning gradient fields for shape generation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 364–381. Springer, 2020.
  • [68] Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978, 2022.
  • [69] Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5826–5835, 2021.
  • [70] Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky, and Tiberiu Popa. Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia 2022 Conference Papers, pages 1–8, 2022.
  • [71] Kevin Chen, Christopher B Choy, Manolis Savva, Angel X Chang, Thomas Funkhouser, and Silvio Savarese. Text2shape: Generating shapes from natural language by learning joint embeddings. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, pages 100–116. Springer, 2019.
  • [72] Zhengzhe Liu, Yi Wang, Xiaojuan Qi, and Chi-Wing Fu. Towards implicit text-guided 3d shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17896–17906, 2022.
  • [73] Paritosh Mittal, Yen-Chi Cheng, Maneesh Singh, and Shubham Tulsiani. Autosdf: Shape priors for 3d completion, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 306–315, 2022.
  • [74] Rao Fu, Xiao Zhan, Yiwen Chen, Daniel Ritchie, and Srinath Sridhar. Shapecrafter: A recursive text-conditioned 3d shape generation model. arXiv preprint arXiv:2207.09446, 2022.
  • [75] Blender 3.5.1. https://www.blender.org/download/. 2023.
  • [76] Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.