This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

3D Scene Graph Prediction on Point Clouds Using Knowledge Graphs

Yiding Qiu1 and Henrik I. Christensen2 1Yiding Qiu is with the Department of Computer Science and Engineering, University of California San Diego, La Jolla, 92122, USA yiqiu@eng.ucsd.edu2Henrik I. Christensen with the Department of Computer Science and Engineering, University of California San Diego, La Jolla, 92122, USA hichristensen@eng.ucsd.edu
Abstract

3D scene graph prediction is a task that aims to concurrently predict object classes and their relationships within a 3D environment. As these environments are primarily designed by and for humans, incorporating commonsense knowledge regarding objects and their relationships can significantly constrain and enhance the prediction of the scene graph. In this paper, we investigate the application of commonsense knowledge graphs for 3D scene graph prediction on point clouds of indoor scenes. Through experiments conducted on a real-world indoor dataset, we demonstrate that integrating external commonsense knowledge via the message passing method leads to a 15.0%15.0\% improvement in scene graph prediction accuracy with external knowledge and 7.96%7.96\% with internal knowledge when compared to state-of-the-art algorithms. We also tested in the real world with 10 frames per second for scene graph generation to show the usage of the model in a more realistic robotics setting.

I INTRODUCTION

A 3D scene graph is a high-level semantic scene representation that captures objects and their relationships within a 3D environment. This representation has recently demonstrated its potential for robotics tasks, including 3D scene reconstruction [1], path planning [2, 3], and navigation [4, 5]. Most object-level semantic mapping methods primarily focus on predicting the class of objects in a scene [6, 7]. However, 3D scene graph estimation diverges from these methods, as it requires additional tasks of (a) predicting if an edge should exist between two objects, and (b) predicting the label of the edge as a semantic relationship. Fig. 1 exemplifies the 3D scene graph prediction problem. Given point cloud data segmented into class-agnostic clusters, the task is to simultaneously classify the objects and relationships. The resulting scene graph comprises objects labeled as nodes and relationships labeled as directed edges. Multiple directed edges can exist from one object to another object. The graph can also be described using the triplet Subject-Predicate-Object. To clarify the concept further, we follow the convention established in the 2D scene graph prediction domain, using the term ”relationship” to describe the triplet and the term ”predicate” to describe the label of the edge.

These relationships offer valuable information that benefits robotic tasks in multiple ways. First, semantic relationships can direct robots to search for target objects more efficiently. For instance, during object-goal navigation, the spatial relationship between two objects, such as Cup-On-Table, can constrain the search space and improve object search efficiency. Secondly, labeled relationships offer robots a richer vocabulary to communicate with humans. For example, instead of describing a scene with the coordinates of a chair and a table, an agent can use the phrase Chair-NextTo-Table, which resembles natural language more closely.

Refer to caption
Figure 1: Given a 3D point cloud segmented into class-agnostic clusters, the objective is to generate the corresponding scene graph that labels the clusters and infers spatial or semantic relationships among them. The resulting 3D scene graph has nodes representing classes of objects, and directed labeled edges representing semantic relationships.

Compared to tasks focusing solely on object classification accuracy, scene graph prediction is more challenging, as accuracy is evaluated based on the correct prediction of triplets. One intuitive approach to address this problem is to utilize knowledge—the common structure or pattern typically found in most environments—to infer which object-relationship pairs are more likely to be present, given the current observation. Indoor environments have a human-centered structure and adhere to the laws of physics. For instance, chairs are usually situated near tables, and cups are commonly placed on tables or other large flat surfaces. Some triplets are more likely to exist than others, while some are nearly impossible to encounter in reality, such as Table-On-Cup. Acquiring this knowledge from external resources or through training can simplify the prediction process. In some sense, knowing the predicates can help improve object recognition accuracy, and having higher confidence in object classes can increase or decrease the likelihood of some predicates.

We propose to incorporate external common-sense knowledge into 3D scene graph prediction tasks. The external knowledge graph is generated from various sources, including Visual Genome [8], ConceptNet [9], and WordNet [10]. Inspired by the Graph-bridging Network (GB-net) [11], we use graph message passing methods to learn both node embeddings and edge embeddings within and between scene graphs and knowledge graphs and perform experiments on the indoor 3D scene graph dataset 3DSSG [12]. A major distinction between our work and [11] lies in the dataset type; [11] performs the task on 2D images for 2D scene graph prediction, whereas our task utilizes 3D point cloud data. This presents a greater challenge because 3D segmented point clouds often suffer from missing points. In general, object recognition on point cloud data has lower accuracy as compared to images. Moreover, relationships in indoor environments are primarily spatial and geometric, while relationships in 2D image datasets can be conceptually abstract. Consequently, the types of knowledge graphs we employ differ from those used for 2D scene graph tasks. Additionally, indoor datasets typically have a smaller vocabulary for both objects and relationships, making it easier for common-sense knowledge to capture relationships.

To the best of our knowledge, our work is the first to leverage external common-sense knowledge for the 3D scene graph prediction problem. The current state-of-the-art algorithms are [12] and [1]. However, neither of these methods utilizes external knowledge for prediction. We compare our prediction results with both works and demonstrate that our model outperforms both in scene graph prediction tasks.

The main contributions of this paper are: (1) we proposed the model for 3D scene graph construction problem that incorporates common-sense knowledge and shows the improvement of the result. (2) we performed experiments in a real-world setting and demonstrate the possible application in the robotics domain.

The remainder of the paper is organized as follows. In Section II, we discuss related work, followed by the problem formulation in Section III. The main method and the network structure for our approach are described in Section IV. In Section V we discuss the dataset used, the overall experimental design, and the results. Finally, we summarize our work and outline future challenges in Section VII.

II Related Work

II-A Scene graphs prediction with knowledge

Scene graph prediction problems have been predominantly tackled on 2D image datasets. Two main methods are prevalent for integrating knowledge into the model. The first method is to extract the high-level structure inherent in the training dataset. For instance, MotifNet [13] leverages the most frequent relations between labeled object pairs in the training set for scene graph prediction. Knowledge-embedded routing network (KERN) [14], on the other hand, employs the co-occurrence probabilities between objects implicitly to aid in resolving the long-tail problem in the dataset.

The second method involves utilizing external knowledge sources for the task. For example, Gu et al.[15] used ConceptNet [9] to refine object and relationship features prior to training on scene graph generation. In [11], Zareian et al.introduced GB-Net, which unifies scene graphs and knowledge graphs by learning the node encoding and edge encoding both within and between graphs. In their model, a scene graph represents an ”image-conditioned instantiation” of a commonsense knowledge graph. They employed multiple knowledge bases as external knowledge sources, including WordNet [10], ConceptNet [9], and Visual Genome [8]. This represents one of the latest attempts to merge knowledge graphs and scene graphs, enabling learning of object features and predictors with neighboring nodes, while also constructing a ”bridge” that infuses information between the two graphs through message passing.

Our work adopts the second method since we aim to generalize beyond the training data and deploy the algorithm in a real-world setting. we also aim to explore the effectiveness of external commonsense knowledge in generating 3D scene graphs, and as such, we adapt the bridging structure similar to [11]. However, due to inherent differences in our dataset (image vs. point cloud), our method also takes into account the incompleteness of the point cloud dataset and innate 3D relationships in the reconstructed scene. Moreover, the knowledge graphs we construct differ from [11]., as the indoor robot task employs a distinct set of objects and relationships.

II-B 3D scene understanding and scene graph generation

The application of deep learning methods for 3D point cloud recognition can be traced back to PointNet [16], a model still extensively to encode and predict point cloud data. In our study, we likewise use PointNet as a backbone for point cloud embedding.

Recently, there has been some research into indoor scene graphs specifically designed for indoor robot mapping. One example is the 3D scene graph [17], which establishes a semi-automatic framework to create a dataset unifying objects, rooms, and cameras in a structured manner. While some traditional approaches [18, 19] use SLAM for semantic mapping at the object level, their maps do not include labeled semantic relationships. Certain methods [5, 3] construct a scene graph from each image, and then merge the 2D scene graphs into a global 3D graph. However, these approaches are tested on a small selection of objects and minimal relationships. The works most similar to ours include [12, 1], with the 3DSSG dataset first introduced in these studies. [12] proposed the Scene Graph Spatial Network (SGPN), which learns the 3D scene graph from the reconstructed point clouds. [1] further extended the method to accommodate RGB-D images as input and demonstrated that a model initially trained on reconstructed point cloud could be used for online scene graph prediction with a minor decline in prediction performance.

Refer to caption
Figure 2: The pipeline of the overall model. The input is the class-agnostic point cloud of a scene. We begin by extracting point clouds such that each segment is either a subject or an object, and the predicate is the union of two segments. This forms the nodes of the Scene Representation(SR). Each node comprises a PointNet encoding feature and a contextual vector, with edges defined by distance. The knowledge input draws from three sources: ConceptNet, Visual Genome, and WordNet. The knowledge graph feature is the glove embedding, with edges constructed from these three sources. The scene graph and knowledge graph are subsequently trained together, allowing for simultaneous updates to both nodes and edges using the Knowledge-Scene Graph Network. Finally, the updated nodes from the intermediate scene graph are classified to establish relation triplets Subject-Predicate-Object.

Based on these insights, our work focuses on generating scene graphs using the point cloud dataset, considering that the image-to-reconstructed point cloud can be accomplished either by traditional SLAM or by the graph fusing method. Notably, our approach differs from all previously mentioned methods by explicitly incorporating common-sense knowledge.

III Problem Formulation

Consider a constructed 3D scene with point cloud 𝒫3\mathcal{P}\subset\mathbb{R}^{3} segmented into nn class-agnostic clusters 𝒞i\mathcal{C}_{i} for i=1,,ni=1,\ldots,n, and a directed graph G=(𝒱,)G=(\mathcal{V},\mathcal{E}). The objective of the task is to classify each cluster CiC_{i} as an object 𝒱\mathcal{V} and associate predicates from \mathcal{E} with it. The class of nodes and the directed edge eventually outputs a set of relation triplets Subject-Predicate-Object. There are two forms to represent the 3D scene graph, which are semantically equivalent and can be transformed into each other. The triplet form is used for the evaluation of the model, whereas the graph form is utilized for map visualization and subsequent tasks, such as navigation.

IV Method

IV-A Pipeline

As shown in Figure 2, the entire pipeline consists of two streams: one originating from the point cloud and the other from knowledge. The first stream generates the scene representation (SR) of the scene graph. Given the input of a room’s point cloud, we extract the point cloud of each object and the predicate, in the form of the union of two clusters. These are encoded with a three-layer PointNet structure, constituting the first part of the SR node features. Additionally, we concatenate a set of contextual vectors to the PointNet feature embedding, as described in section IV-B. Edges of the SR graph are kept if two objects in the 3d space are within a certain Euclidean distance threshold, with both directions preserved at this stage.

The second stream originates from knowledge sources and thus is named knowledge representation(KR). Each node, representing either an object, subject, or predicate, is encoded using GloVe [20] embedding. The edges are provided by three different knowledge sources: Visual Genome, Conceptnet, and WordNet. The process of constructing a knowledge graph is explained in detail in section IV-C.

Next, both SR and KR are fed to a Knowledge-Scene Graph Network(KSGN). Through message passing, the node feature of both SR and KR will update alongside their neighboring node features, and the edge within and between graphs will update as well (Section IV-D). The resulting updated SR node feature of the object and predicate will finally pass through two layers of multi-layer perceptron(MLP) to predict relation triplets.

IV-B Input feature

The nodes features of the input point cloud consist of the PointNet encoding and the contextual vector. The contextual vector is an 11-dimension vector first introduced in [1]. The vector includes the centroid of the point cloud (x,y,z)3(x,y,z)\in\mathbb{R}^{3}, the standard deviation that describes the sparsity of the segment (σx,σy,σz)3(\sigma_{x},\sigma_{y},\sigma_{z})\in\mathbb{R}^{3}, size of the bounding box (bx,by,bz)3(b_{x},b_{y},b_{z})\in\mathbb{R}^{3}, maximum length of the segment l=max(bx,by,bz)l=max(b_{x},b_{y},b_{z})\in\mathbb{R}, and bounding box volume v=bxbybzv=b_{x}\cdot b_{y}\cdot b_{z}\in\mathbb{R}. The vector is calculated for each segment in the room as well as the union of two segments for the predicate.

Type Subtype Knowledge
obj-obj subject-object VG
(3 x 160 x 160) object-subject VG
relatedTo ConcepNet
obj-pred sub-pred VG
(2 x 160 x 27) obj-pred VG
pred-obj pred-sub VG
(2 x 27 x 160) pred-obj VG
pred-pred category hand label
(2 x 27 x 27) wup score WordNet
TABLE I: Knowledge graph adjacency matrix description
Refer to caption
Figure 3: The pipeline of the graph bridging model

IV-C Knowledge Graph Construction

The knowledge graphs are constructed from three different sources:

Visual Genome [8]: This contains labeled scene graphs for image datasets. We filtered the object and relationship vocabulary for the training data and constructed four different matrices: subject-object, object-subject, subject-predicate, and predicate-object.

ConceptNet [9]: A multilingual crowd-sourcing-based knowledge graph database. We retrieved an object-object matrix from it.

WordNet[10]: A lexical database that contains semantics explanations and synonyms for nouns, verbs, and adjectives. We use WordNet for predicate-predicate relationships.

The knowledge graph in our study is defined using four types of adjacency matrices, as depicted in Figure 2 and Table I. For each object in the dataset, we assume they can either be an object type or a subject type. Thus for both subject-object and object-subject relationships, the adjacency matrices are square matrices. We list the details of the knowledge graph in Table I. Because the dataset we use contains 160 objects and 27 relationships, we also listed the matrix dimension for clarity.

Regrading ConceptNet, we use relatedTo for between-object relationships, with a binary weight. As for the predicate-predicate relationship, we have created a manually crafted categorical adjacency matrix. This matrix connects all directional spatial relationships (such as left, right, behind, front), comparison relationships (smaller than, bigger than, higher than, lower than), and relationships that imply an attachment (attached to, standing on, lying on). This manually created matrix is binary as well. For determining similarity scores between predicates, we use WordNet to get the Wu-Palmer (WUP) scores. For all other adjacency matrices, weights are normalized in a range from 0 to 1.

IV-D Knowledge-Scene Graph Network

The Knowledge-based Scene Graph model (KSGN) is a modified version of Gb-Net [11]. As shown in Figure 2, KSGN accepts two types of graph input: the scene graph and the knowledge graph. Notably, both inputs are heterogeneous graphs.

The detailed model is depicted in Figure 3. The scene representation consists of Scene Entities(SE) and Scene Predicates (SP), and the common-sense knowledge representation consists of Common-Sense Entities (CE) and Common-Sense Predicates (CP). In the following, we use Δ(SE,SP,CE,CP){\Delta}\in(SE,SP,CE,CP) to represent different types of nodes.

Each node is encoded by 2-layers of Multi-Layer Perceptron (MLP) to form the node features as the Message Send:

𝐦iΔ=ϕsend Δ(𝐱iΔ)\mathbf{m}_{i}^{\Delta\rightarrow}=\phi_{\text{send }}^{\Delta}\left(\mathbf{x}_{i}^{\Delta}\right) (1)

where ϕsend \phi_{\text{send }} is the MLP that is named as ”send head”, It is trained and shares the weight across four types of nodes.

With each outgoing message, we compute the message along each incoming edge. This is done by first summing the weight of the same types of edges, and then concatenating across different types of edges.

𝐦jΔ=ϕreceive Δ(ΔkΔΔ(i,j,aijk)kaijk𝐦iΔ)\mathbf{m}_{j}^{\Delta\leftarrow}=\phi_{\text{receive }}^{\Delta}\left(\bigcup_{\Delta^{\prime}}\bigcup^{\mathcal{E}_{k}\in\mathcal{E}^{\Delta^{\prime^{\prime}\rightarrow\Delta}}}\sum_{\left(i,j,a_{ij}^{k}\right)\in\mathcal{E}_{k}}a_{ij}^{k}\mathbf{m}_{i}^{\Delta^{\prime}\rightarrow}\right) (2)

Finally, given the original input nodes and the received message, we use Gated Relu Unit (GRU) to update the node representations. The updated node vector is used to classify objects and predicts through training and back-propagation.

O160R26O_{160}R_{26} O27R7O_{27}R_{7}
Model RE \uparrow REsingle \uparrow Obj@1\uparrow Obj@5 \uparrow RE \uparrow REsingle \uparrow Obj@1 \uparrow Obj@5 \uparrow
SGPN [12] 0.071 0.119 0.357 0.623 0.383 0.385 0.420 0.780
SGFN[1] 0.113 0.169 0.504 0.754 0.417 0.417 0.624 0.923
Ours (internal KG) 0.122 0.184 0.466 0.739 0.450 0.450 0.644 0.917
Ours (external KG) 0.130 0.187 0.473 0.742 0.469 0.470 0.637 0.922
TABLE II: Quantitative results of the evaluated methods in the recall.

IV-E Loss

The model is trained in an end-to-end fashion, and the total loss consists of the classification loss for both objects and predicates:

total =λobj+pred\mathcal{L}_{\text{total }}=\lambda\mathcal{L}_{\text{obj}}+\mathcal{L}_{\text{pred}}

where λ\lambda is a user-defined weight factor. Because a subject can have multiple relationships with an object, we used cross-entropy loss for the pred\mathcal{L}_{\text{pred}}.

V Experiment on dataset

V-A Task description

The task objective is twofold: to generate scene graphs by (1) predicting the labels of segmented clusters of point cloud, and (2) predicting the labels of relationships between two point cloud clusters. For this, we use the 3RScan[21] dataset, which consists of 1482 scans across 478 different scenes. Each scene is recorded using Google Tango, yielding sequences of RGB-D images with accurately calibrated camera poses. The ground truth scene graph annotations are provided by 3DSSG [12].

The entire dataset is divided based on the number of scans, including 1061 scenes for training, 117 for validation, and 157 for testing. The original dataset contains 534 object classes and 40 relationship classes. The type of relationships captured in the data include supporting relationships (e.g. standing, lying), proximity relationships (e.g. next to, in front of), and comparative relationships (e.g. bigger than, taller than). In this project, we trained and evaluated the algorithm in two different settings. The first setting consists of 160 object classes and 26 predicate classes (O160R26O_{160}R_{26}), and the second setting consists of 27 object classes and 7 predicate classes(O27R7O_{27}R_{7}).

V-B Comparison Models

Scene Graph Spatial Network (SGPN) [12] The network takes as input the object point cloud and the predicate point clouds and uses Graph Neural Network(GNN) to predict the scene graph.

Scene Graph Fusion Network (SGFN) [1] The network uses only contextual vectors as predicate input instead of PointNet encoded feature. It uses Graph Attention Network(GAT) among the triplets for the prediction.

Ours We present two versions of models, one uses internal knowledge graphs, and the other use external knowledge graphs. Both models have the same structure. The only difference is that for the internal knowledge graphs, we initialize the knowledge representation with zero matrices, which allows the model to capture the relationship within the dataset.

V-C Implementation details

The model is implemented in PyTorch. The learning rate is set to 0.001. The input graphs are generated based on the distance between entities (0.5 meters). We use 100K epochs for our models and 300K epochs for SGPN and SGFN. The weight loss factor λ\lambda is 0.5.

V-D Evaluation metric

To evaluate the model’s performance on scene graph generation, we compare the predicted Subject-Predicate-Object triplets with the ground truth triplet. Considering that multiple predicates can co-exist between two objects, we employ two measures for evaluation: RE (abbreviation for relationship): This measure evaluates if the predicted triplets exactly match the ground truth. REsingle: This is a relaxed form of evaluation where at least one predicted relationship matches the ground truth.

Most of the current 2D scene graph prediction problems adopt recall rate as the evaluation metric [22, 13, 15, 11]. Note that Recall=TP/(TP+FN)Recall=TP/(TP+FN), where TPTP is true positive and FNFN is false negative. The major reason is that the ground-truth annotations of relationships are likely to be incomplete, and thus using metrics like accuracy or precision, which penalize false positive predictions, is unfair to reflect the actual performance of a model. In the table, we use Obj@K to represent if the prediction for objects exists in top K predictions.

Refer to caption
Figure 4: Convergence rate of models

V-E Results

Table II summarizes the qualitative results for our model in comparison with state-of-art models. Our model outperforms in triplet prediction. The results indicate that incorporating knowledge graphs enhances overall performance, with external knowledge graphs offering greater improvements than internal knowledge graphs. Nevertheless, in terms of object classification accuracy, our model does not always give better predictions. This may be attributed to the inherent challenge of classifying objects using point cloud data, which in turn complicates the association between scene graphs and knowledge graphs for objects.

Fig 4 illustrates how the use of knowledge graphs can expedite model convergence. When compared to the SGFN model, our model reaches convergence in earlier epochs, implying a reduced need for training epochs.

V-F Error Analysis

We performed error analysis for both object classification and predicate classification. The top five misclassifications for objects, denoted as ground truth/prediction, were wall/curtain, wall/wardrobe, wall/blinds, pillow/cushion and chair/side table. The difficulty in recognizing curtains and blinds was also a common challenge faced during semantic classification for the ScanNet point cloud dataset [23]. Regarding predicate classes, the most frequently occurring mistakes were learning against/close by, cover/lying on, and cover/standing on.

VI Real-world Experiment

We conducted real-world experiments to assess both the limitations of our algorithm and its potential applications for robotics. While the 3RScan dataset is based on real-world data and employs RGB-D and inertial measurement unit(IMU) data from a Google Tango cell phone, there are hurdles to directly utilizing the trained model in real-time scenario. These challenges are primarily due to the requirement of offline post-processing for constructing the segmented 3D point cloud from RGB-D and IMU inputs, and the reliance on manually cleaned segmentation. Moreover, the camera poses provided by the original data was generated offline, providing a higher degree of accuracy than online camera pose estimates.

For our real-world experiments, we used the Intel RealSense D435 and RealSense T265 sensors. The camera poses and trajectories were estimated online using RTAB-Map[24]. Similar to [1], we adopted the online segment fusion method from Tateno et al.[25]. This algorithm performs image segmentation on depth images and fuses point clouds given camera poses. In addition, while the model was trained on GPU, we employed Open Neural Network Exchange (ONNX) [26] in our real-world experiments to enhance inference speed and eliminate the need for a GPU. We conducted our scene graph generation experiments in two locations—an office area and a basement containing a kitchen—within a school building. Using RGB-D images, camera poses, and online image segmentation, the algorithm predicted the class of each point cloud and the relationships between segments for each frame, subsequently fusing this information on-the-fly. The result was a constructed 3D point cloud with annotated scene graphs on segments.

The results indicate that our model is particularly adept at detecting larger structures and furniture, such as walls, floors, cabinets, tables, and chairs. However, due to the sparsity of point clouds associated with smaller objects, segmenting and classifying these objects remains challenging. Our model operated in real-time, achieving more than 10 frames per second on average. The most frequently predicted accurate triplets are Wall-AttachedTo-Floor and Chair-AttachedTo-Floor.

Potential improvements include the utilization of a more refined image segmentation algorithm, which could result in better 3D point cloud segmentation, albeit with a potential trade-off in inference speed. Additionally, the depth images provided by RealSense are somewhat noisy, suggesting that implementing filtering algorithms could help smooth the depth data, thereby improving segmentation.

VII Conclusion

In this study, we used external knowledge sources from Visual Genome, Conceptnet, and WordNet to predict the 3D scene graph on point cloud data constructed from RGB-D images. Our findings reveal that the use of external knowledge enhances the accuracy of scene graph prediction and expedites model convergence. We also conduct real-world experiments with the algorithm, demonstrating its capability to generate scene graphs online in a cluttered environment.

Nonetheless, our method exhibits several limitations. The primary challenge in 3D scene graph prediction lies in the accuracy of object classification on point clouds. One potential avenue for improving object recognition could involve the use of RGB-D images projected onto point clouds. Another limitation pertains to our current lack of use of common-sense knowledge regarding spatial and size relationships between objects. Given that point cloud data can be quite sparse for smaller objects, we could potentially leverage larger-sized objects and room information to improve the classification accuracy for smaller objects. As a future research direction, we plan to investigate various types of common-sense knowledge that can be used for scene graph prediction.

References

  • [1] S.-C. Wu, J. Wald, K. Tateno, N. Navab, and F. Tombari, “Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d sequences,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7515–7525.
  • [2] C. Agia, K. M. Jatavallabhula, M. Khodeir, O. Miksik, V. Vineet, M. Mukadam, L. Paull, and F. Shkurti, “Taskography: Evaluating robot task planning over large 3d scene graphs,” in Conference on Robot Learning.   PMLR, 2022, pp. 46–58.
  • [3] S. Amiri, K. Chandan, and S. Zhang, “Reasoning with scene graphs for robot planning under partial observability,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5560–5567, 2022.
  • [4] C. Gomez, M. Fehr, A. Millane, A. C. Hernandez, J. Nieto, R. Barber, and R. Siegwart, “Hybrid topological and 3d dense mapping through autonomous exploration for large indoor environments,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 9673–9679.
  • [5] U.-H. Kim, J.-M. Park, T.-J. Song, and J.-H. Kim, “3d scene graph: A sparse and semantic representation of physical environments for intelligent agents,” IEEE transactions on cybernetics, vol. 50, no. 12, pp. 4921–4933, 2019.
  • [6] A. Nüchter and J. Hertzberg, “Towards semantic maps for mobile robots,” Robotics and Autonomous Systems, vol. 56, no. 11, pp. 915–926, 2008.
  • [7] N. Sünderhauf, T. T. Pham, Y. Latif, M. Milford, and I. Reid, “Meaningful maps with object-oriented semantic mapping,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2017, pp. 5079–5085.
  • [8] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision, vol. 123, no. 1, pp. 32–73, 2017.
  • [9] H. Liu and P. Singh, “Conceptnet—a practical commonsense reasoning tool-kit,” BT technology journal, vol. 22, no. 4, pp. 211–226, 2004.
  • [10] G. A. Miller, WordNet: An electronic lexical database.   MIT press, 1998.
  • [11] A. Zareian, S. Karaman, and S.-F. Chang, “Bridging knowledge graphs to generate scene graphs,” in European conference on computer vision.   Springer, 2020, pp. 606–623.
  • [12] J. Wald, H. Dhamo, N. Navab, and F. Tombari, “Learning 3d semantic scene graphs from 3d indoor reconstructions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3961–3970.
  • [13] R. Zellers, M. Yatskar, S. Thomson, and Y. Choi, “Neural motifs: Scene graph parsing with global context,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5831–5840.
  • [14] T. Chen, W. Yu, R. Chen, and L. Lin, “Knowledge-embedded routing network for scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6163–6171.
  • [15] J. Gu, H. Zhao, Z. Lin, S. Li, J. Cai, and M. Ling, “Scene graph generation with external knowledge and image reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1969–1978.
  • [16] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.
  • [17] I. Armeni, Z.-Y. He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and S. Savarese, “3d scene graph: A structure for unified semantics, 3d space, and camera,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5664–5673.
  • [18] A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans,” Robotics: Science and Systems (RSS), 2020.
  • [19] N. Hughes, Y. Chang, and L. Carlone, “Hydra: A real-time spatial perception system for 3D scene graph construction and optimization,” Robotics: Science and Systems (RSS), 2022.
  • [20] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. [Online]. Available: http://www.aclweb.org/anthology/D14-1162
  • [21] J. Wald, A. Avetisyan, N. Navab, F. Tombari, and M. Nießner, “Rio: 3d object instance re-localization in changing indoor environments,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7658–7667.
  • [22] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei, “Visual relationship detection with language priors,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14.   Springer, 2016, pp. 852–869.
  • [23] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839.
  • [24] M. Labbé and F. Michaud, “Rtab-map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation,” Journal of Field Robotics, vol. 36, no. 2, pp. 416–446, 2019.
  • [25] K. Tateno, F. Tombari, and N. Navab, “Real-time and scalable incremental segmentation on dense slam,” in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2015, pp. 4465–4472.
  • [26] J. Bai, F. Lu, K. Zhang et al., “Onnx: Open neural network exchange,” https://github.com/onnx/onnx, 2019.