This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

EqvAfford: SE(3) Equivariance for Point-Level Affordance Learning

Yue Chen1∗      Chenrui Tie2,1∗      Ruihai Wu1,3∗      Hao Dong1,3†     
1CFCS, School of CS, PKU      2School of EECS, PKU     
3National Key Laboratory for Multimedia Information Processing, School of CS, PKU
Abstract

Humans perceive and interact with the world with the awareness of equivariance, facilitating us in manipulating different objects in diverse poses. For robotic manipulation, such equivariance also exists in many scenarios. For example, no matter what the pose of a drawer is (translation, rotation and tilt), the manipulation strategy is consistent (grasp the handle and pull in a line). While traditional models usually do not have the awareness of equivariance for robotic manipulation, which might result in more data for training and poor performance in novel object poses, we propose our EqvAfford framework, with novel designs to guarantee the equivariance in point-level affordance learning for downstream robotic manipulation, with great performance and generalization ability on representative tasks on objects in diverse poses.

11footnotetext: Equal contribution. Yue Chen and Chenrui Tie’s order was determined by a coin flip. Ruihai Wu provided mentoring.22footnotetext: Corresponding authors.

1 Introduction

Symmetry is a fundamental aspect of our physical world, readily apparent in how we perceive and interact with the world. In this work, we focus on object manipulation tasks with certain spatial symmetry, where the manipulation strategy is equivariant to the object’s 6D pose. This kind of spatial symmetry is called SE(3) Equivariance, where SE(3) is a group consists of 3D translation and rotation.

We humans can naturally recognize and exploit SE(3) equivariance in object manipulation tasks. For example, the method of opening a drawer, pulling on its handle in a vertical direction, is intuitively understood to be effective regardless of the drawer’s position, orientation, or tilt. However, the understanding of SE(3) equivariance poses a significant challenge for neural networks. Considering training a neural network to identify the grasp point on a drawer’s point cloud, if the point cloud is rotated and translated, the model’s output could be highly different, while the manipulation point and strategy should be consistent. This case underscores the limitations of data-driven approaches that lack an understanding of the physical world, leading to weak generalization across varied initial conditions. In this work, we develop a perception system designed to deduce potential affordance, which includes both the location of interaction and the manner of manipulation, by leveraging SE(3) equivariance to bolster the model’s generalization ability.

Refer to caption
Figure 1: For each point on a 3D object, we predict an actionability (affordance) score and an interaction orientation to manipulate the object. By leveraging SE(3) equivariance, our method shows consistency and generalizes to different poses of the object.

Object manipulation stands as a pivotal area of research in robotics, where a key challenge lies in identifying the optimal interaction points (i.e., affordable point that indicates where to interact with) and executing appropriate actions (how to interact). We discover that a large portion of object manipulation tasks exhibit SE(3) equivariance. Notably, for tasks where the interaction is independent of the object’s absolute spatial pose, leveraging equivariance can significantly help. In the case of the drawer opening task, the optimal strategy remains consistent regardless of the drawer’s rotation: grasp the handle and exert a vertical pull outwards. Therefore, we propose a framework that leverages SE(3) equivariance to predict per-point affordance and downstream manipulation strategy on each point, with theoretical guarantees and great performance.

In summary, our main contributions are:

  • We propose to leverage the inherent equivariance for point-level affordance learning and downstream robotic manipulation tasks;

  • We propose a framework with novel designs and multiple modules that can infer affordance for object manipulation with the theoretical guarantee of equivariance;

  • Experiments show the superiority of our framework on manipulating objects with diverse and novel object poses.

2 Related Work

2.1 Learning Affordance for Robotic Manipulation

For manipulation, it’s crucial to find suitable interaction points. Gibson proposed affordance – interaction opportunity. They leveraged observations as demonstrations, like videos showing agents interacting with scenarios, and trained models to learn what and how to interact. This kind of method has been used to learn possible contact locations[1] or grasp patterns[6]. However, using passive observations to learn possible interactions suffers from distribution shifts. [17] proposed an approach learning from interactions to obtain more informative samples for articulated object manipulation[10, 8]. Recent works expanded affordance to more diverse scenarios and tasks, such as articulated object manipulation [35, 25, 5], deformable object manipulation [16, 30, 28], clutter manipulation [14], bimanual manipulation [36], and zero-shot generalization to novel shapes [11, 12].

2.2 Leveraging SE(3) Equivariant Representations for Robot Manipulation

Many works focused on building SE(3) equivariant representations from 3D inputs[2, 3, 13, 18, 26, 21, 34, 15, 20]. to enable robust generalizations to out-of-distribution inputs by ensuring inherent equivariance. This allows the model to automatically generalize to inputs at novel 6D poses, without implementing strong data augmentation. Prior works have applied equivariant representations on diverse robot manipulation tasks [32, 22, 23, 29, 33, 24, 7, 9, 4, 19]. This work studies leveraging SE(3) equivariance for point-level affordance learning and thus complete articulated object manipulation.

3 Problem Formulation

We tackle 3D articulated object manipulation tasks. Given a point cloud xx of a 3D articulated object (e.g.e.g. a cabinet with drawers and doors), our model predicts where and how to interact with the object. We define 6 types of short-term primitive actions like pushing or pulling, which are parameterized by the 6D pose of the robot gripper. For each primitive action, we predict the following features for each visible point pp of the 3D articulated object.

  • An affordances score ap[0,1]a_{p}\in[0,1], measuring how likely the point pp should be interacted with in the specific task.

  • A 6D pose RpSO(3)R_{p}\in SO(3) to interact with the point pp.

  • A success likelihood score spRs_{p}^{R} for an action proposal RpR_{p}.

The affordance score apa_{p} should be invariant while the proposed 6D posed RpR_{p} should be equivariant for any SE(3) transformation applied on the object. For example, “pulling drawer” task, if the cabinet is rotated or translated, the apa_{p}, which means how likely the point pp will be interacted with, should remain invariant. Because no matter how the cabinet is placed, the correct way to open the drawer is always to interact with the handle. And the 6D interaction pose needs to change corresponding to the pose change of the cabinet.

4 Method

Refer to caption
Figure 2: Overview of our proposed framework. Taking as input a point cloud of the object, our framework first outputs a per-point SE(3) invariant feature fpif_{p}^{i} and SE(3) equivariant feature fpef_{p}^{e}. The invariant fpif_{p}^{i} results in the affordance map invariant to object rotations, while the equivariant feature fpef_{p}^{e} results in the manipulation actions equivariant to object rotations.

4.1 Network Designs

As shown in Fig.2, our framework consists of four modules: VN-DGCNN Encoder, Affordance Prediction Module, Action Proposal Module, and Action Scoring Module.

The VN-DGCNN Encoder is the first encoder of objects with invariant and equivariant features.

During inference, the Affordance Prediction Module predicts the point-level affordance score (whether an point is actionable for the task) and thus selects the manipulation point with the highest affordance score, the Action Proposal Module proposes many candidate actions of the gripper, and the Action Scoring Module rates the proposed actions and then selects the best action for manipulation.

During training, we first collect diverse interactions with corresponding interaction results to train the Action Scoring Module and Action Proposal Module, and the Affordance Prediction Module is then trained to aggregate the average scores rated by Action Scoring Module of actions proposed by Action Proposal Module.

SE(3) Equivariant Encoder. For each point of the input point cloud, we extract a SE(3) invariant feature fpif_{p}^{i} and a SE(3) equivariant feature fpef_{p}^{e}. Here we use the VN-DGCNN segment network [2] to extract the features. Denote the number of points in the input point cloud as NN, length of feature as dd. As [2] use vectors to represent neurons, fpi,fped×3f^{i}_{p},f^{e}_{p}\in\mathbb{R}^{d\times 3}. For each point p3p\in\mathbb{R}^{3} in the input point cloud xN×3x\in\mathbb{R}^{N\times 3}, we have

i(p)\displaystyle\mathcal{E}_{i}(p) =fpi\displaystyle=f_{p}^{i} (1)
e(p)\displaystyle\mathcal{E}_{e}(p) =fpe\displaystyle=f_{p}^{e} (2)

When point cloud xx transformed by TT SE(3)\in SE(3), we have

i(T(p))\displaystyle\mathcal{E}_{i}(T(p)) =fpi\displaystyle=f_{p}^{i} (3)
e(T(p))\displaystyle\mathcal{E}_{e}(T(p)) =T(fpe)\displaystyle=T(f_{p}^{e}) (4)

Affordance Prediction Module. Our affordance prediction module, similar to Where2Act[17], is a MLP. It takes the SE(3) invariant feature fpif_{p}^{i} of each point and outputs a per-point affordance ap[0,1]a_{p}\in[0,1]. As the input is invariant to any translation and rotation, the affordance apa_{p} is SE(3) invariant. This means regardless of the position and orientation of the input object, we should always prefer to interact with the same part of the object (such as handle when opening a drawer). Formally,

ap=Da(fpi)a_{p}=D_{a}(f_{p}^{i}) (5)

Action Pose Proposal Module. For each point pp, it takes the SE(3) equivariant feature fpef_{p}^{e} as input. We employ VN-DGCNN classification network[2] to implement this module, which is equivariant to any SE(3) transformation. The network DpD_{p} predicts a gripper end-effector pose as a rotation matrix, which is equivariant to the input point cloud.

Rp=Dr(fpe)R_{p}=D_{r}(f_{p}^{e}) (6)

Action Scoring Module. For a proposed interaction orientation RR at point pp, we use a MLP to evaluate a success likelihood sR,p[0,1]s_{R,p}\in[0,1] for this interaction. This score can be used to filter low-rated proposals or select the proposals that are more likely to succeed. Given a invariant feature fpif_{p}^{i} of point pp, and a proposed interaction orientation RR, this module outputs a scalarsR,p[0,1]s_{R,p}\in[0,1]. Namely,

sR,p=Ds(fpi,R)s_{R,p}=D_{s}(f_{p}^{i},R) (7)

At test time, the interaction is positive when sR,p>0.5s_{R,p}>0.5. Due to the page limit, more details of point-level affordance designs can be found in  [17, 27].

4.2 Data Collection

Following [17], we adopt a learning-from-interaction method to collect training data. In the SAPIEN[31] simulator, we place a random articulated object at the center of the scene and use a flying parallel gripper to interact with the object on a specific point pp and orientation RR. We tackle six types of primitive action and pre-program a short trajectory for each primitive action. When (p,R)(p,R) is specified, we can try different pre-programmed short trajectories to interact with the object.

We use both online and offline methods to collect training data. In offline collection, we randomly choose an object and load it into the simulator. We then sample an interaction point pp and an interaction orientation RR and roll out to see whether the gripper can complete the task.

As offline data collection is inefficient, we also add online adaptive data collection. When training the action scoring module DsD_{s} with data pair (p,R)(p,R), we infer the score predictions sR,ps_{R,p} for all points on the articulated object. Then we sample a point pp^{*} with highest score to conduct an addition interaction with point pp^{*} and orientation RR.

4.3 Training and Losses

Our loss function is a combination of following three terms.

Action Scoring Loss. For a batch of data points (p,R,r)(p,R,r), where rr is the ground-truth interaction result(11 for positive, 0 for negative). We train the action scoring module DsD_{s} with the standard binary cross-entropy loss.

Action Proposal Loss. For a batch of data points (p,R,r)(p,R,r), we train the action pose proposal module DpD_{p} on positive data pieces by calculating the geodesic distance between the predicted orientation RpR_{p} with the ground truth orientation RR.

Affordance Prediction Loss. We technically train the affordance prediction module DaD_{a} by supervising the expected success rate when executing a random orientation proposed by DrD_{r}. Instead of rolling out this proposal in the simulator, we directly use the action scoring module DsD_{s} to evaluate whether an interaction is successful.

5 Experiment

We use SAPIEN [31] simulator with realistic physical engine to evaluate our framework, and compare with Where2Act [17], with its same setting for fair comparison.

We select 15 object categories in the PartNet-Mobility dataset [31] to conduct our experiments and report the quantitative results on four commonly seen tasks: pushing and pulling doors and drawers. We train our method and baseline over training shapes and report performance over test shapes from the training categories and shapes from the test categories to test the generalization capabilities over novel shapes and unseen categories

Refer to caption
Figure 3: We visualize the action scoring and action proposal predictions over the movable parts. We show the qualitative results of our method and baseline on two experiment settings across four objects. Our method is robust to rotation of the input point cloud.
Where2Act EqvAfford
z SO(3) z SO(3)
Pushing Door 86.55 / 80.42 56.74 / 53.44 86.49 / 81.03 86.79 / 81.55
Pulling Door 64.29 / 59.87 56.39 / 51.82 63.81 / 58.55 65.70 / 62.65
Pushing Drawer 86.49 / 86.99 58.31 / 48.49 87.89 / 87.09 88.81 / 87.76
Pulling Drawer 67.05 / 60.35 58.83 / 52.15 67.74 / 61.01 68.22 / 61.96
Table 1: Quantitative evaluation of the learned point-level affordance using F1-score. The numbers correspond to test shapes from both the training categories (before slash) and the test categories (after slash). Higher numbers indicates better performance.
Where2Act EqvAfford
z SO(3) z SO(3)
Pushing Door 83.91 / 81.49 38.44 / 37.78 83.84 / 82.80 85.51 / 83.96
Pulling Door 62.76 / 57.28 11.95 / 11.77 62.06 / 56.98 62.78 / 59.23
Pushing Drawer 78.39 / 74.89 35.37 / 29.45 79.58 / 76.02 80.16 / 77.10
Pulling Drawer 68.26 / 64.45 12.65 / 12.26 68.88 / 65.96 69.16 / 67.10
Table 2: We report task success rates of manipulating articulated objects. We evaluate our method and baseline in four tasks. The numbers correspond to test shapes from both the training categories (before slash) and the test categories (after slash).

Results and Analysis. To demonstrate our method keeps the inherent equivariance, we designed two test settings, zz and SO(3)SO(3). Where zz stands for placing objects on an aligned orientation and generates novel with rotations only along the z-axis, and SO(3)SO(3) for arbitrary rotations. Table  1 and  2 provide quantitative analysis comparing our method to the Where2Act baseline, revealing that EqvAfford stands out as the most effective in both affordance learning and downstream robotic manipulation. In table 1, Our method adeptly captures invariant geometric features, outperforming Where2Act, whose features fluctuate with rotational poses. And we visualize the predicted action scores in Fig 3, where we can see that given Objects with different poses, our method learns to extract invariant geometric features that are action-equivariant. In contrast, the baseline models fail to achieve a comparable level of performance. Table 2 further underscores our superiority by showcasing the higher task success rates in manipulating articulated objects. At Last, our model exhibits enhanced generalization, effectively adapting to unseen novel object categories beyond the baseline models.

6 Conclusion

We propose a novel framework, EqvAfford, leveraging SE(3) equivariance for affordance learning and downstream robotic manipulation with novel designs to theoretically guarantee equivariance. Predicting per-point SE(3) Invariant affordance and Equivariant interaction orientation, our method generalizes well to diverse object poses. Experiments on affordance learning and robotic manipulation showcases our method qualitatively and quantitatively.

References

  • Brahmbhatt et al. [2019] Samarth Brahmbhatt, Cusuh Ham, Charles C Kemp, and James Hays. Contactdb: Analyzing and predicting grasp contact via thermal imaging. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8709–8719, 2019.
  • Deng et al. [2021] Congyue Deng, Or Litany, Yueqi Duan, Adrien Poulenard, Andrea Tagliasacchi, and Leonidas J Guibas. Vector neurons: A general framework for so (3)-equivariant networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12200–12209, 2021.
  • Fuchs et al. [2020] Fabian Fuchs, Daniel Worrall, Volker Fischer, and Max Welling. Se (3)-transformers: 3d roto-translation equivariant attention networks. Advances in neural information processing systems, 33:1970–1981, 2020.
  • Gao et al. [2024] Chongkai Gao, Zhengrong Xue, Shuying Deng, Tianhai Liang, Siqi Yang, Lin Shao, and Huazhe Xu. Riemann: Near real-time se(3)-equivariant robot manipulation without point cloud segmentation, 2024.
  • Geng et al. [2024] Haoran Geng, Songlin Wei, Congyue Deng, Bokui Shen, He Wang, and Leonidas Guibas. Sage: Bridging semantic and actionable parts for generalizable manipulation of articulated objects, 2024.
  • Hamer et al. [2010] Henning Hamer, Juergen Gall, Thibaut Weise, and Luc Van Gool. An object-dependent hand pose prior from sparse training data. In CVPR, 2010.
  • Hu et al. [2024] Boce Hu, Xupeng Zhu, Dian Wang, Zihao Dong, Haojie Huang, Chenghao Wang, Robin Walters, and Robert Platt. Orbitgrasp: se(3)se(3)-equivariant grasp learning, 2024.
  • Hu et al. [2017] Ruizhen Hu, Wenchao Li, Oliver Van Kaick, Ariel Shamir, Hao Zhang, and Hui Huang. Learning to predict part mobility from a single static snapshot. ACM Transactions on Graphics (TOG), 36(6):1–13, 2017.
  • Huang et al. [2022] Haojie Huang, Dian Wang, Xupeng Zhu, Robin Walters, and Robert Platt. Edge grasp network: A graph-based se(3)-invariant approach to grasp detection, 2022.
  • Jain et al. [2021] Ajinkya Jain, Rudolf Lioutikov, Caleb Chuck, and Scott Niekum. Screwnet: Category-independent articulation model estimation from depth images using screw theory. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13670–13677. IEEE, 2021.
  • Ju et al. [2024] Yuanchen Ju, Kaizhe Hu, Guowei Zhang, Gu Zhang, Mingrun Jiang, and Huazhe Xu. Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation. ECCV 2024, 2024.
  • Kuang et al. [2024] Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, and Yue Wang. Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation, 2024.
  • Lei et al. [2023] Jiahui Lei, Congyue Deng, Karl Schmeckpeper, Leonidas Guibas, and Kostas Daniilidis. Efem: Equivariant neural field expectation maximization for 3d object segmentation without scene supervision. In CVPR, 2023.
  • Li et al. [2024] Yitong Li, Ruihai Wu, Haoran Lu, Chuanruo Ning, Yan Shen, Guanqi Zhan, and Hao Dong. Broadcasting support relations recursively from local dynamics for object retrieval in clutters. In Robotics: Science and Systems, 2024.
  • Liu et al. [2023] Yunze Liu, Junyu Chen, Zekai Zhang, Jingwei Huang, and Li Yi. Leaf: Learning frames for 4d point cloud sequence understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 604–613, 2023.
  • Lu et al. [2024] Haoran Lu, Yitong Li, Ruihai Wu, Chuanruo Ning, Yan Shen, and Hao Dong. Unigarment: A unified simulation and benchmark for garment manipulation. ICRA Workshop on Deformable Object Manipulation, 2024.
  • Mo et al. [2021] Kaichun Mo, Leonidas J Guibas, Mustafa Mukadam, Abhinav Gupta, and Shubham Tulsiani. Where2act: From pixels to actions for articulated 3d objects. In ICCV, 2021.
  • Ryu et al. [2023] Hyunwoo Ryu, Hong in Lee, Jeong-Hoon Lee, and Jongeun Choi. Equivariant descriptor fields: Se(3)-equivariant energy-based models for end-to-end visual robotic manipulation learning. In The Eleventh International Conference on Learning Representations, 2023.
  • Ryu et al. [2024] Hyunwoo Ryu, Jiwoo Kim, Hyunseok An, Junwoo Chang, Joohwan Seo, Taehan Kim, Yubin Kim, Chaewon Hwang, Jongeun Choi, and Roberto Horowitz. Diffusion-edfs: Bi-equivariant denoising generative modeling on se(3) for visual robotic manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18007–18018, 2024.
  • Scarpellini et al. [2024] Gianluca Scarpellini, Stefano Fiorini, Francesco Giuliari, Pietro Morerio, and Alessio Del Bue. Diffassemble: A unified graph-diffusion model for 2d and 3d reassembly, 2024.
  • Seo et al. [2023] Joohwan Seo, Nikhil Potu Surya Prakash, Xiang Zhang, Changhao Wang, Jongeun Choi, Masayoshi Tomizuka, and Roberto Horowitz. Robot manipulation task learning by leveraging se (3) group invariance and equivariance. arXiv preprint arXiv:2308.14984, 2023.
  • Simeonov et al. [2022] Anthony Simeonov, Yilun Du, Andrea Tagliasacchi, Joshua B Tenenbaum, Alberto Rodriguez, Pulkit Agrawal, and Vincent Sitzmann. Neural descriptor fields: Se (3)-equivariant object representations for manipulation. In 2022 International Conference on Robotics and Automation (ICRA), pages 6394–6400. IEEE, 2022.
  • Simeonov et al. [2023] Anthony Simeonov, Yilun Du, Yen-Chen Lin, Alberto Rodriguez Garcia, Leslie Pack Kaelbling, Tomás Lozano-Pérez, and Pulkit Agrawal. Se (3)-equivariant relational rearrangement with neural descriptor fields. In Conference on Robot Learning, pages 835–846. PMLR, 2023.
  • Wang et al. [2024a] Dian Wang, Stephen Hart, David Surovik, Tarik Kelestemur, Haojie Huang, Haibo Zhao, Mark Yeatman, Jiuguang Wang, Robin Walters, and Robert Platt. Equivariant diffusion policy, 2024a.
  • Wang et al. [2024b] Junbo Wang, Wenhai Liu, Qiaojun Yu, Yang You, Liu Liu, Weiming Wang, and Cewu Lu. Rpmart: Towards robust perception and manipulation for articulated objects, 2024b.
  • Weng et al. [2023] Thomas Weng, David Held, Franziska Meier, and Mustafa Mukadam. Neural grasp distance fields for robot manipulation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 1814–1821. IEEE, 2023.
  • Wu et al. [2022] Ruihai Wu, Yan Zhao, Kaichun Mo, Zizheng Guo, Yian Wang, Tianhao Wu, Qingnan Fan, Xuelin Chen, Leonidas Guibas, and Hao Dong. Vat-mart: Learning visual action trajectory proposals for manipulating 3d articulated objects. ICLR, 2022.
  • Wu et al. [2023a] Ruihai Wu, Chuanruo Ning, and Hao Dong. Learning foresightful dense visual affordance for deformable object manipulation. In IEEE International Conference on Computer Vision (ICCV), 2023a.
  • Wu et al. [2023b] Ruihai Wu, Chenrui Tie, Yushi Du, Yan Zhao, and Hao Dong. Leveraging se (3) equivariance for learning 3d geometric shape assembly. In ICCV, 2023b.
  • Wu et al. [2024] Ruihai Wu, Haoran Lu, Yiyan Wang, Yubo Wang, and Dong Hao. Unigarmentmanip: A unified framework for category-level garment manipulation via dense visual correspondence. CVPR, 2024.
  • Xiang et al. [2020] Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020.
  • Xue et al. [2023] Zhengrong Xue, Zhecheng Yuan, Jiashun Wang, Xueqian Wang, Yang Gao, and Huazhe Xu. Useek: Unsupervised se (3)-equivariant 3d keypoints for generalizable manipulation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 1715–1722. IEEE, 2023.
  • Yang et al. [2023] Jingyun Yang, Congyue Deng, Jimmy Wu, Rika Antonova, Leonidas Guibas, and Jeannette Bohg. Equivact: Sim (3)-equivariant visuomotor policies beyond rigid object manipulation. arXiv preprint arXiv:2310.16050, 2023.
  • Yang et al. [2024] Jingyun Yang, Zi ang Cao, Congyue Deng, Rika Antonova, Shuran Song, and Jeannette Bohg. Equibot: Sim(3)-equivariant diffusion policy for generalizable and data efficient learning, 2024.
  • Yuan et al. [2024] Chengbo Yuan, Chuan Wen, Tong Zhang, and Yang Gao. General flow as foundation affordance for scalable robot learning, 2024.
  • Zhao et al. [2023] Yan Zhao, Ruihai Wu, Zhehuan Chen, Yourong Zhang, Qingnan Fan, Kaichun Mo, and Hao Dong. Dualafford: Learning collaborative visual affordance for dual-gripper manipulation. ICLR, 2023.