DiffCloud: Real-to-Sim from Point Clouds with
Differentiable Simulation and Rendering of Deformable Objects

Priya Sundaresan^1†, Rika Antonova^1∗, and Jeannette Bohg¹ ¹Department of Computer Science, Stanford University, Stanford, CA 94305, USA {priyasun, rika.antonova, bohg}@stanford.edu^†P. Sundaresan was supported by the NSF Graduate Research Fellowship.^∗Supported by the National Science Foundation grant No.2030859 to the Computing Research Association for the CIFellows Project.This project was supported in part by a research award from Meta. The authors also thank Krishna Murthy Jatavallabhula for helpful discussions.

Abstract

Research in manipulation of deformable objects is typically conducted on a limited range of scenarios, because handling each scenario on hardware takes significant effort. Realistic simulators with support for various types of deformations and interactions have the potential to speed up experimentation with novel tasks and algorithms. However, for highly deformable objects it is challenging to align the output of a simulator with the behavior of real objects. Manual tuning is not intuitive, hence automated methods are needed. We view this alignment problem as a joint perception-inference challenge and demonstrate how to use recent neural network architectures to successfully perform simulation parameter inference from real point clouds. We analyze the performance of various architectures, comparing their data and training requirements. Furthermore, we propose to leverage differentiable point cloud sampling and differentiable simulation to significantly reduce the time to achieve the alignment. We employ an efficient way to propagate gradients from point clouds to simulated meshes and further through to the physical simulation parameters, such as mass and stiffness. Experiments with highly deformable objects show that our method can achieve comparable or better alignment with real object behavior, while reducing the time needed to achieve this by more than an order of magnitude. Videos and supplementary material are available at diffcloud.github.io.

I Introduction

We consider the real-to-sim problem of inferring parameters of general-purpose simulators from real observations such that the gap between reality and simulation is reduced [1, 2, 3, 4]. A common approach to solving this problem is to train an inverse model on data generated with a black-box (non-differentiable) simulator using a wide range of parameter settings. The input to such models usually consists of trajectories of a low-dimensional state of the system, e.g. position and orientation of rigid objects in the scene. The output are parameters, such as mass, friction, and other physical object properties.

The state of a highly deformable object cannot be captured by only its position and orientation, since the object deforms during motion. Hence, we represent objects with point clouds as observed through depth cameras. Recent neural network architectures, such as PointNet++ [5] and MeteorNet [6], are well suited for learning to process point clouds. As we will show, they can yield inverse models that offer a viable solution to the challenging task of real-to-sim for deformables from point clouds. Nonetheless, data collection and training for these can be computationally demanding.

Refer to caption — Figure 1: Experimental setup. We execute deformable manipulation trajectories using a Kinova Gen3 arm. We post-process observations recorded from two stereo depth cameras (Intel D435) to generate merged point clouds with the robot arm masked from view. These observations are fed to our proposed method DiffCloud for real-to-sim parameter estimation.

In this work, we propose an alternative approach that employs a differentiable simulator to allow adjusting simulation parameters directly via gradient descent without the need for dataset collection and pre-training. Our approach combines differentiable point cloud rendering and differentiable simulation to bring the behavior of simulated, highly deformable objects closer to that of real objects. We instantiate a scene with a simulated object and create an end-to-end differentiable pipeline that lets us seamlessly propagate the gradients from real point clouds to the low-level physical simulation parameters. For highly deformable objects, even small changes in these parameters can have a significant impact on the behavior of the object. We show that establishing end-to-end differentiability yields a faster alignment between simulation and reality, compared to training inverse models with a black-box (i.e. gradient-free) treatment of the simulator.

We conduct a set of experiments where a robot manipulates highly deformable real objects, such as cloth and paper towels. We show that our approach successfully infers simulation parameters, such as mass and stiffness, making the behavior of simulated deformables match the real ones. In simulation experiments, we explore interactions of the deformables with rigid objects: stretching a band on a pole, and hanging a vest onto a rigid pole. Overall, our experiments show that we can obtain similar or better alignment between the simulated and the target object, compared to the inverse model baselines. The major benefit of our approach is that it obtains such an alignment after on average 10 minutes of direct gradient descent, replacing 2.5 – 5.5 hours of data collection and training for the inverse models.

II Background

II-A Real-to-Sim for Deformable Objects

Many approaches in deformable object manipulation are limited to specific scenarios due to the complexity of real hardware setups [7, 8, 9, 10, 11, 12]. Furthermore, there is a lack of easily tunable yet realistic simulators that support deformables and could aid experimentation with novel tasks. Hence, automated ways to find simulation parameters are needed to make the behavior of simulated deformables resemble that of their real-world counterpart. In robotics, this real-to-sim problem has been extensively studied for rigid objects [13, 14, 15, 16, 17, 18]. However, these methods assume access to a low-dimensional state, such as object poses. Recent surveys have reviewed learning-based approaches for manipulation and perception of deformables [19, 20, 21], including methods for tracking and registration. However, such methods either employ markers [22], assume a known and simplified deformation model [23], or show applicability only to specific cases, e.g. a rope lying flat on a surface [24]. As of now, there is no generic tracking or registration approach for deformables that is robust in a wide range of scenarios.

II-B Inverse Models for Real-to-Sim

One approach towards real-to-sim is to learn an inverse model that maps from a sequence of sensor observations to simulation parameters [25]. In robotics, several earlier works explored using inverse models for highly deformable objects. For example, in [26] the authors propose a piecewise linear elastic material model for cloth, then they fit the material model to real cloths by applying controlled forces and measuring the deformation response. This results in a paired dataset of cloth types and estimated material parameters, but limits generalization to unseen cloths without additional physical measurements. In [27], the authors propose using the dataset of [26] as a source of supervision for training a network to regress material parameters from RGB videos of cloth. This supervised training approach relies heavily on domain randomization to achieve generalization to cloths with unseen colors or patterns. Additionally, the proposed regression framework discretizes the space of outputs to deal with the high dimensional model of cloth deformation, limiting the range of cloth types that can ultimately be accurately modelled. In this work, we use a sequence of depth images (converted to point clouds) as input to our models. Using point clouds instead or RGB images allows us to circumvent the need for extensive domain randomization and for contrasting textures (for foreground/background segmentation). We adopt established methods for processing point clouds, such as PointNet/PointNet++ [28, 5], MeteorNet [6] and train them to regress to continuous simulation parameters.

II-C Differentiable Rendering and Differentiable Simulation

Differentiable rendering of images allows using 2D observations to inform 3D understanding, by propagating gradients from images to illumination models, 3D objects in the scene, and camera pose. Various works have explored making the rasterization process (coloring image pixels based on the assignment of triangular mesh faces) fully differentiable, including DIB-R [29] and SoftRAS [30]. Other works explore making the processes of ray tracing [31] and volumetric rendering [32, 33] differentiable. Toolkits, such as Kaolin [34] and PyTorch3D [35] provide batched implementations of differentiable rasterization, allowing a significant speedup of these computationally expensive processes. In this work, we aim to do correspondence-free alignment of simulated object states, such as meshes, with real visual observations. While images are one such observation space to bridge the gap, achieving close alignment between images that are differentiably rendered from simulated meshes and real images still remains an open challenge due to the range of colors, textures, and fine details present in real images. Thus, we use point clouds to represent objects and build on PyTorch3D [35], which offers a way to differentiably sample a point cloud from a given mesh such that gradients from losses in the point clouds propagate to the mesh vertex gradients. Furthermore, PyTorch3D library exposes efficient data structures and CUDA-enabled batched operations for 3D meshes and point clouds, allowing seamless integration into simulator backends that support PyTorch.

Differentiable simulation is a promising paradigm that allows to adapt simulators by propagating gradients of the simulator outputs w.r.t. the lower-level physical simulation parameters. Recent differentiable simulators with support for deformables include [36, 37, 38, 39, 40, 41, 42, 43, 44, 45]. Most of these only support a limited set of interactions and types of deformation. For example, out of the above, only [36] (a differentiable version of ARCSim [46]) supports arbitrary meshes and modeling interactions of rigid and deformable objects. From the simulators listed above, only gradSim [37] provides a combination of differentiable RGB rendering and simulation. This approach is conceptually related to our work, but is still substantially different, since it does not handle point clouds, does not support interactions between rigid and deformable objects, and presents only sim-to-sim results with simple simulated images and plain textures. The differentiable simulation framework proposed in [47] shows results on inferring deformation properties of a real object from point clouds. We do not build on this approach directly, because it has several key limitations. First, there are limitations in the underlying simulator: the model is aimed for objects with low-to-medium levels of deformability, such as tight plush toys and pillows. In contrast, we are interested in highly deformable objects, such as cloth, garments, and stretchable bands. Second, there are limitations in the loss formulation: the point cloud loss proposed in [47] relies on a simple diffusion of the observed deformation into a deformation field. This is not suitable for the case of highly deformable objects that rapidly and drastically change shape under gravity and upon interacting with other objects. Finally, this prior work is geared towards analyzing cases with passive dynamics of a single object falling and colliding with the table or bending under gravity; the simulator also lacks support for modeling frictional contacts. We are instead interested in actuating a robot to perform manipulation with highly deformable objects that can also interact in non-trivial ways with rigid objects in the scene. We found that [36] is the only available simulator that readily supports such functionality, hence we incorporate it into our approach.

III Our Approach : DiffCloud

Our objective is to discover the physical simulation parameters (e.g. stiffness, mass, friction) that would cause the behavior of the simulated deformable objects to match the behavior of observed real objects. We start by recording a sequence of point cloud observations of the scene, where a robot manipulates a deformable object. Then, we construct a simulated environment in a differentiable simulator that can load meshes of deformable & rigid objects in the scene and simulate their interactions. To obtain simulated point clouds we implement a differentiable point cloud sampler: we sample points on the surface of the simulated objects using PyTorch3D [35]. By defining a differentiable loss between the simulated and real point cloud sequences we can backpropagate all the way to the simulation parameters. Figure 2 gives an overview of our approach.

In this work, we assume that the initial geometry of objects in the scene is known. This allows us to initialize the simulation start state to match the real scene. We also assume that the location of the grasp is known, so we can grasp the object in the simulated scene in the same way as in reality (with a simulated gripper or gripper anchors). With that, we can control the simulated gripper (or anchors) along the trajectory of the end effector that was recorded in reality.

III-A Loss Definition

One candidate for a loss function on point clouds is the Chamfer distance:

\displaystyle\begin{split}d_{\text{Chamf}}(\mathcal{P}^{a},\mathcal{P}^{b})\!=&\tfrac{1}{|\mathcal{P}^{a}|}\!\sum_{\boldsymbol{x}^{a}\in\mathcal{P}^{a}}\!\!\min_{\boldsymbol{x}^{b}\in\mathcal{P}^{b}}||\boldsymbol{x}^{a}-\boldsymbol{x}^{b}||_{2}^{2}\\ &+\!\tfrac{1}{|\mathcal{P}^{b}|}\!\sum_{\boldsymbol{x}^{b}\in\mathcal{P}^{b}}\!\!\min_{\boldsymbol{x}^{a}\in\mathcal{P}^{a}}||\boldsymbol{x}^{b}-\boldsymbol{x}^{a}||_{2}^{2}.\end{split}

(1)

Here, $\mathcal{P}^{a},\mathcal{P}^{b}$ denote two point clouds; $\boldsymbol{x}^{a},\boldsymbol{x}^{b}$ are 3D points in $\mathcal{P}^{a}$ and $\mathcal{P}^{b}$ respectively. This distance metric is frequently used to compare the alignment between two point clouds that are either complete (sampled from meshes) or are generated by perception with the same camera perspective and noise properties. In our case of using low-cost depth sensors, the challenge of constructing the loss function on the real and simulated point clouds is that we need to avoid paying attention to the noise artifacts in the real point cloud. Statistical outlier filtering and post-processing techniques can alleviate noise, but some amount is likely to remain. For example, see the red patches highlighted in the real point cloud in Figure 1. Our insight is that a loss that disregards noise artifacts can be obtained by using a unidirectional Chamfer distance. This yields a loss that relieves the pressure for the simulated point clouds to match the noisy parts of the real point clouds:

\displaystyle\vspace{-5px}\begin{split}d_{\text{Chamf}}^{\text{sim}\rightarrow\text{real}}(\mathcal{P}^{\text{sim}},\mathcal{P}^{\text{real}})\!=\!\!\sum_{\boldsymbol{x}^{\text{sim}}\in\mathcal{P}^{\text{sim}}}\!\!\min_{\boldsymbol{x}^{\text{real}}\in\mathcal{P}^{\text{real}}}||\boldsymbol{x}^{\text{sim}}-\boldsymbol{x}^{\text{real}}||_{2}^{2}.\end{split}

(2)

While the naive computation of the Chamfer distance can be expensive, PyTorch3D provides an efficient GPU-based implementation (see Section 3.1 in [35]). With that, we can quickly propagate gradients from the point clouds through the mesh representation to optimize the low-level physical parameters of the simulator. We found that including point clouds from one or two depth cameras is sufficient to construct a partially occluded point cloud that is still informative enough for the overall optimization to be successful.

III-B Gradient Propagation

We build upon the differentiable simulator DiffSim [36], which supports mesh-based simulation and contact-handling of rigid objects and thin-shell deformables (e.g. cloth). In DiffSim, the simulation state is represented by generalized coordinates $\mathbf{q}=[\mathbf{q_{1}}^{T},\mathbf{q_{2}}^{T},\ldots,\mathbf{q_{n}}^{T}]^{T}$ of all objects in the simulation with corresponding velocities $\mathbf{\dot{q}}=[\mathbf{\dot{q}_{1}}^{T},\mathbf{\dot{q}_{2}}^{T},\ldots,\mathbf{\dot{q}_{n}}^{T}]^{T}$ . The generalized coordinates of a rigid body have $\mathbf{q_{i}}\in\mathbb{R}^{6}$ for rigid objects, denoting position and orientation, and $\mathbf{q_{i}}\in\mathbb{R}^{3}$ for deformable nodes with 3 DoF for position alone. Here, $n$ is the cumulative total of the number of deformable nodes and rigid bodies in the scene. DiffSim uses the implicit Euler method to compute $\mathbf{q,\dot{q}}$ at each time step and performs collision resolution in localized impact zones. A given mesh has a body frame with the origin set to its center of mass at the start of simulation. A mesh vertex $p$ has coordinate $\mathbf{p_{0}}=(p_{x},p_{y},p_{z})^{T}$ in the body frame and world coordinate $\mathbf{p}=\mathbf{f}(\mathbf{q})=[\mathbf{r}]\mathbf{p_{0}}+\mathbf{t}$ , where $\mathbf{r}=(\phi,\theta,\psi)^{T}$ and $\mathbf{t}=(t_{x},t_{y},t_{z})$ is the 6-DoF pose of the mesh. Propagating gradients from vertex $\mathbf{p}$ to the generalized coordinates $\mathbf{q}$ involves computing the Jacobian $\nabla\mathbf{f}$ and obtaining the partial derivatives $\partial{\mathbf{f(q)}}/\partial{\mathbf{q}}$ . In this way, DiffSim is able to solve sim-to-sim optimizations by comparing current mesh states to target mesh states and propagating gradients from observed positional differences.

Since ground truth mesh states are not readily available for real deformables, we use point clouds as an observation space. To allow losses in the point cloud space to be propagated to simulated cloth vertices, we implement a differentiable point cloud sampling step. A triangular mesh face can be represented by its enclosing vertices $\mathbf{p_{1}},\mathbf{p_{2}},\mathbf{p_{3}}$ . In barycentric coordinates, a random point on the surface of the face can be generated by sampling 3 random numbers $(u,v,w)$ such that $u+v+w\!\leq\!1$ [35]. Hence, we can obtain a random point $\boldsymbol{x}$ inside the triangle as:

\boldsymbol{x}=u\mathbf{p_{1}}+v\mathbf{p_{2}}+w\mathbf{p_{3}}.

(3)

To generate an $N$ -point, uniform-density point cloud from a mesh, we first sample $N$ triangular faces $\{(\mathbf{p_{i,1},p_{i,2},p_{i,3}})\}_{i=1,\ldots,N}$ , weighted proportionally to the area of each face. This step ensures that the resulting point cloud is evenly distributed over the surface of the mesh, instead of being disproportionately dense in regions where the mesh has closely packed faces. For the $i^{\text{th}}$ sampled face $(\mathbf{p_{i,1},p_{i,2},p_{i,3}})$ , we generate random coefficients $(u_{i},v_{i},w_{i})$ and use Equation 3 to compute a point $\boldsymbol{x}_{i}$ that lies on the face. Concatenating the points obtained by applying this procedure to the $N$ sampled faces and coefficients yields the point cloud $\mathcal{P}\!=\!\{\boldsymbol{x}_{i}\}_{i=1,\ldots,N}$ . We connect the PyTorch3D [35] implementation of this sampling procedure to the output of a differentiable simulator. With that, gradients from loss functions operating on $\{\boldsymbol{x}_{i}\}_{i=1,\ldots,N}$ can be propagated to mesh vertices $(\mathbf{p_{i,1},p_{i,2},p_{i,3}})$ via chain rule, observing that:

\partial{\boldsymbol{x}_{i}}/\partial{\mathbf{p_{i,1}}}=u_{i}\hskip 8.5359pt\partial{\boldsymbol{x}_{i}}/\partial{\mathbf{p_{i,2}}}=v_{i}\hskip 8.5359pt\partial{\boldsymbol{x}_{i}}/\partial{\mathbf{p_{i,3}}}=w_{i}

(4)

III-C Optimization

We consider scenarios where a robot executes a trajectory to manipulate a deformable object, which potentially also interacts with other objects in the scene. For each scenario, we get a real point cloud sequence of length $T$ and a resolution of $N$ points per frame: $\big{\{}\mathcal{P}^{\mathrm{real}}_{t}\!=\!\{\boldsymbol{x}_{i}^{\text{real}}\}_{i=1,\ldots,N}\big{\}}_{t=1}^{T}$ . We instantiate each real scenario in DiffSim using representative meshes to model the deformable and rigid objects in the scene initially. Next, we begin optimizing the material properties of the simulated cloth such that the discovered parameters best explain the observed target point cloud data. Each simulated scenario involves manipulating a simulated cloth with the same trajectory as in real. Even with identical trajectory execution, we expect a discrepancy between the real and simulated point cloud sequences due to mismatch in the motion of the deformables, whose physical properties we aim to optimize. For each iteration of optimization, we run the simulation for the number of steps in the scenario horizon using the current set of estimated parameters. At each timestep $t$ , we sample the simulated object surfaces in the scene according to Equation 3 to obtain a point cloud. The concatenated point clouds across frames give the simulated point cloud sequence: $\big{\{}\mathcal{P}^{\mathrm{sim}}_{t}=\{\boldsymbol{x}_{i}^{\text{sim}}\}_{i=1,\ldots,N}\big{\}}_{t=1}^{T}$ . In practice, we compute the unidirectional Chamfer distance (Equation 2) on $(\mathcal{P}^{\mathrm{sim}}_{t},\mathcal{P}^{\mathrm{real}}_{t})$ for the corresponding point clouds observed at time $t$ . For most experiments, we found that using only the last frame works well, i.e. $t=T$ , as in [37, 36]. Other frames can be used, and their selection could be considered as a hyperparameter. This loss is propagated across all frames of the simulation and used to update the underlying parameters of the deformable object. This procedure is repeated for a fixed number of iterations or until the Chamfer distance falls below a threshold.

IV Experiments

In this section, we compare DiffCloud with non-differentiable methods for parameter estimation: we evaluate the degree of point cloud alignment between simulated and real trajectories and the compute efficiency across methods.

IV-A Baseline Inverse Models

As experimental baselines, we use methods that view simulators as black-box, i.e. not treating observations as end-to-end differentiable w.r.t the parameters. These inverse models are implemented as regression networks that take point cloud sequences as inputs and predict $k$ simulation parameters. They are trained on simulated point cloud sequences generated by simulations with various simulation parameters.

•

MeteorNet: An architecture for learning representations of 3D point cloud sequences from [6], which we modify to predict $k$ simulation parameters. Specifically, we use MeteorNet-cls (Appendix D.2 in [6]).
•

PointNet++: A regressor similar to the above, but using the multi-scale group architecture from [5] (Appendix B.1) to extract features from a single point cloud; uses three set abstraction layers followed by three fully connected layers with output sizes $(512,256,k)$ .
•

MLP: A regressor with a fully connected network that also operates on a single frame; uses five 2D convolutional layers with output sizes $(64,64,64,128,1024)$ , a symmetric max pooling layer, two fully connected layers with output sizes $(512,256)$ , a dropout layer, and a final fully connected layer with $k$ outputs.

Real point cloud sequences suffer from partial observability, self-occlusion, and noise. To minimize the gap between simulation and reality, we aim to generate a training dataset for each of the regressor methods that is as realistic as possible. As training data for the regressors (MeteorNet, PointNet++, MLP), we initialize $N$ simulations with uniformly sampled parameters and record the resulting point cloud sequences and ground truth parameters. This yields a dataset $\mathcal{D}=\big{\{}\{\boldsymbol{x}_{t}\}_{t=1}^{T},\boldsymbol{w}\!=\![w_{\mathrm{stiff}},w_{\mathrm{mass}}]\big{\}}_{i=1}^{N}$ . For each simulation run, we generate a point cloud sequence from mesh states according to Equation 3, but bias the sampling of points to be on a subset instead of all faces of the deformable to mimic a partial point cloud with occlusion. Then, we apply random Gaussian noise to this point cloud to mimic the effect of real sensor noise. We generate a dataset of 1500 training and 375 test point cloud sequences with 3,500 points per frame across methods. For comparison, MeteorNet was trained on a dataset of 576 examples and the PointNet++ [5] literature uses thousands of examples.

We train the MeteorNet regressor to learn a mapping $f:\{\boldsymbol{x}\}_{i=1}^{T}\rightarrow\boldsymbol{w}$ , which maps an input point cloud sequence $\{\boldsymbol{x}_{t}\}_{t=1}^{T}$ to simulator parameters $\boldsymbol{w}$ that would generate the observed behavior. PointNet++ and MLP methods learn a function $g:\boldsymbol{x}_{t}\rightarrow\boldsymbol{w}$ mapping only a single frame $t$ (selected from a sequence) to the material parameters. In practice, we choose the frame $t$ to be the same frame for which we compute the unidirectional Chamfer distance in DiffCloud optimization. We train each network using $L_{1}$ loss between the predicted and ground truth parameters, using the Adam optimizer with a learning rate of 0.001 for 100 epochs. We run dataset generation on an Intel i5-9400F CPU. We use an NVIDIA GeForce GTX 1070 GPU for training and optimization.

IV-B DiffCloud Implementation Details

DiffCloud is our proposed approach described in Section III. For all simulations we use DiffSim [36] – a differentiable version of the garment simulator ARCSim [46]. As in [27], we initialize the deformable, in this case cloth, to a basis material and learn two scalar multipliers $(w_{\mathrm{stiff}},w_{\mathrm{mass}})$ for the stiffness and mass tensors. These multipliers represent the simulation parameters we aim to learn. We initialize them to the midpoint of each parameter range. We empirically determine this range as $[0.1,10]$ simulation units for all parameters, so that forward simulation is numerically stable. We want to encourage small changes in the learnable parameters to make a visually compelling difference in deformation. Thus, instead of directly optimizing $(w_{\mathrm{stiff}},w_{\mathrm{mass}})$ , we optimize intermediate variables $(s_{\mathrm{stiff}},s_{\mathrm{mass}})$ , where $(w_{\mathrm{stiff}},w_{\mathrm{mass}})=\sigma(s_{\mathrm{stiff}},s_{\mathrm{mass}})\times(10-0.1)+0.1$ , and $\sigma(x)=\frac{1}{1+e^{-x}}$ is the standard sigmoid function. Intuitively, this maps $(s_{\mathrm{stiff}},s_{\mathrm{mass}})$ first to the range $[0,1]$ via the sigmoid function, and then interpolates this value within the parameter range of $[0.1,10]$ . We choose the final parameters to be those that incurred the lowest loss during optimization.

IV-C Evaluation Metrics

Alignment: For quantitative evaluation we use unidirectional Chamfer distance (Equation 2) that characterizes how well the motions of the simulated and real point clouds align. For each of the baselines, we first infer the predicted parameters using the target point cloud trajectory as input. These parameters serve as input to the simulation engine. We then run the simulator using the inferred parameters and generate point clouds from the simulated meshes. We compute the Chamfer distance between the point clouds generated by the baselines and the real point cloud, then compare this against the loss from running DiffCloud on the real point cloud.

Efficiency: Aside from alignment, we also evaluate the computational resources of all methods. Across experiments, we report (1) the time it takes to run DiffCloud optimization on trajectories and (2) the combined dataset generation, training, and inference times for the supervised baselines.

IV-D Real Experimental Setup

IV-D1 Hardware Setup

Our hardware setup consists of a Kinova Gen3 robot arm with a Robotiq 2F-85 gripper and two Intel RealSense D435 cameras (Figure 1). The table workspace measures 50×43 cm, with one camera mounted overhead and another mounted on the side. For each scenario outlined below, we execute trajectories using velocity control in the Cartesian (end-effector) space and record robot joint states and point cloud observations at each step. We merge point clouds from both camera views and use a transformation obtained from standard checkerboard calibration to obtain point clouds in the frame of reference of the robot, which has a corresponding frame of reference in simulation. Using recorded joint states and known robot geometry, we mask out the end-effector from the point clouds as depicted in Figure 1. This yields a set of real, post-processed point cloud observations $\{\boldsymbol{x}{{}_{t}}^{\text{real}}\}_{t=1}^{T}$ .

IV-D2 Real Scenarios

We consider two real manipulation scenarios using cloth as the deformable object of interest. The lift scenario involves grasping and lifting a cloth from a flat start state on a table surface. At the end of the trajectory, the cloth is lifted off the table and suspended mid-air. The fold scenario involves folding a cloth in half, starting from a diamond shape and ending in a triangular configuration. We perform both scenarios on 5 different fabrics grouped into three categories: highly deformable, medium, and shape retaining. The highly deformable class consists of cloths that are especially low stiffness and collapsible, such as silk-like materials. The medium class consist of cloths with modest deformation that can still crumple subject to enough force. The shape retaining class contains cloths that resist deformation more than the other categories, such as paper towels and stiff washcloths. For each fabric, we execute 3 trajectories for a total of 15 trajectories per scenario. Each trajectory is $\approx\!2.5$ seconds long with robot commands sent at $10$ Hz., resulting in a horizon length of $T=25$ . During the lift and fold trajectories, the cloth either collapses or maintains its shape depending on the underlying cloth properties (which we attempt to learn).

IV-E Real Robot Experiments

We aim to infer cloth stiffness and mass from the lift and fold scenarios discussed in Section IV-D.

IV-E1 DiffCloud Specifications

For specifying the DiffCloud loss and evaluation metrics, we need to use the frames for which deformation is most visible. The lift trajectory exhibits least occlusion at the end of the trajectory when the cloth is off the table. For folding, the cloth is most visible in the middle of the trajectory. The level of occlusion at the end of real lift trajectories is minimal, so simulated point clouds are generated by sampling from all cloth faces. In the fold scenario, the half of the real cloth that remains flat on the table does not appear in the real point cloud. Hence, we sample simulated point clouds only from cloth faces on the upper half of the cloth in simulation, as a heuristic for generating occlusion-sensitive point clouds. Occlusion handling is out of scope with the sampling procedure discussed in Section III-B, but we hope to relax this in future work. The simulated counterparts for the lift and fold scenarios both use a 2D cloth mesh consisting of a $7\times 7$ grid of 49 vertices, where we keyframe the position of the top corner vertex to produce the same motion that is executed in real for $T=25$ steps. For both scenarios, we run DiffCloud for 50 iterations using the Adam optimizer with a learning rate of 0.2 to infer $(w_{\mathrm{stiff}},w_{\mathrm{mass}})$ .

IV-E2 Results

We find that DiffCloud is able to estimate mass and stiffness cloth parameters that visually explain the observed real point cloud behavior in the lift (Figure 4) and fold (Figure 5) scenarios. We note that DiffCloud achieves a lower loss than MeteorNet, PointNet++, and MLP across the 3 cloth categories in the lift scenario (Figure 4), according to the evaluation metric from Section IV-C. We also visualize the final parameters found by DiffCloud and observe that the parameters align with the qualitative descriptions of the 3 categories of cloth types considered (Figure 3). For instance, DiffCloud discovers high mass, low stiffness parameters for the polka dot cloth and low mass, high stiffness parameters for the red black cloth, which belong to the highly deformable and shape retaining categories, respectively. Similarly, for the fold scenario, we find that DiffCloud is able to find parameters that account for variations in deformation during execution, such as collapsing inwards mid-fold or maintaining shape (Figure 5). The severity of self-occlusion is more prominent in the fold scenario, making robust parameter estimation more difficult. Still, DiffCloud achieves a loss on par with the baselines, which are (1) trained from thousands of examples, (2) use data augmentations to be invariant to varying degrees of self-occlusion, and (3) require significantly more compute time. DiffCloud takes on average 10 minutes per trajectory in the lift and fold scenarios to optimize the simulation parameters (Figure 3). While the baseline regressors all have inference times on the order of milliseconds, each method incurs an up-front cost of more than two hours of dataset generation. Furthermore, the training procedure requires an additional $40$ minutes to multiple hours per scenario. With significantly less computational footprint, DiffCloud achieves parameter estimation results on real data that are comparable and in some cases better than baselines.

IV-F Further Simulation Experiments

IV-F1 Simulation Scenarios

We further evaluate the performance of DiffCloud on two additional simulated scenarios: hanging one shoulder of a vest onto a pole (vest hang), and stretching an elastic band against a pole (band stretch), shown in Figure 6. The mass and stiffness of the vest mesh determine the outline of the vest when hung on the pole. The mass and stiffness of the elastic band dictate the extent to which the band travels up or down along the pole when pulled taut. We aim to analyze how the contact-rich aspects of these scenarios influence the performance of various methods. To separate these effects from point cloud quality considerations, we use fully observable noise-free point cloud observations. Aside from skipping augmentations (random jittering, dropout of cloth faces) during dataset generation, we use an identical training procedure for MLP, PointNet++, and MeteorNet as in Section IV-E.

IV-F2 DiffCloud Specifications

DiffCloud performs optimization using the Adam optimizer (with learning rate $0.3$ for vest hang, 0.4 for band stretch). We take the unidirectional Chamfer loss on the last frame for band stretch. For vest hang, most of the deformation happens in the second half of the trajectory, so we pick an intermediate frame in this part of the trajectory (instead of the final frame). We compute the loss between the simulated cloth point cloud only (no poles) and the entire target scene (including poles). This focuses optimization on the deformables. Since synthetic point clouds are noiseless, it is possible to choose a termination criteria for DiffCloud based on when the computed loss falls below a pre-defined threshold. In practice, we find that a loss threshold of $0.0005$ corresponds to a well-aligned match. Thus, for each target trajectory, we run DiffCloud until the loss falls below this threshold or until the number of optimization iterations exceeds $50$ , as in the lift and fold scenarios.

IV-F3 Results

For evaluation purposes, we generate a held-out test set of 10 simulated episodes for each scenario with mass and stiffness parameters sampled uniformly at random in the range $[0.1,10]$ . For each episode, we render the corresponding point cloud sequences according to the procedure from Section III-B without additional augmentations. We evaluate the performance of all methods on estimating the parameters of this set of target point cloud sequences. Across all target point cloud sequences in both scenarios, we find that DiffCloud is able to converge to a set of parameters that yield a Chamfer distance below the threshold. The baselines are only able to achieve sub-threshold alignment in 50-80% of runs (Figure 6). Due to the threshold-based stopping criteria for DiffCloud in the simulated experiments, running optimization for a given target point cloud sequence takes $5$ minutes averaged across the vest hang and band stretch scenarios, compared to a combined dataset generation, training, and inference time on the order of hours for the baselines.

V Conclusion and Future Work

In this work, we proposed DiffCloud: an approach to combine differentiable point cloud sampling with differentiable simulation for solving the real-to-sim problems with highly deformable objects. For comparison, we employed recently developed neural network architectures for processing point clouds to learn inverse models that infer simulation parameters without treating the simulator as differentiable. Our experiments showed that DiffCloud reduced the time needed for obtaining real-to-sim alignment by more than an order of magnitude. This result opens the way to more agile experimentation with real-to-sim for highly deformable objects, while still treating the problem as a joint perception-inference task, i.e. not requiring dedicated methods to find correspondences between the depth camera observations (point clouds) and simulated meshes.

Possible avenues for future work include relaxing the need for robot masking by including the robot model into simulation; improving point cloud rendering with realistic occlusion and camera noise models; and jointly optimizing robot actions along with physical simulation parameters. Furthermore, it would be interesting to apply DiffCloud to contact-rich multi-stage tasks that involve multiple rigid and deformable objects. Another interesting direction is to use DiffCloud to infer the starting state of the simulation, including learning the morphology of object meshes. This would allow us to eliminate the need for manual specification of object meshes and initial simulation states, opening the way for creating new realistic simulation scenes ‘on the fly.’

References

[1] A. Prakash, S. Debnath, J.-F. Lafleche, E. Cameracci, S. Birchfield, M. T. Law et al., “Self-supervised real-to-sim scene generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16 044–16 054.
[2] P. Chang and T. Padif, “Sim2real2sim: Bridging the gap between simulation and real-world in flexible object manipulation,” in 2020 Fourth IEEE International Conference on Robotic Computing (IRC). IEEE, 2020, pp. 56–62.
[3] J. Zhang, L. Tai, P. Yun, Y. Xiong, M. Liu, J. Boedecker, and W. Burgard, “Vr-goggles for robots: Real-to-sim domain adaptation for visual control,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1148–1155, 2019.
[4] F. Liu, Z. Li, Y. Han, J. Lu, F. Richter, and M. C. Yip, “Real-to-sim registration of deformable soft tissue with position-based dynamics for surgical robot autonomy,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 12 328–12 334.
[5] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Advances in neural information processing systems, vol. 30, 2017.
[6] X. Liu, M. Yan, and J. Bohg, “Meteornet: Deep learning on dynamic 3d point cloud sequences,” in ICCV, 2019.
[7] E. Yoshida, K. Ayusawa, I. G. Ramirez-Alpizar, K. Harada, C. Duriez, and A. Kheddar, “Simulation-based optimal motion planning for deformable object,” in 2015 IEEE international workshop on advanced robotics and its social impacts (ARSO). IEEE, 2015, pp. 1–6.
[8] J. Matas, S. James, and A. J. Davison, “Sim-to-real reinforcement learning for deformable object manipulation,” in Conference on Robot Learning. PMLR, 2018, pp. 734–743.
[9] S. D. Klee, B. Q. Ferreira, R. Silva, J. P. Costeira, F. S. Melo, and M. Veloso, “Personalized assistance for dressing users,” in International Conference on Social Robotics. Springer, 2015, pp. 359–369.
[10] A. Kapusta, Z. Erickson, H. M. Clever, W. Yu, C. K. Liu, G. Turk, and C. C. Kemp, “Personalized collaborative plans for robot-assisted dressing via optimization and simulation,” Autonomous Robots, vol. 43, no. 8, pp. 2183–2207, 2019.
[11] A. Clegg, W. Yu, J. Tan, C. K. Liu, and G. Turk, “Learning to dress: Synthesizing human dressing motion via deep reinforcement learning,” ACM Transactions on Graphics (TOG), vol. 37, no. 6, pp. 1–10, 2018.
[12] S. Li, N. Figueroa, A. Shah, and J. A. Shah, “Provably Safe and Efficient Motion Planning with Uncertain Human Dynamics,” in Robotics: Science and Systems (RSS), 2021.
[13] Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N. Ratliff, and D. Fox, “Closing the sim-to-real loop: Adapting simulation randomization with real world experience,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 8973–8979.
[14] F. Ramos, R. C. Possas, and D. Fox, “BayesSim: adaptive domain randomization via probabilistic inference for robotics simulators,” in Robotics: Science and Systems (RSS), 2019.
[15] L. Barcelos, R. Oliveira, R. Possas, L. Ott, and F. Ramos, “Disco: Double likelihood-free inference stochastic control,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 10 969–10 975.
[16] B. Mehta, M. Diaz, F. Golemo, C. J. Pal, and L. Paull, “Active domain randomization,” in Conference on Robot Learning. PMLR, 2020, pp. 1162–1176.
[17] F. Muratore, T. G. Gruner, F. Wiese, B. Belousov, M. Gienger, and J. Peters, “Neural posterior domain randomization,” in Conference on Robot Learning. PMLR, 2021.
[18] M. Hwasser, D. Kragic, and R. Antonova, “Variational auto-regularized alignment for sim-to-real control,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 2732–2738.
[19] R. Herguedas, G. López-Nicolás, R. Aragüés, and C. Sagüés, “Survey on multi-robot manipulation of deformable objects,” in IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), 2019.
[20] V. E. Arriola-Rios, P. Guler, F. Ficuciello, D. Kragic, B. Siciliano, and J. L. Wyatt, “Modeling of deformable objects for robotic manipulation: A tutorial and review,” Frontiers in Robotics and AI, vol. 7, p. 82, 2020.
[21] H. Yin, A. Varava, and D. Kragic, “Modeling, learning, perception, and control methods for deformable object manipulation,” Science Robotics, vol. 6, no. 54, 2021.
[22] D. Navarro-Alarcon, Y.-H. Liu, J. G. Romero, and P. Li, “Model-free visually servoed deformation control of elastic objects by robot manipulators,” IEEE Transactions on Robotics, vol. 29, no. 6, pp. 1457–1468, 2013.
[23] W. Sun, M. Çetin, R. Chan, and A. S. Willsky, “Learning the dynamics and time-recursive boundary detection of deformable objects,” IEEE Transactions on Image Processing, vol. 17, no. 11, pp. 2186–2200, 2008.
[24] T. Tang, C. Wang, and M. Tomizuka, “A framework for manipulating deformable linear objects by coherent point drift,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3426–3433, 2018.
[25] J. Willard, X. Jia, S. Xu, M. Steinbach, and V. Kumar, “Integrating physics-based modeling with machine learning: A survey,” arXiv preprint arXiv:2003.04919, vol. 1, no. 1, pp. 1–34, 2020.
[26] H. Wang, J. F. O’Brien, and R. Ramamoorthi, “Data-driven elastic models for cloth: modeling and measurement,” ACM transactions on graphics (TOG), vol. 30, no. 4, pp. 1–12, 2011.
[27] S. Yang, J. Liang, and M. C. Lin, “Learning-based cloth material recovery from video,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4383–4393.
[28] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.
[29] W. Chen, H. Ling, J. Gao, E. Smith, J. Lehtinen, A. Jacobson, and S. Fidler, “Learning to predict 3d objects with an interpolation-based differentiable renderer,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[30] S. Liu, T. Li, W. Chen, and H. Li, “Soft rasterizer: A differentiable renderer for image-based 3d reasoning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7708–7717.
[31] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik, “Multi-view supervision for single-view reconstruction via differentiable ray consistency,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2626–2634.
[32] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger, “Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3504–3515.
[33] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in European conference on computer vision. Springer, 2020, pp. 405–421.
[34] K. M. Jatavallabhula, E. Smith, J.-F. Lafleche, C. F. Tsang, A. Rozantsev, W. Chen, T. Xiang, R. Lebaredian, and S. Fidler, “Kaolin: A pytorch library for accelerating 3d deep learning research,” arXiv preprint arXiv:1911.05063, 2019.
[35] N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W.-Y. Lo, J. Johnson, and G. Gkioxari, “Accelerating 3d deep learning with pytorch3d,” arXiv preprint arXiv:2007.08501, 2020.
[36] Y.-L. Qiao, J. Liang, V. Koltun, and M. Lin, “Scalable differentiable physics for learning and control,” in International Conference on Machine Learning. PMLR, 2020, pp. 7847–7856.
[37] K. M. Jatavallabhula, M. Macklin, F. Golemo, V. Voleti, L. Petrini, M. Weiss, B. Considine, J. Parent-Levesque, K. Xie, K. Erleben et al., “gradsim: Differentiable simulation for system identification and visuomotor control,” 2021.
[38] Y. Hu, L. Anderson, T.-M. Li, Q. Sun, N. Carr, J. Ragan-Kelley, and F. Durand, “Difftaichi: Differentiable programming for physical simulation,” 2020.
[39] J. Liang, M. Lin, and V. Koltun, “Differentiable cloth simulation for inverse problems,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[40] Y. Hu, J. Liu, A. Spielberg, J. B. Tenenbaum, W. T. Freeman, J. Wu, D. Rus, and W. Matusik, “Chainqueen: A real-time differentiable physical simulator for soft robotics,” in 2019 International conference on robotics and automation (ICRA). IEEE, 2019, pp. 6265–6271.
[41] D. Hahn, P. Banzet, J. M. Bern, and S. Coros, “Real2sim: Visco-elastic parameter estimation from dynamic motion,” ACM Transactions on Graphics (TOG), vol. 38, no. 6, pp. 1–13, 2019.
[42] M. Geilinger, D. Hahn, J. Zehnder, M. Bächer, B. Thomaszewski, and S. Coros, “Add: Analytically differentiable dynamics for multi-body systems with frictional contact,” ACM Transactions on Graphics (TOG), vol. 39, no. 6, pp. 1–15, 2020.
[43] T. Du, K. Wu, P. Ma, S. Wah, A. Spielberg, D. Rus, and W. Matusik, “DiffPD: Differentiable projective dynamics with contact,” arXiv e-prints, pp. arXiv–2101, 2021.
[44] M. Lutter, J. Silberbauer, J. Watson, and J. Peters, “Differentiable physics models for real-world offline model-based reinforcement learning,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 4163–4170.
[45] M. A. Z. Mora, M. P. Peychev, S. Ha, M. Vechev, and S. Coros, “Pods: Policy optimization via differentiable simulation,” in International Conference on Machine Learning. PMLR, 2021, pp. 7805–7817.
[46] R. Narain, A. Samii, and J. F. O’brien, “Adaptive anisotropic remeshing for cloth simulation,” ACM transactions on graphics (TOG), vol. 31, no. 6, pp. 1–10, 2012. [Online]. Available: graphics.berkeley.edu/resources/ARCSim
[47] S. Weiss, R. Maier, D. Cremers, R. Westermann, and N. Thuerey, “Correspondence-free material reconstruction using sparse surface constraints,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4686–4695.

DiffCloud: Real-to-Sim from Point Clouds with Differentiable Simulation and Rendering of Deformable Objects