¹¹institutetext: S-Lab for Advanced Intelligence, Nanyang Technological University
¹¹email: yidi001@e.ntu.edu.sg, ccloy@ntu.edu.sg
²²institutetext: Shanghai AI Laboratory
²²email: daibo@pjlab.org.cn

Transformer with Implicit Edges for Particle-based Physics Simulation

Yidi Shao 11 0000-0001-8020-5150 Chen Change Loy 11 0000-0001-5345-1591 Bo Dai 22 0000-0003-0777-9232

Abstract

Particle-based systems provide a flexible and unified way to simulate physics systems with complex dynamics. Most existing data-driven simulators for particle-based systems adopt graph neural networks (GNNs) as their network backbones, as particles and their interactions can be naturally represented by graph nodes and graph edges. However, while particle-based systems usually contain hundreds even thousands of particles, the explicit modeling of particle interactions as graph edges inevitably leads to a significant computational overhead, due to the increased number of particle interactions. Consequently, in this paper we propose a novel Transformer-based method, dubbed as Transformer with Implicit Edges (TIE), to capture the rich semantics of particle interactions in an edge-free manner. The core idea of TIE is to decentralize the computation involving pair-wise particle interactions into per-particle updates. This is achieved by adjusting the self-attention module to resemble the update formula of graph edges in GNN. To improve the generalization ability of TIE, we further amend TIE with learnable material-specific abstract particles to disentangle global material-wise semantics from local particle-wise semantics. We evaluate our model on diverse domains of varying complexity and materials. Compared with existing GNN-based methods, without bells and whistles, TIE achieves superior performance and generalization across all these domains. Codes and models are available at https://github.com/ftbabi/TIE_ECCV2022.git^-1^-1-1Bo Dai completed this work when he was with S-Lab, NTU..

1 Introduction

Particle-based physics simulation not only facilitates the exploration of underlying principles in physics, chemistry and biology, it also plays an important role in computer graphics, e.g., enabling the creation of vivid visual effects such as explosion and fluid dynamic in films and games. By viewing a system as a composition of particles, particle-based physics simulation imitates system dynamics according to the states of particles as well as their mutual interactions. In this way, although different systems may contain different materials and follow different physical laws, they can be simulated in a unified manner with promising quality.

Recent approaches for particle-based physics simulation [2, 23, 17, 12, 22, 25] often adopt a graph neural network (GNN) [10] as the backbone network structure, where particles are treated as graph nodes, and interactions between neighboring particles are explicitly modeled as edges. By explicitly modeling particle interactions, existing methods effectively capture the semantics emerging from those interactions (e.g., the influences of action-reaction forces), which are crucial for accurate simulation of complex system. However, such an explicit formulation requires the computation of edge features for all valid interactions. Since a particle-based system usually contains hundreds even thousands of densely distributed particles, the explicit formulation inevitably leads to significant computational overhead, limiting the efficiency and scalability of these GNN-based approaches.

Refer to caption — Figure 1: (a). Samples from base domains. *FluidFall* contains two drops of water. *FluidShake* simulates a block of water in a moving box. *RiceGrip* has a deformable object squeezed by two grippers. *BoxBath* contains a rigid cubic washed by water. (b). Samples from *BunnyBath*, where we change the rigid cube into bunny for generalization test. We compare our TIE with DPI-Net, which achieves the best performances on *BunnyBath* among previous methods. While the bunny is flooded upside down in ground truth, TIE rollouts more faithful results especially in terms of the bunny’s posture and fluid dynamics. More comparisons can be found in Section 4.2.

In this paper, instead of relying on GNN, we propose to adopt Transformer as the backbone network structure for particle-based physics simulation. While particle interactions are represented as graph edges in GNN, in Transformer they are captured by a series of self-attention operations, in the form of dot-products between tokens of interacting particles. Consequently, in Transformer only particle tokens are required to simulate a system, leading to significantly reduced computational complexity when compared to GNN-based approaches.

The vanilla Transformer, however, is not directly applicable for effective simulation, since the rich semantics of particle interactions cannot be fully conveyed by the dot-products in self-attention operations. In this work, we address the problem via a novel modification to the self-attention operation that resembles the effect of edges in GNN but exempts from the need of explicitly modeling them. Specifically, each particle token is decomposed into three tokens, namely a state token, a receiver token, and a sender token. In particular, the state token keeps track of the particle state, the receiver token describes how the particle’s state would change, and the sender token indicates how the particle will affect its interacting neighbors. By taking receiver tokens and sender tokens as both keys and values, while state tokens are queries, the sturcture of edges in GNN can be equally represented by attention module in Transformer. To further trace the edge semantics in GNN, motivated by the process of normalizations for edges, both receiver and sender tokens are first decentralized in our attention module. Then we recover the standard deviations of edges from receiver and sender tokens, and apply the recovered scalar values as part of attention scores. Thus, the edge features can be effectively revived in Transformer. Moreover, to improve the generalization ability of the proposed method, we further propose to assign a learnable abstract particle for each type of material, and force particles of the same material to interact with their corresponding abstract particle, so that global semantics shared by all particles of the same material can be disentangled from local particle-level semantics.

Our method, dubbed as Transformer with Implicit Edges for Particle-based Physics Simulation (TIE), possesses several advantages over previous methods. First, thanks to the proposed edge-free design, TIE maintains the same level of computational complexity as in the vanilla Transformer, while combines the advantages of both Transformer and GNN. TIE not only inherits the self-attention operation from Transformer that can naturally attend to essential particles in the dynamically changing system, TIE is also capable of extracting rich semantics from particle interactions as GNN, without suffering from its significant computational overhead. Besides, the introduction of learnable abstract particles further boosts the performance of TIE in terms of generality and accuracy, by disentangling global semantics such as the intrinsic characteristics of different materials. For instance, after learning the dynamics of water, TIE can be directly applied to systems with varying numbers and configurations of particles, mimicking various effects including waterfall and flood.

To demonstrate the effectiveness of TIE, a comprehensive evaluation is conducted on four standard environments commonly used in the literature [12, 22], covering domains of different complexity and materials, where TIE achieves superior performance across all these environments compared to existing methods. Attractive properties of TIE, such as its strong generalization ability, are also studied, where we adjust the number and configuration of particles in each environments to create unseen systems for TIE to simulate without re-training. Compared to previous methods, TIE is able to obtain more realistic simulation results across most unseen systems. For example, after changing the shape from cube to bunny in BoxBath, the MSEs achieved by TIE is at least 30% lower than previous methods.

2 Related Work

Physics simulation by neural networks. There are many different kind of representations for physics simulations. Grid-based methods [11, 24, 28] adopt convolutional architectures for learning high-dimensional physical system, while mesh-based simulations [3, 14, 9, 18, 20, 30, 19] typically simulate objects with continuous surfaces, such as clothes, rigid objects, surfaces of water and so on.

Many studies [2, 23, 17, 12, 25, 22] simulate physics on particle-based systems, where all objects are represented by groups of particles. Specifically, Interaction Network (IN) [2] simulated interactions in object-level. Smooth Particle Networks (SPNets) [23] implemented fluid dynamics using position-based fluids [15]. Hierarchical Relation Network (HRN) [17] predicted physical dynamics based on hierarchical graph convolution. Dynamic Particle Interaction Networks (DPI-Net) [12] combined dynamic graphs, multi-step spatial propagation, and hierarchical structure to simulate particles. CConv [25] used spatial convolutions to simulate fluid particles. Graph Network-based Simulators (GNS) [22] computed dynamics via learned message-passing.

Previous work mostly adopted graph networks for simulations. They extracted potential semantics by explicitly modeling edges and storing their embeddings, and required each particle to interact with all its nearby particles without selective mechanism. In contrast, our TIE is able to capture semantics in edges in an edge-free manner, and selectively focus on necessary particle interactions through attention mechanism. Experiments show that TIE is more efficient, and surpasses existing GNN-based methods.

Transformer. Transformer [26] was designed for machine translation and achieved state-of-the-art performance in many natural langruage processing tasks [6, 21, 4]. Recently, Transformer starts to show great expandability and applicability in many other fields, such as computer vision [29, 5, 7, 27, 13], and graph representations [32, 31, 8]. To our knowledge, no attempt has been made to apply Transformer on physics simulation.

Our TIE inherits the multi-head attention mechanism, contributing to dynamically model the potential pattern in particle interactions. Though Graph Transformer [8], which we refer as GraphTrans for short, is also Transformer-based model on graphs, it still turns to explicitly modeling each valid edge to enhance the semantics of particle tokens, failing to make full use of attention mechanism to describe the relations among tokens in a more efficient manner. We adopt GraphTrans [8] in particle-based simulation and compare it with TIE in experiments. Quantitative and qualitative results show that TIE achieves more faithful rollouts in a more efficient way.

3 Methodology

3.1 Problem Formulation

For a particle-based system composed of $N$ particles, we use $\mathcal{X}^{t}=\{{\bm{x}}^{t}_{i}\}_{i=1}^{N}$ to denote the system state at time step $t$ , where ${\bm{x}}^{t}_{i}$ denotes the state of $i$ -th particle. Specifically, ${\bm{x}}^{t}_{i}=[{\bm{p}}^{t}_{i},{\bm{q}}^{t}_{i},{\bm{a}}_{i}]$ , where ${\bm{p}}^{t}_{i},{\bm{q}}^{t}_{i}\in\mathbb{R}^{3}$ refer to position and velocity, and ${\bm{a}}_{i}\in\mathbb{R}^{d_{a}}$ represents fixed particle attributes such as its material type. The goal of a simulator is to learn a model $\phi(\cdot)$ from previous rollouts of a system to causally predict a rollout trajectory in a specific time period conditioned on the initial system state $\mathcal{X}^{0}$ . The prediction runs in a recursive manner, where the simulator will predict the state $\hat{\mathcal{X}}^{t+1}=\phi(\mathcal{X}^{t})$ at time step $t+1$ based on the state $\mathcal{X}^{t}=\{x^{t}_{i}\}$ at time step $t$ . In practice, we will predict the velocities of particles $\hat{Q}^{t+1}=\{\hat{{\bm{q}}}^{t+1}_{i}\}$ , and obtain their positions via $\hat{{\bm{p}}}^{t+1}_{i}={\bm{p}}^{t}_{i}+\Delta t\cdot\hat{{\bm{q}}}^{t+1}_{i}$ , where $\Delta t$ is a domain-specific constant. In the following discussion, the time-step $t$ is omitted to avoid verbose notations.

3.2 GNN-based Approach

As particle-based physics systems can be naturally viewed as directed graphs, a straightforward solution for particle-based physics simulation is applying graph neural network (GNN) [2, 23, 17, 12, 22]. Specifically, we can regard particles in the system as graph nodes, and interactions between pairs of particles as directed edges. Given the states of particles $\mathcal{X}=\{{\bm{x}}_{i}\}_{i=1}^{N}$ at some time-step, to predict the velocities of particles in the next time-step, GNN will at first obtain the initial node features and edge features following:

	$\displaystyle{\bm{v}}_{i}^{(0)}$	$\displaystyle=f^{\mathrm{enc}}_{V}({\bm{x}}_{i}),$		(1)
	$\displaystyle{\bm{e}}_{ij}^{(0)}$	$\displaystyle=f^{\mathrm{enc}}_{E}({\bm{x}}_{i},{\bm{x}}_{j}),$		(2)

where ${\bm{v}}_{i},{\bm{e}}_{ij}\in\mathbb{R}^{d_{h}}$ are $d_{h}$ dimensional vectors, and $f^{\mathrm{enc}}_{V}(\cdot),f^{\mathrm{enc}}_{E}(\cdot)$ are respectively the node and edge encoders. Subsequently, GNN will conduct $L$ rounds of message-passing, and obtain the velocities of particles as:

$\displaystyle{\bm{e}}^{(l+1)}_{ij}$	$\displaystyle=f^{\mathrm{prop}}_{E}({\bm{v}}^{(l)}_{i},{\bm{v}}^{(l)}_{j},{\bm{e}}^{(l)}_{ij}),$	(3)
$\displaystyle{\bm{v}}^{(l+1)}_{i}$	$\displaystyle=f^{\mathrm{prop}}_{V}({\bm{v}}^{(l)}_{i},\sum_{j\in\mathcal{N}_{i}}{\bm{e}}^{(l+1)}_{ij}),$	(4)
$\displaystyle\hat{{\bm{q}}}_{i}$	$\displaystyle=f^{\mathrm{dec}}_{V}({\bm{v}}^{(L)}_{i}),$	(5)

where $\mathcal{N}_{i}$ indicates the set of neighbors of $i$ -th particle, and $f^{\mathrm{prop}}_{E}(\cdot),f^{\mathrm{prop}}_{V}(\cdot)$ and $f^{\mathrm{dec}}_{V}(\cdot)$ are respectively the node propagation module, the edge propagation module as well as the node decoder. In practice, $f^{\mathrm{enc}}_{V}(\cdot),f^{\mathrm{enc}}_{E}(\cdot),f^{\mathrm{prop}}_{E}(\cdot),f^{\mathrm{prop}}_{V}(\cdot)$ and $f^{\mathrm{dec}}_{V}(\cdot)$ are often implemented as multi-layer perceptrons (MLPs). Moreover, a window function $g$ is commonly used to filter out interactions between distant particles and reduce computational complexity:

\displaystyle g(i,j)

\displaystyle=\mathbf{1}\left(\|{\bm{p}}_{i}-{\bm{p}}_{j}\|_{2}<R\right),

(6)

where $\mathbf{1}(\cdot)$ is the indicator function and $R$ is a pre-defined threshould.

3.3 From GNN to Transformer

To accurately simulate the changes of a system over time, it is crucial to exploit the rich semantics conveyed by the interactions among particles, such as the energy transition of a system when constrained by material characteristics and physical laws. While GNN achieves this by explicitly modeling particle interactions as graph edges, such a treatment also leads to substantial computational overhead. Since a particle-based system contains hundreds even thousands of particles, and particles of a system are densely clustered together, this issue significantly limits the efficiency of GNN-based approaches.

Inspired by recent successes of Transformer [26] that applies computational efficient self-attention operations to model the communication among different tokens, in this paper we propose a Transformer-based method, which we refer to as Transformer with Implicit Edges, TIE, for particle-based physics simulation. We first describe how to apply a vanilla Transformer in this task. Specifically, we assign a token to each particle of the system, and therefore particle interactions are naturally achieved by $L$ blocks of multi-head self-attention modules. While the token features are initialized according to Equation 1, they will be updated in the $l$ -th block as:

	$\displaystyle\omega_{ij}$	$\displaystyle=$	$\displaystyle(W^{(l)}_{Q}{\bm{v}}^{(l)}_{i})^{\top}\cdot(W^{(l)}_{K}{\bm{v}}^{(l)}_{j}),$		(7)
	$\displaystyle{\bm{v}}^{(l+1)}_{i}$	$\displaystyle=$	$\displaystyle\sum_{j}\frac{\exp(\omega_{ij}{/\sqrt{d}})}{\sum_{k}\exp(\omega_{i{k}}{/\sqrt{d}})}\cdot(W^{(l)}_{V}{\bm{v}}^{(l)}_{j}),$		(8)

where $d$ is the dimension of features, and $W^{(l)}_{Q},W^{(l)}_{K},W^{(l)}_{V}$ are weight matrices for queries, keys, and values. Following the standard practice, a mask is generated according to Equation 6 to mask out distant particles when computing the attention. And finally, the prediction of velocities follows Equation 5.

Although the vanilla Transformer provides a flexible approach for particle-based simulation, directly applying it leads to inferior simulation results as shown in our experiments. In particular, the vanilla Transformer uses attention weights that are scalars obtained via dot-product, to represent particle interactions, which are insufficient to reflect the rich semantics of particle interactions. To combine the merits of GNN and Transformer, TIE modify the self-attention operation in the vanilla Transformer to implicitly include edge features as in GNN in an edge-free manner. In Figure 2 we include the comparison between our proposed implicit edges and the explicit edges in GNN. Specifically, since $f^{\mathrm{prop}}_{E}$ in Equation 3 and $f^{\mathrm{enc}}_{E}$ in Equation 2 are both implemented as an MLP in practice, by expanding Equation 3 recursively and grouping terms respectively for $i$ -th and $j$ -th particle we can obtain:

$\displaystyle{\bm{r}}^{(0)}_{i}$	$\displaystyle=W^{(0)}_{r}{\bm{x}}_{i},\qquad{\bm{s}}^{(0)}_{j}=W^{(0)}_{s}{\bm{x}}_{j},$	(9)
$\displaystyle{\bm{r}}^{(l)}_{i}$	$\displaystyle=W^{(l)}_{r}{\bm{v}}^{(l)}_{i}+W^{(l)}_{m}{\bm{r}}^{(l-1)}_{i},$	(10)
$\displaystyle{\bm{s}}^{(l)}_{j}$	$\displaystyle=W^{(l)}_{s}{\bm{v}}^{(l)}_{j}+W^{(l)}_{m}{\bm{s}}^{(l-1)}_{j},$	(11)
$\displaystyle{\bm{e}}^{(l+1)}_{ij}$	$\displaystyle={\bm{r}}^{(l)}_{i}+{\bm{s}}^{(l)}_{j},$	(12)

where the effect of an explicit edge can be effectively achieved by two additional tokens, which we refer to as the receiver token ${\bm{r}}_{i}$ and the sender token ${\bm{s}}_{j}$ . The detained expansion of Equation 3 can be found in the supplemental material. Following the above expansion, TIE thus assigns three tokens to each particle of the system, namely a receiver token ${\bm{r}}_{i}$ , a sender token ${\bm{s}}_{i}$ , and a state token ${\bm{v}}_{i}$ . The state token is similar to the particle token in the vanilla Transformer, and its update formula combines the node update formula in GNN and the self-attention formula in Transformer:

	$\displaystyle\omega^{\prime}_{ij}$	$\displaystyle=$	$\displaystyle(W^{(l)}_{Q}{\bm{v}}^{(l)}_{i})^{\top}{\bm{r}}^{(l)}_{i}+(W^{(l)}_{Q}{\bm{v}}^{(l)}_{i})^{\top}{\bm{s}}^{(l)}_{j},$		(13)
	$\displaystyle{\bm{v}}^{(l+1)}_{i}$	$\displaystyle=$	$\displaystyle{\bm{r}}^{(l)}_{i}+\sum_{j}\frac{\exp(\omega^{\prime}_{ij}{/\sqrt{d}})}{\sum_{k}\exp(\omega^{\prime}_{i{k}}{/\sqrt{d}})}\cdot{\bm{s}}^{(l)}_{j}.$		(14)

We refer to Equation 14 as an implicit way to incorporate the rich semantics of particle interactions, since TIE approximates graph edges in GNN with two additional tokens per particle, and more importantly these two tokens can be updated, along with the original token, separately for each particle, avoiding the significant computational overhead. To interpret the modified self-attention in TIE, from the perspective of graph edges, we decompose them into the receiver tokens and the sender tokens, maintaining two extra paths in the Transformer’s self-attention module. As for the perspective of self-attention, the receiver tokens and the sender tokens respectively replace the original keys and values.

In practice, since GNN-based methods usually incorporate LayerNorm [1] in their network architectures that computes the mean and std of edge features to improve their performance and training speed, we can further modify the self-attention in Equation 13 and Equation 14 to include the effect of normalization as well:

$\displaystyle{\left(\sigma^{(l)}_{ij}\right)^{2}}$	$\displaystyle=$	$\displaystyle\frac{1}{d}({\bm{r}}^{(l)}_{i})^{\top}{\bm{r}}^{(l)}_{i}+\frac{1}{d}({\bm{s}}^{(l)}_{j})^{\top}{\bm{s}}^{(l)}_{j}+\frac{2}{d}({\bm{r}}^{(l)}_{i})^{\top}{\bm{s}}^{(l)}_{j}-(\mu^{(l)}_{{r_{i}}}+\mu^{(l)}_{{s_{j}}})^{2},$	(15)
$\displaystyle\omega^{\prime\prime}_{ij}$	$\displaystyle=$	$\displaystyle\frac{(W^{(l)}_{Q}{\bm{v}}^{(l)}_{i})^{\top}({\bm{r}}^{(l)}_{i}-\mu^{(l)}_{{r_{i}}})+(W^{(l)}_{Q}{\bm{v}}^{(l)}_{i})^{\top}({\bm{s}}^{(l)}_{j}-\mu^{(l)}_{{s_{j}}})}{{\sigma^{(l)}_{ij}}},$	(16)
$\displaystyle{\bm{v}}^{(l+1)}_{i}$	$\displaystyle=$	$\displaystyle\sum_{j}\frac{\exp(\omega^{\prime\prime}_{ij}{/\sqrt{d}})}{\sum_{k}\exp(\omega^{\prime\prime}_{i{k}}{/\sqrt{d}})}\cdot{\frac{({\bm{r}}^{(l)}_{i}-\mu^{(l)}_{r_{i}})+({\bm{s}}^{(l)}_{j}-\mu^{(l)}_{s_{j}})}{\sigma^{(l)}_{ij}}},$	(17)

where $\mu^{(l)}_{{r_{i}}}$ and $\mu^{(l)}_{{s_{j}}}$ are respectively the mean of receiver tokens and sender tokens after $l$ -th block. Detailed deduction can be found in the supplemental material.

3.4 Abstract Particles

To further improve the generalization ability of TIE and disentangle global material-specific semantics from local particle-wise semantics, we further equip TIE with material-specific abstract particles.

For $N_{a}$ types of materials, TIE respectively adopts $N_{a}$ abstract particles $A=\{{\bm{a}}_{k}\}_{k=1}^{N_{a}}$ , each of which is a virtual particle with a learnable state token. Ideally, the abstract particle ${\bm{a}}_{k}$ should capture the material-specific semantics of $k$ -th material. They act as additional particles in the system, and their update formulas are the same as normal particles. Unlike normal particles that only interact with neighboring particles, each abstract particle is forced to interact with all particles belonging to its corresponding material. Therefore with $N_{a}$ abstract particles TIE will have $N+N_{a}$ particles in total: $\{{\bm{a}}_{1},\cdots,{\bm{a}}_{N_{a}},{\bm{x}}_{1},\cdots,{\bm{x}}_{N}\}$ . Once TIE is trained, abstract particles can be reused when generalizing TIE to unseen domains that have same materials but vary in particle amont and configuration.

3.5 Traning Objective and Evaluation Metric

To train TIE with existing rollouts of a domain, the standard mean square error (MSE) loss is applied to the output of TIE:

\displaystyle\text{MSE}(\hat{Q},Q)

\displaystyle=

\displaystyle\frac{1}{N}\sum_{i}\|\hat{{\bm{q}}}_{i}-{\bm{q}}_{i}\|_{2}^{2},

(18)

where $\hat{Q}=\{\hat{{\bm{q}}}_{i}\}_{i=1}^{N}$ and $Q=\{{\bm{q}}_{i}\}_{i=1}^{N}$ are respectively the estimation and the ground truth, and $\|\cdot\|_{2}$ is the L2 norm.

In terms of evaluation metric, since a system usually contains multiple types of materials with imbalanced numbers of particles, to better reflect the estimation accuracy, we apply the Mean of Material-wise MSE ( $\text{M}^{3}\text{SE}$ ) for evaluation:

\displaystyle\text{M}^{3}\text{SE}(\hat{Q},Q)

\displaystyle=

\displaystyle\frac{1}{K}\sum_{k}\frac{1}{N_{k}}\sum_{i}\|\hat{{\bm{q}}}_{i,k}-{\bm{q}}_{i,k}\|_{2}^{2},

(19)

where $K$ is the number of material types, $N_{k}$ is the number of particles belonging to the $k$ -th material. $\text{M}^{3}\text{SE}$ is equivalent to the standard MSE when $K=1$ .

Table 1: We report M³SEs (1e-2) results on four base domains, while keep the models’ number of parameters similar to each other. TIE achieves superior performance on all domains without suffering from its significant computational overhead. When adding trainable abstract particles, TIE, marked by +, further improves performance on RiceGrip and BoxBath, which involve complex deformations and multi-material interactions respectively.

Methods	FluidFall		FluidShake		RiceGrip		BoxBath
Methods	M³SE	#Para	M³SE	#Para	M³SE	#Para	M³SE	#Para
DPI-Net [12]	0.08 $\pm$ 0.05	0.61M	1.38 $\pm$ 0.45	0.62M	0.13 $\pm$ 0.09	1.98M	1.33 $\pm$ 0.29	1.98M
CConv [25]	0.08 $\pm$ 0.02	0.84M	1.41 $\pm$ 0.46	0.84M	N/A	N/A	N/A	N/A
GNS [22]	0.09 $\pm$ 0.02	0.70M	1.66 $\pm$ 0.37	0.70M	0.40 $\pm$ 0.16	0.71M	1.56 $\pm$ 0.23	0.70M
GraphTrans [8]	0.04 $\pm$ 0.01	0.77M	1.36 $\pm$ 0.37	0.77M	0.12 $\pm$ 0.11	0.78M	1.27 $\pm$ 0.25	0.77M
TIE (Ours)	0.04 $\pm$ 0.01	0.77M	1.22 $\pm$ 0.37	0.77M	0.13 $\pm$ 0.12	0.78M	1.35 $\pm$ 0.35	0.77M
TIE+ (Ours)	0.04 $\pm$ 0.00	0.77M	1.30 $\pm$ 0.41	0.77M	0.08 $\pm$ 0.08	0.78M	0.92 $\pm$ 0.16	0.77M

4 Experiments

We adopt four domains commonly used in the literature [12, 22, 25] for evaluation. FluidFall is a basic simulation for two droplets of water with 189 particle in total; FluidShake is more complex and simulate the water in a randomly moving box, containing 450 to 627 fluid particles; BoxBath simulates the water washing a rigid cube in fixed box with 960 fluid particles and 64 rigid particles; RiceGrip simulates the interactions between deformable rice and two rigid grippers, including 570 to 980 particles. Samples are displayed in Figure 1. To explore the effectiveness of our model, we compare TIE with four representative approaches: DPI-Net [12], CConv [25], GNS [22], and GraphTrans [8].

Implementation Details. TIE contains $L=4$ blocks. For multi-head self-attention version, The receiver tokens and sender tokens are regarded as the projected keys and values for each head. After projecting the concatenated state tokens from all heads, a two-layer MLPs is followed with dimensions 256 and 128. The concatenated receiver tokens and sender tokens are directly projected by one layer MLP with dimensions 128. The rest hidden dimensions are 128 for default. We train four models independently on four domains, with 5 epochs on FluidShake and BoxBath, 13 epochs on FluidFall, and 20 epochs on RiceGrip. On BoxBath, all models adopt the same strategy to keep the shape of the rigid object following [12]. We adopt MSE on velocities as training loss for all models. The neighborhood radius $R$ in Equation 6 is set to 0.08. We adopt Adam optimizer with an initial learning rate of 0.0008, which has a decreasing factor of 0.8 when the validation loss stops to decrease after 3 epochs. The batch size is set to 16 on all domains. All models are trained and tested on V100 for all experiments, with no augmentation involved.

4.1 Basic Domains

Quantitative results are provided in Table 1, while qualitative results are shown in Figure 4. TIE achieves superior performances on all domains. The effectiveness of abstract particles are more obvious for RiceGrip and BoxBath, which involve complex materials or multi-material interactions.

Performance comparison. We compare TIE with four representative approaches: DPI-Net [12], CConv [25], GNS [22], and GraphTrans [8]. Since DPI-Net and GNS adopt message-passing graph networks for particle-based simulations, we set the number of propagation steps as four for both models, except that DPI-Net adopts a total number of six propagation steps on BoxBath and RiceGrip, where hierarchical structures are adopted. For CConv, which designs convolutional layers carefully tailored to modeling fluid dynamics, such as an SPH-like local kernel [16], we only report the results on fluid-based domains. As shown in Table 1, TIE achieves superior performances on most domains, while TIE+, which has abstract particles, further improves the performances especially on RiceGrip and BoxBath, suggesting the effectiveness of abstract particles in modeling complex deformations and multi-materials interactions. For qualitative results in Figure 4, our model can predict more faithful rollouts on all domains.

Efficiency comparison. The training time for models with varying batch size is shown in Figure 3. For simplicity, the number of particles is fixed and only number of interactions varies in Figure 3. Since TIE uses implicit edges to model particle interactions and significantly reduces computational overhead, TIE has the fastest training speed against general GNN-based simulators (CConv [25] is a specialized simulator for systems containing only fluid) as shown in Figure 3 (a) and (b). When it comes to batch size larger than 1, GNN-based methods need pad edges for each batch, leading to extra computational cost. On the other hand, TIE only need the corresponding attention masks to denote the connectivities without further paddings, which is faster to train on large batch size. We does not report the speed of GraphTrans with more than $1.4\times 10^{4}$ interactions due to the limit of memory. In terms of testing speed, it is hard to compare different methods directly since different simulation results will lead to different amount of valid particle interactions.

4.2 Generalizations

Table 2: M³SEs on generalizations. The lists of numbers in FluidShake and RiceGrip are the range of particles, while the tuples in BoxBath denotes number of fluid particles, number of rigid particles, and shape of rigid objects respectively. Training settings are marked by *. TIE + achieves the best results on most cases.

Methods	FluidShake [450,627]*		RiceGrip [570,980]*
Methods	[720,1013]	[1025,1368]	[1062,1347]	[1349,1642]
DPI-Net [12]	2.13 $\pm$ 0.55	2.78 $\pm$ 0.84	0.23 $\pm$ 0.13	0.38 $\pm$ 0.67
CConv [25]	2.01 $\pm$ 0.55	2.43 $\pm$ 0.81	N/A	N/A
GNS [22]	2.61 $\pm$ 0.44	3.41 $\pm$ 0.59	0.47 $\pm$ 0.20	0.51 $\pm$ 0.28
GraphTrans [8]	2.68 $\pm$ 0.52	3.97 $\pm$ 0.70	0.20 $\pm$ 0.13	0.22 $\pm$ 0.18
TIE+ (Ours)	1.92 $\pm$ 0.47	2.46 $\pm$ 0.65	0.17 $\pm$ 0.11	0.19 $\pm$ 0.15
Methods	BoxBath (960,64,cube)*
Methods	(1280,64,cube)	(960,125,cube)	(960,136,ball)	(960,41,bunny)
DPI-Net [12]	1.70 $\pm$ 0.22	3.22 $\pm$ 0.88	2.86 $\pm$ 0.99	2.04 $\pm$ 0.79
GNS [22]	2.97 $\pm$ 0.48	2.97 $\pm$ 0.71	3.50 $\pm$ 0.67	2.17 $\pm$ 0.37
GraphTrans [8]	1.88 $\pm$ 0.25	1.50 $\pm$ 0.30	1.71 $\pm$ 0.34	2.22 $\pm$ 0.61
TIE+ (Ours)	1.57 $\pm$ 0.18	1.49 $\pm$ 0.19	1.45 $\pm$ 0.27	1.39 $\pm$ 0.48

As shown in Table 2, we generate more complex domains to challenge the robustness of our full model TIE+. Specifically, we add more particles for FluidShake and RiceGrip, which we refer to as L-FluidShake and L-RiceGrip respectively. The L-FluidShake includes 720 to 1368 particles, while L-RiceGrip contains 1062 to 1642 particles. On BoxBath, we change the size and shape of rigid object. Specifically, we add more fluid particles in Lfluid-BoxBath to 1280 fluid particles, while we enlarge the rigid cube in L-BoxBath to 125 particles. We also change the shape of the rigid object into ball and bunny, which we refer to BallBox and BunnyBath respectively. Details of generalized environments settings and results can be found in supplementary materials.

Quantitative results are summarized in Table 2, while qualitative results are depicted in Figure 5. As shown in Table 2, TIE+ achieves lower M³SEs on most domains, while having more faithful rollouts in Figure 5. On L-FluidShake, TIE+ maintains the block of fluid in the air and predicts faithful wave on the surface. On L-RiceGrip, while DPI-Net and GNS have difficulties in maintaining the shape, the rice predicted by GraphTrans is compressed more densely only in the center areas where the grips have reached, the left side and right side of the rice does not deform properly compared with the ground truth. In contrast, TIE+ is able to maintains the shape of the large rice and faithfully deform the whole rice after compressed. On generalized BoxBath, TIE+ is able to predict faithful rollout when the fluid particles flood the rigid objects into the air or when the wave of the fluid particles starts to push the rigid object after the collision. Even when the rigid object changes to bunny with more complex surfaces, TIE+ generates more accurate predictions for both fluid particles and rigid particles.

4.3 Ablation Studies

We comprehensively analyze our TIE and explore the effectiveness of our model in the following aspects: (a) with and without implicit edges; (b) with and without normalization effects in attention; (c) with and without abstract particles; and (d) the sensitiveness to $R$ . The experiments for (a), (b), and (d) are conducted on FluidShake and L-FluidShake, while experiment (c) is conducted on BoxBath. The quantitative results are in Table 3 and Table 4.

Table 3: Ablation studies. We comprehensively explore the effectiveness of TIE, including the effectiveness of implicitly modeling of edges, normalization effects in attention, and abstract particles. We report M³SEs(1e-2) on FluidShake and L-FluidShake, which are complex domains involving outer forces.

Configurations	A(Transformer)	B	C(TIE)	D(TIE+)
Implicit Edges		✓	✓	✓
Normalization			✓	✓
Abstract Particles				✓
FluidShake	2.75 $\pm$ 0.86	1.52 $\pm$ 0.39	1.22 $\pm$ 0.37	1.30 $\pm$ 0.41
L-FluidShake	8.18 $\pm$ 3.15	3.17 $\pm$ 0.94	2.40 $\pm$ 0.74	2.16 $\pm$ 0.62

Effectiveness of implicit edges. We apply vanilla Transformer encoders by configuration A, while TIE in configuration B does not adopt the interaction attention, making sure the only difference is the edge-free structure. The hidden dimension and number of blocks are the same, while TIE is a little larger because of the extra projections for receiver and sender tokens. As shown in Table 3, the original Transformer achieves worse performances, suggesting the scalar attention scores alone are insufficient to capture rich semantics of interactions among particles. In contrast, implicit way of modeling edges enables TIE to take advantages of GNN methods and recover more semantics of particle interactions.

Effectiveness of normalization effects in attention. We follow configuration C to build TIE, which includes Equation 17. Comparing configuration B and C in Table 3, we find that the normalization effects brings benefits to TIE on both base and generalized domains. Such structure further enables TIE to trace the rice semantics from edges, leading to more stable and robust performances.

Effectiveness of abstract particles. As shown in Table 4, we replace the abstract particles with dummy particles, which are zero initialized vectors with fixed values but have the same connectivities as abstract particles. Thus, the dummy particles could not capture the semantics of materials during training. TIE with dummy particles slightly improve the performances on base domains, suggesting that the extra connectivities introduced by abstract particles benefit little on TIE. TIE+ achieves more stable and robust performances, suggesting that the abstract particles are able to effectively disentangle the domain-specific semantics, i.e., the outer forces introduced by walls, and materials-specific semantics, i.e., the pattern of fluid particle dynamics.

Table 4: Ablation studies on abstract particles and sensitiveness to radius

R

. To explore the material-aware semantics extracted by abstract particles, we conduct experiments on BoxBath and the generalized domains BunnyBath, where the rigid cube is replaced by bunny. We replace abstract particles with dummy particles, which are zero constant vectors and have same connectivities as abstract particles. TIE marked by ”dummy” adopts dummy particles. The sensitiveness is on the right part. We report M³SEs(1e-2) on FluidShake. Our default setting on all domains is marked by

*

Methods	BoxBath		Methods	FluidShake
Methods	(960,64,cube)*	(960,41,bunny)	Methods	$R=0.07$	$R^{*}=0.08$	$R=0.09$
TIE	1.35 $\pm$ 0.35	1.50 $\pm$ 0.45	DPI-Net	2.60 $\pm$ 0.56	1.38 $\pm$ 0.45	1.66 $\pm$ 0.48
TIE dummy	1.21 $\pm$ 0.28	1.96 $\pm$ 0.71	GraphTrans	1.97 $\pm$ 0.48	1.36 $\pm$ 0.37	1.36 $\pm$ 0.38
TIE+	0.92 $\pm$ 0.16	1.39 $\pm$ 0.48	TIE	1.60 $\pm$ 0.37	1.22 $\pm$ 0.37	1.31 $\pm$ 0.40

Sensitiveness to $R$ . Quantitative results are reported on FluidShake. As shown in Table 4, when $R$ is smaller, models tend to have a drop in accuracies due to the insufficient particle interactions. When $R$ is greater, the drop of accuracies for DPI-Net is caused by redundant interactions due to the high flexibility of fluid moving patterns. In all cases, TIE achieves superior performances more efficiently, suggesting the effectiveness and robustness of our model.

5 Conclusion

In this paper, we propose Transformer with Implicit Edges (TIE), which aims to trace edge semantics in an edge-free manner and introduces abstract particles to simulate domains of different complexity and materials, Our experimental results show the effectiveness and efficiency of our edge-free structure. The abstract particles enable TIE to capture material-specific semantics, achieving robust performances on complex generalization domains. Finally, TIE makes a successful attempt to hybrid GNN and Transformer into physics simulation and achieve superior performances over existing methods, showing the potential abilities of implicitly modeling edges in physics simulations.

Acknowledgements. This study is supported under the RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). It is also supported by Singapore MOE AcRF Tier 2 (MOE-T2EP20221-0011) and Shanghai AI Laboratory.

References

[1] Ba, L.J., Kiros, J.R., Hinton, G.E.: Layer normalization. CoRR (2016)
[2] Battaglia, P.W., Pascanu, R., Lai, M., Rezende, D.J., Kavukcuoglu, K.: Interaction networks for learning about objects, relations and physics. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain (2016)
[3] Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P.: Geometric deep learning: Going beyond euclidean data. IEEE Signal Process. Mag. (2017)
[4] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (2020)
[5] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I (2020)
[6] Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers) (2019)
[7] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 (2021)
[8] Dwivedi, V.P., Bresson, X.: A generalization of transformer networks to graphs. CoRR (2020)
[9] Hanocka, R., Hertz, A., Fish, N., Giryes, R., Fleishman, S., Cohen-Or, D.: MeshCNN: a network with an edge. ACM Trans. Graph. (2019)
[10] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. CoRR (2016)
[11] Lee, S., You, D.: Data-driven prediction of unsteady flow over a circular cylinder using deep learning. Journal of Fluid Mechanics (2019)
[12] Li, Y., Wu, J., Tedrake, R., Tenenbaum, J.B., Torralba, A.: Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 (2019)
[13] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin Transformer: Hierarchical vision transformer using shifted windows. CoRR (2021)
[14] Luo, R., Shao, T., Wang, H., Xu, W., Chen, X., Zhou, K., Yang, Y.: NNWarp: Neural network-based nonlinear deformation. IEEE Trans. Vis. Comput. Graph. (2020)
[15] Macklin, M., Müller, M.: Position based fluids. ACM Trans. Graph. (2013)
[16] Monaghan, J.J.: Smoothed particle hydrodynamics. Annual review of astronomy and astrophysics (1992)
[17] Mrowca, D., Zhuang, C., Wang, E., Haber, N., Fei-Fei, L., Tenenbaum, J., Yamins, D.L.: Flexible neural representation for physics prediction. In: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada (2018)
[18] Nash, C., Ganin, Y., Eslami, S.M.A., Battaglia, P.W.: PolyGen: An autoregressive generative model of 3d meshes. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (2020)
[19] Pfaff, T., Fortunato, M., Sanchez-Gonzalez, A., Battaglia, P.W.: Learning mesh-based simulation with graph networks. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 (2021)
[20] Qiao, Y., Liang, J., Koltun, V., Lin, M.C.: Scalable differentiable physics for learning and control. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (2020)
[21] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog (2019)
[22] Sanchez-Gonzalez, A., Godwin, J., Pfaff, T., Ying, R., Leskovec, J., Battaglia, P.W.: Learning to simulate complex physics with graph networks. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (2020)
[23] Schenck, C., Fox, D.: SPNets: Differentiable fluid dynamics for deep neural networks. In: 2nd Annual Conference on Robot Learning, CoRL 2018, Zürich, Switzerland, 29-31 October 2018, Proceedings (2018)
[24] Thuerey, N., Weißenow, K., Prantl, L., Hu, X.: Deep learning methods for reynolds-averaged navier–stokes simulations of airfoil flows. AIAA Journal (2020)
[25] Ummenhofer, B., Prantl, L., Thuerey, N., Koltun, V.: Lagrangian fluid simulation with continuous convolutions. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 (2020)
[26] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA (2017)
[27] Wang, H., Zhu, Y., Adam, H., Yuille, A.L., Chen, L.: MaX-DeepLab: End-to-end panoptic segmentation with mask transformers. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 (2021)
[28] Wang, R., Kashinath, K., Mustafa, M., Albert, A., Yu, R.: Towards physics-informed deep learning for turbulent flow prediction. In: KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020 (2020)
[29] Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 (2018)
[30] Weng, Z., Paus, F., Varava, A., Yin, H., Asfour, T., Kragic, D.: Graph-based task-specific prediction models for interactions between deformable and rigid objects. CoRR (2021)
[31] Zhang, J., Zhang, H., Xia, C., Sun, L.: Graph-Bert: Only attention is needed for learning graph representations. CoRR (2020)
[32] Zhou, D., Zheng, L., Han, J., He, J.: A data-driven graph generative model for temporal interaction networks. In: KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020 (2020)

Appendix 0.A Model Details

0.A.1 Decomposing GNN

In the following, we show the detailed deduction of implicitly modeling edges in TIE. When omitting the normalization, bias, and activation, we can update the edge propagation function implemented by MLPs in GNN by

\displaystyle{\bm{e}}^{(l+1)}_{ij}

\displaystyle=

\displaystyle W^{(l)}\left[{\bm{v}}^{(l)}_{i};{\bm{v}}^{(l)}_{j};{\bm{e}}^{(l)}_{ij}\right],

(20)

where $W^{(l)}\in\mathbb{R}^{d\times 3d}$ is the parameter for MLPs, and $[\cdot;\cdot]$ denotes the concatenation. By splitting $W^{(l)}$ into 3 different square blocks $W^{(l)}=[W^{(l)}_{r},W^{(l)}_{s},W^{(l)}_{m}]$ and expanding the edge embeddings, we have

$\displaystyle{\bm{e}}^{(l+1)}_{ij}$	$\displaystyle=$	$\displaystyle W^{(l)}_{r}{\bm{v}}^{(l)}_{i}+W^{(l)}_{s}{\bm{v}}^{(l)}_{j}+W^{(l)}_{m}{\bm{e}}^{(l)}_{ij}$	(21)
	$\displaystyle=$	$\displaystyle\left(W^{(l)}_{r}{\bm{v}}^{(l)}_{i}+W^{(l)}_{m}W^{(l-1)}_{r}{\bm{v}}^{(l-1)}_{i}\right)$	(24)
		$\displaystyle+\left(W^{(l)}_{s}{\bm{v}}^{(l)}_{j}+W^{(l)}_{m}W^{(l-1)}_{s}{\bm{v}}^{(l-1)}_{j}\right)$
		$\displaystyle+W^{(l)}_{m}W^{(l-1)}_{m}{\bm{e}}^{(l-1)}_{ij}$
	$\displaystyle=$	$\displaystyle\left(W^{(l)}_{r}{\bm{v}}^{(l)}_{i}+\sum_{u=1}^{l}\left(\prod_{k=0}^{u-1}W_{m}^{(l-k)}\right)W_{r}^{(l-u)}{\bm{v}}_{i}^{(l-u)}\right)$	(26)
		$\displaystyle+\left(W^{(l)}_{s}{\bm{v}}^{(l)}_{j}+\sum_{u=1}^{l}\left(\prod_{k=0}^{u-1}W_{m}^{(l-k)}\right)W_{s}^{(l-u)}{\bm{v}}_{j}^{(l-u)}\right),$	(26)

where we assume ${\bm{v}}_{i}^{(0)}={\bm{x}}_{i}$ , and $[W^{(0)}_{r},W^{(0)}_{s}]=W^{(0)}$ are the parameters for the edge initialization function $f^{\mathrm{enc}}_{E}(\cdot)$ . Assuming ${\bm{r}}^{(l)}_{i}=W^{(l)}_{r}{\bm{v}}^{(l)}_{i}+W^{(l)}_{m}{\bm{r}}^{(l-1)}_{i}$ and ${\bm{s}}^{(l)}_{j}=W^{(l)}_{s}{\bm{v}}^{(l)}_{j}+W^{(l)}_{m}{\bm{s}}^{(l-1)}_{j}$ , we can further simplify Eq.26 by

	$\displaystyle{\bm{e}}^{(l+1)}_{ij}$	$\displaystyle=$	$\displaystyle\left(W^{(l)}_{r}{\bm{v}}^{(l)}_{i}+W^{(l)}_{m}{\bm{r}}^{(l-1)}_{i}\right)+\left(W^{(l)}_{s}{\bm{v}}^{(l)}_{j}+W^{(l)}_{m}{\bm{s}}^{(l-1)}_{j}\right)$		(27)
		$\displaystyle=$	$\displaystyle{\bm{r}}^{(l)}_{i}+{\bm{s}}^{(l)}_{j},$		(28)

where we assume ${\bm{r}}^{(0)}_{i}=W^{(0)}_{r}{\bm{v}}^{(0)}_{i}$ and ${\bm{s}}^{(0)}_{j}=W^{(0)}_{s}{\bm{v}}^{(0)}_{j}$ , $l\in\{1,2,\cdots,L\}$ is the index of the block.

By combining Equation 26 into the self-attention formula in Transformer [26], we have:

$\displaystyle\omega^{\prime}_{ij}$	$\displaystyle=$	$\displaystyle(W^{(l)}_{Q}{\bm{v}}^{(l)}_{i})^{\top}({\bm{r}}^{(l)}_{i}+{\bm{s}}^{(l)}_{j})$	(29)
	$\displaystyle=$	$\displaystyle(W^{(l)}_{Q}{\bm{v}}^{(l)}_{i})^{\top}{\bm{r}}^{(l)}_{i}+(W^{(l)}_{Q}{\bm{v}}^{(l)}_{i})^{\top}{\bm{s}}^{(l)}_{j},$	(30)
$\displaystyle\hat{\omega}^{\prime}_{ij}$	$\displaystyle=$	$\displaystyle\mathrm{softmax}\left(\frac{\omega^{\prime}_{ij}}{\sqrt{d}}\right),$	(31)
$\displaystyle{\bm{v}}^{(l+1)}_{i}$	$\displaystyle=$	$\displaystyle\sum_{j}\hat{\omega}^{\prime}_{ij}\cdot({\bm{r}}^{(l)}_{i}+{\bm{s}}^{(l)}_{j})$	(32)
	$\displaystyle=$	$\displaystyle\sum_{j}\hat{\omega}^{\prime}_{ij}\cdot{\bm{r}}^{(l)}_{i}+\sum_{j}\hat{\omega}^{\prime}_{ij}\cdot{\bm{s}}^{(l)}_{j}$	(33)
	$\displaystyle=$	$\displaystyle{\bm{r}}^{(l)}_{i}+\sum_{j}\hat{\omega}^{\prime}_{ij}\cdot{\bm{s}}^{(l)}_{j}.$	(34)

0.A.2 Normalization Effects

Given Equation 26, we further modify the self-attention in Equation 30 and Equation 34 to include the effects of normalization in GNN for edges. Since GNN-based methods usually incorporate LayerNorm [1] in their network architectures that computes the mean and std of edge features to improve their performance and training speed, we propose to apply the effects of normalization for each edge in GNN to our model. For the mean $\mu^{(l)}_{ij}$ and std $\sigma^{(l)}_{ij}$ of each interaction between particle $i$ and $j$ , we can compute them from receiver token ${\bm{r}}^{(l)}_{i}$ and sender token ${\bm{s}}^{(l)}_{j}$ by

$\displaystyle\mu^{(l)}_{ij}$	$\displaystyle=$	$\displaystyle\frac{1}{d}\sum_{k}\left(r^{(l)}_{ik}+s^{(l)}_{jk}\right)$	(35)
	$\displaystyle=$	$\displaystyle\frac{1}{d}\left(\sum_{k}r^{(l)}_{ik}+\sum_{k}s^{(l)}_{jk}\right)$	(36)
	$\displaystyle=$	$\displaystyle\mu^{(l)}_{r_{i}}+\mu^{(l)}_{s_{j}},$	(37)

$\displaystyle\left(\sigma^{(l)}_{ij}\right)^{2}$	$\displaystyle=$	$\displaystyle\frac{1}{d}\sum_{k}\left(r^{(l)}_{ik}+s^{(l)}_{jk}-\mu^{(l)}_{ij}\right)^{2}$	(38)
	$\displaystyle=$	$\displaystyle\frac{1}{d}\sum_{k}\left((r^{(l)}_{ik})^{2}+(s^{(l)}_{jk})^{2}+(\mu_{ij}^{(l)})^{2}+2r^{(l)}_{ik}s^{(l)}_{jk}-2\mu_{ij}^{(l)}(r^{(l)}_{ik}+s^{(l)}_{jk})\right)$	(39)
	$\displaystyle=$	$\displaystyle\frac{1}{d}\left(({\bm{r}}^{(l)}_{i})^{\top}{\bm{r}}^{(l)}_{i}+({\bm{s}}^{(l)}_{j})^{\top}{\bm{s}}^{(l)}_{j}\right)+(\mu^{(l)}_{ij})^{2}+\frac{2}{d}({\bm{r}}^{(l)}_{i})^{\top}{\bm{s}}^{(l)}_{j}-2(\mu^{(l)}_{ij})^{2}$	(40)
	$\displaystyle=$	$\displaystyle\frac{1}{d}({\bm{r}}^{(l)}_{i})^{\top}{\bm{r}}^{(l)}_{i}+\frac{1}{d}({\bm{s}}^{(l)}_{j})^{\top}{\bm{s}}^{(l)}_{j}+\frac{2}{d}({\bm{r}}^{(l)}_{i})^{\top}{\bm{s}}^{(l)}_{j}-(\mu^{(l)}_{ij})^{2}$	(41)
	$\displaystyle=$	$\displaystyle\frac{1}{d}({\bm{r}}^{(l)}_{i})^{\top}{\bm{r}}^{(l)}_{i}+\frac{1}{d}({\bm{s}}^{(l)}_{j})^{\top}{\bm{s}}^{(l)}_{j}+\frac{2}{d}({\bm{r}}^{(l)}_{i})^{\top}{\bm{s}}^{(l)}_{j}-(\mu^{(l)}_{r_{i}}+\mu^{(l)}_{s_{j}})^{2}$	(42)

where $r^{(l)}_{ik}$ and $s^{(l)}_{jk}$ are respectively the $k$ -th element in ${\bm{r}}^{(l)}_{i}$ and ${\bm{s}}^{(l)}_{j}$ , $\mu^{(l)}_{r_{i}}$ and $\mu^{(l)}_{s_{j}}$ are respectively the mean of receiver token ${\bm{r}}^{(l)}_{i}$ and sender token ${\bm{s}}^{(l)}_{j}$ after $l$ -th block. Hence, by replacing ${\bm{r}}^{(l)}_{i}+{\bm{s}}^{(l)}_{j}$ in Equation 29 and Equation 32 by $\frac{{\bm{r}}^{(l)}_{i}+{\bm{s}}^{(l)}_{j}-\mu^{(l)}_{ij}}{\sigma^{(l)}_{ij}}$ , we have

$\displaystyle\omega^{\prime\prime}_{ij}$	$\displaystyle=$	$\displaystyle\frac{(W^{(l)}_{Q}{\bm{v}}^{(l)}_{i})^{\top}({\bm{r}}^{(l)}_{i}-\mu^{(l)}_{r_{i}})+(W^{(l)}_{Q}{\bm{v}}^{(l)}_{i})^{\top}({\bm{s}}^{(l)}_{j}-\mu^{(l)}_{s_{j}})}{\sigma_{ij}^{(l)}},$	(43)
$\displaystyle\hat{\omega}^{\prime\prime}_{ij}$	$\displaystyle=$	$\displaystyle\mathrm{softmax}\left(\frac{\omega^{\prime\prime}_{ij}}{\sqrt{d}}\right),$	(44)
$\displaystyle{\bm{v}}^{(l+1)}_{i}$	$\displaystyle=$	$\displaystyle\sum_{j}\hat{\omega}^{\prime\prime}_{ij}\cdot\frac{{\bm{r}}^{(l)}_{i}-\mu^{(l)}_{r_{i}}}{\sigma_{ij}^{(l)}}+\sum_{j}\hat{\omega}^{\prime\prime}_{ij}\cdot\frac{{\bm{s}}^{(l)}_{j}-\mu^{(l)}_{s_{j}}}{\sigma_{ij}^{(l)}},$	(45)

When it comes to the scaling and shifting parameters in LayerNorm [1], we add them into Equation 45 to better resemble the normalization effects.

Appendix 0.B Experiment Details

0.B.1 Implementation Details

Inputs and outputs details. For FluidFall, FluidShake, and BoxBath, we only use particles’ states at time $t$ as inputs and output the velocities at time $t+1$ . For RiceGrip, we concatenate particles states from $t-2$ to $t$ as inputs and output 6-dim vector for the velocity of the current observed position and the resting position. For BoxBath, we output 7-dim vectors, where 3 dimensions for the predicted velocities, and 4 dimensions for rotation constrains. The rotation constraints, which predict the rotation velocities, are applied only on rigid particles, which is the same as mentioned in DPI-Net [12]. All states of particles, such as the positions and velocities, are first normalized by mean and standard deviations calculated on corresponding training set before they are fed into the models.

Training. We train four models independently on four domains, with 5 epochs on FluidShake and BoxBath, 13 epochs on FluidFall, and 20 epochs on RiceGrip. For common settings, we adopt Adam optimizer with an initial learning rate of 0.0008, which has a decreasing factor of 0.8 when the validation loss stops to decrease after 3 epochs. The batch size is set to 16 on all domains. All models are trained and tested on V100 for all experiments, with no augmentation involved.

Baseline details. For fair comparison, the following settings are the same with TIE: inputs for models, number of training epochs on different domains, learning rate schedules, and training loss on velocities. Hyper-parameters for baselines are first chosen the same as their original papers, and then fine-tuned within a small range of changes. For example, in terms of the batch size, 16 works better for DPI-Net than the original settings.

0.B.2 Data Generation

Table 5: Details of generalization settings. We list the number of particles in both training domains and generalization domains. The lists of numbers in L-FluidShake and L-RiceGrip are the range of particles, while the number of rigid and fluid particles in generalized BoxBath are listed separately.

Domains	Training Settings	Generalization Settings
L-FluidShake	[450, 627]	[720, 1368]
L-RiceGrip	[570, 980]	[1062, 1642]
Lfluid-BoxBath	Fluid: 960. Rigid: 64	Fluid: 1280. Rigid 64
L-BoxBath	Fluid: 960. Rigid: 64	Fluid: 960. Rigid: 125
BunnyBath	Fluid: 960. Rigid: 64	Fluid: 960. Rigid: 41
BallBath	Fluid: 960. Rigid: 64	Fluid: 960. Rigid: 136

Basic Domains. We use the same setting for our datasets as mentioned in previous work [12]. FluidFall contains two fluid droplets with different sizes. The sizes for droplets are randomly generated with one droplet larger than the other. Positions and viscosity for droplets are randomly initialized. This domain contains 189 particles with 121 frames for each rollout. There are 2700 rollouts in training set and 300 rollouts in validation set. FluidShake simulates the water in a moving box. The speed of the box is randomly generated at each timestamp. In addition, the size of the box and the number of particles are various for different rollouts. In basic training and validation sets, the number of particles varies from 450 to 627. This domain has 301 frames for each rollout. There are 1800 rollouts in training set and 200 rollouts in validation set. RiceGrip contains two grippers and a sticky rice. The grippers’ positions and orientations are randomly initialized. The number of particles for rice varies from 570 to 980 with 41 frames for each rollout in training and validation sets. There are 4500 rollouts in training set and 500 rollouts in validation set. BoxBath simulates a rigid cube washed by water in a fixed container. The initial positions of fluid block and rigid cube are randomly initialized. This domain contains 960 fluid particles and 64 rigid particles with 151 frames for each rollout. There are 2700 rollouts in training set and 300 rollouts in validation set.

Generalization Domains. We release the details of generalization settings in Table 5. We add more particles for FluidShake and RiceGrip, which we refer to as L-FluidShake and L-RiceGrip respectively. The L-FluidShake includes 720 to 1368 particles, while L-RiceGrip contains 1062 to 1642 particles. On BoxBath, we enlarge the fluid block and change the size and shape of rigid object. Specifically, we add more fluid particles in Lfluid-BoxBath to 1280 fluid particles, while we enlarge the rigid cube in L-BoxBath to 125 particles. We also change the shape of the rigid object into ball and bunny, which we refer to BallBox and BunnyBath respectively. The number of test rollouts and the number of frames for each rollout are the same as the corresponding basic domains.

0.B.3 Rendered Rollouts

We visualize some rollouts on BoxBath and its generalized domains, which are complex domains with multi-materials interactions. We simplify BoxBath into 5 key steps: 1. the rigid object flooded by fluid hits the wall; 2. the rigid object is thrown into the air; 3. the rigid object falls into the fluid; 4. the fluid, after hitting the wall, pushes the rigid object; 5. the rigid object slows down and stops moving. The visual results and analysis are shown in the following. Our TIE+ achieves more faithful rollouts on all the domains, suggesting the effectiveness of our implicitly modeled edges and the abstract particles.