This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Space-Time Continuous Neural PDEs from Partially Observed States

Valerii Iakovlev  Markus Heinonen  Harri Lähdesmäki
Department of Computer Science, Aalto University, Finland
{valerii.iakovlev, markus.o.heinonen, harri.lahdesmaki}@aalto.fi
Abstract

We introduce a novel grid-independent model for learning partial differential equations (PDEs) from noisy and partial observations on irregular spatiotemporal grids. We propose a space-time continuous latent neural PDE model with an efficient probabilistic framework and a novel encoder design for improved data efficiency and grid independence. The latent state dynamics are governed by a PDE model that combines the collocation method and the method of lines. We employ amortized variational inference for approximate posterior estimation and utilize a multiple shooting technique for enhanced training speed and stability. Our model demonstrates state-of-the-art performance on complex synthetic and real-world datasets, overcoming limitations of previous approaches and effectively handling partially-observed data. The proposed model outperforms recent methods, showing its potential to advance data-driven PDE modeling and enabling robust, grid-independent modeling of complex partially-observed dynamic processes.

1 Introduction

footnotetext: Source code and datasets can be found in our github repository.

Modeling spatiotemporal processes allows to understand and predict the behavior of complex systems that evolve over time and space (Cressie and Wikle, 2011). Partial differential equations (PDEs) are a popular tool for this task as they have a solid mathematical foundation (Evans, 2010) and can describe the dynamics of a wide range of physical, biological, and social phenomena (Murray, 2002; Hirsch, 2007). However, deriving PDEs can be challenging, especially when the system’s underlying mechanisms are complex and not well understood. Data-driven methods can bypass these challenges (Brunton and Kutz, 2019). By learning the underlying system dynamics directly from data, we can develop accurate PDE models that capture the essential features of the system. This approach has changed our ability to model complex systems and make predictions about their behavior in a data-driven manner.

While current data-driven PDE models have been successful at modeling complex spatiotemporal phenomena, they often operate under various simplifying assumptions such as regularity of the spatial or temporal grids (Long et al., 2018; Kochkov et al., 2021; Pfaff et al., 2021; Li et al., 2021; Han et al., 2022; Poli et al., 2022), discreteness in space or time (Seo et al., 2020; Pfaff et al., 2021; Lienen and Günnemann, 2022; Brandstetter et al., 2022), and availability of complete and noiseless observations (Long et al., 2018; Pfaff et al., 2021; Wu et al., 2022). Such assumptions become increasingly limiting in more realistic scenarios with scarce data and irregularly spaced, noisy and partial observations.

We address the limitations of existing methods and propose a space-time continuous and grid-independent model that can learn PDE dynamics from noisy and partial observations made on irregular spatiotemporal grids. Our main contributions include:

  • Development of an efficient generative modeling framework for learning latent neural PDE models from noisy and partially-observed data;

  • Novel PDE model that merges two PDE solution techniques – the collocation method and the method of lines – to achieve space-time continuity, grid-independence, and data efficiency;

  • Novel encoder design that operates on local spatiotemporal neighborhoods for improved data-efficiency and grid-independence.

Our model demonstrates state-of-the-art performance on complex synthetic and real-world datasets, opening up the possibility for accurate and efficient modeling of complex dynamic processes and promoting further advancements in data-driven PDE modeling.

2 Problem Setup

In this work we are concerned with modeling of spatiotemporal processes. For brevity, we present our method for a single observed trajectory, but extension to multiple trajectories is straightforward. We observe a spatiotemporal dynamical system evolving over time on a spatial domain Ω\Omega. The observations are made at MM arbitrary consecutive time points t1:M:=(t1,,tM)t_{1:M}:=(t_{1},\dots,t_{M}) and NN arbitrary observation locations 𝒙1:N:=(𝒙1,,𝒙N){\bm{x}}_{1:N}:=({\bm{x}}_{1},\dots,{\bm{x}}_{N}), where 𝒙iΩ{\bm{x}}_{i}\in\Omega. This generates a sequence of observations 𝒖1:M:=(𝒖1,,𝒖M){\bm{u}}_{1:M}:=({\bm{u}}_{1},\dots,{\bm{u}}_{M}), where 𝒖iN×D{\bm{u}}_{i}\in\mathbb{R}^{N\times D} contains DD-dimensional observations at the NN observation locations. We define 𝒖ij{\bm{u}}_{i}^{j} as the observation at time tit_{i} and location 𝒙j{\bm{x}}_{j}. The number of time points and observation locations may vary between different observed trajectories.

We assume the data is generated by a dynamical system with a latent state 𝒛(t,𝒙)d{\bm{z}}(t,{\bm{x}})\in\mathbb{R}^{d}, where tt is time and 𝒙Ω{\bm{x}}\in\Omega is spatial location. The latent state is governed by an unknown PDE and is mapped to the observed state 𝒖(t,𝒙)D{\bm{u}}(t,{\bm{x}})\in\mathbb{R}^{D} by an unknown observation function gg and likelihood model pp:

𝒛(t,x)t=F(𝒛(t,𝒙),𝒙𝒛(t,𝒙),𝒙2𝒛(t,𝒙),),\displaystyle\frac{\partial{\bm{z}}(t,x)}{\partial t}=F({\bm{z}}(t,{\bm{x}}),\partial_{{\bm{x}}}{\bm{z}}(t,{\bm{x}}),\partial^{2}_{{\bm{x}}}{\bm{z}}(t,{\bm{x}}),\ldots), (1)
𝒖(t,𝒙)p(g(𝒛(t,𝒙))),\displaystyle{\bm{u}}(t,{\bm{x}})\sim p(g({\bm{z}}(t,{\bm{x}}))), (2)

where 𝒙𝒛(t,𝒙)\partial^{\bullet}_{{\bm{x}}}{\bm{z}}(t,{\bm{x}}) denotes partial derivatives wrt 𝒙{\bm{x}}.

In this work we make two assumptions that are highly relevant in real-world scenarios. First, we assume partial observations, that is, the observed state 𝒖(t,𝒙){\bm{u}}(t,{\bm{x}}) does not contain all information about the latent state 𝒛(t,𝒙){\bm{z}}(t,{\bm{x}}) (e.g., 𝒛(t,𝒙){\bm{z}}(t,{\bm{x}}) contains pressure and velocity, but 𝒖(t,𝒙){\bm{u}}(t,{\bm{x}}) contains information only about the pressure). Second, we assume out-of-distribution time points and observation locations, that is, their number, positions, and density can change arbitrarily at test time.

3 Model

Refer to caption
Figure 1: Model sketch. Initial latent state 𝒛(t1,𝒙){\bm{z}}(t_{1},{\bm{x}}) is evolved via FθdynF_{\theta_{\text{dyn}}} to the following latent states which are then mapped to the observed states by gθdecg_{\theta_{\text{dec}}}.

Here we describe the model components (Sec. 3.1) which are then used to construct the generative model (Sec. 3.2).

3.1 Model components

Our model consists of four parts: space-time continuous latent state 𝒛(t,𝒙){\bm{z}}(t,{\bm{x}}) and observed state 𝒖(t,𝒙){\bm{u}}(t,{\bm{x}}), a dynamics function FθdynF_{\theta_{\text{dyn}}} governing the temporal evolution of the latent state, and an observation function gθdecg_{\theta_{\text{dec}}} mapping the latent state to the observed state (see Figure 1). Next, we describe these components in detail.

Latent state.

To define a space-time continuous latent state 𝒛(t,𝒙)d{\bm{z}}(t,{\bm{x}})\in\mathbb{R}^{d}, we introduce 𝒛(t):=(𝒛1(t),,𝒛N(t))N×d{\bm{z}}(t):=({\bm{z}}^{1}(t),\dots,{\bm{z}}^{N}(t))\in\mathbb{R}^{N\times d}, where each 𝒛i(t)d{\bm{z}}^{i}(t)\in\mathbb{R}^{d} corresponds to the observation location 𝒙i{\bm{x}}_{i}. Then, we define the latent state 𝒛(t,𝒙){\bm{z}}(t,{\bm{x}}) as a spatial interpolant of 𝒛(t){\bm{z}}(t):

𝒛(t,𝒙):=Interpolate(𝒛(t))(𝒙),\displaystyle{\bm{z}}(t,{\bm{x}}):=\mathrm{Interpolate}({\bm{z}}(t))({\bm{x}}), (3)

where Interpolate()\mathrm{Interpolate}(\cdot) maps 𝒛(t){\bm{z}}(t) to an interpolant which can be evaluated at any spatial location 𝒙Ω{\bm{x}}\in\Omega (see Figure 2). We do not rely on a particular interpolation method, but in this work we use linear interpolation as it shows good performance and facilitates efficient implementation.

Latent state dynamics.

Refer to caption
Figure 2: Latent state 𝒛(t,𝒙){\bm{z}}(t,{\bm{x}}) defined as an interpolant of 𝒛(t):=(𝒛1(t),,𝒛4(t)){\bm{z}}(t):=({\bm{z}}^{1}(t),...,{\bm{z}}^{4}(t)).

Given a space-time continuous latent state, one can naturally define its dynamics in terms of a PDE:

𝒛(t,x)t=Fθdyn(𝒛(t,𝒙),𝒙𝒛(t,𝒙),𝒙2𝒛(t,𝒙),),\displaystyle\frac{\partial{\bm{z}}(t,x)}{\partial t}=F_{\theta_{\text{dyn}}}({\bm{z}}(t,{\bm{x}}),\partial_{{\bm{x}}}{\bm{z}}(t,{\bm{x}}),\partial^{2}_{{\bm{x}}}{\bm{z}}(t,{\bm{x}}),\ldots), (4)

where FθdynF_{\theta_{\text{dyn}}} is a dynamics function with parameters θdyn\theta_{\text{dyn}}. This is a viable approach known as the collocation method (Kansa, 1990; Cheng, 2009), but it has several limitations. It requires us to decide which partial derivatives to include in the dynamics function, and also requires an interpolant which has all the selected partial derivatives (e.g., linear interpolant has only first order derivatives). To avoid these limitations, we combine the collocation method with another PDE solution technique known as the method of lines (Schiesser, 1991; Hamdi et al., 2007), which approximates spatial derivatives 𝒙𝒛(t,𝒙)\partial^{\bullet}_{{\bm{x}}}{\bm{z}}(t,{\bm{x}}) using only evaluations of 𝒛(t,𝒙){\bm{z}}(t,{\bm{x}}), and then let the dynamics function approximate all required derivatives in a data-driven manner. To do that, we define the spatial neighborhood of 𝒙{\bm{x}} as 𝒩S(𝒙)\mathcal{N}_{\text{S}}({\bm{x}}), which is a set containing 𝒙{\bm{x}} and its spatial neighbors, and also define 𝒛(t,𝒩S(𝒙)){\bm{z}}(t,\mathcal{N}_{\text{S}}({\bm{x}})), which is a set of evaluations of the interpolant 𝒛(t,𝒙){\bm{z}}(t,{\bm{x}}) at points in 𝒩S(𝒙)\mathcal{N}_{\text{S}}({\bm{x}}):

𝒩S(𝒙):={𝒙Ω:𝒙=𝒙or𝒙is a spatial neighbor of𝒙},\displaystyle\mathcal{N}_{\text{S}}({\bm{x}}):=\{{\bm{x}}^{\prime}\in\Omega:{\bm{x}}^{\prime}={\bm{x}}\ \text{or}\ {\bm{x}}^{\prime}\ \text{is a spatial neighbor of}\ {\bm{x}}\}, (5)
𝒛(t,𝒩S(𝒙)):={𝒛(t,𝒙):𝒙𝒩S(𝒙)},\displaystyle{\bm{z}}(t,\mathcal{N}_{\text{S}}({\bm{x}})):=\{{\bm{z}}(t,{\bm{x}}^{\prime}):{\bm{x}}^{\prime}\in\mathcal{N}_{\text{S}}({\bm{x}})\}, (6)

and assume that this information is sufficient to approximate all required spatial derivatives at 𝒙{\bm{x}}. This is a reasonable assumption since, e.g., finite differences can approximate derivatives using only function values and locations of the evaluation points. Hence, we define the dynamics of 𝒛(t,𝒙){\bm{z}}(t,{\bm{x}}) as

𝒛(t,𝒙)t=Fθdyn(𝒩S(𝒙),𝒛(t,𝒩S(𝒙))),\displaystyle\frac{\partial{\bm{z}}(t,{\bm{x}})}{\partial t}=F_{\theta_{\text{dyn}}}(\mathcal{N}_{\text{S}}({\bm{x}}),{\bm{z}}(t,\mathcal{N}_{\text{S}}({\bm{x}}))), (7)

which is defined only in terms of the values of the latent state, but not its spatial derivatives.

Refer to caption
Figure 3: Example of 𝒩S(𝒙i)\mathcal{N}_{\text{S}}({\bm{x}}_{i}). Instead of using the observation locations (dots) to define spatial neighbors, we use spatial locations arranged in a fixed predefined pattern (crosses).

One way to define the spatial neighbors for 𝒙{\bm{x}} is in terms of the observation locations 𝒙1:N{\bm{x}}_{1:N} (e.g., use the nearest ones) as was done, for example, in (Long et al., 2018; Pfaff et al., 2021; Lienen and Günnemann, 2022). Instead, we utilize continuity of the latent state 𝒛(t,𝒙){\bm{z}}(t,{\bm{x}}), and define the spatial neighbors in a grid-independent manner as a fixed number of points arranged in a predefined patter around 𝒙{\bm{x}} (see Figure 3). This allows to fix the shape and size of the spatial neighborhoods in advance, making them independent of the observation locations. In this work we use the spatial neighborhood consisting of two concentric circles of radius rr and r/2r/2, each circle contains 8 evaluation points as in Figure 3. In Appendix D we compare neighborhoods of various shapes and sizes.

Equation 7 allows to simulate the temporal evolution of 𝒛(t,𝒙){\bm{z}}(t,{\bm{x}}) at any spatial location. However, since 𝒛(t,𝒙){\bm{z}}(t,{\bm{x}}) is defined only in terms of a spatial interpolant of 𝒛(t){\bm{z}}(t) (see Eq. 3), with 𝒛i(t)=𝒛(t,𝒙i){\bm{z}}^{i}(t)={\bm{z}}(t,{\bm{x}}_{i}), it is sufficient to simulate the latent state dynamics only at the observation locations 𝒙1:N{\bm{x}}_{1:N}. Hence, we can completely characterize the latent state dynamics in terms of a system of NN ODEs:

d𝒛(t)dt:=(d𝒛1(t)dtd𝒛N(t)dt)=(𝒛(t,𝒙1)t𝒛(t,𝒙N)t)=(Fθdyn(𝒩S(𝒙1),𝒛(t,𝒩S(𝒙1)))Fθdyn(𝒩S(𝒙N),𝒛(t,𝒩S(𝒙N)))).\displaystyle\frac{d{\bm{z}}(t)}{dt}:=\begin{pmatrix}\frac{d{\bm{z}}^{1}(t)}{dt}\\ \vdots\\ \frac{d{\bm{z}}^{N}(t)}{dt}\end{pmatrix}=\begin{pmatrix}\frac{\partial{\bm{z}}(t,{\bm{x}}_{1})}{\partial t}\\ \vdots\\ \frac{\partial{\bm{z}}(t,{\bm{x}}_{N})}{\partial t}\end{pmatrix}=\begin{pmatrix}F_{\theta_{\text{dyn}}}(\mathcal{N}_{\text{S}}({\bm{x}}_{1}),{\bm{z}}(t,\mathcal{N}_{\text{S}}({\bm{x}}_{1})))\\ \vdots\\ F_{\theta_{\text{dyn}}}(\mathcal{N}_{\text{S}}({\bm{x}}_{N}),{\bm{z}}(t,\mathcal{N}_{\text{S}}({\bm{x}}_{N})))\end{pmatrix}. (8)

For convenience, we define 𝒛(t;t1,𝒛1,θdyn):=ODESolve(t;t1,𝒛1,θdyn){\bm{z}}(t;t_{1},{\bm{z}}_{1},\theta_{\text{dyn}}):=\mathrm{ODESolve}(t;t_{1},{\bm{z}}_{1},\theta_{\text{dyn}}) as the solution of the ODE system in Equation 8 at time tt with initial state 𝒛(t1)=𝒛1{\bm{z}}(t_{1})={\bm{z}}_{1} and parameters θdyn\theta_{\text{dyn}}. We also define 𝒛(t,𝒙;t1,𝒛1,θdyn){\bm{z}}(t,{\bm{x}};t_{1},{\bm{z}}_{1},\theta_{\text{dyn}}) as the spatial interpolant of 𝒛(t;t1,𝒛1,θdyn){\bm{z}}(t;t_{1},{\bm{z}}_{1},\theta_{\text{dyn}}) as in Equation 3. We solve the ODEs using off the shelf differentiable ODE solvers from torchdiffeq package (Chen, 2018). Note that we solve for the state 𝒛(t){\bm{z}}(t) only at the observation locations 𝒙1:N{\bm{x}}_{1:N}, so to get the neighborhood values 𝒛(t,𝒩S(𝒙i)){\bm{z}}(t,\mathcal{N}_{\text{S}}({\bm{x}}_{i})) we perform interpolation at every step of the ODE solver.

Observation function.

We define the mapping from the latent space to the observation space as a parametric function gθdecg_{\theta_{\text{dec}}} with parameters θdec\theta_{\text{dec}}:

𝒖(t,𝒙)𝒩(gθdec(𝒛(t,𝒙)),σu2ID),\displaystyle{\bm{u}}(t,{\bm{x}})\sim\mathcal{N}(g_{\theta_{\text{dec}}}({\bm{z}}(t,{\bm{x}})),\sigma_{u}^{2}I_{D}), (9)

where 𝒩\mathcal{N} is the Gaussian distribution, σu2\sigma_{u}^{2} is noise variance, and IDI_{D} is DD-by-DD identity matrix.

3.2 Generative model

Refer to caption
Figure 4: Multiple shooting splits a trajectory with one initial state (top) into two sub-trajectories with two initial states (bottom) and tries to minimize the gap between sub-trajectories (orange arrow).

Training models of dynamic systems is often challenging due to long training times and training instabilities (Ribeiro et al., 2020; Metz et al., 2021). To alleviate these problems, various heuristics have been proposed, such as progressive lengthening and splitting of the training trajectories (Yildiz et al., 2019; Um et al., 2020). We use multiple shooting (Bock and Plitt, 1984; Voss et al., 2004), a simple and efficient technique which has demonstrated its effectiveness in ODE learning applications (Jordana et al., 2021; Hegde et al., 2022). We extent the multiple shooting framework for latent ODE models presented in (Iakovlev et al., 2023) to our PDE modeling setup by introducing spatial dimensions in the latent state and designing an encoder adapted specifically to the PDE setting (Section 4.2).

Multiple shooting splits a single trajectory {𝒛(ti)}i=1,,M\{{\bm{z}}(t_{i})\}_{i=1,...,M} with one initial state 𝒛1{\bm{z}}_{1} into BB consecutive non-overlapping sub-trajectories {𝒛(ti)}ib,b=1,,B\{{\bm{z}}(t_{i})\}_{i\in\mathcal{I}_{b}},\ b=1,\dots,B with BB initial states 𝒔1:B:=(𝒔1,,𝒔B){\bm{s}}_{1:B}:=({\bm{s}}_{1},\dots,{\bm{s}}_{B}) while imposing a continuity penalty between the sub-trajectories (see Figure 4). The index set b\mathcal{I}_{b} contains time point indices for the bb’th sub-trajectory. We also denote the temporal position of 𝒔b{\bm{s}}_{b} as t[b]t_{[b]} and place 𝒔b{\bm{s}}_{b} at the first time point preceding the bb’th sub-trajectory (except 𝒔1{\bm{s}}_{1} which is placed at t1t_{1}). Note that the shooting states 𝒔b{\bm{s}}_{b} have the same dimension as the original latent state 𝒛(t){\bm{z}}(t) i.e., 𝒔bN×d{\bm{s}}_{b}\in\mathbb{R}^{N\times d}. Multiple shooting allows to parallelize the simulation over the sub-trajectories and shortens the simulation intervals thus improving the training speed and stability. In Appendix D we demonstrate the effect of multiple shooting on the model training and prediction accuracy.

We begin by defining the prior over the unknown model parameters and initial states:

p(𝒔1:B,θdyn,θdec)=p(𝒔1:B|θdyn)p(θdyn)p(θdec),\displaystyle p({\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}})=p({\bm{s}}_{1:B}|\theta_{\text{dyn}})p(\theta_{\text{dyn}})p(\theta_{\text{dec}}), (10)

where p(θdyn)p(\theta_{\text{dyn}}) and p(θdec)p(\theta_{\text{dec}}) are zero-mean diagonal Gaussians, and the continuity inducing prior p(𝒔1:B|θdyn)p({\bm{s}}_{1:B}|\theta_{\text{dyn}}) is defined as in (Iakovlev et al., 2023)

p(𝒔1:B|θdyn)\displaystyle p({\bm{s}}_{1:B}|\theta_{\text{dyn}}) =p(𝒔1)b=2Bp(𝒔b|𝒔b1,θdyn).\displaystyle=p({\bm{s}}_{1})\prod_{b=2}^{B}{p({\bm{s}}_{b}|{\bm{s}}_{b-1},\theta_{\text{dyn}})}. (11)

Intuitively, the continuity prior p(𝒔b|𝒔b1,θdyn)p({\bm{s}}_{b}|{\bm{s}}_{b-1},\theta_{\text{dyn}}) takes the initial latent state 𝒔b1{\bm{s}}_{b-1}, simulates it forward from time t[b1]t_{[b-1]} to t[b]t_{[b]} to get 𝝁[b]=ODESolve(t[b];t[b1],𝒔b1,θdyn)\bm{\mu}_{[b]}=\mathrm{ODESolve}(t_{[b]};t_{[b-1]},{\bm{s}}_{b-1},\theta_{\text{dyn}}), and then forces 𝝁[b]\bm{\mu}_{[b]} to approximately match the initial state 𝒔b{\bm{s}}_{b} of the next sub-trajectory, thus promoting continuity of the full trajectory. We assume the continuity inducing prior factorizes across the grid points, i.e.,

p(𝒔1:B|θdyn)\displaystyle p({\bm{s}}_{1:B}|\theta_{\text{dyn}}) =[j=1Np(𝒔1j)][b=2Bj=1Np(𝒔bj|𝒔b1,θdyn)],\displaystyle=\left[\prod_{j=1}^{N}p({\bm{s}}_{1}^{j})\right]\left[\prod_{b=2}^{B}{\prod_{j=1}^{N}p({\bm{s}}_{b}^{j}|{\bm{s}}_{b-1},\theta_{\text{dyn}})}\right], (12)
=[j=1Np(𝒔1j)][b=2Bj=1N𝒩(𝒔bj|𝒛(t[b],𝒙j;t[b1],𝒔b1,θdyn),σc2Id)],\displaystyle=\left[\prod_{j=1}^{N}p({\bm{s}}_{1}^{j})\right]\left[\prod_{b=2}^{B}{\prod_{j=1}^{N}\mathcal{N}\left({\bm{s}}_{b}^{j}|{\bm{z}}(t_{[b]},{\bm{x}}_{j};t_{[b-1]},{\bm{s}}_{b-1},\theta_{\text{dyn}}),\sigma_{c}^{2}I_{d}\right)}\right], (13)

where p(𝒔1j)p({\bm{s}}_{1}^{j}) is a diagonal Gaussian, and parameter σc2\sigma_{c}^{2} controls the strength of the prior. Note that the term 𝒛(t[b],𝒙j;t[b1],𝒔b1,θdyn){\bm{z}}(t_{[b]},{\bm{x}}_{j};t_{[b-1]},{\bm{s}}_{b-1},\theta_{\text{dyn}}) in Equation 13 equals the ODE forward solution ODESolve(t[b];t[b1],𝒔b1,θdyn)\mathrm{ODESolve}(t_{[b]};t_{[b-1]},{\bm{s}}_{b-1},\theta_{\text{dyn}}) at grid location 𝒙j{\bm{x}}_{j}.

Finally, we define our generative in terms of the following sampling procedure:

θdyn,θdec,𝒔1:B\displaystyle\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B} p(θdyn)p(θdec)p(𝒔1:B|θdyn),\displaystyle\sim p(\theta_{\text{dyn}})p(\theta_{\text{dec}})p({\bm{s}}_{1:B}|\theta_{\text{dyn}}), (14)
𝒛(ti)\displaystyle{\bm{z}}(t_{i}) =𝒛(ti;t[b],𝒔b,θdyn),\displaystyle={\bm{z}}(t_{i};t_{[b]},{\bm{s}}_{b},\theta_{\text{dyn}}),\quad b{1,,B},ib,\displaystyle\ b\in\{1,...,B\},\ i\in\mathcal{I}_{b}, (15)
𝒖ij\displaystyle{\bm{u}}_{i}^{j} p(𝒖ij|gθdec(𝒛(ti,𝒙j)),\displaystyle\sim p({\bm{u}}_{i}^{j}|g_{\theta_{\text{dec}}}({\bm{z}}(t_{i},{\bm{x}}_{j})),\quad i=1,,M,j=1,,N,\displaystyle i=1,\dots,M,\ j=1,\dots,N, (16)

with the following joint distribution (see Appendix A for details about the model specification.):

p(𝒖1:M,𝒔1:B,θdyn,θdec)=b=1Bibj=1N[p(𝒖ij|𝒔b,θdyn,θdec)]p(𝒔1:B|θdyn)p(θdyn)p(θdec).\displaystyle p({\bm{u}}_{1:M},{\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}})=\prod_{b=1}^{B}\prod_{i\in\mathcal{I}_{b}}\prod_{j=1}^{N}{\left[p({\bm{u}}_{i}^{j}|{\bm{s}}_{b},\theta_{\text{dyn}},\theta_{\text{dec}})\right]}p({\bm{s}}_{1:B}|\theta_{\text{dyn}})p(\theta_{\text{dyn}})p(\theta_{\text{dec}}). (17)

4 Parameter Inference, Encoder, and Forecasting

4.1 Amortized variational inference

We approximate the true posterior over the model parameters and initial states p(𝒔1:B,θdyn,θdec|𝒖1:M)p({\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}}|{\bm{u}}_{1:M}) using variational inference (Blei et al., 2017) with the following approximate posterior:

q(θdyn,θdec,𝒔1:B)\displaystyle q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B}) =q(θdyn)q(θdec)q(𝒔1:B)=q𝝍dyn(θdyn)q𝝍dec(θdec)b=1Bj=1Nq𝝍bj(𝒔bj),\displaystyle=q(\theta_{\text{dyn}})q(\theta_{\text{dec}})q({\bm{s}}_{1:B})=q_{\bm{\psi}_{\text{dyn}}}(\theta_{\text{dyn}})q_{\bm{\psi}_{\text{dec}}}(\theta_{\text{dec}})\prod_{b=1}^{B}\prod_{j=1}^{N}{q_{\bm{\psi}_{b}^{j}}({\bm{s}}_{b}^{j})}, (18)

where q𝝍dynq_{\bm{\psi}_{\text{dyn}}}, q𝝍decq_{\bm{\psi}_{\text{dec}}} and q𝝍bjq_{\bm{\psi}_{b}^{j}} are diagonal Gaussians, and 𝝍dyn\bm{\psi}_{\text{dyn}}, 𝝍dec\bm{\psi}_{\text{dec}} and 𝝍bj\bm{\psi}_{b}^{j} are variational parameters. To avoid direct optimization over the local variational parameters 𝝍bj\bm{\psi}_{b}^{j}, we use amortized variational inference (Kingma and Welling, 2013) and train an encoder hθench_{\theta_{\text{enc}}} with parameters θenc\theta_{\text{enc}} which maps observations 𝒖1:M{\bm{u}}_{1:M} to 𝝍bj\bm{\psi}_{b}^{j} (see Section 4.2). For brevity, we sometimes omit the dependence of approximate posteriors on variational parameters and simply write e.g., q(𝒔bj)q({\bm{s}}_{b}^{j}).

In variational inference the best approximation of the posterior is obtained by minimizing the Kullback-Leibler divergence: KL[q(θdyn,θdec,𝒔1:B)p(θdyn,θdec,𝒔1:B|𝒖1:N)]\mathrm{KL}\big{[}q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\lVert p(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B}|{\bm{u}}_{1:N})\big{]}, which is equivalent to maximizing the evidence lower bound (ELBO), defined for our model as:

=b=1Bibj=1N𝔼q(𝒔b,θdyn,θdec)[logp(𝒖ij|𝒔b,θdyn,θdec)](i) observation modelj=1NKL[q(𝒔1j)p(𝒔1j)](ii) initial state prior\displaystyle\mathcal{L}=\underbrace{\sum_{b=1}^{B}\sum_{i\in\mathcal{I}_{b}}\sum_{j=1}^{N}{\mathbb{E}_{q({\bm{s}}_{b},\theta_{\text{dyn}},\theta_{\text{dec}})}\left[\log p({\bm{u}}_{i}^{j}|{\bm{s}}_{b},\theta_{\text{dyn}},\theta_{\text{dec}})\right]}}_{\textit{(i)}\text{ observation model}}-\underbrace{\sum_{j=1}^{N}\mathrm{KL}\big{[}q({\bm{s}}_{1}^{j})\lVert p({\bm{s}}_{1}^{j})\big{]}}_{\textit{(ii)}\text{ initial state prior}}
b=2Bj=1N𝔼q(θdyn,𝒔b1)[KL[q(𝒔bj)p(𝒔bj|𝒔b1,θdyn)]](iii) continuity priorKL[q(θdyn)p(θdyn)](iv) dynamics priorKL[q(θdec)p(θdec)](v) decoder prior.\displaystyle-\underbrace{\sum_{b=2}^{B}\sum_{j=1}^{N}\mathbb{E}_{q(\theta_{\text{dyn}},{\bm{s}}_{b-1})}\Big{[}\mathrm{KL}\big{[}q({\bm{s}}_{b}^{j})\lVert p({\bm{s}}_{b}^{j}|{\bm{s}}_{b-1},\theta_{\text{dyn}})\big{]}\Big{]}}_{\textit{(iii)}\text{ continuity prior}}-\underbrace{\mathrm{KL}\big{[}q(\theta_{\text{dyn}})\lVert p(\theta_{\text{dyn}})\big{]}}_{\textit{(iv)}\text{ dynamics prior}}-\underbrace{\mathrm{KL}\big{[}q(\theta_{\text{dec}})\lVert p(\theta_{\text{dec}})\big{]}}_{\textit{(v)}\text{ decoder prior}}.

The terms (ii)(ii), (iv)(iv), and (v)(v) are computed analytically, while terms (i)(i) and (iii)(iii) are approximated using Monte Carlo integration for expectations, and numerical ODE solvers for initial value problems. See Appendix A and B approximate posterior details and derivation and computation of the ELBO.

4.2 Encoder

Refer to caption
Figure 5: Spatiotemporal neighborhood of a multiple shooting time point t[b]=tit_{[b]}=t_{i} and location 𝒙j{\bm{x}}_{j}, 𝒖[t[b],𝒙j]{\bm{u}}[t_{[b]},{\bm{x}}_{j}] (denoted by green, blue and orange crosses and the dots inside), is mapped to the variational parameters 𝝍bj\bm{\psi}_{b}^{j} via the encoder.

Here we describe our encoder which maps observations 𝒖1:M{\bm{u}}_{1:M} to local variational parameters 𝝍bj\bm{\psi}_{b}^{j} required to sample the initial latent state of the sub-trajectory bb at time point t[b]t_{[b]} and observation location 𝒙j{\bm{x}}_{j}. Similarly to our model, the encoder should be data-efficient and grid-independent.

Similarly to our model (Section 3.1), we enable grid-independence by making the encoder operate on spatial interpolants of the observations 𝒖1:M{\bm{u}}_{1:M} (even if they are noisy):

𝒖i(𝒙):=Interpolate(𝒖i)(𝒙),i=1,,M,\displaystyle{\bm{u}}_{i}({\bm{x}}):=\mathrm{Interpolate}({\bm{u}}_{i})({\bm{x}}),\quad\quad i=1,\dots,M, (19)

where spatial interpolation is done separately for each time point ii. We then use the interpolants 𝒖i(𝒙){\bm{u}}_{i}({\bm{x}}) to define the spatial neighborhoods 𝒩S(𝒙)\mathcal{N}_{\text{S}}({\bm{x}}) in a grid-independent manner.

To improve data-efficiency, we assume 𝝍bj\bm{\psi}_{b}^{j} does not depend on the whole observed sequence 𝒖1:M{\bm{u}}_{1:M}, but only on some local information in a spatiotemporal neighborhood of t[b]t_{[b]} and 𝒙j{\bm{x}}_{j}. We define the temporal neighborhood of t[b]t_{[b]} as

𝒩T(t[b]){k:|tkt[b]|δT,k=1,,M},\displaystyle\mathcal{N}_{\text{T}}(t_{[b]})\coloneqq\{k:|t_{k}-t_{[b]}|\leq\delta_{T},\ k=1,\dots,M\}, (20)

where δT\delta_{T} is a hyperparameter controlling the neighborhood size, and then define the spatiotemporal neighborhood of t[b]t_{[b]} and 𝒙j{\bm{x}}_{j} as

𝒖[t[b],𝒙j]:={𝒖k(𝒙):k𝒩T(t[b]),𝒙𝒩S(𝒙j)}.\displaystyle{\bm{u}}[t_{[b]},{\bm{x}}_{j}]:=\{{\bm{u}}_{k}({\bm{x}}):k\in\mathcal{N}_{\text{T}}(t_{[b]}),{\bm{x}}\in\mathcal{N}_{\text{S}}({\bm{x}}_{j})\}. (21)

Our encoder operates on such spatiotemporal neighborhoods 𝒖[t[b],𝒙j]{\bm{u}}[t_{[b]},{\bm{x}}_{j}] and works in three steps (see Figure 5). First, for each time index k𝒩T(t[b])k\in\mathcal{N}_{\text{T}}(t_{[b]}) it aggregates the spatial information {𝒖k(𝒙)}𝒙𝒩(𝒙j)\{{\bm{u}}_{k}({\bm{x}})\}_{{\bm{x}}\in\mathcal{N}({\bm{x}}_{j})} into a vector αkS\alpha_{k}^{\text{S}}. Then, it aggregates the spatial representations αkS\alpha_{k}^{\text{S}} across time into another vector α[b]T\alpha_{[b]}^{\text{T}} which is finally mapped to the variational parameters 𝝍bj\bm{\psi}_{b}^{j} as follows:

𝝍bj=hθenc(𝒖[t[b],𝒙j])=hread(htemporal(hspatial(𝒖[t[b],𝒙j]))).\displaystyle\bm{\psi}_{b}^{j}=h_{\theta_{\text{enc}}}({\bm{u}}[t_{[b]},{\bm{x}}_{j}])=h_{\text{read}}(h_{\text{temporal}}(h_{\text{spatial}}({\bm{u}}[t_{[b]},{\bm{x}}_{j}]))). (22)

Spatial aggregation.

Since the spatial neighborhoods are fixed and remain identical for all spatial locations (see Figure 5), we implement the spatial aggregation function hspatialh_{\text{spatial}} as an MLP which takes elements of the set {𝒖k(𝒙)}𝒙𝒩S(𝒙j)\{{\bm{u}}_{k}({\bm{x}})\}_{{\bm{x}}\in\mathcal{N}_{\text{S}}({\bm{x}}_{j})} stacked in a fixed order as the input.

Temporal aggregation.

We implement htemporalh_{\text{temporal}} as a stack of transformer layers (Vaswani et al., 2017) which allows it to operate on input sets of arbitrary size. We use time-aware attention and continuous relative positional encodings (Iakovlev et al., 2023) which were shown to be effective on data from dynamical systems observed at irregular time intervals. Each transformer layer takes a layer-specific input set {ξkin}k𝒩T(t[b])\{\xi_{k}^{\text{in}}\}_{k\in\mathcal{N}_{\text{T}}(t_{[b]})}, where ξkin\xi_{k}^{\text{in}} is located at tkt_{k}, and maps it to an output set {ξkout}k𝒩T(t[b])\{\xi_{k}^{\text{out}}\}_{k\in\mathcal{N}_{\text{T}}(t_{[b]})}, where each ξkout\xi_{k}^{\text{out}} is computed using only the input elements within distance δT\delta_{T} from tkt_{k}, thus promoting temporal locality. Furthermore, instead of using absolute positional encodings the model assumes the behavior of the system does not depend on time and uses relative temporal distances to inject positional information. The first layer takes {αkS}k𝒩T(t[b])\{\alpha_{k}^{\text{S}}\}_{k\in\mathcal{N}_{\text{T}}(t_{[b]})} as the input, while the last layer returns a single element at time point t[b]t_{[b]}, which represents the temporal aggregation α[b]T\alpha_{[b]}^{\text{T}}.

Variational parameter readout.

Since αiT\alpha_{i}^{\text{T}} is a fixed-length vector, we implement hreadh_{\text{read}} as an MLP.

4.3 Forecasting

Given initial observations 𝒖~1:m\tilde{{\bm{u}}}_{1:m} at time points t1:mt_{1:m}, we predict the future observation 𝒖~n\tilde{{\bm{u}}}_{n} at a time point tn>tmt_{n}>t_{m} as the expected value of the approximate posterior predictive distribution:

p(𝒖~n|𝒖~1:m,𝒖1:M)p(𝒖~n|𝒔~m,θdyn,θdec)q(𝒔~m)q(θdyn)q(θdec)𝑑𝒔~m𝑑θdyn𝑑θdec.\displaystyle p(\tilde{{\bm{u}}}_{n}|\tilde{{\bm{u}}}_{1:m},{\bm{u}}_{1:M})\approx\int p(\tilde{{\bm{u}}}_{n}|\tilde{{\bm{s}}}_{m},\theta_{\text{dyn}},\theta_{\text{dec}})q(\tilde{{\bm{s}}}_{m})q(\theta_{\text{dyn}})q(\theta_{\text{dec}})d\tilde{{\bm{s}}}_{m}d\theta_{\text{dyn}}d\theta_{\text{dec}}. (23)

The expected value is estimated via Monte Carlo integration (see Appendix C.4 for details).

5 Experiments

Refer to caption
Figure 6: Top: Shallow Water dataset contains observations of the wave height in a pool of water. Middle: Navier-Stokes dataset contains observations of the concentration of a species transported in a fluid via buoyancy and velocity field. Bottom: Scalar Flow dataset contains observations of smoke plumes raising in warm air.

We use three challenging datasets: Shallow Water, Navier-Stokes, and Scalar Flow which contain observations of spatiotemporal system at N1100N\approx 1100 grid points evolving over time (see Figure 6). The first two datasets are synthetic and generated using numeric PDE solvers (we use scikit-fdiff (Cellier, 2019) for Shallow Water, and PhiFlow (Holl et al., 2020) for Navier-Stokes), while the third dataset contains real-world observations (camera images) of smoke plumes raising in warm air (Eckert et al., 2019). In all cases the observations are made at irregular spatiotemporal grids and contain only partial information about the true system state. In particular, for Shallow Water we observe only the wave height, for Navier-Stokes we observe only the concentration of the species, and for Scalar Flow only pixel densities are known. All datasets contain 6060/2020/2020 training/validation/testing trajectories. See Appendix C for details.

We train our model for 20k iterations with constant learning rate of 3e-43\text{e-4} and linear warmup. The latent spatiotemporal dynamics are simulated using differentiable ODE solvers from the torchdiffeq package (Chen, 2018) (we use dopri5 with rtol=1e-31\text{e-3}, atol=1e-41\text{e-4}, no adjoint). Training is done on a single NVIDIA Tesla V100 GPU, with a single run taking 3-4 hours. We use the mean absolute error (MAE) on the test set as the performance measure. Error bars are standard errors over 4 random seeds. For forecasting we use the expected value of the posterior predictive distribution. See Appendix C for all details about the training, validation, and testing setup.

Latent state dimension.

Here we show the advantage of using latent-space models on partially observed data. We change the latent state dimension dd from 1 to 5 and measure the test MAE. Note that for d=1d=1 we effectively have a data-space model which models the observations without trying to reconstruct the missing states. Figure 7 shows that in all cases there is improvement in performance as the latent dimension grows. For Shallow Water and Navier-Stokes the true latent dimension is 3. Since Scalar Flow is a real-world process, there is no true latent dimension. As a benchmark, we provide the performance of our model trained on fully-observed versions of the synthetic datasets (we use the same architecture and hyperparameters, but fix dd to 33). Figure 7 also shows examples of model predictions (at the final time point) for different values of dd. We see a huge difference between d=1d=1 and d=3,5d=3,5. Note how apparently small difference in MAE at d=1d=1 and d=5d=5 for Scalar Flow corresponds to a dramatic improvement in the prediction quality.

Refer to caption
Refer to caption
Figure 7: Left: Test MAE vs latent state dimension dd. Black lines are test MAE on fully-observed versions of the datasets (±\pm standard error). Right: Model predictions for different dd.
Refer to caption
Figure 8: Predictions on spatial grids of different density (linear interpolant, test data).

Grid independence.

In this experiment we demonstrate the grid-independence property of our model by training it on grids with 1100\approx 1100 observation locations, and then testing on a coarser, original, and finer grids. We evaluate the effect of using different interpolation methods by repeating the experiment with linear, k-nearest neighbors, and inverse distance weighting (IDW) interpolants. For Shallow Water and Navier-Stokes the coarser/finer grids contain 290290/42004200 nodes, while for Scalar Flow we have 560560/64206420 nodes, respectively. Table 1 shows the model’s performance for different spatial grids and interpolation methods. We see that all interpolation methods perform rather similarly on the original grid, but linear interpolation and IDW tend to perform better on finer/coarser grids than k-NN. Performance drop on coarse grids is expected since we get less accurate information about the system’s initial state and simulate the dynamics on coarse grids. Figure 8 also shows examples of model predictions (at the final time point) for different grid sizes and linear interpolant.

Table 1: Test MAE for different interpolation methods.
Dataset Grid k-NN Linear IDW
Coarser 0.046±0.0020.046\pm 0.002 0.034±0.0010.034\pm 0.001 0.038±0.0020.038\pm 0.002
Shallow Water Original 0.017±0.0020.017\pm 0.002 0.016±0.0020.016\pm 0.002 0.017±0.0030.017\pm 0.003
Finer 0.031±0.0030.031\pm 0.003 0.017±0.0030.017\pm 0.003 0.030±0.0020.030\pm 0.002
Coarser 0.087±0.0060.087\pm 0.006 0.069±0.0090.069\pm 0.009 0.066±0.0060.066\pm 0.006
Navier Stokes Original 0.048±0.0090.048\pm 0.009 0.041±0.0030.041\pm 0.003 0.045±0.0100.045\pm 0.010
Finer 0.054±0.0090.054\pm 0.009 0.044±0.0040.044\pm 0.004 0.049±0.0020.049\pm 0.002
Coarser 0.041±0.0210.041\pm 0.021 0.032±0.0090.032\pm 0.009 0.035±0.0120.035\pm 0.012
Scalar Flow Original 0.019±0.0010.019\pm 0.001 0.018±0.0000.018\pm 0.000 0.018±0.0010.018\pm 0.001
Finer 0.040±0.0160.040\pm 0.016 0.026±0.0060.026\pm 0.006 0.028±0.0070.028\pm 0.007

Comparison to other models.

We test our model against recent spatiotemporal models from the literature: Finite Element Networks (FEN) (Lienen and Günnemann, 2022), Neural Stochastic PDEs (NSPDE) (Salvi et al., 2021), MAgNet (Boussif et al., 2022), and DINo (Yin et al., 2023). We also use Neural ODEs (NODE) (Chen et al., 2018) as the baseline. We use the official implementation for all models and tune their hyperparameters for the best performance (see App. C for details). For Shallow Water and Navier-Stokes we use the first 5 time points to infer the latent state and then predict the next 20 time points, while for Scalar Flow we use the first 10 points for inference and predict the next 10 points. For synthetic data, we consider two settings: one where the data is fully observed (i.e., the complete state is recorded) – a setting for which most models are designed – and one where the data is partially observed (i.e., only part of the full state is given, as discussed at the beginning of this section). The results are shown in Table 2. We see that some of the baseline models achieve reasonably good results on the fully-observed datasets, but they fail on partially-observed data, while our model maintains strong performance in all cases. Apart from the fully observed Shallow Water dataset where FEN performs slightly better, our method outperforms other methods on all other datasets by a clear margin. See Appendix C for hyperparameter details. In Appendix E we demonstrate our model’s capability to learn dynamics from noisy data. In Appendix F we show model predictions on different datasets.

Table 2: Test MAE for different models.
Model
Shallow Water
(Full)
Shallow Water
(Partial)
Navier Stokes
(Full)
Navier Stokes
(Partial)
Scalar Flow
NODE 0.036±0.0000.036\pm 0.000 0.084±0.0010.084\pm 0.001 0.053±0.0010.053\pm 0.001 0.109±0.0010.109\pm 0.001 0.056±0.0010.056\pm 0.001
FEN 0.011±0.002\bm{0.011\pm 0.002} 0.064±0.0050.064\pm 0.005 0.031±0.0010.031\pm 0.001 0.108±0.0020.108\pm 0.002 0.062±0.0050.062\pm 0.005
SNPDE 0.019±0.0020.019\pm 0.002 0.033±0.0010.033\pm 0.001 0.042±0.0040.042\pm 0.004 0.075±0.0020.075\pm 0.002 0.059±0.0020.059\pm 0.002
DINo 0.027±0.0010.027\pm 0.001 0.063±0.0030.063\pm 0.003 0.047±0.0010.047\pm 0.001 0.113±0.0020.113\pm 0.002 0.059±0.0010.059\pm 0.001
MAgNet NA 0.061±0.0010.061\pm 0.001 NA 0.103±0.0030.103\pm 0.003 0.056±0.0030.056\pm 0.003
Ours 0.014±0.0020.014\pm 0.002 0.016±0.002\bm{0.016\pm 0.002} 0.024±0.003\bm{0.024\pm 0.003} 0.041±0.003\bm{0.041\pm 0.003} 0.042±0.001\bm{0.042\pm 0.001}

6 Related Work

Closest to our work is Ayed et al. (2022), where they considered the problem of learning PDEs from partial observations and proposed a discrete and grid-dependent model that is restricted to regular spatiotemporal grids. Another related work is that of Nguyen et al. (2020), where they proposed a variational inference framework for learning ODEs from noisy and partially-observed data. However, they consider only low-dimensional ODEs and are restricted to regular grids.

Other works considered learning the latent space PDE dynamics using the “encode-process-decode” approach. Pfaff et al. (2021) use GNN-based encoder and dynamics function and map the observations to the same spatial grid in the latent space and learn the latent space dynamics. Sanchez et al. (2020) use a similar approach but with CNNs and map the observations to a coarser latent grid and learn the coarse-scale dynamics. Wu et al. (2022) use CNNs to map observations to a low-dimensional latent vector and learn the latent dynamics. However, all these approaches are grid-dependent, limited to regular spatial/temporal grids, and require fully-observed data.

Interpolation has been used in numerous studies for various applications. Works such as (Alet et al., 2019; Jiang et al., 2020; Han et al., 2022) use interpolation to map latent states on coarse grids to observations on finer grids. Hua et al. (2022) used interpolation as a post-processing step to obtain continuous predictions, while Boussif et al. (2022) used it to recover observations at missing nodes.

Another approach for achieving grid-independence was presented in neural operators (Li et al., 2021; Lu et al., 2021), which learn a mapping between infinite-dimensional function spaces and represent the mapping in a grid-independent manner.

7 Conclusion

We proposed a novel space-time continuous, grid-independent model for learning PDE dynamics from noisy and partial observations on irregular spatiotemporal grids. Our contributions include an efficient generative modeling framework, a novel latent PDE model merging collocation and method of lines, and a data-efficient, grid-independent encoder design. The model demonstrates state-of-the-art performance on complex datasets, highlighting its potential for advancing data-driven PDE modeling and enabling accurate predictions of spatiotemporal phenomena in diverse fields. However, our model and encoder operate on every spatial and temporal location which might not be the most efficient approach and hinders scaling to extremely large grids, hence research into more efficient latent state extraction and dynamics modeling methods is needed.

References

  • Alet et al. (2019) Ferran Alet, Adarsh Keshav Jeewajee, Maria Bauza Villalonga, Alberto Rodriguez, Tomas Lozano-Perez, and Leslie Kaelbling. Graph element networks: adaptive, structured computation and memory. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 212–222. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/alet19a.html.
  • Ayed et al. (2022) Ibrahim Ayed, Emmanuel de Bézenac, Arthur Pajot, and Patrick Gallinari. Modelling spatiotemporal dynamics from earth observation data with neural differential equations. Machine Learning, 111(6):2349–2380, 2022. doi: 10.1007/s10994-022-06139-2. URL https://doi.org/10.1007/s10994-022-06139-2.
  • Blei et al. (2017) David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, Apr 2017. ISSN 1537-274X. doi: 10.1080/01621459.2017.1285773. URL http://dx.doi.org/10.1080/01621459.2017.1285773.
  • Bock and Plitt (1984) H.G. Bock and K.J. Plitt. A multiple shooting algorithm for direct solution of optimal control problems*. IFAC Proceedings Volumes, 17(2):1603–1608, 1984. ISSN 1474-6670. doi: https://doi.org/10.1016/S1474-6670(17)61205-9. URL https://www.sciencedirect.com/science/article/pii/S1474667017612059. 9th IFAC World Congress: A Bridge Between Control Science and Technology, Budapest, Hungary, 2-6 July 1984.
  • Boussif et al. (2022) Oussama Boussif, Yoshua Bengio, Loubna Benabbou, and Dan Assouline. MAgnet: Mesh agnostic neural PDE solver. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=bx2roi8hca8.
  • Brandstetter et al. (2022) Johannes Brandstetter, Daniel E. Worrall, and Max Welling. Message passing neural PDE solvers. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=vSix3HPYKSU.
  • Brunton and Kutz (2019) Steven L. Brunton and J. Nathan Kutz. Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control. Cambridge University Press, 2019. doi: 10.1017/9781108380690.
  • Cellier (2019) Nicolas Cellier. scikit-fdiff, 2019. URL https://gitlab.com/celliern/scikit-fdiff.
  • Chen (2018) Ricky T. Q. Chen. torchdiffeq, 2018. URL https://github.com/rtqichen/torchdiffeq.
  • Chen et al. (2018) Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations, 2018.
  • Cheng (2009) Alexander H.-D. Cheng. Radial basis function collocation method. In Computational Mechanics, pages 219–219, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg. ISBN 978-3-540-75999-7.
  • Cressie and Wikle (2011) N. Cressie and C. K. Wikle. Statistics for Spatio-Temporal Data. Wiley, 2011. ISBN 9780471692744. URL https://books.google.fi/books?id=-kOC6D0DiNYC.
  • Eckert et al. (2019) Marie-Lenat Eckert, Kiwon Um, and Nils Thuerey. Scalarflow: A large-scale volumetric data set of real-world scalar transport flows for computer animation and machine learning. ACM Transactions on Graphics, 38(6):239, 2019.
  • Evans (2010) L. C. Evans. Partial Differential Equations. American Mathematical Society, 2010. ISBN 9780821849743.
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL https://proceedings.mlr.press/v9/glorot10a.html.
  • Hamdi et al. (2007) S. Hamdi, W. E. Schiesser, and G. W Griffiths. Method of lines. Scholarpedia, 2(7):2859, 2007. doi: 10.4249/scholarpedia.2859. revision #26511.
  • Han et al. (2022) Xu Han, Han Gao, Tobias Pfaff, Jian-Xun Wang, and Liping Liu. Predicting physics in mesh-reduced space with temporal attention. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=XctLdNfCmP.
  • Hegde et al. (2022) Pashupati Hegde, Cagatay Yildiz, Harri Lähdesmäki, Samuel Kaski, and Markus Heinonen. Variational multiple shooting for bayesian ODEs with gaussian processes. In The 38th Conference on Uncertainty in Artificial Intelligence, 2022. URL https://openreview.net/forum?id=r2NuhIUoceq.
  • Hirsch (2007) Charles Hirsch. Numerical computation of internal and external flows: The fundamentals of computational fluid dynamics. Elsevier, 2007.
  • Holl et al. (2020) Philipp Holl, Vladlen Koltun, Kiwon Um, and Nils Thuerey. phiflow: A differentiable pde solving framework for deep learning via physical simulations. In NeurIPS Workshop on Differentiable vision, graphics, and physics applied to machine learning, 2020. URL https://montrealrobotics.ca/diffcvgp/assets/papers/3.pdf.
  • Hua et al. (2022) Chuanbo Hua, Federico Berto, Michael Poli, Stefano Massaroli, and Jinkyoo Park. Efficient continuous spatio-temporal simulation with graph spline networks. In ICML 2022 2nd AI for Science Workshop, 2022. URL https://openreview.net/forum?id=PBT0Vftuji.
  • Iakovlev et al. (2023) Valerii Iakovlev, Cagatay Yildiz, Markus Heinonen, and Harri Lähdesmäki. Latent neural ODEs with sparse bayesian multiple shooting. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=moIlFZfj_1b.
  • Jiang et al. (2020) Chiyu lMaxr Jiang, Soheil Esmaeilzadeh, Kamyar Azizzadenesheli, Karthik Kashinath, Mustafa Mustafa, Hamdi A. Tchelepi, Philip Marcus, Mr Prabhat, and Anima Anandkumar. Meshfreeflownet: A physics-constrained deep continuous space-time super-resolution framework. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2020. doi: 10.1109/sc41405.2020.00013. URL http://dx.doi.org/10.1109/SC41405.2020.00013.
  • Jordana et al. (2021) Armand Jordana, Justin Carpentier, and Ludovic Righetti. Learning dynamical systems from noisy sensor measurements using multiple shooting, 2021.
  • Kansa (1990) E.J. Kansa. Multiquadrics—a scattered data approximation scheme with applications to computational fluid-dynamics—ii solutions to parabolic, hyperbolic and elliptic partial differential equations. Computers and Mathematics with Applications, 19(8):147–161, 1990. ISSN 0898-1221. doi: https://doi.org/10.1016/0898-1221(90)90271-K. URL https://www.sciencedirect.com/science/article/pii/089812219090271K.
  • Kingma and Ba (2017) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
  • Kingma and Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2013.
  • Kochkov et al. (2021) Dmitrii Kochkov, Jamie A. Smith, Ayya Alieva, Qing Wang, Michael P. Brenner, and Stephan Hoyer. Machine learning–accelerated computational fluid dynamics. Proceedings of the National Academy of Sciences, 118(21):e2101784118, 2021. doi: 10.1073/pnas.2101784118. URL https://www.pnas.org/doi/abs/10.1073/pnas.2101784118.
  • Li et al. (2021) Zongyi Li, Nikola Borislavov Kovachki, Kamyar Azizzadenesheli, Burigede liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=c8P9NQVtmnO.
  • Lienen and Günnemann (2022) Marten Lienen and Stephan Günnemann. Learning the dynamics of physical systems from sparse observations with finite element networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=HFmAukZ-k-2.
  • Long et al. (2018) Zichao Long, Yiping Lu, Xianzhong Ma, and Bin Dong. PDE-net: Learning PDEs from data. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3208–3216. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/long18a.html.
  • Lu et al. (2021) Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218–229, Mar 2021. ISSN 2522-5839. doi: 10.1038/s42256-021-00302-5. URL http://dx.doi.org/10.1038/s42256-021-00302-5.
  • Metz et al. (2021) Luke Metz, C. Daniel Freeman, Samuel S. Schoenholz, and Tal Kachman. Gradients are not all you need, 2021.
  • Murray (2002) James D. Murray. Mathematical Biology I. An Introduction, volume 17 of Interdisciplinary Applied Mathematics. Springer, New York, 3 edition, 2002. doi: 10.1007/b98868.
  • Nguyen et al. (2020) Duong Nguyen, Said Ouala, Lucas Drumetz, and Ronan Fablet. Variational deep learning for the identification and reconstruction of chaotic and stochastic dynamical systems from noisy and partial observations. ArXiv, abs/2009.02296, 2020.
  • Pfaff et al. (2021) Tobias Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and Peter Battaglia. Learning mesh-based simulation with graph networks. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=roNqYL0_XP.
  • Poli et al. (2022) Michael Poli, Stefano Massaroli, Federico Berto, Jinkyoo Park, Tri Dao, Christopher Re, and Stefano Ermon. Transform once: Efficient operator learning in frequency domain. In ICML 2022 2nd AI for Science Workshop, 2022. URL https://openreview.net/forum?id=x1fNT5yj41N.
  • Ribeiro et al. (2020) Antônio H. Ribeiro, Koen Tiels, Jack Umenberger, Thomas B. Schön, and Luis A. Aguirre. On the smoothness of nonlinear system identification. Automatica, 121:109158, 2020. ISSN 0005-1098. doi: https://doi.org/10.1016/j.automatica.2020.109158. URL https://www.sciencedirect.com/science/article/pii/S0005109820303563.
  • Salvi et al. (2021) Cristopher Salvi, Maud Lemercier, and Andris Gerasimovics. Neural stochastic pdes: Resolution-invariant learning of continuous spatiotemporal dynamics, 2021.
  • Sanchez et al. (2020) Alvaro Sanchez, Dmitrii Kochkov, Jamie Alexander Smith, Michael Brenner, Peter Battaglia, and Tobias Joachim Pfaff. Learning latent field dynamics of pdes. 2020. We are submitting to Machine Learning and the Physical Sciences workshop with a submission deadline on October 2nd.
  • Schiesser (1991) William E. Schiesser. The numerical method of lines: Integration of partial differential equations. 1991.
  • Seo et al. (2020) Sungyong Seo, Chuizheng Meng, and Yan Liu. Physics-aware difference graph networks for sparsely-observed dynamics. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=r1gelyrtwH.
  • Um et al. (2020) Kiwon Um, Robert Brand, Yun Fei, Philipp Holl, and Nils Thuerey. Solver-in-the-Loop: Learning from Differentiable Physics to Interact with Iterative PDE-Solvers. Advances in Neural Information Processing Systems, 2020.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  • Voss et al. (2004) Henning Voss, J. Timmer, and Juergen Kurths. Nonlinear dynamical system identification from uncertain and indirect measurements. International Journal of Bifurcation and Chaos, 14, 01 2004.
  • Wu et al. (2022) Tailin Wu, Takashi Maruyama, and Jure Leskovec. Learning to accelerate partial differential equations via latent global evolution. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=xvZtgp5wyYT.
  • Yildiz et al. (2019) Cagatay Yildiz, Markus Heinonen, and Harri Lähdesmäki. Ode2vae: Deep generative second order odes with bayesian neural networks. ArXiv, abs/1905.10994, 2019.
  • Yin et al. (2023) Yuan Yin, Matthieu Kirchmeyer, Jean-Yves Franceschi, Alain Rakotomamonjy, and patrick gallinari. Continuous PDE dynamics forecasting with implicit neural representations. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=B73niNjbPs.

Appendix A Appendix A

A.1 Model specification.

Here we provide all details about our model specification. The joint distribution for our model is

p(𝒖1:M,𝒔1:B,θdyn,θdec)=p(𝒖1:N|𝒔1:B,θdyn,θdec)p(𝒔1:B|θdyn)p(θdyn)p(θdec).\displaystyle p({\bm{u}}_{1:M},{\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}})=p({\bm{u}}_{1:N}|{\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}})p({\bm{s}}_{1:B}|\theta_{\text{dyn}})p(\theta_{\text{dyn}})p(\theta_{\text{dec}}). (24)

Next, we specify each component in detail.

Parameter priors.

The parameter priors are isotropic zero-mean multivariate normal distributions:

p(θdyn)=𝒩(θdyn|𝟎,I),\displaystyle p(\theta_{\text{dyn}})=\mathcal{N}(\theta_{\text{dyn}}|\mathbf{0},I), (25)
p(θdec)=𝒩(θdec|𝟎,I),\displaystyle p(\theta_{\text{dec}})=\mathcal{N}(\theta_{\text{dec}}|\mathbf{0},I), (26)

where 𝒩\mathcal{N} is the normal distribution, 𝟎\mathbf{0} is a zero vector, and II is the identity matrix, both have an appropriate dimensionality dependent on the number of encoder and dynamics parameters.

Continuity prior.

We define the continuity prior as

p(𝒔1:B|θdyn)\displaystyle p({\bm{s}}_{1:B}|\theta_{\text{dyn}}) =p(𝒔1)b=2Bp(𝒔b|𝒔b1,θdyn),\displaystyle=p({\bm{s}}_{1})\prod_{b=2}^{B}{p({\bm{s}}_{b}|{\bm{s}}_{b-1},\theta_{\text{dyn}})}, (27)
=[j=1Np(𝒔1j)][b=2Bj=1Np(𝒔bj|𝒔b1,θdyn)],\displaystyle=\left[\prod_{j=1}^{N}p({\bm{s}}_{1}^{j})\right]\left[\prod_{b=2}^{B}{\prod_{j=1}^{N}p({\bm{s}}_{b}^{j}|{\bm{s}}_{b-1},\theta_{\text{dyn}})}\right], (28)
=[j=1N𝒩(𝒔1j|𝟎,I)][b=2Bj=1N𝒩(𝒔bj|𝒛(t[b],𝒙j;t[b1],𝒔b1,θdyn),σc2I).],\displaystyle=\left[\prod_{j=1}^{N}\mathcal{N}({\bm{s}}_{1}^{j}|\mathbf{0},I)\right]\left[\prod_{b=2}^{B}{\prod_{j=1}^{N}\mathcal{N}\left({\bm{s}}_{b}^{j}|{\bm{z}}(t_{[b]},{\bm{x}}_{j};t_{[b-1]},{\bm{s}}_{b-1},\theta_{\text{dyn}}),\sigma_{c}^{2}I\right).}\right], (29)

where 𝒩\mathcal{N} is the normal distribution, 𝟎d\mathbf{0}\in\mathbb{R}^{d} is a zero vector, Id×dI\in\mathbb{R}^{d\times d} is the identity matrix, and σc\sigma_{c}\in\mathbb{R} is the parameter controlling the strength of the prior. Smaller values of σc\sigma_{c} tend to produce smaller gaps between the sub-trajectories.

Observation model

p(𝒖1:N|𝒔1:B,θdyn,θdec)\displaystyle p({\bm{u}}_{1:N}|{\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}}) =b=1Bibj=1Np(𝒖ij|𝒔b,θdyn,θdec)\displaystyle=\prod_{b=1}^{B}\prod_{i\in\mathcal{I}_{b}}\prod_{j=1}^{N}{p({\bm{u}}_{i}^{j}|{\bm{s}}_{b},\theta_{\text{dyn}},\theta_{\text{dec}})} (30)
=b=1Bibj=1Np(𝒖ij|gθdec(𝒛(ti,𝒙j;t[b],𝒔b,θdyn)))\displaystyle=\prod_{b=1}^{B}\prod_{i\in\mathcal{I}_{b}}\prod_{j=1}^{N}{p({\bm{u}}_{i}^{j}|g_{\theta_{\text{dec}}}({\bm{z}}(t_{i},{\bm{x}}_{j};t_{[b]},{\bm{s}}_{b},\theta_{\text{dyn}})))} (31)
=b=1Bibj=1N𝒩(𝒖ij|gθdec(𝒛(ti,𝒙j;t[b],𝒔b,θdyn)),σu2I),\displaystyle=\prod_{b=1}^{B}\prod_{i\in\mathcal{I}_{b}}\prod_{j=1}^{N}{\mathcal{N}({\bm{u}}_{i}^{j}|g_{\theta_{\text{dec}}}({\bm{z}}(t_{i},{\bm{x}}_{j};t_{[b]},{\bm{s}}_{b},\theta_{\text{dyn}})),\sigma_{u}^{2}I)}, (32)

where 𝒩\mathcal{N} is the normal distribution, σu2\sigma_{u}^{2} is the observation noise variance, and ID×DI\in\mathbb{R}^{D\times D} is the identity matrix. Note again that 𝒛(ti,𝒙j;t[b],𝒔b,θdyn){\bm{z}}(t_{i},{\bm{x}}_{j};t_{[b]},{\bm{s}}_{b},\theta_{\text{dyn}}) above equals the ODE forward solution ODESolve(ti;t[b],𝒔b,θdyn)\mathrm{ODESolve}(t_{i};t_{[b]},{\bm{s}}_{b},\theta_{\text{dyn}}) at grid location 𝒙j{\bm{x}}_{j}.

A.2 Approximate posterior specification.

Here we provide all details about the approximate posterior. We define the approximate posterior as

q(θdyn,θdec,𝒔1:B)\displaystyle q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B}) =q(θdyn)q(θdec)q(𝒔1:B)=q𝝍dyn(θdyn)q𝝍dec(θdec)b=1Bj=1Nq𝝍bj(𝒔bj).\displaystyle=q(\theta_{\text{dyn}})q(\theta_{\text{dec}})q({\bm{s}}_{1:B})=q_{\bm{\psi}_{\text{dyn}}}(\theta_{\text{dyn}})q_{\bm{\psi}_{\text{dec}}}(\theta_{\text{dec}})\prod_{b=1}^{B}\prod_{j=1}^{N}{q_{\bm{\psi}_{b}^{j}}({\bm{s}}_{b}^{j})}. (33)

Next, we specify each component in detail.

Dynamics parameters posterior.

We define q𝝍dyn(θdyn)q_{\bm{\psi}_{\text{dyn}}}(\theta_{\text{dyn}}) as

q𝝍dyn(θdyn)=𝒩(θdyn|𝜸dyn,diag(𝝉dyn2)),\displaystyle q_{\bm{\psi}_{\text{dyn}}}(\theta_{\text{dyn}})=\mathcal{N}(\theta_{\text{dyn}}|\bm{\gamma}_{\text{dyn}},\mathrm{diag}(\bm{\tau}_{\text{dyn}}^{2})), (34)

where 𝜸dyn\bm{\gamma}_{\text{dyn}} and 𝝉dyn2\bm{\tau}_{\text{dyn}}^{2} are vectors with an appropriate dimension (dependent on the number of dynamics parameters), and diag(𝝉dyn2)\mathrm{diag}(\bm{\tau}_{\text{dyn}}^{2}) is a matrix with 𝝉dyn2\bm{\tau}_{\text{dyn}}^{2} on the diagonal. We define the vector of variational parameters as 𝝍dyn=(𝜸dyn,𝝉dyn2)\bm{\psi}_{\text{dyn}}=(\bm{\gamma}_{\text{dyn}},\bm{\tau}_{\text{dyn}}^{2}). We optimize directly over 𝝍dyn\bm{\psi}_{\text{dyn}} and initialize 𝜸dyn\bm{\gamma}_{\text{dyn}} using Xavier (Glorot and Bengio, 2010) initialization, while 𝝉dyn\bm{\tau}_{\text{dyn}} is initialized with each element equal to 91049\cdot 10^{-4}.

Decoder parameters posterior.

We define q𝝍dec(θdec)q_{\bm{\psi}_{\text{dec}}}(\theta_{\text{dec}}) as

q𝝍dec(θdec)=𝒩(θdec|𝜸dec,diag(𝝉dec2)),\displaystyle q_{\bm{\psi}_{\text{dec}}}(\theta_{\text{dec}})=\mathcal{N}(\theta_{\text{dec}}|\bm{\gamma}_{\text{dec}},\mathrm{diag}(\bm{\tau}_{\text{dec}}^{2})), (35)

where 𝜸dec\bm{\gamma}_{\text{dec}} and 𝝉dec2\bm{\tau}_{\text{dec}}^{2} are vectors with an appropriate dimension (dependent on the number of decoder parameters), and diag(𝝉dec2)\mathrm{diag}(\bm{\tau}_{\text{dec}}^{2}) is a matrix with 𝝉dec2\bm{\tau}_{\text{dec}}^{2} on the diagonal. We define the vector of variational parameters as 𝝍dec=(𝜸dec,𝝉dec2)\bm{\psi}_{\text{dec}}=(\bm{\gamma}_{\text{dec}},\bm{\tau}_{\text{dec}}^{2}). We optimize directly over 𝝍dec\bm{\psi}_{\text{dec}} and initialize 𝜸dec\bm{\gamma}_{\text{dec}} using Xavier (Glorot and Bengio, 2010) initialization, while 𝝉dec\bm{\tau}_{\text{dec}} is initialized with each element equal to 91049\cdot 10^{-4}.

Shooting variables posterior.

We define q𝝍bj(𝒔bj)q_{\bm{\psi}_{b}^{j}}({\bm{s}}_{b}^{j}) as

q𝝍bj(𝒔bj)=𝒩(𝒔bj|𝜸bj,diag([𝝉bj]2))),\displaystyle q_{\bm{\psi}_{b}^{j}}({\bm{s}}_{b}^{j})=\mathcal{N}({\bm{s}}_{b}^{j}|\bm{\gamma}_{b}^{j},\mathrm{diag}([\bm{\tau}_{b}^{j}]^{2}))), (36)

where the vectors 𝜸bj,𝝉bjd\bm{\gamma}_{b}^{j},\bm{\tau}_{b}^{j}\in\mathbb{R}^{d} are returned by the encoder hθench_{\theta_{\text{enc}}}, and diag([𝝉bj]2)\mathrm{diag}([\bm{\tau}_{b}^{j}]^{2}) is a matrix with [𝝉bj]2[\bm{\tau}_{b}^{j}]^{2} on the diagonal. We define the vector of variational parameters as 𝝍bj=(𝜸bj,[𝝉bj])\bm{\psi}_{b}^{j}=(\bm{\gamma}_{b}^{j},[\bm{\tau}_{b}^{j}]). Because the variational inference for the shooting variables is amortized, our model is trained w.r.t. the parameters of the encoder network, θenc\theta_{\text{enc}}.

Appendix B Appendix B

B.1 Derivation of ELBO.

For our model and the choice of the approximate posterior the ELBO can be written as

\displaystyle\mathcal{L} =q(θdyn,θdec,𝒔1:B)lnp(𝒖1:M,𝒔1:B,θdyn,θdec)q(θdyn,θdec,𝒔1:B)dθdyndθdecd𝒔1:B\displaystyle=\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\frac{p({\bm{u}}_{1:M},{\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}})}{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})}}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}} (37)
=q(θdyn,θdec,𝒔1:B)lnp(𝒖1:M|𝒔1:B,θdyn,θdec)p(𝒔1:B|θdyn)p(θdyn)p(θdec)q(𝒔1:B)q(θdyn)q(θdec)dθdyndθdecd𝒔1:B\displaystyle=\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\frac{p({\bm{u}}_{1:M}|{\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}})p({\bm{s}}_{1:B}|\theta_{\text{dyn}})p(\theta_{\text{dyn}})p(\theta_{\text{dec}})}{q({\bm{s}}_{1:B})q(\theta_{\text{dyn}})q(\theta_{\text{dec}})}}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}} (38)
=q(θdyn,θdec,𝒔1:B)lnp(𝒖1:M|𝒔1:B,θdyn,θdec)𝑑θdyn𝑑θdec𝑑𝒔1:B\displaystyle=\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{p({\bm{u}}_{1:M}|{\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}})}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}} (39)
q(θdyn,θdec,𝒔1:B)lnq(𝒔1:B)p(𝒔1:B|θdyn)dθdyndθdecd𝒔1:B\displaystyle\quad-\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\frac{q({\bm{s}}_{1:B})}{p({\bm{s}}_{1:B}|\theta_{\text{dyn}})}}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}} (40)
q(θdyn,θdec,𝒔1:B)lnq(θdyn)p(θdyn)dθdyndθdecd𝒔1:B\displaystyle\quad-\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\frac{q(\theta_{\text{dyn}})}{p(\theta_{\text{dyn}})}}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}} (41)
q(θdec,θdec,𝒔1:B)lnq(θdec)p(θdec)dθdyndθdecd𝒔1:B\displaystyle\quad-\int{q(\theta_{\text{dec}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\frac{q(\theta_{\text{dec}})}{p(\theta_{\text{dec}})}}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}} (42)
=1234.\displaystyle=\mathcal{L}_{1}-\mathcal{L}_{2}-\mathcal{L}_{3}-\mathcal{L}_{4}. (43)

Next, we will look at each term i\mathcal{L}_{i} separately.

1\displaystyle\mathcal{L}_{1} =q(θdyn,θdec,𝒔1:B)lnp(𝒖1:M|𝒔1:B,θdyn,θdec)𝑑θdyn𝑑θdec𝑑𝒔1:B\displaystyle=\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{p({\bm{u}}_{1:M}|{\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}})}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}} (44)
=q(θdyn,θdec,𝒔1:B)ln[b=1Bibj=1Np(𝒖ij|𝒔b,θdyn,θdec)]𝑑θdyn𝑑θdec𝑑𝒔1:B\displaystyle=\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\left[\prod_{b=1}^{B}\prod_{i\in\mathcal{I}_{b}}\prod_{j=1}^{N}{p({\bm{u}}_{i}^{j}|{\bm{s}}_{b},\theta_{\text{dyn}},\theta_{\text{dec}})}\right]}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}} (45)
=b=1Bibj=1Nq(θdyn,θdec,𝒔1:B)ln[p(𝒖ij|𝒔b,θdyn,θdec)]𝑑θdyn𝑑θdec𝑑𝒔1:B\displaystyle=\sum_{b=1}^{B}\sum_{i\in\mathcal{I}_{b}}\sum_{j=1}^{N}\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\left[{p({\bm{u}}_{i}^{j}|{\bm{s}}_{b},\theta_{\text{dyn}},\theta_{\text{dec}})}\right]}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}} (46)
=b=1Bibj=1Nq(θdyn,θdec,𝒔b)ln[p(𝒖ij|𝒔b,θdyn,θdec)]𝑑θdyn𝑑θdec𝑑𝒔b\displaystyle=\sum_{b=1}^{B}\sum_{i\in\mathcal{I}_{b}}\sum_{j=1}^{N}\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{b})\ln{\left[{p({\bm{u}}_{i}^{j}|{\bm{s}}_{b},\theta_{\text{dyn}},\theta_{\text{dec}})}\right]}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{b}} (47)
=b=1Bibj=1N𝔼q(θdyn,θdec,𝒔b)ln[p(𝒖ij|𝒔b,θdyn,θdec)].\displaystyle=\sum_{b=1}^{B}\sum_{i\in\mathcal{I}_{b}}\sum_{j=1}^{N}\mathbb{E}_{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{b})}\ln{\left[{p({\bm{u}}_{i}^{j}|{\bm{s}}_{b},\theta_{\text{dyn}},\theta_{\text{dec}})}\right]}. (48)
2\displaystyle\mathcal{L}_{2} =q(θdyn,θdec,𝒔1:B)lnq(𝒔1:B)p(𝒔1:B|θdyn)dθdyndθdecd𝒔1:B\displaystyle=\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\frac{q({\bm{s}}_{1:B})}{p({\bm{s}}_{1:B}|\theta_{\text{dyn}})}}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}} (49)
=q(θdyn,θdec,𝒔1:B)ln[q(𝒔1)p(𝒔1)b=2Bq(𝒔b)p(𝒔b|𝒔b1,θdyn)]𝑑θdyn𝑑θdec𝑑𝒔1:B\displaystyle=\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\left[\frac{q({\bm{s}}_{1})}{p({\bm{s}}_{1})}\prod_{b=2}^{B}{\frac{q({\bm{s}}_{b})}{p({\bm{s}}_{b}|{\bm{s}}_{b-1},\theta_{\text{dyn}})}}\right]}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}} (50)
=q(θdyn,θdec,𝒔1:B)ln[j=1Nq(𝒔1j)p(𝒔1j)]𝑑θdyn𝑑θdec𝑑𝒔1:B\displaystyle=\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\left[\prod_{j=1}^{N}\frac{q({\bm{s}}_{1}^{j})}{p({\bm{s}}_{1}^{j})}\right]}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}} (51)
+q(θdyn,θdec,𝒔1:B)ln[b=2Bj=1Nq(𝒔bj)p(𝒔bj|𝒔b1,θdyn)]𝑑θdyn𝑑θdec𝑑𝒔1:B\displaystyle\quad+\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\left[\prod_{b=2}^{B}\prod_{j=1}^{N}{\frac{q({\bm{s}}_{b}^{j})}{p({\bm{s}}_{b}^{j}|{\bm{s}}_{b-1},\theta_{\text{dyn}})}}\right]}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}} (52)
=j=1Nq(θdyn,θdec,𝒔1:B)ln[q(𝒔1j)p(𝒔1j)]𝑑θdyn𝑑θdec𝑑𝒔1:B\displaystyle=\sum_{j=1}^{N}\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\left[\frac{q({\bm{s}}_{1}^{j})}{p({\bm{s}}_{1}^{j})}\right]}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}} (53)
+b=2Bq(θdyn,θdec,𝒔1:B)j=1Nln[q(𝒔bj)p(𝒔bj|𝒔b1,θdyn)]dθdyndθdecd𝒔1:B\displaystyle\quad+\sum_{b=2}^{B}\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\sum_{j=1}^{N}\ln{\left[{\frac{q({\bm{s}}_{b}^{j})}{p({\bm{s}}_{b}^{j}|{\bm{s}}_{b-1},\theta_{\text{dyn}})}}\right]}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}} (54)
=j=1Nq(𝒔1j)ln[q(𝒔1j)p(𝒔1j)]𝑑𝒔1j\displaystyle=\sum_{j=1}^{N}\int{q({\bm{s}}_{1}^{j})\ln{\left[\frac{q({\bm{s}}_{1}^{j})}{p({\bm{s}}_{1}^{j})}\right]}d{\bm{s}}_{1}^{j}} (55)
+b=2Bq(θdyn,𝒔b1,𝒔b)j=1Nln[q(𝒔bj)p(𝒔bj|𝒔b1,θdyn)]dθdynd𝒔b1d𝒔b\displaystyle\quad+\sum_{b=2}^{B}\int{q(\theta_{\text{dyn}},{\bm{s}}_{b-1},{\bm{s}}_{b})\sum_{j=1}^{N}\ln{\left[{\frac{q({\bm{s}}_{b}^{j})}{p({\bm{s}}_{b}^{j}|{\bm{s}}_{b-1},\theta_{\text{dyn}})}}\right]}d\theta_{\text{dyn}}d{\bm{s}}_{b-1}d{\bm{s}}_{b}} (56)
=j=1Nq(𝒔1j)ln[q(𝒔1j)p(𝒔1j)]𝑑𝒔1j\displaystyle=\sum_{j=1}^{N}\int{q({\bm{s}}_{1}^{j})\ln{\left[\frac{q({\bm{s}}_{1}^{j})}{p({\bm{s}}_{1}^{j})}\right]}d{\bm{s}}_{1}^{j}} (57)
+b=2Bq(θdyn,𝒔b1)j=1N[q(𝒔bj)lnq(𝒔bj)p(𝒔bj|𝒔b1,θdyn)d𝒔bj]dθdynd𝒔b1\displaystyle\quad+\sum_{b=2}^{B}\int{q(\theta_{\text{dyn}},{\bm{s}}_{b-1})\sum_{j=1}^{N}\left[\int q({\bm{s}}_{b}^{j})\ln{{\frac{q({\bm{s}}_{b}^{j})}{p({\bm{s}}_{b}^{j}|{\bm{s}}_{b-1},\theta_{\text{dyn}})}}}d{\bm{s}}_{b}^{j}\right]d\theta_{\text{dyn}}d{\bm{s}}_{b-1}} (58)
=j=1NKL(q(𝒔1j)p(𝒔1j))+b=2B𝔼q(θdyn,𝒔b1)[j=1NKL(q(𝒔bj)p(𝒔bj|𝒔b1,θdyn))],\displaystyle=\sum_{j=1}^{N}\mathrm{KL}\left(q({\bm{s}}_{1}^{j})\lVert p({\bm{s}}_{1}^{j})\right)+\sum_{b=2}^{B}\mathbb{E}_{q(\theta_{\text{dyn}},{\bm{s}}_{b-1})}\left[\sum_{j=1}^{N}\mathrm{KL}\left(q({\bm{s}}_{b}^{j})\lVert p({\bm{s}}_{b}^{j}|{\bm{s}}_{b-1},\theta_{\text{dyn}})\right)\right], (59)

where KL\mathrm{KL} is Kullback–Leibler (KL) divergence. Both of the KL divergences above have a closed form but the expectation w.r.t. q(θdyn,𝒔b1)q(\theta_{\text{dyn}},{\bm{s}}_{b-1}) does not.

3=KL(q(θdyn)p(θdyn)),4=KL(q(θdec)p(θdec)).\displaystyle\mathcal{L}_{3}=\mathrm{KL}(q(\theta_{\text{dyn}})\lVert p(\theta_{\text{dyn}})),\quad\mathcal{L}_{4}=\mathrm{KL}(q(\theta_{\text{dec}})\lVert p(\theta_{\text{dec}})). (60)

B.2 Computation of ELBO.

We compute the ELBO using the following algorithm:

  1. 1.

    Sample θdyn,θdec\theta_{\text{dyn}},\theta_{\text{dec}} from q𝝍dyn(θdyn),q𝝍dec(θdec)q_{\bm{\psi}_{\text{dyn}}}(\theta_{\text{dyn}}),q_{\bm{\psi}_{\text{dec}}}(\theta_{\text{dec}}).

  2. 2.

    Sample 𝒔1:B{\bm{s}}_{1:B} by sampling each 𝒔bj{\bm{s}}_{b}^{j} from q𝝍bj(𝒔bj)q_{\bm{\psi}_{b}^{j}}({\bm{s}}_{b}^{j}) with 𝝍bj=hθenc(𝒖[t[b],𝒙j])\bm{\psi}_{b}^{j}=h_{\theta_{\text{enc}}}({\bm{u}}[t_{[b]},{\bm{x}}_{j}]).

  3. 3.

    Compute 𝒖1:M{\bm{u}}_{1:M} from 𝒔1:B{\bm{s}}_{1:B} as in Equations 14-16.

  4. 4.

    Compute ELBO \mathcal{L} (KL terms are computed in closed form, for expectations we use Monte Carlo integration with one sample).

Sampling is done using reparametrization to allow unbiased gradients w.r.t. the model parameters.

Appendix C Appendix C

C.1 Datasets.

Shallow Water.

The shallow water equations are a system of partial differential equations (PDEs) that simulate the behavior of water in a shallow basin. These equations are effectively a depth-integrated version of the Navier-Stokes equations, assuming the horizontal length scale is significantly larger than the vertical length scale. Given these assumptions, they provide a model for water dynamics in a basin or similar environment, and are commonly utilized in predicting the propagation of water waves, tides, tsunamis, and coastal currents. The state of the system modeled by these equations consists of the wave height h(t,x,y)h(t,x,y), velocity in the xx-direction u(t,x,y)u(t,x,y) and velocity in the yy-direction v(t,x,y)v(t,x,y). Given an initial state (h0,u0,v0)(h_{0},u_{0},v_{0}), we solve the PDEs on a spatial domain Ω\Omega over time interval [0,T][0,T]. The shallow water equations are defined as:

ht+(hu)x+(hv)y=0,\displaystyle\frac{\partial h}{\partial t}+\frac{\partial(hu)}{\partial x}+\frac{\partial(hv)}{\partial y}=0, (61)
ut+uux+vuy+ghx=0,\displaystyle\frac{\partial u}{\partial t}+u\frac{\partial u}{\partial x}+v\frac{\partial u}{\partial y}+g\frac{\partial h}{\partial x}=0, (62)
vt+uvx+vvy+ghy=0,\displaystyle\frac{\partial v}{\partial t}+u\frac{\partial v}{\partial x}+v\frac{\partial v}{\partial y}+g\frac{\partial h}{\partial y}=0, (63)

where gg is the gravitational constant.

The spatial domain Ω\Omega is a unit square with periodic boundary conditions. We set T=0.1T=0.1 sec. The solution is evaluated at randomly selected spatial locations and time points. We use 10891089 spatial locations and 2525 time points. The spatial end temporal grids are the same for all trajectories. Since we are dealing with partially-observed cases, we assume that we observe only the wave height h(t,x,y)h(t,x,y).

For each trajectory, we start with zero initial velocities and the initial height h0(x,y)h_{0}(x,y) generated as:

h~0(x,y)=k,l=NNλklcos(2π(kx+ly))+γklsin(2π(kx+ly)),\displaystyle\tilde{h}_{0}(x,y)=\sum_{k,l=-N}^{N}{\lambda_{kl}\cos(2\pi(kx+ly))+\gamma_{kl}\sin(2\pi(kx+ly))}, (64)
h0(x,y)=1+h~0(x,y)min(h~0)max(h~0)min(h~0),\displaystyle h_{0}(x,y)=1+\frac{\tilde{h}_{0}(x,y)-\mathrm{min}(\tilde{h}_{0})}{\mathrm{max}(\tilde{h}_{0})-\mathrm{min}(\tilde{h}_{0})}, (65)

where N=3N=3 and λkl,γkl𝒩(0,1)\lambda_{kl},\gamma_{kl}\sim\mathcal{N}(0,1).

The datasets used for training, validation, and testing contain 6060, 2020, and 2020 trajectories, respectively.

We use scikit-fdiff (Cellier, 2019) to solve the PDEs.

Navier-Stokes.

For this dataset we model the propagation of a scalar field (e.g., smoke concentration) in a fluid (e.g., air). The modeling is done by coupling the Navier-Stokes equations with the Boussinesq buoyancy term and the transport equation to model the propagation of the scalar field. The state of the system modeled by these equations consists of the scalar field c(t,x,y)c(t,x,y), velocity in xx-direction u(t,x,y)u(t,x,y), velocity in yy-direction v(t,x,y)v(t,x,y), and pressure p(t,x,y)p(t,x,y). Given an initial state (c0,u0,v0,p0)(c_{0},u_{0},v_{0},p_{0}), we solve the PDEs on a spatial domain Ω\Omega over time interval [0,T][0,T]. The Navier-Stokes equations with the transport equation are defined as:

ux+vy=0,\displaystyle\frac{\partial u}{\partial x}+\frac{\partial v}{\partial y}=0, (66)
ut+uux+vuy=px+ν(2ux2+2uy2),\displaystyle\frac{\partial u}{\partial t}+u\frac{\partial u}{\partial x}+v\frac{\partial u}{\partial y}=-\frac{\partial p}{\partial x}+\nu\left(\frac{\partial^{2}u}{\partial x^{2}}+\frac{\partial^{2}u}{\partial y^{2}}\right), (67)
vt+uvx+vvy=py+ν(2vx2+2vy2)+c,\displaystyle\frac{\partial v}{\partial t}+u\frac{\partial v}{\partial x}+v\frac{\partial v}{\partial y}=-\frac{\partial p}{\partial y}+\nu\left(\frac{\partial^{2}v}{\partial x^{2}}+\frac{\partial^{2}v}{\partial y^{2}}\right)+c, (68)
ct=ucxvcy+ν(2cx2+2cy2),\displaystyle\frac{\partial c}{\partial t}=-u\frac{\partial c}{\partial x}-v\frac{\partial c}{\partial y}+\nu\left(\frac{\partial^{2}c}{\partial x^{2}}+\frac{\partial^{2}c}{\partial y^{2}}\right), (69)

where ν=0.002\nu=0.002.

The spatial domain Ω\Omega is a unit square with periodic boundary conditions. We set T=2.0T=2.0 sec, but drop the first 0.50.5 seconds due to slow dynamics during this time period. The solution is evaluated at randomly selected spatial locations and time points. We use 10891089 spatial locations and 2525 time points. The spatial and temporal grids are the same for all trajectories. Since we are dealing with partially-observed cases, we assume that we observe only the scalar field c(t,x,y)c(t,x,y).

For each trajectory, we start with zero initial velocities and pressure, and the initial scalar field c0(x,y)c_{0}(x,y) is generated as:

c~0(x,y)=k,l=NNλklcos(2π(kx+ly))+γklsin(2π(kx+ly)),\displaystyle\tilde{c}_{0}(x,y)=\sum_{k,l=-N}^{N}{\lambda_{kl}\cos(2\pi(kx+ly))+\gamma_{kl}\sin(2\pi(kx+ly))}, (70)
c0(x,y)=c~0(x,y)min(c~0)max(c~0)min(c~0),\displaystyle c_{0}(x,y)=\frac{\tilde{c}_{0}(x,y)-\mathrm{min}(\tilde{c}_{0})}{\mathrm{max}(\tilde{c}_{0})-\mathrm{min}(\tilde{c}_{0})}, (71)

where N=2N=2 and λkl,γkl𝒩(0,1)\lambda_{kl},\gamma_{kl}\sim\mathcal{N}(0,1).

The datasets used for training, validation, and testing contain 6060, 2020, and 2020 trajectories, respectively.

We use PhiFlow (Holl et al., 2020) to solve the PDEs.

Scalar Flow.

Refer to caption
Figure 9: Spatial grid used for Scalar Flow dataset.

This dataset, proposed by Eckert et al. (2019), consists of observations of smoke plumes rising in hot air. The observations are post-processed camera images of the smoke plumes taken from multiple views. For simplicity, we use only the front view. The dataset contains 104 trajectories, where each trajectory has 150 time points and each image has the resolution 1080 ×\times 1920. Each trajectory was recorded for T=2.5T=2.5 seconds.

To reduce dimensionality of the observations we sub-sample the original spatial and temporal grids. For the temporal grid, we remove the first 50 time points, which leaves 100 time points, and then take every 4th time point, thus leaving 20 time points in total. The original 1080 ×\times 1920 spatial grid is first down-sampled by a factor of 9 giving a new grid with resolution 120 ×\times 213, and then the new grid is further sub-sampled based on the smoke density at each node. In particular, we compute the average smoke density at each node (averaged over time), and then sample the nodes without replacement with the probability proportional to the average smoke density (thus, nodes that have zero density most of the time are not selected). See example of a final grid in Figure 9. This gives a new grid with 1089 nodes.

We further smooth the observations by applying Gaussian smoothing with the standard deviation of 1.5 (assuming domain size 120 ×\times 213).

We use the first 60 trajectories for training, next 20 for validation and next 20 for testing.

In this case the spatial domain is non-periodic, which means that for some observation location 𝒙i{\bm{x}}_{i} some of its spatial neighbors 𝒩S(𝒙i)\mathcal{N}_{\text{S}}({\bm{x}}_{i}) might be outside of the domain. We found that to account for such cases it is sufficient to mark such out-of-domain neighbors by setting their value to 1-1.

Time grids used for the three datasets are shown in Figure 10.

Refer to caption
Refer to caption
Refer to caption
Figure 10: Time grids used for Shallow Water (top), Navier-Stokes (middle), and Scalar Flow (bottom).

C.2 Model architecture and hyper-parameters.

Dynamics function.

For all datasets we define FθdynF_{\theta_{\text{dyn}}} as an MLP. For Shallow Water/Navier-Stokes/Scalar Flow we use 1/3/3 hidden layers with the size of 1024/512/512, respectively. We use ReLU nonlinearities.

Observation function.

For all datasets we define gθdecg_{\theta_{\text{dec}}} as a selector function which takes the latent state 𝒛(t,x)d{\bm{z}}(t,x)\in\mathbb{R}^{d} and returns its first component.

Encoder.

Our encoder hθench_{\theta_{\text{enc}}} consists of three function: hθspatialh_{\theta_{\text{spatial}}}, hθtemporalh_{\theta_{\text{temporal}}}, and hθreadh_{\theta_{\text{read}}}. The spatial aggregation function hθspatialh_{\theta_{\text{spatial}}} is a linear mapping to 128\mathbb{R}^{128}. The temporal aggregation function hθtemporalh_{\theta_{\text{temporal}}} is a stack of transformer layers with temporal attention and continuous relative positional encodings (Iakovlev et al., 2023). For all datasets, we set the number of transformer layers to 6. Finally, the variational parameter readout function hθreadh_{\theta_{\text{read}}} is a mapping defined as

𝝍bj=hθread(α[b]T)=(𝜸bj𝝉bj)=(Linear(α[b]T)exp(Linear(α[b]T))),\displaystyle\bm{\psi}_{b}^{j}=h_{\theta_{\text{read}}}(\alpha_{[b]}^{\text{T}})=\begin{pmatrix}\bm{\gamma}_{b}^{j}\\ \bm{\tau}_{b}^{j}\end{pmatrix}=\begin{pmatrix}\mathrm{Linear}(\alpha_{[b]}^{\text{T}})\\ \mathrm{exp}(\mathrm{Linear}(\alpha_{[b]}^{\text{T}}))\end{pmatrix}, (72)

where Linear\mathrm{Linear} is a linear layer (different for each line), and 𝜸bj\bm{\gamma}_{b}^{j} and 𝝉bj\bm{\tau}_{b}^{j} are the variational parameters discussed in Appendix A.

Spatial and temporal neighborhoods.

We use the same spatial neighborhoods 𝒩S(𝒙)\mathcal{N}_{\text{S}}({\bm{x}}) for both the encoder and the dynamics function. We define 𝒩S(𝒙)\mathcal{N}_{\text{S}}({\bm{x}}) as the set of points consisting of the point 𝒙{\bm{x}} and points on two concentric circles centered at 𝒙{\bm{x}}, with radii rr and r/2r/2, respectively. Each circle contains 8 points spaced 4545 degrees apart (see Figure 11 (right)). The radius rr is set to 0.10.1. For Shallow Water/Navier-Stokes/Scalar Flow the size of temporal neighborhood (δT\delta_{T}) is set to 0.10.1/0.10.1/0.20.2, respectively.

Multiple Shooting.

For Shallow Water/Navier-Stokes/Scalar Flow we split the full training trajectories into 44/44/1919 sub-trajectories, or, equivalently, have the sub-trajectory length of 66/66/22.

C.3 Training, validation, and testing setup.

Data preprocessing.

We scale the temporal grids, spatial grids, and observations to be within the interval [0,1][0,1].

Training.

We train our model for 20000 iterations using Adam (Kingma and Ba, 2017) optimizer with constant learning rate 3e-43\text{e-4} and linear warmup for 200 iterations. The latent spatiotemporal dynamics are simulated using differentiable ODE solvers from the torchdiffeq package (Chen, 2018) (we use dopri5 with rtol=1e-31\text{e-3}, atol=1e-41\text{e-4}, no adjoint). The batch size is 1.

Validation.

We use validation set to track the performance of our model during training and save the parameters that produce the best validation performance. As performance measure we use the mean absolute error at predicting the full validation trajectories given some number of initial observations. For Shallow Water/Navier-Stokes/Scalar Flow we use the first 55/55/1010 observations. The predictions are made by taking one sample from the posterior predictive distribution (see Appendix C.4 for details).

Testing.

Testing is done similarly to validation, except that as the prediction we use an estimate of the expected value of the posterior predictive distribution (see Appendix C.4 for details).

C.4 Forecasting.

Given initial observations 𝒖~1:m\tilde{{\bm{u}}}_{1:m} at time points t1:mt_{1:m}, we predict the future observation 𝒖~n\tilde{{\bm{u}}}_{n} at a time point tn>tmt_{n}>t_{m} as the expected value of the approximate posterior predictive distribution:

p(𝒖~n|𝒖~1:m,𝒖1:M)p(𝒖~n|𝒔~m,θdyn,θdec)q(𝒔~m)q(θdyn)q(θdec)𝑑𝒔~m𝑑θdyn𝑑θdec.\displaystyle p(\tilde{{\bm{u}}}_{n}|\tilde{{\bm{u}}}_{1:m},{\bm{u}}_{1:M})\approx\int p(\tilde{{\bm{u}}}_{n}|\tilde{{\bm{s}}}_{m},\theta_{\text{dyn}},\theta_{\text{dec}})q(\tilde{{\bm{s}}}_{m})q(\theta_{\text{dyn}})q(\theta_{\text{dec}})d\tilde{{\bm{s}}}_{m}d\theta_{\text{dyn}}d\theta_{\text{dec}}. (73)

The expected value is estimated via Monte Carlo integration, so the algorithm for predicting 𝒖~n\tilde{{\bm{u}}}_{n} is:

  1. 1.

    Sample θdyn,θdec\theta_{\text{dyn}},\theta_{\text{dec}} from q(θdyn),q(θdec)q(\theta_{\text{dyn}}),q(\theta_{\text{dec}}).

  2. 2.

    Sample 𝒔~m\tilde{{\bm{s}}}_{m} from q(𝒔~m)=j=1Nq𝝍mj(𝒔~mj)q(\tilde{{\bm{s}}}_{m})=\prod_{j=1}^{N}q_{\bm{\psi}_{m}^{j}}(\tilde{{\bm{s}}}_{m}^{j}), where the variational parameters 𝝍mj\bm{\psi}_{m}^{j} are given by the encoder hθench_{\theta_{\text{enc}}} operating on the initial observations 𝒖~1:m\tilde{{\bm{u}}}_{1:m} as 𝝍mj=hθenc(𝒖~[tm,𝒙j])\bm{\psi}_{m}^{j}=h_{\theta_{\text{enc}}}(\tilde{{\bm{u}}}[t_{m},{\bm{x}}_{j}]).

  3. 3.

    Compute the latent state 𝒛~(tn)=𝒛(tn;tm,𝒔~m,θdyn)\tilde{{\bm{z}}}(t_{n})={\bm{z}}(t_{n};t_{m},\tilde{{\bm{s}}}_{m},\theta_{\text{dyn}}).

  4. 4.

    Sample 𝒖~n\tilde{{\bm{u}}}_{n} by sampling each 𝒖~nj\tilde{{\bm{u}}}_{n}^{j} from 𝒩(𝒖~nj|gθdec(𝒛~(tn,𝒙j))),σu2I)\mathcal{N}(\tilde{{\bm{u}}}_{n}^{j}|g_{\theta_{\text{dec}}}(\tilde{{\bm{z}}}(t_{n},{\bm{x}}_{j}))),\sigma_{u}^{2}I).

  5. 5.

    Repeat steps 1-4 nn times and average the predictions (we use n=10n=10).

C.5 Model comparison setup.

NODE.

For the NODE model the dynamics function was implemented as a fully connected feedforward neural network with 3 hidden layers, 512 neurons each, and ReLU nonlinearities.

FEN.

We use the official implementation of FEN. We use FEN variant without the transport term as we found it improves results on our datasets. The dynamics were assumed to be stationary and autonomous in all cases. The dynamics function was represented by a fully connected feedforward neural network with 3 hidden layers, 512 neurons each, and ReLU nonlinearities.

NSPDE.

We use the official implementation of NSPDE. We set the number of hidden channels to 16, and set modes1 and modes2 to 32.

DINo.

We use the official implementation of DINo. The encoder is an MLP with 3 hidden layers, 512 neurons each, and Swish non-linearities. The code dimension is 100100. The dynamics function is an MLP with 3 hidden layers, 512 neurons each, and Swish non-linearities. The decoder has 33 layers and 6464 channels.

MAgNet.

We use the official implementation of MAgNet. We use the graph neural network variant of the model. The number of message-passing steps is 5. All MLPs have 4 layers with 128 neurons each in each layer. The latent state dimension is 128.

Appendix D Appendix D

D.1 Spatiotemporal neighborhood shapes and sizes.

Here we investigate the effect of changing the shape and size of spatial and temporal neighborhoods used by the encoder and dynamics functions. We use the default hyperparameters discussed in Appendix C and change only the neighborhood shape or size. A neighborhood size of zero implies no spatial/temporal aggregation.

Initially, we use the original circular neighborhood displayed in Figure 11 for both encoder and dynamics function and change only its size (radius). The results are presented in Figures 12(a) and 12(b). In Figure 12(a), it is surprising to see very little effect from changing the encoder’s spatial neighborhood size. A potential explanation is that the dynamics function shares the spatial aggregation task with the encoder. However, the results in Figure 12(b) are more intuitive, displaying a U-shaped curve for the test MAE, indicating the importance of using spatial neighborhoods of appropriate size. Interestingly, the best results tend to be achieved with relatively large neighborhood sizes. Similarly, Figure 12(c) shows U-shaped curves for the encoder’s temporal neighborhood size, suggesting that latent state inference benefits from utilizing local temporal information.

We then examine the effect of changing the shape of the dynamics function’s spatial neighborhood. We use nncircle neighborhoods, which consist of nn equidistant concentric circular neighborhoods (see examples in Figure 11). Effectively, we maintain a fixed neighborhood size while altering its density. The results can be seen in Figure 13. We find that performance does not significantly improve when using denser (and presumably more informative) spatial neighborhoods, indicating that accurate predictions only require a relatively sparse neighborhood with appropriate size.

Refer to caption
Figure 11: Left: original circular neighborhood (1circle). Center: circular neighborhood with increased size. Right: circular neighborhood of a different shape (2circle).
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 12: (a),(b): Test MAE vs spatial neighborhood sizes of the encoder and dynamics function, respectively. (c): Test MAE vs temporal neighborhood size of the encoder. Note that the spatial and temporal domains are normalized, so their largest size in any dimension is 1.
Refer to caption
Figure 13: Test MAE vs spatial neighborhood shape.

D.2 Multiple shooting.

Here we demonstrate the effect of using multiple shooting for model training. In Figure 14 (left), we vary the sub-trajectory length (longer sub-trajectories imply more difficult training) and plot the test errors for each sub-trajectory length. We observe that in all cases, the best results are achieved when the sub-trajectory length is considerably smaller than the full trajectory length. In Figure 14 (right) we further show the training times, and as can be seen multiple shooting allows to noticeably reduce the training times.

Refer to caption
Figure 14: Test MAE vs training sub-trajectory length.

Appendix E Appendix E

Noisy Data.

Here we show the effect of observation noise on our model and compare the results against other models. We train all models with data noise of various strengths, and then compute test MAE on noiseless data (we still use noisy data to infer the initial state at test time). Figure 15 shows that our model can manage noise strength up to 0.1 without significant drops in performance. Note that all observations are in the range [0,1][0,1].

Refer to caption
Figure 15: Test MAE vs observation noise σu\sigma_{u}.

Appendix F Appendix F

F.1 Model Predictions

We show (Fig. 16) predictions of different models trained on different datasets (synthetic data is partially observed).

Refer to caption
Figure 16: Predictions from different models. Only forecasting is shown.

F.2 Visualization of Prediction Uncertainty

Figures 17, 18, and 19 demonstrate the prediction uncertainty across different samples from the posterior distribution.

Refer to caption
Refer to caption
Figure 17: Left: Test prediction for a single trajectory on Shallow Water dataset. Show are data (top), mean prediction (middle), and standard deviation of the predictions (bottom). Columns show predictions at consecutive time points. Right: Ground truth observations (dashed black) and predictions (colored) with standard deviation plotted over time at two spatial locations.
Refer to caption
Refer to caption
Figure 18: Left: Test prediction for a single trajectory on Navier-Stokes dataset. Show are data (top), mean prediction (middle), and standard deviation of the predictions (bottom). Columns show predictions at consecutive time points. Right: Ground truth observations (dashed black) and predictions (colored) with standard deviation plotted over time at two spatial locations.
Refer to caption
Refer to caption
Figure 19: Left: Test prediction for a single trajectory on Scalar Flow dataset. Show are data (top), mean prediction (middle), and standard deviation of the predictions (bottom). Columns show predictions at consecutive time points. Right: Ground truth observations (dashed black) and predictions (colored) with standard deviation plotted over time at two spatial locations.