Learning Space-Time Continuous Neural PDEs from Partially Observed States

Valerii Iakovlev Markus Heinonen Harri Lähdesmäki
Department of Computer Science, Aalto University, Finland
{valerii.iakovlev, markus.o.heinonen, harri.lahdesmaki}@aalto.fi

Abstract

We introduce a novel grid-independent model for learning partial differential equations (PDEs) from noisy and partial observations on irregular spatiotemporal grids. We propose a space-time continuous latent neural PDE model with an efficient probabilistic framework and a novel encoder design for improved data efficiency and grid independence. The latent state dynamics are governed by a PDE model that combines the collocation method and the method of lines. We employ amortized variational inference for approximate posterior estimation and utilize a multiple shooting technique for enhanced training speed and stability. Our model demonstrates state-of-the-art performance on complex synthetic and real-world datasets, overcoming limitations of previous approaches and effectively handling partially-observed data. The proposed model outperforms recent methods, showing its potential to advance data-driven PDE modeling and enabling robust, grid-independent modeling of complex partially-observed dynamic processes.

1 Introduction

^†^†footnotetext: Source code and datasets can be found in our github repository.

Modeling spatiotemporal processes allows to understand and predict the behavior of complex systems that evolve over time and space (Cressie and Wikle, 2011). Partial differential equations (PDEs) are a popular tool for this task as they have a solid mathematical foundation (Evans, 2010) and can describe the dynamics of a wide range of physical, biological, and social phenomena (Murray, 2002; Hirsch, 2007). However, deriving PDEs can be challenging, especially when the system’s underlying mechanisms are complex and not well understood. Data-driven methods can bypass these challenges (Brunton and Kutz, 2019). By learning the underlying system dynamics directly from data, we can develop accurate PDE models that capture the essential features of the system. This approach has changed our ability to model complex systems and make predictions about their behavior in a data-driven manner.

While current data-driven PDE models have been successful at modeling complex spatiotemporal phenomena, they often operate under various simplifying assumptions such as regularity of the spatial or temporal grids (Long et al., 2018; Kochkov et al., 2021; Pfaff et al., 2021; Li et al., 2021; Han et al., 2022; Poli et al., 2022), discreteness in space or time (Seo et al., 2020; Pfaff et al., 2021; Lienen and Günnemann, 2022; Brandstetter et al., 2022), and availability of complete and noiseless observations (Long et al., 2018; Pfaff et al., 2021; Wu et al., 2022). Such assumptions become increasingly limiting in more realistic scenarios with scarce data and irregularly spaced, noisy and partial observations.

We address the limitations of existing methods and propose a space-time continuous and grid-independent model that can learn PDE dynamics from noisy and partial observations made on irregular spatiotemporal grids. Our main contributions include:

•

Development of an efficient generative modeling framework for learning latent neural PDE models from noisy and partially-observed data;
•

Novel PDE model that merges two PDE solution techniques – the collocation method and the method of lines – to achieve space-time continuity, grid-independence, and data efficiency;
•

Novel encoder design that operates on local spatiotemporal neighborhoods for improved data-efficiency and grid-independence.

Our model demonstrates state-of-the-art performance on complex synthetic and real-world datasets, opening up the possibility for accurate and efficient modeling of complex dynamic processes and promoting further advancements in data-driven PDE modeling.

2 Problem Setup

In this work we are concerned with modeling of spatiotemporal processes. For brevity, we present our method for a single observed trajectory, but extension to multiple trajectories is straightforward. We observe a spatiotemporal dynamical system evolving over time on a spatial domain $\Omega$ . The observations are made at $M$ arbitrary consecutive time points $t_{1:M}:=(t_{1},\dots,t_{M})$ and $N$ arbitrary observation locations ${\bm{x}}_{1:N}:=({\bm{x}}_{1},\dots,{\bm{x}}_{N})$ , where ${\bm{x}}_{i}\in\Omega$ . This generates a sequence of observations ${\bm{u}}_{1:M}:=({\bm{u}}_{1},\dots,{\bm{u}}_{M})$ , where ${\bm{u}}_{i}\in\mathbb{R}^{N\times D}$ contains $D$ -dimensional observations at the $N$ observation locations. We define ${\bm{u}}_{i}^{j}$ as the observation at time $t_{i}$ and location ${\bm{x}}_{j}$ . The number of time points and observation locations may vary between different observed trajectories.

We assume the data is generated by a dynamical system with a latent state ${\bm{z}}(t,{\bm{x}})\in\mathbb{R}^{d}$ , where $t$ is time and ${\bm{x}}\in\Omega$ is spatial location. The latent state is governed by an unknown PDE and is mapped to the observed state ${\bm{u}}(t,{\bm{x}})\in\mathbb{R}^{D}$ by an unknown observation function $g$ and likelihood model $p$ :

	$\displaystyle\frac{\partial{\bm{z}}(t,x)}{\partial t}=F({\bm{z}}(t,{\bm{x}}),\partial_{{\bm{x}}}{\bm{z}}(t,{\bm{x}}),\partial^{2}_{{\bm{x}}}{\bm{z}}(t,{\bm{x}}),\ldots),$		(1)
	$\displaystyle{\bm{u}}(t,{\bm{x}})\sim p(g({\bm{z}}(t,{\bm{x}}))),$		(2)

where $\partial^{\bullet}_{{\bm{x}}}{\bm{z}}(t,{\bm{x}})$ denotes partial derivatives wrt ${\bm{x}}$ .

In this work we make two assumptions that are highly relevant in real-world scenarios. First, we assume partial observations, that is, the observed state ${\bm{u}}(t,{\bm{x}})$ does not contain all information about the latent state ${\bm{z}}(t,{\bm{x}})$ (e.g., ${\bm{z}}(t,{\bm{x}})$ contains pressure and velocity, but ${\bm{u}}(t,{\bm{x}})$ contains information only about the pressure). Second, we assume out-of-distribution time points and observation locations, that is, their number, positions, and density can change arbitrarily at test time.

3 Model

Refer to caption — Figure 1: Model sketch. Initial latent state ${\bm{z}}(t_{1},{\bm{x}})$ is evolved via $F_{\theta_{\text{dyn}}}$ to the following latent states which are then mapped to the observed states by $g_{\theta_{\text{dec}}}$ .

Here we describe the model components (Sec. 3.1) which are then used to construct the generative model (Sec. 3.2).

3.1 Model components

Our model consists of four parts: space-time continuous latent state ${\bm{z}}(t,{\bm{x}})$ and observed state ${\bm{u}}(t,{\bm{x}})$ , a dynamics function $F_{\theta_{\text{dyn}}}$ governing the temporal evolution of the latent state, and an observation function $g_{\theta_{\text{dec}}}$ mapping the latent state to the observed state (see Figure 1). Next, we describe these components in detail.

Latent state.

To define a space-time continuous latent state ${\bm{z}}(t,{\bm{x}})\in\mathbb{R}^{d}$ , we introduce ${\bm{z}}(t):=({\bm{z}}^{1}(t),\dots,{\bm{z}}^{N}(t))\in\mathbb{R}^{N\times d}$ , where each ${\bm{z}}^{i}(t)\in\mathbb{R}^{d}$ corresponds to the observation location ${\bm{x}}_{i}$ . Then, we define the latent state ${\bm{z}}(t,{\bm{x}})$ as a spatial interpolant of ${\bm{z}}(t)$ :

\displaystyle{\bm{z}}(t,{\bm{x}}):=\mathrm{Interpolate}({\bm{z}}(t))({\bm{x}}),

(3)

where $\mathrm{Interpolate}(\cdot)$ maps ${\bm{z}}(t)$ to an interpolant which can be evaluated at any spatial location ${\bm{x}}\in\Omega$ (see Figure 2). We do not rely on a particular interpolation method, but in this work we use linear interpolation as it shows good performance and facilitates efficient implementation.

Latent state dynamics.

Given a space-time continuous latent state, one can naturally define its dynamics in terms of a PDE:

\displaystyle\frac{\partial{\bm{z}}(t,x)}{\partial t}=F_{\theta_{\text{dyn}}}({\bm{z}}(t,{\bm{x}}),\partial_{{\bm{x}}}{\bm{z}}(t,{\bm{x}}),\partial^{2}_{{\bm{x}}}{\bm{z}}(t,{\bm{x}}),\ldots),

(4)

where $F_{\theta_{\text{dyn}}}$ is a dynamics function with parameters $\theta_{\text{dyn}}$ . This is a viable approach known as the collocation method (Kansa, 1990; Cheng, 2009), but it has several limitations. It requires us to decide which partial derivatives to include in the dynamics function, and also requires an interpolant which has all the selected partial derivatives (e.g., linear interpolant has only first order derivatives). To avoid these limitations, we combine the collocation method with another PDE solution technique known as the method of lines (Schiesser, 1991; Hamdi et al., 2007), which approximates spatial derivatives $\partial^{\bullet}_{{\bm{x}}}{\bm{z}}(t,{\bm{x}})$ using only evaluations of ${\bm{z}}(t,{\bm{x}})$ , and then let the dynamics function approximate all required derivatives in a data-driven manner. To do that, we define the spatial neighborhood of ${\bm{x}}$ as $\mathcal{N}_{\text{S}}({\bm{x}})$ , which is a set containing ${\bm{x}}$ and its spatial neighbors, and also define ${\bm{z}}(t,\mathcal{N}_{\text{S}}({\bm{x}}))$ , which is a set of evaluations of the interpolant ${\bm{z}}(t,{\bm{x}})$ at points in $\mathcal{N}_{\text{S}}({\bm{x}})$ :

	$\displaystyle\mathcal{N}_{\text{S}}({\bm{x}}):=\{{\bm{x}}^{\prime}\in\Omega:{\bm{x}}^{\prime}={\bm{x}}\ \text{or}\ {\bm{x}}^{\prime}\ \text{is a spatial neighbor of}\ {\bm{x}}\},$		(5)
	$\displaystyle{\bm{z}}(t,\mathcal{N}_{\text{S}}({\bm{x}})):=\{{\bm{z}}(t,{\bm{x}}^{\prime}):{\bm{x}}^{\prime}\in\mathcal{N}_{\text{S}}({\bm{x}})\},$		(6)

and assume that this information is sufficient to approximate all required spatial derivatives at ${\bm{x}}$ . This is a reasonable assumption since, e.g., finite differences can approximate derivatives using only function values and locations of the evaluation points. Hence, we define the dynamics of ${\bm{z}}(t,{\bm{x}})$ as

\displaystyle\frac{\partial{\bm{z}}(t,{\bm{x}})}{\partial t}=F_{\theta_{\text{dyn}}}(\mathcal{N}_{\text{S}}({\bm{x}}),{\bm{z}}(t,\mathcal{N}_{\text{S}}({\bm{x}}))),

(7)

which is defined only in terms of the values of the latent state, but not its spatial derivatives.

One way to define the spatial neighbors for ${\bm{x}}$ is in terms of the observation locations ${\bm{x}}_{1:N}$ (e.g., use the nearest ones) as was done, for example, in (Long et al., 2018; Pfaff et al., 2021; Lienen and Günnemann, 2022). Instead, we utilize continuity of the latent state ${\bm{z}}(t,{\bm{x}})$ , and define the spatial neighbors in a grid-independent manner as a fixed number of points arranged in a predefined patter around ${\bm{x}}$ (see Figure 3). This allows to fix the shape and size of the spatial neighborhoods in advance, making them independent of the observation locations. In this work we use the spatial neighborhood consisting of two concentric circles of radius $r$ and $r/2$ , each circle contains 8 evaluation points as in Figure 3. In Appendix D we compare neighborhoods of various shapes and sizes.

Equation 7 allows to simulate the temporal evolution of ${\bm{z}}(t,{\bm{x}})$ at any spatial location. However, since ${\bm{z}}(t,{\bm{x}})$ is defined only in terms of a spatial interpolant of ${\bm{z}}(t)$ (see Eq. 3), with ${\bm{z}}^{i}(t)={\bm{z}}(t,{\bm{x}}_{i})$ , it is sufficient to simulate the latent state dynamics only at the observation locations ${\bm{x}}_{1:N}$ . Hence, we can completely characterize the latent state dynamics in terms of a system of $N$ ODEs:

\displaystyle\frac{d{\bm{z}}(t)}{dt}:=\begin{pmatrix}\frac{d{\bm{z}}^{1}(t)}{dt}\\ \vdots\\ \frac{d{\bm{z}}^{N}(t)}{dt}\end{pmatrix}=\begin{pmatrix}\frac{\partial{\bm{z}}(t,{\bm{x}}_{1})}{\partial t}\\ \vdots\\ \frac{\partial{\bm{z}}(t,{\bm{x}}_{N})}{\partial t}\end{pmatrix}=\begin{pmatrix}F_{\theta_{\text{dyn}}}(\mathcal{N}_{\text{S}}({\bm{x}}_{1}),{\bm{z}}(t,\mathcal{N}_{\text{S}}({\bm{x}}_{1})))\\ \vdots\\ F_{\theta_{\text{dyn}}}(\mathcal{N}_{\text{S}}({\bm{x}}_{N}),{\bm{z}}(t,\mathcal{N}_{\text{S}}({\bm{x}}_{N})))\end{pmatrix}.

(8)

For convenience, we define ${\bm{z}}(t;t_{1},{\bm{z}}_{1},\theta_{\text{dyn}}):=\mathrm{ODESolve}(t;t_{1},{\bm{z}}_{1},\theta_{\text{dyn}})$ as the solution of the ODE system in Equation 8 at time $t$ with initial state ${\bm{z}}(t_{1})={\bm{z}}_{1}$ and parameters $\theta_{\text{dyn}}$ . We also define ${\bm{z}}(t,{\bm{x}};t_{1},{\bm{z}}_{1},\theta_{\text{dyn}})$ as the spatial interpolant of ${\bm{z}}(t;t_{1},{\bm{z}}_{1},\theta_{\text{dyn}})$ as in Equation 3. We solve the ODEs using off the shelf differentiable ODE solvers from torchdiffeq package (Chen, 2018). Note that we solve for the state ${\bm{z}}(t)$ only at the observation locations ${\bm{x}}_{1:N}$ , so to get the neighborhood values ${\bm{z}}(t,\mathcal{N}_{\text{S}}({\bm{x}}_{i}))$ we perform interpolation at every step of the ODE solver.

Observation function.

We define the mapping from the latent space to the observation space as a parametric function $g_{\theta_{\text{dec}}}$ with parameters $\theta_{\text{dec}}$ :

\displaystyle{\bm{u}}(t,{\bm{x}})\sim\mathcal{N}(g_{\theta_{\text{dec}}}({\bm{z}}(t,{\bm{x}})),\sigma_{u}^{2}I_{D}),

(9)

where $\mathcal{N}$ is the Gaussian distribution, $\sigma_{u}^{2}$ is noise variance, and $I_{D}$ is $D$ -by- $D$ identity matrix.

3.2 Generative model

Training models of dynamic systems is often challenging due to long training times and training instabilities (Ribeiro et al., 2020; Metz et al., 2021). To alleviate these problems, various heuristics have been proposed, such as progressive lengthening and splitting of the training trajectories (Yildiz et al., 2019; Um et al., 2020). We use multiple shooting (Bock and Plitt, 1984; Voss et al., 2004), a simple and efficient technique which has demonstrated its effectiveness in ODE learning applications (Jordana et al., 2021; Hegde et al., 2022). We extent the multiple shooting framework for latent ODE models presented in (Iakovlev et al., 2023) to our PDE modeling setup by introducing spatial dimensions in the latent state and designing an encoder adapted specifically to the PDE setting (Section 4.2).

Multiple shooting splits a single trajectory $\{{\bm{z}}(t_{i})\}_{i=1,...,M}$ with one initial state ${\bm{z}}_{1}$ into $B$ consecutive non-overlapping sub-trajectories $\{{\bm{z}}(t_{i})\}_{i\in\mathcal{I}_{b}},\ b=1,\dots,B$ with $B$ initial states ${\bm{s}}_{1:B}:=({\bm{s}}_{1},\dots,{\bm{s}}_{B})$ while imposing a continuity penalty between the sub-trajectories (see Figure 4). The index set $\mathcal{I}_{b}$ contains time point indices for the $b$ ’th sub-trajectory. We also denote the temporal position of ${\bm{s}}_{b}$ as $t_{[b]}$ and place ${\bm{s}}_{b}$ at the first time point preceding the $b$ ’th sub-trajectory (except ${\bm{s}}_{1}$ which is placed at $t_{1}$ ). Note that the shooting states ${\bm{s}}_{b}$ have the same dimension as the original latent state ${\bm{z}}(t)$ i.e., ${\bm{s}}_{b}\in\mathbb{R}^{N\times d}$ . Multiple shooting allows to parallelize the simulation over the sub-trajectories and shortens the simulation intervals thus improving the training speed and stability. In Appendix D we demonstrate the effect of multiple shooting on the model training and prediction accuracy.

We begin by defining the prior over the unknown model parameters and initial states:

\displaystyle p({\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}})=p({\bm{s}}_{1:B}|\theta_{\text{dyn}})p(\theta_{\text{dyn}})p(\theta_{\text{dec}}),

(10)

where $p(\theta_{\text{dyn}})$ and $p(\theta_{\text{dec}})$ are zero-mean diagonal Gaussians, and the continuity inducing prior $p({\bm{s}}_{1:B}|\theta_{\text{dyn}})$ is defined as in (Iakovlev et al., 2023)

\displaystyle p({\bm{s}}_{1:B}|\theta_{\text{dyn}})

\displaystyle=p({\bm{s}}_{1})\prod_{b=2}^{B}{p({\bm{s}}_{b}|{\bm{s}}_{b-1},\theta_{\text{dyn}})}.

(11)

Intuitively, the continuity prior $p({\bm{s}}_{b}|{\bm{s}}_{b-1},\theta_{\text{dyn}})$ takes the initial latent state ${\bm{s}}_{b-1}$ , simulates it forward from time $t_{[b-1]}$ to $t_{[b]}$ to get $\bm{\mu}_{[b]}=\mathrm{ODESolve}(t_{[b]};t_{[b-1]},{\bm{s}}_{b-1},\theta_{\text{dyn}})$ , and then forces $\bm{\mu}_{[b]}$ to approximately match the initial state ${\bm{s}}_{b}$ of the next sub-trajectory, thus promoting continuity of the full trajectory. We assume the continuity inducing prior factorizes across the grid points, i.e.,

	$\displaystyle p({\bm{s}}_{1:B}\|\theta_{\text{dyn}})$	$\displaystyle=\left[\prod_{j=1}^{N}p({\bm{s}}_{1}^{j})\right]\left[\prod_{b=2}^{B}{\prod_{j=1}^{N}p({\bm{s}}_{b}^{j}\|{\bm{s}}_{b-1},\theta_{\text{dyn}})}\right],$		(12)
		$\displaystyle=\left[\prod_{j=1}^{N}p({\bm{s}}_{1}^{j})\right]\left[\prod_{b=2}^{B}{\prod_{j=1}^{N}\mathcal{N}\left({\bm{s}}_{b}^{j}\|{\bm{z}}(t_{[b]},{\bm{x}}_{j};t_{[b-1]},{\bm{s}}_{b-1},\theta_{\text{dyn}}),\sigma_{c}^{2}I_{d}\right)}\right],$		(13)

where $p({\bm{s}}_{1}^{j})$ is a diagonal Gaussian, and parameter $\sigma_{c}^{2}$ controls the strength of the prior. Note that the term ${\bm{z}}(t_{[b]},{\bm{x}}_{j};t_{[b-1]},{\bm{s}}_{b-1},\theta_{\text{dyn}})$ in Equation 13 equals the ODE forward solution $\mathrm{ODESolve}(t_{[b]};t_{[b-1]},{\bm{s}}_{b-1},\theta_{\text{dyn}})$ at grid location ${\bm{x}}_{j}$ .

Finally, we define our generative in terms of the following sampling procedure:

$\displaystyle\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B}$	$\displaystyle\sim p(\theta_{\text{dyn}})p(\theta_{\text{dec}})p({\bm{s}}_{1:B}\|\theta_{\text{dyn}}),$		(14)
$\displaystyle{\bm{z}}(t_{i})$	$\displaystyle={\bm{z}}(t_{i};t_{[b]},{\bm{s}}_{b},\theta_{\text{dyn}}),\quad$	$\displaystyle\ b\in\{1,...,B\},\ i\in\mathcal{I}_{b},$	(15)
$\displaystyle{\bm{u}}_{i}^{j}$	$\displaystyle\sim p({\bm{u}}_{i}^{j}\|g_{\theta_{\text{dec}}}({\bm{z}}(t_{i},{\bm{x}}_{j})),\quad$	$\displaystyle i=1,\dots,M,\ j=1,\dots,N,$	(16)

with the following joint distribution (see Appendix A for details about the model specification.):

\displaystyle p({\bm{u}}_{1:M},{\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}})=\prod_{b=1}^{B}\prod_{i\in\mathcal{I}_{b}}\prod_{j=1}^{N}{\left[p({\bm{u}}_{i}^{j}|{\bm{s}}_{b},\theta_{\text{dyn}},\theta_{\text{dec}})\right]}p({\bm{s}}_{1:B}|\theta_{\text{dyn}})p(\theta_{\text{dyn}})p(\theta_{\text{dec}}).

(17)

4 Parameter Inference, Encoder, and Forecasting

4.1 Amortized variational inference

We approximate the true posterior over the model parameters and initial states $p({\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}}|{\bm{u}}_{1:M})$ using variational inference (Blei et al., 2017) with the following approximate posterior:

\displaystyle q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})

\displaystyle=q(\theta_{\text{dyn}})q(\theta_{\text{dec}})q({\bm{s}}_{1:B})=q_{\bm{\psi}_{\text{dyn}}}(\theta_{\text{dyn}})q_{\bm{\psi}_{\text{dec}}}(\theta_{\text{dec}})\prod_{b=1}^{B}\prod_{j=1}^{N}{q_{\bm{\psi}_{b}^{j}}({\bm{s}}_{b}^{j})},

(18)

where $q_{\bm{\psi}_{\text{dyn}}}$ , $q_{\bm{\psi}_{\text{dec}}}$ and $q_{\bm{\psi}_{b}^{j}}$ are diagonal Gaussians, and $\bm{\psi}_{\text{dyn}}$ , $\bm{\psi}_{\text{dec}}$ and $\bm{\psi}_{b}^{j}$ are variational parameters. To avoid direct optimization over the local variational parameters $\bm{\psi}_{b}^{j}$ , we use amortized variational inference (Kingma and Welling, 2013) and train an encoder $h_{\theta_{\text{enc}}}$ with parameters $\theta_{\text{enc}}$ which maps observations ${\bm{u}}_{1:M}$ to $\bm{\psi}_{b}^{j}$ (see Section 4.2). For brevity, we sometimes omit the dependence of approximate posteriors on variational parameters and simply write e.g., $q({\bm{s}}_{b}^{j})$ .

In variational inference the best approximation of the posterior is obtained by minimizing the Kullback-Leibler divergence: $\mathrm{KL}\big{[}q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\lVert p(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B}|{\bm{u}}_{1:N})\big{]}$ , which is equivalent to maximizing the evidence lower bound (ELBO), defined for our model as:

	$\displaystyle\mathcal{L}=\underbrace{\sum_{b=1}^{B}\sum_{i\in\mathcal{I}_{b}}\sum_{j=1}^{N}{\mathbb{E}_{q({\bm{s}}_{b},\theta_{\text{dyn}},\theta_{\text{dec}})}\left[\log p({\bm{u}}_{i}^{j}\|{\bm{s}}_{b},\theta_{\text{dyn}},\theta_{\text{dec}})\right]}}_{\textit{(i)}\text{ observation model}}-\underbrace{\sum_{j=1}^{N}\mathrm{KL}\big{[}q({\bm{s}}_{1}^{j})\lVert p({\bm{s}}_{1}^{j})\big{]}}_{\textit{(ii)}\text{ initial state prior}}$
	$\displaystyle-\underbrace{\sum_{b=2}^{B}\sum_{j=1}^{N}\mathbb{E}_{q(\theta_{\text{dyn}},{\bm{s}}_{b-1})}\Big{[}\mathrm{KL}\big{[}q({\bm{s}}_{b}^{j})\lVert p({\bm{s}}_{b}^{j}\|{\bm{s}}_{b-1},\theta_{\text{dyn}})\big{]}\Big{]}}_{\textit{(iii)}\text{ continuity prior}}-\underbrace{\mathrm{KL}\big{[}q(\theta_{\text{dyn}})\lVert p(\theta_{\text{dyn}})\big{]}}_{\textit{(iv)}\text{ dynamics prior}}-\underbrace{\mathrm{KL}\big{[}q(\theta_{\text{dec}})\lVert p(\theta_{\text{dec}})\big{]}}_{\textit{(v)}\text{ decoder prior}}.$

The terms $(ii)$ , $(iv)$ , and $(v)$ are computed analytically, while terms $(i)$ and $(iii)$ are approximated using Monte Carlo integration for expectations, and numerical ODE solvers for initial value problems. See Appendix A and B approximate posterior details and derivation and computation of the ELBO.

4.2 Encoder

Here we describe our encoder which maps observations ${\bm{u}}_{1:M}$ to local variational parameters $\bm{\psi}_{b}^{j}$ required to sample the initial latent state of the sub-trajectory $b$ at time point $t_{[b]}$ and observation location ${\bm{x}}_{j}$ . Similarly to our model, the encoder should be data-efficient and grid-independent.

Similarly to our model (Section 3.1), we enable grid-independence by making the encoder operate on spatial interpolants of the observations ${\bm{u}}_{1:M}$ (even if they are noisy):

\displaystyle{\bm{u}}_{i}({\bm{x}}):=\mathrm{Interpolate}({\bm{u}}_{i})({\bm{x}}),\quad\quad i=1,\dots,M,

(19)

where spatial interpolation is done separately for each time point $i$ . We then use the interpolants ${\bm{u}}_{i}({\bm{x}})$ to define the spatial neighborhoods $\mathcal{N}_{\text{S}}({\bm{x}})$ in a grid-independent manner.

To improve data-efficiency, we assume $\bm{\psi}_{b}^{j}$ does not depend on the whole observed sequence ${\bm{u}}_{1:M}$ , but only on some local information in a spatiotemporal neighborhood of $t_{[b]}$ and ${\bm{x}}_{j}$ . We define the temporal neighborhood of $t_{[b]}$ as

\displaystyle\mathcal{N}_{\text{T}}(t_{[b]})\coloneqq\{k:|t_{k}-t_{[b]}|\leq\delta_{T},\ k=1,\dots,M\},

(20)

where $\delta_{T}$ is a hyperparameter controlling the neighborhood size, and then define the spatiotemporal neighborhood of $t_{[b]}$ and ${\bm{x}}_{j}$ as

\displaystyle{\bm{u}}[t_{[b]},{\bm{x}}_{j}]:=\{{\bm{u}}_{k}({\bm{x}}):k\in\mathcal{N}_{\text{T}}(t_{[b]}),{\bm{x}}\in\mathcal{N}_{\text{S}}({\bm{x}}_{j})\}.

(21)

Our encoder operates on such spatiotemporal neighborhoods ${\bm{u}}[t_{[b]},{\bm{x}}_{j}]$ and works in three steps (see Figure 5). First, for each time index $k\in\mathcal{N}_{\text{T}}(t_{[b]})$ it aggregates the spatial information $\{{\bm{u}}_{k}({\bm{x}})\}_{{\bm{x}}\in\mathcal{N}({\bm{x}}_{j})}$ into a vector $\alpha_{k}^{\text{S}}$ . Then, it aggregates the spatial representations $\alpha_{k}^{\text{S}}$ across time into another vector $\alpha_{[b]}^{\text{T}}$ which is finally mapped to the variational parameters $\bm{\psi}_{b}^{j}$ as follows:

\displaystyle\bm{\psi}_{b}^{j}=h_{\theta_{\text{enc}}}({\bm{u}}[t_{[b]},{\bm{x}}_{j}])=h_{\text{read}}(h_{\text{temporal}}(h_{\text{spatial}}({\bm{u}}[t_{[b]},{\bm{x}}_{j}]))).

(22)

Spatial aggregation.

Since the spatial neighborhoods are fixed and remain identical for all spatial locations (see Figure 5), we implement the spatial aggregation function $h_{\text{spatial}}$ as an MLP which takes elements of the set $\{{\bm{u}}_{k}({\bm{x}})\}_{{\bm{x}}\in\mathcal{N}_{\text{S}}({\bm{x}}_{j})}$ stacked in a fixed order as the input.

Temporal aggregation.

We implement $h_{\text{temporal}}$ as a stack of transformer layers (Vaswani et al., 2017) which allows it to operate on input sets of arbitrary size. We use time-aware attention and continuous relative positional encodings (Iakovlev et al., 2023) which were shown to be effective on data from dynamical systems observed at irregular time intervals. Each transformer layer takes a layer-specific input set $\{\xi_{k}^{\text{in}}\}_{k\in\mathcal{N}_{\text{T}}(t_{[b]})}$ , where $\xi_{k}^{\text{in}}$ is located at $t_{k}$ , and maps it to an output set $\{\xi_{k}^{\text{out}}\}_{k\in\mathcal{N}_{\text{T}}(t_{[b]})}$ , where each $\xi_{k}^{\text{out}}$ is computed using only the input elements within distance $\delta_{T}$ from $t_{k}$ , thus promoting temporal locality. Furthermore, instead of using absolute positional encodings the model assumes the behavior of the system does not depend on time and uses relative temporal distances to inject positional information. The first layer takes $\{\alpha_{k}^{\text{S}}\}_{k\in\mathcal{N}_{\text{T}}(t_{[b]})}$ as the input, while the last layer returns a single element at time point $t_{[b]}$ , which represents the temporal aggregation $\alpha_{[b]}^{\text{T}}$ .

Variational parameter readout.

Since $\alpha_{i}^{\text{T}}$ is a fixed-length vector, we implement $h_{\text{read}}$ as an MLP.

4.3 Forecasting

Given initial observations $\tilde{{\bm{u}}}_{1:m}$ at time points $t_{1:m}$ , we predict the future observation $\tilde{{\bm{u}}}_{n}$ at a time point $t_{n}>t_{m}$ as the expected value of the approximate posterior predictive distribution:

\displaystyle p(\tilde{{\bm{u}}}_{n}|\tilde{{\bm{u}}}_{1:m},{\bm{u}}_{1:M})\approx\int p(\tilde{{\bm{u}}}_{n}|\tilde{{\bm{s}}}_{m},\theta_{\text{dyn}},\theta_{\text{dec}})q(\tilde{{\bm{s}}}_{m})q(\theta_{\text{dyn}})q(\theta_{\text{dec}})d\tilde{{\bm{s}}}_{m}d\theta_{\text{dyn}}d\theta_{\text{dec}}.

(23)

The expected value is estimated via Monte Carlo integration (see Appendix C.4 for details).

5 Experiments

We use three challenging datasets: Shallow Water, Navier-Stokes, and Scalar Flow which contain observations of spatiotemporal system at $N\approx 1100$ grid points evolving over time (see Figure 6). The first two datasets are synthetic and generated using numeric PDE solvers (we use scikit-fdiff (Cellier, 2019) for Shallow Water, and PhiFlow (Holl et al., 2020) for Navier-Stokes), while the third dataset contains real-world observations (camera images) of smoke plumes raising in warm air (Eckert et al., 2019). In all cases the observations are made at irregular spatiotemporal grids and contain only partial information about the true system state. In particular, for Shallow Water we observe only the wave height, for Navier-Stokes we observe only the concentration of the species, and for Scalar Flow only pixel densities are known. All datasets contain $60$ / $20$ / $20$ training/validation/testing trajectories. See Appendix C for details.

We train our model for 20k iterations with constant learning rate of $3\text{e-4}$ and linear warmup. The latent spatiotemporal dynamics are simulated using differentiable ODE solvers from the torchdiffeq package (Chen, 2018) (we use dopri5 with rtol= $1\text{e-3}$ , atol= $1\text{e-4}$ , no adjoint). Training is done on a single NVIDIA Tesla V100 GPU, with a single run taking 3-4 hours. We use the mean absolute error (MAE) on the test set as the performance measure. Error bars are standard errors over 4 random seeds. For forecasting we use the expected value of the posterior predictive distribution. See Appendix C for all details about the training, validation, and testing setup.

Latent state dimension.

Here we show the advantage of using latent-space models on partially observed data. We change the latent state dimension $d$ from 1 to 5 and measure the test MAE. Note that for $d=1$ we effectively have a data-space model which models the observations without trying to reconstruct the missing states. Figure 7 shows that in all cases there is improvement in performance as the latent dimension grows. For Shallow Water and Navier-Stokes the true latent dimension is 3. Since Scalar Flow is a real-world process, there is no true latent dimension. As a benchmark, we provide the performance of our model trained on fully-observed versions of the synthetic datasets (we use the same architecture and hyperparameters, but fix $d$ to $3$ ). Figure 7 also shows examples of model predictions (at the final time point) for different values of $d$ . We see a huge difference between $d=1$ and $d=3,5$ . Note how apparently small difference in MAE at $d=1$ and $d=5$ for Scalar Flow corresponds to a dramatic improvement in the prediction quality.

Grid independence.

In this experiment we demonstrate the grid-independence property of our model by training it on grids with $\approx 1100$ observation locations, and then testing on a coarser, original, and finer grids. We evaluate the effect of using different interpolation methods by repeating the experiment with linear, k-nearest neighbors, and inverse distance weighting (IDW) interpolants. For Shallow Water and Navier-Stokes the coarser/finer grids contain $290$ / $4200$ nodes, while for Scalar Flow we have $560$ / $6420$ nodes, respectively. Table 1 shows the model’s performance for different spatial grids and interpolation methods. We see that all interpolation methods perform rather similarly on the original grid, but linear interpolation and IDW tend to perform better on finer/coarser grids than k-NN. Performance drop on coarse grids is expected since we get less accurate information about the system’s initial state and simulate the dynamics on coarse grids. Figure 8 also shows examples of model predictions (at the final time point) for different grid sizes and linear interpolant.

Table 1: Test MAE for different interpolation methods.

Dataset	Grid	k-NN	Linear	IDW
	Coarser	$0.046\pm 0.002$	$0.034\pm 0.001$	$0.038\pm 0.002$
Shallow Water	Original	$0.017\pm 0.002$	$0.016\pm 0.002$	$0.017\pm 0.003$
	Finer	$0.031\pm 0.003$	$0.017\pm 0.003$	$0.030\pm 0.002$
	Coarser	$0.087\pm 0.006$	$0.069\pm 0.009$	$0.066\pm 0.006$
Navier Stokes	Original	$0.048\pm 0.009$	$0.041\pm 0.003$	$0.045\pm 0.010$
	Finer	$0.054\pm 0.009$	$0.044\pm 0.004$	$0.049\pm 0.002$
	Coarser	$0.041\pm 0.021$	$0.032\pm 0.009$	$0.035\pm 0.012$
Scalar Flow	Original	$0.019\pm 0.001$	$0.018\pm 0.000$	$0.018\pm 0.001$
	Finer	$0.040\pm 0.016$	$0.026\pm 0.006$	$0.028\pm 0.007$

Comparison to other models.

We test our model against recent spatiotemporal models from the literature: Finite Element Networks (FEN) (Lienen and Günnemann, 2022), Neural Stochastic PDEs (NSPDE) (Salvi et al., 2021), MAgNet (Boussif et al., 2022), and DINo (Yin et al., 2023). We also use Neural ODEs (NODE) (Chen et al., 2018) as the baseline. We use the official implementation for all models and tune their hyperparameters for the best performance (see App. C for details). For Shallow Water and Navier-Stokes we use the first 5 time points to infer the latent state and then predict the next 20 time points, while for Scalar Flow we use the first 10 points for inference and predict the next 10 points. For synthetic data, we consider two settings: one where the data is fully observed (i.e., the complete state is recorded) – a setting for which most models are designed – and one where the data is partially observed (i.e., only part of the full state is given, as discussed at the beginning of this section). The results are shown in Table 2. We see that some of the baseline models achieve reasonably good results on the fully-observed datasets, but they fail on partially-observed data, while our model maintains strong performance in all cases. Apart from the fully observed Shallow Water dataset where FEN performs slightly better, our method outperforms other methods on all other datasets by a clear margin. See Appendix C for hyperparameter details. In Appendix E we demonstrate our model’s capability to learn dynamics from noisy data. In Appendix F we show model predictions on different datasets.

Table 2: Test MAE for different models.

Model

Shallow Water

(Full)

Shallow Water

(Partial)

Navier Stokes

(Full)

Navier Stokes

(Partial)

Scalar Flow

NODE

0.036\pm 0.000

0.084\pm 0.001

0.053\pm 0.001

0.109\pm 0.001

0.056\pm 0.001

FEN

\bm{0.011\pm 0.002}

0.064\pm 0.005

0.031\pm 0.001

0.108\pm 0.002

0.062\pm 0.005

SNPDE

0.019\pm 0.002

0.033\pm 0.001

0.042\pm 0.004

0.075\pm 0.002

0.059\pm 0.002

DINo

0.027\pm 0.001

0.063\pm 0.003

0.047\pm 0.001

0.113\pm 0.002

0.059\pm 0.001

MAgNet

0.061\pm 0.001

0.103\pm 0.003

0.056\pm 0.003

Ours

0.014\pm 0.002

\bm{0.016\pm 0.002}

\bm{0.024\pm 0.003}

\bm{0.041\pm 0.003}

\bm{0.042\pm 0.001}

6 Related Work

Closest to our work is Ayed et al. (2022), where they considered the problem of learning PDEs from partial observations and proposed a discrete and grid-dependent model that is restricted to regular spatiotemporal grids. Another related work is that of Nguyen et al. (2020), where they proposed a variational inference framework for learning ODEs from noisy and partially-observed data. However, they consider only low-dimensional ODEs and are restricted to regular grids.

Other works considered learning the latent space PDE dynamics using the “encode-process-decode” approach. Pfaff et al. (2021) use GNN-based encoder and dynamics function and map the observations to the same spatial grid in the latent space and learn the latent space dynamics. Sanchez et al. (2020) use a similar approach but with CNNs and map the observations to a coarser latent grid and learn the coarse-scale dynamics. Wu et al. (2022) use CNNs to map observations to a low-dimensional latent vector and learn the latent dynamics. However, all these approaches are grid-dependent, limited to regular spatial/temporal grids, and require fully-observed data.

Interpolation has been used in numerous studies for various applications. Works such as (Alet et al., 2019; Jiang et al., 2020; Han et al., 2022) use interpolation to map latent states on coarse grids to observations on finer grids. Hua et al. (2022) used interpolation as a post-processing step to obtain continuous predictions, while Boussif et al. (2022) used it to recover observations at missing nodes.

Another approach for achieving grid-independence was presented in neural operators (Li et al., 2021; Lu et al., 2021), which learn a mapping between infinite-dimensional function spaces and represent the mapping in a grid-independent manner.

7 Conclusion

We proposed a novel space-time continuous, grid-independent model for learning PDE dynamics from noisy and partial observations on irregular spatiotemporal grids. Our contributions include an efficient generative modeling framework, a novel latent PDE model merging collocation and method of lines, and a data-efficient, grid-independent encoder design. The model demonstrates state-of-the-art performance on complex datasets, highlighting its potential for advancing data-driven PDE modeling and enabling accurate predictions of spatiotemporal phenomena in diverse fields. However, our model and encoder operate on every spatial and temporal location which might not be the most efficient approach and hinders scaling to extremely large grids, hence research into more efficient latent state extraction and dynamics modeling methods is needed.

References

Alet et al. (2019) Ferran Alet, Adarsh Keshav Jeewajee, Maria Bauza Villalonga, Alberto Rodriguez, Tomas Lozano-Perez, and Leslie Kaelbling. Graph element networks: adaptive, structured computation and memory. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 212–222. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/alet19a.html.
Ayed et al. (2022) Ibrahim Ayed, Emmanuel de Bézenac, Arthur Pajot, and Patrick Gallinari. Modelling spatiotemporal dynamics from earth observation data with neural differential equations. Machine Learning, 111(6):2349–2380, 2022. doi: 10.1007/s10994-022-06139-2. URL https://doi.org/10.1007/s10994-022-06139-2.
Blei et al. (2017) David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, Apr 2017. ISSN 1537-274X. doi: 10.1080/01621459.2017.1285773. URL http://dx.doi.org/10.1080/01621459.2017.1285773.
Bock and Plitt (1984) H.G. Bock and K.J. Plitt. A multiple shooting algorithm for direct solution of optimal control problems*. IFAC Proceedings Volumes, 17(2):1603–1608, 1984. ISSN 1474-6670. doi: https://doi.org/10.1016/S1474-6670(17)61205-9. URL https://www.sciencedirect.com/science/article/pii/S1474667017612059. 9th IFAC World Congress: A Bridge Between Control Science and Technology, Budapest, Hungary, 2-6 July 1984.
Boussif et al. (2022) Oussama Boussif, Yoshua Bengio, Loubna Benabbou, and Dan Assouline. MAgnet: Mesh agnostic neural PDE solver. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=bx2roi8hca8.
Brandstetter et al. (2022) Johannes Brandstetter, Daniel E. Worrall, and Max Welling. Message passing neural PDE solvers. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=vSix3HPYKSU.
Brunton and Kutz (2019) Steven L. Brunton and J. Nathan Kutz. Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control. Cambridge University Press, 2019. doi: 10.1017/9781108380690.
Cellier (2019) Nicolas Cellier. scikit-fdiff, 2019. URL https://gitlab.com/celliern/scikit-fdiff.
Chen (2018) Ricky T. Q. Chen. torchdiffeq, 2018. URL https://github.com/rtqichen/torchdiffeq.
Chen et al. (2018) Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations, 2018.
Cheng (2009) Alexander H.-D. Cheng. Radial basis function collocation method. In Computational Mechanics, pages 219–219, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg. ISBN 978-3-540-75999-7.
Cressie and Wikle (2011) N. Cressie and C. K. Wikle. Statistics for Spatio-Temporal Data. Wiley, 2011. ISBN 9780471692744. URL https://books.google.fi/books?id=-kOC6D0DiNYC.
Eckert et al. (2019) Marie-Lenat Eckert, Kiwon Um, and Nils Thuerey. Scalarflow: A large-scale volumetric data set of real-world scalar transport flows for computer animation and machine learning. ACM Transactions on Graphics, 38(6):239, 2019.
Evans (2010) L. C. Evans. Partial Differential Equations. American Mathematical Society, 2010. ISBN 9780821849743.
Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL https://proceedings.mlr.press/v9/glorot10a.html.
Hamdi et al. (2007) S. Hamdi, W. E. Schiesser, and G. W Griffiths. Method of lines. Scholarpedia, 2(7):2859, 2007. doi: 10.4249/scholarpedia.2859. revision #26511.
Han et al. (2022) Xu Han, Han Gao, Tobias Pfaff, Jian-Xun Wang, and Liping Liu. Predicting physics in mesh-reduced space with temporal attention. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=XctLdNfCmP.
Hegde et al. (2022) Pashupati Hegde, Cagatay Yildiz, Harri Lähdesmäki, Samuel Kaski, and Markus Heinonen. Variational multiple shooting for bayesian ODEs with gaussian processes. In The 38th Conference on Uncertainty in Artificial Intelligence, 2022. URL https://openreview.net/forum?id=r2NuhIUoceq.
Hirsch (2007) Charles Hirsch. Numerical computation of internal and external flows: The fundamentals of computational fluid dynamics. Elsevier, 2007.
Holl et al. (2020) Philipp Holl, Vladlen Koltun, Kiwon Um, and Nils Thuerey. phiflow: A differentiable pde solving framework for deep learning via physical simulations. In NeurIPS Workshop on Differentiable vision, graphics, and physics applied to machine learning, 2020. URL https://montrealrobotics.ca/diffcvgp/assets/papers/3.pdf.
Hua et al. (2022) Chuanbo Hua, Federico Berto, Michael Poli, Stefano Massaroli, and Jinkyoo Park. Efficient continuous spatio-temporal simulation with graph spline networks. In ICML 2022 2nd AI for Science Workshop, 2022. URL https://openreview.net/forum?id=PBT0Vftuji.
Iakovlev et al. (2023) Valerii Iakovlev, Cagatay Yildiz, Markus Heinonen, and Harri Lähdesmäki. Latent neural ODEs with sparse bayesian multiple shooting. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=moIlFZfj_1b.
Jiang et al. (2020) Chiyu lMaxr Jiang, Soheil Esmaeilzadeh, Kamyar Azizzadenesheli, Karthik Kashinath, Mustafa Mustafa, Hamdi A. Tchelepi, Philip Marcus, Mr Prabhat, and Anima Anandkumar. Meshfreeflownet: A physics-constrained deep continuous space-time super-resolution framework. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2020. doi: 10.1109/sc41405.2020.00013. URL http://dx.doi.org/10.1109/SC41405.2020.00013.
Jordana et al. (2021) Armand Jordana, Justin Carpentier, and Ludovic Righetti. Learning dynamical systems from noisy sensor measurements using multiple shooting, 2021.
Kansa (1990) E.J. Kansa. Multiquadrics—a scattered data approximation scheme with applications to computational fluid-dynamics—ii solutions to parabolic, hyperbolic and elliptic partial differential equations. Computers and Mathematics with Applications, 19(8):147–161, 1990. ISSN 0898-1221. doi: https://doi.org/10.1016/0898-1221(90)90271-K. URL https://www.sciencedirect.com/science/article/pii/089812219090271K.
Kingma and Ba (2017) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
Kingma and Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2013.
Kochkov et al. (2021) Dmitrii Kochkov, Jamie A. Smith, Ayya Alieva, Qing Wang, Michael P. Brenner, and Stephan Hoyer. Machine learning–accelerated computational fluid dynamics. Proceedings of the National Academy of Sciences, 118(21):e2101784118, 2021. doi: 10.1073/pnas.2101784118. URL https://www.pnas.org/doi/abs/10.1073/pnas.2101784118.
Li et al. (2021) Zongyi Li, Nikola Borislavov Kovachki, Kamyar Azizzadenesheli, Burigede liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=c8P9NQVtmnO.
Lienen and Günnemann (2022) Marten Lienen and Stephan Günnemann. Learning the dynamics of physical systems from sparse observations with finite element networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=HFmAukZ-k-2.
Long et al. (2018) Zichao Long, Yiping Lu, Xianzhong Ma, and Bin Dong. PDE-net: Learning PDEs from data. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3208–3216. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/long18a.html.
Lu et al. (2021) Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218–229, Mar 2021. ISSN 2522-5839. doi: 10.1038/s42256-021-00302-5. URL http://dx.doi.org/10.1038/s42256-021-00302-5.
Metz et al. (2021) Luke Metz, C. Daniel Freeman, Samuel S. Schoenholz, and Tal Kachman. Gradients are not all you need, 2021.
Murray (2002) James D. Murray. Mathematical Biology I. An Introduction, volume 17 of Interdisciplinary Applied Mathematics. Springer, New York, 3 edition, 2002. doi: 10.1007/b98868.
Nguyen et al. (2020) Duong Nguyen, Said Ouala, Lucas Drumetz, and Ronan Fablet. Variational deep learning for the identification and reconstruction of chaotic and stochastic dynamical systems from noisy and partial observations. ArXiv, abs/2009.02296, 2020.
Pfaff et al. (2021) Tobias Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and Peter Battaglia. Learning mesh-based simulation with graph networks. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=roNqYL0_XP.
Poli et al. (2022) Michael Poli, Stefano Massaroli, Federico Berto, Jinkyoo Park, Tri Dao, Christopher Re, and Stefano Ermon. Transform once: Efficient operator learning in frequency domain. In ICML 2022 2nd AI for Science Workshop, 2022. URL https://openreview.net/forum?id=x1fNT5yj41N.
Ribeiro et al. (2020) Antônio H. Ribeiro, Koen Tiels, Jack Umenberger, Thomas B. Schön, and Luis A. Aguirre. On the smoothness of nonlinear system identification. Automatica, 121:109158, 2020. ISSN 0005-1098. doi: https://doi.org/10.1016/j.automatica.2020.109158. URL https://www.sciencedirect.com/science/article/pii/S0005109820303563.
Salvi et al. (2021) Cristopher Salvi, Maud Lemercier, and Andris Gerasimovics. Neural stochastic pdes: Resolution-invariant learning of continuous spatiotemporal dynamics, 2021.
Sanchez et al. (2020) Alvaro Sanchez, Dmitrii Kochkov, Jamie Alexander Smith, Michael Brenner, Peter Battaglia, and Tobias Joachim Pfaff. Learning latent field dynamics of pdes. 2020. We are submitting to Machine Learning and the Physical Sciences workshop with a submission deadline on October 2nd.
Schiesser (1991) William E. Schiesser. The numerical method of lines: Integration of partial differential equations. 1991.
Seo et al. (2020) Sungyong Seo, Chuizheng Meng, and Yan Liu. Physics-aware difference graph networks for sparsely-observed dynamics. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=r1gelyrtwH.
Um et al. (2020) Kiwon Um, Robert Brand, Yun Fei, Philipp Holl, and Nils Thuerey. Solver-in-the-Loop: Learning from Differentiable Physics to Interact with Iterative PDE-Solvers. Advances in Neural Information Processing Systems, 2020.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Voss et al. (2004) Henning Voss, J. Timmer, and Juergen Kurths. Nonlinear dynamical system identification from uncertain and indirect measurements. International Journal of Bifurcation and Chaos, 14, 01 2004.
Wu et al. (2022) Tailin Wu, Takashi Maruyama, and Jure Leskovec. Learning to accelerate partial differential equations via latent global evolution. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=xvZtgp5wyYT.
Yildiz et al. (2019) Cagatay Yildiz, Markus Heinonen, and Harri Lähdesmäki. Ode2vae: Deep generative second order odes with bayesian neural networks. ArXiv, abs/1905.10994, 2019.
Yin et al. (2023) Yuan Yin, Matthieu Kirchmeyer, Jean-Yves Franceschi, Alain Rakotomamonjy, and patrick gallinari. Continuous PDE dynamics forecasting with implicit neural representations. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=B73niNjbPs.

Appendix A Appendix A

A.1 Model specification.

Here we provide all details about our model specification. The joint distribution for our model is

\displaystyle p({\bm{u}}_{1:M},{\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}})=p({\bm{u}}_{1:N}|{\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}})p({\bm{s}}_{1:B}|\theta_{\text{dyn}})p(\theta_{\text{dyn}})p(\theta_{\text{dec}}).

(24)

Next, we specify each component in detail.

Parameter priors.

The parameter priors are isotropic zero-mean multivariate normal distributions:

	$\displaystyle p(\theta_{\text{dyn}})=\mathcal{N}(\theta_{\text{dyn}}\|\mathbf{0},I),$		(25)
	$\displaystyle p(\theta_{\text{dec}})=\mathcal{N}(\theta_{\text{dec}}\|\mathbf{0},I),$		(26)

where $\mathcal{N}$ is the normal distribution, $\mathbf{0}$ is a zero vector, and $I$ is the identity matrix, both have an appropriate dimensionality dependent on the number of encoder and dynamics parameters.

Continuity prior.

We define the continuity prior as

$\displaystyle p({\bm{s}}_{1:B}\|\theta_{\text{dyn}})$	$\displaystyle=p({\bm{s}}_{1})\prod_{b=2}^{B}{p({\bm{s}}_{b}\|{\bm{s}}_{b-1},\theta_{\text{dyn}})},$	(27)
	$\displaystyle=\left[\prod_{j=1}^{N}p({\bm{s}}_{1}^{j})\right]\left[\prod_{b=2}^{B}{\prod_{j=1}^{N}p({\bm{s}}_{b}^{j}\|{\bm{s}}_{b-1},\theta_{\text{dyn}})}\right],$	(28)
	$\displaystyle=\left[\prod_{j=1}^{N}\mathcal{N}({\bm{s}}_{1}^{j}\|\mathbf{0},I)\right]\left[\prod_{b=2}^{B}{\prod_{j=1}^{N}\mathcal{N}\left({\bm{s}}_{b}^{j}\|{\bm{z}}(t_{[b]},{\bm{x}}_{j};t_{[b-1]},{\bm{s}}_{b-1},\theta_{\text{dyn}}),\sigma_{c}^{2}I\right).}\right],$	(29)

where $\mathcal{N}$ is the normal distribution, $\mathbf{0}\in\mathbb{R}^{d}$ is a zero vector, $I\in\mathbb{R}^{d\times d}$ is the identity matrix, and $\sigma_{c}\in\mathbb{R}$ is the parameter controlling the strength of the prior. Smaller values of $\sigma_{c}$ tend to produce smaller gaps between the sub-trajectories.

Observation model

$\displaystyle p({\bm{u}}_{1:N}\|{\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}})$	$\displaystyle=\prod_{b=1}^{B}\prod_{i\in\mathcal{I}_{b}}\prod_{j=1}^{N}{p({\bm{u}}_{i}^{j}\|{\bm{s}}_{b},\theta_{\text{dyn}},\theta_{\text{dec}})}$	(30)
	$\displaystyle=\prod_{b=1}^{B}\prod_{i\in\mathcal{I}_{b}}\prod_{j=1}^{N}{p({\bm{u}}_{i}^{j}\|g_{\theta_{\text{dec}}}({\bm{z}}(t_{i},{\bm{x}}_{j};t_{[b]},{\bm{s}}_{b},\theta_{\text{dyn}})))}$	(31)
	$\displaystyle=\prod_{b=1}^{B}\prod_{i\in\mathcal{I}_{b}}\prod_{j=1}^{N}{\mathcal{N}({\bm{u}}_{i}^{j}\|g_{\theta_{\text{dec}}}({\bm{z}}(t_{i},{\bm{x}}_{j};t_{[b]},{\bm{s}}_{b},\theta_{\text{dyn}})),\sigma_{u}^{2}I)},$	(32)

where $\mathcal{N}$ is the normal distribution, $\sigma_{u}^{2}$ is the observation noise variance, and $I\in\mathbb{R}^{D\times D}$ is the identity matrix. Note again that ${\bm{z}}(t_{i},{\bm{x}}_{j};t_{[b]},{\bm{s}}_{b},\theta_{\text{dyn}})$ above equals the ODE forward solution $\mathrm{ODESolve}(t_{i};t_{[b]},{\bm{s}}_{b},\theta_{\text{dyn}})$ at grid location ${\bm{x}}_{j}$ .

A.2 Approximate posterior specification.

Here we provide all details about the approximate posterior. We define the approximate posterior as

\displaystyle q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})

\displaystyle=q(\theta_{\text{dyn}})q(\theta_{\text{dec}})q({\bm{s}}_{1:B})=q_{\bm{\psi}_{\text{dyn}}}(\theta_{\text{dyn}})q_{\bm{\psi}_{\text{dec}}}(\theta_{\text{dec}})\prod_{b=1}^{B}\prod_{j=1}^{N}{q_{\bm{\psi}_{b}^{j}}({\bm{s}}_{b}^{j})}.

(33)

Next, we specify each component in detail.

Dynamics parameters posterior.

We define $q_{\bm{\psi}_{\text{dyn}}}(\theta_{\text{dyn}})$ as

\displaystyle q_{\bm{\psi}_{\text{dyn}}}(\theta_{\text{dyn}})=\mathcal{N}(\theta_{\text{dyn}}|\bm{\gamma}_{\text{dyn}},\mathrm{diag}(\bm{\tau}_{\text{dyn}}^{2})),

(34)

where $\bm{\gamma}_{\text{dyn}}$ and $\bm{\tau}_{\text{dyn}}^{2}$ are vectors with an appropriate dimension (dependent on the number of dynamics parameters), and $\mathrm{diag}(\bm{\tau}_{\text{dyn}}^{2})$ is a matrix with $\bm{\tau}_{\text{dyn}}^{2}$ on the diagonal. We define the vector of variational parameters as $\bm{\psi}_{\text{dyn}}=(\bm{\gamma}_{\text{dyn}},\bm{\tau}_{\text{dyn}}^{2})$ . We optimize directly over $\bm{\psi}_{\text{dyn}}$ and initialize $\bm{\gamma}_{\text{dyn}}$ using Xavier (Glorot and Bengio, 2010) initialization, while $\bm{\tau}_{\text{dyn}}$ is initialized with each element equal to $9\cdot 10^{-4}$ .

Decoder parameters posterior.

We define $q_{\bm{\psi}_{\text{dec}}}(\theta_{\text{dec}})$ as

\displaystyle q_{\bm{\psi}_{\text{dec}}}(\theta_{\text{dec}})=\mathcal{N}(\theta_{\text{dec}}|\bm{\gamma}_{\text{dec}},\mathrm{diag}(\bm{\tau}_{\text{dec}}^{2})),

(35)

where $\bm{\gamma}_{\text{dec}}$ and $\bm{\tau}_{\text{dec}}^{2}$ are vectors with an appropriate dimension (dependent on the number of decoder parameters), and $\mathrm{diag}(\bm{\tau}_{\text{dec}}^{2})$ is a matrix with $\bm{\tau}_{\text{dec}}^{2}$ on the diagonal. We define the vector of variational parameters as $\bm{\psi}_{\text{dec}}=(\bm{\gamma}_{\text{dec}},\bm{\tau}_{\text{dec}}^{2})$ . We optimize directly over $\bm{\psi}_{\text{dec}}$ and initialize $\bm{\gamma}_{\text{dec}}$ using Xavier (Glorot and Bengio, 2010) initialization, while $\bm{\tau}_{\text{dec}}$ is initialized with each element equal to $9\cdot 10^{-4}$ .

Shooting variables posterior.

We define $q_{\bm{\psi}_{b}^{j}}({\bm{s}}_{b}^{j})$ as

\displaystyle q_{\bm{\psi}_{b}^{j}}({\bm{s}}_{b}^{j})=\mathcal{N}({\bm{s}}_{b}^{j}|\bm{\gamma}_{b}^{j},\mathrm{diag}([\bm{\tau}_{b}^{j}]^{2}))),

(36)

where the vectors $\bm{\gamma}_{b}^{j},\bm{\tau}_{b}^{j}\in\mathbb{R}^{d}$ are returned by the encoder $h_{\theta_{\text{enc}}}$ , and $\mathrm{diag}([\bm{\tau}_{b}^{j}]^{2})$ is a matrix with $[\bm{\tau}_{b}^{j}]^{2}$ on the diagonal. We define the vector of variational parameters as $\bm{\psi}_{b}^{j}=(\bm{\gamma}_{b}^{j},[\bm{\tau}_{b}^{j}])$ . Because the variational inference for the shooting variables is amortized, our model is trained w.r.t. the parameters of the encoder network, $\theta_{\text{enc}}$ .

Appendix B Appendix B

B.1 Derivation of ELBO.

For our model and the choice of the approximate posterior the ELBO can be written as

$\displaystyle\mathcal{L}$	$\displaystyle=\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\frac{p({\bm{u}}_{1:M},{\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}})}{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})}}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}}$	(37)
	$\displaystyle=\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\frac{p({\bm{u}}_{1:M}\|{\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}})p({\bm{s}}_{1:B}\|\theta_{\text{dyn}})p(\theta_{\text{dyn}})p(\theta_{\text{dec}})}{q({\bm{s}}_{1:B})q(\theta_{\text{dyn}})q(\theta_{\text{dec}})}}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}}$	(38)
	$\displaystyle=\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{p({\bm{u}}_{1:M}\|{\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}})}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}}$	(39)
	$\displaystyle\quad-\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\frac{q({\bm{s}}_{1:B})}{p({\bm{s}}_{1:B}\|\theta_{\text{dyn}})}}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}}$	(40)
	$\displaystyle\quad-\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\frac{q(\theta_{\text{dyn}})}{p(\theta_{\text{dyn}})}}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}}$	(41)
	$\displaystyle\quad-\int{q(\theta_{\text{dec}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\frac{q(\theta_{\text{dec}})}{p(\theta_{\text{dec}})}}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}}$	(42)
	$\displaystyle=\mathcal{L}_{1}-\mathcal{L}_{2}-\mathcal{L}_{3}-\mathcal{L}_{4}.$	(43)

Next, we will look at each term $\mathcal{L}_{i}$ separately.

$\displaystyle\mathcal{L}_{1}$	$\displaystyle=\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{p({\bm{u}}_{1:M}\|{\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}})}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}}$	(44)
	$\displaystyle=\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\left[\prod_{b=1}^{B}\prod_{i\in\mathcal{I}_{b}}\prod_{j=1}^{N}{p({\bm{u}}_{i}^{j}\|{\bm{s}}_{b},\theta_{\text{dyn}},\theta_{\text{dec}})}\right]}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}}$	(45)
	$\displaystyle=\sum_{b=1}^{B}\sum_{i\in\mathcal{I}_{b}}\sum_{j=1}^{N}\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\left[{p({\bm{u}}_{i}^{j}\|{\bm{s}}_{b},\theta_{\text{dyn}},\theta_{\text{dec}})}\right]}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}}$	(46)
	$\displaystyle=\sum_{b=1}^{B}\sum_{i\in\mathcal{I}_{b}}\sum_{j=1}^{N}\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{b})\ln{\left[{p({\bm{u}}_{i}^{j}\|{\bm{s}}_{b},\theta_{\text{dyn}},\theta_{\text{dec}})}\right]}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{b}}$	(47)
	$\displaystyle=\sum_{b=1}^{B}\sum_{i\in\mathcal{I}_{b}}\sum_{j=1}^{N}\mathbb{E}_{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{b})}\ln{\left[{p({\bm{u}}_{i}^{j}\|{\bm{s}}_{b},\theta_{\text{dyn}},\theta_{\text{dec}})}\right]}.$	(48)

$\displaystyle\mathcal{L}_{2}$	$\displaystyle=\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\frac{q({\bm{s}}_{1:B})}{p({\bm{s}}_{1:B}\|\theta_{\text{dyn}})}}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}}$	(49)
	$\displaystyle=\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\left[\frac{q({\bm{s}}_{1})}{p({\bm{s}}_{1})}\prod_{b=2}^{B}{\frac{q({\bm{s}}_{b})}{p({\bm{s}}_{b}\|{\bm{s}}_{b-1},\theta_{\text{dyn}})}}\right]}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}}$	(50)
	$\displaystyle=\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\left[\prod_{j=1}^{N}\frac{q({\bm{s}}_{1}^{j})}{p({\bm{s}}_{1}^{j})}\right]}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}}$	(51)
	$\displaystyle\quad+\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\left[\prod_{b=2}^{B}\prod_{j=1}^{N}{\frac{q({\bm{s}}_{b}^{j})}{p({\bm{s}}_{b}^{j}\|{\bm{s}}_{b-1},\theta_{\text{dyn}})}}\right]}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}}$	(52)
	$\displaystyle=\sum_{j=1}^{N}\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\ln{\left[\frac{q({\bm{s}}_{1}^{j})}{p({\bm{s}}_{1}^{j})}\right]}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}}$	(53)
	$\displaystyle\quad+\sum_{b=2}^{B}\int{q(\theta_{\text{dyn}},\theta_{\text{dec}},{\bm{s}}_{1:B})\sum_{j=1}^{N}\ln{\left[{\frac{q({\bm{s}}_{b}^{j})}{p({\bm{s}}_{b}^{j}\|{\bm{s}}_{b-1},\theta_{\text{dyn}})}}\right]}d\theta_{\text{dyn}}d\theta_{\text{dec}}d{\bm{s}}_{1:B}}$	(54)
	$\displaystyle=\sum_{j=1}^{N}\int{q({\bm{s}}_{1}^{j})\ln{\left[\frac{q({\bm{s}}_{1}^{j})}{p({\bm{s}}_{1}^{j})}\right]}d{\bm{s}}_{1}^{j}}$	(55)
	$\displaystyle\quad+\sum_{b=2}^{B}\int{q(\theta_{\text{dyn}},{\bm{s}}_{b-1},{\bm{s}}_{b})\sum_{j=1}^{N}\ln{\left[{\frac{q({\bm{s}}_{b}^{j})}{p({\bm{s}}_{b}^{j}\|{\bm{s}}_{b-1},\theta_{\text{dyn}})}}\right]}d\theta_{\text{dyn}}d{\bm{s}}_{b-1}d{\bm{s}}_{b}}$	(56)
	$\displaystyle=\sum_{j=1}^{N}\int{q({\bm{s}}_{1}^{j})\ln{\left[\frac{q({\bm{s}}_{1}^{j})}{p({\bm{s}}_{1}^{j})}\right]}d{\bm{s}}_{1}^{j}}$	(57)
	$\displaystyle\quad+\sum_{b=2}^{B}\int{q(\theta_{\text{dyn}},{\bm{s}}_{b-1})\sum_{j=1}^{N}\left[\int q({\bm{s}}_{b}^{j})\ln{{\frac{q({\bm{s}}_{b}^{j})}{p({\bm{s}}_{b}^{j}\|{\bm{s}}_{b-1},\theta_{\text{dyn}})}}}d{\bm{s}}_{b}^{j}\right]d\theta_{\text{dyn}}d{\bm{s}}_{b-1}}$	(58)
	$\displaystyle=\sum_{j=1}^{N}\mathrm{KL}\left(q({\bm{s}}_{1}^{j})\lVert p({\bm{s}}_{1}^{j})\right)+\sum_{b=2}^{B}\mathbb{E}_{q(\theta_{\text{dyn}},{\bm{s}}_{b-1})}\left[\sum_{j=1}^{N}\mathrm{KL}\left(q({\bm{s}}_{b}^{j})\lVert p({\bm{s}}_{b}^{j}\|{\bm{s}}_{b-1},\theta_{\text{dyn}})\right)\right],$	(59)

where $\mathrm{KL}$ is Kullback–Leibler (KL) divergence. Both of the KL divergences above have a closed form but the expectation w.r.t. $q(\theta_{\text{dyn}},{\bm{s}}_{b-1})$ does not.

\displaystyle\mathcal{L}_{3}=\mathrm{KL}(q(\theta_{\text{dyn}})\lVert p(\theta_{\text{dyn}})),\quad\mathcal{L}_{4}=\mathrm{KL}(q(\theta_{\text{dec}})\lVert p(\theta_{\text{dec}})).

(60)

B.2 Computation of ELBO.

We compute the ELBO using the following algorithm:

1.

Sample $\theta_{\text{dyn}},\theta_{\text{dec}}$ from $q_{\bm{\psi}_{\text{dyn}}}(\theta_{\text{dyn}}),q_{\bm{\psi}_{\text{dec}}}(\theta_{\text{dec}})$ .
2.

Sample ${\bm{s}}_{1:B}$ by sampling each ${\bm{s}}_{b}^{j}$ from $q_{\bm{\psi}_{b}^{j}}({\bm{s}}_{b}^{j})$ with $\bm{\psi}_{b}^{j}=h_{\theta_{\text{enc}}}({\bm{u}}[t_{[b]},{\bm{x}}_{j}])$ .
3.

Compute ${\bm{u}}_{1:M}$ from ${\bm{s}}_{1:B}$ as in Equations 14-16.
4.

Compute ELBO $\mathcal{L}$ (KL terms are computed in closed form, for expectations we use Monte Carlo integration with one sample).

Sampling is done using reparametrization to allow unbiased gradients w.r.t. the model parameters.

Appendix C Appendix C

C.1 Datasets.

Shallow Water.

The shallow water equations are a system of partial differential equations (PDEs) that simulate the behavior of water in a shallow basin. These equations are effectively a depth-integrated version of the Navier-Stokes equations, assuming the horizontal length scale is significantly larger than the vertical length scale. Given these assumptions, they provide a model for water dynamics in a basin or similar environment, and are commonly utilized in predicting the propagation of water waves, tides, tsunamis, and coastal currents. The state of the system modeled by these equations consists of the wave height $h(t,x,y)$ , velocity in the $x$ -direction $u(t,x,y)$ and velocity in the $y$ -direction $v(t,x,y)$ . Given an initial state $(h_{0},u_{0},v_{0})$ , we solve the PDEs on a spatial domain $\Omega$ over time interval $[0,T]$ . The shallow water equations are defined as:

	$\displaystyle\frac{\partial h}{\partial t}+\frac{\partial(hu)}{\partial x}+\frac{\partial(hv)}{\partial y}=0,$		(61)
	$\displaystyle\frac{\partial u}{\partial t}+u\frac{\partial u}{\partial x}+v\frac{\partial u}{\partial y}+g\frac{\partial h}{\partial x}=0,$		(62)
	$\displaystyle\frac{\partial v}{\partial t}+u\frac{\partial v}{\partial x}+v\frac{\partial v}{\partial y}+g\frac{\partial h}{\partial y}=0,$		(63)

where $g$ is the gravitational constant.

The spatial domain $\Omega$ is a unit square with periodic boundary conditions. We set $T=0.1$ sec. The solution is evaluated at randomly selected spatial locations and time points. We use $1089$ spatial locations and $25$ time points. The spatial end temporal grids are the same for all trajectories. Since we are dealing with partially-observed cases, we assume that we observe only the wave height $h(t,x,y)$ .

For each trajectory, we start with zero initial velocities and the initial height $h_{0}(x,y)$ generated as:

	$\displaystyle\tilde{h}_{0}(x,y)=\sum_{k,l=-N}^{N}{\lambda_{kl}\cos(2\pi(kx+ly))+\gamma_{kl}\sin(2\pi(kx+ly))},$		(64)
	$\displaystyle h_{0}(x,y)=1+\frac{\tilde{h}_{0}(x,y)-\mathrm{min}(\tilde{h}_{0})}{\mathrm{max}(\tilde{h}_{0})-\mathrm{min}(\tilde{h}_{0})},$		(65)

where $N=3$ and $\lambda_{kl},\gamma_{kl}\sim\mathcal{N}(0,1)$ .

The datasets used for training, validation, and testing contain $60$ , $20$ , and $20$ trajectories, respectively.

We use scikit-fdiff (Cellier, 2019) to solve the PDEs.

Navier-Stokes.

For this dataset we model the propagation of a scalar field (e.g., smoke concentration) in a fluid (e.g., air). The modeling is done by coupling the Navier-Stokes equations with the Boussinesq buoyancy term and the transport equation to model the propagation of the scalar field. The state of the system modeled by these equations consists of the scalar field $c(t,x,y)$ , velocity in $x$ -direction $u(t,x,y)$ , velocity in $y$ -direction $v(t,x,y)$ , and pressure $p(t,x,y)$ . Given an initial state $(c_{0},u_{0},v_{0},p_{0})$ , we solve the PDEs on a spatial domain $\Omega$ over time interval $[0,T]$ . The Navier-Stokes equations with the transport equation are defined as:

	$\displaystyle\frac{\partial u}{\partial x}+\frac{\partial v}{\partial y}=0,$		(66)
	$\displaystyle\frac{\partial u}{\partial t}+u\frac{\partial u}{\partial x}+v\frac{\partial u}{\partial y}=-\frac{\partial p}{\partial x}+\nu\left(\frac{\partial^{2}u}{\partial x^{2}}+\frac{\partial^{2}u}{\partial y^{2}}\right),$		(67)
	$\displaystyle\frac{\partial v}{\partial t}+u\frac{\partial v}{\partial x}+v\frac{\partial v}{\partial y}=-\frac{\partial p}{\partial y}+\nu\left(\frac{\partial^{2}v}{\partial x^{2}}+\frac{\partial^{2}v}{\partial y^{2}}\right)+c,$		(68)
	$\displaystyle\frac{\partial c}{\partial t}=-u\frac{\partial c}{\partial x}-v\frac{\partial c}{\partial y}+\nu\left(\frac{\partial^{2}c}{\partial x^{2}}+\frac{\partial^{2}c}{\partial y^{2}}\right),$		(69)

where $\nu=0.002$ .

The spatial domain $\Omega$ is a unit square with periodic boundary conditions. We set $T=2.0$ sec, but drop the first $0.5$ seconds due to slow dynamics during this time period. The solution is evaluated at randomly selected spatial locations and time points. We use $1089$ spatial locations and $25$ time points. The spatial and temporal grids are the same for all trajectories. Since we are dealing with partially-observed cases, we assume that we observe only the scalar field $c(t,x,y)$ .

For each trajectory, we start with zero initial velocities and pressure, and the initial scalar field $c_{0}(x,y)$ is generated as:

	$\displaystyle\tilde{c}_{0}(x,y)=\sum_{k,l=-N}^{N}{\lambda_{kl}\cos(2\pi(kx+ly))+\gamma_{kl}\sin(2\pi(kx+ly))},$		(70)
	$\displaystyle c_{0}(x,y)=\frac{\tilde{c}_{0}(x,y)-\mathrm{min}(\tilde{c}_{0})}{\mathrm{max}(\tilde{c}_{0})-\mathrm{min}(\tilde{c}_{0})},$		(71)

where $N=2$ and $\lambda_{kl},\gamma_{kl}\sim\mathcal{N}(0,1)$ .

The datasets used for training, validation, and testing contain $60$ , $20$ , and $20$ trajectories, respectively.

We use PhiFlow (Holl et al., 2020) to solve the PDEs.

Scalar Flow.

This dataset, proposed by Eckert et al. (2019), consists of observations of smoke plumes rising in hot air. The observations are post-processed camera images of the smoke plumes taken from multiple views. For simplicity, we use only the front view. The dataset contains 104 trajectories, where each trajectory has 150 time points and each image has the resolution 1080 $\times$ 1920. Each trajectory was recorded for $T=2.5$ seconds.

To reduce dimensionality of the observations we sub-sample the original spatial and temporal grids. For the temporal grid, we remove the first 50 time points, which leaves 100 time points, and then take every 4th time point, thus leaving 20 time points in total. The original 1080 $\times$ 1920 spatial grid is first down-sampled by a factor of 9 giving a new grid with resolution 120 $\times$ 213, and then the new grid is further sub-sampled based on the smoke density at each node. In particular, we compute the average smoke density at each node (averaged over time), and then sample the nodes without replacement with the probability proportional to the average smoke density (thus, nodes that have zero density most of the time are not selected). See example of a final grid in Figure 9. This gives a new grid with 1089 nodes.

We further smooth the observations by applying Gaussian smoothing with the standard deviation of 1.5 (assuming domain size 120 $\times$ 213).

We use the first 60 trajectories for training, next 20 for validation and next 20 for testing.

In this case the spatial domain is non-periodic, which means that for some observation location ${\bm{x}}_{i}$ some of its spatial neighbors $\mathcal{N}_{\text{S}}({\bm{x}}_{i})$ might be outside of the domain. We found that to account for such cases it is sufficient to mark such out-of-domain neighbors by setting their value to $-1$ .

Time grids used for the three datasets are shown in Figure 10.

C.2 Model architecture and hyper-parameters.

Dynamics function.

For all datasets we define $F_{\theta_{\text{dyn}}}$ as an MLP. For Shallow Water/Navier-Stokes/Scalar Flow we use 1/3/3 hidden layers with the size of 1024/512/512, respectively. We use ReLU nonlinearities.

Observation function.

For all datasets we define $g_{\theta_{\text{dec}}}$ as a selector function which takes the latent state ${\bm{z}}(t,x)\in\mathbb{R}^{d}$ and returns its first component.

Encoder.

Our encoder $h_{\theta_{\text{enc}}}$ consists of three function: $h_{\theta_{\text{spatial}}}$ , $h_{\theta_{\text{temporal}}}$ , and $h_{\theta_{\text{read}}}$ . The spatial aggregation function $h_{\theta_{\text{spatial}}}$ is a linear mapping to $\mathbb{R}^{128}$ . The temporal aggregation function $h_{\theta_{\text{temporal}}}$ is a stack of transformer layers with temporal attention and continuous relative positional encodings (Iakovlev et al., 2023). For all datasets, we set the number of transformer layers to 6. Finally, the variational parameter readout function $h_{\theta_{\text{read}}}$ is a mapping defined as

\displaystyle\bm{\psi}_{b}^{j}=h_{\theta_{\text{read}}}(\alpha_{[b]}^{\text{T}})=\begin{pmatrix}\bm{\gamma}_{b}^{j}\\ \bm{\tau}_{b}^{j}\end{pmatrix}=\begin{pmatrix}\mathrm{Linear}(\alpha_{[b]}^{\text{T}})\\ \mathrm{exp}(\mathrm{Linear}(\alpha_{[b]}^{\text{T}}))\end{pmatrix},

(72)

where $\mathrm{Linear}$ is a linear layer (different for each line), and $\bm{\gamma}_{b}^{j}$ and $\bm{\tau}_{b}^{j}$ are the variational parameters discussed in Appendix A.

Spatial and temporal neighborhoods.

We use the same spatial neighborhoods $\mathcal{N}_{\text{S}}({\bm{x}})$ for both the encoder and the dynamics function. We define $\mathcal{N}_{\text{S}}({\bm{x}})$ as the set of points consisting of the point ${\bm{x}}$ and points on two concentric circles centered at ${\bm{x}}$ , with radii $r$ and $r/2$ , respectively. Each circle contains 8 points spaced $45$ degrees apart (see Figure 11 (right)). The radius $r$ is set to $0.1$ . For Shallow Water/Navier-Stokes/Scalar Flow the size of temporal neighborhood ( $\delta_{T}$ ) is set to $0.1$ / $0.1$ / $0.2$ , respectively.

Multiple Shooting.

For Shallow Water/Navier-Stokes/Scalar Flow we split the full training trajectories into $4$ / $4$ / $19$ sub-trajectories, or, equivalently, have the sub-trajectory length of $6$ / $6$ / $2$ .

C.3 Training, validation, and testing setup.

Data preprocessing.

We scale the temporal grids, spatial grids, and observations to be within the interval $[0,1]$ .

Training.

We train our model for 20000 iterations using Adam (Kingma and Ba, 2017) optimizer with constant learning rate $3\text{e-4}$ and linear warmup for 200 iterations. The latent spatiotemporal dynamics are simulated using differentiable ODE solvers from the torchdiffeq package (Chen, 2018) (we use dopri5 with rtol= $1\text{e-3}$ , atol= $1\text{e-4}$ , no adjoint). The batch size is 1.

Validation.

We use validation set to track the performance of our model during training and save the parameters that produce the best validation performance. As performance measure we use the mean absolute error at predicting the full validation trajectories given some number of initial observations. For Shallow Water/Navier-Stokes/Scalar Flow we use the first $5$ / $5$ / $10$ observations. The predictions are made by taking one sample from the posterior predictive distribution (see Appendix C.4 for details).

Testing.

Testing is done similarly to validation, except that as the prediction we use an estimate of the expected value of the posterior predictive distribution (see Appendix C.4 for details).

C.4 Forecasting.

\displaystyle p(\tilde{{\bm{u}}}_{n}|\tilde{{\bm{u}}}_{1:m},{\bm{u}}_{1:M})\approx\int p(\tilde{{\bm{u}}}_{n}|\tilde{{\bm{s}}}_{m},\theta_{\text{dyn}},\theta_{\text{dec}})q(\tilde{{\bm{s}}}_{m})q(\theta_{\text{dyn}})q(\theta_{\text{dec}})d\tilde{{\bm{s}}}_{m}d\theta_{\text{dyn}}d\theta_{\text{dec}}.

(73)

The expected value is estimated via Monte Carlo integration, so the algorithm for predicting $\tilde{{\bm{u}}}_{n}$ is:

1.

Sample $\theta_{\text{dyn}},\theta_{\text{dec}}$ from $q(\theta_{\text{dyn}}),q(\theta_{\text{dec}})$ .
2.

Sample $\tilde{{\bm{s}}}_{m}$ from $q(\tilde{{\bm{s}}}_{m})=\prod_{j=1}^{N}q_{\bm{\psi}_{m}^{j}}(\tilde{{\bm{s}}}_{m}^{j})$ , where the variational parameters $\bm{\psi}_{m}^{j}$ are given by the encoder $h_{\theta_{\text{enc}}}$ operating on the initial observations $\tilde{{\bm{u}}}_{1:m}$ as $\bm{\psi}_{m}^{j}=h_{\theta_{\text{enc}}}(\tilde{{\bm{u}}}[t_{m},{\bm{x}}_{j}])$ .
3.

Compute the latent state $\tilde{{\bm{z}}}(t_{n})={\bm{z}}(t_{n};t_{m},\tilde{{\bm{s}}}_{m},\theta_{\text{dyn}})$ .
4.

Sample $\tilde{{\bm{u}}}_{n}$ by sampling each $\tilde{{\bm{u}}}_{n}^{j}$ from $\mathcal{N}(\tilde{{\bm{u}}}_{n}^{j}|g_{\theta_{\text{dec}}}(\tilde{{\bm{z}}}(t_{n},{\bm{x}}_{j}))),\sigma_{u}^{2}I)$ .
5.

Repeat steps 1-4 $n$ times and average the predictions (we use $n=10$ ).

C.5 Model comparison setup.

NODE.

For the NODE model the dynamics function was implemented as a fully connected feedforward neural network with 3 hidden layers, 512 neurons each, and ReLU nonlinearities.

FEN.

We use the official implementation of FEN. We use FEN variant without the transport term as we found it improves results on our datasets. The dynamics were assumed to be stationary and autonomous in all cases. The dynamics function was represented by a fully connected feedforward neural network with 3 hidden layers, 512 neurons each, and ReLU nonlinearities.

NSPDE.

We use the official implementation of NSPDE. We set the number of hidden channels to 16, and set modes1 and modes2 to 32.

DINo.

We use the official implementation of DINo. The encoder is an MLP with 3 hidden layers, 512 neurons each, and Swish non-linearities. The code dimension is $100$ . The dynamics function is an MLP with 3 hidden layers, 512 neurons each, and Swish non-linearities. The decoder has $3$ layers and $64$ channels.

MAgNet.

We use the official implementation of MAgNet. We use the graph neural network variant of the model. The number of message-passing steps is 5. All MLPs have 4 layers with 128 neurons each in each layer. The latent state dimension is 128.

Appendix D Appendix D

D.1 Spatiotemporal neighborhood shapes and sizes.

Here we investigate the effect of changing the shape and size of spatial and temporal neighborhoods used by the encoder and dynamics functions. We use the default hyperparameters discussed in Appendix C and change only the neighborhood shape or size. A neighborhood size of zero implies no spatial/temporal aggregation.

Initially, we use the original circular neighborhood displayed in Figure 11 for both encoder and dynamics function and change only its size (radius). The results are presented in Figures 12(a) and 12(b). In Figure 12(a), it is surprising to see very little effect from changing the encoder’s spatial neighborhood size. A potential explanation is that the dynamics function shares the spatial aggregation task with the encoder. However, the results in Figure 12(b) are more intuitive, displaying a U-shaped curve for the test MAE, indicating the importance of using spatial neighborhoods of appropriate size. Interestingly, the best results tend to be achieved with relatively large neighborhood sizes. Similarly, Figure 12(c) shows U-shaped curves for the encoder’s temporal neighborhood size, suggesting that latent state inference benefits from utilizing local temporal information.

We then examine the effect of changing the shape of the dynamics function’s spatial neighborhood. We use $n$ circle neighborhoods, which consist of $n$ equidistant concentric circular neighborhoods (see examples in Figure 11). Effectively, we maintain a fixed neighborhood size while altering its density. The results can be seen in Figure 13. We find that performance does not significantly improve when using denser (and presumably more informative) spatial neighborhoods, indicating that accurate predictions only require a relatively sparse neighborhood with appropriate size.

D.2 Multiple shooting.

Here we demonstrate the effect of using multiple shooting for model training. In Figure 14 (left), we vary the sub-trajectory length (longer sub-trajectories imply more difficult training) and plot the test errors for each sub-trajectory length. We observe that in all cases, the best results are achieved when the sub-trajectory length is considerably smaller than the full trajectory length. In Figure 14 (right) we further show the training times, and as can be seen multiple shooting allows to noticeably reduce the training times.

Appendix E Appendix E

Noisy Data.

Here we show the effect of observation noise on our model and compare the results against other models. We train all models with data noise of various strengths, and then compute test MAE on noiseless data (we still use noisy data to infer the initial state at test time). Figure 15 shows that our model can manage noise strength up to 0.1 without significant drops in performance. Note that all observations are in the range $[0,1]$ .

Appendix F Appendix F

F.1 Model Predictions

We show (Fig. 16) predictions of different models trained on different datasets (synthetic data is partially observed).

F.2 Visualization of Prediction Uncertainty

Figures 17, 18, and 19 demonstrate the prediction uncertainty across different samples from the posterior distribution.

$\displaystyle p({\bm{u}}_{1:N}\|{\bm{s}}_{1:B},\theta_{\text{dyn}},\theta_{\text{dec}})$	$\displaystyle=\prod_{b=1}^{B}\prod_{i\in\mathcal{I}_{b}}\prod_{j=1}^{N}{p({\bm{u}}_{i}^{j}\|{\bm{s}}_{b},\theta_{\text{dyn}},\theta_{\text{dec}})}$	(30)
	$\displaystyle=\prod_{b=1}^{B}\prod_{i\in\mathcal{I}_{b}}\prod_{j=1}^{N}{p({\bm{u}}_{i}^{j}\|g_{\theta_{\text{dec}}}({\bm{z}}(t_{i},{\bm{x}}_{j};t_{[b]},{\bm{s}}_{b},\theta_{\text{dyn}})))}$	(31)
	$\displaystyle=\prod_{b=1}^{B}\prod_{i\in\mathcal{I}_{b}}\prod_{j=1}^{N}{\mathcal{N}({\bm{u}}_{i}^{j}\|g_{\theta_{\text{dec}}}({\bm{z}}(t_{i},{\bm{x}}_{j};t_{[b]},{\bm{s}}_{b},\theta_{\text{dyn}})),\sigma_{u}^{2}I)},$	(32)