Neural Integral Equations

Emanuele Zappala Idaho State University and Yale University, emanuelezappala@isu.edu Antonio Henrique de Oliveira Fonseca Yale University, antonio.fonseca@yale.edu Josue Ortega Caro Yale University, josue.ortegacaro@yale.edu Andrew Henry Moberly Yale University, andrew.moberly@yale.edu Michael James Higley Yale University, m.higley@yale.edu Jessica Cardin Yale university, jess.cardin@yale.edu David van Dijk Yale University, david.vandijk@yale.edu

Abstract

Nonlinear operators with long-distance spatiotemporal dependencies are fundamental in modeling complex systems across sciences, yet learning these nonlocal operators remains challenging in machine learning. Integral equations (IEs), which model such nonlocal systems, have wide-ranging applications in physics, chemistry, biology, and engineering. We introduce Neural Integral Equations (NIE), a method for learning unknown integral operators from data using an IE solver. To improve scalability and model capacity, we also present Attentional Neural Integral Equations (ANIE), which replaces the integral with self-attention. Both models are grounded in the theory of second-kind integral equations, where the indeterminate appears both inside and outside the integral operator. We provide theoretical analysis showing how self-attention can approximate integral operators under mild regularity assumptions, further deepening previously reported connections between transformers and integration, and deriving corresponding approximation results for integral operators. Through numerical benchmarks on synthetic and real-world data — including Lotka-Volterra, Navier-Stokes, and Burgers’ equations, as well as brain dynamics and integral equations — we showcase the models’ capabilities and their ability to derive interpretable dynamics embeddings. Our experiments demonstrate that ANIE outperforms existing methods, especially for longer time intervals and higher-dimensional problems. Our work addresses a critical gap in machine learning for nonlocal operators and offers a powerful tool for studying unknown complex systems with long-range dependencies.

1 Introduction

Integral equations (IEs) are functional equations where the indeterminate function appears under the sign of integration [SRH⁺81]. The theory of IEs has a long history in pure and applied mathematics, dating back to the 1800’s, and it is thought to have started with Fourier’s Theorem [Gro07]. One other early application of IEs, was found in the Dirichelet’s problem (a PDE), which was originally solved through its integral formulation. Subsequent studies, carried out by Fredholm, Volterra, and Hilbert and Schmidt, have significantly contributed to the establishment of this theory. IEs appear in many applications ranging from physics and chemistry, to biology and engineering [Waz11, Gro07], for instance in potential theory, diffraction and inverse problems such as scattering in quantum mechanics [Waz11, Gro07, Lak95]. Neural field equations, that model brain activity, can be described using IEs and integro-differential equations (IDEs), due to their highly non-local nature ([Ama77]). IEs are related to the theory of ordinary differential equations (ODEs) and partial differential equations (PDEs), however they possess unique properties. While ODEs and PDEs describe local behavior, IEs model global (long-distance) spatiotemporal relations. Moreover, ODEs and PDEs have IE forms that in certain circumstances can be solved more effectively and efficiently due to the better stability properties of IE solvers compared to ODE and PDE solvers [Rok85, Rok90]. See also [GK98] for an example of PDE system that is solved with high accuracy through an IE method.

Learning nonlocal operators for dynamics with long-distance relations is an open problem in deep learning. In this article, we introduce and address the problem of learning nonlocal dynamics from data through integral equations. Namely, we introduce the Neural Integral Equation (NIE) and the Attentional Neural Integral Equation (ANIE). Our setup is that of an operator learning problem, where we learn the integral operator that generates dynamics that fit given data. Often, one has observations of a dynamical system without knowing its analytical form. Our approach permits modeling the system purely from the observations. This model, via the learned integral operator, can be used to generate dynamics, as well as be used to infer the spatiotemporal relations that generated the data. The innovation of our proposed method lies in the fact that we formulate the operator learning problem associated to dynamics in the form of an optimization problem for the solutions of an IE obtained through an IE solver. Unlike other operator learning methods that learn dynamics as a mapping between function spaces for fixed time points, i.e. as a mapping $T:\prod_{i}\mathcal{A}_{i}\longrightarrow\prod_{j}\mathcal{B}_{j}$ , where $\mathcal{A}_{i}$ and $\mathcal{B}_{j}$ are function spaces each representing a time coordinate, NIE and ANIE allow to continuously learn dynamics with arbitrary time resolution. Our solver outputs solutions through an iterative procedure [Waz11], which converges to a solution of the IE.

Refer to caption — Figure 1: Diagrammatic representation of the model. The solver is initialized with $f$ , also called the free function. This initialization is often the first time point of the dynamics. To solve the IE and find the solution $\mathbf{y}$ , an iterative procedure is carried out in which at each solver step $k$ the integral of $G_{\theta}(\mathbf{y}^{k},x,t)$ is computed and used as the solution $\mathbf{y}^{k+1}$ in the next step. Integration is done either with Monte Carlo (via torchquad) or with self-attention, representing NIE and ANIE respectively. The solver integration steps are repeated until convergence of $\mathbf{y}^{k}$ to the solution of the IE. This solution is then compared to the input data to compute a loss that, via backpropagation, is used to find $\theta$ that minimizes the error. The resulting integral operator represents the IE that models the data. Top right, an example of the attention weights for calcium imaging dynamics is presented. Bottom right, an example of dynamimcal embedding of Navier-Stokes dataset colored by velocity is shown.

1.1 Our Contributions

In this article, we introduce Neural Integral Equations (NIE) and Attentional Neural Integral Equations (ANIE), which are novel neural network based methods for learning dynamics, in the form of IEs, from data. Our architectures allow modeling dynamics with long-distance spatiotemporal relations typical of non-local functional equations. Our main contributions are as follows:

•

We introduce a novel method for learning dynamics from data as solutions of IEs of the second kind through an IE solver.
•

We implement a fully differentiable IE solver in PyTorch¹¹1Code available at: https://github.com/emazap7/ANIE.
•

We implement a highly scalable version of the solver where integration is done with a self-attention mechanism.
•

We derive theoretical results on convergence of the solver, and approximation capabilities of our models.
•

Our model provides explainable dynamics and meaningful embeddings of these dynamics.
•

Finally, we use our method to model and interpret non-local dynamics from brain activity recordings.

1.2 Background and related work

1.2.1 Integral equations in numerical analysis

Due to their wide range of applications, the theory of IEs has attracted the attention of mathematicians, physicists, and engineers for a long time. Detailed accounts on integral equations can be found in [Zem12, Waz11, Bôc26]. Along with their theoretical properties, much attention has been devoted to the development of efficient integral equation solvers, focusing on rapidly obtaining highly accurate solutions of certain PDE systems [Rok85, Rok90]. In fact, it is known that integral equation solvers obtain more accurate solutions than differential solvers for a variety of ODEs and PDEs.

1.2.2 Operator learning

IE solvers are used to solve given equations through some iterative procedure, see e.g. [Waz11, DM88]. Moreover, machine learning approaches to solve given types of IEs have been implemented [GFZJ22, Que16, KD19, GLS⁺21, EB12]. In such cases, the IE is known, and we seek the solution of it. However, in practice, we often do not have access to the analytical form of the equation and we only have data sampled from a system. In such cases, we want to model the system by learning an operator that can reproduce the system. This is the setting of operator learning problems, and several approaches to operator learning, including using deep learning, have been presented [KLL⁺21, LJP⁺21, LKA⁺21, LKA⁺20, Cao21, HWS⁺23, MKH⁺22, KLS24, PMB⁺22, BdBR⁺24, OOK⁺23, OSG⁺22]. Typical operator learning problems are formulated on finite grids (finite difference methods) that approximate the domain of the functions. In this case, recovering the continuous limit is a very challenging problem, and irregularly sampled data can completely alter the evaluation of the learned operator. Operator learning for integral equations has not been considered thus far, and it constitutes the main novelty of the present article. This is entailed in the formulation of the operator learning problem through an IE solver. The convenience of this approach lies in the capability of the solver to sample the domain of integration continuously, and the capabilities of integral equations to model very complex dynamics, due to their highly non-local behavior. A similar approach for Integro-Differential Equations (IDEs) has been followed in [ZFM⁺23]. However, in the present work our implementation does not include differential solvers, and the reformulation of such dynamical problems in terms of IEs has great benefits in terms of solver speed and stability. Moreover, our version of an IE solver that approximates integrals via self-attention allows for higher dimensional integrals than those considered in [ZFM⁺23].

1.2.3 Learning continuous dynamics

Modeling continuous dynamics from discretely sampled data is a fundamental task in data science. Methods for continuous modeling include those based on ODEs [CRBD18, CAN21]. While ODEs are useful for modeling temporal dynamics, they are fundamentally local equations which neither model spatial nor long-range temporal relations. The authors of [CRBD18] have employed auxiliary tools, such as RNNs, in order to include non-locality. We point out that RNNs can be seen as performing a temporal integration (in discrete steps), in order to codify some degree of non-local (temporal) dependence in the dynamics. In this work, we introduce a framework that provides a more general and formal solution to this non-local integration problem. Moreover, the dynamics are not produced sequentially with respect to time, as is done by ODE solvers, but are processed in parallel thus providing increased efficiency, as we will demonstrate experimentally.

1.2.4 Integration via self-attention

The self-attention mechanism and transformers were introduced in [VSP⁺17] and applied to machine translation tasks. Thanks to their initial success, they have since been used in many other domains, including operator learning for dynamics [Cao21, GZ22]. Interestingly, the self-attention mechanism can be interpreted as the Nystrom method for approximating integrals [XZC⁺21]. Making use of this connection, we approximate the integral kernel of our model using self-attention, allowing efficient integration over higher dimensions.

2 Neural integral equations

An IE (of Urysohn type) takes the general form given by

\mathbb{y}(t)=f(t)+\int_{\alpha(t)}^{\beta(t)}G(\mathbb{y}(s),t,s)ds,

(1)

where the variable $s$ is the local time used for integration for each $t$ , which is the global time. Due to their fundamentally non-local behavior, integral equations have been used to model physical and biological phenomena, such as brain dynamics, virus spreading, and plasma physics [Waz11, Gro07, Ama77]. The case considered in this article, where the indeterminate function $\mathbb{y}(t)$ appears both under the sign of integration and outside of it, is termed an equation of the second kind, as opposed to the first kind where the indeterminate function appears only in the integral operator. IEs of the second kind are more stable than of the first kind for reasons rooted in functional analysis, as explained in Appendix B.4.

We introduce Neural Integral Equations (NIEs), a deep neural network model based on integral equations. NIEs are integral equations as defined by Equation 1 where $G$ is a neural network, parameterized by $\theta$ , and indicated by $G_{\theta}$ . Training a NIE consists of optimizing $G_{\theta}$ in such a way that the corresponding solution $\mathbb{y}$ to Equation 1 fits the given data. At each step of the training, we perform two fundamental procedures. The first one is to solve the IE determined by $G_{\theta}$ , and the second one is to optimize for $G_{\theta}$ in such a way that solving the corresponding IE produces a function that fits a given dataset. Details on the solver procedure and the training are given in Appendix D.

IEs, in contrast to ODEs and PDEs, are non-local equations [SRH⁺81] since in order to evaluate the integral operator $\int_{\alpha(t)}^{\beta(t)}G_{\theta}(\bullet,t,s)ds:\mathcal{A}\longrightarrow\mathcal{A}$ on a function $\mathbf{y}$ , we need the value of $\mathbf{y}$ over the full integration domain. In fact, to evaluate the RHS of Equation 1 at an arbitrary time point $t$ the function $\mathbb{y}(s)$ between $\alpha(t)$ and $\beta(t)$ is needed. Here, $\alpha$ and $\beta$ are arbitrary functions and common choices include $\alpha(t)=a$ and $\beta(t)=b$ (called Fredholm equations), or $\alpha(t)=0$ and $\beta(t)=t$ (called Volterra equations). Consequently, solving an integral equation requires an iterative procedure, based on the notion of Picard iterations (successive approximation method), where the solution is obtained as a sequence of approximations that converge to the solution. We refer the reader to Appendix B.3 for details on the solver implemented in this article, the theory upon which it is based, and the proofs regarding the convergence of our algorithms to a solution of the given IE (see Theorem B.1 and Corollary B.2). We also refer to [Waz11] for an elementary and computationally driven introduction to the theory behind the methods that motivate this procedure, and [DM88] for a more detailed account.

Interestingly, utilizing NIEs to model ODEs allows to bypass the use of the ODE solvers, as the one introduced in [CRBD18, CAN21]. The convenience in this approach is that the integral equation solver is more stable than the ODE solver [KR12]. ODE solver instabilities, induced by equation stiffness, have been previously considered in [GBD⁺20, FJNO20]. The IE solver presented in this work thus does not suffer from these problems, and is also significantly faster.

It is often useful to consider a more specific form for IEs, where the function $G$ factors in the product of a kernel $K$ and a generally non-linear function $F$ as $G(\mathbf{y},t,s)=K(t,s)F(\mathbf{y})$ . Here, $K$ is matrix-valued and it carries the dependence on the time (both $t$ and $s$ ), while $F$ depends only on the indeterminate function $\mathbf{y}$ . The form of this IE is therefore:

\mathbb{y}(t)=f(t)+\int_{\alpha(t)}^{\beta(t)}K(t,s)F(\mathbb{y}(s))ds.

(2)

NIEs in this form comprise two neural networks, namely $K$ and $F$ . We observe that in IEs, the initial condition is embedded in the equation itself, and it is not an arbitrary value to be specified as an extra condition. To solve the IE we implement a solver that performs an iterative procedure to obtain a solution, see Appendix B.3 and Appendix D. During the iterations, Monte Carlo sampling is performed to evaluate the integrals. This procedure allows our deep learning model to be independent of the temporal grid points, therefore resulting in a continuous model, since the model internally uses randomly sampled points to generate the successive iterations, as opposed to using fixed grid points. The general algorithm for training NIE is given in Algorithm 1, and a diagrammatic overview of it is shown in Figure 1. See also Figure 2 for a visualization of the general solving procedure.

Algorithm 1 NIE method training step. Integration is performed using the module torch.quad, with the Monte Carlo method.

\mathbf{y}_{0}(t)

\triangleright

Initialization

\mathbf{y}(t)

\triangleright

Solution to IE with initial

\mathbf{y}_{0}(t)

\mathbf{y}^{0}(t):=\mathbf{y}_{0}(t)

\triangleright

Initial solution guess

4:while

{\rm iter}\leq{\rm maxiter}

and

{\rm error}>{\rm tolerance}

5: Evaluate:

\mathbf{y}^{i+1}(t)=f(\mathbf{y}^{i},t)+\int_{\alpha(t)}^{\beta(t)}G(t,s,\mathbf{y}^{i}(s))ds

6: Set solution to be:

r\mathbf{y}^{i}+(1-r)\mathbf{y}^{i+1}

7: New error:

{\rm error}={\rm metric}(\mathbf{y}^{i+1},\mathbf{y}^{i}

)

8:end while

9:Output of solver:

\mathbf{y}(t)

10:Compute loss wrt observations:

{\rm loss}(\mathbf{y}(t),{\rm obs})

11:Gradient descent step

2.1 Space, time and higher dimensional integration

IEs can have multiple space dimensions in addition to time. Such equations are formulated as

\mathbb{y}(\mathbb{x},t)=f(\mathbb{x},t)+\int_{\Omega}\int_{\alpha(t)}^{\beta(t)}G(\mathbb{y}(\mathbb{x}^{\prime},s),\mathbb{x},\mathbb{x}^{\prime},t,s)d\mathbb{x}^{\prime}ds,

(3)

where $\Omega\subset\mathbb{R}^{n}$ is a domain in $\mathbb{R}^{n}$ , and $\mathbb{y}:\Omega\times I\longrightarrow\mathbb{R}^{m}$ , for some interval $I\subset\mathbb{R}$ . More commonly, in the literature, one finds a simpler case of higher dimensional IEs, where the integral component $\int_{\Omega}\int_{\alpha(t)}^{\beta(t)}G(\mathbb{y}(x^{\prime},s),\mathbb{x},\mathbb{x}^{\prime},t,s)d\mathbb{x}^{\prime}ds$ is obtained as a sum of terms with only partial integrations. Such an equation takes the form

\displaystyle\begin{aligned} \mathbb{y}(\mathbb{x},t)&=&f(\mathbb{x},t)+\int_{\alpha(t)}^{\beta(t)}G_{1}(\mathbb{y}(\mathbb{x},s),\mathbb{x},t,s)ds+\int_{\Omega}G_{2}(\mathbb{y}(\mathbb{x}^{\prime},t),\mathbb{x},\mathbb{x}^{\prime},t)d\mathbb{x}^{\prime},\end{aligned}

(4)

These equations are the integral counterpart of PDEs, similarly to the relation between one-dimensional IEs and ODEs, and they are called Partial Integral Equations (PIEs). With slight abuse of notation we will still refer to Equation 3 as a PIE, as we will in practice use such an approach to model PDEs in the case of Burgers’ equation and the Navier-Stokes equation.

2.2 Attentional neural integral equations

Training of NIE requires an integration step at each time point, incurring in a potentially high computational cost. This integration step is implemented using the torchquad package [GTM21], a high performance numerical Monte Carlo integration method, resulting in fast integration and high scalability of NIE. For example, solving ODEs using NIE is significantly faster than using traditional ODE solvers (Table 5). However, several limitations are associated with the torchquad integration method. In fact, torchquad requires significantly increasing numbers of sampled points with increasing numbers of dimensions. To use NIE for solving PDEs and (P)IEs we require efficient spatial integration in high dimensions.

To address these challenges, we have employed an approach to NIE where the integral operator is based on a self-attention mechanism. In fact, self-attention can be viewed as an approximation of an integration procedure [TBY⁺19, XZC⁺21], where the product of queries and keys coincides with the notion of a kernel, as the one discussed in Section 2. In [Cao21] the parallelism between self-attention and integration of kernels was further explored to interpret transformers as Galerkin projections in operator learning tasks.

We have replaced the analytical integral $\int_{\alpha(t)}^{\beta(t)}G(t,s,\mathbb{y}(s))ds$ in Equation 1 with a self-attention procedure. The resulting model, which we call Attentional Neural Integral Equation (ANIE), follows the same principle of iterative IE solving presented in Section 2 but where the neural networks $K$ and $F$ are replaced by attention matrices. It can be shown, see Appendix B.3, that the successive approximation method is still applicable in this case to obtain a solution for the corresponding equation. Observe, following the comparison between integration and self-attention, that $K$ is decomposed in the product of queries and keys, as described for instance in [Cao21]. The interval of integration $[\alpha(t),\beta(t)]$ is determined, in the attentional approximation, by means of the mask. In particular, if there is no mask we have a Fredholm IE, while the causal attention mask [YZQC21] corresponds to a Volterra type of IE.

An iterative procedure similar to the one discussed in Algorithm 1 is implemented to solve the corresponding IE, see Appendix B.3 and Appendix D.2. During iterations, we sample points uniformly from the spatiotemporal domain, and the corresponding integral operator does not depend on the grid points of the dataset. Our experiments on the Burgers’ dataset (Section 3.1 show that our model is stable with respect to change of spatiotemporal stamps since the model internally uses randomly sampled points to generate the successive iterations, rather than the fixed grid points. A detailed description of the integration procedure, along with solver steps and training for ANIE is given in Appendix D.2. Moreover, Theorem B.1, Corollary B.2 and Remark B.3 show that the solver procedure converges to a solution under certain mild assumptions.

Algorithm 2 summarizes the solving and training procedures for ANIE. A detailed description of the meaning of $\mathfrak{Att}$ is found in Appendix D.2. Theoretical considerations on Fredholm generalized equations with general operators, integral operator approximation through self-attention, and existence of the solutions for these equations are given in Appendix B.4. Figure 13 gives a diagrammatic representation of the integration procedure implemented in this article, and Figure 14 gives a schematic representation of the solver procedure with space and time.

Algorithm 2 ANIE method training step. Integration here is replaced by a transformer employing self-attention.

\mathbf{y}_{0}(\mathbf{x},t)

\triangleright

Initialization

\mathbf{y}(\mathbf{x},t)

\triangleright

Solution to IE with initial

\mathbf{y}_{0}(\mathbf{x},t)

\mathbf{y}^{0}(\mathbf{x},t):=\mathbf{y}_{0}(\mathbf{x},t)

\triangleright

Initial solution guess

4:while

{\rm iter}\leq{\rm maxiter}

and

{\rm error}>{\rm tolerance}

5: Concatenate space and time tokens to

\mathbf{y}^{i}(\mathbf{x},t)

\tilde{\mathbf{y}}^{i}(\mathbf{x},t)={\rm concat}(\mathbf{y}^{i}(\mathbf{x},t),s,t)

6: Evaluate with self-attention:

y^{i+1}(t)=f(\tilde{\mathbf{y}}^{i},t)+{\mathfrak{A}tt}(\tilde{\mathbf{y}}^{i}(\mathbf{x},t))

7: Set solution to be:

r\mathbf{y}^{i}+(1-r)\mathbf{y}^{i+1}

8: New error:

{\rm error}={\rm metric}(\mathbf{y}^{i+1},\mathbf{y}^{i}

)

9:end while

10:Output of solver:

\mathbf{y}(\mathbf{x},t)

11:Compute loss wrt observations:

{\rm loss}(\mathbf{y}(\mathbf{x},t),{\rm obs})

12:Gradient descent step

3 Experiments

3.1 Modeling PDEs with IEs: Burgers’ and Navier-Stokes equations

PDEs can be reformulated as IEs in several circumstances, and dynamics generated by differential operators can therefore be modeled through ANIE as a PIE, where integration is performed in space and time. We consider two well known types of PDEs, namely the Burgers’ equation and the Navier-Stokes equation. Since NIE is implemented only for time integration, we use only ANIE in these experiments, which allows for efficient space and time integration. We observe that our implementation of Algorithm 2 applied to the case of Navier-Stokes equation closely parallels the IE method employed in [GK98], with the main difference that we learn the Green’s function through gradient descent, since no knowledge of the underlying Navier-Stokes equations is assumed.

Table 1: Left, benchmark on the Navier-Stokes equation. We evaluate the models on predicting dynamics of different lengths (

t=3,5,10,20

) for unseen initial conditions. The models that use a single time point are ANIE (ours), FNO2D, ViT [DBK⁺20], ViTsmall [LLS21] and ViTparallel [TCEN⁺22] models, while the convolutional LSTM, FNO3D, and ViT3D all use more time points (

2

10

and

2

, respectively) to predict the rest of the dynamics. ANIE outperforms also the models that use more data points for initialization. Right, benchmark on the Burgers’ equation with different time intervals

t=10,15,25

and space resolutions

s=256,512

, where a time interpolation task is performed. A symbol

-

has been used to indicate those models that were not suitable for certain experiments (e.g. wrong dimensionality), while NA indicates models that did not converge, or did not fit in memory.

	Navier-Stokes				Burgers’
	t = 3	t = 5	t = 10	t = 20	t=10		t=15		t=25
					s = 256	s = 512	s = 256	s = 512	s = 256	s = 512
LSTM	$.1384$	$.2337$	$.1422$	$.2465$	$-$	$-$	$-$	$-$	$-$	$-$
ResNet	$-$	$-$	$-$	$-$	$.0295$	$.0309$	$.0280$	$.0232$	$.0194$	$.0204$
Conv1DLSTM	$-$	$-$	$-$	$-$	$.0132$	$.0133$	$.0132$	$.0136$	$.0124$	$.0134$
Conv2DLSTM	$.4935$	$.4393$	$.3931$	$.2999$	$-$	$-$	$-$	$-$	$-$	$-$
FNO1D	$-$	$-$	$-$	$-$	$.0088$	$.0088$	$0087$	$.0087$	$.0083$	$.0086$
Galerkin	$-$	$-$	$-$	$-$	$.0525$	NA	$.0521$	NA	$.0518$	NA
FNO2D	$.2795$	$.2724$	NA	NA	$-$	$-$	$-$	$-$	$-$	$-$
FNO3D	NA	NA	$.1751$	$.0701$	$-$	$-$	$-$	$-$	$-$	$-$
ViT	$.1093$	$.0877$	$.2473$	$.2367$	$.0430$	$.0423$	$.0423$	$.0422$	$.0422$	$.0424$
ViTsmall	$.0926$	$.0702$	$.0677$	$.0655$	$.0429$	$.0429$	$.0426$	$.0427$	$.0417$	$.0424$
ViTparallel	$.2901$	$.2660$	$.2475$	$.2368$	$.0433$	$.0702$	$.0573$	$.0861$	$.0435$	$.0700$
ViT3D	$.0360$	$.0365$	$.0433$	$.0406$	$-$	$-$	$-$	$-$	$-$	$-$
ANIE (ours)	$\mathbf{.0194}$	$\mathbf{.0220}$	$\mathbf{.0193}$	$\mathbf{.0117}$	$\mathbf{.0024}$	$\mathbf{.0026}$	$\mathbf{.0024}$	$\mathbf{.0024}$	$\mathbf{.0022}$	$\mathbf{.0023}$

For the Burgers’ equation, we focus on the ability of ANIE to model both space and time continuously, and we therefore perform an interpolation taks, where the model outputs time points that are not included in the training test, and for unseen initial conditions. This is in contrast to [Cao21, LKA⁺21] where a “static” Burgers’ equation was considered in which the learned operator maps the initial condition ( $t=0$ ) to the final time ( $t=1$ ), thus treating time as a discrete two point set. In our approach, we model the system continuously over a time interval and randomly sample points during iterations to perform the quadrature of the temporal integrals. In this experiment, the Galerkin model [Cao21], was not included for the higher spatial dimension setting because the amount of memory required exceeded what was available to us during the experiments. The results are reported in Table 1 (right side), and an example of the learned dynamics is given in Figure 4.

For the Navier-Stokes equation, we consider an extrapolation task where we evaluate the model on unseen initial conditions. Previous works have shown high performance in predicting dynamics of Navier-Stokes from new initial conditions, but they require several frames (i.e. several time points) to be fed into the model in order to achieve such performance. We see that since ANIE learns the full dynamics from arbitrarily chosen initial conditions, we achieve good performance even when a single initial condition is used to initialize the system. We train FNO2D, ViT, ViTsmall and ViTparallel with initialization on a single time point, while convolutional LSTM, FNO3D, and ViT3D are trained with $2$ , $2$ and $10$ times for initialization, respectively. The results are given in Table 1 (left side). We note that ANIE outperforms also the models that use more data points for initialization. FNO2D did not converge for higher number of points, and therefore results for time points $t=10$ and $t=20$ have not been reported, while for FNO3D, we have conducted the experiments only for $t=10$ and $t=20$ since using fewer points for the time dimension would have effectively reduced FNO3D to FNO2D. An example of the Navier-Stokes dynamics is given in Figure 3.

3.2 Modeling brain dynamics using ANIE

Brain activity can be modeled as a spatiotemporal dynamical system [VGSS20]. While most connections between neurons are localized in space, there are a significant number of interactions that are long-range [ERML⁺13]. As such, brain dynamics can be modeled using integral equations [Ama77] which, unlike PDEs, allow for non-local interactions. Since ANIE allows efficient learning of integral operators from data, we demonstrate the ability of ANIE to learn non-local brain dynamics from functional Magnetic Resonance Imaging (fMRI) recordings.

To obtain fMRI data that has arbitrary time duration as well as unlimited trials, we make use of neurolib [CJO21], an fMRI simulation package. The data provided by this tool permit for more extensive comparison and statistical power. neurolib simulates whole-brain activity using a system of delay differential equations, which are non-local equations, thus allowing testing of ANIE’s ability to model non-local systems. Here we show the performance of ANIE and other models in modeling data generated by neurolib. Details about data generation and preprocessing can be found in the Appendix F.4.

The generated fMRI data comprises neural activity for 80 nodes localized across the cortex. The first half of the data is used for training and the second half is used for testing. For training, the data are divided into segments of 20 time points, where the first time point is used as the initial condition, and the loss is computed over all 20 points. As such, the models are trained as an initial condition problem. During inference, the models are given points from the test set as new initial conditions and asked to extrapolate for the following 19 points. The mean error per point for 200 new initial conditions is shown in Figure 7 and summarized in Table 2. Figure 5 shows the data and model per fMRI recording node over time. We show that ANIE has significantly better performance than other benchmarked methods for medium ( $t=10$ ) and long ( $t=20$ ) time step predictions, demonstrating its ability to model non-local dynamics. For shorter and more localized dynamics ( $t=5$ ), FNO1D shows better performance, which can be explained by the fact that FNO1D outputs the average of the initial points provided as the prediction for the first 5 time steps.

3.3 Interpretable dynamics

In addition to modeling and generating new dynamics, it is useful to get insight into the underlying process that generates the dynamics. For example, in neuroscience, a major goal is to understand how specific brain activity patterns give rise to cognition, learning, and behavior. To explore the interpretability of ANIE we carry out two experiments. For the first experiment, we augment the spacetime integration domain with a CLS token [DCLT18], such that each dynamics is projected into a single vector. This vector can then be related to specific properties of the dynamics. Specifically, we embed these vectors for different Navier-Stokes dynamics and find that the resulting manifold (projected using PCA) has a highly nonrandom structure. This is in contrast to the projection of the raw data (see Figure 6). To further explore the resulting dynamics manifold, we color it by the velocities of the dynamics, a property that was not explicitly seen by the model during training. We find that the manifold highly correlates with velocity whereas the embedding of the raw data has no such correlation. To quantify this we compute the kNN regression error on the embeddings with respect to the velocities and find that the embedding obtained from ANIE has significantly lower error (see Table 3).

For the second experiment, we inspect the attention weights of the model when predicting brain dynamics (calcium imaging, Appendix G), in order to infer which cortical loci drive neuronal dynamics. Figure 11 shows that the motor and visual cortices are the areas of the brain with the highest attention values. We note that the attention plots are not directly correlated with the brain activity inputs, suggesting that they point to new information about the data. To validate this, we compare the performance of predicting the visual stimulus, which was not explicitly provided to the model, from either the raw data or the attention values using a kNN-regressor (k=3) (see Appendix G). In Table 4 we show that the attention weights significantly ( $p=0.035$ ) outperform the raw data, thus demonstrating that ANIE can provide insights into the modeled dynamics.

3.4 Further experiments

In Appendix A we include several more experiments regarding the training speed of ANIE showcasing that it is significantly faster than ODE solver based models, hyperparameter sensitivity of the model, and modeling of IE dynamics, along with further tables and figures on the experiments of Subsections 3.1, 3.2 and 3.3. In Appendix A.5 we have explored the convergence of the solver to fixed points of the corresponding IE.

4 Limitations

An important consideration relates to the theoretical guarantees for convergence of the solver as considered in Appendix B.3. In fact, enforcing contractivity of the integral operator might be needed for certain types of applications as seen in Appendix B.3. However, we mention that Lipschitz constraints (in particular contractivity) have been long studied in machine learning, including the case of transformers [KPM21, QWC⁺23]. We therefore expect that in cases when the solver is unstable, and convergence is problematic, one can enforce convergence using a Lipschitz constraint.

5 Conclusions

We have presented a neural network-based integral equation solver that can learn an integral operator and its associated dynamics from data, and capture non-local spatiotemporal interactions. We have demonstrated the ability of our method to learn ODE, PDE, and IE systems better than other dynamics models, in terms of better predictability as well as scalability. We have studied the approximation capabilities of NIE and ANIE, and obtained theoretical guarantees for the convergence of the IE solver, enhancing the understanding of this deep-learning approach. Moreover, we have shown that ANIE is interpretable and produces meaningful dynamics embeddings, as demonstrated in our applications to the Navier-Stokes equation and brain activity dynamics. Since IEs can be used to model many real-world systems, we anticipate that NIE and ANIE will have broad applications in science and engineering.

6 Data availability statement

We have included descriptions regarding how to reproduce all synthetic datasets, used in Figures 4, 3, 5, 8 as well as Table 1. All synthetic datasets are available at https://github.com/emazap7/ANIE. Access to the datasets can be obtained also at https://figshare.com/articles/dataset/IE_spirals/25606242 for integral equation spirals, https://figshare.com/articles/dataset/Burgers_1k_t400/25606149 for Burgers data, https://figshare.com/articles/dataset/Navier_Stokes_Dataset_mat/25606152 for Navier-Stokes data, and https://figshare.com/articles/dataset/fMRI_data/25606272 for the simulated fMRI data. Lokta-Volterra and Lorentz system datasets can be generated using the notebooks found at https://github.com/emazap7/ANIE. The calcium imaging dataset is not available open source. We have included a detailed account of the techniques used in [BHS⁺20, HSF⁺20] (see Appendix G), where information on how to obtain the dataset is included.

7 Code availability statement

All codes are available at: https://github.com/emazap7/ANIE. Detailed installation descriptions are found thereby. Jupyter notebooks for training and testing of the models for the main experiments are provided in the repository. Pre-trained models are accessible directlty, and instructions on how to run the notebooks are added in the form of comments throughout the notebooks. The main codes for the models, along with the experiments, are found in the directory “IE_source”.

8 Author contributions

EZ conceived the algorithmic framework, obtained the theoretical results, and contributed to the numerical experiments. AF and JOC contributed to the numerical experiments. AM, MH and JC provided the calcium imaging data. DvD led the study and conceived the algorithmic framework. EZ, AF, JOC and DvD contributed to writing the article.

References

[Ama77] Shun-ichi Amari. Dynamics of pattern formation in lateral-inhibition type neural fields. Biological cybernetics, 27(2):77–87, 1977.
[BdBR⁺24] Francesca Bartolucci, Emmanuel de Bezenac, Bogdan Raonic, Roberto Molinaro, Siddhartha Mishra, and Rima Alaifari. Representation equivalent neural operators: a framework for alias-free operator learning. Advances in Neural Information Processing Systems, 36, 2024.
[BHS⁺20] Daniel Barson, Ali S Hamodi, Xilin Shen, Gyorgy Lur, R Todd Constable, Jessica A Cardin, Michael C Crair, and Michael J Higley. Simultaneous mesoscopic and two-photon imaging of neuronal activity in cortical circuits. Nature methods, 17(1):107–113, 2020.
[Bôc26] Maxime Bôcher. An introduction to the study of integral equations. Number 10. University Press, 1926.
[BP72] Edward R Benton and George W Platzman. A table of solutions of the one-dimensional burgers equation. Quarterly of Applied Mathematics, 30(2):195–212, 1972.
[Bra69] HJ Brascamp. The fredholm theory of integral equations for special types of compact operators on a separable hilbert space. Compositio Mathematica, 21(1):59–80, 1969.
[BRSS17] Małgorzata Borówko, Wojciech Rżysko, Stefan Sokołowski, and Tomasz Staszewski. Integral equations theory for two-dimensional systems involving nanoparticles. Molecular Physics, 115(9-12):1065–1073, 2017.
[CAN21] Ricky T. Q. Chen, Brandon Amos, and Maximilian Nickel. Learning neural event functions for ordinary differential equations. International Conference on Learning Representations, 2021.
[Cao21] Shuhao Cao. Choose a transformer: Fourier or galerkin. Advances in neural information processing systems, 34:24924–24940, 2021.
[Cho68] Alexandre Joel Chorin. Numerical solution of the navier-stokes equations. Mathematics of computation, 22(104):745–762, 1968.
[CJO21] Caglar Cakan, Nikola Jajcay, and Klaus Obermayer. neurolib: a simulation framework for whole-brain neural mass modeling. Cognitive Computation, pages 1–21, 2021.
[CRBD18] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.
[DAK23] Waleed Diab and Mohammed Al-Kobaisi. U-deeponet: U-net enhanced deep operator network for geologic carbon sequestration. arXiv preprint arXiv:2311.15288, 2023.
[Dav60] Harold Thayer Davis. Introduction to nonlinear differential and integral equations. US Atomic Energy Commission, 1960.
[DBK⁺20] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[DKM⁺20] Talgat Daulbaev, Alexandr Katrutsa, Larisa Markeeva, Julia Gusak, Andrzej Cichocki, and Ivan Oseledets. Interpolation technique to speed up gradients propagation in neural odes. Advances in Neural Information Processing Systems, 33:16689–16700, 2020.
[DM88] Leonard Michael Delves and Julie L Mohamed. Computational methods for integral equations. CUP Archive, 1988.
[DR07] Philip J Davis and Philip Rabinowitz. Methods of numerical integration. Courier Corporation, 2007.
[EB12] Sohrab Effati and Reza Buzhabadi. A neural network approach for solving fredholm integral equations of the second kind. Neural Computing and Applications, 21(5):843–852, 2012.
[ERML⁺13] Mária Ercsey-Ravasz, Nikola T Markov, Camille Lamy, David C Van Essen, Kenneth Knoblauch, Zoltán Toroczkai, and Henry Kennedy. A predictive network model of cerebral cortical connectivity based on a distance rule. Neuron, 80(1):184–197, 2013.
[Fef00] Charles L Fefferman. Existence and smoothness of the navier-stokes equation. The millennium prize problems, 57:67, 2000.
[FJNO20] Chris Finlay, Jörn-Henrik Jacobsen, Levon Nurbekyan, and Adam Oberman. How to train your neural ode: the world of jacobian and kinetic regularization. In International conference on machine learning, pages 3154–3164. PMLR, 2020.
[GBD⁺20] Arnab Ghosh, Harkirat Behl, Emilien Dupont, Philip Torr, and Vinay Namboodiri. Steer: Simple temporal regularization for neural ode. Advances in Neural Information Processing Systems, 33:14831–14843, 2020.
[GFZJ22] Yu Guan, Tingting Fang, Diankun Zhang, and Congming Jin. Solving fredholm integral equations using deep learning. International Journal of Applied and Computational Mathematics, 8(2):1–10, 2022.
[GIKM10] Yurii N Grigoriev, Nail H Ibragimov, Vladimir F Kovalev, and Sergey V Meleshko. Symmetries of integro-differential equations: with applications in mechanics and plasma physics, volume 806. Springer, 2010.
[GK98] Leslie Greengard and Mary Catherine Kropinski. An integral equation approach to the incompressible navier–stokes equations in two dimensions. SIAM Journal on Scientific Computing, 20(1):318–336, 1998.
[GLS⁺21] Rui Guo, Zhichao Lin, Tao Shan, Maokun Li, Fan Yang, Shenheng Xu, and Aria Abubakar. Solving combined field integral equation with deep neural network for 2-d conducting object. IEEE Antennas and Wireless Propagation Letters, 20(4):538–542, 2021.
[Gro07] Charles W Groetsch. Integral equations of the first kind, inverse problems and regularization: a crash course. In Journal of Physics: Conference Series, volume 73, page 012001. IOP Publishing, 2007.
[GTM21] Pablo Gómez, Håvard Hem Toftevaag, and Gabriele Meoni. torchquad: Numerical integration in arbitrary dimensions with pytorch. Journal of Open Source Software, 6(64):3439, 2021.
[GZ22] Nicholas Geneva and Nicholas Zabaras. Transformers for modeling physical systems. Neural Networks, 146:272–289, 2022.
[HSF⁺20] Ali S Hamodi, Aude Martinez Sabino, N Dalton Fitzgerald, Dionysia Moschou, and Michael C Crair. Transverse sinus injections drive robust whole-brain expression of transgenes. Elife, 9:e53639, 2020.
[HSW89] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
[HWS⁺23] Zhongkai Hao, Zhengyi Wang, Hang Su, Chengyang Ying, Yinpeng Dong, Songming Liu, Ze Cheng, Jian Song, and Jun Zhu. Gnot: A general neural operator transformer for operator learning. In International Conference on Machine Learning, pages 12556–12569. PMLR, 2023.
[KBJD20] Jacob Kelly, Jesse Bettencourt, Matthew J Johnson, and David K Duvenaud. Learning differential equations that are easy to solve. Advances in Neural Information Processing Systems, 33:4370–4380, 2020.
[KCL21] Patrick Kidger, Ricky TQ Chen, and Terry Lyons. “hey, that’s not an ode”: faster ode adjoints with 12 lines of code. Journal of Machine Learning Research, 2021.
[KD19] Alexander Keller and Ken Dahm. Integral equations and machine learning. Mathematics and Computers in Simulation, 161:2–12, 2019.
[KGES19] Manochehr Kazemi, Hamid Mottaghi Golshan, Reza Ezzati, and Mohsen Sadatrasoul. New approach to solve two-dimensional fredholm integral equations. Journal of Computational and Applied Mathematics, 354:66–79, 2019.
[KLL⁺21] Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Learning maps between function spaces. arXiv preprint arXiv:2108.08481, 2021.
[KLS24] Nikola B Kovachki, Samuel Lanthaler, and Andrew M Stuart. Operator learning: Algorithms and analysis. arXiv preprint arXiv:2402.15715, 2024.
[KPM21] Hyunjik Kim, George Papamakarios, and Andriy Mnih. The lipschitz constant of self-attention. In International Conference on Machine Learning, pages 5562–5571. PMLR, 2021.
[KR12] Dan Kushnir and Vladimir Rokhlin. A highly accurate solver for stiff ordinary differential equations. SIAM Journal on Scientific Computing, 34(3):A1296–A1315, 2012.
[Kra] MA Krasnosel’skii. Geometrical methods of nonlinear analysis, springer-verlag, berlin, 1984. MR 85b, 47057.
[Kra64] Yu P Krasnosel’skii. Topological methods in the theory of nonlinear integral equations. Pergamon Press, 1964.
[Lak95] Vangipuram Lakshmikantham. Theory of integro-differential equations, volume 1. CRC press, 1995.
[LJP⁺21] Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218–229, 2021.
[LKA⁺20] Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differential equations. In International Conference on Learning Representations, Workshop on Integration of Deep Neural Models and Differential Equations, 2020.
[LKA⁺21] Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations, 2021.
[LLS21] Seung Hoon Lee, Seunghyun Lee, and Byung Cheol Song. Vision transformer for small-size datasets. arXiv preprint arXiv:2112.13492, 2021.
[LPW⁺17] Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width. Advances in neural information processing systems, 30, 2017.
[LR02] Xian-Fang Li and Er-Qian Rong. Solution of a class of two-dimensional integral equations. Journal of computational and applied mathematics, 145(2):335–343, 2002.
[MKH⁺22] Andreas Maier, Harald Köstler, Marco Heisig, Patrick Krauss, and Seung Hee Yang. Known operator learning and hybrid machine learning in medical imaging—a review of the past, the present, and the future. Progress in Biomedical Engineering, 4(2):022002, 2022.
[Mor13] Valter Moretti. Spectral theory and quantum mechanics: with an introduction to the algebraic formulation. Springer Science & Business Media, 2013.
[OOK⁺23] Oded Ovadia, Vivek Oommen, Adar Kahana, Ahmad Peyvan, Eli Turkel, and George Em Karniadakis. Real-time inference and extrapolation via a diffusion-inspired temporal transformer operator (ditto). arXiv preprint arXiv:2307.09072, 2023.
[OSG⁺22] Vivek Oommen, Khemraj Shukla, Somdatta Goswami, Rémi Dingreville, and George Em Karniadakis. Learning two-phase microstructure evolution using neural operators and autoencoder architectures. npj Computational Materials, 8(1):190, 2022.
[PMB⁺22] Michael Poli, Stefano Massaroli, Federico Berto, Jinkyoo Park, Tri Dao, Christopher Ré, and Stefano Ermon. Transform once: Efficient operator learning in frequency domain. Advances in Neural Information Processing Systems, 35:7947–7959, 2022.
[PMSR21] Avik Pal, Yingbo Ma, Viral Shah, and Christopher V Rackauckas. Opening the blackbox: Accelerating neural differential equations by regularizing internal solver heuristics. In International Conference on Machine Learning, pages 8325–8335. PMLR, 2021.
[PMY⁺20] Michael Poli, Stefano Massaroli, Atsushi Yamashita, Hajime Asama, and Jinkyoo Park. Hypersolvers: Toward fast continuous-depth models. Advances in Neural Information Processing Systems, 33:21105–21117, 2020.
[PYD20] Kourosh Parand, Hafez Yari, and Mehdi Delkhosh. Solving two-dimensional integral equations of the second kind on non-rectangular domains with error estimate. Engineering with Computers, 36(2):725–739, 2020.
[Que16] Qichao Que. Integral Equations For Machine Learning Problems. PhD thesis, The Ohio State University, 2016.
[QWC⁺23] Xianbiao Qi, Jianan Wang, Yihao Chen, Yukai Shi, and Lei Zhang. Lipsformer: Introducing lipschitz continuity to vision transformers. arXiv preprint arXiv:2304.09856, 2023.
[RCD19] Yulia Rubanova, Ricky TQ Chen, and David K Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. Advances in neural information processing systems, 32, 2019.
[Rok85] Vladimir Rokhlin. Rapid solution of integral equations of classical potential theory. Journal of computational physics, 60(2):187–207, 1985.
[Rok90] Vladimir Rokhlin. Rapid solution of integral equations of scattering theory in two dimensions. Journal of Computational Physics, 86(2):414–439, 1990.
[SB51] Edwin E Salpeter and Hans Albrecht Bethe. A relativistic equation for bound-state problems. Physical Review, 84(6):1232, 1951.
[SRH⁺81] Harlan W Stech, Samuel M Rankin, Terry L Herdman, et al. Integral and functional differential equations, volume 67. CRC Press, 1981.
[TBY⁺19] Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. Transformer dissection: A unified understanding of transformer’s attention via the lens of kernel. In Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4344––4353, 2019.
[TCEN⁺22] Hugo Touvron, Matthieu Cord, Alaaeldin El-Nouby, Jakob Verbeek, and Hervé Jégou. Three things everyone should know about vision transformers. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 497–515. Springer, 2022.
[TF59] W Tobocman and LL Foldy. Integral equations for the schrödinger wave function. American Journal of Physics, 27(7):483–490, 1959.
[VGSS20] Saurabh Vyas, Matthew D Golub, David Sussillo, and Krishna V Shenoy. Computation through neural population dynamics. Annual Review of Neuroscience, 43:249, 2020.
[VSP⁺17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[Waz11] Abdul-Majid Wazwaz. Linear and nonlinear integral equations, volume 639. Springer, 2011.
[XZC⁺21] Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nyströmformer: A nystöm-based algorithm for approximating self-attention. In Proceedings of the… AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, volume 35, page 14138. NIH Public Access, 2021.
[YBR⁺19] Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077, 2019.
[YZQC21] Xu Yang, Hanwang Zhang, Guojun Qi, and Jianfei Cai. Causal attention for vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9847–9857, 2021.
[Zem12] Stephen M Zemyan. The classical theory of integral equations: a concise treatment. Springer Science & Business Media, 2012.
[ZFM⁺23] Emanuele Zappala, Antonio H de O Fonseca, Andrew H Moberly, Michael J Higley, Chadi Abdallah, Jessica A Cardin, and David van Dijk. Neural integro-differential equations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11104–11112, 2023.

Appendix A Additional Experiments

A.1 Additional figures and tables

We hereby report additional figures and tables that complement the results given in Section 3.

The model DeepOnet + UNET in Table 2 is implemented similarly to [DAK23].

Table 2: Benchmark on predicting fMRI brain dynamics. We report the mean squared errors per extrapolated dynamics of different lengths (

t=5,10,20

) on new initial conditions. All models use a single data point as initial condition, while the LSTM model uses 2 time points. We see that as the dynamics gets more non-local (i.e. longer time intervals) only ANIE can correctly predict it, as shown by lower mean squared error.

	t = 5	t = 10	t = 20
NODE	$0.98\pm 0.07831$	$1.759\pm 0.1407$	$2.361\pm 0.2227$
LSTM	$2.004\pm 0.1856$	$2.182\pm 0.195$	$2.47\pm 0.1993$
Residual Network	$2.396\pm 0.1705$	$2.535\pm 0.1706$	$2.742\pm 0.2$
FNO1D	0.4735 $\pm$ 0.04857	$1.5110\pm 0.13570$	$2.7320\pm 0.31410$
ViT	$1.543\pm 0.1235$	$1.6650\pm 0.1091$	$2.0350\pm 0.1497$
DeepOnet+AE	$2.436\pm 0.5546$	$2.774\pm 1.018$	$6.405\pm 2.062$
DeepOnet+UNET	$2.692\pm 0.5066$	$3.065\pm 0.962$	$3.024\pm 0.665$
ANIE (ours)	$0.7974\pm 0.08118$	$\mathbf{1.086\pm 0.112}$	$\mathbf{1.242\pm 0.1256}$

Table 3: Benchmark on embedding experiment. We perform KNN regression with

k=5

on embeddings of Navier-Stokes dynamics correlating the velocity of the dynamics and the embedding. All values are mean squared errors and are multiplied by a factor of

10^{-4}

PCA	UMAP	LSTM	ViT	ANIE (ours)
$9.91\pm 2.30$	$7.27\pm 1.75$	$8.83\pm 1.55$	$11.04\pm 2.35$	$\mathbf{6.52\pm 1.31}$

Table 4: Performance in

R^{2}

of a KNN Regressor in regressing the contrast of visual stimuli from the learned latent representation. Results presented as (mean

\pm

std, N=1600 frames, cross-validation=10).

PCA	ViT	ConvLSTM	ANIE (ours)
$0.6763\pm 0.02100$	$0.6231\pm 0.07581$	$0.6263\pm 0.06008$	$\mathbf{0.6944\pm 0.02982}$

A.2 Benchmark of (A)NIE training speed

Neural ODEs (NODEs) can be slow and have poor scalability [KBJD20]. As such, several methods have been introduced to improve their performance [KBJD20, KCL21, DKM⁺20, PMY⁺20, PMSR21]. Despite these improvements, NODE is still significantly slower than discrete methods such as LSTMs. We hypothesize that (A)NIE has significantly better scalability than NODE, comparable to fast but discrete LSTMs, despite being a continuous model. To test this, we compare NIE and ANIE to the latest optimized version of (latent) NODE [RCD19] and to LSTM on three different dynamical systems: Lotka-Volterra equations, Lorenz system, and IE-generated 2D spirals (see Appendix F for data generation details). During training, models were initialized with the first half of the data and were tasked to predict the second half. Training speed are reported in Table 5. While all models achieve comparable (good) fits to the data, we find that ANIE outperforms all models in 2 out of the 3 datasets in terms of speed. Furthermore, ANIE has better MSE compared to all other models.

Table 5: Comparison of training speed between NIE, ANIE, NODE, and LSTM on three different dynamical systems. Wall Time is provided in seconds per training iteration. Mean Squared Error shows accuracy of model fits to the data. NIE, ANIE, and LSTM are significantly faster than NeuralODE for all systems. However, while LSTM is fast, it is not a continuous time model like NIE, ANIE, and NeuralODE. Thus, NIE and ANIE have the advantages of being true continuous models with the speed of an LSTM.

	Wall Time (sec per iteration)			Mean Squared Error
Models	Lotka-Volterra	Lorenz	2D-Spirals	Lotka-Volterra	Lorenz	2D-Spirals
LSTM	$0.0116$	$\mathbf{0.012}$	$0.02$	$0.4492$	$0.9926$	$0.1313$
Latent NeuralODE	$2.88$	$0.845$	$0.494$	$0.4502$	$1.1001$	$0.1385$
NIE (ours)	$0.0312$	$0.38$	$0.154$	$0.4596$	$1.0229$	$0.1449$
ANIE (ours)	$\mathbf{0.0081}$	$0.1764$	$\mathbf{0.0082}$	$\mathbf{0.4411}$	$\mathbf{0.9421}$	$\mathbf{0.1257}$

A.3 Hyperparameter sensitivity benchmark

For most deep learning models, including NODE, finding numerically stable solutions usually requires an extensive hyperparameter search. Since IE solvers are known to be more stable than ODE solvers, we hypothesize that (A)NIE is less sensitive to hyperparameter changes. To test this, we quantify model fit, for the Lotka-Volterra dynamical system, as a function of two different hyperparameters: learning rate and L2 norm weight regularization. We perform this experiment for three different models: LSTM, Latent NODE, and ANIE. As shown in Figure 10, we find that ANIE generally has lower validation error as well as more consistent errors across hyperparameter values, compared to LSTM and NODE, therefore validating our hypothesis.

A.4 Modeling 2D IE spirals

To further test the ability of ANIE in modeling non-local systems, we benchmark ANIE, NODE, and LSTM on a dataset of 2D spirals generated by integral equations. This data consists of 500 2D curves of 100 time points each. The data was split in half for training and testing. During training, the first 20 points were given as initial condition and the models were tasked to predict the full 100 point dynamics. Details on the data generation are described in Appendix F. For ANIE, the initialization is given via the free function $f$ , which assumes the values of the first 20 points and sets the remaining 80 points to be equal to the value of the 20th point. For NODE, the initialization is given as the reverse RNN on the first 20 points, which outputs a distribution corresponding to the first time point (see [CRBD18] for more details on their Latent ODE experiments). For LSTM, we input the data in segments of 20 points to predict the consecutive point of the sequence. The process is repeated with the output of the previous step until all points of the curve are predicted. During inference, we test the models’ performance on never before seen initial conditions. Table 6 shows the correlation between the ground truth curve and the model predictions. Figure 9 shows the correlation coefficients for the 500 curves. In summary, ANIE significantly outperforms the other tested methods in predicting IE generated non-local dynamics.

Table 6: Benchmark on

2D

IE spirals. R2 values of model fits are provided for ANIE, NODE, and LSTM. ANIE has the best performance.

NODE	LSTM	ViT	FNO	ANIE (ours)
$0.1778\pm 0.06932$	$0.3410\pm 0.1132$	$0.1668\pm 0.1624$	$0.2882\pm 0.08609$	$\mathbf{0.7366\pm 0.1440}$

A.5 Solver convergence

We now consider the convergence of the solver to a solution of an IE for a trained model. Our experiment here considers a model that has been trained with a number of iteration, and we explore whether the solver iterations converge to a solution at the end of the training. These results show that the model learns to converge to a solution of Equation 5 within the iterations that are fixed during training. They show that a fixed point for IE is obtained when outputting a prediction.

Figure 12 and Figure 3 show the convergence error (i.e. the value $||y_{n+1}-y_{n}||$ ), and the guesses produced by the solver during inference (i.e. $y_{n}$ for $n$ corresponding to the iteration index), respectively.

Appendix B Integral Equations

Integral Equations (IEs) are equations where the unknown function appears under the sign of integral. These equations can be given in general as

\lambda\mathbf{y}=f+T(\mathbf{y}),

(5)

where $T$ is an integral operator, e.g. as in Equation 1 and Equation 3, and $f$ is a known term of the equation. In fact, this functional equations have been studied for classes of compact operators $T$ that are not necessarily in the form of integral operators – see [Bra69] and references therein. We can distinguish two fundamental kinds of equations from the form given in Equation 5 that have been studied extensively throughout the years. When $\lambda=0$ , we say that the corresponding IE is of the first kind, while when $\lambda\neq 0$ we say that it is of the second kind.

In this article we formulate our methods based on equations of the second kind for the following important theoretical considerations which apply to the case where $T$ is bounded over an infinite space (such as the space of functions as we consider in the article). Firstly, an equation of the first kind can easily have no solution, as the range of a bounded operator $T$ on an infinite space is not the whole space (see Proposition 4.41 in [Mor13]). Therefore, for choices of $f$ , there is no $\mathbf{y}$ such that $T(\mathbf{y})=-f$ and therefore Equation 5 has no solutions. The other issue is intrinsic to the nature of equation of the first kind, and does not relate to the existence of solutions. In fact, any compact injective operator $T$ (on an infnite space) does not admit a bounded left inverse (see Proposition 4.42 in [Mor13]). In practice, this means that if Equation 5 has a unique solution for $f$ , then varying $f$ by a a small amount can result in very significant variations in the corresponding solution $\mathbf{y}$ . This is clearly a potential issue when dealing with a deep learning model that aims at learning the operator $T$ from data. In fact, observations from which $T$ is learned might be noisy, which might result in very considerable perturbations of the solution $\mathbf{y}$ and, consequently, considerable perturbations on the operator $T$ that the model converges to. Since equations of the second kind are much more stable, we have formulated all the theory in this setting, and implemented our solver for such equations. The issues relating to existence and uniqueness of the solution for these equations are discussed in more detail in Section B.4 below.

The theories of IEs and Integro-Differential Equations (IDEs) are tightly related, and it is often the case to reduce problems in IEs to problems in IDEs and vice versa, both in practical and theoretical situations. IEs are also related to differential equations, and it is possible to reformulate problems in ODEs in the language of IEs or IDEs. In certain cases, IEs can also be converted to differential equations problems, even though this is not always possible [Zem12, GIKM10]. In fact, the theory of IEs is not equivalent to that of differential equations. The most intuitive way of understanding this is by considering the local nature of differential euqations, as opposed to the non-local origin of IEs. By non-locality of IEs it is meant that each spatiotemporal point in an IE depends on an integration over the full domain of the solution function $\mathbb{y}$ . In the case of differential equations, each local point depends only on the contiguous points through the local definition of the differential operators.

B.1 Integral Equations (1D)

We first discuss IEs where the integral operator only involves a temporal integration (i.e. 1D), as discussed in Section 2. In analogy with the case of differential equations, this case can be considered as the one corresponding to ODEs.

These IEs are given by an equation of type

\mathbb{y}(t)=f(t)+\int_{\alpha(t)}^{\beta(t)}G(\mathbb{y},t,s)ds,

(6)

where $f$ is the free term, which does not depend on $\mathbb{y}$ , while the unknown function $\mathbb{y}$ appears both in the LHS, and in the RHS under the sign of integral. The term $\int_{\alpha(t)}^{\beta(t)}G(\mathbb{y},t,s)ds$ is an integral operator $\mathcal{C}(D)\longrightarrow\mathcal{C}(D)$ from the space of integrable functions $\mathcal{C}(D)$ over some domain of $\mathbb{R}$ , into itself. We observe that the variables $t$ and $s$ appearing in $G$ are both in $D$ , and they are interpreted as time variables. We refer to them as global and local times, respectively, following the same convention of [ZFM⁺23]. The functions $\alpha$ and $\beta$ determine the extremes of integration for each (global) time $t$ . Common choices for $\alpha$ and $\beta$ include $\alpha(t)=0$ and $\beta(t)=t$ (Volterra equations), or $\alpha(t)=a$ and $\beta(t)=b$ (Fredholm equations).

The fundamental question in the theory of IEs is whether solutions exists and are unique. It turns out that under relatively mild assumptions on the regularity of $G$ IEs admit unique solutions [Zem12]. Furthermore, the proofs in the first chapter of [Lak95] show the close relation between IEs and IDEs, as existence an uniqueness problems for IDEs are shown to be equivalent to analogous problems for IEs. Then the fixed point theorems of Schauder and Tychonoff are used to prove the results.

B.2 Integral Equations (n+1D)

We now discuss the case of IEs where the integral operator involves integration over a multi-dimensional domain of $\mathbb{R}^{n}$ . This is the IE version of PDEs, and they are commonly referred to as Partial Integral Equations (PIEs) when integration occurs on different components separately. We will consider equations where the multi-dimensional integral is obtained through multiple integrations. An equation of this type takes the form

\mathbb{y}(\mathbb{x},t)=f(\mathbb{x},t)+\int_{\Omega}\int_{\alpha(t)}^{\beta(t)}G(\mathbb{y},\mathbb{x},\mathbb{x}^{\prime},t,s)d\mathbb{x}^{\prime}ds,

(7)

where $\Omega\subset\mathbb{R}^{n}$ is a domain in $\mathbb{R}^{n}$ , and $\mathbb{y}:\Omega\times\mathbb{R}\longrightarrow\mathbb{R}^{m}$ . Here $m$ does not necessarily coincide with $n$ .

PIEs and higher dimensional IEs have been studied in some restricted form since the 1800’s, as they have been employed to formulate the laws of electromagnetism before the unified version of Maxwell’s equations was published. In addition, early work on the Dirichelet’s problem found the IE approach proficuous, and it is well known that several problems in scattering theory (molecular, atomic and nuclear) are formulated in terms of (P)IEs. In fact, the Schrödinger equation can be recast as an IE [TF59]. See also [SB51], where bound-state problems are treated with the IE formalism.

B.3 Generalities on solving Integral Equations

The most striking difference between the procedure of solving an IE and an ODE, is that for an IE in order to evaluate at a single time point, one needs to know the solution for all time points. This is clearly an issue, since solving for one point requires that we already know a solution for all points. In order to better elucidate this point, we consider a simple comparison between the solution procedure of an ODE equation of type $\dot{\mathbf{y}}=f(\mathbf{y},t)$ , and an IE of type $\mathbf{y}=f(t)+\int_{0}^{1}G(\mathbf{y},t,s)ds$ .

Let us assume we are solving an ODE of type $\dot{\mathbf{y}}(t)=f(\mathbf{y},t)$ and that $\mathbf{y}$ is known at time points $t_{0},t_{1},\ldots,t_{k-1}$ . Then one can obtain $\mathbf{y}$ at $t_{k}$ by means of the Euler method by using the known value at $t_{k-1}$ by taking small enough steps $\Delta t$ forward in time. In general, therefore, one starts by the initial condition $\mathbf{y}_{0}$ and determines the solution $\mathbf{y}$ at the points $t_{0},\ldots,t_{n}$ by taking small steps and representing the derivative as $\Delta f/\Delta t$ for small intervals $\Delta t$ . Of course, more sophisticated methods are possible for the numerical solution of the ODE, but they essentially produce the next time point from the previous one in a sequential way. Let us now consider an analogous Fredholm IE to the ODE given above. This is a simple equation of type $\mathbf{y}=f(t)+\int_{0}^{1}G(\mathbf{y},t,s)ds$ . Suppose we know $\mathbf{y}$ at time points $t_{0},\ldots,t_{k-1}$ . In order to determine $\mathbf{y}(t_{k})$ , we need to compute $f(t_{k})+\int_{0}^{1}G(\mathbf{y},t_{k},s)ds$ , which requires us to know $\mathbf{y}$ over the full interval $[0,1]$ , as $G$ is integrated over $[0,1]$ . It is obvious that knowing a single time point for $\mathbf{y}$ (or a sequence of values) does not suffice anymore. In a Volterra type of equation the integral would be between $[0,t_{k}]$ (where the unknown value $t_{k}$ is included!), which does not really change the essence of the issue.

Although several methods can be employed to solve IEs, most (if not all) of them are based on the concept of iteration over some initial guess for the solution of the IE. Iterating on the initial guess produces a sequence of functions that then converges to a solution of the integral equation. More specifically, one can consider the von Neumann series of the integral operator as we now discuss. In fact, let us consider Equation 5, which can be rewritten as

(\mathbb{1}-T)(\mathbf{y})=f,

where we assume for a moment that $T$ is a linear operator. Observe that if we can find the inverse of $\mathbb{1}-T$ , then we obtain $\mathbf{y}$ as $(\mathbb{1}-T)^{-1}(f)$ . This can be done by writing the von Neumann series for $(\mathbb{1}-T)^{-1}=\sum_{k=0}^{\infty}T^{k}$ . This expression makes sense whenever the series $\sum_{k=0}^{\infty}T^{k}$ converges in operator norm, which is guaranteed in important cases such as when $\sum_{k=0}^{\infty}||T||^{k}$ converges (e.g. when $||T||<1$ ), while milder conditions on the convergence of the series exist as well. In such a situation when the von Neumann series is meaningful, we can then obtain $\mathbf{y}$ by iteratively applying $T^{k}$ to $f$ . The nonlinear case is handled in a similar iterative procedure, which is called Method of Successive Approximations, or also Picard’s Iterations [Dav60]. It is in fact straightforward to convince oneself that under mild conditions, the method will output a solution of the integral equation. Conditions under which such succession is guaranteed to converge can be found in [Dav60]. A particularly well known case is when the integrand of the integral operator is contractive (i.e. Lipschitz with constant between $0$ and $1$ ) with respect to the variable $\mathbf{y}$ . We give a proof of such approach for our setting, c.f. similar results found in [Dav60].

Theorem B.1.

Let $\epsilon>0$ be fixed, and suppose that $T$ is Lipschitz with constant $k<1$ . Then, we can find $y\in X$ such that $||T(y)+f-y||<\epsilon$ , independently of the choice of $f$ .

Proof.

Let us set $y_{0}:=f$ and $y_{n+1}=f+T(y_{n})$ and consider the term $||y_{1}-y_{0}||$ . We have

||y_{1}-y_{0}||=||T(y_{0})||=||T(y_{0})||.

For an arbitrary $n>1$ we have

||y_{n+1}-y_{n}||=||T(y_{n})-T(y_{n-1})||\leq k||y_{n}-y_{n-1}||.

Therefore, applying the same procedure to $y_{n}-y_{n-1}=T(y_{n-1})-T(y_{n-2})$ until we reach $y_{1}-y_{0}$ , we obtain the inequality

||y_{n+1}-y_{n}||\leq k^{n}||T(y_{0})||.

Since $k<1$ , the term $k^{n}||T(y_{0})||$ is eventually smaller than $\epsilon$ , for all $n\geq\nu$ for some choice of $\nu$ . Defining $y:=y_{\nu}$ for such $\nu$ gives the following

||T(y_{\nu})+f-y_{\nu}||=||y_{\nu+1}-y_{\nu}||<\epsilon.

∎

The following now follows easily.

Corollary B.2.

Consider the same hypotheses as above. Then Equation 5 admits a solution. In particular, if the integrand $G$ in Equation 6 is contractive with respect to $\mathbf{y}$ with constant $k$ such that $k\cdot(b-a)<1$ (where $[a,b]$ is the codomain of $\alpha,\beta$ ), the iterative method in Algorithm 1 converges to a solution of the equation.

Proof.

From the proof of Theorem B.1 it follows that the sequence $y_{n}$ is a Cauchy sequence. Since $X$ is Banach, then $y_{n}$ converges to $y\in X$ . By continuity of $T$ , $y$ is a solution to Equation 5. For the second part of the statement, observe that when $G$ is contractive with respect to $\mathbf{y}$ , then we can apply Theorem B.1 to show that the sequence generated following Algorithm 1 is Cauchy, and we can proceed as in the first part of the proof. ∎

Remark B.3.

Observe that the result in Corollary B.2 applies to Algorithm 2 as well, under the assumptions that the transformer architecture is contractive with respect to the input sequence $\mathbf{y}$ . Also, a statement that refers to higher dimensional IEs can be obtained (and proved) similarly to the second part of the statement of Corollary B.2, using the measure of $\Omega\times[a,b]$ instead of the value $(b-a)$ .

In practice, the method of successive approximations is implemented as follows. The initial guess for the integral equation is simply given by the free function $f$ (i.e. $T^{0}(f)$ ), which is used to initialize the iterative procedure. Then, we apply $T$ to $\mathbf{y}^{0}:=T^{0}(f)$ to obtain a new solution $\mathbf{z}^{1}:=f(t)+T^{1}(\mathbf{y}^{0})$ . We set $y^{1}:=r\mathbf{y}^{0}+(1-r)\mathbf{z}^{1}$ and apply $T^{2}$ to the solution $y^{1}$ and repeat. Here $r$ is a smoothing factor that determines the amount of contribution from the new approximation to consider at each step. As the iterations grow, the fractions of the contributions due to the smoothing factor $r$ tend to $1$ . Observe that when we sum $r\mathbf{y}^{i}+(1-r)\mathbf{y}^{i+1}$ with $r=0$ , we obtain the terms of the von Neumann series up to degree $i$ applied to $f$ : $\sum_{0}^{i}T^{k}(f)$ . The smoothing factor generally shows good empirical regularization for IE solvers, and we have set $r=1/2$ throughout our experiments, even though we have not seen any concrete difference between different values of $r$ . This procedure is shown in Fugure 2.

In [DM88], Chapter 4, computations on the error bounds for the iterative procedure described above when the integrand function $G$ splits into the product of a kernel (see above) and a linear function $F$ are given. In the same chapter a detailed description of the Nyström approximation for the computation of the error is given as well. We describe a concrete realization of the iterative procedure discussed above in Section D, along with the learning steps for the training of our model. Moreover, we additionally observe that the procedure described above does not depend on $T$ being an integral operator or a general operator and, therefore, applying this methodology to the case where we have a transformer instead of $T$ is still meaningful, in the assumption that $T$ is such that the iterated series of approximations is convergent.

Depending on the specific IE that one is solving (e.g. Fredholm or Volterra, $1D$ or $(n+1)D$ etc.) the actual numerical procedure for finding a numerical solution can vary. For instance, a list of articles that showcase such a wide variety of specific methods for the solution of certain types of equations is [KR12, BRSS17, LR02, KGES19, PYD20]. Such variations upon the same theme of iterative procedure depend on finding the most efficient way of converging to a solution, finding the best error bounds, improving stability of the solver and substantially depend on the form of the integral operator. As our method is applied without the actual knowledge of the shape of the integral operator, but it actually aims at inferring (i.e. learning) the integral operator from data, we implement an iterative procedure which is fixed and depends only on a hyperparameter smoothing factor. This is described in detail in the next section. However, we point out that since the integrand, and therefore the integral operator itself, is learned during the training, one can assume that the model will optimize with respect to the procedure in a way that our iterations are in a sense “optimal” with respect to the target.

Thus far, our considerations on the implementation of IE solvers seem to point to a fundamental computational issue, since they entail a more sophisticated solving procedure than that of ODEs or PDEs. However, in various situations, even solving ODEs and PDEs through IE solvers presents significant advantages that are not necessarily obvious from the above discussions. The first advantage is that IE solvers are significantly more stable than ODE and PDE solvers, as shown for instance in the articles [KR12, Rok85, Rok90]. This in particular provides a new solution to the issue of underflowing during training of NODEs that does not consist of a regularization, but of a complete change of perspective. In addition, even though to solve an IE one needs to iterate, in general the number of iterations is not particularly high. In fact, in our experiments the total number of iterations turned out to be sufficient to be fixed between $4$ and $6$ . However, when solving for instance an ODE, one needs to go sequentially through each time step. These can be in the order of the $100$ (as in some of our experiments). On the contrary, our IE solver processes the full time interval in parallel for each iteration. This results in a much faster algorithm compared to differential solvers, as shown in our experiments.

B.4 Existence and uniqueness of solutions

The solver procedure described in the previous subsection of course assumes that there exists a solution to start with. As mentioned at the beginning of the section, we treat equations of the second kind in this article also because the existence conditions are better behaved than for the equations of the first kind. We give now some theoretical considerations in this regard. We will also discuss when these solutions are uniquely determined. Existence and uniqueness are two fundamental parts of the well-posedness of IEs, the other being the continuity of solutions with respect to initial data.

A concise and relatively self-contained reference for the existence and uniqueness of solutions for (linear) Fredholm IEs is Section 4.5 in [Mor13]. In fact, it is shown in Theorem 4.43 that if a Fredholm equation has Hermitian kernel, then the IE has a unique solution whenever $\lambda$ is not an eigenvalue of the integral operator. For real coefficients, which is the case we are interested into, one can simply reduce the case to symmetric kernels, which are kernels for which $K(t,s)=K(s,t)$ for all $t,s$ . Since in this article we have assumed $\lambda=1$ , the condition becomes equivalent to saying that there is no function $\mathbf{z}$ such that $\int_{0}^{1}K(t,s)z(s)ds=z(t)$ for all $t$ .

For more general (linear) integral operators (bounded on a Hilbert space), a similar result holds. In fact, from Theorem 4.44 in [Mor13] we have that a generalized Fredholm integral equation admits solutions if and only if the free function is orthogonal to each solution of the associated homogeneous adjoint equation. The latter admits the zero function as a solution (so the solution set is not empty), and is obtained from Equation 5 by deleting $f$ , and by taking the adjoint of $T$ and the complex conjugate of $\mathbf{y}$ . In the real case, the conjugate of $\mathbf{y}$ is $\mathbf{y}$ itself. Moreover, uniqueness is guaranteed if the associated homogeneous equation has only trivial solutions. In the case of nonlinear integral operators several existence and uniqueness conditions can be found in the literature. The reader is referred to [Dav60, Zem12, Lak95] for specific formulations. Generally speaking, such conditions are assumed on the integrand functions that determine the integral operator, in such a way that contractive theorems (such as Schauder’s and Tikhonoff’s) can be applied.

Observe that such formulations of the existence and uniqueness based on the contractive properties of the operator $T$ are particularly interesting in the case where the integral operator is replaced by a general neural network (between function spaces) which is not necessarily obtained through integration. In practice, when $T$ is a general neural network that is possibly nonlinear on all the entries, except with respect to the function $\mathbf{y}$ , $T$ can be approximated by an integral equation using the following reasoning. It is known that Hilber-Schmidt operators on the Hilbert space of square integrable functions are approximated by integral operators – see e.g. [Mor13] Theorem 4.26. It is reasonable to assume that neural network operators are well behaved enough to be considered Hilbert-Schmidt operators. They therefore approximate some integral operator, and the training process therefore learns an integral equation.

More generally, for IEs of Urysohn or Hammerstein type, the existence and uniqueness problems are well known under much milder conditions, namely when the operator is completely continuous [Kra64, Kra]. In this situation, it is sufficient for the operator to have a nonzero topological index to guarantee that the corresponding IE admits a solution, and to study the problem of uniqueness one can determine the value of the topological index in a bounded subset of the Banach space in consideration, since this is directly related to the number of fixed points of the given IE.

The previous discussion, however, does not directly apply to the case when $T$ is a transformer, due to the fact that $T(\mathbf{y})$ is not linear in $\mathbf{y}$ in this case. Such equations can still be considered generalized Fredholm equations, and conditions on nonlinear operators $T$ being approximated by integral operators can be found in the literature, but the extent to which such equations are equivalent to integral equations is a fascinating question which will not be explicitly considered in this article.

As a word of warning, we mention that the general theory ensures the existence and uniqueness of solutions under some (mild) assumptions. Of course, in principle one should impose constraints to ensure that such assumptions are satisfied and that the results would apply. However, in our experiments we have observed good stability and good convergence without imposing any additional constraints. This does not apply in general, but we hypothesize that during optimization the model converges towards operators whose associated IE is well behaved, in order to avoid regimes of poor stability due to lack of solutions or lack of uniqueness of solutions. For different datasets such behavior might not be satisfied, and extra care in this regard might be needed.

B.5 The initial condition for IEs

NIE does not learn a dynamical system via the derivative of a function $\mathbb{y}$ , as is the case for ODEs and IDEs. Therefore, we do not need to specify an initial condition in the solver during training and evaluation. In fact, the initial condition for IEs is encoded in the equation itself. For instance, taking $t=0$ in a Volterra or a Fredholm equation uniquely fixes $\mathbb{y}(\mathbb{x},0)$ for all $x$ .

Therefore, we can specify a condition for IEs by constraining the free function $f(\mathbb{y},t)$ . In what follows, we will make use of this paradigm several times. There are two immediate ways one could impose constraints on the free function. The simplest is to fix a value $\mathbb{y}_{0}$ and let $f(\mathbb{y},t)$ be fixed to be $\mathbb{y}_{0}$ for all $t$ . Alternatively, one could choose an arbitrary function $f$ and keep this function fixed. In practice, the latter is conceptually more meaningful. For instance, in theoretical physics, when transforming the Schrödinger equation into an integral equation, on the RHS one can choose an arbitrary function $\psi(\mathbb{y},t)$ which corresponds to the wave function of free particles, i.e. without potential $V$ . Applications of this procedure are found below in the experiments.

Appendix C Approximation capabilities

In this appendix we consider the capabilities of our models to approximate (nonlinear) integral operators and integral equations.

C.1 NIE

We consider two settings, where the integral operarator is modeled by a single hidden layer feed forward neural network of arbitrary width, or by an arbitrarily deep neural network.

We want to show that when we restrict ourselves to single hidden layer feed forward neural networks of arbitrary depth for our function $G_{\theta}$ in Equation 1, we can approximate a wide class of integral equations over a suitable subset of the space of functions. In the case of deep neural networks, we will argue that the NIE architecture can approximate any “regular enough” integral operator, where regularity will be described below. We restrict our considerations to the case of function spaces where the domain is $\mathbb{R}$ , since the higher dimensional case is easily adapted from this discussion. We will therefore use $y$ instead of $\mathbf{y}$ to indicate the elements of the domain of the integral operators.

Let $T:C([0,1])\longrightarrow C([0,1])$ be an integral operator on the space of continuous functions, defined as $y\mapsto T(y)(t):=\int_{\alpha(t)}^{\beta(t)}G(y(s),t,s)ds$ for continuous functions $\alpha,\beta:[0,1]\longrightarrow[0,1]$ , and a continuous $G:\mathbb{R}\times[0,1]\times[0,1]\longrightarrow\mathbb{R}$ . In fact, for what follows we could consider $G$ as being Borel measurable, instead of imposing the more restrictive condition of being continuous. However, since in applications continuity is generally required, we impose this more restrictive conditions. Moreover, our discussion easily extends to the case when the definition intervals are $[a,b]$ instead of $[0,1]$ with simple modifications, and a similar approach also extends to higher dimensional integrals. We assume that $T$ is such that the corresponding integral equation of the second kind, i.e. Equation 1, admits a solution $y^{*}:[0,1]\longrightarrow\mathbb{R}$ in $C([0,1])$ . Since $y^{*}$ is continuous, there exists a compact $K=[-k,k]$ , for $k>0$ , such that $y^{*}([0,1])\subset K$ . Let us consider now a neighborhood $U_{K}$ of $y^{*}$ in the compact-open topology such that for all $y\in U_{K}$ we have the property $y([0,1])\subset K$ . This could be, for instance, the space of functions $y$ mapping $[0,1]$ into the open $(-k,k)=K^{\circ}$ . We can therefore restrict $G$ to the domain $K\times[0,1]\times[0,1]$ , and we will still indicate this restriction by $G$ and the corresponding integral operator by $T$ (defined over the neighborhood $U_{K}$ ), for notational simplicity.

For an arbirtary chosen $\epsilon>0$ , we want to show that we can approximate $T(y)$ with error at most $\epsilon$ in the metric induced by $C([0,1])$ on $U_{K}$ through an NIE integral operator $T_{\theta}(y)(t):=\int_{\alpha(t)}^{\beta(t)}G_{\theta}(y(s),t,s)ds$ . To this purpose, let us set $Q=\sup_{[0,1]}|\beta(t)-\alpha(t)|$ , and observe that by applying the Universal Approximation Theorem for single hidden layer feed forward neural networks (see [HSW89]) to the function $G:K\times[0,1]\times[0,1]\longrightarrow\mathbb{R}$ , we can find a single hidden layer neural network $G_{\theta}:K\times[0,1]\times[0,1]\longrightarrow\mathbb{R}$ such that for all $t,s\in[0,1]$ we have $|G(y(s),t,s)-G_{\theta}(y(s),t,s)|<\epsilon/Q$ . With such a $G_{\theta}$ , for all functions $y\in U_{K}$ we have for any fixed $t^{*}$ in $[0,1]$

	$\displaystyle\|\|T(y)(t^{})-\int_{\alpha(t^{})}^{\beta(t^{})}G_{\theta}(y(s),t^{},s)ds\|\|$	$\displaystyle\leq$	$\displaystyle\int_{\alpha(t^{})}^{\beta(t^{})}\|\|G(y(s),t^{},s)-G_{\theta}(y(s),t^{},s)\|\|ds$
		$\displaystyle<$	$\displaystyle\|\beta(t^{})-\alpha(t^{})\|\epsilon/Q.$

Therefore, uniformly on $t$ we have that

||T(y)(t)-\int_{\alpha(t)}^{\beta(t)}G_{\theta}(y(s),t,s)ds||<\epsilon.

But this means that $d(T(y),T_{\theta}(y))<\epsilon$ with the metric $d$ on $U_{K}$ induced by that of $C([0,1])$ .

We observe that while this approximation does not hold in complete generality, it is valid for a class of integral operators of importance, since we are usually interested in operators whose corresponding IE is admits continuous solutions, and we are interested in modeling the operator in a neighborhood of a solution. Moreover, under mild assumptions (e.g. discussed in Appendix B.4 the dependence of the solution on the initial data is continuous, and therefore the solutions to the equation for perturbed $f$ lie in a neighborhood of a solution $y^{*}$ obtained for $f$ . So, our results apply in such important cases. Lastly, we point out that throughout the previous reasoning we have implicitly assumed that numerical integration is performed with infinite precision. Of course this is not the case in practice, but since we can reduce the numerical error in the integration procedure arbitrarily by choosing densely enough samples for a choice of the integration scheme, the error due to numerical integration can be rendered small enough so that the previous inequalities hold.

We now consider the case where we allow deep neural networks as in [LPW⁺17], Theorem 1. In this case, we argue that for any IE of the second kind as in Equation 1 where we set $T(y)(t):=\int_{\alpha(t)}^{\beta(t)}G(y(s),t,s)ds$ for a Lebesgue integral function $G$ , we can approximate the integral operator $T$ with arbitrary precision. As a consequence, there is a NIE model which realizes any IE as in Equation 1 with arbitrary accuracy. We can proceed as for the case of single hidden layers neural networks above, with the main difference that applying Theorem 1 from [LPW⁺17] we do not need to restrict ourselves to a neighborhood $U_{K}$ of a solution $y^{*}$ of the integral equation, and the neural integral operator $\int_{\alpha(t)}^{\beta(t)}G_{\theta}(y(s),t,s)ds$ approximates $T$ uniformly with respect to $t$ for any $y\in C([0,1])$ . Observe that in order to use Theorem 1 in [LPW⁺17] we need to precompose $G$ and $G_{\theta}$ by a characteristic function $\chi_{[0,1]}$ , which does not affect the result.

C.2 ANIE

We give some comments on the approximation properties of ANIE with respect to generalized Fredholm equations. For simplicity, we consider the case where the integration is performed only over time, even though the same reasoning can be extended to spatiotemporal domains. Let $T:C([0,1])\longrightarrow C([0,1])$ denote a Fredholm integral operator defined through the assignment $T(y)(t)=\int_{0}^{1}G(\mathbf{y}(t),t,\mathbf{y}(s),s)ds$ . Observe that this integral form is more general than considered in Equation 1, and it follows the interpretation of integration in terms of self-attention (c.f. Appendix D.2, where the integration approximation used in this article is given in more detail).

Let us assume that the integral equation $y=f^{*}+T(y)$ admits a unique continuous solution $y^{*}\in C^{2}([0,1])$ , and that $G$ is regular enough so that the equation admits unique solution in $C([0,1])$ for given functions $f$ in a neighborhood of $f^{*}$ in the compact open topology. Observe, that such well-posedness conditions are usually relatively mild (see for instance Appendix B.4 and references therein), and this is the main situation of interest in applications. Then, there exists a compact $K=[-k,k]$ such that $y^{*}([0,1])\subset K$ and we can choose a neighborhood $U_{K}$ of $y^{*}$ in the compact-open topology of $C([0,1])$ such that $y([0,1])\subset K$ for all $y\in U_{K}$ . In fact, one can simply choose $U_{K}:=\{y\in C([0,1])\ |\ |y([0,1])|<k\}$ . Under such hypothesis, there are numerical integration schemes that can approximate the integral $\int_{0}^{1}G(\mathbf{y}(t),t,\mathbf{y}(s),s)ds$ for any fixed choice of $t$ with arbitrary precision, upon choosing a number of points for evaluation that is sufficiently large. For instance, for any fixed $t$ , the error for trapezoidal rules is bound by a term that goes to zero as $n$ grows, where $n$ is the number of points chosen in $[0,1]$ for approximating the integral, see Equation 2.1.6 in [DR07]. This term is the modulus of continuity

\omega_{t}(1/n):=\max_{|s_{1}-s_{2}|<1/n}|G(\mathbf{y}(t),t,\mathbf{y}(s_{1}),s_{1})-G(\mathbf{y}(t),t,\mathbf{y}(s_{2}),s_{2})|.

For each choice of $n$ , there exists a compact $K_{n}:=[-k_{n},k_{n}]$ such that $y^{*}$ maps into $K_{n}$ , and $G$ restricted to $K_{n}\times[0,1]\times K_{n}\times[0,1]$ has $\omega_{t}(1/n)<1/n$ for all $t\in K_{n}$ . In this situation we can choose a neighborhood of $y^{*}$ , $U_{K_{n}}$ such that $\omega_{t}(1/n)<1/n$ for all $t\in K_{n}$ for each choice of $y\in U_{K_{n}}$ , and this numerical integration approximates the value of $T(y)(t)$ with arbitrarily high accuracy.

We indicate our numerical integration scheme using the formula

T(y)(t)=\int_{0}^{1}G(\mathbf{y}(t),t,\mathbf{y}(s),s)ds\approx\sum_{i=0}^{n}w_{i}(t)G(\mathbf{y}(t),t,\mathbf{y}(s_{i}),s_{i}),

where $s_{i}$ indicates the $i^{\rm th}$ grid point of $\{t_{i}\}\subset[0,1]$ . We can therefore obtain the evaluation of $T(y)$ at the grid points $t_{j}$ as

T(y)(t_{j})=\int_{0}^{1}G(\mathbf{y}(t_{j}),t_{j},\mathbf{y}(s),s)ds\approx\sum_{i=0}^{n}w_{i}(t_{j})G(\mathbf{y}(t_{j}),t_{j},\mathbf{y}(s_{i}),s_{i}),

by choosing $t$ to be one of the grid points.

From our regularity assumptions on the derivatives we can uniformly bound the error on evaluating $T(y)$ at the points $t_{j}$ , so that for sufficiently dense grids the evaluation error is smaller than $\epsilon/2$ for any choice of $\epsilon>0$ , when evaluating on functions $y$ in a neighborhood of $y^{*}$ .

Let us consider now a permutation of the input of $T(y)$ for some $\sigma\in\Sigma_{n}$ . This means that we permute the grid points $\{t_{i}\}$ as $\{t_{\sigma i}\}$ . The approximated integration above gives

T(y)(t_{i})\approx\sum_{i}w_{\sigma i}(t_{\sigma j})G(\mathbf{y}(t_{\sigma j}),t_{\sigma j},\mathbf{y}(s_{\sigma i}),s_{\sigma i})=\sum_{i}w_{i}(t_{\sigma j})G(\mathbf{y}(t_{\sigma j}),t_{\sigma j},\mathbf{y}(s_{i}),s_{i}),

where the second equality follows from the fact that we are summing over all the permuted indices $i$ . This means that our approximation of the integration, evaluated on grid points, is permutation equivariant. Applying Theorem 2 in [YBR⁺19] we are able to find a transformer architecture and a weight configuration, which we denote by $\mathcal{T}$ , such that $\|\sum_{i=0}^{n}w_{i}(t_{j})G(\mathbf{y}(t_{j}),t_{j},\mathbf{y}(s_{i}),s_{i})-\mathcal{T}(y(t_{j}))\|_{p}<\epsilon/2$ , as a function of the $t_{j}$ ’s. As a consequence, we obtain the approximation

$\displaystyle\\|T(y)(t)-\mathcal{T}(y(t))\\|_{p}$	$\displaystyle\leq$	$\displaystyle\\|T(y)(t_{j})-\sum_{i=0}^{n}w_{i}(t_{j})G(\mathbf{y}(t_{j}),t_{j},\mathbf{y}(s_{i}),s_{i})\\|_{p}$
	$\displaystyle+$	$\displaystyle\\|\sum_{i=0}^{n}w_{i}(t_{j})G(\mathbf{y}(t_{j}),t_{j},\mathbf{y}(s_{i}),s_{i})-\mathcal{T}(y(t_{j}))\\|_{p}$
	$\displaystyle<$	$\displaystyle\epsilon/2+\epsilon/2=\epsilon,$

for any choice of $y$ in a neighborhood of $y^{*}$ .

Appendix D Detailed description of the solver and the training procedure

We give here a detailed account of the implementation of the models NIE and ANIE ( $1D$ and $(n+1)D$ IEs, respectively). More specifically, we provide a more thorough description of Algorithm 1 and Algorithm 2 for solving the IEs associated to neural networks $G$ (feed-forward) and $\mathfrak{Att}$ (transformer), and contextualize these algorithms in the optimization procedure that learns the neural networks.

D.1 Implementation of NIE

We only consider the case of Equation 1, as the case where the function $G$ splits in the product of a kernel $K$ and the (possibly) nonlinear function $F$ is substantially identical. We observe that the main components of the training of NIEs are two. An optimization step that targets $G$ , and a solver procedure to obtain a solution associated to the integral equation individuated by $G$ , or more precisely the integral operator that $G$ defines. Therefore, we want to solve Equation 1 for a fixed neural network $G$ , determine how far this solution is from fitting the data, and optimize $G$ is such a way that at the next step we obtain a solution that more accurately fits the data. At the end of the training, we have a neural network $G$ that defines an integral operator, and in turn an integral equation, whose associated solution(s) approximates the given data.

To fix notation, let us call $X$ the dataset for training. This contains several instances of $n$ dimensional curves with respect to time. In other words, we consider $X=\{X_{i}\}_{i\leq N}$ where $N$ is the number of instances and each $X_{i}=\{\mathbf{x}^{i}_{0},\ldots,\mathbf{x}^{i}_{m}\}$ , where each $\mathbf{x}_{i}\in\mathbb{R}^{q}$ is a $q$ -dimensional vector, and the sequence of $\mathbf{x}^{i}_{k}$ refers to a discretization of the time interval where the curves are defined. For simplicity, we assume that time points are take in $[0,1]$ . The neural network $G$ defining the integral operator will be denoted $G_{\theta}$ , in order to explicitly indicate the dependence of $G$ on its parameters. The objective of the training is to optimize $\theta$ in such a way that the corresponding $G_{\theta}$ defines an IE whose solutions $\mathbf{y}_{i}(t)$ corresponding to different initializations pass through the discretized curves $\mathbf{x}^{i}(t)$ .

Let us now consider one training step $n$ , where the neural network $G_{\theta_{n}}$ has weights obtained from the previous training steps (or randomly initialized if this is the first step). We need to solve the IE

\mathbf{y}=f(t)+\int_{\alpha(t)}^{\beta(t)}G_{\theta_{n}}(\mathbf{y},t,s)ds,

(8)

associated to the integral operator $\int_{\alpha(t)}^{\beta(t)}G_{\theta_{n}}(\mathbf{y},t,s)ds$ corresponding to the weights $\theta_{n}$ at training step $n$ .

For simplicity we consider a batch size of $1$ , so that our training curve is given by $\{\mathbf{x}_{0},\ldots,\mathbf{x}_{m}\}$ , where we suppress the superscript $i$ because there is only one curve. Then, we select the first vector $\mathbf{x}_{0}$ , and use this to initialize a full curve with repeated instances of this. In other words, we define $f(t)=x_{0}$ for all times $t$ . We now apply the IE solver procedure, and set the zero order solution to the IE to be $\mathbf{y}^{0}(t)=f(t)=\mathbf{x}_{0}$ . We now apply the integral operator determined by $G_{\theta}$ computing

\mathbf{z}^{0}=f(t)+\int_{\alpha(t)}^{\beta(t)}G_{\theta_{n}}(\mathbf{y}^{0},t,s)ds.

Observe, that at this stage, we can perform the integration over the interval $[\alpha(t),\beta(t)]$ for each time $t$ , since $\mathbf{y}^{0}$ is given for all times $t$ . We then set $\mathbf{y}^{1}=r\mathbf{y}^{0}+(1-r)\mathbf{z}^{1}$ where $r$ is a smoothing factor $0\leq r<1$ which is set beforehand. The function $\mathbf{y}^{1}(t)$ is now the new approximation for the solution of the IE given in Equation 8. We can now compute the global error between $\mathbf{y}^{0}$ and $\mathbf{y}^{1}$ , which we denote by $m(\mathbf{y}^{0},\mathbf{y}^{1})$ . This error is internal to the solver and does not refer to how well the model fits the data. It refers to how far the solver is from converging. We iterate this procedure. Let us assume that this has been done $k$ times. Then, we have a function that approximates a solution of the IE at the $k^{\rm th}$ iteration, denoted by $\mathbf{y}^{k}(t)$ . We compute

\mathbf{z}^{k}=f(t)+\int_{\alpha(t)}^{\beta(t)}G_{\theta_{n}}(\mathbf{y}^{k},t,s)ds,

where, as before, we can evaluate the integral over the intervals $[\alpha(t),\beta(t)]$ because the function $\mathbf{y}^{k}(t)$ is defined over the full time length of the dynamics.

This iterative procedure converges to a solution of the integral equation for the integral operator defined through $G_{\theta_{n}}$ , [DM88]. In order to optimize the parameters $\theta$ of $G$ we require gradients on the input of $G_{\theta_{n}}$ when applying the neural network, we compute the loss between the solution $\mathbf{y}$ obtained through the iterative solution and the data, and we then backpropagate.

D.2 Implementation of ANIE

We now consider ANIE, which is an IE model where the integral is approximated via self-attention. As the iterative solver procedure to obtain a solution of the IE determined by the integral operator is conceptually the same as in the case of NIE given above, we focus mostly on the details relative to the use of self-attention in this setting. Firstly, we consider an IE with space and time, which takes the form of Equation 7. Our dataset now consists of instances of a given dynamics $X=\{X_{i}\}_{i\leq N}$ , where $N$ is the number of instances in the dataset, and each $X_{i}=\{\mathbf{x}^{i}_{s_{1},\ldots,s_{d},j}\}$ is a family of $q$ -dimensional vectors (where $q$ is the dimension of the dynamics), indexed by the spatial and temporal indices $s_{1},\ldots,s_{d}$ and $j$ corresponding to a discretization (e.g. a mesh) of the spatio-temporal domain $\Omega\times[0,T]$ . Observe that the dimension of the spatial domain $\Omega$ here is assumed to be $d$ , therefore implying that each $\mathbf{x}$ depends on $d$ indices. Therefore, one can think of each dynamics instance in the dataset as being a temporal sequence of spatial meshes, e.g. a sequence of images when $d=2$ . We will assume that the number of time points in such a sequence is equal to $m_{T}$ , that the total number of space points is equal to $m_{\Omega}$ , and we set $m=m_{t}m_{\Omega}$ .

For the sake of simplicity we assume that the attention model approximating the integral operator consists of a single self-attention layer. Let $\mathfrak{Att}$ denote a self-attention layer, and assume that $\mathfrak{Att}:\mathbb{R}^{m\times(q+d+1)}\longrightarrow\mathbb{R}^{m\times(q+d+1)}$ . Observe that the attention layer maps sequences of length $m$ of $(q+d+1)$ -dimensional vectors to sequences of the same type. We therefore think of $\mathfrak{Att}:\mathcal{X}\longrightarrow\mathcal{Y}$ as a mapping between two function spaces $\mathcal{X}$ and $\mathcal{Y}$ , whose elements are functions $\mathbf{y}(\mathbf{x},t)$ in a discretized form, with $\mathbf{x}\in\Omega$ and $t\in[a,b]$ . As discussed in [Cao21] (see also [XZC⁺21]), the self-attention mechanism can be thought of as an approximation of an integral operator where given a discretized function $\mathbf{y}(\mathbf{x},t)$ , $\mathfrak{Att}(\mathbf{y}(\mathbf{x},t))$ is another discretized function obtained through an approximation of an integration over the variables $\mathbf{x}$ and $t$ . This theoretical motivation, and the computational complexity of performing Monte Carlo integration in higher dimensions, led us to consider an IE solver where instead of learning a simple neural network $G$ as in the setting of NIE, we learn the integral operator in the form of its attentional approximation $\mathfrak{Att}$ .

As for the detailed description of NIE given above, we assume that the batch size is equal to $1$ , and the dataset is $X=\{X_{i}\}_{i\leq N}$ with $X_{i}=\{\mathbf{x}^{i}_{s_{1},\ldots,s_{d},j}\}$ for a discretization of a spatio-temporal domain $\Omega\times[0,T]$ as described at the beginning of this subsection. Let $\mathfrak{Att}_{\theta}$ denote the transformer with parameters $\theta$ obtained at epoch $n$ of the training session. Here if $n=0$ it simply means that $\mathfrak{Att}_{\theta}$ is randomly initialized. We want to inspect epoch $n+1$ . The IE we solve at each training epoch takes the form

\mathbf{y}=f(\mathbf{x},t)+\mathfrak{Att}_{\theta}(\mathbf{y},\mathbf{x},t),

(9)

where $\mathfrak{Att}_{\theta}(\mathbf{y},\mathbf{x},t)$ is an approximation of an integral operator $\int_{\Omega}\int_{0}^{T}G(\mathbf{y},\mathbf{x},\mathbf{x}^{\prime},t,s)d\mathbf{x}^{\prime}ds$ for some $G$ . The solver is initialized through the free function $f(\mathbf{x},t)$ which plays the role of first “guess” for the solution of the IE. Observe that since $f$ is evaluated on the full discretization of $\Omega\times[0,T]$ , then the length $m$ of the sequence of vectors that approximates $f(\mathbf{x},t)$ equates the product of number of space points $s_{k}$ for each dimensions and the time points $t_{r}$ . The solver therefore creates its first approximation by setting $\mathbf{y}^{0}(\mathbf{x},t)=f(\mathbf{x},t)$ . Then, the for the first iteration of the solver we create the new sequence $\tilde{\mathbf{y}}^{0}$ by concatenating to each $\mathbf{y}$ and the spatiotemporal $m$ coordinates $(\mathbf{x}_{s},t_{r})$ . Now, $\tilde{\mathbf{y}}$ consists of a sequence of $m=m_{T}m_{\Omega}$ vectors (one per spacetime point) which also are possess a spacetime encoding (through concatenation). See Figure 13 for a schematic representation of the integration procedure through a transformer. Then, we set

\tilde{\mathbf{y}}^{1}(\mathbf{x},t)=f(\mathbf{x},t)+\mathfrak{Att}_{\theta}(\tilde{\mathbf{y}}^{0}),

where the dependence of $\tilde{\mathbf{y}}^{1}$ on spacetime coordinates $\mathbf{x}$ and $t$ indicates that we have one vector $\tilde{y}^{1}$ per spacetime coordinate. If $q$ is the dimension of the dynamics (i.e. the number of channels per spacetime point), then the sequence $\tilde{\mathbf{y}}^{1}$ consists of vectors of dimension $q+d+1$ , where $d$ is the number of space dimensions. This happens because $\tilde{\mathbf{y}}^{1}$ is the output of a transformer of a sequence obtained by a sequence of $q$ -dimensional vectors concatenated with a $(d+1)$ -dimensional sequence. The two simplest options are either to discard the last $d+1$ dimensions of the vectors, or add an additional linear layer that projects from $q+d+1$ dimensions to $q$ . As tests have not shown a significant difference between the two approaches, we have adopted the former. So proceeding we obtain the $1$ -dimensional sequence $\mathbf{Z}^{1}(\mathbf{x},t)$ . Lastly, we set $\mathbf{y}^{1}(\mathbf{x},t)=r\mathbf{y}^{0}+(1-r)\mathbf{z}^{1}$ , where $r$ is a smoothing factor that is a hyperparameter of the solver. One therefore computes the error $m(\mathbf{y}^{0},\mathbf{y}^{1})$ between the initial step and the second one to quantify the degree of change of the new approximation, where $m(\bullet,\bullet)$ is a global error metric fixed throughout.

Now we iterate the same procedure and assuming that the approximation $\mathbf{y}^{i}$ to the equation has been obtained, then we concatenate spacetime coordinates to obtain $\tilde{\mathbf{y}}^{i}$ and set

\tilde{\mathbf{y}}^{i+1}(\mathbf{x},t)=f(\mathbf{x},t)+\mathfrak{Att}_{\theta}(\tilde{\mathbf{y}}^{i}),

which we use to obtain $\mathbf{z}^{i+1}$ (by deleting the last $d+1$ dimensions). Then we set $\mathbf{y}^{i+1}=r\mathbf{y}^{i}+(1-r)\mathbf{z}^{i+1}$ and compute the global error $m(\mathbf{y}^{i},\mathbf{y}^{i+1})$ . Figure 14 shows a solver step integration in detail.

In practice, the number of iterations for the solver is a fixed hyperparameter which we have set between $3$ and $5$ in our experiments. This has been suffcient to achieve good results, and to learn a model that is stable under the solving procedure described above. Since the solver is fully implemented in pytorch and the model that approximates the integral operator is a transformer, we can simply backpropagate through the solver at each epoch, after we have solved for $\mathbf{y}$ and compared the solution to the given data $\{X_{i}\}_{i\leq N}$ .

We complete this subsection with a more concrete description of the motivations for approximating integration through the mechanism of self-attention. Very similar perspectives have appeared in other sources (see e.g. [Cao21, XZC⁺21]), but we provide a formulation of such considerations that more easily fit the perspectives of integral operators for integral equations used in this article. This also serves as a more explicit description of the $\mathfrak{Att}$ found in Algorithm 2.

We consider an $n$ -dimensional dynamics $\mathbf{y}(\mathbf{x},t)$ depending on space $\mathbf{x}\in\Omega$ (for some domain $\Omega$ ) and time $t\in[0,1]$ . The queries, keys and values of self-attention can be considered as maps $\psi_{W}:\mathbb{R}^{n+1}\times\Omega\times[0,1]\longrightarrow\mathbb{R}^{1\times d}$ , where $d$ is the latent dimension of the self-attention, and $W=Q,K,V$ for queries, keys and values, respectively. Then, (for $W=Q,K,V$ ) we have

W[\mathbf{y}|\mathbf{x}|t]=\begin{bmatrix}\psi_{W}^{0}[\mathbf{y}|\mathbf{x}|t]\\ \vdots\\ \psi_{W}^{i}[\mathbf{y}|\mathbf{x}|t]\\ \vdots\\ \psi_{W}^{d-1}[\mathbf{y}|\mathbf{x}|t]\end{bmatrix},

where $[\mathbf{y}|\mathbf{x}|t]$ indicates concatenation of the terms in the bracket. Let us consider now the “traditional” quadratic self-attention. Similar considerations also apply for the linear attention used in the experiments, mutatis mutandis. The product between queries and keys gives

[(\cdots\psi_{Q}^{i}[\mathbf{y}|\mathbf{x}|t]\cdots)\cdot(\cdots\psi_{K}^{j}[\mathbf{y}|\mathbf{x}|t]\cdots)^{T}]_{ij}=(\psi_{Q}^{i}\cdot\hat{\psi}_{K}^{j}),

where $T$ indicates transposition, and $\hat{\psi}$ indicates the columns of the transposed matrix. Then, if $z$ is the output of the self-attention layer (observe that this consists of $(\mathbf{z}_{i})_{j}$ where $i$ indicates a spatiotemporal point, and $j$ indicates the $j^{th}$ -dimension of the $n$ -dimensional dynamics. Then, we have

(\mathbf{z}_{i})_{j}=\sum_{j}(\psi_{Q}^{i}\cdot\hat{\psi}_{K}^{j})\psi_{V}^{j}\approx\int_{\Omega\times[0,1]}K(\mathbf{y},\mathbf{x},t,\mathbf{y}^{\prime},\mathbf{x}^{\prime},t^{\prime})F(\mathbf{y}^{\prime},\mathbf{x}^{\prime},t^{\prime})d\mathbf{x}^{\prime}dt^{\prime},

where the prime symbols indicates the variables we are summing upon (this is why the are “being integrated” in the integral), and $\mathbf{y}$ is evaluated at $\mathbf{x},t$ , while $\mathbf{y}^{\prime}$ is evaluated at $\mathbf{x}^{\prime},t^{\prime}$ .

D.3 Dependence of the model on iteration steps

We explore here the dependence of model extrapolation on initial condition for the Navier-Stokes dataset with respect to the number of iterations of the solver. The results are reported in Table 7 and Table 8, where mean squared error and standard deviations are reported. Figure 15 shows the results reported in Table 7. We perform our experiments with two different models, one with a much higher number of parameters than the other. We see that for a smaller model, the impact of the number of solver steps becomes much more pronounced. This indicates that while a very large model is able to compensate the effect of the solver steps and reduce the difference in testing quality, a smaller model can greatly benefit from a higher number of iterations. We notice, in particular, that an ANIE model with a single layer performs as well as an ANIE model with $4$ layers and lower number of iterations. In all cases, higher number of solver steps give better evaluations than single iteration models with statistical significance ( $P<0.0001$ ).

Table 7: Dependence of the model’s fit with respect to change in number of iterations of the solver for a single layer architecture of ANIE. Mean squared error and standard deviations are reported. For higher iterations the error during testing decreases in a statistically significant manner (

P<0.0001

)

iter $=1$	iter $=2$	iter $=4$	iter $=6$	iter $=8$
$0.06532\pm 0.00996$	$0.04918\pm 0.00794$	$0.04591\pm 0.00780$	$0.04563\pm 0.00760$	$0.04371\pm 0.00719$

Table 8: Dependence of the model’s fit with respect to change in number of iterations of the solver for a

4

layers architecture of ANIE. Mean squared error and standard deviations are reported. For higher iterations the error during testing decreases in a statistically significant manner (

P<0.0001

)

iter $=1$	iter $=2$	iter $=3$	iter $=4$	iter $=5$
$0.04485\pm 0.00766$	$0.04417\pm 0.00764$	$0.04359\pm 0.00773$	$0.04229\pm 0.00711$	$0.04196\pm 0.00714$

Appendix E Computational cost

We now give more details regarding the computational cost of our models.

The theoretical order of the computation for NIE per iteration is in the order of $N\times T$ , where $N$ is the number of Monte Carlo sampling points, and $T$ is the number of time points used in the solver. This has to be multiplied by the number of iterations, which for example has been taken to be $3$ in the experiments on training speed.

For ANIE we have performed our experiments using a linear version of the self-attention, which requires a linear computational cost in the number of spacetime points used (this changes depending on the resolution of the dataset). So, for a spacetime grid $\Omega_{n}\subset\Omega$ consiting of $n$ space points, and a grid $T_{m}\subset I$ consisting of $m$ time points, the computational cost is in the order of $n\cdot m$ times the number of solver itrations. The iterations for ANIE varied between $3$ and $7$ throuoghout the experiments. We observe that quadratic attention would result in a computational cost of the order of $(nm)^{2}\cdot r$ , where $r$ is the number of iterations of the solver.

Appendix F Artificial Dataset Generation

F.1 Lotka-Volterra System

The Lotka-Volterra equations are a classic system of nonlinear differential equations that model the interaction between two populations. The equations are given by:

\centering\frac{dx}{dt}=\alpha xy-\beta y\@add@centering

\centering\frac{dy}{dt}=\delta xy-\gamma y\@add@centering

where $\alpha$ and $\delta$ define the population interaction terms, and $\beta$ and $\gamma$ are the intrinsic population growth for population $x$ and $y$ . To generate our dataset, 100 values of $\alpha,\beta,\delta$ and $\gamma$ have been randomly generated and the corresponding system has been solved with a fixed initial condition. Our code was adapted from https://scipy-cookbook.readthedocs.io/items/LoktaVolterraTutorial.html. An example visualization of a solution is given in Figure 16.

F.2 Lorenz System

The Lorenz system is a 3-dimensional system of ordinary differential equations, for modelling atmosferic convection. Furthermore, this system is known to be chaotic, which means that small variations of initial conditions can significantly affect the final trajectory. The system is given by:

\centering\frac{dx}{dt}=\sigma(y-x)\@add@centering

\centering\frac{dy}{dt}=x(\rho-z)-y\@add@centering

\centering\frac{dz}{dt}=xy-\beta z\@add@centering

We have sampled 100 random initial conditions, and have solved the system with the same fixed parameters. Our code was adapted from https://github.com/gboeing/lorenz-system. An example visualization with default hyperparams is given in Figure 17.

F.3 Integral Equation Spirals

The $2D$ integral equation spirals have been obtained by solving an IE with the following form:

\displaystyle\mathbf{y}(t)=\int_{0}^{t}\begin{bmatrix}\cos 2\pi(t-s)&-\sin 2\pi(t-s)\\ -\sin 2\pi(t-s)&-\cos 2\pi(t-s)\end{bmatrix}\tanh(2\pi\mathbf{y}(s))dx+\mathbf{z}_{0}+\begin{bmatrix}\cos(t)\\ \cos(t+\pi)\end{bmatrix},

where $z_{0}$ was sampled from a uniform distribution.

The equation has been solved numerically through our solver (with analytical functions instead of neural networks) for different known functions $f$ corresponding to different choices of $\mathbf{z}_{0}$ . An example visualization of a solution is given in Figure 18.

F.4 FMRI data generation

The simulated fMRI data was generated using neurolib [CJO21]. The authors of this tool provide code to generate fMRI data for Resting-state with a given structural connectivity matrix and a delay matrix. The code can be found in their GitHub page ²²2https://github.com/neurolib-dev/neurolib/blob/master/examples/example-0-aln-minimal.ipynb. We used this code to generate 100000 time points of data for 80 voxels corresponding to regions of the cortex.

The generated data is normalized via computing the z-score of the logarithm of the whole data. This data is then downsampled in time by a factor of 10, thus resulting in 10k time points. In our tests, we use the first 5k points, where the first 2.5k points are used for training and the remaining points are reserved for testing. During batching, each point is taken as the initial condition of a curve of length 20 points.

Appendix G Calcium imaging dataset

C57BL/6J mice were kept on a 12h light/dark cycle, provided with food and water ad libitum, and housed individually following headpost implants. Imaging experiments were performed during the light phase of the cycle. For mesoscopic imaging, brain-wide expression of jRCaMP1b was achieved via postnatal sinus injection as described in [BHS⁺20, HSF⁺20].

Briefly, P0-P1 litters were removed from their home cage and placed on a heating pad. Pups were kept on ice for 5 min to induce anesthesia via hypothermia and then maintained on a metal plate surrounded by ice for the duration of the injection. Pups were injected bilaterally with 4 ul of AAV9-hsyn-NES-jRCaMP1b ( $2.5\times 10^{13}$ gc/ml, Addgene). Mice also received an injection of AAV9-hsyn-ACh3.0 to express the genetically encoded cholinergic sensor ACh3.0, (Jing et al., 2020, although these data were not used in the present study. Once the entire litter was injected, pups were returned to their home cage.

Surgical procedures were performed on sinus-injected animals once they reached adulthood ( $>$ P50). Mice were anesthetized using 1-2% isoflurane and maintained at 37ºC for the duration of the surgery. For mesoscopic imaging, the skin and fascia above the skull were removed from the nasal bone to the posterior of the intraparietal bone and laterally between the temporal muscles. The surface of the skull was thoroughly cleaned with saline and the edges of the incision secured to the skull with Vetbond. A custom titanium headpost for head fixation was secured to the posterior of the nasal bone with transparent dental cement (Metabond, Parkell), and a thin layer of dental cement was applied to the entire dorsal surface of the skull. Next, a layer of cyanoacrylate (Maxi-Cure, Bob Smith Industries) was used to cover the skull and left to cure 30 min at room temperature to provide a smooth surface for trans-cranial imaging.

Mesoscopic calcium imaging was performed using a Zeiss Axiozoom with a 1x, 0.25 NA objective with a 56 mm working distance (Zeiss). Epifluorescent excitation was provided by an LED bank (Spectra X Light Engine, Lumencor) using two output wavelengths: 395/25 (isosbestic for ACh3.0, Lohani et al., 2020) and 575/25nm (jRCaMP1b). Emitted light passed through a dual camera image splitter (TwinCam, Cairn Research) then through either a 525/50 (ACh3.0) or 630/75 (jRCaMP1b) emission filter (Chroma) before it reached two sCMOS cameras (Orca-Flash V3, Hamamatsu). Images were acquired at 512x512 resolution after 4x pixel binning. Each channel was acquired at 10 Hz with 20 ms exposure using HCImage software (Hamamatsu).

For visual stimulation, sinusoidal drifting gratings (2 Hz, 0.04 cycles/degree were generated using custom-written functions based on Psychtoolbox in Matlab and presented on an LCD monitor at a distance of 20 cm from the right eye. Stimuli were presented for 2 seconds with a 5 second inter-stimulus interval

Imaging frames were grouped by excitation wavelength (395nm, 470nm, and 575nm) and downsampled from 512 $\times$ 512 to 256 $\times$ 256 pixels. Detrending was applied using a low pass filter (N=100, $f_{cutoff}=$ 0.001Hz). Time traces were obtained using $(\Delta F/F)_{i}=(F_{i}-F_{(i,o)})/F_{(i,o)}$ where $F_{i}$ is the fluorescence of pixel $i$ and $F_{(i,o)}$ is the corresponding low-pass filtered signal.

Hemodynamic artifacts were removed using a linear regression accounting for spatiotemporal dependencies between neighboring pixels. We used the isosbestic excitation of ACh3.0 (395 nm) co-expressed in these mice as a means of measuring activity-independent fluctuations in fluorescence associated with hemodynamic signals. Briefly, given two $p\times 1$ random signals $y_{1}$ and $y_{2}$ corresponding to $\Delta F/F$ of $p$ pixels for two excitation wavelengths “green” and ”UV”, we consider the following linear model:

\displaystyle y_{1}=x+z+\eta,

(10)

\displaystyle y_{2}=Az+\xi,

(11)

where x and z are mutually uncorrelated $p\times 1$ random signals corresponding to $p$ pixels of the neuronal and hemodynamic signals, respectively. $\eta$ and $\xi$ are white Gaussian $p\times 1$ noise signals and A is an unknown $p\times p$ real invertible matrix. We estimate the neuronal signal as the optimal linear estimator for $x$ (in the sense of Minimum Mean Squared Error):

	$\displaystyle\hat{x}$	$\displaystyle=$	$\displaystyle H\left(\begin{array}[]{c}y_{1}\\ y_{2}\end{array}\right),$		(14)
	$\displaystyle H$	$\displaystyle=$	$\displaystyle\sum_{xy}{\sum_{y}}^{-1}$		(15)

where $y=\begin{pmatrix}y_{1}\\ y_{2}\end{pmatrix}$ is given by stacking $y_{1}$ on top of $y_{2}$ , $\sum_{y}=E[yy^{T}]$ is the autocorrelation matrix of $y$ and $\sum_{xy}=E[xy^{T}]$ is the cross-correlation matrix between $x$ and $y$ . The matrix $\sum_{y}$ is estimated directly from the observations, and the matrix $\sum_{xy}$ is estimated by:

\displaystyle\sum_{xy}=\Biggl{(}\sum_{y_{1}}-\sigma_{\eta}^{2}I-\biggl{(}\sum_{y_{1}y_{2}}{\Bigl{(}\sum_{y_{2}}-\sigma_{\xi}^{2}I\Bigl{)}}^{-1}{\sum_{y_{2}}}^{-1}{\sum_{y_{1}y_{2}}}^{T}\biggl{)}^{T}

\displaystyle 0\Bigg{)}

(16)

where $\sigma_{\eta}^{2}$ and $\sigma_{\xi}^{2}$ are the noise variances of $\eta$ and $\xi$ , respectively, and $I$ is the $p\times p$ identity matrix. The noise variances $\sigma_{\eta}^{2}$ and $\sigma_{\xi}^{2}$ are evaluated according to the median of the singular values of the corresponding correlation matrices $\sum_{y_{1}}$ and $\sum_{y_{2}}$ . This analysis is usually performed in patches where the size of the patch, $p$ , is determined by the amount of time samples available and estimated parameters. In the present study, we used a patch size of $p=9$ . The final activity traces were obtained by z-scoring the corrected $\Delta F/F$ signals per pixel. The dimensionality of the resulting video is then reduced via PCA to 10 components, which represents $\approx 80\%$ of data variance.

Appendix H Burgers’ equations

The Burgers’ equation is a quasilinear parabolic partial differential equation that takes the form

\frac{\partial u}{\partial t}+u\frac{\partial u}{\partial x}=\nu\frac{\partial^{2}u}{\partial x^{2}},

(17)

where $x$ is a spatial dimension, while $t$ indicates time, and $\nu$ is a diffusion coefficient called viscosity see [BP72]. A very interesting behavior of the solutions of the Burgers’ equation regards the presence of shock waves.

Our dataset is generated using the Matlab code used in [LKA⁺21], which can be found in their GitHub page ³³3https://github.com/zongyi-li/fourier_neural_operator/tree/master/data_generation/burgers. The solution is given on a spatial mesh of $1024$ and $400$ time points are generated from a random initial condition. We use $1000$ curves for training and test on $200$ unseen curves, where the interval spans $1/4$ of the original time used for testing.

Appendix I Navier-Stokes equations

The Navier-Stokes equations are partial differential equations that arise in fluid mechanics, where they are used to describe the motion of viscous fluids. They are derived from the conservation laws (for momentum and mass) for Newtonian fluids subject to an external force with the addition of pressure and friction forces, where the unknown function indicates the velocity vector of the fluid [Cho68, Fef00]. Their expression is given by the system

\frac{\partial}{\partial t}u_{i}+\sum_{j}u_{j}\frac{\partial u_{i}}{\partial x_{j}}=\nu\Delta u_{i}-\frac{\partial p}{\partial x_{i}}+f_{i}(\mathbb{x},t)

(18)

{\rm div}u=\sum_{i}\frac{\partial u_{i}}{\partial x_{i}}

(19)

where $\Delta$ is the Laplacian operator, $f$ is the external force, and $\mathbb{u}$ is the unknown velocity function. We experiment on the same data set for $\nu=1e-3$ of [LKA⁺21], which can be found in their GitHub page ⁴⁴4https://github.com/zongyi-li/fourier_neural_operator/tree/master/data_generation/navier_stokes. They solved the viscous, incompressible $2D$ Navier-Stokes equation for vorticity on the unit torus, hence with periodic boundary conditions. The initial time point is sampled from a gaussian distribution. The forcing term is a linear combination of sine and cosine functions depending only on space, and independent of time. The numerical method for the solution of the equation is pseudospectral, for the vorticity-streamfunction formulation. The solver scheme follows the steps: (1) solving the Poisson equation, (2) vorticity is differentiated, (3) the non-linear term is added. A Crank-Nicholson update is used to advance time. Details can be found in Appendix A.3.3 of [LKA⁺21].

We use $4000$ instances for training and $1000$ for testing. In our tasks, we utilize a single time point to inizialize our model (ANIE) and obtain the full dynamics from a single frame. For comparison, we use the minimal number of time points allowed for the other models for comparison. This is not always possible, for instance, FNO3D cannot be applied on a single time point, or few time points, as the time convolution needs several time points to produce significant results. Despite this significant advantage given to FNO3D, ANIE (ours) still better performs on the prediction of $10$ and $20$ time points.

Appendix J Additional details on experiments and computational resources

The number of parameters for the models used in the experiments and are given in Tables 9 and 10. In all cases, the optimizer “Adam” has been employed. Experiments have been run on a 16GB NVIDIA A100 GPU.

Table 9: Number of parameters for fMRI and 2D IE curves experiments

	ANIE	NODE	LSTM	Residual Net
Generated fMRI	$321,233$	$319,280$	$328,400$	$319,280$
2D curves	$18,819$	$1,654$	$20,862$	-

Table 10: Number of parameters for PDE experiments. The ViT models and ResNet have variable number of parameters for the Burgers’ equation, depending on the space and time resolutions. We report minimum and maximum number of parameters across the models.

	Burgers	Navier-Stokes
LSTM	$-$	$302,059,520$
FNO3D	$-$	$6,558,537$
Galerkin	$511,329$	$-$
Conv1DLSTM	$149,520$	$-$
ResNet	$425,056-578,912$	$-$
Conv2DLSTM	$-$	$447,528$
ViT	$3,823,904-6,467,872$	$105,258,016$
ViTsmall	$4,497,828-6,483,876$	$105,321,636$
ViTparallel	$6,979,328-9,623,296$	$126,260,224$
ViT3D	$-$	$105,340,096$
ANIE	$161,154$	$1,278,627$

	$\displaystyle\|\|T(y)(t^{})-\int_{\alpha(t^{})}^{\beta(t^{})}G_{\theta}(y(s),t^{},s)ds\|\|$	$\displaystyle\leq$	$\displaystyle\int_{\alpha(t^{})}^{\beta(t^{})}\|\|G(y(s),t^{},s)-G_{\theta}(y(s),t^{},s)\|\|ds$
		$\displaystyle<$	$\displaystyle\|\beta(t^{})-\alpha(t^{})\|\epsilon/Q.$