Scalable marginalization of correlated latent variables with applications to learning particle interaction kernels

Mengyang Gu^†¹¹1Correspondence should be addressed to Mengyang Gu (mengyang@pstat.ucsb.edu ) , Xubo Liu^†, Xinyi Fang^†, Sui Tang^‡

^† Department of Statistics and Applied Probability, University of California, Santa Barbara, CA
^‡ Department of Mathematics, University of California, Santa Barbara, CA

Abstract

Marginalization of latent variables or nuisance parameters is a fundamental aspect of Bayesian inference and uncertainty quantification. In this work, we focus on scalable marginalization of latent variables in modeling correlated data, such as spatio-temporal or functional observations. We first introduce Gaussian processes (GPs) for modeling correlated data and highlight the computational challenge, where the computational complexity increases cubically fast along with the number of observations. We then review the connection between the state space model and GPs with Matérn covariance for temporal inputs. The Kalman filter and Rauch-Tung-Striebel smoother were introduced as a scalable marginalization technique for computing the likelihood and making predictions of GPs without approximation. We summarize recent efforts on extending the scalable marginalization idea to the linear model of coregionalization for multivariate correlated output and spatio-temporal observations. In the final part of this work, we introduce a novel marginalization technique to estimate interaction kernels and forecast particle trajectories. The computational progress lies in the sparse representation of the inverse covariance matrix of the latent variables, then applying conjugate gradient for improving predictive accuracy with large data sets. The computational advances achieved in this work outline a wide range of applications in molecular dynamic simulation, cellular migration, and agent-based models.

KEYWORDS: Marginalization, Bayesian inference, Scalable computation, Gaussian process, Kalman filter, Particle interaction

1 Introduction

Given a set of latent variables in a model, do we fit a model with a particular set of latent variables, or do we integrate out the latent variables when making predictions? Marginalization of latent variables is an iconic feature of the Bayesian analysis. The art of marginalization in statistics can at least be traced back to the De Finetti’s theorem [12], which states that an infinite sequence $\{X_{i}\}^{\infty}_{i=1}$ is exchangeable, if and if only if there exists a random variable $\theta\in\Theta$ with probability distribution $\pi(\cdot)$ , and a conditional distribution $p(\cdot\mid\theta)$ , such that

p(x_{1},...,x_{N})=\int\left\{\prod^{N}_{i=1}p(x_{i}\mid\theta)\right\}\pi(\theta)d\theta.

(1)

Marginalization of nuisance parameters for models with independent observations has been comprehensively reviewed in [7]. Bayesian model selection [8, 4] and Bayesian model averaging [42], as two other examples, both rely on the marginalization of parameters in each model.

For spatially correlated data, the Jefferys prior of the covariance parameters in a Gaussian process (GP), which is proportional to the squared root of the Fisher information matrix of the likelihood, often leads to improper posteriors [6]. The posterior of the covariance parameter becomes proper if the prior is derived based on the Fisher information matrix of the marginal likelihood, after marginalizing out the mean and variance parameters. The resulting prior, after marginalization, is a reference prior, which has been studied for modeling spatially correlated data and computer model emulation [39, 46, 28, 20, 37].

Marginalization of latent variables has lately been aware by the machine learning community as well, for purposes of uncertainty quantification and propagation. In [29], for instance, the deep ensembles of models with a scoring function were proposed to assess the uncertainty in deep neural networks, and it is closely related to Bayesian model averaging with a uniform prior on parameters. This approach was further studied in [64], where the importance of marginalization is highlighted. Neural networks with infinite depth were shown to be equivalent to a GP with a particular kernel function in [38], and it was lately shown in [32] that the results of deep neural networks can be reproduced by GPs, where the latent nodes are marginalized out.

In this work, we study the marginalization of latent variables for correlated data, particularly focusing on scalable computation. Gaussian processes have been ubiquitously used for modeling spatially correlated data [3] and emulating computer experiments [50]. Computing the likelihood in GPs and making predictions, however, cost $\mathcal{O}(N^{3})$ operations, where $N$ is the number of observations, due to finding the inverse and determinant of the covariance matrix. To overcome the computational bottleneck, various approximation approaches, such as inducing point approaches [54], fixed rank approximation [10], integrated nested Laplace approximation [48], stochastic partial differential equation representation [33], local Gaussian process approximation [15], and hierarchical nearest-neighbor Gaussian process models [11], circulant embedding [55], many of which can be summarized into the framework of Vecchia approximation [59, 27]. Scalable computation of a GP model with a multi-dimensional input space and a smooth covariance function is of great interest in recent years.

The exact computation of GP models with smaller computational complexity was less studied in past. To fill this knowledge gap, we will first review the stochastic differential equation representation of a GP with the Matérn covariance and one-dimensional input variable [63, 22], where the solution can be written as a dynamic linear model [62]. Kalman filter and Rauch–Tung–Striebel smoother [26, 45] can be implemented for computing the likelihood function and predictive distribution exactly, reducing the computational complexity of GP using a Matérn kernel with a half-integer roughness parameter and 1D input from $\mathcal{O}(N^{3})$ to $\mathcal{O}(N)$ operations. Here, interestingly, the latent states of a GP model are marginalized out in Kalman Filter iteratively. Thus the Kalman filter can be considered as an example of marginalization of latent variables, which leads to efficient computation. Note that the Kalman filter is not directly applicable for GP with multivariate inputs, yet GPs with some of the widely used covariance structures, such as the product or separable kernel [5] and linear model of coregionalization [3], can be written as state space models on an augmented lattice [19, 17]. Based on this connection, we introduce a few extensions of scalable marginalization for modeling incomplete matrices of correlated data.

The contributions of this work are twofold. First, the computational scalability and efficiency of marginalizing latent variables for models of correlated data and functional data are less studied. Here we discuss the marginalization of latent states in the Kalman filter in computing the likelihood and making predictions, with only $\mathcal{O}(N)$ computational operations. We discuss recent extensions on structured data with multi-dimensional input. Second, we develop new marginalization techniques to estimate interaction kernels of particles and to forecast trajectories of particles, which have wide applications in agent-based models [9], cellular migration [23], and molecular dynamic simulation [43]. The computational gain comes from the sparse representation of inverse covariance of interaction kernels, and the use of the conjugate gradient algorithm [24] for iterative computation. Specifically, we reduce the computational order from $\mathcal{O}((nMDL)^{3})+\mathcal{O}(n^{4}L^{2}M^{2}D)$ operations in recent studies [34, 13] to $\mathcal{O}(Tn^{2}MDL)+\mathcal{O}(n^{2}MDL\log(nMDL))$ operations based on training data of $M$ simulation runs, each containing $n$ particles in a $D$ dimensional space at $L$ time points, with $T$ being the number of iterations in the sparse conjugate gradient algorithm. This allows us to estimate interaction kernels of dynamic systems with many more observations. Here the sparsity comes from the use of the Matérn kernel, which is distinct from any of the approximation methods based on sparse covariance structures.

The rest of the paper is organized below. We first introduce the GP as a surrogate model for approximating computationally expensive simulations in Section 2. The state space model representation of a GP with Matérn covariance and temporal input is introduced in Section 3.1. We then review the Kalman filter as a computationally scalable technique to marginalize out latent states for computing the likelihood of a GP model and making predictions in Section 3.2. In Section 3.3, we discuss the extension of latent state marginalization in linear models of coregionaliztion for multivariate functional data, spatial and spatio-temporal data on the incomplete lattice. The new computationally scalable algorithm for estimating interaction kernel and forecasting particle trajectories is introduced in Section 4. We conclude this study and discuss a few potential research directions in Section 5. The code and data used in this paper are publicly available: https://github.com/UncertaintyQuantification/scalable_marginalization.

2 Background: Gaussian process

We briefly introduce the GP model in this section. We focus on computer model emulation, where the GP emulator is often used as a surrogate model to approximate computer experiments [52]. Consider a real-valued unknown function $z(\cdot)$ , modeled by a Gaussian stochastic process (GaSP) or Gaussian process (GP), $z(\cdot)\sim\mathcal{GP}(\mu(\cdot),\sigma^{2}K(\cdot,\cdot))$ , meaning that, for any inputs $\{\mathbf{x}_{1},\ldots,\mathbf{x}_{N}\}$ (with $\mathbf{x}_{i}$ being a $p\times 1$ vector), the marginal distribution of $\mathbf{z}=(z(\mathbf{x}_{1}),...,z(\mathbf{x}_{N}))^{T}$ follows a multivariate normal distribution,

\mathbf{z}\mid\bm{\beta},\,\sigma^{2},\,{\bm{\gamma}}\sim\mathcal{MN}(\bm{\mu},\sigma^{2}{\mathbf{R}})\,,

(2)

where $\bm{\mu}=(\mu(\mathbf{x}_{1}),...,\mu(\mathbf{x}_{N}))^{T}$ is a vector of mean or trend parameters, $\sigma^{2}$ is the unknown variance and ${\mathbf{R}}$ is the correlation matrix with the $(i,j)$ element modeled by a kernel $K\,(\mathbf{x}_{i},\mathbf{x}_{j})$ with parameters $\bm{\gamma}$ . It is common to model the mean by $\mu(\mathbf{x})=\mathbf{h}(\mathbf{x}){\bm{\beta}}\,$ , where $\mathbf{h}(\mathbf{x})$ is a $1\times q$ row vector of basis function, and $\bm{\beta}$ is a $q\times 1$ vector of mean parameters.

When modeling spatially correlated data, the isotropic kernel is often used, where the input of the kernel only depends on the Euclidean distance $K(\mathbf{x}_{a},\mathbf{x}_{b})=K(||\mathbf{x}_{a}-\mathbf{x}_{b}||)$ . In comparison, each coordinate of the latent function in computer experiments could have different physical meanings and units. Thus a product kernel is often used in constructing a GP emulator, such that correlation lengths can be different at each coordinate. For any $\mathbf{x}_{a}=(x_{a1},\ldots,x_{ap})$ and $\mathbf{x}_{b}=(x_{b1},\ldots,x_{bp})$ , the kernel function can be written as $K(\mathbf{x}_{a},\mathbf{x}_{b})=K_{1}(x_{a1},x_{b1})\times...\times K_{p}(x_{ap},x_{bp})$ , where $K_{l}$ is a kernel for the $l$ th coordinate with a distinct range parameter $\gamma_{l}$ , for $l=1,...,p$ . Some frequently used kernels $K_{l}$ include power exponential and Matérn kernel functions [44]. The Matérn kernel, for instance, follows

K_{l}(d_{l})=\frac{1}{2^{\nu_{l}-1}\Gamma(\nu_{l})}\left(\frac{\sqrt{2\nu_{l}}d_{l}}{\gamma_{l}}\right)^{\nu_{l}}\mathcal{K}_{\nu_{l}}\left(\frac{\sqrt{2\nu_{l}}d_{l}}{\gamma_{l}}\right),

(3)

where $d_{l}=|x_{al}-x_{bl}|$ , $\Gamma(\cdot)$ is the gamma function, $\mathcal{K}_{\nu_{l}}(\cdot/\gamma_{l})$ is the modified Bessel function of the second kind with the range parameter and roughness parameter being $\gamma_{l}$ and $\nu_{l}$ , respectively. The Matérn correlation has a closed-form expression when the roughness parameter is a half-integer, i.e. $\nu_{l}={2k_{l}+1}/{2}$ with $k_{l}\in\mathbb{N}$ . It becomes the exponential correlation and Gaussian correlation, when $k_{l}=0$ and $k_{l}\to\infty$ , respectively. The GP with Matérn kernel is $\lfloor\nu_{l}-1\rfloor$ mean square differentiable at coordinate $l$ . This is a good property, as the differentiability of the process is directly controlled by the roughness parameter.

Denote mean basis of observations $\mathbf{H}=(\mathbf{h}^{T}(\mathbf{x}_{1}),...,\mathbf{h}^{T}(\mathbf{x}_{N}))^{T}$ . The parameters in GP contain mean parameters $\bm{\beta}$ , variance parameter $\sigma^{2}$ , and range parameters $\bm{\gamma}=(\gamma_{1},...,\gamma_{p})$ . Integrating out the mean and variance parameters with respect to reference prior $\pi(\bm{\beta},\sigma)\propto 1/\sigma^{2}$ , the predictive distribution of any input $\mathbf{x}^{*}$ follows a student t distribution [20]:

z({\mathbf{x}^{*}})\mid\mathbf{z},\,\bm{\gamma}\sim\mathcal{T}(\hat{z}({\mathbf{x}}^{*}),\hat{\sigma}^{2}K^{**},N-q)\,,

(4)

with $N-q$ degrees of freedom, where

$\displaystyle\hat{z}({\mathbf{x}}^{*})=$	$\displaystyle{\mathbf{h}({\mathbf{x}}^{})}\hat{\bm{\beta}}+\mathbf{r}^{T}(\mathbf{x}^{}){{\mathbf{R}}}^{-1}\left(\mathbf{z}-\mathbf{H}\hat{\bm{\beta}}\right),$	(5)
$\displaystyle\hat{\sigma}^{2}=$	$\displaystyle(N-q)^{-1}{\left(\mathbf{z}-\mathbf{H}\hat{\bm{\beta}}\right)}^{T}{{\mathbf{R}}}^{-1}\left({\mathbf{z}}-\mathbf{H}\hat{\bm{\beta}}\right),$	(6)
$\displaystyle K^{**}=$	$\displaystyle K({\mathbf{x}^{}},{\mathbf{x}^{}})-{\mathbf{r}^{T}(\mathbf{x}^{}){{\mathbf{R}}}^{-1}\mathbf{r}(\mathbf{x}^{})}+{\mathbf{h}}^{}(\mathbf{x}^{})^{T}$
	$\displaystyle\times\left(\mathbf{H}^{T}{{\mathbf{R}}}^{-1}\mathbf{H}\right)^{-1}{\mathbf{h}}^{}(\mathbf{x}^{}),$	(7)

with ${\mathbf{h}}^{*}(\mathbf{x}^{*})=\left({{\mathbf{h}(\mathbf{x}^{*})}}-\mathbf{H}^{T}{{\mathbf{R}}}^{-1}\mathbf{r}(\mathbf{x}^{*})\right)$ , $\hat{\bm{\beta}}=\left(\mathbf{H}^{T}{{\mathbf{R}}}^{-1}\ \mathbf{H}\right)^{-1}\mathbf{H}^{T}{{\mathbf{R}}}^{-1}\mathbf{z}$ being the generalized least squares estimator of $\bm{\beta}$ , and $\mathbf{r}(\mathbf{x}^{*})=(K(\mathbf{x}^{*},{\mathbf{x}}_{1}),\ldots,K(\mathbf{x}^{*},{\mathbf{x}}_{N}))^{T}$ .

Direct computation of the likelihood requires $\mathcal{O}(N^{3})$ operations due to computing the Cholesky decomposition of the covariane matrix for matrix inversion, and the determinant of the covariance matrix. Thus a posterior sampling algorithm, such as the Markov chain Monte Carlo (MCMC) algorithm can be slow, as it requires a large number of posterior samples. Plug-in estimators, such as the maximum likelihood estimator (MLE) were often used to estimate the range parameters $\bm{\gamma}$ in covariance. In [20], the maximum marginal posterior estimator (MMPE) with robust parameterizations was discussed to overcome the instability of the MLE. The MLE and MMPE of the parameters in a GP emulator with both the product kernel and the isotropic kernel are all implemented in the RobustGaSP package [18].

In some applications, we may not directly observe the latent function but a noisy realization:

y(\mathbf{x})=z(\mathbf{x})+\epsilon(\mathbf{x}),

(8)

where $z(.)$ is modeled as a zero-mean GP with covariance $\sigma^{2}K(.,.)$ , and $\epsilon(\mathbf{x})\sim\mathcal{N}(0,\sigma^{2}_{0})$ follows an independent Gaussian noise. This model is typically referred to as the Gaussian process regression [44], which is suitable for scenarios containing noisy observations, such as experimental or field observations, numerical solutions of differential equations with non-negligible error, and stochastic simulations. Denote the noisy observations $\mathbf{y}=(y(\mathbf{x}_{1}),y(\mathbf{x}_{2}),...,y(\mathbf{x}_{N}))^{T}$ at the design input set $\{\mathbf{x}_{1},\mathbf{x}_{2},...,\mathbf{x}_{N}\}$ and the nugget parameter $\eta=\sigma^{2}_{0}/\sigma^{2}$ . Both range and nugget parameters in GPR can be estimated by the plug-in estimators [18]. The predictive distribution of $f(\mathbf{x}^{*})$ at any input $\mathbf{x}^{*}$ can be obtained by replacing $\mathbf{R}$ with $\mathbf{\tilde{R}}=\mathbf{R}+\eta\mathbf{I}_{n}$ in Equation (4).

Constructing a GP emulator to approximate computer simulation typically starts with a “space-filling” design, such as the Latin hypercube sampling (LHS), to fill the input space. Numerical solutions of computer models were then obtained at these design points, and the set $\{(\mathbf{x}_{i},y_{i})\}^{N}_{i=1}$ is used for training a GP emulator. For any observed input $\mathbf{x}^{*}$ , the predictive mean in (4) is often used for predictions, and the uncertainty of observations can be obtained through the predictive distribution. in Figure 1, we plot the predictive mean of a GP emulator to approximate the Branin function [56] with N training inputs sampled from LHS, using the default setting of the RobustGaSP package [18]. When the number of observations increases from $N=12$ (middle panel) to $N=24$ (right panel), the predictive error becomes smaller.

Refer to caption — Figure 1: Predictions by the GP emulator of a function on 2D inputs with $N=12$ and $N=24$ observations (black circles) are shown in the left and right panels, respectively.

The computational complexity of GP models increases at the order of $\mathcal{O}(N^{3})$ , which prohibits applications on emulating complex computer simulations, when a relatively large number of simulation runs are required. In the next section, we will introduce the state space representation of GP with Matérn covariance and one-dimensional input, where the computational order scales as $\mathcal{O}(N)$ without approximation. This method can be applied to problems with high dimensional input space discussed in Section 4.

3 Marginalization in Kalman filter

3.1 State space representation of GP with the Matérn kernel

Suppose we model the observations by Equation (8) where the latent process $z(.)$ is assumed to follow a GP on 1D input. For simplicity, here we assume a zero mean parameter ( $\mu=0$ ), and a mean function can be easily included in the analysis. It has been realized that a GP defined in 1D input space using a Matérn covariance with a half-integer roughness parameter input can be written as stochastic differential equations (SDEs) [63, 22], which can reduce the operations of computing the likelihood and making predictions from $\mathcal{O}(N^{3})$ to $\mathcal{O}(N)$ operations, with the use of Kalman filter. Here we first review SDE representation and then we discuss marginalization of latent variables in the Kalman filter algorithm for scalable computation.

When the roughness parameter is $\nu=5/2$ , for instance, the Matérn kernel has the expression below

K(d)=\left(1+\frac{\sqrt{5}d}{\gamma}+\frac{5d^{2}}{3\gamma^{2}}\right)\exp\left(-\frac{\sqrt{5}d}{\gamma}\right),\,

(9)

where $d=|x_{a}-x_{b}|$ is the distance between any $x_{a},x_{b}\in\mathbb{R}$ and $\gamma$ is a range parameter typically estimated by data. The output and two derivatives of the GP with the Matérn kernel in (9) can be written as the SDE below,

\displaystyle\frac{d\bm{\theta}(x)}{dx}=\mathbf{J}\bm{\theta}(x)+\mathbf{L}\varepsilon(x),

(10)

or in the matrix form,

\frac{d}{dx}\begin{pmatrix}z(x)\\ z^{(1)}(x)\\ z^{(2)}(x)\end{pmatrix}=\begin{pmatrix}0&1&0\\ 0&0&1\\ -\lambda^{3}&-3\lambda^{2}&-3\lambda\end{pmatrix}\begin{pmatrix}z(x)\\ z^{(1)}(x)\\ z^{(2)}(x)\end{pmatrix}+\begin{pmatrix}0\\ 0\\ 1\end{pmatrix}\varepsilon(x),

where $\varepsilon(x)\sim N(0,\sigma^{2})$ , with $\lambda=\frac{\sqrt{2\nu}}{\gamma}=\frac{\sqrt{5}}{\gamma}$ , and $z^{(l)}(\cdot)$ is the $l$ th derivative of the process $z(\cdot)$ . Denote $c=\frac{16}{3}\sigma^{2}\lambda^{5}$ and $\mathbf{F}=(1,0,0)$ . Assume the 1D input is ordered, i.e. $x_{1}\leq x_{2}\leq...\leq x_{N}$ . The solution of SDE in (10) can be expressed as a continuous-time dynamic linear model [61],

\displaystyle\begin{split}y(x_{i})&=\mathbf{F}\bm{\theta}(x_{i})+\epsilon(x_{i}),\\ \bm{\theta}(x_{i})&=\mathbf{G}(x_{i})\bm{\theta}(x_{i-1})+\mathbf{w}(x_{i}),\,\end{split}

(11)

where $\mathbf{w}(x_{i})\sim\mathcal{MN}(0,\mathbf{W}(x_{i}))$ for $i=2,...,N$ , $\bm{\theta}(x_{1})\sim\mathcal{MN}(\mathbf{0},\mathbf{W}(x_{1}))$ and Gaussian noise follows $\epsilon(x_{i})\sim\mathcal{N}(0,\sigma^{2}_{0})$ . Here $\mathbf{G}(x_{i})=e^{\mathbf{J}(x_{i}-x_{i-1})}$ and $\mathbf{W}(x_{i})=\int^{x_{i}-x_{i-1}}_{0}e^{\mathbf{J}t}\mathbf{L}c\mathbf{L}^{T}e^{\mathbf{J}^{T}t}dt$ from $i=2,...,N$ , and stationary distribution $\bm{\theta}(x_{i})\sim\mathcal{MN}(0,\mathbf{W}(x_{1}))$ , with $\mathbf{W}(x_{1})=\int^{\infty}_{0}e^{\mathbf{J}t}\mathbf{L}c\mathbf{L}^{T}e^{\mathbf{J}^{T}t}dt.$ Both $\mathbf{G}(x_{i})$ and $\mathbf{W}(x_{i})$ have closed-form expressions given in the Appendix 5.1. The joint distribution of the states follows $\left(\bm{\theta}^{T}(x_{1}),...,\bm{\theta}^{T}(x_{N})\right)^{T}\sim\mathcal{MN}(\mathbf{0},\bm{\Lambda}^{-1})$ , where the inverse covariance $\bm{\Lambda}$ is a block tri-diagonal matrix discussed in Appendix 5.1.

3.2 Kalman filter as a scalable marginalization technique

For dynamic linear models in (11), Kalman filter and Rauch–Tung–Striebel (RTS) smoother can be used as an exact and scalable approach to compute the likelihood, and predictive distributions. The Kalman filter and RTS smoother are sometimes called the forward filtering and backward smoothing/sampling algorithm, widely used in dynamic linear models of time series. We refer the readers to [61, 41] for discussion of dynamic linear models.

Write $\mathbf{G}(x_{i})=\mathbf{G}_{i}$ , $\mathbf{W}(x_{i})=\mathbf{W}_{i}$ , $\bm{\theta}(x_{i})=\bm{\theta}_{i}$ and $y(x_{i})=y_{i}$ for $i=1,...,N$ . In Lemma 1, we summarize Kalman filter and RTS smoother for the dynamic linear model in (11). Compared with $\mathcal{O}(N^{3})$ computational operations and $\mathcal{O}(N^{2})$ storage cost from GPs, the outcomes of Kalman filter and RTS smoother can be used for computing the likelihood and predictive distribution with $\mathcal{O}(N)$ operations and $\mathcal{O}(N)$ storage cost, summarized in Lemma 1. All the distributions in Lemma 1 and Lemma 2 are conditional distributions given parameters $(\gamma,\sigma^{2},\sigma^{2}_{0})$ , which are omitted for simplicity.

Lemma 1 (Kalman Filter and RTS Smoother [26, 45]).

1. (Kalman Filter.) Let $\bm{\theta}_{i-1}|\mathbf{y}_{1:i-1}\sim\mathcal{MN}(\mathbf{m}_{i-1},\mathbf{C}_{i-1})$ . For $i=2,...,N$ , iteratively we have,

(i)

the one-step-ahead predictive distribution of $\bm{\theta}_{i}$ given $\mathbf{y}_{1:i-1}$ ,

$\bm{\theta}_{i}|\mathbf{y}_{1:i-1}\sim\mathcal{MN}(\mathbf{b}_{i},\mathbf{B}_{i}),\vspace{-.05in}$ (12)

with $\mathbf{b}_{i}=\mathbf{G}_{i}\mathbf{m}_{i-1}$ and $\mathbf{B}_{i}=\mathbf{G}_{i}\mathbf{C}_{i-1}\mathbf{G}^{T}_{i}+\mathbf{W}_{i}$ ,
(ii)

the one-step-ahead predictive distribution of $Y_{i}$ given $\mathbf{y}_{1:i-1}$ ,

$Y_{i}|\mathbf{y}_{1:i-1}\sim\mathcal{N}(f_{i},Q_{i}),\vspace{-.05in}$ (13)

with $f_{i}=\mathbf{F}\mathbf{b}_{i},$ and $Q_{i}=\mathbf{F}\mathbf{B}_{i}\mathbf{F}^{T}+\sigma^{2}_{0}$ ,
(iii)

the filtering distribution of $\bm{\theta}_{i}$ given $\mathbf{y}_{1:i}$ ,

$\bm{\theta}_{i}|\mathbf{y}_{1:i}\sim\mathcal{MN}(\mathbf{m}_{i},\mathbf{C}_{i}),\vspace{-.05in}$ (14)

with $\mathbf{m}_{i}=\mathbf{b}_{i}+\mathbf{B}_{i}\mathbf{F}^{T}Q^{-1}_{i}(y_{i}-f_{i})$ and $\mathbf{C}_{i}=\mathbf{B}_{i}-\mathbf{B}_{i}\mathbf{F}^{T}Q^{-1}_{i}\mathbf{F}\mathbf{B}_{i}$ .

2. (RTS Smoother.) Denote $\bm{\theta}_{i+1}|\mathbf{y}_{1:n}\sim\mathcal{N}(s_{i+1},S_{i+1})$ , then recursively for $i=N-1,...,1$ ,

\bm{\theta}_{i}|\mathbf{y}_{1:N}\sim\mathcal{MN}(\mathbf{s}_{i},\mathbf{S}_{i}),\vspace{-.05in}

(15)

where $\mathbf{s}_{i}=\mathbf{m}_{i}+\mathbf{C}_{i}\mathbf{G}^{T}_{i+1}\mathbf{B}^{-1}_{i+1}(\mathbf{s}_{i+1}-\mathbf{b}_{i+1})$ and $\mathbf{S}_{i}=\mathbf{C}_{i}-\mathbf{C}_{i}\mathbf{G}^{T}_{i+1}\mathbf{B}^{-1}_{i+1}(\mathbf{B}_{i+1}-\mathbf{S}_{i+1})\mathbf{B}^{-1}_{i+1}\mathbf{G}_{i+1}\mathbf{C}_{i}$ .

Lemma 2 (Likelihood and predictive distribution).

1. (Likelihood.) The likelihood follows

p(\mathbf{y}_{1:N}\mid\sigma^{2},\sigma_{0}^{2},\gamma)=\left\{\prod^{N}_{i=1}(2\pi Q_{i})^{-\frac{1}{2}}\right\}\exp\left\{-\sum^{N}_{i=1}\frac{(y_{i}-f_{i})^{2}}{2Q_{i}}\right\},

where $f_{i}$ and $Q_{i}$ are given in Kalman filter. The likelihood can be used to obtain the MLE of the parameters $(\sigma^{2},\sigma_{0}^{2},\gamma)$ .

2. (Predictive distribution.)

(i)

By the last step of Kalman filter, one has $\bm{\theta}_{N}|\mathbf{y}_{1:N}$ and recursively by the RTS smoother, the predictive distribution of $\bm{\theta}_{i}$ for $i=N-1,...,1$ follows

$\bm{\theta}_{i}|\mathbf{y}_{1:N}\sim\mathcal{MN}(\mathbf{s}_{i},\mathbf{S}_{i}).\vspace{-.05in}$ (16)

(ii)

For any $x^{*}$ (W.l.o.g. let $x_{i}<x^{*}<x_{i+1}$ )

\bm{\theta}(x^{*})\mid\mathbf{y}_{1:N}\sim\mathcal{MN}\left(\hat{\bm{\theta}}(x^{*}),\hat{\bm{\Sigma}}(x^{*})\right)

where

	$\displaystyle\hat{\bm{\theta}}(x^{*})$	$\displaystyle=\mathbf{G}^{}_{i}\mathbf{s}_{i}+\mathbf{W}^{}_{i}(\mathbf{G}^{}_{i+1})^{T}(\tilde{\mathbf{W}}^{}_{i+1})^{-1}(\mathbf{s}_{i+1}-\mathbf{G}^{}_{i+1}\mathbf{G}^{}_{i}\mathbf{s}_{i})$
	$\displaystyle\hat{\bm{\Sigma}}(x^{*})$	$\displaystyle=((\mathbf{W}^{}_{i})^{-1}+(\mathbf{G}^{}_{i+1})^{T}(\mathbf{W}^{}_{i+1})^{-1}\mathbf{G}^{}_{i+1})^{-1}$

with terms denoted with ‘*’ given in the Appendix 5.1.

Although we introduce the Matérn kernel with $\nu=5/2$ as an example, the likelihood and predictive distribution of GPs with the Matérn kernel of a small half-integer roughness parameter can be computed efficiently, for both equally spaced and not equally spaced 1D inputs. For the Matérn kernel with a very large roughness parameter, the dimension of the latent states becomes large, which makes efficient computation prohibitive. In practice, the Matérn kernel with a relatively large roughness parameter (e.g. with $\nu=5/2$ ) is found to be accurate for estimating a smooth latent function in computer experiments [20, 2]. Because of this reason, the Matérn kernel with $\nu=5/2$ is the default choice of the kernel function in some packages of GP emulators [47, 18].

For a model containing latent variables, one may proceed with two usual approaches:

(i)

sampling the latent variables $\bm{\theta}({x_{i}})$ from the posterior distribution by the MCMC algorithm,
(ii)

optimizing the latent variables $\bm{\theta}({x_{i}})$ to minimize a loss function.

For approach (i), the MCMC algorithm is usually much slower than the Kalman filter, as the number of the latent states is high, requiring a large number of posterior samples [17]. On the other hand, the prior correlation between states may not be taken into account directly in approach (ii), making the estimation less efficient than the Kalman filter, if data contain correlation across latent states. In comparison, the latent states in the dynamic linear model in (11) are iteratively marginalized out in Kalman filter, and the closed-form expression is derived in each step, which only takes $\mathcal{O}(N)$ operations and storage cost, with $N$ being the number of observations.

In practice, when a sensible probability model or a prior of latent variables is considered, the principle is to integrate out the latent variables when making predictions. Posterior samples and optimization algorithms, on the other hand, can be very useful for approximating the marginal likelihood when closed-form expressions are not available. As an example, we will introduce applications that integrate the sparse covariance structure along with conjugate gradient optimization into estimating particle interaction kernels, and forecasting particle trajectories in Section 4, which integrates both marginalization and optimization to tackle a computationally challenging scenario.

In Figure 2, we compare the cost for computing the predictive mean for a nonlinear function with 1D inputs [16]. The input is uniformly sampled from $[0.5,0.25]$ , and an independent Gaussian white noise with a standard deviation of $0.1$ is added in simulating the observations. We compare two ways of computing the predictive mean. The first approach implements direct computation of the predictive mean by Equation (5). The second approach is computed by the likelihood function and predictive distribution from Lemma 2 based on the Kalman filter and RTS smoother. The range and nugget parameters are fixed to be $0.5$ and $10^{-4}$ for demonstration purposes, respectively. The computational time of this simulated experiment is shown in the left panel in Figure 2. The approach based on Kalman filter and RTS smoother is much faster, as computing the likelihood and making predictions by Kalman filter and RTS smoother only require $\mathcal{O}(N)$ operations, whereas the direct computation cost $\mathcal{O}(N^{3})$ operations. The right panel gives the predictive mean, latent truth, and observations, when $N=1000$ . The difference between the two approaches is very small, as both methods are exact.

3.3 Marginalization of correlated matrix observations with multi-dimensional inputs

The Kalman filter is widely applied in signal processing, system control, and modeling time series. Here we introduce a few recent studies that apply Kalman filter to GP models with Matérn covariance to model spatial, spatio-temporal, and functional observations.

Let $\mathbf{y}(\mathbf{x})=(y_{1}(\mathbf{x}),...,y_{n_{1}}(\mathbf{x}))^{T}$ be an $n_{1}$ -dimensional real-valued output vector at a $p$ -dimensional input vector $\mathbf{x}$ . For simplicity, assume the mean is zero. Consider the latent factor model:

\mathbf{y}(\mathbf{x})=\mathbf{A}\mathbf{z}(\mathbf{x})+\bm{\epsilon}(\mathbf{x}),

(17)

where $\mathbf{A}=[\mathbf{a}_{1},...,\mathbf{a}_{d}]$ is an $n_{1}\times d$ factor loading matrix and $\mathbf{z}(\mathbf{x})=(z_{1}(\mathbf{x}),...,z_{d}(\mathbf{x}))^{T}$ is a $d$ -dimensional factor processes, with $d\leq n_{1}$ . The noise process follows $\bm{\epsilon}(\mathbf{x})\sim\mathcal{N}(\mathbf{0},\,\sigma^{2}_{0}\mathbf{I}_{n_{1}})$ . Each factor is modeled by a zero-mean Gaussian process (GP), meaning that $\mathbf{Z}_{l}=(z_{l}(\mathbf{x}_{1}),...,z_{l}(\mathbf{x}_{n_{2}}))$ follows a multivariate normal distribution $\mathbf{Z}^{T}_{l}\sim\mathcal{MN}(\mathbf{0},\bm{\Sigma}_{l})$ , where the $(i,\,j)$ entry of $\bm{\Sigma}_{l}$ is parameterized by a covariance function $\sigma^{2}_{l}K_{l}(\mathbf{x}_{i},\mathbf{x}_{j})$ for $l=1,...,d$ . The model (17) is often known as the semiparametric latent factor model in the machine learning community [53], and it belongs to a class of linear models of coregionalization [3]. It has a wide range of applications in modeling multivariate spatially correlated data and functional observations from computer experiments [14, 25, 40].

We have the following two assumptions for model (17).

Assumption 1.

The prior of latent processes $\mathbf{z}_{i}(.)$ and $\mathbf{z}_{j}(.)$ are independent, for any $i\neq j$ .

Assumption 2.

The factor loadings are orthogonal, i.e. $\mathbf{A}^{T}\mathbf{A}=\mathbf{I}_{d}$ .

The first assumption is typically assumed for modeling multivariate spatially correlated data or computer experiments [3, 25]. Secondly, note that the model in (17) is unchanged if we replace $(\mathbf{A},\mathbf{z}(\mathbf{x}))$ by $(\mathbf{A}\mathbf{E},\mathbf{E}^{-1}\mathbf{z}(\mathbf{x}))$ for an invertible matrix $\mathbf{E}$ , meaning that the linear subspace of $\mathbf{A}$ can be identified if no further constraint is imposed. Furthermore, as the variance of each latent process $\sigma^{2}_{i}$ is estimated by the data, imposing the unity constraint on each column of $\mathbf{A}$ can reduce identifiability issues. The second assumption was also assumed in other recent studies [31, 30].

Given Assumption 1 and Assumption 2, we review recent results that alleviates the computational cost. Let us first assume the observations are an $N=n_{1}\times n_{2}$ matrix $\mathbf{Y}=[\mathbf{y}(\mathbf{x}_{1}),...,\mathbf{y}(\mathbf{x}_{n_{2}})]$ .

Lemma 3 (Posterior independence and orthogonal projection [17]).

For model (17) with Assumption 1 and Assumption 2, we have two properties below.

1. (Posterior Independence.) For any $l\neq m$

\mbox{Cov}[\mathbf{Z}_{l}^{T},\mathbf{Z}_{m}^{T}|\mathbf{Y}]=\mathbf{0},

and for each $l=1,...,d$ ,

\mathbf{Z}_{l}^{T}|\mathbf{Y}\sim\mathcal{MN}(\bm{\mu}_{Z_{l}},\bm{\Sigma}_{z_{l}}),

where $\bm{\mu}_{z_{l}}=\bm{\Sigma}_{l}\bm{\tilde{\Sigma}}^{-1}_{l}\mathbf{\tilde{y}}_{l}$ , $\mathbf{\tilde{y}}_{l}=\mathbf{Y}^{T}\mathbf{a}_{l}$ and $\bm{\Sigma}_{Z_{l}}=\bm{\Sigma}_{l}-\bm{\Sigma}_{l}\bm{\tilde{\Sigma}}_{l}^{-1}\bm{\Sigma}_{l}$ with $\bm{\tilde{\Sigma}}_{l}=\bm{\Sigma}_{l}+\sigma_{0}^{2}\mathbf{I}_{n_{2}}$ .

2. (Orthogonal projection.) After integrating $\mathbf{\mathbf{missing}}{z}(\cdot)$ , the marginal likelihood is a product of multivariate normal densities at projected observations:

p(\mathbf{Y})=\prod^{d}_{l=1}\mathcal{PN}(\tilde{\mathbf{y}}_{l};\mathbf{0},\bm{\tilde{\Sigma}}_{l})\prod^{n_{1}}_{l=d+1}\mathcal{PN}(\tilde{\mathbf{y}}_{c,l};\mathbf{0},\sigma^{2}_{0}\mathbf{I}_{n_{2}}),

(18)

where $\tilde{\mathbf{y}}_{c,l}=\mathbf{Y}^{T}\mathbf{a}_{c,l}$ with $\mathbf{a}_{c,l}$ being the $l$ th column of $\mathbf{A}_{c}$ , the orthogonal component of $\mathbf{A}$ , and $\mathcal{PN}$ denotes the density for a multivariate normal distribution.

The properties in Lemma 17 lead to computationally scalable expressions of the maximum marginal likelihood estimator (MMLE) of factor loadings.

Theorem 1 (Generalized probabilistic principal component analysis [19]).

Assume $\mathbf{A}^{T}\mathbf{A}=\mathbf{I}_{d}$ , after marginalizing out $\mathbf{Z}^{T}_{l}\sim\mathcal{MN}(\mathbf{0},\bm{\Sigma}_{l})$ for $l=1,2,...,d$ , we have the results below.

•

If $\bm{\Sigma}_{1}=...=\bm{\Sigma}_{d}=\bm{\Sigma}$ , the marginal likelihood is maximized when

$\hat{\mathbf{A}}=\mathbf{U}\mathbf{S},$ (19)

where $\mathbf{U}$ is an $n_{1}\times d$ matrix of the first $d$ principal eigenvectors of $\mathbf{G}={\mathbf{Y}(\sigma^{2}_{0}\bm{\Sigma}^{-1}+\mathbf{I}_{n_{2}})^{-1}\mathbf{Y}^{T}}$ and $\mathbf{S}$ is a $d\times d$ orthogonal rotation matrix;

•

If the covariances of the factor processes are different, denoting $\mathbf{G}_{l}={\mathbf{Y}(\sigma^{2}_{0}\bm{\Sigma}^{-1}_{l}+\mathbf{I}_{n_{2}})^{-1}\mathbf{Y}^{T}}$ , the MMLE of factor loadings is

\mathbf{\hat{A}}=\operatorname*{argmax}_{\mathbf{A}}\sum^{d}_{l=1}{\mathbf{a}^{T}_{l}\mathbf{G}_{l}\mathbf{a}_{l}},\quad\text{s.t.}\quad\mathbf{A}^{T}\mathbf{A}=\mathbf{I}_{d}.

(20)

The estimator $\mathbf{A}$ in Theorem 1 is called the generalized probabilistic principal component analysis (GPPCA). The optimization algorithm that preserves the orthogonal constraints in (20) is available in [60].

In [58], the latent factor is assumed to follow independent standard normal distributions, and the authors derived the MMLE of the factor loading matrix $\mathbf{A}$ , which was termed the probabilistic principal component analysis (PPCA). The GPPCA extends the PPCA to correlated latent factors modeled by GPs, which incorporates the prior correlation information between outputs as a function of inputs, and the latent factor processes were marginalized out when estimating the factor loading matrix and other parameters. When the input is 1D and the Matérn covariance is used for modeling latent factors, the computational order of GPPCA is $\mathcal{O}(Nd)$ , which is the same as the PCA. For correlated data, such as spatio-temporal observations and multivariate functional data, GPPCA provides a flexible and scalable approach to estimate factor loading by marginalizing out the latent factors [19].

Spatial and spatio-temporal models with a separable covariance can be written as a special case of the model in (17). For instance, suppose $d=n_{1}$ and the $n_{1}\times n_{2}$ latent factor matrix follows $\mathbf{Z}\sim\mathcal{MN}(0,\,\sigma^{2}\mathbf{R}_{1}\otimes\mathbf{R}_{2}),$ where $\mathbf{R}_{1}$ and $\mathbf{R}_{2}$ are $n_{1}\times n_{1}$ and $n_{2}\times n_{2}$ subcovariances, respectively. Denote the eigen decomposition $\mathbf{R}_{1}=\mathbf{V}_{1}\bm{\Lambda}_{1}\mathbf{V}_{1}^{T}$ with $\bm{\Lambda}_{1}$ being a diagonal matrix with the eigenvalues $\lambda_{i}$ , for $i=1,...,n_{1}$ . Then this separable model can be written as the model in (17), with $\mathbf{A}=\mathbf{V}_{1}$ , $\bm{\Sigma}_{l}=\sigma^{2}\lambda_{l}\mathbf{R}_{2}$ . The connection suggests that the latent factor loading matrix can be specified as the eigenvector matrix of a covariance matrix, parameterized by a kernel function. This approach is studied in [17] for modeling incomplete lattice with irregular missing patterns, and the Kalman filter is integrated for accelerating computation on massive spatial and spatio-temporal observations.

4 Scalable marginalization for learning particle interaction kernels from trajectory data

Collective motions with particle interactions are very common in both microscopic and macroscopic systems [35, 36]. Learning interaction kernels between particles is important for two purposes. First, physical laws are less understood for many complex systems, such as cell migration or non-equilibrium thermodynamic processes. Estimating the interaction kernels between particles from fields or experimental data is essential for learning these processes. Second, simulation of particle interactions, such as ab initio molecular dynamics simulation, can be very computationally expensive. Statistical learning approaches can be used for reducing the computational cost of simulations.

For demonstration purposes, we consider a simple first-order system. In [34], for a system with $n$ interacting particles, the velocity of the $i$ th particle at time $t$ , $\mathbf{v}_{i}(t)=d\mathbf{x}_{i}(t)/dt$ , is modeled by positions between all particles,

\mathbf{v}_{i}(t)=\sum^{n}_{j=1}\phi(||\mathbf{x}_{j}(t)-\mathbf{x}_{i}(t)||)\mathbf{u}_{i,j}(t),

(21)

where $\phi(\cdot)$ is a latent interaction kernel function between particle $i$ and all other particles, with $||\cdot||$ being the Euclidean distance, and $\mathbf{u}_{i,j}(t)=\mathbf{x}_{j}(t)-\mathbf{x}_{i}(t)$ is a vector of differences between positions of particles $i$ and $j$ , for $i,j=1,...,n$ . Here $\phi(\cdot)$ is a two-body interaction function. In molecular dynamics simulation, for instance, the Lennard-Jones potential can be related to molecular forces and the acceleration of molecules in a second-order system. The statistical learning approach can be extended to a second-order system that involves acceleration and external force terms. The first-order system as (21) can be considered as an approximation of the second-order system. Furthermore, the interaction between particles is global, as any particle is affected by all other particles. Learning global interactions is more computationally challenging than local interactions [51], and approximating the global interaction by the local interaction is of interest in future studies.

One important goal is to efficiently estimate the unobservable interaction functions from the particle trajectory data, without specifying a parametric form. This goal is key for estimating the behaviors of dynamic systems in experiments and in observational studies, as the physics law in a new system may be unknown. In [13], $\phi(\cdot)$ is modeled by a zero-mean Gaussian process with a Matérn covariance:

\phi(\cdot)\sim\mathcal{GP}(0,\sigma^{2}K(\cdot,\cdot)).

(22)

Computing estimation of interactions of large-scale systems or more simulation runs, however, is prohibitive, as the direct inversion of the covariance matrix of observations of velocities requires $\mathcal{O}((nMDL)^{3})$ operations, where $M$ is the number of simulations or experiments, $n$ is the number of particles, $D$ is the dimension of each particle, $L$ denotes the number of time points for each trial. Furthermore, constructing such covariance contains computing an $n^{2}LM\times n^{2}LM$ matrix of $\phi$ for a $D$ -dimensional input space, which takes $\mathcal{O}(n^{4}L^{2}M^{2}D)$ operations. Thus, directly estimating interaction kernel with a GP model in (22) is only applicable to systems with a small number of observations [34, 13].

This work makes contributions from two different aspects for estimating dynamic systems of interacting particles. We first show the covariance of velocity observations can be obtained by operations on a few sparse matrices, after marginalizing out the latent interaction function. The sparsity of the inverse covariance of the latent interaction kernel allows us to modify the Kalman filter to efficiently compute the matrix product in this problem, and then apply a conjugate gradient (CG) algorithm [24, 21, 49] to solve this system iteratively. The computational complexity of computing the predictive mean and variance of a test point is at the order of $\mathcal{O}(TnN)+\mathcal{O}(nN\log(nN))$ , for a system of $n$ particles, $N=nMDL$ observations, and $T$ is the number of iterations required in the CG algorithm. We found that typically around a few hundred CG iterations can achieve high predictive accuracy for a moderately large number of observations. The algorithm leads substantial reduction of computational cost, compared to direct computation.

Second, we study the effect of experimental designs on estimating the interaction kernel function. In previous studies, it is unclear how initial positions, time steps of trajectory and the number of particles affect the accuracy in estimating interaction kernels. Compared to other conventional problems in computer model emulation, where a “space-filling” design is often used, here we cannot directly observe the realizations of the latent function. Instead, the output velocity is a weighted average of the interaction kernel functions between particles. Besides, we cannot directly control distances between the particles moving away from the initial positions, in both simulation and experimental studies. When the distance between two particles $i$ and $j$ is small, the contribution $\phi(||\mathbf{x}_{i}(t)-\mathbf{x}_{j}(t)||)\mathbf{u}_{i,j}(t)$ can be small, if the repulsive force by $\phi(\cdot)$ does not increase as fast as the distance decreases. Thus we found that the estimation of interaction kernel function can be less accurate in the input domain of small distances. This problem can be alleviated by placing initial positions of more particles close to each other, providing more data with small distance pairs that improve the accuracy in estimation.

4.1 Scalable computation based on sparse representation of inverse covariance

For illustration purposes, let us first consider a simple scenario where we have $M=1$ simulation and $L=1$ time point of a system of $n$ interacting particles at a $D$ dimensional space. Since we only have 1 time point, we omit the notation $t$ when there is no confusion. The algorithm for the general scenario with $L>1$ and $M>1$ is discussed in Appendix 5.2. In practice, the experimental observations of velocity from multiple particle tracking algorithms or particle image velocimetry typically contain noises [1]. Even for simulation data, the numerical error could be non-negligible for large and complex systems. In these scenarios, the observed velocity $\mathbf{\tilde{v}}_{i}=(v_{i,1},...,v_{i,D})^{T}$ is a noisy realization: $\mathbf{\tilde{v}}_{i}=\mathbf{v}_{i}+\bm{\epsilon}_{i}$ , where $\bm{\epsilon}_{i}\sim\mathcal{MN}(0,\sigma^{2}_{0}\mathbf{I}_{D})$ denotes a vector of Gaussian noise with variance $\sigma^{2}_{0}$ .

Assume the $nD$ observations of velocity are $\mathbf{\tilde{v}}=(\tilde{v}_{1,1},...,\tilde{v}_{n,1},\tilde{v}_{1,2},...,\tilde{v}_{n,2},...,\tilde{v}_{n-1,D},\tilde{v}_{n,D})^{T}$ . After integrating out the latent function modeled in Equation (22), the marginal distribution of observations follows

\displaystyle(\mathbf{\tilde{v}}\mid\mathbf{R}_{\phi},\sigma^{2},\sigma^{2}_{0})\sim\mathcal{MN}\left(\mathbf{0},\sigma^{2}\mathbf{U}\mathbf{R}_{\phi}\mathbf{U}^{T}+\sigma^{2}_{0}\mathbf{I}_{nD}\right),

(23)

where $\mathbf{U}$ is an $nD\times n^{2}$ block diagonal matrix, with the $i$ th $D\times n$ block in the diagonals being $(\mathbf{u}_{i,1},...,\mathbf{u}_{i,n})$ , and $\mathbf{R}_{\phi}$ is an $n^{2}\times n^{2}$ matrix, where the $(i^{\prime},j^{\prime})$ term in the $(i,j)$ th $n\times n$ block is $K(d_{i,i^{\prime}},d_{j,j^{\prime}})$ with $d_{i,i^{\prime}}=||\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}||$ and $d_{j,j^{\prime}}=||\mathbf{x}_{j}-\mathbf{x}_{j}^{\prime}||$ for $i,i^{\prime},j,j^{\prime}=1,...,n$ . Direct computation of the likelihood involves computing the inverse of an $nD\times nD$ covariance matrix and constructing an $n^{2}\times n^{2}$ matrix $\mathbf{R}_{\phi}$ , which costs $\mathcal{O}((nD)^{3})+\mathcal{O}(n^{4}D)$ operations. This is computationally expensive even for small systems.

Here we use an exponential kernel function, $K(d)=\exp(-d/\gamma)$ with range parameter $\gamma$ , of modeling any nonnegative distance input $d$ for illustration, where this method can be extended to include Matérn kernels with half-integer roughness parameters. Denote distance pairs $d_{i,j}=||\mathbf{x}_{i}-\mathbf{x}_{j}||$ , and there are $(n-1)n/2$ unique positive distance pairs. Denote the $(n-1)n/2$ distance pairs $\mathbf{d}_{s}=(d_{s,1},...d_{s,(n-1)n/2})^{T}$ in an increasing order with the subscript $s$ meaning ‘sorted’. Here we do not need to consider the case when $d_{i,j}=0$ , as $\mathbf{u}_{i,j}=\mathbf{0}$ , leading to zero contribution to the velocity. Thus the model in (21) can be reduced to exclude the interaction between particle at zero distance. In reality, two particles at the same position are impractical, as there typically exists a repulsive force when two particles get very close. Hence, we can reduce the $n^{2}$ distance pairs $d_{i,j}$ for $i=1,...,n$ and $j=1,...,n$ , to $(n-1)n/2$ unique positive terms $d_{s,i}$ in an increasing order, for $i=1,...,(n-1)n/2$ .

Denote the $(n-1)n/2\times(n-1)n/2$ correlation matrix of the kernel outputs $\bm{\phi}=(\phi(d_{s,1}),...,\phi(d_{s,(n-1)n/2}))^{T}$ by $\mathbf{R}_{s}$ and $\mathbf{U}_{s}$ is $nD\times(n-1)n/2$ sparse matrix with $n-1$ nonzero terms on each row, where the nonzero entries of the $i$ th particle correspond to the distance pairs in the $\mathbf{R}_{s}$ . Denote the nugget parameter $\eta=\sigma^{2}_{0}/\sigma^{2}$ . After marginalizing out $\phi$ , the covariance of velocity observations follows

\displaystyle(\mathbf{\tilde{v}}\mid\gamma,\sigma^{2},\eta)\sim\mathcal{MN}\left(\mathbf{0},\sigma^{2}\mathbf{\tilde{R}}_{v}\right),

(24)

with

\mathbf{\tilde{R}}_{v}=(\mathbf{U}_{s}\mathbf{R}_{s}\mathbf{U}^{T}_{s}+\eta\mathbf{I}_{nD}).

(25)

The conditional distribution of the interaction kernel $\phi(d^{*})$ at any distance $d^{*}$ follows

\displaystyle(\phi(d^{*})\mid\mathbf{\tilde{v}},\gamma,\sigma^{2},\eta)\sim\mathcal{N}(\hat{\phi}(d^{*}),\sigma^{2}K^{*}),

(26)

where the predictive mean and variance follow

	$\displaystyle\hat{\phi}(d^{*})$	$\displaystyle=\mathbf{r}^{T}(d^{*})\mathbf{U}^{T}_{s}\mathbf{\tilde{R}}_{v}^{-1}\mathbf{\tilde{v}},$		(27)
	$\displaystyle\sigma^{2}K^{*}$	$\displaystyle=\sigma^{2}\left(K(d^{},d^{})-\mathbf{r}^{T}(d^{})\mathbf{U}^{T}_{s}\mathbf{\tilde{R}}_{v}^{-1}\mathbf{U}_{s}\mathbf{r}(d^{})\right),$		(28)

with $\mathbf{r}(d^{*})=(K(d^{*},d_{s,1}),...,K(d^{*},d_{s,n(n-1)/2}))^{T}$ . After obtaining the estimated interaction kernel, one can use it to forecast trajectories of particles and understand the physical mechanism of flocking behaviors.

Our primary task is to efficiently compute the predictive distribution of interaction kernel in (26), where the most computationally expensive terms in the predictive mean and variance is $\mathbf{\tilde{R}}_{v}^{-1}\mathbf{\tilde{v}}$ and $\mathbf{\tilde{R}}_{v}^{-1}\mathbf{U}_{s}\mathbf{r}(d^{*})$ . Note that the $\mathbf{U}_{s}$ is a sparse matrix with $n(n-1)d$ nonzero terms and the inverse covariance matrix $\mathbf{R}^{-1}_{s}$ is a tri-diagonal matrix. However, directly applying the CG algorithm is still computationally challenging, as neither $\mathbf{\tilde{R}}_{v}$ nor $\mathbf{\tilde{R}}_{v}^{-1}$ is sparse. To solve this problem, we extend a step in the Kalman filter to efficiently compute the matrix-vector multiplication with the use of sparsity induced by the choice of covariance matrix. Each step of the CG iteration in the new algorithm only costs $\mathcal{O}(nDT)$ operations for computing a system of $n$ particles and $D$ dimensions with $T$ CG iteration steps. For most systems we explored, we found a few hundred iterations in the CG algorithm achieve high accuracy. The substantial reduction of the computational cost allows us to use more observations to improve the predictive accuracy. We term this approach the sparse conjugate gradient algorithm for Gaussian processes (sparse CG-GP). The algorithm for the scenario with $M$ simulations, each containing $L$ time frames of $n$ particles in a $D$ dimensional space, is discussed in Appendix 5.2.

The comparison of the computational cost between the full GP model and the proposed sparse CG-GP method is shown in the right panel in Figure 3. The most computational expensive part of the full GP model is on constructing the $n(n-1)/2\times n(n-1)/2$ correlation matrix $\mathbf{R}_{s}$ of $\phi$ for $n(n-1)/2$ distance pairs. The sparse CG-GP algorithm is much faster as we do not need to construct this covariance matrix; instead we only need to efficiently compute matrix multiplication by utilizing the sparse structure of the inverse of $\mathbf{R}_{s}$ (Appendix 5.2). Note the GP model with an exponential covariance naturally induces a sparse inverse covariance matrix that can be used for faster computation, which is different from imposing a sparse covariance structure for approximation.

In the left panel in Figure 3, we show the predictive mean and uncertainty assessment by the sparse CG-GP method for three different designs for sampling the initial positions of particles. From the first to the third designs, the initial value of each coordinate of the particle is sampled independently from a uniform distribution $\mathcal{U}[a_{1},b_{1}]$ , normal distribution $\mathcal{N}(a_{2},b_{2})$ , and log uniform (reciprocal) distribution $\mathcal{LU}[\log(a_{3}),\log(b_{3})]$ , respectively.

For experiments with the interaction kernel being the truncated Lennard-Jones potential given in Appendix 5.2, we use $a_{1}=0$ , $b_{1}=5$ , $a_{2}=0$ , $b_{2}=5$ , $a_{3}=10^{-3}$ and $b_{3}=5$ for three designs of initial positions. Compared with the first design, the second design of initial positions, which was assumed in [34], has a larger probability mass of distributions near 0. In the third design, the distributions of the distance between particle pairs are monotonically decreasing, with more probability mass near 0 than those in the first two designs. In all cases shown in Figure 3, we assume $M=1$ , $L=1$ and the noise variance is set to be $\sigma_{0}=10^{-3}$ in the simulation. For demonstration purposes, the range and nugget parameters are fixed to be $\gamma=5$ and $\eta=10^{-5}$ respectively, when computing the predictive distribution of $\phi$ . The estimation of the interaction kernel on large distances is accurate for all different designs, whereas the estimation of the interaction kernel at small distances is not satisfying for the first two designs. When particles are initialized from the third design (log-uniform), the accuracy is better, as there are more particles near each other, providing more information about the particles at small values. This result is intuitive, as the small distance pairs have relatively small contributions to the velocity based on Equation (21), and we need more particles close to each other to estimate the interaction kernel function at small distances.

The numerical comparison between different designs allows us to better understand the learning efficiency in different scenarios, which can be used to design experiments. Because of the large improvement of computational scalability compared to previous studies [13, 34], we can accurately estimate interaction kernels based on more particles and longer trajectories.

4.2 Numerical results

Here we discuss two scenarios, where the interaction between particles follow the truncated Lennard-Jones (LJ) and opinion dynamics (OD) kernel functions. The LJ potential is widely used in MD simulations of interacting molecules [43]. First-order systems of form (21) have also been successfully applied in modeling opinion dynamics in social networks (see the survey [36] and references therein). The interaction function $\phi$ models how the opinions of pairs of people influence each other. In our numerical example, we consider heterophilious opinion interactions: each agent is more influenced by its neighbors slightly further away from its closest neighbors. As time evolves, the opinions of agents merge into clusters, with the number of clusters significantly smaller than the number of agents. This phenomenon was studied in [36] that heterophilious dynamics enhances consensus, contradicting the intuition that would suggest that the tendency to bond more with those who are different rather than with those who are similar would break connections and prevent clusters of consensus.

The details of the interaction functions are given in Appendix 5.3. For each interaction, we test our method based on 12 configurations of 2 particle sizes ( $n=50$ and $n=200$ ), 2 time lengths ( $L=1$ and $L=10$ ), and 3 designs of initial positions (uniform, normal and log-uniform). The computational scalability of the sparse CG algorithm allows us to efficiently compute the predictions in most of these experimental settings within a few seconds. For each configuration, we repeat the experiments 10 times to average the effects of randomness in the initial positions of particles. The root of the mean squared error in predicting the interaction kernels by averaging these 10 experiments of each configuration is given in Appendix 5.4.

In Figure 4, we show the estimation of interactions kernels and forecasts of particle trajectories with different designs, particle sizes and time points. The sparse CG-GP method is relatively accurate for almost all scenarios. Among different initial positions, the estimation of trajectories for LJ interaction is the most accurate when the initial positions of the particles are sampled by the log-uniform distribution. This is because there are more small distances between particles when the initial positions follow a log-uniform distribution, providing more data to estimate the interaction kernel at small distances. Furthermore, when we have more particles or observations at larger time intervals, the estimation of the interaction kernel from all designs becomes more accurate in terms of the normalized root mean squared error with the detailed comparison given in Appendix 5.4.

In panel (c) and panel (f) of Figure 4, we plot the trajectory forecast of $10$ particles over $200$ time points for the truncated LJ kernel and OD kernel, respectively. In both simulation scenarios, interaction kernels are estimated based on trajectories of $n=50$ particles across $L=20$ time steps with initial positions sampled from the log-uniform design. The trajectories of only $10$ particles out of $50$ particles are shown for better visualization. For trajectories simulated by the truncated LJ, some particles can move very close, since the repulsive force between two particles becomes smaller as the force is proportional to the distance from Equation (21), and the truncation of kernel substantially reduces the repulsive force when particles move close. For the OD simulation, the particles move toward a cluster, as expected, since the particles always have attractive forces between each other. The forecast trajectories are close to the hold-out truth, indicating the high accuracy of our approach.

Compared with the results shown in previous studies [34, 13], estimating the interaction kernels and forecasting trajectories both look more accurate. The large computational reduction by the sparse CG-GP algorithm shown in Figure 3 permits the use of longer trajectories from more particles to estimate the interaction kernel, which improves the predictive accuracy. Here particle has interactions with all other particles in our simulation, making the number of distance pairs large. Yet we are able to estimate the interaction kernel and forecast the trajectories of particles within only tens of seconds in a desktop for the most time consuming scenario we considered. Since the particles typically have very small or no interaction when the distances between them are large, approximation can be made by enforcing interactions between particles within the specified radius, for further reducing the computational cost.

5 Concluding remarks

We have introduced scalable marginalization of latent variables for correlated data. We first introduce GP models and reviewed the SDE representation of GPs with Matérn covariance and one-dimensional input. Kalman filter and RTS smoother were introduced as a scalable marginalization way to compute the likelihood function and predictive distribution, which reduces the computational complexity of GP with Matérn covariance for 1D input from $\mathcal{O}(N^{3})$ to $\mathcal{O}(N)$ operations without approximation, where $N$ is the number of observations. Recent efforts on extending scalable computation from 1D input to multi-dimensional input are discussed. In particular, we developed a new scalable algorithm for predicting particle interaction kernel and forecast trajectories of particles. The achievement is through the sparse representation of GPs in modeling interaction kernel, and then efficient computation for matrix multiplication by modifying the Kalman filter algorithm. An iterative algorithm based on CG can then be applied, which reduces the computational complexity.

There are a wide range of future topics relevant to this study. First, various models of spatio-temporal data can be written as random factor models in (17) with latent factors modeled as Gaussian processes for temporal inputs. It is of interest to utilize the computational advantage of the dynamic linear models of factor processes, extending the computational tools by relaxing the independence between prior factor processes in Assumption 1 or incorporating the Toeplitz covariance structure for stationary temporal processes. Second, for estimating systems of particle interactions, we can further reduce computation by only considering interactions within a radius between particles. Third, a comprehensively study the experimental design, initialization, and parameter estimation in will be helpful for estimating latent interaction functions that can be unidentifiable or weakly identifiable in certain scenarios. Furthermore, velocity directions and angle order parameters are essential for understanding the mechanism of active nematics and cell migration, which can motivate more complex models of interactions. Finally, the sparse CG algorithm developed in this work is of interest to reducing the computational complexity of GP models with multi-dimensional input and general designs.

Acknowledgement

The work is partially supported by the National Institutes of Health under Award No. R01DK130067. Gu and Liu acknowledge the partial support from National Science Foundation (NSF) under Award No. DMS-2053423. Fang acknowledges the support from the UCSB academic senate faculty research grants program. Tang is partially supported by Regents Junior Faculty fellowship, Faculty Early Career Acceleration grant, Hellman Family Faculty Fellowship sponsored by UCSB and the NSF under Award No. DMS-2111303. The authors thank the editor and two referees for their comments that substantially improved the article.

Appendix

5.1 Closed-form expressions of state space representation of GP having Matérn covariance with $\nu=5/2$

Denote $\lambda=\frac{\sqrt{5}}{\gamma}$ , $q=\frac{16}{3}\sigma^{2}\lambda^{5}$ and $d_{i}=|x_{i}-x_{i-1}|$ . For $i=2,...,N$ , $\mathbf{G}_{i}$ and $\mathbf{W}_{i}$ in (11) have the expressions below:

\mathbf{G}_{i}=\frac{e^{-\lambda d_{i}}}{2}\begin{pmatrix}\lambda^{2}d_{i}^{2}+2\lambda+2&2(\lambda d_{i}^{2}+d_{i})&d_{i}^{2}\\ -\lambda^{3}d_{i}^{2}&-2(\lambda^{2}d_{i}^{2}-\lambda d_{i}-1)&2d_{i}-\lambda d_{i}^{2}\\ \lambda^{4}d_{i}^{2}-2\lambda^{3}d_{i}&2(\lambda^{3}d_{i}^{2}-3\lambda^{2}d_{i})&\lambda^{2}d_{i}^{2}-4\lambda d_{i}+2\end{pmatrix},

\mathbf{W}_{i}=\frac{4\sigma^{2}\lambda^{5}}{3}\begin{pmatrix}W_{1,1}(x_{i})&W_{1,2}(x_{i})&W_{1,3}(x_{i})\\ W_{2,1}(x_{i})&W_{2,2}(x_{i})&W_{2,3}(x_{i})\\ W_{3,1}(x_{i})&W_{3,2}(x_{i})&W_{3,3}(x_{i})\end{pmatrix},

with

	$\displaystyle W_{1,1}(x_{i})$	$\displaystyle=\frac{e^{-2\lambda d_{i}}(3+6\lambda d_{i}+6\lambda^{2}d^{2}_{i}+4\lambda^{3}d^{3}_{i}+2\lambda^{4}d^{4}_{i})-3}{-4\lambda^{5}},$
	$\displaystyle W_{1,2}(x_{i})$	$\displaystyle=W_{2,1}(x_{i})=\frac{e^{-2\lambda d_{i}}d_{i}^{4}}{2},$
	$\displaystyle W_{1,3}(x_{i})$	$\displaystyle=W_{3,1}(x_{i})=\frac{e^{-2\lambda d_{i}}(1+2\lambda d_{i}+2\lambda^{2}d^{2}_{i}+4\lambda^{3}d^{3}_{i}-2\lambda^{4}d^{4}_{i})-1}{4\lambda^{3}},$
	$\displaystyle W_{2,2}(x_{i})$	$\displaystyle=\frac{e^{-2\lambda d_{i}}(1+2\lambda d_{i}+2\lambda^{2}d^{2}_{i}-4\lambda^{3}d^{3}_{i}+2\lambda^{4}d^{4}_{i})-1}{-4\lambda^{3}},$
	$\displaystyle W_{2,3}(x_{i})$	$\displaystyle=W_{3,2}(x_{i})=\frac{e^{-2\lambda d_{i}}d_{i}^{2}(4-4\lambda d_{i}+\lambda^{2}d^{2}_{i})}{2},$
	$\displaystyle W_{3,3}(x_{i})$	$\displaystyle=\frac{e^{-2\lambda d_{i}}(-3+10\lambda^{2}d^{2}_{i}-22\lambda^{2}d^{2}_{i}+12\lambda^{2}d^{2}_{i}-2\lambda^{4}d^{4}_{i})+3}{4\lambda},$

and the stationary covariance of $\bm{\theta}_{i}$ , $i=1,...,\tilde{N}$ , is

\mathbf{W}_{1}=\begin{pmatrix}\sigma^{2}&0&-\sigma^{2}\lambda^{2}/3\\ 0&\sigma^{2}\lambda^{2}/3&0\\ -\sigma^{2}\lambda^{2}/3&0&\sigma^{2}\lambda^{4}\end{pmatrix},

The joint distribution of latent states follows $\left(\bm{\theta}^{T}(x_{1}),...,\bm{\theta}^{T}(x_{N})\right)^{T}\sim\mathcal{MN}(\mathbf{0},\bm{\Lambda}^{-1})$ , where the $\bm{\Lambda}$ is a symmetric block tri-diagonal matrix with the $i$ th diagonal block being $\mathbf{W}^{-1}_{i}+\mathbf{G}^{T}_{i}\mathbf{W}^{-1}_{i+1}\mathbf{G}_{i}$ for $i=1,...,{N}-1$ , and the $N$ th diagonal block being $\mathbf{W}_{{N}}^{-1}$ . The primary off-diagonal block of $\bm{\Lambda}$ is $-\mathbf{G}^{T}_{i}\mathbf{W}^{-1}_{i}$ , for $i=2,...,N$ .

Suppose $x_{i}<x^{*}<x_{i+1}$ . Let $d_{i}^{*}=|x_{i}^{*}-x_{i}|$ and $d_{i+1}^{*}=|x_{i+1}-x_{i}^{*}|$ . The “*” terms $\mathbf{G}_{i}^{*}$ and $\mathbf{W}_{i}^{*}$ can be computed by replacing $d_{i}$ in $\mathbf{G}_{i}$ and $\mathbf{W}_{i}$ by $d_{i}^{*}$ , whereas the “*” terms $\mathbf{G}_{i+1}^{*}$ and $\mathbf{W}_{i+1}^{*}$ can be computed by replacing the $d_{i}$ in $\mathbf{G}_{i}$ and $\mathbf{W}_{i}$ by $d_{i+1}^{*}$ . Furthermore, $\tilde{\mathbf{W}}^{*}_{i+1}=\mathbf{W}_{i+1}^{*}+\mathbf{G}_{i+1}^{*}\mathbf{W}_{i}^{*}(\mathbf{G}_{i+1}^{*})^{T}$ .

5.2 The sparse CG-GP algorithm for estimating interaction kernels

Here we discuss the details of computing the predictive mean and variance in (26). The $N$ -vector of velocity observations is denoted as $\mathbf{\tilde{v}}$ , where the total number of observations is defined by $N=nDML$ . To compute the predictive mean and variance, the most computational challenging part is to compute $N$ -vector $\mathbf{z}=(\mathbf{U}_{s}\mathbf{R}_{s}\mathbf{U}^{T}_{s}+\eta\mathbf{I}_{N})^{-1}\mathbf{\tilde{v}}$ . Here $\mathbf{R}_{s}$ and $\mathbf{U}_{s}$ are $\tilde{N}\times\tilde{N}$ and $N\times\tilde{N}$ , respectively, where $\tilde{N}=n(n-1)ML/2$ is the number of non-zero unique distance pairs. Note that both $\mathbf{U}_{s}$ and $\mathbf{R}^{-1}_{s}$ are sparse. Instead of directly computing the matrix inversion and the matrix-vector multiplication, we utilize the sparsity structure to accelerate the computation in the sparse CG-GP algorithm. In the iteration, we need to efficiently compute

\tilde{\mathbf{z}}=(\mathbf{U}_{s}\mathbf{R}_{s}\mathbf{U}^{T}_{s}+\eta\mathbf{I}_{N})\mathbf{z},

(29)

for any real-valued N-vector $\mathbf{z}$ .

We have four steps to compute the quantity in (29) efficiently. Denote $x_{i,j,m}[t_{l}]$ the $j$ th spatial coordinate of particle $i$ at time $t_{l}$ , in the $m$ th simulation, for $i=1,...,n$ , $j=1,...,D$ , $l=1,...,L$ and $m=1,...,M$ . In the following, we use $\mathbf{x}_{\cdot,\cdot,m}[\cdot]$ to denote a vector of all positions in the $m$ th simulation and vice versa. Furthermore, we use $z[k]$ to mean the $k$ th entry of any vector $\mathbf{z}$ , $\mathbf{A}[k,.]$ and $\mathbf{A}[.,k]$ to mean the $k$ th row vector and $k$ th column vector of any matrix $\mathbf{A}$ , respectively. The rank of a particle with position $\mathbf{x}_{i,.,m}[t_{l}]$ is defined to be $P=(m-1)Ln+(l-1)n+i$ .

First, we reduce the $N\times\tilde{N}$ sparse matrix $\mathbf{U}_{s}$ of distance difference pairs to an $N\times n$ matrix $\mathbf{U}_{re}$ , where ‘re’ means reduced, with the $((m-1)LnD+(l-1)nD+(j-1)n+i_{1},i_{2})$ th entry of $\mathbf{U}_{re}$ being $(x_{i_{1},j,m}[t_{l}]-x_{i_{2},j,m}[t_{l}])$ , for any $|i_{1}-i_{2}|=1,...,n-1$ , $i_{1}\leq n$ and $i_{2}\leq n$ . Furthermore, we create a $\tilde{N}\times 2$ matrix $\mathbf{P}_{r}$ in which the $h$ th row records the rank of a distance pair is the $h$ th largest in the zero-excluded sorted distance pairs $\mathbf{d}_{s}$ , where $P_{r}[h,1]$ and $P_{r}[h,2]$ are the rank of rows of these distances in the matrix $\mathbf{d}_{mat}$ , where the $j$ th column records the unordered distance pairs of the $j$ th particle for $j=1,...,n$ . We further assume $P_{r}[h,1]>P_{r}[h,2]$ .

For any $N$ -vector $\mathbf{z}$ , the $k$ th entry of $\mathbf{U}^{T}_{s}\mathbf{z}$ can be written as

(\mathbf{U}^{T}_{s}\mathbf{z})[k]=\sum_{j_{k}=0}^{D-1}\mathbf{U}_{re}\bigg{[}P_{r}[k,1]+c_{j_{k}},P_{r}[k,2]-(m-1)nL-(l-1)n\bigg{]}\bigg{(}z[P_{r}[k,1]+c_{j_{k}}]-z[P_{r}[k,2]+c_{j_{k}}]\bigg{)},

(30)

where $c_{j_{k}}=(D-1)(m-1)nL+(D-1)(l-1)n+nj_{k}$ for $k=1,...,\tilde{N}$ , if the $k$ th largest entry of distance pair is from time frame $l$ in the $m$ th simulation. The output is denoted as an $\tilde{N}$ vector $\mathbf{g}_{1}$ , i.e. $\mathbf{g}_{1}=(\mathbf{U}^{T}_{s}\mathbf{z})$ .

Second, since the exponential kernel is used, $\mathbf{R}^{-1}_{s}$ is a tri-diagonal matrix [20]. We modify a Kalman filter step to efficiently compute the product of an upper bi-diagonal $\mathbf{g}_{2}=\mathbf{L}^{T}_{s}\mathbf{g}_{1}$ , where $\mathbf{L}_{s}$ is the factor of the Cholesky decomposition $\mathbf{R}_{s}=\mathbf{L}_{s}\mathbf{L}^{T}_{s}$ . Denote the Cholesky decomposition of the inverse covariance the factor $\mathbf{R}^{-1}_{s}=\mathbf{\tilde{L}}_{s}\mathbf{\tilde{L}}_{s}^{T}$ , where $\mathbf{\tilde{L}}_{s}$ can be written as the lower bi-diagonal matrix below:

\mathbf{\tilde{L}}_{s}=\begin{pmatrix}\frac{1}{\sqrt{1-\rho_{1}^{2}}}&&&\\ \frac{-\rho_{1}}{\sqrt{1-\rho_{1}^{2}}}&\frac{1}{\sqrt{1-\rho_{2}^{2}}}&&\\ &\frac{-\rho_{2}}{\sqrt{1-\rho_{2}^{2}}}\,\ddots&&&\\ &\quad\ddots&\frac{1}{\sqrt{1-\rho_{\tilde{N}-1}^{2}}}&\\ &&\frac{-\rho_{\tilde{N}-1}}{\sqrt{1-\rho_{\tilde{N}-1}^{2}}}&1\\ \end{pmatrix},

(31)

where $\rho_{k}=\exp(-(d_{s,k+1}-d_{s,k})/\gamma)$ for $k=1,...,\tilde{N}-1$ . We modify the Thomas algorithm [57] to solve $\mathbf{g}_{2}$ from equation $(\mathbf{L}^{T}_{s})^{-1}\mathbf{g}_{2}=\mathbf{g}_{1}$ . Here $(\mathbf{L}^{T}_{s})^{-1}$ is an upper bi-diagonal matrix with explicit form

(\mathbf{L}^{T}_{s})^{-1}=\begin{pmatrix}1&\frac{-\rho_{1}}{\sqrt{1-\rho_{1}^{2}}}&\\ &\frac{1}{\sqrt{1-\rho_{1}^{2}}}&\\ &&\ddots&\ddots&\\ &&&\frac{1}{\sqrt{1-\rho_{\tilde{N}-2}^{2}}}&\frac{-\rho_{\tilde{N}-1}}{\sqrt{1-\rho_{\tilde{N}-1}^{2}}}\\ &&&&\frac{1}{\sqrt{1-\rho_{\tilde{N}-1}^{2}}}\\ \end{pmatrix}.

(32)

Here only up to 2 entries in each row of $(\mathbf{L}^{T}_{s})^{-1}$ are nonzero. Using a backward solver, the $\mathbf{g}_{2}$ can be obtained by the iteration below:

	$\displaystyle\mathbf{g}_{2}[\tilde{N}]$	$\displaystyle=\mathbf{g}_{1}[\tilde{N}]\sqrt{1-\rho_{\tilde{N}-1}^{2}},$		(33)
	$\displaystyle\mathbf{g}_{2}[k]$	$\displaystyle=\sqrt{1-\rho_{k-1}^{2}}\mathbf{g}_{1}[k]+\frac{\rho_{k}\mathbf{g}_{2}[k+1]\sqrt{1-\rho_{k-1}^{2}}}{\sqrt{1-\rho_{k}^{2}}},\quad$		(34)

for $k=\tilde{N}-1,...,2,1$ . Note that the Thomas algorithm is not stable in general, but here the stability issue is greatly improved, as the matrix in the system is bi-diagonal instead of tri-diagonal.

Third, we compute $\mathbf{g}_{3}=\mathbf{L}_{s}\mathbf{g}_{2}$ by solving $\mathbf{L}^{-1}_{s}\mathbf{g}_{3}=\mathbf{g}_{2}$ :

	$\displaystyle\mathbf{g}_{3}[1]$	$\displaystyle=\mathbf{g}_{2}[1],$		(35)
	$\displaystyle\mathbf{g}_{3}[k]$	$\displaystyle=\sqrt{1-\rho_{k-1}^{2}}\mathbf{g}_{2}[k]+\rho_{k-1}\mathbf{g}_{3}[k-1],$		(36)

for $k=2,....,\tilde{N}-1$ .

Finally, we denote a $MLn\times n$ matrix $\mathbf{P}_{c}$ . $\mathbf{P}_{c}$ is initialized as a zero matrix. And then for $r_{c}=1,...,MLn$ , row $r_{c}$ of $\mathbf{P}_{c}$ stores the ranks of distances between the $i$ th particle and other $n-1$ particles in $\mathbf{d}_{s}$ . For instance, at the $l$ th time step in the $m$ th simulation, particle $i$ has $n-1$ non-zero distances $||\mathbf{x}_{1}-\mathbf{x}_{i}||,...,||\mathbf{x}_{i-1}-\mathbf{x}_{i}||,||\mathbf{x}_{i+1}-\mathbf{x}_{i}||,...,||\mathbf{x}_{n}-\mathbf{x}_{i}||$ with ranks $h_{1},....,h_{i-1},h_{i+1},...h_{n}$ in $\mathbf{d}_{s}$ . Then the $((m-1)Ln+(l-1)n+i)$ th row of $\mathbf{P}_{c}$ is filled with $(h_{1},...,h_{i-1},h_{i+1},...,h_{n})$ .

Given any $\tilde{N}$ -vector $\mathbf{g}_{3}$ , the $k$ th entry of $\mathbf{U}_{s}\mathbf{g}_{3}$ can be written as

(\mathbf{U}_{s}\mathbf{g}_{3})[k]=\mathbf{U}_{re}[k,.]\mathbf{g}_{3}[\mathbf{P}_{c}[k^{\prime},.]^{T}],

(37)

assuming that $k$ satisfies $k=(m-1)LnD+(l-1)nD+jn+i$ and $k^{\prime}=i+(m-1)Ln+(l-1)n$ for some $m,l,j$ and $i$ , and $k=1,...,N$ . The output of this step is an $N$ vector $\mathbf{g}_{4}$ , with the $k$ th entry being $\mathbf{g}_{4}[k]:=(\mathbf{U}^{T}_{s}\mathbf{g}_{3})[k]$ , for $k=1,...,N$ .

We summarize the sparse CG-GP algorithm using the following steps to compute $\tilde{\mathbf{z}}$ in (29) below.

1.

Use equation (30) to compute $\mathbf{g}_{1}[k]=(\mathbf{U}^{T}_{s}\mathbf{z})[k]$ , for $k=1,...,\tilde{N}$ .
2.

Use equations (33) and (34) to solve $\mathbf{g}_{2}$ from $(\mathbf{L}_{s}^{T})^{-1}\mathbf{g}_{2}=\mathbf{g}_{1}$ where $(\mathbf{L}_{s}^{T})^{-1}=(\mathbf{L}_{s}^{-1})^{T}$ with $\mathbf{L}_{s}^{-1}$ given in equation (32).
3.

Use equations (35) and (36) to solve $\mathbf{g}_{3}$ from $\mathbf{L}_{s}^{-1}\mathbf{g}_{3}=\mathbf{g}_{2}$ , where $\mathbf{L}_{s}^{-1}$ is given in equation (32).
4.

Use equation (37) to compute $\mathbf{g}_{4}[k]=(\mathbf{U}_{s}\mathbf{g}_{3})[k]$ and let $\tilde{\mathbf{z}}=\mathbf{g}_{4}+\eta\mathbf{z}$ .

5.3 Interaction kernels in simulated studies

Here we give the expressions of the truncated L-J and OD kernels of particle interaction in [34, 13]. The truncated LJ kernel is given by

\phi_{LJ}(d)=\left\{\begin{array}[]{ll}c_{2}\exp(-c_{1}d^{12}),&d\in[0,0.95],\\ \frac{8(d^{-4}-d^{-10})}{3},&d\in(0.95,\infty),\end{array}\right.

where

\displaystyle c_{1}=-\frac{1}{12}\frac{c_{4}}{c_{3}(0.95)^{11}}\mbox{ and }c_{2}=c_{3}\exp(c_{1}(0.95)^{12}),

with $c_{3}=\frac{8}{3}(0.95^{-4}-0.95^{-10})$ and $c_{4}=\frac{8}{3}(10(0.95)^{-11}-4(0.95)^{-5})$ .

The interaction kernel of OD is defined as

\phi_{OD}(d)=\left\{\begin{array}[]{ll}0.4,&d\in[0,c_{5}),\\ -0.3\cos(10\pi(d-c_{5}))+0.7,&d\in[c_{5},c_{6}),\\ 1,&d\in[c_{6}\leq d<0.95),\\ 0.5\cos(10\pi(d-0.95))+0.5,&d\in[0.95,1.05),\\ 0,&d\in[1.05,\infty),\end{array}\right.

where $c_{5}=\frac{1}{\sqrt{2}}-0.05$ and $c_{6}=\frac{1}{\sqrt{2}}+0.05$ .

5.4 Further numerical results on estimating interaction kernels

We outline the numerical results of estimating the interaction functions at $N^{*}_{1}=1000$ equally spaced distance pairs at $d\in[0,5]$ and $d\in[0,1.5]$ for the truncated LJ and OD, respectively. For each configuration, we repeat the simulation $N^{*}_{2}=10$ times and compute the predictive error in each simulation. The total number of test points is $N^{*}=N^{*}_{1}N^{*}_{2}=10^{4}$ . For demonstration purposes, we do not add a noise into the simulated data (i.e. $\sigma^{2}_{0}=0$ ). The range and nugget parameters are fixed to be $\gamma=5$ and $\eta=10^{-5}$ . We compute the normalized root of mean squared error (NRMSE) in estimating the interaction kernel function:

\mbox{NRMSE}=\frac{1}{\sigma_{\phi}}\sqrt{\sum^{N^{*}}_{i=1}\frac{(\hat{\phi}(d^{*}_{i})-\phi(d^{*}_{i}))^{2}}{N^{*}}},

where $\hat{\phi}(.)$ is the estimated interaction kernel from the velocities and positions of the particles; $\sigma_{\phi}$ is the standard deviation of the interaction function at test points.

Truncated LJ	n=50	n=200	n=50	n=200
	L=1	L=1	L=10	L=10
Uniform	$.11$	$.021$	$.026$	$.0051$
Normal	$.037$	$.012$	$.0090$	$.0028$
Log-uniform	$.043$	$.0036$	$.0018$	$.00091$
OD	n=50	n=200	n=50	n=200
	L=1	L=1	L=10	L=10
Uniform	$.024$	$.0086$	$.0031$	$.0036$
Normal	$.13$	$.013$	$.038$	$.0064$
Log-uniform	$.076$	$.0045$	$.0018$	$.00081$

Table 1: NRMSE of the sparse CG-GP method for estimating the truncated LJ and OD kernels of particle interaction.

Table 1 gives the NRMSE of the sparse CG-GP method for the truncated LJ and OD kernels at 12 configurations. Typically the estimation is the most accurate when the initial positions of the particles are sampled from the log-uniform design for a given number of observations and an interaction kernel. This is because the contributions to the velocities from the kernel function are proportional to the distance of particle in (21), and small contributions from the interaction kernel at small distance values make the kernel hard to estimate from the trajectory data in general. When the initial positions of the particles are sampled from the log-uniform design, more particles are close to each other, which provides more information to estimate the interaction kernel at a small distance.

Furthermore, the predictive error of the interaction kernel is smaller, when the trajectories with a larger number of particle sizes or at longer time points are used in estimation, as more observations typically improve predictive accuracy. The sparse CG-GP algorithm reduces the computational cost substantially, which allows more observations to be used for making predictions.

References

[1] Ronald J Adrian and Jerry Westerweel. Particle image velocimetry. Number 30. Cambridge university press, 2011.
[2] Kyle R Anderson, Ingrid A Johanson, Matthew R Patrick, Mengyang Gu, Paul Segall, Michael P Poland, Emily K Montgomery-Brown, and Asta Miklius. Magma reservoir failure and the onset of caldera collapse at Kīlauea volcano in 2018. Science, 366(6470), 2019.
[3] Sudipto Banerjee, Bradley P Carlin, and Alan E Gelfand. Hierarchical modeling and analysis for spatial data. Crc Press, 2014.
[4] Maria Maddalena Barbieri and James O Berger. Optimal predictive model selection. The annals of statistics, 32(3):870–897, 2004.
[5] Maria J Bayarri, James O Berger, Rui Paulo, Jerry Sacks, John A Cafeo, James Cavendish, Chin-Hsu Lin, and Jian Tu. A framework for validation of computer models. Technometrics, 49(2):138–154, 2007.
[6] James O Berger, Victor De Oliveira, and Bruno Sansó. Objective Bayesian analysis of spatially correlated data. Journal of the American Statistical Association, 96(456):1361–1374, 2001.
[7] James O Berger, Brunero Liseo, and Robert L Wolpert. Integrated likelihood methods for eliminating nuisance parameters. Statistical science, 14(1):1–28, 1999.
[8] James O Berger and Luis R Pericchi. The intrinsic bayes factor for model selection and prediction. Journal of the American Statistical Association, 91(433):109–122, 1996.
[9] Iain D Couzin, Jens Krause, Nigel R Franks, and Simon A Levin. Effective leadership and decision-making in animal groups on the move. Nature, 433(7025):513–516, 2005.
[10] Noel Cressie and Gardar Johannesson. Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):209–226, 2008.
[11] Abhirup Datta, Sudipto Banerjee, Andrew O Finley, and Alan E Gelfand. Hierarchical nearest-neighbor gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111(514):800–812, 2016.
[12] Bruno De Finetti. La prévision: ses lois logiques, ses sources subjectives. In Annales de l’institut Henri Poincaré, volume 7, pages 1–68, 1937.
[13] Jinchao Feng, Yunxiang Ren, and Sui Tang. Data-driven discovery of interacting particle systems using gaussian processes. arXiv preprint arXiv:2106.02735, 2021.
[14] Alan E Gelfand, Sudipto Banerjee, and Dani Gamerman. Spatial process modelling for univariate and multivariate dynamic spatial data. Environmetrics: The official journal of the International Environmetrics Society, 16(5):465–479, 2005.
[15] Robert B Gramacy and Daniel W Apley. Local Gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24(2):561–578, 2015.
[16] Robert B Gramacy and Herbert KH Lee. Cases for the nugget in modeling computer experiments. Statistics and Computing, 22(3):713–722, 2012.
[17] Mengyang Gu and Hanmo Li. Gaussian Orthogonal Latent Factor Processes for Large Incomplete Matrices of Correlated Data. Bayesian Analysis, pages 1 – 26, 2022.
[18] Mengyang Gu, Jesus Palomo, and James O. Berger. RobustGaSP: Robust Gaussian Stochastic Process Emulation in R. The R Journal, 11(1):112–136, 2019.
[19] Mengyang Gu and Weining Shen. Generalized probabilistic principal component analysis of correlated data. Journal of Machine Learning Research, 21(13), 2020.
[20] Mengyang Gu, Xiaojing Wang, and James O Berger. Robust Gaussian stochastic process emulation. Annals of Statistics, 46(6A):3038–3066, 2018.
[21] Wolfgang Hackbusch. Iterative solution of large sparse systems of equations, volume 95. Springer, 1994.
[22] Jouni Hartikainen and Simo Sarkka. Kalman filtering and smoothing solutions to temporal gaussian process regression models. In Machine Learning for Signal Processing (MLSP), 2010 IEEE International Workshop on, pages 379–384. IEEE, 2010.
[23] Silke Henkes, Yaouen Fily, and M Cristina Marchetti. Active jamming: Self-propelled soft particles at high density. Physical Review E, 84(4):040301, 2011.
[24] Magnus R Hestenes and Eduard Stiefel. Methods of conjugate gradients for solving. Journal of research of the National Bureau of Standards, 49(6):409, 1952.
[25] Dave Higdon, James Gattiker, Brian Williams, and Maria Rightley. Computer model calibration using high-dimensional output. Journal of the American Statistical Association, 103(482):570–583, 2008.
[26] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960.
[27] Matthias Katzfuss and Joseph Guinness. A general framework for vecchia approximations of gaussian processes. Statistical Science, 36(1):124–141, 2021.
[28] Hannes Kazianka and Jürgen Pilz. Objective Bayesian analysis of spatial data with uncertain nugget and range parameters. Canadian Journal of Statistics, 40(2):304–327, 2012.
[29] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
[30] Clifford Lam and Qiwei Yao. Factor modeling for high-dimensional time series: inference for the number of factors. The Annals of Statistics, 40(2):694–726, 2012.
[31] Clifford Lam, Qiwei Yao, and Neil Bathia. Estimation of latent factors for high-dimensional time series. Biometrika, 98(4):901–918, 2011.
[32] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
[33] Finn Lindgren, Håvard Rue, and Johan Lindström. An explicit link between gaussian fields and gaussian markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4):423–498, 2011.
[34] Fei Lu, Ming Zhong, Sui Tang, and Mauro Maggioni. Nonparametric inference of interaction laws in systems of agents from trajectory data. Proc. Natl. Acad. Sci. U.S.A., 116(29):14424–14433, 2019.
[35] M Cristina Marchetti, Jean-François Joanny, Sriram Ramaswamy, Tanniemola B Liverpool, Jacques Prost, Madan Rao, and R Aditi Simha. Hydrodynamics of soft active matter. Reviews of modern physics, 85(3):1143, 2013.
[36] Sebastien Motsch and Eitan Tadmor. Heterophilious dynamics enhances consensus. SIAM review, 56(4):577–621, 2014.
[37] Joseph Muré. Propriety of the reference posterior distribution in gaussian process modeling. The Annals of Statistics, 49(4):2356–2377, 2021.
[38] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
[39] Rui Paulo. Default priors for Gaussian processes. Annals of statistics, 33(2):556–582, 2005.
[40] Rui Paulo, Gonzalo García-Donato, and Jesús Palomo. Calibration of computer models with multivariate output. Computational Statistics and Data Analysis, 56(12):3959–3974, 2012.
[41] Giovanni Petris, Sonia Petrone, and Patrizia Campagnoli. Dynamic linear models with R. Springer, 2009.
[42] Adrian E Raftery, David Madigan, and Jennifer A Hoeting. Bayesian model averaging for linear regression models. Journal of the American Statistical Association, 92(437):179–191, 1997.
[43] Dennis C Rapaport and Dennis C Rapaport Rapaport. The art of molecular dynamics simulation. Cambridge university press, 2004.
[44] Carl Edward Rasmussen. Gaussian processes for machine learning. MIT Press, 2006.
[45] Herbert E Rauch, F Tung, and Charlotte T Striebel. Maximum likelihood estimates of linear dynamic systems. AIAA journal, 3(8):1445–1450, 1965.
[46] Cuirong Ren, Dongchu Sun, and Chong He. Objective bayesian analysis for a spatial model with nugget effects. Journal of Statistical Planning and Inference, 142(7):1933–1946, 2012.
[47] Olivier Roustant, David Ginsbourger, and Yves Deville. Dicekriging, diceoptim: Two r packages for the analysis of computer experiments by kriging-based metamodeling and optimization. Journal of Statistical Software, 51(1):1–55, 2012.
[48] Håvard Rue, Sara Martino, and Nicolas Chopin. Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations. Journal of the royal statistical society: Series B (statistical methodology), 71(2):319–392, 2009.
[49] Yousef Saad. Iterative methods for sparse linear systems. SIAM, 2003.
[50] Jerome Sacks, William J Welch, Toby J Mitchell, and Henry P Wynn. Design and analysis of computer experiments. Statistical science, 4(4):409–423, 1989.
[51] Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, and Peter Battaglia. Learning to simulate complex physics with graph networks. In International Conference on Machine Learning, pages 8459–8468. PMLR, 2020.
[52] Thomas J Santner, Brian J Williams, and William I Notz. The design and analysis of computer experiments. Springer Science & Business Media, 2003.
[53] Matthias Seeger, Yee-Whye Teh, and Michael Jordan. Semiparametric latent factor models. Technical report, 2005.
[54] Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs. Advances in neural information processing systems, 18:1257, 2006.
[55] Jonathan R Stroud, Michael L Stein, and Shaun Lysen. Bayesian and maximum likelihood estimation for gaussian processes on an incomplete lattice. Journal of computational and Graphical Statistics, 26(1):108–120, 2017.
[56] S. Surjanovic and D. Bingham. Virtual library of simulation experiments: Test functions and datasets. http://www.sfu.ca/~ssurjano, 2017.
[57] Llewellyn Hilleth Thomas. Elliptic problems in linear difference equations over a network. Watson Sci. Comput. Lab. Rept., Columbia University, New York, 1:71, 1949.
[58] Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999.
[59] Aldo V Vecchia. Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society: Series B (Methodological), 50(2):297–312, 1988.
[60] Zaiwen Wen and Wotao Yin. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142(1-2):397–434, 2013.
[61] M. West and P. J. Harrison. Bayesian Forecasting & Dynamic Models. Springer Verlag, 2nd edition, 1997.
[62] Mike West and Jeff Harrison. Bayesian forecasting and dynamic models. Springer Science & Business Media, 2006.
[63] Peter Whittle. Stochastic process in several dimensions. Bulletin of the International Statistical Institute, 40(2):974–994, 1963.
[64] Andrew G Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33:4697–4708, 2020.

Scalable marginalization of correlated latent variables with applications to learning particle interaction kernels

Abstract

1 Introduction

2 Background: Gaussian process

3 Marginalization in Kalman filter

3.1 State space representation of GP with the Matérn kernel

3.2 Kalman filter as a scalable marginalization technique

Lemma 1 (Kalman Filter and RTS Smoother [26, 45]).

Lemma 2 (Likelihood and predictive distribution).

3.3 Marginalization of correlated matrix observations with multi-dimensional inputs

Assumption 1.

Assumption 2.

Lemma 3 (Posterior independence and orthogonal projection [17]).

Theorem 1 (Generalized probabilistic principal component analysis [19]).

4 Scalable marginalization for learning particle interaction kernels from trajectory data

4.1 Scalable computation based on sparse representation of inverse covariance

4.2 Numerical results

5 Concluding remarks

Acknowledgement

Appendix

5.1 Closed-form expressions of state space representation of GP having Matérn covariance with ν=5/2\nu=5/2

5.2 The sparse CG-GP algorithm for estimating interaction kernels

5.3 Interaction kernels in simulated studies

5.4 Further numerical results on estimating interaction kernels

References

5.1 Closed-form expressions of state space representation of GP having Matérn covariance with $\nu=5/2$