This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Scalable marginalization of correlated latent variables with applications to learning particle interaction kernels

Mengyang Gu111Correspondence should be addressed to Mengyang Gu (mengyang@pstat.ucsb.edu )  , Xubo Liu, Xinyi Fang, Sui Tang

Department of Statistics and Applied Probability, University of California, Santa Barbara, CA
Department of Mathematics, University of California, Santa Barbara, CA
Abstract

Marginalization of latent variables or nuisance parameters is a fundamental aspect of Bayesian inference and uncertainty quantification. In this work, we focus on scalable marginalization of latent variables in modeling correlated data, such as spatio-temporal or functional observations. We first introduce Gaussian processes (GPs) for modeling correlated data and highlight the computational challenge, where the computational complexity increases cubically fast along with the number of observations. We then review the connection between the state space model and GPs with Matérn covariance for temporal inputs. The Kalman filter and Rauch-Tung-Striebel smoother were introduced as a scalable marginalization technique for computing the likelihood and making predictions of GPs without approximation. We summarize recent efforts on extending the scalable marginalization idea to the linear model of coregionalization for multivariate correlated output and spatio-temporal observations. In the final part of this work, we introduce a novel marginalization technique to estimate interaction kernels and forecast particle trajectories. The computational progress lies in the sparse representation of the inverse covariance matrix of the latent variables, then applying conjugate gradient for improving predictive accuracy with large data sets. The computational advances achieved in this work outline a wide range of applications in molecular dynamic simulation, cellular migration, and agent-based models.

KEYWORDS: Marginalization, Bayesian inference, Scalable computation, Gaussian process, Kalman filter, Particle interaction

1 Introduction

Given a set of latent variables in a model, do we fit a model with a particular set of latent variables, or do we integrate out the latent variables when making predictions? Marginalization of latent variables is an iconic feature of the Bayesian analysis. The art of marginalization in statistics can at least be traced back to the De Finetti’s theorem [12], which states that an infinite sequence {Xi}i=1\{X_{i}\}^{\infty}_{i=1} is exchangeable, if and if only if there exists a random variable θΘ\theta\in\Theta with probability distribution π()\pi(\cdot), and a conditional distribution p(θ)p(\cdot\mid\theta), such that

p(x1,,xN)={i=1Np(xiθ)}π(θ)𝑑θ.p(x_{1},...,x_{N})=\int\left\{\prod^{N}_{i=1}p(x_{i}\mid\theta)\right\}\pi(\theta)d\theta. (1)

Marginalization of nuisance parameters for models with independent observations has been comprehensively reviewed in [7]. Bayesian model selection [8, 4] and Bayesian model averaging [42], as two other examples, both rely on the marginalization of parameters in each model.

For spatially correlated data, the Jefferys prior of the covariance parameters in a Gaussian process (GP), which is proportional to the squared root of the Fisher information matrix of the likelihood, often leads to improper posteriors [6]. The posterior of the covariance parameter becomes proper if the prior is derived based on the Fisher information matrix of the marginal likelihood, after marginalizing out the mean and variance parameters. The resulting prior, after marginalization, is a reference prior, which has been studied for modeling spatially correlated data and computer model emulation [39, 46, 28, 20, 37].

Marginalization of latent variables has lately been aware by the machine learning community as well, for purposes of uncertainty quantification and propagation. In [29], for instance, the deep ensembles of models with a scoring function were proposed to assess the uncertainty in deep neural networks, and it is closely related to Bayesian model averaging with a uniform prior on parameters. This approach was further studied in [64], where the importance of marginalization is highlighted. Neural networks with infinite depth were shown to be equivalent to a GP with a particular kernel function in [38], and it was lately shown in [32] that the results of deep neural networks can be reproduced by GPs, where the latent nodes are marginalized out.

In this work, we study the marginalization of latent variables for correlated data, particularly focusing on scalable computation. Gaussian processes have been ubiquitously used for modeling spatially correlated data [3] and emulating computer experiments [50]. Computing the likelihood in GPs and making predictions, however, cost 𝒪(N3)\mathcal{O}(N^{3}) operations, where NN is the number of observations, due to finding the inverse and determinant of the covariance matrix. To overcome the computational bottleneck, various approximation approaches, such as inducing point approaches [54], fixed rank approximation [10], integrated nested Laplace approximation [48], stochastic partial differential equation representation [33], local Gaussian process approximation [15], and hierarchical nearest-neighbor Gaussian process models [11], circulant embedding [55], many of which can be summarized into the framework of Vecchia approximation [59, 27]. Scalable computation of a GP model with a multi-dimensional input space and a smooth covariance function is of great interest in recent years.

The exact computation of GP models with smaller computational complexity was less studied in past. To fill this knowledge gap, we will first review the stochastic differential equation representation of a GP with the Matérn covariance and one-dimensional input variable [63, 22], where the solution can be written as a dynamic linear model [62]. Kalman filter and Rauch–Tung–Striebel smoother [26, 45] can be implemented for computing the likelihood function and predictive distribution exactly, reducing the computational complexity of GP using a Matérn kernel with a half-integer roughness parameter and 1D input from 𝒪(N3)\mathcal{O}(N^{3}) to 𝒪(N)\mathcal{O}(N) operations. Here, interestingly, the latent states of a GP model are marginalized out in Kalman Filter iteratively. Thus the Kalman filter can be considered as an example of marginalization of latent variables, which leads to efficient computation. Note that the Kalman filter is not directly applicable for GP with multivariate inputs, yet GPs with some of the widely used covariance structures, such as the product or separable kernel [5] and linear model of coregionalization [3], can be written as state space models on an augmented lattice [19, 17]. Based on this connection, we introduce a few extensions of scalable marginalization for modeling incomplete matrices of correlated data.

The contributions of this work are twofold. First, the computational scalability and efficiency of marginalizing latent variables for models of correlated data and functional data are less studied. Here we discuss the marginalization of latent states in the Kalman filter in computing the likelihood and making predictions, with only 𝒪(N)\mathcal{O}(N) computational operations. We discuss recent extensions on structured data with multi-dimensional input. Second, we develop new marginalization techniques to estimate interaction kernels of particles and to forecast trajectories of particles, which have wide applications in agent-based models [9], cellular migration [23], and molecular dynamic simulation [43]. The computational gain comes from the sparse representation of inverse covariance of interaction kernels, and the use of the conjugate gradient algorithm [24] for iterative computation. Specifically, we reduce the computational order from 𝒪((nMDL)3)+𝒪(n4L2M2D)\mathcal{O}((nMDL)^{3})+\mathcal{O}(n^{4}L^{2}M^{2}D) operations in recent studies [34, 13] to 𝒪(Tn2MDL)+𝒪(n2MDLlog(nMDL))\mathcal{O}(Tn^{2}MDL)+\mathcal{O}(n^{2}MDL\log(nMDL)) operations based on training data of MM simulation runs, each containing nn particles in a DD dimensional space at LL time points, with TT being the number of iterations in the sparse conjugate gradient algorithm. This allows us to estimate interaction kernels of dynamic systems with many more observations. Here the sparsity comes from the use of the Matérn kernel, which is distinct from any of the approximation methods based on sparse covariance structures.

The rest of the paper is organized below. We first introduce the GP as a surrogate model for approximating computationally expensive simulations in Section 2. The state space model representation of a GP with Matérn covariance and temporal input is introduced in Section 3.1. We then review the Kalman filter as a computationally scalable technique to marginalize out latent states for computing the likelihood of a GP model and making predictions in Section 3.2. In Section 3.3, we discuss the extension of latent state marginalization in linear models of coregionaliztion for multivariate functional data, spatial and spatio-temporal data on the incomplete lattice. The new computationally scalable algorithm for estimating interaction kernel and forecasting particle trajectories is introduced in Section 4. We conclude this study and discuss a few potential research directions in Section 5. The code and data used in this paper are publicly available: https://github.com/UncertaintyQuantification/scalable_marginalization.

2 Background: Gaussian process

We briefly introduce the GP model in this section. We focus on computer model emulation, where the GP emulator is often used as a surrogate model to approximate computer experiments [52]. Consider a real-valued unknown function z()z(\cdot), modeled by a Gaussian stochastic process (GaSP) or Gaussian process (GP), z()𝒢𝒫(μ(),σ2K(,))z(\cdot)\sim\mathcal{GP}(\mu(\cdot),\sigma^{2}K(\cdot,\cdot)), meaning that, for any inputs {𝐱1,,𝐱N}\{\mathbf{x}_{1},\ldots,\mathbf{x}_{N}\} (with 𝐱i\mathbf{x}_{i} being a p×1p\times 1 vector), the marginal distribution of 𝐳=(z(𝐱1),,z(𝐱N))T\mathbf{z}=(z(\mathbf{x}_{1}),...,z(\mathbf{x}_{N}))^{T} follows a multivariate normal distribution,

𝐳𝜷,σ2,𝜸𝒩(𝝁,σ2𝐑),\mathbf{z}\mid\bm{\beta},\,\sigma^{2},\,{\bm{\gamma}}\sim\mathcal{MN}(\bm{\mu},\sigma^{2}{\mathbf{R}})\,, (2)

where 𝝁=(μ(𝐱1),,μ(𝐱N))T\bm{\mu}=(\mu(\mathbf{x}_{1}),...,\mu(\mathbf{x}_{N}))^{T} is a vector of mean or trend parameters, σ2\sigma^{2} is the unknown variance and 𝐑{\mathbf{R}} is the correlation matrix with the (i,j)(i,j) element modeled by a kernel K(𝐱i,𝐱j)K\,(\mathbf{x}_{i},\mathbf{x}_{j}) with parameters 𝜸\bm{\gamma}. It is common to model the mean by μ(𝐱)=𝐡(𝐱)𝜷\mu(\mathbf{x})=\mathbf{h}(\mathbf{x}){\bm{\beta}}\,, where 𝐡(𝐱)\mathbf{h}(\mathbf{x}) is a 1×q1\times q row vector of basis function, and 𝜷\bm{\beta} is a q×1q\times 1 vector of mean parameters.

When modeling spatially correlated data, the isotropic kernel is often used, where the input of the kernel only depends on the Euclidean distance K(𝐱a,𝐱b)=K(𝐱a𝐱b)K(\mathbf{x}_{a},\mathbf{x}_{b})=K(||\mathbf{x}_{a}-\mathbf{x}_{b}||). In comparison, each coordinate of the latent function in computer experiments could have different physical meanings and units. Thus a product kernel is often used in constructing a GP emulator, such that correlation lengths can be different at each coordinate. For any 𝐱a=(xa1,,xap)\mathbf{x}_{a}=(x_{a1},\ldots,x_{ap}) and 𝐱b=(xb1,,xbp)\mathbf{x}_{b}=(x_{b1},\ldots,x_{bp}), the kernel function can be written as K(𝐱a,𝐱b)=K1(xa1,xb1)××Kp(xap,xbp)K(\mathbf{x}_{a},\mathbf{x}_{b})=K_{1}(x_{a1},x_{b1})\times...\times K_{p}(x_{ap},x_{bp}), where KlK_{l} is a kernel for the llth coordinate with a distinct range parameter γl\gamma_{l}, for l=1,,pl=1,...,p. Some frequently used kernels KlK_{l} include power exponential and Matérn kernel functions [44]. The Matérn kernel, for instance, follows

Kl(dl)=12νl1Γ(νl)(2νldlγl)νl𝒦νl(2νldlγl),K_{l}(d_{l})=\frac{1}{2^{\nu_{l}-1}\Gamma(\nu_{l})}\left(\frac{\sqrt{2\nu_{l}}d_{l}}{\gamma_{l}}\right)^{\nu_{l}}\mathcal{K}_{\nu_{l}}\left(\frac{\sqrt{2\nu_{l}}d_{l}}{\gamma_{l}}\right), (3)

where dl=|xalxbl|d_{l}=|x_{al}-x_{bl}|, Γ()\Gamma(\cdot) is the gamma function, 𝒦νl(/γl)\mathcal{K}_{\nu_{l}}(\cdot/\gamma_{l}) is the modified Bessel function of the second kind with the range parameter and roughness parameter being γl\gamma_{l} and νl\nu_{l}, respectively. The Matérn correlation has a closed-form expression when the roughness parameter is a half-integer, i.e. νl=2kl+1/2\nu_{l}={2k_{l}+1}/{2} with klk_{l}\in\mathbb{N}. It becomes the exponential correlation and Gaussian correlation, when kl=0k_{l}=0 and klk_{l}\to\infty, respectively. The GP with Matérn kernel is νl1\lfloor\nu_{l}-1\rfloor mean square differentiable at coordinate ll. This is a good property, as the differentiability of the process is directly controlled by the roughness parameter.

Denote mean basis of observations 𝐇=(𝐡T(𝐱1),,𝐡T(𝐱N))T\mathbf{H}=(\mathbf{h}^{T}(\mathbf{x}_{1}),...,\mathbf{h}^{T}(\mathbf{x}_{N}))^{T}. The parameters in GP contain mean parameters 𝜷\bm{\beta}, variance parameter σ2\sigma^{2}, and range parameters 𝜸=(γ1,,γp)\bm{\gamma}=(\gamma_{1},...,\gamma_{p}). Integrating out the mean and variance parameters with respect to reference prior π(𝜷,σ)1/σ2\pi(\bm{\beta},\sigma)\propto 1/\sigma^{2}, the predictive distribution of any input 𝐱\mathbf{x}^{*} follows a student t distribution [20]:

z(𝐱)𝐳,𝜸𝒯(z^(𝐱),σ^2K,Nq),z({\mathbf{x}^{*}})\mid\mathbf{z},\,\bm{\gamma}\sim\mathcal{T}(\hat{z}({\mathbf{x}}^{*}),\hat{\sigma}^{2}K^{**},N-q)\,, (4)

with NqN-q degrees of freedom, where

z^(𝐱)=\displaystyle\hat{z}({\mathbf{x}}^{*})= 𝐡(𝐱)𝜷^+𝐫T(𝐱)𝐑1(𝐳𝐇𝜷^),\displaystyle{\mathbf{h}({\mathbf{x}}^{*})}\hat{\bm{\beta}}+\mathbf{r}^{T}(\mathbf{x}^{*}){{\mathbf{R}}}^{-1}\left(\mathbf{z}-\mathbf{H}\hat{\bm{\beta}}\right), (5)
σ^2=\displaystyle\hat{\sigma}^{2}= (Nq)1(𝐳𝐇𝜷^)T𝐑1(𝐳𝐇𝜷^),\displaystyle(N-q)^{-1}{\left(\mathbf{z}-\mathbf{H}\hat{\bm{\beta}}\right)}^{T}{{\mathbf{R}}}^{-1}\left({\mathbf{z}}-\mathbf{H}\hat{\bm{\beta}}\right), (6)
K=\displaystyle K^{**}= K(𝐱,𝐱)𝐫T(𝐱)𝐑1𝐫(𝐱)+𝐡(𝐱)T\displaystyle K({\mathbf{x}^{*}},{\mathbf{x}^{*}})-{\mathbf{r}^{T}(\mathbf{x}^{*}){{\mathbf{R}}}^{-1}\mathbf{r}(\mathbf{x}^{*})}+{\mathbf{h}}^{*}(\mathbf{x}^{*})^{T}
×(𝐇T𝐑1𝐇)1𝐡(𝐱),\displaystyle\times\left(\mathbf{H}^{T}{{\mathbf{R}}}^{-1}\mathbf{H}\right)^{-1}{\mathbf{h}}^{*}(\mathbf{x}^{*}), (7)

with 𝐡(𝐱)=(𝐡(𝐱)𝐇T𝐑1𝐫(𝐱)){\mathbf{h}}^{*}(\mathbf{x}^{*})=\left({{\mathbf{h}(\mathbf{x}^{*})}}-\mathbf{H}^{T}{{\mathbf{R}}}^{-1}\mathbf{r}(\mathbf{x}^{*})\right), 𝜷^=(𝐇T𝐑1𝐇)1𝐇T𝐑1𝐳\hat{\bm{\beta}}=\left(\mathbf{H}^{T}{{\mathbf{R}}}^{-1}\ \mathbf{H}\right)^{-1}\mathbf{H}^{T}{{\mathbf{R}}}^{-1}\mathbf{z} being the generalized least squares estimator of 𝜷\bm{\beta}, and 𝐫(𝐱)=(K(𝐱,𝐱1),,K(𝐱,𝐱N))T\mathbf{r}(\mathbf{x}^{*})=(K(\mathbf{x}^{*},{\mathbf{x}}_{1}),\ldots,K(\mathbf{x}^{*},{\mathbf{x}}_{N}))^{T}.

Direct computation of the likelihood requires 𝒪(N3)\mathcal{O}(N^{3}) operations due to computing the Cholesky decomposition of the covariane matrix for matrix inversion, and the determinant of the covariance matrix. Thus a posterior sampling algorithm, such as the Markov chain Monte Carlo (MCMC) algorithm can be slow, as it requires a large number of posterior samples. Plug-in estimators, such as the maximum likelihood estimator (MLE) were often used to estimate the range parameters 𝜸\bm{\gamma} in covariance. In [20], the maximum marginal posterior estimator (MMPE) with robust parameterizations was discussed to overcome the instability of the MLE. The MLE and MMPE of the parameters in a GP emulator with both the product kernel and the isotropic kernel are all implemented in the RobustGaSP package [18].

In some applications, we may not directly observe the latent function but a noisy realization:

y(𝐱)=z(𝐱)+ϵ(𝐱),y(\mathbf{x})=z(\mathbf{x})+\epsilon(\mathbf{x}), (8)

where z(.)z(.) is modeled as a zero-mean GP with covariance σ2K(.,.)\sigma^{2}K(.,.), and ϵ(𝐱)𝒩(0,σ02)\epsilon(\mathbf{x})\sim\mathcal{N}(0,\sigma^{2}_{0}) follows an independent Gaussian noise. This model is typically referred to as the Gaussian process regression [44], which is suitable for scenarios containing noisy observations, such as experimental or field observations, numerical solutions of differential equations with non-negligible error, and stochastic simulations. Denote the noisy observations 𝐲=(y(𝐱1),y(𝐱2),,y(𝐱N))T\mathbf{y}=(y(\mathbf{x}_{1}),y(\mathbf{x}_{2}),...,y(\mathbf{x}_{N}))^{T} at the design input set {𝐱1,𝐱2,,𝐱N}\{\mathbf{x}_{1},\mathbf{x}_{2},...,\mathbf{x}_{N}\} and the nugget parameter η=σ02/σ2\eta=\sigma^{2}_{0}/\sigma^{2}. Both range and nugget parameters in GPR can be estimated by the plug-in estimators [18]. The predictive distribution of f(𝐱)f(\mathbf{x}^{*}) at any input 𝐱\mathbf{x}^{*} can be obtained by replacing 𝐑\mathbf{R} with 𝐑~=𝐑+η𝐈n\mathbf{\tilde{R}}=\mathbf{R}+\eta\mathbf{I}_{n} in Equation (4).

Constructing a GP emulator to approximate computer simulation typically starts with a “space-filling” design, such as the Latin hypercube sampling (LHS), to fill the input space. Numerical solutions of computer models were then obtained at these design points, and the set {(𝐱i,yi)}i=1N\{(\mathbf{x}_{i},y_{i})\}^{N}_{i=1} is used for training a GP emulator. For any observed input 𝐱\mathbf{x}^{*}, the predictive mean in (4) is often used for predictions, and the uncertainty of observations can be obtained through the predictive distribution. in Figure 1, we plot the predictive mean of a GP emulator to approximate the Branin function [56] with N training inputs sampled from LHS, using the default setting of the RobustGaSP package [18]. When the number of observations increases from N=12N=12 (middle panel) to N=24N=24 (right panel), the predictive error becomes smaller.

Refer to caption Refer to caption Refer to caption
Figure 1: Predictions by the GP emulator of a function on 2D inputs with N=12N=12 and N=24N=24 observations (black circles) are shown in the left and right panels, respectively.

The computational complexity of GP models increases at the order of 𝒪(N3)\mathcal{O}(N^{3}), which prohibits applications on emulating complex computer simulations, when a relatively large number of simulation runs are required. In the next section, we will introduce the state space representation of GP with Matérn covariance and one-dimensional input, where the computational order scales as 𝒪(N)\mathcal{O}(N) without approximation. This method can be applied to problems with high dimensional input space discussed in Section 4.

3 Marginalization in Kalman filter

3.1 State space representation of GP with the Matérn kernel

Suppose we model the observations by Equation (8) where the latent process z(.)z(.) is assumed to follow a GP on 1D input. For simplicity, here we assume a zero mean parameter (μ=0\mu=0), and a mean function can be easily included in the analysis. It has been realized that a GP defined in 1D input space using a Matérn covariance with a half-integer roughness parameter input can be written as stochastic differential equations (SDEs) [63, 22], which can reduce the operations of computing the likelihood and making predictions from 𝒪(N3)\mathcal{O}(N^{3}) to 𝒪(N)\mathcal{O}(N) operations, with the use of Kalman filter. Here we first review SDE representation and then we discuss marginalization of latent variables in the Kalman filter algorithm for scalable computation.

When the roughness parameter is ν=5/2\nu=5/2, for instance, the Matérn kernel has the expression below

K(d)=(1+5dγ+5d23γ2)exp(5dγ),K(d)=\left(1+\frac{\sqrt{5}d}{\gamma}+\frac{5d^{2}}{3\gamma^{2}}\right)\exp\left(-\frac{\sqrt{5}d}{\gamma}\right),\, (9)

where d=|xaxb|d=|x_{a}-x_{b}| is the distance between any xa,xbx_{a},x_{b}\in\mathbb{R} and γ\gamma is a range parameter typically estimated by data. The output and two derivatives of the GP with the Matérn kernel in (9) can be written as the SDE below,

d𝜽(x)dx=𝐉𝜽(x)+𝐋ε(x),\displaystyle\frac{d\bm{\theta}(x)}{dx}=\mathbf{J}\bm{\theta}(x)+\mathbf{L}\varepsilon(x), (10)

or in the matrix form,

ddx(z(x)z(1)(x)z(2)(x))=(010001λ33λ23λ)(z(x)z(1)(x)z(2)(x))+(001)ε(x),\frac{d}{dx}\begin{pmatrix}z(x)\\ z^{(1)}(x)\\ z^{(2)}(x)\end{pmatrix}=\begin{pmatrix}0&1&0\\ 0&0&1\\ -\lambda^{3}&-3\lambda^{2}&-3\lambda\end{pmatrix}\begin{pmatrix}z(x)\\ z^{(1)}(x)\\ z^{(2)}(x)\end{pmatrix}+\begin{pmatrix}0\\ 0\\ 1\end{pmatrix}\varepsilon(x),

where ε(x)N(0,σ2)\varepsilon(x)\sim N(0,\sigma^{2}), with λ=2νγ=5γ\lambda=\frac{\sqrt{2\nu}}{\gamma}=\frac{\sqrt{5}}{\gamma}, and z(l)()z^{(l)}(\cdot) is the llth derivative of the process z()z(\cdot). Denote c=163σ2λ5c=\frac{16}{3}\sigma^{2}\lambda^{5} and 𝐅=(1,0,0)\mathbf{F}=(1,0,0). Assume the 1D input is ordered, i.e. x1x2xNx_{1}\leq x_{2}\leq...\leq x_{N}. The solution of SDE in (10) can be expressed as a continuous-time dynamic linear model [61],

y(xi)=𝐅𝜽(xi)+ϵ(xi),𝜽(xi)=𝐆(xi)𝜽(xi1)+𝐰(xi),\displaystyle\begin{split}y(x_{i})&=\mathbf{F}\bm{\theta}(x_{i})+\epsilon(x_{i}),\\ \bm{\theta}(x_{i})&=\mathbf{G}(x_{i})\bm{\theta}(x_{i-1})+\mathbf{w}(x_{i}),\,\end{split} (11)

where 𝐰(xi)𝒩(0,𝐖(xi))\mathbf{w}(x_{i})\sim\mathcal{MN}(0,\mathbf{W}(x_{i})) for i=2,,Ni=2,...,N, 𝜽(x1)𝒩(𝟎,𝐖(x1))\bm{\theta}(x_{1})\sim\mathcal{MN}(\mathbf{0},\mathbf{W}(x_{1})) and Gaussian noise follows ϵ(xi)𝒩(0,σ02)\epsilon(x_{i})\sim\mathcal{N}(0,\sigma^{2}_{0}). Here 𝐆(xi)=e𝐉(xixi1)\mathbf{G}(x_{i})=e^{\mathbf{J}(x_{i}-x_{i-1})} and 𝐖(xi)=0xixi1e𝐉t𝐋c𝐋Te𝐉Tt𝑑t\mathbf{W}(x_{i})=\int^{x_{i}-x_{i-1}}_{0}e^{\mathbf{J}t}\mathbf{L}c\mathbf{L}^{T}e^{\mathbf{J}^{T}t}dt from i=2,,Ni=2,...,N, and stationary distribution 𝜽(xi)𝒩(0,𝐖(x1))\bm{\theta}(x_{i})\sim\mathcal{MN}(0,\mathbf{W}(x_{1})), with 𝐖(x1)=0e𝐉t𝐋c𝐋Te𝐉Tt𝑑t.\mathbf{W}(x_{1})=\int^{\infty}_{0}e^{\mathbf{J}t}\mathbf{L}c\mathbf{L}^{T}e^{\mathbf{J}^{T}t}dt. Both 𝐆(xi)\mathbf{G}(x_{i}) and 𝐖(xi)\mathbf{W}(x_{i}) have closed-form expressions given in the Appendix 5.1. The joint distribution of the states follows (𝜽T(x1),,𝜽T(xN))T𝒩(𝟎,𝚲1)\left(\bm{\theta}^{T}(x_{1}),...,\bm{\theta}^{T}(x_{N})\right)^{T}\sim\mathcal{MN}(\mathbf{0},\bm{\Lambda}^{-1}), where the inverse covariance 𝚲\bm{\Lambda} is a block tri-diagonal matrix discussed in Appendix 5.1.

3.2 Kalman filter as a scalable marginalization technique

For dynamic linear models in (11), Kalman filter and Rauch–Tung–Striebel (RTS) smoother can be used as an exact and scalable approach to compute the likelihood, and predictive distributions. The Kalman filter and RTS smoother are sometimes called the forward filtering and backward smoothing/sampling algorithm, widely used in dynamic linear models of time series. We refer the readers to [61, 41] for discussion of dynamic linear models.

Write 𝐆(xi)=𝐆i\mathbf{G}(x_{i})=\mathbf{G}_{i}, 𝐖(xi)=𝐖i\mathbf{W}(x_{i})=\mathbf{W}_{i}, 𝜽(xi)=𝜽i\bm{\theta}(x_{i})=\bm{\theta}_{i} and y(xi)=yiy(x_{i})=y_{i} for i=1,,Ni=1,...,N. In Lemma 1, we summarize Kalman filter and RTS smoother for the dynamic linear model in (11). Compared with 𝒪(N3)\mathcal{O}(N^{3}) computational operations and 𝒪(N2)\mathcal{O}(N^{2}) storage cost from GPs, the outcomes of Kalman filter and RTS smoother can be used for computing the likelihood and predictive distribution with 𝒪(N)\mathcal{O}(N) operations and 𝒪(N)\mathcal{O}(N) storage cost, summarized in Lemma 1. All the distributions in Lemma 1 and Lemma 2 are conditional distributions given parameters (γ,σ2,σ02)(\gamma,\sigma^{2},\sigma^{2}_{0}), which are omitted for simplicity.

Lemma 1 (Kalman Filter and RTS Smoother [26, 45]).

1. (Kalman Filter.) Let 𝜽i1|𝐲1:i1𝒩(𝐦i1,𝐂i1)\bm{\theta}_{i-1}|\mathbf{y}_{1:i-1}\sim\mathcal{MN}(\mathbf{m}_{i-1},\mathbf{C}_{i-1}). For i=2,,Ni=2,...,N, iteratively we have,

  • (i)

    the one-step-ahead predictive distribution of 𝜽i\bm{\theta}_{i} given 𝐲1:i1\mathbf{y}_{1:i-1},

    𝜽i|𝐲1:i1𝒩(𝐛i,𝐁i),\bm{\theta}_{i}|\mathbf{y}_{1:i-1}\sim\mathcal{MN}(\mathbf{b}_{i},\mathbf{B}_{i}),\vspace{-.05in} (12)

    with 𝐛i=𝐆i𝐦i1\mathbf{b}_{i}=\mathbf{G}_{i}\mathbf{m}_{i-1} and 𝐁i=𝐆i𝐂i1𝐆iT+𝐖i\mathbf{B}_{i}=\mathbf{G}_{i}\mathbf{C}_{i-1}\mathbf{G}^{T}_{i}+\mathbf{W}_{i},

  • (ii)

    the one-step-ahead predictive distribution of YiY_{i} given 𝐲1:i1\mathbf{y}_{1:i-1},

    Yi|𝐲1:i1𝒩(fi,Qi),Y_{i}|\mathbf{y}_{1:i-1}\sim\mathcal{N}(f_{i},Q_{i}),\vspace{-.05in} (13)

    with fi=𝐅𝐛i,f_{i}=\mathbf{F}\mathbf{b}_{i}, and Qi=𝐅𝐁i𝐅T+σ02Q_{i}=\mathbf{F}\mathbf{B}_{i}\mathbf{F}^{T}+\sigma^{2}_{0},

  • (iii)

    the filtering distribution of 𝜽i\bm{\theta}_{i} given 𝐲1:i\mathbf{y}_{1:i},

    𝜽i|𝐲1:i𝒩(𝐦i,𝐂i),\bm{\theta}_{i}|\mathbf{y}_{1:i}\sim\mathcal{MN}(\mathbf{m}_{i},\mathbf{C}_{i}),\vspace{-.05in} (14)

    with 𝐦i=𝐛i+𝐁i𝐅TQi1(yifi)\mathbf{m}_{i}=\mathbf{b}_{i}+\mathbf{B}_{i}\mathbf{F}^{T}Q^{-1}_{i}(y_{i}-f_{i}) and 𝐂i=𝐁i𝐁i𝐅TQi1𝐅𝐁i\mathbf{C}_{i}=\mathbf{B}_{i}-\mathbf{B}_{i}\mathbf{F}^{T}Q^{-1}_{i}\mathbf{F}\mathbf{B}_{i}.

2. (RTS Smoother.) Denote 𝜽i+1|𝐲1:n𝒩(si+1,Si+1)\bm{\theta}_{i+1}|\mathbf{y}_{1:n}\sim\mathcal{N}(s_{i+1},S_{i+1}), then recursively for i=N1,,1i=N-1,...,1,

𝜽i|𝐲1:N𝒩(𝐬i,𝐒i),\bm{\theta}_{i}|\mathbf{y}_{1:N}\sim\mathcal{MN}(\mathbf{s}_{i},\mathbf{S}_{i}),\vspace{-.05in} (15)

where 𝐬i=𝐦i+𝐂i𝐆i+1T𝐁i+11(𝐬i+1𝐛i+1)\mathbf{s}_{i}=\mathbf{m}_{i}+\mathbf{C}_{i}\mathbf{G}^{T}_{i+1}\mathbf{B}^{-1}_{i+1}(\mathbf{s}_{i+1}-\mathbf{b}_{i+1}) and 𝐒i=𝐂i𝐂i𝐆i+1T𝐁i+11(𝐁i+1𝐒i+1)𝐁i+11𝐆i+1𝐂i\mathbf{S}_{i}=\mathbf{C}_{i}-\mathbf{C}_{i}\mathbf{G}^{T}_{i+1}\mathbf{B}^{-1}_{i+1}(\mathbf{B}_{i+1}-\mathbf{S}_{i+1})\mathbf{B}^{-1}_{i+1}\mathbf{G}_{i+1}\mathbf{C}_{i}.

Lemma 2 (Likelihood and predictive distribution).

1. (Likelihood.) The likelihood follows

p(𝐲1:Nσ2,σ02,γ)={i=1N(2πQi)12}exp{i=1N(yifi)22Qi},p(\mathbf{y}_{1:N}\mid\sigma^{2},\sigma_{0}^{2},\gamma)=\left\{\prod^{N}_{i=1}(2\pi Q_{i})^{-\frac{1}{2}}\right\}\exp\left\{-\sum^{N}_{i=1}\frac{(y_{i}-f_{i})^{2}}{2Q_{i}}\right\},

where fif_{i} and QiQ_{i} are given in Kalman filter. The likelihood can be used to obtain the MLE of the parameters (σ2,σ02,γ)(\sigma^{2},\sigma_{0}^{2},\gamma).

2. (Predictive distribution.)

  • (i)

    By the last step of Kalman filter, one has 𝜽N|𝐲1:N\bm{\theta}_{N}|\mathbf{y}_{1:N} and recursively by the RTS smoother, the predictive distribution of 𝜽i\bm{\theta}_{i} for i=N1,,1i=N-1,...,1 follows

    𝜽i|𝐲1:N𝒩(𝐬i,𝐒i).\bm{\theta}_{i}|\mathbf{y}_{1:N}\sim\mathcal{MN}(\mathbf{s}_{i},\mathbf{S}_{i}).\vspace{-.05in} (16)
  • (ii)

    For any xx^{*} (W.l.o.g. let xi<x<xi+1x_{i}<x^{*}<x_{i+1})

    𝜽(x)𝐲1:N𝒩(𝜽^(x),𝚺^(x))\bm{\theta}(x^{*})\mid\mathbf{y}_{1:N}\sim\mathcal{MN}\left(\hat{\bm{\theta}}(x^{*}),\hat{\bm{\Sigma}}(x^{*})\right)

    where

    𝜽^(x)\displaystyle\hat{\bm{\theta}}(x^{*}) =𝐆i𝐬i+𝐖i(𝐆i+1)T(𝐖~i+1)1(𝐬i+1𝐆i+1𝐆i𝐬i)\displaystyle=\mathbf{G}^{*}_{i}\mathbf{s}_{i}+\mathbf{W}^{*}_{i}(\mathbf{G}^{*}_{i+1})^{T}(\tilde{\mathbf{W}}^{*}_{i+1})^{-1}(\mathbf{s}_{i+1}-\mathbf{G}^{*}_{i+1}\mathbf{G}^{*}_{i}\mathbf{s}_{i})
    𝚺^(x)\displaystyle\hat{\bm{\Sigma}}(x^{*}) =((𝐖i)1+(𝐆i+1)T(𝐖i+1)1𝐆i+1)1\displaystyle=((\mathbf{W}^{*}_{i})^{-1}+(\mathbf{G}^{*}_{i+1})^{T}(\mathbf{W}^{*}_{i+1})^{-1}\mathbf{G}^{*}_{i+1})^{-1}

    with terms denoted with ‘*’ given in the Appendix 5.1.

Although we introduce the Matérn kernel with ν=5/2\nu=5/2 as an example, the likelihood and predictive distribution of GPs with the Matérn kernel of a small half-integer roughness parameter can be computed efficiently, for both equally spaced and not equally spaced 1D inputs. For the Matérn kernel with a very large roughness parameter, the dimension of the latent states becomes large, which makes efficient computation prohibitive. In practice, the Matérn kernel with a relatively large roughness parameter (e.g. with ν=5/2\nu=5/2) is found to be accurate for estimating a smooth latent function in computer experiments [20, 2]. Because of this reason, the Matérn kernel with ν=5/2\nu=5/2 is the default choice of the kernel function in some packages of GP emulators [47, 18].

For a model containing latent variables, one may proceed with two usual approaches:

  • (i)

    sampling the latent variables 𝜽(xi)\bm{\theta}({x_{i}}) from the posterior distribution by the MCMC algorithm,

  • (ii)

    optimizing the latent variables 𝜽(xi)\bm{\theta}({x_{i}}) to minimize a loss function.

For approach (i), the MCMC algorithm is usually much slower than the Kalman filter, as the number of the latent states is high, requiring a large number of posterior samples [17]. On the other hand, the prior correlation between states may not be taken into account directly in approach (ii), making the estimation less efficient than the Kalman filter, if data contain correlation across latent states. In comparison, the latent states in the dynamic linear model in (11) are iteratively marginalized out in Kalman filter, and the closed-form expression is derived in each step, which only takes 𝒪(N)\mathcal{O}(N) operations and storage cost, with NN being the number of observations.

In practice, when a sensible probability model or a prior of latent variables is considered, the principle is to integrate out the latent variables when making predictions. Posterior samples and optimization algorithms, on the other hand, can be very useful for approximating the marginal likelihood when closed-form expressions are not available. As an example, we will introduce applications that integrate the sparse covariance structure along with conjugate gradient optimization into estimating particle interaction kernels, and forecasting particle trajectories in Section 4, which integrates both marginalization and optimization to tackle a computationally challenging scenario.

Refer to caption Refer to caption
Figure 2: Comparison between the direct computation of GP, and that by the Kalman filter (KF) and RTS smoother, both having the Matérn kernel in (9). The left panel shows the computational cost of computing the predictive mean over different number of observations. When N=5000N=5000, direct computation takes around 20 seconds, whereas the computation by KF and RTS smoother takes 0.029 seconds. The predictive mean computed in both ways is plotted in the right panel when N=1,000N=1,000, where the root of mean squared difference between two approaches is 5.98×10125.98\times 10^{-12}.

In Figure 2, we compare the cost for computing the predictive mean for a nonlinear function with 1D inputs [16]. The input is uniformly sampled from [0.5,0.25][0.5,0.25], and an independent Gaussian white noise with a standard deviation of 0.10.1 is added in simulating the observations. We compare two ways of computing the predictive mean. The first approach implements direct computation of the predictive mean by Equation (5). The second approach is computed by the likelihood function and predictive distribution from Lemma 2 based on the Kalman filter and RTS smoother. The range and nugget parameters are fixed to be 0.50.5 and 10410^{-4} for demonstration purposes, respectively. The computational time of this simulated experiment is shown in the left panel in Figure 2. The approach based on Kalman filter and RTS smoother is much faster, as computing the likelihood and making predictions by Kalman filter and RTS smoother only require 𝒪(N)\mathcal{O}(N) operations, whereas the direct computation cost 𝒪(N3)\mathcal{O}(N^{3}) operations. The right panel gives the predictive mean, latent truth, and observations, when N=1000N=1000. The difference between the two approaches is very small, as both methods are exact.

3.3 Marginalization of correlated matrix observations with multi-dimensional inputs

The Kalman filter is widely applied in signal processing, system control, and modeling time series. Here we introduce a few recent studies that apply Kalman filter to GP models with Matérn covariance to model spatial, spatio-temporal, and functional observations.

Let 𝐲(𝐱)=(y1(𝐱),,yn1(𝐱))T\mathbf{y}(\mathbf{x})=(y_{1}(\mathbf{x}),...,y_{n_{1}}(\mathbf{x}))^{T} be an n1n_{1}-dimensional real-valued output vector at a pp-dimensional input vector 𝐱\mathbf{x}. For simplicity, assume the mean is zero. Consider the latent factor model:

𝐲(𝐱)=𝐀𝐳(𝐱)+ϵ(𝐱),\mathbf{y}(\mathbf{x})=\mathbf{A}\mathbf{z}(\mathbf{x})+\bm{\epsilon}(\mathbf{x}), (17)

where 𝐀=[𝐚1,,𝐚d]\mathbf{A}=[\mathbf{a}_{1},...,\mathbf{a}_{d}] is an n1×dn_{1}\times d factor loading matrix and 𝐳(𝐱)=(z1(𝐱),,zd(𝐱))T\mathbf{z}(\mathbf{x})=(z_{1}(\mathbf{x}),...,z_{d}(\mathbf{x}))^{T} is a dd-dimensional factor processes, with dn1d\leq n_{1}. The noise process follows ϵ(𝐱)𝒩(𝟎,σ02𝐈n1)\bm{\epsilon}(\mathbf{x})\sim\mathcal{N}(\mathbf{0},\,\sigma^{2}_{0}\mathbf{I}_{n_{1}}). Each factor is modeled by a zero-mean Gaussian process (GP), meaning that 𝐙l=(zl(𝐱1),,zl(𝐱n2))\mathbf{Z}_{l}=(z_{l}(\mathbf{x}_{1}),...,z_{l}(\mathbf{x}_{n_{2}})) follows a multivariate normal distribution 𝐙lT𝒩(𝟎,𝚺l)\mathbf{Z}^{T}_{l}\sim\mathcal{MN}(\mathbf{0},\bm{\Sigma}_{l}), where the (i,j)(i,\,j) entry of 𝚺l\bm{\Sigma}_{l} is parameterized by a covariance function σl2Kl(𝐱i,𝐱j)\sigma^{2}_{l}K_{l}(\mathbf{x}_{i},\mathbf{x}_{j}) for l=1,,dl=1,...,d. The model (17) is often known as the semiparametric latent factor model in the machine learning community [53], and it belongs to a class of linear models of coregionalization [3]. It has a wide range of applications in modeling multivariate spatially correlated data and functional observations from computer experiments [14, 25, 40].

We have the following two assumptions for model (17).

Assumption 1.

The prior of latent processes 𝐳i(.)\mathbf{z}_{i}(.) and 𝐳j(.)\mathbf{z}_{j}(.) are independent, for any iji\neq j.

Assumption 2.

The factor loadings are orthogonal, i.e. 𝐀T𝐀=𝐈d\mathbf{A}^{T}\mathbf{A}=\mathbf{I}_{d}.

The first assumption is typically assumed for modeling multivariate spatially correlated data or computer experiments [3, 25]. Secondly, note that the model in (17) is unchanged if we replace (𝐀,𝐳(𝐱))(\mathbf{A},\mathbf{z}(\mathbf{x})) by (𝐀𝐄,𝐄1𝐳(𝐱))(\mathbf{A}\mathbf{E},\mathbf{E}^{-1}\mathbf{z}(\mathbf{x})) for an invertible matrix 𝐄\mathbf{E}, meaning that the linear subspace of 𝐀\mathbf{A} can be identified if no further constraint is imposed. Furthermore, as the variance of each latent process σi2\sigma^{2}_{i} is estimated by the data, imposing the unity constraint on each column of 𝐀\mathbf{A} can reduce identifiability issues. The second assumption was also assumed in other recent studies [31, 30].

Given Assumption 1 and Assumption 2, we review recent results that alleviates the computational cost. Let us first assume the observations are an N=n1×n2N=n_{1}\times n_{2} matrix 𝐘=[𝐲(𝐱1),,𝐲(𝐱n2)]\mathbf{Y}=[\mathbf{y}(\mathbf{x}_{1}),...,\mathbf{y}(\mathbf{x}_{n_{2}})].

Lemma 3 (Posterior independence and orthogonal projection [17]).

For model (17) with Assumption 1 and Assumption 2, we have two properties below.

1. (Posterior Independence.) For any lml\neq m

Cov[𝐙lT,𝐙mT|𝐘]=𝟎,\mbox{Cov}[\mathbf{Z}_{l}^{T},\mathbf{Z}_{m}^{T}|\mathbf{Y}]=\mathbf{0},

and for each l=1,,dl=1,...,d,

𝐙lT|𝐘𝒩(𝝁Zl,𝚺zl),\mathbf{Z}_{l}^{T}|\mathbf{Y}\sim\mathcal{MN}(\bm{\mu}_{Z_{l}},\bm{\Sigma}_{z_{l}}),

where 𝝁zl=𝚺l𝚺~l1𝐲~l\bm{\mu}_{z_{l}}=\bm{\Sigma}_{l}\bm{\tilde{\Sigma}}^{-1}_{l}\mathbf{\tilde{y}}_{l}, 𝐲~l=𝐘T𝐚l\mathbf{\tilde{y}}_{l}=\mathbf{Y}^{T}\mathbf{a}_{l} and 𝚺Zl=𝚺l𝚺l𝚺~l1𝚺l\bm{\Sigma}_{Z_{l}}=\bm{\Sigma}_{l}-\bm{\Sigma}_{l}\bm{\tilde{\Sigma}}_{l}^{-1}\bm{\Sigma}_{l} with 𝚺~l=𝚺l+σ02𝐈n2\bm{\tilde{\Sigma}}_{l}=\bm{\Sigma}_{l}+\sigma_{0}^{2}\mathbf{I}_{n_{2}}.

2. (Orthogonal projection.) After integrating missingz()\mathbf{\mathbf{missing}}{z}(\cdot), the marginal likelihood is a product of multivariate normal densities at projected observations:

p(𝐘)=l=1d𝒫𝒩(𝐲~l;𝟎,𝚺~l)l=d+1n1𝒫𝒩(𝐲~c,l;𝟎,σ02𝐈n2),p(\mathbf{Y})=\prod^{d}_{l=1}\mathcal{PN}(\tilde{\mathbf{y}}_{l};\mathbf{0},\bm{\tilde{\Sigma}}_{l})\prod^{n_{1}}_{l=d+1}\mathcal{PN}(\tilde{\mathbf{y}}_{c,l};\mathbf{0},\sigma^{2}_{0}\mathbf{I}_{n_{2}}), (18)

where 𝐲~c,l=𝐘T𝐚c,l\tilde{\mathbf{y}}_{c,l}=\mathbf{Y}^{T}\mathbf{a}_{c,l} with 𝐚c,l\mathbf{a}_{c,l} being the llth column of 𝐀c\mathbf{A}_{c}, the orthogonal component of 𝐀\mathbf{A}, and 𝒫𝒩\mathcal{PN} denotes the density for a multivariate normal distribution.

The properties in Lemma 17 lead to computationally scalable expressions of the maximum marginal likelihood estimator (MMLE) of factor loadings.

Theorem 1 (Generalized probabilistic principal component analysis [19]).

Assume 𝐀T𝐀=𝐈d\mathbf{A}^{T}\mathbf{A}=\mathbf{I}_{d}, after marginalizing out 𝐙lT𝒩(𝟎,𝚺l)\mathbf{Z}^{T}_{l}\sim\mathcal{MN}(\mathbf{0},\bm{\Sigma}_{l}) for l=1,2,,dl=1,2,...,d, we have the results below.

  • If 𝚺1==𝚺d=𝚺\bm{\Sigma}_{1}=...=\bm{\Sigma}_{d}=\bm{\Sigma}, the marginal likelihood is maximized when

    𝐀^=𝐔𝐒,\hat{\mathbf{A}}=\mathbf{U}\mathbf{S}, (19)

    where 𝐔\mathbf{U} is an n1×dn_{1}\times d matrix of the first dd principal eigenvectors of 𝐆=𝐘(σ02𝚺1+𝐈n2)1𝐘T\mathbf{G}={\mathbf{Y}(\sigma^{2}_{0}\bm{\Sigma}^{-1}+\mathbf{I}_{n_{2}})^{-1}\mathbf{Y}^{T}} and 𝐒\mathbf{S} is a d×dd\times d orthogonal rotation matrix;

  • If the covariances of the factor processes are different, denoting 𝐆l=𝐘(σ02𝚺l1+𝐈n2)1𝐘T\mathbf{G}_{l}={\mathbf{Y}(\sigma^{2}_{0}\bm{\Sigma}^{-1}_{l}+\mathbf{I}_{n_{2}})^{-1}\mathbf{Y}^{T}}, the MMLE of factor loadings is

    𝐀^=argmax𝐀l=1d𝐚lT𝐆l𝐚l,s.t.𝐀T𝐀=𝐈d.\mathbf{\hat{A}}=\operatorname*{argmax}_{\mathbf{A}}\sum^{d}_{l=1}{\mathbf{a}^{T}_{l}\mathbf{G}_{l}\mathbf{a}_{l}},\quad\text{s.t.}\quad\mathbf{A}^{T}\mathbf{A}=\mathbf{I}_{d}. (20)

The estimator 𝐀\mathbf{A} in Theorem 1 is called the generalized probabilistic principal component analysis (GPPCA). The optimization algorithm that preserves the orthogonal constraints in (20) is available in [60].

In [58], the latent factor is assumed to follow independent standard normal distributions, and the authors derived the MMLE of the factor loading matrix 𝐀\mathbf{A}, which was termed the probabilistic principal component analysis (PPCA). The GPPCA extends the PPCA to correlated latent factors modeled by GPs, which incorporates the prior correlation information between outputs as a function of inputs, and the latent factor processes were marginalized out when estimating the factor loading matrix and other parameters. When the input is 1D and the Matérn covariance is used for modeling latent factors, the computational order of GPPCA is 𝒪(Nd)\mathcal{O}(Nd), which is the same as the PCA. For correlated data, such as spatio-temporal observations and multivariate functional data, GPPCA provides a flexible and scalable approach to estimate factor loading by marginalizing out the latent factors [19].

Spatial and spatio-temporal models with a separable covariance can be written as a special case of the model in (17). For instance, suppose d=n1d=n_{1} and the n1×n2n_{1}\times n_{2} latent factor matrix follows 𝐙𝒩(0,σ2𝐑1𝐑2),\mathbf{Z}\sim\mathcal{MN}(0,\,\sigma^{2}\mathbf{R}_{1}\otimes\mathbf{R}_{2}), where 𝐑1\mathbf{R}_{1} and 𝐑2\mathbf{R}_{2} are n1×n1n_{1}\times n_{1} and n2×n2n_{2}\times n_{2} subcovariances, respectively. Denote the eigen decomposition 𝐑1=𝐕1𝚲1𝐕1T\mathbf{R}_{1}=\mathbf{V}_{1}\bm{\Lambda}_{1}\mathbf{V}_{1}^{T} with 𝚲1\bm{\Lambda}_{1} being a diagonal matrix with the eigenvalues λi\lambda_{i}, for i=1,,n1i=1,...,n_{1}. Then this separable model can be written as the model in (17), with 𝐀=𝐕1\mathbf{A}=\mathbf{V}_{1}, 𝚺l=σ2λl𝐑2\bm{\Sigma}_{l}=\sigma^{2}\lambda_{l}\mathbf{R}_{2}. The connection suggests that the latent factor loading matrix can be specified as the eigenvector matrix of a covariance matrix, parameterized by a kernel function. This approach is studied in [17] for modeling incomplete lattice with irregular missing patterns, and the Kalman filter is integrated for accelerating computation on massive spatial and spatio-temporal observations.

4 Scalable marginalization for learning particle interaction kernels from trajectory data

Collective motions with particle interactions are very common in both microscopic and macroscopic systems [35, 36]. Learning interaction kernels between particles is important for two purposes. First, physical laws are less understood for many complex systems, such as cell migration or non-equilibrium thermodynamic processes. Estimating the interaction kernels between particles from fields or experimental data is essential for learning these processes. Second, simulation of particle interactions, such as ab initio molecular dynamics simulation, can be very computationally expensive. Statistical learning approaches can be used for reducing the computational cost of simulations.

For demonstration purposes, we consider a simple first-order system. In [34], for a system with nn interacting particles, the velocity of the iith particle at time tt, 𝐯i(t)=d𝐱i(t)/dt\mathbf{v}_{i}(t)=d\mathbf{x}_{i}(t)/dt, is modeled by positions between all particles,

𝐯i(t)=j=1nϕ(𝐱j(t)𝐱i(t))𝐮i,j(t),\mathbf{v}_{i}(t)=\sum^{n}_{j=1}\phi(||\mathbf{x}_{j}(t)-\mathbf{x}_{i}(t)||)\mathbf{u}_{i,j}(t), (21)

where ϕ()\phi(\cdot) is a latent interaction kernel function between particle ii and all other particles, with ||||||\cdot|| being the Euclidean distance, and 𝐮i,j(t)=𝐱j(t)𝐱i(t)\mathbf{u}_{i,j}(t)=\mathbf{x}_{j}(t)-\mathbf{x}_{i}(t) is a vector of differences between positions of particles ii and jj, for i,j=1,,ni,j=1,...,n. Here ϕ()\phi(\cdot) is a two-body interaction function. In molecular dynamics simulation, for instance, the Lennard-Jones potential can be related to molecular forces and the acceleration of molecules in a second-order system. The statistical learning approach can be extended to a second-order system that involves acceleration and external force terms. The first-order system as (21) can be considered as an approximation of the second-order system. Furthermore, the interaction between particles is global, as any particle is affected by all other particles. Learning global interactions is more computationally challenging than local interactions [51], and approximating the global interaction by the local interaction is of interest in future studies.

One important goal is to efficiently estimate the unobservable interaction functions from the particle trajectory data, without specifying a parametric form. This goal is key for estimating the behaviors of dynamic systems in experiments and in observational studies, as the physics law in a new system may be unknown. In [13], ϕ()\phi(\cdot) is modeled by a zero-mean Gaussian process with a Matérn covariance:

ϕ()𝒢𝒫(0,σ2K(,)).\phi(\cdot)\sim\mathcal{GP}(0,\sigma^{2}K(\cdot,\cdot)). (22)

Computing estimation of interactions of large-scale systems or more simulation runs, however, is prohibitive, as the direct inversion of the covariance matrix of observations of velocities requires 𝒪((nMDL)3)\mathcal{O}((nMDL)^{3}) operations, where MM is the number of simulations or experiments, nn is the number of particles, DD is the dimension of each particle, LL denotes the number of time points for each trial. Furthermore, constructing such covariance contains computing an n2LM×n2LMn^{2}LM\times n^{2}LM matrix of ϕ\phi for a DD-dimensional input space, which takes 𝒪(n4L2M2D)\mathcal{O}(n^{4}L^{2}M^{2}D) operations. Thus, directly estimating interaction kernel with a GP model in (22) is only applicable to systems with a small number of observations [34, 13].

This work makes contributions from two different aspects for estimating dynamic systems of interacting particles. We first show the covariance of velocity observations can be obtained by operations on a few sparse matrices, after marginalizing out the latent interaction function. The sparsity of the inverse covariance of the latent interaction kernel allows us to modify the Kalman filter to efficiently compute the matrix product in this problem, and then apply a conjugate gradient (CG) algorithm [24, 21, 49] to solve this system iteratively. The computational complexity of computing the predictive mean and variance of a test point is at the order of 𝒪(TnN)+𝒪(nNlog(nN))\mathcal{O}(TnN)+\mathcal{O}(nN\log(nN)), for a system of nn particles, N=nMDLN=nMDL observations, and TT is the number of iterations required in the CG algorithm. We found that typically around a few hundred CG iterations can achieve high predictive accuracy for a moderately large number of observations. The algorithm leads substantial reduction of computational cost, compared to direct computation.

Second, we study the effect of experimental designs on estimating the interaction kernel function. In previous studies, it is unclear how initial positions, time steps of trajectory and the number of particles affect the accuracy in estimating interaction kernels. Compared to other conventional problems in computer model emulation, where a “space-filling” design is often used, here we cannot directly observe the realizations of the latent function. Instead, the output velocity is a weighted average of the interaction kernel functions between particles. Besides, we cannot directly control distances between the particles moving away from the initial positions, in both simulation and experimental studies. When the distance between two particles ii and jj is small, the contribution ϕ(𝐱i(t)𝐱j(t))𝐮i,j(t)\phi(||\mathbf{x}_{i}(t)-\mathbf{x}_{j}(t)||)\mathbf{u}_{i,j}(t) can be small, if the repulsive force by ϕ()\phi(\cdot) does not increase as fast as the distance decreases. Thus we found that the estimation of interaction kernel function can be less accurate in the input domain of small distances. This problem can be alleviated by placing initial positions of more particles close to each other, providing more data with small distance pairs that improve the accuracy in estimation.

4.1 Scalable computation based on sparse representation of inverse covariance

For illustration purposes, let us first consider a simple scenario where we have M=1M=1 simulation and L=1L=1 time point of a system of nn interacting particles at a DD dimensional space. Since we only have 1 time point, we omit the notation tt when there is no confusion. The algorithm for the general scenario with L>1L>1 and M>1M>1 is discussed in Appendix 5.2. In practice, the experimental observations of velocity from multiple particle tracking algorithms or particle image velocimetry typically contain noises [1]. Even for simulation data, the numerical error could be non-negligible for large and complex systems. In these scenarios, the observed velocity 𝐯~i=(vi,1,,vi,D)T\mathbf{\tilde{v}}_{i}=(v_{i,1},...,v_{i,D})^{T} is a noisy realization: 𝐯~i=𝐯i+ϵi\mathbf{\tilde{v}}_{i}=\mathbf{v}_{i}+\bm{\epsilon}_{i}, where ϵi𝒩(0,σ02𝐈D)\bm{\epsilon}_{i}\sim\mathcal{MN}(0,\sigma^{2}_{0}\mathbf{I}_{D}) denotes a vector of Gaussian noise with variance σ02\sigma^{2}_{0}.

Assume the nDnD observations of velocity are 𝐯~=(v~1,1,,v~n,1,v~1,2,,v~n,2,,v~n1,D,v~n,D)T\mathbf{\tilde{v}}=(\tilde{v}_{1,1},...,\tilde{v}_{n,1},\tilde{v}_{1,2},...,\tilde{v}_{n,2},...,\tilde{v}_{n-1,D},\tilde{v}_{n,D})^{T}. After integrating out the latent function modeled in Equation (22), the marginal distribution of observations follows

(𝐯~𝐑ϕ,σ2,σ02)𝒩(𝟎,σ2𝐔𝐑ϕ𝐔T+σ02𝐈nD),\displaystyle(\mathbf{\tilde{v}}\mid\mathbf{R}_{\phi},\sigma^{2},\sigma^{2}_{0})\sim\mathcal{MN}\left(\mathbf{0},\sigma^{2}\mathbf{U}\mathbf{R}_{\phi}\mathbf{U}^{T}+\sigma^{2}_{0}\mathbf{I}_{nD}\right), (23)

where 𝐔\mathbf{U} is an nD×n2nD\times n^{2} block diagonal matrix, with the iith D×nD\times n block in the diagonals being (𝐮i,1,,𝐮i,n)(\mathbf{u}_{i,1},...,\mathbf{u}_{i,n}), and 𝐑ϕ\mathbf{R}_{\phi} is an n2×n2n^{2}\times n^{2} matrix, where the (i,j)(i^{\prime},j^{\prime}) term in the (i,j)(i,j)th n×nn\times n block is K(di,i,dj,j)K(d_{i,i^{\prime}},d_{j,j^{\prime}}) with di,i=𝐱i𝐱id_{i,i^{\prime}}=||\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}|| and dj,j=𝐱j𝐱jd_{j,j^{\prime}}=||\mathbf{x}_{j}-\mathbf{x}_{j}^{\prime}|| for i,i,j,j=1,,ni,i^{\prime},j,j^{\prime}=1,...,n. Direct computation of the likelihood involves computing the inverse of an nD×nDnD\times nD covariance matrix and constructing an n2×n2n^{2}\times n^{2} matrix 𝐑ϕ\mathbf{R}_{\phi}, which costs 𝒪((nD)3)+𝒪(n4D)\mathcal{O}((nD)^{3})+\mathcal{O}(n^{4}D) operations. This is computationally expensive even for small systems.

Here we use an exponential kernel function, K(d)=exp(d/γ)K(d)=\exp(-d/\gamma) with range parameter γ\gamma, of modeling any nonnegative distance input dd for illustration, where this method can be extended to include Matérn kernels with half-integer roughness parameters. Denote distance pairs di,j=𝐱i𝐱jd_{i,j}=||\mathbf{x}_{i}-\mathbf{x}_{j}||, and there are (n1)n/2(n-1)n/2 unique positive distance pairs. Denote the (n1)n/2(n-1)n/2 distance pairs 𝐝s=(ds,1,ds,(n1)n/2)T\mathbf{d}_{s}=(d_{s,1},...d_{s,(n-1)n/2})^{T} in an increasing order with the subscript ss meaning ‘sorted’. Here we do not need to consider the case when di,j=0d_{i,j}=0, as 𝐮i,j=𝟎\mathbf{u}_{i,j}=\mathbf{0}, leading to zero contribution to the velocity. Thus the model in (21) can be reduced to exclude the interaction between particle at zero distance. In reality, two particles at the same position are impractical, as there typically exists a repulsive force when two particles get very close. Hence, we can reduce the n2n^{2} distance pairs di,jd_{i,j} for i=1,,ni=1,...,n and j=1,,nj=1,...,n, to (n1)n/2(n-1)n/2 unique positive terms ds,id_{s,i} in an increasing order, for i=1,,(n1)n/2i=1,...,(n-1)n/2.

Denote the (n1)n/2×(n1)n/2(n-1)n/2\times(n-1)n/2 correlation matrix of the kernel outputs ϕ=(ϕ(ds,1),,ϕ(ds,(n1)n/2))T\bm{\phi}=(\phi(d_{s,1}),...,\phi(d_{s,(n-1)n/2}))^{T} by 𝐑s\mathbf{R}_{s} and 𝐔s\mathbf{U}_{s} is nD×(n1)n/2nD\times(n-1)n/2 sparse matrix with n1n-1 nonzero terms on each row, where the nonzero entries of the iith particle correspond to the distance pairs in the 𝐑s\mathbf{R}_{s}. Denote the nugget parameter η=σ02/σ2\eta=\sigma^{2}_{0}/\sigma^{2}. After marginalizing out ϕ\phi, the covariance of velocity observations follows

(𝐯~γ,σ2,η)𝒩(𝟎,σ2𝐑~v),\displaystyle(\mathbf{\tilde{v}}\mid\gamma,\sigma^{2},\eta)\sim\mathcal{MN}\left(\mathbf{0},\sigma^{2}\mathbf{\tilde{R}}_{v}\right), (24)

with

𝐑~v=(𝐔s𝐑s𝐔sT+η𝐈nD).\mathbf{\tilde{R}}_{v}=(\mathbf{U}_{s}\mathbf{R}_{s}\mathbf{U}^{T}_{s}+\eta\mathbf{I}_{nD}). (25)

The conditional distribution of the interaction kernel ϕ(d)\phi(d^{*}) at any distance dd^{*} follows

(ϕ(d)𝐯~,γ,σ2,η)𝒩(ϕ^(d),σ2K),\displaystyle(\phi(d^{*})\mid\mathbf{\tilde{v}},\gamma,\sigma^{2},\eta)\sim\mathcal{N}(\hat{\phi}(d^{*}),\sigma^{2}K^{*}), (26)

where the predictive mean and variance follow

ϕ^(d)\displaystyle\hat{\phi}(d^{*}) =𝐫T(d)𝐔sT𝐑~v1𝐯~,\displaystyle=\mathbf{r}^{T}(d^{*})\mathbf{U}^{T}_{s}\mathbf{\tilde{R}}_{v}^{-1}\mathbf{\tilde{v}}, (27)
σ2K\displaystyle\sigma^{2}K^{*} =σ2(K(d,d)𝐫T(d)𝐔sT𝐑~v1𝐔s𝐫(d)),\displaystyle=\sigma^{2}\left(K(d^{*},d^{*})-\mathbf{r}^{T}(d^{*})\mathbf{U}^{T}_{s}\mathbf{\tilde{R}}_{v}^{-1}\mathbf{U}_{s}\mathbf{r}(d^{*})\right), (28)

with 𝐫(d)=(K(d,ds,1),,K(d,ds,n(n1)/2))T\mathbf{r}(d^{*})=(K(d^{*},d_{s,1}),...,K(d^{*},d_{s,n(n-1)/2}))^{T}. After obtaining the estimated interaction kernel, one can use it to forecast trajectories of particles and understand the physical mechanism of flocking behaviors.

Our primary task is to efficiently compute the predictive distribution of interaction kernel in (26), where the most computationally expensive terms in the predictive mean and variance is 𝐑~v1𝐯~\mathbf{\tilde{R}}_{v}^{-1}\mathbf{\tilde{v}} and 𝐑~v1𝐔s𝐫(d)\mathbf{\tilde{R}}_{v}^{-1}\mathbf{U}_{s}\mathbf{r}(d^{*}). Note that the 𝐔s\mathbf{U}_{s} is a sparse matrix with n(n1)dn(n-1)d nonzero terms and the inverse covariance matrix 𝐑s1\mathbf{R}^{-1}_{s} is a tri-diagonal matrix. However, directly applying the CG algorithm is still computationally challenging, as neither 𝐑~v\mathbf{\tilde{R}}_{v} nor 𝐑~v1\mathbf{\tilde{R}}_{v}^{-1} is sparse. To solve this problem, we extend a step in the Kalman filter to efficiently compute the matrix-vector multiplication with the use of sparsity induced by the choice of covariance matrix. Each step of the CG iteration in the new algorithm only costs 𝒪(nDT)\mathcal{O}(nDT) operations for computing a system of nn particles and DD dimensions with TT CG iteration steps. For most systems we explored, we found a few hundred iterations in the CG algorithm achieve high accuracy. The substantial reduction of the computational cost allows us to use more observations to improve the predictive accuracy. We term this approach the sparse conjugate gradient algorithm for Gaussian processes (sparse CG-GP). The algorithm for the scenario with MM simulations, each containing LL time frames of nn particles in a DD dimensional space, is discussed in Appendix 5.2.

Refer to caption Refer to caption
Figure 3: Estimation of particle interaction kernel by the truncated Lennard-Jones potential in [34] with D=2D=2. The left panel shows the true interaction kernel (black curve) and estimated kernel functions (colored curves), based on three different ways of initial positions of n=200n=200 particles. The computational time of the full GP in [13] and the sparse CG-GP approach is given in the right panel. Here the most computational intensive part of the full GP model is in constructing the correlation matrix 𝐑s\mathbf{R}_{s} of ϕ\phi for n(n1)/2n(n-1)/2 distance pairs in Equation (25), which scales as 𝒪(n4D)\mathcal{O}(n^{4}D).

The comparison of the computational cost between the full GP model and the proposed sparse CG-GP method is shown in the right panel in Figure 3. The most computational expensive part of the full GP model is on constructing the n(n1)/2×n(n1)/2n(n-1)/2\times n(n-1)/2 correlation matrix 𝐑s\mathbf{R}_{s} of ϕ\phi for n(n1)/2n(n-1)/2 distance pairs. The sparse CG-GP algorithm is much faster as we do not need to construct this covariance matrix; instead we only need to efficiently compute matrix multiplication by utilizing the sparse structure of the inverse of 𝐑s\mathbf{R}_{s} (Appendix 5.2). Note the GP model with an exponential covariance naturally induces a sparse inverse covariance matrix that can be used for faster computation, which is different from imposing a sparse covariance structure for approximation.

In the left panel in Figure 3, we show the predictive mean and uncertainty assessment by the sparse CG-GP method for three different designs for sampling the initial positions of particles. From the first to the third designs, the initial value of each coordinate of the particle is sampled independently from a uniform distribution 𝒰[a1,b1]\mathcal{U}[a_{1},b_{1}], normal distribution 𝒩(a2,b2)\mathcal{N}(a_{2},b_{2}), and log uniform (reciprocal) distribution 𝒰[log(a3),log(b3)]\mathcal{LU}[\log(a_{3}),\log(b_{3})], respectively.

For experiments with the interaction kernel being the truncated Lennard-Jones potential given in Appendix 5.2, we use a1=0a_{1}=0, b1=5b_{1}=5, a2=0a_{2}=0, b2=5b_{2}=5, a3=103a_{3}=10^{-3} and b3=5b_{3}=5 for three designs of initial positions. Compared with the first design, the second design of initial positions, which was assumed in [34], has a larger probability mass of distributions near 0. In the third design, the distributions of the distance between particle pairs are monotonically decreasing, with more probability mass near 0 than those in the first two designs. In all cases shown in Figure 3, we assume M=1M=1, L=1L=1 and the noise variance is set to be σ0=103\sigma_{0}=10^{-3} in the simulation. For demonstration purposes, the range and nugget parameters are fixed to be γ=5\gamma=5 and η=105\eta=10^{-5} respectively, when computing the predictive distribution of ϕ\phi. The estimation of the interaction kernel on large distances is accurate for all different designs, whereas the estimation of the interaction kernel at small distances is not satisfying for the first two designs. When particles are initialized from the third design (log-uniform), the accuracy is better, as there are more particles near each other, providing more information about the particles at small values. This result is intuitive, as the small distance pairs have relatively small contributions to the velocity based on Equation (21), and we need more particles close to each other to estimate the interaction kernel function at small distances.

The numerical comparison between different designs allows us to better understand the learning efficiency in different scenarios, which can be used to design experiments. Because of the large improvement of computational scalability compared to previous studies [13, 34], we can accurately estimate interaction kernels based on more particles and longer trajectories.

4.2 Numerical results

Refer to caption
Figure 4: Estimation of interaction function based on the sparse CG-GP method, and trajectory forecast for the truncated LJ and OD simulation. In panel (a), the colored curves are estimated interactions for different initial positions and particle sizes, all based on trajectories using only L=1L=1 time step, whereas the black curve is the truncated LJ used in the simulation. The colored curves in panel (b) are the same as those in panel (a), but based on trajectories in L=10L=10 time steps. The panels (d) and (e) are the same as panels (a) and (b), respectively, but the simulated data are generated by the OD interaction kernel. The shared areas are the 95%95\% predictive interval. In panel (c), we graph the simulated trajectories of 10 out of 50 particles L=200L=200 time steps, and the trajectory forecast based on estimated interaction function and initial positions. The arrow indicates the direction of velocities of particles at the last time step. Panel (f) is the same as panel (c), but for the OD interaction kernel.

Here we discuss two scenarios, where the interaction between particles follow the truncated Lennard-Jones (LJ) and opinion dynamics (OD) kernel functions. The LJ potential is widely used in MD simulations of interacting molecules [43]. First-order systems of form (21) have also been successfully applied in modeling opinion dynamics in social networks (see the survey [36] and references therein). The interaction function ϕ\phi models how the opinions of pairs of people influence each other. In our numerical example, we consider heterophilious opinion interactions: each agent is more influenced by its neighbors slightly further away from its closest neighbors. As time evolves, the opinions of agents merge into clusters, with the number of clusters significantly smaller than the number of agents. This phenomenon was studied in [36] that heterophilious dynamics enhances consensus, contradicting the intuition that would suggest that the tendency to bond more with those who are different rather than with those who are similar would break connections and prevent clusters of consensus.

The details of the interaction functions are given in Appendix 5.3. For each interaction, we test our method based on 12 configurations of 2 particle sizes (n=50n=50 and n=200n=200), 2 time lengths (L=1L=1 and L=10L=10), and 3 designs of initial positions (uniform, normal and log-uniform). The computational scalability of the sparse CG algorithm allows us to efficiently compute the predictions in most of these experimental settings within a few seconds. For each configuration, we repeat the experiments 10 times to average the effects of randomness in the initial positions of particles. The root of the mean squared error in predicting the interaction kernels by averaging these 10 experiments of each configuration is given in Appendix 5.4.

In Figure 4, we show the estimation of interactions kernels and forecasts of particle trajectories with different designs, particle sizes and time points. The sparse CG-GP method is relatively accurate for almost all scenarios. Among different initial positions, the estimation of trajectories for LJ interaction is the most accurate when the initial positions of the particles are sampled by the log-uniform distribution. This is because there are more small distances between particles when the initial positions follow a log-uniform distribution, providing more data to estimate the interaction kernel at small distances. Furthermore, when we have more particles or observations at larger time intervals, the estimation of the interaction kernel from all designs becomes more accurate in terms of the normalized root mean squared error with the detailed comparison given in Appendix 5.4.

In panel (c) and panel (f) of Figure 4, we plot the trajectory forecast of 1010 particles over 200200 time points for the truncated LJ kernel and OD kernel, respectively. In both simulation scenarios, interaction kernels are estimated based on trajectories of n=50n=50 particles across L=20L=20 time steps with initial positions sampled from the log-uniform design. The trajectories of only 1010 particles out of 5050 particles are shown for better visualization. For trajectories simulated by the truncated LJ, some particles can move very close, since the repulsive force between two particles becomes smaller as the force is proportional to the distance from Equation (21), and the truncation of kernel substantially reduces the repulsive force when particles move close. For the OD simulation, the particles move toward a cluster, as expected, since the particles always have attractive forces between each other. The forecast trajectories are close to the hold-out truth, indicating the high accuracy of our approach.

Compared with the results shown in previous studies [34, 13], estimating the interaction kernels and forecasting trajectories both look more accurate. The large computational reduction by the sparse CG-GP algorithm shown in Figure 3 permits the use of longer trajectories from more particles to estimate the interaction kernel, which improves the predictive accuracy. Here particle has interactions with all other particles in our simulation, making the number of distance pairs large. Yet we are able to estimate the interaction kernel and forecast the trajectories of particles within only tens of seconds in a desktop for the most time consuming scenario we considered. Since the particles typically have very small or no interaction when the distances between them are large, approximation can be made by enforcing interactions between particles within the specified radius, for further reducing the computational cost.

5 Concluding remarks

We have introduced scalable marginalization of latent variables for correlated data. We first introduce GP models and reviewed the SDE representation of GPs with Matérn covariance and one-dimensional input. Kalman filter and RTS smoother were introduced as a scalable marginalization way to compute the likelihood function and predictive distribution, which reduces the computational complexity of GP with Matérn covariance for 1D input from 𝒪(N3)\mathcal{O}(N^{3}) to 𝒪(N)\mathcal{O}(N) operations without approximation, where NN is the number of observations. Recent efforts on extending scalable computation from 1D input to multi-dimensional input are discussed. In particular, we developed a new scalable algorithm for predicting particle interaction kernel and forecast trajectories of particles. The achievement is through the sparse representation of GPs in modeling interaction kernel, and then efficient computation for matrix multiplication by modifying the Kalman filter algorithm. An iterative algorithm based on CG can then be applied, which reduces the computational complexity.

There are a wide range of future topics relevant to this study. First, various models of spatio-temporal data can be written as random factor models in (17) with latent factors modeled as Gaussian processes for temporal inputs. It is of interest to utilize the computational advantage of the dynamic linear models of factor processes, extending the computational tools by relaxing the independence between prior factor processes in Assumption 1 or incorporating the Toeplitz covariance structure for stationary temporal processes. Second, for estimating systems of particle interactions, we can further reduce computation by only considering interactions within a radius between particles. Third, a comprehensively study the experimental design, initialization, and parameter estimation in will be helpful for estimating latent interaction functions that can be unidentifiable or weakly identifiable in certain scenarios. Furthermore, velocity directions and angle order parameters are essential for understanding the mechanism of active nematics and cell migration, which can motivate more complex models of interactions. Finally, the sparse CG algorithm developed in this work is of interest to reducing the computational complexity of GP models with multi-dimensional input and general designs.

Acknowledgement

The work is partially supported by the National Institutes of Health under Award No. R01DK130067. Gu and Liu acknowledge the partial support from National Science Foundation (NSF) under Award No. DMS-2053423. Fang acknowledges the support from the UCSB academic senate faculty research grants program. Tang is partially supported by Regents Junior Faculty fellowship, Faculty Early Career Acceleration grant, Hellman Family Faculty Fellowship sponsored by UCSB and the NSF under Award No. DMS-2111303. The authors thank the editor and two referees for their comments that substantially improved the article.

Appendix

5.1 Closed-form expressions of state space representation of GP having Matérn covariance with ν=5/2\nu=5/2

Denote λ=5γ\lambda=\frac{\sqrt{5}}{\gamma}, q=163σ2λ5q=\frac{16}{3}\sigma^{2}\lambda^{5} and di=|xixi1|d_{i}=|x_{i}-x_{i-1}|. For i=2,,Ni=2,...,N, 𝐆i\mathbf{G}_{i} and 𝐖i\mathbf{W}_{i} in (11) have the expressions below:

𝐆i=eλdi2(λ2di2+2λ+22(λdi2+di)di2λ3di22(λ2di2λdi1)2diλdi2λ4di22λ3di2(λ3di23λ2di)λ2di24λdi+2),\mathbf{G}_{i}=\frac{e^{-\lambda d_{i}}}{2}\begin{pmatrix}\lambda^{2}d_{i}^{2}+2\lambda+2&2(\lambda d_{i}^{2}+d_{i})&d_{i}^{2}\\ -\lambda^{3}d_{i}^{2}&-2(\lambda^{2}d_{i}^{2}-\lambda d_{i}-1)&2d_{i}-\lambda d_{i}^{2}\\ \lambda^{4}d_{i}^{2}-2\lambda^{3}d_{i}&2(\lambda^{3}d_{i}^{2}-3\lambda^{2}d_{i})&\lambda^{2}d_{i}^{2}-4\lambda d_{i}+2\end{pmatrix},
𝐖i=4σ2λ53(W1,1(xi)W1,2(xi)W1,3(xi)W2,1(xi)W2,2(xi)W2,3(xi)W3,1(xi)W3,2(xi)W3,3(xi)),\mathbf{W}_{i}=\frac{4\sigma^{2}\lambda^{5}}{3}\begin{pmatrix}W_{1,1}(x_{i})&W_{1,2}(x_{i})&W_{1,3}(x_{i})\\ W_{2,1}(x_{i})&W_{2,2}(x_{i})&W_{2,3}(x_{i})\\ W_{3,1}(x_{i})&W_{3,2}(x_{i})&W_{3,3}(x_{i})\end{pmatrix},

with

W1,1(xi)\displaystyle W_{1,1}(x_{i}) =e2λdi(3+6λdi+6λ2di2+4λ3di3+2λ4di4)34λ5,\displaystyle=\frac{e^{-2\lambda d_{i}}(3+6\lambda d_{i}+6\lambda^{2}d^{2}_{i}+4\lambda^{3}d^{3}_{i}+2\lambda^{4}d^{4}_{i})-3}{-4\lambda^{5}},
W1,2(xi)\displaystyle W_{1,2}(x_{i}) =W2,1(xi)=e2λdidi42,\displaystyle=W_{2,1}(x_{i})=\frac{e^{-2\lambda d_{i}}d_{i}^{4}}{2},
W1,3(xi)\displaystyle W_{1,3}(x_{i}) =W3,1(xi)=e2λdi(1+2λdi+2λ2di2+4λ3di32λ4di4)14λ3,\displaystyle=W_{3,1}(x_{i})=\frac{e^{-2\lambda d_{i}}(1+2\lambda d_{i}+2\lambda^{2}d^{2}_{i}+4\lambda^{3}d^{3}_{i}-2\lambda^{4}d^{4}_{i})-1}{4\lambda^{3}},
W2,2(xi)\displaystyle W_{2,2}(x_{i}) =e2λdi(1+2λdi+2λ2di24λ3di3+2λ4di4)14λ3,\displaystyle=\frac{e^{-2\lambda d_{i}}(1+2\lambda d_{i}+2\lambda^{2}d^{2}_{i}-4\lambda^{3}d^{3}_{i}+2\lambda^{4}d^{4}_{i})-1}{-4\lambda^{3}},
W2,3(xi)\displaystyle W_{2,3}(x_{i}) =W3,2(xi)=e2λdidi2(44λdi+λ2di2)2,\displaystyle=W_{3,2}(x_{i})=\frac{e^{-2\lambda d_{i}}d_{i}^{2}(4-4\lambda d_{i}+\lambda^{2}d^{2}_{i})}{2},
W3,3(xi)\displaystyle W_{3,3}(x_{i}) =e2λdi(3+10λ2di222λ2di2+12λ2di22λ4di4)+34λ,\displaystyle=\frac{e^{-2\lambda d_{i}}(-3+10\lambda^{2}d^{2}_{i}-22\lambda^{2}d^{2}_{i}+12\lambda^{2}d^{2}_{i}-2\lambda^{4}d^{4}_{i})+3}{4\lambda},

and the stationary covariance of 𝜽i\bm{\theta}_{i}, i=1,,N~i=1,...,\tilde{N}, is

𝐖1=(σ20σ2λ2/30σ2λ2/30σ2λ2/30σ2λ4),\mathbf{W}_{1}=\begin{pmatrix}\sigma^{2}&0&-\sigma^{2}\lambda^{2}/3\\ 0&\sigma^{2}\lambda^{2}/3&0\\ -\sigma^{2}\lambda^{2}/3&0&\sigma^{2}\lambda^{4}\end{pmatrix},

The joint distribution of latent states follows (𝜽T(x1),,𝜽T(xN))T𝒩(𝟎,𝚲1)\left(\bm{\theta}^{T}(x_{1}),...,\bm{\theta}^{T}(x_{N})\right)^{T}\sim\mathcal{MN}(\mathbf{0},\bm{\Lambda}^{-1}), where the 𝚲\bm{\Lambda} is a symmetric block tri-diagonal matrix with the iith diagonal block being 𝐖i1+𝐆iT𝐖i+11𝐆i\mathbf{W}^{-1}_{i}+\mathbf{G}^{T}_{i}\mathbf{W}^{-1}_{i+1}\mathbf{G}_{i} for i=1,,N1i=1,...,{N}-1, and the NNth diagonal block being 𝐖N1\mathbf{W}_{{N}}^{-1}. The primary off-diagonal block of 𝚲\bm{\Lambda} is 𝐆iT𝐖i1-\mathbf{G}^{T}_{i}\mathbf{W}^{-1}_{i}, for i=2,,Ni=2,...,N.

Suppose xi<x<xi+1x_{i}<x^{*}<x_{i+1}. Let di=|xixi|d_{i}^{*}=|x_{i}^{*}-x_{i}| and di+1=|xi+1xi|d_{i+1}^{*}=|x_{i+1}-x_{i}^{*}|. The “*” terms 𝐆i\mathbf{G}_{i}^{*} and 𝐖i\mathbf{W}_{i}^{*} can be computed by replacing did_{i} in 𝐆i\mathbf{G}_{i} and 𝐖i\mathbf{W}_{i} by did_{i}^{*}, whereas the “*” terms 𝐆i+1\mathbf{G}_{i+1}^{*} and 𝐖i+1\mathbf{W}_{i+1}^{*} can be computed by replacing the did_{i} in 𝐆i\mathbf{G}_{i} and 𝐖i\mathbf{W}_{i} by di+1d_{i+1}^{*}. Furthermore, 𝐖~i+1=𝐖i+1+𝐆i+1𝐖i(𝐆i+1)T\tilde{\mathbf{W}}^{*}_{i+1}=\mathbf{W}_{i+1}^{*}+\mathbf{G}_{i+1}^{*}\mathbf{W}_{i}^{*}(\mathbf{G}_{i+1}^{*})^{T}.

5.2 The sparse CG-GP algorithm for estimating interaction kernels

Here we discuss the details of computing the predictive mean and variance in (26). The NN-vector of velocity observations is denoted as 𝐯~\mathbf{\tilde{v}}, where the total number of observations is defined by N=nDMLN=nDML. To compute the predictive mean and variance, the most computational challenging part is to compute NN-vector 𝐳=(𝐔s𝐑s𝐔sT+η𝐈N)1𝐯~\mathbf{z}=(\mathbf{U}_{s}\mathbf{R}_{s}\mathbf{U}^{T}_{s}+\eta\mathbf{I}_{N})^{-1}\mathbf{\tilde{v}}. Here 𝐑s\mathbf{R}_{s} and 𝐔s\mathbf{U}_{s} are N~×N~\tilde{N}\times\tilde{N} and N×N~N\times\tilde{N}, respectively, where N~=n(n1)ML/2\tilde{N}=n(n-1)ML/2 is the number of non-zero unique distance pairs. Note that both 𝐔s\mathbf{U}_{s} and 𝐑s1\mathbf{R}^{-1}_{s} are sparse. Instead of directly computing the matrix inversion and the matrix-vector multiplication, we utilize the sparsity structure to accelerate the computation in the sparse CG-GP algorithm. In the iteration, we need to efficiently compute

𝐳~=(𝐔s𝐑s𝐔sT+η𝐈N)𝐳,\tilde{\mathbf{z}}=(\mathbf{U}_{s}\mathbf{R}_{s}\mathbf{U}^{T}_{s}+\eta\mathbf{I}_{N})\mathbf{z}, (29)

for any real-valued N-vector 𝐳\mathbf{z}.

We have four steps to compute the quantity in (29) efficiently. Denote xi,j,m[tl]x_{i,j,m}[t_{l}] the jjth spatial coordinate of particle ii at time tlt_{l}, in the mmth simulation, for i=1,,ni=1,...,n, j=1,,Dj=1,...,D, l=1,,Ll=1,...,L and m=1,,Mm=1,...,M. In the following, we use 𝐱,,m[]\mathbf{x}_{\cdot,\cdot,m}[\cdot] to denote a vector of all positions in the mmth simulation and vice versa. Furthermore, we use z[k]z[k] to mean the kkth entry of any vector 𝐳\mathbf{z}, 𝐀[k,.]\mathbf{A}[k,.] and 𝐀[.,k]\mathbf{A}[.,k] to mean the kkth row vector and kkth column vector of any matrix 𝐀\mathbf{A}, respectively. The rank of a particle with position 𝐱i,.,m[tl]\mathbf{x}_{i,.,m}[t_{l}] is defined to be P=(m1)Ln+(l1)n+iP=(m-1)Ln+(l-1)n+i.

First, we reduce the N×N~N\times\tilde{N} sparse matrix 𝐔s\mathbf{U}_{s} of distance difference pairs to an N×nN\times n matrix 𝐔re\mathbf{U}_{re}, where ‘re’ means reduced, with the ((m1)LnD+(l1)nD+(j1)n+i1,i2)((m-1)LnD+(l-1)nD+(j-1)n+i_{1},i_{2})th entry of 𝐔re\mathbf{U}_{re} being (xi1,j,m[tl]xi2,j,m[tl])(x_{i_{1},j,m}[t_{l}]-x_{i_{2},j,m}[t_{l}]), for any |i1i2|=1,,n1|i_{1}-i_{2}|=1,...,n-1, i1ni_{1}\leq n and i2ni_{2}\leq n. Furthermore, we create a N~×2\tilde{N}\times 2 matrix 𝐏r\mathbf{P}_{r} in which the hhth row records the rank of a distance pair is the hhth largest in the zero-excluded sorted distance pairs 𝐝s\mathbf{d}_{s}, where Pr[h,1]P_{r}[h,1] and Pr[h,2]P_{r}[h,2] are the rank of rows of these distances in the matrix 𝐝mat\mathbf{d}_{mat}, where the jjth column records the unordered distance pairs of the jjth particle for j=1,,nj=1,...,n. We further assume Pr[h,1]>Pr[h,2]P_{r}[h,1]>P_{r}[h,2].

For any NN-vector 𝐳\mathbf{z}, the kkth entry of 𝐔sT𝐳\mathbf{U}^{T}_{s}\mathbf{z} can be written as

(𝐔sT𝐳)[k]=jk=0D1𝐔re[Pr[k,1]+cjk,Pr[k,2](m1)nL(l1)n](z[Pr[k,1]+cjk]z[Pr[k,2]+cjk]),(\mathbf{U}^{T}_{s}\mathbf{z})[k]=\sum_{j_{k}=0}^{D-1}\mathbf{U}_{re}\bigg{[}P_{r}[k,1]+c_{j_{k}},P_{r}[k,2]-(m-1)nL-(l-1)n\bigg{]}\bigg{(}z[P_{r}[k,1]+c_{j_{k}}]-z[P_{r}[k,2]+c_{j_{k}}]\bigg{)}, (30)

where cjk=(D1)(m1)nL+(D1)(l1)n+njkc_{j_{k}}=(D-1)(m-1)nL+(D-1)(l-1)n+nj_{k} for k=1,,N~k=1,...,\tilde{N}, if the kkth largest entry of distance pair is from time frame ll in the mmth simulation. The output is denoted as an N~\tilde{N} vector 𝐠1\mathbf{g}_{1}, i.e. 𝐠1=(𝐔sT𝐳)\mathbf{g}_{1}=(\mathbf{U}^{T}_{s}\mathbf{z}).

Second, since the exponential kernel is used, 𝐑s1\mathbf{R}^{-1}_{s} is a tri-diagonal matrix [20]. We modify a Kalman filter step to efficiently compute the product of an upper bi-diagonal 𝐠2=𝐋sT𝐠1\mathbf{g}_{2}=\mathbf{L}^{T}_{s}\mathbf{g}_{1}, where 𝐋s\mathbf{L}_{s} is the factor of the Cholesky decomposition 𝐑s=𝐋s𝐋sT\mathbf{R}_{s}=\mathbf{L}_{s}\mathbf{L}^{T}_{s}. Denote the Cholesky decomposition of the inverse covariance the factor 𝐑s1=𝐋~s𝐋~sT\mathbf{R}^{-1}_{s}=\mathbf{\tilde{L}}_{s}\mathbf{\tilde{L}}_{s}^{T}, where 𝐋~s\mathbf{\tilde{L}}_{s} can be written as the lower bi-diagonal matrix below:

𝐋~s=(11ρ12ρ11ρ1211ρ22ρ21ρ2211ρN~12ρN~11ρN~121),\mathbf{\tilde{L}}_{s}=\begin{pmatrix}\frac{1}{\sqrt{1-\rho_{1}^{2}}}&&&\\ \frac{-\rho_{1}}{\sqrt{1-\rho_{1}^{2}}}&\frac{1}{\sqrt{1-\rho_{2}^{2}}}&&\\ &\frac{-\rho_{2}}{\sqrt{1-\rho_{2}^{2}}}\,\ddots&&&\\ &\quad\ddots&\frac{1}{\sqrt{1-\rho_{\tilde{N}-1}^{2}}}&\\ &&\frac{-\rho_{\tilde{N}-1}}{\sqrt{1-\rho_{\tilde{N}-1}^{2}}}&1\\ \end{pmatrix}, (31)

where ρk=exp((ds,k+1ds,k)/γ)\rho_{k}=\exp(-(d_{s,k+1}-d_{s,k})/\gamma) for k=1,,N~1k=1,...,\tilde{N}-1. We modify the Thomas algorithm [57] to solve 𝐠2\mathbf{g}_{2} from equation (𝐋sT)1𝐠2=𝐠1(\mathbf{L}^{T}_{s})^{-1}\mathbf{g}_{2}=\mathbf{g}_{1}. Here (𝐋sT)1(\mathbf{L}^{T}_{s})^{-1} is an upper bi-diagonal matrix with explicit form

(𝐋sT)1=(1ρ11ρ1211ρ1211ρN~22ρN~11ρN~1211ρN~12).(\mathbf{L}^{T}_{s})^{-1}=\begin{pmatrix}1&\frac{-\rho_{1}}{\sqrt{1-\rho_{1}^{2}}}&\\ &\frac{1}{\sqrt{1-\rho_{1}^{2}}}&\\ &&\ddots&\ddots&\\ &&&\frac{1}{\sqrt{1-\rho_{\tilde{N}-2}^{2}}}&\frac{-\rho_{\tilde{N}-1}}{\sqrt{1-\rho_{\tilde{N}-1}^{2}}}\\ &&&&\frac{1}{\sqrt{1-\rho_{\tilde{N}-1}^{2}}}\\ \end{pmatrix}. (32)

Here only up to 2 entries in each row of (𝐋sT)1(\mathbf{L}^{T}_{s})^{-1} are nonzero. Using a backward solver, the 𝐠2\mathbf{g}_{2} can be obtained by the iteration below:

𝐠2[N~]\displaystyle\mathbf{g}_{2}[\tilde{N}] =𝐠1[N~]1ρN~12,\displaystyle=\mathbf{g}_{1}[\tilde{N}]\sqrt{1-\rho_{\tilde{N}-1}^{2}}, (33)
𝐠2[k]\displaystyle\mathbf{g}_{2}[k] =1ρk12𝐠1[k]+ρk𝐠2[k+1]1ρk121ρk2,\displaystyle=\sqrt{1-\rho_{k-1}^{2}}\mathbf{g}_{1}[k]+\frac{\rho_{k}\mathbf{g}_{2}[k+1]\sqrt{1-\rho_{k-1}^{2}}}{\sqrt{1-\rho_{k}^{2}}},\quad (34)

for k=N~1,,2,1k=\tilde{N}-1,...,2,1. Note that the Thomas algorithm is not stable in general, but here the stability issue is greatly improved, as the matrix in the system is bi-diagonal instead of tri-diagonal.

Third, we compute 𝐠3=𝐋s𝐠2\mathbf{g}_{3}=\mathbf{L}_{s}\mathbf{g}_{2} by solving 𝐋s1𝐠3=𝐠2\mathbf{L}^{-1}_{s}\mathbf{g}_{3}=\mathbf{g}_{2}:

𝐠3[1]\displaystyle\mathbf{g}_{3}[1] =𝐠2[1],\displaystyle=\mathbf{g}_{2}[1], (35)
𝐠3[k]\displaystyle\mathbf{g}_{3}[k] =1ρk12𝐠2[k]+ρk1𝐠3[k1],\displaystyle=\sqrt{1-\rho_{k-1}^{2}}\mathbf{g}_{2}[k]+\rho_{k-1}\mathbf{g}_{3}[k-1], (36)

for k=2,.,N~1k=2,....,\tilde{N}-1.

Finally, we denote a MLn×nMLn\times n matrix 𝐏c\mathbf{P}_{c}. 𝐏c\mathbf{P}_{c} is initialized as a zero matrix. And then for rc=1,,MLnr_{c}=1,...,MLn, row rcr_{c} of 𝐏c\mathbf{P}_{c} stores the ranks of distances between the iith particle and other n1n-1 particles in 𝐝s\mathbf{d}_{s}. For instance, at the llth time step in the mmth simulation, particle ii has n1n-1 non-zero distances 𝐱1𝐱i,,𝐱i1𝐱i,𝐱i+1𝐱i,,𝐱n𝐱i||\mathbf{x}_{1}-\mathbf{x}_{i}||,...,||\mathbf{x}_{i-1}-\mathbf{x}_{i}||,||\mathbf{x}_{i+1}-\mathbf{x}_{i}||,...,||\mathbf{x}_{n}-\mathbf{x}_{i}|| with ranks h1,.,hi1,hi+1,hnh_{1},....,h_{i-1},h_{i+1},...h_{n} in 𝐝s\mathbf{d}_{s}. Then the ((m1)Ln+(l1)n+i)((m-1)Ln+(l-1)n+i)th row of 𝐏c\mathbf{P}_{c} is filled with (h1,,hi1,hi+1,,hn)(h_{1},...,h_{i-1},h_{i+1},...,h_{n}).

Given any N~\tilde{N}-vector 𝐠3\mathbf{g}_{3}, the kkth entry of 𝐔s𝐠3\mathbf{U}_{s}\mathbf{g}_{3} can be written as

(𝐔s𝐠3)[k]=𝐔re[k,.]𝐠3[𝐏c[k,.]T],(\mathbf{U}_{s}\mathbf{g}_{3})[k]=\mathbf{U}_{re}[k,.]\mathbf{g}_{3}[\mathbf{P}_{c}[k^{\prime},.]^{T}], (37)

assuming that kk satisfies k=(m1)LnD+(l1)nD+jn+ik=(m-1)LnD+(l-1)nD+jn+i and k=i+(m1)Ln+(l1)nk^{\prime}=i+(m-1)Ln+(l-1)n for some m,l,jm,l,j and ii, and k=1,,Nk=1,...,N. The output of this step is an NN vector 𝐠4\mathbf{g}_{4}, with the kkth entry being 𝐠4[k]:=(𝐔sT𝐠3)[k]\mathbf{g}_{4}[k]:=(\mathbf{U}^{T}_{s}\mathbf{g}_{3})[k], for k=1,,Nk=1,...,N.

We summarize the sparse CG-GP algorithm using the following steps to compute 𝐳~\tilde{\mathbf{z}} in (29) below.

  1. 1.

    Use equation (30) to compute 𝐠1[k]=(𝐔sT𝐳)[k]\mathbf{g}_{1}[k]=(\mathbf{U}^{T}_{s}\mathbf{z})[k], for k=1,,N~k=1,...,\tilde{N}.

  2. 2.

    Use equations (33) and (34) to solve 𝐠2\mathbf{g}_{2} from (𝐋sT)1𝐠2=𝐠1(\mathbf{L}_{s}^{T})^{-1}\mathbf{g}_{2}=\mathbf{g}_{1} where (𝐋sT)1=(𝐋s1)T(\mathbf{L}_{s}^{T})^{-1}=(\mathbf{L}_{s}^{-1})^{T} with 𝐋s1\mathbf{L}_{s}^{-1} given in equation (32).

  3. 3.

    Use equations (35) and (36) to solve 𝐠3\mathbf{g}_{3} from 𝐋s1𝐠3=𝐠2\mathbf{L}_{s}^{-1}\mathbf{g}_{3}=\mathbf{g}_{2}, where 𝐋s1\mathbf{L}_{s}^{-1} is given in equation (32).

  4. 4.

    Use equation (37) to compute 𝐠4[k]=(𝐔s𝐠3)[k]\mathbf{g}_{4}[k]=(\mathbf{U}_{s}\mathbf{g}_{3})[k] and let 𝐳~=𝐠4+η𝐳\tilde{\mathbf{z}}=\mathbf{g}_{4}+\eta\mathbf{z}.

5.3 Interaction kernels in simulated studies

Here we give the expressions of the truncated L-J and OD kernels of particle interaction in [34, 13]. The truncated LJ kernel is given by

ϕLJ(d)={c2exp(c1d12),d[0,0.95],8(d4d10)3,d(0.95,),\phi_{LJ}(d)=\left\{\begin{array}[]{ll}c_{2}\exp(-c_{1}d^{12}),&d\in[0,0.95],\\ \frac{8(d^{-4}-d^{-10})}{3},&d\in(0.95,\infty),\end{array}\right.

where

c1=112c4c3(0.95)11 and c2=c3exp(c1(0.95)12),\displaystyle c_{1}=-\frac{1}{12}\frac{c_{4}}{c_{3}(0.95)^{11}}\mbox{ and }c_{2}=c_{3}\exp(c_{1}(0.95)^{12}),

with c3=83(0.9540.9510)c_{3}=\frac{8}{3}(0.95^{-4}-0.95^{-10}) and c4=83(10(0.95)114(0.95)5)c_{4}=\frac{8}{3}(10(0.95)^{-11}-4(0.95)^{-5}).

The interaction kernel of OD is defined as

ϕOD(d)={0.4,d[0,c5),0.3cos(10π(dc5))+0.7,d[c5,c6),1,d[c6d<0.95),0.5cos(10π(d0.95))+0.5,d[0.95,1.05),0,d[1.05,),\phi_{OD}(d)=\left\{\begin{array}[]{ll}0.4,&d\in[0,c_{5}),\\ -0.3\cos(10\pi(d-c_{5}))+0.7,&d\in[c_{5},c_{6}),\\ 1,&d\in[c_{6}\leq d<0.95),\\ 0.5\cos(10\pi(d-0.95))+0.5,&d\in[0.95,1.05),\\ 0,&d\in[1.05,\infty),\end{array}\right.

where c5=120.05c_{5}=\frac{1}{\sqrt{2}}-0.05 and c6=12+0.05c_{6}=\frac{1}{\sqrt{2}}+0.05.

5.4 Further numerical results on estimating interaction kernels

We outline the numerical results of estimating the interaction functions at N1=1000N^{*}_{1}=1000 equally spaced distance pairs at d[0,5]d\in[0,5] and d[0,1.5]d\in[0,1.5] for the truncated LJ and OD, respectively. For each configuration, we repeat the simulation N2=10N^{*}_{2}=10 times and compute the predictive error in each simulation. The total number of test points is N=N1N2=104N^{*}=N^{*}_{1}N^{*}_{2}=10^{4}. For demonstration purposes, we do not add a noise into the simulated data (i.e. σ02=0\sigma^{2}_{0}=0). The range and nugget parameters are fixed to be γ=5\gamma=5 and η=105\eta=10^{-5}. We compute the normalized root of mean squared error (NRMSE) in estimating the interaction kernel function:

NRMSE=1σϕi=1N(ϕ^(di)ϕ(di))2N,\mbox{NRMSE}=\frac{1}{\sigma_{\phi}}\sqrt{\sum^{N^{*}}_{i=1}\frac{(\hat{\phi}(d^{*}_{i})-\phi(d^{*}_{i}))^{2}}{N^{*}}},

where ϕ^(.)\hat{\phi}(.) is the estimated interaction kernel from the velocities and positions of the particles; σϕ\sigma_{\phi} is the standard deviation of the interaction function at test points.

Truncated LJ n=50 n=200 n=50 n=200
L=1 L=1 L=10 L=10
Uniform .11.11 .021.021 .026.026 .0051.0051
Normal .037.037 .012.012 .0090.0090 .0028.0028
Log-uniform .043.043 .0036.0036 .0018.0018 .00091.00091
OD n=50 n=200 n=50 n=200
L=1 L=1 L=10 L=10
Uniform .024.024 .0086.0086 .0031.0031 .0036.0036
Normal .13.13 .013.013 .038.038 .0064.0064
Log-uniform .076.076 .0045.0045 .0018.0018 .00081.00081
Table 1: NRMSE of the sparse CG-GP method for estimating the truncated LJ and OD kernels of particle interaction.

Table 1 gives the NRMSE of the sparse CG-GP method for the truncated LJ and OD kernels at 12 configurations. Typically the estimation is the most accurate when the initial positions of the particles are sampled from the log-uniform design for a given number of observations and an interaction kernel. This is because the contributions to the velocities from the kernel function are proportional to the distance of particle in (21), and small contributions from the interaction kernel at small distance values make the kernel hard to estimate from the trajectory data in general. When the initial positions of the particles are sampled from the log-uniform design, more particles are close to each other, which provides more information to estimate the interaction kernel at a small distance.

Furthermore, the predictive error of the interaction kernel is smaller, when the trajectories with a larger number of particle sizes or at longer time points are used in estimation, as more observations typically improve predictive accuracy. The sparse CG-GP algorithm reduces the computational cost substantially, which allows more observations to be used for making predictions.

References

  • [1] Ronald J Adrian and Jerry Westerweel. Particle image velocimetry. Number 30. Cambridge university press, 2011.
  • [2] Kyle R Anderson, Ingrid A Johanson, Matthew R Patrick, Mengyang Gu, Paul Segall, Michael P Poland, Emily K Montgomery-Brown, and Asta Miklius. Magma reservoir failure and the onset of caldera collapse at Kīlauea volcano in 2018. Science, 366(6470), 2019.
  • [3] Sudipto Banerjee, Bradley P Carlin, and Alan E Gelfand. Hierarchical modeling and analysis for spatial data. Crc Press, 2014.
  • [4] Maria Maddalena Barbieri and James O Berger. Optimal predictive model selection. The annals of statistics, 32(3):870–897, 2004.
  • [5] Maria J Bayarri, James O Berger, Rui Paulo, Jerry Sacks, John A Cafeo, James Cavendish, Chin-Hsu Lin, and Jian Tu. A framework for validation of computer models. Technometrics, 49(2):138–154, 2007.
  • [6] James O Berger, Victor De Oliveira, and Bruno Sansó. Objective Bayesian analysis of spatially correlated data. Journal of the American Statistical Association, 96(456):1361–1374, 2001.
  • [7] James O Berger, Brunero Liseo, and Robert L Wolpert. Integrated likelihood methods for eliminating nuisance parameters. Statistical science, 14(1):1–28, 1999.
  • [8] James O Berger and Luis R Pericchi. The intrinsic bayes factor for model selection and prediction. Journal of the American Statistical Association, 91(433):109–122, 1996.
  • [9] Iain D Couzin, Jens Krause, Nigel R Franks, and Simon A Levin. Effective leadership and decision-making in animal groups on the move. Nature, 433(7025):513–516, 2005.
  • [10] Noel Cressie and Gardar Johannesson. Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):209–226, 2008.
  • [11] Abhirup Datta, Sudipto Banerjee, Andrew O Finley, and Alan E Gelfand. Hierarchical nearest-neighbor gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111(514):800–812, 2016.
  • [12] Bruno De Finetti. La prévision: ses lois logiques, ses sources subjectives. In Annales de l’institut Henri Poincaré, volume 7, pages 1–68, 1937.
  • [13] Jinchao Feng, Yunxiang Ren, and Sui Tang. Data-driven discovery of interacting particle systems using gaussian processes. arXiv preprint arXiv:2106.02735, 2021.
  • [14] Alan E Gelfand, Sudipto Banerjee, and Dani Gamerman. Spatial process modelling for univariate and multivariate dynamic spatial data. Environmetrics: The official journal of the International Environmetrics Society, 16(5):465–479, 2005.
  • [15] Robert B Gramacy and Daniel W Apley. Local Gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24(2):561–578, 2015.
  • [16] Robert B Gramacy and Herbert KH Lee. Cases for the nugget in modeling computer experiments. Statistics and Computing, 22(3):713–722, 2012.
  • [17] Mengyang Gu and Hanmo Li. Gaussian Orthogonal Latent Factor Processes for Large Incomplete Matrices of Correlated Data. Bayesian Analysis, pages 1 – 26, 2022.
  • [18] Mengyang Gu, Jesus Palomo, and James O. Berger. RobustGaSP: Robust Gaussian Stochastic Process Emulation in R. The R Journal, 11(1):112–136, 2019.
  • [19] Mengyang Gu and Weining Shen. Generalized probabilistic principal component analysis of correlated data. Journal of Machine Learning Research, 21(13), 2020.
  • [20] Mengyang Gu, Xiaojing Wang, and James O Berger. Robust Gaussian stochastic process emulation. Annals of Statistics, 46(6A):3038–3066, 2018.
  • [21] Wolfgang Hackbusch. Iterative solution of large sparse systems of equations, volume 95. Springer, 1994.
  • [22] Jouni Hartikainen and Simo Sarkka. Kalman filtering and smoothing solutions to temporal gaussian process regression models. In Machine Learning for Signal Processing (MLSP), 2010 IEEE International Workshop on, pages 379–384. IEEE, 2010.
  • [23] Silke Henkes, Yaouen Fily, and M Cristina Marchetti. Active jamming: Self-propelled soft particles at high density. Physical Review E, 84(4):040301, 2011.
  • [24] Magnus R Hestenes and Eduard Stiefel. Methods of conjugate gradients for solving. Journal of research of the National Bureau of Standards, 49(6):409, 1952.
  • [25] Dave Higdon, James Gattiker, Brian Williams, and Maria Rightley. Computer model calibration using high-dimensional output. Journal of the American Statistical Association, 103(482):570–583, 2008.
  • [26] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960.
  • [27] Matthias Katzfuss and Joseph Guinness. A general framework for vecchia approximations of gaussian processes. Statistical Science, 36(1):124–141, 2021.
  • [28] Hannes Kazianka and Jürgen Pilz. Objective Bayesian analysis of spatial data with uncertain nugget and range parameters. Canadian Journal of Statistics, 40(2):304–327, 2012.
  • [29] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
  • [30] Clifford Lam and Qiwei Yao. Factor modeling for high-dimensional time series: inference for the number of factors. The Annals of Statistics, 40(2):694–726, 2012.
  • [31] Clifford Lam, Qiwei Yao, and Neil Bathia. Estimation of latent factors for high-dimensional time series. Biometrika, 98(4):901–918, 2011.
  • [32] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
  • [33] Finn Lindgren, Håvard Rue, and Johan Lindström. An explicit link between gaussian fields and gaussian markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4):423–498, 2011.
  • [34] Fei Lu, Ming Zhong, Sui Tang, and Mauro Maggioni. Nonparametric inference of interaction laws in systems of agents from trajectory data. Proc. Natl. Acad. Sci. U.S.A., 116(29):14424–14433, 2019.
  • [35] M Cristina Marchetti, Jean-François Joanny, Sriram Ramaswamy, Tanniemola B Liverpool, Jacques Prost, Madan Rao, and R Aditi Simha. Hydrodynamics of soft active matter. Reviews of modern physics, 85(3):1143, 2013.
  • [36] Sebastien Motsch and Eitan Tadmor. Heterophilious dynamics enhances consensus. SIAM review, 56(4):577–621, 2014.
  • [37] Joseph Muré. Propriety of the reference posterior distribution in gaussian process modeling. The Annals of Statistics, 49(4):2356–2377, 2021.
  • [38] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
  • [39] Rui Paulo. Default priors for Gaussian processes. Annals of statistics, 33(2):556–582, 2005.
  • [40] Rui Paulo, Gonzalo García-Donato, and Jesús Palomo. Calibration of computer models with multivariate output. Computational Statistics and Data Analysis, 56(12):3959–3974, 2012.
  • [41] Giovanni Petris, Sonia Petrone, and Patrizia Campagnoli. Dynamic linear models with R. Springer, 2009.
  • [42] Adrian E Raftery, David Madigan, and Jennifer A Hoeting. Bayesian model averaging for linear regression models. Journal of the American Statistical Association, 92(437):179–191, 1997.
  • [43] Dennis C Rapaport and Dennis C Rapaport Rapaport. The art of molecular dynamics simulation. Cambridge university press, 2004.
  • [44] Carl Edward Rasmussen. Gaussian processes for machine learning. MIT Press, 2006.
  • [45] Herbert E Rauch, F Tung, and Charlotte T Striebel. Maximum likelihood estimates of linear dynamic systems. AIAA journal, 3(8):1445–1450, 1965.
  • [46] Cuirong Ren, Dongchu Sun, and Chong He. Objective bayesian analysis for a spatial model with nugget effects. Journal of Statistical Planning and Inference, 142(7):1933–1946, 2012.
  • [47] Olivier Roustant, David Ginsbourger, and Yves Deville. Dicekriging, diceoptim: Two r packages for the analysis of computer experiments by kriging-based metamodeling and optimization. Journal of Statistical Software, 51(1):1–55, 2012.
  • [48] Håvard Rue, Sara Martino, and Nicolas Chopin. Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations. Journal of the royal statistical society: Series B (statistical methodology), 71(2):319–392, 2009.
  • [49] Yousef Saad. Iterative methods for sparse linear systems. SIAM, 2003.
  • [50] Jerome Sacks, William J Welch, Toby J Mitchell, and Henry P Wynn. Design and analysis of computer experiments. Statistical science, 4(4):409–423, 1989.
  • [51] Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, and Peter Battaglia. Learning to simulate complex physics with graph networks. In International Conference on Machine Learning, pages 8459–8468. PMLR, 2020.
  • [52] Thomas J Santner, Brian J Williams, and William I Notz. The design and analysis of computer experiments. Springer Science & Business Media, 2003.
  • [53] Matthias Seeger, Yee-Whye Teh, and Michael Jordan. Semiparametric latent factor models. Technical report, 2005.
  • [54] Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs. Advances in neural information processing systems, 18:1257, 2006.
  • [55] Jonathan R Stroud, Michael L Stein, and Shaun Lysen. Bayesian and maximum likelihood estimation for gaussian processes on an incomplete lattice. Journal of computational and Graphical Statistics, 26(1):108–120, 2017.
  • [56] S. Surjanovic and D. Bingham. Virtual library of simulation experiments: Test functions and datasets. http://www.sfu.ca/~ssurjano, 2017.
  • [57] Llewellyn Hilleth Thomas. Elliptic problems in linear difference equations over a network. Watson Sci. Comput. Lab. Rept., Columbia University, New York, 1:71, 1949.
  • [58] Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999.
  • [59] Aldo V Vecchia. Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society: Series B (Methodological), 50(2):297–312, 1988.
  • [60] Zaiwen Wen and Wotao Yin. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142(1-2):397–434, 2013.
  • [61] M. West and P. J. Harrison. Bayesian Forecasting & Dynamic Models. Springer Verlag, 2nd edition, 1997.
  • [62] Mike West and Jeff Harrison. Bayesian forecasting and dynamic models. Springer Science & Business Media, 2006.
  • [63] Peter Whittle. Stochastic process in several dimensions. Bulletin of the International Statistical Institute, 40(2):974–994, 1963.
  • [64] Andrew G Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33:4697–4708, 2020.