This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Kernel Packet: An Exact and Scalable Algorithm for Gaussian Process Regression with Matérn Correlations

\nameHaoyuan Chen \emailchenhaoyuan2018@tamu.edu
\nameLiang Ding \emailldingaa@tamu.edu
\nameRui Tuo   \emailruituo@tamu.edu
\addrWm Michael Barnes ’64 Department of Industrial & Systems Engineering
Texas A&M University
College Station, TX 77843, USA
First two authors contributed equally to this work.
Abstract

We develop an exact and scalable algorithm for one-dimensional Gaussian process regression with Matérn correlations whose smoothness parameter ν\nu is a half-integer. The proposed algorithm only requires 𝒪(ν3n)\mathcal{O}(\nu^{3}n) operations and 𝒪(νn)\mathcal{O}(\nu n) storage. This leads to a linear-cost solver since ν\nu is chosen to be fixed and usually very small in most applications. The proposed method can be applied to multi-dimensional problems if a full grid or a sparse grid design is used. The proposed method is based on a novel theory for Matérn correlation functions. We find that a suitable rearrangement of these correlation functions can produce a compactly supported function, called a “kernel packet”. Using a set of kernel packets as basis functions leads to a sparse representation of the covariance matrix that results in the proposed algorithm. Simulation studies show that the proposed algorithm, when applicable, is significantly superior to the existing alternatives in both the computational time and predictive accuracy.

Keywords: Computer experiments, Kriging, Uncertainty quantification, Compactly supported functions, Sparse matrices

1 Introduction

Gaussian process (GP) regression is a powerful function reconstruction tool. It has been widely used in computer experiments (Santner et al., 2003; Gramacy, 2020), spatial statistics (Cressie, 2015), supervised learning (Rasmussen, 2006), reinforcement learning (Deisenroth et al., 2013), probabilistic numerics (Hennig et al., 2015) and Bayesian optimization (Srinivas et al., 2009). GP regression models are flexible to fit a variety of functions, and they also enable uncertainty quantification for prediction by providing predictive distributions. With these appealing features, GP regression has become the primary surrogate model for computer experiments since popularized by Sacks et al. (1989). Despite these advantages, Gaussian process regression has its drawbacks. A major one is its computational complexity. Training a GP model requires furnishing matrix inverses and determinants. With nn training points, each of these matrix manipulations takes 𝒪(n3)\mathcal{O}(n^{3}) operations (referred to as “time” thereafter, assuming for simplicity that no parallel computing is enforced) if a direct method, such as the Cholesky decomposition, is applied. Besides, the computation for model training may also be hindered by the 𝒪(n2)\mathcal{O}(n^{2}) storage requirement (Gramacy, 2020) to store the n×nn\times n covariance matrix.

Tremendous efforts have been made in the literature to address the computational challenges of GP regression. Recent advances in scalable GP regression include random Fourier features (Rahimi and Recht, 2007), Nyström Approximation (also known as inducing points) (Smola and Schölkopf, 2000; Williams and Seeger, 2001; Titsias, 2009; Bui et al., 2017; Katzfuss, 2017; Chen and Stein, 2021), structured kernel interpolation Wilson and Nickisch (2015), etc. These methods are based on different types of approximation of GPs, i.e., the efficiency is gained at the cost of its accuracy. In contrast, the main objective of this work is to propose a novel scalable approach that does not need an approximation.

In this work, we focus on the use of GP regression in the context of computer experiments. In these studies, the training data are acquired through an experiment, in which the input points can be chosen. Such a choice is called a design of the experiment. It is well known that a suitably chosen design can largely simplify the computation. Here we consider the “tensor-space” techniques in terms of using a product correlation function and a full grid or a sparse grid (Plumlee, 2014) design. The tensor-space techniques can reduce a multivariate GP regression problem to several univariate problems. It is worth noting that, in some applications besides computer experiments, even if the input sites are not controllable, the data are naturally observed on full grids, e.g., the remote sensing data in geoscience applications (Bazi and Melgani, 2009). In these scenarios, the tensor-space techniques are also applicable.

Having the tensor-space techniques, the final hard nut to crack is the one-dimensional GP regression problem. We assume that the one-dimensional input data are already ordered throughout this work. This assumption is reasonable in computer experiment applications since the design points are chosen at our will. In other applications where we do not have ordered data in the first place, it takes only 𝒪(nlogn)\mathcal{O}(n\log n) time to sort them.

This work presents a mathematically exact algorithm to make conditional inference for one-dimensional GP regression with time and space complexity both linear in nn. This algorithm is specialized for Matérn correlations with smoothness ν\nu being a half-integer (see Section 1.1 for the definition.) Matérn correlations are commonly used in practice (Stein, 1999; Santner et al., 2003; Gramacy, 2020). In most applications, ν\nu is chosen to be small, e.g., ν=1.5\nu=1.5 or ν=2.5\nu=2.5, for the sake of a higher model flexibility. The proposed algorithm enjoys the following important features.

  • Given the hyper-parameters of the GP, the proposed algorithm is mathematically exact, i.e., all numerical error is attributed to the roundoff error given by the machine precision.

  • There is no restriction for the one-dimensional input points. But if the points are equally spaced, the computational time can be further reduced.

  • It takes only 𝒪(ν3n)\mathcal{O}(\nu^{3}n) time to compute the matrix inversion and the determinant. For equally spaced designs, this time is further reduced to 𝒪(ν2n)\mathcal{O}(\nu^{2}n).

  • After the above pre-processing time, it takes only 𝒪(ν+logn)\mathcal{O}(\nu+\log n) or even 𝒪(ν)\mathcal{O}(\nu) time to make a new prediction (i.e., evaluate the conditional mean) at an untried point.

  • The storage requirement is only 𝒪(νn)\mathcal{O}(\nu n).

The remainder of this article is organized as follows. We will review the general idea of GP regression and some existing algorithms in Sections 1.1 and 1.2, respectively. The mathematical theory behind the proposed algorithm is introduced in Section 2. In Section 3, we propose the main algorithm. Numerical studies are given in Section 4. In Section 5, we briefly discuss some possible extensions of the proposed method. Concluding remarks are made in Section 6. Appendices A and B contain the required mathematical tools and our technical proofs, respectively.

1.1 A review on GP Regression

Let Y(𝐱)Y(\mathbf{x}) denote a stationary GP prior on d\mathbb{R}^{d} with mean function μ(𝐱)\mu(\mathbf{x}), variance σ2\sigma^{2}, and correlation function K(𝐱,𝐱)K(\mathbf{x},\mathbf{x}^{\prime}). The correlation function is also referred to as a “kernel function” in the language of applied math or machine learning (Rasmussen, 2006). When d=1d=1, there are two types of popular correlation functions. The first type is the Matérn family (Stein, 1999):

K(x,x)=21νΓ(ν)(2ν|xx|ω)νKν(2ν|xx|ω),K(x,x^{\prime})=\frac{2^{1-\nu}}{\Gamma(\nu)}\left(\sqrt{2\nu}\frac{\lvert x-x^{\prime}\rvert}{\omega}\right)^{\nu}K_{\nu}\left(\sqrt{2\nu}\frac{\lvert x-x^{\prime}\rvert}{\omega}\right), (1)

for any x,xx,x^{\prime}\in\mathbb{R}, where ν>0\nu>0 is the smoothness parameter, ω>0\omega>0 is the scale and KνK_{\nu} is the modified Bessel function of the second kind. The smoothness parameter ν\nu governs the smoothness of the GP YY (Santner et al., 2003; Stein, 1999); the scale parameter ω\omega determines the spread of the correlation (Rasmussen, 2006). Matérn correlation functions are widely used because of its great flexibility. The second type is the Gaussian family:

K(x,x)=exp(|xx|2ω),K(x,x^{\prime})=\exp\left(-\frac{\lvert x-x^{\prime}\rvert^{2}}{\omega}\right), (2)

for any x,xdx,x^{\prime}\in\mathbb{R}^{d}. A Gaussian kernel function is the limit of a sequence of Matérn kernels with the smoothness parameter tending to infinity. The sample paths generated by GP with Gaussian correlation function are infinitely differentiable.

For multi-dimensional problems, a typical choice of the correlation structure is the separable or product correlation:

K(𝐱,𝐱)=j=1dKj(xj,xj),K(\mathbf{x},\mathbf{x}^{\prime})=\prod_{j=1}^{d}K_{j}(x_{j},x^{\prime}_{j}), (3)

for any 𝐱=(x1,,xd)T,𝐱=(x1,,xd)T\mathbf{x}=(x_{1},\ldots,x_{d})^{T},\mathbf{x}^{\prime}=(x^{\prime}_{1},\ldots,x^{\prime}_{d})^{T}, where KjK_{j} is a one-dimensional Matérn or Gaussian correlation function for each jj. This assumption ensures that the GP lives in a tensor space, and is key to the “tensor-space” techniques, which reduces the multi-dimensional problems to one-dimensional ones.

Suppose that we have observed 𝐘=(Y(𝐱1),,Y(𝐱n))T\mathbf{Y}=\big{(}Y(\mathbf{x}_{1}),\cdots,Y(\mathbf{x}_{n})\big{)}^{T} on nn distinct points 𝐗={𝐱i}i=1n\mathbf{X}=\{\mathbf{x}_{i}\}_{i=1}^{n}. The aim of GP regression is to predict the output at an untried input 𝐱\mathbf{x}^{*} by computing the distribution of Y(𝐱)Y(\mathbf{x}^{*}) conditional on 𝐘\mathbf{Y}, which is a normal distribution with the following conditional mean and variance (Santner et al., 2003; Banerjee et al., 2014):

𝔼[Y(𝐱)|𝐘]=μ(𝐱)+K(𝐱,𝐗)𝐊1(𝐘𝝁),\displaystyle\operatorname{\mathbb{E}}\left[Y(\mathbf{x}^{*})\big{|}\mathbf{Y}\right]=\mu(\mathbf{x}^{*})+K(\mathbf{x}^{*},\mathbf{X})\mathbf{K}^{-1}\left(\mathbf{Y}-\boldsymbol{\mu}\right), (4)
Var[Y(𝐱)|𝐘]=σ2(K(𝐱,𝐱)K(𝐱,𝐗)𝐊1K(𝐗,𝐱)),\displaystyle\operatorname{\mathrm{Var}}\left[Y(\mathbf{x}^{*})\big{|}\mathbf{Y}\right]=\sigma^{2}\bigg{(}K(\mathbf{x}^{*},\mathbf{x}^{*})-K(\mathbf{x}^{*},\mathbf{X})\mathbf{K}^{-1}K(\mathbf{X},\mathbf{x}^{*})\bigg{)}, (5)

where σ2>0\sigma^{2}>0 is the variance, K(𝐱,𝐗)=(K(𝐗,𝐱))T=(K(𝐱,𝐱1),,K(𝐱,𝐱n))K(\mathbf{x}^{*},\mathbf{X})=\left(K(\mathbf{X},\mathbf{x}^{*})\right)^{T}=\left(K(\mathbf{x}^{*},\mathbf{x}_{1}),\cdots,K(\mathbf{x}^{*},\mathbf{x}_{n})\right), 𝐊=[K(𝐱i,𝐱s)]i,s=1n\mathbf{K}=\left[K(\mathbf{x}_{i},\mathbf{x}_{s})\right]_{i,s=1}^{n} and 𝝁=(μ(𝐱1),,μ(𝐱n))T\boldsymbol{\mu}=\left(\mu(\mathbf{x}_{1}),\cdots,\mu(\mathbf{x}_{n})\right)^{T}.

In GP regression, the mean function μ\mu is usually parametrized as a linear form μ=i=1pβifi\mu=\sum_{i=1}^{p}\beta_{i}f_{i} for some unknown coefficient vector 𝜷=(β1,,βp)T\boldsymbol{\beta}=\big{(}\beta_{1},\cdots,\beta_{p}\big{)}^{T} and known regression functions f1,,fpf_{1},\cdots,f_{p}. To improve the predictive performance of GP regression, the coefficient vector 𝜷\boldsymbol{\beta}, variance σ2\sigma^{2} and scales 𝝎=(ω1,,ωd)T\boldsymbol{\omega}=\big{(}\omega_{1},\cdots,\omega_{d}\big{)}^{T} associated to each one-dimensional correlation function kjk_{j} are usually estimated via maximum likelihood (Jones et al., 1998). The log-likelihood function given the data is:

L(𝜷,σ2,𝝎)=12[nlogσ2+logdet(𝐊)+1σ2(𝐘𝐅𝜷)T𝐊1(𝐘𝐅𝜷)],L(\boldsymbol{\beta},\sigma^{2},\boldsymbol{\omega})=-\frac{1}{2}\left[n\log\sigma^{2}+\log\det(\mathbf{K})+\frac{1}{\sigma^{2}}\big{(}\mathbf{Y}-\mathbf{F}\boldsymbol{\beta}\big{)}^{T}\mathbf{K}^{-1}\big{(}\mathbf{Y}-\mathbf{F}\boldsymbol{\beta}\big{)}\right], (6)

where det(𝐊)\det(\mathbf{K}) denotes the determinant of the correlation matrix 𝐊\mathbf{K} and 𝐅\mathbf{F} is the n×pn\times p matrix whose (i,s)th(i,s)^{\rm th} entry is fs(𝐱i)f_{s}(\mathbf{x}_{i}). The Maximum Likelihood Estimator (MLE) is then defined as the maximimizer of the log-likelihood function: (𝜷^,σ^2,𝝎^)=argmax𝜷,σ2,𝝎L(𝜷,σ2,𝝎)(\hat{\boldsymbol{\beta}},\hat{\sigma}^{2},\hat{\boldsymbol{\omega}})=\operatorname*{argmax}_{\boldsymbol{\beta},\sigma^{2},\boldsymbol{\omega}}L(\boldsymbol{\beta},\sigma^{2},\boldsymbol{\omega}).

In both GP regression and parameter estimation, the computation can become unstable or even intractable because it involves the pursuit of the inversion and the determinant of the correlation matrix 𝐊\mathbf{K}. Each task takes 𝒪(n3)\mathcal{O}(n^{3}) time if a direct method, such as the Cholesky decomposition, is applied, which is a fundamental computational challenge for GP regression.

1.2 Comparisons with Existing Methods

When applicable, as a specialized algorithm, the proposed method is significantly superior to the existing alternatives. In this section, we compare the proposed method with a few popular existing approaches for large-scale GP regression. It is worth noting that, the fundamental mathematical theory for the proposed method differs from that of any of the existing methods. A summary of the comparisons is presented in Table 1.

Method Kernels Design Time Storage Accuracy
Proposed Method Matérn-ν\nu, ν1/2\nu-1/2\in\mathbb{N} Arbitrary 𝒪(ν3n)\mathcal{O}(\nu^{3}n) 𝒪(νn)\mathcal{O}(\nu n) Exact
Stationary Equally Depending on the
Toeplitz Methods Kernels Spaced 𝒪(nlogn)\mathcal{O}(n\log n) 𝒪(n)\mathcal{O}(n) number of iterations
Local Approximate GP
(with mm nearest neighbors) Arbitrary Arbitrary 𝒪(m3)\mathcal{O}(m^{3}) 𝒪(m2+n)\mathcal{O}(m^{2}+n) Unknown
Random Fourier Features Matérn-ν,ν>12\nu,\ \nu>\frac{1}{2}
(with mm random features) Gaussian Arbitrary 𝒪(m2n)\mathcal{O}(m^{2}n) 𝒪(m2+mn)\mathcal{O}(m^{2}+mn) 𝒪p(m1/2)\mathcal{O}_{p}(m^{-1/2})
Nyström Approximation Matérn-ν,ν>32\nu,\ \nu>\frac{3}{2} Matérn: 𝒪p(m2ν1)\mathcal{O}_{p}(m^{-2\nu-1})
(with mm inducing points) Gaussian Arbitrary 𝒪(m2n)\mathcal{O}(m^{2}n) 𝒪(m2+mn)\mathcal{O}(m^{2}+mn) Gaussian: 𝒪p(exp(αmlogm))\mathcal{O}_{p}(\exp(-\alpha m\log m))
Table 1: Comparisons with existing methods.

Toeplitz methods: Toeplitz methods (Wood and Chan, 1994) work for stationary GPs with equally spaced design points. These methods leverage the Toeplitz structure of the covariance matrices under this setting. To make a prediction in terms of solving (4) and (5), there are two approaches. The first is to solve the Toeplitz system exactly, using, for example, the Levinson algorithm (Zhang et al., 2005). This takes 𝒪(n2)\mathcal{O}(n^{2}) time. A more commonly used approach is based on a conjugate gradient algorithm (Atkinson, 2008) to solve the matrix inversion problems in (4) and (5). Each step takes 𝒪(nlogn)\mathcal{O}(n\log n) time. For the sake of rapid computation, the number of iterations is chosen to be small. But then the method becomes inexact. Moreover, the conjugate gradient algorithm is unable to find the determinant in (6) (Wilson and Nickisch, 2015). Thus one has to resort to the exact algorithm to compute the likelihood value, which takes 𝒪(n2)\mathcal{O}(n^{2}) time. Toeplitz methods only works for equally spaced design points. This is a strong restriction for one-dimensional problems. For multi-dimensional problems in a tensor space, having this restriction can also be disturbing, especially under a sparse grid design. Many famous sparse grid designs are not based on equally spaced one-dimensional points, such as the Clenshaw-Curtis sparse grids (Gerstner and Griebel, 1998) or the ones suggested by Plumlee (2014).

Local Approximate Gaussian Processes: Gramacy and Apley (2015) proposed a sequential design scheme that dynamically defines the support of a Gaussian process predictor based on a local subset of the data. The local subset comprises of mm data points and, consequently, local approximate GP reduces the time and space complexity of GPs regression to 𝒪(m3)\mathcal{O}(m^{3}) and 𝒪(m2+n)\mathcal{O}(m^{2}+n) respectively. Local approximate GPs can achieve a decent accuracy level in empirical experiments but theoretical properties of this algorithm are still unknown.

Random Fourier Features: The class of Fourier features methods originates from the work by Rahimi and Recht (2007). These methods essentially use i=1mϕi(𝐱)ϕi(𝐱)\sum_{i=1}^{m}\phi_{i}(\mathbf{x})\phi_{i}(\mathbf{x}^{\prime}) to approximate K(𝐱,𝐱)K(\mathbf{x},\mathbf{x}^{\prime}), where ϕ1(𝐱),,ψm(𝐱)\phi_{1}(\mathbf{x}),\ldots,\psi_{m}(\mathbf{x}) are basis functions constructed based on random samples from the spectral density, i.e., the Fourier transform of the kernel function KK. This low-rank approximation reduces the time and space complexity of GP regression to 𝒪p(m2n)\mathcal{O}_{p}(m^{2}n) and 𝒪p(m2+mn)\mathcal{O}_{p}(m^{2}+mn), respectively, with accuracy 𝒪p(m1/2)\mathcal{O}_{p}(m^{-1/2}) (Sriperumbudur and Szabo, 2015). Clearly, the price for fast computation of random Fourier features is the loss of its accuracy.

Nyström Approximation: These methods approximate the n×nn\times n covariance matrix 𝐊\mathbf{K} by an m×mm\times m matrix 𝐊~=K(𝐗~,𝐗~)\widetilde{\mathbf{K}}=K(\widetilde{\mathbf{X}},\widetilde{\mathbf{X}}), where 𝐗~={𝐱~i}i=1m\widetilde{\mathbf{X}}=\{\tilde{\mathbf{x}}_{i}\}_{i=1}^{m} are called the inducing points. Similar to random Fourier features, Nyström approximations reduce the time complexity and space complexity of GP regression to 𝒪p(m2n)\mathcal{O}_{p}(m^{2}n) and 𝒪p(m2+mn)\mathcal{O}_{p}(m^{2}+mn), respectively. There are several approaches to choose the inducing points. Smola and Schölkopf (2000); Williams and Seeger (2001) selected 𝐗~\widetilde{\mathbf{X}} from data points 𝐗\mathbf{X} by an orthogonalization procedure. Titsias (2009); Bui et al. (2017) treated 𝐗~\widetilde{\mathbf{X}} as hidden variables and select these inducing points via variational Bayesian inference. Katzfuss (2017); Chen and Stein (2021) further developed the Nyström approximation to construct more precise kernel approximations with multi-resolution structures. For Matérn-ν\nu kernel with ν>3/2\nu>3/2, it is shown in Burt et al. (2019) that the accuracy level of any inducing points method is 𝒪p(m2ν1)\mathcal{O}_{p}(m^{-2\nu-1}), which is higher than that of the random Fourier features. It was also shown in Tuo and Wang (2020) that GP regression with Matérn-ν\nu kernel converges to the underlying true GP at the rate 𝒪(nν)\mathcal{O}(n^{-\nu}) and, hence, the number of inducing points should satisfy m=𝒪(nν2ν+1)m=\mathcal{O}({n^{\frac{\nu}{2\nu+1}}}) to achieve the optimal order of approximation accuracy. In this case, the time and space complexity of Nyström approximations are 𝒪(n1+2ν2ν+1)\mathcal{O}(n^{1+\frac{2\nu}{2\nu+1}}) and 𝒪(n1+ν2ν+1)\mathcal{O}(n^{1+\frac{\nu}{2\nu+1}}), respectively. These are higher than that of the proposed algorithm, not to mention that the latter provides the exact solutions.

Other methods: Using a compactly supported kernel (Gramacy, 2020) can induce a sparse covariance matrix, which can lead to an improvement in matrix manipulations. However, if the support of the kernel remains the same while the design points become dense in a finite interval, the sparsity of the covariance matrix is not high enough to improve the order of magnitude. On the other hand, shrinking the support may substantially change the sample path properties of the GP, and impair the power of prediction. Recently, Loper et al. (2021) proposed a general approximation scheme for one-dimensional GP regression, which results in a linear-time inference method.

2 Theory of Kernel Packet Basis

In this section, we introduce the mathematical theory for the novel approach of inverting the correlation matrix in (4) and (5). Technical proofs of all theorems are deferred to Appendix B.

Direct inverting the matrix 𝐊\mathbf{K} in (4) and (5) is time consuming, because 𝐊\mathbf{K} is a dense matrix. Note that each entry of 𝐊\mathbf{K} is an evaluation of function K(,xj)K(\cdot,x_{j}) for some jj. The matrix 𝐊\mathbf{K} is not sparse because the support of KK is the entire real line. The main idea of this work is to find an exact representation of 𝐊\mathbf{K} in terms of sparse matrices. This exact representation is built in terms of a change-of-basis transformation.

In this section, we suppose KK is a one-dimensional kernel. Consider the linear space 𝒦=span{K(,xj)}j=1n\mathcal{K}=\operatorname{span}\{K(\cdot,x_{j})\}_{j=1}^{n}. The goal is to find another basis for 𝒦\mathcal{K}, denoted as {ϕj}j=1n\{\phi_{j}\}_{j=1}^{n}, satisfying the following properties:

  1. 1.

    Almost all of the ϕj\phi_{j}’s have compact supports.

  2. 2.

    {ϕj}j=1n\{\phi_{j}\}_{j=1}^{n} can be obtained from {K(,xj)}j=1n\{K(\cdot,x_{j})\}_{j=1}^{n} via a sparse linear transformation, i.e., the matrix defining the linear transform from {K(,xj)}j=1n\{K(\cdot,x_{j})\}_{j=1}^{n} to {ϕj}j=1n\{\phi_{j}\}_{j=1}^{n} is sparse.

Unless otherwise specified, throughout this article we assume that the one-dimensional kernel KK is a Matérn correlation function as in (1), whose spectral density is proportional to (2ν/ω2+x2)(ν+1/2)\left(2\nu/\omega^{2}+x^{2}\right)^{-(\nu+1/2)}; see Rasmussen (2006); Tuo and Wu (2016). For notational simplicity, let c2:=2ν/ω2c^{2}:=2\nu/\omega^{2} and the above spectral density is proportional to (c2+x2)(ν+1/2)(c^{2}+x^{2})^{-(\nu+1/2)}.

2.1 Definition and Existence of Kernel Packets

In this section we introduce the theory that explains how we can find a compactly supported function in 𝒦\mathcal{K}. Clearly, such a function must admit the representation ϕ(x)=j=1nAjK(x,xj).\phi(x)=\sum_{j=1}^{n}A_{j}K(x,x_{j}). Recall the requirement that the linear transform is sparse, which means that most of the coefficients AjA_{j}’s must be zero. This inspires the following definition.

Definition 1

Given a correlation function KK and input points a1<<aka_{1}<\cdots<a_{k}, a non-zero function ϕ\phi is called a kernel packet (KP) of degree kk, if it admits the representation ϕ(x)=j=1kAjK(x,aj),\phi(x)=\sum_{j=1}^{k}A_{j}K(x,a_{j}), and the support of ϕ\phi is [a1,ak][a_{1},a_{k}].

At first sight, it seems to be too optimistic to expect the existence of KPs. But, surprisingly, these functions do exist for one-dimensional Matérn correlation functions with half-integer smoothness. We will show that if the smoothness parameter ν\nu is a half integer, i.e., ν1/2\nu-1/2\in\mathbb{N}, there is a KP of degree k:=2ν+2k:=2\nu+2 given any kk distinct input points.

For simplicity, we will use kk to parametrize the Matérn correlation, in other words, ν=(k2)/2\nu=(k-2)/2 for k=3,5,7,k=3,5,7,\ldots. Let 𝐚=(a1,,ak)T\mathbf{a}=(a_{1},...,a_{k})^{T} be a vector with a1<<aka_{1}<\cdots<a_{k}. The goal is to find coefficients AjA_{j}’s such that

ϕ𝐚(x):=j=1kAjK(x,aj)\phi_{\mathbf{a}}(x):=\sum_{j=1}^{k}A_{j}K(x,a_{j}) (7)

is a KP. We will first find a necessary condition for AjA_{j}’s, and next we will prove that such a condition is also sufficient. We apply the Paley-Wiener theorem (see Lemma 15 in Appendix A and Stein and Shakarchi (2003)), which states that ϕ𝐚(x)\phi_{\mathbf{a}}(x) has a compact support only if the inverse Fourier transform of ϕ𝐚\phi_{\mathbf{a}}, denoted as ϕ~𝐚(x)\tilde{\phi}_{\mathbf{a}}(x), can be extended to an entire function, i.e., a complex-valued function that is holomorphic on the whole complex plane. Let i=1i=\sqrt{-1}. Our convention of inverse Fourier transform is f~(ξ)=(2π)1/2f(x)eiξx𝑑x\tilde{f}(\xi)=(2\pi)^{-1/2}\int_{-\infty}^{\infty}f(x)e^{i\xi x}dx. Direct calculations show

ϕ~𝐚(x)[j=1kAjexp{iajx}](c2+x2)(k1)/2,x.\tilde{\phi}_{\mathbf{a}}(x)\propto\left[\sum_{j=1}^{k}A_{j}\exp\{ia_{j}x\}\right](c^{2}+x^{2})^{-(k-1)/2},x\in\mathbb{R}.

Clearly, the analytic continuation of this function (up to a constant) is

ϕ~𝐚(z)[j=1kAjexp{iajz}](c2+z2)(k1)/2γ(z)(c2+z2)(k1)/2,\tilde{\phi}_{\mathbf{a}}(z)\propto\left[\sum_{j=1}^{k}A_{j}\exp\{ia_{j}z\}\right](c^{2}+z^{2})^{-(k-1)/2}\eqqcolon\gamma(z)(c^{2}+z^{2})^{-(k-1)/2},

and this function can be defined at any z{±ci}z\in\mathbb{C}\setminus\{\pm ci\}. Note that the function (c2+z2)(k1)/2(c^{2}+z^{2})^{-(k-1)/2} has poles at z=±ciz=\pm ci, each with multiplicity (k1)/2(k-1)/2. According to Paley-Wiener theorem, we have to make ϕ~𝐚(z)\tilde{\phi}_{\mathbf{a}}(z) an entire function, which implies that γ(±ci)=0\gamma(\pm ci)=0, each with multiplicity (k1)/2(k-1)/2 as well. This condition leads to a set of equations111This statement is formalized as Lemma 20 in Appendix B.:

γ(j)(ci)=0,\displaystyle\gamma^{(j)}(ci)=0, γ(j)(ci)=0,\displaystyle\gamma^{(j)}(-ci)=0,

for j=0,,(k3)/2j=0,\ldots,(k-3)/2, where γ(j)\gamma^{(j)} denotes the jjth derivative of γ\gamma. Clearly, there are k1k-1 equations, which can be rewritten as

j=1kAjajlexp{δcaj}=0,\displaystyle\sum_{j=1}^{k}A_{j}a_{j}^{l}\exp\{\delta ca_{j}\}=0, (8)

with l=0,,(k3)/2l=0,\ldots,(k-3)/2 and δ=±1\delta=\pm 1, which is a (k1)×k(k-1)\times k linear system. All solutions to this system are real-valued vectors because all coefficients are real.

Next we study the property of the linear system (8) and the corresponding ϕ𝐚\phi_{\mathbf{a}}. Theorem 2 states that ϕ𝐚\phi_{\mathbf{a}} can be uniquely determined by (8) up to a multiplicative constant.

Theorem 2

If a1,,aka_{1},\ldots,a_{k} are distinct, the solution space of (8) is one-dimensional, i.e., there do not exist two linearly independent solutions to (8).

Another important property of (8) is that its solution is not affected by a shift of 𝐚\mathbf{a}. Define 𝐚+t=(a1+t,,an+t)T\mathbf{a}+t=(a_{1}+t,\ldots,a_{n}+t)^{T}.

Theorem 3

The solution space of (8), as a function of 𝐚\mathbf{a}, is invariant under a shift transformation Tt(𝐚)=𝐚+tT_{t}(\mathbf{a})=\mathbf{a}+t for any tt\in\mathbb{R}.

Remark 4

Theorem 3 suggests that we can apply a shift on 𝐚\mathbf{a} without affecting the solution space. It is worth noting that, although the solution space does not change theoretically, the condition number of the linear system (8) may change, which may significantly affect the numerical accuracy. In order to enhance the numerical stability in solving (8), we suggest standardizing 𝐚\mathbf{a} using transformation Tt(𝐚)=𝐚+tT_{t}(\mathbf{a})=\mathbf{a}+t such that a1+t=(an+t)a_{1}+t=-(a_{n}+t), i.e., t=(a1+an)/2t=-(a_{1}+a_{n})/2. The same standardization technique will be employed in the proof of Theorem 5.

Theorem 5 confirms that any non-zero ϕ𝐚\phi_{\mathbf{a}} is indeed a KP.

Theorem 5

The support of any non-zero function ϕ𝐚\phi_{\mathbf{a}} defined by (7) and (8) is [a1,ak][a_{1},a_{k}].

In other words, we have the following Corollary 6.

Corollary 6

Let KK be a Matérn correlation with smoothness ν\nu. If ν\nu is a half integer, then KK admits a KP with degree 2ν+22\nu+2. In addition, given a1<<aka_{1}<\cdots<a_{k}, function ϕ𝐚\phi_{\mathbf{a}} with the form (7) is a KP if and only if the coefficients AjA_{j}’s are given by a non-zero solution to (8).

Figure 1 illustrates that the linear combination of 55 components {K(,aj)}j=15\{K(\cdot,a_{j})\}_{j=1}^{5} provides a compactly supported KP corresponding to Matérn-3/23/2 correlation function.

Refer to caption
Figure 1: KP ϕ𝐚\phi_{\mathbf{a}} (black line) corresponding to Matérn-3/23/2, and Matérn-3/23/2 correlation function and its components {K(,aj)}j=15\{K(\cdot,a_{j})\}_{j=1}^{5}.

It is evident that KPs are highly non-trivial and precious. Their existence relies on the correlation function. Theorem 7 shows that many other correlation function do not admit any KP, and consequently, the proposed algorithm is not applicable to these correlations.

Theorem 7

The following correlation functions do not admit KPs:

  1. 1.

    Any Matérn correlation function whose smoothness parameter is not a half integer.

  2. 2.

    Any Gaussian correlation function.

Theorem 8 shows that the KP constructed by (8) has the lowest degree.

Theorem 8

Let KK be a Matérn correlation function with half-integer smoothness ν\nu. Let mm be a positive integer with m<2ν+2m<2\nu+2. Then any function of the form j=1mAjK(,aj)\sum_{j=1}^{m}A_{j}K(\cdot,a_{j}) does not have a compact support unless Aj=0A_{j}=0 for all j=0,,mj=0,\ldots,m, and in other words, there does not exist a KP of degree lower than 2ν+22\nu+2.

2.2 One-sided Kernel Packets

Besides KPs, we need to introduce a set of functions to capture the “boundary effects” of Gaussian process regression. As before, let 𝐚=(a1,,as)T\mathbf{a}=(a_{1},...,a_{s})^{T} be a vector with a1<<asa_{1}<\cdots<a_{s}. We consider the functions

ϕ𝐚(x):=j=1sAjK(x,aj),\phi_{\mathbf{a}}(x):=\sum_{j=1}^{s}A_{j}K(x,a_{j}), (9)

with (k+1)/2sk1(k+1)/2\leq s\leq k-1 and a non-zero real vector (A1,,As)T(A_{1},\ldots,A_{s})^{T}. Then Theorem 8 suggests that ϕ𝐚\phi_{\mathbf{a}} in (9) cannot have a compact support. Nevertheless, it is possible that the support of ϕ𝐚\phi_{\mathbf{a}} is a half real line. In this case, we cal ϕ𝐚\phi_{\mathbf{a}} a one-sided KP. Specifically, we call ϕ𝐚\phi_{\mathbf{a}} a right-sided KP if suppϕ𝐚=[a1,+)\operatorname{\mathrm{s}upp}\phi_{\mathbf{a}}=[a_{1},+\infty), and we call ϕ𝐚\phi_{\mathbf{a}} a left-sided KP if suppϕ𝐚=(,as]\operatorname{\mathrm{s}upp}\phi_{\mathbf{a}}=(-\infty,a_{s}].

First we consider right-sided KPs. We propose to identify AjA_{j}’s by solving

j=1sAjajlexp{caj}=0,j=1sAjajrexp{caj}=0,\sum_{j=1}^{s}A_{j}a_{j}^{l}\exp\{-ca_{j}\}=0,\quad\sum_{j=1}^{s}A_{j}a_{j}^{r}\exp\{ca_{j}\}=0, (10)

where l=0,,(k3)/2l=0,\ldots,(k-3)/2 and the second term of (10) comprises auxiliary equations for the case s(k+3)/2s\geq(k+3)/2 with r=0,,s(k+3)/2r=0,\ldots,s-(k+3)/2. Similar to (8), (10) is an (s1)×s(s-1)\times s linear system.

The following theorems describes the properties of the linear system (10) and the corresponding ϕ𝐚\phi_{\mathbf{a}}. Specifically, Theorem 11 confirms that ϕ𝐚\phi_{\mathbf{a}} is indeed a right-sided KP.

Theorem 9

The solution space of (10) is one-dimensional provided that a1,,asa_{1},\ldots,a_{s} are distinct.

Theorem 10

The solution space of (10), as a function of 𝐚\mathbf{a}, is invariant under a shift transformation Tt(𝐚)=𝐚+tT_{t}(\mathbf{a})=\mathbf{a}+t for any tt\in\mathbb{R}.

Theorem 11

The support of any non-zero function ϕ𝐚\phi_{\mathbf{a}} defined by (9) and (10) is [a1,+)[a_{1},+\infty).

Left-sided KPs are constructed similarly by solving the following equations:

j=1sAjajlexp{caj}=0,j=1sAjajrexp{caj}=0,\sum_{j=1}^{s}A_{j}a_{j}^{l}\exp\{ca_{j}\}=0,\quad\sum_{j=1}^{s}A_{j}a_{j}^{r}\exp\{-ca_{j}\}=0, (11)

where l=0,,(k3)/2l=0,\ldots,(k-3)/2 and the second term comprises auxiliary equations for the case s(k+3)/2s\geq(k+3)/2 with r=0,,s(k+3)/2r=0,\ldots,s-(k+3)/2. The properties of left-sided KPs are analogous to these stated in Theorems 9-11, for which we omit the statements.

Remark 12

As in Remark 4, we suggest applying a shift transformation on 𝐚\mathbf{a} before computing AjA_{j}’s. Let Tt(𝐚)=(a1,,as)T.T_{t}(\mathbf{a})=(a^{\prime}_{1},\ldots,a^{\prime}_{s})^{T}. We suggest using TtT_{t} such that a1=0a^{\prime}_{1}=0 (i.e., t=a1t=-a_{1}) for the right-sided KPs, and as=0a^{\prime}_{s}=0 (i.e., t=ast=-a_{s}) for the left sided KPs. The same shifting is employed in the proof of Theorem 11.

2.3 Kernel Packet Basis

Let x1<<xnx_{1}<\cdots<x_{n} be the input data, and KK a Matérn correlation function with a half-integer smoothness. Suppose nkn\geq k. We can construct the following nn functions, as a subset of 𝒦\mathcal{K}:

  1. 1.

    ϕ1,ϕ2,,ϕ(k1)/2\phi_{1},\phi_{2},\ldots,\phi_{(k-1)/2}, defined as left-sided KPs ϕ(x1,,x(k+1)/2),ϕ(x1,,x(k+1)/2+1),,ϕ(x1,,xk1)\phi_{(x_{1},\ldots,x_{(k+1)/2})},\phi_{(x_{1},\ldots,x_{(k+1)/2+1})},\ldots,\phi_{(x_{1},\ldots,x_{k-1})},

  2. 2.

    ϕ(k+1)/2,ϕ(k+1)/2+1,,ϕn(k1)/2\phi_{(k+1)/2},\phi_{(k+1)/2+1},\ldots,\phi_{n-(k-1)/2}, defined as KPs ϕ(x1,,xk),ϕ(x2,,xk+1),,ϕ(xnk+1,,xn)\phi_{(x_{1},\ldots,x_{k})},\phi_{(x_{2},\ldots,x_{k+1})},\ldots,\phi_{(x_{n-k+1},\ldots,x_{n})},

  3. 3.

    ϕn(k3)/2,,ϕn1,ϕn\phi_{n-(k-3)/2},\ldots,\phi_{n-1},\phi_{n}, defined as right-sided KPs ϕ(xnk+2,,xn),,ϕ(xn(k1)/21,,xn),ϕ(xn(k1)/2,,xn)\phi_{(x_{n-k+2},\ldots,x_{n})},\ldots,\phi_{(x_{n-(k-1)/2-1},\ldots,x_{n})},\phi_{(x_{n-(k-1)/2},\ldots,x_{n})}.

Note that KPs and one-sided KPs given the input points cannot be uniquely defined. They are unique only up to a non-zero multiplicative factor. Here the choice of these factors are nonessential. The general theory and algorithms in this article will be valid for each specific choice. Now we present Theorem 13, which, together with the fact that the dimension of 𝒦\mathcal{K} is nn, implies that {ϕj}j=1n\{\phi_{j}\}_{j=1}^{n} forms a basis for 𝒦\mathcal{K}, referred to as the KP basis.

Theorem 13

Let x1<<xnx_{1}<\cdots<x_{n} be the input data and the functions ϕ1,,ϕn\phi_{1},\ldots,\phi_{n} are constructed in the above manner. Then the basis functions {ϕj}j=1n\{\phi_{j}\}_{j=1}^{n} are linearly independent in 𝒦\mathcal{K}.

Further, it is straightforward to check via Theorems 5 and 11 that, given any xx\in\mathbb{R}, the vector ϕ(x)=(ϕ1(x),,ϕn(x))T\boldsymbol{\phi}(x)=\left(\phi_{1}(x),\ldots,\phi_{n}(x)\right)^{T} has at most k1k-1 non-zero entries. As a result, we have constructed a basis for 𝒦\mathcal{K} satisfying the two sparse properties mentioned at the beginning of Section 2. Figure 2 illustrates a KP basis corresponding to Matérn-3/23/2 and Matérn-5/25/2 correlation function with input points 𝐗={0.1,0.2,,1}\mathbf{X}=\{0.1,0.2,\ldots,1\}.

Refer to caption
Refer to caption
Figure 2: KP basis functions corresponding to Matérn-3/23/2 (left) and Matérn-5/25/2 (right) correlation function with input points 𝐗={0.1,0.2,,1}.\mathbf{X}=\{0.1,0.2,\ldots,1\}. The KPs, left-sided KPs, and the right-sided KPs are plotted in orange, blue, and green lines, respectively.

3 Kernel Packet Algorithms

In this section, we will employ the KP bases to develop scalable algorithms for GP regression problems. In Sections 3.1 and 3.2 , we present algorithms for one-dimensional GP regression with noiseless and noisy data, respectively. In section 3.3, we generalize the one-dimensional algorithms to higher dimensions by applying the tensor and sparse grid techniques.

3.1 One-dimensional GP Regression with Noiseless Data

The theory in Section 2 shows that for one-dimensional problems, given ordered and distinct inputs x1<<xnx_{1}<\cdots<x_{n}, the correlation matrix 𝐊\mathbf{K} admits a sparse representation as

𝐊𝐀=ϕ(𝐗),\mathbf{K}\mathbf{A}=\boldsymbol{\phi}(\mathbf{X}), (12)

where both 𝐀\mathbf{A} and ϕ(𝐗)\boldsymbol{\phi}(\mathbf{X}) are banded matrices. In (12), the (l,j)th(l,j)^{\rm th} entry of ϕ(𝐗)\boldsymbol{\phi}(\mathbf{X}) is ϕj(xl)\phi_{j}(x_{l}). In view of the compact supportedness of ϕj\phi_{j}, ϕ(𝐗)\boldsymbol{\phi}(\mathbf{X}) is a banded matrix with bandwidth (k3)/2(k-3)/2:

ϕ(𝐗)=[ϕjk32(xj2k32)ϕjk32(xj)ϕj+k32(xj)ϕj+k32(xj+2k32)].\boldsymbol{\phi}(\mathbf{X})=\begin{bmatrix}\ddots&&&&\\ \ddots&\phi_{j-\frac{k-3}{2}}(x_{j-2\frac{k-3}{2}})&&&\\ \ddots&\vdots&\ddots&&\\ &\phi_{j-\frac{k-3}{2}}(x_{j})&\cdots&\phi_{j+\frac{k-3}{2}}(x_{j})&\\ &&\ddots&\vdots&\ddots\\ &&&\phi_{j+\frac{k-3}{2}}(x_{j+2\frac{k-3}{2}})&\ddots\\ &&&&\ddots\end{bmatrix}.

The matrix of 𝐀\mathbf{A} consists of the coefficients to construct the KPs. In view of the sparse representation, 𝐀\mathbf{A} is a banded matrix with bandwidth (k1)/2(k-1)/2:

𝐀=[Aj2k12,jk12Aj,jk12Aj,j+k12Aj+2k12,j+k12].\mathbf{A}=\begin{bmatrix}\ddots&&&&\\ \ddots&A_{j-2\frac{k-1}{2},j-\frac{k-1}{2}}&&&\\ \ddots&\vdots&\ddots&&\\ &A_{j,j-\frac{k-1}{2}}&\cdots&A_{j,j+\frac{k-1}{2}}&\\ &&\ddots&\vdots&\ddots\\ &&&A_{j+2\frac{k-1}{2},j+\frac{k-1}{2}}&\ddots\\ &&&&\ddots\end{bmatrix}.

Computing 𝐀\mathbf{A} and ϕ(𝐗)\boldsymbol{\phi}(\mathbf{X}) takes 𝒪(k3n)\mathcal{O}(k^{3}n) time, because in the construction of each ϕj\phi_{j}, at most kk kernel basis functions are needed and the time complexity for solving the coefficients {Aw,j:|wj|k12}\{A_{w,j}:|w-j|\leq\frac{k-1}{2}\} satisfying equation (8), (10) or (11) is 𝒪(k3)\mathcal{O}(k^{3}). The computational time 𝒪(k3n)\mathcal{O}(k^{3}n) in this step will dominate that in the next step, which is 𝒪(k2n)\mathcal{O}(k^{2}n). However, if the design points are equally spaces, the KP coefficients given by (8) will remain the same for each kk consecutive data points, so that we only need to compute these values once, and thus the computational time of this step is only 𝒪(k4)\mathcal{O}(k^{4}). In this case, the computation time in the next step, i.e., 𝒪(k2n)\mathcal{O}(k^{2}n), will be dominant, provided that knk\ll n.

Now we solve the GP regression problem, by substituting the identity K(,𝐗)=ϕ()𝐀1K(\cdot,\mathbf{X})=\boldsymbol{\phi}(\cdot)\mathbf{A}^{-1} into (4) and (5) to obtain

𝔼[Y(x)|𝐘]=μ(x)+ϕT(x)[ϕ(𝐗)]1(𝐘𝝁),\displaystyle\operatorname{\mathbb{E}}\left[Y(x^{*})\big{|}\mathbf{Y}\right]=\mu(x^{*})+\boldsymbol{\phi}^{T}(x^{*})\left[\boldsymbol{\phi}(\mathbf{X})\right]^{-1}\left(\mathbf{Y}-\boldsymbol{\mu}\right), (13)
Var[Y(x)|𝐘]=σ2(K(x,x)ϕT(x)[ϕ(𝐗)]1K(𝐗,x)).\displaystyle\operatorname{\mathrm{Var}}\left[Y(x^{*})\big{|}\mathbf{Y}\right]=\sigma^{2}\bigg{(}K(x^{*},x^{*})-\boldsymbol{\phi}^{T}(x^{*})\left[\boldsymbol{\phi}(\mathbf{X})\right]^{-1}K(\mathbf{X},x^{*})\bigg{)}. (14)

The key to GP regression now becomes calculating the vector [ϕ(𝐗)]1𝐯\left[\boldsymbol{\phi}(\mathbf{X})\right]^{-1}{\mathbf{v}} with 𝐯=𝐘𝝁\mathbf{v}=\mathbf{Y}-\boldsymbol{\mu} or 𝐯=K(𝐗,x)\mathbf{v}=K(\mathbf{X},x^{*}). This is equivalent to solving the sparse banded linear system ϕ(𝐗)𝐬=𝐯\boldsymbol{\phi}(\mathbf{X})\mathbf{s}=\mathbf{v}. There exists quite a few sparse linear solvers that can solve this linear system efficiently. For example, the algorithm based on the LU decomposition in Davis (2006) can be applied to solve for 𝐬\mathbf{s} in 𝒪(k2n)\mathcal{O}(k^{2}n) time. MATLAB provides convenient and efficient builtin functions, such as mldivide or decomposition, to solve sparse banded linear system in this form.

It is worth noting that (13) can be executed in the following faster way when we need to evaluate 𝔼[Y(x)|𝐘]\operatorname{\mathbb{E}}\left[Y(x^{*})|\mathbf{Y}\right] for a many different xx^{*}. First, we compute 𝐬:=[ϕ(𝐗)]1(𝐘𝝁)\mathbf{s}:=\left[\boldsymbol{\phi}(\mathbf{X})\right]^{-1}\left(\mathbf{Y}-\boldsymbol{\mu}\right), which takes 𝒪(k2n)\mathcal{O}(k^{2}n) time. Next we evaluate μ(x)+ϕT(x)𝐬\mu(x^{*})+\boldsymbol{\phi}^{T}(x^{*})\mathbf{s} for different xx^{*}. As said before, ϕT(x)\boldsymbol{\phi}^{T}(x^{*}) has at most k1k-1 non-zero entries; see Figure 2. If we know which k1k-1 entries are non-zero, the second step takes only 𝒪(k)\mathcal{O}(k) time. To find the non-zero entry, a general approach is to use a binary search, which takes 𝒪(logn)\mathcal{O}(\log n) time. Sometime, these entries can be found within a constant time. For example, if the design points are equally spaced, there exist explicit expressions for the indices of the non-zero entries; if we need to predict for xx^{*} over a dense mesh (which is a typical task of surrogate modeling), we can use the indices of the non-zero entries for the previous point as an initial guess to find those for the current point.

Similar to the conditional inference, the log-likelihood function (6) can also be computed in 𝒪(k2n)\mathcal{O}(k^{2}n) time. First, the log-determinant of 𝐊\mathbf{K} can be rewritten as logdet(𝐊)=logdet(ϕ(𝐗))logdet(𝐀)\log\det(\mathbf{K})=\log\det\left(\boldsymbol{\phi}(\mathbf{X})\right)-\log\det(\mathbf{A}), according to identity (12): Because both 𝐀\mathbf{A} and ϕ(𝐗)\boldsymbol{\phi}(\mathbf{X}) are banded matrices, their determinants can be computed in 𝒪(k2n)\mathcal{O}(k^{2}n) time by sequential methods (Kamgnia and Nguenang, 2014, section 4.1). Second, the same method for the conditional inference can be applied to compute (𝐘𝐅𝜷)T𝐊1(𝐘𝐅𝜷)\left(\mathbf{Y}-\mathbf{F}\boldsymbol{\beta}\right)^{T}\mathbf{K}^{-1}\left(\mathbf{Y}-\mathbf{F}\boldsymbol{\beta}\right) in 𝒪(k2n)\mathcal{O}(k^{2}n) time.

3.2 One-dimensional GP Regression with Noisy Data

Suppose we observe data 𝐙\mathbf{Z}, which is a noisy version of 𝐘\mathbf{Y}. Specifically, Z(𝐱i)=Y(𝐱i)+εZ(\mathbf{x}_{i})=Y(\mathbf{x}_{i})+\varepsilon, where ε𝒩(0,σY2)\varepsilon\sim\mathcal{N}(0,\sigma^{2}_{Y}). In this case, the covariance of the observed noisy responses is Cov(Z(𝐱i),Z(𝐱j))=σ2K(𝐱i,𝐱j)+σY2𝕀(𝐱i=𝐱j)\text{Cov}\big{(}Z(\mathbf{x}_{i}),Z(\mathbf{x}_{j})\big{)}=\sigma^{2}K(\mathbf{x}_{i},\mathbf{x}_{j})+\sigma_{Y}^{2}\mathbb{I}(\mathbf{x}_{i}=\mathbf{x}_{j}). In other words, the covariance matrix Cov(𝐙,𝐙)\text{Cov}(\mathbf{Z},\mathbf{Z}) is σ2𝐊+σY2𝕀\sigma^{2}\mathbf{K}+\sigma_{Y}^{2}\mathbb{I}, where 𝕀\mathbb{I} is the identity matrix. The posterior predictor at a new point 𝐱\mathbf{x}^{*} is also normal distributed with the following conditional mean and variance:

𝔼[Y(𝐱)|𝐙]=μ(𝐱)+K(𝐱,𝐗)[𝐊+σY2σ2𝕀]1(𝐙𝝁),\displaystyle\operatorname{\mathbb{E}}\left[Y(\mathbf{x}^{*})\big{|}\mathbf{Z}\right]=\mu(\mathbf{x}^{*})+K(\mathbf{x}^{*},\mathbf{X})\left[\mathbf{K}+\frac{\sigma_{Y}^{2}}{\sigma^{2}}\mathbb{I}\right]^{-1}\left(\mathbf{Z}-\boldsymbol{\mu}\right), (15)
Var[Y(𝐱)|𝐙]=σ2(K(𝐱,𝐱)K(𝐱,𝐗)[𝐊+σY2σ2𝕀]1K(𝐗,𝐱)),\displaystyle\operatorname{\mathrm{Var}}\left[Y(\mathbf{x}^{*})\big{|}\mathbf{Z}\right]=\sigma^{2}\left(K(\mathbf{x}^{*},\mathbf{x}^{*})-K(\mathbf{x}^{*},\mathbf{X})\left[\mathbf{K}+\frac{\sigma_{Y}^{2}}{\sigma^{2}}\mathbb{I}\right]^{-1}K(\mathbf{X},\mathbf{x}^{*})\right), (16)

and the log-likelihood function given data 𝐙\mathbf{Z} is:

L(𝜷,σ2,𝝎)=12[logdet(σ2𝐊+σY2𝕀)+(𝐙𝐅𝜷)T[σ2𝐊+σY2𝕀]1(𝐙𝐅𝜷)].L(\boldsymbol{\beta},\sigma^{2},\boldsymbol{\omega})=-\frac{1}{2}\left[\log\det(\sigma^{2}\mathbf{K}+\sigma^{2}_{Y}\mathbb{I})+\big{(}\mathbf{Z}-\mathbf{F}\boldsymbol{\beta}\big{)}^{T}\big{[}\sigma^{2}\mathbf{K}+\sigma_{Y}^{2}\mathbb{I}\big{]}^{-1}\big{(}\mathbf{Z}-\mathbf{F}\boldsymbol{\beta}\big{)}\right]. (17)

When the input 𝐱\mathbf{x} is one dimensional, (15), (16), and (17) can be calculated in 𝒪(k2n)\mathcal{O}(k^{2}n) as the noiseless case because the covariance matrix σ2𝐊+σY2𝕀\sigma^{2}\mathbf{K}+\sigma_{Y}^{2}\mathbb{I} admits the following factorization:

σ2𝐊+σY2𝕀=(σ2ϕ(𝐗)+σY2𝐀)𝐀1.\sigma^{2}\mathbf{K}+\sigma_{Y}^{2}\mathbb{I}=\big{(}\sigma^{2}\boldsymbol{\phi}(\mathbf{X})+\sigma_{Y}^{2}\mathbf{A}\big{)}\mathbf{A}^{-1}. (18)

By substituting (18) and the identity K(,𝐗)=ϕ()𝐀1K(\cdot,\mathbf{X})=\boldsymbol{\phi}(\cdot)\mathbf{A}^{-1} into (15), (16), and (17), we can obtain:

𝔼[Y(x)|𝐙]=μ(x)+ϕT(x)[ϕ(𝐗)+σY2σ2𝐀]1(𝐙𝝁),\displaystyle\operatorname{\mathbb{E}}\left[Y(x^{*})\big{|}\mathbf{Z}\right]=\mu(x^{*})+\boldsymbol{\phi}^{T}(x^{*})\left[\boldsymbol{\phi}(\mathbf{X})+\frac{\sigma_{Y}^{2}}{\sigma^{2}}\mathbf{A}\right]^{-1}\left(\mathbf{Z}-\boldsymbol{\mu}\right), (19)
Var[Y(x)|𝐙]=σ2(K(x,x)ϕT(x)[ϕ(𝐗)+σY2σ2𝐀]1K(𝐗,x)).\displaystyle\operatorname{\mathrm{Var}}\left[Y(x^{*})\big{|}\mathbf{Z}\right]=\sigma^{2}\bigg{(}K(x^{*},x^{*})-\boldsymbol{\phi}^{T}(x^{*})\left[\boldsymbol{\phi}(\mathbf{X})+\frac{\sigma_{Y}^{2}}{\sigma^{2}}\mathbf{A}\right]^{-1}K(\mathbf{X},x^{*})\bigg{)}. (20)

and

L(𝜷,σ2,𝝎)=\displaystyle L(\boldsymbol{\beta},\sigma^{2},\boldsymbol{\omega})= 12[logdet(σ2ϕ(𝐗)+σY2𝐀)logdet(𝐀)\displaystyle-\frac{1}{2}\bigg{[}\log\det\big{(}\sigma^{2}\boldsymbol{\phi}(\mathbf{X})+\sigma_{Y}^{2}\mathbf{A}\big{)}-\log\det(\mathbf{A})
+(𝐙𝐅𝜷)T𝐀[σ2ϕ(𝐗)+σY2𝐀]1(𝐙𝐅𝜷)].\displaystyle+\big{(}\mathbf{Z}-\mathbf{F}\boldsymbol{\beta}\big{)}^{T}\mathbf{A}\big{[}\sigma^{2}\boldsymbol{\phi}(\mathbf{X})+\sigma_{Y}^{2}\mathbf{A}\big{]}^{-1}\big{(}\mathbf{Z}-\mathbf{F}\boldsymbol{\beta}\big{)}\bigg{]}. (21)

We have shown that ϕ(𝐗)\boldsymbol{\phi}(\mathbf{X}) and 𝐀\mathbf{A} are banded matrices with bandwidth (k3)/2(k-3)/2 and (k1)/2(k-1)/2, respectively. Therefore, the matrix σ2ϕ(𝐗)+σY2𝐀\sigma^{2}\boldsymbol{\phi}(\mathbf{X})+\sigma_{Y}^{2}\mathbf{A} is also a banded matrix with bandwidth (k3)/2(k-3)/2. Time complexity for computing this sum is 𝒪(kn)\mathcal{O}(kn). We then can use the algorithms for banded matrices introduced in section 3.1 to compute (19), (20), and (21) in time complexity 𝒪(k2n)\mathcal{O}(k^{2}n). Recall that the time complexities for computing ϕ(𝐗)\boldsymbol{\phi}(\mathbf{X}) and 𝐀\mathbf{A} are both 𝒪(k3n)\mathcal{O}(k^{3}n). Therefore, in the noisy setting, the total time complexity for computing the posterior and MLE is still 𝒪(k3n)\mathcal{O}(k^{3}n), which is the same as the noiseless case.

3.3 Multi-dimensional KP

When data is noiseless, the exact algorithm proposed in Section 3.1 can be used to solve multi-dimensional problems if the input points are full or sparse grids.

A full grid is defined as the Cartesian product of one dimensional point sets: 𝐗FG=×j=1d𝐗(j)\mathbf{X}^{\rm FG}=\times_{j=1}^{d}\mathbf{X}^{(j)} where each 𝐗(j)\mathbf{X}^{(j)} denotes any one-dimensional point set. Assuming a separable correlation function (3) comprising dd one-dimensional Matérn correlation functions with half-integer smoothness, and inputs on a full grid 𝐗FG\mathbf{X}^{\rm FG}, the covariance vector K(𝐱,𝐗FG)K(\mathbf{x}^{*},\mathbf{X}^{\rm FG}) and covariance matrix 𝐊\mathbf{K} decompose into Kronecker products of matrices over each input dimension (Saatçi, 2012; Wilson, 2014):

K(𝐱,𝐗FG)=k=1dKj(xj,𝐗(j))=j=1dϕjT(xj)𝐀j1=(j=1dϕjT(xj))(j=1d𝐀j1)\displaystyle K(\mathbf{x}^{*},\mathbf{X}^{\rm FG})=\bigotimes_{k=1}^{d}K_{j}(x^{*}_{j},\mathbf{X}^{(j)})=\bigotimes_{j=1}^{d}\boldsymbol{\phi}^{T}_{j}(x_{j}^{*})\mathbf{A}^{-1}_{j}=\bigg{(}\bigotimes_{j=1}^{d}\boldsymbol{\phi}^{T}_{j}(x_{j}^{*})\bigg{)}\bigg{(}\bigotimes_{j=1}^{d}\mathbf{A}^{-1}_{j}\bigg{)} (22)
𝐊=k=1dKj(𝐗(j),𝐗(j))=j=1dϕj(𝐗(j))𝐀j1=(j=1dϕj(𝐗(j)))(j=1d𝐀j1).\displaystyle\mathbf{K}=\bigotimes_{k=1}^{d}K_{j}(\mathbf{X}^{(j)},\mathbf{X}^{(j)})=\bigotimes_{j=1}^{d}\boldsymbol{\phi}_{j}(\mathbf{X}^{(j)})\mathbf{A}^{-1}_{j}=\bigg{(}\bigotimes_{j=1}^{d}\boldsymbol{\phi}_{j}(\mathbf{X}^{(j)})\bigg{)}\bigg{(}\bigotimes_{j=1}^{d}\mathbf{A}^{-1}_{j}\bigg{)}. (23)

When we compute the vector K(𝐱,𝐗)𝐊1K(\mathbf{x}^{*},\mathbf{X})\mathbf{K}^{-1}, the matrix j=1d𝐀j1\bigotimes_{j=1}^{d}\mathbf{A}^{-1}_{j} is cancelled as the one dimensional case. Therefore, (4) and (5) can be expressed as

𝔼[Y(𝐱)|𝐘]=μ(𝐱)+(j=1dϕjT(xj))(j=1d[ϕj(𝐗(j))]1)(𝐘𝝁)\displaystyle\operatorname{\mathbb{E}}\left[Y(\mathbf{x}^{*})\big{|}\mathbf{Y}\right]=\mu(\mathbf{x}^{*})+\left(\bigotimes_{j=1}^{d}\boldsymbol{\phi}^{T}_{j}(x_{j}^{*})\right)\left(\bigotimes_{j=1}^{d}\left[\boldsymbol{\phi}_{j}(\mathbf{X}^{(j)})\right]^{-1}\right)\left(\mathbf{Y}-\boldsymbol{\mu}\right) (24)
Var[Y(𝐱)|𝐘]=σ2(K(𝐱,𝐱)j=1dϕjT(xj)[ϕj(𝐗(j))]1Kj(𝐗(j),xj))\displaystyle\operatorname{\mathrm{Var}}\left[Y(\mathbf{x}^{*})\big{|}\mathbf{Y}\right]=\sigma^{2}\bigg{(}K(\mathbf{x}^{*},\mathbf{x}^{*})-\prod_{j=1}^{d}\boldsymbol{\phi}^{T}_{j}(x_{j}^{*})\left[\boldsymbol{\phi}_{j}(\mathbf{X}^{(j)})\right]^{-1}K_{j}(\mathbf{X}^{(j)},x_{j}^{*})\bigg{)} (25)

and the log-likelihood function (6) becomes

L(𝜷,σ2,𝝎)\displaystyle L(\boldsymbol{\beta},\sigma^{2},\boldsymbol{\omega}) =12[nlogσ2+j=1dnnj(logdetϕj(𝐗(j))logdet𝐀j)\displaystyle=-\frac{1}{2}\bigg{[}n\log\sigma^{2}+\sum_{j=1}^{d}\frac{n}{n_{j}}\left(\log\det\boldsymbol{\phi}_{j}(\mathbf{X}^{(j)})-\log\det\mathbf{A}_{j}\right)
+1σ2(𝐘𝐅𝜷)T(j=1d𝐀j)(j=1d[ϕj(𝐗(j))]1)(𝐘𝐅𝜷)],\displaystyle+\frac{1}{\sigma^{2}}\big{(}\mathbf{Y}-\mathbf{F}\boldsymbol{\beta}\big{)}^{T}\left(\bigotimes_{j=1}^{d}\mathbf{A}_{j}\right)\left(\bigotimes_{j=1}^{d}\left[\boldsymbol{\phi}_{j}(\mathbf{X}^{(j)})\right]^{-1}\right)\big{(}\mathbf{Y}-\mathbf{F}\boldsymbol{\beta}\big{)}\bigg{]}, (26)

where ϕj\boldsymbol{\phi}_{j}’s are the KPs associated to correlation function KjK_{j} and point set 𝐗(j)\mathbf{X}^{(j)}, 𝐀j\mathbf{A}_{j} is the coefficient matrix for constructing ϕj\boldsymbol{\phi}_{j} defined in (12), and n=j=1dnjn=\prod_{j=1}^{d}n_{j} is the size of 𝐗FG\mathbf{X}^{\rm FG}, njn_{j} is the size of 𝐗(j)\mathbf{X}^{(j)}. We can also note that entries of vector j=1dϕj()\bigotimes_{j=1}^{d}\boldsymbol{\phi}_{j}(\cdot) are products of one-dimensional KPs. Therefore, similar to the one-dimensional case, j=1dϕj()\bigotimes_{j=1}^{d}\boldsymbol{\phi}_{j}(\cdot) is a vector of compactly supported functions.

Refer to caption
Refer to caption
Figure 3: product KP basis functions ϕ(.4 .5 .6 .7 .8)(x1)ϕ(.4 .5 .6 .7 .8)(x2)\phi_{(.4\ .5\ .6\ .7\ .8)}(x_{1})\phi_{(.4\ .5\ .6\ .7\ .8)}(x_{2}) corresponding to Matérn-3/23/2 (left) and ϕ(.3 .4 .5 .6 .7 .8 .9)(x1)ϕ((.3 .4 .5 .6 .7 .8 .9)(x2)\phi_{(.3\ .4\ .5\ .6\ .7\ .8\ .9)}(x_{1})\phi_{((.3\ .4\ .5\ .6\ .7\ .8\ .9)}(x_{2}) corresponding to Matérn-5/25/2 (right) correlation function.

In (24)–(26), j=1dϕjT(xj)[ϕj(𝐗(j))]1Kj(𝐗(j),xj)\prod_{j=1}^{d}\boldsymbol{\phi}^{T}_{j}(x_{j}^{*})\left[\boldsymbol{\phi}_{j}(\mathbf{X}^{(j)})\right]^{-1}K_{j}(\mathbf{X}^{(j)},x_{j}^{*}) and j=1dnnj(logdetϕj(𝐗(j))logdet𝐀j)\sum_{j=1}^{d}\frac{n}{n_{j}}(\log\det\boldsymbol{\phi}_{j}(\mathbf{X}^{(j)})-\log\det\mathbf{A}_{j}) can clearly be computed in 𝒪(j=1dk3nj)\mathcal{O}(\sum_{j=1}^{d}k^{3}n_{j}) time. The computation of (j=1d[ϕj(𝐗(j))]1)𝕧\left(\bigotimes_{j=1}^{d}\left[\boldsymbol{\phi}_{j}(\mathbf{X}^{(j)})\right]^{-1}\right)\mathbb{v} for 𝕧n\mathbb{v}\in\mathbb{R}^{n} is known as Kronecker product least squares (KLS), which has been extensively studied; see, e.g., Graham (2018); Fausett and Fulton (1994). Essentially, the task can be accomplished by sequentially solving dd problems where the jthj^{\rm th} problem is to solve n/njn/n_{j} independent banded linear equations with njn_{j} variables. With a banded solver, the total time complexity is 𝒪(dk3n)\mathcal{O}(dk^{3}n).

Although full grid designs result in simple and fast computation of GP regression, their sizes increase exponentially in the dimension. When the dimension is large, another class of grid-based designs called the sparse grids can be practically more useful. Let 𝐗𝕝\mathbf{X}_{\mathbb{l}} denote full grids in the form 𝐗𝕝=×j=1d𝐗lj\mathbf{X}_{\mathbb{l}}=\times_{j=1}^{d}\mathbf{X}_{l_{j}} where ljl_{j}\in\mathbb{N} and 𝐗lj\mathbf{X}_{l_{j}} is a one-dimensional point set satisfying the nested structure =𝐗0𝐗1𝐗lj\emptyset=\mathbf{X}_{0}\subseteq\mathbf{X}_{1}\subseteq\cdots\subseteq\mathbf{X}_{l_{j}} for each jj. A sparse grid of level η\eta is defined as a union of full grids 𝐗𝐥\mathbf{X}_{\mathbf{l}}’s: 𝐗ηSG=|𝐥|η+d1𝐗𝐥\mathbf{X}^{\rm SG}_{\eta}=\bigcup_{|\mathbf{l}|\leq\eta+d-1}\mathbf{X}_{\mathbf{l}}, where |𝐥|:=j=1dlj|\mathbf{l}|:=\sum_{j=1}^{d}l_{j}. GP regression on sparse grid designs was first discussed in Plumlee (2014). According to Algorithm 1 in Plumlee (2014), (24) and (25), GP regression on 𝐗ηSG\mathbf{X}^{\rm SG}_{\eta} admits the expression

𝔼[Y(𝐱)|𝐘]=μ(𝐱)+|𝐥|=max{d,ηd+1}η(1)η|𝐥|(d1|η|𝐥)f¯𝐥(𝐱),\displaystyle\operatorname{\mathbb{E}}\left[Y(\mathbf{x}^{*})\big{|}\mathbf{Y}\right]=\mu(\mathbf{x}^{*})+\sum_{|\mathbf{l}|=\max\{d,\eta-d+1\}}^{\eta}(-1)^{\eta-|\mathbf{l}|}{d-1\choose|\eta|-\mathbf{l}}\bar{f}_{\mathbf{l}}(\mathbf{x}^{*}), (27)
Var[Y(𝐱)|𝐘]=σ2(K(𝐱,𝐱)|𝐥|=max{d,ηd+1}η(1)η|𝐥|(d1|η|𝐥)K¯𝐥(𝐱,𝐱)),\displaystyle\operatorname{\mathrm{Var}}\left[Y(\mathbf{x}^{*})\big{|}\mathbf{Y}\right]=\sigma^{2}\bigg{(}K(\mathbf{x}^{*},\mathbf{x}^{*})-\sum_{|\mathbf{l}|=\max\{d,\eta-d+1\}}^{\eta}(-1)^{\eta-|\mathbf{l}|}{d-1\choose|\eta|-\mathbf{l}}\bar{K}_{\mathbf{l}}(\mathbf{x}^{*},\mathbf{x}^{*})\bigg{)}, (28)

where

f¯𝐥:=(j=1dϕljT(xj))(j=1d[ϕlj(𝐗lj)]1)(𝐘𝐥𝝁𝐥),\displaystyle\bar{f}_{\mathbf{l}}:=\left(\bigotimes_{j=1}^{d}\boldsymbol{\phi}^{T}_{l_{j}}(x_{j}^{*})\right)\left(\bigotimes_{j=1}^{d}\left[\boldsymbol{\phi}_{l_{j}}(\mathbf{X}_{l_{j}})\right]^{-1}\right)\left(\mathbf{Y}_{\mathbf{l}}-\boldsymbol{\mu}_{\mathbf{l}}\right),
K¯𝐥(𝐱,𝐱):=j=1dϕljT(xj)[ϕlj(𝐗lj)]1Kj(𝐗lj,xj),\displaystyle\bar{K}_{\mathbf{l}}(\mathbf{x}^{*},\mathbf{x}^{*}):=\prod_{j=1}^{d}\boldsymbol{\phi}^{T}_{l_{j}}(x_{j}^{*})\left[\boldsymbol{\phi}_{l_{j}}(\mathbf{X}_{l_{j}})\right]^{-1}K_{j}(\mathbf{X}_{l_{j}},x_{j}^{*}),

come from (24) and (25), respectively; 𝐘𝐥\mathbf{Y}_{\mathbf{l}} and 𝝁𝐥\boldsymbol{\mu}_{\mathbf{l}} denote the sub-vectors of 𝐘\mathbf{Y} and 𝝁\boldsymbol{\mu} on full grid 𝐗𝐥\mathbf{X}_{\mathbf{l}}, respectively. Based on Theorem 1 and Algorithm 2 in Plumlee (2014) and (26), logdet𝐊\log\det\mathbf{K} and (𝐘𝐅𝜷)T𝐊1(𝐘𝐅𝜷)T(\mathbf{Y}-\mathbf{F}\boldsymbol{\beta}\big{)}^{T}\mathbf{K}^{-1}(\mathbf{Y}-\mathbf{F}\boldsymbol{\beta}\big{)}^{T} in the log-likehood function (6) can be decomposed as the following linear combinations, respectively:

|𝐥|=max{d,η+d1}ηj=1d(logdetϕlj(𝐗lj)detϕlj(𝐗lj1)logdet𝐀ljdet𝐀lj1)wj(nlwnlw1),\displaystyle\sum_{|\mathbf{l}|=\max\{d,\eta+d-1\}}^{\eta}\sum_{j=1}^{d}\bigg{(}\log\frac{\det\boldsymbol{\phi}_{l_{j}}(\mathbf{X}_{l_{j}})}{\det\boldsymbol{\phi}_{l_{j}}(\mathbf{X}_{l_{j}-1})}-\log\frac{\det\mathbf{A}_{l_{j}}}{\det\mathbf{A}_{l_{j}-1}}\bigg{)}\prod_{w\neq j}(n_{l_{w}}-n_{l_{w}-1}), (29)
|𝐥|=max{d,η+d1}η(1)η|𝐥|σ2(d1|η|𝐥)(𝐘𝐥𝐅𝐥𝜷)T(j=1d𝐀lj)(j=1d[ϕlj(𝐗lj)]1)(𝐘𝐥𝐅𝐥𝜷),\displaystyle\sum_{|\mathbf{l}|=\max\{d,\eta+d-1\}}^{\eta}\frac{(-1)^{\eta-|\mathbf{l}|}}{\sigma^{2}}{d-1\choose|\eta|-\mathbf{l}}\big{(}\mathbf{Y}_{\mathbf{l}}-\mathbf{F}_{\mathbf{l}}\boldsymbol{\beta}\big{)}^{T}\left(\bigotimes_{j=1}^{d}\mathbf{A}_{l_{j}}\right)\left(\bigotimes_{j=1}^{d}\left[\boldsymbol{\phi}_{l_{j}}(\mathbf{X}_{l_{j}})\right]^{-1}\right)\big{(}\mathbf{Y}_{\mathbf{l}}-\mathbf{F}_{\mathbf{l}}\boldsymbol{\beta}\big{)}, (30)

where ϕ0(𝐗0)=𝐀0=1\boldsymbol{\phi}_{0}(\mathbf{X}_{0})=\mathbf{A}_{0}=1 and 𝐅𝐥\mathbf{F}_{\mathbf{l}} denotes the sub-matrix of 𝐅\mathbf{F} on full grid 𝐗𝐥\mathbf{X}_{\mathbf{l}}.

The above idea of direct computation fails to work for noisy data, because the Kronecker product structure of the covariance matrices breaks down due to the noise. Nonetheless, conjugate gradient methods can be implemented efficiently in the presence of the KP factorization (12). We defer the details to Section 5.

4 Numerical Experiments

We first conduct numerical experiments to assess the performance of the proposed algorithm for grid-based designs on test functions in Section 4.1. Next we employ the proposed method to one-dimensional real datasets in Section 4.2 to further assess its performance.

4.1 Grid-based Designs

4.1.1 Full Grid Designs

We test our algorithm on the following deterministic function:

f(𝐱)=sin(12πx1)+sin(12πx2),𝐱(0,1)2.f(\mathbf{x})=\sin(12\pi x_{1})+\sin(12\pi x_{2}),\quad\mathbf{x}\in(0,1)^{2}.

Samples of ff are collected from a level-η\eta full grid design: 𝐗η𝖥𝖦=×j=12{2η,22η,,12η}\mathbf{X}^{\mathsf{FG}}_{\eta}=\times_{j=1}^{2}\{2^{-\eta},2\cdot 2^{-\eta},\ldots,1-2^{-\eta}\} with η=5,6,,13\eta=5,6,\cdots,13. The proposed KP algorithm is applied for GP regression using product Matérn correlation function. We choose the same correlation function in each dimension, with ω=1\omega=1, and either ν=3/2\nu=3/2 or 5/25/2. We will investigate the mean squared error (MSE) and the average computational time over 1000 random test points for each prediction resulting from KP and the following approximation/fast GP regression algorithms with fixed correlation functions.

  1. 1.

    laGP R package 222https://bobby.gramacy.com/r_packages/laGP/. In each experiment, laGP is run under Gaussian covariance family, the only covariance family supported by the package; size of the local subset is set as 100.

  2. 2.

    Inducing Points provided in the GPML tool box (Rasmussen and Nickisch, 2010). The number of inducing points mm is set as m=nm=\sqrt{n}, which is the choice to achieve the optimal approximation power for Matérn-5/25/2 correlation according to Burt et al. (2019). However, if the algorithm crashes due to large sample size, mm is reduced to a level that the algorithm can run properly. We consider Matérn-3/23/2 and 5/25/2 correlations.

  3. 3.

    RFF to approximate Matérn-3/23/2 and Matérn-5/25/2 correlation functions by feature functions [1m(cosγix+bi)]i=1m\bigl{[}\frac{1}{\sqrt{m}}(\cos\gamma_{i}x+b_{i})\bigr{]}_{i=1}^{m}, where m=nm=\sqrt{n}, {γi}i=1m\{\gamma_{i}\}_{i=1}^{m} are independent and identically distributed (i.i.d.) samples from tt-distributions with degrees of freedom three and five, respectively, and {bi}1m\{b_{i}\}_{1}^{m} are i.i.d. samples from the uniform distribution on [0,2π][0,2\pi]. If the algorithm crashes due to large sample size, mm is reduced to a level that the algorithm can run properly.

  4. 4.

    Toeplitz system solver incorporates the one-dimensional Toeplitz method and the Kronercker product technique. We consider Matérn-3/23/2 and 5/25/2 correlations. In this experiment we use equally spaced design points, so that the Toeplitz method can work.

Refer to caption
Refer to caption
Refer to caption
Figure 4: Logarithm of MSE for predictions with Matérn-3/23/2 correlation function (left) and Matérn-5/25/2 correlation function (middle) and logarithm of averaged computational time (right). The laGP uses the Gaussian covariance family in both the left and the middle figure. No results are shown for the cases when a runtime error occurs or the prediction error ceases to improve.

We sample 1000 i.i.d. test points uniformly from (0,1)2(0,1)^{2} for each experimental trial. Figure 4 compares the MSE and the computational time of all algorithms, both under logarithmic scales, for sample sizes 22j2^{2j}, j=5,6,13j=5,6,\ldots 13.

The performance curves of some algorithms in Figure 4 are incomplete, because these algorithms fail to work at a certain sample size due to a runtime error, or the prediction MSE ceases to improve. In this case, we stop the subsequent experimental trials with larger sample sizes for these algorithms. Specifically, for sample size larger than 2162^{16}, laGP breaks down due to runtime errors. The MSE of Toeplitz ceases to improve at sample size 2202^{20}. For sample size larger than 2202^{20}, the number of random features for RFF is fixed at m=210m=2^{10} and the number of inducing points for inducing points method is fixed at m=210m=2^{10} for subsequent trials. Otherwise, both the inducing points method and RFF break down because the approximated covariance matrices are nearly singular. Because of their fixed mm’s, the performances of RFF and inducing points method do not have noticeable improvement for sample size larger than 2182^{18}. In contrast, KP can run on larger sample sets with sizes up to 2262^{26} (more than 67 million) grid points.

It is shown in Figure 4 that KP has the lowest MSE and the fastest computational time in all experimental trials. The inducing points method, laGP and RFF have similar MSE in all experimental trials. The Toeplitz and KP algorithms, which compute the GP regression in exact ways, outperform other approximation methods.

4.1.2 Sparse Grid Designs

We test our algorithm on the Griewank function (Molga and Smutnicki, 2005), defined as

f(𝐱)=j=1dxj24000j=1dcos(xjj)+1,𝐱(2,2)d,f(\mathbf{x})=\sum_{j=1}^{d}\frac{x_{j}^{2}}{4000}-\prod_{j=1}^{d}\cos\left(\frac{x_{j}}{\sqrt{j}}\right)+1,\quad\mathbf{x}\in(-2,2)^{d},

with d=10d=10 and d=20d=20 respectively. Samples of ff are collected from a level-η\eta sparse grid design (η=3,4,,7)(\eta=3,4,\cdots,7). We consider a constant mean μ(𝐱)=β\mu(\mathbf{x})=\beta and Matérn correlations with ν=3/2,5/2\nu=3/2,5/2 and a single scale parameter ω\omega for all dimensions. We treat mean β\beta, variance σ2\sigma^{2} and scale ω\omega as unknown variables and use the MLE-predictor. We compare the performance of proposed KP algorithm and the direct method for GP regression on the sparse grids given in Plumlee (2014).

We sample 1000 i.i.d. points uniformly from the input space for each experimental trial. The mean squared error is estimated from these test points. In each trial, the mean squared errors of KP and direct method are in the same order and their differences are within ±1010\pm 10^{-10}. This is because both methods compute the MLE-predictor in an exact manner, and this also ensures the numerical correctness of the proposed method.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Logarithm of the computational time for MLE-predictions with Matérn-3/23/2 correlation function and Matérn-5/25/2 correlation with 1010 and 2020 dimensional inputs.

Figure 5 compares the logarithm of the needed computational time to estimate the unknown parameters β\beta, σ2\sigma^{2} and ω\omega and make predictions on the test points. The proposed KP algorithm is significantly advantageous in the computational time. When there are more than 10510^{5} training points, the KP algorithm is at least twice faster than the direct method in each trial.

4.2 Real Datasets

In this section, we assess the performance of the proposed algorithm on two real-world datasets: the Mauna Loa CO2\text{CO}_{2} dataset (Keeling and Whorf, 2005) and the intraday stock prices of Apple Inc.

4.2.1 CO2\text{CO}_{2} Data Interpolation

This dataset consists of the monthly average atmosphere CO2\text{CO}_{2} concentrations at the the Mauna Loa Observatory in Hawaii for the last sixty years. The dataset has in total 767 data points and features a overall upward trend and a yearly cycle.

We fit the data using GP models reinforced by KP, inducing points, and RFF methods, respectively. For the proposed KP method, we consider a constant mean μ(𝐱)=β\mu(\mathbf{x})=\beta, a single scale parameter ω\omega, and Matérn correlations with ν=3/2,5/2\nu=3/2,5/2, respectively. For the inducing points method and RFF, we consider also constant mean μ(𝐱)=β\mu(\mathbf{x})=\beta. Different from KP, we use Gaussian correlations with scale parameter ω\omega for the inducing points method and RFF. The number of inducing points is set as 100 for inducing points method. The number of generated random feature is set as 30 for RFF. We treat mean β\beta, variance σ2\sigma^{2} and scale ω\omega for all algorithms as unknown variables and use the MLE-predictor. For each algorithm, we compute the conditional mean and standard deviation on 2000 test points and plot the predictive curve. To evaluate the speed in training and prediction, we record the elapsed times for training and calculate the average time for a new prediction.

The training and prediction time of each algorithm is shown in Table 2. It is seen that the KP methods with Matérn-3/23/2 and Matérn-5/25/2 are faster than the inducing points and RFF. The predictive curve given by each algorithm is shown in Figure 6. Clearly, both KP methods interpolate adequately from 1960 to 2020 with accurate conditional standard deviations. In contrast, inducing points and RFF fail to interpolate the data, because the numbers of feature functions in inducing points and RFF are less than the number of observations. This results in predictive curves with higher standard deviations.

CO2\text{CO}_{2} Stock Price
Algorithm TtrainT_{\text{train}}  (sec) TpredT_{\text{pred}} (10310^{-3} sec) TtrainT_{\text{train}} (sec) TpredT_{\text{pred}} (10310^{-3} sec)
KP Matérn-3/2 0.18±0.130.18\pm 0.13 2.84±0.962.84\pm 0.96 0.37±0.110.37\pm 0.11 2.83±1.072.83\pm 1.07
KP Matérn-5/2 0.23±0.170.23\pm 0.17 3.31±1.223.31\pm 1.22 0.44±0.270.44\pm 0.27 3.54±1.733.54\pm 1.73
Inducing Points 0.28±0.090.28\pm 0.09 5.26±1.565.26\pm 1.56 0.58±0.130.58\pm 0.13 9.88±3.379.88\pm 3.37
RFF 0.25±0.120.25\pm 0.12 3.34±1.393.34\pm 1.39 0.50±0.260.50\pm 0.26 7.42±0.997.42\pm 0.99
Table 2: Comparisons of training and prediction time
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Observations of monthly CO2\text{CO}_{2} concentration (first row) and its interpolation by KP with Matérn-3/2 (second row) , KP with Matérn-5/2 (third row), Inducing Points (fourth row), and RFF (fifth row). The blue curves label predictions and the red areas label ±1\pm 1 standard deviation.

4.2.2 Stock Price Regression

This dataset consists of the intraday stock prices of Apple Inc from January, 2009 to April, 2011. The dataset has in total 1259 data points. We assume that the data points are corrupted by noise so they are randomly distributed around some underlying trend. In this experiment, our goal is to reconstruct the underlying trend via GP regression.

Similar to Section 4.2.1, we run KP with Matérn-3/23/2 and Matérn-5/25/2 correlations on the dataset and use inducing pFoints and RFF as our benchmark algorithms. Settings of all algorithms are exactly the same as Section 4.2.1 except that the number of inducing points is set as 200200 and the number of generated random features is set as 100 for RFF. We further treat the data variance parameter σY2\sigma_{Y}^{2} as an unknown parameter and use the MLE predictor. We also record the elapsed times for training and calculate the average time for a new prediction.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Stock Price regression by KP Matérn -3/2 (upper left), KP Matérn -5/2 (upper right), inducing points (lower left) and RFF (lower right). The blue curves label predictions, the red areas label ±1\pm 1 standard deviations, and the red dots labels observations.

Similar to the previous experiment, Table 2 shows that the KP methods are more efficient than inducing points and RFF in both training and prediction. The predictive curve given by each algorithm is shown in Figure 7. We can see that both KP methods successfully capture the local changes of the overall trend while inducing points and RFF fail to do so. This is because neither inducing points nor RFF have enough number of feature functions to reconstruct curves with highly local fluctuations, and therefore, their predictive curves are too smooth so that a larger number of data points are distributed outside of their ±1\pm 1 standard deviation areas.

5 Possible Extensions

Although the primary focus of this article is on exact algorithms, we would like to mention the potential of combining the proposed method with existing approximate algorithms. In this section, we will briefly discuss how to use the conjugate gradient method in the presence of the KP factorization (12) to accommodate a broader class of multi-dimensional Gaussian process regression problems.

5.1 Multi-dimensional GP Regression with Noisy Data

Suppose the input points lie in a full grid 𝐗FG\mathbf{X}^{\rm FG} and the observed data 𝐙\mathbf{Z} is noisy: Z(𝐱i)=Y(𝐱i)+εZ(\mathbf{x}_{i})=Y(\mathbf{x}_{i})+\varepsilon, where ε𝒩(0,σY2)\varepsilon\sim\mathcal{N}(0,\sigma^{2}_{Y}). Following arguments similar to what we have done in Sections 3.2 and 3.3, we can show that the following matrix operations are essential in computing the posterior and MLE:

[j=1dϕj(𝐗(j))+σY2σ2j=1d𝐀j]1𝕧\displaystyle\left[\bigotimes_{j=1}^{d}\boldsymbol{\phi}_{j}(\mathbf{X}^{(j)})+\frac{\sigma_{Y}^{2}}{\sigma^{2}}\bigotimes_{j=1}^{d}\mathbf{A}_{j}\right]^{-1}\mathbb{v} (31)
logdet(j=1dϕj(𝐗(j))+σY2σ2j=1d𝐀j)\displaystyle\log\det\left(\bigotimes_{j=1}^{d}\boldsymbol{\phi}_{j}(\mathbf{X}^{(j)})+\frac{\sigma_{Y}^{2}}{\sigma^{2}}\bigotimes_{j=1}^{d}\mathbf{A}_{j}\right) (32)

for some 𝕧n\mathbb{v}\in\mathbb{R}^{n}. The direct Kronecker product approach fails to work in this scenario because the additive noise breaks the tensor product structure. Nonetheless, conjugate gradient methods such as those implemented in GPyTorch (Gardner et al., 2018) or MATLAB (Barrett et al., 1994) can be employed to solve (31) and (32) efficiently. This is because the conjugate gradient methods require nothing more than the multiplication between the covariance matrix and a vector. In our case, both {ϕj(𝐗(j))}\{\boldsymbol{\phi}_{j}(\mathbf{X}^{(j)})\} and {𝐀j}\{\mathbf{A}_{j}\} are banded matrices so j=1dϕj(𝐗(j))\bigotimes_{j=1}^{d}\boldsymbol{\phi}_{j}(\mathbf{X}^{(j)}) and j=1d𝐀j\bigotimes_{j=1}^{d}\mathbf{A}_{j}, which are Kronecker products of banded matrices, have only 𝒪(n)\mathcal{O}(n) non-zero entries. Therefore, the cost of matrix-multiplications by the matres j=1dϕj(𝐗(j))\bigotimes_{j=1}^{d}\boldsymbol{\phi}_{j}(\mathbf{X}^{(j)}) and j=1d𝐀j\bigotimes_{j=1}^{d}\mathbf{A}_{j} both scale linearly with respect to the number of points in the grid. If input points lie in a sparse grid, the posterior and MLE conditional on noisy data can also be computed efficiently because they can be decomposed as linear combinations of posteriors and MLEs on full grids as shown in section 3.3.

5.2 Additive Covariance Functions

Suppose the GP is equipped with the following additive covariance function:

K(𝐱,𝐱)=jk,j(xj,xj)K(\mathbf{x},\mathbf{x}^{\prime})=\sum_{\mathcal{I}\in\mathcal{F}}\prod_{j\in\mathcal{I}}k_{\mathcal{I},j}(x_{j},x^{\prime}_{j}) (33)

where \mathcal{F} is any subset of the power set of {1,2,,d}\{1,2,\cdots,d\} and k,jk_{\mathcal{I},j} is any one-dimensional Matérn correlation with half-integer smoothness. When input points lie in a full grid 𝐗FG\mathbf{X}^{\rm FG}, the posterior and MLE can also be efficiently computed using conjugate gradient methods. In this case, the covariance matrix can be written as the following form:

𝐊=[jϕ,j(𝐗(j))][j𝐀,j1].\mathbf{K}=\sum_{\mathcal{I}\in\mathcal{F}}\left[\bigotimes_{j\in\mathcal{I}}\boldsymbol{\phi}_{\mathcal{I},j}(\mathbf{X}^{(j)})\right]\left[\bigotimes_{j\in\mathcal{I}}\mathbf{A}_{\mathcal{I},j}^{-1}\right]. (34)

The matrix-multiplication 𝐊𝕧\mathbf{K}\mathbb{v} can be computed in linear time in the size of 𝐗FG\mathbf{X}^{\rm FG} for any vector 𝕧n\mathbb{v}\in\mathbb{R}^{n}. Firstly, we use KLS techniques introduced in Section 3.3 to compute 𝕧=[j𝐀,j1]𝕧\mathbb{v}^{\prime}=\big{[}\bigotimes_{j\in\mathcal{I}}\mathbf{A}^{-1}_{\mathcal{I},j}\big{]}\mathbb{v}, which has linear time complexity in the size of 𝐗FG\mathbf{X}^{\rm FG}. Then, we can compute [jϕ,j(𝐗(j))]𝕧\big{[}\bigotimes_{j\in\mathcal{I}}\boldsymbol{\phi}_{\mathcal{I},j}(\mathbf{X}^{(j)})\big{]}\mathbb{v}^{\prime}, which has the same time complexity. Similar to Section 5.1, efficient algorithms also exist when input points lie in a sparse grid.

6 Conclusions and Discussion

In this work, we propose a rapid and exact algorithm for one-dimensional Gaussian process regression under Matérn correlations with half-integer smoothness. The proposed method can be applied to some multi-dimensional problems by using tensor product techniques, including grid and sparse grid designs, and their generalizations (Plumlee et al., 2021). With a simple modification, the proposed algorithm can also accommodate noisy data. If the design is not grid-based, the proposed algorithm is not applicable. We may apply the idea of Ding et al. (2020) to develop approximated algorithms, which work for not only regression problems, but also for other type of supervised learning tasks.

Another direction for future work is to establish the relationship between KP and the state-space approaches. The latter methods leverage the Gauss-Markov process representation of certain GPs, including Matérn-type GPs with half-integer smoothness, and employ the Kalman filtering and related methodologies to handle GP regression, which results in a learning algorithm with time and space complexity both in O(n)O(n) (Hartikainen and Särkkä, 2010; Saatçi, 2012; Sarkka et al., 2013; Loper et al., 2021). Whether the key mathematical theories of KP and state-space approaches are essentially equivalent is unknown and requires further investigation. Although having the same time and space complexity as KP, the Kalman filtering method is formulated in a sequential data processing form, which significantly differs from the usual supervised learning framework and makes it more difficult to comprehend. The proposed method, in contrast, is presented by a simple matrix factorization (12), which is easy to implement and incorporated in more complicated models.

Appendix

A Paley-Wiener Theorems

We will need two Paley-Wiener theorems in our proofs, given by Lemmas 15 and 16. For detailed discussion, we refer to Chapter 4 of Stein and Shakarchi (2003). Denote the support of function ff as suppf\operatorname{supp}f.

Definition 14 (Stein and Shakarchi (2003), page 112)

We say that a function ff is of moderate decrease if there exists MM\in\mathbb{R} so that |f(x)|M/(1+|x|α)|f(x)|\leq M/(1+|x|^{\alpha}) for some α>1\alpha>1, for all xx\in\mathbb{R}.

Lemma 15 (Theorem 3.3 in Chapter 4 of Stein and Shakarchi (2003))

Suppose ff is continuous and of moderate decrease on \mathbb{R}, f^\hat{f} is the Fourier transform of ff. Then, ff has an extension to the complex plane that is entire with333Stein and Shakarchi (2003) uses an equivalent but different definition of the inverse Fourier transform as f~(ξ)=f(x)e2πiξ𝑑x\tilde{f}(\xi)=\int_{-\infty}^{\infty}f(x)e^{2\pi i\xi}dx, so that this inequality becomes |f(z)|Ae2πM|z||f(z)|\leq Ae^{2\pi M|z|}. |f(z)|AeM|z||f(z)|\leq Ae^{M|z|} for some A>0A>0, if and only if suppf^[M,M]\operatorname{supp}\hat{f}\subset[-M,M].

Lemma 16 (Theorem 3.5 in Chapter 4 of Stein and Shakarchi (2003))

Suppose ff is continuous and of moderate decrease on \mathbb{R}, f^\hat{f} is the Fourier transform of ff. Then suppf^[0,+)\operatorname{supp}\hat{f}\subset[0,+\infty) if and only if ff can be extended to a continuous and bounded function in the closed upper half-plane {z=x+iy:y0}\{z=x+iy:y\geq 0\} with ff holomorphic in the interior.

B Technical Proofs

B.1 Algebraic Properties

The following Lemma 17 will be useful in proving the main theorems. We use degp\deg p to denote the degree of polynomial pp. For notational convenience, we define the degree of the zero polynomial as 1-1. We say xx a zero of function ff if f(x)=0f(x)=0.

Lemma 17

Let p1p_{1} and p2p_{2} be polynomials with degp1=d1,degp2=d2\deg p_{1}=d_{1},\deg p_{2}=d_{2}. If max(degp1,degp2)0\max(\deg p_{1},\deg p_{2})\geq 0 and c0c\neq 0, then the function f(x):=p1(x)ecx+p2(x)ecxf(x):=p_{1}(x)e^{cx}+p_{2}(x)e^{-cx} has at most d1+d2+1d_{1}+d_{2}+1 real-valued zeros.

Proof  Without loss of generality, we assume that p1p_{1} is non-zero. Suppose ff has at least d1+d2+2d_{1}+d_{2}+2 real-valued zeros. Equivalently, the function g(x)=p1(x)e2cx+p2(x)g(x)=p_{1}(x)e^{2cx}+p_{2}(x) has at least d1+d2+2d_{1}+d_{2}+2 real-valued zeros. The mean value theorem implies g(ξ)=0g^{\prime}(\xi)=0 for some ξ\xi lying between two consecutive real-valued zeros of gg. Therefore, gg^{\prime} has at least d1+d2+1d_{1}+d_{2}+1 real-valued zeros. Repeating this procedure d2+1d_{2}+1 times, we can conclude that g(d2+1)g^{(d_{2}+1)} has at least d1+1d_{1}+1 real-valued zeros. Note that g(d2+1)g^{(d_{2}+1)} possesses the form g(d2+1)(x)=q(x)e2cx,g^{(d_{2}+1)}(x)=q(x)e^{2cx}, where q(x)q(x) is a non-zero polynomial with degree d1d_{1}. Because e2cx>0e^{2cx}>0, q(x)q(x) has at least d1+1d_{1}+1 real-valued zeros, which contradicts the fundamental theorem of algebra.  

Lemma 18 can directly lead to Theorems 2 and 9. We call the vector of zero the trivial solution to a homogeneous system of linear equations.

Lemma 18

The following homogeneous systems have only the trivial solutions.

  1. 1.

    For m1m\geq 1, the m×mm\times m system about (u1,,um)T(u_{1},\ldots,u_{m})^{T}:

    j=1mbjlexp{cbj}uj=0,\displaystyle\sum_{j=1}^{m}b_{j}^{l}\exp\{cb_{j}\}u_{j}=0,

    with l=0,,m1l=0,\ldots,m-1, c0c\neq 0 and distinct real numbers b1,,bmb_{1},\ldots,b_{m}.

  2. 2.

    For m,s1m,s\geq 1, the (m+s)×(m+s)(m+s)\times(m+s) system about (u1,,um+s)T(u_{1},\ldots,u_{m+s})^{T}:

    j=1m+sbjlexp{cbj}uj=0,j=1m+sbjrexp{cbj}uj=0,\displaystyle\sum_{j=1}^{m+s}b_{j}^{l}\exp\{cb_{j}\}u_{j}=0,\quad\sum_{j=1}^{m+s}b_{j}^{r}\exp\{-cb_{j}\}u_{j}=0, (35)

    with l=0,,m1l=0,\ldots,m-1, r=0,,s1r=0,\ldots,s-1, c0c\neq 0 and distinct real numbers b1,,bm+sb_{1},\ldots,b_{m+s}.

Proof  For both parts, it suffices to prove that the coefficient matrices are of full row ranks, which is equivalent to that they are of full column ranks. This inspires us to consider the transpose of the coefficient matrices.

Here we only provide the proof for Part 2. The proof for Part 1 follows from similar lines. For Part 2, the linear system corresponding to the transposed coefficient matrix is

l=0m1bjlexp{cbj}vl+r=0s1bjrexp{cbj}vm+r=0,\displaystyle\sum_{l=0}^{m-1}b_{j}^{l}\exp\{cb_{j}\}v_{l}+\sum_{r=0}^{s-1}b_{j}^{r}\exp\{-cb_{j}\}v_{m+r}=0, (36)

with the vector of unknowns (v0,,vm1)T(v_{0},\ldots,v_{m-1})^{T}. Suppose (35) has a non-trivial solution. Then (36) also has a nontrivial solution, denoted as (v0,,vm+s1)T(v^{*}_{0},\ldots,v^{*}_{m+s-1})^{T}. Write p1(x)=l=0m1vlxlp_{1}(x)=\sum_{l=0}^{m-1}v^{*}_{l}x^{l} and p2(x)=r=0s1vm+rxrp_{2}(x)=\sum_{r=0}^{s-1}v^{*}_{m+r}x^{r}. Therefore, (36) implies that each bjb_{j} is a zero of the function f(x):=p1(x)ecx+p2(x)ecxf(x):=p_{1}(x)e^{cx}+p_{2}(x)e^{-cx}. Hence f(x)f(x) has at least m+sm+s distinct zeros. Note that degp1m1,degp2s1\deg p_{1}\leq m-1,\deg p_{2}\leq s_{1}. Because (v0,,vm+s1)T(v^{*}_{0},\ldots,v^{*}_{m+s-1})^{T} is non-trivial, we have max(degp1,degp2)0\max(\deg p_{1},\deg p_{2})\geq 0. Thus Lemma 17 yields that f(x)f(x) has no more than m+s1m+s-1 distinct zeros, a contradiction.  

Proof [Proof of Theorem 2] This theorem follows directly from Part 2 of Lemma 18, because each (k1)×(k1)(k-1)\times(k-1) submatrix of the coefficient matrix corresponds a linear system of the form in Part 2 of Lemma 18.  

Proof [Proof of Theorem 9] This theorem follows directly from Lemma 18, because each (s1)×(s1)(s-1)\times(s-1) submatrix of the coefficient matrix corresponds a linear system of the form in of Lemma 18.  

Proof [Proof of Theorem 3] Let (A1,,Ak)T(A_{1},\ldots,A_{k})^{T} be a solution to (8). It suffices to prove that

j=1kAj(aj+t)lexp{δc(aj+t)}=0,\sum_{j=1}^{k}A_{j}(a_{j}+t)^{l}\exp\{\delta c(a_{j}+t)\}=0,

for l=0,,(k3)/2l=0,\ldots,(k-3)/2, δ=±1\delta=\pm 1 and each tt\in\mathbb{R}. This can be proved by noting that

j=1kAj(aj+t)lexp{δc(aj+t)}\displaystyle\sum_{j=1}^{k}A_{j}(a_{j}+t)^{l}\exp\{\delta c(a_{j}+t)\}
=\displaystyle= exp{δct}j=1kAjexp{δcaj}m=0l(lm)ajmtlm\displaystyle\exp\{\delta ct\}\sum_{j=1}^{k}A_{j}\exp\{\delta ca_{j}\}\sum_{m=0}^{l}\binom{l}{m}a_{j}^{m}t^{l-m}
=\displaystyle= exp{δct}m=0l(lm)tlm(j=1kAjajmexp{δcaj})=0,\displaystyle\exp\{\delta ct\}\sum_{m=0}^{l}\binom{l}{m}t^{l-m}\left(\sum_{j=1}^{k}A_{j}a_{j}^{m}\exp\{\delta ca_{j}\}\right)=0,

where the last equality follows from the identity j=1kAjajmexp{δcaj}=0\sum_{j=1}^{k}A_{j}a_{j}^{m}\exp\{\delta ca_{j}\}=0 for 0ml0\leq m\leq l, ensured by equation system (8).  

Proof [Proof of Theorem 10] The proof follows from arguments similar to the proof of Theorem 3.  

B.2 Results for the Supports

We first prove the following useful lemma.

Lemma 19

Let KK be a Matérn correlation with a half-integer smoothness. Suppose b1<<bmb_{1}<\cdots<b_{m} and t(bτ,bτ+1)t\in(b_{\tau},b_{\tau+1}) for some 1τ<n1\leq\tau<n. Let ψ(x)=j=1mBjK(x,bj),\psi(x)=\sum_{j=1}^{m}B_{j}K(x,b_{j}), for BjB_{j}\in\mathbb{R}. Denote ψ1(x)=j=1τBjK(x,bj)\psi_{1}(x)=\sum_{j=1}^{\tau}B_{j}K(x,b_{j}), and ψ2(x)=j=τ+1mBjK(x,bj)\psi_{2}(x)=\sum_{j=\tau+1}^{m}B_{j}K(x,b_{j}). If there exists ϵ>0\epsilon>0 such that for x(tϵ,t+ϵ)x\in(t-\epsilon,t+\epsilon), ψ(x)=0\psi(x)=0, then

ψ1(x)={ψ(x),for x<bτ,0,otherwise. and ψ2(x)={0,for xbτ+1,ψ(x),otherwise.\psi_{1}(x)=\begin{cases}\psi(x),&\text{for }x<b_{\tau},\\ 0,&\text{otherwise}.\end{cases}\text{~{}~{}~{}and~{}~{}~{}}\psi_{2}(x)=\begin{cases}0,&\text{for }x\leq b_{\tau+1},\\ \psi(x),&\text{otherwise}.\end{cases}

Proof  It is known that when ν=p+1/2\nu=p+1/2 with pp\in\mathbb{N}, the Matérn correlation can be expressed as (Santner et al., 2003)

K(x,x)=Pp(|xx|)exp{c|xx|},\displaystyle K(x,x^{\prime})=P_{p}(|x-x^{\prime}|)\exp\{-c|x-x^{\prime}|\}, (37)

where c=2ν/ωc=\sqrt{2\nu/\omega} and Pp(x)=σ2p!(2p)!j=0p(p+j)!j!(pj)!(2cx)pjP_{p}(x)=\sigma^{2}\frac{p!}{(2p)!}\sum_{j=0}^{p}\frac{(p+j)!}{j!(p-j)!}(2cx)^{p-j} is a polynomial of degree pp.

Therefore, for any x(bτ,bτ+1)x\in(b_{\tau},b_{\tau+1}),

ψ(x)\displaystyle\psi(x) =\displaystyle= ψ1(x)+ψ2(x)\displaystyle\psi_{1}(x)+\psi_{2}(x)
=\displaystyle= j=1τBjPp(xbj)ec(xbj)+j=τ+1mBjPp(bjx)ec(xbj)\displaystyle\sum_{j=1}^{\tau}B_{j}P_{p}(x-b_{j})e^{-c(x-b_{j})}+\sum_{j=\tau+1}^{m}B_{j}P_{p}(b_{j}-x)e^{c(x-b_{j})}
=:\displaystyle=: p1(x)ecx+p2(x)ecx,\displaystyle p_{1}(x)e^{-cx}+p_{2}(x)e^{cx},

where p1p_{1} and p2p_{2} are polynomials. Thus ψ(x)\psi(x) is an analytic function on (bτ,bτ+1)(b_{\tau},b_{\tau+1}). Then ψ(x)=0\psi(x)=0 for x(tϵ,t+ϵ)x\in(t-\epsilon,t+\epsilon) implies ψ(x)=0\psi(x)=0 for x(bτ,bτ+1)x\in(b_{\tau},b_{\tau+1}), which is possible only if p1p20p_{1}\equiv p_{2}\equiv 0, because otherwise ψ\psi can only have at most 2p+12p+1 distinct zeros on (bτ,bτ+1)(b_{\tau},b_{\tau+1}) according to Lemma 17. Hence, ψ1(x)=0\psi_{1}(x)=0 whenever xbτx\geq b_{\tau}, and ψ2(x)=0\psi_{2}(x)=0 whenever xbτ+1x\leq b_{\tau+1}.  

The following lemma formalizes the rationale behind (8).

Lemma 20

Let UU be a connected open subset of \mathbb{C} containing a point z0z_{0}. Let f(z)f(z) be a holomorphic function on UU, and mm a positive integer. Then f(z)(zz0)mf(z)(z-z_{0})^{-m} can be extended as a holomorphic function on UU if and only if f(j)(z0)=0,f^{(j)}(z_{0})=0, for j=0,,m1j=0,\ldots,m-1,

Proof  First assume f(z)(zz0)mf(z)(z-z_{0})^{-m} being holomorphic. Then f(z0)f(z_{0}) must be zero, because otherwise limzz0f(z)(zz0)m=\lim_{z\rightarrow z_{0}}f(z)(z-z_{0})^{-m}=\infty. If ff vanishes identically in UU, the desired result is trivial. If ff does not vanish identically in UU, according to Theorem 1.1 in Chapter 3 of Stein and Shakarchi (2003), there exists a neighborhood VUV\subset U of z0z_{0}, and a unique positive integer mm^{\prime} such that f(z)=(zz0)mg(z)f(z)=(z-z_{0})^{m^{\prime}}g(z) for zVz\in V with gg being a non-vanishing holomorphic function on VV. Clearly it must hold that mmm^{\prime}\geq m, because otherwise we have limzz0f(z)(zz0)m=\lim_{z\rightarrow z_{0}}f(z)(z-z_{0})^{-m}=\infty again. Then it is easily checked that f(j)(z0)=0,f^{(j)}(z_{0})=0, for j=0,,m1j=0,\ldots,m-1.

For the converse, in a small disc centered at z0z_{0} the function ff has a power series expansion f=j=0aj(zz0)jf=\sum_{j=0}^{\infty}a_{j}(z-z_{0})^{j}, where aj=f(j)(z0)/j!a_{j}=f^{(j)}(z_{0})/j! for each jj\in\mathbb{N}. Thus a0==am1=0a_{0}=\cdots=a_{m-1}=0. Consequently, f(z)(zz0)m=j=maj(zz0)jmf(z)(z-z_{0})^{-m}=\sum_{j=m}^{\infty}a_{j}(z-z_{0})^{j-m} and thus is holomophic on UU.  

Proof [Proof of Theorem 8] As before, let k:=2ν+2k:=2\nu+2. Suppose that ϕ(x)=j=1mAjK(x,aj)\phi(x)=\sum_{j=1}^{m}A_{j}K(x,a_{j}) has a compact support. The analytic continuation of its inverse Fourier transform is

ϕ~(z)=j=1mAjexp{ajz}(c2+z2)(k1)/2:=γ(z)(c2+z2)(k1)/2,\tilde{\phi}(z)=\sum_{j=1}^{m}A_{j}\exp\{a_{j}z\}(c^{2}+z^{2})^{(k-1)/2}:=\gamma(z)(c^{2}+z^{2})^{(k-1)/2},

for z{±ci}z\in\mathbb{C}\in\{\pm ci\}. Then Lemma 20 entails γ(j)(±ci)=0\gamma^{(j)}(\pm ci)=0 for j=0,,(k3)/2j=0,\ldots,(k-3)/2, which leads to the linear system

j=1mAjajlexp{δcaj}=0,\sum_{j=1}^{m}A_{j}a_{j}^{l}\exp\{\delta ca_{j}\}=0,

with l=0,,(k3)/2l=0,\ldots,(k-3)/2 and δ=±1\delta=\pm 1. But this system has only the trivial solution in view of Lemma 18.  

Proof [Proof of Theorem 5]Without loss of generality, we can assume that a1=Ma_{1}=-M and ak=Ma_{k}=M for some positive real number MM, because otherwise we can apply a shift translation to convert the original problem to this form in view of Theorem 3.

We first employ Lemma 15 to show that suppϕ𝐚[M,M]=[a1,ak]\operatorname{supp}\phi_{\mathbf{a}}\subset[-M,M]=[a_{1},a_{k}]. Lemma 20 implies that ϕ~𝐚\tilde{\phi}_{\mathbf{a}} is entire. By its continuity, |ϕ~𝐚||\tilde{\phi}_{\mathbf{a}}| is bounded in the region |z|2c|z|\leq 2c. For |z|2c|z|\geq 2c, we have

|ϕ~𝐚(z)|=\displaystyle\left|\tilde{\phi}_{\mathbf{a}}(z)\right|={} |γ(z)||(c2+z2)(k1)/2|ck1|γ(z)|\displaystyle\left|\gamma(z)\right|\cdot\left|(c^{2}+z^{2})^{(k-1)/2}\right|\leq c^{k-1}\left|\gamma(z)\right|
\displaystyle\leq{} ck1j=1k|Aj||exp{iajz}|ck1j=1k|Aj|exp{M|z|},\displaystyle c^{k-1}\sum_{j=1}^{k}\left|A_{j}\right|\cdot\left|\exp\{-ia_{j}z\}\right|\leq c^{k-1}\sum_{j=1}^{k}\left|A_{j}\right|\exp\{M|z|\},

where the last inequality follows from the fact that |ez|e|z|.|e^{z}|\leq e^{|z|}. Clearly, ϕ~𝐚\tilde{\phi}_{\mathbf{a}} is of moderate decrease. According to Lemma 15, we obtain that suppϕ𝐚[M,M]\operatorname{supp}\phi_{\mathbf{a}}\subset[-M,M].

It remains to prove that suppϕ𝐚=[M,M]\operatorname{supp}\phi_{\mathbf{a}}=[-M,M]. Suppose suppϕ𝐚[M,M]\operatorname{supp}\phi_{\mathbf{a}}\neq[-M,M]. Then we can find MM1<M2M-M\leq M_{1}<M_{2}\leq M, such that ϕ𝐚(x)=0\phi_{\mathbf{a}}(x)=0 for x[M1,M2]x\in[M_{1},M_{2}]. Therefore, Lemma 19 implies that ϕ𝐚\phi_{\mathbf{a}} can be expressed as

ϕ𝐚(x)=j=1τAjK(x,aj)+j=τ+1kAjK(x,aj):=ψ1(x)+ψ2(x),\phi_{\mathbf{a}}(x)=\sum_{j=1}^{\tau}A_{j}K(x,a_{j})+\sum_{j=\tau+1}^{k}A_{j}K(x,a_{j}):=\psi_{1}(x)+\psi_{2}(x),

for some 1τ<n1\leq\tau<n, such that suppψ1[a1,aτ],suppψ2[aτ+1,an]\operatorname{supp}\psi_{1}\subset[a_{1},a_{\tau}],\operatorname{supp}\psi_{2}\subset[a_{\tau+1},a_{n}]. Because either ψ1\psi_{1} or ψ2\psi_{2} must not be identically vanishing, such a function, according to Definition 1, is a KP with degree less than kk. But this contradicts Theorem 8.  

Proof [Proof of Theorem 7] For Part 1, when the smoothness parameter ν\nu of a Matérn kernel is not a half integer, then direct calculations shows

ϕ~a(x)\displaystyle\tilde{\phi}_{a}(x) \displaystyle\propto [j=1kAjexp{iajx}](c2+x2)ν12\displaystyle\left[\sum_{j=1}^{k}A_{j}\exp\{ia_{j}x\}\right](c^{2}+x^{2})^{-\nu-\frac{1}{2}} (38)
=\displaystyle= [j=1kAjexp{iajx}]exp{(ν+12)log(x2+c2)}.\displaystyle\left[\sum_{j=1}^{k}A_{j}\exp\{ia_{j}x\}\right]\exp\left\{-(\nu+\frac{1}{2})\log(x^{2}+c^{2})\right\}.

The goal is to prove that (38) cannot be extended to an entire function unless Aj=0A_{j}=0 for each jj. There is no continuous complex logarithm function defined on all {0}\mathbb{C}\setminus\{0\}. Here we consider the principal branch of the complex logarithm Logz:=log|z|+iArgz,\operatorname{Log}z:=\log|z|+i\operatorname{Arg}z, where Argz\operatorname{Arg}z is the principal value of the argument of zz ranging in (π,π](-\pi,\pi]. For xx\in\mathbb{R}, we have logx=Logx\log x=\operatorname{Log}x. It is known that Logx\operatorname{Log}x is holomorphic on the set {z:z0}\mathbb{C}\setminus\{z\in\mathbb{R}:z\leq 0\}. Therefore, the function in (38) can be analytically continued to the region 𝒮:=𝒮c:={yi:y,|y|c}.\mathcal{S}:=\mathbb{C}\setminus\mathcal{S}^{c}:=\mathbb{C}\setminus\{yi:y\in\mathbb{R},|y|\geq c\}.

Because analytical continuation is unique, the analytical continuation of (38) should coincide with

g(z):=[j=1kAjexp{iajz}]exp{(ν+12)Log(z2+c2)}g(z):=\left[\sum_{j=1}^{k}A_{j}\exp\{ia_{j}z\}\right]\exp\left\{-(\nu+\frac{1}{2})\operatorname{Log}(z^{2}+c^{2})\right\}

on 𝒮\mathcal{S}. Because ν+1/2\nu+1/2 is not an integer, exp{(ν+12)Log(x2+c2)}\exp\left\{-(\nu+\frac{1}{2})\operatorname{Log}(x^{2}+c^{2})\right\} is discontinuous when zz moves across 𝒮c\mathcal{S}^{c}. Suppose there exists a KP. Then g(z)g(z) must an entire function in view of lemma 15. To make g(z)g(z) continuous on 𝒮c\mathcal{S}^{c}, we must have j=1kAjexp{iajz}=0\sum_{j=1}^{k}A_{j}\exp\{ia_{j}z\}=0 on 𝒮c\mathcal{S}^{c}, but this readily implies j=1kAjexp{iajz}=0\sum_{j=1}^{k}A_{j}\exp\{ia_{j}z\}=0 on \mathbb{C} as j=1kAjexp{iajz}\sum_{j=1}^{k}A_{j}\exp\{ia_{j}z\} is also an entire function. Therefore g(z)=0g(z)=0. By the uniqueness of the Fourier transform, the underlying KP vanishes identically in \mathbb{R}, which leads to a contradiction.

For part 2, a Gaussian correlation function KK is an analytic function on \mathbb{R}, and so does ϕ𝐚:=j=1nAjK(xai)\phi_{\mathbf{a}}:=\sum_{j=1}^{n}A_{j}K(x-a_{i}). Therefore, ϕ𝐚\phi_{\mathbf{a}} cannot have a compact support unless ϕ𝐚0\phi_{\mathbf{a}}\equiv 0.  

Proof [Proof of Theorem 11] Without loss of generality, we can assume that a1=0a_{1}=0 because otherwise we can apply shift translation to make this happen in view of Theorem 10.

Clearly, ϕ𝐚\phi_{\mathbf{a}} is of moderate decrease in view of the expression (37). Direct calculation shows

ϕ~𝐚(z)[j=1sAjexp{iajz}](c2+z2)(k1)/2=γ(z)(c2+z2)(k1)/2.\tilde{\phi}_{\mathbf{a}}(z)\propto\left[\sum_{j=1}^{s}A_{j}\exp\{ia_{j}z\}\right](c^{2}+z^{2})^{-(k-1)/2}=\gamma(z)(c^{2}+z^{2})^{-(k-1)/2}.

Equation (10) implies γ(j)(ci)=0\gamma^{(j)}(ci)=0 for j=0,,(k3)/2j=0,\ldots,(k-3)/2. Thus djdzj(γ(z)(z+ci)(k1)/2)|z=ci=0\frac{d^{j}}{dz^{j}}(\gamma(z)(z+ci)^{-(k-1)/2})\big{|}_{z=ci}=0 for j=0,,(k3)/2j=0,\ldots,(k-3)/2, which, together with Lemma 20, yields that f(z):=γ(z)(c2+z2)(k1)/2f(z):=\gamma(z)(c^{2}+z^{2})^{-(k-1)/2} is holomorphic in a neighborhood of cici. So f(z)f(z) is continuous on the upper half-plane {z=x+iy:y0}\{z=x+iy:y\geq 0\} and is holomorphic in its interior. To employ Lemma 16, it remains to proof that f(z)f(z) is bounded in {z=x+iy:y0}\{z=x+iy:y\geq 0\}. For |zci|c|z-ci|\leq c, f(z)f(z) is clearly bounded as it is a continuous function. For |zci|c|z-ci|\geq c and z{z=x+iy:y0}z\in\{z=x+iy:y\geq 0\}, we have

|(c2+z2)(k1)/2|=|zci|(k1)/2|z+ci|(k1)/2c(k1).|(c^{2}+z^{2})^{-(k-1)/2}|=|z-ci|^{-(k-1)/2}|z+ci|^{-(k-1)/2}\leq c^{-(k-1)}.

Write z=x+iyz=x+iy, then

|f(z)|=|γ(z)||(c2+z2)(k1)/2||j=1sAjexp{iaj(x+iy)}c(k1)|\displaystyle|f(z)|=|\gamma(z)||(c^{2}+z^{2})^{-(k-1)/2}|\leq\left|\sum_{j=1}^{s}A_{j}\exp\{ia_{j}(x+iy)\}c^{-(k-1)}\right|
\displaystyle\leq c(k1)j=1s|Aj||exp{iaj(x+iy)}|=c(k1)j=1s|Aj|exp{ajy},\displaystyle c^{-(k-1)}\sum_{j=1}^{s}\left|A_{j}\right|\left|\exp\{ia_{j}(x+iy)\}\right|=c^{-(k-1)}\sum_{j=1}^{s}\left|A_{j}\right|\exp\{-a_{j}y\},

which is bounded as y0y\geq 0. Therefore, according to Lemma 16, suppϕ𝐚[0,+)\operatorname{supp}\phi_{\mathbf{a}}\subset[0,+\infty).

Next, we prove that 0suppϕ𝐚0\in\operatorname{supp}\phi_{\mathbf{a}}. First, it can be shown that A10A_{1}\neq 0, because otherwise (A2,,As)T(A_{2},\ldots,A_{s})^{T} is a solution to the linear system j=2sajlexp{caj}Aj=0,\sum_{j=2}^{s}a_{j}^{l}\exp\{-ca_{j}\}A_{j}=0, with l=0,,(k3)/2l=0,\ldots,(k-3)/2, if s=(k+1)/2s=(k+1)/2, or j=2sajlexp{caj}Aj=0,j=2sajrexp{caj}Aj=0,\sum_{j=2}^{s}a_{j}^{l}\exp\{-ca_{j}\}A_{j}=0,\quad\sum_{j=2}^{s}a_{j}^{r}\exp\{ca_{j}\}A_{j}=0, with l=0,,(k3)/2l=0,\ldots,(k-3)/2 and r=0,,s(k+3)/2r=0,\ldots,s-(k+3)/2, if s(k+3)/2s\geq(k+3)/2. Then Lemma 17 suggests that Aj=0A_{j}=0 for all j=0,,sj=0,\ldots,s, which is a contradiction. Now, suppose 0suppϕ𝐚0\not\in\operatorname{supp}\phi_{\mathbf{a}}. Then there exists ϵ>0\epsilon>0, such that ϕ𝐚(x)=0\phi_{\mathbf{a}}(x)=0 for all x<ϵx<\epsilon. Without loss of generality, assume ϵ<a2\epsilon<a_{2}. We now apply a shift transformation. Recall that Tϵ(𝐚):=(a1ϵ,,asϵ)T_{-\epsilon}(\mathbf{a}):=(a_{1}-\epsilon,\ldots,a_{s}-\epsilon). Theorem 10 implies that ϕTϵ(𝐚)(x)=j=1sAjK(x+ϵ,aj).\phi_{T_{\epsilon}(\mathbf{a})}(x)=\sum_{j=1}^{s}A_{j}K(x+\epsilon,a_{j}). Thus,

ϕ~Tϵ(𝐚)(z)[j=1sAjexp{i(ajϵ)z}](c2+z2)(k1)/2.\displaystyle\tilde{\phi}_{T_{\epsilon}(\mathbf{a})}(z)\propto\left[\sum_{j=1}^{s}A_{j}\exp\{i(a_{j}-\epsilon)z\}\right](c^{2}+z^{2})^{-(k-1)/2}.

It is easily seen that ϕ~Tϵ(𝐚)(z)\tilde{\phi}_{T_{\epsilon}(\mathbf{a})}(z) is unbounded if z=iyz=iy for sufficiently large y>0y>0. Specifically,

ϕ~Tϵ(𝐚)(iy)\displaystyle\tilde{\phi}_{T_{\epsilon}(\mathbf{a})}(iy) \displaystyle\propto [j=1sAjexp{(ajϵ)y}](c2y2)(k1)/2\displaystyle\left[\sum_{j=1}^{s}A_{j}\exp\{-(a_{j}-\epsilon)y\}\right](c^{2}-y^{2})^{-(k-1)/2}
=\displaystyle= A1(c2y2)k12exp{(ϵa1)y}+(c2y2)k12j=2sAjexp{(ajϵ)y},\displaystyle A_{1}(c^{2}-y^{2})^{-\frac{k-1}{2}}\exp\{(\epsilon-a_{1})y\}+(c^{2}-y^{2})^{-\frac{k-1}{2}}\sum_{j=2}^{s}A_{j}\exp\{-(a_{j}-\epsilon)y\},

where the first term diverges and the second term converges to zero as y+y\rightarrow+\infty, because A10A_{1}\neq 0 and a1<ϵ<a2<<asa_{1}<\epsilon<a_{2}<\cdots<a_{s}.

The remainder is to prove that suppϕ𝐚=[0,+)\operatorname{supp}\phi_{\mathbf{a}}=[0,+\infty). Suppose suppϕ𝐚[0,+)\operatorname{supp}\phi_{\mathbf{a}}\neq[0,+\infty). Then there exist M2>M1>0M_{2}>M_{1}>0, such that ϕ𝐚(x)=0\phi_{\mathbf{a}}(x)=0 whenever x(M1,M2)x\in(M_{1},M_{2}). Without loss of generality, we assume that M1,M2{a1,,as}M_{1},M_{2}\not\in\{a_{1},\ldots,a_{s}\}. Then we can write

ϕ𝐚(x)=j=1sAjK(x,aj)+0K(x,M1)+0K(x,M2).\phi_{\mathbf{a}}(x)=\sum_{j=1}^{s}A_{j}K(x,a_{j})+0\cdot K(x,M_{1})+0\cdot K(x,M_{2}).

Then Lemma 19 implies that ϕ𝐚(x)\phi_{\mathbf{a}}(x) can be decomposed into

ϕ𝐚(x)=j=1τAjK(x,aj)+j=τ+1sAjK(x,aj):=ψ1(x)+ψ2(x),\phi_{\mathbf{a}}(x)=\sum_{j=1}^{\tau}A_{j}K(x,a_{j})+\sum_{j=\tau+1}^{s}A_{j}K(x,a_{j}):=\psi_{1}(x)+\psi_{2}(x),

for some 1τ<s1\leq\tau<s, such that suppψ1[0,M1]\operatorname{supp}\psi_{1}\subset[0,M_{1}] and suppψ1[M2,+)\operatorname{supp}\psi_{1}\subset[M_{2},+\infty). Because 0suppϕ𝐚0\in\operatorname{supp}\phi_{\mathbf{a}}, we have suppψ1\operatorname{supp}\psi_{1}\neq\emptyset. Therefore, ψ1\psi_{1} is a non-zero function and has a compact support, which contradicts Theorem 8.  

B.3 Linear Independence

Proof [Proof of Theorem 13] We have learned from Theorems 5 and 11, and the analogous counterpart of Theorem 11 for the left-sided KPs that:

  1. 1.

    The left-sided KPs ϕ1,ϕ2,,ϕ(k1)/2\phi_{1},\phi_{2},\ldots,\phi_{(k-1)/2} have supports (,x(k+1)/2]{(-\infty,x_{(k+1)/2}]}, (,x(k+1)/2+1]{(-\infty,x_{(k+1)/2+1}]}, \ldots, (,xk1]{(-\infty,x_{k-1}]}, respectively.

  2. 2.

    The KPs ϕ(k+1)/2,ϕ(k+1)/2+1,,ϕn(k1)/2\phi_{(k+1)/2},\phi_{(k+1)/2+1},\ldots,\phi_{n-(k-1)/2} have supports [x1,xk]{[x_{1},x_{k}]}, [x2,xk+1]{[x_{2},x_{k+1}]}, \ldots, [xnk+1,xn]{[x_{n-k+1},x_{n}]}, respectively.

  3. 3.

    The right-sided KPs ϕn(k3)/2,,ϕn1,ϕn\phi_{n-(k-3)/2},\ldots,\phi_{n-1},\phi_{n} have supports [xnk+2,){[x_{n-k+2},\infty)}, \ldots, [xn(k1)/21,){[x_{n-(k-1)/2-1},\infty)}, [xn(k1)/2,){[x_{n-(k-1)/2},\infty)} respectively.

Therefore, for τ<n(k1)/2\tau<n-(k-1)/2, any function of the form f:=j=1τλjϕjf:=\sum_{j=1}^{\tau}\lambda_{j}\phi_{j} satisfies suppf(,xτ+(k1)/2]\operatorname{supp}f\subset(-\infty,x_{\tau+(k-1)/2}]. Note that suppϕτ+1=[xτ+1(k1)/2,xτ+1+(k1)/2](,xτ+(k1)/2]\operatorname{supp}\phi_{\tau+1}=[x_{\tau+1-(k-1)/2},x_{\tau+1+(k-1)/2}]\not\subset(-\infty,x_{\tau+(k-1)/2}], which proves that ϕτ+1span{ϕ1,,ϕτ}\phi_{\tau+1}\not\in\operatorname{span}\{\phi_{1},\ldots,\phi_{\tau}\}. Hence, by induction, we can prove that ϕ1,,ϕn(k1)/2\phi_{1},\ldots,\phi_{n-(k-1)/2} are linearly independent.

Similarly, we can prove that the right-sided KPs ϕn(k3)/2,,ϕn\phi_{n-(k-3)/2},\ldots,\phi_{n} are linearly independent.

Now suppose j=1nξjϕj=0\sum_{j=1}^{n}\xi_{j}\phi_{j}=0 for ξ1,,ξn\xi_{1},\ldots,\xi_{n}\in\mathbb{R}. We rearrange this identity as

f1:=j=1n(k1)/2ξjϕj=j=n(k3)/2nξjϕj=:f2,\displaystyle f_{1}:=\sum_{j=1}^{n-(k-1)/2}\xi_{j}\phi_{j}=-\sum_{j=n-(k-3)/2}^{n}\xi_{j}\phi_{j}=:f_{2}, (39)

i.e., the left-hand side of (39) is a linear combination of the left-sided KPs and the KPs, and the right-hand side of (39) is a linear combination of the right-sided KPs.

Note that suppf1(,xn]\operatorname{supp}f_{1}\subset(-\infty,x_{n}] and suppf2[xnk+2,+)\operatorname{supp}f_{2}\subset[x_{n-k+2},+\infty). Then identify (39) implies suppf2(,xn][xnk+2,+)=[xnk+2,xn]\operatorname{supp}f_{2}\subset(-\infty,x_{n}]\cap[x_{n-k+2},+\infty)=[x_{n-k+2},x_{n}]. By definition, f2f_{2} is a linear combination of k1k-1 functions K(,ank+2),,K(,an)K(\cdot,a_{n-k+2}),\ldots,K(\cdot,a_{n}). Hence, by Theorem 8, f2f_{2} has a compact support only if f20f_{2}\equiv 0, which, together with the fact that ϕn(k3)/2,,ϕn\phi_{n-(k-3)/2},\ldots,\phi_{n} are linearly independent, yields that ξn(k3)/2==ξn=0\xi_{n-(k-3)/2}=\cdots=\xi_{n}=0. Then by (39), we similarly have ξ1,,xn(k1)/2=0\xi_{1},\ldots,x_{n-(k-1)/2}=0 because ϕ1,,ϕn(k1)/2\phi_{1},\ldots,\phi_{n-(k-1)/2} are proved to be linearly independent. In summary, we prove that ϕ1,,ϕn\phi_{1},\ldots,\phi_{n} are linearly independent.  


References

  • Atkinson (2008) Kendall E Atkinson. An Introduction to Numerical Analysis. John Wiley & Sons, 2008.
  • Banerjee et al. (2014) Sudipto Banerjee, Bradley P Carlin, and Alan E Gelfand. Hierarchical Modeling and Analysis for Spatial Data. CRC press, 2014.
  • Barrett et al. (1994) Richard Barrett, Michael Berry, Tony F Chan, James Demmel, June Donato, Jack Dongarra, Victor Eijkhout, Roldan Pozo, Charles Romine, and Henk Van der Vorst. Templates for the solution of linear systems: building blocks for iterative methods. SIAM, 1994.
  • Bazi and Melgani (2009) Yakoub Bazi and Farid Melgani. Gaussian process approach to remote sensing image classification. IEEE transactions on geoscience and remote sensing, 48(1):186–197, 2009.
  • Bui et al. (2017) Thang D Bui, Josiah Yan, and Richard E Turner. A unifying framework for gaussian process pseudo-point approximations using power expectation propagation. The Journal of Machine Learning Research, 18(1):3649–3720, 2017.
  • Burt et al. (2019) David Burt, Carl Edward Rasmussen, and Mark Van Der Wilk. Rates of convergence for sparse variational gaussian process regression. In International Conference on Machine Learning, pages 862–871. PMLR, 2019.
  • Chen and Stein (2021) Jie Chen and Michael Stein. Linear-cost covariance functions for gaussian random fields. Journal of the American Statistical Association, 2021. doi: 10.1080/01621459.2021.1919122.
  • Cressie (2015) Noel Cressie. Statistics for spatial data. John Wiley & Sons, 2015.
  • Davis (2006) Timothy A Davis. Direct Methods for Sparse Linear Systems. SIAM, 2006.
  • Deisenroth et al. (2013) Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen. Gaussian processes for data-efficient learning in robotics and control. IEEE transactions on pattern analysis and machine intelligence, 37(2):408–423, 2013.
  • Ding et al. (2020) Liang Ding, Rui Tuo, and Shahin Shahrampour. Generalization guarantees for sparse kernel approximation with entropic optimal features. In Proceedings of the 37th International Conference on Machine Learning, pages 2545–2555, 2020.
  • Fausett and Fulton (1994) Donald W Fausett and Charles T Fulton. Large least squares problems involving kronecker products. SIAM Journal on Matrix Analysis and Applications, 15(1):219–227, 1994.
  • Gardner et al. (2018) Jacob R Gardner, Geoff Pleiss, David Bindel, Kilian Q Weinberger, and Andrew Gordon Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. arXiv preprint arXiv:1809.11165, 2018.
  • Gerstner and Griebel (1998) Thomas Gerstner and Michael Griebel. Numerical integration using sparse grids. Numerical Algorithms, 18(3):209–232, 1998.
  • Graham (2018) Alexander Graham. Kronecker Products and Matrix Calculus with Applications. Courier Dover Publications, 2018.
  • Gramacy (2020) Robert B Gramacy. Surrogates: Gaussian process modeling, design, and optimization for the applied sciences. Chapman and Hall/CRC, 2020.
  • Gramacy and Apley (2015) Robert B Gramacy and Daniel W Apley. Local gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24(2):561–578, 2015.
  • Hartikainen and Särkkä (2010) Jouni Hartikainen and Simo Särkkä. Kalman filtering and smoothing solutions to temporal gaussian process regression models. In 2010 IEEE international workshop on machine learning for signal processing, pages 379–384. IEEE, 2010.
  • Hennig et al. (2015) Philipp Hennig, Michael A Osborne, and Mark Girolami. Probabilistic numerics and uncertainty in computations. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 471(2179):20150142, 2015.
  • Jones et al. (1998) Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of expensive black-box functions. J. Glob. Optim., 13(4):455–492, 1998.
  • Kamgnia and Nguenang (2014) Emmanuel Kamgnia and Louis Bernard Nguenang. Some efficient methods for computing the determinant of large sparse matrices. Revue Africaine de la Recherche en Informatique et Mathématiques Appliquées, 17:73–92, 2014.
  • Katzfuss (2017) Matthias Katzfuss. A multi-resolution approximation for massive spatial datasets. Journal of the American Statistical Association, 112(517):201–214, Jan 2017.
  • Keeling and Whorf (2005) Charles D Keeling and TP Whorf. Atmospheric carbon dioxide record from mauna loa. Carbon Dioxide Research Group, Scripps Institution of Oceanography, University of California La Jolla, California, pages 92093–0444, 2005.
  • Loper et al. (2021) Jackson Loper, David Blei, John P Cunningham, and Liam Paninski. A general linear-time inference method for Gaussian processes on one dimension. Journal of Machine Learning Research, 22(234):1–36, 2021.
  • Molga and Smutnicki (2005) Marcin Molga and Czesław Smutnicki. Test functions for optimization needs. Test functions for optimization needs, 101:48, 2005.
  • Plumlee et al. (2021) M Plumlee, CB Erickson, BE Ankenman, and E Lawrence. Composite grid designs for adaptive computer experiments with fast inference. Biometrika, 108(3):749–755, 2021.
  • Plumlee (2014) Matthew Plumlee. Fast prediction of deterministic functions using sparse grid experimental designs. Journal of the American Statistical Association, 109(508):1581–1591, 2014.
  • Rahimi and Recht (2007) Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 20, pages 1177–1184, 2007.
  • Rasmussen (2006) Carl Edward Rasmussen. Gaussian Processes for Machine Learning. MIT Press, 2006.
  • Rasmussen and Nickisch (2010) Carl Edward Rasmussen and Hannes Nickisch. Gaussian processes for machine learning (gpml) toolbox. Journal of Machine Learning Research, 11(100):3011–3015, 2010.
  • Saatçi (2012) Yunus Saatçi. Scalable inference for structured Gaussian process models. PhD thesis, Citeseer, 2012.
  • Sacks et al. (1989) Jerome Sacks, William J Welch, Toby J Mitchell, and Henry P Wynn. Design and analysis of computer experiments. Statistical science, 4(4):409–423, 1989.
  • Santner et al. (2003) Thomas J Santner, Brian J Williams, William I Notz, and Brain J Williams. The Design and Analysis of Computer Experiments, volume 1. Springer, 2003.
  • Sarkka et al. (2013) Simo Sarkka, Arno Solin, and Jouni Hartikainen. Spatiotemporal learning via infinite-dimensional bayesian filtering and smoothing: A look at gaussian process regression through kalman filtering. IEEE Signal Processing Magazine, 30(4):51–61, 2013.
  • Smola and Schölkopf (2000) Alex J Smola and Bernhard Schölkopf. Sparse greedy matrix approximation for machine learning. 2000.
  • Srinivas et al. (2009) Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
  • Sriperumbudur and Szabo (2015) Bharath Sriperumbudur and Zoltan Szabo. Optimal rates for random fourier features. Advances in Neural Information Processing Systems, 28:1144–1152, 2015.
  • Stein and Shakarchi (2003) Elias M Stein and Rami Shakarchi. Complex Analysis. Princeton University Press, 2003.
  • Stein (1999) Michael L Stein. Interpolation of Spatial Data: some theory for kriging. Springer Science & Business Media, 1999.
  • Titsias (2009) Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Artificial intelligence and statistics, pages 567–574. PMLR, 2009.
  • Tuo and Wang (2020) Rui Tuo and Wenjia Wang. Kriging prediction with isotropic matern correlations: robustness and experimental designs. Journal of Machine Learning Research, 21(187):1–38, 2020.
  • Tuo and Wu (2016) Rui Tuo and C. F. Jeff Wu. A theoretical framework for calibration in computer models: parametrization, estimation and convergence properties. SIAM/ASA Journal on Uncertainty Quantification, 4(1):767–795, 2016.
  • Williams and Seeger (2001) Christopher Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. In Proceedings of the 14th Annual Conference on Neural Information Processing Systems, pages 682–688, 2001.
  • Wilson and Nickisch (2015) Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian processes (kiss-gp). In International Conference on Machine Learning, pages 1775–1784, 2015.
  • Wilson (2014) Andrew Gordon Wilson. Covariance kernels for fast automatic pattern discovery and extrapolation with Gaussian processes. PhD thesis, Citeseer, 2014.
  • Wood and Chan (1994) Andrew TA Wood and Grace Chan. Simulation of stationary Gaussian processes in [0,1]d[0,1]^{d}. Journal of Computational and Graphical Statistics, 3(4):409–432, 1994.
  • Zhang et al. (2005) Yunong Zhang, William E Leithead, and Douglas J Leith. Time-series gaussian process regression based on toeplitz computation of O(N2)O(N^{2}) operations and O(N)O(N)-level storage. In Proceedings of the 44th IEEE Conference on Decision and Control, pages 3711–3716. IEEE, 2005.