This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Large-scale Heteroscedastic Regression via Gaussian Process

Haitao Liu, Yew-Soon Ong,  and Jianfei Cai Haitao Liu is with the Rolls-Royce@NTU Corporate Lab, Nanyang Technological University, Singapore, 637460. E-mail: htliu@ntu.edu.sg.Yew-Soon Ong is with School of Computer Science and Engineering, Nanyang Technological University, Singapore, 639798. E-mail: asysong@ntu.edu.sg. Jianfei Cai is with Department of Data Science & AI, Monash University, Australia. E-mail: jianfei.cai@monash.edu.
Abstract

Heteroscedastic regression considering the varying noises among observations has many applications in the fields like machine learning and statistics. Here we focus on the heteroscedastic Gaussian process (HGP) regression which integrates the latent function and the noise function together in a unified non-parametric Bayesian framework. Though showing remarkable performance, HGP suffers from the cubic time complexity, which strictly limits its application to big data. To improve the scalability, we first develop a variational sparse inference algorithm, named VSHGP, to handle large-scale datasets. Furthermore, two variants are developed to improve the scalability and capability of VSHGP. The first is stochastic VSHGP (SVSHGP) which derives a factorized evidence lower bound, thus enhancing efficient stochastic variational inference. The second is distributed VSHGP (DVSHGP) which (i) follows the Bayesian committee machine formalism to distribute computations over multiple local VSHGP experts with many inducing points; and (ii) adopts hybrid parameters for experts to guard against over-fitting and capture local variety. The superiority of DVSHGP and SVSHGP as compared to existing scalable heteroscedastic/homoscedastic GPs is then extensively verified on various datasets.

Index Terms:
Heteroscedastic GP, Large-scale, Sparse approximation, Stochastic variational inference, Distributed learning.

Notation

F,FV=\displaystyle F,F_{V}= Lower bounds for log model evidencelogp(𝒚)\displaystyle\mbox{Lower bounds for log model evidence}\log p(\bm{y})
𝒇,𝒈=\displaystyle\bm{f},\bm{g}= Latent function values and log variances
𝒇m,𝒈u=\displaystyle\bm{f}_{m},\bm{g}_{u}= Inducing variables for 𝒇 and 𝒈\displaystyle\mbox{Inducing variables for }\bm{f}\mbox{ and }\bm{g}
kf,kg=\displaystyle k^{f},k^{g}= Kernels for f and g\displaystyle\mbox{Kernels for }f\mbox{ and }g
𝑲f,𝑲g=\displaystyle\bm{K}^{f},\bm{K}^{g}= Kernel matrices for 𝒇 and 𝒈\displaystyle\mbox{Kernel matrices for }\bm{f}\mbox{ and }\bm{g}
m,u=\displaystyle m,u= Inducing sizes for 𝒇 and 𝒈\displaystyle\mbox{Inducing sizes for }\bm{f}\mbox{ and }\bm{g}
m0,u0=\displaystyle m_{0},u_{0}= Inducing sizes for each expert in DVSHGP
mb=\displaystyle m_{\mathrm{b}}= Number of basis functions for GPVC [8]
msod=\displaystyle m_{\mathrm{sod}}= Subset size for EHSoD [13]
M=\displaystyle M= Number of VSHGP experts
i=\displaystyle\mathcal{M}_{i}= ith VSHGP expert (1iM)\displaystyle i\mbox{th VSHGP expert }(1\leq i\leq M)
n,n=\displaystyle n,n_{*}= Training size and test size
n0=\displaystyle n_{0}= Training size for each expert in DVSHGP
𝑿,𝑿=\displaystyle\bm{X},\bm{X}_{*}= Training and test data points
𝒚,𝒚=\displaystyle\bm{y},\bm{y}_{*}= Training and test observations
μ0=\displaystyle\mu_{0}= Mean parameter for latent noise function g\displaystyle\mbox{Mean parameter for latent noise function }g
𝝁m,𝚺m=\displaystyle\bm{\mu}_{m},\bm{\Sigma}_{m}= Mean and variance for the posterior q(𝒇m)\displaystyle\mbox{Mean and variance for the posterior }q(\bm{f}_{m})
𝝁u,𝚺u=\displaystyle\bm{\mu}_{u},\bm{\Sigma}_{u}= Mean and variance for the posterior q(𝒈u)\displaystyle\mbox{Mean and variance for the posterior }q(\bm{g}_{u})
𝝁f,𝚺f=\displaystyle\bm{\mu}_{f},\bm{\Sigma}_{f}= Mean and variance for the posterior q(𝒇)\displaystyle\mbox{Mean and variance for the posterior }q(\bm{f})
𝝁f,𝚺f=\displaystyle\bm{\mu}_{f}^{\star},\bm{\Sigma}_{f}^{\star}= Mean and variance for the optimal q(𝒇)\displaystyle\mbox{Mean and variance for the optimal }q^{\star}(\bm{f})
𝝁g,𝚺g=\displaystyle\bm{\mu}_{g},\bm{\Sigma}_{g}= Mean and variance for the posterior q(𝒈)\displaystyle\mbox{Mean and variance for the posterior }q(\bm{g})
𝚲nn=\displaystyle\bm{\Lambda}_{nn}= Variational parameters for q(𝒈u) in VSHGP\displaystyle\mbox{Variational parameters for }q(\bm{g}_{u})\mbox{ in VSHGP}
μf(𝒜),μg(𝒜)=\displaystyle\mu_{f_{(\mathcal{A})}*},\mu_{g_{(\mathcal{A})}*}= (Aggregated) prediction mean of f and g at 𝒙\displaystyle\mbox{(Aggregated) prediction mean of }f\mbox{ and }g\mbox{ at }\bm{x}_{*}
σf(𝒜)2,σg(𝒜)2=\displaystyle\sigma^{2}_{f_{(\mathcal{A})}*},\sigma^{2}_{g_{(\mathcal{A})}*}= (Aggregated) prediction variance of f and g at 𝒙\displaystyle\mbox{(Aggregated) prediction variance of }f\mbox{ and }g\mbox{ at }\bm{x}_{*}
μ(𝒜),σ(𝒜)2=\displaystyle\mu_{(\mathcal{A})*},\sigma^{2}_{(\mathcal{A})*}= (Aggregated) prediction mean and variance at 𝒙\displaystyle\mbox{(Aggregated) prediction mean and variance at }\bm{x}_{*}
μi,σi2=\displaystyle\mu_{i*},\sigma^{2}_{i*}= Prediction mean and variance of i at 𝒙\displaystyle\mbox{Prediction mean and variance of }\mathcal{M}_{i}\mbox{ at }\bm{x}_{*}
wif,wig=\displaystyle w_{i*}^{f},w_{i*}^{g}= Weights of the expert i at 𝒙 for f and g\displaystyle\mbox{Weights of the expert }\mathcal{M}_{i}\mbox{ at }\bm{x}_{*}\mbox{ for }f\mbox{ and }g
𝜽f,𝜽g=\displaystyle\bm{\theta}^{f},\bm{\theta}^{g}= Kernel parameters for kf and kg\displaystyle\mbox{Kernel parameters for }k^{f}\mbox{ and }k^{g}
ϵi=\displaystyle\epsilon_{i}= Noise for observation yi at point 𝒙i (1in)\displaystyle\mbox{Noise for observation }y_{i}\mbox{ at point }\bm{x}_{i}\mbox{ }(1\leq i\leq n)
γ=\displaystyle\gamma= Step parameter for optimization
=\displaystyle\mathcal{B}= Mini-batch set sampled from 𝑿\displaystyle\mbox{Mini-batch set sampled from }\bm{X}

I Introduction

In supervised learning, we learn a machine learning model from nn training points 𝑿={𝒙iRd}i=1n\bm{X}=\{\bm{x}_{i}\in R^{d}\}_{i=1}^{n} and the observations 𝒚={y(𝒙i)=f(𝒙i)+ϵi}i=1n\bm{y}=\{y(\bm{x}_{i})=f(\bm{x}_{i})+\epsilon_{i}\}_{i=1}^{n}, where ff is the underlying function and ϵi\epsilon_{i} is the independent noise. The non-parametric Gaussian process (GP) places a GP prior on ff with the iid noise ϵi𝒩(0,σϵ2)\epsilon_{i}\sim\mathcal{N}(0,\sigma^{2}_{\epsilon}), resulting in the standard homoscedastic GP. The homoscedasticity offers tractable GP inference to enable extensive applications in regression and classification [1], visualization [2], Bayesian optimization [3], multi-task learning [4], active learning [5], etc.

Many realistic problems, e.g., volatility forecasting [6], biophysical variables estimation [7], cosmological redshifts estimation [8], robotics and vehicle control [9, 10], however should consider the input-dependent noise rather than the simple constant noise in order to fit the local noise rates of complicated data distributions. In comparison to the conventional homogeneous GP, the heteroscedastic GP (HGP) provides better quantification of different sources of uncertainty, which further brings benefits for the downstream tasks, e.g., active learning, optimization and uncertainty quantification [11, 12].

To account for the heteroscedastic noise in GP, there exists two main strategies: (i) treat the GP as a black-box and interpret the heteroscedasticity using another separate model; and (ii) integrate the heteroscedasticity within the unifying GP framework. The first post-model strategy first trains a standard GP to capture the underlying function ff, and then either trains another GP to take the remaining empirical variance into account [13], or use the quantile regression [14] to model the lower and upper quantiles of the variance respectively.

In contrast, the second integration strategy provides an elegant framework for heteroscedastic regression. The simplest way to mimic variable noise is through adding independent yet different noise variances to the diagonal of kernel matrix [15]. Goldberg et al. [16] introduced a more principled HGP which infers a mean-field GP for f(𝒙)f(\bm{x}) and an additional GP g(𝒙)g(\bm{x}) for logσϵ2(𝒙)\log\sigma^{2}_{\epsilon}(\bm{x}) jointly. Similarly, other HGPs which describe the heteroscedastic noise using the pointwise division of two GPs or the general non-linear combination of GPs have been proposed for example in [17, 18, 19]. Note that unlike the homoscedastic GP, the inference in HGP is challenging since the model evidence (marginal likelihood) p(𝒚)p(\bm{y}) and the posterior p(y|𝑿,𝒚,𝒙)p(y_{*}|\bm{X},\bm{y},\bm{x}_{*}) are intractable. Hence, various approximate inference methods, e.g., markov chain monte carlo (MCMC) [16], maximum a posteriori (MAP) [20, 21, 22, 23], variational inference [24, 25], expectation propagation [26, 27] and Laplace approximation [28], have been used. The most accurate MCMC is time-consuming when handling large-scale datasets; the MAP is a point estimation which risks over-fitting and oscillation; the variational inference and its variants, which run fast via maximizing over a tractable and rigorous evidence lower bound (ELBO), provide a trade-off.

This paper focuses on the HGP model developed in [16]. When handling nn training points, the standard GP suffers from a cubic complexity 𝒪(n3)\mathcal{O}(n^{3}) due to the inversion of an n×nn\times n kernel matrix, which makes it unaffordable for big data. Since HGP employs an additional log-GP for noise variance, its complexity is about two times that of standard GP. Hence, to handle large-scale datasets, which is of great demand in the era of big data, the scalability of HGP should be improved.

Recently, there has been an increasing trend on studying scalable GPs, which have two core categories: global and local approximations [29]. As the representative of global approximation, sparse approximation considers mm (mnm\ll n) global inducing pairs {𝑿m,𝒇m}\{\bm{X}_{m},\bm{f}_{m}\} to summarize the training data by approximating the prior [30] or the posterior [31], resulting in the complexity of 𝒪(nm2)\mathcal{O}(nm^{2}). Variants of sparse approximation have been recently proposed to handle millions of data points via distributed inference [32, 33], stochastic variational inference [34, 35], or structured inducing points [36]. The global sparse approximation, however, often faces challenges in capturing quick-varying features [37]. Differently, local approximation, which follows the idea of divide-and-conquer, first trains GP experts on local subsets and then aggregates their predictions, for example by the means of product-of-experts (PoE) [38] and Bayesian committee machine (BCM) [39, 40, 41]. Hence, local approximation not only distributes the computations but also captures quick-varying features [42]. Hybrid strategies thereafter have been presented for taking advantages of global and local approximations [43, 44, 45].

The developments in scalable homoscedastic GPs have thus motivated us to scale up HGPs. Alternatively, we could combine the simple Subset-of-Data (SoD) approximation [46] with the empirical HGP [13] which trains two separate GPs for predicting mean and variance, respectively. The empirical variance however is hard to fit since it follows an asymmetric Gaussian distribution. More reasonably, the GP using variable covariances (GPVC) [8] follows the idea of relevance vector machine (RVM) [47] that a stationary kernel k(.,.)k(.,.) has a positive and finite Fourier spectrum, suggesting using only mbm_{\mathrm{b}} (mbnm_{\mathrm{b}}\ll n) independent basis functions for approximation. Note that GPVC shares the basis functions for ff and gg which however might produce distinct features. Besides, the RVM-type model usually suffers from underestimated prediction variance when leaving 𝑿\bm{X} [48].

Besides, there are some scalable “pseudo” HGPs [49, 43] which are not designed for such case, but can describe the heteroscedasticity to some extent due to the factorized conditionals. For instance, the Fully Independent Training Conditional (FITC) [49] and its block version named Partially Independent Conditional (PIC) [43] adopt a factorized training conditional to achieve a varying noise [50]. Though equipped with heteroscedastic variance, these models (i) severely underestimate the constant noise variance, (ii) sacrifice the prediction mean [50], and (iii) may produce discontinuous predictions on block boundaries [44]. Recently, the stochastic/distributed variants of FITC and PIC have been developed to further improve the scalability [35, 33, 51, 45].

The high capability of describing complicated data distribution with however poor scalability motivate us to develop variational sparse HGPs which employ an additional log-GP for heteroscedasticity. Particularly, the main contributions are:

1. A variational inference algorithm for sparse HGP, named VSHGP, is developed. Specifically, VSHGP derives an analytical ELBO using mm inducing points for ff and uu inducing points for gg, resulting in a greatly reduced complexity of 𝒪(nm2+nu2)\mathcal{O}(nm^{2}+nu^{2}). Besides, some tricks for example re-parameterization are used to ease the inference;

2. The stochastic variant SVSHGP is presented to further improve the scalability of VSHGP. Specifically, we derive a factorized ELBO which allows using efficient stochastic variational inference;

3. The distributed variant DVSHGP is presented for improving the scalability and capability of VSHGP. The local experts with many inducing points (i) distribute the computations for parallel computing, and (ii) employ hybrid parameters to guard against over-fitting and capture local variety;

4. Extensive experiments111The SVSHGP is implemented based on the GPflow toolbox [52], which benefits from parallel/GPU speedup and automatic differentiation of Tensorflow [53]. The DVSHGP is built upon the GPML toolbox [54]. These implementations are available at https://github.com/LiuHaiTao01. conducted on datasets with up to two million points reveal that the localized DVSHGP exhibits superior performance, while the global SVSHGP may sacrifice the prediction mean for capturing heteroscedastic noise.

The remainder of the article is organized as follows. Section II first develops the VSHGP model via variational inference. Then, we present the stochastic variant in Section III and the distributed variant in Section IV to further enhance the scalability and capability, followed by extensive experiments in Section VI. Finally, Section VII provides concluding remarks.

II Variational sparse HGP

II-A Sparse approximation

We follow [16] to define the HGP as y(𝒙)=f(𝒙)+ϵ(𝒙)y(\bm{x})=f(\bm{x})+\epsilon(\bm{x}), wherein the latent function f(𝒙)f(\bm{x}) and the noise ϵ(𝒙)\epsilon(\bm{x}) follow

f(𝒙)𝒢𝒫(0,kf(𝒙,𝒙)),ϵ(𝒙)𝒩(0,σϵ2(𝒙)).f(\bm{x})\sim\mathcal{GP}(0,k^{f}(\bm{x},\bm{x}^{\prime})),\quad\epsilon(\bm{x})\sim\mathcal{N}(0,\sigma^{2}_{\epsilon}(\bm{x})). (1)

It is observed that the input-dependent noise variance σϵ2(𝒙)\sigma^{2}_{\epsilon}(\bm{x}) enables describing possible heteroscedasticity. Notably, this HGP degenerates to homoscedastic GP when σϵ2(𝒙)\sigma^{2}_{\epsilon}(\bm{x}) is a constant. To ensure the positivity, we particularly consider the exponential form σϵ2(𝒙)=eg(𝒙)\sigma^{2}_{\epsilon}(\bm{x})=e^{g(\bm{x})}, wherein the latent function g(𝒙)g(\bm{x}) akin to f(𝒙)f(\bm{x}) follows an independent GP prior

g(𝒙)𝒢𝒫(μ0,kg(𝒙,𝒙)).g(\bm{x})\sim\mathcal{GP}(\mu_{0},k^{g}(\bm{x},\bm{x}^{\prime})). (2)

The only difference is that unlike the zero-mean GP prior placed on f(𝒙)f(\bm{x}), we explicitly consider a prior mean μ0\mu_{0} to account for the variability of the noise variance.222For ff, we can pre-process the data to fulfill the zero-mean assumption. For gg, however, it is hard to satisfy the zero-mean assumption, since we have no access to the “noise” data. The kernels kfk^{f} and kgk^{g} could be, e.g., the squared exponential (SE) kernel equipped with automatic relevance determination (ARD)

k(𝒙,𝒙)=σs2exp(12i=1d(xixi)2li2),k(\bm{x},\bm{x}^{\prime})=\sigma^{2}_{s}\exp\left(-\frac{1}{2}\sum_{i=1}^{d}\frac{(x_{i}-x^{\prime}_{i})^{2}}{l_{i}^{2}}\right), (3)

where the signal variance σs2\sigma^{2}_{s} is an output scale, and the length-scale lil_{i} is an input scale along the iith dimension.

Given the training data 𝒟={𝑿,𝒚}\mathcal{D}=\{\bm{X},\bm{y}\}, the joint priors follow

p(𝒇)=𝒩(𝒇|𝟎,𝑲nnf),p(𝒈)=𝒩(𝒈|μ0𝟏,𝑲nng),p(\bm{f})=\mathcal{N}(\bm{f}|\bm{0},\bm{K}^{f}_{nn}),\quad p(\bm{g})=\mathcal{N}(\bm{g}|\mu_{0}\bm{1},\bm{K}^{g}_{nn}), (4)

where [𝑲nnf]ij=kf(𝒙i,𝒙j)[\bm{K}^{f}_{nn}]_{ij}=k^{f}(\bm{x}_{i},\bm{x}_{j}) and [𝑲nng]ij=kg(𝒙i,𝒙j)[\bm{K}^{g}_{nn}]_{ij}=k^{g}(\bm{x}_{i},\bm{x}_{j}) for 1i,jn1\leq i,j\leq n. Accordingly, the data likelihood becomes

p(𝒚|𝒇,𝒈)=𝒩(𝒚|𝒇,𝚺ϵ),p(\bm{y}|\bm{f},\bm{g})=\mathcal{N}(\bm{y}|\bm{f},\bm{\Sigma}_{\epsilon}), (5)

where the diagonal noise matrix has [𝚺ϵ]ii=eg(𝒙i)[\bm{\Sigma}_{\epsilon}]_{ii}=e^{g(\bm{x}_{i})}.

To scale up HGP, we follow the sparse approximation framework to introduce mm inducing variables 𝒇m𝒩(𝒇m|𝟎,𝑲mmf)\bm{f}_{m}\sim\mathcal{N}(\bm{f}_{m}|\bm{0},\bm{K}_{mm}^{f}) at the inducing points 𝑿m\bm{X}_{m} for 𝒇\bm{f}; similarly, we introduce uu inducing variables 𝒈u𝒩(𝒈u|μ0𝟏,𝑲uug)\bm{g}_{u}\sim\mathcal{N}(\bm{g}_{u}|\mu_{0}\bm{1},\bm{K}_{uu}^{g}) at the independent 𝑿u\bm{X}_{u} for 𝒈\bm{g}. Besides, we assume that 𝒇m\bm{f}_{m} is a sufficient statistic for 𝒇\bm{f}, and 𝒈u\bm{g}_{u} a sufficient statistic for 𝒈\bm{g}.333Sufficient statistic means the variables 𝒛\bm{z} and 𝒇\bm{f} are independent given 𝒇m\bm{f}_{m}, i.e., it holds p(𝒛|𝒇,𝒇m)=p(𝒛|𝒇m)p(\bm{z}|\bm{f},\bm{f}_{m})=p(\bm{z}|\bm{f}_{m}). As a result, we obtain two training conditionals

p(𝒇|𝒇m)\displaystyle p(\bm{f}|\bm{f}_{m}) =𝒩(𝒇|𝛀nmf𝒇m,𝑲nnf𝑸nnf),\displaystyle=\mathcal{N}(\bm{f}|\bm{\Omega}_{nm}^{f}\bm{f}_{m},\bm{K}_{nn}^{f}-\bm{Q}_{nn}^{f}),
p(𝒈|𝒈u)\displaystyle p(\bm{g}|\bm{g}_{u}) =𝒩(𝒈|𝛀nug(𝒈uμ0𝟏)+μ0𝟏,𝑲nng𝑸nng),\displaystyle=\mathcal{N}(\bm{g}|\bm{\Omega}_{nu}^{g}(\bm{g}_{u}-\mu_{0}\bm{1})+\mu_{0}\bm{1},\bm{K}_{nn}^{g}-\bm{Q}_{nn}^{g}),

where 𝛀nmf=𝑲nmf(𝑲mmf)1\bm{\Omega}_{nm}^{f}=\bm{K}_{nm}^{f}(\bm{K}_{mm}^{f})^{-1}, 𝛀nmg=𝑲nmg(𝑲mmg)1\bm{\Omega}_{nm}^{g}=\bm{K}_{nm}^{g}(\bm{K}_{mm}^{g})^{-1}, 𝑸nnf=𝛀nmf𝑲mnf\bm{Q}_{nn}^{f}=\bm{\Omega}_{nm}^{f}\bm{K}_{mn}^{f} and 𝑸nng=𝛀nmg𝑲mng\bm{Q}_{nn}^{g}=\bm{\Omega}_{nm}^{g}\bm{K}_{mn}^{g}.

In the augmented probability space, the model evidence

p(𝒚)=p(𝒚|𝒇,𝒈)p(𝒇|𝒇m)p(𝒈|𝒈u)p(𝒇m)p(𝒈u)𝑑𝒇𝑑𝒇m𝑑𝒈𝑑𝒈u,p(\bm{y})=\int p(\bm{y}|\bm{f},\bm{g})p(\bm{f}|\bm{f}_{m})p(\bm{g}|\bm{g}_{u})p(\bm{f}_{m})p(\bm{g}_{u})d\bm{f}d\bm{f}_{m}d\bm{g}d\bm{g}_{u},

together with the posterior p(𝒛|𝒚)=p(𝒚|𝒛)p(𝒛)/p(𝒚)p(\bm{z}|\bm{y})=p(\bm{y}|\bm{z})p(\bm{z})/p(\bm{y}) where 𝒛={𝒇,𝒈,𝒇m,𝒈u}\bm{z}=\{\bm{f},\bm{g},\bm{f}_{m},\bm{g}_{u}\}, however, is intractable. Hence, we use variational inference to derive an analytical ELBO of logp(𝒚)\log p(\bm{y}).

II-B Evidence lower bound

We employ the mean-field theory [55] to approximate the intractable posterior p(𝒛|𝒚)=p(𝒇|𝒇m)p(𝒈|𝒈u)p(𝒇m|𝒚)p(𝒈u|𝒚)p(\bm{z}|\bm{y})=p(\bm{f}|\bm{f}_{m})p(\bm{g}|\bm{g}_{u})p(\bm{f}_{m}|\bm{y})p(\bm{g}_{u}|\bm{y}) as

p(𝒛|𝒚)q(𝒛)=p(𝒇|𝒇m)p(𝒈|𝒈u)q(𝒇m)q(𝒈u),p(\bm{z}|\bm{y})\approx q(\bm{z})=p(\bm{f}|\bm{f}_{m})p(\bm{g}|\bm{g}_{u})q(\bm{f}_{m})q(\bm{g}_{u}), (6)

where q(𝒇m)q(\bm{f}_{m}) and q(𝒈u)q(\bm{g}_{u}) are free variational distributions to approximate the posteriors p(𝒇m|𝒚)p(\bm{f}_{m}|\bm{y}) and p(𝒈u|𝒚)p(\bm{g}_{u}|\bm{y}), respectively.

In order to push the approximation q(𝒛)q(\bm{z}) towards the exact p(𝒛|𝒚)p(\bm{z}|\bm{y}), we minimize their Kullback-Leibler (KL) divergence KL(q(𝒛)||p(𝒛|𝒚))\mathrm{KL}(q(\bm{z})||p(\bm{z}|\bm{y})), which, on the other hand, is equivalent to maximizing the ELBO FF, since KL(.,.)0\mathrm{KL}(.,.)\geq 0, as

F=q(𝒛)logp(𝒛,𝒚)q(𝒛)d𝒛=logp(𝒚)KL(q(𝒛)||p(𝒛|𝒚)).F=\int q(\bm{z})\log\frac{p(\bm{z},\bm{y})}{q(\bm{z})}d\bm{z}=\log p(\bm{y})-\mathrm{KL}(q(\bm{z})||p(\bm{z}|\bm{y})). (7)

As a consequence, instead of directly maximizing the intractable logp(𝒚)\log p(\bm{y}) for inference, we now seek the maximization of FF w.r.t. the variational distributions q(𝒇m)q(\bm{f}_{m}) and q(𝒈u)q(\bm{g}_{u}).

By reformulating FF we observe that

F=KL(q(𝒇m)||q(𝒇m))+H(q(𝒈u))+const.,\displaystyle F=-\mathrm{KL}(q(\bm{f}_{m})||q^{\star}(\bm{f}_{m}))+H(q(\bm{g}_{u}))+const.,

where H(q(𝒈u))H(q(\bm{g}_{u})) is the information entropy of q(𝒈u)q(\bm{g}_{u}); q(𝒇m)q^{\star}(\bm{f}_{m}) is the optimal distribution since it maximizes the bound FF, and it satisfies, given the normalizer C0C_{0},

q(𝒇m)=p(𝒇m)C0ep(𝒇|𝒇m)p(𝒈|𝒈u)q(𝒈u)logp(𝒚|𝒇,𝒈)𝑑𝒇𝑑𝒈𝑑𝒈u.q^{\star}(\bm{f}_{m})=\frac{p(\bm{f}_{m})}{C_{0}}e^{\int p(\bm{f}|\bm{f}_{m})p(\bm{g}|\bm{g}_{u})q(\bm{g}_{u})\log p(\bm{y}|\bm{f},\bm{g})d\bm{f}d\bm{g}d\bm{g}_{u}}. (8)

Thereafter, by substituting q(𝒇m)q^{\star}(\bm{f}_{m}) back into FF, we arrive at a tighter ELBO, given q(𝒈u)=𝒩(𝒈u|𝝁u,𝚺u)q(\bm{g}_{u})=\mathcal{N}(\bm{g}_{u}|\bm{\mu}_{u},\bm{\Sigma}_{u}), as

FV=\displaystyle F_{V}= logC0KL(q(𝒈u)||p(𝒈u))\displaystyle\log C_{0}-\mathrm{KL}(q(\bm{g}_{u})||p(\bm{g}_{u})) (9)
=\displaystyle= log𝒩(𝒚|𝟎,𝑸nnf+𝑹g)logterm0.25Tr[𝚺g]tracetermof𝒈\displaystyle\underbrace{\log\mathcal{N}(\bm{y}|\bm{0},\bm{Q}_{nn}^{f}+\bm{R}_{g})}_{\mathrm{log\,term}}-\underbrace{0.25\mathrm{Tr}[\bm{\Sigma}_{g}]}_{\mathrm{trace\,term\,of}\,\bm{g}}
0.5Tr[𝑹g1(𝑲nnf𝑸nnf)]tracetermof𝒇KL(q(𝒈u)||p(𝒈u))KLterm,\displaystyle-\underbrace{0.5\mathrm{Tr}[\bm{R}_{g}^{-1}(\bm{K}_{nn}^{f}-\bm{Q}_{nn}^{f})]}_{\mathrm{trace\,term\,of}\,\bm{f}}-\underbrace{\mathrm{KL}(q(\bm{g}_{u})||p(\bm{g}_{u}))}_{\mathrm{KL\,term}},

where the diagonal matrix 𝑹gRn×n\bm{R}_{g}\in R^{n\times n} has [𝑹g]ii=e[𝝁g]i[𝚺g]ii/2[\bm{R}_{g}]_{ii}=e^{[\bm{\mu}_{g}]_{i}-[\bm{\Sigma}_{g}]_{ii}/2}, with the mean and variance

𝝁g\displaystyle\bm{\mu}_{g} =𝛀nug(𝝁uμ0𝟏)+μ0𝟏,\displaystyle=\bm{\Omega}_{nu}^{g}(\bm{\mu}_{u}-\mu_{0}\bm{1})+\mu_{0}\bm{1}, (10a)
𝚺g\displaystyle\bm{\Sigma}_{g} =𝑲nng𝑸nng+𝛀nug𝚺u(𝛀nug)𝖳,\displaystyle=\bm{K}_{nn}^{g}-\bm{Q}_{nn}^{g}+\bm{\Omega}_{nu}^{g}\bm{\Sigma}_{u}(\bm{\Omega}_{nu}^{g})^{\mathsf{T}}, (10b)

coming from q(𝒈)=p(𝒈|𝒈u)q(𝒈u)𝑑𝒈uq(\bm{g})=\int p(\bm{g}|\bm{g}_{u})q(\bm{g}_{u})d\bm{g}_{u} which approximates p(𝒈|𝒚)p(\bm{g}|\bm{y}). It is observed that the analytical bound FVF_{V} depends only on q(𝒈u)q(\bm{g}_{u}) since we have “marginalized” q(𝒇m)q(\bm{f}_{m}) out.

Let us delve further into the terms of FVF_{V} in (9):

  • The log term log𝒩(𝒚|𝟎,𝚺y)\log\mathcal{N}(\bm{y}|\bm{0},\bm{\Sigma}_{y}), where 𝚺y=𝑸nnf+𝑹g\bm{\Sigma}_{y}=\bm{Q}_{nn}^{f}+\bm{R}_{g}, is analogous to that in a standard GP. It achieves bias-variance trade-off for both ff and gg by penalizing model complexity and low data likelihood [1].

  • The two trace terms act as a regularization to choose good inducing sets for 𝒇\bm{f} and 𝒈\bm{g}, and guard against over-fitting. It is observed that Tr[𝑲nng𝑸nng]\mathrm{Tr}[\bm{K}_{nn}^{g}-\bm{Q}_{nn}^{g}] and Tr[𝑲nnf𝑸nnf]\mathrm{Tr}[\bm{K}_{nn}^{f}-\bm{Q}_{nn}^{f}] represent the total variance of the training conditionals p(𝒈|𝒈u)p(\bm{g}|\bm{g}_{u}) and p(𝒇|𝒇m)p(\bm{f}|\bm{f}_{m}), respectively. To maximize FVF_{V}, the traces should be very small, implying that 𝒇m\bm{f}_{m} and 𝒈u\bm{g}_{u} are very informative (i.e., sufficient statistics, also called variational compression [56]) for 𝒇\bm{f} and 𝒈\bm{g}. Particularly, the zero traces indicate that 𝒇m=𝒇\bm{f}_{m}=\bm{f} and 𝒈u=𝒈\bm{g}_{u}=\bm{g}, thus recovering the variational HGP (VHGP) [24]. Besides, the zero traces imply that the variances of q(𝒈u)q(\bm{g}_{u}) equal to that of p(𝒈u|𝒚)p(\bm{g}_{u}|\bm{y}).

  • The KL term is a constraint for rationalising q(𝒈u)q(\bm{g}_{u}). It is observed that minimizing the traces only push the variances of q(𝒈u)q(\bm{g}_{u}) towards that of p(𝒈u|𝒚)p(\bm{g}_{u}|\bm{y}). To let the co-variances of q(𝒈u)q(\bm{g}_{u}) rationally approximate that of p(𝒈u|𝒚)p(\bm{g}_{u}|\bm{y}), the KL term penalizes q(𝒈u)q(\bm{g}_{u}) so that it does not deviate significantly from the prior p(𝒈u)p(\bm{g}_{u}).

II-C Reparameterization and inference

In order to maximize the ELBO FVF_{V} in (9), we need to infer ω=u+u(u+1)/2\omega=u+u(u+1)/2 free variational parameters in 𝝁u\bm{\mu}_{u} and 𝚺u\bm{\Sigma}_{u}. Assume that u=0.01nu=0.01n, then ω\omega is larger than nn when the training size n>2×104n>2\times 10^{4}, leading to a high-dimensional and non-trivial optimization task.

We observe that the derivatives of FVF_{V} w.r.t. 𝝁u\bm{\mu}_{u} and 𝚺u\bm{\Sigma}_{u} are

FV𝝁u=\displaystyle\frac{\partial F_{V}}{\partial\bm{\mu}_{u}}= 12(𝛀nug)𝖳𝚲nnab𝟏(𝑲uug)1(𝝁uμ0𝟏),\displaystyle\frac{1}{2}(\bm{\Omega}_{nu}^{g})^{\mathsf{T}}\bm{\Lambda}_{nn}^{ab}\bm{1}-(\bm{K}_{uu}^{g})^{-1}(\bm{\mu}_{u}-\mu_{0}\bm{1}),
FV𝚺u=\displaystyle\frac{\partial F_{V}}{\partial\bm{\Sigma}_{u}}= 14(𝛀nug)𝖳(𝚲nnab+𝑰)𝛀nug+12[𝚺u1(𝑲uug)1],\displaystyle-\frac{1}{4}(\bm{\Omega}_{nu}^{g})^{\mathsf{T}}(\bm{\Lambda}_{nn}^{ab}+\bm{I})\bm{\Omega}_{nu}^{g}+\frac{1}{2}[\bm{\Sigma}_{u}^{-1}-(\bm{K}_{uu}^{g})^{-1}],

where the diagonal matrix 𝚲nnab=𝚲nna+𝚲nnb\bm{\Lambda}_{nn}^{ab}=\bm{\Lambda}_{nn}^{a}+\bm{\Lambda}_{nn}^{b} with 𝚲nna=(𝚺y1𝒚𝒚𝖳𝚺y1𝚺y1)𝑹g\bm{\Lambda}_{nn}^{a}=(\bm{\Sigma}_{y}^{-1}\bm{y}\bm{y}^{\mathsf{T}}\bm{\Sigma}_{y}^{-1}-\bm{\Sigma}_{y}^{-1})\odot\bm{R}_{g}, 𝚲nnb=(𝑲nnf𝑸nnf)𝑹g1\bm{\Lambda}_{nn}^{b}=(\bm{K}_{nn}^{f}-\bm{Q}_{nn}^{f})\odot\bm{R}_{g}^{-1}, and the operator \odot being the element-wise product. Hence, it is observed that the optimal 𝝁u\bm{\mu}_{u} and 𝚺u1\bm{\Sigma}_{u}^{-1} satisfy

𝝁u=\displaystyle\bm{\mu}_{u}= 0.5𝑲ung𝚲nnab𝟏+μ0𝟏,\displaystyle 0.5\bm{K}_{un}^{g}\bm{\Lambda}_{nn}^{ab}\bm{1}+\mu_{0}\bm{1},
𝚺u1=\displaystyle\bm{\Sigma}_{u}^{-1}= 0.5(𝛀nug)𝖳(𝚲nnab+𝑰)𝛀nug+(𝑲uug)1.\displaystyle 0.5(\bm{\Omega}_{nu}^{g})^{\mathsf{T}}(\bm{\Lambda}_{nn}^{ab}+\bm{I})\bm{\Omega}_{nu}^{g}+(\bm{K}_{uu}^{g})^{-1}.

Interestingly, we find that both the optimal 𝝁u\bm{\mu}_{u} and 𝚺u1\bm{\Sigma}_{u}^{-1} depend on 𝚲nn=0.5(𝚲nnab+𝑰)\bm{\Lambda}_{nn}=0.5(\bm{\Lambda}_{nn}^{ab}+\bm{I}), which is a positive semi-definite diagonal matrix, see the non-negativity proof in Appendix A. Hence, we re-parameterize 𝝁u\bm{\mu}_{u} and 𝚺u1\bm{\Sigma}_{u}^{-1} in terms of 𝚲nn\bm{\Lambda}_{nn} as

𝝁u=\displaystyle\bm{\mu}_{u}= 𝑲ung(𝚲nn0.5𝑰)𝟏+μ0𝟏,\displaystyle\bm{K}_{un}^{g}(\bm{\Lambda}_{nn}-0.5\bm{I})\bm{1}+\mu_{0}\bm{1}, (11a)
𝚺u1=\displaystyle\bm{\Sigma}_{u}^{-1}= (𝑲uug)1+(𝛀nug)𝖳𝚲nn𝛀nug.\displaystyle(\bm{K}_{uu}^{g})^{-1}+(\bm{\Omega}_{nu}^{g})^{\mathsf{T}}\bm{\Lambda}_{nn}\bm{\Omega}_{nu}^{g}. (11b)

This re-parameterization eases the model inference by (i) reducing the number of variational parameters from ω\omega to nn, and (ii) limiting the new variational parameters 𝚲nn\bm{\Lambda}_{nn} to be non-negative, thus narrowing the search space.

So far, the bound FVF_{V} depends on the variational parameters 𝚲nn\bm{\Lambda}_{nn}, the kernel parameters 𝜽f\bm{\theta}^{f} and 𝜽g\bm{\theta}^{g}, the mean parameter μ0\mu_{0} for gg, and the inducing points 𝑿m\bm{X}_{m} and 𝑿u\bm{X}_{u}. We maximize FVF_{V} to infer all these hyperparameters 𝜻={𝚲nn,𝜽f,𝜽g,𝑿m,𝑿u}\bm{\zeta}=\{\bm{\Lambda}_{nn},\bm{\theta}^{f},\bm{\theta}^{g},\bm{X}_{m},\bm{X}_{u}\} jointly for model selection. This non-linear optimization task can be solved via conjugate gradient descent (CGD), since the derivatives of FVF_{V} w.r.t. these hyperparameters have closed forms, see Appendix B.

II-D Predictive Distribution

The predictive distribution p(y|𝒚,𝒙)p(y_{*}|\bm{y},\bm{x}_{*}) at the test point 𝒙\bm{x}_{*} is approximated as

q(y)=p(y|f,g)q(f)q(g)𝑑f𝑑g.q(y_{*})=\int p(y_{*}|f_{*},g_{*})q(f_{*})q(g_{*})df_{*}dg_{*}. (12)

As for q(f)=p(f|𝒇m)q(𝒇m)𝑑𝒇mq(f_{*})=\int p(f_{*}|\bm{f}_{m})q^{\star}(\bm{f}_{m})d\bm{f}_{m}, we first calculate q(𝒇m)=𝒩(𝒇m|𝝁f,𝚺f)q^{\star}(\bm{f}_{m})=\mathcal{N}(\bm{f}_{m}|\bm{\mu}_{f}^{\star},\bm{\Sigma}_{f}^{\star}) from (8) as

𝝁f=𝑲mmf𝑲R1𝑲mnf𝑹g1𝒚,𝚺f=𝑲mmf𝑲R1𝑲mmf,\displaystyle\bm{\mu}_{f}^{\star}=\bm{K}_{mm}^{f}\bm{K}_{R}^{-1}\bm{K}_{mn}^{f}\bm{R}_{g}^{-1}\bm{y},\quad\bm{\Sigma}_{f}^{\star}=\bm{K}_{mm}^{f}\bm{K}_{R}^{-1}\bm{K}_{mm}^{f}, (13)

where 𝑲R=𝑲mnf𝑹g1𝑲nmf+𝑲mmf\bm{K}_{R}=\bm{K}_{mn}^{f}\bm{R}_{g}^{-1}\bm{K}_{nm}^{f}+\bm{K}_{mm}^{f}. Using q(𝒇m)q^{\star}(\bm{f}_{m}), we have q(f)=𝒩(f|μf,σf2)q(f_{*})=\mathcal{N}(f_{*}|\mu_{f*},\sigma_{f*}^{2}) with

μf=\displaystyle\mu_{f*}= 𝒌mf𝑲R1𝑲mnf𝑹g1𝒚,\displaystyle\bm{k}_{*m}^{f}\bm{K}_{R}^{-1}\bm{K}_{mn}^{f}\bm{R}_{g}^{-1}\bm{y}, (14a)
σf2=\displaystyle\sigma_{f*}^{2}= kf𝒌mf(𝑲mmf)1𝒌mf+𝒌mf𝑲R1𝒌mf.\displaystyle k_{**}^{f}-\bm{k}_{*m}^{f}(\bm{K}_{mm}^{f})^{-1}\bm{k}_{m*}^{f}+\bm{k}_{*m}^{f}\bm{K}_{R}^{-1}\bm{k}_{m*}^{f}. (14b)

It is interesting to find in (14b) that the correction term 𝒌mf𝑲R1𝒌mf\bm{k}_{*m}^{f}\bm{K}_{R}^{-1}\bm{k}_{m*}^{f} contains the heteroscedasticity information from the noise term 𝑹g\bm{R}_{g}. Hence, q(f)q(f_{*}) produces heteroscedastic variances over the input domain, see an illustration example in Fig. 2(b). The heteroscedastic σf2(𝒙)\sigma^{2}_{f}(\bm{x}_{*}) (i) eases the learning of gg, and (ii) plays as an auxiliary role, since the heteroscedasticity is mainly explained by gg. Also, our VSHGP is believed to produce a better prediction mean μf(𝒙)\mu_{f}(\bm{x}_{*}) through the interaction between ff and gg in (14a).444This happens when gg is learned well, see the numerical experiments below.

Similarly, we have the predictive distribution q(g)=p(g|𝒈u)q(𝒈u)𝑑𝒈u=𝒩(g|μg,σg2)q(g_{*})=\int p(g_{*}|\bm{g}_{u})q(\bm{g}_{u})d\bm{g}_{u}=\mathcal{N}(g_{*}|\mu_{g*},\sigma_{g*}^{2}) where, given 𝑲Λ=𝑲ung𝚲nn1𝑲nug+𝑲uug\bm{K}_{\Lambda}=\bm{K}_{un}^{g}\bm{\Lambda}_{nn}^{-1}\bm{K}_{nu}^{g}+\bm{K}_{uu}^{g},

μg=\displaystyle\mu_{g*}= 𝒌ug(𝑲uug)1(𝝁uμ0𝟏)+μ0𝟏,\displaystyle\bm{k}_{*u}^{g}(\bm{K}_{uu}^{g})^{-1}(\bm{\mu}_{u}-\mu_{0}\bm{1})+\mu_{0}\bm{1}, (15a)
σg2=\displaystyle\sigma^{2}_{g*}= kg𝒌ug(𝑲uug)1𝒌ug+𝒌ug𝑲Λ1𝒌ug.\displaystyle k_{**}^{g}-\bm{k}_{*u}^{g}(\bm{K}_{uu}^{g})^{-1}\bm{k}_{u*}^{g}+\bm{k}_{*u}^{g}\bm{K}_{\Lambda}^{-1}\bm{k}_{u*}^{g}. (15b)

Finally, using the posteriors q(f)q(f_{*}) and q(g)q(g_{*}), and the likelihood p(y|f,g)=𝒩(y|f,eg)p(y_{*}|f_{*},g_{*})=\mathcal{N}(y_{*}|f_{*},e^{g_{*}}), we have

q(y)=𝒩(y|μf,eg+σf2)𝒩(g|μg,σg2)𝑑g,\displaystyle q(y_{*})=\int\mathcal{N}(y_{*}|\mu_{f*},e^{g_{*}}+\sigma_{f*}^{2})\mathcal{N}(g_{*}|\mu_{g*},\sigma_{g*}^{2})dg_{*}, (16)

which is intractable and non-Gaussian. However, the integral can be approximated up to several digits using the Gauss-Hermite quadrature, resulting in the mean and variance as [24]

μ=μf,σ2=σf2+eμg+σg2/2.\displaystyle\mu_{*}=\mu_{f*},\quad\sigma^{2}_{*}=\sigma_{f*}^{2}+e^{\mu_{g*}+\sigma_{g*}^{2}/2}. (17)

In the final prediction variance, the variance σf2\sigma^{2}_{f*} represents the uncertainty about ff due to data density, and it approaches zero with increasing nn; the exponential term implies the intrinsic heteroscedastic noise uncertainty.

It is notable that the unifying VSHGP includes VHGP [24] and variational sparse GP (VSGP) [31] as special cases: when 𝒇m=𝒇\bm{f}_{m}=\bm{f} and 𝒈u=𝒈\bm{g}_{u}=\bm{g}, VSHGP recovers VHGP; when q(𝒈u)=p(𝒈u)q(\bm{g}_{u})=p(\bm{g}_{u}), i.e., we are now facing a homoscedastic regression task, VSHGP regenerates to VSGP.

Overall, by introducing inducing sets for both 𝒇\bm{f} and 𝒈\bm{g}, VSHGP is equipped with the means to handle large-scale heteroscedastic regression. However, (i) the current time complexity 𝒪(nm2+nu2)\mathcal{O}(nm^{2}+nu^{2}), which is linear with training size, makes VSHGP still unaffordable for, e.g., millions of data points; and (ii) as a global approximation, the capability of VSHGP is limited by the small and global inducing sets.

To this end, we introduce below two strategies to further improve the scalability and capability of VSHGP.

III Stochastic VSHGP

To further improve the scalability of VSHGP, the variational distribution q(𝒇m)=𝒩(𝒇m|𝝁m,𝚺m)q(\bm{f}_{m})=\mathcal{N}(\bm{f}_{m}|\bm{\mu}_{m},\bm{\Sigma}_{m}) is re-introduced to use the original bound F=q(𝒛)logp(𝒛,𝒚)q(𝒛)d𝒛F=\int q(\bm{z})\log\frac{p(\bm{z},\bm{y})}{q(\bm{z})}d\bm{z} in (7). Given the factorized likelihood p(𝒚|𝒇,𝒈)=i=1Np(yi|fi,gi)p(\bm{y}|\bm{f},\bm{g})=\prod_{i=1}^{N}p(y_{i}|f_{i},g_{i}), the ELBO FF is

F=\displaystyle F= i=1n𝔼q(fi)q(gi)[logp(yi|fi,gi)]KL[q(𝒇m)||p(𝒇m)]\displaystyle\sum_{i=1}^{n}\mathbb{E}_{q(f_{i})q(g_{i})}[\log p(y_{i}|f_{i},g_{i})]-\mathrm{KL}[q(\bm{f}_{m})||p(\bm{f}_{m})] (18)
KL[q(𝒈u)||p(𝒈u)],\displaystyle-\mathrm{KL}[q(\bm{g}_{u})||p(\bm{g}_{u})],

where the first expectation term is expressed as

=:\displaystyle=: i=1n[log𝒩(yi|[𝝁f]i,[𝑹g]ii)14[𝚺g]ii12[𝚺f𝑹g1]ii],\displaystyle\sum_{i=1}^{n}\left[\log\mathcal{N}(y_{i}|[\bm{\mu}_{f}]_{i},[\bm{R}_{g}]_{ii})-\frac{1}{4}[\bm{\Sigma}_{g}]_{ii}-\frac{1}{2}[\bm{\Sigma}_{f}\bm{R}_{g}^{-1}]_{ii}\right],

with 𝝁f=𝛀nmf𝝁m\bm{\mu}_{f}=\bm{\Omega}_{nm}^{f}\bm{\mu}_{m} and 𝚺f=𝑲nnf𝑸nnf+𝛀nmf𝚺m(𝛀nmf)𝖳\bm{\Sigma}_{f}=\bm{K}_{nn}^{f}-\bm{Q}_{nn}^{f}+\bm{\Omega}_{nm}^{f}\bm{\Sigma}_{m}(\bm{\Omega}_{nm}^{f})^{\mathsf{T}}.

The new FF is a relaxed version of FVF_{V} in (9). It is found that the derivatives satisfy

F𝝁m=\displaystyle\frac{\partial F}{\partial\bm{\mu}_{m}}= (𝛀nmf)𝖳𝑹g1(𝒚𝛀nmf𝝁m)(𝑲mmf)1𝝁m,\displaystyle(\bm{\Omega}_{nm}^{f})^{\mathsf{T}}\bm{R}_{g}^{-1}(\bm{y}-\bm{\Omega}_{nm}^{f}\bm{\mu}_{m})-(\bm{K}_{mm}^{f})^{-1}\bm{\mu}_{m}, (19)
F𝚺m=\displaystyle\frac{\partial F}{\partial\bm{\Sigma}_{m}}= 12(𝛀nmf)𝖳𝑹g1𝛀nmf+12(𝚺m1(𝑲mmf)1).\displaystyle-\frac{1}{2}(\bm{\Omega}_{nm}^{f})^{\mathsf{T}}\bm{R}_{g}^{-1}\bm{\Omega}_{nm}^{f}+\frac{1}{2}(\bm{\Sigma}_{m}^{-1}-(\bm{K}_{mm}^{f})^{-1}).

Let the gradients be zeros, we recover the optimal solution q(𝒇m)q^{\star}(\bm{f}_{m}) in (13), indicating that FVFF_{V}\geq F with the equality at q(𝒇m)=q(𝒇m)q(\bm{f}_{m})=q^{\star}(\bm{f}_{m}).

The scalability is improved by FF through the first term in the right-hand side of (18), which factorizes over data points. The sum form allows using efficient stochastic gradient descent (SGD), e.g., Adam [57], with mini-batch mode for big data. Specifically, we choose a random subset {1,,n}\mathcal{B}\subseteq\{1,\cdots,n\} to have an unbiased estimation of FF as

F~=\displaystyle\widetilde{F}= n||i𝔼q(fi)q(gi)[logp(yi|fi,gi)]\displaystyle\frac{n}{|\mathcal{B}|}\sum_{i\in\mathcal{B}}\mathbb{E}_{q(f_{i})q(g_{i})}[\log p(y_{i}|f_{i},g_{i})] (20)
KL[q(𝒇m)||p(𝒇m)]KL[q(𝒈u)||p(𝒈u)],\displaystyle-\mathrm{KL}[q(\bm{f}_{m})||p(\bm{f}_{m})]-\mathrm{KL}[q(\bm{g}_{u})||p(\bm{g}_{u})],

where ||n|\mathcal{B}|\ll n is the mini-batch size. More efficiently, since the two variational distributions are defined in terms of KL divergence, we could optimize them along the natural gradients instead of the Euclidean gradients, see Appendix C.

Finally, the predictions of the stochastic VSHGP (SVSHGP) follow (17), with the predictions replaced as

μf=\displaystyle\mu_{f*}= 𝒌mf(𝑲mmf)1𝝁m,\displaystyle\bm{k}_{*m}^{f}(\bm{K}_{mm}^{f})^{-1}\bm{\mu}_{m},
σf2=\displaystyle\sigma^{2}_{f*}= kf𝒌mf(𝑲mmf)1(𝑲mmf𝚺m)(𝑲mmf)1𝒌mf,\displaystyle k_{**}^{f}-\bm{k}_{*m}^{f}(\bm{K}_{mm}^{f})^{-1}(\bm{K}_{mm}^{f}-\bm{\Sigma}_{m})(\bm{K}_{mm}^{f})^{-1}\bm{k}_{m*}^{f},

and

μg=\displaystyle\mu_{g*}= 𝒌ug(𝑲uug)1(𝝁uμ0𝟏)+μ0𝟏,\displaystyle\bm{k}_{*u}^{g}(\bm{K}_{uu}^{g})^{-1}(\bm{\mu}_{u}-\mu_{0}\bm{1})+\mu_{0}\bm{1},
σg2=\displaystyle\sigma^{2}_{g*}= kg𝒌ug(𝑲uug)1(𝑲uug𝚺u)(𝑲uug)1𝒌ug.\displaystyle k_{**}^{g}-\bm{k}_{*u}^{g}(\bm{K}_{uu}^{g})^{-1}(\bm{K}_{uu}^{g}-\bm{\Sigma}_{u})(\bm{K}_{uu}^{g})^{-1}\bm{k}_{u*}^{g}.

Compared to the deterministic VSHGP, the stochastic variant greatly reduces the time complexity from 𝒪(nm2+nu2)\mathcal{O}(nm^{2}+nu^{2}) to 𝒪(||m2+||u2+m3+u3)\mathcal{O}(|\mathcal{B}|m^{2}+|\mathcal{B}|u^{2}+m^{3}+u^{3}), at the cost of requiring many more optimization efforts in the enlarged probabilistic space.555Compared to VSHGP, the SVSHGP cannot re-parameterize 𝝁u\bm{\mu}_{u} and 𝚺u\bm{\Sigma}_{u}, and has to infer m+m(m+1)m+m(m+1) more variational parameters in 𝝁m\bm{\mu}_{m} and 𝚺m\bm{\Sigma}_{m}. Besides, the capability of SVSHGP akin to VSHGP is still limited to the finite number of global inducing points.

IV Distributed VSHGP

To further improve the scalability and capability of VSHGP via many inducing points, the distributed variant named DVSHGP proposes to combine VSHGP with local approximations, e.g., the Bayesian committee machine (BCM) [39, 40], to enable distributed learning and capture local variety.

IV-A Training experts with hybrid parameters

We first partition the training data 𝒟\mathcal{D} into MM subsets 𝒟i={𝑿i,𝒚i}\mathcal{D}_{i}=\{\bm{X}_{i},\bm{y}_{i}\}, 1iM1\leq i\leq M. Then, we train a VSHGP expert i\mathcal{M}_{i} on 𝒟i\mathcal{D}_{i} by using the relevant inducing sets 𝑿mi\bm{X}_{m_{i}} and 𝑿ui\bm{X}_{u_{i}}. Particularly, to obtain computational gains, an independence assumption is posed for all the experts {i}i=1M\{\mathcal{M}_{i}\}_{i=1}^{M} such that logp(𝒚;𝑿,𝜻)\log p(\bm{y};\bm{X},\bm{\zeta}) is decomposed into the sum of MM individuals

logp(𝒚;𝑿,𝜻)i=1Mlogp(𝒚i;𝑿i,𝜻i)i=1MFVi,\log p(\bm{y};\bm{X},\bm{\zeta})\approx\sum_{i=1}^{M}\log p(\bm{y}_{i};\bm{X}_{i},\bm{\zeta}_{i})\geq\sum_{i=1}^{M}F_{V_{i}}, (21)

where 𝜻i\bm{\zeta}_{i} is the hyperparameters in i\mathcal{M}_{i}, and FVi=F(q(𝒈ui))F_{V_{i}}=F(q(\bm{g}_{u_{i}})) is the ELBO of i\mathcal{M}_{i}. The factorization (21) calculates the inversions efficiently as (𝑲mmf)1diag[{(𝑲mimif)1}i=1M](\bm{K}_{mm}^{f})^{-1}\approx\mathrm{diag}[\{(\bm{K}^{f}_{m_{i}m_{i}})^{-1}\}_{i=1}^{M}] and (𝑲mmg)1diag[{(𝑲mimig)1}i=1M](\bm{K}_{mm}^{g})^{-1}\approx\mathrm{diag}[\{(\bm{K}^{g}_{m_{i}m_{i}})^{-1}\}_{i=1}^{M}].

We train these VSHGP experts with hybrid parameters. Specifically, the BCM-type aggregation requires sharing the priors p(f)p(f_{*}) and p(g)p(g_{*}) over experts. That means, we should share the hyperparameters including 𝜽f\bm{\theta}^{f}, 𝜽g\bm{\theta}^{g} and μ0\mu_{0} across experts. These global parameters are beneficial for guarding against over-fitting [40], at the cost of however degrading the capability. Hence, we leave the variational parameters 𝚲nini\bm{\Lambda}_{n_{i}n_{i}} and the inducing points 𝑿mi\bm{X}_{m_{i}} and 𝑿ui\bm{X}_{u_{i}} for each expert to infer them individually. These local parameters improve capturing local variety by (i) pushing q(𝒈ui)q(\bm{g}_{u_{i}}) towards the posterior p(𝒈ui|𝒚i)p(\bm{g}_{u_{i}}|\bm{y}_{i}) of i\mathcal{M}_{i}, and (ii) using many inducing points.

Besides, because of the local parameters, we prefer partitioning the data into disjoint experts rather than random experts like [40]. The disjoint partition using clustering techniques produces local and separate experts which are desirable for learning the relevant local parameters. In contrast, the random partition, which assigns points randomly to the subsets, provides global and overlapped experts which are difficult to well estimate the local parameters. For instance, when DVSHGP uses random experts on the toy example below, it fails to capture the heteroscedastic noise.

Finally, suppose that each expert has the same training size n0=n/Mn_{0}=n/M, the training complexity for an expert is 𝒪(n0m02+n0u02)\mathcal{O}(n_{0}m_{0}^{2}+n_{0}u_{0}^{2}), where m0m_{0} is the inducing size for 𝒇i\bm{f}_{i} and u0u_{0} the inducing size for 𝒈i\bm{g}_{i}. Due to the MM local experts, DVSHGP naturally offers parallel/distributed training, hence reducing the time complexity of VSHGP with a factor ideally close to the number of machines when m0=mm_{0}=m and u0=uu_{0}=u.

IV-B Aggregating experts’ predictions

For each VSHGP expert i\mathcal{M}_{i}, we obtain the predictive distribution qi(y)q_{i}(y_{*}) with the means {μfi,μgi,μi}\{\mu_{f_{i}*},\mu_{g_{i}*},\mu_{i*}\} and variances {σfi2,σgi2,σi2}\{\sigma^{2}_{f_{i}*},\sigma^{2}_{g_{i}*},\sigma^{2}_{i*}\}. Thereafter, we combine the experts’ predictions together to perform the final prediction by, for example the robust BCM (RBCM) aggregation, which naturally supports distributed/parallel computing [40, 58].

The key to the success of aggregation is that we do not directly combine the experts’ predictions {μi,σi2}i=1M\{\mu_{i*},\sigma^{2}_{i*}\}_{i=1}^{M}. This is because (i) the RBCM aggregation of {qi(y)}i=1M\{q_{i}(y_{*})\}_{i=1}^{M} produces an inconsistent prediction variance with increasing nn and MM [41]; and (ii) the predictive distribution qi(y)q_{i}(y_{*}) in (16) is non-Gaussian.666Note that the generalized RBCM strategy [41], which provides consistent predictions, however is not favored by our DVSHGP with local parameters, since it requires (i) the experts’ predictions to be Gaussian and (ii) an additional global base expert. To have a meaningful prediction variance, which is crucial for heteroscedastic regression, we perform the aggregation for the latent functions ff and gg, respectively. This is because the prediction variances of ff and gg approach zeros with increasing nn. This agrees with the property of RBCM.

We first have the aggregated predictive distribution for ff_{*}, given the prior p(f)=𝒩(f|0,σf2kf)p(f_{*})=\mathcal{N}(f_{*}|0,\sigma_{f**}^{2}\triangleq k^{f}_{**}), as

p𝒜(f|𝒚,𝒙)=i=1M[qi(f)]wif[p(f)]iwif1,p_{\mathcal{A}}(f_{*}|\bm{y},\bm{x}_{*})=\frac{\prod_{i=1}^{M}[q_{i}(f_{*})]^{w^{f}_{i*}}}{[p(f_{*})]^{\sum_{i}w^{f}_{i*}-1}}, (22)

with the mean and variance expressed respectively, as

μf𝒜\displaystyle\mu_{f_{\mathcal{A}}*} =σf𝒜2i=1Mwifσfi2μfi,\displaystyle=\sigma_{f_{\mathcal{A}}*}^{2}\sum_{i=1}^{M}w^{f}_{i*}\sigma_{f_{i}*}^{-2}\mu_{f_{i}*},
σf𝒜2\displaystyle\sigma_{f_{\mathcal{A}}*}^{-2} =i=1Mwifσfi2+(1i=1Mwif)σf2,\displaystyle=\sum_{i=1}^{M}w^{f}_{i*}\sigma_{f_{i}*}^{-2}+\left(1-\sum_{i=1}^{M}w^{f}_{i*}\right)\sigma_{f**}^{-2},

where the weight wif0w^{f}_{i*}\geq 0 represents the contribution of i\mathcal{M}_{i} at 𝒙\bm{x}_{*} for ff, and is defined as the difference in the differential entropy between the prior p(f)p(f_{*}) and the posterior qi(f)q_{i}(f_{*}) as wif=0.5(logσf2logσfi2)w^{f}_{i*}=0.5(\log\sigma_{f**}^{2}-\log\sigma_{f_{i}*}^{2}). Similarly, for gg which explicitly considers a prior mean μ0\mu_{0}, the aggregated predictive distribution is

p𝒜(g|𝒚,𝒙)=i=1M[qi(g)]wig[p(g)]iwig1,p_{\mathcal{A}}(g_{*}|\bm{y},\bm{x}_{*})=\frac{\prod_{i=1}^{M}[q_{i}(g_{*})]^{w^{g}_{i*}}}{[p(g_{*})]^{\sum_{i}w^{g}_{i*}-1}}, (23)

with the mean and variance, expressed respectively as

μg𝒜=\displaystyle\mu_{g_{\mathcal{A}}*}= σg𝒜2[i=1Mwigσgi2μgi+(1i=1Mwig)σg2μ0],\displaystyle\sigma_{g_{\mathcal{A}}*}^{2}\left[\sum_{i=1}^{M}w^{g}_{i*}\sigma_{g_{i}*}^{-2}\mu_{g_{i}*}+\left(1-\sum_{i=1}^{M}w^{g}_{i*}\right)\sigma_{g**}^{-2}\mu_{0}\right],
σg𝒜2=\displaystyle\sigma_{g_{\mathcal{A}}*}^{-2}= i=1Mwigσgi2+(1i=1Mwig)σg2,\displaystyle\sum_{i=1}^{M}w^{g}_{i*}\sigma_{g_{i}*}^{-2}+\left(1-\sum_{i=1}^{M}w^{g}_{i*}\right)\sigma_{g**}^{-2},

where σg2\sigma_{g**}^{-2} is the prior precision of gg_{*}, and wig=0.5(logσg2logσgi2)w^{g}_{i*}=0.5(\log\sigma_{g**}^{2}-\log\sigma_{g_{i}*}^{2}) is the weight of i\mathcal{M}_{i} at 𝒙\bm{x}_{*} for gg.

Refer to caption
Figure 1: Hierarchy of the proposed DVSHGP model.

Thereafter, as shown in Fig. 1, the final predictions akin to (17) are combined as

μ𝒜=μf𝒜,σ𝒜2=σf𝒜2+eμg𝒜+σg𝒜2/2.\displaystyle\mu_{\mathcal{A}*}=\mu_{f_{\mathcal{A}}*},\quad\sigma^{2}_{\mathcal{A}*}=\sigma_{f_{\mathcal{A}}*}^{2}+e^{\mu_{g_{\mathcal{A}}*}+\sigma_{g_{\mathcal{A}}*}^{2}/2}. (24)

The hierarchical and localized computation structure enables (i) large-scale regression via distributed computations, and (ii) flexible approximation of slow-/quick-varying features by local experts and many inducing points (up to the training size nn).

Refer to caption
Figure 2: Illustration of DVSHGP on the toy example. The crosses marked in different colors in (a) represent M=5M=5 subsets. The top circles and bottom squares marked in different colors represent the optimized positions of inducing points for {𝒇i}i=1M\{\bm{f}_{i}\}_{i=1}^{M} and {𝒈i}i=1M\{\bm{g}_{i}\}_{i=1}^{M}, respectively. The red curves present the prediction mean, whereas the black curves represent 95% confidence interval of the prediction mean.

Finally, we illustrate the DVSHGP on a heteroscedastic toy example expressed as

y(x)=sinc(x)+ϵ,x[10,10],y(x)=\mathrm{sinc}(x)+\epsilon,\quad x\in[-10,10], (25)

where ϵ=𝒩(0,σϵ2(x))\epsilon=\mathcal{N}(0,\sigma_{\epsilon}^{2}(x)) and σϵ(x)=0.05+0.2(1+sin(2x))/(1+e0.2x)\sigma_{\epsilon}(x)=0.05+0.2(1+\sin(2x))/(1+e^{-0.2x}). We draw 500 training points from (25), and use the kk-means technique to partition them into five disjoint subsets. We then employ ten inducing points for both 𝒇i\bm{f}_{i} and 𝒈i\bm{g}_{i} of the VSHGP expert i\mathcal{M}_{i}, 1i51\leq i\leq 5. Fig. 2 shows that (i) DVSHGP can efficiently employ up to 100 inducing points for modeling through five local experts, and (ii) DVSHGP successfully describes the underlying function ff and the heteroscedastic log noise variance gg.

V Discussions regarding implementation

V-A Implementation of DVSHGP

As for DVSHGP, we should infer (i) the global parameters including the kernel parameters 𝜽f\bm{\theta}^{f} and 𝜽g\bm{\theta}^{g}, and the mean μ0\mu_{0}; and (ii) the local parameters including the variational parameters 𝚲nini\bm{\Lambda}_{n_{i}n_{i}} and the inducing parameters 𝑿mi\bm{X}_{m_{i}} and 𝑿ui\bm{X}_{u_{i}}, for local experts {i}i=1M\{\mathcal{M}_{i}\}_{i=1}^{M}.

Notably, the variational parameters 𝚲nini\bm{\Lambda}_{n_{i}n_{i}} are crucial for the success of DVSHGP, since they represent the heteroscedasticity. To learn the variational parameters well, there are two issues: (i) how to initialize them and (ii) how to optimize them. As for initialization, let us focus on VSHGP, which is the foundation for the experts in DVSHGP. It is observed in (11) that 𝚲nn\bm{\Lambda}_{nn} directly determines the initialization of q(𝒈u)=𝒩(𝒈u|𝝁u,𝚺u)q(\bm{g}_{u})=\mathcal{N}(\bm{g}_{u}|\bm{\mu}_{u},\bm{\Sigma}_{u}). Given the prior 𝒈u𝒩(𝒈u|μ0𝟏,𝑲uug)\bm{g}_{u}\sim\mathcal{N}(\bm{g}_{u}|\mu_{0}\bm{1},\bm{K}_{uu}^{g}), we intuitively place a prior mean μ0𝟏\mu_{0}\bm{1} on 𝝁u\bm{\mu}_{u}, resulting in 𝚲nn=0.5𝑰\bm{\Lambda}_{nn}=0.5\bm{I}. In contrast, if we initialize [𝚲nn]ii[\bm{\Lambda}_{nn}]_{ii} with a value larger or smaller than 0.50.5, the cumulative term 𝑲ung(𝚲nn0.5𝑰)𝟏\bm{K}_{un}^{g}(\bm{\Lambda}_{nn}-0.5\bm{I})\bm{1} in (11a) becomes far away from zero with increasing nn, leading to improper prior mean for 𝝁u\bm{\mu}_{u}. As for optimization, compared to standard GP, our DVSHGP needs to additionally infer nn variational parameters and M(m0+u0)dM(m_{0}+u_{0})d inducing parameters, which greatly enlarge the parameter space and increase the optimization difficulty. Hence, we use an alternating strategy where we first optimize the variational parameters individually to capture the heteroscedasticity roughly, followed by learning all the hyperparameters jointly.

Refer to caption
Figure 3: The variational parameters learnt respectively by DVSHGP and VHGP on the toy problem. The crosses marked in different colors represent the variational parameters inferred for the five local experts of DVSHGP.

Fig. 3 depicts the inferred variational parameters varying over training points by DVSHGP and the original VHGP [24], respectively, on the toy problem. It turns out that the variational parameters estimated by DVSHGP (i) generally agree with that of VHGP, and (ii) showcase local characteristics that are beneficial for describing local variety.

V-B Implementation of SVSHGP

As for SVSHGP, to effectively infer the variational parameters in q(𝒇m)q(\bm{f}_{m}) and q(𝒈u)q(\bm{g}_{u}), we adopt the natural gradient descent (NGD), which however should carefully tune the step parameter γ\gamma defined in Appendix C. For the Gaussian likelihood, the optimal solution is γ=1.0\gamma=1.0, since taking an unit step is equivalent to performing a VB update [34]. But for the stochastic case, empirical results suggest that γ\gamma should be gradually increased to some fixed value. Hence, we follow the schedule in [59]: take γinitial=0.0001\gamma_{\mathrm{initial}}=0.0001 and log-linearly increase γ\gamma to γfinal=0.1\gamma_{\mathrm{final}}=0.1 over five iterations, and then keep γfinal\gamma_{\mathrm{final}} for the remaining iterations.

Thereafter, we employ a hybrid strategy, called NGD+Adam, for optimization. Specifically, we perform a step of NGD on the variational parameters with the aforementioned γ\gamma schedule, followed by a step of Adam on the remaining hyperparameters with a fixed step γ=0.01\gamma=0.01.

Refer to caption
Figure 4: Illustration of SVSHGP using Adam and NGD+Adam respectively on the toy exmaple. The dash line represents the final ELBO of VSHGP.

Fig. 4 depicts the convergence histories of SVSHGP using Adam and NGD+Adam respectively on the toy example. We use m=u=20m=u=20 inducing points and a mini-batch size of ||=50|\mathcal{B}|=50. As the ground truth, the final ELBO obtained by VSHGP is provided. It is observed that (i) the NGD+Adam converges faster than the pure Adam, and (ii) the stochastic optimizers finally approach the solution of CGD used in VSHGP.

V-C DVSHGP vs. (S)VSHGP

Compared to the global (S)VSHGP, the performance of DVSHGP is enhanced by many inducing points and the localized experts with individual variational and inducing parameters, resulting in the capability of capturing quick-varying features. To verify this, we apply DVSHGP and (S)VSHGP to the time-series solar irradiance dataset [60] which contains quick-varying and heteroscedastic features.

Refer to caption
Figure 5: Comparison of DVSHGP and VSHGP on the solar irradiance dataset.

In the comparison, DVSHGP employs the kk-means technique to partition the 391 training points into M=10M=10 subsets, and uses m0=u0=20m_{0}=u_{0}=20 inducing points for each expert; (S)VSHGP employs m=u=20m=u=20 inducing points. Particularly, we initialize the length-scales in the SE kernel (3) with a pretty small value of 5.0 for kfk^{f} and kgk^{g} for (S)VSHGP on this quick-varying dataset. Fig. 5 shows that (i) DVSHGP captures the quick-varying and heteroscedastic features successfully via local experts and many inducing points; (ii) (S)VSHGP however fails due to the small set of global inducing points.777Since VSHGP and SVSHGP show similar predictions on this dataset, we only illustrate the VSHGP predictions here.

VI Numerical experiments

This section verifies the proposed DVSHGP and SVSHGP against existing scalable HGPs on a synthetic dataset and four real-world datasets. The comparison includes (i) GPVC [8], (ii) the distributed variant of PIC (dPIC) [33], (iii) FITC [49], and (iv) the SoD based empirical HGP (EHSoD) [13]. Besides, the comparison also employs the homoscedastic VSGP [31] and RBCM [40] to showcase the benefits brought by the consideration of heteroscedasticity.

We implement DVSHGP, FITC, EHSoD, VSGP and RBCM by the GPML toolbox [54], and implement SVSHGP by the GPflow toolbox [52]; we use the published GPVC codes available at https://github.com/OxfordML/GPz and the dPIC codes available at https://github.com/qminh93/dSGP_ICML16. They are executed on a personal computer with four 3.70 GHz cores and 16 GB RAM for the synthetic and three medium-sized datasets, and on a Linux workstation with eight 3.20 GHz cores and 32GB memory for the large-scale dataset.

All the GPs employ the SE kernel in (3). Normalization is performed for both 𝑿\bm{X} and 𝒚\bm{y} to have zero mean and unit variance before training. Finally, we use nn_{*} test points {𝑿,𝒚}\{\bm{X}_{*},\bm{y}_{*}\} to assess the model accuracy by the standardized mean square error (SMSE) and the mean standardized log loss (MSLL) [1]. The SMSE quantifies the discrepancy between the predictions and the exact observations. Particularly, it equals to one when the model always predicts the mean of 𝒚\bm{y}. Moreover, the MSLL quantifies the predictive distribution, and is negative for better models. Particularly, it equals to zero when the model always predicts the mean and variance of 𝒚\bm{y}.

VI-A Synthetic dataset

We employ a 2D version of the toy example (25) as

y(𝒙)=f(𝒙)+ϵ(𝒙),𝒙[10,10]2,y(\bm{x})=f(\bm{x})+\epsilon(\bm{x}),\quad\bm{x}\in[-10,10]^{2},

with highly nonlinear latent function f(𝒙)=sinc(0.1x1x2)f(\bm{x})=\mathrm{sinc}(0.1x_{1}x_{2}) and noise ϵ(𝒙)=𝒩(0,σϵ2(0.1x1x2))\epsilon(\bm{x})=\mathcal{N}(0,\sigma_{\epsilon}^{2}(0.1x_{1}x_{2})). We randomly generate 10,000 training points and evaluate the model accuracy on 4,900 grid test points. We generate ten instances of the training data such that each model is repeated ten times.

We have M=50M=50 and m0=u0=100m_{0}=u_{0}=100 for DVSHGP, resulting in n0=200n_{0}=200 data points assigned to each expert; we have mb=300m_{\mathrm{b}}=300 basis functions for GPVC; we have m=300m=300 for SVSHGP, FITC and VSGP; we have m=300m=300 and M=50M=50 for dPIC; we have M=50M=50 for RBCM; finally, we train two separate GPs on a subset of size msod=2,000m_{\mathrm{sod}}=2,000 for EHSoD. As for optimization, DVSHGP adopts a two-stage process: it first only optimizes the variational parameters using CGD with up to 30 line searches, and then learns all the parameters jointly using up to 70 line searches; SVSHGP trains with NGD+Adam using ||=1,000|\mathcal{B}|=1,000 over 1,000 iterations; VSGP, FITC, GPVC and RBCM use up to 100 line searches to learn the parameters; dPIC employs the default optimization settings in the published codes; and finally EHSoD uses up to 50 line searches to train the two standard GPs, respectively.

Refer to caption
Figure 6: Comparison of GPs over ten runs on the synthetic dataset.

Fig. 6 depicts the modeling results of different GPs over ten runs on the synthetic sinc2D dataset.888Since the dPIC codes only provide the prediction mean, we did not offer its MSLL value as well as the estimated noise variance in the following plots. The horizontal axis represents the sum of training and predicting time for a model. It turns out that DVSHGP, SVSHGP, dPIC, VSGP and RBCM are competitive in terms of SMSE; but DVSHGP and SVSHGP perform better in terms of MSLL due to the well estimated heteroscedastic noise. Compared to the homoscedastic VSGP and RBCM, the FITC has heteroscedastic variances, which is indicated by the lower MSLL, at the cost of however (i) sacrificing the accuracy of prediction mean, and (ii) suffering from invalid noise variance σϵ2\sigma^{2}_{\epsilon}.999FITC estimates σϵ2\sigma^{2}_{\epsilon} as 0.0030, while VSGP estimates it as 0.0309. As a principled HGP, GPVC performs slightly better than FITC in terms of MSLL. Finally, the simple EHSoD has the worst SMSE; but it outperforms the homoscedastic VSGP and RBCM in terms of MSLL.

Refer to caption
Figure 7: The exact noise variance and the prediction variances of different heteroscedastic/homoscedastic GPs on the synthetic dataset.

In terms of efficiency, the RBCM requires less computing time since it contains no variational/inducing parameters, thus resulting in (i) lower complexity, and (ii) early stop for optimization. This also happens for the three datasets below.

Finally, Fig. 7 depicts the prediction variances of all the GPs except dPIC in comparison to the exact σ2\sigma^{2} on the synthetic dataset. It is first observed that the homoscedastic VSGP and RBCM are unable to describe the complex noise variance: they yield a nearly constant variance over the input domain. In contrast, DVSHGP, SVSHGP and GPVC capture the varying noise variance accurately by using an additional noise process gg; FITC also captures the profile of the exact σ2\sigma^{2} with however unstable peaks and valleys; EHSoD is found to capture a rough expression of the exact σ2\sigma^{2}.

VI-B Medium real-world datasets

This section conducts comparison on three real-world datasets. The first is the 9D protein dataset [61] with 45,730 data points. This dataset, taken from CASP 5-9, describes the physicochemical properties of protein tertiary structure. The second is the 21D sarcos dataset [1], which relates to the inverse kinematics of a robot arm, has 48,933 data points. The third is the 3D 3droad dataset which comprises 434,874 data points [62] extracted from a 2D road network in North Jutland, Denmark, plussing elevation information.

VI-B1 The protein dataset

For the protein dataset, we randomly choose 35,000 training points and 10,730 test points. In the comparison, we have M=100M=100 (i.e., n0=350n_{0}=350) and m0=u0=175m_{0}=u_{0}=175 for DVSHGP; we have m=400m=400 for SVSHGP, VSGP and FITC; we have m=400m=400 and M=100M=100 for dPIC; we have mb=400m_{\mathrm{b}}=400 for GPVC; we have M=100M=100 for RBCM; and finally we have msod=4,000m_{\mathrm{sod}}=4,000 for EHSoD. As for optimization, SVSHGP trains with NGD+Adam using ||=2,000|\mathcal{B}|=2,000 over 2,000 iterations. The optimization settings of other GPs keep consistent to that for the synthetic dataset.

Refer to caption
Figure 8: Comparison of GPs over ten runs on the protein dataset.

The results of different GPs over ten runs are summarized in Fig. 8. Among the HGPs, it is observed that dPIC outperforms the others in terms of SMSE, followed by DVSHGP. On the other hand, DVSHGP performs the best in terms of MSLL, followed by FITC and SVSHGP. The simple EHSoD is found to produce unstable MSLL results because of the small subset. Finally, the homoscedastic VSGP and RBCM provide mediocre SMSE and MSLL results.

Refer to caption
Figure 9: The distributions of log noise variances estimated by different GPs on the protein dataset. The dash and dot lines indicate the log noise variances of VSGP and RBCM, respectively.

Next, Fig. 9 offers insights into the distributions of log noise variances of all the GPs except dPIC on the protein dataset for a single run. Note that (i) as homoscedastic GPs, the log noise variances of VSGP and RBCM are marked as dash and dot lines, respectively; and (ii) we plot the variance of p(f|𝒟,𝒙)p(f_{*}|\mathcal{D},\bm{x}_{*}) for FITC since (a) it accounts for the heteroscedasticity and (b) the scalar noise variance σϵ2\sigma^{2}_{\epsilon} is severely underestimated. The results in Fig. 9 indicate that the protein dataset may contain heteroscedastic noise. Besides, compared to the VSGP which uses a global inducing set, the localized RBCM provides a more compact estimation of σϵ2\sigma^{2}_{\epsilon}. This compact noise variance, which has also been observed on the two datasets below, brings lower MSLL for RBCM.

Furthermore, we clearly observe the interaction between ff and gg for DVSHGP, SVSHGP and GPVC. The small MSLL of RBCM suggests that the protein dataset may own small noise varaicnes at some test points. Hence, the localized DVSHGP, which is enabled to capture the local variety through individual variational and inducing parameters for each expert, produces a longer tail in Fig. 9. The well estimated heteroscedastic noise in turn improves the prediction mean of DVSHGP through the interaction between ff and gg. In contrast, due to the limited global inducing set, the prediction mean of SVSHGP and GPVC is traded for capturing heteroscedastic noise.

Refer to caption
Figure 10: The effect of algorithmic parameters on the performance of different sparse GPs on the protein dataset.

Notably, the performance of sparse GPs is affected by their modeling parameters, e.g., the inducing sizes m0m_{0}, mm, u0u_{0} and uu, the number of basis functions mbm_{\mathrm{b}}, and the subset size msodm_{\mathrm{sod}}. Fig. 10(a) and (b) depict the average results of sparse GPs over ten runs using different parameters. Particularly, we investigate the impact of subset size n0n_{0} on DVSHGP in Fig. 10(c) using m0=u0=0.5n0m_{0}=u_{0}=0.5n_{0}. It is found that DVSHGP favours large n0n_{0} (small MM) and large m0m_{0} and u0u_{0}. Similarly, VSGP and FITC favour more inducing points. However, dPIC offers an unstable SMSE performance with increasing mm; GPVC performs slightly worse with increasing mbm_{\mathrm{b}} in terms of both SMSE and MSLL, which has also been observed in the original paper [8], and may be caused by the sharing of basis functions for ff and gg. Finally, EHSoD showcases poorer MSLL values when msod3000m_{\mathrm{sod}}\geq 3000, because of the difficulty of approximating the empirical variances.

VI-B2 The sarcos dataset

For the sarcos dataset, we randomly choose 40,000 training points and 8,933 test points. In the comparison, we have M=120M=120 (i.e., n0333n_{0}\approx 333) and m0=u0=175m_{0}=u_{0}=175 for DVSHGP; we have m=600m=600 for SVSHGP, VSGP and FITC; we have m=600m=600 and M=120M=120 for dPIC; we have mb=600m_{\mathrm{b}}=600 for GPVC; we have M=120M=120 for RBCM; and finally we have msod=4,000m_{\mathrm{sod}}=4,000 for EHSoD. The optimization settings are the same as before.

The results of different GPs over ten runs on the sarcos dataset are depicted in Fig. 11. Besides, Fig. 12 depicts the log noise variances of the GPs on this dataset. Different from the protein dataset, the sarcos dataset seems to have weak heteroscedastic noises across the input domain, which is verified by the facts that (i) the noise variance of DVSHGP is a constant, and (ii) DVSHGP agrees with RBCM in terms of both SMSE and MSLL. Hence, all the HGPs except EHSoD perform similarly in terms of MSLL.

Refer to caption
Figure 11: Comparison of GPs over ten runs on the sarcos dataset.
Refer to caption
Figure 12: The distributions of log noise variances of different GPs on the sarcos dataset. The dash and dot lines indicate the log noise variances of VSGP and RBCM, respectively.

In addition, the weak heteroscedasticity in the sarcos dataset reveals that we can use only a few inducing points for 𝒈\bm{g} to speed up the inference. For instance, we retrain DVSHGP using u0=5u_{0}=5. This extremely small inducing set for 𝒈\bm{g} brings (i) much less computing time of about 350 seconds, and (ii) almost the same model accuracy with SMSE = 0.0099 and MSLL = -2.3034.

VI-B3 The 3droad dataset

Finally, for the 3droad dataset, we randomly choose 390,000 training points, and use the remaining 44,874 data points for testing. In the comparison, we have M=800M=800 (i.e., n0487n_{0}\approx 487) and m0=u0=250m_{0}=u_{0}=250 for DVSHGP; we have m=500m=500 for SVSHGP, VSGP and FITC; we have m=500m=500 and M=800M=800 for dPIC; we have mb=500m_{\mathrm{b}}=500 for GPVC; we have M=800M=800 for RBCM; and finally we have msod=8,000m_{\mathrm{sod}}=8,000 for EHSoD. As for optimization, SVSHGP trains with NGD+Adam using ||=4,000|\mathcal{B}|=4,000 over 4,000 iterations. The optimization settings of other GPs keep the same as before.

Refer to caption
Figure 13: Comparison of GPs over ten runs on the 3droad dataset.

The results of different GPs over ten runs on the 3droad dataset are depicted in Fig. 13. It is observed that DVSHGP outperforms the others in terms of both SMSE and MSLL, followed by RBCM. For other HGPs, especially SVSHGP and GPVC, the relatively poor noise variance (large MSLL) in turn sacrifices the accuracy of prediction mean. Even though, the heteroscedastic noise helps SVSHGP, GPVC and FITC perform similarly to VSGP in terms of MSLL.

In addition, Fig. 14 depicts the log noise variances of these GPs on the 3droad dataset. The highly accurate prediction mean of DVSHGP helps well estimate the heteroscedastic noise. It is observed that (i) the noise variances estimated by DVSHGP are more compact than that of other HGPs; and (ii) the average noise variance agrees with that of RBCM.

Refer to caption
Figure 14: The distributions of log noise variances of different GPs on the 3droad dataset. The dash and dot lines indicate the log noise variances of VSGP and RBCM, respectively.

Finally, the results from the 3droad dataset together with the other two datasets indicate that:

  • the well estimated noise variance of HGPs in turn improves the prediction mean via the interaction between ff and gg; otherwise, it may sacrifice the prediction mean;

  • the heteroscedastic noise usually improves sparse HGPs over the homoscedastic VSGP in terms of MSLL.

VI-C Large real-world dataset

The final section evaluates the performance of different GPs on the 11D electric dataset,101010The dataset is available at https://archive.ics.uci.edu/ml/index.php. which is partitioned into two million training points and 49,280 test points. The HGPs in the comparison include DVSHGP, SVSHGP, dPIC and EHSoD.111111GPVC and FITC are unaffordable for this massive dataset. Besides, the stochastic variant of FITC [35] is not included, since it does not support end-to-end training. Besides, the RBCM and the stochastic variant of VSGP, named SVGP [34], are employed for this big dataset.

TABLE I: Comparison of GPs on the electric dataset.
DVSHGP SVSHGP dPIC EHSoD SVGP RBCM
SMSE 0.0020 0.0029 0.0042 0.0103 0.0028 0.0023
MSLL -3.4456 -3.1207 - -1.9453 -2.8489 -3.0647
tt [hh] 11.05 7.44 47.22 4.72 3.55 3.97

In the comparison, we have M=2,000M=2,000 (i.e., n0=1,000n_{0}=1,000) and m0=u0=300m_{0}=u_{0}=300 for DVSHGP; we have m=2,500m=2,500 for SVSHGP and SVGP; we have m=2,500m=2,500 and M=2,000M=2,000 for dPIC; we have M=2,000M=2,000 for RBCM; and finally we have msod=15,000m_{\mathrm{sod}}=15,000 for EHSoD. As for optimization, SVSHGP trains with NGD+Adam using ||=5,000|\mathcal{B}|=5,000 over 10,000 iterations; The optimization settings of other GPs keep the same as before.

The average results over five runs in Table I indicate that DVSHGP outperforms the others in terms of both SMSE and MSLL, followed by SVSHGP. The simple EHSoD provides the worst performance, and cannot be improved by using larger msodm_{\mathrm{sod}} due to the memory limit in current infrastructure. Additionally, in terms of efficiency, we find that (i) SVSHGP is better than DVSHGP due to the parallel/GPU acceleration deployed in Tensorflow;121212Further GPU speedup could be utilized for DVSHGP in Matlab. (ii) SVGP is better than SVSHGP because of lower complexity; and (iii) the huge computing time of dPIC might be incurred by the unoptimized codes.

Refer to caption
Figure 15: Illustration of (a) the computing time of DVSHGP vs. number of parallel cores, and (b) the performance of SVSHGP, from left to right, using ||=1,000|\mathcal{B}|=1,000, 2,500 and 5,000 on the electric dataset.

Finally, due to the distributed framework, Fig. 15(a) depicts the total computing time of DVSHGP using different numbers of processing cores. It is observed that the DVSHGP using eight cores achieves a speedup around 3.5 in comparison to the centralized counterpart. Fig. 15(b) also exploits the performance of SVSHGP using a varying mini-batch size |||\mathcal{B}|. It is observed that (i) a small |||\mathcal{B}| significantly speeds up the model training, and (ii) different mini-batch sizes yield similar SMSE and MSLL here, because the model has been optimized over sufficient iterations.

VII Discussions and conclusions

In order to scale up the original HGP [16], we have presented distributed and stochastic variational sparse HGPs. The proposed SVSHGP improves the scalability through stochastic variational inference. The proposed DVSHGP (i) enables large-scale regression via distributed computations, and (ii) achieves high model capability via localized experts and many inducing points. We compared them to existing scalable homoscedastic/heteroscedastic GPs on a synthetic dataset and four real-world datasets. The comparative results obtained indicated that DVSHGP exhibits superior performance in terms of both SMSE and MSLL; while due to the limited global inducing set, SVSHGP may sacrifice the prediction mean for capturing heteroscedastic noise.

Apart from our scalable HGP framework, there are some new GP paradigms developed recently from different perspectives for improving the predictive distribution. For instance, instead of directly modeling the heteroscedastic noise, we could introduce additional latent inputs to modulate the kernel [63, 64]; or we directly target the interested posterior distribution to enhance the prediction of ff [65]; or we adopt highly flexible priors, e.g., implicit processes, over functions [66]; or we mix various GPs during the inference [67, 68]; or we develop the specific non-stationary kernel [69]. They bring new interpretations at the cost of however losing the direct description of heteroscedasticity or raising complicated inference with high complexity. But all these paradigms together with our scalable HGPs greatly enable future exploitation of fitting the data distribution with high quality and efficiency.

Finally, our future work will consider the heteroscedasticity in the underlying function ff, i.e., the non-stationary, such as [22, 18, 67]. The integration of various kinds of heteroscedasticity is believed to improve predictions.

Appendix A Non-negativity of Λnn\Lambda_{nn}

We know that the variational diagonal matrix 𝚲nn\bm{\Lambda}_{nn} expresses

𝚲nn=0.5(𝚲nna+𝚲nnb+𝑰).\bm{\Lambda}_{nn}=0.5(\bm{\Lambda}_{nn}^{a}+\bm{\Lambda}_{nn}^{b}+\bm{I}).

In order to prove the non-negativity of 𝚲nn\bm{\Lambda}_{nn}, we should figure out the non-negativity of the diagonal elements of 𝚲nna+𝑰\bm{\Lambda}^{a}_{nn}+\bm{I} and 𝚲nnb\bm{\Lambda}^{b}_{nn}, respectively.

Firstly, the diagonal elements of 𝚲nnb\bm{\Lambda}^{b}_{nn} write

[𝚲nnb]ii=[(𝑲nnf𝑸nnf)𝑹g1]ii,1in,[\bm{\Lambda}_{nn}^{b}]_{ii}=[(\bm{K}_{nn}^{f}-\bm{Q}_{nn}^{f})\bm{R}_{g}^{-1}]_{ii},\quad 1\leq i\leq n,

where the diagonal elements of 𝑹g1\bm{R}_{g}^{-1} satisfy [𝑹g1]ii=e[𝚺g]ii/2[𝝁g]i>0[\bm{R}_{g}^{-1}]_{ii}=e^{[\bm{\Sigma}_{g}]_{ii}/2-[\bm{\mu}_{g}]_{i}}>0, 1in1\leq i\leq n; and the diagonal elements of 𝑲nnf𝑸nnf\bm{K}_{nn}^{f}-\bm{Q}_{nn}^{f} are the variances of training conditional p(𝒇|𝒇m)p(\bm{f}|\bm{f}_{m}). Therefore, 𝚲nnb\bm{\Lambda}_{nn}^{b} has non-negative diagonal elements.

Secondly, the diagonal elements of 𝚲nna+𝑰\bm{\Lambda}^{a}_{nn}+\bm{I} write, given 𝜷n=𝚺y1𝒚\bm{\beta}_{n}=\bm{\Sigma}_{y}^{-1}\bm{y},

[𝚲nna+𝑰]ii=[𝜷n𝜷n𝖳𝑹g𝚺y1𝑹g+𝑰]ii,1in.[\bm{\Lambda}_{nn}^{a}+\bm{I}]_{ii}=[\bm{\beta}_{n}\bm{\beta}_{n}^{\mathsf{T}}\bm{R}_{g}-\bm{\Sigma}_{y}^{-1}\bm{R}_{g}+\bm{I}]_{ii},\quad 1\leq i\leq n.

For 𝜷n𝜷n𝖳𝑹g\bm{\beta}_{n}\bm{\beta}_{n}^{\mathsf{T}}\bm{R}_{g}, the diagonal elements are non-negative. For 𝑰𝚺y1𝑹g\bm{I}-\bm{\Sigma}_{y}^{-1}\bm{R}_{g}, given the Cholesky decomposition 𝑲Λf=𝑳Λf(𝑳Λf)𝖳\bm{K}_{\Lambda}^{f}=\bm{L}_{\Lambda}^{f}(\bm{L}_{\Lambda}^{f})^{\mathsf{T}}, we have

𝑰𝚺y1𝑹g=\displaystyle\bm{I}-\bm{\Sigma}_{y}^{-1}\bm{R}_{g}= [𝑹g1𝑲nmf(𝑲Λf)1𝑲mnf𝑹g1]𝑹g\displaystyle[\bm{R}_{g}^{-1}\bm{K}_{nm}^{f}(\bm{K}_{\Lambda}^{f})^{-1}\bm{K}_{mn}^{f}\bm{R}_{g}^{-1}]\bm{R}_{g}
=\displaystyle= [𝑹g1𝑲nmf(𝑳Λf)𝖳(𝑳Λf)1𝑲mnf𝑹g1]𝑹g,\displaystyle[\bm{R}_{g}^{-1}\bm{K}_{nm}^{f}(\bm{L}_{\Lambda}^{f})^{-\mathsf{T}}(\bm{L}_{\Lambda}^{f})^{-1}\bm{K}_{mn}^{f}\bm{R}_{g}^{-1}]\bm{R}_{g},

indicating that the diagonal elements must be non-negative.

Hence, from the foregoing discussions, we know that 𝚲nn\bm{\Lambda}_{nn} is a non-negative diagonal matrix.

Appendix B Derivatives of FVF_{V} w.r.t. hyperparameters

Let 𝝀n=log(𝚲nn𝟏)\bm{\lambda}_{n}=\log(\bm{\Lambda}_{nn}\bm{1}) collect nn variational parameters in the log form for non-negativity, we have the derivatives of FVF_{V} w.r.t. 𝝀n\bm{\lambda}_{n} as

FV𝝀n=\displaystyle\frac{\partial F_{V}}{\partial\bm{\lambda}_{n}}= 𝚲nn[12(𝑸nng+12𝑨nn)𝚲nnab𝟏+14𝑨nn𝟏\displaystyle\bm{\Lambda}_{nn}\left[\frac{1}{2}(\bm{Q}_{nn}^{g}+\frac{1}{2}\bm{A}_{nn})\bm{\Lambda}_{nn}^{ab}\bm{1}+\frac{1}{4}\bm{A}_{nn}\bm{1}\right.
12𝑨nn𝚲nn𝟏𝝁g+μ0𝟏],\displaystyle-\left.\frac{1}{2}\bm{A}_{nn}\bm{\Lambda}_{nn}\bm{1}-\bm{\mu}_{g}+\mu_{0}\bm{1}\right],

where 𝑨nn=(𝑲nug𝑲Λ1𝑲ung)2\bm{A}_{nn}=(\bm{K}_{nu}^{g}\bm{K}_{\Lambda}^{-1}\bm{K}_{un}^{g})^{\odot 2}, and the operator 2{\odot 2} represents the element-wise power.

The derivatives of FVF_{V} w.r.t. the kernel parameters 𝜽f={θif}\bm{\theta}^{f}=\{\theta_{i}^{f}\} are

FVθif=12Tr[(𝜷n𝜷n𝖳𝚺y1+𝑹g1)𝑸nnfθif𝑹g1𝑲nnfθif].\frac{\partial F_{V}}{\partial\theta^{f}_{i}}=\frac{1}{2}\mathrm{Tr}\left[(\bm{\beta}_{n}\bm{\beta}_{n}^{\mathsf{T}}-\bm{\Sigma}_{y}^{-1}+\bm{R}_{g}^{-1})\frac{\partial\bm{Q}_{nn}^{f}}{\partial\theta^{f}_{i}}-\bm{R}_{g}^{-1}\frac{\partial\bm{K}_{nn}^{f}}{\partial\theta^{f}_{i}}\right].

Similarly, the derivatives of FVF_{V} w.r.t. the kernel parameters 𝜽g={θig}\bm{\theta}^{g}=\{\theta_{i}^{g}\} are

FVθig=\displaystyle\frac{\partial F_{V}}{\partial\theta^{g}_{i}}= 12Tr[𝝁gθig(𝟏𝖳𝚲nnab)12(𝚲nnab+𝑰)𝚺gθig]\displaystyle\frac{1}{2}\mathrm{Tr}\left[\frac{\partial\bm{\mu}_{g}}{\partial\theta^{g}_{i}}(\bm{1}^{\mathsf{T}}\bm{\Lambda}_{nn}^{ab})-\frac{1}{2}(\bm{\Lambda}_{nn}^{ab}+\bm{I})\frac{\partial\bm{\Sigma}_{g}}{\partial\theta^{g}_{i}}\right]
\displaystyle- 12Tr[𝑽uug𝑲uugθig(𝛀nug)𝖳𝚲nn𝛀nug𝚺uθig+2𝝁uθig𝜸u𝖳],\displaystyle\frac{1}{2}\mathrm{Tr}\left[\bm{V}_{uu}^{g}\frac{\partial\bm{K}_{uu}^{g}}{\partial\theta^{g}_{i}}-(\bm{\Omega}_{nu}^{g})^{\mathsf{T}}\bm{\Lambda}_{nn}\bm{\Omega}_{nu}^{g}\frac{\partial\bm{\Sigma}_{u}}{\partial\theta^{g}_{i}}+2\frac{\partial\bm{\mu}_{u}}{\partial\theta^{g}_{i}}\bm{\gamma}_{u}^{\mathsf{T}}\right],

where 𝜸u=(𝑲uug)1(𝝁uμ0𝟏)\bm{\gamma}_{u}=(\bm{K}_{uu}^{g})^{-1}(\bm{\mu}_{u}-\mu_{0}\bm{1}) and 𝑽uug=(𝑲uug)1𝜸u𝜸u𝖳(𝑲uug)1𝚺u(𝑲uug)1\bm{V}_{uu}^{g}=(\bm{K}_{uu}^{g})^{-1}-\bm{\gamma}_{u}\bm{\gamma}_{u}^{\mathsf{T}}-(\bm{K}_{uu}^{g})^{-1}\bm{\Sigma}_{u}(\bm{K}_{uu}^{g})^{-1}.

The derivatives of FVF_{V} w.r.t. the mean parameter μ0\mu_{0} of gg is

FVμ0=12Tr(𝚲nnab).\frac{\partial F_{V}}{\partial\mu_{0}}=\frac{1}{2}\mathrm{Tr}(\bm{\Lambda}_{nn}^{ab}).

Finally, we calculate the derivatives of FVF_{V} w.r.t. the inducing points 𝑿m\bm{X}_{m} and 𝑿u\bm{X}_{u}. Since the inducing points are involved in the kernel matrices, we get the derivatives 𝑲nmf/xijf\partial\bm{K}^{f}_{nm}/\partial x^{f}_{ij}, 𝑲mnf/xijf\partial\bm{K}^{f}_{mn}/\partial x^{f}_{ij}, 𝑲mmf/xijf\partial\bm{K}^{f}_{mm}/\partial x^{f}_{ij}, 𝑲nug/xijg\partial\bm{K}^{g}_{nu}/\partial x^{g}_{ij}, 𝑲ung/xijg\partial\bm{K}^{g}_{un}/\partial x^{g}_{ij}, and 𝑲uug/xijg\partial\bm{K}^{g}_{uu}/\partial x^{g}_{ij}, where xijf=[𝑿m]ijx^{f}_{ij}=[\bm{X}_{m}]_{ij} and xijg=[𝑿u]ijx^{g}_{ij}=[\bm{X}_{u}]_{ij}. We first obtain the derivatives of FVF_{V} w.r.t. 𝑿m\bm{X}_{m} as

FVxijf=2Tr[𝑲nmfxijf𝑨mnf]+Tr[𝑲mmfxijf𝑨mmf],\frac{\partial F_{V}}{\partial x^{f}_{ij}}=2\mathrm{Tr}\left[\frac{\partial\bm{K}^{f}_{nm}}{\partial x^{f}_{ij}}\bm{A}^{f}_{mn}\right]+\mathrm{Tr}\left[\frac{\partial\bm{K}^{f}_{mm}}{\partial x^{f}_{ij}}\bm{A}^{f}_{mm}\right],

where 𝑨mnf=0.5(𝛀nmf)𝖳(𝜷n𝜷n𝖳𝚺y1+𝑹g1)\bm{A}_{mn}^{f}=0.5(\bm{\Omega}_{nm}^{f})^{\mathsf{T}}(\bm{\beta}_{n}\bm{\beta}_{n}^{\mathsf{T}}-\bm{\Sigma}_{y}^{-1}+\bm{R}_{g}^{-1}), and 𝑨mmf=𝑨mnf𝛀nmf\bm{A}_{mm}^{f}=-\bm{A}_{mn}^{f}\bm{\Omega}_{nm}^{f}. Similarly, the derivatives of FVF_{V} w.r.t. 𝑿u\bm{X}_{u} write

FVxijg=Tr[𝑲nugxijg𝑨ung]+Tr[𝑨nug𝑲ungxijg]+Tr[𝑲uugxijg𝑨uug].\frac{\partial F_{V}}{\partial x^{g}_{ij}}=\mathrm{Tr}\left[\frac{\partial\bm{K}^{g}_{nu}}{\partial x^{g}_{ij}}\bm{A}^{g}_{un}\right]+\mathrm{Tr}\left[\bm{A}^{g}_{nu}\frac{\partial\bm{K}^{g}_{un}}{\partial x^{g}_{ij}}\right]+\mathrm{Tr}\left[\frac{\partial\bm{K}^{g}_{uu}}{\partial x^{g}_{ij}}\bm{A}^{g}_{uu}\right].

For 𝑨ung\bm{A}_{un}^{g} in FV/xijg\partial F_{V}/\partial x^{g}_{ij}, we have

𝑨ung=\displaystyle\bm{A}^{g}_{un}= 0.5(𝛀nug)𝖳(𝚲nn0.5𝑰)𝟏(𝟏𝖳𝚲nnab)𝑻1\displaystyle\underbrace{0.5(\bm{\Omega}_{nu}^{g})^{\mathsf{T}}(\bm{\Lambda}_{nn}-0.5\bm{I})\bm{1}(\bm{1}^{\mathsf{T}}\bm{\Lambda}_{nn}^{ab})}_{\bm{T}_{1}}
+0.25(𝑯uu𝑲ung𝚲nn[𝛀nuΛ+𝛀nug]𝖳(𝚲nnab+𝑰))𝑻2\displaystyle+\underbrace{0.25\left(\bm{H}_{uu}\bm{K}_{un}^{g}\bm{\Lambda}_{nn}-[\bm{\Omega}_{nu}^{\Lambda}+\bm{\Omega}_{nu}^{g}]^{\mathsf{T}}(\bm{\Lambda}_{nn}^{ab}+\bm{I})\right)}_{\bm{T}_{2}}
+0.5(𝑱uu𝑲ung𝚲nn𝜸u𝟏𝖳(𝚲nn0.5𝑰))𝑻3.\displaystyle+\underbrace{0.5\left(\bm{J}_{uu}\bm{K}_{un}^{g}\bm{\Lambda}_{nn}-\bm{\gamma}_{u}\bm{1}^{\mathsf{T}}(\bm{\Lambda}_{nn}-0.5\bm{I})\right)}_{\bm{T}_{3}}.

where 𝑯uu=(𝛀nuΛ)𝖳(𝚲nnab+𝑰)𝛀nuΛ\bm{H}_{uu}=(\bm{\Omega}_{nu}^{\Lambda})^{\mathsf{T}}(\bm{\Lambda}_{nn}^{ab}+\bm{I})\bm{\Omega}_{nu}^{\Lambda}, and 𝑱uu=𝑲Λ1𝑲uug((𝑲uug)1𝚺u1)𝑲uug𝑲Λ1\bm{J}_{uu}=\bm{K}_{\Lambda}^{-1}\bm{K}_{uu}^{g}((\bm{K}_{uu}^{g})^{-1}-\bm{\Sigma}_{u}^{-1})\bm{K}_{uu}^{g}\bm{K}_{\Lambda}^{-1}. For 𝑨nug\bm{A}_{nu}^{g}, we have

𝑨nug=0.5(𝚲nn0.5𝑰)𝟏(𝟏𝖳𝚲nnab)𝛀nug+𝑻2𝖳+𝑻3𝖳.\bm{A}^{g}_{nu}=0.5(\bm{\Lambda}_{nn}-0.5\bm{I})\bm{1}(\bm{1}^{\mathsf{T}}\bm{\Lambda}_{nn}^{ab})\bm{\Omega}_{nu}^{g}+\bm{T}_{2}^{\mathsf{T}}+\bm{T}_{3}^{\mathsf{T}}.

For 𝑨uug\bm{A}_{uu}^{g}, we have

𝑨uug=\displaystyle\bm{A}_{uu}^{g}= 𝑻1𝛀nug0.25((𝛀nug)𝖳(𝚲nnab+𝑰)𝛀nug𝑯uu)\displaystyle-\bm{T}_{1}\bm{\Omega}_{nu}^{g}-25\left((\bm{\Omega}_{nu}^{g})^{\mathsf{T}}(\bm{\Lambda}_{nn}^{ab}+\bm{I})\bm{\Omega}_{nu}^{g}-\bm{H}_{uu}\right)
+0.5(𝑷uu+𝑷uu𝖳+𝑽uug𝑱uu),\displaystyle+5\left(\bm{P}_{uu}+\bm{P}_{uu}^{\mathsf{T}}+\bm{V}_{uu}^{g}-\bm{J}_{uu}\right),

where 𝑷uu=𝑲Λ1𝑲uug((𝑲uug)1𝚺u1)\bm{P}_{uu}=\bm{K}_{\Lambda}^{-1}\bm{K}_{uu}^{g}((\bm{K}_{uu}^{g})^{-1}-\bm{\Sigma}_{u}^{-1}).

The calculation of FV/xijf\partial F_{V}/\partial x^{f}_{ij} and FV/xijg\partial F_{V}/\partial x^{g}_{ij} requires a loop over m×dm\times d and u×du\times d parameters of the inducing points, which is quite slow for even moderate mm, uu and dd. Fortunately, we know that the derivative 𝑲/xijf(g)\partial\bm{K}/\partial x_{ij}^{f(g)} only has nn or mm (uu) non-zero elements. Due to the sparsity, 𝑲/xijf(g)\partial\bm{K}/\partial x_{ij}^{f(g)} can be performed in vectorized operations such that the derivatives w.r.t. all the inducing points can be calculated along a specific dimension.

Appendix C Natural gradients of q(𝒇m)q(\bm{f}_{m}) and q(𝒈u)q(\bm{g}_{u})

For exponential family distributions131313The probability density function (PDF) of exponential family is p(𝒙)=h(𝒙)e𝜽𝖳𝒕(𝒙)A(𝜽)p(\bm{x})=h(\bm{x})e^{\bm{\theta}^{\mathsf{T}}\bm{t}(\bm{x})-A(\bm{\theta})}, where 𝜽\bm{\theta} is natural parameters, h(𝒙)h(\bm{x}) is underlying measure, 𝒕(𝒙)\bm{t}(\bm{x}) is sufficient statistic, and A(𝜽)A(\bm{\theta}) is log normalizer. Besides, the expectation parameters are defined as 𝝍=𝔼p(𝒙)[𝒕(𝒙)]\bm{\psi}=\mathbb{E}_{p(\bm{x})}[\bm{t}(\bm{x})]. parameterized by natural parameters 𝜽\bm{\theta}, we update the parameters using natural gradients as

𝜽(t+1)=𝜽(t)γ(t)𝑮𝜽(t)1F𝜽(t)=𝜽(t)γ(t)F𝝍(t),\bm{\theta}_{(t+1)}=\bm{\theta}_{(t)}-\gamma_{(t)}\bm{G}_{\bm{\theta}_{(t)}}^{-1}\frac{\partial F}{\partial\bm{\theta}_{(t)}}=\bm{\theta}_{(t)}-\gamma_{(t)}\frac{\partial F}{\partial\bm{\psi}_{(t)}},

where FF is the objective function, and 𝑮𝜽=𝝍(t)/𝜽(t)\bm{G}_{\bm{\theta}}=\partial\bm{\psi}_{(t)}/\partial\bm{\theta}_{(t)} is the fisher information matrix with 𝝍\bm{\psi} being the expectation parameters of exponential distributions.

For q(𝒈u)=𝒩(𝒈u|𝝁u,𝚺u)q(\bm{g}_{u})=\mathcal{N}(\bm{g}_{u}|\bm{\mu}_{u},\bm{\Sigma}_{u}), its natural parameters are 𝜽\bm{\theta} which are partitioned into two components

𝜽1=𝚺u1𝝁u,𝚯2=12𝚺u1,\bm{\theta}_{1}=\bm{\Sigma}_{u}^{-1}\bm{\mu}_{u},\quad\bm{\Theta}_{2}=-\frac{1}{2}\bm{\Sigma}_{u}^{-1},

where 𝜽1\bm{\theta}_{1} comprises the first mm elements of 𝜽\bm{\theta}, and 𝚯2\mathbf{\Theta}_{2} the remaining elements reshaped to a square matrix. Accordingly, the expectation parameters 𝝍\bm{\psi} are divided as

𝝍1=𝝁u,𝚿2=𝝁u𝝁u𝖳+𝚺u.\bm{\psi}_{1}=\bm{\mu}_{u},\quad\bm{\Psi}_{2}=\bm{\mu}_{u}\bm{\mu}_{u}^{\mathsf{T}}+\bm{\Sigma}_{u}.

Thereafter, we update the natural parameters with step γ(t)\gamma_{(t)} as

𝜽1(t+1)\displaystyle\bm{\theta}_{1_{(t+1)}} =𝚺u(t)1𝝁u(t)γ(t)F𝝍1(t),\displaystyle=\bm{\Sigma}_{u_{(t)}}^{-1}\bm{\mu}_{u_{(t)}}-\gamma_{(t)}\frac{\partial F}{\partial\bm{\psi}_{1_{(t)}}},
𝚯2(t+1)\displaystyle\bm{\Theta}_{2_{(t+1)}} =12𝚺u(t)1γ(t)F𝚿2(t),\displaystyle=-\frac{1}{2}\bm{\Sigma}_{u_{(t)}}^{-1}-\gamma_{(t)}\frac{\partial F}{\partial\bm{\Psi}_{2_{(t)}}},

where F/𝝍1(t)=F/𝝁u(t)\partial F/\partial\bm{\psi}_{1_{(t)}}=\partial F/\partial\bm{\mu}_{u_{(t)}} and F/𝚿2(t)=F/𝚺u(t)\partial F/\partial\bm{\Psi}_{2_{(t)}}=\partial F/\partial\bm{\Sigma}_{u_{(t)}}. The derivatives F/𝝁u\partial F/\partial\bm{\mu}_{u} and F/𝚺u\partial F/\partial\bm{\Sigma}_{u} are respectively expressed as

F𝝁u=\displaystyle\frac{\partial F}{\partial\bm{\mu}_{u}}= 12(𝛀nug)𝖳𝚲nnab𝟏(𝑲uug)1(𝝁uμ0𝟏),\displaystyle\frac{1}{2}(\bm{\Omega}_{nu}^{g})^{\mathsf{T}}\bm{\Lambda}_{nn}^{a^{\prime}b}\bm{1}-(\bm{K}_{uu}^{g})^{-1}(\bm{\mu}_{u}-\mu_{0}\bm{1}),
F𝚺u=\displaystyle\frac{\partial F}{\partial\bm{\Sigma}_{u}}= 14(𝛀nug)𝖳(𝚲nnab+𝑰)𝛀nug+12[𝚺u1(𝑲uug)1],\displaystyle-\frac{1}{4}(\bm{\Omega}_{nu}^{g})^{\mathsf{T}}(\bm{\Lambda}_{nn}^{a^{\prime}b}+\bm{I})\bm{\Omega}_{nu}^{g}+\frac{1}{2}[\bm{\Sigma}_{u}^{-1}-(\bm{K}_{uu}^{g})^{-1}],

where 𝚲nnab=𝚲nna+𝚲nnb\bm{\Lambda}_{nn}^{a^{\prime}b}=\bm{\Lambda}_{nn}^{a^{\prime}}+\bm{\Lambda}_{nn}^{b}, and 𝚲nna\bm{\Lambda}_{nn}^{a^{\prime}} is a diagonal matrix with the diagonal element being

[𝚲nna]ii=[𝑹g1(𝒚𝛀nmf𝝁m)(𝒚𝛀nmf𝝁m)𝖳𝑰]ii.[\bm{\Lambda}_{nn}^{a^{\prime}}]_{ii}=[\bm{R}_{g}^{-1}(\bm{y}-\bm{\Omega}_{nm}^{f}\bm{\mu}_{m})(\bm{y}-\bm{\Omega}_{nm}^{f}\bm{\mu}_{m})^{\mathsf{T}}-\bm{I}]_{ii}.

For q(𝒇m)=𝒩(𝒇m|𝝁m,𝚺m)q(\bm{f}_{m})=\mathcal{N}(\bm{f}_{m}|\bm{\mu}_{m},\bm{\Sigma}_{m}), the updates of 𝝁m(t+1)\bm{\mu}_{m_{(t+1)}} and 𝚺m(t+1)\bm{\Sigma}_{m_{(t+1)}} follow the foregoing steps, with the derivatives F/𝝁m\partial F/\partial\bm{\mu}_{m} and F/𝚺m\partial F/\partial\bm{\Sigma}_{m} taking (19).

Acknowledgment

This work is funded by the National Research Foundation, Singapore under its AI Singapore programme [Award No.: AISG-RP-2018-004], the Data Science and Artificial Intelligence Research Center (DSAIR) at Nanyang Technological University and supported under the Rolls-Royce@NTU Corporate Lab.

References

  • [1] C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machine learning. MIT Press, 2006.
  • [2] N. Lawrence, “Probabilistic non-linear principal component analysis with Gaussian process latent variable models,” Journal of Machine Learning Research, vol. 6, no. Nov, pp. 1783–1816, 2005.
  • [3] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas, “Taking the human out of the loop: A review of Bayesian optimization,” Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, 2016.
  • [4] H. Liu, J. Cai, and Y. Ong, “Remarks on multi-output Gaussian process regression,” Knowledge-Based Systems, vol. 144, no. March, pp. 102–121, 2018.
  • [5] B. Settles, “Active learning,” Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 6, no. 1, pp. 1–114, 2012.
  • [6] P. Kou, D. Liang, L. Gao, and J. Lou, “Probabilistic electricity price forecasting with variational heteroscedastic Gaussian process and active learning,” Energy Conversion and Management, vol. 89, pp. 298–308, 2015.
  • [7] M. Lázaro-Gredilla, M. K. Titsias, J. Verrelst, and G. Camps-Valls, “Retrieval of biophysical parameters with heteroscedastic Gaussian processes,” IEEE Geoscience and Remote Sensing Letters, vol. 11, no. 4, pp. 838–842, 2014.
  • [8] I. A. Almosallam, M. J. Jarvis, and S. J. Roberts, “GPz: non-stationary sparse Gaussian processes for heteroscedastic uncertainty estimation in photometric redshifts,” Monthly Notices of the Royal Astronomical Society, vol. 462, no. 1, pp. 726–739, 2016.
  • [9] M. Bauza and A. Rodriguez, “A probabilistic data-driven model for planar pushing,” in International Conference on Robotics and Automation, 2017, pp. 3008–3015.
  • [10] A. J. Smith, M. AlAbsi, and T. Fields, “Heteroscedastic Gaussian process-based system identification and predictive control of a quadcopter,” in AIAA Atmospheric Flight Mechanics Conference, 2018, p. 0298.
  • [11] J. Kirschner and A. Krause, “Information directed sampling and bandits with heteroscedastic noise,” in Proceedings of the 31st Conference On Learning Theory, 2018, pp. 358–384.
  • [12] A. Kendall and Y. Gal, “What uncertainties do we need in Bayesian deep learning for computer vision?” in Advances in Neural Information Processing Systems, 2017, pp. 5574–5584.
  • [13] S. Urban, M. Ludersdorfer, and P. Van Der Smagt, “Sensor calibration and hysteresis compensation with heteroscedastic Gaussian processes,” IEEE Sensors Journal, vol. 15, no. 11, pp. 6498–6506, 2015.
  • [14] F. C. Pereira, C. Antoniou, J. A. Fargas, and M. Ben-Akiva, “A metamodel for estimating error bounds in real-time traffic prediction systems,” IEEE Transactions on Intelligent Transportation Systems, vol. 15, no. 3, pp. 1310–1322, 2014.
  • [15] E. Snelson and Z. Ghahramani, “Variable noise and dimensionality reduction for sparse Gaussian processes,” in Uncertainty in Artificial Intelligence, 2006, pp. 461–468.
  • [16] P. W. Goldberg, C. K. Williams, and C. M. Bishop, “Regression with input-dependent noise: A Gaussian process treatment,” in Advances in Neural Information Processing Systems, 1998, pp. 493–499.
  • [17] L. Muñoz-González, M. Lázaro-Gredilla, and A. R. Figueiras-Vidal, “Divisive Gaussian processes for nonstationary regression,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 11, pp. 1991–2003, 2014.
  • [18] L. Munoz-Gonzalez, M. Lázaro-Gredilla, and A. R. Figueiras-Vidal, “Laplace approximation for divisive Gaussian processes for nonstationary regression,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 3, pp. 618–624, 2016.
  • [19] A. D. Saul, J. Hensman, A. Vehtari, and N. D. Lawrence, “Chained Gaussian processes,” in Artificial Intelligence and Statistics, 2016, pp. 1431–1440.
  • [20] K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard, “Most likely heteroscedastic Gaussian process regression,” in International Conference on Machine Learning, 2007, pp. 393–400.
  • [21] N. Quadrianto, K. Kersting, M. D. Reid, T. S. Caetano, and W. L. Buntine, “Kernel conditional quantile estimation via reduction revisited,” in International Conference on Data Mining, 2009, pp. 938–943.
  • [22] M. Heinonen, H. Mannerström, J. Rousu, S. Kaski, and H. Lähdesmäki, “Non-stationary Gaussian process regression with hamiltonian monte carlo,” in Artificial Intelligence and Statistics, 2016, pp. 732–740.
  • [23] M. Binois, R. B. Gramacy, and M. Ludkovski, “Practical heteroscedastic Gaussian process modeling for large simulation experiments,” Journal of Computational and Graphical Statistics, vol. 27, no. 4, pp. 808–821, 2018.
  • [24] M. K. Titsias and M. Lázaro-Gredilla, “Variational heteroscedastic Gaussian process regression,” in International Conference on Machine Learning, 2011, pp. 841–848.
  • [25] M. Menictas and M. P. Wand, “Variational inference for heteroscedastic semiparametric regression,” Australian & New Zealand Journal of Statistics, vol. 57, no. 1, pp. 119–138, 2015.
  • [26] L. Muñoz-González, M. Lázaro-Gredilla, and A. R. Figueiras-Vidal, “Heteroscedastic Gaussian process regression using expectation propagation,” in Machine Learning for Signal Processing, 2011, pp. 1–6.
  • [27] V. Tolvanen, P. Jylänki, and A. Vehtari, “Expectation propagation for nonstationary heteroscedastic Gaussian process regression,” in Machine Learning for Signal Processing, 2014, pp. 1–6.
  • [28] M. Hartmann and J. Vanhatalo, “Laplace approximation and natural gradient for Gaussian process regression with heteroscedastic student-t model,” Statistics and Computing, vol. 29, no. 4, pp. 753–773, 2019.
  • [29] H. Liu, Y.-S. Ong, X. Shen, and J. Cai, “When Gaussian process meets big data: A review of scalable GPs,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–19, 2020.
  • [30] J. Quiñonero-Candela and C. E. Rasmussen, “A unifying view of sparse approximate Gaussian process regression,” Journal of Machine Learning Research, vol. 6, no. Dec, pp. 1939–1959, 2005.
  • [31] M. Titsias, “Variational learning of inducing variables in sparse Gaussian processes,” in Artificial Intelligence and Statistics, 2009, pp. 567–574.
  • [32] Y. Gal, M. van der Wilk, and C. E. Rasmussen, “Distributed variational inference in sparse Gaussian process regression and latent variable models,” in Advances in Neural Information Processing Systems, 2014, pp. 3257–3265.
  • [33] T. N. Hoang, Q. M. Hoang, and B. K. H. Low, “A distributed variational inference framework for unifying parallel sparse Gaussian process regression models.” in International Conference on Machine Learning, 2016, pp. 382–391.
  • [34] J. Hensman, N. Fusi, and N. D. Lawrence, “Gaussian processes for big data,” in Uncertainty in Artificial Intelligence, 2013, pp. 282–290.
  • [35] T. N. Hoang, Q. M. Hoang, and B. K. H. Low, “A unifying framework of anytime sparse Gaussian process regression models with stochastic variational inference for big data,” in International Conference on Machine Learning, 2015, pp. 569–578.
  • [36] A. Wilson and H. Nickisch, “Kernel interpolation for scalable structured Gaussian processes (KISS-GP),” in International Conference on Machine Learning, 2015, pp. 1775–1784.
  • [37] T. D. Bui and R. E. Turner, “Tree-structured Gaussian process approximations,” in Advances in Neural Information Processing Systems, 2014, pp. 2213–2221.
  • [38] G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural Computation, vol. 14, no. 8, pp. 1771–1800, 2002.
  • [39] V. Tresp, “A Bayesian committee machine,” Neural Computation, vol. 12, no. 11, pp. 2719–2741, 2000.
  • [40] M. P. Deisenroth and J. W. Ng, “Distributed Gaussian processes,” in International Conference on Machine Learning, 2015, pp. 1481–1490.
  • [41] H. Liu, J. Cai, Y. Wang, and Y.-S. Ong, “Generalized robust Bayesian committee machine for large-scale Gaussian process regression,” in International Conference on Machine Learning, 2018, pp. 3137–3146.
  • [42] H. Liu, J. Cai, Y.-S. Ong, and Y. Wang, “Understanding and comparing scalable Gaussian process regression for big data,” Knowledge-Based Systems, vol. 164, pp. 324–335, 2019.
  • [43] E. Snelson and Z. Ghahramani, “Local and global sparse Gaussian process approximations,” in Artificial Intelligence and Statistics, 2007, pp. 524–531.
  • [44] J. Vanhatalo and A. Vehtari, “Modelling local and global phenomena with sparse Gaussian processes,” in Uncertainty in Artificial Intelligence, 2008, pp. 571–578.
  • [45] R. Ouyang and K. H. Low, “Gaussian process decentralized data fusion meets transfer learning in large-scale distributed cooperative perception,” in AAAI Conference on Artificial Intelligence, 2018.
  • [46] K. Chalupka, C. K. Williams, and I. Murray, “A framework for evaluating approximation methods for Gaussian process regression,” Journal of Machine Learning Research, vol. 14, no. Feb, pp. 333–350, 2013.
  • [47] M. E. Tipping, “Sparse Bayesian learning and the relevance vector machine,” Journal of Machine Learning Research, vol. 1, no. Jun, pp. 211–244, 2001.
  • [48] C. E. Rasmussen and J. Quinonero-Candela, “Healing the relevance vector machine through augmentation,” in International Conference on Machine Learning, 2005, pp. 689–696.
  • [49] E. Snelson and Z. Ghahramani, “Sparse Gaussian processes using pseudo-inputs,” in Advances in Neural Information Processing Systems, 2006, pp. 1257–1264.
  • [50] M. Bauer, M. van der Wilk, and C. E. Rasmussen, “Understanding probabilistic sparse Gaussian process approximations,” in Advances in Neural Information Processing Systems, 2016, pp. 1533–1541.
  • [51] H. Yu, T. Nghia, B. K. H. Low, and P. Jaillet, “Stochastic variational inference for bayesian sparse Gaussian process regression,” in International Joint Conference on Neural Networks, 2019, pp. 1–8.
  • [52] D. G. Matthews, G. Alexander, M. Van Der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León-Villagrá, Z. Ghahramani, and J. Hensman, “GPflow: A Gaussian process library using TensorFlow,” Journal of Machine Learning Research, vol. 18, no. 1, pp. 1299–1304, 2017.
  • [53] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning,” in 12th {\{USENIX}\} Symposium on Operating Systems Design and Implementation, 2016, pp. 265–283.
  • [54] C. E. Rasmussen and H. Nickisch, “Gaussian processes for machine learning (GPML) toolbox,” Journal of Machine Learning Research, vol. 11, no. Nov, pp. 3011–3015, 2010.
  • [55] S. Sun, “A review of deterministic approximate inference techniques for Bayesian machine learning,” Neural Computing and Applications, vol. 23, no. 7-8, pp. 2039–2050, 2013.
  • [56] J. Hensman and N. D. Lawrence, “Nested variational compression in deep Gaussian processes,” arXiv preprint arXiv:1412.1370, 2014.
  • [57] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [58] B. Ingram and D. Cornford, “Parallel geostatistics for sparse and dense datasets,” in GeoENV VII–Geostatistics for Environmental Applications. Springer, 2010, pp. 371–381.
  • [59] H. Salimbeni, S. Eleftheriadis, and J. Hensman, “Natural gradients in practice: Non-conjugate variational inference in Gaussian process models,” in International Conference on Artificial Intelligence and Statistics, 2018, pp. 689–697.
  • [60] J. Hensman, N. Durrande, A. Solin et al., “Variational fourier features for Gaussian processes.” Journal of Machine Learning Research, vol. 18, pp. 1–52, 2018.
  • [61] D. Dheeru and E. Karra Taniskidou, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
  • [62] M. Kaul, B. Yang, and C. S. Jensen, “Building accurate 3D spatial networks to enable next generation intelligent transportation systems,” in International Conference on Mobile Data Management, vol. 1, 2013, pp. 137–146.
  • [63] C. Wang and R. M. Neal, “Gaussian process regression with heteroscedastic or non-gaussian residuals,” arXiv preprint arXiv:1212.6246, 2012.
  • [64] V. Dutordoir, H. Salimbeni, J. Hensman, and M. Deisenroth, “Gaussian process conditional density estimation,” in Advances in Neural Information Processing Systems, 2018, pp. 2385–2395.
  • [65] M. Jankowiak, G. Pleiss, and J. R. Gardner, “Sparse Gaussian process regression beyond variational inference,” arXiv preprint arXiv:1910.07123, 2019.
  • [66] C. Ma, Y. Li, and J. M. Hernandez-Lobato, “Variational implicit processes,” in International Conference on Machine Learning, 2019, pp. 4222–4233.
  • [67] T. Nguyen and E. Bonilla, “Fast allocation of Gaussian process experts,” in International Conference on Machine Learning, 2014, pp. 145–153.
  • [68] D. Wu and J. Ma, “A two-layer mixture model of Gaussian process functional regressions and its MCMC EM algorithm,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 10, pp. 4894–4904, 2018.
  • [69] S. Remes, M. Heinonen, and S. Kaski, “Non-stationary spectral kernels,” in Advances in Neural Information Processing Systems, 2017, pp. 4642–4651.
[Uncaptioned image] Haitao Liu received the Ph.D. degree from the School of Energy and Power Engineering, Dalian University of Technology, Dalian, China, in 2016. He is currently a Research Fellow with the Rolls-Royce@NTU Corp Laboratory, Nanyang Technological University, Singapore. His current research interests include multi-task learning, large-scale Gaussian process, active learning, and optimization.
[Uncaptioned image] Yew-Soon Ong (M’99-SM’12-F’18) received the Ph.D. degree in artificial intelligence in complex design from the University of Southampton, U.K., in 2003. He is a President’s Chair Professor in Computer Science at the Nanyang Technological University (NTU), and holds the position of Chief Artificial Intelligence Scientist at the Agency for Science, Technology and Research Singapore. At NTU, he serves as Director of the Data Science and Artificial Intelligence Research Center and Director of the Singtel-NTU Cognitive & Artificial Intelligence Joint Lab. His research interest is in artificial and computational intelligence. He is founding Editor-in-Chief of the IEEE Transactions on Emerging Topics in Computational Intelligence, Technical Editor-in-Chief of Memetic Computing and Associate Editor of IEEE Transactions on Neural Networks & Learning Systems, the IEEE Transactions on Cybernetics, IEEE Transactions on Evolutionary Computation and others. He has received several IEEE outstanding paper awards, listed as a Thomson Reuters highly cited researcher and among the World’s Most Influential Scientific Minds.
[Uncaptioned image] Jianfei Cai (S’98–M’02–SM’07) is a Professor at Faculty of IT, Monash University, where he currently serves as the Head for the Data Science & AI Department. Before that, he was a full professor, a cluster deputy director of Data Science & AI Research center (DSAIR), Head of Visual and Interactive Computing Division and Head of Computer Communications Division in Nanyang Technological University (NTU). His major research interests include computer vision, multimedia and deep learning. He has published more than 200 technical papers in international conferences and journals. He is a co-recipient of paper awards in ACCV, ICCM, IEEE ICIP and MMSP. He has served as an Associate Editor for IEEE T-IP, T-MM, T-CSVT and Visual Computer as well as serving as Area Chair for ICCV, ECCV, ACM Multimedia, ICME and ICIP. He was the Chair of IEEE CAS VSPC-TC during 2016-2018. He had also served as the leading TPC Chair for IEEE ICME 2012 and the best paper award committee co-chair for IEEE T-MM 2019.