Large-scale Heteroscedastic Regression via Gaussian Process

Haitao Liu, Yew-Soon Ong, and Jianfei Cai Haitao Liu is with the Rolls-Royce@NTU Corporate Lab, Nanyang Technological University, Singapore, 637460. E-mail: htliu@ntu.edu.sg.Yew-Soon Ong is with School of Computer Science and Engineering, Nanyang Technological University, Singapore, 639798. E-mail: asysong@ntu.edu.sg. Jianfei Cai is with Department of Data Science & AI, Monash University, Australia. E-mail: jianfei.cai@monash.edu.

Abstract

Heteroscedastic regression considering the varying noises among observations has many applications in the fields like machine learning and statistics. Here we focus on the heteroscedastic Gaussian process (HGP) regression which integrates the latent function and the noise function together in a unified non-parametric Bayesian framework. Though showing remarkable performance, HGP suffers from the cubic time complexity, which strictly limits its application to big data. To improve the scalability, we first develop a variational sparse inference algorithm, named VSHGP, to handle large-scale datasets. Furthermore, two variants are developed to improve the scalability and capability of VSHGP. The first is stochastic VSHGP (SVSHGP) which derives a factorized evidence lower bound, thus enhancing efficient stochastic variational inference. The second is distributed VSHGP (DVSHGP) which (i) follows the Bayesian committee machine formalism to distribute computations over multiple local VSHGP experts with many inducing points; and (ii) adopts hybrid parameters for experts to guard against over-fitting and capture local variety. The superiority of DVSHGP and SVSHGP as compared to existing scalable heteroscedastic/homoscedastic GPs is then extensively verified on various datasets.

Index Terms:

Heteroscedastic GP, Large-scale, Sparse approximation, Stochastic variational inference, Distributed learning.

Notation

	$\displaystyle F,F_{V}=$	$\displaystyle\mbox{Lower bounds for log model evidence}\log p(\bm{y})$
	$\displaystyle\bm{f},\bm{g}=$	Latent function values and log variances
	$\displaystyle\bm{f}_{m},\bm{g}_{u}=$	$\displaystyle\mbox{Inducing variables for }\bm{f}\mbox{ and }\bm{g}$
	$\displaystyle k^{f},k^{g}=$	$\displaystyle\mbox{Kernels for }f\mbox{ and }g$
	$\displaystyle\bm{K}^{f},\bm{K}^{g}=$	$\displaystyle\mbox{Kernel matrices for }\bm{f}\mbox{ and }\bm{g}$
	$\displaystyle m,u=$	$\displaystyle\mbox{Inducing sizes for }\bm{f}\mbox{ and }\bm{g}$
	$\displaystyle m_{0},u_{0}=$	Inducing sizes for each expert in DVSHGP
	$\displaystyle m_{\mathrm{b}}=$	Number of basis functions for GPVC [8]
	$\displaystyle m_{\mathrm{sod}}=$	Subset size for EHSoD [13]
	$\displaystyle M=$	Number of VSHGP experts
	$\displaystyle\mathcal{M}_{i}=$	$\displaystyle i\mbox{th VSHGP expert }(1\leq i\leq M)$
	$\displaystyle n,n_{*}=$	Training size and test size
	$\displaystyle n_{0}=$	Training size for each expert in DVSHGP
	$\displaystyle\bm{X},\bm{X}_{*}=$	Training and test data points
	$\displaystyle\bm{y},\bm{y}_{*}=$	Training and test observations
	$\displaystyle\mu_{0}=$	$\displaystyle\mbox{Mean parameter for latent noise function }g$
	$\displaystyle\bm{\mu}_{m},\bm{\Sigma}_{m}=$	$\displaystyle\mbox{Mean and variance for the posterior }q(\bm{f}_{m})$
	$\displaystyle\bm{\mu}_{u},\bm{\Sigma}_{u}=$	$\displaystyle\mbox{Mean and variance for the posterior }q(\bm{g}_{u})$
	$\displaystyle\bm{\mu}_{f},\bm{\Sigma}_{f}=$	$\displaystyle\mbox{Mean and variance for the posterior }q(\bm{f})$
	$\displaystyle\bm{\mu}_{f}^{\star},\bm{\Sigma}_{f}^{\star}=$	$\displaystyle\mbox{Mean and variance for the optimal }q^{\star}(\bm{f})$
	$\displaystyle\bm{\mu}_{g},\bm{\Sigma}_{g}=$	$\displaystyle\mbox{Mean and variance for the posterior }q(\bm{g})$
	$\displaystyle\bm{\Lambda}_{nn}=$	$\displaystyle\mbox{Variational parameters for }q(\bm{g}_{u})\mbox{ in VSHGP}$
	$\displaystyle\mu_{f_{(\mathcal{A})}},\mu_{g_{(\mathcal{A})}}=$	$\displaystyle\mbox{(Aggregated) prediction mean of }f\mbox{ and }g\mbox{ at }\bm{x}_{*}$
	$\displaystyle\sigma^{2}_{f_{(\mathcal{A})}},\sigma^{2}_{g_{(\mathcal{A})}}=$	$\displaystyle\mbox{(Aggregated) prediction variance of }f\mbox{ and }g\mbox{ at }\bm{x}_{*}$
	$\displaystyle\mu_{(\mathcal{A})},\sigma^{2}_{(\mathcal{A})}=$	$\displaystyle\mbox{(Aggregated) prediction mean and variance at }\bm{x}_{*}$
	$\displaystyle\mu_{i},\sigma^{2}_{i}=$	$\displaystyle\mbox{Prediction mean and variance of }\mathcal{M}_{i}\mbox{ at }\bm{x}_{*}$
	$\displaystyle w_{i}^{f},w_{i}^{g}=$	$\displaystyle\mbox{Weights of the expert }\mathcal{M}_{i}\mbox{ at }\bm{x}_{*}\mbox{ for }f\mbox{ and }g$
	$\displaystyle\bm{\theta}^{f},\bm{\theta}^{g}=$	$\displaystyle\mbox{Kernel parameters for }k^{f}\mbox{ and }k^{g}$
	$\displaystyle\epsilon_{i}=$	$\displaystyle\mbox{Noise for observation }y_{i}\mbox{ at point }\bm{x}_{i}\mbox{ }(1\leq i\leq n)$
	$\displaystyle\gamma=$	Step parameter for optimization
	$\displaystyle\mathcal{B}=$	$\displaystyle\mbox{Mini-batch set sampled from }\bm{X}$

I Introduction

In supervised learning, we learn a machine learning model from $n$ training points $\bm{X}=\{\bm{x}_{i}\in R^{d}\}_{i=1}^{n}$ and the observations $\bm{y}=\{y(\bm{x}_{i})=f(\bm{x}_{i})+\epsilon_{i}\}_{i=1}^{n}$ , where $f$ is the underlying function and $\epsilon_{i}$ is the independent noise. The non-parametric Gaussian process (GP) places a GP prior on $f$ with the iid noise $\epsilon_{i}\sim\mathcal{N}(0,\sigma^{2}_{\epsilon})$ , resulting in the standard homoscedastic GP. The homoscedasticity offers tractable GP inference to enable extensive applications in regression and classification [1], visualization [2], Bayesian optimization [3], multi-task learning [4], active learning [5], etc.

Many realistic problems, e.g., volatility forecasting [6], biophysical variables estimation [7], cosmological redshifts estimation [8], robotics and vehicle control [9, 10], however should consider the input-dependent noise rather than the simple constant noise in order to fit the local noise rates of complicated data distributions. In comparison to the conventional homogeneous GP, the heteroscedastic GP (HGP) provides better quantification of different sources of uncertainty, which further brings benefits for the downstream tasks, e.g., active learning, optimization and uncertainty quantification [11, 12].

To account for the heteroscedastic noise in GP, there exists two main strategies: (i) treat the GP as a black-box and interpret the heteroscedasticity using another separate model; and (ii) integrate the heteroscedasticity within the unifying GP framework. The first post-model strategy first trains a standard GP to capture the underlying function $f$ , and then either trains another GP to take the remaining empirical variance into account [13], or use the quantile regression [14] to model the lower and upper quantiles of the variance respectively.

In contrast, the second integration strategy provides an elegant framework for heteroscedastic regression. The simplest way to mimic variable noise is through adding independent yet different noise variances to the diagonal of kernel matrix [15]. Goldberg et al. [16] introduced a more principled HGP which infers a mean-field GP for $f(\bm{x})$ and an additional GP $g(\bm{x})$ for $\log\sigma^{2}_{\epsilon}(\bm{x})$ jointly. Similarly, other HGPs which describe the heteroscedastic noise using the pointwise division of two GPs or the general non-linear combination of GPs have been proposed for example in [17, 18, 19]. Note that unlike the homoscedastic GP, the inference in HGP is challenging since the model evidence (marginal likelihood) $p(\bm{y})$ and the posterior $p(y_{*}|\bm{X},\bm{y},\bm{x}_{*})$ are intractable. Hence, various approximate inference methods, e.g., markov chain monte carlo (MCMC) [16], maximum a posteriori (MAP) [20, 21, 22, 23], variational inference [24, 25], expectation propagation [26, 27] and Laplace approximation [28], have been used. The most accurate MCMC is time-consuming when handling large-scale datasets; the MAP is a point estimation which risks over-fitting and oscillation; the variational inference and its variants, which run fast via maximizing over a tractable and rigorous evidence lower bound (ELBO), provide a trade-off.

This paper focuses on the HGP model developed in [16]. When handling $n$ training points, the standard GP suffers from a cubic complexity $\mathcal{O}(n^{3})$ due to the inversion of an $n\times n$ kernel matrix, which makes it unaffordable for big data. Since HGP employs an additional log-GP for noise variance, its complexity is about two times that of standard GP. Hence, to handle large-scale datasets, which is of great demand in the era of big data, the scalability of HGP should be improved.

Recently, there has been an increasing trend on studying scalable GPs, which have two core categories: global and local approximations [29]. As the representative of global approximation, sparse approximation considers $m$ ( $m\ll n$ ) global inducing pairs $\{\bm{X}_{m},\bm{f}_{m}\}$ to summarize the training data by approximating the prior [30] or the posterior [31], resulting in the complexity of $\mathcal{O}(nm^{2})$ . Variants of sparse approximation have been recently proposed to handle millions of data points via distributed inference [32, 33], stochastic variational inference [34, 35], or structured inducing points [36]. The global sparse approximation, however, often faces challenges in capturing quick-varying features [37]. Differently, local approximation, which follows the idea of divide-and-conquer, first trains GP experts on local subsets and then aggregates their predictions, for example by the means of product-of-experts (PoE) [38] and Bayesian committee machine (BCM) [39, 40, 41]. Hence, local approximation not only distributes the computations but also captures quick-varying features [42]. Hybrid strategies thereafter have been presented for taking advantages of global and local approximations [43, 44, 45].

The developments in scalable homoscedastic GPs have thus motivated us to scale up HGPs. Alternatively, we could combine the simple Subset-of-Data (SoD) approximation [46] with the empirical HGP [13] which trains two separate GPs for predicting mean and variance, respectively. The empirical variance however is hard to fit since it follows an asymmetric Gaussian distribution. More reasonably, the GP using variable covariances (GPVC) [8] follows the idea of relevance vector machine (RVM) [47] that a stationary kernel $k(.,.)$ has a positive and finite Fourier spectrum, suggesting using only $m_{\mathrm{b}}$ ( $m_{\mathrm{b}}\ll n$ ) independent basis functions for approximation. Note that GPVC shares the basis functions for $f$ and $g$ which however might produce distinct features. Besides, the RVM-type model usually suffers from underestimated prediction variance when leaving $\bm{X}$ [48].

Besides, there are some scalable “pseudo” HGPs [49, 43] which are not designed for such case, but can describe the heteroscedasticity to some extent due to the factorized conditionals. For instance, the Fully Independent Training Conditional (FITC) [49] and its block version named Partially Independent Conditional (PIC) [43] adopt a factorized training conditional to achieve a varying noise [50]. Though equipped with heteroscedastic variance, these models (i) severely underestimate the constant noise variance, (ii) sacrifice the prediction mean [50], and (iii) may produce discontinuous predictions on block boundaries [44]. Recently, the stochastic/distributed variants of FITC and PIC have been developed to further improve the scalability [35, 33, 51, 45].

The high capability of describing complicated data distribution with however poor scalability motivate us to develop variational sparse HGPs which employ an additional log-GP for heteroscedasticity. Particularly, the main contributions are:

1. A variational inference algorithm for sparse HGP, named VSHGP, is developed. Specifically, VSHGP derives an analytical ELBO using $m$ inducing points for $f$ and $u$ inducing points for $g$ , resulting in a greatly reduced complexity of $\mathcal{O}(nm^{2}+nu^{2})$ . Besides, some tricks for example re-parameterization are used to ease the inference;

2. The stochastic variant SVSHGP is presented to further improve the scalability of VSHGP. Specifically, we derive a factorized ELBO which allows using efficient stochastic variational inference;

3. The distributed variant DVSHGP is presented for improving the scalability and capability of VSHGP. The local experts with many inducing points (i) distribute the computations for parallel computing, and (ii) employ hybrid parameters to guard against over-fitting and capture local variety;

4. Extensive experiments¹¹1The SVSHGP is implemented based on the GPflow toolbox [52], which benefits from parallel/GPU speedup and automatic differentiation of Tensorflow [53]. The DVSHGP is built upon the GPML toolbox [54]. These implementations are available at https://github.com/LiuHaiTao01. conducted on datasets with up to two million points reveal that the localized DVSHGP exhibits superior performance, while the global SVSHGP may sacrifice the prediction mean for capturing heteroscedastic noise.

The remainder of the article is organized as follows. Section II first develops the VSHGP model via variational inference. Then, we present the stochastic variant in Section III and the distributed variant in Section IV to further enhance the scalability and capability, followed by extensive experiments in Section VI. Finally, Section VII provides concluding remarks.

II Variational sparse HGP

II-A Sparse approximation

We follow [16] to define the HGP as $y(\bm{x})=f(\bm{x})+\epsilon(\bm{x})$ , wherein the latent function $f(\bm{x})$ and the noise $\epsilon(\bm{x})$ follow

f(\bm{x})\sim\mathcal{GP}(0,k^{f}(\bm{x},\bm{x}^{\prime})),\quad\epsilon(\bm{x})\sim\mathcal{N}(0,\sigma^{2}_{\epsilon}(\bm{x})).

(1)

It is observed that the input-dependent noise variance $\sigma^{2}_{\epsilon}(\bm{x})$ enables describing possible heteroscedasticity. Notably, this HGP degenerates to homoscedastic GP when $\sigma^{2}_{\epsilon}(\bm{x})$ is a constant. To ensure the positivity, we particularly consider the exponential form $\sigma^{2}_{\epsilon}(\bm{x})=e^{g(\bm{x})}$ , wherein the latent function $g(\bm{x})$ akin to $f(\bm{x})$ follows an independent GP prior

g(\bm{x})\sim\mathcal{GP}(\mu_{0},k^{g}(\bm{x},\bm{x}^{\prime})).

(2)

The only difference is that unlike the zero-mean GP prior placed on $f(\bm{x})$ , we explicitly consider a prior mean $\mu_{0}$ to account for the variability of the noise variance.²²2For $f$ , we can pre-process the data to fulfill the zero-mean assumption. For $g$ , however, it is hard to satisfy the zero-mean assumption, since we have no access to the “noise” data. The kernels $k^{f}$ and $k^{g}$ could be, e.g., the squared exponential (SE) kernel equipped with automatic relevance determination (ARD)

k(\bm{x},\bm{x}^{\prime})=\sigma^{2}_{s}\exp\left(-\frac{1}{2}\sum_{i=1}^{d}\frac{(x_{i}-x^{\prime}_{i})^{2}}{l_{i}^{2}}\right),

(3)

where the signal variance $\sigma^{2}_{s}$ is an output scale, and the length-scale $l_{i}$ is an input scale along the $i$ th dimension.

Given the training data $\mathcal{D}=\{\bm{X},\bm{y}\}$ , the joint priors follow

p(\bm{f})=\mathcal{N}(\bm{f}|\bm{0},\bm{K}^{f}_{nn}),\quad p(\bm{g})=\mathcal{N}(\bm{g}|\mu_{0}\bm{1},\bm{K}^{g}_{nn}),

(4)

where $[\bm{K}^{f}_{nn}]_{ij}=k^{f}(\bm{x}_{i},\bm{x}_{j})$ and $[\bm{K}^{g}_{nn}]_{ij}=k^{g}(\bm{x}_{i},\bm{x}_{j})$ for $1\leq i,j\leq n$ . Accordingly, the data likelihood becomes

p(\bm{y}|\bm{f},\bm{g})=\mathcal{N}(\bm{y}|\bm{f},\bm{\Sigma}_{\epsilon}),

(5)

where the diagonal noise matrix has $[\bm{\Sigma}_{\epsilon}]_{ii}=e^{g(\bm{x}_{i})}$ .

To scale up HGP, we follow the sparse approximation framework to introduce $m$ inducing variables $\bm{f}_{m}\sim\mathcal{N}(\bm{f}_{m}|\bm{0},\bm{K}_{mm}^{f})$ at the inducing points $\bm{X}_{m}$ for $\bm{f}$ ; similarly, we introduce $u$ inducing variables $\bm{g}_{u}\sim\mathcal{N}(\bm{g}_{u}|\mu_{0}\bm{1},\bm{K}_{uu}^{g})$ at the independent $\bm{X}_{u}$ for $\bm{g}$ . Besides, we assume that $\bm{f}_{m}$ is a sufficient statistic for $\bm{f}$ , and $\bm{g}_{u}$ a sufficient statistic for $\bm{g}$ .³³3Sufficient statistic means the variables $\bm{z}$ and $\bm{f}$ are independent given $\bm{f}_{m}$ , i.e., it holds $p(\bm{z}|\bm{f},\bm{f}_{m})=p(\bm{z}|\bm{f}_{m})$ . As a result, we obtain two training conditionals

	$\displaystyle p(\bm{f}\|\bm{f}_{m})$	$\displaystyle=\mathcal{N}(\bm{f}\|\bm{\Omega}_{nm}^{f}\bm{f}_{m},\bm{K}_{nn}^{f}-\bm{Q}_{nn}^{f}),$
	$\displaystyle p(\bm{g}\|\bm{g}_{u})$	$\displaystyle=\mathcal{N}(\bm{g}\|\bm{\Omega}_{nu}^{g}(\bm{g}_{u}-\mu_{0}\bm{1})+\mu_{0}\bm{1},\bm{K}_{nn}^{g}-\bm{Q}_{nn}^{g}),$

where $\bm{\Omega}_{nm}^{f}=\bm{K}_{nm}^{f}(\bm{K}_{mm}^{f})^{-1}$ , $\bm{\Omega}_{nm}^{g}=\bm{K}_{nm}^{g}(\bm{K}_{mm}^{g})^{-1}$ , $\bm{Q}_{nn}^{f}=\bm{\Omega}_{nm}^{f}\bm{K}_{mn}^{f}$ and $\bm{Q}_{nn}^{g}=\bm{\Omega}_{nm}^{g}\bm{K}_{mn}^{g}$ .

In the augmented probability space, the model evidence

p(\bm{y})=\int p(\bm{y}|\bm{f},\bm{g})p(\bm{f}|\bm{f}_{m})p(\bm{g}|\bm{g}_{u})p(\bm{f}_{m})p(\bm{g}_{u})d\bm{f}d\bm{f}_{m}d\bm{g}d\bm{g}_{u},

together with the posterior $p(\bm{z}|\bm{y})=p(\bm{y}|\bm{z})p(\bm{z})/p(\bm{y})$ where $\bm{z}=\{\bm{f},\bm{g},\bm{f}_{m},\bm{g}_{u}\}$ , however, is intractable. Hence, we use variational inference to derive an analytical ELBO of $\log p(\bm{y})$ .

II-B Evidence lower bound

We employ the mean-field theory [55] to approximate the intractable posterior $p(\bm{z}|\bm{y})=p(\bm{f}|\bm{f}_{m})p(\bm{g}|\bm{g}_{u})p(\bm{f}_{m}|\bm{y})p(\bm{g}_{u}|\bm{y})$ as

p(\bm{z}|\bm{y})\approx q(\bm{z})=p(\bm{f}|\bm{f}_{m})p(\bm{g}|\bm{g}_{u})q(\bm{f}_{m})q(\bm{g}_{u}),

(6)

where $q(\bm{f}_{m})$ and $q(\bm{g}_{u})$ are free variational distributions to approximate the posteriors $p(\bm{f}_{m}|\bm{y})$ and $p(\bm{g}_{u}|\bm{y})$ , respectively.

In order to push the approximation $q(\bm{z})$ towards the exact $p(\bm{z}|\bm{y})$ , we minimize their Kullback-Leibler (KL) divergence $\mathrm{KL}(q(\bm{z})||p(\bm{z}|\bm{y}))$ , which, on the other hand, is equivalent to maximizing the ELBO $F$ , since $\mathrm{KL}(.,.)\geq 0$ , as

F=\int q(\bm{z})\log\frac{p(\bm{z},\bm{y})}{q(\bm{z})}d\bm{z}=\log p(\bm{y})-\mathrm{KL}(q(\bm{z})||p(\bm{z}|\bm{y})).

(7)

As a consequence, instead of directly maximizing the intractable $\log p(\bm{y})$ for inference, we now seek the maximization of $F$ w.r.t. the variational distributions $q(\bm{f}_{m})$ and $q(\bm{g}_{u})$ .

By reformulating $F$ we observe that

\displaystyle F=-\mathrm{KL}(q(\bm{f}_{m})||q^{\star}(\bm{f}_{m}))+H(q(\bm{g}_{u}))+const.,

where $H(q(\bm{g}_{u}))$ is the information entropy of $q(\bm{g}_{u})$ ; $q^{\star}(\bm{f}_{m})$ is the optimal distribution since it maximizes the bound $F$ , and it satisfies, given the normalizer $C_{0}$ ,

q^{\star}(\bm{f}_{m})=\frac{p(\bm{f}_{m})}{C_{0}}e^{\int p(\bm{f}|\bm{f}_{m})p(\bm{g}|\bm{g}_{u})q(\bm{g}_{u})\log p(\bm{y}|\bm{f},\bm{g})d\bm{f}d\bm{g}d\bm{g}_{u}}.

(8)

Thereafter, by substituting $q^{\star}(\bm{f}_{m})$ back into $F$ , we arrive at a tighter ELBO, given $q(\bm{g}_{u})=\mathcal{N}(\bm{g}_{u}|\bm{\mu}_{u},\bm{\Sigma}_{u})$ , as

$\displaystyle F_{V}=$	$\displaystyle\log C_{0}-\mathrm{KL}(q(\bm{g}_{u})\|\|p(\bm{g}_{u}))$	(9)
$\displaystyle=$	$\displaystyle\underbrace{\log\mathcal{N}(\bm{y}\|\bm{0},\bm{Q}_{nn}^{f}+\bm{R}_{g})}_{\mathrm{log\,term}}-\underbrace{0.25\mathrm{Tr}[\bm{\Sigma}_{g}]}_{\mathrm{trace\,term\,of}\,\bm{g}}$
	$\displaystyle-\underbrace{0.5\mathrm{Tr}[\bm{R}_{g}^{-1}(\bm{K}_{nn}^{f}-\bm{Q}_{nn}^{f})]}_{\mathrm{trace\,term\,of}\,\bm{f}}-\underbrace{\mathrm{KL}(q(\bm{g}_{u})\|\|p(\bm{g}_{u}))}_{\mathrm{KL\,term}},$

where the diagonal matrix $\bm{R}_{g}\in R^{n\times n}$ has $[\bm{R}_{g}]_{ii}=e^{[\bm{\mu}_{g}]_{i}-[\bm{\Sigma}_{g}]_{ii}/2}$ , with the mean and variance


$\displaystyle\bm{\mu}_{g}$	$\displaystyle=\bm{\Omega}_{nu}^{g}(\bm{\mu}_{u}-\mu_{0}\bm{1})+\mu_{0}\bm{1},$	(10a)
$\displaystyle\bm{\Sigma}_{g}$	$\displaystyle=\bm{K}_{nn}^{g}-\bm{Q}_{nn}^{g}+\bm{\Omega}_{nu}^{g}\bm{\Sigma}_{u}(\bm{\Omega}_{nu}^{g})^{\mathsf{T}},$	(10b)

coming from $q(\bm{g})=\int p(\bm{g}|\bm{g}_{u})q(\bm{g}_{u})d\bm{g}_{u}$ which approximates $p(\bm{g}|\bm{y})$ . It is observed that the analytical bound $F_{V}$ depends only on $q(\bm{g}_{u})$ since we have “marginalized” $q(\bm{f}_{m})$ out.

Let us delve further into the terms of $F_{V}$ in (9):

•

The log term $\log\mathcal{N}(\bm{y}|\bm{0},\bm{\Sigma}_{y})$ , where $\bm{\Sigma}_{y}=\bm{Q}_{nn}^{f}+\bm{R}_{g}$ , is analogous to that in a standard GP. It achieves bias-variance trade-off for both $f$ and $g$ by penalizing model complexity and low data likelihood [1].
•

The two trace terms act as a regularization to choose good inducing sets for $\bm{f}$ and $\bm{g}$ , and guard against over-fitting. It is observed that $\mathrm{Tr}[\bm{K}_{nn}^{g}-\bm{Q}_{nn}^{g}]$ and $\mathrm{Tr}[\bm{K}_{nn}^{f}-\bm{Q}_{nn}^{f}]$ represent the total variance of the training conditionals $p(\bm{g}|\bm{g}_{u})$ and $p(\bm{f}|\bm{f}_{m})$ , respectively. To maximize $F_{V}$ , the traces should be very small, implying that $\bm{f}_{m}$ and $\bm{g}_{u}$ are very informative (i.e., sufficient statistics, also called variational compression [56]) for $\bm{f}$ and $\bm{g}$ . Particularly, the zero traces indicate that $\bm{f}_{m}=\bm{f}$ and $\bm{g}_{u}=\bm{g}$ , thus recovering the variational HGP (VHGP) [24]. Besides, the zero traces imply that the variances of $q(\bm{g}_{u})$ equal to that of $p(\bm{g}_{u}|\bm{y})$ .
•

The KL term is a constraint for rationalising $q(\bm{g}_{u})$ . It is observed that minimizing the traces only push the variances of $q(\bm{g}_{u})$ towards that of $p(\bm{g}_{u}|\bm{y})$ . To let the co-variances of $q(\bm{g}_{u})$ rationally approximate that of $p(\bm{g}_{u}|\bm{y})$ , the KL term penalizes $q(\bm{g}_{u})$ so that it does not deviate significantly from the prior $p(\bm{g}_{u})$ .

II-C Reparameterization and inference

In order to maximize the ELBO $F_{V}$ in (9), we need to infer $\omega=u+u(u+1)/2$ free variational parameters in $\bm{\mu}_{u}$ and $\bm{\Sigma}_{u}$ . Assume that $u=0.01n$ , then $\omega$ is larger than $n$ when the training size $n>2\times 10^{4}$ , leading to a high-dimensional and non-trivial optimization task.

We observe that the derivatives of $F_{V}$ w.r.t. $\bm{\mu}_{u}$ and $\bm{\Sigma}_{u}$ are

	$\displaystyle\frac{\partial F_{V}}{\partial\bm{\mu}_{u}}=$	$\displaystyle\frac{1}{2}(\bm{\Omega}_{nu}^{g})^{\mathsf{T}}\bm{\Lambda}_{nn}^{ab}\bm{1}-(\bm{K}_{uu}^{g})^{-1}(\bm{\mu}_{u}-\mu_{0}\bm{1}),$
	$\displaystyle\frac{\partial F_{V}}{\partial\bm{\Sigma}_{u}}=$	$\displaystyle-\frac{1}{4}(\bm{\Omega}_{nu}^{g})^{\mathsf{T}}(\bm{\Lambda}_{nn}^{ab}+\bm{I})\bm{\Omega}_{nu}^{g}+\frac{1}{2}[\bm{\Sigma}_{u}^{-1}-(\bm{K}_{uu}^{g})^{-1}],$

where the diagonal matrix $\bm{\Lambda}_{nn}^{ab}=\bm{\Lambda}_{nn}^{a}+\bm{\Lambda}_{nn}^{b}$ with $\bm{\Lambda}_{nn}^{a}=(\bm{\Sigma}_{y}^{-1}\bm{y}\bm{y}^{\mathsf{T}}\bm{\Sigma}_{y}^{-1}-\bm{\Sigma}_{y}^{-1})\odot\bm{R}_{g}$ , $\bm{\Lambda}_{nn}^{b}=(\bm{K}_{nn}^{f}-\bm{Q}_{nn}^{f})\odot\bm{R}_{g}^{-1}$ , and the operator $\odot$ being the element-wise product. Hence, it is observed that the optimal $\bm{\mu}_{u}$ and $\bm{\Sigma}_{u}^{-1}$ satisfy

	$\displaystyle\bm{\mu}_{u}=$	$\displaystyle 0.5\bm{K}_{un}^{g}\bm{\Lambda}_{nn}^{ab}\bm{1}+\mu_{0}\bm{1},$
	$\displaystyle\bm{\Sigma}_{u}^{-1}=$	$\displaystyle 0.5(\bm{\Omega}_{nu}^{g})^{\mathsf{T}}(\bm{\Lambda}_{nn}^{ab}+\bm{I})\bm{\Omega}_{nu}^{g}+(\bm{K}_{uu}^{g})^{-1}.$

Interestingly, we find that both the optimal $\bm{\mu}_{u}$ and $\bm{\Sigma}_{u}^{-1}$ depend on $\bm{\Lambda}_{nn}=0.5(\bm{\Lambda}_{nn}^{ab}+\bm{I})$ , which is a positive semi-definite diagonal matrix, see the non-negativity proof in Appendix A. Hence, we re-parameterize $\bm{\mu}_{u}$ and $\bm{\Sigma}_{u}^{-1}$ in terms of $\bm{\Lambda}_{nn}$ as


$\displaystyle\bm{\mu}_{u}=$	$\displaystyle\bm{K}_{un}^{g}(\bm{\Lambda}_{nn}-0.5\bm{I})\bm{1}+\mu_{0}\bm{1},$	(11a)
$\displaystyle\bm{\Sigma}_{u}^{-1}=$	$\displaystyle(\bm{K}_{uu}^{g})^{-1}+(\bm{\Omega}_{nu}^{g})^{\mathsf{T}}\bm{\Lambda}_{nn}\bm{\Omega}_{nu}^{g}.$	(11b)

This re-parameterization eases the model inference by (i) reducing the number of variational parameters from $\omega$ to $n$ , and (ii) limiting the new variational parameters $\bm{\Lambda}_{nn}$ to be non-negative, thus narrowing the search space.

So far, the bound $F_{V}$ depends on the variational parameters $\bm{\Lambda}_{nn}$ , the kernel parameters $\bm{\theta}^{f}$ and $\bm{\theta}^{g}$ , the mean parameter $\mu_{0}$ for $g$ , and the inducing points $\bm{X}_{m}$ and $\bm{X}_{u}$ . We maximize $F_{V}$ to infer all these hyperparameters $\bm{\zeta}=\{\bm{\Lambda}_{nn},\bm{\theta}^{f},\bm{\theta}^{g},\bm{X}_{m},\bm{X}_{u}\}$ jointly for model selection. This non-linear optimization task can be solved via conjugate gradient descent (CGD), since the derivatives of $F_{V}$ w.r.t. these hyperparameters have closed forms, see Appendix B.

II-D Predictive Distribution

The predictive distribution $p(y_{*}|\bm{y},\bm{x}_{*})$ at the test point $\bm{x}_{*}$ is approximated as

q(y_{*})=\int p(y_{*}|f_{*},g_{*})q(f_{*})q(g_{*})df_{*}dg_{*}.

(12)

As for $q(f_{*})=\int p(f_{*}|\bm{f}_{m})q^{\star}(\bm{f}_{m})d\bm{f}_{m}$ , we first calculate $q^{\star}(\bm{f}_{m})=\mathcal{N}(\bm{f}_{m}|\bm{\mu}_{f}^{\star},\bm{\Sigma}_{f}^{\star})$ from (8) as

\displaystyle\bm{\mu}_{f}^{\star}=\bm{K}_{mm}^{f}\bm{K}_{R}^{-1}\bm{K}_{mn}^{f}\bm{R}_{g}^{-1}\bm{y},\quad\bm{\Sigma}_{f}^{\star}=\bm{K}_{mm}^{f}\bm{K}_{R}^{-1}\bm{K}_{mm}^{f},

(13)

where $\bm{K}_{R}=\bm{K}_{mn}^{f}\bm{R}_{g}^{-1}\bm{K}_{nm}^{f}+\bm{K}_{mm}^{f}$ . Using $q^{\star}(\bm{f}_{m})$ , we have $q(f_{*})=\mathcal{N}(f_{*}|\mu_{f*},\sigma_{f*}^{2})$ with


$\displaystyle\mu_{f*}=$	$\displaystyle\bm{k}_{*m}^{f}\bm{K}_{R}^{-1}\bm{K}_{mn}^{f}\bm{R}_{g}^{-1}\bm{y},$	(14a)
$\displaystyle\sigma_{f*}^{2}=$	$\displaystyle k_{*}^{f}-\bm{k}_{m}^{f}(\bm{K}_{mm}^{f})^{-1}\bm{k}_{m}^{f}+\bm{k}_{m}^{f}\bm{K}_{R}^{-1}\bm{k}_{m*}^{f}.$	(14b)

It is interesting to find in (14b) that the correction term $\bm{k}_{*m}^{f}\bm{K}_{R}^{-1}\bm{k}_{m*}^{f}$ contains the heteroscedasticity information from the noise term $\bm{R}_{g}$ . Hence, $q(f_{*})$ produces heteroscedastic variances over the input domain, see an illustration example in Fig. 2(b). The heteroscedastic $\sigma^{2}_{f}(\bm{x}_{*})$ (i) eases the learning of $g$ , and (ii) plays as an auxiliary role, since the heteroscedasticity is mainly explained by $g$ . Also, our VSHGP is believed to produce a better prediction mean $\mu_{f}(\bm{x}_{*})$ through the interaction between $f$ and $g$ in (14a).⁴⁴4This happens when $g$ is learned well, see the numerical experiments below.

Similarly, we have the predictive distribution $q(g_{*})=\int p(g_{*}|\bm{g}_{u})q(\bm{g}_{u})d\bm{g}_{u}=\mathcal{N}(g_{*}|\mu_{g*},\sigma_{g*}^{2})$ where, given $\bm{K}_{\Lambda}=\bm{K}_{un}^{g}\bm{\Lambda}_{nn}^{-1}\bm{K}_{nu}^{g}+\bm{K}_{uu}^{g}$ ,


$\displaystyle\mu_{g*}=$	$\displaystyle\bm{k}_{*u}^{g}(\bm{K}_{uu}^{g})^{-1}(\bm{\mu}_{u}-\mu_{0}\bm{1})+\mu_{0}\bm{1},$	(15a)
$\displaystyle\sigma^{2}_{g*}=$	$\displaystyle k_{*}^{g}-\bm{k}_{u}^{g}(\bm{K}_{uu}^{g})^{-1}\bm{k}_{u}^{g}+\bm{k}_{u}^{g}\bm{K}_{\Lambda}^{-1}\bm{k}_{u*}^{g}.$	(15b)

Finally, using the posteriors $q(f_{*})$ and $q(g_{*})$ , and the likelihood $p(y_{*}|f_{*},g_{*})=\mathcal{N}(y_{*}|f_{*},e^{g_{*}})$ , we have

\displaystyle q(y_{*})=\int\mathcal{N}(y_{*}|\mu_{f*},e^{g_{*}}+\sigma_{f*}^{2})\mathcal{N}(g_{*}|\mu_{g*},\sigma_{g*}^{2})dg_{*},

(16)

which is intractable and non-Gaussian. However, the integral can be approximated up to several digits using the Gauss-Hermite quadrature, resulting in the mean and variance as [24]

\displaystyle\mu_{*}=\mu_{f*},\quad\sigma^{2}_{*}=\sigma_{f*}^{2}+e^{\mu_{g*}+\sigma_{g*}^{2}/2}.

(17)

In the final prediction variance, the variance $\sigma^{2}_{f*}$ represents the uncertainty about $f$ due to data density, and it approaches zero with increasing $n$ ; the exponential term implies the intrinsic heteroscedastic noise uncertainty.

It is notable that the unifying VSHGP includes VHGP [24] and variational sparse GP (VSGP) [31] as special cases: when $\bm{f}_{m}=\bm{f}$ and $\bm{g}_{u}=\bm{g}$ , VSHGP recovers VHGP; when $q(\bm{g}_{u})=p(\bm{g}_{u})$ , i.e., we are now facing a homoscedastic regression task, VSHGP regenerates to VSGP.

Overall, by introducing inducing sets for both $\bm{f}$ and $\bm{g}$ , VSHGP is equipped with the means to handle large-scale heteroscedastic regression. However, (i) the current time complexity $\mathcal{O}(nm^{2}+nu^{2})$ , which is linear with training size, makes VSHGP still unaffordable for, e.g., millions of data points; and (ii) as a global approximation, the capability of VSHGP is limited by the small and global inducing sets.

To this end, we introduce below two strategies to further improve the scalability and capability of VSHGP.

III Stochastic VSHGP

To further improve the scalability of VSHGP, the variational distribution $q(\bm{f}_{m})=\mathcal{N}(\bm{f}_{m}|\bm{\mu}_{m},\bm{\Sigma}_{m})$ is re-introduced to use the original bound $F=\int q(\bm{z})\log\frac{p(\bm{z},\bm{y})}{q(\bm{z})}d\bm{z}$ in (7). Given the factorized likelihood $p(\bm{y}|\bm{f},\bm{g})=\prod_{i=1}^{N}p(y_{i}|f_{i},g_{i})$ , the ELBO $F$ is

	$\displaystyle F=$	$\displaystyle\sum_{i=1}^{n}\mathbb{E}_{q(f_{i})q(g_{i})}[\log p(y_{i}\|f_{i},g_{i})]-\mathrm{KL}[q(\bm{f}_{m})\|\|p(\bm{f}_{m})]$		(18)
		$\displaystyle-\mathrm{KL}[q(\bm{g}_{u})\|\|p(\bm{g}_{u})],$		(18)

where the first expectation term is expressed as

\displaystyle=:

\displaystyle\sum_{i=1}^{n}\left[\log\mathcal{N}(y_{i}|[\bm{\mu}_{f}]_{i},[\bm{R}_{g}]_{ii})-\frac{1}{4}[\bm{\Sigma}_{g}]_{ii}-\frac{1}{2}[\bm{\Sigma}_{f}\bm{R}_{g}^{-1}]_{ii}\right],

with $\bm{\mu}_{f}=\bm{\Omega}_{nm}^{f}\bm{\mu}_{m}$ and $\bm{\Sigma}_{f}=\bm{K}_{nn}^{f}-\bm{Q}_{nn}^{f}+\bm{\Omega}_{nm}^{f}\bm{\Sigma}_{m}(\bm{\Omega}_{nm}^{f})^{\mathsf{T}}$ .

The new $F$ is a relaxed version of $F_{V}$ in (9). It is found that the derivatives satisfy

	$\displaystyle\frac{\partial F}{\partial\bm{\mu}_{m}}=$	$\displaystyle(\bm{\Omega}_{nm}^{f})^{\mathsf{T}}\bm{R}_{g}^{-1}(\bm{y}-\bm{\Omega}_{nm}^{f}\bm{\mu}_{m})-(\bm{K}_{mm}^{f})^{-1}\bm{\mu}_{m},$		(19)
	$\displaystyle\frac{\partial F}{\partial\bm{\Sigma}_{m}}=$	$\displaystyle-\frac{1}{2}(\bm{\Omega}_{nm}^{f})^{\mathsf{T}}\bm{R}_{g}^{-1}\bm{\Omega}_{nm}^{f}+\frac{1}{2}(\bm{\Sigma}_{m}^{-1}-(\bm{K}_{mm}^{f})^{-1}).$		(19)

Let the gradients be zeros, we recover the optimal solution $q^{\star}(\bm{f}_{m})$ in (13), indicating that $F_{V}\geq F$ with the equality at $q(\bm{f}_{m})=q^{\star}(\bm{f}_{m})$ .

The scalability is improved by $F$ through the first term in the right-hand side of (18), which factorizes over data points. The sum form allows using efficient stochastic gradient descent (SGD), e.g., Adam [57], with mini-batch mode for big data. Specifically, we choose a random subset $\mathcal{B}\subseteq\{1,\cdots,n\}$ to have an unbiased estimation of $F$ as

	$\displaystyle\widetilde{F}=$	$\displaystyle\frac{n}{\|\mathcal{B}\|}\sum_{i\in\mathcal{B}}\mathbb{E}_{q(f_{i})q(g_{i})}[\log p(y_{i}\|f_{i},g_{i})]$		(20)
		$\displaystyle-\mathrm{KL}[q(\bm{f}_{m})\|\|p(\bm{f}_{m})]-\mathrm{KL}[q(\bm{g}_{u})\|\|p(\bm{g}_{u})],$		(20)

where $|\mathcal{B}|\ll n$ is the mini-batch size. More efficiently, since the two variational distributions are defined in terms of KL divergence, we could optimize them along the natural gradients instead of the Euclidean gradients, see Appendix C.

Finally, the predictions of the stochastic VSHGP (SVSHGP) follow (17), with the predictions replaced as

	$\displaystyle\mu_{f*}=$	$\displaystyle\bm{k}_{*m}^{f}(\bm{K}_{mm}^{f})^{-1}\bm{\mu}_{m},$
	$\displaystyle\sigma^{2}_{f*}=$	$\displaystyle k_{*}^{f}-\bm{k}_{m}^{f}(\bm{K}_{mm}^{f})^{-1}(\bm{K}_{mm}^{f}-\bm{\Sigma}_{m})(\bm{K}_{mm}^{f})^{-1}\bm{k}_{m*}^{f},$

and

	$\displaystyle\mu_{g*}=$	$\displaystyle\bm{k}_{*u}^{g}(\bm{K}_{uu}^{g})^{-1}(\bm{\mu}_{u}-\mu_{0}\bm{1})+\mu_{0}\bm{1},$
	$\displaystyle\sigma^{2}_{g*}=$	$\displaystyle k_{*}^{g}-\bm{k}_{u}^{g}(\bm{K}_{uu}^{g})^{-1}(\bm{K}_{uu}^{g}-\bm{\Sigma}_{u})(\bm{K}_{uu}^{g})^{-1}\bm{k}_{u*}^{g}.$

Compared to the deterministic VSHGP, the stochastic variant greatly reduces the time complexity from $\mathcal{O}(nm^{2}+nu^{2})$ to $\mathcal{O}(|\mathcal{B}|m^{2}+|\mathcal{B}|u^{2}+m^{3}+u^{3})$ , at the cost of requiring many more optimization efforts in the enlarged probabilistic space.⁵⁵5Compared to VSHGP, the SVSHGP cannot re-parameterize $\bm{\mu}_{u}$ and $\bm{\Sigma}_{u}$ , and has to infer $m+m(m+1)$ more variational parameters in $\bm{\mu}_{m}$ and $\bm{\Sigma}_{m}$ . Besides, the capability of SVSHGP akin to VSHGP is still limited to the finite number of global inducing points.

IV Distributed VSHGP

To further improve the scalability and capability of VSHGP via many inducing points, the distributed variant named DVSHGP proposes to combine VSHGP with local approximations, e.g., the Bayesian committee machine (BCM) [39, 40], to enable distributed learning and capture local variety.

IV-A Training experts with hybrid parameters

We first partition the training data $\mathcal{D}$ into $M$ subsets $\mathcal{D}_{i}=\{\bm{X}_{i},\bm{y}_{i}\}$ , $1\leq i\leq M$ . Then, we train a VSHGP expert $\mathcal{M}_{i}$ on $\mathcal{D}_{i}$ by using the relevant inducing sets $\bm{X}_{m_{i}}$ and $\bm{X}_{u_{i}}$ . Particularly, to obtain computational gains, an independence assumption is posed for all the experts $\{\mathcal{M}_{i}\}_{i=1}^{M}$ such that $\log p(\bm{y};\bm{X},\bm{\zeta})$ is decomposed into the sum of $M$ individuals

\log p(\bm{y};\bm{X},\bm{\zeta})\approx\sum_{i=1}^{M}\log p(\bm{y}_{i};\bm{X}_{i},\bm{\zeta}_{i})\geq\sum_{i=1}^{M}F_{V_{i}},

(21)

where $\bm{\zeta}_{i}$ is the hyperparameters in $\mathcal{M}_{i}$ , and $F_{V_{i}}=F(q(\bm{g}_{u_{i}}))$ is the ELBO of $\mathcal{M}_{i}$ . The factorization (21) calculates the inversions efficiently as $(\bm{K}_{mm}^{f})^{-1}\approx\mathrm{diag}[\{(\bm{K}^{f}_{m_{i}m_{i}})^{-1}\}_{i=1}^{M}]$ and $(\bm{K}_{mm}^{g})^{-1}\approx\mathrm{diag}[\{(\bm{K}^{g}_{m_{i}m_{i}})^{-1}\}_{i=1}^{M}]$ .

We train these VSHGP experts with hybrid parameters. Specifically, the BCM-type aggregation requires sharing the priors $p(f_{*})$ and $p(g_{*})$ over experts. That means, we should share the hyperparameters including $\bm{\theta}^{f}$ , $\bm{\theta}^{g}$ and $\mu_{0}$ across experts. These global parameters are beneficial for guarding against over-fitting [40], at the cost of however degrading the capability. Hence, we leave the variational parameters $\bm{\Lambda}_{n_{i}n_{i}}$ and the inducing points $\bm{X}_{m_{i}}$ and $\bm{X}_{u_{i}}$ for each expert to infer them individually. These local parameters improve capturing local variety by (i) pushing $q(\bm{g}_{u_{i}})$ towards the posterior $p(\bm{g}_{u_{i}}|\bm{y}_{i})$ of $\mathcal{M}_{i}$ , and (ii) using many inducing points.

Besides, because of the local parameters, we prefer partitioning the data into disjoint experts rather than random experts like [40]. The disjoint partition using clustering techniques produces local and separate experts which are desirable for learning the relevant local parameters. In contrast, the random partition, which assigns points randomly to the subsets, provides global and overlapped experts which are difficult to well estimate the local parameters. For instance, when DVSHGP uses random experts on the toy example below, it fails to capture the heteroscedastic noise.

Finally, suppose that each expert has the same training size $n_{0}=n/M$ , the training complexity for an expert is $\mathcal{O}(n_{0}m_{0}^{2}+n_{0}u_{0}^{2})$ , where $m_{0}$ is the inducing size for $\bm{f}_{i}$ and $u_{0}$ the inducing size for $\bm{g}_{i}$ . Due to the $M$ local experts, DVSHGP naturally offers parallel/distributed training, hence reducing the time complexity of VSHGP with a factor ideally close to the number of machines when $m_{0}=m$ and $u_{0}=u$ .

IV-B Aggregating experts’ predictions

For each VSHGP expert $\mathcal{M}_{i}$ , we obtain the predictive distribution $q_{i}(y_{*})$ with the means $\{\mu_{f_{i}*},\mu_{g_{i}*},\mu_{i*}\}$ and variances $\{\sigma^{2}_{f_{i}*},\sigma^{2}_{g_{i}*},\sigma^{2}_{i*}\}$ . Thereafter, we combine the experts’ predictions together to perform the final prediction by, for example the robust BCM (RBCM) aggregation, which naturally supports distributed/parallel computing [40, 58].

The key to the success of aggregation is that we do not directly combine the experts’ predictions $\{\mu_{i*},\sigma^{2}_{i*}\}_{i=1}^{M}$ . This is because (i) the RBCM aggregation of $\{q_{i}(y_{*})\}_{i=1}^{M}$ produces an inconsistent prediction variance with increasing $n$ and $M$ [41]; and (ii) the predictive distribution $q_{i}(y_{*})$ in (16) is non-Gaussian.⁶⁶6Note that the generalized RBCM strategy [41], which provides consistent predictions, however is not favored by our DVSHGP with local parameters, since it requires (i) the experts’ predictions to be Gaussian and (ii) an additional global base expert. To have a meaningful prediction variance, which is crucial for heteroscedastic regression, we perform the aggregation for the latent functions $f$ and $g$ , respectively. This is because the prediction variances of $f$ and $g$ approach zeros with increasing $n$ . This agrees with the property of RBCM.

We first have the aggregated predictive distribution for $f_{*}$ , given the prior $p(f_{*})=\mathcal{N}(f_{*}|0,\sigma_{f**}^{2}\triangleq k^{f}_{**})$ , as

p_{\mathcal{A}}(f_{*}|\bm{y},\bm{x}_{*})=\frac{\prod_{i=1}^{M}[q_{i}(f_{*})]^{w^{f}_{i*}}}{[p(f_{*})]^{\sum_{i}w^{f}_{i*}-1}},

(22)

with the mean and variance expressed respectively, as

	$\displaystyle\mu_{f_{\mathcal{A}}*}$	$\displaystyle=\sigma_{f_{\mathcal{A}}}^{2}\sum_{i=1}^{M}w^{f}_{i}\sigma_{f_{i}}^{-2}\mu_{f_{i}},$
	$\displaystyle\sigma_{f_{\mathcal{A}}*}^{-2}$	$\displaystyle=\sum_{i=1}^{M}w^{f}_{i}\sigma_{f_{i}}^{-2}+\left(1-\sum_{i=1}^{M}w^{f}_{i}\right)\sigma_{f*}^{-2},$

where the weight $w^{f}_{i*}\geq 0$ represents the contribution of $\mathcal{M}_{i}$ at $\bm{x}_{*}$ for $f$ , and is defined as the difference in the differential entropy between the prior $p(f_{*})$ and the posterior $q_{i}(f_{*})$ as $w^{f}_{i*}=0.5(\log\sigma_{f**}^{2}-\log\sigma_{f_{i}*}^{2})$ . Similarly, for $g$ which explicitly considers a prior mean $\mu_{0}$ , the aggregated predictive distribution is

p_{\mathcal{A}}(g_{*}|\bm{y},\bm{x}_{*})=\frac{\prod_{i=1}^{M}[q_{i}(g_{*})]^{w^{g}_{i*}}}{[p(g_{*})]^{\sum_{i}w^{g}_{i*}-1}},

(23)

with the mean and variance, expressed respectively as

	$\displaystyle\mu_{g_{\mathcal{A}}*}=$	$\displaystyle\sigma_{g_{\mathcal{A}}}^{2}\left[\sum_{i=1}^{M}w^{g}_{i}\sigma_{g_{i}}^{-2}\mu_{g_{i}}+\left(1-\sum_{i=1}^{M}w^{g}_{i}\right)\sigma_{g*}^{-2}\mu_{0}\right],$
	$\displaystyle\sigma_{g_{\mathcal{A}}*}^{-2}=$	$\displaystyle\sum_{i=1}^{M}w^{g}_{i}\sigma_{g_{i}}^{-2}+\left(1-\sum_{i=1}^{M}w^{g}_{i}\right)\sigma_{g*}^{-2},$

where $\sigma_{g**}^{-2}$ is the prior precision of $g_{*}$ , and $w^{g}_{i*}=0.5(\log\sigma_{g**}^{2}-\log\sigma_{g_{i}*}^{2})$ is the weight of $\mathcal{M}_{i}$ at $\bm{x}_{*}$ for $g$ .

Refer to caption — Figure 1: Hierarchy of the proposed DVSHGP model.

Thereafter, as shown in Fig. 1, the final predictions akin to (17) are combined as

\displaystyle\mu_{\mathcal{A}*}=\mu_{f_{\mathcal{A}}*},\quad\sigma^{2}_{\mathcal{A}*}=\sigma_{f_{\mathcal{A}}*}^{2}+e^{\mu_{g_{\mathcal{A}}*}+\sigma_{g_{\mathcal{A}}*}^{2}/2}.

(24)

The hierarchical and localized computation structure enables (i) large-scale regression via distributed computations, and (ii) flexible approximation of slow-/quick-varying features by local experts and many inducing points (up to the training size $n$ ).

Finally, we illustrate the DVSHGP on a heteroscedastic toy example expressed as

y(x)=\mathrm{sinc}(x)+\epsilon,\quad x\in[-10,10],

(25)

where $\epsilon=\mathcal{N}(0,\sigma_{\epsilon}^{2}(x))$ and $\sigma_{\epsilon}(x)=0.05+0.2(1+\sin(2x))/(1+e^{-0.2x})$ . We draw 500 training points from (25), and use the $k$ -means technique to partition them into five disjoint subsets. We then employ ten inducing points for both $\bm{f}_{i}$ and $\bm{g}_{i}$ of the VSHGP expert $\mathcal{M}_{i}$ , $1\leq i\leq 5$ . Fig. 2 shows that (i) DVSHGP can efficiently employ up to 100 inducing points for modeling through five local experts, and (ii) DVSHGP successfully describes the underlying function $f$ and the heteroscedastic log noise variance $g$ .

V Discussions regarding implementation

V-A Implementation of DVSHGP

As for DVSHGP, we should infer (i) the global parameters including the kernel parameters $\bm{\theta}^{f}$ and $\bm{\theta}^{g}$ , and the mean $\mu_{0}$ ; and (ii) the local parameters including the variational parameters $\bm{\Lambda}_{n_{i}n_{i}}$ and the inducing parameters $\bm{X}_{m_{i}}$ and $\bm{X}_{u_{i}}$ , for local experts $\{\mathcal{M}_{i}\}_{i=1}^{M}$ .

Notably, the variational parameters $\bm{\Lambda}_{n_{i}n_{i}}$ are crucial for the success of DVSHGP, since they represent the heteroscedasticity. To learn the variational parameters well, there are two issues: (i) how to initialize them and (ii) how to optimize them. As for initialization, let us focus on VSHGP, which is the foundation for the experts in DVSHGP. It is observed in (11) that $\bm{\Lambda}_{nn}$ directly determines the initialization of $q(\bm{g}_{u})=\mathcal{N}(\bm{g}_{u}|\bm{\mu}_{u},\bm{\Sigma}_{u})$ . Given the prior $\bm{g}_{u}\sim\mathcal{N}(\bm{g}_{u}|\mu_{0}\bm{1},\bm{K}_{uu}^{g})$ , we intuitively place a prior mean $\mu_{0}\bm{1}$ on $\bm{\mu}_{u}$ , resulting in $\bm{\Lambda}_{nn}=0.5\bm{I}$ . In contrast, if we initialize $[\bm{\Lambda}_{nn}]_{ii}$ with a value larger or smaller than $0.5$ , the cumulative term $\bm{K}_{un}^{g}(\bm{\Lambda}_{nn}-0.5\bm{I})\bm{1}$ in (11a) becomes far away from zero with increasing $n$ , leading to improper prior mean for $\bm{\mu}_{u}$ . As for optimization, compared to standard GP, our DVSHGP needs to additionally infer $n$ variational parameters and $M(m_{0}+u_{0})d$ inducing parameters, which greatly enlarge the parameter space and increase the optimization difficulty. Hence, we use an alternating strategy where we first optimize the variational parameters individually to capture the heteroscedasticity roughly, followed by learning all the hyperparameters jointly.

Fig. 3 depicts the inferred variational parameters varying over training points by DVSHGP and the original VHGP [24], respectively, on the toy problem. It turns out that the variational parameters estimated by DVSHGP (i) generally agree with that of VHGP, and (ii) showcase local characteristics that are beneficial for describing local variety.

V-B Implementation of SVSHGP

As for SVSHGP, to effectively infer the variational parameters in $q(\bm{f}_{m})$ and $q(\bm{g}_{u})$ , we adopt the natural gradient descent (NGD), which however should carefully tune the step parameter $\gamma$ defined in Appendix C. For the Gaussian likelihood, the optimal solution is $\gamma=1.0$ , since taking an unit step is equivalent to performing a VB update [34]. But for the stochastic case, empirical results suggest that $\gamma$ should be gradually increased to some fixed value. Hence, we follow the schedule in [59]: take $\gamma_{\mathrm{initial}}=0.0001$ and log-linearly increase $\gamma$ to $\gamma_{\mathrm{final}}=0.1$ over five iterations, and then keep $\gamma_{\mathrm{final}}$ for the remaining iterations.

Thereafter, we employ a hybrid strategy, called NGD+Adam, for optimization. Specifically, we perform a step of NGD on the variational parameters with the aforementioned $\gamma$ schedule, followed by a step of Adam on the remaining hyperparameters with a fixed step $\gamma=0.01$ .

Fig. 4 depicts the convergence histories of SVSHGP using Adam and NGD+Adam respectively on the toy example. We use $m=u=20$ inducing points and a mini-batch size of $|\mathcal{B}|=50$ . As the ground truth, the final ELBO obtained by VSHGP is provided. It is observed that (i) the NGD+Adam converges faster than the pure Adam, and (ii) the stochastic optimizers finally approach the solution of CGD used in VSHGP.

V-C DVSHGP vs. (S)VSHGP

Compared to the global (S)VSHGP, the performance of DVSHGP is enhanced by many inducing points and the localized experts with individual variational and inducing parameters, resulting in the capability of capturing quick-varying features. To verify this, we apply DVSHGP and (S)VSHGP to the time-series solar irradiance dataset [60] which contains quick-varying and heteroscedastic features.

In the comparison, DVSHGP employs the $k$ -means technique to partition the 391 training points into $M=10$ subsets, and uses $m_{0}=u_{0}=20$ inducing points for each expert; (S)VSHGP employs $m=u=20$ inducing points. Particularly, we initialize the length-scales in the SE kernel (3) with a pretty small value of 5.0 for $k^{f}$ and $k^{g}$ for (S)VSHGP on this quick-varying dataset. Fig. 5 shows that (i) DVSHGP captures the quick-varying and heteroscedastic features successfully via local experts and many inducing points; (ii) (S)VSHGP however fails due to the small set of global inducing points.⁷⁷7Since VSHGP and SVSHGP show similar predictions on this dataset, we only illustrate the VSHGP predictions here.

VI Numerical experiments

This section verifies the proposed DVSHGP and SVSHGP against existing scalable HGPs on a synthetic dataset and four real-world datasets. The comparison includes (i) GPVC [8], (ii) the distributed variant of PIC (dPIC) [33], (iii) FITC [49], and (iv) the SoD based empirical HGP (EHSoD) [13]. Besides, the comparison also employs the homoscedastic VSGP [31] and RBCM [40] to showcase the benefits brought by the consideration of heteroscedasticity.

We implement DVSHGP, FITC, EHSoD, VSGP and RBCM by the GPML toolbox [54], and implement SVSHGP by the GPflow toolbox [52]; we use the published GPVC codes available at https://github.com/OxfordML/GPz and the dPIC codes available at https://github.com/qminh93/dSGP_ICML16. They are executed on a personal computer with four 3.70 GHz cores and 16 GB RAM for the synthetic and three medium-sized datasets, and on a Linux workstation with eight 3.20 GHz cores and 32GB memory for the large-scale dataset.

All the GPs employ the SE kernel in (3). Normalization is performed for both $\bm{X}$ and $\bm{y}$ to have zero mean and unit variance before training. Finally, we use $n_{*}$ test points $\{\bm{X}_{*},\bm{y}_{*}\}$ to assess the model accuracy by the standardized mean square error (SMSE) and the mean standardized log loss (MSLL) [1]. The SMSE quantifies the discrepancy between the predictions and the exact observations. Particularly, it equals to one when the model always predicts the mean of $\bm{y}$ . Moreover, the MSLL quantifies the predictive distribution, and is negative for better models. Particularly, it equals to zero when the model always predicts the mean and variance of $\bm{y}$ .

VI-A Synthetic dataset

We employ a 2D version of the toy example (25) as

y(\bm{x})=f(\bm{x})+\epsilon(\bm{x}),\quad\bm{x}\in[-10,10]^{2},

with highly nonlinear latent function $f(\bm{x})=\mathrm{sinc}(0.1x_{1}x_{2})$ and noise $\epsilon(\bm{x})=\mathcal{N}(0,\sigma_{\epsilon}^{2}(0.1x_{1}x_{2}))$ . We randomly generate 10,000 training points and evaluate the model accuracy on 4,900 grid test points. We generate ten instances of the training data such that each model is repeated ten times.

We have $M=50$ and $m_{0}=u_{0}=100$ for DVSHGP, resulting in $n_{0}=200$ data points assigned to each expert; we have $m_{\mathrm{b}}=300$ basis functions for GPVC; we have $m=300$ for SVSHGP, FITC and VSGP; we have $m=300$ and $M=50$ for dPIC; we have $M=50$ for RBCM; finally, we train two separate GPs on a subset of size $m_{\mathrm{sod}}=2,000$ for EHSoD. As for optimization, DVSHGP adopts a two-stage process: it first only optimizes the variational parameters using CGD with up to 30 line searches, and then learns all the parameters jointly using up to 70 line searches; SVSHGP trains with NGD+Adam using $|\mathcal{B}|=1,000$ over 1,000 iterations; VSGP, FITC, GPVC and RBCM use up to 100 line searches to learn the parameters; dPIC employs the default optimization settings in the published codes; and finally EHSoD uses up to 50 line searches to train the two standard GPs, respectively.

Fig. 6 depicts the modeling results of different GPs over ten runs on the synthetic sinc2D dataset.⁸⁸8Since the dPIC codes only provide the prediction mean, we did not offer its MSLL value as well as the estimated noise variance in the following plots. The horizontal axis represents the sum of training and predicting time for a model. It turns out that DVSHGP, SVSHGP, dPIC, VSGP and RBCM are competitive in terms of SMSE; but DVSHGP and SVSHGP perform better in terms of MSLL due to the well estimated heteroscedastic noise. Compared to the homoscedastic VSGP and RBCM, the FITC has heteroscedastic variances, which is indicated by the lower MSLL, at the cost of however (i) sacrificing the accuracy of prediction mean, and (ii) suffering from invalid noise variance $\sigma^{2}_{\epsilon}$ .⁹⁹9FITC estimates $\sigma^{2}_{\epsilon}$ as 0.0030, while VSGP estimates it as 0.0309. As a principled HGP, GPVC performs slightly better than FITC in terms of MSLL. Finally, the simple EHSoD has the worst SMSE; but it outperforms the homoscedastic VSGP and RBCM in terms of MSLL.

In terms of efficiency, the RBCM requires less computing time since it contains no variational/inducing parameters, thus resulting in (i) lower complexity, and (ii) early stop for optimization. This also happens for the three datasets below.

Finally, Fig. 7 depicts the prediction variances of all the GPs except dPIC in comparison to the exact $\sigma^{2}$ on the synthetic dataset. It is first observed that the homoscedastic VSGP and RBCM are unable to describe the complex noise variance: they yield a nearly constant variance over the input domain. In contrast, DVSHGP, SVSHGP and GPVC capture the varying noise variance accurately by using an additional noise process $g$ ; FITC also captures the profile of the exact $\sigma^{2}$ with however unstable peaks and valleys; EHSoD is found to capture a rough expression of the exact $\sigma^{2}$ .

VI-B Medium real-world datasets

This section conducts comparison on three real-world datasets. The first is the 9D protein dataset [61] with 45,730 data points. This dataset, taken from CASP 5-9, describes the physicochemical properties of protein tertiary structure. The second is the 21D sarcos dataset [1], which relates to the inverse kinematics of a robot arm, has 48,933 data points. The third is the 3D 3droad dataset which comprises 434,874 data points [62] extracted from a 2D road network in North Jutland, Denmark, plussing elevation information.

VI-B1 The protein dataset

For the protein dataset, we randomly choose 35,000 training points and 10,730 test points. In the comparison, we have $M=100$ (i.e., $n_{0}=350$ ) and $m_{0}=u_{0}=175$ for DVSHGP; we have $m=400$ for SVSHGP, VSGP and FITC; we have $m=400$ and $M=100$ for dPIC; we have $m_{\mathrm{b}}=400$ for GPVC; we have $M=100$ for RBCM; and finally we have $m_{\mathrm{sod}}=4,000$ for EHSoD. As for optimization, SVSHGP trains with NGD+Adam using $|\mathcal{B}|=2,000$ over 2,000 iterations. The optimization settings of other GPs keep consistent to that for the synthetic dataset.

The results of different GPs over ten runs are summarized in Fig. 8. Among the HGPs, it is observed that dPIC outperforms the others in terms of SMSE, followed by DVSHGP. On the other hand, DVSHGP performs the best in terms of MSLL, followed by FITC and SVSHGP. The simple EHSoD is found to produce unstable MSLL results because of the small subset. Finally, the homoscedastic VSGP and RBCM provide mediocre SMSE and MSLL results.

Next, Fig. 9 offers insights into the distributions of log noise variances of all the GPs except dPIC on the protein dataset for a single run. Note that (i) as homoscedastic GPs, the log noise variances of VSGP and RBCM are marked as dash and dot lines, respectively; and (ii) we plot the variance of $p(f_{*}|\mathcal{D},\bm{x}_{*})$ for FITC since (a) it accounts for the heteroscedasticity and (b) the scalar noise variance $\sigma^{2}_{\epsilon}$ is severely underestimated. The results in Fig. 9 indicate that the protein dataset may contain heteroscedastic noise. Besides, compared to the VSGP which uses a global inducing set, the localized RBCM provides a more compact estimation of $\sigma^{2}_{\epsilon}$ . This compact noise variance, which has also been observed on the two datasets below, brings lower MSLL for RBCM.

Furthermore, we clearly observe the interaction between $f$ and $g$ for DVSHGP, SVSHGP and GPVC. The small MSLL of RBCM suggests that the protein dataset may own small noise varaicnes at some test points. Hence, the localized DVSHGP, which is enabled to capture the local variety through individual variational and inducing parameters for each expert, produces a longer tail in Fig. 9. The well estimated heteroscedastic noise in turn improves the prediction mean of DVSHGP through the interaction between $f$ and $g$ . In contrast, due to the limited global inducing set, the prediction mean of SVSHGP and GPVC is traded for capturing heteroscedastic noise.

Notably, the performance of sparse GPs is affected by their modeling parameters, e.g., the inducing sizes $m_{0}$ , $m$ , $u_{0}$ and $u$ , the number of basis functions $m_{\mathrm{b}}$ , and the subset size $m_{\mathrm{sod}}$ . Fig. 10(a) and (b) depict the average results of sparse GPs over ten runs using different parameters. Particularly, we investigate the impact of subset size $n_{0}$ on DVSHGP in Fig. 10(c) using $m_{0}=u_{0}=0.5n_{0}$ . It is found that DVSHGP favours large $n_{0}$ (small $M$ ) and large $m_{0}$ and $u_{0}$ . Similarly, VSGP and FITC favour more inducing points. However, dPIC offers an unstable SMSE performance with increasing $m$ ; GPVC performs slightly worse with increasing $m_{\mathrm{b}}$ in terms of both SMSE and MSLL, which has also been observed in the original paper [8], and may be caused by the sharing of basis functions for $f$ and $g$ . Finally, EHSoD showcases poorer MSLL values when $m_{\mathrm{sod}}\geq 3000$ , because of the difficulty of approximating the empirical variances.

VI-B2 The sarcos dataset

For the sarcos dataset, we randomly choose 40,000 training points and 8,933 test points. In the comparison, we have $M=120$ (i.e., $n_{0}\approx 333$ ) and $m_{0}=u_{0}=175$ for DVSHGP; we have $m=600$ for SVSHGP, VSGP and FITC; we have $m=600$ and $M=120$ for dPIC; we have $m_{\mathrm{b}}=600$ for GPVC; we have $M=120$ for RBCM; and finally we have $m_{\mathrm{sod}}=4,000$ for EHSoD. The optimization settings are the same as before.

The results of different GPs over ten runs on the sarcos dataset are depicted in Fig. 11. Besides, Fig. 12 depicts the log noise variances of the GPs on this dataset. Different from the protein dataset, the sarcos dataset seems to have weak heteroscedastic noises across the input domain, which is verified by the facts that (i) the noise variance of DVSHGP is a constant, and (ii) DVSHGP agrees with RBCM in terms of both SMSE and MSLL. Hence, all the HGPs except EHSoD perform similarly in terms of MSLL.

In addition, the weak heteroscedasticity in the sarcos dataset reveals that we can use only a few inducing points for $\bm{g}$ to speed up the inference. For instance, we retrain DVSHGP using $u_{0}=5$ . This extremely small inducing set for $\bm{g}$ brings (i) much less computing time of about 350 seconds, and (ii) almost the same model accuracy with SMSE = 0.0099 and MSLL = -2.3034.

VI-B3 The 3droad dataset

Finally, for the 3droad dataset, we randomly choose 390,000 training points, and use the remaining 44,874 data points for testing. In the comparison, we have $M=800$ (i.e., $n_{0}\approx 487$ ) and $m_{0}=u_{0}=250$ for DVSHGP; we have $m=500$ for SVSHGP, VSGP and FITC; we have $m=500$ and $M=800$ for dPIC; we have $m_{\mathrm{b}}=500$ for GPVC; we have $M=800$ for RBCM; and finally we have $m_{\mathrm{sod}}=8,000$ for EHSoD. As for optimization, SVSHGP trains with NGD+Adam using $|\mathcal{B}|=4,000$ over 4,000 iterations. The optimization settings of other GPs keep the same as before.

The results of different GPs over ten runs on the 3droad dataset are depicted in Fig. 13. It is observed that DVSHGP outperforms the others in terms of both SMSE and MSLL, followed by RBCM. For other HGPs, especially SVSHGP and GPVC, the relatively poor noise variance (large MSLL) in turn sacrifices the accuracy of prediction mean. Even though, the heteroscedastic noise helps SVSHGP, GPVC and FITC perform similarly to VSGP in terms of MSLL.

In addition, Fig. 14 depicts the log noise variances of these GPs on the 3droad dataset. The highly accurate prediction mean of DVSHGP helps well estimate the heteroscedastic noise. It is observed that (i) the noise variances estimated by DVSHGP are more compact than that of other HGPs; and (ii) the average noise variance agrees with that of RBCM.

Finally, the results from the 3droad dataset together with the other two datasets indicate that:

•

the well estimated noise variance of HGPs in turn improves the prediction mean via the interaction between $f$ and $g$ ; otherwise, it may sacrifice the prediction mean;
•

the heteroscedastic noise usually improves sparse HGPs over the homoscedastic VSGP in terms of MSLL.

VI-C Large real-world dataset

The final section evaluates the performance of different GPs on the 11D electric dataset,¹⁰¹⁰10The dataset is available at https://archive.ics.uci.edu/ml/index.php. which is partitioned into two million training points and 49,280 test points. The HGPs in the comparison include DVSHGP, SVSHGP, dPIC and EHSoD.¹¹¹¹11GPVC and FITC are unaffordable for this massive dataset. Besides, the stochastic variant of FITC [35] is not included, since it does not support end-to-end training. Besides, the RBCM and the stochastic variant of VSGP, named SVGP [34], are employed for this big dataset.

TABLE I: Comparison of GPs on the electric dataset.

	DVSHGP	SVSHGP	dPIC	EHSoD	SVGP	RBCM
SMSE	0.0020	0.0029	0.0042	0.0103	0.0028	0.0023
MSLL	-3.4456	-3.1207	-	-1.9453	-2.8489	-3.0647
$t$ [ $h$ ]	11.05	7.44	47.22	4.72	3.55	3.97

In the comparison, we have $M=2,000$ (i.e., $n_{0}=1,000$ ) and $m_{0}=u_{0}=300$ for DVSHGP; we have $m=2,500$ for SVSHGP and SVGP; we have $m=2,500$ and $M=2,000$ for dPIC; we have $M=2,000$ for RBCM; and finally we have $m_{\mathrm{sod}}=15,000$ for EHSoD. As for optimization, SVSHGP trains with NGD+Adam using $|\mathcal{B}|=5,000$ over 10,000 iterations; The optimization settings of other GPs keep the same as before.

The average results over five runs in Table I indicate that DVSHGP outperforms the others in terms of both SMSE and MSLL, followed by SVSHGP. The simple EHSoD provides the worst performance, and cannot be improved by using larger $m_{\mathrm{sod}}$ due to the memory limit in current infrastructure. Additionally, in terms of efficiency, we find that (i) SVSHGP is better than DVSHGP due to the parallel/GPU acceleration deployed in Tensorflow;¹²¹²12Further GPU speedup could be utilized for DVSHGP in Matlab. (ii) SVGP is better than SVSHGP because of lower complexity; and (iii) the huge computing time of dPIC might be incurred by the unoptimized codes.

Finally, due to the distributed framework, Fig. 15(a) depicts the total computing time of DVSHGP using different numbers of processing cores. It is observed that the DVSHGP using eight cores achieves a speedup around 3.5 in comparison to the centralized counterpart. Fig. 15(b) also exploits the performance of SVSHGP using a varying mini-batch size $|\mathcal{B}|$ . It is observed that (i) a small $|\mathcal{B}|$ significantly speeds up the model training, and (ii) different mini-batch sizes yield similar SMSE and MSLL here, because the model has been optimized over sufficient iterations.

VII Discussions and conclusions

In order to scale up the original HGP [16], we have presented distributed and stochastic variational sparse HGPs. The proposed SVSHGP improves the scalability through stochastic variational inference. The proposed DVSHGP (i) enables large-scale regression via distributed computations, and (ii) achieves high model capability via localized experts and many inducing points. We compared them to existing scalable homoscedastic/heteroscedastic GPs on a synthetic dataset and four real-world datasets. The comparative results obtained indicated that DVSHGP exhibits superior performance in terms of both SMSE and MSLL; while due to the limited global inducing set, SVSHGP may sacrifice the prediction mean for capturing heteroscedastic noise.

Apart from our scalable HGP framework, there are some new GP paradigms developed recently from different perspectives for improving the predictive distribution. For instance, instead of directly modeling the heteroscedastic noise, we could introduce additional latent inputs to modulate the kernel [63, 64]; or we directly target the interested posterior distribution to enhance the prediction of $f$ [65]; or we adopt highly flexible priors, e.g., implicit processes, over functions [66]; or we mix various GPs during the inference [67, 68]; or we develop the specific non-stationary kernel [69]. They bring new interpretations at the cost of however losing the direct description of heteroscedasticity or raising complicated inference with high complexity. But all these paradigms together with our scalable HGPs greatly enable future exploitation of fitting the data distribution with high quality and efficiency.

Finally, our future work will consider the heteroscedasticity in the underlying function $f$ , i.e., the non-stationary, such as [22, 18, 67]. The integration of various kinds of heteroscedasticity is believed to improve predictions.

Appendix A Non-negativity of $\Lambda_{nn}$

We know that the variational diagonal matrix $\bm{\Lambda}_{nn}$ expresses

\bm{\Lambda}_{nn}=0.5(\bm{\Lambda}_{nn}^{a}+\bm{\Lambda}_{nn}^{b}+\bm{I}).

In order to prove the non-negativity of $\bm{\Lambda}_{nn}$ , we should figure out the non-negativity of the diagonal elements of $\bm{\Lambda}^{a}_{nn}+\bm{I}$ and $\bm{\Lambda}^{b}_{nn}$ , respectively.

Firstly, the diagonal elements of $\bm{\Lambda}^{b}_{nn}$ write

[\bm{\Lambda}_{nn}^{b}]_{ii}=[(\bm{K}_{nn}^{f}-\bm{Q}_{nn}^{f})\bm{R}_{g}^{-1}]_{ii},\quad 1\leq i\leq n,

where the diagonal elements of $\bm{R}_{g}^{-1}$ satisfy $[\bm{R}_{g}^{-1}]_{ii}=e^{[\bm{\Sigma}_{g}]_{ii}/2-[\bm{\mu}_{g}]_{i}}>0$ , $1\leq i\leq n$ ; and the diagonal elements of $\bm{K}_{nn}^{f}-\bm{Q}_{nn}^{f}$ are the variances of training conditional $p(\bm{f}|\bm{f}_{m})$ . Therefore, $\bm{\Lambda}_{nn}^{b}$ has non-negative diagonal elements.

Secondly, the diagonal elements of $\bm{\Lambda}^{a}_{nn}+\bm{I}$ write, given $\bm{\beta}_{n}=\bm{\Sigma}_{y}^{-1}\bm{y}$ ,

[\bm{\Lambda}_{nn}^{a}+\bm{I}]_{ii}=[\bm{\beta}_{n}\bm{\beta}_{n}^{\mathsf{T}}\bm{R}_{g}-\bm{\Sigma}_{y}^{-1}\bm{R}_{g}+\bm{I}]_{ii},\quad 1\leq i\leq n.

For $\bm{\beta}_{n}\bm{\beta}_{n}^{\mathsf{T}}\bm{R}_{g}$ , the diagonal elements are non-negative. For $\bm{I}-\bm{\Sigma}_{y}^{-1}\bm{R}_{g}$ , given the Cholesky decomposition $\bm{K}_{\Lambda}^{f}=\bm{L}_{\Lambda}^{f}(\bm{L}_{\Lambda}^{f})^{\mathsf{T}}$ , we have

	$\displaystyle\bm{I}-\bm{\Sigma}_{y}^{-1}\bm{R}_{g}=$	$\displaystyle[\bm{R}_{g}^{-1}\bm{K}_{nm}^{f}(\bm{K}_{\Lambda}^{f})^{-1}\bm{K}_{mn}^{f}\bm{R}_{g}^{-1}]\bm{R}_{g}$
	$\displaystyle=$	$\displaystyle[\bm{R}_{g}^{-1}\bm{K}_{nm}^{f}(\bm{L}_{\Lambda}^{f})^{-\mathsf{T}}(\bm{L}_{\Lambda}^{f})^{-1}\bm{K}_{mn}^{f}\bm{R}_{g}^{-1}]\bm{R}_{g},$

indicating that the diagonal elements must be non-negative.

Hence, from the foregoing discussions, we know that $\bm{\Lambda}_{nn}$ is a non-negative diagonal matrix.

Appendix B Derivatives of $F_{V}$ w.r.t. hyperparameters

Let $\bm{\lambda}_{n}=\log(\bm{\Lambda}_{nn}\bm{1})$ collect $n$ variational parameters in the log form for non-negativity, we have the derivatives of $F_{V}$ w.r.t. $\bm{\lambda}_{n}$ as

	$\displaystyle\frac{\partial F_{V}}{\partial\bm{\lambda}_{n}}=$	$\displaystyle\bm{\Lambda}_{nn}\left[\frac{1}{2}(\bm{Q}_{nn}^{g}+\frac{1}{2}\bm{A}_{nn})\bm{\Lambda}_{nn}^{ab}\bm{1}+\frac{1}{4}\bm{A}_{nn}\bm{1}\right.$
		$\displaystyle-\left.\frac{1}{2}\bm{A}_{nn}\bm{\Lambda}_{nn}\bm{1}-\bm{\mu}_{g}+\mu_{0}\bm{1}\right],$

where $\bm{A}_{nn}=(\bm{K}_{nu}^{g}\bm{K}_{\Lambda}^{-1}\bm{K}_{un}^{g})^{\odot 2}$ , and the operator ${\odot 2}$ represents the element-wise power.

The derivatives of $F_{V}$ w.r.t. the kernel parameters $\bm{\theta}^{f}=\{\theta_{i}^{f}\}$ are

\frac{\partial F_{V}}{\partial\theta^{f}_{i}}=\frac{1}{2}\mathrm{Tr}\left[(\bm{\beta}_{n}\bm{\beta}_{n}^{\mathsf{T}}-\bm{\Sigma}_{y}^{-1}+\bm{R}_{g}^{-1})\frac{\partial\bm{Q}_{nn}^{f}}{\partial\theta^{f}_{i}}-\bm{R}_{g}^{-1}\frac{\partial\bm{K}_{nn}^{f}}{\partial\theta^{f}_{i}}\right].

Similarly, the derivatives of $F_{V}$ w.r.t. the kernel parameters $\bm{\theta}^{g}=\{\theta_{i}^{g}\}$ are

	$\displaystyle\frac{\partial F_{V}}{\partial\theta^{g}_{i}}=$	$\displaystyle\frac{1}{2}\mathrm{Tr}\left[\frac{\partial\bm{\mu}_{g}}{\partial\theta^{g}_{i}}(\bm{1}^{\mathsf{T}}\bm{\Lambda}_{nn}^{ab})-\frac{1}{2}(\bm{\Lambda}_{nn}^{ab}+\bm{I})\frac{\partial\bm{\Sigma}_{g}}{\partial\theta^{g}_{i}}\right]$
	$\displaystyle-$	$\displaystyle\frac{1}{2}\mathrm{Tr}\left[\bm{V}_{uu}^{g}\frac{\partial\bm{K}_{uu}^{g}}{\partial\theta^{g}_{i}}-(\bm{\Omega}_{nu}^{g})^{\mathsf{T}}\bm{\Lambda}_{nn}\bm{\Omega}_{nu}^{g}\frac{\partial\bm{\Sigma}_{u}}{\partial\theta^{g}_{i}}+2\frac{\partial\bm{\mu}_{u}}{\partial\theta^{g}_{i}}\bm{\gamma}_{u}^{\mathsf{T}}\right],$

where $\bm{\gamma}_{u}=(\bm{K}_{uu}^{g})^{-1}(\bm{\mu}_{u}-\mu_{0}\bm{1})$ and $\bm{V}_{uu}^{g}=(\bm{K}_{uu}^{g})^{-1}-\bm{\gamma}_{u}\bm{\gamma}_{u}^{\mathsf{T}}-(\bm{K}_{uu}^{g})^{-1}\bm{\Sigma}_{u}(\bm{K}_{uu}^{g})^{-1}$ .

The derivatives of $F_{V}$ w.r.t. the mean parameter $\mu_{0}$ of $g$ is

\frac{\partial F_{V}}{\partial\mu_{0}}=\frac{1}{2}\mathrm{Tr}(\bm{\Lambda}_{nn}^{ab}).

Finally, we calculate the derivatives of $F_{V}$ w.r.t. the inducing points $\bm{X}_{m}$ and $\bm{X}_{u}$ . Since the inducing points are involved in the kernel matrices, we get the derivatives $\partial\bm{K}^{f}_{nm}/\partial x^{f}_{ij}$ , $\partial\bm{K}^{f}_{mn}/\partial x^{f}_{ij}$ , $\partial\bm{K}^{f}_{mm}/\partial x^{f}_{ij}$ , $\partial\bm{K}^{g}_{nu}/\partial x^{g}_{ij}$ , $\partial\bm{K}^{g}_{un}/\partial x^{g}_{ij}$ , and $\partial\bm{K}^{g}_{uu}/\partial x^{g}_{ij}$ , where $x^{f}_{ij}=[\bm{X}_{m}]_{ij}$ and $x^{g}_{ij}=[\bm{X}_{u}]_{ij}$ . We first obtain the derivatives of $F_{V}$ w.r.t. $\bm{X}_{m}$ as

\frac{\partial F_{V}}{\partial x^{f}_{ij}}=2\mathrm{Tr}\left[\frac{\partial\bm{K}^{f}_{nm}}{\partial x^{f}_{ij}}\bm{A}^{f}_{mn}\right]+\mathrm{Tr}\left[\frac{\partial\bm{K}^{f}_{mm}}{\partial x^{f}_{ij}}\bm{A}^{f}_{mm}\right],

where $\bm{A}_{mn}^{f}=0.5(\bm{\Omega}_{nm}^{f})^{\mathsf{T}}(\bm{\beta}_{n}\bm{\beta}_{n}^{\mathsf{T}}-\bm{\Sigma}_{y}^{-1}+\bm{R}_{g}^{-1})$ , and $\bm{A}_{mm}^{f}=-\bm{A}_{mn}^{f}\bm{\Omega}_{nm}^{f}$ . Similarly, the derivatives of $F_{V}$ w.r.t. $\bm{X}_{u}$ write

\frac{\partial F_{V}}{\partial x^{g}_{ij}}=\mathrm{Tr}\left[\frac{\partial\bm{K}^{g}_{nu}}{\partial x^{g}_{ij}}\bm{A}^{g}_{un}\right]+\mathrm{Tr}\left[\bm{A}^{g}_{nu}\frac{\partial\bm{K}^{g}_{un}}{\partial x^{g}_{ij}}\right]+\mathrm{Tr}\left[\frac{\partial\bm{K}^{g}_{uu}}{\partial x^{g}_{ij}}\bm{A}^{g}_{uu}\right].

For $\bm{A}_{un}^{g}$ in $\partial F_{V}/\partial x^{g}_{ij}$ , we have

	$\displaystyle\bm{A}^{g}_{un}=$	$\displaystyle\underbrace{0.5(\bm{\Omega}_{nu}^{g})^{\mathsf{T}}(\bm{\Lambda}_{nn}-0.5\bm{I})\bm{1}(\bm{1}^{\mathsf{T}}\bm{\Lambda}_{nn}^{ab})}_{\bm{T}_{1}}$
		$\displaystyle+\underbrace{0.25\left(\bm{H}_{uu}\bm{K}_{un}^{g}\bm{\Lambda}_{nn}-[\bm{\Omega}_{nu}^{\Lambda}+\bm{\Omega}_{nu}^{g}]^{\mathsf{T}}(\bm{\Lambda}_{nn}^{ab}+\bm{I})\right)}_{\bm{T}_{2}}$
		$\displaystyle+\underbrace{0.5\left(\bm{J}_{uu}\bm{K}_{un}^{g}\bm{\Lambda}_{nn}-\bm{\gamma}_{u}\bm{1}^{\mathsf{T}}(\bm{\Lambda}_{nn}-0.5\bm{I})\right)}_{\bm{T}_{3}}.$

where $\bm{H}_{uu}=(\bm{\Omega}_{nu}^{\Lambda})^{\mathsf{T}}(\bm{\Lambda}_{nn}^{ab}+\bm{I})\bm{\Omega}_{nu}^{\Lambda}$ , and $\bm{J}_{uu}=\bm{K}_{\Lambda}^{-1}\bm{K}_{uu}^{g}((\bm{K}_{uu}^{g})^{-1}-\bm{\Sigma}_{u}^{-1})\bm{K}_{uu}^{g}\bm{K}_{\Lambda}^{-1}$ . For $\bm{A}_{nu}^{g}$ , we have

\bm{A}^{g}_{nu}=0.5(\bm{\Lambda}_{nn}-0.5\bm{I})\bm{1}(\bm{1}^{\mathsf{T}}\bm{\Lambda}_{nn}^{ab})\bm{\Omega}_{nu}^{g}+\bm{T}_{2}^{\mathsf{T}}+\bm{T}_{3}^{\mathsf{T}}.

For $\bm{A}_{uu}^{g}$ , we have

	$\displaystyle\bm{A}_{uu}^{g}=$	$\displaystyle-\bm{T}_{1}\bm{\Omega}_{nu}^{g}-25\left((\bm{\Omega}_{nu}^{g})^{\mathsf{T}}(\bm{\Lambda}_{nn}^{ab}+\bm{I})\bm{\Omega}_{nu}^{g}-\bm{H}_{uu}\right)$
		$\displaystyle+5\left(\bm{P}_{uu}+\bm{P}_{uu}^{\mathsf{T}}+\bm{V}_{uu}^{g}-\bm{J}_{uu}\right),$

where $\bm{P}_{uu}=\bm{K}_{\Lambda}^{-1}\bm{K}_{uu}^{g}((\bm{K}_{uu}^{g})^{-1}-\bm{\Sigma}_{u}^{-1})$ .

The calculation of $\partial F_{V}/\partial x^{f}_{ij}$ and $\partial F_{V}/\partial x^{g}_{ij}$ requires a loop over $m\times d$ and $u\times d$ parameters of the inducing points, which is quite slow for even moderate $m$ , $u$ and $d$ . Fortunately, we know that the derivative $\partial\bm{K}/\partial x_{ij}^{f(g)}$ only has $n$ or $m$ ( $u$ ) non-zero elements. Due to the sparsity, $\partial\bm{K}/\partial x_{ij}^{f(g)}$ can be performed in vectorized operations such that the derivatives w.r.t. all the inducing points can be calculated along a specific dimension.

Appendix C Natural gradients of $q(\bm{f}_{m})$ and $q(\bm{g}_{u})$

For exponential family distributions¹³¹³13The probability density function (PDF) of exponential family is $p(\bm{x})=h(\bm{x})e^{\bm{\theta}^{\mathsf{T}}\bm{t}(\bm{x})-A(\bm{\theta})}$ , where $\bm{\theta}$ is natural parameters, $h(\bm{x})$ is underlying measure, $\bm{t}(\bm{x})$ is sufficient statistic, and $A(\bm{\theta})$ is log normalizer. Besides, the expectation parameters are defined as $\bm{\psi}=\mathbb{E}_{p(\bm{x})}[\bm{t}(\bm{x})]$ . parameterized by natural parameters $\bm{\theta}$ , we update the parameters using natural gradients as

\bm{\theta}_{(t+1)}=\bm{\theta}_{(t)}-\gamma_{(t)}\bm{G}_{\bm{\theta}_{(t)}}^{-1}\frac{\partial F}{\partial\bm{\theta}_{(t)}}=\bm{\theta}_{(t)}-\gamma_{(t)}\frac{\partial F}{\partial\bm{\psi}_{(t)}},

where $F$ is the objective function, and $\bm{G}_{\bm{\theta}}=\partial\bm{\psi}_{(t)}/\partial\bm{\theta}_{(t)}$ is the fisher information matrix with $\bm{\psi}$ being the expectation parameters of exponential distributions.

For $q(\bm{g}_{u})=\mathcal{N}(\bm{g}_{u}|\bm{\mu}_{u},\bm{\Sigma}_{u})$ , its natural parameters are $\bm{\theta}$ which are partitioned into two components

\bm{\theta}_{1}=\bm{\Sigma}_{u}^{-1}\bm{\mu}_{u},\quad\bm{\Theta}_{2}=-\frac{1}{2}\bm{\Sigma}_{u}^{-1},

where $\bm{\theta}_{1}$ comprises the first $m$ elements of $\bm{\theta}$ , and $\mathbf{\Theta}_{2}$ the remaining elements reshaped to a square matrix. Accordingly, the expectation parameters $\bm{\psi}$ are divided as

\bm{\psi}_{1}=\bm{\mu}_{u},\quad\bm{\Psi}_{2}=\bm{\mu}_{u}\bm{\mu}_{u}^{\mathsf{T}}+\bm{\Sigma}_{u}.

Thereafter, we update the natural parameters with step $\gamma_{(t)}$ as

	$\displaystyle\bm{\theta}_{1_{(t+1)}}$	$\displaystyle=\bm{\Sigma}_{u_{(t)}}^{-1}\bm{\mu}_{u_{(t)}}-\gamma_{(t)}\frac{\partial F}{\partial\bm{\psi}_{1_{(t)}}},$
	$\displaystyle\bm{\Theta}_{2_{(t+1)}}$	$\displaystyle=-\frac{1}{2}\bm{\Sigma}_{u_{(t)}}^{-1}-\gamma_{(t)}\frac{\partial F}{\partial\bm{\Psi}_{2_{(t)}}},$

where $\partial F/\partial\bm{\psi}_{1_{(t)}}=\partial F/\partial\bm{\mu}_{u_{(t)}}$ and $\partial F/\partial\bm{\Psi}_{2_{(t)}}=\partial F/\partial\bm{\Sigma}_{u_{(t)}}$ . The derivatives $\partial F/\partial\bm{\mu}_{u}$ and $\partial F/\partial\bm{\Sigma}_{u}$ are respectively expressed as

	$\displaystyle\frac{\partial F}{\partial\bm{\mu}_{u}}=$	$\displaystyle\frac{1}{2}(\bm{\Omega}_{nu}^{g})^{\mathsf{T}}\bm{\Lambda}_{nn}^{a^{\prime}b}\bm{1}-(\bm{K}_{uu}^{g})^{-1}(\bm{\mu}_{u}-\mu_{0}\bm{1}),$
	$\displaystyle\frac{\partial F}{\partial\bm{\Sigma}_{u}}=$	$\displaystyle-\frac{1}{4}(\bm{\Omega}_{nu}^{g})^{\mathsf{T}}(\bm{\Lambda}_{nn}^{a^{\prime}b}+\bm{I})\bm{\Omega}_{nu}^{g}+\frac{1}{2}[\bm{\Sigma}_{u}^{-1}-(\bm{K}_{uu}^{g})^{-1}],$

where $\bm{\Lambda}_{nn}^{a^{\prime}b}=\bm{\Lambda}_{nn}^{a^{\prime}}+\bm{\Lambda}_{nn}^{b}$ , and $\bm{\Lambda}_{nn}^{a^{\prime}}$ is a diagonal matrix with the diagonal element being

[\bm{\Lambda}_{nn}^{a^{\prime}}]_{ii}=[\bm{R}_{g}^{-1}(\bm{y}-\bm{\Omega}_{nm}^{f}\bm{\mu}_{m})(\bm{y}-\bm{\Omega}_{nm}^{f}\bm{\mu}_{m})^{\mathsf{T}}-\bm{I}]_{ii}.

For $q(\bm{f}_{m})=\mathcal{N}(\bm{f}_{m}|\bm{\mu}_{m},\bm{\Sigma}_{m})$ , the updates of $\bm{\mu}_{m_{(t+1)}}$ and $\bm{\Sigma}_{m_{(t+1)}}$ follow the foregoing steps, with the derivatives $\partial F/\partial\bm{\mu}_{m}$ and $\partial F/\partial\bm{\Sigma}_{m}$ taking (19).

Acknowledgment

This work is funded by the National Research Foundation, Singapore under its AI Singapore programme [Award No.: AISG-RP-2018-004], the Data Science and Artificial Intelligence Research Center (DSAIR) at Nanyang Technological University and supported under the Rolls-Royce@NTU Corporate Lab.

References

[1] C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machine learning. MIT Press, 2006.
[2] N. Lawrence, “Probabilistic non-linear principal component analysis with Gaussian process latent variable models,” Journal of Machine Learning Research, vol. 6, no. Nov, pp. 1783–1816, 2005.
[3] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas, “Taking the human out of the loop: A review of Bayesian optimization,” Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, 2016.
[4] H. Liu, J. Cai, and Y. Ong, “Remarks on multi-output Gaussian process regression,” Knowledge-Based Systems, vol. 144, no. March, pp. 102–121, 2018.
[5] B. Settles, “Active learning,” Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 6, no. 1, pp. 1–114, 2012.
[6] P. Kou, D. Liang, L. Gao, and J. Lou, “Probabilistic electricity price forecasting with variational heteroscedastic Gaussian process and active learning,” Energy Conversion and Management, vol. 89, pp. 298–308, 2015.
[7] M. Lázaro-Gredilla, M. K. Titsias, J. Verrelst, and G. Camps-Valls, “Retrieval of biophysical parameters with heteroscedastic Gaussian processes,” IEEE Geoscience and Remote Sensing Letters, vol. 11, no. 4, pp. 838–842, 2014.
[8] I. A. Almosallam, M. J. Jarvis, and S. J. Roberts, “GPz: non-stationary sparse Gaussian processes for heteroscedastic uncertainty estimation in photometric redshifts,” Monthly Notices of the Royal Astronomical Society, vol. 462, no. 1, pp. 726–739, 2016.
[9] M. Bauza and A. Rodriguez, “A probabilistic data-driven model for planar pushing,” in International Conference on Robotics and Automation, 2017, pp. 3008–3015.
[10] A. J. Smith, M. AlAbsi, and T. Fields, “Heteroscedastic Gaussian process-based system identification and predictive control of a quadcopter,” in AIAA Atmospheric Flight Mechanics Conference, 2018, p. 0298.
[11] J. Kirschner and A. Krause, “Information directed sampling and bandits with heteroscedastic noise,” in Proceedings of the 31st Conference On Learning Theory, 2018, pp. 358–384.
[12] A. Kendall and Y. Gal, “What uncertainties do we need in Bayesian deep learning for computer vision?” in Advances in Neural Information Processing Systems, 2017, pp. 5574–5584.
[13] S. Urban, M. Ludersdorfer, and P. Van Der Smagt, “Sensor calibration and hysteresis compensation with heteroscedastic Gaussian processes,” IEEE Sensors Journal, vol. 15, no. 11, pp. 6498–6506, 2015.
[14] F. C. Pereira, C. Antoniou, J. A. Fargas, and M. Ben-Akiva, “A metamodel for estimating error bounds in real-time traffic prediction systems,” IEEE Transactions on Intelligent Transportation Systems, vol. 15, no. 3, pp. 1310–1322, 2014.
[15] E. Snelson and Z. Ghahramani, “Variable noise and dimensionality reduction for sparse Gaussian processes,” in Uncertainty in Artificial Intelligence, 2006, pp. 461–468.
[16] P. W. Goldberg, C. K. Williams, and C. M. Bishop, “Regression with input-dependent noise: A Gaussian process treatment,” in Advances in Neural Information Processing Systems, 1998, pp. 493–499.
[17] L. Muñoz-González, M. Lázaro-Gredilla, and A. R. Figueiras-Vidal, “Divisive Gaussian processes for nonstationary regression,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 11, pp. 1991–2003, 2014.
[18] L. Munoz-Gonzalez, M. Lázaro-Gredilla, and A. R. Figueiras-Vidal, “Laplace approximation for divisive Gaussian processes for nonstationary regression,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 3, pp. 618–624, 2016.
[19] A. D. Saul, J. Hensman, A. Vehtari, and N. D. Lawrence, “Chained Gaussian processes,” in Artificial Intelligence and Statistics, 2016, pp. 1431–1440.
[20] K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard, “Most likely heteroscedastic Gaussian process regression,” in International Conference on Machine Learning, 2007, pp. 393–400.
[21] N. Quadrianto, K. Kersting, M. D. Reid, T. S. Caetano, and W. L. Buntine, “Kernel conditional quantile estimation via reduction revisited,” in International Conference on Data Mining, 2009, pp. 938–943.
[22] M. Heinonen, H. Mannerström, J. Rousu, S. Kaski, and H. Lähdesmäki, “Non-stationary Gaussian process regression with hamiltonian monte carlo,” in Artificial Intelligence and Statistics, 2016, pp. 732–740.
[23] M. Binois, R. B. Gramacy, and M. Ludkovski, “Practical heteroscedastic Gaussian process modeling for large simulation experiments,” Journal of Computational and Graphical Statistics, vol. 27, no. 4, pp. 808–821, 2018.
[24] M. K. Titsias and M. Lázaro-Gredilla, “Variational heteroscedastic Gaussian process regression,” in International Conference on Machine Learning, 2011, pp. 841–848.
[25] M. Menictas and M. P. Wand, “Variational inference for heteroscedastic semiparametric regression,” Australian & New Zealand Journal of Statistics, vol. 57, no. 1, pp. 119–138, 2015.
[26] L. Muñoz-González, M. Lázaro-Gredilla, and A. R. Figueiras-Vidal, “Heteroscedastic Gaussian process regression using expectation propagation,” in Machine Learning for Signal Processing, 2011, pp. 1–6.
[27] V. Tolvanen, P. Jylänki, and A. Vehtari, “Expectation propagation for nonstationary heteroscedastic Gaussian process regression,” in Machine Learning for Signal Processing, 2014, pp. 1–6.
[28] M. Hartmann and J. Vanhatalo, “Laplace approximation and natural gradient for Gaussian process regression with heteroscedastic student-t model,” Statistics and Computing, vol. 29, no. 4, pp. 753–773, 2019.
[29] H. Liu, Y.-S. Ong, X. Shen, and J. Cai, “When Gaussian process meets big data: A review of scalable GPs,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–19, 2020.
[30] J. Quiñonero-Candela and C. E. Rasmussen, “A unifying view of sparse approximate Gaussian process regression,” Journal of Machine Learning Research, vol. 6, no. Dec, pp. 1939–1959, 2005.
[31] M. Titsias, “Variational learning of inducing variables in sparse Gaussian processes,” in Artificial Intelligence and Statistics, 2009, pp. 567–574.
[32] Y. Gal, M. van der Wilk, and C. E. Rasmussen, “Distributed variational inference in sparse Gaussian process regression and latent variable models,” in Advances in Neural Information Processing Systems, 2014, pp. 3257–3265.
[33] T. N. Hoang, Q. M. Hoang, and B. K. H. Low, “A distributed variational inference framework for unifying parallel sparse Gaussian process regression models.” in International Conference on Machine Learning, 2016, pp. 382–391.
[34] J. Hensman, N. Fusi, and N. D. Lawrence, “Gaussian processes for big data,” in Uncertainty in Artificial Intelligence, 2013, pp. 282–290.
[35] T. N. Hoang, Q. M. Hoang, and B. K. H. Low, “A unifying framework of anytime sparse Gaussian process regression models with stochastic variational inference for big data,” in International Conference on Machine Learning, 2015, pp. 569–578.
[36] A. Wilson and H. Nickisch, “Kernel interpolation for scalable structured Gaussian processes (KISS-GP),” in International Conference on Machine Learning, 2015, pp. 1775–1784.
[37] T. D. Bui and R. E. Turner, “Tree-structured Gaussian process approximations,” in Advances in Neural Information Processing Systems, 2014, pp. 2213–2221.
[38] G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural Computation, vol. 14, no. 8, pp. 1771–1800, 2002.
[39] V. Tresp, “A Bayesian committee machine,” Neural Computation, vol. 12, no. 11, pp. 2719–2741, 2000.
[40] M. P. Deisenroth and J. W. Ng, “Distributed Gaussian processes,” in International Conference on Machine Learning, 2015, pp. 1481–1490.
[41] H. Liu, J. Cai, Y. Wang, and Y.-S. Ong, “Generalized robust Bayesian committee machine for large-scale Gaussian process regression,” in International Conference on Machine Learning, 2018, pp. 3137–3146.
[42] H. Liu, J. Cai, Y.-S. Ong, and Y. Wang, “Understanding and comparing scalable Gaussian process regression for big data,” Knowledge-Based Systems, vol. 164, pp. 324–335, 2019.
[43] E. Snelson and Z. Ghahramani, “Local and global sparse Gaussian process approximations,” in Artificial Intelligence and Statistics, 2007, pp. 524–531.
[44] J. Vanhatalo and A. Vehtari, “Modelling local and global phenomena with sparse Gaussian processes,” in Uncertainty in Artificial Intelligence, 2008, pp. 571–578.
[45] R. Ouyang and K. H. Low, “Gaussian process decentralized data fusion meets transfer learning in large-scale distributed cooperative perception,” in AAAI Conference on Artificial Intelligence, 2018.
[46] K. Chalupka, C. K. Williams, and I. Murray, “A framework for evaluating approximation methods for Gaussian process regression,” Journal of Machine Learning Research, vol. 14, no. Feb, pp. 333–350, 2013.
[47] M. E. Tipping, “Sparse Bayesian learning and the relevance vector machine,” Journal of Machine Learning Research, vol. 1, no. Jun, pp. 211–244, 2001.
[48] C. E. Rasmussen and J. Quinonero-Candela, “Healing the relevance vector machine through augmentation,” in International Conference on Machine Learning, 2005, pp. 689–696.
[49] E. Snelson and Z. Ghahramani, “Sparse Gaussian processes using pseudo-inputs,” in Advances in Neural Information Processing Systems, 2006, pp. 1257–1264.
[50] M. Bauer, M. van der Wilk, and C. E. Rasmussen, “Understanding probabilistic sparse Gaussian process approximations,” in Advances in Neural Information Processing Systems, 2016, pp. 1533–1541.
[51] H. Yu, T. Nghia, B. K. H. Low, and P. Jaillet, “Stochastic variational inference for bayesian sparse Gaussian process regression,” in International Joint Conference on Neural Networks, 2019, pp. 1–8.
[52] D. G. Matthews, G. Alexander, M. Van Der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León-Villagrá, Z. Ghahramani, and J. Hensman, “GPflow: A Gaussian process library using TensorFlow,” Journal of Machine Learning Research, vol. 18, no. 1, pp. 1299–1304, 2017.
[53] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning,” in 12th $\{$ USENIX $\}$ Symposium on Operating Systems Design and Implementation, 2016, pp. 265–283.
[54] C. E. Rasmussen and H. Nickisch, “Gaussian processes for machine learning (GPML) toolbox,” Journal of Machine Learning Research, vol. 11, no. Nov, pp. 3011–3015, 2010.
[55] S. Sun, “A review of deterministic approximate inference techniques for Bayesian machine learning,” Neural Computing and Applications, vol. 23, no. 7-8, pp. 2039–2050, 2013.
[56] J. Hensman and N. D. Lawrence, “Nested variational compression in deep Gaussian processes,” arXiv preprint arXiv:1412.1370, 2014.
[57] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[58] B. Ingram and D. Cornford, “Parallel geostatistics for sparse and dense datasets,” in GeoENV VII–Geostatistics for Environmental Applications. Springer, 2010, pp. 371–381.
[59] H. Salimbeni, S. Eleftheriadis, and J. Hensman, “Natural gradients in practice: Non-conjugate variational inference in Gaussian process models,” in International Conference on Artificial Intelligence and Statistics, 2018, pp. 689–697.
[60] J. Hensman, N. Durrande, A. Solin et al., “Variational fourier features for Gaussian processes.” Journal of Machine Learning Research, vol. 18, pp. 1–52, 2018.
[61] D. Dheeru and E. Karra Taniskidou, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
[62] M. Kaul, B. Yang, and C. S. Jensen, “Building accurate 3D spatial networks to enable next generation intelligent transportation systems,” in International Conference on Mobile Data Management, vol. 1, 2013, pp. 137–146.
[63] C. Wang and R. M. Neal, “Gaussian process regression with heteroscedastic or non-gaussian residuals,” arXiv preprint arXiv:1212.6246, 2012.
[64] V. Dutordoir, H. Salimbeni, J. Hensman, and M. Deisenroth, “Gaussian process conditional density estimation,” in Advances in Neural Information Processing Systems, 2018, pp. 2385–2395.
[65] M. Jankowiak, G. Pleiss, and J. R. Gardner, “Sparse Gaussian process regression beyond variational inference,” arXiv preprint arXiv:1910.07123, 2019.
[66] C. Ma, Y. Li, and J. M. Hernandez-Lobato, “Variational implicit processes,” in International Conference on Machine Learning, 2019, pp. 4222–4233.
[67] T. Nguyen and E. Bonilla, “Fast allocation of Gaussian process experts,” in International Conference on Machine Learning, 2014, pp. 145–153.
[68] D. Wu and J. Ma, “A two-layer mixture model of Gaussian process functional regressions and its MCMC EM algorithm,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 10, pp. 4894–4904, 2018.
[69] S. Remes, M. Heinonen, and S. Kaski, “Non-stationary spectral kernels,” in Advances in Neural Information Processing Systems, 2017, pp. 4642–4651.

	$\displaystyle p(\bm{f}\|\bm{f}_{m})$	$\displaystyle=\mathcal{N}(\bm{f}\|\bm{\Omega}_{nm}^{f}\bm{f}_{m},\bm{K}_{nn}^{f}-\bm{Q}_{nn}^{f}),$
	$\displaystyle p(\bm{g}\|\bm{g}_{u})$	$\displaystyle=\mathcal{N}(\bm{g}\|\bm{\Omega}_{nu}^{g}(\bm{g}_{u}-\mu_{0}\bm{1})+\mu_{0}\bm{1},\bm{K}_{nn}^{g}-\bm{Q}_{nn}^{g}),$