This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Deconditional Downscaling with Gaussian Processes

Siu Lun Chau
University of Oxford
&Shahine Bouabid11footnotemark: 1 22footnotemark: 2
University of Oxford
&Dino Sejdinovic22footnotemark: 2
University of Oxford
Indicates equal contributionDepartment of Statistics, Oxford, UK, OX1 3LB. <siu.chau@stats.ox.ac.uk, shahine.bouabid@stats.ox.ac.uk, dino.sejdinovic@stats.ox.ac.uk>
Abstract

Refining low-resolution (LR) spatial fields with high-resolution (HR) information, often known as statistical downscaling, is challenging as the diversity of spatial datasets often prevents direct matching of observations. Yet, when LR samples are modeled as aggregate conditional means of HR samples with respect to a mediating variable that is globally observed, the recovery of the underlying fine-grained field can be framed as taking an “inverse” of the conditional expectation, namely a deconditioning problem. In this work, we propose a Bayesian formulation of deconditioning which naturally recovers the initial reproducing kernel Hilbert space formulation from Hsu and Ramos [1]. We extend deconditioning to a downscaling setup and devise efficient conditional mean embedding estimator for multiresolution data. By treating conditional expectations as inter-domain features of the underlying field, a posterior for the latent field can be established as a solution to the deconditioning problem. Furthermore, we show that this solution can be viewed as a two-staged vector-valued kernel ridge regressor and show that it has a minimax optimal convergence rate under mild assumptions. Lastly, we demonstrate its proficiency in a synthetic and a real-world atmospheric field downscaling problem, showing substantial improvements over existing methods.

1 Introduction

Spatial observations often operate at limited resolution due to practical constraints. For example, remote sensing atmosphere products [2, 3, 4, 5] provide measurement of atmospheric properties such as cloud top temperatures and optical thickness, but only at a low resolution. Devising methods to refine low-resolution (LR) variables for local-scale analysis thus plays a crucial part in our understanding of the anthropogenic impact on climate

When high-resolution (HR) observations of different covariates are available, details can be instilled into the LR field for refinement. This task is referred to as statistical downscaling or spatial disaggregation and models LR observations as the aggregation of an unobserved underlying HR field. For example, multispectral optical satellite imagery [6, 7] typically comes at higher resolution than atmospheric products and can be used to refine the latter.

Statistical downscaling has been studied in various forms, notably giving it a probabilistic treatment [8, 9, 10, 11, 12, 13], in which Gaussian processes (GP) [14] are typically used in conjunction with a sparse variational formulation [15] to recover the underlying unobserved HR field. Our approach follows this line of research where we do not observe data from the underlying HR groundtruth field. On the other hand, deep neural network (DNN) based approaches [16, 17, 18] study this problem from a different setting, where they often assume that both HR and LR matched observations are available for training. Then, their approaches follow a standard supervised learning setting in learning a mapping between different resolutions.

However, both lines of existing methods require access to bags of HR covariates that are paired with aggregated targets, which in practice might be infeasible. For example, the multitude of satellites in orbit not only collect snapshots of the atmosphere at different resolutions, but also from different places and at different times, such that these observations are not jointly observed. To overcome this limitation, we propose to consider a more flexible mediated statistical downscaling setup that only requires indirect matching between LR and HR covariates through a mediating field. We assume that this additional field can be easily observed, and matched separately with both HR covariates and aggregate LR targets. We then use this third-party field to mediate learning and downscale our unmatched data. In our motivating application, climate simulations [19, 20, 21] based on physical science can serve as a mediating field since they provide a comprehensive spatiotemporal coverage of meteorological variables that can be matched to both LR and HR covariates.

Formally, let 𝒙b={x(1),,x(n)}𝒳{}^{b}\!{\boldsymbol{x}}=\{x^{(1)},\ldots,x^{(n)}\}\subset{\mathcal{X}} be a general notation for bags of HR covariates, f:𝒳f:{\mathcal{X}}\rightarrow{\mathbb{R}} the field of interest we wish to recover and z~\tilde{z} the LR aggregate observations from the field ff. We suppose that 𝒙b{}^{b}\!{\boldsymbol{x}} and z~\tilde{z} are unmatched, but that there exists mediating covariates y,y~𝒴y,\tilde{y}\in{\mathcal{Y}}, such that (b𝒙,y)(^{b}\!{\boldsymbol{x}},y) are jointly observed and likewise for (y~,z~)(\tilde{y},\tilde{z}) as illustrated in Figure 1. We assume the following aggregation observation model z~=𝔼X[f(X)|Y=y~]+ε\tilde{z}={\mathbb{E}}_{X}[f(X)|Y=\tilde{y}]+\varepsilon with some noise ε\varepsilon. Our goal in mediated statistical downscaling is then to estimate ff given (b𝒙,y)(^{b}\!{\boldsymbol{x}},y) and (y~,z~)(\tilde{y},\tilde{z}), which corresponds to the deconditioning problem introduced in [1].

Refer to caption
Figure 1: The LR response z~\tilde{z} (blue) and the bag HR covariates 𝒙b{}^{b}\!{\boldsymbol{x}} (green) are unmatched. The downscaling is mediated through bag-level LR covariates yy and y~\tilde{y} (orange).

Motivated by applications in likelihood-free inference and task-transfer regression, Hsu and Ramos [1] first studied the deconditioning problem through the lens of reproducing kernel Hilbert space (RKHS) and introduced the framework of Deconditional Mean Embeddings (DME) as its solution.

In this work, we first propose a Bayesian formulation of deconditioning that results into a much simpler and elegant way to arrive to the DME-based estimator of Hsu and Ramos [1], using the conditional expectations of ff. Motivated by our mediated statistical downscaling problem, we then extend deconditioning to a multi-resolution setup and bridge the gap between DMEs and existing probabilistic statistical downscaling methods [9]. By placing a GP prior on the sought field ff, we obtain a posterior distribution of the downscaled field as a principled Bayesian solution to the downscaling task on indirectly matched data. For scalability, we provide a tractable variational inference approximation and an alternative approximation to the conditional mean operator (CMO) [22] to speed up computation for large multi-resolution datasets.

From a theoretical stand point, we further develop the framework of DMEs by establishing it as a two-staged vector-valued regressor with a natural reconstruction loss that mirrors Grünewälder et al. [23]’s work on conditional mean embeddings. This perspective allows us to leverage distribution regression theory from [24, 25] and obtain convergence rate results for the deconditional mean operator (DMO) estimator. Under mild assumptions, we obtain conditions under which this rate is a minimax optimal in terms of statistical-computational efficiency.

Our contributions are summarized as follows:

  • We propose a Bayesian formulation of the deconditioning problem of Hsu and Ramos [1]. We establish its posterior mean as a DME-based estimate and its posterior covariance as a gauge of the deconditioning quality.

  • We extend deconditioning to a multi-resolution setup in the context of the mediated statistical downscaling problem and bridge the gap with existing probabilistic statistical downscaling methods. Computationally efficient algorithms are devised.

  • We demonstrate that the DMO estimate minimises a two-staged vector-valued regression and derive its convergence rate under mild assumptions, with conditions for minimax optimality.

  • We benchmark our model against existing methods for statistical downscaling tasks in climate science, on both synthetic and real-world multi-resolution atmospheric fields data, and show improved performance.

2 Background Materials

2.1 Notations

Let X,YX,Y be a pair of random variables taking values in non-empty sets 𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}}, respectively. Let k:𝒳×𝒳k:{\mathcal{X}}\times{\mathcal{X}}\rightarrow{\mathbb{R}} and :𝒴×𝒴\ell:{\mathcal{Y}}\times{\mathcal{Y}}\rightarrow{\mathbb{R}} be positive definite kernels. The closure of the span of their canonical feature maps kx:=k(x,)k_{x}:=k(x,\cdot) and y:=(y,)\ell_{y}:=\ell(y,\cdot) for x𝒳x\in{\mathcal{X}} and y𝒴y\in{\mathcal{Y}} respectively induces RKHS k𝒳{\mathcal{H}}_{k}\subseteq{\mathbb{R}}^{\mathcal{X}} and 𝒴{\mathcal{H}}_{\ell}\subseteq{\mathbb{R}}^{\mathcal{Y}} endowed with inner products ,k\langle\cdot,\cdot\rangle_{k} and ,\langle\cdot,\cdot\rangle_{\ell}.

We observe realizations 𝒟1b={b𝒙j,yj}j=1N{}^{b}\!{\mathcal{D}}_{1}=\{^{b}\!{\boldsymbol{x}}_{j},y_{j}\}_{j=1}^{N} of bags 𝒙jb={xj(i)}i=1nj{}^{b}\!{\boldsymbol{x}}_{j}=\{x_{j}^{(i)}\}_{i=1}^{n_{j}} from conditional distribution X|Y=yj{\mathbb{P}}_{X|Y=y_{j}}, with bag-level covariates yjy_{j} sampled from Y{\mathbb{P}}_{Y}. We concatenate them into vectors 𝐱:=[𝒙1b𝒙Nb]{\bf x}:=\begin{bmatrix}{}^{b}\!{\boldsymbol{x}}_{1}&\ldots&{}^{b}\!{\boldsymbol{x}}_{N}\end{bmatrix}^{\top} and 𝐲:=[y1yN]{\bf y}:=\begin{bmatrix}y_{1}&\ldots&y_{N}\end{bmatrix}^{\top}.

For simplicity, our exposition will use the notation without bagging – i.e. where 𝒟1={xj,yj}j=1N{\mathcal{D}}_{1}=\{x_{j},y_{j}\}_{j=1}^{N} – when the generality of our contributions will be relevant to the theory of deconditioning from Hsu and Ramos [1], in Sections 2, 3 and 4. We will come back to a bagged dataset formalism in Sections 3.3 and 5, which corresponds to our motivating application of mediated statistical downscaling.

With an abuse of notation, we define feature matrices by stacking feature maps along columns as 𝚽𝐱:=[kx1kxN]{\bf\Phi}_{\bf x}:=\begin{bmatrix}k_{x_{1}}&\ldots&k_{x_{N}}\end{bmatrix} and 𝚿𝐲:=[y1yN]{\bf\Psi}_{\bf y}:=\begin{bmatrix}\ell_{y_{1}}&\ldots&\ell_{y_{N}}\end{bmatrix} and we denote Gram matrices as 𝐊𝐱𝐱=𝚽𝐱𝚽𝐱=[k(xi,xj)]1i,jN{\bf K}_{{\bf x}{\bf x}}={\bf\Phi}_{\bf x}^{\top}{\bf\Phi}_{\bf x}=[k(x_{i},x_{j})]_{1\leq i,j\leq N} and 𝐋𝐲𝐲=𝚿𝐲𝚿𝐲=[(yi,yj)]1i,jN{\bf L}_{{\bf y}{\bf y}}={\bf\Psi}_{\bf y}^{\top}{\bf\Psi}_{\bf y}=[\ell(y_{i},y_{j})]_{1\leq i,j\leq N}. The notation abuse ()()(\cdot)^{\top}(\cdot) is a shorthand for the elementwise RKHS inner products when it is clear from the context.

Let ZZ denote the real-valued random variable stemming from the noisy conditional expectation of some unknown latent function f:𝒳f:{\mathcal{X}}\rightarrow{\mathbb{R}}, as Z=𝔼X[f(X)|Y]+εZ={\mathbb{E}}_{X}[f(X)|Y]+\varepsilon. We suppose one observes another set of realizations 𝒟2={y~j,z~j}j=1M{\mathcal{D}}_{2}=\{\tilde{y}_{j},\tilde{z}_{j}\}_{j=1}^{M} from YZ{\mathbb{P}}_{YZ}, which is sampled independently from 𝒟1{\mathcal{D}}_{1}. Likewise, we stack observations into vectors 𝐲~:=[y~1y~M]\tilde{\bf y}:=\begin{bmatrix}\tilde{y}_{1}&\ldots&\tilde{y}_{M}\end{bmatrix}^{\top}, 𝐳~:=[z~1z~M]\tilde{\bf z}:=\begin{bmatrix}\tilde{z}_{1}&\ldots&\tilde{z}_{M}\end{bmatrix}^{\top} and define feature map 𝚿𝐲~:=[y~1y~M]{\bf\Psi}_{\tilde{\bf y}}:=\begin{bmatrix}\ell_{\tilde{y}_{1}}&\ldots&\ell_{\tilde{y}_{M}}\end{bmatrix}.

2.2 Conditional and Deconditional Kernel Mean Embeddings

Marginal and Joint Mean Embeddings

Kernel mean embeddings of distributions provide a powerful framework for representing and manipulating distributions without specifying their parametric form [26, 22]. The marginal mean embedding of measure X{\mathbb{P}}_{X} is defined as μX:=𝔼X[kX]k\mu_{X}:={\mathbb{E}}_{X}[k_{X}]\in{\mathcal{H}}_{k} and corresponds to the Riesz representer of expectation functional f𝔼X[f(X)]f\mapsto{\mathbb{E}}_{X}[f(X)]. It can hence be used to evaluate expectations 𝔼X[f(X)]=f,μXk{\mathbb{E}}_{X}[f(X)]=\langle f,\mu_{X}\rangle_{k}. If the mapping XμX{\mathbb{P}}_{X}\mapsto\mu_{X} is injective, the kernel kk is said to be characteristic [27], a property satisfied for the Gaussian and Matérn kernels on d{\mathbb{R}}^{d} [28]. In practice, Monte Carlo estimator μ^X:=1Ni=1Nkxi\hat{\mu}_{X}:=\frac{1}{N}\sum_{i=1}^{N}k_{x_{i}} provides an unbiased estimate of μX\mu_{X} [29].

Extending this rationale to embeddings of joint distributions, we define CYY:=𝔼Y[YY]C_{YY}:={\mathbb{E}}_{Y}[\ell_{Y}\otimes\ell_{Y}]\in{\mathcal{H}}_{\ell}\otimes{\mathcal{H}}_{\ell} and CXY:=𝔼X,Y[kXY]kC_{XY}:={\mathbb{E}}_{X,Y}[k_{X}\otimes\ell_{Y}]\in{\mathcal{H}}_{k}\otimes{\mathcal{H}}_{\ell}, which can be identified with the cross-covariance operators between Hilbert spaces CYY:C_{YY}:{\mathcal{H}}_{\ell}\rightarrow{\mathcal{H}}_{\ell} and CXY:kC_{XY}:{\mathcal{H}}_{\ell}\rightarrow{\mathcal{H}}_{k}. They correspond to the Riesz representers of bilinear forms (g,g)Cov(g(Y),g(Y))=g,CYYg(g,g^{\prime})\mapsto\operatorname{Cov}(g(Y),g^{\prime}(Y))=\langle g,C_{YY}g^{\prime}\rangle_{\ell} and (f,g)Cov(f(X),g(Y))=f,CXYgk(f,g)\mapsto\operatorname{Cov}(f(X),g(Y))=\langle f,C_{XY}g\rangle_{k}. As above, empirical estimates are obtained as C^YY=1N𝚿𝐲𝚿𝐲=1Ni=1Nyiyi\hat{C}_{YY}=\frac{1}{N}{\bf\Psi}_{\bf y}{\bf\Psi}_{\bf y}^{\top}=\frac{1}{N}\sum_{i=1}^{N}\ell_{y_{i}}\otimes\ell_{y_{i}} and C^XY=1N𝚽𝐱𝚿𝐲=1Ni=1Nkxiyi\hat{C}_{XY}=\frac{1}{N}{\bf\Phi}_{\bf x}{\bf\Psi}_{\bf y}^{\top}=\frac{1}{N}\sum_{i=1}^{N}k_{x_{i}}\otimes\ell_{y_{i}}. Again, notation abuse ()()(\cdot)(\cdot)^{\top} is a shorthand for element-wise tensor products when clear from context.

Conditional Mean Embeddings

Similarly, one can introduce RKHS embeddings for conditional distributions referred to as Conditional Mean Embeddings (CME). The CME of conditional probability measure X|Y=y{\mathbb{P}}_{X|Y=y} is defined as μX|Y=y:=𝔼X[kX|Y=y]k\mu_{X|Y=y}:={\mathbb{E}}_{X}[k_{X}|Y=y]\in{\mathcal{H}}_{k}. As introduced by Fukumizu et al. [27], it is common to formulate conditioning in terms of a Hilbert space operator CX|Y:kC_{X|Y}:{\mathcal{H}}_{\ell}\rightarrow{\mathcal{H}}_{k} called the Conditional Mean Operator (CMO). CX|YC_{X|Y} satisfies by definition CX|Yy=μX|Y=yC_{X|Y}\ell_{y}=\mu_{X|Y=y} and CX|Yf=𝔼X[f(X)|Y=],fkC_{X|Y}^{\top}f={\mathbb{E}}_{X}[f(X)|Y=\cdot]\,,\forall f\in{\mathcal{H}}_{k}, where CX|YC_{X|Y}^{\top} denotes the adjoint of CX|YC_{X|Y}. Plus, the CMO admits expression CX|Y=CXYCYY1C_{X|Y}=C_{XY}C_{YY}^{-1}, provided yRange(CYY),y𝒴\ell_{y}\in\operatorname{Range}({C_{YY}}),\,\forall y\in{\mathcal{Y}} [27, 26]. Song et al. [30] show that a nonparametric empirical form of the CMO writes

C^X|Y=𝚽𝐱(𝐋𝐲𝐲+Nλ𝐈N)1𝚿𝐲,\hat{C}_{X|Y}={\bf\Phi}_{\bf x}({\bf L}_{{\bf y}{\bf y}}+N\lambda{\bf I}_{N})^{-1}{\bf\Psi}_{\bf y}^{\top}, (1)

where λ>0\lambda>0 is some regularisation ensuring the empirical operator is globally defined and bounded.

As observed by Grünewälder et al. [23], since CX|YC_{X|Y} defines a mapping from {\mathcal{H}}_{\ell} to k{\mathcal{H}}_{k}, it can be interpreted as the solution to a vector-valued regression problem. This perspective enables derivation of probabilistic convergence bounds on the empirical CMO estimator [23, 25].

Deconditional Mean Embeddings

Introduced by Hsu and Ramos [1] as a new class of embeddings, Deconditional Mean Embeddings (DME) are natural counterparts of CMEs. While CME μX|Y=yk\mu_{X|Y=y}\in{\mathcal{H}}_{k} allows to take the conditional expectation of any fkf\in{\mathcal{H}}_{k} through inner product 𝔼X[f(X)|Y=y]=f,μX|Y=yk{\mathbb{E}}_{X}[f(X)|Y=y]=\langle f,\mu_{X|Y=y}\rangle_{k}, the DME denoted μX=x|Y\mu_{X=x|Y}\in{\mathcal{H}}_{\ell} solves the inverse problem111the slightly unusual notation μX=x|Y\mu_{X=x|Y} is taken from Hsu and Ramos [1] and is meant to contrast the usual conditioning μX|Y=y\mu_{X|Y=y} and allows to recover the initial function of which the conditional expectation was taken, through inner product 𝔼X[f(X)|Y=],μX=x|Y=f(x)\langle{\mathbb{E}}_{X}[f(X)|Y=\cdot],\mu_{X=x|Y}\rangle_{\ell}=f(x).

The associated Hilbert space operator, the Deconditional Mean Operator (DMO), is thus defined as the operator DX|Y:kD_{X|Y}:{\mathcal{H}}_{k}\rightarrow{\mathcal{H}}_{\ell} such that DX|Y𝔼X[f(X)|Y=]=f,fkD_{X|Y}^{\top}{\mathbb{E}}_{X}[f(X)|Y=\cdot]=f\,,\forall f\in{\mathcal{H}}_{k}. It admits an expression in terms of CMO and cross-covariance operators DX|Y=(CX|YCYY)(CX|YCYY(CX|Y))1D_{X|Y}=(C_{X|Y}C_{YY})^{\top}(C_{X|Y}C_{YY}(C_{X|Y})^{\top})^{-1} provided yRange(CYY)\ell_{y}\in\operatorname{Range}({C_{YY}}) and kxRange(CX|YCYYCX|Y),y𝒴k_{x}\in\operatorname{Range}({C_{X|Y}C_{YY}C_{X|Y}^{\top}})\,,\forall y\in{\mathcal{Y}} and x𝒳\forall x\in{\mathcal{X}}. A nonparametric empirical estimate of the DMO using datasets 𝒟1{\mathcal{D}}_{1} and 𝒟2{\mathcal{D}}_{2} as described above, is given by D^X|Y=𝚿𝐲~(𝐀𝐊𝐱𝐱𝐀+Mϵ𝐈M)1𝐀𝚽𝐱\hat{D}_{X|Y}={\bf\Psi}_{\tilde{\bf y}}\left({\bf A}^{\top}{\bf K}_{{\bf x}{\bf x}}{\bf A}+M\epsilon{\bf I}_{M}\right)^{-1}{\bf A}^{\top}{\bf\Phi}_{\bf x}^{\top} where ϵ>0\epsilon>0 is a regularisation term and 𝐀:=(𝐋𝐲𝐲+Nλ𝐈)1𝐋𝐲𝐲~{\bf A}:=({\bf L}_{{\bf y}{\bf y}}+N\lambda{\bf I})^{-1}{\bf L}_{{\bf y}\tilde{\bf y}} can be interpreted as a mediation operator. Applying the DMO to expected responses 𝐳~\tilde{\bf z}, Hsu and Ramos [1] are able to recover an estimate of ff as

f^(x)=k(x,𝐱)𝐀(𝐀𝐊𝐱𝐱𝐀+Mϵ𝐈M)1𝐳~.\hat{f}(x)=k(x,{\bf x}){\bf A}\left({\bf A}^{\top}{\bf K}_{{\bf x}{\bf x}}{\bf A}+M\epsilon{\bf I}_{M}\right)^{-1}\tilde{\bf z}. (2)

Note that since separate samples 𝐲~\tilde{{\bf y}} can be used to estimate CYYC_{YY}, this naturally fits a mediating variables setup where 𝐱{\bf x} and the conditional means 𝐳~\tilde{\bf z} are not jointly observed.

3 Deconditioning with Gaussian processes

In this section, we introduce Conditional Mean Process (CMP), a stochastic process stemming from the conditional expectation of a GP. We provide a characterisation of the CMP and show that the corresponding posterior mean of its integrand is a DME-based estimator. We also derive in Appendix B a variational formulation of our model that scales to large datasets and demonstrate its performance in Section 5.

For simplicity, we put aside observations bagging in Sections 3.1 and 3.2, our contributions being relevant to the general theory of DMEs. We return to a bagged formalism in Section 3.3 and extend deconditioning to the multiresolution setup inherent to the mediated statistical downscaling application. In what follows, 𝒳{\mathcal{X}} is a measurable space, 𝒴{\mathcal{Y}} a Borel space and feature maps kxk_{x} and y\ell_{y} are Borel-measurable functions for any x𝒳x\in{\mathcal{X}}, y𝒴y\in{\mathcal{Y}}. All proofs and derivations of the paper are included in the appendix.

3.1 Conditional Mean Process

Bayesian quadrature [31, 32, 14] is based on the observation that the integral of a GP with respect to some marginal measure is a Gaussian random variable. We start by probing the implications of integrating with respect to conditional distribution X|Y=y{\mathbb{P}}_{X|Y=y} and considering such integrals as functions of the conditioning variable yy. This gives rise to the notion of conditional mean processes.

Definition 3.1 (Conditional Mean Process).

Let f𝒢𝒫(m,k)f\sim{\mathcal{G}}{\mathcal{P}}(m,k) with integrable sample paths, i.e. 𝒳|f|dX<\int_{\mathcal{X}}|f|\,{\mathrm{d}}{\mathbb{P}}_{X}<\infty a.s. The CMP induced by ff with respect to  X|Y{\mathbb{P}}_{X|Y} is defined as the stochastic process {g(y):y𝒴}\left\{g(y)\,:\,y\in\mathcal{Y}\right\} given by

g(y)=𝔼X[f(X)|Y=y]=𝒳f(x)dX|Y=y(x).g(y)={\mathbb{E}}_{X}[f(X)|Y=y]=\int_{\mathcal{X}}f(x)\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y}(x). (3)

By linearity of the integral, it is clear that g(y)g(y) is a Gaussian random variable for each y𝒴y\in\mathcal{Y}. The sample paths integrability requirement ensures gg is well-defined almost everywhere. The following result characterizes CMP as a GP on 𝒴{\mathcal{Y}}.

Proposition 3.2 (CMP characterization).

Suppose 𝔼X[|m(X)|]<{\mathbb{E}}_{X}[|m(X)|]<\infty and 𝔼X[kXk]<{\mathbb{E}}_{X}[\|k_{X}\|_{k}]<\infty and let (X,Y)XY(X^{\prime},Y^{\prime})\sim{\mathbb{P}}_{XY}. Then gg is a Gaussian process g𝒢𝒫(ν,q)g\sim{\mathcal{G}}{\mathcal{P}}(\nu,q) a.s. , specified by

ν(y)=𝔼X[m(X)|Y=y]q(y,y)=𝔼X,X[k(X,X)|Y=y,Y=y]\nu(y)={\mathbb{E}}_{X}[m(X)|Y=y]\qquad\quad q(y,y^{\prime})={\mathbb{E}}_{X,X^{\prime}}[k(X,X^{\prime})|Y=y,Y^{\prime}=y^{\prime}] (4)

y,y𝒴\forall y,y^{\prime}\in{\mathcal{Y}}. Furthermore, q(y,y)=μX|Y=y,μX|Y=ykq(y,y^{\prime})=\langle\mu_{X|Y=y},\mu_{X|Y=y^{\prime}}\rangle_{k} a.s.

Intuitively, the CMP can be understood as a GP on the conditional means where its covariance q(y,y)q(y,y^{\prime}) is induced by the similarity between the CMEs at yy and yy^{\prime}. Resorting to the kernel \ell defined on 𝒴{\mathcal{Y}}, we can reexpress the covariance using Hilbert space operators as q(y,y)=CX|Yy,CX|Yykq(y,y^{\prime})=\langle C_{X|Y}\ell_{y},C_{X|Y}\ell_{y^{\prime}}\rangle_{k}. A natural nonparametric estimate of the CMP covariance thus comes using the CMO estimator from (1) as q^(y,y)=(y,𝐲)(𝐋𝐲𝐲+Nλ𝐈N)1𝐊𝐱𝐱(𝐋𝐲𝐲+Nλ𝐈N)1(𝐲,y)\hat{q}(y,y^{\prime})=\ell(y,{\bf y})\left({\bf L}_{{\bf y}{\bf y}}+N\lambda{\bf I}_{N}\right)^{-1}{\bf K}_{{\bf x}{\bf x}}\left({\bf L}_{{\bf y}{\bf y}}+N\lambda{\bf I}_{N}\right)^{-1}\ell({\bf y},y^{\prime}). When mkm\in{\mathcal{H}}_{k}, the mean function can be written as ν(y)=μX|Y=y,mk\nu(y)=\langle\mu_{X|Y=y},m\rangle_{k} for which we can also use empirical estimate ν^(y)=(y,𝐲)(𝐋𝐲𝐲+Nλ𝐈N)1𝚽𝐱m\hat{\nu}(y)=\ell(y,{\bf y})\left({\bf L}_{{\bf y}{\bf y}}+N\lambda{\bf I}_{N}\right)^{-1}{\bf\Phi}_{\bf x}^{\top}m. Finally, one can also quantify the covariance between the CMP gg and its integrand ff, i.e. Cov(f(x),g(y))=𝔼X[k(x,X)|Y=y]\operatorname{Cov}(f(x),g(y))={\mathbb{E}}_{X}[k(x,X)|Y=y]. Under the same assumptions as Proposition 3.2, this covariance can be expressed using mean embeddings, i.e. Cov(f(x),g(y))=kx,μX|Y=yk\operatorname{Cov}(f(x),g(y))=\langle k_{x},\mu_{X|Y=y}\rangle_{k} and admits empirical estimate k(x,𝐱)(𝐋𝐲𝐲+Nλ𝐈N)1(𝐲,y)k(x,{\bf x})\left({\bf L}_{{\bf y}{\bf y}}+N\lambda{\bf I}_{N}\right)^{-1}\ell({\bf y},y).

3.2 Deconditional Posterior

Given independent observations introduced above, 𝒟1={𝐱,𝐲}{\mathcal{D}}_{1}=\{{\bf x},{\bf y}\} and 𝒟2={𝐲~,𝐳~}{\mathcal{D}}_{2}=\{\tilde{\bf y},\tilde{\bf z}\}, we may now consider an additive noise model with CMP prior on aggregate observations 𝐳~|𝐲~𝒩(g(𝐲~),σ2𝐈M)\tilde{\bf z}|\tilde{\bf y}\sim{\mathcal{N}}(g(\tilde{\bf y}),\sigma^{2}{\bf I}_{M}). Let 𝐐𝐲~𝐲~:=q(𝐲~,𝐲~){\bf Q}_{\tilde{\bf y}\tilde{\bf y}}:=q(\tilde{\bf y},\tilde{\bf y}) be the kernel matrix induced by qq on 𝐲~\tilde{\bf y} and let 𝚼:=Cov(f(𝐱),𝐳~)=𝚽𝐱CX|Y𝚿𝐲~\boldsymbol{\Upsilon}:=\operatorname{Cov}(f({\bf x}),\tilde{\bf z})={\bf\Phi}_{\bf x}^{\top}C_{X|Y}{\bf\Psi}_{\tilde{\bf y}} be the cross-covariance between f(𝐱)f({\bf x}) and 𝐳~\tilde{\bf z}. The joint normality of f(𝐱)f({\bf x}) and 𝐳~\tilde{\bf z} gives

[f(𝐱)𝐳~]𝐲,𝐲~𝒩([m(𝐱)ν(𝐲~)],[𝐊𝐱𝐱𝚼𝚼𝐐𝐲~𝐲~+σ2𝐈M]).\begin{bmatrix}f({\bf x})\\ \tilde{\bf z}\end{bmatrix}\mid{{\bf y},\tilde{\bf y}}\sim{\mathcal{N}}\left(\begin{bmatrix}m({\bf x})\\ \nu(\tilde{\bf y})\end{bmatrix},\begin{bmatrix}{\bf K}_{{\bf x}{\bf x}}&\boldsymbol{\Upsilon}\\ \boldsymbol{\Upsilon}^{\top}&{\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+\sigma^{2}{\bf I}_{M}\end{bmatrix}\right). (5)

Using Gaussian conditioning, we can then readily derive the posterior distribution of the underlying GP field ff given the aggregate observations 𝐳~\tilde{\bf z} corresponding to 𝐲~\tilde{\bf y}. The latter can naturally be degenerated if observations are paired, i.e. 𝐲=𝐲~{\bf y}=\tilde{\bf y}. This formulation can be seen as an example of the inter-domain GP [33], where we utilise the observed conditional means 𝐳~\tilde{\bf z} as inter-domain inducing features for inference of ff.

Proposition 3.3 (Deconditional Posterior).

Given aggregate observations 𝐳~\tilde{\bf z} with homoscedastic noise σ2\sigma^{2}, the deconditional posterior of ff is defined as the Gaussian process f|𝐳~𝒢𝒫(md,kd)f|\tilde{\bf z}\sim{\mathcal{G}}{\mathcal{P}}(m_{\mathrm{d}},k_{\mathrm{d}}) where

md(x)\displaystyle m_{\mathrm{d}}(x) =m(x)+kxCX|Y𝚿𝐲~(𝐐𝐲~𝐲~+σ2𝐈M)1(𝐳~ν(𝐲~)),\displaystyle=m(x)+k_{x}^{\top}C_{X|Y}{\bf\Psi}_{\tilde{\bf y}}({\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+\sigma^{2}{\bf I}_{M})^{-1}(\tilde{\bf z}-\nu(\tilde{\bf y})), (6)
kd(x,x)\displaystyle k_{\mathrm{d}}(x,x^{\prime}) =k(x,x)kxCX|Y𝚿𝐲~(𝐐𝐲~𝐲~+σ2𝐈M)1𝚿𝐲~CX|Ykx.\displaystyle=k(x,x^{\prime})-k_{x}^{\top}C_{X|Y}{\bf\Psi}_{\tilde{\bf y}}({\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+\sigma^{2}{\bf I}_{M})^{-1}{\bf\Psi}_{\tilde{\bf y}}^{\top}C_{X|Y}^{\top}k_{x^{\prime}}. (7)

Substituting terms by their empirical forms, we can define a nonparametric estimate of the mdm_{\mathrm{d}} as

m^d(x):=m(x)+k(x,𝐱)𝐀(𝐐^𝐲~𝐲~+σ2𝐈M)1(𝐳~ν^(𝐲~)))\hat{m}_{\mathrm{d}}(x):=m(x)+k(x,{\bf x}){\bf A}(\hat{\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+\sigma^{2}{\bf I}_{M})^{-1}(\tilde{\bf z}-\hat{\nu}(\tilde{\bf y}))) (8)

which, when m=0m=0, reduces to the DME-based estimator in (2) by taking the noise variance σ2N\frac{\sigma^{2}}{N} as the inverse regularization parameter. Hsu and Ramos [1] recover a similar posterior mean expression in their Bayesian interpretation of DME. However, they do not link the distributions of ff and its CMP, which leads to much more complicated chained inference derivations combining fully Bayesian and MAP estimates, while we naturally recover it using simple Gaussian conditioning.

Likewise, an empirical estimate of the deconditional covariance is given by

k^d(x,x):=k(x,x)k(x,𝐱)𝐀(𝐐^𝐲~𝐲~+σ2𝐈M)1𝐀k(𝐱,x).\hat{k}_{\mathrm{d}}(x,x^{\prime}):=k(x,x^{\prime})-k(x,{\bf x}){\bf A}(\hat{\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+\sigma^{2}{\bf I}_{M})^{-1}{\bf A}^{\top}k({\bf x},x^{\prime}). (9)

Interestingly, the latter can be rearranged to write as the difference between the original kernel and the kernel undergoing conditioning and deconditioning steps k^d(x,x)=k(x,x)kx,D^X|YC^X|Ykxk\hat{k}_{\mathrm{d}}(x,x^{\prime})=k(x,x^{\prime})-\langle k_{x},\hat{D}_{X|Y}\hat{C}_{X|Y}k_{x^{\prime}}\rangle_{k}. This can be interpreted as a measure of reconstruction quality, which degenerates in the case of perfect deconditioning, i.e. D^X|YC^X|Y=Idk\hat{D}_{X|Y}\hat{C}_{X|Y}=\operatorname{Id}_{{\mathcal{H}}_{k}}.

3.3 Deconditioning and multiresolution data

Downscaling application would typically correspond to multiresolution data, with bag dataset 𝒟1b={(b𝒙j,yj)}j=1N{}^{b}\!{\mathcal{D}}_{1}=\{(^{b}\!{\boldsymbol{x}}_{j},y_{j})\}_{j=1}^{N} where 𝒙jb={xj(i)}i=1nj{}^{b}\!{\boldsymbol{x}}_{j}=\{x_{j}^{(i)}\}_{i=1}^{n_{j}}. In this setup, the mismatch in size between vector concatenations 𝐱=[x1(1)xN(nN)]{\bf x}=[x_{1}^{(1)}\;\ldots\;x_{N}^{(n_{N})}]^{\top} and 𝐲=[y1yN]{\bf y}=\begin{bmatrix}y_{1}&\ldots&y_{N}\end{bmatrix}^{\top} prevents from readily applying (1) to estimate the CMO and thus infer the deconditional posterior. There is, however, a straightforward approach to alleviate this: simply replicate bag-level covariates yjy_{j} to match bags sizes njn_{j}. Although simple, this method incurs a 𝒪((j=1Nnj)3){\mathcal{O}}((\sum_{j=1}^{N}n_{j})^{3}) cost due to matrix inversion in (1).

Alternatively, since bags 𝒙jb{}^{b}\!{\boldsymbol{x}}_{j} are sampled from conditional distribution X|Y=yj{\mathbb{P}}_{X|Y=y_{j}}, unbiased Monte Carlo estimators of CMEs are given by μ^X|Y=yj=1nji=1njkxj(i)\hat{\mu}_{X|Y=y_{j}}=\frac{1}{n_{j}}\sum_{i=1}^{n_{j}}k_{x_{j}^{(i)}}. Let 𝐌^𝐲=[μ^X|Y=y1μ^X|Y=yN]{\bf\hat{M}}_{{\bf y}}=[\hat{\mu}_{X|Y=y_{1}}\ldots\hat{\mu}_{X|Y=y_{N}}]^{\top} denote their concatenation along columns. We can then rewrite the cross-covariance operator as CXY=𝔼Y[𝔼X[kX|Y]Y]C_{XY}={\mathbb{E}}_{Y}[{\mathbb{E}}_{X}[k_{X}|Y]\otimes\ell_{Y}] and hence take 1N𝐌^𝐲𝚿𝐲\frac{1}{N}{\bf\hat{M}}_{{\bf y}}{\bf\Psi}_{\bf y}^{\top} as an estimate for CXYC_{XY}. Substituting empirical forms into CX|Y=CX|YCYY1C_{X|Y}=C_{X|Y}C_{YY}^{-1} and applying Woodbury identity, we obtain an alternative CMO estimator that only requires inversion of a N×NN\times N matrix. We call it Conditional Mean Shrinkage Operator and define it as

C^X|YS:=𝐌^𝐲(𝐋𝐲𝐲+λN𝐈N)1𝚿𝐲.{}^{S}\hat{C}_{X|Y}:={\bf\hat{M}}_{\bf y}({\bf L}_{{\bf y}{\bf y}}+\lambda N{\bf I}_{N})^{-1}{\bf\Psi}_{\bf y}^{\top}. (10)

This estimator can be seen as a generalisation of the Kernel Mean Shrinkage Estimator [34] to the conditional case. We provide in Appendix C modifications of (8) and (9) including this estimator for the computation of the deconditional posterior.

4 Deconditioning as regression

In Section 3.2, we obtain a DMO-based estimate for the posterior mean of f|𝐳~f|\tilde{\bf z}. When the estimate gets closer to the exact operator, the uncertainty collapses and the Bayesian view meets the frequentist. It is however unclear how the empirical operators effectively converge in finite data size. Adopting an alternative perspective, we now demonstrate that the DMO estimate can be obtained as the minimiser of a two-staged vector-valued regression. This frequentist turn enables us to leverage rich theory of vector-valued regression and establish under mild assumptions a convergence rate on the DMO estimator, with conditions to fulfill minimax optimality in terms of statistical-computational efficiency. In the following, we briefly review CMO’s vector-valued regression viewpoint and construct an analogous regression problem for DMO. We refer the reader to [35] for a comprehensive overview of vector-valued RKHS theory. As for Sections 3.1 and 3.2, this section contributes to the general theory of DMEs and we hence put aside the bag notations.

Stage 1: Regressing the Conditional Mean Operator

As first introduced by Grünewälder et al. [23] and generalised to infinite dimensional spaces by Singh et al. [25], estimating CX|YC_{X|Y}^{\top} is equivalent to solving a vector-valued kernel ridge regression problem in the hypothesis space of Hilbert-Schmidt operators from k{\mathcal{H}}_{k} to {\mathcal{H}}_{\ell}, denoted as HS(k,)\operatorname{HS}({\mathcal{H}}_{k},{\mathcal{H}}_{\ell}). Specifically, we may consider the operator-valued kernel defined over k{\mathcal{H}}_{k} as Γ(f,f):=f,fkId\Gamma(f,f^{\prime}):=\langle f,f\rangle_{k}\operatorname{Id}_{{\mathcal{H}}_{\ell}}. We denote Γ{\mathcal{H}}_{\Gamma} the {\mathcal{H}}_{\ell}-valued RKHS spanned by Γ\Gamma with norm Γ\|\cdot\|_{\Gamma}, which can be identified to HS(k,)\operatorname{HS}({\mathcal{H}}_{k},{\mathcal{H}}_{\ell})222Γ=Span¯{Γfh,fk,h}=Span¯{fh,fk,h}=k¯HS(k,){\mathcal{H}}_{\Gamma}=\overline{\operatorname{Span}}\left\{{\Gamma_{f}h,f\in{\mathcal{H}}_{k},h\in{\mathcal{H}}_{\ell}}\right\}=\overline{\operatorname{Span}}\left\{{f\otimes h,f\in{\mathcal{H}}_{k},h\in{\mathcal{H}}_{\ell}}\right\}=\overline{{\mathcal{H}}_{k}\otimes{\mathcal{H}}_{\ell}}\cong\operatorname{HS}({\mathcal{H}}_{k},{\mathcal{H}}_{\ell}). Singh et al. [25] frame CMO regression as the minimisation surrogate discrepancy c(C):=𝔼X,Y[kXCYk2]{\mathcal{E}}_{\mathrm{c}}(C):={\mathbb{E}}_{X,Y}\left[\|k_{X}-C^{\top}\ell_{Y}\|^{2}_{k}\right], to which they substitute an empirical regularised version restricted to Γ{\mathcal{H}}_{\Gamma} given by

^c(C):=1Ni=1NkxiCyik2+λCΓ2CΓλ>0\hat{\mathcal{E}}_{\mathrm{c}}(C):=\frac{1}{N}\sum_{i=1}^{N}\|k_{x_{i}}-C^{\top}\ell_{y_{i}}\|^{2}_{k}+\lambda\|C\|^{2}_{\Gamma}\,\qquad C\in{\mathcal{H}}_{\Gamma}\qquad\lambda>0 (11)

This k{\mathcal{H}}_{k}-valued kernel ridge regression problem admits a closed-form minimiser which shares the same empirical form as the CMO, i.e. C^X|Y=argminCΓ^c(C)\hat{C}_{X|Y}^{\top}=\arg\min_{C\in{\mathcal{H}}_{\Gamma}}\,\hat{\mathcal{E}}_{\mathrm{c}}(C) [23, 25].

Stage 2 : Regressing the Deconditional Mean Operator

The DMO on the other hand is defined as the operator DX|Y:kD_{X|Y}:{\mathcal{H}}_{k}\rightarrow{\mathcal{H}}_{\ell} such that fk\forall f\in{\mathcal{H}}_{k}, DX|YCX|Yf=fD_{X|Y}^{\top}C_{X|Y}^{\top}f=f. Since deconditioning corresponds to finding a pseudo-inverse to the CMO, it is natural to consider a reconstruction objective d(D):=𝔼Y[YDCX|YY2]{\mathcal{E}}_{\mathrm{d}}(D):={\mathbb{E}}_{Y}\left[\|\ell_{Y}-DC_{X|Y}\ell_{Y}\|^{2}_{\ell}\right]. Introducing a novel characterization of the DMO, we propose to minimise this objective in the hypothesis space of Hilbert-Schmidt operators HS(k,)\operatorname{HS}({\mathcal{H}}_{k},{\mathcal{H}}_{\ell}) which identifies to Γ{\mathcal{H}}_{\Gamma}. As per above, we denote C^X|Y\hat{C}_{X|Y} the empirical CMO learnt in Stage 1, and we substitute the loss with an empirical regularised formulation on Γ{\mathcal{H}}_{\Gamma}

^d(D):=1Mj=1My~jDC^X|Yy~j2+ϵDΓ2DΓϵ>0\hat{\mathcal{E}}_{\mathrm{d}}(D):=\frac{1}{M}\sum_{j=1}^{M}\|\ell_{\tilde{y}_{j}}-D\hat{C}_{X|Y}\ell_{\tilde{y}_{j}}\|^{2}_{\ell}+\epsilon\|D\|^{2}_{\Gamma}\qquad D\in{\mathcal{H}}_{\Gamma}\qquad\epsilon>0 (12)
Proposition 4.1 (Empirical DMO as vector-valued regressor).

The minimiser of the empirical reconstruction risk is the empirical DMO, i.e. D^X|Y=argminDΓ^d(D)\hat{D}_{X|Y}=\arg\min_{D\in{\mathcal{H}}_{\Gamma}}{\hat{\mathcal{E}}_{\mathrm{d}}(D)}

Since it requires to estimate the CMO first, minimising (12) can be viewed as a two-staged vector value regression problem.

Convergence results

Following the footsteps of [25, 24], this perspective enables us to state the performance of the DMO estimate in terms of asymptotic convergence of the objective d{\mathcal{E}}_{\mathrm{d}}. As in Caponnetto and De Vito [36], we must restrict the class of probability measure for XY{\mathbb{P}}_{XY} and Y{\mathbb{P}}_{Y} to ensure uniform convergence even when k{\mathcal{H}}_{k} is infinite dimensional. The family of distribution considered is a general class of priors that does not assume parametric distributions and is parametrized by two variables: b>1b>1 controls the effective input dimension and c]1,2]c\in]1,2] controls functional smoothness. Mild regularity assumptions are also placed on the original spaces 𝒳,𝒴{\mathcal{X}},{\mathcal{Y}}, their corresponding RKHS k,{\mathcal{H}}_{k},{\mathcal{H}}_{\ell} and the vector-valued RKHS Γ{\mathcal{H}}_{\Gamma}. We discuss these assumptions in details in Appendix D. Importantly, while k{\mathcal{H}}_{k} can be infinite dimensional, we nonetheless have to assume the RKHS {\mathcal{H}}_{\ell} is finite dimensional. In further research, we will relax this assumption.

Theorem 4.2 (Empirical DMO Convergence Rate).

Denote DY=argminDΓd(D)D_{{\mathbb{P}}_{Y}}=\arg\min_{D\in{\mathcal{H}}_{\Gamma}}{\mathcal{E}}_{\mathrm{d}}(D). Assume assumptions stated in Appendix D are satisfied. In particular, let (b,c)(b,c) and (0,c)(0,c^{\prime}) be the parameters of the restricted class of distribution for Y{\mathbb{P}}_{Y} and XY{\mathbb{P}}_{XY} respectively and let ι]0,1]\iota\in]0,1] be the Hölder continuity exponent in Γ{\mathcal{H}}_{\Gamma}. Then, if we choose λ=N1c+1\lambda=N^{-\frac{1}{c^{\prime}+1}}, N=Ma(c+1)ι(c1)N=M^{\frac{a(c^{\prime}+1)}{\iota(c^{\prime}-1)}} where a>0a>0, we have the following result,

  • If ab(c+1)bc+1a\leq\frac{b(c+1)}{bc+1}, then d(D^X|Y)d(DY)=𝒪(Macc+1){\mathcal{E}}_{\mathrm{d}}(\hat{D}_{X|Y})-{\mathcal{E}}_{\mathrm{d}}(D_{{\mathbb{P}}_{Y}})={\mathcal{O}}(M^{\frac{-ac}{c+1}}) with ϵ=Mac+1\epsilon=M^{\frac{-a}{c+1}}

  • If ab(c+1)bc+1a\geq\frac{b(c+1)}{bc+1}, then d(D^X|Y)d(DY)=𝒪(Mbcbc+1){\mathcal{E}}_{\mathrm{d}}(\hat{D}_{X|Y})-{\mathcal{E}}_{\mathrm{d}}(D_{{\mathbb{P}}_{Y}})={\mathcal{O}}(M^{\frac{-bc}{bc+1}}) with ϵ=Mbbc+1\epsilon=M^{\frac{-b}{bc+1}}

This theorem underlines a trade-off between the computational and statistical efficiency with respect to the datasets cardinalities N=|𝒟1|N=|{\mathcal{D}}_{1}|, M=|𝒟2|M=|{\mathcal{D}}_{2}| and the problem difficulty b,c,cb,c,c^{\prime}. For ab(c+1)bc+1a\leq\frac{b(c+1)}{bc+1}, smaller aa means less samples from 𝒟1{\mathcal{D}}_{1} at fixed MM and thus computational savings. But it also hampers convergence, resulting in reduced statistical efficiency. At a=b(c+1)bc+1<2a=\frac{b(c+1)}{bc+1}<2, convergence rate is a minimax computational-statistical efficiency optimal, i.e. convergence rate is optimal with smallest possible MM. We note that at this optimal, N>MN>M and which means less samples are required from 𝒟2{\mathcal{D}}_{2}. ab(c+1)bc+1a\geq\frac{b(c+1)}{bc+1} does not improve the convergence rate but only increases the size of 𝒟1{\mathcal{D}}_{1} and hence the computational cost it bears.

5 Deconditional Downscaling Experiments

We demonstrate and evaluate our CMP-based downscaling approaches on both synthetic experiments and a challenging atmospheric temperature field downscaling problem with unmatched multi-resolution data. We denote the exact CMP deconditional posterior from Section 3 as cmp, the cmp using with efficient shrinkage CMO estimation as s-cmp and the variational formulation as varcmp. They are compared against vbagg [9] — which we describe below — and a GP regression [14] baseline (gpr) modified to take bags centroids as the input. Experiments are implemented in PyTorch [37, 38], all code and datasets are made available333https://github.com/shahineb/deconditional-downscaling and computational details are provided in Appendix E.

Variational Aggregate Learning

vbagg is introduced by Law et al. [9] as a variational aggregate learning framework to disaggregate exponential family models, with emphasis on the Poisson family. We consider its application to the Gaussian family, which models the relationship between aggregate targets zjz_{j} and bag covariates {xj(i)}i\{x_{j}^{(i)}\}_{i} by bag-wise averaging of a GP prior on the function of interest. In fact, the Gaussian vbagg corresponds exactly to a special case of cmp on matched data, where the bag covariates are simply one hot encoded indices with kernel (j,j)=δ(j,j)\ell(j,j^{\prime})=\delta(j,j^{\prime}) where δ\delta is the Kronecker delta. However, vbagg cannot handle unmatched data as bag indices do not instill the smoothness that is used for mediation. For fair analysis, we compare variational methods varcmp and vbagg together, and exact methods cmp/s-cmp to an exact version of vbagg, which we implement and refer to as bagg-gp.

5.1 Swiss Roll

The scikit-learn [39] swiss roll manifold sampling function allows to generate a 3D manifold of points x3x\in{\mathbb{R}}^{3} mapped with their position along the manifold tt\in{\mathbb{R}}. Our objective will be to recover tt for each point xx by only observing tt at an aggregate level. In the first experiment, we compare our model to existing weakly supervised spatial disaggregation methods when all high-resolution covariates are matched with a coarse-level aggregate target. We then proceed to withdraw this requirement in a companion experiment.

Refer to caption
Figure 2: Step 1: Split space regularly along height. Step 2: Group points into height-level bags. Step 3: Average points targets into bag-level aggregate targets.

5.1.1 Direct matching

Experimental Setup

Inspired by the experimental setup from Law et al. [9], we regularly split space along height B1B-1 times as depicted in Figure 2 and group together manifold points within each height level, hence mixing together points with very different positions on the manifold. We obtain bags of samples {(b𝒙j,𝒕j)}j=1B\{(^{b}\!{\boldsymbol{x}}_{j},{\boldsymbol{t}}_{j})\}_{j=1}^{B} where the jjth bag contains njn_{j} points 𝒙jb={xj(i)}i=1nj{}^{b}\!{\boldsymbol{x}}_{j}=\{x_{j}^{(i)}\}_{i=1}^{n_{j}} and their corresponding targets 𝒕j={tj(i)}i=1nj{{\boldsymbol{t}}}_{j}=\{t_{j}^{(i)}\}_{i=1}^{n_{j}}. We then construct bag aggregate targets by taking noisy bag targets average zj:=1nji=1njtj(i)+εjz_{j}:=\frac{1}{n_{j}}\sum_{i=1}^{n_{j}}t_{j}^{(i)}+\varepsilon_{j}, where εj𝒩(0,σ2)\varepsilon_{j}\sim{\mathcal{N}}(0,\sigma^{2}). We thus obtain matched weakly supervised bag learning dataset 𝒟={(b𝒙j,zj)}j=1B{\mathcal{D}}^{\circ}=\{(^{b}\!{\boldsymbol{x}}_{j},z_{j})\}_{j=1}^{B}. Since each bag corresponds to a height-level, the center altitude of each height split yjy_{j}\in{\mathbb{R}} is a natural candidate bag-level covariate that informs on relative positions of the bags. We can augment the above dataset as 𝒟={(b𝒙j,yj,zj)}j=1B{\mathcal{D}}=\{(^{b}\!{\boldsymbol{x}}_{j},y_{j},z_{j})\}_{j=1}^{B}. Using these bag datasets, we now wish to downscale aggregate targets zjz_{j} to recover the unobserved manifold locations {𝒕j}j=1B\{{\boldsymbol{t}}_{j}\}_{j=1}^{B} and be able to query the target at any previously unseen input xx.

Models

We use a zero-mean prior on ff and choose a Gaussian kernel for kk and \ell. Inducing points location is initialized with K-means++ procedure for varcmp and vbagg such that they spread evenly across the manifold. For exact methods, kernel hyperparameters and noise variance σ2\sigma^{2} are learnt on 𝒟{\mathcal{D}} by optimising the marginal likelihood. For varcmp, they are learnt jointly with variational distribution parameters by maximising an evidence lower bound objective. While CMP-based methods can leverage bag-level covariates yjy_{j}, baselines are restricted to learn from 𝒟{\mathcal{D}}^{\circ}. Adam optimiser [40] is used in all experiments.

Results

We test models against unobserved groundtruth {𝒕j}j=1B\{{\boldsymbol{t}}_{j}\}_{j=1}^{B} by evaluating the root mean square error (RMSE) to the posterior mean. Table 1 shows that cmp, s-cmp and varcmp outperform their corresponding counterparts i.e. bagg-gp and vbagg, with statistical significance confirmed by a Wilcoxon signed-rank test in Appendix E. Most notably, this shows that the additional knowledge on bag-level dependence instilled by \ell is reflected even in a setting where each bag is matched with an aggregate target.

Table 1: RMSE of the swissroll experiment for models trained over directly and indirectly matched datasets ; scores averaged over 20 seeds and 1 s.d is reported ; * indicates our proposed methods.
Matching cmp* s-cmp* varcmp* bagg-gp vbagg gpr
Direct 0.33±\pm0.06 0.25 ±\pm0.04 0.18±\pm0.04 0.60±\pm0.01 0.22±\pm0.04 0.70±\pm0.05
Indirect 0.80±\pm0.14 1.05±\pm0.04 0.87±\pm0.07 1.13±\pm0.11 1.46±\pm0.34 1.04±\pm0.05

5.1.2 Indirect matching

Experimental Setup

We now impose indirect matching through mediating variable yjy_{j}. We randomly select N=B2N=\lfloor\frac{B}{2}\rfloor bags which we consider to be the NN first ones without loss of generality and split 𝒟{\mathcal{D}} into 𝒟1={(b𝒙j,yj)}j=1N{\mathcal{D}}_{1}=\{(^{b}\!{\boldsymbol{x}}_{j},y_{j})\}_{j=1}^{N} and 𝒟2={(y~j,z~j)}j=1BN={(yN+j,zN+j)}j=1BN{\mathcal{D}}_{2}=\{(\tilde{y}_{j},\tilde{z}_{j})\}_{j=1}^{B-N}=\{(y_{N+j},z_{N+j})\}_{j=1}^{B-N}, such that no pair of covariates bag 𝒙jb{}^{b}\!{\boldsymbol{x}}_{j} and aggregate target z~j\tilde{z}_{j} are jointly observed.

Models

CMP-based methods are naturally able to learn from this setting and are trained by independently drawing samples from 𝒟1{\mathcal{D}}_{1} and 𝒟2{\mathcal{D}}_{2}. Baseline methods however require bags of covariates to be matched with an aggregate bag target. To remedy this, we place a separate prior g𝒢𝒫(0,)g\sim{\mathcal{G}}{\mathcal{P}}(0,\ell) and fit regression model z~j=g(y~j)+εj\tilde{z}_{j}=g(\tilde{y}_{j})+\varepsilon_{j} over 𝒟2{\mathcal{D}}_{2}. We then use the predictive posterior mean to augment the first dataset as 𝒟1={(𝒙jb,𝔼[g(yj)|𝒟2])}j=1N{\mathcal{D}}_{1}^{\prime}=\left\{\left({}^{b}\!{\boldsymbol{x}}_{j},{\mathbb{E}}[g(y_{j})|{\mathcal{D}}_{2}]\right)\right\}_{j=1}^{N}. This dataset can then be used to train bagg-gp, vbagg and gpr.

Results

For comparison, we use the same evaluation as in the direct matching experiment. Table 1 underlines an anticipated drop in RMSE for all models, but we observe that bagg-gp and vbagg suffer most from the mediated matching of the dataset while cmp and varcmp report best scores by a substantial margin. This highlights how using a separate prior on gg to mediate 𝒟1{\mathcal{D}}_{1} and 𝒟2{\mathcal{D}}_{2} turns out to be suboptimal in contrast to using the prior naturally implied by CMP. While it is more computationally efficient than cmp, we observe a relative drop in performance for s-cmp.

5.2 Mediated downscaling of atmospheric temperature

Given the large diversity of sources and formats of remote sensing and model data, expecting directly matched observations is often unrealistic [41]. For example, two distinct satellite products will often provide low and high resolution imagery that can be matched neither spatially nor temporally [2, 3, 4, 5]. Climate simulations [20, 21, 19] on the other hand provide a comprehensive coarse resolution coverage of meteorological variables that can serve as a mediating dataset.

For the purpose of demonstration, we create an experimental setup inspired by this problem using Coupled Model Intercomparison Project Phase 6 (CMIP6) [19] simulation data. This grants us access to groundtruth high-resolution covariates to facilitate model evaluation.

Experimental Setup

We collect monthly mean 2D atmospheric fields simulation from CMIP6 data [42, 43] for the following variables: air temperature at cloud top (T), mean cloud top pressure (P), total cloud cover (TCC) and cloud albedo (α\alpha). First, we collocate TCC and α\alpha onto a HR latitude-longitude grid of size 360×\times720 to obtain fine grained fields (latitudeHR, longitudeHR, altitudeHR, TCCHR, α\alphaHR) augmented with a static HR surface altitude field. Then we collocate P and T onto a LR grid of size 21×\times42 to obtain coarse grained fields (latitudeLR, longitudeLR, PLR, TLR). We denote by BB the number of low resolution pixels.

Our objective is to disaggregate TLR to the HR fields granularity. We assimilate the jjth coarse temperature pixel to an aggregate target zj:=z_{j}:= TLRj{}_{j}^{\text{LR}} corresponding to bag jj. Each bag includes HR covariates 𝒙jb={xj(i)}i=1nj:={({}^{b}\!{\boldsymbol{x}}_{j}=\{x_{j}^{(i)}\}_{i=1}^{n_{j}}:=\{(latitudeHR(i)j{}_{j}^{\text{HR}(i)}, longitudeHR(i)j{}_{j}^{\text{HR}(i)}, altitudeHR(i)j{}_{j}^{\text{HR}(i)}, TCCHR(i)j{}_{j}^{\text{HR}(i)}, αjHR(i))}i=1nj\alpha_{j}^{\text{HR}(i)})\}_{i=1}^{n_{j}}. To emulate unmatched observations, we randomly select N=B2N=\lfloor\frac{B}{2}\rfloor of the bags {b𝒙j}j=1N\{^{b}\!{\boldsymbol{x}}_{j}\}_{j=1}^{N} and keep the opposite half of LR observations {zN+j}j=1BN\{z_{N+j}\}_{j=1}^{B-N}, such that there is no single aggregate bag target that corresponds to one of the available bags. Finally, we choose the pressure field PLR as the mediating variable. We hence compose a third party low resolution field of bag-level covariates yj:=(y_{j}:=(latitudeLRj{}_{j}^{\text{LR}}, longitudeLRj{}_{j}^{\text{LR}}, P)jLR{}_{j}^{\text{LR}}) which can separately be matched with both above sets to obtain datasets 𝒟1={(𝒙j,yj)}j=1N{\mathcal{D}}_{1}=\{({\boldsymbol{x}}_{j},y_{j})\}_{j=1}^{N} and 𝒟2={(y~j,z~j)}j=1BN={(yN+j,zN+j)}j=1BN{\mathcal{D}}_{2}=\{(\tilde{y}_{j},\tilde{z}_{j})\}_{j=1}^{B-N}=\{(y_{N+j},z_{N+j})\}_{j=1}^{B-N}.

Figure 3: Left: High-resolution atmoshperic covariates used for prediction; Center-Left: Observed low-resolution temperature field, grey pixels are unobserved; Center Unobserved high-resolution groundtruth temperature field; Center-Right: varcmp deconditional posterior mean; Right 95% confidence region size on prediction; temperature values are in Kelvin.
Refer to caption
Models Setup

We only consider variational methods to scale to large number of pixels. varcmp is naturally able to learn from indirectly matched data. We use a Matérn-1.5 kernel for rough spatial covariates (latitude, longitude) and a Gaussian kernel for atmospheric covariates (P, TCC, α\alpha) and surface altitude. kk and \ell are both taken as sums of Matérn and Gaussian kernels, and their hyperparameters are learnt along with noise variance during training. A high-resolution noise term is also introduced, with details provided in Appendix E. Inducing points locations are uniformly initialized across the HR grid. We replace gpr with an inducing point variational counterpart vargpr [15]. Since baseline methods require a matched dataset, we proceed as with the unmatched swiss roll experiment and fit a GP regression model gg with kernel \ell on 𝒟2{\mathcal{D}}_{2} and then use its predictive posterior mean to obtain pseudo-targets for the bags of HR covariates from 𝒟1{\mathcal{D}}_{1}.

Results

Performance is evaluated by comparing downscaling deconditional posterior mean against original high resolution field THR available in CMIP6 data [43], which we emphasise is never observed. We use random Fourier features [44] approximation of kernel kk to scale kernel evaluation to the HR covariates grid during testing. As reported in Table 2, varcmp substantially outperforms both baselines with statistical significance provided in Appendix E. Figure 3 shows the reconstructed image with varcmp, plots for other methods are included in the Appendix E. The model resolves statistical patterns from HR covariates into coarse resolution temperature pixels, henceforth reconstructing a faithful HR version of the temperature field.

Table 2: Downscaling similarity scores of posterior mean against groundtruth high resolution cloud top temperature field ; averaged over 10 seeds; we report 1 s.d. ; \downarrow: lower is better ; \uparrow: higher is better.
Model RMSE\;\downarrow MAE\;\downarrow Corr.\;\uparrow SSIM\;\uparrow
vargpr 8.02±\pm0.28 5.55±\pm0.17 0.831±\pm0.012 0.212±\pm0.011
vbagg 8.25±\pm0.15 5.82±\pm0.11 0.821±\pm0.006 0.182±\pm0.004
varcmp 7.40±\pm0.25 5.34±\pm0.22 0.848±\pm0.011 0.212±\pm0.013

6 Discussion

We introduced a scalable Bayesian solution to the mediated statistical downscaling problem, which handles unmatched multi-resolution data. The proposed approach combines Gaussian Processes with the framework of deconditioning using RKHSs and recovers previous approaches as its special cases. We provided convergence rates for the associated deconditioning operator. Finally, we demonstrated the advantages over spatial disaggregation baselines in synthetic data and in a challenging atmospheric fields downscaling problem.

In future work, exploring theoretical guarantees of the computationally efficient shrinkage formulation in a multi-resolution setting and relaxing finite dimensionality assumptions for the convergence rate will have fruitful practical and theoretical implications. Further directions also open up to quantify uncertainty over the deconditional posterior since it is computed using empirical estimates of the CMP covariance. This may be problematic if the mediating variable undergoes covariate shift between the two datasets.

Acknowledgments

SLC is supported by the EPSRC and MRC through the OxWaSP CDT programme EP/L016710/1. SB is supported by the EU’s Horizon 2020 research and innovation programme under Marie Skłodowska-Curie grant agreement No 860100. DS is supported in partly by Tencent AI lab and partly by the Alan Turing Institute (EP/N510129/1).

References

  • Hsu and Ramos [2019] Kelvin Hsu and Fabio Ramos. Bayesian Deconditional Kernel Mean Embeddings. Proceedings of Machine Learning Research. PMLR, 2019.
  • Remer et al. [2005] L. A. Remer, Y. J. Kaufman, D. Tanré, S. Mattoo, D. A. Chu, J. V. Martins, R.-R. Li, C. Ichoku, R. C. Levy, R. G. Kleidman, T. F. Eck, E. Vermote, and B. N. Holben. The MODIS Aerosol Algorithm, Products, and Validation. Journal of the Atmospheric Sciences, 2005.
  • Platnick et al. [2003] S. Platnick, M.D. King, S.A. Ackerman, W.P. Menzel, B.A. Baum, J.C. Riedi, and R.A. Frey. The MODIS cloud products: algorithms and examples from Terra. IEEE Transactions on Geoscience and Remote Sensing, 2003.
  • Stephens et al. [2002] Graeme L Stephens, Deborah G Vane, Ronald J Boain, Gerald G Mace, Kenneth Sassen, Zhien Wang, Anthony J Illingworth, Ewan J O’connor, William B Rossow, Stephen L Durden, et al. The CloudSat mission and the A-train: A new dimension of space-based observations of clouds and precipitation. Bulletin of the American Meteorological Society, 2002.
  • Barnes et al. [1998] William L. Barnes, Thomas S. Pagano, and Vincent V. Salomonson. Prelaunch characteristics of the Moderate Resolution Imaging Spectroradiometer (MODIS) on EOS-AM1. IEEE Transactions on Geoscience and Remote Sensing, 1998.
  • Malenovský et al. [2012] Zbyněk Malenovský, Helmut Rott, Josef Cihlar, Michael E. Schaepman, Glenda García-Santos, Richard Fernandes, and Michael Berger. Sentinels for science: Potential of Sentinel-1, -2, and -3 missions for scientific observations of ocean, cryosphere, and land. Remote Sensing of Environment, 2012.
  • Knight and Kvaran [2014] Edward J. Knight and Geir Kvaran. Landsat-8 operational land imager design. Remote Sensing of Environment, 2014.
  • Zhang et al. [2020] Yivan Zhang, Nontawat Charoenphakdee, Zhenguo Wu, and Masashi Sugiyama. Learning from aggregate observations. In Advances in Neural Information Processing Systems, 2020.
  • Law et al. [2018] Leon Ho Chung Law, Dino Sejdinovic, Ewan Cameron, Tim C.D. Lucas, Seth Flaxman, Katherine Battle, and Kenji Fukumizu. Variational learning on aggregate outputs with Gaussian processes. In Advances in Neural Information Processing Systems, 2018.
  • Hamelijnck et al. [2019] Oliver Hamelijnck, Theodoros Damoulas, Kangrui Wang, and Mark A. Girolami. Multi-resolution multi-task Gaussian processes. In Advances in Neural Information Processing Systems, 2019.
  • Yousefi et al. [2019] Fariba Yousefi, Michael Thomas Smith, and Mauricio A. Álvarez. Multi-task learning for aggregated data using Gaussian processes. In Advances in Neural Information Processing Systems, 2019.
  • Tanaka et al. [2019] Yusuke Tanaka, Toshiyuki Tanaka, Tomoharu Iwata, Takeshi Kurashima, Maya Okawa, Yasunori Akagi, and Hiroyuki Toda. Spatially aggregated Gaussian processes with multivariate areal outputs. Advances in Neural Information Processing Systems, 2019.
  • Ville Tanskanen , Krista Longi [2020] Arto Klam Ville Tanskanen , Krista Longi. Non-Linearities in Gaussian Processes with Integral Observations. IEEE international Workshop on Machine Learning for Signal, 2020.
  • Rasmussen and Williams [2005] C Rasmussen and C Williams. Gaussian Processes for Machine Learning, 2005.
  • Titsias [2009] Michalis K. Titsias. Variational learning of inducing variables in sparse gaussian processes. In AISTATS, 2009.
  • Vaughan et al. [2021] Anna Vaughan, Will Tebbutt, J Scott Hosking, and Richard E Turner. Convolutional conditional neural processes for local climate downscaling. Geoscientific Model Development Discussions, pages 1–25, 2021.
  • Groenke et al. [2020] Brian Groenke, Luke Madaus, and Claire Monteleoni. Climalign: Unsupervised statistical downscaling of climate variables via normalizing flows. In Proceedings of the 10th International Conference on Climate Informatics, pages 60–66, 2020.
  • Vandal et al. [2017] Thomas Vandal, Evan Kodra, Sangram Ganguly, Andrew Michaelis, Ramakrishna Nemani, and Auroop R Ganguly. Deepsd: Generating high resolution climate change projections through single image super-resolution. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, pages 1663–1672, 2017.
  • Eyring et al. [2016] Veronika Eyring, Sandrine Bony, Gerald A Meehl, Catherine A Senior, Bjorn Stevens, Ronald J Stouffer, and Karl E Taylor. Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization. Geoscientific Model Development, 2016.
  • Flato [2011] Gregory M. Flato. Earth system models: an overview. WIREs Climate Change, 2011.
  • Scholze et al. [2012] Marko Scholze, J. Icarus Allen, William J. Collins, Sarah E. Cornell, Chris Huntingford, Manoj M. Joshi, Jason A. Lowe, Robin S. Smith, and Oliver Wild. Earth system models. Cambridge University Press, 2012.
  • Muandet et al. [2016a] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf. Kernel mean embedding of distributions: A review and beyond. arXiv preprint arXiv:1605.09522, 2016a.
  • Grünewälder et al. [2012] Steffen Grünewälder, Guy Lever, Luca Baldassarre, Sam Patterson, Arthur Gretton, and Massimilano Pontil. Conditional Mean Embeddings as Regressors. In Proceedings of the 29th International Coference on International Conference on Machine Learning, 2012.
  • Szabó et al. [2016] Zoltán Szabó, Bharath K. Sriperumbudur, Barnabás Póczos, and Arthur Gretton. Learning theory for distribution regression. Journal of Machine Learning Research, 2016.
  • Singh et al. [2019] Rahul Singh, Maneesh Sahani, and Arthur Gretton. Kernel instrumental variable regression. In Advances in Neural Information Processing Systems, 2019.
  • Song et al. [2013] Le Song, Kenji Fukumizu, and Arthur Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 2013.
  • Fukumizu et al. [2004] Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. Journal of Machine Learning Research, 2004.
  • Fukumizu et al. [2008] Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Schölkopf. Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems, 2008.
  • Sriperumbudur et al. [2012] Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert R. G. Lanckriet. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 2012.
  • Song et al. [2009] Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, 2009.
  • Larkin [1972] FM Larkin. Gaussian measure in Hilbert space and applications in numerical analysis. The Rocky Mountain Journal of Mathematics, 1972.
  • Briol et al. [2019] François-Xavier Briol, Chris J Oates, Mark Girolami, Michael A Osborne, Dino Sejdinovic, et al. Probabilistic integration: A role in statistical computation? Statistical Science, 2019.
  • Rudner et al. [2020] Tim GJ Rudner, Dino Sejdinovic, and Yarin Gal. Inter-domain deep Gaussian processes. In International Conference on Machine Learning, 2020.
  • Muandet et al. [2016b] Krikamol Muandet, Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, and Bernhard Schölkopf. Kernel mean shrinkage estimators. Journal of Machine Learning Research, 2016b.
  • Paulsen and Raghupathi [2016] Vern Paulsen and Mrinal Raghupathi. An introduction to the theory of reproducing kernel Hilbert spaces. Cambridge university press, 2016.
  • Caponnetto and De Vito [2007] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 2007.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. 2019.
  • Gardner et al. [2018] Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. Gpytorch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration. In Advances in Neural Information Processing Systems, 2018.
  • Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 2011.
  • Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR, 2015.
  • Watson-Parris et al. [2016] D. Watson-Parris, N. Schutgens, N. Cook, Z. Kipling, P. Kershaw, E. Gryspeerdt, B. Lawrence, and P. Stier. Community Intercomparison Suite (CIS) v1.4.0: a tool for intercomparing models and observations. Geoscientific Model Development, 2016.
  • Roberts [2018 v20180730] Malcolm Roberts. MOHC HadGEM3-GC31-HM model output prepared for CMIP6 HighResMIP hist-1950. Earth System Grid Federation, 2018 v20180730. doi: 10.22033/ESGF/CMIP6.6040.
  • Voldoire [2019 v20190221] Aurore Voldoire. CNRM-CERFACS CNRM-CM6-1-HR model output prepared for CMIP6 HighResMIP hist-1950. Earth System Grid Federation, 2019 v20190221. doi: 10.22033/ESGF/CMIP6.4040.
  • Rahimi et al. [2007] Ali Rahimi, Benjamin Recht, et al. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, 2007.
  • Kallenberg [2002] Olav Kallenberg. Foundations of Modern Probability. Springer, 2002.
  • Micchelli and Pontil [2005] Charles A Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural computation, 2005.
  • Law et al. [2019] Ho Chung Law, Peilin Zhao, Leung Sing Chan, Junzhou Huang, and Dino Sejdinovic. Hyperparameter learning via distributional transfer. In Advances in Neural Information Processing Systems, 2019.
  • Steinwart and Christmann [2008] Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Publishing Company, Incorporated, 2008.

Appendix A Proofs

A.1 Proofs of Section 3

Proposition A.1.

Let h:𝒳h:{\mathcal{X}}\rightarrow{\mathbb{R}} such that 𝔼[|h(X)|]<{\mathbb{E}}[|h(X)|]<\infty. Then, {y𝒴𝔼[|h(X)||Y=y]<}\{y\in{\mathcal{Y}}\mid{\mathbb{E}}[|h(X)||Y=y]<\infty\} is a full measure set with respect to Y{\mathbb{P}}_{Y}.

Proof.

Since 𝒳{\mathcal{X}} is a Borel space and 𝒴{\mathcal{Y}} is measurable, the existence of a Y{\mathbb{P}}_{Y}-a.e. regular conditional probability distribution is guaranteed by [45, Theorem 6.3]. Now suppose 𝔼[|h(X)|]<{\mathbb{E}}[|h(X)|]<\infty and let 𝒴o={y𝒴𝔼[|h(X)||Y=y]<}{\mathcal{Y}}^{o}=\{y\in{\mathcal{Y}}\mid{\mathbb{E}}[|h(X)||Y=y]<\infty\}. Since 𝔼[|h(X)|]=𝔼[𝔼[|h(X)|Y]]{\mathbb{E}}[|h(X)|]={\mathbb{E}}\left[{\mathbb{E}}[|h(X)|\mid Y]\right], the conditional expectation 𝔼[|h(X)|Y]{\mathbb{E}}[|h(X)|\mid Y] must have finite expectation almost everywhere, i.e. Y(𝒴o)=1{\mathbb{P}}_{Y}({\mathcal{Y}}^{o})=1. ∎

Proposition 3.2.

Suppose 𝔼[|m(X)|]<{\mathbb{E}}[|m(X)|]<\infty and 𝔼[kXk]<{\mathbb{E}}[\|k_{X}\|_{k}]<\infty and let (X,Y)XY(X^{\prime},Y^{\prime})\sim{\mathbb{P}}_{XY}. Then gg is a Gaussian process g𝒢𝒫(ν,q)g\sim{\mathcal{G}}{\mathcal{P}}(\nu,q) a.s. , specified by

ν(y)=𝔼[m(X)|Y=y]q(y,y)=𝔼[k(X,X)|Y=y,Y=y]\nu(y)={\mathbb{E}}[m(X)|Y=y]\qquad\quad q(y,y^{\prime})={\mathbb{E}}[k(X,X^{\prime})|Y=y,Y^{\prime}=y^{\prime}] (13)

y,y𝒴\forall y,y^{\prime}\in{\mathcal{Y}}. Furthermore, q(y,y)=μX|Y=y,μX|Y=ykq(y,y^{\prime})=\langle\mu_{X|Y=y},\mu_{X|Y=y^{\prime}}\rangle_{k} a.s.

Proof of Proposition 3.2.

We will assume for the sake of simplicity that m=0m=0 in the following derivations and will return to the case of an uncentered GP at the end of the proof.

Show that g(y)g(y) is in a space of Gaussian random variables

Let (Ω,,)(\Omega,{\mathcal{F}},{\mathbb{P}}) denote a probability space and L2(Ω,)L^{2}(\Omega,{\mathbb{P}}) the space of square integrable random variables endowed with standard inner product. x𝒳\forall x\in{\mathcal{X}}, since f(x)f(x) is Gaussian, then f(x)L2(Ω,)f(x)\in L^{2}(\Omega,{\mathbb{P}}). We can hence define 𝒮(f){\mathcal{S}}(f) as the closure in L2(Ω,)L^{2}(\Omega,{\mathbb{P}}) of the vector space spanned by ff, i.e. 𝒮(f):=Span¯{f(x):x𝒳}{\mathcal{S}}(f):=\overline{\operatorname{Span}}\left\{{f(x)\,:\,x\in{\mathcal{X}}}\right\}.

Elements of 𝒮(f){\mathcal{S}}(f) write as limits of centered Gaussian random variables, hence when their covariance sequence converge, they are normally distributed. Let T𝒮(f)T\in{\mathcal{S}}(f)^{\perp}, then we have 𝔼[Tf(x)]=0{\mathbb{E}}[Tf(x)]=0. Let y𝒴y\in{\mathcal{Y}}, we also have

𝔼[Tg(y)]=𝔼[𝒳Tf(x)dX|Y=y]\displaystyle{\mathbb{E}}[Tg(y)]={\mathbb{E}}\left[\int_{\mathcal{X}}Tf(x)\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y}\right] (14)

In order to switch orders of integration, we need to show that the double integral satisfies absolute convergence.

𝒳𝔼[|Tf(x)|]dX|Y=y(x)\displaystyle\int_{\mathcal{X}}{\mathbb{E}}[|Tf(x)|]\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y}(x) 𝒳𝔼[T2]𝔼[f(x)2]dX|Y=y(x)\displaystyle\leq\int_{\mathcal{X}}\sqrt{{\mathbb{E}}[T^{2}]{\mathbb{E}}[f(x)^{2}]}\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y}(x) (15)
=𝔼[T2]𝒳kxkdX|Y=y(x)\displaystyle=\sqrt{{\mathbb{E}}[T^{2}]}\int_{\mathcal{X}}\|k_{x}\|_{k}\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y}(x) (16)
=𝔼[T2]𝔼[kXk|Y=y]\displaystyle=\sqrt{{\mathbb{E}}[T^{2}]}{\mathbb{E}}[\|k_{X}\|_{k}|Y=y] (17)

Since TL2(Ω,)T\in L^{2}(\Omega,{\mathbb{P}}), 𝔼[T2]<{\mathbb{E}}[T^{2}]<\infty. Plus, as we assume that 𝔼[kXk]<{\mathbb{E}}[\|k_{X}\|_{k}]<\infty, Proposition A.1 gives that 𝔼[kXk|Y=y]<{\mathbb{E}}[\|k_{X}\|_{k}|Y=y]<\infty a.s. We can thus apply Fubini’s theorem and obtain

𝔼[Tg(y)]=𝒳𝔼[Tf(x)]dX|Y=y(x)=0a.s.\displaystyle{\mathbb{E}}[Tg(y)]=\int_{\mathcal{X}}{\mathbb{E}}[Tf(x)]\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y}(x)=0\enspace\text{a.s.} (18)

As this holds for any T𝒮(f)T\in{\mathcal{S}}(f)^{\perp}, we conclude that g(y)(𝒮(f)) a.s.g(y)𝒮(f) a.s.g(y)\in\left({\mathcal{S}}(f)^{\perp}\right)^{\perp}\text{ a.s.}\Rightarrow g(y)\in{\mathcal{S}}(f)\text{ a.s.}. We cannot claim yet though that g(y)g(y) is Gaussian since we do not know whether it results from a sequence of Gaussian variables with converging variance sequence. We now have to prove that g(y)g(y) has a finite variance.

Show that g(y)g(y) has finite variance

We proceed by computing the expression of the covariance between g(y)g(y) and g(y)g(y^{\prime}) which is more general and yields the variance.

Let y,y𝒴y,y^{\prime}\in{\mathcal{Y}}, the covariance of g(y)g(y) and g(y)g(y^{\prime}) is given by

q(y,y)\displaystyle q(y,y^{\prime}) =𝔼[g(y)g(y)]𝔼[g(y)]𝔼[g(y)]\displaystyle={\mathbb{E}}[g(y)g(y^{\prime})]-{\mathbb{E}}[g(y)]{\mathbb{E}}[g(y^{\prime})] (19)
=𝔼[𝒳𝒳f(x)f(x)dX|Y=y(x)dX|Y=y(x)]\displaystyle={\mathbb{E}}\left[\int_{\mathcal{X}}\int_{\mathcal{X}}f(x)f(x^{\prime})\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y}(x)\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y^{\prime}}(x^{\prime})\right] (20)
𝔼[𝒳f(x)dX|Y=y(x)]𝔼[𝒳f(x)dX|Y=y(x)]\displaystyle-{\mathbb{E}}\left[\int_{\mathcal{X}}f(x)\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y}(x)\right]{\mathbb{E}}\left[\int_{\mathcal{X}}f(x^{\prime})\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y^{\prime}}(x^{\prime})\right] (21)

Choosing TT as a constant random variable in the above, we can show that 𝒳𝔼[|f(x)|]dX|Y=y(x)<\int_{\mathcal{X}}{\mathbb{E}}[|f(x)|]\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y}(x)<\infty a.s. We can hence apply Fubini’s theorem to switch integration order in the mean terms (21) and obtain that 𝔼[g(y)]=0{\mathbb{E}}[g(y)]=0 since ff is centered.

To apply Fubini’s theorem to (20), we need to show that the triple integration absolutely converges. Let x,x𝒳x,x^{\prime}\in{\mathcal{X}}, we know that 𝔼[|f(x)f(x)|]𝔼[f(x)2]𝔼[f(x)2]=kxkkxk{\mathbb{E}}[|f(x)f(x^{\prime})|]\leq\sqrt{{\mathbb{E}}[f(x)^{2}]{\mathbb{E}}[f(x^{\prime})^{2}]}=\|k_{x}\|_{k}\|k_{x^{\prime}}\|_{k}. Using similar arguments as above, we obtain

𝒳𝒳𝔼[|f(x)f(x)|]dX|Y=y(x)dX|Y=y(x)𝔼[kXk|Y=y]𝔼[kXk|Y=y]< a.s.\displaystyle\int_{\mathcal{X}}\int_{\mathcal{X}}{\mathbb{E}}[|f(x)f(x^{\prime})|]\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y}(x)\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y^{\prime}}(x^{\prime})\leq{\mathbb{E}}[\|k_{X}\|_{k}|Y=y]{\mathbb{E}}[\|k_{X}\|_{k}|Y=y^{\prime}]<\infty\text{ a.s.} (22)

We can thus apply Fubini’s theorem which yields

q(y,y)\displaystyle q(y,y^{\prime}) =𝒳𝒳𝔼[f(x)f(x)]dX|Y=y(x)dX|Y=y(x)\displaystyle=\int_{\mathcal{X}}\int_{\mathcal{X}}{\mathbb{E}}[f(x)f(x^{\prime})]\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y}(x)\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y^{\prime}}(x^{\prime}) (23)
=𝒳𝒳Cov(f(x),f(x))k(x,x)dX|Y=y(x)dX|Y=y(x)\displaystyle=\int_{\mathcal{X}}\int_{\mathcal{X}}\underbrace{\operatorname{Cov}(f(x),f(x^{\prime}))}_{k(x,x^{\prime})}\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y}(x)\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y^{\prime}}(x^{\prime}) (24)
=𝔼[k(X,X)|Y=y,Y=y]\displaystyle={\mathbb{E}}[k(X,X^{\prime})|Y=y,Y^{\prime}=y^{\prime}] (25)
𝔼[kXk|Y=y]𝔼[kXk|Y=y]<a.s.\displaystyle\leq{\mathbb{E}}[\|k_{X}\|_{k}|Y=y]{\mathbb{E}}[\|k_{X}\|_{k}|Y=y^{\prime}]<\infty\enspace\text{a.s.} (26)

where (X,Y)(X^{\prime},Y^{\prime}) denote random variables with same joint distribution than (X,Y)(X,Y) as defined in the proposition.

g(y)𝒮(f)g(y)\in{\mathcal{S}}(f) and has finite variance q(y,y)q(y,y) a.s., it is thus a centered Gaussian random variable a.s. Furthermore, as this holds for any y𝒴y\in{\mathcal{Y}}, then any finite subset of {g(y):y𝒴}\{g(y)\,:\,y\in{\mathcal{Y}}\} follows a multivariate normal distribution which shows that gg is a centered Gaussian process on 𝒴{\mathcal{Y}} and its covariance function is specified by qq.

Uncentered case m0m\neq 0

We now return to an uncentered GP prior on ff with assumption that 𝔼[|m(X)|]<{\mathbb{E}}[|m(X)|]<\infty. By Proposition A.1, we get that 𝔼[|m(X)||Y=y]<{\mathbb{E}}[|m(X)|\,|Y=y]<\infty a.s. for y𝒴y\in{\mathcal{Y}}.

Let ν:y𝔼[m(X)|Y=y]\nu:y\mapsto{\mathbb{E}}[m(X)|Y=y]. We can clearly rewrite gg as the sum of ν\nu and a centered GP on 𝒴{\mathcal{Y}}

g(y)=ν(y)+𝒳(f(x)m(x))dX|Y=y(x),y𝒴g(y)=\nu(y)+\int_{\mathcal{X}}(f(x)-m(x))\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y}(x),\qquad\forall y\in{\mathcal{Y}} (27)

which is well-defined almost surely.

It hence comes 𝔼[g(y)]=𝔼[ν(y)]+0=ν(y){\mathbb{E}}[g(y)]={\mathbb{E}}[\nu(y)]+0=\nu(y). Plus since ν(y)\nu(y) is a constant shift, the covariance is not affected and has the same expression than for the centered GP. Since this holds for any y𝒴y\in{\mathcal{Y}}, we conclude that g𝒢𝒫(ν,q)g\sim{\mathcal{G}}{\mathcal{P}}(\nu,q) a.s.

Show that q(y,y)=μX|Y=y,μX|Y=ykq(y,y^{\prime})=\langle\mu_{X|Y=y},\mu_{X|Y=y^{\prime}}\rangle_{k}

First, we know by Proposition A.1 that 𝔼[kXk|Y=y]<{\mathbb{E}}[\|k_{X}\|_{k}|Y=y]<\infty Y{\mathbb{P}}_{Y}-a.e. .By triangular inequality, we obtain μX|Y=yk=𝔼[kX|Y=y]k𝔼[kXk|Y=y]<\|\mu_{X|Y=y}\|_{k}=\|{\mathbb{E}}[k_{X}|Y=y]\|_{k}\leq{\mathbb{E}}[\|k_{X}\|_{k}|Y=y]<\infty Y{\mathbb{P}}_{Y}-a.e. and hence μX|Y=y\mu_{X|Y=y} is well-defined up to a set of measure zero with respect to Y{\mathbb{P}}_{Y}.

With notations from Proposition 3.2, we can proceed for any y,y𝒴y,y^{\prime}\in{\mathcal{Y}} as

q(y,y)\displaystyle q(y,y^{\prime}) =𝔼[k(X,X)|Y=y,Y=y]\displaystyle={\mathbb{E}}[k(X,X^{\prime})|Y=y^{\prime},Y^{\prime}=y^{\prime}] (28)
=𝒳𝒳k(x,x)dX|Y=y(x)dX|Y=y(x)\displaystyle=\int_{\mathcal{X}}\int_{\mathcal{X}}k(x,x^{\prime})\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y}(x)\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y^{\prime}}(x^{\prime}) (29)
=𝒳𝒳kx,kxkdX|Y=y(x)dX|Y=y(x)\displaystyle=\int_{\mathcal{X}}\int_{\mathcal{X}}\langle k_{x},k_{x^{\prime}}\rangle_{k}\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y}(x)\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y^{\prime}}(x^{\prime}) (30)
=𝒳kxdX|Y=y(x),𝒳kxdX|Y=y(x)k a.s.\displaystyle=\left\langle\int_{\mathcal{X}}k_{x}\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y}(x),\int_{\mathcal{X}}k_{x^{\prime}}\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y^{\prime}}(x^{\prime})\right\rangle_{k}\enspace\text{ a.s.} (31)
=μX|Y=y,μX|Y=yk a.s.\displaystyle=\langle\mu_{X|Y=y},\mu_{X|Y=y^{\prime}}\rangle_{k}\enspace\text{ a.s.} (32)

Proposition 3.3.

Given aggregate observations 𝐳~\tilde{\bf z} with homoscedastic noise σ2\sigma^{2}, the deconditional posterior of ff is defined as the Gaussian process f|𝐳~𝒢𝒫(md,kd)f|\tilde{\bf z}\sim{\mathcal{G}}{\mathcal{P}}(m_{\mathrm{d}},k_{\mathrm{d}}) where

md(x)\displaystyle m_{\mathrm{d}}(x) =m(x)+kxCX|Y𝚿𝐲~(𝐐𝐲~𝐲~+σ2𝐈M)1(𝐳~ν(𝐲~)),\displaystyle=m(x)+k_{x}^{\top}C_{X|Y}{\bf\Psi}_{\tilde{\bf y}}({\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+\sigma^{2}{\bf I}_{M})^{-1}(\tilde{\bf z}-\nu(\tilde{\bf y})), (33)
kd(x,x)\displaystyle k_{\mathrm{d}}(x,x^{\prime}) =k(x,x)kxCX|Y𝚿𝐲~(𝐐𝐲~𝐲~+σ2𝐈M)1𝚿𝐲~CX|Ykx.\displaystyle=k(x,x^{\prime})-k_{x}^{\top}C_{X|Y}{\bf\Psi}_{\tilde{\bf y}}({\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+\sigma^{2}{\bf I}_{M})^{-1}{\bf\Psi}_{\tilde{\bf y}}^{\top}C_{X|Y}^{\top}k_{x^{\prime}}. (34)
Proof of Proposition 3.3.

Recall that

[f(𝐱)𝐳~]𝐲,𝐲~𝒩([m(𝐱)ν(𝐲~)],[𝐊𝐱𝐱𝚼𝚼𝐐𝐲~𝐲~+σ2𝐈M]).\begin{bmatrix}f({\bf x})\\ \tilde{\bf z}\end{bmatrix}\mid{{\bf y},\tilde{\bf y}}\sim{\mathcal{N}}\left(\begin{bmatrix}m({\bf x})\\ \nu(\tilde{\bf y})\end{bmatrix},\begin{bmatrix}{\bf K}_{{\bf x}{\bf x}}&\boldsymbol{\Upsilon}\\ \boldsymbol{\Upsilon}^{\top}&{\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+\sigma^{2}{\bf I}_{M}\end{bmatrix}\right). (35)

where 𝚼=Cov(f(𝐱),𝐳~)=𝚽𝐱CX|Y𝚿𝐲~\boldsymbol{\Upsilon}=\operatorname{Cov}(f({\bf x}),\tilde{\bf z})={\bf\Phi}_{\bf x}^{\top}C_{X|Y}{\bf\Psi}_{\tilde{\bf y}}.

Applying Gaussian conditioning, we obtain that

f(𝐱)𝐳~,𝐲,𝐲~𝒩(\displaystyle f({\bf x})\mid\tilde{\bf z},{\bf y},\tilde{\bf y}\sim{\mathcal{N}}( m(𝐱)+𝚼(𝐐𝐲~𝐲~+σ2𝐈M)1(𝐳~ν(𝐲~)),\displaystyle m({\bf x})+\boldsymbol{\Upsilon}({\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+\sigma^{2}{\bf I}_{M})^{-1}(\tilde{\bf z}-\nu(\tilde{\bf y})), (36)
𝐊𝐱𝐱𝚼(𝐐𝐲~𝐲~+σ2𝐈M)1𝚼)\displaystyle{\bf K}_{{\bf x}{\bf x}}-\boldsymbol{\Upsilon}({\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+\sigma^{2}{\bf I}_{M})^{-1}\boldsymbol{\Upsilon}^{\top}) (37)

Since the latter holds for any input 𝐱𝒳N{\bf x}\in{\mathcal{X}}^{N}, by Kolmogorov extension theorem this implies that ff conditioned on the data 𝐳~,𝐲~\tilde{\bf z},\tilde{\bf y} is a draw from a GP. We denote it f|𝐳~𝒢𝒫(md,kd)f|\tilde{\bf z}\sim{\mathcal{G}}{\mathcal{P}}(m_{\mathrm{d}},k_{\mathrm{d}}) and it is specified by

md(x)\displaystyle m_{\mathrm{d}}(x) =m(x)+kxCX|Y𝚿𝐲~(𝐐𝐲~𝐲~+σ2𝐈M)1(𝐳~ν(𝐲~)),\displaystyle=m(x)+k_{x}^{\top}C_{X|Y}{\bf\Psi}_{\tilde{\bf y}}({\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+\sigma^{2}{\bf I}_{M})^{-1}(\tilde{\bf z}-\nu(\tilde{\bf y})), (38)
kd(x,x)\displaystyle k_{\mathrm{d}}(x,x^{\prime}) =k(x,x)kxCX|Y𝚿𝐲~(𝐐𝐲~𝐲~+σ2𝐈M)1𝚿𝐲~CX|Ykx.\displaystyle=k(x,x^{\prime})-k_{x}^{\top}C_{X|Y}{\bf\Psi}_{\tilde{\bf y}}({\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+\sigma^{2}{\bf I}_{M})^{-1}{\bf\Psi}_{\tilde{\bf y}}^{\top}C_{X|Y}^{\top}k_{x^{\prime}}. (39)

Note that we abuse notation

"kxCX|Y𝚿𝐲~"\displaystyle"k_{x}^{\top}C_{X|Y}{\bf\Psi}_{\tilde{\bf y}}" =[kx,CX|Yy~1kkx,CX|Yy~Mk]\displaystyle=\begin{bmatrix}\langle k_{x},C_{X|Y}\ell_{\tilde{y}_{1}}\rangle_{k}&\ldots&\langle k_{x},C_{X|Y}\ell_{\tilde{y}_{M}}\rangle_{k}\end{bmatrix} (40)
=[kx,μX|Y=y~1kkx,μX|Y=y~Mk]\displaystyle=\begin{bmatrix}\langle k_{x},\mu_{X|Y=\tilde{y}_{1}}\rangle_{k}&\ldots&\langle k_{x},\mu_{X|Y=\tilde{y}_{M}}\rangle_{k}\end{bmatrix} (41)
=[Cov(f(x),g(y~1))Cov(f(x),g(y~M))].\displaystyle=\begin{bmatrix}\operatorname{Cov}(f(x),g(\tilde{y}_{1}))&\ldots&\operatorname{Cov}(f(x),g(\tilde{y}_{M}))\end{bmatrix}. (42)

A.2 Proofs of Section 4

Proposition 4.1 (Empirical DMO as vector-valued regressor).

The minimiser of the empirical reconstruction risk is the empirical DMO, i.e. D^X|Y=argminDΓ^d(D)\hat{D}_{X|Y}=\arg\min_{D\in{\mathcal{H}}_{\Gamma}}{\hat{\mathcal{E}}_{\mathrm{d}}(D)}

Proof of Proposition 4.1.

Let DΓD\in{\mathcal{H}}_{\Gamma}, we recall the form of the regularised empirical objective

^d(D)=1Mj=1My~jDC^X|Yy~j2+ϵDΓ2\hat{\mathcal{E}}_{\mathrm{d}}(D)=\frac{1}{M}\sum_{j=1}^{M}\|\ell_{\tilde{y}_{j}}-D\hat{C}_{X|Y}\ell_{\tilde{y}_{j}}\|^{2}_{\ell}+\epsilon\|D\|^{2}_{\Gamma} (43)

By [46, Theorem 4.1], if D^argminDΓ^d(D)\hat{D}\in\underset{D\in{\mathcal{H}}_{\Gamma}}{\arg\min}\,\hat{\mathcal{E}}_{\mathrm{d}}(D), then it is unique and has form

D^=j=1MΓC^X|Yy~jcj\hat{D}=\sum_{j=1}^{M}\Gamma_{\hat{C}_{X|Y}\ell_{\tilde{y}_{j}}}c_{j} (44)

where ΓC^X|Yy~j:Γ\Gamma_{\hat{C}_{X|Y}\ell_{\tilde{y}_{j}}}:{\mathcal{H}}_{\ell}\rightarrow{\mathcal{H}}_{\Gamma} is the vector-valued kernel Γ\Gamma’s feature map indexed by C^X|Yy~j\hat{C}_{X|Y}\ell_{\tilde{y}_{j}}, such that for any hΓh\in{\mathcal{H}}_{\Gamma} and gg\in{\mathcal{H}}_{\ell}, we have h,ΓC^X|Yy~jgΓ=h(C^X|Yy~j),g\langle h,\Gamma_{\hat{C}_{X|Y}\ell_{\tilde{y}_{j}}}g\rangle_{\Gamma}=\langle h(\hat{C}_{X|Y}\ell_{\tilde{y}_{j}}),g\rangle_{\ell}. (see [35] for a detailed review of vector-valued RKHS). Furthermore, coefficients c1,,cMc_{1},\ldots,c_{M}\in{\mathcal{H}}_{\ell} are the unique solutions to

i=1M(Γ(C^X|Yy~i,C^X|Yy~j)+Mϵδij)ci=y~j\sum_{i=1}^{M}\left(\Gamma(\hat{C}_{X|Y}\ell_{\tilde{y}_{i}},\hat{C}_{X|Y}\ell_{\tilde{y}_{j}})+M\epsilon\delta_{ij}\right)c_{i}=\ell_{\tilde{y}_{j}} (45)

Since

Γ(C^X|Yy~i,C^X|Yy~j)=C^X|Yy~i,C^X|Yy~jkId=q^(y~i,y~j)Id\Gamma(\hat{C}_{X|Y}\ell_{\tilde{y}_{i}},\hat{C}_{X|Y}\ell_{\tilde{y}_{j}})=\langle\hat{C}_{X|Y}\ell_{\tilde{y}_{i}},\hat{C}_{X|Y}\ell_{\tilde{y}_{j}}\rangle_{k}\operatorname{Id}_{{\mathcal{H}}_{\ell}}=\hat{q}(\tilde{y}_{i},\tilde{y}_{j})\operatorname{Id}_{{\mathcal{H}}_{\ell}} (46)

where Id\operatorname{Id}_{{\mathcal{H}}_{\ell}} denotes the identity operator on {\mathcal{H}}_{\ell}. The above simplifies as

i=1M(q^(y~i,y~j)+Mϵδij)ci=y~j1jM\displaystyle\sum_{i=1}^{M}\left(\hat{q}(\tilde{y}_{i},\tilde{y}_{j})+M\epsilon\delta_{ij}\right)c_{i}=\ell_{\tilde{y}_{j}}\quad\forall 1\leq j\leq M (47)
(𝐐^𝐲~𝐲~+Mϵ𝐈M)𝐜=𝚿𝐲~\displaystyle\Leftrightarrow\left(\hat{\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+M\epsilon{\bf I}_{M}\right){\bf c}^{\top}={\bf\Psi}_{\tilde{\bf y}}^{\top} (48)
𝐜=𝚿𝐲~(𝐐^𝐲~𝐲~+Mϵ𝐈M)1\displaystyle\Leftrightarrow{\bf c}={\bf\Psi}_{\tilde{\bf y}}\left(\hat{\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+M\epsilon{\bf I}_{M}\right)^{-1} (49)

where 𝐜=[c1cM]{\bf c}=\begin{bmatrix}c_{1}&\ldots&c_{M}\end{bmatrix}.

Since for any fkf\in{\mathcal{H}}_{k} and gg\in{\mathcal{H}}_{\ell}, our choice of kernel gives Γfg=gf\Gamma_{f}g=g\otimes f, plugging (47) into (44) we obtain

D^\displaystyle\hat{D} =j=1MΓC^X|Yy~jcj\displaystyle=\sum_{j=1}^{M}\Gamma_{\hat{C}_{X|Y}\ell_{\tilde{y}_{j}}}c_{j} (50)
=j=1McjC^X|Yy~j\displaystyle=\sum_{j=1}^{M}c_{j}\otimes\hat{C}_{X|Y}\ell_{\tilde{y}_{j}} (51)
=𝐜[C^X|Y𝚿𝐲~]\displaystyle={\bf c}\left[\hat{C}_{X|Y}{\bf\Psi}_{\tilde{\bf y}}\right]^{\top} (52)
=[𝚿𝐲~(𝐐^𝐲~𝐲~+Mϵ𝐈M)1][C^X|Y𝚿𝐲~]\displaystyle=\left[{\bf\Psi}_{\tilde{\bf y}}\left(\hat{\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+M\epsilon{\bf I}_{M}\right)^{-1}\right]\left[\hat{C}_{X|Y}{\bf\Psi}_{\tilde{\bf y}}\right]^{\top} (53)
=𝚿𝐲~(𝐐^𝐲~𝐲~+Mϵ𝐈M)1𝚿𝐲~C^X|Y\displaystyle={\bf\Psi}_{\tilde{\bf y}}\left(\hat{\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+M\epsilon{\bf I}_{M}\right)^{-1}{\bf\Psi}_{\tilde{\bf y}}^{\top}\hat{C}_{X|Y}^{\top} (54)
=𝚿𝐲~(𝐐^𝐲~𝐲~+Mϵ𝐈M)1𝚿𝐲~𝚿𝐲(𝐋𝐲𝐲+Nλ𝐈N)1𝚽𝐱\displaystyle={\bf\Psi}_{\tilde{\bf y}}\left(\hat{\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+M\epsilon{\bf I}_{M}\right)^{-1}{\bf\Psi}_{\tilde{\bf y}}^{\top}{\bf\Psi}_{{\bf y}}\left({\bf L}_{{\bf y}{\bf y}}+N\lambda{\bf I}_{N}\right)^{-1}{\bf\Phi}_{\bf x} (55)
=𝚿𝐲~(𝐐^𝐲~𝐲~+Mϵ𝐈M)1𝐀𝚽𝐱\displaystyle={\bf\Psi}_{\tilde{\bf y}}\left(\hat{\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+M\epsilon{\bf I}_{M}\right)^{-1}{\bf A}{\bf\Phi}_{\bf x} (56)
=D^X|Y\displaystyle=\hat{D}_{X|Y} (57)

which concludes the proof. ∎

Theorem 4.2 (Empirical DMO Convergence Rate).

Denote DY=argminDΓd(D)D_{{\mathbb{P}}_{Y}}=\arg\min_{D\in{\mathcal{H}}_{\Gamma}}{\mathcal{E}}_{\mathrm{d}}(D). Assume assumptions stated in Appendix D are satisfied. In particular, let (b,c)(b,c) and (0,c)(0,c^{\prime}) be the parameters of the restricted class of distribution for Y{\mathbb{P}}_{Y} and XY{\mathbb{P}}_{XY} respectively and let ι]0,1]\iota\in]0,1] be the Hölder continuity exponent in Γ{\mathcal{H}}_{\Gamma}. Then, if we choose λ=N1c+1\lambda=N^{-\frac{1}{c^{\prime}+1}}, N=Ma(c+1)ι(c1)N=M^{\frac{a(c^{\prime}+1)}{\iota(c^{\prime}-1)}} where a>0a>0, we have the following result,

  • If ab(c+1)bc+1a\leq\frac{b(c+1)}{bc+1}, then d(D^X|Y)d(DY)=𝒪(Macc+1){\mathcal{E}}_{\mathrm{d}}(\hat{D}_{X|Y})-{\mathcal{E}}_{\mathrm{d}}(D_{{\mathbb{P}}_{Y}})={\mathcal{O}}(M^{\frac{-ac}{c+1}}) with ϵ=Mac+1\epsilon=M^{\frac{-a}{c+1}}

  • If ab(c+1)bc+1a\geq\frac{b(c+1)}{bc+1}, then d(D^X|Y)d(DY)=𝒪(Mbcbc+1){\mathcal{E}}_{\mathrm{d}}(\hat{D}_{X|Y})-{\mathcal{E}}_{\mathrm{d}}(D_{{\mathbb{P}}_{Y}})={\mathcal{O}}(M^{\frac{-bc}{bc+1}}) with ϵ=Mbbc+1\epsilon=M^{\frac{-b}{bc+1}}

Proof of Theorem 4.2.

In Appendix D, we present Theorem D.4 which is a detailed version of this result with all assumptions explicitly stated. The proof of Theorem D.4 constitutes the proof of this result. ∎

Appendix B Variational formulation of the deconditional posterior

Inference computational complexity is 𝒪(M3){\mathcal{O}}(M^{3}) for the posterior mean and 𝒪(N3+M3){\mathcal{O}}(N^{3}+M^{3}) for the posterior covariance. To scale to large datasets, we introduce in the following a variational formulation as a scalable approximation to the deconditional posterior f(𝐱)|𝐳~f({\bf x})|\tilde{\bf z}. Without loss of generality, we assume in the following that ff is centered, i.e. m=0m=0.

B.1 Variational formulation

We consider a set of dd inducing locations 𝐰=[w1wd]𝒳d{\bf w}=\begin{bmatrix}w_{1}&\ldots&w_{d}\end{bmatrix}^{\top}\in{\mathcal{X}}^{d} and define inducing points as the gaussian vector 𝐮:=f(𝐰)𝒩(0,𝐊𝐰𝐰){\bf u}:=f({\bf w})\sim{\mathcal{N}}(0,{\bf K}_{\bf ww}), where 𝐊𝐰𝐰:=k(𝐰,𝐰){\bf K}_{\bf ww}:=k({\bf w,w}). We set dd-dimensional variational distribution q(𝐮)=𝒩(𝐮|𝜼,𝚺)q({\bf u})={\mathcal{N}}({\bf u|\boldsymbol{\eta},\boldsymbol{\Sigma}}) over inducing points and define q(𝐟):=p(𝐟|𝐮)q(𝐮)d𝐮q({\bf f}):=\int p({\bf f}|{\bf u})q({\bf u})\,{\mathrm{d}}{\bf u} as an approximation of the deconditional posterior p(𝐟|𝐳)p({\bf f}|{\bf z}). The estimation of the deconditional posterior can thus be approximated by optimising the variational distribution parameters 𝜼\boldsymbol{\eta}, 𝚺\boldsymbol{\Sigma} to maximise the evidence lower bound (ELBO) objective given by

ELBO(q)=𝔼q(𝐟)[logp(𝐳~|𝐟)]+KL(q(𝐮)p(𝐮)).\operatorname{ELBO}(q)={\mathbb{E}}_{q({\bf f})}[\log p({\bf\tilde{z}|f})]+\operatorname{KL}(q({\bf u})\|p({\bf u})). (58)

As both qq and pp are Gaussians, the Kullback-Leibler divergence admits closed-form. The expected log likelihood term decomposes as

𝔼q(𝐟)[logp(𝐳|𝐟)]=M2log(2πσ2)+12σ2(tr(𝐀𝚺¯𝐀)+𝐳~A𝜼¯22){\mathbb{E}}_{q({\bf f})}[\log p({\bf z|f})]=-\frac{M}{2}\log(2\pi\sigma^{2})+\frac{1}{2\sigma^{2}}\left(\operatorname{tr}\left({\bf A^{\top}{\bf\bar{\boldsymbol{\Sigma}}}A}\right)+\left\|{\tilde{\bf z}-A^{\top}{\bf\bar{\boldsymbol{\eta}}}}\right\|_{2}^{2}\right) (59)

where 𝜼¯\bar{\boldsymbol{\eta}} and 𝚺¯\bar{\boldsymbol{\Sigma}} are the parameters of the posterior variational distribution q(𝐟)=𝒩(𝐟|𝜼¯,𝚺¯)q({\bf f})={\mathcal{N}}({\bf f}|\bar{\boldsymbol{\eta}},\bar{\boldsymbol{\Sigma}}) given by

𝜼¯=𝐊𝐱𝐰𝐊𝐰𝐰𝟏𝜼𝚺¯=𝐊𝐱𝐱𝐊𝐱𝐰[𝐊𝐰𝐰𝟏𝐊𝐰𝐰𝟏𝚺𝐊𝐰𝐰𝟏]𝐊𝐰𝐱{\bf\bar{\boldsymbol{\eta}}=K_{xw}K_{ww}^{-1}\boldsymbol{\eta}}\qquad\quad{\bf\bar{\boldsymbol{\Sigma}}=K_{xx}-K_{xw}\left[K_{ww}^{-1}-K_{ww}^{-1}\boldsymbol{\Sigma}K_{ww}^{-1}\right]K_{wx}} (60)

Given this objective, we can optimise this lower bound with respect to variational parameters 𝜼,𝚺\boldsymbol{\eta},\boldsymbol{\Sigma}, noise σ2\sigma^{2} and parameters of kernels kk and \ell, with an option to parametrize these kernels using feature maps given by deep neural network [47], using a stochastic gradient approach for example. We might also want to learn the inducing locations 𝐰{\bf w}.

B.2 Details on evidence lower bound derivation

For completeness, we provide here the derivation of the evidence lower bound objective. Let us remind its expression as stated in (58)

ELBO(q)=𝔼q(𝐟)[logp(𝐳~|𝐟)]KL(q(𝐮)p(𝐮))\operatorname{ELBO}(q)={\mathbb{E}}_{q({\bf f})}[\log p({\bf\tilde{z}}|{\bf f})]-\operatorname{KL}(q({\bf u})\|p({\bf u})) (61)

The second term here is the Kullback-Leibler divergence of two gaussian densities which has a known and tractable closed-form expression.

KL(q(𝐮)p(𝐮))=12[tr(𝐊𝐰𝐰1𝚺)+𝜼𝐊𝐰𝐰1𝜼d+logdet𝐊𝐰𝐰det𝚺]\operatorname{KL}(q({\bf u})\|p({\bf u}))=\frac{1}{2}\left[\operatorname{tr}\left({\bf K}_{\bf ww}^{-1}\boldsymbol{\Sigma}\right)+\boldsymbol{\eta}^{\top}{\bf K}_{\bf ww}^{-1}\boldsymbol{\eta}-d+\log\frac{\det{\bf K}_{\bf ww}}{\det\boldsymbol{\Sigma}}\right] (62)

The first term is the expected log likelihood and needs to be derived. Using properties of integrals of gaussian densities, we can start by showing that q(𝐟)q({\bf f}) also corresponds to a gaussian density which comes

q(𝐟)\displaystyle q({\bf f}) =p(𝐟|𝐮)q(𝐮)d𝐮\displaystyle=\int p({\bf f}|{\bf u})q({\bf u})\,{\mathrm{d}}{\bf u} (63)
=𝒩(𝐟|𝐊𝐱𝐰𝐊𝐰𝐰𝟏𝐮,𝐊𝐱𝐱𝐊𝐱𝐰𝐊𝐰𝐰𝟏𝐊𝐰𝐱)×𝒩(𝐮|𝜼,𝚺)d𝐮\displaystyle=\int{\mathcal{N}}({\bf f}|{\bf K_{xw}K_{ww}^{-1}u},{\bf K_{xx}-K_{xw}K_{ww}^{-1}K_{wx}})\times{\mathcal{N}}({\bf u|\boldsymbol{\eta},\boldsymbol{\Sigma}})\,{\mathrm{d}}{\bf u} (64)
=𝒩(𝐟|𝜼¯,𝚺¯)\displaystyle={\mathcal{N}}({\bf f|\bar{\boldsymbol{\eta}},\bar{\boldsymbol{\Sigma}}}) (65)

where

𝜼¯\displaystyle{\bf\bar{\boldsymbol{\eta}}} =𝐊𝐱𝐰𝐊𝐰𝐰𝟏𝜼\displaystyle={\bf K_{xw}K_{ww}^{-1}\boldsymbol{\eta}} (66)
𝚺¯\displaystyle{\bf\bar{\boldsymbol{\Sigma}}} =𝐊𝐱𝐱𝐊𝐱𝐰[𝐊𝐰𝐰𝟏𝐊𝐰𝐰𝟏𝚺𝐊𝐰𝐰𝟏]𝐊𝐰𝐱\displaystyle={\bf K_{xx}-K_{xw}\left[K_{ww}^{-1}-K_{ww}^{-1}\boldsymbol{\Sigma}K_{ww}^{-1}\right]K_{wx}} (67)

Let’s try now to obtain a closed-form expression of 𝔼q(𝐟)[logp(𝐳~|𝐟)]{\mathbb{E}}_{q({\bf f})}[\log p({\bf\tilde{z}}|{\bf f})] on which we will be able to perform a gradient-based optimization routine. Using Gaussian conditioning on (5), we obtain

p(𝐳~|𝐟)=𝒩(𝐳~|𝚼𝐊𝐱𝐱1𝐟,𝐐𝐲~𝐲~+σ2𝐈M𝚼𝐊𝐱𝐱1𝚼)p({\bf\tilde{z}|f})={\mathcal{N}}({\bf\tilde{z}}|\boldsymbol{\Upsilon}^{\top}{\bf K}_{{\bf x}{\bf x}}^{-1}{\bf f}\;,\;{{\bf Q}_{\tilde{\bf y}\tilde{\bf y}}}+\sigma^{2}{\bf I}_{M}-\boldsymbol{\Upsilon}^{\top}{\bf K_{{\bf x}{\bf x}}}^{-1}\boldsymbol{\Upsilon}) (68)

We notice that 𝚼𝐊𝐱𝐱1=(𝐲~,𝐲)(𝐋𝐲𝐲+λN𝐈N)1𝐊𝐱𝐱𝐊𝐱𝐱𝟏=(𝐲~,𝐲)(𝐋𝐲𝐲+λN𝐈N)1=𝐀\boldsymbol{\Upsilon}^{\top}{\bf K}_{{\bf x}{\bf x}}^{-1}=\ell({\bf\tilde{y},y})({\bf L}_{{\bf y}{\bf y}}+\lambda N{\bf I}_{N})^{-1}{\bf K_{{\bf x}{\bf x}}K_{{\bf x}{\bf x}}^{-1}}=\ell({\bf\tilde{y},y})({\bf L}_{{\bf y}{\bf y}}+\lambda N{\bf I}_{N})^{-1}={\bf A}.

Hence we also have 𝚼𝐊𝐱𝐱1𝚼=𝐀𝐊𝐱𝐱𝐀=𝐐𝐲~𝐲~\boldsymbol{\Upsilon}^{\top}{\bf K}_{{\bf x}{\bf x}}^{-1}\boldsymbol{\Upsilon}={\bf A}^{\top}{\bf K}_{{\bf x}{\bf x}}{\bf A}={\bf Q}_{\tilde{\bf y}\tilde{\bf y}}.

We can thus simplify (68) as

p(𝐳~|𝐟)=𝒩(𝐳~|𝐀𝐟,σ2𝐈n)p({\bf\tilde{z}|f})={\mathcal{N}}({\bf\tilde{z}}|{\bf A^{\top}}{\bf f},\sigma^{2}{\bf I}_{n}) (69)

Then,

logp(𝐳~|𝐟)\displaystyle\log p({\bf\tilde{z}|f}) =M2log(2πσ2)12σ2𝐳~𝐀𝐟22\displaystyle=-\frac{M}{2}\log(2\pi\sigma^{2})-\frac{1}{2\sigma^{2}}\left\|{\bf\tilde{z}-A^{\top}f}\right\|^{2}_{2} (70)
𝔼q(𝐟)[logp(𝐳~|𝐟)]\displaystyle\Rightarrow{\mathbb{E}}_{q({\bf f})}[\log p({\bf\tilde{z}|f})] =M2log(2πσ2)12σ2𝔼q(𝐟)[𝐳~𝐀𝐟22]\displaystyle=-\frac{M}{2}\log(2\pi\sigma^{2})-\frac{1}{2\sigma^{2}}{\mathbb{E}}_{q({\bf f})}\left[\left\|{\bf\tilde{z}-A^{\top}f}\right\|^{2}_{2}\right] (71)

Using the trace trick to express the expectation with respect to the posterior variational parameters 𝜼¯,𝚺¯{\bar{\boldsymbol{\eta}},\bar{\boldsymbol{\Sigma}}}, we have

𝔼q(𝐟)[𝐳~𝐀𝐟22]\displaystyle{\mathbb{E}}_{q({\bf f})}\left[\left\|{\bf\tilde{z}-A^{\top}f}\right\|^{2}_{2}\right] =𝔼q(𝐟)[tr((𝐳~𝐀𝐟)(𝐳~𝐀𝐟))]\displaystyle={\mathbb{E}}_{q({\bf f})}\left[\operatorname{tr}\left(\left({\bf\tilde{z}-A^{\top}f}\right)^{\top}\left({\bf\tilde{z}-A^{\top}f}\right)\right)\right] (72)
=𝔼q(𝐟)[tr((𝐳~𝐀𝐟)(𝐳~𝐀𝐟))]\displaystyle={\mathbb{E}}_{q({\bf f})}\left[\operatorname{tr}\left(\left({\bf\tilde{z}-A^{\top}f}\right)\left({\bf\tilde{z}-A^{\top}f}\right)^{\top}\right)\right] (73)
=tr(𝔼q(𝐟)[(𝐳~𝐀𝐟)(𝐳~𝐀𝐟)])\displaystyle=\operatorname{tr}\left({\mathbb{E}}_{q({\bf f})}\left[\left({\bf\tilde{z}-A^{\top}f}\right)\left({\bf\tilde{z}-A^{\top}f}\right)^{\top}\right]\right) (74)

And

𝔼q(𝐟)[(𝐳~𝐀𝐟)(𝐳~𝐀𝐟)]\displaystyle{\mathbb{E}}_{q({\bf f})}\left[\left({\bf\tilde{z}-A^{\top}f}\right)\left({\bf\tilde{z}-A^{\top}f}\right)^{\top}\right] =Cov(𝐳~𝐀𝐟)+𝔼q(𝐟)[𝐳~𝐀𝐟]𝔼q(𝐟)[𝐳~𝐀𝐟]\displaystyle=\operatorname{Cov}({\bf\tilde{z}-A^{\top}f})+{\mathbb{E}}_{q({\bf f})}\left[{\bf\tilde{z}-A^{\top}f}\right]{\mathbb{E}}_{q({\bf f})}\left[{\bf\tilde{z}-A^{\top}f}\right]^{\top} (76)
=𝐀𝚺¯𝐀+(𝐳~𝐀𝜼¯)(𝐳~𝐀𝜼¯)\displaystyle={\bf A^{\top}\bar{\boldsymbol{\Sigma}}A}+\left({\bf\tilde{z}-A^{\top}\bar{\boldsymbol{\eta}}}\right)\left({\bf\tilde{z}-A^{\top}\bar{\boldsymbol{\eta}}}\right)^{\top} (77)

Hence, it comes that

𝔼q(𝐟)[𝐳~𝐀𝐟22]\displaystyle{\mathbb{E}}_{q({\bf f})}\left[\left\|{\bf\tilde{z}-A^{\top}f}\right\|^{2}_{2}\right] =tr(𝐀𝚺¯𝐀)+tr((𝐳~𝐀𝜼¯)(𝐳~𝐀𝜼¯))\displaystyle=\operatorname{tr}\left({\bf A^{\top}\bar{\boldsymbol{\Sigma}}A}\right)+\operatorname{tr}\left(\left({\bf\tilde{z}-A^{\top}\bar{\boldsymbol{\eta}}}\right)\left({\bf\tilde{z}-A^{\top}\bar{\boldsymbol{\eta}}}\right)^{\top}\right) (78)
=tr(𝐀𝚺¯𝐀)+𝐳~𝐀𝜼¯22\displaystyle=\operatorname{tr}\left({\bf A^{\top}\bar{\boldsymbol{\Sigma}}A}\right)+\left\|{\bf\tilde{z}-A^{\top}\bar{\boldsymbol{\eta}}}\right\|_{2}^{2} (79)

which can be efficiently computed as it only requires diagonal terms.

Wrapping up, we obtain that

ELBO(q)=M2log(2πσ2)12σ2(tr(𝐀𝚺¯𝐀)+𝐳~𝐀𝜼¯22)KL(q(𝐮)p(𝐮))\operatorname{ELBO}(q)=-\frac{M}{2}\log(2\pi\sigma^{2})-\frac{1}{2\sigma^{2}}\left(\operatorname{tr}\left({\bf A^{\top}\bar{\boldsymbol{\Sigma}}A}\right)+\left\|{\bf\tilde{z}-A^{\top}\bar{\boldsymbol{\eta}}}\right\|_{2}^{2}\right)-\operatorname{KL}(q({\bf u})\|p({\bf u})) (80)

Appendix C Details on Conditional Mean Shrinkage Operator

C.1 Deconditional posterior with Conditional Mean Shrinkage Operator

We recall from Proposition 3.3 that the deconditional posterior is a GP specifed by mean and covariance functions

md(x)\displaystyle m_{\mathrm{d}}(x) =m(x)+kxCX|Y𝚿𝐲~(𝐐𝐲~𝐲~+σ2𝐈M)1(𝐳~ν(𝐲~)),\displaystyle=m(x)+k_{x}^{\top}C_{X|Y}{\bf\Psi}_{\tilde{\bf y}}({\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+\sigma^{2}{\bf I}_{M})^{-1}(\tilde{\bf z}-\nu(\tilde{\bf y})), (81)
kd(x,x)\displaystyle k_{\mathrm{d}}(x,x^{\prime}) =k(x,x)kxCX|Y𝚿𝐲~(𝐐𝐲~𝐲~+σ2𝐈M)1𝚿𝐲~CX|Ykx\displaystyle=k(x,x^{\prime})-k_{x}^{\top}C_{X|Y}{\bf\Psi}_{\tilde{\bf y}}({\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+\sigma^{2}{\bf I}_{M})^{-1}{\bf\Psi}_{\tilde{\bf y}}^{\top}C_{X|Y}^{\top}k_{x^{\prime}} (82)

for any x,x𝒳x,x^{\prime}\in{\mathcal{X}}, where we abuse notation for the cross-covariance term

"kxCX|Y𝚿𝐲~"=[kx,CX|Yy~1kkx,CX|Yy~Mk]."k_{x}^{\top}C_{X|Y}{\bf\Psi}_{\tilde{\bf y}}"=\begin{bmatrix}\langle k_{x},C_{X|Y}\ell_{\tilde{y}_{1}}\rangle_{k}&\ldots&\langle k_{x},C_{X|Y}\ell_{\tilde{y}_{M}}\rangle_{k}\end{bmatrix}. (83)

The CMO appears in the cross-covariance term kxCX|Y𝚿𝐲~k_{x}^{\top}C_{X|Y}{\bf\Psi}_{\tilde{\bf y}} and in the CMP covariance matrix 𝐐𝐲~𝐲~=𝚿𝐲~CX|YCX|Y𝚿𝐲~{\bf Q}_{\tilde{\bf y}\tilde{\bf y}}={\bf\Psi}_{\tilde{\bf y}}^{\top}C_{X|Y}^{\top}C_{X|Y}{\bf\Psi}_{\tilde{\bf y}}. To derive empirical versions using the Conditional Mean Shrinkage Operator we replace it by C^X|YS=𝐌^𝐲(𝐋𝐲𝐲+λN𝐈N)1𝚿𝐲{}^{S}\!\hat{C}_{X|Y}={\bf\hat{M}}_{{\bf y}}({\bf L}_{{\bf y}{\bf y}}+\lambda N{\bf I}_{N})^{-1}{\bf\Psi}_{{\bf y}}^{\top}.

The empirical cross-covariance operator with shrinkage CMO estimate is given by

kxC^X|YS𝚿𝐲~\displaystyle k_{x}^{\top}\,\!{}^{S}\!\hat{C}_{X|Y}{\bf\Psi}_{\tilde{\bf y}} =kx𝐌^𝐲(𝐋𝐲𝐲+λN𝐈N)1𝚿𝐲𝚿𝐲~\displaystyle=k_{x}^{\top}{\bf\hat{M}}_{{\bf y}}({\bf L}_{{\bf y}{\bf y}}+\lambda N{\bf I}_{N})^{-1}{\bf\Psi}_{{\bf y}}^{\top}{\bf\Psi}_{\tilde{\bf y}} (84)
=kx𝐌^𝐲(𝐋𝐲𝐲+λN𝐈N)1𝐋𝐲𝐲~\displaystyle=k_{x}^{\top}{\bf\hat{M}}_{{\bf y}}({\bf L}_{{\bf y}{\bf y}}+\lambda N{\bf I}_{N})^{-1}{\bf L}_{{\bf y}\tilde{\bf y}} (85)
=kx𝐌^𝐲𝐀\displaystyle=k_{x}^{\top}{\bf\hat{M}}_{{\bf y}}{\bf A} (86)

where we abuse notation

"kx𝐌^𝐲"\displaystyle"k_{x}^{\top}{\bf\hat{M}}_{{\bf y}}" :=[kx,μ^X|Y=y1kkx,μ^X|Y=yNk]\displaystyle:=\begin{bmatrix}\langle k_{x},\hat{\mu}_{X|Y=y_{1}}\rangle_{k}&\ldots&\langle k_{x},\hat{\mu}_{X|Y=y_{N}}\rangle_{k}\end{bmatrix} (87)
=[1n1i=1n1k(x1(i),x)1nNi=1nNk(xN(i),x)].\displaystyle=\begin{bmatrix}\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}k(x_{1}^{(i)},x)&\ldots&\frac{1}{n_{N}}\sum_{i=1}^{n_{N}}k(x_{N}^{(i)},x)\end{bmatrix}. (88)

The empirical shrinkage CMP covariance matrix is given by

𝐐^𝐲~𝐲~S{}^{S}\!\hat{\bf Q}_{\tilde{\bf y}\tilde{\bf y}} :=𝚿𝐲~C^X|YSC^X|YS𝚿𝐲~\displaystyle:={\bf\Psi}_{\tilde{\bf y}}^{\top}\,\!{}^{S}\!\hat{C}_{X|Y}^{\top}\,\!{}^{S}\!\hat{C}_{X|Y}{\bf\Psi}_{\tilde{\bf y}} (89)
=𝚿𝐲~𝚿𝐲(𝐋𝐲𝐲+λN𝐈N)1𝐌^𝐲𝐌^𝐲(𝐋𝐲𝐲+λN𝐈N)1𝚿𝐲𝚿𝐲~\displaystyle={\bf\Psi}_{\tilde{\bf y}}^{\top}{\bf\Psi}_{{\bf y}}({\bf L}_{{\bf y}{\bf y}}+\lambda N{\bf I}_{N})^{-1}{\bf\hat{M}}_{{\bf y}}^{\top}{\bf\hat{M}}_{{\bf y}}({\bf L}_{{\bf y}{\bf y}}+\lambda N{\bf I}_{N})^{-1}{\bf\Psi}_{{\bf y}}^{\top}{\bf\Psi}_{\tilde{\bf y}} (90)
=𝐀𝐌^𝐲𝐌^𝐲𝐀\displaystyle={\bf A}^{\top}{\bf\hat{M}}_{{\bf y}}^{\top}{\bf\hat{M}}_{{\bf y}}{\bf A} (91)

where with similar notation abuse

"𝐌^𝐲𝐌^𝐲"=[μ^X|Y=yi,μ^X|Y=yjk]1i,jN=[1ninjl=1nir=1njk(xi(l),xj(r))]1i,jN"{\bf\hat{M}}_{{\bf y}}^{\top}{\bf\hat{M}}_{{\bf y}}"=\left[\langle\hat{\mu}_{X|Y=y_{i}},\hat{\mu}_{X|Y=y_{j}}\rangle_{k}\right]_{1\leq i,j\leq N}=\left[\frac{1}{n_{i}n_{j}}\sum_{l=1}^{n_{i}}\sum_{r=1}^{n_{j}}k(x_{i}^{(l)},x_{j}^{(r)})\right]_{1\leq i,j\leq N} (92)

Substituting the latters into (81) and (82), we obtain empirical estimates of the deconditional posterior with shrinkage CMO estimator defined as

m^dS(x):=m(x)+kx𝐌^𝐲𝐀(𝐀𝐌^𝐲𝐌^𝐲𝐀+σ2𝐈M)1(𝐳~μ^(𝐲~)),\,\!{}^{S}\!\hat{m}_{\mathrm{d}}(x):=m(x)+k_{x}^{\top}{\bf\hat{M}}_{{\bf y}}{\bf A}({\bf A}^{\top}{\bf\hat{M}}_{{\bf y}}^{\top}{\bf\hat{M}}_{{\bf y}}{\bf A}+\sigma^{2}{\bf I}_{M})^{-1}(\tilde{\bf z}-\hat{\mu}(\tilde{\bf y})), (93)
k^dS(x,x):=k(x,x)kx𝐌^𝐲𝐀(𝐀𝐌^𝐲𝐌^𝐲𝐀+σ2𝐈M)1𝐀𝐌^𝐲kx\,\!{}^{S}\!\hat{k}_{\mathrm{d}}(x,x^{\prime}):=k(x,x^{\prime})-k_{x}^{\top}{\bf\hat{M}}_{{\bf y}}{\bf A}({\bf A}^{\top}{\bf\hat{M}}_{{\bf y}}^{\top}{\bf\hat{M}}_{{\bf y}}{\bf A}+\sigma^{2}{\bf I}_{M})^{-1}{\bf A}^{\top}{\bf\hat{M}}_{{\bf y}}^{\top}k_{x^{\prime}} (94)

for any x,x𝒳x,x^{\prime}\in{\mathcal{X}}.

Note that as the number of bags increases, it is possible to derive a variational formulation similar to the one proposed in Section B that leverages the shrinkage estimator to further speed up the overall computation.

C.2 Ablation Study

In this section we will present an ablation study on the shrinkage CMO estimator. The key is to illustrate that the Shrinkage CMO performs on par with the standard CMO estimator but is much faster to compute.

In the following, we will sample bag data of the form 𝒟b={b𝒙j,yj}j=1N{}^{b}\!{\mathcal{D}}=\{^{b}{\boldsymbol{x}}_{j},y_{j}\}_{j=1}^{N} and 𝒙jb={xj(i)}i=1n{}^{b}{\boldsymbol{x}}_{j}=\{x_{j}^{(i)}\}_{i=1}^{n}, i.e there are NN bags with nn elements inside each. We first sample NN bag labels yj𝒩(0,2)y_{j}\sim{\mathcal{N}}(0,2) and for each bag yjy_{j}, we sample nn observations xj(i)|yj𝒩(yjsin(yj),0.52)x_{j}^{(i)}|y_{j}\sim{\mathcal{N}}(y_{j}\sin(y_{j}),0.5^{2}).

Recall in standard CME one would need to repeat the number of bag labels to match the cardinality of xj(i)x_{j}^{(i)}, i.e estimating CME using data {xj(i),yj}j=1,i=1N,n\{x_{j}^{(i)},y_{j}\}_{j=1,i=1}^{N,n}.

Denote C^X|Y\hat{C}_{X|Y} as the standard CMO estimator and C^X|YS{}^{S}\hat{C}_{X|Y} as the shrinkage CMO estimator. We will compare the RMSE between the two estimator when tested on a grid of test points {xi,yi}i=1N\{x^{*}_{i},y^{*}_{i}\}_{i=1}^{N^{*}}, i.e comparing the RMSE of the values between μ^X|Y=yi(xi):=C^X|Yyi,kxik\hat{\mu}_{X|Y=y^{*}_{i}}(x^{*}_{i}):=\langle\hat{C}_{X|Y}\ell_{y^{*}_{i}},k_{x^{*}_{i}}\rangle_{k} and μ^X|Y=yiS(xi):=SC^X|Yyi,kxik{}^{S}\hat{\mu}_{X|Y=y^{*}_{i}}(x^{*}_{i}):=\langle^{S}\hat{C}_{X|Y}\ell_{y^{*}_{i}},k_{x^{*}_{i}}\rangle_{k} for each ii. We also report the time in seconds needed to compute the estimator. The following results are ran on a CPU. Kernel hyperparameters are chosen using the median heuristic. The regularisation for both estimator is set to 0.10.1.

Refer to caption
Figure 4: 33 bags with 5050 samples each. (left) Data, (middle) μ^X|Y=yi(xi)\hat{\mu}_{X|Y=y^{*}_{i}}(x^{*}_{i}) Standard CME. (right) μ^X|Y=yiS(xi){}^{S}\hat{\mu}_{X|Y=y^{*}_{i}}(x^{*}_{i}) Shrinkage CME. We see both algorithms require very little time to train, (0.01\sim 0.01second) with a negligible difference in values as shown by the RMSE.
Refer to caption
Figure 5: 5050 bags with 33 samples each. (left) Data, (middle) μ^X|Y=yi(xi)\hat{\mu}_{X|Y=y^{*}_{i}}(x^{*}_{i}) Standard CME. (right) μ^X|Y=yiS(xi){}^{S}\hat{\mu}_{X|Y=y^{*}_{i}}(x^{*}_{i}) Shrinkage CME. Again, we see both algorithms require very little time to train, (0.03\sim 0.03 second). However, there is an increase in RMSE for the shrinkage estimator because there are much less samples for each bag, thus the empirical CME estimate μ^X|Y=yj\hat{\mu}_{X|Y=y_{j}} might not be accurate. Nonetheless, it is still a small difference.

Figures 4 and 5 show how shrinkage CMO performed compared to the standard CMO in a small data regime. Now when we increase the data size, we will start to see the major computational differences. (See Figures 6 and 7)

Refer to caption
Figure 6: 5050 bags with 500500 samples each. (left) Data, (middle) μ^X|Y=yi(xi)\hat{\mu}_{X|Y=y^{*}_{i}}(x^{*}_{i}) Standard CME. (right) μ^X|Y=yiS(xi){}^{S}\hat{\mu}_{X|Y=y^{*}_{i}}(x^{*}_{i}) Shrinkage CME. With a small RMSE of 0.030.03, the Shrinkage CME is approximately 600 times quicker than the standard version.
Refer to caption
Figure 7: 500500 bags with 5050 samples each. (left) Data, (middle) μ^X|Y=yi(xi)\hat{\mu}_{X|Y=y^{*}_{i}}(x^{*}_{i}) Standard CME. (right) μ^X|Y=yiS(xi){}^{S}\hat{\mu}_{X|Y=y^{*}_{i}}(x^{*}_{i}) Shrinkage CME. Again, with a small RMSE of 0.020.02, Shrinakge CME is approximately 200 times quicker than the standard CME.

Appendix D Details on Convergence Result

In this section, we provide insights about the convergence results stated in Section 4. These results are largely based on the impactful work of Caponnetto and De Vito [36], Szabó et al. [24] and Singh et al. [25] which we modify to fit our problem setup. Each assumption that we make is adapted from a similar assumption made in those works, for which we provide intuition and a detailed justification. We start by redefining the mathematical tools introduced in these works that are necessary to state our result.

D.1 Definitions and 𝒫K(b,c){\mathcal{P}}_{K}(b,c) spaces

We start by providing a general definition of covariance operators over vector-valued RKHS, which will allow us to specify a class of probability distributions for our convergence result.

Definition D.1 (Covariance operator).

Let 𝒲{\mathcal{W}} a Polish space endowed with measure ρ\rho, 𝒢{\mathcal{G}} a real separable Hilbert space and K:𝒲2(𝒢)K:{\mathcal{W}}^{2}\rightarrow{\mathcal{L}}({\mathcal{G}}) an operator-valued kernel spanning a 𝒢{\mathcal{G}}-valued RKHS K{\mathcal{H}}_{K}.

The covariance operator of KK is defined as the positive trace class operator given by

TK:=𝒵KwKw𝑑ρ(w)(K)T_{K}:=\int_{{\mathcal{Z}}}K_{w}K_{w}^{*}d\rho(w)\in{\mathcal{L}}({\mathcal{H}}_{K}) (95)

where (K){\mathcal{L}}({\mathcal{H}}_{K}) denotes the space of bounded linear operators over k{\mathcal{H}}_{k}.

Definition D.2 (Power of self-adjoint Hilbert operator).

Let TT a compact self-adjoint Hilbert space operator with spectral decomposition T=n=1λnenenT=\sum_{n=1}^{\infty}\lambda_{n}e_{n}\otimes e_{n} on (en)n(e_{n})_{n\in{\mathbb{N}}} basis of Ker(T)\operatorname{Ker}(T)^{\perp}. The rrth power of TT is defined as Tr=n=1λnrenenT^{r}=\sum_{n=1}^{\infty}\lambda_{n}^{r}e_{n}\otimes e_{n}.

Using the covariance operator, we now introduce a general class of priors that does not assume parametric distributions, by adapting to our setup a definition originally introduced by Caponnetto and De Vito [36]. This class captures the difficulty of a regression problem in terms of two simple parameters, bb and cc [24].

Definition D.3 (𝒫K(b,c){\mathcal{P}}_{K}(b,c) class).

Let ρ:𝒢𝒵[0,[{\mathcal{E}}_{\rho}:{\mathcal{G}}^{\mathcal{Z}}\rightarrow[0,\infty[ an expected risk function over ρ\rho and Eρ=argminρE_{\rho}=\arg\min\,{\mathcal{E}}_{\rho}. Then given b>1b>1 and c]1,2]c\in]1,2], we say that ρ\rho is a 𝒫K(b,c){\mathcal{P}}_{K}(b,c) class probability measure w.r.t. ρ{\mathcal{E}}_{\rho} if

  1. 1.

    Range assumption: GK\exists G\in{\mathcal{H}}_{K} such that Eρ=TKc12GE_{\rho}=T_{K}^{\frac{c-1}{2}}\circ G with GK2R\|G\|_{K}^{2}\leq R for some R0R\geq 0

  2. 2.

    Spectral assumption: the eigenvalues (λn)n(\lambda_{n})_{n\in{\mathbb{N}}} of TKT_{K} satisfy αnbλnβ,n\alpha\leq n^{b}\lambda_{n}\leq\beta\,,\forall n\in{\mathbb{N}} for some βα0\beta\geq\alpha\geq 0

The range assumption controls the functional smoothness of EρE_{\rho} as larger cc corresponds to increased smoothness. Specifically, elements of Range(TKc12)\operatorname{Range}({T_{K}^{\frac{c-1}{2}}}) admit Fourier coefficients (γn)n(\gamma_{n})_{n\in{\mathbb{N}}} such that n=1γn2λn(c+1)<\sum_{n=1}^{\infty}\gamma_{n}^{2}\lambda_{n}^{-(c+1)}<\infty. In the limit c1c\rightarrow 1, we obtain Range(TK0)=Range(IdK)=K\operatorname{Range}({T_{K}^{0}})=\operatorname{Range}({\operatorname{Id}_{{\mathcal{H}}_{K}}})={\mathcal{H}}_{K}. Since ranked eigenvalues are positive and λn0\lambda_{n}\rightarrow 0, greater power of the covariance operator TKT_{K} give rise to faster decay of the Fourier coefficients and hence smoother operators.

The spectral assumptions can be read as a polynomial decay over the eigenvalues of TKT_{K}. Thus, larger bb leads to enhanced decay λn=Θ(nb)\lambda_{n}=\Theta(n^{-b}) and concretely in a smaller effective input dimension.

D.2 Complete statement of the convergence result

The following result corresponds to a detailed version of Theorem 4.2 where all the assumptions are explicitly stated. As such, its proof also constitutes the proof for Theorem 4.2.

Theorem D.4 (Empirical DMO Convergence Rate).

Assume that

  1. 1.

    𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}} are Polish spaces, i.e. separable and completely metrizable topoligical spaces

  2. 2.

    kk and \ell are continuous, bounded, their canonical feature maps kxk_{x} and y\ell_{y} are measurable and kk is characteristic

  3. 3.

    {\mathcal{H}}_{\ell} is finite dimensional

  4. 4.

    argmincΓ\arg\min{\mathcal{E}}_{\mathrm{c}}\in{\mathcal{H}}_{\Gamma} and argmindΓ\arg\min{\mathcal{E}}_{\mathrm{d}}\in{\mathcal{H}}_{\Gamma}

  5. 5.

    The operator family {ΓμX|Y=y}y𝒴\{\Gamma_{\mu_{X|Y=y}}\}_{y\in{\mathcal{Y}}} is Hölder continuous with exponent ι]0,1]\iota\in]0,1]

  6. 6.

    XY{\mathbb{P}}_{XY} is a 𝒫Γ(0,c){\mathcal{P}}_{\Gamma}(0,c^{\prime}) class probability measure w.r.t. c{\mathcal{E}}_{\mathrm{c}} and Y{\mathbb{P}}_{Y} is a 𝒫Γ(b,c){\mathcal{P}}_{\Gamma}(b,c) class probability measure w.r.t. d{\mathcal{E}}_{\mathrm{d}}

  7. 7.

    g\forall g\in{\mathcal{H}}_{\ell}, g<\|g\|_{\ell}<\infty almost surely

Let DY=argminDΓd(D)D_{{\mathbb{P}}_{Y}}=\underset{D\in{\mathcal{H}}_{\Gamma}}{\arg\min}\,{\mathcal{E}}_{\mathrm{d}}(D). Then, if we choose λ=N1c+1\lambda=N^{-\frac{1}{c^{\prime}+1}} and N=Ma(c+1)ι(c1)N=M^{\frac{a(c^{\prime}+1)}{\iota(c^{\prime}-1)}} where a>0a>0, we have

  • If ab(c+1)bc+1a\leq\frac{b(c+1)}{bc+1}, then d(D^X|Y)d(DY)=𝒪(Macc+1){\mathcal{E}}_{\mathrm{d}}(\hat{D}_{X|Y})-{\mathcal{E}}_{\mathrm{d}}(D_{{\mathbb{P}}_{Y}})={\mathcal{O}}(M^{\frac{-ac}{c+1}}) with ϵ=Mac+1\epsilon=M^{\frac{-a}{c+1}}

  • If ab(c+1)bc+1a\geq\frac{b(c+1)}{bc+1}, then d(D^X|Y)d(DY)=𝒪(Mbcbc+1){\mathcal{E}}_{\mathrm{d}}(\hat{D}_{X|Y})-{\mathcal{E}}_{\mathrm{d}}(D_{{\mathbb{P}}_{Y}})={\mathcal{O}}(M^{\frac{-bc}{bc+1}}) with ϵ=Mbbc+1\epsilon=M^{\frac{-b}{bc+1}}

Proof of Theorem 4.2.

The main objective here will be to rigorously verify that within our setup, the conditions in Theorem 4 from [25] are met. We reformulate from our problem perspective each of the assumptions stated by Singh et al. [25] and verify they are satisfied.

Assumption 1

Assume observation model Z~=f(X)+ε~\tilde{Z}=f(X)+\tilde{\varepsilon}, with 𝔼[ε~|Y]=0{\mathbb{E}}[\tilde{\varepsilon}|Y]=0 and suppose X|Y=y{\mathbb{P}}_{X|Y=y} is not constant in yy.

In this work, the observation model considered is Z=𝔼[f(X)|Y]+εZ={\mathbb{E}}[f(X)|Y]+\varepsilon and the objective is to recover the underlying random variable f(X)f(X) which noisy conditional expectation is observed. The latter presumes that we could bring ZZ to XX’s resolution. We can model it by introducing “pre-aggregation” observation model Z~=f(X)+ε~\tilde{Z}=f(X)+\tilde{\varepsilon} such that Z=𝔼[Z~|Y]Z={\mathbb{E}}[\tilde{Z}|Y] and ε~\tilde{\varepsilon} is a noise term at individual level satisfying 𝔼[ε~|Y]=0{\mathbb{E}}[\tilde{\varepsilon}|Y]=0.

Assumption 2

𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}} are Polish spaces.

We also make this assumption.

Assumption 3

kk and \ell are continuous and bounded, their canonical feature maps are measurable and kk is characteristic.

We make the same assumptions. The separability of 𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}} along with continuity assumptions on kernels allow to propagate separability to their associated RKHS k{\mathcal{H}}_{k} and {\mathcal{H}}_{\ell} and to the vector-valued RKHS Γ{\mathcal{H}}_{\Gamma}. Boundedness and continuity on kernels ensure the measurability of the CMO and hence that measures on 𝒳{\mathcal{X}} and cYcY can be extended to k{\mathcal{H}}_{k} and {\mathcal{H}}_{\ell}. The assumption on kk being characteristic ensures that conditional mean embeddings μX|Y=y\mu_{X|Y=y} uniquely embed conditional distributions X|Y=y{\mathbb{P}}_{X|Y=y} and henceforth operators over {\mathcal{H}}_{\ell} are identified.

Assumption 4

argmincΓ\arg\min\,{\mathcal{E}}_{\mathrm{c}}\in{\mathcal{H}}_{\Gamma}.

This property stronger is than what the actual conditional mean operator needs to satisfy, but it is necessary to make sure the problem is well-defined. We also make this assumption.

Assumption 5

XY{\mathbb{P}}_{XY} is a 𝒫Γ(0,c){\mathcal{P}}_{\Gamma}(0,c^{\prime}) class probability measure, with c]1,2]c^{\prime}\in]1,2]

As explained by Singh et al. [25], this is further required to bound the approximation error which we also make. Through the definition of the 𝒫Γ(0,c){\mathcal{P}}_{\Gamma}(0,c^{\prime}) class, this hypothesis assumes the existence of a probability measure over k{\mathcal{H}}_{k} we denote k{\mathbb{P}}_{{\mathcal{H}}_{k}}. Since k{\mathcal{H}}_{k} is Polish (proof below), the latter can be constructed as an extension of X{\mathbb{P}}_{X} over the Borel σ\sigma-algebra associated to k{\mathcal{H}}_{k} [48, Lemma A.3.16].

Assumption 6

k{\mathcal{H}}_{k} is a Polish space

Since kk is continuous and 𝒳{\mathcal{X}} is separable, k{\mathcal{H}}_{k} is a separable Hilbert space which makes it Polish.

Assumption 7

The {ΓμX|Y=y}y𝒴\{\Gamma_{\mu_{X|Y=y}}\}_{y\in{\mathcal{Y}}} operator family is

  • Uniformly bounded in Hilbert-Schmidt norm, i.e. B>0\exists B>0 such that  y𝒴\forall y\in{\mathcal{Y}}, ΓμX|Y=yHS(,Γ)2B\|\Gamma_{\mu_{X|Y=y}}\|^{2}_{\operatorname{HS}({\mathcal{H}}_{\ell},{\mathcal{H}}_{\Gamma})}\leq B

  • Hölder continuous in operator norm, i.e. L>0,ι]0,1]\exists L>0,\iota\in]0,1] such that y,y𝒴\forall y,y^{\prime}\in{\mathcal{Y}}, ΓμX|Y=yΓμX|Y=y(,Γ)LμX|Y=yμX|Y=ykι\|\Gamma_{\mu_{X|Y=y}}-\Gamma_{\mu_{X|Y=y^{\prime}}}\|_{{\mathcal{L}}({\mathcal{H}}_{\ell},{\mathcal{H}}_{\Gamma})}\leq L\|\mu_{X|Y=y}-\mu_{X|Y=y^{\prime}}\|^{\iota}_{k}

where (,Γ){\mathcal{L}}({\mathcal{H}}_{\ell},{\mathcal{H}}_{\Gamma}) denotes the space of bounded linear operator between {\mathcal{H}}_{\ell} and Γ{\mathcal{H}}_{\Gamma}.

Since we assume finite dimensionality of {\mathcal{H}}_{\ell}, we make a stronger assumption than the boundedness in Hilbert-Schmidt norm which we obtain as

ΓμX|Y=yHS(,Γ)2\displaystyle\|\Gamma_{\mu_{X|Y=y}}\|^{2}_{\operatorname{HS}({\mathcal{H}}_{\ell},{\mathcal{H}}_{\Gamma})} =tr(Γ(μX|Y=y,μX|Y=y))\displaystyle=\operatorname{tr}\left(\Gamma(\mu_{X|Y=y},\mu_{X|Y=y})\right) (96)
=tr(μX|Y=y,μX|Y=ykId)\displaystyle=\operatorname{tr}\left(\langle\mu_{X|Y=y},\mu_{X|Y=y}\rangle_{k}\operatorname{Id}_{{\mathcal{H}}_{\ell}}\right) (97)
=q(y,y)tr(Id)<.\displaystyle=q(y,y)\operatorname{tr}\left(\operatorname{Id}_{{\mathcal{H}}_{\ell}}\right)<\infty. (98)

Hölder continuity is a mild assumption commonly satisfied as stated in  [24].

Assumption 8

argmindΓ\arg\min\,{\mathcal{E}}_{\mathrm{d}}\in{\mathcal{H}}_{\Gamma} and {\mathcal{H}}_{\ell} is a space of bounded functions almost surely

We assume that the true minimiser of d{\mathcal{E}}_{\mathrm{d}} is in Γ{\mathcal{H}}_{\Gamma} to have a well-defined problem. The second assumption here is expressed in terms of probability measure {\mathbb{P}}_{{\mathcal{H}}_{\ell}} over {\mathcal{H}}_{\ell}. We do also assume that there exists B>0B>0 such that g\forall g\in{\mathcal{H}}_{\ell}, g<B\|g\|_{\ell}<B\enspace{\mathbb{P}}_{{\mathcal{H}}_{\ell}}- almost surely.

Assumption 9

Y{\mathbb{P}}_{Y} is a 𝒫Γ(b,c){\mathcal{P}}_{\Gamma}(b,c) class probability measure, with b>1b>1 and c]1,2]c\in]1,2]

This last hypothesis is not required per se to obtain a bound on the excess error of regularized estimate D^X|Y\hat{D}_{X|Y}. However, it allows to simplify the bounds and state them in terms of parameters bb and cc which characterize efficient input size and functional smoothness respectively.

Furthermore, a premise to this assumption is the existence of a probability measure over {\mathcal{H}}_{\ell} that we denote {\mathbb{P}}_{{\mathcal{H}}_{\ell}}. Since \ell is continuous and 𝒴{\mathcal{Y}} separable, it makes {\mathcal{H}}_{\ell} a separable and thus Polish. We can then construct {\mathbb{P}}_{{\mathcal{H}}_{\ell}} by extension of Y{\mathbb{P}}_{Y} [48, Lemma A.3.16]

This theorem underlines a trade-off between the computational and statistical efficiency w.r.t. the datasets cardinalities N=|𝒟1|N=|{\mathcal{D}}_{1}| and M=|𝒟2|M=|{\mathcal{D}}_{2}| and the problem difficulty (b,c,c)(b,c,c^{\prime}).

For ab(c+1)bc+1a\leq\frac{b(c+1)}{bc+1}, smaller aa means less samples from 𝒟1{\mathcal{D}}_{1} at fixed MM and thus computational savings. But it also hampers convergence, resulting in reduced statistical efficiency. At a=b(c+1)bc+1<2a=\frac{b(c+1)}{bc+1}<2, convergence rate is a minimax computational-statistical efficiency optimal, i.e. convergence rate is optimal with smallest possible MM. We note that at this optimal, N>MN>M and hence we require less samples from 𝒟2{\mathcal{D}}_{2}. ab(c+1)bc+1a\geq\frac{b(c+1)}{bc+1} does not improve the convergence rate but increases the size of 𝒟1{\mathcal{D}}_{1} and hence the computational cost it bears.

We also note that larger Hölder exponents ι\iota, which translates in smoother kernels, leads to reduced NN. Similarly, since cc+1c1c^{\prime}\mapsto\frac{c^{\prime}+1}{c-1} and cb(c+1)bc+1c\mapsto\frac{b(c+1)}{bc+1} are strictly decreasing functions over ]1,2]]1,2], stronger range assumptions regularity which means smoother operators reduces the number of sample needed from 𝒟1{\mathcal{D}}_{1} to achieve minimax optimality. Smoother problems do hence require fewer samples.

Larger spectral decay exponent bb translate here in requiring more samples to reach minimax optimality and undermines optimal convergence rate. Hence problems with smaller effective input dimension are harder to solve and require more samples and iterations.

Appendix E Additional Experimental Results

E.1 Swiss Roll Experiment

E.1.1 Statistical significance table

Table 3: p-values from a two-tailed Wilcoxon signed-rank test between all pairs of methods for the test RMSE of the swiss-roll experiment with a direct and indirect matching setup. The null hypothesis is that scores samples come from the same distribution. We only present the lower triangular matrix of the table for clarity of reading.
Matching cmp bagg-gp varcmp vbagg gpr s-cmp
Direct cmp - - - - - -
bagg-gp 0.00006 - - - - -
varcmp 0.00008 0.00006 - - - -
vbagg 0.00006 0.00006 0.005723 - - -
gpr 0.00006 0.00006 0.00006 0.00006 - -
s-cmp 0.00006 0.00006 0.000477 0.014269 0.00006 -
Indirect cmp - - - - - -
bagg-gp 0.011129 - - - - -
varcmp 0.001944 0.015240 - - - -
vbagg 0.000089 0.047858 0.000089 - - -
gpr 0.025094 0.047858 0.047858 0.851925 - -
s-cmp 0.000089 0.002821 0.000089 0.000140 0.052222 -

E.1.2 Compute and Resources Specifications

Computations for all experiments were carried out on an internal cluster. We used a single GeForce GTX 1080 Ti GPU to speed up computations and conduct each experiment with multiple initialisation seeds. We underline however that the experiment does not require GPU acceleration and can be performed on CPU in a timely manner.

E.2 CMP with high-resolution noise observation model

E.2.1 Deconditional posterior with high-resolution noise

Beyond observation noise on the aggregate observations 𝐳~\tilde{{\bf z}} as introduced in Section LABEL:sec:_deconditional_posterior, it is natural to also consider observing noise at the high-resolution level, i.e. noises placed on ff level directly in addition to the one gg at aggregate level. Let ξ𝒢𝒫(0,δ)\xi\sim{\mathcal{G}}{\mathcal{P}}(0,\delta) the zero-mean Gaussian process with covariance function

δ:|𝒳×𝒳(x,x){1 if x=x0 else.{\delta}:\left|\begin{array}[]{rcl}{{\mathcal{X}}\times{\mathcal{X}}}&\longrightarrow&{{\mathbb{R}}}\\ {(x,x^{\prime})}&\longmapsto&{\begin{cases}1\text{ if }x=x^{\prime}\\ 0\text{ else}\end{cases}}\\ \end{array}\right.. (99)

By incorporating this gaussian noise process in the integrand, we can replace the definition of the CMP by

g(y)=𝒳(f(x)+ςξ(x))dX|Y=y,y𝒴,g(y)=\int_{{\mathcal{X}}}\left(f(x)+\varsigma\xi(x)\right)\,{\mathrm{d}}{\mathbb{P}}_{X|Y=y}\,,\enspace\forall y\in{\mathcal{Y}}, (100)

where ς>0\varsigma>0 is the high-resolution noise standard deviation parameter. Essentially, this amounts to consider a contaminated covariance for the HR observation process. This covariance is defined as

kς:|𝒳×𝒳(x,x)k(x,x)+ς2δ(x,x).{k^{\varsigma}}:\left|\begin{array}[]{rcl}{{\mathcal{X}}\times{\mathcal{X}}}&\longrightarrow&{{\mathbb{R}}}\\ {(x,x^{\prime})}&\longmapsto&{k(x,x^{\prime})+\varsigma^{2}\delta(x,x^{\prime})}\\ \end{array}\right.. (101)

Provided the same regularity assumptions as in Proposition 3.2, the covariance of the CMP becomes q(y,y)=𝔼[kς(X,X)|Y=y,Y=y]q(y,y^{\prime})={\mathbb{E}}[k^{\varsigma}(X,X^{\prime})|Y=y,Y^{\prime}=y^{\prime}] — the mean and cross-covariance terms are not affected. Similarly be written in terms of conditional mean embeddings, but using as an integrand for the CMEs the canonical feature maps induced by kςk^{\varsigma}, i.e. μX|Y=yς:=𝔼[kς(X,)|Y=y]\mu_{X|Y=y}^{\varsigma}:={\mathbb{E}}[k^{\varsigma}(X,\cdot)|Y=y] for any y𝒴y\in{\mathcal{Y}}. Critically, this is reflected in the expression of the empirical CMP covariance which writes

q^(y,y)=(y,𝐲)(𝐋𝐲𝐲+Nλ𝐈N)1(𝐊𝐱𝐱+ς2𝐈N)(𝐋𝐲𝐲+Nλ𝐈N)1(𝐲,y)\hat{q}(y,y^{\prime})=\ell(y,{\bf y})({\bf L}_{{\bf y}{\bf y}}+N\lambda{\bf I}_{N})^{-1}({\bf K}_{{\bf x}{\bf x}}+\varsigma^{2}{\bf I}_{N})({\bf L}_{{\bf y}{\bf y}}+N\lambda{\bf I}_{N})^{-1}\ell({\bf y},y^{\prime}) (102)

thus, yielding matrix form

𝐐^𝐲~𝐲~\displaystyle\hat{\bf Q}_{\tilde{\bf y}\tilde{\bf y}} :=q^(𝐲~,𝐲~)\displaystyle:=\hat{q}(\tilde{\bf y},\tilde{\bf y}) (103)
=𝐋𝐲~𝐲(𝐋𝐲𝐲+Nλ𝐈N)1(𝐊𝐱𝐱+ς2𝐈N)(𝐋𝐲𝐲+Nλ𝐈N)1𝐋𝐲𝐲~\displaystyle={\bf L}_{\tilde{\bf y}{\bf y}}({\bf L}_{{\bf y}{\bf y}}+N\lambda{\bf I}_{N})^{-1}({\bf K}_{{\bf x}{\bf x}}+\varsigma^{2}{\bf I}_{N})({\bf L}_{{\bf y}{\bf y}}+N\lambda{\bf I}_{N})^{-1}{\bf L}_{{\bf y}\tilde{\bf y}} (104)
=𝐀(𝐊𝐱𝐱+ς2𝐈N)𝐀.\displaystyle={\bf A}^{\top}({\bf K}_{{\bf x}{\bf x}}+\varsigma^{2}{\bf I}_{N}){\bf A}. (105)

which can readily be used in (8) and (9) to compute the deconditional posterior.

This high-resolution noise term introduces an additional regularization to the model that helps preventing degeneracy of the deconditional posterior covariance. Indeed, we have

k^d(𝐱,𝐱)\displaystyle\hat{k}_{\mathrm{d}}({\bf x},{\bf x}) =𝐊𝐱𝐱𝐊𝐱𝐱𝐀(𝐐^𝐲~𝐲~+σ2𝐈M)1𝐀𝐊𝐱𝐱\displaystyle={\bf K}_{{\bf x}{\bf x}}-{\bf K}_{{\bf x}{\bf x}}{\bf A}(\hat{\bf Q}_{\tilde{\bf y}\tilde{\bf y}}+\sigma^{2}{\bf I}_{M})^{-1}{\bf A}^{\top}{\bf K}_{{\bf x}{\bf x}} (106)
=𝐊𝐱𝐱𝐊𝐱𝐱𝐀(𝐀(𝐊𝐱𝐱+ς2𝐈N)𝐀+σ2𝐈M)1𝐀𝐊𝐱𝐱\displaystyle={\bf K}_{{\bf x}{\bf x}}-{\bf K}_{{\bf x}{\bf x}}{\bf A}({\bf A}^{\top}({\bf K}_{{\bf x}{\bf x}}+\varsigma^{2}{\bf I}_{N}){\bf A}+\sigma^{2}{\bf I}_{M})^{-1}{\bf A}^{\top}{\bf K}_{{\bf x}{\bf x}} (107)
=𝐊𝐱𝐱𝐊𝐱𝐱(𝐀𝐀(𝐊𝐱𝐱+ς2𝐈N)+σ2𝐈M)1(𝐀𝐀𝐊𝐱𝐱).\displaystyle={\bf K}_{{\bf x}{\bf x}}-{\bf K}_{{\bf x}{\bf x}}({\bf A}{\bf A}^{\top}({\bf K}_{{\bf x}{\bf x}}+\varsigma^{2}{\bf I}_{N})+\sigma^{2}{\bf I}_{M})^{-1}({\bf A}{\bf A}^{\top}{\bf K}_{{\bf x}{\bf x}}). (108)

where on the last line we have used the Woodburry identity. We can see that when σ=ς=0\sigma=\varsigma=0, (108) degenerates to 0. The aggregate observation model noise σ\sigma provides a first layer of regularization at low-resolution. The high-resolution noise ς\varsigma supplements it, making for a more stable numerical compuation for the empirical covariance matrix.

E.2.2 Variational deconditional posterior with high-resolution noise

The high-resolution noise observation process can also be incorporated into the variational derivation to obtain a slightly different ELBO objective. We have

p(𝐳~|𝐟)\displaystyle p({\bf\tilde{z}|f}) =𝒩(𝐳~|𝚼𝐊𝐱𝐱1𝐟,𝐐𝐲~𝐲~+σ2𝐈M𝚼𝐊𝐱𝐱1𝚼)\displaystyle={\mathcal{N}}({\bf\tilde{z}}|\boldsymbol{\Upsilon}^{\top}{\bf K}_{{\bf x}{\bf x}}^{-1}{\bf f}\;,\;{{\bf Q}_{\tilde{\bf y}\tilde{\bf y}}}+\sigma^{2}{\bf I}_{M}-\boldsymbol{\Upsilon}^{\top}{\bf K_{{\bf x}{\bf x}}}^{-1}\boldsymbol{\Upsilon}) (109)
=𝒩(𝐳~|𝐀𝐟,𝐀(𝐊𝐱𝐱+ς2𝐈N)𝐀+σ2𝐈M𝐀𝐊𝐱𝐱𝐀)\displaystyle={\mathcal{N}}({\bf\tilde{z}}|{\bf A}{\bf f}\;,\;{\bf A}^{\top}({\bf K}_{{\bf x}{\bf x}}+\varsigma^{2}{\bf I}_{N}){\bf A}+\sigma^{2}{\bf I}_{M}-{\bf A}^{\top}{\bf K}_{{\bf x}{\bf x}}{\bf A}) (110)
=𝒩(𝐳~|𝐀𝐟,ς2𝐀𝐀+σ2𝐈M)\displaystyle={\mathcal{N}}({\bf\tilde{z}}|{\bf A}{\bf f}\;,\;\varsigma^{2}{\bf A}^{\top}{\bf A}+\sigma^{2}{\bf I}_{M}) (111)

The expected loglikelihood with respect to the variational posterior hence writes

𝔼q(𝐟)[p(𝐳~|𝐟)]=\displaystyle{\mathbb{E}}_{q({\bf f})}[p({\bf\tilde{z}|f})]= M2log(2π)12logdet(ς2𝐀𝐀+σ2𝐈M)\displaystyle-\frac{M}{2}\log(2\pi)-\frac{1}{2}\log\det(\varsigma^{2}{\bf A}^{\top}{\bf A}+\sigma^{2}{\bf I}_{M}) (112)
12𝔼q(𝐟)[(𝐳~𝐀𝐟)(ς2𝐀𝐀+σ2𝐈M)1(𝐳~𝐀𝐟)]\displaystyle-\frac{1}{2}{\mathbb{E}}_{q(\bf f)}\left[(\tilde{\bf z}-{\bf A}^{\top}{\bf f})^{\top}(\varsigma^{2}{\bf A}^{\top}{\bf A}+\sigma^{2}{\bf I}_{M})^{-1}(\tilde{\bf z}-{\bf A}^{\top}{\bf f})\right] (113)

With a derivation similar to the one proposed in Appendix B, the expected loglikelihood can be expressed in terms of the posterior variational parameters as

𝔼q(𝐟)[p(𝐳~|𝐟)]=\displaystyle{\mathbb{E}}_{q({\bf f})}[p({\bf\tilde{z}|f})]= M2log(2π)12logdet(ς2𝐀𝐀+σ2𝐈M)\displaystyle-\frac{M}{2}\log(2\pi)-\frac{1}{2}\log\det(\varsigma^{2}{\bf A}^{\top}{\bf A}+\sigma^{2}{\bf I}_{M}) (114)
12(𝐳~𝐀𝜼¯)(ς2𝐀𝐀+σ2𝐈M)1(𝐳~𝐀𝜼¯)\displaystyle-\frac{1}{2}(\tilde{\bf z}-{\bf A}^{\top}{\bf\bar{\boldsymbol{\eta}}})^{\top}(\varsigma^{2}{\bf A}^{\top}{\bf A}+\sigma^{2}{\bf I}_{M})^{-1}(\tilde{\bf z}-{\bf A}^{\top}{\bf\bar{\boldsymbol{\eta}}}) (115)
12tr((ς2𝐀𝐀+σ2𝐈M)1𝐀𝚺¯𝐀)\displaystyle-\frac{1}{2}\operatorname{tr}\left((\varsigma^{2}{\bf A}^{\top}{\bf A}+\sigma^{2}{\bf I}_{M})^{-1}{\bf A}^{\top}{\bf\bar{\boldsymbol{\Sigma}}}{\bf A}\right) (116)

In particular, the last term can be rearranged into tr(𝚺¯1/2𝐀(ς2𝐀𝐀+σ2𝐈M)1𝐀𝚺¯1/2)\operatorname{tr}\left({\bf\bar{\boldsymbol{\Sigma}}}^{\nicefrac{{1}}{{2}}}{\bf A}(\varsigma^{2}{\bf A}^{\top}{\bf A}+\sigma^{2}{\bf I}_{M})^{-1}{\bf A}^{\top}{\bf\bar{\boldsymbol{\Sigma}}}^{\nicefrac{{1}}{{2}}}\right) which can efficiently be computed as an inverse quadratic form [38].

E.3 Mediated downscaling of atmospheric temperature

E.3.1 Map visualization of atmospheric fields dataset

Refer to caption
Figure 8: Map visualization of the dataset used in the mediated downscaling experiment (for one random seed); Top: Bags of high-resolution albedo α\alphaHR and total cloud cover TCCHR pixels which are observed in 𝒟1{\mathcal{D}}_{1} — each “coarse pixel” delineates a bag of HR pixels; Middle: Low-resolution pressure field PLR which is observed everywhere and plays the role of mediating variable; Bottom: Low-resolution temperature field TLR pixels which are observed in 𝒟2{\mathcal{D}}_{2} and that we want to downscale; grey pixels are unobserved; the grey layer on HR covariates maps (top) is the exact complementary of the grey layer on the observed TLR map (bottom).

E.3.2 Downscaling prediction maps

Refer to caption
Figure 9: Predicted downscaled atmospheric temperature field with vargpr; Top-Left: Posterior mean; Top-Right: 95% confidence region size, i.e. 2 standard deviation of the posterior; Bottom-Left: Squared difference with unobserved groundtruth THR; Bottom-Right: Difference between unobserved groundtruth THR and the posterior mean.
Refer to caption
Figure 10: Predicted downscaled atmospheric temperature field with vbagg; Top-Left: Posterior mean; Top-Right: 95% confidence region size, i.e. 2 standard deviation of the posterior; Bottom-Left: Squared difference with unobserved groundtruth THR; Bottom-Right: Difference between unobserved groundtruth THR and the posterior mean.
Refer to caption
Figure 11: Predicted downscaled atmospheric temperature field with varcmp; Top-Left: Posterior mean; Top-Right: 95% confidence region size, i.e. 2 standard deviation of the posterior; Bottom-Left: Squared difference with unobserved groundtruth THR; Bottom-Right: Difference between unobserved groundtruth THR and the posterior mean.

E.3.3 Statistical significance table

Table 4: p-values from a two-tailed Wilcoxon signed-rank test between all pairs of methods for the evaluation scores on the mediated statistical downscaling experiment. The null hypothesis is that scores samples come from the same distribution. As before, we only present the lower-traingular table for clarity of reading.
Metric varcmp vbagg vargpr
varcmp - - -
RMSE vbagg 0.005062 - -
vargpr 0.006910 0.046853 -
varcmp - - -
MAE vbagg 0.005062 - -
vargpr 0.059336 0.006910 -
varcmp - - -
CORR vbagg 0.005062 - -
vargpr 0.016605 0.028417 -
varcmp - - -
SSIM vbagg 0.005062 - -
vargpr 0.959354 0.005062 -

E.3.4 Compute and Resources Specifications

Computations for all experiments were carried out on an internal cluster. We used a single GeForce GTX 1080 Ti GPU to speed up computations and conduct each experiment with multiple initialisation seeds.