This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Evaluating Aleatoric Uncertainty
via Conditional Generative Models

Ziyi Huang, Henry Lam, Haofeng Zhang
Columbia University
New York, NY 10027
zh2354, khl2114, hz2553@columbia.edu
Abstract

Aleatoric uncertainty quantification seeks for distributional knowledge of random responses, which is important for reliability analysis and robustness improvement in machine learning applications. Previous research on aleatoric uncertainty estimation mainly targets closed-formed conditional densities or variances, which requires strong restrictions on the data distribution or dimensionality. To overcome these restrictions, we study conditional generative models for aleatoric uncertainty estimation. We introduce two metrics to measure the discrepancy between two conditional distributions that suit these models. Both metrics can be easily and unbiasedly computed via Monte Carlo simulation of the conditional generative models, thus facilitating their evaluation and training. We demonstrate numerically how our metrics provide correct measurements of conditional distributional discrepancies and can be used to train conditional models competitive against existing benchmarks.

1 Introduction

Uncertainty quantification plays a pivotal role in machine learning systems, especially for downstream decision-making tasks involving reliability analysis and optimization. There are two major types of uncertainty, aleatoric uncertainty and epistemic uncertainty. Aleatoric uncertainty refers to the intrinsic stochasticity of the output given a specific input [21], while epistemic uncertainty accounts for the model uncertainty caused by data and modeling limitations [24]. Most classical machine learning algorithms that focus on mean response prediction primarily address epistemic uncertainty, but aleatoric uncertainty, which describes the distribution of responses beyond summary statistics like the mean, has been gaining importance because of risk and safety-critical considerations.

Existing approaches for aleatoric uncertainty estimation can be largely divided into the following directions: negative log-likelihood (NLL) loss-based estimation, forecaster calibration, and conditional density estimation (CDE). While powerful, these approaches are limited by several drawbacks arising from real-world applications:

  1. 1.

    Negative Log-Likelihood Loss: In regression tasks, aleatoric uncertainty can be estimated through the conditional mean and variance from models (heteroscedastic neural networks) optimized by the NLL loss [41, 2, 4, 31, 24, 5]. However, this approach requires scalar-type output, which cannot be easily extended to broader computer vision applications, such as image generation. In addition, the computation of NLL loss relies on assumptions of conditional Gaussian or Gaussian-like distribution, which may not be followed by real-world datasets.

  2. 2.

    Forecaster Calibration: In the calibration literature, aleatoric uncertainty estimators are also known as forecasters [12, 26, 49] with multiple definitions of calibration modes [12, 49, 8, 26, 5]. Under these definitions, the ground-truth conditional distribution function is well calibrated, but not vice versa. Thus, some intuitive sharpness criteria are typically applied to avoid trivial forecasters such as the unconditional distribution. However, little is known about how to recover the ground-truth conditional distribution function via calibration, even asymptotically.

  3. 3.

    Conditional Density Estimation: In CDE-based approaches [20, 22, 52, 45, 46, 6, 7], aleatoric uncertainty is directly calculated by estimating conditional densities in a certain form (such as kernel density). Most of CDE methods can only apply on low-dimensional responses following absolutely continuous conditional distributions. Moreover, the output of CDE methods is an explicit formula of the conditional density function. Thus, numerical characteristics such as conditional quantiles may be hard to obtain, as it involves numerical integration that is generally difficult to implement, especially in higher-dimensional settings.

To address the above challenges, we study a framework using conditional generative models to estimate aleatoric uncertainty. Compared to previous approaches, conditional generative models [38, 47] are more scalable and flexibly applicable regardless of the dimension and distribution of the input/output vector. Moreover, they can easily generate numerical characteristics of the underlying distributions or other performance estimations through Monte Carlo methods.

At the core of our framework is the construction of distance metrics between the generative model and the ground-truth distribution, which is required for both model evaluation and training [13, 42, 1]. In particular, we generalize the maximum mean discrepancy (MMD) [14, 34] to the setting of conditional distributions, by constructing two new metrics that we call joint maximum mean discrepancy (JMMD) and average maximum mean discrepancy (AMMD). We derive statistical properties in estimating these metrics and illustrate that both metrics admit easy-to-implement and computationally scalable unbiased estimators. Based on these, we further develop two approaches to optimize conditional generative models suited for different tasks and conduct comprehensive experiments to show the correctness and effectiveness of our framework.

Our approach has the following strengths relative to previous methods: 1) A similar study with conditional MMD can be found in [47] which, as far as we know, is the most relevant work on MMD-based conditional generative models. However, their framework involves unrealistic technical assumptions that may not hold in practice, as well as matrix inversion operations that suffer from instability and scalability issues (see Section 4.1). 2) Both JMMD and AMMD are evaluation metrics that are desirably "distribution-free" (i.e., the data are not assumed any particular type of distributions) and "model-free" (i.e., the evaluation does not involve additional estimated models such as the discriminator). In previous research, Fréchet Inception Distance (FID) [19, 36] is a standard metric to assess the quality of unconditional generative models. However, the closed-form computation of FID assumes that both generative models and data follow multivariate Gaussian distributions. Another commonly used evaluation approach is Indirect Sampling Likelihood (ISL) [3, 13], which computes the NLL under a fitted kernel density based on generative models. However, kernel density estimation deteriorates in quality when dimensionality increases and could fit poorly into the generative models. Finally, the value of loss on testing data is an alternative for performance examination. However, typical losses such as using ff-divergence or Wasserstein distance cannot indicate the performance of the generator alone (see Section 2).

2 Related Work

Learning and Evaluation Criterion. Evaluation criteria on generative models against data are typically borrowed from discrepancy measures between two probability distributions in the statistics literature. The latter includes two major types: ff-divergence and integral probability metrics. The seminal paper [13] used Jensen-Shannon divergence in its original form and then [42] extended it to general ff-divergence motivated from the benefits of other divergence function choices. The computation of integral probability metrics have two important sub-directions, MMD [34, 33] and Wasserstein distance [1, 15]. Among these criteria, a discriminator is typically needed for approaches with ff-divergence (variational representation) and Wasserstein distance (dual representation), while it is not required for MMD methods. The loss function from ff-divergence and Wasserstein distance cannot be directly use to evaluate generative models alone due to their dependency on the quality of discriminators. In addition, other conditional distance measures may encounter challenges when using generative models. For instance, NLL value [31] and CDE value [6] require an explicit form of the model’s density function.

Aleatoric Uncertainty in Deep Learning. Besides the directions discussed in Section 1, aleatoric uncertainty on classification tasks can be estimated from the output of softmax layers in deep neural networks [40, 31, 17]. Previous research [16] pointed out that directly using softmax outputs for estimation could be inaccurate, as the softmax probability on predicted class did not reflect the ground-truth correctness likelihood. The ground-truth conditional mass function has zero calibration error but not vice versa. Hence forecasters with zero calibration error, which have been studied extensively [30, 37, 29], are not sufficient to recover the ground-truth conditional mass function. The forecasters could be heuristically improved by a second-level metric named sharpness (or refinement error) [30, 27, 28]. Since aleatoric uncertainty in classification can be captured by vector-valued maps such as softmax responses, it is not necessary to use a conditional generative model for this task, and thus we do not focus on classification in this paper.

3 Conditional Generative Models and Maximum Mean Discrepancy

3.1 Conditional Generative Models

In this section, we provide rigorous definitions on conditional generative models. Consider a standard statistical framework where a pair of random vectors (X,Y)𝒳×𝒴(X,Y)\in\mathcal{X}\times\mathcal{Y} follows a joint distribution PX,YP_{X,Y} with marginal distributions XPXX\sim P_{X} and YPYY\sim P_{Y}. We assume the space 𝒳d\mathcal{X}\subset\mathbb{R}^{d} with d1d\geq 1 which is allowed to contain either continuous or discrete components. Denote the conditional distribution of YY given XX by PY|XP_{Y|X}. For a given value xx of XX, denote the conditional distribution as PY|X=xP_{Y|X=x}. Typically, we regard XX as a vector of input (example) and YY as a vector of output (label). For instance, YqY\subset\mathbb{R}^{q} with q1q\geq 1 in regression and Y[K]:={1,,K}Y\subset[K]:=\{1,\ldots,K\} in classification. Alternatively, in image generation tasks, XX refers to auxiliary information (such as the image attributes or labels) and YY refers to the image in order to keep the notation consistent.

Our goal is to quantify the conditional distribution PY|XP_{Y|X} via conditional generative models. More precisely, let ξm\xi\in\mathbb{R}^{m} be a random vector independent of XX with a known distribution PξP_{\xi} (specified by the learner) and the goal is to construct a function G:m×𝒳𝒴G:\mathbb{R}^{m}\times\mathcal{X}\to\mathcal{Y} such that the conditional distribution of G(ξ,X)|X=xG(\xi,X)|X=x is the same as PY|X=xP_{Y|X=x}. The following lemma demonstrates the existence of such function GG, termed as the conditional generative model G:m×𝒳𝒴G:\mathbb{R}^{m}\times\mathcal{X}\to\mathcal{Y}.

Lemma 1 (Adapted from Theorem 5.10 in [23]).

Let (X,Y)(X,Y) be a random pair taking values in 𝒳×𝒴\mathcal{X}\times\mathcal{Y} with joint distribution PX,YP_{X,Y}. Suppose YY is a standard Borel space. Then there exist a random vector ξPξ=Uniform([0,1]m)\xi\sim P_{\xi}=\text{Uniform}([0,1]^{m}) and a Borel-measurable function G:m×𝒳𝒴G:\mathbb{R}^{m}\times\mathcal{X}\to\mathcal{Y} such that ξ\xi is independent of XX and (X,Y)=(X,G(ξ,X))(X,Y)=(X,G(\xi,X)) almost everywhere. In particular, such GG satisfies that Y|X=xG(ξ,X)|X=xY|X=x\sim G(\xi,X)|X=x for a.e. xx with respect to PXP_{X}.

The conditional generative model can provide more information than standard regression models with single-point prediction. In regression problems, the conditional mean can be estimated by taking the sample mean of multiple draws from G(ξi,X)|X=xG(\xi_{i},X)|X=x. Meanwhile, other numerical characteristics of the underlying target distribution, such as conditional variance and conditional quantile can also be estimated by Monte Carlo sampling from the conditional generative model, beyond what single-point prediction could offer.

In the rest of this paper, we use PY|XP_{Y|X} for the ground-truth conditional distribution and QY|XQ_{Y|X} for the distribution of the conditional generative model G(ξ,X)|XG(\xi,X)|X. We denote QX,G(ξ,X)Q_{X,G(\xi,X)} as the joint distribution of (X,G(ξ,X))(X,G(\xi,X)). For each given xx, the generative model is able to generate conditionally independent and identically distributed (i.i.d.) samples G(ξi,x)G(\xi_{i},x) from the conditional distribution QY|X=xQ_{Y|X=x}. We parametrize the conditional generative model in a hypothesis class {Gθ(ξ,X):θΘ}\{G_{\theta}(\xi,X):\theta\in\Theta\} with parameter θ\theta. To learn G(ξ,X)|XG(\xi,X)|X as an estimate of PY|XP_{Y|X}, we need a metric to quantify the difference between G(ξ,X)|XG(\xi,X)|X and PY|XP_{Y|X} using finite training data, which relates to Two-Sample Test. To this end, we will use the (kernel) maximum mean discrepancy (MMD) [14], which is described in the next subsection.

3.2 Two-Sample Test via Maximum Mean Discrepancy

We review the standard MMD in the setting of unconditional distribution on 𝒴\mathcal{Y}. Section A provides preliminaries on the reproducing kernel Hilbert space (RKHS). Suppose that X\mathcal{F}_{X} (Y\mathcal{F}_{Y}) is the RKHS defined on the space 𝒳\mathcal{X} (𝒴\mathcal{Y}) with kernel k1k_{1} (k2k_{2}) and feature map ϕ1\phi_{1} (ϕ2\phi_{2}). We adopt the following two basic assumptions throughout this paper (i.e., all theorems make these assumptions without explicit mentioning). Detailed explanations on Assumptions 1 and 2 can be found in Section A. The Gaussian kernels for instance satisfy both assumptions.

Assumption 1.

We assume the following: 1) k1(,)k_{1}(\cdot,\cdot) is measurable and 𝔼xPX[k1(x,x)]<\mathbb{E}_{x\sim P_{X}}[k_{1}(x,x)]<\infty. 2) k2(,)k_{2}(\cdot,\cdot) is measurable and 𝔼yPY[k2(y,y)]<\mathbb{E}_{y\sim P_{Y}}[k_{2}(y,y)]<\infty. Moreover, 𝔼yPY|X=x[k2(y,y)|X=x]<\mathbb{E}_{y\sim P_{Y|X=x}}[k_{2}(y,y)|X=x]<\infty for any x𝒳x\in\mathcal{X}. In addition, these assumptions also hold when replacing the data distribution PP by the generative distribution QQ.

Assumption 2.

We assume the following: 1) k1k_{1} is characteristic. 2) k2k_{2} is characteristic. 3) k1k2k_{1}\otimes k_{2} is characteristic.

The integral probability metric aims to measure the discrepancy between two distributions. Let 𝒢\mathcal{G} denote a set of functions 𝒴\mathcal{Y}\to\mathbb{R}. Given two distributions PYP_{Y} and QYQ_{Y} on 𝒴\mathcal{Y}, the integral probability metric is defined as

IPM(PY,QY)=supf𝒢|𝔼[f(Y)]𝔼[f(Y^)]|IPM(P_{Y},Q_{Y})=\sup_{f\in\mathcal{G}}|\mathbb{E}[f(Y)]-\mathbb{E}[f(\hat{Y})]|

where YPYY\sim P_{Y} and Y^QY\hat{Y}\sim Q_{Y}. MMD is a special case of integral probability metrics, as it chooses 𝒢\mathcal{G} to be the unit ball in the RKHS Y\mathcal{F}_{Y}. Let μPY\mu_{P_{Y}} denote the kernel mean embedding of PYP_{Y} in Y\mathcal{F}_{Y}: μPY:=𝔼yPY[ϕ2(y)]\mu_{P_{Y}}:=\mathbb{E}_{y\sim P_{Y}}[\phi_{2}(y)]. μPY\mu_{P_{Y}} is guaranteed to be an element in the RKHS Y\mathcal{F}_{Y} under Assumption 1 [14]. With these discussions, the square of MMD distance between PYP_{Y} and QYQ_{Y} is formally defined as

MMD2(PY,QY)=supfY,fY1|𝔼[f(Y1)]𝔼[f(Y^1)]|2\displaystyle MMD^{2}(P_{Y},Q_{Y})=\sup_{f\in\mathcal{F}_{Y},\|f\|_{\mathcal{F}_{Y}}\leq 1}|\mathbb{E}[f(Y_{1})]-\mathbb{E}[f(\hat{Y}_{1})]|^{2}
=\displaystyle= μPYμQYY2=𝔼[k2(Y1,Y2)]2𝔼[k2(Y1,Y^1)]+𝔼[k2(Y^1,Y^2)]\displaystyle\|\mu_{P_{Y}}-\mu_{Q_{Y}}\|^{2}_{\mathcal{F}_{Y}}=\mathbb{E}[k_{2}(Y_{1},Y_{2})]-2\mathbb{E}[k_{2}(Y_{1},\hat{Y}_{1})]+\mathbb{E}[k_{2}(\hat{Y}_{1},\hat{Y}_{2})] (1)

where Y1,Y2i.i.d.PYY_{1},Y_{2}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P_{Y} and Y^1,Y^2i.i.d.QY\hat{Y}_{1},\hat{Y}_{2}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}Q_{Y}.

Theorem 2 ([14]).

MMD2(PY,QY)0MMD^{2}(P_{Y},Q_{Y})\geq 0 and MMD2(PY,QY)=0MMD^{2}(P_{Y},Q_{Y})=0 if and only if PY=QYP_{Y}=Q_{Y}.

Suppose we have data yii.i.d.PYy_{i}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P_{Y} (i[n]i\in[n]) and y^ji.i.d.QY\hat{y}_{j}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}Q_{Y} (j[m]j\in[m]). Then a standard unbiased estimator of MMD2(PY,QY)MMD^{2}(P_{Y},Q_{Y}) [14] is

MMD2u=1n(n1)i=1nii,i=1nk2(yi,yi)2nmi=1nj=1mk2(yi,y^j)+1m(m1)j=1mjj,j=1mk2(y^j,y^j)\mathcal{L}^{u}_{MMD^{2}}=\frac{1}{n(n-1)}\sum_{i=1}^{n}\sum_{i^{\prime}\neq i,i^{\prime}=1}^{n}k_{2}(y_{i},y_{i^{\prime}})-\frac{2}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}k_{2}(y_{i},\hat{y}_{j})+\frac{1}{m(m-1)}\sum_{j=1}^{m}\sum_{j^{\prime}\neq j,j^{\prime}=1}^{m}k_{2}(\hat{y}_{j},\hat{y}_{j^{\prime}})

4 Generalization to Conditional Two-Sample Test

In this section, we generalize MMD to conditional Two-Sample Test. We first explain the limitation of conditional maximum mean discrepancy (CMMD), the state-of-the-art approach to use MMD for conditional models [47]. Then, we introduce two metrics, JMMD and AMMD, which bypass the limitations of CMMD on strong restrictions and biased estimation. We also present the connections and comparisons among these metrics, and describe how to use them to construct conditional generative models.

4.1 Previous Work on Conditional Maximum Mean Discrepancy

In [47], conditional generative moment-matching networks (CGMMNs) were developed for conditional distribution generation. In particular, they leveraged previous work on conditional mean embeddings of the conditional distribution CPY|XC_{P_{Y|X}} [51, 9, 50, 39] (Section A provides a review on this topic.) They used the discrepancy between CPY|X=xC_{P_{Y|X=x}} and CQY|X=xC_{Q_{Y|X=x}} to measure the difference of two conditional distributions, termed as CMMD, which is defined formally as:

CMMD2=CPY|XCQY|XXY2CMMD^{2}=\|C_{P_{Y|X}}-C_{Q_{Y|X}}\|^{2}_{\mathcal{F}_{X}\otimes\mathcal{F}_{Y}} (2)

where PP represents the ground-truth data distribution and QQ represents the generative distribution. The estimator of CMMD2CMMD^{2} developed in [47] is as follows:

C2(P,Q)=C~PY|XC~QY|XXY2,\mathcal{L}_{C^{2}}(P,Q)=\|\tilde{C}_{P_{Y|X}}-\tilde{C}_{Q_{Y|X}}\|^{2}_{\mathcal{F}_{X}\otimes\mathcal{F}_{Y}}, (3)
C~PY|X=C~PYX(C~PXX+λI)1=Φ2(KX+λnI)1Φ1\tilde{C}_{P_{Y|X}}=\tilde{C}_{P_{YX}}(\tilde{C}_{P_{XX}}+\lambda I)^{-1}=\Phi_{2}(K_{X}+\lambda nI)^{-1}\Phi_{1} (4)

where Φ2=(ϕ2(y1),,ϕ2(yn))\Phi_{2}=(\phi_{2}(y_{1}),...,\phi_{2}(y_{n})), Φ1=(ϕ1(x1),,ϕ1(xn))\Phi_{1}=(\phi_{1}(x_{1}),...,\phi_{1}(x_{n})), KX=Φ1Φ1K_{X}=\Phi_{1}^{\top}\Phi_{1}, and II is the identity matrix.

While [47] is the most relevant study to our problem setting, directly applying CMMD on aleatoric uncertainty estimation has following limitations:

Computationally Expensive: The matrix inverse in the estimator is computationally expensive for practical implementation. The running time for a single inversion in one iteration is at the order of O(B3)O(B^{3}), where BB is the batch size. Meanwhile, the batch size should be sufficient large to achieve good performance for generative models [34].

Strong Technical Assumptions and Existence of Inversion: 1) The existence of the conditional mean embedding operator CPY|XC_{P_{Y|X}} typically requires strong assumptions: gY\forall g\in\mathcal{F}_{Y}, 𝔼PY|X[g(Y)|X]X\mathbb{E}_{P_{Y|X}}[g(Y)|X]\in\mathcal{F}_{X}. This assumption is not necessarily true for continuous domains [51], and simple counterexamples using the Gaussian kernel can be found [9]. 2) In general, C~XX1\tilde{C}^{-1}_{XX} does not exist when X\mathcal{F}_{X} is infinite dimensional, since C~XX\tilde{C}_{XX} is a compact operator and thus has an arbitrary small positive eigenvalue [39]. When the matrix is singular, the matrix inversion could be unstable and the performance of the estimator C~PY|X\tilde{C}_{P_{Y|X}} in [47] might be degraded after adding λI\lambda I to avoid the singularity. 3) Even though the first two points could be relieved in some sense [43], the CMMD metric (2) is well-defined only if CPY|X,CQY|XXYC_{P_{Y|X}},C_{Q_{Y|X}}\in\mathcal{F}_{X}\otimes\mathcal{F}_{Y}. However, this requires a much stronger assumption than the existence of C~XX1\tilde{C}^{-1}_{XX} (See Assumption 6 in Section A).

Bias: CMMD does not admit any obvious unbiased estimator. The estimator C~PY|X\tilde{C}_{P_{Y|X}} in (4) is biased, even in the asymptotic sense if λ\lambda is fixed [51].

To bypass the above limitations, we propose two alternative metrics which only require basic assumptions on the existence of the cross-covariance operator and the characteristic property of the kernels (Assumptions 1 and 2). In particular, we do not require the existence of the inversion of any operator or matrix, which makes our metrics easily implemented for real-world applications.

4.2 Average Maximum Mean Discrepancy (AMMD)

We first introduce a rather straightforward approach, which we term the AMMD metric. AMMD shows better potential for multi-output problems (such as image generation) where data consists of i.i.d. inputs xix_{i} with conditionally independent outputs yi,jy_{i,j} at each xix_{i}; see Section C for a more detailed discussion. In AMMD, at each xx, we use (1) to build unbiased estimators of the MMD on the conditional distribution of Y|X=xY|X=x. Then, these estimators are averaged with respect to the marginal PXP_{X}. More specifically, we define

AMMD2(P,Q)=𝔼xPX[MMDX=x2(PY|X=x,QY|X=x)]AMMD^{2}(P,Q)=\mathbb{E}_{x\sim P_{X}}[MMD_{X=x}^{2}(P_{Y|X=x},Q_{Y|X=x})]

where

MMDX=x2(PY|X=x,QY|X=x):=μPY|X=xμQY|X=xY2\displaystyle MMD_{X=x}^{2}(P_{Y|X=x},Q_{Y|X=x}):=\|\mu_{P_{Y|X=x}}-\mu_{Q_{Y|X=x}}\|^{2}_{\mathcal{F}_{Y}} (5)
=\displaystyle= 𝔼[k2(Y1x,Y2x)|X=x]2𝔼[k2(Y1x,Y^1x)|X=x]+𝔼[k2(Y^1x,Y^2x)|X=x]\displaystyle\mathbb{E}[k_{2}(Y^{x}_{1},Y^{x}_{2})|X=x]-2\mathbb{E}[k_{2}(Y^{x}_{1},\hat{Y}^{x}_{1})|X=x]+\mathbb{E}[k_{2}(\hat{Y}^{x}_{1},\hat{Y}^{x}_{2})|X=x]

is a function of xx and for fixed xx, Y1x,Y2xi.i.d.PY|X=xY^{x}_{1},Y^{x}_{2}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P_{Y|X=x}, and Y^1x,Y^2xi.i.d.QY|X=x\hat{Y}^{x}_{1},\hat{Y}^{x}_{2}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}Q_{Y|X=x}. Note that μPY|X=x\mu_{P_{Y|X=x}}, μQY|X=xY\mu_{Q_{Y|X=x}}\in\mathcal{F}_{Y} guaranteed by Assumption 1 so MMDX=x2MMD_{X=x}^{2} is well-defined. Hence AMMD2(P,Q)AMMD^{2}(P,Q) is also well-defined being the expectation with respect to non-negative measurable functions.

Remark 3.

In (5), Y1x,Y2xY^{x}_{1},Y^{x}_{2} are drawn in a conditionally independent manner for each xx. This is not equivalent to globally draw two unconditionally independent samples Y1,Y2Y_{1},Y_{2} and consider Y1|X=x,Y2|X=xY_{1}|X=x,Y_{2}|X=x for each xx because the latter is not conditionally independent in general. Therefore, we have that in general, 𝔼xPX[𝔼[k2(Y1x,Y2x)|X=x]]𝔼[k2(Y1,Y2)].\mathbb{E}_{x\sim P_{X}}[\mathbb{E}[k_{2}(Y^{x}_{1},Y^{x}_{2})|X=x]]\neq\mathbb{E}[k_{2}(Y_{1},Y_{2})].

Theorem 4.

AMMD2(P,Q)0AMMD^{2}(P,Q)\geq 0 and AMMD2(P,Q)=0AMMD^{2}(P,Q)=0 if and only if for a.e. xx with respect to PXP_{X}, PY|X=x=QY|X=xP_{Y|X=x}=Q_{Y|X=x}.

Theorem 4 shows that AMMD2(P,Q)AMMD^{2}(P,Q) offers a metric to measure PY|XP_{Y|X} versus QY|XQ_{Y|X}. Next, we propose a Monte Carlo estimator of AMMD2AMMD^{2} for conditional generative models:

  1. 1.

    Take a batch {(xi,yi,l):i[n],l[r]}\{(x_{i},y_{i,l}):i\in[n],l\in[r]\} from PP of batch size rnrn where yi,ly_{i,l} (l[r]l\in[r]) are the outputs at the same xix_{i}. Here, rr is restricted by the specification of the task: r=1r=1 in single-output problems but can be greater than 11 in multi-output problems such as image generation; n1n\geq 1.

  2. 2.

    Generate a batch {(xi,G(ξi,j,xi)):i[n],j[m]}\{(x_{i},G(\xi_{i,j},x_{i})):i\in[n],j\in[m]\} from QQ of batch size mnmn where ξ1,1,,ξn,m\xi_{1,1},\ldots,\xi_{n,m} are i.i.d. and independent of x1,,xnx_{1},\ldots,x_{n}; m2m\geq 2.

  3. 3.

    Compute

    A2^(P,Q)=\displaystyle\hat{A^{2}}(P,Q)= 1ni=1n(2mrj=1ml=1rk2(yi,l,G(ξi,j,xi))\displaystyle\frac{1}{n}\sum_{i=1}^{n}\Big{(}-\frac{2}{mr}\sum_{j=1}^{m}\sum_{l=1}^{r}k_{2}(y_{i,l},G(\xi_{i,j},x_{i}))
    +1m(m1)j=1mjj,j=1mk2(G(ξi,j,xi),G(ξi,j,xi)))\displaystyle+\frac{1}{m(m-1)}\sum_{j=1}^{m}\sum_{j^{\prime}\neq j,j^{\prime}=1}^{m}k_{2}(G(\xi_{i,j},x_{i}),G(\xi_{i,j^{\prime}},x_{i}))\Big{)} (6)

The next theorem establishes the error analysis of the estimator A2^(P,Q)\hat{A^{2}}(P,Q).

Theorem 5.

A2^(P,Q)\hat{A^{2}}(P,Q) is an unbiased estimator of AMMD2(P,Q)C0AMMD^{2}(P,Q)-C_{0} where C0C_{0} is a constant independent of QQ given by C0=𝔼xPX[𝔼[k2(Y1x,Y2x)|X=x]]C_{0}=\mathbb{E}_{x\sim P_{X}}[\mathbb{E}[k_{2}(Y^{x}_{1},Y^{x}_{2})|X=x]]. Moreover, the variance of A2^(P,Q)\hat{A^{2}}(P,Q) is O(1nmin{m,r})+1nK0O(\frac{1}{n\min\{m,r\}})+\frac{1}{n}K_{0} where K0=VarxPX[2𝔼[k2(Y1x,Y^1x)|X=x]+𝔼[k2(Y^1x,Y^2x)|X=x]]K_{0}=\text{Var}_{x\sim P_{X}}[-2\mathbb{E}[k_{2}(Y^{x}_{1},\hat{Y}^{x}_{1})|X=x]+\mathbb{E}[k_{2}(\hat{Y}^{x}_{1},\hat{Y}^{x}_{2})|X=x]] is independent of n,m,rn,m,r. (The explicit formula of the variance is given in Section B.)

C0C_{0} in Theorem 5 is free of the conditional generative model and thus does not need to be embodied at the training/evaluation phase. This is in the same spirit of NLL which is formed by a free-of-model constant and the Kullback–Leibler divergence between the model and data. Theorem 5 shows that if nn is not allowed to be large (e.g., nn is bounded above by the number of class in the label-based image generation problems), the variance of the estimator A2^(P,Q)\hat{A^{2}}(P,Q) should be reduced by increasing mm and rr. On the other hand, if nn is allowed to be large (e.g., in regression problems with continuous features), then given a fixed computational budget, we should increase nn while maintaining the small possible values m=2m=2 and r=1r=1 to reduce the variance of A2^(P,Q)\hat{A^{2}}(P,Q) efficiently.

4.3 Joint Maximum Mean Discrepancy (JMMD)

We then introduce the JMMD metric, which is based on the joint distribution. Compared with AMMD, JMMD is more suitable for single-output tasks (such as regression) where data consists of joint i.i.d. samples (xi,yi)(x_{i},y_{i}); see Section C for a more detailed discussion. According to the observation in Lemma 1, the matching of QY|X=xQ_{Y|X=x} (the conditional distribution of G(ξ,X)|X=xG(\xi,X)|X=x) with PY|X=xP_{Y|X=x} for a.e. xx can be transferred to the matching of QX,YQ_{X,Y} (the joint distribution of (X,G(ξ,X))(X,G(\xi,X))) with PX,YP_{X,Y}. This motivates us to define the following metric which we term as JMMD:

JMMD2(P,Q)=MMD2(PX,Y,QX,Y)\displaystyle JMMD^{2}(P,Q)=MMD^{2}(P_{X,Y},Q_{X,Y})
=\displaystyle= 𝔼[k3((X1,Y1),(X2,Y2))]2𝔼[k3((X1,Y1),(X^1,Y^1))]+𝔼[k3((X^1,Y^1),(X^2,Y^2))]\displaystyle\mathbb{E}[k_{3}((X_{1},Y_{1}),(X_{2},Y_{2}))]-2\mathbb{E}[k_{3}((X_{1},Y_{1}),(\hat{X}_{1},\hat{Y}_{1}))]+\mathbb{E}[k_{3}((\hat{X}_{1},\hat{Y}_{1}),(\hat{X}_{2},\hat{Y}_{2}))]

where k3=k1k2k_{3}=k_{1}\otimes k_{2} is the kernel of the tensor product space XY\mathcal{F}_{X}\otimes\mathcal{F}_{Y} and (X1,Y1),(X2,Y2)i.i.d.PX,Y(X_{1},Y_{1}),(X_{2},Y_{2})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P_{X,Y} and (X^1,Y^1),(X^2,Y^2)i.i.d.QX,Y(\hat{X}_{1},\hat{Y}_{1}),(\hat{X}_{2},\hat{Y}_{2})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}Q_{X,Y}. Note that JMMD2(P,Q)JMMD^{2}(P,Q) can be viewed alternatively as the discrepancy of the cross-covariance operators CPYXC_{P_{YX}}, CQYXC_{Q_{YX}} defined on the tensor product space XY\mathcal{F}_{X}\otimes\mathcal{F}_{Y}. Since CPYXC_{P_{YX}}, CQYXXYC_{Q_{YX}}\in\mathcal{F}_{X}\otimes\mathcal{F}_{Y} guaranteed by Assumption 1, JMMD2(P,Q)JMMD^{2}(P,Q) is well-defined (see Section A for more details).

Theorem 6.

JMMD2(P,Q)0JMMD^{2}(P,Q)\geq 0 and JMMD2(P,Q)=0JMMD^{2}(P,Q)=0 if and only if for a.e. xx with respect to PXP_{X}, PY|X=x=QY|X=xP_{Y|X=x}=Q_{Y|X=x}.

Theorem 6 shows that JMMD2(P,Q)JMMD^{2}(P,Q) offers a metric to measure PY|XP_{Y|X} versus QY|XQ_{Y|X}. In parallel to AMMD, we propose a Monte Carlo estimator of JMMD2JMMD^{2} for conditional generative models: Take a batch of samples from PP of batch size rr {(xl,yl):l[r]}\{(x_{l},y_{l}):l\in[r]\}; r2r\geq 2. Generate a batch from QQ of batch size mm {(x^j,G(ξj,x^j)):j[m]}\{(\hat{x}_{j},G(\xi_{j},\hat{x}_{j})):j\in[m]\} where ξ1,,ξm\xi_{1},...,\xi_{m} are i.i.d. and independent of x1,,xr,x^1,,x^mx_{1},\ldots,x_{r},\hat{x}_{1},\ldots,\hat{x}_{m}; m2m\geq 2. Compute

J2^(P,Q)=2mrj=1ml=1rk3((xl,yl),(x^j,G(ξj,x^j)))\displaystyle\hat{J^{2}}(P,Q)=-\frac{2}{mr}\sum_{j=1}^{m}\sum_{l=1}^{r}k_{3}((x_{l},y_{l}),(\hat{x}_{j},G(\xi_{j},\hat{x}_{j})))
+1m(m1)j=1mjj,j=1mk3((x^j,G(ξj,x^j)),(x^j,G(ξj,x^j)))\displaystyle+\frac{1}{m(m-1)}\sum_{j=1}^{m}\sum_{j^{\prime}\neq j,j^{\prime}=1}^{m}k_{3}((\hat{x}_{j},G(\xi_{j},\hat{x}_{j})),(\hat{x}_{j^{\prime}},G(\xi_{j^{\prime}},\hat{x}_{j^{\prime}}))) (7)

The next theorem establishes the error analysis of the estimator J2^(P,Q)\hat{J^{2}}(P,Q).

Theorem 7.

J2^(P,Q)\hat{J^{2}}(P,Q) is an unbiased estimator of JMMD2(P,Q)C1JMMD^{2}(P,Q)-C_{1} where C1C_{1} is a constant independent of QQ given by C1=𝔼[k3((X1,Y1),(X2,Y2))]C_{1}=\mathbb{E}[k_{3}((X_{1},Y_{1}),(X_{2},Y_{2}))]. Moreover, the variance of J2^(P,Q)\hat{J^{2}}(P,Q) is O(1min{m,r})O(\frac{1}{\min\{m,r\}}). (The explicit formula of the variance is given in Section B.)

C1C_{1} in Theorem 7 is free of the conditional generative model. Theorem 7 shows that the variance of J2^(P,Q)\hat{J^{2}}(P,Q) is decreasing at the order of 1min{m,r}\frac{1}{\min\{m,r\}}. Therefore, given a fixed computational budget BB, we should set m=Θ(r)=Θ(B)m=\Theta(r)=\Theta(\sqrt{B}) to achieve the minimum variance of the estimator J2^(P,Q)\hat{J^{2}}(P,Q).

Algorithm 1 Algorithm Framework of A-CGM

Input: Training Dataset 𝒟={(xi,yi):i}\mathcal{D}=\{(x_{i},y_{i}):i\in\mathcal{I}\}.
      Output: Finalized parameters θ\theta in the generative model Gθ(ξ,x)G_{\theta}(\xi,x).

1:Randomly divide the training dataset 𝒟\mathcal{D} into mini batches.
2:for t=0,,T1t=0,\ldots,T-1 do
3:     Set G=\mathcal{B}^{G}=\emptyset
4:     for each mini batch \mathcal{B} in 𝒟\mathcal{D} do
5:         for each xx\in\mathcal{B} do
6:              Draw multiple i.i.d. copies ξ1,,ξm\xi_{1},\ldots,\xi_{m} from PξP_{\xi}
7:              Generate conditional samples by forward-propagating through Gθ(ξj,x)G_{\theta}(\xi_{j},x)
8:              Add (x,Gθ(ξ1,x)),,(x,Gθ(ξm,x))(x,G_{\theta}(\xi_{1},x)),\ldots,(x,G_{\theta}(\xi_{m},x)) into G\mathcal{B}^{G}
9:         end for
10:         Optimize θ\theta by A2^(P,Qθ)\hat{A^{2}}(P,Q_{\theta}) in (6) based on \mathcal{B} and G\mathcal{B}^{G}
11:     end for
12:end for

4.4 Connections and Comparisons among Metrics

We establish the theoretical connections among CMMD, JMMD, and AMMD as follows.

Theorem 8.

Suppose that AMMDAMMD and JMMDJMMD are well-defined. Then we have that

JMMD2𝔼xPX[k1(x,x)]AMMD2.JMMD^{2}\leq\mathbb{E}_{x\sim P_{X}}[k_{1}(x,x)]AMMD^{2}.
Theorem 9.

Suppose that AMMDAMMD is well-defined. Moreover, suppose that for all gYg\in\mathcal{F}_{Y}, 𝔼PY|X[g(Y)|X]X\mathbb{E}_{P_{Y|X}}[g(Y)|X]\in\mathcal{F}_{X} and 𝔼QY|X[g(Y)|X]X\mathbb{E}_{Q_{Y|X}}[g(Y)|X]\in\mathcal{F}_{X} so that the conditional mean embeddings CPY|XC_{P_{Y|X}}, CQY|XC_{Q_{Y|X}} are well-defined. Furthermore, we assume CPY|XXYC_{P_{Y|X}}\in\mathcal{F}_{X}\otimes\mathcal{F}_{Y}, CQY|XXYC_{Q_{Y|X}}\in\mathcal{F}_{X}\otimes\mathcal{F}_{Y} so that CMMD (2) is well-defined. Then we have

AMMD2𝔼xPX[k1(x,x)]CMMD2.AMMD^{2}\leq\mathbb{E}_{x\sim P_{X}}[k_{1}(x,x)]CMMD^{2}.

We further highlight the strengths of AMMD and JMMD on conditional generative model evaluation which is a challenging task due to delicacies of evaluation at the conditional distribution level. First, both metrics are "distribution-free", i.e., the data or conditional generative model are not restricted to a specific type of distributions. In contrast, FID for instance requires the Gaussian assumption. Second, they are "model-free", i.e., their evaluation does not involve additional estimated models beyond the conditional generative model itself, such as kernel density estimators in Indirect Sampling Likelihood [3, 13] or estimated discriminators [42, 1].

4.5 Conditional Generative Model Construction

With the evaluation metrics, we present two corresponding deep-learning-based methods to construct conditional generative models. Our approaches are named as J-CGM and A-CGM, with special targets on the JMMD/AMMD for different tasks. For performance measurement, the values of JMMD and AMMD are estimated by drawing samples from the generative model optimized in J-CGM/A-CGM. Denote Gθ(ξ,X)G_{\theta}(\xi,X) as the generative model optimized in J-CGM/A-CGM with parameters θ\theta. Note that Gθ(ξ,X)G_{\theta}(\xi,X) takes both the given conditional variables XX and the extra random vector ξ\xi as inputs. Let QθQ_{\theta} be the joint distribution of (X,Gθ(ξ,X))(X,G_{\theta}(\xi,X)). A detailed step-by-step pseudo-code of A-CGM is listed in Algorithm 1. A similar procedure for J-CGM is presented in Algorithm 2 in Section C.

5 Experiments

Experimental Setup. We empirically verify the effectiveness of our proposed approaches on both regression and image generation tasks. In both tasks, we compare the performance of our approaches with the state-of-the-art MMD-based conditional generative model, CGMMN [47]. For regression, our experiments are conducted on the following widely used real-world benchmark datasets: Boston, Concrete, Energy, Wine, Yacht, Kin8nm, Protein, and CCPP [18, 10, 31, 44]. Besides the JMMD and AMMD values, we also report the scores of FID [19, 36], which is a standard metric for assessing the quality of generative models. In our experiments, FID is computed based on the joint distribution of (X,Y)(X,Y), since it is originally defined in the unconditional sense. In the label-based image generation task, we adopt the benchmark dataset MNIST [32] to evaluate the correctness of our framework. In this task, XX is the label of the image and the generative model Gθ(ξ,X)G_{\theta}(\xi,X) is expected to output random image samples with the attribute of class XX. We provide visuals to directly show the generation performance of different approaches. All experiments are conducted on a GeForce RTX 2080 Ti GPU. More experimental results are presented in Section C.

5.1 Aleatoric Uncertainty in Regression

CGMMN J-CGM A-CGM
Dataset JMMD AMMD FID JMMD AMMD FID JMMD AMMD FID
Boston 1.921021.92\cdot 10^{-2} 8.411058.41\cdot 10^{-5} 1.351021.35\cdot 10^{-2} 2.921042.92\cdot 10^{-4} 8.491048.49\cdot 10^{-4} 7.841037.84\cdot 10^{-3} 3.171043.17\cdot 10^{-4} 8.741058.74\cdot 10^{-5} 5.201035.20\cdot 10^{-3}
Concrete 9.571039.57\cdot 10^{-3} 1.671041.67\cdot 10^{-4} 8.641038.64\cdot 10^{-3} 1.531041.53\cdot 10^{-4} 1.781041.78\cdot 10^{-4} 9.861039.86\cdot 10^{-3} 2.471042.47\cdot 10^{-4} 1.991051.99\cdot 10^{-5} 9.741039.74\cdot 10^{-3}
Energy 1.871021.87\cdot 10^{-2} 1.401041.40\cdot 10^{-4} 7.961037.96\cdot 10^{-3} 3.161043.16\cdot 10^{-4} 9.451049.45\cdot 10^{-4} 9.161039.16\cdot 10^{-3} 3.091043.09\cdot 10^{-4} 1.231041.23\cdot 10^{-4} 9.381039.38\cdot 10^{-3}
Wine 1.091021.09\cdot 10^{-2} 3.021043.02\cdot 10^{-4} 1.011021.01\cdot 10^{-2} 1.141041.14\cdot 10^{-4} 3.611043.61\cdot 10^{-4} 9.941039.94\cdot 10^{-3} 1.161041.16\cdot 10^{-4} 2.721042.72\cdot 10^{-4} 9.861039.86\cdot 10^{-3}
Yacht 1.281021.28\cdot 10^{-2} 5.761055.76\cdot 10^{-5} 1.111021.11\cdot 10^{-2} 6.601046.60\cdot 10^{-4} 4.751044.75\cdot 10^{-4} 1.131021.13\cdot 10^{-2} 1.671041.67\cdot 10^{-4} 4.631054.63\cdot 10^{-5} 7.681037.68\cdot 10^{-3}
Kin8nm 1.001021.00\cdot 10^{-2} 1.461031.46\cdot 10^{-3} 1.411021.41\cdot 10^{-2} 1.101041.10\cdot 10^{-4} 1.391031.39\cdot 10^{-3} 9.691039.69\cdot 10^{-3} 1.011041.01\cdot 10^{-4} 1.381031.38\cdot 10^{-3} 9.761039.76\cdot 10^{-3}
Protein 9.201039.20\cdot 10^{-3} 7.871037.87\cdot 10^{-3} 9.831039.83\cdot 10^{-3} 7.081057.08\cdot 10^{-5} 8.081038.08\cdot 10^{-3} 9.811039.81\cdot 10^{-3} 8.601058.60\cdot 10^{-5} 2.181032.18\cdot 10^{-3} 1.001021.00\cdot 10^{-2}
CCPP 1.641021.64\cdot 10^{-2} 2.421032.42\cdot 10^{-3} 9.991039.99\cdot 10^{-3} 2.911042.91\cdot 10^{-4} 2.241032.24\cdot 10^{-3} 2.571042.57\cdot 10^{-4} 1.591031.59\cdot 10^{-3} 1.021021.02\cdot 10^{-2}
Table 1: Conditional generative models in regression. Best results are in bold.

Kernel Selections. As the regression data are low-dimensional, we choose k1k_{1} and k2k_{2} to be the standard Gaussian kernels k1(x1,x2):=exp(12x1x222)k_{1}(x_{1},x_{2}):=\exp\left(-\frac{1}{2}\|x_{1}-x_{2}\|_{2}^{2}\right), k2(y1,y2):=exp(12y1y222)k_{2}(y_{1},y_{2}):=\exp\left(-\frac{1}{2}\|y_{1}-y_{2}\|_{2}^{2}\right). Note that they readily satisfy both Assumptions 1 and 2.

Implementation Details. For regression tasks, we apply a simple network architecture with 2 hidden layers to avoid overfitting. We use the ReLU function as the activation function and the number of neurons in each hidden layer is 32. The input of the generative model is concatenated by two vectors, the covariate vector XX and the extra random vector ξ\xi following a 1010-dimensional uniform distribution Uniform([1,1]10)\text{Uniform}([-1,1]^{10}). Our network is optimized by the Adam optimizer [25] with learning rate 0.0005. For the preprocessing step, we follow the same experimental procedure in [44] on data normalization and dataset splitting. The AMMD evaluation omits the free-of-model constant C0C_{0} as justified in Theorem 5.

We evaluate the performance of our proposed J-CGM and A-CGM with the state-of-the-art baseline CGMMN [47] on multiple real-world benchmark regression datasets. Table 5.1 reports the evaluation metrics from different models on the testing data. As shown, J-CGM and A-CGM achieve competitive performance under the AMMD and JMMD evaluation criteria, while A-CGM tends to produce better results on AMMD. In contrast, although CGMMN can produce satisfactory results on AMMD, it underperforms on JMMD in general. Under the FID criterion, J-CGM and A-CGM achieve slightly better results than CGMMN. However, note that FID is a heuristic criterion without the statistical properties we develop for AMMD and JMMD and is thus less reliable.

5.2 Aleatoric Uncertainty in Image Generation

Refer to caption
Figure 1: Random conditional samples generated by different approaches.

Kernel Selections. As pointed out by previous studies on deep kernels [34, 33, 53, 35, 11], for complicated and high-dimensional real-world data, a kernel test using a simple kernel such as Gaussian kernel should be conducted on the code/feature space instead of the original data space to provide stronger signals for discrepancy measurement of high-dimensional distributions. Following this guidance, we apply an auto-encoder [48] to learn the representative features of the input images in the preprocessing step. Precisely, suppose the pre-trained auto-encoder network is given by BωAωB_{\omega^{\prime}}\circ A_{\omega} where Aω:𝒴𝒴^A_{\omega}:\mathcal{Y}\to\hat{\mathcal{Y}} is the encoder network with 𝒴^\hat{\mathcal{Y}} being the lower-dimensional code space; Bω:𝒴^𝒴B_{\omega^{\prime}}:\hat{\mathcal{Y}}\to\mathcal{Y} is the decoder network. We use the following feature-aware deep kernel for the image space 𝒴\mathcal{Y}: k2(y1,y2)=((1ϵ0)κ1(Aω(y1),Aω(y2))+ϵ0)κ2(y1,y2),k_{2}(y_{1},y_{2})=\Big{(}(1-\epsilon_{0})\kappa_{1}(A_{\omega}(y_{1}),A_{\omega}(y_{2}))+\epsilon_{0}\Big{)}\kappa_{2}(y_{1},y_{2}), where κ1\kappa_{1} is a Gaussian kernel defined on the code space 𝒴^\hat{\mathcal{Y}}; κ2\kappa_{2} is a Gaussian kernel defined on the original image space 𝒴\mathcal{Y}; ϵ0(0,1)\epsilon_{0}\in(0,1) is introduced to ensure that k2(y1,y2)k_{2}(y_{1},y_{2}) is a characteristic kernel [35, 11]. We set k1k_{1} to be the standard Gaussian kernels since 𝒳\mathcal{X} is low-dimensional.

Corresponding to our kernel, we now assume that all conditional generative models output samples in the code space for the convenience of MMD tests: Gθ(ξ,X):m×𝒳𝒴^G_{\theta}(\xi,X):\mathbb{R}^{m}\times\mathcal{X}\to\hat{\mathcal{Y}}. The generative image is then given by BωGθ(ξ,X)B_{\omega^{\prime}}\circ G_{\theta}(\xi,X).

Implementation Details. In the auto-encoder network, the encoder/decoder networks AωA_{\omega} and BωB_{\omega^{\prime}} have a single hidden layer with 1024 neurons. The dimension of the code space 𝒴^\hat{\mathcal{Y}} is 32. The generative network is formed by 3 hidden layers with ReLU function as the activation function. The number of neurons in each hidden layer is 64, 256, and 256. The networks also take two vectors as input, the one-hot encoding vector of label XX and the extra random vector ξ\xi following a 1010-dimensional uniform distribution Uniform([1,1]10)\text{Uniform}([-1,1]^{10}). The generative network is optimized by the Adam optimizer [25] with learning rate 0.001.

In Figure 1, we show a few random conditional samples of the reconstructed images from A-CGM, J-CGM, and CGMMN. Overall, all models can generate clear and recognizable samples of handwritten digits. In particular, the reconstructed images from J-CGM are more diverse with multiple writing types, while those from A-CGM are more clearly distinct. These results evidently demonstrate the effectiveness of our approaches on multiple real-world applications.

6 Conclusions

In this paper, we study the feasibility of leveraging conditional generative model on aleatoric uncertainty estimation. With theoretical justification, we propose two metrics for discrepancy measurement between two conditional distributions and demonstrate that both metrics can be easily and unbiasedly computed via Monte Carlo simulation. Experimental evaluations on multiple tasks corroborate our theory and further demonstrate the effectiveness of our approaches on real-world applications. Our study explores a new direction on aleatoric uncertainty estimation, which overcomes a few limitations in the previous research. In the future, we will extend our approaches for aleatoric uncertainty estimation on more real-world applications such as super-resolution image generation.

References

  • [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223. PMLR, 2017.
  • [2] C. M. Bishop. Mixture density networks. 1994.
  • [3] O. Breuleux, Y. Bengio, and P. Vincent. Quickly generating representative samples from an rbm-derived process. Neural Computation, 23(8):2058–2073, 2011.
  • [4] G. C. Cawley, N. L. Talbot, and O. Chapelle. Estimating predictive variances with kernel ridge regression. In Machine Learning Challenges Workshop, pages 56–77. Springer, 2005.
  • [5] P. Cui, W. Hu, and J. Zhu. Calibrated reliable regression using maximum mean discrepancy. Advances in Neural Information Processing Systems, 33:17164–17175, 2020.
  • [6] N. Dalmasso, T. Pospisil, A. B. Lee, R. Izbicki, P. E. Freeman, and A. I. Malz. Conditional density estimation tools in python and r with applications to photometric redshifts and likelihood-free cosmological inference. Astronomy and Computing, 30:100362, 2020.
  • [7] V. Dutordoir, H. Salimbeni, J. Hensman, and M. Deisenroth. Gaussian process conditional density estimation. Advances in Neural Information Processing Systems, 31, 2018.
  • [8] M. Fasiolo, S. N. Wood, M. Zaffran, R. Nedellec, and Y. Goude. Fast calibrated additive quantile regression. Journal of the American Statistical Association, 116(535):1402–1412, 2021.
  • [9] K. Fukumizu, L. Song, and A. Gretton. Kernel bayes’ rule: Bayesian inference with positive definite kernels. The Journal of Machine Learning Research, 14(1):3753–3783, 2013.
  • [10] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pages 1050–1059, 2016.
  • [11] R. Gao, F. Liu, J. Zhang, B. Han, T. Liu, G. Niu, and M. Sugiyama. Maximum mean discrepancy test is aware of adversarial attacks. In International Conference on Machine Learning, pages 3564–3575. PMLR, 2021.
  • [12] T. Gneiting, F. Balabdaoui, and A. E. Raftery. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2):243–268, 2007.
  • [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014.
  • [14] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
  • [15] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. Advances in Neural Information Processing Systems, 30, 2017.
  • [16] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pages 1321–1330. PMLR, 2017.
  • [17] D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
  • [18] J. M. Hernández-Lobato and R. Adams. Probabilistic backpropagation for scalable learning of bayesian neural networks. In International Conference on Machine Learning, pages 1861–1869, 2015.
  • [19] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30, 2017.
  • [20] M. P. Holmes, A. G. Gray, and C. L. Isbell. Fast nonparametric conditional density estimation. arXiv preprint arXiv:1206.5278, 2012.
  • [21] E. Hüllermeier and W. Waegeman. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning, 110(3):457–506, 2021.
  • [22] R. Izbicki, A. B. Lee, and P. E. Freeman. Photo-zz estimation: An example of nonparametric conditional density estimation under selection bias. The Annals of Applied Statistics, 11(2):698–724, 2017.
  • [23] O. Kallenberg and O. Kallenberg. Foundations of modern probability, volume 2. Springer, 1997.
  • [24] A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in Neural Information Processing Systems, 30, 2017.
  • [25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [26] V. Kuleshov, N. Fenner, and S. Ermon. Accurate uncertainties for deep learning using calibrated regression. In International conference on machine learning, pages 2796–2804. PMLR, 2018.
  • [27] V. Kuleshov and P. S. Liang. Calibrated structured prediction. Advances in Neural Information Processing Systems, 28, 2015.
  • [28] M. Kull and P. Flach. Novel decompositions of proper scoring rules for classification: Score adjustment as precursor to calibration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 68–85. Springer, 2015.
  • [29] M. Kull, M. Perello Nieto, M. Kängsepp, T. Silva Filho, H. Song, and P. Flach. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. Advances in Neural Information Processing Systems, 32, 2019.
  • [30] A. Kumar, P. S. Liang, and T. Ma. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32, 2019.
  • [31] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems, 30, 2017.
  • [32] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [33] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. Póczos. Mmd gan: Towards deeper understanding of moment matching network. Advances in Neural Information Processing Systems, 30, 2017.
  • [34] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In International Conference on Machine Learning, pages 1718–1727. PMLR, 2015.
  • [35] F. Liu, W. Xu, J. Lu, G. Zhang, A. Gretton, and D. J. Sutherland. Learning deep kernels for non-parametric two-sample tests. In International Conference on Machine Learning, pages 6316–6326. PMLR, 2020.
  • [36] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. Are gans created equal? a large-scale study. Advances in neural information processing systems, 31, 2018.
  • [37] M. Minderer, J. Djolonga, R. Romijnders, F. Hubis, X. Zhai, N. Houlsby, D. Tran, and M. Lucic. Revisiting the calibration of modern neural networks. Advances in Neural Information Processing Systems, 34, 2021.
  • [38] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [39] K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning, 10(1-2):1–141, 2017.
  • [40] A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, pages 625–632, 2005.
  • [41] D. A. Nix and A. S. Weigend. Estimating the mean and variance of the target probability distribution. In Proceedings of 1994 ieee international conference on neural networks (ICNN’94), volume 1, pages 55–60. IEEE, 1994.
  • [42] S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. Advances in Neural Information Processing Systems, 29, 2016.
  • [43] J. Park and K. Muandet. A measure-theoretic approach to kernel conditional mean embeddings. Advances in Neural Information Processing Systems, 33:21247–21259, 2020.
  • [44] T. Pearce, M. Zaki, A. Brintrup, and A. Neely. High-quality prediction intervals for deep learning: A distribution-free, ensembled approach. arXiv preprint arXiv:1802.07167, 2018.
  • [45] T. Pospisil and A. B. Lee. Rfcde: Random forests for conditional density estimation. arXiv preprint arXiv:1804.05753, 2018.
  • [46] T. Pospisil and A. B. Lee. (f) rfcde: Random forests for conditional density estimation and functional data. arXiv preprint arXiv:1906.07177, 2019.
  • [47] Y. Ren, J. Zhu, J. Li, and Y. Luo. Conditional generative moment-matching networks. Advances in Neural Information Processing Systems, 29, 2016.
  • [48] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
  • [49] H. Song, T. Diethe, M. Kull, and P. Flach. Distribution calibration for regression. In International Conference on Machine Learning, pages 5897–5906. PMLR, 2019.
  • [50] L. Song, K. Fukumizu, and A. Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 30(4):98–111, 2013.
  • [51] L. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 961–968, 2009.
  • [52] M. Sugiyama, I. Takeuchi, T. Suzuki, T. Kanamori, H. Hachiya, and D. Okanohara. Conditional density estimation via least-squares density ratio estimation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 781–788. JMLR Workshop and Conference Proceedings, 2010.
  • [53] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370–378. PMLR, 2016.