Evaluating Aleatoric Uncertainty
via Conditional Generative Models

Ziyi Huang, Henry Lam, Haofeng Zhang
Columbia University
New York, NY 10027
zh2354, khl2114, hz2553@columbia.edu

Abstract

Aleatoric uncertainty quantification seeks for distributional knowledge of random responses, which is important for reliability analysis and robustness improvement in machine learning applications. Previous research on aleatoric uncertainty estimation mainly targets closed-formed conditional densities or variances, which requires strong restrictions on the data distribution or dimensionality. To overcome these restrictions, we study conditional generative models for aleatoric uncertainty estimation. We introduce two metrics to measure the discrepancy between two conditional distributions that suit these models. Both metrics can be easily and unbiasedly computed via Monte Carlo simulation of the conditional generative models, thus facilitating their evaluation and training. We demonstrate numerically how our metrics provide correct measurements of conditional distributional discrepancies and can be used to train conditional models competitive against existing benchmarks.

1 Introduction

Uncertainty quantification plays a pivotal role in machine learning systems, especially for downstream decision-making tasks involving reliability analysis and optimization. There are two major types of uncertainty, aleatoric uncertainty and epistemic uncertainty. Aleatoric uncertainty refers to the intrinsic stochasticity of the output given a specific input [21], while epistemic uncertainty accounts for the model uncertainty caused by data and modeling limitations [24]. Most classical machine learning algorithms that focus on mean response prediction primarily address epistemic uncertainty, but aleatoric uncertainty, which describes the distribution of responses beyond summary statistics like the mean, has been gaining importance because of risk and safety-critical considerations.

Existing approaches for aleatoric uncertainty estimation can be largely divided into the following directions: negative log-likelihood (NLL) loss-based estimation, forecaster calibration, and conditional density estimation (CDE). While powerful, these approaches are limited by several drawbacks arising from real-world applications:

1.

Negative Log-Likelihood Loss: In regression tasks, aleatoric uncertainty can be estimated through the conditional mean and variance from models (heteroscedastic neural networks) optimized by the NLL loss [41, 2, 4, 31, 24, 5]. However, this approach requires scalar-type output, which cannot be easily extended to broader computer vision applications, such as image generation. In addition, the computation of NLL loss relies on assumptions of conditional Gaussian or Gaussian-like distribution, which may not be followed by real-world datasets.
2.

Forecaster Calibration: In the calibration literature, aleatoric uncertainty estimators are also known as forecasters [12, 26, 49] with multiple definitions of calibration modes [12, 49, 8, 26, 5]. Under these definitions, the ground-truth conditional distribution function is well calibrated, but not vice versa. Thus, some intuitive sharpness criteria are typically applied to avoid trivial forecasters such as the unconditional distribution. However, little is known about how to recover the ground-truth conditional distribution function via calibration, even asymptotically.
3.

Conditional Density Estimation: In CDE-based approaches [20, 22, 52, 45, 46, 6, 7], aleatoric uncertainty is directly calculated by estimating conditional densities in a certain form (such as kernel density). Most of CDE methods can only apply on low-dimensional responses following absolutely continuous conditional distributions. Moreover, the output of CDE methods is an explicit formula of the conditional density function. Thus, numerical characteristics such as conditional quantiles may be hard to obtain, as it involves numerical integration that is generally difficult to implement, especially in higher-dimensional settings.

To address the above challenges, we study a framework using conditional generative models to estimate aleatoric uncertainty. Compared to previous approaches, conditional generative models [38, 47] are more scalable and flexibly applicable regardless of the dimension and distribution of the input/output vector. Moreover, they can easily generate numerical characteristics of the underlying distributions or other performance estimations through Monte Carlo methods.

At the core of our framework is the construction of distance metrics between the generative model and the ground-truth distribution, which is required for both model evaluation and training [13, 42, 1]. In particular, we generalize the maximum mean discrepancy (MMD) [14, 34] to the setting of conditional distributions, by constructing two new metrics that we call joint maximum mean discrepancy (JMMD) and average maximum mean discrepancy (AMMD). We derive statistical properties in estimating these metrics and illustrate that both metrics admit easy-to-implement and computationally scalable unbiased estimators. Based on these, we further develop two approaches to optimize conditional generative models suited for different tasks and conduct comprehensive experiments to show the correctness and effectiveness of our framework.

Our approach has the following strengths relative to previous methods: 1) A similar study with conditional MMD can be found in [47] which, as far as we know, is the most relevant work on MMD-based conditional generative models. However, their framework involves unrealistic technical assumptions that may not hold in practice, as well as matrix inversion operations that suffer from instability and scalability issues (see Section 4.1). 2) Both JMMD and AMMD are evaluation metrics that are desirably "distribution-free" (i.e., the data are not assumed any particular type of distributions) and "model-free" (i.e., the evaluation does not involve additional estimated models such as the discriminator). In previous research, Fréchet Inception Distance (FID) [19, 36] is a standard metric to assess the quality of unconditional generative models. However, the closed-form computation of FID assumes that both generative models and data follow multivariate Gaussian distributions. Another commonly used evaluation approach is Indirect Sampling Likelihood (ISL) [3, 13], which computes the NLL under a fitted kernel density based on generative models. However, kernel density estimation deteriorates in quality when dimensionality increases and could fit poorly into the generative models. Finally, the value of loss on testing data is an alternative for performance examination. However, typical losses such as using $f$ -divergence or Wasserstein distance cannot indicate the performance of the generator alone (see Section 2).

2 Related Work

Learning and Evaluation Criterion. Evaluation criteria on generative models against data are typically borrowed from discrepancy measures between two probability distributions in the statistics literature. The latter includes two major types: $f$ -divergence and integral probability metrics. The seminal paper [13] used Jensen-Shannon divergence in its original form and then [42] extended it to general $f$ -divergence motivated from the benefits of other divergence function choices. The computation of integral probability metrics have two important sub-directions, MMD [34, 33] and Wasserstein distance [1, 15]. Among these criteria, a discriminator is typically needed for approaches with $f$ -divergence (variational representation) and Wasserstein distance (dual representation), while it is not required for MMD methods. The loss function from $f$ -divergence and Wasserstein distance cannot be directly use to evaluate generative models alone due to their dependency on the quality of discriminators. In addition, other conditional distance measures may encounter challenges when using generative models. For instance, NLL value [31] and CDE value [6] require an explicit form of the model’s density function.

Aleatoric Uncertainty in Deep Learning. Besides the directions discussed in Section 1, aleatoric uncertainty on classification tasks can be estimated from the output of softmax layers in deep neural networks [40, 31, 17]. Previous research [16] pointed out that directly using softmax outputs for estimation could be inaccurate, as the softmax probability on predicted class did not reflect the ground-truth correctness likelihood. The ground-truth conditional mass function has zero calibration error but not vice versa. Hence forecasters with zero calibration error, which have been studied extensively [30, 37, 29], are not sufficient to recover the ground-truth conditional mass function. The forecasters could be heuristically improved by a second-level metric named sharpness (or refinement error) [30, 27, 28]. Since aleatoric uncertainty in classification can be captured by vector-valued maps such as softmax responses, it is not necessary to use a conditional generative model for this task, and thus we do not focus on classification in this paper.

3 Conditional Generative Models and Maximum Mean Discrepancy

3.1 Conditional Generative Models

In this section, we provide rigorous definitions on conditional generative models. Consider a standard statistical framework where a pair of random vectors $(X,Y)\in\mathcal{X}\times\mathcal{Y}$ follows a joint distribution $P_{X,Y}$ with marginal distributions $X\sim P_{X}$ and $Y\sim P_{Y}$ . We assume the space $\mathcal{X}\subset\mathbb{R}^{d}$ with $d\geq 1$ which is allowed to contain either continuous or discrete components. Denote the conditional distribution of $Y$ given $X$ by $P_{Y|X}$ . For a given value $x$ of $X$ , denote the conditional distribution as $P_{Y|X=x}$ . Typically, we regard $X$ as a vector of input (example) and $Y$ as a vector of output (label). For instance, $Y\subset\mathbb{R}^{q}$ with $q\geq 1$ in regression and $Y\subset[K]:=\{1,\ldots,K\}$ in classification. Alternatively, in image generation tasks, $X$ refers to auxiliary information (such as the image attributes or labels) and $Y$ refers to the image in order to keep the notation consistent.

Our goal is to quantify the conditional distribution $P_{Y|X}$ via conditional generative models. More precisely, let $\xi\in\mathbb{R}^{m}$ be a random vector independent of $X$ with a known distribution $P_{\xi}$ (specified by the learner) and the goal is to construct a function $G:\mathbb{R}^{m}\times\mathcal{X}\to\mathcal{Y}$ such that the conditional distribution of $G(\xi,X)|X=x$ is the same as $P_{Y|X=x}$ . The following lemma demonstrates the existence of such function $G$ , termed as the conditional generative model $G:\mathbb{R}^{m}\times\mathcal{X}\to\mathcal{Y}$ .

Lemma 1 (Adapted from Theorem 5.10 in [23]).

Let $(X,Y)$ be a random pair taking values in $\mathcal{X}\times\mathcal{Y}$ with joint distribution $P_{X,Y}$ . Suppose $Y$ is a standard Borel space. Then there exist a random vector $\xi\sim P_{\xi}=\text{Uniform}([0,1]^{m})$ and a Borel-measurable function $G:\mathbb{R}^{m}\times\mathcal{X}\to\mathcal{Y}$ such that $\xi$ is independent of $X$ and $(X,Y)=(X,G(\xi,X))$ almost everywhere. In particular, such $G$ satisfies that $Y|X=x\sim G(\xi,X)|X=x$ for a.e. $x$ with respect to $P_{X}$ .

The conditional generative model can provide more information than standard regression models with single-point prediction. In regression problems, the conditional mean can be estimated by taking the sample mean of multiple draws from $G(\xi_{i},X)|X=x$ . Meanwhile, other numerical characteristics of the underlying target distribution, such as conditional variance and conditional quantile can also be estimated by Monte Carlo sampling from the conditional generative model, beyond what single-point prediction could offer.

In the rest of this paper, we use $P_{Y|X}$ for the ground-truth conditional distribution and $Q_{Y|X}$ for the distribution of the conditional generative model $G(\xi,X)|X$ . We denote $Q_{X,G(\xi,X)}$ as the joint distribution of $(X,G(\xi,X))$ . For each given $x$ , the generative model is able to generate conditionally independent and identically distributed (i.i.d.) samples $G(\xi_{i},x)$ from the conditional distribution $Q_{Y|X=x}$ . We parametrize the conditional generative model in a hypothesis class $\{G_{\theta}(\xi,X):\theta\in\Theta\}$ with parameter $\theta$ . To learn $G(\xi,X)|X$ as an estimate of $P_{Y|X}$ , we need a metric to quantify the difference between $G(\xi,X)|X$ and $P_{Y|X}$ using finite training data, which relates to Two-Sample Test. To this end, we will use the (kernel) maximum mean discrepancy (MMD) [14], which is described in the next subsection.

3.2 Two-Sample Test via Maximum Mean Discrepancy

We review the standard MMD in the setting of unconditional distribution on $\mathcal{Y}$ . Section A provides preliminaries on the reproducing kernel Hilbert space (RKHS). Suppose that $\mathcal{F}_{X}$ ( $\mathcal{F}_{Y}$ ) is the RKHS defined on the space $\mathcal{X}$ ( $\mathcal{Y}$ ) with kernel $k_{1}$ ( $k_{2}$ ) and feature map $\phi_{1}$ ( $\phi_{2}$ ). We adopt the following two basic assumptions throughout this paper (i.e., all theorems make these assumptions without explicit mentioning). Detailed explanations on Assumptions 1 and 2 can be found in Section A. The Gaussian kernels for instance satisfy both assumptions.

Assumption 1.

We assume the following: 1) $k_{1}(\cdot,\cdot)$ is measurable and $\mathbb{E}_{x\sim P_{X}}[k_{1}(x,x)]<\infty$ . 2) $k_{2}(\cdot,\cdot)$ is measurable and $\mathbb{E}_{y\sim P_{Y}}[k_{2}(y,y)]<\infty$ . Moreover, $\mathbb{E}_{y\sim P_{Y|X=x}}[k_{2}(y,y)|X=x]<\infty$ for any $x\in\mathcal{X}$ . In addition, these assumptions also hold when replacing the data distribution $P$ by the generative distribution $Q$ .

Assumption 2.

We assume the following: 1) $k_{1}$ is characteristic. 2) $k_{2}$ is characteristic. 3) $k_{1}\otimes k_{2}$ is characteristic.

The integral probability metric aims to measure the discrepancy between two distributions. Let $\mathcal{G}$ denote a set of functions $\mathcal{Y}\to\mathbb{R}$ . Given two distributions $P_{Y}$ and $Q_{Y}$ on $\mathcal{Y}$ , the integral probability metric is defined as

IPM(P_{Y},Q_{Y})=\sup_{f\in\mathcal{G}}|\mathbb{E}[f(Y)]-\mathbb{E}[f(\hat{Y})]|

where $Y\sim P_{Y}$ and $\hat{Y}\sim Q_{Y}$ . MMD is a special case of integral probability metrics, as it chooses $\mathcal{G}$ to be the unit ball in the RKHS $\mathcal{F}_{Y}$ . Let $\mu_{P_{Y}}$ denote the kernel mean embedding of $P_{Y}$ in $\mathcal{F}_{Y}$ : $\mu_{P_{Y}}:=\mathbb{E}_{y\sim P_{Y}}[\phi_{2}(y)]$ . $\mu_{P_{Y}}$ is guaranteed to be an element in the RKHS $\mathcal{F}_{Y}$ under Assumption 1 [14]. With these discussions, the square of MMD distance between $P_{Y}$ and $Q_{Y}$ is formally defined as

		$\displaystyle MMD^{2}(P_{Y},Q_{Y})=\sup_{f\in\mathcal{F}_{Y},\\|f\\|_{\mathcal{F}_{Y}}\leq 1}\|\mathbb{E}[f(Y_{1})]-\mathbb{E}[f(\hat{Y}_{1})]\|^{2}$
	$\displaystyle=$	$\displaystyle\\|\mu_{P_{Y}}-\mu_{Q_{Y}}\\|^{2}_{\mathcal{F}_{Y}}=\mathbb{E}[k_{2}(Y_{1},Y_{2})]-2\mathbb{E}[k_{2}(Y_{1},\hat{Y}_{1})]+\mathbb{E}[k_{2}(\hat{Y}_{1},\hat{Y}_{2})]$		(1)

where $Y_{1},Y_{2}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P_{Y}$ and $\hat{Y}_{1},\hat{Y}_{2}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}Q_{Y}$ .

Theorem 2 ([14]).

$MMD^{2}(P_{Y},Q_{Y})\geq 0$ and $MMD^{2}(P_{Y},Q_{Y})=0$ if and only if $P_{Y}=Q_{Y}$ .

Suppose we have data $y_{i}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P_{Y}$ ( $i\in[n]$ ) and $\hat{y}_{j}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}Q_{Y}$ ( $j\in[m]$ ). Then a standard unbiased estimator of $MMD^{2}(P_{Y},Q_{Y})$ [14] is

\mathcal{L}^{u}_{MMD^{2}}=\frac{1}{n(n-1)}\sum_{i=1}^{n}\sum_{i^{\prime}\neq i,i^{\prime}=1}^{n}k_{2}(y_{i},y_{i^{\prime}})-\frac{2}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}k_{2}(y_{i},\hat{y}_{j})+\frac{1}{m(m-1)}\sum_{j=1}^{m}\sum_{j^{\prime}\neq j,j^{\prime}=1}^{m}k_{2}(\hat{y}_{j},\hat{y}_{j^{\prime}})

4 Generalization to Conditional Two-Sample Test

In this section, we generalize MMD to conditional Two-Sample Test. We first explain the limitation of conditional maximum mean discrepancy (CMMD), the state-of-the-art approach to use MMD for conditional models [47]. Then, we introduce two metrics, JMMD and AMMD, which bypass the limitations of CMMD on strong restrictions and biased estimation. We also present the connections and comparisons among these metrics, and describe how to use them to construct conditional generative models.

4.1 Previous Work on Conditional Maximum Mean Discrepancy

In [47], conditional generative moment-matching networks (CGMMNs) were developed for conditional distribution generation. In particular, they leveraged previous work on conditional mean embeddings of the conditional distribution $C_{P_{Y|X}}$ [51, 9, 50, 39] (Section A provides a review on this topic.) They used the discrepancy between $C_{P_{Y|X=x}}$ and $C_{Q_{Y|X=x}}$ to measure the difference of two conditional distributions, termed as CMMD, which is defined formally as:

CMMD^{2}=\|C_{P_{Y|X}}-C_{Q_{Y|X}}\|^{2}_{\mathcal{F}_{X}\otimes\mathcal{F}_{Y}}

(2)

where $P$ represents the ground-truth data distribution and $Q$ represents the generative distribution. The estimator of $CMMD^{2}$ developed in [47] is as follows:

\mathcal{L}_{C^{2}}(P,Q)=\|\tilde{C}_{P_{Y|X}}-\tilde{C}_{Q_{Y|X}}\|^{2}_{\mathcal{F}_{X}\otimes\mathcal{F}_{Y}},

(3)

\tilde{C}_{P_{Y|X}}=\tilde{C}_{P_{YX}}(\tilde{C}_{P_{XX}}+\lambda I)^{-1}=\Phi_{2}(K_{X}+\lambda nI)^{-1}\Phi_{1}

(4)

where $\Phi_{2}=(\phi_{2}(y_{1}),...,\phi_{2}(y_{n}))$ , $\Phi_{1}=(\phi_{1}(x_{1}),...,\phi_{1}(x_{n}))$ , $K_{X}=\Phi_{1}^{\top}\Phi_{1}$ , and $I$ is the identity matrix.

While [47] is the most relevant study to our problem setting, directly applying CMMD on aleatoric uncertainty estimation has following limitations:

Computationally Expensive: The matrix inverse in the estimator is computationally expensive for practical implementation. The running time for a single inversion in one iteration is at the order of $O(B^{3})$ , where $B$ is the batch size. Meanwhile, the batch size should be sufficient large to achieve good performance for generative models [34].

Strong Technical Assumptions and Existence of Inversion: 1) The existence of the conditional mean embedding operator $C_{P_{Y|X}}$ typically requires strong assumptions: $\forall g\in\mathcal{F}_{Y}$ , $\mathbb{E}_{P_{Y|X}}[g(Y)|X]\in\mathcal{F}_{X}$ . This assumption is not necessarily true for continuous domains [51], and simple counterexamples using the Gaussian kernel can be found [9]. 2) In general, $\tilde{C}^{-1}_{XX}$ does not exist when $\mathcal{F}_{X}$ is infinite dimensional, since $\tilde{C}_{XX}$ is a compact operator and thus has an arbitrary small positive eigenvalue [39]. When the matrix is singular, the matrix inversion could be unstable and the performance of the estimator $\tilde{C}_{P_{Y|X}}$ in [47] might be degraded after adding $\lambda I$ to avoid the singularity. 3) Even though the first two points could be relieved in some sense [43], the CMMD metric (2) is well-defined only if $C_{P_{Y|X}},C_{Q_{Y|X}}\in\mathcal{F}_{X}\otimes\mathcal{F}_{Y}$ . However, this requires a much stronger assumption than the existence of $\tilde{C}^{-1}_{XX}$ (See Assumption 6 in Section A).

Bias: CMMD does not admit any obvious unbiased estimator. The estimator $\tilde{C}_{P_{Y|X}}$ in (4) is biased, even in the asymptotic sense if $\lambda$ is fixed [51].

To bypass the above limitations, we propose two alternative metrics which only require basic assumptions on the existence of the cross-covariance operator and the characteristic property of the kernels (Assumptions 1 and 2). In particular, we do not require the existence of the inversion of any operator or matrix, which makes our metrics easily implemented for real-world applications.

4.2 Average Maximum Mean Discrepancy (AMMD)

We first introduce a rather straightforward approach, which we term the AMMD metric. AMMD shows better potential for multi-output problems (such as image generation) where data consists of i.i.d. inputs $x_{i}$ with conditionally independent outputs $y_{i,j}$ at each $x_{i}$ ; see Section C for a more detailed discussion. In AMMD, at each $x$ , we use (1) to build unbiased estimators of the MMD on the conditional distribution of $Y|X=x$ . Then, these estimators are averaged with respect to the marginal $P_{X}$ . More specifically, we define

AMMD^{2}(P,Q)=\mathbb{E}_{x\sim P_{X}}[MMD_{X=x}^{2}(P_{Y|X=x},Q_{Y|X=x})]

where

			$\displaystyle MMD_{X=x}^{2}(P_{Y\|X=x},Q_{Y\|X=x}):=\\|\mu_{P_{Y\|X=x}}-\mu_{Q_{Y\|X=x}}\\|^{2}_{\mathcal{F}_{Y}}$		(5)
		$\displaystyle=$	$\displaystyle\mathbb{E}[k_{2}(Y^{x}_{1},Y^{x}_{2})\|X=x]-2\mathbb{E}[k_{2}(Y^{x}_{1},\hat{Y}^{x}_{1})\|X=x]+\mathbb{E}[k_{2}(\hat{Y}^{x}_{1},\hat{Y}^{x}_{2})\|X=x]$		(5)

is a function of $x$ and for fixed $x$ , $Y^{x}_{1},Y^{x}_{2}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P_{Y|X=x}$ , and $\hat{Y}^{x}_{1},\hat{Y}^{x}_{2}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}Q_{Y|X=x}$ . Note that $\mu_{P_{Y|X=x}}$ , $\mu_{Q_{Y|X=x}}\in\mathcal{F}_{Y}$ guaranteed by Assumption 1 so $MMD_{X=x}^{2}$ is well-defined. Hence $AMMD^{2}(P,Q)$ is also well-defined being the expectation with respect to non-negative measurable functions.

Remark 3.

In (5), $Y^{x}_{1},Y^{x}_{2}$ are drawn in a conditionally independent manner for each $x$ . This is not equivalent to globally draw two unconditionally independent samples $Y_{1},Y_{2}$ and consider $Y_{1}|X=x,Y_{2}|X=x$ for each $x$ because the latter is not conditionally independent in general. Therefore, we have that in general, $\mathbb{E}_{x\sim P_{X}}[\mathbb{E}[k_{2}(Y^{x}_{1},Y^{x}_{2})|X=x]]\neq\mathbb{E}[k_{2}(Y_{1},Y_{2})].$

Theorem 4.

$AMMD^{2}(P,Q)\geq 0$ and $AMMD^{2}(P,Q)=0$ if and only if for a.e. $x$ with respect to $P_{X}$ , $P_{Y|X=x}=Q_{Y|X=x}$ .

Theorem 4 shows that $AMMD^{2}(P,Q)$ offers a metric to measure $P_{Y|X}$ versus $Q_{Y|X}$ . Next, we propose a Monte Carlo estimator of $AMMD^{2}$ for conditional generative models:

1.

Take a batch $\{(x_{i},y_{i,l}):i\in[n],l\in[r]\}$ from $P$ of batch size $rn$ where $y_{i,l}$ ( $l\in[r]$ ) are the outputs at the same $x_{i}$ . Here, $r$ is restricted by the specification of the task: $r=1$ in single-output problems but can be greater than $1$ in multi-output problems such as image generation; $n\geq 1$ .
2.

Generate a batch $\{(x_{i},G(\xi_{i,j},x_{i})):i\in[n],j\in[m]\}$ from $Q$ of batch size $mn$ where $\xi_{1,1},\ldots,\xi_{n,m}$ are i.i.d. and independent of $x_{1},\ldots,x_{n}$ ; $m\geq 2$ .

Compute

	$\displaystyle\hat{A^{2}}(P,Q)=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\Big{(}-\frac{2}{mr}\sum_{j=1}^{m}\sum_{l=1}^{r}k_{2}(y_{i,l},G(\xi_{i,j},x_{i}))$
		$\displaystyle+\frac{1}{m(m-1)}\sum_{j=1}^{m}\sum_{j^{\prime}\neq j,j^{\prime}=1}^{m}k_{2}(G(\xi_{i,j},x_{i}),G(\xi_{i,j^{\prime}},x_{i}))\Big{)}$		(6)

The next theorem establishes the error analysis of the estimator $\hat{A^{2}}(P,Q)$ .

Theorem 5.

$\hat{A^{2}}(P,Q)$ is an unbiased estimator of $AMMD^{2}(P,Q)-C_{0}$ where $C_{0}$ is a constant independent of $Q$ given by $C_{0}=\mathbb{E}_{x\sim P_{X}}[\mathbb{E}[k_{2}(Y^{x}_{1},Y^{x}_{2})|X=x]]$ . Moreover, the variance of $\hat{A^{2}}(P,Q)$ is $O(\frac{1}{n\min\{m,r\}})+\frac{1}{n}K_{0}$ where $K_{0}=\text{Var}_{x\sim P_{X}}[-2\mathbb{E}[k_{2}(Y^{x}_{1},\hat{Y}^{x}_{1})|X=x]+\mathbb{E}[k_{2}(\hat{Y}^{x}_{1},\hat{Y}^{x}_{2})|X=x]]$ is independent of $n,m,r$ . (The explicit formula of the variance is given in Section B.)

$C_{0}$ in Theorem 5 is free of the conditional generative model and thus does not need to be embodied at the training/evaluation phase. This is in the same spirit of NLL which is formed by a free-of-model constant and the Kullback–Leibler divergence between the model and data. Theorem 5 shows that if $n$ is not allowed to be large (e.g., $n$ is bounded above by the number of class in the label-based image generation problems), the variance of the estimator $\hat{A^{2}}(P,Q)$ should be reduced by increasing $m$ and $r$ . On the other hand, if $n$ is allowed to be large (e.g., in regression problems with continuous features), then given a fixed computational budget, we should increase $n$ while maintaining the small possible values $m=2$ and $r=1$ to reduce the variance of $\hat{A^{2}}(P,Q)$ efficiently.

4.3 Joint Maximum Mean Discrepancy (JMMD)

We then introduce the JMMD metric, which is based on the joint distribution. Compared with AMMD, JMMD is more suitable for single-output tasks (such as regression) where data consists of joint i.i.d. samples $(x_{i},y_{i})$ ; see Section C for a more detailed discussion. According to the observation in Lemma 1, the matching of $Q_{Y|X=x}$ (the conditional distribution of $G(\xi,X)|X=x$ ) with $P_{Y|X=x}$ for a.e. $x$ can be transferred to the matching of $Q_{X,Y}$ (the joint distribution of $(X,G(\xi,X))$ ) with $P_{X,Y}$ . This motivates us to define the following metric which we term as JMMD:

			$\displaystyle JMMD^{2}(P,Q)=MMD^{2}(P_{X,Y},Q_{X,Y})$
		$\displaystyle=$	$\displaystyle\mathbb{E}[k_{3}((X_{1},Y_{1}),(X_{2},Y_{2}))]-2\mathbb{E}[k_{3}((X_{1},Y_{1}),(\hat{X}_{1},\hat{Y}_{1}))]+\mathbb{E}[k_{3}((\hat{X}_{1},\hat{Y}_{1}),(\hat{X}_{2},\hat{Y}_{2}))]$

where $k_{3}=k_{1}\otimes k_{2}$ is the kernel of the tensor product space $\mathcal{F}_{X}\otimes\mathcal{F}_{Y}$ and $(X_{1},Y_{1}),(X_{2},Y_{2})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}P_{X,Y}$ and $(\hat{X}_{1},\hat{Y}_{1}),(\hat{X}_{2},\hat{Y}_{2})\stackrel{{\scriptstyle i.i.d.}}{{\sim}}Q_{X,Y}$ . Note that $JMMD^{2}(P,Q)$ can be viewed alternatively as the discrepancy of the cross-covariance operators $C_{P_{YX}}$ , $C_{Q_{YX}}$ defined on the tensor product space $\mathcal{F}_{X}\otimes\mathcal{F}_{Y}$ . Since $C_{P_{YX}}$ , $C_{Q_{YX}}\in\mathcal{F}_{X}\otimes\mathcal{F}_{Y}$ guaranteed by Assumption 1, $JMMD^{2}(P,Q)$ is well-defined (see Section A for more details).

Theorem 6.

$JMMD^{2}(P,Q)\geq 0$ and $JMMD^{2}(P,Q)=0$ if and only if for a.e. $x$ with respect to $P_{X}$ , $P_{Y|X=x}=Q_{Y|X=x}$ .

Theorem 6 shows that $JMMD^{2}(P,Q)$ offers a metric to measure $P_{Y|X}$ versus $Q_{Y|X}$ . In parallel to AMMD, we propose a Monte Carlo estimator of $JMMD^{2}$ for conditional generative models: Take a batch of samples from $P$ of batch size $r$ $\{(x_{l},y_{l}):l\in[r]\}$ ; $r\geq 2$ . Generate a batch from $Q$ of batch size $m$ $\{(\hat{x}_{j},G(\xi_{j},\hat{x}_{j})):j\in[m]\}$ where $\xi_{1},...,\xi_{m}$ are i.i.d. and independent of $x_{1},\ldots,x_{r},\hat{x}_{1},\ldots,\hat{x}_{m}$ ; $m\geq 2$ . Compute

	$\displaystyle\hat{J^{2}}(P,Q)=-\frac{2}{mr}\sum_{j=1}^{m}\sum_{l=1}^{r}k_{3}((x_{l},y_{l}),(\hat{x}_{j},G(\xi_{j},\hat{x}_{j})))$
	$\displaystyle+\frac{1}{m(m-1)}\sum_{j=1}^{m}\sum_{j^{\prime}\neq j,j^{\prime}=1}^{m}k_{3}((\hat{x}_{j},G(\xi_{j},\hat{x}_{j})),(\hat{x}_{j^{\prime}},G(\xi_{j^{\prime}},\hat{x}_{j^{\prime}})))$		(7)

The next theorem establishes the error analysis of the estimator $\hat{J^{2}}(P,Q)$ .

Theorem 7.

$\hat{J^{2}}(P,Q)$ is an unbiased estimator of $JMMD^{2}(P,Q)-C_{1}$ where $C_{1}$ is a constant independent of $Q$ given by $C_{1}=\mathbb{E}[k_{3}((X_{1},Y_{1}),(X_{2},Y_{2}))]$ . Moreover, the variance of $\hat{J^{2}}(P,Q)$ is $O(\frac{1}{\min\{m,r\}})$ . (The explicit formula of the variance is given in Section B.)

$C_{1}$ in Theorem 7 is free of the conditional generative model. Theorem 7 shows that the variance of $\hat{J^{2}}(P,Q)$ is decreasing at the order of $\frac{1}{\min\{m,r\}}$ . Therefore, given a fixed computational budget $B$ , we should set $m=\Theta(r)=\Theta(\sqrt{B})$ to achieve the minimum variance of the estimator $\hat{J^{2}}(P,Q)$ .

Algorithm 1 Algorithm Framework of A-CGM

Input: Training Dataset $\mathcal{D}=\{(x_{i},y_{i}):i\in\mathcal{I}\}$ .
Output: Finalized parameters $\theta$ in the generative model $G_{\theta}(\xi,x)$ .

1:Randomly divide the training dataset

\mathcal{D}

into mini batches.

2:for

t=0,\ldots,T-1

3: Set

\mathcal{B}^{G}=\emptyset

4: for each mini batch

\mathcal{B}

\mathcal{D}

5: for each

x\in\mathcal{B}

6: Draw multiple i.i.d. copies

\xi_{1},\ldots,\xi_{m}

from

P_{\xi}

7: Generate conditional samples by forward-propagating through

G_{\theta}(\xi_{j},x)

8: Add

(x,G_{\theta}(\xi_{1},x)),\ldots,(x,G_{\theta}(\xi_{m},x))

into

\mathcal{B}^{G}

9: end for

10: Optimize

\theta

\hat{A^{2}}(P,Q_{\theta})

in (6) based on

\mathcal{B}

and

\mathcal{B}^{G}

11: end for

12:end for

4.4 Connections and Comparisons among Metrics

We establish the theoretical connections among CMMD, JMMD, and AMMD as follows.

Theorem 8.

Suppose that $AMMD$ and $JMMD$ are well-defined. Then we have that

JMMD^{2}\leq\mathbb{E}_{x\sim P_{X}}[k_{1}(x,x)]AMMD^{2}.

Theorem 9.

Suppose that $AMMD$ is well-defined. Moreover, suppose that for all $g\in\mathcal{F}_{Y}$ , $\mathbb{E}_{P_{Y|X}}[g(Y)|X]\in\mathcal{F}_{X}$ and $\mathbb{E}_{Q_{Y|X}}[g(Y)|X]\in\mathcal{F}_{X}$ so that the conditional mean embeddings $C_{P_{Y|X}}$ , $C_{Q_{Y|X}}$ are well-defined. Furthermore, we assume $C_{P_{Y|X}}\in\mathcal{F}_{X}\otimes\mathcal{F}_{Y}$ , $C_{Q_{Y|X}}\in\mathcal{F}_{X}\otimes\mathcal{F}_{Y}$ so that CMMD (2) is well-defined. Then we have

AMMD^{2}\leq\mathbb{E}_{x\sim P_{X}}[k_{1}(x,x)]CMMD^{2}.

We further highlight the strengths of AMMD and JMMD on conditional generative model evaluation which is a challenging task due to delicacies of evaluation at the conditional distribution level. First, both metrics are "distribution-free", i.e., the data or conditional generative model are not restricted to a specific type of distributions. In contrast, FID for instance requires the Gaussian assumption. Second, they are "model-free", i.e., their evaluation does not involve additional estimated models beyond the conditional generative model itself, such as kernel density estimators in Indirect Sampling Likelihood [3, 13] or estimated discriminators [42, 1].

4.5 Conditional Generative Model Construction

With the evaluation metrics, we present two corresponding deep-learning-based methods to construct conditional generative models. Our approaches are named as J-CGM and A-CGM, with special targets on the JMMD/AMMD for different tasks. For performance measurement, the values of JMMD and AMMD are estimated by drawing samples from the generative model optimized in J-CGM/A-CGM. Denote $G_{\theta}(\xi,X)$ as the generative model optimized in J-CGM/A-CGM with parameters $\theta$ . Note that $G_{\theta}(\xi,X)$ takes both the given conditional variables $X$ and the extra random vector $\xi$ as inputs. Let $Q_{\theta}$ be the joint distribution of $(X,G_{\theta}(\xi,X))$ . A detailed step-by-step pseudo-code of A-CGM is listed in Algorithm 1. A similar procedure for J-CGM is presented in Algorithm 2 in Section C.

5 Experiments

Experimental Setup. We empirically verify the effectiveness of our proposed approaches on both regression and image generation tasks. In both tasks, we compare the performance of our approaches with the state-of-the-art MMD-based conditional generative model, CGMMN [47]. For regression, our experiments are conducted on the following widely used real-world benchmark datasets: Boston, Concrete, Energy, Wine, Yacht, Kin8nm, Protein, and CCPP [18, 10, 31, 44]. Besides the JMMD and AMMD values, we also report the scores of FID [19, 36], which is a standard metric for assessing the quality of generative models. In our experiments, FID is computed based on the joint distribution of $(X,Y)$ , since it is originally defined in the unconditional sense. In the label-based image generation task, we adopt the benchmark dataset MNIST [32] to evaluate the correctness of our framework. In this task, $X$ is the label of the image and the generative model $G_{\theta}(\xi,X)$ is expected to output random image samples with the attribute of class $X$ . We provide visuals to directly show the generation performance of different approaches. All experiments are conducted on a GeForce RTX 2080 Ti GPU. More experimental results are presented in Section C.

5.1 Aleatoric Uncertainty in Regression

Refer to caption — Table 1: Conditional generative models in regression. Best results are in bold.

	CGMMN			J-CGM			A-CGM
Dataset	JMMD	AMMD	FID	JMMD	AMMD	FID	JMMD	AMMD	FID
Boston	$1.92\cdot 10^{-2}$	$8.41\cdot 10^{-5}$	$1.35\cdot 10^{-2}$	$2.92\cdot 10^{-4}$	$8.49\cdot 10^{-4}$	$7.84\cdot 10^{-3}$	$3.17\cdot 10^{-4}$	$8.74\cdot 10^{-5}$	$5.20\cdot 10^{-3}$
Concrete	$9.57\cdot 10^{-3}$	$1.67\cdot 10^{-4}$	$8.64\cdot 10^{-3}$	$1.53\cdot 10^{-4}$	$1.78\cdot 10^{-4}$	$9.86\cdot 10^{-3}$	$2.47\cdot 10^{-4}$	$1.99\cdot 10^{-5}$	$9.74\cdot 10^{-3}$
Energy	$1.87\cdot 10^{-2}$	$1.40\cdot 10^{-4}$	$7.96\cdot 10^{-3}$	$3.16\cdot 10^{-4}$	$9.45\cdot 10^{-4}$	$9.16\cdot 10^{-3}$	$3.09\cdot 10^{-4}$	$1.23\cdot 10^{-4}$	$9.38\cdot 10^{-3}$
Wine	$1.09\cdot 10^{-2}$	$3.02\cdot 10^{-4}$	$1.01\cdot 10^{-2}$	$1.14\cdot 10^{-4}$	$3.61\cdot 10^{-4}$	$9.94\cdot 10^{-3}$	$1.16\cdot 10^{-4}$	$2.72\cdot 10^{-4}$	$9.86\cdot 10^{-3}$
Yacht	$1.28\cdot 10^{-2}$	$5.76\cdot 10^{-5}$	$1.11\cdot 10^{-2}$	$6.60\cdot 10^{-4}$	$4.75\cdot 10^{-4}$	$1.13\cdot 10^{-2}$	$1.67\cdot 10^{-4}$	$4.63\cdot 10^{-5}$	$7.68\cdot 10^{-3}$
Kin8nm	$1.00\cdot 10^{-2}$	$1.46\cdot 10^{-3}$	$1.41\cdot 10^{-2}$	$1.10\cdot 10^{-4}$	$1.39\cdot 10^{-3}$	$9.69\cdot 10^{-3}$	$1.01\cdot 10^{-4}$	$1.38\cdot 10^{-3}$	$9.76\cdot 10^{-3}$
Protein	$9.20\cdot 10^{-3}$	$7.87\cdot 10^{-3}$	$9.83\cdot 10^{-3}$	$7.08\cdot 10^{-5}$	$8.08\cdot 10^{-3}$	$9.81\cdot 10^{-3}$	$8.60\cdot 10^{-5}$	$2.18\cdot 10^{-3}$	$1.00\cdot 10^{-2}$
CCPP	$1.64\cdot 10^{-2}$	$2.42\cdot 10^{-3}$	$9.99\cdot 10^{-3}$	$2.91\cdot 10^{-4}$	$2.24\cdot 10^{-3}$		$2.57\cdot 10^{-4}$	$1.59\cdot 10^{-3}$	$1.02\cdot 10^{-2}$

Evaluating Aleatoric Uncertainty
via Conditional Generative Models

Abstract

1 Introduction

2 Related Work

3 Conditional Generative Models and Maximum Mean Discrepancy

3.1 Conditional Generative Models

Lemma 1 (Adapted from Theorem 5.10 in [23]).

3.2 Two-Sample Test via Maximum Mean Discrepancy

Assumption 1.

Assumption 2.

Theorem 2 ([14]).

4 Generalization to Conditional Two-Sample Test

4.1 Previous Work on Conditional Maximum Mean Discrepancy

4.2 Average Maximum Mean Discrepancy (AMMD)

Remark 3.

Theorem 4.

Theorem 5.

4.3 Joint Maximum Mean Discrepancy (JMMD)

Theorem 6.

Theorem 7.

4.4 Connections and Comparisons among Metrics

Theorem 8.

Theorem 9.

4.5 Conditional Generative Model Construction

5 Experiments

5.1 Aleatoric Uncertainty in Regression

5.2 Aleatoric Uncertainty in Image Generation

6 Conclusions

References

Evaluating Aleatoric Uncertainty via Conditional Generative Models

Abstract

1 Introduction

2 Related Work

3 Conditional Generative Models and Maximum Mean Discrepancy

3.1 Conditional Generative Models

Lemma 1 (Adapted from Theorem 5.10 in [23]).

3.2 Two-Sample Test via Maximum Mean Discrepancy

Assumption 1.

Assumption 2.

Theorem 2 ([14]).

4 Generalization to Conditional Two-Sample Test

4.1 Previous Work on Conditional Maximum Mean Discrepancy

4.2 Average Maximum Mean Discrepancy (AMMD)

Remark 3.

Theorem 4.

Theorem 5.

4.3 Joint Maximum Mean Discrepancy (JMMD)

Theorem 6.

Theorem 7.

4.4 Connections and Comparisons among Metrics

Theorem 8.

Theorem 9.

4.5 Conditional Generative Model Construction

5 Experiments

5.1 Aleatoric Uncertainty in Regression

5.2 Aleatoric Uncertainty in Image Generation

6 Conclusions

References

Evaluating Aleatoric Uncertainty
via Conditional Generative Models