A Unified Framework for Multi-distribution Density Ratio Estimation

Lantao Yu
Department of Computer Science
Stanford University
lantaoyu@cs.stanford.edu &Yujia Jin
Department of Management Science and Engineering
Stanford University
yujiajin@stanford.edu &Stefano Ermon
Department of Computer Science
Stanford University
ermon@cs.stanford.edu

Abstract

Binary density ratio estimation (DRE), the problem of estimating the ratio $p_{1}/p_{2}$ given their empirical samples, provides the foundation for many state-of-the-art machine learning algorithms such as contrastive representation learning and covariate shift adaptation. In this work, we consider a generalized setting where given samples from multiple distributions $p_{1},\ldots,p_{k}$ (for $k>2$ ), we aim to efficiently estimate the density ratios between all pairs of distributions. Such a generalization leads to important new applications such as estimating statistical discrepancy among multiple random variables like multi-distribution $f$ -divergence, and bias correction via multiple importance sampling. We then develop a general framework from the perspective of Bregman divergence minimization, where each strictly convex multivariate function induces a proper loss for multi-distribution DRE. Moreover, we rederive the theoretical connection between multi-distribution density ratio estimation and class probability estimation, justifying the use of any strictly proper scoring rule composite with a link function for multi-distribution DRE. We show that our framework leads to methods that strictly generalize their counterparts in binary DRE, as well as new methods that show comparable or superior performance on various downstream tasks.

1 Introduction

Estimating the density ratio between two distributions based on their empirical samples is a central problem in machine learning, which continuously drives progress in this field and finds its applications in many machine learning tasks such as anomaly detection (Hido et al., 2008; Smola et al., 2009; Hido et al., 2011), importance weighting in covariate shift adaptation (Huang et al., 2006; Sugiyama et al., 2007), generative modeling (Uehara et al., 2016; Nowozin et al., 2016; Grover et al., 2019), two-sample test (Sugiyama et al., 2011; Gretton et al., 2012), mutual information estimation and representation learning (Oord et al., 2018; Hjelm et al., 2018). It is such a powerful paradigm because computing density ratio focuses on extracting and preserving contrastive information between two distributions, which is crucial in many tasks. Despite the tremendous success of binary DRE, many applications involve more than two probability distributions and developing density ratio estimation methods among multiple distributions has the potential of advancing various applications such as estimating multi-distribution statistical discrepancy measures (Garcia-Garcia & Williamson, 2012), multi-domain transfer learning, bias correction and variance reduction with multiple importance sampling (Elvira et al., 2019), multi-marginal generative modeling (Cao et al., 2019) and multilingual machine translation (Dong et al., 2015; Aharoni et al., 2019).

Although recent years have witnessed significant progress and a continuously increasing trend in developing more sophisticated and advanced methods for binary DRE (Sugiyama et al., 2012; Liu et al., 2017; Rhodes et al., 2020; Kato & Teshima, 2021; Choi et al., 2021), methods for estimating density ratios among multiple distributions remain largely unexplored, besides an empirical exploration of multi-class logistic regression for multi-task learning (Bickel et al., 2008), where the density ratios serve as the resampling weights between the distribution of a pool of examples of multiple tasks and the target distribution for a given task at hand and lead to significant accuracy improvement on HIV therapy screening experiments.

In this work, we propose a unified framework based on expected Bregman divergence minimization, where any strictly convex multivariate function induces a proper loss for multi-distribution DRE, thus generalizing the framework in (Sugiyama et al., 2012) to multi-distribution case. Moreover, by directly generalizing the Bregman identity in (Menon & Ong, 2016) to multi-variable case, we rederive a similar result to (Nock et al., 2016), which formally relates losses for multi-distribution density ratio estimation and class probability estimation and theoretically justifies the use of any strictly proper scoring rule (e.g., the logarithm score (Good, 1952), the Brier score (Brier et al., 1950) and the pseudo-spherical score (Good, 1971)) composite with a link function for multi-distribution DRE. By choosing a variety of specific convex functions or proper scoring rules, we show that our unified framework leads to methods that strictly generalize their counterparts for binary DRE, as well as new objectives specific to multi-distribution DRE. We demonstrate the effectiveness of our framework, and study and compare the empirical performance of its different instantiations on various downstream tasks that rely on accurate multi-distribution density ratio estimation.

2 Preliminaries

2.1 Multi-class Experiments

In multi-class experiments, we have a pair of random variables $(X,Y)\in{\mathcal{X}}\times{\mathcal{Y}}$ with joint distribution $D(X,Y)$ , where ${\mathcal{X}}$ is the sample space and ${\mathcal{Y}}=[k]\mathrel{\mathop{:}}=\{1,\ldots,k\}$ is the finite label space. Define the probability simplex as $\Delta_{k}\mathrel{\mathop{:}}=\{{\bm{p}}\in{\mathbb{R}}^{k}_{\geq 0}|\mathbf{1}^{\top}{\bm{p}}=1\}$ . According to chain rule of probability, any joint distribution $D(X,Y)$ can be decomposed into class priors $\pi_{i}\mathrel{\mathop{:}}={\mathbb{P}}(Y=i)$ and class conditionals $P_{i}(x)\mathrel{\mathop{:}}={\mathbb{P}}(X=x|Y=i)$ for $i\in[k]$ , or into sample marginal $M(x)\mathrel{\mathop{:}}={\mathbb{P}}(X=x)$ and class probability function ${\bm{\eta}}:{\mathcal{X}}\to\Delta_{k}$ (i.e., $\eta_{i}(x)={\mathbb{P}}(Y=i|X=x)$ ). We write ${\bm{\eta}}(x)$ as a vector ${\bm{\eta}}$ and omit $x$ when it is clear from context. Thus we can also represent the joint distribution as $D=(\bm{\pi},P_{1},\ldots,P_{k})$ (where ${\bm{\pi}}\in\Delta_{k}$ ) or $(M,{\bm{\eta}})$ . For any $i\in[k]$ , we assume $P_{i}$ has density $p_{i}$ with respect to the Lebesgue measure.

Remark on notations. To avoid confusion, we would like to emphasize that the class probability is denoted as $\eta_{i}(x)={\mathbb{P}}(Y=i|X=x)$ and the class conditional is denoted as $P_{i}(x)={\mathbb{P}}(X=x|Y=i)$ with density $p_{i}(x)$ . The former further satisfies the normalization constraint: $\forall x\in{\mathcal{X}},\sum_{i=1}^{k}\eta_{i}(x)=1$ , while $i$ in the latter one only serves as the index for $k$ different distributions.

In multi-class classification, given independent and identically distributed (i.i.d.) samples from the joint distribution $D(X,Y)$ , we want to learn a probabilistic classifier $\hat{{\bm{\eta}}}:{\mathcal{X}}\to\Delta_{k}$ to approximate the true class probability function ${\bm{\eta}}$ by minimizing the following $\ell$ -risk:

{\mathcal{L}}_{\text{CPE}}(\hat{{\bm{\eta}}};D)=\mathbb{E}_{D(x,y)}[\ell(y,\hat{{\bm{\eta}}}(x))]=\mathbb{E}_{x\sim M}[\mathbb{E}_{y\sim{\bm{\eta}}(x)}[\ell(y,\hat{{\bm{\eta}}}(x))]]=\mathbb{E}_{x\sim M}[L({\bm{\eta}}(x),\hat{{\bm{\eta}}}(x))]

(1)

where $\ell:[k]\times\Delta_{k}\to{\mathbb{R}}$ is the loss function for using the class predictor $\hat{{\bm{\eta}}}(x)$ when the true class is $y$ , and $L:\Delta_{k}\times\Delta_{k}\to{\mathbb{R}}$ is the expected loss of $\hat{{\bm{\eta}}}(x)$ under the true class probability ${\bm{\eta}}(x)$ .

Definition 1 (Proper loss).

A loss function $\ell$ is proper if the corresponding expected loss satisfies: $\forall P,Q\in\Delta_{k},L(P,Q)\geq L(P,P)$ . It is strictly proper if the equality holds only when $P=Q$ .

In statistical decision theory (Gneiting & Raftery, 2007), the negative proper loss is also called proper scoring rule (i.e., $S(y,\hat{{\bm{\eta}}}(x))=-\ell(y,\hat{{\bm{\eta}}}(x))$ ), which assesses the utility of the prediction. Properness of a loss is desirable in multi-class classification because it encourages the class probability estimator $\hat{{\bm{\eta}}}$ to match the true class probability function ${\bm{\eta}}$ . An important property of proper loss is summarized in the following theorem:

Definition 2 (Bregman divergence).

Given a differentiable convex function $\phi:{\mathcal{S}}\to\mathbb{R}$ defined on a convex set ${\mathcal{S}}\subset\mathbb{R}^{d}$ and two points ${\bm{x}},{\bm{y}}\in{\mathcal{S}}$ , the Bregman divergence from ${\bm{x}}$ to ${\bm{y}}$ is defined as:

\mathbf{B}_{\phi}({\bm{x}},{\bm{y}})\mathrel{\mathop{:}}=\phi({\bm{x}})-\phi({\bm{y}})-\langle{\bm{x}}-{\bm{y}},\nabla\phi({\bm{y}})\rangle

(2)

Theorem 1 ((Gneiting & Raftery, 2007); Proposition 7 in (Vernet et al., 2011)).

Given a proper loss $\ell$ and the corresponding expected loss $L$ , for any $P,Q\in\Delta_{k}$ , the generalized entropy function $\underline{L}(P)\mathrel{\mathop{:}}=\inf_{Q\in\Delta_{k}}L(P,Q)=L(P,P)$ is concave; when $\underline{L}$ is differentiable, the regret or excess risk of a predictor $Q$ over the Bayes-optimal $P$ is the Bregman divergence induced by the convex function $f=-\underline{L}$ :

\mathrm{reg}(P,Q;\ell)\mathrel{\mathop{:}}=L(P,Q)-L(P,P)=\mathbf{B}_{f}(P,Q)

(3)

Given the Bregman divergence representation of the point-wise regret in Theorem 1 and the $\ell$ -risk in Equation (1), the excess risk of a class probability estimator $\hat{{\bm{\eta}}}$ over the Bayes optimal ${\bm{\eta}}$ is:

	$\displaystyle\mathrm{reg}(\hat{{\bm{\eta}}};M,{\bm{\eta}},\ell)\mathrel{\mathop{:}}=$	$\displaystyle{\mathcal{L}}_{\text{CPE}}(\hat{{\bm{\eta}}};D)-{\mathcal{L}}_{\text{CPE}}({\bm{\eta}};D)=\mathbb{E}_{M(x)}[L({\bm{\eta}}(x),\hat{{\bm{\eta}}}(x))-L({\bm{\eta}}(x),{\bm{\eta}}(x))]$		(4)
	$\displaystyle=$	$\displaystyle\mathbb{E}_{M(x)}[\mathbf{B}_{f}({\bm{\eta}}(x),\hat{{\bm{\eta}}}(x))]$		(4)

2.2 Multi-distribution $f$ -Divergence

Csiszaŕ’s $f$ -divergence is a popular way to measure the discrepancy between two probability distributions. Specifically, given two distributions $P,Q$ and a convex function $f:\mathbb{R}_{+}\to\mathbb{R}\cup\{\pm\infty\}$ satisfying $f(1)=0$ , the $f$ -divergence between $P$ and $Q$ is defined as $\mathbf{D}_{f}(P||Q)=\mathbb{E}_{Q}[f(\mathrm{d}P/\mathrm{d}Q)]$ . In the following, we will introduce the multi-distribution extension of $f$ -divergence (Garcia-Garcia & Williamson, 2012).

Definition 3 (Multi-distribution $f$ -divergence).

For $k$ probability distributions $P_{1},\ldots,P_{k}$ on a common probability space $({\mathcal{X}},\sigma({\mathcal{X}}))$ with densities $p_{1},\ldots,p_{k}$ , given multi-variate closed convex function $f:\mathbb{R}_{+}^{k-1}\to\mathbb{R}\cup\{\pm\infty\}$ satisfying $f(\bm{1})=0$ , the multi-distribution $f$ -divergence between $P_{1},\ldots,P_{k-1}$ and $P_{k}$ is defined as:

\mathbf{D}_{f}\left(P_{1},\ldots,P_{k-1}||P_{k}\right)=\mathbb{E}_{p_{k}(x)}\left[f\left(\frac{p_{1}(x)}{p_{k}(x)},\ldots,\frac{p_{k-1}(x)}{p_{k}(x)}\right)\right]

(5)

2.3 Connecting Density Ratios and Class Probabilities via Link Function

Inspired by the definition in Eq. (5), we consider the following canonical density ratio vector (more discussion about this choice can be found in Section 3.2): ${\bm{r}}(x)=(r_{1}(x),\ldots,r_{k}(x))$ where $r_{i}(x)\mathrel{\mathop{:}}=p_{i}(x)/p_{k}(x)$ and $r_{k}(x)=1$ . Then we can connect a density ratio vector ${\bm{r}}(x)\in\mathbb{R}_{+}^{k-1}\times\{1\}$ and a class probability vector ${\bm{\eta}}(x)\in\Delta_{k}$ via an invertible link function.

According to Bayes’ theorem, we have:

\frac{{\mathbb{P}}(X=x,Y=i)}{{\mathbb{P}}(X=x,Y=k)}=\frac{\pi_{i}p_{i}(x)}{\pi_{k}p_{k}(x)}=\frac{M(x)\eta_{i}(x)}{M(x)\eta_{k}(x)}\Leftrightarrow r_{i}(x)=\frac{p_{i}(x)}{p_{k}(x)}=\frac{\pi_{k}}{\pi_{i}}\cdot\frac{\eta_{i}(x)}{\eta_{k}(x)}.

(6)

Thus we define the following multi-distribution link function $\Psi_{\mathrm{dr}}:\Delta^{k}\rightarrow\mathbb{R}_{+}^{k-1}\times\{1\}$ as a natural generalization of the binary DRE link function (Menon & Ong, 2016; Vernet et al., 2011):

[\Psi_{\mathrm{dr}}({\bm{\eta}}(x))]_{i}\mathrel{\mathop{:}}=\frac{\pi_{k}}{\pi_{i}}\cdot\frac{\eta_{i}(x)}{\eta_{k}(x)}=r_{i}(x),~{}\text{for all}~{}i\in[k].

(7)

Given Eq. (7) and the normalization constraint $\sum_{i\in[k]}\eta_{i}=1$ , we obtain the inverse link function:

[\Psi^{-1}_{\mathrm{dr}}({\bm{r}}(x))]_{i}\mathrel{\mathop{:}}=\frac{\pi_{i}r_{i}(x)}{\sum_{j\in[k]}\pi_{j}r_{j}(x)}=\eta_{i}(x),~{}\text{for all}~{}i\in[k].

(8)

Thus given knowledge of the prior distribution ${\bm{\pi}}$ (which can also be easily estimated from empirical samples), one can transform a class probability estimator into a density ratio estimator via $\hat{{\bm{r}}}(x)=\Psi_{\mathrm{dr}}(\hat{{\bm{\eta}}}(x))$ and vice versa via $\hat{{\bm{\eta}}}(x)=\Psi^{-1}_{\mathrm{dr}}(\hat{{\bm{r}}}(x))$ .

3 A Unified Framework for Multi-distribution DRE

3.1 Multi-distribution Density Ratio Estimation Problem Setup

Following the basic formulation of multi-class experiments in Section 2.1, we now introduce the problem setup of multi-distribution density ratio estimation (DRE). Recall that ${\mathcal{X}}$ is the common data domain and $P_{1},\ldots,P_{k}$ are $k$ different distributions defined on ${\mathcal{X}}$ with densities $p_{1},\ldots,p_{k}$ . Suppose we are given $n_{i}$ i.i.d. samples $\{x_{j}^{(i)}\}_{j=1}^{n_{i}}$ from each distribution $P_{i}$ . The goal of multi-distribution DRE is to estimate the density ratios between all pairs of distributions $\{r_{ij}\mathrel{\mathop{:}}=p_{i}/p_{j}\}_{i,j\in[k]}$ from the i.i.d. datasets $\{\{x_{j}^{(i)}\}_{j=1}^{n_{i}}\}_{i=1}^{k}$ . In this paper, we assume that the density ratios are always well-defined on domain ${\mathcal{X}}$ (e.g., when the distributions have strictly positive densities), which is also a common assumption in binary DRE problem (Kanamori et al., 2009; Kato & Teshima, 2021).

A naive approach towards this problem is to separately estimate each density $p_{i}$ from $\{x_{j}^{(i)}\}_{j=1}^{n_{i}}$ and then plug in $p_{i}$ and $p_{j}$ to get $r_{ij}$ . However, as previous theoretical works (Kpotufe, 2017; Nguyen et al., 2007; Kanamori et al., 2012; Que & Belkin, 2013) suggest, directly estimating density ratios has many advantages in practical settings. Specifically, we know that (1) optimal convergence rates depend only on the smoothness of the density ratio and not on the densities; (2) optimal rates depend only on the intrinsic dimension of data, thus escaping the curse of dimension in density estimation. Inspired by these observations in binary DRE, this paper aims to develop a general framework for directly estimating multi-distribution density ratios. Moreover, we also theoretically prove that various interesting facts (Menon & Ong, 2016; Sugiyama et al., 2012), which hold in the binary case, extend to our multi-distribution case in Section 4.

While most previous works focus on DRE in binary cases, multi-distribution DRE has many important downstream applications. For example, given any integrable function $\phi:{\mathcal{X}}\to{\mathbb{R}}$ , suppose we want to use importance sampling to estimate the expectation of $\phi$ with respect to a target distribution $Q$ with density $q$ w.r.t. the base measure:

\mathbb{E}_{q(x)}[\phi(x)]=\int_{\mathcal{X}}q(x)\phi(x)\mathrm{d}x=\int_{\mathcal{X}}p(x)\frac{q(x)}{p(x)}\phi(x)\mathrm{d}x=\mathbb{E}_{p(x)}\left[r(x)\cdot\phi(x)\right]

(9)

where we use the density ratio $r=p/q$ to correct the bias caused by using samples from the proposal distribution $p$ rather than the target distribution $q$ . However, in practice, finding a good proposal is critical yet challenging (Owen & Zhou, 2000). An alternative and more robust strategy is to use a population of different proposals (sampling schemes) and use a set of density ratios to correct the bias, which is also known as multiple importance sampling (MIS) (Cappé et al., 2004; Elvira et al., 2015). Given $k$ different proposals $p_{1},\ldots,p_{k}$ , the MIS estimation of the expectation is given by:

\mathbb{E}_{q(x)}[\phi(x)]=\sum_{i=1}^{k}\omega_{i}\mathbb{E}_{p_{i}(x)}\left[\frac{q(x)}{p_{i}(x)}\phi(x)\right]

(10)

where $\omega_{i}$ is the weight for each proposal $p_{i}$ and satisfies $\sum_{i}\omega_{i}=1$ . Thus a more efficient and accurate multi-distribution DRE method will lead to better MIS. In the context of multi-source off-policy policy evaluation (Kallus et al., 2021), the proposals correspond to a set of demonstration policies and the target distribution is the query policy whose performance we want to evaluate from the offline multi-souce demonstrations; in the context of multi-domain transfer learning setting (covariate shift adaptation) (Bickel et al., 2008; Dinh et al., 2013), the proposals correspond to a set of data generating distributions (e.g. multiple source domains or various data augmentation strategies) and the target is the test distribution we care about. Estimating multi-distribution density ratios also allows us to compute important information quantities among multiple random variables such as the multi-distribution $f$ -divergence in Equation (5), which can be used to analyze various kinds of discrepancy and correlations between multiple random variables and further has the potential of inspiring new generative models for multiple marginal matching problem (Cao et al., 2019).

3.2 Multi-distribution DRE via Bregman Divergence Minimization

Inspired by the success of Bregman divergence minimization for unifying various DRE methods in the binary case (Sugiyama et al., 2012), in this section, we propose a general framework for solving the multi-distribution density ratio estimation problem. First, we discuss our modeling choices. Although our goal is to estimate ${k\choose 2}$ density ratios (between all possible pairs), the solution set $\{r_{ij}\mathrel{\mathop{:}}=p_{i}/p_{j}\}_{i,j\in[k]}$ actually has $k-1$ degrees of freedom (e.g., $r_{ik}=r_{ij}\cdot r_{jk}$ ). Thus without loss of generality, we parametrize the following $k-1$ density ratio models $\hat{{\bm{r}}}_{\bm{\theta}}=(\hat{r}_{\theta_{1}},\ldots,\hat{r}_{\theta_{k-1}})$ to approximate the true canonical density ratios ${\bm{r}}=(r_{1},\ldots,r_{k-1})$ , where $r_{i}\mathrel{\mathop{:}}=p_{i}/p_{k}$ for $i\in[k-1]$ . For the simplicity of notation, we will omit the dependence on the parameters ${\bm{\theta}}$ and write our density ratio models as $\hat{{\bm{r}}}=(\hat{r}_{1},\ldots,\hat{r}_{k-1})$ . An advantage of such modeling choice is that any density ratio can be recovered within one step of computation $\frac{p_{i}}{p_{j}}=\frac{p_{i}/p_{k}}{p_{j}/p_{k}}=\frac{r_{i}}{r_{j}}$ , thus avoiding large compounding error while naturally ensuring consistency within the solution set (i.e., if we parametrize $\hat{r}_{ij}$ , $\hat{r}_{jk}$ and $\hat{r}_{ik}$ respectively, we have to make sure they satisfy $\hat{r}_{ik}=\hat{r}_{ij}\cdot\hat{r}_{jk}$ ).

Since our goal is to optimize $\hat{{\bm{r}}}$ to approximate the true density ratios ${\bm{r}}$ , we consider to use Bregman divergence (Def. 2) to measure the discrepancy between ${\bm{r}}$ and $\hat{{\bm{r}}}$ . Specifically, for any strictly convex function $f:\mathbb{R}^{k-1}_{+}\to\mathbb{R}$ and $\forall x\in{\mathcal{X}}$ , we have the following point-wise optimization problem:

\min_{\hat{{\bm{r}}}(x)\in{\mathbb{R}}^{k-1}_{+}}\mathbf{B}_{f}({\bm{r}}(x),\hat{{\bm{r}}}(x))=f({\bm{r}}(x))-f(\hat{{\bm{r}}}(x))-\langle\nabla f(\hat{{\bm{r}}}(x)),{\bm{r}}(x)-\hat{{\bm{r}}}(x)\rangle

(11)

which corresponds to the difference between the value of $f$ at ${\bm{r}}$ , and the value of the first-order Taylor expansion of $f$ around point $\hat{{\bm{r}}}$ evaluated at point ${\bm{r}}$ . Although the current formulation can be understood as a regression problem from $\hat{{\bm{r}}}(x)$ to the true density ratios ${\bm{r}}(x)$ , we actually only have i.i.d. samples $x\sim p_{1},\ldots,p_{k}$ instead of the true targets ${\bm{r}}(x)$ . In this case, we consider to use the following expected Bregman divergence to measure the overall discrepancy from the true density ratios ${\bm{r}}$ to the density ratio models $\hat{{\bm{r}}}$ :

	$\displaystyle{\mathcal{L}}_{\text{DRE}}(\hat{{\bm{r}}};D)=$	$\displaystyle\int_{\mathcal{X}}p_{k}(x)\Big{(}f({\bm{r}}(x))-f(\hat{{\bm{r}}}(x))-\langle\nabla f(\hat{{\bm{r}}}(x)),{\bm{r}}(x)-\hat{{\bm{r}}}(x)\rangle\Big{)}\mathrm{d}x$		(12)
	$\displaystyle=$	$\displaystyle\mathbb{E}_{p_{k}(x)}\left[\langle\nabla f(\hat{{\bm{r}}}(x)),\hat{{\bm{r}}}(x)\rangle-f(\hat{{\bm{r}}}(x))\right]-\sum_{i\in[k-1]}\mathbb{E}_{p_{i}(x)}[\partial_{i}f(\hat{{\bm{r}}}(x))]+C$		(13)

where $C\mathrel{\mathop{:}}=\int_{\mathcal{X}}p_{k}(x)f({\bm{r}}(x))\mathrm{d}x=\mathbf{D}_{f}(P_{1},\ldots,P_{k-1}\|P_{k})$ is a constant with respect to $\hat{{\bm{r}}}$ and the equality comes from the fact that $p_{k}\cdot(r_{1},\ldots,r_{k-1})=(p_{1},\ldots,p_{k-1})$ according to the definition of ${\bm{r}}$ . The rationale behind the above choice is that it allows us to get an unbiased estimation of the discrepancy between ${\bm{r}}$ and $\hat{{\bm{r}}}$ only using i.i.d. samples from $p_{1},\ldots,p_{k}$ . Specifically, since $C$ is a constant, we have the following optimization problem over $\hat{{\bm{r}}}$ to approximate the true density ratios (where each expectation $\mathbb{E}_{p_{i}}$ can be empirically estimated using samples from $p_{i}$ ):

\min_{\hat{{\bm{r}}}:{\mathcal{X}}\to{\mathbb{R}}^{k-1}_{+}}\mathbb{E}_{p_{k}(x)}\left[\left\langle\nabla f(\hat{{\bm{r}}}(x)),\hat{{\bm{r}}}(x)\right\rangle-f(\hat{{\bm{r}}}(x))\right]-\sum_{i\in[k-1]}\mathbb{E}_{p_{i}(x)}\left[\partial_{i}f(\hat{{\bm{r}}}(x))\right]

(14)

Interestingly, the above multi-distribution DRE formulation, which is based on Bregman divergence minimization, can be alternatively derived from the perspective of variational estimation of multi-distribution $f$ -divergence. In the following, We briefly discuss such an interpretation of Eq. (14).

Based on Fenchel duality, we can represent any strictly convex function $f:\mathbb{R}_{+}^{k-1}\to\mathbb{R}\cup\{+\infty\}$ through its conjugate function $f^{*}({\bm{s}})\mathrel{\mathop{:}}=\max_{{\bm{r}}\in{\mathbb{R}}^{k-1}_{+}}\langle{\bm{s}},{\bm{r}}\rangle-f({\bm{r}})$ as:

f({\bm{r}}(x))=\max_{{\bm{s}}:{\mathcal{X}}\to{\mathbb{R}}^{k-1}}\langle{\bm{r}}(x),{\bm{s}}(x)\rangle-f^{*}({\bm{s}}(x)),~{}~{}\text{for any}~{}x\in{\mathcal{X}}.

(15)

In order to estimate the multi-distribution $f$ -divergence defined in Eq. (5) only using samples from $P_{1},\ldots,P_{k}$ (instead of their density information), we consider the following variational representation of multi-distribution $f$ -divergence by substituting Eq. (15) into Eq. (5):

\displaystyle\mathbf{D}_{f}(P_{1},\ldots,P_{k-1}||P_{k})=-\min_{{\bm{s}}:{\mathcal{X}}\rightarrow\mathbb{R}^{k-1}}\left[-\sum_{i\in[k-1]}\mathbb{E}_{p_{i}(x)}[{\bm{s}}(x)]_{i}+\mathbb{E}_{p_{k}(x)}f^{*}({\bm{s}}(x))\right]

(16)

We then have the following lemma revealing the equivalence between the optimization problem in Eq. (14) and Eq. (16).

Proposition 1 (DRE via variational estimation of multi-distribution $f$ -divergence).

Given a strictly convex function $f:{\mathbb{R}}^{k-1}_{+}\rightarrow\mathbb{R}\cup\{+\infty\}$ , the optimization problem in Eq. (14) (induced by minimizing expected Bregman divergence $\mathbf{B}_{f}({\bm{r}},\hat{{\bm{r}}})$ ) is equivalent to the one in Eq. (16) (for variational estimation of multi-distribution $f$ -divergence) under change of variables satisfying: $\nabla f(\hat{{\bm{r}}}(x))={\bm{s}}(x),~{}\forall x\in{\mathcal{X}}$ .

4 Connecting Losses for Multi-class Classification and DRE

In this section, we rederive a similar result to (Nock et al., 2016) by directly generalizing the Bregman identity in (Menon & Ong, 2016) to multi-variable case, which established the theoretical connection between multi-distribution DRE and multi-class classification.

In Section 2.1, we have shown that the exact minimization of the excess risk for any strictly proper loss $\ell$ results in the true class probability function ${\bm{\eta}}$ , and consequently gives us the true density ratio ${\bm{r}}$ through the link function $\Psi_{\mathrm{dr}}({\bm{\eta}})$ . In the following, we take a further step to show that essentially the procedure of minimizing any strictly proper loss is equivalent to minimizing an expected Bregman divergence between the true density ratios ${\bm{r}}$ and the approximate density ratios $\hat{{\bm{r}}}$ , thus generalizing the theoretical results in binary case (Menon & Ong, 2016) to the multi-distribution case and justifying the validity of using any strictly proper scoring rule (e.g. Brier score (Brier et al., 1950) and pseudo-spherical score (Good, 1971)) for multi-distribution DRE. All proofs for this section can be found in Appendix A.3.

We start by introducing the following multivariate Bregman identity.

Lemma 1 (Multivariate Bregman Identity).

Given a convex function $f:\mathbb{R}^{k-1}\to\mathbb{R}$ , we can define an associated function $f^{\circledast}(u_{1},\ldots,u_{k-1})=(1+\sum_{i\in[k-1]}u_{i})f\left(\frac{1}{1+\sum_{i\in[k-1]}u_{i}}\cdot{\bm{u}}\right)$ . We can show that (i) $f^{\circledast}$ is convex and (ii) for any ${\bm{u}},{\bm{v}}\in\mathbb{R}^{k-1}$ , their associated Bregman divergences satisfy:

\mathbf{B}_{f}\left(\frac{1}{1+\sum_{i\in[k-1]}u_{i}}\cdot{\bm{u}},\frac{1}{1+\sum_{i\in[k-1]}v_{i}}\cdot{\bm{v}}\right)=\frac{1}{1+\sum_{i\in[k-1]}u_{i}}\mathbf{B}_{f^{\circledast}}({\bm{u}},{\bm{v}}).

(17)

One can then apply Lemma 1 with $u_{i}=\frac{\pi_{i}}{\pi_{k}}r_{i}$ and $v_{i}=\frac{\pi_{i}}{\pi_{k}}\hat{r}_{i}$ for each $i\in[k-1]$ and use the fact that $\mathbf{B}_{f^{\circledast}_{\pi}}\left({\bm{r}},\hat{{\bm{r}}}\right)=\mathbf{B}_{f^{\circledast}}({\bm{u}},{\bm{v}})$ for $f^{\circledast}_{\pi}({\bm{r}})=f^{\circledast}(\frac{1}{\pi_{k}}{\bm{\pi}}\circ{\bm{r}})$ to establish the following connection between the optimality gap of density ratio estimators and class probability estimators, where we use ${\bm{a}}\circ{\bm{b}}$ to denote the element-wise product between vectors ${\bm{a}}$ and ${\bm{b}}$ , and ${\bm{\pi}}_{[1:k-1]}\in\mathbb{R}^{k-1}$ as the vector when restricting ${\bm{\pi}}$ onto its first $k-1$ coordinates.

Proposition 2.

For any convex function $f:\mathbb{R}^{k-1}_{+}\rightarrow\mathbb{R}$ , and two density ratio vectors ${\bm{r}}(x)$ and $\hat{{\bm{r}}}(x)$ , one can construct corresponding class probability vectors ${\bm{\eta}}(x)=\Psi_{\mathrm{dr}}^{-1}({\bm{r}}(x))$ and $\hat{{\bm{\eta}}}(x)=\Psi_{\mathrm{dr}}^{-1}(\hat{{\bm{r}}}(x))$ through the inverse link function in Eq. (8), and obtain:

\mathbf{B}_{f}\left({\bm{\eta}}(x),\hat{{\bm{\eta}}}(x)\right)=\frac{\pi_{k}}{\pi_{k}+\sum_{i\in[k-1]}\pi_{i}r_{i}(x)}\mathbf{B}_{f^{\circledast}_{\pi}}\left({\bm{r}}(x),\hat{{\bm{r}}}(x)\right)~{}\text{for all}~{}x\in{\mathcal{X}},

(18)

where we define the convex function $f^{\circledast}_{\pi}$ induced by some prior distribution $\pi\in\Delta_{k}$ as

f^{\circledast}_{\pi}(r_{1},\ldots,r_{k-1})\mathrel{\mathop{:}}=\left(1+\sum_{i\in[k-1]}\pi_{i}r_{i}/\pi_{k}\right)\cdot f\left(\frac{{\bm{\pi}}_{[1:k-1]}\circ{\bm{r}}}{\pi_{k}+\sum_{i\in[k-1]}\pi_{i}r_{i}}\right).

(19)

Combining Proposition 2 with the Bregman divergence representation of the point-wise regret for a proper risk $\ell$ for multi-class classification in Eq. (4), we provide the following main theorem that interprets the minimization of multi-class classification regret as multi-distribution DRE under expected Bregman divergence minimization.

Theorem 2.

Given any strictly proper loss $\ell$ , for any joint data distribution $D(X,Y)$ with class prior $\pi\in\Delta_{k}$ , the multi-class classification regret defined in Eq. (4) satisfies that:

\mathrm{reg}(\hat{{\bm{\eta}}};M,{\bm{\eta}},\ell)=\pi_{k}\mathbb{E}_{p_{k}(x)}\mathbf{B}_{f^{\circledast}_{\pi}}({\bm{r}}(x),\hat{{\bm{r}}}(x)),

(20)

for $f^{\circledast}_{\pi}$ as defined in Eq. (19), and ${\bm{r}}=\Psi_{\mathrm{dr}}({\bm{\eta}})$ and $\hat{{\bm{r}}}=\Psi_{\mathrm{dr}}(\hat{{\bm{\eta}}})$ as defined in Eq. (7).

Theorem 2 generalizes a known equivalence between density ratio estimation and class probability estimation in the binary case (see Section 5 in (Menon & Ong, 2016)), and provides a similar equivalence in the more complicated multi-class experiments. Besides, in comparison to the binary case result, we also provide a simpler proof, loosen the assumptions on the twice-differentiability of convex function $f$ induced by the proper loss $\ell$ (i.e., $f=-\underline{L}$ , see Theorem 1 for more details), and generalize the argument to an arbitrary prior distribution $\pi\in\Delta^{k}$ instead of the uniform prior case $\pi_{1}=\pi_{2}=1/2$ considered in (Menon & Ong, 2016).

Moreover, we notice that multi-distribution $f$ -divergence among class conditionals $P_{1},\ldots,P_{k}$ also corresponds to the statistical information measure in multi-class experiments (DeGroot, 1962) (defined as the gap between the prior and posterior generalized entropy). Since we have established the equivalence between multi-class DRE (Eq. (14)) and variational estimation of multi-distribution $f$ -divergence (Eq. (16)), we can show by choosing particular convex functions (associated with the loss $\ell$ for multi-class classification), multi-distribution DRE can be viewed as estimating the statistical information measure in multi-class experiments. See detailed discussions in Appendix A.3.1.

5 Examples of Multi-distribution DRE

In the binary density ratio matching under Bregman divergence framework (Sugiyama et al., 2012), we can choose various convex functions to recover popular binary DRE methods such as KLIEP (Sugiyama et al., 2008), LSIF (Kanamori et al., 2009) and Logistic Regression (Franklin, 2005). In this section, we provide some instantiations of our multi-distribution DRE framework. Specifically, Section 3.2 suggests that any strictly convex multivariate function $f:{\mathbb{R}}^{k-1}_{+}\to{\mathbb{R}}$ induces a proper loss for multi-distribution DRE, and Section 4 justifies that any strictly proper scoring rule composite with $\Psi_{\mathrm{dr}}$ can also be used for multi-distribution DRE. We briefly discuss some choices of the convex function or proper scoring rule, and we provide detailed derivations in Appendix A.4.

5.1 Methods Induced by Convex Functions

Multi-class Logistic Regression. From Section 2.3, we know that there is a one-to-one correspondence between a class probability estimator and a density ratio estimator: $\hat{{\bm{r}}}=\Psi_{\mathrm{dr}}\circ\hat{{\bm{\eta}}}$ and $\hat{{\bm{\eta}}}=\Psi_{\mathrm{dr}}^{-1}\circ\hat{{\bm{r}}}$ . For the clarity of presentation, here we assume the class prior distribution ${\bm{\pi}}$ is uniform such that $\hat{r}_{i}(x)=\hat{\eta}_{i}(x)/\hat{\eta}_{k}(x)$ and $\hat{\eta}_{i}(x)=\hat{r}_{i}(x)/\sum_{j=1}^{k}\hat{r}_{j}(x)$ . To recover the loss of multi-class logistic regression, we choose the following convex function to be $f(\hat{r}_{1},\ldots,\hat{r}_{k-1})=\frac{1}{k}\sum_{i=1}^{k}\hat{r}_{i}\log\left(\hat{r}_{i}/\sum_{j=1}^{k}\hat{r}_{j}\right)$ . In this case, the loss in Eq. (14) reduces to:

\frac{1}{k}\mathbb{E}_{p_{k}(x)}\left[\log\left(\sum_{j=1}^{k}\hat{r}_{j}(x)\right)\right]-\frac{1}{k}\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\log\left(\frac{\hat{r}_{i}(x)}{\sum_{j=1}^{k}\hat{r}_{j}(x)}\right)\right]=-\left(\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}_{p_{i}(x)}[\log\hat{\eta}_{i}(x)]\right)

(21)

We provide discussions for the general case (non-uniform prior ${\bm{\pi}}$ ) in Appendix A.4.1. Interestingly, we noticed that the above convex function also gives rise to the multi-distribution Jensen-Shannon divergence (Garcia-Garcia & Williamson, 2012) (also known as the information radius (Sibson, 1969), $\mathbf{D}_{f}(P_{1},\ldots,P_{k})=\frac{1}{k}\sum_{i=1}^{k}D_{\mathrm{KL}}(P_{i}\|\frac{1}{k}\sum_{j=1}^{k}P_{j})$ ) up to a constant of $\log k$ .

Multi-LSIF. When the convex function associated with the Bregman divergence is chosen to be $f(\hat{r}_{1},\ldots,\hat{r}_{k-1})=\frac{1}{2}\sum_{i=1}^{k-1}(\hat{r}_{i}-1)^{2}=\frac{1}{2}\|\hat{{\bm{r}}}-\bm{1}\|^{2}$ , the loss in Eq. (14) reduces to:

\frac{1}{2}\sum_{i=1}^{k-1}\mathbb{E}_{p_{k}(x)}\left[\hat{r}_{i}^{2}(x)-1\right]-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\hat{r}_{i}(x)-1\right]=\frac{1}{2}\sum_{i=1}^{k-1}\mathbb{E}_{p_{k}(x)}\left[(\hat{r}_{i}(x)-r_{i}(x))^{2}\right]-C

(22)

where $C=\mathbb{E}_{p_{k}(x)}\left[\|{\bm{r}}(x)-1\|^{2}\right]$ is a constant w.r.t. $\hat{{\bm{r}}}$ and the minimizer to the above loss function matches the true density ratios, which strictly generalizes the Least-Squares Importance Fitting (LSIF) (Kanamori et al., 2009) method to the multi-distribution case.

Besides, we also consider the following simple convex functions that either strictly generalize their binary DRE counterparts as above, or lead to completely new methods for multi-distribution DRE:

•

Multi-KLIEP. $f(\hat{r}_{1},\ldots,\hat{r}_{k-1})=\sum_{i=1}^{k-1}(\hat{r}_{i}\log\hat{r}_{i}-\hat{r}_{i})=\langle\hat{{\bm{r}}},\log(\hat{{\bm{r}}})\rangle-\|\hat{{\bm{r}}}\|_{1}$ . This strictly generalizes the Kullback–Leibler Importance Estimation Procedure (KLIEP) (Sugiyama et al., 2008) to the multi-distribution case. See Appendix A.4.3 for more details.
•

Power. $f(\hat{r}_{1},\ldots,\hat{r}_{k-1})=\sum_{i=1}^{k-1}\hat{r}_{i}^{\alpha}=\|\hat{{\bm{r}}}\|_{\alpha}^{\alpha}$ , for $\alpha>1$ . This strictly generalizes the robust DRE method in (Sugiyama et al., 2012), which recovers Multi-KLIEP when $\alpha\to 1$ and Multi-LSIF when $\alpha=2$ . See Appendix A.4.4 for more details.
•

Quadratic. $f(\hat{r}_{1},\ldots,\hat{r}_{k-1})=\hat{{\bm{r}}}^{\top}{\bm{H}}\hat{{\bm{r}}}+{\bm{q}}^{\top}\hat{{\bm{r}}}$ , for any positive definite matrix ${\bm{H}}\succ 0$ . When ${\bm{H}}$ is the identity matrix and ${\bm{q}}=(-2,\ldots,-2)$ , this is equivalent to Multi-LSIF.
•

LogSumExp. $f(\hat{r}_{1},\ldots,\hat{r}_{k-1})=\alpha\log\left(\sum_{i=1}^{k-1}\exp(\hat{r}_{i}/\alpha)\right)$ for $\alpha>0$ .

In principle, we can use any desired strictly convex function $f:{\mathbb{R}}^{k-1}_{+}\to{\mathbb{R}}$ within the optimization problem in Eq. (14), implying the great potential of our unified framework for discovering novel objectives for multi-distribution DRE. In terms of modeling flexibility, the curvature of different convex functions encode different inductive biases that may favor various downstream applications and we leave the design of more suitable convex functions for DRE as exciting future avenues.

5.2 Methods Induced by Proper Scoring Rules Composite with $\Psi_{\mathrm{dr}}$

From Section 4, we know that any strictly proper loss $\ell:[k]\times\Delta_{k}\to{\mathbb{R}}$ (or strictly proper scoring rule $S(i,\hat{{\bm{\eta}}})=-\ell(i,\hat{{\bm{\eta}}})$ ) in conjunction with the link function $\Psi_{\mathrm{dr}}$ also induces valid losses for multi-distribution DRE:

\min_{\hat{{\bm{r}}}:{\mathcal{X}}\to{\mathbb{R}}^{k-1}_{+}}\mathbb{E}_{D(x,y)}[\ell(y,\hat{{\bm{\eta}}}(x))]=\mathbb{E}_{x\sim M,y\sim{\bm{\eta}}(x)}[\ell(y,\Psi_{\mathrm{dr}}^{-1}(\hat{{\bm{r}}}(x)))]

(23)

In this work, we consider using the following classic proper scoring rules (Gneiting & Raftery, 2007), where $\hat{{\bm{\eta}}}$ is parametrized as $\Psi_{\mathrm{dr}}^{-1}(\hat{{\bm{r}}})$ (i.e. $\hat{\eta}_{i}=\pi_{i}\hat{r}_{i}/\sum_{j=1}^{k}\pi_{j}\hat{r}_{j}$ ):

•

Logarithm score. (Good, 1952) The loss is specified as $\ell(i,\hat{{\bm{\eta}}})=-\log(\hat{\eta}_{i})$ , which also recovers the loss of multi-class logistic regression in Section 5.1.
•

Brier score. (Brier et al., 1950) The loss is specified as $\ell(i,\hat{{\bm{\eta}}})=-2\hat{\eta}_{i}+\sum_{j=1}^{k}\hat{\eta}_{j}^{2}+1$ .
•

Logarithm pseudo-spherical score. (Good, 1971; Fujisawa & Eguchi, 2008) The loss is specified as $\ell(i,\hat{{\bm{\eta}}})=-\log\left(\frac{\hat{\eta}_{i}^{\alpha-1}}{(\sum_{j=1}^{k}\hat{\eta}_{j}^{\alpha})^{(\alpha-1)/\alpha}}\right)$ for $\alpha>1$ .

6 Experiments

In this section, we verify the validity of our framework, as well as study and compare the various instantiations introduced in Section 5, on a variety of tasks that all rely on accurate multi-distribution density ratio estimation. In particular, the tasks we consider include density ratio estimation among multiple multivariate Gaussian distributions, anomaly detection on CIFAR-10 (Krizhevsky et al., 2009), multi-target MNIST Generation (LeCun et al., 1998) and multi-distribution off-policy policy evaluation. We discuss the basic problem setups, evaluation metrics and experimental results in the following and we provide more experimental details for each task in Appendix A.5.

Synthetic Data Experiments. We first apply our methods to estimate density ratios among $k=5$ multivariate Gaussian distributions with different mean vectors and identity covariance matrix. We conducted experiments for various data dimensions ranging from 2 to 50. Since Gaussian distributions have tractable densities, we know the ground-truth density ratio functions and we calculate the mean absolute error (MAE) between all $k\choose 2$ true density ratios and the learned ones:

\displaystyle\mathrm{MAE}({\bm{r}},\hat{{\bm{r}}};M(x))=\frac{2}{k(k-1)}\mathbb{E}_{M(x)}\left[\sum_{1\leq i<j\leq k}\left|r_{ij}(x)-\hat{r}_{ij}(x)\right|\right]

where density ratio between $p_{i}$ and $p_{j}$ is recovered by $\hat{r}_{i}/\hat{r}_{j}$ as discussed in Section 3.2. We summarize the results in Table 1, from which we can see that multi-class logistic regression and Brier score composite with $\Psi_{\mathrm{dr}}$ show superior performance in this task.

OOD Detection on CIFAR-10. Suppose we have $k$ different distributions $p_{1},\ldots,p_{k}$ , where $p_{k}=\sum_{i\in[k-1]}\alpha_{i}p_{i}$ , ( $\sum_{i\in[k-1]}\alpha_{i}=1$ and $\forall i,\alpha_{i}>0$ ). For each distribution $p_{i}$ ( $i\leq k-1$ ), samples from the mixture distribution $p_{k}$ contain both inlier samples and outlier samples. The goal of this task is to identify the inlier samples from the pool of mixture samples. In particular, we use the learned density ratio $\hat{r}_{i}$ as the score function to retrieve the inlier samples of $p_{i}$ , since the true density ratio function $r_{i}=p_{i}/\sum_{j\in[k-1]}\alpha_{j}p_{j}$ tend to be larger for samples from $p_{i}$ and smaller for samples from other distributions. In this case, we calculate the average AUROC for each scoring function.

Table 1: Mean absolute error for multi-distribution density ratio estimation among five multivariate Gaussian distributions. Results are averaged across three random seeds.

Method	$\mathrm{Dim}=2$	$\mathrm{Dim}=5$	$\mathrm{Dim}=10$	$\mathrm{Dim}=20$	$\mathrm{Dim}=30$	$\mathrm{Dim}=40$	$\mathrm{Dim}=50$
Random Init	$1.724\pm 0.03$	$1.723\pm 0.008$	$1.728\pm 0.02$	$1.765\pm 0.017$	$1.749\pm 0.009$	$1.753\pm 0.002$	$1.768\pm 0.008$
Multi-LR	$\bm{0.044}\pm 0.003$	$\bm{0.048}\pm 0.005$	$\bm{0.061}\pm 0.002$	$\bm{0.07}\pm 0.001$	$\bm{0.081}\pm 0.002$	$\bm{0.089}\pm 0.001$	$\bm{0.098}\pm 0.002$
Multi-KLIEP	$0.051\pm 0.002$	$0.066\pm 0.002$	$0.074\pm 0.0$	$0.089\pm 0.002$	$0.105\pm 0.005$	$0.112\pm 0.004$	$0.123\pm 0.003$
Multi-LSIF	$0.073\pm 0.006$	$0.097\pm 0.001$	$0.109\pm 0.005$	$0.124\pm 0.003$	$0.144\pm 0.004$	$0.141\pm 0.005$	$0.158\pm 0.004$
Power	$0.054\pm 0.003$	$0.073\pm 0.001$	$0.081\pm 0.004$	$0.104\pm 0.003$	$0.117\pm 0.003$	$0.123\pm 0.005$	$0.135\pm 0.004$
Brier	$\bm{0.042}\pm 0.002$	$\bm{0.056}\pm 0.003$	$\bm{0.066}\pm 0.003$	$\bm{0.081}\pm 0.002$	$\bm{0.086}\pm 0.002$	$\bm{0.094}\pm 0.002$	$\bm{0.105}\pm 0.001$
Spherical	$0.103\pm 0.007$	$0.106\pm 0.006$	$0.115\pm 0.004$	$0.121\pm 0.005$	$0.125\pm 0.006$	$0.132\pm 0.003$	$0.138\pm 0.011$
LogSumExp	$0.231\pm 0.067$	$0.198\pm 0.034$	$0.184\pm 0.013$	$0.179\pm 0.014$	$0.184\pm 0.009$	$0.192\pm 0.01$	$0.193\pm 0.003$
Quadratic	$0.148\pm 0.033$	$0.186\pm 0.028$	$0.218\pm 0.011$	$0.219\pm 0.018$	$0.226\pm 0.018$	$0.236\pm 0.023$	$0.254\pm 0.014$

Table 2: Results for CIFAR-10 OOD detection, MNIST multi-target generation and multi-distribution off-policy policy evaluation error based on learned density ratios.

\uparrow

means higher is better and

\downarrow

means lower is better. Results of top

3

methods for each task are bold. Results are averaged across three random seeds.

Method	CIFAR-10 OOD ( $\uparrow$ )	MNIST Generation ( $\downarrow$ )	Off-policy Evaluation ( $\downarrow$ )
Random Init	$0.499\pm 0.017$	$1.598\pm 0.063$	$1377.68\pm 379.76$
Multi-LR	$\bm{0.854}\pm 0.009$	$\bm{0.156}\pm 0.014$	$62.43\pm 12.87$
Multi-KLIEP	$0.828\pm 0.005$	$0.281\pm 0.050$	$110.89\pm 35.33$
Multi-LSIF	$0.801\pm 0.008$	$0.274\pm 0.027$	$71.09\pm 1.12$
Power ( $\alpha=1.5$ )	$0.816\pm 0.007$	$0.224\pm 0.036$	$\bm{53.43}\pm 20.73$
Brier	$\bm{0.849}\pm 0.010$	$\bm{0.107}\pm 0.022$	$71.21\pm 17.34$
Spherical ( $\alpha=1.8$ )	$\bm{0.853}\pm 0.010$	$\bm{0.145}\pm 0.041$	/
LogSumExp ( $\alpha=5$ )	$0.782\pm 0.012$	/	$\bm{52.02}\pm 9.16$
Quadratic	$0.804\pm 0.009$	/	$\bm{55.10}\pm 11.92$

Multi-target MNIST Generation. DRE can be used in the sampling-importance-resampling (SIR) paradigm (Liu & Chen, 1998; Doucet et al., 2000). Suppose we want to obtain samples from $p_{1},\ldots,p_{k-1}$ while we have a large set of samples from $p_{k}$ . For each $i\in[k-1]$ , we can use the density ratio function $\hat{r}_{i}$ in conjunction with SIR to approximately sample from the target distribution $p_{i}$ (Algorithm 1 in (Grover et al., 2019)). For this task, we evaluate if the SIR samples for target distribution $p_{i}$ contains the correct proportion of classes/properties (10 digit numbers in MNIST) and we use $\frac{1}{k-1}\sum_{i=1}^{k-1}\sum_{j=1}^{10}|h_{ij}-\hat{h}_{ij}|$ as the evaluation metric, where $h_{ij}$ and $\hat{h}_{ij}$ denote the desired proportion and sampled proportion for property $j$ in each target-generation task $i$ .

Multi-distribution Off-policy Policy Evaluation. Suppose we have $k$ different reinforcement learning policies $p_{i}(a|s)$ , each inducing an occupancy measure (Syed et al., 2008) (i.e, state-action distribution) $\rho_{i}(s,a)$ . Density ratios allow us to conduct off-policy policy evaluation, which estimates the expected return (sum of reward) of target policies $p_{1},\ldots,p_{k-1}$ given trajectories sampled from a source policy $p_{k}$ . In this case, we evaluate the following metric to assess the quality of the learned density ratios ( $\tau=\{(s_{t},a_{t})\}_{t=1}^{T}$ denotes a sequence of state-action pairs):

\displaystyle\frac{1}{k-1}\sum_{i=1}^{k-1}\left|\mathbb{E}_{p_{k}(\tau)}\left[\sum_{t=1}^{T}\hat{r}_{i}(s_{t},a_{t})r(s_{t},a_{t})\right]-\mathbb{E}_{p_{i}(\tau)}\left[\sum_{t=1}^{T}r(s_{t},a_{t})\right]\right|

We summarized the results for CIFAR-10 OOD detection, multi-target MNIST generation and multi-distribution off-policy policy evaluation in Table 2 (omitted results indicates the corresponding method performs worse than listed methods by a large margin on the specific task). We can see that methods induced by proper scoring rules such as multi-class logistic regression, Brier score and pseudo-spherical score tend to have the best performance on the first two tasks. And surprisingly, methods induced by some simple multivariate convex functions such as the LogSumExp and the quadratic function show excellent performance on the third task. These results demonstrate the advantage of our framework in the sense that it offers us extreme flexibility for designing new objectives for multi-distribution DRE that are more suitable for various downstream applications.

7 Conclusion

In this paper, we focus on the generalized problem of efficiently estimating density ratios among multiple distributions. We propose a general framework based on expected Bregmand divergence minimization, where each strictly convex function induces a proper loss for multi-distribution DRE. Furthermore, we rederive the theoretical equivalence between the losses of class probability estimation and density ratio estimation, which justifies the use of any strictly proper scoring rules for multi-distribution DRE. Finally, we demonstrated the effectiveness of our framework on various downstream tasks.

References

Aharoni et al. (2019) Roee Aharoni, Melvin Johnson, and Orhan Firat. Massively multilingual neural machine translation. arXiv preprint arXiv:1903.00089, 2019.
Bickel et al. (2008) Steffen Bickel, Jasmina Bogojeska, Thomas Lengauer, and Tobias Scheffer. Multi-task learning for hiv therapy screening. In Proceedings of the 25th international conference on Machine learning, pp. 56–63, 2008.
Brier et al. (1950) Glenn W Brier et al. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3, 1950.
Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
Cao et al. (2019) Jiezhang Cao, Langyuan Mo, Yifan Zhang, Kui Jia, Chunhua Shen, and Mingkui Tan. Multi-marginal wasserstein gan. Advances in Neural Information Processing Systems, 32:1776–1786, 2019.
Cappé et al. (2004) Olivier Cappé, Arnaud Guillin, Jean-Michel Marin, and Christian P Robert. Population monte carlo. Journal of Computational and Graphical Statistics, 13(4):907–929, 2004.
Choi et al. (2021) Kristy Choi, Madeline Liao, and Stefano Ermon. Featurized density ratio estimation. arXiv preprint arXiv:2107.02212, 2021.
DeGroot (1962) Morris H DeGroot. Uncertainty, information, and sequential experiments. The Annals of Mathematical Statistics, 33(2):404–419, 1962.
Dinh et al. (2013) Cuong V Dinh, Robert PW Duin, Ignacio Piqueras-Salazar, and Marco Loog. Fidos: A generalized fisher based feature extraction method for domain shift. Pattern Recognition, 46(9):2510–2518, 2013.
Dong et al. (2015) Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1723–1732, 2015.
Doucet et al. (2000) Arnaud Doucet, Simon Godsill, and Christophe Andrieu. On sequential monte carlo sampling methods for bayesian filtering. Statistics and computing, 10(3):197–208, 2000.
Duchi et al. (2018) John Duchi, Khashayar Khosravi, and Feng Ruan. Multiclass classification, information, divergence and surrogate risk. The Annals of Statistics, 46(6B):3246–3275, 2018.
Elvira et al. (2015) Víctor Elvira, Luca Martino, David Luengo, and Mónica F Bugallo. Efficient multiple importance sampling estimators. IEEE Signal Processing Letters, 22(10):1757–1761, 2015.
Elvira et al. (2019) Víctor Elvira, Luca Martino, David Luengo, and Mónica F Bugallo. Generalized multiple importance sampling. Statistical Science, 34(1):129–155, 2019.
Franklin (2005) James Franklin. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2):83–85, 2005.
Fujisawa & Eguchi (2008) Hironori Fujisawa and Shinto Eguchi. Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis, 99(9):2053–2081, 2008.
Garcia-Garcia & Williamson (2012) Dario Garcia-Garcia and Robert C Williamson. Divergences and risks for multiclass experiments. In Conference on Learning Theory, pp. 28–1. JMLR Workshop and Conference Proceedings, 2012.
Gneiting & Raftery (2007) Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378, 2007.
Good (1952) IJ Good. Rational decisions. Journal of the Royal Statistical Society, pp. 107–114, 1952.
Good (1971) IJ Good. Comment on “measuring information and uncertainty” by robert j. buehler. Foundations of Statistical Inference, pp. 337–339, 1971.
Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
Grover et al. (2019) Aditya Grover, Jiaming Song, Alekh Agarwal, Kenneth Tran, Ashish Kapoor, Eric Horvitz, and Stefano Ermon. Bias correction of learned generative models using likelihood-free importance weighting. arXiv preprint arXiv:1906.09531, 2019.
Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
Hido et al. (2008) Shohei Hido, Yuta Tsuboi, Hisashi Kashima, Masashi Sugiyama, and Takafumi Kanamori. Inlier-based outlier detection via direct density ratio estimation. In 2008 Eighth IEEE international conference on data mining, pp. 223–232. IEEE, 2008.
Hido et al. (2011) Shohei Hido, Yuta Tsuboi, Hisashi Kashima, Masashi Sugiyama, and Takafumi Kanamori. Statistical outlier detection using direct density ratio estimation. Knowledge and information systems, 26(2):309–336, 2011.
Hjelm et al. (2018) R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
Huang et al. (2006) Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Schölkopf, and Alex Smola. Correcting sample selection bias by unlabeled data. Advances in neural information processing systems, 19:601–608, 2006.
Kallus et al. (2021) Nathan Kallus, Yuta Saito, and Masatoshi Uehara. Optimal off-policy evaluation from multiple logging policies. In International Conference on Machine Learning, pp. 5247–5256. PMLR, 2021.
Kanamori et al. (2009) Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. The Journal of Machine Learning Research, 10:1391–1445, 2009.
Kanamori et al. (2012) Takafumi Kanamori, Taiji Suzuki, and Masashi Sugiyama. Statistical analysis of kernel-based least-squares density-ratio estimation. Machine Learning, 86(3):335–367, 2012.
Kato & Teshima (2021) Masahiro Kato and Takeshi Teshima. Non-negative bregman divergence minimization for deep direct density ratio estimation. In International Conference on Machine Learning, pp. 5320–5333. PMLR, 2021.
Kpotufe (2017) Samory Kpotufe. Lipschitz density-ratios, structured data, and data-driven tuning. In Artificial Intelligence and Statistics, pp. 1320–1328. PMLR, 2017.
Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Liu & Chen (1998) Jun S Liu and Rong Chen. Sequential monte carlo methods for dynamic systems. Journal of the American statistical association, 93(443):1032–1044, 1998.
Liu et al. (2017) Song Liu, Akiko Takeda, Taiji Suzuki, and Kenji Fukumizu. Trimmed density ratio estimation. arXiv preprint arXiv:1703.03216, 2017.
Menon & Ong (2016) Aditya Menon and Cheng Soon Ong. Linking losses for density ratio and class-probability estimation. In International Conference on Machine Learning, pp. 304–313. PMLR, 2016.
Nguyen et al. (2007) XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In NIPS, pp. 1089–1096, 2007.
Nock et al. (2016) Richard Nock, Aditya Menon, and Cheng Soon Ong. A scaled bregman theorem with applications. Advances in Neural Information Processing Systems, 29:19–27, 2016.
Nowozin et al. (2016) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 271–279, 2016.
Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
Owen & Zhou (2000) Art Owen and Yi Zhou. Safe and effective importance sampling. Journal of the American Statistical Association, 95(449):135–143, 2000.
Que & Belkin (2013) Qichao Que and Mikhail Belkin. Inverse density as an inverse problem: The fredholm equation approach. arXiv preprint arXiv:1304.5575, 2013.
Rhodes et al. (2020) Benjamin Rhodes, Kai Xu, and Michael U Gutmann. Telescoping density-ratio estimation. arXiv preprint arXiv:2006.12204, 2020.
Sibson (1969) Robin Sibson. Information radius. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 14(2):149–160, 1969.
Smola et al. (2009) Alex Smola, Le Song, and Choon Hui Teo. Relative novelty detection. In Artificial Intelligence and Statistics, pp. 536–543. PMLR, 2009.
Sugiyama et al. (2007) Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul Von Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In NIPS, volume 7, pp. 1433–1440. Citeseer, 2007.
Sugiyama et al. (2008) Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699–746, 2008.
Sugiyama et al. (2011) Masashi Sugiyama, Taiji Suzuki, Yuta Itoh, Takafumi Kanamori, and Manabu Kimura. Least-squares two-sample test. Neural networks, 24(7):735–751, 2011.
Sugiyama et al. (2012) Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density-ratio matching under the bregman divergence: a unified framework of density-ratio estimation. Annals of the Institute of Statistical Mathematics, 64(5):1009–1044, 2012.
Syed et al. (2008) Umar Syed, Michael Bowling, and Robert E Schapire. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039, 2008.
Uehara et al. (2016) Masatoshi Uehara, Issei Sato, Masahiro Suzuki, Kotaro Nakayama, and Yutaka Matsuo. Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920, 2016.
Vernet et al. (2011) Elodie Vernet, Robert Williamson, Mark Reid, et al. Composite multiclass losses. 2011.

Appendix A Proofs

A.1 Proofs for Section 2

See 1

Proof.

For completeness, we provide the proof here. First, we can check that $\underline{L}(P):\Delta_{k}\to{\mathbb{R}}$ is a concave function. Define $\mathbf{L}(P)$ to be the vector $\left(\ell(1,P),\ldots,\ell(k,P)\right)$ . Then the entropy function can be represented as $\underline{L}(P)=L(P,P)=\mathbb{E}_{y\sim P}[\ell(y,P)]=P^{\top}\mathbf{L}(P)$ and similarly $L(P,Q)=P^{\top}\mathbf{L}(Q)$ . For $\lambda\in[0,1]$ and $P,Q\in\Delta_{k}$ , we have:

	$\displaystyle\underline{L}(\lambda P+(1-\lambda)Q)$	$\displaystyle=(\lambda P+(1-\lambda)Q)^{\top}\mathbf{L}(\lambda P+(1-\lambda)Q)$
		$\displaystyle=\lambda P^{\top}\mathbf{L}(\lambda P+(1-\lambda)Q)+(1-\lambda)Q^{\top}\mathbf{L}(\lambda P+(1-\lambda)Q)$
		$\displaystyle\geq\lambda P^{\top}\mathbf{L}(P)+(1-\lambda)Q^{\top}\mathbf{L}(Q)=\lambda\underline{L}(P)+(1-\lambda)\underline{L}(Q)$

where the inequality is because $\ell$ is proper. Thus $\underline{L}$ is a concave function. Next we show that the excess risk is a Bregman divergence with convex function $-\underline{L}$ . First, observe that

\displaystyle L(P,Q)=P^{\top}{\mathbf{L}}(Q)=Q^{\top}{\mathbf{L}}(Q)+(P-Q)^{\top}{\mathbf{L}}(Q)

Because $\ell$ is proper, we have:

	$\displaystyle 0\leq L(P,Q)-L(P,P)$	$\displaystyle=Q^{\top}{\mathbf{L}}(Q)+(P-Q)^{\top}{\mathbf{L}}(Q)-P^{\top}{\mathbf{L}}(P)$
		$\displaystyle=-\underline{L}(P)-(-\underline{L}(Q))-(P-Q)^{\top}(-{\mathbf{L}}(Q))$

Rearrange the term we get $-\underline{L}(P)\geq(-\underline{L}(Q))+(-{\mathbf{L}}(Q))^{\top}(P-Q)$ and therefore $-{\mathbf{L}}(Q)$ is a subderivative of $-\underline{L}$ . When $-\underline{L}$ is differentiable, its subdifferential contains exactly one subderivative and $-{\mathbf{L}}(Q)=\nabla(-\underline{L}(Q))$ . Therefore, we have $\mathrm{reg}(P,Q)=L(P,Q)-L(P,P)=f(P)-f(Q)-\langle\nabla f(Q),P-Q\rangle=\mathbf{B}_{f}(P,Q)$ with $f=-\underline{L}$ . ∎

A.2 Proofs for Section 3.2

See 1

Proof of Proposition 1.

We first recall that the optimization problem for multi-distribution DRE is of the form

\displaystyle\min_{\hat{{\bm{r}}}:{\mathcal{X}}\to{\mathbb{R}}^{k-1}_{+}}\mathbb{E}_{p_{k}(x)}\left[\langle\nabla f(\hat{{\bm{r}}}(x)),\hat{{\bm{r}}}(x)\rangle-f(\hat{{\bm{r}}}(x))\right]-\sum_{i\in[k-1]}\mathbb{E}_{p_{i}(x)}[\partial_{i}f(\hat{{\bm{r}}}(x))]

(24)

and one can use the Fenchel-dual convex conjugate to represent the $f$ -divergence as

\mathbf{D}_{f}(P_{1},\cdots,P_{k-1}||P_{k})=-\min_{{\bm{s}}:{\mathcal{X}}\rightarrow\mathbb{R}^{k-1}}\left[-\sum_{i\in[k-1]}\mathbb{E}_{p_{i}(x)}[{\bm{s}}(x)]_{i}+\mathbb{E}_{p_{k}(x)}f^{*}({\bm{s}}(x))\right]

(25)

By first-order optimality condition of convex functions, for any $x\in{\mathcal{X}}$ the optimal solution $\overline{{\bm{s}}}(x)$ for Eq. (25) satisfies that

\forall i\in[k-1],x\in{\mathcal{X}},~{}~{}p_{i}(x)-\partial_{i}f^{*}(\overline{{\bm{s}}}(x))p_{k}(x)=0\Longleftrightarrow\frac{p_{i}(x)}{p_{k}(x)}=\partial_{i}f^{*}(\overline{{\bm{s}}}(x))

Therefore $\overline{{\bm{r}}}(x)=\nabla f^{*}(\overline{{\bm{s}}}(x))$ recovers the true density ratios.

Now we show that under change of variable ${\bm{s}}(x)=\nabla f(\hat{{\bm{r}}}(x))$ , one can write the problem in Eq. (25) equivalently as the one in Eq. (24). First due to the property of the convex conjugate function ( $f^{**}=f$ ), we have:

f^{*}({\bm{s}}(x))=\min_{{\bm{h}}(x)\in{\mathbb{R}}^{k-1}}\langle{\bm{s}}(x),{\bm{h}}(x)\rangle-f({\bm{h}}(x))

Substituting ${\bm{s}}(x)$ with $\nabla f(\hat{{\bm{r}}}(x))$ , we have:

f^{*}({\bm{s}}(x))=\min_{{\bm{h}}(x)\in{\mathbb{R}}^{k-1}}\langle\nabla f(\hat{{\bm{r}}}(x)),{\bm{h}}(x)\rangle-f({\bm{h}}(x))

(26)

Taking derivative w.r.t. ${\bm{h}}$ and due to the strict convexity of $f$ ( $\nabla f({\bm{a}})=\nabla f({\bm{b}})\Leftrightarrow{\bm{a}}={\bm{b}}$ ), we know that the minimum of Eq. (26) achieves at $\overline{{\bm{h}}}(x)=\hat{{\bm{r}}}(x)$ . Thus we have:

f^{*}({\bm{s}}(x))=\langle\nabla f(\hat{{\bm{r}}}(x)),\hat{{\bm{r}}}(x)\rangle-f(\hat{{\bm{r}}}(x))

(27)

Plugging Eq. (27) and ${\bm{s}}(x)=\nabla f(\hat{{\bm{r}}}(x))$ back to the optimization problem in Eq. (25), we can get the following equivalent problem by flipping a sign of the objective function without changing the optimal solution:

\min_{\hat{{\bm{r}}}:{\mathcal{X}}\to{\mathbb{R}}^{k-1}}\mathbb{E}_{p_{k}(x)}\left[\langle\nabla f(\hat{{\bm{r}}}(x)),\hat{{\bm{r}}}(x)\rangle-f(\hat{{\bm{r}}}(x))\right]-\sum_{i\in[k-1]}\mathbb{E}_{p_{i}(x)}\partial_{i}f(\hat{{\bm{r}}}(x)),

which is the same as the one in (24). ∎

A.3 Proofs for Section 4

See 1

Proof of Lemma 1.

For simplicity of notations we let $u_{k}=v_{k}=1$ for arbitrary $u,v\in\mathbb{R}^{k-1}$ . We first prove the convexity of $f^{\circledast}$ by definition. Given any two points $u,v\in\mathbb{R}^{k-1}$ and $\theta\in[0,1]$ , one has

		$\displaystyle f^{\circledast}(\theta{\bm{u}}+(1-\theta){\bm{v}})$
	$\displaystyle=$	$\displaystyle\left(\sum_{i\in[k]}\left(\theta u_{i}+(1-\theta)v_{i}\right)\right)\cdot f\left(\frac{1}{\sum_{i\in[k]}(\theta u_{i}+(1-\theta)v_{i})}\cdot(\theta{\bm{u}}+(1-\theta){\bm{v}})\right)$
	$\displaystyle=$	$\displaystyle\left(\theta\sum_{i\in[k]}u_{i}+(1-\theta)\sum_{i\in[k]}v_{i}\right)\cdot f\left(\frac{1}{\theta\sum_{i\in[k]}u_{i}+(1-\theta)\sum_{i\in[k]}v_{i}}\cdot(\theta{\bm{u}}+(1-\theta){\bm{v}})\right)$
	$\displaystyle\stackrel{{\scriptstyle(\star)}}{{\leq}}$	$\displaystyle\theta\left(\sum_{i\in[k]}u_{i}\right)f\left(\frac{1}{\sum_{i\in[k]}u_{i}}{\bm{u}}\right)+(1-\theta)\left(\sum_{i\in[k]}v_{i}\right)f\left(\frac{1}{\sum_{i\in[k]}v_{i}}{\bm{v}}\right)$
	$\displaystyle=$	$\displaystyle\theta f^{\circledast}({\bm{u}})+(1-\theta)f^{\circledast}({\bm{v}}).$

Here for inequality $(\star)$ we use the fact that for any convex function $g:\mathbb{R}^{n}\rightarrow\mathbb{R}$ , the perspective function $h(t,x):\mathbb{R}\times\mathbb{R}^{n}\rightarrow\mathbb{R}$ defined as $h(t,x)\mathrel{\mathop{:}}=tg(x/t)$ is a function jointly convex in $(t,x)$ .

Now to see the identity holds, note we can write

	$\displaystyle\mathrm{LHS}=$	$\displaystyle f\left(\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{u}}\right)-f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right)$
		$\displaystyle-\left\langle\nabla f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right),\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{u}}-\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right\rangle$

and that

	$\displaystyle\mathrm{RHS}=$	$\displaystyle f\left(\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{u}}\right)-\frac{\sum_{i\in[k]}v_{i}}{\sum_{i\in[k]}u_{i}}\cdot f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right)-\frac{1}{\sum_{i\in[k]}u_{i}}\left\langle\nabla f^{\circledast}\left({\bm{v}}\right),{\bm{u}}-{\bm{v}}\right\rangle$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}$	$\displaystyle f\left(\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{u}}\right)-\frac{\sum_{i\in[k]}v_{i}}{\sum_{i\in[k]}u_{i}}\cdot f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right)$
		$\displaystyle-\frac{1}{\sum_{i\in[k]}u_{i}}\left\langle f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right)\bm{1}+\left(\mathbf{I}-\frac{1}{\sum_{i\in[k]}v_{i}}\bm{1}{\bm{v}}^{\top}\right)\nabla f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right),{\bm{u}}-{\bm{v}}\right\rangle$
	$\displaystyle\stackrel{{\scriptstyle(ii)}}{{=}}$	$\displaystyle f\left(\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{u}}\right)-\left[\frac{\sum_{i\in[k]}v_{i}}{\sum_{i\in[k]}v_{i}}+\frac{\sum_{i\in[k]}(u_{i}-v_{i})}{\sum_{i\in[k]}u_{i}}\right]\cdot f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right)$
		$\displaystyle+\frac{1}{\sum_{i\in[k]}u_{i}}\left\langle\nabla f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right),\left(\mathbf{I}-\frac{1}{\sum_{i\in[k]}v_{i}}{\bm{v}}\bm{1}^{\top}\right)({\bm{u}}-{\bm{v}})\right\rangle$
	$\displaystyle=$	$\displaystyle f\left(\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{u}}\right)-f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right)$
		$\displaystyle+\left\langle\nabla f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right),\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{u}}-\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{v}}-\frac{\sum_{i\in[k]}(u_{i}-v_{i})}{\left(\sum_{i\in[k]}u_{i}\right)\left(\sum_{i\in[k]}v_{i}\right)}\cdot{\bm{v}}\right\rangle$
	$\displaystyle=$	$\displaystyle f\left(\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{u}}\right)-f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right)$
		$\displaystyle+\left\langle\nabla f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right),\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{u}}-\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right\rangle,$

where we use $(i)$ the gradient formula that $\nabla f^{\circledast}({\bm{v}})=\left(\mathbf{I}-\frac{1}{\sum_{i\in[k]}v_{i}}\bm{1}{\bm{v}}^{\top}\right)\nabla f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right)$ by definition of $f^{\circledast}$ , and $(ii)$ rearranging terms and that $\langle{\bm{A}}{\bm{v}},{\bm{u}}\rangle=\langle{\bm{u}},{\bm{A}}^{\top}{\bm{v}}\rangle$ .

Thus, we have shown that $\mathrm{LHS}=\mathrm{RHS}$ and concludes the proof. ∎

See 2

Proof of Proposition 2.

Given any $x\in{\mathcal{X}}$ , the equality follows by applying Lemma 1 with ${\bm{u}}=\frac{1}{\pi_{k}}{\bm{\pi}}_{[1:k-1]}\circ{\bm{r}}(x)$ and ${\bm{v}}=\frac{1}{\pi_{k}}{\bm{\pi}}_{[1:k-1]}\circ\hat{{\bm{r}}}(x)$ . To see why this is true, note that we have by definition of $\eta_{i}(x)$ and $\hat{\eta}_{i}(x)$ that (here $\circ$ implies element-wise multiplication)

{\bm{\eta}}(x)=\frac{{\bm{\pi}}\circ{\bm{r}}(x)}{\pi_{k}+\sum_{j\in[k-1]}\pi_{j}r_{j}(x)}=\frac{1}{1+\sum_{i\in[k-1]}u_{i}}\cdot{\bm{u}},~{}\text{and similarly}~{}\hat{{\bm{\eta}}}({\bm{x}})=\frac{1}{1+\sum_{i\in[k-1]}v_{i}}\cdot{\bm{v}}.

Consequently applying Lemma 1 implies that

\mathbf{B}_{f}({\bm{\eta}}(x),\hat{{\bm{\eta}}}(x))=\frac{1}{1+\sum_{i\in[k-1]}u_{i}}\mathbf{B}_{f^{\circledast}}({\bm{u}},{\bm{v}})

(28)

Note that given any convex function $f^{\circledast}$ , we consider its composition with linear map function as

f^{\circledast}_{\pi}({\bm{r}})=f^{\circledast}\left(\frac{1}{\pi_{k}}{\bm{\pi}}_{[1:k-1]}\circ{\bm{r}}\right)=f^{\circledast}({\bm{u}}).

We note that linear composition preserves convexity and Bregman divergence equality, i.e. we have

	$\displaystyle\mathbf{B}_{f^{\circledast}}({\bm{u}},{\bm{v}})=f^{\circledast}({\bm{u}})-f^{\circledast}({\bm{v}})-\langle\nabla f^{\circledast}({\bm{v}}),{\bm{u}}-{\bm{v}}\rangle$	(29)
$\displaystyle=$	$\displaystyle f^{\circledast}\left(\frac{1}{\pi_{k}}{\bm{\pi}}_{1:k-1}\circ{\bm{r}}\right)-f^{\circledast}\left(\frac{1}{\pi_{k}}{\bm{\pi}}_{1:k-1}\circ\hat{{\bm{r}}}\right)-\left\langle\nabla f^{\circledast}\left(\frac{1}{\pi_{k}}{\bm{\pi}}_{1:k-1}\circ\hat{{\bm{r}}}\right),\frac{1}{\pi_{k}}{\bm{\pi}}_{1:k-1}\circ({\bm{r}}-\hat{{\bm{r}}})\right\rangle$
$\displaystyle\stackrel{{\scriptstyle(\star)}}{{=}}$	$\displaystyle f^{\circledast}_{\pi}({\bm{r}})-f^{\circledast}_{\pi}(\hat{{\bm{r}}})-\langle\nabla f^{\circledast}_{\pi}({\bm{r}}),{\bm{r}}-\hat{{\bm{r}}}\rangle=\mathbf{B}_{f^{\circledast}_{\pi}}({\bm{r}},\hat{{\bm{r}}}),$

where for equality $(\star)$ we use chain rule for taking derivatives of the linear composite mapping. Combining Equations (28) and (29) and replacing ${\bm{u}}=\frac{1}{\pi_{k}}{\bm{\pi}}_{[1:k-1]}\circ{\bm{r}}$ gives the desired result. ∎

See 2

Proof of Theorem 2.

Given the multi-class classification regret under some proper loss $\ell$ in Eq. (4) and Proposition 2 we have:

	$\displaystyle\mathrm{reg}(\hat{{\bm{\eta}}};M,{\bm{\eta}},\ell)$	$\displaystyle\mathrel{\mathop{:}}={\mathcal{L}}_{\text{CPE}}(\hat{{\bm{\eta}}};D)-{\mathcal{L}}_{\text{CPE}}({\bm{\eta}};D)=\mathbb{E}_{M(x)}[\mathbf{B}_{f}({\bm{\eta}}(x),\hat{{\bm{\eta}}}(x))]$
		$\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\sum_{i\in[k]}\pi_{i}\mathbb{E}_{p_{i}(x)}\mathbf{B}_{f}({\bm{\eta}}(x),\hat{{\bm{\eta}}}(x))=\mathbb{E}_{p_{k}(x)}\left(\sum_{i\in[k]}\pi_{i}\frac{p_{i}(x)}{p_{k}(x)}\mathbf{B}_{f}({\bm{\eta}}(x),\hat{{\bm{\eta}}}(x))\right)$
		$\displaystyle\stackrel{{\scriptstyle(ii)}}{{=}}\mathbb{E}_{p_{k}(x)}\left(\left(\pi_{k}+\sum_{i\in[k-1]}\pi_{i}r_{i}(x)\right)\cdot\mathbf{B}_{f}({\bm{\eta}}(x),\hat{{\bm{\eta}}}(x))\right)$
		$\displaystyle\stackrel{{\scriptstyle(iii)}}{{=}}\pi_{k}\mathbb{E}_{p_{k}(x)}\mathbf{B}_{f^{\circledast}_{\pi}}({\bm{r}}(x),\hat{{\bm{r}}}(x)),$

where we use $(i)$ the definition of marginal distribution $M(x)=\sum_{i\in[k]}\pi_{i}p_{i}(x)$ , $(ii)$ the definition of density ratio that $r_{i}(x)=p_{i}(x)/p_{k}(x)$ $\forall x\in{\mathcal{X}},i\in[k]$ , and $(iii)$ Proposition 2 with the consistent definitions of $f^{\circledast}_{\pi}$ and ${\bm{r}}$ , $\hat{{\bm{r}}}$ as stated in the theorem. ∎

A.3.1 Information Measure in Multi-class Experiments

In this section, we show that multi-distribution density ratio estimation can be viewed as estimating the statistical information measure (DeGroot, 1962) in multi-class experiments, under appropriate choices for the convex function $f$ .

We first introduce the following definitions in multi-class experiments. For ${\bm{p}}\in\Delta_{k}$ , any proper loss function $\ell:[k]\times\Delta^{k}\to\mathbb{R}$ induces a generalized entropy:

H_{\ell}({\bm{p}})\mathrel{\mathop{:}}=\inf_{{\bm{q}}\in\Delta^{k}}\sum_{i\in[k]}p_{i}\ell(i,{\bm{q}}),

which measures the uncertainty of the task. Given a multi-class experiment $D=({\bm{\pi}},P_{1},\ldots,P_{k})=(M,{\bm{\eta}})$ and the generalized entropy $H_{\ell}:\Delta^{k}\to\mathbb{R}$ (which is closed concave), the information measure in a multi-class experiment (DeGroot, 1962; Duchi et al., 2018) is defined as the gap between the prior and posterior generalized entropy:

\mathcal{I}_{H_{\ell}}(D)=H_{\ell}({\bm{\pi}})-\mathbb{E}_{M(x)}[H_{\ell}({\bm{\eta}}(x))].

We next introduce the following connections between multi-distribution $f$ -divergence, generalized entropy and information measure in multi-class experiments. Specifically, Duchi et al. (2018) proved an equivalence between the gap of prior and posterior Bayes risks and the multi-distribution $f$ -divergence induced by a convex function $f$ depending on $\ell$ and the prior ${\bm{\pi}}$ , demonstrating the utility of multi-distribution $f$ -divergence for experimental design of multi-class classification.

Theorem 3 ((Duchi et al., 2018)).

Given a proper loss $\ell$ , its associated generalized entropy $H_{\ell}$ and a multi-class distribution $D=({\bm{\pi}},P_{1},\ldots,P_{k})=(M,{\bm{\eta}})$ , we can define a closed convex function $f_{\ell,{\bm{\pi}}}:{\mathbb{R}}^{k-1}_{+}\to{\mathbb{R}}\cup\{\pm\infty\}$ as

f_{\ell,{\bm{\pi}}}({\bm{t}})\mathrel{\mathop{:}}=\sup_{\bm{\nu}\in\Delta^{k}}\left(H_{\ell}({\bm{\pi}})-\sum_{i\in[k-1]}\pi_{i}\ell(i,\bm{\nu})t_{i}-\pi_{k}\ell(k,\bm{\nu})\right)

(30)

We can then express the information measure of multi-class experiments as the multi-distribution $f$ -divergence induced by Eq. (30):

	$\displaystyle\mathcal{I}_{H_{\ell}}(D)$	$\displaystyle=H({\bm{\pi}})-\mathbb{E}_{M(x)}[H_{\ell}({\bm{\eta}}(x))]=\inf_{\bm{\nu}\in\Delta^{k}}\sum_{i\in[k]}\pi_{i}\ell(i,\bm{\nu})-\inf_{\hat{{\bm{\eta}}}}{\mathcal{L}}(\hat{{\bm{\eta}}};D)$
		$\displaystyle=\mathbf{D}_{f_{\ell,{\bm{\pi}}}}(P_{1},\ldots,P_{k-1}\\|P_{k}).$

Given Theorem 3 and Proposition 1, we know that multi-distribution density ratio estimation by minimizing expected Bregman divergence (Eq. (14)), induced by the convex function $f_{\ell,{\bm{\pi}}}$ defined in Eq. (30), corresponds to estimating the statistical information measure in multi-class classification experiments.

A.4 Examples of Multi-distribution DRE

A.4.1 Multi-class Logistic Regression

From Section 2.3, we know that there is a one-to-one correspondence between a class probability estimator and a density ratio estimator through the link and the inverse link function: $\hat{{\bm{r}}}=\Psi_{\mathrm{dr}}\circ\hat{{\bm{\eta}}}$ and $\hat{{\bm{\eta}}}=\Psi_{\mathrm{dr}}^{-1}\circ\hat{{\bm{r}}}$ . When the class prior distribution ${\bm{\pi}}$ is uniform, we have:

\hat{r}_{i}(x)=\frac{\hat{\eta}_{i}(x)}{\hat{\eta}_{k}(x)}~{}~{}\text{and}~{}~{}\hat{\eta}_{i}(x)=\frac{\hat{r}_{i}(x)}{\sum_{j=1}^{k}\hat{r}_{j}(x)},~{}~{}\text{for all}~{}i\in[k],x\in{\mathcal{X}}.

(31)

To recover the loss of multi-class logistic regression used in (Bickel et al., 2008), we choose the following convex function (where we use $\hat{r}_{k}=1$ ):

	$\displaystyle f(\hat{r}_{1},\ldots,\hat{r}_{k-1})=$	$\displaystyle\frac{1}{k}\sum_{i=1}^{k}\hat{r}_{i}\log\left(\frac{\hat{r}_{i}}{\sum_{j=1}^{k}\hat{r}_{j}}\right)$		(32)
	$\displaystyle\partial_{i}f(\hat{r}_{1},\ldots,\hat{r}_{k-1})=$	$\displaystyle\frac{1}{k}\log\left(\frac{\hat{r}_{i}}{\sum_{j=1}^{k}\hat{r}_{j}}\right)\quad\text{for}~{}i\in[k-1]$		(33)

Thus the loss in Eq. (14) reduces to:

		$\displaystyle\frac{1}{k}\mathbb{E}_{p_{k}(x)}\left[\sum_{i=1}^{k-1}\hat{r}_{i}(x)\log\frac{\hat{r}_{i}(x)}{\sum_{j=1}^{k}\hat{r}_{j}(x)}-\sum_{i=1}^{k-1}\hat{r}_{i}(x)\log\hat{r}_{i}(x)+\left(\sum_{i=1}^{k}\hat{r}_{i}(x)\right)\log\left(\sum_{i=1}^{k}\hat{r}_{i}(x)\right)\right]-$
		$\displaystyle\frac{1}{k}\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\log\left(\frac{\hat{r}_{i}(x)}{\sum_{j=1}^{k}\hat{r}_{j}(x)}\right)\right]$
	$\displaystyle=$	$\displaystyle\frac{1}{k}\mathbb{E}_{p_{k}(x)}\left[\hat{r}_{k}(x)\log\left(\sum_{i=1}^{k}\hat{r}_{i}(x)\right)\right]-\frac{1}{k}\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\log\left(\frac{\hat{r}_{i}(x)}{\sum_{j=1}^{k}\hat{r}_{j}(x)}\right)\right]$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}$	$\displaystyle-\left(\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}_{p_{i}(x)}[\log(\hat{{\bm{\eta}}}_{i}(x))]\right)$

where $(i)$ is because $\hat{r}_{k}(x)=1,~{}\forall x\in{\mathcal{X}}$ and Eq. (31).

When the class prior ${\bm{\pi}}$ is not uniform, from Section 2.3, we know that the link and inverse link connecting density ratio estimators and class probability estimators are:

\hat{r}_{i}=\frac{\pi_{k}}{\pi_{i}}\cdot\frac{\hat{\eta}_{i}}{\hat{\eta}_{k}}\quad\text{and}\quad\hat{\eta}_{i}=\frac{\pi_{i}\hat{r}_{i}}{\sum_{j\in[k]}\pi_{j}\hat{r}_{j}},~{}\text{for all}~{}i\in[k],x\in{\mathcal{X}}.

(34)

In this case, we choose the following convex function (where we use $\hat{r}_{k}=1$ ):

	$\displaystyle f(\hat{r}_{1},\ldots,\hat{r}_{k-1})=$	$\displaystyle\sum_{i=1}^{k}\pi_{i}\hat{r}_{i}\log\pi_{i}\hat{r}_{i}-\left(\sum_{i=1}^{k}\pi_{i}\hat{r}_{i}\right)\log\left(\sum_{i=1}^{k}\pi_{i}\hat{r}_{i}\right)$		(35)
	$\displaystyle\partial_{i}f(\hat{r}_{1},\ldots,\hat{r}_{k-1})=$	$\displaystyle\pi_{i}\log\left(\frac{\pi_{i}\hat{r}_{i}}{\sum_{j=1}^{k}\pi_{i}\hat{r}_{j}}\right)\quad\text{for}~{}i\in[k-1]$		(36)

Note that when ${\bm{\pi}}$ is uniform distribution, Eq. (35) reduces to Eq. (32).

The loss in Eq. (14) reduces to:

		$\displaystyle\mathbb{E}_{p_{k}(x)}\left[\pi_{k}\hat{r}_{k}(x)\log\left(\sum_{i=1}^{k}\pi_{i}\hat{r}_{i}(x)\right)-\pi_{k}\hat{r}_{k}(x)\log(\pi_{k}\hat{r}_{k}(x))\right]-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\pi_{i}\log\left(\frac{\pi_{i}\hat{r}_{i}(x)}{\sum_{j=1}^{k}\pi_{j}\hat{r}_{j}(x)}\right)\right]$
	$\displaystyle=$	$\displaystyle-\left(\sum_{i=1}^{k}\pi_{i}\mathbb{E}_{p_{i}(x)}[\log(\hat{\eta}_{i}(x))]\right)$

which corresponds to the multi-class logistic regression loss for the class probability estimators $\hat{{\bm{\eta}}}$ .

Remark. Interestingly, we noticed that the multi-distribution $f$ -divergence associated with the convex function in Eq. (32) is the multi-distribution Jensen-Shannon divergence (Garcia-Garcia & Williamson, 2012) (also known as information radius (Sibson, 1969)) up to a constant of $\log k$ :

	$\displaystyle\mathbf{D}_{f}(P_{1},\ldots,P_{k})$	$\displaystyle=\mathbb{E}_{p_{k}(x)}\left[f\left(\frac{p_{1}(x)}{p_{k}(x)},\ldots,\frac{p_{k-1}(x)}{p_{k}(x)}\right)\right]$
		$\displaystyle=\frac{1}{k}\mathbb{E}_{p_{k}(x)}\left[\sum_{i=1}^{k}\frac{p_{i}(x)}{p_{k}(x)}\log\left(\frac{p_{i}(x)}{\sum_{j=1}^{k}p_{k}(x)}\right)\right]$
		$\displaystyle=\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}_{p_{i}(x)}\left[\log\left(\frac{p_{i}(x)}{\frac{1}{k}\sum_{j=1}^{k}p_{j}(x)}\right)\right]-\log k$
		$\displaystyle=\frac{1}{k}\sum_{i=1}^{k}D_{\mathrm{KL}}\left(P_{i}\\|\frac{1}{k}\sum_{j=1}^{k}P_{j}\right)-\log k$

A.4.2 Least-squares Importance Fitting

When the convex function associated with the Bregman divergence is chosen to be:

	$\displaystyle f(\hat{r}_{1},\ldots,\hat{r}_{k-1})$	$\displaystyle=\frac{1}{2}\sum_{i=1}^{k-1}(\hat{r}_{i}-1)^{2}=\frac{1}{2}\\|\hat{{\bm{r}}}-\bm{1}\\|^{2}$		(37)
	$\displaystyle\partial_{i}f(\hat{r}_{1},\ldots,\hat{r}_{k-1})$	$\displaystyle=\hat{r}_{i}-1\quad\text{for}~{}i\in[k-1]$		(38)

The loss in Eq. (14) reduces to:

		$\displaystyle\mathbb{E}_{p_{k}(x)}\left[\sum_{i=1}^{k-1}(\hat{r}_{i}^{2}(x)-\hat{r}_{i}(x))-\frac{1}{2}\sum_{i=1}^{k-1}(\hat{r}_{i}^{2}(x)-2\hat{r}_{i}(x)+1)\right]-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\hat{r}_{i}(x)-1\right]$
	$\displaystyle=$	$\displaystyle\frac{1}{2}\mathbb{E}_{p_{k}(x)}\left[\sum_{i=1}^{k-1}(\hat{r}_{i}^{2}(x)-1)\right]-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\hat{r}_{i}(x)-1\right]$
	$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{i=1}^{k-1}\mathbb{E}_{p_{k}(x)}\left[\hat{r}_{i}^{2}(x)-1-2\frac{p_{i}(x)}{p_{k}(x)}(\hat{r}_{i}(x)-1)\right]$
	$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{i=1}^{k-1}\mathbb{E}_{p_{k}(x)}\left[(\hat{r}_{i}(x)-r_{i}(x))^{2}\right]-C$

A.4.3 KL Importance Estimation Procedure

When the convex function associated with the Bregman divergence is chosen to be:

	$\displaystyle f(\hat{r}_{1},\ldots,\hat{r}_{k-1})$	$\displaystyle=\sum_{i=1}^{k-1}(\hat{r}_{i}\log\hat{r}_{i}-\hat{r}_{i})=\langle\hat{{\bm{r}}},\log(\hat{{\bm{r}}})\rangle-\\|\hat{{\bm{r}}}\\|_{1}$		(39)
	$\displaystyle\partial_{i}f(\hat{r}_{1},\ldots,\hat{r}_{k-1})$	$\displaystyle=\log\hat{r}_{i}\quad\text{for}~{}i\in[k-1]$		(40)

The loss in Eq. (14) reduces to:

		$\displaystyle\mathbb{E}_{p_{k}(x)}\left[\sum_{i=1}^{k-1}\hat{r}_{i}(x)\log\hat{r}_{i}(x)-\sum_{i=1}^{k-1}(\hat{r}_{i}(x)\log\hat{r}_{i}(x)-\hat{r}_{i}(x))\right]-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}[\log\hat{r}_{i}(x)]$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{p_{k}(x)}\left[\sum_{i=1}^{k-1}\hat{r}_{i}(x)\right]-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}[\log\hat{r}_{i}(x)]$		(41)

This is equivalent to the following constrained optimization problem:

		$\displaystyle\operatorname*{arg\,min}_{\hat{{\bm{r}}}:{\mathcal{X}}\to{\mathbb{R}}^{k-1}}\sum_{i=1}^{k-1}D_{\mathrm{KL}}(p_{i}(x)\\|\hat{r}_{i}(x)\cdot p_{k}(x))=\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\log\left(\frac{p_{i}(x)}{\hat{r}_{i}(x)\cdot p_{k}(x)}\right)\right]$
	$\displaystyle=$	$\displaystyle\operatorname*{arg\,min}_{\hat{{\bm{r}}}:{\mathcal{X}}\to{\mathbb{R}}^{k-1}}-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}[\log\hat{r}_{i}(x)]$
		$\displaystyle\text{s.t.}\quad\mathbb{E}_{p_{k}(x)}[\hat{r}_{i}(x)]=1~{}~{}\text{and}~{}~{}\hat{r}_{i}(x)\geq 0,~{}~{}\text{for all}~{}i\in[k-1].$

which strictly generalizes the Kullback–Leibler Importance Estimation Procedure (KLIEP) (Sugiyama et al., 2008) to the multi-distribution case.

A.4.4 Basu’s Power Divergence for Robust DRE

For some $\alpha>1$ , we choose the following convex function (the $\alpha$ -norm of a vector):

	$\displaystyle f(\hat{r}_{1},\ldots,\hat{r}_{k-1})$	$\displaystyle=\sum_{i=1}^{k-1}\hat{r}_{i}^{\alpha}=\\|\hat{{\bm{r}}}\\|_{\alpha}^{\alpha}$		(42)
	$\displaystyle\partial_{i}f(\hat{r}_{1},\ldots,\hat{r}_{k-1})$	$\displaystyle=\alpha\hat{r}_{i}^{\alpha-1}$		(43)

The loss in Eq. (14) reduces to:

		$\displaystyle\mathbb{E}_{p_{k}(x)}\left[\sum_{i=1}^{k-1}\alpha\hat{r}_{i}^{\alpha}(x)-\sum_{i=1}^{k-1}\hat{r}_{i}^{\alpha}(x)\right]-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\alpha\hat{r}_{i}^{\alpha-1}(x)\right]$
	$\displaystyle=$	$\displaystyle\sum_{i=1}^{k-1}\mathbb{E}_{p_{k}(x)}\left[(\alpha-1)\hat{r}_{i}^{\alpha}(x)\right]-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\alpha\hat{r}_{i}^{\alpha-1}(x)\right]$		(44)

To understand the robustness of this formulation, for each $i\in[k-1]$ , we take the derivative of Eq. (44) w.r.t. the parameters in the density ratio model $\hat{r}_{i}$ and equate it to zero, and we get the following parameter estimation equation:

\mathbb{E}_{p_{k}(x)}[\hat{r}_{i}^{\alpha-1}(x)\nabla\hat{r}_{i}(x)]-\mathbb{E}_{p_{i}(x)}[\hat{r}_{i}^{\alpha-2}(x)\nabla\hat{r}_{i}(x)]=\bm{0}

(45)

Now we apply the same analysis to the multi-distribution KLIEP method in Eq. (41) and we get the following equation (for each $i\in[k-1]$ ):

\mathbb{E}_{p_{k}(x)}[\nabla\hat{r}_{i}(x)]-\mathbb{E}_{p_{i}(x)}[\hat{r}_{i}^{-1}(x)\nabla\hat{r}_{i}(x)]=\bm{0}

(46)

Comparing Eq. (45) with Eq. (46), we can see that the power divergence DRE method in Eq. (44) is a weighted version of the multi-distribution KLIEP method, where the weight $\hat{r}_{i}^{\alpha-1}(x)$ controls the importance of the samples. In some scenario where the outlier samples tend to have density ratios less than one, they will have less influence on the parameter estimation, which generalizes the binary Basu’s power divergence robust DRE method (Sugiyama et al., 2012) to the multi-distribution case. Another interesting observation is that when $\alpha\to 1$ , Eq. (45) recovers the KLIEP gradient in Eq. (46); when $\alpha=2$ , the power divergence DRE in Eq. (44) recovers the multi-distribution LSIF method in Section A.4.2.

A.4.5 More Examples

When the convex function is chosen to be the Log-Sum-Exp type function (for $\alpha>0$ ):

	$\displaystyle f(\hat{r}_{1},\ldots,\hat{r}_{k-1})$	$\displaystyle=\alpha\log\left(\sum_{i=1}^{k-1}\exp(\hat{r}_{i}/\alpha)\right)$		(47)
	$\displaystyle\partial_{i}f(\hat{r}_{1},\ldots,\hat{r}_{k-1})$	$\displaystyle=\frac{\exp(\hat{r}_{i}/\alpha)}{\sum_{i=1}^{k-1}\exp(\hat{r}_{i}/\alpha)}$		(48)

The loss in Eq. (14) can be written as:

\displaystyle\mathbb{E}_{p_{k}(x)}\left[\sum_{i=1}^{k-1}\frac{\hat{r}_{i}(x)\exp(\hat{r}_{i}(x)/\alpha)}{\sum_{j=1}^{k-1}\exp(\hat{r}_{j}(x)/\alpha)}-\alpha\log\left(\sum_{i=1}^{k-1}\exp(\hat{r}_{i}(x)/\alpha)\right)\right]-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\frac{\exp(\hat{r}_{i}(x)/\alpha)}{\sum_{j=1}^{k-1}\exp(\hat{r}_{j}(x)/\alpha)}\right]

We can similarly derive loss functions induced by other convex functions such as the quadratic function $f(\hat{r}_{1},\ldots,\hat{r}_{k-1})=\hat{{\bm{r}}}^{\top}{\bm{H}}\hat{{\bm{r}}}+{\bm{q}}^{\top}\hat{{\bm{r}}}$ , for some positive definite matrix ${\bm{H}}\succ 0$ .

A.5 More Experimental Details

We provide more details about the problem setup of each task used in our empirical study.

For the synthetic data experiments, we use $k=5$ multivariate Gaussian distributions with identity covariance matrix and different mean vectors:

	$\displaystyle{\bm{\mu}}_{1}$	$\displaystyle=(1,0,0,\ldots)^{d}$
	$\displaystyle{\bm{\mu}}_{2}$	$\displaystyle=(-1,0,0,\ldots)^{d}$
	$\displaystyle{\bm{\mu}}_{3}$	$\displaystyle=(0,1,0,\ldots)^{d}$
	$\displaystyle{\bm{\mu}}_{4}$	$\displaystyle=(0,-1,0,\ldots)^{d}$
	$\displaystyle{\bm{\mu}}_{5}$	$\displaystyle=(1,0,0,\ldots)^{d}$

We use such design so that the density ratios are almost surely well-defined and the numerical optimization with respect to the canonical density ratio vector $\hat{{\bm{r}}}=(\hat{r}_{1},\ldots,\hat{r}_{k-1})$ is more stable. We use a two-layer Multi-Layer Perceptron (MLP) ( $\text{Linear}(d,32)\to\text{Linear}(32,32)\to\text{Linear}(32,k-1)$ ) with ReLU activation function to realize the density ratio model.

For CIFAR-10 OOD detection experiments, we set $k=4$ and we construct each distribution as: $p_{1}$ - samples labeled {airplane, automobile, bird}; $p_{2}$ - samples labeled {cat, deer, dog, frog}; $p_{3}$ - samples labeled {horse, ship, truck} and $p_{4}$ - a uniform mixture of these distributions. We use a standard convolution neural network in the PyTorch tutorial¹¹1 https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html with $k-1$ outputs to realize the density ratio model.

For MNIST multi-target generation experiments, we use $k=6$ and we construct each distribution as: $p_{1}$ - samples labeled {0,1}; $p_{2}$ - samples labeled {2,3}; $p_{3}$ - samples labeled {4,5}; $p_{4}$ - samples labeled {6,7}; $p_{5}$ - samples labeled {8,9}; $p_{6}$ - a mixture of these distributions. We use a two-layer convolutional neural network (Conv(1,32, 3, 1) $\to$ Conv(32, 64, 3, 1) $\to$ Linear(9216, 128) $\to$ Linear(128, $k-1$ )) with ReLU activation function to realize the density ratio model.

For multi-distribution off-policy policy evaluation experiments, we conducted experiments on the Half-Cheetah environment in OpenAI Gym (Brockman et al., 2016). We use soft actor-critic algorithm (Haarnoja et al., 2018) to obtain five different policies with average expected return of {3811, 5277, 6444, 7397, 5728} respectively and we learn density ratios between their induced occupancy measures (state-action distributions). We use a three-layer MLP ( $\text{Linear}(17,256)\to\text{Linear}(256,256)\to\text{Linear}(256,256)\to\text{Linear}(256,k-1)$ ) with ReLU activation function to realize the density ratio model.

A Unified Framework for Multi-distribution Density Ratio Estimation

Abstract

1 Introduction

2 Preliminaries

2.1 Multi-class Experiments

Definition 1 (Proper loss).

Definition 2 (Bregman divergence).

Theorem 1 ((Gneiting & Raftery, 2007); Proposition 7 in (Vernet et al., 2011)).

2.2 Multi-distribution ff-Divergence

Definition 3 (Multi-distribution ff-divergence).

2.3 Connecting Density Ratios and Class Probabilities via Link Function

3 A Unified Framework for Multi-distribution DRE

3.1 Multi-distribution Density Ratio Estimation Problem Setup

3.2 Multi-distribution DRE via Bregman Divergence Minimization

Proposition 1 (DRE via variational estimation of multi-distribution ff-divergence).

4 Connecting Losses for Multi-class Classification and DRE

Lemma 1 (Multivariate Bregman Identity).

Proposition 2.

Theorem 2.

5 Examples of Multi-distribution DRE

5.1 Methods Induced by Convex Functions

5.2 Methods Induced by Proper Scoring Rules Composite with Ψdr\Psi_{\mathrm{dr}}

6 Experiments

7 Conclusion

References

Appendix A Proofs

A.1 Proofs for Section 2

Proof.

A.2 Proofs for Section 3.2

Proof of Proposition 1.

A.3 Proofs for Section 4

Proof of Lemma 1.

Proof of Proposition 2.

Proof of Theorem 2.

A.3.1 Information Measure in Multi-class Experiments

Theorem 3 ((Duchi et al., 2018)).

A.4 Examples of Multi-distribution DRE

A.4.1 Multi-class Logistic Regression

A.4.2 Least-squares Importance Fitting

A.4.3 KL Importance Estimation Procedure

A.4.4 Basu’s Power Divergence for Robust DRE

A.4.5 More Examples

A.5 More Experimental Details

2.2 Multi-distribution $f$ -Divergence

Definition 3 (Multi-distribution $f$ -divergence).

Proposition 1 (DRE via variational estimation of multi-distribution $f$ -divergence).

5.2 Methods Induced by Proper Scoring Rules Composite with $\Psi_{\mathrm{dr}}$