This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Unified Framework for Multi-distribution Density Ratio Estimation

Lantao Yu
Department of Computer Science
Stanford University
lantaoyu@cs.stanford.edu &Yujia Jin
Department of Management Science and Engineering
Stanford University
yujiajin@stanford.edu &Stefano Ermon
Department of Computer Science
Stanford University
ermon@cs.stanford.edu
Abstract

Binary density ratio estimation (DRE), the problem of estimating the ratio p1/p2p_{1}/p_{2} given their empirical samples, provides the foundation for many state-of-the-art machine learning algorithms such as contrastive representation learning and covariate shift adaptation. In this work, we consider a generalized setting where given samples from multiple distributions p1,,pkp_{1},\ldots,p_{k} (for k>2k>2), we aim to efficiently estimate the density ratios between all pairs of distributions. Such a generalization leads to important new applications such as estimating statistical discrepancy among multiple random variables like multi-distribution ff-divergence, and bias correction via multiple importance sampling. We then develop a general framework from the perspective of Bregman divergence minimization, where each strictly convex multivariate function induces a proper loss for multi-distribution DRE. Moreover, we rederive the theoretical connection between multi-distribution density ratio estimation and class probability estimation, justifying the use of any strictly proper scoring rule composite with a link function for multi-distribution DRE. We show that our framework leads to methods that strictly generalize their counterparts in binary DRE, as well as new methods that show comparable or superior performance on various downstream tasks.

1 Introduction

Estimating the density ratio between two distributions based on their empirical samples is a central problem in machine learning, which continuously drives progress in this field and finds its applications in many machine learning tasks such as anomaly detection (Hido et al., 2008; Smola et al., 2009; Hido et al., 2011), importance weighting in covariate shift adaptation (Huang et al., 2006; Sugiyama et al., 2007), generative modeling (Uehara et al., 2016; Nowozin et al., 2016; Grover et al., 2019), two-sample test (Sugiyama et al., 2011; Gretton et al., 2012), mutual information estimation and representation learning (Oord et al., 2018; Hjelm et al., 2018). It is such a powerful paradigm because computing density ratio focuses on extracting and preserving contrastive information between two distributions, which is crucial in many tasks. Despite the tremendous success of binary DRE, many applications involve more than two probability distributions and developing density ratio estimation methods among multiple distributions has the potential of advancing various applications such as estimating multi-distribution statistical discrepancy measures (Garcia-Garcia & Williamson, 2012), multi-domain transfer learning, bias correction and variance reduction with multiple importance sampling (Elvira et al., 2019), multi-marginal generative modeling (Cao et al., 2019) and multilingual machine translation (Dong et al., 2015; Aharoni et al., 2019).

Although recent years have witnessed significant progress and a continuously increasing trend in developing more sophisticated and advanced methods for binary DRE (Sugiyama et al., 2012; Liu et al., 2017; Rhodes et al., 2020; Kato & Teshima, 2021; Choi et al., 2021), methods for estimating density ratios among multiple distributions remain largely unexplored, besides an empirical exploration of multi-class logistic regression for multi-task learning (Bickel et al., 2008), where the density ratios serve as the resampling weights between the distribution of a pool of examples of multiple tasks and the target distribution for a given task at hand and lead to significant accuracy improvement on HIV therapy screening experiments.

In this work, we propose a unified framework based on expected Bregman divergence minimization, where any strictly convex multivariate function induces a proper loss for multi-distribution DRE, thus generalizing the framework in (Sugiyama et al., 2012) to multi-distribution case. Moreover, by directly generalizing the Bregman identity in (Menon & Ong, 2016) to multi-variable case, we rederive a similar result to (Nock et al., 2016), which formally relates losses for multi-distribution density ratio estimation and class probability estimation and theoretically justifies the use of any strictly proper scoring rule (e.g., the logarithm score (Good, 1952), the Brier score (Brier et al., 1950) and the pseudo-spherical score (Good, 1971)) composite with a link function for multi-distribution DRE. By choosing a variety of specific convex functions or proper scoring rules, we show that our unified framework leads to methods that strictly generalize their counterparts for binary DRE, as well as new objectives specific to multi-distribution DRE. We demonstrate the effectiveness of our framework, and study and compare the empirical performance of its different instantiations on various downstream tasks that rely on accurate multi-distribution density ratio estimation.

2 Preliminaries

2.1 Multi-class Experiments

In multi-class experiments, we have a pair of random variables (X,Y)𝒳×𝒴(X,Y)\in{\mathcal{X}}\times{\mathcal{Y}} with joint distribution D(X,Y)D(X,Y), where 𝒳{\mathcal{X}} is the sample space and 𝒴=[k]:={1,,k}{\mathcal{Y}}=[k]\mathrel{\mathop{:}}=\{1,\ldots,k\} is the finite label space. Define the probability simplex as Δk:={𝒑0k|𝟏𝒑=1}\Delta_{k}\mathrel{\mathop{:}}=\{{\bm{p}}\in{\mathbb{R}}^{k}_{\geq 0}|\mathbf{1}^{\top}{\bm{p}}=1\}. According to chain rule of probability, any joint distribution D(X,Y)D(X,Y) can be decomposed into class priors πi:=(Y=i)\pi_{i}\mathrel{\mathop{:}}={\mathbb{P}}(Y=i) and class conditionals Pi(x):=(X=x|Y=i)P_{i}(x)\mathrel{\mathop{:}}={\mathbb{P}}(X=x|Y=i) for i[k]i\in[k], or into sample marginal M(x):=(X=x)M(x)\mathrel{\mathop{:}}={\mathbb{P}}(X=x) and class probability function 𝜼:𝒳Δk{\bm{\eta}}:{\mathcal{X}}\to\Delta_{k} (i.e., ηi(x)=(Y=i|X=x)\eta_{i}(x)={\mathbb{P}}(Y=i|X=x)). We write 𝜼(x){\bm{\eta}}(x) as a vector 𝜼{\bm{\eta}} and omit xx when it is clear from context. Thus we can also represent the joint distribution as D=(𝝅,P1,,Pk)D=(\bm{\pi},P_{1},\ldots,P_{k}) (where 𝝅Δk{\bm{\pi}}\in\Delta_{k}) or (M,𝜼)(M,{\bm{\eta}}). For any i[k]i\in[k], we assume PiP_{i} has density pip_{i} with respect to the Lebesgue measure.

Remark on notations. To avoid confusion, we would like to emphasize that the class probability is denoted as ηi(x)=(Y=i|X=x)\eta_{i}(x)={\mathbb{P}}(Y=i|X=x) and the class conditional is denoted as Pi(x)=(X=x|Y=i)P_{i}(x)={\mathbb{P}}(X=x|Y=i) with density pi(x)p_{i}(x). The former further satisfies the normalization constraint: x𝒳,i=1kηi(x)=1\forall x\in{\mathcal{X}},\sum_{i=1}^{k}\eta_{i}(x)=1, while ii in the latter one only serves as the index for kk different distributions.

In multi-class classification, given independent and identically distributed (i.i.d.) samples from the joint distribution D(X,Y)D(X,Y), we want to learn a probabilistic classifier 𝜼^:𝒳Δk\hat{{\bm{\eta}}}:{\mathcal{X}}\to\Delta_{k} to approximate the true class probability function 𝜼{\bm{\eta}} by minimizing the following \ell-risk:

CPE(𝜼^;D)=𝔼D(x,y)[(y,𝜼^(x))]=𝔼xM[𝔼y𝜼(x)[(y,𝜼^(x))]]=𝔼xM[L(𝜼(x),𝜼^(x))]{\mathcal{L}}_{\text{CPE}}(\hat{{\bm{\eta}}};D)=\mathbb{E}_{D(x,y)}[\ell(y,\hat{{\bm{\eta}}}(x))]=\mathbb{E}_{x\sim M}[\mathbb{E}_{y\sim{\bm{\eta}}(x)}[\ell(y,\hat{{\bm{\eta}}}(x))]]=\mathbb{E}_{x\sim M}[L({\bm{\eta}}(x),\hat{{\bm{\eta}}}(x))] (1)

where :[k]×Δk\ell:[k]\times\Delta_{k}\to{\mathbb{R}} is the loss function for using the class predictor 𝜼^(x)\hat{{\bm{\eta}}}(x) when the true class is yy, and L:Δk×ΔkL:\Delta_{k}\times\Delta_{k}\to{\mathbb{R}} is the expected loss of 𝜼^(x)\hat{{\bm{\eta}}}(x) under the true class probability 𝜼(x){\bm{\eta}}(x).

Definition 1 (Proper loss).

A loss function \ell is proper if the corresponding expected loss satisfies: P,QΔk,L(P,Q)L(P,P)\forall P,Q\in\Delta_{k},L(P,Q)\geq L(P,P). It is strictly proper if the equality holds only when P=QP=Q.

In statistical decision theory (Gneiting & Raftery, 2007), the negative proper loss is also called proper scoring rule (i.e., S(y,𝜼^(x))=(y,𝜼^(x))S(y,\hat{{\bm{\eta}}}(x))=-\ell(y,\hat{{\bm{\eta}}}(x))), which assesses the utility of the prediction. Properness of a loss is desirable in multi-class classification because it encourages the class probability estimator 𝜼^\hat{{\bm{\eta}}} to match the true class probability function 𝜼{\bm{\eta}}. An important property of proper loss is summarized in the following theorem:

Definition 2 (Bregman divergence).

Given a differentiable convex function ϕ:𝒮\phi:{\mathcal{S}}\to\mathbb{R} defined on a convex set 𝒮d{\mathcal{S}}\subset\mathbb{R}^{d} and two points 𝐱,𝐲𝒮{\bm{x}},{\bm{y}}\in{\mathcal{S}}, the Bregman divergence from 𝐱{\bm{x}} to 𝐲{\bm{y}} is defined as:

𝐁ϕ(𝒙,𝒚):=ϕ(𝒙)ϕ(𝒚)𝒙𝒚,ϕ(𝒚)\mathbf{B}_{\phi}({\bm{x}},{\bm{y}})\mathrel{\mathop{:}}=\phi({\bm{x}})-\phi({\bm{y}})-\langle{\bm{x}}-{\bm{y}},\nabla\phi({\bm{y}})\rangle (2)
Theorem 1 ((Gneiting & Raftery, 2007); Proposition 7 in (Vernet et al., 2011)).

Given a proper loss \ell and the corresponding expected loss LL, for any P,QΔkP,Q\in\Delta_{k}, the generalized entropy function L¯(P):=infQΔkL(P,Q)=L(P,P)\underline{L}(P)\mathrel{\mathop{:}}=\inf_{Q\in\Delta_{k}}L(P,Q)=L(P,P) is concave; when L¯\underline{L} is differentiable, the regret or excess risk of a predictor QQ over the Bayes-optimal PP is the Bregman divergence induced by the convex function f=L¯f=-\underline{L}:

reg(P,Q;):=L(P,Q)L(P,P)=𝐁f(P,Q)\mathrm{reg}(P,Q;\ell)\mathrel{\mathop{:}}=L(P,Q)-L(P,P)=\mathbf{B}_{f}(P,Q) (3)

Given the Bregman divergence representation of the point-wise regret in Theorem 1 and the \ell-risk in Equation (1), the excess risk of a class probability estimator 𝜼^\hat{{\bm{\eta}}} over the Bayes optimal 𝜼{\bm{\eta}} is:

reg(𝜼^;M,𝜼,):=\displaystyle\mathrm{reg}(\hat{{\bm{\eta}}};M,{\bm{\eta}},\ell)\mathrel{\mathop{:}}= CPE(𝜼^;D)CPE(𝜼;D)=𝔼M(x)[L(𝜼(x),𝜼^(x))L(𝜼(x),𝜼(x))]\displaystyle{\mathcal{L}}_{\text{CPE}}(\hat{{\bm{\eta}}};D)-{\mathcal{L}}_{\text{CPE}}({\bm{\eta}};D)=\mathbb{E}_{M(x)}[L({\bm{\eta}}(x),\hat{{\bm{\eta}}}(x))-L({\bm{\eta}}(x),{\bm{\eta}}(x))] (4)
=\displaystyle= 𝔼M(x)[𝐁f(𝜼(x),𝜼^(x))]\displaystyle\mathbb{E}_{M(x)}[\mathbf{B}_{f}({\bm{\eta}}(x),\hat{{\bm{\eta}}}(x))]

2.2 Multi-distribution ff-Divergence

Csiszaŕ’s ff-divergence is a popular way to measure the discrepancy between two probability distributions. Specifically, given two distributions P,QP,Q and a convex function f:+{±}f:\mathbb{R}_{+}\to\mathbb{R}\cup\{\pm\infty\} satisfying f(1)=0f(1)=0, the ff-divergence between PP and QQ is defined as 𝐃f(P||Q)=𝔼Q[f(dP/dQ)]\mathbf{D}_{f}(P||Q)=\mathbb{E}_{Q}[f(\mathrm{d}P/\mathrm{d}Q)]. In the following, we will introduce the multi-distribution extension of ff-divergence (Garcia-Garcia & Williamson, 2012).

Definition 3 (Multi-distribution ff-divergence).

For kk probability distributions P1,,PkP_{1},\ldots,P_{k} on a common probability space (𝒳,σ(𝒳))({\mathcal{X}},\sigma({\mathcal{X}})) with densities p1,,pkp_{1},\ldots,p_{k}, given multi-variate closed convex function f:+k1{±}f:\mathbb{R}_{+}^{k-1}\to\mathbb{R}\cup\{\pm\infty\} satisfying f(𝟏)=0f(\bm{1})=0, the multi-distribution ff-divergence between P1,,Pk1P_{1},\ldots,P_{k-1} and PkP_{k} is defined as:

𝐃f(P1,,Pk1||Pk)=𝔼pk(x)[f(p1(x)pk(x),,pk1(x)pk(x))]\mathbf{D}_{f}\left(P_{1},\ldots,P_{k-1}||P_{k}\right)=\mathbb{E}_{p_{k}(x)}\left[f\left(\frac{p_{1}(x)}{p_{k}(x)},\ldots,\frac{p_{k-1}(x)}{p_{k}(x)}\right)\right] (5)

2.3 Connecting Density Ratios and Class Probabilities via Link Function

Inspired by the definition in Eq. (5), we consider the following canonical density ratio vector (more discussion about this choice can be found in Section 3.2): 𝒓(x)=(r1(x),,rk(x)){\bm{r}}(x)=(r_{1}(x),\ldots,r_{k}(x)) where ri(x):=pi(x)/pk(x)r_{i}(x)\mathrel{\mathop{:}}=p_{i}(x)/p_{k}(x) and rk(x)=1r_{k}(x)=1. Then we can connect a density ratio vector 𝒓(x)+k1×{1}{\bm{r}}(x)\in\mathbb{R}_{+}^{k-1}\times\{1\} and a class probability vector 𝜼(x)Δk{\bm{\eta}}(x)\in\Delta_{k} via an invertible link function.

According to Bayes’ theorem, we have:

(X=x,Y=i)(X=x,Y=k)=πipi(x)πkpk(x)=M(x)ηi(x)M(x)ηk(x)ri(x)=pi(x)pk(x)=πkπiηi(x)ηk(x).\frac{{\mathbb{P}}(X=x,Y=i)}{{\mathbb{P}}(X=x,Y=k)}=\frac{\pi_{i}p_{i}(x)}{\pi_{k}p_{k}(x)}=\frac{M(x)\eta_{i}(x)}{M(x)\eta_{k}(x)}\Leftrightarrow r_{i}(x)=\frac{p_{i}(x)}{p_{k}(x)}=\frac{\pi_{k}}{\pi_{i}}\cdot\frac{\eta_{i}(x)}{\eta_{k}(x)}. (6)

Thus we define the following multi-distribution link function Ψdr:Δk+k1×{1}\Psi_{\mathrm{dr}}:\Delta^{k}\rightarrow\mathbb{R}_{+}^{k-1}\times\{1\} as a natural generalization of the binary DRE link function (Menon & Ong, 2016; Vernet et al., 2011):

[Ψdr(𝜼(x))]i:=πkπiηi(x)ηk(x)=ri(x),for alli[k].[\Psi_{\mathrm{dr}}({\bm{\eta}}(x))]_{i}\mathrel{\mathop{:}}=\frac{\pi_{k}}{\pi_{i}}\cdot\frac{\eta_{i}(x)}{\eta_{k}(x)}=r_{i}(x),~{}\text{for all}~{}i\in[k]. (7)

Given Eq. (7) and the normalization constraint i[k]ηi=1\sum_{i\in[k]}\eta_{i}=1, we obtain the inverse link function:

[Ψdr1(𝒓(x))]i:=πiri(x)j[k]πjrj(x)=ηi(x),for alli[k].[\Psi^{-1}_{\mathrm{dr}}({\bm{r}}(x))]_{i}\mathrel{\mathop{:}}=\frac{\pi_{i}r_{i}(x)}{\sum_{j\in[k]}\pi_{j}r_{j}(x)}=\eta_{i}(x),~{}\text{for all}~{}i\in[k]. (8)

Thus given knowledge of the prior distribution 𝝅{\bm{\pi}} (which can also be easily estimated from empirical samples), one can transform a class probability estimator into a density ratio estimator via 𝒓^(x)=Ψdr(𝜼^(x))\hat{{\bm{r}}}(x)=\Psi_{\mathrm{dr}}(\hat{{\bm{\eta}}}(x)) and vice versa via 𝜼^(x)=Ψdr1(𝒓^(x))\hat{{\bm{\eta}}}(x)=\Psi^{-1}_{\mathrm{dr}}(\hat{{\bm{r}}}(x)).

3 A Unified Framework for Multi-distribution DRE

3.1 Multi-distribution Density Ratio Estimation Problem Setup

Following the basic formulation of multi-class experiments in Section 2.1, we now introduce the problem setup of multi-distribution density ratio estimation (DRE). Recall that 𝒳{\mathcal{X}} is the common data domain and P1,,PkP_{1},\ldots,P_{k} are kk different distributions defined on 𝒳{\mathcal{X}} with densities p1,,pkp_{1},\ldots,p_{k}. Suppose we are given nin_{i} i.i.d. samples {xj(i)}j=1ni\{x_{j}^{(i)}\}_{j=1}^{n_{i}} from each distribution PiP_{i}. The goal of multi-distribution DRE is to estimate the density ratios between all pairs of distributions {rij:=pi/pj}i,j[k]\{r_{ij}\mathrel{\mathop{:}}=p_{i}/p_{j}\}_{i,j\in[k]} from the i.i.d. datasets {{xj(i)}j=1ni}i=1k\{\{x_{j}^{(i)}\}_{j=1}^{n_{i}}\}_{i=1}^{k}. In this paper, we assume that the density ratios are always well-defined on domain 𝒳{\mathcal{X}} (e.g., when the distributions have strictly positive densities), which is also a common assumption in binary DRE problem (Kanamori et al., 2009; Kato & Teshima, 2021).

A naive approach towards this problem is to separately estimate each density pip_{i} from {xj(i)}j=1ni\{x_{j}^{(i)}\}_{j=1}^{n_{i}} and then plug in pip_{i} and pjp_{j} to get rijr_{ij}. However, as previous theoretical works (Kpotufe, 2017; Nguyen et al., 2007; Kanamori et al., 2012; Que & Belkin, 2013) suggest, directly estimating density ratios has many advantages in practical settings. Specifically, we know that (1) optimal convergence rates depend only on the smoothness of the density ratio and not on the densities; (2) optimal rates depend only on the intrinsic dimension of data, thus escaping the curse of dimension in density estimation. Inspired by these observations in binary DRE, this paper aims to develop a general framework for directly estimating multi-distribution density ratios. Moreover, we also theoretically prove that various interesting facts (Menon & Ong, 2016; Sugiyama et al., 2012), which hold in the binary case, extend to our multi-distribution case in Section 4.

While most previous works focus on DRE in binary cases, multi-distribution DRE has many important downstream applications. For example, given any integrable function ϕ:𝒳\phi:{\mathcal{X}}\to{\mathbb{R}}, suppose we want to use importance sampling to estimate the expectation of ϕ\phi with respect to a target distribution QQ with density qq w.r.t. the base measure:

𝔼q(x)[ϕ(x)]=𝒳q(x)ϕ(x)dx=𝒳p(x)q(x)p(x)ϕ(x)dx=𝔼p(x)[r(x)ϕ(x)]\mathbb{E}_{q(x)}[\phi(x)]=\int_{\mathcal{X}}q(x)\phi(x)\mathrm{d}x=\int_{\mathcal{X}}p(x)\frac{q(x)}{p(x)}\phi(x)\mathrm{d}x=\mathbb{E}_{p(x)}\left[r(x)\cdot\phi(x)\right] (9)

where we use the density ratio r=p/qr=p/q to correct the bias caused by using samples from the proposal distribution pp rather than the target distribution qq. However, in practice, finding a good proposal is critical yet challenging (Owen & Zhou, 2000). An alternative and more robust strategy is to use a population of different proposals (sampling schemes) and use a set of density ratios to correct the bias, which is also known as multiple importance sampling (MIS) (Cappé et al., 2004; Elvira et al., 2015). Given kk different proposals p1,,pkp_{1},\ldots,p_{k}, the MIS estimation of the expectation is given by:

𝔼q(x)[ϕ(x)]=i=1kωi𝔼pi(x)[q(x)pi(x)ϕ(x)]\mathbb{E}_{q(x)}[\phi(x)]=\sum_{i=1}^{k}\omega_{i}\mathbb{E}_{p_{i}(x)}\left[\frac{q(x)}{p_{i}(x)}\phi(x)\right] (10)

where ωi\omega_{i} is the weight for each proposal pip_{i} and satisfies iωi=1\sum_{i}\omega_{i}=1. Thus a more efficient and accurate multi-distribution DRE method will lead to better MIS. In the context of multi-source off-policy policy evaluation (Kallus et al., 2021), the proposals correspond to a set of demonstration policies and the target distribution is the query policy whose performance we want to evaluate from the offline multi-souce demonstrations; in the context of multi-domain transfer learning setting (covariate shift adaptation) (Bickel et al., 2008; Dinh et al., 2013), the proposals correspond to a set of data generating distributions (e.g. multiple source domains or various data augmentation strategies) and the target is the test distribution we care about. Estimating multi-distribution density ratios also allows us to compute important information quantities among multiple random variables such as the multi-distribution ff-divergence in Equation (5), which can be used to analyze various kinds of discrepancy and correlations between multiple random variables and further has the potential of inspiring new generative models for multiple marginal matching problem (Cao et al., 2019).

3.2 Multi-distribution DRE via Bregman Divergence Minimization

Inspired by the success of Bregman divergence minimization for unifying various DRE methods in the binary case (Sugiyama et al., 2012), in this section, we propose a general framework for solving the multi-distribution density ratio estimation problem. First, we discuss our modeling choices. Although our goal is to estimate (k2){k\choose 2} density ratios (between all possible pairs), the solution set {rij:=pi/pj}i,j[k]\{r_{ij}\mathrel{\mathop{:}}=p_{i}/p_{j}\}_{i,j\in[k]} actually has k1k-1 degrees of freedom (e.g., rik=rijrjkr_{ik}=r_{ij}\cdot r_{jk}). Thus without loss of generality, we parametrize the following k1k-1 density ratio models 𝒓^𝜽=(r^θ1,,r^θk1)\hat{{\bm{r}}}_{\bm{\theta}}=(\hat{r}_{\theta_{1}},\ldots,\hat{r}_{\theta_{k-1}}) to approximate the true canonical density ratios 𝒓=(r1,,rk1){\bm{r}}=(r_{1},\ldots,r_{k-1}), where ri:=pi/pkr_{i}\mathrel{\mathop{:}}=p_{i}/p_{k} for i[k1]i\in[k-1]. For the simplicity of notation, we will omit the dependence on the parameters 𝜽{\bm{\theta}} and write our density ratio models as 𝒓^=(r^1,,r^k1)\hat{{\bm{r}}}=(\hat{r}_{1},\ldots,\hat{r}_{k-1}). An advantage of such modeling choice is that any density ratio can be recovered within one step of computation pipj=pi/pkpj/pk=rirj\frac{p_{i}}{p_{j}}=\frac{p_{i}/p_{k}}{p_{j}/p_{k}}=\frac{r_{i}}{r_{j}}, thus avoiding large compounding error while naturally ensuring consistency within the solution set (i.e., if we parametrize r^ij\hat{r}_{ij}, r^jk\hat{r}_{jk} and r^ik\hat{r}_{ik} respectively, we have to make sure they satisfy r^ik=r^ijr^jk\hat{r}_{ik}=\hat{r}_{ij}\cdot\hat{r}_{jk}).

Since our goal is to optimize 𝒓^\hat{{\bm{r}}} to approximate the true density ratios 𝒓{\bm{r}}, we consider to use Bregman divergence (Def. 2) to measure the discrepancy between 𝒓{\bm{r}} and 𝒓^\hat{{\bm{r}}}. Specifically, for any strictly convex function f:+k1f:\mathbb{R}^{k-1}_{+}\to\mathbb{R} and x𝒳\forall x\in{\mathcal{X}}, we have the following point-wise optimization problem:

min𝒓^(x)+k1𝐁f(𝒓(x),𝒓^(x))=f(𝒓(x))f(𝒓^(x))f(𝒓^(x)),𝒓(x)𝒓^(x)\min_{\hat{{\bm{r}}}(x)\in{\mathbb{R}}^{k-1}_{+}}\mathbf{B}_{f}({\bm{r}}(x),\hat{{\bm{r}}}(x))=f({\bm{r}}(x))-f(\hat{{\bm{r}}}(x))-\langle\nabla f(\hat{{\bm{r}}}(x)),{\bm{r}}(x)-\hat{{\bm{r}}}(x)\rangle (11)

which corresponds to the difference between the value of ff at 𝒓{\bm{r}}, and the value of the first-order Taylor expansion of ff around point 𝒓^\hat{{\bm{r}}} evaluated at point 𝒓{\bm{r}}. Although the current formulation can be understood as a regression problem from 𝒓^(x)\hat{{\bm{r}}}(x) to the true density ratios 𝒓(x){\bm{r}}(x), we actually only have i.i.d. samples xp1,,pkx\sim p_{1},\ldots,p_{k} instead of the true targets 𝒓(x){\bm{r}}(x). In this case, we consider to use the following expected Bregman divergence to measure the overall discrepancy from the true density ratios 𝒓{\bm{r}} to the density ratio models 𝒓^\hat{{\bm{r}}}:

DRE(𝒓^;D)=\displaystyle{\mathcal{L}}_{\text{DRE}}(\hat{{\bm{r}}};D)= 𝒳pk(x)(f(𝒓(x))f(𝒓^(x))f(𝒓^(x)),𝒓(x)𝒓^(x))dx\displaystyle\int_{\mathcal{X}}p_{k}(x)\Big{(}f({\bm{r}}(x))-f(\hat{{\bm{r}}}(x))-\langle\nabla f(\hat{{\bm{r}}}(x)),{\bm{r}}(x)-\hat{{\bm{r}}}(x)\rangle\Big{)}\mathrm{d}x (12)
=\displaystyle= 𝔼pk(x)[f(𝒓^(x)),𝒓^(x)f(𝒓^(x))]i[k1]𝔼pi(x)[if(𝒓^(x))]+C\displaystyle\mathbb{E}_{p_{k}(x)}\left[\langle\nabla f(\hat{{\bm{r}}}(x)),\hat{{\bm{r}}}(x)\rangle-f(\hat{{\bm{r}}}(x))\right]-\sum_{i\in[k-1]}\mathbb{E}_{p_{i}(x)}[\partial_{i}f(\hat{{\bm{r}}}(x))]+C (13)

where C:=𝒳pk(x)f(𝒓(x))dx=𝐃f(P1,,Pk1Pk)C\mathrel{\mathop{:}}=\int_{\mathcal{X}}p_{k}(x)f({\bm{r}}(x))\mathrm{d}x=\mathbf{D}_{f}(P_{1},\ldots,P_{k-1}\|P_{k}) is a constant with respect to 𝒓^\hat{{\bm{r}}} and the equality comes from the fact that pk(r1,,rk1)=(p1,,pk1)p_{k}\cdot(r_{1},\ldots,r_{k-1})=(p_{1},\ldots,p_{k-1}) according to the definition of 𝒓{\bm{r}}. The rationale behind the above choice is that it allows us to get an unbiased estimation of the discrepancy between 𝒓{\bm{r}} and 𝒓^\hat{{\bm{r}}} only using i.i.d. samples from p1,,pkp_{1},\ldots,p_{k}. Specifically, since CC is a constant, we have the following optimization problem over 𝒓^\hat{{\bm{r}}} to approximate the true density ratios (where each expectation 𝔼pi\mathbb{E}_{p_{i}} can be empirically estimated using samples from pip_{i}):

min𝒓^:𝒳+k1𝔼pk(x)[f(𝒓^(x)),𝒓^(x)f(𝒓^(x))]i[k1]𝔼pi(x)[if(𝒓^(x))]\min_{\hat{{\bm{r}}}:{\mathcal{X}}\to{\mathbb{R}}^{k-1}_{+}}\mathbb{E}_{p_{k}(x)}\left[\left\langle\nabla f(\hat{{\bm{r}}}(x)),\hat{{\bm{r}}}(x)\right\rangle-f(\hat{{\bm{r}}}(x))\right]-\sum_{i\in[k-1]}\mathbb{E}_{p_{i}(x)}\left[\partial_{i}f(\hat{{\bm{r}}}(x))\right] (14)

Interestingly, the above multi-distribution DRE formulation, which is based on Bregman divergence minimization, can be alternatively derived from the perspective of variational estimation of multi-distribution ff-divergence. In the following, We briefly discuss such an interpretation of Eq. (14).

Based on Fenchel duality, we can represent any strictly convex function f:+k1{+}f:\mathbb{R}_{+}^{k-1}\to\mathbb{R}\cup\{+\infty\} through its conjugate function f(𝒔):=max𝒓+k1𝒔,𝒓f(𝒓)f^{*}({\bm{s}})\mathrel{\mathop{:}}=\max_{{\bm{r}}\in{\mathbb{R}}^{k-1}_{+}}\langle{\bm{s}},{\bm{r}}\rangle-f({\bm{r}}) as:

f(𝒓(x))=max𝒔:𝒳k1𝒓(x),𝒔(x)f(𝒔(x)),for anyx𝒳.f({\bm{r}}(x))=\max_{{\bm{s}}:{\mathcal{X}}\to{\mathbb{R}}^{k-1}}\langle{\bm{r}}(x),{\bm{s}}(x)\rangle-f^{*}({\bm{s}}(x)),~{}~{}\text{for any}~{}x\in{\mathcal{X}}. (15)

In order to estimate the multi-distribution ff-divergence defined in Eq. (5) only using samples from P1,,PkP_{1},\ldots,P_{k} (instead of their density information), we consider the following variational representation of multi-distribution ff-divergence by substituting Eq. (15) into Eq. (5):

𝐃f(P1,,Pk1||Pk)=min𝒔:𝒳k1[i[k1]𝔼pi(x)[𝒔(x)]i+𝔼pk(x)f(𝒔(x))]\displaystyle\mathbf{D}_{f}(P_{1},\ldots,P_{k-1}||P_{k})=-\min_{{\bm{s}}:{\mathcal{X}}\rightarrow\mathbb{R}^{k-1}}\left[-\sum_{i\in[k-1]}\mathbb{E}_{p_{i}(x)}[{\bm{s}}(x)]_{i}+\mathbb{E}_{p_{k}(x)}f^{*}({\bm{s}}(x))\right] (16)

We then have the following lemma revealing the equivalence between the optimization problem in Eq. (14) and Eq. (16).

Proposition 1 (DRE via variational estimation of multi-distribution ff-divergence).

Given a strictly convex function f:+k1{+}f:{\mathbb{R}}^{k-1}_{+}\rightarrow\mathbb{R}\cup\{+\infty\}, the optimization problem in Eq. (14) (induced by minimizing expected Bregman divergence 𝐁f(𝐫,𝐫^)\mathbf{B}_{f}({\bm{r}},\hat{{\bm{r}}})) is equivalent to the one in Eq. (16) (for variational estimation of multi-distribution ff-divergence) under change of variables satisfying: f(𝐫^(x))=𝐬(x),x𝒳\nabla f(\hat{{\bm{r}}}(x))={\bm{s}}(x),~{}\forall x\in{\mathcal{X}}.

4 Connecting Losses for Multi-class Classification and DRE

In this section, we rederive a similar result to (Nock et al., 2016) by directly generalizing the Bregman identity in (Menon & Ong, 2016) to multi-variable case, which established the theoretical connection between multi-distribution DRE and multi-class classification.

In Section 2.1, we have shown that the exact minimization of the excess risk for any strictly proper loss \ell results in the true class probability function 𝜼{\bm{\eta}}, and consequently gives us the true density ratio 𝒓{\bm{r}} through the link function Ψdr(𝜼)\Psi_{\mathrm{dr}}({\bm{\eta}}). In the following, we take a further step to show that essentially the procedure of minimizing any strictly proper loss is equivalent to minimizing an expected Bregman divergence between the true density ratios 𝒓{\bm{r}} and the approximate density ratios 𝒓^\hat{{\bm{r}}}, thus generalizing the theoretical results in binary case (Menon & Ong, 2016) to the multi-distribution case and justifying the validity of using any strictly proper scoring rule (e.g. Brier score (Brier et al., 1950) and pseudo-spherical score (Good, 1971)) for multi-distribution DRE. All proofs for this section can be found in Appendix A.3.

We start by introducing the following multivariate Bregman identity.

Lemma 1 (Multivariate Bregman Identity).

Given a convex function f:k1f:\mathbb{R}^{k-1}\to\mathbb{R}, we can define an associated function f(u1,,uk1)=(1+i[k1]ui)f(11+i[k1]ui𝐮)f^{\circledast}(u_{1},\ldots,u_{k-1})=(1+\sum_{i\in[k-1]}u_{i})f\left(\frac{1}{1+\sum_{i\in[k-1]}u_{i}}\cdot{\bm{u}}\right). We can show that (i) ff^{\circledast} is convex and (ii) for any 𝐮,𝐯k1{\bm{u}},{\bm{v}}\in\mathbb{R}^{k-1}, their associated Bregman divergences satisfy:

𝐁f(11+i[k1]ui𝒖,11+i[k1]vi𝒗)=11+i[k1]ui𝐁f(𝒖,𝒗).\mathbf{B}_{f}\left(\frac{1}{1+\sum_{i\in[k-1]}u_{i}}\cdot{\bm{u}},\frac{1}{1+\sum_{i\in[k-1]}v_{i}}\cdot{\bm{v}}\right)=\frac{1}{1+\sum_{i\in[k-1]}u_{i}}\mathbf{B}_{f^{\circledast}}({\bm{u}},{\bm{v}}). (17)

One can then apply Lemma 1 with ui=πiπkriu_{i}=\frac{\pi_{i}}{\pi_{k}}r_{i} and vi=πiπkr^iv_{i}=\frac{\pi_{i}}{\pi_{k}}\hat{r}_{i} for each i[k1]i\in[k-1] and use the fact that 𝐁fπ(𝒓,𝒓^)=𝐁f(𝒖,𝒗)\mathbf{B}_{f^{\circledast}_{\pi}}\left({\bm{r}},\hat{{\bm{r}}}\right)=\mathbf{B}_{f^{\circledast}}({\bm{u}},{\bm{v}}) for fπ(𝒓)=f(1πk𝝅𝒓)f^{\circledast}_{\pi}({\bm{r}})=f^{\circledast}(\frac{1}{\pi_{k}}{\bm{\pi}}\circ{\bm{r}}) to establish the following connection between the optimality gap of density ratio estimators and class probability estimators, where we use 𝒂𝒃{\bm{a}}\circ{\bm{b}} to denote the element-wise product between vectors 𝒂{\bm{a}} and 𝒃{\bm{b}}, and 𝝅[1:k1]k1{\bm{\pi}}_{[1:k-1]}\in\mathbb{R}^{k-1} as the vector when restricting 𝝅{\bm{\pi}} onto its first k1k-1 coordinates.

Proposition 2.

For any convex function f:+k1f:\mathbb{R}^{k-1}_{+}\rightarrow\mathbb{R}, and two density ratio vectors 𝐫(x){\bm{r}}(x) and 𝐫^(x)\hat{{\bm{r}}}(x), one can construct corresponding class probability vectors 𝛈(x)=Ψdr1(𝐫(x)){\bm{\eta}}(x)=\Psi_{\mathrm{dr}}^{-1}({\bm{r}}(x)) and 𝛈^(x)=Ψdr1(𝐫^(x))\hat{{\bm{\eta}}}(x)=\Psi_{\mathrm{dr}}^{-1}(\hat{{\bm{r}}}(x)) through the inverse link function in Eq. (8), and obtain:

𝐁f(𝜼(x),𝜼^(x))=πkπk+i[k1]πiri(x)𝐁fπ(𝒓(x),𝒓^(x))for allx𝒳,\mathbf{B}_{f}\left({\bm{\eta}}(x),\hat{{\bm{\eta}}}(x)\right)=\frac{\pi_{k}}{\pi_{k}+\sum_{i\in[k-1]}\pi_{i}r_{i}(x)}\mathbf{B}_{f^{\circledast}_{\pi}}\left({\bm{r}}(x),\hat{{\bm{r}}}(x)\right)~{}\text{for all}~{}x\in{\mathcal{X}}, (18)

where we define the convex function fπf^{\circledast}_{\pi} induced by some prior distribution πΔk\pi\in\Delta_{k} as

fπ(r1,,rk1):=(1+i[k1]πiri/πk)f(𝝅[1:k1]𝒓πk+i[k1]πiri).f^{\circledast}_{\pi}(r_{1},\ldots,r_{k-1})\mathrel{\mathop{:}}=\left(1+\sum_{i\in[k-1]}\pi_{i}r_{i}/\pi_{k}\right)\cdot f\left(\frac{{\bm{\pi}}_{[1:k-1]}\circ{\bm{r}}}{\pi_{k}+\sum_{i\in[k-1]}\pi_{i}r_{i}}\right). (19)

Combining Proposition 2 with the Bregman divergence representation of the point-wise regret for a proper risk \ell for multi-class classification in Eq. (4), we provide the following main theorem that interprets the minimization of multi-class classification regret as multi-distribution DRE under expected Bregman divergence minimization.

Theorem 2.

Given any strictly proper loss \ell, for any joint data distribution D(X,Y)D(X,Y) with class prior πΔk\pi\in\Delta_{k}, the multi-class classification regret defined in Eq. (4) satisfies that:

reg(𝜼^;M,𝜼,)=πk𝔼pk(x)𝐁fπ(𝒓(x),𝒓^(x)),\mathrm{reg}(\hat{{\bm{\eta}}};M,{\bm{\eta}},\ell)=\pi_{k}\mathbb{E}_{p_{k}(x)}\mathbf{B}_{f^{\circledast}_{\pi}}({\bm{r}}(x),\hat{{\bm{r}}}(x)), (20)

for fπf^{\circledast}_{\pi} as defined in Eq. (19), and 𝐫=Ψdr(𝛈){\bm{r}}=\Psi_{\mathrm{dr}}({\bm{\eta}}) and 𝐫^=Ψdr(𝛈^)\hat{{\bm{r}}}=\Psi_{\mathrm{dr}}(\hat{{\bm{\eta}}}) as defined in Eq. (7).

Theorem 2 generalizes a known equivalence between density ratio estimation and class probability estimation in the binary case (see Section 5 in (Menon & Ong, 2016)), and provides a similar equivalence in the more complicated multi-class experiments. Besides, in comparison to the binary case result, we also provide a simpler proof, loosen the assumptions on the twice-differentiability of convex function ff induced by the proper loss \ell (i.e., f=L¯f=-\underline{L}, see Theorem 1 for more details), and generalize the argument to an arbitrary prior distribution πΔk\pi\in\Delta^{k} instead of the uniform prior case π1=π2=1/2\pi_{1}=\pi_{2}=1/2 considered in (Menon & Ong, 2016).

Moreover, we notice that multi-distribution ff-divergence among class conditionals P1,,PkP_{1},\ldots,P_{k} also corresponds to the statistical information measure in multi-class experiments (DeGroot, 1962) (defined as the gap between the prior and posterior generalized entropy). Since we have established the equivalence between multi-class DRE (Eq. (14)) and variational estimation of multi-distribution ff-divergence (Eq. (16)), we can show by choosing particular convex functions (associated with the loss \ell for multi-class classification), multi-distribution DRE can be viewed as estimating the statistical information measure in multi-class experiments. See detailed discussions in Appendix A.3.1.

5 Examples of Multi-distribution DRE

In the binary density ratio matching under Bregman divergence framework (Sugiyama et al., 2012), we can choose various convex functions to recover popular binary DRE methods such as KLIEP (Sugiyama et al., 2008), LSIF (Kanamori et al., 2009) and Logistic Regression (Franklin, 2005). In this section, we provide some instantiations of our multi-distribution DRE framework. Specifically, Section 3.2 suggests that any strictly convex multivariate function f:+k1f:{\mathbb{R}}^{k-1}_{+}\to{\mathbb{R}} induces a proper loss for multi-distribution DRE, and Section 4 justifies that any strictly proper scoring rule composite with Ψdr\Psi_{\mathrm{dr}} can also be used for multi-distribution DRE. We briefly discuss some choices of the convex function or proper scoring rule, and we provide detailed derivations in Appendix A.4.

5.1 Methods Induced by Convex Functions

Multi-class Logistic Regression.  From Section 2.3, we know that there is a one-to-one correspondence between a class probability estimator and a density ratio estimator: 𝒓^=Ψdr𝜼^\hat{{\bm{r}}}=\Psi_{\mathrm{dr}}\circ\hat{{\bm{\eta}}} and 𝜼^=Ψdr1𝒓^\hat{{\bm{\eta}}}=\Psi_{\mathrm{dr}}^{-1}\circ\hat{{\bm{r}}}. For the clarity of presentation, here we assume the class prior distribution 𝝅{\bm{\pi}} is uniform such that r^i(x)=η^i(x)/η^k(x)\hat{r}_{i}(x)=\hat{\eta}_{i}(x)/\hat{\eta}_{k}(x) and η^i(x)=r^i(x)/j=1kr^j(x)\hat{\eta}_{i}(x)=\hat{r}_{i}(x)/\sum_{j=1}^{k}\hat{r}_{j}(x). To recover the loss of multi-class logistic regression, we choose the following convex function to be f(r^1,,r^k1)=1ki=1kr^ilog(r^i/j=1kr^j)f(\hat{r}_{1},\ldots,\hat{r}_{k-1})=\frac{1}{k}\sum_{i=1}^{k}\hat{r}_{i}\log\left(\hat{r}_{i}/\sum_{j=1}^{k}\hat{r}_{j}\right). In this case, the loss in Eq. (14) reduces to:

1k𝔼pk(x)[log(j=1kr^j(x))]1ki=1k1𝔼pi(x)[log(r^i(x)j=1kr^j(x))]=(1ki=1k𝔼pi(x)[logη^i(x)])\frac{1}{k}\mathbb{E}_{p_{k}(x)}\left[\log\left(\sum_{j=1}^{k}\hat{r}_{j}(x)\right)\right]-\frac{1}{k}\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\log\left(\frac{\hat{r}_{i}(x)}{\sum_{j=1}^{k}\hat{r}_{j}(x)}\right)\right]=-\left(\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}_{p_{i}(x)}[\log\hat{\eta}_{i}(x)]\right) (21)

We provide discussions for the general case (non-uniform prior 𝝅{\bm{\pi}}) in Appendix A.4.1. Interestingly, we noticed that the above convex function also gives rise to the multi-distribution Jensen-Shannon divergence (Garcia-Garcia & Williamson, 2012) (also known as the information radius (Sibson, 1969), 𝐃f(P1,,Pk)=1ki=1kDKL(Pi1kj=1kPj)\mathbf{D}_{f}(P_{1},\ldots,P_{k})=\frac{1}{k}\sum_{i=1}^{k}D_{\mathrm{KL}}(P_{i}\|\frac{1}{k}\sum_{j=1}^{k}P_{j})) up to a constant of logk\log k.

Multi-LSIF.  When the convex function associated with the Bregman divergence is chosen to be f(r^1,,r^k1)=12i=1k1(r^i1)2=12𝒓^𝟏2f(\hat{r}_{1},\ldots,\hat{r}_{k-1})=\frac{1}{2}\sum_{i=1}^{k-1}(\hat{r}_{i}-1)^{2}=\frac{1}{2}\|\hat{{\bm{r}}}-\bm{1}\|^{2}, the loss in Eq. (14) reduces to:

12i=1k1𝔼pk(x)[r^i2(x)1]i=1k1𝔼pi(x)[r^i(x)1]=12i=1k1𝔼pk(x)[(r^i(x)ri(x))2]C\frac{1}{2}\sum_{i=1}^{k-1}\mathbb{E}_{p_{k}(x)}\left[\hat{r}_{i}^{2}(x)-1\right]-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\hat{r}_{i}(x)-1\right]=\frac{1}{2}\sum_{i=1}^{k-1}\mathbb{E}_{p_{k}(x)}\left[(\hat{r}_{i}(x)-r_{i}(x))^{2}\right]-C (22)

where C=𝔼pk(x)[𝒓(x)12]C=\mathbb{E}_{p_{k}(x)}\left[\|{\bm{r}}(x)-1\|^{2}\right] is a constant w.r.t. 𝒓^\hat{{\bm{r}}} and the minimizer to the above loss function matches the true density ratios, which strictly generalizes the Least-Squares Importance Fitting (LSIF) (Kanamori et al., 2009) method to the multi-distribution case.

Besides, we also consider the following simple convex functions that either strictly generalize their binary DRE counterparts as above, or lead to completely new methods for multi-distribution DRE:

  • Multi-KLIEP.  f(r^1,,r^k1)=i=1k1(r^ilogr^ir^i)=𝒓^,log(𝒓^)𝒓^1f(\hat{r}_{1},\ldots,\hat{r}_{k-1})=\sum_{i=1}^{k-1}(\hat{r}_{i}\log\hat{r}_{i}-\hat{r}_{i})=\langle\hat{{\bm{r}}},\log(\hat{{\bm{r}}})\rangle-\|\hat{{\bm{r}}}\|_{1}. This strictly generalizes the Kullback–Leibler Importance Estimation Procedure (KLIEP) (Sugiyama et al., 2008) to the multi-distribution case. See Appendix A.4.3 for more details.

  • Power.  f(r^1,,r^k1)=i=1k1r^iα=𝒓^ααf(\hat{r}_{1},\ldots,\hat{r}_{k-1})=\sum_{i=1}^{k-1}\hat{r}_{i}^{\alpha}=\|\hat{{\bm{r}}}\|_{\alpha}^{\alpha},   for α>1\alpha>1. This strictly generalizes the robust DRE method in (Sugiyama et al., 2012), which recovers Multi-KLIEP when α1\alpha\to 1 and Multi-LSIF when α=2\alpha=2. See Appendix A.4.4 for more details.

  • Quadratic.  f(r^1,,r^k1)=𝒓^𝑯𝒓^+𝒒𝒓^f(\hat{r}_{1},\ldots,\hat{r}_{k-1})=\hat{{\bm{r}}}^{\top}{\bm{H}}\hat{{\bm{r}}}+{\bm{q}}^{\top}\hat{{\bm{r}}}, for any positive definite matrix 𝑯0{\bm{H}}\succ 0. When 𝑯{\bm{H}} is the identity matrix and 𝒒=(2,,2){\bm{q}}=(-2,\ldots,-2), this is equivalent to Multi-LSIF.

  • LogSumExp.  f(r^1,,r^k1)=αlog(i=1k1exp(r^i/α))f(\hat{r}_{1},\ldots,\hat{r}_{k-1})=\alpha\log\left(\sum_{i=1}^{k-1}\exp(\hat{r}_{i}/\alpha)\right) for α>0\alpha>0.

In principle, we can use any desired strictly convex function f:+k1f:{\mathbb{R}}^{k-1}_{+}\to{\mathbb{R}} within the optimization problem in Eq. (14), implying the great potential of our unified framework for discovering novel objectives for multi-distribution DRE. In terms of modeling flexibility, the curvature of different convex functions encode different inductive biases that may favor various downstream applications and we leave the design of more suitable convex functions for DRE as exciting future avenues.

5.2 Methods Induced by Proper Scoring Rules Composite with Ψdr\Psi_{\mathrm{dr}}

From Section 4, we know that any strictly proper loss :[k]×Δk\ell:[k]\times\Delta_{k}\to{\mathbb{R}} (or strictly proper scoring rule S(i,𝜼^)=(i,𝜼^)S(i,\hat{{\bm{\eta}}})=-\ell(i,\hat{{\bm{\eta}}})) in conjunction with the link function Ψdr\Psi_{\mathrm{dr}} also induces valid losses for multi-distribution DRE:

min𝒓^:𝒳+k1𝔼D(x,y)[(y,𝜼^(x))]=𝔼xM,y𝜼(x)[(y,Ψdr1(𝒓^(x)))]\min_{\hat{{\bm{r}}}:{\mathcal{X}}\to{\mathbb{R}}^{k-1}_{+}}\mathbb{E}_{D(x,y)}[\ell(y,\hat{{\bm{\eta}}}(x))]=\mathbb{E}_{x\sim M,y\sim{\bm{\eta}}(x)}[\ell(y,\Psi_{\mathrm{dr}}^{-1}(\hat{{\bm{r}}}(x)))] (23)

In this work, we consider using the following classic proper scoring rules (Gneiting & Raftery, 2007), where 𝜼^\hat{{\bm{\eta}}} is parametrized as Ψdr1(𝒓^)\Psi_{\mathrm{dr}}^{-1}(\hat{{\bm{r}}}) (i.e. η^i=πir^i/j=1kπjr^j\hat{\eta}_{i}=\pi_{i}\hat{r}_{i}/\sum_{j=1}^{k}\pi_{j}\hat{r}_{j}):

  • Logarithm score. (Good, 1952) The loss is specified as (i,𝜼^)=log(η^i)\ell(i,\hat{{\bm{\eta}}})=-\log(\hat{\eta}_{i}), which also recovers the loss of multi-class logistic regression in Section 5.1.

  • Brier score. (Brier et al., 1950) The loss is specified as (i,𝜼^)=2η^i+j=1kη^j2+1\ell(i,\hat{{\bm{\eta}}})=-2\hat{\eta}_{i}+\sum_{j=1}^{k}\hat{\eta}_{j}^{2}+1.

  • Logarithm pseudo-spherical score. (Good, 1971; Fujisawa & Eguchi, 2008) The loss is specified as (i,𝜼^)=log(η^iα1(j=1kη^jα)(α1)/α)\ell(i,\hat{{\bm{\eta}}})=-\log\left(\frac{\hat{\eta}_{i}^{\alpha-1}}{(\sum_{j=1}^{k}\hat{\eta}_{j}^{\alpha})^{(\alpha-1)/\alpha}}\right) for α>1\alpha>1.

6 Experiments

In this section, we verify the validity of our framework, as well as study and compare the various instantiations introduced in Section 5, on a variety of tasks that all rely on accurate multi-distribution density ratio estimation. In particular, the tasks we consider include density ratio estimation among multiple multivariate Gaussian distributions, anomaly detection on CIFAR-10 (Krizhevsky et al., 2009), multi-target MNIST Generation (LeCun et al., 1998) and multi-distribution off-policy policy evaluation. We discuss the basic problem setups, evaluation metrics and experimental results in the following and we provide more experimental details for each task in Appendix A.5.

Synthetic Data Experiments.  We first apply our methods to estimate density ratios among k=5k=5 multivariate Gaussian distributions with different mean vectors and identity covariance matrix. We conducted experiments for various data dimensions ranging from 2 to 50. Since Gaussian distributions have tractable densities, we know the ground-truth density ratio functions and we calculate the mean absolute error (MAE) between all (k2)k\choose 2 true density ratios and the learned ones:

MAE(𝒓,𝒓^;M(x))=2k(k1)𝔼M(x)[1i<jk|rij(x)r^ij(x)|]\displaystyle\mathrm{MAE}({\bm{r}},\hat{{\bm{r}}};M(x))=\frac{2}{k(k-1)}\mathbb{E}_{M(x)}\left[\sum_{1\leq i<j\leq k}\left|r_{ij}(x)-\hat{r}_{ij}(x)\right|\right]

where density ratio between pip_{i} and pjp_{j} is recovered by r^i/r^j\hat{r}_{i}/\hat{r}_{j} as discussed in Section 3.2. We summarize the results in Table 1, from which we can see that multi-class logistic regression and Brier score composite with Ψdr\Psi_{\mathrm{dr}} show superior performance in this task.

OOD Detection on CIFAR-10.  Suppose we have kk different distributions p1,,pkp_{1},\ldots,p_{k}, where pk=i[k1]αipip_{k}=\sum_{i\in[k-1]}\alpha_{i}p_{i}, (i[k1]αi=1\sum_{i\in[k-1]}\alpha_{i}=1 and i,αi>0\forall i,\alpha_{i}>0). For each distribution pip_{i} (ik1i\leq k-1), samples from the mixture distribution pkp_{k} contain both inlier samples and outlier samples. The goal of this task is to identify the inlier samples from the pool of mixture samples. In particular, we use the learned density ratio r^i\hat{r}_{i} as the score function to retrieve the inlier samples of pip_{i}, since the true density ratio function ri=pi/j[k1]αjpjr_{i}=p_{i}/\sum_{j\in[k-1]}\alpha_{j}p_{j} tend to be larger for samples from pip_{i} and smaller for samples from other distributions. In this case, we calculate the average AUROC for each scoring function.

Table 1: Mean absolute error for multi-distribution density ratio estimation among five multivariate Gaussian distributions. Results are averaged across three random seeds.
Method Dim=2\mathrm{Dim}=2 Dim=5\mathrm{Dim}=5 Dim=10\mathrm{Dim}=10 Dim=20\mathrm{Dim}=20 Dim=30\mathrm{Dim}=30 Dim=40\mathrm{Dim}=40 Dim=50\mathrm{Dim}=50
Random Init 1.724±0.031.724\pm 0.03 1.723±0.0081.723\pm 0.008 1.728±0.021.728\pm 0.02 1.765±0.0171.765\pm 0.017 1.749±0.0091.749\pm 0.009 1.753±0.0021.753\pm 0.002 1.768±0.0081.768\pm 0.008
Multi-LR 0.044±0.003\bm{0.044}\pm 0.003 0.048±0.005\bm{0.048}\pm 0.005 0.061±0.002\bm{0.061}\pm 0.002 0.07±0.001\bm{0.07}\pm 0.001 0.081±0.002\bm{0.081}\pm 0.002 0.089±0.001\bm{0.089}\pm 0.001 0.098±0.002\bm{0.098}\pm 0.002
Multi-KLIEP 0.051±0.0020.051\pm 0.002 0.066±0.0020.066\pm 0.002 0.074±0.00.074\pm 0.0 0.089±0.0020.089\pm 0.002 0.105±0.0050.105\pm 0.005 0.112±0.0040.112\pm 0.004 0.123±0.0030.123\pm 0.003
Multi-LSIF 0.073±0.0060.073\pm 0.006 0.097±0.0010.097\pm 0.001 0.109±0.0050.109\pm 0.005 0.124±0.0030.124\pm 0.003 0.144±0.0040.144\pm 0.004 0.141±0.0050.141\pm 0.005 0.158±0.0040.158\pm 0.004
Power 0.054±0.0030.054\pm 0.003 0.073±0.0010.073\pm 0.001 0.081±0.0040.081\pm 0.004 0.104±0.0030.104\pm 0.003 0.117±0.0030.117\pm 0.003 0.123±0.0050.123\pm 0.005 0.135±0.0040.135\pm 0.004
Brier 0.042±0.002\bm{0.042}\pm 0.002 0.056±0.003\bm{0.056}\pm 0.003 0.066±0.003\bm{0.066}\pm 0.003 0.081±0.002\bm{0.081}\pm 0.002 0.086±0.002\bm{0.086}\pm 0.002 0.094±0.002\bm{0.094}\pm 0.002 0.105±0.001\bm{0.105}\pm 0.001
Spherical 0.103±0.0070.103\pm 0.007 0.106±0.0060.106\pm 0.006 0.115±0.0040.115\pm 0.004 0.121±0.0050.121\pm 0.005 0.125±0.0060.125\pm 0.006 0.132±0.0030.132\pm 0.003 0.138±0.0110.138\pm 0.011
LogSumExp 0.231±0.0670.231\pm 0.067 0.198±0.0340.198\pm 0.034 0.184±0.0130.184\pm 0.013 0.179±0.0140.179\pm 0.014 0.184±0.0090.184\pm 0.009 0.192±0.010.192\pm 0.01 0.193±0.0030.193\pm 0.003
Quadratic 0.148±0.0330.148\pm 0.033 0.186±0.0280.186\pm 0.028 0.218±0.0110.218\pm 0.011 0.219±0.0180.219\pm 0.018 0.226±0.0180.226\pm 0.018 0.236±0.0230.236\pm 0.023 0.254±0.0140.254\pm 0.014
Table 2: Results for CIFAR-10 OOD detection, MNIST multi-target generation and multi-distribution off-policy policy evaluation error based on learned density ratios. \uparrow means higher is better and \downarrow means lower is better. Results of top 33 methods for each task are bold. Results are averaged across three random seeds.
Method CIFAR-10 OOD (\uparrow) MNIST Generation (\downarrow) Off-policy Evaluation (\downarrow)
Random Init 0.499±0.0170.499\pm 0.017 1.598±0.0631.598\pm 0.063 1377.68±379.761377.68\pm 379.76
Multi-LR 0.854±0.009\bm{0.854}\pm 0.009 0.156±0.014\bm{0.156}\pm 0.014 62.43±12.8762.43\pm 12.87
Multi-KLIEP 0.828±0.0050.828\pm 0.005 0.281±0.0500.281\pm 0.050 110.89±35.33110.89\pm 35.33
Multi-LSIF 0.801±0.0080.801\pm 0.008 0.274±0.0270.274\pm 0.027 71.09±1.1271.09\pm 1.12
Power (α=1.5\alpha=1.5) 0.816±0.0070.816\pm 0.007 0.224±0.0360.224\pm 0.036 53.43±20.73\bm{53.43}\pm 20.73
Brier 0.849±0.010\bm{0.849}\pm 0.010 0.107±0.022\bm{0.107}\pm 0.022 71.21±17.3471.21\pm 17.34
Spherical (α=1.8\alpha=1.8) 0.853±0.010\bm{0.853}\pm 0.010 0.145±0.041\bm{0.145}\pm 0.041 /
LogSumExp (α=5\alpha=5) 0.782±0.0120.782\pm 0.012 / 52.02±9.16\bm{52.02}\pm 9.16
Quadratic 0.804±0.0090.804\pm 0.009 / 55.10±11.92\bm{55.10}\pm 11.92

Multi-target MNIST Generation.  DRE can be used in the sampling-importance-resampling (SIR) paradigm (Liu & Chen, 1998; Doucet et al., 2000). Suppose we want to obtain samples from p1,,pk1p_{1},\ldots,p_{k-1} while we have a large set of samples from pkp_{k}. For each i[k1]i\in[k-1], we can use the density ratio function r^i\hat{r}_{i} in conjunction with SIR to approximately sample from the target distribution pip_{i} (Algorithm 1 in (Grover et al., 2019)). For this task, we evaluate if the SIR samples for target distribution pip_{i} contains the correct proportion of classes/properties (10 digit numbers in MNIST) and we use 1k1i=1k1j=110|hijh^ij|\frac{1}{k-1}\sum_{i=1}^{k-1}\sum_{j=1}^{10}|h_{ij}-\hat{h}_{ij}| as the evaluation metric, where hijh_{ij} and h^ij\hat{h}_{ij} denote the desired proportion and sampled proportion for property jj in each target-generation task ii.

Multi-distribution Off-policy Policy Evaluation.  Suppose we have kk different reinforcement learning policies pi(a|s)p_{i}(a|s), each inducing an occupancy measure (Syed et al., 2008) (i.e, state-action distribution) ρi(s,a)\rho_{i}(s,a). Density ratios allow us to conduct off-policy policy evaluation, which estimates the expected return (sum of reward) of target policies p1,,pk1p_{1},\ldots,p_{k-1} given trajectories sampled from a source policy pkp_{k}. In this case, we evaluate the following metric to assess the quality of the learned density ratios (τ={(st,at)}t=1T\tau=\{(s_{t},a_{t})\}_{t=1}^{T} denotes a sequence of state-action pairs):

1k1i=1k1|𝔼pk(τ)[t=1Tr^i(st,at)r(st,at)]𝔼pi(τ)[t=1Tr(st,at)]|\displaystyle\frac{1}{k-1}\sum_{i=1}^{k-1}\left|\mathbb{E}_{p_{k}(\tau)}\left[\sum_{t=1}^{T}\hat{r}_{i}(s_{t},a_{t})r(s_{t},a_{t})\right]-\mathbb{E}_{p_{i}(\tau)}\left[\sum_{t=1}^{T}r(s_{t},a_{t})\right]\right|

We summarized the results for CIFAR-10 OOD detection, multi-target MNIST generation and multi-distribution off-policy policy evaluation in Table 2 (omitted results indicates the corresponding method performs worse than listed methods by a large margin on the specific task). We can see that methods induced by proper scoring rules such as multi-class logistic regression, Brier score and pseudo-spherical score tend to have the best performance on the first two tasks. And surprisingly, methods induced by some simple multivariate convex functions such as the LogSumExp and the quadratic function show excellent performance on the third task. These results demonstrate the advantage of our framework in the sense that it offers us extreme flexibility for designing new objectives for multi-distribution DRE that are more suitable for various downstream applications.

7 Conclusion

In this paper, we focus on the generalized problem of efficiently estimating density ratios among multiple distributions. We propose a general framework based on expected Bregmand divergence minimization, where each strictly convex function induces a proper loss for multi-distribution DRE. Furthermore, we rederive the theoretical equivalence between the losses of class probability estimation and density ratio estimation, which justifies the use of any strictly proper scoring rules for multi-distribution DRE. Finally, we demonstrated the effectiveness of our framework on various downstream tasks.

References

  • Aharoni et al. (2019) Roee Aharoni, Melvin Johnson, and Orhan Firat. Massively multilingual neural machine translation. arXiv preprint arXiv:1903.00089, 2019.
  • Bickel et al. (2008) Steffen Bickel, Jasmina Bogojeska, Thomas Lengauer, and Tobias Scheffer. Multi-task learning for hiv therapy screening. In Proceedings of the 25th international conference on Machine learning, pp.  56–63, 2008.
  • Brier et al. (1950) Glenn W Brier et al. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3, 1950.
  • Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  • Cao et al. (2019) Jiezhang Cao, Langyuan Mo, Yifan Zhang, Kui Jia, Chunhua Shen, and Mingkui Tan. Multi-marginal wasserstein gan. Advances in Neural Information Processing Systems, 32:1776–1786, 2019.
  • Cappé et al. (2004) Olivier Cappé, Arnaud Guillin, Jean-Michel Marin, and Christian P Robert. Population monte carlo. Journal of Computational and Graphical Statistics, 13(4):907–929, 2004.
  • Choi et al. (2021) Kristy Choi, Madeline Liao, and Stefano Ermon. Featurized density ratio estimation. arXiv preprint arXiv:2107.02212, 2021.
  • DeGroot (1962) Morris H DeGroot. Uncertainty, information, and sequential experiments. The Annals of Mathematical Statistics, 33(2):404–419, 1962.
  • Dinh et al. (2013) Cuong V Dinh, Robert PW Duin, Ignacio Piqueras-Salazar, and Marco Loog. Fidos: A generalized fisher based feature extraction method for domain shift. Pattern Recognition, 46(9):2510–2518, 2013.
  • Dong et al. (2015) Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  1723–1732, 2015.
  • Doucet et al. (2000) Arnaud Doucet, Simon Godsill, and Christophe Andrieu. On sequential monte carlo sampling methods for bayesian filtering. Statistics and computing, 10(3):197–208, 2000.
  • Duchi et al. (2018) John Duchi, Khashayar Khosravi, and Feng Ruan. Multiclass classification, information, divergence and surrogate risk. The Annals of Statistics, 46(6B):3246–3275, 2018.
  • Elvira et al. (2015) Víctor Elvira, Luca Martino, David Luengo, and Mónica F Bugallo. Efficient multiple importance sampling estimators. IEEE Signal Processing Letters, 22(10):1757–1761, 2015.
  • Elvira et al. (2019) Víctor Elvira, Luca Martino, David Luengo, and Mónica F Bugallo. Generalized multiple importance sampling. Statistical Science, 34(1):129–155, 2019.
  • Franklin (2005) James Franklin. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2):83–85, 2005.
  • Fujisawa & Eguchi (2008) Hironori Fujisawa and Shinto Eguchi. Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis, 99(9):2053–2081, 2008.
  • Garcia-Garcia & Williamson (2012) Dario Garcia-Garcia and Robert C Williamson. Divergences and risks for multiclass experiments. In Conference on Learning Theory, pp.  28–1. JMLR Workshop and Conference Proceedings, 2012.
  • Gneiting & Raftery (2007) Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378, 2007.
  • Good (1952) IJ Good. Rational decisions. Journal of the Royal Statistical Society, pp.  107–114, 1952.
  • Good (1971) IJ Good. Comment on “measuring information and uncertainty” by robert j. buehler. Foundations of Statistical Inference, pp.  337–339, 1971.
  • Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
  • Grover et al. (2019) Aditya Grover, Jiaming Song, Alekh Agarwal, Kenneth Tran, Ashish Kapoor, Eric Horvitz, and Stefano Ermon. Bias correction of learned generative models using likelihood-free importance weighting. arXiv preprint arXiv:1906.09531, 2019.
  • Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
  • Hido et al. (2008) Shohei Hido, Yuta Tsuboi, Hisashi Kashima, Masashi Sugiyama, and Takafumi Kanamori. Inlier-based outlier detection via direct density ratio estimation. In 2008 Eighth IEEE international conference on data mining, pp.  223–232. IEEE, 2008.
  • Hido et al. (2011) Shohei Hido, Yuta Tsuboi, Hisashi Kashima, Masashi Sugiyama, and Takafumi Kanamori. Statistical outlier detection using direct density ratio estimation. Knowledge and information systems, 26(2):309–336, 2011.
  • Hjelm et al. (2018) R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
  • Huang et al. (2006) Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Schölkopf, and Alex Smola. Correcting sample selection bias by unlabeled data. Advances in neural information processing systems, 19:601–608, 2006.
  • Kallus et al. (2021) Nathan Kallus, Yuta Saito, and Masatoshi Uehara. Optimal off-policy evaluation from multiple logging policies. In International Conference on Machine Learning, pp. 5247–5256. PMLR, 2021.
  • Kanamori et al. (2009) Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. The Journal of Machine Learning Research, 10:1391–1445, 2009.
  • Kanamori et al. (2012) Takafumi Kanamori, Taiji Suzuki, and Masashi Sugiyama. Statistical analysis of kernel-based least-squares density-ratio estimation. Machine Learning, 86(3):335–367, 2012.
  • Kato & Teshima (2021) Masahiro Kato and Takeshi Teshima. Non-negative bregman divergence minimization for deep direct density ratio estimation. In International Conference on Machine Learning, pp. 5320–5333. PMLR, 2021.
  • Kpotufe (2017) Samory Kpotufe. Lipschitz density-ratios, structured data, and data-driven tuning. In Artificial Intelligence and Statistics, pp.  1320–1328. PMLR, 2017.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Liu & Chen (1998) Jun S Liu and Rong Chen. Sequential monte carlo methods for dynamic systems. Journal of the American statistical association, 93(443):1032–1044, 1998.
  • Liu et al. (2017) Song Liu, Akiko Takeda, Taiji Suzuki, and Kenji Fukumizu. Trimmed density ratio estimation. arXiv preprint arXiv:1703.03216, 2017.
  • Menon & Ong (2016) Aditya Menon and Cheng Soon Ong. Linking losses for density ratio and class-probability estimation. In International Conference on Machine Learning, pp. 304–313. PMLR, 2016.
  • Nguyen et al. (2007) XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In NIPS, pp.  1089–1096, 2007.
  • Nock et al. (2016) Richard Nock, Aditya Menon, and Cheng Soon Ong. A scaled bregman theorem with applications. Advances in Neural Information Processing Systems, 29:19–27, 2016.
  • Nowozin et al. (2016) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp.  271–279, 2016.
  • Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • Owen & Zhou (2000) Art Owen and Yi Zhou. Safe and effective importance sampling. Journal of the American Statistical Association, 95(449):135–143, 2000.
  • Que & Belkin (2013) Qichao Que and Mikhail Belkin. Inverse density as an inverse problem: The fredholm equation approach. arXiv preprint arXiv:1304.5575, 2013.
  • Rhodes et al. (2020) Benjamin Rhodes, Kai Xu, and Michael U Gutmann. Telescoping density-ratio estimation. arXiv preprint arXiv:2006.12204, 2020.
  • Sibson (1969) Robin Sibson. Information radius. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 14(2):149–160, 1969.
  • Smola et al. (2009) Alex Smola, Le Song, and Choon Hui Teo. Relative novelty detection. In Artificial Intelligence and Statistics, pp.  536–543. PMLR, 2009.
  • Sugiyama et al. (2007) Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul Von Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In NIPS, volume 7, pp.  1433–1440. Citeseer, 2007.
  • Sugiyama et al. (2008) Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699–746, 2008.
  • Sugiyama et al. (2011) Masashi Sugiyama, Taiji Suzuki, Yuta Itoh, Takafumi Kanamori, and Manabu Kimura. Least-squares two-sample test. Neural networks, 24(7):735–751, 2011.
  • Sugiyama et al. (2012) Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density-ratio matching under the bregman divergence: a unified framework of density-ratio estimation. Annals of the Institute of Statistical Mathematics, 64(5):1009–1044, 2012.
  • Syed et al. (2008) Umar Syed, Michael Bowling, and Robert E Schapire. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp.  1032–1039, 2008.
  • Uehara et al. (2016) Masatoshi Uehara, Issei Sato, Masahiro Suzuki, Kotaro Nakayama, and Yutaka Matsuo. Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920, 2016.
  • Vernet et al. (2011) Elodie Vernet, Robert Williamson, Mark Reid, et al. Composite multiclass losses. 2011.

Appendix A Proofs

A.1 Proofs for Section 2

See 1

Proof.

For completeness, we provide the proof here. First, we can check that L¯(P):Δk\underline{L}(P):\Delta_{k}\to{\mathbb{R}} is a concave function. Define 𝐋(P)\mathbf{L}(P) to be the vector ((1,P),,(k,P))\left(\ell(1,P),\ldots,\ell(k,P)\right). Then the entropy function can be represented as L¯(P)=L(P,P)=𝔼yP[(y,P)]=P𝐋(P)\underline{L}(P)=L(P,P)=\mathbb{E}_{y\sim P}[\ell(y,P)]=P^{\top}\mathbf{L}(P) and similarly L(P,Q)=P𝐋(Q)L(P,Q)=P^{\top}\mathbf{L}(Q). For λ[0,1]\lambda\in[0,1] and P,QΔkP,Q\in\Delta_{k}, we have:

L¯(λP+(1λ)Q)\displaystyle\underline{L}(\lambda P+(1-\lambda)Q) =(λP+(1λ)Q)𝐋(λP+(1λ)Q)\displaystyle=(\lambda P+(1-\lambda)Q)^{\top}\mathbf{L}(\lambda P+(1-\lambda)Q)
=λP𝐋(λP+(1λ)Q)+(1λ)Q𝐋(λP+(1λ)Q)\displaystyle=\lambda P^{\top}\mathbf{L}(\lambda P+(1-\lambda)Q)+(1-\lambda)Q^{\top}\mathbf{L}(\lambda P+(1-\lambda)Q)
λP𝐋(P)+(1λ)Q𝐋(Q)=λL¯(P)+(1λ)L¯(Q)\displaystyle\geq\lambda P^{\top}\mathbf{L}(P)+(1-\lambda)Q^{\top}\mathbf{L}(Q)=\lambda\underline{L}(P)+(1-\lambda)\underline{L}(Q)

where the inequality is because \ell is proper. Thus L¯\underline{L} is a concave function. Next we show that the excess risk is a Bregman divergence with convex function L¯-\underline{L}. First, observe that

L(P,Q)=P𝐋(Q)=Q𝐋(Q)+(PQ)𝐋(Q)\displaystyle L(P,Q)=P^{\top}{\mathbf{L}}(Q)=Q^{\top}{\mathbf{L}}(Q)+(P-Q)^{\top}{\mathbf{L}}(Q)

Because \ell is proper, we have:

0L(P,Q)L(P,P)\displaystyle 0\leq L(P,Q)-L(P,P) =Q𝐋(Q)+(PQ)𝐋(Q)P𝐋(P)\displaystyle=Q^{\top}{\mathbf{L}}(Q)+(P-Q)^{\top}{\mathbf{L}}(Q)-P^{\top}{\mathbf{L}}(P)
=L¯(P)(L¯(Q))(PQ)(𝐋(Q))\displaystyle=-\underline{L}(P)-(-\underline{L}(Q))-(P-Q)^{\top}(-{\mathbf{L}}(Q))

Rearrange the term we get L¯(P)(L¯(Q))+(𝐋(Q))(PQ)-\underline{L}(P)\geq(-\underline{L}(Q))+(-{\mathbf{L}}(Q))^{\top}(P-Q) and therefore 𝐋(Q)-{\mathbf{L}}(Q) is a subderivative of L¯-\underline{L}. When L¯-\underline{L} is differentiable, its subdifferential contains exactly one subderivative and 𝐋(Q)=(L¯(Q))-{\mathbf{L}}(Q)=\nabla(-\underline{L}(Q)). Therefore, we have reg(P,Q)=L(P,Q)L(P,P)=f(P)f(Q)f(Q),PQ=𝐁f(P,Q)\mathrm{reg}(P,Q)=L(P,Q)-L(P,P)=f(P)-f(Q)-\langle\nabla f(Q),P-Q\rangle=\mathbf{B}_{f}(P,Q) with f=L¯f=-\underline{L}. ∎

A.2 Proofs for Section 3.2

See 1

Proof of Proposition 1.

We first recall that the optimization problem for multi-distribution DRE is of the form

min𝒓^:𝒳+k1𝔼pk(x)[f(𝒓^(x)),𝒓^(x)f(𝒓^(x))]i[k1]𝔼pi(x)[if(𝒓^(x))]\displaystyle\min_{\hat{{\bm{r}}}:{\mathcal{X}}\to{\mathbb{R}}^{k-1}_{+}}\mathbb{E}_{p_{k}(x)}\left[\langle\nabla f(\hat{{\bm{r}}}(x)),\hat{{\bm{r}}}(x)\rangle-f(\hat{{\bm{r}}}(x))\right]-\sum_{i\in[k-1]}\mathbb{E}_{p_{i}(x)}[\partial_{i}f(\hat{{\bm{r}}}(x))] (24)

and one can use the Fenchel-dual convex conjugate to represent the ff-divergence as

𝐃f(P1,,Pk1||Pk)=min𝒔:𝒳k1[i[k1]𝔼pi(x)[𝒔(x)]i+𝔼pk(x)f(𝒔(x))]\mathbf{D}_{f}(P_{1},\cdots,P_{k-1}||P_{k})=-\min_{{\bm{s}}:{\mathcal{X}}\rightarrow\mathbb{R}^{k-1}}\left[-\sum_{i\in[k-1]}\mathbb{E}_{p_{i}(x)}[{\bm{s}}(x)]_{i}+\mathbb{E}_{p_{k}(x)}f^{*}({\bm{s}}(x))\right] (25)

By first-order optimality condition of convex functions, for any x𝒳x\in{\mathcal{X}} the optimal solution 𝒔¯(x)\overline{{\bm{s}}}(x) for Eq. (25) satisfies that

i[k1],x𝒳,pi(x)if(𝒔¯(x))pk(x)=0pi(x)pk(x)=if(𝒔¯(x))\forall i\in[k-1],x\in{\mathcal{X}},~{}~{}p_{i}(x)-\partial_{i}f^{*}(\overline{{\bm{s}}}(x))p_{k}(x)=0\Longleftrightarrow\frac{p_{i}(x)}{p_{k}(x)}=\partial_{i}f^{*}(\overline{{\bm{s}}}(x))

Therefore 𝒓¯(x)=f(𝒔¯(x))\overline{{\bm{r}}}(x)=\nabla f^{*}(\overline{{\bm{s}}}(x)) recovers the true density ratios.

Now we show that under change of variable 𝒔(x)=f(𝒓^(x)){\bm{s}}(x)=\nabla f(\hat{{\bm{r}}}(x)), one can write the problem in Eq. (25) equivalently as the one in Eq. (24). First due to the property of the convex conjugate function (f=ff^{**}=f), we have:

f(𝒔(x))=min𝒉(x)k1𝒔(x),𝒉(x)f(𝒉(x))f^{*}({\bm{s}}(x))=\min_{{\bm{h}}(x)\in{\mathbb{R}}^{k-1}}\langle{\bm{s}}(x),{\bm{h}}(x)\rangle-f({\bm{h}}(x))

Substituting 𝒔(x){\bm{s}}(x) with f(𝒓^(x))\nabla f(\hat{{\bm{r}}}(x)), we have:

f(𝒔(x))=min𝒉(x)k1f(𝒓^(x)),𝒉(x)f(𝒉(x))f^{*}({\bm{s}}(x))=\min_{{\bm{h}}(x)\in{\mathbb{R}}^{k-1}}\langle\nabla f(\hat{{\bm{r}}}(x)),{\bm{h}}(x)\rangle-f({\bm{h}}(x)) (26)

Taking derivative w.r.t. 𝒉{\bm{h}} and due to the strict convexity of ff (f(𝒂)=f(𝒃)𝒂=𝒃\nabla f({\bm{a}})=\nabla f({\bm{b}})\Leftrightarrow{\bm{a}}={\bm{b}}), we know that the minimum of Eq. (26) achieves at 𝒉¯(x)=𝒓^(x)\overline{{\bm{h}}}(x)=\hat{{\bm{r}}}(x). Thus we have:

f(𝒔(x))=f(𝒓^(x)),𝒓^(x)f(𝒓^(x))f^{*}({\bm{s}}(x))=\langle\nabla f(\hat{{\bm{r}}}(x)),\hat{{\bm{r}}}(x)\rangle-f(\hat{{\bm{r}}}(x)) (27)

Plugging Eq. (27) and 𝒔(x)=f(𝒓^(x)){\bm{s}}(x)=\nabla f(\hat{{\bm{r}}}(x)) back to the optimization problem in Eq. (25), we can get the following equivalent problem by flipping a sign of the objective function without changing the optimal solution:

min𝒓^:𝒳k1𝔼pk(x)[f(𝒓^(x)),𝒓^(x)f(𝒓^(x))]i[k1]𝔼pi(x)if(𝒓^(x)),\min_{\hat{{\bm{r}}}:{\mathcal{X}}\to{\mathbb{R}}^{k-1}}\mathbb{E}_{p_{k}(x)}\left[\langle\nabla f(\hat{{\bm{r}}}(x)),\hat{{\bm{r}}}(x)\rangle-f(\hat{{\bm{r}}}(x))\right]-\sum_{i\in[k-1]}\mathbb{E}_{p_{i}(x)}\partial_{i}f(\hat{{\bm{r}}}(x)),

which is the same as the one in  (24). ∎

A.3 Proofs for Section 4

See 1

Proof of Lemma 1.

For simplicity of notations we let uk=vk=1u_{k}=v_{k}=1 for arbitrary u,vk1u,v\in\mathbb{R}^{k-1}. We first prove the convexity of ff^{\circledast} by definition. Given any two points u,vk1u,v\in\mathbb{R}^{k-1} and θ[0,1]\theta\in[0,1], one has

f(θ𝒖+(1θ)𝒗)\displaystyle f^{\circledast}(\theta{\bm{u}}+(1-\theta){\bm{v}})
=\displaystyle= (i[k](θui+(1θ)vi))f(1i[k](θui+(1θ)vi)(θ𝒖+(1θ)𝒗))\displaystyle\left(\sum_{i\in[k]}\left(\theta u_{i}+(1-\theta)v_{i}\right)\right)\cdot f\left(\frac{1}{\sum_{i\in[k]}(\theta u_{i}+(1-\theta)v_{i})}\cdot(\theta{\bm{u}}+(1-\theta){\bm{v}})\right)
=\displaystyle= (θi[k]ui+(1θ)i[k]vi)f(1θi[k]ui+(1θ)i[k]vi(θ𝒖+(1θ)𝒗))\displaystyle\left(\theta\sum_{i\in[k]}u_{i}+(1-\theta)\sum_{i\in[k]}v_{i}\right)\cdot f\left(\frac{1}{\theta\sum_{i\in[k]}u_{i}+(1-\theta)\sum_{i\in[k]}v_{i}}\cdot(\theta{\bm{u}}+(1-\theta){\bm{v}})\right)
()\displaystyle\stackrel{{\scriptstyle(\star)}}{{\leq}} θ(i[k]ui)f(1i[k]ui𝒖)+(1θ)(i[k]vi)f(1i[k]vi𝒗)\displaystyle\theta\left(\sum_{i\in[k]}u_{i}\right)f\left(\frac{1}{\sum_{i\in[k]}u_{i}}{\bm{u}}\right)+(1-\theta)\left(\sum_{i\in[k]}v_{i}\right)f\left(\frac{1}{\sum_{i\in[k]}v_{i}}{\bm{v}}\right)
=\displaystyle= θf(𝒖)+(1θ)f(𝒗).\displaystyle\theta f^{\circledast}({\bm{u}})+(1-\theta)f^{\circledast}({\bm{v}}).

Here for inequality ()(\star) we use the fact that for any convex function g:ng:\mathbb{R}^{n}\rightarrow\mathbb{R}, the perspective function h(t,x):×nh(t,x):\mathbb{R}\times\mathbb{R}^{n}\rightarrow\mathbb{R} defined as h(t,x):=tg(x/t)h(t,x)\mathrel{\mathop{:}}=tg(x/t) is a function jointly convex in (t,x)(t,x).

Now to see the identity holds, note we can write

LHS=\displaystyle\mathrm{LHS}= f(1i[k]ui𝒖)f(1i[k]vi𝒗)\displaystyle f\left(\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{u}}\right)-f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right)
f(1i[k]vi𝒗),1i[k]ui𝒖1i[k]vi𝒗\displaystyle-\left\langle\nabla f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right),\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{u}}-\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right\rangle

and that

RHS=\displaystyle\mathrm{RHS}= f(1i[k]ui𝒖)i[k]vii[k]uif(1i[k]vi𝒗)1i[k]uif(𝒗),𝒖𝒗\displaystyle f\left(\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{u}}\right)-\frac{\sum_{i\in[k]}v_{i}}{\sum_{i\in[k]}u_{i}}\cdot f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right)-\frac{1}{\sum_{i\in[k]}u_{i}}\left\langle\nabla f^{\circledast}\left({\bm{v}}\right),{\bm{u}}-{\bm{v}}\right\rangle
=(i)\displaystyle\stackrel{{\scriptstyle(i)}}{{=}} f(1i[k]ui𝒖)i[k]vii[k]uif(1i[k]vi𝒗)\displaystyle f\left(\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{u}}\right)-\frac{\sum_{i\in[k]}v_{i}}{\sum_{i\in[k]}u_{i}}\cdot f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right)
1i[k]uif(1i[k]vi𝒗)𝟏+(𝐈1i[k]vi𝟏𝒗)f(1i[k]vi𝒗),𝒖𝒗\displaystyle-\frac{1}{\sum_{i\in[k]}u_{i}}\left\langle f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right)\bm{1}+\left(\mathbf{I}-\frac{1}{\sum_{i\in[k]}v_{i}}\bm{1}{\bm{v}}^{\top}\right)\nabla f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right),{\bm{u}}-{\bm{v}}\right\rangle
=(ii)\displaystyle\stackrel{{\scriptstyle(ii)}}{{=}} f(1i[k]ui𝒖)[i[k]vii[k]vi+i[k](uivi)i[k]ui]f(1i[k]vi𝒗)\displaystyle f\left(\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{u}}\right)-\left[\frac{\sum_{i\in[k]}v_{i}}{\sum_{i\in[k]}v_{i}}+\frac{\sum_{i\in[k]}(u_{i}-v_{i})}{\sum_{i\in[k]}u_{i}}\right]\cdot f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right)
+1i[k]uif(1i[k]vi𝒗),(𝐈1i[k]vi𝒗𝟏)(𝒖𝒗)\displaystyle+\frac{1}{\sum_{i\in[k]}u_{i}}\left\langle\nabla f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right),\left(\mathbf{I}-\frac{1}{\sum_{i\in[k]}v_{i}}{\bm{v}}\bm{1}^{\top}\right)({\bm{u}}-{\bm{v}})\right\rangle
=\displaystyle= f(1i[k]ui𝒖)f(1i[k]vi𝒗)\displaystyle f\left(\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{u}}\right)-f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right)
+f(1i[k]vi𝒗),1i[k]ui𝒖1i[k]ui𝒗i[k](uivi)(i[k]ui)(i[k]vi)𝒗\displaystyle+\left\langle\nabla f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right),\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{u}}-\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{v}}-\frac{\sum_{i\in[k]}(u_{i}-v_{i})}{\left(\sum_{i\in[k]}u_{i}\right)\left(\sum_{i\in[k]}v_{i}\right)}\cdot{\bm{v}}\right\rangle
=\displaystyle= f(1i[k]ui𝒖)f(1i[k]vi𝒗)\displaystyle f\left(\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{u}}\right)-f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right)
+f(1i[k]vi𝒗),1i[k]ui𝒖1i[k]vi𝒗,\displaystyle+\left\langle\nabla f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right),\frac{1}{\sum_{i\in[k]}u_{i}}\cdot{\bm{u}}-\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right\rangle,

where we use (i)(i) the gradient formula that f(𝒗)=(𝐈1i[k]vi𝟏𝒗)f(1i[k]vi𝒗)\nabla f^{\circledast}({\bm{v}})=\left(\mathbf{I}-\frac{1}{\sum_{i\in[k]}v_{i}}\bm{1}{\bm{v}}^{\top}\right)\nabla f\left(\frac{1}{\sum_{i\in[k]}v_{i}}\cdot{\bm{v}}\right) by definition of ff^{\circledast}, and (ii)(ii) rearranging terms and that 𝑨𝒗,𝒖=𝒖,𝑨𝒗\langle{\bm{A}}{\bm{v}},{\bm{u}}\rangle=\langle{\bm{u}},{\bm{A}}^{\top}{\bm{v}}\rangle.

Thus, we have shown that LHS=RHS\mathrm{LHS}=\mathrm{RHS} and concludes the proof. ∎

See 2

Proof of Proposition 2.

Given any x𝒳x\in{\mathcal{X}}, the equality follows by applying Lemma 1 with 𝒖=1πk𝝅[1:k1]𝒓(x){\bm{u}}=\frac{1}{\pi_{k}}{\bm{\pi}}_{[1:k-1]}\circ{\bm{r}}(x) and 𝒗=1πk𝝅[1:k1]𝒓^(x){\bm{v}}=\frac{1}{\pi_{k}}{\bm{\pi}}_{[1:k-1]}\circ\hat{{\bm{r}}}(x). To see why this is true, note that we have by definition of ηi(x)\eta_{i}(x) and η^i(x)\hat{\eta}_{i}(x) that (here \circ implies element-wise multiplication)

𝜼(x)=𝝅𝒓(x)πk+j[k1]πjrj(x)=11+i[k1]ui𝒖,and similarly𝜼^(𝒙)=11+i[k1]vi𝒗.{\bm{\eta}}(x)=\frac{{\bm{\pi}}\circ{\bm{r}}(x)}{\pi_{k}+\sum_{j\in[k-1]}\pi_{j}r_{j}(x)}=\frac{1}{1+\sum_{i\in[k-1]}u_{i}}\cdot{\bm{u}},~{}\text{and similarly}~{}\hat{{\bm{\eta}}}({\bm{x}})=\frac{1}{1+\sum_{i\in[k-1]}v_{i}}\cdot{\bm{v}}.

Consequently applying Lemma 1 implies that

𝐁f(𝜼(x),𝜼^(x))=11+i[k1]ui𝐁f(𝒖,𝒗)\mathbf{B}_{f}({\bm{\eta}}(x),\hat{{\bm{\eta}}}(x))=\frac{1}{1+\sum_{i\in[k-1]}u_{i}}\mathbf{B}_{f^{\circledast}}({\bm{u}},{\bm{v}}) (28)

Note that given any convex function ff^{\circledast}, we consider its composition with linear map function as

fπ(𝒓)=f(1πk𝝅[1:k1]𝒓)=f(𝒖).f^{\circledast}_{\pi}({\bm{r}})=f^{\circledast}\left(\frac{1}{\pi_{k}}{\bm{\pi}}_{[1:k-1]}\circ{\bm{r}}\right)=f^{\circledast}({\bm{u}}).

We note that linear composition preserves convexity and Bregman divergence equality, i.e. we have

𝐁f(𝒖,𝒗)=f(𝒖)f(𝒗)f(𝒗),𝒖𝒗\displaystyle\mathbf{B}_{f^{\circledast}}({\bm{u}},{\bm{v}})=f^{\circledast}({\bm{u}})-f^{\circledast}({\bm{v}})-\langle\nabla f^{\circledast}({\bm{v}}),{\bm{u}}-{\bm{v}}\rangle (29)
=\displaystyle= f(1πk𝝅1:k1𝒓)f(1πk𝝅1:k1𝒓^)f(1πk𝝅1:k1𝒓^),1πk𝝅1:k1(𝒓𝒓^)\displaystyle f^{\circledast}\left(\frac{1}{\pi_{k}}{\bm{\pi}}_{1:k-1}\circ{\bm{r}}\right)-f^{\circledast}\left(\frac{1}{\pi_{k}}{\bm{\pi}}_{1:k-1}\circ\hat{{\bm{r}}}\right)-\left\langle\nabla f^{\circledast}\left(\frac{1}{\pi_{k}}{\bm{\pi}}_{1:k-1}\circ\hat{{\bm{r}}}\right),\frac{1}{\pi_{k}}{\bm{\pi}}_{1:k-1}\circ({\bm{r}}-\hat{{\bm{r}}})\right\rangle
=()\displaystyle\stackrel{{\scriptstyle(\star)}}{{=}} fπ(𝒓)fπ(𝒓^)fπ(𝒓),𝒓𝒓^=𝐁fπ(𝒓,𝒓^),\displaystyle f^{\circledast}_{\pi}({\bm{r}})-f^{\circledast}_{\pi}(\hat{{\bm{r}}})-\langle\nabla f^{\circledast}_{\pi}({\bm{r}}),{\bm{r}}-\hat{{\bm{r}}}\rangle=\mathbf{B}_{f^{\circledast}_{\pi}}({\bm{r}},\hat{{\bm{r}}}),

where for equality ()(\star) we use chain rule for taking derivatives of the linear composite mapping. Combining Equations (28) and (29) and replacing 𝒖=1πk𝝅[1:k1]𝒓{\bm{u}}=\frac{1}{\pi_{k}}{\bm{\pi}}_{[1:k-1]}\circ{\bm{r}} gives the desired result. ∎

See 2

Proof of Theorem 2.

Given the multi-class classification regret under some proper loss \ell in Eq. (4) and Proposition 2 we have:

reg(𝜼^;M,𝜼,)\displaystyle\mathrm{reg}(\hat{{\bm{\eta}}};M,{\bm{\eta}},\ell) :=CPE(𝜼^;D)CPE(𝜼;D)=𝔼M(x)[𝐁f(𝜼(x),𝜼^(x))]\displaystyle\mathrel{\mathop{:}}={\mathcal{L}}_{\text{CPE}}(\hat{{\bm{\eta}}};D)-{\mathcal{L}}_{\text{CPE}}({\bm{\eta}};D)=\mathbb{E}_{M(x)}[\mathbf{B}_{f}({\bm{\eta}}(x),\hat{{\bm{\eta}}}(x))]
=(i)i[k]πi𝔼pi(x)𝐁f(𝜼(x),𝜼^(x))=𝔼pk(x)(i[k]πipi(x)pk(x)𝐁f(𝜼(x),𝜼^(x)))\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\sum_{i\in[k]}\pi_{i}\mathbb{E}_{p_{i}(x)}\mathbf{B}_{f}({\bm{\eta}}(x),\hat{{\bm{\eta}}}(x))=\mathbb{E}_{p_{k}(x)}\left(\sum_{i\in[k]}\pi_{i}\frac{p_{i}(x)}{p_{k}(x)}\mathbf{B}_{f}({\bm{\eta}}(x),\hat{{\bm{\eta}}}(x))\right)
=(ii)𝔼pk(x)((πk+i[k1]πiri(x))𝐁f(𝜼(x),𝜼^(x)))\displaystyle\stackrel{{\scriptstyle(ii)}}{{=}}\mathbb{E}_{p_{k}(x)}\left(\left(\pi_{k}+\sum_{i\in[k-1]}\pi_{i}r_{i}(x)\right)\cdot\mathbf{B}_{f}({\bm{\eta}}(x),\hat{{\bm{\eta}}}(x))\right)
=(iii)πk𝔼pk(x)𝐁fπ(𝒓(x),𝒓^(x)),\displaystyle\stackrel{{\scriptstyle(iii)}}{{=}}\pi_{k}\mathbb{E}_{p_{k}(x)}\mathbf{B}_{f^{\circledast}_{\pi}}({\bm{r}}(x),\hat{{\bm{r}}}(x)),

where we use (i)(i) the definition of marginal distribution M(x)=i[k]πipi(x)M(x)=\sum_{i\in[k]}\pi_{i}p_{i}(x), (ii)(ii) the definition of density ratio that ri(x)=pi(x)/pk(x)r_{i}(x)=p_{i}(x)/p_{k}(x) x𝒳,i[k]\forall x\in{\mathcal{X}},i\in[k], and (iii)(iii) Proposition 2 with the consistent definitions of fπf^{\circledast}_{\pi} and 𝒓{\bm{r}}, 𝒓^\hat{{\bm{r}}} as stated in the theorem. ∎

A.3.1 Information Measure in Multi-class Experiments

In this section, we show that multi-distribution density ratio estimation can be viewed as estimating the statistical information measure (DeGroot, 1962) in multi-class experiments, under appropriate choices for the convex function ff.

We first introduce the following definitions in multi-class experiments. For 𝒑Δk{\bm{p}}\in\Delta_{k}, any proper loss function :[k]×Δk\ell:[k]\times\Delta^{k}\to\mathbb{R} induces a generalized entropy:

H(𝒑):=inf𝒒Δki[k]pi(i,𝒒),H_{\ell}({\bm{p}})\mathrel{\mathop{:}}=\inf_{{\bm{q}}\in\Delta^{k}}\sum_{i\in[k]}p_{i}\ell(i,{\bm{q}}),

which measures the uncertainty of the task. Given a multi-class experiment D=(𝝅,P1,,Pk)=(M,𝜼)D=({\bm{\pi}},P_{1},\ldots,P_{k})=(M,{\bm{\eta}}) and the generalized entropy H:ΔkH_{\ell}:\Delta^{k}\to\mathbb{R} (which is closed concave), the information measure in a multi-class experiment (DeGroot, 1962; Duchi et al., 2018) is defined as the gap between the prior and posterior generalized entropy:

H(D)=H(𝝅)𝔼M(x)[H(𝜼(x))].\mathcal{I}_{H_{\ell}}(D)=H_{\ell}({\bm{\pi}})-\mathbb{E}_{M(x)}[H_{\ell}({\bm{\eta}}(x))].

We next introduce the following connections between multi-distribution ff-divergence, generalized entropy and information measure in multi-class experiments. Specifically, Duchi et al. (2018) proved an equivalence between the gap of prior and posterior Bayes risks and the multi-distribution ff-divergence induced by a convex function ff depending on \ell and the prior 𝝅{\bm{\pi}}, demonstrating the utility of multi-distribution ff-divergence for experimental design of multi-class classification.

Theorem 3 ((Duchi et al., 2018)).

Given a proper loss \ell, its associated generalized entropy HH_{\ell} and a multi-class distribution D=(𝛑,P1,,Pk)=(M,𝛈)D=({\bm{\pi}},P_{1},\ldots,P_{k})=(M,{\bm{\eta}}), we can define a closed convex function f,𝛑:+k1{±}f_{\ell,{\bm{\pi}}}:{\mathbb{R}}^{k-1}_{+}\to{\mathbb{R}}\cup\{\pm\infty\} as

f,𝝅(𝒕):=sup𝝂Δk(H(𝝅)i[k1]πi(i,𝝂)tiπk(k,𝝂))f_{\ell,{\bm{\pi}}}({\bm{t}})\mathrel{\mathop{:}}=\sup_{\bm{\nu}\in\Delta^{k}}\left(H_{\ell}({\bm{\pi}})-\sum_{i\in[k-1]}\pi_{i}\ell(i,\bm{\nu})t_{i}-\pi_{k}\ell(k,\bm{\nu})\right) (30)

We can then express the information measure of multi-class experiments as the multi-distribution ff-divergence induced by Eq. (30):

H(D)\displaystyle\mathcal{I}_{H_{\ell}}(D) =H(𝝅)𝔼M(x)[H(𝜼(x))]=inf𝝂Δki[k]πi(i,𝝂)inf𝜼^(𝜼^;D)\displaystyle=H({\bm{\pi}})-\mathbb{E}_{M(x)}[H_{\ell}({\bm{\eta}}(x))]=\inf_{\bm{\nu}\in\Delta^{k}}\sum_{i\in[k]}\pi_{i}\ell(i,\bm{\nu})-\inf_{\hat{{\bm{\eta}}}}{\mathcal{L}}(\hat{{\bm{\eta}}};D)
=𝐃f,𝝅(P1,,Pk1Pk).\displaystyle=\mathbf{D}_{f_{\ell,{\bm{\pi}}}}(P_{1},\ldots,P_{k-1}\|P_{k}).

Given Theorem 3 and Proposition 1, we know that multi-distribution density ratio estimation by minimizing expected Bregman divergence (Eq. (14)), induced by the convex function f,𝝅f_{\ell,{\bm{\pi}}} defined in Eq. (30), corresponds to estimating the statistical information measure in multi-class classification experiments.

A.4 Examples of Multi-distribution DRE

A.4.1 Multi-class Logistic Regression

From Section 2.3, we know that there is a one-to-one correspondence between a class probability estimator and a density ratio estimator through the link and the inverse link function: 𝒓^=Ψdr𝜼^\hat{{\bm{r}}}=\Psi_{\mathrm{dr}}\circ\hat{{\bm{\eta}}} and 𝜼^=Ψdr1𝒓^\hat{{\bm{\eta}}}=\Psi_{\mathrm{dr}}^{-1}\circ\hat{{\bm{r}}}. When the class prior distribution 𝝅{\bm{\pi}} is uniform, we have:

r^i(x)=η^i(x)η^k(x)andη^i(x)=r^i(x)j=1kr^j(x),for alli[k],x𝒳.\hat{r}_{i}(x)=\frac{\hat{\eta}_{i}(x)}{\hat{\eta}_{k}(x)}~{}~{}\text{and}~{}~{}\hat{\eta}_{i}(x)=\frac{\hat{r}_{i}(x)}{\sum_{j=1}^{k}\hat{r}_{j}(x)},~{}~{}\text{for all}~{}i\in[k],x\in{\mathcal{X}}. (31)

To recover the loss of multi-class logistic regression used in (Bickel et al., 2008), we choose the following convex function (where we use r^k=1\hat{r}_{k}=1):

f(r^1,,r^k1)=\displaystyle f(\hat{r}_{1},\ldots,\hat{r}_{k-1})= 1ki=1kr^ilog(r^ij=1kr^j)\displaystyle\frac{1}{k}\sum_{i=1}^{k}\hat{r}_{i}\log\left(\frac{\hat{r}_{i}}{\sum_{j=1}^{k}\hat{r}_{j}}\right) (32)
if(r^1,,r^k1)=\displaystyle\partial_{i}f(\hat{r}_{1},\ldots,\hat{r}_{k-1})= 1klog(r^ij=1kr^j)fori[k1]\displaystyle\frac{1}{k}\log\left(\frac{\hat{r}_{i}}{\sum_{j=1}^{k}\hat{r}_{j}}\right)\quad\text{for}~{}i\in[k-1] (33)

Thus the loss in Eq. (14) reduces to:

1k𝔼pk(x)[i=1k1r^i(x)logr^i(x)j=1kr^j(x)i=1k1r^i(x)logr^i(x)+(i=1kr^i(x))log(i=1kr^i(x))]\displaystyle\frac{1}{k}\mathbb{E}_{p_{k}(x)}\left[\sum_{i=1}^{k-1}\hat{r}_{i}(x)\log\frac{\hat{r}_{i}(x)}{\sum_{j=1}^{k}\hat{r}_{j}(x)}-\sum_{i=1}^{k-1}\hat{r}_{i}(x)\log\hat{r}_{i}(x)+\left(\sum_{i=1}^{k}\hat{r}_{i}(x)\right)\log\left(\sum_{i=1}^{k}\hat{r}_{i}(x)\right)\right]-
1ki=1k1𝔼pi(x)[log(r^i(x)j=1kr^j(x))]\displaystyle\frac{1}{k}\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\log\left(\frac{\hat{r}_{i}(x)}{\sum_{j=1}^{k}\hat{r}_{j}(x)}\right)\right]
=\displaystyle= 1k𝔼pk(x)[r^k(x)log(i=1kr^i(x))]1ki=1k1𝔼pi(x)[log(r^i(x)j=1kr^j(x))]\displaystyle\frac{1}{k}\mathbb{E}_{p_{k}(x)}\left[\hat{r}_{k}(x)\log\left(\sum_{i=1}^{k}\hat{r}_{i}(x)\right)\right]-\frac{1}{k}\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\log\left(\frac{\hat{r}_{i}(x)}{\sum_{j=1}^{k}\hat{r}_{j}(x)}\right)\right]
=(i)\displaystyle\stackrel{{\scriptstyle(i)}}{{=}} (1ki=1k𝔼pi(x)[log(𝜼^i(x))])\displaystyle-\left(\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}_{p_{i}(x)}[\log(\hat{{\bm{\eta}}}_{i}(x))]\right)

where (i)(i) is because r^k(x)=1,x𝒳\hat{r}_{k}(x)=1,~{}\forall x\in{\mathcal{X}} and Eq. (31).

When the class prior 𝝅{\bm{\pi}} is not uniform, from Section 2.3, we know that the link and inverse link connecting density ratio estimators and class probability estimators are:

r^i=πkπiη^iη^kandη^i=πir^ij[k]πjr^j,for alli[k],x𝒳.\hat{r}_{i}=\frac{\pi_{k}}{\pi_{i}}\cdot\frac{\hat{\eta}_{i}}{\hat{\eta}_{k}}\quad\text{and}\quad\hat{\eta}_{i}=\frac{\pi_{i}\hat{r}_{i}}{\sum_{j\in[k]}\pi_{j}\hat{r}_{j}},~{}\text{for all}~{}i\in[k],x\in{\mathcal{X}}. (34)

In this case, we choose the following convex function (where we use r^k=1\hat{r}_{k}=1):

f(r^1,,r^k1)=\displaystyle f(\hat{r}_{1},\ldots,\hat{r}_{k-1})= i=1kπir^ilogπir^i(i=1kπir^i)log(i=1kπir^i)\displaystyle\sum_{i=1}^{k}\pi_{i}\hat{r}_{i}\log\pi_{i}\hat{r}_{i}-\left(\sum_{i=1}^{k}\pi_{i}\hat{r}_{i}\right)\log\left(\sum_{i=1}^{k}\pi_{i}\hat{r}_{i}\right) (35)
if(r^1,,r^k1)=\displaystyle\partial_{i}f(\hat{r}_{1},\ldots,\hat{r}_{k-1})= πilog(πir^ij=1kπir^j)fori[k1]\displaystyle\pi_{i}\log\left(\frac{\pi_{i}\hat{r}_{i}}{\sum_{j=1}^{k}\pi_{i}\hat{r}_{j}}\right)\quad\text{for}~{}i\in[k-1] (36)

Note that when 𝝅{\bm{\pi}} is uniform distribution, Eq. (35) reduces to Eq. (32).

The loss in Eq. (14) reduces to:

𝔼pk(x)[πkr^k(x)log(i=1kπir^i(x))πkr^k(x)log(πkr^k(x))]i=1k1𝔼pi(x)[πilog(πir^i(x)j=1kπjr^j(x))]\displaystyle\mathbb{E}_{p_{k}(x)}\left[\pi_{k}\hat{r}_{k}(x)\log\left(\sum_{i=1}^{k}\pi_{i}\hat{r}_{i}(x)\right)-\pi_{k}\hat{r}_{k}(x)\log(\pi_{k}\hat{r}_{k}(x))\right]-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\pi_{i}\log\left(\frac{\pi_{i}\hat{r}_{i}(x)}{\sum_{j=1}^{k}\pi_{j}\hat{r}_{j}(x)}\right)\right]
=\displaystyle= (i=1kπi𝔼pi(x)[log(η^i(x))])\displaystyle-\left(\sum_{i=1}^{k}\pi_{i}\mathbb{E}_{p_{i}(x)}[\log(\hat{\eta}_{i}(x))]\right)

which corresponds to the multi-class logistic regression loss for the class probability estimators 𝜼^\hat{{\bm{\eta}}}.

Remark. Interestingly, we noticed that the multi-distribution ff-divergence associated with the convex function in Eq. (32) is the multi-distribution Jensen-Shannon divergence (Garcia-Garcia & Williamson, 2012) (also known as information radius (Sibson, 1969)) up to a constant of logk\log k:

𝐃f(P1,,Pk)\displaystyle\mathbf{D}_{f}(P_{1},\ldots,P_{k}) =𝔼pk(x)[f(p1(x)pk(x),,pk1(x)pk(x))]\displaystyle=\mathbb{E}_{p_{k}(x)}\left[f\left(\frac{p_{1}(x)}{p_{k}(x)},\ldots,\frac{p_{k-1}(x)}{p_{k}(x)}\right)\right]
=1k𝔼pk(x)[i=1kpi(x)pk(x)log(pi(x)j=1kpk(x))]\displaystyle=\frac{1}{k}\mathbb{E}_{p_{k}(x)}\left[\sum_{i=1}^{k}\frac{p_{i}(x)}{p_{k}(x)}\log\left(\frac{p_{i}(x)}{\sum_{j=1}^{k}p_{k}(x)}\right)\right]
=1ki=1k𝔼pi(x)[log(pi(x)1kj=1kpj(x))]logk\displaystyle=\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}_{p_{i}(x)}\left[\log\left(\frac{p_{i}(x)}{\frac{1}{k}\sum_{j=1}^{k}p_{j}(x)}\right)\right]-\log k
=1ki=1kDKL(Pi1kj=1kPj)logk\displaystyle=\frac{1}{k}\sum_{i=1}^{k}D_{\mathrm{KL}}\left(P_{i}\|\frac{1}{k}\sum_{j=1}^{k}P_{j}\right)-\log k

A.4.2 Least-squares Importance Fitting

When the convex function associated with the Bregman divergence is chosen to be:

f(r^1,,r^k1)\displaystyle f(\hat{r}_{1},\ldots,\hat{r}_{k-1}) =12i=1k1(r^i1)2=12𝒓^𝟏2\displaystyle=\frac{1}{2}\sum_{i=1}^{k-1}(\hat{r}_{i}-1)^{2}=\frac{1}{2}\|\hat{{\bm{r}}}-\bm{1}\|^{2} (37)
if(r^1,,r^k1)\displaystyle\partial_{i}f(\hat{r}_{1},\ldots,\hat{r}_{k-1}) =r^i1fori[k1]\displaystyle=\hat{r}_{i}-1\quad\text{for}~{}i\in[k-1] (38)

The loss in Eq. (14) reduces to:

𝔼pk(x)[i=1k1(r^i2(x)r^i(x))12i=1k1(r^i2(x)2r^i(x)+1)]i=1k1𝔼pi(x)[r^i(x)1]\displaystyle\mathbb{E}_{p_{k}(x)}\left[\sum_{i=1}^{k-1}(\hat{r}_{i}^{2}(x)-\hat{r}_{i}(x))-\frac{1}{2}\sum_{i=1}^{k-1}(\hat{r}_{i}^{2}(x)-2\hat{r}_{i}(x)+1)\right]-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\hat{r}_{i}(x)-1\right]
=\displaystyle= 12𝔼pk(x)[i=1k1(r^i2(x)1)]i=1k1𝔼pi(x)[r^i(x)1]\displaystyle\frac{1}{2}\mathbb{E}_{p_{k}(x)}\left[\sum_{i=1}^{k-1}(\hat{r}_{i}^{2}(x)-1)\right]-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\hat{r}_{i}(x)-1\right]
=\displaystyle= 12i=1k1𝔼pk(x)[r^i2(x)12pi(x)pk(x)(r^i(x)1)]\displaystyle\frac{1}{2}\sum_{i=1}^{k-1}\mathbb{E}_{p_{k}(x)}\left[\hat{r}_{i}^{2}(x)-1-2\frac{p_{i}(x)}{p_{k}(x)}(\hat{r}_{i}(x)-1)\right]
=\displaystyle= 12i=1k1𝔼pk(x)[(r^i(x)ri(x))2]C\displaystyle\frac{1}{2}\sum_{i=1}^{k-1}\mathbb{E}_{p_{k}(x)}\left[(\hat{r}_{i}(x)-r_{i}(x))^{2}\right]-C

where C=𝔼pk(x)[𝒓(x)12]C=\mathbb{E}_{p_{k}(x)}\left[\|{\bm{r}}(x)-1\|^{2}\right] is a constant w.r.t. 𝒓^\hat{{\bm{r}}} and the minimizer to the above loss function matches the true density ratios, which strictly generalizes the Least-Squares Importance Fitting (LSIF) (Kanamori et al., 2009) method to the multi-distribution case.

A.4.3 KL Importance Estimation Procedure

When the convex function associated with the Bregman divergence is chosen to be:

f(r^1,,r^k1)\displaystyle f(\hat{r}_{1},\ldots,\hat{r}_{k-1}) =i=1k1(r^ilogr^ir^i)=𝒓^,log(𝒓^)𝒓^1\displaystyle=\sum_{i=1}^{k-1}(\hat{r}_{i}\log\hat{r}_{i}-\hat{r}_{i})=\langle\hat{{\bm{r}}},\log(\hat{{\bm{r}}})\rangle-\|\hat{{\bm{r}}}\|_{1} (39)
if(r^1,,r^k1)\displaystyle\partial_{i}f(\hat{r}_{1},\ldots,\hat{r}_{k-1}) =logr^ifori[k1]\displaystyle=\log\hat{r}_{i}\quad\text{for}~{}i\in[k-1] (40)

The loss in Eq. (14) reduces to:

𝔼pk(x)[i=1k1r^i(x)logr^i(x)i=1k1(r^i(x)logr^i(x)r^i(x))]i=1k1𝔼pi(x)[logr^i(x)]\displaystyle\mathbb{E}_{p_{k}(x)}\left[\sum_{i=1}^{k-1}\hat{r}_{i}(x)\log\hat{r}_{i}(x)-\sum_{i=1}^{k-1}(\hat{r}_{i}(x)\log\hat{r}_{i}(x)-\hat{r}_{i}(x))\right]-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}[\log\hat{r}_{i}(x)]
=\displaystyle= 𝔼pk(x)[i=1k1r^i(x)]i=1k1𝔼pi(x)[logr^i(x)]\displaystyle\mathbb{E}_{p_{k}(x)}\left[\sum_{i=1}^{k-1}\hat{r}_{i}(x)\right]-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}[\log\hat{r}_{i}(x)] (41)

This is equivalent to the following constrained optimization problem:

argmin𝒓^:𝒳k1i=1k1DKL(pi(x)r^i(x)pk(x))=i=1k1𝔼pi(x)[log(pi(x)r^i(x)pk(x))]\displaystyle\operatorname*{arg\,min}_{\hat{{\bm{r}}}:{\mathcal{X}}\to{\mathbb{R}}^{k-1}}\sum_{i=1}^{k-1}D_{\mathrm{KL}}(p_{i}(x)\|\hat{r}_{i}(x)\cdot p_{k}(x))=\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\log\left(\frac{p_{i}(x)}{\hat{r}_{i}(x)\cdot p_{k}(x)}\right)\right]
=\displaystyle= argmin𝒓^:𝒳k1i=1k1𝔼pi(x)[logr^i(x)]\displaystyle\operatorname*{arg\,min}_{\hat{{\bm{r}}}:{\mathcal{X}}\to{\mathbb{R}}^{k-1}}-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}[\log\hat{r}_{i}(x)]
s.t.𝔼pk(x)[r^i(x)]=1andr^i(x)0,for alli[k1].\displaystyle\text{s.t.}\quad\mathbb{E}_{p_{k}(x)}[\hat{r}_{i}(x)]=1~{}~{}\text{and}~{}~{}\hat{r}_{i}(x)\geq 0,~{}~{}\text{for all}~{}i\in[k-1].

which strictly generalizes the Kullback–Leibler Importance Estimation Procedure (KLIEP) (Sugiyama et al., 2008) to the multi-distribution case.

A.4.4 Basu’s Power Divergence for Robust DRE

For some α>1\alpha>1, we choose the following convex function (the α\alpha-norm of a vector):

f(r^1,,r^k1)\displaystyle f(\hat{r}_{1},\ldots,\hat{r}_{k-1}) =i=1k1r^iα=𝒓^αα\displaystyle=\sum_{i=1}^{k-1}\hat{r}_{i}^{\alpha}=\|\hat{{\bm{r}}}\|_{\alpha}^{\alpha} (42)
if(r^1,,r^k1)\displaystyle\partial_{i}f(\hat{r}_{1},\ldots,\hat{r}_{k-1}) =αr^iα1\displaystyle=\alpha\hat{r}_{i}^{\alpha-1} (43)

The loss in Eq. (14) reduces to:

𝔼pk(x)[i=1k1αr^iα(x)i=1k1r^iα(x)]i=1k1𝔼pi(x)[αr^iα1(x)]\displaystyle\mathbb{E}_{p_{k}(x)}\left[\sum_{i=1}^{k-1}\alpha\hat{r}_{i}^{\alpha}(x)-\sum_{i=1}^{k-1}\hat{r}_{i}^{\alpha}(x)\right]-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\alpha\hat{r}_{i}^{\alpha-1}(x)\right]
=\displaystyle= i=1k1𝔼pk(x)[(α1)r^iα(x)]i=1k1𝔼pi(x)[αr^iα1(x)]\displaystyle\sum_{i=1}^{k-1}\mathbb{E}_{p_{k}(x)}\left[(\alpha-1)\hat{r}_{i}^{\alpha}(x)\right]-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\alpha\hat{r}_{i}^{\alpha-1}(x)\right] (44)

To understand the robustness of this formulation, for each i[k1]i\in[k-1], we take the derivative of Eq. (44) w.r.t. the parameters in the density ratio model r^i\hat{r}_{i} and equate it to zero, and we get the following parameter estimation equation:

𝔼pk(x)[r^iα1(x)r^i(x)]𝔼pi(x)[r^iα2(x)r^i(x)]=𝟎\mathbb{E}_{p_{k}(x)}[\hat{r}_{i}^{\alpha-1}(x)\nabla\hat{r}_{i}(x)]-\mathbb{E}_{p_{i}(x)}[\hat{r}_{i}^{\alpha-2}(x)\nabla\hat{r}_{i}(x)]=\bm{0} (45)

Now we apply the same analysis to the multi-distribution KLIEP method in Eq. (41) and we get the following equation (for each i[k1]i\in[k-1]):

𝔼pk(x)[r^i(x)]𝔼pi(x)[r^i1(x)r^i(x)]=𝟎\mathbb{E}_{p_{k}(x)}[\nabla\hat{r}_{i}(x)]-\mathbb{E}_{p_{i}(x)}[\hat{r}_{i}^{-1}(x)\nabla\hat{r}_{i}(x)]=\bm{0} (46)

Comparing Eq. (45) with Eq. (46), we can see that the power divergence DRE method in Eq. (44) is a weighted version of the multi-distribution KLIEP method, where the weight r^iα1(x)\hat{r}_{i}^{\alpha-1}(x) controls the importance of the samples. In some scenario where the outlier samples tend to have density ratios less than one, they will have less influence on the parameter estimation, which generalizes the binary Basu’s power divergence robust DRE method (Sugiyama et al., 2012) to the multi-distribution case. Another interesting observation is that when α1\alpha\to 1, Eq. (45) recovers the KLIEP gradient in Eq. (46); when α=2\alpha=2, the power divergence DRE in Eq. (44) recovers the multi-distribution LSIF method in Section A.4.2.

A.4.5 More Examples

When the convex function is chosen to be the Log-Sum-Exp type function (for α>0\alpha>0):

f(r^1,,r^k1)\displaystyle f(\hat{r}_{1},\ldots,\hat{r}_{k-1}) =αlog(i=1k1exp(r^i/α))\displaystyle=\alpha\log\left(\sum_{i=1}^{k-1}\exp(\hat{r}_{i}/\alpha)\right) (47)
if(r^1,,r^k1)\displaystyle\partial_{i}f(\hat{r}_{1},\ldots,\hat{r}_{k-1}) =exp(r^i/α)i=1k1exp(r^i/α)\displaystyle=\frac{\exp(\hat{r}_{i}/\alpha)}{\sum_{i=1}^{k-1}\exp(\hat{r}_{i}/\alpha)} (48)

The loss in Eq. (14) can be written as:

𝔼pk(x)[i=1k1r^i(x)exp(r^i(x)/α)j=1k1exp(r^j(x)/α)αlog(i=1k1exp(r^i(x)/α))]i=1k1𝔼pi(x)[exp(r^i(x)/α)j=1k1exp(r^j(x)/α)]\displaystyle\mathbb{E}_{p_{k}(x)}\left[\sum_{i=1}^{k-1}\frac{\hat{r}_{i}(x)\exp(\hat{r}_{i}(x)/\alpha)}{\sum_{j=1}^{k-1}\exp(\hat{r}_{j}(x)/\alpha)}-\alpha\log\left(\sum_{i=1}^{k-1}\exp(\hat{r}_{i}(x)/\alpha)\right)\right]-\sum_{i=1}^{k-1}\mathbb{E}_{p_{i}(x)}\left[\frac{\exp(\hat{r}_{i}(x)/\alpha)}{\sum_{j=1}^{k-1}\exp(\hat{r}_{j}(x)/\alpha)}\right]

We can similarly derive loss functions induced by other convex functions such as the quadratic function f(r^1,,r^k1)=𝒓^𝑯𝒓^+𝒒𝒓^f(\hat{r}_{1},\ldots,\hat{r}_{k-1})=\hat{{\bm{r}}}^{\top}{\bm{H}}\hat{{\bm{r}}}+{\bm{q}}^{\top}\hat{{\bm{r}}}, for some positive definite matrix 𝑯0{\bm{H}}\succ 0.

A.5 More Experimental Details

We provide more details about the problem setup of each task used in our empirical study.

For the synthetic data experiments, we use k=5k=5 multivariate Gaussian distributions with identity covariance matrix and different mean vectors:

𝝁1\displaystyle{\bm{\mu}}_{1} =(1,0,0,)d\displaystyle=(1,0,0,\ldots)^{d}
𝝁2\displaystyle{\bm{\mu}}_{2} =(1,0,0,)d\displaystyle=(-1,0,0,\ldots)^{d}
𝝁3\displaystyle{\bm{\mu}}_{3} =(0,1,0,)d\displaystyle=(0,1,0,\ldots)^{d}
𝝁4\displaystyle{\bm{\mu}}_{4} =(0,1,0,)d\displaystyle=(0,-1,0,\ldots)^{d}
𝝁5\displaystyle{\bm{\mu}}_{5} =(1,0,0,)d\displaystyle=(1,0,0,\ldots)^{d}

We use such design so that the density ratios are almost surely well-defined and the numerical optimization with respect to the canonical density ratio vector 𝒓^=(r^1,,r^k1)\hat{{\bm{r}}}=(\hat{r}_{1},\ldots,\hat{r}_{k-1}) is more stable. We use a two-layer Multi-Layer Perceptron (MLP) (Linear(d,32)Linear(32,32)Linear(32,k1)\text{Linear}(d,32)\to\text{Linear}(32,32)\to\text{Linear}(32,k-1)) with ReLU activation function to realize the density ratio model.

For CIFAR-10 OOD detection experiments, we set k=4k=4 and we construct each distribution as: p1p_{1} - samples labeled {airplane, automobile, bird}; p2p_{2} - samples labeled {cat, deer, dog, frog}; p3p_{3} - samples labeled {horse, ship, truck} and p4p_{4} - a uniform mixture of these distributions. We use a standard convolution neural network in the PyTorch tutorial111 https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html with k1k-1 outputs to realize the density ratio model.

For MNIST multi-target generation experiments, we use k=6k=6 and we construct each distribution as: p1p_{1} - samples labeled {0,1}; p2p_{2} - samples labeled {2,3}; p3p_{3} - samples labeled {4,5}; p4p_{4} - samples labeled {6,7}; p5p_{5} - samples labeled {8,9}; p6p_{6} - a mixture of these distributions. We use a two-layer convolutional neural network (Conv(1,32, 3, 1) \to Conv(32, 64, 3, 1) \to Linear(9216, 128) \to Linear(128, k1k-1)) with ReLU activation function to realize the density ratio model.

For multi-distribution off-policy policy evaluation experiments, we conducted experiments on the Half-Cheetah environment in OpenAI Gym (Brockman et al., 2016). We use soft actor-critic algorithm (Haarnoja et al., 2018) to obtain five different policies with average expected return of {3811, 5277, 6444, 7397, 5728} respectively and we learn density ratios between their induced occupancy measures (state-action distributions). We use a three-layer MLP (Linear(17,256)Linear(256,256)Linear(256,256)Linear(256,k1)\text{Linear}(17,256)\to\text{Linear}(256,256)\to\text{Linear}(256,256)\to\text{Linear}(256,k-1)) with ReLU activation function to realize the density ratio model.