A Unified Framework for Multi-distribution Density Ratio Estimation
Abstract
Binary density ratio estimation (DRE), the problem of estimating the ratio given their empirical samples, provides the foundation for many state-of-the-art machine learning algorithms such as contrastive representation learning and covariate shift adaptation. In this work, we consider a generalized setting where given samples from multiple distributions (for ), we aim to efficiently estimate the density ratios between all pairs of distributions. Such a generalization leads to important new applications such as estimating statistical discrepancy among multiple random variables like multi-distribution -divergence, and bias correction via multiple importance sampling. We then develop a general framework from the perspective of Bregman divergence minimization, where each strictly convex multivariate function induces a proper loss for multi-distribution DRE. Moreover, we rederive the theoretical connection between multi-distribution density ratio estimation and class probability estimation, justifying the use of any strictly proper scoring rule composite with a link function for multi-distribution DRE. We show that our framework leads to methods that strictly generalize their counterparts in binary DRE, as well as new methods that show comparable or superior performance on various downstream tasks.
1 Introduction
Estimating the density ratio between two distributions based on their empirical samples is a central problem in machine learning, which continuously drives progress in this field and finds its applications in many machine learning tasks such as anomaly detection (Hido et al., 2008; Smola et al., 2009; Hido et al., 2011), importance weighting in covariate shift adaptation (Huang et al., 2006; Sugiyama et al., 2007), generative modeling (Uehara et al., 2016; Nowozin et al., 2016; Grover et al., 2019), two-sample test (Sugiyama et al., 2011; Gretton et al., 2012), mutual information estimation and representation learning (Oord et al., 2018; Hjelm et al., 2018). It is such a powerful paradigm because computing density ratio focuses on extracting and preserving contrastive information between two distributions, which is crucial in many tasks. Despite the tremendous success of binary DRE, many applications involve more than two probability distributions and developing density ratio estimation methods among multiple distributions has the potential of advancing various applications such as estimating multi-distribution statistical discrepancy measures (Garcia-Garcia & Williamson, 2012), multi-domain transfer learning, bias correction and variance reduction with multiple importance sampling (Elvira et al., 2019), multi-marginal generative modeling (Cao et al., 2019) and multilingual machine translation (Dong et al., 2015; Aharoni et al., 2019).
Although recent years have witnessed significant progress and a continuously increasing trend in developing more sophisticated and advanced methods for binary DRE (Sugiyama et al., 2012; Liu et al., 2017; Rhodes et al., 2020; Kato & Teshima, 2021; Choi et al., 2021), methods for estimating density ratios among multiple distributions remain largely unexplored, besides an empirical exploration of multi-class logistic regression for multi-task learning (Bickel et al., 2008), where the density ratios serve as the resampling weights between the distribution of a pool of examples of multiple tasks and the target distribution for a given task at hand and lead to significant accuracy improvement on HIV therapy screening experiments.
In this work, we propose a unified framework based on expected Bregman divergence minimization, where any strictly convex multivariate function induces a proper loss for multi-distribution DRE, thus generalizing the framework in (Sugiyama et al., 2012) to multi-distribution case. Moreover, by directly generalizing the Bregman identity in (Menon & Ong, 2016) to multi-variable case, we rederive a similar result to (Nock et al., 2016), which formally relates losses for multi-distribution density ratio estimation and class probability estimation and theoretically justifies the use of any strictly proper scoring rule (e.g., the logarithm score (Good, 1952), the Brier score (Brier et al., 1950) and the pseudo-spherical score (Good, 1971)) composite with a link function for multi-distribution DRE. By choosing a variety of specific convex functions or proper scoring rules, we show that our unified framework leads to methods that strictly generalize their counterparts for binary DRE, as well as new objectives specific to multi-distribution DRE. We demonstrate the effectiveness of our framework, and study and compare the empirical performance of its different instantiations on various downstream tasks that rely on accurate multi-distribution density ratio estimation.
2 Preliminaries
2.1 Multi-class Experiments
In multi-class experiments, we have a pair of random variables with joint distribution , where is the sample space and is the finite label space. Define the probability simplex as . According to chain rule of probability, any joint distribution can be decomposed into class priors and class conditionals for , or into sample marginal and class probability function (i.e., ). We write as a vector and omit when it is clear from context. Thus we can also represent the joint distribution as (where ) or . For any , we assume has density with respect to the Lebesgue measure.
Remark on notations. To avoid confusion, we would like to emphasize that the class probability is denoted as and the class conditional is denoted as with density . The former further satisfies the normalization constraint: , while in the latter one only serves as the index for different distributions.
In multi-class classification, given independent and identically distributed (i.i.d.) samples from the joint distribution , we want to learn a probabilistic classifier to approximate the true class probability function by minimizing the following -risk:
(1) |
where is the loss function for using the class predictor when the true class is , and is the expected loss of under the true class probability .
Definition 1 (Proper loss).
A loss function is proper if the corresponding expected loss satisfies: . It is strictly proper if the equality holds only when .
In statistical decision theory (Gneiting & Raftery, 2007), the negative proper loss is also called proper scoring rule (i.e., ), which assesses the utility of the prediction. Properness of a loss is desirable in multi-class classification because it encourages the class probability estimator to match the true class probability function . An important property of proper loss is summarized in the following theorem:
Definition 2 (Bregman divergence).
Given a differentiable convex function defined on a convex set and two points , the Bregman divergence from to is defined as:
(2) |
Theorem 1 ((Gneiting & Raftery, 2007); Proposition 7 in (Vernet et al., 2011)).
Given a proper loss and the corresponding expected loss , for any , the generalized entropy function is concave; when is differentiable, the regret or excess risk of a predictor over the Bayes-optimal is the Bregman divergence induced by the convex function :
(3) |
2.2 Multi-distribution -Divergence
Csiszaŕ’s -divergence is a popular way to measure the discrepancy between two probability distributions. Specifically, given two distributions and a convex function satisfying , the -divergence between and is defined as . In the following, we will introduce the multi-distribution extension of -divergence (Garcia-Garcia & Williamson, 2012).
Definition 3 (Multi-distribution -divergence).
For probability distributions on a common probability space with densities , given multi-variate closed convex function satisfying , the multi-distribution -divergence between and is defined as:
(5) |
2.3 Connecting Density Ratios and Class Probabilities via Link Function
Inspired by the definition in Eq. (5), we consider the following canonical density ratio vector (more discussion about this choice can be found in Section 3.2): where and . Then we can connect a density ratio vector and a class probability vector via an invertible link function.
According to Bayes’ theorem, we have:
(6) |
Thus we define the following multi-distribution link function as a natural generalization of the binary DRE link function (Menon & Ong, 2016; Vernet et al., 2011):
(7) |
Given Eq. (7) and the normalization constraint , we obtain the inverse link function:
(8) |
Thus given knowledge of the prior distribution (which can also be easily estimated from empirical samples), one can transform a class probability estimator into a density ratio estimator via and vice versa via .
3 A Unified Framework for Multi-distribution DRE
3.1 Multi-distribution Density Ratio Estimation Problem Setup
Following the basic formulation of multi-class experiments in Section 2.1, we now introduce the problem setup of multi-distribution density ratio estimation (DRE). Recall that is the common data domain and are different distributions defined on with densities . Suppose we are given i.i.d. samples from each distribution . The goal of multi-distribution DRE is to estimate the density ratios between all pairs of distributions from the i.i.d. datasets . In this paper, we assume that the density ratios are always well-defined on domain (e.g., when the distributions have strictly positive densities), which is also a common assumption in binary DRE problem (Kanamori et al., 2009; Kato & Teshima, 2021).
A naive approach towards this problem is to separately estimate each density from and then plug in and to get . However, as previous theoretical works (Kpotufe, 2017; Nguyen et al., 2007; Kanamori et al., 2012; Que & Belkin, 2013) suggest, directly estimating density ratios has many advantages in practical settings. Specifically, we know that (1) optimal convergence rates depend only on the smoothness of the density ratio and not on the densities; (2) optimal rates depend only on the intrinsic dimension of data, thus escaping the curse of dimension in density estimation. Inspired by these observations in binary DRE, this paper aims to develop a general framework for directly estimating multi-distribution density ratios. Moreover, we also theoretically prove that various interesting facts (Menon & Ong, 2016; Sugiyama et al., 2012), which hold in the binary case, extend to our multi-distribution case in Section 4.
While most previous works focus on DRE in binary cases, multi-distribution DRE has many important downstream applications. For example, given any integrable function , suppose we want to use importance sampling to estimate the expectation of with respect to a target distribution with density w.r.t. the base measure:
(9) |
where we use the density ratio to correct the bias caused by using samples from the proposal distribution rather than the target distribution . However, in practice, finding a good proposal is critical yet challenging (Owen & Zhou, 2000). An alternative and more robust strategy is to use a population of different proposals (sampling schemes) and use a set of density ratios to correct the bias, which is also known as multiple importance sampling (MIS) (Cappé et al., 2004; Elvira et al., 2015). Given different proposals , the MIS estimation of the expectation is given by:
(10) |
where is the weight for each proposal and satisfies . Thus a more efficient and accurate multi-distribution DRE method will lead to better MIS. In the context of multi-source off-policy policy evaluation (Kallus et al., 2021), the proposals correspond to a set of demonstration policies and the target distribution is the query policy whose performance we want to evaluate from the offline multi-souce demonstrations; in the context of multi-domain transfer learning setting (covariate shift adaptation) (Bickel et al., 2008; Dinh et al., 2013), the proposals correspond to a set of data generating distributions (e.g. multiple source domains or various data augmentation strategies) and the target is the test distribution we care about. Estimating multi-distribution density ratios also allows us to compute important information quantities among multiple random variables such as the multi-distribution -divergence in Equation (5), which can be used to analyze various kinds of discrepancy and correlations between multiple random variables and further has the potential of inspiring new generative models for multiple marginal matching problem (Cao et al., 2019).
3.2 Multi-distribution DRE via Bregman Divergence Minimization
Inspired by the success of Bregman divergence minimization for unifying various DRE methods in the binary case (Sugiyama et al., 2012), in this section, we propose a general framework for solving the multi-distribution density ratio estimation problem. First, we discuss our modeling choices. Although our goal is to estimate density ratios (between all possible pairs), the solution set actually has degrees of freedom (e.g., ). Thus without loss of generality, we parametrize the following density ratio models to approximate the true canonical density ratios , where for . For the simplicity of notation, we will omit the dependence on the parameters and write our density ratio models as . An advantage of such modeling choice is that any density ratio can be recovered within one step of computation , thus avoiding large compounding error while naturally ensuring consistency within the solution set (i.e., if we parametrize , and respectively, we have to make sure they satisfy ).
Since our goal is to optimize to approximate the true density ratios , we consider to use Bregman divergence (Def. 2) to measure the discrepancy between and . Specifically, for any strictly convex function and , we have the following point-wise optimization problem:
(11) |
which corresponds to the difference between the value of at , and the value of the first-order Taylor expansion of around point evaluated at point . Although the current formulation can be understood as a regression problem from to the true density ratios , we actually only have i.i.d. samples instead of the true targets . In this case, we consider to use the following expected Bregman divergence to measure the overall discrepancy from the true density ratios to the density ratio models :
(12) | ||||
(13) |
where is a constant with respect to and the equality comes from the fact that according to the definition of . The rationale behind the above choice is that it allows us to get an unbiased estimation of the discrepancy between and only using i.i.d. samples from . Specifically, since is a constant, we have the following optimization problem over to approximate the true density ratios (where each expectation can be empirically estimated using samples from ):
(14) |
Interestingly, the above multi-distribution DRE formulation, which is based on Bregman divergence minimization, can be alternatively derived from the perspective of variational estimation of multi-distribution -divergence. In the following, We briefly discuss such an interpretation of Eq. (14).
Based on Fenchel duality, we can represent any strictly convex function through its conjugate function as:
(15) |
In order to estimate the multi-distribution -divergence defined in Eq. (5) only using samples from (instead of their density information), we consider the following variational representation of multi-distribution -divergence by substituting Eq. (15) into Eq. (5):
(16) |
We then have the following lemma revealing the equivalence between the optimization problem in Eq. (14) and Eq. (16).
Proposition 1 (DRE via variational estimation of multi-distribution -divergence).
4 Connecting Losses for Multi-class Classification and DRE
In this section, we rederive a similar result to (Nock et al., 2016) by directly generalizing the Bregman identity in (Menon & Ong, 2016) to multi-variable case, which established the theoretical connection between multi-distribution DRE and multi-class classification.
In Section 2.1, we have shown that the exact minimization of the excess risk for any strictly proper loss results in the true class probability function , and consequently gives us the true density ratio through the link function . In the following, we take a further step to show that essentially the procedure of minimizing any strictly proper loss is equivalent to minimizing an expected Bregman divergence between the true density ratios and the approximate density ratios , thus generalizing the theoretical results in binary case (Menon & Ong, 2016) to the multi-distribution case and justifying the validity of using any strictly proper scoring rule (e.g. Brier score (Brier et al., 1950) and pseudo-spherical score (Good, 1971)) for multi-distribution DRE. All proofs for this section can be found in Appendix A.3.
We start by introducing the following multivariate Bregman identity.
Lemma 1 (Multivariate Bregman Identity).
Given a convex function , we can define an associated function . We can show that (i) is convex and (ii) for any , their associated Bregman divergences satisfy:
(17) |
One can then apply Lemma 1 with and for each and use the fact that for to establish the following connection between the optimality gap of density ratio estimators and class probability estimators, where we use to denote the element-wise product between vectors and , and as the vector when restricting onto its first coordinates.
Proposition 2.
For any convex function , and two density ratio vectors and , one can construct corresponding class probability vectors and through the inverse link function in Eq. (8), and obtain:
(18) |
where we define the convex function induced by some prior distribution as
(19) |
Combining Proposition 2 with the Bregman divergence representation of the point-wise regret for a proper risk for multi-class classification in Eq. (4), we provide the following main theorem that interprets the minimization of multi-class classification regret as multi-distribution DRE under expected Bregman divergence minimization.
Theorem 2.
Theorem 2 generalizes a known equivalence between density ratio estimation and class probability estimation in the binary case (see Section 5 in (Menon & Ong, 2016)), and provides a similar equivalence in the more complicated multi-class experiments. Besides, in comparison to the binary case result, we also provide a simpler proof, loosen the assumptions on the twice-differentiability of convex function induced by the proper loss (i.e., , see Theorem 1 for more details), and generalize the argument to an arbitrary prior distribution instead of the uniform prior case considered in (Menon & Ong, 2016).
Moreover, we notice that multi-distribution -divergence among class conditionals also corresponds to the statistical information measure in multi-class experiments (DeGroot, 1962) (defined as the gap between the prior and posterior generalized entropy). Since we have established the equivalence between multi-class DRE (Eq. (14)) and variational estimation of multi-distribution -divergence (Eq. (16)), we can show by choosing particular convex functions (associated with the loss for multi-class classification), multi-distribution DRE can be viewed as estimating the statistical information measure in multi-class experiments. See detailed discussions in Appendix A.3.1.
5 Examples of Multi-distribution DRE
In the binary density ratio matching under Bregman divergence framework (Sugiyama et al., 2012), we can choose various convex functions to recover popular binary DRE methods such as KLIEP (Sugiyama et al., 2008), LSIF (Kanamori et al., 2009) and Logistic Regression (Franklin, 2005). In this section, we provide some instantiations of our multi-distribution DRE framework. Specifically, Section 3.2 suggests that any strictly convex multivariate function induces a proper loss for multi-distribution DRE, and Section 4 justifies that any strictly proper scoring rule composite with can also be used for multi-distribution DRE. We briefly discuss some choices of the convex function or proper scoring rule, and we provide detailed derivations in Appendix A.4.
5.1 Methods Induced by Convex Functions
Multi-class Logistic Regression. From Section 2.3, we know that there is a one-to-one correspondence between a class probability estimator and a density ratio estimator: and . For the clarity of presentation, here we assume the class prior distribution is uniform such that and . To recover the loss of multi-class logistic regression, we choose the following convex function to be . In this case, the loss in Eq. (14) reduces to:
(21) |
We provide discussions for the general case (non-uniform prior ) in Appendix A.4.1. Interestingly, we noticed that the above convex function also gives rise to the multi-distribution Jensen-Shannon divergence (Garcia-Garcia & Williamson, 2012) (also known as the information radius (Sibson, 1969), ) up to a constant of .
Multi-LSIF. When the convex function associated with the Bregman divergence is chosen to be , the loss in Eq. (14) reduces to:
(22) |
where is a constant w.r.t. and the minimizer to the above loss function matches the true density ratios, which strictly generalizes the Least-Squares Importance Fitting (LSIF) (Kanamori et al., 2009) method to the multi-distribution case.
Besides, we also consider the following simple convex functions that either strictly generalize their binary DRE counterparts as above, or lead to completely new methods for multi-distribution DRE:
- •
- •
-
•
Quadratic. , for any positive definite matrix . When is the identity matrix and , this is equivalent to Multi-LSIF.
-
•
LogSumExp. for .
In principle, we can use any desired strictly convex function within the optimization problem in Eq. (14), implying the great potential of our unified framework for discovering novel objectives for multi-distribution DRE. In terms of modeling flexibility, the curvature of different convex functions encode different inductive biases that may favor various downstream applications and we leave the design of more suitable convex functions for DRE as exciting future avenues.
5.2 Methods Induced by Proper Scoring Rules Composite with
From Section 4, we know that any strictly proper loss (or strictly proper scoring rule ) in conjunction with the link function also induces valid losses for multi-distribution DRE:
(23) |
In this work, we consider using the following classic proper scoring rules (Gneiting & Raftery, 2007), where is parametrized as (i.e. ):
- •
-
•
Brier score. (Brier et al., 1950) The loss is specified as .
- •
6 Experiments
In this section, we verify the validity of our framework, as well as study and compare the various instantiations introduced in Section 5, on a variety of tasks that all rely on accurate multi-distribution density ratio estimation. In particular, the tasks we consider include density ratio estimation among multiple multivariate Gaussian distributions, anomaly detection on CIFAR-10 (Krizhevsky et al., 2009), multi-target MNIST Generation (LeCun et al., 1998) and multi-distribution off-policy policy evaluation. We discuss the basic problem setups, evaluation metrics and experimental results in the following and we provide more experimental details for each task in Appendix A.5.
Synthetic Data Experiments. We first apply our methods to estimate density ratios among multivariate Gaussian distributions with different mean vectors and identity covariance matrix. We conducted experiments for various data dimensions ranging from 2 to 50. Since Gaussian distributions have tractable densities, we know the ground-truth density ratio functions and we calculate the mean absolute error (MAE) between all true density ratios and the learned ones:
where density ratio between and is recovered by as discussed in Section 3.2. We summarize the results in Table 1, from which we can see that multi-class logistic regression and Brier score composite with show superior performance in this task.
OOD Detection on CIFAR-10. Suppose we have different distributions , where , ( and ). For each distribution (), samples from the mixture distribution contain both inlier samples and outlier samples. The goal of this task is to identify the inlier samples from the pool of mixture samples. In particular, we use the learned density ratio as the score function to retrieve the inlier samples of , since the true density ratio function tend to be larger for samples from and smaller for samples from other distributions. In this case, we calculate the average AUROC for each scoring function.
Method | |||||||
---|---|---|---|---|---|---|---|
Random Init | |||||||
Multi-LR | |||||||
Multi-KLIEP | |||||||
Multi-LSIF | |||||||
Power | |||||||
Brier | |||||||
Spherical | |||||||
LogSumExp | |||||||
Quadratic |
Method | CIFAR-10 OOD () | MNIST Generation () | Off-policy Evaluation () |
---|---|---|---|
Random Init | |||
Multi-LR | |||
Multi-KLIEP | |||
Multi-LSIF | |||
Power () | |||
Brier | |||
Spherical () | / | ||
LogSumExp () | / | ||
Quadratic | / |
Multi-target MNIST Generation. DRE can be used in the sampling-importance-resampling (SIR) paradigm (Liu & Chen, 1998; Doucet et al., 2000). Suppose we want to obtain samples from while we have a large set of samples from . For each , we can use the density ratio function in conjunction with SIR to approximately sample from the target distribution (Algorithm 1 in (Grover et al., 2019)). For this task, we evaluate if the SIR samples for target distribution contains the correct proportion of classes/properties (10 digit numbers in MNIST) and we use as the evaluation metric, where and denote the desired proportion and sampled proportion for property in each target-generation task .
Multi-distribution Off-policy Policy Evaluation. Suppose we have different reinforcement learning policies , each inducing an occupancy measure (Syed et al., 2008) (i.e, state-action distribution) . Density ratios allow us to conduct off-policy policy evaluation, which estimates the expected return (sum of reward) of target policies given trajectories sampled from a source policy . In this case, we evaluate the following metric to assess the quality of the learned density ratios ( denotes a sequence of state-action pairs):
We summarized the results for CIFAR-10 OOD detection, multi-target MNIST generation and multi-distribution off-policy policy evaluation in Table 2 (omitted results indicates the corresponding method performs worse than listed methods by a large margin on the specific task). We can see that methods induced by proper scoring rules such as multi-class logistic regression, Brier score and pseudo-spherical score tend to have the best performance on the first two tasks. And surprisingly, methods induced by some simple multivariate convex functions such as the LogSumExp and the quadratic function show excellent performance on the third task. These results demonstrate the advantage of our framework in the sense that it offers us extreme flexibility for designing new objectives for multi-distribution DRE that are more suitable for various downstream applications.
7 Conclusion
In this paper, we focus on the generalized problem of efficiently estimating density ratios among multiple distributions. We propose a general framework based on expected Bregmand divergence minimization, where each strictly convex function induces a proper loss for multi-distribution DRE. Furthermore, we rederive the theoretical equivalence between the losses of class probability estimation and density ratio estimation, which justifies the use of any strictly proper scoring rules for multi-distribution DRE. Finally, we demonstrated the effectiveness of our framework on various downstream tasks.
References
- Aharoni et al. (2019) Roee Aharoni, Melvin Johnson, and Orhan Firat. Massively multilingual neural machine translation. arXiv preprint arXiv:1903.00089, 2019.
- Bickel et al. (2008) Steffen Bickel, Jasmina Bogojeska, Thomas Lengauer, and Tobias Scheffer. Multi-task learning for hiv therapy screening. In Proceedings of the 25th international conference on Machine learning, pp. 56–63, 2008.
- Brier et al. (1950) Glenn W Brier et al. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3, 1950.
- Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- Cao et al. (2019) Jiezhang Cao, Langyuan Mo, Yifan Zhang, Kui Jia, Chunhua Shen, and Mingkui Tan. Multi-marginal wasserstein gan. Advances in Neural Information Processing Systems, 32:1776–1786, 2019.
- Cappé et al. (2004) Olivier Cappé, Arnaud Guillin, Jean-Michel Marin, and Christian P Robert. Population monte carlo. Journal of Computational and Graphical Statistics, 13(4):907–929, 2004.
- Choi et al. (2021) Kristy Choi, Madeline Liao, and Stefano Ermon. Featurized density ratio estimation. arXiv preprint arXiv:2107.02212, 2021.
- DeGroot (1962) Morris H DeGroot. Uncertainty, information, and sequential experiments. The Annals of Mathematical Statistics, 33(2):404–419, 1962.
- Dinh et al. (2013) Cuong V Dinh, Robert PW Duin, Ignacio Piqueras-Salazar, and Marco Loog. Fidos: A generalized fisher based feature extraction method for domain shift. Pattern Recognition, 46(9):2510–2518, 2013.
- Dong et al. (2015) Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1723–1732, 2015.
- Doucet et al. (2000) Arnaud Doucet, Simon Godsill, and Christophe Andrieu. On sequential monte carlo sampling methods for bayesian filtering. Statistics and computing, 10(3):197–208, 2000.
- Duchi et al. (2018) John Duchi, Khashayar Khosravi, and Feng Ruan. Multiclass classification, information, divergence and surrogate risk. The Annals of Statistics, 46(6B):3246–3275, 2018.
- Elvira et al. (2015) Víctor Elvira, Luca Martino, David Luengo, and Mónica F Bugallo. Efficient multiple importance sampling estimators. IEEE Signal Processing Letters, 22(10):1757–1761, 2015.
- Elvira et al. (2019) Víctor Elvira, Luca Martino, David Luengo, and Mónica F Bugallo. Generalized multiple importance sampling. Statistical Science, 34(1):129–155, 2019.
- Franklin (2005) James Franklin. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2):83–85, 2005.
- Fujisawa & Eguchi (2008) Hironori Fujisawa and Shinto Eguchi. Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis, 99(9):2053–2081, 2008.
- Garcia-Garcia & Williamson (2012) Dario Garcia-Garcia and Robert C Williamson. Divergences and risks for multiclass experiments. In Conference on Learning Theory, pp. 28–1. JMLR Workshop and Conference Proceedings, 2012.
- Gneiting & Raftery (2007) Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378, 2007.
- Good (1952) IJ Good. Rational decisions. Journal of the Royal Statistical Society, pp. 107–114, 1952.
- Good (1971) IJ Good. Comment on “measuring information and uncertainty” by robert j. buehler. Foundations of Statistical Inference, pp. 337–339, 1971.
- Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
- Grover et al. (2019) Aditya Grover, Jiaming Song, Alekh Agarwal, Kenneth Tran, Ashish Kapoor, Eric Horvitz, and Stefano Ermon. Bias correction of learned generative models using likelihood-free importance weighting. arXiv preprint arXiv:1906.09531, 2019.
- Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
- Hido et al. (2008) Shohei Hido, Yuta Tsuboi, Hisashi Kashima, Masashi Sugiyama, and Takafumi Kanamori. Inlier-based outlier detection via direct density ratio estimation. In 2008 Eighth IEEE international conference on data mining, pp. 223–232. IEEE, 2008.
- Hido et al. (2011) Shohei Hido, Yuta Tsuboi, Hisashi Kashima, Masashi Sugiyama, and Takafumi Kanamori. Statistical outlier detection using direct density ratio estimation. Knowledge and information systems, 26(2):309–336, 2011.
- Hjelm et al. (2018) R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
- Huang et al. (2006) Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Schölkopf, and Alex Smola. Correcting sample selection bias by unlabeled data. Advances in neural information processing systems, 19:601–608, 2006.
- Kallus et al. (2021) Nathan Kallus, Yuta Saito, and Masatoshi Uehara. Optimal off-policy evaluation from multiple logging policies. In International Conference on Machine Learning, pp. 5247–5256. PMLR, 2021.
- Kanamori et al. (2009) Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. The Journal of Machine Learning Research, 10:1391–1445, 2009.
- Kanamori et al. (2012) Takafumi Kanamori, Taiji Suzuki, and Masashi Sugiyama. Statistical analysis of kernel-based least-squares density-ratio estimation. Machine Learning, 86(3):335–367, 2012.
- Kato & Teshima (2021) Masahiro Kato and Takeshi Teshima. Non-negative bregman divergence minimization for deep direct density ratio estimation. In International Conference on Machine Learning, pp. 5320–5333. PMLR, 2021.
- Kpotufe (2017) Samory Kpotufe. Lipschitz density-ratios, structured data, and data-driven tuning. In Artificial Intelligence and Statistics, pp. 1320–1328. PMLR, 2017.
- Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
- LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Liu & Chen (1998) Jun S Liu and Rong Chen. Sequential monte carlo methods for dynamic systems. Journal of the American statistical association, 93(443):1032–1044, 1998.
- Liu et al. (2017) Song Liu, Akiko Takeda, Taiji Suzuki, and Kenji Fukumizu. Trimmed density ratio estimation. arXiv preprint arXiv:1703.03216, 2017.
- Menon & Ong (2016) Aditya Menon and Cheng Soon Ong. Linking losses for density ratio and class-probability estimation. In International Conference on Machine Learning, pp. 304–313. PMLR, 2016.
- Nguyen et al. (2007) XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In NIPS, pp. 1089–1096, 2007.
- Nock et al. (2016) Richard Nock, Aditya Menon, and Cheng Soon Ong. A scaled bregman theorem with applications. Advances in Neural Information Processing Systems, 29:19–27, 2016.
- Nowozin et al. (2016) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 271–279, 2016.
- Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Owen & Zhou (2000) Art Owen and Yi Zhou. Safe and effective importance sampling. Journal of the American Statistical Association, 95(449):135–143, 2000.
- Que & Belkin (2013) Qichao Que and Mikhail Belkin. Inverse density as an inverse problem: The fredholm equation approach. arXiv preprint arXiv:1304.5575, 2013.
- Rhodes et al. (2020) Benjamin Rhodes, Kai Xu, and Michael U Gutmann. Telescoping density-ratio estimation. arXiv preprint arXiv:2006.12204, 2020.
- Sibson (1969) Robin Sibson. Information radius. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 14(2):149–160, 1969.
- Smola et al. (2009) Alex Smola, Le Song, and Choon Hui Teo. Relative novelty detection. In Artificial Intelligence and Statistics, pp. 536–543. PMLR, 2009.
- Sugiyama et al. (2007) Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul Von Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In NIPS, volume 7, pp. 1433–1440. Citeseer, 2007.
- Sugiyama et al. (2008) Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699–746, 2008.
- Sugiyama et al. (2011) Masashi Sugiyama, Taiji Suzuki, Yuta Itoh, Takafumi Kanamori, and Manabu Kimura. Least-squares two-sample test. Neural networks, 24(7):735–751, 2011.
- Sugiyama et al. (2012) Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density-ratio matching under the bregman divergence: a unified framework of density-ratio estimation. Annals of the Institute of Statistical Mathematics, 64(5):1009–1044, 2012.
- Syed et al. (2008) Umar Syed, Michael Bowling, and Robert E Schapire. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039, 2008.
- Uehara et al. (2016) Masatoshi Uehara, Issei Sato, Masahiro Suzuki, Kotaro Nakayama, and Yutaka Matsuo. Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920, 2016.
- Vernet et al. (2011) Elodie Vernet, Robert Williamson, Mark Reid, et al. Composite multiclass losses. 2011.
Appendix A Proofs
A.1 Proofs for Section 2
See 1
Proof.
For completeness, we provide the proof here. First, we can check that is a concave function. Define to be the vector . Then the entropy function can be represented as and similarly . For and , we have:
where the inequality is because is proper. Thus is a concave function. Next we show that the excess risk is a Bregman divergence with convex function . First, observe that
Because is proper, we have:
Rearrange the term we get and therefore is a subderivative of . When is differentiable, its subdifferential contains exactly one subderivative and . Therefore, we have with . ∎
A.2 Proofs for Section 3.2
See 1
Proof of Proposition 1.
We first recall that the optimization problem for multi-distribution DRE is of the form
(24) |
and one can use the Fenchel-dual convex conjugate to represent the -divergence as
(25) |
By first-order optimality condition of convex functions, for any the optimal solution for Eq. (25) satisfies that
Therefore recovers the true density ratios.
Now we show that under change of variable , one can write the problem in Eq. (25) equivalently as the one in Eq. (24). First due to the property of the convex conjugate function (), we have:
Substituting with , we have:
(26) |
Taking derivative w.r.t. and due to the strict convexity of (), we know that the minimum of Eq. (26) achieves at . Thus we have:
(27) |
A.3 Proofs for Section 4
See 1
Proof of Lemma 1.
For simplicity of notations we let for arbitrary . We first prove the convexity of by definition. Given any two points and , one has
Here for inequality we use the fact that for any convex function , the perspective function defined as is a function jointly convex in .
Now to see the identity holds, note we can write
and that
where we use the gradient formula that by definition of , and rearranging terms and that .
Thus, we have shown that and concludes the proof. ∎
See 2
Proof of Proposition 2.
Given any , the equality follows by applying Lemma 1 with and . To see why this is true, note that we have by definition of and that (here implies element-wise multiplication)
Consequently applying Lemma 1 implies that
(28) |
Note that given any convex function , we consider its composition with linear map function as
We note that linear composition preserves convexity and Bregman divergence equality, i.e. we have
(29) | ||||
where for equality we use chain rule for taking derivatives of the linear composite mapping. Combining Equations (28) and (29) and replacing gives the desired result. ∎
See 2
Proof of Theorem 2.
A.3.1 Information Measure in Multi-class Experiments
In this section, we show that multi-distribution density ratio estimation can be viewed as estimating the statistical information measure (DeGroot, 1962) in multi-class experiments, under appropriate choices for the convex function .
We first introduce the following definitions in multi-class experiments. For , any proper loss function induces a generalized entropy:
which measures the uncertainty of the task. Given a multi-class experiment and the generalized entropy (which is closed concave), the information measure in a multi-class experiment (DeGroot, 1962; Duchi et al., 2018) is defined as the gap between the prior and posterior generalized entropy:
We next introduce the following connections between multi-distribution -divergence, generalized entropy and information measure in multi-class experiments. Specifically, Duchi et al. (2018) proved an equivalence between the gap of prior and posterior Bayes risks and the multi-distribution -divergence induced by a convex function depending on and the prior , demonstrating the utility of multi-distribution -divergence for experimental design of multi-class classification.
Theorem 3 ((Duchi et al., 2018)).
Given a proper loss , its associated generalized entropy and a multi-class distribution , we can define a closed convex function as
(30) |
We can then express the information measure of multi-class experiments as the multi-distribution -divergence induced by Eq. (30):
Given Theorem 3 and Proposition 1, we know that multi-distribution density ratio estimation by minimizing expected Bregman divergence (Eq. (14)), induced by the convex function defined in Eq. (30), corresponds to estimating the statistical information measure in multi-class classification experiments.
A.4 Examples of Multi-distribution DRE
A.4.1 Multi-class Logistic Regression
From Section 2.3, we know that there is a one-to-one correspondence between a class probability estimator and a density ratio estimator through the link and the inverse link function: and . When the class prior distribution is uniform, we have:
(31) |
To recover the loss of multi-class logistic regression used in (Bickel et al., 2008), we choose the following convex function (where we use ):
(32) | ||||
(33) |
Thus the loss in Eq. (14) reduces to:
where is because and Eq. (31).
When the class prior is not uniform, from Section 2.3, we know that the link and inverse link connecting density ratio estimators and class probability estimators are:
(34) |
In this case, we choose the following convex function (where we use ):
(35) | ||||
(36) |
Note that when is uniform distribution, Eq. (35) reduces to Eq. (32).
The loss in Eq. (14) reduces to:
which corresponds to the multi-class logistic regression loss for the class probability estimators .
A.4.2 Least-squares Importance Fitting
When the convex function associated with the Bregman divergence is chosen to be:
(37) | ||||
(38) |
The loss in Eq. (14) reduces to:
where is a constant w.r.t. and the minimizer to the above loss function matches the true density ratios, which strictly generalizes the Least-Squares Importance Fitting (LSIF) (Kanamori et al., 2009) method to the multi-distribution case.
A.4.3 KL Importance Estimation Procedure
When the convex function associated with the Bregman divergence is chosen to be:
(39) | ||||
(40) |
The loss in Eq. (14) reduces to:
(41) |
This is equivalent to the following constrained optimization problem:
which strictly generalizes the Kullback–Leibler Importance Estimation Procedure (KLIEP) (Sugiyama et al., 2008) to the multi-distribution case.
A.4.4 Basu’s Power Divergence for Robust DRE
For some , we choose the following convex function (the -norm of a vector):
(42) | ||||
(43) |
The loss in Eq. (14) reduces to:
(44) |
To understand the robustness of this formulation, for each , we take the derivative of Eq. (44) w.r.t. the parameters in the density ratio model and equate it to zero, and we get the following parameter estimation equation:
(45) |
Now we apply the same analysis to the multi-distribution KLIEP method in Eq. (41) and we get the following equation (for each ):
(46) |
Comparing Eq. (45) with Eq. (46), we can see that the power divergence DRE method in Eq. (44) is a weighted version of the multi-distribution KLIEP method, where the weight controls the importance of the samples. In some scenario where the outlier samples tend to have density ratios less than one, they will have less influence on the parameter estimation, which generalizes the binary Basu’s power divergence robust DRE method (Sugiyama et al., 2012) to the multi-distribution case. Another interesting observation is that when , Eq. (45) recovers the KLIEP gradient in Eq. (46); when , the power divergence DRE in Eq. (44) recovers the multi-distribution LSIF method in Section A.4.2.
A.4.5 More Examples
When the convex function is chosen to be the Log-Sum-Exp type function (for ):
(47) | ||||
(48) |
The loss in Eq. (14) can be written as:
We can similarly derive loss functions induced by other convex functions such as the quadratic function , for some positive definite matrix .
A.5 More Experimental Details
We provide more details about the problem setup of each task used in our empirical study.
For the synthetic data experiments, we use multivariate Gaussian distributions with identity covariance matrix and different mean vectors:
We use such design so that the density ratios are almost surely well-defined and the numerical optimization with respect to the canonical density ratio vector is more stable. We use a two-layer Multi-Layer Perceptron (MLP) () with ReLU activation function to realize the density ratio model.
For CIFAR-10 OOD detection experiments, we set and we construct each distribution as: - samples labeled {airplane, automobile, bird}; - samples labeled {cat, deer, dog, frog}; - samples labeled {horse, ship, truck} and - a uniform mixture of these distributions. We use a standard convolution neural network in the PyTorch tutorial111 https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html with outputs to realize the density ratio model.
For MNIST multi-target generation experiments, we use and we construct each distribution as: - samples labeled {0,1}; - samples labeled {2,3}; - samples labeled {4,5}; - samples labeled {6,7}; - samples labeled {8,9}; - a mixture of these distributions. We use a two-layer convolutional neural network (Conv(1,32, 3, 1) Conv(32, 64, 3, 1) Linear(9216, 128) Linear(128, )) with ReLU activation function to realize the density ratio model.
For multi-distribution off-policy policy evaluation experiments, we conducted experiments on the Half-Cheetah environment in OpenAI Gym (Brockman et al., 2016). We use soft actor-critic algorithm (Haarnoja et al., 2018) to obtain five different policies with average expected return of {3811, 5277, 6444, 7397, 5728} respectively and we learn density ratios between their induced occupancy measures (state-action distributions). We use a three-layer MLP () with ReLU activation function to realize the density ratio model.