This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Minimum Gamma Divergence for
Regression and Classification Problems

Shinto Eguchi

Preface

In an era where data drives decision-making across diverse fields, the need for robust and efficient statistical methods has never been greater. As a researcher deeply involved in the study of divergence measures, I have witnessed firsthand the transformative impact these tools can have on statistical inference and machine learning. This book aims to provide a comprehensive guide to the class of power divergences, with a particular focus on the γ\gamma-divergence, exploring their theoretical underpinnings, practical applications, and potential for enhancing robustness in statistical models and machine learning algorithms.

The inspiration for this book stems from the growing recognition that traditional statistical methods often fall short in the presence of model misspecification, outliers, and noisy data. Divergence measures, such as the γ\gamma-divergence, offer a promising alternative by providing robust estimation techniques that can withstand these challenges. This book seeks to bridge the gap between theoretical development and practical application, offering new insights and methodologies that can be readily applied in various scientific and engineering disciplines.

The book is structured into four main chapters. Chapter 1 introduces the foundational concepts of divergence measures, including the well-known Kullback-Leibler divergence and its limitations. It then presents a detailed exploration of power divergences, such as the α\alpha, β\beta, and γ\gamma-divergences, highlighting their unique properties and advantages. Chapter 2 explores minimum divergence methods for regression models, demonstrating how these methods can improve robustness and efficiency in statistical estimation. Chapter 3 extends these methods to Poisson point processes, with a focus on ecological applications, providing a robust framework for modeling species distributions and other spatial phenomena. Finally, Chapter 4 explores the use of divergence measures in machine learning, including applications in Boltzmann machines, AdaBoost, and active learning. The chapter emphasizes the practical benefits of these measures in enhancing model robustness and performance.

By providing a detailed examination of divergence measures, this book aims to offer a valuable resource for statisticians, machine learning practitioners, and researchers. It presents a unified perspective on the use of power divergences in various contexts, offering practical examples and empirical results to illustrate their effectiveness. The methodologies discussed in this book are designed to be both insightful and practical, enabling readers to apply these concepts in their work and research.

This book is the culmination of years of research and collaboration. I am grateful to my colleagues and students whose questions and feedback have shaped the content of this book. Special thanks to Hironori Fujisawa, Masayuki Henmi, Takashi Takenouchi, Osamu Komori, Kenichi Hatashi, Su-Yun Huang, Hung Hung, Shogo Kato, Yusuke Saigusa and Hideitsu Hino for their invaluable support and contributions.

I invite you to explore the rich landscape of divergence measures presented in this book. Whether you are a researcher, practitioner, or student, I hope you find the concepts and methods discussed here to be both insightful and practical. It is my sincere wish that this book will contribute to the advancement of robust statistical methods and inspire further research and innovation in the field.

Tokyo, 2024 Shinto Eguchi

Chapter 1 Power divergence

We present a mathematical framework for discussing the class of divergence measures, which are essential tools for quantifying the difference between two probability distributions. These measures find applications in various fields such as statistics, machine learning, and data science. We begin by discussing the well-known Kullback-Leibler (KL) Divergence, highlighting its advantages and limitations. To address the shortcomings of KL-divergence, the paper introduces three alternative types: α\alpha, β\beta, and γ\gamma-divergences. We emphasize the importance of choosing the right ”reference measure,” especially for β\beta and γ\gamma divergences, as it significantly impacts the results.

1.1 Introduction

We provide a comprehensive study of divergence measures that are essential tools for quantifying the difference between two probability distributions. These measures find applications in various fields such as statistics, machine learning, and data science [1, 6, 19, 79, 20]. See also [98, 17, 58, 34, 72, 80, 50].

We present the α\alpha, β\beta, and γ\gamma divergence measures, each characterized by distinctive properties and advantages. These measures are particularly well-suited for a variety of applications, offering tailored solutions to specific challenges in statistical inference and machine learning. We further explore the practical applications of these divergence measures, examining their implementation in statistical models such as generalized linear models and Poisson point processes. Special attention is given to selecting the appropriate ’reference measure,’ which is crucial for the accuracy and effectiveness of these methods. The study concludes by identifying areas for future research, including the further exploration of reference measures. Overall, the paper serves as a valuable resource for understanding the mathematical and practical aspects of divergence measures.

In recent years, a number of studies have been conducted on the robustness of machine learning models using the γ\gamma-divergence, which was proposed in [33]. This book highlights that the γ\gamma-divergence can be defined even when the power exponent γ\gamma is negative, provided certain integrability conditions are met [26]. Specifically, one key condition is that the probability distributions are defined on a set of finite discrete values. We demonstrate that the γ\gamma-divergence with γ=1\gamma=-1 is intimately connected to the inequality between the arithmetic mean and the geometric mean of the ratio of two probability mass functions, thus terming it the geometric-mean (GM) divergence. Likewise, we show that the γ\gamma-divergence with γ=2\gamma=-2 can be derived from the inequality between the arithmetic mean and the harmonic mean of the mass functions, leading to its designation as the harmonic-mean (HM) divergence.

1.2 Probabilistic framework

Let YY be a random variable with a set 𝒴\mathcal{Y} of possible values in \mathbb{R}. We denote Λ\Lambda as a σ\sigma-finite measure, referred to as the reference measure. The reference measure Λ\Lambda is typically either the Lebesgue measure for a case where YY is a continuous random variable or the counting measure for a case where YY is a discrete one. Let us define 𝒫\mathcal{P} as the space encompassing all probability measures PP’s that are absolutely continuous with respect to Λ\Lambda each other. The probability P(B)P(B) for an event BB can be expressed as:

P(B)=Bp(y)dΛ(y),\displaystyle P(B)=\int_{B}p(y){\rm d}\Lambda(y),

where p(y)=(P/Λ)(y)p(y)=(\partial P/\partial\Lambda)(y) is referred to as the Radon-Nicodym (RN) derivative. Specifically, p(y)p(y) is referred to as the probability density function (pdf) and the probability mass function (pmf) if YY is a continuous and discrete random variable, respectively.

Definition 1.

Let D(P,Q)D(P,Q) denote a functional defined on 𝒫×𝒫{\mathcal{P}}\times{\mathcal{P}}. Then, we call D(P,Q)D(P,Q) a divergence measure if D(P,Q)0D(P,Q)\geq 0 for all PP and QQ of 𝒫\mathcal{P}, and D(P,Q)=0D(P,Q)=0 means P=QP=Q.

Consider two normal distributions 𝙽𝚘𝚛(μ1,σ12){\tt Nor}(\mu_{1},\sigma_{1}^{2}) and 𝙽𝚘𝚛(μ2,σ22){\tt Nor}(\mu_{2},\sigma_{2}^{2}). If both distributions have the same mean μ\mu and variance σ2\sigma^{2}, they are identical, and their divergence is zero. However, as the mean and variance of one distribution diverge from those of the other, the divergence measure increases, quantifying how one distribution is different from the other. Thus, a divergence measure quantifies how one probability distribution diverges from another. The key properties are non-negativity, asymmetry and zero when identical. The asymmetry of the divergence measure helps one to discuss model comparisons, variational inferences, generative models, optimal control policies and so on. Researchers have proposed various divergence measures in statistics and machine learning to compare two models or to measure the information loss when approximating a distribution. It is more appropriately termed ‘information divergence’, although here is simply called ‘divergence’ for simplicity. As a specific example, the Kullback-Leibler (KL) divergence is given by the following equation:

D0(P,Q)=𝒴p(y)logp(y)q(y)dΛ(y),D_{0}(P,Q)=\int_{\mathcal{Y}}p(y)\log\frac{p(y)}{q(y)}{\rm d}\Lambda(y), (1.1)

where p(y)=P/Λ(y)p(y)=\partial P/\partial\Lambda(y) and q(y)=Q/Λ(y)q(y)=\partial Q/\partial\Lambda(y). The KL-divergence is essentially independent of the choice since it is written without Λ\Lambda as

D0(P,Q)=logPQdP.\displaystyle D_{0}(P,Q)=\int\log\frac{\partial P}{\partial Q}\,{\rm d}P.

This implies that any properties for the KL-divergence D0(P,Q)D_{0}(P,Q) directly can be thought as intrinsic properties between probability measures PP and QQ regardless of the RN-derivatives with respect to the reference measure. The definition (1.1) implicitly assumes the integrability. This implies that the integrals of their products with the logarithm of their ratio must be finite. Such assumptions are almost acceptable in the practical applications in statistics and the machine learning. However, if we assume a Cauchy distribution as PP and a normal distribution QQ, then D0(P,Q)D_{0}(P,Q) is not a finite value. Thus, the KL-divergence is associated with instable behaviors, which arises non-robustness for the minimum KL-divergence method or the maximum likelihood. This aspect will be discussed in the following chapter.

If we write the cross-entropy as

H0(P,Q)=plogqdΛ,\displaystyle H_{0}(P,Q)=-\int p\log{q}\,{\rm d}\Lambda, (1.2)

then the KL-divergence is written by the difference as

D0(P,Q)=H0(P,Q)H0(P,P).\displaystyle D_{0}(P,Q)=H_{0}(P,Q)-H_{0}(P,P).

The KL-divergence is a divergence measure due to the convexity of the negative logarithmic function. In foundational statistics, the Neyman-Pearson lemma holds a pivotal role. This lemma posits that the likelihood ratio test (LRT) is the most powerful method for hypothesis testing when comparing a null hypothesis distribution PP against an alternative distribution QQ. In this context, the KL-divergence D0(P,Q)D_{0}(P,Q) can be interpreted as the expected value of the LRT under the null hypothesis distribution PP. For a more in-depth discussion of the close relationship between D0(P,Q)D_{0}(P,Q) and the Neyman-Pearson lemma, the reader is referred to [25].

In the context of machine learning, KL-divergence is often used in algorithms like variational autoencoders. Here, KL-divergence helps quantify how closely the learned distribution approximates the real data distribution. Lower KL-divergence values indicate better approximations, thus helping in the model’s optimization process.

1.3 Power divergence measures

The KL-divergence is sometimes referred to as log divergence due to its definition involving the logarithmic function. Alternatively, a specific class of power divergence measures can be derived from power functions characterized by the exponent parameters α\alpha, β\beta, and γ\gamma, as detailed below. Among the numerous ways to quantify the divergence or distance between two probability distributions, ‘power divergence measure’ occupies a unique and significant property. Originating from foundational concepts in information theory, these measures have been extended and adapted to address various challenges across statistics and machine learning. As we strive to make better decisions based on data, understanding the nuances between different divergence measures becomes crucial. This section introduces the power divergence measures through three key types: α\alpha, β\beta, and γ\gamma divergences, see [9] for a comprehensive review. Each of these offers distinct advantages and limitations, and serves as a building block for diverse applications ranging from robust parameter estimation to model selection and beyond.

(1) α\alpha-divergence:

Dα(P,Q;Λ)=1α(α1)𝒴[1(q(y)p(y))α]p(y)dΛ(y),\displaystyle D_{\alpha}(P,Q;\Lambda)=\frac{1}{\alpha(\alpha-1)}{\int_{\mathcal{Y}}\left[1-\left(\frac{q(y)}{p(y)}\right)^{\alpha}\right]p(y){\rm d}\Lambda(y)},

where α\alpha belongs to \mathbb{R}, cf. [8, 1] for further details. Let us introduce

Wα(R)=1α(α1)(1Rα)1α1(1R),\displaystyle W_{\alpha}(R)=\frac{1}{\alpha(\alpha-1)}(1-R^{\alpha})-\frac{1}{\alpha-1}(1-R),

as a generator function for R0R\geq 0. Then the α\alpha-divergence is written as

Dα(P,Q;Λ)=𝒴Wα(q(y)p(y))p(y)dΛ(y).\displaystyle D_{\alpha}(P,Q;\Lambda)=\int_{\mathcal{Y}}W_{\alpha}\left(\frac{q(y)}{p(y)}\right)p(y){\rm d}\Lambda(y).

Note that Wα(R)0W_{\alpha}(R)\geq 0. Equality is achieved if and only if R=1R=1, indicating that Wα(R)W_{\alpha}(R) is a convex function. This implies Dα(P,Q;Λ)0D_{\alpha}(P,Q;\Lambda)\geq 0 with equality if and only if P=QP=Q. This shows that Dα(P,Q;Λ)D_{\alpha}(P,Q;\Lambda) is a divergence measure. The log expression [8] is given by

Δα(P,Q;Λ)=1α1log𝒴(q(y)p(y))αp(y)dΛ(y).\displaystyle\Delta_{\alpha}(P,Q;\Lambda)=\frac{1}{\alpha-1}\log{\int_{\mathcal{Y}}\left(\frac{q(y)}{p(y)}\right)^{\alpha}p(y){\rm d}\Lambda(y)}.

The α\alpha-divergence is associated with the Pythagorean identity in the space 𝒫\mathcal{P}. Assume that a triple of PP, QQ and RR satisfies

Dα(P,Q;Λ)+Dα(Q,R;Λ)=Dα(P,R;Λ).\displaystyle D_{\alpha}(P,Q;\Lambda)+D_{\alpha}(Q,R;\Lambda)=D_{\alpha}(P,R;\Lambda).

This equation reflects a Pythagorean relation, wherein the triple P,Q,RP,Q,R forms a right triangle if Dα(P,Q;Λ)D_{\alpha}(P,Q;\Lambda) is considered the squared Euclidean distance between PP and QQ. We define two curves {Pt}0t1\{P_{t}\}_{0\leq t\leq 1} and {Rs}0t1\{R_{s}\}_{0\leq t\leq 1} in 𝒫\mathcal{P} such that the RN-derivatives of PtP_{t} and RsR_{s} is given by pt(y)=(1t)p(y)+tq(y)p_{t}(y)=(1-t)p(y)+tq(y) and

rs(y)=zsexp{(1s)logr(y)+slogq(y)},\displaystyle r_{s}(y)=z_{s}\exp\{(1-s)\log r(y)+s\log q(y)\},

respectively, where zsz_{s} is a normalizing constant. We then observe that the Pythagorean relation remains unchanged for the triple Pt,Q,RsP_{t},Q,R_{s}, as illustrated by the following equation:

Dα(Pt,Q;Λ)+Dα(Q,Rs;Λ)=Dα(Pt,Rs;Λ).\displaystyle D_{\alpha}(P_{t},Q;\Lambda)+D_{\alpha}(Q,R_{s};\Lambda)=D_{\alpha}(P_{t},R_{s};\Lambda).

In accordance with this, The α\alpha-divergence allows for 𝒫\mathcal{P} to be as if it were an Euclidean space. This property plays a central role in the approach of information geometry. It gives geometric insights for statistics and machine learning [73].

For example, consider a multinomial distribution MN(π,m)(\pi,m) with a probability mass function (pmf):

p(y,π,m)=(my1yk)π1y1πkyk\displaystyle p(y,\pi,m)=\binom{m}{y_{1}\cdots y_{k}}\pi_{1}^{y_{1}}\cdots\pi_{k}^{y_{k}} (1.3)

for y=(y1,,yk)𝒴y=(y_{1},...,y_{k})\in{\mathcal{Y}} with 𝒴={y|j=1kyj=m}{\mathcal{Y}}=\{\ y\ |\sum_{j=1}^{k}y_{j}=m\}, where π=(πj)j=1k\pi=(\pi_{j})_{j=1}^{k}. The α\alpha-divergence between multinomial distributions MN(π,m)\text{{\tt MN}}(\pi,m) and MN(ρ,m)\text{{\tt MN}}(\rho,m) can be expressed as follows:

Dα(𝙼𝙽(π,m),𝙼𝙽(ρ,m);C)=1α(α1){1j=1kπjαρj1α},\displaystyle D_{\alpha}({\tt MN}(\pi,m),{\tt MN}(\rho,m);C)=\frac{1}{\alpha(\alpha-1)}\Big{\{}1-\sum_{j=1}^{k}\pi_{j}^{\alpha}\rho_{j}^{1-\alpha}\Big{\}}, (1.4)

where CC is the counting measure.

The α\alpha-divergence is independent of the choice of Λ\Lambda since

Dα(P,Q;Λ)=1α(1α)[1(QP)α]𝑑P.\displaystyle D_{\alpha}(P,Q;\Lambda)=\frac{1}{\alpha(1-\alpha)}{\int\left[1-\left(\frac{\partial Q}{\partial P}\right)^{\alpha}\right]dP}.

This indicates that Dα(P,Q;Λ)D_{\alpha}(P,Q;\Lambda) is independent of the choice of Λ\Lambda, as Dα(P,Q;Λ)=Dα(P,Q;Λ~)D_{\alpha}(P,Q;\Lambda)=D_{\alpha}(P,Q;\tilde{\Lambda}) for any Λ~\tilde{\Lambda}. Consequently, equation (1.4) is also independent of CC. In general, the Csisar class of divergence is independent of the choice of the reference measure [26].

(2) β\beta-divergence:

Dβ(P,Q;Λ)=1β(β+1){p(y)β+1+βq(y)β+1(β+1)p(y)q(y)β}dΛ(y),\displaystyle D_{\beta}(P,Q;\Lambda)=\frac{1}{\beta(\beta+1)}\int\{p(y)^{\beta+1}+{\beta}q(y)^{\beta+1}-(\beta+1)p(y)q(y)^{\beta}\}{\rm d}\Lambda(y), (1.5)

where β\beta belongs to \mathbb{R}. For more details, refer to [2, 65]. Let us consider a generator function defined as follows:

Uβ(R)=1β(β+1)Rβ+1 for R0.\displaystyle U_{\beta}(R)=\frac{1}{\beta(\beta+1)}R^{\beta+1}\ \text{ for }\ R\geq 0.

It follows from the convexity of Uβ(R)U_{\beta}(R) that Uβ(R1)Uβ(R0)Uβ(R0)(R1R0).U_{\beta}(R_{1})-U_{\beta}(R_{0})\geq U_{\beta}{}^{\prime}(R_{0})(R_{1}-R_{0}). This concludes that Dβ(P,Q,Λ)D_{\beta}(P,Q,\Lambda) is a divergence measure due to

Dβ(P,Q;Λ)={Uβ(p)Uβ(q)Uβ(q)(pq)}dΛ.\displaystyle D_{\beta}(P,Q;\Lambda)=\int\{U_{\beta}(p)-U_{\beta}(q)-U_{\beta}{}^{\prime}(q)(p-q)\}{\rm d}\Lambda.

We also observe the property preserving the Pythagorean relation for the β\beta-divergence. When PP, QQ and RR form a right triangle by the β\beta-divergence, then the right triangle is preserved for the triple PtP_{t}, QQ anf RsR_{s}.

It’s worth noting that the β\beta-divergence is dependent on the choice of reference measure Λ\Lambda. For instance, if we choose Λ~\tilde{\Lambda} as the reference measure, then the β\beta-divergence is given by:

Dβ(P,Q;Λ~)=1β(β+1){p~β+1+βq~β+1(β+1)p~q~β}dΛ~.\displaystyle D_{\beta}(P,Q;\tilde{\Lambda})=\frac{1}{\beta(\beta+1)}\int\{\tilde{p}^{\beta+1}+\beta\tilde{q}^{\beta+1}-(\beta+1)\tilde{p}\tilde{q}^{\beta}\}{\rm d}\tilde{\Lambda}.

Here, p~=P/Λ~\tilde{p}=\partial P/\partial\tilde{\Lambda} and q~=Q/Λ~\tilde{q}=\partial Q/\partial\tilde{\Lambda}. This can be rewritten as

1β(β+1){(λ~p)β𝑑P+β(λ~q)β𝑑Q(β+1)(λ~q)β𝑑P},\displaystyle\frac{1}{\beta(\beta+1)}\left\{\int(\tilde{\lambda}p)^{\beta}dP+\beta\int(\tilde{\lambda}q)^{\beta}dQ-(\beta+1)\int(\tilde{\lambda}q)^{\beta}dP\right\}, (1.6)

where λ~=Λ~/Λ\tilde{\lambda}={\partial\tilde{\Lambda}}/{\partial\Lambda}. Hence, the integrands of Dβ(P,Q;Λ~)D_{\beta}(P,Q;\tilde{\Lambda}) are given by the integrands of Dβ(P,Q;Λ)D_{\beta}(P,Q;\Lambda) multiplied by λ~β\tilde{\lambda}^{\beta}. The choice of the reference measure gives a substantial effect for evaluating the β\beta-divergence.

We again consider a multinomial distribution MN(π,m)(\pi,m) defined in (1.3). Unfortunately, the β\beta-divergence Dβ(𝙼𝙽(π,m),𝙼𝙽(ρ,m);C)D_{\beta}({\tt MN}(\pi,m),{\tt MN}(\rho,m);C) with the counting measure CC would have an intractable expression. Therefore, we select Λ~\tilde{\Lambda} in a way that the Radon-Nikodym (RN) derivative is defined as

Λ~C(y)=(my1yk)1\displaystyle\frac{\partial\tilde{\Lambda}}{\partial C}(y)=\binom{m}{y_{1}\cdots y_{k}}^{-1} (1.7)

as a reference measure. Accordingly, p~(y,π,n)=π1y1πkyk\tilde{p}(y,\pi,n)=\pi_{1}{}^{y_{1}}\cdots\pi_{k}{}^{y_{k}}, and hence

p~(y,π,n)βdP(y)=y𝒴(my1yk){π1y1πk}ykβ+1,\displaystyle\int\tilde{p}(y,\pi,n)^{\beta}dP(y)=\sum_{y\in{\mathcal{Y}}}\binom{m}{y_{1}\cdots y_{k}}\left\{\pi_{1}{}^{y_{1}}\cdots\pi_{k}{}^{y_{k}}\right\}^{\beta+1},

which is equal to {j=1kπj}β+1m\big{\{}\sum_{j=1}^{k}\pi_{j}{}^{\beta+1}\big{\}}^{m}. Using this approach, a closed form expression for β\beta-divergence can be derived:

Dβ(𝙼𝙽(π,m),𝙼𝙽(ρ,m);Λ~)\displaystyle D_{\beta}({\tt MN}(\pi,m),{\tt MN}(\rho,m);\tilde{\Lambda})
={j=1kπj}β+1mβ(β+1)+{j=1kρj}β+1mβ+1{j=1kπjρj}βmβ\displaystyle=\frac{\big{\{}\sum_{j=1}^{k}\pi_{j}{}^{\beta+1}\big{\}}^{m}}{\beta(\beta+1)}+\frac{\big{\{}\sum_{j=1}^{k}\rho_{j}{}^{\beta+1}\big{\}}^{m}}{\beta+1}-\frac{\big{\{}\sum_{j=1}^{k}\pi_{j}\rho_{j}{}^{\beta}\big{\}}^{m}}{\beta} (1.8)

due to (1.6). In this way, the expression (1.8) has a tractable form, in which this is reduced to the standard one of β\beta-divergence when m=1m=1. Subsequent discussions will explore the choice of reference measure that provides the most accurate inference within statistical models, such as the generalized linear model and the model of inhomogeneous Poisson point processes.

(3) γ\gamma-divergence [33]:

Dγ(P,Q;Λ)=1γ𝒴p(y)q(y)γdΛ(y)(𝒴q(y)γ+1dΛ(y))γγ+1+1γ(𝒴p(y)γ+1dΛ(y))1γ+1.\displaystyle D_{\gamma}(P,Q;\Lambda)=-\frac{1}{\gamma}\frac{\int_{\mathcal{Y}}p(y)q(y)^{\gamma}{\rm d}\Lambda(y)}{\ \ \big{(}\int_{\mathcal{Y}}q(y)^{\gamma+1}{\rm d}\Lambda(y)\big{)}^{\frac{\gamma}{\gamma+1}}}+\frac{1}{\gamma}\Big{(}\int_{\mathcal{Y}}p(y)^{\gamma+1}{\rm d}\Lambda(y)\Big{)}^{\frac{1}{\gamma+1}}. (1.9)

If we define the γ\gamma-cross entropy as:

Hγ(P,Q;Λ)=1γ𝒴p(y)q(y)γdΛ(y)(𝒴q(y)γ+1dΛ(y))γγ+1.\displaystyle H_{\gamma}(P,Q;\Lambda)=-\frac{1}{\gamma}\frac{\int_{\mathcal{Y}}p(y)q(y)^{\gamma}{\rm d}\Lambda(y)}{\ \ \big{(}\int_{\mathcal{Y}}q(y)^{\gamma+1}{\rm d}\Lambda(y)\big{)}^{\frac{\gamma}{\gamma+1}}}.

then the γ\gamma-divergence is written by the difference:

Dγ(P,Q;Λ)=Hγ(P,Q;Λ)Hγ(P,P;Λ).\displaystyle D_{\gamma}(P,Q;\Lambda)=H_{\gamma}(P,Q;\Lambda)-H_{\gamma}(P,P;\Lambda).

It is noteworthy that the cross γ\gamma-entropy is a convex-linear functional with respect to the first argument:

j=1kwjHγ(Pj,Q;Λ)=Hγ(P¯,Q;Λ),\displaystyle\sum_{j=1}^{k}w_{j}H_{\gamma}(P_{j},Q;\Lambda)=H_{\gamma}(\bar{P},Q;\Lambda),

where wjw_{j}’s are positive weights with j=1kwj=1\sum_{j=1}^{k}w_{j}=1 and P¯=(1/k)j=1kPj\bar{P}=(1/k)\sum_{j=1}^{k}P_{j}. This property gives explicitly the empirical expression for the γ\gamma-entropy given data set {Yi}1in\{Y_{i}\}_{1\leq i\leq n}. Consider the empirical distribution (1/n)i=1n1Yi(1/n)\sum_{i=1}^{n}{1}_{Y_{i}} as P¯\bar{P}, where 1Yi1_{Y_{i}} is the Dirac measure at the atom YiY_{i}. Then

Hγ(P¯,Q;Λ)=1γ1ni=1nq(Yi)γ(𝒴q(y)γ+1dΛ(y))γγ+1.\displaystyle H_{\gamma}(\bar{P},Q;\Lambda)=-\frac{1}{\gamma}\frac{1}{n}\sum_{i=1}^{n}\frac{q(Y_{i})^{\gamma}}{\big{(}\int_{\mathcal{Y}}q(y)^{\gamma+1}{\rm d}\Lambda(y)\big{)}^{\frac{\gamma}{\gamma+1}}}.

If we assume that {Yi}1in\{Y_{i}\}_{1\leq i\leq n} is a sequence of identically and independently distributed with PP, then

𝔼[Hγ(P¯,Q;Λ)]=Hγ(P,Q;Λ),\mathbb{E}[H_{\gamma}(\bar{P},Q;\Lambda)]=H_{\gamma}(P,Q;\Lambda),

and hence Hγ(P¯,Q;Λ)H_{\gamma}(\bar{P},Q;\Lambda) almost surely converges to Hγ(P,Q;Λ)H_{\gamma}(P,Q;\Lambda) due to the strong law of large numbers. Subsequently, this will be used to define the empirical loss based on the dataset. Needless to say, the empirical expression of the cross entropy H0(P¯,Q;Λ)H_{0}(\bar{P},Q;\Lambda) in (1.2) is the negative log likelihood. The γ\gamma-diagonal entropy is proportional to the Lebesgue norm with exponent λ=γ+1\lambda=\gamma+1 as

Hγ(P,P;Λ)=1γ(𝒴p(y)λdΛ(y))1λ.\displaystyle H_{\gamma}(P,P;\Lambda)=-\frac{1}{\gamma}{\Big{(}\int_{\mathcal{Y}}p(y)^{\lambda}{\rm d}\Lambda(y)\Big{)}^{\frac{1}{\lambda}}}.

Considering the conjugate exponent λ=λ/(1λ)\lambda^{*}=\lambda/(1-\lambda), the Hölder inequality for pp and qγq^{\gamma} states

𝒴p(y)q(y)γdΛ(y){𝒴p(y)λdΛ(y)}1λ{𝒴{q(y)γ}λdΛ(y)}1λ.\displaystyle{\int_{\mathcal{Y}}p(y)q(y)^{\gamma}{\rm d}\Lambda(y)}\leq{\Big{\{}\int_{\mathcal{Y}}p(y)^{\lambda}{\rm d}\Lambda(y)\Big{\}}^{\frac{1}{\lambda}}\Big{\{}\int_{\mathcal{Y}}\{q(y)^{\gamma}\}^{\lambda^{*}}{\rm d}\Lambda(y)\Big{\}}^{\frac{1}{\lambda^{*}}}}.

This holds for any λ>1\lambda>1 with the equality if and only if p=qp=q. This implies the γ\gamma-divergence satisfies the definition of ‘divergence measure’ for any γ>0\gamma>0. It should be noted that the Hölder inequality is employed for not the pair pp and qq but that of pp and qγq^{\gamma}, which yields the property ‘zero when identical’ as a divergence measure. Also, the γ\gamma-divergence approaches the KL-divergence in the limit:

limγ0Dγ(P,Q;Λ)=D0(P,Q;Λ).\displaystyle\lim_{\gamma\rightarrow 0}D_{\gamma}(P,Q;\Lambda)=D_{0}(P,Q;\Lambda).

This is because limγ0(sγ1)/γ=logs\lim_{\gamma\rightarrow 0}(s^{\gamma}-1)/\gamma=\log s for all s>0s>0 as well as the α\alpha and β\beta divergence measures.

We observe a close relationship between β\beta and γ\gamma divergence measures. Consider a maximum problem of the β\beta-divergence: maxσ>0Dβ(P,σQ;Λ)\max_{\sigma>0}D_{\beta}(P,\sigma Q;\Lambda). By definition, if we write Dβ(P,Q;Λ)=maxσ>0Dβ(P,σQ;Λ)D_{\beta}^{*}(P,Q;\Lambda)=\max_{\sigma>0}D_{\beta}(P,\sigma Q;\Lambda), then Dβ(P,σQ;Λ)=Dβ(P,Q;Λ)D_{\beta}^{*}(P,\sigma Q;\Lambda)=D_{\beta}^{*}(P,Q;\Lambda) for all σ>0\sigma>0. Thus, the maximizer of σ\sigma is given by

σopt=p(y)q(y)βdΛ(y)q(y)β+1dΛ(y).\displaystyle\sigma_{\rm opt}=\frac{\int p(y)q(y)^{\beta}{\rm d}\Lambda(y)}{\int q(y)^{\beta+1}{\rm d}\Lambda(y)}.

If we confuse β\beta with γ\gamma, then the close relationship is found in

Dβ(P,σoptQ;Λ)={γHγ(P,Q;Λ)}β+1{γHγ(P,P;Λ)}β+1β(β+1)\displaystyle D_{\beta}(P,\sigma_{\rm opt}Q;\Lambda)=\frac{\{-\gamma H_{\gamma}(P,Q;\Lambda)\}^{\beta+1}-\{-\gamma H_{\gamma}(P,P;\Lambda)\}^{\beta+1}}{\beta(\beta+1)}

In accordance with this, the γ\gamma-divergence can be viewed as the β\beta-divergence interpreted in a projective geometry [27]. Similarly, consider a dual problem: maxσ>0Dβ(σP,Q;Λ)\max_{\sigma>0}D_{\beta}(\sigma P,Q;\Lambda). Then, the maximizer of σ\sigma is given by

σopt={p(y)q(y)βdΛ(y)p(y)β+1dΛ(y)}1β.\displaystyle\sigma_{\rm opt}^{*}=\Big{\{}\frac{\int p(y)q(y)^{\beta}{\rm d}\Lambda(y)}{\int p(y)^{\beta+1}{\rm d}\Lambda(y)}\Big{\}}^{\frac{1}{\beta}}.

Hence, the scale adjusted divergence is given by

Dβ(σoptp,q)={γHγ(P,Q;Λ)}β+1β+β{γHγ(P,P;Λ)}β+1ββ(β+1)\displaystyle D_{\beta}(\sigma^{*}_{\rm opt}p,q)=\frac{-\{-\gamma H_{\gamma}^{*}(P,Q;\Lambda)\}^{\frac{\beta+1}{\beta}}+\beta\{-\gamma H_{\gamma}^{*}(P,P;\Lambda)\}^{\frac{\beta+1}{\beta}}}{\beta(\beta+1)} (1.10)

Thus, we get a dualistic version of the γ\gamma-divergence as

Dγ(P,Q;Λ)=1γ𝒴p(y)q(y)γdΛ(y)(𝒴p(y)γ+1dΛ(y))1γ+1+1γ(𝒴q(y)γ+1dΛ(y))γγ+1.\displaystyle D^{*}_{\gamma}(P,Q;\Lambda)=-\frac{1}{\gamma}\frac{\int_{\mathcal{Y}}p(y)q(y)^{\gamma}{\rm d}\Lambda(y)}{(\int_{\mathcal{Y}}p(y)^{\gamma+1}{\rm d}\Lambda(y))^{\frac{1}{\gamma+1}}}+\frac{1}{\gamma}\Big{(}\int_{\mathcal{Y}}q(y)^{\gamma+1}{\rm d}\Lambda(y)\Big{)}^{\frac{\gamma}{\gamma+1}}. (1.11)

We refer Dγ(p,q)D^{*}_{\gamma}(p,q) to as the dual γ\gamma-divergence. If we define the dual γ\gamma-entropy as

Hγ(P,Q;Λ)=1γ𝒴p(y)q(y)γdΛ(y)(𝒴p(y)γ+1dΛ(y))1γ+1,\displaystyle H^{*}_{\gamma}(P,Q;\Lambda)=-\frac{1}{\gamma}\frac{\int_{\mathcal{Y}}p(y)q(y)^{\gamma}{\rm d}\Lambda(y)}{(\int_{\mathcal{Y}}p(y)^{\gamma+1}{\rm d}\Lambda(y))^{\frac{1}{\gamma+1}}},

then Dγ(P,Q;Λ)=Hγ(P,Q;Λ)Hγ(P,P;Λ)D^{*}_{\gamma}(P,Q;\Lambda)=H^{*}_{\gamma}(P,Q;\Lambda)-H^{*}_{\gamma}(P,P;\Lambda). In effect, the γ\gamma-divergence DγD_{\gamma} and its dual DγD_{\gamma}^{*} are connected as follows.

Dγ(P,Q;Λ)=(qγ+1dΛ)γγ+1(pγ+1dΛ)1γ+1Dγ(P,Q;Λ).\displaystyle D^{*}_{\gamma}(P,Q;\Lambda)=\frac{\big{(}\int q^{\gamma+1}{\rm d}\Lambda\big{)}^{\frac{\gamma}{\gamma+1}}}{\big{(}\int p^{\gamma+1}{\rm d}\Lambda\big{)}^{\frac{1}{\gamma+1}}}D_{\gamma}(P,Q;\Lambda).

1.4 The γ\gamma-divergence and its dual

In the evolving landscape of statistical divergence measures, a lesser-explored but highly potent member of the family is the γ\gamma divergence. This divergence serves as an interesting alternative to the more commonly used α\alpha and β\beta divergences, with unique properties and advantages that make it particularly suited for certain classes of problems. The dual γ\gamma divergence offers further flexibility, allowing for nuanced analysis from different perspectives. The following section is dedicated to a deep-dive into the mathematical formulations and properties of these divergences, shedding light on their invariance characteristics, relationships to other divergences, and potential applications. Notably, we shall establish that γ\gamma divergence is well-defined even for negative values of the exponent γ\gamma, and examine its special cases which connect to the geometric and harmonic mean divergences. This comprehensive treatment aims to illuminate the role that γ\gamma and its dual can play in advancing both theoretical and applied aspects of statistical inference and machine learning [28].

Let us focus on the γ\gamma-divergence in power divergence measures. We define a power-transformed function as follows:

q(γ)(y)=q(y)γ+1𝒴q(y~)γ+1dΛ(y~),\displaystyle q^{(\gamma)}(y)=\frac{q(y)^{\gamma+1}}{\int_{\mathcal{Y}}q(\tilde{y})^{\gamma+1}{\rm d}\Lambda(\tilde{y})}, (1.12)

which we refer to as the γ\gamma-expression of q(y)q(y), where q𝒫q\in{\mathcal{P}}. Thus, the measure having the RN-derivative q(γ)q^{(\gamma)} belongs to 𝒫\mathcal{P} since q(γ)(y)dΛ(y)=1\int q^{(\gamma)}(y){\rm d}\Lambda(y)=1. We can write

Hγ(P,Q;Λ)=1γ𝒴p(y){q(γ)(y)}γγ+1dΛ(y),\displaystyle H_{\gamma}(P,Q;\Lambda)=-\frac{1}{\gamma}\int_{\mathcal{Y}}p(y)\{{q^{(\gamma)}(y)\}^{\frac{\gamma}{\gamma+1}}{\rm d}\Lambda(y)},

and

Hγ(P,Q;Λ)=1γ𝒴{p(γ)(y)}1γ+1q(y)γdΛ(y),\displaystyle H_{\gamma}^{*}(P,Q;\Lambda)=-\frac{1}{\gamma}\int_{\mathcal{Y}}\{{p^{(\gamma)}(y)\}^{\frac{1}{\gamma+1}}q(y)^{\gamma}{\rm d}\Lambda(y)},

These equations directly yield an observation: DγD_{\gamma} and DγD_{\gamma}^{*} are scale-invariant, but only with respect to one of the two arguments.

Dγ(P,σQ;Λ)=Dγ(P,Q;Λ) and Dγ(σP,Q;Λ)=Dγ(P,Q;Λ);\displaystyle D_{\gamma}(P,\sigma Q;\Lambda)=D_{\gamma}(P,Q;\Lambda)\text{ and }D_{\gamma}^{*}(\sigma P,Q;\Lambda)=D_{\gamma}(P,Q;\Lambda); (1.13)

while

Dγ(σP,Q;Λ)=σDγ(P,Q;Λ) and Dγ(P,σQ;Λ)=σγDγ(P,Q;Λ).\displaystyle D_{\gamma}(\sigma P,Q;\Lambda)=\sigma D_{\gamma}(P,Q;\Lambda)\quad\text{ and }\quad D^{*}_{\gamma}(P,\sigma Q;\Lambda)=\sigma^{\gamma}D_{\gamma}(P,Q;\Lambda).

The power exponent γ\gamma is usually assumed to be positive. However, we extend γ\gamma to be a real number in this discussion, see [23] for a brief discussion.

Proposition 1.

Dγ(P,Q;Λ)D_{\gamma}(P,Q;\Lambda) and Dγ(P,Q;Λ)D^{*}_{\gamma}(P,Q;\Lambda) defined in (1.9) and (1.11), respectively, are both divergence measures in (1) for any real number γ\gamma.

Proof.

We introduce two generator functions defined as:

Vγ(R)=1γRγ1+γandVγ(R)=1γR11+γ\displaystyle V_{\gamma}(R)=-\frac{1}{\gamma}R^{\frac{\gamma}{1+\gamma}}\quad{\text{a}nd}\quad V_{\gamma}^{*}(R)=-\frac{1}{\gamma}R^{\frac{1}{1+\gamma}} (1.14)

for R>0R>0. By definition, the γ\gamma divergence can be expressed as:

Dγ(P,Q;Λ)=𝒴p(y){Vγ(q(γ)(y))Vγ(p(γ)(y))}dΛ(y).\displaystyle D_{\gamma}(P,Q;\Lambda)=\int_{{\mathcal{Y}}}p(y)\{V_{\gamma}(q^{(\gamma)}(y))-V_{\gamma}(p^{(\gamma)}(y))\}{\rm d}\Lambda(y). (1.15)

Due to the convexity of Vγ(R)V_{\gamma}(R) in RR for any γ\gamma\in\mathbb{R}, we have

Dγ(P,Q;Λ)𝒴p(y)Vγ(p(γ)(y)){q(γ)(y)p(γ)(y)}dΛ(y)\displaystyle D_{\gamma}(P,Q;\Lambda)\geq\int_{{\mathcal{Y}}}p(y)V_{\gamma}^{\prime}(p^{(\gamma)}(y))\{q^{(\gamma)}(y)-p^{(\gamma)}(y)\}{\rm d}\Lambda(y) (1.16)

with equality if and only if P=QP=Q. The right-hand-side of (1.16) can be rewritten as:

11+γ(𝒴pγ+1dΛ)1γ+1𝒴(q(γ)p(γ))dΛ.\displaystyle-\frac{1}{1+\gamma}\Big{(}\int_{{\mathcal{Y}}}p^{\gamma+1}{\rm d}\Lambda\Big{)}^{\frac{1}{\gamma+1}}\int_{{\mathcal{Y}}}(q^{(\gamma)}-p^{(\gamma)}){\rm d}\Lambda.

The second term identically vanishes since p(γ)p^{(\gamma)} and q(γ)q^{(\gamma)} have both total mass one. Similarly, we observe for any real number γ\gamma that

Dγ(P,Q;Λ)=𝒴{Vγ(p(γ)(y))Vγ(q(γ)(y))}q(y)γdΛ(y)\displaystyle D_{\gamma}^{*}(P,Q;\Lambda)=\int_{{\mathcal{Y}}}\{V^{*}_{\gamma}(p^{(\gamma)}(y))-V^{*}_{\gamma}(q^{(\gamma)}(y))\}q(y)^{\gamma}{\rm d}\Lambda(y) (1.17)

which is equal or greater than 0 and the equality holds if and only if P=QP=Q due the convexity of V(R)V^{*}(R). Therefore, Dγ(P,Q;Λ)D_{\gamma}(P,Q;\Lambda) and Dγ(P,Q;Λ)D^{*}_{\gamma}(P,Q;\Lambda) are both divergence measures for any real number γ\gamma. ∎

We will discuss Dγ(P,Q;Λ)D_{\gamma}(P,Q;\Lambda) with a negative power exponent γ\gamma in a context of statistical inferences. The γ\gamma-divergence (1.9) is implicitly assumed to be integrable as well as the KL-divergence, in which the integrability condition for the γ\gamma-divergence with γ<0\gamma<0 is presented. Let us look into the case of a multinomial distribution Bin(π,m)(\pi,m) defined in (1.3) with the reference measure Λ~\tilde{\Lambda} given by (1.7). An argument similar to that on the β\beta-divergence yields

Dγ(𝙼𝙽(π,m),𝙼𝙽(ρ,m);Λ~)=mγj=1kπjρjγ{j=1kρj}γ+1γγ+1+mγ{j=1kπj}γ+11γ+1.\displaystyle D_{\gamma}({\tt MN}(\pi,m),{\tt MN}(\rho,m);\tilde{\Lambda})=-\frac{m}{\gamma}\frac{\sum_{j=1}^{k}\pi_{j}\rho_{j}{}^{\gamma}}{\big{\{}\sum_{j=1}^{k}\rho_{j}{}^{\gamma+1}\big{\}}^{\frac{\gamma}{\gamma+1}}}+\frac{m}{\gamma}\Big{\{}\sum_{j=1}^{k}\pi_{j}{}^{\gamma+1}\Big{\}}^{\frac{1}{\gamma+1}}.

as the γ\gamma-divergence of the log expression in (1.21), where π\pi and ρ\rho are cell probability vectors of mm dimension. The γ\gamma-divergence with the counting measure would also have no closed expression. Therefore, careful consideration is needed when choosing the reference measure Λ\Lambda for γ\gamma divergence. Let π(y)\pi(y) be the RN-derivative of Λ\Lambda with respect to the Lebesgue measure LL. Then, the γ\gamma-divergence (1.9) with

Hγ(P,Q;Λ)=1γ𝒴p(y)q(y)γπ(y)𝑑L(y)(𝒴q(y)γ+1π(y)𝑑L(y))γγ+1,\displaystyle H_{\gamma}(P,Q;\Lambda)=-\frac{1}{\gamma}\frac{\int_{\mathcal{Y}}p(y)q(y)^{\gamma}\pi(y)dL(y)}{\ \ \big{(}\int_{\mathcal{Y}}q(y)^{\gamma+1}\pi(y)dL(y)\big{)}^{\frac{\gamma}{\gamma+1}}},

where π=Λ/L\pi=\partial\Lambda/\partial L. Our key objective is to identify a π\pi that ensures stable and smooth behavior under a given statistical model and dataset.

We discuss a generalization of the γ\gamma-divergence. Let VV be a convex function. Then, VV-divergence is defined by

DV(P,Q;Λ)=𝒴p(y){V(v(zqq(y)))V(v(zpp(y)))}dΛ(y),D_{V}(P,Q;\Lambda)=\int_{\mathcal{Y}}p(y)\{V(v^{*}(z_{q}q(y)))-V(v^{*}(z_{p}p(y)))\}{\rm d}\Lambda(y), (1.18)

where zqz_{q} is a normalizing constant satisfying v(zqq(y))dΛ(y)=1\int v^{*}(z_{q}q(y)){\rm d}\Lambda(y)=1 and the function v(t)v^{*}(t) satisfies

V(v(t))=1t.V^{\prime}(v^{*}(t))=\frac{1}{t}. (1.19)

It is derived from the assumption of the convexity of VV that

DV(P,Q;Λ)𝒴p(y)V(v(zpp(y))){v(zqq(y))v(zpp(y))}dΛ(y)\displaystyle D_{V}(P,Q;\Lambda)\geq\int_{\mathcal{Y}}p(y)V^{\prime}(v^{*}(z_{p}p(y)))\{v^{*}(z_{q}q(y))-v^{*}(z_{p}p(y))\}{\rm d}\Lambda(y)

which is equal to

𝒴{v(zqq(y))v(zpp(y))}dΛ(y)\displaystyle\int_{\mathcal{Y}}\{v^{*}(z_{q}q(y))-v^{*}(z_{p}p(y))\}{\rm d}\Lambda(y) (1.20)

up to the proportional factor due to (1.19). By the definition of normalizing constants, the integral in (1.20) vanishes. Hence, DV(P,Q;Λ)D_{V}(P,Q;\Lambda) becomes a divergence measure. Specifically, DV(P,σQ;Λ)=DV(P,Q;Λ)D_{V}(P,\sigma Q;\Lambda)=D_{V}(P,Q;\Lambda) for σ>0\sigma>0 due to the normalizing constant zqz_{q}. For example, if V(R)=(1/γ)Rγγ+1V(R)=-(1/\gamma)R^{\frac{\gamma}{\gamma+1}} as in (1.14), DV(P,Q;Λ)D_{V}(P,Q;\Lambda) reduces to the γ\gamma-divergence. There are various examples of VV other than (1.14), for example,

V(R)=2log(R+11)+R+1,\displaystyle V(R)=2\log(\sqrt{R+1}-1)+\sqrt{R+1},

which is related to the κ\kappa-entropy discussed in a physical context [52, 78]. We do not go further into this topic as it is beyond the scope of this paper.

We investigate a notable property of the dual γ\gamma divergence. There exists a strong relationship between the generalized mean of probability measures and the minimization of the average dual γ\gamma divergence. Subsequently, we will explore its applications in active learning.

Proposition 2.

Consider an average of kk dual γ\gamma-divergence measures as

A(P)=j=1kwjDγ(P,Qj;Λ).\displaystyle A(P)=\sum_{j=1}^{k}w_{j}D_{\gamma}^{*}(P,Q_{j};\Lambda).

Let Popt=argminP𝒫A(P)P^{\rm opt}=\mathop{\rm argmin}_{P\in{\mathcal{P}}}A(P). Then, the Radon-Nikodym (RN) derivative of PoptP^{\rm opt} is uniquely determined as follows:

PoptΛ(y)=zw{j=1kwjqj(y)γ}1γ\displaystyle\frac{\partial P^{\rm opt}}{\partial\Lambda}(y)=z_{w}\Big{\{}\sum_{j=1}^{k}w_{j}q_{j}(y)^{\gamma}\Big{\}}^{\frac{1}{\gamma}}

where qj=Qj/Λq_{j}={\partial Q_{j}}/{\partial\Lambda} and zwz_{w} is the normalizing constant.

Proof.

If we write Popt/Λ\partial P^{\rm opt}/{\partial\Lambda} by poptp^{\rm opt}, then

A(P)A(Popt)=1γj=1kwj{𝒴p(y)qj(y)γdΛ(y)(𝒴p(y)γ+1dΛ(y))1γ+1𝒴pwopt(y)qj(y)γdΛ(y)(𝒴{pwopt(y)}γ+1dΛ(y))1γ+1}\displaystyle A(P)-A(P^{\rm opt})=-\frac{1}{\gamma}\sum_{j=1}^{k}w_{j}\Big{\{}\frac{\int_{\mathcal{Y}}p(y)q_{j}(y)^{\gamma}{\rm d}\Lambda(y)}{(\int_{\mathcal{Y}}p(y)^{\gamma+1}{\rm d}\Lambda(y))^{\frac{1}{\gamma+1}}}-\frac{\int_{\mathcal{Y}}p_{w}^{\rm opt}(y)q_{j}(y)^{\gamma}{\rm d}\Lambda(y)}{(\int_{\mathcal{Y}}\{p_{w}^{\rm opt}(y)\}^{\gamma+1}{\rm d}\Lambda(y))^{\frac{1}{\gamma+1}}}\Big{\}}

which is equal to

1zw1γ{𝒴p(y){popt(y)}γdΛ(y)(𝒴p(y)γ+1dΛ(y))1γ+1(𝒴{popt(y)}γ+1dΛ(y))γγ+1}.\displaystyle-\frac{1}{z_{w}}\frac{1}{\gamma}\Big{\{}\frac{\int_{\mathcal{Y}}p(y)\{p^{\rm opt}(y)\}^{\gamma}{\rm d}\Lambda(y)}{(\int_{\mathcal{Y}}p(y)^{\gamma+1}{\rm d}\Lambda(y))^{\frac{1}{\gamma+1}}}-{\Big{(}\int_{\mathcal{Y}}\{p^{\rm opt}(y)\}^{\gamma+1}{\rm d}\Lambda(y)\Big{)}^{\frac{\gamma}{\gamma+1}}}\Big{\}}.

This expression simplifies to (1/z)Dγ(P,Popt;Λ)(1/z)D_{\gamma}^{*}(P,P^{\rm opt};\Lambda). Therefore, A(P)A(Popt)A(P)\geq A(P^{\rm opt}) and the equality holds if and only if P=PoptP=P^{\rm opt} This is due to the property of Dγ(P,Popt;Λ)D_{\gamma}^{*}(P,P^{\rm opt};\Lambda) as a divergence measure. ∎

The optimal distribution PoptP^{\rm opt} can be viewed as the consensus one integrating kk committee members’ distributions QjQ_{j}’s into the average of divergence measures with importance weights wjw_{j}. We adopt a ”query by committee” approach and examine the robustness against variations in committee distributions. Proposition 2 leads to an average version of the Pythagorean relations:

j=1kwjDγ(P,Qj)=Dγ(P,Popt)+j=1kwjDγ(Popt,Qj).\displaystyle\sum_{j=1}^{k}w_{j}D_{\gamma}^{*}(P,Q_{j})=D_{\gamma}^{*}(P,P^{\rm opt})+\sum_{j=1}^{k}w_{j}D_{\gamma}^{*}(P^{\rm opt},Q_{j}).

We refer to PoptP^{\rm opt} as the power mean of the set {Qj}1jk\{Q_{j}\}_{1\leq j\leq k}. In general, a generalized mean is defined as

GMϕ=ϕ1(j=1kwjϕ(qj)),\displaystyle GM_{\phi}=\phi^{-1}\Big{(}\sum_{j=1}^{k}w_{j}\phi(q_{j})\Big{)},

where ϕ\phi is a one-to-one function on (0,)(0,\infty). We confirm that, if γ=1\gamma=1, then Popt=j=1kwjQjP^{\rm opt}=\sum_{j=1}^{k}w_{j}Q_{j} that is the arithmetic mean, or the mixture distribution of QjQ_{j}’s with mixture proportions wjw_{j}’s. If γ=1\gamma=-1, then

PoptΛ(y)=zw{j=1kwjqj(y)1}1,\displaystyle\frac{\partial P^{\rm opt}}{\partial\Lambda}(y)=z_{w}\Big{\{}\sum_{j=1}^{k}w_{j}q_{j}(y)^{-1}\Big{\}}^{-1},

which is the harmonic mean of qjq_{j}’s with weights wjw_{j}’s. As γ\gamma goes to 0, the dual γ\gamma-divergence Dγ(P,Q;Λ)D_{\gamma}^{*}(P,Q;\Lambda) converges to the dual KL-divergence D0(P,Q)D_{0}^{*}(P,Q) defined by D0(Q,P)D_{0}(Q,P). The minimizer is given by

PoptΛ(y)=zj=1k{qj(y)}wj,\displaystyle\frac{\partial P^{\rm opt}}{\partial\Lambda}(y)=z\prod_{j=1}^{k}\{q_{j}(y)\}^{w_{j}},

which is the harmonic mean of qjq_{j}’s with weights wjw_{j}’s. We will discuss divergence measures using the harmonic and geometric means of ratios for RN-derivatives in a later section.

We often utilize the logarithmic expression for the γ\gamma divergence, given by

Δγ(P,Q;Λ)=1γlog𝒴p(y)q(y)γdΛ(y)(𝒴p(y)γ+1dΛ(y))1γ+1(𝒴q(y)γ+1dΛ(y))γγ+1.\displaystyle\Delta_{\gamma}(P,Q;\Lambda)=-\frac{1}{\gamma}\log\frac{\int_{\mathcal{Y}}p(y)q(y)^{\gamma}{\rm d}\Lambda(y)}{\big{(}\int_{\mathcal{Y}}p(y)^{\gamma+1}{\rm d}\Lambda(y)\big{)}^{\frac{1}{\gamma+1}}\big{(}\int_{\mathcal{Y}}q(y)^{\gamma+1}{\rm d}\Lambda(y)\big{)}^{\frac{\gamma}{\gamma+1}}}. (1.21)

We find a remarkable property such that

Δγ(τP,σQ;Λ)=Δγ(P,Q;Λ)\displaystyle\Delta_{\gamma}(\tau P,\sigma Q;\Lambda)=\Delta_{\gamma}(P,Q;\Lambda)

for all τ>0\tau>0 and σ>0\sigma>0, noting the log expression is written by

Δγ(P,Q;Λ)=1γlog𝒴{p(γ)(y)}1γ+1{q(γ)(y)}γγ+1dΛ(y)\displaystyle\Delta_{\gamma}(P,Q;\Lambda)=-\frac{1}{\gamma}\log{\int_{\mathcal{Y}}\{p^{(\gamma)}(y)\}^{\frac{1}{\gamma+1}}\{q^{(\gamma)}(y)\}^{\frac{\gamma}{\gamma+1}}{\rm d}\Lambda(y)} (1.22)

by the use of γ\gamma-expression defined in (1.12). This implies that Δγ(P,Q;Λ)\Delta_{\gamma}(P,Q;\Lambda) measures not a departure between PP ad QQ but an angle between them. When γ=1\gamma=1, this is the negative log cosine similarity for pp and qq. In effect, the cosine for aa and bb are in d\mathbb{R}^{d} is defined by

cos(a,b)=j=1dajbj{j=1d|aj|2}12{j=1|bj|2}12.\displaystyle\cos(a,b)=\frac{\sum_{j=1}^{d}a_{j}b_{j}}{\{\sum_{j=1}^{d}|a_{j}|^{2}\}^{\frac{1}{2}}\{\sum_{j=1}^{|}b_{j}|^{2}\}^{\frac{1}{2}}}.

This is closely related to Δγ(P,Q;Λ)\Delta_{\gamma}(P,Q;\Lambda) in a discrete space 𝒴\mathcal{Y} when γ=1\gamma=1. which is closely related to Δγ(P,Q;Λ)\Delta_{\gamma}(P,Q;\Lambda) on dd discrete space 𝒴\mathcal{Y} when γ=1\gamma=1. We will discuss an extension of Δγ\Delta_{\gamma} to be defined on a space of all signed measures that comprehensively gives an asymmetry in measuring the angle.

In summary, the exploration of power divergence measures in this section has illuminated their potential as versatile tools for quantifying the divergence between probability distributions. From the foundational Kullback-Leibler divergence to the more specialized α\alpha, β\beta, and γ\gamma divergence measures, we have seen that each type has its own strengths and limitations, making them suited for particular classes of problems. We have also underscored the mathematical properties that make these divergences unique, such as invariance under different conditions and applicability in empirical settings. As the field of statistics and machine learning continues to evolve, it’s evident that these power divergence measures will find even broader applications, providing rigorous ways to compare models, make predictions, and draw inferences from increasingly complex data.

1.5 GM and HM divergence measures

We discuss a situation where a random variable YY is discrete, taking values in a finite set of non-negative integers denoted by 𝒴={1,,k}{\mathcal{Y}}=\{1,...,k\}. Let 𝒫\mathcal{P} be the space of all probability measures on 𝒴\mathcal{Y}. In the realm of statistical divergence measures, arithmetic, geometric, and harmonic means for a probability measure PP of 𝒫\mathcal{P} receives less attention despite their mathematical elegance and potential applications. For this, consider the RN-derivative of a probability measure relative to QQ that equals a ratio of probability mass functions (pmfs) pp and qq induced by PP and QQ in 𝒫{\mathcal{P}}. Then, there is well-known inequality between the arithmetic and geometric means:

y𝒴p(y)q(y)r(y)y𝒴{p(y)q(y)}r(y)\displaystyle\sum_{y\in{\mathcal{Y}}}\frac{p(y)}{q(y)}r(y)\geq\prod_{y\in{\mathcal{Y}}}\Big{\{}\frac{p(y)}{q(y)}\Big{\}}^{r(y)} (1.23)

and that between the arithmetic and harmonic means:

y𝒴p(y)q(y)r(y)[y𝒴{p(y)q(y)}1r(y)]1,\displaystyle\sum_{y\in{\mathcal{Y}}}\frac{p(y)}{q(y)}r(y)\geq\Big{[}\sum_{y\in{\mathcal{Y}}}\Big{\{}\frac{p(y)}{q(y)}\Big{\}}^{-1}r(y)\Big{]}^{-1}, (1.24)

where r(y)r(y) is a weight function that is arbitrarily a fixed pmf on 𝒴\mathcal{Y}. Equality in (1.23) or (1.24) holds if and only if p=qp=q. This well-known inequality relationships among these means serve as the mathematical bedrock for defining new divergence measures. Specifically, the Geometric Mean (GM) and Harmonic Mean (HM) divergences are inspired by inequalities involving these means and ratios of probabilities as follows.

First, we define the GM-divergence as

DGM(P,Q;R)=y𝒴p(y)q(y)r(y)y𝒴q(y)r(y)y=1kp(y)r(y)\displaystyle D_{\rm GM}(P,Q;R)=\sum_{y\in{\mathcal{Y}}}\frac{p(y)}{q(y)}r(y)\prod_{y\in{\mathcal{Y}}}q(y)^{r(y)}-\prod_{y=1}^{k}p(y)^{r(y)}

transforming the expression (1.23), where p(y)p(y), q(y)q(y) and r(y)r(y) are the pmfs with respect to PP, QQ and RR, respectively. Note that DGM(P,Q)D_{\rm GM}(P,Q) is a divergence measure on 𝒫\mathcal{P} as defined in Definition 1. We restrict 𝒴\mathcal{Y} to be a finite discrete set for this discussion; however, our results can be generalized. In effect, the GM-divergence has a general form:

DGM(P,Q;R)=𝒴p(y)q(y)𝑑R(y)exp{𝒴logq(y)𝑑R(y)}exp{𝒴logp(y)𝑑R(y)}\displaystyle D_{\rm GM}(P,Q;R)=\int_{\mathcal{Y}}\frac{p(y)}{q(y)}dR(y)\exp\Big{\{}\int_{\mathcal{Y}}\log q(y)dR(y)\Big{\}}-\exp\Big{\{}\int_{\mathcal{Y}}\log p(y)dR(y)\Big{\}} (1.25)

For comparison, we have a look at the γ\gamma-divergence

Dγ(P,Q;R)=1γ𝒴p(y)q(y)γ𝑑R(y)(𝒴q(y)γ+1𝑑R(y))γγ+1+1γ(𝒴p(y)γ+1𝑑R(y))1γ+1,\displaystyle D_{\gamma}(P,Q;R)=-\frac{1}{\gamma}\frac{\int_{{\mathcal{Y}}}p(y){q(y)}^{\gamma}dR(y)}{\big{(}\int_{{\mathcal{Y}}}{q(y)}^{\gamma+1}dR(y)\big{)}^{\frac{\gamma}{\gamma+1}}}+\frac{1}{\gamma}{\big{(}\int_{{\mathcal{Y}}}{p(y)}^{\gamma+1}dR(y)\big{)}^{\frac{1}{\gamma+1}}}, (1.26)

by selecting a probability measure RR as a reference measure.

Proposition 3.

Let DGM(P,Q;R)D_{\rm GM}(P,Q;R) and Dγ(P,Q;R)D_{\gamma}(P,Q;R) be defined in (1.25) and (1.26), respectively. Then,

limγ1Dγ(P,Q;R)=DGM(P,Q;R)\displaystyle\lim_{\gamma\rightarrow-1}D_{\gamma}(P,Q;R)=D_{\rm GM}(P,Q;R) (1.27)
Proof.

It follows from the L’Hôpital’s rule that

limγ11γ+1log𝒴p(y)γ+1𝑑R(y)=𝒴logp(y)𝑑R(y).\displaystyle\lim_{\gamma\rightarrow-1}{\frac{1}{\gamma+1}}\log{\int_{{\mathcal{Y}}}{p(y)}^{\gamma+1}dR(y)}=\int_{{\mathcal{Y}}}\log p(y)dR(y).

This implies

limγ1{𝒴p(y)γ+1𝑑R(y)}1γ+1=exp{𝒴logp(y)𝑑R(y)}.\displaystyle\lim_{\gamma\rightarrow-1}\Big{\{}{\int_{{\mathcal{Y}}}{p(y)}^{\gamma+1}dR(y)}\Big{\}}^{\frac{1}{\gamma+1}}=\exp\Big{\{}\ \int_{{\mathcal{Y}}}\log p(y)dR(y)\Big{\}}.

In accordance, we conclude (1.27). ∎

We write the GM-divergence by the difference of the cross and diagonal entropy measures: DGM(P,Q;R)=HGM(P,Q;R)HGM(P,P;R)D_{\rm GM}(P,Q;R)=H_{\rm GM}(P,Q;R)-H_{\rm GM}(P,P;R), where

HGM(P,Q;R)=𝒴p(y)q(y)𝑑R(y)exp{𝒴logq(y)𝑑R(y)}.\displaystyle H_{\rm GM}(P,Q;R)=\int_{{\mathcal{Y}}}\frac{p(y)}{q(y)}dR(y)\exp\Big{\{}\int_{{\mathcal{Y}}}\log q(y)dR(y)\Big{\}}.

The GM-divergence has a log expression:

ΔGM(P,Q;R)=logHGM(P,Q;R)HGM(P,P;R).\displaystyle\Delta_{\rm GM}(P,Q;R)=\log\frac{H_{\rm GM}(P,Q;R)}{H_{\rm GM}(P,P;R)}.

We note that ΔGM(P,Q;R)\Delta_{\rm GM}(P,Q;R) is equal to Δγ(P,Q;R)\Delta_{\gamma}(P,Q;R) taking the limit of γ\gamma to 1-1. Here we discuss the case of the Poisson distribution family. We choose a Poisson distribution Po(τ)(\tau) as the reference measure. Thus, the GM-divergence of the log-form is given by

ΔGM(𝙿𝚘(λ),𝙿𝚘(μ);𝙿𝚘(τ))=τ(logλμλμ+1).\displaystyle\Delta_{\rm GM}({\tt Po}(\lambda),{\tt Po}(\mu);{\tt Po}(\tau))=\tau\Big{(}\log\frac{\lambda}{\mu}-\frac{\lambda}{\mu}+1\Big{)}.

Second, we introduce the HM-divergence. Suppose r(y)=q(y)1/z𝒴q(z)1r(y)=q(y)^{-1}/\sum_{z\in{\mathcal{Y}}}q(z)^{-1} in the inequality (1.24). Then, (1.24) is written as

y𝒴p(y)q(y)2{y𝒴q(y)1}1{y𝒴p(y)1}1{y𝒴q(y)1}.\displaystyle\sum_{y\in{\mathcal{Y}}}\frac{p(y)}{q(y)^{2}}\Big{\{}\sum_{y\in{\mathcal{Y}}}q(y)^{-1}\Big{\}}^{-1}\geq\Big{\{}\sum_{y\in{\mathcal{Y}}}{p(y)}^{-1}\Big{\}}^{-1}\Big{\{}\sum_{y\in{\mathcal{Y}}}q(y)^{-1}\Big{\}}. (1.28)

We define the harmonic-mean (HM ) divergence arranging the inequality (1.28) as

DHM(P,Q)=HHM(P,Q)HHM(P,P).\displaystyle D_{\rm HM}(P,Q)=H_{\rm HM}(P,Q)-H_{\rm HM}(P,P).

Here

HHM(P,Q)=y𝒴p(y)q(y)2{y𝒴q(y)1}2\displaystyle H_{\rm HM}(P,Q)=\sum_{y\in{\mathcal{Y}}}\frac{p(y)}{q(y)^{2}}\Big{\{}\sum_{y\in{\mathcal{Y}}}q(y)^{-1}\Big{\}}^{-2}

is the cross entropy, where p(y)p(y) and q(y)q(y) is the pmfs of PP and QQ, respectively. The (diagonal) entropy is given by the harmonic mean of p(y)p(y)’s:

HHM(P,P)={y𝒴p(y)1}1.\displaystyle H_{\rm HM}(P,P)=\Big{\{}\sum_{y\in{\mathcal{Y}}}p(y)^{-1}\Big{\}}^{-1}.

Note that DHM(P,Q)D_{\rm HM}(P,Q) qualifies as a divergence measure on 𝒫\mathcal{P}, as defined in Definition 1, due to the inequality (1.28). When γ=2\gamma=-2, DHM(P,Q)D_{\rm HM}(P,Q) is equal to Dγ(P,Q;C)D_{\gamma}(P,Q;C) in (1.9) with the counting measure CC. The log form is given by

ΔHM(P,Q)=logHHM(P,Q)logHHM(P,P).\displaystyle\Delta_{\rm HM}(P,Q)=\log H_{\rm HM}(P,Q)-\log H_{\rm HM}(P,P).

The GM-divergence provides an insightful lens through which we can examine statistical similarity or dissimilarity by leveraging the multiplicative nature of probabilities. The HM-divergence, on the other hand, focuses on rates and ratios, thus providing a complementary perspective to the GM-divergence, particularly useful in scenarios where rate-based analysis is pivotal. By extending the divergence measures to include GM and HM divergence, we gain a nuanced toolkit for quantifying divergence, each with unique advantages and applications. For instance, the GM-divergence could be particularly useful in applications where multiplicative effects are prominent, such as in network science or econometrics. Similarly, the HM-divergence might be beneficial in settings like biostatistics or communications, where rate and proportion are of prime importance. This framework, rooted in the relationships among arithmetic, geometric, and harmonic means, not only expands the class of divergence measures but also elevates our understanding of how different mathematical properties can be tailored to suit the needs of diverse statistical challenges.”

1.6 Concluding remarks

In summary, this chapter has laid the groundwork for understanding the class of power divergence measures in a probabilistic framework. We study that divergence measures quantify the difference between two probability distributions and have applications in statistics and machine learning. It begins with the well-known Kullback-Leibler (KL) divergence, highlighting its advantages and limitations. To address limitations of KL-divergence, three types of power divergence measures are introduced.

Let us look at the α\alpha, β\beta, and γ\gamma-divergence measures for a Poisson distribution model. Let 𝙿𝚘(λ){\tt Po}(\lambda) denote a Poisson distribution with the RN-derivative

p(y,λ)=λyeλ\displaystyle p(y,\lambda)=\lambda^{y}e^{-\lambda} (1.29)

with respect to the reference measure RR having (R)/(C)(y)=1/y!(\partial R)/(\partial C)(y)=1/y!. Seven examples of power divergence between Poisson distributions 𝙿𝚘(λ0){\tt Po}(\lambda_{0}) and 𝙿𝚘(λ1){\tt Po}(\lambda_{1}) are listed in Table 1.1. Note that this choice of the reference measure enable us to having such an tractable form of the β\beta and γ\gamma divergences as well as its variants. Here we use a basic formula to obtain these divergence measures:

y=0p(y,λ)adR(y)=exp(λaaλ)\displaystyle\sum_{y=0}^{\infty}p(y,\lambda)^{a}{\rm d}R(y)=\exp(\lambda^{a}-a\lambda)

for an exponent aa of \mathbb{R}. The contour sets of seven divergences between Poisson distributions are plotted in Figure 1.1. All the divergences attain the unique minimum 0 in the diagonal line {(λ0,λ1):λ0=λ1}\{(\lambda_{0},\lambda_{1}):\lambda_{0}=\lambda_{1}\}. The contour set of GM and HM divergences are flat compared to those of other divergences.

Table 1.1: The variants of power divergence
α\alpha-divergence 1α(1α)(1exp{λ0(λ1λ0)α(1α)λ0αλ1}\frac{1}{\alpha(1-\alpha)}(1-\exp\{\lambda_{0}(\frac{\lambda_{1}}{\lambda_{0}})^{\alpha}-(1-\alpha)\lambda_{0}-\alpha\lambda_{1}\}
β\beta-divergence eλ0β+1(β+1)λ0+βeλ1β+1(β+1)λ1(β+1)eλ0λ1βλ0βλ1β(β+1)\displaystyle\frac{e^{\lambda_{0}^{\beta+1}-(\beta+1)\lambda_{0}}+\beta e^{\lambda_{1}^{\beta+1}-(\beta+1)\lambda_{1}}-(\beta+1)e^{\lambda_{0}\lambda_{1}^{\beta}-\lambda_{0}-\beta\lambda_{1}}}{\beta(\beta+1)}
γ\gamma-divergence 1γexp(λ0){exp(λ0λ1γγγ+1λ1γ+1)exp(1γ+1λ0γ+1)}-\frac{1}{\gamma}\exp({-\lambda_{0}})\{\exp(\lambda_{0}\lambda_{1}^{\gamma}-\frac{\gamma}{\gamma+1}\lambda_{1}^{\gamma+1})-\exp(\frac{1}{\gamma+1}\lambda_{0}^{\gamma+1})\}
dual γ\gamma-divergence 1γexp(γλ1){exp(λ0λ1γ1γ+1λ0γ+1)exp(γγ+1λ1γ+1)}-\frac{1}{\gamma}\exp({-\gamma\lambda_{1}})\{\exp(\lambda_{0}\lambda_{1}^{\gamma}-\frac{1}{\gamma+1}\lambda_{0}^{\gamma+1})-\exp(\frac{\gamma}{\gamma+1}\lambda_{1}^{\gamma+1})\}
log γ\gamma-divergence 1γ(λ0λ1γ1γ+1λ0γ+1γγ+1λ1γ+1)-\frac{1}{\gamma}(\lambda_{0}\lambda_{1}^{\gamma}-\frac{1}{\gamma+1}\lambda_{0}^{\gamma+1}-\frac{\gamma}{\gamma+1}\lambda_{1}^{\gamma+1})
GM-divergence λ1exp(λ0){exp(λ0λ1)exp(1)}\lambda_{1}\exp({-\lambda_{0}})\{\exp(\frac{\lambda_{0}}{\lambda_{1}})-\exp(1)\}
HM-divergence 12exp(λ0){exp(λ0λ1221λ1)exp(1λ0)}\small\mbox{$\frac{1}{2}$}\exp({-\lambda_{0}})\{\exp(\frac{\lambda_{0}}{\lambda_{1}^{2}}-2\frac{1}{\lambda_{1}})-\exp(-\frac{1}{\lambda_{0}})\}
Refer to caption
Figure 1.1: Contour plots of the power divergences.

The α\alpha-divergence is intrinsic to assess the divergence between two probability measures. One of the most important properties is invariance with the choice of the reference measure that expresses the Radon-Nicodym derivatives for the two probability measure . The invariance provides direct understandings for the intrinsic properties beyond properties for the probability density or mass functions. A serious drawback is pointed out that an empirical counterpart is not available for a given data set in most practical situations. This makes difficult for applications for statistical inference for estimation and prediction. In effect, the statistical applications are limited to the curved exponential family that is modeled in an exponential family. See [18] for the statistical curvature characterizing the second order efficiency.

The β\beta-divergence and the γ\gamma-divergence are not invariant with the choice of the reference measure. We have to determine the reference measure from the point of the application to statistics and machine learning. Subsequently, we discuss the appropriate selection of the reference measure in both cases of β\beta and γ\gamma divergences. Both divergence measures are efficient for applications in areas of statistics and machine learning since the empirical loss function for a parametric model distribution is applicable for any dataset. For example, the β\beta-divergence is utilized as a cost function to measure the difference between a data matrix and the factorized matrix in the nonnegative matrix factorization. Such applications the minimum β\beta-divergence method is more robust than the maximum likelihood method which can be viewed as the minimum KL-divergence method. In practice, the β\beta is not scale invariant in the space of all finite measures that involves that of all the probability measures. We will study that the lack of the scale invariance does not allow simple estimating functions even under a normal distribution model.

Alternatively, the γ\gamma-divergence is scale-invariant with respect to the second argument. The γ\gamma-divergence provides a simple estimating function for the minimum γ\gamma-estimator. This property enables to proposing an efficient learning algorithm for solving the estimating equation. For example, the γ\gamma-divergence is used for the clustering analysis. The cluster centers are determined by local minimums of the empirical loss function defined by the γ\gamma-divergence, see [75, 74] for the learning architecture. A fixed-point type of algorithm is proposed to conduct a fast detection for the local minimums. Such practical properties in applications will be explored in the following section. We consider the dual γ\gamma-divergence that is invariant for the first argument. We will explore the applicability for defining the consensus distribution in a context of active learning. It is confirmed that the γ\gamma-divergence is well defined even for negative value of the exponent γ\gamma. The γ\gamma-divergences with γ=1\gamma=-1 and γ=2\gamma=-2 are reduced to the GM and HM divergences, respectively. In a subsequent discussion, special attentions to the GM and HM divergences are explored for various objectives in applications, see [26] for the application to dynamic treatment regimes in the medical science.

Chapter 2 Minimum divergence for regression model

This chapter explores statistical estimation within regression models. We introduce a comprehensive class of estimators known as Minimum Divergence Estimators (MDEs), along with their empirical loss functions under a parametric framework. Standard properties such as unbiasedness, consistency, and asymptotic normality of these estimators are thoroughly examined. Additionally, the chapter addresses the issue of model misspecification, which can result in biases, inaccurate inferences, and diminished statistical power, and highlights the vulnerability of conventional methods to such misspecifications. Our primary goal is to identify estimators that remain robust against potential biases arising from model misspecification. We place particular emphasis on the γ\gamma-divergence, which underpins the γ\gamma-estimator known for its efficiency and robustness.

2.1 Introduction

We study statistical estimation in a regression model including a generalized linear model. The maximum likelihood (ML) method is widely employed and developed for the estimation problem. This estimation method has been standard on the basis of the solid evidence in which the ML-estimator is asymptotically consistent and efficient when the underlying distribution is correctly specified a regression model. See [16, 62, 7, 39]. The power of parametric inference in regression models is substantial, offering several advantages and capabilities that are essential for effective statistical analysis and decision-making. The ML method has been a cornerstone of parametric inference. This principle yields estimators that are asymptotically unbiased, consistent, and efficient, given that the model is correctly specified. Specifically, generalized linear models (GLMs) extend linear models to accommodate response variables that have error distribution models other than a normal distribution, enabling the use of ML estimation across binomial, Poisson, and other exponential family distributions.

However, we are frequently concerned with model misspecification, which occurs when the statistical model does not accurately represent the data-generating process. This could be due to the omission of important variables, the inclusion of irrelevant variables, incorrect functional forms, and wrong error structures. Such misspecification can lead to biases, inaccurate inferences, and reduced statistical power because of misspecification for the parametric model. See [95] for the critical issue of model misspecification. A misspecified model is more sensitive to outliers, often resulting in more biased estimates. Outliers can also obscure true relationship between variables, making it difficult to detect model misspecification by overshadowing the true relationships between variables. Unfortunately, the performance of the ML-estimator is easily degraded in such difficult situations because of the excessive sensitivities to model misspecification. Such limitations in the face of model misspecification and complex data structures have prompted the development of a broad spectrum of alternative methodologies. In this way, we take the MDE approach other than the maximum likelihood.

We discuss a class of estimating methods through minimization of divergence measure [2, 65, 69, 67, 21, 66, 22, 55, 53, 54, 26]. These are known as minimum divergence estimators (MDEs). The empirical loss functions for a given dataset are discussed in a unified perspective under a parametric model. Thus, we derive a broad class of estimation methods via MDEs. Our primary objective is to find estimators that are robust against potential biases in the presence of model misspecification. MDEs can be applied to a case where the outcome is a continuous variable in a straightforward manner, in which the reference measure to define the divergence is fixed by the Lebesgue measure. Alternatively, more consideration is needed regarding the choice of a reference measure in a discrete variable case for the outcome. In particular, β\beta and γ\gamma divergence measures are strongly associated with a specific dependence for the choice of a reference measure. We explore effective choices for the reference measure to ensure that the corresponding divergences are tractable and can be expressed feasibly. We focus on the γ\gamma-divergence as a robust MDE through an effective choice of the reference measure.

This chapter is organized as follows. Section 2.4 gives an overview of M-estimation in a framework of generalized linear model. In Section 2.3 the γ\gamma-divergence is introduced in a regression model and the γ\gamma-loss function is discussed. In Section 2.4 we focus on the γ\gamma-estimator in a normal linear regression model. A simple numerical example demonstrates a robust property of the γ\gamma-estimator compared to the ML-estimator. Section 2.5 discusses a logistic model for a binary regression. The γ\gamma-loss function is shown to have a robust property where the Euclidean distance of the estimating function to the decision boundary is uniformly bounded when γ\gamma is in a specific range. In Section 2.6 extends the result in the binary case to a multiclass case. Section 2.7 considers a Poisson regression model focusing on a log-linear model. The γ\gamma-divergence is given by a specific choice of the reference measure. The robustness for the γ\gamma-estimator is confirmed for any γ\gamma in the specific range. A simple numerical experiment is conducted. Finally, concluding remarks for geometric understandings are given in 2.8.

2.2 M-estimators in a generalized linear model

Let us establish a probabilistic framework for a dd-dimensional covariate variable XX in a subset 𝒳\mathcal{X} of d\mathbb{R}^{d}, and an outcome YY with a value of a subset 𝒴\mathcal{Y} of \mathbb{R} in a regression model paradigm. The major objective is to estimate the regression function

r(x)=𝔼[Y|X=x]\displaystyle r(x)=\mathbb{E}[Y|X=x]

based on a given dataset. In a paradigm of prediction, XX is often called a feature vector, where to build a predictor defined by a function from 𝒳\mathcal{X} to 𝒴\mathcal{Y} is one of the most important tasks. Let 𝒫(x){\mathcal{P}}(x) be a space of conditional probability measures conditioned on X=xX=x. For any event BB in 𝒴\mathcal{Y} the conditional probability given X=xX=x is written by

P(B|x)=Bp(y|x)dΛ(y),\displaystyle P(B|x)=\int_{B}p(y|x){\rm d}\Lambda(y),

where p(y|x)p(y|x) is the RN-derivative of Y=yY=y given X=xX=x with a reference measure Λ\Lambda. A statistical model embedded in 𝒫(x){\mathcal{P}}(x) is written as

={P(|x,θ):θΘ},\displaystyle{{\mathcal{M}}}=\{P(\cdot|x,\theta):\theta\in\Theta\}, (2.1)

where θ\theta is a parameter of a parameter space Θ\Theta. Then, the Kullback-Leibler (KL) divergence on 𝒫(x){\mathcal{P}}(x) is given by

D0(P0(|x),P1(|x))=H0(P0(|x),P1(|x))D0(P0(|x),P0(|x)),\displaystyle D_{0}(P_{0}(\cdot|x),P_{1}(\cdot|x))=H_{0}(P_{0}(\cdot|x),P_{1}(\cdot|x))-D_{0}(P_{0}(\cdot|x),P_{0}(\cdot|x)),

with the cross entropy,

H0(P0(|x),P1(|x))=𝒴p0(y|x)logp1(y|x)dΛ(y).\displaystyle H_{0}(P_{0}(\cdot|x),P_{1}(\cdot|x))=-\int_{\mathcal{Y}}{p_{0}(y|x)}\log{p_{1}(y|x)}{\rm d}\Lambda(y).

Note that the KL-divergence is independent of the choice of reference measure Λ\Lambda as discussed in Chapter 1. Let 𝒟={(Xi,Yi):1in}{\mathcal{D}}=\{(X_{i},Y_{i}):1\leq i\leq n\} be a random sample drawn from a distribution of {\mathcal{M}}. The goal is to estimate the parameter θ\theta in {\mathcal{M}} in (2.1). Then, the negative log-likelihood function is defined by

L0(θ;Λ)=1ni=1np(Yi|Xi,θ),\displaystyle L_{0}(\theta;\Lambda)=-\frac{1}{n}\sum_{i=1}^{n}p(Y_{i}|X_{i},\theta),

where p(y|x,θ)p(y|x,\theta) is the RN-derivative of P(,θ)P(\cdot,\theta) with respect to Λ\Lambda. Note that, for any measure Λ~\tilde{\Lambda} equivalent to Λ\Lambda the negative log-likelihood functions L0(θ;Λ~)=L0(θ;Λ)L_{0}(\theta;\tilde{\Lambda})=L_{0}(\theta;\Lambda) up to a constant. The expectation of L0(θ;Λ)L_{0}(\theta;\Lambda) under the model distribution of \mathcal{M} is equal to the cross entropy:

𝔼0[L0(θ;Λ)|X¯]=1ni=1nH0(P(|Xi,θ0),P(|Xi,θ))\displaystyle\mathbb{E}_{0}[L_{0}(\theta;\Lambda)|\underline{X}]=\frac{1}{n}\sum_{i=1}^{n}H_{0}(P(\cdot|X_{i},\theta_{0}),P(\cdot|X_{i},\theta)) (2.2)

where X¯=(X1,,Xn)\underline{X}=(X_{1},...,X_{n}) and θ0\theta_{0} is the true value of the parameter and 𝔼0\mathbb{E}_{0} is the conditional expectation under the model distribution P(|Xi,θ0)P(\cdot|X_{i},\theta_{0})’s. Hence,

𝔼0[L0(θ;Λ)|X¯]𝔼0[L(θ0;Λ)|X¯]=1ni=1nD0(P(|Xi,θ0),P(|Xi,θ)).\displaystyle\mathbb{E}_{0}[L_{0}(\theta;\Lambda)|\underline{X}]-\mathbb{E}_{0}[L(\theta_{0};\Lambda)|\underline{X}]=\frac{1}{n}\sum_{i=1}^{n}D_{0}(P(\cdot|X_{i},\theta_{0}),P(\cdot|X_{i},\theta)).

which can be viewed as an empirical analogue of the Pythagorean equation. Due to the property of the KL-divergence as a divergence measure,

θ0=argminθΘ𝔼0[L0(θ;Λ)|X¯].\displaystyle\theta_{0}=\mathop{\rm argmin}_{\theta\in\Theta}\mathbb{E}_{0}[L_{0}(\theta;\Lambda)|\underline{X}].

By definition, the ML-estimator θ^0\hat{\theta}_{0} is the minimizer of L0(θ;Λ)L_{0}(\theta;\Lambda) in θ\theta; while the true value θ0\theta_{0} is the minimizer of E0[L0(θ;Λ)|X¯]E_{0}[L_{0}(\theta;\Lambda)|\underline{X}] in θ\theta. The continuous mapping theorem reveals the consistency of the ML-estimator for the true parameter, see [61, 94] The estimating function is defined by the gradient of the negative log-likelihood function

S0(θ;Λ)=θL0(θ;Λ).\displaystyle{S}_{0}(\theta;\Lambda)=\frac{\partial}{\partial\theta}L_{0}(\theta;\Lambda).

Hence, the ML-estimator θ^\hat{\theta} is a solution of the estimating equation, S0(θ;Λ)=0{S}_{0}(\theta;\Lambda)=0 under regularity conditions. We note that the solution of the expected estimating function under the distribution with the true value θ0\theta_{0} is θ0\theta_{0} itself, that is,

𝔼0[S0(θ0;Λ)|X¯]=0.\displaystyle\mathbb{E}_{0}[{S}_{0}(\theta_{0};\Lambda)|\underline{X}]=0.

This implies that the continuous mapping theorem again concludes the consistency of the ML-estimator for the true value θ0\theta_{0}.

The framework of a generalized linear model (GLM) is suitable for a wide range of data types other than the ordinary linear regression model, see [62]. While the ordinary linear regression usually assumes that the response variable is normally distributed, GLMs allow for response variables that have different distributions, such as the Bernoulli, categorical, Poisson, negative binomial distributions and exponential families in a unified manner. In this way, GLMs provide excellent applicability for a wide range of data types, including count data, binary data, and other types of skewed or non-continuous data. A GLM consists of three main components:

  • 1

    Random Component: Specifies the probability distribution of the response variable YY. This is typically a member of the exponential family of distributions (e.g., normal, exponential, binomial, Poisson, etc.).

  • 2

    Systematic Component: Represents the linear combination of the predictor variables, similar to ordinary linear regression. It is usually expressed as η=θX\eta=\theta^{\top}X.

  • 3

    Link Function: Provides the relationship between the random and systematic components. The expected value of YY given X=xX=x, or the regression function is one-to-one with the linear combination of predictors η=θx\eta=\theta^{\top}x through the link function gg.

In the framework of the GLM, an exponential dispersion model is employed as

pexp(y,ω,ϕ)=exp{yωa(ω)ϕ+c(y,ϕ)},\displaystyle p_{\exp}(y,\omega,\phi)=\exp\Big{\{}\frac{y\omega-a(\omega)}{\phi}+c(y,\phi)\Big{\}},

with respect to a reference measure Λ\Lambda, where ω\omega and ϕ\phi is called the canonical and the dispersion parameters, respectively, see [51]. Here we assume that ω\omega can be defined in (,)(-\infty,\infty). This allows for a linear modeling ω=g(θx)\omega=g(\theta^{\top}x) with a flexible form of the link function gg. Specifically, if gg is an identity function, then gg is referred to as the canonical link function. This formulation involves most of practical models in statistics such as the logistic and the log linear models. In practice, the dispersion parameter ϕ\phi is usually estimated separately from θ\theta, and hence, we assume ϕ\phi is known to be 11 for simplicity. This leads to a generalized linear model:

p(y|x,θ)=exp{yωa(ω)+c(y)}\displaystyle p(y|x,\theta)=\exp\big{\{}y\omega-a(\omega)+c(y)\big{\}} (2.3)

with ω=g(θx)\omega=g(\theta^{\top}x) as the conditional RN-derivative of YY given X=xX=x. The regression function is given by

𝔼[Y|X=x,θ)=a(g(θx))\displaystyle\mathbb{E}[Y|X=x,\theta)=a^{\prime}(g(\theta^{\top}x))

due to the Bartlett identity.

Let us consider M-estimators for a parameter θ\theta in the linear model (2.3). Originally, the M-estimator is introduced to cover robust estimators of a location parameter, see [47] for breakthrough ideas for robust statistics, and [83] for robust regression. We define an M-type loss function for the GLM defined in (2.3):

L¯Ψ(θ,𝒟)=1ni=1nΨ(Yi,θXi)\displaystyle\bar{L}_{\Psi}(\theta,{\cal D})=\frac{1}{n}\sum_{i=1}^{n}\Psi(Y_{i},\theta^{\top}X_{i}) (2.4)

for a given dataset 𝒟={(Xi,Yi)}i=1n{\cal D}=\{(X_{i},Y_{i})\}_{i=1}^{n} and we call

θ^Ψ:=argminθdL¯Ψ(θ,𝒟)\displaystyle\hat{\theta}_{\Psi}:=\mathop{\rm argmin}_{\theta\in\mathbb{R}^{d}}\bar{L}_{\Psi}(\theta,{\cal D})

the M-estimator. Here the generator function Ψ(y,s)\Psi(y,s) is assumed to be convex with respect to ss. If Ψ(y,s)=yg(s)a(g(s))\Psi(y,s)=yg(s)-a(g(s)), then the M-estimator is nothing but the ML estimator. Thus, the estimating function is given by

S¯ψ(θ,𝒟)=1ni=1nψ(Yi,θXi)Xi,\displaystyle\bar{S}_{\psi}(\theta,{\cal D})=\frac{1}{n}\sum_{i=1}^{n}\psi(Y_{i},\theta^{\top}X_{i})X_{i}, (2.5)

where ψ(y,s)=(/s)Ψ(y,s)\psi(y,s)=(\partial/\partial s)\Psi(y,s). If we confine the generator function Ψ\Psi to a form of Ψ(ys)\Psi(y-s), then this formulation reduces to the original form of M-estimation [48, 93]. In general, the estimating function is characterized by ψ(Y,θX)\psi(Y,\theta^{\top}X). Hereafter we assume that

𝔼θ[ψ(Y,θX)|X=x]=0,\displaystyle\mathbb{E}_{\theta}[\psi(Y,\theta^{\top}X)|X=x]=0,

where 𝔼θ\mathbb{E}_{\theta} is the expectation under p(y|x,θ)p(y|x,\theta). This assumption leads to consistency for the estimator θ^Ψ\hat{\theta}_{\Psi}. We note that the relationship between the loss function and the estimating function is not one-to-one. Indeed, there exist many choices of the estimating function for obtaining the estimator θ^Ψ\hat{\theta}_{\Psi} other than (2.5). We have a geometric discussion for an unbiased estimating function.

We investigate the behavior of the score function Sγ(x,y,θ){S}_{\gamma}(x,y,\theta) of the γ\gamma-estimator. By definition, the γ\gamma-estimator is the solution such that the sample mean of the score function is equated to zero. We write a linear predictor as Fθ(x)=θ1x+θ0F_{\theta}(x)=\theta_{1}^{\top}x+\theta_{0}, where θ=(θ0,θ1)\theta=(\theta_{0},\theta_{1}). We call

Hθ={xp:Fθ(x)=0}\displaystyle H_{\theta}=\{x\in\mathbb{R}^{p}:F_{\theta}(x)=0\} (2.6)

the prediction boundary. Then, the following formula is well-known in the Euclidean geometry.

Proposition 4.

Let x𝒳x\in{\cal X} and d(x,Hθ)d(x,H_{\theta}) be the Euclidean distance from xx to the prediction boundary HθH_{\theta} defined in (2.6). Then,

d(x,Hθ)=|Fθ(x)|θ1\displaystyle d(x,H_{\theta})=\frac{|F_{\theta}(x)|}{\|\theta_{1}\|} (2.7)
Proof.

Let xx^{*} be the projection of xx onto HθH_{\theta}. Then, d(x,Hθ)=xxd(x,H_{\theta})=\|x-x^{*}\|, where \|\ \| denotes the Euclidean norm. There exists a non zero scalar τ\tau such that xx=τθ1x-x^{*}=\tau\theta_{1} noting that a normal vector to the hyperplane HθH_{\theta} is given by θ1\theta_{1}. Hence, θ1(xx)=τθ12\theta_{1}^{\top}(x-x^{*})=\tau\|\theta_{1}\|^{2} and

d(x,Hθ)=|τ|θ1=|θ1(xx)|θ1,\displaystyle d(x,H_{\theta})=|\tau|\|\theta_{1}\|=\frac{|\theta_{1}^{\top}(x-x^{*})|}{\|\theta_{1}\|}, (2.8)

which concludes (2.7) since |θ1(xx)|=|Fθ(x)|{|\theta_{1}^{\top}(x-x^{*})|}=|F_{\theta}(x)| due to Fθ(x)=0F_{\theta}(x^{*})=0.

Thus, a covariate vector XX of 𝒳\cal X is decomposed into the orthogonal and horizontal components as X=Zθ(X)+Wθ(X)X=Z_{\theta}(X)+W_{\theta}(X), where

Zθ(X)=θXθ2θandWθ(x)=XZθ(X).\displaystyle Z_{\theta}(X)=\frac{\theta^{\top}X}{\|\theta\|^{2}}\>\theta\ \ {\rm and}\ \ W_{\theta}(x)=X-Z_{\theta}(X). (2.9)

We note that Zθ(X)Wθ(X)=0Z_{\theta}(X)^{\top}W_{\theta}(X)=0 and X2=Zθ(X)2+Wθ(X)2\|X\|^{2}=\|Z_{\theta}(X)\|^{2}+\|W_{\theta}(X)\|^{2}. Due to the orthogonal decomposition (2.9) of XX, the estimating function is also decomposed into

Sψ(Y,X,θ)=Sψ(O)(Y,X,θ)+Sψ(H)(Y,X,θ),\displaystyle S_{\psi}(Y,X,\theta)=S^{\rm(O)}_{\psi}(Y,X,\theta)+S^{\rm(H)}_{\psi}(Y,X,\theta),

where

Sψ(O)(Y,X,θ)=ψ(Y,θZθ(X))Zθ(X),Sψ(H)(Y,X,θ)=ψ(Y,θZθ(X))Wθ(X).\displaystyle S^{\rm(O)}_{\psi}(Y,X,\theta)=\psi(Y,\theta^{\top}Z_{\theta}(X))Z_{\theta}(X),{\ \ }S^{\rm(H)}_{\psi}(Y,X,\theta)=\psi(Y,\theta^{\top}Z_{\theta}(X))W_{\theta}(X).

Here we use a property: θZθ(X)=θX\theta^{\top}Z_{\theta}(X)=\theta^{\top}X. Thus, in Sψ(O)(Y,X,θ)S^{\rm(O)}_{\psi}(Y,X,\theta), ψ(Y,θZθ(X))\psi(Y,\theta^{\top}Z_{\theta}(X)) and Zθ(X)Z_{\theta}(X) are strongly connected each other; in Sψ(H)(Y,X,θ)S^{\rm(H)}_{\psi}(Y,X,\theta), ψ(Y,θZθ(X))\psi(Y,\theta^{\top}Z_{\theta}(X)) and Wθ(X)W_{\theta}(X) are less connected.

The estimating function (2.5) is decomposed into a sum of the orthogonal and horizontal components,

S¯ψ(θ,𝒟)=S¯ψ(O)(θ,𝒟)+S¯ψ(O)(θ,𝒟),\displaystyle\bar{S}_{\psi}(\theta,{\cal D})=\bar{S}^{\rm(O)}_{\psi}(\theta,{\cal D})+\bar{S}^{\rm(O)}_{\psi}(\theta,{\cal D}),

where

S¯ψ(O)(θ,𝒟)=1ni=1nSψ(O)(Yi,Xi,θ),S¯ψ(H)(θ,𝒟)=1ni=1nSψ(H)(Yi,Xi,θ).\displaystyle\bar{S}^{\rm(O)}_{\psi}(\theta,{\cal D})=\frac{1}{n}\sum_{i=1}^{n}S^{\rm(O)}_{\psi}(Y_{i},X_{i},\theta),\ \ \bar{S}^{\rm(H)}_{\psi}(\theta,{\cal D})=\frac{1}{n}\sum_{i=1}^{n}S^{\rm(H)}_{\psi}(Y_{i},X_{i},\theta).

We consider a specific type of contamination in the covariate space 𝒳\cal X.

Proposition 5.

Let 𝒟={(Xi,Yi)}i=1n{\cal D}=\{(X_{i},Y_{i})\}_{i=1}^{n} and 𝒟={(Xi,Yi)}i=1n{\cal D}^{*}=\{(X^{*}_{i},Y_{i})\}_{i=1}^{n}, where Xi=Xi+σ(Xi)Wθ(Xi)X^{*}_{i}=X_{i}+\sigma(X_{i})W_{\theta}(X_{i}) with arbitrarily a fixed scalar σ(Xi)\sigma(X_{i}) depending on XiX_{i}. Then, L¯Ψ(θ,𝒟)=L¯Ψ(θ,𝒟)\bar{L}_{\Psi}(\theta,{\cal D}^{*})=\bar{L}_{\Psi}(\theta,{\cal D}), S¯ψ(O)(θ,𝒟)=S¯ψ(O)(θ,𝒟)\bar{S}^{\rm(O)}_{\psi}(\theta,{\cal D}^{*})=\bar{S}^{\rm(O)}_{\psi}(\theta,{\cal D}) and

S¯ψ(H)(θ,𝒟)=1ni=1nψ(Yi,θZθ(Xi))(1+σ(Xi))Wθ(Xi).\displaystyle\bar{S}^{\rm(H)}_{\psi}(\theta,{\cal D}^{*})=\frac{1}{n}\sum_{i=1}^{n}\psi(Y_{i},\theta^{\top}Z_{\theta}(X_{i}))(1+\sigma(X_{i}))W_{\theta}(X_{i}).
Proof.

By definition, Zθ(Xi)=Zθ(Xi)Z_{\theta}(X^{*}_{i})=Z_{\theta}(X_{i}) and Wθ(Xi)=(1+σ(Xi))Wθ(Xi)W_{\theta}(X^{*}_{i})=(1+\sigma(X_{i}))W_{\theta}(X_{i}) due to Zθ(Xi)Wθ(Xi)=0Z_{\theta}(X_{i})^{\top}W_{\theta}(X_{i})=0. These imply the conclusion. ∎

We observe that L¯Ψ(θ,𝒟)\bar{L}_{\Psi}(\theta,{\cal D}^{*}) and S¯ψ(O)(θ,𝒟)\bar{S}_{\psi}^{(O)}(\theta,{\cal D}^{*}) have both no influence with the contamination in {Xi}\{X_{i}^{*}\}. Alternatively, S¯ψ(H)(θ,𝒟)\bar{S}_{\psi}^{(H)}(\theta,{\cal D}^{*}) has a substantial influence by scalar multiplication. Hence, we can change the definition of the horizontal component as

S¯¯ψ(H)(θ,𝒟)=1ni=1nψ(Yi,θZθ(Xi))Wθ(Xi)Wθ(Xi)\displaystyle\bar{\bar{S}}^{\rm(H)}_{\psi}(\theta,{\cal D}^{*})=\frac{1}{n}\sum_{i=1}^{n}\psi(Y_{i},\theta^{\top}Z_{\theta}(X_{i}))\frac{W_{\theta}(X_{i})}{\|W_{\theta}(X_{i})\|}

choosing as σ(Xi)=Wθ(Xi)11\sigma(X_{i})={\|W_{\theta}(X_{i})\|}^{-1}-1. Then, it has a mild behavior such that

S¯¯ψ(H)(θ,𝒟)}1ni=1n|ψ(Yi,θZθ(Xi))|\displaystyle\|\bar{\bar{S}}^{\rm(H)}_{\psi}(\theta,{\cal D}^{*})\}\|\leq\frac{1}{n}\sum_{i=1}^{n}|\psi(Y_{i},\theta^{\top}Z_{\theta}(X_{i}))|

In this way, the estimating function (2.5) of M-estimator θ^Ψ\hat{\theta}_{\Psi} can be written as

S~ψ(θ,𝒟)=1ni=1nψ(Yi,θXi){Zθ(Xi)+Wθ(Xi)Wθ(Xi)}.\displaystyle\tilde{S}_{\psi}(\theta,{\cal D})=\frac{1}{n}\sum_{i=1}^{n}\psi(Y_{i},\theta^{\top}X_{i})\Big{\{}Z_{\theta}(X_{i})+\frac{W_{\theta}(X_{i})}{\|W_{\theta}(X_{i})\|}\Big{\}}. (2.10)
Proposition 6.

Assume there exists a constant cc such that

sup(y,s)𝒴×|ψ(y,s)s|=c.\displaystyle\sup_{(y,s)\in{\cal Y}\times\mathbb{R}}|\psi(y,s)s|=c.

Then, the estimating function S~ψ(θ,𝒟)\tilde{S}_{\psi}(\theta,{\cal D}) in (2.10) of the M-estimator θ^Ψ\hat{\theta}_{\Psi} is bounded with respect to any dataset 𝒟\cal D.

Proof.

It follows from the assumption such that there exists a constant c1c_{1} such that c1=sup(y,s)𝒴×|ψ(y,s)|c_{1}=\sup_{(y,s)\in{\cal Y}\times\mathbb{R}}|\psi(y,s)| since

|ψ(y,s)|sup(y,s)𝒴×[1,1]|ψ(y,s)|+sup(y,s)𝒴×|ψ(y,s)s|.\displaystyle|\psi(y,s)|\leq\sup_{(y,s)\in{\cal Y}\times[-1,1]}|\psi(y,s)|+\sup_{(y,s)\in{\cal Y}\times\mathbb{R}}|\psi(y,s)s|.

Therefore, we observe

sup𝒟S~ψ(θ,𝒟)\displaystyle\sup_{\cal D}\|\tilde{S}_{\psi}(\theta,{\cal D})\| 1ni=1n|ψ(Yi,θXi)|{Zθ(Xi)+1}\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}|\psi(Y_{i},\theta^{\top}X_{i})|\big{\{}\|Z_{\theta}(X_{i})\|+1\big{\}}
1θsup(y,s)𝒴×|ψ(y,s)s|+sup(y,s)𝒴×|ψ(y,s)|,\displaystyle\leq\frac{1}{\|\theta\|}\sup_{(y,s)\in{\cal Y}\times\mathbb{R}}|\psi(y,s)s|+\sup_{(y,s)\in{\cal Y}\times\mathbb{R}}|\psi(y,s)|,

which is equal to c/θ+c1c/\|\theta\|+c_{1}.

On the other hand, suppose another type of contamination 𝒟={(Xi,Yi)}i=1n{\cal D}^{**}=\{(X^{**}_{i},Y_{i})\}_{i=1}^{n}, where Xi=Xi+τ(Xi)Zθ(Xi)X^{**}_{i}=X_{i}+\tau(X_{i})Z_{\theta}(X_{i}) with a fixed scalar τ(Xi)\tau(X_{i}) depending on {Xi}\{X_{i}\}. Then, L¯Ψ(θ,𝒟)\bar{L}_{\Psi}(\theta,{\cal D}^{*}) and S¯ψ(O)(θ,𝒟)\bar{S}_{\psi}^{(O)}(\theta,{\cal D}^{*}) have both strong influences; S¯ψ(H)(θ,𝒟)\bar{S}_{\psi}^{(H)}(\theta,{\cal D}^{*}) has no influence.

The ML-estimator is a standard estimator that is defined by maximization of the likelihood for a given data set {(Xi,Yi)}i=1n\{(X_{i},Y_{i})\}_{i=1}^{n}. In effect, the negative log-likelihood function is defined by

L0(θ;Λ)=1ni=1n{Yig(θXi)a(g(θXi))+c(Yi)}.\displaystyle L_{0}(\theta;\Lambda)=-\frac{1}{n}\sum_{i=1}^{n}\{Y_{i}g(\theta^{\top}X_{i})-a(g(\theta^{\top}X_{i}))+c(Y_{i})\}.

The likelihood estimating function is given by

S0(θ;Λ)=1ni=1n{Yia(g(θXi))}g(θXi)Xi.\displaystyle{S}_{0}(\theta;\Lambda)=-\frac{1}{n}\sum_{i=1}^{n}\{Y_{i}-a^{\prime}(g(\theta^{\top}X_{i}))\}g^{\prime}(\theta^{\top}X_{i})X_{i}. (2.11)

Here the regression parameter θ\theta is of our main interests. We note that the ML-estimator θ^0\hat{\theta}_{0} can be obtained without the nuisance parameter ϕ\phi even if it is unknown. In effect, there are some methods for estimating ϕ\phi using the deviance and the Pearson χ2\chi^{2} divergence in a case where ϕ\phi is unknown. The expected value of the negative log-likelihood conditional on X¯=(X1,,Xn)\underline{X}=(X_{1},...,X_{n}) is given by

𝔼[L0(θ;Λ)|X¯]=1ni=1n{a(g(θXi))g(θXi)a(g(θXi))}\displaystyle\mathbb{E}[L_{0}(\theta;\Lambda)|\underline{X}]=-\frac{1}{n}\sum_{i=1}^{n}\{a^{\prime}(g(\theta^{\top}X_{i}))g(\theta^{\top}X_{i})-a(g(\theta^{\top}X_{i}))\}

up to a constant since the conditional expectation is given by 𝔼[Y|X=x]=a(g(θx))\mathbb{E}[Y|X=x]=a^{\prime}(g(\theta^{\top}x)) due to a basic property of the exponential dispersion model (2.3).

2.3 The γ\gamma-loss function and its variants

Let us discuss the the γ\gamma-divergence in the framework of regression model based on the discussion in the general distribution setting of the preceding section. The γ\gamma-divergence is given by

Dγ(P(|X,θ0),P(|X,θ1);Λ)=Hγ(P(|X,θ0),P(|X,θ1);Λ)Hγ(P(|X,θ0),P(|X,θ0);Λ)\displaystyle D_{\gamma}(P(\cdot|X,\theta_{0}),P(\cdot|X,\theta_{1});\Lambda)=H_{\gamma}(P(\cdot|X,\theta_{0}),P(\cdot|X,\theta_{1});\Lambda)-H_{\gamma}(P(\cdot|X,\theta_{0}),P(\cdot|X,\theta_{0});\Lambda)

with the cross entropy,

Hγ(P(|X,θ0),P(|X,θ1);Λ)=1γ𝒴p(y|X,θ0)[{p(γ)(y|X,θ1)}γγ+1.\displaystyle H_{\gamma}(P(\cdot|X,\theta_{0}),P(\cdot|X,\theta_{1});\Lambda)=-\frac{1}{\gamma}\int_{\mathcal{Y}}{p(y|X,\theta_{0})}\big{[}\{{p^{(\gamma)}(y|X,\theta_{1})}\}^{\frac{\gamma}{\gamma+1}}.

The loss function derived from the γ\gamma-divergence is

Lγ(θ;Λ)=1n1γi=1n{p(γ)(Yi|Xi,θ)}γγ+1,\displaystyle L_{\gamma}(\theta;\Lambda)=-\frac{1}{n}\frac{1}{\gamma}\sum_{i=1}^{n}\{p^{(\gamma)}(Y_{i}|X_{i},\theta)\}^{\frac{\gamma}{\gamma+1}},

where p(γ)(y|x,θ)p^{(\gamma)}(y|x,\theta) is the γ\gamma-expression of p(y|x,θ)p(y|x,\theta), that is

p(γ)(y|x,θ)={p(y|x,θ)}γ+1{p(y~|x,θ)}γ+1dΛ(y~).\displaystyle p^{(\gamma)}(y|x,\theta)=\frac{\{p(y|x,\theta)\}^{\gamma+1}}{\int\{p(\tilde{y}|x,\theta)\}^{\gamma+1}{\rm d}\Lambda(\tilde{y})}. (2.12)

We define the γ\gamma-estimator for the parameter θ\theta by θ^γ=argminθΘLγ(θ;Λ)\hat{\theta}_{\gamma}=\mathop{\rm argmin}_{\theta\in\Theta}L_{\gamma}(\theta;\Lambda). By definition, the γ\gamma-estimator reduces to the ML-estimator when γ\gamma is taken a limit to 0.

Remark 1.

Let us discuss a behavior of the γ\gamma-loss function when |γ||\gamma| becomes larger in which the outcome YY is finite-discrete in 𝒴\cal Y. For simplicity, we define the loss function as

Lγ(θ;Λ)=sign(γ)i=1n{p(γ)(Yi|Xi,θ)}γγ+1.\displaystyle L_{\gamma}(\theta;\Lambda)=-{\rm sign}(\gamma)\sum_{i=1}^{n}\{p^{(\gamma)}(Y_{i}|X_{i},\theta)\}^{\frac{\gamma}{\gamma+1}}.

Let f(x,θ)=argmaxy𝒴p(y|x,θ)f(x,\theta)=\mathop{\rm argmax}_{y\in{\cal Y}}p(y|x,\theta) and g(x,θ)=argminy𝒴p(y|x,θ)g(x,\theta)=\mathop{\rm argmin}_{y\in{\cal Y}}p(y|x,\theta). Then, the γ\gamma-expression satisfies

p()(y|x,θ)\displaystyle p^{(\infty)}(y|x,\theta) :=limγp(γ)(y|x,θ)\displaystyle:=\lim_{\gamma\rightarrow\infty}p^{(\gamma)}(y|x,\theta)
=limγ{p(y|x,θ)/maxy𝒴p(y|x,θ)}γ+1y~𝒴{p(y~|x,θ)/maxy𝒴p(y|x,θ)}γ+1\displaystyle=\lim_{\gamma\rightarrow\infty}\frac{\{p(y|x,\theta)/\max_{y^{*}\in{\cal Y}}p(y^{*}|x,\theta)\}^{\gamma+1}}{\sum_{\tilde{y}\in{\cal Y}}\{p(\tilde{y}|x,\theta)/\max_{y^{*}\in{\cal Y}}p(y^{*}|x,\theta)\}^{\gamma+1}}
=I(y=f(x,θ))\displaystyle={\rm I}(y=f(x,\theta))

Similarly,

p()(y|x,θ)\displaystyle p^{(-\infty)}(y|x,\theta) :=limγp(γ)(y|x,θ)\displaystyle:=\lim_{\gamma\rightarrow-\infty}p^{(\gamma)}(y|x,\theta)
=limγ{miny𝒴p(y|x,θ)/p(y|x,θ)}γ1y~𝒴{miny𝒴p(y|x,θ)/p(y~|x,θ)}γ1\displaystyle=\lim_{\gamma\rightarrow-\infty}\frac{\{\min_{y^{*}\in{\cal Y}}p(y^{*}|x,\theta)/p(y|x,\theta)\}^{-\gamma-1}}{\sum_{\tilde{y}\in{\cal Y}}\{\min_{y^{*}\in{\cal Y}}p(y^{*}|x,\theta)/p(\tilde{y}|x,\theta)\}^{-\gamma-1}}
=I(y=g(x,θ))\displaystyle={\rm I}(y=g(x,\theta))

Hence, L(θ;Λ)L_{\infty}(\theta;\Lambda) is equivalent to the 0-1 loss function i=1nI(Yif(Xi,θ))\sum_{i=1}^{n}{\rm I}(Y_{i}\neq f(X_{i},\theta)); while

L(θ;Λ)=i=1nI(Yi=g(Xi,θ))\displaystyle L_{-\infty}(\theta;\Lambda)=\sum_{i=1}^{n}{\rm I}(Y_{i}=g(X_{i},\theta)) (2.13)

This is the number of YiY_{i}’s equal to the worst predictor g(Xi,θ)g(X_{i},\theta). If we focus on a case of 𝒴={0,1}{\cal Y}=\{0,1\}, then L(θ;Λ)L_{-\infty}(\theta;\Lambda) is nothing but the 0-1 loss function since I(y=g(x,θ))=I(yf(x,θ)){\rm I}(y=g(x,\theta))={\rm I}(y\neq f(x,\theta)). In principle, the minimization of the 0-1 loss is hard due to the non-differentiability. The γ\gamma-loss function smoothly connects the log-loss and the 0-1 loss without the computational challenge. See [31, 71] for detailed discussion for the 0-1 loss optimization.

In a subsequent discussion, the γ\gamma-expression will play an important role on clarifying the statistical properties of the γ\gamma-estimator. In fact, the γ\gamma-expression function is a counterpart of the log model function: logp(y|x,θ)\log p(y|x,\theta) in L0(θ;Λ)L_{0}(\theta;\Lambda). Here we have a note as one of the most basic properties that

1γ𝔼0[{p(γ)(Y|X,θ)}γγ+1|X=x]=Hγ(P(|x,θ0),P(|x,θ);Λ),\displaystyle-\frac{1}{\gamma}\mathbb{E}_{0}\Big{[}\{p^{(\gamma)}(Y|X,\theta)\}^{\frac{\gamma}{\gamma+1}}|X=x\Big{]}=H_{\gamma}(P(\cdot|x,\theta_{0}),P(\cdot|x,\theta);\Lambda), (2.14)

Equation (2.14) yields

𝔼0[Lγ(θ;Λ)|X¯]=1ni=1nHγ(P(|Xi,θ0),P(|Xi,θ);Λ).\displaystyle\mathbb{E}_{0}[L_{\gamma}(\theta;\Lambda)|\underline{X}]=\frac{1}{n}\sum_{i=1}^{n}H_{\gamma}(P(\cdot|X_{i},\theta_{0}),P(\cdot|X_{i},\theta);\Lambda).

and, hence,

𝔼0[Lγ(θ;Λ)|X¯]𝔼0[Lγ(θ0;Λ)|X¯]=1ni=1nDγ(P(|Xi,θ0),P(|Xi,θ);Λ).\displaystyle\mathbb{E}_{0}[L_{\gamma}(\theta;\Lambda)|\underline{X}]-\mathbb{E}_{0}[L_{\gamma}(\theta_{0};\Lambda)|\underline{X}]=\frac{1}{n}\sum_{i=1}^{n}D_{\gamma}(P(\cdot|X_{i},\theta_{0}),P(\cdot|X_{i},\theta);\Lambda). (2.15)

This implies

θ0=argminθΘ𝔼0[Lγ(θ;Λ)|X¯].\displaystyle\theta_{0}=\mathop{\rm argmin}_{\theta\in\Theta}\mathbb{E}_{0}[L_{\gamma}(\theta;\Lambda)|\underline{X}].

Thus, we observe due to the discussion similar to that for the ML-estimator and the KL-divergence that θ^γ\hat{\theta}_{\gamma} consistent for θ0\theta_{0}. The γ\gamma-estimating function is defined by

Sγ(θ;Λ)=θLγ(θ;Λ).\displaystyle{S}_{\gamma}(\theta;\Lambda)=\frac{\partial}{\partial\theta}L_{\gamma}(\theta;\Lambda).

Then, we have a basic property that the γ\gamma-estimating function should satisfy in general.

Proposition 7.

The true value of the parameter is the solution of the expected γ\gamma-estimating equation under the expectation of the true distribution. That is, if θ=θ0\theta=\theta_{0},

𝔼0[Sγ(θ;Λ)|X¯]=0,\displaystyle\mathbb{E}_{0}[{S}_{\gamma}(\theta;\Lambda)|\underline{X}]=0, (2.16)

where 𝔼0\mathbb{E}_{0} is the conditional expectation under the true distribution P(|Xi,θ0)P(\cdot|X_{i},\theta_{0})’s given X¯\underline{X}.

Proof.

By definition,

Sγ(θ;Λ)=1n1γ+1i=1n{p(γ)(Yi|Xi,θ)}1γ+1θp(γ)(Yi|Xi,θ).\displaystyle{S}_{\gamma}(\theta;\Lambda)=-\frac{1}{n}\frac{1}{\gamma+1}\sum_{i=1}^{n}\{p^{(\gamma)}(Y_{i}|X_{i},\theta)\}^{-\frac{1}{\gamma+1}}\frac{\partial}{\partial\theta}p^{(\gamma)}(Y_{i}|X_{i},\theta).

Here we note

{p(γ)(Yi|Xi,θ)}1γ+1=1p(Yi|Xi,θ)\displaystyle\{p^{(\gamma)}(Y_{i}|X_{i},\theta)\}^{-\frac{1}{\gamma+1}}=\frac{1}{p(Y_{i}|X_{i},\theta)}

up to a proportionality constant. Hence,

𝔼0[Sγ(θ;Λ)|X¯]i=1n𝒴p(y|Xi,θ0)p(y|Xi,θ)θp(γ)(y|Xi,θ)dΛ(y).\displaystyle\mathbb{E}_{0}[{S}_{\gamma}(\theta;\Lambda)|\underline{X}]\propto\sum_{i=1}^{n}\int_{\mathcal{Y}}\frac{p(y|X_{i},\theta_{0})}{p(y|X_{i},\theta)}\frac{\partial}{\partial\theta}p^{(\gamma)}(y|X_{i},\theta){\rm d}\Lambda(y).

If θ=θ0\theta=\theta_{0}, then this vanishes identically due to the total mass one of p(γ)(y|Xi,θ0)p^{(\gamma)}(y|X_{i},\theta_{0}). ∎

The γ\gamma-estimator θ^γ\hat{\theta}_{\gamma} is a solution of the estimating equation; while true value θ0\theta_{0} is the solution of the expected estimating equation under the true distribution with θ0\theta_{0}. Similarly, this shows the consistency of the γ\gamma-estimator for the true value of the parameter. The γ\gamma-estimating function Sγ(θ;Λ){S}_{\gamma}(\theta;\Lambda) is said to be unbiased in the sense of (2.16). Such a unbiased property leads to the consistency of the estimator. However, if the underlying distribution is misspecified, then we have to evaluate the expectation in (2.16) under the misspecified distribution other than the true distribution. Thus, the unbiased property is generally broken down, and the Euclidean norm of the estimating function may be divergent at the worst case. We will investigate such behaviors in misspecified situations later.

Now, we consider the MDEs via the GM and HM divergences introduced in Chapter 1. First, consider the loss function defined by the GM-divergence:

LGM(θ;R)=1ni=1nr(Yi)p(Yi|Xi,θ)exp{logp(y|Xi,θ)𝑑R(y)}.\displaystyle L_{\rm GM}(\theta;R)=\frac{1}{n}\sum_{i=1}^{n}\frac{r(Y_{i})}{p(Y_{i}|X_{i},\theta)}\exp\Big{\{}\int\log p(y|X_{i},\theta)dR(y)\Big{\}}.

where RR is the reference probability measure in 𝒫(x){\mathcal{P}}(x). We define as θ^GM=argminθΘLGM(θ;R)\hat{\theta}_{\rm GM}=\mathop{\rm argmin}_{\theta\in\Theta}L_{\rm GM}(\theta;R), which we refer to as the GM-estimator. The GM\rm GM-estimating equation is given by

SGM(θ;R)\displaystyle{S}_{\rm GM}(\theta;R) :=1ni=1nr(Yi)p(Yi|Xi,θ)exp{logp(y|Xi,θ)𝑑R(y)}\displaystyle:=\frac{1}{n}\sum_{i=1}^{n}\frac{r(Y_{i})}{p(Y_{i}|X_{i},\theta)}\exp\Big{\{}\int\log p(y|X_{i},\theta)dR(y)\Big{\}}
×{S(Yi|Xi,θ)S(y|Xi,θ)dR(y)}=0,\displaystyle\times\Big{\{}S(Y_{i}|X_{i},\theta)-\int S(y|X_{i},\theta)dR(y)\Big{\}}=0, (2.17)

where S(y|x,θ)=(/θ)logp(y|x,θ)S(y|x,\theta)=(\partial/\partial\theta)\log p(y|x,\theta). Secondly, consider the loss function defined by the HM-divergence:

LHM(θ)=12ni=1n{p(2)(Yi|Xi,θ)}2.\displaystyle L_{\rm HM}(\theta)=\frac{1}{2n}\sum_{i=1}^{n}\{p^{(-2)}(Y_{i}|X_{i},\theta)\}^{2}.

The (2)(-2)-model can be viewed as an inverse-weighted probability model on the account of

p(2)(y|x,θ)=1p(y|x,θ)j=0k1p(j|x,θ)\displaystyle p^{(-2)}(y|x,\theta)=\frac{\small\displaystyle\frac{1}{p(y|x,\theta)}}{\displaystyle{\large\mbox{$\sum_{j=0}^{k}$}}\ \frac{1}{p(j|x,\theta)}}

We define the HM estimator by θ^HM=argminθΘLHM(θ)\hat{\theta}_{\rm HM}=\mathop{\rm argmin}_{\theta\in\Theta}L_{\rm HM}(\theta). The HM\rm HM-estimating equation is given by

SHM(θ)=1ni=1np(2)(Yi|Xi,θ)θp(2)(Yi|Xi,θ)\displaystyle{S}_{\rm HM}(\theta)=\frac{1}{n}\sum_{i=1}^{n}p^{(-2)}(Y_{i}|X_{i},\theta)\frac{\partial}{\partial\theta}p^{(-2)}(Y_{i}|X_{i},\theta)

We note from the discussion in Section 2 that LGM(θ;R)L_{\rm GM}(\theta;R) and SGM(θ;R){S}_{\rm GM}(\theta;R) are equal to Lγ(θ;R)L_{\gamma}(\theta;R) and Sγ(θ;R){S}_{\gamma}(\theta;R) with γ=1\gamma=-1; LHM(θ)L_{\rm HM}(\theta) and SHM(θ){S}_{\rm HM}(\theta) are equal to Lγ(θ;C)L_{\gamma}(\theta;C) and Sγ(θ;C){S}_{\gamma}(\theta;C) with γ=2\gamma=-2. We will discuss the dependence on the reference measure RR, in which we like to elucidate which choice of RR gives a reasonable performance in the presence of possible model misspecification.

We focus on the GLM framework in which we look into the formula on the γ\gamma-divergence. Then, the choice of the reference measure should be paid attentions to the γ\gamma-divergence. The original reference measure Λ\Lambda is changed to RR such that R/Λ(y)=exp{c(y)}.\partial R/\partial\Lambda(y)=\exp\{c(y)\}. Hence, the model is given by p(y|x,ω)=exp{yωa(ω)}p(y|x,\omega)=\exp\{y\omega-a(\omega)\} withr respect to RR. We note that Λ~\tilde{\Lambda} is a probability measure since the RN-derivative is equal to p(y|x,θ)p(y|x,\theta) defined in (2.3) when θ\theta is a zero vector. This makes the model more mathematically tractable and allows us to use standard statistical methods for estimation and inference. Then, the γ\gamma-expression for p(y|x,ω)p(y|x,\omega) is given by

p(γ)(y|x,ω)=p(y|x,(γ+1)ω).\displaystyle p^{(\gamma)}(y|x,\omega)=p(y|x,(\gamma+1)\omega). (2.18)

This property of reflexiveness is convenient the analysis based on the γ\gamma-divergence. First of all, the γ\gamma-loss function is given by

Lγ(θ;R)=1n1γi=1nexp{γYig(θXi)γγ+1a((γ+1)g(θXi))}\displaystyle L_{\gamma}(\theta;R)=-\frac{1}{n}\frac{1}{\gamma}\sum_{i=1}^{n}\exp\Big{\{}\gamma\,Y_{i}\,g(\theta^{\top}X_{i})-\frac{\gamma}{\gamma+1}a\big{(}(\gamma+1)g(\theta^{\top}X_{i})\big{)}\Big{\}} (2.19)

due to the γ\gamma-expression (2.18). The γ\gamma-estimating function is given by

Sγ(θ;R)=\displaystyle{S}_{\gamma}(\theta;R)= 1ni=1nexp{γYig(θXi)γγ+1a((γ+1)g(θXi))}\displaystyle\frac{1}{n}\sum_{i=1}^{n}\exp\Big{\{}\gamma\,Y_{i}\,g(\theta^{\top}X_{i})-\frac{\gamma}{\gamma+1}a\big{(}(\gamma+1)g(\theta^{\top}X_{i})\big{)}\Big{\}} (2.20)
×{Yia((γ+1)g(θXi))}g(θXi)Xi.\displaystyle\times\{Y_{i}-a^{\prime}\big{(}(\gamma+1)g(\theta^{\top}X_{i})\big{)}\}g^{\prime}(\theta^{\top}X_{i})X_{i}.

We note that the change of the reference measure from Λ\Lambda to Λ~\tilde{\Lambda} is the key for the minimum γ\gamma-divergence estimation. In fact, the γ\gamma-loss function would not have a closed form as (2.19) unless the reference measure is changed. Here, we remark that the γ\gamma-loss function is a specific example of M-type loss function L¯Ψ(θ)\bar{L}_{\Psi}(\theta) in (2.4) with a relationship of

Ψ(y,s)=exp{γyg(s)γγ+1a((γ+1)g(s))}.\displaystyle\Psi(y,s)=\exp\Big{\{}\gamma yg(s)-\frac{\gamma}{\gamma+1}a\big{(}(\gamma+1)g(s)\big{)}\Big{\}}.

The expected γ\gamma loss function is given by

𝔼0[Lγ(θ;R)|X¯]=1ni=1nexp{a(γg(θXi))γγ+1a((γ+1)g(θXi))},\displaystyle\mathbb{E}_{0}[L_{\gamma}(\theta;R)|\underline{X}]=-\frac{1}{n}\sum_{i=1}^{n}\exp\Big{\{}a\big{(}\gamma g(\theta^{\top}X_{i})\big{)}-\frac{\gamma}{\gamma+1}a\big{(}(\gamma+1)g(\theta^{\top}X_{i})\big{)}\Big{\}},

where 𝔼0\mathbb{E}_{0} denotes the expectation under the true distribution P(|x,θ0)P(\cdot|x,\theta_{0}). This function attains a global minimum at θ=θ0\theta=\theta_{0} as discussed around (2.15) in the general framework. Similarly, the GM-loss function is written by

LGM(θ;R)=1ni=1nr(Yi)exp{(YiμR)g(θXi)+a(g(θXi))},\displaystyle L_{\rm GM}(\theta;R)=\frac{1}{n}\sum_{i=1}^{n}{r(Y_{i})}\exp\big{\{}-(Y_{i}-\mu_{R})\,g(\theta^{\top}X_{i})+a\big{(}g(\theta^{\top}X_{i})\big{)}\big{\}},

where μR=yr(y)dΛ~(y)\mu_{R}=\int yr(y){\rm d}\tilde{\Lambda}(y). The HM loss function is written by

LHM(θ)=12ni=1nexp{2Yi2a(g(θx))}.\displaystyle L_{\rm HM}(\theta)=\frac{1}{2n}\sum_{i=1}^{n}\exp\{-2Y_{i}-2a\big{(}\!\!-\!g(\theta^{\top}x)\big{)}\}.

since the γ\gamma-expression becomes p(γ)(y|x,θ)=exp{yg(θx)a(g(θx))}p^{(\gamma)}(y|x,\theta)=\exp\{-y\,g(\theta^{\top}x)-a\big{(}\!\!-\!g(\theta^{\top}x)\big{)}\} when γ=2\gamma=-2. In accordance with these, all the formulas for the loss functions defined in the general model (2.1) are reasonably transported in GLM. Subsequently, we go on the specific model of GLM to discuss deeper properties.

We have discussed the generalization of the γ\gamma-divergence in the preceding section. The generalized divergence DV(P,Q:Λ)D_{V}(P,Q:\Lambda) defined in (1.18) in Chapter 2 yields the loss function by

LV(θ;Λ)=1ni=1nV(v(z(θ,Xi)p(Yi|Xi,θ))),\displaystyle L_{V}(\theta;\Lambda)=-\frac{1}{n}\sum_{i=1}^{n}V(v^{*}(z(\theta,X_{i})p(Y_{i}|X_{i},\theta))),

where z(θ,Xi)z(\theta,X_{i}) is a normalizing factor satisfying

𝒴v(z(θ,Xi)p(y|Xi,θ))dΛ(y)=1.\displaystyle\int_{\mathcal{Y}}v^{*}(z(\theta,X_{i})p(y|X_{i},\theta)){\rm d}\Lambda(y)=1. (2.21)

The similar discussion as in the above conducts

𝔼0[LV(θ;Λ)|X¯]𝔼0[LV(θ0;Λ)|X¯]=1ni=1nDV(P(|Xi,θ0),P(|Xi,θ);Λ).\displaystyle\mathbb{E}_{0}[L_{V}(\theta;\Lambda)|\underline{X}]-\mathbb{E}_{0}[L_{V}(\theta_{0};\Lambda)|\underline{X}]=\frac{1}{n}\sum_{i=1}^{n}D_{V}(P(\cdot|X_{i},\theta_{0}),P(\cdot|X_{i},\theta);\Lambda).

The estimating function is written as

SV(θ;Λ)=1ni=1n1z(θ,Xi)p(Yi|Xi,θ)θv(z(θ,Xi)p(Yi|Xi,θ)).\displaystyle{S}_{V}(\theta;\Lambda)=-\frac{1}{n}\sum_{i=1}^{n}\frac{1}{z(\theta,X_{i})p(Y_{i}|X_{i},\theta)}\frac{\partial}{\partial\theta}v^{*}(z(\theta,X_{i})p(Y_{i}|X_{i},\theta)).

due to assumption (1.19). This implies

𝔼0[SV(θ0;Λ)|X¯]=1ni=1n1z(θ,Xi)θv(z(θ0,Xi)p(y|Xi,θ))dΛ(y)\displaystyle\mathbb{E}_{0}[{S}_{V}(\theta_{0};\Lambda)|\underline{X}]=-\frac{1}{n}\sum_{i=1}^{n}\frac{1}{z(\theta,X_{i})}\int\frac{\partial}{\partial\theta}v^{*}(z(\theta_{0},X_{i})p(y|X_{i},\theta)){\rm d}\Lambda(y)

which vanishes since all v(z(θ0,Xi)p(y|Xi,θ))v^{*}(z(\theta_{0},X_{i})p(y|X_{i},\theta))’s have total mass one as in (2.21). Consequently, we can derive the MD estimator based on the generalized divergence DV(P,Q,Λ)D_{V}(P,Q,\Lambda) with the γ\gamma-divergence as the standard. In Section 3, we will consider another candidate of DV(P,Q,Λ)D_{V}(P,Q,\Lambda) for estimation under a Poisson point process model.

2.4 Normal linear regression

Linear regression, one of the most familiar and most widely used statistical techniques, dates back to the 19-th century in the mathematical formulation by Carolus F. Gauss [96]. It originally emerged from the eminent observation of Francis Galton on regression towards the mean at the begging of the 20-th century. Thus, the ordinary least squares method is evolved with the advancement of statistical theory and computational methods. As the application of linear regression expanded, statisticians recognized its sensitivity to outliers. Outliers can significantly influence the regression model’s estimates, leading to misleading results. To address these limitations, robust regression methods were developed. These methods aim to provide estimates that are less affected by outliers or violations of model assumptions like normality of errors or homoscedasticity.

Let YY be an outcome variable in \mathbb{R} and XX be a covariate vector in a subset 𝒳{\mathcal{X}} of d\mathbb{R}^{d}. Assume the conditional probability density function (pdf) of YY given X=xX=x as

p(y|X=x,θ,σ2)=12πσ2exp{12(yθx)2σ2}.\displaystyle p(y|X=x,\theta,\sigma^{2})=\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\Big{\{}-\small\mbox{$\frac{1}{2}$}\frac{(y-\theta^{\top}x)^{2}}{\sigma^{2}}\Big{\}}. (2.22)

Thus, the normal linear regression model (2.22) is one of the simplest examples of GLM with an identity link function where σ\sigma is a dispersion parameter. Indeed, σ\sigma is a crucial parameter for assessing model fit. We will discuss the estimation for the parameter later. The KL-divergence between normal distributions is given by

D0(𝙽𝚘𝚛(μ0,σ02),𝙽𝚘𝚛(μ1,σ12))=12(μ1μ0)2σ12+12(σ02σ12logσ02σ121).\displaystyle D_{0}({\tt Nor}(\mu_{0},\sigma_{0}^{2}),{\tt Nor}(\mu_{1},\sigma_{1}^{2}))=\small\mbox{$\frac{1}{2}$}\frac{(\mu_{1}-\mu_{0})^{2}}{\sigma_{1}^{2}}+\small\mbox{$\frac{1}{2}$}\Big{(}\frac{\sigma_{0}^{2}}{\sigma_{1}^{2}}-\log\frac{\sigma_{0}^{2}}{\sigma_{1}^{2}}-1\Big{)}.

For a given dataset (Xi,Yi)i=1n{(X_{i},Y_{i})}_{i=1}^{n}, the negative log-likelihood function is as follows:

L0(θ)=121ni=1n{(YiθXi)2σ0+log(2πσ2)}.\displaystyle L_{0}(\theta)=\small\mbox{$\frac{1}{2}$}\frac{1}{n}\sum_{i=1}^{n}\Big{\{}\frac{(Y_{i}-\theta^{\top}X_{i})^{2}}{\sigma_{0}}+\log(2\pi\sigma^{2})\Big{\}}.

The estimating function for θ\theta is

S0(θ)=1n1σ2i=1n(YiθXi)Xi,\displaystyle{S}_{0}(\theta)=\frac{1}{n}\frac{1}{\sigma^{2}}\sum_{i=1}^{n}(Y_{i}-\theta^{\top}X_{i})X_{i},

where σ2\sigma^{2} is assumed to be known. In fact, it is estimated in a practical situation where σ2\sigma^{2} is unknown. Equating the estimating function to zero gives the likelihood equations in which the ML-estimator is nothing but the least square estimator. This is a well-known element in statistics with a wide range of applications, where several standard tools for assessing model fit and diagnostics have been established.

On the other hand, robust regression robust methods aim to provide estimates that are less affected by outliers or violations of model assumptions like normality of errors. The key is the introduction of M-estimators, which generalize maximum likelihood estimators. They work by minimizing a sum of a loss function applied to the residuals. The choice of the loss function (such as Huber’s winsorized loss or Tukey’s biweight loss [3]) determines the robustness and efficiency of the estimator. The M-estimator, θ^Ψ\hat{\theta}_{\Psi}, of a parameter θ\theta is obtained by minimizing an objective function, typically defined by a sum of Ψ\Psi’s applied to the adjusted residuals:

θ^Ψ=argminθdi=1nΨ(YiθXiσ).\displaystyle\hat{\theta}_{\Psi}=\mathop{\rm argmin}_{\theta\in\mathbb{R}^{d}}\sum_{i=1}^{n}\Psi\Big{(}\frac{Y_{i}-\theta^{\top}X_{i}}{\sigma}\Big{)}. (2.23)

The estimating equation is given by

i=1nψ(YiθXiσ)Xi=0,\displaystyle\sum_{i=1}^{n}\psi\Big{(}\frac{Y_{i}-\theta^{\top}X_{i}}{\sigma}\Big{)}X_{i}=0,

where ψ(r)=(/r)Ψ(r)\psi(r)=(\partial/\partial r)\Psi(r). Here are typical examples:
(1). Quadratic loss: Ψ(r)=r2\Psi(r)=r^{2}, which is equivalent to the log-likelihood function
(2). Huber’s loss: Ψ(r)={12r2for |r|kk(|r|12k)for |r|>k\Psi(r)=\left\{\begin{array}[]{lc}\small\mbox{$\frac{1}{2}$}r^{2}&\text{for }|r|\leq k\\[5.69054pt] k(|r|-\small\mbox{$\frac{1}{2}$}k)&\text{for }|r|>k\end{array}\right.
(3). Tukey’s loss: Ψ(r)={c26(1[1(rc)2]3)for |r|cc26for |r|>c,\Psi(r)=\left\{\begin{array}[]{lc}\frac{c^{2}}{6}\left(1-\left[1-\left(\frac{r}{c}\right)^{2}\right]^{3}\right)&\text{for }|r|\leq c\\[8.53581pt] \frac{c^{2}}{6}&\text{for }|r|>c\end{array}\right.,
where KK and cc are hyper parameters.

We return the discussion for the γ\gamma-estimator. The γ\gamma-divergence is given by

Dγ(𝙽𝚘𝚛(μ0,σ2),𝙽𝚘𝚛(μ1,σ2))=cγ(σ2)[exp{12γγ+1(μ1μ0)2σ2}1]γγ+1,\displaystyle D_{\gamma}({\tt Nor}(\mu_{0},\sigma^{2}),{\tt Nor}(\mu_{1},\sigma^{2}))=c_{\gamma}(\sigma^{2}){}^{\frac{\gamma}{\gamma+1}}\Big{[}\exp\Big{\{}-\small\mbox{$\frac{1}{2}$}\frac{\gamma}{\gamma+1}\frac{(\mu_{1}-\mu_{0})^{2}}{\sigma^{2}}\Big{\}}-1\Big{]},

where cγ=(γ+1)121γ+1c_{\gamma}=(\gamma+1)^{\small\mbox{$\frac{1}{2}$}\frac{1}{\gamma+1}}. The γ\gamma-expression of the normal linear model is given by

p(γ)(y|x,θ)=p0(y,θx,σ0/(γ+1)),\displaystyle p^{(\gamma)}(y|x,\theta)=p_{0}(y,\theta^{\top}x,\sigma_{0}/(\gamma+1)),

where p0(y,μ,σ2)p_{0}(y,\mu,\sigma^{2}) is a normal density function with mean μ\mu and variance σ2\sigma^{2}. Hence, the γ\gamma-loss function is given by

Lγ(θ)=1n1γi=1n{p0(Yi,θXi,σ0/(γ+1))}γγ+1.\displaystyle L_{\gamma}(\theta)=-\frac{1}{n}\frac{1}{\gamma}\sum_{i=1}^{n}\{p_{0}(Y_{i},\theta^{\top}X_{i},\sigma_{0}/(\gamma+1))\}^{\frac{\gamma}{\gamma+1}}.

which is written as

1n1γi=1nexp{12γ(YiθXi)2σ212γγ+1σ2}\displaystyle-\frac{1}{n}\frac{1}{\gamma}\sum_{i=1}^{n}\exp\Big{\{}-\small\mbox{$\frac{1}{2}$}{\gamma}\frac{(Y_{i}-\theta^{\top}X_{i})^{2}}{\sigma^{2}}{-\small\mbox{$\frac{1}{2}$}\frac{\gamma}{\gamma+1}}\sigma^{2}\Big{\}} (2.24)

up to a scalar multiplication. Consequently, the γ\gamma-loss function is a specific example of Ψ\Psi-loss function in (2.23) viewing as Ψ(r)(1/γ)exp(12γr2)\Psi(r)\propto-(1/\gamma)\exp(-\small\mbox{$\frac{1}{2}$}\gamma r^{2}). We note that the γ\gamma-estimator is one of M-estimators. The γ\gamma-estimating function is defined as Sγ(θ)=1ni=1nSγ(Xi,Yi,θ),{S}_{\gamma}(\theta)=\frac{1}{n}\sum_{i=1}^{n}{S}_{\gamma}(X_{i},Y_{i},\theta), where the score function is defined by

Sγ(x,y,θ)=(2πσ2)12γγ+1exp{12γ(yθx)2σ2}yθxσ0x.\displaystyle{S}_{\gamma}(x,y,\theta)=(2\pi\sigma^{2})^{-\small\mbox{$\frac{1}{2}$}\frac{\gamma}{\gamma+1}}\exp\Big{\{}-\small\mbox{$\frac{1}{2}$}{\gamma}\frac{(y-\theta^{\top}x)^{2}}{\sigma^{2}}\Big{\}}\frac{y-\theta^{\top}x}{\sigma_{0}}x. (2.25)

The generator function is given as ψ(r,γ)=rexp(12γr2)\psi(r,\gamma)=r\exp(-\small\mbox{$\frac{1}{2}$}\gamma r^{2}) as an M-estimator. Fig 2.1 displays the plots of the generator functions:
(1). γ\gamma-loss, ψ(r,γ)=rexp(12γr2)\psi(r,\gamma)=r\exp(-\small\mbox{$\frac{1}{2}$}\gamma r^{2}),
(2). Huber’s loss, ψ(r,k)=𝕀(|r|k)r+𝕀(|r|>k)sign(r)\psi(r,k)={\mathbb{I}}(|r|\leq k)r+{\mathbb{I}}(|r|>k){\rm sign(r)}
(3). Tukey’s loss, ψ(r,c)=𝕀(|r|k)r{1(r/c)2}2\psi(r,c)={\mathbb{I}}(|r|\leq k)r\{1-(r/c)^{2}\}^{2}.
It is observed the generator functions of the γ\gamma-loss and Tukey’s loss are both redescending. This means the influence of each data point on the estimation decreases to zero beyond a certain threshold, effectively eliminating the impact of extreme outliers. Unlike the quadratic loss and Huber’s loss functions, such redescending loss functions are non-convex. This characteristic makes it more robust but also introduces challenges in optimization, as it can lead to multiple local minimums.

Refer to caption
Figure 2.1: Plots of the generator functions

The variance parameter σ2\sigma^{2} in the normal regression model is referred to as a dispersion parameter in GLM. In a situation where σ2\sigma^{2} is unknown the likelihood method is similar to the known case. The ML-estimator for σ2\sigma^{2} is derived by

σ^2=121ni=1n(Yiθ^0Xi)2\displaystyle\hat{\sigma}^{2}=\small\mbox{$\frac{1}{2}$}\frac{1}{n}\sum_{i=1}^{n}{(Y_{i}-\hat{\theta}_{0}^{\top}X_{i})^{2}}

plugging θ\theta in θ^0\hat{\theta}_{0}. Alternatively, the γ\gamma-estimator for (θ,σ2)(\theta,\sigma^{2}) is is derived by the solution of the joint estimating equation combining

σ2=γ+1ni=1nexp{12γ(YiθXi)2σ2}(YiθXi)2\displaystyle\sigma^{2}=\frac{\gamma+1}{n}\sum_{i=1}^{n}\exp\Big{\{}-\small\mbox{$\frac{1}{2}$}{\gamma}\frac{(Y_{i}-\theta^{\top}X_{i})^{2}}{\sigma^{2}}\Big{\}}(Y_{i}-\theta^{\top}X_{i})^{2}

with the estimating equation for θ\theta. Similarly, we can find that the boundedness property for the γ\gamma-score function for σ2\sigma^{2} holds.

Let us apply the geometric discussion associated with the decision boundary HθH_{\theta} in (2.6) to the normal regression model. We write the estimating function of M-estimator in (2.23) as

Sψ(θ,𝒟)=i=1nψ(YiθXiσ0)Xi\displaystyle S_{\psi}(\theta,{\cal D})=\sum_{i=1}^{n}\psi\Big{(}\frac{Y_{i}-\theta^{\top}X_{i}}{\sigma_{0}}\Big{)}X_{i}

for a given dataset 𝒟={Xi,Yi)}i=1n{\cal D}=\{X_{i},Y_{i})\}_{i=1}^{n}. Due to the orthogonal decomposition of XX, the estimating function is also decomposed into a sum of the orthogonal and horizontal components, S¯ψ(O)(θ,𝒟)+S¯ψ(H)(θ,𝒟)\bar{S}^{\rm(O)}_{\psi}(\theta,{\cal D})+\bar{S}^{\rm(H)}_{\psi}(\theta,{\cal D}), where

S¯ψ(O)(θ,𝒟)=1ni=1nψ(YiθXiσ0)Zθ(Xi),S¯ψ(H)(θ,𝒟)=1ni=1nψ(YiθXiσ0)Wθ(Xi).\displaystyle\bar{S}^{\rm(O)}_{\psi}(\theta,{\cal D})=\frac{1}{n}\sum_{i=1}^{n}\psi\Big{(}\frac{Y_{i}-\theta^{\top}X_{i}}{\sigma_{0}}\Big{)}Z_{\theta}(X_{i}),\ \ \bar{S}^{\rm(H)}_{\psi}(\theta,{\cal D})=\frac{1}{n}\sum_{i=1}^{n}\psi\Big{(}\frac{Y_{i}-\theta^{\top}X_{i}}{\sigma_{0}}\Big{)}W_{\theta}(X_{i}).

We note that this decomposition is the same as that for the GLM in Section . We consider a specific type of contamination in the covariate space 𝒳\cal X such that 𝒟={(Xi,Yi)}i=1n{\cal D}^{*}=\{(X^{*}_{i},Y_{i})\}_{i=1}^{n}, where Xi=Xi+σ(Xi)Wθ(Xi)X^{*}_{i}=X_{i}+\sigma(X_{i})W_{\theta}(X_{i}) with a fixed scalar σ(Xi)\sigma(X_{i}) depending on XiX_{i}. As in the discussion for the general setting of the GLM, L¯Ψ(θ,𝒟)\bar{L}_{\Psi}(\theta,{\cal D}^{*}) and S¯ψ(O)(θ,𝒟)\bar{S}_{\psi}^{(O)}(\theta,{\cal D}^{*}) have both strong influences; S¯ψ(H)(θ,𝒟)\bar{S}_{\psi}^{(\rm H)}(\theta,{\cal D}^{*}) has no influence. Let us investigate a preferable property for the γ\gamma-estimator applying the decomposition formula above.

Proposition 8.

Let Sγ(x,y,θ){S}_{\gamma}(x,y,\theta) be the γ\gamma-score function defined in (2.25). Then,

supx𝒳d(Sγ(x,y,θ),Hθ)<\displaystyle\sup_{x\in{\mathcal{X}}}d({S}_{\gamma}(x,y,\theta),H_{\theta})<\infty (2.26)

for any fixed yy of \mathbb{R} and any γ>0\gamma>0, where dd is the Euclidean distance.

Proof.

It is written that

d(Sγ(x,y,θ),Hθ)=exp(12γz2)|z(zy)|,\displaystyle d({S}_{\gamma}(x,y,\theta),H_{\theta})=\exp(-\small\mbox{$\frac{1}{2}$}\gamma z^{2})|z(z-y)|,

where z=yθxz=y-\theta^{\top}x. Therefore,

supx𝒳d(Sγ(x,y,θ),Hθ)exp(12γz2)(z2+|yz|),\displaystyle\sup_{x\in{\mathcal{X}}}d({S}_{\gamma}(x,y,\theta),H_{\theta})\leq\exp(-\small\mbox{$\frac{1}{2}$}{\gamma}z^{2})(z^{2}+|yz|),

which is bounded by

supz>0z2exp(12γz2)+|y|supz>0|z|exp(12γz2).\displaystyle\sup_{z>0}z^{2}\exp(-\small\mbox{$\frac{1}{2}$}{\gamma}z^{2})+|y|\sup_{z>0}|z|\exp(-\small\mbox{$\frac{1}{2}$}{\gamma}z^{2}).

This is simplified as

2γexp(1)+|y|1γexp(1γ).\displaystyle\frac{2}{\gamma}\exp(-1)+|y|\frac{1}{\gamma}\exp(-\frac{1}{\gamma}).

Therefore, (2.26) is concluded for the fixed yy. ∎

It follows from Proposition 8 that all the estimating scores of the γ\gamma-estimator appropriately lies in a tubular neighborhood

𝒩θ(δ)={zd:d(z,θ)δ}\displaystyle{\mathcal{N}}_{\theta}(\delta)=\big{\{}z\in\mathbb{R}^{d}:d(z,{\mathcal{H}}_{\theta})\leq\delta\big{\}} (2.27)

surrounding HθH_{\theta}. As a result, the distance from the estimating function to the boundary HθH_{\theta} is bounded, that is,

supx𝒳d(Sγ(θ),θ)2γexp(1)+1γexp(1γ)max1in|Yi|.\displaystyle\sup_{x\in{\mathcal{X}}}d({S}_{\gamma}(\theta),{\mathcal{H}}_{\theta})\leq\frac{2}{\gamma}\exp(-1)+\frac{1}{\gamma}\exp(-\frac{1}{\gamma})\max_{1\leq i\leq n}|Y_{i}|.

However, in the limit case of γ=0\gamma=0 or the ML-estimator, this boundedness property for covariate outlying is broken down. Tukey’s biweight loss estimating function satisfies the boundedness; Huber’s loss estimating function does not satisfy that.

We have a brief study for numerical experiments. Assume that covariate vectors XiX_{i}’s are generated from a bivariate normal distribution 𝙽𝚘𝚛(0,I){\tt Nor}(0,{\rm I}), where I\rm I denotes a 3-dimensional identity matrix. This simulation was designed based on a scenario about the conditional distribution of the response variables YiY_{i}’s as follows.

Specified model

Yi𝙽𝚘𝚛(θ1Xi+θ0,σ).\hskip 25.60747ptY_{i}\sim{\tt Nor}(\theta_{1}^{\top}X_{i}+\theta_{0},\sigma).

Misspecified model

Yi(1π)𝙽𝚘𝚛(θ1Xi+θ0,σ)+π𝙽𝚘𝚛(θ1Xi+θ0,σ).Y_{i}\sim(1-\pi){\tt Nor}(\theta_{1}^{\top}X_{i}+\theta_{0},\sigma)+\pi{\tt Nor}(-\theta_{*1}^{\top}X_{i}+\theta_{*0},\sigma_{*}).

Here parameters were set as (θ0,θ1)=(0.5,1.5,1.0)(\theta_{0},\theta_{1})=(0.5,1.5,1.0)^{\top}, and π=0.1\pi=0.1 with σ=1\sigma=1; (θ0,θ1)=(0.5,1.5,1.0)(\theta_{0*},\theta_{1*})=(0.5,-1.5,-1.0)^{\top} with σ=1\sigma_{*}=1.

We compared the estimates the ML-estimator θ^0\hat{\theta}_{0} and the γ\gamma-estimator θ^γ\hat{\theta}_{\gamma} with γ=0.3\gamma=0.3, where the simulation was conducted by 300 replications. In the the case of specified model, the ML-estimator was slightly superior to the γ\gamma-estimator in a point of the root means square estimate (rmse), however the superiority is almost negligible. Next, we suppose a mixture distribution of two normal regression modes in which one was the same model as the above with the mixing probability 0.90.9; the other was still a normal regression model but the the minus slope vector with the mixing probability 0.10.1. Under such a misspecified setting, γ\gamma-estimator was crucially superior to the ML-estimator, where the rmse of the ML-estimator is more than double that of the γ\gamma-estimator. Thus, the ML-estimator is sensitive to the presence of such a heterogeneous subgroup; the γ\gamma-estimator is robust. Proposition 8 suggests that the effect of the subgroup is substantially suppressed in the estimating function of the γ\gamma-estimator. See Table 2.1 and Figure 2.2 for details.

Table 2.1: Comparison between the ML-estimator and the γ\gamma-estimator.

(a). The case of specified model

Method estimate rmse
ML-estimate (0.495672,1.50753,1.00211)(0.495672,1.50753,1.00211) 0.1736170.173617
γ\gamma-estimate (0.497593,1.50754,1.00301)(0.497593,1.50754,1.00301) 0.1764430.176443

(b). The case of misspecified model

Method estimate rmse
ML-estimate (0.50788,1.19093,0.774613)(0.50788,1.19093,0.774613) 0.4869190.486919
γ\gamma-estimate (0.500093,1.43289,0.941501)(0.500093,1.43289,0.941501) 0.2197980.219798
Refer to caption
Figure 2.2: Box-whisker Plots of the ML-estimator and the γ\gamma-estimator

2.5 Binary logistic regression

We consider a binary outcome YY with a value in 𝒴={0,1}{\mathcal{Y}}=\{0,1\} and a covariate XX in a subset 𝒳{\mathcal{X}} of d\mathbb{R}^{d}. The probability distribution is characterized by a probability mass function (pmf) or the RN-derivative with respect to a counting measure CC:

p(y,π)=πy(1π)1y,\displaystyle p(y,\pi)=\pi^{y}(1-\pi)^{1-y},

which is referred to as the Bernoulli distribution 𝙱𝚎𝚛(π){\tt Ber}(\pi), where π\pi is the probability of Y=1Y=1. A binary regression model is defined by a link function of the systematic component ω\omega into the random component: g(η)=exp(ω)/{1+exp(ω)},g(\eta)={\exp(\omega)}/{\{1+\exp(\omega)\}}, so that the conditional pmf given X=xX=x with a linear model ω=θx\omega=\theta^{\top}x is given by

p(y|x,θ)=exp(yθx)1+exp(θx),\displaystyle p(y|x,\theta)=\frac{\exp(y\theta^{\top}x)}{1+\exp(\theta^{\top}x)}, (2.28)

which is referred as a logistic model [15, 46].

The KL-divergence between Bernoulli distributions is given by

D0(𝙱𝚎𝚛(π),𝙱𝚎𝚛(ρ))=πlogπρ+(1π)log1π1ρ.\displaystyle D_{0}({\tt Ber}(\pi),{\tt Ber}(\rho))=\pi\log\frac{\pi}{\rho}+(1-\pi)\log\frac{1-\pi}{1-\rho}.

For a given dataset {(Xi,Yi)}i=1,,n\{(X_{i},Y_{i})\}_{i=1,...,n}, the negative log-likelihood function is given by

L0(θ)=1ni=1nYiθXilog{1+exp(θXi)}\displaystyle L_{0}(\theta)=-\frac{1}{n}\sum_{i=1}^{n}Y_{i}\theta^{\top}X_{i}-\log\{1+\exp(\theta^{\top}X_{i})\}

and the likelihood equation is written by

S0(θ)=1ni=1n{Yiexp(θXi)1+exp(θXi)}Xi=0.\displaystyle{S}_{0}(\theta)=-\frac{1}{n}\sum_{i=1}^{n}\Big{\{}Y_{i}-\frac{\exp(\theta^{\top}X_{i})}{1+\exp(\theta^{\top}X_{i})}\Big{\}}X_{i}=0. (2.29)

On the other hand, the γ\gamma-divergence is given by

Dγ(𝙱𝚎𝚛(π),𝙱𝚎𝚛(ρ);C)=1γπργ+(1π)(1ρ)γ{ργ+1+(1ρ)γ+1}γγ+1+1γ{πγ+1+(1π)γ+1}1γ+1,\displaystyle D_{\gamma}({\tt Ber}(\pi),{\tt Ber}(\rho);C)=-\frac{1}{\gamma}\frac{\pi\rho^{\gamma}+(1-\pi)(1-\rho)^{\gamma}}{\big{\{}\rho^{\gamma+1}+(1-\rho)^{\gamma+1}\big{\}}^{\frac{\gamma}{\gamma+1}}}+\frac{1}{\gamma}{\big{\{}\pi^{\gamma+1}+(1-\pi)^{\gamma+1}\big{\}}^{\frac{1}{\gamma+1}}},

where CC is the counting measure on 𝒴\cal Y. Note that this depends on the choice of CC as the reference measure on 𝒴\cal Y. The γ\gamma-expression of the logistic model (2.28) is given by

p(γ)(y|x,ω)=exp{(γ+1)yθx}1+exp{(γ+1)θx}.\displaystyle p^{(\gamma)}(y|x,\omega)=\frac{\exp\{(\gamma+1)y\theta^{\top}x\}}{1+\exp\{(\gamma+1)\theta^{\top}x\}}.

Hence, the γ\gamma-loss function is written by

Lγ(θ;C)=1n1γi=1n[exp{(γ+1)YiθXi}1+exp{(γ+1)θXi}]γγ+1.\displaystyle L_{\gamma}(\theta;C)=-\frac{1}{n}\frac{1}{\gamma}\sum_{i=1}^{n}\Big{[}\frac{\exp\{(\gamma+1)Y_{i}\theta^{\top}X_{i}\}}{1+\exp\{(\gamma+1)\theta^{\top}X_{i}\}}\Big{]}^{\frac{\gamma}{\gamma+1}}. (2.30)

and the γ\gamma-estimating function is written as

Sγ(θ;C)=1ni=1nSγ(Xi,Yi,θ;C),\displaystyle{S}_{\gamma}(\theta;C)=\frac{1}{n}\sum_{i=1}^{n}{S}_{\gamma}(X_{i},Y_{i},\theta;C),

where

Sγ(X,Y,θ;C)=[exp{(γ+1)YiθX}1+exp{(γ+1)θX}]γγ+1{Yexp{(γ+1)θX}1+exp{(γ+1)θX}}X,\displaystyle{S}_{\gamma}(X,Y,\theta;C)=\Big{[}\frac{\exp\{(\gamma+1)Y_{i}\theta^{\top}X\}}{1+\exp\{(\gamma+1)\theta^{\top}X\}}\Big{]}^{\frac{\gamma}{\gamma+1}}\Big{\{}Y-\frac{\exp\{(\gamma+1)\theta^{\top}X\}}{1+\exp\{(\gamma+1)\theta^{\top}X\}}\Big{\}}X, (2.31)

see [49] for the discussion for robust mislabel. See [24, 69, 90, 92, 91, 55] for other type of MDE approaches than the γ\gammaestimation.

The γ\gamma-divergence on the space of Bernoulli distributions is well defined for all real number γ\gamma. Let us fix as γ=1\gamma=-1, and thus the GM-divergence between Bernoulli distributions is given by

DGM(𝙱𝚎𝚛(π),𝙱𝚎𝚛(ρ);R)={πρr+1π1ρ(1r)}πr(1π)1rρr(1ρ)1r,\displaystyle D_{\rm GM}({\tt Ber}(\pi),{\tt Ber}(\rho);R)=\Big{\{}\frac{\pi}{\rho}r+\frac{1-\pi}{1-\rho}(1-r)\Big{\}}{\pi}^{r}({1-\pi})^{1-r}-{\rho}^{r}({1-\rho})^{1-r},

where the reference measure RR is chosen by 𝙱𝚎𝚛(r){\tt Ber}(r). Hence, the GM-loss function is given by

LGM(θ;R)=1ni=1nrYi(1r)1Yiexp{(rYi)θXi}.\displaystyle L_{\rm GM}(\theta;R)=\frac{1}{n}\sum_{i=1}^{n}r^{Y_{i}}(1-r)^{1-Y_{i}}\exp\{(r-Y_{i})\theta^{\top}X_{i}\}.

The GM-loss function with the reference measure 𝙱𝚎𝚛(12){\tt Ber}(\small\mbox{$\frac{1}{2}$}) is equal to the exponential loss function for AdaBoost algorithm discussed for an ensemble learning [30]. The integrated discrimination improvement index via odds [40] is based on the GM-loss function to assess prediction performance. We will give a further discussion in a subsequent chapter. The GM-estimating function is written by

SGM(θ;𝙱𝚎𝚛(r))=1ni=1n(2Yi1)exp{(rYi)θXi}Xi\displaystyle{S}_{\rm GM}(\theta;{\tt Ber}(r))=\frac{1}{n}\sum_{i=1}^{n}(2Y_{i}-1)\exp\{(r-Y_{i})\theta^{\top}X_{i}\}X_{i}

due to rY(1r)1Y(rY)=(2Y1)r(1r)r^{Y}(1-r)^{1-Y}(r-Y)=(2Y-1)r(1-r) for Y=0,1Y=0,1. Therefore, this estimating function is unbiased for any r,0<r<1r,0<r<1, that is, the expected estimating function conditional on (X1,,Xn)(X_{1},...,X_{n}) under the logistic model (2.28) is equal to a zero vector.

We discuss which rr is effective for practical problems in logistic regression applications. In particular, we focus on a problem of imbalanced samples that is an important issue in the binary regression. An imbalanced dataset is one where the distribution of samples across these two classes is not equal. For example, in a medical diagnosis dataset, the number of patients with a rare disease (class 1) may be significantly lower than those without it (class 0). In this way, it is characterized as

0P(Y=1)P(Y=0)1,\displaystyle 0\approx P(Y=1)\ll P(Y=0)\approx 1,

There are difficult issues for the model bias, the poor generalization and the inaccurate performance metrics for the prediction. Imbalanced samples can lead to biased or inconsistent estimators, affecting hypothesis tests and confidence intervals. For these problem resampling techniques have been exploited by oversampling the minority class or undersampling the majority class can balance the dataset. Also, the cost-sensitive Learning introduces a cost matrix to penalize misclassification of the minority class more heavily. The asymmetric logistic regression is proposed introducing a new parameter to account for data complexity [55]. They observe that this parameter controls the influence from imbalanced sampling. Here we tackle with this problem by the GM-estimator choosing an appropriate reference distribution RR in the GM-loss function. We select 𝙱𝚎𝚛(π^0){\tt Ber}(\hat{\pi}_{0}) as the reference measure, where π^0\hat{\pi}_{0} is the proportion of the negative sample, namely π^0=i=1n𝕀(Yi=0)/n\hat{\pi}_{0}=\sum_{i=1}^{n}{\mathbb{I}}(Y_{i}=0)/n. Then, the resultant loss function is given by

LGM(iw)(θ)=1ni=1nπ^0Yi(1π^0)1Yiexp{(π^0Yi)θXi}.\displaystyle L^{\rm(iw)}_{\rm GM}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\hat{\pi}_{0}^{Y_{i}}(1-\hat{\pi}_{0})^{1-Y_{i}}\exp\{(\hat{\pi}_{0}-Y_{i})\theta^{\top}X_{i}\}. (2.32)

We refer this to as the inverse-weighted GM-loss function since the weight

π^0Yi(1π^0)1Yi1(1π^0)Yiπ^01Yi.\hat{\pi}_{0}^{Y_{i}}(1-\hat{\pi}_{0})^{1-Y_{i}}\propto\frac{1}{(1-\hat{\pi}_{0})^{Y_{i}}\hat{\pi}_{0}^{1-Y_{i}}}.

Hence, the estimating function is given by

SGM(iw)(θ)=1ni=1n(2Yi1)exp{(π^0Yi)θXi}Xi.\displaystyle{S}^{\rm(iw)}_{\rm GM}(\theta)=\frac{1}{n}\sum_{i=1}^{n}(2Y_{i}-1)\exp\{(\hat{\pi}_{0}-Y_{i})\theta^{\top}X_{i}\}X_{i}.

Equating the estimating function to zero gives the equality between two sums of positive and negative samples:

1ni=1n𝕀(Yi=1)exp{(π^01)θXi}Xi=1ni=1n𝕀(Yi=0)exp{π^0θXi}Xi.\displaystyle\frac{1}{n}\sum_{i=1}^{n}{\mathbb{I}}(Y_{i}=1)\exp\{(\hat{\pi}_{0}-1)\theta^{\top}X_{i}\}X_{i}=\frac{1}{n}\sum_{i=1}^{n}{\mathbb{I}}(Y_{i}=0)\exp\{\hat{\pi}_{0}\theta^{\top}X_{i}\}X_{i}.

Alternatively, the likelihood estimating equation is written as

1ni=1n𝕀(Yi=1)11+exp(θXi)Xi=1ni=1n𝕀(Yi=0)exp(θXi)1+exp(θXi)Xi.\displaystyle\frac{1}{n}\sum_{i=1}^{n}{\mathbb{I}}(Y_{i}=1)\frac{1}{1+\exp(\theta^{\top}X_{i})}X_{i}=\frac{1}{n}\sum_{i=1}^{n}{\mathbb{I}}(Y_{i}=0)\frac{\exp(\theta^{\top}X_{i})}{1+\exp(\theta^{\top}X_{i})}X_{i}.

Both of estimating equations are unbiased, however the weightings are contrast each other.

We conduct a brief study for numerical experiments. Assume that covariate vectors are generated from a mixture of bivariate normal distributions as

Xi(1ϵ)𝙽𝚘𝚛(μ0,I)+ϵ𝙽𝚘𝚛(μ0,I).\displaystyle X_{i}\sim(1-\epsilon)\ {\tt Nor}(\mu_{0},{\rm I})+\epsilon\ {\tt Nor}(-\mu_{0},{\rm I}).

where I\rm I denotes a 2-dimensional identity matrix. Here, we set as n=1000,μ0=(2.,2.)n=1000,\mu_{0}=(2.,2.)^{\top}, and the mixture ratio ϵ\epsilon will be taken by some fixed values. The outcome variables are generated from Bernoulli distributions as Yi𝙱𝚎𝚛(π(Xi))Y_{i}\sim{\tt Ber}(\pi(X_{i})), where

π(Xi,θ0)=exp(θ0+θ1Xi1+θ2Xi2)1+exp(θ0+θ1Xi1+θ2Xi2)\displaystyle\pi(X_{i},\theta_{0})=\frac{\exp(\theta_{0}+\theta_{1}X_{i1}+\theta_{2}X_{i2})}{1+\exp(\theta_{0}+\theta_{1}X_{i1}+\theta_{2}X_{i2})}

where we set as n=1000n=1000 and (θ0,θ1)=(0.5,1.0,1.5).(\theta_{0},\theta_{1}^{\top})=(0.5,1.0,1.5)^{\top}. This simulation is designed to have imbalanced samples such that the positive sample proportion approximately becomes near ϵ\epsilon.

We compared the ML-estimator θ^\hat{\theta} with the inverse-weighted GM estimator θ^GM\hat{\theta}_{\rm GM} with 30 replications. Thus, we observe that the GM estimator have a better performance over the ML-estimator in the sense of true positive rate. Table LABEL:TPRTNR is the list of the true positive and negative rates based on test samples with size 10001000. Note that two label conditional distributions are 𝙽𝚘𝚛(μ0,I){\tt Nor}(\mu_{0},{\rm I}) are 𝙽𝚘𝚛(μ0,I){\tt Nor}(-\mu_{0},{\rm I}). These are set to be sufficiently separated from each other. Hence, the classification problem becomes extremely an easy task when ϵ\epsilon is a moderate value. Both ML-estimator and GM estimator have good performance in cases of ϵ=0.3,0.1\epsilon=0.3,0.1. Alternatively, we observe that the true positive rate for GM estimator is considerably higher than that of ML-estimator in a situation of imbalance samples as in either case of ϵ=0.03,0.01\epsilon=0.03,0.01.

Table 2.2: The comparison between MLE vs GME
ϵ\epsilon MLE GME
0.30.3 (0.969,0.995)(0.969,0.995) (0.956,0.994)(0.956,0.994)
0.10.1 (0.897,0.998)(0.897,0.998) (0.902,0.995)(0.902,0.995)
0.050.05 (0.800,0.999)(0.800,0.999) (0.817,0.994)(0.817,0.994)
0.030.03 (0.705,0.999)(0.705,0.999) (0.733,0.995)(0.733,0.995)
0.010.01 (0.462,0.999)(0.462,0.999) (0.538,0.996)(0.538,0.996)

(a,b)(a,b) denotes a pair of the true positive and negative rates aa and bb.

We next focus on the HM-divergence (γ\gamma-divergence, γ=2\gamma=-2):

DHM(𝙱𝚎𝚛(π),𝙱𝚎𝚛(ρ))=π(1ρ)2+(1π)ρ2π(1π),\displaystyle D_{\rm HM}({\tt Ber}(\pi),{\tt Ber}(\rho))=\pi(1-\rho)^{2}+(1-\pi)\rho^{2}-\pi(1-\pi),

where the reference measure is determined by 𝙱𝚎𝚛(ρ){\tt Ber}(\rho). The HM-loss function is derived as

LHM(θ)=1ni=1n[exp{(1yi)θXi}1+exp(θXi)]2,\displaystyle L_{\rm HM}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\Bigg{[}\frac{\exp\{(1-y_{i})\theta^{\top}X_{i}\}}{1+\exp(\theta^{\top}X_{i})}\Bigg{]}^{2},

for the logistic model (2.28). Note that the HM-loss function is the γ\gamma-loss function with γ=2\gamma=-2, which the γ\gamma-expression is reduced to

p(2)(y|x,ω)=exp{(1y)θx}1+exp(θx).\displaystyle p^{(-2)}(y|x,\omega)=\frac{\exp\{(1-y)\theta^{\top}x\}}{1+\exp(\theta^{\top}x)}.

Hence, the HM-estimating function is written as

SHM(θ)=1ni=1nexp(θXi){1+exp(θXi)}2{Yiexp(θXi)1+exp(θXi)}Xi=0\displaystyle{S}_{\rm HM}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\frac{\exp(\theta^{\top}X_{i})}{\{1+\exp(\theta^{\top}X_{i})\}^{2}}\Big{\{}Y_{i}-\frac{\exp(\theta^{\top}X_{i})}{1+\exp(\theta^{\top}X_{i})}\Big{\}}X_{i}=0

This is a weighted likelihood score function with the conditional variance of YY as the weight function. We will observe that this weighting has an effective key for the HM estimator to be robust for covariate outliers.

Let us investigate the behavior of the estimating function Sγ(θ;C){S}_{\gamma}(\theta;C) of the γ\gamma-estimator. In general, Sγ(θ;C){S}_{\gamma}(\theta;C) is unbiased, that is, 𝔼0[Sγ(θ;C)|X¯]=0\mathbb{E}_{0}[{S}_{\gamma}(\theta;C)|\underline{X}]=0 under the conditional expectation with the true distribution with the pmf p(y|x,θ0)p(y|x,\theta_{0}). However, this property easily violated if the expectation is taken by a misspecified distribution QQ with the pmf q(y|x)q(y|x) other than the true distribution [12, 13, 53]. Hence, we look into the expected estimating function under the misspecified model.

Proposition 9.

Consider the γ\gamma-estimating function under a logistic model (2.28). Assume γ>0\gamma>0 or γ<1\gamma<-1. Then,

supx𝒳|θ𝔼Q[Sγ(X,Y,θ)|X=x]|<,\sup_{x\in{\mathcal{X}}}|\theta^{\top}\mathbb{E}_{Q}[{S}_{\gamma}(X,Y,\theta)|X=x]|<\infty, (2.33)

where 𝔼Q[|X=x]\mathbb{E}_{Q}[\ \cdot\ |X=x] is the conditional expectation under a misspecified distribution QQ outside the model (2.28).

Proof.

It is written from (2.31) that

𝔼Q[Sγ(X,Y,θ)|X=x]\displaystyle\mathbb{E}_{Q}[{S}_{\gamma}(X,Y,\theta)|X=x]
=y=01[exp{(γ+1)yθx}1+exp{(γ+1)θx}]γγ+1[yexp{(γ+1)θx}1+exp{(γ+1)θx}]q(y|x)(γ+1)x\displaystyle=\sum_{y=0}^{1}\bigg{[}\frac{\exp\{(\gamma+1)y\theta^{\top}x\}}{1+\exp\{(\gamma+1)\theta^{\top}x\}}\bigg{]}^{\frac{\gamma}{\gamma+1}}\Big{[}y-\frac{\exp\{(\gamma+1)\theta^{\top}x\}}{1+\exp\{(\gamma+1)\theta^{\top}x\}}\Big{]}q(y|x)(\gamma+1)x (2.34)

Hence, if s=(γ+1)θxs=(\gamma+1)\theta^{\top}x, then

|θ𝔼Q[Sγ(X,Y,θ)|X=x]|Ψγ(s)+Ψγ(s),\displaystyle\big{|}\theta^{\top}\mathbb{E}_{Q}[{S}_{\gamma}(X,Y,\theta)|X=x]\big{|}\leq\Psi_{\gamma}(s)+\Psi_{\gamma}(-s),

where

Ψγ(s)=|s||γ+1|{11+exp(s)}γγ+1exp(s)1+exp(s).\displaystyle\Psi_{\gamma}(s)=\frac{|s|}{|\gamma+1|}\Big{\{}\frac{1}{1+\exp(-s)}\Big{\}}^{\frac{\gamma}{\gamma+1}}\frac{\exp(-s)}{1+\exp(-s)}. (2.35)

We observe that, if γ>0\gamma>0 or γ<1\gamma<-1, then

supsΨγ(s)=supsΨγ(s)<.\displaystyle\sup_{s\in\mathbb{R}}\Psi_{\gamma}(s)=\sup_{s\in\mathbb{R}}\Psi_{\gamma}(-s)<\infty.

This concludes (2.33).

.

We note that Proposition 9 focuses only on the logistic model (2.28), however such a boundedness property holds in both the probit model and the complementary log-log model.

We consider a geometric understanding for the bounded property in (2.33). In GLM, the linear predictor is written by θx=θ1x1+θ0\theta^{\top}x=\theta_{1}^{\top}x_{1}+\theta_{0}, where θ1\theta_{1} and θ0\theta_{0} are referred to as a slope vector and intercept term, respectively. The decision boundary is defined as Hθ{H}_{\theta} as in (2.6). The Euclidean distance of xx into Hθ{H}_{\theta},

d(x,θ)=|θ1x1θ0|θ1.\displaystyle d(x,{\mathcal{H}}_{\theta})=\frac{|\theta_{1}^{\top}x_{1}-\theta_{0}|}{\|\theta_{1}\|}.

is referred to as the margin of xx from the decision boundary Hθ{H}_{\theta}, which plays.a central role on the support vector machine [14]. Let

𝒩θ(δ)={x𝒳:d(x,θ)δ}.\displaystyle{\mathcal{N}}_{\theta}(\delta)=\big{\{}x\in{\mathcal{X}}:d(x,{\mathcal{H}}_{\theta})\leq\delta\big{\}}. (2.36)

This is the δ\delta-tubular neighborhood including θ{\mathcal{H}}_{\theta}. In this perspective, Proposition 9 states for any γ,γ<1 or γ>0\gamma,\gamma<-1\text{ or }\gamma>0 that the conditional expectation of γ\gamma-estimating function is in the tubular neighborhood with probability one even under the misspecified distribution outside the parametric model (2.28). On the other hands, the likelihood estimating function does not satisfy such a stable property because the margin of the conditional expectation becomes unbounded. Therefore, we result that the γ\gamma-estimator is robust for misspecification for the model for γ>0\gamma>0 or γ>1\gamma>-1; while the ML-estimator is not robust.

We observe in the Euclidean geometric view that, for a feature vector xx of 𝒳\cal X, the decision hyperplane θ{\mathcal{H}}_{\theta} decompose xx into orthogonal and tangential components as x=z+wx=z+w, where z=(θx)θ/θ2z=(\theta^{\top}x)\theta/\|\theta\|^{2} and w=xzw=x-z. Note zwz\perp w and x2=z2+w2\|x\|^{2}=\|z\|^{2}+\|w\|^{2}. In accordance with this geometric view, we give more insights on the robust performance for the γ\gamma-estimator class. We write the γ\gamma-estimating function (2.31) by Sγ(x,y,θ)=ηγ(y,θx)(z+w)S_{\gamma}(x,y,\theta)=\eta_{\gamma}(y,\theta^{\top}x)(z+w). Then,

|Sγ(x,y,θ)||ηγ(y,zθ)(z+w).\displaystyle|S_{\gamma}(x,y,\theta)|\leq|\eta_{\gamma}(y,z^{\top}\theta)(\|z\|+\|w\|). (2.37)

Therefore, we conclude that

|Sγ(x,y,θ)|sups|ηγ(y,s)|(|s|θ2+w).\displaystyle|S_{\gamma}(x,y,\theta)|\leq\sup_{s\in\mathbb{R}}|\eta_{\gamma}(y,s)|\Big{(}\frac{|s|}{\|\theta\|^{2}}+\|w\|\Big{)}. (2.38)

Thus, we observe a robust property of the γ\gamma-estimator in a more direct perspective.

Proposition 10.

Assume γ>0\gamma>0 or γ<1\gamma<-1. Then, the γ\gamma-estimating function Sγ(θ;C){S}_{\gamma}(\theta;C) based on a dataset 𝒟={(Xi.Yi)}i=1n{\cal D}=\{(X_{i}.Y_{i})\}_{i=1}^{n} satisfies

sup𝒟|θSγ(θ;C)|<.\displaystyle\sup_{\cal D}|\theta^{\top}{S}_{\gamma}(\theta;C)|<\infty. (2.39)
Proof.

It is written from (2.31) that

θSγ(θ:C)\displaystyle\theta^{\top}{S}_{\gamma}(\theta:C)
=i=1n[exp{(γ+1)YiθXi}1+exp{(γ+1)θXi}]γγ+1[Yiexp{(γ+1)θXi}1+exp{(γ+1)θXi}]θXi\displaystyle=\sum_{i=1}^{n}\bigg{[}\frac{\exp\{(\gamma+1)Y_{i}\theta^{\top}X_{i}\}}{1+\exp\{(\gamma+1)\theta^{\top}X_{i}\}}\bigg{]}^{\frac{\gamma}{\gamma+1}}\Big{[}Y_{i}-\frac{\exp\{(\gamma+1)\theta^{\top}X_{i}\}}{1+\exp\{(\gamma+1)\theta^{\top}X_{i}\}}\Big{]}\theta^{\top}X_{i} (2.40)

which is decomposed into the sum of the positive and negative samples as

1ni=1n{𝕀(Yi=1)Ψγ(Si)𝕀(Yi=0)Ψγ(Si)}\displaystyle\frac{1}{n}\sum_{i=1}^{n}\big{\{}{\mathbb{I}}(Y_{i}=1)\Psi_{\gamma}(S_{i})-{\mathbb{I}}(Y_{i}=0)\Psi_{\gamma}(-S_{i})\big{\}}

where Si=(γ+1)θXiS_{i}=(\gamma+1)\theta^{\top}X_{i} and Ψγ(s)\Psi_{\gamma}(s) is defined in (2.35). Hence, we get

|θSγ(θ:Λ)|1ni=1n{𝕀(Yi=1)|Ψγ(Si)|+𝕀(Yi=0)|Ψγ(Si)|}\displaystyle|\theta^{\top}{S}_{\gamma}(\theta:\Lambda)|\leq\frac{1}{n}\sum_{i=1}^{n}\big{\{}{\mathbb{I}}(Y_{i}=1)|\Psi_{\gamma}(S_{i})|+{\mathbb{I}}(Y_{i}=0)|\Psi_{\gamma}(-S_{i})|\big{\}}

which is bounded by

n1nsups|Ψγ(s)|+n0nsups|Ψγ(s)|\displaystyle\frac{n_{1}}{n}\sup_{s\in\mathbb{R}}|\Psi_{\gamma}(s)|+\frac{n_{0}}{n}\sup_{s\in\mathbb{R}}|\Psi_{\gamma}(-s)|

which is equal to δγ\delta_{\gamma}, where ny=(1/n)i=1n𝕀(Yi=y)n_{y}=(1/n)\sum_{i=1}^{n}{\mathbb{I}}(Y_{i}=y) for y=0,1y=0,1. This concludes (2.39). ∎

The log-likelihood estimating function is given by

S0(θ;Λ)=1nYi=0n{𝕀(Yi=1)exp(θXi)1+exp(θXi)+𝕀(Yi=0)11+exp(θXi)}Xi.\displaystyle{S}_{0}(\theta;\Lambda)=-\frac{1}{n}\sum_{Y_{i}=0}^{n}\Big{\{}{\mathbb{I}}(Y_{i}=1)\frac{\exp(\theta^{\top}X_{i})}{1+\exp(\theta^{\top}X_{i})}+{\mathbb{I}}(Y_{i}=0)\frac{1}{1+\exp(\theta^{\top}X_{i})}\Big{\}}X_{i}. (2.41)

Hence, |θS0(θ|Λ)||\theta^{\top}{S}_{0}(\theta|\Lambda)| is unbounded in {θXi:i=1,,n}\{\theta^{\top}X_{i}:i=1,...,n\} since either of two terms in (2.41) diverges to infinity as |θXi||\theta^{\top}X_{i}| goes to infinity. The GM-estimating function is written by

SGM(θ,R)=1ni=1n[𝕀(Yi=1)exp{r(0)θXi}r(0)+𝕀(Yi=0)exp{r(1)θXi}r(1)]Xi.\displaystyle{S}_{\rm GM}(\theta,R)=\frac{1}{n}\sum_{i=1}^{n}\Big{[}{\mathbb{I}}(Y_{i}=1)\exp\{r(0)\theta^{\top}X_{i}\}r(0)+{\mathbb{I}}(Y_{i}=0)\exp\{r(1)\theta^{\top}X_{i}\}r(1)\Big{]}X_{i}.

This implies that |θSGM(θ;R)||\theta^{\top}{S}_{\rm GM}(\theta;R)| is unbounded.

We have a brief study for numerical experiments in two types of sampling. One is based on the covariate distribution conditional on the outcome YY, which is widely analyzed in case-control studies. The other is based on the conditional distribution of YY given the covariate vector XX, which is common in cohort-control studies. First, we consider a model of an outcome-conditional distribution. Assume that the conditional distribution of XX given Y=yY=y is a bivariate normal distribution 𝙽𝚘𝚛(μy,I){\tt Nor}(\mu_{y},{\rm I}), where I\rm I is a 2-dimensional identity matrix. Then, the marginal distribution of XX is written as p1𝙽𝚘𝚛(μ1,I)+p0𝙽𝚘𝚛(μ0,I)p_{1}{\tt Nor}(\mu_{1},{\rm I})+p_{0}{\tt Nor}(\mu_{0},{\rm I}), where py=P(Y=y)p_{y}=P(Y=y). The the conditional pmf of YY given X=xX=x is given by

p(y|x,θ)=exp{y(θ1x+θ0)}1+exp(θ1x+θ0)\displaystyle p(y|x,\theta)=\frac{\exp\{y(\theta_{1}^{\top}x+\theta_{0})\}}{1+\exp(\theta_{1}^{\top}x+\theta_{0})}

due to the Bayes formula, where θ1=μ1μ0\theta_{1}=\mu_{1}-\mu_{0} and θ0=12(μ0μ0μ1μ1)log(p1/p0)\theta_{0}=\small\mbox{$\frac{1}{2}$}(\mu_{0}^{\top}\mu_{0}-\mu_{1}^{\top}\mu_{1})-\log(p_{1}/p_{0}). Let N𝙱𝚒𝚗(p1,n).N\sim{\tt Bin}(p_{1},n). The simulation was conducted based on a scenario about the NN positive and nNn-N negative samples with Yi=1Y_{i}=1 and Yi=0Y_{i}=0, respectively, as follows.

(a). Specified model: {Xi}i=1N𝙽𝚘𝚛(μ1,I).\hskip 25.60747pt\{X_{i}\}_{i=1}^{N}\sim{\tt Nor}(\mu_{1},{\rm I}). and {Xi}i=N+1n𝙽𝚘𝚛(μ0,I).\{X_{i}\}_{i=N+1}^{n}\sim{\tt Nor}(\mu_{0},{\rm I}).

(b). Misspecified model:     {Xi}i=1N(1π)𝙽𝚘𝚛(μ1,I)+π𝙽𝚘𝚛(σμ0,I).\{X_{i}\}_{i=1}^{N}\sim(1-\pi){\tt Nor}(\mu_{1},{\rm I})+\pi{\tt Nor}(\sigma\mu_{0},{\rm I}). and
                                           {Xi}i=N+1n𝙽𝚘𝚛(μ0,I).\{X_{i}\}_{i=N+1}^{n}\sim{\tt Nor}(\mu_{0},{\rm I}).

Here parameters were set as μ1=(0.5,0.5)\mu_{1}=(0.5,0.5)^{\top}, μ0=(0.5,0.5)\mu_{0}=-(0.5,0.5)^{\top}, p1=0.5p_{1}=0.5 and (π,σ)=(0.1,4.0)(\pi,\sigma)=(0.1,-4.0), so that (θ0,θ1)=(0.0,1.0,1.0)(\theta_{0},\theta_{1}^{\top})=(0.0,1.0,1.0). Figure 2.3 shows the plot of 103 negative samples (Blue), 87 negative samples (Green), 10 negative outliers (Red) on the logistic model surface {(x1,x2,p(1|(x1,x2),(θ0,θ1)):3.5x13.5,3.5x23.5}\{(x_{1},x_{2},p(1|(x_{1},x_{2}),(\theta_{0},\theta_{1})):-3.5\leq x_{1}\leq 3.5,-3.5\leq x_{2}\leq 3.5\}. Thus, 10 negative outliers are away from the hull of 87 negative samples.

Refer to caption
Figure 2.3: Covariate vectors on the logistic model

We compared the estimates the ML-estimator θ^0\hat{\theta}_{0}, the γ\gamma-estimator θ^γ\hat{\theta}_{\gamma} with γ=0.8\gamma=0.8, the GM-estimator θ^GM\hat{\theta}_{\rm GM} and HM\rm HM-estimator θ^HM\hat{\theta}_{\rm HM} , where the simulation was conducted by 300 replications. See Table 2.3 for the performance of four estimators in case (a) and (b) and Figure 2.4 for the box-whisker plot in case (b). In the case (a) of specified model, the ML-estimator was superior to other estimators in a point of the root means square estimate (rmse), however the superiority is subtle. Next, we observe for case (b) of misspecified model in which the conditional distribution given Y=1Y=1 is contaminated with a normal distribution 𝙽𝚘𝚛(σμ0,I){\tt Nor}(\sigma\mu_{0},{\rm I}) with mixing ratio 0.10.1. Under this setting, γ\gamma-estimator (γ=0.8)(\gamma=0.8) and the HM -estimator were substantially robust; the ML-estimator and GM-estimator were sensitive to the misspecification. Upon closer observation, it becomes apparent that γ\gamma-estimator (γ=0.8)(\gamma=0.8) and the HM -estimator were superior to the ML-estimator and GM-estimator in the bias behaviors rather than the variance ones as shown in Figure 2.4. This observation is consistent with what Proposition (10) asserts: The γ\gamma estimator has a boundedness property if γ<1\gamma<-1 or γ>0\gamma>0. Because the ML-estimator, the GM-estimator and the HM-estimator equal the γ\gamma-estimators with γ=0,1,2\gamma=0,-1,-2, respectively.

Table 2.3: Comparison between the ML-estimator and the γ\gamma-estimator.

(a). The case of specified model

Method estimate rmse
ML-estimator (0.011,1.021,1.014)({0.011,1.021,1.014}) 0.3410.341
γ\gamma-estimate (0.012,1.062,1.045)({0.012,1.062,1.045}) 0.4070.407
GM-estimate (0.009,1.031,1.029)({0.009,1.031,1.029}) 0.3650.365
HM-estimate (0.013,1.051,1.037)({0.013,1.051,1.037}) 0.3900.390

(b). The case of misspecified model

Method estimate rmse
ML-estimator (0.102,0.481,0.503)({0.102,0.481,0.503}) 0.7580.758
γ\gamma-estimate (0.081,0.889,0.911)({0.081,0.889,0.911}) 0.4410.441
GM-estimate (0.161,0.428,0.452)({0.161,0.428,0.452}) 0.8390.839
HM-estimate (0.070,0.862,0.885)({-0.070,0.862,0.885}) 0.4640.464
Refer to caption
Figure 2.4: Box-whisker Plots of the ML-estimator and the γ\gamma-estimator (γ=0.8)(\gamma=0.8), GM-estimator, HM-estimator

Second, we consider a model of a covariate-conditional distribution of YY. Assume that XX follows a standard normal distribution 𝙽𝚘𝚛(0,I2){\tt Nor}(0,I_{2}) and a conditional distribution of YY given X=xX=x follows a logistic model

p(y|x,θ)=exp{y(θ1x+θ0)}1+exp(θ1x+θ0).\displaystyle p(y|x,\theta)=\frac{\exp\{y(\theta_{1}^{\top}x+\theta_{0})\}}{1+\exp(\theta_{1}^{\top}x+\theta_{0})}.

The simulation was conducted based on a scenario as follows.

(a). Specified model: (Yi=y)|(Xi=x)𝙱𝚎𝚛(p(y|x,θ)).\hskip 25.60747pt(Y_{i}=y)|(X_{i}=x)\sim{\tt Ber}(p(y|x,\theta)).

(b). Misspecified model:     (Yi=y)|(Xi=x)(1ϵ)𝙱𝚎𝚛(p(y|x,θ))+ϵ𝙱𝚎𝚛(p(y|x,θout)).(Y_{i}=y)|(X_{i}=x)\sim(1-\epsilon){\tt Ber}(p(y|x,\theta))+\epsilon{\tt Ber}(p(y|x,\theta_{\rm out})).

Here parameters were set as (θ0,θ1)=(0.0,1.0,1.0)(\theta_{0},\theta_{1}^{\top})=(0.0,1.0,1.0). Figure 2.3 shows the plot of 103 negative samples (Blue), 87 negative samples (Green), 10 negative outliers (Red) on the logistic model surface {(x1,x2,p(1|(x1,x2),(θ0,θ1)):3.5x13.5,3.5x23.5}\{(x_{1},x_{2},p(1|(x_{1},x_{2}),(\theta_{0},\theta_{1})):-3.5\leq x_{1}\leq 3.5,-3.5\leq x_{2}\leq 3.5\}. Thus, 10 negative outliers are away from the hull of 87 negative samples.

Similarly, a comparison among the ML-estimator θ^0\hat{\theta}_{0}, the γ\gamma-estimator θ^γ\hat{\theta}_{\gamma} with γ=0.8\gamma=0.8, the GM-estimator θ^GM\hat{\theta}_{\rm GM} and HM\rm HM-estimator θ^HM\hat{\theta}_{\rm HM} with 100100 replications. See Table 2.4. In the case (a), the ML-estimator was slightly superior to other estimators. For case (b), γ\gamma-estimator (γ=0.8)(\gamma=0.8) and the HM -estimator were more robust; the ML-estimator and GM-estimator, which was the same tendency as the case of the outcome-conditional model.

Table 2.4: Comparison between the ML-estimator and the γ\gamma-estimator.

(a). The case of specified model

Method estimate rmse
ML-estimator (0.011,1.021,1.014)({0.011,1.021,1.014}) 0.3410.341
γ\gamma-estimate (0.012,1.062,1.045)({0.012,1.062,1.045}) 0.4070.407
GM-estimate (0.009,1.031,1.029)({0.009,1.031,1.029}) 0.3650.365
HM-estimate (0.013,1.051,1.037)({0.013,1.051,1.037}) 0.3900.390

(b). The case of misspecified model

Method estimate rmse
ML-estimator (0.102,0.481,0.503)({0.102,0.481,0.503}) 0.7580.758
γ\gamma-estimate (0.081,0.889,0.911)({0.081,0.889,0.911}) 0.4410.441
GM-estimate (0.161,0.428,0.452)({0.161,0.428,0.452}) 0.8390.839
HM-estimate (0.070,0.862,0.885)({-0.070,0.862,0.885}) 0.4640.464

2.6 Multiclass logistic regression

We consider a situation where an outcome variable YY has a value in 𝒴={0,,k}{\mathcal{Y}}=\{0,...,k\} and a covariate XX with a value in a subset 𝒳{\mathcal{X}} of d\mathbb{R}^{d}. The probability distribution is given by a probability mass function (pmf)

p(y,π)=j=0kπj,𝕀(y=j)\displaystyle p(y,\pi)=\prod_{j=0}^{k}\pi_{j}{}^{{\mathbb{I}}(y=j)},

which is referred to as the categorical distribution 𝙲𝚊𝚝(π){\tt Cat}(\pi), where π=(πj)j=1k\pi=(\pi_{j})_{j=1}^{k} is the probability vector, (P(Y=j))j=1k(P(Y=j))_{j=1}^{k} with π0\pi_{0} being 1j=1kπj1-\sum_{j=1}^{k}\pi_{j}.

Remark 2.

We begin with a simple case of estimating π\pi without any covariates. Let {Yi}1in\{Y_{i}\}_{1\leq i\leq n} be a random sample drawn from 𝙲𝚊𝚝(π){\tt Cat}(\pi). Then, the estimators discussed here equal the observed frequency vector as follows. First of all, the ML-estimator is the observed frequency vector with components (π^0,,π^k)(\hat{\pi}_{0},...,\hat{\pi}_{k}), where π^j=1/ni=1n𝕀(Yi=j)\hat{\pi}_{j}=1/n\sum_{i=1}^{n}{\mathbb{I}}(Y_{i}=j). Next, the γ\gamma-loss function

Lγ(π,C)=1n1γi=1nj=0kπjγ𝕀(Yi=j)(j=0kπj)γ+1γγ+1\displaystyle L_{\gamma}(\pi,C)=-\frac{1}{n}\frac{1}{\gamma}\sum_{i=1}^{n}\frac{\prod_{j=0}^{k}\pi_{j}{}^{\gamma{\mathbb{I}}(Y_{i}=j)}}{\big{(}\sum_{j=0}^{k}\pi_{j}{}^{\gamma+1}\big{)}^{\frac{\gamma}{\gamma+1}}}

is written as 1/γj=0kπ^jπj/γ(j=0kπj)γ+1γγ+11/\gamma\sum_{j=0}^{k}{\hat{\pi}_{j}\pi_{j}{}^{\gamma}}/{\big{(}\sum_{j=0}^{k}\pi_{j}{}^{\gamma+1}\big{)}^{\frac{\gamma}{\gamma+1}}}. We observe

Lγ(π,C)=Dγ(𝙲𝚊𝚝(π^),𝙲𝚊𝚝(π))\displaystyle L_{\gamma}(\pi,C)=D_{\gamma}({\tt Cat}(\hat{\pi}),{\tt Cat}(\pi))

up to a constant. Therefore, the γ\gamma-estimator for π\pi is equal to π^\hat{\pi} for all γ\gamma. Similarly, the β\beta-estimator is equal to π^\hat{\pi} for all β\beta. However the α\alpha-estimator does not satisfy that except for the limit case of α\alpha to 0, or the ML-estimator.

We return the discussion for the regression model with a covariate vector XX. A multiclass logistic regression model is defined by a soft max function as a link function of the systematic component η\eta into the random component. The conditional pmf given X=xX=x is given by

p(y|x,θ)={11+j=1kexp(ηj) if y=0,exp(ηy)1+j=1kexp(ηj) if y=1,,k,\displaystyle p(y|x,\theta)=\left\{\begin{array}[]{cl}\displaystyle{\frac{1}{1+\sum_{j=1}^{k}\exp(\eta_{j})}}&\text{ if }y=0,\\[14.22636pt] \displaystyle{\frac{\exp(\eta_{y})}{1+\sum_{j=1}^{k}\exp(\eta_{j})}}&\text{ if }y=1,...,k\end{array}\right., (2.44)

which is referred as a multinomial logistic model, where θ=(θ1,,θk)\theta=(\theta_{1},...,\theta_{k})^{\top} and ηj=θjx\eta_{j}=\theta_{j}{}^{\top}x. The KL-divergence between categorical distributions is given by

D0(𝙲𝚊𝚝(π),𝙲𝚊𝚝(ρ))=j=0kπjlogπjρj.\displaystyle D_{0}({\tt Cat}(\pi),{\tt Cat}(\rho))=\sum_{j=0}^{k}\pi_{j}\log\frac{\pi_{j}}{\rho_{j}}.

For a given dataset {(Xi,Yi)}i=1,,n\{(X_{i},Y_{i})\}_{i=1,...,n}, the negative log-likelihood function is given by

L0(θ;C)=1ni=1n[θYiXjlog{1+j=1kexp(θjXi)}]\displaystyle L_{0}(\theta;C)=-\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\theta_{Y_{i}}{}^{\top}X_{j}-\log\Big{\{}1+\sum_{j=1}^{k}\exp(\theta_{j}{}^{\top}X_{i})\Big{\}}\bigg{]}

where we set θy=0\theta_{y}=0 if y=0y=0. The likelihood equation is written by the jj-th component:

S0(θ;C)j=1ni=1n{𝕀(Yi=j)exp(θjXi)1+l=1kexp(θlXi)}Xi=0.\displaystyle{S}_{0}{}_{j}(\theta;C)=-\frac{1}{n}\sum_{i=1}^{n}\Big{\{}{\mathbb{I}}(Y_{i}=j)-\frac{\exp(\theta_{j}{}^{\top}X_{i})}{1+\sum_{l=1}^{k}\exp(\theta_{l}{}^{\top}X_{i})}\Big{\}}X_{i}=0.

for j=1,,kj=1,...,k. The γ\gamma-divergence is given by

Dγ(𝙲𝚊𝚝(π),𝙲𝚊𝚝(ρ))=1γj=0kπjρjγ(j=0kρj)γ+1γγ+1+1γ(j=0kπj)γ+11γ+1,\displaystyle D_{\gamma}({\tt Cat}(\pi),{\tt Cat}(\rho))=-\frac{1}{\gamma}\frac{\sum_{j=0}^{k}\pi_{j}\rho_{j}{}^{\gamma}}{\big{(}\sum_{j=0}^{k}\rho_{j}{}^{\gamma+1}\big{)}^{\frac{\gamma}{\gamma+1}}}+\frac{1}{\gamma}\bigg{(}\sum_{j=0}^{k}\pi_{j}{}^{\gamma+1}\bigg{)}^{\frac{1}{\gamma+1}},

We remark that the γ\gamma-expression defined in (2.12) is given by

p(γ)(y|x,θ)=p(y|x,(γ+1)θ),\displaystyle p^{(\gamma)}(y|x,\theta)=p(y|x,(\gamma+1)\theta),

where p(y|x,θ)p(y|x,\theta) is in the multi logistic model (2.44). Hence, the γ\gamma-loss function is given by

Lγ(θ;C)=1n1γi=1nj=0k{p(Yi|x,(γ+1)θ)}γγ+1\displaystyle L_{\gamma}(\theta;C)=-\frac{1}{n}\frac{1}{\gamma}\sum_{i=1}^{n}\sum_{j=0}^{k}\{p(Y_{i}|x,(\gamma+1)\theta)\}^{\frac{\gamma}{\gamma+1}}

and the γ\gamma-estimating equation is written as

Sγ(θ;C)j:=1ni=1nSγ(Xi,Yi,θ;C)=0,\displaystyle{S}_{\gamma}{}_{j}(\theta;C):=\frac{1}{n}\sum_{i=1}^{n}{S}_{\gamma}(X_{i},Y_{i},\theta;C)=0,

where Sγ(X,Y,θ;C){S}_{\gamma}(X,Y,\theta;C) is defined by

{p(Y|X,(γ+1)θ)}γγ+1{𝕀(Y=j)p(j|X,(γ+1)θ)}X.\displaystyle\{p(Y|X,(\gamma+1)\theta)\}^{\frac{\gamma}{\gamma+1}}\big{\{}{\mathbb{I}}(Y=j)-p(j|X,(\gamma+1)\theta)\big{\}}X. (2.45)

The GM-divergence between categorical distributions:

DGM(𝙲𝚊𝚝(π),𝙲𝚊𝚝(ρ);R)=y=0kπyρyr(y)y=0kρyr(y)y=0kπyr(y),\displaystyle D_{\rm GM}({\tt Cat}(\pi),{\tt Cat}(\rho);R)=\sum_{y=0}^{k}\frac{\pi_{y}}{\rho_{y}}r(y)\prod_{y=0}^{k}\rho_{y}^{r(y)}-\prod_{y=0}^{k}\pi_{y}^{r(y)},

where the reference distribution RR is chosen by 𝙲𝚊𝚝(r){\tt Cat}(r). Hence, the GM-loss function is given by

LGM(θ;R)=1ni=1nr(Yi)exp{(θ¯RθYi)Xi}.,\displaystyle L_{\rm GM}(\theta;R)=\frac{1}{n}\sum_{i=1}^{n}r(Y_{i})\exp\{(\bar{\theta}_{R}-\theta_{Y_{i}})^{\top}X_{i}\}.,

where θ¯R=j=1kr(j)θj\bar{\theta}_{R}=\sum_{j=1}^{k}r(j)\theta_{j}. We will have a further discussion later such that the GM-loss is closely related to the exponential loss for Multiclass AdaBoost algorithm. The GM-estimating function is given by

SGM(θ;R)j=1ni=1nr(Yi)exp{(θ¯θYi)Xi}{r(Yi)𝕀(Yi=j)}Xj.\displaystyle{S}_{\rm GM}{}_{j}(\theta;R)=\frac{1}{n}\sum_{i=1}^{n}r(Y_{i})\exp\{(\bar{\theta}-\theta_{Y_{i}})^{\top}X_{i}\}\{r(Y_{i})-{\mathbb{I}}(Y_{i}=j)\}X_{j}.

Finally, the HM-divergence is

DHM(𝙲𝚊𝚝(π),𝙲𝚊𝚝(ρ))=y=0kπy(1ρy)2y=0kπy(1πy)2.\displaystyle D_{\rm HM}({\tt Cat}(\pi),{\tt Cat}(\rho))=\sum_{y=0}^{k}\pi_{y}(1-\rho_{y})^{2}-\sum_{y=0}^{k}\pi_{y}(1-\pi_{y})^{2}.

The HM-loss function is derived as

LHM(θ)=12ni=1nj=0k{p(Yi|Xi,θ)}2\displaystyle L_{\rm HM}(\theta)=\frac{1}{2n}\sum_{i=1}^{n}\sum_{j=0}^{k}\{p(Y_{i}|X_{i},-\theta)\}^{2}

for the logistic model (2.44) noting p(2)(y|x,θ)=p(y|x,θ)p^{(-2)}(y|x,\theta)=p(y|x,-\theta). This is the sum of squared probabilities of the inverse label. Hence, the HM-estimating function is written as

SHM(θ)=1ni=1np(Yi|Xi,θ)θp(Yi|Xi,θ).\displaystyle{S}_{\rm HM}(\theta)=\frac{1}{n}\sum_{i=1}^{n}p(Y_{i}|X_{i},-\theta)\frac{\partial}{\partial\theta}p(Y_{i}|X_{i},-\theta).
Refer to caption
Figure 2.5: Plots of contours of KL, GM and HM divergence measures.

Let us have a brief look at the behavior of the γ\gamma-estimating function Sγ(θ;C){S}_{\gamma}(\theta;C) in the presence of misspecification of the parametric model in the multiclass logistic distribution (2.44). Basically, most of properties are similar to those in the Bernoulli logistic model.

Proposition 11.

Consider the γ\gamma-estimating function under a multiclass logistic model (2.44). Assume γ>0\gamma>0 or γ<1\gamma<-1. Then,

supx𝒳|θjSγ(θ;Λ)j|δγ,j\sup_{x\in{\mathcal{X}}}|\theta_{j}^{\top}{S}_{\gamma}{}_{j}(\theta;\Lambda)|\leq\delta_{\gamma}{}_{j}, (2.46)

where

δγ=jsupsklj|sj||γ+1|[{fj(s)}γγ+1fl(s)+{fl(s)}γγ+1fj(s)]\displaystyle\delta_{\gamma}{}_{j}=\sup_{s\in\mathbb{R}^{k}}\ \sum_{l\neq j}\frac{|s_{j}|}{|\gamma+1|}[\{f_{j}(s)\}^{\frac{\gamma}{\gamma+1}}f_{l}(s)+\{f_{l}(s)\}^{\frac{\gamma}{\gamma+1}}f_{j}(s)] (2.47)

with fj(s)=exp(sj)/{1+l=1kexp(sl)}.f_{j}(s)=\exp(s_{j})/\{1+\sum_{l=1}^{k}\exp(s_{l})\}.

Proof.

We confirm that δγj\delta_{\gamma}{}_{j} is a finite definite value if γ<1\gamma<-1 or γ>0\gamma>0. It is written from (2.45) that

|θjSγ(θ;C)j|1ni=1n[{p(j|Xi,(γ+1)θ)}γγ+1{1p(j|Xi,(γ+1)θ)}\displaystyle\big{|}\theta_{j}^{\top}{S}_{\gamma}{}_{j}(\theta;C)\big{|}\leq\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\{p(j|X_{i},(\gamma+1)\theta)\}^{\frac{\gamma}{\gamma+1}}\big{\{}1-p(j|X_{i},(\gamma+1)\theta)\big{\}}
+ljp(l|Xi,(γ+1)θ)γγ+1p(j|Xi,(γ+1)θ)]|θjXi|.\displaystyle\hskip 73.97716pt+\sum_{l\neq j}p(l|X_{i},(\gamma+1)\theta)^{\frac{\gamma}{\gamma+1}}p(j|X_{i},(\gamma+1)\theta)\bigg{]}|\theta_{j}^{\top}X_{i}|. (2.48)

Hence, if Si=j(γ+1)θjXiS_{i}{}_{j}=(\gamma+1)\theta_{j}^{\top}X_{i}, then

|θSγ(θ;C)j|1ni=1nlj[{fj(Si)}γγ+1fl(Si)+{fl(Si)}γγ+1fj(Si)]|Si|jγ+1,\displaystyle\big{|}\theta^{\top}{S}_{\gamma}{}_{j}(\theta;C)\big{|}\leq\frac{1}{n}\sum_{i=1}^{n}\sum_{l\neq j}\bigg{[}\{f_{j}(S_{i})\}^{\frac{\gamma}{\gamma+1}}f_{l}(S_{i})+\{f_{l}(S_{i})\}^{\frac{\gamma}{\gamma+1}}f_{j}(S_{i})\bigg{]}\frac{|S_{i}{}_{j}|}{\gamma+1},

where Si=(Si,1,Si)kS_{i}=(S_{i}{}_{1},...,S_{i}{}_{k}). This concludes (2.46) by taking the supremum to the right-hand-side in all SiS_{i}’s . ∎

The jj-th linear predictor is written by θjx=θ1jx1+θj0\theta_{j}^{\top}x=\theta_{1j}^{\top}x_{1}+\theta_{j0} with the slope vector θ1\theta_{1} and the intercept θ0\theta_{0}. The jj-th decision boundary is given by

H(θj)={x𝒳:θ1jx1+θ0j=0},\displaystyle{{H}(\theta_{j})}=\{x\in{\mathcal{X}}:\theta_{1j}^{\top}x_{1}+\theta_{0j}=0\}, (2.49)

In a context of prediction, a predictor for the label YY based on a given feature vector xx is give by

f(x)=argmaxy𝒴θyx,\displaystyle f(x)=\mathop{\rm argmax}_{y\in{\mathcal{Y}}}\theta_{y}^{\top}x,

which is equal to the Bayes rule under the multiclass logistic model, where θ0=0\theta_{0}=0 in the parametrization as in (2.44). We observe through the discussion similar to Proposition 9 in the situation of the Bernoulli logistic model that θj𝔼Q[Sγ(θ,X,Y)j|X=x]\theta_{j}^{\top}\mathbb{E}_{Q}[{S}_{\gamma}{}_{j}(\theta,X,Y)|X=x] is uniformly bounded in x𝒳x\in{\mathcal{X}} even under any misspecified distribution QQ outside the parametric model. Therefore, we result that the γ\gamma-estimator has such a stable behavior for all the sample {(Xi,Yi)}1in\{(X_{i},Y_{i})\}_{1\leq i\leq n} if γ\gamma is in a range (1,1)(0,)(1,-1)\cup(0,\infty). The ML-estimator and the GM-estimators equals γ\gamma-estimators with γ=0\gamma=0 and γ=1\gamma=-1, respectively. Therefore, both γ\gamma’s are outside the range, which suggests they are suffered from the unboundedness.

We next study ordinal regression, also known as ordinal classification. Consider an ordinal outcome YY having values in 𝒴={0,,k}{\cal Y}=\{0,...,k\}. The probability of YY falling into a certain category yy or lower is modeled as

(Yy|X,θ)=exp(θ0y+θ1X)1+exp(θ0y+θ1X)\displaystyle{\mathbb{P}}(Y\leq y|X,\theta)=\frac{\exp(\theta_{0y}+\theta_{1}^{\top}X)}{1+\exp(\theta_{0y}+\theta_{1}^{\top}X)} (2.50)

for y=0,,k1y=0,...,k-1, where θ=(θ00,θ0k,θ1)\theta=(\theta_{00},...\theta_{0k},\theta_{1}). The model (2.50) is referred to as ordinal logistic model noting (Yy|X,θ)=F(θ0y+θ1X){\mathbb{P}}(Y\leq y|X,\theta)=F(\theta_{0y}+\theta_{1}^{\top}X) with a logistic distribution F(z)=exp(z)/(1+exp(z))F(z)=\exp(z)/(1+\exp(z)). Here, the thresholds are assumed θ00θ0k1\theta_{00}\leq\cdots\leq\theta_{0k-1} to ensure that the probability statement (2.50) makes sense. Each threshold θ0y\theta_{0y} effectively sets a boundary point on the latent continuous scale, beyond which the likelihood of higher category outcomes increases. The difference between consecutive thresholds also gives insight into the ”distance” or discrimination between adjacent categories on the latent scale, governed by the predictors.

For a given nn observations {(Xi,Yi)}i=1n\{(X_{i},Y_{i})\}_{i=1}^{n} the negative log-likelihood function

L0(θ)=i=1nlogp(Yi|Xi,θ)\displaystyle L_{0}(\theta)=-\sum_{i=1}^{n}\log p(Y_{i}|X_{i},\theta)

where p(y|x,θ)=F(θ0y+θ1x)F(θ0y1+θ1x)p(y|x,\theta)=F(\theta_{0y}+\theta_{1}^{\top}x)-F(\theta_{0y-1}+\theta_{1}^{\top}x). Similarly, the γ\gamma-loss function can be given in a straightforward manner. However, these loss functions seem complicated since the conditional probability p(y|x,θ)p(y|x,\theta) is introduced indirectly as a difference between the cumulative distribution functions F(θ0y+θ1x)F(\theta_{0y}+\theta_{1}^{\top}x)’s.

To address this issue, it is treated that each threshold as a separate binarized response, effectively turning the ordinal regression problem into multiple binary regression problems. Let P(y)P(y) and F(y)F(y) be cumulative distribution functions on 𝒴\cal Y. We define a dichotomized cross entropy

H0(d)(P,F)=y=0kP(y)logF(y)+(1P(y))log(1F(y)).\displaystyle H_{0}^{\rm(d)}(P,F)=\sum_{y=0}^{k}P(y)\log F(y)+(1-P(y))\log(1-F(y)).

This is a sum of the cross entropys between a Bernoulli distributions 𝙱𝚎𝚛(P(y)){\tt Ber}(P(y)) and 𝙱𝚎𝚛(F(y)){\tt Ber}(F(y)). The KL divergence is given as D0(d)(P,F)=H0(d)(P,F)H0(d)(P,P)D_{0}^{\rm(d)}(P,F)=H_{0}^{\rm(d)}(P,F)-H_{0}^{\rm(d)}(P,P). Thus, the dichotomized log-likelihood function is given by

L0(d)(θ)=i=1ny=0kZiylogF(θ0y+θXi)+(1Ziy)log{1F(θ0y+θXi)},\displaystyle L_{0}^{\rm(d)}(\theta)=\sum_{i=1}^{n}\sum_{y=0}^{k}Z_{iy}\log F(\theta_{0y}+\theta^{\top}X_{i})+(1-Z_{iy})\log\{1-F(\theta_{0y}+\theta^{\top}X_{i})\},

where Ziy=I(Yiy)Z_{iy}={\rm I}(Y_{i}\leq y). Note 𝔼[L0(d)(θ)]=H0(d)(P,F(,θ))\mathbb{E}[L_{0}^{\rm(d)}(\theta)]=H_{0}^{\rm(d)}(P,F(\cdot,\theta)), where 𝔼\mathbb{E} denotes the expectation under the distribution PP and F(y,θ)=F(θ0y+θx)F(y,\theta)=F(\theta_{0y}+\theta^{\top}x). Under the ordinal logistic model (2.50),

L0(d)(θ)=i=1ny=0k[Zyi(θ0y+θXi)log{1+exp(θ0y+θXi)}].\displaystyle L_{0}^{\rm(d)}(\theta)=\sum_{i=1}^{n}\sum_{y=0}^{k}\left[Z_{yi}(\theta_{0y}+\theta^{\top}X_{i})-\log\{1+\exp(\theta_{0y}+\theta^{\top}X_{i})\}\right].

On the other hand, the dichotomized γ\gamma-loss function is given by

Lγ(d)(θ)=1γi=1ny=0k[Ziy{F(γ)(θ0y+θXi)}γγ+1+(1Ziy){1F(γ)(θ0y+θXi)}γγ+1],\displaystyle L_{\gamma}^{\rm(d)}(\theta)=-\frac{1}{\gamma}\sum_{i=1}^{n}\sum_{y=0}^{k}\left[Z_{iy}\{F^{(\gamma)}(\theta_{0y}+\theta^{\top}X_{i})\}^{\frac{\gamma}{\gamma+1}}+(1-Z_{iy})\{1-F^{(\gamma)}(\theta_{0y}+\theta^{\top}X_{i})\}^{\frac{\gamma}{\gamma+1}}\right],

where F(γ)(A)F^{(\gamma)}(A) is the γ\gamma-expression for F(A)F(A), that is,

F(γ)(A)={F(A)}γ+1F(A)γ+1+{1F(A)}γ+1.\displaystyle F^{(\gamma)}(A)=\frac{\{F(A)\}^{\gamma+1}}{F(A)^{\gamma+1}+\{1-F(A)\}^{\gamma+1}}.

Under the ordinary logistic model (2.50),

Lγ(d)(θ)=1γi=1ny=0k{exp{Ziy(γ+1)(θ0y+θXi)}1+exp{(γ+1)(θ0y+θXi)}}γγ+1\displaystyle L_{\gamma}^{\rm(d)}(\theta)=-\frac{1}{\gamma}\sum_{i=1}^{n}\sum_{y=0}^{k}\left\{\frac{\exp\{Z_{iy}(\gamma+1)(\theta_{0y}+\theta^{\top}X_{i})\}}{1+\exp\{(\gamma+1)(\theta_{0y}+\theta^{\top}X_{i})\}}\right\}^{\frac{\gamma}{\gamma+1}}

If γ\gamma is taken a limit to 1-1, then it is reduced the GM-loss function

LGM(d)(θ,C)=i=1ny=0kexp{(12Ziy)(θ0y+θXi)};\displaystyle L_{\rm GM}^{\rm(d)}(\theta,C)=\sum_{i=1}^{n}\sum_{y=0}^{k}\exp\{(\small\mbox{$\frac{1}{2}$}-Z_{iy})(\theta_{0y}+\theta^{\top}X_{i})\};

if γ=2\gamma=-2, then it is reduced the HM-loss function

LHM(d)(θ)=12i=1ny=0k{exp{(1Ziy)(θ0y+θXi)}1+exp(θ0y+θXi)}2.\displaystyle L_{\rm HM}^{\rm(d)}(\theta)=\frac{1}{2}\sum_{i=1}^{n}\sum_{y=0}^{k}\left\{\frac{\exp\{(1-Z_{iy})(\theta_{0y}+\theta^{\top}X_{i})\}}{1+\exp(\theta_{0y}+\theta^{\top}X_{i})}\right\}^{2}.
Remark 3.

Let us discuss an extension for dichotomized loss functions to a setting where the outcome space 𝒴\cal Y is a subset of d\mathbb{R}^{d}. Consider a partition of 𝒴{\cal Y} such that 𝒴=j=1kBk{\cal Y}=\oplus_{j=1}^{k}B_{k}. Then, the model is reduced to a categorical distribution 𝙲𝚊𝚝(π(x,θ)){\tt Cat}(\pi(x,\theta)), where π(x,θ)=(π1(x,θ),,πk(x,θ))\pi(x,\theta)=(\pi_{1}(x,\theta),...,\pi_{k}(x,\theta)) with πj(x,θ)=Bjp(y|x,θ)dΛ(y)\pi_{j}(x,\theta)=\int_{B_{j}}p(y|x,\theta){\rm d}\Lambda(y). The cross entropy is reduced to

H0(d)(π,π(x,θ))=j=1kπjlogπj(x,θ)\displaystyle H_{0}^{\rm(d)}(\pi,\pi(x,\theta))=\sum_{j=1}^{k}\pi_{j}\log\pi_{j}(x,\theta)

and the negative log-likelihood function is reduced to

L0(d)(θ)=i=1ny=0kI(YiBj)logπj(Xi,θ).\displaystyle L_{0}^{\rm(d)}(\theta)=-\sum_{i=1}^{n}\sum_{y=0}^{k}{\rm I}(Y_{i}\in B_{j})\log\pi_{j}(X_{i},\theta).

Similarly, the γ\gamma-cross entropy is reduced to

Hγ(d)(π,π(x,θ))=j=1kπj{πj(x,θ)γ+1j=1kπj(x,θ)γ+1}γγ+1\displaystyle H_{\gamma}^{\rm(d)}(\pi,\pi(x,\theta))=\sum_{j=1}^{k}\pi_{j}\left\{\frac{\pi_{j}(x,\theta)^{\gamma+1}}{\sum_{j^{\prime}=1}^{k}\pi_{j^{\prime}}(x,\theta)^{\gamma+1}}\right\}^{\frac{\gamma}{\gamma+1}}

and the γ\gamma-loss function is reduced to

L0(d)(θ)=i=1ny=0kI(YiBj){πj(Xi,θ)γ+1j=1kπj(Xi,θ)γ+1}γγ+1.\displaystyle L_{0}^{\rm(d)}(\theta)=-\sum_{i=1}^{n}\sum_{y=0}^{k}{\rm I}(Y_{i}\in B_{j})\left\{\frac{\pi_{j}(X_{i},\theta)^{\gamma+1}}{\sum_{j^{\prime}=1}^{k}\pi_{j^{\prime}}(X_{i},\theta)^{\gamma+1}}\right\}^{\frac{\gamma}{\gamma+1}}.

There are some parametric models similar to the present model including ordered probit models, continuation ratio model and adjacent categories logit model. The coefficients in ordinal regression models tell us about the change in the odds of being in a higher ordered category as the predictor increases. Importantly, because of the ordered nature of the outcomes, the interpretation of these coefficients gets tied not just to changes between specific categories but to changes across the order of categories. Ordinal regression is useful in fields like social sciences, marketing, and health sciences, where rating scales (like agreement, satisfaction, pain scales) are common and the assumption of equidistant categories is not reasonable. This method respects the order within the categories, which could be ignored in standard multiclass approaches.

2.7 Poisson regression model

The Poisson regression model is a member of generalized linear model (GLM), which is typically used for count data. When the outcome variable is a count (i.e., number of times an event occurs), the Poisson regression model is a suitable approach to analyze the relationship between the count and explanatory variables. The key assumptions behind the Poisson regression model are that the mean and variance of the outcome variable are equal, and the observations are independent of each other. The primary objective of Poisson regression is to model the expected count of an event occurring, given a set of explanatory variables. The model provides a framework to estimate the log rate of events occurring, which can be back-transformed to provide an estimate of the event count at different levels of the explanatory variables.

Let YY be a response variable having a value in 𝒴={0,1,}{\mathcal{Y}}=\{0,1,...\} and XX be a covariate variable with a value in a subset 𝒳{\mathcal{X}} of d\mathbb{R}^{d}. A Poisson distribution 𝙿𝚘(λ){\tt Po}(\lambda) with an intensity parameter λ\lambda has a probability mass function (pmf) given by

p(y,λ)=λyy!exp(λ)\displaystyle p(y,\lambda)=\frac{\lambda^{y}}{y!}\exp(-\lambda)

for yy of 𝒴{\mathcal{Y}}. A Poisson regression model to a count YY given X=xX=x is defined by the probability distribution P(|x,θ)P(\cdot|x,\theta) with pmf

p(y|x,θ)=1y!exp{yθxexp(θx)}.\displaystyle p(y|x,\theta)=\frac{1}{y!}\exp\{y\theta^{\top}x-\exp(\theta^{\top}x)\}. (2.51)

The link function of the regression function to the canonical variable is a logarithmic function, g(λ)=logλg(\lambda)=\log\lambda, in which (2.51) is referred to as a log-linear model. The likelihood principle gives the negative log-likelihood function by

L0(θ)=1ni=1n{YiθXiexp(θXi)logYi!}.\displaystyle L_{0}(\theta)=-\frac{1}{n}\sum_{i=1}^{n}\{Y_{i}\theta^{\top}X_{i}-\exp(\theta^{\top}X_{i})-\log Y_{i}!\}.

for a given dataset {(Xi,Yi)}i=1n\{(X_{i},Y_{i})\}_{i=1}^{n}. Here the term logYi!\log Y_{i}! can be neglected since it is a constant in θ\theta. In effect, the estimating function is give by

S0(θ)=1ni=1n{Yiexp(θXi)}Xi.\displaystyle{S}_{0}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\{Y_{i}-\exp(\theta^{\top}X_{i})\}X_{i}.

We see from the general theory for the likelihood method that the ML-estimator for θ\theta is consistent with θ\theta.

Next, we consider the γ\gamma-divergence and its applications to the Poisson model. For this, we fix a reference measure as R=𝙿𝚘(μ)R={\tt Po}(\mu). Then, the RN-derivative of a conditional probability measure P(|x,θ)P(\cdot|x,\theta) with respect to RR is given by

P(y|x,θ)R=μyexp{yθx+μexp(θx)}\displaystyle\frac{\partial P(y|x,\theta)}{\partial R}=\mu^{-y}\exp\{y\theta^{\top}x+\mu-\exp(\theta^{\top}x)\}

and hence the γ\gamma-expression for this is given by

p(γ)(y|x,θ)=exp[(γ+1)yθxexp{(γ+1)θx}].\displaystyle p^{(\gamma)}(y|x,\theta)=\exp[(\gamma+1)y\theta^{\top}x-\exp\{(\gamma+1)\theta^{\top}x\}].

The γ\gamma-cross entropy between Poisson distribution is given by

Hγ(P(|x,θ0),P(|x,θ1);R)\displaystyle\hskip 8.53581ptH_{\gamma}(P(\cdot|x,\theta_{0}),P(\cdot|x,\theta_{1});R)
=1γ[exp{exp(θ0x+γθ1x)}γγ+1exp{exp((γ+1)θ1x)}],\displaystyle=-\frac{1}{\gamma}\Big{[}\exp\{\exp(\theta^{\top}_{0}x+\gamma\theta^{\top}_{1}x)\}-\frac{\gamma}{\gamma+1}\exp\{\exp((\gamma+1)\theta^{\top}_{1}x)\}\Big{]},

where RR is the reference measure defined by R(y)=1/y!R(y)=1/y! for y=0,1,y=0,1,.... Note that this choice of RR enable us to having such an tractable form of this entropy. Hence, the γ\gamma-loss function is given by

Lγ(θ;R)=1n1γi=1nexp[γYiθXiγγ+1exp{(γ+1)θXi)}].\displaystyle L_{\gamma}(\theta;R)=-\frac{1}{n}\frac{1}{\gamma}\sum_{i=1}^{n}\exp\Big{[}\gamma Y_{i}\theta^{\top}X_{i}-\frac{\gamma}{\gamma+1}\exp\{(\gamma+1)\theta^{\top}X_{i})\}\Big{]}.

The estimating function is given by

Sγ(θ;R)=1ni=1nSγ(θ,Xi,Yi;R),\displaystyle{S}_{\gamma}(\theta;R)=\frac{1}{n}\sum_{i=1}^{n}{S}_{\gamma}(\theta,X_{i},Y_{i};R),

where

Sγ(θ,X,Y;R)=wγ(X,Y,θ)[Yexp{(γ+1)θX)}]X.\displaystyle{S}_{\gamma}(\theta,X,Y;R)=w_{\gamma}(X,Y,\theta)[Y-\exp\{(\gamma+1)\theta^{\top}X)\}]X. (2.52)

where

wγ(X,Y,θ)=exp[γYθXγγ+1exp{(γ+1)θX)}].\displaystyle w_{\gamma}(X,Y,\theta)=\exp\Big{[}\gamma Y\theta^{\top}X-\frac{\gamma}{\gamma+1}\exp\{(\gamma+1)\theta^{\top}X)\}\Big{]}.

We investigate the unbiased property for the estimating function.

Proposition 12.

Let

Φγ(x,y,θ)=θSγ(θ,x,y;R),\displaystyle\Phi_{\gamma}(x,y,\theta)=\theta^{\top}{S}_{\gamma}(\theta,x,y;R),

where Sγ(θ,x,y;R)){S}_{\gamma}(\theta,x,y;R)) is the estimating function defined in (2.52). Then, if γ>0\gamma>0,

supx𝒳|Φγ(x,y,θ)|<\displaystyle\sup_{x\in{\cal X}}|\Phi_{\gamma}(x,y,\theta)|<\infty

for any fixed y𝒴y\in{\cal Y}.

Proof.

By definition, we have

|Φγ(x,y,θ)||ω|eγyωγγ+1e(γ+1)ω(y+e(γ+1)ω).\displaystyle|\Phi_{\gamma}(x,y,\theta)|\leq|\omega|e^{\gamma y\omega-\frac{\gamma}{\gamma+1}e^{(\gamma+1)\omega}}\big{(}y+e^{(\gamma+1)\omega}\big{)}.

where ω=θx\omega=\theta^{\top}x. The following limit holds for positive constants c1,c2,c3c_{1},c_{2},c_{3}:

lim|ω||ω|ec1ωc2ec3ω=0\displaystyle\lim_{|\omega|\rightarrow\infty}|\omega|e^{c_{1}\omega-c_{2}e^{c_{3}\omega}}=0 (2.53)

Thus, we immediately observe

lim|ω||ω|eγyωγγ+1e(γ+1)ω(y+e(γ+1)ω)=0\displaystyle\lim_{|\omega|\rightarrow\infty}|\omega|e^{\gamma y\omega-\frac{\gamma}{\gamma+1}e^{(\gamma+1)\omega}}\big{(}y+e^{(\gamma+1)\omega}\big{)}=0

due to (2.53). This concludes that |Φγ(x,y,θ)||\Phi_{\gamma}(x,y,\theta)| is a bounded function in xx for any y𝒴y\in{\cal Y}. ∎

It is noted that the function in (2.53) has a mild shape as in Figure 2.6. Thus, the property of redescending is characteristic in the γ\gamma-estimating function. The graph is rapidly approaching to 0 when the absolute value of the canonical value ω\omega increases.

Refer to caption
Figure 2.6: Plots of |ω|ec1ωc2ec3ω|\omega|e^{c_{1}\omega-c_{2}e^{c_{3}\omega}} for (c1,c2,c3)=(1,1,1),(0.8,0.8,0.8),(1.2,1.2,1.2).(c_{1},c_{2},c_{3})=(1,1,1),(0.8,0.8,0.8),(1.2,1.2,1.2).

We remark that Φγ(x,y,θ)\Phi_{\gamma}(x,y,\theta) denotes the margin of the estimating function to the boundary {x𝒳:θx=0}\{x\in{\cal X}:\theta^{\top}x=0\}. The margin is bounded in XX but unbounded in YY, in which the behavior is delicate as seen in Figure 2.7. When γ>0.2\gamma>0.2, the boundedness is almost broken down in a practical numerical sense. The green lines are plotted for a curve {(y,w,0):y=e(γ+1)w}.\{(y,w,0):y=e^{(\gamma+1)w}\}. In this way, the margin becomes a zero on the green line. The behavior is found mild in a region away from the green line when γ\gamma is a small positive value; that is found unbounded there when γ\gamma equals a zero, or the likelihood case. This suggests a robust and efficient property for the γ\gamma-estimator with a positive small γ\gamma. To check this, we consider conduct a numerical experiment where there occurs a misspecification for the Poisson log-linear model p(y|x,θ)p(y|x,\theta) in (2.51). The synthetic dataset is generated from a mixture distribution, in which a heterogeneous subgroup is generated from a Poisson distribution p(y|x,θhetero)p(y|x,\theta_{\rm hetero}) with a small proportion π\pi in addition to a normal group from p(y|x,θ)p(y|x,\theta) with the proportion 1π1-\pi. Here θhetro\theta_{\rm hetro} is determined from plausible scenarios. We generate XiX_{i}’s from a trivariate normal distribution 𝙽𝚘𝚛(0,0.2I){\tt Nor}(0,0.2\,{\rm I}) and YiY_{i}’s from

(1π)𝙿𝚘(exp(θ1Xi+θ0))+π𝙿𝚘(exp(θhetero1Xi+θ0)).\displaystyle(1-\pi){\tt Po}(\exp(\theta_{1}^{\top}X_{i}+\theta_{0}))+\pi{\tt Po}(\exp(\theta_{{\rm hetero}1}^{\top}X_{i}+\theta_{0})).

Here the intercept is set as θ0=0.5\theta_{0}=0.5 and the slope vector is as θ1=(0.5,1.5,1.0)\theta_{1}=(0.5,1.5,-1.0) in the normal group; while the slope vector θhetro1\theta_{{\rm hetro}1} is set as either θ1-\theta_{1} or a zero vector in the minor group. It is suggested that the minor group has a reverse association to the normal group, or no reaction to the covariate. If there is no misspecification above, or equivalently π=0\pi=0, then the ML-estimator performs better than the γ\gamma-estimator. However, the ML-estimator is sensitive to such misspecification; the γ\gamma-estimator has robust performance, see Table 2.5. Here, the sample size is set as n=100n=100 and the replication number is as m=300m=300. The γ\gamma is selected as 0.050.05, in which larger values of γ\gamma yield unreasonable estimates. This is because the the margin of the γ\gamma-estimator has extreme behavior as noted around Figure 2.7.

Refer to caption
Figure 2.7: 3D plots of Φγ(x,y,θ)\Phi_{\gamma}(x,y,\theta) against yy and ww where γ=0.05,0.1,0.0\gamma=0.05,0.1,0.0.
Table 2.5: Comparison between the ML-estimator and the γ\gamma-estimator.

(a). The case of π=0\pi=0

Method estimate rmse
ML-estimator (0.50,1.50,1.00,0.49)(0.50,1.50,-1.00,0.49) 0.0620.062
γ\gamma-estimate (0.59,1.40,0.93,0.46)(0.59,1.40,-0.93,0.46) 0.1870.187

(b). The case of π=0.3\pi=0.3 and θ1hetero=0.\theta_{1\rm hetero}=0.

Method estimate rmse
ML-estimator (0.40,1.32,0.88,0.45)(0.40,1.32,-0.88,0.45) 0.3100.310
γ\gamma-estimate (0.39,1.47,0.97,0.48)(0.39,1.47,-0.97,0.48) 0.2350.235

(c). The case of π=0.3\pi=0.3 and θ1hetero=θ1.\theta_{1\rm hetero}=-\theta_{1}.

Method estimate rmse
ML-estimator (0.41,1.29,0.88,0.43)(0.41,1.29,-0.88,0.43) 0.3860.386
γ\gamma-estimate (0.40,1.47,0.99,0.48)(0.40,1.47,-0.99,0.48) 0.2930.293

In this section, we focus on the γ\gamma-divergence, within the framework of the Poisson regression model. The γ\gamma-divergence provides a robust alternative to the traditional ML estimator, which are sensitive to model misspecification and outliers. The robustness of the estimator was examined from a geometric viewpoint, highlighting the behavior of the estimating function in the feature space and its relationship with the prediction level set. The potential of γ\gamma-divergence in enhancing model robustness is emphasized, with suggestions for future research exploring its application in high-dimensional data scenarios and machine learning contexts, such as deep learning and transfer learning. This work not only contributes to the theoretical understanding of statistical estimation methods but also offers practical insights for their application in various fields, ranging from biostatistics to machine learning. For future work, it would be beneficial to further investigate the theoretical underpinnings of γ\gamma-divergence in a wider range of statistical models and to explore its application in more complex and high-dimensional data scenarios, including machine learning contexts like multi-task learning and meta-learning .

2.8 Concluding remarks

In this chapter we provide a comprehensive exploration of MDEs, particularly γ\gamma-divergence, within regression models, see [70, 74, 76] for other applications of unsupervised learning. This addresses the challenges posed by model misspecification, which can lead to biased estimates and inaccuracies, and proposes MDEs as a robust solution. We have discussed various regression models, including normal, logistic, and Poisson, demonstrating the efficacy of γ\gamma-divergence in handling outliers and model inconsistencies. In particular, the robustness for the estimator is pursued in a geometric perspective for the estimation function in the feature space. This elucidates the intrinsic relationship between the feature space outcome space such that the behavior of the estimating function in the product space of the feature and outcome spaces is characterized the projection length to the prediction level set. It concludes with numerical experiments, showcasing the superiority of γ\gamma-estimators over traditional maximum likelihood estimators in certain misspecified models, thereby highlighting the practical benefits of MDEs in statistical estimation and inference. For a detailed conclusion, it is important to recognize the significant role of γ\gamma-divergence in enhancing model robustness against biases and misspecifications. Emphasizing its applicability across different statistical models, the chapter underscores the potential of MDEs to improve the reliability and accuracy of statistical inferences, particularly in complex or imperfect real-world data scenarios. This work will not only contribute to the theoretical understanding of statistical estimation methods but also offer practical insights for their application in diverse fields, ranging from biostatistics to machine learning.

For future work, considering the promising results of γ\gamma-divergence in regression models, it could be beneficial to explore its application in more complex and high-dimensional data scenarios. This includes delving into machine learning contexts, such as deep learning or neural networks, where robustness against data imperfections is crucial. The machine learning is rapidly developing activity areas towards generative models for documents, images and movies, in which the architecture is in a huge scale for high-dimensional vector and matrix computations to establish pre-trained models such as large language models. There is a challenging direction to incorporate the γ\gamma-divergence approach into such areas including multi-task leaning, transfer leaning, meta leaning and so forth. For example, transfer learning is important to strengthen the empirical knowledge for target source. Few-shot learning is deeply intertwined with transfer learning. In fact, most few-shot learning approaches are based on the principles of transfer learning. The idea is to pre-train a model on a related task with ample data (source domain) and then fine-tune or adapt this model to the new task (target domain) with limited data. This approach leverages the knowledge (features, representations) acquired during the pre-training phase to make accurate predictions in the few-shot scenario. Additionally, investigating the theoretical underpinnings of γ\gamma-divergence in a wider range of statistical models could further solidify its role as a versatile and robust tool in statistical estimation and inference.

In transfer learning, the goal is to leverage knowledge from a source domain to improve learning in a target domain. The γ\gamma-divergence can be used to ensure robust parameter estimation during this process. Let 𝒮\mathcal{S} be the source domain with distribution PsP_{s} and parameter θs\theta_{s}, and 𝒯\mathcal{T} be the target domain with distribution PtP_{t} and parameter θt\theta_{t}. The objective is to minimize a loss function that incorporates both source and target domains:

L(θs,θt)=L𝒮(θs)+λL𝒯(θt|θs)L(\theta_{s},\theta_{t})=L_{\mathcal{S}}(\theta_{s})+\lambda L_{\mathcal{T}}(\theta_{t}|\theta_{s})

where λ\lambda is a regularization parameter balancing the influence of the source model on the target model. Using γ\gamma-divergence, the loss functions L𝒮(θs)L_{\mathcal{S}}(\theta_{s}) and L𝒯(θt|θs)L_{\mathcal{T}}(\theta_{t}|\theta_{s}) are defined as:

L𝒮(θs)=1nsi=1nsDγ(Ps,i(|θs),Qs)L_{\mathcal{S}}(\theta_{s})=\frac{1}{n_{s}}\sum_{i=1}^{n_{s}}D_{\gamma}(P_{s,i}(\cdot|\theta_{s}),Q_{s})
L𝒯(θt|θs)=1nti=1ntDγ(Pt,i(|θt),Ps,i(|θs))L_{\mathcal{T}}(\theta_{t}|\theta_{s})=\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}D_{\gamma}(P_{t,i}(\cdot|\theta_{t}),P_{s,i}(\cdot|\theta_{s}))

where QsQ_{s} is the empirical distribution in the source domain, and DγD_{\gamma} denotes the γ\gamma-divergence. The gradients for updating the parameters are given by:

θsL(θs,θt)=θsL𝒮(θs)+λθsL𝒯(θt|θs)\nabla_{\theta_{s}}L(\theta_{s},\theta_{t})=\nabla_{\theta_{s}}L_{\mathcal{S}}(\theta_{s})+\lambda\nabla_{\theta_{s}}L_{\mathcal{T}}(\theta_{t}|\theta_{s})
θtL(θs,θt)=λθtL𝒯(θt|θs)\nabla_{\theta_{t}}L(\theta_{s},\theta_{t})=\lambda\nabla_{\theta_{t}}L_{\mathcal{T}}(\theta_{t}|\theta_{s})

These gradients take into account the robustness properties of the γ\gamma-divergence, reducing sensitivity to outliers and model misspecifications.

In multi-task learning, the aim is to learn multiple related tasks simultaneously, sharing knowledge among them to improve overall performance. The γ\gamma-divergence helps in creating a robust shared representation. Let 𝒯i\mathcal{T}_{i} denote the ii-th task with parameter θi\theta_{i}, and let Θ\Theta be the shared parameter space. The combined loss function for multiple tasks is:

L(Θ)=i=1mαiLi(θi,Θ)L(\Theta)=\sum_{i=1}^{m}\alpha_{i}L_{i}(\theta_{i},\Theta)

where αi\alpha_{i} are weights for each task, and LiL_{i} is the loss for task 𝒯i\mathcal{T}_{i}. The task-specific losses LiL_{i} are defined using γ\gamma-divergence:

Li(θi,Θ)=1nij=1niDγ(Pi,j(|θi,Θ),Qi)L_{i}(\theta_{i},\Theta)=\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}D_{\gamma}(P_{i,j}(\cdot|\theta_{i},\Theta),Q_{i})

where QiQ_{i} is the empirical distribution for task 𝒯i\mathcal{T}_{i}. The gradients for updating the shared parameters Θ\Theta and task-specific parameters θi\theta_{i} are:

ΘL(Θ)=i=1mαiΘLi(θi,Θ)\nabla_{\Theta}L(\Theta)=\sum_{i=1}^{m}\alpha_{i}\nabla_{\Theta}L_{i}(\theta_{i},\Theta)
θiL(Θ)=αiθiLi(θi,Θ)\nabla_{\theta_{i}}L(\Theta)=\alpha_{i}\nabla_{\theta_{i}}L_{i}(\theta_{i},\Theta)

The γ\gamma-divergence ensures that the updates are robust to outliers and anomalies within each task’s data. By using a divergence measure that penalizes discrepancies between distributions, the model learns shared features that are less sensitive to noise and specific to individual tasks.

Efficiently optimizing the γ\gamma-divergence in high-dimensional parameter spaces remains challenging. Developing scalable algorithms that maintain robustness properties is crucial. Further theoretical exploration of the convergence properties and bounds of γ\gamma-divergence-based estimators in transfer and multi-task learning scenarios. Applying these robust methods to diverse real-world datasets in fields like healthcare, finance, and natural language processing to validate their practical effectiveness and robustness. By integrating γ\gamma-divergence into transfer and multi-task learning frameworks, we can enhance the robustness and adaptability of machine learning models, making them more reliable in varied and complex data environments.

Chapter 3 Minimum divergence for Poisson point process

This study introduces a robust alternative to traditional species distribution models (SDMs) using Poisson Point Processes (PPP) and new divergence measures. We propose the FF-estimator, a method grounded in cumulative distribution functions, offering enhanced accuracy and robustness over maximum likelihood (ML) estimation, especially under model misspecification. Our simulations highlight its superior performance and practical applicability in ecological studies, marking a significant step forward in ecological modeling for biodiversity conservation.

3.1 Introduction

Species distribution models (SDMs) are crucial in ecology for mapping the distribution of species across various habitats and geographical areas [37, 29, 63, 88, 68]. These models play an essential role in enhancing our understanding of biodiversity patterns, predicting changes in species distributions due to climate change, and guiding conservation and management efforts [60]. The MaxEnt (Maximum Entropy) approach to species distribution modeling represents a pivotal methodology in ecology, especially in the context of predicting species distributions under various environmental conditions [77]. This approach is particularly favored for its ability to handle presence-only data, a common scenario in ecological studies where the absence of a species is often unrecorded or unknown [26]. Alternatively, the approach based on Poisson Point Process (PPP) gives more comprehensive understandings for random events scattered across a certain space or time [81]. It is particularly powerful in various fields including ecology, seismology, telecommunications, and spatial statistics. We review quickly the framework for a PPP, cf. [89] for practical applications focusing on ecological studies and [57, 45] for statistical learning perspectives. The close relation between the MaxEnt and PPP approaches are rigorously discussed in [82].

In this chapter, we introduce an innovative approach that employs Poisson Point Processes (PPP) along with alternative divergence measures to enhance the robustness and efficiency of SDMs [84]. We propose the use of the FF-estimator, a novel method based on cumulative distribution functions, which offers a promising alternative to ML-estimator, particularly in the presence of model misspecification. Traditional approaches, such as ML estimation, often grapple with issues of model misspecification, leading to inaccurate predictions. Our approach is evaluated through a series of simulations, demonstrating its superiority over traditional methods in terms of accuracy and robustness. The paper also explores the computational aspects of these estimators, providing insights into their practical application in ecological studies. By addressing key challenges in SDM estimation, our methodology paves the way for more reliable and effective ecological modeling, essential for biodiversity conservation and ecological research.

Let AA be a subset of 2\mathbb{R}^{2} to be provided observed points. Then the event space is given by the collection of all possible finite subsets of AA as a union of pairs {(m,{s1,,sm})}m=1\{(m,\{s_{1},...,s_{m}\})\}_{m=1}^{\infty} in addition to (0,)(0,\emptyset), where \emptyset denotes an empty set. Thus, the event space comprises pairs of the set of observed points {s1,,sm}\{s_{1},...,s_{m}\} and the number mm. Let λ(s)\lambda(s) be a positive function on AA that is called an intensity function. A PPP defined on S{S} is described by the intensity function λ(s)\lambda(s) in a two-step procedure for any realization of S{S}.

  • (i)

    The number MM is non-negative and generated from a Poisson distribution. This distribution, denoted as 𝙿𝚘(Λ){\tt Po}(\Lambda), has a probability mass function (pmf) given by

    p(m,μ)=Λmm!exp{Λ}\displaystyle p(m,\mu)=\frac{\Lambda^{m}}{m!}\exp\{-\Lambda\}

    where Λ=Aλ(s)ds\Lambda=\int_{A}\lambda(s){\rm d}s with an intensity function λ(s)\lambda(s) on AA.

  • (ii)

    The sequence (S1,,SM)(S_{1},...,S_{M}) in AA is obtained by independent and identically distributed sample of a random variable SS on AA with probability density function (pdf) given by

    p(s)=λ(s)Λ\displaystyle p(s)=\frac{\lambda(s)}{\Lambda}

    for sAs\in A.

This description covers the basic statistical structure of the Poison point process. The joint random variable Ξ=(M,{S1,,SM})\Xi=(M,\{S_{1},...,S_{M}\}) has a pdf written as

p(ξ)=exp{Λ}i=1mλ(si),p(\xi)=\exp\{-\Lambda\}\prod_{i=1}^{m}{\lambda(s_{i})}, (3.1)

where ξ=(m,{s1,,sm})\xi=(m,\{s_{1},...,s_{m}\}). Thus, the intensity function λ(s)\lambda(s) characterizes the pdf p(ξ)p(\xi) of the PPP. The set of all the intensity function has a one-to-one correspondence with the set of all the distributions of the PPPs. Subsequently, we will discuss divergence measures on intensity functions rather than pdfs.

3.2 Species distribution model

Species Distribution Models (SDMs) are crucial tools in ecology for understanding and predicting species distributions across spatial landscapes. The inhomogeneous PPP plays a significant role in enhancing the functionality and accuracy of these models due to its ability to handle spatial heterogeneity, which is often a characteristic of ecological data. Ecological landscapes are inherently heterogeneous with varying attributes such as vegetation, soil types, and climatic conditions. The inhomogeneous PPP accommodates this spatial heterogeneity by allowing the event rate to vary across space, thereby enabling a more realistic modeling of species distributions. This can incorporate environmental covariates to model the intensity function of the point process, which in turn helps in understanding how different environmental factors influence species distribution. This is crucial for both theoretical ecological studies and practical conservation planning [56].

If presence and absence data are available, we can employ familiar statistical method such as the logistic model, the random forest and other binary classification algorithms. However, ecological datasets often consist of presence-only records, which can be a challenge for traditional statistical models. We focus on a statistical analysis for presence-only-data, in which the inhomogeneous modeling for PPPs can effectively handle presence-only data, making it a powerful tool for species distribution model in data-scarce scenarios.

Let us introduce a SDM in the framework of PPP discussed above. Suppose that we get a presence dataset, say {S1,,SM}\{S_{1},...,S_{M}\}, or a set of observed points for a species in a study area AA. Then, we build a statistical model of an intensity function driving a PPP on A{A}, in which a parametric model is given by

={λ(s,θ):θΘ},\displaystyle{\mathcal{M}}=\{\lambda(s,\theta):\theta\in\Theta\}, (3.2)

called a species distribution model (SDM), where θ\theta is an unknown parameter in the space Θ\Theta. The pdf of the joint random variable Ξ=(M,{S1,,SM})\Xi=(M,\{S_{1},...,S_{M}\}) is written as

p(ξ,θ)=exp{Λ(θ)}i=1mλ(si,θ)\displaystyle p(\xi,\theta)=\exp\{-\Lambda(\theta)\}\prod_{i=1}^{m}{\lambda(s_{i},\theta)}

due to (3.1), where ξ=(m,{s1,,sm})\xi=(m,\{s_{1},...,s_{m}\}) and Λ(θ)=Aλ(s,θ)𝑑s\Lambda(\theta)=\int_{A}\lambda(s,\theta)ds. In ecological terms, this can be understood as recording the locations (e.g., GPS coordinates) where a particular species has been observed. The pdf here helps in modeling the likelihood of finding the species at different locations within the study area, considering various environmental factors. Typically, we shall consider a log-linear model

λ(s,θ)=exp{θ1x(s)+θ0}\displaystyle\lambda(s,\theta)=\exp\{\theta_{1}^{\top}x(s)+\theta_{0}\} (3.3)

with θ=(θ0,θ1)\theta=(\theta_{0},\theta_{1}), a feature vector x(s)x(s), a slope vector θ1\theta_{1} and an intercept θ0\theta_{0}. Here x(s)x(s) consists environmental characteristics such as geographical, climatic and other factors influencing the habitation of the species. Then, parameter estimation is key in SDMs to understand the relationships between species distributions and environmental covariates. The ML-estimator is a common approach used in PPP to estimate these parameters, which in turn, refines the SDM.

The negative log-likelihood function based on an observation sequence (M,{S1,,SM})(M,\{S_{1},...,S_{M}\}) is given by

L0(θ)=i=1Mlogλ(Si,θ)+Λ(θ).\displaystyle L_{0}(\theta)=-\sum_{i=1}^{M}\log\lambda(S_{i},\theta)+\Lambda(\theta). (3.4)

Here the cumulative intensity is usually approximated as

Λ(θ)=i=1nwiλ(Si,θ)\displaystyle\Lambda(\theta)=\sum_{i=1}^{n}w_{i}\lambda(S_{i},\theta) (3.5)

by Gaussian quadrature, where SM+1,,Sn{S_{M+1},...,S_{n}} are the centers of the grid cells containing no presence location and wiw_{i} is a quadrature weight for a grid cell area. The approximated estimating equation is given by

S0(θ)=j=1n{Ziwiλ(Si,θ)}θlogλ(Si,θ)=0,\displaystyle{S}_{0}(\theta)=\sum_{j=1}^{n}\{Z_{i}-w_{i}\lambda(S_{i},\theta)\}\frac{\partial}{\partial\theta}\log\lambda(S_{i},\theta)={0}, (3.6)

where ZjZ_{j} is an indicator for presence, that is, Zj=1Z_{j}=1 if 1jM1\leq j\leq M and 0 otherwise.

Let p(ξ)p(\xi) and q(ξ)q(\xi) be probability density functions (pdf) of two PPPs, where ξ=(m,{s1,,sm})\xi=(m,\{s_{1},...,s_{m}\}) is a realization. Due to the discussion above, the pdfs are written as

p(ξ)=exp{Λ}i=1mλ(si),q(ξ)=exp{H}i=1mη(si),\displaystyle p(\xi)=\exp\{-\Lambda\}\prod_{i=1}^{m}{\lambda(s_{i})},\ \ \ q(\xi)=\exp\{-H\}\prod_{i=1}^{m}{\eta(s_{i})}, (3.7)

in which p(ξ)p(\xi) and λ(s)\lambda(s) have a one-to-one correspondence, and q(ξ)q(\xi) and η(s)\eta(s) have also the same property. The Kullback-Leibler (KL) divergence between pp and qq is defined by the difference between the cross entropy and the diagonal entropy as D0(p,q)=H0(p,q)H0(p,p){D}_{0}(p,q)={H}_{0}(p,q)-{H}_{0}(p,p), where the cross entropy is defined by

H0(p,q)=𝔼p[logq(Ξ)]\displaystyle{H}_{0}(p,q)=-\mathbb{E}_{p}[\log{q(\Xi)}]

with the expectation 𝔼p\mathbb{E}_{p} with the pdf p(ξ)p(\xi). This is written as

H0(p,q)\displaystyle{H}_{0}(p,q) =m=0Λmm!eΛA××Alog{eHj=1mη(sj)}j=1mλ(sj)Λdsj\displaystyle=-\sum_{m=0}^{\infty}\frac{\Lambda^{m}}{m!}e^{-\Lambda}\int_{A\times\cdots\times A}\log\{e^{-H}\prod_{j=1}^{m}{\eta(s_{j})}\}\prod_{j=1}^{m}\frac{\lambda(s_{j})}{\Lambda}{\rm d}s_{j}
=m=0Λmm!eΛ[H+mΛAλ(s)logη(s)ds]\displaystyle=\sum_{m=0}^{\infty}\frac{\Lambda^{m}}{m!}e^{-\Lambda}\Big{[}-H+\frac{m}{\Lambda}\int_{A}{\lambda(s)}\log{\eta(s)}{\rm d}s\Big{]}
=A{λ(s)logη(s)η(s)}ds.\displaystyle=\int_{A}\{\lambda(s)\log{\eta(s)}-\eta(s)\}{\rm d}s. (3.8)

Thus, the KL-divergence is

D0(p,q)=A{λ(s)logλ(s)η(s)λ(s)+η(s)}ds,\displaystyle{D}_{0}(p,q)=\int_{A}\Big{\{}\lambda(s)\log\frac{\lambda(s)}{\eta(s)}-\lambda(s)+\eta(s)\Big{\}}{\rm d}s, (3.9)

see [57] for detailed derivations. This can be seen as a way to assess the effectiveness of an ecological model. For instance, how well does our model predict where a species will be found, based on environmental factors like climate, soil type, or vegetation? The closer our model’s predictions are to the actual observations, the better it is at explaining the species’ distribution. In effect, D0(p,q){D}_{0}(p,q) coincides with the extended KL-divergence between intensity functions λ(s)\lambda(s) and η(s)\eta(s). Here, the term λ(s)+η(s)-\lambda(s)+\eta(s) in the integrand of (3.9) should be added to the standard form since both λ(s)\lambda(s) and η(s)\eta(s) in general do not have total mass one.

Let (M,{S1,,SM})(M,\{S_{1},...,S_{M}\}) be an observations having a pdf p(ξ,θ0)p(\xi,\theta_{0}). We consider the expected value under the true pdf p(ξ,θ0)p(\xi,\theta_{0}) that is given by

𝔼0[L0(θ)]=A{λ(s,θ0)logλ(s,θ)λ(s,θ)}ds\displaystyle\mathbb{E}_{0}[L_{0}(\theta)]=-\int_{A}\{\lambda(s,\theta_{0})\log\lambda(s,\theta)-\lambda(s,\theta)\}{\rm d}s

noting a familiar formula for a random sum in PPP:

𝔼0[i=1Mlogλ(Si,θ)]=Aλ(s,θ0)logλ(s,θ)ds,\displaystyle\mathbb{E}_{0}\Big{[}\sum_{i=1}^{M}\log\lambda(S_{i},\theta)\Big{]}=\int_{A}\lambda(s,\theta_{0})\log\lambda(s,\theta){\rm d}s, (3.10)

where 𝔼0\mathbb{E}_{0} denotes the expectation with the pdf p(ξ,θ0)p(\xi,\theta_{0}). This is nothing but the cross entropy between intensity functions λ(s,θ0)\lambda(s,\theta_{0}) and λ(s,θ)\lambda(s,\theta). In accordance, we observe a close relationship between the log-likelihood and the KL-divergence that is parallel to the discussion around (2.2) in chapter 3. In effect,

𝔼0[L0(θ)]𝔼0[L0(θ0)]=D0(p(,θ0),p(,θ)).\displaystyle\mathbb{E}_{0}[L_{0}(\theta)]-\mathbb{E}_{0}[L_{0}(\theta_{0})]=D_{0}(p(\cdot,\theta_{0}),p(\cdot,\theta)).

This relation concludes the consistency of the ML-estimator for the true value θ0\theta_{0} noting θ0=argminθΘE0[L0(θ)]\theta_{0}=\mathop{\rm argmin}_{\theta\in\Theta}E_{0}[L_{0}(\theta)]. This suggests that the method used to estimate the impact of environmental factors on species distribution is dependable. In practical terms, this means ecologists can trust the model to make accurate predictions about where a species might be found, based on environmental data.

3.3 Divergence measures on intensity functions

We would like to extend the minimum divergence method for estimating to estimating a SDM. The main objective is to propose an alternative to the maximum likelihood method, aiming to enhance robustness and expedite computation. We have observed the close relationship between the log-likelihood and the KL-divergence in the previous section. Fortunately, the empirical form of the KL-divergence is matched with the log-likelihood function in the framework of the SDM. We remark that a fact that the KL-divergence between PPPs is equal to the KL-divergence between its intensity functions is essential for ensuring this property. However, this key relation does not hold in the situation for the power divergence.

First, we review a formula for random sum and product in PPP, which is gently and comprehensively discussed in [89].

Proposition 13.

Let Ξ=(M,{S1,,SM})\Xi=(M,\{S_{1},...,S_{M}\}) be a realization of PPP with an intensity function λ(s)\lambda(s) on an area AA. Then, for any integrable function g(s)g(s),

𝔼[i=1Mg(Si)]=Ag(s)λ(s)ds\displaystyle\mathbb{E}\Big{[}\sum_{i=1}^{M}g(S_{i})\Big{]}=\int_{A}g(s)\lambda(s){\rm d}s (3.11)

and

𝔼[i=1Mg(Si)]=exp{A{g(s)1}λ(s)}ds.\displaystyle\mathbb{E}\Big{[}\prod_{i=1}^{M}g(S_{i})\Big{]}=\exp\Big{\{}\int_{A}\{g(s)-1\}\lambda(s)\Big{\}}{\rm d}s.
Proof.

By definition,

𝔼[i=1Mg(Si)]=m=0Λmm!eΛA××Ai=1mg(si)i=1mλ(si)Λdsj\displaystyle\mathbb{E}\Big{[}\sum_{i=1}^{M}g(S_{i})\Big{]}=\sum_{m=0}^{\infty}\frac{\Lambda^{m}}{m!}e^{-\Lambda}\int_{A\times\cdots\times A}\sum_{i=1}^{m}g(s_{i})\prod_{i=1}^{m}\frac{\lambda(s_{i})}{\Lambda}{\rm d}s_{j}
=m=0Λmm!eΛmΛAλ(s)g(s)ds\displaystyle=\sum_{m=0}^{\infty}\frac{\Lambda^{m}}{m!}e^{-\Lambda}\frac{m}{\Lambda}\int_{A}{\lambda(s)}{g(s)}{\rm d}s
=Ag(s)λ(s)ds.\displaystyle=\int_{A}g(s)\lambda(s){\rm d}s.

Similarly,

𝔼[i=1Mg(Si)]=m=0Λmm!eΛA××Ai=1mg(si)i=1mλ(si)Λdsj\displaystyle\mathbb{E}\Big{[}\prod_{i=1}^{M}g(S_{i})\Big{]}=\sum_{m=0}^{\infty}\frac{\Lambda^{m}}{m!}e^{-\Lambda}\int_{A\times\cdots\times A}\prod_{i=1}^{m}g(s_{i})\prod_{i=1}^{m}\frac{\lambda(s_{i})}{\Lambda}{\rm d}s_{j}
=m=0Λmm!eΛ{Ag(s)λ(s)Λds}m\displaystyle=\sum_{m=0}^{\infty}\frac{\Lambda^{m}}{m!}e^{-\Lambda}\Big{\{}\int_{A}\frac{g(s)\lambda(s)}{\Lambda}{\rm d}s\Big{\}}^{m}
=exp{A{g(s)1}λ(s)ds}.\displaystyle=\exp\Big{\{}\int_{A}\{g(s)-1\}\lambda(s){\rm d}s\Big{\}}.

Proposition 13 gives interesting properties of the random sum with the random product, see section 2.6 in [89] for further discussion and historical backgrounds. In ecology, this can be interpreted as predicting the total impact or effect of a particular environmental factor (represented by g(s)g(s)) across all locations where a species is observed within a study area AA. For example, g(s)g(s) could represent the level of a specific nutrient or habitat quality at each observation point SiS_{i}. The integral then sums up these effects across the entire habitat, providing a comprehensive view of how the environmental factor influences the species across its distribution. This formula can be used in SDMs to quantify the cumulative effect of environmental variables on species presence. For instance, it could help in assessing how total food availability or habitat suitability across a landscape influences the likelihood of species presence. By integrating such ecological factors into the SDM, researchers can gain insights into the species’ habitat preferences and distribution patterns. Understanding the cumulative impact of environmental factors is crucial for conservation planning and management. This approach helps identify critical areas that contribute significantly to species survival and can guide habitat restoration or protection efforts. For instance, if the model shows that certain areas have a high cumulative impact on species presence, these areas might be prioritized for conservation.

Second, we introduce divergence measures to apply the estimation for a species distributional model employing the formula introduced in Proposition 13. The γ\gamma-divergence between probability measures PP and QQ of PPPs with RN-derivatives p(ξ)p(\xi) and q(ξ)q(\xi) in (3.7) is given by

Dγ(P,Q)=Hγ(P,Q)Hγ(P,P).\displaystyle D_{\gamma}(P,Q)=H_{\gamma}(P,Q)-H_{\gamma}(P,P).

Here the cross γ\gamma-entropy is defined by

Hγ(P,Q)=1γ𝔼P[q(Ξ)γ]{𝔼Q[q(Ξ)γ]}γγ+1,\displaystyle H_{\gamma}(P,Q)=-\frac{1}{\gamma}\frac{\mathbb{E}_{P}[q(\Xi)^{\gamma}]}{\{\mathbb{E}_{Q}[q(\Xi)^{\gamma}]\}^{\frac{\gamma}{\gamma+1}}},

where 𝔼P\mathbb{E}_{P} denote the expectation with respect to PP. Accordingly, the γ\gamma-cross entropy between probability distributions P0P_{0} and PθP_{\theta} having the intensity functions λ0(s)\lambda_{0}(s) and λ(s,θ)\lambda(s,\theta), respectively, is written as

Hγ(P0,Pθ)=1γexp[A{λ0(s)(λ(s,θ)γ1)γγ+1λ(s,θ)γ+1}ds]\displaystyle H_{\gamma}(P_{0},P_{\theta})=-\frac{1}{\gamma}\exp\Big{[}\int_{A}\Big{\{}\lambda_{0}(s)(\lambda(s,\theta)^{\gamma}-1)-\frac{\gamma}{\gamma+1}\lambda(s,\theta)^{\gamma+1}\Big{\}}{\rm d}s\Big{]}

since

𝔼P0[p(Ξ,θ)γ]\displaystyle{\mathbb{E}_{P_{0}}[p(\Xi,\theta)^{\gamma}]} =exp{γΛ(θ)}𝔼P0[i=1Mλ(Si,θ)γ]\displaystyle=\exp\{\gamma\Lambda(\theta)\}\mathbb{E}_{P_{0}}\Big{[}\prod_{i=1}^{M}\lambda(S_{i},\theta)^{\gamma}\Big{]}
=exp[A{γλ(s,θ)+λ0(s)(λ(s,θ)γ1)}ds]\displaystyle=\exp\Big{[}\int_{A}\{\gamma\lambda(s,\theta)+\lambda_{0}(s)(\lambda(s,\theta)^{\gamma}-1)\}{\rm d}s\Big{]} (3.12)

due to Proposition 13. However, it is difficult to give an empirical expression of Hγ(P0,Pθ)H_{\gamma}(P_{0},P_{\theta}) for a given realization (M,{S1,,SM})(M,\{S_{1},...,S_{M}\}) generated from P0P_{0}. In accordance, we consider another type of divergence.

Consider the log γ\gamma-divergence between P0P_{0} and PθP_{\theta} that is defined by

Δγ(P0,Pθ)=1γlog𝔼P0[p(Ξ,θ)γ]{𝔼Pθ[p(Ξ,θ)γ]}γγ+1{𝔼P0[p0(Ξ)γ]}1γ+1.\displaystyle\Delta_{\gamma}(P_{0},P_{\theta})=-\frac{1}{\gamma}\log\frac{\mathbb{E}_{P_{0}}[p(\Xi,\theta)^{\gamma}]}{\{\mathbb{E}_{P\theta}[p(\Xi,\theta)^{\gamma}]\}^{\frac{\gamma}{\gamma+1}}\{\mathbb{E}_{P_{0}}[p_{0}(\Xi)^{\gamma}]\}^{\frac{1}{\gamma+1}}}.

This is written as

Δγ(P0,Pθ)=1γA{λ0(s)λ(s,θ)γγγ+1λ(s,θ)γ+11γ+1λ0(s)γ+1}ds.\displaystyle{\Delta}_{\gamma}(P_{0},P_{\theta})=-\frac{1}{\gamma}\int_{A}\Big{\{}\lambda_{0}(s)\lambda(s,\theta)^{\gamma}-\frac{\gamma}{\gamma+1}\lambda(s,\theta)^{\gamma+1}-\frac{1}{\gamma+1}\lambda_{0}(s)^{\gamma+1}\Big{\}}{\rm d}s. (3.13)

Therefore, the loss function is induced as

Lγ(θ)=1γi=1Mλ(Si,θ)γ+1γ+1Aλ(s,θ)γ+1ds\displaystyle L_{\gamma}(\theta)=-\frac{1}{\gamma}\sum_{i=1}^{M}\lambda(S_{i},\theta)^{\gamma}+\frac{1}{\gamma+1}\int_{A}\lambda(s,\theta)^{\gamma+1}{\rm d}s

for a SDM (3.2). This loss function has totally different from the negative log-likelihood. In a regression model, Dγ(P,Q)D_{\gamma}(P,Q) and Δγ(P,Q)\Delta_{\gamma}(P,Q) yield the same loss function; while in a PPP model only Δγ(P,Q)\Delta_{\gamma}(P,Q) yield the loss functions Lγ(θ)L_{\gamma}(\theta).

We observe that the property about random sum and product leads to delicate differences among one-to-one transformed divergence measures. So, we consider a divergence measure directly defined on the space of intensity functions other than on that of probability distributions of PPPs. The β\beta-divergence is given by

Dβ(λ,η)=1βA{λ(s)η(s)βββ+1η(s)β+11β+1λ(s)β+1}ds;\displaystyle{D}_{\beta}(\lambda,\eta)=-\frac{1}{\beta}\int_{A}\Big{\{}\lambda(s)\eta(s)^{\beta}-\frac{\beta}{\beta+1}\eta(s)^{\beta+1}-\frac{1}{\beta+1}\lambda(s)^{\beta+1}\Big{\}}{\rm d}s; (3.14)

The γ\gamma-divergence is given by

Dγ(λ,η)=1γAλ(s)η(s)γds{Aη(s)γ+1ds}γγ+1+1γ{Aλ(s)γ+1ds}1γ+1.\displaystyle{D}_{\gamma}(\lambda,\eta)=-\frac{1}{\gamma}\frac{\int_{A}\lambda(s)\eta(s)^{\gamma}{\rm d}s}{\{\int_{A}\eta(s)^{\gamma+1}{\rm d}s\}^{\frac{\gamma}{\gamma+1}}}+\frac{1}{\gamma}\Big{\{}\int_{A}\lambda(s)^{\gamma+1}{\rm d}s\Big{\}}^{\frac{1}{\gamma+1}}.

The loss functions corresponding to these are given by

Lβ(θ)=1βi=1Mλ(Si,θ)β+1β+1Aλ(s,θ)β+1ds\displaystyle L_{\beta}(\theta)=-\frac{1}{\beta}\sum_{i=1}^{M}\lambda(S_{i},\theta)^{\beta}+\frac{1}{\beta+1}\int_{A}\lambda(s,\theta)^{\beta+1}{\rm d}s (3.15)

and

Lγ(θ)=1γi=1Mλ(Si,θ)γ{Aλ(s,θ)γ+1ds}γγ+1.\displaystyle L_{\gamma}(\theta)=-\frac{1}{\gamma}\frac{\sum_{i=1}^{M}\lambda(S_{i},\theta)^{\gamma}}{\{\int_{A}\lambda(s,\theta)^{\gamma+1}{\rm d}s\}^{\frac{\gamma}{\gamma+1}}}. (3.16)

And, the estimating functions corresponding to these are given by

Sβ(θ)=i=1Mλ(Si,θ)βθlogλ(Si,θ)Aλ(s,θ)β+1θlogλ(s,θ)ds\displaystyle{S}_{\beta}(\theta)=\sum_{i=1}^{M}\lambda(S_{i},\theta)^{\beta}\frac{\partial}{\partial\theta}\log\lambda(S_{i},\theta)-\int_{A}\lambda(s,\theta)^{\beta+1}\frac{\partial}{\partial\theta}\log\lambda(s,\theta){\rm d}s

and

Sγ(θ)=\displaystyle{S}_{\gamma}(\theta)= i=1Mλ(Si,θ)γ{Aλ(s,θ)γ+1ds}γγ+1\displaystyle\sum_{i=1}^{M}\frac{\lambda(S_{i},\theta)^{\gamma}}{\{\int_{A}\lambda(s,\theta)^{\gamma+1}{\rm d}s\}^{\frac{\gamma}{\gamma+1}}}
×{θlogλ(Si,θ)Aλ(s,θ)γ+1Aλ(s,θ)γ+1dsθlogλ(s,θ)ds}.\displaystyle\times\Big{\{}\frac{\partial}{\partial\theta}\log\lambda(S_{i},\theta)-\int_{A}\frac{\lambda(s,\theta)^{\gamma+1}}{\int_{A}\lambda(s,\theta)^{\gamma+1}{\rm d}s}\frac{\partial}{\partial\theta}\log\lambda(s,\theta){\rm d}s\Big{\}}. (3.17)

A divergence measure between two PPPs is written by a functional with respect to intensity functions induced by them. We observe an interesting relationship from this viewpoint.

Proposition 14.

Let PP and QQ be probability distributions for PPPs with intensity functions λ\lambda and η\eta, respectively. Then, the log γ\gamma-divergence Δγ(P,Q)\Delta_{\gamma}(P,Q) in (3.13)is equal to the β\beta-divergence Dβ(λ,η)D_{\beta}(\lambda,\eta) in (3.14) when γ=β\gamma=\beta.

Proof is immediate by definition.

Essentially, Δγ(P,Q)\Delta_{\gamma}(P,Q) satisfies the scale invariance which expresses an angle between PP anQQ rather than a distance between them; Dβ(λ,η)D_{\beta}(\lambda,\eta) does not satisfy such invariance in the intensity function space. Thus, they are totally different characteristics, however the connection of probability distributions and their intensity functions for PPPs entails this coincidence. It follows from Proposition 14 that the GM-divergence DGM(P,Q)D_{\rm GM}(P,Q) equals the Itakura-Saito divergence, that is

DGM(P,Q)=A{λ(s)η(s)logλ(s)η(s)1}ds.\displaystyle{D}_{\rm GM}(P,Q)=\int_{A}\Big{\{}\frac{\lambda(s)}{\eta(s)}-\log\frac{\lambda(s)}{\eta(s)}-1\Big{\}}{\rm d}s.

Hence, the GM-loss function is given by

LGM(θ)=i=1M1λ(Si,θ)+Alogλ(s,θ)ds.\displaystyle L_{\rm GM}(\theta)=\sum_{i=1}^{M}\frac{1}{\lambda(S_{i},\theta)}+\int_{A}\log\lambda(s,\theta){\rm d}s.

and the estimating function is

SGM(θ)=i=1M1λ(Si,θ)θlogλ(Si,θ)+Aθlogλ(s,θ)ds.\displaystyle{S}_{\rm GM}(\theta)=\sum_{i=1}^{M}\frac{1}{\lambda(S_{i},\theta)}\frac{\partial}{\partial\theta}\log\lambda(S_{i},\theta)+\int_{A}\frac{\partial}{\partial\theta}\log\lambda(s,\theta){\rm d}s. (3.18)
Proposition 15.

Assume a log-linear model λ(s,θ)=exp{θ1x(s)+θ0}\lambda(s,\theta)=\exp\{\theta_{1}^{\top}x(s)+\theta_{0}\} with a feature vector x(s)x(s). Then, the estimating function of the GMGM-estimator is given by

SGM(θ)=i=1Mexp{θ1x(Si)θ0}x0(Si)Ax0(s)ds.\displaystyle{S}_{\rm GM}(\theta)=\sum_{i=1}^{M}\exp\{-\theta_{1}^{\top}x(S_{i})-\theta_{0}\}x_{0}(S_{i})-\int_{A}x_{0}(s){\rm d}s. (3.19)

where x0(s)=(1,x(s))x_{0}(s)=(1,x(s)^{\top})^{\top}.

Proof.

Proof is immediate. Equation (3.19) is seen by applying (3.18) to the log linear model. ∎

Equating SGM(θ){S}_{\rm GM}(\theta) to zero satisfies

i=1Mexp{θx0(Si)}x0(Si)=j=1w(Sj)x0(Sj)\displaystyle\sum_{i=1}^{M}\exp\{-\theta^{\top}x_{0}(S_{i})\}x_{0}(S_{i})=\sum_{j=1}w(S_{j})x_{0}(S_{j}) (3.20)

by the quadrature approximation. This implies that the inverse intensity weighted mean for presence data {x(Si)}i=1M\{x(S_{i})\}_{i=1}^{M} is equal to the region mean for {x(Sj)}j=1n\{x(S_{j})\}_{j=1}^{n}. The learning algorithm to solve the estimating equation (3.20) to get the GM-estimator needs only the update of the inverse intensity mean for the presence during without any updates for the region mean during the iteration process. In this regard, the computational cost for the γ\gamma-estimator frequently becomes huge for a large set of quadrature points. For example, it needs to evaluate the quadrature approximation in the likelihood equation (3.6) during iteration process for obtaining the ML-estimator. On the other hand, such evaluation step is free in the algorithm for obtaining the GM-estimator.

Finally, we look into an approach to the minimum divergence defined on the space of pdf’s. A intensity function λ(s,θ)\lambda(s,\theta) determines the pdf for an occurrence of a point ss by p(s,θ)=λ(s,θ)/Λ(θ)p(s,\theta)=\lambda(s,\theta)/\Lambda(\theta). From a point of this, we can consider the divergence class of pdfs, which has been discussed in Chapter 2. However, this approach has a weak point such that, in a log-linear model λ(s,θ)=exp{θ1x(s)+θ0}\lambda(s,\theta)=\exp\{\theta_{1}x(s)+\theta_{0}\}, such a pdf transformation cancels the intercept parameter θ0\theta_{0} as found in

p(s,θ)=exp{θ1x(s)}Aexp{θ1x(s~)}ds~.\displaystyle p(s,\theta)=\frac{\exp\{\theta_{1}x(s)\}}{\int_{A}\exp\{\theta_{1}x(\tilde{s})\}{\rm d}\tilde{s}}.

Therefore, the maximum entropy method is based on such an approach, so that the intercept parameter cannot be consistently estimated. See [82] for the detailed discussion in a rigorous framework. Here we note that the γ\gamma-divergence between p=λ/Λp=\lambda/\Lambda and q=η/Hq=\eta/H is essentially equal to that between λ\lambda and η\eta, that is

Dγ(λ,η)=ΛDγ(p,q).{D}_{\gamma}(\lambda,\eta)=\Lambda{D}_{\gamma}(p,q).

This implies that the intercept θ0\theta_{0} is not estimable in the estimating function (3.17). In deed, the γ\gamma-loss function (3.16) for the log-linear model is reduced to

Lγ(θ)=1γi=1Mexp{γ(θ1x(Si)}{Aexp{(γ+1)θ1x(s)}ds}γγ+1,\displaystyle L_{\gamma}(\theta)=-\frac{1}{\gamma}\frac{\sum_{i=1}^{M}\exp\{\gamma(\theta_{1}^{\top}x(S_{i})\}}{\{\int_{A}\exp\{(\gamma+1)\theta_{1}^{\top}x(s)\}{\rm d}s\}^{\frac{\gamma}{\gamma+1}}},

which is constant in θ0\theta_{0}. From the scale invariance of the log γ\gamma divergence, Δγ(p,q)=Δγ(λ,η)\Delta_{\gamma}(p,q)=\Delta_{\gamma}(\lambda,\eta) noting pp and qq equals λ\lambda and η\eta up to a scale factor. Similarly, the intercept parameter is not identifiable. On the other hand, the β\beta-loss function (3.15) is written down as

Lβ(θ)=1βi=1Mexp{β(θ1x(Si)+θ0)}+1β+1Aexp{(β+1)(θ1x(s)+θ0)}ds\displaystyle L_{\beta}(\theta)=-\frac{1}{\beta}\sum_{i=1}^{M}\exp\{\beta(\theta_{1}^{\top}x(S_{i})+\theta_{0})\}+\frac{1}{\beta+1}\int_{A}\exp\{(\beta+1)(\theta_{1}^{\top}x(s)+\theta_{0})\}{\rm d}s

in which θ0\theta_{0} is estimable.

3.4 Robust divergence method

We discuss robustness for estimating the SDM defined by a parametric intensity function λ(s,θ)\lambda(s,\theta). In particular, a log-linear model λ(s,θ)=θ1x(s)+θ0\lambda(s,\theta)=\theta_{1}x(s)+\theta_{0}, where θ=(θ0,θ1)\theta=(\theta_{0},\theta_{1}) and x(s)x(s) is environmental feature vector influencing on a habitat of a target species. In Section 3.3 we discuss the minimum divergence estimation for the SDM in which power divergence measures are explored on the space of the PPP distributions, on that of intensity functions, or on that of pdfs in exhaustive manner. In the examination, the minimum β\beta-divergence method defined on the space of intensity functions is recommended for its reasonable property for the consistency of estimation.

We look at the β\beta-estimating function for a given dataset (M,{S1,,SM})(M,\{S_{1},...,S_{M}\}) that is defined as

Sβ(θ)=j=1n{Zjwjexp(θ1x(Sj)+θ0)}exp{β(θ1x(Sj)+θ0)}x0(Sj),\displaystyle{S}_{\beta}(\theta)=\sum_{j=1}^{n}\{Z_{j}-w_{j}\exp(\theta_{1}^{\top}x(S_{j})+\theta_{0})\}\exp\{\beta(\theta_{1}^{\top}x(S_{j})+\theta_{0})\}x_{0}(S_{j}), (3.21)

where x0(Sj)=(1,x(Sj))x_{0}(S_{j})=(1,x(S_{j})^{\top})^{\top}, wjw_{j} is a quadrature weight on the set {S1,,Sn}\{S_{1},...,S_{n}\} combined with presence and background grid centers, and ZjZ_{j} is an indicator for presence, that is, Zj=1Z_{j}=1 if 1jM1\leq j\leq M and 0 otherwise. Note that taking β=0\beta=0 as a specific choice yields the likelihood equation (3.6). Alternatively, taking a limit of β\beta to 1-1 entails the GM-estimating function as

SGM(θ)\displaystyle{S}_{\rm GM}(\theta) =j=1n{Zjexp{(θ1x(Sj)+θ0)}wj}x0(Sj),\displaystyle=\sum_{j=1}^{n}\{Z_{j}\exp\{-(\theta_{1}^{\top}x(S_{j})+\theta_{0})\}-w_{j}\}x_{0}(S_{j}), (3.22)
=i=1Mexp{(θ1x(Si)+θ0)}x0(Si)x¯0,\displaystyle=\sum_{i=1}^{M}\exp\{-(\theta_{1}^{\top}x(S_{i})+\theta_{0})\}x_{0}(S_{i})-\bar{x}_{0}, (3.23)

where x¯0=j=1nwjx0(Sj)\bar{x}_{0}=\sum_{j=1}^{n}w_{j}x_{0}(S_{j}). This leads to a remarkable cost reduction for the computation in the learning algorithm as discussed after Proposition 15. Here the computation in (3.22) is only for the first term of presence data with one evaluation x¯0\bar{x}_{0} using background data. For any β\beta, the β\beta-estimating function is unbiased. Because

𝔼θ[Sβ(θ)]=\displaystyle\mathbb{E}_{\theta}[{S}_{\beta}(\theta)]= Aexp{(β+1)(θ1x(s)+θ0)}x0(s)ds\displaystyle\int_{A}\exp\{(\beta+1)(\theta_{1}^{\top}x(s)+\theta_{0})\}x_{0}(s){\rm d}s
j=1nwjexp{(β+1)(θ1x(Sj)+θ0)}x0(Sj),\displaystyle-\sum_{j=1}^{n}w_{j}\exp\{(\beta+1)(\theta_{1}^{\top}x(S_{j})+\theta_{0})\}x_{0}(S_{j}),

which is equal to a zero vector if the the quadrature approximation is proper, where 𝔼θ\mathbb{E}_{\theta} denote expectation under the log-linear model λ(s,θ)=exp(θ1x(s)+θ0)\lambda(s,\theta)=\exp(\theta_{1}x(s)+\theta_{0}). This unbiasedness property guarantees the consistency of the β\beta-estimator for θ\theta. In accordance with this, we would like to select the most robust estimator for model misspecification in the class of β\beta-estimators. The difference of the β\beta-estimator with the ML-estimator is focused only on the estimating weight exp{β(θ1x(Sj)+θ0)}\exp\{\beta(\theta_{1}^{\top}x(S_{j})+\theta_{0})\} in (3.24). We contemplate that the estimating weight would be not effective for any data situation regardless of whether β\beta is positive or negative. Indeed, the estimating weight becomes unbounded and has no comprehensive understanding for misspecification.

We consider another estimator rather than the β\beta-estimator for seeking robustness of misspecification based on the discussion above [84]. A main objective is to change the estimating weight for the β\beta-estimator into a more comprehensive form. Let FF be a cumulative distribution function defined on [0,)[0,\infty). Then, we define an estimating function

SF(θ)=j=1n{Zjwjexp(θ1x(Sj)+θ0)}F(σexp(θ1x(Sj)+θ0))x0(Sj),\displaystyle{S}_{F}(\theta)=\sum_{j=1}^{n}\{Z_{j}-w_{j}\exp(\theta_{1}^{\top}x(S_{j})+\theta_{0})\}F(\sigma\exp(\theta_{1}^{\top}x(S_{j})+\theta_{0}))x_{0}(S_{j}), (3.24)

where σ>0\sigma>0 is a hyper parameter. We call the FF-estimator by θ^F\hat{\theta}_{F} defined by a solution for the equation SF(θ)=0{S}_{F}(\theta)=0. Immediately, the unbiasedness for SF(θ){S}_{F}(\theta) can be confirmed. In this definition, the estimating weight is given as F(σexp(θ1x(Sj)+θ0))F(\sigma\exp(\theta_{1}^{\top}x(S_{j})+\theta_{0})). For example, we will use a Pareto cumulative distribution function

F(t)=1(1+ηct)1η,F(t)=1-(1+\eta ct)^{-\frac{1}{\eta}},

where a shape parameter η>0\eta>0 is fixed to be 11 in a subsequent experiment. We envisage that FF expresses the existence probability of the intensity for the presence of the target species. Hence, a low value of the weight implies a low probability of the presence. The plot of the estimating weights F(σλ(Sj,θ))F(\sigma\lambda(S_{j},\theta)) for i=1,,Mi=1,...,M would be helpful if we knew the true value.

Suppose that the log-linear model for the given data would be misspecified. We consider a specific situation for misspecification such that

λ(s)=(1ϵ)λ(s,θ)+ϵλout(s).\displaystyle\lambda(s)=(1-\epsilon)\lambda(s,\theta)+\epsilon\lambda_{\rm out}(s). (3.25)

This implies there is a contamination of a subgroup with a probability ϵ\epsilon in a major group with the intensity function correctly specified. Here the subgroup has the intensity function λout(s)\lambda_{\rm out}(s) that is far away from the log-linear model λ(s,θ)\lambda(s,\theta). Geometrically speaking, the underlying intensity function λ(s)\lambda(s) in (3.25) is a tubular neighborhood surrounding the model {λ(s,θ):θΘ}\{\lambda(s,\theta):\theta\in\Theta\} with a radius ϵ\epsilon in the space of all intensity functions. In this misspecification, we expect that the estimating wights for the subgroup should be suppressed than those for the major group. It cannot be denied in practical situations that there is always a risk of model misspecification. it is comparatively easy to find outliers for presence records or background data cause by mechanical errors. Standard data preprocessing procedures helpful for data cleaning, however it is difficult to find outliers under such a latent structure for misspecification. In this regard, the approach by the FF-estimator is promising to solve the problem for such difficult situations. The hyper parameter σ\sigma should be selected by a cross validation method to give its effective impact on the estimation process. We will discuss on enhancing the clarity and practical applicability of the concepts in this approach as a future work.

We have a brief study for numerical experiments. Assume that a feature vector set {X(sj)}j=1n\{X(s_{j})\}_{j=1}^{n} on presence and background grids is generated from a bivariate normal distribution 𝙽𝚘𝚛(0,I){\tt Nor}(0,{\rm I}), where I\rm I denotes a 2-dimensional identity matrix. Our simulation scenarios for the intensity function was organized as follows.

(a). Specified model: λ(s)=λ(s,θ0)\hskip 17.07164pt\lambda(s)=\lambda(s,\theta_{0}), where λ(s,θ0)=exp{θ01X(s)+θ00}.\lambda(s,\theta_{0})=\exp\{\theta_{01}^{\top}X(s)+\theta_{00}\}.

(b). Misspecified model: λ(s)=(1ϵ)λ(s,θ0)+ϵλ(s,θ)\hskip 2.56073pt\lambda(s)=(1-\epsilon)\lambda(s,\theta_{0})+\epsilon\lambda(s,\theta_{*}), where θ=(θ00,θ01).\theta_{*}=(\theta_{00},-\theta_{01}).

Here parameters were set as θ0=0.5\theta_{0}=0.5, θ1=(1.5,1.0)\theta_{1}=(1.5,-1.0)^{\top}, and π=0.1\pi=0.1. In case (b), λ(s,θ)\lambda(s,\theta_{*}) is a specific example of λout(s)\lambda_{\rm out}(s) in (3.25), which implies that the subgroup has the intensity function with the negative parameter to the major group. In ecological studies, a major group in the species might thrive under conditions where a few others do not, and vice versa. Using a negative parameter could imitate this kind of inverse relationship. See Figure 3.1 for the plot of presence numbers against two dimensional feature vectors. The presence numbers were generated from the misspecified model (b) with the simulation number 10001000.

Refer to caption
Figure 3.1: Plot of presence numbers with major group (red), minor group (blue) and backgounds (gray).

We compared the estimates the ML-estimator θ^0\hat{\theta}_{0}, the FF-estimator (σ=1.2)(\sigma=1.2) and the β\beta-estimator θ^β\hat{\theta}_{\beta} (β=0.1)(\beta=-0.1), where the simulation was conducted by 300 replications. In the the case of specified model, the ML-estimator was slightly superior to the FF-estimator and β\beta-estimator in a point of the root means square estimate (rmse), however the superiority over the FF-estimator is just little. Next, we suppose a mixture model of two intensity functions in which one was the log-linear model as the above with the mixing ratio 0.90.9; the other was still a log-linear model but the the slope vector was the minus one with the mixing ratio 0.10.1. Under this setting, FF-estimator was especially superior to the ML-estimator and the β\beta-estimator, where the rmse of the ML-estimator is more than double that of the FF-estimator. The β\beta-estimator shows less robust for this misspecification. Thus, the ML-estimator is sensitive to the presence of such a heterogeneous subgroup; the FF-estimator is robust. It is considered that the estimating weight F(λ(s,θ)F(\lambda(s,\theta) effectively performs to suppress the influence of the subgroup in the estimating function of the FF-estimator. See Table 3.1 for details and and Figure 3.2 for the plot of three estimators in the case (b). We observe in the numerical experiment that the FF-estimator has almost the same performance as that of the ML-estimator when the model is correctly specified; the FF-estimator has more robust performance than the ML-estimator when the model is partially misspecified in a numerical example.

Table 3.1: Comparison among the ML-estimator, FF-estimator and β\beta-estimator.

(a). The case of specified model

Method estimate rmse
ML-estimator (0.499,1.497,1.003)({0.499,1.497,-1.003}) 0.1520.152
FF-estimate (0.501,1.496,1.002)({0.501,1.496,-1.002}) 0.1560.156
β\beta-estimate (0.707,1.498,1.004)({0.707,1.498,-1.004}) 0.2590.259

(b). The case of misspecified model

Method estimate rmse
ML-estimator (0.930,1.208,0.784)({0.930,1.208,-0.784}) 0.8690.869
FF-estimate (0.562,1.410,0.935)({0.562,1.410,-0.935}) 0.3790.379
β\beta-estimate (1.211,1.163,0.750)({1.211,1.163,-0.750}) 1.0681.068
Refer to caption
Figure 3.2: Box-whisker Plots of the ML-estimator, FF-estimator and β\beta-estimator

We discuss a problem that the FF-estimator θ^F\hat{\theta}_{F} is defined by the solution of the equation SF(θ)=0{S}_{F}(\theta)=0; while the objective function is not given. However, it follows from Poincaré lemma that there is a unique objective function LF(θ)L_{F}(\theta) such that θ^F=argminθΘLF(θ)\hat{\theta}_{F}=\mathop{\rm argmin}_{\theta\in\Theta}L_{F}(\theta). See [4] for geometric insights in the Poincaré lemma. Because SF(θ){S}_{F}(\theta) is integrable since the Jacobian matrix of SF(θ)=0{S}_{F}(\theta)=0 is symmetric. In effect, we have the solution as follows.

Proposition 16.

Let FF be a cumulative distribution function on [0,)[0,\infty). Consider a loss function for a model λ(s,θ)\lambda(s,\theta) defined by

LF(θ)=i=1MaF(λ(Si,θ))+j=1nwjbF(λ(Sj,θ)),\displaystyle L_{F}(\theta)=-\sum_{i=1}^{M}a_{F}(\lambda(S_{i},\theta))+\sum_{j=1}^{n}w_{j}b_{F}(\lambda(S_{j},\theta)),

where aF(λ)=0λF(z)z𝑑za_{F}(\lambda)=\int_{0}^{\lambda}\frac{F(z)}{z}dz and bF(λ)=0λF(z)𝑑z.b_{F}(\lambda)=\int_{0}^{\lambda}{F(z)}dz. Then, if the the model is a log-linear model λ(s,θ)=exp{θ1x(s)+θ0}\lambda(s,\theta)=\exp\{\theta_{1}^{\top}x(s)+\theta_{0}\}, the estimating function is given by SF(θ){S}_{F}(\theta) in (3.24).

Proof.

The gradient vector of LF(θ)L_{F}(\theta) is given by

SF(θ)=i=1MaF(λ(Si,θ))θλ(Sj,θ)+j=1nwjb(λ(Sj,θ))θλ(Sj,θ).\displaystyle{S}_{F}(\theta)=-\sum_{i=1}^{M}a_{F}^{\prime}(\lambda(S_{i},\theta))\frac{\partial}{\partial\theta}\lambda(S_{j},\theta)+\sum_{j=1}^{n}w_{j}b^{\prime}(\lambda(S_{j},\theta))\frac{\partial}{\partial\theta}\lambda(S_{j},\theta).

This is written as

SF(θ)\displaystyle{S}_{F}(\theta) =i=1MF(λ(Si,θ))λ(Si,θ)θλ(Sj,θ)+j=1nwjF(λ(Sj,θ))θλ(Sj,θ)\displaystyle=\sum_{i=1}^{M}\frac{F(\lambda(S_{i},\theta))}{\lambda(S_{i},\theta)}\frac{\partial}{\partial\theta}\lambda(S_{j},\theta)+\sum_{j=1}^{n}w_{j}F(\lambda(S_{j},\theta))\frac{\partial}{\partial\theta}\lambda(S_{j},\theta)
=j=1n(Zjwjλ(Si,θ))F(λ(Sj,θ))θlogλ(Sj,θ),\displaystyle=\sum_{j=1}^{n}(Z_{j}-w_{j}{\lambda(S_{i},\theta)})F(\lambda(S_{j},\theta))\frac{\partial}{\partial\theta}\log\lambda(S_{j},\theta), (3.26)

where ZjZ_{j} is the presence indicator. Hence, we conclude that SF(θ){S}_{F}(\theta) is equal to that given in (3.24) under a log-linear model. ∎

The FF-estimator is derived by minimization of this loss function LF(θ)L_{F}(\theta). Hence, we have a question whether there is a divergence measure that induces to LF(θ)L_{F}(\theta).

Remark 4.

We have a quick review of the Bregman divergence that is defined by

DU(λ,η)=[U(λ(s))U(η(s))U(η(s)){λ(s)η(s)}]ds\displaystyle D_{U}(\lambda,\eta)=\int[U(\lambda(s))-U(\eta(s))-U^{\prime}(\eta(s))\{\lambda(s)-\eta(s)\}]{\rm d}s

where UU is a convex function. The loss function is given by

LU(θ)=i=1MU(λ(Si,θ))+j=1nwj{λ(Sj,θ)U(λ(Sj,θ))U(λ(Sj,θ)))}\displaystyle L_{U}(\theta)=-\sum_{i=1}^{M}U^{\prime}(\lambda(S_{i},\theta))+\sum_{j=1}^{n}w_{j}\{\lambda(S_{j},\theta)U^{\prime}(\lambda(S_{j},\theta))-U(\lambda(S_{j},\theta)))\}

and the estimating function is

SU(θ)=j=1n{Zjwjλ(Sj,θ)}U′′(λ(Sj,θ))λ(Sj,θ)θlogλ(Sj,θ).\displaystyle{S}_{U}(\theta)=\sum_{j=1}^{n}\{Z_{j}-w_{j}\lambda(S_{j},\theta)\}{U^{\prime\prime}(\lambda(S_{j},\theta))}{\lambda(S_{j},\theta)}\frac{\partial}{\partial\theta}\log\lambda(S_{j},\theta).

Therefore, we observe that, if F(z)=U′′(z)zF(z)=U^{\prime\prime}(z)z, then SF(θ)=SU(θ){S}_{F}(\theta)={S}_{U}(\theta), where SF(θ){S}_{F}(\theta) is defined in (3.26). This implies that the divergence, DF(λ,η)D_{F}(\lambda,\eta), associated with SF(θ){S}_{F}(\theta) is equal to the Bregman divergence DU(λ,η)D_{U}(\lambda,\eta) with the generator UU satisfying F(z)=U′′(z)zF(z)=U^{\prime\prime}(z)z. That is,

DF(λ,η)=[AF(λ(s))AF(η(s))+aF(η(s))bF(η(s))λ(s)]ds,\displaystyle D_{F}(\lambda,\eta)=\int[A_{F}(\lambda(s))-A_{F}(\eta(s))+a_{F}(\eta(s))-b_{F}(\eta(s))\lambda(s)]{\rm d}s,

where aFa_{F} and bFb_{F} are defined in Proposition 16 and AF(z)=0ZaF(s)dsA_{F}(z)=\int_{0}^{Z}a_{F}(s){\rm d}s.

3.5 Conclusion and Future Work

Our study marks a significant advance in the field of species distribution modeling by introducing a novel approach that leverages Poisson Point Processes (PPP) and alternative divergence measures. The key contribution of this work is the development of the FF-estimator, a robust and efficient tool designed to overcome the limitations of traditional ML-methods, particularly in cases of model misspecification.

The FF-estimator, based on cumulative distribution functions, demonstrates superior performance in our simulations. This robustness is particularly notable in handling model misspecification, a common challenge in ecological data analysis. Our approach provides ecologists and conservationists with a more reliable tool for predicting species distributions, which is crucial for biodiversity conservation efforts and ecological planning. We also explored the computational aspects of the FF-estimator, finding it to be computationally feasible for practical applications, despite its advanced statistical underpinnings. While our study offers significant contributions, it also opens up several avenues for future research: Further validation of the FF-estimator in diverse ecological settings and with different species is necessary to establish its generalizability and practical utility. The integration of the FF-estimator with other types of ecological data, such as presence-only data, would enhance its applicability. There is scope for further refining the computational algorithms to enhance the efficiency of the FF-estimator, making it more accessible for large-scale ecological studies. Exploring the applicability of this method in other scientific disciplines, such as environmental science and geography, could be a fruitful area of research. In conclusion, our work not only contributes to the theoretical underpinnings of species distribution modeling but also has significant practical implications for ecological research and conservation strategies.

The intensity function is modeled based on environmental variables, reflecting habitat preferences. This process typically involves a dataset of locations where species and environmental information have been observed, along with accurate and high-quality background data. With precise training on these datasets, reliable predictions can be derived using maximum likelihood methods in Poisson point process modeling. These predictions are easily applied using predictors integrated into the maximum likelihood estimators. While Poisson point process modeling and the maximum likelihood method can derive reliable predictions from observed data, predicting for ‘unsampled‘ areas that differ significantly from the observed regions poses a significant challenge [97, 64].

The ability to predict the occurrence of target species in unobserved areas using datasets of observed locations, environmental information, and background data is a pivotal issue in species distribution modeling (SDM) and ecological research. Applying these models to regions that differ significantly from those included in the training dataset introduces several technical and methodological challenges. When unobserved areas differ substantially from the observed regions, predicting the occurrence of target species in unobserved areas remains a critical issue. To address this issue, exploring predictions based on the similarity of environmental variables is essential. One promising approach relies on ecological similarity rather than geographical proximity, making it particularly effective for species with wide distributions or fragmented habitats. Additionally, by adopting a weighted likelihood approach and linking Poisson point processes through a probability kernel function between observed and unobserved areas, it becomes possible to efficiently predict the probability of species occurrence in unobserved areas. We believe that the methodologies developed in this study will inspire further innovations in statistical ecology and beyond.

Chapter 4 Minimum divergence in machine leaning

We discuss divergence measures and applications encompassing some areas of machine learning, Boltzmann machines, gradient boosting, active leaning and cosine similarity. Boltzmann machines have wide developments for generative models by the help of statistical dynamics. The ML-estimator is a basic device for data learning, but the computation is challenging for evaluating the partition functions. We introduce the GM divergence and the GM estimator for the Boltzmann machines. The GM estimator is shown a fast computation thanks to free evaluation of the partition function. Next, we focus on on active learning, particularly the Query by Committee method. It highlights how divergence measures can be used to select informative data points, integrating statistical and machine learning concepts. Finally, we extend the γ\gamma-divergence on a space of real-valued functions. This yields a natural extension of the cosine similarities, called γ\gamma-cosine similarities. The basic properties are explored and demonstrated in numerical experiments compared to traditional cosine similarity.

4.1 Boltzmann machine

Boltzmann Machines (BMs) are a class of stochastic recurrent neural networks that were introduced in the early 1980s, crucial in bridging the realms of statistical physics and machine learning, see [44, 43] for the mechanics of BMs, and [35] for comprehensive foundations in the theory underlying neural networks and deep learning. They have become fundamental for understanding and developing more advanced generative models. Thus, BMs are statistical models that learn to represent the underlying probability distributions of a dataset. They consist of visible and hidden units, where the visible units correspond to the observed data and the hidden units capture the latent features. Usually, the connections between these units are symmetrical, which means the weight matrix is symmetric. The energy of a configuration in a BM is calculated using an energy function, typically defined by the biases of units and the weights of the connection between units. The partition function is a normalizing factor used to ensure that the probabilities sum to 1 summing exponentiated negative energy over all possible configurations of the units [35].

Training a BM involves adjusting the parameters (weights and biases) to maximize the likelihood of the observed data. This is often done via stochastic maximum likelihood or contrastive divergence. The log-likelihood gradient has a simple form, but computing it exactly is intractable due to the partition function. Thus, approximations or sampling methods like Markov chain Monte Carlo are used. BMs have been extended to more complex and efficient models like Restricted BMs and deep belief networks. They have found applications in dimensionality reduction, topic modeling, and collaborative filtering among others. We overview the principles and applications of BMs, especially in exploring the landscape of energy-based models and the geometrical insights into the learning dynamics of such models. The exploration of divergence, cross-entropy, and entropy in the context of BMs might yield profound understandings, potentially propelling advancements in both theoretical and practical domains of machine learning and artificial intelligence.

Let 𝒫\mathcal{P} be the space of all probability mass functions defined on a finite discrete set 𝒳={1,1}d{\mathcal{X}}=\{-1,1\}^{d}, that is

𝒫={p(x):p(x)>0(x𝒳),x𝒳p(x)=1},\displaystyle{\mathcal{P}}=\Big{\{}p(x):p(x)>0\ (\forall x\in{\mathcal{X}})\ ,\sum_{x\in{\mathcal{X}}}p(x)=1\Big{\}},

in which p(x)p(x) is called a dd-variate Boltzmann distribution. A standard BM in 𝒫\mathcal{P} is introduced as

p(x,θ)=1Zθexp{E(x,θ)}\displaystyle p(x,\theta)=\frac{1}{Z_{\theta}}\exp\{-E(x,\theta)\}

for x𝒳x\in{\mathcal{X}}, where E(x,θ)E(x,\theta) is the energy function defined by

E(x,θ)=bxxWx\displaystyle E(x,\theta)=-b^{\top}x-x^{\top}{W}x

with a parameter θ=(b,W)\theta=(b,{W}). Here ZθZ_{\theta} is the partition function defied by

Zθ=x𝒳exp{E(x,θ)}.\displaystyle Z_{\theta}=\sum_{x\in{\mathcal{X}}}\exp\{-E(x,\theta)\}.

The Kullback-Leibler (KL) divergence is written as

D0(p,p(,θ))=x𝒳p(x)logp(x)p(x,θ)\displaystyle D_{0}(p,p(\cdot,\theta))=\sum_{x\in{\mathcal{X}}}p(x)\log\frac{p(x)}{p(x,\theta)}

which involves the partition function ZθZ_{\theta}. The negative log-likelihood function for a given dataset {xi}i=1,,N\{x_{i}\}_{i=1,...,N} is written as

L0(θ)=i=1NE(xi,θ)+NlogZθ\displaystyle L_{0}(\theta)=\sum_{i=1}^{N}E(x_{i},\theta)+N\log Z_{\theta}

and the estimating function is given by

SML(θ)=1Ni=1N[xi𝔼θ[X]xixi𝔼θ[XX]]\displaystyle{S}_{\rm ML}(\theta)=\frac{1}{N}\sum_{i=1}^{N}\begin{bmatrix}\!\!\!\!x_{i}-\mathbb{E}_{\theta}[X]\vspace{1mm}\\ x_{i}x_{i}^{\top}-\mathbb{E}_{\theta}[XX^{\top}]\end{bmatrix}

where 𝔼θ\mathbb{E}_{\theta} denotes the expectation with respect to p(x,θ)p(x,\theta). In practice, the likelihood computation is known as an infeasible problem because of a sequential procedure with large summation including 𝔼θ\mathbb{E}_{\theta} or ZθZ_{\theta}. There is much literature to discuss approximate computations such as variational approximations and Markov-chain Monte Carlo simulations [42].

On the other hand, we observe that the computation for the GM-divergence DGMD_{\rm GM} does not need to have any evaluation for the partition function as follows: the GM-divergence is defined by

DGM(p,p(,θ))=x𝒳p(x)p(x,θ)x𝒳p(x,θ)1m.\displaystyle D_{\rm GM}(p,p(\cdot,\theta))=\sum_{x\in{\mathcal{X}}}\frac{p(x)}{p(x,\theta)}\prod_{x\in{\mathcal{X}}}p(x,\theta)^{\frac{1}{m}}. (4.1)

This is written as

DGM(p,p(,θ))=x𝒳p(x)exp{E(x,θ)E¯(θ)}\displaystyle D_{\rm GM}(p,p(\cdot,\theta))=\sum_{x\in{\mathcal{X}}}{p(x)}{\exp\{E(x,\theta)-\bar{E}(\theta)\}}

where E¯(θ)\bar{E}(\theta) is an averaged energy given by E¯(θ)=1mx𝒳E(x,θ).\bar{E}(\theta)={\frac{1}{m}}\sum_{x\in{\mathcal{X}}}E(x,\theta). Note that the averaged energy is written as

E¯(θ)=bx¯+tr(Wxx¯),\displaystyle\bar{E}(\theta)=b^{\top}\bar{x}+{\rm tr}({W}\overline{xx^{\top}}), (4.2)

where x¯=1mx𝒳x\bar{x}={\frac{1}{m}}\sum_{x\in{\mathcal{X}}}x and xx¯=1mx𝒳xx\overline{xx^{\top}}={\frac{1}{m}}\sum_{x\in{\mathcal{X}}}xx^{\top}. We observe that DGM(p,p(,θ))D_{\rm GM}(p,p(\cdot,\theta)) is free from the partition term ZθZ_{\theta} due to the cancellation of ZθZ_{\theta} in multiplying the two terms in the right-hand-side of (4.1).

For a given dataset {Xi}i=1,,N\{X_{i}\}_{i=1,...,N}, the GM-loss function for θ\theta is defined by

LGM(θ)=1Ni=1Nexp{E(Xi,θ)E¯(θ)}\displaystyle L_{\rm GM}(\theta)=\frac{1}{N}\sum_{i=1}^{N}{\exp\{E(X_{i},\theta)-\bar{E}(\theta)\}}

and the minimizer θ^GM\hat{\theta}_{\rm GM} is called the GM-estimator. The estimating function is given by

SGM(θ)=1Ni=1Nexp{E(Xi,θ)E¯(θ)}[XiX¯XiXiXX¯].\displaystyle{S}_{\rm GM}(\theta)=\frac{1}{N}\sum_{i=1}^{N}{\exp\{E(X_{i},\theta)-\bar{E}(\theta)\}}\begin{bmatrix}X_{i}-\bar{X}\vspace{1mm}\\ X_{i}X_{i}^{\top}-\overline{XX^{\top}}\end{bmatrix}.

In accordance, the computation for finding θ^GM\hat{\theta}_{\rm GM} is drastically lighter than that for the ML-estimator. For example, a stochastic gradient algorithm can be applied in feasible manner. In some cases, a Newton-type algorithm may still be applicable, which is suggested as

θ{i=1Nexp{E(Xi,θ)E¯(θ)}S(Xi)}1SGM(θ)\displaystyle\theta\longleftarrow\Big{\{}\sum_{i=1}^{N}{\exp\{E(X_{i},\theta)-\bar{E}(\theta)\}}{S}(X_{i})\Big{\}}^{-1}{S}_{\rm GM}(\theta)

where

S(Xi)=[XiX¯XiXiXX¯][XiX¯,XiXiXX¯].\displaystyle{S}(X_{i})=\begin{bmatrix}X_{i}-\bar{X}\vspace{1mm}\\ X_{i}X_{i}^{\top}-\overline{XX^{\top}}\end{bmatrix}\begin{bmatrix}X_{i}^{\top}-\bar{X}^{\top},X_{i}X_{i}^{\top}-\overline{XX^{\top}}\end{bmatrix}.

This discussion is applicable for the deep BMs with restricted BMs.

Here we have a small simulation study to demonstrate the fast computation for the GM estimator compared to the ML-estimator. Basically, the computation time can vary based on the hardware specifications and other running processes. This simulation is done by Python program on the Google Colaboratory
(https://research.google.com/colaboratory). Keep in mind that the computation of the partition function ZθZ_{\theta} can be extremely challenging for large dimensions due to the exponential number of terms. For simplicity, this implementation will not optimize this calculation and might not be feasible for very large dimensions. It is notable that the computation time for the log-likelihood increased significantly with the higher dimension, which is expected due to the exponential increase in the number of states that need to be summed over in the partition function. On the other hand, the computation time for the GM loss remains relatively low, which reinforces its computational efficiency, particularly in higher dimensions. it is not feasible to directly compute the log-likelihood times for dimensions up to 20 within a reasonable time frame using the current method. As shown, the computation time for the log-likelihood increases significantly with the dimension, reflecting the computational complexity due to the partition function. On the other hand, the GM loss computation time remains relatively low and stable across these dimensions. This trend suggests that while the GM estimator maintains its computational efficiency in higher dimensions, the ML-estimator becomes increasingly impractical due to the exponential growth in computation time. For dimensions beyond this range, especially approaching D=20D=20, one might expect the computation time for the log-likelihood to become prohibitively long, further emphasizing the advantage of the GM loss method in high-dimensional settings. Figure 4.1 focuses on computing and plotting the computation times for log-likelihood and GM loss across dimensions d=5,,10d=5,...,10 of the BM, where the sample size nn is fixed as 100100. This result is consistent with our observation that the GM loss might offer a more computationally efficient alternative to the ML-estimator, especially as the dimensionality of the problem increases. For a case of the higher dimension mare than 1010, the naive gradient algorithm for the ML-estimator cannot converge in the limited time; that for the GM estimator works well if d100d\leq 100. When d=100d=100 and N=5000N=5000, the computation time is approximately 0.811 seconds.

Refer to caption
Figure 4.1: Computation time for the ML-estimator and GM-estimator for d=5,…,10

Consider a Boltzmann distribution with visible and hidden units

p(x,h,θ)=1Zθexp{E(x,h,θ)}\displaystyle p(x,h,\theta)=\frac{1}{Z_{\theta}}\exp\{-E(x,h,\theta)\}

for (x,h)𝒳×(x,h)\in{\mathcal{X}}\times{\mathcal{H}}, where 𝒳={0,1}d{\mathcal{X}}=\{0,1\}^{d}, ={0,1}{\mathcal{H}}=\{0,1\}^{\ell} and E(x,h,θ)E(x,h,\theta) is the energy function defined by

E(x,h,θ)=bxchxWh\displaystyle E(x,h,\theta)=-b^{\top}x-c^{\top}h-x^{\top}{W}h

with a parameter θ=(b,c,W)\theta=(b,c,{W}). Here ZθZ_{\theta} is the partition function defied by

Zθ=(x,h)𝒳×exp{E(x,h,θ)}.\displaystyle Z_{\theta}=\sum_{(x,h)\in{\mathcal{X}}\times{\mathcal{H}}}\exp\{-E(x,h,\theta)\}.

The marginal distribution is given by

p(x,θ)=1Zθhexp{E(x,h,θ)},\displaystyle p(x,\theta)=\frac{1}{Z_{\theta}}\sum_{h\in{\mathcal{H}}}\exp\{-E(x,h,\theta)\},

and the GM-divergence is given by

DGM(p,p(,θ))\displaystyle D_{\rm GM}(p,p(\cdot,\theta)) =x𝒳p(x)[hexp{E(x,h,θ)}]1x𝒳[hexp{E(x,h,θ)}]1m\displaystyle=\sum_{x\in{\mathcal{X}}}{p(x)}\Big{[}\sum_{h\in{\mathcal{H}}}\exp\{-E(x,h,\theta)\}\Big{]}^{-1}\prod_{x\in{\mathcal{X}}}\Big{[}\sum_{h\in{\mathcal{H}}}\exp\{-E(x,h,\theta)\}\Big{]}^{\frac{1}{m}}
=x𝒳p(x)exp{E~(x,θ)E¯(θ)}\displaystyle=\sum_{x\in{\mathcal{X}}}{p(x)}{\exp\{\tilde{E}(x,\theta)-\bar{E}(\theta)\}} (4.3)

where E¯(θ)=1mx𝒳E~(x,θ)\bar{E}(\theta)={\frac{1}{m}}\sum_{x\in{\mathcal{X}}}\tilde{E}(x,\theta) and

E~(x,θ)=loghexp{E(x,h,θ)}.\displaystyle\tilde{E}(x,\theta)=-\log\sum_{h\in{\mathcal{H}}}\exp\{-E(x,h,\theta)\}.

Note that the bias term E¯(θ)\bar{E}(\theta) is not written by the sufficient statistics as in (4.2). For a given dataset {xi}i=1,,N\{x_{i}\}_{i=1,...,N}, the GM-loss function for θ\theta is defined by

LGM(θ)\displaystyle L_{\rm GM}(\theta) =1Ni=1N[hexp{E(Xi,h,θ)}]1x𝒳[hexp{E(x,h,θ)}]1m\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\Big{[}\sum_{h\in{\mathcal{H}}}\exp\{-E(X_{i},h,\theta)\}\Big{]}^{-1}\prod_{x\in{\mathcal{X}}}\Big{[}\sum_{h\in{\mathcal{H}}}\exp\{-E(x,h,\theta)\}\Big{]}^{\frac{1}{m}}
=1Ni=1Nexp{E~(Xi,θ)E¯(θ)}.\displaystyle=\frac{1}{N}\sum_{i=1}^{N}{\exp\{\tilde{E}(X_{i},\theta)-\bar{E}(\theta)\}}. (4.4)

The estimating function is given by

SGM(θ)\displaystyle{S}_{\rm GM}(\theta) =1Ni=1Nexp{E~(Xi,θ)E¯(θ)}[XiX¯𝔼θ[H|Xi]𝔼θ[H|X]¯𝔼θ[XiH|Xi]𝔼θ[XH|X]¯,]\displaystyle=\frac{1}{N}\sum_{i=1}^{N}{\exp\{\tilde{E}(X_{i},\theta)-\bar{E}(\theta)\}}\begin{bmatrix}X_{i}-\bar{X}\vspace{1.5mm}\\ \mathbb{E}_{\theta}[H|X_{i}]-\overline{\mathbb{E}_{\theta}[H|X]}\vspace{1.5mm}\\ \mathbb{E}_{\theta}[X_{i}H^{\top}|X_{i}]-\overline{\mathbb{E}_{\theta}[XH^{\top}|X]},\end{bmatrix} (4.5)

where

p(h|x,θ)=p(x,h,θ)p(x,θ)=exp{E(x,h,θ)}exp{E~(x,θ)}\displaystyle p(h|x,\theta)=\frac{p(x,h,\theta)}{p(x,\theta)}=\frac{\exp\{-E(x,h,\theta)\}}{\exp\{-\tilde{E}(x,\theta)\}} (4.6)

and

𝔼θ[H|X]¯=1mx𝒳𝔼θ[H|x],𝔼θ[XH|X]¯=1mx𝒳𝔼θ[xH|x].\displaystyle\overline{\mathbb{E}_{\theta}[H|X]}={\frac{1}{m}}\sum_{x\in{\mathcal{X}}}\mathbb{E}_{\theta}[H|x],\ \ \ \overline{\mathbb{E}_{\theta}[XH^{\top}|X]}={\frac{1}{m}}\sum_{x\in{\mathcal{X}}}\mathbb{E}_{\theta}[xH^{\top}|x].

For H=(Hk)k=1H=(H_{k})_{k=1}^{\ell}, the conditional distributions of HkH_{k}’s given xx are conditional independent as seen in

p(h|x,θ)\displaystyle p(h|x,\theta) =exp{bx+ch+xWh}hexp{bx+ch+xWh}\displaystyle=\frac{\exp\{b^{\top}x+c^{\top}h+x^{\top}{W}h\}}{\sum_{h^{\prime}\in{\mathcal{H}}}\exp\{b^{\top}x+c^{\top}h^{\prime}+x^{\top}{W}h^{\prime}\}} (4.7)
=k=1exp{ckhk+j=1dxjWjkhk}1+exp{ck+j=1dxjWjk},\displaystyle=\prod_{k=1}^{\ell}\frac{\exp\{c_{k}h_{k}+\sum_{j=1}^{d}x_{j}{W}_{jk}h_{k}\}}{1+\exp\{c_{k}+\sum_{j=1}^{d}x_{j}{W}_{jk}\}},

and hence,

𝔼θ[Hk|X]¯=1mxexp{ck+j=1dxjWjk}1+exp{ck+j=1dxjWjk}\displaystyle\overline{\mathbb{E}_{\theta}[H_{k}|X]}={\frac{1}{m}}\sum_{x}\frac{\exp\{c_{k}+\sum_{j=1}^{d}x_{j}{W}_{jk}\}}{1+\exp\{c_{k}+\sum_{j=1}^{d}x_{j}{W}_{jk}\}}

and

𝔼θ[Hk|X(j)]¯=1mx(j)exp{ck+Wjk+jjxjWjk}1+exp{ck+Wjk+jjxjWjk},\displaystyle\overline{\mathbb{E}_{\theta}[H_{k}|X_{(-j)}]}={\frac{1}{m}}\sum_{\ \ x_{(-j)}}\frac{\exp\{c_{k}+{W}_{jk}+\sum_{j^{\prime}\neq j}x_{j^{\prime}}{W}_{j^{\prime}k}\}}{1+\exp\{c_{k}+{W}_{jk}+\sum_{j^{\prime}\neq j}x_{j^{\prime}}{W}_{j^{\prime}k}\}},

where x(j)=(x1,,xj1,xj+1,,xd)x_{(-j)}=(x_{1},...,x_{j-1},x_{j+1},...,x_{d}).

Note that the conditional expectation 𝔼θ[|x]\mathbb{E}_{\theta}[\ \cdot\ |x] in the estimating function (4.5) can be evaluated by p(h|x,θ)p(h|x,\theta) in (4.7) that is free from the partition function ZθZ_{\theta}. A stochastic gradient algorithm is easily implemented in a fast computation.

Next, consider a Boltzmann distribution connecting visible and hidden units to an output variable as

p(x,h,y,θ)=1Zθexp{E(x,h,y,θ)}\displaystyle p(x,h,y,\theta)=\frac{1}{Z_{\theta}}\exp\{-E(x,h,y,\theta)\}

for (x,h,y)𝒳××𝒴(x,h,y)\in{\mathcal{X}}\times{\mathcal{H}}\times{\mathcal{Y}}, where E(x,h,y,θ)E(x,h,y,\theta) is the energy function defined by

E(x,h,y,θ)=btxcthdtxtWhhtUe(y)\displaystyle E(x,h,y,\theta)=-b^{t}x-c^{t}hd^{t}-x^{t}{W}h-h^{t}{U}e(y)

with e(y)=(0,,0,1(y),0,,0)te(y)=(0,...,0,\overset{(y)}{1},0,...,0)^{t} for y𝒴y\in{\mathcal{Y}} and a parameter θ=(b,c,d,W,U)\theta=(b,c,d,{W},{U}). Here ZθZ_{\theta} is the partition function defied by

Zθ=(sx,sh,y)𝒳××𝒴exp{E(x,h,y,θ)}.\displaystyle Z_{\theta}=\sum_{(sx,sh,y)\in{\mathcal{X}}\times{\mathcal{H}}\times{\mathcal{Y}}}\exp\{-E(x,h,y,\theta)\}.

The marginal distribution is given by

p(x,y,θ)=1Zθhexp{E(x,h,y,θ)},\displaystyle p(x,y,\theta)=\frac{1}{Z_{\theta}}\sum_{h\in{\mathcal{H}}}\exp\{-E(x,h,y,\theta)\},

Similarly, for a given dataset {(xi,yi)}i=1,,N\{(x_{i},y_{i})\}_{i=1,...,N}, the GM-loss function for θ\theta is defined by

LGM(θ)=1Ni=1Nexp{E~(xi,yi,θ)E¯(θ)}.\displaystyle L_{\rm GM}(\theta)=\frac{1}{N}\sum_{i=1}^{N}{\exp\{\tilde{E}(x_{i},y_{i},\theta)-\bar{E}(\theta)\}}.

where

E~(x,y,θ)=loghexp{E(x,h,y,θ)}\displaystyle\tilde{E}(x,y,\theta)=-\log\sum_{h\in{\mathcal{H}}}\exp\{-E(x,h,y,\theta)\}

and

E¯(θ)=1m(x,y)𝒳×𝒴E~(x,y,θ).\displaystyle\bar{E}(\theta)={\frac{1}{m^{\prime}}}\sum_{(x,y)\in{\mathcal{X}}\times{\mathcal{Y}}}\tilde{E}(x,y,\theta).

with the cardinal number mm^{\prime} of 𝒳×𝒴{\mathcal{X}}\times{\mathcal{Y}}. In accordance with this, we can apply the GM-loss function for the Boltzmann distribution with supervised outcomes, and for that with the multiple hidden variables. In accordance with these, we point that the GM divergence and the GM estimator has advantageous property over the KL divergence and the ML-estimator in theoretical formulation. A numerical example shows the advantage in a small scale of experiment. However, we have not discussed sufficient experiments and practical applications to confirm the advantage. For this, it needs further investigation for comparing the GM method with the current methods discussed for the deep brief network [99].

4.2 Multiclass AdaBoost

AdaBoost is a part of ensemble learning algorithms that combine the decisions of multiple base learners, or weak learners to produce a strong learner. The core premise is that a group of ”weak” models can be combined to form a ”strong” model. AdaBoost [30] and its variants have found applications across various domains, including bioinformatics and statistical ecology, where they help in creating robust predictive models from noisy or incomplete data. AdaBoost has been extended to handle multiclass classification problems.

An example is Multiclass AdaBoost or AdaBoost.M1, an extension adapting the algorithm to handle more than two classes. There are also other variants like SAMME (Stagewise Additive Modeling using a Multiclass Exponential loss function) which further extends AdaBoost to multiclass scenarios [38]. Random forests and gradient boosting machines (GBM) can be mentioned as popular and efficient methods of ensemble learning [5, 32]. Random forests exhibit a good balance between bias and variance, primarily due to the averaging of uncorrelated decision trees. GBMs are highly flexible and can be used with various loss functions and types of weak learners, though trees are standard. AdaBoost excels in situations where bias reduction is crucial, while Random Forests are a robust, all-rounder solution. GBMs offer high flexibility and can achieve superior performance with careful tuning, especially in complex datasets. Interestingly, there is a connection between AdaBoost and information geometry [69]. The process of re-weighting the data points can be seen as a form of geometrically moving along the manifold of probability distributions over the data. This geometric interpretation might tie back to concepts of divergence and entropy, which are core to information geometry. We focus on discussing various loss functions that are derived from the class of power divergence.

We discuss a setting of a binary class label. Let XX be a covariate with a value in a subset 𝒳{\mathcal{X}} of d\mathbb{R}^{d}, and YY be an outcome with a value in 𝒴={1,1}{\mathcal{Y}}=\{-1,1\}. Let f(x)f(x) be a predictor such that the prediction rule is given by h(x)=sign(f(x))h(x)={\rm sign}(f(x)). The exponential loss is proposed for the Adaboost [30, 85]. As one of the most characteristic points, the optimization is conducted in the function space of a set of weak classifiers. The exponential loss functional plays a central role as a key ingredient, which is defined on a space of predictors as

Lexp(f)=1ni=1nexp{Yif(Xi)},\displaystyle L_{\exp}(f)=\frac{1}{n}\sum_{i=1}^{n}\exp\{-Y_{i}f(X_{i})\},

where ff is a predictor on 𝒳{\mathcal{X}}. If we take expectation under a conditional distribution

p0(y|x)=exp{yf0(x)}exp{f0(x)}+exp{f0(x)},\displaystyle p_{0}(y|x)=\frac{\exp\{-yf_{0}(x)\}}{\exp\{f_{0}(x)\}+\exp\{-f_{0}(x)\}},

then the expected exponential loss is

𝔼[Lexp(f)]=exp{f(x)f0(x)}+exp{f(x)+f0(x)}exp{f0(x)}+exp{f0(x)},\displaystyle\mathbb{E}[L_{\exp}(f)]=\frac{\exp\{f(x)-f_{0}(x)\}+\exp\{-f(x)+f_{0}(x)\}}{\exp\{f_{0}(x)\}+\exp\{-f_{0}(x)\}},

which is greater than or equal to 2/exp{f0(x)}+exp{f0(x)}{2}/{\exp\{f_{0}(x)\}+\exp\{-f_{0}(x)\}}. The equality holds if and only if f=f0f=f_{0}. This implies that the minimizer of the expected GM loss is equal to the true predictor f0f_{0}, namely,

f0=argminf𝔼[Lexp(f)].\displaystyle f_{0}=\mathop{\rm argmin}_{f\in{\cal F}}\mathbb{E}[L_{\exp}(f)].

The functional optimization is practically implemented by a simple algorithm. The stagewise learning algorithm is given as follows (Freund & Schapire, 1995):

(1). Provide J:={hj:𝒳{1,1};jJ}{\mathcal{H}}_{J}:=\{h_{j}:{\mathcal{X}}\rightarrow\{-1,1\};j\in J\}. Set as w0,i=1nw_{0,i}=\frac{1}{n} and h0(x)=0h_{0}(x)=0.

(2). For step t=1,,Tt=1,...,T

(2.a). ht=argminhJErrt(h)\displaystyle h_{t}=\mathop{\rm argmin}_{h\in{\mathcal{H}}_{J}}{\rm Err}_{t}(h), where Errt(h)=i=1nwt1,i𝕀(h(Xi)Yi)\displaystyle{\rm Err}_{t}(h)=\sum_{i=1}^{n}w_{t-1,i}{\mathbb{I}}(h(X_{i})\neq Y_{i}).

(2.b). αt=argminαLGM(ft1+αht)\displaystyle\alpha_{t}=\mathop{\rm argmin}_{\alpha\in\mathbb{R}}L_{\rm GM}(f_{t-1}+\alpha h_{t}), where ft1(x)=j=1t1αjhj(x)\displaystyle f_{t-1}(x)=\sum_{j=1}^{t-1}\alpha_{j}h_{j}(x).

(2.c). wt,i=wt1,iexp{αtYiht(Xi)}.w_{t,i}=w_{t-1,i}\exp\{-\alpha_{t}Y_{i}h_{t}(X_{i})\}.

(3). Set hT(x)=sgin(t=1Tαtht(x)).\displaystyle h_{T}(x)={\rm sgin}\Big{(}\sum_{t=1}^{T}\alpha_{t}h_{t}(x)\Big{)}.

Note that substep (2.b) is calculated as a comprehensive form: the half log odds of error rate,

αt=12log1errt(ht)errt(ht),\displaystyle\alpha_{t}=\small\mbox{$\frac{1}{2}$}\log\frac{1-{\rm err}_{t}(h_{t})}{{\rm err}_{t}(h_{t})},

where errt(h)=Err(h)/i=1nwt1,i{\rm err}_{t}(h)={\rm Err}(h)/\sum_{i=1}^{n}w_{t-1,i}. The algorithm is an elegant and simple form, in which the mathematical operation is just defined by elementary functions of exp\exp and log\log. On the other hand, the iteratively reweighted least square algorithm needs the operation of matrix inverse even for a linear logistic model. Let us apply the γ\gamma-loss functional for the boosting algorithm, cf. Chapert 2 for a general form of the γ\gamma-loss. First of all, confirm

p(γ)(y|f(x))=exp{(γ+1)yf(x)}exp{(γ+1)f(x)}+exp{(γ+1)f(x)}\displaystyle p^{(\gamma)}(y|f(x))=\frac{\exp\{(\gamma+1)yf(x)\}}{\exp\{(\gamma+1)f(x)\}+\exp\{-(\gamma+1)f(x)\}}

as the γ\gamma-expression. Hence, the γ\gamma-loss functional is written by

Lγ(f)=i=1n{p(γ)(Yi|f(Xi))}γγ+1.\displaystyle L_{\gamma}(f)=\sum_{i=1}^{n}\big{\{}p^{(\gamma)}(Y_{i}|f(X_{i}))\big{\}}^{\frac{\gamma}{\gamma+1}}.

Let us discuss the gradient boosting algorithm based on the γ\gamma-loss functional. The stagewise learning algorithm ft+1=ft+αff_{t+1}=f_{t}+\alpha^{*}f^{*} for t=0,1,Tt=0,1,...T is given as follows:

(α,f)=argmin(α,f)×Lγ(ft+αf),\displaystyle(\alpha^{*},f^{*})=\mathop{\rm argmin}_{(\alpha,f)\in\mathbb{R}\times{\mathcal{F}}}L_{\gamma}(f_{t}+\alpha f),

where f0f_{0} is an initial guess and TT is determined by an appropriate stopping rule. However, the joint minimization is expensive in the light of the computation. For this, we use the gradient as

Lγ(ft)=αLγ(ft+αf)|α=0,\displaystyle\nabla L_{\gamma}(f_{t})=\partial_{\alpha}L_{\gamma}(f_{t}+\alpha f)\Big{|}_{\alpha=0},

which is written as

i=1nπγ(Yi,ft(Xi))𝕀(Yisign(f(Xi)))+Ct,\displaystyle\sum_{i=1}^{n}\pi_{\gamma}(Y_{i},f_{t}(X_{i})){\mathbb{I}}(Y_{i}\neq{\rm sign}(f(X_{i})))+C_{t},

where CtC_{t} is a constant in ff, and

πγ(Yi,ft(Xi))=p(γ)(+1|f(Xi))p(γ)(1|f(Xi)){p(γ)(Yi|f(Xi))}1γ+1.\displaystyle\pi_{\gamma}(Y_{i},f_{t}(X_{i}))=\frac{p^{(\gamma)}(+1|f(X_{i}))p^{(\gamma)}(-1|f(X_{i}))}{\big{\{}p^{(\gamma)}(Y_{i}|f(X_{i}))\big{\}}^{\frac{1}{\gamma+1}}}.

In accordance, the γ\gamma-boosting algorithm is parallel to AdaBoost as:

(1). Provide J:={hj:𝒳{1,1};jJ}{\mathcal{H}}_{J}:=\{h_{j}:{\mathcal{X}}\rightarrow\{-1,1\};j\in J\}. Set as h0(x)=0h_{0}(x)=0.

(2). For step t=1,,Tt=1,...,T

(2.a). ht+1=argminhJErrt(h)\displaystyle h_{t+1}=\mathop{\rm argmin}_{h\in{\mathcal{H}}_{J}}{\rm Err}_{t}(h), where Errt(h)=i=1nπγ(Yi,ft(Xi))𝕀(h(Xi)Yi)\displaystyle{\rm Err}_{t}(h)=\sum_{i=1}^{n}\pi_{\gamma}(Y_{i},f_{t}(X_{i})){\mathbb{I}}(h(X_{i})\neq Y_{i}).

(2.b). αt+1=argminαLγ(ft+αht+1)\displaystyle\alpha_{t+1}=\mathop{\rm argmin}_{\alpha\in\mathbb{R}}L_{\gamma}(f_{t}+\alpha h_{t+1}), where ft(x)=j=1tαjhj(x)\displaystyle f_{t}(x)=\sum_{j=1}^{t}\alpha_{j}h_{j}(x).

(2.c). Errt+1(h)=i=1nπγ(Yi,ft(Xi)+αt+1ht+1(Xi))𝕀(h(Xi)Yi)\displaystyle{\rm Err}_{t+1}(h)=\sum_{i=1}^{n}\pi_{\gamma}(Y_{i},f_{t}(X_{i})+\alpha_{t+1}h_{t+1}(X_{i})){\mathbb{I}}(h(X_{i})\neq Y_{i})

(3). Set hT(x)=sgin(t=1Tαtht(x)).\displaystyle h_{T}(x)={\rm sgin}\Big{(}\sum_{t=1}^{T}\alpha_{t}h_{t}(x)\Big{)}.

The-GM functional, LGM(f)L_{\rm GM}(f), and the HM-loss functional, LHM(f)L_{\rm HM}(f), are derived by setting γ\gamma to 1-1 and 2-2, respectively. It can be observed that the exponential loss functional is nothing but the GM-loss functional. We consider a situation of an imbalanced sample, where π1π1\pi_{-1}\gg\pi_{1} for the probability πy\pi_{y} of Y=yY=y. We adopt the adjusted exponential (GM) loss functional in (2.32) as

Lexp(w)(f)=1ni=1nπ1Yiexp{πYiYif(Xi)}.\displaystyle L_{\exp}^{(\rm w)}(f)=\frac{1}{n}\sum_{i=1}^{n}\pi_{1-Y_{i}}\exp\{-\pi_{Y_{i}}Y_{i}f(X_{i})\}.

The learning algorithm is given by replacing substeps (2.b) and (2.c) to

(2.b). αt=argminαLexp(w)(ft1+αht)\displaystyle\alpha_{t}=\mathop{\rm argmin}_{\alpha\in\mathbb{R}}L_{\exp}^{\rm(w)}(f_{t-1}+\alpha h_{t}), where ft1(x)=j=1t1αjhj(x)\displaystyle f_{t-1}(x)=\sum_{j=1}^{t-1}\alpha_{j}h_{j}(x).

(2.c). wt,i=wt1,iexp{πYiYiαtht(Xi)}.w_{t,i}=w_{t-1,i}\exp\{-\pi_{Y_{i}}Y_{i}\alpha_{t}h_{t}(X_{i})\}.

We observe in (2.b):

Lexp(w)(ft1+αh)eπ1αerrt1+eπ1α(1errt1)+eπ0αerrt0+eπ0α(1errt0),\displaystyle L_{\exp}^{(\rm w)}(f_{t-1}+\alpha h)\propto e^{\pi_{1}\alpha}{\rm err}_{t1}+e^{-\pi_{1}\alpha}(1-{\rm err}_{t1})+e^{\pi_{0}\alpha}{\rm err}_{t0}+e^{-\pi_{0}\alpha}(1-{\rm err}_{t0}),

where

errty=i=1nπ1ywt1,i{eπ1α𝕀(Yi=y,Yif(Xi))/i=1nwt1,i.\displaystyle{\rm err}_{ty}=\sum_{i=1}^{n}\pi_{1-y}w_{t-1,i}\{e^{\pi_{1}\alpha}{\mathbb{I}}(Y_{i}=y,Y_{i}\neq f(X_{i}))/\sum_{i=1}^{n}w_{t-1,i}.

We discuss a setting of a multiclass label. Let XX be a feature vector in a subset 𝒳\mathcal{X} of d\mathbb{R}^{d} and YY be a label in 𝒴={1,,k}{\mathcal{Y}}=\{1,...,k\}. The major objective is to predict YY given X=xX=x, in which there are spaces of classifiers and predictors, namely, ={h:𝒳𝒴}{\cal H}=\{h:{\mathcal{X}}\rightarrow{\mathcal{Y}}\} and

={f(x)=(f1(x),,fk(x))k:y=1kfy(x)=0}.\displaystyle{\mathcal{F}}=\{f(x)=(f_{1}(x),...,f_{k}(x))\in\mathbb{R}^{k}:\sum_{y=1}^{k}f_{y}(x)=0\}.

A classifier h(x)h(x) is introduced by a predictor f(x)f(x) as

hf(x)=argmaxy𝒴fy(x);\displaystyle h_{f}(x)=\mathop{\rm argmax}_{y\in{\mathcal{Y}}}f_{y}(x);

a predictor fh(x)f_{h}(x) is introduced by a predictor h(x)h(x) as

fh(x)=(𝕀(h(x)=y)1k)y𝒴.\displaystyle f_{h}(x)=\Big{(}{\mathbb{I}}(h(x)=y)-\frac{1}{k}\Big{)}_{y\in{\mathcal{Y}}}. (4.8)

Note that ={hf:f}{\mathcal{H}}=\{h_{f}:f\in{\mathcal{F}}\}; while {fh:h}\{f_{h}:h\in{\mathcal{H}}\} is a subset of \mathcal{F}. In the learning algorithm discussed below, the predictor is updated by the linear span of predictors embedded by selected classifiers in a sequential manner. The conditional probability mass function (pmf) of YY given X=xX=x is assumed as a soft-max function

p(y|f(x))=exp{fy(x)}j𝒴exp{fj(x)}\displaystyle p(y|f(x))=\frac{\exp\{f_{y}(x)\}}{\sum_{j\in{\mathcal{Y}}}\exp\{f_{j}(x)\}}

where f(x)f(x) is a predictor of \mathcal{F}. We notice that p(y|f(x))p(y|f(x)) and fy(x)f_{y}(x) are one-to-one as a function of yy. Indeed, they are connected as fy(x)=logp(y|f(x))1kj=1klogp(y|f(x))f_{y}(x)=\log p(y|f(x))-\frac{1}{k}\sum_{j=1}^{k}\log p(y|f(x)). We note that this assumption is in the framework of the GLM as in the conditional pmf (2.44) with a different parametrization discussed in Section 2.6 if f(x)f(x) is a linear predictor. However, the formulation is nonparametic, in which the model is written by ={p(y|f(x)):f}{\mathcal{M}}=\{p(y|f(x)):f\in{\mathcal{F}}\}. Similarly, the γ\gamma-loss functional for ff is

Lγ(f)=i=1n{exp{(γ+1)fYi(Xi)}y𝒴exp{(γ+1)fy(Xi)}}γγ+1.\displaystyle L_{\gamma}(f)=\sum_{i=1}^{n}\Big{\{}\frac{\exp\{(\gamma+1)f_{Y_{i}}(X_{i})\}}{\sum_{y\in{\mathcal{Y}}}\exp\{(\gamma+1)f_{y}(X_{i})\}}\Big{\}}^{\frac{\gamma}{\gamma+1}}.

The minimum of the expected the γ\gamma-loss functional for ff is attained at f=f(0)f=f^{(0)} taking expectation under a conditional distribution

p0(y|x)=exp{fy(0)(x)}j=1kexp{fj(0)(x)}.\displaystyle p_{0}(y|x)=\frac{\exp\{f_{y}^{(0)}(x)\}}{\sum_{j=1}^{k}\exp\{f_{j}^{(0)}(x)\}}.

Thus, we conclude that the minimizer of the expected γ\gamma-loss in ff is equal to the true predictor f(0)f^{(0)}, namely,

f(0)=argminf𝔼[Lγ(f)].\displaystyle f^{(0)}=\mathop{\rm argmin}_{f\in{\cal F}}\mathbb{E}[L_{\gamma}(f)].

Thus, the minimization of the γ\gamma-loss functional on the predictor space \cal F yields non-parametric consistency. Similarly, the stagewise learning algorithm ft=ft1+αfhf_{t}=f_{t-1}+\alpha^{*}f_{h^{*}} for t=1,Tt=1,...T is given as follows:

(α,h)=argmin(α,h)×Lγ(ft1+αfh),\displaystyle(\alpha^{*},h^{*})=\mathop{\rm argmin}_{(\alpha,h)\in\mathbb{R}\times{\mathcal{H}}}L_{\gamma}(f_{t-1}+\alpha f_{h}),

where f0f_{0} is an initial guess and fhf_{h} is defined in (4.8). For any fixed α\alpha, we observe

Lγ(ft1+αfh)=i=1nwt1,i𝕀(Yih(Xi))+Ct1,\displaystyle L_{\gamma}(f_{t-1}+\alpha f_{h})=\sum_{i=1}^{n}w_{t-1,i}{\mathbb{I}}(Y_{i}\neq h(X_{i}))+C_{t-1},

where

wt1,i={exp{(γ+1)fYi(Xi)}y𝒴exp{(γ+1)fy(Xi)}}γγ+1,\displaystyle w_{t-1,i}=\Big{\{}\frac{\exp\{(\gamma+1)f_{Y_{i}}(X_{i})\}}{\sum_{y\in{\mathcal{Y}}}\exp\{(\gamma+1)f_{y}(X_{i})\}}\Big{\}}^{\frac{\gamma}{\gamma+1}},

(1). Provide J:={hj:𝒳{1,,k};jJ}{\mathcal{H}}_{J}:=\{h_{j}:{\mathcal{X}}\rightarrow\{1,...,k\};j\in J\}. Set as w0,i=1nw_{0,i}=\frac{1}{n} and h0(x)=0h_{0}(x)=0.

(2). For step t=1,,Tt=1,...,T

(2.a). ht=argminhJErrt(h)\displaystyle h_{t}=\mathop{\rm argmin}_{h\in{\mathcal{H}}_{J}}{\rm Err}_{t}(h), where Errt(h)=i=1nwt1,i𝕀(h(Xi)Yi)\displaystyle{\rm Err}_{t}(h)=\sum_{i=1}^{n}w_{t-1,i}{\mathbb{I}}(h(X_{i})\neq Y_{i}).

(2.b). αt=argminαLγ(ft1+αfht)\displaystyle\alpha_{t}=\mathop{\rm argmin}_{\alpha\in\mathbb{R}}L_{\gamma}(f_{t-1}+\alpha f_{h_{t}}), where ft1(x)=j=1t1αjfhj(x)\displaystyle f_{t-1}(x)=\sum_{j=1}^{t-1}\alpha_{j}f_{h_{j}}(x) with the

embedded predictor fh(x)f_{h}(x) defined in (4.8).

(2.c). wt,i=wt1,i{exp{(γ+1)αtftYi(Xi)}y𝒴exp{(γ+1)fty(Xi)}}γγ+1.\displaystyle w_{t,i}=w_{t-1,i}\Big{\{}\frac{\exp\{(\gamma+1)\alpha_{t}f_{t}{}_{Y_{i}}(X_{i})\}}{\sum_{y\in{\mathcal{Y}}}\exp\{(\gamma+1)f_{t}{}_{y}(X_{i})\}}\Big{\}}^{\frac{\gamma}{\gamma+1}}.

(3). Set hT(x)=argmaxy𝒴fT(y,x).\displaystyle h_{T}(x)=\mathop{\rm argmax}_{y\in{\mathcal{Y}}}f_{T}(y,x).

The GM-loss functional is given by

LGM(f)=i=1nexp{fYi(Xi)}\displaystyle L_{\rm GM}(f)=\sum_{i=1}^{n}{\exp\{-f_{Y_{i}}(X_{i})\}}

due to the normalizing condition y𝒴fy(x)=0.\sum_{y\in{\mathcal{Y}}}f_{y}(x)=0. This is essentially the same as the exponential loss [38], in which the class label yy is coded as similar to (4.8). Thus, the equivalence of the GM-loss and the exponential loss also holds for a multiclass classification. We can discuss the problem of imbalanced samples similarly as given for a binary classification. Let πy=P(Y=y)\pi_{y}=P(Y=y) and

πyinv=1πyj=1k1πj.\displaystyle\pi_{y}^{\rm inv}=\frac{\frac{1}{\pi_{y}}}{\sum_{j=1}^{k}\frac{1}{\pi_{j}}}.

The adjusted exponential (GM) loss functional in (2.32) as

Lexp(w)(f)=1ni=1nπYiinvexp{πYifYi(Xi)}.\displaystyle L_{\exp}^{(\rm w)}(f)=\frac{1}{n}\sum_{i=1}^{n}\pi_{Y_{i}}^{\rm inv}\exp\{-\pi_{Y_{i}}f_{Y_{i}}(X_{i})\}.

The learning algorithm is given by a minor change for substeps (2.b) and (2.c). The HM-loss functional is given by

LHM(f)=i=1n{exp{fYi(Xi)}y𝒴exp{fy(Xi)}}2.\displaystyle L_{\rm HM}(f)=\sum_{i=1}^{n}\Big{\{}\frac{\exp\{-f_{Y_{i}}(X_{i})\}}{\sum_{y\in{\mathcal{Y}}}\exp\{-f_{y}(X_{i})\}}\Big{\}}^{2}.

GBMs are highly flexible and can be adapted to various loss functions and types of weak learners, although decision trees are commonly used as the base learners. This flexibility is one of the key strengths of GBM, allowing it to be tailored to a wide range of problems and data types. The loss functions discussed above can be applied to GBMs. Its require careful tuning of several parameters (e.g., number of trees, learning rate, depth of trees), which can be time-consuming. This discussion primarily focuses on the minimum divergence principle from a theoretical perspective. In future projects, we aim to extend our discussion to develop effective GBM applications for a wide range of datasets.

4.3 Active learning

Active learning is a subfield of machine learning that focuses on building efficient training datasets, see [86] for a comprehensive survey. Unlike traditional supervised learning, where all labels are provided upfront, active learning aims to select the most informative examples for labeling, thereby potentially reducing the number of labeled examples needed to achieve a certain level of performance, cf. [10] for understanding how statistical methods are integrated into active learning algorithms. Active learning is a fascinating area where statistical machine learning and information geometry can intersect, offering deep insights into the learning process. One of the primary goals is to reduce the number of labeled instances required to train a model effectively. Annotation can be expensive, especially for tasks like medical image labeling, natural language processing, or any domain-specific task requiring expert knowledge. In scenarios where data collection is expensive or time-consuming, active learning aims to make the most out of a small dataset. By focusing on ambiguous or difficult instances, active learning improves the model’s performance faster than random sampling would. In this way, the active learning has been attracted attentions in a situation relevant in today’s data-rich but label-scarce environments. This could set the stage for the technical details that follow.

The query by committee (QBC) method is a popular method in active learning in which there are another approaches the uncertainty sampling, the expected model change and Bayesian Optimization, see [87] for the theoretical underpinnings of the QBC approach. We focus on the QBC approach, a ”committee” of models is trained on the current labeled dataset. When it comes to selecting the next data point to label, the committee ”votes” on the labels for the unlabeled data points. The data point for which there is the most disagreement among the committee members is then selected for labeling. The idea is that this point lies in a region of high uncertainty and therefore would provide the most information if labeled. From an information geometry perspective, one could consider the divergence or distance between the probability distributions predicted by each model in the committee for a given data point. The point that maximizes this divergence could be considered the most informative.

Let XX be a feature vector in a subset 𝒳\mathcal{X} of d\mathbb{R}^{d} and YY be a label in 𝒴={1,,k}{\mathcal{Y}}=\{1,...,k\}. The conditional probability mass function (pmf) of YY given X=xX=x is assumed as a soft-max function

p(y|ξ(x))=exp{ξy(x)}j𝒴exp{ξj(x)}\displaystyle p(y|\xi(x))=\frac{\exp\{\xi_{y}(x)\}}{\sum_{j\in{\mathcal{Y}}}\exp\{\xi_{j}(x)\}}

where ξ(x)\xi(x) is a predictor vector with components {ξy(x)}y=1k\{\xi_{y}(x)\}_{y=1}^{k} satisfying y𝒴ξy(x)=0.\sum_{y\in{\mathcal{Y}}}\xi_{y}(x)=0. The prediction is conducted by

h(x)=argminy𝒴ξy(x)\displaystyle h(x)=\mathop{\rm argmin}_{y\in{\mathcal{Y}}}\xi_{y}(x)

noting p(y|ξ(x))p(y|\xi(x)) and ξy(x)\xi_{y}(x) are one-to-one as a function of yy. In effect, they are connected as ξy(x)=logp(y|ξ(x))1kj=1klogp(y|xξ(x))\xi_{y}(x)=\log p(y|\xi(x))-\frac{1}{k}\sum_{j=1}^{k}\log p(y|x\xi(x)). We note that this assumption is in the framework of the GLM as in the conditional pmf (2.44) with a different parametrization discussed in Section 2.6 if ξy(x)\xi_{y}(x) is a linear predictor.

We aim to design a sequential family of datasets {St}t=0T\{S_{t}\}_{t=0}^{T} such that the (t+1)(t+1)-th dataset is updated as

St+1={(Xt+1,Yt+1)}St\displaystyle S_{t+1}=\{(X_{t+1},Y_{t+1})\}\cup S_{t}

for t,0tT1t,0\leq t\leq T-1, where S0S_{0} is an appropriately chosen datasets. Given StS_{t}, we conduct an experiment to get (Xt+1,Yt+1)(X_{t+1},Y_{t+1}) in which Xt+1X_{t+1} is explored to improve the performance of the prediction of the label YY, and the outcome Yt+1Y_{t+1} is sampled from the conditional distribution given Xt+1X_{t+1}. Thus, the active leaning proposes such a good update pair (Xt+1,Yt+1)(X_{t+1},Y_{t+1}) that encourages the tt-th prediction result to strengthen the performance in a universal manner. The key of the active leaning is to build the efficient method to get the feature vector Xt+1X_{t+1} that compensates for the weakness of the prediction based on StS_{t}. For this, it is preferable that the distribution of YY given Xt+1X_{t+1} is separate from that given (X1,,Xt)(X_{1},...,X_{t}). Here, let us take the QBC approach in which an acquisition function plays a central role.

Assume that there are mm committee members or machines such that the ll-th member employs a predictor ξ(tl)y(x)\xi^{(tl)}_{y}(x) for a feature vector xx and a label yy based on the dataset StS_{t}, and thus the prediction for YY given xx is performed by argmaxy𝒴ξ(tl)y(x)\mathop{\rm argmax}_{y\in{\mathcal{Y}}}\xi^{(tl)}_{y}(x). We define an acquisition function defined on a feature vector xx of 𝒳\mathcal{X}

A(t)(x)=l=1mwlD(P(|ξ(tl)(x)),P(|ξ^(t)(x)))\displaystyle A^{(t)}(x)=\sum_{l=1}^{m}w_{l}D(P(\cdot|\xi^{(tl)}(x)),P(\cdot|\hat{\xi}^{(t)}(x))) (4.9)

adopting a divergence measure DD, where ξ(tl)(x)\xi^{(tl)}(x) is the predictor vector learned by the ll-th member at stage tt; ξ^(t)(x)\hat{\xi}^{(t)}(x) is the consensus predictor vector combining among {ξ(tl)}l=1m\{\xi^{(tl)}\}_{l=1}^{m}. The consensus predictor is given by

ξ^0(t)(x)=argminξΞl=1mwlD(P(|ξ(tl)(x)),P(|ξ(x))),\displaystyle\hat{\xi}_{0}^{(t)}(x)=\mathop{\rm argmin}_{\xi\in\Xi}\sum_{l=1}^{m}w_{l}D(P(\cdot|\xi^{(tl)}(x)),P(\cdot|\xi(x))),

where Ξ\Xi is the set of all the predictor vectors. Such an optimization problem is discussed around Proposition 2 associated with the generalized mean [41]. In accordance, the new feature vector is selected as

X(t+1)=argmaxx𝒳(t)A(t)(x),\displaystyle X^{(t+1)}=\mathop{\rm argmax}_{x\in{\mathcal{X}}^{(t)}}A^{(t)}(x), (4.10)

where 𝒳(t){\mathcal{X}}^{(t)} is a subset of possible candidates of 𝒳\mathcal{X} at stage tt.

The standard choice of DD for (4.9) is the KL-divergence D0D_{0} in (1.1), which yields the consensus distribution with the pmf

p^0(t)(y|x)=exp{l=1mwlξ(tl)y(x)}j=1kexp{l=1mwlξ(tl)j(x)},\displaystyle\hat{p}_{0}^{(t)}(y|x)=\frac{\exp\big{\{}\sum_{l=1}^{m}w_{l}\xi^{(tl)}_{y}(x)\big{\}}}{\sum_{j=1}^{k}\exp\big{\{}\sum_{l=1}^{m}w_{l}\xi^{(tl)}_{j}(x)\big{\}}},

or equivalently ξ^0(t)(x)=l=1mwlξ(tl)(x)\hat{\xi}_{0}^{(t)}(x)=\sum_{l=1}^{m}w_{l}\xi^{(tl)}(x) as the consensus predictor. Alternatively, we adopt the dual γ\gamma-divergence DγD^{*}_{\gamma} defined in (1.11) and thus,

A(t)γ(x)=l=1mwlDγ(P(|ξ(tl)(x)),P(|ξ^(t)(x));C)\displaystyle A^{(t)}_{\gamma}(x)=\sum_{l=1}^{m}w_{l}D_{\gamma}(P(\cdot|\xi^{(tl)}(x)),P(\cdot|\hat{\xi}^{(t)}(x));C)

where

Dγ(P(|ξ(tl)(x)),P(|ξ^(t)(x));C)=y𝒴{p(γ)(y|ξ(x))}1γ+1p(y|ξ(tl)(x))γ.\displaystyle D_{\gamma}(P(\cdot|\xi^{(tl)}(x)),P(\cdot|\hat{\xi}^{(t)}(x));C)=\sum_{y\in{\mathcal{Y}}}\{p^{(\gamma)}(y|\xi(x))\}^{\frac{1}{\gamma+1}}p(y|\xi^{(tl)}(x))^{\gamma}.

Here, p(γ)(y|ξ(x))p^{(\gamma)}(y|\xi(x)) is the γ\gamma-expression defined in (1.12), or

p(γ)(y|ξ(x))=exp{(γ+1)ξy(x)}j=1kexp{(γ+1)ξj(x)}.\displaystyle p^{(\gamma)}(y|\xi(x))=\frac{\exp\{(\gamma+1)\xi_{y}(x)\}}{\sum_{j=1}^{k}\exp\{(\gamma+1)\xi_{j}(x)\}}.

This yields

p^γ(t)(y|x)=[l=1mwlexp{γξ(tl)y(x)}]1γj=1k[l=1mwlexp{γξ(tl)j(x)}]1γ.\displaystyle\hat{p}_{\gamma}^{(t)}(y|x)=\frac{\Big{[}\sum_{l=1}^{m}w_{l}\exp\{\gamma\xi^{(tl)}_{y}(x)\}\Big{]}^{\frac{1}{\gamma}}}{\sum_{j=1}^{k}\Big{[}\sum_{l=1}^{m}w_{l}\exp\{\gamma\xi^{(tl)}_{j}(x)\}\Big{]}^{\frac{1}{\gamma}}}.

as the pmf of the consensus distribution and

ξ^γy(t)(x)=1γlog[l=1mwlexp{γξy(tl)(x)}]\displaystyle\hat{\xi}_{\gamma}{}_{\ y}^{(t)}{}(x)=\frac{1}{\gamma}\log\Big{[}\sum_{l=1}^{m}w_{l}\exp\{\gamma\xi_{y}^{(tl)}(x)\}\Big{]}

as the consensus predictor up to a constant in yy. We note that the consensus predictor ξ^γ(t)y(x)\hat{\xi}_{\gamma}^{(t)}{}_{\!\!y}(x) has a form of log-sum-exp mean. This has the extreme forms as

limγξ^γ(t)(x)=min1lmξ(tl)(x) and limγξ^γ(t)(x)=max1lmξ(tl)(x).\displaystyle\lim_{\gamma\rightarrow-\infty}\hat{\xi}_{\gamma}^{(t)}(x)=\min_{1\leq l\leq m}\xi^{(tl)}(x)\quad\text{ and }\quad\lim_{\gamma\rightarrow\infty}\hat{\xi}_{\gamma}^{(t)}(x)=\max_{1\leq l\leq m}\xi^{(tl)}(x).

Let us look at the decision boundary of the consensus predictors ξ^0(x)\hat{\xi}_{0}(x) and ξ^γ(x)\hat{\xi}_{\gamma}(x) combining two linear predictors in a two dimensional space, see Figure 4.2.

Refer to caption
Figure 4.2: Plots of decision boundaries of dual KL, dual γ\gamma-divergence measures

.

If all the committee machines have linear predictors, then the 0 consensus predictor is still a linear predictor, but the γ\gamma-consensus predictor is a nonlinear predictor according to the value of γ0\gamma\neq 0 as in Figure 4.2. Hence, we can explore the nonlinearity at every stage learning an appropriate value of γ\gamma. Needless to say, the objective is to find a good feature vector Xt+1X_{t+1} in (4.10), and hence we have to pay attentions to the leaning procedure conducted by a minimax process

maxx𝒳minξΞ{l=1mwlDγ(P(|ξ(tl)(x)),P(|ξ(x)))}.\displaystyle\max_{x\in{\mathcal{X}}}\min_{\xi\in\Xi}\Big{\{}\sum_{l=1}^{m}w_{l}D_{\gamma}(P(\cdot|\xi^{(tl)}(x)),P(\cdot|\xi(x)))\Big{\}}.

It is possible to monitor the minimax value each stage, which can evaluate the learning performance. In effect, the minimax game of the cross entropy with Nature and a decision maker is nicely discussed [36]. The mini-maxity is solved in the zero sum game: Nature wants to maximize the cross entropy under a constrain with a fixed expectation; the decision maker wants to minimize one o the full space. However, our minimax process is not relevant to this observation straightforward. It is necessary to make further discussion to propose a selection of the optimal value of γ\gamma based on StS_{t}.

4.4 The γ\gamma-cosine similarity

The γ\gamma-divergence is defined on the probability measures dominated by a reference measure. We have studied statistical applications focusing on regression and classification. We would like to extend that the γ\gamma-divergence can be defined on the Lebesgue Lp{\rm L}_{p}-space, p(Λ)={f(x):fp<}{\cal L}_{p}(\Lambda)=\{f(x):\|f\|_{p}<\infty\} for an exponent pp, 1p1\leq p\leq\infty, where the Lp{\rm L}_{p}-norm is defined by

fp=(|f(x)|pdΛ(x))1p,\displaystyle\|f\|_{p}=\Big{(}\int|f(x)|^{p}{\rm d}\Lambda(x)\Big{)}^{\frac{1}{p}}, (4.11)

where Λ\Lambda is a σ\sigma-finite measure. There is a challenge for the extension: a function f(x)f(x) can take a negative value. If we adopt a usual power transformation f(x)γf(x)^{\gamma}, it indeed poses a problem. When f(x)<0f(x)<0 as raising a negative number to a fractional power can lead to complex values, which would not be meaningful in this context. For this, we introduce a sign-preserved power transformation as

f(x)γ=sign(f(x))|f(x)|γ.\displaystyle f(x)^{\ominus\gamma}={\rm sign}(f(x))|f(x)|^{\gamma}.

The log γ\gamma-divergence (1.21) is extended as

Δγ(f,g;Λ)=1γlog|f(x)fp(g(x)gp)γdΛ(x)|\displaystyle\Delta_{\gamma}(f,g;\Lambda)=-\frac{1}{\gamma}\log\bigg{|}\int\frac{f(x)}{\|f\|_{p}}\Big{(}\frac{g(x)}{\|g\|_{p}}\Big{)}^{\ominus\gamma}{\rm d}\Lambda(x)\bigg{|}

for ff and gg of p(Λ){\cal L}_{p}(\Lambda), where p=γ+1p=\gamma+1. This still satisfies the scale invariance. There would be potential developments to utilize Δ(f,g;Λ)\Delta(f,g;\Lambda) in a field of statistics and machine learning, by which singed functions like a predictor function or functional data can be directed evaluated. In particular, we explore this idea to a context of cosine similarity.

Cosine similarity is a measure used to determine the similarity between two non-zero vectors in an inner product space, which includes Hilbert spaces. This measure is particularly important in many applications, such as information retrieval, text analysis, and pattern recognition. In a Hilbert space, which is a complete inner product space, cosine similarity can be defined in a way that generalizes the concept from Euclidean spaces:

cos(f,g)=ff,gg,\displaystyle\cos(f,g)=\Big{\langle}\frac{f}{\|f\|},\frac{g}{\|g\|}\Big{\rangle}, (4.12)

where f=f,f\|f\|=\sqrt{\langle f,f\rangle} and ,\langle\ ,\ \rangle is the inner product, namely, f,g=f(x)g(x)dΛ(x)\langle f,g\rangle=\int f(x)g(x){\rm d}\Lambda(x). Thus, in the Hilbert space, (Λ){\cal H}(\Lambda), cos(τf,σg)=cos(f,g)\cos(\tau f,\sigma g)=\cos(f,g) for any scalars τ\tau and σ\sigma. The Cauchy-Schwartz inequality yields |cos(f,g)|1|\cos(f,g)|\leq 1, and |cos(f,g)|=1|\cos(f,g)|=1 if and only if there exists a scalar σ\sigma such that g(x)=σf(x)g(x)=\sigma f(x) for xx almost everywhere.

We extend the cosine measure (4.12) on the Lp-space by analogy with the extension of the log γ\gamma-divergence, see [59, 11] for foundations for function analysis. For this, we observe the Hölder inequality implies |H(u,v)|1|{\rm H}(u,v)|\leq 1, where

H(u,v)=uup,vvq,\displaystyle{\rm H}(u,v)=\Big{\langle}\frac{u}{\|u\|_{p}},\frac{v}{\|v\|_{q}}\Big{\rangle}, (4.13)

for upu\in{\cal L}_{p} and vqv\in{\cal L}_{q}, where qq is the conjugate exponent to pp satisfying 1p+1q=1\frac{1}{p}+\frac{1}{q}=1. The dual space (the Banach space of all continuous linear functionals) of the Lp-space for 1<p<1<p<\infty has a natural isomorphism with Lq-space. The isomorphism associates with the functional ιp(v)p(Λ)\iota_{p}(v)\in{\cal L}_{p}(\Lambda)^{*} defined by uιp(v)(u)=uvdΛu\mapsto\iota_{p}(v)(u)=\int uv{\rm d}\Lambda. Thus, the the Hölder inequality guarantees that ιp(v)(u)\iota_{p}(v)(u) is well defined and continuous, and hence q(Λ){\cal L}_{q}(\Lambda) is said to be the continuous dual space of p(Λ){\cal L}_{p}(\Lambda). Apparently, H(u,v){\rm H}(u,v) seems a surrogate for cos(f,g)\cos(f,g). However, the domain of cos\cos is (Λ)×(Λ){\cal H}(\Lambda)\times{\cal H}(\Lambda); that of H\rm H is p(Λ)×q(Λ){\cal L}_{p}(\Lambda)\times{\cal L}_{q}(\Lambda). Further, |cos(f,g)|=1|\cos(f,g)|=1 means fgf\propto g; |H(u,v)|=1|{\rm H}(u,v)|=1 means |u|p|v|q|u|^{p}\propto|v|^{q}. Thus, the functional H(u,v){\rm H}(u,v) has inappropriate characters as a cosine functional to measure an angle between vectors in a function space. For this, consider a transform κp\kappa_{p} from p{\cal L}_{p} to q{\cal L}_{q} by κp(v)=vpq\kappa_{p}(v)=v^{\ominus\frac{p}{q}} noting |κp(v)|q=|v|p|\kappa_{p}(v)|^{q}=|v|^{p}. Then, we can define H(u,κp(v)){\rm H}(u,\kappa_{p}(v)) for uu and vv in p(Λ).{\cal L}_{p}(\Lambda). Consequently, we define a cosine measure defined on p{\cal L}_{p} as

cosγ(f,g)=ffp,(ggp)pq\displaystyle\cos_{\gamma}(f,g)=\Big{\langle}\frac{f\ }{\|f\|_{p}},\Big{(}\frac{g\ }{\|g\|_{p}}\Big{)}^{\ominus\frac{p}{q}}\Big{\rangle} (4.14)

linking γ\gamma to as p=γ+1p=\gamma+1, where qq is the conjugate exponent to pp. A close connection with the log γ\gamma-divergence is noted as

|cosγ(f,g)|=exp{γΔγ(f,g)}.\displaystyle|\cos_{\gamma}(f,g)|=\exp\{-\gamma\Delta_{\gamma}(f,g)\}.

This implies cosγ(f,g)=0Δγ(f,g)=\cos_{\gamma}(f,g)=0\ \Longleftrightarrow\ \Delta_{\gamma}(f,g)=\infty, in which both express quantities when ff and gg are the most distinct. In this formulation, cosγ(f,g)\cos_{\gamma}(f,g), called the γ\gamma-cosine, ensures mathematical consistency across all real values of g(x)g(x), which is vital for the measure’s applicability in a wide range of contexts. Note that, if p=2p=2, then cosγ(f,g)=cos(f,g)\cos_{\gamma}(f,g)=\cos(f,g), in which g(x)pqg(x)^{\ominus\frac{p}{q}} reduces to g(x)g(x). Further, a basic property is summarized as follows:

Proposition 17.

Let ff and gg be in p(Λ){\cal L}_{p}(\Lambda). Then, |cosγ(f,g)|1|\cos_{\gamma}(f,g)|\leq 1, and equality holds if and only gg is proportional to ff.

Proof.

By definition, cosγ(f,g)=H(f,κp(g))\cos_{\gamma}(f,g)={\rm H}(f,\kappa_{p}(g)), where H{\rm H} is defined in (4.13). This implies |cosγ(f,g)1|\cos_{\gamma}(f,g)\leq 1. The equality holds if and only if |f|p|gq|pq|f|^{p}\propto|g^{q}|^{\frac{p}{q}}, that is, there exists a scalar σ\sigma such that g(x)=σf(x)g(x)=\sigma f(x) for everywhere xx. ∎

In this way, the γ\gamma-cosine is defined by the isomorphism between p{\cal L}_{p} and p{\cal L}_{p}^{*}. We note that

cosγ(τf,σg)=sign(τσ)cosγ(f,g).\displaystyle\cos_{\gamma}(\tau f,\sigma g)={\rm sign}(\tau\sigma)\cos_{\gamma}(f,g).

Accordingly, cosγ(f,g)\cos_{\gamma}(f,g) is a natural extension of the cosine functional cos(f,g)\cos(f,g). As a special character, cosγ(f,g)\cos_{\gamma}(f,g) is asymmetric in ff and gg if γ1\gamma\neq 1. The asymmetry remains, akin to divergence measures, providing a directional similarity measure between two functions. We have discussed the cosine measure extended on 1+γ(Λ){\cal L}_{1+\gamma}(\Lambda) relating to the log γ\gamma-divergence. In effect, the divergence is defined for the applicability of any empirical probability measure for a given dataset. However, such a constraint is not required in this context. Hence we can define a generalized variants

cos(β,γ)(f,g)=(ffβ+γ)β,(ggβ+γ)γ\displaystyle\cos_{(\beta,\gamma)}(f,g)=\Big{\langle}\Big{(}\frac{f\ }{\|f\|_{\beta+\gamma}}\Big{)}^{\ominus\beta},\Big{(}\frac{g\ }{\|g\|_{\beta+\gamma}}\Big{)}^{\ominus{\gamma}}\Big{\rangle} (4.15)

for ff and gg in β+γ(Λ){\cal L}_{\beta+\gamma}(\Lambda), called the (β,γ)(\beta,\gamma)-cosine measure, with tuning parameters β1\beta\geq 1 and γ1.\gamma\geq 1. Specifically, it is noted cosγ(f,g)=cos(1,γ)(f,g)\cos_{\gamma}(f,g)=\cos_{(1,\gamma)}(f,g). We note that the information divergence associated with cos(β,γ)(f,g)\cos_{(\beta,\gamma)}(f,g) is given by

Δ(β,γ)(f,g)=1βγlog|cos(β,γ)(f,g)|.\displaystyle\Delta_{(\beta,\gamma)}(f,g)=-\frac{1}{\beta\gamma}\log|\cos_{(\beta,\gamma)}(f,g)|.

In statistical machine learning, this measure could be used to compare probability density functions, regression functions, or other functional forms, especially when dealing with asymmetric relationships. It might be particularly relevant in scenarios where the sign of the function values carries important information, such as in economic data, signal processing, or environmental modeling.

The formulation defined on the function space is easily reduced on a Euclidean space as follows. Let xx and yy be in d\mathbb{R}^{d}. Then, the cosine similarity is defined by

cos(x,y)=xx,yy=e(x),e(y),\displaystyle\cos(x,y)=\Big{\langle}\frac{x}{\|x\|},\frac{y}{\|y\|}\Big{\rangle}=\langle e(x),e(y)\rangle, (4.16)

where e(x)=x/xe(x)=x/\|x\| and ,\langle\cdot,\cdot\rangle and \|\cdot\| denote the Euclidean inner product and norm on d\mathbb{R}^{d}. The γ\gamma-cosine function in d\mathbb{R}^{d} is introduced as

cosγ(x,y)\displaystyle\cos_{\gamma}(x,y) =xxp,(yyp)γ=eγ(x),eγ(y),\displaystyle=\Big{\langle}\frac{x}{\|x\|_{p}},\Big{(}\frac{y}{\|y\|_{p}}\Big{)}^{\ominus\gamma}\Big{\rangle}=\langle e_{\gamma}(x),e_{\gamma}^{*}(y)\rangle,
=i=1dxisign(yi)|yi|γ{i=1d|xi|p}1p{i=1d|yi|p}1q,\displaystyle=\frac{\sum_{i=1}^{d}x_{i}{\rm sign}(y_{i})|y_{i}|^{\gamma}}{\{\sum_{i=1}^{d}|x_{i}|^{p}\}^{\frac{1}{p}}\{\sum_{i=1}^{d}|y_{i}|^{p}\}^{\frac{1}{q}}}, (4.17)

for a power parameter γ>0\gamma>0. We can view the plot of the sign-preserving power transformation xγx^{\ominus\gamma} for γ=15,25,35,45,1\gamma=\frac{1}{5},\frac{2}{5},\frac{3}{5},\frac{4}{5},1 in Fig. 4.3:

Refer to caption
Figure 4.3: Plots of the signed power function.

As for the generalized measure, the (β,γ)(\beta,\gamma)-cosine measure is given by

cos(β,γ)(x,y)=(xxβ+γ)β,(yyβ+γ)γ,\displaystyle\cos_{(\beta,\gamma)}(x,y)=\Big{\langle}\Big{(}\frac{x}{\|x\|_{\beta+\gamma}}\Big{)}^{\ominus\beta},\Big{(}\frac{y}{\|y\|_{\beta+\gamma}}\Big{)}^{\ominus\gamma}\Big{\rangle},

see the functional form (4.14).

We investigate properties of the γ\gamma-cosine and (β,γ)(\beta,\gamma) cosine comparing to the standard cosine. Let e(β,γ)(x)=(xxβ+γ)β\displaystyle e_{(\beta,\gamma)}(x)=\Big{(}\frac{x}{\|x\|_{\beta+\gamma}}\Big{)}^{\ominus\beta} and e(β,γ)(y)=(yyβ+γ)γ\displaystyle e_{(\beta,\gamma)}^{*}(y)=\Big{(}\frac{y}{\|y\|_{\beta+\gamma}}\Big{)}^{\ominus\gamma}. Then, the (β,γ)(\beta,\gamma)-cosine is written as

cos(β,γ)(x,y)=e(β,γ)(x),e(β,γ)(y),\displaystyle\cos_{(\beta,\gamma)}(x,y)=\langle e_{(\beta,\gamma)}(x),e_{(\beta,\gamma)}^{*}(y)\rangle,

We observe the following behaviors where γ\gamma has an extreme value.

Proposition 18.

Let xx and yy be in d\mathbb{R}^{d}. Then,
(a).              limγ0cos(β,γ)(x,y)=(xxβ)β,sign(y),\displaystyle\lim_{\gamma\rightarrow 0}\cos_{(\beta,\gamma)}(x,y)=\Big{\langle}\Big{(}\frac{x}{\|x\|_{\beta}}\Big{)}^{\ominus\beta},{\rm sign}(y)\Big{\rangle},
where sign(y)=(sign(yi))i=1d{\rm sign}(y)=({\rm sign}(y_{i}))_{i=1}^{d}. Further,
(b).              limγcos(β,γ)(x,y)=(xx)β,sign(y)sign(y)1,\displaystyle\lim_{\gamma\rightarrow\infty}\cos_{(\beta,\gamma)}(x,y)=\Big{\langle}\Big{(}\frac{x}{\|x\|_{\infty}}\Big{)}^{\ominus\beta},\frac{{\rm sign}_{\infty}(y)}{\|{\rm sign}_{\infty}(y)\|_{1}}\Big{\rangle},
where the ii-th component of sign(y){\rm sign}_{\infty}(y) denotes sign(yi)𝕀(|yi|=y){\rm sign}(y_{i}){\mathbb{I}}(|y_{i}|=\|y\|_{\infty}) for i=1,di=1...,d with y=max1id|yi|\|y\|_{\infty}=\max_{1\leq i\leq d}|y_{i}|.

Proof.

By definition, limγ0yγ=sign(y)\lim_{\gamma\rightarrow 0}y^{\ominus\gamma}={\rm sign}(y). This implies (a). Next, if we divide both the numerator and the denominator of e(β,γ)(y)e_{(\beta,\gamma)}^{*}(y) by y\|y\|_{\infty}, then

limγe(β,γ)(y)=limγ(yy)γ(yβ+γy)γ.\displaystyle\lim_{\gamma\rightarrow\infty}e_{(\beta,\gamma)}^{*}(y)=\lim_{\gamma\rightarrow\infty}\Big{(}\frac{y}{\|y\|_{\infty}}\Big{)}^{\ominus\gamma}\Big{(}\frac{\|y\|_{\beta+\gamma}}{\|y\|_{\infty}}\Big{)}^{-\gamma}.

Hence, for i=1,,di=1,...,d

sign(yi)limγ(|yi|y)γ=sign(yi)𝕀(|yi|=y);\displaystyle{\rm sign}(y_{i})\lim_{\gamma\rightarrow\infty}\Big{(}\frac{|y_{i}|}{\|y\|_{\infty}}\Big{)}^{\gamma}={\rm sign}(y_{i}){\mathbb{I}}(|y_{i}|=\|y\|_{\infty});
limγ(yβ+γy)γ=limγ[i=1(yiy)γ]1=[i=1𝕀(|yi|=y)]1.\displaystyle\lim_{\gamma\rightarrow\infty}\Big{(}\frac{\|y\|_{\beta+\gamma}}{\|y\|_{\infty}}\Big{)}^{-\gamma}=\lim_{\gamma\rightarrow\infty}\bigg{[}\sum_{i=1}\Big{(}\frac{y_{i}}{\|y\|_{\infty}}\Big{)}^{\gamma}\bigg{]}^{-1}=\Big{[}\sum_{i=1}{\mathbb{I}}(|{y_{i}}|={\|y\|_{\infty}})\Big{]}^{-1}.

Consequently, we conclude (b). ∎

We remark that

limβ0,γ0cos(β,γ)(x,y)=1dsign(x),sign(y)\displaystyle\lim_{\beta\rightarrow 0,\gamma\rightarrow 0}\cos_{(\beta,\gamma)}(x,y)=\frac{1}{d}\big{\langle}{\rm sign}(x),{\rm sign}(y)\big{\rangle}

Alternatively, the order of taking limits of β\beta and γ\gamma to \infty with respect to cos(β,γ)(x,y)\cos_{(\beta,\gamma)}(x,y) results in different outcomes:

limβlimγcos(β,γ)(x,y)=sign(x),sign(y)sign(y)1;\displaystyle\lim_{\beta\rightarrow\infty}\lim_{\gamma\rightarrow\infty}\cos_{(\beta,\gamma)}(x,y)=\frac{\langle{\rm sign}_{\infty}(x),{\rm sign}_{\infty}(y)\rangle}{\|{\rm sign}_{\infty}(y)\|_{1}};
limγlimβcos(β,γ)(x,y)=sign(x),sign(y)sign(x)1;\displaystyle\lim_{\gamma\rightarrow\infty}\lim_{\beta\rightarrow\infty}\cos_{(\beta,\gamma)}(x,y)=\frac{\langle{\rm sign}_{\infty}(x),{\rm sign}_{\infty}(y)\rangle}{\|{\rm sign}_{\infty}(x)\|_{1}};
limγcos(γ,γ)(x,y)=sign(x),sign(y)sign(x)2sign(y)2.\displaystyle\lim_{\gamma\rightarrow\infty}\cos_{(\gamma,\gamma)}(x,y)=\frac{\langle{\rm sign}_{\infty}(x),{\rm sign}_{\infty}(y)\rangle}{\|{\rm sign}_{\infty}(x)\|_{2\ }\|{\rm sign}_{\infty}(y)\|_{2}}.

Note that sign(y){\rm sign}_{\infty}(y) is a sparse vector as it has sign\rm sign only at the components with the maximum absolute value with 0’s elsewhere . Thus, cos(β,)(x,y)\cos_{(\beta,\infty)}(x,y) is proportional to the Euclidean inner product between xβx^{\ominus\beta} and the sparse vector sign(y){\rm sign}_{\infty}(y). This is contrast with the standard cosine similarity, in which the orthogonality with cos(x,y)\cos_{\infty}(x,y) is totally different from that with cos(x,y)\cos(x,y). In effect, cos(x,y)=0x,y=0\cos(x,y)=0\Leftrightarrow\langle x,y\rangle=0; cos(β,)(x,y)=0xβ,sign(y)=0\cos_{(\beta,\infty)}(x,y)=0\Leftrightarrow\langle x^{\ominus\beta},{\rm sign}_{\infty}(y)\rangle=0. The orthogonality with cos(β,)(x,y)\cos_{(\beta,\infty)}(x,y) is reduced to the inner product of the dd_{\infty}-dimensional Euclidean space, where dd_{\infty} is the cardinal number of {i{1,,d}:|yi|=y}\{i\in\{1,...,d\}:|y_{i}|=\|y\|_{\infty}\}. Note that the equality condition in the limit case of γ\gamma is totally different from that when γ\gamma is finite. Indeed, x=±sign(y)x=\pm{\rm sign}_{\infty}(y) if and only if cos(β,)(x,y)=±1\cos_{(\beta,\infty)}(x,y)=\pm 1, where cos(x,y)\cos_{\infty}(x,y) can be viewed the arithmetic mean of relative ratios in 1(y)1_{\infty}(y). It is pointed that the cosine similarity has poor performance in a high-dimensional data. Then, values of the cosine similarity becomes small numbers near a zero, and hence they cannot extract important characteristics of vectors. It is frequently observed in the case of high-dimensional data that only a small part of components involves important information for a target analysis; the remaining components are non-informative. The standard cosine similarity equally measures all components; while the power-transformed cosine (γ\gamma-cos) can focus on only the small part of essential components. Thus, the γ\gamma-cos neglects unnecessary information with the majority components, so that the γ\gamma-cos can extract essential information involving with principal components. In this sense, the γ\gamma-cos does not need any preprocessing procedures for dimension reduction such as principal component analysis.

Proposition 19.

Let x=(x0,x1)x=(x_{0},x_{1}) and y=(y0,y1)y=(y_{0},y_{1}), respectively, where x0,y0d0x_{0},y_{0}\in\mathbb{R}^{d_{0}}; x1,y1d1x_{1},y_{1}\in\mathbb{R}^{d_{1}} with d=d0+d1d=d_{0}+d_{1}. If x0>x1\|x_{0}\|_{\infty}>\|x_{1}\|_{\infty} and y0>y1\|y_{0}\|_{\infty}>\|y_{1}\|_{\infty}, then,

cos(β,)(x0,y0)=cos(β,)(x,y).\displaystyle\cos_{(\beta,\infty)}(x_{0},y_{0})=\cos_{(\beta,\infty)}(x,y). (4.18)
Proof.

From the assumption, x0=x\|x_{0}\|_{\infty}=\|x\|_{\infty} and 1(y0)=1(y)1_{\infty}(y_{0})=1_{\infty}(y). This implies

cos(β,)(x,y)=i=1dxiβx0sign(yi)𝕀(|yi|=y0)|1(y0)|,\displaystyle\cos_{(\beta,\infty)}(x,y)=\sum_{i=1}^{d}\frac{\ x_{i}^{\ominus\beta}\ \ }{\|x_{0}\|_{\infty}}\frac{{\rm sign}(y_{i}){\mathbb{I}}(|y_{i}|=\|y_{0}\|_{\infty}\big{)}}{|1_{\infty}(y_{0})|}, (4.19)

which is nothing but cos(β,)(x0,y0)\cos_{(\beta,\infty)}(x_{0},y_{0}) since all the summands are zeros in the summation of ii from d0+1d_{0}+1 to dd in (4.19). ∎

In Proposition 19, the infinite-power cosine similarity is viewed as a robust measure in the sense that cos(x0,y0)=cos((x0,x1),(y0,y1))\cos_{\infty}(x_{0},y_{0})=\cos_{\infty}((x_{0},x_{1}),(y_{0},y_{1})) for any minor components x1x_{1} and y1y_{1}. However, we observe that the robustness looks extreme as seen in the following.

Proposition 20.

Consider a function of ϵ\epsilon as

Φ(ϵ)=cos(x,(y0,ϵy1)).\displaystyle\Phi(\epsilon)=\cos_{\infty}(x,(y_{0},\epsilon y_{1})).

Then, if y1=y0\|y_{1}\|_{\infty}=\|y_{0}\|_{\infty}, Φ(ϵ)\Phi(\epsilon) is not continuous at ϵ=1\epsilon=1.

Proof.

It follows from Proposition 2 that, if 0<ϵ<10<\epsilon<1, then

Φ(ϵ)=i=1d0xix𝕀(|yi|=y0)|1(y0)|\displaystyle\Phi(\epsilon)=\sum_{i=1}^{d_{0}}\frac{x_{i}\ \ }{\|x\|_{\infty}}\frac{{\mathbb{I}}(|y_{i}|=\|y_{0}\|_{\infty}\big{)}}{|1_{\infty}(y_{0})|}

where d0d_{0} is the dimension of y0y_{0}. On the other hand,

Φ(1)=i=1d0xix𝕀(|yi|=y0)|1(y0)|+i=d0+1dxix𝕀(|yi|=y1)|1(y1)|.\displaystyle\Phi(1)=\sum_{i=1}^{d_{0}}\frac{x_{i}\ \ }{\|x\|_{\infty}}\frac{{\mathbb{I}}(|y_{i}|=\|y_{0}\|_{\infty}\big{)}}{|1_{\infty}(y_{0})|}+\sum_{i=d_{0}+1}^{d}\frac{x_{i}\ \ }{\|x\|_{\infty}}\frac{{\mathbb{I}}(|y_{i}|=\|y_{1}\|_{\infty}\big{)}}{|1_{\infty}(y_{1})|}.

This implies the discontinuity of Φ(ϵ)\Phi(\epsilon) at ϵ=1\epsilon=1.

We investigate statistical properties of the power cosine measure in comparison with the conventional cosine similarity. For this we have scenarios to generate realized vectors xx’s and yy’s in d\mathbb{R}^{d} as follows. Assume that the jj-th replications XjX_{j} and YjY_{j} are given by

Xj=μ1+ϵ1andYj=μ2+ϵ2,\displaystyle X_{j}=\mu_{1}+\epsilon_{1}\hskip 14.22636pt\mbox{and}\hskip 14.22636ptY_{j}=\mu_{2}+\epsilon_{2},

where ϵa\epsilon_{a}’s are identically and independently distributed as 𝙽𝚘𝚛(0,σ2𝕀d){\tt Nor}(0,\sigma^{2}{\mathbb{I}}_{d}). We conduct a numerical experiment with 20002000 replications setting d=1000d=1000 and μ1=(10,9,,1,0,,0)\mu_{1}=(10,9,...,1,0,...,0)^{\top} with μ2\mu_{2} fixed later for some σ2\sigma^{2}’s.

First, fix as μ2=μ1\mu_{2}=\mu_{1} as a proportional case. Then, the value of the cosine measure cos(β,γ)(X,Y)\cos_{(\beta,\gamma)}(X,Y) is expected to be 11 if the error terms are negligible. When (β,γ)=(1,1)(\beta,\gamma)=(1,1), then cos(β,γ)(X,Y)\cos_{(\beta,\gamma)}(X,Y) has not a consistent mean even with small erros; when β>1,γ>1\beta>1,\gamma>1, then cos(β,γ)(X,Y)\cos_{(\beta,\gamma)}(X,Y) has a consistent mean near 0 with resonable errors. Table 4.1 shows detailed outcomes with the variance σ2=0.05,0.1,0.3,0.5\sigma^{2}=0.05,0.1,0.3,0.5, where Mean and Std denote the mean and standard deviations for cos(β,γ)(Xj.Yj)\cos_{(\beta,\gamma)}(X_{j}.Y_{j})’s with 2000 replications . Second, fix as

μ2=μ20μ1,μ20μ02μ0,\displaystyle\mu_{2}=\mu_{20}-\frac{\langle\mu_{1},\mu_{20}\rangle}{\|\mu_{0}\|^{2}}\mu_{0},

where μ20=(1,2,,10,0,,0)\mu_{20}=(1,2,...,10,0,...,0)^{\top}. Note μ1,μ20=0\langle\mu_{1},\mu_{20}\rangle=0. This means μ1\mu_{1} and μ2\mu_{2} are orthogonal in the L2-sence. Then, the value of the cosine measure cos(β,γ)(X,Y)\cos_{(\beta,\gamma)}(X,Y) should be near 0 if the error terms are negligible. For all the cases (β,γ)(\beta,\gamma)’s, the mean of cos(β,γ)(X,Y)\cos_{(\beta,\gamma)}(X,Y) is reasonably near 0 with small standard deviations, see Table 4.2 for details.

Table 4.1: cos(β,γ)(X,Y)\cos_{(\beta,\gamma)}(X,Y) in a proportional case
(β,γ)(\beta,\gamma) Mean Std
(1,1)(1,1) 0.8850.885 0.0050.005
(2,2)(2,2) 0.9970.997 0.0010.001
(2,5)(2,5) 0.9950.995 0.0030.003

σ2=0.05\sigma^{2}=0.05

(β,γ)(\beta,\gamma) Mean Std
(1,1)(1,1) 0.7930.793 0.0080.008
(2,2)(2,2) 0.9930.993 0.0030.003
(2,5)(2,5) 0.9910.991 0.0070.007

σ2=0.1\sigma^{2}=0.1

(β,γ)(\beta,\gamma) Mean Std
(1,1)(1,1) 0.5620.562 0.0180.018
(2,2)(2,2) 0.9750.975 0.0080.008
(2,5)(2,5) 0.9760.976 0.0180.018

σ2=0.3\sigma^{2}=0.3

(β,γ)(\beta,\gamma) Mean Std
(1,1)(1,1) 0.4340.434 0.0230.023
(2,2)(2,2) 0.9480.948 0.0140.014
(2,5)(2,5) 0.9610.961 0.0300.030

σ2=0.5\sigma^{2}=0.5

Table 4.2: cos(β,γ)(X,Y)\cos_{(\beta,\gamma)}(X,Y) in an orthogonal case
(β,γ)(\beta,\gamma) Mean Std
(1,1)(1,1) 0.0000.000 0.0290.029
(2,2)(2,2) 0.086-0.086 0.0450.045
(2,5)(2,5) 0.0070.007 0.0000.000

σ2=0.05\sigma^{2}=0.05

(β,γ)(\beta,\gamma) Mean Std
(1,1)(1,1) 0.0000.000 0.0150.015
(2,2)(2,2) 0.0920.092 0.0140.014
(2,5)(2,5) 0.0060.006 0.0000.000

σ2=0.1\sigma^{2}=0.1

(β,γ)(\beta,\gamma) Mean Std
(1,1)(1,1) 0.0000.000 0.0200.020
(2,2)(2,2) 0.0930.093 0.0210.021
(2,5)(2,5) 0.0060.006 0.0000.000

σ2=0.3\sigma^{2}=0.3

(β,γ)(\beta,\gamma) Mean Std
(1,1)(1,1) 0.0000.000 0.0270.027
(2,2)(2,2) 0.089-0.089 0.0360.036
(2,5)(2,5) 0.0060.006 0.0000.000

σ2=0.5\sigma^{2}=0.5

We applied these similarity measures to hierarchical clustering using a Python package. Synthetic data were generated in a setting of 8 clusters, each with 15 data points, in a 1000-dimensional Euclidean space. The distance functions used were cos(1,1)(x,y)\cos_{(1,1)}(x,y) and cos(1,5)(x,y)\cos_{(1,5)}(x,y), to compare performance in high-dimensional data clustering. The clustering criterion was set to maxclust in fcluster from the scipy.cluster.hierarchy module. The silhouette score, ranging from -1 to +1, served as a measure of the clustering quality. The clustering was conducted with 10 replications.

For case (a), using the distance based on cos(1,1)(x,y)\cos_{(1,1)}(x,y), the 10 silhouette scores had a mean of -0.038 with a standard deviation of 0.001, indicating poor clustering quality. Alternatively, for case (b), with the distance based on cos(1,5)(x,y)\cos_{(1,5)}(x,y), the scores had a mean of 0.833 and a standard deviation of 0.015, suggesting good clustering quality. Thus, the hierarchical clustering performance using (β,γ)=(1,5)(\beta,\gamma)=(1,5)-cosine similarity was significantly better than that using standard cosine similarity, as illustrated in typical dendrograms (Fig. 4.4).

Refer to caption
Figure 4.4: The dendrograms by the distances based on (a) and (b).

Let XX be a dd-variate variable with a covariance matrix Σ\Sigma, which is a dd-dimensional symmetric, positive definite matrix. Suppose the eigenvalues λ1,,λd\lambda_{1},...,\lambda_{d} of Σ\Sigma are restricted as λ1λd0ϵ>δλd0+1λd\lambda_{1}\geq...\geq\lambda_{d_{0}}\geq\epsilon>\delta\geq\lambda_{d_{0}+1}\geq...\geq\lambda_{d}. Given nn-random sample (X1,,Xn)(X_{1},...,X_{n}) from XX, the standard PCA is given by solving the kk-principal vectors v1,,vkv_{1},...,v_{k}, and xj=1λjvjvjxx\approx\sum_{j=1}\lambda_{j}v_{j}v_{j}^{\top}x. Suppose that (X1,,Xn)(X_{1},...,X_{n}) is generated from 𝙽𝚘𝚛d(0,Σ){\tt Nor}_{d}(0,\Sigma), where

Σ=[Σ0OOϵ𝕀dd0].\displaystyle\Sigma=\begin{bmatrix}\Sigma_{0}&O\\ O^{\top}&\epsilon\>{\mathbb{I}}_{d-d_{0}}\end{bmatrix}. (4.20)

Here Σ0\Sigma_{0} is a positive-definite matrix of size d0×d0d_{0}\times d_{0}-matrix whose eigenvalues are (λ1,,λd0)(\lambda_{1},...,\lambda_{d_{0}}) and OO is a zero matrix of size d0×(dd0).d_{0}\times(d-d_{0}). We set as

n=500,d=1000,d0=10,(λ1,,λd0)=(5,4.5,,,1.5,1),ϵ=0.1\displaystyle n=500,d=1000,d_{0}=10,(\lambda_{1},...,\lambda_{d_{0}})=(5,4.5,...,,1.5,1),\epsilon=0.1

Thus, the scenario is envisaged a situation where the signal is just of 1010 dimension with the rest of 990990-dimensional noise.

For this, the sample covariance matrix is defined by

S=1ni=1n(XiX¯)(XiX¯)\displaystyle S=\frac{1}{n}\sum_{i=1}^{n}(X_{i}-\bar{X})(X_{i}-\bar{X})^{\top}

and λ^j\hat{\lambda}_{j} and v^j\hat{v}_{j} are obtained as the jj-th eigenvalue and eigenvector of SS, where X¯\bar{X} is the sample mean vector. We propose the γ\gamma-sample covariance matrix as

Sγ=1ni=1n(XiγXγ¯)(XiγXγ¯),\displaystyle S^{\ominus\gamma}=\frac{1}{n}\sum_{i=1}^{n}(X_{i}^{\ominus\gamma}-\overline{X^{\ominus\gamma}})(X_{i}^{\ominus\gamma}-\overline{X^{\ominus\gamma}})^{\top},

where the γ\gamma-transform for a dd-vector xx is given by xγ=(sign(xj)|xj|γ)j=1dx^{\ominus\gamma}=({\rm sign}(x_{j})|x_{j}|^{\gamma})_{j=1}^{d} and

Xγ¯=1ni=1nXiγ.\displaystyle\overline{X^{\ominus\gamma}}=\frac{1}{n}\sum_{i=1}^{n}X_{i}^{\ominus\gamma}.

Thus, the γ\gamma-PCA is derived by solving the eigenvalues and eigenvectors of SγS_{\gamma}.

To implement the PCA modification in Python, especially given the specific requirements for generating the sample data following these steps:

  • Generate Sample Data: Create a 1000-dimensional dataset where the first 10 dimensions are drawn from a normal distribution with a specific covariance matrix Σ0\Sigma_{0}, and the remaining dimensions have a much smaller variance.

  • Compute the γ\gamma-Sample Covariance Matrix: Apply the γ\gamma transformation to the covariance matrix computation.

  • Eigenvalue and Eigenvector Computation: Compute the eigenvalues and eigenvectors of the γ\gamma-sample covariance matrix.

We conducted a numerical experiment according to these steps. The cumulative contribution ratios are plotted in Fig 4.5. It was observed that the standard PCA (γ=1(\gamma=1) had poor performance for the synthetic dataset, in which the cumulative contribution to 1010 dimensions was lower than 0.30.3. Alternatively, the γ\gamma-PCA effectively improves the performance as the cumulative contribution to 1010 dimensions was higher than 0.90.9 for γ=2.0\gamma=2.0. We remark that this efficient property for the γ\gamma-PCA depends on the simulation setting where the signal vector X0X_{0} of dimension d0d_{0} and the no-signal vector X1X_{1} of dimension dd0d-d_{0} are independent as in (4.20), where XX is decomposed as (X0,X1)(X_{0},X_{1}). If the independence is not assumed, then the good recovery by the γ\gamma-PCA is not observed. In reality, there would not be a strong evidence whether the independence holds or not. To address this issue we need more discussion with real data analysis. Additionally, combining PCA with other techniques like independent component analysis or machine learning algorithms can further enhance its performance in complex data environments. This broader perspective should enrich the discussion in your draft, especially concerning the real-world applicability and limitations of PCA modifications.

Refer to caption
Figure 4.5: Plot of cumulative contribution ratios with γ=1.0,1.5,2.0\gamma=1.0,1.5,2.0

We have discussed the extension of the γ\gamma-divergence to the Lebesgue Lp-space and introduces the concept of γ\gamma-cosine similarity, a novel measure for comparing functions or vectors in a function space. This measure is particularly relevant in statistics and machine learning, especially when dealing with signed functions or functional data.

The γ\gamma-divergence, previously studied in the context of regression and classification, is extended to the Lebesgue Lp-space. To address the issue of functions taking negative values, a sign-preserved power transformation is introduced. This transformation is crucial for extending the log γ\gamma-divergence to functions that can take negative values. The concept of cosine similarity, commonly used in Hilbert spaces, is extended to the Lp-space. The γ\gamma-cosine similarity is defined as cosγ(f,g)=f/fp,(g/gp)(p/q)\cos_{\gamma}(f,g)=\langle f/\|f\|_{p},(g/\|g\|_{p})^{\ominus(p/q)}\rangle, where p=γ+1p=\gamma+1 and qq is the conjugate exponent to pp. This measure maintains mathematical consistency across all real values of g(x)g(x). Basic properties of γ\gamma-cosine similarity are explored: |cosγ(f,g)|1|\cos_{\gamma}(f,g)|\leq 1, with equality holding if and only if gg is proportional to ff. It is also noted that cosγ(f,g)=0\cos_{\gamma}(f,g)=0 if and only if Δγ(f,g)=\Delta_{\gamma}(f,g)=\infty, indicating maximum distinctness between ff and gg. Generalized (β,γ)(\beta,\gamma)-cosine measure is given as a more general form of the cosine measure is introduced for Lβ+γ(Λ)L_{\beta+\gamma}(\Lambda) space, providing additional flexibility with tuning parameters β\beta and γ\gamma. An application of these similarity measures in hierarchical clustering is demonstrated using Python. The (β,γ)(\beta,\gamma)-cosine similarity shows better performance in clustering high-dimensional data compared to the standard cosine similarity. It can focus on essential components of the data, potentially reducing the need for preprocessing steps like principal component analysis. The γ\gamma-PCA is defined, parallel to the (γ,γ)(\gamma,\gamma)-cosine, and demonstrated for a good performance in high-dimensional situations. Therefore, the γ\gamma-cosine and (β,γ)(\beta,\gamma)-cosine measures could be particularly useful in statistical machine learning for comparing probability density functions, regression functions, or other functional forms, especially in scenarios where the sign of function values is significant.

In conclusion, the γ\gamma-cosine similarity and its generalized form, the (β,γ)(\beta,\gamma)-cosine measure, represent significant advancements in the field of statistical mathematics, particularly in the analysis of high-dimensional data and functional data analysis. These measures offer a more flexible and robust way to compare functions or vectors in various spaces, which is crucial for many applications in statistics and machine learning.

4.5 Concluding remarks

The concepts introduced in this chapter, particularly the GM divergence, γ\gamma-divergence, and γ\gamma-cosine similarity, offer promising avenues for advancing machine learning techniques, especially in high-dimensional settings. However, several areas warrant further exploration to fully understand and leverage these methodologies.

While the computational advantages of the GM divergence and γ\gamma-cosine similarity are demonstrated through simulations, real-world applications in domains such as bioinformatics, natural language processing, and image analysis could benefit from a deeper investigation. The scalability of these methods in extremely high-dimensional datasets, particularly those encountered in genomics or deep learning models, remains an open question. Future research should focus on implementing these methods in large-scale machine learning pipelines to assess their performance and robustness compared to traditional methods. This could include exploring parallel computing strategies or GPU acceleration to handle the increased computational demands in practical applications.

The chapter primarily discusses the GM divergence and γ\gamma-divergence, but the potential to extend these ideas to other divergence measures, such as Jensen-Shannon divergence or Renyi divergence, could be fruitful. Investigating how these alternative measures interact with the GM estimator or can be integrated into ensemble learning frameworks like AdaBoost might yield novel insights and improved algorithms. Moreover, a systematic comparison of these divergence measures across different machine learning tasks could provide clarity on their relative strengths and weaknesses.

While the γ\gamma-cosine similarity provides a novel way to compare vectors in function spaces, its theoretical underpinnings require further formalization. For instance, exploring its properties in different types of function spaces, such as Sobolev spaces or Besov spaces, might reveal new insights into its behavior and applications. Additionally, the interpretability of the γ\gamma-cosine similarity in practical settings is a key aspect that should be addressed. How does this measure correlate with traditional metrics used in machine learning, such as accuracy, precision, and recall? Can it be used to enhance the interpretability of models, particularly in domains requiring high levels of transparency, such as healthcare or finance?

The methods discussed in this chapter are largely grounded in parametric models, particularly in the context of Boltzmann machines and AdaBoost. However, extending these divergence-based methods to non-parametric or semi-parametric models could open up new applications, particularly in statistical machine learning. For example, exploring the use of GM divergence in the context of kernel methods, Gaussian processes, or non-parametric Bayesian models could provide new avenues for research. Similarly, semi-parametric approaches that combine the flexibility of non-parametric methods with the interpretability of parametric models could benefit from the computational advantages of the GM estimator.

To solidify the practical utility of the proposed methods, extensive empirical validation across a variety of datasets and machine learning tasks is essential. This includes benchmarking against state-of-the-art algorithms to evaluate performance in terms of accuracy, computational efficiency, and robustness. Establishing a comprehensive suite of benchmarks, possibly in collaboration with the broader research community, could facilitate the adoption of these methods. Such benchmarks should include both synthetic datasets, to explore the behavior of these methods under controlled conditions, and real-world datasets, to demonstrate their applicability in practical scenarios. 6. Exploration of Hyperparameter Sensitivity The introduction of γ\gamma and β\beta parameters in the γ\gamma-cosine and (β,γ)(\beta,\gamma)-cosine measures adds a layer of flexibility, but also complexity. Understanding how sensitive these methods are to the choice of these parameters, and developing guidelines or heuristics for their selection, would be a valuable addition to the methodology. Future work could explore automatic or adaptive methods for tuning these parameters, possibly integrating them with cross-validation techniques or Bayesian optimization to improve the ease of use and performance of the algorithms. Conclusion The introduction of GM divergence, γ\gamma-divergence, andγ\gamma-cosine similarity offers exciting opportunities for advancing machine learning and statistical modeling. However, their full potential will only be realized through continued research and development. By addressing the challenges outlined above, the field can better understand the theoretical implications, enhance practical applications, and ultimately, integrate these methods into mainstream machine learning practice.

Acknowledgements

I also would like to acknowledge the assistance provided by ChatGPT, an AI language model developed by OpenAI. Its ability to answer questions, provide suggestions, and assist in the drafting process has been a remarkable aid in organizing and refining the content of this book. While any errors or omissions are my own, the contributions of ChatGPT have certainly made the writing process more efficient and enjoyable.

Bibliography

  • [1] Shun-Ichi Amari. Differential geometry of curved exponential families-curvatures and information loss. The Annals of Statistics, 10(2):357–385, 1982.
  • [2] Ayanendranath Basu, Ian R Harris, Nils L Hjort, and MC Jones. Robust and efficient estimation by minimising a density power divergence. Biometrika, 85(3):549–559, 1998.
  • [3] Albert E Beaton and John W Tukey. The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data. Technometrics, 16(2):147–185, 1974.
  • [4] Raoul Bott, Loring W Tu, et al. Differential forms in algebraic topology, volume 82. Springer, 1982.
  • [5] Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.
  • [6] Jacob Burbea and C Rao. On the convexity of some divergence measures based on entropy functions. IEEE Transactions on Information Theory, 28(3):489–495, 1982.
  • [7] George Casella and Roger Berger. Statistical inference. CRC Press, 2024.
  • [8] Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, pages 493–507, 1952.
  • [9] Andrzej Cichocki and Shun-ichi Amari. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy, 12(6):1532–1568, 2010.
  • [10] David A Cohn, Zoubin Ghahramani, and Michael I Jordan. Active learning with statistical models. Journal of artificial intelligence research, 4:129–145, 1996.
  • [11] John B Conway. A course in functional analysis, volume 96. Springer, 2019.
  • [12] John Copas. Binary regression models for contaminated data. Journal of the Royal Statistical Society: Series B., 50:225–265, 1988.
  • [13] John Copas and Shinto Eguchi. Local model uncertainty and incomplete-data bias (with discussion). Journal of the Royal Statistical Society: Series B., 67:459–513, 2005.
  • [14] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20:273–297, 1995.
  • [15] David R Cox. Some problems connected with statistical inference. Annals of Mathematical Statistics, 29(2):357–372, 1958.
  • [16] David Roxbee Cox and David Victor Hinkley. Theoretical statistics. CRC Press, 1979.
  • [17] Imre Csiszár, Paul C Shields, et al. Information theory and statistics: A tutorial. Foundations and Trends® in Communications and Information Theory, 1(4):417–528, 2004.
  • [18] Bradley Efron. Defining the curvature of a statistical problem (with applications to second order efficiency). The Annals of Statistics, pages 1189–1242, 1975.
  • [19] Shinto Eguchi. A differential geometric approach to statistical inference on the basis of contrast functionals. Hiroshima mathematical journal, 15(2):341–391, 1985.
  • [20] Shinto Eguchi. Geometry of minimum contrast. Hiroshima Mathematical Journal, 22(3):631–647, 1992.
  • [21] Shinto Eguchi. Information geometry and statistical pattern recognition. Sugaku Expositions, 19:197–216, 2006.
  • [22] Shinto Eguchi. Information Divergence Geometry and the Application to Statistical Machine Learning, pages 309–332. Springer US, Boston, MA, 2009.
  • [23] Shinto Eguchi. Minimum information divergence of q-functions for dynamic treatment resumes. Information Geometry, 7(Suppl 1):229–249, 2024.
  • [24] Shinto Eguchi and John Copas. A class of logistic-type discriminant functions. Biometrika, 89:1–22, 2002.
  • [25] Shinto Eguchi and John Copas. Interpreting kullback–leibler divergence with the neyman–pearson lemma. Journal of Multivariate Analysis, 97(9):2034–2040, 2006.
  • [26] Shinto Eguchi and Osamu Komori. Minimum divergence methods in statistical machine learning. (No Title), 2022.
  • [27] Shinto Eguchi, Osamu Komori, and Shogo Kato. Projective power entropy and maximum tsallis entropy distributions. Entropy, 13(10):1746–1764, 2011.
  • [28] Shinto Eguchi, Osamu Komori, and Atsumi Ohara. Duality of maximum entropy and minimum divergence. Entropy, 16(7):3552–3572, 2014.
  • [29] Jane Elith and John R Leathwick. Species distribution models: ecological explanation and prediction across space and time. Annual review of ecology, evolution, and systematics, 40(1):677–697, 2009.
  • [30] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  • [31] Jerome H Friedman. On bias, variance, 0/1 loss, and the curse-of-dimensionality. Data mining and knowledge discovery, 1:55–77, 1997.
  • [32] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
  • [33] Hironori Fujisawa and Shinto Eguchi. Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis, 99:2053–2081, 2008.
  • [34] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378, 2007.
  • [35] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
  • [36] Peter D Grünwald and A Philip Dawid. Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory. 2004.
  • [37] Antoine Guisan and Wilfried Thuiller. Predicting species distribution: offering more than simple habitat models. Ecology letters, 8(9):993–1009, 2005.
  • [38] Trevor Hastie, Saharon Rosset, Ji Zhu, and Hui Zou. Multi-class adaboost. Statistics and its Interface, 2(3):349–360, 2009.
  • [39] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
  • [40] Kenichi Hayashi and Shinto Eguchi. A new integrated discrimination improvement index via odds. Statistical Papers, pages 1–20, 2024.
  • [41] Hideitsu Hino and Shinto Eguchi. Active learning by query by committee with robust divergences. Information Geometry, 6(1):81–106, 2023.
  • [42] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
  • [43] Geoffrey E Hinton. A practical guide to training restricted boltzmann machines. In Neural Networks: Tricks of the Trade: Second Edition, pages 599–619. Springer, 2012.
  • [44] Geoffrey E Hinton, Terrence J Sejnowski, et al. Learning and relearning in boltzmann machines. Parallel distributed processing: Explorations in the microstructure of cognition, 1(282-317):2, 1986.
  • [45] Hung Gia Hoang, Ba-Ngu Vo, Ba-Tuong Vo, and Ronald Mahler. The cauchy–schwarz divergence for poisson point processes. IEEE Transactions on Information Theory, 61(8):4475–4485, 2015.
  • [46] David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied logistic regression. John Wiley & Sons, 2013.
  • [47] Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pages 492–518. Springer, 1992.
  • [48] Peter J Huber and Elvezio M Ronchetti. Robust statistics. John Wiley & Sons, 2011.
  • [49] Hung Hung, Zhi-Yu Jou, and Su-Yun Huang. Robust mislabel logistic regression without modeling mislabel probabilities. Biometrics, 74(1):145–154, 2018.
  • [50] Jack Jewson, Jim Q Smith, and Chris Holmes. Principles of bayesian inference using general divergence criteria. Entropy, 20(6):442, 2018.
  • [51] Bent Jørgensen. Exponential dispersion models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 49(2):127–145, 1987.
  • [52] Giorgio Kaniadakis. Non-linear kinetics underlying generalized statistics. Physica A: Statistical mechanics and its applications, 296(3-4):405–425, 2001.
  • [53] Osamu Komori and Shinto Eguchi. Statistical Methods for Imbalanced Data in Ecological and Biological Studies. Springer, Tokyo, 2019.
  • [54] Osamu Komori and Shinto Eguchi. A unified formulation of k-means, fuzzy c-means and gaussian mixture model by the kolmogorov-nagumo average. Entropy, 23:518, 2021.
  • [55] Osamu Komori, Shinto Eguchi, Shiro Ikeda, Hiroshi Okamura, Momoko Ichinokawa, and Shinichiro Nakayama. An asymmetric logistic regression model for ecological data. Methods in Ecology and Evolution, 7(2):249–260, 2016.
  • [56] Osamu Komori, Shinto Eguchi, Yusuke Saigusa, Buntarou Kusumoto, and Yasuhiro Kubota. Sampling bias correction in species distribution models by quasi-linear poisson point process. Ecological Informatics, 55:1–11, 2020.
  • [57] Osamu Komori, Yusuke Saigusa, and Shinto Eguchi. Statistical learning for species distribution models in ecological studies. Japanese Journal of Statistics and Data Science, 6(2):803–826, 2023.
  • [58] Friedrich Liese and Igor Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394–4412, 2006.
  • [59] David G Luenberger. Optimization by vector space methods. John Wiley & Sons, 1997.
  • [60] Kumar P Mainali, Dan L Warren, Kunjithapatham Dhileepan, Andrew McConnachie, Lorraine Strathie, Gul Hassan, Debendra Karki, Bharat B Shrestha, and Camille Parmesan. Projecting future expansion of invasive species: comparing and improving methodologies for species distribution modeling. Global change biology, 21(12):4464–4480, 2015.
  • [61] Henry B Mann and Abraham Wald. On the statistical treatment of linear stochastic difference equations. Econometrica, Journal of the Econometric Society, pages 173–220, 1943.
  • [62] P. McCullagh and J.A. Nelder. Generalized Linear Models. Chapman & Hall, New York, 1989.
  • [63] Cory Merow, Adam M Wilson, and Walter Jetz. Integrating occurrence data and expert maps for improved species range predictions. Global Ecology and Biogeography, 26(2):243–258, 2017.
  • [64] Hanna Meyer and Edzer Pebesma. Predicting into unknown space? estimating the area of applicability of spatial prediction models. Methods in Ecology and Evolution, 12(9):1620–1633, 2021.
  • [65] Mihoko Minami and Shinto Eguchi. Robust blind source separation by beta divergence. Neural Computation, 14:1859–1886, 2002.
  • [66] Md Nurul Haque Mollah, Shinto Eguchi, and Mihoko Minami. Robust prewhitening for ica by minimizing β\beta-divergence and its application to fastica. Neural Processing Letters, 25:91–110, 2007.
  • [67] Md Nurul Haque Mollah, Mihoko Minami, and Shinto Eguchi. Exploring latent structure of mixture ICA models by the minimum beta-divergence method. Neural Computation, 18:166–190, 2006.
  • [68] Victoria Diane Monette. Ecological factors associated with habitat use of baird’s tapirs (tapirus bairdii). 2019.
  • [69] Noboru Murata, Takashi Takenouchi, Takafumi Kanamori, and Shinto Eguchi. Information geometry of 𝓊{\mathcal{u}}-boost and bregman divergence. Neural Computation, 16:1437–1481, 2004.
  • [70] Kanta Naito and Shinto Eguchi. Density estimation with minimization of U{U}-divergence. Machine Learning, 90:29–57, 2013.
  • [71] Tan Nguyen and Scott Sanner. Algorithms for direct 0–1 loss optimization in binary classification. In International conference on machine learning, pages 1085–1093. PMLR, 2013.
  • [72] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
  • [73] Frank Nielsen. On geodesic triangles with right angles in a dually flat space. In Progress in Information Geometry: Theory and Applications, pages 153–190. Springer, 2021.
  • [74] Akifumi Notsu and Shinto Eguchi. Robust clustering method in the presence of scattered observations. Neural Computation, 28:1141–1162, 2016.
  • [75] Akifumi Notsu, Osamu Komori, and Shinto Eguchi. Spontaneous clustering via minimum gamma-divergence. Neural computation, 26(2):421–448, 2014.
  • [76] Katsuhiro Omae, Osamu Komori, and Shinto Eguchi. Quasi-linear score for capturing heterogeneous structure in biomarkers. BMC Bioinformatics, 18:308, 2017.
  • [77] Steven J Phillips, Miroslav Dudík, and Robert E Schapire. A maximum entropy approach to species distribution modeling. In Proceedings of the twenty-first international conference on Machine learning, page 83, 2004.
  • [78] Giovanni Pistone. κ\kappa-exponential models from the geometrical viewpoint. The European Physical Journal B, 70:29–37, 2009.
  • [79] C Radakrishna Rao. Differential metrics in probability spaces. Differential geometry in statistical inference, 10:217–240, 1987.
  • [80] Mark D Reid and Robert C Williamson. Information, divergence and risk for binary experiments. Journal of Machine Learning Research, 12(3), 2011.
  • [81] Ian W Renner, Jane Elith, Adrian Baddeley, William Fithian, Trevor Hastie, Steven J Phillips, Gordana Popovic, and David I Warton. Point process models for presence-only analysis. Methods in Ecology and Evolution, 6(4):366–379, 2015.
  • [82] Ian W Renner and David I Warton. Equivalence of maxent and poisson point process models for species distribution modeling in ecology. Biometrics, 69(1):274–281, 2013.
  • [83] Peter J Rousseeuw and Annick M Leroy. Robust regression and outlier detection. John wiley & sons, 2005.
  • [84] Yusuke Saigusa, Shinto Eguchi, and Osamu Komori. Robust minimum divergence estimation in a spatial poisson point process. Ecological Informatics, 81:102569, 2024.
  • [85] Robert E Schapire and Yoav Freund. Boosting: Foundations and algorithms. Kybernetes, 42(1):164–166, 2013.
  • [86] Burr Settles. Active learning literature survey. 2009.
  • [87] H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory, pages 287–294, 1992.
  • [88] Helen R Sofaer, Catherine S Jarnevich, Ian S Pearse, Regan L Smyth, Stephanie Auer, Gericke L Cook, Thomas C Edwards Jr, Gerald F Guala, Timothy G Howard, Jeffrey T Morisette, et al. Development and delivery of species distribution models to inform decision-making. BioScience, 69(7):544–557, 2019.
  • [89] Roy L Streit and Roy L Streit. The poisson point process. Springer, 2010.
  • [90] Takashi Takenouchi and Shinto Eguchi. Robustifying AdaBoost by adding the naive error rate. Neural Computation, 16:767–787, 2004.
  • [91] Takashi Takenouchi, Osamu Komori, and Shinto Eguchi. Extension of receiver operating characteristic curve and auc-optimal classification. Neural Computation, 24:2789–2824, 2012.
  • [92] Takashi Takenouchi2, Shinto Eguchi, Noboru Murata, and Takafumi Kanamori. Robust boosting algorithm against mislabeling in multiclass problems. Neural Computation, 20:1596–1630, 2008.
  • [93] Marina Valdora and Víctor J Yohai. Robust estimators for generalized linear models. Journal of Statistical Planning and Inference, 146:31–48, 2014.
  • [94] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
  • [95] Halbert White. Maximum likelihood estimation of misspecified models. Econometrica: Journal of the econometric society, pages 1–25, 1982.
  • [96] Christopher KI Williams. Prediction with gaussian processes: From linear regression to linear prediction and beyond. In Learning in graphical models, pages 599–621. Springer, 1998.
  • [97] Katherine L Yates, Phil J Bouchet, M Julian Caley, Kerrie Mengersen, Christophe F Randin, Stephen Parnell, Alan H Fielding, Andrew J Bamford, Stephen Ban, A Márcia Barbosa, et al. Outstanding challenges in the transferability of ecological models. Trends in ecology & evolution, 33(10):790–802, 2018.
  • [98] Jun Zhang. Divergence function, duality, and convex analysis. Neural computation, 16(1):159–195, 2004.
  • [99] Huimin Zhao, Jie Liu, Huayue Chen, Jie Chen, Yang Li, Junjie Xu, and Wu Deng. Intelligent diagnosis using continuous wavelet transform and gauss convolutional deep belief network. IEEE Transactions on Reliability, 72(2):692–702, 2022.