Minimum Gamma Divergence for
Regression and Classification Problems

Shinto Eguchi

Preface

In an era where data drives decision-making across diverse fields, the need for robust and efficient statistical methods has never been greater. As a researcher deeply involved in the study of divergence measures, I have witnessed firsthand the transformative impact these tools can have on statistical inference and machine learning. This book aims to provide a comprehensive guide to the class of power divergences, with a particular focus on the $\gamma$ -divergence, exploring their theoretical underpinnings, practical applications, and potential for enhancing robustness in statistical models and machine learning algorithms.

The inspiration for this book stems from the growing recognition that traditional statistical methods often fall short in the presence of model misspecification, outliers, and noisy data. Divergence measures, such as the $\gamma$ -divergence, offer a promising alternative by providing robust estimation techniques that can withstand these challenges. This book seeks to bridge the gap between theoretical development and practical application, offering new insights and methodologies that can be readily applied in various scientific and engineering disciplines.

The book is structured into four main chapters. Chapter 1 introduces the foundational concepts of divergence measures, including the well-known Kullback-Leibler divergence and its limitations. It then presents a detailed exploration of power divergences, such as the $\alpha$ , $\beta$ , and $\gamma$ -divergences, highlighting their unique properties and advantages. Chapter 2 explores minimum divergence methods for regression models, demonstrating how these methods can improve robustness and efficiency in statistical estimation. Chapter 3 extends these methods to Poisson point processes, with a focus on ecological applications, providing a robust framework for modeling species distributions and other spatial phenomena. Finally, Chapter 4 explores the use of divergence measures in machine learning, including applications in Boltzmann machines, AdaBoost, and active learning. The chapter emphasizes the practical benefits of these measures in enhancing model robustness and performance.

By providing a detailed examination of divergence measures, this book aims to offer a valuable resource for statisticians, machine learning practitioners, and researchers. It presents a unified perspective on the use of power divergences in various contexts, offering practical examples and empirical results to illustrate their effectiveness. The methodologies discussed in this book are designed to be both insightful and practical, enabling readers to apply these concepts in their work and research.

This book is the culmination of years of research and collaboration. I am grateful to my colleagues and students whose questions and feedback have shaped the content of this book. Special thanks to Hironori Fujisawa, Masayuki Henmi, Takashi Takenouchi, Osamu Komori, Kenichi Hatashi, Su-Yun Huang, Hung Hung, Shogo Kato, Yusuke Saigusa and Hideitsu Hino for their invaluable support and contributions.

I invite you to explore the rich landscape of divergence measures presented in this book. Whether you are a researcher, practitioner, or student, I hope you find the concepts and methods discussed here to be both insightful and practical. It is my sincere wish that this book will contribute to the advancement of robust statistical methods and inspire further research and innovation in the field.

Tokyo, 2024 Shinto Eguchi

Chapter 1 Power divergence

We present a mathematical framework for discussing the class of divergence measures, which are essential tools for quantifying the difference between two probability distributions. These measures find applications in various fields such as statistics, machine learning, and data science. We begin by discussing the well-known Kullback-Leibler (KL) Divergence, highlighting its advantages and limitations. To address the shortcomings of KL-divergence, the paper introduces three alternative types: $\alpha$ , $\beta$ , and $\gamma$ -divergences. We emphasize the importance of choosing the right ”reference measure,” especially for $\beta$ and $\gamma$ divergences, as it significantly impacts the results.

1.1 Introduction

We provide a comprehensive study of divergence measures that are essential tools for quantifying the difference between two probability distributions. These measures find applications in various fields such as statistics, machine learning, and data science [1, 6, 19, 79, 20]. See also [98, 17, 58, 34, 72, 80, 50].

We present the $\alpha$ , $\beta$ , and $\gamma$ divergence measures, each characterized by distinctive properties and advantages. These measures are particularly well-suited for a variety of applications, offering tailored solutions to specific challenges in statistical inference and machine learning. We further explore the practical applications of these divergence measures, examining their implementation in statistical models such as generalized linear models and Poisson point processes. Special attention is given to selecting the appropriate ’reference measure,’ which is crucial for the accuracy and effectiveness of these methods. The study concludes by identifying areas for future research, including the further exploration of reference measures. Overall, the paper serves as a valuable resource for understanding the mathematical and practical aspects of divergence measures.

In recent years, a number of studies have been conducted on the robustness of machine learning models using the $\gamma$ -divergence, which was proposed in [33]. This book highlights that the $\gamma$ -divergence can be defined even when the power exponent $\gamma$ is negative, provided certain integrability conditions are met [26]. Specifically, one key condition is that the probability distributions are defined on a set of finite discrete values. We demonstrate that the $\gamma$ -divergence with $\gamma=-1$ is intimately connected to the inequality between the arithmetic mean and the geometric mean of the ratio of two probability mass functions, thus terming it the geometric-mean (GM) divergence. Likewise, we show that the $\gamma$ -divergence with $\gamma=-2$ can be derived from the inequality between the arithmetic mean and the harmonic mean of the mass functions, leading to its designation as the harmonic-mean (HM) divergence.

1.2 Probabilistic framework

Let $Y$ be a random variable with a set $\mathcal{Y}$ of possible values in $\mathbb{R}$ . We denote $\Lambda$ as a $\sigma$ -finite measure, referred to as the reference measure. The reference measure $\Lambda$ is typically either the Lebesgue measure for a case where $Y$ is a continuous random variable or the counting measure for a case where $Y$ is a discrete one. Let us define $\mathcal{P}$ as the space encompassing all probability measures $P$ ’s that are absolutely continuous with respect to $\Lambda$ each other. The probability $P(B)$ for an event $B$ can be expressed as:

\displaystyle P(B)=\int_{B}p(y){\rm d}\Lambda(y),

where $p(y)=(\partial P/\partial\Lambda)(y)$ is referred to as the Radon-Nicodym (RN) derivative. Specifically, $p(y)$ is referred to as the probability density function (pdf) and the probability mass function (pmf) if $Y$ is a continuous and discrete random variable, respectively.

Definition 1.

Let $D(P,Q)$ denote a functional defined on ${\mathcal{P}}\times{\mathcal{P}}$ . Then, we call $D(P,Q)$ a divergence measure if $D(P,Q)\geq 0$ for all $P$ and $Q$ of $\mathcal{P}$ , and $D(P,Q)=0$ means $P=Q$ .

Consider two normal distributions ${\tt Nor}(\mu_{1},\sigma_{1}^{2})$ and ${\tt Nor}(\mu_{2},\sigma_{2}^{2})$ . If both distributions have the same mean $\mu$ and variance $\sigma^{2}$ , they are identical, and their divergence is zero. However, as the mean and variance of one distribution diverge from those of the other, the divergence measure increases, quantifying how one distribution is different from the other. Thus, a divergence measure quantifies how one probability distribution diverges from another. The key properties are non-negativity, asymmetry and zero when identical. The asymmetry of the divergence measure helps one to discuss model comparisons, variational inferences, generative models, optimal control policies and so on. Researchers have proposed various divergence measures in statistics and machine learning to compare two models or to measure the information loss when approximating a distribution. It is more appropriately termed ‘information divergence’, although here is simply called ‘divergence’ for simplicity. As a specific example, the Kullback-Leibler (KL) divergence is given by the following equation:

D_{0}(P,Q)=\int_{\mathcal{Y}}p(y)\log\frac{p(y)}{q(y)}{\rm d}\Lambda(y),

(1.1)

where $p(y)=\partial P/\partial\Lambda(y)$ and $q(y)=\partial Q/\partial\Lambda(y)$ . The KL-divergence is essentially independent of the choice since it is written without $\Lambda$ as

\displaystyle D_{0}(P,Q)=\int\log\frac{\partial P}{\partial Q}\,{\rm d}P.

This implies that any properties for the KL-divergence $D_{0}(P,Q)$ directly can be thought as intrinsic properties between probability measures $P$ and $Q$ regardless of the RN-derivatives with respect to the reference measure. The definition (1.1) implicitly assumes the integrability. This implies that the integrals of their products with the logarithm of their ratio must be finite. Such assumptions are almost acceptable in the practical applications in statistics and the machine learning. However, if we assume a Cauchy distribution as $P$ and a normal distribution $Q$ , then $D_{0}(P,Q)$ is not a finite value. Thus, the KL-divergence is associated with instable behaviors, which arises non-robustness for the minimum KL-divergence method or the maximum likelihood. This aspect will be discussed in the following chapter.

If we write the cross-entropy as

\displaystyle H_{0}(P,Q)=-\int p\log{q}\,{\rm d}\Lambda,

(1.2)

then the KL-divergence is written by the difference as

\displaystyle D_{0}(P,Q)=H_{0}(P,Q)-H_{0}(P,P).

The KL-divergence is a divergence measure due to the convexity of the negative logarithmic function. In foundational statistics, the Neyman-Pearson lemma holds a pivotal role. This lemma posits that the likelihood ratio test (LRT) is the most powerful method for hypothesis testing when comparing a null hypothesis distribution $P$ against an alternative distribution $Q$ . In this context, the KL-divergence $D_{0}(P,Q)$ can be interpreted as the expected value of the LRT under the null hypothesis distribution $P$ . For a more in-depth discussion of the close relationship between $D_{0}(P,Q)$ and the Neyman-Pearson lemma, the reader is referred to [25].

In the context of machine learning, KL-divergence is often used in algorithms like variational autoencoders. Here, KL-divergence helps quantify how closely the learned distribution approximates the real data distribution. Lower KL-divergence values indicate better approximations, thus helping in the model’s optimization process.

1.3 Power divergence measures

The KL-divergence is sometimes referred to as log divergence due to its definition involving the logarithmic function. Alternatively, a specific class of power divergence measures can be derived from power functions characterized by the exponent parameters $\alpha$ , $\beta$ , and $\gamma$ , as detailed below. Among the numerous ways to quantify the divergence or distance between two probability distributions, ‘power divergence measure’ occupies a unique and significant property. Originating from foundational concepts in information theory, these measures have been extended and adapted to address various challenges across statistics and machine learning. As we strive to make better decisions based on data, understanding the nuances between different divergence measures becomes crucial. This section introduces the power divergence measures through three key types: $\alpha$ , $\beta$ , and $\gamma$ divergences, see [9] for a comprehensive review. Each of these offers distinct advantages and limitations, and serves as a building block for diverse applications ranging from robust parameter estimation to model selection and beyond.

(1) $\alpha$ -divergence:

\displaystyle D_{\alpha}(P,Q;\Lambda)=\frac{1}{\alpha(\alpha-1)}{\int_{\mathcal{Y}}\left[1-\left(\frac{q(y)}{p(y)}\right)^{\alpha}\right]p(y){\rm d}\Lambda(y)},

where $\alpha$ belongs to $\mathbb{R}$ , cf. [8, 1] for further details. Let us introduce

\displaystyle W_{\alpha}(R)=\frac{1}{\alpha(\alpha-1)}(1-R^{\alpha})-\frac{1}{\alpha-1}(1-R),

as a generator function for $R\geq 0$ . Then the $\alpha$ -divergence is written as

\displaystyle D_{\alpha}(P,Q;\Lambda)=\int_{\mathcal{Y}}W_{\alpha}\left(\frac{q(y)}{p(y)}\right)p(y){\rm d}\Lambda(y).

Note that $W_{\alpha}(R)\geq 0$ . Equality is achieved if and only if $R=1$ , indicating that $W_{\alpha}(R)$ is a convex function. This implies $D_{\alpha}(P,Q;\Lambda)\geq 0$ with equality if and only if $P=Q$ . This shows that $D_{\alpha}(P,Q;\Lambda)$ is a divergence measure. The log expression [8] is given by

\displaystyle\Delta_{\alpha}(P,Q;\Lambda)=\frac{1}{\alpha-1}\log{\int_{\mathcal{Y}}\left(\frac{q(y)}{p(y)}\right)^{\alpha}p(y){\rm d}\Lambda(y)}.

The $\alpha$ -divergence is associated with the Pythagorean identity in the space $\mathcal{P}$ . Assume that a triple of $P$ , $Q$ and $R$ satisfies

\displaystyle D_{\alpha}(P,Q;\Lambda)+D_{\alpha}(Q,R;\Lambda)=D_{\alpha}(P,R;\Lambda).

This equation reflects a Pythagorean relation, wherein the triple $P,Q,R$ forms a right triangle if $D_{\alpha}(P,Q;\Lambda)$ is considered the squared Euclidean distance between $P$ and $Q$ . We define two curves $\{P_{t}\}_{0\leq t\leq 1}$ and $\{R_{s}\}_{0\leq t\leq 1}$ in $\mathcal{P}$ such that the RN-derivatives of $P_{t}$ and $R_{s}$ is given by $p_{t}(y)=(1-t)p(y)+tq(y)$ and

\displaystyle r_{s}(y)=z_{s}\exp\{(1-s)\log r(y)+s\log q(y)\},

respectively, where $z_{s}$ is a normalizing constant. We then observe that the Pythagorean relation remains unchanged for the triple $P_{t},Q,R_{s}$ , as illustrated by the following equation:

\displaystyle D_{\alpha}(P_{t},Q;\Lambda)+D_{\alpha}(Q,R_{s};\Lambda)=D_{\alpha}(P_{t},R_{s};\Lambda).

In accordance with this, The $\alpha$ -divergence allows for $\mathcal{P}$ to be as if it were an Euclidean space. This property plays a central role in the approach of information geometry. It gives geometric insights for statistics and machine learning [73].

For example, consider a multinomial distribution MN $(\pi,m)$ with a probability mass function (pmf):

\displaystyle p(y,\pi,m)=\binom{m}{y_{1}\cdots y_{k}}\pi_{1}^{y_{1}}\cdots\pi_{k}^{y_{k}}

(1.3)

for $y=(y_{1},...,y_{k})\in{\mathcal{Y}}$ with ${\mathcal{Y}}=\{\ y\ |\sum_{j=1}^{k}y_{j}=m\}$ , where $\pi=(\pi_{j})_{j=1}^{k}$ . The $\alpha$ -divergence between multinomial distributions $\text{{\tt MN}}(\pi,m)$ and $\text{{\tt MN}}(\rho,m)$ can be expressed as follows:

\displaystyle D_{\alpha}({\tt MN}(\pi,m),{\tt MN}(\rho,m);C)=\frac{1}{\alpha(\alpha-1)}\Big{\{}1-\sum_{j=1}^{k}\pi_{j}^{\alpha}\rho_{j}^{1-\alpha}\Big{\}},

(1.4)

where $C$ is the counting measure.

The $\alpha$ -divergence is independent of the choice of $\Lambda$ since

\displaystyle D_{\alpha}(P,Q;\Lambda)=\frac{1}{\alpha(1-\alpha)}{\int\left[1-\left(\frac{\partial Q}{\partial P}\right)^{\alpha}\right]dP}.

This indicates that $D_{\alpha}(P,Q;\Lambda)$ is independent of the choice of $\Lambda$ , as $D_{\alpha}(P,Q;\Lambda)=D_{\alpha}(P,Q;\tilde{\Lambda})$ for any $\tilde{\Lambda}$ . Consequently, equation (1.4) is also independent of $C$ . In general, the Csisar class of divergence is independent of the choice of the reference measure [26].

(2) $\beta$ -divergence:

\displaystyle D_{\beta}(P,Q;\Lambda)=\frac{1}{\beta(\beta+1)}\int\{p(y)^{\beta+1}+{\beta}q(y)^{\beta+1}-(\beta+1)p(y)q(y)^{\beta}\}{\rm d}\Lambda(y),

(1.5)

where $\beta$ belongs to $\mathbb{R}$ . For more details, refer to [2, 65]. Let us consider a generator function defined as follows:

\displaystyle U_{\beta}(R)=\frac{1}{\beta(\beta+1)}R^{\beta+1}\ \text{ for }\ R\geq 0.

It follows from the convexity of $U_{\beta}(R)$ that $U_{\beta}(R_{1})-U_{\beta}(R_{0})\geq U_{\beta}{}^{\prime}(R_{0})(R_{1}-R_{0}).$ This concludes that $D_{\beta}(P,Q,\Lambda)$ is a divergence measure due to

\displaystyle D_{\beta}(P,Q;\Lambda)=\int\{U_{\beta}(p)-U_{\beta}(q)-U_{\beta}{}^{\prime}(q)(p-q)\}{\rm d}\Lambda.

We also observe the property preserving the Pythagorean relation for the $\beta$ -divergence. When $P$ , $Q$ and $R$ form a right triangle by the $\beta$ -divergence, then the right triangle is preserved for the triple $P_{t}$ , $Q$ anf $R_{s}$ .

It’s worth noting that the $\beta$ -divergence is dependent on the choice of reference measure $\Lambda$ . For instance, if we choose $\tilde{\Lambda}$ as the reference measure, then the $\beta$ -divergence is given by:

\displaystyle D_{\beta}(P,Q;\tilde{\Lambda})=\frac{1}{\beta(\beta+1)}\int\{\tilde{p}^{\beta+1}+\beta\tilde{q}^{\beta+1}-(\beta+1)\tilde{p}\tilde{q}^{\beta}\}{\rm d}\tilde{\Lambda}.

Here, $\tilde{p}=\partial P/\partial\tilde{\Lambda}$ and $\tilde{q}=\partial Q/\partial\tilde{\Lambda}$ . This can be rewritten as

\displaystyle\frac{1}{\beta(\beta+1)}\left\{\int(\tilde{\lambda}p)^{\beta}dP+\beta\int(\tilde{\lambda}q)^{\beta}dQ-(\beta+1)\int(\tilde{\lambda}q)^{\beta}dP\right\},

(1.6)

where $\tilde{\lambda}={\partial\tilde{\Lambda}}/{\partial\Lambda}$ . Hence, the integrands of $D_{\beta}(P,Q;\tilde{\Lambda})$ are given by the integrands of $D_{\beta}(P,Q;\Lambda)$ multiplied by $\tilde{\lambda}^{\beta}$ . The choice of the reference measure gives a substantial effect for evaluating the $\beta$ -divergence.

We again consider a multinomial distribution MN $(\pi,m)$ defined in (1.3). Unfortunately, the $\beta$ -divergence $D_{\beta}({\tt MN}(\pi,m),{\tt MN}(\rho,m);C)$ with the counting measure $C$ would have an intractable expression. Therefore, we select $\tilde{\Lambda}$ in a way that the Radon-Nikodym (RN) derivative is defined as

\displaystyle\frac{\partial\tilde{\Lambda}}{\partial C}(y)=\binom{m}{y_{1}\cdots y_{k}}^{-1}

(1.7)

as a reference measure. Accordingly, $\tilde{p}(y,\pi,n)=\pi_{1}{}^{y_{1}}\cdots\pi_{k}{}^{y_{k}}$ , and hence

\displaystyle\int\tilde{p}(y,\pi,n)^{\beta}dP(y)=\sum_{y\in{\mathcal{Y}}}\binom{m}{y_{1}\cdots y_{k}}\left\{\pi_{1}{}^{y_{1}}\cdots\pi_{k}{}^{y_{k}}\right\}^{\beta+1},

which is equal to $\big{\{}\sum_{j=1}^{k}\pi_{j}{}^{\beta+1}\big{\}}^{m}$ . Using this approach, a closed form expression for $\beta$ -divergence can be derived:

	$\displaystyle D_{\beta}({\tt MN}(\pi,m),{\tt MN}(\rho,m);\tilde{\Lambda})$
	$\displaystyle=\frac{\big{\{}\sum_{j=1}^{k}\pi_{j}{}^{\beta+1}\big{\}}^{m}}{\beta(\beta+1)}+\frac{\big{\{}\sum_{j=1}^{k}\rho_{j}{}^{\beta+1}\big{\}}^{m}}{\beta+1}-\frac{\big{\{}\sum_{j=1}^{k}\pi_{j}\rho_{j}{}^{\beta}\big{\}}^{m}}{\beta}$		(1.8)

due to (1.6). In this way, the expression (1.8) has a tractable form, in which this is reduced to the standard one of $\beta$ -divergence when $m=1$ . Subsequent discussions will explore the choice of reference measure that provides the most accurate inference within statistical models, such as the generalized linear model and the model of inhomogeneous Poisson point processes.

(3) $\gamma$ -divergence [33]:

\displaystyle D_{\gamma}(P,Q;\Lambda)=-\frac{1}{\gamma}\frac{\int_{\mathcal{Y}}p(y)q(y)^{\gamma}{\rm d}\Lambda(y)}{\ \ \big{(}\int_{\mathcal{Y}}q(y)^{\gamma+1}{\rm d}\Lambda(y)\big{)}^{\frac{\gamma}{\gamma+1}}}+\frac{1}{\gamma}\Big{(}\int_{\mathcal{Y}}p(y)^{\gamma+1}{\rm d}\Lambda(y)\Big{)}^{\frac{1}{\gamma+1}}.

(1.9)

If we define the $\gamma$ -cross entropy as:

\displaystyle H_{\gamma}(P,Q;\Lambda)=-\frac{1}{\gamma}\frac{\int_{\mathcal{Y}}p(y)q(y)^{\gamma}{\rm d}\Lambda(y)}{\ \ \big{(}\int_{\mathcal{Y}}q(y)^{\gamma+1}{\rm d}\Lambda(y)\big{)}^{\frac{\gamma}{\gamma+1}}}.

then the $\gamma$ -divergence is written by the difference:

\displaystyle D_{\gamma}(P,Q;\Lambda)=H_{\gamma}(P,Q;\Lambda)-H_{\gamma}(P,P;\Lambda).

It is noteworthy that the cross $\gamma$ -entropy is a convex-linear functional with respect to the first argument:

\displaystyle\sum_{j=1}^{k}w_{j}H_{\gamma}(P_{j},Q;\Lambda)=H_{\gamma}(\bar{P},Q;\Lambda),

where $w_{j}$ ’s are positive weights with $\sum_{j=1}^{k}w_{j}=1$ and $\bar{P}=(1/k)\sum_{j=1}^{k}P_{j}$ . This property gives explicitly the empirical expression for the $\gamma$ -entropy given data set $\{Y_{i}\}_{1\leq i\leq n}$ . Consider the empirical distribution $(1/n)\sum_{i=1}^{n}{1}_{Y_{i}}$ as $\bar{P}$ , where $1_{Y_{i}}$ is the Dirac measure at the atom $Y_{i}$ . Then

\displaystyle H_{\gamma}(\bar{P},Q;\Lambda)=-\frac{1}{\gamma}\frac{1}{n}\sum_{i=1}^{n}\frac{q(Y_{i})^{\gamma}}{\big{(}\int_{\mathcal{Y}}q(y)^{\gamma+1}{\rm d}\Lambda(y)\big{)}^{\frac{\gamma}{\gamma+1}}}.

If we assume that $\{Y_{i}\}_{1\leq i\leq n}$ is a sequence of identically and independently distributed with $P$ , then

\mathbb{E}[H_{\gamma}(\bar{P},Q;\Lambda)]=H_{\gamma}(P,Q;\Lambda),

and hence $H_{\gamma}(\bar{P},Q;\Lambda)$ almost surely converges to $H_{\gamma}(P,Q;\Lambda)$ due to the strong law of large numbers. Subsequently, this will be used to define the empirical loss based on the dataset. Needless to say, the empirical expression of the cross entropy $H_{0}(\bar{P},Q;\Lambda)$ in (1.2) is the negative log likelihood. The $\gamma$ -diagonal entropy is proportional to the Lebesgue norm with exponent $\lambda=\gamma+1$ as

\displaystyle H_{\gamma}(P,P;\Lambda)=-\frac{1}{\gamma}{\Big{(}\int_{\mathcal{Y}}p(y)^{\lambda}{\rm d}\Lambda(y)\Big{)}^{\frac{1}{\lambda}}}.

Considering the conjugate exponent $\lambda^{*}=\lambda/(1-\lambda)$ , the Hölder inequality for $p$ and $q^{\gamma}$ states

\displaystyle{\int_{\mathcal{Y}}p(y)q(y)^{\gamma}{\rm d}\Lambda(y)}\leq{\Big{\{}\int_{\mathcal{Y}}p(y)^{\lambda}{\rm d}\Lambda(y)\Big{\}}^{\frac{1}{\lambda}}\Big{\{}\int_{\mathcal{Y}}\{q(y)^{\gamma}\}^{\lambda^{*}}{\rm d}\Lambda(y)\Big{\}}^{\frac{1}{\lambda^{*}}}}.

This holds for any $\lambda>1$ with the equality if and only if $p=q$ . This implies the $\gamma$ -divergence satisfies the definition of ‘divergence measure’ for any $\gamma>0$ . It should be noted that the Hölder inequality is employed for not the pair $p$ and $q$ but that of $p$ and $q^{\gamma}$ , which yields the property ‘zero when identical’ as a divergence measure. Also, the $\gamma$ -divergence approaches the KL-divergence in the limit:

\displaystyle\lim_{\gamma\rightarrow 0}D_{\gamma}(P,Q;\Lambda)=D_{0}(P,Q;\Lambda).

This is because $\lim_{\gamma\rightarrow 0}(s^{\gamma}-1)/\gamma=\log s$ for all $s>0$ as well as the $\alpha$ and $\beta$ divergence measures.

We observe a close relationship between $\beta$ and $\gamma$ divergence measures. Consider a maximum problem of the $\beta$ -divergence: $\max_{\sigma>0}D_{\beta}(P,\sigma Q;\Lambda)$ . By definition, if we write $D_{\beta}^{*}(P,Q;\Lambda)=\max_{\sigma>0}D_{\beta}(P,\sigma Q;\Lambda)$ , then $D_{\beta}^{*}(P,\sigma Q;\Lambda)=D_{\beta}^{*}(P,Q;\Lambda)$ for all $\sigma>0$ . Thus, the maximizer of $\sigma$ is given by

\displaystyle\sigma_{\rm opt}=\frac{\int p(y)q(y)^{\beta}{\rm d}\Lambda(y)}{\int q(y)^{\beta+1}{\rm d}\Lambda(y)}.

If we confuse $\beta$ with $\gamma$ , then the close relationship is found in

\displaystyle D_{\beta}(P,\sigma_{\rm opt}Q;\Lambda)=\frac{\{-\gamma H_{\gamma}(P,Q;\Lambda)\}^{\beta+1}-\{-\gamma H_{\gamma}(P,P;\Lambda)\}^{\beta+1}}{\beta(\beta+1)}

In accordance with this, the $\gamma$ -divergence can be viewed as the $\beta$ -divergence interpreted in a projective geometry [27]. Similarly, consider a dual problem: $\max_{\sigma>0}D_{\beta}(\sigma P,Q;\Lambda)$ . Then, the maximizer of $\sigma$ is given by

\displaystyle\sigma_{\rm opt}^{*}=\Big{\{}\frac{\int p(y)q(y)^{\beta}{\rm d}\Lambda(y)}{\int p(y)^{\beta+1}{\rm d}\Lambda(y)}\Big{\}}^{\frac{1}{\beta}}.

Hence, the scale adjusted divergence is given by

\displaystyle D_{\beta}(\sigma^{*}_{\rm opt}p,q)=\frac{-\{-\gamma H_{\gamma}^{*}(P,Q;\Lambda)\}^{\frac{\beta+1}{\beta}}+\beta\{-\gamma H_{\gamma}^{*}(P,P;\Lambda)\}^{\frac{\beta+1}{\beta}}}{\beta(\beta+1)}

(1.10)

Thus, we get a dualistic version of the $\gamma$ -divergence as

\displaystyle D^{*}_{\gamma}(P,Q;\Lambda)=-\frac{1}{\gamma}\frac{\int_{\mathcal{Y}}p(y)q(y)^{\gamma}{\rm d}\Lambda(y)}{(\int_{\mathcal{Y}}p(y)^{\gamma+1}{\rm d}\Lambda(y))^{\frac{1}{\gamma+1}}}+\frac{1}{\gamma}\Big{(}\int_{\mathcal{Y}}q(y)^{\gamma+1}{\rm d}\Lambda(y)\Big{)}^{\frac{\gamma}{\gamma+1}}.

(1.11)

We refer $D^{*}_{\gamma}(p,q)$ to as the dual $\gamma$ -divergence. If we define the dual $\gamma$ -entropy as

\displaystyle H^{*}_{\gamma}(P,Q;\Lambda)=-\frac{1}{\gamma}\frac{\int_{\mathcal{Y}}p(y)q(y)^{\gamma}{\rm d}\Lambda(y)}{(\int_{\mathcal{Y}}p(y)^{\gamma+1}{\rm d}\Lambda(y))^{\frac{1}{\gamma+1}}},

then $D^{*}_{\gamma}(P,Q;\Lambda)=H^{*}_{\gamma}(P,Q;\Lambda)-H^{*}_{\gamma}(P,P;\Lambda)$ . In effect, the $\gamma$ -divergence $D_{\gamma}$ and its dual $D_{\gamma}^{*}$ are connected as follows.

\displaystyle D^{*}_{\gamma}(P,Q;\Lambda)=\frac{\big{(}\int q^{\gamma+1}{\rm d}\Lambda\big{)}^{\frac{\gamma}{\gamma+1}}}{\big{(}\int p^{\gamma+1}{\rm d}\Lambda\big{)}^{\frac{1}{\gamma+1}}}D_{\gamma}(P,Q;\Lambda).

1.4 The $\gamma$ -divergence and its dual

In the evolving landscape of statistical divergence measures, a lesser-explored but highly potent member of the family is the $\gamma$ divergence. This divergence serves as an interesting alternative to the more commonly used $\alpha$ and $\beta$ divergences, with unique properties and advantages that make it particularly suited for certain classes of problems. The dual $\gamma$ divergence offers further flexibility, allowing for nuanced analysis from different perspectives. The following section is dedicated to a deep-dive into the mathematical formulations and properties of these divergences, shedding light on their invariance characteristics, relationships to other divergences, and potential applications. Notably, we shall establish that $\gamma$ divergence is well-defined even for negative values of the exponent $\gamma$ , and examine its special cases which connect to the geometric and harmonic mean divergences. This comprehensive treatment aims to illuminate the role that $\gamma$ and its dual can play in advancing both theoretical and applied aspects of statistical inference and machine learning [28].

Let us focus on the $\gamma$ -divergence in power divergence measures. We define a power-transformed function as follows:

\displaystyle q^{(\gamma)}(y)=\frac{q(y)^{\gamma+1}}{\int_{\mathcal{Y}}q(\tilde{y})^{\gamma+1}{\rm d}\Lambda(\tilde{y})},

(1.12)

which we refer to as the $\gamma$ -expression of $q(y)$ , where $q\in{\mathcal{P}}$ . Thus, the measure having the RN-derivative $q^{(\gamma)}$ belongs to $\mathcal{P}$ since $\int q^{(\gamma)}(y){\rm d}\Lambda(y)=1$ . We can write

\displaystyle H_{\gamma}(P,Q;\Lambda)=-\frac{1}{\gamma}\int_{\mathcal{Y}}p(y)\{{q^{(\gamma)}(y)\}^{\frac{\gamma}{\gamma+1}}{\rm d}\Lambda(y)},

and

\displaystyle H_{\gamma}^{*}(P,Q;\Lambda)=-\frac{1}{\gamma}\int_{\mathcal{Y}}\{{p^{(\gamma)}(y)\}^{\frac{1}{\gamma+1}}q(y)^{\gamma}{\rm d}\Lambda(y)},

These equations directly yield an observation: $D_{\gamma}$ and $D_{\gamma}^{*}$ are scale-invariant, but only with respect to one of the two arguments.

\displaystyle D_{\gamma}(P,\sigma Q;\Lambda)=D_{\gamma}(P,Q;\Lambda)\text{ and }D_{\gamma}^{*}(\sigma P,Q;\Lambda)=D_{\gamma}(P,Q;\Lambda);

(1.13)

while

\displaystyle D_{\gamma}(\sigma P,Q;\Lambda)=\sigma D_{\gamma}(P,Q;\Lambda)\quad\text{ and }\quad D^{*}_{\gamma}(P,\sigma Q;\Lambda)=\sigma^{\gamma}D_{\gamma}(P,Q;\Lambda).

The power exponent $\gamma$ is usually assumed to be positive. However, we extend $\gamma$ to be a real number in this discussion, see [23] for a brief discussion.

Proposition 1.

$D_{\gamma}(P,Q;\Lambda)$ and $D^{*}_{\gamma}(P,Q;\Lambda)$ defined in (1.9) and (1.11), respectively, are both divergence measures in (1) for any real number $\gamma$ .

Proof.

We introduce two generator functions defined as:

\displaystyle V_{\gamma}(R)=-\frac{1}{\gamma}R^{\frac{\gamma}{1+\gamma}}\quad{\text{a}nd}\quad V_{\gamma}^{*}(R)=-\frac{1}{\gamma}R^{\frac{1}{1+\gamma}}

(1.14)

for $R>0$ . By definition, the $\gamma$ divergence can be expressed as:

\displaystyle D_{\gamma}(P,Q;\Lambda)=\int_{{\mathcal{Y}}}p(y)\{V_{\gamma}(q^{(\gamma)}(y))-V_{\gamma}(p^{(\gamma)}(y))\}{\rm d}\Lambda(y).

(1.15)

Due to the convexity of $V_{\gamma}(R)$ in $R$ for any $\gamma\in\mathbb{R}$ , we have

\displaystyle D_{\gamma}(P,Q;\Lambda)\geq\int_{{\mathcal{Y}}}p(y)V_{\gamma}^{\prime}(p^{(\gamma)}(y))\{q^{(\gamma)}(y)-p^{(\gamma)}(y)\}{\rm d}\Lambda(y)

(1.16)

with equality if and only if $P=Q$ . The right-hand-side of (1.16) can be rewritten as:

\displaystyle-\frac{1}{1+\gamma}\Big{(}\int_{{\mathcal{Y}}}p^{\gamma+1}{\rm d}\Lambda\Big{)}^{\frac{1}{\gamma+1}}\int_{{\mathcal{Y}}}(q^{(\gamma)}-p^{(\gamma)}){\rm d}\Lambda.

The second term identically vanishes since $p^{(\gamma)}$ and $q^{(\gamma)}$ have both total mass one. Similarly, we observe for any real number $\gamma$ that

\displaystyle D_{\gamma}^{*}(P,Q;\Lambda)=\int_{{\mathcal{Y}}}\{V^{*}_{\gamma}(p^{(\gamma)}(y))-V^{*}_{\gamma}(q^{(\gamma)}(y))\}q(y)^{\gamma}{\rm d}\Lambda(y)

(1.17)

which is equal or greater than $0$ and the equality holds if and only if $P=Q$ due the convexity of $V^{*}(R)$ . Therefore, $D_{\gamma}(P,Q;\Lambda)$ and $D^{*}_{\gamma}(P,Q;\Lambda)$ are both divergence measures for any real number $\gamma$ . ∎

We will discuss $D_{\gamma}(P,Q;\Lambda)$ with a negative power exponent $\gamma$ in a context of statistical inferences. The $\gamma$ -divergence (1.9) is implicitly assumed to be integrable as well as the KL-divergence, in which the integrability condition for the $\gamma$ -divergence with $\gamma<0$ is presented. Let us look into the case of a multinomial distribution Bin $(\pi,m)$ defined in (1.3) with the reference measure $\tilde{\Lambda}$ given by (1.7). An argument similar to that on the $\beta$ -divergence yields

\displaystyle D_{\gamma}({\tt MN}(\pi,m),{\tt MN}(\rho,m);\tilde{\Lambda})=-\frac{m}{\gamma}\frac{\sum_{j=1}^{k}\pi_{j}\rho_{j}{}^{\gamma}}{\big{\{}\sum_{j=1}^{k}\rho_{j}{}^{\gamma+1}\big{\}}^{\frac{\gamma}{\gamma+1}}}+\frac{m}{\gamma}\Big{\{}\sum_{j=1}^{k}\pi_{j}{}^{\gamma+1}\Big{\}}^{\frac{1}{\gamma+1}}.

as the $\gamma$ -divergence of the log expression in (1.21), where $\pi$ and $\rho$ are cell probability vectors of $m$ dimension. The $\gamma$ -divergence with the counting measure would also have no closed expression. Therefore, careful consideration is needed when choosing the reference measure $\Lambda$ for $\gamma$ divergence. Let $\pi(y)$ be the RN-derivative of $\Lambda$ with respect to the Lebesgue measure $L$ . Then, the $\gamma$ -divergence (1.9) with

\displaystyle H_{\gamma}(P,Q;\Lambda)=-\frac{1}{\gamma}\frac{\int_{\mathcal{Y}}p(y)q(y)^{\gamma}\pi(y)dL(y)}{\ \ \big{(}\int_{\mathcal{Y}}q(y)^{\gamma+1}\pi(y)dL(y)\big{)}^{\frac{\gamma}{\gamma+1}}},

where $\pi=\partial\Lambda/\partial L$ . Our key objective is to identify a $\pi$ that ensures stable and smooth behavior under a given statistical model and dataset.

We discuss a generalization of the $\gamma$ -divergence. Let $V$ be a convex function. Then, $V$ -divergence is defined by

D_{V}(P,Q;\Lambda)=\int_{\mathcal{Y}}p(y)\{V(v^{*}(z_{q}q(y)))-V(v^{*}(z_{p}p(y)))\}{\rm d}\Lambda(y),

(1.18)

where $z_{q}$ is a normalizing constant satisfying $\int v^{*}(z_{q}q(y)){\rm d}\Lambda(y)=1$ and the function $v^{*}(t)$ satisfies

V^{\prime}(v^{*}(t))=\frac{1}{t}.

(1.19)

It is derived from the assumption of the convexity of $V$ that

\displaystyle D_{V}(P,Q;\Lambda)\geq\int_{\mathcal{Y}}p(y)V^{\prime}(v^{*}(z_{p}p(y)))\{v^{*}(z_{q}q(y))-v^{*}(z_{p}p(y))\}{\rm d}\Lambda(y)

which is equal to

\displaystyle\int_{\mathcal{Y}}\{v^{*}(z_{q}q(y))-v^{*}(z_{p}p(y))\}{\rm d}\Lambda(y)

(1.20)

up to the proportional factor due to (1.19). By the definition of normalizing constants, the integral in (1.20) vanishes. Hence, $D_{V}(P,Q;\Lambda)$ becomes a divergence measure. Specifically, $D_{V}(P,\sigma Q;\Lambda)=D_{V}(P,Q;\Lambda)$ for $\sigma>0$ due to the normalizing constant $z_{q}$ . For example, if $V(R)=-(1/\gamma)R^{\frac{\gamma}{\gamma+1}}$ as in (1.14), $D_{V}(P,Q;\Lambda)$ reduces to the $\gamma$ -divergence. There are various examples of $V$ other than (1.14), for example,

\displaystyle V(R)=2\log(\sqrt{R+1}-1)+\sqrt{R+1},

which is related to the $\kappa$ -entropy discussed in a physical context [52, 78]. We do not go further into this topic as it is beyond the scope of this paper.

We investigate a notable property of the dual $\gamma$ divergence. There exists a strong relationship between the generalized mean of probability measures and the minimization of the average dual $\gamma$ divergence. Subsequently, we will explore its applications in active learning.

Proposition 2.

Consider an average of $k$ dual $\gamma$ -divergence measures as

\displaystyle A(P)=\sum_{j=1}^{k}w_{j}D_{\gamma}^{*}(P,Q_{j};\Lambda).

Let $P^{\rm opt}=\mathop{\rm argmin}_{P\in{\mathcal{P}}}A(P)$ . Then, the Radon-Nikodym (RN) derivative of $P^{\rm opt}$ is uniquely determined as follows:

\displaystyle\frac{\partial P^{\rm opt}}{\partial\Lambda}(y)=z_{w}\Big{\{}\sum_{j=1}^{k}w_{j}q_{j}(y)^{\gamma}\Big{\}}^{\frac{1}{\gamma}}

where $q_{j}={\partial Q_{j}}/{\partial\Lambda}$ and $z_{w}$ is the normalizing constant.

Proof.

If we write $\partial P^{\rm opt}/{\partial\Lambda}$ by $p^{\rm opt}$ , then

\displaystyle A(P)-A(P^{\rm opt})=-\frac{1}{\gamma}\sum_{j=1}^{k}w_{j}\Big{\{}\frac{\int_{\mathcal{Y}}p(y)q_{j}(y)^{\gamma}{\rm d}\Lambda(y)}{(\int_{\mathcal{Y}}p(y)^{\gamma+1}{\rm d}\Lambda(y))^{\frac{1}{\gamma+1}}}-\frac{\int_{\mathcal{Y}}p_{w}^{\rm opt}(y)q_{j}(y)^{\gamma}{\rm d}\Lambda(y)}{(\int_{\mathcal{Y}}\{p_{w}^{\rm opt}(y)\}^{\gamma+1}{\rm d}\Lambda(y))^{\frac{1}{\gamma+1}}}\Big{\}}

which is equal to

\displaystyle-\frac{1}{z_{w}}\frac{1}{\gamma}\Big{\{}\frac{\int_{\mathcal{Y}}p(y)\{p^{\rm opt}(y)\}^{\gamma}{\rm d}\Lambda(y)}{(\int_{\mathcal{Y}}p(y)^{\gamma+1}{\rm d}\Lambda(y))^{\frac{1}{\gamma+1}}}-{\Big{(}\int_{\mathcal{Y}}\{p^{\rm opt}(y)\}^{\gamma+1}{\rm d}\Lambda(y)\Big{)}^{\frac{\gamma}{\gamma+1}}}\Big{\}}.

This expression simplifies to $(1/z)D_{\gamma}^{*}(P,P^{\rm opt};\Lambda)$ . Therefore, $A(P)\geq A(P^{\rm opt})$ and the equality holds if and only if $P=P^{\rm opt}$ This is due to the property of $D_{\gamma}^{*}(P,P^{\rm opt};\Lambda)$ as a divergence measure. ∎

The optimal distribution $P^{\rm opt}$ can be viewed as the consensus one integrating $k$ committee members’ distributions $Q_{j}$ ’s into the average of divergence measures with importance weights $w_{j}$ . We adopt a ”query by committee” approach and examine the robustness against variations in committee distributions. Proposition 2 leads to an average version of the Pythagorean relations:

\displaystyle\sum_{j=1}^{k}w_{j}D_{\gamma}^{*}(P,Q_{j})=D_{\gamma}^{*}(P,P^{\rm opt})+\sum_{j=1}^{k}w_{j}D_{\gamma}^{*}(P^{\rm opt},Q_{j}).

We refer to $P^{\rm opt}$ as the power mean of the set $\{Q_{j}\}_{1\leq j\leq k}$ . In general, a generalized mean is defined as

\displaystyle GM_{\phi}=\phi^{-1}\Big{(}\sum_{j=1}^{k}w_{j}\phi(q_{j})\Big{)},

where $\phi$ is a one-to-one function on $(0,\infty)$ . We confirm that, if $\gamma=1$ , then $P^{\rm opt}=\sum_{j=1}^{k}w_{j}Q_{j}$ that is the arithmetic mean, or the mixture distribution of $Q_{j}$ ’s with mixture proportions $w_{j}$ ’s. If $\gamma=-1$ , then

\displaystyle\frac{\partial P^{\rm opt}}{\partial\Lambda}(y)=z_{w}\Big{\{}\sum_{j=1}^{k}w_{j}q_{j}(y)^{-1}\Big{\}}^{-1},

which is the harmonic mean of $q_{j}$ ’s with weights $w_{j}$ ’s. As $\gamma$ goes to $0$ , the dual $\gamma$ -divergence $D_{\gamma}^{*}(P,Q;\Lambda)$ converges to the dual KL-divergence $D_{0}^{*}(P,Q)$ defined by $D_{0}(Q,P)$ . The minimizer is given by

\displaystyle\frac{\partial P^{\rm opt}}{\partial\Lambda}(y)=z\prod_{j=1}^{k}\{q_{j}(y)\}^{w_{j}},

which is the harmonic mean of $q_{j}$ ’s with weights $w_{j}$ ’s. We will discuss divergence measures using the harmonic and geometric means of ratios for RN-derivatives in a later section.

We often utilize the logarithmic expression for the $\gamma$ divergence, given by

\displaystyle\Delta_{\gamma}(P,Q;\Lambda)=-\frac{1}{\gamma}\log\frac{\int_{\mathcal{Y}}p(y)q(y)^{\gamma}{\rm d}\Lambda(y)}{\big{(}\int_{\mathcal{Y}}p(y)^{\gamma+1}{\rm d}\Lambda(y)\big{)}^{\frac{1}{\gamma+1}}\big{(}\int_{\mathcal{Y}}q(y)^{\gamma+1}{\rm d}\Lambda(y)\big{)}^{\frac{\gamma}{\gamma+1}}}.

(1.21)

We find a remarkable property such that

\displaystyle\Delta_{\gamma}(\tau P,\sigma Q;\Lambda)=\Delta_{\gamma}(P,Q;\Lambda)

for all $\tau>0$ and $\sigma>0$ , noting the log expression is written by

\displaystyle\Delta_{\gamma}(P,Q;\Lambda)=-\frac{1}{\gamma}\log{\int_{\mathcal{Y}}\{p^{(\gamma)}(y)\}^{\frac{1}{\gamma+1}}\{q^{(\gamma)}(y)\}^{\frac{\gamma}{\gamma+1}}{\rm d}\Lambda(y)}

(1.22)

by the use of $\gamma$ -expression defined in (1.12). This implies that $\Delta_{\gamma}(P,Q;\Lambda)$ measures not a departure between $P$ ad $Q$ but an angle between them. When $\gamma=1$ , this is the negative log cosine similarity for $p$ and $q$ . In effect, the cosine for $a$ and $b$ are in $\mathbb{R}^{d}$ is defined by

\displaystyle\cos(a,b)=\frac{\sum_{j=1}^{d}a_{j}b_{j}}{\{\sum_{j=1}^{d}|a_{j}|^{2}\}^{\frac{1}{2}}\{\sum_{j=1}^{|}b_{j}|^{2}\}^{\frac{1}{2}}}.

This is closely related to $\Delta_{\gamma}(P,Q;\Lambda)$ in a discrete space $\mathcal{Y}$ when $\gamma=1$ . which is closely related to $\Delta_{\gamma}(P,Q;\Lambda)$ on $d$ discrete space $\mathcal{Y}$ when $\gamma=1$ . We will discuss an extension of $\Delta_{\gamma}$ to be defined on a space of all signed measures that comprehensively gives an asymmetry in measuring the angle.

In summary, the exploration of power divergence measures in this section has illuminated their potential as versatile tools for quantifying the divergence between probability distributions. From the foundational Kullback-Leibler divergence to the more specialized $\alpha$ , $\beta$ , and $\gamma$ divergence measures, we have seen that each type has its own strengths and limitations, making them suited for particular classes of problems. We have also underscored the mathematical properties that make these divergences unique, such as invariance under different conditions and applicability in empirical settings. As the field of statistics and machine learning continues to evolve, it’s evident that these power divergence measures will find even broader applications, providing rigorous ways to compare models, make predictions, and draw inferences from increasingly complex data.

1.5 GM and HM divergence measures

We discuss a situation where a random variable $Y$ is discrete, taking values in a finite set of non-negative integers denoted by ${\mathcal{Y}}=\{1,...,k\}$ . Let $\mathcal{P}$ be the space of all probability measures on $\mathcal{Y}$ . In the realm of statistical divergence measures, arithmetic, geometric, and harmonic means for a probability measure $P$ of $\mathcal{P}$ receives less attention despite their mathematical elegance and potential applications. For this, consider the RN-derivative of a probability measure relative to $Q$ that equals a ratio of probability mass functions (pmfs) $p$ and $q$ induced by $P$ and $Q$ in ${\mathcal{P}}$ . Then, there is well-known inequality between the arithmetic and geometric means:

\displaystyle\sum_{y\in{\mathcal{Y}}}\frac{p(y)}{q(y)}r(y)\geq\prod_{y\in{\mathcal{Y}}}\Big{\{}\frac{p(y)}{q(y)}\Big{\}}^{r(y)}

(1.23)

and that between the arithmetic and harmonic means:

\displaystyle\sum_{y\in{\mathcal{Y}}}\frac{p(y)}{q(y)}r(y)\geq\Big{[}\sum_{y\in{\mathcal{Y}}}\Big{\{}\frac{p(y)}{q(y)}\Big{\}}^{-1}r(y)\Big{]}^{-1},

(1.24)

where $r(y)$ is a weight function that is arbitrarily a fixed pmf on $\mathcal{Y}$ . Equality in (1.23) or (1.24) holds if and only if $p=q$ . This well-known inequality relationships among these means serve as the mathematical bedrock for defining new divergence measures. Specifically, the Geometric Mean (GM) and Harmonic Mean (HM) divergences are inspired by inequalities involving these means and ratios of probabilities as follows.

First, we define the GM-divergence as

\displaystyle D_{\rm GM}(P,Q;R)=\sum_{y\in{\mathcal{Y}}}\frac{p(y)}{q(y)}r(y)\prod_{y\in{\mathcal{Y}}}q(y)^{r(y)}-\prod_{y=1}^{k}p(y)^{r(y)}

transforming the expression (1.23), where $p(y)$ , $q(y)$ and $r(y)$ are the pmfs with respect to $P$ , $Q$ and $R$ , respectively. Note that $D_{\rm GM}(P,Q)$ is a divergence measure on $\mathcal{P}$ as defined in Definition 1. We restrict $\mathcal{Y}$ to be a finite discrete set for this discussion; however, our results can be generalized. In effect, the GM-divergence has a general form:

\displaystyle D_{\rm GM}(P,Q;R)=\int_{\mathcal{Y}}\frac{p(y)}{q(y)}dR(y)\exp\Big{\{}\int_{\mathcal{Y}}\log q(y)dR(y)\Big{\}}-\exp\Big{\{}\int_{\mathcal{Y}}\log p(y)dR(y)\Big{\}}

(1.25)

For comparison, we have a look at the $\gamma$ -divergence

\displaystyle D_{\gamma}(P,Q;R)=-\frac{1}{\gamma}\frac{\int_{{\mathcal{Y}}}p(y){q(y)}^{\gamma}dR(y)}{\big{(}\int_{{\mathcal{Y}}}{q(y)}^{\gamma+1}dR(y)\big{)}^{\frac{\gamma}{\gamma+1}}}+\frac{1}{\gamma}{\big{(}\int_{{\mathcal{Y}}}{p(y)}^{\gamma+1}dR(y)\big{)}^{\frac{1}{\gamma+1}}},

(1.26)

by selecting a probability measure $R$ as a reference measure.

Proposition 3.

Let $D_{\rm GM}(P,Q;R)$ and $D_{\gamma}(P,Q;R)$ be defined in (1.25) and (1.26), respectively. Then,

\displaystyle\lim_{\gamma\rightarrow-1}D_{\gamma}(P,Q;R)=D_{\rm GM}(P,Q;R)

(1.27)

Proof.

It follows from the L’Hôpital’s rule that

\displaystyle\lim_{\gamma\rightarrow-1}{\frac{1}{\gamma+1}}\log{\int_{{\mathcal{Y}}}{p(y)}^{\gamma+1}dR(y)}=\int_{{\mathcal{Y}}}\log p(y)dR(y).

This implies

\displaystyle\lim_{\gamma\rightarrow-1}\Big{\{}{\int_{{\mathcal{Y}}}{p(y)}^{\gamma+1}dR(y)}\Big{\}}^{\frac{1}{\gamma+1}}=\exp\Big{\{}\ \int_{{\mathcal{Y}}}\log p(y)dR(y)\Big{\}}.

In accordance, we conclude (1.27). ∎

We write the GM-divergence by the difference of the cross and diagonal entropy measures: $D_{\rm GM}(P,Q;R)=H_{\rm GM}(P,Q;R)-H_{\rm GM}(P,P;R)$ , where

\displaystyle H_{\rm GM}(P,Q;R)=\int_{{\mathcal{Y}}}\frac{p(y)}{q(y)}dR(y)\exp\Big{\{}\int_{{\mathcal{Y}}}\log q(y)dR(y)\Big{\}}.

The GM-divergence has a log expression:

\displaystyle\Delta_{\rm GM}(P,Q;R)=\log\frac{H_{\rm GM}(P,Q;R)}{H_{\rm GM}(P,P;R)}.

We note that $\Delta_{\rm GM}(P,Q;R)$ is equal to $\Delta_{\gamma}(P,Q;R)$ taking the limit of $\gamma$ to $-1$ . Here we discuss the case of the Poisson distribution family. We choose a Poisson distribution Po $(\tau)$ as the reference measure. Thus, the GM-divergence of the log-form is given by

\displaystyle\Delta_{\rm GM}({\tt Po}(\lambda),{\tt Po}(\mu);{\tt Po}(\tau))=\tau\Big{(}\log\frac{\lambda}{\mu}-\frac{\lambda}{\mu}+1\Big{)}.

Second, we introduce the HM-divergence. Suppose $r(y)=q(y)^{-1}/\sum_{z\in{\mathcal{Y}}}q(z)^{-1}$ in the inequality (1.24). Then, (1.24) is written as

\displaystyle\sum_{y\in{\mathcal{Y}}}\frac{p(y)}{q(y)^{2}}\Big{\{}\sum_{y\in{\mathcal{Y}}}q(y)^{-1}\Big{\}}^{-1}\geq\Big{\{}\sum_{y\in{\mathcal{Y}}}{p(y)}^{-1}\Big{\}}^{-1}\Big{\{}\sum_{y\in{\mathcal{Y}}}q(y)^{-1}\Big{\}}.

(1.28)

We define the harmonic-mean (HM ) divergence arranging the inequality (1.28) as

\displaystyle D_{\rm HM}(P,Q)=H_{\rm HM}(P,Q)-H_{\rm HM}(P,P).

Here

\displaystyle H_{\rm HM}(P,Q)=\sum_{y\in{\mathcal{Y}}}\frac{p(y)}{q(y)^{2}}\Big{\{}\sum_{y\in{\mathcal{Y}}}q(y)^{-1}\Big{\}}^{-2}

is the cross entropy, where $p(y)$ and $q(y)$ is the pmfs of $P$ and $Q$ , respectively. The (diagonal) entropy is given by the harmonic mean of $p(y)$ ’s:

\displaystyle H_{\rm HM}(P,P)=\Big{\{}\sum_{y\in{\mathcal{Y}}}p(y)^{-1}\Big{\}}^{-1}.

Note that $D_{\rm HM}(P,Q)$ qualifies as a divergence measure on $\mathcal{P}$ , as defined in Definition 1, due to the inequality (1.28). When $\gamma=-2$ , $D_{\rm HM}(P,Q)$ is equal to $D_{\gamma}(P,Q;C)$ in (1.9) with the counting measure $C$ . The log form is given by

\displaystyle\Delta_{\rm HM}(P,Q)=\log H_{\rm HM}(P,Q)-\log H_{\rm HM}(P,P).

The GM-divergence provides an insightful lens through which we can examine statistical similarity or dissimilarity by leveraging the multiplicative nature of probabilities. The HM-divergence, on the other hand, focuses on rates and ratios, thus providing a complementary perspective to the GM-divergence, particularly useful in scenarios where rate-based analysis is pivotal. By extending the divergence measures to include GM and HM divergence, we gain a nuanced toolkit for quantifying divergence, each with unique advantages and applications. For instance, the GM-divergence could be particularly useful in applications where multiplicative effects are prominent, such as in network science or econometrics. Similarly, the HM-divergence might be beneficial in settings like biostatistics or communications, where rate and proportion are of prime importance. This framework, rooted in the relationships among arithmetic, geometric, and harmonic means, not only expands the class of divergence measures but also elevates our understanding of how different mathematical properties can be tailored to suit the needs of diverse statistical challenges.”

1.6 Concluding remarks

In summary, this chapter has laid the groundwork for understanding the class of power divergence measures in a probabilistic framework. We study that divergence measures quantify the difference between two probability distributions and have applications in statistics and machine learning. It begins with the well-known Kullback-Leibler (KL) divergence, highlighting its advantages and limitations. To address limitations of KL-divergence, three types of power divergence measures are introduced.

Let us look at the $\alpha$ , $\beta$ , and $\gamma$ -divergence measures for a Poisson distribution model. Let ${\tt Po}(\lambda)$ denote a Poisson distribution with the RN-derivative

\displaystyle p(y,\lambda)=\lambda^{y}e^{-\lambda}

(1.29)

with respect to the reference measure $R$ having $(\partial R)/(\partial C)(y)=1/y!$ . Seven examples of power divergence between Poisson distributions ${\tt Po}(\lambda_{0})$ and ${\tt Po}(\lambda_{1})$ are listed in Table 1.1. Note that this choice of the reference measure enable us to having such an tractable form of the $\beta$ and $\gamma$ divergences as well as its variants. Here we use a basic formula to obtain these divergence measures:

\displaystyle\sum_{y=0}^{\infty}p(y,\lambda)^{a}{\rm d}R(y)=\exp(\lambda^{a}-a\lambda)

for an exponent $a$ of $\mathbb{R}$ . The contour sets of seven divergences between Poisson distributions are plotted in Figure 1.1. All the divergences attain the unique minimum $0$ in the diagonal line $\{(\lambda_{0},\lambda_{1}):\lambda_{0}=\lambda_{1}\}$ . The contour set of GM and HM divergences are flat compared to those of other divergences.

Table 1.1: The variants of power divergence

$\alpha$ -divergence	$\frac{1}{\alpha(1-\alpha)}(1-\exp\{\lambda_{0}(\frac{\lambda_{1}}{\lambda_{0}})^{\alpha}-(1-\alpha)\lambda_{0}-\alpha\lambda_{1}\}$
$\beta$ -divergence	$\displaystyle\frac{e^{\lambda_{0}^{\beta+1}-(\beta+1)\lambda_{0}}+\beta e^{\lambda_{1}^{\beta+1}-(\beta+1)\lambda_{1}}-(\beta+1)e^{\lambda_{0}\lambda_{1}^{\beta}-\lambda_{0}-\beta\lambda_{1}}}{\beta(\beta+1)}$
$\gamma$ -divergence	$-\frac{1}{\gamma}\exp({-\lambda_{0}})\{\exp(\lambda_{0}\lambda_{1}^{\gamma}-\frac{\gamma}{\gamma+1}\lambda_{1}^{\gamma+1})-\exp(\frac{1}{\gamma+1}\lambda_{0}^{\gamma+1})\}$
dual $\gamma$ -divergence	$-\frac{1}{\gamma}\exp({-\gamma\lambda_{1}})\{\exp(\lambda_{0}\lambda_{1}^{\gamma}-\frac{1}{\gamma+1}\lambda_{0}^{\gamma+1})-\exp(\frac{\gamma}{\gamma+1}\lambda_{1}^{\gamma+1})\}$
log $\gamma$ -divergence	$-\frac{1}{\gamma}(\lambda_{0}\lambda_{1}^{\gamma}-\frac{1}{\gamma+1}\lambda_{0}^{\gamma+1}-\frac{\gamma}{\gamma+1}\lambda_{1}^{\gamma+1})$
GM-divergence	$\lambda_{1}\exp({-\lambda_{0}})\{\exp(\frac{\lambda_{0}}{\lambda_{1}})-\exp(1)\}$
HM-divergence	$\small\mbox{$\frac{1}{2}$}\exp({-\lambda_{0}})\{\exp(\frac{\lambda_{0}}{\lambda_{1}^{2}}-2\frac{1}{\lambda_{1}})-\exp(-\frac{1}{\lambda_{0}})\}$

Refer to caption — Figure 1.1: Contour plots of the power divergences.

The $\alpha$ -divergence is intrinsic to assess the divergence between two probability measures. One of the most important properties is invariance with the choice of the reference measure that expresses the Radon-Nicodym derivatives for the two probability measure . The invariance provides direct understandings for the intrinsic properties beyond properties for the probability density or mass functions. A serious drawback is pointed out that an empirical counterpart is not available for a given data set in most practical situations. This makes difficult for applications for statistical inference for estimation and prediction. In effect, the statistical applications are limited to the curved exponential family that is modeled in an exponential family. See [18] for the statistical curvature characterizing the second order efficiency.

The $\beta$ -divergence and the $\gamma$ -divergence are not invariant with the choice of the reference measure. We have to determine the reference measure from the point of the application to statistics and machine learning. Subsequently, we discuss the appropriate selection of the reference measure in both cases of $\beta$ and $\gamma$ divergences. Both divergence measures are efficient for applications in areas of statistics and machine learning since the empirical loss function for a parametric model distribution is applicable for any dataset. For example, the $\beta$ -divergence is utilized as a cost function to measure the difference between a data matrix and the factorized matrix in the nonnegative matrix factorization. Such applications the minimum $\beta$ -divergence method is more robust than the maximum likelihood method which can be viewed as the minimum KL-divergence method. In practice, the $\beta$ is not scale invariant in the space of all finite measures that involves that of all the probability measures. We will study that the lack of the scale invariance does not allow simple estimating functions even under a normal distribution model.

Alternatively, the $\gamma$ -divergence is scale-invariant with respect to the second argument. The $\gamma$ -divergence provides a simple estimating function for the minimum $\gamma$ -estimator. This property enables to proposing an efficient learning algorithm for solving the estimating equation. For example, the $\gamma$ -divergence is used for the clustering analysis. The cluster centers are determined by local minimums of the empirical loss function defined by the $\gamma$ -divergence, see [75, 74] for the learning architecture. A fixed-point type of algorithm is proposed to conduct a fast detection for the local minimums. Such practical properties in applications will be explored in the following section. We consider the dual $\gamma$ -divergence that is invariant for the first argument. We will explore the applicability for defining the consensus distribution in a context of active learning. It is confirmed that the $\gamma$ -divergence is well defined even for negative value of the exponent $\gamma$ . The $\gamma$ -divergences with $\gamma=-1$ and $\gamma=-2$ are reduced to the GM and HM divergences, respectively. In a subsequent discussion, special attentions to the GM and HM divergences are explored for various objectives in applications, see [26] for the application to dynamic treatment regimes in the medical science.

Chapter 2 Minimum divergence for regression model

This chapter explores statistical estimation within regression models. We introduce a comprehensive class of estimators known as Minimum Divergence Estimators (MDEs), along with their empirical loss functions under a parametric framework. Standard properties such as unbiasedness, consistency, and asymptotic normality of these estimators are thoroughly examined. Additionally, the chapter addresses the issue of model misspecification, which can result in biases, inaccurate inferences, and diminished statistical power, and highlights the vulnerability of conventional methods to such misspecifications. Our primary goal is to identify estimators that remain robust against potential biases arising from model misspecification. We place particular emphasis on the $\gamma$ -divergence, which underpins the $\gamma$ -estimator known for its efficiency and robustness.

2.1 Introduction

We study statistical estimation in a regression model including a generalized linear model. The maximum likelihood (ML) method is widely employed and developed for the estimation problem. This estimation method has been standard on the basis of the solid evidence in which the ML-estimator is asymptotically consistent and efficient when the underlying distribution is correctly specified a regression model. See [16, 62, 7, 39]. The power of parametric inference in regression models is substantial, offering several advantages and capabilities that are essential for effective statistical analysis and decision-making. The ML method has been a cornerstone of parametric inference. This principle yields estimators that are asymptotically unbiased, consistent, and efficient, given that the model is correctly specified. Specifically, generalized linear models (GLMs) extend linear models to accommodate response variables that have error distribution models other than a normal distribution, enabling the use of ML estimation across binomial, Poisson, and other exponential family distributions.

However, we are frequently concerned with model misspecification, which occurs when the statistical model does not accurately represent the data-generating process. This could be due to the omission of important variables, the inclusion of irrelevant variables, incorrect functional forms, and wrong error structures. Such misspecification can lead to biases, inaccurate inferences, and reduced statistical power because of misspecification for the parametric model. See [95] for the critical issue of model misspecification. A misspecified model is more sensitive to outliers, often resulting in more biased estimates. Outliers can also obscure true relationship between variables, making it difficult to detect model misspecification by overshadowing the true relationships between variables. Unfortunately, the performance of the ML-estimator is easily degraded in such difficult situations because of the excessive sensitivities to model misspecification. Such limitations in the face of model misspecification and complex data structures have prompted the development of a broad spectrum of alternative methodologies. In this way, we take the MDE approach other than the maximum likelihood.

We discuss a class of estimating methods through minimization of divergence measure [2, 65, 69, 67, 21, 66, 22, 55, 53, 54, 26]. These are known as minimum divergence estimators (MDEs). The empirical loss functions for a given dataset are discussed in a unified perspective under a parametric model. Thus, we derive a broad class of estimation methods via MDEs. Our primary objective is to find estimators that are robust against potential biases in the presence of model misspecification. MDEs can be applied to a case where the outcome is a continuous variable in a straightforward manner, in which the reference measure to define the divergence is fixed by the Lebesgue measure. Alternatively, more consideration is needed regarding the choice of a reference measure in a discrete variable case for the outcome. In particular, $\beta$ and $\gamma$ divergence measures are strongly associated with a specific dependence for the choice of a reference measure. We explore effective choices for the reference measure to ensure that the corresponding divergences are tractable and can be expressed feasibly. We focus on the $\gamma$ -divergence as a robust MDE through an effective choice of the reference measure.

This chapter is organized as follows. Section 2.4 gives an overview of M-estimation in a framework of generalized linear model. In Section 2.3 the $\gamma$ -divergence is introduced in a regression model and the $\gamma$ -loss function is discussed. In Section 2.4 we focus on the $\gamma$ -estimator in a normal linear regression model. A simple numerical example demonstrates a robust property of the $\gamma$ -estimator compared to the ML-estimator. Section 2.5 discusses a logistic model for a binary regression. The $\gamma$ -loss function is shown to have a robust property where the Euclidean distance of the estimating function to the decision boundary is uniformly bounded when $\gamma$ is in a specific range. In Section 2.6 extends the result in the binary case to a multiclass case. Section 2.7 considers a Poisson regression model focusing on a log-linear model. The $\gamma$ -divergence is given by a specific choice of the reference measure. The robustness for the $\gamma$ -estimator is confirmed for any $\gamma$ in the specific range. A simple numerical experiment is conducted. Finally, concluding remarks for geometric understandings are given in 2.8.

2.2 M-estimators in a generalized linear model

Let us establish a probabilistic framework for a $d$ -dimensional covariate variable $X$ in a subset $\mathcal{X}$ of $\mathbb{R}^{d}$ , and an outcome $Y$ with a value of a subset $\mathcal{Y}$ of $\mathbb{R}$ in a regression model paradigm. The major objective is to estimate the regression function

\displaystyle r(x)=\mathbb{E}[Y|X=x]

based on a given dataset. In a paradigm of prediction, $X$ is often called a feature vector, where to build a predictor defined by a function from $\mathcal{X}$ to $\mathcal{Y}$ is one of the most important tasks. Let ${\mathcal{P}}(x)$ be a space of conditional probability measures conditioned on $X=x$ . For any event $B$ in $\mathcal{Y}$ the conditional probability given $X=x$ is written by

\displaystyle P(B|x)=\int_{B}p(y|x){\rm d}\Lambda(y),

where $p(y|x)$ is the RN-derivative of $Y=y$ given $X=x$ with a reference measure $\Lambda$ . A statistical model embedded in ${\mathcal{P}}(x)$ is written as

\displaystyle{{\mathcal{M}}}=\{P(\cdot|x,\theta):\theta\in\Theta\},

(2.1)

where $\theta$ is a parameter of a parameter space $\Theta$ . Then, the Kullback-Leibler (KL) divergence on ${\mathcal{P}}(x)$ is given by

\displaystyle D_{0}(P_{0}(\cdot|x),P_{1}(\cdot|x))=H_{0}(P_{0}(\cdot|x),P_{1}(\cdot|x))-D_{0}(P_{0}(\cdot|x),P_{0}(\cdot|x)),

with the cross entropy,

\displaystyle H_{0}(P_{0}(\cdot|x),P_{1}(\cdot|x))=-\int_{\mathcal{Y}}{p_{0}(y|x)}\log{p_{1}(y|x)}{\rm d}\Lambda(y).

Note that the KL-divergence is independent of the choice of reference measure $\Lambda$ as discussed in Chapter 1. Let ${\mathcal{D}}=\{(X_{i},Y_{i}):1\leq i\leq n\}$ be a random sample drawn from a distribution of ${\mathcal{M}}$ . The goal is to estimate the parameter $\theta$ in ${\mathcal{M}}$ in (2.1). Then, the negative log-likelihood function is defined by

\displaystyle L_{0}(\theta;\Lambda)=-\frac{1}{n}\sum_{i=1}^{n}p(Y_{i}|X_{i},\theta),

where $p(y|x,\theta)$ is the RN-derivative of $P(\cdot,\theta)$ with respect to $\Lambda$ . Note that, for any measure $\tilde{\Lambda}$ equivalent to $\Lambda$ the negative log-likelihood functions $L_{0}(\theta;\tilde{\Lambda})=L_{0}(\theta;\Lambda)$ up to a constant. The expectation of $L_{0}(\theta;\Lambda)$ under the model distribution of $\mathcal{M}$ is equal to the cross entropy:

\displaystyle\mathbb{E}_{0}[L_{0}(\theta;\Lambda)|\underline{X}]=\frac{1}{n}\sum_{i=1}^{n}H_{0}(P(\cdot|X_{i},\theta_{0}),P(\cdot|X_{i},\theta))

(2.2)

where $\underline{X}=(X_{1},...,X_{n})$ and $\theta_{0}$ is the true value of the parameter and $\mathbb{E}_{0}$ is the conditional expectation under the model distribution $P(\cdot|X_{i},\theta_{0})$ ’s. Hence,

\displaystyle\mathbb{E}_{0}[L_{0}(\theta;\Lambda)|\underline{X}]-\mathbb{E}_{0}[L(\theta_{0};\Lambda)|\underline{X}]=\frac{1}{n}\sum_{i=1}^{n}D_{0}(P(\cdot|X_{i},\theta_{0}),P(\cdot|X_{i},\theta)).

which can be viewed as an empirical analogue of the Pythagorean equation. Due to the property of the KL-divergence as a divergence measure,

\displaystyle\theta_{0}=\mathop{\rm argmin}_{\theta\in\Theta}\mathbb{E}_{0}[L_{0}(\theta;\Lambda)|\underline{X}].

By definition, the ML-estimator $\hat{\theta}_{0}$ is the minimizer of $L_{0}(\theta;\Lambda)$ in $\theta$ ; while the true value $\theta_{0}$ is the minimizer of $E_{0}[L_{0}(\theta;\Lambda)|\underline{X}]$ in $\theta$ . The continuous mapping theorem reveals the consistency of the ML-estimator for the true parameter, see [61, 94] The estimating function is defined by the gradient of the negative log-likelihood function

\displaystyle{S}_{0}(\theta;\Lambda)=\frac{\partial}{\partial\theta}L_{0}(\theta;\Lambda).

Hence, the ML-estimator $\hat{\theta}$ is a solution of the estimating equation, ${S}_{0}(\theta;\Lambda)=0$ under regularity conditions. We note that the solution of the expected estimating function under the distribution with the true value $\theta_{0}$ is $\theta_{0}$ itself, that is,

\displaystyle\mathbb{E}_{0}[{S}_{0}(\theta_{0};\Lambda)|\underline{X}]=0.

This implies that the continuous mapping theorem again concludes the consistency of the ML-estimator for the true value $\theta_{0}$ .

The framework of a generalized linear model (GLM) is suitable for a wide range of data types other than the ordinary linear regression model, see [62]. While the ordinary linear regression usually assumes that the response variable is normally distributed, GLMs allow for response variables that have different distributions, such as the Bernoulli, categorical, Poisson, negative binomial distributions and exponential families in a unified manner. In this way, GLMs provide excellent applicability for a wide range of data types, including count data, binary data, and other types of skewed or non-continuous data. A GLM consists of three main components:

1

Random Component: Specifies the probability distribution of the response variable $Y$ . This is typically a member of the exponential family of distributions (e.g., normal, exponential, binomial, Poisson, etc.).
2

Systematic Component: Represents the linear combination of the predictor variables, similar to ordinary linear regression. It is usually expressed as $\eta=\theta^{\top}X$ .
3

Link Function: Provides the relationship between the random and systematic components. The expected value of $Y$ given $X=x$ , or the regression function is one-to-one with the linear combination of predictors $\eta=\theta^{\top}x$ through the link function $g$ .

In the framework of the GLM, an exponential dispersion model is employed as

\displaystyle p_{\exp}(y,\omega,\phi)=\exp\Big{\{}\frac{y\omega-a(\omega)}{\phi}+c(y,\phi)\Big{\}},

with respect to a reference measure $\Lambda$ , where $\omega$ and $\phi$ is called the canonical and the dispersion parameters, respectively, see [51]. Here we assume that $\omega$ can be defined in $(-\infty,\infty)$ . This allows for a linear modeling $\omega=g(\theta^{\top}x)$ with a flexible form of the link function $g$ . Specifically, if $g$ is an identity function, then $g$ is referred to as the canonical link function. This formulation involves most of practical models in statistics such as the logistic and the log linear models. In practice, the dispersion parameter $\phi$ is usually estimated separately from $\theta$ , and hence, we assume $\phi$ is known to be $1$ for simplicity. This leads to a generalized linear model:

\displaystyle p(y|x,\theta)=\exp\big{\{}y\omega-a(\omega)+c(y)\big{\}}

(2.3)

with $\omega=g(\theta^{\top}x)$ as the conditional RN-derivative of $Y$ given $X=x$ . The regression function is given by

\displaystyle\mathbb{E}[Y|X=x,\theta)=a^{\prime}(g(\theta^{\top}x))

due to the Bartlett identity.

Let us consider M-estimators for a parameter $\theta$ in the linear model (2.3). Originally, the M-estimator is introduced to cover robust estimators of a location parameter, see [47] for breakthrough ideas for robust statistics, and [83] for robust regression. We define an M-type loss function for the GLM defined in (2.3):

\displaystyle\bar{L}_{\Psi}(\theta,{\cal D})=\frac{1}{n}\sum_{i=1}^{n}\Psi(Y_{i},\theta^{\top}X_{i})

(2.4)

for a given dataset ${\cal D}=\{(X_{i},Y_{i})\}_{i=1}^{n}$ and we call

\displaystyle\hat{\theta}_{\Psi}:=\mathop{\rm argmin}_{\theta\in\mathbb{R}^{d}}\bar{L}_{\Psi}(\theta,{\cal D})

the M-estimator. Here the generator function $\Psi(y,s)$ is assumed to be convex with respect to $s$ . If $\Psi(y,s)=yg(s)-a(g(s))$ , then the M-estimator is nothing but the ML estimator. Thus, the estimating function is given by

\displaystyle\bar{S}_{\psi}(\theta,{\cal D})=\frac{1}{n}\sum_{i=1}^{n}\psi(Y_{i},\theta^{\top}X_{i})X_{i},

(2.5)

where $\psi(y,s)=(\partial/\partial s)\Psi(y,s)$ . If we confine the generator function $\Psi$ to a form of $\Psi(y-s)$ , then this formulation reduces to the original form of M-estimation [48, 93]. In general, the estimating function is characterized by $\psi(Y,\theta^{\top}X)$ . Hereafter we assume that

\displaystyle\mathbb{E}_{\theta}[\psi(Y,\theta^{\top}X)|X=x]=0,

where $\mathbb{E}_{\theta}$ is the expectation under $p(y|x,\theta)$ . This assumption leads to consistency for the estimator $\hat{\theta}_{\Psi}$ . We note that the relationship between the loss function and the estimating function is not one-to-one. Indeed, there exist many choices of the estimating function for obtaining the estimator $\hat{\theta}_{\Psi}$ other than (2.5). We have a geometric discussion for an unbiased estimating function.

We investigate the behavior of the score function ${S}_{\gamma}(x,y,\theta)$ of the $\gamma$ -estimator. By definition, the $\gamma$ -estimator is the solution such that the sample mean of the score function is equated to zero. We write a linear predictor as $F_{\theta}(x)=\theta_{1}^{\top}x+\theta_{0}$ , where $\theta=(\theta_{0},\theta_{1})$ . We call

\displaystyle H_{\theta}=\{x\in\mathbb{R}^{p}:F_{\theta}(x)=0\}

(2.6)

the prediction boundary. Then, the following formula is well-known in the Euclidean geometry.

Proposition 4.

Let $x\in{\cal X}$ and $d(x,H_{\theta})$ be the Euclidean distance from $x$ to the prediction boundary $H_{\theta}$ defined in (2.6). Then,

\displaystyle d(x,H_{\theta})=\frac{|F_{\theta}(x)|}{\|\theta_{1}\|}

(2.7)

Proof.

Let $x^{*}$ be the projection of $x$ onto $H_{\theta}$ . Then, $d(x,H_{\theta})=\|x-x^{*}\|$ , where $\|\ \|$ denotes the Euclidean norm. There exists a non zero scalar $\tau$ such that $x-x^{*}=\tau\theta_{1}$ noting that a normal vector to the hyperplane $H_{\theta}$ is given by $\theta_{1}$ . Hence, $\theta_{1}^{\top}(x-x^{*})=\tau\|\theta_{1}\|^{2}$ and

\displaystyle d(x,H_{\theta})=|\tau|\|\theta_{1}\|=\frac{|\theta_{1}^{\top}(x-x^{*})|}{\|\theta_{1}\|},

(2.8)

which concludes (2.7) since ${|\theta_{1}^{\top}(x-x^{*})|}=|F_{\theta}(x)|$ due to $F_{\theta}(x^{*})=0$ .

∎

Thus, a covariate vector $X$ of $\cal X$ is decomposed into the orthogonal and horizontal components as $X=Z_{\theta}(X)+W_{\theta}(X)$ , where

\displaystyle Z_{\theta}(X)=\frac{\theta^{\top}X}{\|\theta\|^{2}}\>\theta\ \ {\rm and}\ \ W_{\theta}(x)=X-Z_{\theta}(X).

(2.9)

We note that $Z_{\theta}(X)^{\top}W_{\theta}(X)=0$ and $\|X\|^{2}=\|Z_{\theta}(X)\|^{2}+\|W_{\theta}(X)\|^{2}$ . Due to the orthogonal decomposition (2.9) of $X$ , the estimating function is also decomposed into

\displaystyle S_{\psi}(Y,X,\theta)=S^{\rm(O)}_{\psi}(Y,X,\theta)+S^{\rm(H)}_{\psi}(Y,X,\theta),

where

\displaystyle S^{\rm(O)}_{\psi}(Y,X,\theta)=\psi(Y,\theta^{\top}Z_{\theta}(X))Z_{\theta}(X),{\ \ }S^{\rm(H)}_{\psi}(Y,X,\theta)=\psi(Y,\theta^{\top}Z_{\theta}(X))W_{\theta}(X).

Here we use a property: $\theta^{\top}Z_{\theta}(X)=\theta^{\top}X$ . Thus, in $S^{\rm(O)}_{\psi}(Y,X,\theta)$ , $\psi(Y,\theta^{\top}Z_{\theta}(X))$ and $Z_{\theta}(X)$ are strongly connected each other; in $S^{\rm(H)}_{\psi}(Y,X,\theta)$ , $\psi(Y,\theta^{\top}Z_{\theta}(X))$ and $W_{\theta}(X)$ are less connected.

The estimating function (2.5) is decomposed into a sum of the orthogonal and horizontal components,

\displaystyle\bar{S}_{\psi}(\theta,{\cal D})=\bar{S}^{\rm(O)}_{\psi}(\theta,{\cal D})+\bar{S}^{\rm(O)}_{\psi}(\theta,{\cal D}),

where

\displaystyle\bar{S}^{\rm(O)}_{\psi}(\theta,{\cal D})=\frac{1}{n}\sum_{i=1}^{n}S^{\rm(O)}_{\psi}(Y_{i},X_{i},\theta),\ \ \bar{S}^{\rm(H)}_{\psi}(\theta,{\cal D})=\frac{1}{n}\sum_{i=1}^{n}S^{\rm(H)}_{\psi}(Y_{i},X_{i},\theta).

We consider a specific type of contamination in the covariate space $\cal X$ .

Proposition 5.

Let ${\cal D}=\{(X_{i},Y_{i})\}_{i=1}^{n}$ and ${\cal D}^{*}=\{(X^{*}_{i},Y_{i})\}_{i=1}^{n}$ , where $X^{*}_{i}=X_{i}+\sigma(X_{i})W_{\theta}(X_{i})$ with arbitrarily a fixed scalar $\sigma(X_{i})$ depending on $X_{i}$ . Then, $\bar{L}_{\Psi}(\theta,{\cal D}^{*})=\bar{L}_{\Psi}(\theta,{\cal D})$ , $\bar{S}^{\rm(O)}_{\psi}(\theta,{\cal D}^{*})=\bar{S}^{\rm(O)}_{\psi}(\theta,{\cal D})$ and

\displaystyle\bar{S}^{\rm(H)}_{\psi}(\theta,{\cal D}^{*})=\frac{1}{n}\sum_{i=1}^{n}\psi(Y_{i},\theta^{\top}Z_{\theta}(X_{i}))(1+\sigma(X_{i}))W_{\theta}(X_{i}).

Proof.

By definition, $Z_{\theta}(X^{*}_{i})=Z_{\theta}(X_{i})$ and $W_{\theta}(X^{*}_{i})=(1+\sigma(X_{i}))W_{\theta}(X_{i})$ due to $Z_{\theta}(X_{i})^{\top}W_{\theta}(X_{i})=0$ . These imply the conclusion. ∎

We observe that $\bar{L}_{\Psi}(\theta,{\cal D}^{*})$ and $\bar{S}_{\psi}^{(O)}(\theta,{\cal D}^{*})$ have both no influence with the contamination in $\{X_{i}^{*}\}$ . Alternatively, $\bar{S}_{\psi}^{(H)}(\theta,{\cal D}^{*})$ has a substantial influence by scalar multiplication. Hence, we can change the definition of the horizontal component as

\displaystyle\bar{\bar{S}}^{\rm(H)}_{\psi}(\theta,{\cal D}^{*})=\frac{1}{n}\sum_{i=1}^{n}\psi(Y_{i},\theta^{\top}Z_{\theta}(X_{i}))\frac{W_{\theta}(X_{i})}{\|W_{\theta}(X_{i})\|}

choosing as $\sigma(X_{i})={\|W_{\theta}(X_{i})\|}^{-1}-1$ . Then, it has a mild behavior such that

\displaystyle\|\bar{\bar{S}}^{\rm(H)}_{\psi}(\theta,{\cal D}^{*})\}\|\leq\frac{1}{n}\sum_{i=1}^{n}|\psi(Y_{i},\theta^{\top}Z_{\theta}(X_{i}))|

In this way, the estimating function (2.5) of M-estimator $\hat{\theta}_{\Psi}$ can be written as

\displaystyle\tilde{S}_{\psi}(\theta,{\cal D})=\frac{1}{n}\sum_{i=1}^{n}\psi(Y_{i},\theta^{\top}X_{i})\Big{\{}Z_{\theta}(X_{i})+\frac{W_{\theta}(X_{i})}{\|W_{\theta}(X_{i})\|}\Big{\}}.

(2.10)

Proposition 6.

Assume there exists a constant $c$ such that

\displaystyle\sup_{(y,s)\in{\cal Y}\times\mathbb{R}}|\psi(y,s)s|=c.

Then, the estimating function $\tilde{S}_{\psi}(\theta,{\cal D})$ in (2.10) of the M-estimator $\hat{\theta}_{\Psi}$ is bounded with respect to any dataset $\cal D$ .

Proof.

It follows from the assumption such that there exists a constant $c_{1}$ such that $c_{1}=\sup_{(y,s)\in{\cal Y}\times\mathbb{R}}|\psi(y,s)|$ since

\displaystyle|\psi(y,s)|\leq\sup_{(y,s)\in{\cal Y}\times[-1,1]}|\psi(y,s)|+\sup_{(y,s)\in{\cal Y}\times\mathbb{R}}|\psi(y,s)s|.

Therefore, we observe

	$\displaystyle\sup_{\cal D}\\|\tilde{S}_{\psi}(\theta,{\cal D})\\|$	$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\|\psi(Y_{i},\theta^{\top}X_{i})\|\big{\{}\\|Z_{\theta}(X_{i})\\|+1\big{\}}$
		$\displaystyle\leq\frac{1}{\\|\theta\\|}\sup_{(y,s)\in{\cal Y}\times\mathbb{R}}\|\psi(y,s)s\|+\sup_{(y,s)\in{\cal Y}\times\mathbb{R}}\|\psi(y,s)\|,$

which is equal to $c/\|\theta\|+c_{1}$ .

∎

On the other hand, suppose another type of contamination ${\cal D}^{**}=\{(X^{**}_{i},Y_{i})\}_{i=1}^{n}$ , where $X^{**}_{i}=X_{i}+\tau(X_{i})Z_{\theta}(X_{i})$ with a fixed scalar $\tau(X_{i})$ depending on $\{X_{i}\}$ . Then, $\bar{L}_{\Psi}(\theta,{\cal D}^{*})$ and $\bar{S}_{\psi}^{(O)}(\theta,{\cal D}^{*})$ have both strong influences; $\bar{S}_{\psi}^{(H)}(\theta,{\cal D}^{*})$ has no influence.

The ML-estimator is a standard estimator that is defined by maximization of the likelihood for a given data set $\{(X_{i},Y_{i})\}_{i=1}^{n}$ . In effect, the negative log-likelihood function is defined by

\displaystyle L_{0}(\theta;\Lambda)=-\frac{1}{n}\sum_{i=1}^{n}\{Y_{i}g(\theta^{\top}X_{i})-a(g(\theta^{\top}X_{i}))+c(Y_{i})\}.

The likelihood estimating function is given by

\displaystyle{S}_{0}(\theta;\Lambda)=-\frac{1}{n}\sum_{i=1}^{n}\{Y_{i}-a^{\prime}(g(\theta^{\top}X_{i}))\}g^{\prime}(\theta^{\top}X_{i})X_{i}.

(2.11)

Here the regression parameter $\theta$ is of our main interests. We note that the ML-estimator $\hat{\theta}_{0}$ can be obtained without the nuisance parameter $\phi$ even if it is unknown. In effect, there are some methods for estimating $\phi$ using the deviance and the Pearson $\chi^{2}$ divergence in a case where $\phi$ is unknown. The expected value of the negative log-likelihood conditional on $\underline{X}=(X_{1},...,X_{n})$ is given by

\displaystyle\mathbb{E}[L_{0}(\theta;\Lambda)|\underline{X}]=-\frac{1}{n}\sum_{i=1}^{n}\{a^{\prime}(g(\theta^{\top}X_{i}))g(\theta^{\top}X_{i})-a(g(\theta^{\top}X_{i}))\}

up to a constant since the conditional expectation is given by $\mathbb{E}[Y|X=x]=a^{\prime}(g(\theta^{\top}x))$ due to a basic property of the exponential dispersion model (2.3).

2.3 The $\gamma$ -loss function and its variants

Let us discuss the the $\gamma$ -divergence in the framework of regression model based on the discussion in the general distribution setting of the preceding section. The $\gamma$ -divergence is given by

\displaystyle D_{\gamma}(P(\cdot|X,\theta_{0}),P(\cdot|X,\theta_{1});\Lambda)=H_{\gamma}(P(\cdot|X,\theta_{0}),P(\cdot|X,\theta_{1});\Lambda)-H_{\gamma}(P(\cdot|X,\theta_{0}),P(\cdot|X,\theta_{0});\Lambda)

with the cross entropy,

\displaystyle H_{\gamma}(P(\cdot|X,\theta_{0}),P(\cdot|X,\theta_{1});\Lambda)=-\frac{1}{\gamma}\int_{\mathcal{Y}}{p(y|X,\theta_{0})}\big{[}\{{p^{(\gamma)}(y|X,\theta_{1})}\}^{\frac{\gamma}{\gamma+1}}.

The loss function derived from the $\gamma$ -divergence is

\displaystyle L_{\gamma}(\theta;\Lambda)=-\frac{1}{n}\frac{1}{\gamma}\sum_{i=1}^{n}\{p^{(\gamma)}(Y_{i}|X_{i},\theta)\}^{\frac{\gamma}{\gamma+1}},

where $p^{(\gamma)}(y|x,\theta)$ is the $\gamma$ -expression of $p(y|x,\theta)$ , that is

\displaystyle p^{(\gamma)}(y|x,\theta)=\frac{\{p(y|x,\theta)\}^{\gamma+1}}{\int\{p(\tilde{y}|x,\theta)\}^{\gamma+1}{\rm d}\Lambda(\tilde{y})}.

(2.12)

We define the $\gamma$ -estimator for the parameter $\theta$ by $\hat{\theta}_{\gamma}=\mathop{\rm argmin}_{\theta\in\Theta}L_{\gamma}(\theta;\Lambda)$ . By definition, the $\gamma$ -estimator reduces to the ML-estimator when $\gamma$ is taken a limit to $0$ .

Remark 1.

Let us discuss a behavior of the $\gamma$ -loss function when $|\gamma|$ becomes larger in which the outcome $Y$ is finite-discrete in $\cal Y$ . For simplicity, we define the loss function as

\displaystyle L_{\gamma}(\theta;\Lambda)=-{\rm sign}(\gamma)\sum_{i=1}^{n}\{p^{(\gamma)}(Y_{i}|X_{i},\theta)\}^{\frac{\gamma}{\gamma+1}}.

Let $f(x,\theta)=\mathop{\rm argmax}_{y\in{\cal Y}}p(y|x,\theta)$ and $g(x,\theta)=\mathop{\rm argmin}_{y\in{\cal Y}}p(y|x,\theta)$ . Then, the $\gamma$ -expression satisfies

	$\displaystyle p^{(\infty)}(y\|x,\theta)$	$\displaystyle:=\lim_{\gamma\rightarrow\infty}p^{(\gamma)}(y\|x,\theta)$
		$\displaystyle=\lim_{\gamma\rightarrow\infty}\frac{\{p(y\|x,\theta)/\max_{y^{}\in{\cal Y}}p(y^{}\|x,\theta)\}^{\gamma+1}}{\sum_{\tilde{y}\in{\cal Y}}\{p(\tilde{y}\|x,\theta)/\max_{y^{}\in{\cal Y}}p(y^{}\|x,\theta)\}^{\gamma+1}}$
		$\displaystyle={\rm I}(y=f(x,\theta))$

Similarly,

	$\displaystyle p^{(-\infty)}(y\|x,\theta)$	$\displaystyle:=\lim_{\gamma\rightarrow-\infty}p^{(\gamma)}(y\|x,\theta)$
		$\displaystyle=\lim_{\gamma\rightarrow-\infty}\frac{\{\min_{y^{}\in{\cal Y}}p(y^{}\|x,\theta)/p(y\|x,\theta)\}^{-\gamma-1}}{\sum_{\tilde{y}\in{\cal Y}}\{\min_{y^{}\in{\cal Y}}p(y^{}\|x,\theta)/p(\tilde{y}\|x,\theta)\}^{-\gamma-1}}$
		$\displaystyle={\rm I}(y=g(x,\theta))$

Hence, $L_{\infty}(\theta;\Lambda)$ is equivalent to the 0-1 loss function $\sum_{i=1}^{n}{\rm I}(Y_{i}\neq f(X_{i},\theta))$ ; while

\displaystyle L_{-\infty}(\theta;\Lambda)=\sum_{i=1}^{n}{\rm I}(Y_{i}=g(X_{i},\theta))

(2.13)

This is the number of $Y_{i}$ ’s equal to the worst predictor $g(X_{i},\theta)$ . If we focus on a case of ${\cal Y}=\{0,1\}$ , then $L_{-\infty}(\theta;\Lambda)$ is nothing but the 0-1 loss function since ${\rm I}(y=g(x,\theta))={\rm I}(y\neq f(x,\theta))$ . In principle, the minimization of the 0-1 loss is hard due to the non-differentiability. The $\gamma$ -loss function smoothly connects the log-loss and the 0-1 loss without the computational challenge. See [31, 71] for detailed discussion for the 0-1 loss optimization.

In a subsequent discussion, the $\gamma$ -expression will play an important role on clarifying the statistical properties of the $\gamma$ -estimator. In fact, the $\gamma$ -expression function is a counterpart of the log model function: $\log p(y|x,\theta)$ in $L_{0}(\theta;\Lambda)$ . Here we have a note as one of the most basic properties that

\displaystyle-\frac{1}{\gamma}\mathbb{E}_{0}\Big{[}\{p^{(\gamma)}(Y|X,\theta)\}^{\frac{\gamma}{\gamma+1}}|X=x\Big{]}=H_{\gamma}(P(\cdot|x,\theta_{0}),P(\cdot|x,\theta);\Lambda),

(2.14)

Equation (2.14) yields

\displaystyle\mathbb{E}_{0}[L_{\gamma}(\theta;\Lambda)|\underline{X}]=\frac{1}{n}\sum_{i=1}^{n}H_{\gamma}(P(\cdot|X_{i},\theta_{0}),P(\cdot|X_{i},\theta);\Lambda).

and, hence,

\displaystyle\mathbb{E}_{0}[L_{\gamma}(\theta;\Lambda)|\underline{X}]-\mathbb{E}_{0}[L_{\gamma}(\theta_{0};\Lambda)|\underline{X}]=\frac{1}{n}\sum_{i=1}^{n}D_{\gamma}(P(\cdot|X_{i},\theta_{0}),P(\cdot|X_{i},\theta);\Lambda).

(2.15)

This implies

\displaystyle\theta_{0}=\mathop{\rm argmin}_{\theta\in\Theta}\mathbb{E}_{0}[L_{\gamma}(\theta;\Lambda)|\underline{X}].

Thus, we observe due to the discussion similar to that for the ML-estimator and the KL-divergence that $\hat{\theta}_{\gamma}$ consistent for $\theta_{0}$ . The $\gamma$ -estimating function is defined by

\displaystyle{S}_{\gamma}(\theta;\Lambda)=\frac{\partial}{\partial\theta}L_{\gamma}(\theta;\Lambda).

Then, we have a basic property that the $\gamma$ -estimating function should satisfy in general.

Proposition 7.

The true value of the parameter is the solution of the expected $\gamma$ -estimating equation under the expectation of the true distribution. That is, if $\theta=\theta_{0}$ ,

\displaystyle\mathbb{E}_{0}[{S}_{\gamma}(\theta;\Lambda)|\underline{X}]=0,

(2.16)

where $\mathbb{E}_{0}$ is the conditional expectation under the true distribution $P(\cdot|X_{i},\theta_{0})$ ’s given $\underline{X}$ .

Proof.

By definition,

\displaystyle{S}_{\gamma}(\theta;\Lambda)=-\frac{1}{n}\frac{1}{\gamma+1}\sum_{i=1}^{n}\{p^{(\gamma)}(Y_{i}|X_{i},\theta)\}^{-\frac{1}{\gamma+1}}\frac{\partial}{\partial\theta}p^{(\gamma)}(Y_{i}|X_{i},\theta).

Here we note

\displaystyle\{p^{(\gamma)}(Y_{i}|X_{i},\theta)\}^{-\frac{1}{\gamma+1}}=\frac{1}{p(Y_{i}|X_{i},\theta)}

up to a proportionality constant. Hence,

\displaystyle\mathbb{E}_{0}[{S}_{\gamma}(\theta;\Lambda)|\underline{X}]\propto\sum_{i=1}^{n}\int_{\mathcal{Y}}\frac{p(y|X_{i},\theta_{0})}{p(y|X_{i},\theta)}\frac{\partial}{\partial\theta}p^{(\gamma)}(y|X_{i},\theta){\rm d}\Lambda(y).

If $\theta=\theta_{0}$ , then this vanishes identically due to the total mass one of $p^{(\gamma)}(y|X_{i},\theta_{0})$ . ∎

The $\gamma$ -estimator $\hat{\theta}_{\gamma}$ is a solution of the estimating equation; while true value $\theta_{0}$ is the solution of the expected estimating equation under the true distribution with $\theta_{0}$ . Similarly, this shows the consistency of the $\gamma$ -estimator for the true value of the parameter. The $\gamma$ -estimating function ${S}_{\gamma}(\theta;\Lambda)$ is said to be unbiased in the sense of (2.16). Such a unbiased property leads to the consistency of the estimator. However, if the underlying distribution is misspecified, then we have to evaluate the expectation in (2.16) under the misspecified distribution other than the true distribution. Thus, the unbiased property is generally broken down, and the Euclidean norm of the estimating function may be divergent at the worst case. We will investigate such behaviors in misspecified situations later.

Now, we consider the MDEs via the GM and HM divergences introduced in Chapter 1. First, consider the loss function defined by the GM-divergence:

\displaystyle L_{\rm GM}(\theta;R)=\frac{1}{n}\sum_{i=1}^{n}\frac{r(Y_{i})}{p(Y_{i}|X_{i},\theta)}\exp\Big{\{}\int\log p(y|X_{i},\theta)dR(y)\Big{\}}.

where $R$ is the reference probability measure in ${\mathcal{P}}(x)$ . We define as $\hat{\theta}_{\rm GM}=\mathop{\rm argmin}_{\theta\in\Theta}L_{\rm GM}(\theta;R)$ , which we refer to as the GM-estimator. The $\rm GM$ -estimating equation is given by

	$\displaystyle{S}_{\rm GM}(\theta;R)$	$\displaystyle:=\frac{1}{n}\sum_{i=1}^{n}\frac{r(Y_{i})}{p(Y_{i}\|X_{i},\theta)}\exp\Big{\{}\int\log p(y\|X_{i},\theta)dR(y)\Big{\}}$
		$\displaystyle\times\Big{\{}S(Y_{i}\|X_{i},\theta)-\int S(y\|X_{i},\theta)dR(y)\Big{\}}=0,$		(2.17)

where $S(y|x,\theta)=(\partial/\partial\theta)\log p(y|x,\theta)$ . Secondly, consider the loss function defined by the HM-divergence:

\displaystyle L_{\rm HM}(\theta)=\frac{1}{2n}\sum_{i=1}^{n}\{p^{(-2)}(Y_{i}|X_{i},\theta)\}^{2}.

The $(-2)$ -model can be viewed as an inverse-weighted probability model on the account of

\displaystyle p^{(-2)}(y|x,\theta)=\frac{\small\displaystyle\frac{1}{p(y|x,\theta)}}{\displaystyle{\large\mbox{$\sum_{j=0}^{k}$}}\ \frac{1}{p(j|x,\theta)}}

We define the HM estimator by $\hat{\theta}_{\rm HM}=\mathop{\rm argmin}_{\theta\in\Theta}L_{\rm HM}(\theta)$ . The $\rm HM$ -estimating equation is given by

\displaystyle{S}_{\rm HM}(\theta)=\frac{1}{n}\sum_{i=1}^{n}p^{(-2)}(Y_{i}|X_{i},\theta)\frac{\partial}{\partial\theta}p^{(-2)}(Y_{i}|X_{i},\theta)

We note from the discussion in Section 2 that $L_{\rm GM}(\theta;R)$ and ${S}_{\rm GM}(\theta;R)$ are equal to $L_{\gamma}(\theta;R)$ and ${S}_{\gamma}(\theta;R)$ with $\gamma=-1$ ; $L_{\rm HM}(\theta)$ and ${S}_{\rm HM}(\theta)$ are equal to $L_{\gamma}(\theta;C)$ and ${S}_{\gamma}(\theta;C)$ with $\gamma=-2$ . We will discuss the dependence on the reference measure $R$ , in which we like to elucidate which choice of $R$ gives a reasonable performance in the presence of possible model misspecification.

We focus on the GLM framework in which we look into the formula on the $\gamma$ -divergence. Then, the choice of the reference measure should be paid attentions to the $\gamma$ -divergence. The original reference measure $\Lambda$ is changed to $R$ such that $\partial R/\partial\Lambda(y)=\exp\{c(y)\}.$ Hence, the model is given by $p(y|x,\omega)=\exp\{y\omega-a(\omega)\}$ withr respect to $R$ . We note that $\tilde{\Lambda}$ is a probability measure since the RN-derivative is equal to $p(y|x,\theta)$ defined in (2.3) when $\theta$ is a zero vector. This makes the model more mathematically tractable and allows us to use standard statistical methods for estimation and inference. Then, the $\gamma$ -expression for $p(y|x,\omega)$ is given by

\displaystyle p^{(\gamma)}(y|x,\omega)=p(y|x,(\gamma+1)\omega).

(2.18)

This property of reflexiveness is convenient the analysis based on the $\gamma$ -divergence. First of all, the $\gamma$ -loss function is given by

\displaystyle L_{\gamma}(\theta;R)=-\frac{1}{n}\frac{1}{\gamma}\sum_{i=1}^{n}\exp\Big{\{}\gamma\,Y_{i}\,g(\theta^{\top}X_{i})-\frac{\gamma}{\gamma+1}a\big{(}(\gamma+1)g(\theta^{\top}X_{i})\big{)}\Big{\}}

(2.19)

due to the $\gamma$ -expression (2.18). The $\gamma$ -estimating function is given by

	$\displaystyle{S}_{\gamma}(\theta;R)=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\exp\Big{\{}\gamma\,Y_{i}\,g(\theta^{\top}X_{i})-\frac{\gamma}{\gamma+1}a\big{(}(\gamma+1)g(\theta^{\top}X_{i})\big{)}\Big{\}}$		(2.20)
		$\displaystyle\times\{Y_{i}-a^{\prime}\big{(}(\gamma+1)g(\theta^{\top}X_{i})\big{)}\}g^{\prime}(\theta^{\top}X_{i})X_{i}.$

We note that the change of the reference measure from $\Lambda$ to $\tilde{\Lambda}$ is the key for the minimum $\gamma$ -divergence estimation. In fact, the $\gamma$ -loss function would not have a closed form as (2.19) unless the reference measure is changed. Here, we remark that the $\gamma$ -loss function is a specific example of M-type loss function $\bar{L}_{\Psi}(\theta)$ in (2.4) with a relationship of

\displaystyle\Psi(y,s)=\exp\Big{\{}\gamma yg(s)-\frac{\gamma}{\gamma+1}a\big{(}(\gamma+1)g(s)\big{)}\Big{\}}.

The expected $\gamma$ loss function is given by

\displaystyle\mathbb{E}_{0}[L_{\gamma}(\theta;R)|\underline{X}]=-\frac{1}{n}\sum_{i=1}^{n}\exp\Big{\{}a\big{(}\gamma g(\theta^{\top}X_{i})\big{)}-\frac{\gamma}{\gamma+1}a\big{(}(\gamma+1)g(\theta^{\top}X_{i})\big{)}\Big{\}},

where $\mathbb{E}_{0}$ denotes the expectation under the true distribution $P(\cdot|x,\theta_{0})$ . This function attains a global minimum at $\theta=\theta_{0}$ as discussed around (2.15) in the general framework. Similarly, the GM-loss function is written by

\displaystyle L_{\rm GM}(\theta;R)=\frac{1}{n}\sum_{i=1}^{n}{r(Y_{i})}\exp\big{\{}-(Y_{i}-\mu_{R})\,g(\theta^{\top}X_{i})+a\big{(}g(\theta^{\top}X_{i})\big{)}\big{\}},

where $\mu_{R}=\int yr(y){\rm d}\tilde{\Lambda}(y)$ . The HM loss function is written by

\displaystyle L_{\rm HM}(\theta)=\frac{1}{2n}\sum_{i=1}^{n}\exp\{-2Y_{i}-2a\big{(}\!\!-\!g(\theta^{\top}x)\big{)}\}.

since the $\gamma$ -expression becomes $p^{(\gamma)}(y|x,\theta)=\exp\{-y\,g(\theta^{\top}x)-a\big{(}\!\!-\!g(\theta^{\top}x)\big{)}\}$ when $\gamma=-2$ . In accordance with these, all the formulas for the loss functions defined in the general model (2.1) are reasonably transported in GLM. Subsequently, we go on the specific model of GLM to discuss deeper properties.

We have discussed the generalization of the $\gamma$ -divergence in the preceding section. The generalized divergence $D_{V}(P,Q:\Lambda)$ defined in (1.18) in Chapter 2 yields the loss function by

\displaystyle L_{V}(\theta;\Lambda)=-\frac{1}{n}\sum_{i=1}^{n}V(v^{*}(z(\theta,X_{i})p(Y_{i}|X_{i},\theta))),

where $z(\theta,X_{i})$ is a normalizing factor satisfying

\displaystyle\int_{\mathcal{Y}}v^{*}(z(\theta,X_{i})p(y|X_{i},\theta)){\rm d}\Lambda(y)=1.

(2.21)

The similar discussion as in the above conducts

\displaystyle\mathbb{E}_{0}[L_{V}(\theta;\Lambda)|\underline{X}]-\mathbb{E}_{0}[L_{V}(\theta_{0};\Lambda)|\underline{X}]=\frac{1}{n}\sum_{i=1}^{n}D_{V}(P(\cdot|X_{i},\theta_{0}),P(\cdot|X_{i},\theta);\Lambda).

The estimating function is written as

\displaystyle{S}_{V}(\theta;\Lambda)=-\frac{1}{n}\sum_{i=1}^{n}\frac{1}{z(\theta,X_{i})p(Y_{i}|X_{i},\theta)}\frac{\partial}{\partial\theta}v^{*}(z(\theta,X_{i})p(Y_{i}|X_{i},\theta)).

due to assumption (1.19). This implies

\displaystyle\mathbb{E}_{0}[{S}_{V}(\theta_{0};\Lambda)|\underline{X}]=-\frac{1}{n}\sum_{i=1}^{n}\frac{1}{z(\theta,X_{i})}\int\frac{\partial}{\partial\theta}v^{*}(z(\theta_{0},X_{i})p(y|X_{i},\theta)){\rm d}\Lambda(y)

which vanishes since all $v^{*}(z(\theta_{0},X_{i})p(y|X_{i},\theta))$ ’s have total mass one as in (2.21). Consequently, we can derive the MD estimator based on the generalized divergence $D_{V}(P,Q,\Lambda)$ with the $\gamma$ -divergence as the standard. In Section 3, we will consider another candidate of $D_{V}(P,Q,\Lambda)$ for estimation under a Poisson point process model.

2.4 Normal linear regression

Linear regression, one of the most familiar and most widely used statistical techniques, dates back to the 19-th century in the mathematical formulation by Carolus F. Gauss [96]. It originally emerged from the eminent observation of Francis Galton on regression towards the mean at the begging of the 20-th century. Thus, the ordinary least squares method is evolved with the advancement of statistical theory and computational methods. As the application of linear regression expanded, statisticians recognized its sensitivity to outliers. Outliers can significantly influence the regression model’s estimates, leading to misleading results. To address these limitations, robust regression methods were developed. These methods aim to provide estimates that are less affected by outliers or violations of model assumptions like normality of errors or homoscedasticity.

Let $Y$ be an outcome variable in $\mathbb{R}$ and $X$ be a covariate vector in a subset ${\mathcal{X}}$ of $\mathbb{R}^{d}$ . Assume the conditional probability density function (pdf) of $Y$ given $X=x$ as

\displaystyle p(y|X=x,\theta,\sigma^{2})=\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\Big{\{}-\small\mbox{$\frac{1}{2}$}\frac{(y-\theta^{\top}x)^{2}}{\sigma^{2}}\Big{\}}.

(2.22)

Thus, the normal linear regression model (2.22) is one of the simplest examples of GLM with an identity link function where $\sigma$ is a dispersion parameter. Indeed, $\sigma$ is a crucial parameter for assessing model fit. We will discuss the estimation for the parameter later. The KL-divergence between normal distributions is given by

\displaystyle D_{0}({\tt Nor}(\mu_{0},\sigma_{0}^{2}),{\tt Nor}(\mu_{1},\sigma_{1}^{2}))=\small\mbox{$\frac{1}{2}$}\frac{(\mu_{1}-\mu_{0})^{2}}{\sigma_{1}^{2}}+\small\mbox{$\frac{1}{2}$}\Big{(}\frac{\sigma_{0}^{2}}{\sigma_{1}^{2}}-\log\frac{\sigma_{0}^{2}}{\sigma_{1}^{2}}-1\Big{)}.

For a given dataset ${(X_{i},Y_{i})}_{i=1}^{n}$ , the negative log-likelihood function is as follows:

\displaystyle L_{0}(\theta)=\small\mbox{$\frac{1}{2}$}\frac{1}{n}\sum_{i=1}^{n}\Big{\{}\frac{(Y_{i}-\theta^{\top}X_{i})^{2}}{\sigma_{0}}+\log(2\pi\sigma^{2})\Big{\}}.

The estimating function for $\theta$ is

\displaystyle{S}_{0}(\theta)=\frac{1}{n}\frac{1}{\sigma^{2}}\sum_{i=1}^{n}(Y_{i}-\theta^{\top}X_{i})X_{i},

where $\sigma^{2}$ is assumed to be known. In fact, it is estimated in a practical situation where $\sigma^{2}$ is unknown. Equating the estimating function to zero gives the likelihood equations in which the ML-estimator is nothing but the least square estimator. This is a well-known element in statistics with a wide range of applications, where several standard tools for assessing model fit and diagnostics have been established.

On the other hand, robust regression robust methods aim to provide estimates that are less affected by outliers or violations of model assumptions like normality of errors. The key is the introduction of M-estimators, which generalize maximum likelihood estimators. They work by minimizing a sum of a loss function applied to the residuals. The choice of the loss function (such as Huber’s winsorized loss or Tukey’s biweight loss [3]) determines the robustness and efficiency of the estimator. The M-estimator, $\hat{\theta}_{\Psi}$ , of a parameter $\theta$ is obtained by minimizing an objective function, typically defined by a sum of $\Psi$ ’s applied to the adjusted residuals:

\displaystyle\hat{\theta}_{\Psi}=\mathop{\rm argmin}_{\theta\in\mathbb{R}^{d}}\sum_{i=1}^{n}\Psi\Big{(}\frac{Y_{i}-\theta^{\top}X_{i}}{\sigma}\Big{)}.

(2.23)

The estimating equation is given by

\displaystyle\sum_{i=1}^{n}\psi\Big{(}\frac{Y_{i}-\theta^{\top}X_{i}}{\sigma}\Big{)}X_{i}=0,

where $\psi(r)=(\partial/\partial r)\Psi(r)$ . Here are typical examples:
(1). Quadratic loss: $\Psi(r)=r^{2}$ , which is equivalent to the log-likelihood function
(2). Huber’s loss: $\Psi(r)=\left\{\begin{array}[]{lc}\small\mbox{$\frac{1}{2}$}r^{2}&\text{for }|r|\leq k\\[5.69054pt] k(|r|-\small\mbox{$\frac{1}{2}$}k)&\text{for }|r|>k\end{array}\right.$
(3). Tukey’s loss: $\Psi(r)=\left\{\begin{array}[]{lc}\frac{c^{2}}{6}\left(1-\left[1-\left(\frac{r}{c}\right)^{2}\right]^{3}\right)&\text{for }|r|\leq c\\[8.53581pt] \frac{c^{2}}{6}&\text{for }|r|>c\end{array}\right.,$
where $K$ and $c$ are hyper parameters.

We return the discussion for the $\gamma$ -estimator. The $\gamma$ -divergence is given by

\displaystyle D_{\gamma}({\tt Nor}(\mu_{0},\sigma^{2}),{\tt Nor}(\mu_{1},\sigma^{2}))=c_{\gamma}(\sigma^{2}){}^{\frac{\gamma}{\gamma+1}}\Big{[}\exp\Big{\{}-\small\mbox{$\frac{1}{2}$}\frac{\gamma}{\gamma+1}\frac{(\mu_{1}-\mu_{0})^{2}}{\sigma^{2}}\Big{\}}-1\Big{]},

where $c_{\gamma}=(\gamma+1)^{\small\mbox{$\frac{1}{2}$}\frac{1}{\gamma+1}}$ . The $\gamma$ -expression of the normal linear model is given by

\displaystyle p^{(\gamma)}(y|x,\theta)=p_{0}(y,\theta^{\top}x,\sigma_{0}/(\gamma+1)),

where $p_{0}(y,\mu,\sigma^{2})$ is a normal density function with mean $\mu$ and variance $\sigma^{2}$ . Hence, the $\gamma$ -loss function is given by

\displaystyle L_{\gamma}(\theta)=-\frac{1}{n}\frac{1}{\gamma}\sum_{i=1}^{n}\{p_{0}(Y_{i},\theta^{\top}X_{i},\sigma_{0}/(\gamma+1))\}^{\frac{\gamma}{\gamma+1}}.

which is written as

\displaystyle-\frac{1}{n}\frac{1}{\gamma}\sum_{i=1}^{n}\exp\Big{\{}-\small\mbox{$\frac{1}{2}$}{\gamma}\frac{(Y_{i}-\theta^{\top}X_{i})^{2}}{\sigma^{2}}{-\small\mbox{$\frac{1}{2}$}\frac{\gamma}{\gamma+1}}\sigma^{2}\Big{\}}

(2.24)

up to a scalar multiplication. Consequently, the $\gamma$ -loss function is a specific example of $\Psi$ -loss function in (2.23) viewing as $\Psi(r)\propto-(1/\gamma)\exp(-\small\mbox{$\frac{1}{2}$}\gamma r^{2})$ . We note that the $\gamma$ -estimator is one of M-estimators. The $\gamma$ -estimating function is defined as ${S}_{\gamma}(\theta)=\frac{1}{n}\sum_{i=1}^{n}{S}_{\gamma}(X_{i},Y_{i},\theta),$ where the score function is defined by

\displaystyle{S}_{\gamma}(x,y,\theta)=(2\pi\sigma^{2})^{-\small\mbox{$\frac{1}{2}$}\frac{\gamma}{\gamma+1}}\exp\Big{\{}-\small\mbox{$\frac{1}{2}$}{\gamma}\frac{(y-\theta^{\top}x)^{2}}{\sigma^{2}}\Big{\}}\frac{y-\theta^{\top}x}{\sigma_{0}}x.

(2.25)

The generator function is given as $\psi(r,\gamma)=r\exp(-\small\mbox{$\frac{1}{2}$}\gamma r^{2})$ as an M-estimator. Fig 2.1 displays the plots of the generator functions:
(1). $\gamma$ -loss, $\psi(r,\gamma)=r\exp(-\small\mbox{$\frac{1}{2}$}\gamma r^{2})$ ,
(2). Huber’s loss, $\psi(r,k)={\mathbb{I}}(|r|\leq k)r+{\mathbb{I}}(|r|>k){\rm sign(r)}$
(3). Tukey’s loss, $\psi(r,c)={\mathbb{I}}(|r|\leq k)r\{1-(r/c)^{2}\}^{2}$ .
It is observed the generator functions of the $\gamma$ -loss and Tukey’s loss are both redescending. This means the influence of each data point on the estimation decreases to zero beyond a certain threshold, effectively eliminating the impact of extreme outliers. Unlike the quadratic loss and Huber’s loss functions, such redescending loss functions are non-convex. This characteristic makes it more robust but also introduces challenges in optimization, as it can lead to multiple local minimums.

The variance parameter $\sigma^{2}$ in the normal regression model is referred to as a dispersion parameter in GLM. In a situation where $\sigma^{2}$ is unknown the likelihood method is similar to the known case. The ML-estimator for $\sigma^{2}$ is derived by

\displaystyle\hat{\sigma}^{2}=\small\mbox{$\frac{1}{2}$}\frac{1}{n}\sum_{i=1}^{n}{(Y_{i}-\hat{\theta}_{0}^{\top}X_{i})^{2}}

plugging $\theta$ in $\hat{\theta}_{0}$ . Alternatively, the $\gamma$ -estimator for $(\theta,\sigma^{2})$ is is derived by the solution of the joint estimating equation combining

\displaystyle\sigma^{2}=\frac{\gamma+1}{n}\sum_{i=1}^{n}\exp\Big{\{}-\small\mbox{$\frac{1}{2}$}{\gamma}\frac{(Y_{i}-\theta^{\top}X_{i})^{2}}{\sigma^{2}}\Big{\}}(Y_{i}-\theta^{\top}X_{i})^{2}

with the estimating equation for $\theta$ . Similarly, we can find that the boundedness property for the $\gamma$ -score function for $\sigma^{2}$ holds.

Let us apply the geometric discussion associated with the decision boundary $H_{\theta}$ in (2.6) to the normal regression model. We write the estimating function of M-estimator in (2.23) as

\displaystyle S_{\psi}(\theta,{\cal D})=\sum_{i=1}^{n}\psi\Big{(}\frac{Y_{i}-\theta^{\top}X_{i}}{\sigma_{0}}\Big{)}X_{i}

for a given dataset ${\cal D}=\{X_{i},Y_{i})\}_{i=1}^{n}$ . Due to the orthogonal decomposition of $X$ , the estimating function is also decomposed into a sum of the orthogonal and horizontal components, $\bar{S}^{\rm(O)}_{\psi}(\theta,{\cal D})+\bar{S}^{\rm(H)}_{\psi}(\theta,{\cal D})$ , where

\displaystyle\bar{S}^{\rm(O)}_{\psi}(\theta,{\cal D})=\frac{1}{n}\sum_{i=1}^{n}\psi\Big{(}\frac{Y_{i}-\theta^{\top}X_{i}}{\sigma_{0}}\Big{)}Z_{\theta}(X_{i}),\ \ \bar{S}^{\rm(H)}_{\psi}(\theta,{\cal D})=\frac{1}{n}\sum_{i=1}^{n}\psi\Big{(}\frac{Y_{i}-\theta^{\top}X_{i}}{\sigma_{0}}\Big{)}W_{\theta}(X_{i}).

We note that this decomposition is the same as that for the GLM in Section . We consider a specific type of contamination in the covariate space $\cal X$ such that ${\cal D}^{*}=\{(X^{*}_{i},Y_{i})\}_{i=1}^{n}$ , where $X^{*}_{i}=X_{i}+\sigma(X_{i})W_{\theta}(X_{i})$ with a fixed scalar $\sigma(X_{i})$ depending on $X_{i}$ . As in the discussion for the general setting of the GLM, $\bar{L}_{\Psi}(\theta,{\cal D}^{*})$ and $\bar{S}_{\psi}^{(O)}(\theta,{\cal D}^{*})$ have both strong influences; $\bar{S}_{\psi}^{(\rm H)}(\theta,{\cal D}^{*})$ has no influence. Let us investigate a preferable property for the $\gamma$ -estimator applying the decomposition formula above.

Proposition 8.

Let ${S}_{\gamma}(x,y,\theta)$ be the $\gamma$ -score function defined in (2.25). Then,

\displaystyle\sup_{x\in{\mathcal{X}}}d({S}_{\gamma}(x,y,\theta),H_{\theta})<\infty

(2.26)

for any fixed $y$ of $\mathbb{R}$ and any $\gamma>0$ , where $d$ is the Euclidean distance.

Proof.

It is written that

\displaystyle d({S}_{\gamma}(x,y,\theta),H_{\theta})=\exp(-\small\mbox{$\frac{1}{2}$}\gamma z^{2})|z(z-y)|,

where $z=y-\theta^{\top}x$ . Therefore,

\displaystyle\sup_{x\in{\mathcal{X}}}d({S}_{\gamma}(x,y,\theta),H_{\theta})\leq\exp(-\small\mbox{$\frac{1}{2}$}{\gamma}z^{2})(z^{2}+|yz|),

which is bounded by

\displaystyle\sup_{z>0}z^{2}\exp(-\small\mbox{$\frac{1}{2}$}{\gamma}z^{2})+|y|\sup_{z>0}|z|\exp(-\small\mbox{$\frac{1}{2}$}{\gamma}z^{2}).

This is simplified as

\displaystyle\frac{2}{\gamma}\exp(-1)+|y|\frac{1}{\gamma}\exp(-\frac{1}{\gamma}).

Therefore, (2.26) is concluded for the fixed $y$ . ∎

It follows from Proposition 8 that all the estimating scores of the $\gamma$ -estimator appropriately lies in a tubular neighborhood

\displaystyle{\mathcal{N}}_{\theta}(\delta)=\big{\{}z\in\mathbb{R}^{d}:d(z,{\mathcal{H}}_{\theta})\leq\delta\big{\}}

(2.27)

surrounding $H_{\theta}$ . As a result, the distance from the estimating function to the boundary $H_{\theta}$ is bounded, that is,

\displaystyle\sup_{x\in{\mathcal{X}}}d({S}_{\gamma}(\theta),{\mathcal{H}}_{\theta})\leq\frac{2}{\gamma}\exp(-1)+\frac{1}{\gamma}\exp(-\frac{1}{\gamma})\max_{1\leq i\leq n}|Y_{i}|.

However, in the limit case of $\gamma=0$ or the ML-estimator, this boundedness property for covariate outlying is broken down. Tukey’s biweight loss estimating function satisfies the boundedness; Huber’s loss estimating function does not satisfy that.

We have a brief study for numerical experiments. Assume that covariate vectors $X_{i}$ ’s are generated from a bivariate normal distribution ${\tt Nor}(0,{\rm I})$ , where $\rm I$ denotes a 3-dimensional identity matrix. This simulation was designed based on a scenario about the conditional distribution of the response variables $Y_{i}$ ’s as follows.

Specified model: $\hskip 25.60747ptY_{i}\sim{\tt Nor}(\theta_{1}^{\top}X_{i}+\theta_{0},\sigma).$
Misspecified model: $Y_{i}\sim(1-\pi){\tt Nor}(\theta_{1}^{\top}X_{i}+\theta_{0},\sigma)+\pi{\tt Nor}(-\theta_{*1}^{\top}X_{i}+\theta_{*0},\sigma_{*}).$

Here parameters were set as $(\theta_{0},\theta_{1})=(0.5,1.5,1.0)^{\top}$ , and $\pi=0.1$ with $\sigma=1$ ; $(\theta_{0*},\theta_{1*})=(0.5,-1.5,-1.0)^{\top}$ with $\sigma_{*}=1$ .

We compared the estimates the ML-estimator $\hat{\theta}_{0}$ and the $\gamma$ -estimator $\hat{\theta}_{\gamma}$ with $\gamma=0.3$ , where the simulation was conducted by 300 replications. In the the case of specified model, the ML-estimator was slightly superior to the $\gamma$ -estimator in a point of the root means square estimate (rmse), however the superiority is almost negligible. Next, we suppose a mixture distribution of two normal regression modes in which one was the same model as the above with the mixing probability $0.9$ ; the other was still a normal regression model but the the minus slope vector with the mixing probability $0.1$ . Under such a misspecified setting, $\gamma$ -estimator was crucially superior to the ML-estimator, where the rmse of the ML-estimator is more than double that of the $\gamma$ -estimator. Thus, the ML-estimator is sensitive to the presence of such a heterogeneous subgroup; the $\gamma$ -estimator is robust. Proposition 8 suggests that the effect of the subgroup is substantially suppressed in the estimating function of the $\gamma$ -estimator. See Table 2.1 and Figure 2.2 for details.

Table 2.1: Comparison between the ML-estimator and the

\gamma

-estimator.

(a). The case of specified model

Method	estimate	rmse
ML-estimate	$(0.495672,1.50753,1.00211)$	$0.173617$
$\gamma$ -estimate	$(0.497593,1.50754,1.00301)$	$0.176443$

(b). The case of misspecified model

Method	estimate	rmse
ML-estimate	$(0.50788,1.19093,0.774613)$	$0.486919$
$\gamma$ -estimate	$(0.500093,1.43289,0.941501)$	$0.219798$

2.5 Binary logistic regression

We consider a binary outcome $Y$ with a value in ${\mathcal{Y}}=\{0,1\}$ and a covariate $X$ in a subset ${\mathcal{X}}$ of $\mathbb{R}^{d}$ . The probability distribution is characterized by a probability mass function (pmf) or the RN-derivative with respect to a counting measure $C$ :

\displaystyle p(y,\pi)=\pi^{y}(1-\pi)^{1-y},

which is referred to as the Bernoulli distribution ${\tt Ber}(\pi)$ , where $\pi$ is the probability of $Y=1$ . A binary regression model is defined by a link function of the systematic component $\omega$ into the random component: $g(\eta)={\exp(\omega)}/{\{1+\exp(\omega)\}},$ so that the conditional pmf given $X=x$ with a linear model $\omega=\theta^{\top}x$ is given by

\displaystyle p(y|x,\theta)=\frac{\exp(y\theta^{\top}x)}{1+\exp(\theta^{\top}x)},

(2.28)

which is referred as a logistic model [15, 46].

The KL-divergence between Bernoulli distributions is given by

\displaystyle D_{0}({\tt Ber}(\pi),{\tt Ber}(\rho))=\pi\log\frac{\pi}{\rho}+(1-\pi)\log\frac{1-\pi}{1-\rho}.

For a given dataset $\{(X_{i},Y_{i})\}_{i=1,...,n}$ , the negative log-likelihood function is given by

\displaystyle L_{0}(\theta)=-\frac{1}{n}\sum_{i=1}^{n}Y_{i}\theta^{\top}X_{i}-\log\{1+\exp(\theta^{\top}X_{i})\}

and the likelihood equation is written by

\displaystyle{S}_{0}(\theta)=-\frac{1}{n}\sum_{i=1}^{n}\Big{\{}Y_{i}-\frac{\exp(\theta^{\top}X_{i})}{1+\exp(\theta^{\top}X_{i})}\Big{\}}X_{i}=0.

(2.29)

On the other hand, the $\gamma$ -divergence is given by

\displaystyle D_{\gamma}({\tt Ber}(\pi),{\tt Ber}(\rho);C)=-\frac{1}{\gamma}\frac{\pi\rho^{\gamma}+(1-\pi)(1-\rho)^{\gamma}}{\big{\{}\rho^{\gamma+1}+(1-\rho)^{\gamma+1}\big{\}}^{\frac{\gamma}{\gamma+1}}}+\frac{1}{\gamma}{\big{\{}\pi^{\gamma+1}+(1-\pi)^{\gamma+1}\big{\}}^{\frac{1}{\gamma+1}}},

where $C$ is the counting measure on $\cal Y$ . Note that this depends on the choice of $C$ as the reference measure on $\cal Y$ . The $\gamma$ -expression of the logistic model (2.28) is given by

\displaystyle p^{(\gamma)}(y|x,\omega)=\frac{\exp\{(\gamma+1)y\theta^{\top}x\}}{1+\exp\{(\gamma+1)\theta^{\top}x\}}.

Hence, the $\gamma$ -loss function is written by

\displaystyle L_{\gamma}(\theta;C)=-\frac{1}{n}\frac{1}{\gamma}\sum_{i=1}^{n}\Big{[}\frac{\exp\{(\gamma+1)Y_{i}\theta^{\top}X_{i}\}}{1+\exp\{(\gamma+1)\theta^{\top}X_{i}\}}\Big{]}^{\frac{\gamma}{\gamma+1}}.

(2.30)

and the $\gamma$ -estimating function is written as

\displaystyle{S}_{\gamma}(\theta;C)=\frac{1}{n}\sum_{i=1}^{n}{S}_{\gamma}(X_{i},Y_{i},\theta;C),

where

\displaystyle{S}_{\gamma}(X,Y,\theta;C)=\Big{[}\frac{\exp\{(\gamma+1)Y_{i}\theta^{\top}X\}}{1+\exp\{(\gamma+1)\theta^{\top}X\}}\Big{]}^{\frac{\gamma}{\gamma+1}}\Big{\{}Y-\frac{\exp\{(\gamma+1)\theta^{\top}X\}}{1+\exp\{(\gamma+1)\theta^{\top}X\}}\Big{\}}X,

(2.31)

see [49] for the discussion for robust mislabel. See [24, 69, 90, 92, 91, 55] for other type of MDE approaches than the $\gamma$ estimation.

The $\gamma$ -divergence on the space of Bernoulli distributions is well defined for all real number $\gamma$ . Let us fix as $\gamma=-1$ , and thus the GM-divergence between Bernoulli distributions is given by

\displaystyle D_{\rm GM}({\tt Ber}(\pi),{\tt Ber}(\rho);R)=\Big{\{}\frac{\pi}{\rho}r+\frac{1-\pi}{1-\rho}(1-r)\Big{\}}{\pi}^{r}({1-\pi})^{1-r}-{\rho}^{r}({1-\rho})^{1-r},

where the reference measure $R$ is chosen by ${\tt Ber}(r)$ . Hence, the GM-loss function is given by

\displaystyle L_{\rm GM}(\theta;R)=\frac{1}{n}\sum_{i=1}^{n}r^{Y_{i}}(1-r)^{1-Y_{i}}\exp\{(r-Y_{i})\theta^{\top}X_{i}\}.

The GM-loss function with the reference measure ${\tt Ber}(\small\mbox{$\frac{1}{2}$})$ is equal to the exponential loss function for AdaBoost algorithm discussed for an ensemble learning [30]. The integrated discrimination improvement index via odds [40] is based on the GM-loss function to assess prediction performance. We will give a further discussion in a subsequent chapter. The GM-estimating function is written by

\displaystyle{S}_{\rm GM}(\theta;{\tt Ber}(r))=\frac{1}{n}\sum_{i=1}^{n}(2Y_{i}-1)\exp\{(r-Y_{i})\theta^{\top}X_{i}\}X_{i}

due to $r^{Y}(1-r)^{1-Y}(r-Y)=(2Y-1)r(1-r)$ for $Y=0,1$ . Therefore, this estimating function is unbiased for any $r,0<r<1$ , that is, the expected estimating function conditional on $(X_{1},...,X_{n})$ under the logistic model (2.28) is equal to a zero vector.

We discuss which $r$ is effective for practical problems in logistic regression applications. In particular, we focus on a problem of imbalanced samples that is an important issue in the binary regression. An imbalanced dataset is one where the distribution of samples across these two classes is not equal. For example, in a medical diagnosis dataset, the number of patients with a rare disease (class 1) may be significantly lower than those without it (class 0). In this way, it is characterized as

\displaystyle 0\approx P(Y=1)\ll P(Y=0)\approx 1,

There are difficult issues for the model bias, the poor generalization and the inaccurate performance metrics for the prediction. Imbalanced samples can lead to biased or inconsistent estimators, affecting hypothesis tests and confidence intervals. For these problem resampling techniques have been exploited by oversampling the minority class or undersampling the majority class can balance the dataset. Also, the cost-sensitive Learning introduces a cost matrix to penalize misclassification of the minority class more heavily. The asymmetric logistic regression is proposed introducing a new parameter to account for data complexity [55]. They observe that this parameter controls the influence from imbalanced sampling. Here we tackle with this problem by the GM-estimator choosing an appropriate reference distribution $R$ in the GM-loss function. We select ${\tt Ber}(\hat{\pi}_{0})$ as the reference measure, where $\hat{\pi}_{0}$ is the proportion of the negative sample, namely $\hat{\pi}_{0}=\sum_{i=1}^{n}{\mathbb{I}}(Y_{i}=0)/n$ . Then, the resultant loss function is given by

\displaystyle L^{\rm(iw)}_{\rm GM}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\hat{\pi}_{0}^{Y_{i}}(1-\hat{\pi}_{0})^{1-Y_{i}}\exp\{(\hat{\pi}_{0}-Y_{i})\theta^{\top}X_{i}\}.

(2.32)

We refer this to as the inverse-weighted GM-loss function since the weight

\hat{\pi}_{0}^{Y_{i}}(1-\hat{\pi}_{0})^{1-Y_{i}}\propto\frac{1}{(1-\hat{\pi}_{0})^{Y_{i}}\hat{\pi}_{0}^{1-Y_{i}}}.

Hence, the estimating function is given by

\displaystyle{S}^{\rm(iw)}_{\rm GM}(\theta)=\frac{1}{n}\sum_{i=1}^{n}(2Y_{i}-1)\exp\{(\hat{\pi}_{0}-Y_{i})\theta^{\top}X_{i}\}X_{i}.

Equating the estimating function to zero gives the equality between two sums of positive and negative samples:

\displaystyle\frac{1}{n}\sum_{i=1}^{n}{\mathbb{I}}(Y_{i}=1)\exp\{(\hat{\pi}_{0}-1)\theta^{\top}X_{i}\}X_{i}=\frac{1}{n}\sum_{i=1}^{n}{\mathbb{I}}(Y_{i}=0)\exp\{\hat{\pi}_{0}\theta^{\top}X_{i}\}X_{i}.

Alternatively, the likelihood estimating equation is written as

\displaystyle\frac{1}{n}\sum_{i=1}^{n}{\mathbb{I}}(Y_{i}=1)\frac{1}{1+\exp(\theta^{\top}X_{i})}X_{i}=\frac{1}{n}\sum_{i=1}^{n}{\mathbb{I}}(Y_{i}=0)\frac{\exp(\theta^{\top}X_{i})}{1+\exp(\theta^{\top}X_{i})}X_{i}.

Both of estimating equations are unbiased, however the weightings are contrast each other.

We conduct a brief study for numerical experiments. Assume that covariate vectors are generated from a mixture of bivariate normal distributions as

\displaystyle X_{i}\sim(1-\epsilon)\ {\tt Nor}(\mu_{0},{\rm I})+\epsilon\ {\tt Nor}(-\mu_{0},{\rm I}).

where $\rm I$ denotes a 2-dimensional identity matrix. Here, we set as $n=1000,\mu_{0}=(2.,2.)^{\top}$ , and the mixture ratio $\epsilon$ will be taken by some fixed values. The outcome variables are generated from Bernoulli distributions as $Y_{i}\sim{\tt Ber}(\pi(X_{i}))$ , where

\displaystyle\pi(X_{i},\theta_{0})=\frac{\exp(\theta_{0}+\theta_{1}X_{i1}+\theta_{2}X_{i2})}{1+\exp(\theta_{0}+\theta_{1}X_{i1}+\theta_{2}X_{i2})}

where we set as $n=1000$ and $(\theta_{0},\theta_{1}^{\top})=(0.5,1.0,1.5)^{\top}.$ This simulation is designed to have imbalanced samples such that the positive sample proportion approximately becomes near $\epsilon$ .

We compared the ML-estimator $\hat{\theta}$ with the inverse-weighted GM estimator $\hat{\theta}_{\rm GM}$ with 30 replications. Thus, we observe that the GM estimator have a better performance over the ML-estimator in the sense of true positive rate. Table LABEL:TPRTNR is the list of the true positive and negative rates based on test samples with size $1000$ . Note that two label conditional distributions are ${\tt Nor}(\mu_{0},{\rm I})$ are ${\tt Nor}(-\mu_{0},{\rm I})$ . These are set to be sufficiently separated from each other. Hence, the classification problem becomes extremely an easy task when $\epsilon$ is a moderate value. Both ML-estimator and GM estimator have good performance in cases of $\epsilon=0.3,0.1$ . Alternatively, we observe that the true positive rate for GM estimator is considerably higher than that of ML-estimator in a situation of imbalance samples as in either case of $\epsilon=0.03,0.01$ .

Table 2.2: The comparison between MLE vs GME

$\epsilon$	MLE	GME
$0.3$	$(0.969,0.995)$	$(0.956,0.994)$
$0.1$	$(0.897,0.998)$	$(0.902,0.995)$
$0.05$	$(0.800,0.999)$	$(0.817,0.994)$
$0.03$	$(0.705,0.999)$	$(0.733,0.995)$
$0.01$	$(0.462,0.999)$	$(0.538,0.996)$

$(a,b)$ denotes a pair of the true positive and negative rates $a$ and $b$ .

We next focus on the HM-divergence ( $\gamma$ -divergence, $\gamma=-2$ ):

\displaystyle D_{\rm HM}({\tt Ber}(\pi),{\tt Ber}(\rho))=\pi(1-\rho)^{2}+(1-\pi)\rho^{2}-\pi(1-\pi),

where the reference measure is determined by ${\tt Ber}(\rho)$ . The HM-loss function is derived as

\displaystyle L_{\rm HM}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\Bigg{[}\frac{\exp\{(1-y_{i})\theta^{\top}X_{i}\}}{1+\exp(\theta^{\top}X_{i})}\Bigg{]}^{2},

for the logistic model (2.28). Note that the HM-loss function is the $\gamma$ -loss function with $\gamma=-2$ , which the $\gamma$ -expression is reduced to

\displaystyle p^{(-2)}(y|x,\omega)=\frac{\exp\{(1-y)\theta^{\top}x\}}{1+\exp(\theta^{\top}x)}.

Hence, the HM-estimating function is written as

\displaystyle{S}_{\rm HM}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\frac{\exp(\theta^{\top}X_{i})}{\{1+\exp(\theta^{\top}X_{i})\}^{2}}\Big{\{}Y_{i}-\frac{\exp(\theta^{\top}X_{i})}{1+\exp(\theta^{\top}X_{i})}\Big{\}}X_{i}=0

This is a weighted likelihood score function with the conditional variance of $Y$ as the weight function. We will observe that this weighting has an effective key for the HM estimator to be robust for covariate outliers.

Let us investigate the behavior of the estimating function ${S}_{\gamma}(\theta;C)$ of the $\gamma$ -estimator. In general, ${S}_{\gamma}(\theta;C)$ is unbiased, that is, $\mathbb{E}_{0}[{S}_{\gamma}(\theta;C)|\underline{X}]=0$ under the conditional expectation with the true distribution with the pmf $p(y|x,\theta_{0})$ . However, this property easily violated if the expectation is taken by a misspecified distribution $Q$ with the pmf $q(y|x)$ other than the true distribution [12, 13, 53]. Hence, we look into the expected estimating function under the misspecified model.

Proposition 9.

Consider the $\gamma$ -estimating function under a logistic model (2.28). Assume $\gamma>0$ or $\gamma<-1$ . Then,

\sup_{x\in{\mathcal{X}}}|\theta^{\top}\mathbb{E}_{Q}[{S}_{\gamma}(X,Y,\theta)|X=x]|<\infty,

(2.33)

where $\mathbb{E}_{Q}[\ \cdot\ |X=x]$ is the conditional expectation under a misspecified distribution $Q$ outside the model (2.28).

Proof.

It is written from (2.31) that

	$\displaystyle\mathbb{E}_{Q}[{S}_{\gamma}(X,Y,\theta)\|X=x]$
	$\displaystyle=\sum_{y=0}^{1}\bigg{[}\frac{\exp\{(\gamma+1)y\theta^{\top}x\}}{1+\exp\{(\gamma+1)\theta^{\top}x\}}\bigg{]}^{\frac{\gamma}{\gamma+1}}\Big{[}y-\frac{\exp\{(\gamma+1)\theta^{\top}x\}}{1+\exp\{(\gamma+1)\theta^{\top}x\}}\Big{]}q(y\|x)(\gamma+1)x$		(2.34)

Hence, if $s=(\gamma+1)\theta^{\top}x$ , then

\displaystyle\big{|}\theta^{\top}\mathbb{E}_{Q}[{S}_{\gamma}(X,Y,\theta)|X=x]\big{|}\leq\Psi_{\gamma}(s)+\Psi_{\gamma}(-s),

where

\displaystyle\Psi_{\gamma}(s)=\frac{|s|}{|\gamma+1|}\Big{\{}\frac{1}{1+\exp(-s)}\Big{\}}^{\frac{\gamma}{\gamma+1}}\frac{\exp(-s)}{1+\exp(-s)}.

(2.35)

We observe that, if $\gamma>0$ or $\gamma<-1$ , then

\displaystyle\sup_{s\in\mathbb{R}}\Psi_{\gamma}(s)=\sup_{s\in\mathbb{R}}\Psi_{\gamma}(-s)<\infty.

This concludes (2.33).

∎

We note that Proposition 9 focuses only on the logistic model (2.28), however such a boundedness property holds in both the probit model and the complementary log-log model.

We consider a geometric understanding for the bounded property in (2.33). In GLM, the linear predictor is written by $\theta^{\top}x=\theta_{1}^{\top}x_{1}+\theta_{0}$ , where $\theta_{1}$ and $\theta_{0}$ are referred to as a slope vector and intercept term, respectively. The decision boundary is defined as ${H}_{\theta}$ as in (2.6). The Euclidean distance of $x$ into ${H}_{\theta}$ ,

\displaystyle d(x,{\mathcal{H}}_{\theta})=\frac{|\theta_{1}^{\top}x_{1}-\theta_{0}|}{\|\theta_{1}\|}.

is referred to as the margin of $x$ from the decision boundary ${H}_{\theta}$ , which plays.a central role on the support vector machine [14]. Let

\displaystyle{\mathcal{N}}_{\theta}(\delta)=\big{\{}x\in{\mathcal{X}}:d(x,{\mathcal{H}}_{\theta})\leq\delta\big{\}}.

(2.36)

This is the $\delta$ -tubular neighborhood including ${\mathcal{H}}_{\theta}$ . In this perspective, Proposition 9 states for any $\gamma,\gamma<-1\text{ or }\gamma>0$ that the conditional expectation of $\gamma$ -estimating function is in the tubular neighborhood with probability one even under the misspecified distribution outside the parametric model (2.28). On the other hands, the likelihood estimating function does not satisfy such a stable property because the margin of the conditional expectation becomes unbounded. Therefore, we result that the $\gamma$ -estimator is robust for misspecification for the model for $\gamma>0$ or $\gamma>-1$ ; while the ML-estimator is not robust.

We observe in the Euclidean geometric view that, for a feature vector $x$ of $\cal X$ , the decision hyperplane ${\mathcal{H}}_{\theta}$ decompose $x$ into orthogonal and tangential components as $x=z+w$ , where $z=(\theta^{\top}x)\theta/\|\theta\|^{2}$ and $w=x-z$ . Note $z\perp w$ and $\|x\|^{2}=\|z\|^{2}+\|w\|^{2}$ . In accordance with this geometric view, we give more insights on the robust performance for the $\gamma$ -estimator class. We write the $\gamma$ -estimating function (2.31) by $S_{\gamma}(x,y,\theta)=\eta_{\gamma}(y,\theta^{\top}x)(z+w)$ . Then,

\displaystyle|S_{\gamma}(x,y,\theta)|\leq|\eta_{\gamma}(y,z^{\top}\theta)(\|z\|+\|w\|).

(2.37)

Therefore, we conclude that

\displaystyle|S_{\gamma}(x,y,\theta)|\leq\sup_{s\in\mathbb{R}}|\eta_{\gamma}(y,s)|\Big{(}\frac{|s|}{\|\theta\|^{2}}+\|w\|\Big{)}.

(2.38)

Thus, we observe a robust property of the $\gamma$ -estimator in a more direct perspective.

Proposition 10.

Assume $\gamma>0$ or $\gamma<-1$ . Then, the $\gamma$ -estimating function ${S}_{\gamma}(\theta;C)$ based on a dataset ${\cal D}=\{(X_{i}.Y_{i})\}_{i=1}^{n}$ satisfies

\displaystyle\sup_{\cal D}|\theta^{\top}{S}_{\gamma}(\theta;C)|<\infty.

(2.39)

Proof.

It is written from (2.31) that

	$\displaystyle\theta^{\top}{S}_{\gamma}(\theta:C)$
	$\displaystyle=\sum_{i=1}^{n}\bigg{[}\frac{\exp\{(\gamma+1)Y_{i}\theta^{\top}X_{i}\}}{1+\exp\{(\gamma+1)\theta^{\top}X_{i}\}}\bigg{]}^{\frac{\gamma}{\gamma+1}}\Big{[}Y_{i}-\frac{\exp\{(\gamma+1)\theta^{\top}X_{i}\}}{1+\exp\{(\gamma+1)\theta^{\top}X_{i}\}}\Big{]}\theta^{\top}X_{i}$		(2.40)

which is decomposed into the sum of the positive and negative samples as

\displaystyle\frac{1}{n}\sum_{i=1}^{n}\big{\{}{\mathbb{I}}(Y_{i}=1)\Psi_{\gamma}(S_{i})-{\mathbb{I}}(Y_{i}=0)\Psi_{\gamma}(-S_{i})\big{\}}

where $S_{i}=(\gamma+1)\theta^{\top}X_{i}$ and $\Psi_{\gamma}(s)$ is defined in (2.35). Hence, we get

\displaystyle|\theta^{\top}{S}_{\gamma}(\theta:\Lambda)|\leq\frac{1}{n}\sum_{i=1}^{n}\big{\{}{\mathbb{I}}(Y_{i}=1)|\Psi_{\gamma}(S_{i})|+{\mathbb{I}}(Y_{i}=0)|\Psi_{\gamma}(-S_{i})|\big{\}}

which is bounded by

\displaystyle\frac{n_{1}}{n}\sup_{s\in\mathbb{R}}|\Psi_{\gamma}(s)|+\frac{n_{0}}{n}\sup_{s\in\mathbb{R}}|\Psi_{\gamma}(-s)|

which is equal to $\delta_{\gamma}$ , where $n_{y}=(1/n)\sum_{i=1}^{n}{\mathbb{I}}(Y_{i}=y)$ for $y=0,1$ . This concludes (2.39). ∎

The log-likelihood estimating function is given by

\displaystyle{S}_{0}(\theta;\Lambda)=-\frac{1}{n}\sum_{Y_{i}=0}^{n}\Big{\{}{\mathbb{I}}(Y_{i}=1)\frac{\exp(\theta^{\top}X_{i})}{1+\exp(\theta^{\top}X_{i})}+{\mathbb{I}}(Y_{i}=0)\frac{1}{1+\exp(\theta^{\top}X_{i})}\Big{\}}X_{i}.

(2.41)

Hence, $|\theta^{\top}{S}_{0}(\theta|\Lambda)|$ is unbounded in $\{\theta^{\top}X_{i}:i=1,...,n\}$ since either of two terms in (2.41) diverges to infinity as $|\theta^{\top}X_{i}|$ goes to infinity. The GM-estimating function is written by

\displaystyle{S}_{\rm GM}(\theta,R)=\frac{1}{n}\sum_{i=1}^{n}\Big{[}{\mathbb{I}}(Y_{i}=1)\exp\{r(0)\theta^{\top}X_{i}\}r(0)+{\mathbb{I}}(Y_{i}=0)\exp\{r(1)\theta^{\top}X_{i}\}r(1)\Big{]}X_{i}.

This implies that $|\theta^{\top}{S}_{\rm GM}(\theta;R)|$ is unbounded.

We have a brief study for numerical experiments in two types of sampling. One is based on the covariate distribution conditional on the outcome $Y$ , which is widely analyzed in case-control studies. The other is based on the conditional distribution of $Y$ given the covariate vector $X$ , which is common in cohort-control studies. First, we consider a model of an outcome-conditional distribution. Assume that the conditional distribution of $X$ given $Y=y$ is a bivariate normal distribution ${\tt Nor}(\mu_{y},{\rm I})$ , where $\rm I$ is a 2-dimensional identity matrix. Then, the marginal distribution of $X$ is written as $p_{1}{\tt Nor}(\mu_{1},{\rm I})+p_{0}{\tt Nor}(\mu_{0},{\rm I})$ , where $p_{y}=P(Y=y)$ . The the conditional pmf of $Y$ given $X=x$ is given by

\displaystyle p(y|x,\theta)=\frac{\exp\{y(\theta_{1}^{\top}x+\theta_{0})\}}{1+\exp(\theta_{1}^{\top}x+\theta_{0})}

due to the Bayes formula, where $\theta_{1}=\mu_{1}-\mu_{0}$ and $\theta_{0}=\small\mbox{$\frac{1}{2}$}(\mu_{0}^{\top}\mu_{0}-\mu_{1}^{\top}\mu_{1})-\log(p_{1}/p_{0})$ . Let $N\sim{\tt Bin}(p_{1},n).$ The simulation was conducted based on a scenario about the $N$ positive and $n-N$ negative samples with $Y_{i}=1$ and $Y_{i}=0$ , respectively, as follows.

(a). Specified model: $\hskip 25.60747pt\{X_{i}\}_{i=1}^{N}\sim{\tt Nor}(\mu_{1},{\rm I}).$ and $\{X_{i}\}_{i=N+1}^{n}\sim{\tt Nor}(\mu_{0},{\rm I}).$

(b). Misspecified model: $\{X_{i}\}_{i=1}^{N}\sim(1-\pi){\tt Nor}(\mu_{1},{\rm I})+\pi{\tt Nor}(\sigma\mu_{0},{\rm I}).$ and
$\{X_{i}\}_{i=N+1}^{n}\sim{\tt Nor}(\mu_{0},{\rm I}).$

Here parameters were set as $\mu_{1}=(0.5,0.5)^{\top}$ , $\mu_{0}=-(0.5,0.5)^{\top}$ , $p_{1}=0.5$ and $(\pi,\sigma)=(0.1,-4.0)$ , so that $(\theta_{0},\theta_{1}^{\top})=(0.0,1.0,1.0)$ . Figure 2.3 shows the plot of 103 negative samples (Blue), 87 negative samples (Green), 10 negative outliers (Red) on the logistic model surface $\{(x_{1},x_{2},p(1|(x_{1},x_{2}),(\theta_{0},\theta_{1})):-3.5\leq x_{1}\leq 3.5,-3.5\leq x_{2}\leq 3.5\}$ . Thus, 10 negative outliers are away from the hull of 87 negative samples.

We compared the estimates the ML-estimator $\hat{\theta}_{0}$ , the $\gamma$ -estimator $\hat{\theta}_{\gamma}$ with $\gamma=0.8$ , the GM-estimator $\hat{\theta}_{\rm GM}$ and $\rm HM$ -estimator $\hat{\theta}_{\rm HM}$ , where the simulation was conducted by 300 replications. See Table 2.3 for the performance of four estimators in case (a) and (b) and Figure 2.4 for the box-whisker plot in case (b). In the case (a) of specified model, the ML-estimator was superior to other estimators in a point of the root means square estimate (rmse), however the superiority is subtle. Next, we observe for case (b) of misspecified model in which the conditional distribution given $Y=1$ is contaminated with a normal distribution ${\tt Nor}(\sigma\mu_{0},{\rm I})$ with mixing ratio $0.1$ . Under this setting, $\gamma$ -estimator $(\gamma=0.8)$ and the HM -estimator were substantially robust; the ML-estimator and GM-estimator were sensitive to the misspecification. Upon closer observation, it becomes apparent that $\gamma$ -estimator $(\gamma=0.8)$ and the HM -estimator were superior to the ML-estimator and GM-estimator in the bias behaviors rather than the variance ones as shown in Figure 2.4. This observation is consistent with what Proposition (10) asserts: The $\gamma$ estimator has a boundedness property if $\gamma<-1$ or $\gamma>0$ . Because the ML-estimator, the GM-estimator and the HM-estimator equal the $\gamma$ -estimators with $\gamma=0,-1,-2$ , respectively.

Table 2.3: Comparison between the ML-estimator and the

\gamma

-estimator.

(a). The case of specified model

Method	estimate	rmse
ML-estimator	$({0.011,1.021,1.014})$	$0.341$
$\gamma$ -estimate	$({0.012,1.062,1.045})$	$0.407$
GM-estimate	$({0.009,1.031,1.029})$	$0.365$
HM-estimate	$({0.013,1.051,1.037})$	$0.390$

(b). The case of misspecified model

Method	estimate	rmse
ML-estimator	$({0.102,0.481,0.503})$	$0.758$
$\gamma$ -estimate	$({0.081,0.889,0.911})$	$0.441$
GM-estimate	$({0.161,0.428,0.452})$	$0.839$
HM-estimate	$({-0.070,0.862,0.885})$	$0.464$

Second, we consider a model of a covariate-conditional distribution of $Y$ . Assume that $X$ follows a standard normal distribution ${\tt Nor}(0,I_{2})$ and a conditional distribution of $Y$ given $X=x$ follows a logistic model

\displaystyle p(y|x,\theta)=\frac{\exp\{y(\theta_{1}^{\top}x+\theta_{0})\}}{1+\exp(\theta_{1}^{\top}x+\theta_{0})}.

The simulation was conducted based on a scenario as follows.

(a). Specified model: $\hskip 25.60747pt(Y_{i}=y)|(X_{i}=x)\sim{\tt Ber}(p(y|x,\theta)).$

(b). Misspecified model: $(Y_{i}=y)|(X_{i}=x)\sim(1-\epsilon){\tt Ber}(p(y|x,\theta))+\epsilon{\tt Ber}(p(y|x,\theta_{\rm out})).$

Here parameters were set as $(\theta_{0},\theta_{1}^{\top})=(0.0,1.0,1.0)$ . Figure 2.3 shows the plot of 103 negative samples (Blue), 87 negative samples (Green), 10 negative outliers (Red) on the logistic model surface $\{(x_{1},x_{2},p(1|(x_{1},x_{2}),(\theta_{0},\theta_{1})):-3.5\leq x_{1}\leq 3.5,-3.5\leq x_{2}\leq 3.5\}$ . Thus, 10 negative outliers are away from the hull of 87 negative samples.

Similarly, a comparison among the ML-estimator $\hat{\theta}_{0}$ , the $\gamma$ -estimator $\hat{\theta}_{\gamma}$ with $\gamma=0.8$ , the GM-estimator $\hat{\theta}_{\rm GM}$ and $\rm HM$ -estimator $\hat{\theta}_{\rm HM}$ with $100$ replications. See Table 2.4. In the case (a), the ML-estimator was slightly superior to other estimators. For case (b), $\gamma$ -estimator $(\gamma=0.8)$ and the HM -estimator were more robust; the ML-estimator and GM-estimator, which was the same tendency as the case of the outcome-conditional model.

Table 2.4: Comparison between the ML-estimator and the

\gamma

-estimator.

(a). The case of specified model

Method	estimate	rmse
ML-estimator	$({0.011,1.021,1.014})$	$0.341$
$\gamma$ -estimate	$({0.012,1.062,1.045})$	$0.407$
GM-estimate	$({0.009,1.031,1.029})$	$0.365$
HM-estimate	$({0.013,1.051,1.037})$	$0.390$

(b). The case of misspecified model

Method	estimate	rmse
ML-estimator	$({0.102,0.481,0.503})$	$0.758$
$\gamma$ -estimate	$({0.081,0.889,0.911})$	$0.441$
GM-estimate	$({0.161,0.428,0.452})$	$0.839$
HM-estimate	$({-0.070,0.862,0.885})$	$0.464$

2.6 Multiclass logistic regression

We consider a situation where an outcome variable $Y$ has a value in ${\mathcal{Y}}=\{0,...,k\}$ and a covariate $X$ with a value in a subset ${\mathcal{X}}$ of $\mathbb{R}^{d}$ . The probability distribution is given by a probability mass function (pmf)

\displaystyle p(y,\pi)=\prod_{j=0}^{k}\pi_{j}{}^{{\mathbb{I}}(y=j)},

which is referred to as the categorical distribution ${\tt Cat}(\pi)$ , where $\pi=(\pi_{j})_{j=1}^{k}$ is the probability vector, $(P(Y=j))_{j=1}^{k}$ with $\pi_{0}$ being $1-\sum_{j=1}^{k}\pi_{j}$ .

Remark 2.

We begin with a simple case of estimating $\pi$ without any covariates. Let $\{Y_{i}\}_{1\leq i\leq n}$ be a random sample drawn from ${\tt Cat}(\pi)$ . Then, the estimators discussed here equal the observed frequency vector as follows. First of all, the ML-estimator is the observed frequency vector with components $(\hat{\pi}_{0},...,\hat{\pi}_{k})$ , where $\hat{\pi}_{j}=1/n\sum_{i=1}^{n}{\mathbb{I}}(Y_{i}=j)$ . Next, the $\gamma$ -loss function

\displaystyle L_{\gamma}(\pi,C)=-\frac{1}{n}\frac{1}{\gamma}\sum_{i=1}^{n}\frac{\prod_{j=0}^{k}\pi_{j}{}^{\gamma{\mathbb{I}}(Y_{i}=j)}}{\big{(}\sum_{j=0}^{k}\pi_{j}{}^{\gamma+1}\big{)}^{\frac{\gamma}{\gamma+1}}}

is written as $1/\gamma\sum_{j=0}^{k}{\hat{\pi}_{j}\pi_{j}{}^{\gamma}}/{\big{(}\sum_{j=0}^{k}\pi_{j}{}^{\gamma+1}\big{)}^{\frac{\gamma}{\gamma+1}}}$ . We observe

\displaystyle L_{\gamma}(\pi,C)=D_{\gamma}({\tt Cat}(\hat{\pi}),{\tt Cat}(\pi))

up to a constant. Therefore, the $\gamma$ -estimator for $\pi$ is equal to $\hat{\pi}$ for all $\gamma$ . Similarly, the $\beta$ -estimator is equal to $\hat{\pi}$ for all $\beta$ . However the $\alpha$ -estimator does not satisfy that except for the limit case of $\alpha$ to $0$ , or the ML-estimator.

We return the discussion for the regression model with a covariate vector $X$ . A multiclass logistic regression model is defined by a soft max function as a link function of the systematic component $\eta$ into the random component. The conditional pmf given $X=x$ is given by

\displaystyle p(y|x,\theta)=\left\{\begin{array}[]{cl}\displaystyle{\frac{1}{1+\sum_{j=1}^{k}\exp(\eta_{j})}}&\text{ if }y=0,\\[14.22636pt] \displaystyle{\frac{\exp(\eta_{y})}{1+\sum_{j=1}^{k}\exp(\eta_{j})}}&\text{ if }y=1,...,k\end{array}\right.,

(2.44)

which is referred as a multinomial logistic model, where $\theta=(\theta_{1},...,\theta_{k})^{\top}$ and $\eta_{j}=\theta_{j}{}^{\top}x$ . The KL-divergence between categorical distributions is given by

\displaystyle D_{0}({\tt Cat}(\pi),{\tt Cat}(\rho))=\sum_{j=0}^{k}\pi_{j}\log\frac{\pi_{j}}{\rho_{j}}.

For a given dataset $\{(X_{i},Y_{i})\}_{i=1,...,n}$ , the negative log-likelihood function is given by

\displaystyle L_{0}(\theta;C)=-\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\theta_{Y_{i}}{}^{\top}X_{j}-\log\Big{\{}1+\sum_{j=1}^{k}\exp(\theta_{j}{}^{\top}X_{i})\Big{\}}\bigg{]}

where we set $\theta_{y}=0$ if $y=0$ . The likelihood equation is written by the $j$ -th component:

\displaystyle{S}_{0}{}_{j}(\theta;C)=-\frac{1}{n}\sum_{i=1}^{n}\Big{\{}{\mathbb{I}}(Y_{i}=j)-\frac{\exp(\theta_{j}{}^{\top}X_{i})}{1+\sum_{l=1}^{k}\exp(\theta_{l}{}^{\top}X_{i})}\Big{\}}X_{i}=0.

for $j=1,...,k$ . The $\gamma$ -divergence is given by

\displaystyle D_{\gamma}({\tt Cat}(\pi),{\tt Cat}(\rho))=-\frac{1}{\gamma}\frac{\sum_{j=0}^{k}\pi_{j}\rho_{j}{}^{\gamma}}{\big{(}\sum_{j=0}^{k}\rho_{j}{}^{\gamma+1}\big{)}^{\frac{\gamma}{\gamma+1}}}+\frac{1}{\gamma}\bigg{(}\sum_{j=0}^{k}\pi_{j}{}^{\gamma+1}\bigg{)}^{\frac{1}{\gamma+1}},

We remark that the $\gamma$ -expression defined in (2.12) is given by

\displaystyle p^{(\gamma)}(y|x,\theta)=p(y|x,(\gamma+1)\theta),

where $p(y|x,\theta)$ is in the multi logistic model (2.44). Hence, the $\gamma$ -loss function is given by

\displaystyle L_{\gamma}(\theta;C)=-\frac{1}{n}\frac{1}{\gamma}\sum_{i=1}^{n}\sum_{j=0}^{k}\{p(Y_{i}|x,(\gamma+1)\theta)\}^{\frac{\gamma}{\gamma+1}}

and the $\gamma$ -estimating equation is written as

\displaystyle{S}_{\gamma}{}_{j}(\theta;C):=\frac{1}{n}\sum_{i=1}^{n}{S}_{\gamma}(X_{i},Y_{i},\theta;C)=0,

where ${S}_{\gamma}(X,Y,\theta;C)$ is defined by

\displaystyle\{p(Y|X,(\gamma+1)\theta)\}^{\frac{\gamma}{\gamma+1}}\big{\{}{\mathbb{I}}(Y=j)-p(j|X,(\gamma+1)\theta)\big{\}}X.

(2.45)

The GM-divergence between categorical distributions:

\displaystyle D_{\rm GM}({\tt Cat}(\pi),{\tt Cat}(\rho);R)=\sum_{y=0}^{k}\frac{\pi_{y}}{\rho_{y}}r(y)\prod_{y=0}^{k}\rho_{y}^{r(y)}-\prod_{y=0}^{k}\pi_{y}^{r(y)},

where the reference distribution $R$ is chosen by ${\tt Cat}(r)$ . Hence, the GM-loss function is given by

\displaystyle L_{\rm GM}(\theta;R)=\frac{1}{n}\sum_{i=1}^{n}r(Y_{i})\exp\{(\bar{\theta}_{R}-\theta_{Y_{i}})^{\top}X_{i}\}.,

where $\bar{\theta}_{R}=\sum_{j=1}^{k}r(j)\theta_{j}$ . We will have a further discussion later such that the GM-loss is closely related to the exponential loss for Multiclass AdaBoost algorithm. The GM-estimating function is given by

\displaystyle{S}_{\rm GM}{}_{j}(\theta;R)=\frac{1}{n}\sum_{i=1}^{n}r(Y_{i})\exp\{(\bar{\theta}-\theta_{Y_{i}})^{\top}X_{i}\}\{r(Y_{i})-{\mathbb{I}}(Y_{i}=j)\}X_{j}.

Finally, the HM-divergence is

\displaystyle D_{\rm HM}({\tt Cat}(\pi),{\tt Cat}(\rho))=\sum_{y=0}^{k}\pi_{y}(1-\rho_{y})^{2}-\sum_{y=0}^{k}\pi_{y}(1-\pi_{y})^{2}.

The HM-loss function is derived as

\displaystyle L_{\rm HM}(\theta)=\frac{1}{2n}\sum_{i=1}^{n}\sum_{j=0}^{k}\{p(Y_{i}|X_{i},-\theta)\}^{2}

for the logistic model (2.44) noting $p^{(-2)}(y|x,\theta)=p(y|x,-\theta)$ . This is the sum of squared probabilities of the inverse label. Hence, the HM-estimating function is written as

\displaystyle{S}_{\rm HM}(\theta)=\frac{1}{n}\sum_{i=1}^{n}p(Y_{i}|X_{i},-\theta)\frac{\partial}{\partial\theta}p(Y_{i}|X_{i},-\theta).

Let us have a brief look at the behavior of the $\gamma$ -estimating function ${S}_{\gamma}(\theta;C)$ in the presence of misspecification of the parametric model in the multiclass logistic distribution (2.44). Basically, most of properties are similar to those in the Bernoulli logistic model.

Proposition 11.

Consider the $\gamma$ -estimating function under a multiclass logistic model (2.44). Assume $\gamma>0$ or $\gamma<-1$ . Then,

\sup_{x\in{\mathcal{X}}}|\theta_{j}^{\top}{S}_{\gamma}{}_{j}(\theta;\Lambda)|\leq\delta_{\gamma}{}_{j},

(2.46)

where

\displaystyle\delta_{\gamma}{}_{j}=\sup_{s\in\mathbb{R}^{k}}\ \sum_{l\neq j}\frac{|s_{j}|}{|\gamma+1|}[\{f_{j}(s)\}^{\frac{\gamma}{\gamma+1}}f_{l}(s)+\{f_{l}(s)\}^{\frac{\gamma}{\gamma+1}}f_{j}(s)]

(2.47)

with $f_{j}(s)=\exp(s_{j})/\{1+\sum_{l=1}^{k}\exp(s_{l})\}.$

Proof.

We confirm that $\delta_{\gamma}{}_{j}$ is a finite definite value if $\gamma<-1$ or $\gamma>0$ . It is written from (2.45) that

	$\displaystyle\big{\|}\theta_{j}^{\top}{S}_{\gamma}{}_{j}(\theta;C)\big{\|}\leq\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\{p(j\|X_{i},(\gamma+1)\theta)\}^{\frac{\gamma}{\gamma+1}}\big{\{}1-p(j\|X_{i},(\gamma+1)\theta)\big{\}}$
	$\displaystyle\hskip 73.97716pt+\sum_{l\neq j}p(l\|X_{i},(\gamma+1)\theta)^{\frac{\gamma}{\gamma+1}}p(j\|X_{i},(\gamma+1)\theta)\bigg{]}\|\theta_{j}^{\top}X_{i}\|.$		(2.48)

Hence, if $S_{i}{}_{j}=(\gamma+1)\theta_{j}^{\top}X_{i}$ , then

\displaystyle\big{|}\theta^{\top}{S}_{\gamma}{}_{j}(\theta;C)\big{|}\leq\frac{1}{n}\sum_{i=1}^{n}\sum_{l\neq j}\bigg{[}\{f_{j}(S_{i})\}^{\frac{\gamma}{\gamma+1}}f_{l}(S_{i})+\{f_{l}(S_{i})\}^{\frac{\gamma}{\gamma+1}}f_{j}(S_{i})\bigg{]}\frac{|S_{i}{}_{j}|}{\gamma+1},

where $S_{i}=(S_{i}{}_{1},...,S_{i}{}_{k})$ . This concludes (2.46) by taking the supremum to the right-hand-side in all $S_{i}$ ’s . ∎

The $j$ -th linear predictor is written by $\theta_{j}^{\top}x=\theta_{1j}^{\top}x_{1}+\theta_{j0}$ with the slope vector $\theta_{1}$ and the intercept $\theta_{0}$ . The $j$ -th decision boundary is given by

\displaystyle{{H}(\theta_{j})}=\{x\in{\mathcal{X}}:\theta_{1j}^{\top}x_{1}+\theta_{0j}=0\},

(2.49)

In a context of prediction, a predictor for the label $Y$ based on a given feature vector $x$ is give by

\displaystyle f(x)=\mathop{\rm argmax}_{y\in{\mathcal{Y}}}\theta_{y}^{\top}x,

which is equal to the Bayes rule under the multiclass logistic model, where $\theta_{0}=0$ in the parametrization as in (2.44). We observe through the discussion similar to Proposition 9 in the situation of the Bernoulli logistic model that $\theta_{j}^{\top}\mathbb{E}_{Q}[{S}_{\gamma}{}_{j}(\theta,X,Y)|X=x]$ is uniformly bounded in $x\in{\mathcal{X}}$ even under any misspecified distribution $Q$ outside the parametric model. Therefore, we result that the $\gamma$ -estimator has such a stable behavior for all the sample $\{(X_{i},Y_{i})\}_{1\leq i\leq n}$ if $\gamma$ is in a range $(1,-1)\cup(0,\infty)$ . The ML-estimator and the GM-estimators equals $\gamma$ -estimators with $\gamma=0$ and $\gamma=-1$ , respectively. Therefore, both $\gamma$ ’s are outside the range, which suggests they are suffered from the unboundedness.

We next study ordinal regression, also known as ordinal classification. Consider an ordinal outcome $Y$ having values in ${\cal Y}=\{0,...,k\}$ . The probability of $Y$ falling into a certain category $y$ or lower is modeled as

\displaystyle{\mathbb{P}}(Y\leq y|X,\theta)=\frac{\exp(\theta_{0y}+\theta_{1}^{\top}X)}{1+\exp(\theta_{0y}+\theta_{1}^{\top}X)}

(2.50)

for $y=0,...,k-1$ , where $\theta=(\theta_{00},...\theta_{0k},\theta_{1})$ . The model (2.50) is referred to as ordinal logistic model noting ${\mathbb{P}}(Y\leq y|X,\theta)=F(\theta_{0y}+\theta_{1}^{\top}X)$ with a logistic distribution $F(z)=\exp(z)/(1+\exp(z))$ . Here, the thresholds are assumed $\theta_{00}\leq\cdots\leq\theta_{0k-1}$ to ensure that the probability statement (2.50) makes sense. Each threshold $\theta_{0y}$ effectively sets a boundary point on the latent continuous scale, beyond which the likelihood of higher category outcomes increases. The difference between consecutive thresholds also gives insight into the ”distance” or discrimination between adjacent categories on the latent scale, governed by the predictors.

For a given $n$ observations $\{(X_{i},Y_{i})\}_{i=1}^{n}$ the negative log-likelihood function

\displaystyle L_{0}(\theta)=-\sum_{i=1}^{n}\log p(Y_{i}|X_{i},\theta)

where $p(y|x,\theta)=F(\theta_{0y}+\theta_{1}^{\top}x)-F(\theta_{0y-1}+\theta_{1}^{\top}x)$ . Similarly, the $\gamma$ -loss function can be given in a straightforward manner. However, these loss functions seem complicated since the conditional probability $p(y|x,\theta)$ is introduced indirectly as a difference between the cumulative distribution functions $F(\theta_{0y}+\theta_{1}^{\top}x)$ ’s.

To address this issue, it is treated that each threshold as a separate binarized response, effectively turning the ordinal regression problem into multiple binary regression problems. Let $P(y)$ and $F(y)$ be cumulative distribution functions on $\cal Y$ . We define a dichotomized cross entropy

\displaystyle H_{0}^{\rm(d)}(P,F)=\sum_{y=0}^{k}P(y)\log F(y)+(1-P(y))\log(1-F(y)).

This is a sum of the cross entropys between a Bernoulli distributions ${\tt Ber}(P(y))$ and ${\tt Ber}(F(y))$ . The KL divergence is given as $D_{0}^{\rm(d)}(P,F)=H_{0}^{\rm(d)}(P,F)-H_{0}^{\rm(d)}(P,P)$ . Thus, the dichotomized log-likelihood function is given by

\displaystyle L_{0}^{\rm(d)}(\theta)=\sum_{i=1}^{n}\sum_{y=0}^{k}Z_{iy}\log F(\theta_{0y}+\theta^{\top}X_{i})+(1-Z_{iy})\log\{1-F(\theta_{0y}+\theta^{\top}X_{i})\},

where $Z_{iy}={\rm I}(Y_{i}\leq y)$ . Note $\mathbb{E}[L_{0}^{\rm(d)}(\theta)]=H_{0}^{\rm(d)}(P,F(\cdot,\theta))$ , where $\mathbb{E}$ denotes the expectation under the distribution $P$ and $F(y,\theta)=F(\theta_{0y}+\theta^{\top}x)$ . Under the ordinal logistic model (2.50),

\displaystyle L_{0}^{\rm(d)}(\theta)=\sum_{i=1}^{n}\sum_{y=0}^{k}\left[Z_{yi}(\theta_{0y}+\theta^{\top}X_{i})-\log\{1+\exp(\theta_{0y}+\theta^{\top}X_{i})\}\right].

On the other hand, the dichotomized $\gamma$ -loss function is given by

\displaystyle L_{\gamma}^{\rm(d)}(\theta)=-\frac{1}{\gamma}\sum_{i=1}^{n}\sum_{y=0}^{k}\left[Z_{iy}\{F^{(\gamma)}(\theta_{0y}+\theta^{\top}X_{i})\}^{\frac{\gamma}{\gamma+1}}+(1-Z_{iy})\{1-F^{(\gamma)}(\theta_{0y}+\theta^{\top}X_{i})\}^{\frac{\gamma}{\gamma+1}}\right],

where $F^{(\gamma)}(A)$ is the $\gamma$ -expression for $F(A)$ , that is,

\displaystyle F^{(\gamma)}(A)=\frac{\{F(A)\}^{\gamma+1}}{F(A)^{\gamma+1}+\{1-F(A)\}^{\gamma+1}}.

Under the ordinary logistic model (2.50),

\displaystyle L_{\gamma}^{\rm(d)}(\theta)=-\frac{1}{\gamma}\sum_{i=1}^{n}\sum_{y=0}^{k}\left\{\frac{\exp\{Z_{iy}(\gamma+1)(\theta_{0y}+\theta^{\top}X_{i})\}}{1+\exp\{(\gamma+1)(\theta_{0y}+\theta^{\top}X_{i})\}}\right\}^{\frac{\gamma}{\gamma+1}}

If $\gamma$ is taken a limit to $-1$ , then it is reduced the GM-loss function

\displaystyle L_{\rm GM}^{\rm(d)}(\theta,C)=\sum_{i=1}^{n}\sum_{y=0}^{k}\exp\{(\small\mbox{$\frac{1}{2}$}-Z_{iy})(\theta_{0y}+\theta^{\top}X_{i})\};

if $\gamma=-2$ , then it is reduced the HM-loss function

\displaystyle L_{\rm HM}^{\rm(d)}(\theta)=\frac{1}{2}\sum_{i=1}^{n}\sum_{y=0}^{k}\left\{\frac{\exp\{(1-Z_{iy})(\theta_{0y}+\theta^{\top}X_{i})\}}{1+\exp(\theta_{0y}+\theta^{\top}X_{i})}\right\}^{2}.

Remark 3.

Let us discuss an extension for dichotomized loss functions to a setting where the outcome space $\cal Y$ is a subset of $\mathbb{R}^{d}$ . Consider a partition of ${\cal Y}$ such that ${\cal Y}=\oplus_{j=1}^{k}B_{k}$ . Then, the model is reduced to a categorical distribution ${\tt Cat}(\pi(x,\theta))$ , where $\pi(x,\theta)=(\pi_{1}(x,\theta),...,\pi_{k}(x,\theta))$ with $\pi_{j}(x,\theta)=\int_{B_{j}}p(y|x,\theta){\rm d}\Lambda(y)$ . The cross entropy is reduced to

\displaystyle H_{0}^{\rm(d)}(\pi,\pi(x,\theta))=\sum_{j=1}^{k}\pi_{j}\log\pi_{j}(x,\theta)

and the negative log-likelihood function is reduced to

\displaystyle L_{0}^{\rm(d)}(\theta)=-\sum_{i=1}^{n}\sum_{y=0}^{k}{\rm I}(Y_{i}\in B_{j})\log\pi_{j}(X_{i},\theta).

Similarly, the $\gamma$ -cross entropy is reduced to

\displaystyle H_{\gamma}^{\rm(d)}(\pi,\pi(x,\theta))=\sum_{j=1}^{k}\pi_{j}\left\{\frac{\pi_{j}(x,\theta)^{\gamma+1}}{\sum_{j^{\prime}=1}^{k}\pi_{j^{\prime}}(x,\theta)^{\gamma+1}}\right\}^{\frac{\gamma}{\gamma+1}}

and the $\gamma$ -loss function is reduced to

\displaystyle L_{0}^{\rm(d)}(\theta)=-\sum_{i=1}^{n}\sum_{y=0}^{k}{\rm I}(Y_{i}\in B_{j})\left\{\frac{\pi_{j}(X_{i},\theta)^{\gamma+1}}{\sum_{j^{\prime}=1}^{k}\pi_{j^{\prime}}(X_{i},\theta)^{\gamma+1}}\right\}^{\frac{\gamma}{\gamma+1}}.

There are some parametric models similar to the present model including ordered probit models, continuation ratio model and adjacent categories logit model. The coefficients in ordinal regression models tell us about the change in the odds of being in a higher ordered category as the predictor increases. Importantly, because of the ordered nature of the outcomes, the interpretation of these coefficients gets tied not just to changes between specific categories but to changes across the order of categories. Ordinal regression is useful in fields like social sciences, marketing, and health sciences, where rating scales (like agreement, satisfaction, pain scales) are common and the assumption of equidistant categories is not reasonable. This method respects the order within the categories, which could be ignored in standard multiclass approaches.

2.7 Poisson regression model

The Poisson regression model is a member of generalized linear model (GLM), which is typically used for count data. When the outcome variable is a count (i.e., number of times an event occurs), the Poisson regression model is a suitable approach to analyze the relationship between the count and explanatory variables. The key assumptions behind the Poisson regression model are that the mean and variance of the outcome variable are equal, and the observations are independent of each other. The primary objective of Poisson regression is to model the expected count of an event occurring, given a set of explanatory variables. The model provides a framework to estimate the log rate of events occurring, which can be back-transformed to provide an estimate of the event count at different levels of the explanatory variables.

Let $Y$ be a response variable having a value in ${\mathcal{Y}}=\{0,1,...\}$ and $X$ be a covariate variable with a value in a subset ${\mathcal{X}}$ of $\mathbb{R}^{d}$ . A Poisson distribution ${\tt Po}(\lambda)$ with an intensity parameter $\lambda$ has a probability mass function (pmf) given by

\displaystyle p(y,\lambda)=\frac{\lambda^{y}}{y!}\exp(-\lambda)

for $y$ of ${\mathcal{Y}}$ . A Poisson regression model to a count $Y$ given $X=x$ is defined by the probability distribution $P(\cdot|x,\theta)$ with pmf

\displaystyle p(y|x,\theta)=\frac{1}{y!}\exp\{y\theta^{\top}x-\exp(\theta^{\top}x)\}.

(2.51)

The link function of the regression function to the canonical variable is a logarithmic function, $g(\lambda)=\log\lambda$ , in which (2.51) is referred to as a log-linear model. The likelihood principle gives the negative log-likelihood function by

\displaystyle L_{0}(\theta)=-\frac{1}{n}\sum_{i=1}^{n}\{Y_{i}\theta^{\top}X_{i}-\exp(\theta^{\top}X_{i})-\log Y_{i}!\}.

for a given dataset $\{(X_{i},Y_{i})\}_{i=1}^{n}$ . Here the term $\log Y_{i}!$ can be neglected since it is a constant in $\theta$ . In effect, the estimating function is give by

\displaystyle{S}_{0}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\{Y_{i}-\exp(\theta^{\top}X_{i})\}X_{i}.

We see from the general theory for the likelihood method that the ML-estimator for $\theta$ is consistent with $\theta$ .

Next, we consider the $\gamma$ -divergence and its applications to the Poisson model. For this, we fix a reference measure as $R={\tt Po}(\mu)$ . Then, the RN-derivative of a conditional probability measure $P(\cdot|x,\theta)$ with respect to $R$ is given by

\displaystyle\frac{\partial P(y|x,\theta)}{\partial R}=\mu^{-y}\exp\{y\theta^{\top}x+\mu-\exp(\theta^{\top}x)\}

and hence the $\gamma$ -expression for this is given by

\displaystyle p^{(\gamma)}(y|x,\theta)=\exp[(\gamma+1)y\theta^{\top}x-\exp\{(\gamma+1)\theta^{\top}x\}].

The $\gamma$ -cross entropy between Poisson distribution is given by

	$\displaystyle\hskip 8.53581ptH_{\gamma}(P(\cdot\|x,\theta_{0}),P(\cdot\|x,\theta_{1});R)$
	$\displaystyle=-\frac{1}{\gamma}\Big{[}\exp\{\exp(\theta^{\top}_{0}x+\gamma\theta^{\top}_{1}x)\}-\frac{\gamma}{\gamma+1}\exp\{\exp((\gamma+1)\theta^{\top}_{1}x)\}\Big{]},$

where $R$ is the reference measure defined by $R(y)=1/y!$ for $y=0,1,...$ . Note that this choice of $R$ enable us to having such an tractable form of this entropy. Hence, the $\gamma$ -loss function is given by

\displaystyle L_{\gamma}(\theta;R)=-\frac{1}{n}\frac{1}{\gamma}\sum_{i=1}^{n}\exp\Big{[}\gamma Y_{i}\theta^{\top}X_{i}-\frac{\gamma}{\gamma+1}\exp\{(\gamma+1)\theta^{\top}X_{i})\}\Big{]}.

The estimating function is given by

\displaystyle{S}_{\gamma}(\theta;R)=\frac{1}{n}\sum_{i=1}^{n}{S}_{\gamma}(\theta,X_{i},Y_{i};R),

where

\displaystyle{S}_{\gamma}(\theta,X,Y;R)=w_{\gamma}(X,Y,\theta)[Y-\exp\{(\gamma+1)\theta^{\top}X)\}]X.

(2.52)

where

\displaystyle w_{\gamma}(X,Y,\theta)=\exp\Big{[}\gamma Y\theta^{\top}X-\frac{\gamma}{\gamma+1}\exp\{(\gamma+1)\theta^{\top}X)\}\Big{]}.

We investigate the unbiased property for the estimating function.

Proposition 12.

Let

\displaystyle\Phi_{\gamma}(x,y,\theta)=\theta^{\top}{S}_{\gamma}(\theta,x,y;R),

where ${S}_{\gamma}(\theta,x,y;R))$ is the estimating function defined in (2.52). Then, if $\gamma>0$ ,

\displaystyle\sup_{x\in{\cal X}}|\Phi_{\gamma}(x,y,\theta)|<\infty

for any fixed $y\in{\cal Y}$ .

Proof.

By definition, we have

\displaystyle|\Phi_{\gamma}(x,y,\theta)|\leq|\omega|e^{\gamma y\omega-\frac{\gamma}{\gamma+1}e^{(\gamma+1)\omega}}\big{(}y+e^{(\gamma+1)\omega}\big{)}.

where $\omega=\theta^{\top}x$ . The following limit holds for positive constants $c_{1},c_{2},c_{3}$ :

\displaystyle\lim_{|\omega|\rightarrow\infty}|\omega|e^{c_{1}\omega-c_{2}e^{c_{3}\omega}}=0

(2.53)

Thus, we immediately observe

\displaystyle\lim_{|\omega|\rightarrow\infty}|\omega|e^{\gamma y\omega-\frac{\gamma}{\gamma+1}e^{(\gamma+1)\omega}}\big{(}y+e^{(\gamma+1)\omega}\big{)}=0

due to (2.53). This concludes that $|\Phi_{\gamma}(x,y,\theta)|$ is a bounded function in $x$ for any $y\in{\cal Y}$ . ∎

It is noted that the function in (2.53) has a mild shape as in Figure 2.6. Thus, the property of redescending is characteristic in the $\gamma$ -estimating function. The graph is rapidly approaching to $0$ when the absolute value of the canonical value $\omega$ increases.

We remark that $\Phi_{\gamma}(x,y,\theta)$ denotes the margin of the estimating function to the boundary $\{x\in{\cal X}:\theta^{\top}x=0\}$ . The margin is bounded in $X$ but unbounded in $Y$ , in which the behavior is delicate as seen in Figure 2.7. When $\gamma>0.2$ , the boundedness is almost broken down in a practical numerical sense. The green lines are plotted for a curve $\{(y,w,0):y=e^{(\gamma+1)w}\}.$ In this way, the margin becomes a zero on the green line. The behavior is found mild in a region away from the green line when $\gamma$ is a small positive value; that is found unbounded there when $\gamma$ equals a zero, or the likelihood case. This suggests a robust and efficient property for the $\gamma$ -estimator with a positive small $\gamma$ . To check this, we consider conduct a numerical experiment where there occurs a misspecification for the Poisson log-linear model $p(y|x,\theta)$ in (2.51). The synthetic dataset is generated from a mixture distribution, in which a heterogeneous subgroup is generated from a Poisson distribution $p(y|x,\theta_{\rm hetero})$ with a small proportion $\pi$ in addition to a normal group from $p(y|x,\theta)$ with the proportion $1-\pi$ . Here $\theta_{\rm hetro}$ is determined from plausible scenarios. We generate $X_{i}$ ’s from a trivariate normal distribution ${\tt Nor}(0,0.2\,{\rm I})$ and $Y_{i}$ ’s from

\displaystyle(1-\pi){\tt Po}(\exp(\theta_{1}^{\top}X_{i}+\theta_{0}))+\pi{\tt Po}(\exp(\theta_{{\rm hetero}1}^{\top}X_{i}+\theta_{0})).

Here the intercept is set as $\theta_{0}=0.5$ and the slope vector is as $\theta_{1}=(0.5,1.5,-1.0)$ in the normal group; while the slope vector $\theta_{{\rm hetro}1}$ is set as either $-\theta_{1}$ or a zero vector in the minor group. It is suggested that the minor group has a reverse association to the normal group, or no reaction to the covariate. If there is no misspecification above, or equivalently $\pi=0$ , then the ML-estimator performs better than the $\gamma$ -estimator. However, the ML-estimator is sensitive to such misspecification; the $\gamma$ -estimator has robust performance, see Table 2.5. Here, the sample size is set as $n=100$ and the replication number is as $m=300$ . The $\gamma$ is selected as $0.05$ , in which larger values of $\gamma$ yield unreasonable estimates. This is because the the margin of the $\gamma$ -estimator has extreme behavior as noted around Figure 2.7.

Table 2.5: Comparison between the ML-estimator and the

\gamma

-estimator.

(a). The case of $\pi=0$

Method	estimate	rmse
ML-estimator	$(0.50,1.50,-1.00,0.49)$	$0.062$
$\gamma$ -estimate	$(0.59,1.40,-0.93,0.46)$	$0.187$

(b). The case of $\pi=0.3$ and $\theta_{1\rm hetero}=0.$

Method	estimate	rmse
ML-estimator	$(0.40,1.32,-0.88,0.45)$	$0.310$
$\gamma$ -estimate	$(0.39,1.47,-0.97,0.48)$	$0.235$

(c). The case of $\pi=0.3$ and $\theta_{1\rm hetero}=-\theta_{1}.$

Method	estimate	rmse
ML-estimator	$(0.41,1.29,-0.88,0.43)$	$0.386$
$\gamma$ -estimate	$(0.40,1.47,-0.99,0.48)$	$0.293$

In this section, we focus on the $\gamma$ -divergence, within the framework of the Poisson regression model. The $\gamma$ -divergence provides a robust alternative to the traditional ML estimator, which are sensitive to model misspecification and outliers. The robustness of the estimator was examined from a geometric viewpoint, highlighting the behavior of the estimating function in the feature space and its relationship with the prediction level set. The potential of $\gamma$ -divergence in enhancing model robustness is emphasized, with suggestions for future research exploring its application in high-dimensional data scenarios and machine learning contexts, such as deep learning and transfer learning. This work not only contributes to the theoretical understanding of statistical estimation methods but also offers practical insights for their application in various fields, ranging from biostatistics to machine learning. For future work, it would be beneficial to further investigate the theoretical underpinnings of $\gamma$ -divergence in a wider range of statistical models and to explore its application in more complex and high-dimensional data scenarios, including machine learning contexts like multi-task learning and meta-learning .

2.8 Concluding remarks

In this chapter we provide a comprehensive exploration of MDEs, particularly $\gamma$ -divergence, within regression models, see [70, 74, 76] for other applications of unsupervised learning. This addresses the challenges posed by model misspecification, which can lead to biased estimates and inaccuracies, and proposes MDEs as a robust solution. We have discussed various regression models, including normal, logistic, and Poisson, demonstrating the efficacy of $\gamma$ -divergence in handling outliers and model inconsistencies. In particular, the robustness for the estimator is pursued in a geometric perspective for the estimation function in the feature space. This elucidates the intrinsic relationship between the feature space outcome space such that the behavior of the estimating function in the product space of the feature and outcome spaces is characterized the projection length to the prediction level set. It concludes with numerical experiments, showcasing the superiority of $\gamma$ -estimators over traditional maximum likelihood estimators in certain misspecified models, thereby highlighting the practical benefits of MDEs in statistical estimation and inference. For a detailed conclusion, it is important to recognize the significant role of $\gamma$ -divergence in enhancing model robustness against biases and misspecifications. Emphasizing its applicability across different statistical models, the chapter underscores the potential of MDEs to improve the reliability and accuracy of statistical inferences, particularly in complex or imperfect real-world data scenarios. This work will not only contribute to the theoretical understanding of statistical estimation methods but also offer practical insights for their application in diverse fields, ranging from biostatistics to machine learning.

For future work, considering the promising results of $\gamma$ -divergence in regression models, it could be beneficial to explore its application in more complex and high-dimensional data scenarios. This includes delving into machine learning contexts, such as deep learning or neural networks, where robustness against data imperfections is crucial. The machine learning is rapidly developing activity areas towards generative models for documents, images and movies, in which the architecture is in a huge scale for high-dimensional vector and matrix computations to establish pre-trained models such as large language models. There is a challenging direction to incorporate the $\gamma$ -divergence approach into such areas including multi-task leaning, transfer leaning, meta leaning and so forth. For example, transfer learning is important to strengthen the empirical knowledge for target source. Few-shot learning is deeply intertwined with transfer learning. In fact, most few-shot learning approaches are based on the principles of transfer learning. The idea is to pre-train a model on a related task with ample data (source domain) and then fine-tune or adapt this model to the new task (target domain) with limited data. This approach leverages the knowledge (features, representations) acquired during the pre-training phase to make accurate predictions in the few-shot scenario. Additionally, investigating the theoretical underpinnings of $\gamma$ -divergence in a wider range of statistical models could further solidify its role as a versatile and robust tool in statistical estimation and inference.

In transfer learning, the goal is to leverage knowledge from a source domain to improve learning in a target domain. The $\gamma$ -divergence can be used to ensure robust parameter estimation during this process. Let $\mathcal{S}$ be the source domain with distribution $P_{s}$ and parameter $\theta_{s}$ , and $\mathcal{T}$ be the target domain with distribution $P_{t}$ and parameter $\theta_{t}$ . The objective is to minimize a loss function that incorporates both source and target domains:

L(\theta_{s},\theta_{t})=L_{\mathcal{S}}(\theta_{s})+\lambda L_{\mathcal{T}}(\theta_{t}|\theta_{s})

where $\lambda$ is a regularization parameter balancing the influence of the source model on the target model. Using $\gamma$ -divergence, the loss functions $L_{\mathcal{S}}(\theta_{s})$ and $L_{\mathcal{T}}(\theta_{t}|\theta_{s})$ are defined as:

L_{\mathcal{S}}(\theta_{s})=\frac{1}{n_{s}}\sum_{i=1}^{n_{s}}D_{\gamma}(P_{s,i}(\cdot|\theta_{s}),Q_{s})

L_{\mathcal{T}}(\theta_{t}|\theta_{s})=\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}D_{\gamma}(P_{t,i}(\cdot|\theta_{t}),P_{s,i}(\cdot|\theta_{s}))

where $Q_{s}$ is the empirical distribution in the source domain, and $D_{\gamma}$ denotes the $\gamma$ -divergence. The gradients for updating the parameters are given by:

\nabla_{\theta_{s}}L(\theta_{s},\theta_{t})=\nabla_{\theta_{s}}L_{\mathcal{S}}(\theta_{s})+\lambda\nabla_{\theta_{s}}L_{\mathcal{T}}(\theta_{t}|\theta_{s})

\nabla_{\theta_{t}}L(\theta_{s},\theta_{t})=\lambda\nabla_{\theta_{t}}L_{\mathcal{T}}(\theta_{t}|\theta_{s})

These gradients take into account the robustness properties of the $\gamma$ -divergence, reducing sensitivity to outliers and model misspecifications.

In multi-task learning, the aim is to learn multiple related tasks simultaneously, sharing knowledge among them to improve overall performance. The $\gamma$ -divergence helps in creating a robust shared representation. Let $\mathcal{T}_{i}$ denote the $i$ -th task with parameter $\theta_{i}$ , and let $\Theta$ be the shared parameter space. The combined loss function for multiple tasks is:

L(\Theta)=\sum_{i=1}^{m}\alpha_{i}L_{i}(\theta_{i},\Theta)

where $\alpha_{i}$ are weights for each task, and $L_{i}$ is the loss for task $\mathcal{T}_{i}$ . The task-specific losses $L_{i}$ are defined using $\gamma$ -divergence:

L_{i}(\theta_{i},\Theta)=\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}D_{\gamma}(P_{i,j}(\cdot|\theta_{i},\Theta),Q_{i})

where $Q_{i}$ is the empirical distribution for task $\mathcal{T}_{i}$ . The gradients for updating the shared parameters $\Theta$ and task-specific parameters $\theta_{i}$ are:

\nabla_{\Theta}L(\Theta)=\sum_{i=1}^{m}\alpha_{i}\nabla_{\Theta}L_{i}(\theta_{i},\Theta)

\nabla_{\theta_{i}}L(\Theta)=\alpha_{i}\nabla_{\theta_{i}}L_{i}(\theta_{i},\Theta)

The $\gamma$ -divergence ensures that the updates are robust to outliers and anomalies within each task’s data. By using a divergence measure that penalizes discrepancies between distributions, the model learns shared features that are less sensitive to noise and specific to individual tasks.

Efficiently optimizing the $\gamma$ -divergence in high-dimensional parameter spaces remains challenging. Developing scalable algorithms that maintain robustness properties is crucial. Further theoretical exploration of the convergence properties and bounds of $\gamma$ -divergence-based estimators in transfer and multi-task learning scenarios. Applying these robust methods to diverse real-world datasets in fields like healthcare, finance, and natural language processing to validate their practical effectiveness and robustness. By integrating $\gamma$ -divergence into transfer and multi-task learning frameworks, we can enhance the robustness and adaptability of machine learning models, making them more reliable in varied and complex data environments.

Chapter 3 Minimum divergence for Poisson point process

This study introduces a robust alternative to traditional species distribution models (SDMs) using Poisson Point Processes (PPP) and new divergence measures. We propose the $F$ -estimator, a method grounded in cumulative distribution functions, offering enhanced accuracy and robustness over maximum likelihood (ML) estimation, especially under model misspecification. Our simulations highlight its superior performance and practical applicability in ecological studies, marking a significant step forward in ecological modeling for biodiversity conservation.

3.1 Introduction

Species distribution models (SDMs) are crucial in ecology for mapping the distribution of species across various habitats and geographical areas [37, 29, 63, 88, 68]. These models play an essential role in enhancing our understanding of biodiversity patterns, predicting changes in species distributions due to climate change, and guiding conservation and management efforts [60]. The MaxEnt (Maximum Entropy) approach to species distribution modeling represents a pivotal methodology in ecology, especially in the context of predicting species distributions under various environmental conditions [77]. This approach is particularly favored for its ability to handle presence-only data, a common scenario in ecological studies where the absence of a species is often unrecorded or unknown [26]. Alternatively, the approach based on Poisson Point Process (PPP) gives more comprehensive understandings for random events scattered across a certain space or time [81]. It is particularly powerful in various fields including ecology, seismology, telecommunications, and spatial statistics. We review quickly the framework for a PPP, cf. [89] for practical applications focusing on ecological studies and [57, 45] for statistical learning perspectives. The close relation between the MaxEnt and PPP approaches are rigorously discussed in [82].

In this chapter, we introduce an innovative approach that employs Poisson Point Processes (PPP) along with alternative divergence measures to enhance the robustness and efficiency of SDMs [84]. We propose the use of the $F$ -estimator, a novel method based on cumulative distribution functions, which offers a promising alternative to ML-estimator, particularly in the presence of model misspecification. Traditional approaches, such as ML estimation, often grapple with issues of model misspecification, leading to inaccurate predictions. Our approach is evaluated through a series of simulations, demonstrating its superiority over traditional methods in terms of accuracy and robustness. The paper also explores the computational aspects of these estimators, providing insights into their practical application in ecological studies. By addressing key challenges in SDM estimation, our methodology paves the way for more reliable and effective ecological modeling, essential for biodiversity conservation and ecological research.

Let $A$ be a subset of $\mathbb{R}^{2}$ to be provided observed points. Then the event space is given by the collection of all possible finite subsets of $A$ as a union of pairs $\{(m,\{s_{1},...,s_{m}\})\}_{m=1}^{\infty}$ in addition to $(0,\emptyset)$ , where $\emptyset$ denotes an empty set. Thus, the event space comprises pairs of the set of observed points $\{s_{1},...,s_{m}\}$ and the number $m$ . Let $\lambda(s)$ be a positive function on $A$ that is called an intensity function. A PPP defined on ${S}$ is described by the intensity function $\lambda(s)$ in a two-step procedure for any realization of ${S}$ .

(i)

The number $M$ is non-negative and generated from a Poisson distribution. This distribution, denoted as ${\tt Po}(\Lambda)$ , has a probability mass function (pmf) given by

$\displaystyle p(m,\mu)=\frac{\Lambda^{m}}{m!}\exp\{-\Lambda\}$

where $\Lambda=\int_{A}\lambda(s){\rm d}s$ with an intensity function $\lambda(s)$ on $A$ .
(ii)

The sequence $(S_{1},...,S_{M})$ in $A$ is obtained by independent and identically distributed sample of a random variable $S$ on $A$ with probability density function (pdf) given by

$\displaystyle p(s)=\frac{\lambda(s)}{\Lambda}$

for $s\in A$ .

This description covers the basic statistical structure of the Poison point process. The joint random variable $\Xi=(M,\{S_{1},...,S_{M}\})$ has a pdf written as

p(\xi)=\exp\{-\Lambda\}\prod_{i=1}^{m}{\lambda(s_{i})},

(3.1)

where $\xi=(m,\{s_{1},...,s_{m}\})$ . Thus, the intensity function $\lambda(s)$ characterizes the pdf $p(\xi)$ of the PPP. The set of all the intensity function has a one-to-one correspondence with the set of all the distributions of the PPPs. Subsequently, we will discuss divergence measures on intensity functions rather than pdfs.

3.2 Species distribution model

Species Distribution Models (SDMs) are crucial tools in ecology for understanding and predicting species distributions across spatial landscapes. The inhomogeneous PPP plays a significant role in enhancing the functionality and accuracy of these models due to its ability to handle spatial heterogeneity, which is often a characteristic of ecological data. Ecological landscapes are inherently heterogeneous with varying attributes such as vegetation, soil types, and climatic conditions. The inhomogeneous PPP accommodates this spatial heterogeneity by allowing the event rate to vary across space, thereby enabling a more realistic modeling of species distributions. This can incorporate environmental covariates to model the intensity function of the point process, which in turn helps in understanding how different environmental factors influence species distribution. This is crucial for both theoretical ecological studies and practical conservation planning [56].

If presence and absence data are available, we can employ familiar statistical method such as the logistic model, the random forest and other binary classification algorithms. However, ecological datasets often consist of presence-only records, which can be a challenge for traditional statistical models. We focus on a statistical analysis for presence-only-data, in which the inhomogeneous modeling for PPPs can effectively handle presence-only data, making it a powerful tool for species distribution model in data-scarce scenarios.

Let us introduce a SDM in the framework of PPP discussed above. Suppose that we get a presence dataset, say $\{S_{1},...,S_{M}\}$ , or a set of observed points for a species in a study area $A$ . Then, we build a statistical model of an intensity function driving a PPP on ${A}$ , in which a parametric model is given by

\displaystyle{\mathcal{M}}=\{\lambda(s,\theta):\theta\in\Theta\},

(3.2)

called a species distribution model (SDM), where $\theta$ is an unknown parameter in the space $\Theta$ . The pdf of the joint random variable $\Xi=(M,\{S_{1},...,S_{M}\})$ is written as

\displaystyle p(\xi,\theta)=\exp\{-\Lambda(\theta)\}\prod_{i=1}^{m}{\lambda(s_{i},\theta)}

due to (3.1), where $\xi=(m,\{s_{1},...,s_{m}\})$ and $\Lambda(\theta)=\int_{A}\lambda(s,\theta)ds$ . In ecological terms, this can be understood as recording the locations (e.g., GPS coordinates) where a particular species has been observed. The pdf here helps in modeling the likelihood of finding the species at different locations within the study area, considering various environmental factors. Typically, we shall consider a log-linear model

\displaystyle\lambda(s,\theta)=\exp\{\theta_{1}^{\top}x(s)+\theta_{0}\}

(3.3)

with $\theta=(\theta_{0},\theta_{1})$ , a feature vector $x(s)$ , a slope vector $\theta_{1}$ and an intercept $\theta_{0}$ . Here $x(s)$ consists environmental characteristics such as geographical, climatic and other factors influencing the habitation of the species. Then, parameter estimation is key in SDMs to understand the relationships between species distributions and environmental covariates. The ML-estimator is a common approach used in PPP to estimate these parameters, which in turn, refines the SDM.

The negative log-likelihood function based on an observation sequence $(M,\{S_{1},...,S_{M}\})$ is given by

\displaystyle L_{0}(\theta)=-\sum_{i=1}^{M}\log\lambda(S_{i},\theta)+\Lambda(\theta).

(3.4)

Here the cumulative intensity is usually approximated as

\displaystyle\Lambda(\theta)=\sum_{i=1}^{n}w_{i}\lambda(S_{i},\theta)

(3.5)

by Gaussian quadrature, where ${S_{M+1},...,S_{n}}$ are the centers of the grid cells containing no presence location and $w_{i}$ is a quadrature weight for a grid cell area. The approximated estimating equation is given by

\displaystyle{S}_{0}(\theta)=\sum_{j=1}^{n}\{Z_{i}-w_{i}\lambda(S_{i},\theta)\}\frac{\partial}{\partial\theta}\log\lambda(S_{i},\theta)={0},

(3.6)

where $Z_{j}$ is an indicator for presence, that is, $Z_{j}=1$ if $1\leq j\leq M$ and $0$ otherwise.

Let $p(\xi)$ and $q(\xi)$ be probability density functions (pdf) of two PPPs, where $\xi=(m,\{s_{1},...,s_{m}\})$ is a realization. Due to the discussion above, the pdfs are written as

\displaystyle p(\xi)=\exp\{-\Lambda\}\prod_{i=1}^{m}{\lambda(s_{i})},\ \ \ q(\xi)=\exp\{-H\}\prod_{i=1}^{m}{\eta(s_{i})},

(3.7)

in which $p(\xi)$ and $\lambda(s)$ have a one-to-one correspondence, and $q(\xi)$ and $\eta(s)$ have also the same property. The Kullback-Leibler (KL) divergence between $p$ and $q$ is defined by the difference between the cross entropy and the diagonal entropy as ${D}_{0}(p,q)={H}_{0}(p,q)-{H}_{0}(p,p)$ , where the cross entropy is defined by

\displaystyle{H}_{0}(p,q)=-\mathbb{E}_{p}[\log{q(\Xi)}]

with the expectation $\mathbb{E}_{p}$ with the pdf $p(\xi)$ . This is written as

$\displaystyle{H}_{0}(p,q)$	$\displaystyle=-\sum_{m=0}^{\infty}\frac{\Lambda^{m}}{m!}e^{-\Lambda}\int_{A\times\cdots\times A}\log\{e^{-H}\prod_{j=1}^{m}{\eta(s_{j})}\}\prod_{j=1}^{m}\frac{\lambda(s_{j})}{\Lambda}{\rm d}s_{j}$
	$\displaystyle=\sum_{m=0}^{\infty}\frac{\Lambda^{m}}{m!}e^{-\Lambda}\Big{[}-H+\frac{m}{\Lambda}\int_{A}{\lambda(s)}\log{\eta(s)}{\rm d}s\Big{]}$
	$\displaystyle=\int_{A}\{\lambda(s)\log{\eta(s)}-\eta(s)\}{\rm d}s.$	(3.8)

Thus, the KL-divergence is

\displaystyle{D}_{0}(p,q)=\int_{A}\Big{\{}\lambda(s)\log\frac{\lambda(s)}{\eta(s)}-\lambda(s)+\eta(s)\Big{\}}{\rm d}s,

(3.9)

see [57] for detailed derivations. This can be seen as a way to assess the effectiveness of an ecological model. For instance, how well does our model predict where a species will be found, based on environmental factors like climate, soil type, or vegetation? The closer our model’s predictions are to the actual observations, the better it is at explaining the species’ distribution. In effect, ${D}_{0}(p,q)$ coincides with the extended KL-divergence between intensity functions $\lambda(s)$ and $\eta(s)$ . Here, the term $-\lambda(s)+\eta(s)$ in the integrand of (3.9) should be added to the standard form since both $\lambda(s)$ and $\eta(s)$ in general do not have total mass one.

Let $(M,\{S_{1},...,S_{M}\})$ be an observations having a pdf $p(\xi,\theta_{0})$ . We consider the expected value under the true pdf $p(\xi,\theta_{0})$ that is given by

\displaystyle\mathbb{E}_{0}[L_{0}(\theta)]=-\int_{A}\{\lambda(s,\theta_{0})\log\lambda(s,\theta)-\lambda(s,\theta)\}{\rm d}s

noting a familiar formula for a random sum in PPP:

\displaystyle\mathbb{E}_{0}\Big{[}\sum_{i=1}^{M}\log\lambda(S_{i},\theta)\Big{]}=\int_{A}\lambda(s,\theta_{0})\log\lambda(s,\theta){\rm d}s,

(3.10)

where $\mathbb{E}_{0}$ denotes the expectation with the pdf $p(\xi,\theta_{0})$ . This is nothing but the cross entropy between intensity functions $\lambda(s,\theta_{0})$ and $\lambda(s,\theta)$ . In accordance, we observe a close relationship between the log-likelihood and the KL-divergence that is parallel to the discussion around (2.2) in chapter 3. In effect,

\displaystyle\mathbb{E}_{0}[L_{0}(\theta)]-\mathbb{E}_{0}[L_{0}(\theta_{0})]=D_{0}(p(\cdot,\theta_{0}),p(\cdot,\theta)).

This relation concludes the consistency of the ML-estimator for the true value $\theta_{0}$ noting $\theta_{0}=\mathop{\rm argmin}_{\theta\in\Theta}E_{0}[L_{0}(\theta)]$ . This suggests that the method used to estimate the impact of environmental factors on species distribution is dependable. In practical terms, this means ecologists can trust the model to make accurate predictions about where a species might be found, based on environmental data.

3.3 Divergence measures on intensity functions

We would like to extend the minimum divergence method for estimating to estimating a SDM. The main objective is to propose an alternative to the maximum likelihood method, aiming to enhance robustness and expedite computation. We have observed the close relationship between the log-likelihood and the KL-divergence in the previous section. Fortunately, the empirical form of the KL-divergence is matched with the log-likelihood function in the framework of the SDM. We remark that a fact that the KL-divergence between PPPs is equal to the KL-divergence between its intensity functions is essential for ensuring this property. However, this key relation does not hold in the situation for the power divergence.

First, we review a formula for random sum and product in PPP, which is gently and comprehensively discussed in [89].

Proposition 13.

Let $\Xi=(M,\{S_{1},...,S_{M}\})$ be a realization of PPP with an intensity function $\lambda(s)$ on an area $A$ . Then, for any integrable function $g(s)$ ,

\displaystyle\mathbb{E}\Big{[}\sum_{i=1}^{M}g(S_{i})\Big{]}=\int_{A}g(s)\lambda(s){\rm d}s

(3.11)

and

\displaystyle\mathbb{E}\Big{[}\prod_{i=1}^{M}g(S_{i})\Big{]}=\exp\Big{\{}\int_{A}\{g(s)-1\}\lambda(s)\Big{\}}{\rm d}s.

Proof.

By definition,

	$\displaystyle\mathbb{E}\Big{[}\sum_{i=1}^{M}g(S_{i})\Big{]}=\sum_{m=0}^{\infty}\frac{\Lambda^{m}}{m!}e^{-\Lambda}\int_{A\times\cdots\times A}\sum_{i=1}^{m}g(s_{i})\prod_{i=1}^{m}\frac{\lambda(s_{i})}{\Lambda}{\rm d}s_{j}$
	$\displaystyle=\sum_{m=0}^{\infty}\frac{\Lambda^{m}}{m!}e^{-\Lambda}\frac{m}{\Lambda}\int_{A}{\lambda(s)}{g(s)}{\rm d}s$
	$\displaystyle=\int_{A}g(s)\lambda(s){\rm d}s.$

Similarly,

	$\displaystyle\mathbb{E}\Big{[}\prod_{i=1}^{M}g(S_{i})\Big{]}=\sum_{m=0}^{\infty}\frac{\Lambda^{m}}{m!}e^{-\Lambda}\int_{A\times\cdots\times A}\prod_{i=1}^{m}g(s_{i})\prod_{i=1}^{m}\frac{\lambda(s_{i})}{\Lambda}{\rm d}s_{j}$
	$\displaystyle=\sum_{m=0}^{\infty}\frac{\Lambda^{m}}{m!}e^{-\Lambda}\Big{\{}\int_{A}\frac{g(s)\lambda(s)}{\Lambda}{\rm d}s\Big{\}}^{m}$
	$\displaystyle=\exp\Big{\{}\int_{A}\{g(s)-1\}\lambda(s){\rm d}s\Big{\}}.$

∎

Proposition 13 gives interesting properties of the random sum with the random product, see section 2.6 in [89] for further discussion and historical backgrounds. In ecology, this can be interpreted as predicting the total impact or effect of a particular environmental factor (represented by $g(s)$ ) across all locations where a species is observed within a study area $A$ . For example, $g(s)$ could represent the level of a specific nutrient or habitat quality at each observation point $S_{i}$ . The integral then sums up these effects across the entire habitat, providing a comprehensive view of how the environmental factor influences the species across its distribution. This formula can be used in SDMs to quantify the cumulative effect of environmental variables on species presence. For instance, it could help in assessing how total food availability or habitat suitability across a landscape influences the likelihood of species presence. By integrating such ecological factors into the SDM, researchers can gain insights into the species’ habitat preferences and distribution patterns. Understanding the cumulative impact of environmental factors is crucial for conservation planning and management. This approach helps identify critical areas that contribute significantly to species survival and can guide habitat restoration or protection efforts. For instance, if the model shows that certain areas have a high cumulative impact on species presence, these areas might be prioritized for conservation.

Second, we introduce divergence measures to apply the estimation for a species distributional model employing the formula introduced in Proposition 13. The $\gamma$ -divergence between probability measures $P$ and $Q$ of PPPs with RN-derivatives $p(\xi)$ and $q(\xi)$ in (3.7) is given by

\displaystyle D_{\gamma}(P,Q)=H_{\gamma}(P,Q)-H_{\gamma}(P,P).

Here the cross $\gamma$ -entropy is defined by

\displaystyle H_{\gamma}(P,Q)=-\frac{1}{\gamma}\frac{\mathbb{E}_{P}[q(\Xi)^{\gamma}]}{\{\mathbb{E}_{Q}[q(\Xi)^{\gamma}]\}^{\frac{\gamma}{\gamma+1}}},

where $\mathbb{E}_{P}$ denote the expectation with respect to $P$ . Accordingly, the $\gamma$ -cross entropy between probability distributions $P_{0}$ and $P_{\theta}$ having the intensity functions $\lambda_{0}(s)$ and $\lambda(s,\theta)$ , respectively, is written as

\displaystyle H_{\gamma}(P_{0},P_{\theta})=-\frac{1}{\gamma}\exp\Big{[}\int_{A}\Big{\{}\lambda_{0}(s)(\lambda(s,\theta)^{\gamma}-1)-\frac{\gamma}{\gamma+1}\lambda(s,\theta)^{\gamma+1}\Big{\}}{\rm d}s\Big{]}

since

	$\displaystyle{\mathbb{E}_{P_{0}}[p(\Xi,\theta)^{\gamma}]}$	$\displaystyle=\exp\{\gamma\Lambda(\theta)\}\mathbb{E}_{P_{0}}\Big{[}\prod_{i=1}^{M}\lambda(S_{i},\theta)^{\gamma}\Big{]}$
		$\displaystyle=\exp\Big{[}\int_{A}\{\gamma\lambda(s,\theta)+\lambda_{0}(s)(\lambda(s,\theta)^{\gamma}-1)\}{\rm d}s\Big{]}$		(3.12)

due to Proposition 13. However, it is difficult to give an empirical expression of $H_{\gamma}(P_{0},P_{\theta})$ for a given realization $(M,\{S_{1},...,S_{M}\})$ generated from $P_{0}$ . In accordance, we consider another type of divergence.

Consider the log $\gamma$ -divergence between $P_{0}$ and $P_{\theta}$ that is defined by

\displaystyle\Delta_{\gamma}(P_{0},P_{\theta})=-\frac{1}{\gamma}\log\frac{\mathbb{E}_{P_{0}}[p(\Xi,\theta)^{\gamma}]}{\{\mathbb{E}_{P\theta}[p(\Xi,\theta)^{\gamma}]\}^{\frac{\gamma}{\gamma+1}}\{\mathbb{E}_{P_{0}}[p_{0}(\Xi)^{\gamma}]\}^{\frac{1}{\gamma+1}}}.

This is written as

\displaystyle{\Delta}_{\gamma}(P_{0},P_{\theta})=-\frac{1}{\gamma}\int_{A}\Big{\{}\lambda_{0}(s)\lambda(s,\theta)^{\gamma}-\frac{\gamma}{\gamma+1}\lambda(s,\theta)^{\gamma+1}-\frac{1}{\gamma+1}\lambda_{0}(s)^{\gamma+1}\Big{\}}{\rm d}s.

(3.13)

Therefore, the loss function is induced as

\displaystyle L_{\gamma}(\theta)=-\frac{1}{\gamma}\sum_{i=1}^{M}\lambda(S_{i},\theta)^{\gamma}+\frac{1}{\gamma+1}\int_{A}\lambda(s,\theta)^{\gamma+1}{\rm d}s

for a SDM (3.2). This loss function has totally different from the negative log-likelihood. In a regression model, $D_{\gamma}(P,Q)$ and $\Delta_{\gamma}(P,Q)$ yield the same loss function; while in a PPP model only $\Delta_{\gamma}(P,Q)$ yield the loss functions $L_{\gamma}(\theta)$ .

We observe that the property about random sum and product leads to delicate differences among one-to-one transformed divergence measures. So, we consider a divergence measure directly defined on the space of intensity functions other than on that of probability distributions of PPPs. The $\beta$ -divergence is given by

\displaystyle{D}_{\beta}(\lambda,\eta)=-\frac{1}{\beta}\int_{A}\Big{\{}\lambda(s)\eta(s)^{\beta}-\frac{\beta}{\beta+1}\eta(s)^{\beta+1}-\frac{1}{\beta+1}\lambda(s)^{\beta+1}\Big{\}}{\rm d}s;

(3.14)

The $\gamma$ -divergence is given by

\displaystyle{D}_{\gamma}(\lambda,\eta)=-\frac{1}{\gamma}\frac{\int_{A}\lambda(s)\eta(s)^{\gamma}{\rm d}s}{\{\int_{A}\eta(s)^{\gamma+1}{\rm d}s\}^{\frac{\gamma}{\gamma+1}}}+\frac{1}{\gamma}\Big{\{}\int_{A}\lambda(s)^{\gamma+1}{\rm d}s\Big{\}}^{\frac{1}{\gamma+1}}.

The loss functions corresponding to these are given by

\displaystyle L_{\beta}(\theta)=-\frac{1}{\beta}\sum_{i=1}^{M}\lambda(S_{i},\theta)^{\beta}+\frac{1}{\beta+1}\int_{A}\lambda(s,\theta)^{\beta+1}{\rm d}s

(3.15)

and

\displaystyle L_{\gamma}(\theta)=-\frac{1}{\gamma}\frac{\sum_{i=1}^{M}\lambda(S_{i},\theta)^{\gamma}}{\{\int_{A}\lambda(s,\theta)^{\gamma+1}{\rm d}s\}^{\frac{\gamma}{\gamma+1}}}.

(3.16)

And, the estimating functions corresponding to these are given by

\displaystyle{S}_{\beta}(\theta)=\sum_{i=1}^{M}\lambda(S_{i},\theta)^{\beta}\frac{\partial}{\partial\theta}\log\lambda(S_{i},\theta)-\int_{A}\lambda(s,\theta)^{\beta+1}\frac{\partial}{\partial\theta}\log\lambda(s,\theta){\rm d}s

and

	$\displaystyle{S}_{\gamma}(\theta)=$	$\displaystyle\sum_{i=1}^{M}\frac{\lambda(S_{i},\theta)^{\gamma}}{\{\int_{A}\lambda(s,\theta)^{\gamma+1}{\rm d}s\}^{\frac{\gamma}{\gamma+1}}}$
		$\displaystyle\times\Big{\{}\frac{\partial}{\partial\theta}\log\lambda(S_{i},\theta)-\int_{A}\frac{\lambda(s,\theta)^{\gamma+1}}{\int_{A}\lambda(s,\theta)^{\gamma+1}{\rm d}s}\frac{\partial}{\partial\theta}\log\lambda(s,\theta){\rm d}s\Big{\}}.$		(3.17)

A divergence measure between two PPPs is written by a functional with respect to intensity functions induced by them. We observe an interesting relationship from this viewpoint.

Proposition 14.

Let $P$ and $Q$ be probability distributions for PPPs with intensity functions $\lambda$ and $\eta$ , respectively. Then, the log $\gamma$ -divergence $\Delta_{\gamma}(P,Q)$ in (3.13)is equal to the $\beta$ -divergence $D_{\beta}(\lambda,\eta)$ in (3.14) when $\gamma=\beta$ .

Proof is immediate by definition.

Essentially, $\Delta_{\gamma}(P,Q)$ satisfies the scale invariance which expresses an angle between $P$ an $Q$ rather than a distance between them; $D_{\beta}(\lambda,\eta)$ does not satisfy such invariance in the intensity function space. Thus, they are totally different characteristics, however the connection of probability distributions and their intensity functions for PPPs entails this coincidence. It follows from Proposition 14 that the GM-divergence $D_{\rm GM}(P,Q)$ equals the Itakura-Saito divergence, that is

\displaystyle{D}_{\rm GM}(P,Q)=\int_{A}\Big{\{}\frac{\lambda(s)}{\eta(s)}-\log\frac{\lambda(s)}{\eta(s)}-1\Big{\}}{\rm d}s.

Hence, the GM-loss function is given by

\displaystyle L_{\rm GM}(\theta)=\sum_{i=1}^{M}\frac{1}{\lambda(S_{i},\theta)}+\int_{A}\log\lambda(s,\theta){\rm d}s.

and the estimating function is

\displaystyle{S}_{\rm GM}(\theta)=\sum_{i=1}^{M}\frac{1}{\lambda(S_{i},\theta)}\frac{\partial}{\partial\theta}\log\lambda(S_{i},\theta)+\int_{A}\frac{\partial}{\partial\theta}\log\lambda(s,\theta){\rm d}s.

(3.18)

Proposition 15.

Assume a log-linear model $\lambda(s,\theta)=\exp\{\theta_{1}^{\top}x(s)+\theta_{0}\}$ with a feature vector $x(s)$ . Then, the estimating function of the $GM$ -estimator is given by

\displaystyle{S}_{\rm GM}(\theta)=\sum_{i=1}^{M}\exp\{-\theta_{1}^{\top}x(S_{i})-\theta_{0}\}x_{0}(S_{i})-\int_{A}x_{0}(s){\rm d}s.

(3.19)

where $x_{0}(s)=(1,x(s)^{\top})^{\top}$ .

Proof.

Proof is immediate. Equation (3.19) is seen by applying (3.18) to the log linear model. ∎

Equating ${S}_{\rm GM}(\theta)$ to zero satisfies

\displaystyle\sum_{i=1}^{M}\exp\{-\theta^{\top}x_{0}(S_{i})\}x_{0}(S_{i})=\sum_{j=1}w(S_{j})x_{0}(S_{j})

(3.20)

by the quadrature approximation. This implies that the inverse intensity weighted mean for presence data $\{x(S_{i})\}_{i=1}^{M}$ is equal to the region mean for $\{x(S_{j})\}_{j=1}^{n}$ . The learning algorithm to solve the estimating equation (3.20) to get the GM-estimator needs only the update of the inverse intensity mean for the presence during without any updates for the region mean during the iteration process. In this regard, the computational cost for the $\gamma$ -estimator frequently becomes huge for a large set of quadrature points. For example, it needs to evaluate the quadrature approximation in the likelihood equation (3.6) during iteration process for obtaining the ML-estimator. On the other hand, such evaluation step is free in the algorithm for obtaining the GM-estimator.

Finally, we look into an approach to the minimum divergence defined on the space of pdf’s. A intensity function $\lambda(s,\theta)$ determines the pdf for an occurrence of a point $s$ by $p(s,\theta)=\lambda(s,\theta)/\Lambda(\theta)$ . From a point of this, we can consider the divergence class of pdfs, which has been discussed in Chapter 2. However, this approach has a weak point such that, in a log-linear model $\lambda(s,\theta)=\exp\{\theta_{1}x(s)+\theta_{0}\}$ , such a pdf transformation cancels the intercept parameter $\theta_{0}$ as found in

\displaystyle p(s,\theta)=\frac{\exp\{\theta_{1}x(s)\}}{\int_{A}\exp\{\theta_{1}x(\tilde{s})\}{\rm d}\tilde{s}}.

Therefore, the maximum entropy method is based on such an approach, so that the intercept parameter cannot be consistently estimated. See [82] for the detailed discussion in a rigorous framework. Here we note that the $\gamma$ -divergence between $p=\lambda/\Lambda$ and $q=\eta/H$ is essentially equal to that between $\lambda$ and $\eta$ , that is

{D}_{\gamma}(\lambda,\eta)=\Lambda{D}_{\gamma}(p,q).

This implies that the intercept $\theta_{0}$ is not estimable in the estimating function (3.17). In deed, the $\gamma$ -loss function (3.16) for the log-linear model is reduced to

\displaystyle L_{\gamma}(\theta)=-\frac{1}{\gamma}\frac{\sum_{i=1}^{M}\exp\{\gamma(\theta_{1}^{\top}x(S_{i})\}}{\{\int_{A}\exp\{(\gamma+1)\theta_{1}^{\top}x(s)\}{\rm d}s\}^{\frac{\gamma}{\gamma+1}}},

which is constant in $\theta_{0}$ . From the scale invariance of the log $\gamma$ divergence, $\Delta_{\gamma}(p,q)=\Delta_{\gamma}(\lambda,\eta)$ noting $p$ and $q$ equals $\lambda$ and $\eta$ up to a scale factor. Similarly, the intercept parameter is not identifiable. On the other hand, the $\beta$ -loss function (3.15) is written down as

\displaystyle L_{\beta}(\theta)=-\frac{1}{\beta}\sum_{i=1}^{M}\exp\{\beta(\theta_{1}^{\top}x(S_{i})+\theta_{0})\}+\frac{1}{\beta+1}\int_{A}\exp\{(\beta+1)(\theta_{1}^{\top}x(s)+\theta_{0})\}{\rm d}s

in which $\theta_{0}$ is estimable.

3.4 Robust divergence method

We discuss robustness for estimating the SDM defined by a parametric intensity function $\lambda(s,\theta)$ . In particular, a log-linear model $\lambda(s,\theta)=\theta_{1}x(s)+\theta_{0}$ , where $\theta=(\theta_{0},\theta_{1})$ and $x(s)$ is environmental feature vector influencing on a habitat of a target species. In Section 3.3 we discuss the minimum divergence estimation for the SDM in which power divergence measures are explored on the space of the PPP distributions, on that of intensity functions, or on that of pdfs in exhaustive manner. In the examination, the minimum $\beta$ -divergence method defined on the space of intensity functions is recommended for its reasonable property for the consistency of estimation.

We look at the $\beta$ -estimating function for a given dataset $(M,\{S_{1},...,S_{M}\})$ that is defined as

\displaystyle{S}_{\beta}(\theta)=\sum_{j=1}^{n}\{Z_{j}-w_{j}\exp(\theta_{1}^{\top}x(S_{j})+\theta_{0})\}\exp\{\beta(\theta_{1}^{\top}x(S_{j})+\theta_{0})\}x_{0}(S_{j}),

(3.21)

where $x_{0}(S_{j})=(1,x(S_{j})^{\top})^{\top}$ , $w_{j}$ is a quadrature weight on the set $\{S_{1},...,S_{n}\}$ combined with presence and background grid centers, and $Z_{j}$ is an indicator for presence, that is, $Z_{j}=1$ if $1\leq j\leq M$ and $0$ otherwise. Note that taking $\beta=0$ as a specific choice yields the likelihood equation (3.6). Alternatively, taking a limit of $\beta$ to $-1$ entails the GM-estimating function as

	$\displaystyle{S}_{\rm GM}(\theta)$	$\displaystyle=\sum_{j=1}^{n}\{Z_{j}\exp\{-(\theta_{1}^{\top}x(S_{j})+\theta_{0})\}-w_{j}\}x_{0}(S_{j}),$		(3.22)
		$\displaystyle=\sum_{i=1}^{M}\exp\{-(\theta_{1}^{\top}x(S_{i})+\theta_{0})\}x_{0}(S_{i})-\bar{x}_{0},$		(3.23)

where $\bar{x}_{0}=\sum_{j=1}^{n}w_{j}x_{0}(S_{j})$ . This leads to a remarkable cost reduction for the computation in the learning algorithm as discussed after Proposition 15. Here the computation in (3.22) is only for the first term of presence data with one evaluation $\bar{x}_{0}$ using background data. For any $\beta$ , the $\beta$ -estimating function is unbiased. Because

	$\displaystyle\mathbb{E}_{\theta}[{S}_{\beta}(\theta)]=$	$\displaystyle\int_{A}\exp\{(\beta+1)(\theta_{1}^{\top}x(s)+\theta_{0})\}x_{0}(s){\rm d}s$
		$\displaystyle-\sum_{j=1}^{n}w_{j}\exp\{(\beta+1)(\theta_{1}^{\top}x(S_{j})+\theta_{0})\}x_{0}(S_{j}),$

which is equal to a zero vector if the the quadrature approximation is proper, where $\mathbb{E}_{\theta}$ denote expectation under the log-linear model $\lambda(s,\theta)=\exp(\theta_{1}x(s)+\theta_{0})$ . This unbiasedness property guarantees the consistency of the $\beta$ -estimator for $\theta$ . In accordance with this, we would like to select the most robust estimator for model misspecification in the class of $\beta$ -estimators. The difference of the $\beta$ -estimator with the ML-estimator is focused only on the estimating weight $\exp\{\beta(\theta_{1}^{\top}x(S_{j})+\theta_{0})\}$ in (3.24). We contemplate that the estimating weight would be not effective for any data situation regardless of whether $\beta$ is positive or negative. Indeed, the estimating weight becomes unbounded and has no comprehensive understanding for misspecification.

We consider another estimator rather than the $\beta$ -estimator for seeking robustness of misspecification based on the discussion above [84]. A main objective is to change the estimating weight for the $\beta$ -estimator into a more comprehensive form. Let $F$ be a cumulative distribution function defined on $[0,\infty)$ . Then, we define an estimating function

\displaystyle{S}_{F}(\theta)=\sum_{j=1}^{n}\{Z_{j}-w_{j}\exp(\theta_{1}^{\top}x(S_{j})+\theta_{0})\}F(\sigma\exp(\theta_{1}^{\top}x(S_{j})+\theta_{0}))x_{0}(S_{j}),

(3.24)

where $\sigma>0$ is a hyper parameter. We call the $F$ -estimator by $\hat{\theta}_{F}$ defined by a solution for the equation ${S}_{F}(\theta)=0$ . Immediately, the unbiasedness for ${S}_{F}(\theta)$ can be confirmed. In this definition, the estimating weight is given as $F(\sigma\exp(\theta_{1}^{\top}x(S_{j})+\theta_{0}))$ . For example, we will use a Pareto cumulative distribution function

F(t)=1-(1+\eta ct)^{-\frac{1}{\eta}},

where a shape parameter $\eta>0$ is fixed to be $1$ in a subsequent experiment. We envisage that $F$ expresses the existence probability of the intensity for the presence of the target species. Hence, a low value of the weight implies a low probability of the presence. The plot of the estimating weights $F(\sigma\lambda(S_{j},\theta))$ for $i=1,...,M$ would be helpful if we knew the true value.

Suppose that the log-linear model for the given data would be misspecified. We consider a specific situation for misspecification such that

\displaystyle\lambda(s)=(1-\epsilon)\lambda(s,\theta)+\epsilon\lambda_{\rm out}(s).

(3.25)

This implies there is a contamination of a subgroup with a probability $\epsilon$ in a major group with the intensity function correctly specified. Here the subgroup has the intensity function $\lambda_{\rm out}(s)$ that is far away from the log-linear model $\lambda(s,\theta)$ . Geometrically speaking, the underlying intensity function $\lambda(s)$ in (3.25) is a tubular neighborhood surrounding the model $\{\lambda(s,\theta):\theta\in\Theta\}$ with a radius $\epsilon$ in the space of all intensity functions. In this misspecification, we expect that the estimating wights for the subgroup should be suppressed than those for the major group. It cannot be denied in practical situations that there is always a risk of model misspecification. it is comparatively easy to find outliers for presence records or background data cause by mechanical errors. Standard data preprocessing procedures helpful for data cleaning, however it is difficult to find outliers under such a latent structure for misspecification. In this regard, the approach by the $F$ -estimator is promising to solve the problem for such difficult situations. The hyper parameter $\sigma$ should be selected by a cross validation method to give its effective impact on the estimation process. We will discuss on enhancing the clarity and practical applicability of the concepts in this approach as a future work.

We have a brief study for numerical experiments. Assume that a feature vector set $\{X(s_{j})\}_{j=1}^{n}$ on presence and background grids is generated from a bivariate normal distribution ${\tt Nor}(0,{\rm I})$ , where $\rm I$ denotes a 2-dimensional identity matrix. Our simulation scenarios for the intensity function was organized as follows.

(a). Specified model: $\hskip 17.07164pt\lambda(s)=\lambda(s,\theta_{0})$ , where $\lambda(s,\theta_{0})=\exp\{\theta_{01}^{\top}X(s)+\theta_{00}\}.$

(b). Misspecified model: $\hskip 2.56073pt\lambda(s)=(1-\epsilon)\lambda(s,\theta_{0})+\epsilon\lambda(s,\theta_{*})$ , where $\theta_{*}=(\theta_{00},-\theta_{01}).$

Here parameters were set as $\theta_{0}=0.5$ , $\theta_{1}=(1.5,-1.0)^{\top}$ , and $\pi=0.1$ . In case (b), $\lambda(s,\theta_{*})$ is a specific example of $\lambda_{\rm out}(s)$ in (3.25), which implies that the subgroup has the intensity function with the negative parameter to the major group. In ecological studies, a major group in the species might thrive under conditions where a few others do not, and vice versa. Using a negative parameter could imitate this kind of inverse relationship. See Figure 3.1 for the plot of presence numbers against two dimensional feature vectors. The presence numbers were generated from the misspecified model (b) with the simulation number $1000$ .

We compared the estimates the ML-estimator $\hat{\theta}_{0}$ , the $F$ -estimator $(\sigma=1.2)$ and the $\beta$ -estimator $\hat{\theta}_{\beta}$ $(\beta=-0.1)$ , where the simulation was conducted by 300 replications. In the the case of specified model, the ML-estimator was slightly superior to the $F$ -estimator and $\beta$ -estimator in a point of the root means square estimate (rmse), however the superiority over the $F$ -estimator is just little. Next, we suppose a mixture model of two intensity functions in which one was the log-linear model as the above with the mixing ratio $0.9$ ; the other was still a log-linear model but the the slope vector was the minus one with the mixing ratio $0.1$ . Under this setting, $F$ -estimator was especially superior to the ML-estimator and the $\beta$ -estimator, where the rmse of the ML-estimator is more than double that of the $F$ -estimator. The $\beta$ -estimator shows less robust for this misspecification. Thus, the ML-estimator is sensitive to the presence of such a heterogeneous subgroup; the $F$ -estimator is robust. It is considered that the estimating weight $F(\lambda(s,\theta)$ effectively performs to suppress the influence of the subgroup in the estimating function of the $F$ -estimator. See Table 3.1 for details and and Figure 3.2 for the plot of three estimators in the case (b). We observe in the numerical experiment that the $F$ -estimator has almost the same performance as that of the ML-estimator when the model is correctly specified; the $F$ -estimator has more robust performance than the ML-estimator when the model is partially misspecified in a numerical example.

Table 3.1: Comparison among the ML-estimator,

F

-estimator and

\beta

-estimator.

(a). The case of specified model

Method	estimate	rmse
ML-estimator	$({0.499,1.497,-1.003})$	$0.152$
$F$ -estimate	$({0.501,1.496,-1.002})$	$0.156$
$\beta$ -estimate	$({0.707,1.498,-1.004})$	$0.259$

(b). The case of misspecified model

Method	estimate	rmse
ML-estimator	$({0.930,1.208,-0.784})$	$0.869$
$F$ -estimate	$({0.562,1.410,-0.935})$	$0.379$
$\beta$ -estimate	$({1.211,1.163,-0.750})$	$1.068$

We discuss a problem that the $F$ -estimator $\hat{\theta}_{F}$ is defined by the solution of the equation ${S}_{F}(\theta)=0$ ; while the objective function is not given. However, it follows from Poincaré lemma that there is a unique objective function $L_{F}(\theta)$ such that $\hat{\theta}_{F}=\mathop{\rm argmin}_{\theta\in\Theta}L_{F}(\theta)$ . See [4] for geometric insights in the Poincaré lemma. Because ${S}_{F}(\theta)$ is integrable since the Jacobian matrix of ${S}_{F}(\theta)=0$ is symmetric. In effect, we have the solution as follows.

Proposition 16.

Let $F$ be a cumulative distribution function on $[0,\infty)$ . Consider a loss function for a model $\lambda(s,\theta)$ defined by

\displaystyle L_{F}(\theta)=-\sum_{i=1}^{M}a_{F}(\lambda(S_{i},\theta))+\sum_{j=1}^{n}w_{j}b_{F}(\lambda(S_{j},\theta)),

where $a_{F}(\lambda)=\int_{0}^{\lambda}\frac{F(z)}{z}dz$ and $b_{F}(\lambda)=\int_{0}^{\lambda}{F(z)}dz.$ Then, if the the model is a log-linear model $\lambda(s,\theta)=\exp\{\theta_{1}^{\top}x(s)+\theta_{0}\}$ , the estimating function is given by ${S}_{F}(\theta)$ in (3.24).

Proof.

The gradient vector of $L_{F}(\theta)$ is given by

\displaystyle{S}_{F}(\theta)=-\sum_{i=1}^{M}a_{F}^{\prime}(\lambda(S_{i},\theta))\frac{\partial}{\partial\theta}\lambda(S_{j},\theta)+\sum_{j=1}^{n}w_{j}b^{\prime}(\lambda(S_{j},\theta))\frac{\partial}{\partial\theta}\lambda(S_{j},\theta).

This is written as

	$\displaystyle{S}_{F}(\theta)$	$\displaystyle=\sum_{i=1}^{M}\frac{F(\lambda(S_{i},\theta))}{\lambda(S_{i},\theta)}\frac{\partial}{\partial\theta}\lambda(S_{j},\theta)+\sum_{j=1}^{n}w_{j}F(\lambda(S_{j},\theta))\frac{\partial}{\partial\theta}\lambda(S_{j},\theta)$
		$\displaystyle=\sum_{j=1}^{n}(Z_{j}-w_{j}{\lambda(S_{i},\theta)})F(\lambda(S_{j},\theta))\frac{\partial}{\partial\theta}\log\lambda(S_{j},\theta),$		(3.26)

where $Z_{j}$ is the presence indicator. Hence, we conclude that ${S}_{F}(\theta)$ is equal to that given in (3.24) under a log-linear model. ∎

The $F$ -estimator is derived by minimization of this loss function $L_{F}(\theta)$ . Hence, we have a question whether there is a divergence measure that induces to $L_{F}(\theta)$ .

Remark 4.

We have a quick review of the Bregman divergence that is defined by

\displaystyle D_{U}(\lambda,\eta)=\int[U(\lambda(s))-U(\eta(s))-U^{\prime}(\eta(s))\{\lambda(s)-\eta(s)\}]{\rm d}s

where $U$ is a convex function. The loss function is given by

\displaystyle L_{U}(\theta)=-\sum_{i=1}^{M}U^{\prime}(\lambda(S_{i},\theta))+\sum_{j=1}^{n}w_{j}\{\lambda(S_{j},\theta)U^{\prime}(\lambda(S_{j},\theta))-U(\lambda(S_{j},\theta)))\}

and the estimating function is

\displaystyle{S}_{U}(\theta)=\sum_{j=1}^{n}\{Z_{j}-w_{j}\lambda(S_{j},\theta)\}{U^{\prime\prime}(\lambda(S_{j},\theta))}{\lambda(S_{j},\theta)}\frac{\partial}{\partial\theta}\log\lambda(S_{j},\theta).

Therefore, we observe that, if $F(z)=U^{\prime\prime}(z)z$ , then ${S}_{F}(\theta)={S}_{U}(\theta)$ , where ${S}_{F}(\theta)$ is defined in (3.26). This implies that the divergence, $D_{F}(\lambda,\eta)$ , associated with ${S}_{F}(\theta)$ is equal to the Bregman divergence $D_{U}(\lambda,\eta)$ with the generator $U$ satisfying $F(z)=U^{\prime\prime}(z)z$ . That is,

\displaystyle D_{F}(\lambda,\eta)=\int[A_{F}(\lambda(s))-A_{F}(\eta(s))+a_{F}(\eta(s))-b_{F}(\eta(s))\lambda(s)]{\rm d}s,

where $a_{F}$ and $b_{F}$ are defined in Proposition 16 and $A_{F}(z)=\int_{0}^{Z}a_{F}(s){\rm d}s$ .

3.5 Conclusion and Future Work

Our study marks a significant advance in the field of species distribution modeling by introducing a novel approach that leverages Poisson Point Processes (PPP) and alternative divergence measures. The key contribution of this work is the development of the $F$ -estimator, a robust and efficient tool designed to overcome the limitations of traditional ML-methods, particularly in cases of model misspecification.

The $F$ -estimator, based on cumulative distribution functions, demonstrates superior performance in our simulations. This robustness is particularly notable in handling model misspecification, a common challenge in ecological data analysis. Our approach provides ecologists and conservationists with a more reliable tool for predicting species distributions, which is crucial for biodiversity conservation efforts and ecological planning. We also explored the computational aspects of the $F$ -estimator, finding it to be computationally feasible for practical applications, despite its advanced statistical underpinnings. While our study offers significant contributions, it also opens up several avenues for future research: Further validation of the $F$ -estimator in diverse ecological settings and with different species is necessary to establish its generalizability and practical utility. The integration of the $F$ -estimator with other types of ecological data, such as presence-only data, would enhance its applicability. There is scope for further refining the computational algorithms to enhance the efficiency of the $F$ -estimator, making it more accessible for large-scale ecological studies. Exploring the applicability of this method in other scientific disciplines, such as environmental science and geography, could be a fruitful area of research. In conclusion, our work not only contributes to the theoretical underpinnings of species distribution modeling but also has significant practical implications for ecological research and conservation strategies.

The intensity function is modeled based on environmental variables, reflecting habitat preferences. This process typically involves a dataset of locations where species and environmental information have been observed, along with accurate and high-quality background data. With precise training on these datasets, reliable predictions can be derived using maximum likelihood methods in Poisson point process modeling. These predictions are easily applied using predictors integrated into the maximum likelihood estimators. While Poisson point process modeling and the maximum likelihood method can derive reliable predictions from observed data, predicting for ‘unsampled‘ areas that differ significantly from the observed regions poses a significant challenge [97, 64].

The ability to predict the occurrence of target species in unobserved areas using datasets of observed locations, environmental information, and background data is a pivotal issue in species distribution modeling (SDM) and ecological research. Applying these models to regions that differ significantly from those included in the training dataset introduces several technical and methodological challenges. When unobserved areas differ substantially from the observed regions, predicting the occurrence of target species in unobserved areas remains a critical issue. To address this issue, exploring predictions based on the similarity of environmental variables is essential. One promising approach relies on ecological similarity rather than geographical proximity, making it particularly effective for species with wide distributions or fragmented habitats. Additionally, by adopting a weighted likelihood approach and linking Poisson point processes through a probability kernel function between observed and unobserved areas, it becomes possible to efficiently predict the probability of species occurrence in unobserved areas. We believe that the methodologies developed in this study will inspire further innovations in statistical ecology and beyond.

Chapter 4 Minimum divergence in machine leaning

We discuss divergence measures and applications encompassing some areas of machine learning, Boltzmann machines, gradient boosting, active leaning and cosine similarity. Boltzmann machines have wide developments for generative models by the help of statistical dynamics. The ML-estimator is a basic device for data learning, but the computation is challenging for evaluating the partition functions. We introduce the GM divergence and the GM estimator for the Boltzmann machines. The GM estimator is shown a fast computation thanks to free evaluation of the partition function. Next, we focus on on active learning, particularly the Query by Committee method. It highlights how divergence measures can be used to select informative data points, integrating statistical and machine learning concepts. Finally, we extend the $\gamma$ -divergence on a space of real-valued functions. This yields a natural extension of the cosine similarities, called $\gamma$ -cosine similarities. The basic properties are explored and demonstrated in numerical experiments compared to traditional cosine similarity.

4.1 Boltzmann machine

Boltzmann Machines (BMs) are a class of stochastic recurrent neural networks that were introduced in the early 1980s, crucial in bridging the realms of statistical physics and machine learning, see [44, 43] for the mechanics of BMs, and [35] for comprehensive foundations in the theory underlying neural networks and deep learning. They have become fundamental for understanding and developing more advanced generative models. Thus, BMs are statistical models that learn to represent the underlying probability distributions of a dataset. They consist of visible and hidden units, where the visible units correspond to the observed data and the hidden units capture the latent features. Usually, the connections between these units are symmetrical, which means the weight matrix is symmetric. The energy of a configuration in a BM is calculated using an energy function, typically defined by the biases of units and the weights of the connection between units. The partition function is a normalizing factor used to ensure that the probabilities sum to 1 summing exponentiated negative energy over all possible configurations of the units [35].

Training a BM involves adjusting the parameters (weights and biases) to maximize the likelihood of the observed data. This is often done via stochastic maximum likelihood or contrastive divergence. The log-likelihood gradient has a simple form, but computing it exactly is intractable due to the partition function. Thus, approximations or sampling methods like Markov chain Monte Carlo are used. BMs have been extended to more complex and efficient models like Restricted BMs and deep belief networks. They have found applications in dimensionality reduction, topic modeling, and collaborative filtering among others. We overview the principles and applications of BMs, especially in exploring the landscape of energy-based models and the geometrical insights into the learning dynamics of such models. The exploration of divergence, cross-entropy, and entropy in the context of BMs might yield profound understandings, potentially propelling advancements in both theoretical and practical domains of machine learning and artificial intelligence.

Let $\mathcal{P}$ be the space of all probability mass functions defined on a finite discrete set ${\mathcal{X}}=\{-1,1\}^{d}$ , that is

\displaystyle{\mathcal{P}}=\Big{\{}p(x):p(x)>0\ (\forall x\in{\mathcal{X}})\ ,\sum_{x\in{\mathcal{X}}}p(x)=1\Big{\}},

in which $p(x)$ is called a $d$ -variate Boltzmann distribution. A standard BM in $\mathcal{P}$ is introduced as

\displaystyle p(x,\theta)=\frac{1}{Z_{\theta}}\exp\{-E(x,\theta)\}

for $x\in{\mathcal{X}}$ , where $E(x,\theta)$ is the energy function defined by

\displaystyle E(x,\theta)=-b^{\top}x-x^{\top}{W}x

with a parameter $\theta=(b,{W})$ . Here $Z_{\theta}$ is the partition function defied by

\displaystyle Z_{\theta}=\sum_{x\in{\mathcal{X}}}\exp\{-E(x,\theta)\}.

The Kullback-Leibler (KL) divergence is written as

\displaystyle D_{0}(p,p(\cdot,\theta))=\sum_{x\in{\mathcal{X}}}p(x)\log\frac{p(x)}{p(x,\theta)}

which involves the partition function $Z_{\theta}$ . The negative log-likelihood function for a given dataset $\{x_{i}\}_{i=1,...,N}$ is written as

\displaystyle L_{0}(\theta)=\sum_{i=1}^{N}E(x_{i},\theta)+N\log Z_{\theta}

and the estimating function is given by

\displaystyle{S}_{\rm ML}(\theta)=\frac{1}{N}\sum_{i=1}^{N}\begin{bmatrix}\!\!\!\!x_{i}-\mathbb{E}_{\theta}[X]\vspace{1mm}\\ x_{i}x_{i}^{\top}-\mathbb{E}_{\theta}[XX^{\top}]\end{bmatrix}

where $\mathbb{E}_{\theta}$ denotes the expectation with respect to $p(x,\theta)$ . In practice, the likelihood computation is known as an infeasible problem because of a sequential procedure with large summation including $\mathbb{E}_{\theta}$ or $Z_{\theta}$ . There is much literature to discuss approximate computations such as variational approximations and Markov-chain Monte Carlo simulations [42].

On the other hand, we observe that the computation for the GM-divergence $D_{\rm GM}$ does not need to have any evaluation for the partition function as follows: the GM-divergence is defined by

\displaystyle D_{\rm GM}(p,p(\cdot,\theta))=\sum_{x\in{\mathcal{X}}}\frac{p(x)}{p(x,\theta)}\prod_{x\in{\mathcal{X}}}p(x,\theta)^{\frac{1}{m}}.

(4.1)

This is written as

\displaystyle D_{\rm GM}(p,p(\cdot,\theta))=\sum_{x\in{\mathcal{X}}}{p(x)}{\exp\{E(x,\theta)-\bar{E}(\theta)\}}

where $\bar{E}(\theta)$ is an averaged energy given by $\bar{E}(\theta)={\frac{1}{m}}\sum_{x\in{\mathcal{X}}}E(x,\theta).$ Note that the averaged energy is written as

\displaystyle\bar{E}(\theta)=b^{\top}\bar{x}+{\rm tr}({W}\overline{xx^{\top}}),

(4.2)

where $\bar{x}={\frac{1}{m}}\sum_{x\in{\mathcal{X}}}x$ and $\overline{xx^{\top}}={\frac{1}{m}}\sum_{x\in{\mathcal{X}}}xx^{\top}$ . We observe that $D_{\rm GM}(p,p(\cdot,\theta))$ is free from the partition term $Z_{\theta}$ due to the cancellation of $Z_{\theta}$ in multiplying the two terms in the right-hand-side of (4.1).

For a given dataset $\{X_{i}\}_{i=1,...,N}$ , the GM-loss function for $\theta$ is defined by

\displaystyle L_{\rm GM}(\theta)=\frac{1}{N}\sum_{i=1}^{N}{\exp\{E(X_{i},\theta)-\bar{E}(\theta)\}}

and the minimizer $\hat{\theta}_{\rm GM}$ is called the GM-estimator. The estimating function is given by

\displaystyle{S}_{\rm GM}(\theta)=\frac{1}{N}\sum_{i=1}^{N}{\exp\{E(X_{i},\theta)-\bar{E}(\theta)\}}\begin{bmatrix}X_{i}-\bar{X}\vspace{1mm}\\ X_{i}X_{i}^{\top}-\overline{XX^{\top}}\end{bmatrix}.

In accordance, the computation for finding $\hat{\theta}_{\rm GM}$ is drastically lighter than that for the ML-estimator. For example, a stochastic gradient algorithm can be applied in feasible manner. In some cases, a Newton-type algorithm may still be applicable, which is suggested as

\displaystyle\theta\longleftarrow\Big{\{}\sum_{i=1}^{N}{\exp\{E(X_{i},\theta)-\bar{E}(\theta)\}}{S}(X_{i})\Big{\}}^{-1}{S}_{\rm GM}(\theta)

where

\displaystyle{S}(X_{i})=\begin{bmatrix}X_{i}-\bar{X}\vspace{1mm}\\ X_{i}X_{i}^{\top}-\overline{XX^{\top}}\end{bmatrix}\begin{bmatrix}X_{i}^{\top}-\bar{X}^{\top},X_{i}X_{i}^{\top}-\overline{XX^{\top}}\end{bmatrix}.

This discussion is applicable for the deep BMs with restricted BMs.

Here we have a small simulation study to demonstrate the fast computation for the GM estimator compared to the ML-estimator. Basically, the computation time can vary based on the hardware specifications and other running processes. This simulation is done by Python program on the Google Colaboratory
(https://research.google.com/colaboratory). Keep in mind that the computation of the partition function $Z_{\theta}$ can be extremely challenging for large dimensions due to the exponential number of terms. For simplicity, this implementation will not optimize this calculation and might not be feasible for very large dimensions. It is notable that the computation time for the log-likelihood increased significantly with the higher dimension, which is expected due to the exponential increase in the number of states that need to be summed over in the partition function. On the other hand, the computation time for the GM loss remains relatively low, which reinforces its computational efficiency, particularly in higher dimensions. it is not feasible to directly compute the log-likelihood times for dimensions up to 20 within a reasonable time frame using the current method. As shown, the computation time for the log-likelihood increases significantly with the dimension, reflecting the computational complexity due to the partition function. On the other hand, the GM loss computation time remains relatively low and stable across these dimensions. This trend suggests that while the GM estimator maintains its computational efficiency in higher dimensions, the ML-estimator becomes increasingly impractical due to the exponential growth in computation time. For dimensions beyond this range, especially approaching $D=20$ , one might expect the computation time for the log-likelihood to become prohibitively long, further emphasizing the advantage of the GM loss method in high-dimensional settings. Figure 4.1 focuses on computing and plotting the computation times for log-likelihood and GM loss across dimensions $d=5,...,10$ of the BM, where the sample size $n$ is fixed as $100$ . This result is consistent with our observation that the GM loss might offer a more computationally efficient alternative to the ML-estimator, especially as the dimensionality of the problem increases. For a case of the higher dimension mare than $10$ , the naive gradient algorithm for the ML-estimator cannot converge in the limited time; that for the GM estimator works well if $d\leq 100$ . When $d=100$ and $N=5000$ , the computation time is approximately 0.811 seconds.

Consider a Boltzmann distribution with visible and hidden units

\displaystyle p(x,h,\theta)=\frac{1}{Z_{\theta}}\exp\{-E(x,h,\theta)\}

for $(x,h)\in{\mathcal{X}}\times{\mathcal{H}}$ , where ${\mathcal{X}}=\{0,1\}^{d}$ , ${\mathcal{H}}=\{0,1\}^{\ell}$ and $E(x,h,\theta)$ is the energy function defined by

\displaystyle E(x,h,\theta)=-b^{\top}x-c^{\top}h-x^{\top}{W}h

with a parameter $\theta=(b,c,{W})$ . Here $Z_{\theta}$ is the partition function defied by

\displaystyle Z_{\theta}=\sum_{(x,h)\in{\mathcal{X}}\times{\mathcal{H}}}\exp\{-E(x,h,\theta)\}.

The marginal distribution is given by

\displaystyle p(x,\theta)=\frac{1}{Z_{\theta}}\sum_{h\in{\mathcal{H}}}\exp\{-E(x,h,\theta)\},

and the GM-divergence is given by

	$\displaystyle D_{\rm GM}(p,p(\cdot,\theta))$	$\displaystyle=\sum_{x\in{\mathcal{X}}}{p(x)}\Big{[}\sum_{h\in{\mathcal{H}}}\exp\{-E(x,h,\theta)\}\Big{]}^{-1}\prod_{x\in{\mathcal{X}}}\Big{[}\sum_{h\in{\mathcal{H}}}\exp\{-E(x,h,\theta)\}\Big{]}^{\frac{1}{m}}$
		$\displaystyle=\sum_{x\in{\mathcal{X}}}{p(x)}{\exp\{\tilde{E}(x,\theta)-\bar{E}(\theta)\}}$		(4.3)

where $\bar{E}(\theta)={\frac{1}{m}}\sum_{x\in{\mathcal{X}}}\tilde{E}(x,\theta)$ and

\displaystyle\tilde{E}(x,\theta)=-\log\sum_{h\in{\mathcal{H}}}\exp\{-E(x,h,\theta)\}.

Note that the bias term $\bar{E}(\theta)$ is not written by the sufficient statistics as in (4.2). For a given dataset $\{x_{i}\}_{i=1,...,N}$ , the GM-loss function for $\theta$ is defined by

	$\displaystyle L_{\rm GM}(\theta)$	$\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\Big{[}\sum_{h\in{\mathcal{H}}}\exp\{-E(X_{i},h,\theta)\}\Big{]}^{-1}\prod_{x\in{\mathcal{X}}}\Big{[}\sum_{h\in{\mathcal{H}}}\exp\{-E(x,h,\theta)\}\Big{]}^{\frac{1}{m}}$
		$\displaystyle=\frac{1}{N}\sum_{i=1}^{N}{\exp\{\tilde{E}(X_{i},\theta)-\bar{E}(\theta)\}}.$		(4.4)

The estimating function is given by

\displaystyle{S}_{\rm GM}(\theta)

\displaystyle=\frac{1}{N}\sum_{i=1}^{N}{\exp\{\tilde{E}(X_{i},\theta)-\bar{E}(\theta)\}}\begin{bmatrix}X_{i}-\bar{X}\vspace{1.5mm}\\ \mathbb{E}_{\theta}[H|X_{i}]-\overline{\mathbb{E}_{\theta}[H|X]}\vspace{1.5mm}\\ \mathbb{E}_{\theta}[X_{i}H^{\top}|X_{i}]-\overline{\mathbb{E}_{\theta}[XH^{\top}|X]},\end{bmatrix}

(4.5)

where

\displaystyle p(h|x,\theta)=\frac{p(x,h,\theta)}{p(x,\theta)}=\frac{\exp\{-E(x,h,\theta)\}}{\exp\{-\tilde{E}(x,\theta)\}}

(4.6)

and

\displaystyle\overline{\mathbb{E}_{\theta}[H|X]}={\frac{1}{m}}\sum_{x\in{\mathcal{X}}}\mathbb{E}_{\theta}[H|x],\ \ \ \overline{\mathbb{E}_{\theta}[XH^{\top}|X]}={\frac{1}{m}}\sum_{x\in{\mathcal{X}}}\mathbb{E}_{\theta}[xH^{\top}|x].

For $H=(H_{k})_{k=1}^{\ell}$ , the conditional distributions of $H_{k}$ ’s given $x$ are conditional independent as seen in

	$\displaystyle p(h\|x,\theta)$	$\displaystyle=\frac{\exp\{b^{\top}x+c^{\top}h+x^{\top}{W}h\}}{\sum_{h^{\prime}\in{\mathcal{H}}}\exp\{b^{\top}x+c^{\top}h^{\prime}+x^{\top}{W}h^{\prime}\}}$		(4.7)
		$\displaystyle=\prod_{k=1}^{\ell}\frac{\exp\{c_{k}h_{k}+\sum_{j=1}^{d}x_{j}{W}_{jk}h_{k}\}}{1+\exp\{c_{k}+\sum_{j=1}^{d}x_{j}{W}_{jk}\}},$

and hence,

\displaystyle\overline{\mathbb{E}_{\theta}[H_{k}|X]}={\frac{1}{m}}\sum_{x}\frac{\exp\{c_{k}+\sum_{j=1}^{d}x_{j}{W}_{jk}\}}{1+\exp\{c_{k}+\sum_{j=1}^{d}x_{j}{W}_{jk}\}}

and

\displaystyle\overline{\mathbb{E}_{\theta}[H_{k}|X_{(-j)}]}={\frac{1}{m}}\sum_{\ \ x_{(-j)}}\frac{\exp\{c_{k}+{W}_{jk}+\sum_{j^{\prime}\neq j}x_{j^{\prime}}{W}_{j^{\prime}k}\}}{1+\exp\{c_{k}+{W}_{jk}+\sum_{j^{\prime}\neq j}x_{j^{\prime}}{W}_{j^{\prime}k}\}},

where $x_{(-j)}=(x_{1},...,x_{j-1},x_{j+1},...,x_{d})$ .

Note that the conditional expectation $\mathbb{E}_{\theta}[\ \cdot\ |x]$ in the estimating function (4.5) can be evaluated by $p(h|x,\theta)$ in (4.7) that is free from the partition function $Z_{\theta}$ . A stochastic gradient algorithm is easily implemented in a fast computation.

Next, consider a Boltzmann distribution connecting visible and hidden units to an output variable as

\displaystyle p(x,h,y,\theta)=\frac{1}{Z_{\theta}}\exp\{-E(x,h,y,\theta)\}

for $(x,h,y)\in{\mathcal{X}}\times{\mathcal{H}}\times{\mathcal{Y}}$ , where $E(x,h,y,\theta)$ is the energy function defined by

\displaystyle E(x,h,y,\theta)=-b^{t}x-c^{t}hd^{t}-x^{t}{W}h-h^{t}{U}e(y)

with $e(y)=(0,...,0,\overset{(y)}{1},0,...,0)^{t}$ for $y\in{\mathcal{Y}}$ and a parameter $\theta=(b,c,d,{W},{U})$ . Here $Z_{\theta}$ is the partition function defied by

\displaystyle Z_{\theta}=\sum_{(sx,sh,y)\in{\mathcal{X}}\times{\mathcal{H}}\times{\mathcal{Y}}}\exp\{-E(x,h,y,\theta)\}.

The marginal distribution is given by

\displaystyle p(x,y,\theta)=\frac{1}{Z_{\theta}}\sum_{h\in{\mathcal{H}}}\exp\{-E(x,h,y,\theta)\},

Similarly, for a given dataset $\{(x_{i},y_{i})\}_{i=1,...,N}$ , the GM-loss function for $\theta$ is defined by

\displaystyle L_{\rm GM}(\theta)=\frac{1}{N}\sum_{i=1}^{N}{\exp\{\tilde{E}(x_{i},y_{i},\theta)-\bar{E}(\theta)\}}.

where

\displaystyle\tilde{E}(x,y,\theta)=-\log\sum_{h\in{\mathcal{H}}}\exp\{-E(x,h,y,\theta)\}

and

\displaystyle\bar{E}(\theta)={\frac{1}{m^{\prime}}}\sum_{(x,y)\in{\mathcal{X}}\times{\mathcal{Y}}}\tilde{E}(x,y,\theta).

with the cardinal number $m^{\prime}$ of ${\mathcal{X}}\times{\mathcal{Y}}$ . In accordance with this, we can apply the GM-loss function for the Boltzmann distribution with supervised outcomes, and for that with the multiple hidden variables. In accordance with these, we point that the GM divergence and the GM estimator has advantageous property over the KL divergence and the ML-estimator in theoretical formulation. A numerical example shows the advantage in a small scale of experiment. However, we have not discussed sufficient experiments and practical applications to confirm the advantage. For this, it needs further investigation for comparing the GM method with the current methods discussed for the deep brief network [99].

4.2 Multiclass AdaBoost

AdaBoost is a part of ensemble learning algorithms that combine the decisions of multiple base learners, or weak learners to produce a strong learner. The core premise is that a group of ”weak” models can be combined to form a ”strong” model. AdaBoost [30] and its variants have found applications across various domains, including bioinformatics and statistical ecology, where they help in creating robust predictive models from noisy or incomplete data. AdaBoost has been extended to handle multiclass classification problems.

An example is Multiclass AdaBoost or AdaBoost.M1, an extension adapting the algorithm to handle more than two classes. There are also other variants like SAMME (Stagewise Additive Modeling using a Multiclass Exponential loss function) which further extends AdaBoost to multiclass scenarios [38]. Random forests and gradient boosting machines (GBM) can be mentioned as popular and efficient methods of ensemble learning [5, 32]. Random forests exhibit a good balance between bias and variance, primarily due to the averaging of uncorrelated decision trees. GBMs are highly flexible and can be used with various loss functions and types of weak learners, though trees are standard. AdaBoost excels in situations where bias reduction is crucial, while Random Forests are a robust, all-rounder solution. GBMs offer high flexibility and can achieve superior performance with careful tuning, especially in complex datasets. Interestingly, there is a connection between AdaBoost and information geometry [69]. The process of re-weighting the data points can be seen as a form of geometrically moving along the manifold of probability distributions over the data. This geometric interpretation might tie back to concepts of divergence and entropy, which are core to information geometry. We focus on discussing various loss functions that are derived from the class of power divergence.

We discuss a setting of a binary class label. Let $X$ be a covariate with a value in a subset ${\mathcal{X}}$ of $\mathbb{R}^{d}$ , and $Y$ be an outcome with a value in ${\mathcal{Y}}=\{-1,1\}$ . Let $f(x)$ be a predictor such that the prediction rule is given by $h(x)={\rm sign}(f(x))$ . The exponential loss is proposed for the Adaboost [30, 85]. As one of the most characteristic points, the optimization is conducted in the function space of a set of weak classifiers. The exponential loss functional plays a central role as a key ingredient, which is defined on a space of predictors as

\displaystyle L_{\exp}(f)=\frac{1}{n}\sum_{i=1}^{n}\exp\{-Y_{i}f(X_{i})\},

where $f$ is a predictor on ${\mathcal{X}}$ . If we take expectation under a conditional distribution

\displaystyle p_{0}(y|x)=\frac{\exp\{-yf_{0}(x)\}}{\exp\{f_{0}(x)\}+\exp\{-f_{0}(x)\}},

then the expected exponential loss is

\displaystyle\mathbb{E}[L_{\exp}(f)]=\frac{\exp\{f(x)-f_{0}(x)\}+\exp\{-f(x)+f_{0}(x)\}}{\exp\{f_{0}(x)\}+\exp\{-f_{0}(x)\}},

which is greater than or equal to ${2}/{\exp\{f_{0}(x)\}+\exp\{-f_{0}(x)\}}$ . The equality holds if and only if $f=f_{0}$ . This implies that the minimizer of the expected GM loss is equal to the true predictor $f_{0}$ , namely,

\displaystyle f_{0}=\mathop{\rm argmin}_{f\in{\cal F}}\mathbb{E}[L_{\exp}(f)].

The functional optimization is practically implemented by a simple algorithm. The stagewise learning algorithm is given as follows (Freund & Schapire, 1995):

(1). Provide ${\mathcal{H}}_{J}:=\{h_{j}:{\mathcal{X}}\rightarrow\{-1,1\};j\in J\}$ . Set as $w_{0,i}=\frac{1}{n}$ and $h_{0}(x)=0$ .

(2). For step $t=1,...,T$

(2.a). $\displaystyle h_{t}=\mathop{\rm argmin}_{h\in{\mathcal{H}}_{J}}{\rm Err}_{t}(h)$ , where $\displaystyle{\rm Err}_{t}(h)=\sum_{i=1}^{n}w_{t-1,i}{\mathbb{I}}(h(X_{i})\neq Y_{i})$ .

(2.b). $\displaystyle\alpha_{t}=\mathop{\rm argmin}_{\alpha\in\mathbb{R}}L_{\rm GM}(f_{t-1}+\alpha h_{t})$ , where $\displaystyle f_{t-1}(x)=\sum_{j=1}^{t-1}\alpha_{j}h_{j}(x)$ .

(2.c). $w_{t,i}=w_{t-1,i}\exp\{-\alpha_{t}Y_{i}h_{t}(X_{i})\}.$

(3). Set $\displaystyle h_{T}(x)={\rm sgin}\Big{(}\sum_{t=1}^{T}\alpha_{t}h_{t}(x)\Big{)}.$

Note that substep (2.b) is calculated as a comprehensive form: the half log odds of error rate,

\displaystyle\alpha_{t}=\small\mbox{$\frac{1}{2}$}\log\frac{1-{\rm err}_{t}(h_{t})}{{\rm err}_{t}(h_{t})},

where ${\rm err}_{t}(h)={\rm Err}(h)/\sum_{i=1}^{n}w_{t-1,i}$ . The algorithm is an elegant and simple form, in which the mathematical operation is just defined by elementary functions of $\exp$ and $\log$ . On the other hand, the iteratively reweighted least square algorithm needs the operation of matrix inverse even for a linear logistic model. Let us apply the $\gamma$ -loss functional for the boosting algorithm, cf. Chapert 2 for a general form of the $\gamma$ -loss. First of all, confirm

\displaystyle p^{(\gamma)}(y|f(x))=\frac{\exp\{(\gamma+1)yf(x)\}}{\exp\{(\gamma+1)f(x)\}+\exp\{-(\gamma+1)f(x)\}}

as the $\gamma$ -expression. Hence, the $\gamma$ -loss functional is written by

\displaystyle L_{\gamma}(f)=\sum_{i=1}^{n}\big{\{}p^{(\gamma)}(Y_{i}|f(X_{i}))\big{\}}^{\frac{\gamma}{\gamma+1}}.

Let us discuss the gradient boosting algorithm based on the $\gamma$ -loss functional. The stagewise learning algorithm $f_{t+1}=f_{t}+\alpha^{*}f^{*}$ for $t=0,1,...T$ is given as follows:

\displaystyle(\alpha^{*},f^{*})=\mathop{\rm argmin}_{(\alpha,f)\in\mathbb{R}\times{\mathcal{F}}}L_{\gamma}(f_{t}+\alpha f),

where $f_{0}$ is an initial guess and $T$ is determined by an appropriate stopping rule. However, the joint minimization is expensive in the light of the computation. For this, we use the gradient as

\displaystyle\nabla L_{\gamma}(f_{t})=\partial_{\alpha}L_{\gamma}(f_{t}+\alpha f)\Big{|}_{\alpha=0},

which is written as

\displaystyle\sum_{i=1}^{n}\pi_{\gamma}(Y_{i},f_{t}(X_{i})){\mathbb{I}}(Y_{i}\neq{\rm sign}(f(X_{i})))+C_{t},

where $C_{t}$ is a constant in $f$ , and

\displaystyle\pi_{\gamma}(Y_{i},f_{t}(X_{i}))=\frac{p^{(\gamma)}(+1|f(X_{i}))p^{(\gamma)}(-1|f(X_{i}))}{\big{\{}p^{(\gamma)}(Y_{i}|f(X_{i}))\big{\}}^{\frac{1}{\gamma+1}}}.

In accordance, the $\gamma$ -boosting algorithm is parallel to AdaBoost as:

(1). Provide ${\mathcal{H}}_{J}:=\{h_{j}:{\mathcal{X}}\rightarrow\{-1,1\};j\in J\}$ . Set as $h_{0}(x)=0$ .

(2). For step $t=1,...,T$

(2.a). $\displaystyle h_{t+1}=\mathop{\rm argmin}_{h\in{\mathcal{H}}_{J}}{\rm Err}_{t}(h)$ , where $\displaystyle{\rm Err}_{t}(h)=\sum_{i=1}^{n}\pi_{\gamma}(Y_{i},f_{t}(X_{i})){\mathbb{I}}(h(X_{i})\neq Y_{i})$ .

(2.b). $\displaystyle\alpha_{t+1}=\mathop{\rm argmin}_{\alpha\in\mathbb{R}}L_{\gamma}(f_{t}+\alpha h_{t+1})$ , where $\displaystyle f_{t}(x)=\sum_{j=1}^{t}\alpha_{j}h_{j}(x)$ .

(2.c). $\displaystyle{\rm Err}_{t+1}(h)=\sum_{i=1}^{n}\pi_{\gamma}(Y_{i},f_{t}(X_{i})+\alpha_{t+1}h_{t+1}(X_{i})){\mathbb{I}}(h(X_{i})\neq Y_{i})$

(3). Set $\displaystyle h_{T}(x)={\rm sgin}\Big{(}\sum_{t=1}^{T}\alpha_{t}h_{t}(x)\Big{)}.$

The-GM functional, $L_{\rm GM}(f)$ , and the HM-loss functional, $L_{\rm HM}(f)$ , are derived by setting $\gamma$ to $-1$ and $-2$ , respectively. It can be observed that the exponential loss functional is nothing but the GM-loss functional. We consider a situation of an imbalanced sample, where $\pi_{-1}\gg\pi_{1}$ for the probability $\pi_{y}$ of $Y=y$ . We adopt the adjusted exponential (GM) loss functional in (2.32) as

\displaystyle L_{\exp}^{(\rm w)}(f)=\frac{1}{n}\sum_{i=1}^{n}\pi_{1-Y_{i}}\exp\{-\pi_{Y_{i}}Y_{i}f(X_{i})\}.

The learning algorithm is given by replacing substeps (2.b) and (2.c) to

(2.b^∗). $\displaystyle\alpha_{t}=\mathop{\rm argmin}_{\alpha\in\mathbb{R}}L_{\exp}^{\rm(w)}(f_{t-1}+\alpha h_{t})$ , where $\displaystyle f_{t-1}(x)=\sum_{j=1}^{t-1}\alpha_{j}h_{j}(x)$ .

(2.c^∗). $w_{t,i}=w_{t-1,i}\exp\{-\pi_{Y_{i}}Y_{i}\alpha_{t}h_{t}(X_{i})\}.$

We observe in (2.b^∗):

\displaystyle L_{\exp}^{(\rm w)}(f_{t-1}+\alpha h)\propto e^{\pi_{1}\alpha}{\rm err}_{t1}+e^{-\pi_{1}\alpha}(1-{\rm err}_{t1})+e^{\pi_{0}\alpha}{\rm err}_{t0}+e^{-\pi_{0}\alpha}(1-{\rm err}_{t0}),

where

\displaystyle{\rm err}_{ty}=\sum_{i=1}^{n}\pi_{1-y}w_{t-1,i}\{e^{\pi_{1}\alpha}{\mathbb{I}}(Y_{i}=y,Y_{i}\neq f(X_{i}))/\sum_{i=1}^{n}w_{t-1,i}.

We discuss a setting of a multiclass label. Let $X$ be a feature vector in a subset $\mathcal{X}$ of $\mathbb{R}^{d}$ and $Y$ be a label in ${\mathcal{Y}}=\{1,...,k\}$ . The major objective is to predict $Y$ given $X=x$ , in which there are spaces of classifiers and predictors, namely, ${\cal H}=\{h:{\mathcal{X}}\rightarrow{\mathcal{Y}}\}$ and

\displaystyle{\mathcal{F}}=\{f(x)=(f_{1}(x),...,f_{k}(x))\in\mathbb{R}^{k}:\sum_{y=1}^{k}f_{y}(x)=0\}.

A classifier $h(x)$ is introduced by a predictor $f(x)$ as

\displaystyle h_{f}(x)=\mathop{\rm argmax}_{y\in{\mathcal{Y}}}f_{y}(x);

a predictor $f_{h}(x)$ is introduced by a predictor $h(x)$ as

\displaystyle f_{h}(x)=\Big{(}{\mathbb{I}}(h(x)=y)-\frac{1}{k}\Big{)}_{y\in{\mathcal{Y}}}.

(4.8)

Note that ${\mathcal{H}}=\{h_{f}:f\in{\mathcal{F}}\}$ ; while $\{f_{h}:h\in{\mathcal{H}}\}$ is a subset of $\mathcal{F}$ . In the learning algorithm discussed below, the predictor is updated by the linear span of predictors embedded by selected classifiers in a sequential manner. The conditional probability mass function (pmf) of $Y$ given $X=x$ is assumed as a soft-max function

\displaystyle p(y|f(x))=\frac{\exp\{f_{y}(x)\}}{\sum_{j\in{\mathcal{Y}}}\exp\{f_{j}(x)\}}

where $f(x)$ is a predictor of $\mathcal{F}$ . We notice that $p(y|f(x))$ and $f_{y}(x)$ are one-to-one as a function of $y$ . Indeed, they are connected as $f_{y}(x)=\log p(y|f(x))-\frac{1}{k}\sum_{j=1}^{k}\log p(y|f(x))$ . We note that this assumption is in the framework of the GLM as in the conditional pmf (2.44) with a different parametrization discussed in Section 2.6 if $f(x)$ is a linear predictor. However, the formulation is nonparametic, in which the model is written by ${\mathcal{M}}=\{p(y|f(x)):f\in{\mathcal{F}}\}$ . Similarly, the $\gamma$ -loss functional for $f$ is

\displaystyle L_{\gamma}(f)=\sum_{i=1}^{n}\Big{\{}\frac{\exp\{(\gamma+1)f_{Y_{i}}(X_{i})\}}{\sum_{y\in{\mathcal{Y}}}\exp\{(\gamma+1)f_{y}(X_{i})\}}\Big{\}}^{\frac{\gamma}{\gamma+1}}.

The minimum of the expected the $\gamma$ -loss functional for $f$ is attained at $f=f^{(0)}$ taking expectation under a conditional distribution

\displaystyle p_{0}(y|x)=\frac{\exp\{f_{y}^{(0)}(x)\}}{\sum_{j=1}^{k}\exp\{f_{j}^{(0)}(x)\}}.

Thus, we conclude that the minimizer of the expected $\gamma$ -loss in $f$ is equal to the true predictor $f^{(0)}$ , namely,

\displaystyle f^{(0)}=\mathop{\rm argmin}_{f\in{\cal F}}\mathbb{E}[L_{\gamma}(f)].

Thus, the minimization of the $\gamma$ -loss functional on the predictor space $\cal F$ yields non-parametric consistency. Similarly, the stagewise learning algorithm $f_{t}=f_{t-1}+\alpha^{*}f_{h^{*}}$ for $t=1,...T$ is given as follows:

\displaystyle(\alpha^{*},h^{*})=\mathop{\rm argmin}_{(\alpha,h)\in\mathbb{R}\times{\mathcal{H}}}L_{\gamma}(f_{t-1}+\alpha f_{h}),

where $f_{0}$ is an initial guess and $f_{h}$ is defined in (4.8). For any fixed $\alpha$ , we observe

\displaystyle L_{\gamma}(f_{t-1}+\alpha f_{h})=\sum_{i=1}^{n}w_{t-1,i}{\mathbb{I}}(Y_{i}\neq h(X_{i}))+C_{t-1},

where

\displaystyle w_{t-1,i}=\Big{\{}\frac{\exp\{(\gamma+1)f_{Y_{i}}(X_{i})\}}{\sum_{y\in{\mathcal{Y}}}\exp\{(\gamma+1)f_{y}(X_{i})\}}\Big{\}}^{\frac{\gamma}{\gamma+1}},

(1). Provide ${\mathcal{H}}_{J}:=\{h_{j}:{\mathcal{X}}\rightarrow\{1,...,k\};j\in J\}$ . Set as $w_{0,i}=\frac{1}{n}$ and $h_{0}(x)=0$ .

(2). For step $t=1,...,T$

(2.a). $\displaystyle h_{t}=\mathop{\rm argmin}_{h\in{\mathcal{H}}_{J}}{\rm Err}_{t}(h)$ , where $\displaystyle{\rm Err}_{t}(h)=\sum_{i=1}^{n}w_{t-1,i}{\mathbb{I}}(h(X_{i})\neq Y_{i})$ .

(2.b). $\displaystyle\alpha_{t}=\mathop{\rm argmin}_{\alpha\in\mathbb{R}}L_{\gamma}(f_{t-1}+\alpha f_{h_{t}})$ , where $\displaystyle f_{t-1}(x)=\sum_{j=1}^{t-1}\alpha_{j}f_{h_{j}}(x)$ with the

embedded predictor $f_{h}(x)$ defined in (4.8).

(2.c). $\displaystyle w_{t,i}=w_{t-1,i}\Big{\{}\frac{\exp\{(\gamma+1)\alpha_{t}f_{t}{}_{Y_{i}}(X_{i})\}}{\sum_{y\in{\mathcal{Y}}}\exp\{(\gamma+1)f_{t}{}_{y}(X_{i})\}}\Big{\}}^{\frac{\gamma}{\gamma+1}}.$

(3). Set $\displaystyle h_{T}(x)=\mathop{\rm argmax}_{y\in{\mathcal{Y}}}f_{T}(y,x).$

The GM-loss functional is given by

\displaystyle L_{\rm GM}(f)=\sum_{i=1}^{n}{\exp\{-f_{Y_{i}}(X_{i})\}}

due to the normalizing condition $\sum_{y\in{\mathcal{Y}}}f_{y}(x)=0.$ This is essentially the same as the exponential loss [38], in which the class label $y$ is coded as similar to (4.8). Thus, the equivalence of the GM-loss and the exponential loss also holds for a multiclass classification. We can discuss the problem of imbalanced samples similarly as given for a binary classification. Let $\pi_{y}=P(Y=y)$ and

\displaystyle\pi_{y}^{\rm inv}=\frac{\frac{1}{\pi_{y}}}{\sum_{j=1}^{k}\frac{1}{\pi_{j}}}.

The adjusted exponential (GM) loss functional in (2.32) as

\displaystyle L_{\exp}^{(\rm w)}(f)=\frac{1}{n}\sum_{i=1}^{n}\pi_{Y_{i}}^{\rm inv}\exp\{-\pi_{Y_{i}}f_{Y_{i}}(X_{i})\}.

The learning algorithm is given by a minor change for substeps (2.b) and (2.c). The HM-loss functional is given by

\displaystyle L_{\rm HM}(f)=\sum_{i=1}^{n}\Big{\{}\frac{\exp\{-f_{Y_{i}}(X_{i})\}}{\sum_{y\in{\mathcal{Y}}}\exp\{-f_{y}(X_{i})\}}\Big{\}}^{2}.

GBMs are highly flexible and can be adapted to various loss functions and types of weak learners, although decision trees are commonly used as the base learners. This flexibility is one of the key strengths of GBM, allowing it to be tailored to a wide range of problems and data types. The loss functions discussed above can be applied to GBMs. Its require careful tuning of several parameters (e.g., number of trees, learning rate, depth of trees), which can be time-consuming. This discussion primarily focuses on the minimum divergence principle from a theoretical perspective. In future projects, we aim to extend our discussion to develop effective GBM applications for a wide range of datasets.

4.3 Active learning

Active learning is a subfield of machine learning that focuses on building efficient training datasets, see [86] for a comprehensive survey. Unlike traditional supervised learning, where all labels are provided upfront, active learning aims to select the most informative examples for labeling, thereby potentially reducing the number of labeled examples needed to achieve a certain level of performance, cf. [10] for understanding how statistical methods are integrated into active learning algorithms. Active learning is a fascinating area where statistical machine learning and information geometry can intersect, offering deep insights into the learning process. One of the primary goals is to reduce the number of labeled instances required to train a model effectively. Annotation can be expensive, especially for tasks like medical image labeling, natural language processing, or any domain-specific task requiring expert knowledge. In scenarios where data collection is expensive or time-consuming, active learning aims to make the most out of a small dataset. By focusing on ambiguous or difficult instances, active learning improves the model’s performance faster than random sampling would. In this way, the active learning has been attracted attentions in a situation relevant in today’s data-rich but label-scarce environments. This could set the stage for the technical details that follow.

The query by committee (QBC) method is a popular method in active learning in which there are another approaches the uncertainty sampling, the expected model change and Bayesian Optimization, see [87] for the theoretical underpinnings of the QBC approach. We focus on the QBC approach, a ”committee” of models is trained on the current labeled dataset. When it comes to selecting the next data point to label, the committee ”votes” on the labels for the unlabeled data points. The data point for which there is the most disagreement among the committee members is then selected for labeling. The idea is that this point lies in a region of high uncertainty and therefore would provide the most information if labeled. From an information geometry perspective, one could consider the divergence or distance between the probability distributions predicted by each model in the committee for a given data point. The point that maximizes this divergence could be considered the most informative.

Let $X$ be a feature vector in a subset $\mathcal{X}$ of $\mathbb{R}^{d}$ and $Y$ be a label in ${\mathcal{Y}}=\{1,...,k\}$ . The conditional probability mass function (pmf) of $Y$ given $X=x$ is assumed as a soft-max function

\displaystyle p(y|\xi(x))=\frac{\exp\{\xi_{y}(x)\}}{\sum_{j\in{\mathcal{Y}}}\exp\{\xi_{j}(x)\}}

where $\xi(x)$ is a predictor vector with components $\{\xi_{y}(x)\}_{y=1}^{k}$ satisfying $\sum_{y\in{\mathcal{Y}}}\xi_{y}(x)=0.$ The prediction is conducted by

\displaystyle h(x)=\mathop{\rm argmin}_{y\in{\mathcal{Y}}}\xi_{y}(x)

noting $p(y|\xi(x))$ and $\xi_{y}(x)$ are one-to-one as a function of $y$ . In effect, they are connected as $\xi_{y}(x)=\log p(y|\xi(x))-\frac{1}{k}\sum_{j=1}^{k}\log p(y|x\xi(x))$ . We note that this assumption is in the framework of the GLM as in the conditional pmf (2.44) with a different parametrization discussed in Section 2.6 if $\xi_{y}(x)$ is a linear predictor.

We aim to design a sequential family of datasets $\{S_{t}\}_{t=0}^{T}$ such that the $(t+1)$ -th dataset is updated as

\displaystyle S_{t+1}=\{(X_{t+1},Y_{t+1})\}\cup S_{t}

for $t,0\leq t\leq T-1$ , where $S_{0}$ is an appropriately chosen datasets. Given $S_{t}$ , we conduct an experiment to get $(X_{t+1},Y_{t+1})$ in which $X_{t+1}$ is explored to improve the performance of the prediction of the label $Y$ , and the outcome $Y_{t+1}$ is sampled from the conditional distribution given $X_{t+1}$ . Thus, the active leaning proposes such a good update pair $(X_{t+1},Y_{t+1})$ that encourages the $t$ -th prediction result to strengthen the performance in a universal manner. The key of the active leaning is to build the efficient method to get the feature vector $X_{t+1}$ that compensates for the weakness of the prediction based on $S_{t}$ . For this, it is preferable that the distribution of $Y$ given $X_{t+1}$ is separate from that given $(X_{1},...,X_{t})$ . Here, let us take the QBC approach in which an acquisition function plays a central role.

Assume that there are $m$ committee members or machines such that the $l$ -th member employs a predictor $\xi^{(tl)}_{y}(x)$ for a feature vector $x$ and a label $y$ based on the dataset $S_{t}$ , and thus the prediction for $Y$ given $x$ is performed by $\mathop{\rm argmax}_{y\in{\mathcal{Y}}}\xi^{(tl)}_{y}(x)$ . We define an acquisition function defined on a feature vector $x$ of $\mathcal{X}$

\displaystyle A^{(t)}(x)=\sum_{l=1}^{m}w_{l}D(P(\cdot|\xi^{(tl)}(x)),P(\cdot|\hat{\xi}^{(t)}(x)))

(4.9)

adopting a divergence measure $D$ , where $\xi^{(tl)}(x)$ is the predictor vector learned by the $l$ -th member at stage $t$ ; $\hat{\xi}^{(t)}(x)$ is the consensus predictor vector combining among $\{\xi^{(tl)}\}_{l=1}^{m}$ . The consensus predictor is given by

\displaystyle\hat{\xi}_{0}^{(t)}(x)=\mathop{\rm argmin}_{\xi\in\Xi}\sum_{l=1}^{m}w_{l}D(P(\cdot|\xi^{(tl)}(x)),P(\cdot|\xi(x))),

where $\Xi$ is the set of all the predictor vectors. Such an optimization problem is discussed around Proposition 2 associated with the generalized mean [41]. In accordance, the new feature vector is selected as

\displaystyle X^{(t+1)}=\mathop{\rm argmax}_{x\in{\mathcal{X}}^{(t)}}A^{(t)}(x),

(4.10)

where ${\mathcal{X}}^{(t)}$ is a subset of possible candidates of $\mathcal{X}$ at stage $t$ .

The standard choice of $D$ for (4.9) is the KL-divergence $D_{0}$ in (1.1), which yields the consensus distribution with the pmf

\displaystyle\hat{p}_{0}^{(t)}(y|x)=\frac{\exp\big{\{}\sum_{l=1}^{m}w_{l}\xi^{(tl)}_{y}(x)\big{\}}}{\sum_{j=1}^{k}\exp\big{\{}\sum_{l=1}^{m}w_{l}\xi^{(tl)}_{j}(x)\big{\}}},

or equivalently $\hat{\xi}_{0}^{(t)}(x)=\sum_{l=1}^{m}w_{l}\xi^{(tl)}(x)$ as the consensus predictor. Alternatively, we adopt the dual $\gamma$ -divergence $D^{*}_{\gamma}$ defined in (1.11) and thus,

\displaystyle A^{(t)}_{\gamma}(x)=\sum_{l=1}^{m}w_{l}D_{\gamma}(P(\cdot|\xi^{(tl)}(x)),P(\cdot|\hat{\xi}^{(t)}(x));C)

where

\displaystyle D_{\gamma}(P(\cdot|\xi^{(tl)}(x)),P(\cdot|\hat{\xi}^{(t)}(x));C)=\sum_{y\in{\mathcal{Y}}}\{p^{(\gamma)}(y|\xi(x))\}^{\frac{1}{\gamma+1}}p(y|\xi^{(tl)}(x))^{\gamma}.

Here, $p^{(\gamma)}(y|\xi(x))$ is the $\gamma$ -expression defined in (1.12), or

\displaystyle p^{(\gamma)}(y|\xi(x))=\frac{\exp\{(\gamma+1)\xi_{y}(x)\}}{\sum_{j=1}^{k}\exp\{(\gamma+1)\xi_{j}(x)\}}.

This yields

\displaystyle\hat{p}_{\gamma}^{(t)}(y|x)=\frac{\Big{[}\sum_{l=1}^{m}w_{l}\exp\{\gamma\xi^{(tl)}_{y}(x)\}\Big{]}^{\frac{1}{\gamma}}}{\sum_{j=1}^{k}\Big{[}\sum_{l=1}^{m}w_{l}\exp\{\gamma\xi^{(tl)}_{j}(x)\}\Big{]}^{\frac{1}{\gamma}}}.

as the pmf of the consensus distribution and

\displaystyle\hat{\xi}_{\gamma}{}_{\ y}^{(t)}{}(x)=\frac{1}{\gamma}\log\Big{[}\sum_{l=1}^{m}w_{l}\exp\{\gamma\xi_{y}^{(tl)}(x)\}\Big{]}

as the consensus predictor up to a constant in $y$ . We note that the consensus predictor $\hat{\xi}_{\gamma}^{(t)}{}_{\!\!y}(x)$ has a form of log-sum-exp mean. This has the extreme forms as

\displaystyle\lim_{\gamma\rightarrow-\infty}\hat{\xi}_{\gamma}^{(t)}(x)=\min_{1\leq l\leq m}\xi^{(tl)}(x)\quad\text{ and }\quad\lim_{\gamma\rightarrow\infty}\hat{\xi}_{\gamma}^{(t)}(x)=\max_{1\leq l\leq m}\xi^{(tl)}(x).

Let us look at the decision boundary of the consensus predictors $\hat{\xi}_{0}(x)$ and $\hat{\xi}_{\gamma}(x)$ combining two linear predictors in a two dimensional space, see Figure 4.2.

If all the committee machines have linear predictors, then the $0$ consensus predictor is still a linear predictor, but the $\gamma$ -consensus predictor is a nonlinear predictor according to the value of $\gamma\neq 0$ as in Figure 4.2. Hence, we can explore the nonlinearity at every stage learning an appropriate value of $\gamma$ . Needless to say, the objective is to find a good feature vector $X_{t+1}$ in (4.10), and hence we have to pay attentions to the leaning procedure conducted by a minimax process

\displaystyle\max_{x\in{\mathcal{X}}}\min_{\xi\in\Xi}\Big{\{}\sum_{l=1}^{m}w_{l}D_{\gamma}(P(\cdot|\xi^{(tl)}(x)),P(\cdot|\xi(x)))\Big{\}}.

It is possible to monitor the minimax value each stage, which can evaluate the learning performance. In effect, the minimax game of the cross entropy with Nature and a decision maker is nicely discussed [36]. The mini-maxity is solved in the zero sum game: Nature wants to maximize the cross entropy under a constrain with a fixed expectation; the decision maker wants to minimize one o the full space. However, our minimax process is not relevant to this observation straightforward. It is necessary to make further discussion to propose a selection of the optimal value of $\gamma$ based on $S_{t}$ .

4.4 The $\gamma$ -cosine similarity

The $\gamma$ -divergence is defined on the probability measures dominated by a reference measure. We have studied statistical applications focusing on regression and classification. We would like to extend that the $\gamma$ -divergence can be defined on the Lebesgue ${\rm L}_{p}$ -space, ${\cal L}_{p}(\Lambda)=\{f(x):\|f\|_{p}<\infty\}$ for an exponent $p$ , $1\leq p\leq\infty$ , where the ${\rm L}_{p}$ -norm is defined by

\displaystyle\|f\|_{p}=\Big{(}\int|f(x)|^{p}{\rm d}\Lambda(x)\Big{)}^{\frac{1}{p}},

(4.11)

where $\Lambda$ is a $\sigma$ -finite measure. There is a challenge for the extension: a function $f(x)$ can take a negative value. If we adopt a usual power transformation $f(x)^{\gamma}$ , it indeed poses a problem. When $f(x)<0$ as raising a negative number to a fractional power can lead to complex values, which would not be meaningful in this context. For this, we introduce a sign-preserved power transformation as

\displaystyle f(x)^{\ominus\gamma}={\rm sign}(f(x))|f(x)|^{\gamma}.

The log $\gamma$ -divergence (1.21) is extended as

\displaystyle\Delta_{\gamma}(f,g;\Lambda)=-\frac{1}{\gamma}\log\bigg{|}\int\frac{f(x)}{\|f\|_{p}}\Big{(}\frac{g(x)}{\|g\|_{p}}\Big{)}^{\ominus\gamma}{\rm d}\Lambda(x)\bigg{|}

for $f$ and $g$ of ${\cal L}_{p}(\Lambda)$ , where $p=\gamma+1$ . This still satisfies the scale invariance. There would be potential developments to utilize $\Delta(f,g;\Lambda)$ in a field of statistics and machine learning, by which singed functions like a predictor function or functional data can be directed evaluated. In particular, we explore this idea to a context of cosine similarity.

Cosine similarity is a measure used to determine the similarity between two non-zero vectors in an inner product space, which includes Hilbert spaces. This measure is particularly important in many applications, such as information retrieval, text analysis, and pattern recognition. In a Hilbert space, which is a complete inner product space, cosine similarity can be defined in a way that generalizes the concept from Euclidean spaces:

\displaystyle\cos(f,g)=\Big{\langle}\frac{f}{\|f\|},\frac{g}{\|g\|}\Big{\rangle},

(4.12)

where $\|f\|=\sqrt{\langle f,f\rangle}$ and $\langle\ ,\ \rangle$ is the inner product, namely, $\langle f,g\rangle=\int f(x)g(x){\rm d}\Lambda(x)$ . Thus, in the Hilbert space, ${\cal H}(\Lambda)$ , $\cos(\tau f,\sigma g)=\cos(f,g)$ for any scalars $\tau$ and $\sigma$ . The Cauchy-Schwartz inequality yields $|\cos(f,g)|\leq 1$ , and $|\cos(f,g)|=1$ if and only if there exists a scalar $\sigma$ such that $g(x)=\sigma f(x)$ for $x$ almost everywhere.

We extend the cosine measure (4.12) on the L_p-space by analogy with the extension of the log $\gamma$ -divergence, see [59, 11] for foundations for function analysis. For this, we observe the Hölder inequality implies $|{\rm H}(u,v)|\leq 1$ , where

\displaystyle{\rm H}(u,v)=\Big{\langle}\frac{u}{\|u\|_{p}},\frac{v}{\|v\|_{q}}\Big{\rangle},

(4.13)

for $u\in{\cal L}_{p}$ and $v\in{\cal L}_{q}$ , where $q$ is the conjugate exponent to $p$ satisfying $\frac{1}{p}+\frac{1}{q}=1$ . The dual space (the Banach space of all continuous linear functionals) of the L_p-space for $1<p<\infty$ has a natural isomorphism with L_q-space. The isomorphism associates with the functional $\iota_{p}(v)\in{\cal L}_{p}(\Lambda)^{*}$ defined by $u\mapsto\iota_{p}(v)(u)=\int uv{\rm d}\Lambda$ . Thus, the the Hölder inequality guarantees that $\iota_{p}(v)(u)$ is well defined and continuous, and hence ${\cal L}_{q}(\Lambda)$ is said to be the continuous dual space of ${\cal L}_{p}(\Lambda)$ . Apparently, ${\rm H}(u,v)$ seems a surrogate for $\cos(f,g)$ . However, the domain of $\cos$ is ${\cal H}(\Lambda)\times{\cal H}(\Lambda)$ ; that of $\rm H$ is ${\cal L}_{p}(\Lambda)\times{\cal L}_{q}(\Lambda)$ . Further, $|\cos(f,g)|=1$ means $f\propto g$ ; $|{\rm H}(u,v)|=1$ means $|u|^{p}\propto|v|^{q}$ . Thus, the functional ${\rm H}(u,v)$ has inappropriate characters as a cosine functional to measure an angle between vectors in a function space. For this, consider a transform $\kappa_{p}$ from ${\cal L}_{p}$ to ${\cal L}_{q}$ by $\kappa_{p}(v)=v^{\ominus\frac{p}{q}}$ noting $|\kappa_{p}(v)|^{q}=|v|^{p}$ . Then, we can define ${\rm H}(u,\kappa_{p}(v))$ for $u$ and $v$ in ${\cal L}_{p}(\Lambda).$ Consequently, we define a cosine measure defined on ${\cal L}_{p}$ as

\displaystyle\cos_{\gamma}(f,g)=\Big{\langle}\frac{f\ }{\|f\|_{p}},\Big{(}\frac{g\ }{\|g\|_{p}}\Big{)}^{\ominus\frac{p}{q}}\Big{\rangle}

(4.14)

linking $\gamma$ to as $p=\gamma+1$ , where $q$ is the conjugate exponent to $p$ . A close connection with the log $\gamma$ -divergence is noted as

\displaystyle|\cos_{\gamma}(f,g)|=\exp\{-\gamma\Delta_{\gamma}(f,g)\}.

This implies $\cos_{\gamma}(f,g)=0\ \Longleftrightarrow\ \Delta_{\gamma}(f,g)=\infty$ , in which both express quantities when $f$ and $g$ are the most distinct. In this formulation, $\cos_{\gamma}(f,g)$ , called the $\gamma$ -cosine, ensures mathematical consistency across all real values of $g(x)$ , which is vital for the measure’s applicability in a wide range of contexts. Note that, if $p=2$ , then $\cos_{\gamma}(f,g)=\cos(f,g)$ , in which $g(x)^{\ominus\frac{p}{q}}$ reduces to $g(x)$ . Further, a basic property is summarized as follows:

Proposition 17.

Let $f$ and $g$ be in ${\cal L}_{p}(\Lambda)$ . Then, $|\cos_{\gamma}(f,g)|\leq 1$ , and equality holds if and only $g$ is proportional to $f$ .

Proof.

By definition, $\cos_{\gamma}(f,g)={\rm H}(f,\kappa_{p}(g))$ , where ${\rm H}$ is defined in (4.13). This implies $|\cos_{\gamma}(f,g)\leq 1$ . The equality holds if and only if $|f|^{p}\propto|g^{q}|^{\frac{p}{q}}$ , that is, there exists a scalar $\sigma$ such that $g(x)=\sigma f(x)$ for everywhere $x$ . ∎

In this way, the $\gamma$ -cosine is defined by the isomorphism between ${\cal L}_{p}$ and ${\cal L}_{p}^{*}$ . We note that

\displaystyle\cos_{\gamma}(\tau f,\sigma g)={\rm sign}(\tau\sigma)\cos_{\gamma}(f,g).

Accordingly, $\cos_{\gamma}(f,g)$ is a natural extension of the cosine functional $\cos(f,g)$ . As a special character, $\cos_{\gamma}(f,g)$ is asymmetric in $f$ and $g$ if $\gamma\neq 1$ . The asymmetry remains, akin to divergence measures, providing a directional similarity measure between two functions. We have discussed the cosine measure extended on ${\cal L}_{1+\gamma}(\Lambda)$ relating to the log $\gamma$ -divergence. In effect, the divergence is defined for the applicability of any empirical probability measure for a given dataset. However, such a constraint is not required in this context. Hence we can define a generalized variants

\displaystyle\cos_{(\beta,\gamma)}(f,g)=\Big{\langle}\Big{(}\frac{f\ }{\|f\|_{\beta+\gamma}}\Big{)}^{\ominus\beta},\Big{(}\frac{g\ }{\|g\|_{\beta+\gamma}}\Big{)}^{\ominus{\gamma}}\Big{\rangle}

(4.15)

for $f$ and $g$ in ${\cal L}_{\beta+\gamma}(\Lambda)$ , called the $(\beta,\gamma)$ -cosine measure, with tuning parameters $\beta\geq 1$ and $\gamma\geq 1.$ Specifically, it is noted $\cos_{\gamma}(f,g)=\cos_{(1,\gamma)}(f,g)$ . We note that the information divergence associated with $\cos_{(\beta,\gamma)}(f,g)$ is given by

\displaystyle\Delta_{(\beta,\gamma)}(f,g)=-\frac{1}{\beta\gamma}\log|\cos_{(\beta,\gamma)}(f,g)|.

In statistical machine learning, this measure could be used to compare probability density functions, regression functions, or other functional forms, especially when dealing with asymmetric relationships. It might be particularly relevant in scenarios where the sign of the function values carries important information, such as in economic data, signal processing, or environmental modeling.

The formulation defined on the function space is easily reduced on a Euclidean space as follows. Let $x$ and $y$ be in $\mathbb{R}^{d}$ . Then, the cosine similarity is defined by

\displaystyle\cos(x,y)=\Big{\langle}\frac{x}{\|x\|},\frac{y}{\|y\|}\Big{\rangle}=\langle e(x),e(y)\rangle,

(4.16)

where $e(x)=x/\|x\|$ and $\langle\cdot,\cdot\rangle$ and $\|\cdot\|$ denote the Euclidean inner product and norm on $\mathbb{R}^{d}$ . The $\gamma$ -cosine function in $\mathbb{R}^{d}$ is introduced as

	$\displaystyle\cos_{\gamma}(x,y)$	$\displaystyle=\Big{\langle}\frac{x}{\\|x\\|_{p}},\Big{(}\frac{y}{\\|y\\|_{p}}\Big{)}^{\ominus\gamma}\Big{\rangle}=\langle e_{\gamma}(x),e_{\gamma}^{*}(y)\rangle,$
		$\displaystyle=\frac{\sum_{i=1}^{d}x_{i}{\rm sign}(y_{i})\|y_{i}\|^{\gamma}}{\{\sum_{i=1}^{d}\|x_{i}\|^{p}\}^{\frac{1}{p}}\{\sum_{i=1}^{d}\|y_{i}\|^{p}\}^{\frac{1}{q}}},$		(4.17)

for a power parameter $\gamma>0$ . We can view the plot of the sign-preserving power transformation $x^{\ominus\gamma}$ for $\gamma=\frac{1}{5},\frac{2}{5},\frac{3}{5},\frac{4}{5},1$ in Fig. 4.3:

As for the generalized measure, the $(\beta,\gamma)$ -cosine measure is given by

\displaystyle\cos_{(\beta,\gamma)}(x,y)=\Big{\langle}\Big{(}\frac{x}{\|x\|_{\beta+\gamma}}\Big{)}^{\ominus\beta},\Big{(}\frac{y}{\|y\|_{\beta+\gamma}}\Big{)}^{\ominus\gamma}\Big{\rangle},

see the functional form (4.14).

We investigate properties of the $\gamma$ -cosine and $(\beta,\gamma)$ cosine comparing to the standard cosine. Let $\displaystyle e_{(\beta,\gamma)}(x)=\Big{(}\frac{x}{\|x\|_{\beta+\gamma}}\Big{)}^{\ominus\beta}$ and $\displaystyle e_{(\beta,\gamma)}^{*}(y)=\Big{(}\frac{y}{\|y\|_{\beta+\gamma}}\Big{)}^{\ominus\gamma}$ . Then, the $(\beta,\gamma)$ -cosine is written as

\displaystyle\cos_{(\beta,\gamma)}(x,y)=\langle e_{(\beta,\gamma)}(x),e_{(\beta,\gamma)}^{*}(y)\rangle,

We observe the following behaviors where $\gamma$ has an extreme value.

Proposition 18.

Let $x$ and $y$ be in $\mathbb{R}^{d}$ . Then,
(a). $\displaystyle\lim_{\gamma\rightarrow 0}\cos_{(\beta,\gamma)}(x,y)=\Big{\langle}\Big{(}\frac{x}{\|x\|_{\beta}}\Big{)}^{\ominus\beta},{\rm sign}(y)\Big{\rangle},$
where ${\rm sign}(y)=({\rm sign}(y_{i}))_{i=1}^{d}$ . Further,
(b). $\displaystyle\lim_{\gamma\rightarrow\infty}\cos_{(\beta,\gamma)}(x,y)=\Big{\langle}\Big{(}\frac{x}{\|x\|_{\infty}}\Big{)}^{\ominus\beta},\frac{{\rm sign}_{\infty}(y)}{\|{\rm sign}_{\infty}(y)\|_{1}}\Big{\rangle},$
where the $i$ -th component of ${\rm sign}_{\infty}(y)$ denotes ${\rm sign}(y_{i}){\mathbb{I}}(|y_{i}|=\|y\|_{\infty})$ for $i=1...,d$ with $\|y\|_{\infty}=\max_{1\leq i\leq d}|y_{i}|$ .

Proof.

By definition, $\lim_{\gamma\rightarrow 0}y^{\ominus\gamma}={\rm sign}(y)$ . This implies (a). Next, if we divide both the numerator and the denominator of $e_{(\beta,\gamma)}^{*}(y)$ by $\|y\|_{\infty}$ , then

\displaystyle\lim_{\gamma\rightarrow\infty}e_{(\beta,\gamma)}^{*}(y)=\lim_{\gamma\rightarrow\infty}\Big{(}\frac{y}{\|y\|_{\infty}}\Big{)}^{\ominus\gamma}\Big{(}\frac{\|y\|_{\beta+\gamma}}{\|y\|_{\infty}}\Big{)}^{-\gamma}.

Hence, for $i=1,...,d$

\displaystyle{\rm sign}(y_{i})\lim_{\gamma\rightarrow\infty}\Big{(}\frac{|y_{i}|}{\|y\|_{\infty}}\Big{)}^{\gamma}={\rm sign}(y_{i}){\mathbb{I}}(|y_{i}|=\|y\|_{\infty});

\displaystyle\lim_{\gamma\rightarrow\infty}\Big{(}\frac{\|y\|_{\beta+\gamma}}{\|y\|_{\infty}}\Big{)}^{-\gamma}=\lim_{\gamma\rightarrow\infty}\bigg{[}\sum_{i=1}\Big{(}\frac{y_{i}}{\|y\|_{\infty}}\Big{)}^{\gamma}\bigg{]}^{-1}=\Big{[}\sum_{i=1}{\mathbb{I}}(|{y_{i}}|={\|y\|_{\infty}})\Big{]}^{-1}.

Consequently, we conclude (b). ∎

We remark that

\displaystyle\lim_{\beta\rightarrow 0,\gamma\rightarrow 0}\cos_{(\beta,\gamma)}(x,y)=\frac{1}{d}\big{\langle}{\rm sign}(x),{\rm sign}(y)\big{\rangle}

Alternatively, the order of taking limits of $\beta$ and $\gamma$ to $\infty$ with respect to $\cos_{(\beta,\gamma)}(x,y)$ results in different outcomes:

\displaystyle\lim_{\beta\rightarrow\infty}\lim_{\gamma\rightarrow\infty}\cos_{(\beta,\gamma)}(x,y)=\frac{\langle{\rm sign}_{\infty}(x),{\rm sign}_{\infty}(y)\rangle}{\|{\rm sign}_{\infty}(y)\|_{1}};

\displaystyle\lim_{\gamma\rightarrow\infty}\lim_{\beta\rightarrow\infty}\cos_{(\beta,\gamma)}(x,y)=\frac{\langle{\rm sign}_{\infty}(x),{\rm sign}_{\infty}(y)\rangle}{\|{\rm sign}_{\infty}(x)\|_{1}};

\displaystyle\lim_{\gamma\rightarrow\infty}\cos_{(\gamma,\gamma)}(x,y)=\frac{\langle{\rm sign}_{\infty}(x),{\rm sign}_{\infty}(y)\rangle}{\|{\rm sign}_{\infty}(x)\|_{2\ }\|{\rm sign}_{\infty}(y)\|_{2}}.

Note that ${\rm sign}_{\infty}(y)$ is a sparse vector as it has $\rm sign$ only at the components with the maximum absolute value with 0’s elsewhere . Thus, $\cos_{(\beta,\infty)}(x,y)$ is proportional to the Euclidean inner product between $x^{\ominus\beta}$ and the sparse vector ${\rm sign}_{\infty}(y)$ . This is contrast with the standard cosine similarity, in which the orthogonality with $\cos_{\infty}(x,y)$ is totally different from that with $\cos(x,y)$ . In effect, $\cos(x,y)=0\Leftrightarrow\langle x,y\rangle=0$ ; $\cos_{(\beta,\infty)}(x,y)=0\Leftrightarrow\langle x^{\ominus\beta},{\rm sign}_{\infty}(y)\rangle=0$ . The orthogonality with $\cos_{(\beta,\infty)}(x,y)$ is reduced to the inner product of the $d_{\infty}$ -dimensional Euclidean space, where $d_{\infty}$ is the cardinal number of $\{i\in\{1,...,d\}:|y_{i}|=\|y\|_{\infty}\}$ . Note that the equality condition in the limit case of $\gamma$ is totally different from that when $\gamma$ is finite. Indeed, $x=\pm{\rm sign}_{\infty}(y)$ if and only if $\cos_{(\beta,\infty)}(x,y)=\pm 1$ , where $\cos_{\infty}(x,y)$ can be viewed the arithmetic mean of relative ratios in $1_{\infty}(y)$ . It is pointed that the cosine similarity has poor performance in a high-dimensional data. Then, values of the cosine similarity becomes small numbers near a zero, and hence they cannot extract important characteristics of vectors. It is frequently observed in the case of high-dimensional data that only a small part of components involves important information for a target analysis; the remaining components are non-informative. The standard cosine similarity equally measures all components; while the power-transformed cosine ( $\gamma$ -cos) can focus on only the small part of essential components. Thus, the $\gamma$ -cos neglects unnecessary information with the majority components, so that the $\gamma$ -cos can extract essential information involving with principal components. In this sense, the $\gamma$ -cos does not need any preprocessing procedures for dimension reduction such as principal component analysis.

Proposition 19.

Let $x=(x_{0},x_{1})$ and $y=(y_{0},y_{1})$ , respectively, where $x_{0},y_{0}\in\mathbb{R}^{d_{0}}$ ; $x_{1},y_{1}\in\mathbb{R}^{d_{1}}$ with $d=d_{0}+d_{1}$ . If $\|x_{0}\|_{\infty}>\|x_{1}\|_{\infty}$ and $\|y_{0}\|_{\infty}>\|y_{1}\|_{\infty}$ , then,

\displaystyle\cos_{(\beta,\infty)}(x_{0},y_{0})=\cos_{(\beta,\infty)}(x,y).

(4.18)

Proof.

From the assumption, $\|x_{0}\|_{\infty}=\|x\|_{\infty}$ and $1_{\infty}(y_{0})=1_{\infty}(y)$ . This implies

\displaystyle\cos_{(\beta,\infty)}(x,y)=\sum_{i=1}^{d}\frac{\ x_{i}^{\ominus\beta}\ \ }{\|x_{0}\|_{\infty}}\frac{{\rm sign}(y_{i}){\mathbb{I}}(|y_{i}|=\|y_{0}\|_{\infty}\big{)}}{|1_{\infty}(y_{0})|},

(4.19)

which is nothing but $\cos_{(\beta,\infty)}(x_{0},y_{0})$ since all the summands are zeros in the summation of $i$ from $d_{0}+1$ to $d$ in (4.19). ∎

In Proposition 19, the infinite-power cosine similarity is viewed as a robust measure in the sense that $\cos_{\infty}(x_{0},y_{0})=\cos_{\infty}((x_{0},x_{1}),(y_{0},y_{1}))$ for any minor components $x_{1}$ and $y_{1}$ . However, we observe that the robustness looks extreme as seen in the following.

Proposition 20.

Consider a function of $\epsilon$ as

\displaystyle\Phi(\epsilon)=\cos_{\infty}(x,(y_{0},\epsilon y_{1})).

Then, if $\|y_{1}\|_{\infty}=\|y_{0}\|_{\infty}$ , $\Phi(\epsilon)$ is not continuous at $\epsilon=1$ .

Proof.

It follows from Proposition 2 that, if $0<\epsilon<1$ , then

\displaystyle\Phi(\epsilon)=\sum_{i=1}^{d_{0}}\frac{x_{i}\ \ }{\|x\|_{\infty}}\frac{{\mathbb{I}}(|y_{i}|=\|y_{0}\|_{\infty}\big{)}}{|1_{\infty}(y_{0})|}

where $d_{0}$ is the dimension of $y_{0}$ . On the other hand,

\displaystyle\Phi(1)=\sum_{i=1}^{d_{0}}\frac{x_{i}\ \ }{\|x\|_{\infty}}\frac{{\mathbb{I}}(|y_{i}|=\|y_{0}\|_{\infty}\big{)}}{|1_{\infty}(y_{0})|}+\sum_{i=d_{0}+1}^{d}\frac{x_{i}\ \ }{\|x\|_{\infty}}\frac{{\mathbb{I}}(|y_{i}|=\|y_{1}\|_{\infty}\big{)}}{|1_{\infty}(y_{1})|}.

This implies the discontinuity of $\Phi(\epsilon)$ at $\epsilon=1$ .

∎

We investigate statistical properties of the power cosine measure in comparison with the conventional cosine similarity. For this we have scenarios to generate realized vectors $x$ ’s and $y$ ’s in $\mathbb{R}^{d}$ as follows. Assume that the $j$ -th replications $X_{j}$ and $Y_{j}$ are given by

\displaystyle X_{j}=\mu_{1}+\epsilon_{1}\hskip 14.22636pt\mbox{and}\hskip 14.22636ptY_{j}=\mu_{2}+\epsilon_{2},

where $\epsilon_{a}$ ’s are identically and independently distributed as ${\tt Nor}(0,\sigma^{2}{\mathbb{I}}_{d})$ . We conduct a numerical experiment with $2000$ replications setting $d=1000$ and $\mu_{1}=(10,9,...,1,0,...,0)^{\top}$ with $\mu_{2}$ fixed later for some $\sigma^{2}$ ’s.

First, fix as $\mu_{2}=\mu_{1}$ as a proportional case. Then, the value of the cosine measure $\cos_{(\beta,\gamma)}(X,Y)$ is expected to be $1$ if the error terms are negligible. When $(\beta,\gamma)=(1,1)$ , then $\cos_{(\beta,\gamma)}(X,Y)$ has not a consistent mean even with small erros; when $\beta>1,\gamma>1$ , then $\cos_{(\beta,\gamma)}(X,Y)$ has a consistent mean near $0$ with resonable errors. Table 4.1 shows detailed outcomes with the variance $\sigma^{2}=0.05,0.1,0.3,0.5$ , where Mean and Std denote the mean and standard deviations for $\cos_{(\beta,\gamma)}(X_{j}.Y_{j})$ ’s with 2000 replications . Second, fix as

\displaystyle\mu_{2}=\mu_{20}-\frac{\langle\mu_{1},\mu_{20}\rangle}{\|\mu_{0}\|^{2}}\mu_{0},

where $\mu_{20}=(1,2,...,10,0,...,0)^{\top}$ . Note $\langle\mu_{1},\mu_{20}\rangle=0$ . This means $\mu_{1}$ and $\mu_{2}$ are orthogonal in the L₂-sence. Then, the value of the cosine measure $\cos_{(\beta,\gamma)}(X,Y)$ should be near $0$ if the error terms are negligible. For all the cases $(\beta,\gamma)$ ’s, the mean of $\cos_{(\beta,\gamma)}(X,Y)$ is reasonably near $0$ with small standard deviations, see Table 4.2 for details.

Table 4.1:

\cos_{(\beta,\gamma)}(X,Y)

in a proportional case

$(\beta,\gamma)$	Mean	Std
$(1,1)$	$0.885$	$0.005$
$(2,2)$	$0.997$	$0.001$
$(2,5)$	$0.995$	$0.003$

$\sigma^{2}=0.05$

$(\beta,\gamma)$	Mean	Std
$(1,1)$	$0.793$	$0.008$
$(2,2)$	$0.993$	$0.003$
$(2,5)$	$0.991$	$0.007$

$\sigma^{2}=0.1$

$(\beta,\gamma)$	Mean	Std
$(1,1)$	$0.562$	$0.018$
$(2,2)$	$0.975$	$0.008$
$(2,5)$	$0.976$	$0.018$

$\sigma^{2}=0.3$

$(\beta,\gamma)$	Mean	Std
$(1,1)$	$0.434$	$0.023$
$(2,2)$	$0.948$	$0.014$
$(2,5)$	$0.961$	$0.030$

$\sigma^{2}=0.5$

Table 4.2:

\cos_{(\beta,\gamma)}(X,Y)

in an orthogonal case

$(\beta,\gamma)$	Mean	Std
$(1,1)$	$0.000$	$0.029$
$(2,2)$	$-0.086$	$0.045$
$(2,5)$	$0.007$	$0.000$

$\sigma^{2}=0.05$

$(\beta,\gamma)$	Mean	Std
$(1,1)$	$0.000$	$0.015$
$(2,2)$	$0.092$	$0.014$
$(2,5)$	$0.006$	$0.000$

$\sigma^{2}=0.1$

$(\beta,\gamma)$	Mean	Std
$(1,1)$	$0.000$	$0.020$
$(2,2)$	$0.093$	$0.021$
$(2,5)$	$0.006$	$0.000$

$\sigma^{2}=0.3$

$(\beta,\gamma)$	Mean	Std
$(1,1)$	$0.000$	$0.027$
$(2,2)$	$-0.089$	$0.036$
$(2,5)$	$0.006$	$0.000$

$\sigma^{2}=0.5$

We applied these similarity measures to hierarchical clustering using a Python package. Synthetic data were generated in a setting of 8 clusters, each with 15 data points, in a 1000-dimensional Euclidean space. The distance functions used were $\cos_{(1,1)}(x,y)$ and $\cos_{(1,5)}(x,y)$ , to compare performance in high-dimensional data clustering. The clustering criterion was set to maxclust in fcluster from the scipy.cluster.hierarchy module. The silhouette score, ranging from -1 to +1, served as a measure of the clustering quality. The clustering was conducted with 10 replications.

For case (a), using the distance based on $\cos_{(1,1)}(x,y)$ , the 10 silhouette scores had a mean of -0.038 with a standard deviation of 0.001, indicating poor clustering quality. Alternatively, for case (b), with the distance based on $\cos_{(1,5)}(x,y)$ , the scores had a mean of 0.833 and a standard deviation of 0.015, suggesting good clustering quality. Thus, the hierarchical clustering performance using $(\beta,\gamma)=(1,5)$ -cosine similarity was significantly better than that using standard cosine similarity, as illustrated in typical dendrograms (Fig. 4.4).

Let $X$ be a $d$ -variate variable with a covariance matrix $\Sigma$ , which is a $d$ -dimensional symmetric, positive definite matrix. Suppose the eigenvalues $\lambda_{1},...,\lambda_{d}$ of $\Sigma$ are restricted as $\lambda_{1}\geq...\geq\lambda_{d_{0}}\geq\epsilon>\delta\geq\lambda_{d_{0}+1}\geq...\geq\lambda_{d}$ . Given $n$ -random sample $(X_{1},...,X_{n})$ from $X$ , the standard PCA is given by solving the $k$ -principal vectors $v_{1},...,v_{k}$ , and $x\approx\sum_{j=1}\lambda_{j}v_{j}v_{j}^{\top}x$ . Suppose that $(X_{1},...,X_{n})$ is generated from ${\tt Nor}_{d}(0,\Sigma)$ , where

\displaystyle\Sigma=\begin{bmatrix}\Sigma_{0}&O\\ O^{\top}&\epsilon\>{\mathbb{I}}_{d-d_{0}}\end{bmatrix}.

(4.20)

Here $\Sigma_{0}$ is a positive-definite matrix of size $d_{0}\times d_{0}$ -matrix whose eigenvalues are $(\lambda_{1},...,\lambda_{d_{0}})$ and $O$ is a zero matrix of size $d_{0}\times(d-d_{0}).$ We set as

\displaystyle n=500,d=1000,d_{0}=10,(\lambda_{1},...,\lambda_{d_{0}})=(5,4.5,...,,1.5,1),\epsilon=0.1

Thus, the scenario is envisaged a situation where the signal is just of $10$ dimension with the rest of $990$ -dimensional noise.

For this, the sample covariance matrix is defined by

\displaystyle S=\frac{1}{n}\sum_{i=1}^{n}(X_{i}-\bar{X})(X_{i}-\bar{X})^{\top}

and $\hat{\lambda}_{j}$ and $\hat{v}_{j}$ are obtained as the $j$ -th eigenvalue and eigenvector of $S$ , where $\bar{X}$ is the sample mean vector. We propose the $\gamma$ -sample covariance matrix as

\displaystyle S^{\ominus\gamma}=\frac{1}{n}\sum_{i=1}^{n}(X_{i}^{\ominus\gamma}-\overline{X^{\ominus\gamma}})(X_{i}^{\ominus\gamma}-\overline{X^{\ominus\gamma}})^{\top},

where the $\gamma$ -transform for a $d$ -vector $x$ is given by $x^{\ominus\gamma}=({\rm sign}(x_{j})|x_{j}|^{\gamma})_{j=1}^{d}$ and

\displaystyle\overline{X^{\ominus\gamma}}=\frac{1}{n}\sum_{i=1}^{n}X_{i}^{\ominus\gamma}.

Thus, the $\gamma$ -PCA is derived by solving the eigenvalues and eigenvectors of $S_{\gamma}$ .

To implement the PCA modification in Python, especially given the specific requirements for generating the sample data following these steps:

•

Generate Sample Data: Create a 1000-dimensional dataset where the first 10 dimensions are drawn from a normal distribution with a specific covariance matrix $\Sigma_{0}$ , and the remaining dimensions have a much smaller variance.
•

Compute the $\gamma$ -Sample Covariance Matrix: Apply the $\gamma$ transformation to the covariance matrix computation.
•

Eigenvalue and Eigenvector Computation: Compute the eigenvalues and eigenvectors of the $\gamma$ -sample covariance matrix.

We conducted a numerical experiment according to these steps. The cumulative contribution ratios are plotted in Fig 4.5. It was observed that the standard PCA $(\gamma=1$ ) had poor performance for the synthetic dataset, in which the cumulative contribution to $10$ dimensions was lower than $0.3$ . Alternatively, the $\gamma$ -PCA effectively improves the performance as the cumulative contribution to $10$ dimensions was higher than $0.9$ for $\gamma=2.0$ . We remark that this efficient property for the $\gamma$ -PCA depends on the simulation setting where the signal vector $X_{0}$ of dimension $d_{0}$ and the no-signal vector $X_{1}$ of dimension $d-d_{0}$ are independent as in (4.20), where $X$ is decomposed as $(X_{0},X_{1})$ . If the independence is not assumed, then the good recovery by the $\gamma$ -PCA is not observed. In reality, there would not be a strong evidence whether the independence holds or not. To address this issue we need more discussion with real data analysis. Additionally, combining PCA with other techniques like independent component analysis or machine learning algorithms can further enhance its performance in complex data environments. This broader perspective should enrich the discussion in your draft, especially concerning the real-world applicability and limitations of PCA modifications.

We have discussed the extension of the $\gamma$ -divergence to the Lebesgue L_p-space and introduces the concept of $\gamma$ -cosine similarity, a novel measure for comparing functions or vectors in a function space. This measure is particularly relevant in statistics and machine learning, especially when dealing with signed functions or functional data.

The $\gamma$ -divergence, previously studied in the context of regression and classification, is extended to the Lebesgue L_p-space. To address the issue of functions taking negative values, a sign-preserved power transformation is introduced. This transformation is crucial for extending the log $\gamma$ -divergence to functions that can take negative values. The concept of cosine similarity, commonly used in Hilbert spaces, is extended to the L_p-space. The $\gamma$ -cosine similarity is defined as $\cos_{\gamma}(f,g)=\langle f/\|f\|_{p},(g/\|g\|_{p})^{\ominus(p/q)}\rangle$ , where $p=\gamma+1$ and $q$ is the conjugate exponent to $p$ . This measure maintains mathematical consistency across all real values of $g(x)$ . Basic properties of $\gamma$ -cosine similarity are explored: $|\cos_{\gamma}(f,g)|\leq 1$ , with equality holding if and only if $g$ is proportional to $f$ . It is also noted that $\cos_{\gamma}(f,g)=0$ if and only if $\Delta_{\gamma}(f,g)=\infty$ , indicating maximum distinctness between $f$ and $g$ . Generalized $(\beta,\gamma)$ -cosine measure is given as a more general form of the cosine measure is introduced for $L_{\beta+\gamma}(\Lambda)$ space, providing additional flexibility with tuning parameters $\beta$ and $\gamma$ . An application of these similarity measures in hierarchical clustering is demonstrated using Python. The $(\beta,\gamma)$ -cosine similarity shows better performance in clustering high-dimensional data compared to the standard cosine similarity. It can focus on essential components of the data, potentially reducing the need for preprocessing steps like principal component analysis. The $\gamma$ -PCA is defined, parallel to the $(\gamma,\gamma)$ -cosine, and demonstrated for a good performance in high-dimensional situations. Therefore, the $\gamma$ -cosine and $(\beta,\gamma)$ -cosine measures could be particularly useful in statistical machine learning for comparing probability density functions, regression functions, or other functional forms, especially in scenarios where the sign of function values is significant.

In conclusion, the $\gamma$ -cosine similarity and its generalized form, the $(\beta,\gamma)$ -cosine measure, represent significant advancements in the field of statistical mathematics, particularly in the analysis of high-dimensional data and functional data analysis. These measures offer a more flexible and robust way to compare functions or vectors in various spaces, which is crucial for many applications in statistics and machine learning.

4.5 Concluding remarks

The concepts introduced in this chapter, particularly the GM divergence, $\gamma$ -divergence, and $\gamma$ -cosine similarity, offer promising avenues for advancing machine learning techniques, especially in high-dimensional settings. However, several areas warrant further exploration to fully understand and leverage these methodologies.

While the computational advantages of the GM divergence and $\gamma$ -cosine similarity are demonstrated through simulations, real-world applications in domains such as bioinformatics, natural language processing, and image analysis could benefit from a deeper investigation. The scalability of these methods in extremely high-dimensional datasets, particularly those encountered in genomics or deep learning models, remains an open question. Future research should focus on implementing these methods in large-scale machine learning pipelines to assess their performance and robustness compared to traditional methods. This could include exploring parallel computing strategies or GPU acceleration to handle the increased computational demands in practical applications.

The chapter primarily discusses the GM divergence and $\gamma$ -divergence, but the potential to extend these ideas to other divergence measures, such as Jensen-Shannon divergence or Renyi divergence, could be fruitful. Investigating how these alternative measures interact with the GM estimator or can be integrated into ensemble learning frameworks like AdaBoost might yield novel insights and improved algorithms. Moreover, a systematic comparison of these divergence measures across different machine learning tasks could provide clarity on their relative strengths and weaknesses.

While the $\gamma$ -cosine similarity provides a novel way to compare vectors in function spaces, its theoretical underpinnings require further formalization. For instance, exploring its properties in different types of function spaces, such as Sobolev spaces or Besov spaces, might reveal new insights into its behavior and applications. Additionally, the interpretability of the $\gamma$ -cosine similarity in practical settings is a key aspect that should be addressed. How does this measure correlate with traditional metrics used in machine learning, such as accuracy, precision, and recall? Can it be used to enhance the interpretability of models, particularly in domains requiring high levels of transparency, such as healthcare or finance?

The methods discussed in this chapter are largely grounded in parametric models, particularly in the context of Boltzmann machines and AdaBoost. However, extending these divergence-based methods to non-parametric or semi-parametric models could open up new applications, particularly in statistical machine learning. For example, exploring the use of GM divergence in the context of kernel methods, Gaussian processes, or non-parametric Bayesian models could provide new avenues for research. Similarly, semi-parametric approaches that combine the flexibility of non-parametric methods with the interpretability of parametric models could benefit from the computational advantages of the GM estimator.

To solidify the practical utility of the proposed methods, extensive empirical validation across a variety of datasets and machine learning tasks is essential. This includes benchmarking against state-of-the-art algorithms to evaluate performance in terms of accuracy, computational efficiency, and robustness. Establishing a comprehensive suite of benchmarks, possibly in collaboration with the broader research community, could facilitate the adoption of these methods. Such benchmarks should include both synthetic datasets, to explore the behavior of these methods under controlled conditions, and real-world datasets, to demonstrate their applicability in practical scenarios. 6. Exploration of Hyperparameter Sensitivity The introduction of $\gamma$ and $\beta$ parameters in the $\gamma$ -cosine and $(\beta,\gamma)$ -cosine measures adds a layer of flexibility, but also complexity. Understanding how sensitive these methods are to the choice of these parameters, and developing guidelines or heuristics for their selection, would be a valuable addition to the methodology. Future work could explore automatic or adaptive methods for tuning these parameters, possibly integrating them with cross-validation techniques or Bayesian optimization to improve the ease of use and performance of the algorithms. Conclusion The introduction of GM divergence, $\gamma$ -divergence, and $\gamma$ -cosine similarity offers exciting opportunities for advancing machine learning and statistical modeling. However, their full potential will only be realized through continued research and development. By addressing the challenges outlined above, the field can better understand the theoretical implications, enhance practical applications, and ultimately, integrate these methods into mainstream machine learning practice.

Acknowledgements

I also would like to acknowledge the assistance provided by ChatGPT, an AI language model developed by OpenAI. Its ability to answer questions, provide suggestions, and assist in the drafting process has been a remarkable aid in organizing and refining the content of this book. While any errors or omissions are my own, the contributions of ChatGPT have certainly made the writing process more efficient and enjoyable.

Bibliography

[1] Shun-Ichi Amari. Differential geometry of curved exponential families-curvatures and information loss. The Annals of Statistics, 10(2):357–385, 1982.
[2] Ayanendranath Basu, Ian R Harris, Nils L Hjort, and MC Jones. Robust and efficient estimation by minimising a density power divergence. Biometrika, 85(3):549–559, 1998.
[3] Albert E Beaton and John W Tukey. The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data. Technometrics, 16(2):147–185, 1974.
[4] Raoul Bott, Loring W Tu, et al. Differential forms in algebraic topology, volume 82. Springer, 1982.
[5] Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.
[6] Jacob Burbea and C Rao. On the convexity of some divergence measures based on entropy functions. IEEE Transactions on Information Theory, 28(3):489–495, 1982.
[7] George Casella and Roger Berger. Statistical inference. CRC Press, 2024.
[8] Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, pages 493–507, 1952.
[9] Andrzej Cichocki and Shun-ichi Amari. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy, 12(6):1532–1568, 2010.
[10] David A Cohn, Zoubin Ghahramani, and Michael I Jordan. Active learning with statistical models. Journal of artificial intelligence research, 4:129–145, 1996.
[11] John B Conway. A course in functional analysis, volume 96. Springer, 2019.
[12] John Copas. Binary regression models for contaminated data. Journal of the Royal Statistical Society: Series B., 50:225–265, 1988.
[13] John Copas and Shinto Eguchi. Local model uncertainty and incomplete-data bias (with discussion). Journal of the Royal Statistical Society: Series B., 67:459–513, 2005.
[14] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20:273–297, 1995.
[15] David R Cox. Some problems connected with statistical inference. Annals of Mathematical Statistics, 29(2):357–372, 1958.
[16] David Roxbee Cox and David Victor Hinkley. Theoretical statistics. CRC Press, 1979.
[17] Imre Csiszár, Paul C Shields, et al. Information theory and statistics: A tutorial. Foundations and Trends® in Communications and Information Theory, 1(4):417–528, 2004.
[18] Bradley Efron. Defining the curvature of a statistical problem (with applications to second order efficiency). The Annals of Statistics, pages 1189–1242, 1975.
[19] Shinto Eguchi. A differential geometric approach to statistical inference on the basis of contrast functionals. Hiroshima mathematical journal, 15(2):341–391, 1985.
[20] Shinto Eguchi. Geometry of minimum contrast. Hiroshima Mathematical Journal, 22(3):631–647, 1992.
[21] Shinto Eguchi. Information geometry and statistical pattern recognition. Sugaku Expositions, 19:197–216, 2006.
[22] Shinto Eguchi. Information Divergence Geometry and the Application to Statistical Machine Learning, pages 309–332. Springer US, Boston, MA, 2009.
[23] Shinto Eguchi. Minimum information divergence of q-functions for dynamic treatment resumes. Information Geometry, 7(Suppl 1):229–249, 2024.
[24] Shinto Eguchi and John Copas. A class of logistic-type discriminant functions. Biometrika, 89:1–22, 2002.
[25] Shinto Eguchi and John Copas. Interpreting kullback–leibler divergence with the neyman–pearson lemma. Journal of Multivariate Analysis, 97(9):2034–2040, 2006.
[26] Shinto Eguchi and Osamu Komori. Minimum divergence methods in statistical machine learning. (No Title), 2022.
[27] Shinto Eguchi, Osamu Komori, and Shogo Kato. Projective power entropy and maximum tsallis entropy distributions. Entropy, 13(10):1746–1764, 2011.
[28] Shinto Eguchi, Osamu Komori, and Atsumi Ohara. Duality of maximum entropy and minimum divergence. Entropy, 16(7):3552–3572, 2014.
[29] Jane Elith and John R Leathwick. Species distribution models: ecological explanation and prediction across space and time. Annual review of ecology, evolution, and systematics, 40(1):677–697, 2009.
[30] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
[31] Jerome H Friedman. On bias, variance, 0/1 loss, and the curse-of-dimensionality. Data mining and knowledge discovery, 1:55–77, 1997.
[32] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
[33] Hironori Fujisawa and Shinto Eguchi. Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis, 99:2053–2081, 2008.
[34] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378, 2007.
[35] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
[36] Peter D Grünwald and A Philip Dawid. Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory. 2004.
[37] Antoine Guisan and Wilfried Thuiller. Predicting species distribution: offering more than simple habitat models. Ecology letters, 8(9):993–1009, 2005.
[38] Trevor Hastie, Saharon Rosset, Ji Zhu, and Hui Zou. Multi-class adaboost. Statistics and its Interface, 2(3):349–360, 2009.
[39] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
[40] Kenichi Hayashi and Shinto Eguchi. A new integrated discrimination improvement index via odds. Statistical Papers, pages 1–20, 2024.
[41] Hideitsu Hino and Shinto Eguchi. Active learning by query by committee with robust divergences. Information Geometry, 6(1):81–106, 2023.
[42] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
[43] Geoffrey E Hinton. A practical guide to training restricted boltzmann machines. In Neural Networks: Tricks of the Trade: Second Edition, pages 599–619. Springer, 2012.
[44] Geoffrey E Hinton, Terrence J Sejnowski, et al. Learning and relearning in boltzmann machines. Parallel distributed processing: Explorations in the microstructure of cognition, 1(282-317):2, 1986.
[45] Hung Gia Hoang, Ba-Ngu Vo, Ba-Tuong Vo, and Ronald Mahler. The cauchy–schwarz divergence for poisson point processes. IEEE Transactions on Information Theory, 61(8):4475–4485, 2015.
[46] David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied logistic regression. John Wiley & Sons, 2013.
[47] Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pages 492–518. Springer, 1992.
[48] Peter J Huber and Elvezio M Ronchetti. Robust statistics. John Wiley & Sons, 2011.
[49] Hung Hung, Zhi-Yu Jou, and Su-Yun Huang. Robust mislabel logistic regression without modeling mislabel probabilities. Biometrics, 74(1):145–154, 2018.
[50] Jack Jewson, Jim Q Smith, and Chris Holmes. Principles of bayesian inference using general divergence criteria. Entropy, 20(6):442, 2018.
[51] Bent Jørgensen. Exponential dispersion models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 49(2):127–145, 1987.
[52] Giorgio Kaniadakis. Non-linear kinetics underlying generalized statistics. Physica A: Statistical mechanics and its applications, 296(3-4):405–425, 2001.
[53] Osamu Komori and Shinto Eguchi. Statistical Methods for Imbalanced Data in Ecological and Biological Studies. Springer, Tokyo, 2019.
[54] Osamu Komori and Shinto Eguchi. A unified formulation of k-means, fuzzy c-means and gaussian mixture model by the kolmogorov-nagumo average. Entropy, 23:518, 2021.
[55] Osamu Komori, Shinto Eguchi, Shiro Ikeda, Hiroshi Okamura, Momoko Ichinokawa, and Shinichiro Nakayama. An asymmetric logistic regression model for ecological data. Methods in Ecology and Evolution, 7(2):249–260, 2016.
[56] Osamu Komori, Shinto Eguchi, Yusuke Saigusa, Buntarou Kusumoto, and Yasuhiro Kubota. Sampling bias correction in species distribution models by quasi-linear poisson point process. Ecological Informatics, 55:1–11, 2020.
[57] Osamu Komori, Yusuke Saigusa, and Shinto Eguchi. Statistical learning for species distribution models in ecological studies. Japanese Journal of Statistics and Data Science, 6(2):803–826, 2023.
[58] Friedrich Liese and Igor Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394–4412, 2006.
[59] David G Luenberger. Optimization by vector space methods. John Wiley & Sons, 1997.
[60] Kumar P Mainali, Dan L Warren, Kunjithapatham Dhileepan, Andrew McConnachie, Lorraine Strathie, Gul Hassan, Debendra Karki, Bharat B Shrestha, and Camille Parmesan. Projecting future expansion of invasive species: comparing and improving methodologies for species distribution modeling. Global change biology, 21(12):4464–4480, 2015.
[61] Henry B Mann and Abraham Wald. On the statistical treatment of linear stochastic difference equations. Econometrica, Journal of the Econometric Society, pages 173–220, 1943.
[62] P. McCullagh and J.A. Nelder. Generalized Linear Models. Chapman & Hall, New York, 1989.
[63] Cory Merow, Adam M Wilson, and Walter Jetz. Integrating occurrence data and expert maps for improved species range predictions. Global Ecology and Biogeography, 26(2):243–258, 2017.
[64] Hanna Meyer and Edzer Pebesma. Predicting into unknown space? estimating the area of applicability of spatial prediction models. Methods in Ecology and Evolution, 12(9):1620–1633, 2021.
[65] Mihoko Minami and Shinto Eguchi. Robust blind source separation by beta divergence. Neural Computation, 14:1859–1886, 2002.
[66] Md Nurul Haque Mollah, Shinto Eguchi, and Mihoko Minami. Robust prewhitening for ica by minimizing $\beta$ -divergence and its application to fastica. Neural Processing Letters, 25:91–110, 2007.
[67] Md Nurul Haque Mollah, Mihoko Minami, and Shinto Eguchi. Exploring latent structure of mixture ICA models by the minimum beta-divergence method. Neural Computation, 18:166–190, 2006.
[68] Victoria Diane Monette. Ecological factors associated with habitat use of baird’s tapirs (tapirus bairdii). 2019.
[69] Noboru Murata, Takashi Takenouchi, Takafumi Kanamori, and Shinto Eguchi. Information geometry of ${\mathcal{u}}$ -boost and bregman divergence. Neural Computation, 16:1437–1481, 2004.
[70] Kanta Naito and Shinto Eguchi. Density estimation with minimization of ${U}$ -divergence. Machine Learning, 90:29–57, 2013.
[71] Tan Nguyen and Scott Sanner. Algorithms for direct 0–1 loss optimization in binary classification. In International conference on machine learning, pages 1085–1093. PMLR, 2013.
[72] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
[73] Frank Nielsen. On geodesic triangles with right angles in a dually flat space. In Progress in Information Geometry: Theory and Applications, pages 153–190. Springer, 2021.
[74] Akifumi Notsu and Shinto Eguchi. Robust clustering method in the presence of scattered observations. Neural Computation, 28:1141–1162, 2016.
[75] Akifumi Notsu, Osamu Komori, and Shinto Eguchi. Spontaneous clustering via minimum gamma-divergence. Neural computation, 26(2):421–448, 2014.
[76] Katsuhiro Omae, Osamu Komori, and Shinto Eguchi. Quasi-linear score for capturing heterogeneous structure in biomarkers. BMC Bioinformatics, 18:308, 2017.
[77] Steven J Phillips, Miroslav Dudík, and Robert E Schapire. A maximum entropy approach to species distribution modeling. In Proceedings of the twenty-first international conference on Machine learning, page 83, 2004.
[78] Giovanni Pistone. $\kappa$ -exponential models from the geometrical viewpoint. The European Physical Journal B, 70:29–37, 2009.
[79] C Radakrishna Rao. Differential metrics in probability spaces. Differential geometry in statistical inference, 10:217–240, 1987.
[80] Mark D Reid and Robert C Williamson. Information, divergence and risk for binary experiments. Journal of Machine Learning Research, 12(3), 2011.
[81] Ian W Renner, Jane Elith, Adrian Baddeley, William Fithian, Trevor Hastie, Steven J Phillips, Gordana Popovic, and David I Warton. Point process models for presence-only analysis. Methods in Ecology and Evolution, 6(4):366–379, 2015.
[82] Ian W Renner and David I Warton. Equivalence of maxent and poisson point process models for species distribution modeling in ecology. Biometrics, 69(1):274–281, 2013.
[83] Peter J Rousseeuw and Annick M Leroy. Robust regression and outlier detection. John wiley & sons, 2005.
[84] Yusuke Saigusa, Shinto Eguchi, and Osamu Komori. Robust minimum divergence estimation in a spatial poisson point process. Ecological Informatics, 81:102569, 2024.
[85] Robert E Schapire and Yoav Freund. Boosting: Foundations and algorithms. Kybernetes, 42(1):164–166, 2013.
[86] Burr Settles. Active learning literature survey. 2009.
[87] H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory, pages 287–294, 1992.
[88] Helen R Sofaer, Catherine S Jarnevich, Ian S Pearse, Regan L Smyth, Stephanie Auer, Gericke L Cook, Thomas C Edwards Jr, Gerald F Guala, Timothy G Howard, Jeffrey T Morisette, et al. Development and delivery of species distribution models to inform decision-making. BioScience, 69(7):544–557, 2019.
[89] Roy L Streit and Roy L Streit. The poisson point process. Springer, 2010.
[90] Takashi Takenouchi and Shinto Eguchi. Robustifying AdaBoost by adding the naive error rate. Neural Computation, 16:767–787, 2004.
[91] Takashi Takenouchi, Osamu Komori, and Shinto Eguchi. Extension of receiver operating characteristic curve and auc-optimal classification. Neural Computation, 24:2789–2824, 2012.
[92] Takashi Takenouchi2, Shinto Eguchi, Noboru Murata, and Takafumi Kanamori. Robust boosting algorithm against mislabeling in multiclass problems. Neural Computation, 20:1596–1630, 2008.
[93] Marina Valdora and Víctor J Yohai. Robust estimators for generalized linear models. Journal of Statistical Planning and Inference, 146:31–48, 2014.
[94] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
[95] Halbert White. Maximum likelihood estimation of misspecified models. Econometrica: Journal of the econometric society, pages 1–25, 1982.
[96] Christopher KI Williams. Prediction with gaussian processes: From linear regression to linear prediction and beyond. In Learning in graphical models, pages 599–621. Springer, 1998.
[97] Katherine L Yates, Phil J Bouchet, M Julian Caley, Kerrie Mengersen, Christophe F Randin, Stephen Parnell, Alan H Fielding, Andrew J Bamford, Stephen Ban, A Márcia Barbosa, et al. Outstanding challenges in the transferability of ecological models. Trends in ecology & evolution, 33(10):790–802, 2018.
[98] Jun Zhang. Divergence function, duality, and convex analysis. Neural computation, 16(1):159–195, 2004.
[99] Huimin Zhao, Jie Liu, Huayue Chen, Jie Chen, Yang Li, Junjie Xu, and Wu Deng. Intelligent diagnosis using continuous wavelet transform and gauss convolutional deep belief network. IEEE Transactions on Reliability, 72(2):692–702, 2022.

	$\displaystyle p^{(\infty)}(y\|x,\theta)$	$\displaystyle:=\lim_{\gamma\rightarrow\infty}p^{(\gamma)}(y\|x,\theta)$
		$\displaystyle=\lim_{\gamma\rightarrow\infty}\frac{\{p(y\|x,\theta)/\max_{y^{}\in{\cal Y}}p(y^{}\|x,\theta)\}^{\gamma+1}}{\sum_{\tilde{y}\in{\cal Y}}\{p(\tilde{y}\|x,\theta)/\max_{y^{}\in{\cal Y}}p(y^{}\|x,\theta)\}^{\gamma+1}}$
		$\displaystyle={\rm I}(y=f(x,\theta))$

	$\displaystyle p^{(-\infty)}(y\|x,\theta)$	$\displaystyle:=\lim_{\gamma\rightarrow-\infty}p^{(\gamma)}(y\|x,\theta)$
		$\displaystyle=\lim_{\gamma\rightarrow-\infty}\frac{\{\min_{y^{}\in{\cal Y}}p(y^{}\|x,\theta)/p(y\|x,\theta)\}^{-\gamma-1}}{\sum_{\tilde{y}\in{\cal Y}}\{\min_{y^{}\in{\cal Y}}p(y^{}\|x,\theta)/p(\tilde{y}\|x,\theta)\}^{-\gamma-1}}$
		$\displaystyle={\rm I}(y=g(x,\theta))$

Minimum Gamma Divergence for Regression and Classification Problems

Preface

Chapter 1 Power divergence

1.1 Introduction

1.2 Probabilistic framework

Definition 1.

1.3 Power divergence measures

1.4 The γ\gamma-divergence and its dual

Proposition 1.

Proof.

Proposition 2.

Proof.

1.5 GM and HM divergence measures

Proposition 3.

Proof.

1.6 Concluding remarks

Chapter 2 Minimum divergence for regression model

2.1 Introduction

2.2 M-estimators in a generalized linear model

Proposition 4.

Proof.

Proposition 5.

Proof.

Proposition 6.

Proof.

2.3 The γ\gamma-loss function and its variants

Remark 1.

Proposition 7.

Proof.

2.4 Normal linear regression

Proposition 8.

Proof.

2.5 Binary logistic regression

Proposition 9.

Proof.

Proposition 10.

Proof.

2.6 Multiclass logistic regression

Remark 2.

Proposition 11.

Proof.

Remark 3.

2.7 Poisson regression model

Proposition 12.

Proof.

2.8 Concluding remarks

Chapter 3 Minimum divergence for Poisson point process

3.1 Introduction

3.2 Species distribution model

3.3 Divergence measures on intensity functions

Proposition 13.

Proof.

Proposition 14.

Proposition 15.

Proof.

3.4 Robust divergence method

Proposition 16.

Proof.

Remark 4.

3.5 Conclusion and Future Work

Chapter 4 Minimum divergence in machine leaning

4.1 Boltzmann machine

4.2 Multiclass AdaBoost

4.3 Active learning

4.4 The γ\gamma-cosine similarity

Proposition 17.

Proof.

Proposition 18.

Proof.

Proposition 19.

Proof.

Proposition 20.

Proof.

4.5 Concluding remarks

Acknowledgements

Bibliography

Minimum Gamma Divergence for
Regression and Classification Problems

1.4 The $\gamma$ -divergence and its dual

2.3 The $\gamma$ -loss function and its variants

4.4 The $\gamma$ -cosine similarity