Asymptotic Behavior of Bayesian Generalization Error in Multinomial Mixtures

Takumi Watanabe watanabe.t.bv@m.titech.ac.jp Department of Mathematical and Computing Science, Tokyo Institute of Technology, 2-12-1, Oookayama, Meguro-ku, Tokyo, 152-8552, Japan Sumio Watanabe swatanab@c.titech.ac.jp Department of Mathematical and Computing Science, Tokyo Institute of Technology, 2-12-1, Oookayama, Meguro-ku, Tokyo, 152-8552, Japan

Abstract

Multinomial mixtures are widely used in the information engineering field, however, their mathematical properties are not yet clarified because they are singular learning models. In fact, the models are non-identifiable and their Fisher information matrices are not positive definite. In recent years, the mathematical foundation of singular statistical models are clarified by using algebraic geometric methods. In this paper, we clarify the real log canonical thresholds and multiplicities of the multinomial mixtures and elucidate their asymptotic behaviors of generalization error and free energy.
Keywords: generalization error, free energy, multinomial mixtures, real log canonical thresholds

1 Introduction

A finite mixture model is a probability distribution defined by a linear superposition of a finite number of distributions. Its examples, A normal mixture, a Poisson mixture, and a multinomial mixture have been applied in many research areas. In this paper, we mainly study a multinomial mixture which provides a richer class of statistic models than the single multinomial distribution. The multinomial mixture has been applied to document clustering [1], anomaly detection in medical data [2], and image clustering [3]. In spite of a wide range of its applications, their mathematical property of generalization performance is not yet clarified. One of the mathematical difficulties is caused by the fact that they are not identifiable. If the map from the set of parameters to a probability distribution, then a statistical model is called identifiable. If otherwise, it is called non-identifiable [4]. These models are classified into singular models. If a probability model is singular, the posterior distribution cannot be approximated by any normal distribution, and classical model criteria of regular statistical models such as AIC [5], BIC [6], or DIC [7] cannot be applied to estimate the generalization losses of singular models.

Recently, in order to establish the mathematical foundation of Bayesian inference of singular models, Watanabe derived the asymptotic behavior of their generalization error $G_{n}$ and the free energy $F_{n}$ [8]. There exist both a real and positive number $\lambda$ and an integer $m$ , such that their asymptotic behaviors of $G_{n}$ and $F_{n}$ are respectively given by

	$\displaystyle\mathbb{E}[F_{n}]$	$\displaystyle=nS+\lambda\log n-(m-1)\log\log n+O(1),$		(1.1)
	$\displaystyle\mathbb{E}[G_{n}]$	$\displaystyle=\frac{\lambda}{n}-\frac{m-1}{n\log n}+o\left(\frac{1}{n\log n}\right),$		(1.2)

where $\lambda$ is called a real log canonical threshold (RLCT), $m$ is called a multiplicity, and $\mathbb{E}[\cdot]$ shows the expectation value over all datasets. If a learning model is identifiable and regular, $\lambda=d/n,\ m=1$ , where $d$ is the dimension of the parameter space [6]. If it is singular, $\lambda$ and $m$ depend on the true distribution, the model, and the prior. In singular case, it was shown by [9] that both RLCT and multiplicity can be found by using desingularization theorem in algebraic geometry. In general, the parameter set $K(w)=0$ contains complicated singularities, hence it is difficult to find the resolution map, however, both RLCTs and multiplicities have been clarified in several statistical models and learning machines. Examples of the models in which the RLCTs are found include normal mixtures [10], Poisson mixtures [11], Bernoulli mixtures [12], rank regression [10], Latent Dirichlet Allocation (LDA) [13], and so on. In addition, the RLCTs are used as an analysis of the exchange rate of the replica exchange method [14], which is one of the Markov chain Monte Carlo methods. Moreover, in recent years, the information criterion $\mathrm{sBIC}$ , which uses RLCTs in calculation, has also been proposed [15].

In this paper, we clarify the RLCT of the multinomial mixtures and derive the asymptotic behaviors of the generalization error and the free energy. Our analysis also shows the effect of hyperparameter for the case when Dirichlet distribution is employed as a prior. We begin in section 2 with the introduction of the framework of Bayesian inference. In Section 3 we explain multinomial mixtures, and in section 4 we introduce previous studies about the RLCTs and multiplicities of multinomial mixtures. In Section 5 we claim the main theorem of this paper. In Section 6 we prove the theorem. In Section 7 we discuss the phase transition due to the hyperparameters. And in Section 8 we conclude this paper.

2 Bayes estimation

In this section, we introduce the framework of Bayesian inference. Let $q(x)$ be a true probability distribution and let $X^{n}=(X_{1},\dots,X_{n})$ be a set of training data generated from $q(x)$ independently and identically. Let $p(x|w)$ be a probability model, where $w\in W\subset\mathbb{R}^{d}$ is a parameter, and the $W$ is a parameter space. The prior probability distribution $\varphi(w)$ is a function on $W$ . A posterior distribution $p(w|X^{n})$ is defined by

\displaystyle p(w|X^{n})=\frac{1}{Z_{n}}\varphi(w)\prod_{i=1}^{n}p(X_{i}|w),

(2.1)

where $Z_{n}$ is the normalizing constant:

\displaystyle Z_{n}=\int{\varphi(w)\prod_{i=1}^{n}p(X_{i}|w)}\mathrm{d}w.

(2.2)

The constant $Z_{n}$ is called a marginal likelihood function. The free energy $F_{n}$ is defined as the minus log marginal likelihood function:

\displaystyle F_{n}=-\log Z_{n}.

(2.3)

A predictive distribution $p(x|X^{n})$ is given by

\displaystyle p(x|X^{n})=\int{p(x|w)p(w|X^{n})}\mathrm{d}w.

(2.4)

A generalization error $G_{n}$ is the Kullback-Leibler divergence from the true distribution $q(x)$ to the predictive distribution $p(x|X^{n})$ :

\displaystyle G_{n}=\int{q(x)\log\frac{q(x)}{p(x|X^{n})}}\mathrm{d}x.

(2.5)

The generalization error is a measure of how the predictive distribution $p(x|X^{n})$ is different from the true distribution $q(x)$ .

For an arbitrary function $f:x^{n}\mapsto f(x^{n})$ , the expectation value of $f(X^{n})$ over all sets of training samples is denoted by $\mathbb{E}[\cdot]$ , that is,

\displaystyle\mathbb{E}[f(X^{n})]=\int\dots\int f(x_{1},\dots,x_{n})\prod_{i=1}^{n}q(x_{i})\mathrm{d}x_{i}.

(2.6)

Let the mean error function $K(w)$ be the Kullback-Leibler divergence from the true distribution to the probability model:

\displaystyle K(w)=\int{q(x)\log\frac{q(x)}{p(x|w)}}\mathrm{d}x.

(2.7)

An entropy of the true distribution $S$ and an empirical entropy $S_{n}$ are defined respectively by

	$\displaystyle S$	$\displaystyle=-\int{q(x)\log q(x)}\mathrm{d}x,$		(2.8)
	$\displaystyle S_{n}$	$\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}\log q(X_{i}).$		(2.9)

It is known that the following relationship holds between the free energy and the generalization error [16]:

\displaystyle\mathbb{E}[G_{n}]

\displaystyle=\mathbb{E}[F_{n+1}]-\mathbb{E}[F_{n}]-S.

(2.10)

The relation (2.10) is important because in the most case, we do not know the true distribution $q(x)$ , whereas the free energy can be calculated by using the prior $\varphi(w)$ , the probability model $p(x|w)$ , and a sample $X^{n}$ .

Let $\mathrm{Re}(z)$ be the real part of a complex number $z$ . Define the zeta function in the statistical learning theory as

\displaystyle\zeta(z)=\int{K(w)^{z}\varphi(w)}\mathrm{d}w.

(2.11)

If $K(w)\geq 0$ is an analytic function of $w$ , then the function $\zeta(z)$ is holomorphic in the region $\mathrm{Re}(z)>0$ , and it can be analytically continued to the unique meromorphic function onto the entire complex plane. Moreover, it is known that all poles are real and negative numbers.

In the following, assume that the mean error function $K(w)$ is analytical and the true distribution is feasible with a probabilistic model. Here, the true distribution $q(x)$ is said to be feasible with the probability model $p(x|w)$ if there is a parameter $w^{*}\in W$ such that $q(x)=p(x|w^{*})$ holds for all $x$ . Assume that the maximum pole of the zeta function $\zeta(z)$ is $-\lambda$ and its order is $m$ . By applying the Hironaka resolution theorem in the algebraic geometrical method, the asymptotic behaviors of free energy and generalization errors can be expressed as follows [8] [17]:

	$\displaystyle\mathbb{E}[F_{n}]$	$\displaystyle=nS+\lambda\log n-(m-1)\log\log n+O(1),$		(2.12)
	$\displaystyle\mathbb{E}[G_{n}]$	$\displaystyle=\frac{\lambda}{n}-\frac{m-1}{n\log n}+o\left(\frac{1}{n\log n}\right).$		(2.13)

The constant $\lambda$ and $m$ are called real log canonical threshold (RLCT) and a multiplicity respectively.

3 Multinomial Mixtures

3.1 Multinomial Distribution

Let $\mathbb{Z}_{\geq 0}$ be the set of all non-negative integers, $\mathbb{R}_{\geq 0}$ be the set of all non-negative real numbers. Let constants $L,M$ be two or more natural numbers, and the set $D$ is defined by

\displaystyle D=\left\{x=(x_{1},\dots,x_{L})\in\left(\mathbb{Z}_{\geq 0}\right)^{L}\ :\ \sum_{\ell=1}^{L}x_{\ell}=M\right\}.

(3.1)

The vectors $b\in\mathbb{R}^{L}$ belong to the set $B$ :

\displaystyle B=\left\{b=(b_{1},\dots,b_{L})\in\left(\mathbb{R}_{\geq 0}\right)^{L}\ :\ \sum_{\ell=1}^{L}b_{\ell}=1,\ 0\leq b_{\ell}\leq 1\right\}.

(3.2)

The probability distribution of $x\in D$ determined by the vector $b$ :

\displaystyle\mathrm{Mul}_{L}(x|b)=\frac{M!}{\prod_{\ell=1}^{L}x_{\ell}!}\prod_{\ell=1}^{L}(b_{\ell})^{x_{\ell}}

is called the multinomial distribution. Here, it is defined as $0^{0}=1,\ 0!=1$ . The constant $M$ represents the number of independent trials of the multinomial distribution, and the parameter $b=(b_{1},\dots,b_{L})$ represents the corresponding probability. The multinomial distribution is a generalization of several discrete distribution. If $M=1$ and $L=2$ , the multinomial distribution is called the Bernoulli distribution. If $M=1$ and $L\geq 2$ , it is called the categorical distribution. If $M\geq 2$ and $L=2$ , it is called the binomial distribution.

3.2 Multinomial Mixtures

Let $H$ be a finite natural number greater than or equal to 2. The parameter set $W$ is defined by

\displaystyle W=\left\{(a,b)\ :\ \sum_{h=1}^{H}a_{h}=1,\ 0\leq a_{h}\leq 1,\ b_{h}=(b_{h1},\dots,b_{hL})\in B\ (\forall h\in[H])\right\},

(3.3)

where $[H]$ means the set $\{h\in\mathbb{Z}:1\leq h\leq H\}$ .

The probability distribution on $x\in D$ determined by the parameter $w=(a,b)\ \in W$

\displaystyle p(x|w)=\sum_{h=1}^{H}a_{h}\mathrm{Mul}_{L}(x|b_{h})

(3.4)

is called a multinomial mixture. Here $H$ represents the number of components. The $H$ dimensional parameter $a=(a_{1},\dots,a_{H})$ represents a mixing ratio. The parameter $a$ is assumed that $a_{h}$ is non-negative and $\displaystyle\sum_{h=1}^{H}a_{h}=1$ . Then $a_{h}$ represents the weight of the $h$ -th component distribution. The higher mixing ratio $a_{h}$ means the stronger effect of $h$ -th component.

4 Previous Studies

In this section, we introduce several previous studies on the log canonical thresholds of multinomial mixtures. When the probability model is a binomial mixture, the upper bound of the RLCT in the case of general components and the exact value of one in special cases have been clarified [12].

Theorem 4.1 (the RLCT and multiplicity of binomial mixtures [12]).

Let $x=(y_{1},\dots,y_{M})\in\{0,1\}^{M}$ be an $M$ dimensional binary vector and $p(x|w)$ be a binomial mixture,

\displaystyle p(x|w)=\sum_{h=1}^{H}a_{h}\prod_{m=1}^{M}p^{ym}_{hm}(1-p_{hm})^{1-y_{m}}.

(4.1)

It is assumed that the true distribution $q(x)$ is a binomial mixture,

\displaystyle q(x)=\sum_{h=1}^{H_{0}}a^{*}_{h}\prod_{m=1}^{M}p^{*ym}_{hm}(1-p^{*}_{hm})^{1-y_{m}}.

(4.2)

Let the prior distribution $\varphi(w)$ be

\displaystyle\varphi(w;\eta)=\varphi_{0}(a;\alpha)\prod_{h=1}^{H}\prod_{m=1}^{M}\varphi_{1}(p_{hm};\beta),

(4.3)

where $\eta=\{\alpha,\beta\}$ is a set of hyperparameters, $\varphi_{0}(a;\alpha)$ is a prior distribution of the mixing ratio $a$ with $\alpha>0$ as a hyperparameter (Dirichlet distribution),

\displaystyle\varphi_{0}(a;\alpha)

\displaystyle=\frac{\Gamma(H\alpha)}{\Gamma(\alpha)^{H}}\left(\prod_{h=1}^{H-1}a_{h}^{\alpha-1}\right)\left(1-\sum_{i=1}^{H-1}a_{i}\right)^{\alpha-1}.

(4.4)

and $\varphi_{1}(p_{hm};\beta)$ ( $\beta>0$ ) is a beta distribution for each $h\in[H],m\in[M]$ ,

\displaystyle\varphi_{1}(p_{hm};\beta)

\displaystyle=\frac{\Gamma(2\beta)}{\Gamma(\beta)}p_{hm}^{\beta-1}(1-p_{hm})^{\beta-1}.

(4.5)

We refer to them as deterministic, where $p^{*}_{hm}$ is one or zero. Let $H_{1}$ and $H_{2}$ be the numbers of probabilistic and deterministic components, respectively, where $H=H_{1}+H_{2}$ . Under the above conditions, the asymptotic behavior of the free energy $F_{n}$ is expressed as follows:

\displaystyle F_{n}\leq nS+\mu\log n-(m_{\mu}-1)\log\log n+o\left(\log\log n\right),

(4.6)

where $S$ is the entropy of the true distribution, and $\mu$ and $m_{\mu}$ are defined as follows. For $M\geq 3$ ,

	$\displaystyle\mu$	$\displaystyle=\frac{H_{0}-1+H_{1}M+H_{2}M\beta}{2}+\frac{H-H_{0}}{2}\min\left\{\alpha,\ \frac{M}{2},\ \frac{\beta M}{2}\right\},$		(4.7)
	$\displaystyle m_{\mu}$	$\displaystyle=\begin{cases}2&(\alpha=\min\{M/2,\ \beta M/2\})\\ 1&(\mathrm{otherwise})\end{cases}.$		(4.8)

For $M=2$ ,

	$\displaystyle\mu$	$\displaystyle=\frac{K_{0}-1+H_{1}M+H_{2}M\beta}{2}+\frac{H-H_{0}}{2}\min\left\{\alpha,\ 1,\ \beta\right\},$		(4.9)
	$\displaystyle m_{\mu}$	$\displaystyle=\begin{cases}3&(\alpha=\min\{1,\ \beta\})\\ 2&(\alpha>\min\{1,\ \beta\})\\ 1&(\mathrm{otherwise})\end{cases}.$		(4.10)

Furthermore, if $H=H_{0}+1$ , that is, when the number of components in the probability model is one greater than that in the true distribution, $\mu$ equals the exact value of the RLCT $\lambda$ , and $m_{\mu}$ also equals the multiplicity $m$ .

Matsuda analyzed the RLCT of trinomial mixtures with two components. The exact value of the RLCTs was elucidated when the true distribution is multinomial distribution and the probability model is trinomial mixtures with two components, that is, in the case of $L=3,H=2$ .

Theorem 4.2 (the RLCT and multiplicity of trinomial mixtures with two components [18]).

Let the probability model $p(x|w)$ be a trinomial mixture with two components:

\displaystyle p(x|w)=a\mathrm{Mul}_{3}(x|b_{1})+(1-a)\mathrm{Mul}_{3}(x|b_{2}),\ (a,b_{1},b_{2})\in W.

(4.11)

Here, $\mathrm{Mul}_{3}(x|b)$ means a probability mass function of trinomial distribution with $b=(b_{1},b_{2},b_{3})$ as parameters. Also, let the true distribution $q(x)$ be a trinomial distribution:

\displaystyle q(x)=\mathrm{Mul}_{3}(x|b^{*}).

(4.12)

Also, assume that the prior distribution $\varphi(w)$ is greater than $0$ and bounded above the parameter set $W$ , and the true distribution parameter $b^{*}=(b_{1}^{*},b_{2}^{*},b_{3}^{*})$ satisfies:

\displaystyle b_{1}^{*}b_{2}^{*}b_{3}^{*}\neq 0.

(4.13)

Under these conditions, The RLCT is as follows:

\displaystyle\lambda=\frac{3}{2}.

(4.14)

Matsuda clarified the RLCT of a trinomial mixture with two components, which is in the case of $H=2,H^{*}=1,L=3$ , by using an algebraic geometry algorithm called weighted blow-up.

5 Main Theorem

In this section, we show the main result of this paper, which is a generalization of Theorem 4.2. We clarify the RLCT and the multiplicity of general multinomial mixtures with two components. Furthermore, we also consider the case where the Dirichlet distribution is adopted as the prior distribution of the mixture ratio.

Theorem 5.1 (Main Theorem).

Let the probability model $p(x|w)$ be a multinomial mixture with two components:

\displaystyle p(x|w)=a\mathrm{Mul}_{L}(x|b_{1})+(1-a)\mathrm{Mul}_{L}(x|b_{2}),\ (a,b_{1},b_{2})\in W.

(5.1)

Also, let the true distribution be a multinomial distribution:

\displaystyle q(x)=\mathrm{Mul}_{L}(x|b^{*}).

(5.2)

Also, assume that the prior distribution of the parameter $b$ is greater than $0$ and bounded in the set $W$ , and that the true distribution parameter $b^{*}=(b_{1}^{*},\dots,b_{L}^{*})$ satisfies:

\displaystyle\prod_{\ell=1}^{L}b_{\ell}^{*}\neq 0.

(5.3)

The prior distribution of the mixing ratio $a$ in two cases as follows respectively:

If the prior distribution $\varphi(a)$ of mixture ratio $a$ is greater than $0$ and bounded, the RLCT $\lambda$ and the multiplicity $m$ are given by

	$\displaystyle\lambda$	$\displaystyle=\frac{L-1}{2}+\min\left(\frac{1}{2},\ \frac{L-1}{4}\right),$		(5.4)
	$\displaystyle m$	$\displaystyle=\begin{cases}2&(L=3)\\ 1&(\mathrm{otherwise})\end{cases}.$		(5.5)

If the prior distribution $\varphi(a)$ of mixture ratio $a$ is the Dirichlet distribution with $\alpha\ (>0)$ as a hyperparameter:

\displaystyle\varphi(a;\alpha)=\frac{\Gamma(2\alpha)}{\Gamma(\alpha)^{2}}a^{\alpha-1}(1-a)^{\alpha-1},

(5.6)

the RLCT $\lambda$ and the multiplicity $m$ are given by

	$\displaystyle\lambda$	$\displaystyle=\frac{L-1}{2}+\min\left(\frac{\alpha}{2},\ \frac{L-1}{4}\right),$		(5.7)
	$\displaystyle m$	$\displaystyle=\begin{cases}2&(\alpha=\frac{L-1}{2})\\ 1&(\mathrm{otherwise})\end{cases}.$		(5.8)

6 Proof of the Main Theorem

6.1 properties of the RLCTs and the multiplicities

To prove the theorem 5.1, we introduce notations, and explain some properties of the RLCTs and the multiplicities. Since the RLCT $\lambda$ and the multiplicity $m$ are determined by the mean error function $K(w)$ and the prior distribution $\varphi(w)$ , they are expressed as $\lambda(K,\varphi),m(K,\varphi)$ , or $\lambda(K),m(K)$ , respectively. If the maximum pole and their order of the two zeta function $\zeta_{1}(z),\zeta_{2}(z)$ :

	$\displaystyle\zeta_{1}(z)$	$\displaystyle=\int{K(w)^{z}\varphi(w)}\mathrm{d}w,$		(6.1)
	$\displaystyle\zeta_{2}(z)$	$\displaystyle=\int{K^{\prime}(w)^{z}\varphi(w)}\mathrm{d}w,$		(6.2)

are equal, they are written as

\displaystyle K(w)\sim K^{\prime}(w),

(6.3)

or $\lambda(K,\varphi)=\lambda(K^{\prime},\varphi^{\prime}),\ m(K,\varphi)=m(K^{\prime},\varphi^{\prime})$ .

The following properties hold for the RLCTs and multiplicities [9][18].

Lemma 6.1.

Let $K(w)$ be the mean error function and let $\varphi(w)$ be the prior function.

1.

If there are function $K^{\prime}(w)$ and constants $c,c^{\prime}>0$ exist, and

$\displaystyle cK^{\prime}(w)\leq K(w)\leq c^{\prime}K^{\prime}(w)$ (6.4)

holds for any $w\in W$ , then $K(w)\sim K^{\prime}(w)$ .

If $w=(w_{1},w_{2}),\ K(w)=K_{1}(w_{1})+K_{2}(w_{2}),\ \varphi(w)=\varphi(w_{1})\varphi(w_{2})$ , the following holds

	$\displaystyle\lambda(K,\varphi)$	$\displaystyle=\lambda(K_{1},\varphi_{1})+\lambda(K_{2},\varphi_{2}),$		(6.5)
	$\displaystyle m(K,\varphi)$	$\displaystyle=m(K_{1},\varphi_{1})+m(K_{2},\varphi_{2})-1.$		(6.6)

If $w=(w_{1},w_{2}),\ K(w)=K_{1}(w_{1})K_{2}(w_{2}),\ \varphi(w)=\varphi_{1}(w_{1})\varphi_{2}(w_{2})$ , then

	$\displaystyle\lambda(K,\varphi)$	$\displaystyle=\min\Bigl{(}\lambda(K_{1},\varphi_{1}),\ \lambda(K_{2},\varphi_{2})\Bigr{)},$		(6.7)
	$\displaystyle m(K,\varphi)$	$\displaystyle=\begin{cases}m(K_{1},\varphi_{1})&(\lambda(K_{1},\varphi_{1})<\lambda(K_{2},\varphi_{2}))\\ m(K_{1},\varphi_{1})+m(K_{2},\varphi_{2})&(\lambda(K_{1},\varphi_{1})=\lambda(K_{2},\varphi_{2}))\\ m(K_{2},\varphi_{2})&(\lambda(K_{1},\varphi_{1})>\lambda(K_{2},\varphi_{2}))\end{cases}.$		(6.8)

4.

Let $I,J$ be natural numbers, and let $\{f_{i}(w)\}_{i=1}^{I},\{g_{j}(w)\}_{j=1}^{J}$ be the sets of analytic functions. If the ideal generated from $\{f_{i}(w)\}_{i=1}^{I}$ and the ideal generated from $\{g_{j}(w)\}_{j=1}^{J}$ are equal and

$\displaystyle K_{1}(w)=\sum_{i=1}^{I}f_{i}(w)^{2},\ K_{2}(w)=\sum_{j=1}^{J}g_{j}(w)^{2},$ (6.9)

then $K_{1}(w)\sim K_{2}(w)$ .
5.

For any bounded function $F(w),G(w),H(w)$ on a compact set,

$\displaystyle H(w)^{2}+\left(F(w)+H(w)\ G(w)\right)^{2}\sim H(w)^{2}+F(w)^{2}.$ (6.10)
6.

Let $K^{\prime}(w)$ be the following function:

$\displaystyle K^{\prime}(w)=\sum_{x\in D}\left(p(x|w)-q(x)\right)^{2},$ (6.11)

then $K(w)\sim K^{\prime}(w)$ .

6.2 The restriction on the parameter set of general multinomial mixtures

To prove the theorem 5.1, we prepare some lemmas. As mentioned in section 2, the zeta function $\zeta(z)$ is determined by the prior distribution $\varphi(w)$ and the mean error function $K(w)$ , and the mean error function is defined by the KL information between true distribution and the probability model. Thus, the mean error function $K(w)$ is

\displaystyle K(w)=\sum_{x\in D}q(x)\log\frac{q(x)}{p(x|w)}.

(6.12)

However, in the case of mixed multinomial distribution, some problems arise when considering the mean error function $K(w)$ for the entire parameter set W. When the probability model is a mixed multinomial distribution, $p(x|w)=0$ for some $w\in W$ and some $x\in D$ . Since the true distribution $q(x)$ is not 0 by assumption, on the points $w$ such that $p(x|w)=0$ , the values $q(x)/p(x|w)$ is not finite and the mean error function $K(w)$ diverges. Thus, the results of the asymptotic behavior of their generalization error in the reference [8] cannot be applied directly. To solve this problem, we prove that even if the original parameter set $W$ is restricted to the set $W_{1}$ , which is the parameter set such that $p(x|w)>0$ , the asymptotic behavior of the generalization error does not change.

Lemma 6.2 (the restriction on the parameter set).

Let $W$ be a parameter set of the multinomial mixtures with component H. Let probability model be the multinomial mixtures with $H$ components :

\displaystyle p(x|w)

\displaystyle=\sum_{h=1}^{H}a_{h}\mathrm{Mul}_{L}(x|b_{h})\ (w\in W).

(6.13)

Let $q(x)$ be the multinomial mixtures with $H^{*}$ components ( $H\geq H^{*}$ ):

\displaystyle q(x)

\displaystyle=\sum_{h=1}^{H^{*}}a_{h}^{*}\mathrm{Mul}_{L}(x|b_{h}^{*}).

(6.14)

Here, for any $h=1,\dots,H$ ,

\displaystyle\sum_{\ell=1}^{L}b_{h\ell}\neq 0.

(6.15)

Fix $0<\varepsilon<1$ as a sufficiently small number, and let $W_{1}$ be the subset of $W$ such that $p(x|w)>\varepsilon$ for all $x\in D$ , and let $W_{2}$ be the complement of $W_{1}$ (i.e. $W_{2}=W\setminus W_{1}$ ). Let $\lambda$ be the minus maximum pole and $m$ be its order of the zeta function whose integration range is restricted to $W_{1}$ :

\displaystyle\zeta(z)

\displaystyle=\int_{W_{1}}K(w)^{z}\varepsilon\mathrm{d}w.

(6.16)

Then, the asymptotic behavior of the generalization error is expressed by the following equation:

\displaystyle\mathbb{E}[G_{n}]=\frac{\lambda}{n}-\frac{m-1}{n\log n}+o\left(\frac{1}{n\log n}\right)

(6.17)

Lemma 6.2 means that the asymptotic behavior of the generalization error of the multinomial mixtures can be analyzed by finding the maximum pole of the zeta function whose integration range is restricted to $W_{1}$ . Lemma 6.2 will be explained in the section 6.2.1.

6.2.1 the proof of Lemma 6.2

We will prove the Lemma 6.2. By the definition of the generalization error,

\displaystyle\mathbb{E}[G_{n}]

\displaystyle=\mathbb{E}\left[\log\frac{q(X)}{p(X|X^{n})}\right].

(6.18)

Let $X_{n+1}$ be written as $X$ . By the definitions of the prediction distribution and posterior distribution, and the assumption $q(x)>0$ ,

$\displaystyle\frac{q(X_{n+1})}{p(X_{n+1}\|X^{n})}$	$\displaystyle=\cfrac{q(X_{n+1})}{\displaystyle\int{p(X_{n+1}\|w)p(w\|X^{n})}\mathrm{d}w}$	(6.19)
	$\displaystyle=\frac{q(X_{n+1})}{\displaystyle\int{\cfrac{1}{Z_{n}}\varphi(w)p(X_{n+1}\|w)\prod_{i=1}^{n}p(X_{i}\|w)}\mathrm{d}w}$	(6.20)
	$\displaystyle=\frac{\displaystyle q(X_{n+1})\int{\displaystyle\frac{1}{Z_{n}}\varphi(w)\prod_{i=1}^{n}p(X_{i}\|w)}\mathrm{d}w}{\displaystyle\int{\varphi(w)\prod_{i=1}^{n+1}p(X_{i}\|w)}\mathrm{d}w}$	(6.21)
	$\displaystyle=\frac{\displaystyle\int{\varphi(w)\prod_{i=1}^{n}\frac{p(X_{i}\|w)}{q(X_{i})}}\mathrm{d}w}{\displaystyle\int{\varphi(w)\prod_{i=1}^{n+1}\frac{p(X_{i}\|w)}{q(X_{i})}}\mathrm{d}w}$	(6.22)
	$\displaystyle=\frac{Z(X^{n})}{Z(X^{n+1})},$	(6.23)

where, $Z(X^{n})$ is a quantity expressed by the following equation:

\displaystyle Z(X^{n})=\int{\varphi(w)\prod_{i=1}^{n}\frac{p(X_{i}|w)}{q(X_{i})}}\mathrm{d}w.

(6.24)

By the Eq.(6.23),

\displaystyle\mathbb{E}[G_{n}]=\mathbb{E}\left[\log\frac{Z(X^{n})}{Z(X^{n+1})}\right].

(6.25)

Here, fix $0<\varepsilon<1$ as a sufficiently small number, and let $W_{1}$ be the subset of $W$ that is $p(x|w)>\varepsilon$ in all $x\in D$ , and let $W_{2}$ be the subset that is not. The integral for the parameter w is divided into two integration sets $W_{1}$ and $W_{2}$ , and $Z_{1}(X^{n})$ and $Z_{2}(X^{n})$ are defined as follows:

	$\displaystyle Z_{1}(X^{n})$	$\displaystyle=\int_{W_{1}}{\varphi(w)\prod_{i=1}^{n}\frac{p(X_{i}\|w)}{q(X_{i})}}\mathrm{d}w,$		(6.26)
	$\displaystyle Z_{2}(X^{n})$	$\displaystyle=\int_{W_{2}}{\varphi(w)\prod_{i=1}^{n}\frac{p(X_{i}\|w)}{q(X_{i})}}\mathrm{d}w.$		(6.27)

Then, the relation $Z(X^{n})=Z_{1}(X^{n})+Z_{2}(X^{n})$ holds, and

\displaystyle\mathbb{E}[G_{n}]=\mathbb{E}\left[\log\frac{Z_{1}(X^{n})+Z_{2}(X^{n})}{Z_{1}(X^{n+1})+Z_{2}(X^{n+1})}\right].

(6.28)

Since $p(x|w)>0$ on $W_{1}$ , [9] can be applied, for suffiently large $n$ , the following asymptotic behavior of $Z_{1}(X^{n})$ holds:

\displaystyle Z_{1}(X^{n})=\frac{(\log n)^{m-1}}{n^{\lambda}}Z_{0}(X^{n}),

(6.29)

where $Z_{0}(X^{n})$ is the random variable of $X^{n}$ that holds

\displaystyle\mathbb{E}\left[\log\frac{Z_{0}(X^{n+1})}{Z_{0}(X^{n})}\right]=o\left(\frac{1}{n\log n}\right).

(6.30)

Here, we introduce the following Lemma.

Lemma 6.3.

There is a random variable $\Theta(X^{n})$ that is $1$ only for certain events related to $X^{n}$ and $0$ for others, and the following equation holds:

	$\displaystyle\mathbb{E}\left[\Theta(X^{n+1})\log\frac{Z_{1}(X^{n})+Z_{2}(X^{n})}{Z_{1}(X^{n+1})+Z_{2}(X^{n+1})}\right]$	$\displaystyle=O\left(\exp(-n)\right),$		(6.31)
	$\displaystyle\mathbb{E}\left[\left(1-\Theta(X^{n+1})\right)\log\frac{Z_{1}(X^{n})+Z_{2}(X^{n})}{Z_{1}(X^{n+1})+Z_{2}(X^{n+1})}\right]$	$\displaystyle=\frac{\lambda}{n}-\frac{m-1}{n\log n}+o\left(\frac{1}{n\log n}\right).$		(6.32)

If we can show the lemma 6.3, the following holds:

	$\displaystyle\begin{split}\mathbb{E}[G_{n}]&=\mathbb{E}\left[\Theta(X^{n+1})\log\frac{Z_{1}(X^{n})+Z_{2}(X^{n})}{Z_{1}(X^{n+1})+Z_{2}(X^{n+1})}\right]\\ &\quad+\mathbb{E}\left[\left(1-\Theta(X^{n+1})\right)\log\frac{Z_{1}(X^{n})+Z_{2}(X^{n})}{Z_{1}(X^{n+1})+Z_{2}(X^{n+1})}\right]\\ \end{split}$			(6.33)
		$\displaystyle=\frac{\lambda}{n}-\frac{m-1}{n\log n}+o\left(\frac{1}{n\log n}\right),$		(6.34)

and Lemma 6.2 proof is completed. To prove lemma 6.3, we prepare the section 6.2.2.

6.2.2 Sanov’s theorem

Let $L$ be a natural number, and let $\mathcal{P}$ be the set consisting of the probability distributions on the finite set $\{1,2,\dots,L\}$ .

\displaystyle\mathcal{P}

\displaystyle=\left\{(p_{1},\dots,p_{L})\in\mathbb{R}_{\geq 0}^{L}\ :\ \sum_{\ell=1}^{L}p_{\ell}=1,p_{\ell}\geq 0\right\}.

(6.35)

$Let$ $q=(q_{1},\dots,p_{L})\in\mathcal{P}$ be called true distribution, and it is fixed. Let $X_{1},\dots,X_{n}$ be the random variables independently identically generated from the probability distribution $q$ . Also, for each $\ell=1,\dots,L$ , the random variable $n_{\ell}$ is the number of $X_{1},\dots,X_{n}$ whose value is $\ell$ . Let the empirical distribution $r_{n}$ be

\displaystyle r_{n}=\left(\frac{n_{1}}{n},\dots,\frac{n_{L}}{n}\right).

(6.36)

Then, the next theorem holds.

Theorem 6.1 (Sanov’s Theorem [19]).

Let $A$ be a subset of $\mathcal{P}$ . Let $\mathrm{Pr}(r_{n}\in A)$ be the probability such that the empirical distribution $r_{n}$ is included in the set $A$ , the following inequality holds:

\displaystyle\limsup_{n\to\infty}\frac{1}{n}\log\mathrm{Pr}(r_{n}\in A)\leq-\inf_{p\in A}D(p\|q).

(6.37)

6.2.3 the property of the multinomial distribution with one trial

Let $q=(q_{1},\dots,q_{L})\in\mathcal{P}$ be positive in all the elements i.e. $q_{1},\dots,q_{L}>0$ . Let the positive number $\varepsilon>0$ be sufficiently small and define the set $\mathcal{P}_{\varepsilon}$ as follows:

\displaystyle\mathcal{P}_{\varepsilon}=\{p=(p_{1},\dots,p_{L}):\text{there exists $\ell$ such that $p_{\ell}\leq\varepsilon$}\}.

(6.38)

Since it is assumed that $\varepsilon$ is sufficiently small, we can take $\varepsilon$ so that $q$ is not included in the set $\mathcal{P}_{\varepsilon}$ . Due to the $q\notin\mathcal{P}_{\varepsilon}$ and the property of the KL information, $D(q\|s)>0$ holds for all $s\in\mathcal{P}_{\varepsilon}$ . Next, the constant $c>0$ is fixed as an arbitrary number that satisfies $\displaystyle 0<c<\inf_{s\in\mathcal{P}_{\varepsilon}}D(q\|s)$ , and the set $A$ is defined by

\displaystyle A=\left\{r\in\mathcal{P}:\inf_{s\in\mathcal{P}_{\varepsilon}}D(r\|s)>D(r\|q)+\frac{c}{2}\right\}.

(6.39)

Now, we show that $q\in A$ and that $A$ and $\mathcal{P}_{\varepsilon}$ have no intersection. Assuming there exists a probability distribution $r\in A\cap\mathcal{P}_{\varepsilon}$ , since $r\in A$ ,

\displaystyle\inf_{s\in\mathcal{P}_{\varepsilon}}D(r\|s)>D(r\|q)+\frac{c}{2}.

(6.40)

However, since $r\in\mathcal{P}_{\varepsilon}$ , $\displaystyle\inf_{s\in\mathcal{P}_{\varepsilon}}D(r\|s)=0$ , which is a contradiction. Furthermore, we can take a sufficient small positive number $\delta>0$ such that

\displaystyle\{r\in\mathcal{P}:D(r\|q)<\delta\}\subset A.

(6.41)

By the definition of the set A, if $p\in A^{c}$ then $D(r\|q)\geq\delta$ . Applying the Sanov’s theorem for the set $A^{c}$ ,

\displaystyle\limsup_{n\to\infty}\frac{1}{n}\log\mathrm{Pr}(r_{n}\in A^{c})\leq-\inf_{p\in A^{c}}D(p\|q).

(6.42)

Therefore, for sufficiently large n,

	$\displaystyle\mathrm{Pr}(r_{n}\in A^{c})$	$\displaystyle\leq\exp\left(-n\inf_{p\in A^{c}}D(p\\|q)\right)$		(6.43)
		$\displaystyle\leq\exp(-n\delta).$		(6.44)

Thus,

\displaystyle\mathrm{Pr}(r_{n}\in A)\geq 1-\exp(-n\delta).

(6.45)

By the definition of the set A, with probability at least $1-\delta$ ,

\displaystyle D(r_{n}\|p)\geq D(r_{n}\|q)+\frac{c}{2}.

(6.46)

Since $X_{1},\dots,X_{n}$ are the random variables generated independently from the true distribution $q$ , and $n_{\ell}$ is the number of $X_{1},\dots,X_{n}$ whose value is $\ell$ , by the Eq.(6.46),

\displaystyle\sum_{\ell=1}^{L}\frac{n_{\ell}}{n}\log\frac{q_{\ell}}{p_{\ell}}\geq\frac{c}{2}.

(6.47)

By calculating the Eq.(6.47),

\displaystyle\prod_{\ell=1}^{L}\frac{p_{\ell}^{n_{\ell}}}{q_{j}^{n_{j}}}\leq\exp\left(-\frac{nc}{2}\right).

(6.48)

Eq.(6.48) means that if $p(x)$ and $q(x)$ are the probability mass functions of the multinomial distribution with one trial respectively, the following inequality holds with the probability at least $1-\exp(-n\delta)$ ,

\displaystyle\prod_{i=1}^{n}\frac{p(X_{i})}{q(X_{i})}\leq\exp\left(-\frac{nc}{2}\right).

(6.49)

6.2.4 the properties of the predictive distribution of the multinomial mixtures

In this section, we consider the lower bound of the predictive distribution $p(x|X^{n})$ when the probability model is the multinomial mixture:

\displaystyle p(x|a,b)

\displaystyle=\frac{M!}{\prod_{\ell=1}^{L}x_{\ell}!}\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}.

(6.50)

We introduce the latent variable $y=(y_{1},\dots,y_{H})$ . The latent variable $y$ is a vector such that one element is 1 and the others are 0. By using the variable $y$ , we can rewrite as follows:

\displaystyle p(x,y|a,b)

\displaystyle=\frac{M!}{\prod_{\ell=1}^{L}x_{\ell}!}\prod_{h=1}^{H}\left(a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}\right)^{y_{h}}.

(6.51)

Both the prior distribution of the parameters of the multinomial and that of the parameters of the mixture ratio are Dirichlet distribution:

\displaystyle\varphi(a,b|\alpha,\beta)=\frac{1}{R(\alpha,\beta)}\prod_{h=1}^{H}\left\{a_{h}^{\alpha_{h}-1}\prod_{\ell=1}^{L}b_{h\ell}^{\beta_{h\ell}-1}\right\},

(6.52)

where $\alpha,\beta$ are hyperparameter and $R(\alpha,\beta)$ are normalizing constant such that

\displaystyle R(\alpha,\beta)=\frac{\Gamma\left(\sum_{h=1}^{H}\alpha_{h}\right)}{\prod_{h=1}^{H}\Gamma(\alpha_{h})}\prod_{h=1}^{H}\frac{\Gamma\left(\sum_{\ell=1}^{L}\beta_{h\ell}\right)}{\prod_{h=1}^{H}\Gamma_{h\ell}}.

(6.53)

To calculate the predictive distribution, we calculate the posterior distribution $p(a,b|X^{n})$ .

	$\displaystyle p(a,b\|X^{n})=\frac{1}{\hat{R}_{n}}\varphi(a,b\|\alpha,\beta)\prod_{i=1}^{n}p(X_{i},Y_{i}\|a,b)$		(6.54)
	$\displaystyle=\frac{1}{\hat{R}_{n}}\left[\frac{1}{R(\alpha,\beta)}\prod_{h=1}^{H}\left\{a_{h}^{\alpha_{h}-1}\prod_{\ell=1}^{L}b_{h\ell}^{\beta_{h\ell}-1}\right\}\right]\prod_{i=1}^{n}\left[\frac{M!}{\prod_{\ell=1}^{L}X_{i\ell}!}\prod_{h=1}^{H}\left(a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{X_{i\ell}}\right)^{Y_{ih}}\right]$		(6.55)
	$\displaystyle\propto\frac{1}{\hat{R}_{n}}\prod_{h=1}^{H}\left\{a_{h}^{\alpha_{h}+\sum_{i=1}^{n}Y_{ih}-1}\prod_{\ell=1}^{L}b_{h\ell}^{\beta_{h\ell}+\sum_{i=1}^{n}X_{i\ell}Y_{ih}-1}\right\},$		(6.56)

where $\hat{R}_{n}$ is normalizing constant of $p(a,b|X^{n})$ :

$\displaystyle\hat{R}_{n}$	$\displaystyle=\iint\varphi(a,b\|\alpha,\beta)\prod_{i=1}^{n}p(X_{i},Y_{i}\|a,b)\mathrm{d}a\mathrm{d}b$	(6.57)
	$\displaystyle\propto\prod_{h=1}^{H}\left\{a_{h}^{\alpha_{h}+\sum_{i=1}^{n}Y_{ih}-1}\prod_{\ell=1}^{L}b_{h\ell}^{\beta_{h\ell}+\sum_{i=1}^{n}X_{i\ell}Y_{ih}-1}\right\}\mathrm{d}a\mathrm{d}b$	(6.58)
	$\displaystyle=R\left(\alpha+\sum_{i=1}^{n}Y_{i},\beta+\sum_{i=1}^{n}X_{i}Y_{i}\right).$	(6.59)

Thus, the predictive distribution can be calculated as follows:

	$\displaystyle p(x\|X^{n})=\iint p(x,y\|w)p(a,b\|X^{n},Y^{n})\mathrm{d}a\mathrm{d}b$		(6.60)
	$\displaystyle\propto\iint\prod_{h=1}^{H}\left(a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}\right)^{y_{h}}\frac{1}{\hat{R}_{n}}\prod_{h=1}^{H}\left\{a_{h}^{\alpha_{h}+\sum_{i=1}^{n}Y_{ih}-1}\prod_{\ell=1}^{L}b_{h\ell}^{\beta_{h\ell}+\sum_{i=1}^{n}X_{i\ell}Y_{ih}-1}\right\}\mathrm{d}a\mathrm{d}b$		(6.61)
	$\displaystyle=\frac{1}{\hat{R}_{n}}\iint\prod_{h=1}^{H}\left\{a_{h}^{\alpha_{h}+\sum_{i=1}^{n}Y_{ih}+y_{h}-1}\prod_{\ell=1}^{L}b_{h\ell}^{\beta_{h\ell}+\sum_{i=1}^{n}X_{i\ell}Y_{ih}+x_{\ell}y_{h}-1}\right\}\mathrm{d}a\mathrm{d}b$		(6.62)
	$\displaystyle=\frac{R(\alpha+\sum_{i=1}^{n}Y_{i}+y,\beta+\sum_{i=1}^{n}X_{i}Y_{i}+xy)}{R\left(\alpha+\sum_{i=1}^{n}Y_{i},\beta+\sum_{i=1}^{n}X_{i}Y_{i}\right)}.$		(6.63)

Since the latent variable $y$ is a vector such that one element is 1 and since the others are 0 and the vector $x$ satisfies $\displaystyle\sum_{\ell=1}^{L}x_{\ell}=1$ , apply the property of the Gamma function $\Gamma(x+1)=x\Gamma(x)$ , we can show that

\displaystyle p(x,y|X^{n},Y^{n})=O\left(\frac{1}{n^{M}}\right).

(6.64)

Thus, there exists a constant positive number $C>0$ such that for all $x,y$ ,

\displaystyle p(x,y|X^{n},Y^{n})>\frac{C}{n^{M}}.

(6.65)

By the definition of the marginal distribution,

\displaystyle p(x|X^{n})

\displaystyle=\sum_{y}\sum_{Y^{n}}p(x,y|X^{n},Y^{n})>\frac{C}{n^{M}}.

(6.66)

Therefore,

\displaystyle\log p(x|X^{n})

\displaystyle\leq\log\frac{C}{n^{M}}=-M\log(Cn).

(6.67)

In this section, the prior distribution of parameters is discussed as the Dirichlet distribution. In the main theorem, we also consider the case where a non-zero and bounded prior distribution. In the case of such prior distribution, since any distribution does not affect the poles of the zeta function, the lower bound of the predicted distribution $p(x|X^{n})$ is not in the exponential order as in the above discussion.

6.2.5 properties of the maximum likelihood estimator of multinomial mixtures

Multinomial mixtures with $M$ trials are finite distributions on $\mathbb{Z}^{L}_{\geq 0}$ such that $x_{1}+\dots+x_{L}=M$ . The set $\mathcal{J}$ is defined by

\displaystyle\mathcal{J}=\left\{(x_{1},\dots,x_{L})\in\mathbb{Z}:\sum_{\ell=1}^{L}x_{\ell}=M,\ x_{\ell}\geq 0\right\}.

(6.68)

The set $\mathcal{J}$ is finite, and the number of elements of $\mathcal{J}$ is $J$ . Let $\mathcal{P}_{j}$ be the set of all discrete probability distributions on the finite set $\{1,2,\dots,J\}$ :

\displaystyle\mathcal{P}_{J}

\displaystyle=\left\{(p_{1},\dots,p_{J})\in\mathbb{R}_{\geq 0}^{J}\ :\ \sum_{\ell=1}^{L}p_{j}=1,p_{j}\geq 0\right\}.

(6.69)

The probability distribution on $\mathcal{J}$ that can be expressed by multinomial mixtures of $M$ trials is included in the set $\mathcal{P}_{J}$ . Given that the probability model $p(x|w)$ and the true distribution $q(x)$ are both multinomial mixtures with $M$ trials, and that the corresponding distributions in the probability distribution on $\mathcal{J}$ are the $\bar{p}(x|w)$ and $\bar{q}(x)$ , the average error function $K(w)$ is expressed as follows:

\displaystyle K(w)

\displaystyle=\sum_{x}\bar{q}(x)\log\frac{\bar{q}(x)}{\bar{p}(x|w)}.

(6.70)

Since the function $K(w)$ is the KL information between the true distribution $q(x)$ and the probability model $p(x|w)$ , if $p(x|w)=q(x)$ then $K(w)=0$ , and otherwise $K(w)>0$ . Nowhere, we fix the positive constant $c>0$ in the section 6.2.3, and define the subset $A$ of $\mathcal{P}_{J}$ as follows:

\displaystyle A=\left\{p(x|w):K(w)<\frac{c}{2}\right\}.

(6.71)

Since $q(x)$ are positive for all the points $x\in D$ from the assumption, by fixing the positive constant $\varepsilon>0$ in the section 6.2.3, the subset $E$ of $\mathcal{P}_{J}$ is defined as follows:

\displaystyle E=\{p(x|w):\exists x\in\{1,\dots,J\}\mbox{ s.t. }\bar{p}(x|w)<\varepsilon\}.

(6.72)

The $E$ can be defined so that the sets $E$ and $A$ have no intersection. Then, the log empirical loss $L_{n}$ can be calculated as follows:

$\displaystyle L_{n}$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\log\frac{\bar{q}(X_{i})}{\bar{p}(X_{i})}$	(6.73)
	$\displaystyle=\sum_{j=1}^{J}\frac{n_{j}}{n}\log\frac{\bar{q}_{j}}{\bar{p}_{j}}$	(6.74)
	$\displaystyle=-\sum_{j=1}^{J}\frac{n_{j}}{n}\log\frac{\bar{p}_{j}}{n_{j}/n}-\sum_{j=1}^{J}\frac{n_{j}}{n}\log\frac{n_{j}/n}{\bar{q}_{j}}.$	(6.75)

Since if $n\to\infty$ then $\frac{n_{j}}{n}\to\bar{q}_{j}$ , The first term of the Eq.(6.75) converges to a certain constant, and the second term converges to 0. Thus, with probability at least $1-\exp(-n\delta)$ ,

\displaystyle\frac{1}{n}\log\prod_{i=1}^{n}\frac{q(X_{i})}{p(X_{i})}

\displaystyle\leq-\frac{c}{2}.

(6.76)

Therefore,

\displaystyle\prod_{i=1}^{n}\frac{\bar{q}(X_{i})}{\bar{p}(X_{i})}

\displaystyle\leq\exp(-\frac{nc}{2}).

(6.77)

Let $w_{0}$ be a maximum likelihood estimator from multinomial mixtures with $M$ trials, and let $w_{1}$ be also a maximum likelihood estimator from all the discrete distributions on $\mathcal{P}_{J}$ . Since the set of distributions that can be expressed by multinomial mixtures with $M$ trials include in the set of distributions that can be expressed by all the discrete distributions on $\mathcal{P}_{J}$ , $\displaystyle\prod_{i=1}^{n}p(X_{i}|w_{0})\leq\prod_{i=1}^{n}p(X_{i}|w_{1})$ holds. Therefore, with probability at least $1-\exp(-n\delta)$ ,

\displaystyle\prod_{i=1}^{n}\frac{p(X_{i}|w_{0})}{q(X_{i})}\leq\frac{\bar{p}(X_{i}|w_{1})}{\bar{q}(X_{i})}\leq\exp\left(-\frac{nc}{2}\right).

(6.78)

6.2.6 the proof of lemma 6.3

Proof.

We fix the positive constants $\varepsilon,c,\delta>0$ in the section 6.2.3, and we define the set $A$ in the Eq.(6.71). The random variable $\Theta(X^{n})$ is defined as follows:

\displaystyle\Theta(X^{n})=\begin{cases}1&(r_{n}\in A)\\ 0&(r_{n}\notin A)\end{cases},

(6.79)

where $r_{n}$ is the empirical distribution defined the Eq.(6.36). From the Eq.(6.43), the probability of $\Theta(X^{n})=1$ is less than $(\exp(-n\delta))$ . Using that true distribution $q(X)$ does not depend on the sample size $n$ and using the Eq.(6.67),

$\displaystyle\mathbb{E}\left[\Theta(X^{n+1})\log\frac{Z_{1}(X^{n})+Z_{2}(X^{n})}{Z_{1}(X^{n+1})}\right]$	$\displaystyle=\mathbb{E}\left[\Theta(X^{n+1})\frac{q(X)}{p(x\|X^{n})}\right]$	(6.80)
	$\displaystyle=O\left(\exp\left(-n\delta\right)\left(-M\log(Cn)\right)\right)$	(6.81)
	$\displaystyle=O(\exp(-n)).$	(6.82)

Moreover, by the Eq.(6.78),

	$\displaystyle\begin{split}&\mathbb{E}\left[\left(1-\Theta(X^{n+1})\right)\log\frac{Z_{1}(X^{n})+Z_{2}(X^{n})}{Z_{1}(X^{n+1})+Z_{2}(X^{n+1})}\right]\\ &\quad\leq\mathbb{E}\left[\log\frac{\frac{(\log n)^{m-1}}{n^{\lambda}}Z_{0}(X^{n})+\exp(-nc/2)}{\frac{(\log(n+1))^{m-1}}{(n+1)^{\lambda}}Z_{0}\left(X^{n+1}\right)+\exp\left(-(n+1)c/2\right)}\right]\end{split}$			(6.83)
		$\displaystyle=\frac{\lambda}{n}-\frac{m-1}{n\log n}+\mathbb{E}\left[\log\frac{Z_{0}(X^{n})}{Z_{0}(X^{n+1})}\right]+o\left(\frac{1}{n\log n}\right).$		(6.84)

By the reference [9],

\displaystyle\mathbb{E}\left[\log\frac{Z_{0}(X^{n})}{Z_{0}(X^{n+1})}\right]=o\left(\frac{1}{n\log n}\right).

(6.85)

Therefore,

\displaystyle\begin{split}&\mathbb{E}\left[\left(1-\Theta(X^{n+1})\right)\log\frac{Z_{1}(X^{n})+Z_{2}(X^{n})}{Z_{1}(X^{n+1})+Z_{2}(X^{n+1})}\right]\\ &\quad\leq\frac{\lambda}{n}-\frac{m-1}{n\log n}+o\left(\frac{1}{n\log n}\right).\end{split}

(6.86)

From the above, the lemma 6.3 is shown. ∎

6.3 properties of the general components

We prepare a lemma that holds for multinomial mixtures of general components.

Lemma 6.4.

Let $p(x|w)$ be a multinomial mixture with $H$ components:

\displaystyle p(x|w)=\sum_{h=1}^{H}a_{h}\mathrm{Mul}_{L}(x|b_{h})\ \ (w\in W).

(6.87)

Also, let the true distribution $q(x)$ be a multinomial mixture with $H^{*}\ (H\geq H^{*})$ components:

\displaystyle q(x)=\sum_{h=1}^{H^{*}}a^{*}_{h}\mathrm{Mul}_{L}(x|b^{*}_{h}).

(6.88)

Let $K(w)$ be the mean error function determined by the probability model $p(x|w)$ and the true distribution $q(x)$ . $K(w)$ has the same RLCT and the multiplicity as $K_{1}(w)$ defined below:

\displaystyle K_{1}(w)

\displaystyle=\sum_{x\in D}\left\{\sum_{h=1}^{H}a_{h}\left(\prod_{\ell=1}^{L-1}b_{h\ell}^{x_{\ell}}\right)-\sum_{h^{*}=1}^{H^{*}}a^{*}_{h}\left(\prod_{\ell=1}^{L-1}(b^{*}_{h\ell})^{x_{\ell}}\right)\right\}^{2},

(6.89)

that is, $K(w)\sim K_{1}(w)$ .

Proof.

From the lemma 6.1, the mean error function $K(w)$ is equivalent to $K^{\prime}(w)$ , that is, the RLCT and the multiplicity are equal. Calculate $p(x|w)-q(x)$ as follows:

	$\displaystyle p(x\|w)-q(x)$	$\displaystyle=\left\{\sum_{h=1}^{H}a_{h}\frac{M!}{\prod_{\ell=1}^{L}x_{\ell}!}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{}}a^{}_{h}\frac{M!}{\prod_{\ell=1}^{L}x_{\ell}!}(b^{*}_{h\ell})^{x_{\ell}}\right\}$		(6.90)
		$\displaystyle=\left(\frac{M!}{\prod_{\ell=1}^{L}x_{\ell}!}\right)\left\{\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{}}a^{}_{h}\prod_{\ell=1}^{L}(b^{*}_{h\ell})^{x_{\ell}}\right\}.$		(6.91)

Since $\displaystyle\frac{M!}{\prod_{\ell=1}^{L}x_{\ell}!}$ is greater than $0$ and bounded for any $x\in D$ ,

\displaystyle K(w)\sim\sum_{x\in D}\left\{\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{*}}a^{*}_{h}\prod_{\ell=1}^{L}(b^{*}_{h\ell})^{x_{\ell}}\right\}^{2}.

(6.92)

Furthermore, since for each $h\in[H]$ and for each $h\in[H^{*}]$ : $\displaystyle\sum_{\ell=1}^{L}b_{h\ell}=1$ , $\sum_{\ell=1}^{L}b^{*}_{h\ell}=1$ , both $b_{hL}$ and $b_{hL}^{*}$ can be represented by other $b_{h\ell}$ and $b^{*}_{h\ell}$ for each $h\in[H]$ ,

\displaystyle\begin{split}&\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{*}}a^{*}_{h}\prod_{\ell=1}^{L}(b^{*}_{h\ell})^{x_{\ell}}\\ &\quad=\sum_{h=1}^{H}a_{h}\left(\prod_{\ell=1}^{L-1}b_{h\ell}^{x_{\ell}}\right)\left(1-\sum_{\ell=1}^{L-1}b_{h\ell}\right)^{x_{L}}-\sum_{h=1}^{H^{*}}a^{*}_{h}\left(\prod_{\ell=1}^{L-1}(b^{*}_{h\ell})^{x_{\ell}}\right)\left(1-\sum_{\ell=1}^{L-1}b^{*}_{h\ell}\right)^{x_{L}}.\end{split}

(6.93)

Here, by using the binomial theorem,

	$\displaystyle\left(1-\sum_{\ell=1}^{L-1}b_{h\ell}\right)^{x_{L}}$	$\displaystyle=\sum_{i=0}^{x_{L}}\binom{x_{L}}{i}\left(-\sum_{\ell=1}^{L-1}b_{h\ell}\right)^{i}\ 1^{x_{L}-i}$		(6.94)
		$\displaystyle=1+\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\left(\sum_{\ell=1}^{L-1}b_{h\ell}\right)^{i}.$		(6.95)

Also

\displaystyle\left(1-\sum_{\ell=1}^{L-1}b^{*}_{h\ell}\right)^{x_{L}}

\displaystyle=1+\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\left(\sum_{\ell=1}^{L-1}b^{*}_{h\ell}\right)^{i}.

(6.96)

Therefore,

	$\displaystyle\begin{split}&\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{}}a^{}_{h}\prod_{\ell=1}^{L}(b^{}_{h\ell})^{x_{\ell}}\\ &\quad=\sum_{h=1}^{H}a_{h}\left(\prod_{\ell=1}^{L-1}b_{h\ell}^{x_{\ell}}\right)\left\{1+\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\left(\sum_{\ell=1}^{L-1}b_{h\ell}\right)^{i}\right\}\\ &\quad-\sum_{h=1}^{H^{}}a^{}_{h}\left(\prod_{\ell=1}^{L-1}(b^{}_{h\ell})^{x_{\ell}}\right)\left\{1+\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\left(\sum_{\ell=1}^{L-1}b^{*}_{h\ell}\right)^{i}\right\}\end{split}$			(6.97)
	$\displaystyle\begin{split}&=\sum_{h=1}^{H^{}}a_{h}\prod_{\ell=1}^{L-1}(b_{h\ell})^{x_{\ell}}-\sum_{h=1}^{H^{}}a^{}_{h}\prod_{\ell=1}^{L-1}(b^{}_{h\ell})^{x_{\ell}}\\ &\quad+\sum_{h=1}^{H}a^{}_{h}\left(\prod_{\ell=1}^{L-1}(b_{h\ell})^{x_{\ell}}\right)\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\left(\sum_{\ell=1}^{L-1}b_{h\ell}\right)^{i}\\ &\quad-\sum_{h=1}^{H^{}}a^{}_{h}\left(\prod_{\ell=1}^{L-1}(b^{}_{h\ell})^{x_{\ell}}\right)\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\left(\sum_{\ell=1}^{L-1}b^{*}_{h\ell}\right)^{i}.\end{split}$			(6.98)

For simplicity, for each $h\in[H]$ and $h\in[H^{*}]$ , we define $A_{h},A_{h}^{*}$ as follows:

	$\displaystyle A_{h}$	$\displaystyle=a_{h}\prod_{\ell=1}^{L-1}(b_{h\ell})^{x_{\ell}}\ \ (h=1,\dots,H),$		(6.99)
	$\displaystyle A_{h}^{*}$	$\displaystyle=a^{}_{h}\left(\prod_{\ell=1}^{L-1}(b^{}_{h\ell})^{x_{\ell}}\right)\ \ (h=1,\dots,H^{*}).$		(6.100)

It follows that

\displaystyle\begin{split}&\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{*}}a^{*}_{h}\prod_{\ell=1}^{L}(b^{*}_{h\ell})^{x_{\ell}}\\ &=\sum_{h=1}^{H}A_{h}-\sum_{h=1}^{H^{*}}A_{h}^{*}\\ &\quad+\sum_{h=1}^{H}A_{h}\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\left(\sum_{\ell=1}^{L-1}b_{h\ell}\right)^{i}-\sum_{h=1}^{H^{*}}A_{h}^{*}\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\left(\sum_{\ell=1}^{L-1}b^{*}_{h\ell}\right)^{i}.\end{split}

(6.101)

By using the multinomial theorem,

	$\displaystyle\left(\sum_{\ell=1}^{L-1}b_{h\ell}\right)^{i}$	$\displaystyle=\sum_{i_{1},\dots,i_{L-1}}\frac{i!}{i_{1}!\dots i_{L-1}!}b_{h1}^{i_{1}}\dots b_{hL-1}^{i_{L-1}},$		(6.102)
	$\displaystyle\left(\sum_{\ell=1}^{L-1}b^{*}_{h\ell}\right)^{i}$	$\displaystyle=\sum_{i_{1},\dots,i_{L-1}}\frac{i!}{i_{1}!\dots i_{L-1}!}(b^{}_{h1})^{i_{1}}\dots(b^{}_{hL-1})^{i_{L-1}},$		(6.103)

where the summation $\displaystyle\sum_{i_{1},\dots,i_{L-1}}$ shows the sum of all over non-negative integer sets $(i_{1},\dots,i_{L-1})$ such that $i_{1}+\dots+i_{L-1}=i$ . We apply Eqs.(6.102) and (6.103) to the Eq.(6.101). Let $C_{i_{\ell}}$ be $\displaystyle\frac{i!}{i_{1}!\dots i_{L-1}!}$ . Then we obtain

	$\displaystyle\begin{split}&\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{}}a^{}_{h}\prod_{\ell=1}^{L}(b^{}_{h\ell})^{x_{\ell}}\\ &=\sum_{h=1}^{H}A_{h}-\sum_{h=1}^{H^{}}A_{h}^{}\\ &\quad+\sum_{h=1}^{H}A_{h}\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\sum_{i_{1},\dots,i_{L-1}}C_{i_{\ell}}b_{h1}^{i_{1}}\dots b_{hL-1}^{i_{L-1}}\\ &\quad-\sum_{h=1}^{H^{}}A_{h}^{}\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\sum_{i_{1},\dots,i_{L-1}}C_{i_{\ell}}(b^{}_{h1})^{i_{1}}\dots(b^{*}_{hL-1})^{i_{L-1}}\end{split}$			(6.104)
	$\displaystyle\begin{split}&=\sum_{h=1}^{H}A_{h}-\sum_{h=1}^{H^{}}A_{h}^{}\\ &\quad+\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}\sum_{i_{1},\dots,i_{L-1}}(-1)^{i}C_{i_{\ell}}\binom{x_{L}}{i}\\ &\quad\times\left\{\sum_{h=1}^{H}A_{h}b_{h1}^{i_{1}}\dots b_{hL-1}^{i_{L-1}}-\sum_{h=1}^{H^{}}A_{h}^{}(b^{}_{h1})^{i_{1}}\dots(b^{}_{hL-1})^{i_{L-1}}\right\}.\end{split}$			(6.105)

By using Eqs.(6.99) and (6.100),

\displaystyle\begin{split}&\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{*}}a^{*}_{h}\prod_{\ell=1}^{L}(b^{*}_{h\ell})^{x_{\ell}}\\ &=\sum_{h=1}^{H}A_{h}-\sum_{h=1}^{H^{*}}A_{h}^{*}\\ &+\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}\sum_{i_{1},\dots,i_{L-1}}(-1)^{i}C_{i_{\ell}}\binom{x_{L}}{i}\\ &\times\left\{\sum_{h=1}^{H}a_{h}b_{h1}^{x_{1}+i_{1}}\dots b_{hL-1}^{x_{L-1}+i_{L-1}}-\sum_{h=1}^{H^{*}}a^{*}_{h}(b^{*}_{h1})^{x_{1}+i_{1}}\dots(b^{*}_{hL-1})^{x_{L-1}+i_{L-1}}\right\}.\end{split}

(6.106)

We introduce a polynomial $f_{L-1}(x_{1},\dots,x_{L-1};w)$ defined by

	$\displaystyle f_{L-1}(x_{1},\dots,x_{L-1};w)$	$\displaystyle=\sum_{h=1}^{H}A_{h}-\sum_{h=1}^{H^{}}A_{h}^{},$		(6.107)
		$\displaystyle=\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L-1}(b_{h\ell})^{x_{\ell}}-\sum_{h=1}^{H^{}}a^{}_{h}\left(\prod_{\ell=1}^{L-1}(b^{*}_{h\ell})^{x_{\ell}}\right).$		(6.108)

Then by Eq.(6.106), it follows that

\displaystyle\begin{split}&\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{*}}a^{*}_{h}\prod_{\ell=1}^{L}(b^{*}_{h\ell})^{x_{\ell}}\\ &\quad=f_{L-1}(x_{1},\dots,x_{L-1};w)\\ &+\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}\sum_{i_{1},\dots,i_{L-1}}(-1)^{i}C_{i_{\ell}}\binom{x_{L}}{i}f_{L-1}(x_{1}+i_{1},\dots,x_{L-1}+i_{L-1};w).\end{split}

(6.109)

The second term of Eq.(6.109) can be expressed by the linear sum of the first term $f_{L-1}(x_{1},\dots,x_{L-1};w)$ . That is, in the second term, there is a constant $C^{\prime}(x_{1},\dots,x_{L-1})$ that does not depend on parameters,

\displaystyle\begin{split}&\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}\sum_{i_{1},\dots,i_{L-1}}(-1)^{i}C_{i_{\ell}}\binom{x_{L}}{i}f_{L-1}(x_{1}+i_{1},\dots,x_{L-1}+i_{L-1};w)\\ &\quad=\sum_{x\in D}C^{\prime}(x_{1},\dots,x_{L-1})f_{L-1}(x_{1},\dots,x_{L-1};w).\end{split}

(6.110)

Therefore, the ideal generated from the set $\{f_{L-1}(x_{1},\dots,x_{L-1})\}_{x\in D}$ and the ideal generated from the set $\left\{f_{L-1}(x_{1},\dots,x_{L-1})+f_{L-1}(x_{1}+i_{1},\dots,x_{L-1}+i_{L-1})\right\}_{x\in D}$ are equal, so the function $K_{1}(w)$ is defined as follows:

	$\displaystyle K_{1}(w)$	$\displaystyle=\sum_{x\in D}f_{L-1}(x_{1},\dots,x_{L-1};w)^{2}$		(6.111)
		$\displaystyle=\sum_{x\in D}\left\{\sum_{h=1}^{H}a_{h}\left(\prod_{\ell=1}^{L-1}(b_{h\ell})^{x_{\ell}}\right)-\sum_{h=1}^{H^{}}a^{}_{h}\left(\prod_{\ell=1}^{L-1}(b^{*}_{h\ell})^{x_{\ell}}\right)\right\}^{2}.$		(6.112)

From the lemma 6.1 (4), the two functions $K(w)$ and $K_{1}(w)$ are equivalent, that is, their RLCTs and multiplicities are equal. ∎

6.4 properties of the 2 components

So far, we have prepared the lemma 6.4, which holds for multinomial mixtures with general components. Hereafter, we assume that the number of components of the multinomial mixtures of the probability model is 2 (i.e. $H=2$ ) and that the true distribution is the multinomial distribution (i.e. $H^{*}=1$ ). That is, the probability model $p(x|w)$ and the true distribution $q(x)$ are as follows:

	$\displaystyle p(x\|w)$	$\displaystyle=a\mathrm{Mul}_{L}(x\|b)+(1-a)\mathrm{Mul}_{L}(x\|c),\ b,c\in B,$		(6.113)
	$\displaystyle q(x)$	$\displaystyle=\mathrm{Mul}_{L}(x\|b^{}),\ b^{}\in B,\ \prod_{\ell=1}^{L}b^{*}_{\ell}\neq 0.$		(6.114)

Then, the polynomial $f_{L}(x_{1},\cdots,x_{L})$ defined by the Eq.(6.107) is expressed as follows:

\displaystyle f_{L}(x_{1},\cdots,x_{L};w)=a\prod_{\ell=1}^{L}b_{\ell}^{x_{\ell}}+(1-a)\prod_{\ell=1}^{L}c_{\ell}^{x_{\ell}}-\prod_{\ell=1}^{L}(b_{\ell}^{*})^{x_{\ell}}.

Lemma 6.5.

For $j\in[2:L-1]$ , the following holds:

\displaystyle\begin{split}&f_{L-1}(x_{1},\cdots,x_{j},\cdots,x_{L-1};w)\\ &=(b_{j}+c_{j})f_{L-1}(x_{1},\cdots,x_{j}-1,\cdots,x_{L-1};w)\\ &\quad-b_{j}c_{j}f_{L-1}(x_{1},\cdots,x_{j}-2,\cdots,x_{L-1};w)\\ &\quad-(b_{j}-b_{j}^{*})(c_{j}-b_{j}^{*})(b_{j}^{*})^{x_{j}-2}\prod_{\ell\neq j}^{L-1}(b_{\ell}^{*})^{x_{\ell}},\end{split}

(6.115)

where $[2:L-1]$ represents the set $\{\ell\in\mathbb{Z}:2\leq\ell\leq L-1\}$ .

Proof.

By calculating each of the three terms in Eq.(6.115),

	$\displaystyle\begin{split}&(b_{j}+c_{j})f_{L-1}(x_{1},\cdots,x_{j}-1,\cdots,x_{L-1};w)\\ &\quad=a\prod_{\ell=1}^{L-1}b_{\ell}^{x_{\ell}}+ac_{j}b_{j}^{x_{j}-1}\prod_{\ell\neq j}^{L-1}b_{\ell}^{x_{\ell}}+(1-a)b_{j}c_{j}^{x_{j}-1}\prod_{\ell\neq j}^{L-1}c_{\ell}^{x_{\ell}}\\ &\quad+(1-a)\prod_{\ell=1}^{L-1}c_{\ell}^{x_{\ell}}-(b_{j}+c_{j})(b_{j}^{})^{x_{j}-1}\prod_{\ell\neq j}^{L-1}(b_{\ell}^{})^{x_{\ell}},\end{split}$			(6.116)
	$\displaystyle\begin{split}&b_{j}c_{j}f_{L-1}(x_{1},\cdots,x_{j}-2,\cdots,x_{L-1};w)\\ &\quad=-ac_{j}b_{j}^{x_{j}-1}\prod_{\ell\neq j}^{L-1}b_{\ell}^{x_{\ell}}-(1-a)b_{j}c_{j}^{x_{j}-1}\prod_{\ell\neq j}^{L-1}c_{\ell}^{x_{\ell}},\end{split}$			(6.117)
	$\displaystyle\begin{split}&(b_{j}-b_{j}^{})(c_{j}-b_{j}^{})(b_{j}^{})^{x_{j}-2}\prod_{\ell\neq j}^{L-1}(b_{\ell}^{})^{x_{\ell}}\\ &\quad=-b_{j}c_{j}(b_{j}^{})^{x_{j}-2}\prod_{\ell\neq j}^{L-1}(b_{\ell}^{})^{x_{\ell}}+b_{j}(b_{j}^{})^{x_{j}-1}\prod_{\ell\neq j}^{L-1}(b_{\ell}^{})^{x_{\ell}}\\ &\quad+c_{j}(b_{j}^{})^{x_{j}-1}\prod_{\ell\neq j}^{L-1}(b_{\ell})^{x_{\ell}}-\prod_{\ell\neq j}^{L-1}(b_{\ell}^{})^{x_{\ell}}.\end{split}$			(6.118)

Then Eq.(6.115) is obtained by summing the above three Eqs.(6.116), (6.117), (6.118). ∎

Lemma 6.6.

Define the set $D^{\prime}$ of $x$ as follows:

\displaystyle D^{\prime}=\bigl{\{}(x_{1},x_{2},\cdots,x_{L-1})\in\mathbb{Z}^{L-1}\ |\ x_{\ell}\in\{0,1\}\bigr{\}}

(6.119)

Define the function $K_{2}(w)$ of parameter $w$ as follows:

\displaystyle K_{2}(w)=\sum_{\ell=1}^{L-1}(b_{\ell}-b_{\ell}^{*})^{2}(c_{\ell}-b_{\ell}^{*})^{2}+\sum_{x\in D^{\prime}}f_{L-1}(x_{1},\cdots,x_{L-1};w)^{2}.

(6.120)

Then, $K(w)\sim K_{2}(w)$ .

Proof.

By using lemma 6.5 inductively, $f_{L-1}(x_{1},\cdots,x_{L-1};w)$ can be expressed using

•

$f_{L-1}(1,0,0,\cdots,0;w),f(0,1,0,\cdots,0;w),\cdots$
•

$f_{L-1}(1,1,0,\cdots,0;w),f(1,0,1,\cdots,0;w),\cdots$
•

$\cdots$
•

$f_{L-1}(1,1,\cdots,1;w)$
•

$(b_{\ell}-b_{\ell}^{*})(c_{\ell}-b_{\ell}^{*}).$

Since the ideal generated from $\{f_{L-1}(x_{1},\cdots,x_{L-1};w)\}_{x\in D}\cup\{(b_{i}-b_{i}^{*})^{2}(c_{i}-b_{i}^{*})^{2}\}_{\ell=1}^{L}$ and the ideal generated from $\{f_{L-1}(x_{1},\cdots,x_{L-1};w)\}_{x\in D^{\prime}}\cup\{(b_{i}-b_{i}^{*})^{2}(c_{i}-b_{i}^{*})^{2}\}_{\ell=1}^{L}$ are equal, lemma 6.6 can be applied. Thus, $K(w)\sim K_{2}(w)$ . ∎

6.5 Proof of the main theorem

Let us prove Theorem.5.1.

Proof.

(Proof of Theorem.5.1) From lemma 6.6, to clarify the RLCT and multiplicity of the multinomial mixtures with two components, we calculate the largest pole of the zeta function determined by $K_{2}(w)$ and $\varphi(w)$ .

\displaystyle\begin{split}K_{2}(w)&=\sum_{\ell=1}^{L-1}(b_{\ell}-b_{\ell}^{*})^{2}(c_{\ell}-b_{\ell}^{*})^{2}\\ &\quad+\sum_{\ell_{1}\in\{1,\cdots L-1\}}(ab_{\ell_{1}}+(1-a)c_{\ell_{1}}-b^{*}_{\ell_{1}})^{2}\\ &\quad+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}(ab_{\ell_{1}}b_{\ell_{2}}+(1-a)c_{\ell_{1}}c_{\ell_{2}}-b^{*}_{\ell_{1}}b^{*}_{\ell_{2}})^{2}\\ &\quad+\dots\\ &\quad+\sum_{\ell_{1},\cdots,\ell_{L-1}\in\{1,\cdots,L-1\}}\Bigl{(}a\prod_{k=1}^{L-1}b_{\ell_{k}}+(1-a)\prod_{k=1}^{L-1}c_{\ell_{k}}-\prod_{k=1}^{L-1}b^{*}_{\ell_{k}}\Bigr{)}^{2},\end{split}

(6.121)

where the summation $\displaystyle\sum_{\ell_{1},\ell_{2},\dots,\ell_{i}\in\{1,\cdots,L-1\}}$ represents the sum of the sets $\{(\ell_{1},\dots,\ell_{i})\in\{1,\dots,L-1\}^{i}:j\neq j^{\prime}\Rightarrow\ell_{j}\neq\ell_{j^{\prime}}\ (\forall j,j^{\prime}\in[i])\}$ . Let us define a map $\Phi_{1}:u\mapsto w$ , where for each $\ell\in[L-1]$ ,

\displaystyle\begin{cases}B_{\ell}=b_{\ell}^{*}\\ \beta_{\ell}=b_{\ell}-B_{\ell}\\ \gamma_{\ell}=c_{\ell}-B_{\ell}\end{cases}.

(6.122)

The parameter $u$ consists of $(a,\beta,\gamma)$ , where $\beta=(\beta_{1},\dots,\beta_{L-1}),\ \gamma=(\gamma_{1},\dots,\gamma_{L-1})$ . Based on the map,

\displaystyle\begin{split}&K_{2}(\Phi_{1}(u))\\ &=\sum_{\ell=1}^{L-1}\beta_{\ell}^{2}\gamma_{\ell}^{2}\\ &\quad+\sum_{\ell_{1}\in\{1,\cdots L-1\}}(a\beta_{\ell_{1}}+(1-a)\gamma_{\ell_{1}})^{2}\\ &\quad+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}\Bigl{\{}a\beta_{\ell_{1}}\beta_{\ell_{2}}+(1-a)\gamma_{\ell_{1}}\gamma_{\ell_{2}}\\ &\quad+B_{\ell_{2}}\{a\beta_{\ell_{1}}+(1-a)\gamma_{\ell_{1}}\}+B_{\ell_{1}}\{a\beta_{\ell_{2}}+(1-a)\gamma_{\ell_{2}}\}\Bigr{\}}^{2}\\ &\quad+\dots\\ &\quad+\sum_{\ell_{1},\cdots,\ell_{L-1}\in\{1,\cdots,L-1\}}\Bigl{\{}a\prod_{k=1}^{L-1}\beta_{\ell_{k}}+(1-a)\prod_{k=1}^{L-1}\gamma_{\ell_{k}}\\ &\quad+\cdots+\sum_{j=1}^{L-1}\bigl{(}\prod_{k\neq j}^{L-1}B_{\ell_{k}}\bigr{)}(a\beta_{j}+(1-a)\gamma_{j})\Bigr{\}}^{2}.\end{split}

(6.123)

From the symmetry of the parameters, we can restrict the integration range for $a$ as $0\leq a\leq\frac{1}{2}$ without loss of generality. Let us define a map $\Phi_{2}:v\to u$ , where

\displaystyle\delta_{\ell}=a\beta_{\ell}+(1-a)\gamma_{\ell}\ \ (\ell=1,\cdots,L-1).

(6.124)

The parameter $v$ consists of $(a,\beta,\delta)$ , where $\delta=(\delta_{1},\dots,\delta_{L-1})$ . Since we consider the range $\frac{1}{2}\leq 1-a\leq 1$ , the Jacobian determinant of this transform is not equal to zero. Therefore, neither the maximum pole nor its order of the zeta function changes. We can obtain that

\displaystyle\begin{split}&K_{2}(\Phi_{2}(\Phi_{1}(v)))\\ &=\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}\beta_{\ell}^{2}\gamma_{\ell}^{2}\\ &\quad+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}\Bigl{\{}a\beta_{\ell_{1}}\beta_{\ell_{2}}+(1-a)\gamma_{\ell_{1}}\gamma_{\ell_{2}}\\ &\quad+B_{\ell_{2}}\{a\beta_{\ell_{1}}+(1-a)\gamma_{\ell_{1}}\}+B_{\ell_{1}}\{a\beta_{\ell_{2}}+(1-a)\gamma_{\ell_{2}}\}\Bigr{\}}^{2}\\ &\quad+\dots\\ &\quad+\sum_{\ell_{1},\cdots,\ell_{L-1}\in\{1,\cdots,L-1\}}\Bigl{\{}a\prod_{k=1}^{L-1}\beta_{\ell_{k}}+(1-a)\prod_{k=1}^{L-1}\gamma_{\ell_{k}}\\ &\quad+\cdots+\sum_{j=1}^{L-1}\bigl{(}\prod_{k\neq j}^{L-1}B_{\ell_{k}}\bigr{)}(a\beta_{j}+(1-a)\gamma_{j})\Bigr{\}}^{2}.\end{split}

(6.125)

Here, we eliminate $\gamma$ by using $\displaystyle\gamma_{\ell}=\frac{\delta_{\ell}-a\beta_{\ell}}{1-a}$ ,

	$\displaystyle\sum_{\ell=1}^{L-1}\beta_{\ell}^{2}\gamma_{\ell}^{2}$	$\displaystyle=\sum_{\ell=1}^{L-1}\beta_{\ell}^{2}\Bigl{(}\frac{\delta_{\ell}-a\beta_{\ell}}{1-a}\Bigr{)}^{2}$		(6.126)
		$\displaystyle=\frac{1}{(1-a)^{2}}\sum_{\ell=1}^{L-1}\beta_{\ell}^{2}(\delta_{\ell}-a\beta_{\ell})^{2}.$		(6.127)

Since $\frac{1}{2}\leq 1-a\leq 1$ and lemma 6.1(5),

$\displaystyle\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}\beta_{\ell}^{2}\gamma_{\ell}^{2}$	$\displaystyle=\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\frac{1}{(1-a)^{2}}\sum_{\ell=1}^{L-1}\beta_{\ell}^{2}(\delta_{\ell}-a\beta_{\ell})^{2}$	(6.128)
	$\displaystyle\sim\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}\beta_{\ell}^{2}(-a\beta_{\ell})^{2}$	(6.129)
	$\displaystyle=\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}.$	(6.130)

Moreover,

$\displaystyle\begin{split}&\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}\\ &\quad+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}\Bigl{\{}a\beta_{\ell_{1}}\beta_{\ell_{2}}+(1-a)\gamma_{\ell_{1}}\gamma_{\ell_{2}}\\ &\quad+B_{\ell_{2}}\{a\beta_{\ell_{1}}+(1-a)\gamma_{\ell_{1}}\}+B_{\ell_{1}}\{a\beta_{\ell_{2}}+(1-a)\gamma_{\ell_{2}}\}\Bigr{\}}^{2}\end{split}$		(6.131)
	$\displaystyle=\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}\Bigl{\{}a\beta_{\ell_{1}}\beta_{\ell_{2}}$	(6.132)
	$\displaystyle\quad+(1-a)\frac{\delta_{\ell_{1}}-a\beta_{\ell_{1}}}{1-a}\frac{\delta_{\ell_{2}}-a\beta_{\ell_{2}}}{1-a}+B_{\ell_{2}}\delta_{\ell_{1}}+B_{\ell_{1}}\delta_{\ell_{2}}\Bigr{\}}^{2}$	(6.133)
	$\displaystyle\sim\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}\Bigl{\{}a\beta_{\ell_{1}}\beta_{\ell_{2}}+\left(a\beta_{\ell_{1}}\right)\left(a\beta_{\ell_{2}}\right)\Bigr{\}}^{2}$	(6.134)
	$\displaystyle=\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}(a\beta_{\ell_{1}}\beta_{\ell_{2}})^{2}(1+a)^{2}$	(6.135)
	$\displaystyle\sim\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}a^{2}\beta_{\ell_{1}}^{2}\beta_{\ell_{2}}^{2}.$	(6.136)

By recursively applying lemma 6.1(5), we can obtain that

\displaystyle K_{2}(\Phi_{2}(\Phi_{1}(v)))

\displaystyle\sim\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}a^{2}\beta_{\ell_{1}}^{2}\beta_{\ell_{2}}^{2}.

(6.137)

Here, for all parameters $v$ ,

\displaystyle K_{2}(\Phi_{2}(\Phi_{1}(v)))\geq\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}.

(6.138)

Also, since $\beta_{\ell_{1}}^{2},\beta_{\ell_{2}}^{2}\geq 0$ and the relation between the arithmetic mean and the geometric mean,

\displaystyle\sum_{\ell=1}^{L-1}\beta_{\ell}^{4}+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}\beta_{\ell_{1}}^{2}\beta_{\ell_{2}}^{2}\ \leq\ \sum_{\ell=1}^{L-1}\beta_{\ell}^{4}+\frac{1}{2}\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}(\beta_{\ell_{1}}^{4}+\beta_{\ell_{2}}^{4}).

(6.139)

Therefore, there exists a constant $k\geq 1$ that does not depend on $v$ , and

\displaystyle\sum_{\ell=1}^{L-1}\beta_{\ell}^{4}+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}\beta_{\ell_{1}}^{2}\beta_{\ell_{2}}^{2}\leq k\sum_{\ell=1}^{L-1}\beta_{\ell}^{4}.

(6.140)

By Eqs. (6.138), (6.140),

\displaystyle\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}\leq K_{2}(\Phi_{2}(\Phi_{1}(v)))\leq k\Bigl{(}\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}\Bigr{)}.

(6.141)

Thus, let us define $K_{3}(v)$ :

\displaystyle K_{3}(v)=\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4},

(6.142)

from the lemma 6.1(1) and Eq.(6.141),

\displaystyle K_{3}(v)\sim K_{2}(\Phi_{2}(\Phi_{1}(v))).

(6.143)

From Eq.(6.143), the main theorem can be derived by finding the maximum pole and its order of the zeta function determined from $K_{3}(v)$ and $\varphi(\Phi_{2}(\Phi_{1}(v)))$ . If the prior distribution $\varphi(a)$ of mixture ratio $a$ is greater than $0$ and bounded, from the lemma 6.1(2)(3),

	$\displaystyle\lambda(K_{3},\varphi)$	$\displaystyle=\sum_{\ell=1}^{L-1}\lambda(\delta_{\ell}^{2})+\min\Bigl{(}\lambda(a^{2}),\ \sum_{\ell=1}^{L-1}\lambda(\beta_{\ell}^{4})\Bigr{)}$		(6.144)
		$\displaystyle=\frac{L-1}{2}+\min\Bigl{(}\frac{1}{2},\ \frac{L-1}{4}\Bigr{)}.$		(6.145)

Here, $\displaystyle\frac{1}{2}=\frac{L-1}{4}$ holds only when $L=3$ , and the equal sign does not hold in other cases, so the multiplicity is $m=2$ only when $L=3$ and $m=1$ otherwise. Therefore, the theorem 5.1(1) was shown.

Furthermore, consider the case in the theorem 5.1(2), that is, the case where the prior distribution of the mixing ratio $a$ follows the Dirichlet distribution with $\alpha>0$ as the hyperparameter. Using that the prior distribution of $a$ is $\varphi(a)\propto a^{\alpha-1}(1-a)^{\alpha-1}$ , and the prior distribution of other parameters is greater than $0$ and bounded,

	$\displaystyle\lambda(K_{3},\varphi)$	$\displaystyle=\sum_{\ell=1}^{L-1}\lambda(\delta_{\ell}^{2})+\min\Bigl{(}\lambda(a^{2},a^{\alpha-1}),\ \sum_{\ell=1}^{L-1}\lambda(\beta_{\ell}^{4})\Bigr{)}$		(6.146)
		$\displaystyle=\frac{L-1}{2}+\min\Bigl{(}\frac{\alpha}{2},\ \frac{1}{4}\Bigr{)}.$		(6.147)

Therefore, we completed Theorem 5.1(2). ∎

7 Phase transition due to prior distribution hyperparameters

In Bayesian statistics, if a prior distribution $\varphi(w;\theta)$ has a hyperparameter $\theta$ and a posterior distribution for a sufficiently large $n$ changes drastically at $\theta=\theta_{c}$ , then it is said that the posterior distribution has a phase transition, and $\theta_{c}$ is called a critical point [9].

In the case of the theorem 5.1(2), the prior distribution of the mixed ratio $a$ of the multinomial mixture is assumed to be the Dirichlet distribution with the hyperparamater $\alpha$ , and the asymptotic free energy $F_{n}(\alpha)$ is given by

\displaystyle\mathbb{E}[F_{n}(\alpha)]\simeq\begin{cases}nS+\frac{\alpha}{2}\log n&(\alpha<\frac{L-1}{2})\\ nS+\frac{3(L-1)}{4}\log n-\log\log n&(\alpha=\frac{L-1}{2})\\ nS+\frac{3(L-1)}{4}\log n&(\alpha>\frac{L-1}{2})\end{cases}.

(7.1)

From Eq.(7.1), at $\alpha_{c}=\frac{L-1}{2}$ , $\mathbb{E}[F_{n}(\alpha)]$ is not differentiable, so $\alpha_{c}$ is the phase transition point. If there is a phase transition point, the support of the posterior distribution significantly changes between two phases which greatly affects the result of statistical inference.

8 Conclusions

In this paper, we derived the real log canonical threshold and multiplicity when the probability model and the prior are a multinomial mixture with two components and Dirichlet distribution respectively, and the true distribution is a multinomial distribution. The asymptotic behaviors of the free energy and the generalization error were clarified. One of future works is to find the RLCTs and multiplicities of the multinomial mixtures with the general number of components.

References

[1] Watanabe Takeshi and Einoshin Suzuki. Prototyping abnormal medical test values in hepatitis data with mixture multinomial distribution estimate. ICS, Vol. 2002, No. 45 (2002-ICS-128), pp. 49–54, 2002.
[2] Masada Tomonari, Takasu Atsuhiro, and Adachi Jun. Clustering for name disambiguation in author citations. DBSJ Letters, Vol. 6, No. 1, 2007.
[3] Tomonari Masada, Senya Kiyasu, and Sueharu Miyahara. Clustering images with multinomial mixture models. In International Symposium on Advanced Intelligent Systems, 2007.
[4] K Yamazaki and Sumio Watanabe. Resolution of singularities in mixture models and its stochastic complexity. In Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP’02., Vol. 3, pp. 1355–1359. IEEE, 2002.
[5] Hirotugu Akaike. A new look at the statistical model identification. IEEE transactions on automatic control, Vol. 19, No. 6, pp. 716–723, 1974.
[6] Gideon Schwarz. Estimating the dimension of a model. The annals of statistics, pp. 461–464, 1978.
[7] David J Spiegelhalter, Nicola G Best, Bradley P Carlin, and Angelika Van Der Linde. Bayesian measures of model complexity and fit. Journal of the royal statistical society: Series b (statistical methodology), Vol. 64, No. 4, pp. 583–639, 2002.
[8] Sumio Watanabe. Algebraic Analysis for Nonidentifiable Learning Machines. Neural Computation, Vol. 13, No. 4, pp. 899–933, 04 2001.
[9] Sumio Watanabe. Mathematical theory of Bayesian statistics. CRC Press, 2018.
[10] Miki Aoyagi and Sumio Watanabe. Stochastic complexities of reduced rank regression in bayesian estimation. Neural Networks, Vol. 18, No. 7, pp. 924–933, 2005.
[11] Kenichiro Sato and Sumio Watanabe. Bayesian generalization error of poisson mixture and simplex vandermonde matrix type singularity. arXiv preprint arXiv:1912.13289, 2019.
[12] Keisuke Yamazaki and Daisuke Kaji. Comparing two bayes methods based on the free energy functions in bernoulli mixtures. Neural networks, Vol. 44, pp. 36–43, 2013.
[13] Naoki Hayashi. The exact asymptotic form of bayesian generalization error in latent dirichlet allocation. Neural Networks, Vol. 137, pp. 127–137, 2021.
[14] Kenji Nagata and Sumio Watanabe. Asymptotic behavior of exchange ratio in exchange monte carlo method. Neural Networks, Vol. 21, No. 7, pp. 980–988, 2008.
[15] Mathias Drton and Martyn Plummer. A bayesian information criterion for singular models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 79, No. 2, pp. 323–380, 2017.
[16] Sumio Watanabe. Algebraic geometry and statistical learning theory. No. 25 in Cambridge Monographs on Applied and Computational Mathematics. Cambridge university press, 2009.
[17] Sumio Watanabe. Algebraic geometrical methods for hierarchical learning machines. Neural Networks, Vol. 14, No. 8, pp. 1049–1060, 2001.
[18] Takeshi Matsuda and Sumio Watanabe. Weighted blowups of kullback information and application to multinomial distributions. IEICE Proceedings Series, Vol. 42, No. B2L-C2, 2008.
[19] Imre Csiszár. A simple proof of sanov’s theorem. Bulletin of the Brazilian Mathematical Society, Vol. 37, No. 4, pp. 453–459, 2006.
[20] Christopher M Bishop. Pattern recognition. Machine learning, Vol. 128, No. 9, 2006.
[21] Takumi Watanabe and Sumio Watanabe. Asymptotic behavior of bayesian generalization error in multinomial mixtures. IEICE Technical Report; IEICE Tech. Rep., Vol. 119, No. 360, pp. 1–8, 2020.
[22] Keisuke Yamazaki and Sumio Watanabe. Singularities in mixture models and upper bounds of stochastic complexity. Neural Networks, Vol. 16, No. 7, pp. 1029–1038, 2003.
[23] Miki Aoyagi. A bayesian learning coefficient of generalization error and vandermonde matrix-type singularities. Communications in Statistics—Theory and Methods, Vol. 39, No. 15, pp. 2667–2687, 2010.
[24] Heisuke Hironaka. Resolution of singularities of an algebraic variety over a field of characteristic zero: Ii. Annals of Mathematics, pp. 205–326, 1964.

	$\displaystyle p(x\|w)-q(x)$	$\displaystyle=\left\{\sum_{h=1}^{H}a_{h}\frac{M!}{\prod_{\ell=1}^{L}x_{\ell}!}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{}}a^{}_{h}\frac{M!}{\prod_{\ell=1}^{L}x_{\ell}!}(b^{*}_{h\ell})^{x_{\ell}}\right\}$		(6.90)
		$\displaystyle=\left(\frac{M!}{\prod_{\ell=1}^{L}x_{\ell}!}\right)\left\{\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{}}a^{}_{h}\prod_{\ell=1}^{L}(b^{*}_{h\ell})^{x_{\ell}}\right\}.$		(6.91)