This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Asymptotic Behavior of Bayesian Generalization Error in Multinomial Mixtures

Takumi Watanabe watanabe.t.bv@m.titech.ac.jp Department of Mathematical and Computing Science, Tokyo Institute of Technology, 2-12-1, Oookayama, Meguro-ku, Tokyo, 152-8552, Japan Sumio Watanabe swatanab@c.titech.ac.jp Department of Mathematical and Computing Science, Tokyo Institute of Technology, 2-12-1, Oookayama, Meguro-ku, Tokyo, 152-8552, Japan
Abstract

Multinomial mixtures are widely used in the information engineering field, however, their mathematical properties are not yet clarified because they are singular learning models. In fact, the models are non-identifiable and their Fisher information matrices are not positive definite. In recent years, the mathematical foundation of singular statistical models are clarified by using algebraic geometric methods. In this paper, we clarify the real log canonical thresholds and multiplicities of the multinomial mixtures and elucidate their asymptotic behaviors of generalization error and free energy.
Keywords: generalization error, free energy, multinomial mixtures, real log canonical thresholds

1 Introduction

A finite mixture model is a probability distribution defined by a linear superposition of a finite number of distributions. Its examples, A normal mixture, a Poisson mixture, and a multinomial mixture have been applied in many research areas. In this paper, we mainly study a multinomial mixture which provides a richer class of statistic models than the single multinomial distribution. The multinomial mixture has been applied to document clustering [1], anomaly detection in medical data [2], and image clustering [3]. In spite of a wide range of its applications, their mathematical property of generalization performance is not yet clarified. One of the mathematical difficulties is caused by the fact that they are not identifiable. If the map from the set of parameters to a probability distribution, then a statistical model is called identifiable. If otherwise, it is called non-identifiable [4]. These models are classified into singular models. If a probability model is singular, the posterior distribution cannot be approximated by any normal distribution, and classical model criteria of regular statistical models such as AIC [5], BIC [6], or DIC [7] cannot be applied to estimate the generalization losses of singular models.

Recently, in order to establish the mathematical foundation of Bayesian inference of singular models, Watanabe derived the asymptotic behavior of their generalization error GnG_{n} and the free energy FnF_{n} [8]. There exist both a real and positive number λ\lambda and an integer mm, such that their asymptotic behaviors of GnG_{n} and FnF_{n} are respectively given by

𝔼[Fn]\displaystyle\mathbb{E}[F_{n}] =nS+λlogn(m1)loglogn+O(1),\displaystyle=nS+\lambda\log n-(m-1)\log\log n+O(1), (1.1)
𝔼[Gn]\displaystyle\mathbb{E}[G_{n}] =λnm1nlogn+o(1nlogn),\displaystyle=\frac{\lambda}{n}-\frac{m-1}{n\log n}+o\left(\frac{1}{n\log n}\right), (1.2)

where λ\lambda is called a real log canonical threshold (RLCT), mm is called a multiplicity, and 𝔼[]\mathbb{E}[\cdot] shows the expectation value over all datasets. If a learning model is identifiable and regular, λ=d/n,m=1\lambda=d/n,\ m=1, where dd is the dimension of the parameter space [6]. If it is singular, λ\lambda and mm depend on the true distribution, the model, and the prior. In singular case, it was shown by [9] that both RLCT and multiplicity can be found by using desingularization theorem in algebraic geometry. In general, the parameter set K(w)=0K(w)=0 contains complicated singularities, hence it is difficult to find the resolution map, however, both RLCTs and multiplicities have been clarified in several statistical models and learning machines. Examples of the models in which the RLCTs are found include normal mixtures [10], Poisson mixtures [11], Bernoulli mixtures [12], rank regression [10], Latent Dirichlet Allocation (LDA) [13], and so on. In addition, the RLCTs are used as an analysis of the exchange rate of the replica exchange method [14], which is one of the Markov chain Monte Carlo methods. Moreover, in recent years, the information criterion sBIC\mathrm{sBIC}, which uses RLCTs in calculation, has also been proposed [15].

In this paper, we clarify the RLCT of the multinomial mixtures and derive the asymptotic behaviors of the generalization error and the free energy. Our analysis also shows the effect of hyperparameter for the case when Dirichlet distribution is employed as a prior. We begin in section 2 with the introduction of the framework of Bayesian inference. In Section 3 we explain multinomial mixtures, and in section 4 we introduce previous studies about the RLCTs and multiplicities of multinomial mixtures. In Section 5 we claim the main theorem of this paper. In Section 6 we prove the theorem. In Section 7 we discuss the phase transition due to the hyperparameters. And in Section 8 we conclude this paper.

2 Bayes estimation

In this section, we introduce the framework of Bayesian inference. Let q(x)q(x) be a true probability distribution and let Xn=(X1,,Xn)X^{n}=(X_{1},\dots,X_{n}) be a set of training data generated from q(x)q(x) independently and identically. Let p(x|w)p(x|w) be a probability model, where wWdw\in W\subset\mathbb{R}^{d} is a parameter, and the WW is a parameter space. The prior probability distribution φ(w)\varphi(w) is a function on WW. A posterior distribution p(w|Xn)p(w|X^{n}) is defined by

p(w|Xn)=1Znφ(w)i=1np(Xi|w),\displaystyle p(w|X^{n})=\frac{1}{Z_{n}}\varphi(w)\prod_{i=1}^{n}p(X_{i}|w), (2.1)

where ZnZ_{n} is the normalizing constant:

Zn=φ(w)i=1np(Xi|w)dw.\displaystyle Z_{n}=\int{\varphi(w)\prod_{i=1}^{n}p(X_{i}|w)}\mathrm{d}w. (2.2)

The constant ZnZ_{n} is called a marginal likelihood function. The free energy FnF_{n} is defined as the minus log marginal likelihood function:

Fn=logZn.\displaystyle F_{n}=-\log Z_{n}. (2.3)

A predictive distribution p(x|Xn)p(x|X^{n}) is given by

p(x|Xn)=p(x|w)p(w|Xn)dw.\displaystyle p(x|X^{n})=\int{p(x|w)p(w|X^{n})}\mathrm{d}w. (2.4)

A generalization error GnG_{n} is the Kullback-Leibler divergence from the true distribution q(x)q(x) to the predictive distribution p(x|Xn)p(x|X^{n}):

Gn=q(x)logq(x)p(x|Xn)dx.\displaystyle G_{n}=\int{q(x)\log\frac{q(x)}{p(x|X^{n})}}\mathrm{d}x. (2.5)

The generalization error is a measure of how the predictive distribution p(x|Xn)p(x|X^{n}) is different from the true distribution q(x)q(x).

For an arbitrary function f:xnf(xn)f:x^{n}\mapsto f(x^{n}), the expectation value of f(Xn)f(X^{n}) over all sets of training samples is denoted by 𝔼[]\mathbb{E}[\cdot], that is,

𝔼[f(Xn)]=f(x1,,xn)i=1nq(xi)dxi.\displaystyle\mathbb{E}[f(X^{n})]=\int\dots\int f(x_{1},\dots,x_{n})\prod_{i=1}^{n}q(x_{i})\mathrm{d}x_{i}. (2.6)

Let the mean error function K(w)K(w) be the Kullback-Leibler divergence from the true distribution to the probability model:

K(w)=q(x)logq(x)p(x|w)dx.\displaystyle K(w)=\int{q(x)\log\frac{q(x)}{p(x|w)}}\mathrm{d}x. (2.7)

An entropy of the true distribution SS and an empirical entropy SnS_{n} are defined respectively by

S\displaystyle S =q(x)logq(x)dx,\displaystyle=-\int{q(x)\log q(x)}\mathrm{d}x, (2.8)
Sn\displaystyle S_{n} =1ni=1nlogq(Xi).\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}\log q(X_{i}). (2.9)

It is known that the following relationship holds between the free energy and the generalization error [16]:

𝔼[Gn]\displaystyle\mathbb{E}[G_{n}] =𝔼[Fn+1]𝔼[Fn]S.\displaystyle=\mathbb{E}[F_{n+1}]-\mathbb{E}[F_{n}]-S. (2.10)

The relation (2.10) is important because in the most case, we do not know the true distribution q(x)q(x), whereas the free energy can be calculated by using the prior φ(w)\varphi(w), the probability model p(x|w)p(x|w), and a sample XnX^{n}.

Let Re(z)\mathrm{Re}(z) be the real part of a complex number zz. Define the zeta function in the statistical learning theory as

ζ(z)=K(w)zφ(w)dw.\displaystyle\zeta(z)=\int{K(w)^{z}\varphi(w)}\mathrm{d}w. (2.11)

If K(w)0K(w)\geq 0 is an analytic function of ww, then the function ζ(z)\zeta(z) is holomorphic in the region Re(z)>0\mathrm{Re}(z)>0, and it can be analytically continued to the unique meromorphic function onto the entire complex plane. Moreover, it is known that all poles are real and negative numbers.

In the following, assume that the mean error function K(w)K(w) is analytical and the true distribution is feasible with a probabilistic model. Here, the true distribution q(x)q(x) is said to be feasible with the probability model p(x|w)p(x|w) if there is a parameter wWw^{*}\in W such that q(x)=p(x|w)q(x)=p(x|w^{*}) holds for all xx. Assume that the maximum pole of the zeta function ζ(z)\zeta(z) is λ-\lambda and its order is mm. By applying the Hironaka resolution theorem in the algebraic geometrical method, the asymptotic behaviors of free energy and generalization errors can be expressed as follows [8] [17]:

𝔼[Fn]\displaystyle\mathbb{E}[F_{n}] =nS+λlogn(m1)loglogn+O(1),\displaystyle=nS+\lambda\log n-(m-1)\log\log n+O(1), (2.12)
𝔼[Gn]\displaystyle\mathbb{E}[G_{n}] =λnm1nlogn+o(1nlogn).\displaystyle=\frac{\lambda}{n}-\frac{m-1}{n\log n}+o\left(\frac{1}{n\log n}\right). (2.13)

The constant λ\lambda and mm are called real log canonical threshold (RLCT) and a multiplicity respectively.

3 Multinomial Mixtures

3.1 Multinomial Distribution

Let 0\mathbb{Z}_{\geq 0} be the set of all non-negative integers, 0\mathbb{R}_{\geq 0} be the set of all non-negative real numbers. Let constants L,ML,M be two or more natural numbers, and the set DD is defined by

D={x=(x1,,xL)(0)L:=1Lx=M}.\displaystyle D=\left\{x=(x_{1},\dots,x_{L})\in\left(\mathbb{Z}_{\geq 0}\right)^{L}\ :\ \sum_{\ell=1}^{L}x_{\ell}=M\right\}. (3.1)

The vectors bLb\in\mathbb{R}^{L} belong to the set BB:

B={b=(b1,,bL)(0)L:=1Lb=1, 0b1}.\displaystyle B=\left\{b=(b_{1},\dots,b_{L})\in\left(\mathbb{R}_{\geq 0}\right)^{L}\ :\ \sum_{\ell=1}^{L}b_{\ell}=1,\ 0\leq b_{\ell}\leq 1\right\}. (3.2)

The probability distribution of xDx\in D determined by the vector bb:

MulL(x|b)=M!=1Lx!=1L(b)x\displaystyle\mathrm{Mul}_{L}(x|b)=\frac{M!}{\prod_{\ell=1}^{L}x_{\ell}!}\prod_{\ell=1}^{L}(b_{\ell})^{x_{\ell}}

is called the multinomial distribution. Here, it is defined as 00=1, 0!=10^{0}=1,\ 0!=1. The constant MM represents the number of independent trials of the multinomial distribution, and the parameter b=(b1,,bL)b=(b_{1},\dots,b_{L}) represents the corresponding probability. The multinomial distribution is a generalization of several discrete distribution. If M=1M=1 and L=2L=2, the multinomial distribution is called the Bernoulli distribution. If M=1M=1 and L2L\geq 2, it is called the categorical distribution. If M2M\geq 2 and L=2L=2, it is called the binomial distribution.

3.2 Multinomial Mixtures

Let HH be a finite natural number greater than or equal to 2. The parameter set WW is defined by

W={(a,b):h=1Hah=1, 0ah1,bh=(bh1,,bhL)B(h[H])},\displaystyle W=\left\{(a,b)\ :\ \sum_{h=1}^{H}a_{h}=1,\ 0\leq a_{h}\leq 1,\ b_{h}=(b_{h1},\dots,b_{hL})\in B\ (\forall h\in[H])\right\}, (3.3)

where [H][H] means the set {h:1hH}\{h\in\mathbb{Z}:1\leq h\leq H\}.

The probability distribution on xDx\in D determined by the parameter w=(a,b)Ww=(a,b)\ \in W

p(x|w)=h=1HahMulL(x|bh)\displaystyle p(x|w)=\sum_{h=1}^{H}a_{h}\mathrm{Mul}_{L}(x|b_{h}) (3.4)

is called a multinomial mixture. Here HH represents the number of components. The HH dimensional parameter a=(a1,,aH)a=(a_{1},\dots,a_{H}) represents a mixing ratio. The parameter aa is assumed that aha_{h} is non-negative and h=1Hah=1\displaystyle\sum_{h=1}^{H}a_{h}=1. Then aha_{h} represents the weight of the hh-th component distribution. The higher mixing ratio aha_{h} means the stronger effect of hh-th component.

4 Previous Studies

In this section, we introduce several previous studies on the log canonical thresholds of multinomial mixtures. When the probability model is a binomial mixture, the upper bound of the RLCT in the case of general components and the exact value of one in special cases have been clarified [12].

Theorem 4.1 (the RLCT and multiplicity of binomial mixtures [12]).

Let x=(y1,,yM){0,1}Mx=(y_{1},\dots,y_{M})\in\{0,1\}^{M} be an MM dimensional binary vector and p(x|w)p(x|w) be a binomial mixture,

p(x|w)=h=1Hahm=1Mphmym(1phm)1ym.\displaystyle p(x|w)=\sum_{h=1}^{H}a_{h}\prod_{m=1}^{M}p^{ym}_{hm}(1-p_{hm})^{1-y_{m}}. (4.1)

It is assumed that the true distribution q(x)q(x) is a binomial mixture,

q(x)=h=1H0ahm=1Mphmym(1phm)1ym.\displaystyle q(x)=\sum_{h=1}^{H_{0}}a^{*}_{h}\prod_{m=1}^{M}p^{*ym}_{hm}(1-p^{*}_{hm})^{1-y_{m}}. (4.2)

Let the prior distribution φ(w)\varphi(w) be

φ(w;η)=φ0(a;α)h=1Hm=1Mφ1(phm;β),\displaystyle\varphi(w;\eta)=\varphi_{0}(a;\alpha)\prod_{h=1}^{H}\prod_{m=1}^{M}\varphi_{1}(p_{hm};\beta), (4.3)

where η={α,β}\eta=\{\alpha,\beta\} is a set of hyperparameters, φ0(a;α)\varphi_{0}(a;\alpha) is a prior distribution of the mixing ratio aa with α>0\alpha>0 as a hyperparameter (Dirichlet distribution),

φ0(a;α)\displaystyle\varphi_{0}(a;\alpha) =Γ(Hα)Γ(α)H(h=1H1ahα1)(1i=1H1ai)α1.\displaystyle=\frac{\Gamma(H\alpha)}{\Gamma(\alpha)^{H}}\left(\prod_{h=1}^{H-1}a_{h}^{\alpha-1}\right)\left(1-\sum_{i=1}^{H-1}a_{i}\right)^{\alpha-1}. (4.4)

and φ1(phm;β)\varphi_{1}(p_{hm};\beta) (β>0\beta>0) is a beta distribution for each h[H],m[M]h\in[H],m\in[M],

φ1(phm;β)\displaystyle\varphi_{1}(p_{hm};\beta) =Γ(2β)Γ(β)phmβ1(1phm)β1.\displaystyle=\frac{\Gamma(2\beta)}{\Gamma(\beta)}p_{hm}^{\beta-1}(1-p_{hm})^{\beta-1}. (4.5)

We refer to them as deterministic, where phmp^{*}_{hm} is one or zero. Let H1H_{1} and H2H_{2} be the numbers of probabilistic and deterministic components, respectively, where H=H1+H2H=H_{1}+H_{2}. Under the above conditions, the asymptotic behavior of the free energy FnF_{n} is expressed as follows:

FnnS+μlogn(mμ1)loglogn+o(loglogn),\displaystyle F_{n}\leq nS+\mu\log n-(m_{\mu}-1)\log\log n+o\left(\log\log n\right), (4.6)

where SS is the entropy of the true distribution, and μ\mu and mμm_{\mu} are defined as follows. For M3M\geq 3,

μ\displaystyle\mu =H01+H1M+H2Mβ2+HH02min{α,M2,βM2},\displaystyle=\frac{H_{0}-1+H_{1}M+H_{2}M\beta}{2}+\frac{H-H_{0}}{2}\min\left\{\alpha,\ \frac{M}{2},\ \frac{\beta M}{2}\right\}, (4.7)
mμ\displaystyle m_{\mu} ={2(α=min{M/2,βM/2})1(otherwise).\displaystyle=\begin{cases}2&(\alpha=\min\{M/2,\ \beta M/2\})\\ 1&(\mathrm{otherwise})\end{cases}. (4.8)

For M=2M=2,

μ\displaystyle\mu =K01+H1M+H2Mβ2+HH02min{α, 1,β},\displaystyle=\frac{K_{0}-1+H_{1}M+H_{2}M\beta}{2}+\frac{H-H_{0}}{2}\min\left\{\alpha,\ 1,\ \beta\right\}, (4.9)
mμ\displaystyle m_{\mu} ={3(α=min{1,β})2(α>min{1,β})1(otherwise).\displaystyle=\begin{cases}3&(\alpha=\min\{1,\ \beta\})\\ 2&(\alpha>\min\{1,\ \beta\})\\ 1&(\mathrm{otherwise})\end{cases}. (4.10)

Furthermore, if H=H0+1H=H_{0}+1, that is, when the number of components in the probability model is one greater than that in the true distribution, μ\mu equals the exact value of the RLCT λ\lambda, and mμm_{\mu} also equals the multiplicity mm.

Matsuda analyzed the RLCT of trinomial mixtures with two components. The exact value of the RLCTs was elucidated when the true distribution is multinomial distribution and the probability model is trinomial mixtures with two components, that is, in the case of L=3,H=2L=3,H=2.

Theorem 4.2 (the RLCT and multiplicity of trinomial mixtures with two components [18]).

Let the probability model p(x|w)p(x|w) be a trinomial mixture with two components:

p(x|w)=aMul3(x|b1)+(1a)Mul3(x|b2),(a,b1,b2)W.\displaystyle p(x|w)=a\mathrm{Mul}_{3}(x|b_{1})+(1-a)\mathrm{Mul}_{3}(x|b_{2}),\ (a,b_{1},b_{2})\in W. (4.11)

Here, Mul3(x|b)\mathrm{Mul}_{3}(x|b) means a probability mass function of trinomial distribution with b=(b1,b2,b3)b=(b_{1},b_{2},b_{3}) as parameters. Also, let the true distribution q(x)q(x) be a trinomial distribution:

q(x)=Mul3(x|b).\displaystyle q(x)=\mathrm{Mul}_{3}(x|b^{*}). (4.12)

Also, assume that the prior distribution φ(w)\varphi(w) is greater than 0 and bounded above the parameter set WW, and the true distribution parameter b=(b1,b2,b3)b^{*}=(b_{1}^{*},b_{2}^{*},b_{3}^{*}) satisfies:

b1b2b30.\displaystyle b_{1}^{*}b_{2}^{*}b_{3}^{*}\neq 0. (4.13)

Under these conditions, The RLCT is as follows:

λ=32.\displaystyle\lambda=\frac{3}{2}. (4.14)

Matsuda clarified the RLCT of a trinomial mixture with two components, which is in the case of H=2,H=1,L=3H=2,H^{*}=1,L=3, by using an algebraic geometry algorithm called weighted blow-up.

5 Main Theorem

In this section, we show the main result of this paper, which is a generalization of Theorem 4.2. We clarify the RLCT and the multiplicity of general multinomial mixtures with two components. Furthermore, we also consider the case where the Dirichlet distribution is adopted as the prior distribution of the mixture ratio.

Theorem 5.1 (Main Theorem).

Let the probability model p(x|w)p(x|w) be a multinomial mixture with two components:

p(x|w)=aMulL(x|b1)+(1a)MulL(x|b2),(a,b1,b2)W.\displaystyle p(x|w)=a\mathrm{Mul}_{L}(x|b_{1})+(1-a)\mathrm{Mul}_{L}(x|b_{2}),\ (a,b_{1},b_{2})\in W. (5.1)

Also, let the true distribution be a multinomial distribution:

q(x)=MulL(x|b).\displaystyle q(x)=\mathrm{Mul}_{L}(x|b^{*}). (5.2)

Also, assume that the prior distribution of the parameter bb is greater than 0 and bounded in the set WW, and that the true distribution parameter b=(b1,,bL)b^{*}=(b_{1}^{*},\dots,b_{L}^{*}) satisfies:

=1Lb0.\displaystyle\prod_{\ell=1}^{L}b_{\ell}^{*}\neq 0. (5.3)

The prior distribution of the mixing ratio aa in two cases as follows respectively:

  1. 1.

    If the prior distribution φ(a)\varphi(a) of mixture ratio aa is greater than 0 and bounded, the RLCT λ\lambda and the multiplicity mm are given by

    λ\displaystyle\lambda =L12+min(12,L14),\displaystyle=\frac{L-1}{2}+\min\left(\frac{1}{2},\ \frac{L-1}{4}\right), (5.4)
    m\displaystyle m ={2(L=3)1(otherwise).\displaystyle=\begin{cases}2&(L=3)\\ 1&(\mathrm{otherwise})\end{cases}. (5.5)
  2. 2.

    If the prior distribution φ(a)\varphi(a) of mixture ratio aa is the Dirichlet distribution with α(>0)\alpha\ (>0) as a hyperparameter:

    φ(a;α)=Γ(2α)Γ(α)2aα1(1a)α1,\displaystyle\varphi(a;\alpha)=\frac{\Gamma(2\alpha)}{\Gamma(\alpha)^{2}}a^{\alpha-1}(1-a)^{\alpha-1}, (5.6)

    the RLCT λ\lambda and the multiplicity mm are given by

    λ\displaystyle\lambda =L12+min(α2,L14),\displaystyle=\frac{L-1}{2}+\min\left(\frac{\alpha}{2},\ \frac{L-1}{4}\right), (5.7)
    m\displaystyle m ={2(α=L12)1(otherwise).\displaystyle=\begin{cases}2&(\alpha=\frac{L-1}{2})\\ 1&(\mathrm{otherwise})\end{cases}. (5.8)

6 Proof of the Main Theorem

6.1 properties of the RLCTs and the multiplicities

To prove the theorem 5.1, we introduce notations, and explain some properties of the RLCTs and the multiplicities. Since the RLCT λ\lambda and the multiplicity mm are determined by the mean error function K(w)K(w) and the prior distribution φ(w)\varphi(w), they are expressed as λ(K,φ),m(K,φ)\lambda(K,\varphi),m(K,\varphi), or λ(K),m(K)\lambda(K),m(K), respectively. If the maximum pole and their order of the two zeta function ζ1(z),ζ2(z)\zeta_{1}(z),\zeta_{2}(z):

ζ1(z)\displaystyle\zeta_{1}(z) =K(w)zφ(w)dw,\displaystyle=\int{K(w)^{z}\varphi(w)}\mathrm{d}w, (6.1)
ζ2(z)\displaystyle\zeta_{2}(z) =K(w)zφ(w)dw,\displaystyle=\int{K^{\prime}(w)^{z}\varphi(w)}\mathrm{d}w, (6.2)

are equal, they are written as

K(w)K(w),\displaystyle K(w)\sim K^{\prime}(w), (6.3)

or λ(K,φ)=λ(K,φ),m(K,φ)=m(K,φ)\lambda(K,\varphi)=\lambda(K^{\prime},\varphi^{\prime}),\ m(K,\varphi)=m(K^{\prime},\varphi^{\prime}).

The following properties hold for the RLCTs and multiplicities [9][18].

Lemma 6.1.

Let K(w)K(w) be the mean error function and let φ(w)\varphi(w) be the prior function.

  1. 1.

    If there are function K(w)K^{\prime}(w) and constants c,c>0c,c^{\prime}>0 exist, and

    cK(w)K(w)cK(w)\displaystyle cK^{\prime}(w)\leq K(w)\leq c^{\prime}K^{\prime}(w) (6.4)

    holds for any wWw\in W, then K(w)K(w)K(w)\sim K^{\prime}(w).

  2. 2.

    If w=(w1,w2),K(w)=K1(w1)+K2(w2),φ(w)=φ(w1)φ(w2)w=(w_{1},w_{2}),\ K(w)=K_{1}(w_{1})+K_{2}(w_{2}),\ \varphi(w)=\varphi(w_{1})\varphi(w_{2}), the following holds

    λ(K,φ)\displaystyle\lambda(K,\varphi) =λ(K1,φ1)+λ(K2,φ2),\displaystyle=\lambda(K_{1},\varphi_{1})+\lambda(K_{2},\varphi_{2}), (6.5)
    m(K,φ)\displaystyle m(K,\varphi) =m(K1,φ1)+m(K2,φ2)1.\displaystyle=m(K_{1},\varphi_{1})+m(K_{2},\varphi_{2})-1. (6.6)
  3. 3.

    If w=(w1,w2),K(w)=K1(w1)K2(w2),φ(w)=φ1(w1)φ2(w2)w=(w_{1},w_{2}),\ K(w)=K_{1}(w_{1})K_{2}(w_{2}),\ \varphi(w)=\varphi_{1}(w_{1})\varphi_{2}(w_{2}), then

    λ(K,φ)\displaystyle\lambda(K,\varphi) =min(λ(K1,φ1),λ(K2,φ2)),\displaystyle=\min\Bigl{(}\lambda(K_{1},\varphi_{1}),\ \lambda(K_{2},\varphi_{2})\Bigr{)}, (6.7)
    m(K,φ)\displaystyle m(K,\varphi) ={m(K1,φ1)(λ(K1,φ1)<λ(K2,φ2))m(K1,φ1)+m(K2,φ2)(λ(K1,φ1)=λ(K2,φ2))m(K2,φ2)(λ(K1,φ1)>λ(K2,φ2)).\displaystyle=\begin{cases}m(K_{1},\varphi_{1})&(\lambda(K_{1},\varphi_{1})<\lambda(K_{2},\varphi_{2}))\\ m(K_{1},\varphi_{1})+m(K_{2},\varphi_{2})&(\lambda(K_{1},\varphi_{1})=\lambda(K_{2},\varphi_{2}))\\ m(K_{2},\varphi_{2})&(\lambda(K_{1},\varphi_{1})>\lambda(K_{2},\varphi_{2}))\end{cases}. (6.8)
  4. 4.

    Let I,JI,J be natural numbers, and let {fi(w)}i=1I,{gj(w)}j=1J\{f_{i}(w)\}_{i=1}^{I},\{g_{j}(w)\}_{j=1}^{J} be the sets of analytic functions. If the ideal generated from {fi(w)}i=1I\{f_{i}(w)\}_{i=1}^{I} and the ideal generated from {gj(w)}j=1J\{g_{j}(w)\}_{j=1}^{J} are equal and

    K1(w)=i=1Ifi(w)2,K2(w)=j=1Jgj(w)2,\displaystyle K_{1}(w)=\sum_{i=1}^{I}f_{i}(w)^{2},\ K_{2}(w)=\sum_{j=1}^{J}g_{j}(w)^{2}, (6.9)

    then K1(w)K2(w)K_{1}(w)\sim K_{2}(w).

  5. 5.

    For any bounded function F(w),G(w),H(w)F(w),G(w),H(w) on a compact set,

    H(w)2+(F(w)+H(w)G(w))2H(w)2+F(w)2.\displaystyle H(w)^{2}+\left(F(w)+H(w)\ G(w)\right)^{2}\sim H(w)^{2}+F(w)^{2}. (6.10)
  6. 6.

    Let K(w)K^{\prime}(w) be the following function:

    K(w)=xD(p(x|w)q(x))2,\displaystyle K^{\prime}(w)=\sum_{x\in D}\left(p(x|w)-q(x)\right)^{2}, (6.11)

    then K(w)K(w)K(w)\sim K^{\prime}(w).

6.2 The restriction on the parameter set of general multinomial mixtures

To prove the theorem 5.1, we prepare some lemmas. As mentioned in section 2, the zeta function ζ(z)\zeta(z) is determined by the prior distribution φ(w)\varphi(w) and the mean error function K(w)K(w), and the mean error function is defined by the KL information between true distribution and the probability model. Thus, the mean error function K(w)K(w) is

K(w)=xDq(x)logq(x)p(x|w).\displaystyle K(w)=\sum_{x\in D}q(x)\log\frac{q(x)}{p(x|w)}. (6.12)

However, in the case of mixed multinomial distribution, some problems arise when considering the mean error function K(w)K(w) for the entire parameter set W. When the probability model is a mixed multinomial distribution, p(x|w)=0p(x|w)=0 for some wWw\in W and some xDx\in D. Since the true distribution q(x)q(x) is not 0 by assumption, on the points ww such that p(x|w)=0p(x|w)=0, the values q(x)/p(x|w)q(x)/p(x|w) is not finite and the mean error function K(w)K(w) diverges. Thus, the results of the asymptotic behavior of their generalization error in the reference [8] cannot be applied directly. To solve this problem, we prove that even if the original parameter set WW is restricted to the set W1W_{1}, which is the parameter set such that p(x|w)>0p(x|w)>0, the asymptotic behavior of the generalization error does not change.

Lemma 6.2 (the restriction on the parameter set).

Let WW be a parameter set of the multinomial mixtures with component H. Let probability model be the multinomial mixtures with HH components :

p(x|w)\displaystyle p(x|w) =h=1HahMulL(x|bh)(wW).\displaystyle=\sum_{h=1}^{H}a_{h}\mathrm{Mul}_{L}(x|b_{h})\ (w\in W). (6.13)

Let q(x)q(x) be the multinomial mixtures with HH^{*} components (HHH\geq H^{*}):

q(x)\displaystyle q(x) =h=1HahMulL(x|bh).\displaystyle=\sum_{h=1}^{H^{*}}a_{h}^{*}\mathrm{Mul}_{L}(x|b_{h}^{*}). (6.14)

Here, for any h=1,,Hh=1,\dots,H,

=1Lbh0.\displaystyle\sum_{\ell=1}^{L}b_{h\ell}\neq 0. (6.15)

Fix 0<ε<10<\varepsilon<1 as a sufficiently small number, and let W1W_{1} be the subset of WW such that p(x|w)>εp(x|w)>\varepsilon for all xDx\in D, and let W2W_{2} be the complement of W1W_{1} (i.e. W2=WW1W_{2}=W\setminus W_{1}). Let λ\lambda be the minus maximum pole and mm be its order of the zeta function whose integration range is restricted to W1W_{1}:

ζ(z)\displaystyle\zeta(z) =W1K(w)zεdw.\displaystyle=\int_{W_{1}}K(w)^{z}\varepsilon\mathrm{d}w. (6.16)

Then, the asymptotic behavior of the generalization error is expressed by the following equation:

𝔼[Gn]=λnm1nlogn+o(1nlogn)\displaystyle\mathbb{E}[G_{n}]=\frac{\lambda}{n}-\frac{m-1}{n\log n}+o\left(\frac{1}{n\log n}\right) (6.17)

Lemma 6.2 means that the asymptotic behavior of the generalization error of the multinomial mixtures can be analyzed by finding the maximum pole of the zeta function whose integration range is restricted to W1W_{1}. Lemma 6.2 will be explained in the section 6.2.1.

6.2.1 the proof of Lemma 6.2

We will prove the Lemma 6.2. By the definition of the generalization error,

𝔼[Gn]\displaystyle\mathbb{E}[G_{n}] =𝔼[logq(X)p(X|Xn)].\displaystyle=\mathbb{E}\left[\log\frac{q(X)}{p(X|X^{n})}\right]. (6.18)

Let Xn+1X_{n+1} be written as XX. By the definitions of the prediction distribution and posterior distribution, and the assumption q(x)>0q(x)>0,

q(Xn+1)p(Xn+1|Xn)\displaystyle\frac{q(X_{n+1})}{p(X_{n+1}|X^{n})} =q(Xn+1)p(Xn+1|w)p(w|Xn)dw\displaystyle=\cfrac{q(X_{n+1})}{\displaystyle\int{p(X_{n+1}|w)p(w|X^{n})}\mathrm{d}w} (6.19)
=q(Xn+1)1Znφ(w)p(Xn+1|w)i=1np(Xi|w)dw\displaystyle=\frac{q(X_{n+1})}{\displaystyle\int{\cfrac{1}{Z_{n}}\varphi(w)p(X_{n+1}|w)\prod_{i=1}^{n}p(X_{i}|w)}\mathrm{d}w} (6.20)
=q(Xn+1)1Znφ(w)i=1np(Xi|w)dwφ(w)i=1n+1p(Xi|w)dw\displaystyle=\frac{\displaystyle q(X_{n+1})\int{\displaystyle\frac{1}{Z_{n}}\varphi(w)\prod_{i=1}^{n}p(X_{i}|w)}\mathrm{d}w}{\displaystyle\int{\varphi(w)\prod_{i=1}^{n+1}p(X_{i}|w)}\mathrm{d}w} (6.21)
=φ(w)i=1np(Xi|w)q(Xi)dwφ(w)i=1n+1p(Xi|w)q(Xi)dw\displaystyle=\frac{\displaystyle\int{\varphi(w)\prod_{i=1}^{n}\frac{p(X_{i}|w)}{q(X_{i})}}\mathrm{d}w}{\displaystyle\int{\varphi(w)\prod_{i=1}^{n+1}\frac{p(X_{i}|w)}{q(X_{i})}}\mathrm{d}w} (6.22)
=Z(Xn)Z(Xn+1),\displaystyle=\frac{Z(X^{n})}{Z(X^{n+1})}, (6.23)

where, Z(Xn)Z(X^{n}) is a quantity expressed by the following equation:

Z(Xn)=φ(w)i=1np(Xi|w)q(Xi)dw.\displaystyle Z(X^{n})=\int{\varphi(w)\prod_{i=1}^{n}\frac{p(X_{i}|w)}{q(X_{i})}}\mathrm{d}w. (6.24)

By the Eq.(6.23),

𝔼[Gn]=𝔼[logZ(Xn)Z(Xn+1)].\displaystyle\mathbb{E}[G_{n}]=\mathbb{E}\left[\log\frac{Z(X^{n})}{Z(X^{n+1})}\right]. (6.25)

Here, fix 0<ε<10<\varepsilon<1 as a sufficiently small number, and let W1W_{1} be the subset of WW that is p(x|w)>εp(x|w)>\varepsilon in all xDx\in D, and let W2W_{2} be the subset that is not. The integral for the parameter w is divided into two integration sets W1W_{1} and W2W_{2}, and Z1(Xn)Z_{1}(X^{n}) and Z2(Xn)Z_{2}(X^{n}) are defined as follows:

Z1(Xn)\displaystyle Z_{1}(X^{n}) =W1φ(w)i=1np(Xi|w)q(Xi)dw,\displaystyle=\int_{W_{1}}{\varphi(w)\prod_{i=1}^{n}\frac{p(X_{i}|w)}{q(X_{i})}}\mathrm{d}w, (6.26)
Z2(Xn)\displaystyle Z_{2}(X^{n}) =W2φ(w)i=1np(Xi|w)q(Xi)dw.\displaystyle=\int_{W_{2}}{\varphi(w)\prod_{i=1}^{n}\frac{p(X_{i}|w)}{q(X_{i})}}\mathrm{d}w. (6.27)

Then, the relation Z(Xn)=Z1(Xn)+Z2(Xn)Z(X^{n})=Z_{1}(X^{n})+Z_{2}(X^{n}) holds, and

𝔼[Gn]=𝔼[logZ1(Xn)+Z2(Xn)Z1(Xn+1)+Z2(Xn+1)].\displaystyle\mathbb{E}[G_{n}]=\mathbb{E}\left[\log\frac{Z_{1}(X^{n})+Z_{2}(X^{n})}{Z_{1}(X^{n+1})+Z_{2}(X^{n+1})}\right]. (6.28)

Since p(x|w)>0p(x|w)>0 on W1W_{1}, [9] can be applied, for suffiently large nn, the following asymptotic behavior of Z1(Xn)Z_{1}(X^{n}) holds:

Z1(Xn)=(logn)m1nλZ0(Xn),\displaystyle Z_{1}(X^{n})=\frac{(\log n)^{m-1}}{n^{\lambda}}Z_{0}(X^{n}), (6.29)

where Z0(Xn)Z_{0}(X^{n}) is the random variable of XnX^{n} that holds

𝔼[logZ0(Xn+1)Z0(Xn)]=o(1nlogn).\displaystyle\mathbb{E}\left[\log\frac{Z_{0}(X^{n+1})}{Z_{0}(X^{n})}\right]=o\left(\frac{1}{n\log n}\right). (6.30)

Here, we introduce the following Lemma.

Lemma 6.3.

There is a random variable Θ(Xn)\Theta(X^{n}) that is 11 only for certain events related to XnX^{n} and 0 for others, and the following equation holds:

𝔼[Θ(Xn+1)logZ1(Xn)+Z2(Xn)Z1(Xn+1)+Z2(Xn+1)]\displaystyle\mathbb{E}\left[\Theta(X^{n+1})\log\frac{Z_{1}(X^{n})+Z_{2}(X^{n})}{Z_{1}(X^{n+1})+Z_{2}(X^{n+1})}\right] =O(exp(n)),\displaystyle=O\left(\exp(-n)\right), (6.31)
𝔼[(1Θ(Xn+1))logZ1(Xn)+Z2(Xn)Z1(Xn+1)+Z2(Xn+1)]\displaystyle\mathbb{E}\left[\left(1-\Theta(X^{n+1})\right)\log\frac{Z_{1}(X^{n})+Z_{2}(X^{n})}{Z_{1}(X^{n+1})+Z_{2}(X^{n+1})}\right] =λnm1nlogn+o(1nlogn).\displaystyle=\frac{\lambda}{n}-\frac{m-1}{n\log n}+o\left(\frac{1}{n\log n}\right). (6.32)

If we can show the lemma 6.3, the following holds:

𝔼[Gn]=𝔼[Θ(Xn+1)logZ1(Xn)+Z2(Xn)Z1(Xn+1)+Z2(Xn+1)]+𝔼[(1Θ(Xn+1))logZ1(Xn)+Z2(Xn)Z1(Xn+1)+Z2(Xn+1)]\displaystyle\begin{split}\mathbb{E}[G_{n}]&=\mathbb{E}\left[\Theta(X^{n+1})\log\frac{Z_{1}(X^{n})+Z_{2}(X^{n})}{Z_{1}(X^{n+1})+Z_{2}(X^{n+1})}\right]\\ &\quad+\mathbb{E}\left[\left(1-\Theta(X^{n+1})\right)\log\frac{Z_{1}(X^{n})+Z_{2}(X^{n})}{Z_{1}(X^{n+1})+Z_{2}(X^{n+1})}\right]\\ \end{split} (6.33)
=λnm1nlogn+o(1nlogn),\displaystyle=\frac{\lambda}{n}-\frac{m-1}{n\log n}+o\left(\frac{1}{n\log n}\right), (6.34)

and Lemma 6.2 proof is completed. To prove lemma 6.3, we prepare the section 6.2.2.

6.2.2 Sanov’s theorem

Let LL be a natural number, and let 𝒫\mathcal{P} be the set consisting of the probability distributions on the finite set {1,2,,L}\{1,2,\dots,L\}.

𝒫\displaystyle\mathcal{P} ={(p1,,pL)0L:=1Lp=1,p0}.\displaystyle=\left\{(p_{1},\dots,p_{L})\in\mathbb{R}_{\geq 0}^{L}\ :\ \sum_{\ell=1}^{L}p_{\ell}=1,p_{\ell}\geq 0\right\}. (6.35)

LetLet q=(q1,,pL)𝒫q=(q_{1},\dots,p_{L})\in\mathcal{P} be called true distribution, and it is fixed. Let X1,,XnX_{1},\dots,X_{n} be the random variables independently identically generated from the probability distribution qq. Also, for each =1,,L\ell=1,\dots,L, the random variable nn_{\ell} is the number of X1,,XnX_{1},\dots,X_{n} whose value is \ell. Let the empirical distribution rnr_{n} be

rn=(n1n,,nLn).\displaystyle r_{n}=\left(\frac{n_{1}}{n},\dots,\frac{n_{L}}{n}\right). (6.36)

Then, the next theorem holds.

Theorem 6.1 (Sanov’s Theorem [19]).

Let AA be a subset of 𝒫\mathcal{P}. Let Pr(rnA)\mathrm{Pr}(r_{n}\in A) be the probability such that the empirical distribution rnr_{n} is included in the set AA, the following inequality holds:

lim supn1nlogPr(rnA)infpAD(pq).\displaystyle\limsup_{n\to\infty}\frac{1}{n}\log\mathrm{Pr}(r_{n}\in A)\leq-\inf_{p\in A}D(p\|q). (6.37)

6.2.3 the property of the multinomial distribution with one trial

Let q=(q1,,qL)𝒫q=(q_{1},\dots,q_{L})\in\mathcal{P} be positive in all the elements i.e. q1,,qL>0q_{1},\dots,q_{L}>0. Let the positive number ε>0\varepsilon>0 be sufficiently small and define the set 𝒫ε\mathcal{P}_{\varepsilon} as follows:

𝒫ε={p=(p1,,pL):there exists  such that pε}.\displaystyle\mathcal{P}_{\varepsilon}=\{p=(p_{1},\dots,p_{L}):\text{there exists $\ell$ such that $p_{\ell}\leq\varepsilon$}\}. (6.38)

Since it is assumed that ε\varepsilon is sufficiently small, we can take ε\varepsilon so that qq is not included in the set 𝒫ε\mathcal{P}_{\varepsilon}. Due to the q𝒫εq\notin\mathcal{P}_{\varepsilon} and the property of the KL information, D(qs)>0D(q\|s)>0 holds for all s𝒫εs\in\mathcal{P}_{\varepsilon}. Next, the constant c>0c>0 is fixed as an arbitrary number that satisfies 0<c<infs𝒫εD(qs)\displaystyle 0<c<\inf_{s\in\mathcal{P}_{\varepsilon}}D(q\|s), and the set AA is defined by

A={r𝒫:infs𝒫εD(rs)>D(rq)+c2}.\displaystyle A=\left\{r\in\mathcal{P}:\inf_{s\in\mathcal{P}_{\varepsilon}}D(r\|s)>D(r\|q)+\frac{c}{2}\right\}. (6.39)

Now, we show that qAq\in A and that AA and 𝒫ε\mathcal{P}_{\varepsilon} have no intersection. Assuming there exists a probability distribution rA𝒫εr\in A\cap\mathcal{P}_{\varepsilon}, since rAr\in A,

infs𝒫εD(rs)>D(rq)+c2.\displaystyle\inf_{s\in\mathcal{P}_{\varepsilon}}D(r\|s)>D(r\|q)+\frac{c}{2}. (6.40)

However, since r𝒫εr\in\mathcal{P}_{\varepsilon}, infs𝒫εD(rs)=0\displaystyle\inf_{s\in\mathcal{P}_{\varepsilon}}D(r\|s)=0, which is a contradiction. Furthermore, we can take a sufficient small positive number δ>0\delta>0 such that

{r𝒫:D(rq)<δ}A.\displaystyle\{r\in\mathcal{P}:D(r\|q)<\delta\}\subset A. (6.41)

By the definition of the set A, if pAcp\in A^{c} then D(rq)δD(r\|q)\geq\delta. Applying the Sanov’s theorem for the set AcA^{c},

lim supn1nlogPr(rnAc)infpAcD(pq).\displaystyle\limsup_{n\to\infty}\frac{1}{n}\log\mathrm{Pr}(r_{n}\in A^{c})\leq-\inf_{p\in A^{c}}D(p\|q). (6.42)

Therefore, for sufficiently large n,

Pr(rnAc)\displaystyle\mathrm{Pr}(r_{n}\in A^{c}) exp(ninfpAcD(pq))\displaystyle\leq\exp\left(-n\inf_{p\in A^{c}}D(p\|q)\right) (6.43)
exp(nδ).\displaystyle\leq\exp(-n\delta). (6.44)

Thus,

Pr(rnA)1exp(nδ).\displaystyle\mathrm{Pr}(r_{n}\in A)\geq 1-\exp(-n\delta). (6.45)

By the definition of the set A, with probability at least 1δ1-\delta,

D(rnp)D(rnq)+c2.\displaystyle D(r_{n}\|p)\geq D(r_{n}\|q)+\frac{c}{2}. (6.46)

Since X1,,XnX_{1},\dots,X_{n} are the random variables generated independently from the true distribution qq, and nn_{\ell} is the number of X1,,XnX_{1},\dots,X_{n} whose value is \ell , by the Eq.(6.46),

=1Lnnlogqpc2.\displaystyle\sum_{\ell=1}^{L}\frac{n_{\ell}}{n}\log\frac{q_{\ell}}{p_{\ell}}\geq\frac{c}{2}. (6.47)

By calculating the Eq.(6.47),

=1Lpnqjnjexp(nc2).\displaystyle\prod_{\ell=1}^{L}\frac{p_{\ell}^{n_{\ell}}}{q_{j}^{n_{j}}}\leq\exp\left(-\frac{nc}{2}\right). (6.48)

Eq.(6.48) means that if p(x)p(x) and q(x)q(x) are the probability mass functions of the multinomial distribution with one trial respectively, the following inequality holds with the probability at least 1exp(nδ)1-\exp(-n\delta),

i=1np(Xi)q(Xi)exp(nc2).\displaystyle\prod_{i=1}^{n}\frac{p(X_{i})}{q(X_{i})}\leq\exp\left(-\frac{nc}{2}\right). (6.49)

6.2.4 the properties of the predictive distribution of the multinomial mixtures

In this section, we consider the lower bound of the predictive distribution p(x|Xn)p(x|X^{n}) when the probability model is the multinomial mixture:

p(x|a,b)\displaystyle p(x|a,b) =M!=1Lx!h=1Hah=1Lbhx.\displaystyle=\frac{M!}{\prod_{\ell=1}^{L}x_{\ell}!}\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}. (6.50)

We introduce the latent variable y=(y1,,yH)y=(y_{1},\dots,y_{H}). The latent variable yy is a vector such that one element is 1 and the others are 0. By using the variable yy, we can rewrite as follows:

p(x,y|a,b)\displaystyle p(x,y|a,b) =M!=1Lx!h=1H(ah=1Lbhx)yh.\displaystyle=\frac{M!}{\prod_{\ell=1}^{L}x_{\ell}!}\prod_{h=1}^{H}\left(a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}\right)^{y_{h}}. (6.51)

Both the prior distribution of the parameters of the multinomial and that of the parameters of the mixture ratio are Dirichlet distribution:

φ(a,b|α,β)=1R(α,β)h=1H{ahαh1=1Lbhβh1},\displaystyle\varphi(a,b|\alpha,\beta)=\frac{1}{R(\alpha,\beta)}\prod_{h=1}^{H}\left\{a_{h}^{\alpha_{h}-1}\prod_{\ell=1}^{L}b_{h\ell}^{\beta_{h\ell}-1}\right\}, (6.52)

where α,β\alpha,\beta are hyperparameter and R(α,β)R(\alpha,\beta) are normalizing constant such that

R(α,β)=Γ(h=1Hαh)h=1HΓ(αh)h=1HΓ(=1Lβh)h=1HΓh.\displaystyle R(\alpha,\beta)=\frac{\Gamma\left(\sum_{h=1}^{H}\alpha_{h}\right)}{\prod_{h=1}^{H}\Gamma(\alpha_{h})}\prod_{h=1}^{H}\frac{\Gamma\left(\sum_{\ell=1}^{L}\beta_{h\ell}\right)}{\prod_{h=1}^{H}\Gamma_{h\ell}}. (6.53)

To calculate the predictive distribution, we calculate the posterior distribution p(a,b|Xn)p(a,b|X^{n}).

p(a,b|Xn)=1R^nφ(a,b|α,β)i=1np(Xi,Yi|a,b)\displaystyle p(a,b|X^{n})=\frac{1}{\hat{R}_{n}}\varphi(a,b|\alpha,\beta)\prod_{i=1}^{n}p(X_{i},Y_{i}|a,b) (6.54)
=1R^n[1R(α,β)h=1H{ahαh1=1Lbhβh1}]i=1n[M!=1LXi!h=1H(ah=1LbhXi)Yih]\displaystyle=\frac{1}{\hat{R}_{n}}\left[\frac{1}{R(\alpha,\beta)}\prod_{h=1}^{H}\left\{a_{h}^{\alpha_{h}-1}\prod_{\ell=1}^{L}b_{h\ell}^{\beta_{h\ell}-1}\right\}\right]\prod_{i=1}^{n}\left[\frac{M!}{\prod_{\ell=1}^{L}X_{i\ell}!}\prod_{h=1}^{H}\left(a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{X_{i\ell}}\right)^{Y_{ih}}\right] (6.55)
1R^nh=1H{ahαh+i=1nYih1=1Lbhβh+i=1nXiYih1},\displaystyle\propto\frac{1}{\hat{R}_{n}}\prod_{h=1}^{H}\left\{a_{h}^{\alpha_{h}+\sum_{i=1}^{n}Y_{ih}-1}\prod_{\ell=1}^{L}b_{h\ell}^{\beta_{h\ell}+\sum_{i=1}^{n}X_{i\ell}Y_{ih}-1}\right\}, (6.56)

where R^n\hat{R}_{n} is normalizing constant of p(a,b|Xn)p(a,b|X^{n}):

R^n\displaystyle\hat{R}_{n} =φ(a,b|α,β)i=1np(Xi,Yi|a,b)dadb\displaystyle=\iint\varphi(a,b|\alpha,\beta)\prod_{i=1}^{n}p(X_{i},Y_{i}|a,b)\mathrm{d}a\mathrm{d}b (6.57)
h=1H{ahαh+i=1nYih1=1Lbhβh+i=1nXiYih1}dadb\displaystyle\propto\prod_{h=1}^{H}\left\{a_{h}^{\alpha_{h}+\sum_{i=1}^{n}Y_{ih}-1}\prod_{\ell=1}^{L}b_{h\ell}^{\beta_{h\ell}+\sum_{i=1}^{n}X_{i\ell}Y_{ih}-1}\right\}\mathrm{d}a\mathrm{d}b (6.58)
=R(α+i=1nYi,β+i=1nXiYi).\displaystyle=R\left(\alpha+\sum_{i=1}^{n}Y_{i},\beta+\sum_{i=1}^{n}X_{i}Y_{i}\right). (6.59)

Thus, the predictive distribution can be calculated as follows:

p(x|Xn)=p(x,y|w)p(a,b|Xn,Yn)dadb\displaystyle p(x|X^{n})=\iint p(x,y|w)p(a,b|X^{n},Y^{n})\mathrm{d}a\mathrm{d}b (6.60)
h=1H(ah=1Lbhx)yh1R^nh=1H{ahαh+i=1nYih1=1Lbhβh+i=1nXiYih1}dadb\displaystyle\propto\iint\prod_{h=1}^{H}\left(a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}\right)^{y_{h}}\frac{1}{\hat{R}_{n}}\prod_{h=1}^{H}\left\{a_{h}^{\alpha_{h}+\sum_{i=1}^{n}Y_{ih}-1}\prod_{\ell=1}^{L}b_{h\ell}^{\beta_{h\ell}+\sum_{i=1}^{n}X_{i\ell}Y_{ih}-1}\right\}\mathrm{d}a\mathrm{d}b (6.61)
=1R^nh=1H{ahαh+i=1nYih+yh1=1Lbhβh+i=1nXiYih+xyh1}dadb\displaystyle=\frac{1}{\hat{R}_{n}}\iint\prod_{h=1}^{H}\left\{a_{h}^{\alpha_{h}+\sum_{i=1}^{n}Y_{ih}+y_{h}-1}\prod_{\ell=1}^{L}b_{h\ell}^{\beta_{h\ell}+\sum_{i=1}^{n}X_{i\ell}Y_{ih}+x_{\ell}y_{h}-1}\right\}\mathrm{d}a\mathrm{d}b (6.62)
=R(α+i=1nYi+y,β+i=1nXiYi+xy)R(α+i=1nYi,β+i=1nXiYi).\displaystyle=\frac{R(\alpha+\sum_{i=1}^{n}Y_{i}+y,\beta+\sum_{i=1}^{n}X_{i}Y_{i}+xy)}{R\left(\alpha+\sum_{i=1}^{n}Y_{i},\beta+\sum_{i=1}^{n}X_{i}Y_{i}\right)}. (6.63)

Since the latent variable yy is a vector such that one element is 1 and since the others are 0 and the vector xx satisfies =1Lx=1\displaystyle\sum_{\ell=1}^{L}x_{\ell}=1, apply the property of the Gamma function Γ(x+1)=xΓ(x)\Gamma(x+1)=x\Gamma(x), we can show that

p(x,y|Xn,Yn)=O(1nM).\displaystyle p(x,y|X^{n},Y^{n})=O\left(\frac{1}{n^{M}}\right). (6.64)

Thus, there exists a constant positive number C>0C>0 such that for all x,yx,y,

p(x,y|Xn,Yn)>CnM.\displaystyle p(x,y|X^{n},Y^{n})>\frac{C}{n^{M}}. (6.65)

By the definition of the marginal distribution,

p(x|Xn)\displaystyle p(x|X^{n}) =yYnp(x,y|Xn,Yn)>CnM.\displaystyle=\sum_{y}\sum_{Y^{n}}p(x,y|X^{n},Y^{n})>\frac{C}{n^{M}}. (6.66)

Therefore,

logp(x|Xn)\displaystyle\log p(x|X^{n}) logCnM=Mlog(Cn).\displaystyle\leq\log\frac{C}{n^{M}}=-M\log(Cn). (6.67)

In this section, the prior distribution of parameters is discussed as the Dirichlet distribution. In the main theorem, we also consider the case where a non-zero and bounded prior distribution. In the case of such prior distribution, since any distribution does not affect the poles of the zeta function, the lower bound of the predicted distribution p(x|Xn)p(x|X^{n}) is not in the exponential order as in the above discussion.

6.2.5 properties of the maximum likelihood estimator of multinomial mixtures

Multinomial mixtures with MM trials are finite distributions on 0L\mathbb{Z}^{L}_{\geq 0} such that x1++xL=Mx_{1}+\dots+x_{L}=M. The set 𝒥\mathcal{J} is defined by

𝒥={(x1,,xL):=1Lx=M,x0}.\displaystyle\mathcal{J}=\left\{(x_{1},\dots,x_{L})\in\mathbb{Z}:\sum_{\ell=1}^{L}x_{\ell}=M,\ x_{\ell}\geq 0\right\}. (6.68)

The set 𝒥\mathcal{J} is finite, and the number of elements of 𝒥\mathcal{J} is JJ. Let 𝒫j\mathcal{P}_{j} be the set of all discrete probability distributions on the finite set {1,2,,J}\{1,2,\dots,J\}:

𝒫J\displaystyle\mathcal{P}_{J} ={(p1,,pJ)0J:=1Lpj=1,pj0}.\displaystyle=\left\{(p_{1},\dots,p_{J})\in\mathbb{R}_{\geq 0}^{J}\ :\ \sum_{\ell=1}^{L}p_{j}=1,p_{j}\geq 0\right\}. (6.69)

The probability distribution on 𝒥\mathcal{J} that can be expressed by multinomial mixtures of MM trials is included in the set 𝒫J\mathcal{P}_{J}. Given that the probability model p(x|w)p(x|w) and the true distribution q(x)q(x) are both multinomial mixtures with MM trials, and that the corresponding distributions in the probability distribution on 𝒥\mathcal{J} are the p¯(x|w)\bar{p}(x|w) and q¯(x)\bar{q}(x), the average error function K(w)K(w) is expressed as follows:

K(w)\displaystyle K(w) =xq¯(x)logq¯(x)p¯(x|w).\displaystyle=\sum_{x}\bar{q}(x)\log\frac{\bar{q}(x)}{\bar{p}(x|w)}. (6.70)

Since the function K(w)K(w) is the KL information between the true distribution q(x)q(x) and the probability model p(x|w)p(x|w), if p(x|w)=q(x)p(x|w)=q(x) then K(w)=0K(w)=0, and otherwise K(w)>0K(w)>0. Nowhere, we fix the positive constant c>0c>0 in the section 6.2.3, and define the subset AA of 𝒫J\mathcal{P}_{J} as follows:

A={p(x|w):K(w)<c2}.\displaystyle A=\left\{p(x|w):K(w)<\frac{c}{2}\right\}. (6.71)

Since q(x)q(x) are positive for all the points xDx\in D from the assumption, by fixing the positive constant ε>0\varepsilon>0 in the section 6.2.3, the subset EE of 𝒫J\mathcal{P}_{J} is defined as follows:

E={p(x|w):x{1,,J} s.t. p¯(x|w)<ε}.\displaystyle E=\{p(x|w):\exists x\in\{1,\dots,J\}\mbox{ s.t. }\bar{p}(x|w)<\varepsilon\}. (6.72)

The EE can be defined so that the sets EE and AA have no intersection. Then, the log empirical loss LnL_{n} can be calculated as follows:

Ln\displaystyle L_{n} =1ni=1nlogq¯(Xi)p¯(Xi)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\log\frac{\bar{q}(X_{i})}{\bar{p}(X_{i})} (6.73)
=j=1Jnjnlogq¯jp¯j\displaystyle=\sum_{j=1}^{J}\frac{n_{j}}{n}\log\frac{\bar{q}_{j}}{\bar{p}_{j}} (6.74)
=j=1Jnjnlogp¯jnj/nj=1Jnjnlognj/nq¯j.\displaystyle=-\sum_{j=1}^{J}\frac{n_{j}}{n}\log\frac{\bar{p}_{j}}{n_{j}/n}-\sum_{j=1}^{J}\frac{n_{j}}{n}\log\frac{n_{j}/n}{\bar{q}_{j}}. (6.75)

Since if nn\to\infty then njnq¯j\frac{n_{j}}{n}\to\bar{q}_{j}, The first term of the Eq.(6.75) converges to a certain constant, and the second term converges to 0. Thus, with probability at least 1exp(nδ)1-\exp(-n\delta),

1nlogi=1nq(Xi)p(Xi)\displaystyle\frac{1}{n}\log\prod_{i=1}^{n}\frac{q(X_{i})}{p(X_{i})} c2.\displaystyle\leq-\frac{c}{2}. (6.76)

Therefore,

i=1nq¯(Xi)p¯(Xi)\displaystyle\prod_{i=1}^{n}\frac{\bar{q}(X_{i})}{\bar{p}(X_{i})} exp(nc2).\displaystyle\leq\exp(-\frac{nc}{2}). (6.77)

Let w0w_{0} be a maximum likelihood estimator from multinomial mixtures with MM trials, and let w1w_{1} be also a maximum likelihood estimator from all the discrete distributions on 𝒫J\mathcal{P}_{J}. Since the set of distributions that can be expressed by multinomial mixtures with MM trials include in the set of distributions that can be expressed by all the discrete distributions on 𝒫J\mathcal{P}_{J}, i=1np(Xi|w0)i=1np(Xi|w1)\displaystyle\prod_{i=1}^{n}p(X_{i}|w_{0})\leq\prod_{i=1}^{n}p(X_{i}|w_{1}) holds. Therefore, with probability at least 1exp(nδ)1-\exp(-n\delta),

i=1np(Xi|w0)q(Xi)p¯(Xi|w1)q¯(Xi)exp(nc2).\displaystyle\prod_{i=1}^{n}\frac{p(X_{i}|w_{0})}{q(X_{i})}\leq\frac{\bar{p}(X_{i}|w_{1})}{\bar{q}(X_{i})}\leq\exp\left(-\frac{nc}{2}\right). (6.78)

6.2.6 the proof of lemma 6.3

Proof.

We fix the positive constants ε,c,δ>0\varepsilon,c,\delta>0 in the section 6.2.3, and we define the set AA in the Eq.(6.71). The random variable Θ(Xn)\Theta(X^{n}) is defined as follows:

Θ(Xn)={1(rnA)0(rnA),\displaystyle\Theta(X^{n})=\begin{cases}1&(r_{n}\in A)\\ 0&(r_{n}\notin A)\end{cases}, (6.79)

where rnr_{n} is the empirical distribution defined the Eq.(6.36). From the Eq.(6.43), the probability of Θ(Xn)=1\Theta(X^{n})=1 is less than (exp(nδ))(\exp(-n\delta)). Using that true distribution q(X)q(X) does not depend on the sample size nn and using the Eq.(6.67),

𝔼[Θ(Xn+1)logZ1(Xn)+Z2(Xn)Z1(Xn+1)]\displaystyle\mathbb{E}\left[\Theta(X^{n+1})\log\frac{Z_{1}(X^{n})+Z_{2}(X^{n})}{Z_{1}(X^{n+1})}\right] =𝔼[Θ(Xn+1)q(X)p(x|Xn)]\displaystyle=\mathbb{E}\left[\Theta(X^{n+1})\frac{q(X)}{p(x|X^{n})}\right] (6.80)
=O(exp(nδ)(Mlog(Cn)))\displaystyle=O\left(\exp\left(-n\delta\right)\left(-M\log(Cn)\right)\right) (6.81)
=O(exp(n)).\displaystyle=O(\exp(-n)). (6.82)

Moreover, by the Eq.(6.78),

𝔼[(1Θ(Xn+1))logZ1(Xn)+Z2(Xn)Z1(Xn+1)+Z2(Xn+1)]𝔼[log(logn)m1nλZ0(Xn)+exp(nc/2)(log(n+1))m1(n+1)λZ0(Xn+1)+exp((n+1)c/2)]\displaystyle\begin{split}&\mathbb{E}\left[\left(1-\Theta(X^{n+1})\right)\log\frac{Z_{1}(X^{n})+Z_{2}(X^{n})}{Z_{1}(X^{n+1})+Z_{2}(X^{n+1})}\right]\\ &\quad\leq\mathbb{E}\left[\log\frac{\frac{(\log n)^{m-1}}{n^{\lambda}}Z_{0}(X^{n})+\exp(-nc/2)}{\frac{(\log(n+1))^{m-1}}{(n+1)^{\lambda}}Z_{0}\left(X^{n+1}\right)+\exp\left(-(n+1)c/2\right)}\right]\end{split} (6.83)
=λnm1nlogn+𝔼[logZ0(Xn)Z0(Xn+1)]+o(1nlogn).\displaystyle=\frac{\lambda}{n}-\frac{m-1}{n\log n}+\mathbb{E}\left[\log\frac{Z_{0}(X^{n})}{Z_{0}(X^{n+1})}\right]+o\left(\frac{1}{n\log n}\right). (6.84)

By the reference [9],

𝔼[logZ0(Xn)Z0(Xn+1)]=o(1nlogn).\displaystyle\mathbb{E}\left[\log\frac{Z_{0}(X^{n})}{Z_{0}(X^{n+1})}\right]=o\left(\frac{1}{n\log n}\right). (6.85)

Therefore,

𝔼[(1Θ(Xn+1))logZ1(Xn)+Z2(Xn)Z1(Xn+1)+Z2(Xn+1)]λnm1nlogn+o(1nlogn).\displaystyle\begin{split}&\mathbb{E}\left[\left(1-\Theta(X^{n+1})\right)\log\frac{Z_{1}(X^{n})+Z_{2}(X^{n})}{Z_{1}(X^{n+1})+Z_{2}(X^{n+1})}\right]\\ &\quad\leq\frac{\lambda}{n}-\frac{m-1}{n\log n}+o\left(\frac{1}{n\log n}\right).\end{split} (6.86)

From the above, the lemma 6.3 is shown. ∎

6.3 properties of the general components

We prepare a lemma that holds for multinomial mixtures of general components.

Lemma 6.4.

Let p(x|w)p(x|w) be a multinomial mixture with HH components:

p(x|w)=h=1HahMulL(x|bh)(wW).\displaystyle p(x|w)=\sum_{h=1}^{H}a_{h}\mathrm{Mul}_{L}(x|b_{h})\ \ (w\in W). (6.87)

Also, let the true distribution q(x)q(x) be a multinomial mixture with H(HH)H^{*}\ (H\geq H^{*}) components:

q(x)=h=1HahMulL(x|bh).\displaystyle q(x)=\sum_{h=1}^{H^{*}}a^{*}_{h}\mathrm{Mul}_{L}(x|b^{*}_{h}). (6.88)

Let K(w)K(w) be the mean error function determined by the probability model p(x|w)p(x|w) and the true distribution q(x)q(x). K(w)K(w) has the same RLCT and the multiplicity as K1(w)K_{1}(w) defined below:

K1(w)\displaystyle K_{1}(w) =xD{h=1Hah(=1L1bhx)h=1Hah(=1L1(bh)x)}2,\displaystyle=\sum_{x\in D}\left\{\sum_{h=1}^{H}a_{h}\left(\prod_{\ell=1}^{L-1}b_{h\ell}^{x_{\ell}}\right)-\sum_{h^{*}=1}^{H^{*}}a^{*}_{h}\left(\prod_{\ell=1}^{L-1}(b^{*}_{h\ell})^{x_{\ell}}\right)\right\}^{2}, (6.89)

that is, K(w)K1(w)K(w)\sim K_{1}(w).

Proof.

From the lemma 6.1, the mean error function K(w)K(w) is equivalent to K(w)K^{\prime}(w), that is, the RLCT and the multiplicity are equal. Calculate p(x|w)q(x)p(x|w)-q(x) as follows:

p(x|w)q(x)\displaystyle p(x|w)-q(x) ={h=1HahM!=1Lx!=1Lbhxh=1HahM!=1Lx!(bh)x}\displaystyle=\left\{\sum_{h=1}^{H}a_{h}\frac{M!}{\prod_{\ell=1}^{L}x_{\ell}!}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{*}}a^{*}_{h}\frac{M!}{\prod_{\ell=1}^{L}x_{\ell}!}(b^{*}_{h\ell})^{x_{\ell}}\right\} (6.90)
=(M!=1Lx!){h=1Hah=1Lbhxh=1Hah=1L(bh)x}.\displaystyle=\left(\frac{M!}{\prod_{\ell=1}^{L}x_{\ell}!}\right)\left\{\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{*}}a^{*}_{h}\prod_{\ell=1}^{L}(b^{*}_{h\ell})^{x_{\ell}}\right\}. (6.91)

Since M!=1Lx!\displaystyle\frac{M!}{\prod_{\ell=1}^{L}x_{\ell}!} is greater than 0 and bounded for any xDx\in D,

K(w)xD{h=1Hah=1Lbhxh=1Hah=1L(bh)x}2.\displaystyle K(w)\sim\sum_{x\in D}\left\{\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{*}}a^{*}_{h}\prod_{\ell=1}^{L}(b^{*}_{h\ell})^{x_{\ell}}\right\}^{2}. (6.92)

Furthermore, since for each h[H]h\in[H] and for each h[H]h\in[H^{*}] : =1Lbh=1\displaystyle\sum_{\ell=1}^{L}b_{h\ell}=1, =1Lbh=1\sum_{\ell=1}^{L}b^{*}_{h\ell}=1, both bhLb_{hL} and bhLb_{hL}^{*} can be represented by other bhb_{h\ell} and bhb^{*}_{h\ell} for each h[H]h\in[H],

h=1Hah=1Lbhxh=1Hah=1L(bh)x=h=1Hah(=1L1bhx)(1=1L1bh)xLh=1Hah(=1L1(bh)x)(1=1L1bh)xL.\displaystyle\begin{split}&\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{*}}a^{*}_{h}\prod_{\ell=1}^{L}(b^{*}_{h\ell})^{x_{\ell}}\\ &\quad=\sum_{h=1}^{H}a_{h}\left(\prod_{\ell=1}^{L-1}b_{h\ell}^{x_{\ell}}\right)\left(1-\sum_{\ell=1}^{L-1}b_{h\ell}\right)^{x_{L}}-\sum_{h=1}^{H^{*}}a^{*}_{h}\left(\prod_{\ell=1}^{L-1}(b^{*}_{h\ell})^{x_{\ell}}\right)\left(1-\sum_{\ell=1}^{L-1}b^{*}_{h\ell}\right)^{x_{L}}.\end{split} (6.93)

Here, by using the binomial theorem,

(1=1L1bh)xL\displaystyle\left(1-\sum_{\ell=1}^{L-1}b_{h\ell}\right)^{x_{L}} =i=0xL(xLi)(=1L1bh)i 1xLi\displaystyle=\sum_{i=0}^{x_{L}}\binom{x_{L}}{i}\left(-\sum_{\ell=1}^{L-1}b_{h\ell}\right)^{i}\ 1^{x_{L}-i} (6.94)
=1+i=1xL(xLi)(1)i(=1L1bh)i.\displaystyle=1+\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\left(\sum_{\ell=1}^{L-1}b_{h\ell}\right)^{i}. (6.95)

Also

(1=1L1bh)xL\displaystyle\left(1-\sum_{\ell=1}^{L-1}b^{*}_{h\ell}\right)^{x_{L}} =1+i=1xL(xLi)(1)i(=1L1bh)i.\displaystyle=1+\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\left(\sum_{\ell=1}^{L-1}b^{*}_{h\ell}\right)^{i}. (6.96)

Therefore,

h=1Hah=1Lbhxh=1Hah=1L(bh)x=h=1Hah(=1L1bhx){1+i=1xL(xLi)(1)i(=1L1bh)i}h=1Hah(=1L1(bh)x){1+i=1xL(xLi)(1)i(=1L1bh)i}\displaystyle\begin{split}&\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{*}}a^{*}_{h}\prod_{\ell=1}^{L}(b^{*}_{h\ell})^{x_{\ell}}\\ &\quad=\sum_{h=1}^{H}a_{h}\left(\prod_{\ell=1}^{L-1}b_{h\ell}^{x_{\ell}}\right)\left\{1+\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\left(\sum_{\ell=1}^{L-1}b_{h\ell}\right)^{i}\right\}\\ &\quad-\sum_{h=1}^{H^{*}}a^{*}_{h}\left(\prod_{\ell=1}^{L-1}(b^{*}_{h\ell})^{x_{\ell}}\right)\left\{1+\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\left(\sum_{\ell=1}^{L-1}b^{*}_{h\ell}\right)^{i}\right\}\end{split} (6.97)
=h=1Hah=1L1(bh)xh=1Hah=1L1(bh)x+h=1Hah(=1L1(bh)x)i=1xL(xLi)(1)i(=1L1bh)ih=1Hah(=1L1(bh)x)i=1xL(xLi)(1)i(=1L1bh)i.\displaystyle\begin{split}&=\sum_{h=1}^{H^{*}}a_{h}\prod_{\ell=1}^{L-1}(b_{h\ell})^{x_{\ell}}-\sum_{h=1}^{H^{*}}a^{*}_{h}\prod_{\ell=1}^{L-1}(b^{*}_{h\ell})^{x_{\ell}}\\ &\quad+\sum_{h=1}^{H}a^{*}_{h}\left(\prod_{\ell=1}^{L-1}(b_{h\ell})^{x_{\ell}}\right)\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\left(\sum_{\ell=1}^{L-1}b_{h\ell}\right)^{i}\\ &\quad-\sum_{h=1}^{H^{*}}a^{*}_{h}\left(\prod_{\ell=1}^{L-1}(b^{*}_{h\ell})^{x_{\ell}}\right)\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\left(\sum_{\ell=1}^{L-1}b^{*}_{h\ell}\right)^{i}.\end{split} (6.98)

For simplicity, for each h[H]h\in[H] and h[H]h\in[H^{*}], we define Ah,AhA_{h},A_{h}^{*} as follows:

Ah\displaystyle A_{h} =ah=1L1(bh)x(h=1,,H),\displaystyle=a_{h}\prod_{\ell=1}^{L-1}(b_{h\ell})^{x_{\ell}}\ \ (h=1,\dots,H), (6.99)
Ah\displaystyle A_{h}^{*} =ah(=1L1(bh)x)(h=1,,H).\displaystyle=a^{*}_{h}\left(\prod_{\ell=1}^{L-1}(b^{*}_{h\ell})^{x_{\ell}}\right)\ \ (h=1,\dots,H^{*}). (6.100)

It follows that

h=1Hah=1Lbhxh=1Hah=1L(bh)x=h=1HAhh=1HAh+h=1HAhi=1xL(xLi)(1)i(=1L1bh)ih=1HAhi=1xL(xLi)(1)i(=1L1bh)i.\displaystyle\begin{split}&\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{*}}a^{*}_{h}\prod_{\ell=1}^{L}(b^{*}_{h\ell})^{x_{\ell}}\\ &=\sum_{h=1}^{H}A_{h}-\sum_{h=1}^{H^{*}}A_{h}^{*}\\ &\quad+\sum_{h=1}^{H}A_{h}\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\left(\sum_{\ell=1}^{L-1}b_{h\ell}\right)^{i}-\sum_{h=1}^{H^{*}}A_{h}^{*}\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\left(\sum_{\ell=1}^{L-1}b^{*}_{h\ell}\right)^{i}.\end{split} (6.101)

By using the multinomial theorem,

(=1L1bh)i\displaystyle\left(\sum_{\ell=1}^{L-1}b_{h\ell}\right)^{i} =i1,,iL1i!i1!iL1!bh1i1bhL1iL1,\displaystyle=\sum_{i_{1},\dots,i_{L-1}}\frac{i!}{i_{1}!\dots i_{L-1}!}b_{h1}^{i_{1}}\dots b_{hL-1}^{i_{L-1}}, (6.102)
(=1L1bh)i\displaystyle\left(\sum_{\ell=1}^{L-1}b^{*}_{h\ell}\right)^{i} =i1,,iL1i!i1!iL1!(bh1)i1(bhL1)iL1,\displaystyle=\sum_{i_{1},\dots,i_{L-1}}\frac{i!}{i_{1}!\dots i_{L-1}!}(b^{*}_{h1})^{i_{1}}\dots(b^{*}_{hL-1})^{i_{L-1}}, (6.103)

where the summation i1,,iL1\displaystyle\sum_{i_{1},\dots,i_{L-1}} shows the sum of all over non-negative integer sets (i1,,iL1)(i_{1},\dots,i_{L-1}) such that i1++iL1=ii_{1}+\dots+i_{L-1}=i. We apply Eqs.(6.102) and (6.103) to the Eq.(6.101). Let CiC_{i_{\ell}} be i!i1!iL1!\displaystyle\frac{i!}{i_{1}!\dots i_{L-1}!}. Then we obtain

h=1Hah=1Lbhxh=1Hah=1L(bh)x=h=1HAhh=1HAh+h=1HAhi=1xL(xLi)(1)ii1,,iL1Cibh1i1bhL1iL1h=1HAhi=1xL(xLi)(1)ii1,,iL1Ci(bh1)i1(bhL1)iL1\displaystyle\begin{split}&\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{*}}a^{*}_{h}\prod_{\ell=1}^{L}(b^{*}_{h\ell})^{x_{\ell}}\\ &=\sum_{h=1}^{H}A_{h}-\sum_{h=1}^{H^{*}}A_{h}^{*}\\ &\quad+\sum_{h=1}^{H}A_{h}\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\sum_{i_{1},\dots,i_{L-1}}C_{i_{\ell}}b_{h1}^{i_{1}}\dots b_{hL-1}^{i_{L-1}}\\ &\quad-\sum_{h=1}^{H^{*}}A_{h}^{*}\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}(-1)^{i}\sum_{i_{1},\dots,i_{L-1}}C_{i_{\ell}}(b^{*}_{h1})^{i_{1}}\dots(b^{*}_{hL-1})^{i_{L-1}}\end{split} (6.104)
=h=1HAhh=1HAh+i=1xL(xLi)i1,,iL1(1)iCi(xLi)×{h=1HAhbh1i1bhL1iL1h=1HAh(bh1)i1(bhL1)iL1}.\displaystyle\begin{split}&=\sum_{h=1}^{H}A_{h}-\sum_{h=1}^{H^{*}}A_{h}^{*}\\ &\quad+\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}\sum_{i_{1},\dots,i_{L-1}}(-1)^{i}C_{i_{\ell}}\binom{x_{L}}{i}\\ &\quad\times\left\{\sum_{h=1}^{H}A_{h}b_{h1}^{i_{1}}\dots b_{hL-1}^{i_{L-1}}-\sum_{h=1}^{H^{*}}A_{h}^{*}(b^{*}_{h1})^{i_{1}}\dots(b^{*}_{hL-1})^{i_{L-1}}\right\}.\end{split} (6.105)

By using Eqs.(6.99) and (6.100),

h=1Hah=1Lbhxh=1Hah=1L(bh)x=h=1HAhh=1HAh+i=1xL(xLi)i1,,iL1(1)iCi(xLi)×{h=1Hahbh1x1+i1bhL1xL1+iL1h=1Hah(bh1)x1+i1(bhL1)xL1+iL1}.\displaystyle\begin{split}&\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{*}}a^{*}_{h}\prod_{\ell=1}^{L}(b^{*}_{h\ell})^{x_{\ell}}\\ &=\sum_{h=1}^{H}A_{h}-\sum_{h=1}^{H^{*}}A_{h}^{*}\\ &+\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}\sum_{i_{1},\dots,i_{L-1}}(-1)^{i}C_{i_{\ell}}\binom{x_{L}}{i}\\ &\times\left\{\sum_{h=1}^{H}a_{h}b_{h1}^{x_{1}+i_{1}}\dots b_{hL-1}^{x_{L-1}+i_{L-1}}-\sum_{h=1}^{H^{*}}a^{*}_{h}(b^{*}_{h1})^{x_{1}+i_{1}}\dots(b^{*}_{hL-1})^{x_{L-1}+i_{L-1}}\right\}.\end{split} (6.106)

We introduce a polynomial fL1(x1,,xL1;w)f_{L-1}(x_{1},\dots,x_{L-1};w) defined by

fL1(x1,,xL1;w)\displaystyle f_{L-1}(x_{1},\dots,x_{L-1};w) =h=1HAhh=1HAh,\displaystyle=\sum_{h=1}^{H}A_{h}-\sum_{h=1}^{H^{*}}A_{h}^{*}, (6.107)
=h=1Hah=1L1(bh)xh=1Hah(=1L1(bh)x).\displaystyle=\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L-1}(b_{h\ell})^{x_{\ell}}-\sum_{h=1}^{H^{*}}a^{*}_{h}\left(\prod_{\ell=1}^{L-1}(b^{*}_{h\ell})^{x_{\ell}}\right). (6.108)

Then by Eq.(6.106), it follows that

h=1Hah=1Lbhxh=1Hah=1L(bh)x=fL1(x1,,xL1;w)+i=1xL(xLi)i1,,iL1(1)iCi(xLi)fL1(x1+i1,,xL1+iL1;w).\displaystyle\begin{split}&\sum_{h=1}^{H}a_{h}\prod_{\ell=1}^{L}b_{h\ell}^{x_{\ell}}-\sum_{h=1}^{H^{*}}a^{*}_{h}\prod_{\ell=1}^{L}(b^{*}_{h\ell})^{x_{\ell}}\\ &\quad=f_{L-1}(x_{1},\dots,x_{L-1};w)\\ &+\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}\sum_{i_{1},\dots,i_{L-1}}(-1)^{i}C_{i_{\ell}}\binom{x_{L}}{i}f_{L-1}(x_{1}+i_{1},\dots,x_{L-1}+i_{L-1};w).\end{split} (6.109)

The second term of Eq.(6.109) can be expressed by the linear sum of the first term fL1(x1,,xL1;w)f_{L-1}(x_{1},\dots,x_{L-1};w). That is, in the second term, there is a constant C(x1,,xL1)C^{\prime}(x_{1},\dots,x_{L-1}) that does not depend on parameters,

i=1xL(xLi)i1,,iL1(1)iCi(xLi)fL1(x1+i1,,xL1+iL1;w)=xDC(x1,,xL1)fL1(x1,,xL1;w).\displaystyle\begin{split}&\sum_{i=1}^{x_{L}}\binom{x_{L}}{i}\sum_{i_{1},\dots,i_{L-1}}(-1)^{i}C_{i_{\ell}}\binom{x_{L}}{i}f_{L-1}(x_{1}+i_{1},\dots,x_{L-1}+i_{L-1};w)\\ &\quad=\sum_{x\in D}C^{\prime}(x_{1},\dots,x_{L-1})f_{L-1}(x_{1},\dots,x_{L-1};w).\end{split} (6.110)

Therefore, the ideal generated from the set {fL1(x1,,xL1)}xD\{f_{L-1}(x_{1},\dots,x_{L-1})\}_{x\in D} and the ideal generated from the set {fL1(x1,,xL1)+fL1(x1+i1,,xL1+iL1)}xD\left\{f_{L-1}(x_{1},\dots,x_{L-1})+f_{L-1}(x_{1}+i_{1},\dots,x_{L-1}+i_{L-1})\right\}_{x\in D} are equal, so the function K1(w)K_{1}(w) is defined as follows:

K1(w)\displaystyle K_{1}(w) =xDfL1(x1,,xL1;w)2\displaystyle=\sum_{x\in D}f_{L-1}(x_{1},\dots,x_{L-1};w)^{2} (6.111)
=xD{h=1Hah(=1L1(bh)x)h=1Hah(=1L1(bh)x)}2.\displaystyle=\sum_{x\in D}\left\{\sum_{h=1}^{H}a_{h}\left(\prod_{\ell=1}^{L-1}(b_{h\ell})^{x_{\ell}}\right)-\sum_{h=1}^{H^{*}}a^{*}_{h}\left(\prod_{\ell=1}^{L-1}(b^{*}_{h\ell})^{x_{\ell}}\right)\right\}^{2}. (6.112)

From the lemma 6.1 (4), the two functions K(w)K(w) and K1(w)K_{1}(w) are equivalent, that is, their RLCTs and multiplicities are equal. ∎

6.4 properties of the 2 components

So far, we have prepared the lemma 6.4, which holds for multinomial mixtures with general components. Hereafter, we assume that the number of components of the multinomial mixtures of the probability model is 2 (i.e. H=2H=2) and that the true distribution is the multinomial distribution (i.e. H=1H^{*}=1). That is, the probability model p(x|w)p(x|w) and the true distribution q(x)q(x) are as follows:

p(x|w)\displaystyle p(x|w) =aMulL(x|b)+(1a)MulL(x|c),b,cB,\displaystyle=a\mathrm{Mul}_{L}(x|b)+(1-a)\mathrm{Mul}_{L}(x|c),\ b,c\in B, (6.113)
q(x)\displaystyle q(x) =MulL(x|b),bB,=1Lb0.\displaystyle=\mathrm{Mul}_{L}(x|b^{*}),\ b^{*}\in B,\ \prod_{\ell=1}^{L}b^{*}_{\ell}\neq 0. (6.114)

Then, the polynomial fL(x1,,xL)f_{L}(x_{1},\cdots,x_{L}) defined by the Eq.(6.107) is expressed as follows:

fL(x1,,xL;w)=a=1Lbx+(1a)=1Lcx=1L(b)x.\displaystyle f_{L}(x_{1},\cdots,x_{L};w)=a\prod_{\ell=1}^{L}b_{\ell}^{x_{\ell}}+(1-a)\prod_{\ell=1}^{L}c_{\ell}^{x_{\ell}}-\prod_{\ell=1}^{L}(b_{\ell}^{*})^{x_{\ell}}.
Lemma 6.5.

For j[2:L1]j\in[2:L-1], the following holds:

fL1(x1,,xj,,xL1;w)=(bj+cj)fL1(x1,,xj1,,xL1;w)bjcjfL1(x1,,xj2,,xL1;w)(bjbj)(cjbj)(bj)xj2jL1(b)x,\displaystyle\begin{split}&f_{L-1}(x_{1},\cdots,x_{j},\cdots,x_{L-1};w)\\ &=(b_{j}+c_{j})f_{L-1}(x_{1},\cdots,x_{j}-1,\cdots,x_{L-1};w)\\ &\quad-b_{j}c_{j}f_{L-1}(x_{1},\cdots,x_{j}-2,\cdots,x_{L-1};w)\\ &\quad-(b_{j}-b_{j}^{*})(c_{j}-b_{j}^{*})(b_{j}^{*})^{x_{j}-2}\prod_{\ell\neq j}^{L-1}(b_{\ell}^{*})^{x_{\ell}},\end{split} (6.115)

where [2:L1][2:L-1] represents the set {:2L1}\{\ell\in\mathbb{Z}:2\leq\ell\leq L-1\}.

Proof.

By calculating each of the three terms in Eq.(6.115),

(bj+cj)fL1(x1,,xj1,,xL1;w)=a=1L1bx+acjbjxj1jL1bx+(1a)bjcjxj1jL1cx+(1a)=1L1cx(bj+cj)(bj)xj1jL1(b)x,\displaystyle\begin{split}&(b_{j}+c_{j})f_{L-1}(x_{1},\cdots,x_{j}-1,\cdots,x_{L-1};w)\\ &\quad=a\prod_{\ell=1}^{L-1}b_{\ell}^{x_{\ell}}+ac_{j}b_{j}^{x_{j}-1}\prod_{\ell\neq j}^{L-1}b_{\ell}^{x_{\ell}}+(1-a)b_{j}c_{j}^{x_{j}-1}\prod_{\ell\neq j}^{L-1}c_{\ell}^{x_{\ell}}\\ &\quad+(1-a)\prod_{\ell=1}^{L-1}c_{\ell}^{x_{\ell}}-(b_{j}+c_{j})(b_{j}^{*})^{x_{j}-1}\prod_{\ell\neq j}^{L-1}(b_{\ell}^{*})^{x_{\ell}},\end{split} (6.116)
bjcjfL1(x1,,xj2,,xL1;w)=acjbjxj1jL1bx(1a)bjcjxj1jL1cx,\displaystyle\begin{split}&b_{j}c_{j}f_{L-1}(x_{1},\cdots,x_{j}-2,\cdots,x_{L-1};w)\\ &\quad=-ac_{j}b_{j}^{x_{j}-1}\prod_{\ell\neq j}^{L-1}b_{\ell}^{x_{\ell}}-(1-a)b_{j}c_{j}^{x_{j}-1}\prod_{\ell\neq j}^{L-1}c_{\ell}^{x_{\ell}},\end{split} (6.117)
(bjbj)(cjbj)(bj)xj2jL1(b)x=bjcj(bj)xj2jL1(b)x+bj(bj)xj1jL1(b)x+cj(bj)xj1jL1(b)xjL1(b)x.\displaystyle\begin{split}&(b_{j}-b_{j}^{*})(c_{j}-b_{j}^{*})(b_{j}^{*})^{x_{j}-2}\prod_{\ell\neq j}^{L-1}(b_{\ell}^{*})^{x_{\ell}}\\ &\quad=-b_{j}c_{j}(b_{j}^{*})^{x_{j}-2}\prod_{\ell\neq j}^{L-1}(b_{\ell}^{*})^{x_{\ell}}+b_{j}(b_{j}^{*})^{x_{j}-1}\prod_{\ell\neq j}^{L-1}(b_{\ell}^{*})^{x_{\ell}}\\ &\quad+c_{j}(b_{j}^{*})^{x_{j}-1}\prod_{\ell\neq j}^{L-1}(b_{\ell})^{x_{\ell}}-\prod_{\ell\neq j}^{L-1}(b_{\ell}^{*})^{x_{\ell}}.\end{split} (6.118)

Then Eq.(6.115) is obtained by summing the above three Eqs.(6.116), (6.117), (6.118). ∎

Lemma 6.6.

Define the set DD^{\prime} of xx as follows:

D={(x1,x2,,xL1)L1|x{0,1}}\displaystyle D^{\prime}=\bigl{\{}(x_{1},x_{2},\cdots,x_{L-1})\in\mathbb{Z}^{L-1}\ |\ x_{\ell}\in\{0,1\}\bigr{\}} (6.119)

Define the function K2(w)K_{2}(w) of parameter ww as follows:

K2(w)==1L1(bb)2(cb)2+xDfL1(x1,,xL1;w)2.\displaystyle K_{2}(w)=\sum_{\ell=1}^{L-1}(b_{\ell}-b_{\ell}^{*})^{2}(c_{\ell}-b_{\ell}^{*})^{2}+\sum_{x\in D^{\prime}}f_{L-1}(x_{1},\cdots,x_{L-1};w)^{2}. (6.120)

Then, K(w)K2(w)K(w)\sim K_{2}(w).

Proof.

By using lemma 6.5 inductively, fL1(x1,,xL1;w)f_{L-1}(x_{1},\cdots,x_{L-1};w) can be expressed using

  • fL1(1,0,0,,0;w),f(0,1,0,,0;w),f_{L-1}(1,0,0,\cdots,0;w),f(0,1,0,\cdots,0;w),\cdots

  • fL1(1,1,0,,0;w),f(1,0,1,,0;w),f_{L-1}(1,1,0,\cdots,0;w),f(1,0,1,\cdots,0;w),\cdots

  • \cdots

  • fL1(1,1,,1;w)f_{L-1}(1,1,\cdots,1;w)

  • (bb)(cb).(b_{\ell}-b_{\ell}^{*})(c_{\ell}-b_{\ell}^{*}).

Since the ideal generated from {fL1(x1,,xL1;w)}xD{(bibi)2(cibi)2}=1L\{f_{L-1}(x_{1},\cdots,x_{L-1};w)\}_{x\in D}\cup\{(b_{i}-b_{i}^{*})^{2}(c_{i}-b_{i}^{*})^{2}\}_{\ell=1}^{L} and the ideal generated from {fL1(x1,,xL1;w)}xD{(bibi)2(cibi)2}=1L\{f_{L-1}(x_{1},\cdots,x_{L-1};w)\}_{x\in D^{\prime}}\cup\{(b_{i}-b_{i}^{*})^{2}(c_{i}-b_{i}^{*})^{2}\}_{\ell=1}^{L} are equal, lemma 6.6 can be applied. Thus, K(w)K2(w)K(w)\sim K_{2}(w). ∎

6.5 Proof of the main theorem

Let us prove Theorem.5.1.

Proof.

(Proof of Theorem.5.1) From lemma 6.6, to clarify the RLCT and multiplicity of the multinomial mixtures with two components, we calculate the largest pole of the zeta function determined by K2(w)K_{2}(w) and φ(w)\varphi(w).

K2(w)==1L1(bb)2(cb)2+1{1,L1}(ab1+(1a)c1b1)2+1,2{1,,L1}(ab1b2+(1a)c1c2b1b2)2++1,,L1{1,,L1}(ak=1L1bk+(1a)k=1L1ckk=1L1bk)2,\displaystyle\begin{split}K_{2}(w)&=\sum_{\ell=1}^{L-1}(b_{\ell}-b_{\ell}^{*})^{2}(c_{\ell}-b_{\ell}^{*})^{2}\\ &\quad+\sum_{\ell_{1}\in\{1,\cdots L-1\}}(ab_{\ell_{1}}+(1-a)c_{\ell_{1}}-b^{*}_{\ell_{1}})^{2}\\ &\quad+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}(ab_{\ell_{1}}b_{\ell_{2}}+(1-a)c_{\ell_{1}}c_{\ell_{2}}-b^{*}_{\ell_{1}}b^{*}_{\ell_{2}})^{2}\\ &\quad+\dots\\ &\quad+\sum_{\ell_{1},\cdots,\ell_{L-1}\in\{1,\cdots,L-1\}}\Bigl{(}a\prod_{k=1}^{L-1}b_{\ell_{k}}+(1-a)\prod_{k=1}^{L-1}c_{\ell_{k}}-\prod_{k=1}^{L-1}b^{*}_{\ell_{k}}\Bigr{)}^{2},\end{split} (6.121)

where the summation 1,2,,i{1,,L1}\displaystyle\sum_{\ell_{1},\ell_{2},\dots,\ell_{i}\in\{1,\cdots,L-1\}} represents the sum of the sets {(1,,i){1,,L1}i:jjjj(j,j[i])}\{(\ell_{1},\dots,\ell_{i})\in\{1,\dots,L-1\}^{i}:j\neq j^{\prime}\Rightarrow\ell_{j}\neq\ell_{j^{\prime}}\ (\forall j,j^{\prime}\in[i])\}. Let us define a map Φ1:uw\Phi_{1}:u\mapsto w, where for each [L1]\ell\in[L-1],

{B=bβ=bBγ=cB.\displaystyle\begin{cases}B_{\ell}=b_{\ell}^{*}\\ \beta_{\ell}=b_{\ell}-B_{\ell}\\ \gamma_{\ell}=c_{\ell}-B_{\ell}\end{cases}. (6.122)

The parameter uu consists of (a,β,γ)(a,\beta,\gamma), where β=(β1,,βL1),γ=(γ1,,γL1)\beta=(\beta_{1},\dots,\beta_{L-1}),\ \gamma=(\gamma_{1},\dots,\gamma_{L-1}). Based on the map,

K2(Φ1(u))==1L1β2γ2+1{1,L1}(aβ1+(1a)γ1)2+1,2{1,,L1}{aβ1β2+(1a)γ1γ2+B2{aβ1+(1a)γ1}+B1{aβ2+(1a)γ2}}2++1,,L1{1,,L1}{ak=1L1βk+(1a)k=1L1γk++j=1L1(kjL1Bk)(aβj+(1a)γj)}2.\displaystyle\begin{split}&K_{2}(\Phi_{1}(u))\\ &=\sum_{\ell=1}^{L-1}\beta_{\ell}^{2}\gamma_{\ell}^{2}\\ &\quad+\sum_{\ell_{1}\in\{1,\cdots L-1\}}(a\beta_{\ell_{1}}+(1-a)\gamma_{\ell_{1}})^{2}\\ &\quad+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}\Bigl{\{}a\beta_{\ell_{1}}\beta_{\ell_{2}}+(1-a)\gamma_{\ell_{1}}\gamma_{\ell_{2}}\\ &\quad+B_{\ell_{2}}\{a\beta_{\ell_{1}}+(1-a)\gamma_{\ell_{1}}\}+B_{\ell_{1}}\{a\beta_{\ell_{2}}+(1-a)\gamma_{\ell_{2}}\}\Bigr{\}}^{2}\\ &\quad+\dots\\ &\quad+\sum_{\ell_{1},\cdots,\ell_{L-1}\in\{1,\cdots,L-1\}}\Bigl{\{}a\prod_{k=1}^{L-1}\beta_{\ell_{k}}+(1-a)\prod_{k=1}^{L-1}\gamma_{\ell_{k}}\\ &\quad+\cdots+\sum_{j=1}^{L-1}\bigl{(}\prod_{k\neq j}^{L-1}B_{\ell_{k}}\bigr{)}(a\beta_{j}+(1-a)\gamma_{j})\Bigr{\}}^{2}.\end{split} (6.123)

From the symmetry of the parameters, we can restrict the integration range for aa as 0a120\leq a\leq\frac{1}{2} without loss of generality. Let us define a map Φ2:vu\Phi_{2}:v\to u, where

δ=aβ+(1a)γ(=1,,L1).\displaystyle\delta_{\ell}=a\beta_{\ell}+(1-a)\gamma_{\ell}\ \ (\ell=1,\cdots,L-1). (6.124)

The parameter vv consists of (a,β,δ)(a,\beta,\delta), where δ=(δ1,,δL1)\delta=(\delta_{1},\dots,\delta_{L-1}). Since we consider the range 121a1\frac{1}{2}\leq 1-a\leq 1, the Jacobian determinant of this transform is not equal to zero. Therefore, neither the maximum pole nor its order of the zeta function changes. We can obtain that

K2(Φ2(Φ1(v)))==1L1δ2+=1L1β2γ2+1,2{1,,L1}{aβ1β2+(1a)γ1γ2+B2{aβ1+(1a)γ1}+B1{aβ2+(1a)γ2}}2++1,,L1{1,,L1}{ak=1L1βk+(1a)k=1L1γk++j=1L1(kjL1Bk)(aβj+(1a)γj)}2.\displaystyle\begin{split}&K_{2}(\Phi_{2}(\Phi_{1}(v)))\\ &=\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}\beta_{\ell}^{2}\gamma_{\ell}^{2}\\ &\quad+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}\Bigl{\{}a\beta_{\ell_{1}}\beta_{\ell_{2}}+(1-a)\gamma_{\ell_{1}}\gamma_{\ell_{2}}\\ &\quad+B_{\ell_{2}}\{a\beta_{\ell_{1}}+(1-a)\gamma_{\ell_{1}}\}+B_{\ell_{1}}\{a\beta_{\ell_{2}}+(1-a)\gamma_{\ell_{2}}\}\Bigr{\}}^{2}\\ &\quad+\dots\\ &\quad+\sum_{\ell_{1},\cdots,\ell_{L-1}\in\{1,\cdots,L-1\}}\Bigl{\{}a\prod_{k=1}^{L-1}\beta_{\ell_{k}}+(1-a)\prod_{k=1}^{L-1}\gamma_{\ell_{k}}\\ &\quad+\cdots+\sum_{j=1}^{L-1}\bigl{(}\prod_{k\neq j}^{L-1}B_{\ell_{k}}\bigr{)}(a\beta_{j}+(1-a)\gamma_{j})\Bigr{\}}^{2}.\end{split} (6.125)

Here, we eliminate γ\gamma by using γ=δaβ1a\displaystyle\gamma_{\ell}=\frac{\delta_{\ell}-a\beta_{\ell}}{1-a},

=1L1β2γ2\displaystyle\sum_{\ell=1}^{L-1}\beta_{\ell}^{2}\gamma_{\ell}^{2} ==1L1β2(δaβ1a)2\displaystyle=\sum_{\ell=1}^{L-1}\beta_{\ell}^{2}\Bigl{(}\frac{\delta_{\ell}-a\beta_{\ell}}{1-a}\Bigr{)}^{2} (6.126)
=1(1a)2=1L1β2(δaβ)2.\displaystyle=\frac{1}{(1-a)^{2}}\sum_{\ell=1}^{L-1}\beta_{\ell}^{2}(\delta_{\ell}-a\beta_{\ell})^{2}. (6.127)

Since 121a1\frac{1}{2}\leq 1-a\leq 1 and lemma 6.1(5),

=1L1δ2+=1L1β2γ2\displaystyle\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}\beta_{\ell}^{2}\gamma_{\ell}^{2} ==1L1δ2+1(1a)2=1L1β2(δaβ)2\displaystyle=\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\frac{1}{(1-a)^{2}}\sum_{\ell=1}^{L-1}\beta_{\ell}^{2}(\delta_{\ell}-a\beta_{\ell})^{2} (6.128)
=1L1δ2+=1L1β2(aβ)2\displaystyle\sim\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}\beta_{\ell}^{2}(-a\beta_{\ell})^{2} (6.129)
==1L1δ2+=1L1a2β4.\displaystyle=\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}. (6.130)

Moreover,

=1L1δ2+=1L1a2β4+1,2{1,,L1}{aβ1β2+(1a)γ1γ2+B2{aβ1+(1a)γ1}+B1{aβ2+(1a)γ2}}2\displaystyle\begin{split}&\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}\\ &\quad+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}\Bigl{\{}a\beta_{\ell_{1}}\beta_{\ell_{2}}+(1-a)\gamma_{\ell_{1}}\gamma_{\ell_{2}}\\ &\quad+B_{\ell_{2}}\{a\beta_{\ell_{1}}+(1-a)\gamma_{\ell_{1}}\}+B_{\ell_{1}}\{a\beta_{\ell_{2}}+(1-a)\gamma_{\ell_{2}}\}\Bigr{\}}^{2}\end{split} (6.131)
==1L1δ2+=1L1a2β4+1,2{1,,L1}{aβ1β2\displaystyle=\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}\Bigl{\{}a\beta_{\ell_{1}}\beta_{\ell_{2}} (6.132)
+(1a)δ1aβ11aδ2aβ21a+B2δ1+B1δ2}2\displaystyle\quad+(1-a)\frac{\delta_{\ell_{1}}-a\beta_{\ell_{1}}}{1-a}\frac{\delta_{\ell_{2}}-a\beta_{\ell_{2}}}{1-a}+B_{\ell_{2}}\delta_{\ell_{1}}+B_{\ell_{1}}\delta_{\ell_{2}}\Bigr{\}}^{2} (6.133)
=1L1δ2+=1L1a2β4+1,2{1,,L1}{aβ1β2+(aβ1)(aβ2)}2\displaystyle\sim\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}\Bigl{\{}a\beta_{\ell_{1}}\beta_{\ell_{2}}+\left(a\beta_{\ell_{1}}\right)\left(a\beta_{\ell_{2}}\right)\Bigr{\}}^{2} (6.134)
==1L1δ2+=1L1a2β4+1,2{1,,L1}(aβ1β2)2(1+a)2\displaystyle=\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}(a\beta_{\ell_{1}}\beta_{\ell_{2}})^{2}(1+a)^{2} (6.135)
=1L1δ2+=1L1a2β4+1,2{1,,L1}a2β12β22.\displaystyle\sim\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}a^{2}\beta_{\ell_{1}}^{2}\beta_{\ell_{2}}^{2}. (6.136)

By recursively applying lemma 6.1(5), we can obtain that

K2(Φ2(Φ1(v)))\displaystyle K_{2}(\Phi_{2}(\Phi_{1}(v))) =1L1δ2+=1L1a2β4+1,2{1,,L1}a2β12β22.\displaystyle\sim\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}a^{2}\beta_{\ell_{1}}^{2}\beta_{\ell_{2}}^{2}. (6.137)

Here, for all parameters vv,

K2(Φ2(Φ1(v)))=1L1δ2+=1L1a2β4.\displaystyle K_{2}(\Phi_{2}(\Phi_{1}(v)))\geq\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}. (6.138)

Also, since β12,β220\beta_{\ell_{1}}^{2},\beta_{\ell_{2}}^{2}\geq 0 and the relation between the arithmetic mean and the geometric mean,

=1L1β4+1,2{1,,L1}β12β22=1L1β4+121,2{1,,L1}(β14+β24).\displaystyle\sum_{\ell=1}^{L-1}\beta_{\ell}^{4}+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}\beta_{\ell_{1}}^{2}\beta_{\ell_{2}}^{2}\ \leq\ \sum_{\ell=1}^{L-1}\beta_{\ell}^{4}+\frac{1}{2}\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}(\beta_{\ell_{1}}^{4}+\beta_{\ell_{2}}^{4}). (6.139)

Therefore, there exists a constant k1k\geq 1 that does not depend on vv, and

=1L1β4+1,2{1,,L1}β12β22k=1L1β4.\displaystyle\sum_{\ell=1}^{L-1}\beta_{\ell}^{4}+\sum_{\ell_{1},\ell_{2}\in\{1,\cdots,L-1\}}\beta_{\ell_{1}}^{2}\beta_{\ell_{2}}^{2}\leq k\sum_{\ell=1}^{L-1}\beta_{\ell}^{4}. (6.140)

By Eqs. (6.138), (6.140),

=1L1δ2+=1L1a2β4K2(Φ2(Φ1(v)))k(=1L1δ2+=1L1a2β4).\displaystyle\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}\leq K_{2}(\Phi_{2}(\Phi_{1}(v)))\leq k\Bigl{(}\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}\Bigr{)}. (6.141)

Thus, let us define K3(v)K_{3}(v):

K3(v)==1L1δ2+=1L1a2β4,\displaystyle K_{3}(v)=\sum_{\ell=1}^{L-1}\delta_{\ell}^{2}+\sum_{\ell=1}^{L-1}a^{2}\beta_{\ell}^{4}, (6.142)

from the lemma 6.1(1) and Eq.(6.141),

K3(v)K2(Φ2(Φ1(v))).\displaystyle K_{3}(v)\sim K_{2}(\Phi_{2}(\Phi_{1}(v))). (6.143)

From Eq.(6.143), the main theorem can be derived by finding the maximum pole and its order of the zeta function determined from K3(v)K_{3}(v) and φ(Φ2(Φ1(v)))\varphi(\Phi_{2}(\Phi_{1}(v))). If the prior distribution φ(a)\varphi(a) of mixture ratio aa is greater than 0 and bounded, from the lemma 6.1(2)(3),

λ(K3,φ)\displaystyle\lambda(K_{3},\varphi) ==1L1λ(δ2)+min(λ(a2),=1L1λ(β4))\displaystyle=\sum_{\ell=1}^{L-1}\lambda(\delta_{\ell}^{2})+\min\Bigl{(}\lambda(a^{2}),\ \sum_{\ell=1}^{L-1}\lambda(\beta_{\ell}^{4})\Bigr{)} (6.144)
=L12+min(12,L14).\displaystyle=\frac{L-1}{2}+\min\Bigl{(}\frac{1}{2},\ \frac{L-1}{4}\Bigr{)}. (6.145)

Here, 12=L14\displaystyle\frac{1}{2}=\frac{L-1}{4} holds only when L=3L=3, and the equal sign does not hold in other cases, so the multiplicity is m=2m=2 only when L=3L=3 and m=1m=1 otherwise. Therefore, the theorem 5.1(1) was shown.

Furthermore, consider the case in the theorem 5.1(2), that is, the case where the prior distribution of the mixing ratio aa follows the Dirichlet distribution with α>0\alpha>0 as the hyperparameter. Using that the prior distribution of aa is φ(a)aα1(1a)α1\varphi(a)\propto a^{\alpha-1}(1-a)^{\alpha-1}, and the prior distribution of other parameters is greater than 0 and bounded,

λ(K3,φ)\displaystyle\lambda(K_{3},\varphi) ==1L1λ(δ2)+min(λ(a2,aα1),=1L1λ(β4))\displaystyle=\sum_{\ell=1}^{L-1}\lambda(\delta_{\ell}^{2})+\min\Bigl{(}\lambda(a^{2},a^{\alpha-1}),\ \sum_{\ell=1}^{L-1}\lambda(\beta_{\ell}^{4})\Bigr{)} (6.146)
=L12+min(α2,14).\displaystyle=\frac{L-1}{2}+\min\Bigl{(}\frac{\alpha}{2},\ \frac{1}{4}\Bigr{)}. (6.147)

Therefore, we completed Theorem 5.1(2). ∎

7 Phase transition due to prior distribution hyperparameters

In Bayesian statistics, if a prior distribution φ(w;θ)\varphi(w;\theta) has a hyperparameter θ\theta and a posterior distribution for a sufficiently large nn changes drastically at θ=θc\theta=\theta_{c}, then it is said that the posterior distribution has a phase transition, and θc\theta_{c} is called a critical point [9].

In the case of the theorem 5.1(2), the prior distribution of the mixed ratio aa of the multinomial mixture is assumed to be the Dirichlet distribution with the hyperparamater α\alpha, and the asymptotic free energy Fn(α)F_{n}(\alpha) is given by

𝔼[Fn(α)]{nS+α2logn(α<L12)nS+3(L1)4lognloglogn(α=L12)nS+3(L1)4logn(α>L12).\displaystyle\mathbb{E}[F_{n}(\alpha)]\simeq\begin{cases}nS+\frac{\alpha}{2}\log n&(\alpha<\frac{L-1}{2})\\ nS+\frac{3(L-1)}{4}\log n-\log\log n&(\alpha=\frac{L-1}{2})\\ nS+\frac{3(L-1)}{4}\log n&(\alpha>\frac{L-1}{2})\end{cases}. (7.1)

From Eq.(7.1), at αc=L12\alpha_{c}=\frac{L-1}{2}, 𝔼[Fn(α)]\mathbb{E}[F_{n}(\alpha)] is not differentiable, so αc\alpha_{c} is the phase transition point. If there is a phase transition point, the support of the posterior distribution significantly changes between two phases which greatly affects the result of statistical inference.

8 Conclusions

In this paper, we derived the real log canonical threshold and multiplicity when the probability model and the prior are a multinomial mixture with two components and Dirichlet distribution respectively, and the true distribution is a multinomial distribution. The asymptotic behaviors of the free energy and the generalization error were clarified. One of future works is to find the RLCTs and multiplicities of the multinomial mixtures with the general number of components.

References

  • [1] Watanabe Takeshi and Einoshin Suzuki. Prototyping abnormal medical test values in hepatitis data with mixture multinomial distribution estimate. ICS, Vol. 2002, No. 45 (2002-ICS-128), pp. 49–54, 2002.
  • [2] Masada Tomonari, Takasu Atsuhiro, and Adachi Jun. Clustering for name disambiguation in author citations. DBSJ Letters, Vol. 6, No. 1, 2007.
  • [3] Tomonari Masada, Senya Kiyasu, and Sueharu Miyahara. Clustering images with multinomial mixture models. In International Symposium on Advanced Intelligent Systems, 2007.
  • [4] K Yamazaki and Sumio Watanabe. Resolution of singularities in mixture models and its stochastic complexity. In Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP’02., Vol. 3, pp. 1355–1359. IEEE, 2002.
  • [5] Hirotugu Akaike. A new look at the statistical model identification. IEEE transactions on automatic control, Vol. 19, No. 6, pp. 716–723, 1974.
  • [6] Gideon Schwarz. Estimating the dimension of a model. The annals of statistics, pp. 461–464, 1978.
  • [7] David J Spiegelhalter, Nicola G Best, Bradley P Carlin, and Angelika Van Der Linde. Bayesian measures of model complexity and fit. Journal of the royal statistical society: Series b (statistical methodology), Vol. 64, No. 4, pp. 583–639, 2002.
  • [8] Sumio Watanabe. Algebraic Analysis for Nonidentifiable Learning Machines. Neural Computation, Vol. 13, No. 4, pp. 899–933, 04 2001.
  • [9] Sumio Watanabe. Mathematical theory of Bayesian statistics. CRC Press, 2018.
  • [10] Miki Aoyagi and Sumio Watanabe. Stochastic complexities of reduced rank regression in bayesian estimation. Neural Networks, Vol. 18, No. 7, pp. 924–933, 2005.
  • [11] Kenichiro Sato and Sumio Watanabe. Bayesian generalization error of poisson mixture and simplex vandermonde matrix type singularity. arXiv preprint arXiv:1912.13289, 2019.
  • [12] Keisuke Yamazaki and Daisuke Kaji. Comparing two bayes methods based on the free energy functions in bernoulli mixtures. Neural networks, Vol. 44, pp. 36–43, 2013.
  • [13] Naoki Hayashi. The exact asymptotic form of bayesian generalization error in latent dirichlet allocation. Neural Networks, Vol. 137, pp. 127–137, 2021.
  • [14] Kenji Nagata and Sumio Watanabe. Asymptotic behavior of exchange ratio in exchange monte carlo method. Neural Networks, Vol. 21, No. 7, pp. 980–988, 2008.
  • [15] Mathias Drton and Martyn Plummer. A bayesian information criterion for singular models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 79, No. 2, pp. 323–380, 2017.
  • [16] Sumio Watanabe. Algebraic geometry and statistical learning theory. No. 25 in Cambridge Monographs on Applied and Computational Mathematics. Cambridge university press, 2009.
  • [17] Sumio Watanabe. Algebraic geometrical methods for hierarchical learning machines. Neural Networks, Vol. 14, No. 8, pp. 1049–1060, 2001.
  • [18] Takeshi Matsuda and Sumio Watanabe. Weighted blowups of kullback information and application to multinomial distributions. IEICE Proceedings Series, Vol. 42, No. B2L-C2, 2008.
  • [19] Imre Csiszár. A simple proof of sanov’s theorem. Bulletin of the Brazilian Mathematical Society, Vol. 37, No. 4, pp. 453–459, 2006.
  • [20] Christopher M Bishop. Pattern recognition. Machine learning, Vol. 128, No. 9, 2006.
  • [21] Takumi Watanabe and Sumio Watanabe. Asymptotic behavior of bayesian generalization error in multinomial mixtures. IEICE Technical Report; IEICE Tech. Rep., Vol. 119, No. 360, pp. 1–8, 2020.
  • [22] Keisuke Yamazaki and Sumio Watanabe. Singularities in mixture models and upper bounds of stochastic complexity. Neural Networks, Vol. 16, No. 7, pp. 1029–1038, 2003.
  • [23] Miki Aoyagi. A bayesian learning coefficient of generalization error and vandermonde matrix-type singularities. Communications in Statistics—Theory and Methods, Vol. 39, No. 15, pp. 2667–2687, 2010.
  • [24] Heisuke Hironaka. Resolution of singularities of an algebraic variety over a field of characteristic zero: Ii. Annals of Mathematics, pp. 205–326, 1964.