Asymptotic Behavior of Bayesian Generalization Error in Multinomial Mixtures
Abstract
Multinomial mixtures are widely used in the information engineering field, however, their mathematical properties are not yet clarified because they are singular learning models.
In fact, the models are non-identifiable and their Fisher information matrices are not positive definite. In recent years, the mathematical foundation of singular statistical models are clarified by using algebraic geometric methods. In this paper, we clarify the real log canonical thresholds and multiplicities of the multinomial mixtures
and elucidate their asymptotic behaviors of generalization error and free energy.
Keywords: generalization error, free energy, multinomial mixtures, real log canonical thresholds
1 Introduction
A finite mixture model is a probability distribution defined by a linear superposition of a finite number of distributions. Its examples, A normal mixture, a Poisson mixture, and a multinomial mixture have been applied in many research areas. In this paper, we mainly study a multinomial mixture which provides a richer class of statistic models than the single multinomial distribution. The multinomial mixture has been applied to document clustering [1], anomaly detection in medical data [2], and image clustering [3]. In spite of a wide range of its applications, their mathematical property of generalization performance is not yet clarified. One of the mathematical difficulties is caused by the fact that they are not identifiable. If the map from the set of parameters to a probability distribution, then a statistical model is called identifiable. If otherwise, it is called non-identifiable [4]. These models are classified into singular models. If a probability model is singular, the posterior distribution cannot be approximated by any normal distribution, and classical model criteria of regular statistical models such as AIC [5], BIC [6], or DIC [7] cannot be applied to estimate the generalization losses of singular models.
Recently, in order to establish the mathematical foundation of Bayesian inference of singular models, Watanabe derived the asymptotic behavior of their generalization error and the free energy [8]. There exist both a real and positive number and an integer , such that their asymptotic behaviors of and are respectively given by
(1.1) | ||||
(1.2) |
where is called a real log canonical threshold (RLCT), is called a multiplicity, and shows the expectation value over all datasets. If a learning model is identifiable and regular, , where is the dimension of the parameter space [6]. If it is singular, and depend on the true distribution, the model, and the prior. In singular case, it was shown by [9] that both RLCT and multiplicity can be found by using desingularization theorem in algebraic geometry. In general, the parameter set contains complicated singularities, hence it is difficult to find the resolution map, however, both RLCTs and multiplicities have been clarified in several statistical models and learning machines. Examples of the models in which the RLCTs are found include normal mixtures [10], Poisson mixtures [11], Bernoulli mixtures [12], rank regression [10], Latent Dirichlet Allocation (LDA) [13], and so on. In addition, the RLCTs are used as an analysis of the exchange rate of the replica exchange method [14], which is one of the Markov chain Monte Carlo methods. Moreover, in recent years, the information criterion , which uses RLCTs in calculation, has also been proposed [15].
In this paper, we clarify the RLCT of the multinomial mixtures and derive the asymptotic behaviors of the generalization error and the free energy. Our analysis also shows the effect of hyperparameter for the case when Dirichlet distribution is employed as a prior. We begin in section 2 with the introduction of the framework of Bayesian inference. In Section 3 we explain multinomial mixtures, and in section 4 we introduce previous studies about the RLCTs and multiplicities of multinomial mixtures. In Section 5 we claim the main theorem of this paper. In Section 6 we prove the theorem. In Section 7 we discuss the phase transition due to the hyperparameters. And in Section 8 we conclude this paper.
2 Bayes estimation
In this section, we introduce the framework of Bayesian inference. Let be a true probability distribution and let be a set of training data generated from independently and identically. Let be a probability model, where is a parameter, and the is a parameter space. The prior probability distribution is a function on . A posterior distribution is defined by
(2.1) |
where is the normalizing constant:
(2.2) |
The constant is called a marginal likelihood function. The free energy is defined as the minus log marginal likelihood function:
(2.3) |
A predictive distribution is given by
(2.4) |
A generalization error is the Kullback-Leibler divergence from the true distribution to the predictive distribution :
(2.5) |
The generalization error is a measure of how the predictive distribution is different from the true distribution .
For an arbitrary function , the expectation value of over all sets of training samples is denoted by , that is,
(2.6) |
Let the mean error function be the Kullback-Leibler divergence from the true distribution to the probability model:
(2.7) |
An entropy of the true distribution and an empirical entropy are defined respectively by
(2.8) | ||||
(2.9) |
It is known that the following relationship holds between the free energy and the generalization error [16]:
(2.10) |
The relation (2.10) is important because in the most case, we do not know the true distribution , whereas the free energy can be calculated by using the prior , the probability model , and a sample .
Let be the real part of a complex number . Define the zeta function in the statistical learning theory as
(2.11) |
If is an analytic function of , then the function is holomorphic in the region , and it can be analytically continued to the unique meromorphic function onto the entire complex plane. Moreover, it is known that all poles are real and negative numbers.
In the following, assume that the mean error function is analytical and the true distribution is feasible with a probabilistic model. Here, the true distribution is said to be feasible with the probability model if there is a parameter such that holds for all . Assume that the maximum pole of the zeta function is and its order is . By applying the Hironaka resolution theorem in the algebraic geometrical method, the asymptotic behaviors of free energy and generalization errors can be expressed as follows [8] [17]:
(2.12) | ||||
(2.13) |
The constant and are called real log canonical threshold (RLCT) and a multiplicity respectively.
3 Multinomial Mixtures
3.1 Multinomial Distribution
Let be the set of all non-negative integers, be the set of all non-negative real numbers. Let constants be two or more natural numbers, and the set is defined by
(3.1) |
The vectors belong to the set :
(3.2) |
The probability distribution of determined by the vector :
is called the multinomial distribution. Here, it is defined as . The constant represents the number of independent trials of the multinomial distribution, and the parameter represents the corresponding probability. The multinomial distribution is a generalization of several discrete distribution. If and , the multinomial distribution is called the Bernoulli distribution. If and , it is called the categorical distribution. If and , it is called the binomial distribution.
3.2 Multinomial Mixtures
Let be a finite natural number greater than or equal to 2. The parameter set is defined by
(3.3) |
where means the set .
The probability distribution on determined by the parameter
(3.4) |
is called a multinomial mixture. Here represents the number of components. The dimensional parameter represents a mixing ratio. The parameter is assumed that is non-negative and . Then represents the weight of the -th component distribution. The higher mixing ratio means the stronger effect of -th component.
4 Previous Studies
In this section, we introduce several previous studies on the log canonical thresholds of multinomial mixtures. When the probability model is a binomial mixture, the upper bound of the RLCT in the case of general components and the exact value of one in special cases have been clarified [12].
Theorem 4.1 (the RLCT and multiplicity of binomial mixtures [12]).
Let be an dimensional binary vector and be a binomial mixture,
(4.1) |
It is assumed that the true distribution is a binomial mixture,
(4.2) |
Let the prior distribution be
(4.3) |
where is a set of hyperparameters, is a prior distribution of the mixing ratio with as a hyperparameter (Dirichlet distribution),
(4.4) |
and () is a beta distribution for each ,
(4.5) |
We refer to them as deterministic, where is one or zero. Let and be the numbers of probabilistic and deterministic components, respectively, where . Under the above conditions, the asymptotic behavior of the free energy is expressed as follows:
(4.6) |
where is the entropy of the true distribution, and and are defined as follows. For ,
(4.7) | ||||
(4.8) |
For ,
(4.9) | ||||
(4.10) |
Furthermore, if , that is, when the number of components in the probability model is one greater than that in the true distribution, equals the exact value of the RLCT , and also equals the multiplicity .
Matsuda analyzed the RLCT of trinomial mixtures with two components. The exact value of the RLCTs was elucidated when the true distribution is multinomial distribution and the probability model is trinomial mixtures with two components, that is, in the case of .
Theorem 4.2 (the RLCT and multiplicity of trinomial mixtures with two components [18]).
Let the probability model be a trinomial mixture with two components:
(4.11) |
Here, means a probability mass function of trinomial distribution with as parameters. Also, let the true distribution be a trinomial distribution:
(4.12) |
Also, assume that the prior distribution is greater than and bounded above the parameter set , and the true distribution parameter satisfies:
(4.13) |
Under these conditions, The RLCT is as follows:
(4.14) |
Matsuda clarified the RLCT of a trinomial mixture with two components, which is in the case of , by using an algebraic geometry algorithm called weighted blow-up.
5 Main Theorem
In this section, we show the main result of this paper, which is a generalization of Theorem 4.2. We clarify the RLCT and the multiplicity of general multinomial mixtures with two components. Furthermore, we also consider the case where the Dirichlet distribution is adopted as the prior distribution of the mixture ratio.
Theorem 5.1 (Main Theorem).
Let the probability model be a multinomial mixture with two components:
(5.1) |
Also, let the true distribution be a multinomial distribution:
(5.2) |
Also, assume that the prior distribution of the parameter is greater than and bounded in the set , and that the true distribution parameter satisfies:
(5.3) |
The prior distribution of the mixing ratio in two cases as follows respectively:
-
1.
If the prior distribution of mixture ratio is greater than and bounded, the RLCT and the multiplicity are given by
(5.4) (5.5) -
2.
If the prior distribution of mixture ratio is the Dirichlet distribution with as a hyperparameter:
(5.6) the RLCT and the multiplicity are given by
(5.7) (5.8)
6 Proof of the Main Theorem
6.1 properties of the RLCTs and the multiplicities
To prove the theorem 5.1, we introduce notations, and explain some properties of the RLCTs and the multiplicities. Since the RLCT and the multiplicity are determined by the mean error function and the prior distribution , they are expressed as , or , respectively. If the maximum pole and their order of the two zeta function :
(6.1) | ||||
(6.2) |
are equal, they are written as
(6.3) |
or .
Lemma 6.1.
Let be the mean error function and let be the prior function.
-
1.
If there are function and constants exist, and
(6.4) holds for any , then .
-
2.
If , the following holds
(6.5) (6.6) -
3.
If , then
(6.7) (6.8) -
4.
Let be natural numbers, and let be the sets of analytic functions. If the ideal generated from and the ideal generated from are equal and
(6.9) then .
-
5.
For any bounded function on a compact set,
(6.10) -
6.
Let be the following function:
(6.11) then .
6.2 The restriction on the parameter set of general multinomial mixtures
To prove the theorem 5.1, we prepare some lemmas. As mentioned in section 2, the zeta function is determined by the prior distribution and the mean error function , and the mean error function is defined by the KL information between true distribution and the probability model. Thus, the mean error function is
(6.12) |
However, in the case of mixed multinomial distribution, some problems arise when considering the mean error function for the entire parameter set W. When the probability model is a mixed multinomial distribution, for some and some . Since the true distribution is not 0 by assumption, on the points such that , the values is not finite and the mean error function diverges. Thus, the results of the asymptotic behavior of their generalization error in the reference [8] cannot be applied directly. To solve this problem, we prove that even if the original parameter set is restricted to the set , which is the parameter set such that , the asymptotic behavior of the generalization error does not change.
Lemma 6.2 (the restriction on the parameter set).
Let be a parameter set of the multinomial mixtures with component H. Let probability model be the multinomial mixtures with components :
(6.13) |
Let be the multinomial mixtures with components ():
(6.14) |
Here, for any ,
(6.15) |
Fix as a sufficiently small number, and let be the subset of such that for all , and let be the complement of (i.e. ). Let be the minus maximum pole and be its order of the zeta function whose integration range is restricted to :
(6.16) |
Then, the asymptotic behavior of the generalization error is expressed by the following equation:
(6.17) |
Lemma 6.2 means that the asymptotic behavior of the generalization error of the multinomial mixtures can be analyzed by finding the maximum pole of the zeta function whose integration range is restricted to . Lemma 6.2 will be explained in the section 6.2.1.
6.2.1 the proof of Lemma 6.2
We will prove the Lemma 6.2. By the definition of the generalization error,
(6.18) |
Let be written as . By the definitions of the prediction distribution and posterior distribution, and the assumption ,
(6.19) | ||||
(6.20) | ||||
(6.21) | ||||
(6.22) | ||||
(6.23) |
where, is a quantity expressed by the following equation:
(6.24) |
By the Eq.(6.23),
(6.25) |
Here, fix as a sufficiently small number, and let be the subset of that is in all , and let be the subset that is not. The integral for the parameter w is divided into two integration sets and , and and are defined as follows:
(6.26) | ||||
(6.27) |
Then, the relation holds, and
(6.28) |
Since on , [9] can be applied, for suffiently large , the following asymptotic behavior of holds:
(6.29) |
where is the random variable of that holds
(6.30) |
Here, we introduce the following Lemma.
Lemma 6.3.
There is a random variable that is only for certain events related to and for others, and the following equation holds:
(6.31) | ||||
(6.32) |
6.2.2 Sanov’s theorem
Let be a natural number, and let be the set consisting of the probability distributions on the finite set .
(6.35) |
be called true distribution, and it is fixed. Let be the random variables independently identically generated from the probability distribution . Also, for each , the random variable is the number of whose value is . Let the empirical distribution be
(6.36) |
Then, the next theorem holds.
Theorem 6.1 (Sanov’s Theorem [19]).
Let be a subset of . Let be the probability such that the empirical distribution is included in the set , the following inequality holds:
(6.37) |
6.2.3 the property of the multinomial distribution with one trial
Let be positive in all the elements i.e. . Let the positive number be sufficiently small and define the set as follows:
(6.38) |
Since it is assumed that is sufficiently small, we can take so that is not included in the set . Due to the and the property of the KL information, holds for all . Next, the constant is fixed as an arbitrary number that satisfies , and the set is defined by
(6.39) |
Now, we show that and that and have no intersection. Assuming there exists a probability distribution , since ,
(6.40) |
However, since , , which is a contradiction. Furthermore, we can take a sufficient small positive number such that
(6.41) |
By the definition of the set A, if then . Applying the Sanov’s theorem for the set ,
(6.42) |
Therefore, for sufficiently large n,
(6.43) | ||||
(6.44) |
Thus,
(6.45) |
By the definition of the set A, with probability at least ,
(6.46) |
Since are the random variables generated independently from the true distribution , and is the number of whose value is , by the Eq.(6.46),
(6.47) |
By calculating the Eq.(6.47),
(6.48) |
Eq.(6.48) means that if and are the probability mass functions of the multinomial distribution with one trial respectively, the following inequality holds with the probability at least ,
(6.49) |
6.2.4 the properties of the predictive distribution of the multinomial mixtures
In this section, we consider the lower bound of the predictive distribution when the probability model is the multinomial mixture:
(6.50) |
We introduce the latent variable . The latent variable is a vector such that one element is 1 and the others are 0. By using the variable , we can rewrite as follows:
(6.51) |
Both the prior distribution of the parameters of the multinomial and that of the parameters of the mixture ratio are Dirichlet distribution:
(6.52) |
where are hyperparameter and are normalizing constant such that
(6.53) |
To calculate the predictive distribution, we calculate the posterior distribution .
(6.54) | |||
(6.55) | |||
(6.56) |
where is normalizing constant of :
(6.57) | ||||
(6.58) | ||||
(6.59) |
Thus, the predictive distribution can be calculated as follows:
(6.60) | |||
(6.61) | |||
(6.62) | |||
(6.63) |
Since the latent variable is a vector such that one element is 1 and since the others are 0 and the vector satisfies , apply the property of the Gamma function , we can show that
(6.64) |
Thus, there exists a constant positive number such that for all ,
(6.65) |
By the definition of the marginal distribution,
(6.66) |
Therefore,
(6.67) |
In this section, the prior distribution of parameters is discussed as the Dirichlet distribution. In the main theorem, we also consider the case where a non-zero and bounded prior distribution. In the case of such prior distribution, since any distribution does not affect the poles of the zeta function, the lower bound of the predicted distribution is not in the exponential order as in the above discussion.
6.2.5 properties of the maximum likelihood estimator of multinomial mixtures
Multinomial mixtures with trials are finite distributions on such that . The set is defined by
(6.68) |
The set is finite, and the number of elements of is . Let be the set of all discrete probability distributions on the finite set :
(6.69) |
The probability distribution on that can be expressed by multinomial mixtures of trials is included in the set . Given that the probability model and the true distribution are both multinomial mixtures with trials, and that the corresponding distributions in the probability distribution on are the and , the average error function is expressed as follows:
(6.70) |
Since the function is the KL information between the true distribution and the probability model , if then , and otherwise . Nowhere, we fix the positive constant in the section 6.2.3, and define the subset of as follows:
(6.71) |
Since are positive for all the points from the assumption, by fixing the positive constant in the section 6.2.3, the subset of is defined as follows:
(6.72) |
The can be defined so that the sets and have no intersection. Then, the log empirical loss can be calculated as follows:
(6.73) | ||||
(6.74) | ||||
(6.75) |
Since if then , The first term of the Eq.(6.75) converges to a certain constant, and the second term converges to 0. Thus, with probability at least ,
(6.76) |
Therefore,
(6.77) |
Let be a maximum likelihood estimator from multinomial mixtures with trials, and let be also a maximum likelihood estimator from all the discrete distributions on . Since the set of distributions that can be expressed by multinomial mixtures with trials include in the set of distributions that can be expressed by all the discrete distributions on , holds. Therefore, with probability at least ,
(6.78) |
6.2.6 the proof of lemma 6.3
Proof.
We fix the positive constants in the section 6.2.3, and we define the set in the Eq.(6.71). The random variable is defined as follows:
(6.79) |
where is the empirical distribution defined the Eq.(6.36). From the Eq.(6.43), the probability of is less than . Using that true distribution does not depend on the sample size and using the Eq.(6.67),
(6.80) | ||||
(6.81) | ||||
(6.82) |
Moreover, by the Eq.(6.78),
(6.83) | ||||
(6.84) |
6.3 properties of the general components
We prepare a lemma that holds for multinomial mixtures of general components.
Lemma 6.4.
Let be a multinomial mixture with components:
(6.87) |
Also, let the true distribution be a multinomial mixture with components:
(6.88) |
Let be the mean error function determined by the probability model and the true distribution . has the same RLCT and the multiplicity as defined below:
(6.89) |
that is, .
Proof.
From the lemma 6.1, the mean error function is equivalent to , that is, the RLCT and the multiplicity are equal. Calculate as follows:
(6.90) | ||||
(6.91) |
Since is greater than and bounded for any ,
(6.92) |
Furthermore, since for each and for each : , , both and can be represented by other and for each ,
(6.93) |
Here, by using the binomial theorem,
(6.94) | ||||
(6.95) |
Also
(6.96) |
Therefore,
(6.97) | ||||
(6.98) |
For simplicity, for each and , we define as follows:
(6.99) | ||||
(6.100) |
It follows that
(6.101) |
By using the multinomial theorem,
(6.102) | ||||
(6.103) |
where the summation shows the sum of all over non-negative integer sets such that . We apply Eqs.(6.102) and (6.103) to the Eq.(6.101). Let be . Then we obtain
(6.104) | ||||
(6.105) |
By using Eqs.(6.99) and (6.100),
(6.106) |
We introduce a polynomial defined by
(6.107) | ||||
(6.108) |
Then by Eq.(6.106), it follows that
(6.109) |
The second term of Eq.(6.109) can be expressed by the linear sum of the first term . That is, in the second term, there is a constant that does not depend on parameters,
(6.110) |
Therefore, the ideal generated from the set and the ideal generated from the set are equal, so the function is defined as follows:
(6.111) | ||||
(6.112) |
From the lemma 6.1 (4), the two functions and are equivalent, that is, their RLCTs and multiplicities are equal. ∎
6.4 properties of the 2 components
So far, we have prepared the lemma 6.4, which holds for multinomial mixtures with general components. Hereafter, we assume that the number of components of the multinomial mixtures of the probability model is 2 (i.e. ) and that the true distribution is the multinomial distribution (i.e. ). That is, the probability model and the true distribution are as follows:
(6.113) | ||||
(6.114) |
Then, the polynomial defined by the Eq.(6.107) is expressed as follows:
Lemma 6.5.
For , the following holds:
(6.115) |
where represents the set .
Proof.
Lemma 6.6.
Define the set of as follows:
(6.119) |
Define the function of parameter as follows:
(6.120) |
Then, .
6.5 Proof of the main theorem
Let us prove Theorem.5.1.
Proof.
(Proof of Theorem.5.1) From lemma 6.6, to clarify the RLCT and multiplicity of the multinomial mixtures with two components, we calculate the largest pole of the zeta function determined by and .
(6.121) |
where the summation represents the sum of the sets . Let us define a map , where for each ,
(6.122) |
The parameter consists of , where . Based on the map,
(6.123) |
From the symmetry of the parameters, we can restrict the integration range for as without loss of generality. Let us define a map , where
(6.124) |
The parameter consists of , where . Since we consider the range , the Jacobian determinant of this transform is not equal to zero. Therefore, neither the maximum pole nor its order of the zeta function changes. We can obtain that
(6.125) |
Here, we eliminate by using ,
(6.126) | ||||
(6.127) |
(6.128) | ||||
(6.129) | ||||
(6.130) |
Moreover,
(6.131) | ||||
(6.132) | ||||
(6.133) | ||||
(6.134) | ||||
(6.135) | ||||
(6.136) |
By recursively applying lemma 6.1(5), we can obtain that
(6.137) |
Here, for all parameters ,
(6.138) |
Also, since and the relation between the arithmetic mean and the geometric mean,
(6.139) |
Therefore, there exists a constant that does not depend on , and
(6.140) |
(6.141) |
Thus, let us define :
(6.142) |
from the lemma 6.1(1) and Eq.(6.141),
(6.143) |
From Eq.(6.143), the main theorem can be derived by finding the maximum pole and its order of the zeta function determined from and . If the prior distribution of mixture ratio is greater than and bounded, from the lemma 6.1(2)(3),
(6.144) | ||||
(6.145) |
Here, holds only when , and the equal sign does not hold in other cases, so the multiplicity is only when and otherwise. Therefore, the theorem 5.1(1) was shown.
Furthermore, consider the case in the theorem 5.1(2), that is, the case where the prior distribution of the mixing ratio follows the Dirichlet distribution with as the hyperparameter. Using that the prior distribution of is , and the prior distribution of other parameters is greater than and bounded,
(6.146) | ||||
(6.147) |
7 Phase transition due to prior distribution hyperparameters
In Bayesian statistics, if a prior distribution has a hyperparameter and a posterior distribution for a sufficiently large changes drastically at , then it is said that the posterior distribution has a phase transition, and is called a critical point [9].
In the case of the theorem 5.1(2), the prior distribution of the mixed ratio of the multinomial mixture is assumed to be the Dirichlet distribution with the hyperparamater , and the asymptotic free energy is given by
(7.1) |
From Eq.(7.1), at , is not differentiable, so is the phase transition point. If there is a phase transition point, the support of the posterior distribution significantly changes between two phases which greatly affects the result of statistical inference.
8 Conclusions
In this paper, we derived the real log canonical threshold and multiplicity when the probability model and the prior are a multinomial mixture with two components and Dirichlet distribution respectively, and the true distribution is a multinomial distribution. The asymptotic behaviors of the free energy and the generalization error were clarified. One of future works is to find the RLCTs and multiplicities of the multinomial mixtures with the general number of components.
References
- [1] Watanabe Takeshi and Einoshin Suzuki. Prototyping abnormal medical test values in hepatitis data with mixture multinomial distribution estimate. ICS, Vol. 2002, No. 45 (2002-ICS-128), pp. 49–54, 2002.
- [2] Masada Tomonari, Takasu Atsuhiro, and Adachi Jun. Clustering for name disambiguation in author citations. DBSJ Letters, Vol. 6, No. 1, 2007.
- [3] Tomonari Masada, Senya Kiyasu, and Sueharu Miyahara. Clustering images with multinomial mixture models. In International Symposium on Advanced Intelligent Systems, 2007.
- [4] K Yamazaki and Sumio Watanabe. Resolution of singularities in mixture models and its stochastic complexity. In Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP’02., Vol. 3, pp. 1355–1359. IEEE, 2002.
- [5] Hirotugu Akaike. A new look at the statistical model identification. IEEE transactions on automatic control, Vol. 19, No. 6, pp. 716–723, 1974.
- [6] Gideon Schwarz. Estimating the dimension of a model. The annals of statistics, pp. 461–464, 1978.
- [7] David J Spiegelhalter, Nicola G Best, Bradley P Carlin, and Angelika Van Der Linde. Bayesian measures of model complexity and fit. Journal of the royal statistical society: Series b (statistical methodology), Vol. 64, No. 4, pp. 583–639, 2002.
- [8] Sumio Watanabe. Algebraic Analysis for Nonidentifiable Learning Machines. Neural Computation, Vol. 13, No. 4, pp. 899–933, 04 2001.
- [9] Sumio Watanabe. Mathematical theory of Bayesian statistics. CRC Press, 2018.
- [10] Miki Aoyagi and Sumio Watanabe. Stochastic complexities of reduced rank regression in bayesian estimation. Neural Networks, Vol. 18, No. 7, pp. 924–933, 2005.
- [11] Kenichiro Sato and Sumio Watanabe. Bayesian generalization error of poisson mixture and simplex vandermonde matrix type singularity. arXiv preprint arXiv:1912.13289, 2019.
- [12] Keisuke Yamazaki and Daisuke Kaji. Comparing two bayes methods based on the free energy functions in bernoulli mixtures. Neural networks, Vol. 44, pp. 36–43, 2013.
- [13] Naoki Hayashi. The exact asymptotic form of bayesian generalization error in latent dirichlet allocation. Neural Networks, Vol. 137, pp. 127–137, 2021.
- [14] Kenji Nagata and Sumio Watanabe. Asymptotic behavior of exchange ratio in exchange monte carlo method. Neural Networks, Vol. 21, No. 7, pp. 980–988, 2008.
- [15] Mathias Drton and Martyn Plummer. A bayesian information criterion for singular models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 79, No. 2, pp. 323–380, 2017.
- [16] Sumio Watanabe. Algebraic geometry and statistical learning theory. No. 25 in Cambridge Monographs on Applied and Computational Mathematics. Cambridge university press, 2009.
- [17] Sumio Watanabe. Algebraic geometrical methods for hierarchical learning machines. Neural Networks, Vol. 14, No. 8, pp. 1049–1060, 2001.
- [18] Takeshi Matsuda and Sumio Watanabe. Weighted blowups of kullback information and application to multinomial distributions. IEICE Proceedings Series, Vol. 42, No. B2L-C2, 2008.
- [19] Imre Csiszár. A simple proof of sanov’s theorem. Bulletin of the Brazilian Mathematical Society, Vol. 37, No. 4, pp. 453–459, 2006.
- [20] Christopher M Bishop. Pattern recognition. Machine learning, Vol. 128, No. 9, 2006.
- [21] Takumi Watanabe and Sumio Watanabe. Asymptotic behavior of bayesian generalization error in multinomial mixtures. IEICE Technical Report; IEICE Tech. Rep., Vol. 119, No. 360, pp. 1–8, 2020.
- [22] Keisuke Yamazaki and Sumio Watanabe. Singularities in mixture models and upper bounds of stochastic complexity. Neural Networks, Vol. 16, No. 7, pp. 1029–1038, 2003.
- [23] Miki Aoyagi. A bayesian learning coefficient of generalization error and vandermonde matrix-type singularities. Communications in Statistics—Theory and Methods, Vol. 39, No. 15, pp. 2667–2687, 2010.
- [24] Heisuke Hironaka. Resolution of singularities of an algebraic variety over a field of characteristic zero: Ii. Annals of Mathematics, pp. 205–326, 1964.