This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bayesian mixture modeling using a mixture of finite mixtures with normalized inverse Gaussian weights

Fumiya Iwashige and Shintaro Hashimoto
Department of Mathematics, Hiroshima University
Abstract

In Bayesian inference for mixture models with an unknown number of components, a finite mixture model is usually employed that assumes prior distributions for mixing weights and the number of components. This model is called a mixture of finite mixtures (MFM). As a prior distribution for the weights, a (symmetric) Dirichlet distribution is widely used for conjugacy and computational simplicity, while the selection of the concentration parameter influences the estimate of the number of components. In this paper, we focus on estimating the number of components. As a robust alternative Dirichlet weights, we present a method based on a mixture of finite mixtures with normalized inverse Gaussian weights. The motivation is similar to the use of normalized inverse Gaussian processes instead of Dirichlet processes for infinite mixture modeling. Introducing latent variables, the posterior computation is carried out using block Gibbs sampling without using the reversible jump algorithm. The performance of the proposed method is illustrated through some numerical experiments and real data examples, including clustering, density estimation, and community detection.

Keywords: Bayesian nonparametrics; Clustering; Dirichlet distribution; Inverse Gaussian distribution; Markov chain Monte Carlo; Mixture models

1 Introduction

We consider a finite mixture model with an unknown number of components:

f(y|θ1,,θM,π)=j=1Mπjf(y|θj),\displaystyle f(y|\theta_{1},\dots,\theta_{M},\pi)=\sum_{j=1}^{M}\pi_{j}f(y|\theta_{j}), (1.1)

where M{1,2,}M\in\{1,2,\dots\} is the number of components, ydy\in\mathbb{R}^{d} is a dd-dimensional observation vector, and π=(π1,,πM)\pi=(\pi_{1},\dots,\pi_{M})^{\top} is the mixing weight vector, satisfying j=1Mπj=1\sum_{j=1}^{M}\pi_{j}=1 and πj>0\pi_{j}>0. For j=1,,Mj=1,\dots,M, f(|θj)f(\cdot|\theta_{j}) is a probability density or probability mass function parameterized by the component-specific parameter θj\theta_{j}. The function f(|θj)f(\cdot|\theta_{j}) is also called the kernel of the mixture model. Each observation is assumed to arise from one of the MM components, and each component is weighted based on its frequency of occurrence. Mixture models are important statistical models used in model-based clustering and density estimation because they can model complex data-generating distributions in random phenomena by using multiple components. Finite mixture models have been applied in many fields, for instance, sociology (Handcock et al., 2007), economics (Frühwirth-Schnatter and Kaufmann, 2008) and genetics (McLachlan et al., 2002) and so on. In application, the determination of MM is a very important issue. Correctly estimating MM improves model interpretability, the accuracy of kernel parameter estimates and model predictions, and reduces computation time. Although various methods have been proposed for the selection of MM, such as model selection criteria and hypothesis testing, using these criterion makes it difficult to quantify the uncertainty of MM, and also introduces the bias that model selection is carried out. The comprehensive review of (finite) mixture models is provided by McLachlan (2000); Frühwirth-Schnatter (2006); Frühwirth-Schnatter et al. (2019); McLachlan et al. (2019). In this paper, we focus on the case where MM is finite and make a clear distinction between the number of components MM and the number of clusters kk. In general, the number of components is written as M=k+MnaM=k+M_{na}, where kk is the number of components for which data are actually assigned, and MnaM_{na} is the number of empty components (see also Argiento and De Iorio, 2022).

In Bayesian analysis, a finite mixture model that assumes prior distributions for the mixing weights and the number of components is usually employed (see, e.g. Nobile, 1994; Miller and Harrison, 2018). Such a model is also called a mixture of finite mixtures (MFM), and the model is often used, as well as infinite mixture models represented by Dirichlet process mixture models. Although the reversible jump algorithm (Green, 1995; Richardson and Green, 1997) has been used to obtain samples from the posterior distribution based on MFM, it faces significant computational and implementation challenges. Methods based on marginal likelihoods have also been proposed (see, e.g. Nobile and Fearnside, 2007), while, as with reversible jump, the computational aspect is an issue. A method called sparse MFM (Rousseau and Mengersen, 2011; Malsiner-Walli et al., 2016) has also been proposed, which focuses on estimating the number of clusters kk by using the overfitting model with a large fixed MM, and choosing the prior distribution of the component rate well. Under same conditions, they showed that kk has consistency for true MM. Since it allows for many empty components, we cannot estimate the number of components MM and it is not easy to quantify the uncertainty. Recently, the use of nonparametric Bayesian methods for MFM to estimate MM has attracted much attention. In Miller and Harrison (2018), the exchangeable partition distribution is derived by marginalizing out MM, and the restaurant process is constructed to overcome computational difficulties. In Frühwirth-Schnatter et al. (2021), a generalized MFM in which the parameters of the Dirichlet distribution depend on MM is proposed. Miller and Harrison (2018) derived the theoretical result that the marginal posterior probabilities of kk and MM are asymptotically equivalent when the sample size is large, while only a sampling of kk is obtained as with general nonparametric Bayesian methods. Thus, the number of components MM cannot be sampled directly from the posterior distribution. Although Frühwirth-Schnatter et al. (2021)’s method can sample MM directly, as a sparse MFM, it has the disadvantage of producing a large number of empty components and overestimating MM. In addition, most MFM studies use a symmetric Dirichlet distribution as the mixing weight vector, and the choice of a hyperparameter is very important. Although it is possible to set a prior distribution and learn from the data, the Metropolis-Hastings algorithm is required.

The main purposes of this paper are 1) sampling MM directly and efficiently compared to MFM based on the Dirichlet distribution; 2) estimating MM reasonably by suppressing MnaM_{na}; 3) proposing a MFM that is robust to the choice of hyperparameters of the distribution on a simplex. To this end, we employ a normalized inverse Gaussian distribution as the mixing weight vector of the finite mixture model. To the best of our knowledge, there are few studies using the normalized inverse Gaussian distributions as the mixing weights in the field of finite mixture models. In Bayesian nonparametric inference with infinite mixtures, Lijoi et al. (2005) proposed a normalized inverse Gaussian process and showed some analytical results of the posterior distribution under the process. Similarly, since Miller and Harrison (2018) is a finite version of the Dirichlet process, the model presented in this paper corresponds to a kind of finite alternative of Lijoi et al. (2005). Furthermore, leveraging data augmentation with a latent gamma random variable and the result of Argiento and De Iorio (2022), we construct an efficient posterior sampling algorithm.

This paper is organized as follows. In Section 2, we introduce the MFM and its equivalent representation using discrete probability measures. We also discuss the data augmentation and the conditional posterior distribution of MnaM_{na}, which are crucial for constructing the computational algorithm.In Section 3, we present the proposal method as well as the posterior computation algorithm. In Section 4, we illustrate some numerical studies to compare the proposed method with existing methods. R code implementing the proposed methods is available at Github repository:

https://github.com/Fumiya-Iwashige/MFM-Inv-Ga

2 Mixture of finite mixtures

We introduce a mixture of finite mixtures and its equivalent representation of the discrete probability measure. The proposed model and the posterior computation algorithm presented in the next section are largely based on the basic model presented in this section.

2.1 Formulation

Let each observations Y1,,YnY_{1},\dots,Y_{n} be univariate or multivariate in an Euclidean space. A MFM model is a statistical model defined by the following hierarchical representation.

M1qM,qMis a probability mass function (p.m.f.) on{0,1,2,},π|MPπ(π|M),ci|M,πCategoricalM(π),ξm|Mi.i.dp0(ξ),m=1,,M,θi|ci,ξindδξci(θi),i=1,n,Yi|θiindf(yi|θi),i=1,,n,\displaystyle\begin{split}&M-1\sim q_{M},\quad q_{M}\ \text{is a probability mass function (p.m.f.) on}\ \{0,1,2,\dots\},\\ &\pi|M\sim P_{\pi}(\pi|M),\\ &c_{i}|M,\pi\sim\mathrm{Categorical}_{M}(\pi),\\ &\xi_{m}|M\overset{\mathrm{i.i.d}}{\sim}p_{0}(\xi),\quad m=1,\dots,M,\\ &\theta_{i}|c_{i},\xi\overset{\mathrm{ind}}{\sim}\delta_{\xi_{c_{i}}}(\theta_{i}),\quad i=1,\dots n,\\ &Y_{i}|\theta_{i}\overset{\mathrm{ind}}{\sim}f(y_{i}|\theta_{i}),\quad i=1,\dots,n,\end{split} (2.1)

where f(y|θi)f(y|\theta_{i}) is a parametric model with parameter θi\theta_{i}, δx()\delta_{x}(\cdot) is a point mass at xx and ξ\xi is a element of a vector ξ=(ξ1,,ξM)\xi=(\xi_{1},\dots,\xi_{M})^{\top}. p0(ξ)p_{0}(\xi) is a prior density function of the parameter in mixture components. The parameter space is denoted by Θd\Theta\subset\mathbb{R}^{d}. Each cic_{i} is a latent allocation indicates to which component each observation is allocated. Given the number of components MM, Pπ(π|M)P_{\pi}(\pi|M) is a probability distribution on the MM-dimensional unit simplex, in that, this is the prior distribution for the mixing weights. We focus the case where the mixing weights can be further hierarchically expressed as follows:

Sm|Mindh,his a probability density function (p.d.f.) on(0,),π=(S1T,,SMT)Pπ(π|M),T=m=1MSm,\displaystyle\begin{split}&S_{m}|M\overset{\mathrm{ind}}{\sim}h,\quad h\ \text{is a probability density function (p.d.f.) on}\ (0,\infty),\\ &\pi=\left(\frac{S_{1}}{T},\dots,\frac{S_{M}}{T}\right)\sim P_{\pi}(\pi|M),\quad T=\sum_{m=1}^{M}S_{m},\end{split} (2.2)

where S1,,SMS_{1},\dots,S_{M} is unnormalized weights and π\pi is (normalized) mixing weight vector. This representation is the most basic way to generate a probability distribution on the dd-dimensional unit simplex 𝕊d:={w=(w1,,wd);wi0,i=1dwi=1}\mathbb{S}_{d}:=\{w=(w_{1},\dots,w_{d});w_{i}\geq 0,\sum_{i=1}^{d}w_{i}=1\}. One of the most famous examples of a distribution on 𝕊d\mathbb{S}_{d} is the Dirichlet distribution. If Sm|MGamma(γi,1)S_{m}|M\sim\text{Gamma}(\gamma_{i},1) in (2.2), πDirichlet(γ1,,γd)\pi\sim\text{Dirichlet}(\gamma_{1},\dots,\gamma_{d}), where Gamma(a,b)\text{Gamma}(a,b) is the gamma distribution with shape parameter a>0a>0 and scale parameter β>0\beta>0, and Dirichlet(γ1,,γd)\text{Dirichlet}(\gamma_{1},\dots,\gamma_{d}) is the Dirichlet distribution with parameter γ=(γ1,,γd)>0\gamma=(\gamma_{1},\dots,\gamma_{d})>0. If γ1==γd=:γ\gamma_{1}=\cdots=\gamma_{d}=:\gamma, the distribution obtained by normalization is Dirichlet(γ,,γ)\text{Dirichlet}(\gamma,\dots,\gamma). The symmetric Dirichlet distribution is often used in the framework of MFM, because of conjugacy with categorical distributions and simplicity of computation. The symmetric structure of Dirichlet(γ,,γ)\text{Dirichlet}(\gamma,\dots,\gamma) is also essential for marginalizing out MM and deriving the exchangeable partition distribution. However, it is known that the estimation result is sensitive to the choice of the shape parameter γ\gamma. For example, Miller and Harrison (2018) recommended to use γ=1\gamma=1 as a default choice and the value works well in many cases. The use of a small γ\gamma was recommended in terms of sparse or generalized MFMs (Rousseau and Mengersen, 2011; Malsiner-Walli et al., 2016; Frühwirth-Schnatter et al., 2021). A small value of γ\gamma indicates that many empty components can be created. Although most previous studies allow for the appearance of the empty components, there are several problems. First, MM tends to be overestimated, making the interpretation of the model difficult. Second, the existing of the empty components decreases the predictive performance of the model, as we will see in a later section through density estimations. Finally, they lead to an increase in computation time. When we focus on the estimation of the number of components MM, it is desirable to have few empty components. In other words, the discrepancy between MM and kk should be small. To achieve this, we use the normalized inverse Gaussian distribution as mixing weights, instead of the Dirichlet distribution.

2.2 Equivalent representations using discrete probability measures

We give an equivalent representation of the MFM. We can construct a discrete measure P()=m=1Mπmδξm()P(\cdot)=\sum_{m=1}^{M}\pi_{m}\delta_{\xi_{m}}(\cdot) in the parameter space Θ\Theta, almost surely, where MM, ξ\xi and π\pi are realizations from the distributions qMq_{M}, p0p_{0} and PπP_{\pi}, respectively. Let θ1,,θn\theta_{1},\dots,\theta_{n} be random variables according to PP. Then, the model (2.1) is equivalent to the following hierarchical representation using a random measure PP.

Yi|θiindf(yi|θi),i=1,,n,θ1,,θn|Pi.i.dP,P𝒫(qM,h,p0),\displaystyle\begin{split}&Y_{i}|\theta_{i}\overset{\mathrm{ind}}{\sim}f(y_{i}|\theta_{i}),\quad i=1,\dots,n,\\ &\theta_{1},\dots,\theta_{n}|P\overset{\mathrm{i.i.d}}{\sim}P,\\ &P\sim\mathcal{P}(q_{M},h,p_{0}),\end{split} (2.3)

where 𝒫\mathcal{P} is a probability distribution of PP with parameters qMq_{M}, hh and p0p_{0}. The representation (2.3) is the MFM described as a nonparametric Bayesian framework with infinite mixtures. If we replace 𝒫\mathcal{P} with the Dirichlet process, (2.3) represents the well-known Dirichlet process mixture models (e.g., Escobar and West, 1995). If we replace 𝒫\mathcal{P} with the normalized inverse Gaussian process, (2.3) represents the normalized inverse Gaussian process mixture models (Lijoi et al., 2005). For MFM, Argiento and De Iorio (2022) proposed a normalized independent finite point process (Norm-IFPP), which is a class of flexible prior distributions for PP. We employ the representation (2.3) with the Norm-IFPP. The advantages are as follows. We can directly estimate MM and kk. This is a major difference from Miller and Harrison (2018), which estimates MM indirectly through the number of clusters kk. Furthermore, an efficient Gibbs sampler can be constructed by incorporating the data augmentation with a latent Gamma random variable. This data augmentation enables us to overcome the lack of conjugacy in the categorical distribution. Instead of using the probability density function of PπP_{\pi}, we can use the density function of hh. This is the key to building an efficient MCMC algorithm for the proposed model.

Our study is in the spirit of nonparametric Bayes in that it accounts for uncertainty in the prior distribution of the parameters by means of a random measure based on the Norm-IFPP. The details of Norm-IFPP and and the independent finite point process (IFPP) are given in Argiento and De Iorio (2022).

2.3 Data augmentation and conditional distribution of MnaM_{na}

We illustrate the data augmentation and conditional posterior distribution of MnaM_{na}. We introduce the data augmentation in models (2.1) and (2.2). This technique is employed in James et al. (2009) and Argiento and De Iorio (2022). Let a\mathcal{M}_{a} and na\mathcal{M}_{na} be the allocated and unallocated index sets, respectively. The conditional joint distribution of ξ=(ξ1,,ξM)\xi=(\xi_{1},\dots,\xi_{M})^{\top} and S=(S1,,SM)S=(S_{1},\dots,S_{M})^{\top} given MM and a label vector cc is

p(S,τ|M,c)\displaystyle p(S,\tau|M,c) (i=1nSciT)(m=1Mh(sm)p(τm))\displaystyle\propto\left(\prod_{i=1}^{n}\frac{S_{c_{i}}}{T}\right)\left(\prod_{m=1}^{M}h(s_{m})p(\tau_{m})\right)
=(ma(SmT)nmh(sm)p(τm))(mnah(sm)p(τm)),\displaystyle=\left(\prod_{m\in\mathcal{M}^{a}}\left(\frac{S_{m}}{T}\right)^{n_{m}}h(s_{m})p(\tau_{m})\right)\left(\prod_{m\in\mathcal{M}^{na}}h(s_{m})p(\tau_{m})\right),

where nm=#{i:ci=m}n_{m}=\#\{i:c_{i}=m\}. Since this equation consists of categorical distributions, conjugacy with such a distribution is required as PπP_{\pi} for Gibbs sampling. This restriction is relaxed by using latent a random variable Un|TGamma(n,T)U_{n}|T\sim\mathrm{Gamma}(n,T), where T=m=1MSmT=\sum_{m=1}^{M}S_{m}. In fact,

p(S,τ,Un|M,c)un1(meSmuSmnmh(sm)p(τm))(mnaeSmuh(sm)p(τm)).\displaystyle p(S,\tau,U_{n}|M,c)\propto u^{n-1}\left(\prod_{m\in\mathcal{M}}e^{-S_{m}u}S_{m}^{n_{m}}h(s_{m})p(\tau_{m})\right)\left(\prod_{m\in\mathcal{M}^{na}}e^{-S_{m}u}h(s_{m})p(\tau_{m})\right).

Thus, the conditional distribution of each SmS_{m} is

Sm|Un,M,cindeSmuSmnmh(sm),ma,\displaystyle S_{m}|U_{n},M,c\overset{\mathrm{ind}}{\sim}e^{-S_{m}u}S_{m}^{n_{m}}h(s_{m}),\quad m\in\mathcal{M}_{a}, (2.4)
Sm|Un,M,ci.i.deSmuh(sm),mna.\displaystyle S_{m}|U_{n},M,c\overset{\mathrm{i.i.d}}{\sim}e^{-S_{m}u}h(s_{m}),\quad m\in\mathcal{M}_{na}. (2.5)

If it is easy to generate random variables from the distributions in (2.4) and (2.5), an efficient Gibbs sampling algorithm can be constructed. For example, when the prior distribution of the unnormalized weight is an inverse Gaussian distribution, (2.4) and (2.5) are generalized inverse Gaussian distributions. Introducing UnU_{n}, the update of the variable π\pi is replaced by the update of the variable SmS_{m}. Thus, the selection of hh is essential in MCMC updates.

Under this data augmentation, the conditional distribution of MnaM_{na} is established in Theorem 5.1 in Argiento and De Iorio (2022). This theorem states that if 𝒫(qM,h,p0)\mathcal{P}(q_{M},h,p_{0}) follows a Norm-IFPP, then the posterior distribution of the random measure PP, given UnU_{n}, is a superposition of a finite point process with fixed points and an IFPP. The IFPP characterizes the process of unallocated jumps, where the discrete probability distribution that serves as its parameter corresponds to the distribution of MnaM_{na}. The conditional distribution of MnaM_{na} is given by

(Mna=m|θ,Un=u)(m+k)!m!ψ(u)mqM(m+k),m=1,2,,\displaystyle\mathbb{P}(M_{na}=m|\theta,U_{n}=u)\propto\frac{(m+k)!}{m!}\psi(u)^{m}q_{M}(m+k),\quad m=1,2,\dots, (2.6)

where θ=(θ1,,θn)\theta=(\theta_{1},\dots,\theta_{n})^{\top}, kk is the number of cluster (unique values of θ\theta) and ψ\psi is a Laplace transform of hh. We sample MnaM_{na} from (2.6) in the MCMC algorithm. Then, we straightforwardly obtain MM by adding kk to MnaM_{na}.

Remark 2.1.

Although we use the data augmentation Un|TGamma(n,T)U_{n}|T\sim\text{Gamma}(n,T), the distributions (2.4) and (2.5) are generally complex. In addition, the result of Argiento and De Iorio (2022) applies only to hh that have a Laplace transform. Thus, the inverse Gaussian distribution is one of the few examples that satisfy both computational and theoretical constraints.

3 Methodology

In this section, we propose a mixture of finite mixtures with normalized inverse Gaussian weights. Moreover, we develop an efficient posterior sampling algorithm based on Argiento and De Iorio (2022).

3.1 Mixture of finite mixtures with normalized inverse Gaussian weights

We propose a mixture of finite mixtures with normalized inverse Gaussian weights (denoted by MFM-Inv-Ga), where the notation explicitly specifies hh because it is essential for computation. Our proposal model only requires (2.2) to be

Sm|MindInv-Ga(α,1),α>0,\displaystyle S_{m}|M\overset{\mathrm{ind}}{\sim}\text{Inv-Ga}(\alpha,1),\quad\alpha>0,
π=(S1T,,SMT)Norm-Inv-Ga(α,,α),\displaystyle\pi=\left(\frac{S_{1}}{T},\dots,\frac{S_{M}}{T}\right)\sim\text{Norm-Inv-Ga}(\alpha,\dots,\alpha),

where Inv-Ga(α,β)\text{Inv-Ga}(\alpha,\beta) is the inverse Gaussian distribution with shape parameter α>0\alpha>0 and scale parameter β>0\beta>0, and Norm-Inv-Ga(α1,,αd)\text{Norm-Inv-Ga}(\alpha_{1},\dots,\alpha_{d}) is the normalized inverse Gaussian distribution with parameters αi>0\alpha_{i}>0 for i=1,,di=1,\dots,d. The probability density function of the inverse Gaussian distribution is given by

h(s)=α2πs3/2exp(12(α2s+β2s)+βα),\displaystyle h(s)=\frac{\alpha}{\sqrt{2\pi}}s^{-3/2}\exp\left(-\frac{1}{2}\left(\frac{\alpha^{2}}{s}+\beta^{2}s\right)+\beta\alpha\right), (3.1)

where α>0\alpha>0 is the shape parameter and β>0\beta>0 is the scale parameter. The mean and variance of SS are given by

𝔼[S]=αβ,Var[S]=αβ3.\displaystyle\mathbb{E}[S]=\frac{\alpha}{\beta},\quad\mathrm{Var}\left[S\right]=\frac{\alpha}{\beta^{3}}. (3.2)

From Lijoi et al. (2005), the probability density function of the normalized inverse Gaussian distribution with parameters αi>0,i=1,,d\alpha_{i}>0,\ i=1,\dots,d is given by

f(π)=ei=1dαii=1dαi2d/21πd/2×Kd/2(𝒜d(π1,,πd1))×(π13/2πd13/2(1i=1d1πi)3/2×𝒜d(π1,,πd1)d/4)1,\displaystyle\begin{split}f(\pi)=&\frac{e^{\sum_{i=1}^{d}\alpha_{i}}\prod_{i=1}^{d}\alpha_{i}}{2^{d/2-1}\pi^{d/2}}\times K_{-d/2}\left(\sqrt{\mathcal{A}_{d}(\pi_{1},\dots,\pi_{d-1})}\right)\\ &\times\left(\pi_{1}^{3/2}\cdots\pi_{d-1}^{3/2}\left(1-\sum_{i=1}^{d-1}\pi_{i}\right)^{3/2}\times\mathcal{A}_{d}(\pi_{1},\dots,\pi_{d-1})^{d/4}\right)^{-1},\end{split} (3.3)

where 𝒜d(π1,,πd1)=i=1d1αi2/πi+αd2(1i=1d1πi)1\mathcal{A}_{d}(\pi_{1},\dots,\pi_{d-1})=\sum_{i=1}^{d-1}\alpha_{i}^{2}/\pi_{i}+\alpha_{d}^{2}\left(1-\sum_{i=1}^{d-1}\pi_{i}\right)^{-1} and KmK_{m} is the modified Bessel functions of the third kind of order mm (see, also Ghosal and van der Vaart, 2017). Due to the breakdown of conjugacy with the categorical distribution, constructing an efficient Gibbs sampler is challenging. Because of the complexity of (3.3), the calculation in Miller and Harrison (2018) is intractable: constructing an MCMC algorithm based on the restaurant process by marginalizing out MM is extremely challenging. To overcome this difficulty, we use the data augmentation in Subsection 2.3 and the method proposed in Argiento and De Iorio (2022). As a result, it is sufficient to use not (3.3) but (3.1) and (3.8) to construct our MCMC algorithm.

3.2 Posterior computation

We present a fast and efficient posterior computation algorithm for the proposed method. To this end, we adopt the blocked Gibbs sampling scheme in Argiento and De Iorio (2022). Let η\eta be a hyper-parameter of the prior distribution qMq_{M}, and p()p(\cdot) be a joint prior density function except for hh and qMq_{M}. nm=#{i;ci=m}n_{m}=\#\{i;c_{i}=m\} is the size of the mmth cluster. ψ(u)\psi(u) is the Laplace transform of hh, which is defined by ψ(u):=𝔼[euSm]\psi(u):=\mathbb{E}[e^{-uS_{m}}] for u0u\geq 0.

Algorithm 1 Block Gibbs sampler
  Step 0. Set initial values.
  Step 1. Sample UnU_{n} from Gamma(n,T)\mathrm{Gamma}(n,T), where T=m=1MSmT=\sum_{m=1}^{M}S_{m}.
  Step 2. Sample ci,fori=1,,n,c_{i},for\ i=1,\dots,n, the following discrete probability distribution defined by each probabilities
(ci=j|rest)Sjf(yi|τj),j=1,,M.\mathbb{P}(c_{i}=j|\mathrm{rest})\propto S_{j}f(y_{i}|\tau_{j}),\quad j=1,\dots,M.
  Step 3. Sample the hyper parameter η\eta of the qMq_{M} from
p(η|rest)Ψ(u,k)p(η),Ψ(u,k):=m=0(m+k)!m!ψ(u)mqM(m+k).p(\eta|\mathrm{rest})\propto\Psi(u,k)p(\eta),\quad\Psi(u,k):=\sum_{m=0}^{\infty}\frac{(m+k)!}{m!}\psi(u)^{m}q_{M}(m+k).
  Step 4. Sample MnaM_{na} from the following discrete distribution
(Mna=m|rest)(m+k)!m!ψ(u)mqM(m+k),m=0,1,.\mathbb{P}(M_{na}=m|\mathrm{rest})\propto\frac{(m+k)!}{m!}\psi(u)^{m}q_{M}(m+k),\quad m=0,1,\dots.
  Step 5. Sample SmS_{m}, form=1,,k\ m=1,\dots,k\ from GIG(2u+1,α2,nm1/2).\mathrm{GIG}(2u+1,\alpha^{2},n_{m}-1/2).
  Step 6. Sample τm\tau_{m}, form=1,,k\ m=1,\dots,k from
p(τm|rest){i;θi=τmf(yi|τm)}p(τ).p(\tau_{m}|\mathrm{rest})\propto\left\{\prod_{i;\theta_{i}=\tau_{m}}f(y_{i}|\tau_{m})\right\}p(\tau).
  Step 7. Sample SmnaS_{m_{na}}, for mna=k+1,,Mna+k\ m_{na}=k+1,\dots,M_{na}+k\ from GIG(2u+1,α2,1/2).\mathrm{GIG}(2u+1,\alpha^{2},-1/2).
  Step 8. Sample τmna\tau_{m_{na}}, formna=k+1,,Mna+k\ m_{na}=k+1,\dots,M_{na}+k from the prior p(τ)p(\tau).

We summarize the algorithm in Algorithm 1. The point of this algorithm is the data augmentation through the latent variable such as Un|TGamma(n,T)U_{n}|T\sim\mathrm{Gamma}(n,T), where nn is the sample size and T=m=1MSmT=\sum_{m=1}^{M}S_{m}. It is important to sample the number of empty components MnaM_{na} from (2.6) in Step 4. This step allows for direct sampling of MM by adding kk and label variables to be updated as in the finite mixture model with given MM in step 2. The update is the same as the telescoping sampling proposed by Frühwirth-Schnatter et al. (2021), and the method is more efficient than the classical restaurant process. However, in implementation, it is important to determine qMq_{M} so that the series Ψ(u,k)\Psi(u,k) in Steps 3 and 4 can be written analytically and random variables can be easily generated from the full conditional distributions of η\eta and MnaM_{na}. In Steps 5 and 7, if hh is the Inv-Ga(α,1)\text{Inv-Ga}(\alpha,1), each generalized inverse Gaussian distribution (denoted by GIG) is immediately derived from (2.4) and (2.5). It is easy to generate random numbers from the GIG distribution. In Steps 6 and 7, the assigned unnormalized weights and kernel parameters are updated. When MnaM_{na} sampled in Step 4 is greater than or equal to 11, Steps 8 and 9 are executed. Thus, when we get many empty components, it takes longer to compute.

3.3 Prior distributions for the number of mixture components

Assume that M1M-1 follows a discrete probability distribution with the support {0,1,2,}\{0,1,2,\dots\}. In this paper, we consider Poisson(Λ)\text{Poisson}(\Lambda) (Λ>0\Lambda>0) for M1M-1, because the constraints in Steps 3 and 4 are satisfied. In fact, from Argiento and De Iorio (2022), we have

Ψ(u,k)=Λk1(Λψ(u)+k)exp(Λ(ψ(u)1)).\displaystyle\Psi(u,k)=\Lambda^{k-1}(\Lambda\psi(u)+k)\exp(\Lambda(\psi(u)-1)). (3.4)

Assuming ΛGamma(aΛ,bΛ)\Lambda\sim\text{Gamma}(a_{\Lambda},b_{\Lambda}) for aΛ,bΛ>0a_{\Lambda},b_{\Lambda}>0, the full conditional distributions of Λ\Lambda and MnaM_{na} are given by

p(Λ|rest)ψ(u)(k+aΛ1)\displaystyle p(\Lambda|\mathrm{rest})\propto\psi(u)(k+a_{\Lambda}-1) Ga(k+aΛ+1,1ψ(u)+bΛ)\displaystyle\mathrm{Ga}(k+a_{\Lambda}+1,1-\psi(u)+b_{\Lambda})
+k(bΛ+1ψ(u))Ga(k+aΛ,1ψ(u)+bΛ),\displaystyle+k(b_{\Lambda}+1-\psi(u))\mathrm{Ga}(k+a_{\Lambda},1-\psi(u)+b_{\Lambda}),
(Mna=m|rest)Λψ(u)\displaystyle\mathbb{P}(M_{na}=m|\mathrm{rest})\propto\Lambda\psi(u) Shifted-Poisson(Λψ(u),1)+kPoisson(Λψ(u)),\displaystyle\textrm{Shifted-Poisson}(\Lambda\psi(u),1)+k\mathrm{Poisson}(\Lambda\psi(u)),

respectively, where Shifted-Poisson(Λ,t)\text{Shifted-Poisson}(\Lambda,t) is parallel shifted of the distribution Poisson(Λ)\text{Poisson}(\Lambda) by tt. If we consider the negative binomial distribution as a prior distribution for M1M-1, the full conditional distributions are also obtained but learning hyper-parameters become slightly troublesome.

3.4 Specification of kernels

The choice of kernel is important, and the appropriate kernel must be selected for the purpose. In this paper, although we do not discuss the details of the selection of kernels, we present some famous kernels for the sake of completeness.

3.4.1 Cluster analysis and density estimation

One of the most famous and useful kernels is the (multivariate) normal kernel N(|μ,σ2)N(\cdot|\mu,\sigma^{2}). In the univariate case, we often use the normal inverse gamma model N(μ|m0,σ2/η)×IG(σ2|c0,C0)N(\mu|m_{0},\sigma^{2}/\eta)\times\text{IG}(\sigma^{2}|c_{0},C_{0}) as a prior distribution of ξ=(μ,σ2)\xi=(\mu,\sigma^{2}), where m0\ m_{0}\in\mathbb{R}, η,c0,C0>0\eta,\ c_{0},\ C_{0}>0. The parameter η\eta is called a smoothing parameter and plays an important role in density estimation. It is possible to include a hierarchical prior for η\eta. In the multivariate case, we often employ the normal inverse Wishart model Nr(μ|m0,Σ/B0)×W1(Σ|c0,C0)N_{r}(\mu|m_{0},\Sigma/B_{0})\times\mathrm{W}^{-1}(\Sigma|c_{0},C_{0}) as a prior distribution of (μ,Σ)(\mu,\Sigma), where m0r,B0>0,c0>r1\ m_{0}\in\mathbb{R}^{r},\ B_{0}>0,\ c_{0}>r-1 and C0C_{0} is a positive definite matrix. We will use normal kernels in later numerical experiments in Sections 4.1 and 4.2 for clustering and density estimation. In the context of finite mixture models, various kernels have been proposed. If we have prior information that the data have skewness, the skew normal or skew-t kernel is also useful (see e.g., Frühwirth-Schnatter and Pyne, 2010). The skew-normal and skew-t distributions is easy to handle because these kernels have the scale mixtures of normal representations, except for the degree of freedom of skew-t distribution.

3.4.2 Network analysis

As an application of the proposed method, we perform community detection on network data. Community detection is the task of identifying dense subclasses in network data and corresponds to clustering and estimating the number of components, called the number of communities in network analysis. Note that the number of components of finite mixture models is equivalent to the number of communities in the network, and both are denoted MM. Estimating the number of communities is an important problem and various methods have been proposed (Shi and Malik, 2000; Newman, 2004; White and Smyth, 2005). From a model-based perspective, it is essentially the same as estimating the number of components in a finite-mixture model, and we can apply MFM. The stochastic block model is a famous statistical model of network data (Henze, 1986; Nowicki and Snijders, 2001; Geng et al., 2019), which assumes a stochastic block structure behind and specifies the community structure by estimating the probability of edges being drawn between each group. We estimate the number of communities with the MFM in the framework of this stochastic block model. Geng et al. (2019) proposed a stochastic block model based on MFM with Dirichlet weights, and also constructed a similar algorithm to Miller and Harrison (2018).

MFM can be easily applied to community detection by modifying the kernel. Data yy are replaced by the adjacency matrix An×nA\in\mathbb{R}^{n\times n}, where A=(Aij){0,1}n×nA=(A_{ij})\in\{0,1\}^{n\times n} and nn is the size of the node. When Aij=1A_{ij}=1, this indicates that an edge is drawn from the ii th node to the jj th node, and when Aij=0A_{ij}=0, this does not. For simplicity, we assume that the adjacency matrix is not direct and does not have a self-loop, in that Ai,j=AjiA_{i,j}=A_{ji} and Aii=0A_{ii}=0, where 1i<jn1\leq i<j\leq n. The stochastic block model is formulated as follows,

Aij|Q,MBernoulli(θij),θij=Qcicj,1i<jn,QrsBeta(aQ,bQ),1rsM,\displaystyle\begin{split}&A_{ij}|Q,M\sim\text{Bernoulli}(\theta_{ij}),\quad\theta_{ij}=Q_{c_{i}c_{j}},\quad 1\leq i<j\leq n,\\ &Q_{rs}\sim\text{Beta}(a_{Q},b_{Q}),\quad 1\leq r\leq s\leq M,\end{split} (3.5)

where aQ,bQ>0a_{Q},\ b_{Q}>0 and MM is the number of communities. Q=(Qrs)[0,1]M×MQ=(Q_{rs})\in[0,1]^{M\times M} is a symmetric matrix and defines the stochastic block structure of a network. Each element QrsQ_{rs} represents the probability that an edge is drawn between any node belonging to the community label rr and any node belonging to the community label ss. To perform community detection based on the proposed model, we just set the Bernoulli likelihood as the kernel.

3.5 Evaluation of the number of empty components

From the point of view of the interpretability of the model, the generalization performance, and the computational efficiency, it is reasonable that the number MnaM_{na} should be small. For the full conditional distribution of MnaM_{na}, the following inequality holds, where M1|ΛPoisson(Λ)andΛGamma(aΛ,bΛ)foraΛ,bΛ>0M-1|\Lambda\sim\mathrm{Poisson}(\Lambda)\ \text{and}\ \Lambda\sim\mathrm{Gamma}(a_{\Lambda},b_{\Lambda})\ \text{for}\ a_{\Lambda},b_{\Lambda}>0.

Proposition 3.1.

For the full conditional distribution of MnaM_{na}, the inequalities

(Mna1|Un=u,k,M=m,Λ)\displaystyle\mathbb{P}(M_{na}\geq 1|U_{n}=u,\ k,\ M=m,\ \Lambda) Λψ(u)(1+1Λψ(u)+k),\displaystyle\leq\Lambda\psi(u)\left(1+\frac{1}{\Lambda\psi(u)+k}\right), (3.6)
(Mna1|Un=u,k,M=m)\displaystyle\mathbb{P}(M_{na}\geq 1|U_{n}=u,\ k,\ M=m) ψ(u)aΛbΛ(1+1k)\displaystyle\leq\psi(u)\frac{a_{\Lambda}}{b_{\Lambda}}\left(1+\frac{1}{k}\right) (3.7)

holds, where ψ\psi is the Laplace transform of the probability distribution for hh.

Proof.

When m0m\neq 0,

(Mna=m|Un=u,k,M=m,Λ)=kΛψ(u)+k𝒫0(m;Λψ(u))+Λψ(u)Λψ(u)+k𝒫1(m;Λψ(u)).\mathbb{P}(M_{na}=m|U_{n}=u,\ k,\ M=m,\ \Lambda)=\frac{k}{\Lambda\psi(u)+k}\mathcal{P}_{0}(m;\Lambda\psi(u))+\frac{\Lambda\psi(u)}{\Lambda\psi(u)+k}\mathcal{P}_{1}(m;\Lambda\psi(u)).

Thus, the conditional expectation of MnaM_{na} is

𝔼[Mna|Un=u,k,M=m,Λ]\displaystyle\mathbb{E}[M_{na}|U_{n}=u,\ k,\ M=m,\ \Lambda] =kΛψ(u)+k×Λψ(u)+Λψ(u)Λψ(u)+k×(Λψ(u)+1)\displaystyle=\frac{k}{\Lambda\psi(u)+k}\times\Lambda\psi(u)+\frac{\Lambda\psi(u)}{\Lambda\psi(u)+k}\times(\Lambda\psi(u)+1)
=Λψ(u)(1+1Λψ(u)+k).\displaystyle=\Lambda\psi(u)\left(1+\frac{1}{\Lambda\psi(u)+k}\right).

Furthermore,

𝔼[Mna|Un=u,k,M=m]\displaystyle\mathbb{E}[M_{na}|U_{n}=u,\ k,\ M=m] =𝔼Λ[𝔼[Mna|Un=u,k,M=m,Λ]]\displaystyle=\mathbb{E}_{\Lambda}[\mathbb{E}[M_{na}|U_{n}=u,\ k,\ M=m,\ \Lambda]]
=(Λψ(u)(1+1Λψ(u)+k))bΛaΛΓ(aΛ)ΛaΛ1exp(bΛΛ)dΛ\displaystyle=\int\left(\Lambda\psi(u)\left(1+\frac{1}{\Lambda\psi(u)+k}\right)\right)\frac{b_{\Lambda}^{a_{\Lambda}}}{\Gamma(a_{\Lambda})}\Lambda^{a_{\Lambda}-1}\exp(-b_{\Lambda}\Lambda)\mathrm{d}\Lambda
ψ(u)×aΛbΛ+ψ(u)k×aΛbΛ\displaystyle\leq\psi(u)\times\frac{a_{\Lambda}}{b_{\Lambda}}+\frac{\psi(u)}{k}\times\frac{a_{\Lambda}}{b_{\Lambda}}
=ψ(u)×aΛbΛ(1+1k).\displaystyle=\psi(u)\times\frac{a_{\Lambda}}{b_{\Lambda}}\left(1+\frac{1}{k}\right).

Finally, the result follows from Markov’s inequality. ∎

Proposition 3.1 shows that ψ(u)\psi(u) plays an important role in the generation of empty components. The critical difference between the inverse Gaussian and gamma distributions in estimating MM in the MFM is the Laplace transform. Let ψInv-Ga(u)\psi_{\text{Inv-Ga}}(u) and ψGa(u)\psi_{\text{Ga}}(u) be the Laplace transforms of Inv-Ga(α,β)\text{Inv-Ga}(\alpha,\beta) and Gamma(γ,η)\text{Gamma}(\gamma,\eta), respectively. The we have

ψInv-Ga(u)=exp(α(11+2u)),u0,\displaystyle\psi_{\text{Inv-Ga}}(u)=\exp\left(\alpha\left(1-\sqrt{1+2u}\right)\right),\quad u\geq 0, (3.8)
ψGa(u)=(ηη+u)γ,u0.\displaystyle\psi_{\text{Ga}}(u)=\left(\frac{\eta}{\eta+u}\right)^{\gamma},\quad u\geq 0. (3.9)

Laplace transforms ψInv-Ga\psi_{\text{Inv-Ga}} and ψGa\psi_{\text{Ga}} are decreasing functions with respect to uu. The former has exponential decay, while the latter is polynomial. Figure 1 shows the graphs of the Laplace transforms of inverse Gaussian and gamma distributions when the shape parameters are α,γ=1.0,0.2,101,102,103\alpha,\gamma=1.0,0.2,10^{-1},10^{-2},10^{-3} and the scale parameters are 11. It can be seen that ψInv-Ga\psi_{\text{Inv-Ga}} decreases much faster than ψGa\psi_{\text{Ga}}. The speed of this decrease has a significant impact on the appearance of the empty component in estimating the number of components.

Refer to caption
Refer to caption
Figure 1: When shape parameters are α=γ=1,0.2,101,102,103\alpha=\gamma=1,0.2,10^{-1},10^{-2},10^{-3}, left graphs are Laplace transforms of Inv-Ga(α,1)\text{Inv-Ga}(\alpha,1), right graphs are Laplace transforms of Gamma(γ,1)\text{Gamma}(\gamma,1).

Inequalities (3.6) and (3.7) are not for marginal posterior distributions, but for full conditional distributions of MnaM_{na}. This is due to the fact that the marginal posterior distributions of UnU_{n} are analytically intractable.

4 Empirical demonstrations

We evaluate the performance of the MFM-Inv-Ga and MFM-Ga methods through some numerical experiments. Recall that γ\gamma is the shape parameter of the prior distribution of unnormalized weights in the proposed method (MFM-Inv-Ga), while α\alpha is that of MFM-Ga.

4.1 Inference for the number of mixture components and clustering

In this subsection, we illustrate the performance of clustering and inference for the number of components using artificial and real data.

4.1.1 Artificial data

In this simulation, we assume that Mtrue=3M_{\mathrm{true}}=3. We generate data from the following multivariate normal distribution:

f(y|μ1,μ2,μ3)=0.8N2(y|μ1,I2)+0.1N2(y|μ2,I2)+0.1N2(y|μ3,I2),\displaystyle f(y|\mu_{1},\mu_{2},\mu_{3})=0.8\cdot N_{2}(y|\mu_{1},I_{2})+0.1\cdot N_{2}(y|\mu_{2},I_{2})+0.1\cdot N_{2}(y|\mu_{3},I_{2}),

where μ1=(0,0)\mu_{1}=(0,0)^{\top}, μ2=(0,10)\mu_{2}=(0,10)^{\top}, μ3=(7.5,10)\mu_{3}=(7.5,10)^{\top} and I2I_{2} is the 2×22\times 2 identity matrix. We set n=300n=300 and generate 5050 dataset. Each MCMC iteration is 20002000 and the first half of 10001000 samples is not used as a burn-in period. We assume that M1|ΛPoisson(Λ)M-1|\Lambda\sim\text{Poisson}(\Lambda) and ΛGamma(1,1)\Lambda\sim\text{Gamma}(1,1). We employ the multivariate normal kernel f(y|μ,Σ)N2(y|μ,Σ)f(y|\mu,\Sigma)\sim N_{2}(y|\mu,\Sigma) and the normal inverse Wishart model N2(μ|m0,Σ/B0)×W1(Σ|c0,C0)N_{2}(\mu|m_{0},\Sigma/B_{0})\times\mathrm{W}^{-1}(\Sigma|c_{0},C_{0}) as the prior of the parameters in the kernel, where m0m_{0} is the sample mean vector, B0=1B_{0}=1, c0=2+1.5c_{0}=2+1.5 and C0C_{0} is the sample covariance matrix. The shape parameters are α,γ=1.0,0.2,101,102,103\alpha,\gamma=1.0,0.2,10^{-1},10^{-2},10^{-3}. This choice of α\alpha and γ\gamma induces the mean and variance of Inv-Ga(α,1)\text{Inv-Ga}(\alpha,1) and Gamma(γ,1)\text{Gamma}(\gamma,1) to be equivalent. Hence, the first and second moments of the inverse Gaussian and gamma distribution are matched for each shape parameter (see, also Lijoi et al., 2005). To measure performance, we consider the three criteria.

  • The posterior mean of the number of components MM.

  • The posterior mean of the rand index. The rand index is a measure of the clustering fitting, and the value takes [0,1][0,1]. When it is close to 11, the assignment estimate is reasonable.

  • The posterior probability that the number of empty components MnaM_{na} is equal to 0.

The respective averages over 5050 repetitions are denoted by RI^\widehat{\mathrm{RI}} and ^(Mna=0)\widehat{\mathbb{P}}(M_{na}=0).

We report the posterior mean of MM in Figure 2. It is observed that the results of the MFM-Inv-Ga method remain almost the same even if the shape parameter is varied. However, the MFM-Ga method tends to overestimate the number of mixture components as the shape parameter decreases.

Refer to caption
Figure 2: Box plots of the posterior mean of the number of components MM for each shape parameter. From left to right, shape parameters are set as γ,α=1,0.2,101,102,103\gamma,\alpha=1,0.2,10^{-1},10^{-2},10^{-3}.

Table 1 shows the results of the clustering performance and the posterior probability that MnaM_{na} is equal to 0. For clustering, the MFM-Inv-Ga and MFM-Ga methods are comparable. Both methods have reasonable clustering accuracy, only slightly better for MFM-Ga than for MFM-Inv-Ga. As the shape parameters α\alpha and γ\gamma decrease, RI^i\widehat{\mathrm{RI}}_{i} increases slightly and its standard deviation decreases. This is natural given that for data that have components with relatively very large cluster sizes, a suitable prior on a simplex is one with large mass at the edges or vertices. It is important to note that Figure 2 and Table 1 show that the MFM-Inv-Ga method produces few empty components for all scenarios and provides reasonable estimates of MM, but not MFM-Ga.

Table 1: Results of the rand index RI^i\widehat{\mathrm{RI}}_{i} and posterior probability that the number of empty components is equal to zero ^(Mna=0)\widehat{\mathbb{P}}(M_{na}=0) (their standard deviations are shown in parentheses) averaged over 50 Monte Carlo replications for each shape parameters.
RI^i\widehat{\mathrm{RI}}_{i}
α\alpha and γ\gamma 11 0.20.2 10110^{-1} 10210^{-2} 10310^{-3}
MFM-Inv-Ga 0.9930.993 0.9940.994 0.9950.995 0.9950.995 0.9950.995
(0.013)(0.013) (0.006)(0.006) (0.006)(0.006) (0.004)(0.004) (0.005)(0.005)
MFM-Ga 0.9950.995 0.9950.995 0.9950.995 0.9960.996 0.9960.996
(0.006)(0.006) (0.004)(0.004) (0.003)(0.003) (0.004)(0.004) (0.003)(0.003)
^(Mna=0)\widehat{\mathbb{P}}(M_{na}=0)
α\alpha and γ\gamma 11 0.20.2 10110^{-1} 10210^{-2} 10310^{-3}
MFM-Inv-Ga 1.0001.000 0.9950.995 0.9820.982 0.9470.947 0.9520.952
(0.001)(0.001) (0.008)(0.008) (0.032)(0.032) (0.135)(0.135) (0.087)(0.087)
MFM-Ga 0.9790.979 0.5550.555 0.3280.328 0.0880.088 0.0660.066
(0.008)(0.008) (0.080)(0.080) (0.066)(0.066) (0.012)(0.012) (0.013)(0.013)

This can be seen in Table 2. It can also be seen that MFM-Inv-Ga assigns a very high posterior probability to MtrueM_{\mathrm{true}} than MFM-Ga, and the behavior of the posterior distributions for kk and MM is similar, together with Table 1.

Table 2: Posterior probabilities and their standard deviations (shown in parentheses) of the number of components MM averaged over 50 Monte Carlo replications. The highest value is bolded.
MFM-Inv-Ga
M=1M=1 M=2M=2 M=3M=3 M=4M=4 M=5M=5 M6M\geq 6
α=1\alpha=1 0.0000.000 0.0000.000 0.973\bf{0.973} 0.0270.027 0.0000.000 0.0000.000
(0.000)(0.000) (0.000)(0.000) (0.093)(0.093) (0.093)(0.093) (0.000)(0.000) (0.000)(0.000)
α=0.2\alpha=0.2 0.0000.000 0.0800.080 0.886\bf{0.886} 0.0330.033 0.0010.001 0.0000.000
(0.000)(0.000) (0.274)(0.274) (0.268)(0.268) (0.048)(0.048) (0.003)(0.003) (0.000)(0.000)
α=101\alpha=10^{-1} 0.0000.000 0.0000.000 0.942\bf{0.942} 0.0510.051 0.0060.006 0.0010.001
(0.000)(0.000) (0.000)(0.000) (0.084)(0.084) (0.069)(0.069) (0.013)(0.013) (0.005)(0.005)
α=102\alpha=10^{-2} 0.0000.000 0.0200.020 0.890\bf{0.890} 0.0630.063 0.0150.015 0.0120.012
(0.000)(0.000) (0.141)(0.141) (0.199)(0.199) (0.078)(0.078) (0.039)(0.039) (0.048)(0.048)
α=103\alpha=10^{-3} 0.0000.000 0.0500.050 0.860\bf{0.860} 0.0680.068 0.0150.015 0.0070.007
(0.000)(0.000) (0.209)(0.209) (0.225)(0.225) (0.084)(0.084) (0.029)(0.029) (0.020)(0.020)
MFM-Ga
γ=1\gamma=1 0.0000.000 0.0000.000 0.948\bf{0.948} 0.0490.049 0.0020.002 0.0000.000
(0.000)(0.000) (0.000)(0.000) (0.055)(0.055) (0.049)(0.049) (0.006)(0.006) (0.001)(0.001)
γ=0.2\gamma=0.2 0.0000.000 0.0100.010 0.508\bf{0.508} 0.3010.301 0.1210.121 0.0610.061
(0.000)(0.000) (0.070)(0.070) (0.093)(0.093) (0.037)(0.037) (0.037)(0.037) (0.035)(0.035)
γ=101\gamma=10^{-1} 0.0000.000 0.0000.000 0.307\bf{0.307} 0.3060.306 0.1960.196 0.1910.191
(0.000)(0.000) (0.007)(0.007) (0.093)(0.093) (0.037)(0.037) (0.037)(0.037) (0.035)(0.035)
γ=102\gamma=10^{-2} 0.0000.000 0.0030.003 0.0880.088 0.1550.155 0.176\bf{0.176} 0.5780.578
(0.000)(0.000) (0.002)(0.002) (0.021)(0.021) (0.016)(0.016) (0.013)(0.013) (0.071)(0.071)
γ=103\gamma=10^{-3} 0.0000.000 0.0030.003 0.0660.066 0.1320.132 0.163\bf{0.163} 0.6360.636
(0.000)(0.000) (0.020)(0.020) (0.019)(0.019) (0.013)(0.013) (0.011)(0.011) (0.072)(0.072)

Table 3 shows the CPU times of MFM-Inv-Ga and MFM-Ga for 5050 repetitions of MCMC with 20002000 iterations, where α,γ=1.0\alpha,\gamma=1.0 and 10310^{-3}. When α=γ=1.0\alpha=\gamma=1.0, MFM-Inv-Ga is faster than MFM-Ga on average, but MFM-Inv-Ga has more variability than MFM-Ga. On the other hand, MFM-Ga is very time-consuming in γ=103\gamma=10^{-3}, because many empty components are created. As a result, MFM-Inv-Ga has the same clustering accuracy as MFM-Ga and outperforms MFM-Ga in terms of MM estimation and CPU time.

Table 3: Comparison of CPU times (in seconds) and their standard deviations (shown in parentheses) for artificial data averaged over 5050 replications, α,γ=1.0,103\alpha,\ \gamma=1.0,10^{-3}
MFM-Inv-Ga MFM-Ga
α=γ=1\alpha=\gamma=1 88.54588.545 90.89890.898
(5.686)(5.686) (2.774)(2.774)
α=γ=103\alpha=\gamma=10^{-3} 87.06187.061 192.234192.234
(12.286)(12.286) (8.932)(8.932)

4.1.2 Thyroid Data

We apply the proposed method to famous thyroid data. The data is available from the R package mclust and is well known as benchmark data for clustering. The sample size is 215215 and the dimension of the data is 66. The thyroid disease of each patient is included and classified into three categories: Normal, hypo and hyper. The number of diseases is 150150, 3030 and 3535, respectively. Using these labels as true labels, the main interest is the accuracy of the clustering and whether the number of components is estimated to be 33.

We use the same model and shape parameters as in Section 4.1.1. We independently run the 55 MCMC chain with different initial values. The number of iterations for each chain is 2000020000 and finally we get the 1000010000 samples. We compare MFM-Inv-Ga and MFM-Ga through the posterior mean of MM, the posterior probability (Mna=0|y1,,yn)\mathbb{P}(M_{na}=0|y_{1},\dots,y_{n}) and the posterior mean of the rand index (RI). Table 4 shows that the result of the real data analysis is similar to that of Section 4.1.1. MFM-Inv-Ga provides a reasonable estimate of MM for all shape parameters by not producing empty components. However, MFM-Ga overestimates MM. Furthermore, MFM-Inv-Ga is slightly more accurate than MFM-Ga.

Table 4: Posterior means of MM (M^\hat{M}), posterior probabilities of Mna=0M_{na}=0 ((Mna=0|y1,,yn)\mathbb{P}(M_{na}=0|y_{1},\dots,y_{n})) and posterior means of rand index (RI) for thyroid data.
MFM-Inv-Ga
M^\hat{M} (Mna=0|y1,,yn)\mathbb{P}(M_{na}=0|y_{1},\dots,y_{n}) RI
α=1\alpha=1 3.2003.200 1.0001.000 0.8950.895
α=0.2\alpha=0.2 3.2133.213 0.9940.994 0.8950.895
α=101\alpha=10^{-1} 3.2253.225 0.9870.987 0.8940.894
α=102\alpha=10^{-2} 3.2223.222 0.9870.987 0.8950.895
α=103\alpha=10^{-3} 3.4893.489 0.9600.960 0.8930.893
MFM-Ga
M^\hat{M} (Mna=0|y1,,yn)\mathbb{P}(M_{na}=0|y_{1},\dots,y_{n}) RI
γ=1\gamma=1 3.6703.670 0.9620.962 0.8930.893
γ=0.2\gamma=0.2 4.9444.944 0.4340.434 0.8900.890
γ=101\gamma=10^{-1} 4.9454.945 0.3940.394 0.8930.893
γ=102\gamma=10^{-2} 7.4257.425 0.0670.067 0.8930.893
γ=103\gamma=10^{-3} 7.6267.626 0.0460.046 0.8950.895

4.2 Density estimation

As seen in the previous subsection, the main difference between the MFM-Inv-Ga and MFM-Ga methods is the frequency of occurrence of the empty components. The MFM-Ga method can achieve very high clustering accuracy by setting a small γ\gamma and allowing empty components. However, small γ\gamma not only makes the model difficult to interpret but also degrades the predictive accuracy of the model. In this subsection, we examine this phenomenon through density estimation using predictive distributions.

We used the famous galaxy dataset, which is a small data set consisting of 8282 velocities (km/sec) of different galaxies. The data is widely used in nonparametric Bayesian statistics as a benchmark for density estimation and cluster analysis. The details of the data are found in Roeder (1990). We set M1|ΛPoisson(Λ)M-1|\Lambda\sim\text{Poisson}(\Lambda) and ΛGa(1,1/5)\Lambda\sim\text{Ga}(1,1/5). We also employ the univariate normal kernel N(y|μ,σ2)N(y|\mu,\sigma^{2}), and the normal–inverse gamma conjugate prior N(μ|m0,σ2/τ)×IG(σ2|c0,C0)N(\mu|m_{0},\sigma^{2}/\tau)\times\text{IG}(\sigma^{2}|c_{0},C_{0}) as the prior of the parameters in the kernel. Moreover, we assume that the prior of C0C_{0} is Gamma(d0,D0)\text{Gamma}(d_{0},D_{0}), and we set m0=(max(data)+min(data))/2,d0=0.2,D0=10/(max(data)+min(data))2m_{0}=(\max(\mathrm{data})+\min(\mathrm{data}))/2,\ d_{0}=0.2,\ D_{0}=10/(\max(\mathrm{data})+\min(\mathrm{data}))^{2} as Richardson and Green (1997). The parameter τ\tau controls the smoothness of the estimated density function. We assume that the prior of the smoothing parameter τ\tau is Gamma(w,W)\text{Gamma}(w,W), where w=0.5w=0.5 and W=50W=50 as Escobar and West (1995). Parameters τ\tau and C0C_{0} should be carefully learned from the data, since density estimation is sensitive to their choice. The MCMC iterations are 100000100000 and the first half of the 9000090000 samples are not used as a burn-in period. The shape parameters are the same as in Section 4.1. We evaluated the result of density estimation and posterior probabilities of kk, MM, and MnaM_{na} for each shape parameter.

Refer to caption
Refer to caption
Figure 3: Results of density estimation for galaxy data using MFM-Inv-Ga (left) and MFM-Ga (right).

We show the estimated density functions using the posterior means in Figure 3. From the figure, it is observed that the shapes of the estimated densities using the MFM-Inv-Ga do not depend on the choice of the shape parameter α\alpha. However, results using the MFM-Ga method seem to be strongly influenced by the choice of the shape parameter γ\gamma. For γ=102\gamma=10^{-2} and 10310^{-3}, MFM-Ga cannot capture two large peaks in the middle.

Table 5: Posterior probabilities of the number of components MM and clusters kk for Galaxy data.
MFM-Inv-Ga
M3M\leq 3 M=4M=4 M=5M=5 M=6M=6 M=7M=7 M=8M=8 M=9M=9 M10M\geq 10
α=1.0\alpha=1.0 0.0000.000 0.0890.089 0.2240.224 0.310.31 0.1700.170 0.0970.097 0.0540.054 0.0560.056
α=0.2\alpha=0.2 0.1360.136 0.1940.194 0.1710.171 0.1520.152 0.1140.114 0.0860.086 0.0540.054 0.0910.091
α=101\alpha=10^{-1} 0.0710.071 0.0960.096 0.1540.154 0.1440.144 0.1370.137 0.1020.102 0.0790.079 0.2160.216
α=102\alpha=10^{-2} 0.0390.039 0.0780.078 0.1190.119 0.1510.151 0.1230.123 0.1020.102 0.0810.081 0.3080.308
α=103\alpha=10^{-3} 0.0570.057 0.1110.111 0.1700.170 0.1530.153 0.1220.122 0.1010.101 0.0710.071 0.2170.217
k3k\leq 3 k=4k=4 k=5k=5 k=6k=6 k=7k=7 k=8k=8 k=9k=9 k10k\geq 10
α=1.0\alpha=1.0 0.0000.000 0.0920.092 0.2410.241 0.3310.331 0.1640.164 0.0920.092 0.0460.046 0.0340.034
α=0.2\alpha=0.2 0.1560.156 0.2150.215 0.2060.206 0.1820.182 0.1180.118 0.0630.063 0.0330.033 0.0260.026
α=101\alpha=10^{-1} 0.0890.089 0.1270.127 0.2140.214 0.2240.224 0.1690.169 0.0940.094 0.0520.052 0.0310.031
α=102\alpha=10^{-2} 0.0880.088 0.1230.123 0.2010.201 0.2320.232 0.1600.160 0.1070.107 0.0540.054 0.0350.035
α=103\alpha=10^{-3} 0.0820.082 0.1520.152 0.2560.256 0.2210.221 0.1430.143 0.0810.081 0.0390.039 0.0250.025
MFM-Ga
M3M\leq 3 M=4M=4 M=5M=5 M=6M=6 M=7M=7 M=8M=8 M=9M=9 M10M\geq 10
γ=1.0\gamma=1.0 0.0380.038 0.1040.104 0.1910.191 0.2060.206 0.1760.176 0.1220.122 0.0720.072 0.0910.091
γ=0.2\gamma=0.2 0.0180.018 0.0420.042 0.0760.076 0.0980.098 0.1140.114 0.1170.117 0.1030.103 0.4320.432
γ=101\gamma=10^{-1} 0.0190.019 0.0400.040 0.0570.057 0.0710.071 0.0750.075 0.0830.083 0.0790.079 0.5770.577
γ=102\gamma=10^{-2} 0.0090.009 0.0230.023 0.0420.042 0.0590.059 0.0710.071 0.0760.076 0.0750.075 0.6450.645
γ=103\gamma=10^{-3} 0.0030.003 0.0040.004 0.0130.013 0.0190.019 0.0250.025 0.0330.033 0.0380.038 0.8650.865
k3k\leq 3 k=4k=4 k=5k=5 k=6k=6 k=7k=7 k=8k=8 k=9k=9 k10k\geq 10
γ=1.0\gamma=1.0 0.0440.044 0.1200.120 0.2320.232 0.2360.236 0.1880.188 0.1040.104 0.0440.044 0.0330.033
γ=0.2\gamma=0.2 0.1450.145 0.1980.198 0.3040.304 0.1860.186 0.0980.098 0.0400.040 0.0190.019 0.0090.009
γ=101\gamma=10^{-1} 0.1450.145 0.1980.198 0.3040.304 0.1860.186 0.0980.098 0.0400.040 0.0190.019 0.0090.009
γ=102\gamma=10^{-2} 0.7750.775 0.1950.195 0.0240.024 0.0050.005 0.0000.000 0.0000.000 0.0000.000 0.0000.000
γ=103\gamma=10^{-3} 0.9830.983 0.0180.018 0.0000.000 0.0000.000 0.0000.000 0.0000.000 0.0000.000 0.0000.000

The number of clusters in the galaxy data has been reported as 55 or 66 in existing studies. From table 5, in MFM-Inv-Ga, the posterior distributions of kk have large probabilities at k=5k=5 and 66 for all α\alpha and each posterior distribution induced by MFM-Inv-Ga is more similar than by MFM-Ga. For MM, both MFM-Inv-Ga and MFM-Ga overestimate when α\alpha and γ\gamma are small. This can be seen in Table 6. The reason is why the data size n=82n=82 is small and MnaM_{na} is easily produced. Tables 5 and 6 show that the degree of overestimation is much smaller for MFM-Inv-Ga than for MFM-Ga.

Table 6: Posterior probabilities of the number of empty components is equal to zero for Galaxy data
(Mna=0|y1,,yn)\mathbb{P}(M_{na}=0|y_{1},\dots,y_{n})
α\alpha and γ\gamma 11 0.20.2 10110^{-1} 10210^{-2} 10310^{-3}
MFM-Inv-Ga 0.8540.854 0.6540.654 0.4800.480 0.3470.347 0.4430.443
MFM-Ga 0.6480.648 0.0810.081 0.0400.040 0.0100.010 0.0020.002

Focusing on the number of clusters kk, it is interesting that MFM-Inv-Ga is more robust for shape parameter estimates kk than MFM-Ga. This suggests that the prior distribution of kk based on MFM-Inv-Ga is less informative than MFM-Ga, i.e., the same relationship holds for MFM-Inv-Ga and MFM-Ga as for normalized inverse Gaussian process and Dirichlet process. In Lijoi et al. (2005), the prior of kk induced by a normalized inverse Gaussian process is not more informative for the precision parameter than the Dirichlet process. However, the prior kk for the normalized inverse Gaussian process can be written in closed form, while that for MFM-Inv-Ga is given in complex integral form.

The shape parameter of MFM-Ga should be chosen carefully because it has a significant impact on clustering, density estimation, and the appearance of empty components. However, MFM-Inv-Ga is much more robust than MFM-Ga with respect to the choice of the shape parameter. Hence, MFM-Inv-Ga is superior to MFM-Ga in that it is much easier to use than MFM-Ga.

4.3 Community detection

We apply the proposed method to the community detection for network data. Similar comparisons as in Section 4.1 are made for both artificial and real data. Since the number of components of the finite mixture models is equivalent to the number of communities of the network, we use the same notation MM to denote the number of communities.

4.3.1 Artificial data

First, we illustrate the performance of the proposed method using simulation data. We assume that the true number of communities is 33, denoted by Mtrue=3M_{\mathrm{true}}=3 and the number of nodes in the network is set as n=150n=150. We here consider the balanced network, in that the true allocation consists of MtrueM_{\mathrm{true}} communities with 5050 nodes. For the true probability matrix QQ, we assume that each component is expressed by qrs=q+(pq)I(r=s)q_{rs}=q+(p-q)I(r=s), where q=0.1q=0.1, p=0.8p=0.8 and 1rsMtrue1\leq r\leq s\leq M_{\mathrm{true}}. The assumption indicates that the edges are more easily drawn within the same community and less easily drawn between different communities. In the setup, we generate the 5050 data set.

We set M1|ΛPoisson(Λ)M-1|\Lambda\sim\text{Poisson}(\Lambda) and ΛGa(1,1)\Lambda\sim\text{Ga}(1,1), and employ (3.5) as a kernel and a prior, where aQ=1a_{Q}=1 and bQ=1b_{Q}=1. The values of the shape parameters, the evaluations and the MCMC setting are the same as those of Section 4.1.1.

Refer to caption
Figure 4: Box plots of the posterior mean of the number of communities MM. From left to right, shape parameters are set as γ,α=1,0.2,101,102,103\gamma,\alpha=1,0.2,10^{-1},10^{-2},10^{-3}.

We report the posterior mean of the number of communities in Figure 4. It is observed that the results are almost the same as those of Figure 2.

Table 7: Results of the rand index RI^i\widehat{\mathrm{RI}}_{i} and posterior probability that the number of empty components is equal to zero ^(Mna=0)\widehat{\mathbb{P}}(M_{na}=0) (their standard deviations are shown in parentheses) averaged over 50 Monte Carlo replications.
RI^i\widehat{\mathrm{RI}}_{i}
α=γ=1\alpha=\gamma=1 α=γ=0.2\alpha=\gamma=0.2 α=γ=101\alpha=\gamma=10^{-1} α=γ=102\alpha=\gamma=10^{-2} α=γ=103\alpha=\gamma=10^{-3}
MFM-Inv-Ga 0.9970.997 0.9970.997 0.9950.995 0.9910.991 0.9910.991
(0.010)(0.010) (0.019)(0.019) (0.032)(0.032) (0.044)(0.044) (0.044)(0.044)
MFM-Ga 0.9950.995 0.9960.996 0.9800.980 0.9870.987 0.9100.910
(0.032)(0.032) (0.026)(0.026) (0.059)(0.059) (0.054)(0.054) (0.044)(0.044)
^(Mna=0)\widehat{\mathbb{P}}(M_{na}=0)
α=γ=1\alpha=\gamma=1 α=γ=0.2\alpha=\gamma=0.2 α=γ=101\alpha=\gamma=10^{-1} α=γ=102\alpha=\gamma=10^{-2} α=γ=103\alpha=\gamma=10^{-3}
MFM-Inv-Ga 1.0001.000 0.9940.994 0.9880.988 0.9750.975 0.9620.962
(0.001)(0.001) (0.009)(0.009) (0.029)(0.029) (0.044)(0.044) (0.094)(0.094)
MFM-Ga 0.9600.960 0.5000.500 0.3110.311 0.0890.089 0.0680.068
(0.011)(0.011) (0.071)(0.071) (0.064)(0.064) (0.024)(0.024) (0.018)(0.018)

Table 7 also shows that the results of the posterior probability of MnaM_{na} are the same as Table 1. However, in terms of clustering accuracy for network data, MFM-Inv-Ga is higher and more accurate than MFM-Ga.

Table 8: Posterior probabilities and their standard deviations (shown in parentheses) of the number of components MM averaged over 5050 Monte Carlo replications. The highest value is bolded.
MFM-Inv-Ga
M=1M=1 M=2M=2 M=3M=3 M=4M=4 M=5M=5 M6M\geq 6
α=1\alpha=1 0.0000.000 0.0000.000 0.927\bf{0.927} 0.0730.073 0.0000.000 0.0000.000
(0.000)(0.000) (0.000)(0.000) (0.243)(0.243) (0.243)(0.243) (0.000)(0.000) (0.000)(0.000)
α=0.2\alpha=0.2 0.0000.000 0.0120.012 0.966\bf{0.966} 0.0220.022 0.0000.000 0.0000.000
(0.000)(0.000) (0.084)(0.084) (0.111)(0.111) (0.073)(0.073) (0.001)(0.001) (0.000)(0.000)
α=101\alpha=10^{-1} 0.0000.000 0.0200.020 0.950\bf{0.950} 0.0270.027 0.0020.002 0.0000.000
(0.000)(0.000) (0.141)(0.141) (0.178)(0.178) (0.106)(0.106) (0.009)(0.009) (0.002)(0.002)
α=102\alpha=10^{-2} 0.0000.000 0.0360.036 0.936\bf{0.936} 0.0230.023 0.0040.004 0.0010.001
(0.000)(0.000) (0.179)(0.179) (0.184)(0.184) (0.035)(0.035) (0.008)(0.008) (0.004)(0.004)
α=103\alpha=10^{-3} 0.0000.000 0.0280.028 0.938\bf{0.938} 0.0230.023 0.0070.007 0.0040.004
(0.000)(0.000) (0.143)(0.143) (0.180)(0.180) (0.044)(0.044) (0.021)(0.021) (0.017)(0.017)
MFM-Ga
γ=1\gamma=1 0.0000.000 0.0200.020 0.930\bf{0.930} 0.0490.049 0.0020.002 0.0000.000
(0.000)(0.000) (0.139)(0.139) (0.135)(0.135) (0.028)(0.028) (0.002)(0.002) (0.000)(0.000)
γ=0.2\gamma=0.2 0.0000.000 0.0100.010 0.480\bf{0.480} 0.3070.307 0.1330.133 0.0700.070
(0.000)(0.000) (0.069)(0.069) (0.085)(0.085) (0.036)(0.036) (0.035)(0.035) (0.038)(0.038)
γ=101\gamma=10^{-1} 0.0000.000 0.0350.035 0.300\bf{0.300} 0.2880.288 0.1860.186 0.1910.191
(0.000)(0.000) (0.104)(0.104) (0.059)(0.059) (0.042)(0.042) (0.040)(0.040) (0.070)(0.070)
γ=102\gamma=10^{-2} 0.0000.000 0.0100.010 0.0910.091 0.1560.156 0.175\bf{0.175} 0.5660.566
(0.000)(0.000) (0.041)(0.041) (0.036)(0.036) (0.014)(0.014) (0.015)(0.015) (0.092)(0.092)
γ=103\gamma=10^{-3} 0.0000.000 0.0060.006 0.0690.069 0.1310.131 0.160\bf{0.160} 0.6340.634
(0.000)(0.000) (0.028)(0.028) (0.023)(0.023) (0.017)(0.017) (0.012)(0.012) (0.085)(0.085)

From table 8, MFM-Inv-Ga has a much higher posterior probability at MtrueM_{\text{true}} for all shape parameters than MFM-Ga. As a result, in the context of community detection, MFM-Inv-Ga also achieves better clustering and community estimation than MFM-Ga.

We also compared the computation time with Geng’s method (Geng et al., 2019), denoted MFM-Geng, when the shape parameters are 1.01.0 and 10310^{-3}. The result in Table 9 indicates that MFM-Inv-Ga is on average more than twice as efficient as MFM-Geng in terms of computation time. The reason is why our algorithm does not update the label variable based on the restaurant process and MFM-Inv-Ga has a structure that is unlikely to produce empty components.

The MFM-Geng can estimate the number of clusters kk, but cannot directly estimate the number of communities MM. Furthermore, it is difficult in the MFM-Geng to estimate the hyperparameter of qMq_{M}, because it is necessary to perform complex series calculations, including its parameter. In summary, MFM-Inv-Ga (and MFM-Ga) are superior to MFM-Geng in terms of computation time, direct estimation of MM, and estimation of an essential parameter.

Table 9: Comparison of CPU times in seconds (standard deviations are shown in parentheses) averaged over 5050 replications.
MFM-Inv-Ga MFM-Geng
α=γ=1\alpha=\gamma=1 38.79338.793 87.51987.519
(2.894)(2.894) (3.719)(3.719)
α=γ=103\alpha=\gamma=10^{-3} 38.36138.361 83.42383.423
(4.431)(4.431) (4.673)(4.673)

4.3.2 Dolphins social network data

The dolphins social network data is often used as a benchmark. Data can be obtained in http://www-personal.umich.edu/mejn/netdata/. The data is constructed as an undirected graph and expresses a small-scale animal social network with 6464 bottlenose dolphins off Doubtful Sound, New Zealand. Each node represents a dolphin, and an edge is drawn if two dolphins appear to be closely related to each other. In previous studies, it is well-known that the network has two communities. The details of the data are found in Lusseau et al. (2003).

As before, we compare the posterior distribution of the number of communities between MFM-Inv-Ga and MFM-Ga, and create a co-clustering matrix of MFM-Inv-Ga as quantifying uncertainty of clustering. In this analysis, we set M1Poisson(1)M-1\sim\text{Poisson}(1) and aQ=bQ=3.0a_{Q}=b_{Q}=3.0.

Table 10: Posterior probabilities of the number of communities MM for dolphins social network data
MFM-Inv-Ga
M=1M=1 M=2M=2 M=3M=3 M=4M=4 M=5M=5 M6M\geq 6
α=1\alpha=1 0.0000.000 0.9970.997 0.0030.003 0.0000.000 0.0000.000 0.0000.000
α=0.2\alpha=0.2 0.0000.000 0.9720.972 0.0250.025 0.0020.002 0.0010.001 0.0000.000
α=101\alpha=10^{-1} 0.0000.000 0.9610.961 0.0340.034 0.0040.004 0.0000.000 0.0000.000
α=102\alpha=10^{-2} 0.0000.000 0.9410.941 0.0480.048 0.0100.010 0.0010.001 0.0000.000
α=103\alpha=10^{-3} 0.0000.000 0.8540.854 0.1360.136 0.0090.009 0.0020.002 0.0000.000
MFM-Ga
γ=1\gamma=1 0.0000.000 0.9490.949 0.0490.049 0.0020.002 0.0000.000 0.0000.000
γ=0.2\gamma=0.2 0.0000.000 0.6050.605 0.3070.307 0.0740.074 0.0120.012 0.0010.001
γ=101\gamma=10^{-1} 0.0000.000 0.5140.514 0.3400.340 0.1170.117 0.0250.025 0.0040.004
γ=102\gamma=10^{-2} 0.0000.000 0.6190.619 0.3030.303 0.0690.069 0.0000.000 0.0000.000
γ=103\gamma=10^{-3} 0.3700.370 0.4000.400 0.1720.172 0.0480.048 0.0070.007 0.0010.001

From Table 10, the MFM-Inv-Ga is able to successfully estimate the number of communities M=2M=2 regardless of the values of the shape parameter, while the MFM-Ga is not. We report a co-cluster matrix of MFM-Inv-Ga for α=1.0\alpha=1.0 and MFM-Geng for γ=1.0\gamma=1.0, and the results are shown in Figure 5. The result of MFM-Inv-Ga is almost identical to the results reported in Geng et al. (2019). Furthermore, we confirmed that changing the value of α\alpha does not change the co-cluster matrices and the clustering solution based on MAP estimation. This shows the efficiency of the proposed method.

Refer to caption
Refer to caption
Figure 5: Co-clustering matrix with the MFM-Inv-Ga with α=1.0\alpha=1.0 (left) and the MFM-Geng with γ=1.0\gamma=1.0 (right) for dolphins social network data.

5 Concluding remarks

We proposed a mixture of finite mixtures (MFM) model based on the normalized inverse Gaussian distribution and constructed an efficient posterior sampling algorithm based on Argiento and De Iorio (2022). The proposed method is a finite analog of the inverse Gaussian processes proposed by Lijoi et al. (2005). We illustrate the performance of the proposed method for clustering, density estimation, and community detection, compared to existing MFM models based on Dirichlet distribution (e.g., Miller and Harrison, 2018; Geng et al., 2019; Argiento and De Iorio, 2022). The proposed method is robust against the choice of hyper-parameter α\alpha, and provided reasonable estimates of the number of components and communities compared to the MFM based on the Dirichlet prior distribution. Moreover, the proposed method also has a reasonable predictive performance in the sense of density estimation by suppressing the appearance of empty components.

The drawbacks of the proposed method are as follows. Some parameters involved in the model do not have closed-form marginal distributions because it is not easy to marginalize out UnU_{n}. For example, when we focus on clustering, obtaining the interpretable prior distribution for the number of clusters is very important (see e.g., Zito et al., 2023) to incorporate subjective prior information. However, the proposed model cannot lead to a tractable marginal prior distribution for the number of clusters. In the mixture of finite mixtures, the model consists of a distribution over a simplex based on the normalization of independent random variables. Therefore, the correlation between categories cannot be properly modeled. The normalized inverse Gaussian distribution has a negative covariance as well as the Dirichlet distribution. It may be inappropriate for data with positive correlations between categories, such as the proportion of symbiotic organisms present, disease complication data, or gene expression data. To address such a problem, it may be necessary to construct the MFM in a more general framework that removes the assumption of independence in the Norm-IFPP by Argiento and De Iorio (2022). The construction of MFM models using a more flexible distribution over a simplex that can also express positive correlations such as the generalized Dirichlet distribution (Wong, 1998) is an interesting future topic. Furthermore, the proposed model can be applied to spatial data (e.g., Geng et al., 2021) and functional data (e.g., Hu et al., 2023). For network data, it is expected to extend the MFM to network with weighted edges, degree-corrected stochastic block models, and mixed membership stochastic block models.

Acknowledgement

This work was supported by Japan Society for the Promotion of Science, the establishment of university fellowships towards the creation of science technology innovation Grant Number JPMJFS2129. This work is partially supported by the Japan Society for the Promotion of Science (grant number: 21K13835).

References

  • Argiento and De Iorio (2022) Argiento, R. and M. De Iorio (2022). Is infinity that far? a bayesian nonparametric perspective of finite mixture models. The Annals of Statistics 50(5), 2641–2663.
  • Escobar and West (1995) Escobar, M. D. and M. West (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association 90(430), 577–588.
  • Frühwirth-Schnatter (2006) Frühwirth-Schnatter, S. (2006). Finite mixture and Markov switching models. Springer.
  • Frühwirth-Schnatter et al. (2019) Frühwirth-Schnatter, S., G. Celeux, and C. P. Robert (2019). Handbook of mixture analysis. CRC press.
  • Frühwirth-Schnatter and Kaufmann (2008) Frühwirth-Schnatter, S. and S. Kaufmann (2008). Model-based clustering of multiple time series. Journal of Business & Economic Statistics 26(1), 78–89.
  • Frühwirth-Schnatter et al. (2021) Frühwirth-Schnatter, S., G. Malsiner-Walli, and B. Grün (2021). Generalized mixtures of finite mixtures and telescoping sampling. Bayesian Analysis 16(4), 1279–1307.
  • Frühwirth-Schnatter and Pyne (2010) Frühwirth-Schnatter, S. and S. Pyne (2010). Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions. Biostatistics 11(2), 317–336.
  • Geng et al. (2019) Geng, J., A. Bhattacharya, and D. Pati (2019). Probabilistic community detection with unknown number of communities. Journal of the American Statistical Association 114(526), 893–905.
  • Geng et al. (2021) Geng, J., W. Shi, and G. Hu (2021). Bayesian nonparametric nonhomogeneous poisson process with applications to usgs earthquake data. Spatial Statistics 41, 100495.
  • Ghosal and van der Vaart (2017) Ghosal, S. and A. W. van der Vaart (2017). Fundamentals of nonparametric Bayesian inference, Volume 44. Cambridge University Press.
  • Green (1995) Green, P. J. (1995). Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika 82(4), 711–732.
  • Handcock et al. (2007) Handcock, M. S., A. E. Raftery, and J. M. Tantrum (2007). Model-based clustering for social networks. Journal of the Royal Statistical Society Series A: Statistics in Society 170(2), 301–354.
  • Henze (1986) Henze, N. (1986). A probabilistic representation of the’skew-normal’distribution. Scandinavian Journal of Statistics 13(4), 271–275.
  • Hu et al. (2023) Hu, G., J. Geng, Y. Xue, and H. Sang (2023). Bayesian spatial homogeneity pursuit of functional data: an application to the us income distribution. Bayesian Analysis 18(2), 579–605.
  • James et al. (2009) James, L. F., A. Lijoi, and I. Prünster (2009). Posterior analysis for normalized random measures with independent increments. Scandinavian Journal of Statistics 36(1), 76–97.
  • Lijoi et al. (2005) Lijoi, A., R. H. Mena, and I. Prünster (2005). Hierarchical mixture modeling with normalized inverse-gaussian priors. Journal of the American Statistical Association 100(472), 1278–1291.
  • Lusseau et al. (2003) Lusseau, D., K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson (2003). The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations: can geographic isolation explain this unique trait? Behavioral Ecology and Sociobiology 54, 396–405.
  • Malsiner-Walli et al. (2016) Malsiner-Walli, G., S. Frühwirth-Schnatter, and B. Grün (2016). Model-based clustering based on sparse finite gaussian mixtures. Statistics and Computing 26(1), 303–324.
  • McLachlan (2000) McLachlan, G. (2000). Finite mixture models. A wiley-interscience publication.
  • McLachlan et al. (2002) McLachlan, G. J., R. W. Bean, and D. Peel (2002). A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18(3), 413–422.
  • McLachlan et al. (2019) McLachlan, G. J., S. X. Lee, and S. I. Rathnayake (2019). Finite mixture models. Annual review of Statistics and its Application 6(1), 355–378.
  • Miller and Harrison (2018) Miller, J. W. and M. T. Harrison (2018). Mixture models with a prior on the number of components. Journal of the American Statistical Association 113(521), 340–356.
  • Newman (2004) Newman, M. E. (2004). Detecting community structure in networks. The European Physical Journal B 38, 321–330.
  • Nobile (1994) Nobile, A. (1994). Bayesian analysis of finite mixture distributions. Carnegie Mellon University.
  • Nobile and Fearnside (2007) Nobile, A. and A. T. Fearnside (2007). Bayesian finite mixtures with an unknown number of components: The allocation sampler. Statistics and Computing 17, 147–162.
  • Nowicki and Snijders (2001) Nowicki, K. and T. A. B. Snijders (2001). Estimation and prediction for stochastic blockstructures. Journal of the American statistical association 96(455), 1077–1087.
  • Richardson and Green (1997) Richardson, S. and P. J. Green (1997). On bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society Series B: Statistical Methodology 59(4), 731–792.
  • Roeder (1990) Roeder, K. (1990). Density estimation with confidence sets exemplified by superclusters and voids in the galaxies. Journal of the American Statistical Association 85(411), 617–624.
  • Rousseau and Mengersen (2011) Rousseau, J. and K. Mengersen (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society Series B: Statistical Methodology 73(5), 689–710.
  • Shi and Malik (2000) Shi, J. and J. Malik (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905.
  • White and Smyth (2005) White, S. and P. Smyth (2005). A spectral clustering approach to finding communities in graphs. In Proceedings of the 2005 SIAM international conference on data mining, pp.  274–285. SIAM.
  • Wong (1998) Wong, T.-T. (1998). Generalized dirichlet distribution in bayesian analysis. Applied Mathematics and Computation 97(2-3), 165–181.
  • Zito et al. (2023) Zito, A., T. Rigon, and D. B. Dunson (2023). Bayesian nonparametric modeling of latent partitions via stirling-gamma priors. arXiv preprint arXiv:2306.02360.