This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Model-based Semi-supervised Clustering Methodology

Jordan Yoder 
Department of Applied Mathematics and Statistics, Johns Hopkins University
and
Carey E. Priebe
Department of Applied Mathematics and Statistics, Johns Hopkins University
This work is partially funded by the National Security Science and Engineering Faculty Fellowship (NSSEFF), the Johns Hopkins University Human Language Technology Center of Excellence (JHU HLT COE), and the XDATA program of the Defense Advanced Research Projects Agency (DARPA) administered through Air Force Research Laboratory contract FA8750-12-2-0303.
Abstract

We consider an extension of model-based clustering to the semi-supervised case, where some of the data are pre-labeled. We provide a derivation of the Bayesian Information Criterion (BIC) approximation to the Bayes factor in this setting. We then use the BIC to the select number of clusters and the variables useful for clustering. We demonstrate the efficacy of this adaptation of the model-based clustering paradigm through two simulation examples and a fly larvae behavioral dataset in which lines of neurons are clustered into behavioral groups.


Keywords: BIC, Machine learning, Behavioral data, Model selection, GMM, Mclust

1 Introduction

Clustering is the art of partitioning data into distinct and scientifically relevant classes by assigning labels to observations. Clustering is typically performed in an unsupervised setting, where none of the observed data initially have labels. In semi-supervised learning, some of the data have labels, but others do not; the goal is to classify the unlabeled data by assigning them to the same classes as the labeled data. In the gap between these two settings, there is semi-supervised clustering, in which some of the data have labels, and there may not be labeled data from every class. Furthermore, the total number of classes may be unknown. Just as in semi-supervised learning, leveraging the known labels allows for a more informed clustering.

There are many strategies for solving clustering problems. Some clustering procedures are based on heuristics that lack a principled justification, and many of the basic questions of clustering (e.g., how many classes there are) are often left to the intuition of the practitioner. Fraley and Raftery [9] review model-based clustering, which recasts the task of clustering as a model selection problem, which is well-studied in the field of statistics. The main assumption in model-based clustering is that the data are drawn from one of GG distributions, where each distribution represents a cluster. Model selection can be accomplished by computing approximate Bayes factors for competing models with different numbers and/or types of components.

Much of the recent work in model-based clustering has centered on variable selection. Raftery and Dean [18] proposed a greedy algorithm for model-based clustering in the unsupervised setting; it used the BIC to choose the number of clusters and the features to consider while clustering by proposing linear relationships between variables that influence clustering and those independent of clustering. Murphy et al. [15] adapted Raftery and Dean’s algorithm to semi-supervised learning. Maugis et al. [13, 12] consider many more scenarios of dependency structures between the clustering variables and the remaining variables than in [18]. Taking a different approach, Witten et al. [22] offer a framework for variable selection in clustering using sparse k-means or sparse hierarchical algorithms by modifying the corresponding objective function to include penalty constraints. In their simulations, they outperform the methods of [18] considerably.

The BIC is generally denoted as

BIC=2dlog(n),BIC=2\mathcal{L}-d\log(n),

where \mathcal{L} is the maximum of the likelihood, dd is the number of parameters estimated, and nn is the number of data used to estimate those parameters. Alternative information criteria generally differ by having different values for the penalty term. For example, the sample size adjusted BIC [5] is defined as

sBIC=2dlog(n+224).sBIC=2\mathcal{L}-d\log(\frac{n+2}{24}).

Because BICsBIC=o(1),BIC-sBIC=o(1), one may wonder of the efficacy of such a deviation from the BIC. This question is particularly salient when considering that the standard derivation of the BIC as the approximation to the integrated loglikelihood is only O(1)O(1) accurate asymptotically (so that o(1) terms are negligible). In this paper, we will present such a modified BIC that is asymptotically equivalent to the standard BIC and thoroughly explore an analytical example of how such a modification could be useful.

In this paper, we quickly review model based clustering in Section 2. We state our derivation of a modified BIC to the semi-supervised case in Section 3 that is asymptotically equivalent to the standard BIC. We follow this with a detailed discussion of o(1)o(1) equivalent information criteria, analytical calculations for a toy example, and simulations for a semi-supervised clustering example. Next, we apply our methods to the fly larvae behavioral dataset in Section LABEL:fly. Finally, we conclude the paper with a discussion of the results and extensions to our method.

2 Review of Model Based Clustering

Assume that the data X1,X2,,XnX_{1},X_{2},\dots,X_{n} are distributed i.i.d. according to a mixture model,

fθ()=k=1Gπkfθk(),f_{\theta}(\cdot)=\sum_{k=1}^{G}\pi_{k}f_{\theta_{k}}(\cdot),

where θk\theta_{k} are the parameters of the kthk^{\text{th}} component of the mixture and GG is the total number of components. It can be shown that this is equivalent to the following generative process: for each i{1,2,,n}i\in\{1,2,\dots,n\}

  1. (1)

    draw ZiZ_{i} from multinomial(π1,π2,,πG)\text{multinomial}\left(\pi_{1},\pi_{2},\dots,\pi_{G}\right)

  2. (2)

    draw XiX_{i} from fθZif_{\theta_{Z_{i}}}.

If we consider each of the GG components as representing a single cluster, then the problem of clustering the data is that of unveiling each latent ZiZ_{i} from the generative process. The expectation-maximization (EM) algorithm can be used to find the maximum likelihood estimates for the parameters of the model and the posterior probabilities of cluster membership for each XiX_{i} (cf. [8]).

With this formulation, we have a fully probabilistic clustering scheme. We can set the cluster of the ithi^{\text{th}} observation to the maximum a posteriori estimate:

Zi^=argmaxkπkfk(xi).\hat{Z_{i}}=\text{argmax}_{k}\pi_{k}f_{k}(x_{i}).

Note that the posterior probability can give us a measure of confidence about the ithi^{\text{th}} classification.

Here, we have used fθkf_{\theta_{k}} as a general density function, but it should be noted that the multivariate Gaussian distribution is the most common choice of distribution. We will eventually use an information criterion, such as the Bayesian Information Criterion (BIC), to assess the relative quality of different clusterings. Then, the number of clusters can be chosen using the BIC, which rewards model fit and penalizes model complexity.

By considering different constraints to the covariance matrices in the Gaussian components, we can reduce the number of parameters estimated, thus lowering the influence of the penalty term in the BIC. This allows for simpler clusters to be chosen over complex clusters. Celeux and Govaert [6] consider a spectral decomposition of the kthk^{\text{th}} component’s covariance matrix,

Σk=λkDkTAkDk,\Sigma_{k}=\lambda_{k}D^{T}_{k}A_{k}D_{k},

where λk\lambda_{k} represents the largest eigenvalue, AkA_{k} is a diagonal matrix whose largest entry is 11, and DkD_{k} is a matrix of the corresponding eigenvectors. The interpretation of each term as it relates to cluster kk is as follows: λk\lambda_{k} represents the volume, DkD_{k} the orientation, and AkA_{k} the shape. By forcing one or more of these terms to be the same across all clusters, we can reduce the number of parameters to be estimated from Gd+Gd(d+1)2Gd+G\frac{d(d+1)}{2} in the unconstrained case (Σk=λkDkTAkDk\Sigma_{k}=\lambda_{k}D^{T}_{k}A_{k}D_{k}) to Gd+G1+d(d+1)2Gd+G-1+\frac{d(d+1)}{2} in the most constrained case (Σk=λDTAD\Sigma_{k}=\lambda D^{T}AD). We can also force additional constraints, such Dk=ID_{k}=I and/or Ak=IA_{k}=I, leading to the simplest model: Σk=λI\Sigma_{k}=\lambda I with only G(d+1)G(d+1) parameters to estimate.

3 Methodology

The main theoretical contribution of this paper is to derive an adjustment to the BIC for the semi-supervised clustering setting followed by an exploration about o(1)o(1) equivalent information criteria. We will set forth a formalization of a more general setting, derive the BIC in this broader setting, and then apply our result to the special case of semi-supervised clustering. The modified BIC then represents a principled measure by which to choose the number of clusters and the variables for clustering when performing semi-supervised clustering.

Consider n=i=1Cnin=\sum_{i=1}^{C}n_{i} independent random variables

X1(1),X2(1),,Xn1(1)i.i.d.fθ1,\displaystyle X_{1}^{(1)},X_{2}^{(1)},\dots,X_{n_{1}}^{(1)}\overset{\text{i.i.d.}}{\sim}f_{\theta_{1}},
X1(2),X2(2),,Xn2(2)i.i.d.fθ2,\displaystyle X_{1}^{(2)},X_{2}^{(2)},\dots,X_{n_{2}}^{(2)}\overset{\text{i.i.d.}}{\sim}f_{\theta_{2}},
\displaystyle\vdots
X1(C),X2(C),,XnC(C)i.i.d.fθC,\displaystyle X_{1}^{(C)},X_{2}^{(C)},\dots,X_{n_{C}}^{(C)}\overset{\text{i.i.d.}}{\sim}f_{\theta_{C}},

where θ1Θ1d1\theta_{1}\in\Theta_{1}\subset\mathbb{R}^{d_{1}} and θjΘjΘ1\theta_{j}\in\Theta_{j}\subset\Theta_{1} for j=2,3,Cj=2,3,\dots C and

(fθ1,fθ2,,fθC)M={(fθ(1),fθ(2),,fθ(C)):θ=(θ(1),,θ(C))Θ1×Θ2××ΘC}.(f_{\theta_{1}},f_{\theta_{2}},\dots,f_{\theta_{C}})\in M=\{(f_{\theta^{(1)}},f_{\theta^{(2)}},\dots,f_{\theta^{(C)}}):\theta=(\theta^{(1)},\dots,\theta^{(C)})\in\Theta_{1}\times\Theta_{2}\times\dots\times\Theta_{C}\}.

Collect all of the XsX^{\prime}s into set DD.

Model MM, while more general than strictly necessary, encompasses the semi-supervised case. Consider the first group of n1n_{1} random variables as those for which we do not have labels and each of the other C1C-1 groups as those whose labels we know. Then, for a proposed total number of clusters GC1G\geq C-1, we assume

fθ1(x)=j=1Gπjϕ(x;μj,Σj),f_{\theta_{1}}(x)=\sum_{j=1}^{G}\pi_{j}\phi(x;\mu_{j},\Sigma_{j}),

where ϕ(;μj,Σj)\phi(\cdot;\mu_{j},\Sigma_{j}) is a multivariate normal pdf with mean μj\mu_{j} and covariance matrix Σj\Sigma_{j}, and j=1Gπj=1\sum_{j=1}^{G}\pi_{j}=1. Also, for each k{2,3,,C}k\in\{2,3,\dots,C\},

fθk(x)=ϕ(x;μjk,Σjk),f_{\theta_{k}}(x)=\phi(x;\mu_{j_{k}},\Sigma_{j_{k}}),

where the double subscript jkj_{k} is to account for possible relabeling. It follows that

θ1=((π1,π2,,πG),(μ1,μ2,,μG),(Σ1,Σ2,,ΣG))Θ1,\theta_{1}=\left((\pi_{1},\pi_{2},\dots,\pi_{G}),(\mu_{1},\mu_{2},\dots,\mu_{G}),(\Sigma_{1},\Sigma_{2},\dots,\Sigma_{G})\right)\in\Theta_{1},

where

Θ1={((π1,π2,,πG):i=1Gπi=1,πi[0,1]}×d×G×{(Σ1,Σ2,,ΣG)|Σi0,Σid×d}.\displaystyle\Theta_{1}=\{((\pi_{1},\pi_{2},\dots,\pi_{G}):\sum_{i=1}^{G}\pi_{i}=1,\pi_{i}\in[0,1]\}\times\mathbb{R}^{d\times G}\times\{(\Sigma_{1},\Sigma_{2},\dots,\Sigma_{G})|\Sigma_{i}\succeq 0,\Sigma_{i}\in\mathbb{R}^{d\times d}\}.

Then, for each k{2,3,,C}k\in\{2,3,\dots,C\}, Θk\Theta_{k} is the same as Θ1\Theta_{1} except that the mixing coefficients (π1,π2,,πG)(\pi_{1},\pi_{2},\dots,\pi_{G}) are constrained such that πjk=1\pi_{j_{k}}=1 and all other πi=0\pi_{i}=0.

With this model, we can derive the Expectation step (E-step) and Maximization (M-step) of the EM algorithm. Note that semi-supervised EM is not new; indeed, similar calculations can be found in [14] Section 2.19 and [19].

Here, we generalize the BIC to the semi-supervised case with the aim of not overly penalizing for supervised data. The modified BIC approximation to the integrated likelihood of model MM is

2log(P(D|M))2log(P(D|θ^,M))dlog(n1)=BICM,2\log(P(D|M))\approx 2\log(P(D|\hat{\theta},M))-d\log(n_{1})=BIC^{*}_{M},

where θ^\hat{\theta} is the maximum likelihood estimate (MLE).

We provide the derivation in the appendix. Compare the above result to the classical BIC approximation:

BIC2log(P(D|M))2log(P(D|θ^,M))dlog(n).BIC\approx 2\log(P(D|M))\approx 2\log(P(D|\hat{\theta},M))-d\log(n).

The derivation of the BIC (adjusted and normal) involves an O(1) term, which is dropped at the end. Due to this, the difference between the two BICs is a smaller penalty in the semi-supervised case (i.e. dlog(n1)d\log(n_{1}) vs. dlog(n)d\log(n), which is a o(1)o(1) difference). This can be interpreted as not penalizing more complicated cluster structures for the supervised data. We will discuss the merit of such a deviation in Section 4.

It is known that finite mixture models do not meet the regularity conditions for the BIC approximation used to approximate the integrated likelihood. However, [9] comment that it is still an appropriate and effective approximation in the paradigm of model-based clustering. Thus, we will not be too remiss in considering Theorem LABEL:BIC as applicable when performing semi-supervised clustering. Because the model MM is appropriate for the semi-supervised case, we can use the modified BIC as a way to calculate Bayes factors for different number of total clusters GG, subsets of variables to be included in the clustering, and parameterizations of Σi\Sigma_{i}. The largest BIC value will therefore correspond to our chosen model for clustering. We will call the semi-supervised clustering algorithm using the BIC from Theorem LABEL:BIC to make the aforementioned selections ssClust henceforth.

4 o(1) equivalent Information Criteria

Here, we comment on the use of different penalty terms. Suppose that we define an adjusted BIC, say BICBIC^{\prime} as a function of m[1,inf)m\in[1,\inf) as BIC(m)=2dlog(m)=BICdlog(mn).BIC^{\prime}(m)=2\mathcal{L}-d\log(m)=BIC-d\log(\frac{m}{n}).

When does using such an adjustment matter? Consider two models, say M0M_{0} and M1M_{1}, in competition for selection. Fix an mm. Suppose that

  1. (i)

    d1>d0d_{1}>d_{0}

  2. (ii)

    BIC0>BIC1BIC_{0}>BIC_{1} and

  3. (iii)

    BIC0(m)<BIC1(m).BIC^{\prime}_{0}(m)<BIC^{\prime}_{1}(m).

In this case, we see that under the original definition of the BIC, we would choose M1M_{1} over M0M_{0}, and with the modified definition, we would do the opposite.

In order to analyze this under some assumptions, define two statistical tests:

T(ω)=𝟙{BIC1>BIC0}T(\omega)=\mathbbm{1}\{BIC_{1}>BIC_{0}\}

and

Tm(ω)=𝟙{BIC1(m)>BIC0(m)}.T^{\prime}_{m}(\omega)=\mathbbm{1}\{BIC^{\prime}_{1}(m)>BIC^{\prime}_{0}(m)\}.

When do TT and TT^{\prime} differ?

  1. Case 1:

    T=1T=1 and Tm=0.T^{\prime}_{m}=0. For mnm\leq n this is impossible.

  2. Case 2:

    T=0T=0 and Tm=1.T^{\prime}_{m}=1. We have two subcases.

    1. (a)

      Suppose M1M_{1} is the true model. We would be correct in choosing TmT^{\prime}_{m} over TT.

    2. (b)

      Suppose M0M_{0} is the true model. Note that in this case, we would make a mistake by choosing TmT^{\prime}_{m} over TT.

We would like to analyze the probabilities in case 2. Preferably, (2.a) is much more probable than (2.b). Unfortunately, under (2.a) the distribution of W1,0W_{1,0} is only known for certain cases. For example, with local alternatives of the form θn=θ0+Δn,\theta_{n}=\theta_{0}+\frac{\Delta}{\sqrt{n}}, W1,0(d1d0)W_{1,0}(d_{1}-d_{0}) is known to have noncentral chi square distribution with df=d1d0df=d_{1}-d_{0} and the appropriate non-centrality parameter. Since this result requires rather quixotic assumptions to hold, we will not theoretically bound the probabilities for more general cases; instead, we will defer our analysis to a specific example.

We can say more about (2.b) with some assumptions.

Proposition 4.1.

Suppose the models are nested (so that M0M_{0} is a submodel of M1M_{1}). Define W1,0:=2(10)(d1d0).W_{1,0}:=\frac{2(\mathcal{L}_{1}-\mathcal{L}_{0})}{(d_{1}-d_{0})}. Under M0M_{0}, T=0T=0 and Tm=1T^{\prime}_{m}=1 with probability

PrM0(log(m)<W1,0<log(n))F(log(n))F(log(m)),Pr_{M_{0}}\left(\log(m)<W_{1,0}<\log(n)\right)\rightarrow F(\log(n))-F(\log(m)), (1)

where F()F(\cdot) is the distribution function of an FF distribution with df=(d1d0,).df=(d_{1}-d_{0},\infty).

Proof.

We can do some algebra on (ii)(ii) and (iii)(iii) to obtain equivalent condition that

(d1d0)log(m)<2(10)<(d1d0)log(n).(d_{1}-d_{0})\log(m)<2(\mathcal{L}_{1}-\mathcal{L}_{0})<(d_{1}-d_{0})\log(n).

If the models are nested (so that M0M_{0} is a submodel of M1M_{1}), then the numerator of W1,0W_{1,0} converges in law to a χ2\chi^{2} distribution with df=d1d0df=d_{1}-d_{0} by Wilk’s Theorem. In this case, we note that by Slutsky’s Theorem, W1,0W_{1,0} asymptotically has the distribution of an F-statistic with df=(d1d0,).df=(d_{1}-d_{0},\infty).

Hence, under the nested models assumption, the probability that (ii)(ii) and (iii)(iii) hold when they should not (i.e. choose M1M_{1} when M0M_{0} is the truth) is given by equation (1). ∎

4.1 Illustrative Example

We now present an analysis for specific null and alternative models where we can explicitly calculate the probabilities of cases (2.a) and (2.b). The purpose is to demonstrate that for finite nn, the choice of penalty term is nontrivial.

Let d,d0d,d_{0}\in\mathbb{N} with d0<dd_{0}<d. Consider the nested models:

  • M1={N(μ,I):μd}M_{1}=\{N(\mu,I):\mu\in\mathbb{R}^{d}\}

  • M0,d0={N(μd0,I):μd,μj=0 for j=d0+1,d0+2,,d},M_{0,d_{0}}=\{N(\mu^{\prime}_{d_{0}},I):\mu^{\prime}\in\mathbb{R}^{d},\mu^{\prime}_{j}=0\mbox{ for }j=d_{0}+1,d_{0}+2,\dots,d\},

where II is the dd-dimensional identity matrix.

Let

μ=[112131d]T\displaystyle\mu^{*}=\left[\begin{matrix}1&\frac{1}{\sqrt{2}}&\frac{1}{\sqrt{3}}&\dots&\frac{1}{\sqrt{d}}\end{matrix}\right]^{T}
μ0,d0=[112131d0000]T.\displaystyle\mu^{*}_{0,d_{0}}=\left[\begin{matrix}1&\frac{1}{\sqrt{2}}&\frac{1}{\sqrt{3}}&\dots&\frac{1}{\sqrt{d_{0}}}&0&0&\dots&0\end{matrix}\right]^{T}.

Then, f1:=N(μ,I)M1f_{1}:=N(\mu^{*},I)\in M_{1} and f0,d0:=N(μ0,d0,I)M0.f_{0,d_{0}}:=N(\mu^{*}_{0,d_{0}},I)\in M_{0}. Suppose that X1,X2,,Xni.i.d.f1 or f0,d0X_{1},X_{2},\dots,X_{n}\overset{i.i.d.}{\sim}f_{1}\mbox{ or }f_{0,d_{0}} according to whether or not M1M_{1} is the true model.

Fix a realization of the data (x1,x2,,xn).(x_{1},x_{2},\dots,x_{n}). Let x¯=1ni=1nxi\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i} and x¯0,d0\bar{x}_{0,d_{0}} be x¯\bar{x} with the coordinates after d0d_{0} set to 0. Twice the maximized loglikelihood of the data under M1M_{1} is (up to constants)

21:=i=1n(xix¯)T(xix¯).2\mathcal{L}_{1}:=-\sum_{i=1}^{n}(x_{i}-\bar{x})^{T}(x_{i}-\bar{x}).

Under M0,d0M_{0,d_{0}} it is

20,d0:=\displaystyle 2\mathcal{L}_{0,d_{0}}:= i=1n(xix¯0,d0)T(xix¯0,d0)\displaystyle-\sum_{i=1}^{n}(x_{i}-\bar{x}_{0,d_{0}})^{T}(x_{i}-\bar{x}_{0,d_{0}})
=\displaystyle= 1nx¯x¯0,d022,\displaystyle\mathcal{L}_{1}-n\|\bar{x}-\bar{x}_{0,d_{0}}\|_{2}^{2},

since 2i=1n(xix¯)T(x¯x¯0,d0)=02\sum_{i=1}^{n}(x_{i}-\bar{x})^{T}(\bar{x}-\bar{x}_{0,d_{0}})=0. Thus we find that 0,d0<1.\mathcal{L}_{0,d_{0}}<\mathcal{L}_{1}.

  1. 2a)

    Under f1f_{1},

    nx¯x¯0,d022\displaystyle n\|\bar{x}-\bar{x}_{0,d_{0}}\|_{2}^{2} =\displaystyle= j=d0+1d(nx¯j)2:=Y1,\displaystyle\sum_{j=d_{0}+1}^{d}(\sqrt{n}\bar{x}_{j})^{2}:=Y_{1},

    where Y1Y_{1}\sim noncentral χdf=(dd0)2(nj=d0+1d1j).\chi^{2}_{df=(d-d_{0})}(n\sum_{j=d_{0}+1}^{d}\frac{1}{j}). Hence, for any m>0,m>0,

    Prf1(21dlog(m)\displaystyle Pr_{f_{1}}\Big{(}2\mathcal{L}_{1}-d\log(m) <20,d0d0log(m))\displaystyle<2\mathcal{L}_{0,d_{0}}-d_{0}\log(m)\Big{)}
    =Pr(Y1<(dd0)log(m)).\displaystyle=Pr\left(Y_{1}<(d-d_{0})\log(m)\right).

    Therefore,

    Prf1\displaystyle Pr_{f_{1}} (T=0andTm=1)=\displaystyle(T=0\ \ \text{and}\ \ T^{\prime}_{m}=1)=
    Pr((dd0)log(m)<Y1<(dd0)log(n)).\displaystyle Pr\left((d-d_{0})\log(m)<Y_{1}<(d-d_{0})\log(n)\right).
  2. 2b)

    Under f0,d0,f_{0,d_{0}},

    nx¯x¯0,d022=Yd0,n\|\bar{x}-\bar{x}_{0,d_{0}}\|_{2}^{2}=Y_{d_{0}}, (2)

    where Yd0χdf=(dd0)2Y_{d_{0}}\sim\chi^{2}_{df=(d-d_{0})}. Thus,

    Prf0,d0\displaystyle Pr_{f_{0,d_{0}}} (21dlog(m)<20,d0d0log(m))\displaystyle\left(2\mathcal{L}_{1}-d\log(m)<2\mathcal{L}_{0,d_{0}}-d_{0}\log(m)\right)
    =Pr(Yd0<(dd0)log(m)).\displaystyle=Pr\left(Y_{d_{0}}<(d-d_{0})\log(m)\right).

    Hence,

    Prf0,d0\displaystyle Pr_{f_{0,d_{0}}} (T=0 and Tm=1)=\displaystyle(T=0\mbox{ and }T^{\prime}_{m}=1)=
    Pr((dd0)log(m)<Yd0<(dd0)log(n)).\displaystyle Pr\left((d-d_{0})\log(m)<Y_{d_{0}}<(d-d_{0})\log(n)\right).

Clearly, we can vary n,m,d,n,m,d, and d0d_{0} to influence these probabilities. We will now do so to show how (a) some penalty is better than none; (b) the AIC penalty can result in mistakes; and (c) the optimal penalty depends on n,m,d,n,m,d, and d0d_{0}.

Figure 1 depicts the probability of (2.a) and (2.b) when nn and mm vary for fixed d=200d=200 and d0=190d_{0}=190 under the alternative (a) and null (b) hypotheses. Lower penalties than the BIC would use generally result in better decisions under f1f_{1}. However, there is approximately a 3%3\% chance of error under f0f_{0} when the BIC would be correct with the penalty of exp(2),\exp(2), that of the AIC.

Refer to caption
Figure 1: Probabilities of correctly choosing the alternative over the null when the BIC would not do so. For all curves, d=200d=200 and d0=190.d_{0}=190. Note that 7.389=exp(2),7.389...=\exp(2), the AIC penalty.

Figure 2 depicts the probability of (2.a) and (2.b) when dd and mm vary for fixed n=1000n=1000 and d0=d10d_{0}=d-10 under the alternative (a) and null (b) hypotheses. Lower penalties than the BIC would use generally result in better decisions under f1f_{1} without sacrificing making too many new mistakes under f0f_{0}.

Refer to caption
Figure 2: Probabilities of correctly choosing the alternative over the null when the BIC would not do so. For all curves, n=1000n=1000 and d0=d10.d_{0}=d-10.

Figure 3 depicts the probability of (2.a) and (2.b) when d0d_{0} and mm vary for fixed n=1000n=1000 and d=200d=200 under the alternative (a) and null (b) hypotheses. Lower penalties than the BIC would use generally result in better decisions under f1f_{1} and not too much worse decisions under f0f_{0}. However, there is approximately a larger chance of error under f0f_{0} when the BIC would be correct with the AIC penalty for smaller d0d_{0}.

Refer to caption
Figure 3: Probabilities of correctly choosing the alternative over the null when the BIC would not do so. For all curves, n=1000n=1000 and d=200.d=200.

This example may seem overly simplistic. Indeed, it does not involve semi-supervised clustering at all. However, it does demonstrate the complexities of penalization in model selection with finite samples.

4.2 Simulation Study

We would like to demonstrate the impact of the previously discussed model selection complexities from the illustrative example on the inference task of semi-supervised clustering. To that end, consider the following procedure for constructing a dataset 𝒳2\mathcal{X}\subset\mathbb{R}^{2}, semi-supervising it, and clustering it using model-based clustering with different penalties. We will then compare the resulting ARI scores.

Let

μ1=μ2=[00] and μ3=[22].\mu_{1}=\mu_{2}=\begin{bmatrix}0\\ 0\end{bmatrix}\mbox{ and }\mu_{3}=\begin{bmatrix}2\\ 2\end{bmatrix}.

Also, let

Σ1=[.5.35.35.5]\Sigma_{1}=\begin{bmatrix}.5&.35\\ .35&.5\end{bmatrix}

and

Σ2=[.5.35.35.5].\Sigma_{2}=\begin{bmatrix}.5&-.35\\ -.35&.5\end{bmatrix}.

Consider the following procedure for generating one Monte Carlo Replicate.

  1. 1.

    Define the pdf of each datum as being from a mixture model f(X)=k=13πkfk(X),f(X)=\sum_{k=1}^{3}\pi_{k}f_{k}(X), where fk=N(μk,Σk)f_{k}=N(\mu_{k},\Sigma_{k}) with 6Σ36\Sigma_{3} generated using the onion method of [11]

  2. 2.

    Draw nS=100n^{S}=100 supervised from the first two components.

  3. 3.

    Draw nU=5,10,20,40,80,,640n^{U}=5,10,20,40,80,\dots,640 unsupervised from the full mixture model.

  4. 4.

    Cluster using 252-5 Gaussians with various constraints on the proposed covariance matrices.

  5. 5.

    Choose the models for clustering based on different values of mm in the penalty term dlog(m),d\log(m), where m[nU,nU+nS].m\in[n^{U},n^{U}+n^{S}].

  6. 6.

    Calculate resulting ARIs using the chosen models..

Figures 4 and 5 shows how different penalties can affect the overall clustering performance on the unlabeled data. Generally, we see that with more unsupervised data, there are less differences between the different choices of penalty terms in the interval [nU,nU+nS],[n^{U},n^{U}+n^{S}], which makes sense given the asymptotic results. In this particular example, less penalization allowed us to detect the third component sooner, improving the ARI values in the resulting clustering. Thus, we have shown that in an example closer to our problem of interest, the restricted penalization derived in the previous section can be efficacious as compared to the standard penalty. Additionally, as nn grows, the differences between the information criteria that are dependent on nn to the same order becomes negligible even for relatively small values of nn.

Refer to caption
(a) n=105n=105
Refer to caption
(b) n=110n=110
Refer to caption
(c) n=125n=125
Refer to caption
(d) n=150n=150
Figure 4: ARI vs penalty in BIC(m).BIC^{\prime}(m). Higher is better. Red circles represent significant paired Wilcoxon tests for larger ARI values than vs the standard BIC (at the .05 level). The area to the right of the green line represents the interval [nU,nU+nS].[n^{U},n^{U}+n^{S}].
Refer to caption
Figure 5: ARI vs penalty in BIC(m)BIC^{\prime}(m) for largest value of the total dataset simulated. Higher is better. Red circles represent significant paired Wilcoxon tests for larger ARI values than vs the standard BIC (at the .05 level). The area to the right of the green line represents the interval [nU,nU+nS].[n^{U},n^{U}+n^{S}].

5 Identifying Fly Behaviotypes

In one of the motivating applications for this work, classes of neurons in Drosopholia larvae are controlled using optogenetics (cf. [1] regarding optogenetics). In [20], they observe the reactions of the affected larvae to stimuli in high-throughput behavioral assays. The goal is to determine which classes of neurons cause similar changes in behavior when deactivated.

We initially collected data on n=37780n=37780 larvae grouped into b=2062b=2062 dishes. By changing the optogenetic procedure, =11\ell=11 known lines are created, pbd1, 38a1, 61d0, ppk1, 11f0, pbd2, 38a2, pbd3, ppk2, iav1, and 20c0. Of these lines, we discarded the larvae in pbd3 and ppk2 because they had less than 40 larvae each. Further, we discarded all larvae with an unknown line. After curating the data, we now have n=7730n=7730 larvae. Each larva is observed while responding to various stimuli. Vogelstein et al. citevogelstein2014discovery expand on the methods of [16] and [17] and describe how the observations are embedded into d\mathbb{R}^{d}, where here d=30d=30. We use the method presented in [24] to select the elbow of the scree plot to further reduce the data to d=14d=14 dimensions. We did not perform any additional feature selection.

For each Monte Carlo replicate, we use a small subset of the data where the line was known (101 randomly chosen animals from each of the 9 remaining lines) along with m=0,1,2,8m=0,1,2,\dots 8 pre-labeled data randomly chosen from the 101 animals in each line. We wrote an R package entitled ssClust to perform semi-supervised GMM with similar options to the popular Mclust software, which is an R package for GMM [10]. We cluster the points using both ssClust and Mclust. Then, we compute the ARI against the line type for both methods.

We observe that the initialization strategy of using ss-k-means++ instead of hierarchical clustering results in a significant improvement to the ARI even with no supervision. We find that a single animal per line significantly improves the clustering results, as expected (cf. Figure 6). Further, we can see that there are diminishing returns on additional supervision starting at 33 supervised examples per line.

Refer to caption
Figure 6: Average value of the ARI for 500 Monte Carlo replicates for experiment 5. Error bars are ±2\pm 2 standard errors.

5.1 Differentiating Lines

Vogelstein et al. [20] posed and answered questions of the form, “Is line X different than line Y (in terms of behaviotypes)?” As an illustrative example, we will perform a similar analysis comparing the ppk1 and pbd1 lines but will additionally incorporate some supervision. Here, our interpretation of the clusters will shift from lines to behaviotypes. Our proposed procedure for distinguishing between lines is as follows:

  1. (1)

    Sample 101101 animals from each line

  2. (2)

    Label 33 animals from each line according to a labeling strategy (see below)

  3. (3)

    Cluster the animals using ssClust and Mclust into between 2 and 12 clusters using the parameterizations EEE, VVV, VII, and EII.

  4. (4)

    Collect the results and construct empirical probability of cluster membership for each line

  5. (5)

    Compute the Hellinger distance between the two lines to be compared and store this as statistic HH

  6. (6)

    Simulate the distribution of HH under the null hypothesis that the lines are the same by permuting the labels and computing the Hellinger distance HiH_{i} for i=1,2,,Bi=1,2,\dots,B for some large integer BB.

  7. (7)

    Return an empirical p-value based on steps (5) and (6).

In item (2) we did not specify a labeling strategy in detail. We propose a strategy that is reasonably realistic to execute for our particular dataset. Vogelstein et al. [20] used a hierarchical clustering scheme in which the first few layers were visually identifiable by watching the worms. Thus, by using their labels from an early layer (layer 2, with 4 clusters total), we have a plausible level of supervision for a human to have performed. Specifically, we sample at random a label from the true labels among a line with weights proportional to the counts of each label in that line. Using that label, we sample 3 worms of that label in that line to be the supervised examples. Next, for the other line, we sample a different label, and 3 examples with that true label.

We see that based on all three labeling strategies that both ssClust and Mclust are able to corroborate the results from [20] even with small amounts of data: ppk1 and pbd1 are statistically different (p-value 1e4\approx 1e-4 for both for 500 MC replicates).

We now show that ssClust can answer these questions “sooner” than Mclust. That is, the p-value (pValpVal) will be below the significance level with fewer unsupervised examples. To quantify this concept, we introduce the “answering time” for algorithm A:

τA:=minn{n:2n30 and q=n30𝟙{pVal(𝒟k)α}=1},\tau_{A}:=\min_{n\in\mathbb{N}}\Big{\{}n:2\leq n\leq 30\mbox{ and }\prod_{q=n}^{30}\mathbbm{1}\{pVal\left(\mathcal{D}_{k}\right)\leq\alpha\}=1\Big{\}},

where here we use the notation

𝒟k:={X1,k,,Xq,k,(X99,k,Y99,k),,(X101,k,Y101,k):k{1,2}}\mathcal{D}_{k}:=\{X_{1,k},\dots,X_{q,k},(X_{99,k},Y_{99,k}),\dots,(X_{101,k},Y_{101,k}):k\in\{1,2\}\}

to be the labeled and unlabeled data.

The p-value constraint bears some explanation; it says that we require a significant p-value for all datasets at least as large as with nn unsupervised examples per line. Note that here we assume our datasets are nested and that they all use the same supervised examples. Since the answering time will be dependent on the datasets used, we perform a Monte Carlo simulation with 500 random sequences of datasets and report the answering times for both ssClust and Mclust (cf. Figure 7). The median answering time for ssClust is significantly lower than for Mclust (p-value = 4.7e-12 for paired Wilcoxon signed-rank test).

Refer to caption
Figure 7: Frequency distribution of answering times for 5.1 given 500 MC replicates.

6 Conclusion

Previously explored approaches to semi-supervised clustering include a modified K-means, which can be seen as our algorithm constrained to spherical and identical covariance matrices without the adjusted BIC for model selection (cf. [2]). Others have used latent variables representing clusters and cluster-dependent distributions to give a probabilistic approach to a slightly different problem, where instead of labels being known, only pairwise constraints in the form of must-link and cannot-link are given (cf. [3]). [21] applied an appropriately modified K-means to this problem. Finally, [7] and [23] offer a different approach to semi-supervised clustering involves training a measure to account for constraints or labels.

The main contribution of this paper over previous works is presenting a probabilistic framework for semi-supervised clustering and deriving an information criterion consistent with the framework to allow selection of number of clusters and/or clustering variables. With the corrected BIC, we found that in the simulated examples and fly larvae dataset our method outperformed the most comparable method, Mclust. In the fly dataset, ssClust was able to recover lines better and yielded a lower answering time more often for the question of whether two particular lines were different, behaviorally. This indicates that incorporating even a small amount of information can help guide the clustering process.

Future areas of research will involve handling the more nuanced must-link and cannot-link constraints, which are flexible enough to encompass the labels-known problem we have explored in this paper.

7 Appendix

A Derivation of the BIC for the Semi-supervised Model
Assume the data are distributed according to a member of model MM described in Section 3.

Consider the integrated likelihood:

P(D)=P(D|θ,M)P(θ|M)𝑑θ,P(D)=\int P(D|\theta,M)P(\theta|M)d\theta, (3)

where P()P(\cdot) is a probability, pmf, or pdf where appropriate. Let

Q(θ)=log(P(D|θ,M)P(θ|M)),Q(\theta)=\log\left(P(D|\theta,M)P(\theta|M)\right),

the log of the posterior likelihood. Suppose that the posterior mode exists, say θ¯\bar{\theta}.

A second order Taylor expansion about θ¯\bar{\theta} gives

Q(θ)\displaystyle Q(\theta) =\displaystyle= Q(θ¯)+(θθ¯)TQ(θ¯)+12(θθ¯)T2Q(θ¯)(θθ¯)+o((θθ¯)22).\displaystyle Q(\bar{\theta})+(\theta-\bar{\theta})^{T}\nabla Q(\bar{\theta})+\frac{1}{2}(\theta-\bar{\theta})^{T}\nabla^{2}Q(\bar{\theta})(\theta-\bar{\theta})+o(\|(\theta-\bar{\theta})\|_{2}^{2}).

By the first order optimality necessary conditions, we know Q(θ¯)=0.\nabla Q(\bar{\theta})=0. Note o((θθ¯)22)0o(\|(\theta-\bar{\theta})\|_{2}^{2})\rightarrow 0 faster than (θθ¯)22\|(\theta-\bar{\theta})\|_{2}^{2} as θθ¯\theta\rightarrow\bar{\theta}. By ignoring the last term, we will approximate Q(θ)Q(\theta) with the truncated Taylor expansion:

Q(θ)Q(θ¯)+12(θθ¯)T2Q(θ¯)(θθ¯).\displaystyle Q(\theta)\approx Q(\bar{\theta})+\frac{1}{2}(\theta-\bar{\theta})^{T}\nabla^{2}Q(\bar{\theta})(\theta-\bar{\theta}).

Recalling (3), we may approximate P(D|M)P(D|M) using a saddle point approximation:

P(D|M)\displaystyle P(D|M) =\displaystyle= exp(log(P(D|θ,M)P(θ|M)))𝑑θ\displaystyle\int\exp\left(\log(P(D|\theta,M)P(\theta|M))\right)d\theta
=\displaystyle= exp(Q(θ))𝑑θ\displaystyle\int\exp(Q(\theta))d\theta
\displaystyle\approx exp(Q(θ¯)+12(θθ¯)T2Q(θ¯)(θθ¯))𝑑θ\displaystyle\int\exp\left(Q(\bar{\theta})+\frac{1}{2}(\theta-\bar{\theta})^{T}\nabla^{2}Q(\bar{\theta})(\theta-\bar{\theta})\right)d\theta
=\displaystyle= exp(Q(θ¯))exp(12(θθ¯)T2Q(θ¯)(θθ¯))𝑑θ\displaystyle\exp\left(Q(\bar{\theta})\right)\int\exp\left(\frac{1}{2}(\theta-\bar{\theta})^{T}\nabla^{2}Q(\bar{\theta})(\theta-\bar{\theta})\right)d\theta
=\displaystyle= exp(Q(θ¯))exp(12(θθ¯)T(2Q(θ¯))(θθ¯))𝑑θ.\displaystyle\exp\left(Q(\bar{\theta})\right)\int\exp\left(-\frac{1}{2}(\theta-\bar{\theta})^{T}(-\nabla^{2}Q(\bar{\theta}))(\theta-\bar{\theta})\right)d\theta.

Recognize the integral as proportional to the density of a multivariate Guassian with mean θ¯\bar{\theta} and covariance 2Q(θ¯)-\nabla^{2}Q(\bar{\theta}). Let H=2Q(θ¯).H=-\nabla^{2}Q(\bar{\theta}). Then, we have

P(D|M)\displaystyle P(D|M) \displaystyle\approx exp(Q(θ¯))(2π)d2det(H1))12,\displaystyle\exp\left(Q(\bar{\theta})\right)(2\pi)^{\frac{d}{2}}\det(H^{-1}))^{\frac{1}{2}}, (4)

where dd is number of free parameters in θ\theta. If the conditions in Proposition 3.4.3 in [4] hold, we have for nn sufficiently large, HI(θ¯),H\approx I(\bar{\theta}), where I(θ¯)I(\bar{\theta}) is the Fisher information matrix.

Taking 2log()2\log(\cdot) of both sides, (4) becomes

2log(P(D|M))\displaystyle 2\log(P(D|M)) \displaystyle\approx 2Q(θ¯)+dlog(2π)+log(det(I(θ¯)1))\displaystyle 2Q(\bar{\theta})+d\log(2\pi)+\log(\det(I(\bar{\theta})^{-1}))
=\displaystyle= 2log(P(D|θ¯,M))+2log(P(θ¯|M))+dlog(2π)log(det(I(θ¯)).\displaystyle 2\log(P(D|\bar{\theta},M))+2\log(P(\bar{\theta}|M))+d\log(2\pi)-\log(\det(I(\bar{\theta})).

Now, we must calculate log(det(I(θ¯))-\log(\det(I(\bar{\theta})). Define ni=1Cni.n\equiv\sum_{i=1}^{C}n_{i}. Observe

I(θ)\displaystyle I(\theta) =\displaystyle= Var[θlog(p(X,θ)]\displaystyle\text{Var}\left[\frac{\partial}{\partial\theta}\log(p(X,\theta)\right]
=\displaystyle= Var[i=1nθlog(p(Xi,θ))]\displaystyle\text{Var}\left[\sum_{i=1}^{n}\frac{\partial}{\partial\theta}\log(p(X_{i},\theta))\right]
=\displaystyle= i=1CniI(θi) by independence.\displaystyle\sum_{i=1}^{C}n_{i}I(\theta_{i})\text{ by independence.}

Henceforth, we will use IjI_{j} to denote I(θj)I(\theta_{j}). We will assume that I1I_{1} is positive definite, so that it can be written as

I1=S1TS1I_{1}=S_{1}^{T}S_{1}

for some non-singular matrix S1S_{1}. Let JJ denote the dd-dimensional identity matrix. For notational purposes, let

n:=max2jC(nj).n^{*}:=\max_{2\leq j\leq C}(n_{j}).

Observe

det(i=1CniI(θi))\displaystyle\text{det}\left(\sum_{i=1}^{C}n_{i}I(\theta_{i})\right) =\displaystyle= det(S1T(n1J+(S1T)1(i=2CniI(θi))S11)S1)\displaystyle\text{det}\left(S_{1}^{T}(n_{1}J+\left(S_{1}^{T}\right)^{-1}\left(\sum_{i=2}^{C}n_{i}I(\theta_{i})\right)S_{1}^{-1})S_{1}\right)
=\displaystyle= det(I1)det(n1J+n(S1T)1(i=2CninI(θi))S11)\displaystyle\text{det}(I_{1})\text{det}\left(n_{1}J+n^{*}\left(S_{1}^{T}\right)^{-1}\left(\sum_{i=2}^{C}\frac{n_{i}}{n^{*}}I(\theta_{i})\right)S_{1}^{-1}\right)
=\displaystyle= det(I1)n1ddet(J+nn1(S1T)1(i=2CninI(θi))S11)\displaystyle\text{det}(I_{1})n_{1}^{d}\text{det}\left(J+\frac{n^{*}}{n_{1}}\left(S_{1}^{T}\right)^{-1}\left(\sum_{i=2}^{C}\frac{n_{i}}{n^{*}}I(\theta_{i})\right)S_{1}^{-1}\right)
=\displaystyle= det(I1)n1ddet(J+nn1B),\displaystyle\text{det}(I_{1})n_{1}^{d}\text{det}\left(J+\frac{n^{*}}{n_{1}}B\right),

where B=(S1T)1)(i=2CninI(θi))S11B=\left(S_{1}^{T}\right)^{-1})(\sum_{i=2}^{C}\frac{n_{i}}{n^{*}}I(\theta_{i}))S_{1}^{-1}. It should be noted that BB is a positive semi-definite matrix, as it is a *-transform of a sum of positive semi-definite matrices, so that Sylvester’s Theorem states that BB has the same inertia as a positive semi-definite matrix.

Next, we would like to describe the growth of det(J+nn1B)\text{det}\left(J+\frac{n^{*}}{n_{1}}B\right). For any eigenvalue of J+nn1BJ+\frac{n^{*}}{n_{1}}B, say λ\lambda, we have by Weyl’s theorem

1λ1+nn1B2.1\leq\lambda\leq 1+\frac{n^{*}}{n_{1}}\|B\|_{2}.

Observe

B2\displaystyle\|B\|_{2} =\displaystyle= max{xd:x2=1}xTBx\displaystyle\max_{\{x\in\mathbb{R}^{d}:\|x\|_{2}=1\}}x^{T}Bx
\displaystyle\leq max{x1d:x2=1}S1122xT(j=2CnjnIj)x\displaystyle\max_{\{x\in\mathbb{R}^{d}_{1}:\|x\|_{2}=1\}}\|S_{1}^{-1}\|_{2}^{2}x^{T}\left(\sum_{j=2}^{C}\frac{n_{j}}{n^{*}}I_{j}\right)x
\displaystyle\leq max{xd:x2=1}S1122(j=2CnjnIj22)\displaystyle\max_{\{x\in\mathbb{R}^{d}:\|x\|_{2}=1\}}\|S_{1}^{-1}\|_{2}^{2}\left(\sum_{j=2}^{C}\frac{n_{j}}{n^{*}}\|I_{j}\|_{2}^{2}\right)
\displaystyle\leq max{xd:x2=1}S1122(j=2CIj22),\displaystyle\max_{\{x\in\mathbb{R}^{d}:\|x\|_{2}=1\}}\|S_{1}^{-1}\|_{2}^{2}\left(\sum_{j=2}^{C}\|I_{j}\|_{2}^{2}\right),

which is independent of nn. Let M2max{xd:x2=1}S1122(j=2CIj22).M_{2}\equiv\max_{\{x\in\mathbb{R}^{d}:\|x\|_{2}=1\}}\|S_{1}^{-1}\|_{2}^{2}\left(\sum_{j=2}^{C}\|I_{j}\|_{2}^{2}\right). Then, we have

det(J+nn1B)\displaystyle\text{det}(J+\frac{n^{*}}{n_{1}}B) =\displaystyle= Πm=1dλm(J+nn1B)\displaystyle\Pi_{m=1}^{d}\lambda_{m}\left(J+\frac{n^{*}}{n_{1}}B\right)
\displaystyle\leq (1+nn1(M2))d\displaystyle\left(1+\frac{n^{*}}{n_{1}}(M_{2})\right)^{d}
\displaystyle\leq (1+nn1n1(M2))d\displaystyle\left(1+\frac{n-n_{1}}{n_{1}}(M_{2})\right)^{d}

The only term growing with nn above is nn1n1,\frac{n-n_{1}}{n_{1}}, the ratio of the supervised data over the unsupervised data. In general, it is usually much more expensive to obtain additional supervised data than unsupervised; thus, we find it reasonable to posit that nn1n10\frac{n-n_{1}}{n_{1}}\rightarrow 0 in nn (i.e. is o(1)o(1)). In this case, log(det(J+nn1B))\log\left(\text{det}(J+\frac{n^{*}}{n_{1}}B)\right) is o(1)o(1) by continuity and our bounds.

Hence,

log(det(I(θ)))\displaystyle\log\left(\text{det}\right(I(\theta)\left)\right) =\displaystyle= log(det(I1))+dlog(n1)+log(det(J+nn1B))\displaystyle\log\left(\text{det}(I_{1})\right)+d\log(n_{1})+\log\left(\text{det}(J+\frac{n^{*}}{n_{1}}B)\right) (5)
=\displaystyle= log(det(I1))+dlog(n1)+o(1).\displaystyle\log\left(\text{det}(I_{1})\right)+d\log(n_{1})+o(1). (6)

When the posterior mode is nearly or is equal to the MLE, as is the case when the prior on θ\theta is uniform and Θ\Theta is finite (cf. Bickel and Doksum pp. 114), substitute θ^\hat{\theta}, the MLE, for θ¯\bar{\theta}, the posterior mode. Then, we have

2log(P(D|M))\displaystyle 2\log(P(D|M)) \displaystyle\approx 2log(P(D|θ^,M))+2log(P(θ^|M))+dlog(2π)log(det(I1(θ¯))\displaystyle 2\log(P(D|\hat{\theta},M))+2\log(P(\hat{\theta}|M))+d\log(2\pi)-\log(\det(I_{1}(\bar{\theta}))
=\displaystyle= 2log(P(D|θ^,M))dlog(n1)+o(1)+O(1) using equation (2).\displaystyle 2\log(P(D|\hat{\theta},M))-d\log(n_{1})+o\left(1\right)+O(1)\text{ using equation }(2).

Note that the terms of order less than O(1)O(1) get washed out in the limit as nn\rightarrow\infty. If we drop them, we have our derivation of the adjusted BIC. ∎

References

  • opt [2011] Method of the year 2010. Nat Meth, 8(1):1–1, 01 2011. URL http://dx.doi.org/10.1038/nmeth.f.321.
  • Basu et al. [2002] Sugato Basu, Arindam Banerjee, and Raymond Mooney. Semi-supervised clustering by seeding. In Proceedings of 19th International Conference on Machine Learning. Citeseer, 2002.
  • Basu et al. [2004] Sugato Basu, Mikhail Bilenko, and Raymond J Mooney. A probabilistic framework for semi-supervised clustering. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 59–68. ACM, 2004.
  • Bickel and Doksum [2001] Peter J. Bickel and Kjell A. Doksum. Mathematical Statistics, volume I. Prentice-Hall, Inc., 2001.
  • Boekee and Buss [1981] DE Boekee and HH Buss. Order estimation of autoregressive models. In Proceedings of the 4th Aachen colloquium: Theory and application of signal processing, pages 126–130, 1981.
  • Celeux and Govaert [1995] Gilles Celeux and Gérard Govaert. Gaussian parsimonious clustering models. Pattern recognition, 28(5):781–793, 1995.
  • Cohn et al. [2003] David Cohn, Rich Caruana, and Andrew McCallum. Semi-supervised clustering with user feedback. Constrained Clustering: Advances in Algorithms, Theory, and Applications, 4(1):17–32, 2003.
  • Dempster et al. [1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pages 1–38, 1977.
  • Fraley and Raftery [2002] Chris Fraley and Adrian E Raftery. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458):611–631, 2002.
  • Fraley et al. [2012] Chris Fraley, Adrian E. Raftery, T. Brendan Murphy, and Luca Scrucca. mclust version 4 for r: Normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report 597, University of Washington, Department of Statistics, June 2012.
  • Joe [2006] Harry Joe. Generating random correlation matrices based on partial correlations. Journal of Multivariate Analysis, 97(10):2177–2189, 2006.
  • Maugis et al. [2009a] Cathy Maugis, Gilles Celeux, and M-L Martin-Magniette. Variable selection in model-based clustering: A general variable role modeling. Computational Statistics & Data Analysis, 53(11):3872–3882, 2009a.
  • Maugis et al. [2009b] Cathy Maugis, Gilles Celeux, and Marie-Laure Martin-Magniette. Variable selection for clustering with gaussian mixture models. Biometrics, 65(3):701–709, 2009b.
  • McLachlan and Peel [2004] Geoffrey McLachlan and David Peel. Finite mixture models. John Wiley & Sons, New York, 2004.
  • Murphy et al. [2010] Thomas Brendan Murphy, Nema Dean, and Adrian E Raftery. Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications. The annals of applied statistics, 4(1):396, 2010.
  • Priebe [2001] Carey E. Priebe. Olfactory classification via interpoint distance analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(4):404–413, 2001.
  • Priebe et al. [2004] Carey E Priebe, David J Marchette, and Dennis M Healy Jr. Integrated sensing and processing decision trees. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(6):699–708, 2004.
  • Raftery and Dean [2006] Adrian E Raftery and Nema Dean. Variable selection for model-based clustering. Journal of the American Statistical Association, 101(473):168–178, 2006.
  • Shental et al. [2003] Noam Shental, Aharon Bar-Hillel, Tomer Hertz, and Daphna Weinshall. Computing gaussian mixture models with em using side-information. In Proc. of workshop on the continuum from labeled to unlabeled data in machine learning and data mining, 2003.
  • Vogelstein et al. [2014] Joshua T Vogelstein, Youngser Park, Tomoko Ohyama, Rex A Kerr, James W Truman, Carey E Priebe, and Marta Zlatic. Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure learning. Science, 344(6182):386–392, 2014.
  • Wagstaff et al. [2001] Kiri Wagstaff, Claire Cardie, Seth Rogers, Stefan Schrödl, et al. Constrained k-means clustering with background knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning, volume 1, pages 577–584, 2001.
  • Witten and Tibshirani [2010] Daniela M Witten and Robert Tibshirani. A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490), 2010.
  • Xing et al. [2002] Eric P Xing, Michael I Jordan, Stuart Russell, and Andrew Y Ng. Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems, pages 505–512, 2002.
  • Zhu and Ghodsi [2006] Mu Zhu and Ali Ghodsi. Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics & Data Analysis, 51(2):918–930, 2006.