This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Gibbs sampling for mixtures in order of appearance: the ordered allocation sampler

Pierpaolo De Blasi
Collegio Carlo Alberto and ESOMAS Department,
University of Torino, Italy
pierpaolo.deblasi@unito.it
   María F. Gil–Leyva
Department of Probability and Statistics, IIMAS–UNAM, México
marifer@sigma.iimas.unam.mx
Abstract

Gibbs sampling methods are standard tools to perform posterior inference for mixture models. These have been broadly classified into two categories: marginal and conditional methods. While conditional samplers are more widely applicable than marginal ones, they may suffer from slow mixing in infinite mixtures, where some form of truncation, either deterministic or random, is required. In mixtures with random number of components, the exploration of parameter spaces of different dimensions can also be challenging. We tackle these issues by expressing the mixture components in the random order of appearance in an exchangeable sequence directed by the mixing distribution. We derive a sampler that is straightforward to implement for mixing distributions with tractable size-biased ordered weights, and that can be readily adapted to mixture models for which marginal samplers are not available. In infinite mixtures, no form of truncation is necessary. As for finite mixtures with random dimension, a simple updating of the number of components is obtained by a blocking argument, thus, easing challenges found in trans-dimensional moves via Metropolis-Hastings steps. Additionally, sampling occurs in the space of ordered partitions with blocks labelled in the least element order, which endows the sampler with good mixing properties. The performance of the proposed algorithm is evaluated in a simulation study.

Keywords: Dirichlet process; Pitman-Yor process; size-biased permutations; stick-breaking construction; species sampling models.

1 Introduction

Mixture models represent one of the most successful applications of Bayesian methods. Bayesian inference proceeds by placing a prior on the mixing distribution, whose atoms and their sizes represent the mixture component parameters and the weights, respectively. An important issue is that the number of components is rarely known in advance. The nonparametric approach consists in modelling the mixing distribution with infinitely many support points, and infer the number of components through the number of groups observed in the data (e.g. Escobar and West, 1995). Another alternative is to assign a prior to this unknown quantity (cf. Richardson and Green, 1997; Miller and Harrison, 2018). Posterior inference for mixture models is customarily based on Gibbs sampling methods which have been broadly classified into two categories: marginal and conditional samplers. Marginal methods (Escobar and West, 1995; Neal, 2000) are termed this way because they partially integrate out the mixing distribution and exploit the generalized Pólya urn scheme representation (Blackwell and MacQueen, 1973; Pitman, 2006) of the prediction rule of a sample from the mixing distribution. By doing so marginal samplers avoid dealing with a potentially infinite or random model dimension. They are well suited for models such as the Dirichlet (Ferguson, 1973; Escobar and West, 1995), the Pitman-Yor (Pitman and Yor, 1997; Ishwaran and James, 2001) and mixtures of finite mixtures (Miller and Harrison, 2018). However, they are challenging to adapt to mixing priors without a tractable prediction rule. In alternative, one can use conditional methods which include the mixing distribution and update it as a component of the sampler. While being more widely applicable they bring some issues in their design and implementation. In infinite mixture models, some sort of truncation either deterministic or random is necessary to avoid dealing with infinitely many mixture components. Finite dimensional approximations of the mixing prior were proposed in (Ishwaran and James, 2001). Of random truncation type are the exact conditional samplers derived by Walker (2007); Kalli et al. (2011) and Papaspiliopoulos and Roberts (2008). It has been observed that they require Metropolis-Hastings steps that swap components’ labels so to speed up mixing (Porteous et al., 2006; Papaspiliopoulos and Roberts, 2008). As for mixtures with a random number of components, the main challenge of conditional samplers is the need to explore parameter spaces of different dimensions. The standard method is the reversible jump MCMC algorithm (Richardson and Green, 1997) but this can be difficult to implement.

In this paper we contribute both methodologically and computationally to mitigate these issues by developing a novel conditional Gibbs sampling method named the ordered allocation sampler. The sampler works with the mixture components in the random order in which they are discovered. To derive it we use in depth the theory of species sampling models set forth in Pitman (1995, 1996a, 1996b) where, in particular, it is established that the law of the weights in order of appearance corresponds to the distribution of the weights that is invariant under size-biased permutations. This one admits a simple stick-breaking representation for the Dirichlet and the Pitman-Yor processes. Working with this specific rearrangement of mixture components allows us to exploit a (conditional) prediction rule in the sampler. Thus, it bears similarities with marginal methods such as the fact that mixing takes place in the space of partitions and not in the space of cluster’s labels as it occurs in other conditional samplers (Porteous et al., 2006). Empirical studies confirm that this endows our sampler with nice mixing properties. A second major advantage of our proposal is that, since nn data points can not be generated from more than nn distinct components, at most the sampler needs to update the first nn components in order of appearance. This is especially relevant for infinite mixture models as it avoids truncation. In particular, for Pitman-Yor processes with slowly decaying weights, our sampler proves to be very convenient computationally wise. A third important consequence is that marginalization over the weights in order of appearance yields exactly the exchangeable partition probability function (EPPF Pitman, 1996b). For mixtures with random dimension, this translates to a simple way of updating the number of components without resorting to a reversible jump step. Finally, as other conditional methods, the ordered allocation sampler allows direct inference on the mixing distribution and can be adapted to a wide range of mixing priors.

The rest of the paper is organized as follows. In Section 2 we provide background theory on species sampling priors in mixture models. It will set the stage for the ordered allocation sampler. In Section 3 we first derive the sampler for models with tractable size-biased permuted weights, and later we show how to adapt it when the law of this arrangement of the weights is not available in explicit form. In Section 4 we illustrate the performance of our sampler with well-known real and simulated datasets. Some concluding remarks and discussion points are brought in Section 5. Proofs, technical details and additional illustrations are in the Appendix.

2 Species sampling models

In Bayesian mixture models we model exchangeable data, (yi)=(yi)i=1n(y_{i})=(y_{i})_{i=1}^{n}, taking values in a Borel space, (𝕐,(𝕐))(\mathbb{Y},\mathcal{B}(\mathbb{Y})), as conditionally independent and identically distributed (iid) from

Q(y)=𝕏g(yx)P(dx)=j=1mpjg(yxj),Q(y)=\int_{\mathbb{X}}g(y\mid x)P(\mathrm{d}x)=\sum_{j=1}^{m}p_{j}\,g(y\mid x_{j}), (1)

where g(x)g(\cdot\mid x) is a density for each xx, and the mixing distribution, P=j=1mpjδxjP=\sum_{j=1}^{m}p_{j}\delta_{x_{j}}, is an almost surely discrete random probability measure over the Borel parameter space (𝕏,(𝕏))(\mathbb{X},\mathcal{B}(\mathbb{X})). Species sampling models, introduced and studied by Pitman (1996b), constitute a very general class of random probability measures that provide a convenient prior specification for the mixing distribution PP. In species sampling models, P=j=1mpjδxjP=\sum_{j=1}^{m}p_{j}\delta_{x_{j}}, the atoms, (xj)=(x1,,xm)(x_{j})=(x_{1},\ldots,x_{m}), are iid from a diffuse distribution, ν\nu, over (𝕏,(𝕏))(\mathbb{X},\mathcal{B}(\mathbb{X})), and are independent of (m,(pj))(m,(p_{j})). The weights (pj)=(p1,,pm)(p_{j})=(p_{1},\ldots,p_{m}) are positive random variables with j=1mpj=1\sum_{j=1}^{m}p_{j}=1 almost surely, and the number of support points, mm, can be finite, infinite or random. A sequence, (θi)=(θi)i=1(\theta_{i})=(\theta_{i})_{i=1}^{\infty}, is a species sampling sequence driven by PP if it is exchangeable and the almost sure limit of the empirical distributions, P=limnn1i=1nδθi,P=\lim_{n\to\infty}n^{-1}\sum_{i=1}^{n}\delta_{\theta_{i}}, is a species sampling model. By de Finetti’s theorem, the latter is equivalent to the existence of a species sampling model, PP, such that given PP, (θi)(\theta_{i}) are conditionally iid according to PP (cf. Theorem 1.1 and Proposition 1.4 of Kallenberg, 2005). The laws of (θi)(\theta_{i}) and PP determine each other, and both are fully determined by that of (m,(pj))(m,(p_{j})), and the diffuse distribution, ν\nu. A key aspect to note is that the law of PP is invariant under weights permutations. That is, P=j=1mpjδxjP=\sum_{j=1}^{m}p_{j}\delta_{x_{j}} is equal in distribution to j=1mpρ(j)δxj\sum_{j=1}^{m}p_{\rho(j)}\delta_{x_{j}} for every permutation ρ\rho of {1,,m}\{1,\ldots,m\}, which means that working with an ordering of the weights or another does not change the mixing prior. This is reflected through the so called exchangeable partition probability function (EPPF), given by

π(n1,,nk)=(j1,,jk)𝔼(i=1kpjini),\pi(n_{1},\ldots,n_{k})=\sum_{(j_{1},\ldots,j_{k})}\mathbb{E}\bigg{(}\prod_{i=1}^{k}p_{j_{i}}^{n_{i}}\bigg{)}, (2)

where the sum ranges over all kk-tupples of distinct positive integers, and pj=0p_{j}=0 for j>mj>m. In fact, π(n1,,nk)\pi(n_{1},\ldots,n_{k}) describes the probability that a sample, (θ1,,θn)(\theta_{1},\ldots,\theta_{n}), of size n=j=1knjn=\sum_{j=1}^{k}n_{j}, exhibits exactly kk distinct values, with corresponding frequencies, n1,,nkn_{1},\ldots,n_{k} (Pitman, 1996b, 2006). Whenever π(n1,,nk)\pi(n_{1},\ldots,n_{k}) can be computed in closed form, the prediction rule, [θi+1θ1,,θi]\mathbb{P}[\theta_{i+1}\in\cdot\mid\theta_{1},\ldots,\theta_{i}], of (θi)(\theta_{i}) is available and can be described in terms of a generalized Pólya urn scheme (cf Blackwell and MacQueen, 1973).

The invariance under permutations of j=1mpρ(j)δxj\sum_{j=1}^{m}p_{\rho(j)}\delta_{x_{j}}, as well as the complexity of computing the unordered sum in (2), have motivated the study of weights permutations that simplify the analysis. An ordering of the weights of paramount importance is the size-biased permutation, (p~j)=(p~1,,p~m)(\tilde{p}_{j})=(\tilde{p}_{1},\ldots,\tilde{p}_{m}), given by p~j=pαj\tilde{p}_{j}=p_{\alpha_{j}}, and (αj)=(α1,,αm)(\alpha_{j})=(\alpha_{1},\ldots,\alpha_{m}) defined by

[α1=j(pj)]=pj,[αl=j(pj),α1,,αl1]=pj1i=1l1pαi𝟙{j{α1,,αl1}},2lm.\begin{split}\mathbb{P}[\alpha_{1}=j\mid(p_{j})]&=p_{j},\\ \mathbb{P}[\alpha_{l}=j\mid(p_{j}),\alpha_{1},\ldots,\alpha_{l-1}]&=\frac{p_{j}}{1-\sum_{i=1}^{l-1}p_{\alpha_{i}}}\mathds{1}_{\{j\not\in\{\alpha_{1},\ldots,\alpha_{l-1}\}\}},\quad 2\leq l\leq m.\end{split} (3)

In other words, (αj)(\alpha_{j}) and (p~j)(\tilde{p}_{j}) are sampled without replacement from {1,,m}\{1,\ldots,m\} and (pj)(p_{j}), respectively, with probabilities (pj)(p_{j}). By construction the distribution of (p~j)(\tilde{p}_{j}) is invariant under size-biased permutations. As shown by Pitman (1995, 1996a), the EPPF (2) can be computed through

π(n1,,nk)=𝔼[j=1kp~jnj1j=1k1(1l=1jp~l)],\pi(n_{1},\ldots,n_{k})=\mathbb{E}\bigg{[}\prod_{j=1}^{k}\tilde{p}_{j}^{n_{j}-1}\prod_{j=1}^{k-1}\bigg{(}1-\sum_{l=1}^{j}\tilde{p}_{l}\bigg{)}\bigg{]}, (4)

hence, if the distribution of (p~j)(\tilde{p}_{j}) is available, it becomes easier to compute π(n1,,nk)\pi(n_{1},\ldots,n_{k}). Another advantage of working with size-biased permutations is that p~j\tilde{p}_{j} coincides with the long-run proportion of indexes ii such that θi=x~j\theta_{i}=\tilde{x}_{j}, where x~j\tilde{x}_{j} is the jjth distinct value to appear in (θi)(\theta_{i}). Furthermore, the conditional law of (θi)(\theta_{i}) given (x~j)=(x~1,,x~m)(\tilde{x}_{j})=(\tilde{x}_{1},\ldots,\tilde{x}_{m}) and (p~j)(\tilde{p}_{j}) admits a simple prediction rule as detailed next.

Theorem 1.

Let P=j=1mpjδxjP=\sum_{j=1}^{m}p_{j}\delta_{x_{j}} be a species sampling model over the Borel space (𝕏,(𝕏))(\mathbb{X},\mathcal{B}(\mathbb{X})) and let (θi)=(θi)i=1(\theta_{i})=(\theta_{i})_{i=1}^{\infty} be a sequence with values in (𝕏,(𝕏))(\mathbb{X},\mathcal{B}(\mathbb{X})). Define the jjth distinct value to appear in (θi)(\theta_{i}) through x~j=θMj\tilde{x}_{j}=\theta_{M_{j}}, where Mj=min{i>Mj1:θi{x~1,,x~j1}}M_{j}=\min\{i>M_{j-1}:\theta_{i}\not\in\{\tilde{x}_{1},\ldots,\tilde{x}_{j-1}\}\}, for j2j\geq 2 and M1=1M_{1}=1. Then (θi)(\theta_{i}) is an species sampling sequence driven by PP if and only if the following hold:

  1. i.

    (θi)(\theta_{i}) exhibits mm distinct values, (x~j)=(x~1,,x~m)(\tilde{x}_{j})=(\tilde{x}_{1},\ldots,\tilde{x}_{m}), in order of appearance, and (x~j)(\tilde{x}_{j}) are iid from ν\nu. Furthermore, for jmj\leq m, x~j=xαj\tilde{x}_{j}=x_{\alpha_{j}} where (αj)(\alpha_{j}) satisfies (3).

  2. ii.

    The almost sure limits

    p~j=limn|{in:θi=x~j}|n,jm,\tilde{p}_{j}=\lim_{n\to\infty}\frac{|\{i\leq n:\theta_{i}=\tilde{x}_{j}\}|}{n},\quad j\leq m,

    exist, p~j>0\tilde{p}_{j}>0, j=1mp~j=1\sum_{j=1}^{m}\tilde{p}_{j}=1 almost surely, and (p~j)=(p~1,,p~m)(\tilde{p}_{j})=(\tilde{p}_{1},\ldots,\tilde{p}_{m}) is invariant under size-biased permutations. Moreover, (p~j)(\tilde{p}_{j}) is given by p~j=pαj\tilde{p}_{j}=p_{\alpha_{j}}, with (αj)(\alpha_{j}) as in 2.i.

  3. iii.

    θ1=x~1\theta_{1}=\tilde{x}_{1}, and the conditional prediction rule of (θi)(\theta_{i}) given (p~j)(\tilde{p}_{j}) and (x~j)(\tilde{x}_{j}) is

    [θi+1(p~j),(x~j),θ1,,θi]=j=1kip~jδx~j+(1j=1kip~j)δx~ki+1,\mathbb{P}[\theta_{i+1}\in\cdot\mid(\tilde{p}_{j}),(\tilde{x}_{j}),\theta_{1},\ldots,\theta_{i}]=\sum_{j=1}^{k_{i}}\tilde{p}_{j}\delta_{\tilde{x}_{j}}+\bigg{(}1-\sum_{j=1}^{k_{i}}\tilde{p}_{j}\bigg{)}\delta_{\tilde{x}_{k_{i}+1}},

    for every i1i\geq 1, where kik_{i} is the number of distinct values in (θ1,,θi)(\theta_{1},\ldots,\theta_{i}).

  4. iv.

    mm, (p~j)(\tilde{p}_{j}), (pj)(p_{j}) and (αj)(\alpha_{j}) are independent of elements in (x~j)(\tilde{x}_{j}).

Theorem 1 is based on theory laid down in Pitman (1995, 1996b), nonetheless, we provide a self-contained proof in Appendix A, due to the crucial role it plays in the derivation of the new sampler.

The canonical example of species sampling models in Bayesian nonparametric statistics is the Dirichlet process (Ferguson, 1973). It has m=m=\infty support points and its size-biased permuted weights, (p~j)(\tilde{p}_{j}), admit the stick-breaking representation

p~1=v1,p~j=vji=1j1(1vi),j2,\tilde{p}_{1}=v_{1},\quad\tilde{p}_{j}=v_{j}\prod_{i=1}^{j-1}(1-v_{i}),\quad j\geq 2, (5)

where (vj)=(vj)j=1(v_{j})=(v_{j})_{j=1}^{\infty} are iid from the Beta distribution 𝖡𝖾(1,θ)\mathsf{Be}(1,\theta) (Sethuraman, 1994). The Dirichlet model can be generalized to the two-parameter (σ,θ)(\sigma,\theta)-model (Pitman, 2006) which features size-biased permuted weights as in (5) with independent vj𝖡𝖾(1σ,θ+jσ)v_{j}\sim\mathsf{Be}(1-\sigma,\theta+j\sigma) according to one of the following two regimes:

  1. a)

    σ[0,1)\sigma\in[0,1) and θ>σ\theta>-\sigma. In this case m=m=\infty and the species sampling model PP has been named the Pitman-Yor process by Ishwaran and James (2001) after Pitman and Yor (1997). Evidently the choice σ=0\sigma=0 reduces to a Dirichlet process.

  2. b)

    Given mm\in\mathbb{N}, σ=γ<0\sigma=-\gamma<0, and θ=mγ\theta=m\gamma. In agreement with the notation we have established mm stands for the number of support points of PP. It turns out that the law of (p~j)(\tilde{p}_{j}) corresponds to that of the size-biased permutation of symmetric Dirichlet weights, (p1,,pm)𝖣𝗂𝗋(γ,,γ)(p_{1},\ldots,p_{m})\sim\mathsf{Dir}(\gamma,\ldots,\gamma) (Pitman, 1996a). When γ\gamma is fixed and mm is random PP belongs to the class of Gibbs-type priors (see De Blasi et al., 2015, for a recent review), while the allied mixture model corresponds to the mixture of finite mixtures of Miller and Harrison (2018).

Another type of species sampling models for which a stick-breaking characterization of size-biased weights is available are homogeneous normalized random measures with independent increments (cf. Regazzini et al., 2003; Favaro et al., 2016). Unfortunately, such characterization remains elusive for most species sampling models used in mixture modelling, examples are finite dimensional approximations of the Pitman-Yor process (Ishwaran and James, 2001), the Geometric process (Fuentes-García et al., 2010), the probit stick-breaking process (Rodríguez and Dunson, 2011) and exchangeable stick-breaking processes studied by Gil–Leyva and Mena (2021). For all these species sampling priors the weights can be defined in terms of a stick-breaking decomposition, pj=vjl=1j1(1vl)p_{j}=v_{j}\prod_{l=1}^{j-1}(1-v_{l}), for some sequence of random variables (vj)(v_{j}) with values in [0,1][0,1], yet (pj)(p_{j}) is not invariant under size-biased permutations.

3 The ordered allocation sampler

As mentioned in Section 2, in mixture models data points (yi)=(yi)i=1n(y_{i})=(y_{i})_{i=1}^{n} are treated as conditionally iid from a random density QQ as in (1). Whenever the mixing distribution, PP, is a species sampling model we can equivalently assume yiθig(θi)y_{i}\mid\theta_{i}\sim g(\cdot\mid\theta_{i}), independently for ini\leq n, where (θi)(\theta_{i}) is a species sampling sequence driven by PP. In this setting, marginal samplers integrate out PP and exploit the exchangeability of (θi)(\theta_{i}) as well as the prediction rule, [θi+1θ1,,θi]\mathbb{P}[\theta_{i+1}\in\cdot\mid\theta_{1},\ldots,\theta_{i}], to derive an algorithm for posterior inference (cf. Neal, 2000; Favaro and Teh, 2013; Miller and Harrison, 2018). Instead, conditional samplers (e.g. Ishwaran and James, 2001; Papaspiliopoulos and Roberts, 2008; Kalli et al., 2011) include the mixing distribution, PP, and update its atoms, (xj)(x_{j}), and weights, (pj)(p_{j}), as components of the sampler. The ordered allocation sampler is a conditional sampler as it includes the mixing distribution, PP, however similarly to marginal samplers it relies on a prediction rule for species sampling sequences. Explicitly, motivated by Theorem 1 we work with the atoms, (x~j)(\tilde{x}_{j}), and weights, (p~j)(\tilde{p}_{j}), of PP in the order in which they were discovered by (θi)(\theta_{i}). As commonly done in other samplers, we augment the model with latent allocation variables that identify each observation, yiy_{i}, with the mixture component it was sampled from. Here, in accordance with the order of appearance we introduce what we call ordered allocation variables, (di)=(di)i=1n(d_{i})=(d_{i})_{i=1}^{n}, given by di=jd_{i}=j if and only if yiy_{i} was sampled from g(x~j)g(\cdot\mid\tilde{x}_{j}), i.e. θi=x~j\theta_{i}=\tilde{x}_{j}. Thus, θi=x~di\theta_{i}=\tilde{x}_{d_{i}}, and yi((x~j),di)g(x~di)y_{i}\mid((\tilde{x}_{j}),d_{i})\sim g(\cdot\mid\tilde{x}_{d_{i}}), independently for ini\leq n. If kik_{i} denotes the number of distinct values in (θ1,,θi)(\theta_{1},\ldots,\theta_{i}) then kik_{i} coincides with max{d1,,di}\max\{d_{1},\ldots,d_{i}\}, and di+1d_{i+1} necessarily takes a value in {1,,ki+1}\{1,\ldots,k_{i}+1\}. More precisely, iii of Theorem 1 allows us to compute

d1=1,di+1(p~j),d1,,dij=1kip~jδj+(1j=1kip~j)δki+1,d_{1}=1,\quad d_{i+1}\mid(\tilde{p}_{j}),d_{1},\ldots,d_{i}\sim\sum_{j=1}^{k_{i}}\tilde{p}_{j}\delta_{j}+\left(1-\sum_{j=1}^{k_{i}}\tilde{p}_{j}\right)\delta_{k_{i}+1}, (6)

for ini\leq n, independently of elements in (x~j)(\tilde{x}_{j}) (see also iv of Theorem 1). This yields the augmented likelihood

𝕡[(yi),(di)(p~j),(x~j)]=j=1knp~jnj1(1l=1j1p~l)iDjg(yix~j)𝟙𝒟,\mathbbm{p}[(y_{i}),(d_{i})\mid(\tilde{p}_{j}),(\tilde{x}_{j})]=\prod_{j=1}^{k_{n}}\tilde{p}_{j}^{n_{j}-1}\bigg{(}1-\sum_{l=1}^{j-1}\tilde{p}_{l}\bigg{)}\prod_{i\in D_{j}}g(y_{i}\mid\tilde{x}_{j})\mathds{1}_{\mathcal{D}}, (7)

where kn=max{d1,,dn}k_{n}=\max\{d_{1},\ldots,d_{n}\}, Dj={in:di=j}D_{j}=\{i\leq n:d_{i}=j\}, nj=|Dj|n_{j}=|D_{j}|, and 𝒟\mathcal{D} is the event that {D1,,Dkn}\{D_{1},\ldots,D_{k_{n}}\} is a partition of {1,,n}\{1,\ldots,n\} with blocks in the least element order, in particular DjD_{j}\neq\emptyset, for jknj\leq k_{n}, and min(D1)<min(D2)<<min(Dkn)\min\left(D_{1}\right)<\min\left(D_{2}\right)<\cdots<\min\big{(}D_{{k_{n}}}\big{)}. The full conditional distributions required at each iteration of the sampler are proportional to the product of (7) times the prior distributions, 𝕡[(x~j)]\mathbbm{p}[(\tilde{x}_{j})] and 𝕡[(p~j)]\mathbbm{p}[(\tilde{p}_{j})], of the atoms and weights of PP in order of appearance. We first derive the ordered allocation sampler for those species sampling mixing distributions where the prior of (p~j)(\tilde{p}_{j}) can be modelled directly as is the case of the (σ,θ)(\sigma,\theta)-model. Latter we explain how to adapt the sampler for the more general case where the law of (p~j)(\tilde{p}_{j}) is not available.

3.1 Ordered allocation sampler for size-biased weights

Updating of the ordered allocation variables (di)(d_{i}):

𝕡[di=d]p~dg(yix~d)×j=1knp~j1(1l=1j1p~l)𝟙𝒟.\mathbbm{p}[\,d_{i}=d\mid\cdots\,]\propto\tilde{p}_{d}\,g(y_{i}\mid\tilde{x}_{d})\times\prod_{j=1}^{k_{n}}\tilde{p}_{j}^{-1}\bigg{(}1-\sum_{l=1}^{j-1}\tilde{p}_{l}\bigg{)}\mathds{1}_{\mathcal{D}}. (8)

This is the fundamentally novel part of the algorithm. Differently from other conditional samplers, the allocation variables (di)(d_{i}) can not be updated independently of each other for two main reasons: (i) kn=max{d1,,dn}k_{n}=\max\{d_{1},\ldots,d_{n}\} might change as a consequence of an update in did_{i}, and (ii) the least element order of D1,,DknD_{1},\ldots,D_{k_{n}} must be preserved, as specified by the indicator 𝟙𝒟\mathds{1}_{\mathcal{D}}. Instead, the updating of (di)(d_{i}) resembles the way marginal algorithms update allocation variables (cf. Neal, 2000) in the sense that we will update one did_{i} at a time by conditioning on the current value of the remaining ordered allocation variables. To do so, we first identify the set, 𝒟i\mathcal{D}_{i}, of admissible moves for did_{i}, which contains all positive integers dd for which the event 𝒟\mathcal{D} remains true after setting di=dd_{i}=d. That is, for dd\in{\mathbb{N}}, define kn(d)=max{ki,d}k_{n}^{(d)}=\max\{k_{-i},d\}, where ki=max{dl:li}k_{-i}=\max\{d_{l}:l\neq i\}, and add d𝒟id\in\mathcal{D}_{i} if, under the assumption di=dd_{i}=d, the sets Dj={ln:dl=j}D_{j}=\{l\leq n:d_{l}=j\} are non-empty, for jkn(d)j\leq k_{n}^{(d)}, and satisfy min(D1)<min(D2)<<min(Dkn(d))\min\left(D_{1}\right)<\min\left(D_{2}\right)<\cdots<\min\big{(}D_{{k^{(d)}_{n}}}\big{)}. An example that illustrates how to determine 𝒟i\mathcal{D}_{i} is available in Appendix D. With this notation at hand we can rewrite (8):

𝕡[di=d]p~dg(yix~d)×j=1kn(d)p~j1(1l=1j1p~l)𝟙{d𝒟i}.\mathbbm{p}[\,d_{i}=d\mid\cdots\,]\propto\tilde{p}_{d}\,g(y_{i}\mid\tilde{x}_{d})\times\prod_{j=1}^{k_{n}^{(d)}}\tilde{p}_{j}^{-1}\bigg{(}1-\sum_{l=1}^{j-1}\tilde{p}_{l}\bigg{)}\mathds{1}_{\{d\in\mathcal{D}_{i}\}}. (9)

Next, we need to weight the admissible moves according to (9). To this aim, note that for each d𝒟id\in\mathcal{D}_{i}, either kn(d)=kik_{n}^{(d)}=k_{-i} or kn(d)=ki+1k_{n}^{(d)}=k_{-i}+1. This means that we can divide (9) by j=1kip~j1(1l=1j1p~l)\prod_{j=1}^{k_{-i}}\tilde{p}_{j}^{-1}(1-\sum_{l=1}^{j-1}\tilde{p}_{l}) and obtain

𝕡[di=d]{p~dg(yix~d) if kn(d)=ki,(1l=1kip~l)g(yix~d) if kn(d)=ki+1,\mathbbm{p}[\,d_{i}=d\mid\cdots\,]\propto\begin{cases}\tilde{p}_{d}g(y_{i}\mid\tilde{x}_{d})&\text{ if }k_{n}^{(d)}=k_{-i},\\ \left(1-\sum_{l=1}^{k_{-i}}\tilde{p}_{l}\right)g(y_{i}\mid\tilde{x}_{d})&\text{ if }k_{n}^{(d)}=k_{-i}+1,\end{cases} (10)

for d𝒟id\in\mathcal{D}_{i}, and 𝕡[di=d]=0\mathbbm{p}[\,d_{i}=d\mid\cdots\,]=0, for d𝒟id\not\in\mathcal{D}_{i}. Once we have identified 𝒟i\mathcal{D}_{i}, sampling from (10) is straight-forward, as its support is 𝒟i{1,,n}\mathcal{D}_{i}\subset\{1,\ldots,n\}.

Updating of component parameters in order of appearance (x~j)(\tilde{x}_{j}):

𝕡[x~j]iDjg(yix~j)ν(x~j).\mathbbm{p}[\,\tilde{x}_{j}\mid\cdots\,]\propto\prod_{i\in D_{j}}g(y_{i}\mid\tilde{x}_{j})\nu(\tilde{x}_{j}). (11)

Since Dj=D_{j}=\emptyset, for j>knj>k_{n}, we simply sample x~jν\tilde{x}_{j}\sim\nu from its prior distribution. For jknj\leq k_{n}, sampling from (11) is easy if gg and ν\nu form a conjugate pair. Otherwise, the problem of updating the non-empty components parameters is identical as in conditional samplers and some marginal ones. The advantage with respect to conditional algorithms is that the occupied component parameters are precisely the first knk_{n} in the sequence (x~j)(\tilde{x}_{j}).

Updating of size-biased weights (p~j)(\tilde{p}_{j}):

𝕡[(p~j)]j=1knp~jnj1(1l=1j1p~l)×𝕡[(p~j)].\mathbbm{p}[\,(\tilde{p}_{j})\mid\cdots\,]\propto\prod_{j=1}^{k_{n}}\tilde{p}_{j}^{n_{j}-1}\left(1-\sum_{l=1}^{j-1}\tilde{p}_{l}\right)\,\times\,\mathbbm{p}[(\tilde{p}_{j})]. (12)

If the stick-breaking representation (5) is available, we can update (p~j)(\tilde{p}_{j}) via sampling (vj)(v_{j}) from its full conditional. Noting that 1l=1kp~l=l=1k(1vl)1-\sum_{l=1}^{k}\tilde{p}_{l}=\prod_{l=1}^{k}(1-v_{l}) for each k1k\geq 1, we find

𝕡[(vj)][j=1knvjnj1(1vj)l>jnl]×𝕡[(vj)].\mathbbm{p}[\,(v_{j})\mid\cdots\,]\propto\bigg{[}\prod_{j=1}^{k_{n}}v_{j}^{n_{j}-1}(1-v_{j})^{\sum_{l>j}n_{l}}\bigg{]}\times\mathbbm{p}[(v_{j})].

For example, for the (σ,θ)(\sigma,\theta)-model we know that apriori vj𝖡𝖾(1σ,θ+jσ)v_{j}\sim\mathsf{Be}(1-\sigma,\theta+j\sigma), independently, according to one of the two regimes (a) or (b) spelled out in Section 2. In this case, we update vj𝖡𝖾(njσ,θ+jσ+l>jnl)v_{j}\sim\mathsf{Be}(n_{j}-\sigma,\theta+j\sigma+\sum_{l>j}n_{l}), independently for jknj\leq k_{n}, and we sample vj𝖡𝖾(1σ,θ+jσ)v_{j}\sim\mathsf{Be}(1-\sigma,\theta+j\sigma), for j>knj>k_{n}, as apriori.

Before we move on, there are two points worth highlighting concerning the updating of (p~j)(\tilde{p}_{j}). The first one is that while the stick-breaking decomposition simplifies this step it is not a requirement, what is needed is a posterior characterization of the weights in order of appearance. The Pitman-Yor multinomial process studied by Lijoi et al. (2020) illustrates this point. The second crucial remark is that sampling p~j\tilde{p}_{j} and x~j\tilde{x}_{j}, for j>knj>k_{n}, is needed for only a few jj as required by the occupation of new components when updating (di)(d_{i}). Being that dind_{i}\leq n, at most we will require to update p~j\tilde{p}_{j} and x~j\tilde{x}_{j}, for jmin{n,m}j\leq\min\{n,m\}. This is specially relevant for infinite mixture models as it assures the sampler unfolds in a finite dimensional space even when the model dimension, mm, is infinite. In fact, if the model dimension is deterministic, the ordered allocation sampler is practically identical for finite and infinite mixture models. The case of random mm is treated next.

Updating the model dimension mm:

If the model dimension is random, our proposal here is to update mm, (p~j)(\tilde{p}_{j}) and the non occupied component parameters, (x~j)j>kn=(x~kn+1,,x~m)(\tilde{x}_{j})_{j>k_{n}}=(\tilde{x}_{k_{n}+1},\ldots,\tilde{x}_{m}), as a block from

𝕡[m,(p~j),(x~j)j>kn]j=1knp~jnj1(1l=1j1p~l)j=kn+1mν(x~j)×𝕡[(p~j)m]𝕡[m].\mathbbm{p}[\,m,(\tilde{p}_{j}),(\tilde{x}_{j})_{j>k_{n}}\mid\cdots\,]\propto\prod_{j=1}^{k_{n}}\tilde{p}_{j}^{n_{j}-1}\bigg{(}1-\sum_{l=1}^{j-1}\tilde{p}_{l}\bigg{)}\prod_{j=k_{n}+1}^{m}\nu(\tilde{x}_{j})\times\mathbbm{p}[\,(\tilde{p}_{j})\mid m]\mathbbm{p}[m]. (13)

Here we keep “ \cdots ” to denote all random terms other than mm, (p~j)(\tilde{p}_{j}) and (x~j)j>kn(\tilde{x}_{j})_{j>k_{n}}. We first sample mm from its marginal, i.e. (13) after integrating over (p~j)(\tilde{p}_{j}) and (x~j)j>kn(\tilde{x}_{j})_{j>k_{n}}:

𝕡[m]𝔼[j=1knp~jnj1(1l=1j1p~l)|m]𝕡(m).\mathbbm{p}[\,m\mid\cdots\,]\propto\mathbb{E}\bigg{[}\prod\nolimits_{j=1}^{k_{n}}\tilde{p}_{j}^{n_{j}-1}\bigg{(}1-\sum\nolimits_{l=1}^{j-1}\tilde{p}_{l}\bigg{)}\,\bigg{|}\,m\bigg{]}\mathbbm{p}(m). (14)

The expectation is taken with respect to the conditional distribution of (p~j)(\tilde{p}_{j}) given mm and treating kn,n1,,nknk_{n},n_{1},\ldots,n_{k_{n}} as constants. In particular, since j=1mp~j=1\sum_{j=1}^{m}\tilde{p}_{j}=1, (14) equals zero for m<knm<k_{n}. Taking this into account and recognizing, in the conditional expectation, the EPPF of the species sampling model given mm, cf. (4), we obtain

𝕡[m]π(n1,,nknm)𝕡[m]𝟙{knm}.\mathbbm{p}[\,m\mid\cdots\,]\propto\pi(n_{1},\ldots,n_{k_{n}}\mid m)\mathbbm{p}[m]\mathds{1}_{\{k_{n}\leq m\}}. (15)

This is a remarkably simple expression for the updating of the model dimension as it only requires the conditional EPPF given mm. In Appendix B we provide an example on how to update mm for mixtures of finite mixtures and for the choice of 𝕡[m]\mathbbm{p}[m] detailed by Gnedin (2010). After updating mm, we sample (p~j)(\tilde{p}_{j}) and (x~j)j>kn(\tilde{x}_{j})_{j>k_{n}}, conditioning on mm. Thus (p~j)(\tilde{p}_{j}) is sampled from (12) as detailed before, and the mknm-k_{n} non occupied component parameters, x~kn+1,,x~m\tilde{x}_{k_{n}+1},\ldots,\tilde{x}_{m}, from the prior ν\nu, cf. (11). Note that the blocking argument is remarkably simple when compared with the Metropolis-Hasting steps of the reversible jump MCMC algorithm.

3.2 Ordered allocation sampler for non size-biased weights

In this section we adapt the ordered allocation sampler to species sampling priors that do not enjoy an explicit characterization of the size-biased weights (p~j)(\tilde{p}_{j}). This makes our sampler applicable to mixture models for which marginal samplers are not available, so increasing substantially its scope (examples can be found at the end of Section 2). To this aim recall that (p~j)(\tilde{p}_{j}) is a rearrangement of the weights in any arbitrary order, (pj)(p_{j}), i.e. p~j=pαj\tilde{p}_{j}=p_{\alpha_{j}}, where (αj)(\alpha_{j}) is sampled without replacement from {1,,m}\{1,\ldots,m\} with probabilities (pj)(p_{j}), as defined in (3). The key idea is to include (αj)(\alpha_{j}) as part of the sampler. As we will see, this augmentation yields a conditional sampler that inherits the advantages of the algorithm in Section 3.1 without requiring a closed-form expression of (p~j)(\tilde{p}_{j}) or the EPPF. As for the component parameters in order of appearance, (x~j)(\tilde{x}_{j}), we will continue to model them directly, as i.i.d. from ν\nu, independently of (pj)(p_{j}) and (αj)(\alpha_{j}), cf. iv in Theorem 1. Thus, instead of (7), we work with the augmented likelihood

𝕡[(yi),(di)(pj),(αj),(x~j)]=j=1knpαjnj1(1l=1j1pαl)iDjg(yix~j)𝟙𝒟.\mathbbm{p}[\,(y_{i}),(d_{i})\mid(p_{j}),(\alpha_{j}),(\tilde{x}_{j})\,]=\prod_{j=1}^{k_{n}}p_{\alpha_{j}}^{n_{j}-1}\bigg{(}1-\sum_{l=1}^{j-1}p_{\alpha_{l}}\bigg{)}\prod_{i\in D_{j}}g(y_{i}\mid\tilde{x}_{j})\mathds{1}_{\mathcal{D}}. (16)

It is straightforward to see that the updating of the ordered allocation variables (di)(d_{i}) and the component parameters (x~j)(\tilde{x}_{j}) remain identical. Hence, we will only explain how to update the weights in order of appearance via (pj)(p_{j}) and (αj)(\alpha_{j}), as well as the model dimension, mm, whenever this quantity is random.

Updating of (p~j)(\tilde{p}_{j}) through (pj)(p_{j}) and (αj)(\alpha_{j}):

From (3) and (16) we find

𝕡[(pj),(αj)]j=1knpαjnj×j=kn+1mpαj(1l=1j1pαl)1𝟙𝒜×𝕡[(pj)],\mathbbm{p}[\,(p_{j}),(\alpha_{j})\mid\cdots\,]\propto\prod_{j=1}^{k_{n}}p_{\alpha_{j}}^{n_{j}}\times\prod_{j=k_{n}+1}^{m}p_{\alpha_{j}}\bigg{(}1-\sum_{l=1}^{j-1}p_{\alpha_{l}}\bigg{)}^{-1}\mathds{1}_{\mathcal{A}}\times\mathbbm{p}[(p_{j})], (17)

where 𝒜\mathcal{A} is the event that αiαj\alpha_{i}\neq\alpha_{j} for every iji\neq j. In this part, as a notational device, we keep “ \cdots ” to denote all random terms other than (pj),(αj)(p_{j}),(\alpha_{j}). Also, we distinguish the indexes of the occupied components, (αj)jkn(\alpha_{j})_{j\leq k_{n}}, from the remaining ones, (αj)j>kn(\alpha_{j})_{j>k_{n}}. The key idea to attain simple updating steps is to sample (αj)jkn(\alpha_{j})_{j\leq k_{n}} from its full conditional, which can be expressed as a weighted permutation of knk_{n} indexes, and separately (pj)(p_{j}) and (αj)j>kn(\alpha_{j})_{j>k_{n}} as a block.

We first focus on the updating of (αj)jkn(\alpha_{j})_{j\leq k_{n}}. From (17) we get

𝕡[(αj)jkn(αj)j>kn,(pj),]j=1knpαjnj𝟙𝒜,\mathbbm{p}[\,(\alpha_{j})_{j\leq k_{n}}\mid(\alpha_{j})_{j>k_{n}},(p_{j}),\cdots\,]\propto\prod_{j=1}^{k_{n}}p_{\alpha_{j}}^{n_{j}}\mathds{1}_{\mathcal{A}}, (18)

after noting that j=kn+1m(1l=1j1pαl)1\prod_{j=k_{n}+1}^{m}(1-\sum_{l=1}^{j-1}p_{\alpha_{l}})^{-1} is a constant with respect to (αj)jkn(\alpha_{j})_{j\leq k_{n}}, because 1l=1j1pαl=l=jmpαl1-\sum_{l=1}^{j-1}p_{\alpha_{l}}=\sum_{l=j}^{m}p_{\alpha_{l}}. The event 𝒜\mathcal{A} indicates that this is about sampling from a weighted permutation ρ\rho of the knk_{n} integers corresponding to current values of (αj)jkn(\alpha_{j})_{j\leq k_{n}}. Namely, we sample ρ\rho from

π(ρ)=1Zj=1knwj,ρ(j),ρ𝒮,\pi(\rho)=\frac{1}{Z}\prod_{j=1}^{k_{n}}w_{j,\rho(j)}\,,\quad\rho\in{\cal S},

where wj,l=pαlnjw_{j,l}=p^{n_{j}}_{\alpha_{l}}, ZZ is the normalizing constant and 𝒮{\cal S} is the space of permutations of {1,,kn}\{1,\ldots,k_{n}\}. Afterwards we simply apply ρ\rho to the indexes of the current value of (αj)jkn(\alpha_{j})_{j\leq k_{n}} so to obtain the updated value (αρ(j))jkn(\alpha_{\rho(j)})_{j\leq k_{n}}. Now, to sample ρ\rho from π\pi we follow Zanella (2020) by adopting a Metropolis–Hastings scheme using a locally-balanced informed proposal distribution, cf. Example 3 therein. For the reader’s convenience, we recall briefly how it works. Let N(ρ)N(\rho) be the neighborhood of ρ𝒮\rho\in{\cal S} given by all permutations obtained by switching two indexes, (i.e. ρN(ρ)\rho^{*}\in N(\rho) if and only if there exist iji\neq j such that ρ(i)=ρ(j)\rho^{*}(i)=\rho(j), ρ(j)=ρ(i)\rho^{*}(j)=\rho(i) and ρ(l)=ρ(l)\rho^{*}(l)=\rho(l) for all l{i,j}l\not\in\{i,j\}). Instead of using a random walk scheme, consisting in proposing a new value of ρ\rho, say ρ\rho^{*}, uniformly over N(ρ)N(\rho), and accepting it with probability a(ρ,ρ)=min{1,π(ρ)/π(ρ)}a(\rho,\rho^{*})=\min\{1,\pi(\rho^{*})/\pi(\rho)\}, we bias the proposal towards high probability regions of the target. To do so, we set the proposal distribution to be Q(ρ,ρ)=π(ρ)𝟙N(ρ)/Z(ρ)Q(\rho,\rho^{*})=\sqrt{\pi(\rho^{*})}\mathds{1}_{N(\rho)}/Z(\rho), where Z(ρ)=zN(ρ)π(z)Z(\rho)=\sum_{z\in N(\rho)}\sqrt{\pi(z)} is the normalizing constant. Then the new value, ρ\rho^{*}, is accepted with probability a(ρ,ρ)=min{1,π(ρ)Q(ρ,ρ)π(ρ)Q(ρ,ρ)}a(\rho,\rho^{*})=\min\big{\{}1,\frac{\pi(\rho^{*})Q(\rho^{*},\rho)}{\pi(\rho)Q(\rho,\rho^{*})}\big{\}}. In the simulation study we initialized ρ\rho as the identity function over 𝒮{\cal S} and performed knk_{n} Metropolis-Hastings steps at each iteration. As explained by Zanella (2020) the appeal of considering a locally-balanced proposal, such as QQ, is that it is roughly π\pi-reversible when 𝒮{\cal S} is large with respect to N(ρ)N(\rho), thus the acceptance probability a(ρ,ρ)a(\rho,\rho^{*}) tends to be high. Otherwise, if knk_{n} is small, one can opt to sample exactly from (18) by enumerating all possible permutations of (αj)jkn(\alpha_{j})_{j\leq k_{n}}.

As for the updating of (pj)(p_{j}) and (αj)j>kn(\alpha_{j})_{j>k_{n}}, we first sample (pj)(p_{j}) from the conditional distribution 𝕡[(pj)(αj)jkn,]\mathbbm{p}[\,(p_{j})\mid(\alpha_{j})_{j\leq k_{n}},\cdots\,] obtained from 𝕡[(pj),(αj)j>kn(αj)jkn,]\mathbbm{p}[\,(p_{j}),(\alpha_{j})_{j>k_{n}}\mid(\alpha_{j})_{j\leq k_{n}},\cdots\,] after integrating over (αj)j>kn(\alpha_{j})_{j>k_{n}}. We get

𝕡[(pj)(αj)jkn,]j=1α¯pjrj×𝕡[(pj)],\mathbbm{p}[\,(p_{j})\mid(\alpha_{j})_{j\leq k_{n}},\cdots\,]\propto\prod_{j=1}^{\overline{\alpha}}p_{j}^{r_{j}}\times\mathbbm{p}[(p_{j})],

where α¯=max{αj:jkn}\overline{\alpha}=\max\{\alpha_{j}:j\leq k_{n}\}, and rj=l=1knnl𝟙{αl=j}r_{j}=\sum_{l=1}^{k_{n}}n_{l}\mathds{1}_{\{\alpha_{l}=j\}}, that is rj=nlr_{j}=n_{l} if and only if there exist lknl\leq k_{n} such that αl=j\alpha_{l}=j and rj=0r_{j}=0 otherwise. When pj=vjl=1j1(1vl)p_{j}=v_{j}\prod_{l=1}^{j-1}(1-v_{l}), we can update (pj)(p_{j}) via sampling (vj)(v_{j}) from

𝕡[(vj)(αj)jkn,]j=1α¯vjrj(1vj)l>jrj×𝕡[(vj)].\mathbbm{p}[\,(v_{j})\mid(\alpha_{j})_{j\leq k_{n}},\cdots\,]\propto\prod_{j=1}^{\overline{\alpha}}v_{j}^{r_{j}}(1-v_{j})^{\sum_{l>j}r_{j}}\times\mathbbm{p}[(v_{j})].

For instance, if a priori vj𝖡𝖾(aj,bj)v_{j}\sim\mathsf{Be}(a_{j},b_{j}), independently for j1j\geq 1, then a posteriori vj𝖡𝖾(rj+aj,l>jrl+bj)v_{j}\sim\mathsf{Be}(r_{j}+a_{j},\sum_{l>j}r_{l}+b_{j}) for jα¯j\leq\overline{\alpha} and vj𝖡𝖾(aj,bj)v_{j}\sim\mathsf{Be}(a_{j},b_{j}), for j>α¯j>\overline{\alpha}. Further examples on how to update (pj)(p_{j}) can be found in Appendix C. After updating (pj)(p_{j}) we sample (αj)j>kn(\alpha_{j})_{j>k_{n}} from

𝕡[(αj)j>kn(αj)jkn,(pj),]=j=kn+1mpαj(1l=1j1pαl)1𝟙𝒜.\mathbbm{p}[\,(\alpha_{j})_{j>k_{n}}\mid(\alpha_{j})_{j\leq k_{n}},(p_{j}),\cdots\,]=\prod_{j=k_{n}+1}^{m}p_{\alpha_{j}}\bigg{(}1-\sum_{l=1}^{j-1}p_{\alpha_{l}}\bigg{)}^{-1}\mathds{1}_{\mathcal{A}}. (19)

That is, αkn+1,αkn+2,\alpha_{k_{n}+1},\alpha_{k_{n}+2},\ldots are sampled without replacement from {jm:j(αj)jkn}\{j\leq m:\ j\not\in(\alpha_{j})_{j\leq k_{n}}\} with probabilities proportional to {pj:j(αj)jkn}\{p_{j}:\ j\not\in(\alpha_{j})_{j\leq k_{n}}\}. This can be achieved by sampling sequentially as a priori, cf. (3).

This way of updating (αj)(\alpha_{j}), although theoretically valid, has the disadvantage that switches among indexes in (αj)jkn(\alpha_{j})_{j\leq k_{n}} and indexes in (αj)j>kn(\alpha_{j})_{j>k_{n}} only occur when knk_{n} changes as a consequence of an update in (di)(d_{i}). To facilitate the mixing one can include the following acceleration step after updating (αj)jkn(\alpha_{j})_{j\leq k_{n}} from (18) and before updating (αj)j>kn(\alpha_{j})_{j>k_{n}} from (19). We suggest to sample each αj\alpha_{j} with jknj\leq k_{n} from

𝕡[αj(pj),(αl)lkn,lj,]pαjnj𝟙𝒜,\mathbbm{p}[\,\alpha_{j}\mid(p_{j}),(\alpha_{l})_{l\leq k_{n},l\neq j},\cdots\,]\propto p_{\alpha_{j}}^{n_{j}}\mathds{1}_{\mathcal{A}}, (20)

i.e. conditioning on the current values of αl\alpha_{l}, for lknl\leq k_{n} and ljl\neq j, with (αj)j>kn(\alpha_{j})_{j>k_{n}} integrated out. Hence, the indicator 𝟙𝒜\mathds{1}_{\mathcal{A}} above only dictates αjαl\alpha_{j}\neq\alpha_{l}, with lknl\leq k_{n}. If the number of components is finite, the support of (20) consists of mkn1m-k_{n}-1 positive integers and sampling directly from this distribution is trivial. Otherwise, when m=m=\infty, we can treat (αj)j>kn(\alpha_{j})_{j>k_{n}} as a latent variable with distribution as in (19), and update the pair (αj,αkn+1)(\alpha_{j},\alpha_{k_{n}+1}) from

𝕡[(αj,αkn+1)(pj),(αl)l{j,kn+1},]pαjnjpαkn+1(1l=1knpαl)1𝟙𝒜.\mathbbm{p}[\,(\alpha_{j},\alpha_{k_{n}+1})\mid(p_{j}),(\alpha_{l})_{l\not\in\{j,k_{n}+1\}},\cdots\,]\propto p^{n_{j}}_{\alpha_{j}}p_{\alpha_{k_{n}+1}}\bigg{(}1-\sum_{l=1}^{k_{n}}p_{\alpha_{l}}\bigg{)}^{-1}\mathds{1}_{\mathcal{A}}. (21)

In practice, it is enough to sample αkn+1\alpha_{k_{n}+1} from 𝕡[αkn+1α1,,αkn,(pj),]\mathbbm{p}[\,\alpha_{k_{n}+1}\mid\alpha_{1},\ldots,\alpha_{k_{n}},(p_{j}),\cdots\,] as in (3) and later either leave (αj,αkn+1)(\alpha_{j},\alpha_{k_{n}+1}) unchanged or switch the values of αj\alpha_{j} and αkn+1\alpha_{k_{n}+1}, with probabilities determined by (21). It is worth emphasizing that this procedure has to be repeated for all jknj\leq k_{n}, and that each time we discard αkn+1\alpha_{k_{n}+1} because it is only playing the role of an auxiliary variable to draw samples from (20).

Similarly as with the sampler in Section 3.1, this sampler unfolds in a finite dimensional space even when the model dimension is infinite. In general, we will only need to update x~j\tilde{x}_{j} and αj\alpha_{j}, for j>knj>k_{n}, when required by the updating the ordered allocation variables, (di)(d_{i}). At most iterations this will be necessary for only a few indexes jj. As for the weights, it is the updating of (αj)(\alpha_{j}) what will determine how many entries of (pj)(p_{j}) must be updated. Thus, at most we will need to update pjp_{j} for jmax{α1,,αJ}j\leq\max\{\alpha_{1},\ldots,\alpha_{J}\} where JJ is the latest entry of (αj)(\alpha_{j}) we were required to update.

Updating of mm:

If the model dimension is random, our proposal is to update mm, (x~j)j>kn(\tilde{x}_{j})_{j>k_{n}}, (pj)(p_{j}) and (αj)j>kn(\alpha_{j})_{j>k_{n}} as a block from the full conditional 𝕡[m,(p~j),(x~j)j>kn,(αj)j>kn]\mathbbm{p}[\,m,(\tilde{p}_{j}),(\tilde{x}_{j})_{j>k_{n}},(\alpha_{j})_{j>k_{n}}\mid\cdots\,]. Here we use “ \cdots ” to denote all random terms other than mm, (x~j)j>kn(\tilde{x}_{j})_{j>k_{n}}, (pj)(p_{j}) and (αj)j>kn(\alpha_{j})_{j>k_{n}}. Integrating over (x~j)j>kn(\tilde{x}_{j})_{j>k_{n}}, (pj)(p_{j}) and (αj)j>kn(\alpha_{j})_{j>k_{n}}, we first sample mm from

𝕡[m]𝔼[j=1knpαjnj|m]𝟙{mkn}𝕡[m],\mathbbm{p}[\,m\mid\,\cdots\,]\propto\mathbb{E}\bigg{[}\prod\nolimits_{j=1}^{k_{n}}p_{\alpha_{j}}^{n_{j}}\,\bigg{|}\,m\bigg{]}\mathds{1}_{\{m\geq k_{n}\}}\mathbbm{p}[m], (22)

where the expectation is taken with respect to the conditional distribution of (pj)(p_{j}) given mm, and treating (nj)jkn(n_{j})_{j\leq k_{n}} and (αj)jkn(\alpha_{j})_{j\leq k_{n}} as constants. Later we sample (p~j)(\tilde{p}_{j}) and (αj)j>kn(\alpha_{j})_{j>k_{n}} from (17) as previously explained, and the mknm-k_{n} empty component parameters, x~j\tilde{x}_{j}, from the prior, cf. (11). In contrast to (15), the EPPF does not appear in (22), instead it is enough to compute an expectation of the weights. This is very convenient, being that when law of (p~j)(\tilde{p}_{j}) is not available, typically the EPPF is hard to compute as mentioned in Section 2.

3.3 Acceleration step

There is a very simple modification of the ordered allocation sampler that can greatly improve its performance. To motivate it, first note that the set of admissible moves, 𝒟i\mathcal{D}_{i}, of did_{i} is always contained in {1,,ki1+1}\{1,\ldots,k_{i-1}+1\}, with k0=0k_{0}=0 and ki=max{d1,,di}k_{i}=\max\{d_{1},\ldots,d_{i}\}. Recalling that did_{i} indicates from which component of the mixture was yiy_{i} sampled, this means that while the latest data points will be able to reallocate to virtually all observed components, the first data points will rarely be reassigned to a different component. Furthermore, since component parameters and weights are labelled in the order in which they were discovered by (yi)(y_{i}), the initial order of data points can dictate how often there are label switches of components and thus affect the mixing properties of the sampler. To overcome this, it is enough to exploit the exchangeability of (yi)(y_{i}) and add a step, after updating (di)(d_{i}), in which we randomly permute the data points obtaining (yi)=(yρ(i))(y^{\prime}_{i})=(y_{\rho(i)}), where ρ\rho is a uniform permutation of {1,,n}\{1,\ldots,n\}. Accordingly, we modify (di)(d_{i}) obtaining (di)(d_{i}^{\prime}) defined by di=jd^{\prime}_{i}=j if and only if dρ(i)d_{\rho(i)} equals the jjth distinct value to appear in (dρ(i))(d_{\rho(i)}). This way, the ordered allocation variables, (di)(d^{\prime}_{i}), that correspond to the permuted data set, (yi)(y^{\prime}_{i}), preserve the induced clustering structure, and the least element order as dictated by the event 𝒟\mathcal{D} now holds for (Dj)(D^{\prime}_{j}) with Dj={i:di=j}D^{\prime}_{j}=\{i:d^{\prime}_{i}=j\} (see Appendix D for an example). In accordance, for the sampler in Section 3.2 we will also need to change the values of (αj)jkn(\alpha_{j})_{j\leq k_{n}} so to obtain (αj)jkn(\alpha^{\prime}_{j})_{j\leq k_{n}}, where αj\alpha^{\prime}_{j} now indicates which weight in (pj)(p_{j}) is the jjth one to be discovered by (yi)(y^{\prime}_{i}). To do so we simply have to set αj=αl\alpha^{\prime}_{j}=\alpha_{l} if and only if the jjth distinct value to appear in (dρ(i))(d_{\rho(i)}) equals ll. After doing so we can move on with the updating of mm (if it is random) and each of the observed component parameters, x~j\tilde{x}^{\prime}_{j}, and weights, p~j=pαj\tilde{p}^{\prime}_{j}=p_{\alpha^{\prime}_{j}}, identically as before, although now they are labelled in order in which they were discovered by (yi)(y^{\prime}_{i}).

For the simulation study we will present in the following section, this acceleration step was included in all implementations of the ordered allocation. Nonetheless, in Appendix E we present a few runs of the ordered allocation sampler without it to illustrate its effect.

4 Simulation study

In this section we present a simulation study to compare the mixing of the ordered allocation sampler against that of a marginal sampler and a conditional sampler. Following Kalli et al. (2011), three different data sets have been considered (histograms are displayed in Figure 1). The first data set is the 𝗀𝖺𝗅𝖺𝗑𝗒\mathsf{galaxy} data, consisting of the velocities of 8282 distinct galaxies diverging away from our galaxy. The other two data sets are the 𝗅𝖾𝗉𝗍𝗈𝗄𝗎𝗋𝗍𝗂𝖼\mathsf{leptokurtic} and 𝖻𝗂𝗆𝗈𝖽𝖺𝗅\mathsf{bimodal} data sets first introduced in Green and Richardson (2001). The 𝗅𝖾𝗉𝗍𝗈𝗄𝗎𝗋𝗍𝗂𝖼\mathsf{leptokurtic} consists of 100100 data points simulated from the mixture 0.67𝖭(0,1)+0.33𝖭(0.3,0.252)0.67\,\mathsf{N}(0,1)+0.33\,\mathsf{N}(0.3,0.25^{2}). In the 𝖻𝗂𝗆𝗈𝖽𝖺𝗅\mathsf{bimodal} the 100100 observations come from the mixture 0.5𝖭(1,0.52)+0.5𝖭(1,0.52)0.5\,\mathsf{N}(-1,0.5^{2})+0.5\,\mathsf{N}(1,0.5^{2}). To each data set we fitted a mixture of Gaussian distributions with random location and scale parameters, i.e. g(yx)=𝖭(yμ,σ2)g(y\mid x)=\mathsf{N}(y\mid\mu,\sigma^{2}), and with five different mixing priors specifications. First we consider a mixture of finite mixtures (MFM, Miller and Harrison, 2018) specifically a mixing prior with random dimension, mm, and symmetric Dirichlet weights, (p1,,pm)𝖣𝗂𝗋(γ,,γ)(p_{1},\ldots,p_{m})\sim\mathsf{Dir}(\gamma,\ldots,\gamma), with γ=1\gamma=1. As for mm, we used the prior 𝕡[m]=λ(1λ)m1/m!\mathbbm{p}[m]=\lambda(1-\lambda)_{m-1}/m! (Gnedin, 2010) with λ=0.1\lambda=0.1. The remaining mixing priors we considered are a Dirichlet process (DP) with total mass parameter θ=1\theta=1, a Pitman-Yor process (PY) with parameters (σ,θ)=(0.3,0.7)(\sigma,\theta)=(0.3,0.7), a Geometric process (GP, Fuentes-García et al., 2010) and an Exchangeable Stick-Breaking process (ESB, Gil–Leyva and Mena, 2021). Further specifications of the Geometric and the Exchangeable stick-breaking processes can be found in Appendix C. In all cases the distribution ν\nu of the component parameters was fixed to ν(μ,σ2)=𝖭(μμ0,λ01σ2)Γ1(σ2a0,b0)\nu(\mu,\sigma^{2})=\mathsf{N}(\mu\mid\mu_{0},\lambda_{0}^{-1}\sigma^{2})\Gamma^{-1}(\sigma^{2}\mid a_{0},b_{0}) with hyperparameters μ0=n1i=1nyi\mu_{0}=n^{-1}\sum_{i=1}^{n}y_{i}, λ0=1/100\lambda_{0}=1/100 and a0=b0=0.5a_{0}=b_{0}=0.5.

The marginal sampler we have implemented is Algorithm 8 in Neal (2000) for DP, PY and MFM. In particular for MFM, Algorithm 8 was adapted following Miller and Harrison (2018). No marginal samplers are available for GP and ESB as these priors lack a Pólya urn scheme representation. As for the conditional sampler, we implemented the (dependent) slice-efficient sampler, as described by Kalli et al. (2011), for all models but MFM. The ordered allocation sampler (OAS in short) was used to implement all considered mixing priors. In particular, we used the algorithm in Section 3.1 for MFM, DP and PY, as these priors enjoy a tractable law of the size-biased permuted weights. The algorithm of Section 3.2 was used for GP and ESB. In Appendix C we explain how to update the weights of the GP and ESB priors.

Refer to caption
Figure 1: Histogram of the 𝗀𝖺𝗅𝖺𝗑𝗒\mathsf{galaxy} (left), 𝗅𝖾𝗉𝗍𝗈𝗄𝗎𝗋𝗍𝗂𝖼\mathsf{leptokurtic} (middle) and 𝖻𝗂𝗆𝗈𝖽𝖺𝗅\mathsf{bimodal} (right) datasets. The lines represent the estimated densities using the ESB mixing prior and the ordered allocation sampler in Section 3.2.

To monitor algorithmic performance we explored the convergence of the number of occupied components, knk_{n}, and the deviance, DvD_{v}, of the estimated density (cf Green and Richardson, 2001). The deviance can be computed by Dv=2i=1nlogj=1mnjng(yixj),D_{v}=-2\sum_{i=1}^{n}\log\sum_{j=1}^{m}\frac{n_{j}}{n}g(y_{i}\mid x_{j}), where njn_{j} is the number of data points associated to g(yixj)g(y_{i}\mid x_{j}). More precisely, we considered the chains (knt)t=1T(k_{n}^{t})_{t=1}^{T} and (Dvt)t=1T(D_{v}^{t})_{t=1}^{T} attained from TT iterations after the burn-in period. In each case we estimated the integrated autocorrelation time (IAT), τ=1/2+l=1ρl\tau=1/2+\sum_{l=1}^{\infty}\rho_{l}, where ρl\rho_{l} stands for ll-lag autocorrelation of the monitored chain. As done by Kalli et al. (2011), τ\tau was estimated through τ^=1/2+l=1C1ρ^l\hat{\tau}=1/2+\sum_{l=1}^{C-1}\hat{\rho}_{l}, where ρ^l\hat{\rho}_{l} is the estimated autocorrelation at lag ll and C=min{l:|ρ^l|<2/T}C=\min\{l:|\hat{\rho}_{l}|<2/\sqrt{T}\}. This is a very useful summary statistic for quantifying the convergence of an MCMC algorithm, smaller values of τ^\hat{\tau} corresponding to better performance. For each sampler and mixing prior, we considered 2×1062\times 10^{6} iterations after a burn-in period of 10510^{5} iterations. Table 1 reports estimates of the IAT for knk_{n} and DvD_{v} with standard errors appearing in parenthesis, the latter computed following Section 3 of Sokal (1997).

𝗀𝖺𝗅𝖺𝗑𝗒\mathsf{galaxy} data 𝗅𝖾𝗉𝗍𝗈𝗄𝗎𝗋𝗍𝗂𝖼\mathsf{leptokurtic} data
Marginal OAS Conditional Marginal OAS Conditional
MFM DvD_{v} 14.43(0.38) 26.17(0.93) 563.5(61.3) 855.5(94.8)
knk_{n} 33.55(0.92) 89.42(3.07) 337.7(26.3) 535.2(64.7)
DP DvD_{v} 12.30(0.23) 23.76(0.57) 119.2(9.95) 22.42(0.54) 25.81(0.63) 120.5(10.3)
knk_{n} 13.68(0.25) 32.49(0.81) 190.2(16.3) 9.26(0.13) 18.99(0.41) 100.8(7.18)
PY DvD_{v} 13.48(0.24) 21.59(0.52) 83.33(4.63) 53.85(1.45) 62.91(2.10) 322.9(37.0)
knk_{n} 12.43(0.23) 35.62(0.84) 115.7(6.23) 12.02(0.24) 20.96(0.56) 138.7(13.1)
GP DvD_{v} 11.01(0.33) 40.76(2.50) 50.64(1.55) 158.7(7.86)
knk_{n} 61.67(1.89) 621.2(172.5) 45.34(1.21) 129.8(6.43)
ESB DvD_{v} 24.29(0.68) 245.4(28.3) 61.29(1.97) 184.5(18.6)
knk_{n} 59.27(2.16) 632.4(88.9) 26.78(0.91) 120.1(10.7)
𝖻𝗂𝗆𝗈𝖽𝖺𝗅\mathsf{bimodal} data
Marginal OAS Conditional
MFM DvD_{v} 122.9(8.04) 143.0(9.28)
knk_{n} 65.15(3.89) 109.4(7.99)
DP DvD_{v} 7.84(0.16) 13.87(0.35) 35.61(1.68)
knk_{n} 6.30(0.07) 13.38(0.22) 52.00(2.18)
PY DvD_{v} 39.91(1.33) 58.11(2.21) 257.1(22.8)
knk_{n} 6.24(0.14) 12.40(0.30) 58.10(3.35)
GP DvD_{v} 57.45(1.76) 148.4(5.81)
knk_{n} 55.85(1.65) 146.0(5.94)
ESB DvD_{v} 48.48(1.52) 76.51(3.48)
knk_{n} 19.93(0.58) 35.53(1.42)
Table 1: Results for the three datasets by model and sampler.

We observe that, when applicable, Algorithm 8 outperforms the other samplers, and that the OAS has better mixing properties than the slice sampler. On average, the IAT corresponding to the OAS is roughly 1.81.8 times bigger than that of Algorithm 8, and the IAT of the slice sampler is approximately 4.74.7 times larger than that of the OAS. In general, it has been found that conditional algorithms perform worse than marginal samplers. This can be explained by the fact that in conditional algorithms mixing takes place in the space of all possible values of the (usual) allocation variables, (ci)i=1n(c_{i})_{i=1}^{n}, given by ci=jc_{i}=j if and only is θi=xj\theta_{i}=x_{j}. Instead, in marginal algorithms the labels of the allocation variables are irrelevant, which means that the sampler searches is the space of partitions of {1,,n}\{1,\ldots,n\}, generated by the ties among allocation variables (cf. Porteous et al., 2006). Now, in the OAS, mixing occurs in the space of all possible values of the ordered allocation variables, (di)(d_{i}), which unequivocally define an ordered partition of {1,,n}\{1,\ldots,n\}, with blocks in the least element order. Since there exists a one to one correspondence between (unordered) partitions of {1,,n}\{1,\ldots,n\} and partitions, of the same set, ordered according to the least element, we find that marginal samplers and the OAS search in the exact same space. This explains the better mixing properties of the OAS when compared with the slice sampler. Still, Algorithm 8 has better performances compared with the OAS, which is mainly due to the restricted support of (di)(d_{i}) in the OAS.

In Appendix E we extend this study for the DP model to further compare the distinct versions of the OAS in Sections 3.1 and 3.2, as well as the effect of the acceleration step in Section 3.3. There we also show the graph of the estimated weighted densities by component, so to illustrate in more details how the different samplers mix over component labels.

5 Discussion

The ordered allocation sampler exploits the conditional law of a species sampling sequence given the atoms and the weights in order of appearance. The idea of sorting the parameters by order of appearance is analogous to that of Chopin (2007) for devising sequential Monte Carlo algorithm for hidden Markov models. A key difference is that in our framework we retain exchangeability of the data, while in hidden Markov model the data possess a precise temporal order.

Mixture models with a random dimension have been long known for their appeal from a modelling perspective and for their optimal asymptotic properties (Rousseau and Mengersen, 2011; Shen et al., 2013). However, posterior computation had remained somehow elusive until the advent of the marginal sampler by Miller and Harrison (2018). The ordered allocation sampler is a valid alternative, it is simple to implement, and more broadly applicable. The sampler has been illustrated for mixtures of finite mixtures, but it readily applies to symmetric Dirichlet distributed weights whose parameter γ\gamma can depend on the number of components. For example, we can use it to implement Dirichlet-multinomial mixing priors, or “ sparse finite mixtures ” as termed by Frühwirth-Schnatter and Malsiner-Walli (2019), where γ=θ/m\gamma=\theta/m. As for mixtures with infinitely many components, the sampler completely avoids the truncation problem. In fact, it is practically identical for the case where mm is a finite fixed number and the case where mm is infinite. Other conditional samplers are designed for the case where m<m<\infty is fixed, m=m=\infty, or mm is random and there are clear distinctions between samplers that are designed for one case or another. To the best of our knowledge, the ordered allocation sampler is the first conditional sampler that treats in a unified manner the distinct assumptions on mm.

As highlighted throughout the paper, the ordered allocation sampler enjoys nice properties in terms of applicability and mixing performance. Nonetheless, there are areas of improvements. In other conditional samplers allocation variables are updated independently of each other in a block, rather that one at a time. In big data settings this is a significant advantage over the ordered allocation sampler. Another drawback is that the number of occupied components, knk_{n}, can not change as freely, from one iteration to the next one, as it does in other samplers. This explains the higher IAT of knk_{n} when compared against the marginal method. The ordered allocation sampler may need additional modifications to address these issues, which is an interesting direction for future research.

References

  • (1)
  • Blackwell and MacQueen (1973) Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Polya urn schemes, Ann. Statist. 1(2): 353 – 355.
  • Chopin (2007) Chopin, N. (2007). Inference and model choice for sequentially ordered hidden markov models, JRSSB 69(2): 269–284.
  • De Blasi et al. (2015) De Blasi, P., Favaro, S., Lijoi, A., Mena, R. H., Prünster, I. and Ruggiero, M. (2015). Are Gibbs-type priors the most natural generalization of the Dirichlet process?, IEEE Transactions on Pattern Analysis and Machine Intelligence 37: 212–229.
  • Escobar and West (1995) Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures, J. Amer. Statist. Assoc. 90: 577–588.
  • Favaro et al. (2016) Favaro, S., Lijoi, A., Nava, C., Nipoti, B., Prünster, I. and Teh, Y. W. (2016). On the stick-breaking representation for homogeneous NRMIs, Bayesian Anal. 11: 697–724.
  • Favaro and Teh (2013) Favaro, S. and Teh, Y. W. (2013). MCMC for normalized random measure mixture models, Statistical Science 28(3): 335 – 359.
  • Ferguson (1973) Ferguson, T. (1973). A Bayesian analysis of some nonparametric problems, Ann. Statist. 1(2): 209–230.
  • Frühwirth-Schnatter and Malsiner-Walli (2019) Frühwirth-Schnatter, S. and Malsiner-Walli (2019). From here to infinity: sparse finite versus dirichlet process mixtures in model-based clustering, Advances in Data Analysis and Classification 13: 33–64.
  • Fuentes-García et al. (2010) Fuentes-García, R., Mena, R. H. and Walker, S. G. (2010). A new Bayesian nonparametric mixture model, Communications in Statistics - Simulation and Computation 39(4): 669–682.
  • Gil–Leyva and Mena (2021) Gil–Leyva, M. F. and Mena, R. H. (2021). Stick-breaking processes with exchangeable length variables, Journal of the American Statistical Association p. in press.
  • Gnedin (2010) Gnedin, A. (2010). A species sampling model with finitely many types, Electronic Communications in Probability 15: 79–88.
  • Green and Richardson (2001) Green, P. J. and Richardson, S. (2001). Modeling heterogeneity with and without the Dirichlet process, Scand. J. Stat. 28: 355–375.
  • Ishwaran and James (2001) Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors, J. Amer. Statist. Assoc. 96: 161–173.
  • Kallenberg (2005) Kallenberg, O. (2005). Probabilistic Symmetries and Invariance Principles, first edn, Springer.
  • Kalli et al. (2011) Kalli, M., Griffin, J. E. and Walker, S. (2011). Slice sampling mixtures models, Statist. Comput. 21: 93–105.
  • Lijoi et al. (2020) Lijoi, A., Prünster, I. and Rigon, T. (2020). The Pitman–Yor multinomial process for mixture modelling, Biometrika 107(4): 891–906.
  • Miller and Harrison (2018) Miller, J. W. and Harrison, M. T. (2018). Mixture models with a prior on the number of components, J. Amer. Statist. Assoc. 113(521): 340–356.
  • Neal (2000) Neal, R. M. (2000). Markov Chain Sampling Methods for Dirichlet Process Mixture Models, J. Comput. Graph. Statist. 9(2): 249–265.
  • Papaspiliopoulos and Roberts (2008) Papaspiliopoulos, O. and Roberts, G. O. (2008). Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models, Biometrika 95: 169–186.
  • Pitman (1995) Pitman, J. (1995). Exchangeable and partially exchangeable random partitions, Probab. Theory Relat. Fields 102: 145–158.
  • Pitman (1996a) Pitman, J. (1996a). Random discrete distributions invariant under size-biased permutation, Adv. Appl. Probab. 28(2): 525–539.
  • Pitman (1996b) Pitman, J. (1996b). Some developments of the Blackwell-MacQueen urn scheme, in T. F. et al. (ed.), Statistics, Probability and Game Theory; Papers in honor of David Blackwell, Vol. 30 of Lecture Notes-Monograph Series, Institute of Mathematical Statistics, Hayward, California, pp. 245–267.
  • Pitman (2006) Pitman, J. (2006). Combinatorial Stochastic Processes, Vol. 1875 of École d’été de probabilités de Saint-Flour, first edn, Springer-Verlag Berlin Heidelberg, New York.
  • Pitman and Yor (1997) Pitman, J. and Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator, Ann. Probab. 25(2): 855–900.
  • Porteous et al. (2006) Porteous, I., Ihler, A., Smyth, P. and Welling, M. (2006). Gibbs sampling for (coupled) infinite mixture models in the stick breaking representation, Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence (UAI2006), pp. 385–392.
  • Regazzini et al. (2003) Regazzini, E., Lijoi, A. and Prünster, I. (2003). Distributional results for means of normalized random measures with independent increments, Ann. Statist. 31(2): 560–585.
  • Richardson and Green (1997) Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components, J. R. Stat. Soc. Ser. B 59: 731–792.
  • Rodríguez and Dunson (2011) Rodríguez, A. and Dunson, D. B. (2011). Nonparametric Bayesian models through probit stick-breaking processes, Bayesian Anal. 6(1): 145–178.
  • Rousseau and Mengersen (2011) Rousseau, J. and Mengersen, K. (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models, J. R. Stat. Soc. Ser. B 73(5): 689–710.
  • Sethuraman (1994) Sethuraman, J. (1994). A constructive definition of Dirichlet priors, Stat. Sin. 4: 639–650.
  • Shen et al. (2013) Shen, W., Tokdar, S. T. and Ghosal, S. (2013). Adaptive Bayesian multivariate density estimation with Dirichlet mixtures, Biometrika 100: 623–640.
  • Sokal (1997) Sokal, A. (1997). Monte Carlo Methods in Statistical Mechanics: Foundations and New Algorithms, in C. DeWitt-Morette, P. Cartier and A. Folacci (eds), Functional Integration: Basics and Applications, Springer US, Boston, MA, pp. 131–192.
  • Walker (2007) Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices, Communications in Statistics-Simulation and Computation 36(1): 45–54.
  • Zanella (2020) Zanella, G. (2020). Informed proposals for local MCMC in discrete spaces, J. Amer. Statist. Assoc. 115(530): 852–865.

Appendix

Appendix A Proof of Theorem 1

Lemma A.1.

Let mm be a random variable taking values in {1,2,}{}\{1,2,\ldots\}\cup\{\infty\} and let (p~j)=(p~1,,p~m)(\tilde{p}_{j})=(\tilde{p}_{1},\ldots,\tilde{p}_{m}) be a sequence in [0,1][0,1] with j=1mp~j=1\sum_{j=1}^{m}\tilde{p}_{j}=1. Let π:kk[0,1]\pi:\bigcup_{k\in{\mathbb{N}}}{\mathbb{N}}^{k}\to[0,1] be defined by

π(n1,,nk)=𝔼[j=1kp~jnj1j=1k1(1l=1jp~l)].\pi(n_{1},\ldots,n_{k})=\mathbb{E}\bigg{[}\prod_{j=1}^{k}\tilde{p}_{j}^{n_{j}-1}\prod_{j=1}^{k-1}\bigg{(}1-\sum_{l=1}^{j}\tilde{p}_{l}\bigg{)}\bigg{]}.

Then (p~j)(\tilde{p}_{j}) is invariant under size-biased permutations if and only if π\pi is a symmetric function of (n1,,nk)(n_{1},\ldots,n_{k}).

The proof of Lemma A.1 can be found in Pitman (1995, 1996a). Actually, Pitman derived it more in general for a sequence (p~j)j=1(\tilde{p}_{j})_{j=1}^{\infty} taking values in the infinite dimensional simplex Δ={(wj)j=1:wj0,j=1wj=1}\Delta_{\infty}=\left\{(w_{j})_{j=1}^{\infty}:w_{j}\geq 0,\sum_{j=1}^{\infty}w_{j}=1\right\}. The statement in Lemma A.1 easily follows by transforming (p~1,,p~m)(\tilde{p}_{1},\ldots,\tilde{p}_{m}) into a sequence in Δ\Delta_{\infty} by appending zeros, i.e. p~j=0\tilde{p}_{j}=0 for j>mj>m. For simplicity we will first take for granted Lemma A.1, later in Remark A.1 we explain how to derive a self-contained proof.

Proof of Theorem 1:

(Sufficiency): Assume (θi)(\theta_{i}) is a species sampling sequence driven by the species sampling model P=j=1mpjδxjP=\sum_{j=1}^{m}p_{j}\delta_{x_{j}}. By de Finetti’s theorem,

limn1ni=1nδθi=j=1mpjδxj,\lim_{n\to\infty}\frac{1}{n}\sum_{i=1}^{n}\delta_{\theta_{i}}=\sum_{j=1}^{m}p_{j}\delta_{x_{j}},

almost surely. As j=1mpj=1\sum_{j=1}^{m}p_{j}=1 and pj>0p_{j}>0, we get that outside a \mathbb{P}-null event, θi{x1,,xm}\theta_{i}\in\{x_{1},\ldots,x_{m}\} for every i1i\geq 1, and for each jmj\leq m there exist i1i\geq 1 such that θi=xj\theta_{i}=x_{j}. This together with the diffuseness of ν\nu, yield that (θi)(\theta_{i}) exhibits exactly mm distinct values, x1,,xmx_{1},\ldots,x_{m}, almost surely. This means that we can define (αj)=(α1,,αm)(\alpha_{j})=(\alpha_{1},\ldots,\alpha_{m}) given by αl=j\alpha_{l}=j if and only if θMl=xj\theta_{M_{l}}=x_{j}, recalling that M1=1M_{1}=1 and for 2lm2\leq l\leq m, Ml=min{i>Ml1:θi{θM1,,θMl1}}M_{l}=\min\{i>M_{l-1}:\theta_{i}\not\in\{\theta_{M_{1}},\ldots,\theta_{M_{l-1}}\}\}. This way, x~j=xαj\tilde{x}_{j}=x_{\alpha_{j}} is the jjth distinct value of (θi)(\theta_{i}) in order of appearance. Next we prove that (αj)(\alpha_{j}) satisfies equation (3) in the main document. To this aim, note that PP is {(pj),(xj)}\{(p_{j}),(x_{j})\}-measurable and vice versa. Moreover (θM1,,θMl)(\theta_{M_{1}},\ldots,\theta_{M_{l}}) is {(α1,,αl),(xj)}\{(\alpha_{1},\dots,\alpha_{l}),(x_{j})\}-measurable and (α1,,αl)(\alpha_{1},\dots,\alpha_{l}) is {(θM1,,θMl),(xj)}\{(\theta_{M_{1}},\ldots,\theta_{M_{l}}),(x_{j})\}-measurable. As (θi)(\theta_{i}) is conditionally iid from PP, this implies

[α1=j(pj),(xj)]=[θ1=xjP]=pj\mathbb{P}[\alpha_{1}=j\mid(p_{j}),(x_{j})]=\mathbb{P}[\theta_{1}=x_{j}\mid P]=p_{j}

and for 2lm2\leq l\leq m,

[αl=j(pj),(xj),α1,,αl1]\displaystyle\mathbb{P}[\alpha_{l}=j\mid(p_{j}),(x_{j}),\alpha_{1},\ldots,\alpha_{l-1}] =[θMl=xjP,θM1,,θMl1]\displaystyle=\mathbb{P}[\theta_{M_{l}}=x_{j}\mid P,\theta_{M_{1}},\ldots,\theta_{M_{l-1}}]
pj𝟙{xj{θM1,,θMl1}}\displaystyle\propto p_{j}\mathds{1}_{\{x_{j}\not\in\{\theta_{M_{1}},\ldots,\theta_{M_{l-1}}\}\}}

This is

[αl=jm,(pj),(xj),α1,,αl1]=pj1i=1l1pαi𝟙{j{α1,,αl1}}.\mathbb{P}[\alpha_{l}=j\mid m,(p_{j}),(x_{j}),\alpha_{1},\ldots,\alpha_{l-1}]=\frac{p_{j}}{1-\sum_{i=1}^{l-1}p_{\alpha_{i}}}\mathds{1}_{\{j\not\in\{\alpha_{1},\ldots,\alpha_{l-1}\}\}}.

Hence, given (pj)(p_{j}), (αj)(\alpha_{j}) satisfies (3). Moreover, as (pj)(p_{j}) is independent of (xj)(x_{j}), we also get that (αj)(\alpha_{j}) is independent of (xj)(x_{j}), which are iid from ν\nu. This yields that (x~j)=(xαj)(\tilde{x}_{j})=(x_{\alpha_{j}}) are also iid from ν\nu, so i is proved.

Using de Finetti’s theorem once more, we get

limnj=1kn|{in:θi=x~j}|nδx~j=limn1ni=1nδθi=j=1mpjδxj=j=1mpαjδxαj,\lim_{n\to\infty}\sum_{j=1}^{k_{n}}\frac{|\{i\leq n:\theta_{i}=\tilde{x}_{j}\}|}{n}\delta_{\tilde{x}_{j}}=\lim_{n\to\infty}\frac{1}{n}\sum_{i=1}^{n}\delta_{\theta_{i}}=\sum_{j=1}^{m}p_{j}\delta_{x_{j}}=\sum_{j=1}^{m}p_{\alpha_{j}}\delta_{x_{\alpha_{j}}},

where knk_{n} is the number of distinct values in {θ1,,θn}\{\theta_{1},\ldots,\theta_{n}\}. Since the directing random measure of an exchangeable sequence is unique almost surely, this assures knmk_{n}\to m, and for jmj\leq m, the long run proportion of indexes ii such that θi=x~j=xαj\theta_{i}=\tilde{x}_{j}=x_{\alpha_{j}} is

p~j=limn|{in:θi=x~j}|n=pαj,\tilde{p}_{j}=\lim_{n\to\infty}\frac{|\{i\leq n:\theta_{i}=\tilde{x}_{j}\}|}{n}=p_{\alpha_{j}},

almost surely. As (3) holds for (αj)(\alpha_{j}), (p~j)=(pαj)(\tilde{p}_{j})=({p}_{\alpha_{j}}) is a size-biased permutation of (pj)(p_{j}), which yields ii.

As for iii, first note that by definition, x~1,,x~kn\tilde{x}_{1},\ldots,\tilde{x}_{k_{n}} are the knk_{n} distinct values in order of appearance in {θ1,,θn}\{\theta_{1},\ldots,\theta_{n}\}, for every n1n\geq 1, in particular θ1=x~1\theta_{1}=\tilde{x}_{1}. Now, as (θi)(\theta_{i}) is conditionally iid from P=j=1mpjδxj=j=1mp~jδx~jP=\sum_{j=1}^{m}p_{j}\delta_{x_{j}}=\sum_{j=1}^{m}\tilde{p}_{j}\delta_{\tilde{x}_{j}}, we get that for each n1n\geq 1 and every jknj\leq k_{n},

[θn+1=x~j(p~j),(x~j),θ1,,θn]=p~j.\mathbb{P}[\theta_{n+1}=\tilde{x}_{j}\mid(\tilde{p}_{j}),(\tilde{x}_{j}),\theta_{1},\ldots,\theta_{n}]=\tilde{p}_{j}.

Thus,

[θn+1{x~1,,x~kn}(p~j),(x~j),θ1,,θn]=1j=1knp~j.\mathbb{P}[\theta_{n+1}\not\in\{\tilde{x}_{1},\ldots,\tilde{x}_{k_{n}}\}\mid(\tilde{p}_{j}),(\tilde{x}_{j}),\theta_{1},\ldots,\theta_{n}]=1-\sum_{j=1}^{k_{n}}\tilde{p}_{j}.

By definition, under the event θn+1{x~1,,x~kn}\theta_{n+1}\not\in\{\tilde{x}_{1},\ldots,\tilde{x}_{k_{n}}\} we must have θn+1=x~kn+1\theta_{n+1}=\tilde{x}_{k_{n}+1}, i.e.

[θn+1(x~j),(p~j),θ1,,θn]=j=1knp~jδx~j+(1j=1knp~j)δx~kn+1.\mathbb{P}[\theta_{n+1}\in\cdot\mid(\tilde{x}_{j}),(\tilde{p}_{j}),\theta_{1},\ldots,\theta_{n}]=\sum_{j=1}^{k_{n}}\tilde{p}_{j}\delta_{\tilde{x}_{j}}+\bigg{(}1-\sum_{j=1}^{k_{n}}\tilde{p}_{j}\bigg{)}\delta_{\tilde{x}_{k_{n}+1}}.

As for iv, first note that (p~j)(\tilde{p}_{j}) is {(pj),(αj)}\{(p_{j}),(\alpha_{j})\}-measurable and (pj)(p_{j}) is {(p~j),(αj)}\{(\tilde{p}_{j}),(\alpha_{j})\}-measurable. The last assertion relies on pj=p~αj1p_{j}=\tilde{p}_{\alpha^{-1}_{j}}, where (αj1)(\alpha^{-1}_{j}) is the inverse permutation of (αj)(\alpha_{j}). From the proof of i and the hypothesis, we have that (αj)(\alpha_{j}), (pj)(p_{j}) and mm are independent of (xj)(x_{j}). Thus, for every B(𝕏)B\in\mathcal{B}(\mathbb{X}),

[x~jBm,(p~j),(αj)]=[x~jBm,(pj),(αj)]=[xαjBm,(pj),(αj)]=ν(B),\mathbb{P}[\tilde{x}_{j}\in B\mid m,(\tilde{p}_{j}),(\alpha_{j})]=\mathbb{P}[\tilde{x}_{j}\in B\mid m,(p_{j}),(\alpha_{j})]=\mathbb{P}[x_{\alpha_{j}}\in B\mid m,(p_{j}),(\alpha_{j})]=\nu(B),

which proves iv.

(Necessity): Assume iiv hold. We first prove that (θi)(\theta_{i}) is exchangeable. Fix n1n\geq 1 and define the random partition, Πn\Pi_{n} of [n]={1,,n}[n]=\{1,\ldots,n\} generated by the random equivalence relation iji\sim j if and only of θi=θj\theta_{i}=\theta_{j}. In other words Πn={D1,,Dkn}\Pi_{n}=\{D_{1},\ldots,D_{k_{n}}\} where Dj={in:θi=x~j}D_{j}=\{i\leq n:\theta_{i}=\tilde{x}_{j}\}. Using iii, a simple counting argument implies

[Πn={A1,,Ak}(p~j),(x~j)]=j=1kp~jnj1j=1k1(1l=1jp~j),\mathbb{P}[\Pi_{n}=\{A_{1},\ldots,A_{k}\}\mid(\tilde{p}_{j}),(\tilde{x}_{j})]=\prod_{j=1}^{k}\tilde{p}_{j}^{n_{j}-1}\prod_{j=1}^{k-1}\bigg{(}1-\sum_{l=1}^{j}\tilde{p}_{j}\bigg{)}, (A1)

for every partition {A1,,Ak}\{A_{1},\ldots,A_{k}\} of [n][n], and where nj=|Aj|n_{j}=|A_{j}|. Taking expectations in (A1),

[Πn={A1,,Ak}]=𝔼[j=1kp~jnj1j=1k1(1l=1jp~j)].\mathbb{P}[\Pi_{n}=\{A_{1},\ldots,A_{k}\}]=\mathbb{E}\bigg{[}\prod_{j=1}^{k}\tilde{p}_{j}^{n_{j}-1}\prod_{j=1}^{k-1}\bigg{(}1-\sum_{l=1}^{j}\tilde{p}_{j}\bigg{)}\bigg{]}.

By ii and Lemma A.1, the function

(n1,,nk)π(n1,,nk)=𝔼[j=1kp~jnj1j=1k1(1l=1jp~j)](n_{1},\ldots,n_{k})\mapsto\pi(n_{1},\ldots,n_{k})=\mathbb{E}\bigg{[}\prod_{j=1}^{k}\tilde{p}_{j}^{n_{j}-1}\prod_{j=1}^{k-1}\bigg{(}1-\sum_{l=1}^{j}\tilde{p}_{j}\bigg{)}\bigg{]}

is symmetric. This shows [Πn={A1,,Ak}]=π(n1,,nk)\mathbb{P}[\Pi_{n}=\{A_{1},\ldots,A_{k}\}]=\pi(n_{1},\ldots,n_{k}), at most depends on the number of blocks kk of {A1,,Ak}\{A_{1},\ldots,A_{k}\} and the frequencies, n1,,nkn_{1},\ldots,n_{k}, of each block, through a symmetric function. In other words, Πn\Pi_{n} is exchangeable, in the sense that for every permutation, ρ\rho, of [n][n], Πn\Pi_{n} is equal in distribution to ρ(Πn)\rho(\Pi_{n}), where

ρ(Πn)={ρ(Dj):DjΠn} and ρ(Dj)={ρ(i):iDj}.\rho(\Pi_{n})=\{\rho(D_{j}):D_{j}\in\Pi_{n}\}\quad\text{ and }\quad\rho(D_{j})=\{\rho(i):i\in D_{j}\}.

Now, fix B1,,Bn(𝕏)B_{1},\ldots,B_{n}\in\mathcal{B}(\mathbb{X}) and note

[θρ(1)B1,,θρ(n)BnΠn]\displaystyle\mathbb{P}[\theta_{\rho(1)}\in B_{1},\ldots,\theta_{\rho(n)}\in B_{n}\mid\Pi_{n}] =[x~jBl,lρ(Dj),jknΠn]=j=1knν(iρ(Dj)Bi).\displaystyle=\mathbb{P}[\tilde{x}_{j}\in B_{l},\forall\,l\in\rho(D_{j}),j\leq k_{n}\mid\Pi_{n}]=\prod_{j=1}^{k_{n}}\nu\bigg{(}\bigcap_{i\in\rho(D_{j})}B_{i}\bigg{)}.

The last equality follows from the fact that (x~j)(\tilde{x}_{j}) are iid from ν\nu, and x~j\tilde{x}_{j} is independent of mm and (p~j)(\tilde{p}_{j}), which together with (A1) imply x~j\tilde{x}_{j} is independent of Πn\Pi_{n}. By taking expectations in the last equation we find,

[θρ(1)B1,,θρ(n)Bn]=𝔼[j=1knν(iρ(Dj)Bi)].\mathbb{P}[\theta_{\rho(1)}\in B_{1},\ldots,\theta_{\rho(n)}\in B_{n}]=\mathbb{E}\bigg{[}\prod_{j=1}^{k_{n}}\nu\bigg{(}\bigcap_{i\in\rho(D_{j})}B_{i}\bigg{)}\bigg{]}.

As Πn\Pi_{n} is exchangeable,

𝔼[j=1knν(iρ(Dj)Bi)]=𝔼[j=1knν(iDjBi)],\mathbb{E}\bigg{[}\prod_{j=1}^{k_{n}}\nu\bigg{(}\bigcap_{i\in\rho(D_{j})}B_{i}\bigg{)}\bigg{]}=\mathbb{E}\bigg{[}\prod_{j=1}^{k_{n}}\nu\bigg{(}\bigcap_{i\in D_{j}}B_{i}\bigg{)}\bigg{]},

hence

[θρ(1)B1,,θρ(n)Bn]=[θ1B1,,θnBn],\mathbb{P}[\theta_{\rho(1)}\in B_{1},\ldots,\theta_{\rho(n)}\in B_{n}]=\mathbb{P}[\theta_{1}\in B_{1},\ldots,\theta_{n}\in B_{n}],

which proves (θi)(\theta_{i}) is exchangeable. Finally, by de Finetti’s theorem we know that directing random measure of (θi)(\theta_{i}) is given by

P=limn1ni=1nδθi=limn|{in:θi=x~j}|nj=1knδx~j,P=\lim_{n\to\infty}\frac{1}{n}\sum_{i=1}^{n}\delta_{\theta_{i}}=\lim_{n\to\infty}\frac{|\{i\leq n:\theta_{i}=\tilde{x}_{j}\}|}{n}\sum_{j=1}^{k_{n}}\delta_{\tilde{x}_{j}},

and by i and ii we conclude

P=j=1mp~jδx~j=j=1mpαjδxαj=j=1mpjδxj.P=\sum_{j=1}^{m}\tilde{p}_{j}\delta_{\tilde{x}_{j}}=\sum_{j=1}^{m}p_{\alpha_{j}}\delta_{x_{\alpha_{j}}}=\sum_{j=1}^{m}p_{j}\delta_{x_{j}}.

Remark A.1.

The sufficiency of Lemma A.1, which we require to prove the necessity of Theorem 1, can be easily derived using the sufficiency of Theorem 1, thus provide a self-contained proof of Theorem 1. Namely, in the context of Lemma A.1 let (p~j)(\tilde{p}_{j}) be invariant under size-biased permutations, and let (pj)=(p1,,pm)(p^{\prime}_{j})=(p^{\prime}_{1},\ldots,p^{\prime}_{m^{\prime}}) be any sequence of weights whose size-biased permutation has the law of (p~j)(\tilde{p}_{j}). Then we can construct a species sampling model P=j=1mpjδxjP^{\prime}=\sum_{j=1}^{m^{\prime}}p^{\prime}_{j}\delta_{x^{\prime}_{j}} over a Borel space (𝕏,(𝕏))(\mathbb{X},\mathcal{B}(\mathbb{X})) and a species sampling sequence (θi)(\theta^{\prime}_{i}) driven by PP^{\prime}. By the sufficiency of Theorem 1, we get that for every n1n\geq 1,

[θn+1(p~j),(x~j),θ1,,θn]=j=1knp~jδx~j+(1j=1knp~j)δx~kn+1,\mathbb{P}[\theta^{\prime}_{n+1}\in\cdot\mid(\tilde{p}^{\prime}_{j}),(\tilde{x}^{\prime}_{j}),\theta^{\prime}_{1},\ldots,\theta^{\prime}_{n}]=\sum_{j=1}^{k^{\prime}_{n}}\tilde{p}^{\prime}_{j}\delta_{\tilde{x}^{\prime}_{j}}+\bigg{(}1-\sum_{j=1}^{k_{n}}\tilde{p}^{\prime}_{j}\bigg{)}\delta_{\tilde{x}^{\prime}_{k^{\prime}_{n}+1}}, (A2)

where (x~j)(\tilde{x}^{\prime}_{j}) are the distinct values that (θi)(\theta^{\prime}_{i}) exhibits in order of appearance, and (p~j)(\tilde{p}^{\prime}_{j}) denotes the size-biased permutation of (pj)(p^{\prime}_{j}). Now, let Πn\Pi^{\prime}_{n} be the random partition of [n][n] generated by the random equivalence relation iji\sim j if and only if θi=θj\theta^{\prime}_{i}=\theta^{\prime}_{j}. Then, by construction Πn\Pi^{\prime}_{n} is exchangeable, and a simple counting argument, using (A2), yields

[Πn={A1,,Ak}]=𝔼[j=1k(p~j)nj1j=1k1(1l=1jp~j)],\mathbb{P}[\Pi^{\prime}_{n}=\{A_{1},\ldots,A_{k}\}]=\mathbb{E}\bigg{[}\prod_{j=1}^{k}\bigg{(}\tilde{p}^{\prime}_{j}\bigg{)}^{n_{j}-1}\prod_{j=1}^{k-1}\bigg{(}1-\sum_{l=1}^{j}\tilde{p}^{\prime}_{j}\bigg{)}\bigg{]},

where nj=|Aj|n_{j}=|A_{j}|. Since (p~j)(\tilde{p}^{\prime}_{j}) is equal in distribution to (p~j)(\tilde{p}_{j}) and Πn\Pi^{\prime}_{n} is exchangeable, we conclude

π(n1,,nk)=𝔼[j=1kp~jnj1j=1k1(1l=1jp~j)]=𝔼[j=1k(p~j)nj1j=1k1(1l=1jp~j)]\pi(n_{1},\ldots,n_{k})=\mathbb{E}\bigg{[}\prod_{j=1}^{k}\tilde{p}_{j}^{n_{j}-1}\prod_{j=1}^{k-1}\bigg{(}1-\sum_{l=1}^{j}\tilde{p}_{j}\bigg{)}\bigg{]}=\mathbb{E}\bigg{[}\prod_{j=1}^{k}\bigg{(}\tilde{p}^{\prime}_{j}\bigg{)}^{n_{j}-1}\prod_{j=1}^{k-1}\bigg{(}1-\sum_{l=1}^{j}\tilde{p}^{\prime}_{j}\bigg{)}\bigg{]}

is a symmetric function of (n1,,nk)(n_{1},\ldots,n_{k}).

Appendix B Mixtures of finite mixtures with Gnedin (2010) prior on mm

In this section we illustrate how to update the model dimension mm by sampling from (15). We consider a mixture model with symmetric Dirichlet weights (p1,,pm)𝖣𝗂𝗋(1,,1)(p_{1},\ldots,p_{m})\sim\mathsf{Dir}(1,\ldots,1), and random mm with prior distribution

𝕡[m]=λ(1λ)m1m!,\mathbbm{p}[m]=\frac{\lambda(1-\lambda)_{m-1}}{m!},

where λ(0,1)\lambda\in(0,1) is a known constant. As mentioned in Section 2, the size-biased permuted weights (p~j)(\tilde{p}_{j}) admit stick-breaking representation p~1=v1\tilde{p}_{1}=v_{1}, and p~j=vji=1j1(1vi)\tilde{p}_{j}=v_{j}\prod_{i=1}^{j-1}(1-v_{i}) for independent random variables, vj𝖡𝖾(1σ,θ+jσ)v_{j}\sim\mathsf{Be}(1-\sigma,\theta+j\sigma) where σ=1\sigma=-1 and θ=m\theta=m. Thus, the ordered allocation sampler as derived in Section 3.1 can be used to implement this model.

First note that using (4) and the stick-breaking decomposition of (p~j)(\tilde{p}_{j}), we can compute the conditional EPPF given mm:

π(n1,,nknm)=j=1kn1(θ+jσ)(θ+1)n1j=1kn(1σ)nj1=(mkn+1)kn1(m+1)n1j=1knnj!\pi(n_{1},\ldots,n_{k_{n}}\mid m)=\frac{\prod_{j=1}^{k_{n}-1}(\theta+j\sigma)}{(\theta+1)_{n-1}}\prod_{j=1}^{k_{n}}(1-\sigma)_{n_{j}-1}=\frac{(m-k_{n}+1)_{k_{n}-1}}{(m+1)_{n-1}}\prod_{j=1}^{k_{n}}n_{j}!

where n=j=1knnjn=\sum_{j=1}^{k_{n}}n_{j} and (z)r=i=0r1(z+i)(z)_{r}=\prod_{i=0}^{r-1}(z+i), using the convention that the empty product equals one. Hence, (15) simplifies to

𝕡[m](mkn+1)kn1(m+1)n1×λ(1λ)m1m!𝟙{knm}.\mathbbm{p}[\,m\mid\cdots\,]\propto\frac{(m-k_{n}+1)_{k_{n}-1}}{(m+1)_{n-1}}\times\frac{\lambda(1-\lambda)_{m-1}}{m!}\mathds{1}_{\{k_{n}\leq m\}}.

Following Gnedin (2010), we obtain

m=kn(mkn+1)kn1(m+1)n1λ(1λ)m1m!=(kn1)!(1λ)kn1(λ)nkn(n1)!(1+λ)n1.\sum_{m=k_{n}}^{\infty}\frac{(m-k_{n}+1)_{k_{n}-1}}{(m+1)_{n-1}}\frac{\lambda(1-\lambda)_{m-1}}{m!}=\frac{(k_{n}-1)!(1-\lambda)_{k_{n}-1}(\lambda)_{n-k_{n}}}{(n-1)!(1+\lambda)_{n-1}}.

Thus, we can explicitly compute

𝕡[m\displaystyle\mathbbm{p}[\,m\mid ]=λ(1λ)m1(mkn+1)kn1(m+n1)!×(n1)!(1+λ)n1(kn1)!(1λ)kn1(λ)nkn𝟙{knm}.\displaystyle\cdots\,]=\frac{{\lambda}(1-\lambda)_{m-1}(m-k_{n}+1)_{k_{n}-1}}{(m+n-1)!}\times\frac{(n-1)!(1+\lambda)_{n-1}}{(k_{n}-1)!(1-\lambda)_{k_{n}-1}(\lambda)_{n-k_{n}}}\mathds{1}_{\{k_{n}\leq m\}}.

In particular, using notation qr=𝕡[m=r]q_{r}=\mathbbm{p}[\,m=r\mid\cdots\,]

qkn=(λ+nkn)kn(n)kn,q_{k_{n}}=\frac{(\lambda+n-k_{n})_{k_{n}}}{(n)_{k_{n}}},

and recursively for rknr\geq k_{n},

qr+1=qrr(rλ)(rkn+1)(r+n).q_{r+1}=q_{r}\frac{r(r-\lambda)}{(r-k_{n}+1)(r+n)}.

Thus, to update mm, sample u𝖴𝗇𝗂𝖿(0,1)u\sim\mathsf{Unif}(0,1), and set m=rm=r when l=knr1ql<ul=knrql\sum_{l=k_{n}}^{r-1}q_{l}<u\leq\sum_{l=k_{n}}^{r}q_{l}.

Appendix C Geometric and exchangeable stick-breaking processes

To illustrate the ordered allocation sampler derived in Section 3.2 we chose two species sampling mixing priors for which the law of (p~j)(\tilde{p}_{j}) is not available. These are the geometric process and the exchangeable stick-breaking process. The geometric process (Fuentes-García et al.; 2010) is a species sampling model P=j=1pjδxjP=\sum_{j=1}^{\infty}p_{j}\delta_{x_{j}} with decreasingly ordered weights, (pj)(p_{j}), given by pj=v(1v)j1p_{j}=v(1-v)^{j-1} where vv is a random variable taking values in (0,1)(0,1). The exchangeable stick-breaking process (Gil–Leyva and Mena; 2021) instead has weights pj=vjl=1j1(1vl)p_{j}=v_{j}\prod_{l=1}^{j-1}(1-v_{l}), where (vj)=(vj)j=1(v_{j})=(v_{j})_{j=1}^{\infty} is an exchangeable sequence with values in (0,1)(0,1). Here we consider (vj)(v_{j}) to be a species sampling sequence driven by a Dirichlet process PP^{\prime} over ([0,1],([0,1]))([0,1],\mathcal{B}([0,1])), with total mass parameter θ\theta^{\prime} and base measure ν=𝖡𝖾(a,b)\nu^{\prime}=\mathsf{Be}(a,b). Next we refer to P=j=1pjδxjP=\sum_{j=1}^{\infty}p_{j}\delta_{x_{j}} as Dirichlet driven exchangeable stick-breaking process.

To fully specialize the ordered allocation sampler in Section 3.2 for this two mixing priors it is enough to explain how to update (pj)(p_{j}) via sampling (vj)(v_{j}) from

𝕡[(vj)]j=1α¯vjrj(1vj)Rj𝕡[(vj)],\mathbbm{p}[\,(v_{j})\mid\cdots\,]\propto\prod_{j=1}^{\overline{\alpha}}v_{j}^{r_{j}}(1-v_{j})^{R_{j}}\mathbbm{p}[(v_{j})], (C3)

where rj=l=1knnl𝟙{αl=j}r_{j}=\sum_{l=1}^{k_{n}}n_{l}\mathds{1}_{\{\alpha_{l}=j\}}, Rj=l>jrlR_{j}=\sum_{l>j}r_{l}, α¯=max{α1,,αkn}\overline{\alpha}=\max\{\alpha_{1},\ldots,\alpha_{k_{n}}\}, and “ \cdots ” refers to all the random variables involved excluding (αj)j>kn(\alpha_{j})_{j>k_{n}} and (pj)(p_{j}). Note that by excluding (pj)(p_{j}) we are also excluding (vj)(v_{j}) because these two sequences characterize each other. It is worth noting that the following description can be readily adapted to the updating of (vj)(v_{j}) in the slice-efficient sampler.

For the geometric process we have that vj=vv_{j}=v for every j1j\geq 1, hence it suffices to update vv from

𝕡[v]vn(1v)i=1ncin×𝕡[v],\mathbbm{p}[\,v\mid\cdots\,]\propto v^{n}(1-v)^{\sum_{i=1}^{n}c_{i}-n}\times\mathbbm{p}[v],

where ci=αdic_{i}=\alpha_{d_{i}} for each ini\leq n. In particular if v𝖡𝖾(a,b)v\sim\mathsf{Be}(a,b) a priori, then we update v𝖡𝖾(a+n,b+i=1ncin)v\sim\mathsf{Be}(a+n,b+\sum_{i=1}^{n}c_{i}-n).

Now, for Dirichlet driven exchangeable stick-breaking processes, the updating of (vj)(v_{j}) is more delicate due to the non-trivial dependence among elements in (vj)(v_{j}). We will first focus on updating (vj)jα¯(v_{j})_{j\leq\overline{\alpha}}. To this aim note that since (vj)(v_{j}) is a species sampling sequence driven by a Dirichlet process with total mass parameter θ\theta^{\prime} and base measure ν=𝖡𝖾(a,b)\nu^{\prime}=\mathsf{Be}(a,b), we can compute

𝕡[(vj)jα¯]=(θ)kα¯(θ)nl=1kα¯(ml1)!𝖡𝖾(vla,b),\mathbbm{p}[(v_{j})_{j\leq\overline{\alpha}}]=\frac{(\theta^{\prime})^{k_{\overline{\alpha}}}}{(\theta^{\prime})_{n}}\prod_{l=1}^{k_{\overline{\alpha}}}(m_{l}-1)!\mathsf{Be}(v^{*}_{l}\mid a,b),

where (vl)=(v1,,vkα¯)(v^{*}_{l})=(v^{*}_{1},\ldots,v^{*}_{k_{\overline{\alpha}}}) are the distinct values that (vj)jα¯(v_{j})_{j\leq\overline{\alpha}} exhibits, ml=|El|m_{l}=|E_{l}| and El={jα¯:vj=vl}={jα¯:ej=l}E_{l}=\{j\leq\overline{\alpha}:v_{j}=v^{*}_{l}\}=\{j\leq\overline{\alpha}:e_{j}=l\}, with ej=le_{j}=l if and only if vj=vlv_{j}=v^{*}_{l} (cf. Pitman; 1996b; Neal; 2000). Thus (C3) yields

𝕡[(vl),(ej)](θ)kα¯(θ)nl=1kα¯(ml1)!(vl)jElrj(1vl)jElRj𝖡𝖾(vla,b).\mathbbm{p}[\,(v^{*}_{l}),(e_{j})\mid\cdots\,]\propto\frac{(\theta^{\prime})^{k_{\overline{\alpha}}}}{(\theta^{\prime})_{n}}\prod_{l=1}^{k_{\overline{\alpha}}}(m_{l}-1)!(v^{*}_{l})^{\sum_{j\in E_{l}}r_{j}}(1-v^{*}_{l})^{\sum_{j\in E_{l}}R_{j}}\mathsf{Be}(v^{*}_{l}\mid a,b).

Now to update (vj)(v_{j}) we can first sample (vl)(v^{*}_{l}) from

𝕡[(vl)(ej),]l=1kα¯𝖡𝖾(vl|a+jElrj,b+jElRj),\mathbbm{p}[\,(v^{*}_{l})\mid(e_{j}),\cdots\,]\propto\prod_{l=1}^{k_{\overline{\alpha}}}\mathsf{Be}\bigg{(}v^{*}_{l}\,\bigg{|}\,a+\sum_{j\in E_{l}}r_{j},b+\sum_{j\in E_{l}}R_{j}\bigg{)},

which is a product of independent Beta distributions. Afterwards, for each jα¯j\leq\overline{\alpha}, we can update which value does vjv_{j} take among the ones observed in the rest of the vjv_{j}’s or if it takes a new unobserved value. Say that (vl)j=(v1,,vkj)(v^{*}_{l})_{-j}=(v^{*}_{1},\ldots,v^{*}_{k_{-j}}) are the distinct values in (vi:iα¯,ij)(v_{i}:i\leq\overline{\alpha},i\neq j), in no particular order, and assume without loss of generality that ei=le_{i}=l if and only if vi=vlv_{i}=v^{*}_{l} for each ili\neq l. Then it is enough to sample from

𝕡[ej=e(ei)ij,(vl)j,]{me(ve)rj(1ve)Rj if e{1,,kj}θvrj(1v)Rj𝖡𝖾(dva,b) if e=kj+10otherwise,\mathbbm{p}[e_{j}=e\mid(e_{i})_{i\neq j},(v^{*}_{l})_{-j},\cdots\,]\propto\begin{cases}m_{e}(v^{*}_{e})^{r_{j}}(1-v^{*}_{e})^{R_{j}}&\text{ if }e\in\{1,\ldots,k_{-j}\}\\ \theta^{\prime}\int v^{r_{j}}(1-v)^{R_{j}}\mathsf{Be}(dv\mid a,b)&\text{ if }e=k_{-j}+1\\ 0&\text{otherwise},\end{cases}

where me=|{ij:ei=e}|m_{e}=|\{i\neq j:e_{i}=e\}| and

vrj(1v)Rj𝖡𝖾(dva,b)=Γ(rj+a)Γ(Rj+b)Γ(rj+Rj+a+b).\int v^{r_{j}}(1-v)^{R_{j}}\mathsf{Be}(dv\mid a,b)=\frac{\Gamma(r_{j}+a)\Gamma(R_{j}+b)}{\Gamma(r_{j}+R_{j}+a+b)}.

If the updated value ej{1,,kj}e_{j}\in\{1,\ldots,k_{-j}\} we simply set vj=vejv_{j}=v^{*}_{e_{j}} otherwise if ej=kj+1e_{j}=k_{-j}+1 we sample vj𝖡𝖾(a+rj,b+Rj)v_{j}\sim\mathsf{Be}(a+r_{j},b+R_{j}). Once we have updated (vj)jα¯(v_{j})_{j\leq\overline{\alpha}}, we can update vjv_{j} for j>α¯j>\overline{\alpha} by sampling sequentially from 𝕡[vj(vi)i<j,]\mathbbm{p}[\,v_{j}\mid(v_{i})_{i<j},\cdots\,], which happens to coincide with the prior prediction rule 𝕡[vj(vi)i<j]\mathbbm{p}[\,v_{j}\mid(v_{i})_{i<j}\,]. As (vj)(v_{j}) is a species sampling sequence driven by a Dirichlet process, PP^{\prime}, with total mass parameter θ\theta^{\prime} and base measure ν=𝖡𝖾(a,b)\nu^{\prime}=\mathsf{Be}(a,b), it is well known that

[vj(vi)i<j]=j=1kj1mlθ+j1δvl+θθ+j1ν\mathbb{P}[v_{j}\in\cdot\mid(v_{i})_{i<j}]=\sum_{j=1}^{k_{j-1}}\frac{m_{l}}{\theta^{\prime}+j-1}\delta_{v^{*}_{l}}+\frac{\theta^{\prime}}{\theta^{\prime}+j-1}\nu^{\prime}

where v1,,vkj1v^{*}_{1},\ldots,v^{*}_{k_{j-1}} are the distinct values in (vi)i<j(v_{i})_{i<j}, and ml=|{i<j:vi=vl}|m_{l}=|\{i<j:v_{i}=v^{*}_{l}\}| (cf. Pitman; 1996b). Thus updating vjv_{j} for j>α¯j>\overline{\alpha} is easy, and it will be required only for a few j>α¯j>\overline{\alpha}.

In general, this way of updating (vj)(v_{j}) for Dirichlet driven exchangeable stick-breaking processes is actually an adaptation of Algorithm 2 in Neal (2000), however, other marginal methods such as Algorithm 8 can also be exploited. In fact, by taking into account the underlying Dirichlet process, PP^{\prime}, of (vj)(v_{j}), even a version of the slice sampler or the ordered allocation sampler could have been used. To conclude, we mention that for the simulation study in Section 4 of the main document we fixed the hyperparameters θ=1\theta^{\prime}=1, and a=b=1a=b=1 for both geometric and Dirichlet driven exchangeable stick-breaking models.

Appendix D Ordered allocation variables

Here we discuss the set 𝒟i\mathcal{D}_{i} of admissible moves for the updating of the ordered allocation variable did_{i}. Some general rules for determining 𝒟i\mathcal{D}_{i} can be envisioned: (i) if did_{i} is different from any other djd_{j}, that is Ddi={i}D_{d_{i}}=\{i\}, then did_{i} cannot change, unless di=knd_{i}=k_{n}; and (ii) 𝒟i{1,,ki1+1}\mathcal{D}_{i}\subset\{1,\ldots,k_{i-1}+1\}, so for larger ii, there are more possible admissible moves, in particular, d1=1d_{1}=1 cannot change. As for illustration, let n=5n=5 and say that before updating (di)(d_{i}), d1,,d5d_{1},\ldots,d_{5} are such that the blocks of the partition in the least element order are D1={1,3}D_{1}=\{1,3\}, D2={2,4}D_{2}=\{2,4\} and D3={5}D_{3}=\{5\}. Clearly d1=1d_{1}=1 cannot change. As for d2d_{2}, the admissible moves are 𝒟2={1,2}\mathcal{D}_{2}=\{1,2\} thus d2d_{2} will be sampled from

𝕡(d2=1rest)p~1g(y2x~1),𝕡(d2=2rest)p~2g(y2x~2).\mathbbm{p}(d_{2}=1\mid\mathrm{rest})\propto\tilde{p}_{1}g(y_{2}\mid\tilde{x}_{1}),\quad\mathbbm{p}(d_{2}=2\mid\text{rest})\propto\tilde{p}_{2}g(y_{2}\mid\tilde{x}_{2}).

Say that we sample d2=1d_{2}=1, so that now D1={1,2,3}D_{1}=\{1,2,3\}, D2={4}D_{2}=\{4\}, D3={5}D_{3}=\{5\}. Since the blocks must be in least element order, the admissible moves for d3d_{3} are 𝒟3={1,2}\mathcal{D}_{3}=\{1,2\}, hence d3d_{3} will be sampled from

𝕡(d3=1rest)p~1g(y3x~1),𝕡(d3=2rest)p~2g(y3x~2).\mathbbm{p}(d_{3}=1\mid\mathrm{rest})\propto\tilde{p}_{1}g(y_{3}\mid\tilde{x}_{1}),\quad\mathbbm{p}(d_{3}=2\mid\text{rest})\propto\tilde{p}_{2}g(y_{3}\mid\tilde{x}_{2}).

Assume we sample d3=1d_{3}=1 so now D1={1,2,3}D_{1}=\{1,2,3\}, D2={4}D_{2}=\{4\}, and D3={5}D_{3}=\{5\}. Given that D2D_{2} can not be an empty set, under the current configuration, the only admissible move for d4d_{4} is 𝒟4={4}\mathcal{D}_{4}=\{4\}, i.e. d4d_{4} cannot change. Finally, the admissible moves for d5d_{5} are 𝒟5={1,2,3}\mathcal{D}_{5}=\{1,2,3\}, and we will sample d5d_{5} from

𝕡(d5=1rest)p~1g(y5x~1),𝕡(d5=2rest)p~2g(y5x~2),\mathbbm{p}(d_{5}=1\mid\text{rest})\propto\tilde{p}_{1}g(y_{5}\mid\tilde{x}_{1}),\quad\mathbbm{p}(d_{5}=2\mid\text{rest})\propto\tilde{p}_{2}g(y_{5}\mid\tilde{x}_{2}),
𝕡(d5=3rest)(1p~1p~2)g(y5x~3).\mathbbm{p}(d_{5}=3\mid\text{rest})\propto(1-\tilde{p}_{1}-\tilde{p}_{2})g(y_{5}\mid\tilde{x}_{3}).

Finally, assuming that we sample d5=2d_{5}=2, the initial configuration D1={1,3}D_{1}=\{1,3\}, D2={2,4}D_{2}=\{2,4\} and D3={5}D_{3}=\{5\}, is updated to D1={1,2,3}D_{1}=\{1,2,3\} and D2={4,5}D_{2}=\{4,5\}.

In Section 3.3 we discuss an acceleration step that consists in randomly permuting the data points at each iteration of the sampler. Next we provide an example on how to modify the ordered allocation variables so to preserve the induced clustering structure, as well as the least element order of the partition induced by the modified variables. Let us consider again, as starting values of (di)(d_{i}) the ones corresponding to the partition D1={1,3}D_{1}=\{1,3\}, D2={2,4}D_{2}=\{2,4\} and D3={5}D_{3}=\{5\}, so

(di)=(1,2,1,2,3).(d_{i})=(1,2,1,2,3).

Also consider the permutation ρ=(2,1,3,5,4)\rho=(2,1,3,5,4), i.e. ρ(1)=2,ρ(2)=1,ρ(5)=4\rho(1)=2,\rho(2)=1,\ldots\rho(5)=4, so that the permuted data set is (yi)=(yρ(i))=(y2,y1,y3,y5,y4)(y_{i}^{\prime})=(y_{\rho(i)})=(y_{2},y_{1},y_{3},y_{5},y_{4}). The ordered allocation variables, (di)(d_{i}), induce the following clustering of data points:

{y1,y3}={y2,y3},{y2,y4}={y1,y5},{y5}={y4}.\{y_{1},y_{3}\}=\{y^{\prime}_{2},y^{\prime}_{3}\},\quad\{y_{2},y_{4}\}=\{y^{\prime}_{1},y^{\prime}_{5}\},\quad\{y_{5}\}=\{y^{\prime}_{4}\}.

Thus, the original partition ({1,3},{2,4},{5})(\{1,3\},\{2,4\},\{5\}) becomes (D1,D2,D3)=({1,5},{2,3},{4})(D^{\prime}_{1},D^{\prime}_{2},D^{\prime}_{3})=(\{1,5\},\{2,3\},\{4\}) with respect to the new labeling of the observations. Let us check that the ordered allocation variables, (di)=(1,2,2,3,1)(d^{\prime}_{i})=(1,2,2,3,1), that correspond to (Dj)(D^{\prime}_{j}), can be obtained as explained in Section 3.3, that is di=jd^{\prime}_{i}=j if and only if dρ(i)d_{\rho(i)} equals the jjth distinct value to appear in (dρ(i))(d_{\rho(i)}). We have

(dρ(i))=(d2,d1,d3,d5,d4)=(2,1,1,3,2)(d_{\rho(i)})=(d_{2},d_{1},d_{3},d_{5},d_{4})=(2,1,1,3,2)

so that the distinct values of (dρ(i))(d_{\rho(i)}) in order of appearance are 2,1,32,1,3. Then,

dρ(1)=d2=2 is the first distinct value to appear in (dρ(i)) hence d1=1,\displaystyle d_{\rho(1)}=d_{2}=2\text{ is the first distinct value to appear in }(d_{\rho(i)})\text{ hence }d^{\prime}_{1}=1,
dρ(2)=d1=1 is the second distinct value to appear in (dρ(i)) hence d2=2,\displaystyle d_{\rho(2)}=d_{1}=1\text{ is the second distinct value to appear in }(d_{\rho(i)})\text{ hence }d^{\prime}_{2}=2,
dρ(3)=d3=1 is the second distinct value to appear in (dρ(i)) hence d3=2,\displaystyle d_{\rho(3)}=d_{3}=1\text{ is the second distinct value to appear in }(d_{\rho(i)})\text{ hence }d^{\prime}_{3}=2,
dρ(4)=d5=3 is the third distinct value to appear in (dρ(i)) hence d4=3,\displaystyle d_{\rho(4)}=d_{5}=3\text{ is the third distinct value to appear in }(d_{\rho(i)})\text{ hence }d^{\prime}_{4}=3,
dρ(5)=d4=2 is the first distinct value to appear in (dρ(i)) hence d5=1.\displaystyle d_{\rho(5)}=d_{4}=2\text{ is the first distinct value to appear in }(d_{\rho(i)})\text{ hence }d^{\prime}_{5}=1.

We conclude that

(di)=(1,2,2,3,1),(d^{\prime}_{i})=(1,2,2,3,1),

which in fact yields the partition in least element order D1={1,5},D2={2,3},D3={4}D^{\prime}_{1}=\{1,5\},D^{\prime}_{2}=\{2,3\},D^{\prime}_{3}=\{4\}.

Appendix E Extended simulation study for the DP model

Refer to caption
Figure E1: Estimated density and weighted densities by component (colored lines) for the 𝗀𝖺𝗅𝖺𝗑𝗒\mathsf{galaxy} (left column), 𝗅𝖾𝗉𝗍𝗈𝗄𝗎𝗋𝗍𝗂𝖼\mathsf{leptokurtic} (middle column) and 𝖻𝗂𝗆𝗈𝖽𝖺𝗅\mathsf{bimodal} (right column) data sets, assuming a Dirichlet process mixing distribution. Implementation was made using a Marginal sampler (1st row), the ordered allocation samplers in Sections 3.1 and 3.2 (2nd and 3rd row, respectively) including data permutations, the ordered allocation sampler in Section 3.1 without data permutations (4th row) and the slice-efficient sampler (5th row).

In this section we provide further illustrations of the two versions of the ordered allocation sampler derived in Sections 3.1 and 3.2 of the main paper, and of the importance of the acceleration step in Section 3.3. We consider the Dirichlet process mixing prior and we repeat the simulation study of Section 4 implementing, together with Algorithm 8 in Neal (2000) (Marginal), the dependent slice-efficient sampler by Kalli et al. (2011) (Conditional) and the sampler of Section 3.1 with the acceleration step (OAS1), the sampler of Section 3.2 with the acceleration step (OAS2) and the sampler of Section 3.1 without the acceleration step (OAS1*). Table E1 reports the IAT of the deviance (DvD_{v}) and number occupied components (knk_{n}) for these 5 different samplers. In Figure E1 we show the estimated weighted densities, Q^j\widehat{Q}_{j}, of each component jKj\leq K, as well as the estimated density, Q^=j=1KQ^j\widehat{Q}=\sum_{j=1}^{K}\widehat{Q}_{j}, where KK is the largest index jj, for which Q^j\widehat{Q}_{j} is not identical to zero. We have computed

Q^j=1Tt=1Tnj(t)ng(|xj(t)),\widehat{Q}_{j}=\frac{1}{T}\sum_{t=1}^{T}\ \frac{n_{j}^{(t)}}{n}g\big{(}\cdot\,\big{|}\,x_{j}^{(t)}\big{)},

for a window of T=104T=10^{4} iterations after the burn-in period has elapsed. Here x1(t),x2(t),x_{1}^{(t)},x_{2}^{(t)},\ldots are the sampled component parameters as labelled (or indexed) at iteration tt, and nj(t)n_{j}^{(t)} is the number of data points assigned to the jjth component at iteration tt. In particular, for the ordered allocation samplers, components are labelled in the order in which they were discovered by the (possibly permuted) dataset at each iteration. For the marginal sampler, components are labelled this way at the first iteration, and at subsequent iterations relabelling only occurred to delete gaps, i.e. so that the first knk_{n} indexes, jj, refer to the observed components. As for the slice sampler, components were never relabelled.

We will first focus on exploring the effects of the acceleration step in Section 3.3. As can be observed in Figure E1 the ordered allocation sampler without data permutations (OAS1*, 4th row) is more prone to label components consistently throughout iterations, this is reflected through the propensity of Q^j\widehat{Q}_{j} to be unimodal. In contrast, for the ordered allocation sampler that includes the acceleration step of Section 3.3 (OAS1, 2nd row) Q^j\widehat{Q}_{j} has the shape of the estimated density Q^\widehat{Q}. Thus Q^j\widehat{Q}_{j} is multimodal when Q^\widehat{Q} is (as is the case of the 𝗅𝖾𝗉𝗍𝗈𝗄𝗎𝗋𝗍𝗂𝖼\mathsf{leptokurtic} dataset). This indicates that more label-switches occur in the OAS1, which is an expected consequence of the fact that by permuting data points, components are discovered in a distinct order. Comparing against the graphs of the marginal sampler (1st row), we see that by including the acceleration step, label-switches occur in a more similar way as they naturally do in marginal samplers, which is the only type of sampler where labels are completely irrelevant. In terms of algorithmic performance, in Table E1 we see that the inclusion of data permutations at each iteration represents a significant improvement of the mixing properties, as the IAT corresponding to the OAS1 are much smaller than those of the OAS1*. The one exception is the IAT of DvD_{v} for the 𝗅𝖾𝗉𝗍𝗈𝗄𝗎𝗋𝗍𝗂𝖼\mathsf{leptokurtic} dataset, which is very similar for the OAS1 and the OAS1*. This is due to the fact that, although this dataset comes from more than one Gaussian component, it only has one mode. Thus, it is not clear from which component does each data point come from, this is turn leads to frequent label-switches even if one does not permute the dataset.

We now turn to explore the algorithmic performance of the OAS2 compared against that the OAS1, when the latter applies, i.e. the distribution of the size-biased permuted weights, (p~j)(\tilde{p}_{j}) is available. In Table E1 we see that for the 𝗀𝖺𝗅𝖺𝗑𝗒\mathsf{galaxy} dataset, the IAT values of the OAS2 compare very well with those of the OAS1. For the other two datasets instead there is a difference between the IAT values of the OAS1 and the OAS2. To explain why this happens, recall that the key distinction between the OAS1 and the OAS2 is that the first one updates (p~j)(\tilde{p}_{j}) directly, while the OAS2 relies on the the indexes (αj)(\alpha_{j}). In particular for the DP model, the OAS2 ignores the fact that the distribution of (p~j)(\tilde{p}_{j}) is available. As mentioned in the main paper, to update (αj)(\alpha_{j}) the OAS2 first updates (αj)jkn(\alpha_{j})_{j\leq k_{n}} (i.e. those of weights of occupied components) and later (αj)j>kn(\alpha_{j})_{j>k_{n}}, hence swaps between αj\alpha_{j} and αi\alpha_{i}, with jkn<ij\leq k_{n}<i, mainly occur when occupied components are created or removed as a consequence of an update of (di)(d_{i}). Now, the 𝗅𝖾𝗉𝗍𝗈𝗄𝗎𝗋𝗍𝗂𝖼\mathsf{leptokurtic} and 𝖻𝗂𝗆𝗈𝖽𝖺𝗅\mathsf{bimodal} datasets were both simulated from a mixture of two (more or less) balanced Gaussian components. Since we have implemented a Gaussian mixture, at most iterations the sampler will effectively recognize that there are only two large occupied components (see the 2nd and 3rd columns of Figure E1 for an illustration). Thus the mixing of (αj)(\alpha_{j}) will be affected because components are rarely created of removed. Furthermore, as typically there will be only two large “ occupied ” weights and the rest of them will be very small, it is extremely unlikely that their indexes get swapped. Instead, the 𝗀𝖺𝗅𝖺𝗑𝗒\mathsf{galaxy} dataset was not generated from a Gaussian mixture, which forces the model to rely on different Gaussian components of varying sizes to estimate the density (cf. 1st column of Figure E1). This means that the number of occupied components will change frequently and some of the “ occupied ” weights will be small at many iterations thus facilitating the mixing of weights’ indexes.

Marginal OAS1 OAS2 OAS1* Conditional
𝗀𝖺𝗅𝖺𝗑𝗒\mathsf{galaxy} DvD_{v} 12.30(0.23) 23.76(0.57) 26.76(0.84) 106.9(7.98) 119.2(9.95)
knk_{n} 13.68(0.25) 32.49(0.81) 36.36(1.09) 115.2(7.18) 190.2(16.3)
𝗅𝖾𝗉𝗍𝗈𝗄𝗎𝗋𝗍𝗂𝖼\mathsf{leptokurtic} DvD_{v} 22.42(0.54) 25.81(0.63) 35.98(1.00) 26.92(0.75) 120.5(10.3)
knk_{n} 9.26(0.13) 18.99(0.41) 27.97(0.88) 44.48(0.60) 100.8(7.18)
𝖻𝗂𝗆𝗈𝖽𝖺𝗅\mathsf{bimodal} DvD_{v} 7.84(0.16) 13.87(0.35) 20.82(0.57) 50.47(3.20) 35.61(1.68)
knk_{n} 6.30(0.07) 13.38(0.22) 25.43(0.87) 39.28(1.33) 52.00(2.18)
Table E1: IAT of DvD_{v} and knk_{n} (standard errors are shown in parenthesis) for Algorithm 8 in Neal (2000) (Marginal), the ordered allocation samplers in Sections 3.1 and 3.2 with data permutation (OAS1 and OAS2, respectively), the ordered allocation sampler in Section 3.1 without data permutations (OAS1*) and the dependent slice-efficient sampler (Conditional), obtained by fitting a DP model to the 𝗀𝖺𝗅𝖺𝗑𝗒\mathsf{galaxy}, 𝗅𝖾𝗉𝗍𝗈𝗄𝗎𝗋𝗍𝗂𝖼\mathsf{leptokurtic} and 𝖻𝗂𝗆𝗈𝖽𝖺𝗅\mathsf{bimodal} datasets.