This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Posteriors, conjugacy, and exponential families for completely random measures

Tamara Broderick    Ashia C. Wilson    Michael I. Jordan
Abstract

We demonstrate how to calculate posteriors for general Bayesian nonparametric priors and likelihoods based on completely random measures (CRMs). We further show how to represent Bayesian nonparametric priors as a sequence of finite draws using a size-biasing approach—and how to represent full Bayesian nonparametric models via finite marginals. Motivated by conjugate priors based on exponential family representations of likelihoods, we introduce a notion of exponential families for CRMs, which we call exponential CRMs. This construction allows us to specify automatic Bayesian nonparametric conjugate priors for exponential CRM likelihoods. We demonstrate that our exponential CRMs allow particularly straightforward recipes for size-biased and marginal representations of Bayesian nonparametric models. Along the way, we prove that the gamma process is a conjugate prior for the Poisson likelihood process and the beta prime process is a conjugate prior for a process we call the odds Bernoulli process. We deliver a size-biased representation of the gamma process and a marginal representation of the gamma process coupled with a Poisson likelihood process.

1 Introduction

An important milestone in Bayesian analysis was the development of a general strategy for obtaining conjugate priors based on exponential family representations of likelihoods (DeGroot, 1970). While slavish adherence to exponential-family conjugacy can be criticized, conjugacy continues to occupy an important place in Bayesian analysis, for its computational tractability in high-dimensional problems and for its role in inspiring investigations into broader classes of priors (e.g., via mixtures, limits, or augmentations). The exponential family is, however, a parametric class of models, and it is of interest to consider whether similar general notions of conjugacy can be developed for Bayesian nonparametric models. Indeed, the nonparametric literature is replete with nomenclature that suggests the exponential family, including familiar names such as “Dirichlet,” “beta,” “gamma,” and “Poisson.” These names refer to aspects of the random measures underlying Bayesian nonparametrics, either the Lévy measure used in constructing certain classes of random measures or properties of marginals obtained from random measures. In some cases, conjugacy results have been established that parallel results from classical exponential families; in particular, the Dirichlet process is known to be conjugate to a multinomial process likelihood (Ferguson, 1973), the beta process is conjugate to a Bernoulli process (Kim, 1999; Thibaux and Jordan, 2007) and to a negative binomial process (Broderick et al., 2015). Moreover, various useful representations for marginal distributions, including stick-breaking and size-biased representations, have been obtained by making use of properties that derive from exponential families. It is striking, however, that these results have been obtained separately, and with significant effort; a general formalism that encompasses these individual results has not yet emerged. In this paper, we provide the single, holistic framework so strongly suggested by the nomenclature. Within this single framework, we show that it is straightforward to calculate posteriors and establish conjugacy. Our framework includes the specification of a Bayesian nonparametric analog of the finite exponential family, which allows us to provide automatic and constructive nonparametric conjugate priors given a likelihood specification as well as general recipes for marginal and size-biased representations.

A broad class of Bayesian nonparametric priors—including those built on the Dirichlet process (Ferguson, 1973), the beta process (Hjort, 1990), the gamma process (Ferguson, 1973; Lo, 1982; Titsias, 2008), and the negative binomial process (Zhou et al., 2012; Broderick et al., 2015)—can be viewed as models for the allocation of data points to traits. These processes give us pairs of traits together with rates or frequencies with which the traits occur in some population. Corresponding likelihoods assign each data point in the population to some finite subset of traits conditioned on the trait frequencies. What makes these models nonparametric is that the number of traits in the prior is countably infinite. Then the (typically random) number of traits to which any individual data point is allocated is unbounded, but also there are always new traits to which as-yet-unseen data points may be allocated. That is, such a model allows the number of traits in any data set to grow with the size of that data set.

A principal challenge of working with such models arises in posterior inference. There is a countable infinity of trait frequencies in the prior which we must integrate over to calculate the posterior of trait frequencies given allocations of data points to traits. Bayesian nonparametric models sidestep the full infinite-dimensional integration in three principal ways: conjugacy, size-biased representations, and marginalization.

In its most general form, conjugacy simply asserts that the prior is in the same family of distributions as the posterior. When the prior and likelihood are in finite-dimensional conjugate exponential families, conjugacy can turn posterior calculation into, effectively, vector addition. As a simple example, consider a model with beta-distributed prior, θBeta(θ|α,β)\theta\sim\mathrm{Beta}(\theta|\alpha,\beta), for some fixed hyperparameters α\alpha and β\beta. For the likelihood, let each observation xnx_{n} with n{1,,N}n\in\{1,\ldots,N\} be iid Bernoulli-distributed conditional on parameter θ\theta: xniidBern(x|θ)x_{n}\stackrel{{\scriptstyle iid}}{{\sim}}\mathrm{Bern}(x|\theta). Then the posterior is simply another beta distribution, Beta(θ|αpost,βpost)\mathrm{Beta}(\theta|\alpha_{post},\beta_{post}), with parameters updated via addition: αpost:=α+n=1Nxn\alpha_{post}:=\alpha+\sum_{n=1}^{N}x_{n} and βpost:=β+Nn=1Nxn\beta_{post}:=\beta+N-\sum_{n=1}^{N}x_{n}. While conjugacy is certainly useful and popular in the case of finite parameter cardinality, there is arguably a stronger computational imperative for its use in the infinite-parameter case. Indeed, the core prior-likelihood pairs of Bayesian nonparametrics are generally proven (Hjort, 1990; Kim, 1999; Lo, 1982; Thibaux and Jordan, 2007; Broderick et al., 2015), or assumed to be (Titsias, 2008; Thibaux, 2008), conjugate. When such proofs exist, though, thus far they have been specialized to specific pairs of processes. In what follows, we demonstrate a general way to calculate posteriors for a class of distributions that includes all of these classical Bayesian nonparametric models. We also define a notion of exponential family representation for the infinite-dimensional case and show that, given a Bayesian nonparametric exponential family likelihood, we can readily construct a Bayesian nonparametric conjugate prior.

Size-biased sampling provides a finite-dimensional distribution for each of the individual prior trait frequencies (Thibaux and Jordan, 2007; Paisley et al., 2010). Such a representation has played an important role in Bayesian nonparametrics in recent years, allowing for either exact inference via slice sampling (Damien et al., 1999; Neal, 2003)—as demonstrated by Teh et al. (2007); Broderick et al. (2015)—or approximate inference via truncation (Doshi et al., 2009; Paisley et al., 2011). This representation is particularly useful for building hierarchical models (Thibaux and Jordan, 2007). We show that our framework yields such representations in general, and we show that our construction is especially straightforward to use in the exponential family framework that we develop.

Marginal processes avoid directly representing the infinite-dimensional prior and posterior altogether by integrating out the trait frequencies. Since the trait allocations are finite for each data point, the marginal processes are finite for any finite set of data points. Again, thus far, such processes have been shown to exist separately in special cases; for example, the Indian buffet process (Griffiths and Ghahramani, 2006) is the marginal process for the beta process prior paired with a Bernoulli process likelihood (Thibaux and Jordan, 2007). We show that the integration that generates the marginal process from the full Bayesian model can be generally applied in Bayesian nonparametrics and takes a particularly straightforward form when using conjugate exponential family priors and likelihoods. We further demonstrate that, in this case, a basic, constructive recipe exists for the general marginal process in terms of only finite-dimensional distributions.

Our results are built on the general class of stochastic processes known as completely random measures (CRMs) (Kingman, 1967). We review CRMs in Section 2.1 and we discuss what assumptions are needed to form a full Bayesian nonparametric model from CRMs in Section 2.3. Given a general Bayesian nonparametric prior and likelihood (Section 2.2), we demonstrate in Section 3 how to calculate the posterior. Although the development up to this point is more general, we next introduce a concept of exponential families for CRMs (Section 4.1) and call such models exponential CRMs. We show that we can generate automatic conjugate priors given exponential CRM likelihoods in Section 4.2. Finally, we show how we can generate recipes for size-biased representations (Section 5) and marginal processes (Section 6), which are particularly straightforward in the exponential CRM case (Corollary 5.2 in Section 5 and Corollary 6.2 in Section 6). We illustrate our results on a number of examples and derive new conjugacy results, size-biased representations, and marginal processes along the way.

We note that some similar results have been obtained by Orbanz (2010) and James (2014). In the present work, we focus on creating representations that allow tractable inference.

2 Bayesian models based on completely random measures

As we have discussed, we view Bayesian nonparametric models as being composed of two parts: (1) a collection of pairs of traits together with their frequencies or rates and (2) for each data point, an allocation to different traits. Both parts can be expressed as random measures. Recall that a random measure is a random element whose values are measures.

We represent each trait by a point ψ\psi in some space Ψ\Psi of traits. Further, let θk\theta_{k} be the frequency, or rate, of the trait represented by ψk\psi_{k}, where kk indexes the countably many traits. In particular, θk+\theta_{k}\in\mathbb{R}_{+}. Then (θk,ψk)(\theta_{k},\psi_{k}) is a tuple consisting of the frequency of the kkth trait together with its trait descriptor. We can represent the full collection of pairs of traits with their frequencies by the discrete measure on Ψ\Psi that places weight θk\theta_{k} at location ψk\psi_{k}:

Θ=k=1Kθkδψk,\Theta=\sum_{k=1}^{K}\theta_{k}\delta_{\psi_{k}}, (1)

where the cardinality KK may be finite or infinity.

Next, we form data point XnX_{n} for the nnth individual. The data point XnX_{n} is viewed as a discrete measure. Each atom of XnX_{n} represents a pair consisting of (1) a trait to which the nnth individual is allocated and (2) a degree to which the nnth individual is allocated to this particular trait. That is,

Xn=k=1Knxn,kδψn,k,X_{n}=\sum_{k=1}^{K_{n}}x_{n,k}\delta_{\psi_{n,k}}, (2)

where again ψn,kΨ\psi_{n,k}\in\Psi represents a trait and now xn,k+x_{n,k}\in\mathbb{R}_{+} represents the degree to which the nnth data point belongs to trait ψn,k\psi_{n,k}. KnK_{n} is the total number of traits to which the nnth data point belongs.

Here and in what follows, we treat X1:N={Xn:n[N]}X_{1:N}=\{X_{n}:n\in[N]\} as our observed data points for [N]:={1,2,3,,N}[N]:=\{1,2,3,\ldots,N\}. In practice X1:NX_{1:N} is often incorporated into a more complex Bayesian hierarchical model. For instance, in topic modeling, ψk\psi_{k} represents a topic; that is, ψk\psi_{k} is a distribution over words in a vocabulary (Blei et al., 2003; Teh et al., 2006). θk\theta_{k} might represent the frequency with which the topic ψk\psi_{k} occurs in a corpus of documents. xn,kx_{n,k} might be a positive integer and represent the number of words in topic ψn,k\psi_{n,k} that occur in the nnth document. So the nnth document has a total length of k=1Knxn,k\sum_{k=1}^{K_{n}}x_{n,k} words. In this case, the actual observation consists of the words in each document, and the topics are latent. Not only are the results concerning posteriors, conjugacy, and exponential family representations that we develop below useful for inference in such models, but in fact our results are especially useful in such models—where the traits and any ordering on the traits are not known in advance.

Next, we want to specify a full Bayesian model for our data points X1:NX_{1:N}. To do so, we must first define a prior distribution for the random measure Θ\Theta as well as a likelihood for each random measure XnX_{n} conditioned on Θ\Theta. We let ΣΨ\Sigma_{\Psi} be a σ\sigma-algebra of subsets of Ψ\Psi, where we assume all singletons are in ΣΨ\Sigma_{\Psi}. Then we consider random measures Θ\Theta and XnX_{n} whose values are measures on Ψ\Psi. Note that for any random measure Θ\Theta and any measurable set AΣΨA\in\Sigma_{\Psi}, Θ(A)\Theta(A) is a random variable.

2.1 Completely random measures

We can see from Eqs. (1) and (2) that we desire a distribution on random measures that yields discrete measures almost surely. A particularly simple form of random measure called a completely random measure can be used to generate a.s. discrete random measures (Kingman, 1967).

A completely random measure Θ\Theta is defined as a random measure that satisfies one additional property; for any disjoint, measurable sets A1,A2,,AKΣΨA_{1},A_{2},\ldots,A_{K}\in\Sigma_{\Psi}, we require that Θ(A1),Θ(A2),,Θ(AK)\Theta(A_{1}),\Theta(A_{2}),\ldots,\Theta(A_{K}) be independent random variables. Kingman (1967) showed that a completely random measure can always be decomposed into a sum of three independent parts:

Θ=Θdet+Θfix+Θord.\Theta=\Theta_{det}+\Theta_{fix}+\Theta_{ord}. (3)

Here, Θdet\Theta_{det} is the deterministic component, Θfix\Theta_{fix} is the fixed-location component, and Θord\Theta_{ord} is the ordinary component. In particular, Θdet\Theta_{det} is any deterministic measure. We define the remaining two parts next.

The fixed-location component is called the “fixed component” by Kingman (1967), but we expand the name slightly here to emphasize that Θfix\Theta_{fix} is defined to be constructed from a set of random weights at fixed (i.e., deterministic) locations. That is,

Θfix=k=1Kfixθfix,kδψfix,k,\Theta_{fix}=\sum_{k=1}^{K_{fix}}\theta_{fix,k}\delta_{\psi_{fix,k}}, (4)

where the number of fixed-location atoms, KfixK_{fix}, may be either finite or infinity; ψfix,k\psi_{fix,k} is deterministic, and θfix,k\theta_{fix,k} is a non-negative, real-valued random variable (since Φ\Phi is a measure). Without loss of generality, we assume that the locations ψfix,k\psi_{fix,k} are all distinct. Then, by the independence assumption of CRMs, we must have that θfix,k\theta_{fix,k} are independent random variables across kk. Although the fixed-location atoms are often ignored in the Bayesian nonparametrics literature, we will see that the fixed-location component has a key role to play in establishing Bayesian nonparametric conjugacy and in the CRM representations we present.

The third and final component is the ordinary component. Let #(A)\#(A) denote the cardinality of some countable set AA. Let μ\mu be any σ\sigma-finite, deterministic measure on +×Ψ\mathbb{R}_{+}\times\Psi, where +\mathbb{R}_{+} is equipped with the Borel σ\sigma-algebra and Σ+×Ψ\Sigma_{\mathbb{R}_{+}\times\Psi} is the resulting product σ\sigma-algebra given ΣΨ\Sigma_{\Psi}. Recall that a Poisson point process with rate measure μ\mu on +×Ψ\mathbb{R}_{+}\times\Psi is a random countable subset Π\Pi of +×Ψ\mathbb{R}_{+}\times\Psi such that two properties hold (Kingman, 1992):

  1. 1.

    For any AΣ+×ΨA\in\Sigma_{\mathbb{R}_{+}\times\Psi}, #(ΠA)Poisson(μ(A))\#(\Pi\cap A)\sim\mathrm{Poisson}(\mu(A)).

  2. 2.

    For any disjoint A1,A2,,AKΣ+×ΨA_{1},A_{2},\ldots,A_{K}\in\Sigma_{\mathbb{R}_{+}\times\Psi}, #(ΠA1),#(ΠA2),,#(ΠAK)\#(\Pi\cap A_{1}),\#(\Pi\cap A_{2}),\cdots,\#(\Pi\cap A_{K}) are independent random variables.

To generate an ordinary component, start with a Poisson point process on +×Ψ\mathbb{R}_{+}\times\Psi, characterized by its rate measure μ(dθ×dψ)\mu(d\theta\times d\psi). This process yields Π\Pi, a random and countable set of points: Π={(θord,k,ψord,k)}k=1Kord\Pi=\{(\theta_{ord,k},\psi_{ord,k})\}_{k=1}^{K_{ord}}, where KordK_{ord} may be finite or infinity. Form the ordinary component measure by letting θord,k\theta_{ord,k} be the weight of the atom located at ψord,k\psi_{ord,k}:

Θord=k=1Kordθord,kδψord,k.\Theta_{ord}=\sum_{k=1}^{K_{ord}}\theta_{ord,k}\delta_{\psi_{ord,k}}. (5)

Recall that we stated at the start of Section 2.1 that CRMs may be used to produce a.s. discrete random measures. To check this assertion, note that Θfix\Theta_{fix} is a.s. discrete by construction (Eq. (4)) and Θord\Theta_{ord} is a.s. discrete by construction (Eq. (5)). Θdet\Theta_{det} is the one component that may not be a.s. atomic. Thus the prevailing norm in using models based on CRMs is to set Θdet0\Theta_{det}\equiv 0; in what follows, we adopt this norm. If the reader is concerned about missing any atoms in Θdet\Theta_{det}, note that it is straightforward to adapt the treatment of Θfix\Theta_{fix} to include the case where the atom weights are deterministic. When we set Θdet0\Theta_{det}\equiv 0, we are left with Θ=Θfix+Θord\Theta=\Theta_{fix}+\Theta_{ord} by Eq. (3). So Θ\Theta is also discrete, as desired.

2.2 Prior and likelihood

The prior that we place on Θ\Theta will be a fully general CRM (minus any deterministic component) with one additional assumption on the rate measure of the ordinary component. Before incorporating the additional assumption, we say that Θ\Theta has a fixed-location component with KfixK_{fix} atoms, where the kkth atom has arbitrary distribution Ffix,kF_{fix,k}: θfix,kindepFfix,k(dθ)\theta_{fix,k}\stackrel{{\scriptstyle indep}}{{\sim}}F_{fix,k}(d\theta). KfixK_{fix} may be finite or infinity, and Θ\Theta has an ordinary component characterized by rate measure μ(dθ×dψ)\mu(d\theta\times d\psi). The additional assumption we make is that the distribution on the weights in the ordinary component is assumed to be decoupled from the distribution on the locations. That is, the rate measure decomposes as

μ(dθ×dψ)=ν(dθ)G(dψ),\mu(d\theta\times d\psi)=\nu(d\theta)\cdot G(d\psi), (6)

where ν\nu is any σ\sigma-finite, deterministic measure on +\mathbb{R}_{+} and GG is any proper distribution on Ψ\Psi. While the distribution over locations has been discussed extensively elsewhere (Neal, 2000; Wang and Blei, 2013), it is the weights that affect the allocation of data points to traits.

Given the factorization of μ\mu in Eq. (6), the ordinary component of Θ\Theta can be generated by letting {θfix,k}k=1Kord\{\theta_{fix,k}\}_{k=1}^{K_{ord}} be the points of a Poisson point process generated on +\mathbb{R}_{+} with rate ν\nu.111Recall that KordK_{ord} may be finite or infinity depending on ν\nu and is random when taking finite values. We then draw the locations {ψfix,k}k=1Kord\{\psi_{fix,k}\}_{k=1}^{K_{ord}} iid according to G(dψ)G(d\psi): ψfix,kiidG(dψ)\psi_{fix,k}\stackrel{{\scriptstyle iid}}{{\sim}}G(d\psi). Finally, for each kk, θfix,kδψfix,k\theta_{fix,k}\delta_{\psi_{fix,k}} is an atom in Θord\Theta_{ord}. This factorization will allow us to focus our attention on the trait frequencies, and not the trait locations, in what follows. Moreover, going forward, we will assume GG is diffuse (i.e., GG has no atoms) so that the ordinary component atoms are all at a.s. distinct locations, which are further a.s. distinct from the fixed locations.

Since we have seen that Θ\Theta is an a.s. discrete random measure, we can write it as

Θ=k=1Kθkδψk,\Theta=\sum_{k=1}^{K}\theta_{k}\delta_{\psi_{k}}, (7)

where K:=Kfix+KordK:=K_{fix}+K_{ord} may be finite or infinity, and every ψk\psi_{k} is a.s. unique. That is, we will sometimes find it helpful notationally to use Eq. (7) instead of separating the fixed and ordinary components. At this point, we have specified the prior for Θ\Theta in our general model.

Next, we specify the likelihood; i.e., we specify how to generate the data points XnX_{n} given Θ\Theta. We will assume each XnX_{n} is generated iid given Θ\Theta across the data indices nn. We will let XnX_{n} be a CRM with only a fixed-location component given Θ\Theta. In particular, the atoms of XnX_{n} will be located at the atom locations of Θ\Theta, which are fixed when we condition on Θ\Theta:

Xn:=k=1Kxn,kδψk.X_{n}:=\sum_{k=1}^{K}x_{n,k}\delta_{\psi_{k}}.

Here, xn,kx_{n,k} is drawn according to some distribution HH that may take θk\theta_{k}, the weight of Θ\Theta at location ψk\psi_{k}, as a parameter; i.e.,

xn,kindepH(dx|θk)independently across n and k.x_{n,k}\stackrel{{\scriptstyle indep}}{{\sim}}H(dx|\theta_{k})\quad\textrm{independently across $n$ and $k$}. (8)

Note that while every atom of XnX_{n} is located at an atom of Θ\Theta, it is not necessarily the case that every atom of Θ\Theta has a corresponding atom in XnX_{n}. In particular, if xn,kx_{n,k} is zero for any kk, there is no atom in XnX_{n} at ψk\psi_{k}.

We highlight that the model above stands in contrast to Bayesian nonparametric partition models, for which there is a large literature. In partition models (or clustering models), Θ\Theta is a random probability measure (Ferguson, 1974); in this case, the probability constraint precludes Θ\Theta from being a completely random measure, but it is often chosen to be a normalized completely random measure (James et al., 2009; Lijoi and Prünster, 2010). The choice of Dirichlet process (a normalized gamma process) for Θ\Theta is particularly popular due to a number of useful properties that coincide in this single choice (Doksum, 1974; Escobar, 1994; Escobar and West, 1995, 1998; Ferguson, 1973; Lo, 1984; MacEachern, 1994; Perman et al., 1992; Pitman, 1996a, b; Sethuraman, 1994; West and Escobar, 1994). In partition models, XnX_{n} is a draw from the probability distribution described by Θ\Theta. If we think of such XnX_{n} as a random measure, it is a.s. a single unit mass at a point ψ\psi with strictly positive probability in Θ\Theta.

One potential connection between these two types of models is provided by combinatorial clustering (Broderick et al., 2015). In partition models, we might suppose that we have a number of data sets, all of which we would like to partition. For instance, in a document modeling scenario, each document might be a data set; in particular each data point is a word in the document. And we might wish to partition the words in each document. An alternative perspective is to suppose that there is a single data set, where each data point is a document. Then the document exhibits traits with multiplicities, where the multiplicities might be the number of words from each trait; typically a trait in this application would be a topic. In this case, there are a number of other names besides feature or trait model that may be applied to the overarching model—such as admixture model or mixed membership model (Airoldi et al., 2014).

2.3 Bayesian nonparametrics

So far we have described a prior and likelihood that may be used to form a Bayesian model. We have already stated above that forming a Bayesian nonparametric model imposes some restrictions on the prior and likelihood. We formalize these restrictions in Assumptions A0, A1, and A2 below.

Recall that the premise of Bayesian nonparametrics is that the number of traits represented in a collection of data can grow with the number of data points. More explicitly, we achieve the desideratum that the number of traits is unbounded, and may always grow as new data points are collected, by modeling a countable infinity of traits. This assumption requires that the prior have a countable infinity of atoms. These must either be fixed-location atoms or ordinary component atoms. Fixed-location atoms represent known traits in some sense since we must know the fixed locations of the atoms in advance. Conversely, ordinary component atoms represent unknown traits, as yet to be discovered, since both their locations and associated rates are unknown a priori. Since we cannot know (or represent) a countable infinity of traits a priori, we cannot start with a countable infinity of fixed-location atoms.

  1. A0.

    The number of fixed-location atoms in Θ\Theta is finite.

Since we require a countable infinity of traits in total and they cannot come from the fixed-location atoms by Assumption A0, the ordinary component must contain a countable infinity of atoms. This assumption will be true if and only if the rate measure on the trait frequencies has infinite mass.

  1. A1.

    ν(+)=\nu(\mathbb{R}_{+})=\infty.

Finally, an implicit part of the starting premise is that each data point be allocated to only a finite number of traits; we do not expect to glean an infinite amount of information from finitely represented data. Thus, we require that the number of atoms in every XnX_{n} be finite. By Assumption A0, the number of atoms in XnX_{n} that correspond to fixed-location atoms in Θ\Theta is finite. But by Assumption A1, the number of atoms in Θ\Theta from the ordinary component is infinite. So there must be some restriction on the distribution of values of XX at the atoms of Θ\Theta (that is, some restriction on HH in Eq. (8)) such that only finitely many of these values are nonzero.

In particular, note that if H(dx|θ)H(dx|\theta) does not contain an atom at zero for any θ\theta, then a.s. every one of the countable infinity of atoms of XX will be nonzero. Conversely, it follows that, for our desiderata to hold, we must have that H(dx|θ)H(dx|\theta) exhibits an atom at zero. One consequence of this observation is that H(dx|θ)H(dx|\theta) cannot be purely continuous for all θ\theta. Though this line of reasoning does not necessarily preclude a mixed continuous and discrete HH, we henceforth assume that H(dx|θ)H(dx|\theta) is discrete, with support ={0,1,2,}\mathbb{Z}_{*}=\{0,1,2,\ldots\}, for all θ\theta.

In what follows, we write h(x|θ)h(x|\theta) for the probability mass function of xx given θ\theta. So our requirement that each data point be allocated to only a finite number of traits translates into a requirement that the number of atoms of XnX_{n} with values in +={1,2,}\mathbb{Z}_{+}=\{1,2,\ldots\} be finite. Note that, by construction, the pairs {(θord,k,xord,k)}k=1Kord\{(\theta_{ord,k},x_{ord,k})\}_{k=1}^{K_{ord}} form a marked Poisson point process with rate measure μmark(dθ×dx):=ν(dθ)h(x|θ)\mu_{mark}(d\theta\times dx):=\nu(d\theta)h(x|\theta). And the pairs with xord,kx_{ord,k} equal to any particular value x+x\in\mathbb{Z}_{+} further form a thinned Poisson point process with rate measure νx(dθ):=ν(dθ)h(x|θ)\nu_{x}(d\theta):=\nu(d\theta)h(x|\theta). In particular, the number of atoms of XX with weight xx is Poisson-distributed with mean νx(+)\nu_{x}(\mathbb{R}_{+}). So the number of atoms of XX is finite if and only if the following assumption holds.222When we have the more general case of a mixed continuous and discrete HH, Assumption A2 becomes A2b. x>0θ+ν(dθ)H(dx|θ)<\int_{x>0}\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)H(dx|\theta)<\infty.

  1. A2.

    x=1νx(+)<\sum_{x=1}^{\infty}\nu_{x}(\mathbb{R}_{+})<\infty for νx:=ν(dθ)h(x|θ)\nu_{x}:=\nu(d\theta)h(x|\theta).

Thus Assumptions A0, A1, and A2 capture our Bayesian nonparametric desiderata. We illustrate the development so far with an example.

Example 2.1.

The beta process (Hjort, 1990) provides an example distribution for Θ\Theta. In its most general form, sometimes called the three-parameter beta process (Teh and Görür, 2009; Broderick et al., 2012), the beta process has an ordinary component whose weight rate measure has a beta distribution kernel,

ν(dθ)=γθα1(1θ)c+α1dθ,\nu(d\theta)=\gamma\theta^{-\alpha-1}(1-\theta)^{c+\alpha-1}d\theta, (9)

with support on (0,1](0,1]. Here, the three fixed hyperparameters are γ\gamma, the mass parameter; cc, the concentration parameter; and α\alpha, the discount parameter.333 In (Teh and Görür, 2009; Broderick et al., 2012), the ordinary component features the beta distribution kernel in Eq. (9) multiplied not only by γ\gamma but also by a more complex, positive, real-valued expression in cc and α\alpha. Since all of γ\gamma, cc, and α\alpha are fixed hyperparameters, and γ\gamma is an arbitrary positive real value, any other constant factors containing the hyperparameters can be absorbed into γ\gamma, as in the main text here. Moreover, each of its KfixK_{fix} fixed-location atoms, θkδψk\theta_{k}\delta_{\psi_{k}}, has a beta-distributed weight (Broderick et al., 2015):

θfix,kBeta(θ|ρfix,k,σfix,k),\theta_{fix,k}\sim\mathrm{Beta}(\theta|\rho_{fix,k},\sigma_{fix,k}), (10)

where ρfix,k,σfix,k>0\rho_{fix,k},\sigma_{fix,k}>0 are fixed hyperparameters of the model.

By Assumption A0, KfixK_{fix} is finite. By Assumption A1, ν(+)=\nu(\mathbb{R}_{+})=\infty. To achieve this infinite-mass restriction, the beta kernel in Eq. (9) must be improper; i.e., either α0-\alpha\leq 0 or c+α0c+\alpha\leq 0. Also, note that we must have γ>0\gamma>0 since ν\nu is a measure (and the case γ=0\gamma=0 would be trivial).

Often the beta process is used as a prior paired with a Bernoulli process likelihood (Thibaux and Jordan, 2007). The Bernoulli process specifies that, given Θ=k=1θkδψk\Theta=\sum_{k=1}^{\infty}\theta_{k}\delta_{\psi_{k}}, we draw

xn,kindepBern(x|θk),x_{n,k}\stackrel{{\scriptstyle indep}}{{\sim}}\mathrm{Bern}(x|\theta_{k}),

which is well-defined since every atom weight θk\theta_{k} of Θ\Theta is in (0,1](0,1] by the beta process construction. Thus,

Xn=k=1xn,kδψk.X_{n}=\sum_{k=1}^{\infty}x_{n,k}\delta_{\psi_{k}}.

The marginal distribution of the X1:NX_{1:N} in this case is often called an Indian buffet process (Griffiths and Ghahramani, 2006; Thibaux and Jordan, 2007). The locations of atoms in XnX_{n} are thought of as the dishes sampled by the nnth customer.

We take a moment to highlight the fact that continuous distributions for H(dx|θ)H(dx|\theta) are precluded based on the Bayesian nonparametric desiderata by considering an alternative likelihood. Consider instead if H(dx|θ)H(dx|\theta) were continuous here. Then X1X_{1} would have atoms at every atom of Θ\Theta. In the Indian buffet process analogy, any customer would sample an infinite number of dishes, which contradicts our assumption that our data are finite. Indeed, any customer would sample all of the dishes at once. It is quite often the case in practical applications, though, that the XnX_{n} are merely latent variables, with the observed variables chosen according to a (potentially continuous) distribution given XnX_{n} (Griffiths and Ghahramani, 2006; Thibaux and Jordan, 2007); consider, e.g., mixture and admixture models. These cases are not precluded by our development.

Finally, then, we may apply Assumption A2, which specifies that the number of atoms in each observation XnX_{n} is finite; in this case, the assumption means

x=1θ+ν(dθ)h(x|θ)=θ(0,1]ν(dθ)h(1|θ)\displaystyle\sum_{x=1}^{\infty}\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)\cdot h(x|\theta)=\int_{\theta\in(0,1]}\nu(d\theta)\cdot h(1|\theta)
since θ\theta is supported on (0,1](0,1] and xx is supported on {0,1}\{0,1\}
=θ(0,1]γθα1(1θ)c+α1𝑑θθ=γθ(0,1]θ1α1(1θ)c+α1𝑑θ<.\displaystyle=\int_{\theta\in(0,1]}\gamma\theta^{-\alpha-1}(1-\theta)^{c+\alpha-1}d\theta\cdot\theta=\gamma\int_{\theta\in(0,1]}\theta^{1-\alpha-1}(1-\theta)^{c+\alpha-1}d\theta<\infty.

The integral here is finite if and only if 1α1-\alpha and c+αc+\alpha are the parameters of a proper beta distribution: i.e., if and only if α<1\alpha<1 and c>αc>-\alpha. Together with the restrictions above, these restrictions imply the following allowable parameter ranges for the beta process fixed hyperparameters:

γ>0,α[0,1),c>α,ρfix,k,σfix,k>0for all k[Kfix].\displaystyle\begin{split}\gamma>0,\quad\alpha\in[0,1),\quad c>-\alpha,\quad\rho_{fix,k},\sigma_{fix,k}&>0\quad\textrm{for all $k\in[K_{fix}]$}.\end{split} (11)

These correspond to the hyperparameter ranges previously found in (Teh and Görür, 2009; Broderick et al., 2012). \blacksquare

3 Posteriors

In Section 2, we defined a full Bayesian model consisting of a CRM prior for Θ\Theta and a CRM likelihood for an observation XX conditional on Θ\Theta. Now we would like to calculate the posterior distribution of Θ|X\Theta|X.

Theorem 3.1 (Bayesian nonparametric posteriors).

Let Θ\Theta be a completely random measure that satisfies Assumptions A0 and A1; that is, Θ\Theta is a CRM with KfixK_{fix} fixed atoms such that Kfix<K_{fix}<\infty and such that the kkth atom can be written θfix,kδψfix,k\theta_{fix,k}\delta_{\psi_{fix,k}} with

θfix,kindepFfix,k(dθ)\theta_{fix,k}\stackrel{{\scriptstyle indep}}{{\sim}}F_{fix,k}(d\theta)

for proper distribution Ffix,kF_{fix,k} and deterministic ψfix,k\psi_{fix,k}. Let the ordinary component of Θ\Theta have rate measure

μ(dθ×dψ)=ν(dθ)G(dψ),\mu(d\theta\times d\psi)=\nu(d\theta)\cdot G(d\psi),

where GG is a proper distribution and ν(+)=\nu(\mathbb{R}_{+})=\infty. Write Θ=k=1θkδψk\Theta=\sum_{k=1}^{\infty}\theta_{k}\delta_{\psi_{k}}, and let XX be generated conditional on Θ\Theta according to X=k=1xkδψkX=\sum_{k=1}^{\infty}x_{k}\delta_{\psi_{k}} with xkindeph(x|θk)x_{k}\stackrel{{\scriptstyle indep}}{{\sim}}h(x|\theta_{k}) for proper, discrete probability mass function hh. And suppose XX and Θ\Theta jointly satisfy Assumption A2 so that

x=1θ+ν(dθ)h(x|θ)<.\sum_{x=1}^{\infty}\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)h(x|\theta)<\infty.

Then let Θpost\Theta_{post} be a random measure with the distribution of Θ|X\Theta|X. Θpost\Theta_{post} is a completely random measure with three parts.

  1. 1.

    For each k[Kfix]k\in[K_{fix}], Θpost\Theta_{post} has a fixed-location atom at ψfix,k\psi_{fix,k} with weight θpost,fix,k\theta_{post,fix,k} distributed according to the finite-dimensional posterior Fpost,fix,k(dθ)F_{post,fix,k}(d\theta) that comes from prior Ffix,kF_{fix,k}, likelihood hh, and observation X({ψfix,k})X(\{\psi_{fix,k}\}).

  2. 2.

    Let {xnew,kδψnew,k:k[Knew]}\{x_{new,k}\delta_{\psi_{new,k}}:k\in[K_{new}]\} be the atoms of XX that are not at fixed locations in the prior of Θ\Theta. KnewK_{new} is finite by Assumption A2. Then Θpost\Theta_{post} has a fixed-location atom at xnew,kx_{new,k} with random weight θpost,new,k\theta_{post,new,k}, whose distribution Fpost,new,k(dθ)F_{post,new,k}(d\theta) is proportional to

    ν(dθ)h(xnew,k|θ).\nu(d\theta)h(x_{new,k}|\theta).
  3. 3.

    The ordinary component of Θpost\Theta_{post} has rate measure

    νpost(dθ):=ν(dθ)h(0|θ).\nu_{post}(d\theta):=\nu(d\theta)h(0|\theta).
Proof.

To prove the theorem, we consider in turn each of the two parts of the prior: the fixed-location component and the ordinary component. First, consider any fixed-location atom, θfix,kδψfix,k\theta_{fix,k}\delta_{\psi_{fix,k}}, in the prior. All of the other fixed-location atoms in the prior, as well as the prior ordinary component, are independent of the random weight θfix,k\theta_{fix,k}. So it follows that all of XX except xfix,k:=X({ψfix,k})x_{fix,k}:=X(\{\psi_{fix,k}\}) is independent of θfix,k\theta_{fix,k}. Thus the posterior has a fixed atom located at ψfix,k\psi_{fix,k} whose weight, which we denote θpost,fix,k\theta_{post,fix,k}, has distribution

Fpost,fix,k(dθ)Ffix,k(dθ)h(xfix,k|θ),F_{post,fix,k}(d\theta)\propto F_{fix,k}(d\theta)h(x_{fix,k}|\theta),

which follows from the usual finite Bayes Theorem.

Next, consider the ordinary component in the prior. Let

Ψfix={ψfix,1,,ψfix,Kfix}\Psi_{fix}=\{\psi_{fix,1},\ldots,\psi_{fix,K_{fix}}\}

be the set of fixed-location atoms in the prior. Recall that Ψfix\Psi_{fix} is deterministic, and since GG is continuous, all of the fixed-location atoms and ordinary component atoms of Θ\Theta are at a.s. distinct locations. So the measure XfixX_{fix} defined by

Xfix(A):=X(AΨfix)X_{fix}(A):=X(A\cap\Psi_{fix})

can be derived purely from XX, without knowledge of Θ\Theta. It follows that the measure XordX_{ord} defined by

Xord(A):=X(A(Ψ\Ψfix))X_{ord}(A):=X(A\cap(\Psi\backslash\Psi_{fix}))

can be derived purely from XX without knowledge of Θ\Theta. XordX_{ord} is the same as the observed data measure XX but with atoms only at atoms of the ordinary component of Θ\Theta and not at the fixed-location atoms of Θ\Theta.

Now for any value x+x\in\mathbb{Z}_{+}, let

{ψnew,x,1,,ψnew,x,Knew,x}\{\psi_{new,x,1},\ldots,\psi_{new,x,K_{new,x}}\}

be all of the locations of atoms of size xx in XordX_{ord}. By Assumption A2, the number of such atoms, Knew,xK_{new,x}, is finite. Further let θnew,x,k:=Θ({ψnew,x,k})\theta_{new,x,k}:=\Theta(\{\psi_{new,x,k}\}). Then the values {θnew,x,k}k=1Knew,x\{\theta_{new,x,k}\}_{k=1}^{K_{new,x}} are generated from a thinned Poisson point process with rate measure

νx(dθ):=ν(dθ)h(x|θ).\nu_{x}(d\theta):=\nu(d\theta)h(x|\theta). (12)

And since νx(+)<\nu_{x}(\mathbb{R}_{+})<\infty by assumption, each θnew,x,k\theta_{new,x,k} has distribution equal to the normalized rate measure in Eq. (12). Note that θnew,x,kδψnew,x,k\theta_{new,x,k}\delta_{\psi_{new,x,k}} is a fixed-location atom in the posterior now that its location is known from the observed XordX_{ord}.

By contrast, if a likelihood draw at an ordinary component atom in the prior returns a zero, that atom is not observed in XordX_{ord}. Such atom weights in Θpost\Theta_{post} thus form a marked Poisson point process with rate measure

ν(dθ)h(0|θ),\nu(d\theta)h(0|\theta),

as was to be shown. ∎

In Theorem 3.1, we consider generating Θ\Theta and then a single data point XX conditional on Θ\Theta. Now suppose we generate Θ\Theta and then NN data points, X1,,XNX_{1},\ldots,X_{N}, iid conditional on Θ\Theta. In this case, Theorem 3.1 may be iterated to find the posterior Θ|X1:N\Theta|X_{1:N}. In particular, Theorem 3.1 gives the ordinary component and fixed atoms of the random measure Θ1:=Θ|X1\Theta_{1}:=\Theta|X_{1}. Then, using Θ1\Theta_{1} as the prior measure and X2X_{2} as the data point, another application of Theorem 3.1 gives Θ2:=Θ|X1:2\Theta_{2}:=\Theta|X_{1:2}. We continue recursively using Θ|X1:n\Theta|X_{1:n} for nn between 1 and N1N-1 as the prior measure until we find Θ|X1:N\Theta|X_{1:N}. The result is made explicit in the following corollary.

Corollary 3.2 (Bayesian nonparametric posteriors given multiple data points).

Let Θ\Theta be a completely random measure that satisfies Assumptions A0 and A1; that is, Θ\Theta is a CRM with KfixK_{fix} fixed atoms such that Kfix<K_{fix}<\infty and such that the kkth atom can be written θfix,kδψfix,k\theta_{fix,k}\delta_{\psi_{fix,k}} with

θfix,kindepFfix,k(dθ)\theta_{fix,k}\stackrel{{\scriptstyle indep}}{{\sim}}F_{fix,k}(d\theta)

for proper distribution Ffix,kF_{fix,k} and deterministic ψfix,k\psi_{fix,k}. Let the ordinary component of Θ\Theta have rate measure

μ(dθ×dψ)=ν(dθ)G(dψ),\mu(d\theta\times d\psi)=\nu(d\theta)\cdot G(d\psi),

where GG is a proper distribution and ν(+)=\nu(\mathbb{R}_{+})=\infty. Write Θ=k=1θkδψk\Theta=\sum_{k=1}^{\infty}\theta_{k}\delta_{\psi_{k}}, and let X1,,XnX_{1},\ldots,X_{n} be generated conditional on Θ\Theta according to X=k=1xn,kδψn,kX=\sum_{k=1}^{\infty}x_{n,k}\delta_{\psi_{n,k}} with xn,kindeph(x|θk)x_{n,k}\stackrel{{\scriptstyle indep}}{{\sim}}h(x|\theta_{k}) for proper, discrete probability mass function hh. And suppose X1X_{1} and Θ\Theta jointly satisfy Assumption A2 so that

x=1θ+ν(dθ)h(x|θ)<.\sum_{x=1}^{\infty}\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)h(x|\theta)<\infty.

It is enough to make the assumption for X1X_{1} since the XnX_{n} are iid conditional on Θ\Theta.

Then let Θpost\Theta_{post} be a random measure with the distribution of Θ|X1:N\Theta|X_{1:N}. Θpost\Theta_{post} is a completely random measure with three parts.

  1. 1.

    For each k[Kfix]k\in[K_{fix}], Θpost\Theta_{post} has a fixed-location atom at ψfix,k\psi_{fix,k} with weight θpost,fix,k\theta_{post,fix,k} distributed according to the finite-dimensional posterior Fpost,fix,k(dθ)F_{post,fix,k}(d\theta) that comes from prior Ffix,kF_{fix,k}, likelihood hh, and observation X({ψfix,k})X(\{\psi_{fix,k}\}).

  2. 2.

    Let {ψnew,k:k[Knew]}\{\psi_{new,k}:k\in[K_{new}]\} be the union of atom locations across X1,X2,,XNX_{1},X_{2},\ldots,X_{N} minus the fixed locations in the prior of Θ\Theta. KnewK_{new} is finite. Let xnew,n,kx_{new,n,k} be the weight of the atom in XnX_{n} located at ψnew,k\psi_{new,k}. Note that at least one of xnew,n,kx_{new,n,k} across nn must be non-zero, but in general xnew,n,kx_{new,n,k} may equal zero. Then Θpost\Theta_{post} has a fixed-location atom at xnew,kx_{new,k} with random weight θpost,new,k\theta_{post,new,k}, whose distribution Fpost,new,k(dθ)F_{post,new,k}(d\theta) is proportional to

    ν(dθ)n=1Nh(xnew,n,k|θ).\nu(d\theta)\prod_{n=1}^{N}h(x_{new,n,k}|\theta).
  3. 3.

    The ordinary component of Θpost\Theta_{post} has rate measure

    νpost,n(dθ):=ν(dθ)[h(0|θ)]n.\nu_{post,n}(d\theta):=\nu(d\theta)\left[h(0|\theta)\right]^{n}.
Proof.

Corollary 3.2 follows from recursive application of Theorem 3.1. In order to recursively apply Theorem 3.1, we need to verify that Assumptions A0, A1, and A2 hold for the posterior Θ|X1:(n+1)\Theta|X_{1:(n+1)} when they hold for the prior Θ|X1:n\Theta|X_{1:n}. Note that the number of fixed atoms in the posterior is the number of fixed atoms in the prior plus the number of new atoms in the posterior. By Theorem 3.1, these counts are both finite as long as Θ|X1:n\Theta|X_{1:n} satisfies Assumptions A0 and A2, which both hold for n=0n=0 by assumption and n>0n>0 by the recursive assumption. So Assumption A0 holds for Θ|X1:(n+1)\Theta|X_{1:(n+1)}.

Next we notice that since Assumption A1 implies that there is an infinite number of ordinary component atoms in Θ|X1:n\Theta|X_{1:n} and only finitely many become fixed atoms in the posterior by Assumption A2, it must be that Θ|X1:(n+1)\Theta|X_{1:(n+1)} has infinitely many ordinary component atoms. So Assumption A1 holds for Θ|X1:(n+1)\Theta|X_{1:(n+1)}.

Finally, we note that

x=1θ+νpost,n(dθ)h(x|θ)\displaystyle\sum_{x=1}^{\infty}\int_{\theta\in\mathbb{R}_{+}}\nu_{post,n}(d\theta)h(x|\theta)
=x=1θ+ν(dθ)[h(0|θ)]nh(x|θ)x=1θ+ν(dθ)h(x|θ)<,\displaystyle=\sum_{x=1}^{\infty}\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)\left[h(0|\theta)\right]^{n}h(x|\theta)\leq\sum_{x=1}^{\infty}\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)h(x|\theta)<\infty,

where the penultimate inequality follows since h(0|θ)[0,1]h(0|\theta)\in[0,1] and where the inequality follows by Assumption A2 on the original Θ\Theta (conditioned on no data). So Assumption A2 holds for Θ|X1:(n+1)\Theta|X_{1:(n+1)}. ∎

We now illustrate the results of the theorem with an example.

Example 3.3.

Suppose we again start with a beta process prior for Θ\Theta as in Example 2.1. This time we consider a negative binomial process likelihood (Zhou et al., 2012; Broderick et al., 2015). The negative binomial process specifies that, given Θ=k=1θkδψk\Theta=\sum_{k=1}^{\infty}\theta_{k}\delta_{\psi_{k}}, we draw X=k=1xkδψkX=\sum_{k=1}^{\infty}x_{k}\delta_{\psi_{k}} with

xkindepNegBin(x|r,θk),x_{k}\stackrel{{\scriptstyle indep}}{{\sim}}\mathrm{NegBin}(x|r,\theta_{k}),

for some fixed hyperparameter r>0r>0. So

Xn=k=1xn,kδψk.X_{n}=\sum_{k=1}^{\infty}x_{n,k}\delta_{\psi_{k}}.

In this case, Assumption A2 translates into the following restriction.

x=1θ+ν(dθ)h(x|θ)\displaystyle\sum_{x=1}^{\infty}\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)\cdot h(x|\theta)
=θ+ν(dθ)[1h(0|θ)]=θ(0,1]γθα1(1θ)c+α1𝑑θ[1(1θ)r]<,\displaystyle=\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)\cdot\left[1-h(0|\theta)\right]=\int_{\theta\in(0,1]}\gamma\theta^{-\alpha-1}(1-\theta)^{c+\alpha-1}d\theta\cdot\left[1-(1-\theta)^{r}\right]<\infty,

where the penultimate equality follows since the support of ν(dθ)\nu(d\theta) is (0,1](0,1].

By a Taylor expansion, we have 1(1θ)r=rθ+o(θ)1-(1-\theta)^{r}=r\theta+o(\theta) as θ0\theta\rightarrow 0, so we require

θ(0,1]θ1α1(1θ)c+α1𝑑θ<,\int_{\theta\in(0,1]}\theta^{1-\alpha-1}(1-\theta)^{c+\alpha-1}d\theta<\infty,

which is satisfied if and only if 1α1-\alpha and c+αc+\alpha are the parameters of a proper beta distribution. Thus, we have the same parameter restrictions as in Eq. (11).

Now we calculate the posterior given the beta process prior on Θ\Theta and the negative binomial process likelihood for XX conditional on Θ\Theta. In particular, the posterior has the distribution of Θpost\Theta_{post}, a CRM with three parts given by Theorem 3.1.

First, at each fixed atom ψfix,k\psi_{fix,k} of the prior with weight θfix,k\theta_{fix,k} given by Eq. (10), there is a fixed atom in the posterior with weight θpost,fix,k\theta_{post,fix,k}. Let xpost,fix,k:=X({ψfix,k})x_{post,fix,k}:=X(\{\psi_{fix,k}\}). Then θpost,fix,k\theta_{post,fix,k} has distribution

Fpost,fix,k(dθ)Ffix(dθ)h(xpost,fix,k|θ)=Beta(θ|ρfix,k,σfix,k)dθNegBin(xpost,fix,k|r,θ)θρfix,k1(1θ)σfix,k1dθθxpost,fix,k(1θ)rBeta(θ|ρfix,k+xpost,fix,k,σfix,k+r)dθ.\displaystyle\begin{split}F_{post,fix,k}(d\theta)&\propto F_{fix}(d\theta)\cdot h(x_{post,fix,k}|\theta)\\ &=\mathrm{Beta}(\theta|\rho_{fix,k},\sigma_{fix,k})\;d\theta\cdot\mathrm{NegBin}(x_{post,fix,k}|r,\theta)\\ &\propto\theta^{\rho_{fix,k}-1}(1-\theta)^{\sigma_{fix,k}-1}\;d\theta\cdot\theta^{x_{post,fix,k}}(1-\theta)^{r}\\ &\propto\mathrm{Beta}\left(\theta\left|\rho_{fix,k}+x_{post,fix,k},\sigma_{fix,k}+r\right.\right)\;d\theta.\end{split} (13)

Second, for any atom xnew,kδψnew,kx_{new,k}\delta_{\psi_{new,k}} in XX that is not at a fixed location in the prior, Θpost\Theta_{post} has a fixed atom at ψnew,k\psi_{new,k} whose weight θpost,new,k\theta_{post,new,k} has distribution

Fpost,new,k(dθ)ν(dθ)h(xnew,k|θ)=ν(dθ)NegBin(xnew,k|r,θ)θα1(1θ)c+α1dθθxnew,k(1θ)rBeta(θ|α+xnew,k,c+α+r)dθ,\displaystyle\begin{split}F_{post,new,k}(d\theta)&\propto\nu(d\theta)\cdot h(x_{new,k}|\theta)\\ &=\nu(d\theta)\cdot\mathrm{NegBin}(x_{new,k}|r,\theta)\\ &\propto\theta^{-\alpha-1}(1-\theta)^{c+\alpha-1}\;d\theta\cdot\theta^{x_{new,k}}(1-\theta)^{r}\\ &\propto\mathrm{Beta}\left(\theta\left|-\alpha+x_{new,k},c+\alpha+r\right.\right)\;d\theta,\end{split} (14)

which is a proper distribution since we have the following restrictions on its parameters. For one, by assumption, xnew,k1x_{new,k}\geq 1. And further, by Eq. (11), we have α[0,1)\alpha\in[0,1) as well as c+α>0c+\alpha>0 and r>0r>0.

Third, the ordinary component of Θpost\Theta_{post} has rate measure

ν(dθ)h(0|θ)=γθα1(1θ)c+α1dθ(1θ)r=γθα1(1θ)c+r+α1dθ.\displaystyle\nu(d\theta)h(0|\theta)=\gamma\theta^{-\alpha-1}(1-\theta)^{c+\alpha-1}\;d\theta\cdot(1-\theta)^{r}=\gamma\theta^{-\alpha-1}(1-\theta)^{c+r+\alpha-1}\;d\theta.

Not only have we found the posterior distribution Θpost\Theta_{post} above, but now we can note that the posterior is in the same form as the prior with updated ordinary component hyperparameters:

γpost=γ,αpost=α,cpost=c+r.\displaystyle\gamma_{post}=\gamma,\quad\alpha_{post}=\alpha,\quad c_{post}=c+r.

The posterior also has old and new beta-distributed fixed atoms with beta distribution hyperparameters given in Eq. (13) and Eq. (14), respectively. Thus, we have proven that the beta process is, in fact, conjugate to the negative binomial process. An alternative proof was first given by Broderick et al. (2015). \blacksquare

As in Example 3.3, we can use Theorem 3.1 not only to calculate posteriors but also, once those posteriors are calculated, to check for conjugacy. This approach unifies existing disparate approaches to Bayesian nonparametric conjugacy. However, it still requires the practitioner to guess the right conjugate prior for a given likelihood. In the next section, we define a notion of exponential families for CRMs, and we show how to automatically construct a conjugate prior for any exponential family likelihood.

4 Exponential families

Exponential families are what typically make conjugacy so powerful in the finite case. For one, when a finite likelihood belongs to an exponential family, then existing results give an automatic conjugate, exponential family prior for that likelihood. In this section, we review finite exponential families, define exponential CRMs, and show that analogous automatic conjugacy results can be obtained for exponential CRMs. Our development of exponential CRMs will also allow particularly straightforward results for size-biased representations (Corollary 5.2 in Section 5) and marginal processes (Corollary 6.2 in Section 6).

In the finite-dimensional case, suppose we have some (random) parameter θ\theta and some (random) observation xx whose distribution is conditioned on θ\theta. We say the distribution Hexp,likeH_{exp,like} of xx conditional on θ\theta is in an exponential family if

Hexp,like(dx|θ)=hexp,like(x|θ)dx=κ(x)exp{η(θ),ϕ(x)A(θ)}μ(dx),\displaystyle\begin{split}H_{exp,like}(dx|\theta)=h_{exp,like}(x|\theta)\;dx=\kappa(x)\exp\left\{\langle\eta(\theta),\phi(x)\rangle-A(\theta)\right\}\;\mu(dx),\end{split} (15)

where η(θ)\eta(\theta) is the natural parameter, ϕ(x)\phi(x) is the sufficient statistic, κ(x)\kappa(x) is the base density, and A(θ)A(\theta) is the log partition function. We denote the density of Hexp,likeH_{exp,like} here, which exists by definition, by hexp,likeh_{exp,like}. The measure μ\mu—with respect to which the density hexp,likeh_{exp,like} exists—is typically Lebesgue measure when Hexp,likeH_{exp,like} is diffuse or counting measure when Hexp,likeH_{exp,like} is atomic. A(θ)A(\theta) is determined by the condition that Hexp,like(dx|θ)H_{exp,like}(dx|\theta) have unit total mass on its support.

It is a classic result (Diaconis and Ylvisaker, 1979) that the following distribution for θD\theta\in\mathbb{R}^{D} constitutes a conjugate prior:

Fexp,prior(dθ)=fexp,prior(θ)dθ=exp{ξ,η(θ)+λ[A(θ)]B(ξ,λ)}dθ.\displaystyle\begin{split}F_{exp,prior}(d\theta)=f_{exp,prior}(\theta)\;d\theta=\exp\left\{\langle\xi,\eta(\theta)\rangle+\lambda\left[-A(\theta)\right]-B(\xi,\lambda)\right\}\;d\theta.\end{split} (16)

Fexp,priorF_{exp,prior} is another exponential family distribution, now with natural parameter (ξ,λ)(\xi^{\prime},\lambda)^{\prime}, sufficient statistic (η(θ),A(θ))(\eta(\theta)^{\prime},-A(\theta))^{\prime}, and log partition function B(ξ,λ)B(\xi,\lambda). Note that the logarithms of the densities in both Eq. (15) and Eq. (16) are linear in η(θ)\eta(\theta) and A(θ)-A(\theta). So, by Bayes Theorem, the posterior Fexp,postF_{exp,post} also has these quantities as sufficient statistics in θ\theta, and we can see Fexp,postF_{exp,post} must have the following form.

Fexp,post(dθ|x)=fexp,post(θ|x)dθ=exp{ξ+ϕ(x),η(θ)+(λ+1)[A(θ)]B(ξ+ϕ(x),λ+1)}dθ.\displaystyle\begin{split}\@ADDCLASS{ltx_eqn_lefteqn}$\displaystyle F_{exp,post}(d\theta|x)=f_{exp,post}(\theta|x)\;d\theta$\mbox{}\hfil\\ &=\exp\left\{\langle\xi+\phi(x),\eta(\theta)\rangle+(\lambda+1)\left[-A(\theta)\right]-B(\xi+\phi(x),\lambda+1)\right\}\;d\theta.\end{split} (17)

Thus we see that Fexp,postF_{exp,post} belongs to the same exponential family as Fexp,priorF_{exp,prior} in Eq. (16), and hence Fexp,priorF_{exp,prior} is a conjugate prior for Hexp,likeH_{exp,like} in Eq. (15).

4.1 Exponential families for completely random measures

In the finite-dimensional case, we saw that for any exponential family likelihood, as in Eq. (15), we can always construct a conjugate exponential family prior, given by Eq. (16).

In order to prove a similar result for CRMs, we start by defining a notion of exponential families for CRMs.

Definition 4.1.

We say that a CRM Θ\Theta is an exponential CRM if it has the following two parts. First, let Θ\Theta have KfixK_{fix} fixed-location atoms, where KfixK_{fix} may be finite or infinite. The kkth fixed-location atom is located at any ψfix,k\psi_{fix,k}, unique from the other fixed locations, and has random weight θfix,k\theta_{fix,k}, whose distribution has density ffix,kf_{fix,k}:

ffix,k(θ)=κ(θ)exp{η(ζk),ϕ(θ)A(ζk)},f_{fix,k}(\theta)=\kappa(\theta)\exp\left\{\langle\eta(\zeta_{k}),\phi(\theta)\rangle-A(\zeta_{k})\right\},

for some base density κ\kappa, natural parameter function η\eta, sufficient statistic ϕ\phi, and log partition function AA shared across atoms. Here, ζk\zeta_{k} is an atom-specific parameter.

Second, let Θ\Theta have an ordinary component with rate measure μ(dθ×dψ)=ν(dθ)G(dψ)\mu(d\theta\times d\psi)=\nu(d\theta)\cdot G(d\psi) for some proper distribution GG and weight rate measure ν\nu of the form

ν(dθ)=γexp{η(ζ),ϕ(θ)}.\nu(d\theta)=\gamma\exp\left\{\langle\eta(\zeta),\phi(\theta)\rangle\right\}.

In particular, η\eta and ϕ\phi are shared with the fixed-location atoms, and fixed hyperparameters γ\gamma and ζ\zeta are unique to the ordinary component.

4.2 Automatic conjugacy for completely random measures

With Definition 4.1 in hand, we can specify an automatic Bayesian nonparametric conjugate prior for an exponential CRM likelihood.

Theorem 4.2 (Automatic conjugacy).

Let Θ=k=1θkδψk\Theta=\sum_{k=1}^{\infty}\theta_{k}\delta_{\psi_{k}}, in accordance with Assumption A1. Let XX be generated conditional on Θ\Theta according to an exponential CRM with fixed-location atoms at {ψk}k=1\{\psi_{k}\}_{k=1}^{\infty} and no ordinary component. In particular, the distribution of the weight xkx_{k} at ψk\psi_{k} of XX has the following density conditional on the weight θk\theta_{k} at ψk\psi_{k} of Θ\Theta:

h(x|θk)=κ(x)exp{η(θk),ϕ(x)A(θk)}.h(x|\theta_{k})=\kappa(x)\exp\left\{\langle\eta(\theta_{k}),\phi(x)\rangle-A(\theta_{k})\right\}.

Then a conjugate prior for Θ\Theta is the following exponential CRM distribution. First, let Θ\Theta have Kprior,fixK_{prior,fix} fixed-location atoms, in accordance with Assumption A0. The kkth such atom has random weight θfix,k\theta_{fix,k} with proper density

fprior,fix,k(θ)=exp{ξfix,k,η(θ)+λfix,k[A(θ)]B(ξfix,k,λfix,k)},f_{prior,fix,k}(\theta)=\exp\left\{\langle\xi_{fix,k},\eta(\theta)\rangle+\lambda_{fix,k}\left[-A(\theta)\right]-B(\xi_{fix,k},\lambda_{fix,k})\right\},

where (η,A)(\eta^{\prime},-A)^{\prime} here is the sufficient statistic and BB is the log partition function. ξfix,k\xi_{fix,k} and λfix,k\lambda_{fix,k} are fixed hyperparameters for this atom weight.

Second, let Θ\Theta have ordinary component characterized by any proper distribution GG and weight rate measure

ν(dθ)=γexp{ξ,η(θ)+λ[A(θ)]},\nu(d\theta)=\gamma\exp\left\{\langle\xi,\eta(\theta)\rangle+\lambda\left[-A(\theta)\right]\right\},

where γ\gamma, ξ\xi, and λ\lambda are fixed hyperparameters of the weight rate measure chosen to satisfy Assumptions A1 and A2.

Proof.

To prove the conjugacy of the prior for Θ\Theta with the likelihood for XX, we calculate the posterior distribution of Θ|X\Theta|X using Theorem 3.1. Let Θpost\Theta_{post} be a CRM with the distribution of Θ|X\Theta|X. Then, by Theorem 3.1, Θpost\Theta_{post} has the following three parts.

First, at any fixed location ψfix,k\psi_{fix,k} in the prior, let xfix,kx_{fix,k} be the value of XX at that location. Then Θpost\Theta_{post} has a fixed-location atom at ψfix,k\psi_{fix,k}, and its weight θpost,fix,k\theta_{post,fix,k} has distribution

Fpost,fix,k(dθ)\displaystyle F_{post,fix,k}(d\theta) fprior,fix,k(θ)dθh(xfix,k|θ)\displaystyle\propto f_{prior,fix,k}(\theta)\;d\theta\cdot h(x_{fix,k}|\theta)
exp{ξfix,k,η(θ)+λfix,k[A(θ)]}dθexp{η(θ),ϕ(xfix,k)A(θ)}dθ\displaystyle\propto\exp\left\{\langle\xi_{fix,k},\eta(\theta)\rangle+\lambda_{fix,k}\left[-A(\theta)\right]\right\}\;d\theta\cdot\exp\left\{\langle\eta(\theta),\phi(x_{fix,k})\rangle-A(\theta)\right\}\;d\theta
=exp{ξfix,k+ϕ(xfix,k),η(θ)+(λfix,k+1)[A(θ)]}dθ.\displaystyle=\exp\left\{\langle\xi_{fix,k}+\phi(x_{fix,k}),\eta(\theta)\rangle+(\lambda_{fix,k}+1)\left[-A(\theta)\right]\right\}\;d\theta.

It follows, from putting in the normalizing constant, that the distribution of θpost,fix,k\theta_{post,fix,k} has density

fpost,fix,k(θ)\displaystyle f_{post,fix,k}(\theta) =exp{ξfix,k+ϕ(xfix,k),η(θ)+(λfix,k+1)[A(θ)]\displaystyle=\exp\left\{\langle\xi_{fix,k}+\phi(x_{fix,k}),\eta(\theta)\rangle+(\lambda_{fix,k}+1)\left[-A(\theta)\right]\right.
B(ξfix,k+ϕ(xfix,k),λfix,k+1)}.\displaystyle\left.{}-B(\xi_{fix,k}+\phi(x_{fix,k}),\lambda_{fix,k}+1)\right\}.

Second, for any atom xnew,kδψnew,kx_{new,k}\delta_{\psi_{new,k}} in XX that is not at a fixed location in the prior, Θpost\Theta_{post} has a fixed atom at ψnew,k\psi_{new,k} whose weight θpost,new,k\theta_{post,new,k} has distribution

Fpost,new,k(θ)\displaystyle F_{post,new,k}(\theta) ν(dθ)h(xnew,k|θ)\displaystyle\propto\nu(d\theta)\cdot h(x_{new,k}|\theta)
exp{ξ,η(θ)+λ[A(θ)]}exp{η(θ),ϕ(xnew,k)A(θ)}dθ\displaystyle\propto\exp\left\{\langle\xi,\eta(\theta)\rangle+\lambda\left[-A(\theta)\right]\right\}\cdot\exp\left\{\langle\eta(\theta),\phi(x_{new,k})\rangle-A(\theta)\right\}\;d\theta
=exp{ξ+ϕ(xnew,k),η(θ)+(λ+1)[A(θ)]}dθ\displaystyle=\exp\left\{\langle\xi+\phi(x_{new,k}),\eta(\theta)\rangle+(\lambda+1)\left[-A(\theta)\right]\right\}\;d\theta

and hence density

fpost,new,k(θ)\displaystyle f_{post,new,k}(\theta) =exp{ξ+ϕ(xnew,k),η(θ)+(λ+1)[A(θ)]B(ξ+ϕ(xnew,k),λ+1)}.\displaystyle=\exp\left\{\langle\xi+\phi(x_{new,k}),\eta(\theta)\rangle+(\lambda+1)\left[-A(\theta)\right]-B(\xi+\phi(x_{new,k}),\lambda+1)\right\}.

Third, the ordinary component of Θpost\Theta_{post} has weight rate measure

ν(dθ)h(0|θ)\displaystyle\nu(d\theta)\cdot h(0|\theta)
=γexp{ξ,η(θ)+λ[A(θ)]}κ(0)exp{η(θ),ϕ(0)A(θ)}\displaystyle=\gamma\exp\left\{\langle\xi,\eta(\theta)\rangle+\lambda\left[-A(\theta)\right]\right\}\cdot\kappa(0)\exp\left\{\langle\eta(\theta),\phi(0)\rangle-A(\theta)\right\}
=γκ(0)exp{ξ+ϕ(0),η(θ)+(λ+1)[A(θ)]}.\displaystyle=\gamma\kappa(0)\cdot\exp\left\{\langle\xi+\phi(0),\eta(\theta)\rangle+(\lambda+1)\left[-A(\theta)\right]\right\}.

Thus, the posterior rate measure is in the same exponential CRM form as the prior rate measure with updated hyperparameters:

γpost=γκ(0),ξpost=ξ+ϕ(0),λpost=λ+1.\displaystyle\gamma_{post}=\gamma\kappa(0),\quad\xi_{post}=\xi+\phi(0),\quad\lambda_{post}=\lambda+1.

Since we see that the posterior fixed-location atoms are likewise in the same exponential CRM form as the prior, we have shown that conjugacy holds, as desired. ∎

We next use Theorem 4.2 to give proofs of conjugacy in cases where conjugacy has not previously been established in the Bayesian nonparametrics literature.

Example 4.3.

Let XX be generated according to a Poisson likelihood process444We use the term “Poisson likelihood process” to distinguish this specific Bayesian nonparametric likelihood from the Poisson point process. conditional on Θ\Theta. That is, X=k=1xkδψkX=\sum_{k=1}^{\infty}x_{k}\delta_{\psi_{k}} conditional on Θ=k=1θkδψk\Theta=\sum_{k=1}^{\infty}\theta_{k}\delta_{\psi_{k}} has an exponential CRM distribution with only a fixed-location component. The weight xkx_{k} at location ψk\psi_{k} has support on \mathbb{Z}_{*} and has a Poisson density with parameter θk+\theta_{k}\in\mathbb{R}_{+}:

h(x|θk)=1x!θkxeθk=1x!exp{xlog(θk)θk}.\displaystyle\begin{split}h(x|\theta_{k})=\frac{1}{x!}\theta_{k}^{x}e^{-\theta_{k}}=\frac{1}{x!}\exp\left\{x\log(\theta_{k})-\theta_{k}\right\}.\end{split} (18)

The final line is rewritten to emphasize the exponential family form of this density, with

κ(x)=1x!,ϕ(x)=x,η(θ)=log(θ),A(θ)=θ.\displaystyle\kappa(x)=\frac{1}{x!},\quad\phi(x)=x,\quad\eta(\theta)=\log(\theta),\quad A(\theta)=\theta.

By Theorem 4.2, this Poisson likelihood process has a Bayesian nonparametric conjugate prior for Θ\Theta with two parts.

First, Θ\Theta has a set of Kprior,fixK_{prior,fix} fixed-location atoms, where Kprior,fix<K_{prior,fix}<\infty by Assumption A0. The kkth such atom has random weight θfix,k\theta_{fix,k} with density

fprior,fix,k(θ)\displaystyle f_{prior,fix,k}(\theta) =exp{ξfix,k,η(θ)+λfix,k[A(θ)]B(ξfix,k,λfix,k)}\displaystyle=\exp\left\{\langle\xi_{fix,k},\eta(\theta)\rangle+\lambda_{fix,k}\left[-A(\theta)\right]-B(\xi_{fix,k},\lambda_{fix,k})\right\}
=θξfix,keλfix,kθexp{B(ξfix,k,λfix,k)}\displaystyle=\theta^{\xi_{fix,k}}e^{-\lambda_{fix,k}\theta}\exp\left\{-B(\xi_{fix,k},\lambda_{fix,k})\right\}
=Gamma(θ|ξfix,k+1,λfix,k),\displaystyle=\mathrm{Gamma}(\theta\left|\xi_{fix,k}+1,\lambda_{fix,k}\right.), (19)

where Gamma(θ|a,b)\mathrm{Gamma}(\theta|a,b) denotes the gamma density with shape parameter a>0a>0 and rate parameter b>0b>0. So we must have fixed hyperparameters ξfix,k>1\xi_{fix,k}>-1 and λfix,k>0\lambda_{fix,k}>0. Further,

exp{B(ξfix,k,λfix,k)}=λfix,kξfix,k+1/Γ(ξfix,k+1)\exp\left\{-B(\xi_{fix,k},\lambda_{fix,k})\right\}=\lambda_{fix,k}^{\xi_{fix,k}+1}/\Gamma(\xi_{fix,k}+1)

to ensure normalization.

Second, Θ\Theta has an ordinary component characterized by any proper distribution GG and weight rate measure

ν(dθ)=γexp{ξ,η(θ)+λ[A(θ)]}dθ=γθξeλθdθ.\displaystyle\nu(d\theta)=\gamma\exp\left\{\langle\xi,\eta(\theta)\rangle+\lambda\left[-A(\theta)\right]\right\}\;d\theta=\gamma\theta^{\xi}e^{-\lambda\theta}\;d\theta. (20)

Note that Theorem 4.2 guarantees that the weight rate measure will have the same distributional kernel in θ\theta as the fixed-location atoms.

Finally, we need to choose the allowable hyperparameter ranges for γ\gamma, ξ\xi, and λ\lambda. First, γ>0\gamma>0 to ensure ν\nu is a measure. By Assumption A1, we must have ν(+)=\nu(\mathbb{R}_{+})=\infty, so ν\nu must represent an improper gamma distribution. As such, we require either ξ+10\xi+1\leq 0 or λ0\lambda\leq 0. By Assumption A2, we must have

x=1θ+ν(dθ)h(x|θ)=θ+ν(dθ)[1h(0|θ)]=θ+γθξeλθ𝑑θ[1eθ]<.\displaystyle\sum_{x=1}^{\infty}\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)\cdot h(x|\theta)=\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)\cdot[1-h(0|\theta)]=\int_{\theta\in\mathbb{R}_{+}}\gamma\theta^{\xi}e^{-\lambda\theta}\;d\theta\cdot\left[1-e^{-\theta}\right]<\infty.

To ensure the integral over [1,)[1,\infty) is finite, we must have λ>0\lambda>0. To ensure the integral over (0,1)(0,1) is finite, we note that 1eθ=θ+o(θ)1-e^{-\theta}=\theta+o(\theta) as θ0\theta\rightarrow 0. So we require

θ(0,1)γθξ+1eλθ𝑑θ<,\int_{\theta\in(0,1)}\gamma\theta^{\xi+1}e^{-\lambda\theta}\;d\theta<\infty,

which is satisfied if and only if ξ+2>0\xi+2>0.

Finally, then the hyperparameter restrictions can be summarized as:

γ>0,ξ(2,1],λ>0;ξfix,k>1 and λfix,k>0for all k[Kprior,fix].\displaystyle\gamma>0,\quad\xi\in(-2,-1],\quad\lambda>0;\quad\xi_{fix,k}>-1\textrm{ and }\lambda_{fix,k}>0\quad\textrm{for all $k\in[K_{prior,fix}]$}.

The ordinary component of the conjugate prior for Θ\Theta discovered in this example is typically called a gamma process. Here, we have for the first time specified the distribution of the fixed-location atoms of the gamma process and, also for the first time, proved that the gamma process is conjugate to the Poisson likelihood process. We highlight this result as a corollary to Theorem 4.2.

Corollary 4.4.

Let the Poisson likelihood process be a CRM with fixed-location atom weight distributions as in Eq. (18). Let the gamma process be a CRM with fixed-location atom weight distributions as in Eq. (19) and ordinary component weight measure as in Eq. (20). Then the gamma process is a conjugate Bayesian nonparametric prior for the Poisson likelihood process.

\blacksquare

Example 4.5.

Next, let XX be generated according to a new process we call an odds Bernoulli process. We have previously seen a typical Bernoulli process likelihood in Example 2.1. In the odds Bernoulli process, we say that XX, conditional on Θ\Theta, has an exponential CRM distribution. In this case, the weight of the kkth atom, xkx_{k}, conditional on θk\theta_{k} has support on {0,1}\{0,1\} and has a Bernoulli density with odds parameter θk+\theta_{k}\in\mathbb{R}_{+}:

h(x|θk)=θkx(1+θk)1=exp{xlog(θk)log(1+θk)}.\displaystyle\begin{split}h(x|\theta_{k})&=\theta_{k}^{x}(1+\theta_{k})^{-1}\\ &=\exp\left\{x\log(\theta_{k})-\log(1+\theta_{k})\right\}.\end{split} (21)

That is, if ρ\rho is the probability of a successful Bernoulli draw, then θ=ρ/(1ρ)\theta=\rho/(1-\rho) represents the odds ratio of the probability of success over the probability of failure.

The final line of Eq. (21) is written to emphasize the exponential family form of this density, with

κ(x)=1,ϕ(x)=x,η(θ)=log(θ),A(θ)=log(1+θ).\displaystyle\kappa(x)=1,\quad\phi(x)=x,\quad\eta(\theta)=\log(\theta),\quad A(\theta)=\log(1+\theta).

By Theorem 4.2, the likelihood for XX has a Bayesian nonparametric conjugate prior for Θ\Theta. This conjugate prior has two parts.

First, Θ\Theta has a set of Kprior,fixK_{prior,fix} fixed-location atoms. The kkth such atom has random weight θfix,k\theta_{fix,k} with density

fprior,fix,k(θ)\displaystyle f_{prior,fix,k}(\theta) =exp{ξfix,k,η(θ)+λfix,k[A(θ)]B(ξfix,k,λfix,k)}\displaystyle=\exp\left\{\langle\xi_{fix,k},\eta(\theta)\rangle+\lambda_{fix,k}\left[-A(\theta)\right]-B(\xi_{fix,k},\lambda_{fix,k})\right\}
=θξfix,k(1+θ)λfix,kexp{B(ξfix,k,λfix,k)}\displaystyle=\theta^{\xi_{fix,k}}(1+\theta)^{-\lambda_{fix,k}}\exp\left\{-B(\xi_{fix,k},\lambda_{fix,k})\right\}
=BetaPrime(θ|ξfix,k+1,λfix,kξfix,k1),\displaystyle=\mathrm{BetaPrime}\left(\theta\left|\xi_{fix,k}+1,\lambda_{fix,k}-\xi_{fix,k}-1\right.\right), (22)

where BetaPrime(θ|a,b)\mathrm{BetaPrime}(\theta|a,b) denotes the beta prime density with shape parameters a>0a>0 and b>0b>0. Further,

exp{B(ξfix,k,λfix,k)}=Γ(λfix,k)Γ(ξfix,k+1)Γ(λfix,kξfix,k1)\exp\left\{-B(\xi_{fix,k},\lambda_{fix,k})\right\}=\frac{\Gamma(\lambda_{fix,k})}{\Gamma(\xi_{fix,k}+1)\Gamma(\lambda_{fix,k}-\xi_{fix,k}-1)}

to ensure normalization.

Second, Θ\Theta has an ordinary component characterized by any proper distribution GG and weight rate measure

ν(dθ)=γexp{ξ,η(θ)+λ[A(θ)]}dθ=γθξ(1+θ)λdθ.\displaystyle\nu(d\theta)=\gamma\exp\left\{\langle\xi,\eta(\theta)\rangle+\lambda\left[-A(\theta)\right]\right\}\;d\theta=\gamma\theta^{\xi}(1+\theta)^{-\lambda}\;d\theta. (23)

We need to choose the allowable hyperparameter ranges for γ\gamma, ξ\xi, and λ\lambda. First, γ>0\gamma>0 to ensure ν\nu is a measure. By Assumption A1, we must have ν(+)=\nu(\mathbb{R}_{+})=\infty, so ν\nu must represent an improper beta prime distribution. As such, we require either ξ+10\xi+1\leq 0 or λξ10\lambda-\xi-1\leq 0. By Assumption A2, we must have

x=1θ+ν(dθ)h(x|θ)=θ+ν(dθ)h(1|θ)\displaystyle\sum_{x=1}^{\infty}\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)\cdot h(x|\theta)=\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)\cdot h(1|\theta)
since the support of xx is {0,1}\{0,1\}
=θ+γθξ(1+θ)λ𝑑θθ1(1+θ)1=γθ+θξ+1(1+θ)λ1𝑑θ<.\displaystyle=\int_{\theta\in\mathbb{R}_{+}}\gamma\theta^{\xi}(1+\theta)^{-\lambda}\;d\theta\cdot\theta^{1}(1+\theta)^{-1}=\gamma\int_{\theta\in\mathbb{R}_{+}}\theta^{\xi+1}(1+\theta)^{-\lambda-1}\;d\theta<\infty.

Since the integrand is the kernel of a beta prime distribution, we simply require that this distribution be proper; i.e., ξ+2>0\xi+2>0 and λξ1>0\lambda-\xi-1>0.

The hyperparameter restrictions can be summarized as:

γ>0,ξ(2,1],λ>ξ+1;ξfix,k>1 and λfix,k>ξfix,k+1 for all k[Kprior,fix].\displaystyle\gamma>0,\xi\in(-2,-1],\lambda>\xi+1;\xi_{fix,k}>-1\textrm{ and }\lambda_{fix,k}>\xi_{fix,k}+1\textrm{ for all $k\in[K_{prior,fix}]$}.

We call the distribution for Θ\Theta described in this example the beta prime process. Its ordinary component has previously been defined by Broderick et al. (2015). But this result represents the first time the beta prime process is described in full, including parameter restrictions and fixed-location atoms, as well as the first proof of its conjugacy with the odds Bernoulli process. We highlight the latter result as a corollary to Theorem 4.2 below.

Corollary 4.6.

Let the odds Bernoulli process be a CRM with fixed-location atom weight distributions as in Eq. (21). Let the beta process be a CRM with fixed-location atom weight distributions as in Eq. (22) and ordinary component weight measure as in Eq. (23). Then the beta process is a conjugate Bayesian nonparametric prior for the odds Bernoulli process.

\blacksquare

5 Size-biased representations

We have shown in Section 4.2 that our exponential CRM (Definition 4.1) is useful in that we can find an automatic Bayesian nonparametric conjugate prior given an exponential CRM likelihood. We will see in this section and the next that exponential CRMs allow us to build representations that allow tractable inference despite the infinite-dimensional nature of the models we are using.

The best-known size-biased representation of a random measure in Bayesian nonparametrics is the stick-breaking representation of the Dirichlet process ΘDP\Theta_{DP} (Sethuraman, 1994):

ΘDP=k=1θDP,kδψk; For kθDP,k=βkj=1k1(1βj),βkiidBeta(1,c),ψkiidG,\displaystyle\begin{split}\Theta_{DP}&=\sum_{k=1}^{\infty}\theta_{DP,k}\delta_{\psi_{k}};\\ \textrm{ For $k\in\mathbb{Z}_{*}$, }\theta_{DP,k}&=\beta_{k}\prod_{j=1}^{k-1}(1-\beta_{j}),\quad\beta_{k}\stackrel{{\scriptstyle iid}}{{\sim}}\mathrm{Beta}(1,c),\quad\psi_{k}\stackrel{{\scriptstyle iid}}{{\sim}}G,\end{split} (24)

where cc is a fixed hyperparameter satisfying c>0c>0.

The name “stick-breaking” originates from thinking of the unit interval as a stick of length one. At each round kk, only some of the stick remains; βk\beta_{k} describes the proportion of the remaining stick that is broken off in round kk, and θDP,k\theta_{DP,k} describes the total amount of remaining stick that is broken off in round kk. By construction, not only is each θDP,k(0,1)\theta_{DP,k}\in(0,1) but in fact the θDP,k\theta_{DP,k} add to one (the total stick length) and thus describe a distribution.

Eq. (24) is called a size-biased representation for the following reason. Since the weights {θDP,k}k=1\{\theta_{DP,k}\}_{k=1}^{\infty} describe a distribution, we can make draws from this distribution; each such draw is sometimes thought of as a multinomial draw with a single trial. In that vein, typically we imagine that our data points Xmult,nX_{mult,n} are described as iid draws conditioned on ΘDP\Theta_{DP}, where Xmult,nX_{mult,n} is a random measure with just a single atom:

Xmult,n=δψmult,n;ψmult,n=ψk with probability θDP,k.\displaystyle\begin{split}X_{mult,n}=\delta_{\psi_{mult,n}};\quad\psi_{mult,n}=\psi_{k}\textrm{ with probability }\theta_{DP,k}.\end{split} (25)

Then the limiting proportion of data points Xmult,nX_{mult,n} with an atom at ψmult,1\psi_{mult,1} (the first atom location chosen) is θDP,1\theta_{DP,1}. The limiting proportion of data points with an atom at the next unique atom location chosen will have size θDP,2\theta_{DP,2}, and so on (Broderick et al., 2013).

The representation in Eq. (24) is so useful because there is a familiar, finite-dimensional distribution for each of the atom weights θDP,k\theta_{DP,k} of the random measure ΘDP\Theta_{DP}. This representation allows approximate inference via truncation (Ishwaran and James, 2001) or exact inference via slice sampling (Walker, 2007; Kalli et al., 2011).

Since the weights {θDP,k}k=1\{\theta_{DP,k}\}_{k=1}^{\infty} are constrained to sum to one, the Dirichlet process is not a CRM.555In fact, the Dirichlet process is a normalized gamma process (cf. Example 4.3) (Ferguson, 1973). Indeed, there has been much work on size-biased representations for more general normalized random measures, which include the Dirichlet process as just one example (Perman et al., 1992; Pitman, 1996a, b, 2003).

By contrast, we here wish to explore size-biasing for non-normalized CRMs. In the normalized CRM case, we considered which atom of a random discrete probability measure was drawn first and what is the distribution of that atom’s size. In the non-normalized CRM case considered in the present work, when drawing XX conditional on Θ\Theta, there may be multiple atoms (or one atom or no atoms) of Θ\Theta that correspond to non-zero atoms in XX. The number will always be finite though by Assumption A2. In this non-normalized CRM case, we wish to consider the sizes of all such atoms in Θ\Theta. Size-biased representations have been developed in the past for particular CRM examples, notably the beta process (Paisley et al., 2010; Broderick et al., 2012). And even though there is typically no interpretation of these representations in terms of a single stick representing a unit probability mass, they are sometimes referred to as stick-breaking representations as a nod to the popularity of Dirichlet process stick-breaking.

In the beta process case, such size-biased representations have already been shown to allow approximate inference via truncation (Doshi et al., 2009; Paisley et al., 2011) or exact inference via slice sampling (Teh et al., 2007; Broderick et al., 2015). Here we provide general recipes for the creation of these representations and illustrate our recipes by discovering previously unknown size-biased representations.

We have seen that a general CRM Θ\Theta takes the form of an a.s. discrete random measure:

k=1θkδψk.\sum_{k=1}^{\infty}\theta_{k}\delta_{\psi_{k}}. (26)

The fixed-location atoms are straightforward to simulate; there are finitely many by Assumption A0, their locations are fixed, and their weights are assumed to come from finite-dimensional distributions. The infinite-dimensionality of the Bayesian nonparametric CRM comes from the ordinary component (cf. Section 2.3 and Assumption A1). So far the only description we have of the ordinary component is its generation from the countable infinity of points in a Poisson point process. The next result constructively demonstrates that we can represent the distributions of the CRM weights {θk}k=1\{\theta_{k}\}_{k=1}^{\infty} in Eq. (26) as a sequence of finite-dimensional distributions, much as in the familiar Dirichlet process case.

Theorem 5.1 (Size-biased representations).

Let Θ\Theta be a completely random measure that satisfies Assumptions A0 and A1; that is, Θ\Theta is a CRM with KfixK_{fix} fixed atoms such that Kfix<K_{fix}<\infty and such that the kkth atom can be written θfix,kδψfix,k\theta_{fix,k}\delta_{\psi_{fix,k}}. The ordinary component of Θ\Theta has rate measure

μ(dθ×dψ)=ν(dθ)G(dψ),\mu(d\theta\times d\psi)=\nu(d\theta)\cdot G(d\psi),

where GG is a proper distribution and ν(+)=\nu(\mathbb{R}_{+})=\infty. Write Θ=k=1θkδψk\Theta=\sum_{k=1}^{\infty}\theta_{k}\delta_{\psi_{k}}, and let XnX_{n} be generated iid given Θ\Theta according to Xn=k=1xn,kδψkX_{n}=\sum_{k=1}^{\infty}x_{n,k}\delta_{\psi_{k}} with xn,kindeph(x|θk)x_{n,k}\stackrel{{\scriptstyle indep}}{{\sim}}h(x|\theta_{k}) for proper, discrete probability mass function hh. And suppose XnX_{n} and Θ\Theta jointly satisfy Assumption A2 so that

x=1θ+ν(dθ)h(x|θ)<.\sum_{x=1}^{\infty}\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)h(x|\theta)<\infty.

Then we can write

Θ=m=1x=1j=1ρm,xθm,x,jδψm,x,jψm,x,kiidG iid across m,x,jρm,xindepPoisson(ρ|θν(dθ)h(0|θ)m1h(x|θ)) across m,xθm,x,jindepFsize,m,x(dθ)ν(dθ)h(0|θ)m1h(x|θ) iid across j and independently across m,x.\displaystyle\begin{split}\Theta&=\sum_{m=1}^{\infty}\sum_{x=1}^{\infty}\sum_{j=1}^{\rho_{m,x}}\theta_{m,x,j}\delta_{\psi_{m,x,j}}\\ \psi_{m,x,k}&\stackrel{{\scriptstyle iid}}{{\sim}}G\textrm{ iid across $m,x,j$}\\ \rho_{m,x}&\stackrel{{\scriptstyle indep}}{{\sim}}\mathrm{Poisson}\left(\rho\left|\int_{\theta}\nu(d\theta)h(0|\theta)^{m-1}h(x|\theta)\right.\right)\textrm{ across $m,x$}\\ \theta_{m,x,j}&\stackrel{{\scriptstyle indep}}{{\sim}}F_{size,m,x}(d\theta)\propto\nu(d\theta)h(0|\theta)^{m-1}h(x|\theta)\\ &\textrm{ iid across $j$ and independently across $m,x$}.\end{split} (27)
Proof.

By construction, Θ\Theta is an a.s. discrete random measure with a countable infinity of atoms. Without loss of generality, suppose that for every (non-zero) value of an atom weight θ\theta, there is a non-zero probability of generating an atom with non-zero weight xx in the likelihood. Now suppose we generate X1,X2,X_{1},X_{2},\ldots. Then, for every atom θδψ\theta\delta_{\psi} of Θ\Theta, there exists some finite nn with an atom at ψ\psi. Therefore, we can enumerate all of the atoms of Θ\Theta by enumerating

  • Each atom θδψ\theta\delta_{\psi} such that there is an atom in X1X_{1} at ψ\psi.

  • Each atom θδψ\theta\delta_{\psi} such that there is an atom in X2X_{2} at ψ\psi but there is not an atom in X1X_{1} at ψ\psi.
    \vdots

  • Each atom θδψ\theta\delta_{\psi} such that there is an atom in XmX_{m} at ψ\psi but there is not an atom in any of X1,X2,,Xm1X_{1},X_{2},\ldots,X_{m-1} at ψ\psi.
    \vdots

Moreover, on the mmth round of this enumeration, we can further break down the enumeration by the value of the observation XmX_{m} at the atom location:

  • Each atom θδψ\theta\delta_{\psi} such that there is an atom in XmX_{m} of weight 11 at ψ\psi but there is not an atom in any of X1,X2,,Xm1X_{1},X_{2},\ldots,X_{m-1} at ψ\psi.

  • Each atom θδψ\theta\delta_{\psi} such that there is an atom in XmX_{m} of weight 22 at ψ\psi but there is not an atom in any of X1,X2,,Xm1X_{1},X_{2},\ldots,X_{m-1} at ψ\psi.
    \vdots

  • Each atom θδψ\theta\delta_{\psi} such that there is an atom in XmX_{m} of weight xx at ψ\psi but there is not an atom in any of X1,X2,,Xm1X_{1},X_{2},\ldots,X_{m-1} at ψ\psi.
    \vdots

Recall that the values θk\theta_{k} that form the weights of Θ\Theta are generated according to a Poisson point process with rate measure ν(dθ)\nu(d\theta). So, on the first round, the values of θk\theta_{k} such that x1,k=xx_{1,k}=x also holds are generated according to a thinned Poisson point process with rate measure

ν(dθ)h(x|θ).\nu(d\theta)h(x|\theta).

In particular, since the rate measure has finite total mass by Assumption A2, we can define

M1,x:=θν(dθ)h(x|θ),M_{1,x}:=\int_{\theta}\nu(d\theta)h(x|\theta),

which will be finite. Then the number of atoms θk\theta_{k} for which x1,k=xx_{1,k}=x is

ρ1,xPoisson(ρ|M1,x).\rho_{1,x}\sim\mathrm{Poisson}(\rho|M_{1,x}).

And each such θk\theta_{k} has weight with distribution

Fsize,1,x(dθ)ν(dθ)h(x|θ).F_{size,1,x}(d\theta)\propto\nu(d\theta)h(x|\theta).

Finally, note from Theorem 3.1 that the posterior Θ|X1\Theta|X_{1} has weight rate measure

ν1(dθ):=ν(dθ)h(0|θ).\nu_{1}(d\theta):=\nu(d\theta)h(0|\theta).

Now take any m>1m>1. Suppose, inductively, that the ordinary component of the posterior Θ|X1,,Xm1\Theta|X_{1},\ldots,X_{m-1} has weight rate measure

νm1(dθ):=ν(dθ)h(0|θ)m1.\nu_{m-1}(d\theta):=\nu(d\theta)h(0|\theta)^{m-1}.

The atoms in this ordinary component have been selected precisely because they have not appeared in any of X1,,Xm1X_{1},\ldots,X_{m-1}. As for m=1m=1, we have that the atoms θk\theta_{k} in this ordinary component with corresponding weight in XmX_{m} equal to xx are formed by a thinned Poisson point process, with rate measure

νm1(dθ)h(x|θ)=ν(dθ)h(0|θ)m1h(x|θ).\nu_{m-1}(d\theta)h(x|\theta)=\nu(d\theta)h(0|\theta)^{m-1}h(x|\theta).

Since the rate measure has finite total mass by Assumption A2, we can define

Mm,x:=θν(dθ)h(0|θ)m1h(x|θ),M_{m,x}:=\int_{\theta}\nu(d\theta)h(0|\theta)^{m-1}h(x|\theta),

which will be finite. Then the number of atoms θk\theta_{k} for which x1,k=xx_{1,k}=x is

ρm,xPoisson(ρ|Mm,x).\rho_{m,x}\sim\mathrm{Poisson}(\rho|M_{m,x}).

And each such θk\theta_{k} has weight

Fsize,m,xν(dθ)h(0|θ)m1h(x|θ).F_{size,m,x}\propto\nu(d\theta)h(0|\theta)^{m-1}h(x|\theta).

Finally, note from Theorem 3.1 that the posterior Θ|X1:m\Theta|X_{1:m}, which can be thought of as generated by prior Θ|X1:(m1)\Theta|X_{1:(m-1)} and likelihood Xm|ΘX_{m}|\Theta, has weight rate measure

ν(dθ)h(0|θ)m1h(0|θ)=νm(dθ),\nu(d\theta)h(0|\theta)^{m-1}h(0|\theta)=\nu_{m}(d\theta),

confirming the inductive hypothesis.

Recall that every atom of Θ\Theta is found in exactly one of these rounds and that x+x\in\mathbb{Z}_{+}. Also recall that the atom locations may be generated independently and identically across atoms, and independently from all the weights, according to proper distribution GG (Section 2.2). To summarize, we have then

Θ=m=1x=1j=1ρm,xθm,x,jδψm,x,j,\Theta=\sum_{m=1}^{\infty}\sum_{x=1}^{\infty}\sum_{j=1}^{\rho_{m,x}}\theta_{m,x,j}\delta_{\psi_{m,x,j}},

where

ψm,x,k\displaystyle\psi_{m,x,k} iidG iid across m,x,j\displaystyle\stackrel{{\scriptstyle iid}}{{\sim}}G\textrm{ iid across $m,x,j$}
Mm,x\displaystyle M_{m,x} =θν(dθ)h(0|θ)m1h(x|θ) across m,x\displaystyle=\int_{\theta}\nu(d\theta)h(0|\theta)^{m-1}h(x|\theta)\textrm{ across $m,x$}
ρm,x\displaystyle\rho_{m,x} indepPoisson(ρ|Mm,x) across m,x\displaystyle\stackrel{{\scriptstyle indep}}{{\sim}}\mathrm{Poisson}(\rho|M_{m,x})\textrm{ across $m,x$}
Fsize,m,x(dθ)\displaystyle F_{size,m,x}(d\theta) ν(dθ)h(0|θ)m1h(x|θ) across m,x\displaystyle\propto\nu(d\theta)h(0|\theta)^{m-1}h(x|\theta)\textrm{ across $m,x$}
θm,x,j\displaystyle\theta_{m,x,j} indepFsize,m,x(dθ) iid across j and independently across m,x,\displaystyle\stackrel{{\scriptstyle indep}}{{\sim}}F_{size,m,x}(d\theta)\textrm{ iid across $j$ and independently across $m,x$},

as was to be shown. ∎

The following corollary gives a more detailed recipe for the calculations in Theorem 5.1 when the prior is in a conjugate exponential CRM to the likelihood.

Corollary 5.2 (Exponential CRM size-biased representations).

Let Θ\Theta be an exponential CRM with no fixed-location atoms (thereby trivially satisfying Assumption A0) such that Assumption A1 holds.

Let XX be generated conditional on Θ\Theta according to an exponential CRM with fixed-location atoms at {ψk}k=1\{\psi_{k}\}_{k=1}^{\infty} and no ordinary component. Let the distribution of the weight xn,kx_{n,k} at ψk\psi_{k} have probability mass function

h(x|θk)=κ(x)exp{η(θk),ϕ(x)A(θk)}.h(x|\theta_{k})=\kappa(x)\exp\left\{\langle\eta(\theta_{k}),\phi(x)\rangle-A(\theta_{k})\right\}.

Suppose that Θ\Theta and XX jointly satisfy Assumption A2. And let Θ\Theta be conjugate to XX as in Theorem 4.2. Then we can write

Θ=m=1x=1j=1ρm,xθm,x,jδψm,x,jψm,x,jiidGiid across m,x,jMm,x=γκ(0)m1κ(x)exp{B(ξ+(m1)ϕ(0)+ϕ(x),λ+m)}ρm,xindepPoisson(ρ|Mm,x)independently across m,xθm,x,jindepfsize,m,x(θ)dθ=exp{ξ+(m1)ϕ(0)+ϕ(x),η(θ)+(λ+m)[A(θ)]B(ξ+(m1)ϕ(0)+ϕ(x),λ+m)}iid across j and independently across m,x.\displaystyle\begin{split}\Theta&=\sum_{m=1}^{\infty}\sum_{x=1}^{\infty}\sum_{j=1}^{\rho_{m,x}}\theta_{m,x,j}\delta_{\psi_{m,x,j}}\\ \psi_{m,x,j}&\stackrel{{\scriptstyle iid}}{{\sim}}G\quad\textrm{iid across $m,x,j$}\\ M_{m,x}&=\gamma\cdot\kappa(0)^{m-1}\kappa(x)\cdot\exp\left\{B(\xi+(m-1)\phi(0)+\phi(x),\lambda+m)\right\}\\ \rho_{m,x}&\stackrel{{\scriptstyle indep}}{{\sim}}\mathrm{Poisson}\left(\rho|M_{m,x}\right)\\ &\textrm{independently across $m,x$}\\ \theta_{m,x,j}&\stackrel{{\scriptstyle indep}}{{\sim}}f_{size,m,x}(\theta)\;d\theta\\ &=\exp\left\{\langle\xi+(m-1)\phi(0)+\phi(x),\eta(\theta)\rangle+(\lambda+m)[-A(\theta)]\right.\\ &\quad\left.{}-B(\xi+(m-1)\phi(0)+\phi(x),\lambda+m)\right\}\\ &\textrm{iid across $j$ and independently across $m,x$}.\end{split} (28)
Proof.

The corollary follows from Theorem 5.1 by plugging in the particular forms for ν(dθ)\nu(d\theta) and h(x|θ)h(x|\theta).

In particular,

Mm,x\displaystyle M_{m,x} =θ+ν(dθ)h(0|θ)m1h(x|θ)\displaystyle=\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)h(0|\theta)^{m-1}h(x|\theta)
=θ+γexp{ξ,η(θ)+λ[A(θ)]}\displaystyle=\int_{\theta\in\mathbb{R}_{+}}\gamma\exp\left\{\langle\xi,\eta(\theta)\rangle+\lambda\left[-A(\theta)\right]\right\}
[κ(0)exp{η(θ),ϕ(0)A(θ)}]m1\displaystyle\quad{}\cdot\left[\kappa(0)\exp\left\{\langle\eta(\theta),\phi(0)\rangle-A(\theta)\right\}\right]^{m-1}
κ(x)exp{η(θ),ϕ(x)A(θ)}dθ\displaystyle\quad{}\cdot\kappa(x)\exp\left\{\langle\eta(\theta),\phi(x)\rangle-A(\theta)\right\}\;d\theta
=γκ(0)m1κ(x)exp{B(ξ+(m1)ϕ(0)+ϕ(x),λ+m)},\displaystyle=\gamma\kappa(0)^{m-1}\kappa(x)\exp\left\{B\left(\xi+(m-1)\phi(0)+\phi(x),\lambda+m\right)\right\},

Corollary 5.2 can be used to find the known size-biased representation of the beta process (Thibaux and Jordan, 2007); we demonstrate this derivation in detail in Example B.1 in Appendix B. Here we use Corollary 5.2 to discover a new size-biased representation of the gamma process.

Example 5.3.

Let Θ\Theta be a gamma process, and let XnX_{n} be iid Poisson likelihood processes conditioned on Θ\Theta for each nn as in Example 4.3. That is, we have

ν(dθ)=γθξeλθdθ.\nu(d\theta)=\gamma\theta^{\xi}e^{-\lambda\theta}\;d\theta.

And

h(x|θk)=1x!θkxeθkh(x|\theta_{k})=\frac{1}{x!}\theta_{k}^{x}e^{-\theta_{k}}

with

γ>0,ξ(2,1],λ>0;ξfix,k>1 and λfix,k>0for all k[Kprior,fix]\displaystyle\gamma>0,\quad\xi\in(-2,-1],\quad\lambda>0;\quad\xi_{fix,k}>-1\textrm{ and }\lambda_{fix,k}>0\quad\textrm{for all $k\in[K_{prior,fix}]$}

by Example 4.3.

We can pick out the following components of hh:

κ(x)=1x!,ϕ(x)=x,η(θ)=log(θ),A(θ)=θ.\displaystyle\kappa(x)=\frac{1}{x!},\quad\phi(x)=x,\quad\eta(\theta)=\log(\theta),\quad A(\theta)=\theta.

Thus, by Corollary 5.2, we have

fsize,m,x(θ)θξ+xe(λ+m)θGamma(θ|ξ+x+1,λ+m).\displaystyle f_{size,m,x}(\theta)\propto\theta^{\xi+x}e^{-(\lambda+m)\theta}\propto\mathrm{Gamma}\left(\theta\left|\xi+x+1,\lambda+m\right.\right).

We summarize the representation that follows from Corollary 5.2 in the following result.

Corollary 5.4.

Let the gamma process be a CRM Θ\Theta with fixed-location atom weight distributions as in Eq. (19) and ordinary component weight measure as in Eq. (20). Then we may write

Θ\displaystyle\Theta =m=1x=1j=1ρm,xθm,x,jδψm,x,j\displaystyle=\sum_{m=1}^{\infty}\sum_{x=1}^{\infty}\sum_{j=1}^{\rho_{m,x}}\theta_{m,x,j}\delta_{\psi_{m,x,j}}
ψm,x,j\displaystyle\psi_{m,x,j} iidG iid across m,x,j\displaystyle\stackrel{{\scriptstyle iid}}{{\sim}}G\quad\textrm{ iid across $m,x,j$}
Mm,x\displaystyle M_{m,x} =γ1x!Γ(ξ+x+1)(λ+m)(ξ+x+1) across m,x\displaystyle=\gamma\cdot\frac{1}{x!}\cdot\Gamma(\xi+x+1)\cdot(\lambda+m)^{-(\xi+x+1)}\textrm{ across $m,x$}
ρm,x\displaystyle\rho_{m,x} indepPoisson(ρ|Mm,x) across m,x\displaystyle\stackrel{{\scriptstyle indep}}{{\sim}}\mathrm{Poisson}\left(\rho|M_{m,x}\right)\textrm{ across $m,x$}
θm,x,j\displaystyle\theta_{m,x,j} indepGamma(θ|ξ+x+1,λ+m)\displaystyle\stackrel{{\scriptstyle indep}}{{\sim}}\mathrm{Gamma}\left(\theta\left|\xi+x+1,\lambda+m\right.\right)
iid across j and independently across m,x.\displaystyle\textrm{ iid across $j$ and independently across $m,x$}.

\blacksquare

6 Marginal processes

In Section 5, although we conceptually made use of the observations {X1,X2,}\{X_{1},X_{2},\ldots\}, we focused on a representation of the prior Θ\Theta: cf. Eqs. (27) and (28). In this section, we provide a representation of the marginal of X1:NX_{1:N}, with Θ\Theta integrated out.

The canonical example of a marginal process again comes from the Dirichlet process (DP). In this case, the full model consists of the DP-distributed prior on ΘDP\Theta_{DP} (as in Eq. (24)) together with the likelihood for Xmult,nX_{mult,n} conditional on ΘDP\Theta_{DP} (iid across nn) described by Eq. (25). Then the marginal distribution of Xmult,1:NX_{mult,1:N} is described by the Chinese restaurant process. This marginal takes the following form.

For each n=1,2,,Nn=1,2,\ldots,N,

  1. 1.

    Let {ψk}k=1Kn1\{\psi_{k}\}_{k=1}^{K_{n-1}} be the union of atom locations in Xmult,1,,Xmult,n1X_{mult,1},\ldots,X_{mult,n-1}. Then Xmult,n|Xmult,1,,Xmult,n1X_{mult,n}|X_{mult,1},\ldots,X_{mult,n-1} has a single atom at ψ\psi, where

    ψ\displaystyle\psi ={ψk with probability  k=1Kn1Xmult,m({ψk})ψnew with probability  c\displaystyle=\left\{\begin{array}[]{ll}\psi_{k}&\textrm{ with probability $\propto$ }\sum_{k=1}^{K_{n-1}}X_{mult,m}(\{\psi_{k}\})\\ \psi_{new}&\textrm{ with probability $\propto$ }c\end{array}\right.
    ψnew\displaystyle\psi_{new} G\displaystyle\sim G

In the case of CRMs, the canonical example of a marginal process is the Indian buffet process (Griffiths and Ghahramani, 2006). Both the Chinese restaurant process and Indian buffet process have proven popular for inference since the underlying infinite-dimensional prior is integrated out in these processes and only the finite-dimensional marginal remains. By Assumption A2, we know that the marginal will generally be finite-dimensional for our CRM Bayesian models. And thus we have the following general marginal representations for such models.

Theorem 6.1 (Marginal representations).

Let Θ\Theta be a completely random measure that satisfies Assumptions A0 and A1; that is, Θ\Theta is a CRM with KfixK_{fix} fixed atoms such that Kfix<K_{fix}<\infty and such that the kkth atom can be written θfix,kδψfix,k\theta_{fix,k}\delta_{\psi_{fix,k}}. The ordinary component of Θ\Theta has rate measure

μ(dθ×dψ)=ν(dθ)G(dψ),\mu(d\theta\times d\psi)=\nu(d\theta)\cdot G(d\psi),

where GG is a proper distribution and ν(+)=\nu(\mathbb{R}_{+})=\infty. Write Θ=k=1θkδψk\Theta=\sum_{k=1}^{\infty}\theta_{k}\delta_{\psi_{k}}, and let XnX_{n} be generated iid given Θ\Theta according to Xn=k=1xn,kδψkX_{n}=\sum_{k=1}^{\infty}x_{n,k}\delta_{\psi_{k}} with xn,kindeph(x|θk)x_{n,k}\stackrel{{\scriptstyle indep}}{{\sim}}h(x|\theta_{k}) for proper, discrete probability mass function hh. And suppose XnX_{n} and Θ\Theta jointly satisfy Assumption A2 so that

x=1θ+ν(dθ)h(x|θ)<.\sum_{x=1}^{\infty}\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)h(x|\theta)<\infty.

Then the marginal distribution of X1:NX_{1:N} is the same as that provided by the following construction.

For each n=1,2,,Nn=1,2,\ldots,N,

  1. 1.

    Let {ψk}k=1Kn1\{\psi_{k}\}_{k=1}^{K_{n-1}} be the union of atom locations in X1,,Xn1X_{1},\ldots,X_{n-1}. Let xm,k:=Xm({ψk})x_{m,k}:=X_{m}(\{\psi_{k}\}). Let xn,kx_{n,k} denote the weight of Xn|X1,,Xn1X_{n}|X_{1},\ldots,X_{n-1} at ψk\psi_{k}. Then xn,kx_{n,k} has distribution described by the following probability mass function:

    hcond(xn,k=x|x1:(n1),k)=θ+ν(dθ)h(x|θ)m=1n1h(xm,k|θ)θ+ν(dθ)m=1n1h(xm,k|θ).h_{cond}\left(x_{n,k}=x\left|x_{1:(n-1),k}\right.\right)=\frac{\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)h(x|\theta)\prod_{m=1}^{n-1}h(x_{m,k}|\theta)}{\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)\prod_{m=1}^{n-1}h(x_{m,k}|\theta)}.
  2. 2.

    For each x=1,2,x=1,2,\ldots

    • XnX_{n} has ρn,x\rho_{n,x} new atoms. That is, XnX_{n} has atoms at locations {ψn,x,j}j=1ρn,x\{\psi_{n,x,j}\}_{j=1}^{\rho_{n,x}}, where

      {ψn,x,j}j=1ρn,x{ψk}k=1Kn1= a.s.\{\psi_{n,x,j}\}_{j=1}^{\rho_{n,x}}\cap\{\psi_{k}\}_{k=1}^{K_{n-1}}=\emptyset\quad\textrm{ a.s.}

      Moreover,

      ρn,x\displaystyle\rho_{n,x} indepPoisson(ρ|θν(dθ)h(0|θ)n1h(x|θ)) across n,x\displaystyle\stackrel{{\scriptstyle indep}}{{\sim}}\mathrm{Poisson}\left(\rho\left|\int_{\theta}\nu(d\theta)h(0|\theta)^{n-1}h(x|\theta)\right.\right)\textrm{ across $n,x$}
      ψn,x,j\displaystyle\psi_{n,x,j} iidG(dψ) across n,x,j.\displaystyle\stackrel{{\scriptstyle iid}}{{\sim}}G(d\psi)\textrm{ across $n,x,j$}.
Proof.

We saw in the proof of Theorem 5.1 that the marginal for X1X_{1} can be expressed as follows. For each x+x\in\mathbb{Z}_{+}, there are ρ1,x\rho_{1,x} atoms of X1X_{1} with weight xx, where

ρ1,x\displaystyle\rho_{1,x} indepPoisson(θν(dθ)h(x|θ)) across x.\displaystyle\stackrel{{\scriptstyle indep}}{{\sim}}\mathrm{Poisson}\left(\int_{\theta}\nu(d\theta)h(x|\theta)\right)\textrm{ across $x$}.

These atoms have locations {ψ1,x,j}j=1ρ1,x\{\psi_{1,x,j}\}_{j=1}^{\rho_{1,x}}, where

ψ1,x,j\displaystyle\psi_{1,x,j} iidG(dψ) across x,j.\displaystyle\stackrel{{\scriptstyle iid}}{{\sim}}G(d\psi)\textrm{ across $x,j$}.

For the upcoming induction, let K1:=x=1ρ1,xK_{1}:=\sum_{x=1}^{\infty}\rho_{1,x}. And let {ψk}k=1K1\{\psi_{k}\}_{k=1}^{K_{1}} be the (a.s. disjoint by assumption) union of the sets {ψ1,x,j}j=1ρ1,x\{\psi_{1,x,j}\}_{j=1}^{\rho_{1,x}} across xx. Note that K1K_{1} is finite by Assumption A2.

We will also find it useful in the upcoming induction to let Θpost,1\Theta_{post,1} have the distribution of Θ|X1\Theta|X_{1}. Let θpost,1,x,j=Θpost,1({ψ1,x,j})\theta_{post,1,x,j}=\Theta_{post,1}(\{\psi_{1,x,j}\}). By Theorem 3.1 or the proof of Theorem 5.1, we have that

θpost,1,x,j\displaystyle\theta_{post,1,x,j} indepFpost,1,x,j(dθ)ν(dθ)h(x|θ)\displaystyle\stackrel{{\scriptstyle indep}}{{\sim}}F_{post,1,x,j}(d\theta)\propto\nu(d\theta)h(x|\theta)
 independently across x and iid across j.\displaystyle\quad\textrm{ independently across $x$ and iid across $j$}.

Now take any n>1n>1. Inductively, we assume {ψn1,k}k=1Kn1\{\psi_{n-1,k}\}_{k=1}^{K_{n-1}} is the union of all the atom locations of X1,,Xn1X_{1},\ldots,X_{n-1}. Further assume Kn1K_{n-1} is finite. Let Θpost,n1\Theta_{post,n-1} have the distribution of Θ|X1,,Xn1\Theta|X_{1},\ldots,X_{n-1}. Let θn1,k\theta_{n-1,k} be the weight of Θpost,n1\Theta_{post,n-1} at ψn1,k\psi_{n-1,k}. And, for any m[n1]m\in[n-1], let xm,kx_{m,k} be the weight of XmX_{m} at ψn1,k\psi_{n-1,k}. We inductively assume that

θn1,kindepFn1,k(dθ)ν(dθ)m=1n1h(xm,k|θ)independently across k.\displaystyle\begin{split}\theta_{n-1,k}&\stackrel{{\scriptstyle indep}}{{\sim}}F_{n-1,k}(d\theta)\propto\nu(d\theta)\prod_{m=1}^{n-1}h(x_{m,k}|\theta)\\ &\textrm{independently across $k$}.\end{split} (29)

Now let ψn,k\psi_{n,k} equal ψn1,k\psi_{n-1,k} for k[Kn1]k\in[K_{n-1}]. Let xn,kx_{n,k} denote the weight of XnX_{n} at ψn,k\psi_{n,k} for k[Kn1]k\in[K_{n-1}]. Conditional on the atom weight of Θ\Theta at ψn,k\psi_{n,k}, the atom weights of X1,,Xn1,XnX_{1},\ldots,X_{n-1},X_{n} are independent. Since the atom weights of Θ\Theta are independent as well, we have that xn,k|X1,,Xn1x_{n,k}|X_{1},\ldots,X_{n-1} has the same distribution as xn,k|x1,k,,xn1,kx_{n,k}|x_{1,k},\ldots,x_{n-1,k}. We can write the probability mass function of this distribution as follows.

hcond(xn,k=x|x1,k,,xn1,k)\displaystyle h_{cond}\left(x_{n,k}=x\left|x_{1,k},\ldots,x_{n-1,k}\right.\right)
=θ+Fn1,k(dθ)h(x|θ)\displaystyle=\int_{\theta\in\mathbb{R}_{+}}F_{n-1,k}(d\theta)h(x|\theta)
=θ+[ν(dθ)m=1n1h(xm,k|θ)]h(x|θ)θ+ν(dθ)m=1n1h(xm,k|θ),\displaystyle=\frac{\int_{\theta\in\mathbb{R}_{+}}\left[\nu(d\theta)\prod_{m=1}^{n-1}h(x_{m,k}|\theta)\right]\cdot h(x|\theta)}{\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)\prod_{m=1}^{n-1}h(x_{m,k}|\theta)},

where the last line follows from Eq. (29).

We next show the inductive hypothesis in Eq. (29) holds for nn and k[Kn1]k\in[K_{n-1}]. Let xn,kx_{n,k} denote the weight of XnX_{n} at ψn,k\psi_{n,k} for k[Kn1]k\in[K_{n-1}]. Let Fn,k(dθ)F_{n,k}(d\theta) denote the distribution of xn,kx_{n,k} and note that

Fn,k(dθ)\displaystyle F_{n,k}(d\theta) Fn1,k(dθ)h(xn,k|θ)\displaystyle\propto F_{n-1,k}(d\theta)\cdot h(x_{n,k}|\theta)
=ν(dθ)m=1nh(xm,k|θ),\displaystyle=\nu(d\theta)\prod_{m=1}^{n}h(x_{m,k}|\theta),

which agrees with Eq. (29) for nn when we assume the result for n1n-1.

The previous development covers atoms that are present in at least one of X1,,Xn1X_{1},\ldots,X_{n-1}. Next we consider new atoms in XnX_{n}; that is, we consider atoms in XnX_{n} for which there are no atoms at the same location in any of X1,,Xn1X_{1},\ldots,X_{n-1}.

We saw in the proof of Theorem 5.1 that, for each x+x\in\mathbb{Z}_{+}, there are ρn,x\rho_{n,x} new atoms of XnX_{n} with weight xx such that

ρn,x\displaystyle\rho_{n,x} indepPoisson(ρ|θν(dθ)h(0|θ)n1h(x|θ)) across x.\displaystyle\stackrel{{\scriptstyle indep}}{{\sim}}\mathrm{Poisson}\left(\rho\left|\int_{\theta}\nu(d\theta)h(0|\theta)^{n-1}h(x|\theta)\right.\right)\textrm{ across $x$}.

These new atoms have locations {ψn,x,j}j=1ρn,x\{\psi_{n,x,j}\}_{j=1}^{\rho_{n,x}} with

ψn,x,j\displaystyle\psi_{n,x,j} iidG(dψ) across x,j.\displaystyle\stackrel{{\scriptstyle iid}}{{\sim}}G(d\psi)\textrm{ across $x,j$}.

By Assumption A2, x=1ρn,x<\sum_{x=1}^{\infty}\rho_{n,x}<\infty. So

Kn:=Kn1+x=1ρn,xK_{n}:=K_{n-1}+\sum_{x=1}^{\infty}\rho_{n,x}

remains finite. Let ψn,k\psi_{n,k} for k{Kn1+1,,Kn}k\in\{K_{n-1}+1,\ldots,K_{n}\} index these new locations. Let θn,k\theta_{n,k} be the weight of Θpost,n\Theta_{post,n} at ψn,k\psi_{n,k} for k{Kn1+1,,Kn}k\in\{K_{n-1}+1,\ldots,K_{n}\}. And let xn,kx_{n,k} be the value of XX at ψn,k\psi_{n,k}.

We check that the inductive hypothesis holds. By repeated application of Theorem 3.1, the ordinary component of Θ|X1,,Xn1\Theta|X_{1},\ldots,X_{n-1} has rate measure

ν(dθ)h(0|θ)n1.\nu(d\theta)h(0|\theta)^{n-1}.

So, again by Theorem 3.1, we have that

θn,k\displaystyle\theta_{n,k} indepFn.k(dθ)ν(dθ)h(0|θ)n1h(xn,k|θ).\displaystyle\stackrel{{\scriptstyle indep}}{{\sim}}F_{n.k}(d\theta)\propto\nu(d\theta)h(0|\theta)^{n-1}h(x_{n,k}|\theta).

Since XmX_{m} has value 0 at ψn,k\psi_{n,k} for m{1,,n1}m\in\{1,\ldots,n-1\} by construction, we have that the inductive hypothesis holds. ∎

As in the case of size-biased representations (Section 5 and Corollary 5.2), we can find a more detailed recipe when the prior is in a conjugate exponential CRM to the likelihood.

Corollary 6.2 (Exponential CRM marginal representations).

Let Θ\Theta be an exponential CRM with no fixed-location atoms (thereby trivially satisfying Assumption A0) such that Assumption A1 holds.

Let XX be generated conditional on Θ\Theta according to an exponential CRM with fixed-location atoms at {ψk}k=1\{\psi_{k}\}_{k=1}^{\infty} and no ordinary component. Let the distribution of the weight xn,kx_{n,k} at ψk\psi_{k} have probability mass function

h(x|θk)=κ(x)exp{η(θk),ϕ(x)A(θk)}.h(x|\theta_{k})=\kappa(x)\exp\left\{\langle\eta(\theta_{k}),\phi(x)\rangle-A(\theta_{k})\right\}.

Suppose that Θ\Theta and XX jointly satisfy Assumption A2. And let Θ\Theta be conjugate to XX as in Theorem 4.2. Then the marginal distribution of X1:NX_{1:N} is the same as that provided by the following construction.

For each n=1,2,,Nn=1,2,\ldots,N,

  1. 1.

    Let {ψk}k=1Kn1\{\psi_{k}\}_{k=1}^{K_{n-1}} be the union of atom locations in X1,,Xn1X_{1},\ldots,X_{n-1}. Let xm,k:=Xm({ψk})x_{m,k}:=X_{m}(\{\psi_{k}\}). Let xn,kx_{n,k} denote the weight of Xn|X1,,Xn1X_{n}|X_{1},\ldots,X_{n-1} at ψk\psi_{k}. Then xn,kx_{n,k} has distribution described by the following probability mass function:

    hcond(xn,k=x|x1:(n1),k)\displaystyle h_{cond}\left(x_{n,k}=x\left|x_{1:(n-1),k}\right.\right)
    =κ(x)exp{B(ξ+m=1n1xm,λ+n1)+B(ξ+m=1n1xm+x,λ+n)}.\displaystyle=\kappa(x)\exp\left\{-B(\xi+\sum_{m=1}^{n-1}x_{m},\lambda+n-1)+B(\xi+\sum_{m=1}^{n-1}x_{m}+x,\lambda+n)\right\}.
  2. 2.

    For each x=1,2,x=1,2,\ldots

    • XnX_{n} has ρn,x\rho_{n,x} new atoms. That is, XnX_{n} has atoms at locations {ψn,x,j}j=1ρn,x\{\psi_{n,x,j}\}_{j=1}^{\rho_{n,x}}, where

      {ψn,x,j}j=1ρn,x{ψk}k=1Kn1= a.s.\{\psi_{n,x,j}\}_{j=1}^{\rho_{n,x}}\cap\{\psi_{k}\}_{k=1}^{K_{n-1}}=\emptyset\quad\textrm{ a.s.}

      Moreover,

      Mn,x\displaystyle M_{n,x} :=γκ(0)n1κ(x)exp{B(ξ+(n1)ϕ(0)+ϕ(x),λ+n)}\displaystyle:=\gamma\cdot\kappa(0)^{n-1}\kappa(x)\cdot\exp\left\{B(\xi+(n-1)\phi(0)+\phi(x),\lambda+n)\right\}
      across n,xn,x
      ρn,x\displaystyle\rho_{n,x} indepPoisson(ρ|Mn,x) across n,x\displaystyle\stackrel{{\scriptstyle indep}}{{\sim}}\mathrm{Poisson}\left(\rho\left|M_{n,x}\right.\right)\textrm{ across $n,x$}
      ψn,x,j\displaystyle\psi_{n,x,j} iidG(dψ) across n,x,j.\displaystyle\stackrel{{\scriptstyle iid}}{{\sim}}G(d\psi)\textrm{ across $n,x,j$}.
Proof.

The corollary follows from Theorem 6.1 by plugging in the forms for ν(dθ)\nu(d\theta) and h(x|θ)h(x|\theta).

In particular,

θ+ν(dθ)m=1nh(xm,k|θ)\displaystyle\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)\prod_{m=1}^{n}h(x_{m,k}|\theta)
=θ+γexp{ξ,η(θ)+λ[A(θ)]}[m=1nκ(xm,k)exp{η(θ),ϕ(xm,k)A(θ)}]\displaystyle=\int_{\theta\in\mathbb{R}_{+}}\gamma\exp\left\{\langle\xi,\eta(\theta)\rangle+\lambda\left[-A(\theta)\right]\right\}\cdot\left[\prod_{m=1}^{n}\kappa(x_{m,k})\exp\left\{\langle\eta(\theta),\phi(x_{m,k})\rangle-A(\theta)\right\}\right]
=γ[m=1nκ(xm,k)]B(ξ+m=1nϕ(xm,k),λ+n).\displaystyle=\gamma\left[\prod_{m=1}^{n}\kappa(x_{m,k})\right]B\left(\xi+\sum_{m=1}^{n}\phi(x_{m,k}),\lambda+n\right).

So

hcond(xn,k=x|x1:(n1),k)\displaystyle h_{cond}\left(x_{n,k}=x\left|x_{1:(n-1),k}\right.\right)
=θ+ν(dθ)h(x|θ)m=1n1h(xm,k|θ)θ+ν(dθ)m=1n1h(xm,k|θ)\displaystyle=\frac{\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)h(x|\theta)\prod_{m=1}^{n-1}h(x_{m,k}|\theta)}{\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)\prod_{m=1}^{n-1}h(x_{m,k}|\theta)}
=κ(x)exp{B(ξ+m=1n1xm,λ+n1)+B(ξ+m=1n1xm+x,λ+n)}.\displaystyle=\kappa(x)\exp\left\{-B(\xi+\sum_{m=1}^{n-1}x_{m},\lambda+n-1)+B(\xi+\sum_{m=1}^{n-1}x_{m}+x,\lambda+n)\right\}.

In Example C.1 in Appendix C we show that Corollary 6.2 can be used to recover the Indian buffet process marginal from a beta process prior together with a Bernoulli process likelihood. In the following example, we discover a new marginal for the Poisson likelihood process with gamma process prior.

Example 6.3.

Let Θ\Theta be a gamma process, and let XnX_{n} be iid Poisson likelihood processes conditioned on Θ\Theta for each nn as in Example 4.3. That is, we have

ν(dθ)=γθξeλθdθ and h(x|θk)=1x!θkxeθk\nu(d\theta)=\gamma\theta^{\xi}e^{-\lambda\theta}\;d\theta\quad\textrm{ and }\quad h(x|\theta_{k})=\frac{1}{x!}\theta_{k}^{x}e^{-\theta_{k}}

with

γ>0,ξ(2,1],λ>0;ξfix,k>1 and λfix,k>0for all k[Kprior,fix]\displaystyle\gamma>0,\quad\xi\in(-2,-1],\quad\lambda>0;\quad\xi_{fix,k}>-1\textrm{ and }\lambda_{fix,k}>0\quad\textrm{for all $k\in[K_{prior,fix}]$}

by Example 4.3.

We can pick out the following components of hh:

κ(x)=1x!,ϕ(x)=x,η(θ)=log(θ),A(θ)=θ.\displaystyle\kappa(x)=\frac{1}{x!},\quad\phi(x)=x,\quad\eta(\theta)=\log(\theta),\quad A(\theta)=\theta.

And we calculate

exp{B(ξ,λ)}=θ+exp{ξ,η(θ)+λ[A(θ)]}𝑑θ=θ+θξeλθ=Γ(ξ+1)λ(ξ+1).\displaystyle\exp\left\{B(\xi,\lambda)\right\}=\int_{\theta\in\mathbb{R}_{+}}\exp\left\{\langle\xi,\eta(\theta)\rangle+\lambda[-A(\theta)]\right\}\;d\theta=\int_{\theta\in\mathbb{R}_{+}}\theta^{\xi}e^{-\lambda\theta}=\Gamma(\xi+1)\lambda^{-(\xi+1)}.

So, for kk\in\mathbb{Z}_{*}, we have

(xn=x)\displaystyle\mathbb{P}(x_{n}=x) =κ(x)exp{B(ξ+m=1n1xm,λ+n1)+B(ξ+m=1n1xm+x,λ+n)}\displaystyle=\kappa(x)\exp\left\{-B(\xi+\sum_{m=1}^{n-1}x_{m},\lambda+n-1)+B(\xi+\sum_{m=1}^{n-1}x_{m}+x,\lambda+n)\right\}
=1x!(λ+n1)ξ+m=1n1xm+1Γ(ξ+m=1n1xm+1)Γ(ξ+m=1n1xm+x+1)(λ+n)ξ+m=1n1xm+x+1\displaystyle=\frac{1}{x!}\cdot\frac{(\lambda+n-1)^{\xi+\sum_{m=1}^{n-1}x_{m}+1}}{\Gamma(\xi+\sum_{m=1}^{n-1}x_{m}+1)}\cdot\frac{\Gamma(\xi+\sum_{m=1}^{n-1}x_{m}+x+1)}{(\lambda+n)^{\xi+\sum_{m=1}^{n-1}x_{m}+x+1}}
=Γ(ξ+m=1n1xm+x+1)Γ(x+1)Γ(ξ+m=1n1xm+1)(λ+n1λ+n)ξ+m=1nxm+1(1λ+n)x\displaystyle=\frac{\Gamma(\xi+\sum_{m=1}^{n-1}x_{m}+x+1)}{\Gamma(x+1)\Gamma(\xi+\sum_{m=1}^{n-1}x_{m}+1)}\cdot\left(\frac{\lambda+n-1}{\lambda+n}\right)^{\xi+\sum_{m=1}^{n}x_{m}+1}\left(\frac{1}{\lambda+n}\right)^{x}
=NegBin(x|ξ+m=1n1xm+1,(λ+n)1).\displaystyle=\mathrm{NegBin}\left(x\left|\xi+\sum_{m=1}^{n-1}x_{m}+1,(\lambda+n)^{-1}\right.\right).

And

Mn,x\displaystyle M_{n,x} :=γκ(0)n1κ(x)exp{B(ξ+(n1)ϕ(0)+ϕ(x),λ+n)}\displaystyle:=\gamma\cdot\kappa(0)^{n-1}\kappa(x)\cdot\exp\left\{B(\xi+(n-1)\phi(0)+\phi(x),\lambda+n)\right\}
=γ1x!Γ(ξ+x+1)(λ+n)(ξ+x+1).\displaystyle=\gamma\cdot\frac{1}{x!}\cdot\Gamma(\xi+x+1)(\lambda+n)^{-(\xi+x+1)}.

We summarize the marginal distribution representation of X1:NX_{1:N} that follows from Corollary 6.2 in the following result.

Corollary 6.4.

Let Θ\Theta be a gamma process with fixed-location atom weight distributions as in Eq. (19) and ordinary component weight measure as in Eq. (20). Let XnX_{n} be drawn, iid across nn, conditional on Θ\Theta according to a Poisson likelihood process with fixed-location atom weight distributions as in Eq. (18). Then X1:NX_{1:N} has the same distribution as the following construction.

For each n=1,2,,Nn=1,2,\ldots,N,

  1. 1.

    Let {ψk}k=1Kn1\{\psi_{k}\}_{k=1}^{K_{n-1}} be the union of atom locations in X1,,Xn1X_{1},\ldots,X_{n-1}. Let xm,k:=Xm({ψk})x_{m,k}:=X_{m}(\{\psi_{k}\}). Let xn,kx_{n,k} denote the weight of Xn|X1,,Xn1X_{n}|X_{1},\ldots,X_{n-1} at ψk\psi_{k}. Then xn,kx_{n,k} has distribution described by the following probability mass function:

    hcond(xn,k=x|x1:(n1),k)=NegBin(x|ξ+m=1n1xm,k+1,(λ+n)1).\displaystyle h_{cond}\left(x_{n,k}=x\left|x_{1:(n-1),k}\right.\right)=\mathrm{NegBin}\left(x\left|\xi+\sum_{m=1}^{n-1}x_{m,k}+1,(\lambda+n)^{-1}\right.\right).
  2. 2.

    For each x=1,2,x=1,2,\ldots

    • XnX_{n} has ρn,x\rho_{n,x} new atoms. That is, XnX_{n} has atoms at locations {ψn,x,j}j=1ρn,x\{\psi_{n,x,j}\}_{j=1}^{\rho_{n,x}}, where

      {ψn,x,j}j=1ρn,x{ψk}k=1Kn1= a.s.\{\psi_{n,x,j}\}_{j=1}^{\rho_{n,x}}\cap\{\psi_{k}\}_{k=1}^{K_{n-1}}=\emptyset\quad\textrm{ a.s.}

      Moreover,

      Mn,x\displaystyle M_{n,x} :=γ1x!Γ(ξ+x+1)(λ+n)ξ+x+1\displaystyle:=\gamma\cdot\frac{1}{x!}\cdot\frac{\Gamma(\xi+x+1)}{(\lambda+n)^{\xi+x+1}}
      across n,xn,x
      ρn,x\displaystyle\rho_{n,x} indepPoisson(ρ|Mn,x) independently across n,x\displaystyle\stackrel{{\scriptstyle indep}}{{\sim}}\mathrm{Poisson}\left(\rho\left|M_{n,x}\right.\right)\textrm{ independently across $n,x$}
      ψn,x,j\displaystyle\psi_{n,x,j} iidG(dψ) independently across n,x and iid across j.\displaystyle\stackrel{{\scriptstyle iid}}{{\sim}}G(d\psi)\textrm{ independently across $n,x$ and iid across $j$}.

\blacksquare

7 Discussion

In the preceding sections, we have shown how to calculate posteriors for general CRM-based priors and likelihoods for Bayesian nonparametric models. We have also shown how to represent Bayesian nonparametric priors as a sequence of finite draws, and full Bayesian nonparametric models via finite marginals. We have introduced a notion of exponential families for CRMs, which we call exponential CRMs, that has allowed us to specify automatic Bayesian nonparametric conjugate priors for exponential CRM likelihoods. And we have demonstrated that our exponential CRMs allow particularly straightforward recipes for size-biased and marginal representations of Bayesian nonparametric models. Along the way, we have proved that the gamma process is a conjugate prior for the Poisson likelihood process and the beta prime process is a conjugate prior for the odds Bernoulli process. We have discovered a size-biased representation of the gamma process and a marginal representation of the gamma process coupled with a Poisson likelihood process.

All of this work has relied heavily on the description of Bayesian nonparametric models in terms of completely random measures. As such, we have worked very particularly with pairings of real values—the CRM atom weights, which we have interpreted as trait frequencies or rates—together with trait descriptors—the CRM atom locations. However, all of our proofs broke into essentially two parts: the fixed-location atom part and the ordinary component part. The fixed-location atom development essentially translated into the usual finite version of Bayes Theorem and could easily be extended to full Bayesian models where the prior describes a random element that need not be real-valued. Moreover, the ordinary component development relied entirely on its generation as a Poisson point process over a product space. It seems reasonable to expect that our development might carry through when the first element in this tuple need not be real-valued. And thus we believe our results are suggestive of broader results over more general spaces.

Acknowledgements

Support for this project was provided by ONR under the Multidisciplinary University Research Initiative (MURI) program (N00014-11-1-0688). T. Broderick was supported by a Berkeley Fellowship. A. C. Wilson was supported by an NSF Graduate Research Fellowship.

Appendix A Further automatic conjugate priors

We use Theorem 4.2 to calculate automatic conjugate priors for further exponential CRMs.

Example A.1.

Let XX be generated according to a Bernoulli process as in Example 2.1. That is, XX has an exponential CRM distribution with Klike,fixK_{like,fix} fixed-location atoms, where Klike,fix<K_{like,fix}<\infty in accordance with Assumption A0:

X=k=1Klike,fixxlike,kδψlike,k.X=\sum_{k=1}^{K_{like,fix}}x_{like,k}\delta_{\psi_{like,k}}.

The weight of the kkth atom, xlike,kx_{like,k}, has support on {0,1}\{0,1\} and has a Bernoulli density with parameter θk(0,1]\theta_{k}\in(0,1]:

h(x|θk)\displaystyle h(x|\theta_{k}) =θkx(1θk)1x\displaystyle=\theta_{k}^{x}(1-\theta_{k})^{1-x}
=exp{xlog(θk/(1θk))+log(1θk)}.\displaystyle=\exp\left\{x\log(\theta_{k}/(1-\theta_{k}))+\log(1-\theta_{k})\right\}.

The final line is rewritten to emphasize the exponential family form of this density, with

κ(x)\displaystyle\kappa(x) =1\displaystyle=1
ϕ(x)\displaystyle\phi(x) =x\displaystyle=x
η(θ)\displaystyle\eta(\theta) =log(θ1θ)\displaystyle=\log\left(\frac{\theta}{1-\theta}\right)
A(θ)\displaystyle A(\theta) =log(1θ).\displaystyle=-\log(1-\theta).

Then, by Theorem 4.2, XX has a Bayesian nonparametric conjugate prior for

Θ:=k=1Klike,fixθkδψk.\Theta:=\sum_{k=1}^{K_{like,fix}}\theta_{k}\delta_{\psi_{k}}.

This conjugate prior has two parts.

First, Θ\Theta has a set of Kprior,fixK_{prior,fix} fixed-location atoms at some subset of the Klike,fixK_{like,fix} fixed locations of XX. The kkth such atom has random weight θfix,k\theta_{fix,k} with density

fprior,fix,k(θ)\displaystyle f_{prior,fix,k}(\theta) =exp{ξfix,k,η(θ)+λfix,k[A(θ)]B(ξfix,k,λfix,k)}\displaystyle=\exp\left\{\langle\xi_{fix,k},\eta(\theta)\rangle+\lambda_{fix,k}\left[-A(\theta)\right]-B(\xi_{fix,k},\lambda_{fix,k})\right\}
=θξfix,k(1θ)λfix,kξfix,kexp{B(ξfix,k,λfix,k)}\displaystyle=\theta^{\xi_{fix,k}}(1-\theta)^{\lambda_{fix,k}-\xi_{fix,k}}\exp\left\{-B(\xi_{fix,k},\lambda_{fix,k})\right\}
=Beta(θ|ξfix,k+1,λfix,kξfix,k+1),\displaystyle=\mathrm{Beta}\left(\theta\left|\xi_{fix,k}+1,\lambda_{fix,k}-\xi_{fix,k}+1\right.\right),

where Beta(θ|a,b)\mathrm{Beta}(\theta|a,b) denotes the beta density with shape parameters a>0a>0 and b>0b>0. So we must have fixed hyperparameters ξfix,k>1\xi_{fix,k}>-1 and λfix,k>ξfix,k1\lambda_{fix,k}>\xi_{fix,k}-1. Further,

exp{B(ξfix,k,λfix,k)}=Γ(λfix,k+2)Γ(ξfix,k+1)Γ(λfix,kξfix,k+1)\exp\left\{-B(\xi_{fix,k},\lambda_{fix,k})\right\}=\frac{\Gamma(\lambda_{fix,k}+2)}{\Gamma(\xi_{fix,k}+1)\Gamma(\lambda_{fix,k}-\xi_{fix,k}+1)}

to ensure normalization.

Second, Θ\Theta has an ordinary component characterized by any proper distribution GG and weight rate measure

ν(dθ)\displaystyle\nu(d\theta) =γexp{ξ,η(θ)+λ[A(θ)]}dθ\displaystyle=\gamma\exp\left\{\langle\xi,\eta(\theta)\rangle+\lambda\left[-A(\theta)\right]\right\}\;d\theta
=γθξ(1θ)λξdθ.\displaystyle=\gamma\theta^{\xi}(1-\theta)^{\lambda-\xi}\;d\theta.

Finally, we need to choose the allowable hyperparameter ranges for γ\gamma, ξ\xi, and λ\lambda. γ>0\gamma>0 ensures ν\nu is a measure. By Assumption A1, we must have ν(+)=\nu(\mathbb{R}_{+})=\infty, so ν\nu must represent an improper beta distribution. As such, we require either ξ+10\xi+1\leq 0 or λξ0\lambda-\xi\leq 0. By Assumption A2, we must have

x=1θ+ν(dθ)h(x|θ)\displaystyle\sum_{x=1}^{\infty}\int_{\theta\in\mathbb{R}_{+}}\nu(d\theta)\cdot h(x|\theta)
=θ(0,1]ν(dθ)h(1|θ)\displaystyle=\int_{\theta\in(0,1]}\nu(d\theta)h(1|\theta)
since the support of xx is {0,1}\{0,1\} and the support of θ\theta is (0,1](0,1]
=γθ(0,1]θξ(1θ)λξ𝑑θθ\displaystyle=\gamma\int_{\theta\in(0,1]}\theta^{\xi}(1-\theta)^{\lambda-\xi}\;d\theta\cdot\theta
<\displaystyle<\infty

Since the integrand is the kernel of a beta distribution, the integral is finite if and only if ξ+2>0\xi+2>0 and λξ+1>0\lambda-\xi+1>0.

Finally, then the hyperparameter restrictions can be summarized as:

γ\displaystyle\gamma >0\displaystyle>0
ξ\displaystyle\xi (2,1]\displaystyle\in(-2,-1]
λ\displaystyle\lambda >ξ1\displaystyle>\xi-1
ξfix,k\displaystyle\xi_{fix,k} >1 and λfix,k>ξfix,k1for all k[Kprior,fix]\displaystyle>-1\textrm{ and }\lambda_{fix,k}>\xi_{fix,k}-1\quad\textrm{for all $k\in[K_{prior,fix}]$}

By setting α=ξ+1\alpha=\xi+1, c=λ+2c=\lambda+2, ρfix,k=ξfix,k+1\rho_{fix,k}=\xi_{fix,k}+1, and σfix,k=λfix,kξfix,k+1\sigma_{fix,k}=\lambda_{fix,k}-\xi_{fix,k}+1, we recover the hyperparameters of Eq. (11) in Example 2.1. Here, by contrast to Example 2.1, we found the conjugate prior and its hyperparameter settings given just the Bernoulli process likelihood. Henceforth, we use the parameterization of the beta process above. \blacksquare

Appendix B Further size-biased representations

Example B.1.

Let Θ\Theta be a beta process, and let XnX_{n} be iid Bernoulli processes conditioned on Θ\Theta for each nn as in Example A.1. That is, we have

ν(dθ)=γθξ(1θ)λξdθ.\nu(d\theta)=\gamma\theta^{\xi}(1-\theta)^{\lambda-\xi}\;d\theta.

And

h(x|θk)=θkx(1θk)1xh(x|\theta_{k})=\theta_{k}^{x}(1-\theta_{k})^{1-x}

with

γ\displaystyle\gamma >0\displaystyle>0
ξ\displaystyle\xi (2,1]\displaystyle\in(-2,-1]
λ\displaystyle\lambda >ξ1\displaystyle>\xi-1
ξfix,k\displaystyle\xi_{fix,k} >1 and λfix,k>ξfix,k1for all k[Kprior,fix]\displaystyle>-1\textrm{ and }\lambda_{fix,k}>\xi_{fix,k}-1\quad\textrm{for all $k\in[K_{prior,fix}]$}

by Example A.1.

We can pick out the following components of hh:

κ(x)\displaystyle\kappa(x) =1\displaystyle=1
ϕ(x)\displaystyle\phi(x) =x\displaystyle=x
η(θ)\displaystyle\eta(\theta) =log(θ1θ)\displaystyle=\log\left(\frac{\theta}{1-\theta}\right)
A(θ)\displaystyle A(\theta) =log(1θ).\displaystyle=-\log(1-\theta).

Thus, by Corollary 5.2,

Θ\displaystyle\Theta =m=1x=1j=1ρm,xθm,x,jδψm,x,j\displaystyle=\sum_{m=1}^{\infty}\sum_{x=1}^{\infty}\sum_{j=1}^{\rho_{m,x}}\theta_{m,x,j}\delta_{\psi_{m,x,j}}
ψm,x,j\displaystyle\psi_{m,x,j} iidGiid across m,x,j\displaystyle\stackrel{{\scriptstyle iid}}{{\sim}}G\quad\textrm{iid across $m,x,j$}
θm,x,j\displaystyle\theta_{m,x,j} indepfsize,m,x(θ)dθ\displaystyle\stackrel{{\scriptstyle indep}}{{\sim}}f_{size,m,x}(\theta)\;d\theta
θξ+x(1θ)λ+mξxdθ\displaystyle\propto\theta^{\xi+x}(1-\theta)^{\lambda+m-\xi-x}\;d\theta
Beta(θ|ξ+x,λξ+mx)dθ\displaystyle\propto\mathrm{Beta}\left(\theta\left|\xi+x,\lambda-\xi+m-x\right.\right)\;d\theta
iid across jj and independently across m,xm,x
Mm,x\displaystyle M_{m,x} :=γΓ(ξ+x+1)Γ(λξ+mx+1)Γ(λ+m+2)\displaystyle:=\gamma\cdot\frac{\Gamma(\xi+x+1)\Gamma(\lambda-\xi+m-x+1)}{\Gamma(\lambda+m+2)}
ρm,x\displaystyle\rho_{m,x} indepPoisson(Mm,x)\displaystyle\stackrel{{\scriptstyle indep}}{{\sim}}\mathrm{Poisson}\left(M_{m,x}\right)
across m,xm,x

Broderick et al. (2012) and Paisley et al. (2012) have previously noted that this size-biased representation of the beta process arises from the Poisson point process. \blacksquare

Appendix C Further marginals

Example C.1.

Let Θ\Theta be a beta process, and let XnX_{n} be iid Bernoulli processes conditioned on Θ\Theta for each nn as in Examples A.1 and B.1.

We calculate the main components of Corollary 6.2 for this pair of processes. In particular, we have

(xn=1)\displaystyle\mathbb{P}(x_{n}=1) =κ(k)exp{B(ξ+m=1n1xm,λ+n1)+B(ξ+m=1n1xm+1,λ+n)}\displaystyle=\kappa(k)\exp\left\{-B(\xi+\sum_{m=1}^{n-1}x_{m},\lambda+n-1)+B(\xi+\sum_{m=1}^{n-1}x_{m}+1,\lambda+n)\right\}
=Γ(λ+n1+2)Γ(ξ+m=1n1xm+1)Γ(λ+n1ξm=1n1xm+1)\displaystyle=\frac{\Gamma(\lambda+n-1+2)}{\Gamma(\xi+\sum_{m=1}^{n-1}x_{m}+1)\Gamma(\lambda+n-1-\xi-\sum_{m=1}^{n-1}x_{m}+1)}
Γ(ξ+m=1n1xm+1+1)Γ(λ+nξm=1n1xm1+1)Γ(λ+n+2)\displaystyle\quad{}\cdot\frac{\Gamma(\xi+\sum_{m=1}^{n-1}x_{m}+1+1)\Gamma(\lambda+n-\xi-\sum_{m=1}^{n-1}x_{m}-1+1)}{\Gamma(\lambda+n+2)}
=ξ+m=1n1xm+1λ+n+1\displaystyle=\frac{\xi+\sum_{m=1}^{n-1}x_{m}+1}{\lambda+n+1}

And

Mn,1\displaystyle M_{n,1} :=γκ(0)n1κ(1)exp{B(ξ+(n1)ϕ(0)+ϕ(1),λ+n)}\displaystyle:=\gamma\cdot\kappa(0)^{n-1}\kappa(1)\cdot\exp\left\{B(\xi+(n-1)\phi(0)+\phi(1),\lambda+n)\right\}
=γΓ(ξ+1+1)Γ(λ+nξ1+1)Γ(λ+n+2)\displaystyle=\gamma\cdot\frac{\Gamma(\xi+1+1)\Gamma(\lambda+n-\xi-1+1)}{\Gamma(\lambda+n+2)}

Thus, the marginal distribution of X1:NX_{1:N} is the same as that provided by the following construction.

For each n=1,2,,Nn=1,2,\ldots,N,

  1. 1.

    At any location ψ\psi for which there is some atom in X1,,Xn1X_{1},\ldots,X_{n-1}, let xmx_{m} be the weight of XmX_{m} at ψ\psi for m[n1]m\in[n-1]. Then we have that Xn|X1,,Xn1X_{n}|X_{1},\ldots,X_{n-1} has weight xnx_{n} at ψ\psi, where

    (dxn)\displaystyle\mathbb{P}(dx_{n}) =Bern(xn|ξ+m=1n1xm+1λ+n+1)\displaystyle=\mathrm{Bern}\left(x_{n}\left|\frac{\xi+\sum_{m=1}^{n-1}x_{m}+1}{\lambda+n+1}\right.\right)
  2. 2.

    XnX_{n} has ρn,1\rho_{n,1} atoms at locations {ψn,1,j}\{\psi_{n,1,j}\} with j[ρn,1]j\in[\rho_{n,1}] where there have not yet been atoms in any of X1,,Xn1X_{1},\ldots,X_{n-1}. Moreover,

    Mn,1\displaystyle M_{n,1} :=γΓ(ξ+1+1)Γ(λ+nξ1+1)Γ(λ+n+2)\displaystyle:=\gamma\cdot\frac{\Gamma(\xi+1+1)\Gamma(\lambda+n-\xi-1+1)}{\Gamma(\lambda+n+2)}
    across nn
    ρn,1\displaystyle\rho_{n,1} indepPoisson(Mn,1) across n,x\displaystyle\stackrel{{\scriptstyle indep}}{{\sim}}\mathrm{Poisson}\left(M_{n,1}\right)\textrm{ across $n,x$}
    ψn,1,j\displaystyle\psi_{n,1,j} iidG(dψ) across n,j\displaystyle\stackrel{{\scriptstyle iid}}{{\sim}}G(d\psi)\textrm{ across $n,j$}

Here, we have recovered the three-parameter extension of the Indian buffet process (Teh and Görür, 2009; Broderick et al., 2013). \blacksquare

References

  • Airoldi et al. (2014) E. M. Airoldi, D. Blei, E. A. Erosheva, and S. E. Fienberg. Handbook of Mixed Membership Models and Their Applications. CRC Press, 2014.
  • Blei et al. (2003) D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
  • Broderick et al. (2012) T. Broderick, M. I. Jordan, and J. Pitman. Beta processes, stick-breaking, and power laws. Bayesian Analysis, 7(2):439–476, 2012.
  • Broderick et al. (2013) T. Broderick, M. I. Jordan, and J. Pitman. Cluster and feature modeling from combinatorial stochastic processes. Statistical Science, 2013.
  • Broderick et al. (2015) T. Broderick, L. Mackey, J. Paisley, and M. I. Jordan. Combinatorial clustering and the beta negative binomial process. IEEE TPAMI, 2015.
  • Damien et al. (1999) P. Damien, J. Wakefield, and S. Walker. Gibbs sampling for Bayesian non-conjugate and hierarchical models by using auxiliary variables. Journal of the Royal Statistical Society: Series B, 61(2):331–344, 1999.
  • DeGroot (1970) M. H. DeGroot. Optimal Statistical Decisions. John Wiley & Sons, Inc, 1970.
  • Diaconis and Ylvisaker (1979) P. Diaconis and D. Ylvisaker. Conjugate priors for exponential families. The Annals of Statistics, 7(2):269–281, 1979.
  • Doksum (1974) K. Doksum. Tailfree and neutral random probabilities and their posterior distributions. The Annals of Probability, pages 183–201, 1974.
  • Doshi et al. (2009) F. Doshi, K. T. Miller, J. Van Gael, and Y. W. Teh. Variational inference for the Indian buffet process. In AISTATS, 2009.
  • Escobar (1994) M. D. Escobar. Estimating normal means with a Dirichlet process prior. Journal of the American Statistical Association, 89(425):268–277, 1994.
  • Escobar and West (1995) M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90(430):577–588, 1995.
  • Escobar and West (1998) M. D. Escobar and M. West. Computing nonparametric hierarchical models. In Practical nonparametric and semiparametric Bayesian statistics, pages 1–22. Springer, 1998.
  • Ferguson (1973) T. S. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, pages 209–230, 1973.
  • Ferguson (1974) T. S. Ferguson. Prior distributions on spaces of probability measures. The Annals of Statistics, pages 615–629, 1974.
  • Griffiths and Ghahramani (2006) T. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In NIPS, 2006.
  • Hjort (1990) N. L. Hjort. Nonparametric Bayes estimators based on beta processes in models for life history data. The Annals of Statistics, pages 1259–1294, 1990.
  • Ishwaran and James (2001) H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96(453), 2001.
  • James (2014) L. F. James. Poisson latent feature calculus for generalized Indian buffet processes. arXiv preprint arXiv:1411.2936, 2014.
  • James et al. (2009) L. F. James, A. Lijoi, and I. Prünster. Posterior analysis for normalized random measures with independent increments. Scandinavian Journal of Statistics, 36(1):76–97, 2009.
  • Kalli et al. (2011) M. Kalli, J. E. Griffin, and S. G. Walker. Slice sampling mixture models. Statistics and Computing, 21(1):93–105, 2011.
  • Kim (1999) Y. Kim. Nonparametric Bayesian estimators for counting processes. Annals of Statistics, pages 562–588, 1999.
  • Kingman (1967) J. F. C. Kingman. Completely random measures. Pacific Journal of Mathematics, 21(1):59–78, 1967.
  • Kingman (1992) J. F. C. Kingman. Poisson Processes, volume 3. Oxford University Press, 1992.
  • Lijoi and Prünster (2010) A. Lijoi and I. Prünster. Models beyond the Dirichlet process. In N. L. Hjort, C. Holmes, P. Müller, and S. G. Walker, editors, Bayesian Nonparametrics. Cambridge Series in Statistical and Probabilistic Mathematics, 2010.
  • Lo (1982) A. Y. Lo. Bayesian nonparametric statistical inference for Poisson point processes. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 59(1):55–66, 1982.
  • Lo (1984) A. Y. Lo. On a class of Bayesian nonparametric estimates: I. Density estimates. Annals of Statistics, 12(1):351–357, 1984.
  • MacEachern (1994) S. N. MacEachern. Estimating normal means with a conjugate style Dirichlet process prior. Communications in Statistics-Simulation and Computation, 23(3):727–741, 1994.
  • Neal (2000) R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265, 2000.
  • Neal (2003) R. M. Neal. Slice sampling. Annals of Statistics, pages 705–741, 2003.
  • Orbanz (2010) P. Orbanz. Conjugate projective limits. arXiv preprint arXiv:1012.0363, 2010.
  • Paisley et al. (2010) J. W. Paisley, A. K. Zaas, C. W. Woods, G. S. Ginsburg, and L. Carin. A stick-breaking construction of the beta process. In ICML, pages 847–854, 2010.
  • Paisley et al. (2011) J. W. Paisley, L. Carin, and D. M. Blei. Variational inference for stick-breaking beta process priors. In ICML, pages 889–896, 2011.
  • Paisley et al. (2012) J. W. Paisley, D. M. Blei, and M. I. Jordan. Stick-breaking beta processes and the Poisson process. In AISTATS, pages 850–858, 2012.
  • Perman et al. (1992) M. Perman, J. Pitman, and M. Yor. Size-biased sampling of poisson point processes and excursions. Probability Theory and Related Fields, 92(1):21–39, 1992.
  • Pitman (1996a) J. Pitman. Random discrete distributions invariant under size-biased permutation. Advances in Applied Probability, pages 525–539, 1996a.
  • Pitman (1996b) J. Pitman. Some developments of the Blackwell-MacQueen urn scheme. Lecture Notes-Monograph Series, pages 245–267, 1996b.
  • Pitman (2003) J. Pitman. Poisson-Kingman partitions. Lecture Notes-Monograph Series, pages 1–34, 2003.
  • Sethuraman (1994) J. Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, 4:639–650, 1994.
  • Teh and Görür (2009) Y. W. Teh and D. Görür. Indian buffet processes with power-law behavior. In NIPS, pages 1838–1846, 2009.
  • Teh et al. (2006) Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 2006.
  • Teh et al. (2007) Y. W. Teh, D. Görür, and Z. Ghahramani. Stick-breaking construction for the Indian buffet process. In AISTATS, pages 556–563, 2007.
  • Thibaux and Jordan (2007) R. Thibaux and M. I. Jordan. Hierarchical beta processes and the Indian buffet process. In AISTATS, pages 564–571, 2007.
  • Thibaux (2008) R. J. Thibaux. Nonparametric Bayesian Models for Machine Learning. PhD thesis, UC Berkeley, 2008.
  • Titsias (2008) M. K. Titsias. The infinite gamma-Poisson feature model. In NIPS, pages 1513–1520, 2008.
  • Walker (2007) S. G. Walker. Sampling the Dirichlet mixture model with slices. Communications in Statistics—Simulation and Computation, 36(1):45–54, 2007.
  • Wang and Blei (2013) C. Wang and D. M. Blei. Variational inference in nonconjugate models. The Journal of Machine Learning Research, 14(1):1005–1031, 2013.
  • West and Escobar (1994) M. West and M. D. Escobar. Hierarchical priors and mixture models, with application in regression and density estimation. In P. R. Freeman and A. F. M. Smith, editors, Aspects of Uncertainty: A Tribute to D. V. Lindley. Institute of Statistics and Decision Sciences, Duke University, 1994.
  • Zhou et al. (2012) M. Zhou, L. Hannah, D. Dunson, and L. Carin. Beta-negative binomial process and Poisson factor analysis. AISTATS, 2012.