This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Bayesian theory for estimation of biodiversity

Tommaso Rigon Department of Economics, Management and Statistics, University of Milano–Bicocca, 20126 Milano, Italy Ching-Lung Hsu Department of Statistical Science, Duke University, Durham, U.S.A. David B. Dunson Department of Statistical Science, Duke University, Durham, U.S.A.
Abstract

Statistical inference on biodiversity has a rich history going back to RA Fisher. An influential ecological theory suggests the existence of a fundamental biodiversity number, denoted α\alpha, which coincides with the precision parameter of a Dirichlet process (dp). In this paper, motivated by this theory, we develop Bayesian nonparametric methods for statistical inference on biodiversity, building on the literature on Gibbs-type priors. We argue that σ\sigma-diversity is the most natural extension of the fundamental biodiversity number and discuss strategies for its estimation. Furthermore, we develop novel theory and methods starting with an Aldous-Pitman (ap) process, which serves as the building block for any Gibbs-type prior with a square-root growth rate. We propose a modeling framework that accommodates the hierarchical structure of Linnean taxonomy, offering a more refined approach to quantifying biodiversity. The analysis of a large and comprehensive dataset on Amazon tree flora provides a motivating application.

1 Introduction

The decline in biodiversity represents a significant global concern, with potential implications for entire ecosystems and profound impacts on human well-being (e.g. Ceballos et al., 2015). Consequently, assessing diversity is a primary goal in ecology, although its practical measurement is a notoriously complex task (Colwell, 2009; Magurran and McGill, 2011). Taxon richness, the number of taxa within a community, is perhaps the simplest way to describe diversity, although alternative indices such as Simpson and Shannon indices, or Fisher’s α^f\hat{\alpha}_{\textsc{f}}, are often considered. The estimation of richness involves the analysis of taxon accumulation curves, a statistical methodology with roots that go back to the seminal work of Fisher et al. (1943), Good (1953), and Good and Toulmin (1956). Modern approaches are discussed in Bunge and Fitzpatrick (1993); Colwell (2009); Magurran and McGill (2011). See also Zito et al. (2023) for a recent model-based approach to analyzing accumulation curves and estimating richness.

Refer to caption
(a) Geographical distribution of the 11701170 tree plots analyzed in ter Steege et al. (2013). Colors represent the country of the tree plot.
Refer to caption
(b) Coarsened posterior distribution (ρ=0.01\rho=0.01) of the fundamental biodiversity number α\alpha, using the Amazonian tree dataset of ter Steege et al. (2013). The dotted lines represent 98%98\% credible intervals. The dashed line is the posterior mean.

In an influential contribution, Hubbell (2001) postulated the existence of a fundamental biodiversity number α\alpha, which lies at the core of his unified neutral theory of biodiversity. This number, α\alpha, represents twice the population size multiplied by the speciation rate. The theory is conceptually attractive as it encapsulates biodiversity into a single number with a clear biological interpretation. In fact, α\alpha is closely linked to accumulation curves, and Hubbell (2001) demonstrated that Fisher’s α^f\hat{\alpha}_{\textsc{f}} is asymptotically equivalent to the fundamental number α\alpha for large population size values. The theory has found a successful application in predicting the number of tree species in the Amazon basin, as discussed in Hubbell et al. (2008) and ter Steege et al. (2013); the data are shown in Figure 1(a). These studies reported estimates of α^=743\hat{\alpha}=743 and α^=754\hat{\alpha}=754 for the fundamental biodiversity number, predicting approximately 12,50012,500 and 15,00015,000 tree species in the Amazon basin, respectively.

Interestingly, the fundamental biodiversity number α\alpha in Hubbell’s theory corresponds to the precision parameter of a dp (Hubbell, 2001). The dp has been widely studied and has applications well beyond ecology and biodiversity (Hjort et al., 2010; Ghosal and Van der Vaart, 2017). Hubbell’s theory generated considerable controversy (e.g. McGill, 2003; Chisholm and Burgman, 2004; Ricklefs, 2006), being incompatible with data and ignoring mechanisms regarded important by ecologists (McGill, 2003). In fact, it is well known that dp is restrictive in depending on a single parameter α\alpha and enforcing a logarithmic growth rate for the number of taxa (Lijoi et al., 2007a, b). To address these limitations, Gibbs-type priors have emerged as the most natural extension of the dp (Gnedin and Pitman, 2005; De Blasi et al., 2015) due to their balance between flexibility and tractability. This class includes the dp, Pitman-Yor process (Perman et al., 1992; Pitman and Yor, 1997), normalized inverse Gaussian process (Lijoi et al., 2005), and normalized generalized gamma process (Lijoi et al., 2007b).

We argue that the most natural generalization of Hubbell’s fundamental biodiversity number α\alpha is the so-called σ\sigma-diversity of Pitman (2003). This retains the main appealing characteristic of the unified neutral theory, corresponding to the ability to describe biodiversity with a single interpretable number, while improving upon Hubbell by allowing for various growth rates for taxon accumulation curves. Classical estimation of σ\sigma-diversity is possible, but, as we shall see, the Gibbs-type framework naturally calls for Bayesian estimates. We contribute to the theory of Gibbs-type priors with novel statistical results and by pointing out connections among the classical work of Fisher et al. (1943), accumulation curves, and other measures of biodiversity. Moreover, we investigate in depth a Gibbs process we termed Aldous-Pitman after the work of Aldous and Pitman (1998); Pitman (2003), and we show that a suitable data-augmentation enables posterior inference.

Gibbs-type priors are natural tools for modeling biodiversity when focusing on a single level of the Linnean taxonomy, such as family, genus or species. However, each statistical unit often comprises a collection of LL different taxa, which are organized in a nested fashion; see, e.g., Zito et al. (2023). As a crude exemplification, one might consider using a different Gibbs-type model for each layer of the Linnean taxonomy. However, this approach would overlook the rich and informative nested structure of the data. Bayesian nonparametric models for such data are less developed. When L=2L=2, a sensible proposal is the enriched Dirichlet process (edp) of Wade et al. (2011), subsequently extended by Rigon et al. (2025) to the Pitman-Yor case. A more general prior with L>2L>2, and relying on a Pitman-Yor specification, is implicitly employed in Zito et al. (2023). We propose a general modelling structure that accounts for the complexities of the available data, focusing on the quantification of biodiversity. We call this novel approach taxonomic Gibbs-type priors, which combines the advantages of the enriched Dirichlet process of Wade et al. (2011) with the flexibility of Gibbs processes. In practice, this more refined approach results in taxon-specific indices of biodiversity that might be useful for comparing the biodiversity within branches of the taxonomic tree.

In Section 2, we discuss Bayesian nonparametric foundations of Hubbell (2001)’s theory of biodiversity. In Section 3 we use the Aldous-Pitman process to characterize biodiversity. In Section 4, we propose taxonomic Gibbs priors. In Section 5.1, we compare our Bayesian methodology with the analysis of ter Steege et al. (2013). As depicted in Figure 1(b), our estimate for α\alpha is in close agreement with Hubbell et al. (2008) and ter Steege et al. (2013), with the benefit of uncertainty quantification. Bayesian inference for the number of tree species is also feasible, as will be discussed in Section 5.1. In Section 5.2, we analyze the Amazonian dataset more in-depth through our taxonomic Gibbs-type prior, providing new insights on the within-genera and within-family biodiversity.

2 Bayesian nonparametric modeling of taxon diversity

2.1 Gibbs-type priors

In this section, we provide an overview of Gibbs-type priors (Gnedin and Pitman, 2005), focusing on their relevance for ecological applications and quantification of biodiversity. For a mathematical exposition, we refer to De Blasi et al. (2015), while key theoretical developments are described in Lijoi et al. (2007a, b, 2008a, 2008b); De Blasi et al. (2013). Our data set comprises taxon labels X1,,XnX_{1},\dots,X_{n}, representing, for example, species or families, and taking values in a set 𝕏\mathds{X}. We assume that (Xn)n1(X_{n})_{n\geq 1} is an (ideally) infinite sequence of exchangeable observations which, according to the de Finetti theorem (de Finetti, 1937), is equivalent to the following hierarchical specification

Xnp~\displaystyle X_{n}\mid\tilde{p} iidp~,n1,\displaystyle\overset{\textup{iid}}{\sim}\tilde{p},\qquad n\geq 1, (1)
p~\displaystyle\tilde{p} 𝒬,\displaystyle\sim\mathcal{Q},

where p~\tilde{p} is a (random) probability measure whose law 𝒬\mathcal{Q} is the prior distribution in Bayesian statistics. In our setting, we specify a multinomial model for p~\tilde{p}, namely

p~()=h=1πhδZh(),\tilde{p}(\cdot)=\sum_{h=1}^{\infty}\pi_{h}\delta_{Z_{h}}(\cdot), (2)

where δx\delta_{x} represents the Dirac delta measure at xx, whereas πh\pi_{h} are random probability weights such that h=1πh=1\sum_{h=1}^{\infty}\pi_{h}=1 almost surely, and (Zh)h1(Z_{h})_{h\geq 1} are random 𝕏\mathds{X}-valued locations, denoting distinct taxon labels. This approach is referred to as nonparametric because there are potentially infinitely many probabilities (πh)h1(\pi_{h})_{h\geq 1}, for which a suitable prior will be specified. We assume that the ZhZ_{h}’s are iid samples from a diffuse probability distribution PP on 𝕏\mathds{X} and that (Zh)h1(Z_{h})_{h\geq 1} and (πh)h1(\pi_{h})_{h\geq 1} are independent. The diffuseness of PP implies that the labels ZhZ_{h}’s are all different. Thus, p~\tilde{p} belongs to the class of proper species sampling models (ssms, Pitman, 1996). The baseline measure PP serves as a technical tool that simplifies the mathematical exposition, but we are not interested in “learning” it. Indeed, without loss of generality, in species sampling problems, we may let 𝕏=\mathds{X}=\mathds{R} and let PP be a uniform distribution, meaning that each value ZhZ_{h} is the numerical encoding of the associated taxon.

Remark 1.

The sampling mechanism in equation (1) makes explicit the assumptions behind species sampling models, of which Hubbell (2001)’s theory represents a special case. The core assumptions are: (i) ssms are tailored for the analysis of individual-based accumulation curves (Gotelli and Colwell, 2001) since the taxa XnX_{n} are sampled sequentially, one at a time; (ii) taxon labels are independently sampled from p~\tilde{p} (iid), implying that the probability of observing the hhth species ZhZ_{h} remains constant across data points X1,,XnX_{1},\dots,X_{n}. These simplifying assumptions facilitate the definition of biodiversity indices, but are often violated in practice, particularly when analyzing meta-community data. For instance, observations X1,,XnX_{1},\dots,X_{n} may originate from different geographical regions, as in ter Steege et al. (2013), consequently leading to variations in the probability of observing a given taxon. Nevertheless, ssms serve as a useful approximation of reality, offering robust predictive capabilities, when a reasonable degree of homogeneity across observations is plausible.

The primary focus of inference typically lies not in the probabilities (πh)h1(\pi_{h})_{h\geq 1} but rather in the combinatorial structure induced by the multinomial model of equation (2). Due to the discrete nature of p~\tilde{p}, there will be identical values among X1,,XnX_{1},\dots,X_{n} with positive probability, comprising a total of Kn=kK_{n}=k distinct values with labels X1,,XkX_{1}^{*},\dots,X_{k}^{*} and frequencies n1,,nkn_{1},\dots,n_{k} such that j=1knj=n\sum_{j=1}^{k}n_{j}=n. The random variable KnK_{n} represents the observed taxon richness, whereas n1,,nkn_{1},\dots,n_{k} denote the associated abundances. The ties among the observations X1,XnX_{1}\dots,X_{n} induce a random partition Ψn={C1,,Ck}\Psi_{n}=\{C_{1},\dots,C_{k}\} of the indices {1,,n}\{1,\dots,n\}, where Cj={i:Xi=Xj}C_{j}=\{i:X_{i}=X_{j}^{*}\} and nj=|Cj|n_{j}=|C_{j}|. A species sampling model p~\tilde{p} is a Gibbs-type process if the law of the random partition, called Gibbs partition, is such that

Πn(n1,,nk)=(Ψn={C1,,Ck})=Vn,kj=1k(1σ)nj1,\Pi_{n}\left(n_{1},\dots,n_{k}\right)=\mathds{P}\left(\Psi_{n}=\{C_{1},\dots,C_{k}\}\right)=V_{n,k}\prod_{j=1}^{k}(1-\sigma)_{n_{j}-1}, (3)

where (a)n=a(a+1)(a+n1)(a)_{n}=a(a+1)\cdots(a+n-1) denotes a rising factorial, σ<1\sigma<1 is a discount parameter, and Vn,kV_{n,k} are non-negative weights satisfying the forward recursion Vn,k=(nσk)Vn+1,k+Vn+1,k+1V_{n,k}=\left(n-\sigma k\right)V_{n+1,k}+V_{n+1,k+1} for any n1n\geq 1 and 1kn1\leq k\leq n. Πn(n1,,nk)\Pi_{n}\left(n_{1},\ldots,n_{k}\right) is called the exchangeable partition probability function (eppf) and, in conjunction with PP, characterizes the distribution of random probability measure p~\tilde{p} (Pitman, 1996). Of considerable importance is the predictive distribution induced by Πn\Pi_{n} (Lijoi et al., 2007a):

(Xn+1X1,,Xn)=Vn+1,k+1Vn,kP()+Vn+1,kVn,kj=1k(njσ)δXj().\mathds{P}(X_{n+1}\in\cdot\mid X_{1},\dots,X_{n})=\frac{V_{n+1,k+1}}{V_{n,k}}P(\cdot)+\frac{V_{n+1,k}}{V_{n,k}}\sum_{j=1}^{k}(n_{j}-\sigma)\delta_{X^{*}_{j}}(\cdot). (4)

The above predictive rule provides a sampling mechanism for the data X1,,XnX_{1},\dots,X_{n}. The term (Xn+1=“new”X1,,Xn)=Vn+1,k+1/Vn,k\mathds{P}(X_{n+1}=\text{``new''}\mid X_{1},\dots,X_{n})=V_{n+1,k+1}/V_{n,k} denotes the probability of discovering a new taxon, i.e., the probability of Xn+1X_{n+1} not being observed among the previous X1,,XnX_{1},\dots,X_{n}. Conversely, the probability (Xn+1=“old”X1,,Xn)=1Vn+1,k+1/Vn,k\mathds{P}(X_{n+1}=\text{``old''}\mid X_{1},\dots,X_{n})=1-V_{n+1,k+1}/V_{n,k} corresponds to a Bayesian estimate of the sample coverage, namely the fraction of already observed taxa. The probability of re-observing the jjth taxon XjX^{*}_{j} is proportional to njσn_{j}-\sigma, explaining why σ<1\sigma<1 is sometimes referred to as the “discount parameter”, as it diminishes (or augments) the observed frequencies. Notable Gibbs-type priors are discussed below.

Example 1 (Dirichlet–multinomial, case σ<0\sigma<0).

Consider σ<0\sigma<0, and let HH\in\mathds{N}. A valid set of Gibbs coefficients is defined as:

Vn,k(σ,H):=|σ|k1j=1k1(Hj)(H|σ|+1)n1𝟙(kH),V_{n,k}(\sigma,H):=\frac{|\sigma|^{k-1}\prod_{j=1}^{k-1}(H-j)}{(H|\sigma|+1)_{n-1}}\mathds{1}(k\leq H), (5)

where 𝟙()\mathds{1}(\cdot) denotes the indicator function. This model assumes a finite number of taxa at a population level because the in-sample species richness Kn=kK_{n}=k is upper-bounded by HH. Its predictive scheme is:

(Xn+1H,X1,,Xn)=(Hk)|σ|H|σ|+nP()+1H|σ|+nj=1k(nj+|σ|)δXj().\mathds{P}(X_{n+1}\in\cdot\mid H,X_{1},\dots,X_{n})=\frac{(H-k)|\sigma|}{H|\sigma|+n}P(\cdot)+\frac{1}{H|\sigma|+n}\sum_{j=1}^{k}(n_{j}+|\sigma|)\delta_{X^{*}_{j}}(\cdot).

Thus, the probability of discovering a new taxon is zero whenever H=kH=k. The corresponding process is called Dirichlet-multinomial, denoted by p~=h=1HWhδZh\tilde{p}=\sum_{h=1}^{H}W_{h}\delta_{Z_{h}}, with (W1,,WH1)Dir(|σ|,,|σ|)(W_{1},\dots,W_{H-1})\sim\text{Dir}(|\sigma|,\dots,|\sigma|).

Example 2 (Dirichlet process, case σ=0\sigma=0).

For σ=0\sigma=0 and α>0\alpha>0, a valid set of Gibbs coefficients is:

Vn,k(α):=αk(α)n.V_{n,k}(\alpha):=\frac{\alpha^{k}}{(\alpha)_{n}}. (6)

The corresponding process p~\tilde{p} is the Dirichlet process of Ferguson (1973), with precision parameter α\alpha, the fundamental biodiversity parameter of Hubbell (2001). The general predictive rule in (4) becomes:

(Xn+1α,X1,,Xn)=αα+nP()+1α+nj=1knjδXj(),\mathds{P}(X_{n+1}\in\cdot\mid\alpha,X_{1},\dots,X_{n})=\frac{\alpha}{\alpha+n}P(\cdot)+\frac{1}{\alpha+n}\sum_{j=1}^{k}n_{j}\delta_{X^{*}_{j}}(\cdot),

the Blackwell and MacQueen (1973) urn scheme. The Dirichlet process p~\tilde{p} admits a stick-breaking representation (Sethuraman, 1994): p~=h=1ξhδZh\tilde{p}=\sum_{h=1}^{\infty}\xi_{h}\delta_{Z_{h}} with πh=ξh<h(1ξ)\pi_{h}=\xi_{h}\prod_{\ell<h}(1-\xi_{\ell}) and ξhiidBeta(1,α)\xi_{h}\overset{\text{iid}}{\sim}\text{Beta}(1,\alpha) for h1h\geq 1, with π1=ξ1\pi_{1}=\xi_{1}. This highlights that (i) the dp assumes infinitely many taxa at a population level, and (ii) the probabilities of these taxa decrease exponentially fast at a rate determined by α\alpha.

Example 3 (σ\sigma-stable Poisson-Kingman process, case σ(0,1)\sigma\in(0,1)).

For σ(0,1)\sigma\in(0,1) and γ>0\gamma>0, we have

Vn,k(σ,γ):=σkγkΓ(nkσ)fσ(γ1/σ)01sn1kσfσ((1s)γ1/σ)ds,V_{n,k}(\sigma,\gamma):=\frac{\sigma^{k}\gamma^{k}}{\Gamma(n-k\sigma)f_{\sigma}\big{(}\gamma^{-1/\sigma}\big{)}}\int_{0}^{1}s^{n-1-k\sigma}f_{\sigma}\big{(}\left(1-s\right)\gamma^{-1/\sigma}\big{)}\,\textup{d}s, (7)

where Γ()\Gamma(\cdot) is the gamma function and fσ(t)=(π)1h=1(1)h+1sin(hπσ)Γ(hσ+1)/thσ+1f_{\sigma}(t)=(\pi)^{-1}\sum_{h=1}^{\infty}(-1)^{h+1}\sin(h\pi\sigma)\Gamma(h\sigma+1)/t^{h\sigma+1} represents the density of a σ\sigma-stable distribution. The corresponding process p~\tilde{p} is a σ\sigma-stable Poisson-Kingman (pk) process, conditional on “total mass” γ1/σ\gamma^{-1/\sigma} (Kingman, 1975; Pitman, 2003). The distribution of the associated probability weights (πh)h1(\pi_{h})_{h\geq 1} is complex, but implies infinitely many species.

Remarkably, the aforementioned set of weights Vn,k(σ,H)V_{n,k}(\sigma,H), Vn,k(α)V_{n,k}(\alpha) and Vn,k(σ,γ)V_{n,k}(\sigma,\gamma) form the foundation of any Gibbs-type prior. In fact, Gibbs partitions can always be expressed as a mixture with respect to the parameters HH, α\alpha, and γ\gamma. This result was established by Gnedin and Pitman (2005).

Proposition 1 (Gnedin and Pitman (2005)).

The Gibbs coefficients Vn,kV_{n,k} satisfy the recursive equation Vn,k=(nσk)Vn+1,k+Vn+1,k+1V_{n,k}=\left(n-\sigma k\right)V_{n+1,k}+V_{n+1,k+1} for any n1n\geq 1 and 1kn1\leq k\leq n in the following three cases:

  1. 1.

    If σ<0\sigma<0, whenever Vn,k=h=1Vn,k(σ,h)p(h)V_{n,k}=\sum_{h=1}^{\infty}V_{n,k}(\sigma,h)p(h), for some discrete random variable HH\in\mathds{N} with probability distribution function p(h)p(h), where the Vn,k(σ,h)V_{n,k}(\sigma,h)’s are defined as in equation (5).

  2. 2.

    If σ=0\sigma=0, whenever Vn,k=+Vn,k(α)p(dα)V_{n,k}=\int_{\mathds{R}^{+}}V_{n,k}(\alpha)p(\mathrm{d}\alpha), for some positive random variable α\alpha with probability measure p(dα)p(\mathrm{d}\alpha), where the Vn,k(α)V_{n,k}(\alpha)’s are defined as in equation (6).

  3. 3.

    If σ(0,1)\sigma\in(0,1), whenever Vn,k=+Vn,k(σ,γ)p(dγ)V_{n,k}=\int_{\mathds{R}^{+}}V_{n,k}(\sigma,\gamma)p(\mathrm{d}\gamma), for some positive random variable γ\gamma with probability measure p(dγ)p(\mathrm{d}\gamma), where the Vn,k(σ,γ)V_{n,k}(\sigma,\gamma)’s are defined as in equation (7).

Any Gibbs-type process can be represented hierarchically, involving a suitable prior distribution for the key parameters HH, α\alpha, and γ\gamma. Prior distributions for HH in the σ<0\sigma<0 case are discussed in Gnedin (2010) and De Blasi et al. (2013); see also Miller and Harrison (2018) for applications to mixture models and Legramanti et al. (2022) for employment in stochastic block models. In the σ=0\sigma=0 case, a popular choice is the semi-conjugate Gamma prior for α\alpha as in Escobar and West (1995), with the Stirling-gamma prior of Zito et al. (2024) a recent alternative. Finally, in the σ(0,1)\sigma\in(0,1) regime, a polynomially tilted stable distribution prior for γ1/σ\gamma^{-1/\sigma} leads to the Pitman-Yor process (Pitman and Yor, 1997), whereas an exponentially tilted stable density leads to the normalized generalized gamma process (Lijoi et al., 2007b); refer to Lijoi et al. (2008b); Favaro et al. (2009) for a thorough investigation and to Favaro et al. (2015) for a general class of priors for γ\gamma.

2.2 The quantification of biodiversity

The simplest measure of biodiversity is arguably the taxon richness. In our notation, we observe Kn=kK_{n}=k different taxa among nn data points X1,,XnX_{1},\dots,X_{n}; we refer to this value as the in-sample richness. A priori, the distribution of KnK_{n} induced by a Gibbs-type prior has a simple form

(Kn=k)=Vn,k𝒞(n,k;σ)σk,\mathds{P}(K_{n}=k)=V_{n,k}\frac{\mathscr{C}(n,k;\sigma)}{\sigma^{k}},

where 𝒞(n,k;σ)\mathscr{C}(n,k;\sigma) denotes a generalized factorial coefficient (Charalambides, 2002); the case σ=0\sigma=0 is recovered as the limit of the above formula, since limσ0𝒞(n,k;σ)σk=|s(n,k)|\lim_{\sigma\rightarrow 0}\mathscr{C}(n,k;\sigma)\sigma^{-k}=|s(n,k)|, which is the signless Stirling number of the first kind. The expected values 𝔼(K1),,𝔼(Kn)\mathds{E}(K_{1}),\dots,\mathds{E}(K_{n}) can be interpreted as a model-based rarefaction curve (Zito et al., 2023). Moreover, we may want to predict the number of new taxa Kn+mkK_{n+m}\geq k within a future sample Xn+1,,Xn+mX_{n+1},\dots,X_{n+m}; we call this value the out-of-sample richness. The posterior distribution of Kn+mK_{n+m} given the data Kn=kK_{n}=k, or equivalently of the number of previously unobserved taxa Km(n)=Kn+mKnK_{m}^{(n)}=K_{n+m}-K_{n}, has been derived by Lijoi et al. (2007a) and is

(Km(n)=jX1,,Xn)=Vn+m,k+jVn,k𝒞(m,j;σ,n+kσ)σj,j=0,,m,\mathds{P}(K_{m}^{(n)}=j\mid X_{1},\dots,X_{n})=\frac{V_{n+m,k+j}}{V_{n,k}}\frac{\mathscr{C}(m,j;\sigma,-n+k\sigma)}{\sigma^{j}},\qquad j=0,\dots,m,

where the term 𝒞(m,j;σ,n+kσ)\mathscr{C}(m,j;\sigma,-n+k\sigma) is the noncentral generalized factorial coefficient; see Lijoi et al. (2007a). The collection 𝔼(Kn+1Kn=k),,𝔼(Kn+mKn=k)\mathds{E}(K_{n+1}\mid K_{n}=k),\dots,\mathds{E}(K_{n+m}\mid K_{n}=k) represents a model-based extrapolation of the accumulation curve. Note that Kn=kK_{n}=k is a sufficient statistic for predictions.

Example 4 (Dirichlet-multinomial, cont’d).

For the Dirichlet-multinomial, the rarefaction curve is

𝔼(KnH)=HH(H|σ||σ|)n(H|σ|)n,\mathds{E}(K_{n}\mid H)=H-H\frac{(H|\sigma|-|\sigma|)_{n}}{(H|\sigma|)_{n}},

as shown in Pitman (2006, Chap. 3). Moreover, in the Appendix we provide a simple proof to show that:

𝔼(Kn+mH,X1,,Xn)=H(Hk)(n+H|σ||σ|)m(n+H|σ|)m.\mathds{E}(K_{n+m}\mid H,X_{1},\dots,X_{n})=H-(H-k)\frac{(n+H|\sigma|-|\sigma|)_{m}}{(n+H|\sigma|)_{m}}.
Example 5 (Dirichlet process, cont’d).

In the Dirichlet process case, the rarefaction and the extrapolation curves are given by:

𝔼(Knα)=i=1nαα+i1,𝔼(Kn+mα,X1,,Xn)=k+i=1mαα+n+i1.\mathds{E}(K_{n}\mid\alpha)=\sum_{i=1}^{n}\frac{\alpha}{\alpha+i-1},\qquad\mathds{E}(K_{n+m}\mid\alpha,X_{1},\dots,X_{n})=k+\sum_{i=1}^{m}\frac{\alpha}{\alpha+n+i-1}.

For their practical implementation, one can use i=1n1/(a+i1)=ψ(a+n)ψ(a)\sum_{i=1}^{n}1/(a+i-1)=\psi(a+n)-\psi(a) for any a>0a>0, where ψ()\psi(\cdot) is the digamma function. See, for example, Zito et al. (2023).

While rarefaction and extrapolation curves are valuable tools (Gotelli and Colwell, 2001), it is useful to summarize biodiversity with a single number. The concept of richness, defined as limnKn\lim_{n\rightarrow\infty}K_{n}, is problematic since limnKn=\lim_{n\to\infty}K_{n}=\infty for σ0\sigma\geq 0 regardless of the observed data, whereas KnK_{n} remains finite for σ<0\sigma<0. Avoiding σ0\sigma\geq 0 is tempting, but leads to poor fit and predictions for some datasets. Gibbs-type priors with positive σ0\sigma\geq 0 often excel in predicting future values Km(n)K_{m}^{(n)} for highly diverse taxa, compared to models with σ0\sigma\leq 0 (Lijoi et al., 2007a; Favaro et al., 2009). Additionally, the total number of taxa present in a given area can be estimated even in models where σ0\sigma\geq 0, e.g. following the strategy of ter Steege et al. (2013). This discussion motivates embracing an alternative concept of diversity, called σ\sigma-diversity, introduced by Pitman (2003), which encompasses Hubbell (2001)’s fundamental biodiversity number when σ=0\sigma=0.

Proposition 2 (σ\sigma-diversity, Pitman (2003)).

Let KnK_{n} be the number of distinct values arising from a Gibbs-type prior in (3):

  1. 1.

    Let σ<0\sigma<0 and Vn,k(σ,H)V_{n,k}(\sigma,H) be defined as in equation (5), then KnHK_{n}\rightarrow H almost surely;

  2. 2.

    Let σ=0\sigma=0 and Vn,k(α)V_{n,k}(\alpha) be defined as in equation (6), then Kn/log(n)αK_{n}/\log(n)\rightarrow\alpha almost surely;

  3. 3.

    Let σ(0,1)\sigma\in(0,1) and Vn,k(σ,γ)V_{n,k}(\sigma,\gamma) be defined as in equation (7), then Kn/nσγK_{n}/n^{\sigma}\rightarrow\gamma almost surely.

For a generic set of weights Vn,kV_{n,k}, let cσ(n)c_{\sigma}(n) be a function such that cσ(n)=1c_{\sigma}(n)=1 if σ<0\sigma<0, cσ(n)=log(n)c_{\sigma}(n)=\log(n) if σ=0\sigma=0, and cσ(n)=nσc_{\sigma}(n)=n^{\sigma} if σ(0,1)\sigma\in\left(0,1\right). Then as nn\rightarrow\infty

Kncσ(n)a.s.Sσ.\frac{K_{n}}{c_{\sigma}(n)}\overset{a.s.}{\longrightarrow}S_{\sigma}. (8)

The random variable SσS_{\sigma} is called σ\sigma-diversity and its distribution coincides with the prior for HH, α\alpha and γ\gamma, respectively, implied by the mixture representation of the weights Vn,kV_{n,k} in Proposition 1.

The σ\sigma-diversity can be seen as a richness measure that has been appropriately rescaled. Proposition 2 highlights the central role of the Dirichlet-multinomial, Dirichlet, and σ\sigma-stable pk processes among Gibbs-type priors. It shows that their parameters HH, α\alpha, and γ\gamma are biodiversity indices. In these three processes, σ\sigma diversity is deterministic and assumed to be known. However, σ\sigma-diversities HH, α\alpha, or γ\gamma are typically unknown, and can be estimated employing a prior distribution, leading to a Gibbs-type process for p~\tilde{p} thanks to Proposition 1. In light of this, the posterior law of σ\sigma-diversity is a key quantity for measuring biodiversity. In addition, the posterior distribution of SσS_{\sigma} has an elegant connection with accumulation curves, as shown in the following theorem.

Theorem 1.

Let X1,,Xn+mX_{1},\dots,X_{n+m} be a sample from a Gibbs-type prior (3) with Kn+mK_{n+m} distinct values and let the function cσ(m)c_{\sigma}(m) be defined as in Proposition 2. Then as mm\rightarrow\infty

(Kn+mcσ(m)X1,,Xn)a.s.Sσ,\left(\frac{K_{n+m}}{c_{\sigma}(m)}\mid X_{1},\dots,X_{n}\right)\overset{a.s.}{\longrightarrow}S_{\sigma}, (9)

The random variable SσS_{\sigma} is the σ\sigma-diversity and its distribution coincides with the posterior for HH, α\alpha and γ\gamma, respectively. Moreover, Kn=kK_{n}=k is a sufficient statistic for SσS_{\sigma} given the data X1,,XnX_{1},\dots,X_{n}.

Theorem 1 states that the posterior law of SσS_{\sigma} coincides with the σ\sigma-diversity associated with extrapolation of the accumulation curve. Related results for the normalized generalized gamma model are in Favaro et al. (2012). In practice, deriving the posterior distribution of σ\sigma-diversities HH, α\alpha and γ\gamma is based on the Bayes theorem, where the eppf in (3) acts as likelihood function. Notably, in Gibbs-type priors, the abundances n1,,nkn_{1},\dots,n_{k} appearing in (3) do not provide insights about the diversity, and they do not appear in the posterior for SσS_{\sigma} because the number of observed species kk is a sufficient statistic. This would not be the case for general species sampling models beyond the Gibbs-type.

Example 6 (Dirichlet-multinomial, cont’d).

Bayesian inference for the richness HH, i.e. the σ<0\sigma<0 case, is based on the posterior distribution

p(hX1,,Xn)p(h)|σ|k1j=1k1(hj)(h|σ|+1)n1,h=k,k+1,,p(h\mid X_{1},\dots,X_{n})\propto p(h)\frac{|\sigma|^{k-1}\prod_{j=1}^{k-1}(h-j)}{(h|\sigma|+1)_{n-1}},\qquad h=k,k+1,\dots,

where p(h)p(h) denotes a discrete prior distribution for H{1,2,}H\in\{1,2,\dots\}. If the prior has bounded support, sampling from the posterior is trivial. In general, acceptance-rejection or truncation strategies can be considered. When σ=1\sigma=-1, shifted geometric and shifted Poisson prior specifications have been investigated in De Blasi et al. (2013), whereas Gnedin (2010) proposed a heavy-tailed prior distribution, which leads to a closed-form expression for the posterior distribution of HH and the weights Vn,kV_{n,k}.

Example 7 (Dirichlet-process, cont’d).

Bayesian inference about the fundamental biodiversity number α\alpha, i.e. the σ=0\sigma=0 case, is based on the posterior distribution

p(dαX1,,Xn)p(dα)αk(α)n,p(\mathrm{d}\alpha\mid X_{1},\dots,X_{n})\propto p(\mathrm{d}\alpha)\frac{\alpha^{k}}{(\alpha)_{n}},

where p(dα)p(\mathrm{d}\alpha) denotes the prior distribution for α\alpha. As noted in Zito et al. (2023), this problem is equivalent to a logistic regression with an offset. The choice αGamma(a,b)\alpha\sim\text{Gamma}(a,b) has been advocated by Escobar and West (1995) because it leads to a semi-conjugate Gibbs-sampling scheme. More recently, Zito et al. (2024) proposed the (asymptotically equivalent) Stirling-gamma prior, which has the advantage of being highly interpretable, especially within the context of species sampling models.

Remark 2.

The discount parameter σ\sigma characterizes the asymptotic behavior of the Gibbs partition. Although one could theoretically estimate σ\sigma from data using a prior distribution, leading to a species sampling model beyond the Gibbs type, we argue that the choice of σ\sigma should be regarded as a model selection problem. This avoids difficulties in interpretation, since comparing σ\sigma-diversities across locations makes sense only if they are based on the same asymptotic regime. In practice, based on the data, σ\sigma can be chosen from the following values: σ=1\sigma=-1 (Dirichlet distribution with uniform weights), σ=0\sigma=0 (Dirichlet process) and σ=1/2\sigma=1/2 (Aldous-Pitman process).

Remark 3.

Fisher’s α^f\hat{\alpha}_{\textsc{f}} (Fisher et al., 1943) is an estimator for the parameter α\alpha of the Dirichlet process (Hubbell, 2001). We compare Fisher’s α^f\hat{\alpha}_{\textsc{f}} with the maximum likelihood estimator α^ml=argmaxαΠn(n1,,nkα)=argmaxααk/(α)n\hat{\alpha}_{\textsc{ml}}=\arg\max_{\alpha}\Pi_{n}(n_{1},\dots,n_{k}\mid\alpha)=\arg\max_{\alpha}\alpha^{k}/(\alpha)_{n} of the diversity, with these estimators solving

α^flog(1+nα^f)=k and i=1nα^mlα^ml+i1=k.\hat{\alpha}_{\textsc{f}}\log\left(1+\frac{n}{\hat{\alpha}_{\textsc{f}}}\right)=k\qquad\text{ and }\qquad\sum_{i=1}^{n}\frac{\hat{\alpha}_{\textsc{ml}}}{\hat{\alpha}_{\textsc{ml}}+i-1}=k.

For large nn, the two estimates are nearly identical, due to the inequalities αlog(1+n/α)i=1nα/(α+i1)1+αlog(1+n/α)\alpha\log\left(1+n/\alpha\right)\leq\sum_{i=1}^{n}\alpha/(\alpha+i-1)\leq 1+\alpha\log\left(1+n/\alpha\right) (see, e.g., Ghosal and Van der Vaart, 2017, Proposition 4.8). This striking similarity between α^f\hat{\alpha}_{\textsc{f}} and α^ml\hat{\alpha}_{\textsc{ml}} results because the eppf of a Dirichlet process can be regarded as a conditional likelihood for Fisher’s model, in which the sample size nn is fixed and not random, contrary to Fisher’s original formulation; see McCullagh (2016) for a detailed historical account and further considerations. Hence, the posterior law of α\alpha provides a Bayesian Fisher’s α^f\hat{\alpha}_{\textsc{f}}. The Bayesian perspective not only establishes an insightful link between Fisher’s α^f\hat{\alpha}_{\textsc{f}} and accumulation curves through Proposition 2 and Theorem 1, but also facilitates uncertainty quantification and testing.

2.3 Relationship between σ\sigma-diversity and other biodiversity measures

We show that many classical biodiversity indices can be expressed in terms of HH, α\alpha, or γ\gamma, based on the connection with the accumulation curves discussed earlier. Consider the popular Simpson similarity index 𝒮\mathcal{S} (e.g., Colwell, 2009), which is the probability that two randomly chosen individuals belong to the same taxa. Within the framework of species sampling models, 𝒮:=h=1πh2,\mathcal{S}:=\sum_{h=1}^{\infty}\pi_{h}^{2}, and due to exchangeability 𝔼(𝒮)=(X1=X2).\mathds{E}(\mathcal{S})=\mathds{P}(X_{1}=X_{2}). In Gibbs-type priors, it follows that 𝔼(𝒮)=V2,1\mathds{E}(\mathcal{S})=V_{2,1}. We can specialize this formula in the Dirichlet multinomial and Dirichlet process case, yielding respectively

𝔼{𝒮(σ,H)H}=V2,1(σ,H)=11+H|σ|,𝔼{𝒮(α)α}=V2,1(α)=11+α.\mathds{E}\{\mathcal{S}(\sigma,H)\mid H\}=V_{2,1}(\sigma,H)=\frac{1}{1+H|\sigma|},\qquad\mathds{E}\{\mathcal{S}(\alpha)\mid\alpha\}=V_{2,1}(\alpha)=\frac{1}{1+\alpha}.

Thus, there is an inverse relationship between the expected Simpson similarity and the σ\sigma-diversity. The following section shows a similar relationship in the polynomial regime, when σ=1/2\sigma=1/2. More generally, most biodiversity indices can be expressed as g:=h=1g(πh),\mathcal{I}_{g}:=\sum_{h=1}^{\infty}g(\pi_{h}), for some function g:(0,1)+g:(0,1)\rightarrow\mathds{R}^{+}. For instance, g(π)=π2g(\pi)=\pi^{2} corresponds to the Simpson index, g(π)=πlogπg(\pi)=-\pi\log{\pi} is the Shannon diversity, and g(π)=πτg(\pi)=\pi^{\tau} for some τ>0\tau>0 is the (unnormalized) Tsallis or Hill diversity (Colwell, 2009; Magurran and McGill, 2011). The expectations 𝔼(g)\mathds{E}(\mathcal{I}_{g}) are a function of H,αH,\alpha and γ\gamma. Additionally, their relationship with g\mathcal{I}_{g} can often be explicitly determined: if the random probabilities (πh)h1(\pi_{h})_{h\geq 1} are in size-biased order, then we have the simplification 𝔼(g)=𝔼{h=1g(πh)}=𝔼{g(π1)/π1}\mathds{E}(\mathcal{I}_{g})=\mathds{E}\left\{\sum_{h=1}^{\infty}g(\pi_{h})\right\}=\mathds{E}\{g(\pi_{1})/\pi_{1}\}. For example, in the dp case, the stick-breaking weights are in size-biased order, and we have π1Beta(1,α)\pi_{1}\sim\text{Beta}(1,\alpha), which makes computations straightforward.

2.4 Model validation

We propose two approaches to assess the fit of the model. The first approach compares the observed distinct values K1,,KnK_{1},\dots,K_{n} with the model-based rarefaction curve 𝔼(K1),,𝔼(Kn)\mathds{E}(K_{1}),\dots,\mathds{E}(K_{n}). As shown above, expectations 𝔼(Ki)\mathds{E}(K_{i}) can often be computed explicitly, and are typically evaluated given an estimate for the diversity parameter, e.g. 𝔼(Kiα^ml)\mathds{E}(K_{i}\mid\hat{\alpha}_{\textsc{ml}}) in the σ=0\sigma=0 case. Due to exchangeability, expectations 𝔼(Ki)\mathds{E}(K_{i}) do not depend on the order of the data. However, in most cases, the data X1,,XnX_{1},\dots,X_{n} are not observed in a specific order. Thus, a common practice is to reshuffle the order of X1,,XnX_{1},\dots,X_{n} and then consider the averages K¯1,,K¯n\bar{K}_{1},\dots,\bar{K}_{n}, over all possible permutations of the data. Conveniently, for this combinatorial problem, there exists an explicit solution, which is:

K¯i=k(ni)1j=1k(nnji),i=1,,n,\bar{K}_{i}=k-\binom{n}{i}^{-1}\sum_{j=1}^{k}\binom{n-n_{j}}{i},\qquad i=1,\dots,n,

where Kn=kK_{n}=k and n1,,nkn_{1},\dots,n_{k} are the abundances; see Smith and Grassle (1977); Colwell et al. (2012). The values K¯i\bar{K}_{i} define the “classical rarefaction”, which can be regarded as a frequentist nonparametric estimator arising from a multinomial model. Our Bayesian nonparametric approach instead depends on a few parameters, imposing some rigidity on the functional shape of the rarefaction, which is critical for extrapolation. Graphically comparing K¯1,,K¯n\bar{K}_{1},\dots,\bar{K}_{n} with 𝔼(K1),,𝔼(Kn)\mathds{E}(K_{1}),\dots,\mathds{E}(K_{n}) can give a sense of the suitability of the chosen model. We show a concrete example in Section 5.1.

The second model-checking approach is even simpler and has been used, e.g. in Favaro et al. (2021); see also Thisted and Efron (1987) for early ideas. The abundances n1,,nkn_{1},\dots,n_{k} can be reformulated in terms of frequency counts m1,,mnm_{1},\dots,m_{n}, where mrm_{r} is the number of taxa appearing with frequency rr in the sample. Hence, m1m_{1} represents the number of singletons, m2m_{2} is the number of doubletons, etc. We denote by M1,n,,Mn,nM_{1,n},\dots,M_{n,n} the associated random variables. To assess goodness of fit, we compare empirical counts m1,,mnm_{1},\dots,m_{n} with their model-based expectations 𝔼(M1,n),,𝔼(Mn,n)\mathds{E}(M_{1,n}),\dots,\mathds{E}(M_{n,n}). In the Dirichlet process case, these expectations have a simple analytical formula

𝔼(Mr,nα)=α(α)nr(α)n(nr)(r1)!,r=1,,n,\mathds{E}(M_{r,n}\mid\alpha)=\alpha\frac{(\alpha)_{n-r}}{(\alpha)_{n}}\binom{n}{r}(r-1)!,\qquad r=1,\dots,n,

and therefore when r=1r=1, corresponding to the expected singletons implied by Hubbell (2001) theory, we get 𝔼(M1,nα)=nα/(n+α)=npr(Xn+1=“new”α,X1,,Xn)\mathds{E}(M_{1,n}\mid\alpha)=n\alpha/(n+\alpha)=n\>\text{pr}(X_{n+1}=\text{``new''}\mid\alpha,X_{1},\dots,X_{n}). The general analytical formulas for the expectations 𝔼(Mr,n)\mathds{E}(M_{r,n}) for any prior of Gibbs type are given in Favaro et al. (2013). These expectations can be approximated via Monte Carlo, by sampling X1,,XnX_{1},\dots,X_{n} from the urn scheme in equation (4).

3 The Aldous-Pitman process (σ=1/2\sigma=1/2)

The Dirichlet multinomial and Dirichlet process cases, corresponding to σ0\sigma\leq 0, have been extensively investigated in Bayesian nonparametrics. There exist suitable priors and well-established estimation procedures for HH and α\alpha, as discussed in Section 2. In contrast, relatively less attention has been devoted to the case σ>0\sigma>0, which corresponds to rapidly growing accumulation curves, with notable exceptions such as Lijoi et al. (2007b) and Favaro et al. (2009). The main mathematical challenge lies in the density fσ()f_{\sigma}(\cdot) in equation (7), which lacks a simple analytical expression. However, in the σ=1/2\sigma=1/2 case, this density becomes f1/2(t)=(4π)1t3/2exp{1/(4t)}f_{1/2}(t)=(\sqrt{4\pi})^{-1}t^{-3/2}\exp\{-1/(4t)\}, significantly simplifying the calculations. This leads to the definition of what we term the Aldous-Pitman process.

Definition 1 (Aldous-Pitman process).

Let γ>0\gamma>0, and PP be a diffuse probability measure on 𝕏\mathds{X}. Additionally, let (Zn)n1(Z_{n})_{n\geq 1} be iid samples from PP and (Yn)n1(Y_{n})_{n\geq 1} and be iid samples from a standard Gaussian distribution. Define a species sampling model p~=h=1πhδZh\tilde{p}=\sum_{h=1}^{\infty}\pi_{h}\delta_{Z_{h}} where

πh=Rh1Rh,Rh=γ2/2γ2/2+j=1hYj2,h1,\pi_{h}=R_{h-1}-R_{h},\qquad R_{h}=\frac{\gamma^{2}/2}{\gamma^{2}/2+\sum_{j=1}^{h}Y_{j}^{2}},\qquad h\geq 1,

and R0=1R_{0}=1. We will say that p~γ\tilde{p}\mid\gamma follows an Aldous-Pitman process with parameters γ\gamma and PP.

The Aldous-Pitman process, introduced by Aldous and Pitman (1998), is a special case of the σ\sigma-stable pk process described in Example 3, when σ=1/2\sigma=1/2 (Pitman, 2003). Consequently, the Aldous-Pitman process results in a Gibbs partition, where the parameter γ\gamma corresponds to the σ\sigma-diversity. Notably, the weights satisfy πh>πh1\pi_{h}>\pi_{h-1}, where Rh=>hπR_{h}=\sum_{\ell>h}\pi_{\ell} represents the cumulative probability of rare taxa, which is directly influenced by the σ\sigma-diversity parameter γ\gamma. Furthermore, the associated Gibbs-type coefficients Vn,kV_{n,k} can be explicitly obtained.

Example 8 (Aldous-Pitman process, case σ=1/2\sigma=1/2).

Suppose σ=1/2\sigma=1/2 and let γ>0\gamma>0. Then a valid set of Gibbs coefficients is given by

Vn,k(γ)=2nk/21/2γk1hk+12n(2γ),V_{n,k}(\gamma)=2^{n-k/2-1/2}\gamma^{k-1}h_{k+1-2n}(\sqrt{2}\gamma), (10)

where hν(t)={2Γ(ν)}1h=0(t)h/h!Γ{(hν)/2}h_{\nu}(t)=\{2\Gamma(-\nu)\}^{-1}\sum_{h=0}^{\infty}(-t)^{h}/h!\Gamma\{(h-\nu)/2\} denotes the Hermite function of order ν\nu\in\mathds{R} (Lebedev, 1965, §10.2). The corresponding process p~\tilde{p} is an Aldous-Pitman process, and the σ\sigma-diversity is γ\gamma. Moreover, the associated urn-scheme is as follows:

(Xn+1γ,X1,,Xn)=2γhk2n(2γ)hk+12n(2γ)P()+2hk2n1(2γ)hk+12n(2γ)j=1k(nj1/2)δXj().\mathds{P}(X_{n+1}\in\cdot\mid\gamma,X_{1},\dots,X_{n})=\sqrt{2}\gamma\frac{h_{k-2n}(\sqrt{2}\gamma)}{h_{k+1-2n}(\sqrt{2}\gamma)}P(\cdot)+2\frac{h_{k-2n-1}(\sqrt{2}\gamma)}{h_{k+1-2n}(\sqrt{2}\gamma)}\sum_{j=1}^{k}(n_{j}-1/2)\delta_{X^{*}_{j}}(\cdot).

It can be shown (Pitman, 2003) that the expected Simpson index of an Aldous-Pitman process is

𝔼(𝒮(γ)γ)=(X1=X2γ)=𝔼(eγV),VExp(1),\mathds{E}(\mathcal{S}(\gamma)\mid\gamma)=\mathds{P}(X_{1}=X_{2}\mid\gamma)=\mathds{E}\big{(}e^{-\gamma\sqrt{V}}\big{)},\qquad V\sim\text{Exp}(1),

revealing the close relationship between the σ\sigma-diversity γ\gamma and the Simpson index.

The ap process serves as a building block for Gibbs-type priors with growth rate n\sqrt{n}, having a σ\sigma-diversity γ\gamma. So far, two specific priors for γ\gamma have been investigated: the prior implicitly utilized in the Pitman–Yor process (Favaro et al., 2009) and the one implied by the normalized generalized gamma (ngg) process (Lijoi et al., 2007b). Both options yield tractable Gibbs coefficients Vn,kV_{n,k} for all values of σ(0,1)\sigma\in(0,1). Furthermore, when σ=1/2\sigma=1/2 these priors reduce to the following specifications

γ2\displaystyle\gamma^{2} Gamma(θ+1/2,1/4),\displaystyle\sim\text{Gamma}(\theta+1/2,1/4),\qquad (Pitman-Yor) (11)
γ2\displaystyle\gamma^{-2} Inverse-Gaussian(1/2,β/2),\displaystyle\sim\text{Inverse-Gaussian}(1/\sqrt{2},\beta/\sqrt{2}),\qquad (Normalized inverse Gaussian)

where θ>1/2\theta>-1/2 and β>0\beta>0 are hyperparameters, and the densities for γ\gamma are given by fpy(γ)=4θΓ(θ+1/2)1γ2θexp{γ2/4}f_{\textsc{py}}(\gamma)=4^{-\theta}\Gamma(\theta+1/2)^{-1}\gamma^{2\theta}\exp\{-\gamma^{2}/4\} and fig(γ)=(π)1exp{ββ2/γ2γ2/4}f_{\textsc{ig}}(\gamma)=(\sqrt{\pi})^{-1}\exp\{\beta-\beta^{2}/\gamma^{2}-\gamma^{2}/4\}, for the Pitman-Yor and the inverse Gaussian case, respectively. These results can be deduced from Favaro et al. (2009) and Lijoi et al. (2005, 2007b). When σ=1/2\sigma=1/2 the ngg process is also known as the normalized inverse Gaussian process (Lijoi et al., 2005).

The prior distributions in (11) are somewhat restrictive since, by fixing the asymptotic regime to σ=1/2\sigma=1/2, there is a single parameter (θ\theta or β\beta) controlling the prior mean and variance for the σ\sigma-diversity. Here, we consider a generic prior law p(dγ)p(\mathrm{d}\gamma). The posterior for γ\gamma takes the form p(dγX1,,Xn)p(dγ)Vn,k(γ)p(\mathrm{d}\gamma\mid X_{1},\dots,X_{n})\propto p(\mathrm{d}\gamma)V_{n,k}(\gamma), where Vn,k(γ)V_{n,k}(\gamma) is defined in (10). Naïve sampling algorithms for p(dγX1,,Xn)p(\mathrm{d}\gamma\mid X_{1},\dots,X_{n}) require evaluating the Hermite function hν(t)h_{\nu}(t), leading to numerical instabilities. For negative integers ν{1,2,}\nu\in\{-1,-2,\dots\}, the function hν(t)h_{\nu}(t) may be computed recursively using hν+1(t)=thν(t)νhν1(t)h_{\nu+1}(t)=th_{\nu}(t)-\nu h_{\nu-1}(t), with h0(t)=1h_{0}(t)=1 and h1(t)=Φ(t)/ϕ(t)h_{-1}(t)=\Phi(t)/\phi(t), where Φ(t)\Phi(t) and ϕ(t)\phi(t) are the cumulative distribution function and density of a standard Gaussian, respectively. Unfortunately, this recursion is only useful for relatively small values of nn and kk. As a more robust alternative, we propose a data augmentation strategy arising from an integral representation of Hermite polynomials: hν(t)=Γ(ν)10eu2/2tuuν1duh_{\nu}(t)=\Gamma(-\nu)^{-1}\int_{0}^{\infty}e^{-u^{2}/2-tu}u^{-\nu-1}\mathrm{d}u for any ν<0\nu<0 (Lebedev, 1965, §10.5). We introduce a positive latent variable Un,kU_{n,k} conditionally on which inference on γ\gamma becomes straightforward, and standard sampling strategies can be employed. For n2n\geq 2 we consider the following joint likelihood for X1,,XnX_{1},\dots,X_{n} and Un,kU_{n,k}:

Π(n1,,nk,uγ)=2nk/21/2Γ(2nk1)γk1u2nk2eu2/22γuj=1k(1/2)nj1.\Pi(n_{1},\dots,n_{k},u\mid\gamma)=\frac{2^{n-k/2-1/2}}{\Gamma(2n-k-1)}\gamma^{k-1}u^{2n-k-2}e^{-u^{2}/2-\sqrt{2}\gamma u}\prod_{j=1}^{k}(1/2)_{n_{j}-1}. (12)

It is easy to check that 0Π(n1,,nk,uγ)du=Π(n1,,nkγ)\int_{0}^{\infty}\Pi(n_{1},\dots,n_{k},u\mid\gamma)\mathrm{d}u=\Pi(n_{1},\dots,n_{k}\mid\gamma), with the latter being the eppf of the Aldous-Pitman model. Although equation (12) is helpful for posterior inference under any prior choice, a particularly simple Gibbs sampling algorithm is available if we let γGamma(aγ,bγ)\gamma\sim\text{Gamma}(a_{\gamma},b_{\gamma}) because the corresponding full conditional densities for γ\gamma and Un,kU_{n,k} are

fγ(γ)γaγ+k2e(bγ+2u)γ,fUn,k(u)u2nk2e(u/2+γ)2,f_{\gamma}(\gamma\mid-)\propto\gamma^{a_{\gamma}+k-2}e^{-(b_{\gamma}+\sqrt{2}u)\gamma},\qquad f_{U_{n,k}}(u\mid-)\propto u^{2n-k-2}e^{-(u/\sqrt{2}+\gamma)^{2}},

which means that (γ)Gamma(aγ+k1,bγ+2u)(\gamma\mid-)\sim\text{Gamma}(a_{\gamma}+k-1,b_{\gamma}+\sqrt{2}u) is conditionally conjugate. Additionally, the conditional distribution of (Un,k)(U_{n,k}\mid-) belongs to the family of modified half-normal distributions (Sun et al., 2023), making it straightforward to simulate due to its log-concave density (Devroye, 1986). Alternatively, we can circumvent the need for mcmc. Through simple calculus, one can derive an explicit expression for the density of Un,kX1,,XnU_{n,k}\mid X_{1},\dots,X_{n}, which is fUn,k(uX1,,Xn)u2nk1eu2/2(bγ+2u)(aγ+k1)f_{U_{n,k}}(u\mid X_{1},\dots,X_{n})\propto u^{2n-k-1}e^{-u^{2}/2}(b_{\gamma}+\sqrt{2}u)^{-(a_{\gamma}+k-1)}. This random variable is also easy to simulate, allowing for iid sampling from the posterior distribution of γX1,,Xn\gamma\mid X_{1},\dots,X_{n} in two steps: first, we sample from Un,kX1,,XnU_{n,k}\mid X_{1},\dots,X_{n}, and then from γUn,k,X1,,Xn\gamma\mid U_{n,k},X_{1},\dots,X_{n}.

The data-augmentation strategy we just proposed has further applications. In fact, it can be verified that the predictive scheme of the Aldous-Pitman process, which enables the Monte Carlo approximation of the taxon accumulation curve, can also be expressed in terms of the latent variable Un,kU_{n,k}. The predictive distribution is given by:

(Xn+1γ,X1,,Xn)=2γ2nk1𝔼(Un,k)P()+𝔼(Un,k2)(nk/2)(2nk1)j=1k(nj1/2)δXj(),\mathds{P}(X_{n+1}\in\cdot\mid\gamma,X_{1},\dots,X_{n})=\frac{\sqrt{2}\gamma}{2n-k-1}\mathds{E}(U_{n,k})P(\cdot)+\frac{\mathds{E}(U_{n,k}^{2})}{(n-k/2)(2n-k-1)}\sum_{j=1}^{k}(n_{j}-1/2)\delta_{X^{*}_{j}}(\cdot),

where the expectations are taken with respect to fUn,k(u)u2nk2e(u/2+γ)2f_{U_{n,k}}(u\mid-)\propto u^{2n-k-2}e^{-(u/\sqrt{2}+\gamma)^{2}}. With this representation in hand, applying some probability calculus leads to a sampling procedure for (Xn+1X1,,Xn)(X_{n+1}\mid X_{1},\dots,X_{n}) described in Algorithm 1; see the Appendix for further details.

4 Taxonomic Gibbs-type Priors

4.1 A modeling framework for taxonomic data

In this Section, we move away from classical species sampling models described in Section 2. Here, we consider a vector of labels 𝑿n=(Xn,1,,Xn,L)\bm{X}_{n}=(X_{n,1},\dots,X_{n,L}) representing multiple taxonomic levels, where each Xn,𝕏X_{n,\ell}\in\mathds{X}_{\ell} denotes the value of the nnth statistical unit at the \ellth layer of the taxonomy. This richer data structure is used to provide a multifaceted description of biodiversity. We extend the enriched constructions of Wade et al. (2011); Rigon et al. (2025), which are specific to L=2L=2 and limited to Dirichlet or Pitman–Yor priors. Instead, we will rely on general Gibbs-type priors. None of these works aim to infer biodiversity, which is the main goal of this paper.

As before, we assume that (𝑿n)n1(\bm{X}_{n})_{n\geq 1} is an infinite sequence of exchangeable observations, implying

(Xn,1,,Xn,L)q~\displaystyle(X_{n,1},\dots,X_{n,L})\mid\tilde{q} iidq~,n1,\displaystyle\overset{\textup{iid}}{\sim}\tilde{q},\qquad n\geq 1, (13)
q~\displaystyle\tilde{q} 𝒬L,\displaystyle\sim\mathcal{Q}_{L},

where q~\tilde{q} is a random probability measure on the product space 𝕏1××𝕏L\mathds{X}_{1}\times\cdots\times\mathds{X}_{L} and 𝒬L\mathcal{Q}_{L} is its prior law. Directly using a dp for q~\tilde{q} would be inappropriate as it disregards the hierarchical structure of the taxonomy. Here, we decompose q~\tilde{q} in a Markovian fashion, that is we assume

q~(dx1,,dxL)=p~1(dx1)p~2(dx2x1)p~L(dxLxL1).\tilde{q}(\mathrm{d}x_{1},\dots,\mathrm{d}x_{L})=\tilde{p}_{1}(\mathrm{d}x_{1})\tilde{p}_{2}(\mathrm{d}x_{2}\mid x_{1})\cdots\tilde{p}_{L}(\mathrm{d}x_{L}\mid x_{L-1}). (14)

Each p~(x1)\tilde{p}_{\ell}(\cdot\mid x_{\ell-1}) represents a conditional random probability measure on 𝕏\mathds{X}_{\ell} that depends solely on the values of the parent level x1x_{\ell-1}. Thus, the prior distribution for q~(dx1,,dxL)\tilde{q}(\mathrm{d}x_{1},\dots,\mathrm{d}x_{L}) is induced by choosing suitable priors for p~1\tilde{p}_{1} and p~(x1)\tilde{p}_{\ell}(\cdot\mid x_{\ell-1}). Let p~Gibbs(Sσ,σ;P)\tilde{p}\sim\text{Gibbs}(S_{\sigma},\sigma;P) denote a Gibbs-type prior with deterministic σ\sigma-diversity SσS_{\sigma} and weights Vn,k(Sσ)V_{n,k}(S_{\sigma}), so that σ=0\sigma=0 corresponds to a Dirichlet process (Sσ=αS_{\sigma}=\alpha), whereas the cases σ<0\sigma<0 and σ(0,1)\sigma\in(0,1) correspond to a Dirichlet-multinomial (Sσ=HS_{\sigma}=H) and a σ\sigma-stable Poisson-Kingman (Sσ=γS_{\sigma}=\gamma), respectively. Moreover, let PP be a diffuse probability measure encoding the taxa. A taxonomic Gibbs-type prior is then defined as follows

p~1Sσ1Gibbs(Sσ1,σ1;P1),p~(x1)Sσ(x1)indGibbs(Sσ(x1),σ;P),=2,,L.\tilde{p}_{1}\mid S_{\sigma_{1}}\sim\text{Gibbs}(S_{\sigma_{1}},\sigma_{1};P_{1}),\quad\tilde{p}_{\ell}(\cdot\mid x_{\ell-1})\mid S_{\sigma_{\ell}}(x_{\ell-1})\overset{\text{ind}}{\sim}\text{Gibbs}(S_{\sigma_{\ell}}(x_{\ell-1}),\sigma_{\ell};P_{\ell}),\quad\ell=2,\dots,L. (15)

There are potentially infinitely many p~(x1)\tilde{p}_{\ell}(\cdot\mid x_{\ell-1}), that is, one for each label x1x_{\ell-1}, and they are independent among themselves. In addition, there are no shared values between observations belonging to different parents, i.e. the same species cannot belong to multiple genera. A crucial assumption of (15) is that the values σ\sigma_{\ell}, governing the asymptotic behavior, do not depend on x1x_{\ell-1}, albeit they are allowed to change across layers, for =1,,L\ell=1,\dots,L. Thus, conditional σ\sigma-diversities Sσ(x1)S_{\sigma_{\ell}}(x_{\ell-1}) are comparable within the same layer \ell, because they are expressed on the same scale. If this were not the case, the interpretation of the diversities would be problematic.

Refer to caption
Figure 2: A realization from a taxonomic Gibbs-type priors. In this example, the sample size is n=20n=20, and there are k1=2k_{1}=2 distinct families, k2=5k_{2}=5 genera and k3=8k_{3}=8 species. The frequencies nj,n_{j,\ell} associated to the distinct taxa Xj,X^{*}_{j,\ell} are indicated at the bottom of each circle. Grey circles denote the possibility of discovering new taxa.

The model outlined in equations (13)-(15) admits a nested representation, depicted in Figure 2, which better clarifies the role of each p(x1)p_{\ell}(\cdot\mid x_{\ell-1}). The observations can be sampled as follows:

Xn,1p~1iidp~1,Xn,Xn,1=x1,p~(x1)indp~(x1),X_{n,1}\mid\tilde{p}_{1}\overset{\text{iid}}{\sim}\tilde{p}_{1},\qquad X_{n,\ell}\mid X_{n,\ell-1}=x_{\ell-1},\>\tilde{p}_{\ell}(\cdot\mid x_{\ell-1})\overset{\text{ind}}{\sim}\tilde{p}_{\ell}(\cdot\mid x_{\ell-1}),

for =2,,L\ell=2,\dots,L, and n1n\geq 1. The data of the first layer (Xn,1)n1(X_{n,1})_{n\geq 1} follows the same Gibbs-type mechanism that has been extensively described in Section 2. In the subsequent layers, say the \ellth, the observations Xn,X_{n,\ell} are grouped according to parent’s value Xn,1=x1X_{n,\ell-1}=x_{\ell-1}. The data Xn,X_{n,\ell} with the same parent are iid samples from p~(x1)\tilde{p}_{\ell}(\cdot\mid x_{\ell-1}), which, in turn, is distributed as a Gibbs-type process with weights Vn,k{Sσ(x1)}V_{n,k}\{S_{\sigma_{\ell}}(x_{\ell-1})\}. This representation leads to a predictive mechanism that extends (4). The discreteness of each p~\tilde{p}_{\ell} implies that there will be ties in the realization of 𝑿1,,𝑿n\bm{X}_{1},\dots,\bm{X}_{n} with positive probability. At the \ellth layer, there will be Kn,=kK_{n,\ell}=k_{\ell} distinct taxa with labels X1,,,Xk,X^{*}_{1,\ell},\dots,X^{*}_{k_{\ell},\ell} and frequencies n1,,,nk,n_{1,\ell},\dots,n_{k_{\ell},\ell}, which may be grouped according to their parent. Let n(x1)n(x_{\ell-1}) be the number of observations generated by the parent x1x_{\ell-1}, let k(x1)k_{\ell}(x_{\ell-1}) be the number of distinct labels generated by x1x_{\ell-1}, and define the set C(x1)={j: label Xj, is a child of x1}C(x_{\ell-1})=\{j:\text{ label }X^{*}_{j,\ell}\text{ is a child of }x_{\ell-1}\}, so that k(x1)=|C(x1)|k_{\ell}(x_{\ell-1})=|C(x_{\ell-1})| and n(x1)=jC(x1)nj,n(x_{\ell-1})=\sum_{j\in C(x_{\ell-1})}n_{j,\ell}. The predictive law is presented in the following proposition.

Proposition 3.

Suppose (𝐗n)n1(\bm{X}_{n})_{n\geq 1} is an exchangeable sequence as in equation (1), with q~\tilde{q} distributed as in (15), then for any n1n\geq 1

(Xn+1,1Sσ1,𝑿1,,𝑿n)=Vn+1,k1+1(Sσ1)Vn,k1(Sσ1)P1()+Vn+1,k1(Sσ1)Vn,k1(Sσ1)j=1k1(nj,1σ1)δXj,1().\mathds{P}(X_{n+1,1}\in\cdot\mid S_{\sigma_{1}},\bm{X}_{1},\dots,\bm{X}_{n})=\frac{V_{n+1,k_{1}+1}(S_{\sigma_{1}})}{V_{n,k_{1}}(S_{\sigma_{1}})}P_{1}(\cdot)+\frac{V_{n+1,k_{1}}(S_{\sigma_{1}})}{V_{n,k_{1}}(S_{\sigma_{1}})}\sum_{j=1}^{k_{1}}(n_{j,1}-\sigma_{1})\delta_{X^{*}_{j,1}}(\cdot).

Moreover for =2,,L\ell=2,\dots,L and n1n\geq 1

(Xn+1,Xn+1,1=x1,Sσ(x1),𝑿1,,𝑿n)=\displaystyle\mathds{P}(X_{n+1,\ell}\in\cdot\mid X_{n+1,\ell-1}=x_{\ell-1},S_{\sigma_{\ell}}(x_{\ell-1}),\bm{X}_{1},\dots,\bm{X}_{n})=
=Vn(x1)+1,k(x1)+1{Sσ(x1)})Vn(x1),k(x1){Sσ(x1)}P()+Vn(x1)+1,k(x1){Sσ(x1)})Vn(x1),k(x1){Sσ(x1)}jC(x1)(nj,σ)δXj,().\displaystyle\>=\frac{V_{n(x_{\ell-1})+1,k_{\ell}(x_{\ell-1})+1}\{S_{\sigma_{\ell}}(x_{\ell-1})\})}{V_{n(x_{\ell-1}),k_{\ell}(x_{\ell-1})}\{S_{\sigma_{\ell}}(x_{\ell-1})\}}P_{\ell}(\cdot)+\frac{V_{n(x_{\ell-1})+1,k_{\ell}(x_{\ell-1})}\{S_{\sigma_{\ell}}(x_{\ell-1})\})}{V_{n(x_{\ell-1}),k_{\ell}(x_{\ell-1})}\{S_{\sigma_{\ell}}(x_{\ell-1})\}}\sum_{j\in C(x_{\ell-1})}(n_{j,\ell}-\sigma_{\ell})\delta_{X^{*}_{j,\ell}}(\cdot).

Proposition 3 expresses the learning mechanism in full generality, and it highlights the nested structure induced by taxonomic Gibbs-type priors. For specific choices of σ\sigma_{\ell} the predictive law can be specialized by replacing the Vn,k{Sσ(x1)}V_{n,k}\{S_{\sigma_{\ell}}(x_{\ell-1})\}’s with their specific values. For instance, if σ=0\sigma_{\ell}=0 and =2,,L\ell=2,\dots,L we get the following formula

(Xn+1,Xn+1,1=x1,α(x1),𝑿1,,𝑿n)=\displaystyle\mathds{P}(X_{n+1,\ell}\in\cdot\mid X_{n+1,\ell-1}=x_{\ell-1},\alpha(x_{\ell-1}),\bm{X}_{1},\dots,\bm{X}_{n})=
=α(x1)n(x1)+α(x1)P()+1n(x1)+α(x1)jC(x1)nj,δXj,(),\displaystyle\qquad\qquad\qquad\qquad=\frac{\alpha(x_{\ell-1})}{n(x_{\ell-1})+\alpha(x_{\ell-1})}P_{\ell}(\cdot)+\frac{1}{n(x_{\ell-1})+\alpha(x_{\ell-1})}\sum_{j\in C(x_{\ell-1})}n_{j,\ell}\delta_{X^{*}_{j,\ell}}(\cdot),

which is a taxon-specific urn scheme. The diversity α(x1)\alpha(x_{\ell-1}) refers to observations whose parent is x1x_{\ell-1}. The nested predictive mechanism described in Proposition 3 essentially states that (i) the data X1,1,,Xn,1X_{1,1},\dots,X_{n,1} in the first layer of the taxonomy follow a Gibbs-type process; (ii) the data X1,,,Xn,X_{1,\ell},\dots,X_{n,\ell} for the subsequent branches follow independent predictive mechanisms that are specific to the x1x_{\ell-1} value of the parent level. As depicted in Figure 2, if a new taxon is observed at a certain layer of the taxonomy, it leads to the discovery of new values for all the children levels.

4.2 Estimation of conditional biodiversity

Compared to classical species sampling models, taxonomic Gibbs-type priors allow us to describe conditional biodiversities Sσ(x)S_{\sigma_{\ell}}(x) of specific branches of the taxonomy. The first layer is modeled as a classical Gibbs-type prior, with a single σ\sigma-diversity Sσ1S_{\sigma_{1}}, a case that has been extensively discussed in Section 2. Conversely, at the generic \ellth layer there are, potentially, infinitely many σ\sigma-diversities Sσ(x)S_{\sigma_{\ell}}(x), one for each value of x𝕏x\in\mathds{X}_{\ell}. So far, we have treated the diversities Sσ(x)S_{\sigma_{\ell}}(x) as deterministic, but in practice, they must be learned from the data, and this requires careful prior elicitation. A sample 𝑿1,,𝑿n\bm{X}_{1},\dots,\bm{X}_{n} provides information only about the diversities Sσ(X1,1),,Sσ(Xk,1)S_{\sigma_{\ell}}(X^{*}_{1,\ell-1}),\dots,S_{\sigma_{\ell}}(X^{*}_{k_{\ell},\ell-1}) associated to the Xj,1X^{*}_{j,\ell-1} observed taxa, for =2,,L\ell=2,\dots,L. Moreover, from the independence of random conditional distributions p~(x1)\tilde{p}_{\ell}(\cdot\mid x_{\ell-1}), the likelihood function of a taxonomic Gibbs-type model factorizes as follows

(𝑿1,,𝑿n𝑺)Vn,k1(Sσ1)=2Lj=1k1Vn(Xj,1),k(Xj,1){Sσ(Xj,1)},\mathscr{L}(\bm{X}_{1},\dots,\bm{X}_{n}\mid\bm{S})\propto V_{n,k_{1}}(S_{\sigma_{1}})\prod_{\ell=2}^{L}\prod_{j=1}^{k_{\ell-1}}V_{n(X^{*}_{j,\ell-1}),k(X^{*}_{j,\ell-1})}\{S_{\sigma_{\ell}}(X^{*}_{j,\ell-1})\}, (16)

where 𝑺=(Sσ1,Sσ2(X1,1),,Sσ(Xj,1),,SσL(XkL1,L1))\bm{S}=(S_{\sigma_{1}},S_{\sigma_{2}}(X^{*}_{1,1}),\dots,S_{\sigma_{\ell}}(X^{*}_{j,\ell-1}),\dots,S_{\sigma_{L}}(X^{*}_{k_{L-1},L-1})). The likelihood function underlines that the data alone are not informative about the unobserved branches of the taxonomy, i.e., about Sσ(x)S_{\sigma_{\ell}}(x) whenever the taxon x𝕏x\in\mathds{X}_{\ell} has not been observed in the sample.

We offer some suggestions on modeling the infinite collection of conditional diversities Sσ(x)S_{\sigma_{\ell}}(x). A simple approach assumes that the Sσ(x)S_{\sigma_{\ell}}(x)’s are iid samples from a layer-specific prior law p~\tilde{p}_{\ell}, for any x𝕏x\in\mathds{X}_{\ell}, independently over =2,,L\ell=2,\dots,L. This is computationally convenient because the posterior distributions of the conditional σ\sigma-diversities are independent due to the factorized likelihood in (16), and they can be inferred by applying the tools of Section 2 for each Sσ(x)S_{\sigma_{\ell}}(x). However, the posterior law of Sσ(x)S_{\sigma_{\ell}}(x) for any unobserved taxon x𝕏x\in\mathds{X}_{\ell} coincides with its prior distribution because there is no information in the likelihood. Hence, it is appealing to borrow information across conditional diversities belonging to the same layer through hierarchical modelling. For example, when σ=1/2\sigma_{\ell}=1/2 we may specify Sσ(x)a,biidGamma(a,b)S_{\sigma_{\ell}}(x)\mid a_{\ell},b_{\ell}\overset{\textup{iid}}{\sim}\text{Gamma}(a_{\ell},b_{\ell}), where each (a,b)(a_{\ell},b_{\ell}) follows a hyperprior. Such a specification borrows strength among the diversities within the same layer \ell, but not across. The factorized structure of (16) ensures that mcmc algorithms can be carried out separately for each layer. We will consider an example in Section 5.2. More sophisticated approaches may induce dependence across layers, for example, incorporating phylogenetic information, but we do not pursue this here.

Remark 4.

Marginal and conditional σ\sigma-diversities provide two different, but complementary, summaries of a complex phenomenon. Conditional diversities strongly depend on the chosen taxonomical structure. Potentially the models developed in this article can be used in concert with models characterizing variation in genetic sequences and morphology data for automatic taxonomic classification of samples, allowing for the discovery of new taxa, in a related manner to Zito et al. (2023).

5 The tree flora in the Amazonian Basin

5.1 The estimation of σ\sigma-diversity

We consider the comprehensive tree data set from the Amazon Basin and the Guiana Shield (Amazonia) provided by ter Steege et al. (2013), which is openly accessible online. This dataset integrates multiple sources and includes 11701170 sampling plots well distributed among regions and forest types of the Amazon Basin, whose locations are shown in Figure 1(a). The immense size of the Amazon Basin has limited research on its tree communities at local and regional levels. Scientists still lack a complete understanding of the number of species in the Amazon, as well as their abundance and rarity (Hubbell et al., 2008; ter Steege et al., 2013), which leaves the world’s largest tropical carbon reserve mostly unknown to ecologists. The plots included a total of k=4,962k=4,962 tree species, comprising 747747 genera and 115115 families, based on n=553,949n=553,949 observations. In ter Steege et al. (2013), Fisher’s log-series model was applied to estimate the total number of species, generating approximately 15,00015,000 tree species. About 200200 of these species were classified as hyperdominant, which means they are so common that, collectively, they represent half of all trees in the Amazon. These estimates suggest that there may be over 10,00010,000 rare and poorly known species in the Amazon that could be at risk. We aim to provide a more refined statistical analysis of these data based on a Bayesian nonparametric analysis of the σ\sigma-diversity. In practice, our model-based approach allows us to: (i) check the validity of the underlying assumptions of the model; (ii) provide formal tools for uncertainty quantification even in the presence of model misspecification; (iii) tie together multiple aspects of the analysis within a unified model, making it clear and coherent.

Refer to caption
Figure 3: Rarefaction curves for the Amazonian tree species dataset from ter Steege et al. (2013). Black dots represent the “classical rarefaction” values K¯1,,K¯n\bar{K}_{1},\dots,\bar{K}_{n}, as described in Section 2.4; for clarity, only a subset of points is shown. The orange dashed line represents the model-based rarefaction 𝔼(Knα^ml)=i=1nα^ml/(α^ml+i1)\mathds{E}(K_{n}\mid\hat{\alpha}_{\textsc{ml}})=\sum_{i=1}^{n}\hat{\alpha}_{\textsc{ml}}/(\hat{\alpha}_{\textsc{ml}}+i-1) from a Dirichlet process model.
Refer to caption
Figure 4: Frequency of abundance plot for the Amazonian tree dataset from ter Steege et al. (2013). Black dots represent the observed frequency counts mrm_{r} for sizes r=1,,nr=1,\dots,n, where mrm_{r} denotes the number of tree species that appear with frequency rr in the observed sample. The orange dashed line shows the model-based expectations 𝔼(Mr,nα^ml)\mathds{E}(M_{r,n}\mid\hat{\alpha}_{\textsc{ml}}) for r=1,,nr=1,\dots,n, with the formula provided in Section 2.4. Both axes are displayed on a log scale for clarity.
Refer to caption
Figure 5: Rank in abundance diagram (rad) for the Amazonian tree dataset from ter Steege et al. (2013). Black dots represent the observed abundances n1,,nkn_{1},\dots,n_{k}, ranked from the most common to the least common, in log scale. The dashed orange line represents the expected abundances from a Dirichlet process model with parameter α^ml\hat{\alpha}_{\textsc{ml}}, ranked from the most common to the least common. Model-based values are obtained via Monte Carlo by sampling from the Blackwell and MacQueen (1973) urn scheme with parameter α^ml\hat{\alpha}_{\textsc{ml}} and averaging the ranked abundances over 1,0001,000 replicates.

The observations X1,,XnX_{1},\dots,X_{n} represent tree species and are assumed to be exchangeable under a Gibbs-type species sampling model. In particular, we will show that it is reasonable to assume σ=0\sigma=0, leading to the following hierarchical model:

Xip~\displaystyle X_{i}\mid\tilde{p} iidp~,\displaystyle\overset{\text{iid}}{\sim}\tilde{p},\qquad n1,\displaystyle n\geq 1,
p~α\displaystyle\tilde{p}\mid\alpha dp(αP),\displaystyle\sim\textsc{dp}(\alpha P),\qquad αp(dα).\displaystyle\alpha\sim p(\mathrm{d}\alpha).

Recall that this specification is equivalent to Fisher’s log-series model. As a preliminary step, we obtain a maximum likelihood estimate for the fundamental biodiversity number of α^ml=751.23\hat{\alpha}_{\textsc{ml}}=751.23 and a Fisher’s estimate of α^f=751.32\hat{\alpha}_{\textsc{f}}=751.32. The former was calculated using the BNPvegan R package (Zito et al., 2023), while the latter was obtained through the vegan R package (Oksanen et al., 2024). As previously mentioned, these two estimates are asymptotically equivalent and nearly indistinguishable for large nn. To assess the validity of the Dirichlet process specification we follow the guidelines outlined in Section 2.4 and we consider the quantities 𝔼(Knα^ml)\mathds{E}(K_{n}\mid\hat{\alpha}_{\textsc{ml}}) and 𝔼(Mr,nα^ml)\mathds{E}(M_{r,n}\mid\hat{\alpha}_{\textsc{ml}}). We then compare these expectations with their empirical counterparts, as displayed in Figure 3 (rarefaction) and Figure 4 (frequency of abundance). Both plots give similar conclusions: although the fit is not perfect, the predictions are remarkably close to the observed data. Indeed, the growth rate for the number of distinct species in Figure 3 looks roughly logarithmic, supporting the choice σ=0\sigma=0. At the same time, there are mild discrepancies between the data and the model predictions: the empirical rarefaction curve has a slightly higher curvature than the model fit. Moreover, Figure 4 shows that the Dirichlet process model slightly overestimates the number of singletons, and in particular, we get 𝔼(M1,nα^ml)=750.22\mathds{E}(M_{1,n}\mid\hat{\alpha}_{\textsc{ml}})=750.22 and m1=645m_{1}=645. Finally, we show in Figure 5 the rank in abundance diagram (rad), which is the main graphical tool considered in ter Steege et al. (2013) to validate the adequacy of the Fisher’s log-series model. Once again, we conclude that the fit is quite good.

Despite its good predictive performance, the Dirichlet process is likely misspecified in this context. The data originates from multiple studies, each examining different regions and forests, which makes the exchangeability assumption for (Xn)n1(X_{n})_{n\geq 1} unlikely to hold. Supporting this concern, classical nonparametric and frequentist estimators such as those by Chao (1984), Chao and Lee (1992), and Chao and Bunge (2002) yield predictions–and associated confidence intervals–of between 5,000 and 6,000 for the total number of distinct species. These values were computed using the SPECIES R package (Wang, 2011). However, ter Steege et al. (2013) criticized these predictions, labeling them as “severe underestimations”, due to sharp disagreement with previous studies and expert assessments. This failure of standard nonparametric methods is not unusual in ecological studies (Brose et al., 2003), where assumptions often do not hold. Such estimators typically perform well in localized settings, but here we are considering the entire Amazon Basin. In contrast, the inherent “rigidity” of the Dirichlet process model enforces a logarithmic growth rate, which helps to avoid overfitting to the observed data, thus potentially mitigating the effects of misspecification.

With these considerations in mind, we proceed to conduct a full Bayesian analysis, accounting for potential model misspecification through the use of coarsened posteriors (Miller and Dunson, 2019). In practice, this is approximately achieved by raising the likelihood function to a factor ρ(0,1]\rho\in(0,1], which effectively deflates the sample size and increases the uncertainty; taking ρ=1\rho=1 recovers the standard posterior. For our analysis, the coarsened posterior distribution for α\alpha becomes:

pρ(dαX1,,Xn)p(dα)[αk(α)n]ρ.p_{\rho}(\mathrm{d}\alpha\mid X_{1},\dots,X_{n})\propto p(\mathrm{d}\alpha)\left[\frac{\alpha^{k}}{(\alpha)_{n}}\right]^{\rho}.

A natural prior choice is αsg(a,b,n)\alpha\sim\textsc{sg}(a,b,n), the Stirling-gamma distribution of Zito et al. (2024) with parameters (a,b,n)(a,b,n), where 1a/bn1\leq a/b\leq n and a,b>0a,b>0. This prior interacts well with coarsened posteriors, remaining conjugate even after tempering the likelihood. Specifically, under a Stirling-gamma prior, the coarsened posterior is given by pρ(dαX1,,Xn)αa+ρk/{(α)n}b+ρp_{\rho}(\mathrm{d}\alpha\mid X_{1},\dots,X_{n})\propto\alpha^{a+\rho k}/\{(\alpha)_{n}\}^{b+\rho}, resulting in:

(αX1,,Xn)sg(a+ρk,b+ρ,n).(\alpha\mid X_{1},\dots,X_{n})\sim\textsc{sg}(a+\rho k,b+\rho,n).

Here, a/ba/b serves as the location of the prior distribution,while bb acts as a precision (Zito et al., 2024). In particular, we have 𝔼(Kn)=a/b\mathds{E}(K_{n})=a/b, the prior estimate of the number of distinct species in a sample of size nn. In our analysis, we set a/b=5,000a/b=5,000 and b=0.002b=0.002, implying a=1a=1, thus providing a non-informative prior centered on a plausible guess. Thanks to conjugacy, i.i.d. sampling from the coarsened posterior is straightforward through the algorithm of Zito et al. (2024).

ρ\rho Equivalent sample size (nρn\rho) 1% 25% 50% Mean 75% 99%
Fundamental biodiversity number α\alpha
1 553,949 725 743 751 751 759 779
0.25 138,487 699 736 751 751 767 806
0.1 55,395 669 726 751 751 776 839
0.01 5,539 514 673 747 753 827 1,048
0.001 554 208 517 713 766 956 1,792
Total number of tree species KNK_{N}
1 553,949 14,378 14,841 15,065 15,051 15,267 15,678
0.25 138,487 14,139 14,777 15,052 15,052 15,327 15,976
0.1 55,395 13,814 14,675 15,045 15,054 15,422 16,371
0.01 5,539 11,824 13,981 14,990 15,077 16,077 19,097
0.001 554 7752 11,906 14,533 15,246 17,800 29,058
Table 1: Posterior mean and quantiles for α\alpha and the total number of tree species KNK_{N}, for various choices of the coarsening parameter ρ\rho, for the Amazonian tree dataset from ter Steege et al. (2013). These values are based on 10610^{6} Monte Carlo replicates using the Stirling-gamma sampling algorithm of Zito et al. (2024) for α\alpha and a Poisson approximation for KNK_{N} (see the main text).

We report the results in Table 1, which shows posterior means and quantiles for α\alpha under various choices of the tempering parameter ρ{0.001,0.01,0.25,1}\rho\in\{0.001,0.01,0.25,1\}. Choosing ρ\rho is challenging, as it requires accurately quantifying the degree of misspecification. Table 1 also includes the equivalent sample size nρn\rho; we note that, except for the extreme case of ρ=0.001\rho=0.001, the posterior means and medians remain stable, approximately matching the maximum likelihood estimate of 751751. The value of ρ\rho has a substantial impact on posterior variability, with smaller ρ\rho values representing conservative choices by leading to greater uncertainty. Assuming the model is correctly specified, the 98%98\% credible interval for α\alpha is (725,779)(725,779). However, with a moderate level of coarsening, say ρ=0.01\rho=0.01, the 98% credible interval broadens to (514,1048)(514,1048). Notably, the independent estimate α^=743\hat{\alpha}=743 for the Amazon Basin from Hubbell et al. (2008) lies within both intervals. We provide graphical support for ρ=0.01\rho=0.01 in the Supplementary Material, using an informal elbow rule to identify the smallest ρ\rho that does not lead to strong incompatibilities with the observed data, as measured by the likelihood function (Miller and Dunson, 2019). Figure 1(b) displays the coarsened posterior for α\alpha with ρ=0.01\rho=0.01. The posterior distribution of α\alpha, the fundamental biodiversity number, encapsulates various aspects of the data. For example, for large nn, the expected number of singletons is 𝔼(M1,nα)α\mathds{E}(M_{1,n}\mid\alpha)\approx\alpha, so the interval (514,1048)(514,1048) serves also as an estimate for the number of singletons. This also supports the choice ρ=0.01\rho=0.01, given that the observed number of singletons is m1=645m_{1}=645, which lies well outside the credible interval of the standard posterior. As detailed in Section 2.3, other biodiversity measures can be derived from the posterior distribution of α\alpha. For instance, the posterior mean for the Simpson diversity, (X1=X2)=1/(1+α)\mathds{P}(X_{1}=X_{2})=1/(1+\alpha), is 0.001360.00136, while the posterior mean for Shannon diversity is 7.18847.1884.

Refer to caption
Figure 6: Coarsened posterior distribution (ρ=0.01\rho=0.01) of the total number of distinct tree species KNK_{N}, using the Amazonian tree dataset of ter Steege et al. (2013). The dotted lines represent 98%98\% credible intervals. The dashed line is the posterior mean. These values are based on 10610^{6} Monte Carlo replicates using the Stirling-gamma sampling algorithm of Zito et al. (2024) combined with a Poisson approximation for KNK_{N} (see the main text).

The most interesting quantity we wish to estimate is arguably the total number of tree species in the Amazon Basin. In ter Steege et al. (2013), this is estimated using Fisher’s log series model in a relatively ad-hoc manner, by extrapolating the rad shown in Figure 5 to obtain approximately 15,000 to 16,000 tree species, depending on the extrapolation method used. Here, we present a more mathematically rigorous approach that also properly quantifies the associated uncertainty. Let us first recall that in the Dirichlet process model, KnK_{n} diverges as nn grows. However, this should not be viewed as a shortcoming of the model. The population size NN, representing the total number of trees in the Amazon Basin, is finite, implying that the total number of tree species is KNK_{N}. Furthermore, ter Steege et al. (2013) provides a reliable estimate of the population size, N^=3.949×1011\hat{N}=3.949\times 10^{11}, based on observed tree density. To be robust against potential errors in this estimate, we specify a prior for NN centered on N^\hat{N} by letting NUniform(0.5N^,1.5N^)N\sim\text{Uniform}(0.5\hat{N},1.5\hat{N}). This prior choice is roughly consistent with the standard errors for N^\hat{N} given in ter Steege et al. (2013). Note we cannot learn NN based on the publicly available data, so we rely on external information for this inferential step. In summary, the coarsened posterior distribution for the total number of tree species can be obtained through the following sampling steps:

(KNα,N,X1,,Xn)pK,(αX1,,Xn)sg(a+ρk,b+ρ,n),NUniform(0.5N^,1.5N^).(K_{N}\mid\alpha,N,X_{1},\dots,X_{n})\sim p_{K},\>\>(\alpha\mid X_{1},\dots,X_{n})\sim\textsc{sg}(a+\rho k,b+\rho,n),\>\>N\sim\text{Uniform}(0.5\hat{N},1.5\hat{N}).

In Section 2, we described general formulas for the law pKp_{K}, which could, in principle, be applied here. Additionally, note that the posterior expectation of KNK_{N} is 𝔼(KNα,N,X1,,Xn)=k+i=1Nnα/(α+n+i1)\mathds{E}(K_{N}\mid\alpha,N,X_{1},\dots,X_{n})=k+\sum_{i=1}^{N-n}\alpha/(\alpha+n+i-1). For simplicity and computational efficiency, we rely on a well-known, highly accurate Poisson approximation valid for large values of NN:

(Km(n)α,N,X1,,Xn).Poisson(i=1Nnαα+n+i1),(K_{m}^{(n)}\mid\alpha,N,X_{1},\dots,X_{n})\overset{.}{\sim}\text{Poisson}\left(\sum_{i=1}^{N-n}\frac{\alpha}{\alpha+n+i-1}\right),

where .\overset{.}{\sim} means “approximately distributed as”, recalling that, given the data, we have KN=k+Km(n)K_{N}=k+K_{m}^{(n)}. This approximation can be rigorously justified in terms of total variation convergence, which holds for large values of NN; see Proposition 4.8 in Ghosal and Van der Vaart (2017). We report the results in Table 1, which shows posterior means and quantiles for KNK_{N} under various choices of the tempering parameter. Figure 6 displays the coarsened posterior of KNK_{N} with ρ=0.01\rho=0.01. Our analysis suggests that the Bayesian estimate for the total number of tree species is approximately 15,000, consistent with the main finding of ter Steege et al. (2013). Additionally, we estimate a 98% credible interval for this total to be between 11,800 and 19,100 when ρ=0.01\rho=0.01. Notably, due to the logarithmic growth rate, the uncertainty in the posterior distribution of KNK_{N} is primarily influenced by the variability in α\alpha, rather than potential misspecifications of the population size NN.

5.2 Taxonomic σ\sigma-diversity

We now describe a more nuanced analysis of the Amazon tree dataset that incorporates the taxonomic structure of the data. This analysis uses the taxonomic Gibbs-type priors introduced in Section 4 with L=3L=3, where the exchangeable observations 𝑿1,,𝑿n\bm{X}_{1},\dots,\bm{X}_{n} are triplets, denoted 𝑿i=(Xi,1,Xi,2,Xi,3)\bm{X}_{i}=(X_{i,1},X_{i,2},X_{i,3}), whose elements represent the family, genus, and species of the iith tree, respectively. The last element of each triplet, Xi,3X_{i,3}, corresponds to the previously analyzed tree species. We let the family labels Xn,1p~1X_{n,1}\mid\tilde{p}_{1} be iid samples from p~1\tilde{p}_{1}, where p~1dp(α1P1)\tilde{p}_{1}\sim\textsc{dp}(\alpha_{1}P_{1}) for n1n\geq 1. Moreover, for the genera and species, we assume the observations as generated as follows

Xn,2Xn,1=x1,p~2(x1)indp~2(x1),Xn,3Xn,2=x2,p~3(x2)indp~3(x2),n1.X_{n,2}\mid X_{n,1}=x_{1},\>\tilde{p}_{2}(\cdot\mid x_{1})\overset{\text{ind}}{\sim}\tilde{p}_{2}(\cdot\mid x_{1}),\quad X_{n,3}\mid X_{n,2}=x_{2},\>\tilde{p}_{3}(\cdot\mid x_{2})\overset{\text{ind}}{\sim}\tilde{p}_{3}(\cdot\mid x_{2}),\quad n\geq 1.

We specify different priors for the second and third layers of the taxonomy. Denoting by α={α(x)}x𝕏1\alpha=\{\alpha(x)\}_{x\in\mathds{X}_{1}} and γ={γ(x)}x𝕏2\gamma=\{\gamma(x)\}_{x\in\mathds{X}_{2}}, we assume

p~2(x1)α(x1)inddp(α(x1)P(x1)),\displaystyle\tilde{p}_{2}(\cdot\mid x_{1})\mid\alpha(x_{1})\overset{\text{ind}}{\sim}\textsc{dp}(\alpha(x_{1})P(\cdot\mid x_{1})),\qquad p~3(x2)γ(x2)indap(γ(x2)P(x2)),\displaystyle\tilde{p}_{3}(\cdot\mid x_{2})\mid\gamma(x_{2})\overset{\text{ind}}{\sim}\textsc{ap}(\gamma(x_{2})P(\cdot\mid x_{2})),
α𝒬α,\displaystyle\alpha\sim\mathcal{Q}_{\alpha},\qquad γ𝒬γ.\displaystyle\gamma\sim\mathcal{Q}_{\gamma}.

In other words, we assume a logarithmic growth rate for genera within families and a faster polynomial growth rate for species within each genus, induced by the Aldous-Pitman model. This taxonomic approach allows us to compare the genus-level biodiversity within each family, denoted by α(x)\alpha(x), as well as the species-level biodiversity within each genus, denoted by γ(x)\gamma(x). To get Bayesian estimates for α(x)\alpha(x) and γ(x)\gamma(x) we need to specify priors 𝒬α\mathcal{Q}_{\alpha} and 𝒬γ\mathcal{Q}_{\gamma}. In the former case, we let α(x)iidsg(aα,bα,m)\alpha(x)\overset{\textup{iid}}{\sim}\textsc{sg}(a_{\alpha},b_{\alpha},m), with aα/bα=3a_{\alpha}/b_{\alpha}=3, bα=0.1b_{\alpha}=0.1, m=100m=100, a fairly informative prior according to which we expect, a priori and on average, 3 distinct genera for each family, in a hypothetical sample from the same family of size m=100m=100. On the other hand, we assume a hierarchical prior for γ(x)\gamma(x), to borrow strength across species-level biodiversities, which is needed due to the sparsity of the data. Specifically, we let

γ(x)aγ,bγiidGamma(aγ,bγ),(logaγ,logbγ)Normal2(𝝁γ,σγ2I2).\gamma(x)\mid a_{\gamma},b_{\gamma}\overset{\textup{iid}}{\sim}\text{Gamma}(a_{\gamma},b_{\gamma}),\qquad(\log{a_{\gamma}},\log{b_{\gamma}})\sim\text{Normal}_{2}(\bm{\mu}_{\gamma},\sigma^{2}_{\gamma}I_{2}).

where we set 𝝁γ=(0,0)\bm{\mu}_{\gamma}=(0,0) and σγ2=102\sigma^{2}_{\gamma}=10^{2}.

We ran an mcmc algorithm for 10,000 iterations, discarding the first 1,000 as burn-in; the computational details are provided in the Supplementary Material. The sampling algorithm is particularly straightforward due to the factorized likelihood (16). This, combined with suitable priors 𝒬α\mathcal{Q}\alpha and 𝒬γ\mathcal{Q}\gamma, results in separate blocks of the model that can be estimated independently. The posterior distribution for the most relevant part of the model–the one describing species-within-genera biodiversity–has been coarsened with ρ=0.25\rho=0.25 to address potential misspecification. As before, this choice of coarsening is supported by an informal elbow rule.

Refer to caption
Figure 7: Taxon biodiversity: genera within family. We plot the posterior expectations 𝔼(α(x)𝑿1,,𝑿n)\mathds{E}(\alpha(x)\mid\bm{X}_{1},\dots,\bm{X}_{n}), approximated via mcmc, of the 1616 most diverse families out of 115115 as measured by α(x)\alpha(x).
Refer to caption
Figure 8: Taxon biodiversity: species within genus, for two randomly selected families (Moraceae, Myrtaceae. Dots represent the posterior means 𝔼(γ(x)𝑿1,,𝑿n)\mathds{E}(\gamma(x)\mid\bm{X}_{1},\dots,\bm{X}_{n}), while lines represent a 98% credible interval, approximated via mcmc.

We report in Figure 7 the posterior expectations 𝔼(α(x)𝑿1,,𝑿n)\mathds{E}(\alpha(x)\mid\bm{X}_{1},\dots,\bm{X}_{n}) for selected families. Specifically, we show the 16 most diverse of 115 families, as measured by the fundamental biodiversity numbers α(x)\alpha(x). These values represent family-specific biodiversity that accounts for the variety of genera within the same taxonomic branch. However, it is important to note that α(x)\alpha(x) does not reflect the number of species within each genus. The most biodiverse family is Fabaceae, which is expected, since it is also the most abundant and rich family in the dataset. Furthermore, as observed in ter Steege et al. (2013), Rubiaceae–the second most diverse family according to α(x)\alpha(x)–has relatively few hyperdominant species, which aligns with its high diversity number. Moreover, in Figure 8, we plot the posterior distributions of select genus-specific biodiversities γ(x)\gamma(x) for all genera in two randomly chosen families, Moraceae and Myrtaceae. As the plot suggests, with few exceptions, there is substantial uncertainty in each γ(x)\gamma(x), though this is partially reduced by borrowing strength across genera. In particular, in our hierarchical model, the estimated average biodiversity is 𝔼(aγ/bγ𝑿1,,𝑿n)=0.15\mathds{E}(a_{\gamma}/b_{\gamma}\mid\bm{X}_{1},\dots,\bm{X}_{n})=0.15 with a standard deviation of 𝔼(aγ/bγ2𝑿1,,𝑿n)=0.11\mathds{E}(\sqrt{a_{\gamma}/b_{\gamma}^{2}}\mid\bm{X}_{1},\dots,\bm{X}_{n})=0.11. Two genera that stand out from the average behavior are Ficus and Pseudolmedia. The former ranks among the most diverse genera of trees, as evidenced by the extensive list of species documented in Plants of the World Online (https://powo.science.kew.org); see Baker et al. (2022). To date, there are almost a thousand accepted species of Ficus spread across Amazonia and other regions of the world. In contrast, Pseudolmedia has only 11 known species globally (four of them observed in our dataset), all situated in the Amazon basin and adjacent areas; some of these species may be at risk of extinction.

6 Discussion

In this paper, we discuss what we believe to be one of the most natural definitions of biodiversity from a Bayesian perspective, unifying accumulation curves, and most existing biodiversity measures under a model-based approach. Mathematically, this approach takes advantage of Gibbs-type priors as defined in Gnedin and Pitman (2005), although most of the inferential results crucial for species discovery were developed in a subsequent paper by Lijoi et al. (2007a). This theory has significant connections with ideas developed over the years, particularly by Fisher et al. (1943) and Hubbell (2001). A key natural question is understanding generalizations to different sampling mechanisms, such as those where only species presence is recorded (i.e. incidence data). This has recently been addressed in Ghilotti et al. (2024), which introduced a broad class of models sharing structural properties with Gibbs-type priors and leads to a natural notion of biodiversity for incidence data.

We also envision several promising research directions. First, we observe that a systematic comparison of priors for HH, representing species richness in a Dirichlet-multinomial model, is currently lacking. This is a straightforward next step that could be effectively explored with moderate effort, at least from an empirical standpoint. A second and more complex issue involves the inclusion of covariates. Estimating localized biodiversity metrics for specific regions or tracking biodiversity changes over time is both more interesting and realistic than estimating a single biodiversity number. However, directly applying covariate-dependent Dirichlet processes (e.g., Quintana et al., 2022) appears unsuitable and requires careful modifications. The work of Zito et al. (2023) goes in that direction by incorporating DNA sequencing into the model, but it does not provide synthetic measures of biodiversity. A third relevant issue refers to a clear definition of the so-called “beta” diversity, namely the heterogeneity of species across different sampling regions. Recently, there have been sensible developments in models that can account for shared species, most notably the hierarchical constructions described in Camerlenghi et al. (2017), Camerlenghi et al. (2019). However, some computational challenges remain open, as well as a precise notion of “beta” diversity which should be as clear and unifying as the σ\sigma-diversity described in this paper.

Acknowledgments

Tommaso Rigon acknowledges support of MUR - Prin 2022 - Grant no. 2022CLTYP4, funded by the European Union - Next Generation EU. This work was partially supported by the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 856506).

References

  • Aldous and Pitman (1998) Aldous, D. and J. Pitman (1998). The standard additive coalescent. Ann. Probab. 26(4), 1703–1726.
  • Baker et al. (2022) Baker, W. J., P. Bailey, V. Barber, A. Barker, S. Bellot, D. Bishop, L. R. Botigué, G. Brewer, T. Carruthers, J. J. Clarkson, J. Cook, R. S. Cowan, S. Dodsworth, N. Epitawalage, E. Françoso, B. Gallego, M. G. Johnson, J. T. Kim, K. Leempoel, O. Maurin, C. McGinnie, L. Pokorny, S. Roy, M. Stone, E. Toledo, N. J. Wickett, A. R. Zuntini, W. L. Eiserhardt, P. J. Kersey, I. J. Leitch, and F. Forest (2022). A Comprehensive Phylogenomic Platform for Exploring the Angiosperm Tree of Life. Systematic Biology 71(2), 301–319.
  • Blackwell and MacQueen (1973) Blackwell, D. and J. B. MacQueen (1973). Ferguson distributions via Pólya urn schemes. Ann. Statist. 1(2), 353–355.
  • Brose et al. (2003) Brose, U., N. D. Martinez, and R. J. Williams (2003). Estimating species richness: Sensitivity to sample coverage and insensitivity to spatial patterns. Ecology 84(9), 2364–2377.
  • Bunge and Fitzpatrick (1993) Bunge, J. and M. Fitzpatrick (1993). Estimating the number of species: a review. J. Am. Statist. Ass. 88(421), 364–373.
  • Camerlenghi et al. (2019) Camerlenghi, F., A. Lijoi, P. Orbanz, and I. Prünster (2019). Distribution theory for hierarchical processes. Ann. Statist. 47(1), 67–92.
  • Camerlenghi et al. (2017) Camerlenghi, F., A. Lijoi, and I. Prünster (2017). Bayesian prediction with multiple-samples information. J. Multiv. Anal. 156, 18–28.
  • Ceballos et al. (2015) Ceballos, G., P. R. Ehrlich, A. D. Barnosky, A. García, R. M. Pringle, and T. M. Palmer (2015). Accelerated modern human-induced species losses: Entering the sixth mass extinction. Science Advances 1(5), 9–13.
  • Chao (1984) Chao, A. (1984). Nonparametric estimation of the number of classes in a population. Scand. J. Statist. 11(4), 265–270.
  • Chao and Bunge (2002) Chao, A. and J. Bunge (2002). Estimating the number of species in a stochastic abundance model. Biometrics 58(3), 531–539.
  • Chao and Lee (1992) Chao, A. and S. M. Lee (1992). Estimating the number of classes via sample coverage. J. Am. Statist. Ass. 87(417), 210–217.
  • Charalambides (2002) Charalambides, C. A. (2002). Enumerative combinatorics. Springer.
  • Chisholm and Burgman (2004) Chisholm, R. A. and M. A. Burgman (2004). The unified neutral theory of biodiversity and biogeography: comment. Ecology 85(11), 3172–3174.
  • Colwell (2009) Colwell, R. K. (2009). Biodiversity: concepts, patterns, and measurement, pp. 257–263. Princeton University Press.
  • Colwell et al. (2012) Colwell, R. K., A. Chao, N. J. Gotelli, S.-Y. Lin, C. X. Mao, R. L. Chazdon, and J. T. Longino (2012). Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. J. Plant Ecol. 5(1), 3–21.
  • De Blasi et al. (2015) De Blasi, P., S. Favaro, A. Lijoi, R. H. Mena, I. Prunster, and M. Ruggiero (2015). Are Gibbs-type priors the most natural generalization of the Dirichlet process? IEEE Trans. Pattern Anal. Mach. Intell. 37(2), 212–229.
  • De Blasi et al. (2013) De Blasi, P., A. Lijoi, and I. Prünster (2013). An asymptotic analysis of a class of discrete nonparametric priors. Statist. Sin. 23(3), 1299–1321.
  • de Finetti (1937) de Finetti, B. (1937). La prévision: ses lois logiques, ses sources subjectives. Annales de l’institut Henri Poincaré 7, 1–68.
  • Devroye (1986) Devroye, L. (1986). Non-Uniform Random Variate Generation. Springer.
  • Escobar and West (1995) Escobar, M. D. and M. West (1995). Bayesian density estimation and inference using mixtures. J. Am. Statist. Ass. 90(430), 577–588.
  • Favaro et al. (2009) Favaro, S., A. Lijoi, R. H. Mena, and I. Prünster (2009). Bayesian non-parametric inference for species variety with a two-parameter Poisson-Dirichlet process prior. J. R. Statist. Soc. B 71(5), 993–1008.
  • Favaro et al. (2012) Favaro, S., A. Lijoi, and I. Prünster (2012). Asymptotics for a Bayesian nonparametric estimator of species variety. Bernoulli 18(4), 1267–1283.
  • Favaro et al. (2013) Favaro, S., A. Lijoi, and I. Prünster (2013). Conditional formulae for Gibbs-type exchangeable random partitions. Ann. Appl. Probab. 23(5), 1721–1754.
  • Favaro et al. (2015) Favaro, S., M. Lomeli, and Y. W. Teh (2015). On a class of sigma-stable Poisson–Kingman models and an effective marginalized sampler. Statist. Comp. 25(1), 67–78.
  • Favaro et al. (2021) Favaro, S., F. Panero, and T. Rigon (2021). Bayesian nonparametric disclosure risk assessment. Electron. J. Statist. 15(2), 5626–5651.
  • Ferguson (1973) Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1(2), 209–230.
  • Fisher et al. (1943) Fisher, R. A., A. S. Corbet, and C. B. Williams (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. J. Anim. Ecol. 12(1), 42–58.
  • Ghilotti et al. (2024) Ghilotti, L., F. Camerlenghi, and T. Rigon (2024). Bayesian analysis of product feature allocation models.  arXiv:2408.15806.
  • Ghosal and Van der Vaart (2017) Ghosal, S. and A. Van der Vaart (2017). Fundamentals of nonparametric Bayesian inference. Cambridge University Press.
  • Gnedin (2010) Gnedin, A. (2010). Species sampling model with finitely many types. Electron. Comm. Prob. 15, 79–88.
  • Gnedin and Pitman (2005) Gnedin, A. and J. Pitman (2005). Exchangeable Gibbs partitions and Stirling triangles. Zapiski Nauchnykh Seminarov, POMI 325, 83–102.
  • Good (1953) Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika 40(3-4), 237–264.
  • Good and Toulmin (1956) Good, I. J. and G. H. Toulmin (1956). The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43(1-2), 45–63.
  • Gotelli and Colwell (2001) Gotelli, N. J. and R. K. Colwell (2001). Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness. Ecol. Lett. 4, 379–391.
  • Hjort et al. (2010) Hjort, N. L., C. Holmes, P. Müller, and S. G. Walker (2010). Bayesian nonparametrics, Volume 28. Cambridge University Press.
  • Hubbell (2001) Hubbell, S. P. (2001). The Unified Neutral Theory of Biodiversity and Biogeography. Princeton University Press.
  • Hubbell et al. (2008) Hubbell, S. P., F. He, R. Condit, L. Borda-de Água, J. Kellner, and H. Ter Steege (2008). How many tree species are there in the Amazon and how many of them will go extinct? Proceedings of the National Academy of Sciences of the United States of America 105, 11498–11504.
  • Kingman (1975) Kingman, J. F. (1975). Random discrete distributions. J. R. Statist. Soc. B 37(1), 1–15.
  • Korwar and Hollander (1973) Korwar, R. M. and M. Hollander (1973). Contributions to the theory of Dirichlet processes. Ann. Probab. 1(4), 705–711.
  • Lebedev (1965) Lebedev, N. N. (1965). Special functions and their applications. Prentice-Hall, Inc.
  • Legramanti et al. (2022) Legramanti, S., T. Rigon, D. Durante, and D. B. Dunson (2022). Extended stochastic block models with application to criminal networks. Ann. Appl. Stat. 16(4), 2369–2395.
  • Lijoi et al. (2005) Lijoi, A., R. H. Mena, and I. Prünster (2005). Hierarchical mixture modeling with normalized inverse-Gaussian priors. J. Am. Statist. Ass. 100(472), 1278–1291.
  • Lijoi et al. (2007a) Lijoi, A., R. H. Mena, and I. Prünster (2007a). Bayesian nonparametric estimation of the probability of discovering new species. Biometrika 94(4), 769–786.
  • Lijoi et al. (2007b) Lijoi, A., R. H. Mena, and I. Prünster (2007b). Controlling the reinforcement in Bayesian non-parametric mixture models. J. R. Statist. Soc. B 69(4), 715–740.
  • Lijoi et al. (2008a) Lijoi, A., I. Prünster, and S. G. Walker (2008a). Bayesian nonparametric estimators derived from conditional Gibbs structures. Ann. Appl. Probab. 18(4), 1519–1547.
  • Lijoi et al. (2008b) Lijoi, A., I. Prünster, and S. G. Walker (2008b). Investigating nonparametric priors with Gibbs structure. Statist. Sin. 18(4), 1653–1668.
  • Magurran and McGill (2011) Magurran, A. E. and B. J. McGill (2011). Biological Diversity: frontiers in measurement and assessment. Oxford Biology.
  • McCullagh (2016) McCullagh, P. (2016). Two early contributions to the Ewens saga. Statist. Sci. 31(1), 23–26.
  • McGill (2003) McGill, B. J. (2003). A test of the unified neutral theory of biodiversity. Nature 422(6934), 881–885.
  • Miller and Dunson (2019) Miller, J. W. and D. B. Dunson (2019). Robust Bayesian inference via coarsening. J. Am. Statist. Ass. 114(527), 1113–1125.
  • Miller and Harrison (2018) Miller, J. W. and M. T. Harrison (2018). Mixture models with a prior on the number of components. J. Am. Statist. Ass. 113(521), 340–356.
  • Oksanen et al. (2024) Oksanen, J., G. L. Simpson, F. G. Blanchet, R. Kindt, P. Legendre, P. R. Minchin, R. O’Hara, P. Solymos, M. H. H. Stevens, E. Szoecs, H. Wagner, M. Barbour, M. Bedward, B. Bolker, D. Borcard, G. Carvalho, M. Chirico, M. De Caceres, S. Durand, H. B. A. Evangelista, R. FitzJohn, M. Friendly, B. Furneaux, G. Hannigan, M. O. Hill, L. Lahti, D. McGlinn, M.-H. Ouellette, E. Ribeiro Cunha, T. Smith, A. Stier, C. J. Ter Braak, and J. Weedon (2024). vegan: Community Ecology Package. R package version 2.6-8.
  • Perman et al. (1992) Perman, M., J. Pitman, and M. Yor (1992). Size-biased sampling of Poisson point processes and excursions. Probability Theory and Related Fields 92(1), 21–39.
  • Pitman (1996) Pitman, J. (1996). Some developments of the Blackwell-MacQueen urn scheme. In Statistics, probability and game theory, Volume 30 of IMS Lecture Notes Monogr. Ser., pp.  245–267. Inst. Math. Statist., Hayward, CA.
  • Pitman (2003) Pitman, J. (2003). Poisson-Kingman partitions. Lecture Notes-Monograph Series 40, 1–34.
  • Pitman (2006) Pitman, J. (2006). Combinatorial stochastic processes: Ecole d’eté de probabilités de saint-flour xxxii-2002. Springer.
  • Pitman and Yor (1997) Pitman, J. and M. Yor (1997). The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann. Probab. 25(2), 855–900.
  • Quintana et al. (2022) Quintana, F. A., P. Müller, A. Jara, and S. N. MacEachern (2022). The Dependent Dirichlet Process and Related Models. Statist. Sc. 37(1), 24–41.
  • Ricklefs (2006) Ricklefs, R. E. (2006). The unified neutral theory of biodiversity: do the numbers add up? Ecology 87(6), 1424–1431.
  • Rigon et al. (2025) Rigon, T., S. Petrone, and B. Scarpa (2025). Enriched Pitman-Yor processes. Scand. J. Statist..
  • Sethuraman (1994) Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sin. 4(2), 639–650.
  • Smith and Grassle (1977) Smith, W. and J. F. Grassle (1977). Sampling properties of a family of diversity measures. Biometrics 33(2), 283–292.
  • Sun et al. (2023) Sun, J., M. Kong, and S. Pal (2023). The Modified-Half-Normal distribution: Properties and an efficient sampling scheme. Commun. Statist. Th. Meth. 52(5), 1591–1613.
  • ter Steege et al. (2013) ter Steege, H. et al. (2013). Hyperdominance in the amazonian tree flora. Science 342(6156), 1243092.
  • Thisted and Efron (1987) Thisted, R. and B. Efron (1987). Did Shakespeare write a newly-discovered poem? Biometrika 74(3), 445–455.
  • Wade et al. (2011) Wade, S., S. Mongelluzzo, S. Petrone, et al. (2011). An enriched conjugate prior for Bayesian nonparametric inference. Bayesian Analysis 6(3), 359–385.
  • Wang (2011) Wang, J. P. (2011). SPECIES: An R package for species richness estimation. J. Statist. Soft. 40(9), 1–15.
  • Zito et al. (2023) Zito, A., T. Rigon, and D. B. Dunson (2023). Inferring taxonomic placement from DNA barcoding allowing discovery of new taxa. Meth. Ecol. Evol. 14, 529–542.
  • Zito et al. (2024) Zito, A., T. Rigon, and D. B. Dunson (2024). Bayesian nonparametric modeling of latent partitions via Stirling-gamma priors. Bayesian Analysis (forthcoming).
  • Zito et al. (2023) Zito, A., T. Rigon, O. Ovaskainen, and D. B. Dunson (2023). Bayesian modeling of sequential discoveries. J. Am. Statist. Ass. 188(544), 2521–2532.

Appendix A Proofs

A.1 Rarefaction and extrapolation curve of a Dirichlet-multinomial model

We present here the proof of the results in Example 4, concerning the expectations 𝔼(KnH)\mathds{E}(K_{n}\mid H) and 𝔼(Km(n)H,X1,,Xn)\mathds{E}(K_{m}^{(n)}\mid H,X_{1},\dots,X_{n}) in the Dirichlet multinomial model. To begin, let us recall that the process is defined as p~=h=1HWhδZh\tilde{p}=\sum_{h=1}^{H}W_{h}\delta_{Z_{h}}, where (W1,,WH)Dir(|σ|,,|σ|)(W_{1},\dots,W_{H})\sim\textup{Dir}(|\sigma|,\dots,|\sigma|). The a priori expected value of the number of distinct values is well known (e.g. Pitman, 2006, Chap. 3), but we provide here a simple and alternative proof that does not require combinatorial calculus. Specifically, we have:

𝔼(KnH)=𝔼{𝔼(KnH,p~)}=𝔼[h=1H{1(1Wh)n}H].\mathds{E}(K_{n}\mid H)=\mathds{E}\{\mathds{E}(K_{n}\mid H,\tilde{p})\}=\mathds{E}\left[\sum_{h=1}^{H}\{1-(1-W_{h})^{n}\}\mid H\right].

where the quantity 1(1Wh)n1-(1-W_{h})^{n} is the probability of observing the hhth species at least once in a sample of size nn (Smith and Grassle, 1977) given the vector of probabilities W1,,WHW_{1},\dots,W_{H}. Hence:

𝔼(KnH)=𝔼[h=1H{1(1Wh)n}H]=H𝔼{h=1H(1Wh)nH}=HH𝔼{(1W1)nH},\mathds{E}(K_{n}\mid H)=\mathds{E}\left[\sum_{h=1}^{H}\{1-(1-W_{h})^{n}\}\mid H\right]=H-\mathds{E}\left\{\sum_{h=1}^{H}(1-W_{h})^{n}\mid H\right\}=H-H\mathds{E}\left\{(1-W_{1})^{n}\mid H\right\},

where the last step follows because W1,,WHW_{1},\dots,W_{H} are iid beta random variables. In fact (1W1)Beta(H|σ||σ|,|σ|)(1-W_{1})\sim\text{Beta}(H|\sigma|-|\sigma|,|\sigma|) and from well-known properties of the beta distribution the result follows

𝔼(KnH)=HH(H|σ||σ|)n(H|σ|)n.\mathds{E}(K_{n}\mid H)=H-H\frac{(H|\sigma|-|\sigma|)_{n}}{(H|\sigma|)_{n}}.

This proof technique can be extended to the posteriori expectation 𝔼(Km(n)H,X1,,Xn)\mathds{E}(K_{m}^{(n)}\mid H,X_{1},\dots,X_{n}). In fact, the posterior distribution of W1,,WHW_{1},\dots,W_{H} is conjugate, that is

(W1,,WHH,X1,,Xn)Dir(|σ|+n¯1,,|σ|+n¯H),(W_{1},\dots,W_{H}\mid H,X_{1},\dots,X_{n})\sim\text{Dir}(|\sigma|+\bar{n}_{1},\dots,|\sigma|+\bar{n}_{H}),

where n¯1,,n¯H\bar{n}_{1},\dots,\bar{n}_{H} are the labeled frequencies of the species Z1,,ZHZ_{1},\dots,Z_{H}. Note that we could have n¯h=0\bar{n}_{h}=0, and the number of non-zero entries, that is, the number of observed species, is Kn=kK_{n}=k. Without loss of generality, we assume the non-zero abundances correspond to the first kk values, so that nj=n¯jn_{j}=\bar{n}_{j} for j=1,,kj=1,\dots,k and n¯j=0\bar{n}_{j}=0 for j=k+1,,Hj={k+1},\dots,H. Reasoning as before, we obtain

𝔼(Km(n)H,X1,,Xn)\displaystyle\mathds{E}(K_{m}^{(n)}\mid H,X_{1},\dots,X_{n}) =𝔼{𝔼(Km(n)H,p~,X1,,Xn)}\displaystyle=\mathds{E}\{\mathds{E}(K_{m}^{(n)}\mid H,\tilde{p},X_{1},\dots,X_{n})\}
=𝔼[j=k+1H{1(1Wj)m}H,X1,,Xn]\displaystyle=\mathds{E}\left[\sum_{j=k+1}^{H}\{1-(1-W_{j})^{m}\}\mid H,X_{1},\dots,X_{n}\right]

where each 1(1Wj)m1-(1-W_{j})^{m} for j=k+1,,Hj=k+1,\dots,H is the probability of sampling the jjth unobserved species at least once in an additional sample of size mm, given the vector of probabilities W1,,WHW_{1},\dots,W_{H} and the data X1,,XnX_{1},\dots,X_{n}. A posteriori, the random variables Wk+1,,WHW_{k+1},\dots,W_{H} are still iid beta distributed and in particular (1WjX1,,Xn)Beta(n+H|σ|,|σ|)(1-W_{j}\mid X_{1},\dots,X_{n})\sim\text{Beta}(n+H|\sigma|,|\sigma|) leading to

𝔼(Km(n)H,X1,,Xn)\displaystyle\mathds{E}(K_{m}^{(n)}\mid H,X_{1},\dots,X_{n}) =Hk(Hk)𝔼{(1Wk+1)mH}\displaystyle=H-k-(H-k)\mathds{E}\left\{(1-W_{k+1})^{m}\mid H\right\}
=Hk(Hk)(n+H|σ||σ|)m(n+H|σ|)m.\displaystyle=H-k-(H-k)\frac{(n+H|\sigma|-|\sigma|)_{m}}{(n+H|\sigma|)_{m}}.

Interestingly, this corresponds to the estimator of a Pitman–Yor process with a negative discount parameter; see equation (6) in Favaro et al. (2009). We also note that this result could be alternatively obtained, with some effort, specializing the general theorems of Lijoi et al. (2007a). A further proof strategy based on combinatorial calculus is discussed in Appendix A.1 and A.2 in Favaro et al. (2009).

A.2 Proof of Theorem 1

First of all, the quantity Kn=kK_{n}=k is a sufficient statistic for the diversity by direct inspection of the likelihood function, that is, the eppf. In fact, the diversity SσS_{\sigma} only appears in the terms Vn,kV_{n,k} which solely depend on nn and kk and not the abundances. From Proposition 2, we know that

Kncσ(n)a.s.Sσ,n,\frac{K_{n}}{c_{\sigma}(n)}\overset{\text{a.s.}}{\longrightarrow}S_{\sigma},\qquad n\to\infty,

that is, (limmKm/cσ(m)=Sσ)=1\mathds{P}(\lim_{m\rightarrow\infty}K_{m}/c_{\sigma}(m)=S_{\sigma})=1. Let A={limmKm/cσ(m)=Sσ}A=\left\{\lim_{m\rightarrow\infty}K_{m}/c_{\sigma}(m)=S_{\sigma}\right\}, and B={Kn=k}B=\{K_{n}=k\} and note that (B)>0\mathds{P}(B)>0 and (A)=1\mathds{P}(A)=1, therefore (AB)=(B)\mathds{P}(A\cap B)=\mathds{P}(B). The proof follows from a very simple argument:

(AB)=(AB)(B)=(B)(B)=1.\mathds{P}(A\mid B)=\frac{\mathds{P}(A\cap B)}{\mathds{P}(B)}=\frac{\mathds{P}(B)}{\mathds{P}(B)}=1.

In other words, we have shown that (Km/cσ(m)Kn=k)a.s.Sσ(K_{m}/c_{\sigma}(m)\mid K_{n}=k)\overset{\text{a.s.}}{\longrightarrow}S_{\sigma}. By a continuity argument, the sequences

(Kn+mcσ(m)Kn=k)and(Kn+mcσ(n+m)Kn=k)\left(\frac{K_{n+m}}{c_{\sigma}(m)}\mid K_{n}=k\right)\quad\text{and}\quad\left(\frac{K_{n+m}}{c_{\sigma}(n+m)}\mid K_{n}=k\right)

also converge to SσS_{\sigma} almost surely as mm\to\infty. We now clarify that the distribution of the random variable SσS_{\sigma} is indeed the posterior law of the parameters HH, α\alpha, and γ\gamma. Let us first consider the three building blocks of Gibbs-type priors, namely when SσS_{\sigma} is a positive constant. The above result implies the convergence of the Laplace functionals, that is, for any λ>0\lambda>0,

𝔼(eλKn+m/cσ(m)Kn=k)eλSσ.\mathds{E}\left(e^{-\lambda K_{n+m}/c_{\sigma}(m)}\mid K_{n}=k\right)\to e^{-\lambda S_{\sigma}}.

In a general Gibbs-type prior, (Kn+m/cσ(m)Kn=k)\left(K_{n+m}/c_{\sigma}(m)\mid K_{n}=k\right) converges to a random variable whose distribution is the posterior law of the diversity. In fact, as an application of the tower rule,

𝔼(eλKn+m/cσ(m)Kn=k)=𝔼{𝔼(eλKn+m/cσ(m)Kn=k,Sσ)}𝔼(eλSσKn=k),\mathds{E}\left(e^{-\lambda K_{n+m}/c_{\sigma}(m)}\mid K_{n}=k\right)=\mathds{E}\left\{\mathds{E}\left(e^{-\lambda K_{n+m}/c_{\sigma}(m)}\mid K_{n}=k,S_{\sigma}\right)\right\}\to\mathds{E}(e^{-\lambda S_{\sigma}}\mid K_{n}=k),

which concludes the proof.

A.2.1 Alternative proof of Theorem 1 for σ=0\sigma=0

We describe an alternative proof of Theorem 1, expressed in terms of weak convergence, which applies when σ=0\sigma=0. This proof technique is more direct, relying on calculus and combinatorics. In contrast, the previous proof is more abstract and depends on Proposition 2.

Let us consider the Dirichlet process case with parameter α\alpha. We begin by studying the a priori asymptotic behavior of Kn/log(n)K_{n}/\log(n), a result that was established by Korwar and Hollander (1973). By definition, the Laplace transform of this scaled random variable is

𝔼(eλKn/logn)\displaystyle\mathds{E}\left(e^{-\lambda K_{n}/\log{n}}\right) =j=1neλj/lognαj(α)n|s(n,j)|=1(α)nj=1n(αeλ/logn)j|s(n,j)|=\displaystyle=\sum_{j=1}^{n}e^{-\lambda j/\log{n}}\frac{\alpha^{j}}{(\alpha)_{n}}|s(n,j)|=\frac{1}{(\alpha)_{n}}\sum_{j=1}^{n}\left(\alpha e^{-\lambda/\log{n}}\right)^{j}|s(n,j)|=
=(αeλ/logn)n(α)n=Γ(αeλ/logn+n)Γ(αeλ/logn)Γ(α)Γ(α+n).\displaystyle=\frac{(\alpha e^{-\lambda/\log{n}})_{n}}{(\alpha)_{n}}=\frac{\Gamma(\alpha e^{-\lambda/\log{n}}+n)}{\Gamma(\alpha e^{-\lambda/\log{n}})}\frac{\Gamma(\alpha)}{\Gamma(\alpha+n)}.

The third equality follows by definition of signless Stirling numbers of the first kind. Now, let g(n)=αeλ/log(n)g(n)=\alpha e^{-\lambda/\log(n)} and note that limng(n)=α\lim_{n\to\infty}g(n)=\alpha. Thus

limn𝔼(eλKn/logn)\displaystyle\lim_{n\to\infty}\mathds{E}\left(e^{-\lambda K_{n}/\log{n}}\right) =limnΓ(g(n)+n)Γ(g(n))Γ(α)Γ(α+n)=limnΓ(g(n)+n)Γ(α+n)\displaystyle=\lim_{n\to\infty}\frac{\Gamma(g(n)+n)}{\cancel{\Gamma(g(n))}}\frac{\cancel{\Gamma(\alpha)}}{\Gamma(\alpha+n)}=\lim_{n\to\infty}\frac{\Gamma(g(n)+n)}{\Gamma(\alpha+n)}
=limn2πeg(n)n(g(n)+n)g(n)+n1/22πeαn(α+n)α+n1/2\displaystyle=\lim_{n\to\infty}\frac{\sqrt{2\pi}e^{-g(n)-n}(g(n)+n)^{g(n)+n-1/2}}{\sqrt{2\pi}e^{-\alpha-n}(\alpha+n)^{\alpha+n-1/2}}
=limn(g(n)+n)g(n)+n1/2(α+n)α+n1/2\displaystyle=\lim_{n\to\infty}\frac{(g(n)+n)^{g(n)+n-1/2}}{(\alpha+n)^{\alpha+n-1/2}}
=limn((g(n)+n)g(n)+n1/2(g(n)+n)α+n1/2)((g(n)+n)α+n1/2(α+n)α+n1/2)\displaystyle=\lim_{n\to\infty}\left(\frac{(g(n)+n)^{g(n)+n-1/2}}{(g(n)+n)^{\alpha+n-1/2}}\right)\left(\frac{(g(n)+n)^{\alpha+n-1/2}}{(\alpha+n)^{\alpha+n-1/2}}\right)
=limn(g(n)+n)g(n)α(g(n)+nα+n)α+n1/2.\displaystyle=\lim_{n\to\infty}(g(n)+n)^{g(n)-\alpha}\left(\frac{g(n)+n}{\alpha+n}\right)^{\alpha+n-1/2}.

The second step follows by applying Stirling’s asymptotic formula for the Gamma function. Moreover, note that limn(eλ/logn1)/(λ/logn)=1\lim_{n\to\infty}(e^{-\lambda/\log{n}}-1)/(-\lambda/\log{n})=1 therefore

limn(g(n)+n)g(n)α\displaystyle\lim_{n\to\infty}(g(n)+n)^{g(n)-\alpha} =limn(αeλ/logn+n)α(eλ/logn1)\displaystyle=\lim_{n\to\infty}(\alpha e^{-\lambda/\log{n}}+n)^{\alpha(e^{-\lambda/\log{n}}-1)}
=limn(αλ/logn+n)αλ/logn=eαλ,\displaystyle=\lim_{n\to\infty}(-\alpha\lambda/\log{n}+n)^{-\alpha\lambda/\log{n}}=e^{-\alpha\lambda},

whereas

limn(g(n)+nα+n)α+n1/2\displaystyle\lim_{n\to\infty}\left(\frac{g(n)+n}{\alpha+n}\right)^{\alpha+n-1/2} =limn(1+g(n)/n)α+n1/2(1+α/n)α+n1/2=limneα(1+g(n)/n)α+n1/2\displaystyle=\lim_{n\to\infty}\frac{(1+g(n)/n)^{\alpha+n-1/2}}{(1+\alpha/n)^{\alpha+n-1/2}}=\lim_{n\to\infty}e^{-\alpha}(1+g(n)/n)^{\alpha+n-1/2}
=eαeα=1.\displaystyle=e^{-\alpha}e^{\alpha}=1.

Summing up, we have shown that limn𝔼(eλKn/logn)=eαλ\lim_{n\to\infty}\mathds{E}\left(e^{-\lambda K_{n}/\log{n}}\right)=e^{-\alpha\lambda}, which means that Kn/logndαK_{n}/\log{n}\overset{\textup{d}}{\longrightarrow}\alpha. The proof for the conditional case law follows from similar steps. Note that (Km(n)/cσ(m)Kn=k)(K_{m}^{(n)}/c_{\sigma}(m)\mid K_{n}=k) has the same asymptotic behavior as (Kn+m/cσ(m)Kn=k)(K_{n+m}/c_{\sigma}(m)\mid K_{n}=k), since Kn+m=k+Km(n)K_{n+m}=k+K_{m}^{(n)}. Moreover, its Laplace transform is

𝔼(eλKm(n)/logmKn=k)\displaystyle\mathds{E}\left(e^{-\lambda K_{m}^{(n)}/\log{m}}\mid K_{n}=k\right) =j=1m(eλ/logm)jαj(α)n(α)n+m=jm(m)|s(,j)|(n)m\displaystyle=\sum_{j=1}^{m}\left(e^{-\lambda/\log{m}}\right)^{j}\alpha^{j}\frac{(\alpha)_{n}}{(\alpha)_{n+m}}\sum_{\ell=j}^{m}\binom{m}{\ell}|s(\ell,j)|(n)_{m-\ell}
=(α)n(α)n+mj=1m(αeλ/logm)j|s(m,j;n)|\displaystyle=\frac{(\alpha)_{n}}{(\alpha)_{n+m}}\sum_{j=1}^{m}\left(\alpha e^{-\lambda/\log{m}}\right)^{j}|s(m,j;n)|
=(α)n(α)n+m(αeλ/logm+n)m\displaystyle=\frac{(\alpha)_{n}}{(\alpha)_{n+m}}\left(\alpha e^{-\lambda/\log{m}}+n\right)_{m}

The first step follows from the combinatorial identity established in Lijoi et al. (2007a). In the second step, we recognized that |s(m,j;n)|==jm(m)|s(,j)|(n)m|s(m,j;n)|=\sum_{\ell=j}^{m}\binom{m}{\ell}|s(\ell,j)|(n)_{m-\ell} is an alternative representation of the signless non-central Stirling numbers of the first kind; see equation (8.60) in Charalambides (2002). The third simplification follows by definition of |s(m,j;n)||s(m,j;n)|. Taking the limit, we obtain

limm𝔼(eλKm(n)/logmKn=k)\displaystyle\lim_{m\to\infty}\mathds{E}\left(e^{-\lambda K_{m}^{(n)}/\log{m}}\mid K_{n}=k\right) =limmΓ(α+n)Γ(α)Γ(α)Γ(α+n+m)Γ(αeλ/logm+n)Γ(αeλ/logm+n)\displaystyle=\lim_{m\to\infty}\frac{\Gamma(\alpha+n)}{\Gamma(\alpha)}\frac{\Gamma(\alpha)}{\Gamma(\alpha+n+m)}\frac{\Gamma(\alpha e^{-\lambda/\log{m}}+n)}{\Gamma(\alpha e^{-\lambda/\log{m}}+n)}
=limmΓ(αeλ/logm+n)Γ(α+n+m)\displaystyle=\lim_{m\to\infty}\frac{\Gamma(\alpha e^{-\lambda/\log{m}}+n)}{\Gamma(\alpha+n+m)}
=eαλ.\displaystyle=e^{-\alpha\lambda}.

The last step follows from analogous calculations done for the a priori case. This concludes the proof, as the case of general Gibbs-type prior with random α\alpha and σ=0\sigma=0 follows as an application of the tower rule, as before.

A.3 Aldous-Pitman data augmentation

In this Section we provide further details the discussion about the data augmentation for the Aldous-Pitman model described in Section 3. Let us begin by recalling the main result, that is the augmented likelihood

Π(n1,,nk,uγ)=2nk/21/2Γ(2nk1)γk1u2nk2eu2/22γuj=1k(1/2)nj1.\Pi(n_{1},\dots,n_{k},u\mid\gamma)=\frac{2^{n-k/2-1/2}}{{\Gamma(2n-k-1)}}\gamma^{k-1}u^{2n-k-2}e^{-u^{2}/2-\sqrt{2}\gamma u}\prod_{j=1}^{k}(1/2)_{n_{j}-1}.

Integrating with respect to uu gives the eppf of an Aldous-Pitman model with weights in equation (10), because

hk+12n(2γ)=1Γ(2nk1)0eu2/22γu2nk2du.h_{k+1-2n}(\sqrt{2}\gamma)=\frac{1}{\Gamma(2n-k-1)}\int_{0}^{\infty}e^{-u^{2}/2-\sqrt{2}\gamma}u^{2n-k-2}\mathrm{d}u.

From the augmented representation we immediately obtain, through direct inspection of the joint law, the full conditional distributions for γ\gamma and Un,kU_{n,k}. In particular:

fUn,k(u)=eu2/22γu2nk20eu2/22γu2nk2du.f_{U_{n,k}}(u\mid-)=\frac{e^{-u^{2}/2-\sqrt{2}\gamma}u^{2n-k-2}}{\int_{0}^{\infty}e^{-u^{2}/2-\sqrt{2}\gamma}u^{2n-k-2}\mathrm{d}u}.

Moreover, let us recall the predictive distribution is

(Xn+1γ,X1,,Xn)=2γhk2n(2γ)hk+12n(2γ)P()+2hk12n(2γ)hk+12n(2γ)j=1k(nj1/2)δXj().\mathds{P}(X_{n+1}\in\cdot\mid\gamma,X_{1},\dots,X_{n})=\sqrt{2}\gamma\frac{h_{k-2n}(\sqrt{2}\gamma)}{h_{k+1-2n}(\sqrt{2}\gamma)}P(\cdot)+2\frac{h_{k-1-2n}(\sqrt{2}\gamma)}{h_{k+1-2n}(\sqrt{2}\gamma)}\sum_{j=1}^{k}(n_{j}-1/2)\delta_{X^{*}_{j}}(\cdot).

Substituting hk+12n(2γ)h_{k+1-2n}(\sqrt{2}\gamma) and hk2n(2γ)h_{k-2n}(\sqrt{2}\gamma) with their integral representation, we obtain

2γhk2n(2γ)hk+12n(2γ)\displaystyle\sqrt{2}\gamma\frac{h_{k-2n}(\sqrt{2}\gamma)}{h_{k+1-2n}(\sqrt{2}\gamma)} =2γΓ(2nk1)Γ(2nk)0u2nk1eu2/22γudu0u2nk2eu2/22γudu\displaystyle=\sqrt{2}\gamma\frac{\Gamma(2n-k-1)}{\Gamma(2n-k)}\frac{\int_{0}^{\infty}u^{2n-k-1}e^{-u^{2}/2-\sqrt{2}\gamma u}\mathrm{d}u}{\int_{0}^{\infty}u^{2n-k-2}e^{-u^{2}/2-\sqrt{2}\gamma u}\mathrm{d}u}
=2γ2nk10ufUn,k(u)du\displaystyle=\frac{\sqrt{2}\gamma}{2n-k-1}\int_{0}^{\infty}uf_{U_{n,k}}(u\mid-)\mathrm{d}u
=2γ2nk1𝔼(Un,k).\displaystyle=\frac{\sqrt{2}\gamma}{2n-k-1}\mathds{E}(U_{n,k}).

and

2hk12n(2γ)hk+12n(2γ)\displaystyle 2\frac{h_{k-1-2n}(\sqrt{2}\gamma)}{h_{k+1-2n}(\sqrt{2}\gamma)} =2Γ(2nk1)Γ(2nk+1)0u2nkeu2/22γudu0u2nk2eu2/22γudu\displaystyle=2\frac{\Gamma(2n-k-1)}{\Gamma(2n-k+1)}\frac{\int_{0}^{\infty}u^{2n-k}e^{-u^{2}/2-\sqrt{2}\gamma u}\mathrm{d}u}{\int_{0}^{\infty}u^{2n-k-2}e^{-u^{2}/2-\sqrt{2}\gamma u}\mathrm{d}u}
=2(2nk)(2nk1)0u2fUn,k(u)du\displaystyle=\frac{2}{(2n-k)(2n-k-1)}\int_{0}^{\infty}u^{2}f_{U_{n,k}}(u\mid-)\mathrm{d}u
=2(2nk)(2nk1)𝔼(Un,k2).\displaystyle=\frac{2}{(2n-k)(2n-k-1)}\mathds{E}(U^{2}_{n,k}).

Note that this predictive scheme still does not provide a manageable sampling algorithm for Xn+1X1,,XnX_{n+1}\mid X_{1},\dots,X_{n} because the expectations 𝔼(Un,k)\mathds{E}(U_{n,k}) and 𝔼(Un,k2)\mathds{E}(U^{2}_{n,k}) are not easily available. However, one can resort to Algorithm 1, which is based on the following additional data augmentation. Let us define a binary random variable Dn+1D_{n+1} such that Dn+1=1D_{n+1}=1 iff Xn+1=“new”X_{n+1}=\text{``new''} and 0 otherwise. Moreover, let U~n,k\tilde{U}_{n,k} be a positive random variable such that the joint law of Dn+1D_{n+1} and U~n,k\tilde{U}_{n,k} is:

(Dn+1=1,U~n,kdu)=2γ2nk1ufUn,k(u)du,\mathds{P}(D_{n+1}=1,\tilde{U}_{n,k}\in\mathrm{d}u)=\frac{\sqrt{2}\gamma}{2n-k-1}uf_{U_{n,k}}(u\mid-)\mathrm{d}u,

and

(Dn+1=0,U~n,kdu)=u22nk1fUn,k(u).\mathds{P}(D_{n+1}=0,\tilde{U}_{n,k}\in\mathrm{d}u)=\frac{u^{2}}{2n-k-1}f_{U_{n,k}}(u\mid-).

Then, the sampling strategy is as follows: sample U~n,k\tilde{U}_{n,k} from its marginal distribution and then Dn+1U~n,kD_{n+1}\mid\tilde{U}_{n,k}. Both distributions are easy to sample from and their laws are described in Algorithm 1. Once it has been determined whether the observation Xn+1X_{n+1} is either new or old, the remaining part of the sampling algorithm is trivial.

Algorithm 1 Sampling procedure for (Xn+1X1,,Xn)(X_{n+1}\mid X_{1},\dots,X_{n}) for an Aldous-Pitman model.
X1,,XnX_{1},\dots,X_{n}.
1. Sample a positive latent variable U~n,k\tilde{U}_{n,k} from the log-concave density
fU~n,k(u)(γ2u(nk/21/2)+u2(2nk1))u2nk2e(u/2+γ)2.f_{\tilde{U}_{n,k}}(u)\propto\left(\frac{\gamma}{\sqrt{2}}\frac{u}{(n-k/2-1/2)}+\frac{u^{2}}{(2n-k-1)}\right)u^{2n-k-2}e^{-(u/\sqrt{2}+\gamma)^{2}}.
2. Sample a binary random variable Dn+1D_{n+1} representing if Xn+1X_{n+1} is new or not according to
(Dn+1=1U~n,k=u)γ2u(nk/21/2),(Dn+1=0U~n,k=u)u2(2nk1).\mathds{P}(D_{n+1}=1\mid\tilde{U}_{n,k}=u)\propto\frac{\gamma}{\sqrt{2}}\frac{u}{(n-k/2-1/2)},\quad\mathds{P}(D_{n+1}=0\mid\tilde{U}_{n,k}=u)\propto\frac{u^{2}}{(2n-k-1)}.
3. Sample Xn+1X_{n+1} given the binary variable Dn+1D_{n+1}:
if Dn,k=1D_{n,k}=1 (i.e. if Xn+1=“new”)X_{n+1}=\text{``new''}) then
  3.a Sample a new value Xn+1X_{n+1} from the baseline PP;
else if Dn,k=0D_{n,k}=0 (i.e. if Xn+1=“old”)X_{n+1}=\text{``old''}) then
  3.b Sample Xn+1X_{n+1} from the discrete distribution (Xn+1=XjDn+1=0)(nj1/2)\mathds{P}(X_{n+1}=X^{*}_{j}\mid D_{n+1}=0)\propto(n_{j}-1/2).
end if
return the sampled value Xn+1X_{n+1}.

A.4 Proof of Proposition 3

The proof of this proposition follows by repeatedly applying the argument discussed in Rigon et al. (2025) for the enriched Pitman-Yor process, corresponding to the case where L=2L=2 and the involved random measures are either Pitman–Yor processes or Dirichlet processes.

A.5 Details of the Gibbs sampling algorithm of Section 5.2

The augmented and coarsened likelihood of the model Section 5.2 is the following

(𝑿1,,𝑿n𝑺,𝑼)α1k1(α1)nlayer 1: families\displaystyle\mathscr{L}(\bm{X}_{1},\dots,\bm{X}_{n}\mid\bm{S},\bm{U})\propto\underbrace{\frac{\alpha_{1}^{k_{1}}}{(\alpha_{1})_{n}}}_{\textup{layer 1: families}} ×j=1k1α(Xj,1)k(Xj,1)(α(Xj,1))n(Xj,1)layer 2: genera within family\displaystyle\times\overbrace{\prod_{j=1}^{k_{1}}\frac{\alpha(X^{*}_{j,1})^{k(X^{*}_{j,1})}}{(\alpha(X^{*}_{j,1}))_{n(X^{*}_{j,1})}}}^{\textup{layer 2: genera within family}}
×j=1k2[γ(Xj,2)k(Xj,2)1uj2n(Xj,2)k(Xj,2)2euj2/22γ(Xj,2)uj]ρlayer 3: species within genus,\displaystyle\times\underbrace{\prod_{j=1}^{k_{2}}\left[\gamma(X^{*}_{j,2})^{k(X^{*}_{j,2})-1}u_{j}^{2n(X^{*}_{j,2})-k(X^{*}_{j,2})-2}e^{-u_{j}^{2}/2-\sqrt{2}\gamma(X^{*}_{j,2})u_{j}}\right]^{\rho}}_{\textup{layer 3: species within genus}},

where 𝑼=(u1,,uk2)\bm{U}=(u_{1},\dots,u_{k_{2}}) represents the collection of latent variables. Note that the likelihood function factorizes, allowing inference to be performed independently for each layer of the model. For layer 2, inference proceeds separately for each α(Xj,1)\alpha(X^{*}_{j,1}) as described in Section 5.1, given that the priors for the diversities α(Xj,1)\alpha(X^{*}_{j,1}) are independently distributed according to the Stirling-gamma distribution. On the other hand, a Gibbs-sampling algorithm is required for layer 3, alternating between these steps

  1. 1.

    For j=1,,k2j=1,\dots,k_{2} sample independently from the (coarsened) full conditionals of the diveristies (γ(Xj,2))Gamma(aγ+ρ{k(Xj,2)1},bγ+ρ2uj)(\gamma(X^{*}_{j,2})\mid-)\sim\text{Gamma}(a_{\gamma}+\rho\{k(X^{*}_{j,2})-1\},b_{\gamma}+\rho\sqrt{2}u_{j}).

  2. 2.

    For j=1,,k2j=1,\dots,k_{2} sample independently from the (coarsened) full conditionals of (Uj)(U_{j}\mid-) whose densities are fUj(u)uρ2n(Xj,2)k(Xj,2)2eρu2/2ρ2γ(Xj,2)uf_{U_{j}}(u)\propto u^{\rho 2n(X^{*}_{j,2})-k(X^{*}_{j,2})-2}e^{-\rho u^{2}/2-\rho\sqrt{2}\gamma(X^{*}_{j,2})u}. Exact sampling from this distribution is feasible using the algorithm described in Sun et al. (2023), or alternatively, through the ratio-of-uniform acceptance-rejection algorithm.

  3. 3.

    Sample the hyperparameters (logaγ,logbγ)(\log{a_{\gamma}},\log{b_{\gamma}}\mid-) from their full conditional distribution. This step is equivalent to sampling the posterior distribution of the parameters under the assumption that the data are i.i.d. Gamma distributed and the log-prior follows a multivariate Gaussian distribution. A standard Metropolis step is employed, using a carefully tuned Gaussian proposal distribution to ensure good mixing.

Appendix B Additional plots for the application in Section 5.1

Additional plot for the application discussed in Section 5.1. Simulated values are based on 10610^{6} Monte Carlo replicates using the Stirling-gamma sampling algorithm of Zito et al. (2024) combined with a Poisson approximation for KNK_{N}, as discussed in the main text.

Refer to caption
Figure 9: Calibration plot comparing the coarsening level ρ\rho and the expected log-likelihood 𝔼{klogαlog(α)n}\mathds{E}\{k\log{\alpha}-\log{(\alpha)_{n}}\}, where the expectation is taken with respect to the coarsened posterior of α\alpha, for various level of ρ\rho.
Refer to caption
Figure 10: Posterior distribution (ρ=1\rho=1) of the fundamental biodiversity number α\alpha, using the Amazonian tree dataset of ter Steege et al. (2013). The dotted lines represent 98%98\% credible intervals. The dashed line is the posterior mean.
Refer to caption
Figure 11: Posterior distribution (ρ=1\rho=1) of the total number of distinct tree species KNK_{N}, using the Amazonian tree dataset of ter Steege et al. (2013). The dotted lines represent 98%98\% credible intervals. The dashed line is the posterior mean.
Refer to caption
Figure 12: Credible intervals of the posterior distribution of the fundamental biodiversity number, using the Amazonian tree dataset of ter Steege et al. (2013), for various choices of ρ{0.001,0.01,0.1,0.25,1}\rho\in\{0.001,0.01,0.1,0.25,1\}.
Refer to caption
Figure 13: Credible intervals of the posterior distribution of the total number of distinct tree species KNK_{N}, using the Amazonian tree dataset of ter Steege et al. (2013), for various choices of ρ{0.001,0.01,0.1,0.25,1}\rho\in\{0.001,0.01,0.1,0.25,1\}.