This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Logistic regression models: practical induced prior specification

Ken B. Newman Biomathematics & Statistics Scotland, Edinburgh, UK. School of Mathematics and Maxwell Institute for Mathematical Sciences, University of Edinburgh, Edinburgh, UK. Cristiano Villa Division of Natural and Applied Sciences, Duke-Kunshan University, Kunshan, China School of Mathematics, Statistics and Physics, Newcastle University, Newcastle-upon-Tyne, UK Ruth King School of Mathematics and Maxwell Institute for Mathematical Sciences, University of Edinburgh, Edinburgh, UK.
Abstract

Bayesian inference for statistical models with a hierarchical structure is often characterized by specification of priors for parameters at different levels of the hierarchy. When higher level parameters are functions of the lower level parameters, specifying a prior on the lower level parameters leads to induced priors on the higher level parameters. However, what are deemed uninformative priors for lower level parameters can induce strikingly non-vague priors for higher level parameters. Depending on the sample size and specific model parameterization, these priors can then have unintended effects on the posterior distribution of the higher level parameters.

Here we focus on Bayesian inference for the Bernoulli distribution parameter θ\theta which is modeled as a function of covariates via a logistic regression, where the coefficients are the lower level parameters for which priors are specified. A specific area of interest and application is the modeling of survival probabilities in capture-recapture data and occupancy and detection probabilities in presence-absence data. In particular we propose alternative priors for the coefficients that yield specific induced priors for θ\theta. We address three induced prior cases. The simplest is when the induced prior for θ\theta is Uniform(0,1). The second case is when the induced prior for θ\theta is an arbitrary Beta(α\alpha, β\beta) distribution. The third case is one where the intercept in the logistic model is to be treated distinct from the partial slope coefficients; e.g., E[θ]E[\theta] equals a specified value on (0,1) when all covariates equal 0. Simulation studies were carried out to evaluate performance of these priors and the methods were applied to a real presence/absence data set and occupancy modelling.

Keywords— Bayesian inference, Induced priors, Logistic regression, Objective priors, Occupancy modeling.

1 Introduction

Many processes with binary outcomes, say 0 or 1, are modeled as Bernoulli random variables YY where Pr(Y=1)=θ\Pr(Y=1)=\theta and Pr(Y=0)=1θ\Pr(Y=0)=1-\theta. The value Y=1Y=1 can equate to a variety of nominal variables such as success, survival or presence while the value Y=0Y=0 corresponds to the opposite: failure, death or absence, respectively. Such Bernoulli processes appear in several areas of statistical ecology such as modeling the probability of an animal surviving over some time interval, the probability of an animal producing young, the probability of animal’s or plant’s presence at a given location and the probability of detecting such presence.

Often the Bernoulli probabilities are modelled as functions of covariates, x1,,xpx_{1},\ldots,x_{p}, and to ensure that the probability θ\theta satisfies the condition 0<θ<10<\theta<1, a logistic transformation of θ\theta is often used:

ln(θ1θ)\displaystyle\ln\left(\frac{\theta}{1-\theta}\right) =β0+β1x1++βpxp.\displaystyle=\beta_{0}+\beta_{1}x_{1}+\ldots+\beta_{p}x_{p}.

We label the logistic transformation above logit(θ)logit(\theta). When Bayesian inference is carried out, prior distributions are specified for the coefficients β0,β1,,βp\beta_{0},\beta_{1},\ldots,\beta_{p}. These prior distributions induce a prior distribution on θ\theta.

The choice of priors for Bayesian inference and their effect has been long and frequently discussed (Gelman et al., , 2014; Hobbs and Hooten, , 2015). Several approaches have developed for selecting and formulating priors (Jeffreys, , 1946; Bernardo, , 1979; Datta and Sweeting, , 2005; Kass and Wasserman, , 1996; Simpson et al., , 2017; Leisen et al., , 2020; Walker and Villa, , 2021; Antoniano-Villalobos et al., , 2024) with advice on approaches to prior selection, e.g., Banner et al., (2020) and van de Schoot et al., (2021). Concerns about the choice of priors and criticism of Bayesian methods in general have been voiced by many, particularly in ecological applications (Lele and Dennis, , 2009; Lele, , 2020) and the use of so-called default priors, those built into various computer software, are a general concern (Banner et al., , 2020).

A specific concern is that the use of seemingly innocuous “non-informative” priors at one level of a hierarchical model can induce strikingly non-innocuous, i.e., relatively concentrated, priors at a higher level. This has been noted specifically for the priors for the β\beta’s in the logistic regression models (Seaman III et al., , 2012; Northrup and Gerber, , 2018; Lele, , 2020). For example, Seaman III et al., (2012) show how Normal(0,25225^{2}) priors for α\alpha and β\beta in the simplest logistic model logit(θ)=α+βxlogit(\theta)=\alpha+\beta x induces a so-called bathtub shaped prior on θ\theta with most of the probability mass concentrated at the extremes 0 and 1. Similarly, Northrup and Gerber, (2018) show the effects of a variety of such priors in the analysis of presence/absence data and occupancy modeling where there are two Bernoulli probabilities, one for presence, and another one for detection conditional on presence, which are both often modeled as functions of covariates via logistic regression. In the context of occupancy modeling Lele, (2020) examined posterior inference for a function of the presence and the detection probabilities, namely the probability that a site is occupied but it is observed to be unoccupied on all visits. He pointed out sizable problems of invariance in the posterior distribution of this function to two alternative, but mathematically equivalent, parameterizations of the occupancy model, with samples of size at or around 30, where for one parameterization the point estimate is 0.30 and for the other parameterization is 0.67.

While sometimes the sample size is sufficient to yield a posterior distribution for θ\theta that is dominated by the data and therefore relatively unaffected by the prior, the practitioner might prefer to start with induced priors that more closely match their opinion, such as θ\theta \sim Uniform(0,1) or more generally θ\theta \sim Beta(α\alpha, β\beta).

We do note that there are, of course, cases where informative priors are desired which reflect expert opinion (O’Hagan, , 2008, 2019) and that the use of subjective priors can be defensible (Banner et al., , 2020).

The purpose of this paper is to present procedures for specifying priors for the coefficients of a logistic regression model with pp covariates which induce priors for a Bernoulli probability θ\theta that approximate a specified distribution, in particular those involving arbitrary Beta distributions. This includes the specific case of Beta(1,1) or Uniform(0,1). While Northrup and Gerber, (2018) provide some advice for specifying priors for an intercept alone model, namely pp=0, they do not address the effects of multiple covariates on the induced prior for θ\theta. We also consider the setting where the intercept is to be treated differently than the coefficients for the covariates; e.g., E[θ]E[\theta] equals a specified value on (0,1) when all covariates equal 0.

In Section 2 we begin with the simplest case of an intercept only model, logit(θ)=βlogit(\theta)=\beta. We derive the induced prior for the Bernoulli probability θ\theta given the prior for the intercept, π(β)\pi(\beta), and vice versa, and present a normal distribution prior for β\beta that yields approximately a Uniform(0,1) for θ\theta. Section 3 extends the approach to pp\geq 1 covariates to induce a Uniform(0,1) prior on θ\theta. In Section 4 the case where the induced prior for θ\theta is any Beta distribution is addressed along with an approach for distinct handling of the intercept. Simulation analyses are presented in Section 5 and the method is applied to a real occupancy data set in Section 6.

2 Logistic regression with constant mean model

We focus on logistic regression where the logit link function is used to model the probability of “success” with a Bernoulli random variable:

Y\displaystyle Y Bernoulli(θ);\displaystyle\sim\mbox{Bernoulli}\left(\theta\right);
η=g(θ)=ln(θ1θ)\displaystyle\eta=g(\theta)=\ln\left(\frac{\theta}{1-\theta}\right) =β0x0+β1x1++βpxp.\displaystyle=\beta_{0}x_{0}+\beta_{1}x_{1}+\ldots+\beta_{p}x_{p}. (1)

The focus in this section is to derive priors for the coefficients of the above logistic regression model such that the induced prior for θ\theta is a Uniform(0,1) distribution. These simple results will be built on in later sections to yield a Uniform(0,1) prior for θ\theta when there are p2p\geq 2 covariates in the regression model and then are extended to the case where the induced prior is a Beta(α,β\alpha,\beta) for arbitrary shape parameters.

2.1 From π(β)\pi(\beta) to π(θ)\pi(\theta)

Regarding the induced priors for θ\theta, for maximum simplicity we begin with a linear model for η\eta that is a constant, i.e., η=β\eta=\beta:

η=g(θ)=ln(θ1θ)\displaystyle\eta=g(\theta)=\ln\left(\frac{\theta}{1-\theta}\right) =β.\displaystyle=\beta.

Let π(β)\pi(\beta) be the prior distribution for β\beta, which in this simple case is also the prior for η\eta, i.e., π(η)=π(β)\pi(\eta)=\pi(\beta). The induced prior for θ\theta can be found by the change of variable theorem (Siegrist, , 2024):

πθ(θ)\displaystyle\pi_{\theta}(\theta) =πβ(h1(θ))|dh1(θ)dθ|,\displaystyle=\pi_{\beta}(h^{-1}(\theta))\left|\frac{dh^{-1}(\theta)}{d\theta}\right|,

where h(β)=θh(\beta)=\theta. We define,

θ=h(β)\displaystyle\theta=h(\beta) =exp(β)1+exp(β).\displaystyle=\frac{\exp(\beta)}{1+\exp(\beta)}.

which leads to the induced pdf for θ\theta:

πθ(θ)\displaystyle\pi_{\theta}(\theta) =πβ(ln[θ1θ])|1θ(1θ)|.\displaystyle=\pi_{\beta}\left(\ln\left[\frac{\theta}{1-\theta}\right]\right)\left|\frac{1}{\theta(1-\theta)}\right|.

Example.

A Normal(0, σ2\sigma^{2}) prior is chosen for β\beta. The induced prior for θ\theta is the following.

π(θ)\displaystyle\pi(\theta) =12πσ2exp[(ln(θ/(1θ)))22σ2]1θ(1θ).\displaystyle=\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left[-\frac{\left(\ln(\theta/(1-\theta))\right)^{2}}{2\sigma^{2}}\right]\frac{1}{\theta(1-\theta)}.

Figure 1(a) is a plot of the induced prior for θ\theta given a prior of Normal(0,σ2=32)(0,\sigma^{2}=3^{2}) for β\beta. This is a “bathtub prior” for θ\theta which concentrates the prior probability around the extremes 0 and 1, where the concentration becomes more extreme the larger the variance for β\beta, as can be seen in Figure 1(b). Seaman III et al., (2012), Northrup and Gerber, (2018) and Lele, (2020) provide additional examples.

Refer to caption
(a) σ2\sigma^{2}=32.
Refer to caption
(b) Increasing σ2\sigma^{2}.
Figure 1: Examples of induced prior for Bernoulli(θ\theta) with logit linear model = β\beta, where π(β)\pi(\beta) is Normal(0,σ20,\sigma^{2}).

2.2 From π(θ)\pi(\theta) to π(β)\pi(\beta)

Our objective is to determine what the induced prior for β\beta is given a desired or target prior for θ\theta. The above process is reversed. For example, if the desired prior for θ\theta is Uniform(0,1)(0,1), then the problem is to determine what prior for β\beta would yield such a prior for θ\theta. To begin

β=g(θ)\displaystyle\beta=g(\theta) =ln(θ1θ).\displaystyle=\ln\left(\frac{\theta}{1-\theta}\right).

Then the inverse operation, g1(θ)g^{-1}(\theta), is θ=exp(β)/(1+exp(β))\theta=\exp(\beta)/(1+\exp(\beta)) and dg1(β)dβ\frac{dg^{-1}(\beta)}{d\beta} = exp(β)(1+exp(β))2\frac{\exp(\beta)}{(1+\exp(\beta))^{2}}. Consequently the induced pdf for β\beta is:

πβ(β)\displaystyle\pi_{\beta}(\beta) =exp(β)(1+exp(β))2,<β<.\displaystyle=\frac{\exp(\beta)}{(1+\exp(\beta))^{2}},~{}~{}-\infty<\beta<\infty. (2)

Figure 2 is a plot of the induced prior for β\beta given a Uniform(0,1) prior for θ\theta.

Refer to caption
Figure 2: Induced prior for β\beta when β\beta=logit(θ\theta) and θ\theta \sim Uniform(0,1).

2.3 Logistic Distribution

The pdf for the induced prior for β\beta in Equation (2) is a special case of the logistic distribution. The logistic distribution has location parameter μ\mu and scale parameter ss:

Logistic(x;μ,s)\displaystyle\mbox{Logistic}(x;\mu,s) =exp((xμ)s)s(1+exp((xμ)s))2.\displaystyle=\frac{\exp\left(-\frac{(x-\mu)}{s}\right)}{s\left(1+\exp\left(-\frac{(x-\mu)}{s}\right)\right)^{2}}.

The corresponding expectation and variance are as follows:

E[X]=μ,\displaystyle E[X]=\mu,~{} V[X]=s2π23.\displaystyle~{}V[X]=\frac{s^{2}\pi^{2}}{3}.

The pdf in Equation (2) is then Logistic(μ=0\mu=0, s=1s=1), where due to symmetry in the pdf around 0, the pdf is the same for β\beta and β-\beta. E[β]=0E[\beta]=0 (as Figure 2 indicates) and V[β]V[\beta]=π23\frac{\pi^{2}}{3}. We can conclude from this that if a Logistic(0,1)(0,1) prior is used for β\beta, the induced prior for θ\theta is Uniform(0,1)(0,1). Depending on the context, the target induced prior for θ\theta may not be Uniform(0,1)(0,1), e.g., it might be Beta(α,β\alpha,\beta) and this is later addressed.

2.4 Normal approximation to Logistic(0,1)

Given the shape of the Logistic(0,1) distribution and its symmetry about zero, a Normal(0, π23\frac{\pi^{2}}{3}) distribution could be used as an approximation. As Figure 3 shows the tails of the normal are heavier.

Refer to caption
Figure 3: Induced prior (Logistic(0,1)) for β\beta in logit linear model for a Bernoulli parameter θ\theta when π(θ)\pi(\theta) is Uniform(0,1) and a Normal(0,π23\frac{\pi^{2}}{3}) approximation.

3 Moment matching for π(β)\pi(\beta)’s for multivariate Logistic regression

Now we consider the more general logistic regression setting where θ\theta is a function of pp covariates, as shown in Equation (1). From Section 2, if the prior for η\eta is Logistic(0,1)(0,1), then the induced prior for θ\theta is Uniform(0,1)(0,1). Therefore we need

η=β0+β1x1++βpxp\displaystyle\eta=\beta_{0}+\beta_{1}x_{1}+\ldots+\beta_{p}x_{p} Logistic(0,1),\displaystyle\sim\mbox{Logistic}(0,1),

where we have assumed x0=1x_{0}=1. There are two complications here. Firstly, given (p+1)(p+1) β\beta parameters, there are p+1p+1 prior distributions. One is that there is a many-to-one mapping from the β\betas to the single θ\theta. In theory there are an infinite number of combinations of prior distributions for the β\beta’s that induce a Logistic(0,1)(0,1) distribution. Secondly, the values of x0,x1,,xpx_{0},x_{1},\ldots,x_{p} will affect the distribution of η\eta. Here we will examine approximate solutions to these two problems where we aim to match the expected value and variance for η\eta.

3.1 Linear model with all xix_{i}=1

We begin with an unrealistic setting to make clearer the approach taken by assuming that the linear model for η\eta is simply a sum of p+1p+1 coefficients, equivalently assume that all xix_{i}=1:

η=g(θ)\displaystyle\eta=g(\theta) =β0+β1×1++βp×1=i=0pβi.\displaystyle=\beta_{0}+\beta_{1}\times 1+\ldots+\beta_{p}\times 1=\sum_{i=0}^{p}\beta_{i}.

Hence there is a p+1p+1 dimensional prior, π(β0,,βp)\pi(\beta_{0},\ldots,\beta_{p}). The many-to-one problem can be bypassed by assuming that the prior distributions are all the same and independent. By moment matching we can determine what the mean, μβ\mu_{\beta}, and variance, σβ2\sigma_{\beta}^{2}, of the in-common prior, denoted π(β)\pi(\beta) distribution, should be.

E[η]=0\displaystyle E[\eta]=0 =E[β0]+E[β1]+E[βp]=(p+1)μβ\displaystyle=E[\beta_{0}]+E[\beta_{1}]+\ldots E[\beta_{p}]=(p+1)\mu_{\beta}
V[η]=π23\displaystyle V[\eta]=\frac{\pi^{2}}{3} =V[β0]+V[β1]+V[βp]=(p+1)σβ2\displaystyle=V[\beta_{0}]+V[\beta_{1}]\ldots+V[\beta_{p}]=(p+1)\sigma_{\beta}^{2}

So that μβ=0\mu_{\beta}=0 and σβ2\sigma^{2}_{\beta} = π23(p+1)\frac{\pi^{2}}{3(p+1)}. When p=0p=0, we obtain the variance of π23\frac{\pi^{2}}{3} considered in Section 2.4.

3.2 Moment matching with standardised covariates

We now examine the more realistic case where there are covariates in the linear model. We will, however, assume that the covariates, x1x_{1}, \ldots, xpx_{p}, are standardised, xix_{i}= (xixi¯)/sxi(x_{i}^{*}-\bar{x_{i}^{*}})/s_{x_{i}^{*}}, where xix_{i}^{*} are the original values, thus xi¯\bar{x_{i}}=0 and sxis_{x_{i}}= 1. To bypass the many-to-one problem, we assume “some” in-common, and independent prior distributions. To begin we will assume that all the coefficients β0\beta_{0}, \ldots, βp\beta_{p} are iid with prior πβ(β)\pi_{\beta}(\beta).

We again use moment matching to determine what conditions are necessary for E[η]E[\eta]=0 and V[η]V[\eta]=π2/3\pi^{2}/3 to be satisfied.

E[η]=0\displaystyle E[\eta]=0 =E[β0+β1x1++βpxp]\displaystyle=E[\beta_{0}+\beta_{1}x_{1}+\ldots+\beta_{p}x_{p}]
=E[β0]+i=1pE[βi]E[Xi].\displaystyle=E[\beta_{0}]+\sum_{i=1}^{p}E[\beta_{i}]E[X_{i}].

Given that the covariate values have been standardised, their expected values, E[Xi]E[X_{i}], equal 0, so that the second term above equals 0. This result arises from the expectation being calculated over a finite space with at most nn values for each xix_{i} and assuming implicitly random draws of each xix_{i} independent of xjx_{j}, iji\neq j, which does not recognize any within sample dependencies. However, to ensure E[η]E[\eta]=0, E[β0]E[\beta_{0}] must also equal 0. If the β\beta’s are iid, their expected values would also equal 0.

We next examine V[η]V[\eta]. In the following we assume that V[xi]V[x_{i}]=1.

V[η]=π23\displaystyle V[\eta]=\frac{\pi^{2}}{3} =V[β0+i=1pβixi]\displaystyle=V\left[\beta_{0}+\sum_{i=1}^{p}\beta_{i}x_{i}\right]
=VβE𝒙|β[β0+i=1pβixi|β]+EβV𝒙|β[β0+i=1pβixi|β]\displaystyle=V_{\beta}E_{\hbox{\boldmath$x$}|\beta}\left[\beta_{0}+\sum_{i=1}^{p}\beta_{i}x_{i}\left|\beta\right.\right]+E_{\beta}V_{\hbox{\boldmath$x$}|\beta}\left[\beta_{0}+\sum_{i=1}^{p}\beta_{i}x_{i}\left|\beta\right.\right]
=Vβ[β0+i=1pβiE[xi]]+Eβ[i=1pβi2V[xi]]\displaystyle=V_{\beta}\left[\beta_{0}+\sum_{i=1}^{p}\beta_{i}E[x_{i}]\right]+E_{\beta}\left[\sum_{i=1}^{p}\beta_{i}^{2}V[x_{i}]\right]
=Vβ[β0]+0+[i=1pEβ[βi2]×1]\displaystyle=V_{\beta}[\beta_{0}]+0+\left[\sum_{i=1}^{p}E_{\beta}[\beta_{i}^{2}]\times 1\right]
=σβ2(1+p).\displaystyle=\sigma^{2}_{\beta}(1+p).

To match the Logistic(0,1)(0,1) variance of π23\frac{\pi^{2}}{3}, we require

σβ2\displaystyle\sigma^{2}_{\beta} =π23(p+1).\displaystyle=\frac{\pi^{2}}{3(p+1)}.

Therefore, as for the case with xix_{i}=1, μβ\mu_{\beta}=0 and σβ2\sigma^{2}_{\beta} = π23(p+1)\frac{\pi^{2}}{3(p+1)}. The reason for the identical results is that the standardized variance of the xix_{i}’s, (namely, 1), is the same as the assigned constant of 1 in the previous case. Newman, (2003) proposed the identical normal prior distributions for an analysis of release-recovery data but the derivation details were not shown.

4 Arbitrary Beta distribution for θ\theta

Now we consider the case where an arbitrary Beta distribution has been chosen for the Bernoulli parameter θ\theta and to determine an induced distribution for the logit transform. If the prior for θ\theta is Beta(α,β\alpha,\beta), and η\eta = logit(θ\theta), the induced pdf for η\eta is the following.

p(η)\displaystyle p(\eta) =1B(α,β)exp(η×α)(1+exp(η))α+β,\displaystyle=\frac{1}{B(\alpha,\beta)}\frac{\exp(\eta\times\alpha)}{(1+\exp(\eta))^{\alpha+\beta}}, (3)

where B(α,β)B(\alpha,\beta) is the beta function: Γ(α)Γ(β)Γ(α+β)\frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}, which is the pdf of the Type IV generalized logistic regression (Johnson et al., , 1995). Examples of the pdf for η\eta, eq’n 3, given different Beta(α,β)\alpha,\beta) distributions for θ\theta, are shown in Figure 4.

Refer to caption
Figure 4: Density plots for η=\eta= logit(θ)(\theta) where θBeta(α,β)\theta\sim Beta(\alpha,\beta) for a variety of Beta distribution shape parameters. Numerically calculated averages and variances are shown below each plot and a normal approximation is shown with a dashed blue line.

The values of E[η]E[\eta] and Var[η]Var[\eta] shown in Figure 4 were found by numerically integrating the following expected values:

E[ηk]\displaystyle E[\eta^{k}] =1B(α,β)ηkexp(η×α)(1+exp(η))α+β𝑑η,k=1,2.\displaystyle=\frac{1}{B(\alpha,\beta)}\int_{-\infty}^{\infty}\eta^{k}\frac{\exp(\eta\times\alpha)}{(1+\exp(\eta))^{\alpha+\beta}}d\eta,~{}~{}k=1,2.

Hereafter we denote E[η]E[\eta] by μη\mu_{\eta} and Var[η]Var[\eta] by ση2\sigma^{2}_{\eta}.

4.1 Normal approximation intercept only model

As for the special case of Uniform(0,1)(0,1), i.e. Beta(1,1)(1,1), a normal approximation to the distribution for η\eta is to moment match μη\mu_{\eta} and ση2\sigma^{2}_{\eta}.

For an intercept only model, η=β0\eta=\beta_{0}, and the prior for β0\beta_{0} is

β0\displaystyle\beta_{0} Normal(μη,ση).\displaystyle\sim\mbox{Normal}\left(\mu_{\eta},\sigma_{\eta}\right).

Examples of such normal approximations to the pdf for η\eta are shown by dashed blue lines in Figure 4, which show the adequacy of the approximation is a function of α\alpha and β\beta.

4.2 Normal approximation: Multiple covariates

In the case of one or more covariates, as for the Uniform(0,1)(0,1) case, there is again the many-to-one problem where there are an infinite number of combinations of normal priors where the expected value and variance for the sum of the priors would equal μη\mu_{\eta} and ση2\sigma^{2}_{\eta}. Furthermore, given that standardizing the covariates to have mean zero implies E[βiXi]=0E[\beta_{i}X_{i}]=0 the mean μη\mu_{\eta} is solely found in the prior mean for the intercept β0\beta_{0}. Again, we can avoid the many-to-one problem by using the pp independent and identically distributed normal distributions for the slope coefficients. Namely,

β0\displaystyle\beta_{0} Normal(μη,ση2p+1);\displaystyle\sim\mbox{Normal}\left(\mu_{\eta},\frac{\sigma^{2}_{\eta}}{p+1}\right); (4)
βi\displaystyle\beta_{i} Normal(0,ση2p+1),i=1,,p.\displaystyle\sim\mbox{Normal}\left(0,\frac{\sigma^{2}_{\eta}}{p+1}\right),\qquad i=1,\ldots,p. (5)

Setting the mean of the priors for the βi\beta_{i} terms equal to 0 is arbitrary and can be any value since E[βiXi]=0E[\beta_{i}X_{i}]=0. Then

E[η]\displaystyle E[\eta] =E[β0]+i=1pE[βixi]=μη+i=1p0=μη;\displaystyle=E[\beta_{0}]+\sum_{i=1}^{p}E[\beta_{i}x_{i}]=\mu_{\eta}+\sum_{i=1}^{p}0=\mu_{\eta};
Var[η]\displaystyle Var[\eta] =Var[β0]+i=1pVar[βixi]=ση2p+1+i=1pση2p+1=ση2.\displaystyle=Var[\beta_{0}]+\sum_{i=1}^{p}Var[\beta_{i}x_{i}]=\frac{\sigma^{2}_{\eta}}{p+1}+\sum_{i=1}^{p}\frac{\sigma^{2}_{\eta}}{p+1}=\sigma^{2}_{\eta}.

As pp increases, the variances for both the intercept and the slope coefficients are equally increasingly concentrated around, or shrinking toward, μη\mu_{\eta} or 0. We next present an alternative which does not cause such shrinkage in the prior for the intercept.

We note that while there are different classes of so-called shrinkage priors , e.g., spike and slab priors, the intent differs somewhat, to shrink small effects to zero while maintaining true large effects (Van Erp et al., , 2019), while the intent is here is to yield a desired induced distribution on a particular parameter.

4.3 Distinct handling of the intercept

In some cases there may be a prior opinion regarding the value of θ\theta even if the covariates have no effect, or for an “average” covariate setting, and a distinct handling of the intercept can be considered. In particular one may not want to have the number of covariates affecting the prior for the intercept, i.e., having the prior for β0\beta_{0} become increasingly concentrated around zero.

We begin by assuming a more general prior for θ\theta, Beta(α,β\alpha,\beta), rather than the special case dealt with so far, namely α=β=1\alpha=\beta=1 or Uniform(0,1)(0,1). Direct specification of the α\alpha and β\beta parameters could be done as well but thinking in terms of the average value for θ\theta and its variation may be easier. Specifically, we first specify the expected value of θ\theta and a measure of its variation, the coefficient of variation (CV), and then determine what values of α\alpha and β\beta match these specified values. For example, if we specify E[θ]=0.7E[\theta]=0.7 and CV[θ]=0.30CV[\theta]=0.30, then the prior for θ\theta is Beta(2.633,1.1292.633,1.129). Given the Beta distribution parameters, and referring to Equation (3), the expected value and variance of η\eta can be numerically calculated: in this example μη\mu_{\eta} \approx 1.146 and ση2\sigma^{2}_{\eta} \approx 1.785. Again assuming that the priors for β0\beta_{0} and βi\beta_{i}, i=1,,pi=1,\ldots,p are Normal distributions, the resulting E[η]E[\eta] and Var[η]Var[\eta] guide the choice of the Normal distribution parameters.

We note that with the exception of the case where α\alpha=β\beta, thus E[θ]E[\theta]=0.5, and then E[η]E[\eta] = 0, the contribution of μη\mu_{\eta} to the linear combination (Equation 1) must all come from the mean for the prior for the intercept, β0\beta_{0}. This is because the covariates have been standardized to have mean 0, then E[βixi]E[\beta_{i}x_{i}] = 0 no matter what the value of the expected value for βi\beta_{i} in the normal distribution.

To bypass the many-to-one issue we again assume that the priors for the slope coefficients are identical. Let μ0\mu_{0} and σ02\sigma^{2}_{0} denote the mean and variance for the normal prior for β0\beta_{0}. For the normal prior for βi\beta_{i}, μ1\mu_{1} denotes the normal mean for the βi\beta_{i}’s, which can be anything given it does not affect the expected value of the linear combination, of η\eta. Here for simplicity we will set μ1\mu_{1}=0, and σ12\sigma^{2}_{1} denotes the variance for prior for β1\beta_{1}. While there might be some argument for μ10\mu_{1}\neq 0, we cannot think of one. Subsequently,

β0\displaystyle\beta_{0} Normal(μη,σ02),\displaystyle\sim\mbox{Normal}\left(\mu_{\eta},\sigma^{2}_{0}\right), (6)
β1\displaystyle\beta_{1} Normal(0,σ12).\displaystyle\sim\mbox{Normal}\left(0,\sigma^{2}_{1}\right). (7)

The constraint for the linear combination is that its expected value equals μη\mu_{\eta} and its variance equals ση2\sigma^{2}_{\eta}. Redisplaying the linear combination for convenience, with a randomly drawn sequence of values for xi,jx_{i,j}

ηj\displaystyle\eta_{j} =β0+i=1pβixi,j,\displaystyle=\beta_{0}+\sum_{i=1}^{p}\beta_{i}x_{i,j},

with E[ηj]E[\eta_{j}] = μη\mu_{\eta} and

Var[ηj]\displaystyle Var[\eta_{j}] =σ02+pσ12,\displaystyle=\sigma^{2}_{0}+p\sigma^{2}_{1},

where the latter sum must equal ση2\sigma^{2}_{\eta}. Given values for μη\mu_{\eta} and ση2\sigma^{2}_{\eta}, once σ02\sigma^{2}_{0} are specified then the variance for the slope coefficient priors is

σ12\displaystyle\sigma^{2}_{1} =ση2σ02p.\displaystyle=\frac{\sigma^{2}_{\eta}-\sigma^{2}_{0}}{p}.

Given the assumption of identical priors for all slope coefficients the only decision is to specify σ02\sigma^{2}_{0}. There are numerous ways this can be done and that shown below is just one possibility.

If the aim is that the number of covariates does not affect the prior for β0\beta_{0}, in particular the prior variance for β0\beta_{0}, then σ02\sigma^{2}_{0} should not be a function of pp. One approach is to define σ02\sigma^{2}_{0} relative to ση2\sigma^{2}_{\eta}. For example specify a constant, kk, 0<k<10<k<1, and set σ02\sigma^{2}_{0} = kση2k\sigma^{2}_{\eta}. Then

σ12\displaystyle\sigma^{2}_{1} =(1k)ση2p.\displaystyle=\frac{(1-k)\sigma^{2}_{\eta}}{p}.

The choice of kk is discussed in Section 7.

5 Simulations

In this section, we present the results of three simulation experiments in as many scenarios (conducted using R (R Core Team, , 2024)). We consider:

  1. 1.

    Scenario 1: one covariate and a Uniform(0,1) prior for θ\theta.

  2. 2.

    Scenario 2: three covariates and a Uniform(0,1) prior for θ\theta

  3. 3.

    Scenario 3: three covariates and a Beta(α,β\alpha,\beta) prior for θ\theta, where α\alpha, β\beta \neq 1.

In all three scenarios different types of Normal priors were used for the intercept β0\beta_{0} and the slope coefficient(s). We refer to the “Vague prior” such that there are identical diffuse Normal(0, σ2\sigma^{2}) priors used for the intercept and slope coefficients where σ2=10002\sigma^{2}=1000^{2}; and the “Logistic prior”, where σ\sigma is based on the induced logistic distribution, the details of which vary depending on the number of covariates, and the Beta(α,β)\alpha,\beta) prior for θ\theta. For example, for the Logistic prior, when the induced prior for θ\theta is Uniform(0,1), and there are pp covariates, then σ\sigma = π23(1+p)\sqrt{\frac{\pi^{2}}{3(1+p)}}. Finally, when the prior for the intercept is handled differently (Section 4.3), the prior is called the “Weighted prior”.

For each scenario, we performed experiments for single samples (detailed in the Appendix, as well as on multiple datasets to allow frequentist analyses of the performance.

5.1 Scenario 1: One covariate and θU(0,1)\theta\sim U(0,1)

Multiple data sets of Bernoulli responses were generated from the following model:

yi\displaystyle y_{i} Bernoulli(θi),i=1,,15\displaystyle\sim\mbox{Bernoulli}(\theta_{i}),~{}~{}i=1,\ldots,15 (8)
logit(θi)\displaystyle logit(\theta_{i}) =0.5+0.3xi,i=1,,15,\displaystyle=-0.5+0.3x_{i},~{}~{}i=1,\ldots,15,

where the values of xx were simulated from a Gamma(3,0.2), and then standardised. In the Appendix, the resulting response and covariate values are shown in Table 10 and the functional relationship between θ\theta and xx is shown in Figure 10.

We compared the following two priors for β0\beta_{0} and β1\beta_{1}.

Vague: β0,β1\displaystyle\mbox{Vague: }\beta_{0},\beta_{1} Normal(0,10002);\displaystyle\sim\mbox{Normal}(0,1000^{2}); (9)
Logistic: β0,β1\displaystyle\mbox{Logistic: }\beta_{0},\beta_{1} Normal(0,π23×2=1.2832).\displaystyle\sim\mbox{Normal}\left(0,\frac{\pi^{2}}{3\times 2}=1.283^{2}\right).

The induced priors for θ\theta are shown in Figure 11 in the Appendix.

We simulated 100 datasets from the model with n=15n=15, n=50n=50 and n=100n=100, and computed the following indexes: the mean squared error from the maximum likelihood estimate (MSE\text{MSE}^{*}), the mean squared error from the posterior mean (MSE), and the coverage of the 95% credible interval. The above mean square errors are based on the following formula:

MSE(θ^)=1ni=1n(θ^iθ)2,\text{MSE}(\hat{\theta})=\frac{1}{n}\sum_{i=1}^{n}(\hat{\theta}_{i}-\theta)^{2},

where θ^\hat{\theta} is the point estimate (i.e. the posterior mean or MAP, for example) and θ\theta is either the MLE (for the MSE\text{MSE}^{*}) or the true value of θ\theta (for the MSE).

For the case nn=15, from Table 1 we note that the MSE for the Logistic prior is smaller (about one order of magnitude) than the one computed for the Vague normal prior. In addition, while the posterior coverage for the proposed prior is within the expected limits, under the Vague prior we note a slightly lower than expected coverage for the intercept (0.92) and considerably lower coverage for the slope (0.78).

Vague Logistic
Parameter Truth MSE\text{MSE}^{*} MSE Cov MSE\text{MSE}^{*} MSE Cov
β0\beta_{0} -0.50 0.0253 1.240 0.92 0.0123 0.2371 0.97
β1\beta_{1} 0.30 0.0347 1.094 0.78 0.0203 0.2005 0.93
Table 1: MSE\text{MSE}^{*}, MSE and coverage of the 95% posterior credible interval (Cov), for n=15n=15, for the case with one covariate.

An interesting comparison is between the average MLEs and average posterior means for both parameters. The MLEs are -0.6184 and 0.4203, for β0\beta_{0} and β1\beta_{1}, respectively. The average posterior mean for the intercept is -0.7775 and -0.5076, for the Vague prior and the Logistic prior, respectively, and for the slope coefficient, we have 0.6065 and 0.2780 for the Vague and the Logistic, respectively. We note that the distance between the MLEs and the posterior means are smaller for the Logistic prior than for the Vague prior.

For the larger sample size n=50n=50, the corresponding results are presented in Table 2 where, as expected due to increasing data having more influence on the posterior distributions, the MSE from the MLE and from the posterior mean are lower. Note that, unlike in the case for n=15n=15, the coverage of the coefficients under the Vague prior are close to the nominal value of 95%.

Vague Logistic
Parameter Truth MSE\text{MSE}^{*} MSE Cov MSE\text{MSE}^{*} MSE Cov
β0\beta_{0} -0.50 0.0007 0.1165 0.93 5.00E-04 0.0989 0.94
β1\beta_{1} 0.30 0.0017 0.1189 0.94 2.00E-04 0.0972 0.96
Table 2: MSE\text{MSE}^{*}, MSE and coverage of the 95% posterior credible interval (Cov), for n=50n=50, for the case with one covariate.

In this case, the average MLE for the intercept is -0.5104 and for the slope coefficient is 0.3184. The average posterior means are, for the intercept, -0.5290 (Vague prior) and -0.4950 (Logistic prior) and, for the slope coefficient, 0.3425 (Vague prior) 0.3165 (Logistic prior). Here, while the distances of the MLE and the posterior means for the intercept are virtually the same, the ones for the slope coefficient are smaller for the Logistic prior.

To conclude this experiment, we have run the above simulation with a larger sample size (n=100n=100). The results are presented in Table 3, where we note that the priors’ behaviours are virtually the same as the information in the data dominates. Again, the overall performance of the Logistic prior is better than the Vague prior, in terms of distance between the MLE and the posterior mean, the difference diminishes, as one would expect as the sample size increases.

Vague Logistic
Parameter Truth MSE\text{MSE}^{*} MSE Cov MSE\text{MSE}^{*} MSE Cov
β0\beta_{0} -0.50 1.00E-04 0.0496 0.94 1.00E-04 0.0458 0.95
β1\beta_{1} 0.30 3.00E-04 0.0497 0.97 0.00E+00 0.0457 0.96
Table 3: MSE\text{MSE}^{*}, MSE and coverage of the 95% posterior credible interval (Cov), for n=100n=100, for the case with one covariate.

For this final simulation study, the average MLE for the intercept is -0.5265 and for the slope coefficient is 0.2914. The average posterior means are, for the intercept, -0.5355 (Vague prior) and -0.5206 (Logistic prior) and, for the slope coefficient, 0.3013 (Vague prior) and 0.2914 (Logistic prior).

5.2 Scenarios 2 and 3: Three covariates

In the second and third scenarios, we drew samples from the below model given in Equation (10), considering θU(0,1)\theta\sim U(0,1) (Scenario 2), and its generalisation θBeta(α,β)\theta\sim\text{Beta}(\alpha,\beta) with distinct handling of the intercept (Scenario 3):

logit(θ)\displaystyle logit(\theta) =1.1+0.3x10.6x2+0.02x3,\displaystyle=1.1+0.3x_{1}-0.6x_{2}+0.02x_{3}, (10)

where the covariates were simulated independently from three Gamma distributions, Gamma(10,2), Gamma(12,6), and Gamma(3,3), with the resulting θ\theta values ranged from 0.39 to 0.93.

The prior distributions we compared are the Vague prior and the Logistic prior (see equations (4) and (5)) for the three covariates setting in Scenario 2.

Vague: β0,β1,β2,β3\displaystyle\mbox{Vague: }\beta_{0},\beta_{1},\beta_{2},\beta_{3} Normal(0,10002)\displaystyle\sim\mbox{Normal}(0,1000^{2}) (11)
Logistic: β0,β1,β2,β3\displaystyle\mbox{Logistic: }\beta_{0},\beta_{1},\beta_{2},\beta_{3} Normal(0,π23×(1+3)=0.9072),\displaystyle\sim\mbox{Normal}\left(0,\frac{\pi^{2}}{3\times(1+3)}=0.907^{2}\right),

and, for Scenario 3, the additional prior

β0\displaystyle\beta_{0} Normal(1.150,0.4×1.843=0.7372)\displaystyle\sim\text{Normal}(1.150,0.4\times 1.843=0.7372) (12)
βi\displaystyle\beta_{i} Normal(0,0.6×1.8433=0.3686),i=1,2,3.\displaystyle\sim\text{Normal}\left(0,\frac{0.6\times 1.843}{3}=0.3686\right),\qquad i=1,2,3.

In the Appendix, it is possible to find the induced θ\theta priors, as well as the illustration of the model applied to a single sample.

The frequentist analysis for Scenario 2 was performed on 100 samples of sizes n=30n=30, n=50n=50 and n=100n=100, respectively. We defer the discussion of the results below where the Vague, Logistic, and Weighted priors are shown together.

Similrly to Scenario 2, the frequentist analysis of Scenario 3 was performed on 100 samples of sizes n=30n=30, n=50n=50 and n=100n=100, respectively.

Table 4, for nn=50, shows the MSE\text{MSE}^{*}, the MSE and interval coverage for the three priors. It appears that both the Logistic prior and Weighted prior perform better than the Vague prior; this is obvious by noticing that the MSE is systematically smaller for the two proposed priors for all the parameters. If we compare the coverage, we notice that the Weighted prior tends to have more extreme results, as it is either close to 100% or 90%, suggesting a less precise posterior. Given that the Weighted prior assumes different prior for the intercept and the slope coefficients, an increase in uncertainty is expected.

Vague Logistic Weighted
Parameter MSE\text{MSE}^{*} MSE Cov MSE\text{MSE}^{*} MSE Cov MSE\text{MSE}^{*} MSE Cov
β0\beta_{0} 0.18 1.30 0.89 0.34 0.11 0.96 0.36 0.09 0.97
β1\beta_{1} 0.03 0.45 0.89 0.07 0.10 0.97 0.20 0.05 0.97
β2\beta_{2} 0.04 0.45 0.90 0.10 0.10 0.96 0.28 0.12 0.89
β3\beta_{3} 0.10 1.05 0.91 0.28 0.09 0.99 0.38 0.03 1.00
Table 4: MSE\text{MSE}^{*}, MSE and coverage of the posterior 95% credible interval for the Vague prior, the Logistic prior and the Weighted prior. The summaries refer to 100 samples of size n=50n=50.

For the above simulation study, the average MLEs, and the average posterior means (under each prior) are reported in Table 5.

Posterior Means
Parameter TRUE MLE Vague Logistic Wted
β0\beta_{0} 1.5 1.6905 1.9587 1.3887 1.4235
β1\beta_{1} 0.3 0.4563 0.5580 0.3324 0.1793
β2\beta_{2} -0.6 -0.6801 0.7931 -0.5125 -0.2997
β3\beta_{3} 0.02 0.0441 0.1010 0.0115 0.0013
Table 5: Average maximum likelihood estimates, and average posterior means for the n=50n=50 simulation study.

Similarly to the case with one covariate, we examined (and compared) the prior performances for a relatively small sample size. Using the rule of thumb of having ten data points per covariate, we have drawn 100 samples of size n=30n=30. The results, in terms of MSE\text{MSE}^{*}, MSE and posterior coverage, are reported in Table 6.

Vague Logistic Weighted
Parameter MSE\text{MSE}^{*} MSE Cov MSE\text{MSE}^{*} MSE Cov MSE\text{MSE}^{*} MSE Cov
β0\beta_{0} 0.8154 3.0434 0.75 0.5970 0.1512 0.96 0.5543 0.1103 0.98
β1\beta_{1} 0.1865 1.4455 0.88 0.2760 0.1422 0.97 0.5686 0.0646 0.99
β2\beta_{2} 0.3677 1.8433 0.82 0.3053 0.1593 0.98 0.7505 0.1611 0.86
β3\beta_{3} 0.1253 1.1679 0.87 0.1421 0.1674 0.97 0.3440 0.0334 1.00
Table 6: MSE\text{MSE}^{*}, MSE and coverage of the posterior 95% credible interval for the Vague prior, the Logistic prior and the Weighted beta prior. The summaries refer to 100 samples of size n=30n=30.

The MLE averages and the posterior mean averages, are reported in Table 7.

Posterior Means
Parameter TRUE MLE Vague Logistic Weighted
β0\beta_{0} 1.5 1.6265 1.7309 1.5111 1.5077
β1\beta_{1} 0.3 0.3392 0.3748 0.2971 0.2023
β2\beta_{2} -0.6 -0.6352 -0.6782 -0.5634 -0.4065
β3\beta_{3} 0.02 0.0303 0.0533 0.0337 0.0191
Table 7: Average maximum likelihood estimates, and average posterior means for the n=30n=30 simulation study.

Finally, in Table 8 we have reported the results for the larger sample size (n=100n=100), where the differences between the priors are reduced, although the better performance of the Logistic and Weighted priors remains.

Vague Logistic Weighted
Parameter MSE\text{MSE}^{*} MSE Cov MSE\text{MSE}^{*} MSE Cov MSE\text{MSE}^{*} MSE Cov
β0\beta_{0} 0.0123 0.1647 0.92 0.0182 0.0562 0.97 0.0259 0.0448 0.97
β1\beta_{1} 0.0020 0.1160 0.94 0.0054 0.0658 0.96 0.0393 0.0396 0.98
β2\beta_{2} 0.0025 0.0934 0.95 0.0078 0.0547 0.99 0.0672 0.0628 0.91
β3\beta_{3} 0.0014 0.0952 0.97 0.0021 0.0586 0.99 0.0144 0.0281 1.00
Table 8: MSE\text{MSE}^{*}, MSE and coverage of the posterior 95% credible interval for the Vague prior, the Logistic prior and the Weighted prior. The summaries refer to 100 samples of size n=100n=100.

The average summaries are reported in Table 9.

Posterior Means
Parameter TRUE MLE Vague Logistic Wted
β0\beta_{0} 1.5 1.9354 2.5906 1.3435 1.4266
β1\beta_{1} 0.3 0.5153 0.7479 0.2586 0.1092
β2\beta_{2} -0.6 -0.8578 -1.1913 -0.5110 -0.2388
β3\beta_{3} 0.02 0.0833 0.1658 0.0438 0.0154
Table 9: Average maximum likelihood estimates, and average posterior means for the n=100n=100 simulation study.

6 Application to an occupancy model

We examined the influence of priors on models for presence/absence or occupancy data. In occupancy modelling Bernoulli random variables appear twice (MacKenzie et al., , 2017), once to indicate the true, but a priori unknown, presence (occupancy) of a species at a site or its absence, and again to indicate the detection of the species or failure to detect. Letting ZZ be an indicator for true presence and YY be an indicator for detection, the basic occupancy model structure is the following.

Z\displaystyle Z Bernoulli(ψ);\displaystyle\sim\mbox{Bernoulli}(\psi);
Y|Z=1\displaystyle Y|Z=1 Bernoulli(p),\displaystyle\sim\mbox{Bernoulli}(p),

where the probability of detection is here defined conditional on presence. Equivalently, the distribution for YY can be specified simply conditional on ZZ (rather than Z=1Z=1), using Y|ZY|Z \sim Bernoulli(pZpZ). Note also that the unconditional distribution for YY is Bernoulli(pψ)p\psi). Both occupancy probabilities, ψ\psi, and detection probabilities, pp, are commonly modelled as logistic functions of covariates.

Here we compare the effects of different priors for the coefficients of the logistic models on a real data set. The data are presence/absence counts of European goldfinch made at 266 locations in Switzerland at three different times during the breeding season (Jeffrey Doser, personal communication; see also Chapter 11 of Kéry and Royle, (2015) and the R package AHMbook and the dataset MHB2014 (Kéry et al., , 2024)). Three covariates were used for modelling occupancy, elevation, elevation2, and the percent forest cover, and three covariates for modelling detection, day-of-year, day-of-year2, and duration of the observation periods (at a given location and visit).

We use the the function PGOcc in the R package spOccupancy (Doser et al., , 2022) to fit the occupancy models. The default prior in the PGOcc function for the intercept and slope coefficients is Normal(0,1.652) (labeled spOcc prior), which is a relatively narrow Normal prior compared to vague priors, e.g., Normal(0,10002). The reason for this choice is partially based on results from Northrup and Gerber, (2018) (Jeffrey Doser, personal communication). We used the same logistic prior for the intercept and the slope coefficients as in Section 3. This was done largely for simplicity as PGOcc is set up for using the same priors for the intercepts and slopes (allowing for differences between occupancy probability and detection probability). Given pp=3 covariates, the logistic model priors were then Normal(0, 0.912). A third set of priors labelled wide priors were also used, Normal(0, 402), instead of the Vague priors used previously, Normal(0, 10002), which caused PGOcc to fail.

The usual bathtub shape of the induced priors for the occupancy and the detection probabilities was observed for both the wide and spOcc priors, while the induced priors from the logistic model priors were relatively flat (Figure 5). The resulting posterior distributions for the intercepts and slopes differed slightly between the three different priors, for example, the posterior distributions for the occupancy coefficients (Figure 6) with the logistic prior yielding posterior distributions for the slope coefficients shifted slightly more positive than the other two priors. The resulting posterior mean probabilities for occupancy as a function of elevation, shown in Figure 7, were relatively similar across the range of elevations. The posterior credible intervals (not shown) were relatively wide, particular at the extremes of elevation which had less data.

The relative similarity in the posterior distributions can be attributed to the influence of the data overshadowing the influence of the prior. This can be seen in Figure 8 which shows the resulting posterior mean occupancy probability as a function of elevation when the model was fit using a simple random sample of nn=50 drawn from the 266 survey locations. The discrepancy between the three different priors is more pronounced with nn=50 than when all nn=266 sites were used, with the spOcc and logistic priors yielding somewhat similar results but the wide prior less so.

Refer to caption
Figure 5: Induced prior distributions for occupancy and detection probabilities of the goldfinch data set for three different priors for the logistic regression intercept and partial slopes.
Refer to caption
Figure 6: Posterior distributions for Occupancy model intercept and slope coefficients for three different priors.
Refer to caption
Figure 7: Posterior means for the probability of occupancy as a function of elevation (with percent forest cover set at average value) for the three different priors.
Refer to caption
Figure 8: Resulting posterior means for the probability of occupancy as a function of elevation for a subset of the nn=50 locations in the goldfinch occupancy survey.

7 Discussion

We have proposed priors for the coefficients in a multiple covariate logistic regression for a Bernoulli probability θ\theta that induced a desired prior for θ\theta. To the best of our knowledge previous work has focused largely on single covariate models or has applied the same prior for each regression coefficients irregardless of the number of covariates. The proposed priors include ones such that the induced prior for θ\theta is an Beta(α,β\alpha,\beta) with arbitrary shape parameters as well as priors that allow for distinct handling of the intercept in the regression model. This work has raised additional questions that suggest other areas of research that are included below.

7.1 Weighting in the Weighted prior

For the Weighted prior, the proposed solution for determining the variance for the intercept was to set it equal to some fraction of ση2\sigma^{2}_{\eta}, σ02\sigma^{2}_{0} = kση2k\sigma^{2}_{\eta}, 0k10\leq k\leq 1. Can some guidance be provided regarding the choice of kk?

If kk is near 0, then the prior for the intercept will be concentrated around μη\mu_{\eta}, and the priors for the slope coefficients will be near their maximum dispersion, ση2/p\sigma^{2}_{\eta}/p. At the extreme of kk=0, the intercept β0\beta_{0} is a fixed value. On the other hand if kk is near 1, then the variance of the prior for the intercept is near its maximum ση2\sigma^{2}_{\eta} and the priors for the slope coefficients are more concentrated around zero. At the extreme of kk=1 all slope coefficients disappear.

Therefore kk is balancing the effect of the intercept and the covariates. This begs the question of how to choose kk. One thought is to treat kk as a parameter to estimate and then specify a hierarchical prior distribution on it. Another is to view kk as a tuning parameter in terms of a goodness-of-fit measure. Further investigation of this issue is a topic of further research.

7.2 Concentration of priors around 0

As the number of covariates increases, the priors for the slope coefficients are increasingly concentrated around 0, i.e., there is shrinkage toward 0. Two statistical procedures that have some similarity with this result is Lasso (Tibshirani, , 1996) and penalized complexity priors (Simpson et al., , 2017). Lasso can cause shrinkage of coefficients to the point that a covariate is completely removed. Penalized complexity priors aim, given a set of nested models, to give highest prior probability to a “base” model, e.g., an intercept only model. Whether or not this shrinkage is desireable and whether or not more direct connections can be made with Lasso or penalized complexity priors are additional areas for research.

7.3 Generating functions

The linear combination of random variables, namely the intercept and partial slope coefficients in the linear model, is a convolution. Dropping the intercept and ignoring the effects of the covariate values in the linear combination, assuming all covariates equal 1, if identical distributions are assumed for the slope coefficients, which is a way to address the many-to-one problem mentioned previously, the moment generating function (MGF) for the linear combination is the product of pp identical MGFs:

E[exp(η)]\displaystyle E[\exp(\eta)] =E[exp(t×[β1++βp])]=E[exp(tβ)]p,\displaystyle=E[\exp(t\times[\beta_{1}+\ldots+\beta_{p}])]=E[\exp(t\beta)]^{p},

where β\beta, β1\beta_{1}, \ldots, and βp\beta_{p} are independent and identically distributed. This results in the MGF for β\beta,

E[exp(tβ)]\displaystyle E[\exp(t\beta)] =(E[exp(η)])1p.\displaystyle=\left(E[\exp(\eta)]\right)^{\frac{1}{p}}.

If the MGF for η\eta is known, the MGF for β\beta the ppth root of that MGF. Inversion of an arbitrary MGF to yield the underlying probability distribution is usually not a simple task let alone inversion of the ppth root of a MGF (Walker, , 2017). While interesting, this approach seems impractical.

7.4 Multi-Bernoulli distributions

Another topic for research is to extend these results to multi-Bernoulli distributions. For example, an animal is at location AA at time tt and at time t+1t+1 it could remain at AA or move to BB or move to CC. Model location at next time step by a multi-Bernoulli:

yt+1|yt\displaystyle y_{t+1}|y_{t} Mult-Bernoulli(p1,p2,p3).\displaystyle\sim\mbox{Mult-Bernoulli}(p_{1},p_{2},p_{3}).

The probabilities are modelled by a multi-logistic transformation:

p1\displaystyle p_{1} exp(β1,0+β1,1x1)\displaystyle\propto\exp(\beta_{1,0}+\beta_{1,1}x_{1})
p2\displaystyle p_{2} exp(β2,0+β2,1x2)\displaystyle\propto\exp(\beta_{2,0}+\beta_{2,1}x_{2})
p3\displaystyle p_{3} =1p1p2.\displaystyle=1-p_{1}-p_{2}.

If discrete uniform priors for p1p_{1}, p2p_{2}, and p3p_{3} are desired, what do the priors for the above β\beta’s need to be?

7.5 General problem of lack of parameterization invariance

Here we have only addressed potentially problematic induced priors for Bernoulli parameters. Both Seaman III et al., (2012) and Lele, (2020) discussed this problem for non-Bernoulli settings. In particular Lele, (2020) compared two mathematically equivalent parameterizations for a population dynamics model in a state-space model framework, namely the Ricker model. The two formulations are

Model A :Nt+1\displaystyle\mbox{Model A :}N_{t+1} =Nt×exp(ab×Nt)\displaystyle=N_{t}\times\exp(a-b\times N_{t}) (13)
Model B :Nt+1\displaystyle\mbox{Model B :}N_{t+1} =Nt×exp(a×(1Nt/K)),\displaystyle=N_{t}\times\exp(a\times(1-N_{t}/K)), (14)

where aa is a measure of intrinsic population growth rate, bb is a measure of density dependence, and KK is carrying capacity. KK can be calculated for the first model as a/ba/b. He showed how apparently non-informative priors for the parameters yielded quite different posterior means for the population growth parameter and population carrying capacity.

One likely explanation for the discrepancy in the posterior means is that the induced priors for the functions of parameters with specified priors are quite different than the explicitly specified priors for those parameters in the other parameterization. Figure 9 shows how the induced distribution for carrying capacity KK in Model A is highly skewed compared to specified distribution for KK in Model B, with a similar result for the induced prior for the growth parameter aa in Model B compare to the specified prior in Model A.

Refer to caption
Figure 9: Priors, explicit and induced, for two parameterizations of the Ricker model used by Lele, (2020). Model A formulation is Equation (13) and Model B formulation is Equation (14).

In principle the same concepts used to yield a desired induced prior distribution for the Bernoulli parameter could be applied by specifying desired induced priors for the growth rate and the carrying capacity parameters—using the change-of-variables approach to work backwards to derive the prior probabilities on the lower level parameters that would induce the desired priors. However, the resulting derived prior distributions will typically be quite unique and Monte Carlo procedures will likely be required to generate samples from the probability function.

Acknowledgements.

For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising from this submission.

References

  • Antoniano-Villalobos et al., (2024) Antoniano-Villalobos, I., Villa, C., and Walker, S. G. (2024). A multidimensional objective prior distribution from a scoring rule. Journal of Statistical Planning and Inference, 231:106122.
  • Banner et al., (2020) Banner, K. M., Irvine, K. M., and Rodhouse, T. J. (2020). The use of Bayesian priors in ecology: The good, the bad and the not great. Methods in Ecology and Evolution, 11(8):882–889.
  • Bernardo, (1979) Bernardo, J. (1979). Reference posterior distributions for bayesian inference. Journal of the Royal Statistical Society: Series B (Methodological), 41:113–128.
  • Datta and Sweeting, (2005) Datta, G. S. and Sweeting, T. J. (2005). Probability matching priors. In Dey, D. and Rao, C., editors, Bayesian Thinking, Handbook of Statistics, pages 91–114. Elsevier.
  • Doser et al., (2022) Doser, J. W., Finley, A. O., Kéry, M., and Zipkin, E. F. (2022). spoccupancy: An r package for single-species, multi-species, and integrated spatial occupancy models. Methods in Ecology and Evolution, 13(8):1670–1678.
  • Gelman et al., (2014) Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D., Veharti, A., and Rubin, D. B. (2014). Bayesian data analysis, 3rd Edition. Taylor & Francis Group.
  • Hobbs and Hooten, (2015) Hobbs, N. T. and Hooten, M. B. (2015). Bayesian models: a statistical primer for ecologists. Princeton University Press.
  • Jeffreys, (1946) Jeffreys, H. (1946). An Invariant Form for the Prior Probability in Estimation Problems. Proceedings of the Royal Society of London Series A, 186:453–461.
  • Johnson et al., (1995) Johnson, N. L., Kotz, S., and Balakrishnan, N. (1995). Continuous univariate distributions, volume 2, volume 289. John wiley & sons.
  • Kass and Wasserman, (1996) Kass, R. E. and Wasserman, L. (1996). The selection of prior distributions by formal rules. Journal of the American Statistical Association, 91(435):1343–1370.
  • Kéry and Royle, (2015) Kéry, M. and Royle, J. A. (2015). Applied hierarchical modeling in ecology: Volume 1: Prelude and static models. Elsevier Science.
  • Kéry et al., (2024) Kéry, M., Royle, A., and Meredith, M. (2024). AHMbook: Functions and Data for the Book ’Applied Hierarchical Modeling in Ecology’ Vols 1 and 2. R package version 0.2.10.
  • Leisen et al., (2020) Leisen, F., Villa, C., and Walker, S. G. (2020). On a Class of Objective Priors from Scoring Rules (with Discussion). Bayesian Analysis, 15(4):1345 – 2767.
  • Lele, (2020) Lele, S. R. (2020). Consequences of lack of parameterization invariance of non-informative Bayesian analysis for wildlife management: Survival of San Joaquin kit fox and declines in amphibian populations. Frontiers in Ecology and Evolution, 7:501.
  • Lele and Dennis, (2009) Lele, S. R. and Dennis, B. (2009). Bayesian methods for hierarchical models: are ecologists making a Faustian bargain? Ecological applications, 19(3):581–584.
  • MacKenzie et al., (2017) MacKenzie, D. I., Nichols, J. D., Royle, J. A., Pollock, K. H., Bailey, L., and Hines, J. E. (2017). Occupancy estimation and modeling: inferring patterns and dynamics of species occurrence. Elsevier.
  • Newman, (2003) Newman, K. B. (2003). Modelling paired release-recovery data in the presence of survival and capture heterogeneity with application to marked juvenile salmon. Statistical modelling, 3(3):157–177.
  • Northrup and Gerber, (2018) Northrup, J. M. and Gerber, B. D. (2018). A comment on priors for Bayesian occupancy models. PloS One, 13(2):e0192819.
  • O’Hagan, (2008) O’Hagan, A. (2008). The Bayesian approach to statistics. In Handbook of probability: Theory and applications, volume 6.
  • O’Hagan, (2019) O’Hagan, A. (2019). Expert knowledge elicitation: subjective but scientific. The American Statistician, 73(sup1):69–81.
  • R Core Team, (2024) R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Seaman III et al., (2012) Seaman III, J. W., Seaman Jr, J. W., and Stamey, J. D. (2012). Hidden dangers of specifying noninformative priors. The American Statistician, 66(2):77–84.
  • Siegrist, (2024) Siegrist, K. (2024). Transformations of random variables. https://stats.libretexts.org/Bookshelves/Probability\Theory/ProbabilityMathematicalStatisticsandStochasticProcesses [Accessed: 2024-05-20].
  • Simpson et al., (2017) Simpson, D., Rue, H., Riebler, A., Martins, T. G., and Sørbye, S. H. (2017). Penalising model component complexity: A principled, practical approach to constructing priors. Statistical Science, 32(1):1–28.
  • Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288.
  • van de Schoot et al., (2021) van de Schoot, R., Depaoli, S., King, R., Kramer, B., Märtens, K., Tadesse, M. G., Vannucci, M., Gelman, A., Veen, D., Willemsen, J., et al. (2021). Bayesian statistics and modelling. Nature Reviews Methods Primers, 1(1):1.
  • Van Erp et al., (2019) Van Erp, S., Oberski, D. L., and Mulder, J. (2019). Shrinkage priors for Bayesian penalized regression. Journal of Mathematical Psychology, 89:31–50.
  • Walker, (2017) Walker, S. G. (2017). A Laplace transform inversion method for probability distribution functions. Statistics and Computing, 27(2):439–448.
  • Walker and Villa, (2021) Walker, S. G. and Villa, C. (2021). An objective prior from a scoring rule. Entropy, 23(7):833.

Appendix - Simulation studies details

Additional details and output from the simulations discussed in Section 5 are given below.

Scenario 1: One covariate and θU(0,1)\theta\sim U(0,1)

Single Sample Analysis

We have drawn a sample of size n=15n=15 from the model given in Equation (8) in Section 5.1. The simulated covariate values and related responses are shown in Table 10, while the relationship between θ\theta and xx is plotted in Figure 10.

y x1.std
1 1.00 -1.47
2 0.00 -1.13
3 1.00 -1.01
4 0.00 -1.01
5 1.00 -0.90
6 1.00 -0.31
7 1.00 -0.15
8 1.00 -0.03
9 0.00 0.26
10 1.00 0.28
11 1.00 0.61
12 1.00 0.73
13 1.00 0.84
14 1.00 1.19
15 1.00 2.10
Table 10: Simulated responses and covariates for single covariate model.
Refer to caption
Figure 10: Scenario 1 logit model: θ\theta vs xx.

The induced priors on θ\theta obtained by implementing the priors on the β\betas (see Equation (9)), can be found in Figure 11. We note that there is a difference in shape between the two priors: the one based on the Vague prior assigns almost all the mass towards 0 and 1, whilst the induced prior resulting from the method we proposed can be approximated by a uniform density (which was the aim).

Refer to caption
Figure 11: Induced priors for θ\theta based on a single covariate with Normal(0, 10002) and Normal(0, π23×2)\frac{\pi^{2}}{3\times 2}) priors based on nn=15 observations.

Side-by-side boxplots of the posterior distributions for β0\beta_{0} and β1\beta_{1} resulting from the Vague and Logistic priors are shown in Figure 12, while the histograms of the posterior distributions for the intercept and the slope coefficient, under both priors, along with the MCMC trace plots, are shown Figure 14. The MCMC was run for 5000 simulations with a burnin of 2000 iterations using 4 independent chains. No evidence of a lack of convergence was identified.

Refer to caption
Figure 12: Posterior distributions for β0\beta_{0} and β1\beta_{1} with Vague and Logistic priors.

In Table 11 we show the posterior summaries, that is the posterior mean and the posterior 95% credible interval, as well as the true value and the MLE. We see that the true parameters are contained in both posterior 95% credible intervals, and the intervals are larger under the Logistic prior, meaning that it carries less information than the Vague prior. The result, therefore, supports our statement of being a “more objective” prior. Although the posterior distributions are reasonably symmetric, as seen in Figure 14, the maximum a posteriori (MAP) are -0.13 and -0.15 for the intercept, under the Vague and Logistic prior, respectively, and 0.64 and 0.65 for the slope coefficient. The last results suggest negatively skewed posteriors.

Vague Logistic
Parameters Truth MLE Mean C.I. Mean C.I.
β0\beta_{0} -0.50 -0.14 -0.13 (-1.28, 0.94) -0.13 (-0.74, 1.67)
β1\beta_{1} 0.30 0.35 0.43 (-1.13, 0.86) 0.34 (-0.70, 1.43)
Table 11: Scenario 1: Posterior mean and posterior 95% credible intervals with n=15n=15. The table includes also the true values and the MLEs.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a)
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b)
Figure 14: Scenario 1: Posterior histograms and chains of the intercept and the slope coefficient for the (a) vague prior and the (b) proposed prior. Each histogram contains three vertical lines showing the true value (solid), the MLE (dotted) and the posterior mean (dashed). We have sampled n=15n=15 data points from a model with parameters β0=0.5\beta_{0}=-0.5 and β1=0.3\beta_{1}=0.3.

Scenarios 2 and 3

Single Sample Analysis

For this experiment, we simulated a sample of size n=50n=50 from the model given in Equation (10) in Section 5.2. The prior distributions on the β\betas were the Vague prior, the Logistic prior and the Weighted prior (see Equations (11) and (12)). The induced priors on θ\theta corresponding to the Vague and the Logistic prior are shown in Figure 15, while the one for the weighted prior is shown in Figure 16. We recall that the Weighted prior is designed so that the induce prior on θ\theta is an arbitrary Beta(α,β)\text{Beta}(\alpha,\beta) and, in addition, the prior for the intercept β0\beta_{0} is distinct from the one on the coefficients. In our case, the prior distribution for θ\theta was determined by specifying an expected value of 0.7 and a CV=0.3, which leads to Beta(2.633, 1.129). The induced expectation for η\eta was μη\mu_{\eta}=1.150 and ση2\sigma^{2}_{\eta} = 1.843. A weight for the variance for the prior for β0\beta_{0} of kk=0.4 was selected. The resulting prior is Equation (12) in Section 5.2.

Refer to caption
Figure 15: Induced θ\theta priors with 3 covariates with coefficient priors of Normal(0,10002) and Normal (0,π23×(1+3)=0.9072)\left(0,\frac{\pi^{2}}{3\times(1+3)}=0.907^{2}\right) priors based on nn=50 observations.
Refer to caption
Figure 16: Histogram of induced prior for θ\theta given Weighted priors for intercept and slope coefficients along with Beta(2.633, 1.129) pdf (in blue dashed lines.)

The MCMC parameters are as in the previous example, that is we run 4 chains for 5000 simulations and a burnin of 2000 iterations, giving a usable posterior sample of 3000 iterations. The posterior summaries are reported in Table 12, and the histograms of the posterior distributions are in Figure 17 in Appendix Appendix - Simulation studies details. Given that some of the posterior distributions appear to be asymmetric, we have also reported the MAPs in Table 13, along with the corresponding true values and MLEs.

The prior performances vary across the parameters, although the true values are all within the 95% posterior credible intervals. There no clear dominance of a prior over the others in terms of “guessing” the true value nor in getting closer to the MLE considering as point estimate the posterior mean.

Vague Logistic Wted
Parameter Truth MLE Mean C.I. Mean C.I. Mean C.I.
β0\beta_{0} 1.50 1.80 2.05 (1.19, 3.07) 1.59 (0.91, 2.33) 1.65 (1.01, 2.35)
β1\beta_{1} 0.30 0.11 0.12 (-0.83, 1.09) 0.08 (-0.67, 0.84) 0.04 (-0.49, 0.57)
β2\beta_{2} -0.60 -0.62 -0.70 (-1.63, 0.15) -0.49 (-1.22, 0.21) -0.29 (-0.82, 0.24)
β3\beta_{3} 0.02 0.27 0.37 (-0.56, 1.43) 0.26 (-0.45, 1.04) 0.15 (-0.38, 0.70)
Table 12: Posterior means and 95% credible intervals for the Vague, Logistic, and Weighted priors (nn=50).
Parameters True MLE Vague Logistic Wted
β0\beta_{0} 1.50 1.80 2.24 1.49 1.75
β1\beta_{1} 0.30 0.11 0.20 -0.06 0.24
β2\beta_{2} -0.60 -0.70 -1.01 -0.60 -0.13
β3\beta_{3} 0.02 0.37 0.27 0.04 0.04
Table 13: MAP under the Vague, Logistic, and Weighted priors for the case n=50n=50.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 17: Scenarios 2 and 3: Posterior histograms intercept and partial slope coefficients for a single sample of size n=50n=50 of a model with β0=1.5\beta_{0}=1.5, β1=0.3\beta_{1}=0.3, β2=0.6\beta_{2}=-0.6 and β3=0.02\beta_{3}=0.02 from top row to bottom row, respectively. The first column is based on the Vague prior, the second column is based on the Logistic prior with identical priors for intercept and slope coefficients (equations 4 and 5), and the third column is based on the Weighted prior (Equations 6 and 7; see also Section sec:arbitrary.beta). Each histogram contains three vertical lines showing the true value (solid), the MLE (dotted) and the posterior mean (dashed).