Logistic regression models: practical induced prior specification

Ken B. Newman Biomathematics & Statistics Scotland, Edinburgh, UK. School of Mathematics and Maxwell Institute for Mathematical Sciences, University of Edinburgh, Edinburgh, UK. Cristiano Villa Division of Natural and Applied Sciences, Duke-Kunshan University, Kunshan, China School of Mathematics, Statistics and Physics, Newcastle University, Newcastle-upon-Tyne, UK Ruth King School of Mathematics and Maxwell Institute for Mathematical Sciences, University of Edinburgh, Edinburgh, UK.

Abstract

Bayesian inference for statistical models with a hierarchical structure is often characterized by specification of priors for parameters at different levels of the hierarchy. When higher level parameters are functions of the lower level parameters, specifying a prior on the lower level parameters leads to induced priors on the higher level parameters. However, what are deemed uninformative priors for lower level parameters can induce strikingly non-vague priors for higher level parameters. Depending on the sample size and specific model parameterization, these priors can then have unintended effects on the posterior distribution of the higher level parameters.

Here we focus on Bayesian inference for the Bernoulli distribution parameter $\theta$ which is modeled as a function of covariates via a logistic regression, where the coefficients are the lower level parameters for which priors are specified. A specific area of interest and application is the modeling of survival probabilities in capture-recapture data and occupancy and detection probabilities in presence-absence data. In particular we propose alternative priors for the coefficients that yield specific induced priors for $\theta$ . We address three induced prior cases. The simplest is when the induced prior for $\theta$ is Uniform(0,1). The second case is when the induced prior for $\theta$ is an arbitrary Beta( $\alpha$ , $\beta$ ) distribution. The third case is one where the intercept in the logistic model is to be treated distinct from the partial slope coefficients; e.g., $E[\theta]$ equals a specified value on (0,1) when all covariates equal 0. Simulation studies were carried out to evaluate performance of these priors and the methods were applied to a real presence/absence data set and occupancy modelling.

Keywords— Bayesian inference, Induced priors, Logistic regression, Objective priors, Occupancy modeling.

1 Introduction

Many processes with binary outcomes, say 0 or 1, are modeled as Bernoulli random variables $Y$ where $\Pr(Y=1)=\theta$ and $\Pr(Y=0)=1-\theta$ . The value $Y=1$ can equate to a variety of nominal variables such as success, survival or presence while the value $Y=0$ corresponds to the opposite: failure, death or absence, respectively. Such Bernoulli processes appear in several areas of statistical ecology such as modeling the probability of an animal surviving over some time interval, the probability of an animal producing young, the probability of animal’s or plant’s presence at a given location and the probability of detecting such presence.

Often the Bernoulli probabilities are modelled as functions of covariates, $x_{1},\ldots,x_{p}$ , and to ensure that the probability $\theta$ satisfies the condition $0<\theta<1$ , a logistic transformation of $\theta$ is often used:

\displaystyle\ln\left(\frac{\theta}{1-\theta}\right)

\displaystyle=\beta_{0}+\beta_{1}x_{1}+\ldots+\beta_{p}x_{p}.

We label the logistic transformation above $logit(\theta)$ . When Bayesian inference is carried out, prior distributions are specified for the coefficients $\beta_{0},\beta_{1},\ldots,\beta_{p}$ . These prior distributions induce a prior distribution on $\theta$ .

The choice of priors for Bayesian inference and their effect has been long and frequently discussed (Gelman et al., , 2014; Hobbs and Hooten, , 2015). Several approaches have developed for selecting and formulating priors (Jeffreys, , 1946; Bernardo, , 1979; Datta and Sweeting, , 2005; Kass and Wasserman, , 1996; Simpson et al., , 2017; Leisen et al., , 2020; Walker and Villa, , 2021; Antoniano-Villalobos et al., , 2024) with advice on approaches to prior selection, e.g., Banner et al., (2020) and van de Schoot et al., (2021). Concerns about the choice of priors and criticism of Bayesian methods in general have been voiced by many, particularly in ecological applications (Lele and Dennis, , 2009; Lele, , 2020) and the use of so-called default priors, those built into various computer software, are a general concern (Banner et al., , 2020).

A specific concern is that the use of seemingly innocuous “non-informative” priors at one level of a hierarchical model can induce strikingly non-innocuous, i.e., relatively concentrated, priors at a higher level. This has been noted specifically for the priors for the $\beta$ ’s in the logistic regression models (Seaman III et al., , 2012; Northrup and Gerber, , 2018; Lele, , 2020). For example, Seaman III et al., (2012) show how Normal(0, $25^{2}$ ) priors for $\alpha$ and $\beta$ in the simplest logistic model $logit(\theta)=\alpha+\beta x$ induces a so-called bathtub shaped prior on $\theta$ with most of the probability mass concentrated at the extremes 0 and 1. Similarly, Northrup and Gerber, (2018) show the effects of a variety of such priors in the analysis of presence/absence data and occupancy modeling where there are two Bernoulli probabilities, one for presence, and another one for detection conditional on presence, which are both often modeled as functions of covariates via logistic regression. In the context of occupancy modeling Lele, (2020) examined posterior inference for a function of the presence and the detection probabilities, namely the probability that a site is occupied but it is observed to be unoccupied on all visits. He pointed out sizable problems of invariance in the posterior distribution of this function to two alternative, but mathematically equivalent, parameterizations of the occupancy model, with samples of size at or around 30, where for one parameterization the point estimate is 0.30 and for the other parameterization is 0.67.

While sometimes the sample size is sufficient to yield a posterior distribution for $\theta$ that is dominated by the data and therefore relatively unaffected by the prior, the practitioner might prefer to start with induced priors that more closely match their opinion, such as $\theta$ $\sim$ Uniform(0,1) or more generally $\theta$ $\sim$ Beta( $\alpha$ , $\beta$ ).

We do note that there are, of course, cases where informative priors are desired which reflect expert opinion (O’Hagan, , 2008, 2019) and that the use of subjective priors can be defensible (Banner et al., , 2020).

The purpose of this paper is to present procedures for specifying priors for the coefficients of a logistic regression model with $p$ covariates which induce priors for a Bernoulli probability $\theta$ that approximate a specified distribution, in particular those involving arbitrary Beta distributions. This includes the specific case of Beta(1,1) or Uniform(0,1). While Northrup and Gerber, (2018) provide some advice for specifying priors for an intercept alone model, namely $p$ =0, they do not address the effects of multiple covariates on the induced prior for $\theta$ . We also consider the setting where the intercept is to be treated differently than the coefficients for the covariates; e.g., $E[\theta]$ equals a specified value on (0,1) when all covariates equal 0.

In Section 2 we begin with the simplest case of an intercept only model, $logit(\theta)=\beta$ . We derive the induced prior for the Bernoulli probability $\theta$ given the prior for the intercept, $\pi(\beta)$ , and vice versa, and present a normal distribution prior for $\beta$ that yields approximately a Uniform(0,1) for $\theta$ . Section 3 extends the approach to $p\geq$ 1 covariates to induce a Uniform(0,1) prior on $\theta$ . In Section 4 the case where the induced prior for $\theta$ is any Beta distribution is addressed along with an approach for distinct handling of the intercept. Simulation analyses are presented in Section 5 and the method is applied to a real occupancy data set in Section 6.

2 Logistic regression with constant mean model

We focus on logistic regression where the logit link function is used to model the probability of “success” with a Bernoulli random variable:

	$\displaystyle Y$	$\displaystyle\sim\mbox{Bernoulli}\left(\theta\right);$
	$\displaystyle\eta=g(\theta)=\ln\left(\frac{\theta}{1-\theta}\right)$	$\displaystyle=\beta_{0}x_{0}+\beta_{1}x_{1}+\ldots+\beta_{p}x_{p}.$		(1)

The focus in this section is to derive priors for the coefficients of the above logistic regression model such that the induced prior for $\theta$ is a Uniform(0,1) distribution. These simple results will be built on in later sections to yield a Uniform(0,1) prior for $\theta$ when there are $p\geq 2$ covariates in the regression model and then are extended to the case where the induced prior is a Beta( $\alpha,\beta$ ) for arbitrary shape parameters.

2.1 From $\pi(\beta)$ to $\pi(\theta)$

Regarding the induced priors for $\theta$ , for maximum simplicity we begin with a linear model for $\eta$ that is a constant, i.e., $\eta=\beta$ :

\displaystyle\eta=g(\theta)=\ln\left(\frac{\theta}{1-\theta}\right)

\displaystyle=\beta.

Let $\pi(\beta)$ be the prior distribution for $\beta$ , which in this simple case is also the prior for $\eta$ , i.e., $\pi(\eta)=\pi(\beta)$ . The induced prior for $\theta$ can be found by the change of variable theorem (Siegrist, , 2024):

\displaystyle\pi_{\theta}(\theta)

\displaystyle=\pi_{\beta}(h^{-1}(\theta))\left|\frac{dh^{-1}(\theta)}{d\theta}\right|,

where $h(\beta)=\theta$ . We define,

\displaystyle\theta=h(\beta)

\displaystyle=\frac{\exp(\beta)}{1+\exp(\beta)}.

which leads to the induced pdf for $\theta$ :

\displaystyle\pi_{\theta}(\theta)

\displaystyle=\pi_{\beta}\left(\ln\left[\frac{\theta}{1-\theta}\right]\right)\left|\frac{1}{\theta(1-\theta)}\right|.

Example.

A Normal(0, $\sigma^{2}$ ) prior is chosen for $\beta$ . The induced prior for $\theta$ is the following.

\displaystyle\pi(\theta)

\displaystyle=\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left[-\frac{\left(\ln(\theta/(1-\theta))\right)^{2}}{2\sigma^{2}}\right]\frac{1}{\theta(1-\theta)}.

Figure 1(a) is a plot of the induced prior for $\theta$ given a prior of Normal $(0,\sigma^{2}=3^{2})$ for $\beta$ . This is a “bathtub prior” for $\theta$ which concentrates the prior probability around the extremes 0 and 1, where the concentration becomes more extreme the larger the variance for $\beta$ , as can be seen in Figure 1(b). Seaman III et al., (2012), Northrup and Gerber, (2018) and Lele, (2020) provide additional examples.

Refer to caption — (a) $\sigma^{2}$ =3².

2.2 From $\pi(\theta)$ to $\pi(\beta)$

Our objective is to determine what the induced prior for $\beta$ is given a desired or target prior for $\theta$ . The above process is reversed. For example, if the desired prior for $\theta$ is Uniform $(0,1)$ , then the problem is to determine what prior for $\beta$ would yield such a prior for $\theta$ . To begin

\displaystyle\beta=g(\theta)

\displaystyle=\ln\left(\frac{\theta}{1-\theta}\right).

Then the inverse operation, $g^{-1}(\theta)$ , is $\theta=\exp(\beta)/(1+\exp(\beta))$ and $\frac{dg^{-1}(\beta)}{d\beta}$ = $\frac{\exp(\beta)}{(1+\exp(\beta))^{2}}$ . Consequently the induced pdf for $\beta$ is:

\displaystyle\pi_{\beta}(\beta)

\displaystyle=\frac{\exp(\beta)}{(1+\exp(\beta))^{2}},~{}~{}-\infty<\beta<\infty.

(2)

Figure 2 is a plot of the induced prior for $\beta$ given a Uniform(0,1) prior for $\theta$ .

2.3 Logistic Distribution

The pdf for the induced prior for $\beta$ in Equation (2) is a special case of the logistic distribution. The logistic distribution has location parameter $\mu$ and scale parameter $s$ :

\displaystyle\mbox{Logistic}(x;\mu,s)

\displaystyle=\frac{\exp\left(-\frac{(x-\mu)}{s}\right)}{s\left(1+\exp\left(-\frac{(x-\mu)}{s}\right)\right)^{2}}.

The corresponding expectation and variance are as follows:

\displaystyle E[X]=\mu,~{}

\displaystyle~{}V[X]=\frac{s^{2}\pi^{2}}{3}.

The pdf in Equation (2) is then Logistic( $\mu=0$ , $s=1$ ), where due to symmetry in the pdf around 0, the pdf is the same for $\beta$ and $-\beta$ . $E[\beta]=0$ (as Figure 2 indicates) and $V[\beta]$ = $\frac{\pi^{2}}{3}$ . We can conclude from this that if a Logistic $(0,1)$ prior is used for $\beta$ , the induced prior for $\theta$ is Uniform $(0,1)$ . Depending on the context, the target induced prior for $\theta$ may not be Uniform $(0,1)$ , e.g., it might be Beta( $\alpha,\beta$ ) and this is later addressed.

2.4 Normal approximation to Logistic(0,1)

Given the shape of the Logistic(0,1) distribution and its symmetry about zero, a Normal(0, $\frac{\pi^{2}}{3}$ ) distribution could be used as an approximation. As Figure 3 shows the tails of the normal are heavier.

3 Moment matching for $\pi(\beta)$ ’s for multivariate Logistic regression

Now we consider the more general logistic regression setting where $\theta$ is a function of $p$ covariates, as shown in Equation (1). From Section 2, if the prior for $\eta$ is Logistic $(0,1)$ , then the induced prior for $\theta$ is Uniform $(0,1)$ . Therefore we need

\displaystyle\eta=\beta_{0}+\beta_{1}x_{1}+\ldots+\beta_{p}x_{p}

\displaystyle\sim\mbox{Logistic}(0,1),

where we have assumed $x_{0}=1$ . There are two complications here. Firstly, given $(p+1)$ $\beta$ parameters, there are $p+1$ prior distributions. One is that there is a many-to-one mapping from the $\beta$ s to the single $\theta$ . In theory there are an infinite number of combinations of prior distributions for the $\beta$ ’s that induce a Logistic $(0,1)$ distribution. Secondly, the values of $x_{0},x_{1},\ldots,x_{p}$ will affect the distribution of $\eta$ . Here we will examine approximate solutions to these two problems where we aim to match the expected value and variance for $\eta$ .

3.1 Linear model with all $x_{i}$ =1

We begin with an unrealistic setting to make clearer the approach taken by assuming that the linear model for $\eta$ is simply a sum of $p+1$ coefficients, equivalently assume that all $x_{i}$ =1:

\displaystyle\eta=g(\theta)

\displaystyle=\beta_{0}+\beta_{1}\times 1+\ldots+\beta_{p}\times 1=\sum_{i=0}^{p}\beta_{i}.

Hence there is a $p+1$ dimensional prior, $\pi(\beta_{0},\ldots,\beta_{p})$ . The many-to-one problem can be bypassed by assuming that the prior distributions are all the same and independent. By moment matching we can determine what the mean, $\mu_{\beta}$ , and variance, $\sigma_{\beta}^{2}$ , of the in-common prior, denoted $\pi(\beta)$ distribution, should be.

	$\displaystyle E[\eta]=0$	$\displaystyle=E[\beta_{0}]+E[\beta_{1}]+\ldots E[\beta_{p}]=(p+1)\mu_{\beta}$
	$\displaystyle V[\eta]=\frac{\pi^{2}}{3}$	$\displaystyle=V[\beta_{0}]+V[\beta_{1}]\ldots+V[\beta_{p}]=(p+1)\sigma_{\beta}^{2}$

So that $\mu_{\beta}=0$ and $\sigma^{2}_{\beta}$ = $\frac{\pi^{2}}{3(p+1)}$ . When $p=0$ , we obtain the variance of $\frac{\pi^{2}}{3}$ considered in Section 2.4.

3.2 Moment matching with standardised covariates

We now examine the more realistic case where there are covariates in the linear model. We will, however, assume that the covariates, $x_{1}$ , $\ldots$ , $x_{p}$ , are standardised, $x_{i}$ = $(x_{i}^{*}-\bar{x_{i}^{*}})/s_{x_{i}^{*}}$ , where $x_{i}^{*}$ are the original values, thus $\bar{x_{i}}$ =0 and $s_{x_{i}}$ = 1. To bypass the many-to-one problem, we assume “some” in-common, and independent prior distributions. To begin we will assume that all the coefficients $\beta_{0}$ , $\ldots$ , $\beta_{p}$ are iid with prior $\pi_{\beta}(\beta)$ .

We again use moment matching to determine what conditions are necessary for $E[\eta]$ =0 and $V[\eta]$ = $\pi^{2}/3$ to be satisfied.

	$\displaystyle E[\eta]=0$	$\displaystyle=E[\beta_{0}+\beta_{1}x_{1}+\ldots+\beta_{p}x_{p}]$
		$\displaystyle=E[\beta_{0}]+\sum_{i=1}^{p}E[\beta_{i}]E[X_{i}].$

Given that the covariate values have been standardised, their expected values, $E[X_{i}]$ , equal 0, so that the second term above equals 0. This result arises from the expectation being calculated over a finite space with at most $n$ values for each $x_{i}$ and assuming implicitly random draws of each $x_{i}$ independent of $x_{j}$ , $i\neq j$ , which does not recognize any within sample dependencies. However, to ensure $E[\eta]$ =0, $E[\beta_{0}]$ must also equal 0. If the $\beta$ ’s are iid, their expected values would also equal 0.

We next examine $V[\eta]$ . In the following we assume that $V[x_{i}]$ =1.

	$\displaystyle V[\eta]=\frac{\pi^{2}}{3}$	$\displaystyle=V\left[\beta_{0}+\sum_{i=1}^{p}\beta_{i}x_{i}\right]$
		$\displaystyle=V_{\beta}E_{\hbox{\boldmath$x$}\|\beta}\left[\beta_{0}+\sum_{i=1}^{p}\beta_{i}x_{i}\left\|\beta\right.\right]+E_{\beta}V_{\hbox{\boldmath$x$}\|\beta}\left[\beta_{0}+\sum_{i=1}^{p}\beta_{i}x_{i}\left\|\beta\right.\right]$
		$\displaystyle=V_{\beta}\left[\beta_{0}+\sum_{i=1}^{p}\beta_{i}E[x_{i}]\right]+E_{\beta}\left[\sum_{i=1}^{p}\beta_{i}^{2}V[x_{i}]\right]$
		$\displaystyle=V_{\beta}[\beta_{0}]+0+\left[\sum_{i=1}^{p}E_{\beta}[\beta_{i}^{2}]\times 1\right]$
		$\displaystyle=\sigma^{2}_{\beta}(1+p).$

To match the Logistic $(0,1)$ variance of $\frac{\pi^{2}}{3}$ , we require

\displaystyle\sigma^{2}_{\beta}

\displaystyle=\frac{\pi^{2}}{3(p+1)}.

Therefore, as for the case with $x_{i}$ =1, $\mu_{\beta}$ =0 and $\sigma^{2}_{\beta}$ = $\frac{\pi^{2}}{3(p+1)}$ . The reason for the identical results is that the standardized variance of the $x_{i}$ ’s, (namely, 1), is the same as the assigned constant of 1 in the previous case. Newman, (2003) proposed the identical normal prior distributions for an analysis of release-recovery data but the derivation details were not shown.

4 Arbitrary Beta distribution for $\theta$

Now we consider the case where an arbitrary Beta distribution has been chosen for the Bernoulli parameter $\theta$ and to determine an induced distribution for the logit transform. If the prior for $\theta$ is Beta( $\alpha,\beta$ ), and $\eta$ = logit( $\theta$ ), the induced pdf for $\eta$ is the following.

\displaystyle p(\eta)

\displaystyle=\frac{1}{B(\alpha,\beta)}\frac{\exp(\eta\times\alpha)}{(1+\exp(\eta))^{\alpha+\beta}},

(3)

where $B(\alpha,\beta)$ is the beta function: $\frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}$ , which is the pdf of the Type IV generalized logistic regression (Johnson et al., , 1995). Examples of the pdf for $\eta$ , eq’n 3, given different Beta( $\alpha,\beta)$ distributions for $\theta$ , are shown in Figure 4.

The values of $E[\eta]$ and $Var[\eta]$ shown in Figure 4 were found by numerically integrating the following expected values:

\displaystyle E[\eta^{k}]

\displaystyle=\frac{1}{B(\alpha,\beta)}\int_{-\infty}^{\infty}\eta^{k}\frac{\exp(\eta\times\alpha)}{(1+\exp(\eta))^{\alpha+\beta}}d\eta,~{}~{}k=1,2.

Hereafter we denote $E[\eta]$ by $\mu_{\eta}$ and $Var[\eta]$ by $\sigma^{2}_{\eta}$ .

4.1 Normal approximation intercept only model

As for the special case of Uniform $(0,1)$ , i.e. Beta $(1,1)$ , a normal approximation to the distribution for $\eta$ is to moment match $\mu_{\eta}$ and $\sigma^{2}_{\eta}$ .

For an intercept only model, $\eta=\beta_{0}$ , and the prior for $\beta_{0}$ is

\displaystyle\beta_{0}

\displaystyle\sim\mbox{Normal}\left(\mu_{\eta},\sigma_{\eta}\right).

Examples of such normal approximations to the pdf for $\eta$ are shown by dashed blue lines in Figure 4, which show the adequacy of the approximation is a function of $\alpha$ and $\beta$ .

4.2 Normal approximation: Multiple covariates

In the case of one or more covariates, as for the Uniform $(0,1)$ case, there is again the many-to-one problem where there are an infinite number of combinations of normal priors where the expected value and variance for the sum of the priors would equal $\mu_{\eta}$ and $\sigma^{2}_{\eta}$ . Furthermore, given that standardizing the covariates to have mean zero implies $E[\beta_{i}X_{i}]=0$ the mean $\mu_{\eta}$ is solely found in the prior mean for the intercept $\beta_{0}$ . Again, we can avoid the many-to-one problem by using the $p$ independent and identically distributed normal distributions for the slope coefficients. Namely,

	$\displaystyle\beta_{0}$	$\displaystyle\sim\mbox{Normal}\left(\mu_{\eta},\frac{\sigma^{2}_{\eta}}{p+1}\right);$		(4)
	$\displaystyle\beta_{i}$	$\displaystyle\sim\mbox{Normal}\left(0,\frac{\sigma^{2}_{\eta}}{p+1}\right),\qquad i=1,\ldots,p.$		(5)

Setting the mean of the priors for the $\beta_{i}$ terms equal to 0 is arbitrary and can be any value since $E[\beta_{i}X_{i}]=0$ . Then

	$\displaystyle E[\eta]$	$\displaystyle=E[\beta_{0}]+\sum_{i=1}^{p}E[\beta_{i}x_{i}]=\mu_{\eta}+\sum_{i=1}^{p}0=\mu_{\eta};$
	$\displaystyle Var[\eta]$	$\displaystyle=Var[\beta_{0}]+\sum_{i=1}^{p}Var[\beta_{i}x_{i}]=\frac{\sigma^{2}_{\eta}}{p+1}+\sum_{i=1}^{p}\frac{\sigma^{2}_{\eta}}{p+1}=\sigma^{2}_{\eta}.$

As $p$ increases, the variances for both the intercept and the slope coefficients are equally increasingly concentrated around, or shrinking toward, $\mu_{\eta}$ or 0. We next present an alternative which does not cause such shrinkage in the prior for the intercept.

We note that while there are different classes of so-called shrinkage priors , e.g., spike and slab priors, the intent differs somewhat, to shrink small effects to zero while maintaining true large effects (Van Erp et al., , 2019), while the intent is here is to yield a desired induced distribution on a particular parameter.

4.3 Distinct handling of the intercept

In some cases there may be a prior opinion regarding the value of $\theta$ even if the covariates have no effect, or for an “average” covariate setting, and a distinct handling of the intercept can be considered. In particular one may not want to have the number of covariates affecting the prior for the intercept, i.e., having the prior for $\beta_{0}$ become increasingly concentrated around zero.

We begin by assuming a more general prior for $\theta$ , Beta( $\alpha,\beta$ ), rather than the special case dealt with so far, namely $\alpha=\beta=1$ or Uniform $(0,1)$ . Direct specification of the $\alpha$ and $\beta$ parameters could be done as well but thinking in terms of the average value for $\theta$ and its variation may be easier. Specifically, we first specify the expected value of $\theta$ and a measure of its variation, the coefficient of variation (CV), and then determine what values of $\alpha$ and $\beta$ match these specified values. For example, if we specify $E[\theta]=0.7$ and $CV[\theta]=0.30$ , then the prior for $\theta$ is Beta( $2.633,1.129$ ). Given the Beta distribution parameters, and referring to Equation (3), the expected value and variance of $\eta$ can be numerically calculated: in this example $\mu_{\eta}$ $\approx$ 1.146 and $\sigma^{2}_{\eta}$ $\approx$ 1.785. Again assuming that the priors for $\beta_{0}$ and $\beta_{i}$ , $i=1,\ldots,p$ are Normal distributions, the resulting $E[\eta]$ and $Var[\eta]$ guide the choice of the Normal distribution parameters.

We note that with the exception of the case where $\alpha$ = $\beta$ , thus $E[\theta]$ =0.5, and then $E[\eta]$ = 0, the contribution of $\mu_{\eta}$ to the linear combination (Equation 1) must all come from the mean for the prior for the intercept, $\beta_{0}$ . This is because the covariates have been standardized to have mean 0, then $E[\beta_{i}x_{i}]$ = 0 no matter what the value of the expected value for $\beta_{i}$ in the normal distribution.

To bypass the many-to-one issue we again assume that the priors for the slope coefficients are identical. Let $\mu_{0}$ and $\sigma^{2}_{0}$ denote the mean and variance for the normal prior for $\beta_{0}$ . For the normal prior for $\beta_{i}$ , $\mu_{1}$ denotes the normal mean for the $\beta_{i}$ ’s, which can be anything given it does not affect the expected value of the linear combination, of $\eta$ . Here for simplicity we will set $\mu_{1}$ =0, and $\sigma^{2}_{1}$ denotes the variance for prior for $\beta_{1}$ . While there might be some argument for $\mu_{1}\neq 0$ , we cannot think of one. Subsequently,

	$\displaystyle\beta_{0}$	$\displaystyle\sim\mbox{Normal}\left(\mu_{\eta},\sigma^{2}_{0}\right),$		(6)
	$\displaystyle\beta_{1}$	$\displaystyle\sim\mbox{Normal}\left(0,\sigma^{2}_{1}\right).$		(7)

The constraint for the linear combination is that its expected value equals $\mu_{\eta}$ and its variance equals $\sigma^{2}_{\eta}$ . Redisplaying the linear combination for convenience, with a randomly drawn sequence of values for $x_{i,j}$

\displaystyle\eta_{j}

\displaystyle=\beta_{0}+\sum_{i=1}^{p}\beta_{i}x_{i,j},

with $E[\eta_{j}]$ = $\mu_{\eta}$ and

\displaystyle Var[\eta_{j}]

\displaystyle=\sigma^{2}_{0}+p\sigma^{2}_{1},

where the latter sum must equal $\sigma^{2}_{\eta}$ . Given values for $\mu_{\eta}$ and $\sigma^{2}_{\eta}$ , once $\sigma^{2}_{0}$ are specified then the variance for the slope coefficient priors is

\displaystyle\sigma^{2}_{1}

\displaystyle=\frac{\sigma^{2}_{\eta}-\sigma^{2}_{0}}{p}.

Given the assumption of identical priors for all slope coefficients the only decision is to specify $\sigma^{2}_{0}$ . There are numerous ways this can be done and that shown below is just one possibility.

If the aim is that the number of covariates does not affect the prior for $\beta_{0}$ , in particular the prior variance for $\beta_{0}$ , then $\sigma^{2}_{0}$ should not be a function of $p$ . One approach is to define $\sigma^{2}_{0}$ relative to $\sigma^{2}_{\eta}$ . For example specify a constant, $k$ , $0<k<1$ , and set $\sigma^{2}_{0}$ = $k\sigma^{2}_{\eta}$ . Then

\displaystyle\sigma^{2}_{1}

\displaystyle=\frac{(1-k)\sigma^{2}_{\eta}}{p}.

The choice of $k$ is discussed in Section 7.

5 Simulations

In this section, we present the results of three simulation experiments in as many scenarios (conducted using R (R Core Team, , 2024)). We consider:

1.

Scenario 1: one covariate and a Uniform(0,1) prior for $\theta$ .
2.

Scenario 2: three covariates and a Uniform(0,1) prior for $\theta$
3.

Scenario 3: three covariates and a Beta( $\alpha,\beta$ ) prior for $\theta$ , where $\alpha$ , $\beta$ $\neq$ 1.

In all three scenarios different types of Normal priors were used for the intercept $\beta_{0}$ and the slope coefficient(s). We refer to the “Vague prior” such that there are identical diffuse Normal(0, $\sigma^{2}$ ) priors used for the intercept and slope coefficients where $\sigma^{2}=1000^{2}$ ; and the “Logistic prior”, where $\sigma$ is based on the induced logistic distribution, the details of which vary depending on the number of covariates, and the Beta( $\alpha,\beta)$ prior for $\theta$ . For example, for the Logistic prior, when the induced prior for $\theta$ is Uniform(0,1), and there are $p$ covariates, then $\sigma$ = $\sqrt{\frac{\pi^{2}}{3(1+p)}}$ . Finally, when the prior for the intercept is handled differently (Section 4.3), the prior is called the “Weighted prior”.

For each scenario, we performed experiments for single samples (detailed in the Appendix, as well as on multiple datasets to allow frequentist analyses of the performance.

5.1 Scenario 1: One covariate and $\theta\sim U(0,1)$

Multiple data sets of Bernoulli responses were generated from the following model:

	$\displaystyle y_{i}$	$\displaystyle\sim\mbox{Bernoulli}(\theta_{i}),~{}~{}i=1,\ldots,15$		(8)
	$\displaystyle logit(\theta_{i})$	$\displaystyle=-0.5+0.3x_{i},~{}~{}i=1,\ldots,15,$		(8)

where the values of $x$ were simulated from a Gamma(3,0.2), and then standardised. In the Appendix, the resulting response and covariate values are shown in Table 10 and the functional relationship between $\theta$ and $x$ is shown in Figure 10.

We compared the following two priors for $\beta_{0}$ and $\beta_{1}$ .

	$\displaystyle\mbox{Vague: }\beta_{0},\beta_{1}$	$\displaystyle\sim\mbox{Normal}(0,1000^{2});$		(9)
	$\displaystyle\mbox{Logistic: }\beta_{0},\beta_{1}$	$\displaystyle\sim\mbox{Normal}\left(0,\frac{\pi^{2}}{3\times 2}=1.283^{2}\right).$		(9)

The induced priors for $\theta$ are shown in Figure 11 in the Appendix.

We simulated 100 datasets from the model with $n=15$ , $n=50$ and $n=100$ , and computed the following indexes: the mean squared error from the maximum likelihood estimate ( $\text{MSE}^{*}$ ), the mean squared error from the posterior mean (MSE), and the coverage of the 95% credible interval. The above mean square errors are based on the following formula:

\text{MSE}(\hat{\theta})=\frac{1}{n}\sum_{i=1}^{n}(\hat{\theta}_{i}-\theta)^{2},

where $\hat{\theta}$ is the point estimate (i.e. the posterior mean or MAP, for example) and $\theta$ is either the MLE (for the $\text{MSE}^{*}$ ) or the true value of $\theta$ (for the MSE).

For the case $n$ =15, from Table 1 we note that the MSE for the Logistic prior is smaller (about one order of magnitude) than the one computed for the Vague normal prior. In addition, while the posterior coverage for the proposed prior is within the expected limits, under the Vague prior we note a slightly lower than expected coverage for the intercept (0.92) and considerably lower coverage for the slope (0.78).

Parameter	Truth	$\text{MSE}^{*}$	MSE	Cov	$\text{MSE}^{*}$	MSE	Cov
		Vague			Logistic
$\beta_{0}$	-0.50	0.0253	1.240	0.92	0.0123	0.2371	0.97
$\beta_{1}$	0.30	0.0347	1.094	0.78	0.0203	0.2005	0.93

Table 1:

\text{MSE}^{*}

, MSE and coverage of the 95% posterior credible interval (Cov), for

n=15

, for the case with one covariate.

An interesting comparison is between the average MLEs and average posterior means for both parameters. The MLEs are -0.6184 and 0.4203, for $\beta_{0}$ and $\beta_{1}$ , respectively. The average posterior mean for the intercept is -0.7775 and -0.5076, for the Vague prior and the Logistic prior, respectively, and for the slope coefficient, we have 0.6065 and 0.2780 for the Vague and the Logistic, respectively. We note that the distance between the MLEs and the posterior means are smaller for the Logistic prior than for the Vague prior.

For the larger sample size $n=50$ , the corresponding results are presented in Table 2 where, as expected due to increasing data having more influence on the posterior distributions, the MSE from the MLE and from the posterior mean are lower. Note that, unlike in the case for $n=15$ , the coverage of the coefficients under the Vague prior are close to the nominal value of 95%.

Parameter	Truth	$\text{MSE}^{*}$	MSE	Cov	$\text{MSE}^{*}$	MSE	Cov
		Vague			Logistic
$\beta_{0}$	-0.50	0.0007	0.1165	0.93	5.00E-04	0.0989	0.94
$\beta_{1}$	0.30	0.0017	0.1189	0.94	2.00E-04	0.0972	0.96

Table 2:

\text{MSE}^{*}

, MSE and coverage of the 95% posterior credible interval (Cov), for

n=50

, for the case with one covariate.

In this case, the average MLE for the intercept is -0.5104 and for the slope coefficient is 0.3184. The average posterior means are, for the intercept, -0.5290 (Vague prior) and -0.4950 (Logistic prior) and, for the slope coefficient, 0.3425 (Vague prior) 0.3165 (Logistic prior). Here, while the distances of the MLE and the posterior means for the intercept are virtually the same, the ones for the slope coefficient are smaller for the Logistic prior.

To conclude this experiment, we have run the above simulation with a larger sample size ( $n=100$ ). The results are presented in Table 3, where we note that the priors’ behaviours are virtually the same as the information in the data dominates. Again, the overall performance of the Logistic prior is better than the Vague prior, in terms of distance between the MLE and the posterior mean, the difference diminishes, as one would expect as the sample size increases.

Parameter	Truth	$\text{MSE}^{*}$	MSE	Cov	$\text{MSE}^{*}$	MSE	Cov
		Vague			Logistic
$\beta_{0}$	-0.50	1.00E-04	0.0496	0.94	1.00E-04	0.0458	0.95
$\beta_{1}$	0.30	3.00E-04	0.0497	0.97	0.00E+00	0.0457	0.96

Table 3:

\text{MSE}^{*}

, MSE and coverage of the 95% posterior credible interval (Cov), for

n=100

, for the case with one covariate.

For this final simulation study, the average MLE for the intercept is -0.5265 and for the slope coefficient is 0.2914. The average posterior means are, for the intercept, -0.5355 (Vague prior) and -0.5206 (Logistic prior) and, for the slope coefficient, 0.3013 (Vague prior) and 0.2914 (Logistic prior).

5.2 Scenarios 2 and 3: Three covariates

In the second and third scenarios, we drew samples from the below model given in Equation (10), considering $\theta\sim U(0,1)$ (Scenario 2), and its generalisation $\theta\sim\text{Beta}(\alpha,\beta)$ with distinct handling of the intercept (Scenario 3):

\displaystyle logit(\theta)

\displaystyle=1.1+0.3x_{1}-0.6x_{2}+0.02x_{3},

(10)

where the covariates were simulated independently from three Gamma distributions, Gamma(10,2), Gamma(12,6), and Gamma(3,3), with the resulting $\theta$ values ranged from 0.39 to 0.93.

The prior distributions we compared are the Vague prior and the Logistic prior (see equations (4) and (5)) for the three covariates setting in Scenario 2.

	$\displaystyle\mbox{Vague: }\beta_{0},\beta_{1},\beta_{2},\beta_{3}$	$\displaystyle\sim\mbox{Normal}(0,1000^{2})$		(11)
	$\displaystyle\mbox{Logistic: }\beta_{0},\beta_{1},\beta_{2},\beta_{3}$	$\displaystyle\sim\mbox{Normal}\left(0,\frac{\pi^{2}}{3\times(1+3)}=0.907^{2}\right),$		(11)

and, for Scenario 3, the additional prior

	$\displaystyle\beta_{0}$	$\displaystyle\sim\text{Normal}(1.150,0.4\times 1.843=0.7372)$		(12)
	$\displaystyle\beta_{i}$	$\displaystyle\sim\text{Normal}\left(0,\frac{0.6\times 1.843}{3}=0.3686\right),\qquad i=1,2,3.$		(12)

In the Appendix, it is possible to find the induced $\theta$ priors, as well as the illustration of the model applied to a single sample.

The frequentist analysis for Scenario 2 was performed on 100 samples of sizes $n=30$ , $n=50$ and $n=100$ , respectively. We defer the discussion of the results below where the Vague, Logistic, and Weighted priors are shown together.

Similrly to Scenario 2, the frequentist analysis of Scenario 3 was performed on 100 samples of sizes $n=30$ , $n=50$ and $n=100$ , respectively.

Table 4, for $n$ =50, shows the $\text{MSE}^{*}$ , the MSE and interval coverage for the three priors. It appears that both the Logistic prior and Weighted prior perform better than the Vague prior; this is obvious by noticing that the MSE is systematically smaller for the two proposed priors for all the parameters. If we compare the coverage, we notice that the Weighted prior tends to have more extreme results, as it is either close to 100% or 90%, suggesting a less precise posterior. Given that the Weighted prior assumes different prior for the intercept and the slope coefficients, an increase in uncertainty is expected.

	Vague			Logistic			Weighted
Parameter	$\text{MSE}^{*}$	MSE	Cov	$\text{MSE}^{*}$	MSE	Cov	$\text{MSE}^{*}$	MSE	Cov
$\beta_{0}$	0.18	1.30	0.89	0.34	0.11	0.96	0.36	0.09	0.97
$\beta_{1}$	0.03	0.45	0.89	0.07	0.10	0.97	0.20	0.05	0.97
$\beta_{2}$	0.04	0.45	0.90	0.10	0.10	0.96	0.28	0.12	0.89
$\beta_{3}$	0.10	1.05	0.91	0.28	0.09	0.99	0.38	0.03	1.00

Table 4:

\text{MSE}^{*}

, MSE and coverage of the posterior 95% credible interval for the Vague prior, the Logistic prior and the Weighted prior. The summaries refer to 100 samples of size

n=50

For the above simulation study, the average MLEs, and the average posterior means (under each prior) are reported in Table 5.

			Posterior Means
Parameter	TRUE	MLE	Vague	Logistic	Wted
$\beta_{0}$	1.5	1.6905	1.9587	1.3887	1.4235
$\beta_{1}$	0.3	0.4563	0.5580	0.3324	0.1793
$\beta_{2}$	-0.6	-0.6801	0.7931	-0.5125	-0.2997
$\beta_{3}$	0.02	0.0441	0.1010	0.0115	0.0013

Table 5: Average maximum likelihood estimates, and average posterior means for the

n=50

simulation study.

Similarly to the case with one covariate, we examined (and compared) the prior performances for a relatively small sample size. Using the rule of thumb of having ten data points per covariate, we have drawn 100 samples of size $n=30$ . The results, in terms of $\text{MSE}^{*}$ , MSE and posterior coverage, are reported in Table 6.

	Vague			Logistic			Weighted
Parameter	$\text{MSE}^{*}$	MSE	Cov	$\text{MSE}^{*}$	MSE	Cov	$\text{MSE}^{*}$	MSE	Cov
$\beta_{0}$	0.8154	3.0434	0.75	0.5970	0.1512	0.96	0.5543	0.1103	0.98
$\beta_{1}$	0.1865	1.4455	0.88	0.2760	0.1422	0.97	0.5686	0.0646	0.99
$\beta_{2}$	0.3677	1.8433	0.82	0.3053	0.1593	0.98	0.7505	0.1611	0.86
$\beta_{3}$	0.1253	1.1679	0.87	0.1421	0.1674	0.97	0.3440	0.0334	1.00

Table 6:

\text{MSE}^{*}

, MSE and coverage of the posterior 95% credible interval for the Vague prior, the Logistic prior and the Weighted beta prior. The summaries refer to 100 samples of size

n=30

The MLE averages and the posterior mean averages, are reported in Table 7.

			Posterior Means
Parameter	TRUE	MLE	Vague	Logistic	Weighted
$\beta_{0}$	1.5	1.6265	1.7309	1.5111	1.5077
$\beta_{1}$	0.3	0.3392	0.3748	0.2971	0.2023
$\beta_{2}$	-0.6	-0.6352	-0.6782	-0.5634	-0.4065
$\beta_{3}$	0.02	0.0303	0.0533	0.0337	0.0191

Table 7: Average maximum likelihood estimates, and average posterior means for the

n=30

simulation study.

Finally, in Table 8 we have reported the results for the larger sample size ( $n=100$ ), where the differences between the priors are reduced, although the better performance of the Logistic and Weighted priors remains.

	Vague			Logistic			Weighted
Parameter	$\text{MSE}^{*}$	MSE	Cov	$\text{MSE}^{*}$	MSE	Cov	$\text{MSE}^{*}$	MSE	Cov
$\beta_{0}$	0.0123	0.1647	0.92	0.0182	0.0562	0.97	0.0259	0.0448	0.97
$\beta_{1}$	0.0020	0.1160	0.94	0.0054	0.0658	0.96	0.0393	0.0396	0.98
$\beta_{2}$	0.0025	0.0934	0.95	0.0078	0.0547	0.99	0.0672	0.0628	0.91
$\beta_{3}$	0.0014	0.0952	0.97	0.0021	0.0586	0.99	0.0144	0.0281	1.00

Table 8:

\text{MSE}^{*}

, MSE and coverage of the posterior 95% credible interval for the Vague prior, the Logistic prior and the Weighted prior. The summaries refer to 100 samples of size

n=100

The average summaries are reported in Table 9.

			Posterior Means
Parameter	TRUE	MLE	Vague	Logistic	Wted
$\beta_{0}$	1.5	1.9354	2.5906	1.3435	1.4266
$\beta_{1}$	0.3	0.5153	0.7479	0.2586	0.1092
$\beta_{2}$	-0.6	-0.8578	-1.1913	-0.5110	-0.2388
$\beta_{3}$	0.02	0.0833	0.1658	0.0438	0.0154

Table 9: Average maximum likelihood estimates, and average posterior means for the

n=100

simulation study.

6 Application to an occupancy model

We examined the influence of priors on models for presence/absence or occupancy data. In occupancy modelling Bernoulli random variables appear twice (MacKenzie et al., , 2017), once to indicate the true, but a priori unknown, presence (occupancy) of a species at a site or its absence, and again to indicate the detection of the species or failure to detect. Letting $Z$ be an indicator for true presence and $Y$ be an indicator for detection, the basic occupancy model structure is the following.

	$\displaystyle Z$	$\displaystyle\sim\mbox{Bernoulli}(\psi);$
	$\displaystyle Y\|Z=1$	$\displaystyle\sim\mbox{Bernoulli}(p),$

where the probability of detection is here defined conditional on presence. Equivalently, the distribution for $Y$ can be specified simply conditional on $Z$ (rather than $Z=1$ ), using $Y|Z$ $\sim$ Bernoulli( $pZ$ ). Note also that the unconditional distribution for $Y$ is Bernoulli( $p\psi)$ . Both occupancy probabilities, $\psi$ , and detection probabilities, $p$ , are commonly modelled as logistic functions of covariates.

Here we compare the effects of different priors for the coefficients of the logistic models on a real data set. The data are presence/absence counts of European goldfinch made at 266 locations in Switzerland at three different times during the breeding season (Jeffrey Doser, personal communication; see also Chapter 11 of Kéry and Royle, (2015) and the R package AHMbook and the dataset MHB2014 (Kéry et al., , 2024)). Three covariates were used for modelling occupancy, elevation, elevation², and the percent forest cover, and three covariates for modelling detection, day-of-year, day-of-year², and duration of the observation periods (at a given location and visit).

We use the the function PGOcc in the R package spOccupancy (Doser et al., , 2022) to fit the occupancy models. The default prior in the PGOcc function for the intercept and slope coefficients is Normal(0,1.65²) (labeled spOcc prior), which is a relatively narrow Normal prior compared to vague priors, e.g., Normal(0,1000²). The reason for this choice is partially based on results from Northrup and Gerber, (2018) (Jeffrey Doser, personal communication). We used the same logistic prior for the intercept and the slope coefficients as in Section 3. This was done largely for simplicity as PGOcc is set up for using the same priors for the intercepts and slopes (allowing for differences between occupancy probability and detection probability). Given $p$ =3 covariates, the logistic model priors were then Normal(0, 0.91²). A third set of priors labelled wide priors were also used, Normal(0, 40²), instead of the Vague priors used previously, Normal(0, 1000²), which caused PGOcc to fail.

The usual bathtub shape of the induced priors for the occupancy and the detection probabilities was observed for both the wide and spOcc priors, while the induced priors from the logistic model priors were relatively flat (Figure 5). The resulting posterior distributions for the intercepts and slopes differed slightly between the three different priors, for example, the posterior distributions for the occupancy coefficients (Figure 6) with the logistic prior yielding posterior distributions for the slope coefficients shifted slightly more positive than the other two priors. The resulting posterior mean probabilities for occupancy as a function of elevation, shown in Figure 7, were relatively similar across the range of elevations. The posterior credible intervals (not shown) were relatively wide, particular at the extremes of elevation which had less data.

The relative similarity in the posterior distributions can be attributed to the influence of the data overshadowing the influence of the prior. This can be seen in Figure 8 which shows the resulting posterior mean occupancy probability as a function of elevation when the model was fit using a simple random sample of $n$ =50 drawn from the 266 survey locations. The discrepancy between the three different priors is more pronounced with $n$ =50 than when all $n$ =266 sites were used, with the spOcc and logistic priors yielding somewhat similar results but the wide prior less so.

7 Discussion

We have proposed priors for the coefficients in a multiple covariate logistic regression for a Bernoulli probability $\theta$ that induced a desired prior for $\theta$ . To the best of our knowledge previous work has focused largely on single covariate models or has applied the same prior for each regression coefficients irregardless of the number of covariates. The proposed priors include ones such that the induced prior for $\theta$ is an Beta( $\alpha,\beta$ ) with arbitrary shape parameters as well as priors that allow for distinct handling of the intercept in the regression model. This work has raised additional questions that suggest other areas of research that are included below.

7.1 Weighting in the Weighted prior

For the Weighted prior, the proposed solution for determining the variance for the intercept was to set it equal to some fraction of $\sigma^{2}_{\eta}$ , $\sigma^{2}_{0}$ = $k\sigma^{2}_{\eta}$ , $0\leq k\leq 1$ . Can some guidance be provided regarding the choice of $k$ ?

If $k$ is near 0, then the prior for the intercept will be concentrated around $\mu_{\eta}$ , and the priors for the slope coefficients will be near their maximum dispersion, $\sigma^{2}_{\eta}/p$ . At the extreme of $k$ =0, the intercept $\beta_{0}$ is a fixed value. On the other hand if $k$ is near 1, then the variance of the prior for the intercept is near its maximum $\sigma^{2}_{\eta}$ and the priors for the slope coefficients are more concentrated around zero. At the extreme of $k$ =1 all slope coefficients disappear.

Therefore $k$ is balancing the effect of the intercept and the covariates. This begs the question of how to choose $k$ . One thought is to treat $k$ as a parameter to estimate and then specify a hierarchical prior distribution on it. Another is to view $k$ as a tuning parameter in terms of a goodness-of-fit measure. Further investigation of this issue is a topic of further research.

7.2 Concentration of priors around 0

As the number of covariates increases, the priors for the slope coefficients are increasingly concentrated around 0, i.e., there is shrinkage toward 0. Two statistical procedures that have some similarity with this result is Lasso (Tibshirani, , 1996) and penalized complexity priors (Simpson et al., , 2017). Lasso can cause shrinkage of coefficients to the point that a covariate is completely removed. Penalized complexity priors aim, given a set of nested models, to give highest prior probability to a “base” model, e.g., an intercept only model. Whether or not this shrinkage is desireable and whether or not more direct connections can be made with Lasso or penalized complexity priors are additional areas for research.

7.3 Generating functions

The linear combination of random variables, namely the intercept and partial slope coefficients in the linear model, is a convolution. Dropping the intercept and ignoring the effects of the covariate values in the linear combination, assuming all covariates equal 1, if identical distributions are assumed for the slope coefficients, which is a way to address the many-to-one problem mentioned previously, the moment generating function (MGF) for the linear combination is the product of $p$ identical MGFs:

\displaystyle E[\exp(\eta)]

\displaystyle=E[\exp(t\times[\beta_{1}+\ldots+\beta_{p}])]=E[\exp(t\beta)]^{p},

where $\beta$ , $\beta_{1}$ , $\ldots$ , and $\beta_{p}$ are independent and identically distributed. This results in the MGF for $\beta$ ,

\displaystyle E[\exp(t\beta)]

\displaystyle=\left(E[\exp(\eta)]\right)^{\frac{1}{p}}.

If the MGF for $\eta$ is known, the MGF for $\beta$ the $p$ th root of that MGF. Inversion of an arbitrary MGF to yield the underlying probability distribution is usually not a simple task let alone inversion of the $p$ th root of a MGF (Walker, , 2017). While interesting, this approach seems impractical.

7.4 Multi-Bernoulli distributions

Another topic for research is to extend these results to multi-Bernoulli distributions. For example, an animal is at location $A$ at time $t$ and at time $t+1$ it could remain at $A$ or move to $B$ or move to $C$ . Model location at next time step by a multi-Bernoulli:

\displaystyle y_{t+1}|y_{t}

\displaystyle\sim\mbox{Mult-Bernoulli}(p_{1},p_{2},p_{3}).

The probabilities are modelled by a multi-logistic transformation:

	$\displaystyle p_{1}$	$\displaystyle\propto\exp(\beta_{1,0}+\beta_{1,1}x_{1})$
	$\displaystyle p_{2}$	$\displaystyle\propto\exp(\beta_{2,0}+\beta_{2,1}x_{2})$
	$\displaystyle p_{3}$	$\displaystyle=1-p_{1}-p_{2}.$

If discrete uniform priors for $p_{1}$ , $p_{2}$ , and $p_{3}$ are desired, what do the priors for the above $\beta$ ’s need to be?

7.5 General problem of lack of parameterization invariance

Here we have only addressed potentially problematic induced priors for Bernoulli parameters. Both Seaman III et al., (2012) and Lele, (2020) discussed this problem for non-Bernoulli settings. In particular Lele, (2020) compared two mathematically equivalent parameterizations for a population dynamics model in a state-space model framework, namely the Ricker model. The two formulations are

	$\displaystyle\mbox{Model A :}N_{t+1}$	$\displaystyle=N_{t}\times\exp(a-b\times N_{t})$		(13)
	$\displaystyle\mbox{Model B :}N_{t+1}$	$\displaystyle=N_{t}\times\exp(a\times(1-N_{t}/K)),$		(14)

where $a$ is a measure of intrinsic population growth rate, $b$ is a measure of density dependence, and $K$ is carrying capacity. $K$ can be calculated for the first model as $a/b$ . He showed how apparently non-informative priors for the parameters yielded quite different posterior means for the population growth parameter and population carrying capacity.

One likely explanation for the discrepancy in the posterior means is that the induced priors for the functions of parameters with specified priors are quite different than the explicitly specified priors for those parameters in the other parameterization. Figure 9 shows how the induced distribution for carrying capacity $K$ in Model A is highly skewed compared to specified distribution for $K$ in Model B, with a similar result for the induced prior for the growth parameter $a$ in Model B compare to the specified prior in Model A.

In principle the same concepts used to yield a desired induced prior distribution for the Bernoulli parameter could be applied by specifying desired induced priors for the growth rate and the carrying capacity parameters—using the change-of-variables approach to work backwards to derive the prior probabilities on the lower level parameters that would induce the desired priors. However, the resulting derived prior distributions will typically be quite unique and Monte Carlo procedures will likely be required to generate samples from the probability function.

Acknowledgements.

For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising from this submission.

References

Antoniano-Villalobos et al., (2024) Antoniano-Villalobos, I., Villa, C., and Walker, S. G. (2024). A multidimensional objective prior distribution from a scoring rule. Journal of Statistical Planning and Inference, 231:106122.
Banner et al., (2020) Banner, K. M., Irvine, K. M., and Rodhouse, T. J. (2020). The use of Bayesian priors in ecology: The good, the bad and the not great. Methods in Ecology and Evolution, 11(8):882–889.
Bernardo, (1979) Bernardo, J. (1979). Reference posterior distributions for bayesian inference. Journal of the Royal Statistical Society: Series B (Methodological), 41:113–128.
Datta and Sweeting, (2005) Datta, G. S. and Sweeting, T. J. (2005). Probability matching priors. In Dey, D. and Rao, C., editors, Bayesian Thinking, Handbook of Statistics, pages 91–114. Elsevier.
Doser et al., (2022) Doser, J. W., Finley, A. O., Kéry, M., and Zipkin, E. F. (2022). spoccupancy: An r package for single-species, multi-species, and integrated spatial occupancy models. Methods in Ecology and Evolution, 13(8):1670–1678.
Gelman et al., (2014) Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D., Veharti, A., and Rubin, D. B. (2014). Bayesian data analysis, 3rd Edition. Taylor & Francis Group.
Hobbs and Hooten, (2015) Hobbs, N. T. and Hooten, M. B. (2015). Bayesian models: a statistical primer for ecologists. Princeton University Press.
Jeffreys, (1946) Jeffreys, H. (1946). An Invariant Form for the Prior Probability in Estimation Problems. Proceedings of the Royal Society of London Series A, 186:453–461.
Johnson et al., (1995) Johnson, N. L., Kotz, S., and Balakrishnan, N. (1995). Continuous univariate distributions, volume 2, volume 289. John wiley & sons.
Kass and Wasserman, (1996) Kass, R. E. and Wasserman, L. (1996). The selection of prior distributions by formal rules. Journal of the American Statistical Association, 91(435):1343–1370.
Kéry and Royle, (2015) Kéry, M. and Royle, J. A. (2015). Applied hierarchical modeling in ecology: Volume 1: Prelude and static models. Elsevier Science.
Kéry et al., (2024) Kéry, M., Royle, A., and Meredith, M. (2024). AHMbook: Functions and Data for the Book ’Applied Hierarchical Modeling in Ecology’ Vols 1 and 2. R package version 0.2.10.
Leisen et al., (2020) Leisen, F., Villa, C., and Walker, S. G. (2020). On a Class of Objective Priors from Scoring Rules (with Discussion). Bayesian Analysis, 15(4):1345 – 2767.
Lele, (2020) Lele, S. R. (2020). Consequences of lack of parameterization invariance of non-informative Bayesian analysis for wildlife management: Survival of San Joaquin kit fox and declines in amphibian populations. Frontiers in Ecology and Evolution, 7:501.
Lele and Dennis, (2009) Lele, S. R. and Dennis, B. (2009). Bayesian methods for hierarchical models: are ecologists making a Faustian bargain? Ecological applications, 19(3):581–584.
MacKenzie et al., (2017) MacKenzie, D. I., Nichols, J. D., Royle, J. A., Pollock, K. H., Bailey, L., and Hines, J. E. (2017). Occupancy estimation and modeling: inferring patterns and dynamics of species occurrence. Elsevier.
Newman, (2003) Newman, K. B. (2003). Modelling paired release-recovery data in the presence of survival and capture heterogeneity with application to marked juvenile salmon. Statistical modelling, 3(3):157–177.
Northrup and Gerber, (2018) Northrup, J. M. and Gerber, B. D. (2018). A comment on priors for Bayesian occupancy models. PloS One, 13(2):e0192819.
O’Hagan, (2008) O’Hagan, A. (2008). The Bayesian approach to statistics. In Handbook of probability: Theory and applications, volume 6.
O’Hagan, (2019) O’Hagan, A. (2019). Expert knowledge elicitation: subjective but scientific. The American Statistician, 73(sup1):69–81.
R Core Team, (2024) R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
Seaman III et al., (2012) Seaman III, J. W., Seaman Jr, J. W., and Stamey, J. D. (2012). Hidden dangers of specifying noninformative priors. The American Statistician, 66(2):77–84.
Siegrist, (2024) Siegrist, K. (2024). Transformations of random variables. https://stats.libretexts.org/Bookshelves/Probability\Theory/ProbabilityMathematicalStatisticsandStochasticProcesses [Accessed: 2024-05-20].
Simpson et al., (2017) Simpson, D., Rue, H., Riebler, A., Martins, T. G., and Sørbye, S. H. (2017). Penalising model component complexity: A principled, practical approach to constructing priors. Statistical Science, 32(1):1–28.
Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288.
van de Schoot et al., (2021) van de Schoot, R., Depaoli, S., King, R., Kramer, B., Märtens, K., Tadesse, M. G., Vannucci, M., Gelman, A., Veen, D., Willemsen, J., et al. (2021). Bayesian statistics and modelling. Nature Reviews Methods Primers, 1(1):1.
Van Erp et al., (2019) Van Erp, S., Oberski, D. L., and Mulder, J. (2019). Shrinkage priors for Bayesian penalized regression. Journal of Mathematical Psychology, 89:31–50.
Walker, (2017) Walker, S. G. (2017). A Laplace transform inversion method for probability distribution functions. Statistics and Computing, 27(2):439–448.
Walker and Villa, (2021) Walker, S. G. and Villa, C. (2021). An objective prior from a scoring rule. Entropy, 23(7):833.

Appendix - Simulation studies details

Additional details and output from the simulations discussed in Section 5 are given below.

Scenario 1: One covariate and $\theta\sim U(0,1)$

Single Sample Analysis

We have drawn a sample of size $n=15$ from the model given in Equation (8) in Section 5.1. The simulated covariate values and related responses are shown in Table 10, while the relationship between $\theta$ and $x$ is plotted in Figure 10.

	y	x1.std
1	1.00	-1.47
2	0.00	-1.13
3	1.00	-1.01
4	0.00	-1.01
5	1.00	-0.90
6	1.00	-0.31
7	1.00	-0.15
8	1.00	-0.03
9	0.00	0.26
10	1.00	0.28
11	1.00	0.61
12	1.00	0.73
13	1.00	0.84
14	1.00	1.19
15	1.00	2.10

Table 10: Simulated responses and covariates for single covariate model.

The induced priors on $\theta$ obtained by implementing the priors on the $\beta$ s (see Equation (9)), can be found in Figure 11. We note that there is a difference in shape between the two priors: the one based on the Vague prior assigns almost all the mass towards 0 and 1, whilst the induced prior resulting from the method we proposed can be approximated by a uniform density (which was the aim).

Side-by-side boxplots of the posterior distributions for $\beta_{0}$ and $\beta_{1}$ resulting from the Vague and Logistic priors are shown in Figure 12, while the histograms of the posterior distributions for the intercept and the slope coefficient, under both priors, along with the MCMC trace plots, are shown Figure 14. The MCMC was run for 5000 simulations with a burnin of 2000 iterations using 4 independent chains. No evidence of a lack of convergence was identified.

In Table 11 we show the posterior summaries, that is the posterior mean and the posterior 95% credible interval, as well as the true value and the MLE. We see that the true parameters are contained in both posterior 95% credible intervals, and the intervals are larger under the Logistic prior, meaning that it carries less information than the Vague prior. The result, therefore, supports our statement of being a “more objective” prior. Although the posterior distributions are reasonably symmetric, as seen in Figure 14, the maximum a posteriori (MAP) are -0.13 and -0.15 for the intercept, under the Vague and Logistic prior, respectively, and 0.64 and 0.65 for the slope coefficient. The last results suggest negatively skewed posteriors.

			Vague		Logistic
Parameters	Truth	MLE	Mean	C.I.	Mean	C.I.
$\beta_{0}$	-0.50	-0.14	-0.13	(-1.28, 0.94)	-0.13	(-0.74, 1.67)
$\beta_{1}$	0.30	0.35	0.43	(-1.13, 0.86)	0.34	(-0.70, 1.43)

Table 11: Scenario 1: Posterior mean and posterior 95% credible intervals with

n=15

. The table includes also the true values and the MLEs.

Scenarios 2 and 3

Single Sample Analysis

For this experiment, we simulated a sample of size $n=50$ from the model given in Equation (10) in Section 5.2. The prior distributions on the $\beta$ s were the Vague prior, the Logistic prior and the Weighted prior (see Equations (11) and (12)). The induced priors on $\theta$ corresponding to the Vague and the Logistic prior are shown in Figure 15, while the one for the weighted prior is shown in Figure 16. We recall that the Weighted prior is designed so that the induce prior on $\theta$ is an arbitrary $\text{Beta}(\alpha,\beta)$ and, in addition, the prior for the intercept $\beta_{0}$ is distinct from the one on the coefficients. In our case, the prior distribution for $\theta$ was determined by specifying an expected value of 0.7 and a CV=0.3, which leads to Beta(2.633, 1.129). The induced expectation for $\eta$ was $\mu_{\eta}$ =1.150 and $\sigma^{2}_{\eta}$ = 1.843. A weight for the variance for the prior for $\beta_{0}$ of $k$ =0.4 was selected. The resulting prior is Equation (12) in Section 5.2.

The MCMC parameters are as in the previous example, that is we run 4 chains for 5000 simulations and a burnin of 2000 iterations, giving a usable posterior sample of 3000 iterations. The posterior summaries are reported in Table 12, and the histograms of the posterior distributions are in Figure 17 in Appendix Appendix - Simulation studies details. Given that some of the posterior distributions appear to be asymmetric, we have also reported the MAPs in Table 13, along with the corresponding true values and MLEs.

The prior performances vary across the parameters, although the true values are all within the 95% posterior credible intervals. There no clear dominance of a prior over the others in terms of “guessing” the true value nor in getting closer to the MLE considering as point estimate the posterior mean.

			Vague		Logistic		Wted
Parameter	Truth	MLE	Mean	C.I.	Mean	C.I.	Mean	C.I.
$\beta_{0}$	1.50	1.80	2.05	(1.19, 3.07)	1.59	(0.91, 2.33)	1.65	(1.01, 2.35)
$\beta_{1}$	0.30	0.11	0.12	(-0.83, 1.09)	0.08	(-0.67, 0.84)	0.04	(-0.49, 0.57)
$\beta_{2}$	-0.60	-0.62	-0.70	(-1.63, 0.15)	-0.49	(-1.22, 0.21)	-0.29	(-0.82, 0.24)
$\beta_{3}$	0.02	0.27	0.37	(-0.56, 1.43)	0.26	(-0.45, 1.04)	0.15	(-0.38, 0.70)