On the Use of Cauchy Prior Distributions for Bayesian Logistic Regression

Joyee Ghosh ^,‡ The University of Iowa, Iowa City, IA. Email: joyee-ghosh@uiowa.edu Yingbo Li ^, Clemson University, Clemson, SC. Email: ybli@clemson.eduThese authors contributed equally. Robin Mitra University of Southampton, Southampton, UK. Email: R.Mitra@soton.ac.uk

Abstract

In logistic regression, separation occurs when a linear combination of the predictors can perfectly classify part or all of the observations in the sample, and as a result, finite maximum likelihood estimates of the regression coefficients do not exist. Gelman et al. (2008) recommended independent Cauchy distributions as default priors for the regression coefficients in logistic regression, even in the case of separation, and reported posterior modes in their analyses. As the mean does not exist for the Cauchy prior, a natural question is whether the posterior means of the regression coefficients exist under separation. We prove theorems that provide necessary and sufficient conditions for the existence of posterior means under independent Cauchy priors for the logit link and a general family of link functions, including the probit link. We also study the existence of posterior means under multivariate Cauchy priors. For full Bayesian inference, we develop a Gibbs sampler based on Pólya-Gamma data augmentation to sample from the posterior distribution under independent Student- $t$ priors including Cauchy priors, and provide a companion R package in the supplement. We demonstrate empirically that even when the posterior means of the regression coefficients exist under separation, the magnitude of the posterior samples for Cauchy priors may be unusually large, and the corresponding Gibbs sampler shows extremely slow mixing. While alternative algorithms such as the No-U-Turn Sampler in Stan can greatly improve mixing, in order to resolve the issue of extremely heavy tailed posteriors for Cauchy priors under separation, one would need to consider lighter tailed priors such as normal priors or Student- $t$ priors with degrees of freedom larger than one.

1 Introduction

In Bayesian linear regression, the choice of prior distribution for the regression coefficients is a key component of the analysis. Noninformative priors are convenient when the analyst does not have much prior information, but these prior distributions are often improper which can lead to improper posterior distributions in certain situations. Fernández and Steel (2000) investigated the propriety of the posterior distribution and the existence of posterior moments of regression and scale parameters for a linear regression model, with errors distributed as scale mixtures of normals, under the independence Jeffreys prior. For a design matrix of full column rank, they showed that posterior propriety holds under mild conditions on the sample size; however, the existence of posterior moments is affected by the design matrix and the mixing distribution. Further, there is not always a unique choice of noninformative prior (Yang and Berger 1996). On the other hand, proper prior distributions for the regression coefficients guarantee the propriety of posterior distributions. Among them, normal priors are commonly used in normal linear regression models, as conjugacy permits efficient posterior computation. The normal priors are informative because the prior mean and covariance can be specified to reflect the analyst’s prior information, and the posterior mean of the regression coefficients is the weighted average of the maximum likelihood estimator and the prior mean, with the weight on the latter decreasing as the prior variance increases.

A natural alternative to the normal prior is the Student- $t$ prior distribution, which can be viewed as a scale mixture of normals. The Student- $t$ prior has tails heavier than the normal prior, and hence is more appealing in the case where weakly informative priors are desirable. The Student- $t$ prior is considered robust, because when it is used for location parameters, outliers have vanishing influence on posterior distributions (Dawid 1973). The Cauchy distribution is a special case of the Student- $t$ distribution with 1 degree of freedom. It has been recommended as a prior for normal mean parameters in a point null hypothesis testing (Jeffreys 1961), because if the observations are overwhelmingly far from zero (the value of the mean specified under the point null hypothesis), the Bayes factor favoring the alternative hypothesis tends to infinity. Multivariate Cauchy priors have also been proposed for regression coefficients (Zellner and Siow 1980).

While the choice of prior distributions has been extensively studied for normal linear regression, there has been comparatively less work for generalized linear models. Propriety of the posterior distribution and the existence of posterior moments for binary response models under different noninformative prior choices have been considered (Ibrahim and Laud 1991; Chen and Shao 2001).

Regression models for binary response variables may suffer from a particular problem known as separation, which is the focus of this paper. For example, complete separation occurs if there exists a linear function of the covariates for which positive values of the function correspond to those units with response values of 1, while negative values of the function correspond to units with response values of 0. Formal definitions of separation (Albert and Anderson 1984), including complete separation and its closely related counterpart quasicomplete separation, are reviewed in Section 2. Separation is not a rare problem in practice, and has the potential to become increasingly common in the era of big data, with analysis often being made on data with a modest sample size but a large number of covariates. When separation is present in the data, Albert and Anderson (1984) showed that the maximum likelihood estimates (MLEs) of the regression coefficients do not exist (i.e., are infinite). Removing certain covariates from the regression model may appear to be an easy remedy for the problem of separation, but this ad-hoc strategy has been shown to often result in the removal of covariates with strong relationships with the response (Zorn 2005).

In the frequentist literature, various solutions based on penalized or modified likelihoods have been proposed to obtain finite parameter estimates (Firth 1993; Heinze and Schemper 2002; Heinze 2006; Rousseeuw and Christmann 2003). The problem has also been noted when fitting Bayesian logistic regression models (Clogg et al. 1991), where posterior inferences would be similarly affected by the problem of separation if using improper priors, with the possibility of improper posterior distributions (Speckman et al. 2009).

Gelman et al. (2008) recommended using independent Cauchy prior distributions as a default weakly informative choice for the regression coefficients in a logistic regression model, because these heavy tailed priors avoid over-shrinking large coefficients, but provide shrinkage (unlike improper uniform priors) that enables inferences even in the presence of complete separation. Gelman et al. (2008) developed an approximate EM algorithm to obtain the posterior mode of regression coefficients with Cauchy priors. While inferences based on the posterior mode are convenient, often other summaries of the posterior distribution are also of interest. For example, posterior means under Cauchy priors estimated via Monte Carlo and other approximations have been reported in Bardenet et al. (2014); Chopin and Ridgway (2015). It is well-known that the mean does not exist for the Cauchy distribution, so clearly the prior means of the regression coefficients do not exist. In the presence of separation, where the maximum likelihood estimates are not finite, it is not clear whether the posterior means will exist. To the best of our knowledge, there has been no investigation considering the existence of the posterior mean under Cauchy priors and our research is filling this gap. We find a necessary and sufficient condition where the use of independent Cauchy priors will result in finite posterior means here. In doing so we provide further theoretical underpinning of the approach recommended by Gelman et al. (2008), and additionally provide further insights on their suggestion of centering the covariates before fitting the regression model, which can have an impact on the existence of posterior means.

When the conditions for existence of the posterior mean are satisfied, we also empirically compare different prior choices (including the Cauchy prior) through various simulated and real data examples. In general, posterior computation for logistic regression is known to be more challenging than probit regression. Several MCMC algorithms for logistic regression have been proposed (O’Brien and Dunson 2004; Holmes and Held 2006; Gramacy and Polson 2012), while the most recent Pólya-Gamma data augmentation scheme of Polson et al. (2013) emerged superior to the other methods. Thus we extend this Pólya-Gamma Gibbs sampler for normal priors to accommodate independent Student- $t$ priors and provide an R package to implement the corresponding Gibbs sampler.

The remainder of this article is organized as follows. In Section 2 we derive the theoretical results: a necessary and sufficient condition for the existence of posterior means for coefficients under independent Cauchy priors in a logistic regression model in the presence of separation, and extend our investigation to binary regression models with other link functions such as the probit link, and multivariate Cauchy priors. In Section 3 we develop a Gibbs sampler for the logistic regression model under independent Student- $t$ prior distributions (of which the Cauchy distribution is a special case) and briefly describe the NUTS algorithm of Hoffman and Gelman (2014) which forms the basis of the software Stan. In Section 4 we illustrate via simulated data that Cauchy priors may lead to coefficients of extremely large magnitude under separation, accompanied by slow mixing Gibbs samplers, compared to lighter tailed priors such as Student- $t$ priors with degrees of freedom $7$ ( $t_{7}$ ) or normal priors. In Section 5 we compare Cauchy, $t_{7}$ , and normal priors based on two real datasets, the SPECT data with quasicomplete separation and the Pima Indian Diabetes data without separation. Overall, Cauchy priors exhibit slow mixing under the Gibbs sampler compared to the other two priors. Although mixing can be improved by the NUTS algorithm in Stan, normal priors seem to be the most preferable in terms of producing more reasonable scales for posterior samples of the regression coefficients accompanied by competitive predictive performance, under separation. In Section 6 we conclude with a discussion and our recommendations.

2 Existence of Posterior Means Under Cauchy Priors

In this section, we begin with a review of the concepts of complete and quasicomplete separation proposed by Albert and Anderson (1984). Then based on a new concept of solitary separators, we introduce the main theoretical result of this paper, a necessary and sufficient condition for the existence of posterior means of regression coefficients under independent Cauchy priors in the case of separation. Finally, we extend our investigation to binary regression models with other link functions, and Cauchy priors with different scale parameter structures.

Let $\mathbf{y}=(y_{1},y_{2},\ldots,y_{n})^{T}$ denote a vector of independent Bernoulli response variables with success probabilities $\pi_{1},\pi_{2},\ldots,\pi_{n}$ . For each of the observations, $i=1,2,\ldots,n$ , let $\mathbf{x}_{i}=(x_{i1},x_{i2},\ldots,x_{ip})^{T}$ denote a vector of $p$ covariates, whose first component is assumed to accommodate the intercept, i.e., $x_{i,1}=1$ . Let $\mathbf{X}$ denote the $n\times p$ design matrix with $\mathbf{x}_{i}^{T}$ as its $i$ th row. We assume that the column rank of $\mathbf{X}$ is greater than 1. In this paper, we mainly focus on the logistic regression model, which is expressed as:

\log\left(\frac{\pi_{i}}{1-\pi_{i}}\right)=\mathbf{x}_{i}^{T}\bm{\beta},\quad i=1,2,\ldots,n,

(1)

where $\bm{\beta}=(\beta_{1},\beta_{2},\ldots,\beta_{p})^{T}$ is the vector of regression coefficients.

2.1 A Brief Review of Separation

We denote two disjoint subsets of sample points based on their response values: $A_{0}=\{i:y_{i}=0\}$ and $A_{1}=\{i:y_{i}=1\}$ . According to the definition of Albert and Anderson (1984), complete separation occurs in the sample if there exists a vector $\bm{\alpha}=(\alpha_{1},\alpha_{2},\ldots,\alpha_{p})^{T}$ , such that for all $i=1,2,\ldots,n$ ,

\mathbf{x}_{i}^{T}\bm{\alpha}>0\ \text{ if }i\in A_{1},\quad\mathbf{x}_{i}^{T}\bm{\alpha}<0\ \text{ if }i\in A_{0}.

(2)

Consider a simple example in which we wish to predict whether subjects in a study have a certain kind of infection based on model (1). Let $y_{i}=1$ if the $i$ th subject is infected and 0 otherwise. The model includes an intercept ( $x_{i1}=1$ ) and the other covariates are age ( $x_{i2}$ ), gender ( $x_{i3}$ ), and previous records of being infected ( $x_{i4}$ ). Suppose in the sample, all infected subjects are older than 25 ( $x_{i2}>25$ ), and all subjects who are not infected are younger than 25 ( $x_{i2}<25$ ). This is an example of complete separation because (2) is satisfied for $\bm{\alpha}=(-25,1,0,0)^{T}$ .

If the sample points cannot be completely separated, Albert and Anderson (1984) introduced another notion of separation called quasicomplete separation. There is quasicomplete separation in the sample if there exists a non-null vector $\bm{\alpha}=(\alpha_{1},\alpha_{2},\ldots,\alpha_{p})^{T}$ , such that for all $i=1,2,\ldots,n$ ,

\mathbf{x}_{i}^{T}\bm{\alpha}\geq 0\ \text{ if }i\in A_{1},\quad\mathbf{x}_{i}^{T}\bm{\alpha}\leq 0\ \text{ if }i\in A_{0},

(3)

and equality holds for at least one $i$ . Consider the set up of the previous example where the goal is to predict whether a person is infected or not. Suppose we have the same model but there is a slight modification in the dataset: all infected subjects are at least 25 years old ( $x_{i2}\geq 25$ ), all uninfected subjects are no more than 25 years old ( $x_{i2}\leq 25$ ), and there are two subjects aged exactly 25, of whom one is infected but not the other. This is an example of quasicomplete separation because (2) is satisfied for $\bm{\alpha}=(-25,1,0,0)^{T}$ and the equality holds for two observations with age exactly 25.

Let $\mathcal{C}$ and $\mathcal{Q}$ denote the set of all vectors $\bm{\alpha}$ that satisfy (2) and (3), respectively. For any $\bm{\alpha}\in\mathcal{C}$ , all sample points must satisfy (2), so $\bm{\alpha}$ cannot lead to quasicomplete separation which requires at least one equality in (3). This implies that $\mathcal{C}$ and $\mathcal{Q}$ are disjoint sets, while both can be non-empty for a certain dataset. Note that Albert and Anderson (1984) define quasicomplete separation only when the sample points cannot be separated using complete separation. Thus according to their definition, only one of $\mathcal{C}$ and $\mathcal{Q}$ can be non-empty for a certain dataset. However, in our slightly modified definition of quasicomplete separation, the absence of complete separation is not required. This permits both $\mathcal{C}$ and $\mathcal{Q}$ to be non-empty for a dataset. In the remainder of the paper, for simplicity we use the term “separation” to refer to either complete or quasicomplete separation, so that $\mathcal{C}\cup\mathcal{Q}$ is non-empty.

2.2 Existence of Posterior Means Under Independent Cauchy Priors

When Markov chain Monte Carlo (MCMC) is applied to sample from the posterior distribution, the posterior mean is a commonly used summary statistic. We aim to study whether the marginal posterior mean $E(\beta_{j}\mid\mathbf{y})$ exists under the independent Cauchy priors suggested by Gelman et al. (2008). Let $C(\mu,\sigma)$ denote a Cauchy distribution with location parameter $\mu$ and scale parameter $\sigma$ . The default prior suggested by Gelman et al. (2008) corresponds to $\beta_{j}\stackrel{{\scriptstyle\text{ind}}}{{\sim}}C(0,\sigma_{j})$ , for $j=1,2,\dots,p$ .

For a design matrix with full column rank, Albert and Anderson (1984) showed that a finite maximum likelihood estimate of $\bm{\beta}$ does not exist when there is separation in the data. However, even in the case of separation and/or a rank deficient design matrix, the posterior means for some or all $\beta_{j}$ ’s may exist because they incorporate the information from the prior distribution. Following Definition 2.2.1 of Casella and Berger (1990, pp. 55), we say $E(\beta_{j}\mid\mathbf{y})$ exists if $E(|\beta_{j}|\mid\mathbf{y})<\infty$ , and in this case, $E(\beta_{j}\mid\mathbf{y})$ is given by

E(\beta_{j}\mid\mathbf{y})=\int_{0}^{\infty}\beta_{j}~p(\beta_{j}\mid\mathbf{y})~d\beta_{j}+\int_{-\infty}^{0}\beta_{j}~p(\beta_{j}\mid\mathbf{y})~d\beta_{j}.

(4)

Note that alternative definitions may require only one of the integrals in (4) to be finite for the mean to exist, e.g., Bickel and Doksum (2001, pp. 455). However, according to the definition used in this paper, both integrals in (4) have to be finite for the posterior mean to exist. Our main result shows that for each $j=1,2,\ldots,p$ , the existence of $E(\beta_{j}\mid\mathbf{y})$ depends on whether the predictor $\mathbf{X}_{j}$ is a solitary separator or not, which is defined as follows:

Definition 1.

The predictor $\mathbf{X}_{j}$ is a solitary separator, if there exists an $\bm{\alpha}\in(\mathcal{C}\cup\mathcal{Q})$ such that

\alpha_{j}\neq 0,\quad\alpha_{r}=0\text{ for all }r\neq j.

(5)

This definition implies that for a solitary separator $\mathbf{X}_{j}$ , if $\alpha_{j}>0$ , then $x_{i,j}\geq 0$ for all $i\in A_{1}$ , and $x_{i,j}\leq 0$ for all $i\in A_{0}$ ; if $\alpha_{j}<0$ , then $x_{i,j}\leq 0$ for all $i\in A_{1}$ , and $x_{i,j}\geq 0$ for all $i\in A_{0}$ . Therefore, the hyperplane $\{\mathbf{x}\in\mathbb{R}^{p}:x_{j}=0\}$ in the predictor space separates the data into two groups $A_{1}$ and $A_{0}$ (except for the points located on the hyperplane). The following theorem provides a necessary and sufficient condition for the existence of marginal posterior means of regression coefficients in a logistic regression model.

Theorem 1.

In a logistic regression model, suppose the regression coefficients (including the intercept) $\beta_{j}\stackrel{{\scriptstyle ind}}{{\sim}}C(0,\sigma_{j})$ with $\sigma_{j}>0$ for $j=1,2,\dots,p$ , so that

p(\bm{\beta})=\prod_{j=1}^{p}p(\beta_{j})=\prod_{j=1}^{p}\frac{1}{\pi\sigma_{j}(1+\beta_{j}^{2}/\sigma_{j}^{2})}.

(6)

Then for each $j=1,2,\ldots,p$ , the posterior mean $E(\beta_{j}\mid\mathbf{y})$ exists if and only if $\mathbf{X}_{j}$ is not a solitary separator.

A proof of Theorem 1 is available in Appendices A and B.

Remark 1.

Theorem 1 implies that under independent Cauchy priors in logistic regression, the posterior means of all coefficients exist if there is no separation, or if there is separation with no solitary separators.

Remark 2.

Gelman et al. (2008) suggested centering all predictors (except interaction terms) in the pre-processing step. A consequence of Theorem 1 is that centering may have a crucial role in the existence of the posterior mean $E(\beta_{j}\mid\mathbf{y})$ .

We expand on the second remark with a toy example where a predictor is a solitary separator before centering but not after centering. Consider a dataset with $n=100$ , $\mathbf{y}=(\underbrace{0,\dots 0}_{25},\underbrace{1,\dots,1}_{75})^{T}$ and a binary predictor $\mathbf{X}_{j}=(\underbrace{0,\dots 0}_{50},\underbrace{1,\dots,1}_{50})^{T}$ . Here $\mathbf{X}_{j}$ is a solitary separator which leads to quasicomplete separation before centering. However, the centered predictor $\mathbf{X}_{j}=(\underbrace{-0.5,\dots-0.5}_{50},\underbrace{0.5,\dots,0.5}_{50})^{T}$ is no longer a solitary separator because after centering the hyperplane $\{\mathbf{x}:x_{j}=-0.5\}$ separates the data but $\{\mathbf{x}:x_{j}=0\}$ does not. Consequently, the posterior mean $E(\beta_{j}\mid\mathbf{y})$ does not exist before centering but it exists after centering.

2.3 Extensions of the Theoretical Result

So far we have mainly focused on the logistic regression model, which is one of the most widely used binary regression models because of the interpretability of its regression coefficients in terms of odds ratios. We now generalize Theorem 1 to binary regression models with link functions other than the logit. Following the definition in McCullagh and Nelder (1989, pp. 27), we assume that for $i=1,2,\ldots,n$ , the linear predictor $\mathbf{x}_{i}^{T}\bm{\beta}$ and the success probability $\pi_{i}$ are connected by a monotonic and differentiable link function $g(\cdot)$ such that $g(\pi_{i})=\mathbf{x}_{i}^{T}\bm{\beta}$ . We further assume that $g(.)$ is a one-to-one function, which means that $g(.)$ is strictly monotonic. This is satisfied by many commonly used link functions including the probit. Without loss of generality, we assume that $g(\cdot)$ is a strictly increasing function.

Theorem 2.

In a binary regression model with link function $g(.)$ described above, suppose the regression coefficients have independent Cauchy priors in (6). Then for each $j=1,2,\ldots,p$ ,

(1)

a necessary condition for the existence of the posterior mean $E(\beta_{j}\mid\mathbf{y})$ is that $\mathbf{X}_{j}$ is not a solitary separator;

(2)

a sufficient condition for the existence of $E(\beta_{j}\mid\mathbf{y})$ consists of the following:

(i)

$\mathbf{X}_{j}$ is not a solitary separator, and

(ii)

$\forall\epsilon>0$ ,

\int_{0}^{\infty}\beta_{j}p(\beta_{j})g^{-1}(-\epsilon\beta_{j})d\beta_{j}<\infty,\quad\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[1-g^{-1}(\epsilon\beta_{j})\right]d\beta_{j}<\infty.

(7)

Note that (7) in the sufficient condition of Theorem 2 imposes constraints on the link function $g(.)$ , and hence the likelihood function. A proof of this theorem is given in Appendix C. Moreover, it is shown that condition (7) holds for the probit link function.

In certain applications, to incorporate available prior information, it may be desirable to use Cauchy priors with nonzero location parameters. The following corollary states that for both logistic and probit regression, the condition for existence of posterior means derived in Theorems 1 and 2 continues to hold under independent Cauchy priors with nonzero location parameters.

Corollary 1.

In logistic and probit regression models, suppose the regression coefficients $\beta_{j}\stackrel{{\scriptstyle\text{ind}}}{{\sim}}C(\mu_{j},\sigma_{j})$ , for $j=1,2,\dots,p$ . Then a necessary and sufficient condition for the existence of the posterior mean $E(\beta_{j}\mid\mathbf{y})$ is that $\mathbf{X}_{j}$ is not a solitary separator, for $j=1,2,\dots,p$ .

A proof of Corollary 1 is available in Appendix D.

In some applications it could be more natural to allow the regression coefficients to be dependent, a priori. Thus in addition to independent Cauchy priors, we also study the existence of posterior means under a multivariate Cauchy prior, with the following density function:

p(\bm{\beta})=\frac{\Gamma\left(\frac{1+p}{2}\right)}{\Gamma\left(\frac{1}{2}\right)\pi^{\frac{p}{2}}|\bm{\Sigma}|^{\frac{1}{2}}\left[1+(\bm{\beta}-\bm{\mu})^{T}\bm{\Sigma}^{-1}(\bm{\beta}-\bm{\mu})\right]^{\frac{1+p}{2}}},

(8)

where $\bm{\beta}\in\mathbb{R}^{p}$ , $\bm{\mu}$ is a $p\times 1$ location parameter and $\bm{\Sigma}$ is a $p\times p$ positive-definite scale matrix. A special case of the multivariate Cauchy prior is the Zellner-Siow prior (Zellner and Siow 1980). It can be viewed as a scale mixture of $g$ -priors, where conditional on $g$ , $\bm{\beta}$ has a multivariate normal prior with a covariance matrix proportional to $g(\mathbf{X}^{T}\mathbf{X})^{-1}$ , and the hyperparameter $g$ has an inverse gamma prior, $\text{IG}(1/2,n/2)$ . Based on generalizations of the $g$ -prior to binary regression models (Fouskakis et al. 2009; Sabanés Bové and Held 2011; Hanson et al. 2014), the Zeller-Siow prior, which has a density (8) with $\bm{\Sigma}\propto n(\mathbf{X}^{T}\mathbf{X})^{-1}$ , can be a desirable objective prior as it preserves the covariance structure of the data and is free of tuning parameters.

Theorem 3.

In logistic and probit regression models, suppose the vector of regression coefficients $\bm{\beta}$ has a multivariate Cauchy prior as in (8). If there is no separation, then all posterior means $E(\beta_{j}\mid\mathbf{y})$ exist, for $j=1,2,\ldots,p$ . If there is complete separation, then none of the posterior means $E(\beta_{j}\mid\mathbf{y})$ exist, for $j=1,2,\ldots,p$ .

A proof of Theorem 3 is available in Appendices E and F. The study of existence of posterior means under multivariate Cauchy priors in the presence of quasicomplete separation has proved to be more challenging. We hope to study this problem in future work. Note that although under (8), the induced marginal prior of $\beta_{j}$ is a univariate Cauchy distribution for each $j=1,2,\ldots,p$ , the multivariate Cauchy prior is different from independent Cauchy priors, even with a diagonal scale matrix $\bm{\Sigma}=\text{diag}(\sigma_{1}^{2},\sigma_{2}^{2},\ldots,\sigma_{p}^{2})$ . In fact, as a rotation invariant distribution, the multivariate Cauchy prior places less probability mass along axes than the independent Cauchy priors (see Figure 1). Therefore, it is not surprising that solitary separators no longer play an important role for existence of posterior means under multivariate Cauchy priors, as evident from Theorem 3.

Refer to caption — Figure 1: Contour plots of log-density functions of independent Cauchy distributions with both scale parameters being $1$ (left) and a bivariate Cauchy distribution with scale matrix $\mathbf{I}_{2}$ (right). These plots suggest that independent Cauchy priors place more probability mass along axes than a multivariate Cauchy prior, and thus impose stronger shrinkage. Hence, if complete separation occurs, $E(\beta_{j}\mid\mathbf{Y})$ may exist under independent Cauchy priors for some or all $j=1,2,\ldots,p$ (Theorem 1), but does not exist under a multivariate Cauchy prior (Theorem 3).

So far we have considered Cauchy priors, which are $t$ distributions with 1 degree of freedom. We close this section with a remark on lighter tailed $t$ priors (with degrees of freedom greater than 1) and normal priors, for which the prior means exist.

Remark 3.

In a binary regression model, suppose that the regression coefficients have independent Student-t priors with degrees of freedom greater than one, or independent normal priors. Then it is straightforward to show that the posterior means of the coefficients exist because the likelihood is bounded above by one and the prior means exist. The same result holds under multivariate $t$ priors with degrees of freedom greater than one, and multivariate normal priors.

3 MCMC Sampling for Logistic Regression

In this section we discuss two algorithms for sampling from the posterior distribution for logistic regression coefficients under independent Student- $t$ priors. We first develop a Gibbs sampler and then briefly describe the No-U-Turn Sampler (NUTS) implemented in the freely available software Stan (Carpenter et al. 2016).

3.1 Pólya-Gamma Data Augmentation Gibbs Sampler

Polson et al. (2013) showed that the likelihood for logistic regression can be written as a mixture of normals with respect to a Pólya-Gamma (PG) distribution. Based on this result, they developed an efficient Gibbs sampler for logistic regression with a multivariate normal prior on $\bm{\beta}$ . Choi and Hobert (2013) showed that their Gibbs sampler is uniformly ergodic. This guarantees the existence of central limit theorems for Monte Carlo averages of functions of $\bm{\beta}$ which are square integrable with respect to the posterior distribution $p(\bm{\beta}\mid\mathbf{y})$ . Choi and Hobert (2013) developed a latent data model which also led to the Gibbs sampler of Polson et al. (2013). We adopt their latent data formulation to develop a Gibbs sampler for logistic regression with independent Student- $t$ priors on $\bm{\beta}$ .

Let $U=(2/\pi^{2})\sum_{l=1}^{\infty}W_{l}/{(2l-1)}^{2}$ , where $W_{1},W_{2},\dots$ is a sequence of i.i.d. Exponential random variables with rate parameters equal to 1. The density of $U$ is given by

h(u)=\sum_{l=0}^{\infty}{(-1)}^{l}\frac{(2l+1)}{\sqrt{2\pi u^{3}}}e^{-\frac{(2l+1)^{2}}{8u}},\quad 0<u<\infty.

(9)

Then for $k\geq 0$ , the Pólya-Gamma (PG) distribution is constructed by exponential tilting of $h(u)$ as follows:

p(u;k)=\cosh\left(\frac{k}{2}\right)e^{-\frac{k^{2}u}{2}}h(u),\quad 0<u<\infty.

(10)

A random variable with density $p(u;k)$ has a PG( $1,k$ ) distribution.

Let $t_{v}(0,\sigma_{j})$ denote the Student- $t$ distribution with $v$ degrees of freedom, location parameter 0, and scale parameter $\sigma_{j}$ . Since Student- $t$ distributions can be expressed as inverse-gamma (IG) scale mixtures of normal distributions, for $j=1,2,\ldots,p$ , we have:

\beta_{j}\sim t_{v}(0,\sigma_{j})\Longleftrightarrow\begin{cases}\beta_{j}\mid\gamma_{j}\sim\text{N}(0,\gamma_{j}),\\ \gamma_{j}\sim\text{IG}\left(\frac{v}{2},\frac{v\sigma_{j}^{2}}{2}\right).\end{cases}

Conditional on $\bm{\beta}$ and $\bm{\Gamma}=\text{diag}(\gamma_{1},\gamma_{2},\ldots,\gamma_{p})$ , let $(y_{1},z_{1}),(y_{2},z_{2}),\ldots,(y_{n},z_{n})$ be $n$ independent random vectors such that $y_{i}$ has a Bernoulli distribution with success probability $\exp(\mathbf{x}_{i}^{T}\bm{\beta})/(1+\exp(\mathbf{x}_{i}^{T}\bm{\beta}))$ , $z_{i}\sim PG(1,|\mathbf{x}_{i}^{T}\bm{\beta}|)$ , and $y_{i}$ and $z_{i}$ are independent, for $i=1,2,\ldots,n$ . Let $\bm{Z}_{D}=\text{diag}(z_{1},z_{2},\ldots,z_{n})$ , then the augmented posterior density is $p(\bm{\beta},\bm{\Gamma},\bm{Z}_{D}\mid\mathbf{y})$ . We develop a Gibbs sampler with target distribution $p(\bm{\beta},\bm{\Gamma},\bm{Z}_{D}\mid\mathbf{y})$ , which cycles through the following sequence of distributions iteratively:

1.

$\bm{\beta}\mid\bm{\Gamma},\bm{Z}_{D},\mathbf{y}\sim\text{N}\left((\mathbf{X}^{T}\bm{Z}_{D}\mathbf{X}+\bm{\Gamma}^{-1})^{-1}\mathbf{X}^{T}\mathbf{\tilde{y}},(\mathbf{X}^{T}\bm{Z}_{D}\mathbf{X}+\bm{\Gamma}^{-1})^{-1}\right)$ , where $\tilde{y}_{i}=y_{i}-1/2$ and $\mathbf{\tilde{y}}=(\tilde{y}_{1},\tilde{y}_{2},\ldots,\tilde{y}_{n})^{T}$ ,
2.

$\gamma_{j}\mid\bm{\beta},\bm{Z}_{D},\mathbf{y}\stackrel{{\scriptstyle\text{ind}}}{{\sim}}\text{IG}\left(\frac{v+1}{2},\frac{\beta_{j}^{2}+v\sigma_{j}^{2}}{2}\right)$ , for $j=1,2,\ldots,p$ ,
3.

$z_{i}\mid\bm{\Gamma},\bm{\beta},\mathbf{y}\stackrel{{\scriptstyle\text{ind}}}{{\sim}}\text{PG}(1,|\mathbf{x}_{i}^{T}\bm{\beta}|)$ , for $i=1,2,\ldots,n$ .

Steps 1 and 3 follow immediately from Choi and Hobert (2013); Polson et al. (2013) and step 2 follows from straightforward algebra. In the next section, for comparison of posterior distributions under Student- $t$ priors with different degrees of freedom, we implement the above Gibbs sampler, and for normal priors we apply the Gibbs sampler of Polson et al. (2013). Both Gibbs samplers can be implemented using the R package tglm, available in the supplement.

3.2 Stan

Our empirical results in the next section suggest that the Gibbs sampler exhibits extremely slow mixing for posterior simulation under Cauchy priors for data with separation. Thus we consider alternative MCMC sampling algorithms in the hope of improving mixing. A random walk Metropolis algorithm shows some improvement over the Gibbs sampler in the $p=2$ case. However, it is not efficient for exploring higher dimensional spaces. Thus we have been motivated to use the software Stan (Carpenter et al. 2016), which implements the No-U-Turn Sampler (NUTS) of Hoffman and Gelman (2014), a tuning free extension of the Hamiltonian Monte Carlo (HMC) algorithm (Neal 2011).

It has been demonstrated that for continuous parameter spaces, HMC can improve over poorly mixing Gibbs samplers and random walk Metropolis algorithms. HMC is a Metropolis algorithm that generates proposals based on Hamiltonian dynamics, a concept borrowed from Physics. In HMC, the parameter of interest is referred to as the “position” variable, representing a particle’s position in a $p$ -dimensional space. A $p$ -dimensional auxiliary parameter, the “momentum” variable, is introduced to represent the particle’s momentum. In each iteration, the momentum variable is generated from a Gaussian distribution, and then a proposal of the position momentum pair is generated (approximately) along the trajectory of the Hamiltonian dynamics defined by the joint distribution of the position and momentum. Hamiltonian dynamics changing over time can be approximated by discretizing time via the “leapfrog” method. In practice, a proposal is generated by applying the leapfrog algorithm $L$ times, with stepsize $\epsilon$ , to the the current state. The proposed state is accepted or rejected according to a Metropolis acceptance probability. Section 5.3.3 of the review paper by Neal (2011) illustrates the practical benefits of HMC over random walk Metropolis algorithms. The examples in this section demonstrate that the momentum variable may change only slowly along certain directions during leapfrog steps, permitting the position variable to move consistently in this direction for many steps. In this way, proposed states using Hamiltonian dynamics can be far away from current states but still achieve high acceptance probabilities, making HMC more efficient than traditional algorithms such as random walk Metropolis.

In spite of its advantages, HMC has not been very widely used in the Statistics community until recently, because its performance can be sensitive to the choice of two tuning parameters: the leapfrog stepsize $\epsilon$ and the number of leapfrog steps $L$ . Very small $\epsilon$ can lead to waste in computational power whereas large $\epsilon$ can yield large errors due to discretization. Regarding the number of leapfrog steps $L$ , if it is too small, proposed states can be near current states and thus resemble random walk. On the other hand, if $L$ is too large, the Hamiltonian trajectory can retrace its path so that the proposal is brought closer to the current value, which again is a waste of computational power.

The NUTS algorithm tunes these two parameters automatically. To select $L$ , the main idea is to run the leapfrog steps until the trajectory starts to retrace its path. More specifically, NUTS builds a binary tree based on a recursive doubling procedure, that is similar in flavor to the doubling procedure used for slice sampling by Neal (2003), with nodes of the tree representing position momentum pairs visited by the leapfrog steps along the path. The doubling procedure is stopped if the trajectory starts retracing its path, that is making a “U-turn”, or if there is a large simulation error accumulated due to many steps of leapfrog discretization. NUTS consists of a carefully constructed transition kernel that leaves the target joint distribution invariant. It also proposes a way for adaptive tuning of the stepsize $\epsilon$ .

We find that by implementing this tuning free NUTS algorithm, available in the freely available software Stan, substantially better mixing than the Gibbs sampler can be achieved in all of our examples in which posterior means exist. We still include the Gibbs sampler in this article for two main reasons. First, it illustrates that Stan can provide an incredible improvement in mixing over the Gibbs sampler in certain cases. Stan requires minimal coding effort, much less than developing a Gibbs sampler, which may be useful information for readers who are not yet familiar with Stan. Second, Stan currently works for continuous target distributions only, but discrete distributions for models and mixed distributions for regression coefficients frequently arise in Bayesian variable selection, for regression models with binary or categorical response variables (Holmes and Held 2006; Mitra and Dunson 2010; Ghosh and Clyde 2011; Ghosh et al. 2011; Ghosh and Reiter 2013; Li and Clyde 2015). Unlike HMC algorithms, Gibbs samplers can typically be extended via data augmentation to incorporate mixtures of a point mass and a continuous distribution, as priors for the regression coefficients, without much additional effort.

4 Simulated Data

In this section, we use two simulation examples to empirically demonstrate that under independent Cauchy priors, the aforementioned MCMC algorithm for logistic regression may suffer from extremely slow mixing in the presence of separation in the dataset.

For each simulation scenario, we first standardize the predictors following the recommendation of Gelman et al. (2008). Binary predictors (with 0/1 denoting the two categories) are centered to have mean 0, and other predictors are centered and scaled to have mean 0 and standard deviation 0.5. Their rationale is that such standardizing makes the scale of a continuous predictor comparable to that of a symmetric binary predictor, in the sense that they have the same sample mean and sample standard deviation. Gelman et al. (2008) made a distinction between input variables and predictors, and they suggested standardizing the input variables only. For example, temperature and humidity may be input variables as well as predictors in a model; however, their interaction term is a predictor but not an input variable. In our examples, except for the constant term for the intercept, all other predictors are input variables and standardized appropriately.

We compare the posterior distributions under independent i) Cauchy, i.e., Student- $t$ with 1 degree of freedom, ii) Student- $t$ with 7 degrees of freedom ( $t_{7}$ ), and iii) normal priors for the regression coefficients. In binary regression models, while the inverse cumulative distribution function (CDF) of the logistic distribution yields the logit link function, the inverse CDF of the Student- $t$ distribution yields the robit link function. Liu (2004) showed that the logistic link can be well approximated by a robit link with 7 degrees of freedom. So a $t_{7}$ prior approximately matches the tail heaviness of the logistic likelihood underlying logistic regression. For Cauchy priors we use the default choice recommended by Gelman et al. (2008): all location parameters are set to 0 and scale parameters are set to 10 and 2.5 for the intercept and other coefficients, respectively. To be consistent we use the same location and scale parameters for the other two priors. Gelman et al. (2008) adopted a similar strategy in one of their analyses, to study the effect of tail heaviness of the priors. Among the priors considered here, the normal prior has the lightest tails, the Cauchy prior the heaviest, and the $t_{7}$ prior offers a compromise between the two extremes. For each simulated dataset, we run both the Gibbs sampler developed in Section 3.1 and Stan, for 1,000,000 iterations after a burn-in of 100,000 samples, under each of the three priors.

4.1 Complete Separation with a Solitary Separator

First, we generate a dataset with $p=2$ (including the intercept) and $n=30$ . The continuous predictor $\mathbf{X}_{2}$ is chosen to be a solitary separator (after standardizing), which leads to complete separation, whereas the constant term $\mathbf{X}_{1}$ contains all one’s and is not a solitary separator. A plot of $\mathbf{y}$ versus $\mathbf{X}_{2}$ in Figure 2 demonstrates this graphically. So by Theorem 1, under independent Cauchy priors, $E(\beta_{1}\mid\mathbf{y})$ exists but $E(\beta_{2}\mid\mathbf{y})$ does not.

The results from the Gibbs sampler are reported in Figures 3 and 4. Figure 3 shows the posterior samples of $\bm{\beta}$ under the different priors. The scale of $\beta_{2}$ , the coefficient corresponding to the solitary separator $\mathbf{X}_{2}$ , is extremely large under Cauchy priors, less so under $t_{7}$ priors, and the smallest under normal priors. In particular, under Cauchy priors, the posterior distribution of $\beta_{2}$ seems to have an extremely long right tail. Moreover, although $\mathbf{X}_{1}$ is not a solitary separator, under Cauchy priors, the posterior samples of $\beta_{1}$ have a much larger spread. Figure 4 shows that the running means of both $\beta_{1}$ and $\beta_{2}$ converge rapidly under normal and $t_{7}$ priors, whereas under Cauchy priors, the running mean of $\beta_{1}$ does not converge after a million iterations and that of $\beta_{2}$ clearly diverges. We also ran Stan for this example but do not report the results here, because it gave warning messages about divergent transitions for Cauchy priors, after the burn-in period. Given that the posterior mean of $\beta_{2}$ does not exist in this case, the lack of convergence is not surprising.

4.2 Complete Separation Without Solitary Separators

Now we generate a new dataset with $p=2$ and $n=30$ such that there is complete separation but there are no solitary separators (see Figure 5). This guarantees the existence of both $E(\beta_{1}\mid\mathbf{y})$ and $E(\beta_{2}\mid\mathbf{y})$ under independent Cauchy priors. The difference in the existence of $E(\beta_{2}\mid\mathbf{y})$ for the two simulated datasets is reflected by the posterior samples from the Gibbs sampler: under Cauchy priors, the samples of $\beta_{2}$ in Figure 1 in the Appendix are more stabilized than those in Figure 3 in the manuscript. However, when comparing across prior distributions, we find that the posterior samples of neither $\beta_{1}$ nor $\beta_{2}$ are as stable as those under $t_{7}$ and normal priors, which is not surprising because among the three priors, Cauchy priors have the heaviest tails and thus yield the least shrinkage. Figure 2 in the Appendix shows that the convergence of the running means under Cauchy priors is slow. Although we have not verified the existence of the second or higher order posterior moments under Cauchy priors, for exploratory purposes we examine sample autocorrelation plots of the draws from the Gibbs sampler. Figure 6 shows that the autocorrelation decays extremely slowly for Cauchy priors, reasonably fast for $t_{7}$ priors, and rapidly for normal priors.

Some results from Stan are reported in Figures 3 and 4 in the Appendix. Figure 3 in the Appendix shows posterior distributions with nearly identical shapes as those obtained using Gibbs sampling in Figure 1 in the Appendix, with the only difference being that more extreme values appear under Stan. This is most likely due to faster mixing in Stan. As Stan traverses the parameter space more rapidly, values in the tails appear more quickly than under the Gibbs sampler. Figures 2 and 4 in the Appendix demonstrate that running means based on Stan are in good agreement with those based on the Gibbs sampler.

The autocorrelation plots for Stan in Figure 7 demonstrate a remarkable improvement over those for Gibbs in Figure 6 for all priors, and the difference in mixing is the most prominent for Cauchy priors.

To summarize, all the plots unequivocally suggest that Cauchy priors lead to an extremely slow mixing Gibbs sampler and unusually large scales for the regression coefficients, even when all the marginal posterior means are guaranteed to exist. While mixing can be improved tremendously with Stan, the heavy tailed posteriors under Stan are in agreement with those obtained from the Gibbs samplers. One may argue that in spite of the unnaturally large regression coefficients, Cauchy priors could lead to superior predictions. Thus in the next two sections we compare predictions based on posteriors under the three priors for two real datasets. As Stan generates nearly independent samples, we use Stan for MCMC simulations for the real datasets.

5 Real Data

5.1 SPECT Dataset

The “SPECT” dataset (Kurgan et al. 2001) is available from the UCI Machine Learning Repository¹¹1https://archive.ics.uci.edu/ml/datasets/SPECT+Heart. The binary response variable is whether a patient’s cardiac image is normal or abnormal, according to the diagnosis of cardiologists. The predictors are 22 binary features obtained from the cardiac images using a machine learning algorithm. The goal of the study is to determine if the predictors can correctly predict the diagnoses of cardiologists, so that the process could be automated to some extent.

Prior to centering, two of the binary predictors are solitary quasicomplete separators: $x_{i,j}=0$ $\forall i\in A_{0}$ and $x_{i,j}\geq 0$ $\forall i\in A_{1}$ , for $j=18,19$ , with $\mathbf{X}_{1}$ denoting the column of ones. Ghosh and Reiter (2013) analyzed this dataset with a Bayesian probit regression model which incorporated variable selection. As some of their proposed methods relied on an approximation of the marginal likelihood based on the MLE of $\bm{\beta}$ , they had to drop these potentially important predictors from the analysis. If one analyzed the dataset with the uncentered predictors, by Theorem 1, the posterior means $E(\beta_{18}\mid\mathbf{y})$ and $E(\beta_{19}\mid\mathbf{y})$ would not exist under independent Cauchy priors. However, after centering there are no solitary separators, so the posterior means of all coefficients exist.

The SPECT dataset is split into a training set of 80 observations and a test set of 187 observations by Kurgan et al. (2001). We use the former for model fitting and the latter for prediction. First, for each of the three priors (Cauchy, $t_{7}$ , and normal), we run Stan on the training dataset, for 1,000,000 iterations after discarding 100,000 samples as burn-in.

As in the simulation study, MCMC draws from Stan show excellent mixing for all priors. However, the posterior means of the regression coefficients involved in separation are rather large under Cauchy priors compared to the other priors. For example, the posterior means of $(\beta_{18},\beta_{19})$ under Cauchy, $t_{7}$ , and normal priors are $(10.02,5.57),(3.24,1.68),$ and $(2.73,1.43)$ respectively. These results suggest that Cauchy priors are too diffuse for datasets with separation.

Next for each $i=1,2,\dots,n_{\text{test}}$ in the test set, we estimate the corresponding success probability $\pi_{i}$ by the Monte Carlo average:

\widehat{\pi}_{i}^{\text{MC}}=\frac{1}{S}\sum_{s=1}^{S}\frac{e^{\mathbf{x}_{i}^{T}\bm{\beta}^{(s)}}}{1+e^{\mathbf{x}_{i}^{T}\bm{\beta}^{(s)}}},

(11)

where $\bm{\beta}^{(s)}$ is the sampled value of $\bm{\beta}$ in iteration $s$ , after burn-in. Recall that here $n_{\text{test}}=187$ and $S=10^{6}$ . We calculate two different types of summary measures to assess predictive performance. We classify the $i$ th observation in the test set as a success, if $\widehat{\pi}_{i}^{\text{MC}}\geq 0.5$ and as a failure otherwise, and compute the misclassification rates. Note that the misclassification rate does not fully take into account the magnitude of $\widehat{\pi}_{i}^{\text{MC}}$ . For example, if $y_{i}=1$ both $\widehat{\pi}_{i}^{\text{MC}}=0.5$ and $\widehat{\pi}_{i}^{\text{MC}}=0.9$ would correctly classify the observation, while the latter may be more preferable. So we also consider the average squared difference between $y_{i}$ and $\widehat{\pi}_{i}^{\text{MC}}$ :

MSE^{\text{MC}}=\frac{1}{n_{\text{test}}}\sum_{i=1}^{n_{\text{test}}}{(\widehat{\pi}_{i}^{\text{MC}}-y_{i})}^{2},

(12)

which is always between 0 and 1, with a value closer to 0 being more preferable. Note that the Brier score (Brier 1950) equals $2MSE^{\text{MC}}$ , according to its original definition. Since in some modified definitions (Blattenberger and Lad 1985), it is the same as $MSE^{\text{MC}}$ , we refer to $MSE^{\text{MC}}$ as the Brier score.

Table 1: Misclassification rates based on

\widehat{\pi}_{i}^{\text{MC}}

and

\widehat{\pi}_{i}^{\text{EM}}

, under Cauchy,

t_{7}

, and normal priors for the SPECT data. Small values are preferable.

	Cauchy	$t_{7}$	normal
MCMC	0.273	0.257	0.251
EM	0.278	0.262	0.262

Table 2: Brier scores

MSE^{\text{MC}}

and

MSE^{\text{EM}}

, under Cauchy,

t_{7}

, and normal priors for the SPECT data. Small values are preferable.

	Cauchy	$t_{7}$	normal
MCMC	0.172	0.165	0.163
EM	0.179	0.178	0.178

To compare the Monte Carlo estimates with those based on the EM algorithm of Gelman et al. (2008), we also estimate the posterior mode, denoted by $\widetilde{\bm{\beta}}$ under identical priors and hyperparameters, using the R package arm (Gelman et al. 2015). The EM estimator of $\pi_{i}$ is given by:

\widehat{\pi}_{i}^{\text{EM}}=\frac{e^{\mathbf{x}_{i}^{T}\widetilde{\bm{\beta}}}}{1+e^{\mathbf{x}_{i}^{T}\widetilde{\bm{\beta}}}},

(13)

and $MSE^{\text{EM}}$ is calculated by replacing $\widehat{\pi}_{i}^{\text{MC}}$ by $\widehat{\pi}_{i}^{\text{EM}}$ in (12).

We report the misclassification rates in Table 1 and the Brier scores in Table 2. MCMC achieves somewhat smaller misclassification rates and Brier scores than EM, especially under $t_{7}$ and normal priors. This suggests that a full Bayesian analysis using MCMC may produce estimates that are closer to the truth than modal estimates based on the EM algorithm. The predictions are similar across the three prior distributions with the normal and $t_{7}$ priors yielding slightly more accurate results than Cauchy priors.

5.2 Pima Indians Diabetes Dataset

We now analyze the “Pima Indians Diabetes” dataset in the R package MASS. This is a classic dataset without separation that has been analyzed by many authors in the past. Using this dataset we aim to compare predictions under different priors, when there is no separation. Using the training data provided in the package we predict the class labels of the test data. In this case the difference between different priors is practically nil. The Brier scores are same up to three decimal places, across all priors and all methods (EM and MCMC). The misclassification rates reported in Table 3 also show negligible difference between priors and methods. Here Cauchy priors have a slightly better misclassification rate compared to normal and $t_{7}$ priors, and MCMC provides slightly more accurate results compared to those obtained from EM. These results suggest that when there is no separation and maximum likelihood estimates exist, Cauchy priors may be preferable as default weakly informative priors in the absence of real prior information.

Table 3: Misclassification rates based on

\widehat{\pi}_{i}^{\text{MC}}

and

\widehat{\pi}_{i}^{\text{EM}}

, under Cauchy,

t_{7}

, and normal priors for the Pima Indians data. Small values are preferable.

	Cauchy	$t_{7}$	normal
MCMC	0.196	0.199	0.199
EM	0.202	0.202	0.202

6 Discussion

We have proved that posterior means of regression coefficients in logistic regression are not always guaranteed to exist under the independent Cauchy priors recommended by Gelman et al. (2008), if there is complete or quasicomplete separation in the data. In particular, we have introduced the notion of a solitary separator, which is a predictor capable of separating the samples on its own. Note that a solitary separator needs to be able to separate without the aid of any other predictor, not even the constant term corresponding to the intercept. We have proved that for independent Cauchy priors, the absence of solitary separators is a necessary condition for the existence of posterior means of all coefficients, for a general family of link functions in binary regression models. For logistic and probit regression, this has been shown to be a sufficient condition as well. In general, the sufficient condition depends on the form of the link function. We have also studied multivariate Cauchy priors, where the solitary separator no longer plays an important role. Instead, posterior means of all predictors exist if there is no separation, while none of them exist if there is complete separation. The result under quasicompelte separation is still unclear and will be studied in future work.

In practice, after centering the input variables it is straightforward to check if there are solitary separators in the dataset. The absence of solitary separators guarantees the existence of posterior means of all regression coefficients in logistic regression under independent Cauchy priors. However, our empirical results have shown that even when the posterior means for Cauchy priors exist under separation, the posterior samples of the regression coefficients may be extremely large in magnitude. Separation is usually considered to be a sample phenomenon, so even if the predictors involved in separation are potentially important, some shrinkage of their coefficients is desirable through the prior. Our empirical results based on real datasets have demonstrated that the default Cauchy priors can lead to posterior means as large as 10, which is considered to be unusually large on the logit scale. Our impression is that Cauchy priors are good default choices in general because they contain weak prior information and let the data speak. However, under separation, when there is little information in the data about the logistic regression coefficients (the MLE is not finite), it seems that lighter tailed priors, such as Student- $t$ priors with larger degrees of freedom or even normal priors, are more desirable in terms of producing more plausible posterior distributions.

From a computational perspective, we have observed very slow convergence of the Gibbs sampler under Cauchy priors in the presence of separation. Note that if the design matrix is not of full column rank, for example when $p>n$ , the $p$ columns of $\mathbf{X}$ will be linearly dependent. This implies that the equation for quasicomplete separation (3) will be satisfied with equality for all observations. Empirical results (not reported here for brevity) demonstrated that independent Cauchy priors show convergence of the Gibbs sampler in this case also compared to other lighter tailed priors. Out-of-sample predictive performance based on a real dataset with separation did not show the default Cauchy priors to be superior to $t_{7}$ or normal priors.

In logistic regression, under a multivariate normal prior for $\bm{\beta}$ , Choi and Hobert (2013) showed that the Pólya-Gamma data augmentation Gibbs sampler of Polson et al. (2013) is uniformly ergodic, and the moment generating function of the posterior distribution $p(\bm{\beta}\mid\mathbf{y})$ exists for all $\mathbf{X},\mathbf{y}$ . In our examples of datasets with separation, the normal priors led to the fastest convergence of the Gibbs sampler, reasonable scales for the posterior draws of $\bm{\beta}$ , and comparable or even better predictive performance than other priors. The results from Stan show no problem in mixing under any of the priors. However, the problematic issue of posteriors with extremely heavy tails under Cauchy priors cannot be resolved without altering the prior. Thus, after taking into account all the above considerations, for a full Bayesian analysis we recommend the use of normal priors as a default, when there is separation. Alternatively, heavier tailed priors such as the $t_{7}$ could also be used if robustness is a concern. On the other hand, if the goal of the analysis is to obtain point estimates rather than the entire posterior distribution, the posterior mode obtained from the EM algorithm of Gelman et al. (2015) under default Cauchy priors (Gelman et al. 2008) is a fast viable alternative.

Supplementary Material

In the supplementary material, we present additional simulation results for logistic and probit regression with complete separation, along with an appendix that contains the proofs of all theoretical results. The Gibbs sampler developed in the paper can be implemented with the R package tglm, available from the website: https://cran.r-project.org/web/packages/tglm/index.html.

Acknowledgement

The authors thank the Editor-in-Chief, Editor, Associate Editor and the reviewer for suggestions that led to a greatly improved paper. The authors also thank Drs. David Banks, James Berger, William Bridges, Merlise Clyde, Jon Forster, Jayanta Ghosh, Aixin Tan, and Shouqiang Wang for helpful discussions. The research of Joyee Ghosh was partially supported by the NSA Young Investigator grant H98230-14-1-0126.

Supplementary Material: On the Use of Cauchy Prior Distributions for Bayesian Logistic Regression

In this supplement, we first present an appendix with additional simulation results for logit and probit link functions, and then include an appendix that contains proofs of the theoretical results.

1 Appendix: Simulation Results for Complete Separation Without Solitary Separators

In this section we present some supporting figures for the analysis of the simulated dataset described in Section 4.2 of the manuscript under logit and probit links.

1.1 Logistic Regression for Complete Separation Without Solitary Separators

Figures 1 and 2 are based on the posterior samples from a Gibbs sampler under a logit link, whereas Figures 3 and 4 are corresponding results from Stan. A detailed discussion of the results is provided in Section 4.2 of the manuscript.

1.2 Probit Regression for Complete Separation Without Solitary Separators

In this section we analyze the simulated dataset described in Section 4.2 of the manuscript under a probit link, while keeping everything else the same. We have shown in Theorem 2, that the theoretical results hold for a probit link. The goal of this analysis is to demonstrate that the empirical results are similar under the logit and probit link functions. For this dataset, Theorem 2 guarantees the existence of both $E(\beta_{1}\mid\mathbf{y})$ and $E(\beta_{2}\mid\mathbf{y})$ under independent Cauchy priors and a probit link function. As in the case of logistic regression the heavy tails of Cauchy priors translate into an extremely heavy right tail in the posterior distributions of $\beta_{1}$ and $\beta_{2}$ , compared to the lighter tailed priors (see Figure 5 and 6 here). Thus in the case of separation, normal priors seem to be reasonable for probit regression also.

2 Appendix: Proofs

First, we decompose the proof of Theorem 1 into two parts: in Appendix A we show that a necessary condition for the existence of $E(\beta_{j}\mid\mathbf{y})$ is that $\mathbf{X}_{j}$ is not a solitary separator; and in Appendix B we show that it is also a sufficient condition. Then, we prove Theorem 2 in Appendix C, and Corollary 1 in Appendix D. Finally, we decompose the proof of Theorem 3 into two parts: in Appendix E we show that all posterior means exist if there is no separation; then in Appendix F we show that none of the posterior means exist if there is complete separation.

A Proof of the Necessary Condition for Theorem 1

Here we show that if $\mathbf{X}_{j}$ is a solitary separator, then $E(\beta_{j}\mid\mathbf{y})$ does not exist, which is equivalent to the necessary condition for Theorem 1.

Proof.

For notational simplicity, we define the functional form of the success and failure probabilities in logistic regression as

f_{1}(t)=e^{t}/(1+e^{t}),\quad f_{0}(t)=1-f_{1}(t)=1/(1+e^{t}),

(A.1)

which are strictly increasing and decreasing functions of $t$ , respectively. In addition, both functions are bounded: $0<f_{1}(t),f_{0}(t)<1$ for $t\in\mathbb{R}$ . Let $\bm{\beta}_{(-j)}$ and $\mathbf{x}_{i,(-j)}$ denote the vectors $\bm{\beta}$ and $\mathbf{x}_{i}$ after excluding their $j$ th entries $\beta_{j}$ and $x_{i,j}$ , respectively. Then the likelihood function can be written as

p(\mathbf{y}\mid\bm{\beta})=\prod_{i=1}^{n}p(y_{i}\mid\bm{\beta})=\prod_{i\in A_{1}}f_{1}\left(x_{i,j}\beta_{j}+\mathbf{x}_{i,(-j)}^{T}\bm{\beta}_{(-j)}\right)\cdot\prod_{k\in A_{0}}f_{0}\left(x_{k,j}\beta_{j}+\mathbf{x}_{k,(-j)}^{T}\bm{\beta}_{(-j)}\right).

(A.2)

The posterior mean of $\beta_{j}$ exists provided $E(|\beta_{j}|\mid\mathbf{y})<\infty$ . When the posterior mean exists it is given by (4). Clearly if one of the two integrals in (4) is not finite, then $E(|\beta_{j}|\mid\mathbf{y})=\infty$ . In this proof we will show that if $\alpha_{j}>0$ , the first integral in (4) equals $\infty$ . Similarly, it can be shown that if $\alpha_{j}<0$ , the second integral in (4) equals $-\infty$ .

If $\alpha_{j}>0$ , by (2)-(5), $x_{i,j}\geq 0$ for all $i\in A_{1}$ , and $x_{k,j}\leq 0$ for all $k\in A_{0}$ . When $\beta_{j}>0$ , by the monotonic property of $f_{1}(t)$ and $f_{0}(t)$ we have, $p(\mathbf{y}\mid\bm{\beta})\geq\prod_{i\in A_{1}}f_{1}\left(\mathbf{x}_{i,(-j)}^{T}\bm{\beta}_{(-j)}\right)\cdot\prod_{k\in A_{0}}f_{0}\left(\mathbf{x}_{k,(-j)}^{T}\bm{\beta}_{(-j)}\right)$ , which is free of $\beta_{j}$ . Therefore,

	$\displaystyle\int_{0}^{\infty}\beta_{j}~p(\beta_{j}\mid\mathbf{y})~d\beta_{j}=\int_{0}^{\infty}\beta_{j}\left[\int_{\mathbb{R}^{p-1}}\frac{p(\mathbf{y}\mid\bm{\beta})p(\beta_{j})p(\bm{\beta}_{(-j)})}{p(\mathbf{y})}d\bm{\beta}_{(-j)}\right]~d\beta_{j}$
$\displaystyle=~$	$\displaystyle\frac{1}{p(\mathbf{y})}\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[\int_{\mathbb{R}^{p-1}}p(\mathbf{y}\mid\bm{\beta})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]~d\beta_{j}$
$\displaystyle\geq~$	$\displaystyle\frac{1}{p(\mathbf{y})}\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[\int_{\mathbb{R}^{p-1}}\prod_{i\in A_{1}}f_{1}\left(\mathbf{x}_{i,(-j)}^{T}\bm{\beta}_{(-j)}\right)\prod_{k\in A_{0}}f_{0}\left(\mathbf{x}_{k,(-j)}^{T}\bm{\beta}_{(-j)}\right)p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]~d\beta_{j}$
$\displaystyle=~$	$\displaystyle\frac{\int_{0}^{\infty}\beta_{j}p(\beta_{j})~d\beta_{j}}{p(\mathbf{y})}\left[\int_{\mathbb{R}^{p-1}}\prod_{i\in A_{1}}f_{1}\left(\mathbf{x}_{i,(-j)}^{T}\bm{\beta}_{(-j)}\right)\prod_{k\in A_{0}}f_{0}\left(\mathbf{x}_{k,(-j)}^{T}\bm{\beta}_{(-j)}\right)p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right].$	(A.3)

Here the first equation results from independent priors, i.e., $p(\bm{\beta}_{(-j)}\mid\beta_{j})=p(\bm{\beta}_{(-j)})$ . Since $p(\mathbf{y}\mid\bm{\beta})<1$ , $p(\mathbf{y})=\int_{\mathbb{R}^{p}}p(\mathbf{y}\mid\bm{\beta})p(\bm{\beta})d\bm{\beta}<\int_{\mathbb{R}^{p}}p(\bm{\beta})d\bm{\beta}=1$ . Moreover, $p(\mathbf{y}\mid\bm{\beta})p(\bm{\beta})>0$ for all $\bm{\beta}\in\mathbb{R}^{p}$ , so we also have $p(\mathbf{y})>0$ , implying that $0<p(\mathbf{y})<1$ . For the independent Cauchy priors in (6), $\int_{0}^{\infty}\beta_{j}p(\beta_{j})~d\beta_{j}=\infty$ and the second integral in (A.3) is positive, hence (A.3) equals $\infty$ . ∎

B Proof of the Sufficient Condition for Theorem 1

Here we show that if $\mathbf{X}_{j}$ is not a solitary separator, then the posterior mean of $\beta_{j}$ exists.

Proof.

When $E(|\beta_{j}|\mid\mathbf{y})<\infty$ the posterior mean of $\beta_{j}$ exists and is given by

	$\displaystyle E(\beta_{j}\mid\mathbf{y})$	$\displaystyle=\int_{-\infty}^{\infty}\beta_{j}p(\beta_{j}\mid\mathbf{y})d\beta_{j}$
		$\displaystyle=\frac{1}{p(\mathbf{y})}\underbrace{\int_{0}^{\infty}\beta_{j}p(\beta_{j})p(\mathbf{y}\mid\beta_{j})d\beta_{j}}_{\text{denoted by }E(\beta_{j}\mid\mathbf{y})^{+}}+\frac{1}{p(\mathbf{y})}\underbrace{\int_{-\infty}^{0}\beta_{j}p(\beta_{j})p(\mathbf{y}\mid\beta_{j})d\beta_{j}}_{\text{denoted by }E(\beta_{j}\mid\mathbf{y})^{-}}.$		(B.1)

Since $0<p(\mathbf{y})<1$ it is enough to show that the positive term $E(\beta_{j}\mid\mathbf{y})^{+}$ has a finite upper bound, and the negative term $E(\beta_{j}\mid\mathbf{y})^{-}$ has a finite lower bound. For notational simplicity, in the remainder of the proof, we let $\bm{\alpha}_{(-j)}$ , $\mathbf{x}_{i,(-j)}$ , $\bm{\beta}_{(-j)}$ , and $\bm{\sigma}_{(-j)}$ denote the vectors $\bm{\alpha}$ , $\mathbf{x}_{i}$ , $\bm{\beta}$ , and $\bm{\sigma}=(\sigma_{1},\ldots,\sigma_{p})^{T}$ after excluding their $j$ th entries, respectively.

We first show that $E(\beta_{j}\mid\mathbf{y})^{+}$ has a finite upper bound. Because $\mathbf{X}_{j}$ is not a solitary separator, there exists either

(a)

an $i^{\prime}\in A_{1}$ such that $x_{i^{\prime},j}<0$ , or
(b)

a $k^{\prime}\in A_{0}$ , such that $x_{k^{\prime},j}>0$ .

If both (a) and (b) are violated then $\mathbf{X}_{j}$ is a solitary separator and leads to a contradiction. Furthermore, for such $\mathbf{x}_{i^{\prime},(-j)}$ or $\mathbf{x}_{k^{\prime},(-j)}$ , it contains at least one non-zero entry. This is because if $j\neq 1$ , i.e., $\mathbf{X}_{j}$ does not correspond to the intercept (the column of all one’s), then the first entry in $\mathbf{x}_{i^{\prime},(-j)}$ or $\mathbf{x}_{k^{\prime},(-j)}$ equals $1$ . If $j=1$ , due to the assumption that $\mathbf{X}$ has a column rank $>1$ , there exists one row $i^{\diamond}\in\{1,2,\ldots,n\}$ such that $\mathbf{x}_{i^{\diamond},(-1)}$ contains at least one non-zero entry. If $i^{\diamond}\in A_{0}$ , then we let $k^{\prime}=i^{\diamond}$ and the condition (b) holds because $x_{i^{\diamond},1}=1>0$ . If $i^{\diamond}\in A_{1}$ , we may first rescale $\mathbf{X}_{1}$ by $-1$ , which transforms $\beta_{1}$ to $-\beta_{1}$ . Since the Cauchy prior centered at zero is invariant to this rescaling, and $E(-\beta_{1}\mid\mathbf{y})$ exists if and only if $E(\beta_{1}\mid\mathbf{y})$ exists, we can just apply this rescaling, after which $x_{i^{\diamond},1}=-1<0$ . Then we let $i^{\prime}=i^{\diamond}$ and the condition (a) holds.

We first assume that condition (a) is true. We define a positive constant $\epsilon=|x_{i^{\prime},j}|/2=-x_{i^{\prime},j}/2$ . For any $\beta_{j}>0$ , we define a subset of the domain of $\bm{\beta}_{(-j)}$ as follows

G(\beta_{j})\stackrel{{\scriptstyle\text{def}}}{{=}}\left\{\bm{\beta}_{(-j)}\in\mathbb{R}^{p-1}:\mathbf{x}_{i^{\prime},(-j)}^{T}\bm{\beta}_{(-j)}<\epsilon\beta_{j}\right\}.

(B.2)

Then for any $\bm{\beta}_{(-j)}\in G(\beta_{j})$ , $\mathbf{x}_{i^{\prime}}^{T}\bm{\beta}=x_{i^{\prime},j}\beta_{j}+\mathbf{x}_{i^{\prime},(-j)}^{T}\bm{\beta}_{(-j)}<(x_{i^{\prime},j}+\epsilon)\beta_{j}=-\epsilon\beta_{j}$ . Therefore,

f_{1}(\mathbf{x}_{i^{\prime}}^{T}\bm{\beta})<f_{1}(-\epsilon\beta_{j}),\quad\text{for all }\bm{\beta}_{(-j)}\in G(\beta_{j}).

(B.3)

As $f_{1}(\cdot)$ and $f_{0}(\cdot)$ are bounded above by 1, the likelihood function $p(\mathbf{y}\mid\bm{\beta})$ in (A.2) is bounded above by $f_{1}(\mathbf{x}_{i^{\prime}}^{T}\bm{\beta})$ . Thus

	$\displaystyle E(\beta_{j}\mid\mathbf{y})^{+}=\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[\int_{\mathbb{R}^{p-1}}p(\mathbf{y}\mid\bm{\beta})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]d\beta_{j}$
$\displaystyle<~$	$\displaystyle\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[\int_{\mathbb{R}^{p-1}}f_{1}(\mathbf{x}_{i^{\prime}}^{T}\bm{\beta})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]d\beta_{j}$
$\displaystyle=~$	$\displaystyle\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[\int_{G(\beta_{j})}f_{1}(\mathbf{x}_{i^{\prime}}^{T}\bm{\beta})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}+\int_{\mathbb{R}^{p-1}\backslash G(\beta_{j})}f_{1}(\mathbf{x}_{i^{\prime}}^{T}\bm{\beta})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]d\beta_{j}$
$\displaystyle<~$	$\displaystyle\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[\int_{G(\beta_{j})}f_{1}(-\epsilon\beta_{j})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}+\int_{\mathbb{R}^{p-1}\backslash G(\beta_{j})}p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]d\beta_{j}$	(B.4)

Here the last inequality results from (B.3) and the fact that the function $f_{1}(\cdot)$ is bounded above by $1$ . An upper bound is obtained for the first term in the bracket in (B.4) using the fact that the integrand is non-negative as follows:

	$\displaystyle\int_{G(\beta_{j})}f_{1}(-\epsilon\beta_{j})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}$	$\displaystyle<\int_{\mathbb{R}^{p-1}}f_{1}(-\epsilon\beta_{j})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}$
		$\displaystyle=f_{1}(-\epsilon\beta_{j})\underbrace{\int_{\mathbb{R}^{p-1}}p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}}_{=1}=\frac{e^{-\epsilon\beta_{j}}}{1+e^{-\epsilon\beta_{j}}}<e^{-\epsilon\beta_{j}}.$		(B.5)

Recall that $\mathbf{x}_{i^{\prime},(-j)}$ contains at least one non-zero entry. We assume that $\mathbf{x}_{i^{\prime},r}\neq 0$ . Then to simplify the second term in the bracket in (B.4), we transform $\bm{\beta}_{(-j)}$ to $(\eta,\bm{\xi})$ via a linear transformation, such that $\eta=\mathbf{x}_{i^{\prime},(-j)}^{T}\bm{\beta}_{(-j)}$ , and $\bm{\xi}$ is the vector $\bm{\beta}_{(-j)}$ after excluding $\beta_{r}$ . The characteristic function of a Cauchy distribution $C(\mu,\sigma)$ is $\varphi(t)=e^{it\mu-|t|\sigma}$ , where $t\in\mathbb{R}$ . Since a priori, $\eta$ is a linear combination of independent $C(0,\sigma_{\ell})$ random variables, $\beta_{\ell}$ , for $1\leq\ell\leq p,\ell\neq j$ , its characteristic function is

\varphi_{\eta}(t)=E(e^{it\eta})=\prod_{1\leq\ell\leq p,\ell\neq j}E\left[e^{i(tx_{i^{\prime},\ell})\beta_{\ell}}\right]=\prod_{1\leq\ell\leq p,\ell\neq j}\varphi_{\beta_{\ell}}(tx_{i^{\prime},\ell})=e^{-|t|\sum_{1\leq\ell\leq p,\ell\neq j}|x_{i^{\prime},\ell}|\sigma_{\ell}}.

So the induced prior of $\eta$ is $C(0,\sum_{1\leq\ell\leq p,\ell\neq j}|x_{i^{\prime},\ell}|\sigma_{\ell})$ . Let $|\mathbf{x}_{i^{\prime},(-j)}|$ denote the vector obtained by taking absolute values of each element of $\mathbf{x}_{i^{\prime},(-j)}$ , then the above scale parameter $\sum_{1\leq\ell\leq p,\ell\neq j}|x_{i^{\prime},\ell}|\sigma_{\ell}$ = $|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}$ . By (B.2), for any $\bm{\beta}_{(-j)}\not\in G(\beta_{j})$ , the corresponding $\eta\geq\epsilon\beta_{j}$ and $\bm{\xi}\in\mathbb{R}^{p-2}$ . An upper bound is calculated for the second term in the bracket in (B.4). Note that $p(\eta)p(\bm{\xi}\mid\eta)$ is the joint density of $\eta$ and $\bm{\xi}$ . Since it incorporates the Jacobian of transformation from $\bm{\beta}_{(-j)}$ to $\eta$ and $\bm{\xi}$ , a separate Jacobian term is not needed in the first equality below.

$\displaystyle\int_{\mathbb{R}^{p-1}\backslash G(\beta_{j})}p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}$	$\displaystyle=\int_{\epsilon\beta_{j}}^{\infty}\int_{\mathbb{R}^{p-2}}p(\eta)p(\bm{\xi}\mid\eta)d\bm{\xi}d\eta$
	$\displaystyle=\int_{\epsilon\beta_{j}}^{\infty}\frac{\int_{\mathbb{R}^{p-2}}p(\bm{\xi}\mid\eta)d\bm{\xi}}{\pi~\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}\left[1+\eta^{2}/\left(\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}\right)^{2}\right]}d\eta$
	$\displaystyle=\frac{1}{\pi}\left[\frac{\pi}{2}-\arctan\left(\frac{\epsilon\beta_{j}}{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}\right)\right]$
	$\displaystyle=\frac{1}{\pi}\arctan\left(\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{\epsilon\beta_{j}}\right)<\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{\pi\epsilon\beta_{j}}.$	(B.6)

Here, the second equality holds because $\int_{\mathbb{R}^{p-2}}p(\bm{\xi}\mid\eta)d\bm{\xi}=1$ ; the last inequality holds because $\epsilon$ and $\beta_{j}$ are both positive, and for any $t>0$ , $\arctan(t)<t$ .

Then substituting the expression for $p(\beta_{j})$ as in (6), we continue with (B.4) to find an upper bound.

	$\displaystyle E(\beta_{j}\mid\mathbf{y})^{+}$	$\displaystyle<\int_{0}^{\infty}\frac{\beta_{j}}{\pi\sigma_{j}(1+\beta_{j}^{2}/\sigma_{j}^{2})}\left[e^{-\epsilon\beta_{j}}+\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{\pi\epsilon\beta_{j}}\right]d\beta_{j}$
		$\displaystyle<\int_{0}^{\infty}\frac{\beta_{j}e^{-\epsilon\beta_{j}}}{\pi\sigma_{j}}d\beta_{j}+\int_{0}^{\infty}\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{\pi^{2}\sigma_{j}\epsilon(1+\beta_{j}^{2}/\sigma_{j}^{2})}d\beta_{j}=\frac{1}{\pi\sigma_{j}\epsilon^{2}}+\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}.$		(B.7)

On the other hand, if condition (b) holds, then we just need to slightly modify the above proof. We define $\epsilon=|x_{k^{\prime},j}|/2=x_{k^{\prime},j}/2$ , and change (B.2) to

G(\beta_{j})\stackrel{{\scriptstyle\text{def}}}{{=}}\left\{\bm{\beta}_{(-j)}\in\mathbb{R}^{p-1}:\mathbf{x}_{k^{\prime},(-j)}^{T}\bm{\beta}_{(-j)}>-\epsilon\beta_{j}\right\}.

Consequently, the terms $f_{1}(\mathbf{x}_{i^{\prime}}^{T}\bm{\beta})$ and $f_{1}(-\epsilon\beta_{j})$ in (B.4) have to be changed to $f_{0}(\mathbf{x}_{k^{\prime}}^{T}\bm{\beta})$ and $f_{0}(\epsilon\beta_{j})$ , respectively. For the logit link, $f_{0}(\epsilon\beta_{j})=f_{1}(-\epsilon\beta_{j})$ . The range of the integral in (B.6) with respect to $\eta$ is from $-\infty$ to $-\epsilon\beta_{j}$ ; however, because the density of $\eta$ is symmetric around 0, the value of the integral stays the same. So it can be shown an upper bound for $E(\beta_{j}\mid\mathbf{y})^{+}$ is $\left[1/\left(\pi\sigma_{j}\epsilon^{2}\right)\right]+\left[|\mathbf{x}_{k^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}/\left(2\pi\epsilon\right)\right]$ .

We now show that the negative term $E(\beta_{j}\mid\mathbf{y})^{-}$ has a finite lower bound. For any $\beta_{j}<0$ , by expressing $\beta_{j}^{*}=-\beta_{j}$ , we need to show that the positive term $-E(\beta_{j}\mid\mathbf{y})^{-}=-\int_{-\infty}^{0}\beta_{j}p(\beta_{j})p(\mathbf{y}\mid\beta_{j})d\beta_{j}=\int_{0}^{\infty}\beta_{j}^{*}p(\beta_{j}^{*})p(\mathbf{y}\mid-\beta_{j}^{*})d\beta_{j}^{*}$ has a finite upper bound. As the idea is very similar to the proof of existence of $E(\beta_{j}\mid\mathbf{y})^{+}$ , we present less details here.

Since the predictor $\mathbf{X}_{j}$ is not a solitary separator, there exists either

(c)

an $i^{*}\in A_{1}$ such that $x_{i^{*},j}>0$ , or
(d)

a $k^{*}\in A_{0}$ , such that $x_{k^{*},j}<0$ .

If (c) and (d) are both violated $\mathbf{X}_{j}$ has to be a solitary separator, which leads to a contradiction. WLOG, we assume that condition (c) is true, and as before $\mathbf{x}_{i^{*},(-j)}$ must contain at least one non-zero entry, say, $\mathbf{x}_{i^{*},s}\neq 0$ . If condition (d) is true, then we can adopt a modification similar to the one that is used to prove the existence under condition (b) based on the proof under condition (a).

We define a positive constant $\epsilon=x_{i^{*},j}/2$ . For any $\beta_{j}^{*}>0$ , we define a subset of $\mathbb{R}^{p-1}$ as $G(\beta_{j}^{*})\stackrel{{\scriptstyle\text{def}}}{{=}}\left\{\bm{\beta}_{(-j)}\in\mathbb{R}^{p-1}:\mathbf{x}_{i^{*},(-j)}^{T}\bm{\beta}_{(-j)}<\epsilon\beta_{j}^{*}\right\}$ . Then for all $\bm{\beta}_{(-j)}\in G(\beta_{j}^{*})$ , $\mathbf{x}_{i^{*}}^{T}\bm{\beta}=-x_{i^{*},j}\beta_{j}^{*}+\mathbf{x}_{i^{*},(-j)}^{T}\bm{\beta}_{(-j)}<(-x_{i^{*},j}+\epsilon)\beta_{j}^{*}=-\epsilon\beta_{j}^{*}$ , hence $f_{1}(\mathbf{x}_{i^{*}}^{T}\bm{\beta})<f_{1}\left(-\epsilon\beta_{j}^{*}\right)$ . Since the likelihood function $p(\mathbf{y}\mid-\beta_{j}^{*},\bm{\beta}_{(-j)})<f_{1}(x_{i^{*},j}(-\beta_{j}^{*})+\mathbf{x}_{i^{*},(-j)}^{T}\bm{\beta}_{(-j)})<1$ ,

		$\displaystyle-E(\beta_{j}\mid\mathbf{y})^{-}=\int_{0}^{\infty}\beta_{j}^{}p(\beta_{j}^{})\left[\int_{\mathbb{R}^{p-1}}p(\mathbf{y}\mid-\beta_{j}^{},\bm{\beta}_{(-j)})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]d\beta_{j}^{}$
	$\displaystyle<~$	$\displaystyle\int_{0}^{\infty}\beta_{j}^{}p(\beta_{j}^{})\left[\int_{G(\beta_{j}^{})}f_{1}(-\epsilon\beta_{j}^{})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}+\int_{\mathbb{R}^{p-1}\backslash G(\beta_{j}^{})}p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]d\beta_{j}^{}.$		(B.8)

The first term in the bracket in (B.8) has an upper bound $\exp(-\epsilon\beta_{j}^{*})$ as in (B.5). Recall that $\mathbf{x}_{i^{*},s}\neq 0$ . We now transform $\bm{\beta}_{(-j)}$ to $(\eta,\bm{\xi})$ via a linear transformation, such that $\eta=\mathbf{x}_{i^{*},(-j)}^{T}\bm{\beta}_{(-j)}$ and $\bm{\xi}$ is the vector $\bm{\beta}_{(-j)}$ after excluding $\beta_{s}$ . The prior of $\eta$ is $C(0,|\mathbf{x}_{i^{*},(-j)}|^{T}\bm{\sigma}_{(-j)})$ . For any $\bm{\beta}_{(-j)}\not\in G(\beta_{j}^{*})$ , the corresponding $\eta\geq\epsilon\beta_{j}^{*}$ . Therefore as in (B.6), we obtain an upper bound for the second term in the bracket in (B.8) as $|\mathbf{x}_{i^{*},(-j)}|^{T}\bm{\sigma}_{(-j)}/\left(\pi\epsilon\beta_{j}^{*}\right)$ . Finally, following (B.7) an upper bound for $-E(\beta_{j}\mid\mathbf{y})^{-}$ is $\left[1/\left(\pi\sigma_{j}\epsilon^{2}\right)\right]+\left[|\mathbf{x}_{i^{*},(-j)}|^{T}\bm{\sigma}_{(-j)}/\left(2\pi\epsilon\right)\right]$ . ∎

C Proof of Theorem 2

Proof.

Following the proof of Theorem 1, we denote the success probability function by $\pi=f_{1}(\mathbf{x}^{T}\bm{\beta})$ , where $f_{1}(\cdot)$ is the inverse link function, i.e., $f_{1}(\cdot)=g^{-1}(\cdot)$ . Similarly, let the failure probability function be $f_{0}(\mathbf{x}^{T}\bm{\beta})=1-f_{1}(\mathbf{x}^{T}\bm{\beta})$ . Note that the proof of the necessary condition for Theorem 1 given in Appendix A only relies on the fact the $f_{1}(\cdot)$ is increasing, continuous, and bounded between $0$ and $1$ . Since the link function $g(\cdot)$ is assumed to be strictly increasing and differentiable, so is $f_{1}(\cdot)$ . Moreover, the range of $f_{1}$ is $(0,1)$ . Therefore, the proof of the necessary condition for Theorem 2 follows immediately.

For the proof of the sufficient condition in Theorem 2, one can follow the proof in Appendix B and proceed with the specific choice of $\epsilon$ used there, when condition (a) holds. The proof is identical until (B.4) because the explicit form of the inverse link function is not used until that step. We re-write the final step in (B.4) below and proceed from there:

	$\displaystyle E(\beta_{j}\mid\mathbf{y})^{+}$
$\displaystyle<$	$\displaystyle\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[\int_{G(\beta_{j})}f_{1}(-\epsilon\beta_{j})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}+\int_{\mathbb{R}^{p-1}\backslash G(\beta_{j})}p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]d\beta_{j}$
$\displaystyle=$	$\displaystyle\int_{0}^{\infty}\beta_{j}p(\beta_{j})f_{1}(-\epsilon\beta_{j})\underbrace{\left[\int_{G(\beta_{j})}p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]}_{<\int_{\mathbb{R}^{p-1}}p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}=1}d\beta_{j}$
	$\displaystyle+\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[\int_{\mathbb{R}^{p-1}\backslash G(\beta_{j})}p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]d\beta_{j}$
$\displaystyle<$	$\displaystyle\int_{0}^{\infty}\beta_{j}p(\beta_{j})\underbrace{f_{1}(-\epsilon\beta_{j})}_{=g^{-1}(-\epsilon\beta_{j})}d\beta_{j}+\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[\int_{\mathbb{R}^{p-1}\backslash G(\beta_{j})}p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]d\beta_{j}.$	(C.1)

The sufficient condition in Theorem 2 states that for every positive constant $\epsilon$ ,

\int_{0}^{\infty}\beta_{j}p(\beta_{j})g^{-1}(-\epsilon\beta_{j})d\beta_{j}<\infty.

This implies that the integral will be bounded for the specific choice of $\epsilon$ used in the above proof, and hence the first integral in (C.1) is bounded above. The second integral does not depend on the link function and its bound can be obtained exactly as in Appendix B. Thus $E(\beta_{j}\mid\mathbf{y})^{+}<\infty$ under condition (a). On the other hand if condition (b) holds, the proof follows similarly as in Appendix B and now we need to use the sufficient condition in Theorem 2 that for every positive constant $\epsilon$ , $\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[1-g^{-1}(\epsilon\beta_{j})\right]d\beta_{j}<\infty$ . A bound for $-E(\beta_{j}\mid\mathbf{y})^{-}$ can be obtained similarly, which completes the proof.

In probit regression, we first show that

g^{-1}_{\text{probit}}(t)=\Phi(t)<e^{t}/(1+e^{t})=g^{-1}_{\text{logit}}(t),\quad\text{for any }t<0,

(C.2)

where $\Phi(t)$ is the standard normal cdf. It is equivalent to show that the difference function

u(t)=\Phi(t)-\frac{e^{t}}{1+e^{t}}<0,\text{ for all }t<0.

(C.3)

Note that $u(0)=1/2-1/2=0$ , and $\lim_{t\rightarrow-\infty}u(t)=0$ . Since $u(t)$ is differentiable, we have

u^{\prime}(t)=\frac{1}{\sqrt{2\pi}}e^{-\frac{t^{2}}{2}}-\frac{e^{t}}{(1+e^{t})^{2}}.

Now $u^{\prime}(0)=1/\sqrt{2\pi}-1/4>0$ , and when $t$ is very small, $u^{\prime}(t)<0$ since $e^{-t^{2}/2}$ decays to zero at a faster speed than $e^{t}$ , i.e., there exists a $\tilde{t}<0$ such that $u^{\prime}(\tilde{t})<0$ . Since $u^{\prime}(t)$ is a continuous function, the intermediate value theorem applied to $[\tilde{t},0]$ shows that there exists a $t_{1}<0$ such that $u^{\prime}(t_{1})=0$ . Therefore, to show (C.3), it is sufficient to show that $u^{\prime}(t)$ has a unique root on $\mathbb{R}^{-}$ , which is proved by contradiction as follows.

If $u^{\prime}(t)$ has two distinct roots $t_{1},t_{2}<0$ , i.e., for $i=1,2$ , $u^{\prime}(t_{i})=0$ , then

		$\displaystyle\frac{1}{\sqrt{2\pi}}e^{-\frac{t_{i}^{2}}{2}}=\frac{e^{t_{i}}}{(1+e^{t_{i}})^{2}},\ i=1,2\Longleftrightarrow\frac{e^{-\frac{t_{1}^{2}}{2}}}{e^{-\frac{t_{2}^{2}}{2}}}=\frac{e^{t_{1}}}{e^{t_{2}}}\cdot\frac{(1+e^{t_{2}})^{2}}{(1+e^{t_{1}})^{2}}$
	$\displaystyle\Longleftrightarrow$	$\displaystyle\frac{e^{\frac{(t_{2}+1)^{2}}{2}}}{e^{\frac{(t_{1}+1)^{2}}{2}}}=\left(\frac{1+e^{t_{2}}}{1+e^{t_{1}}}\right)^{2}\Longleftrightarrow\frac{(t_{2}+1)^{2}}{4}-\log(1+e^{t_{2}})=\frac{(t_{1}+1)^{2}}{4}-\log(1+e^{t_{1}}).$		(C.4)

Note that the derivative of the function $(t+1)^{2}/4-\log(1+e^{t})$ is $(t+1)/2-e^{t}/(1+e^{t})$ . It is straightforward to show that this derivative is strictly less than 0 for all $t<0$ , so $(t+1)^{2}/4-\log(1+e^{t})$ is a strictly decreasing function. Thus (C.4) holds only if $t_{1}=t_{2}$ , which leads to a contradiction.

Hence for any $\epsilon>0$ ,

	$\displaystyle\int_{0}^{\infty}\beta_{j}p(\beta_{j})g^{-1}_{\text{probit}}(-\epsilon\beta_{j})d\beta_{j}$	$\displaystyle<\int_{0}^{\infty}\beta_{j}p(\beta_{j})g^{-1}_{\text{logit}}(-\epsilon\beta_{j})d\beta_{j}<\int_{0}^{\infty}\beta_{j}p(\beta_{j}){e^{-\epsilon\beta_{j}}}d\beta_{j}$
	$\displaystyle=\int_{0}^{\infty}\frac{\beta_{j}e^{-\epsilon\beta_{j}}}{\pi\sigma_{j}(1+\beta_{j}^{2}/\sigma_{j}^{2})}d\beta_{j}$	$\displaystyle<\int_{0}^{\infty}\frac{\beta_{j}e^{-\epsilon\beta_{j}}}{\pi\sigma_{j}}d\beta_{j}=\frac{1}{\pi\sigma_{j}\epsilon^{2}}<\infty.$

Since the probit link is symmetric, i.e., $1-g^{-1}_{\text{probit}}(\epsilon\beta_{j})=1-\Phi(\epsilon\beta_{j})=\Phi(-\epsilon\beta_{j})=g^{-1}_{\text{probit}}(-\epsilon\beta_{j})$ , we also have $\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[1-g^{-1}_{\text{probit}}(\epsilon\beta_{j})\right]d\beta_{j}<\infty$ . ∎

D Proof of Corollary 1

To prove Corollary 1, we mainly use a similar strategy to the proof of Theorem 1. To show the necessary condition, we can use all of Appendix A without modification, for both logistic and probit regression models. To show the sufficient condition, we can follow the same proof outline as in Appendix B, with some minimal modification as described in the following proof.

Proof.

First, we denote the vector of prior location parameters by $\bm{\mu}=(\mu_{1},\mu_{2},\ldots,\mu_{p})^{T}$ . If we shift the coefficients $\bm{\beta}$ by $\bm{\mu}$ units, then

\bm{\gamma}\stackrel{{\scriptstyle\text{def}}}{{=}}\bm{\beta}-\bm{\mu}~\Longrightarrow~\gamma_{j}\stackrel{{\scriptstyle\text{ind}}}{{\sim}}\text{C}(0,\sigma_{j}),\quad j=1,2,\ldots,p,

that is, the resulting parameters $\gamma_{j}$ have independent Cauchy priors with location parameters being zero. Since the original parameter $\beta_{j}=\gamma_{j}+\mu_{j}$ for each $j=1,2,\ldots,p$ , the existence of $E(\beta_{j}\mid\mathbf{Y})$ is equivalent to the existence of $E(\gamma_{j}\mid\mathbf{Y})$ . So we just need to show that if $\mathbf{X}_{j}$ is not a solitary separator, then $E(\gamma_{j}\mid\mathbf{Y})$ exists. For simplicity, here we just show that the positive term

E(\gamma_{j}\mid\mathbf{y})^{+}\stackrel{{\scriptstyle\text{def}}}{{=}}\int_{0}^{\infty}\gamma_{j}p(\gamma_{j})p(\mathbf{y}\mid\gamma_{j})d\gamma_{j}

has a finite upper bound. The other half of the result that the negative term $E(\gamma_{j}\mid\mathbf{y})^{-}$ has a finite lower bound will follow with a similar derivation.

As in Appendix B, we first assume that condition (a) is true, and define $\epsilon$ in the same way. For any $\gamma_{j}>0$ , we define a subset of the domain of $\bm{\gamma}_{(-j)}$ as follows

G(\gamma_{j})\stackrel{{\scriptstyle\text{def}}}{{=}}\left\{\bm{\gamma}_{(-j)}\in\mathbb{R}^{p-1}:\mathbf{x}_{i^{\prime},(-j)}^{T}\bm{\gamma}_{(-j)}<\epsilon\gamma_{j}\right\},

then for any $\bm{\gamma}_{(-j)}\in G(\gamma_{j})$ , $\mathbf{x}_{i^{\prime}}^{T}\bm{\gamma}<-\epsilon\gamma_{j}$ . Since $f_{1}(\cdot)$ is an increasing function,

f_{1}(\mathbf{x}_{i^{\prime}}^{T}\bm{\beta})=f_{1}(\mathbf{x}_{i^{\prime}}^{T}\bm{\gamma}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu})<f_{1}(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}),\quad\text{for all }\bm{\gamma}_{(-j)}\in G(\gamma_{j}).

A similar derivation to (B.4) gives

		$\displaystyle E(\gamma_{j}\mid\mathbf{y})^{+}$
	$\displaystyle<~$	$\displaystyle\int_{0}^{\infty}\gamma_{j}p(\gamma_{j})\left[\int_{G(\gamma_{j})}f_{1}(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu})p(\bm{\gamma}_{(-j)})d\bm{\gamma}_{(-j)}+\int_{\mathbb{R}^{p-1}\backslash G(\gamma_{j})}p(\bm{\gamma}_{(-j)})d\bm{\gamma}_{(-j)}\right]d\gamma_{j},$

where by (B.5) the first integral inside the bracket has an upper bound

\int_{G(\gamma_{j})}f_{1}(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu})p(\bm{\gamma}_{(-j)})d\bm{\gamma}_{(-j)}<f_{1}(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}),

(D.1)

and by (B.6) the second integral inside the bracket also has an upper bound

\int_{\mathbb{R}^{p-1}\backslash G(\gamma_{j})}p(\bm{\gamma}_{(-j)})d\bm{\gamma}_{(-j)}<\frac{|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}}{\pi\epsilon\gamma_{j}}.

In logistic regression, the right hand side of (D.1) is further bounded

f_{1}(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu})=\frac{e^{-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}}{1+e^{-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}}<e^{-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}},

and hence by (B.7),

E(\gamma_{j}\mid\mathbf{y})^{+}<\frac{e^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}}{\pi\sigma_{j}\epsilon^{2}}+\frac{|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}.

In probit regression, the function $f_{1}(\cdot)$ in the above derivations equals the standard normal cdf $\Phi(\cdot)$ . By (C.2), for any $\gamma_{j}>\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon$ , we have

\Phi\left(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}\right)<\frac{e^{-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}}{1+e^{-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}}<e^{-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}.

Hence for $\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon>0$ we have an upper bound

		$\displaystyle E(\gamma_{j}\mid\mathbf{y})^{+}<\int_{0}^{\infty}\gamma_{j}p(\gamma_{j})\left[\Phi\left(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}\right)+\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{\pi\epsilon\gamma_{j}}\right]d\gamma_{j}$
	$\displaystyle=$	$\displaystyle\int_{0}^{\infty}\gamma_{j}p(\gamma_{j})\Phi\left(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}\right)d\gamma_{j}+\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}$
	$\displaystyle=$	$\displaystyle\int_{0}^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon}\gamma_{j}p(\gamma_{j})\Phi\left(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}\right)d\gamma_{j}+\int_{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon}^{\infty}\gamma_{j}p(\gamma_{j})\Phi\left(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}\right)d\gamma_{j}+\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}$
	$\displaystyle<$	$\displaystyle\int_{0}^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon}\gamma_{j}p(\gamma_{j})d\gamma_{j}+\int_{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon}^{\infty}\gamma_{j}p(\gamma_{j})e^{-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}d\gamma_{j}+\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}$
	$\displaystyle<$	$\displaystyle\int_{0}^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon}\frac{\gamma_{j}}{\pi\sigma_{j}(1+\gamma_{j}^{2}/\sigma_{j}^{2})}d\gamma_{j}+e^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}\int_{0}^{\infty}\frac{\gamma_{j}e^{-\epsilon\gamma_{j}}}{\pi\sigma_{j}(1+\gamma_{j}^{2}/\sigma_{j}^{2})}d\gamma_{j}+\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}$
	$\displaystyle<$	$\displaystyle~\frac{\sigma_{j}}{2\pi}\log\left[1+\left(\frac{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}{\epsilon\sigma_{j}}\right)^{2}\right]+e^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}\int_{0}^{\infty}\frac{\gamma_{j}e^{-\epsilon\gamma_{j}}}{\pi\sigma_{j}}d\gamma_{j}+\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}$
	$\displaystyle=$	$\displaystyle~\frac{\sigma_{j}}{2\pi}\log\left[1+\left(\frac{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}{\epsilon\sigma_{j}}\right)^{2}\right]+\frac{e^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}}{\pi\sigma_{j}\epsilon^{2}}+\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}.$

Note that a similar derivation also holds if $\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon<0$ .

On the other hand, if condition (b) is true, we can follow the same modification in Appendix B to find upper bounds in a similar way. ∎

To show Theorem 3, we decompose its proof into two parts: in Appendix E we show that all posterior means exist if there is no separation; then in Appendix F we show that under a multivariate Cauchy prior, none of the posterior means exist if there is complete separation.

E Proof of Theorem 3, in the case of no separation

Proof.

For any $j=1,2,\ldots,p$ , to show that $E(\beta_{j}\mid\mathbf{y})$ exists, we just need to show the positive term $E(\beta_{j}\mid\mathbf{y})^{+}$ in (B.1) has an upper bound, because the negative term $E(\beta_{j}\mid\mathbf{y})^{-}$ in (B.1) having a lower bound follows a similar derivation.

When working on $E(\beta_{j}\mid\mathbf{y})^{+}$ , we only need to consider positive $\beta_{j}$ . Denote a new $p-1$ dimensional variable $\bm{\gamma}=\bm{\beta}_{(-j)}/\beta_{j}$ , then for $i=1,2,\ldots,n$ ,

\mathbf{x}_{i}^{T}\bm{\beta}=\beta_{j}\left(x_{i,j}+\mathbf{x}_{i,(-j)}^{T}\bm{\gamma}\right).

If there is no separation, for any $\bm{\gamma}\in\mathbb{R}^{p-1}$ , there exists at least one $i\in\{1,2,\ldots,n\}$ , such that

x_{i,j}+\mathbf{x}_{i,(-j)}^{T}\bm{\gamma}<0,\text{ if }i\in A_{1},\text{ or }x_{i,j}+\mathbf{x}_{i,(-j)}^{T}\bm{\gamma}>0,\text{ if }i\in A_{0}.

(E.1)

For each $i=1,2,\ldots,n$ , denote the vector $\mathbf{z}_{i}$ and the function $h_{i}(\cdot)$ as follows,

\mathbf{z}_{i}\stackrel{{\scriptstyle\text{def}}}{{=}}\begin{cases}\mathbf{x}_{i}&\text{ if }i\in A_{1}\\ -\mathbf{x}_{i}&\text{ if }i\in A_{0}\\ \end{cases},\quad h_{i}(\bm{\gamma})\stackrel{{\scriptstyle\text{def}}}{{=}}z_{i,j}+\mathbf{z}_{i,(-j)}^{T}\bm{\gamma},

(E.2)

then (E.1) can be rewritten as $h_{i}(\bm{\gamma})<0$ . Denote for $i=1,2,\ldots,n$ ,

B_{i}\stackrel{{\scriptstyle\text{def}}}{{=}}\{\bm{\gamma}:h_{i}(\bm{\gamma})<0\}.

Then each $B_{i}$ is a non-empty subset of $\mathbb{R}^{p-1}$ , unless $z_{i,j}\geq 0$ and $\mathbf{z}_{i,(-j)}=\mathbf{0}$ . Let $\mathcal{I}=\{i:B_{i}\neq\o \}$ denote the set of indices $i$ , for which the corresponding $B_{i}$ are non-empty. Because there is no separation,

\bigcup_{i\in\mathcal{I}}B_{i}=\mathbb{R}^{p-1}.

(E.3)

Hence, the set $\mathcal{I}$ is non-empty. We denote its size by $q\stackrel{{\scriptstyle\text{def}}}{{=}}|\mathcal{I}|$ , and rewrite $\mathcal{I}=\{i_{1},i_{2},\ldots,i_{q}\}$ .

Now we show that there exist positive constants $\epsilon_{i_{1}},\epsilon_{i_{2}},\ldots,\epsilon_{i_{q}}$ , such that

\bigcup_{k=1}^{q}\tilde{B}_{i_{k}}=\mathbb{R}^{p-1},

(E.4)

where

\tilde{B}_{i_{k}}\stackrel{{\scriptstyle\text{def}}}{{=}}\{\bm{\gamma}:h_{i_{k}}(\bm{\gamma})<-\epsilon_{i_{k}}\},

(E.5)

are subsets of the corresponding $B_{i_{k}}$ , for all $k=1,2,\ldots,q$ .

If there exists an $i_{k}\in\mathcal{I}$ such that $z_{i_{k},j}<0$ and $\mathbf{z}_{i_{k},(-j)}=\mathbf{0}$ , then $B_{i_{k}}=\mathbb{R}^{p-1}$ . In this case, we just need to let $\epsilon_{i_{k}}=-z_{i_{k},j}/2$ , and $\epsilon_{i_{r}}=M$ , for all $r\neq k$ , where $M$ is an arbitrary positive number. Under this choice of $\epsilon_{i}$ ’s, the sets $\tilde{B}_{i}$ ’s defined by (E.5) satisfy (E.4).

If, on the other hand, $\mathbf{z}_{i_{k},(-j)}\neq\mathbf{0}$ for all $i_{k}\in\mathcal{I}$ , i.e., all $B_{i_{k}}$ are open half spaces in $\mathbb{R}^{p-1}$ , then we can find the constants $\epsilon_{i_{1}},\epsilon_{i_{2}},\ldots,\epsilon_{i_{q}}$ sequentially. For $i_{1}$ , if $\bigcup_{k=2}^{q}B_{i_{k}}=\mathbb{R}^{p-1}$ , we can set $\epsilon_{i_{1}}=M$ . Then the resulting $\tilde{B}_{i_{1}}$ defined by (E.5) satisfies

\tilde{B}_{i_{1}}\cup B_{i_{2}}\cup B_{i_{3}}\cup\cdots\cup B_{i_{q}}=\mathbb{R}^{p-1}.

(E.6)

If $\bigcup_{k=2}^{q}B_{i_{k}}\neq\mathbb{R}^{p-1}$ , then (E.3) suggests

B_{i_{1}}\supset\left(\bigcup_{k=2}^{q}B_{i_{k}}\right)^{c}=\bigcap_{k=2}^{q}B_{i_{k}}^{c}.

(E.7)

For (E.6) to hold, we just need to find an positive $\epsilon_{i_{1}}$ such that the resulting $\tilde{B}_{i_{1}}$ has $\bigcap_{k=2}^{q}B_{i_{k}}^{c}$ as a subset, i.e., $-\epsilon_{i_{1}}$ should be larger than the maximum of $h_{i_{1}}(\bm{\gamma})$ over the polyhedral region $\bm{\gamma}\in\bigcap_{k=2}^{q}B_{i_{k}}^{c}$ . Note that maximizing $h_{i_{1}}(\bm{\gamma})$ over the polyhedron can be represented as a linear programming question,

maximize	$\displaystyle z_{i_{1},j}+\mathbf{z}_{i_{1},(-j)}^{T}\bm{\gamma}$	(E.8)
subject to	$\displaystyle\mathbf{z}_{i_{2},(-j)}^{T}\bm{\gamma}\geq-z_{i_{2},j}$
	$\displaystyle\vdots$
	$\displaystyle\mathbf{z}_{i_{q},(-j)}^{T}\bm{\gamma}\geq-z_{i_{q},j}.$

By Bertsimas and Tsitsiklis (1997, pp. 67, Corollary 2.3), for any linear programming problem over a non-empty polyhedron, including the one in (E.8) to maximize $h_{i_{1}}(\bm{\gamma})=z_{i_{1},j}+\mathbf{z}_{i_{1},(-j)}^{T}\bm{\gamma}$ , either the optimal $h_{i_{1}}(\bm{\gamma})=\infty$ , or there exists an optimal solution, $\bm{\gamma}^{*}$ . Here, the latter case always occurs, because by (E.7), the maximum of $h_{i_{1}}(\bm{\gamma})$ over the polyhedron $\bigcap_{k=2}^{q}B_{i_{k}}^{c}$ does not exceed zero, so it does not go to infinity. Hence, we just need to let

\epsilon_{i_{1}}=-\frac{1}{2}\left[\max_{\bm{\gamma}\in\bigcap_{k=2}^{q}B_{i_{k}}^{c}}z_{i_{1},j}+\mathbf{z}_{i_{1},(-j)}^{T}\bm{\gamma}\right]=-\frac{z_{i_{1},j}+\mathbf{z}_{i_{1},(-j)}^{T}\bm{\gamma}^{*}}{2},

so that the resulting $\tilde{B}_{i_{1}}=\{\bm{\gamma}:h_{i_{1}}(\bm{\gamma})<-\epsilon_{i_{1}}\}$ contains $\bigcap_{k=2}^{q}B_{i_{k}}^{c}$ as a subset, which yields (E.6). After finding $\epsilon_{i_{1}}$ , we can apply similar procedures sequentially, to find positive values $\epsilon_{i_{k}}$ , for $k=2,3,\ldots,q$ , such that

\tilde{B}_{i_{1}}\cup\cdots\cup\tilde{B}_{i_{k}}\cup B_{i_{k+1}}\cup\cdots\cup B_{i_{q}}=\mathbb{R}^{p-1}.

After identifying all $\epsilon_{i_{k}}$ ’s, the resulting $\tilde{B}_{i_{k}}$ ’s satisfy (E.4). Note that the choice of $\epsilon_{i_{k}}$ ’s only depend on the data $\mathbf{z}_{i}$ , $i=1,2,\ldots,n$ , so they are constants given the observed data.

For each $k=1,2,\ldots,q$ , next we show that for any $\bm{\gamma}\in\tilde{B}_{i_{k}}$ , the likelihood function of the $i_{k}$ th observation is bounded above by $(\beta_{j}\epsilon_{i_{k}}e)^{-1}$ . This is because in a logistic regression, if $i_{k}\in A_{1}$ , then

p(y_{i_{k}}\mid\beta_{j},\bm{\gamma})=f_{1}\left(\beta_{j}h_{i_{k}}(\bm{\gamma})\right)=\frac{e^{\beta_{j}h_{i_{k}}(\bm{\gamma})}}{1+e^{\beta_{j}h_{i_{k}}(\bm{\gamma})}}<e^{\beta_{j}h_{i_{k}}(\bm{\gamma})}<e^{-\beta_{j}\epsilon_{i_{k}}}\leq\frac{1}{\beta_{j}\epsilon_{i_{k}}e},

(E.9)

if $i_{k}\in A_{0}$ , then

p(y_{i_{k}}\mid\beta_{j},\bm{\gamma})=f_{0}\left(-\beta_{j}h_{i_{k}}(\bm{\gamma})\right)=\frac{1}{1+e^{-\beta_{j}h_{i_{k}}(\bm{\gamma})}}<e^{\beta_{j}h_{i_{k}}(\bm{\gamma})}<e^{-\beta_{j}\epsilon_{i_{k}}}\leq\frac{1}{\beta_{j}\epsilon_{i_{k}}e}.

(E.10)

Here, the last inequality holds because $e^{-t}\leq\frac{e^{-1}}{t}$ for any $t>0$ . By (C.2), in a probit regression model, the inequalities (E.9) and (E.10) also hold.

Now we show that the positive term $E(\beta_{j}\mid\mathbf{y})^{+}$ has a finite upper bound.

	$\displaystyle E(\beta_{j}\mid\mathbf{y})^{+}$	$\displaystyle=\int_{0}^{\infty}\beta_{j}\int_{\mathbb{R}^{p-1}}p(\mathbf{y}\mid\beta_{j},\bm{\gamma})p(\beta_{j},\bm{\gamma})d\bm{\gamma}d\beta_{j}$
		$\displaystyle\leq\int_{0}^{\infty}\beta_{j}\sum_{k=1}^{q}\int_{\tilde{B}_{i_{k}}}p(\mathbf{y}\mid\beta_{j},\bm{\gamma})p(\beta_{j},\bm{\gamma})d\bm{\gamma}d\beta_{j}$
		$\displaystyle<\int_{0}^{\infty}\beta_{j}\sum_{k=1}^{q}\int_{\tilde{B}_{i_{k}}}p(y_{i_{k}}\mid\beta_{j},\bm{\gamma})p(\beta_{j},\bm{\gamma})d\bm{\gamma}d\beta_{j}$
		$\displaystyle\leq\int_{0}^{\infty}\beta_{j}\sum_{k=1}^{q}\int_{\tilde{B}_{i_{k}}}\frac{1}{\beta_{j}\epsilon_{i_{k}}e}p(\beta_{j},\bm{\gamma})d\bm{\gamma}d\beta_{j}$
		$\displaystyle=\sum_{k=1}^{q}\frac{1}{\epsilon_{i_{k}}e}\int_{0}^{\infty}\int_{\tilde{B}_{i_{k}}}p(\beta_{j},\bm{\gamma})d\bm{\gamma}d\beta_{j}$
		$\displaystyle\leq\sum_{k=1}^{q}\frac{1}{\epsilon_{i_{k}}e}\int_{0}^{\infty}\int_{\mathbb{R}^{p-1}}p(\beta_{j},\bm{\gamma})d\bm{\gamma}d\beta_{j}<\sum_{k=1}^{q}\frac{1}{\epsilon_{i_{k}}e}.$

∎

Note that in Appendix E, the specific formula of the prior density of $\bm{\beta}$ is not used. Therefore, if there is no separation in logistic or probit regression, posterior means of all coefficients exist under all proper prior distributions.

F Proof of Theorem 3, in the case of complete separation

Proof.

Here we show that if there is complete separation, then none of the posterior means $E(\beta_{j}\mid\mathbf{y})$ exist, for $j=1,2,\ldots,p$ . Using the notation $\mathbf{z}_{i}$ , defined in (E.2), we rewrite the set of all vectors satisfying the complete separation condition (2) as

\mathcal{C}=\bigcap_{i=1}^{n}\left\{\bm{\beta}\in\mathbb{R}^{p}:\mathbf{z}_{i}^{T}\bm{\beta}>0\right\}.

According to Albert and Anderson (1984), $\mathcal{C}$ is a convex cone; moreover, if $\bm{\beta}\in\mathcal{C}$ , then $\bm{\beta}+\bm{\delta}\in\mathcal{C}$ for any $\bm{\delta}\in\mathbb{R}^{p}$ that is small enough. Hence, the open set $\mathcal{C}$ , as a subset of the $\mathbb{R}^{p}$ Euclidean space, has positive Lebesgue measure.

To show that $E(\beta_{j}\mid\mathbf{y})$ does not exist, if $\mathcal{C}$ projects on the positive half of the $\beta_{j}$ axis, we will show that $E(\beta_{j}\mid\mathbf{y})^{+}$ diverges, otherwise, we will show that $E(\beta_{j}\mid\mathbf{y})^{-}$ diverges (if both, then showing either is sufficient). Now we assume that the former is true, and denote the intersection

\mathcal{C}_{j}^{+}\stackrel{{\scriptstyle\text{def}}}{{=}}\mathcal{C}\cap\{\bm{\beta}\in\mathbb{R}^{p}:\beta_{j}>0\},

which is again an open convex cone. Since $\mathcal{C}_{j}^{+}$ has positive measure in $\mathbb{R}^{p}$ , under the change of variable from $(\beta_{j},\bm{\beta}_{(-j)})$ to $(\beta_{j},\bm{\gamma})$ , where $\bm{\beta}_{(-j)}=\beta_{j}\bm{\gamma}$ , there exists an open set $\tilde{\mathcal{C}}_{j}^{+}\in\mathbb{R}^{p-1}$ such that $\mathcal{C}_{j}^{+}$ can be written as

\mathcal{C}_{j}^{+}=\{(\beta_{j},\beta_{j}\bm{\gamma}):\beta_{j}>0,\bm{\gamma}\in\tilde{\mathcal{C}}_{j}^{+}\}.

Suppose that $\bm{\gamma}$ can be written as $(\gamma_{1},\ldots,\gamma_{j-1},\gamma_{j+1},\ldots,\gamma_{p})^{T}$ . We define a variant of it by $\tilde{\bm{\gamma}}\stackrel{{\scriptstyle\text{def}}}{{=}}(\gamma_{1},\ldots,\gamma_{j-1},1,\gamma_{j+1},\ldots,\gamma_{p})^{T}$ , such that $\bm{\beta}=\beta_{j}\tilde{\bm{\gamma}}$ . Under the multivariate Cauchy prior (8), the induced prior distribution of $(\beta_{j},\bm{\gamma})$ is

	$\displaystyle p(\beta_{j},\bm{\gamma})$	$\displaystyle\propto\frac{\beta_{j}^{p-1}}{\left[1+\left(\beta_{j}\tilde{\bm{\gamma}}-\bm{\mu}\right)^{T}\bm{\Sigma}^{-1}\left(\beta_{j}\tilde{\bm{\gamma}}-\bm{\mu}\right)\right]^{\frac{p+1}{2}}}$
		$\displaystyle=\frac{\beta_{j}^{p-1}}{\left[\left(\tilde{\bm{\gamma}}^{T}\bm{\Sigma}^{-1}\tilde{\bm{\gamma}}\right)\beta_{j}^{2}-2\left(\tilde{\bm{\gamma}}^{T}\bm{\Sigma}^{-1}\bm{\mu}\right)\beta_{j}+\left(\bm{\mu}^{T}\bm{\Sigma}^{-1}\bm{\mu}+1\right)\right]^{\frac{p+1}{2}}}.$

Inside $\tilde{\mathcal{C}}_{j}^{+}$ , there must exist a closed rectangular box, denoted by $\tilde{\mathcal{D}}_{j}^{+}=\{\bm{\gamma}:\gamma_{k}\in[l_{k},u_{k}],k=1,\ldots,j-1,j+1,\ldots,p\}\subset\tilde{\mathcal{C}}_{j}^{+}$ . By Browder (1996, pp. 142, Corollary 6.57), a continuous function takes its maximum and minimum on a compact set. Since $\mathcal{D}_{j}^{+}$ is a compact set (closed and bounded in $\mathbb{R}^{p-1}$ ),

a\stackrel{{\scriptstyle\text{def}}}{{=}}\max_{\bm{\gamma}\in\tilde{\mathcal{D}}_{j}^{+}}\tilde{\bm{\gamma}}^{T}\bm{\Sigma}^{-1}\tilde{\bm{\gamma}},\quad b\stackrel{{\scriptstyle\text{def}}}{{=}}\min_{\bm{\gamma}\in\tilde{\mathcal{D}}_{j}^{+}}\tilde{\bm{\gamma}}^{T}\bm{\Sigma}^{-1}\bm{\mu}

(F.1)

both exist.

Recall that for all $\bm{\gamma}\in\tilde{\mathcal{C}}_{j}^{+}$ (hence including all elements in $\mathcal{D}_{j}^{+}$ ), $z_{i,j}+\mathbf{z}_{i,(-j)}^{T}\bm{\gamma}>0$ ; equivalently, if $i\in A_{1}$ , then $x_{i,j}+\mathbf{x}_{i,(-j)}^{T}\bm{\gamma}>0$ , and if $i\in A_{0}$ , then $x_{i,j}+\mathbf{x}_{i,(-j)}^{T}\bm{\gamma}<0$ . Now we show that in both logistic and probit regressions, the positive term $E(\beta_{j}\mid\mathbf{y})^{+}$ diverges.

	$\displaystyle E(\beta_{j}\mid\mathbf{y})^{+}=\int_{0}^{\infty}\beta_{j}\int_{\mathbb{R}^{p-1}}p(\beta_{j},\bm{\gamma})p(\mathbf{y}\mid\beta_{j},\bm{\gamma})d\bm{\gamma}d\beta_{j}$
$\displaystyle\geq$	$\displaystyle\int_{0}^{\infty}\beta_{j}\int_{\tilde{\mathcal{D}}_{j}^{+}}p(\beta_{j},\bm{\gamma})\prod_{i\in A_{1}}f_{1}(\beta_{j}(x_{i,j}+\mathbf{x}_{i,(-j)}^{T}\bm{\gamma}))\prod_{k\in A_{0}}f_{0}(\beta_{j}(x_{k,j}+\mathbf{x}_{k,(-j)}^{T}\bm{\gamma}))d\bm{\gamma}d\beta_{j}$
$\displaystyle\geq$	$\displaystyle\int_{0}^{\infty}\beta_{j}\int_{\tilde{\mathcal{D}}_{j}^{+}}p(\beta_{j},\bm{\gamma})\prod_{i\in A_{1}}f_{1}(0)\prod_{k\in A_{0}}f_{0}(0)d\bm{\gamma}d\beta_{j}$
$\displaystyle=$	$\displaystyle\left(\frac{1}{2}\right)^{n}\int_{0}^{\infty}\beta_{j}\int_{\tilde{\mathcal{D}}_{j}^{+}}p(\beta_{j},\bm{\gamma})d\bm{\gamma}d\beta_{j}$
$\displaystyle=$	$\displaystyle~C\int_{0}^{\infty}\beta_{j}\int_{\tilde{\mathcal{D}}_{j}^{+}}\frac{\beta_{j}^{p-1}}{\left[\left(\tilde{\bm{\gamma}}^{T}\bm{\Sigma}^{-1}\tilde{\bm{\gamma}}\right)\beta_{j}^{2}-2\left(\tilde{\bm{\gamma}}^{T}\bm{\Sigma}^{-1}\bm{\mu}\right)\beta_{j}+\left(\bm{\mu}^{T}\bm{\Sigma}^{-1}\bm{\mu}+1\right)\right]^{\frac{p+1}{2}}}d\bm{\gamma}d\beta_{j}$
$\displaystyle\geq$	$\displaystyle~C\int_{0}^{\infty}\int_{\tilde{\mathcal{D}}_{j}^{+}}\frac{\beta_{j}^{p}}{\left[a\beta_{j}^{2}-2b\beta_{j}+\left(\bm{\mu}^{T}\bm{\Sigma}^{-1}\bm{\mu}+1\right)\right]^{\frac{p+1}{2}}}d\bm{\gamma}d\beta_{j}$
$\displaystyle=$	$\displaystyle~C\left[\prod_{k=1}^{p-1}(u_{k}-l_{k})\right]\int_{0}^{\infty}\frac{\beta_{j}^{p}}{\left[a(\beta_{j}-b/a)^{2}+c\right]^{\frac{p+1}{2}}}d\beta_{j}$	(F.2)
$\displaystyle=$	$\displaystyle~\infty,$

where $C$ is a positive constant, $c$ is a constant, and $a$ and $b$ have been defined previously in (F.1).

On the other hand, if the set $\mathcal{C}$ of complete separation vectors only projects on the negative half of the $\beta_{j}$ axis, following a similar deviation, we can show that $E(\beta_{j}\mid\mathbf{y})^{-}$ diverges to $-\infty$ .

∎

References

Albert and Anderson (1984) Albert, A. and Anderson, J. A. (1984). “On the Existence of Maximum Likelihood Estimates in Logistic Regression Models.” Biometrika, 71(1): 1–10.
Bardenet et al. (2014) Bardenet, R., Doucet, A., and Holmes, C. (2014). “Towards Scaling up Markov Chain Monte Carlo: An Adaptive Subsampling Approach.” Proceedings of the 31st International Conference on Machine Learning (ICML-14), 405–413.
Bertsimas and Tsitsiklis (1997) Bertsimas, D. and Tsitsiklis, J. N. (1997). Introduction to Linear Optimization. Athena Scientific Belmont, MA.
Bickel and Doksum (2001) Bickel, P. J. and Doksum, K. A. (2001). Mathematical Statistics, volume I. Prentice Hall Englewood Cliffs, NJ.
Blattenberger and Lad (1985) Blattenberger, G. and Lad, F. (1985). “Separating the Brier Score into Calibration and Refinement Components: A Graphical Exposition.” The American Statistician, 39(1): 26–32.
Brier (1950) Brier, G. W. (1950). “Verification of Forecasts Expressed in Terms of Probability.” Monthly Weather Review, 78: 1–3.
Browder (1996) Browder, A. (1996). Mathematical Analysis: An Introduction. Springer-Verlag.
Carpenter et al. (2016) Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, A., Michael, Guo, J., Li, P., and Riddell, A. (2016). “Stan: A Probabilistic Programming Language.” Journal of Statistical Software, in press.
Casella and Berger (1990) Casella, G. and Berger, R. L. (1990). Statistical Inference. Duxbury Press.
Chen and Shao (2001) Chen, M.-H. and Shao, Q.-M. (2001). “Propriety of Posterior Distribution for Dichotomous Quantal Response Models.” Proceedings of the American Mathematical Society, 129(1): 293–302.
Choi and Hobert (2013) Choi, H. M. and Hobert, J. P. (2013). “The Polya-Gamma Gibbs Sampler for Bayesian Logistic Regression is Uniformly Ergodic.” Electronic Journal of Statistics, 7(2054-2064).
Chopin and Ridgway (2015) Chopin, N. and Ridgway, J. (2015). “Leave Pima Indians Alone: Binary Regression as a Benchmark for Bayesian Computation.” arxiv.org.
Clogg et al. (1991) Clogg, C. C., Rubin, D. B., Schenker, N., Schultz, B., and Weidman, L. (1991). “Multiple Imputation of Industry and Occupation Codes in Census Public-Use Samples Using Bayesian Logistic Regression.” Journal of the American Statistical Association, 86(413): 68–78.
Dawid (1973) Dawid, A. P. (1973). “Posterior Expectations for Large Observations.” Biometrika, 60: 664–666.
Fernández and Steel (2000) Fernández, C. and Steel, M. F. (2000). “Bayesian Regression Analysis with Scale Mixtures of Normals.” Econometric Theory, 16(80-101).
Firth (1993) Firth, D. (1993). “Bias Reduction of Maximum Likelihood Estimates.” Biometrika, 80(1): 27–38.
Fouskakis et al. (2009) Fouskakis, D., Ntzoufras, I., and Draper, D. (2009). “Bayesian Variable Selection Using Cost-Adjusted BIC, with Application to Cost-Effective Measurement of Quality of Health Care.” The Annals of Applied Statistics, 3(2): 663–690.
Gelman et al. (2008) Gelman, A., Jakulin, A., Pittau, M., and Su, Y. (2008). “A Weakly Informative Default Prior Distribution for Logistic and Other Regression Models.” The Annals of Applied Statistics, 2(4): 1360–1383.
Gelman et al. (2015) Gelman, A., Su, Y.-S., Yajima, M., Hill, J., Pittau, M. G., Kerman, J., Zheng, T., and Dorie, V. (2015). arm: Data Analysis Using Regression and Multilevel/Hierarchical Models. R package version 1.8-5.
URL http://CRAN.R-project.org/package=arm
Ghosh and Clyde (2011) Ghosh, J. and Clyde, M. A. (2011). “Rao-Blackwellization for Bayesian Variable Selection and Model Averaging in Linear and Binary Regression: A Novel Data Augmentation Approach.” Journal of the American Statistical Association, 106(495): 1041–1052.
Ghosh et al. (2011) Ghosh, J., Herring, A. H., and Siega-Riz, A. M. (2011). “Bayesian Variable Selection for Latent Class Models.” Biometrics, 67: 917–925.
Ghosh and Reiter (2013) Ghosh, J. and Reiter, J. P. (2013). “Secure Bayesian Model Averaging for Horizontally Partitioned Data.” Statistics and Computing, 23: 311–322.
Gramacy and Polson (2012) Gramacy, R. B. and Polson, N. G. (2012). “Simulation-Based Regularized Logistic Regression.” Bayesian Analysis, 7(3): 567–590.
Hanson et al. (2014) Hanson, T. E., Branscum, A. J., and Johnson, W. O. (2014). “Informative g-Priors for Logistic Regression.” Bayesian Analysis, 9(3): 597–612.
Heinze (2006) Heinze, G. (2006). “A Comparative Investigation of Methods for Logistic Regression with Separated or Nearly Separated Data.” Statistics in Medicine, 25: 4216–4226.
Heinze and Schemper (2002) Heinze, G. and Schemper, M. (2002). “A Solution to the Problem of Separation in Logistic Regression.” Statistics in Medicine, 21: 2409–2419.
Hoffman and Gelman (2014) Hoffman, M. D. and Gelman, A. (2014). “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” The Journal of Machine Learning Research, 15(1): 1593–1623.
Holmes and Held (2006) Holmes, C. C. and Held, L. (2006). “Bayesian Auxiliary Variable Models for Binary and Multinomial Regression.” Bayesian Analysis, 1(1): 145–168.
Ibrahim and Laud (1991) Ibrahim, J. G. and Laud, P. W. (1991). “On Bayesian Analysis of Generalized Linear Models using Jeffreys’s Prior.” Journal of the American Statistical Association, 86(416): 981–986.
Jeffreys (1961) Jeffreys, H. (1961). Theory of Probability. Oxford Univ. Press.
Kurgan et al. (2001) Kurgan, L., Cios, K., Tadeusiewicz, R., Ogiela, M., and Goodenday, L. (2001). “Knowledge Discovery Approach to Automated Cardiac SPECT Diagnosis.” Artificial Intelligence in Medicine, 23:2: 149–169.
Li and Clyde (2015) Li, Y. and Clyde, M. A. (2015). “Mixtures of $g$ -Priors in Generalized Linear Models.” arxiv.org.
Liu (2004) Liu, C. (2004). “Robit Regression: A Simple Robust Alternative to Logistic and Probit Regression.” In Gelman, A. and Meng, X. (eds.), Applied Bayesian Modeling and Casual Inference from Incomplete-Data Perspectives, 227–238. Wiley, London.
McCullagh and Nelder (1989) McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Chapman and Hall.
Mitra and Dunson (2010) Mitra, R. and Dunson, D. B. (2010). “Two Level Stochastic Search Variable Selection in GLMs with Missing Predictors.” International Journal of Biostatistics, 6(1): Article 33.
Neal (2003) Neal, R. M. (2003). “Slice Samlping.” The Annals of Statistics, 31(3): 705–767.
Neal (2011) — (2011). “MCMC using Hamiltonian Dynamics.” In Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. (eds.), Handbook of Markov Chain Monte Carlo. Chapman & Hall / CRC Press.
O’Brien and Dunson (2004) O’Brien, S. M. and Dunson, D. B. (2004). “Bayesian Multivariate Logistic Regression.” Biometrics, 60(3): 739–746.
Polson et al. (2013) Polson, N. G., Scott, J. G., and Windle, J. (2013). “Bayesian Inference for Logistic Models Using Pólya-Gamma Latent Variables.” Journal of the American Statistical Association, 108(504): 1339–1349.
Rousseeuw and Christmann (2003) Rousseeuw, P. J. and Christmann, A. (2003). “Robustness Against Separation and Outliers in Logistic Regression.” Computational Statistics and Data Analysis, 42: 315–332.
Sabanés Bové and Held (2011) Sabanés Bové, D. and Held, L. (2011). “Hyper- $g$ Priors for Generalized Linear Models.” Bayesian Analysis, 6(3): 387–410.
Speckman et al. (2009) Speckman, P. L., Lee, J., and Sun, D. (2009). “Existence of the MLE and Propriety of Posteriors for a General Multinomial Choice Model.” Statistica Sinica, 19: 731–748.
Yang and Berger (1996) Yang, R. and Berger, J. O. (1996). “A Catalog of Noninformative Priors.” Institute of Statistics and Decision Sciences, Duke University.
Zellner and Siow (1980) Zellner, A. and Siow, A. (1980). “Posterior Odds Ratios for Selected Regression Hypotheses.” In Bayesian Statistics: Proceedings of the First International Meeting Held in Valencia (Spain), 585–603. Valencia, Spain: University of Valencia Press.
Zorn (2005) Zorn, C. (2005). “A Solution to Separation in Binary Response Models.” Political Analysis, 13(2): 157–170.

		$\displaystyle E(\gamma_{j}\mid\mathbf{y})^{+}<\int_{0}^{\infty}\gamma_{j}p(\gamma_{j})\left[\Phi\left(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}\right)+\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{\pi\epsilon\gamma_{j}}\right]d\gamma_{j}$
	$\displaystyle=$	$\displaystyle\int_{0}^{\infty}\gamma_{j}p(\gamma_{j})\Phi\left(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}\right)d\gamma_{j}+\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}$
	$\displaystyle=$	$\displaystyle\int_{0}^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon}\gamma_{j}p(\gamma_{j})\Phi\left(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}\right)d\gamma_{j}+\int_{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon}^{\infty}\gamma_{j}p(\gamma_{j})\Phi\left(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}\right)d\gamma_{j}+\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}$
	$\displaystyle<$	$\displaystyle\int_{0}^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon}\gamma_{j}p(\gamma_{j})d\gamma_{j}+\int_{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon}^{\infty}\gamma_{j}p(\gamma_{j})e^{-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}d\gamma_{j}+\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}$
	$\displaystyle<$	$\displaystyle\int_{0}^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon}\frac{\gamma_{j}}{\pi\sigma_{j}(1+\gamma_{j}^{2}/\sigma_{j}^{2})}d\gamma_{j}+e^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}\int_{0}^{\infty}\frac{\gamma_{j}e^{-\epsilon\gamma_{j}}}{\pi\sigma_{j}(1+\gamma_{j}^{2}/\sigma_{j}^{2})}d\gamma_{j}+\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}$
	$\displaystyle<$	$\displaystyle~\frac{\sigma_{j}}{2\pi}\log\left[1+\left(\frac{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}{\epsilon\sigma_{j}}\right)^{2}\right]+e^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}\int_{0}^{\infty}\frac{\gamma_{j}e^{-\epsilon\gamma_{j}}}{\pi\sigma_{j}}d\gamma_{j}+\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}$
	$\displaystyle=$	$\displaystyle~\frac{\sigma_{j}}{2\pi}\log\left[1+\left(\frac{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}{\epsilon\sigma_{j}}\right)^{2}\right]+\frac{e^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}}{\pi\sigma_{j}\epsilon^{2}}+\frac{\|\mathbf{x}_{i^{\prime},(-j)}\|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}.$