This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Use of Cauchy Prior Distributions for Bayesian Logistic Regression

Joyee Ghosh ,‡ The University of Iowa, Iowa City, IA. Email: joyee-ghosh@uiowa.edu    Yingbo Li , Clemson University, Clemson, SC. Email: ybli@clemson.eduThese authors contributed equally.    Robin Mitra University of Southampton, Southampton, UK. Email: R.Mitra@soton.ac.uk
Abstract

In logistic regression, separation occurs when a linear combination of the predictors can perfectly classify part or all of the observations in the sample, and as a result, finite maximum likelihood estimates of the regression coefficients do not exist. Gelman et al. (2008) recommended independent Cauchy distributions as default priors for the regression coefficients in logistic regression, even in the case of separation, and reported posterior modes in their analyses. As the mean does not exist for the Cauchy prior, a natural question is whether the posterior means of the regression coefficients exist under separation. We prove theorems that provide necessary and sufficient conditions for the existence of posterior means under independent Cauchy priors for the logit link and a general family of link functions, including the probit link. We also study the existence of posterior means under multivariate Cauchy priors. For full Bayesian inference, we develop a Gibbs sampler based on Pólya-Gamma data augmentation to sample from the posterior distribution under independent Student-tt priors including Cauchy priors, and provide a companion R package in the supplement. We demonstrate empirically that even when the posterior means of the regression coefficients exist under separation, the magnitude of the posterior samples for Cauchy priors may be unusually large, and the corresponding Gibbs sampler shows extremely slow mixing. While alternative algorithms such as the No-U-Turn Sampler in Stan can greatly improve mixing, in order to resolve the issue of extremely heavy tailed posteriors for Cauchy priors under separation, one would need to consider lighter tailed priors such as normal priors or Student-tt priors with degrees of freedom larger than one.

1 Introduction

In Bayesian linear regression, the choice of prior distribution for the regression coefficients is a key component of the analysis. Noninformative priors are convenient when the analyst does not have much prior information, but these prior distributions are often improper which can lead to improper posterior distributions in certain situations. Fernández and Steel (2000) investigated the propriety of the posterior distribution and the existence of posterior moments of regression and scale parameters for a linear regression model, with errors distributed as scale mixtures of normals, under the independence Jeffreys prior. For a design matrix of full column rank, they showed that posterior propriety holds under mild conditions on the sample size; however, the existence of posterior moments is affected by the design matrix and the mixing distribution. Further, there is not always a unique choice of noninformative prior (Yang and Berger 1996). On the other hand, proper prior distributions for the regression coefficients guarantee the propriety of posterior distributions. Among them, normal priors are commonly used in normal linear regression models, as conjugacy permits efficient posterior computation. The normal priors are informative because the prior mean and covariance can be specified to reflect the analyst’s prior information, and the posterior mean of the regression coefficients is the weighted average of the maximum likelihood estimator and the prior mean, with the weight on the latter decreasing as the prior variance increases.

A natural alternative to the normal prior is the Student-tt prior distribution, which can be viewed as a scale mixture of normals. The Student-tt prior has tails heavier than the normal prior, and hence is more appealing in the case where weakly informative priors are desirable. The Student-tt prior is considered robust, because when it is used for location parameters, outliers have vanishing influence on posterior distributions (Dawid 1973). The Cauchy distribution is a special case of the Student-tt distribution with 1 degree of freedom. It has been recommended as a prior for normal mean parameters in a point null hypothesis testing (Jeffreys 1961), because if the observations are overwhelmingly far from zero (the value of the mean specified under the point null hypothesis), the Bayes factor favoring the alternative hypothesis tends to infinity. Multivariate Cauchy priors have also been proposed for regression coefficients (Zellner and Siow 1980).

While the choice of prior distributions has been extensively studied for normal linear regression, there has been comparatively less work for generalized linear models. Propriety of the posterior distribution and the existence of posterior moments for binary response models under different noninformative prior choices have been considered (Ibrahim and Laud 1991; Chen and Shao 2001).

Regression models for binary response variables may suffer from a particular problem known as separation, which is the focus of this paper. For example, complete separation occurs if there exists a linear function of the covariates for which positive values of the function correspond to those units with response values of 1, while negative values of the function correspond to units with response values of 0. Formal definitions of separation (Albert and Anderson 1984), including complete separation and its closely related counterpart quasicomplete separation, are reviewed in Section 2. Separation is not a rare problem in practice, and has the potential to become increasingly common in the era of big data, with analysis often being made on data with a modest sample size but a large number of covariates. When separation is present in the data, Albert and Anderson (1984) showed that the maximum likelihood estimates (MLEs) of the regression coefficients do not exist (i.e., are infinite). Removing certain covariates from the regression model may appear to be an easy remedy for the problem of separation, but this ad-hoc strategy has been shown to often result in the removal of covariates with strong relationships with the response (Zorn 2005).

In the frequentist literature, various solutions based on penalized or modified likelihoods have been proposed to obtain finite parameter estimates (Firth 1993; Heinze and Schemper 2002; Heinze 2006; Rousseeuw and Christmann 2003). The problem has also been noted when fitting Bayesian logistic regression models (Clogg et al. 1991), where posterior inferences would be similarly affected by the problem of separation if using improper priors, with the possibility of improper posterior distributions (Speckman et al. 2009).

Gelman et al. (2008) recommended using independent Cauchy prior distributions as a default weakly informative choice for the regression coefficients in a logistic regression model, because these heavy tailed priors avoid over-shrinking large coefficients, but provide shrinkage (unlike improper uniform priors) that enables inferences even in the presence of complete separation. Gelman et al. (2008) developed an approximate EM algorithm to obtain the posterior mode of regression coefficients with Cauchy priors. While inferences based on the posterior mode are convenient, often other summaries of the posterior distribution are also of interest. For example, posterior means under Cauchy priors estimated via Monte Carlo and other approximations have been reported in Bardenet et al. (2014); Chopin and Ridgway (2015). It is well-known that the mean does not exist for the Cauchy distribution, so clearly the prior means of the regression coefficients do not exist. In the presence of separation, where the maximum likelihood estimates are not finite, it is not clear whether the posterior means will exist. To the best of our knowledge, there has been no investigation considering the existence of the posterior mean under Cauchy priors and our research is filling this gap. We find a necessary and sufficient condition where the use of independent Cauchy priors will result in finite posterior means here. In doing so we provide further theoretical underpinning of the approach recommended by Gelman et al. (2008), and additionally provide further insights on their suggestion of centering the covariates before fitting the regression model, which can have an impact on the existence of posterior means.

When the conditions for existence of the posterior mean are satisfied, we also empirically compare different prior choices (including the Cauchy prior) through various simulated and real data examples. In general, posterior computation for logistic regression is known to be more challenging than probit regression. Several MCMC algorithms for logistic regression have been proposed (O’Brien and Dunson 2004; Holmes and Held 2006; Gramacy and Polson 2012), while the most recent Pólya-Gamma data augmentation scheme of Polson et al. (2013) emerged superior to the other methods. Thus we extend this Pólya-Gamma Gibbs sampler for normal priors to accommodate independent Student-tt priors and provide an R package to implement the corresponding Gibbs sampler.

The remainder of this article is organized as follows. In Section 2 we derive the theoretical results: a necessary and sufficient condition for the existence of posterior means for coefficients under independent Cauchy priors in a logistic regression model in the presence of separation, and extend our investigation to binary regression models with other link functions such as the probit link, and multivariate Cauchy priors. In Section 3 we develop a Gibbs sampler for the logistic regression model under independent Student-tt prior distributions (of which the Cauchy distribution is a special case) and briefly describe the NUTS algorithm of Hoffman and Gelman (2014) which forms the basis of the software Stan. In Section 4 we illustrate via simulated data that Cauchy priors may lead to coefficients of extremely large magnitude under separation, accompanied by slow mixing Gibbs samplers, compared to lighter tailed priors such as Student-tt priors with degrees of freedom 77 (t7t_{7}) or normal priors. In Section 5 we compare Cauchy, t7t_{7}, and normal priors based on two real datasets, the SPECT data with quasicomplete separation and the Pima Indian Diabetes data without separation. Overall, Cauchy priors exhibit slow mixing under the Gibbs sampler compared to the other two priors. Although mixing can be improved by the NUTS algorithm in Stan, normal priors seem to be the most preferable in terms of producing more reasonable scales for posterior samples of the regression coefficients accompanied by competitive predictive performance, under separation. In Section 6 we conclude with a discussion and our recommendations.

2 Existence of Posterior Means Under Cauchy Priors

In this section, we begin with a review of the concepts of complete and quasicomplete separation proposed by Albert and Anderson (1984). Then based on a new concept of solitary separators, we introduce the main theoretical result of this paper, a necessary and sufficient condition for the existence of posterior means of regression coefficients under independent Cauchy priors in the case of separation. Finally, we extend our investigation to binary regression models with other link functions, and Cauchy priors with different scale parameter structures.

Let 𝐲=(y1,y2,,yn)T\mathbf{y}=(y_{1},y_{2},\ldots,y_{n})^{T} denote a vector of independent Bernoulli response variables with success probabilities π1,π2,,πn\pi_{1},\pi_{2},\ldots,\pi_{n}. For each of the observations, i=1,2,,ni=1,2,\ldots,n, let 𝐱i=(xi1,xi2,,xip)T\mathbf{x}_{i}=(x_{i1},x_{i2},\ldots,x_{ip})^{T} denote a vector of pp covariates, whose first component is assumed to accommodate the intercept, i.e., xi,1=1x_{i,1}=1. Let 𝐗\mathbf{X} denote the n×pn\times p design matrix with 𝐱iT\mathbf{x}_{i}^{T} as its iith row. We assume that the column rank of 𝐗\mathbf{X} is greater than 1. In this paper, we mainly focus on the logistic regression model, which is expressed as:

log(πi1πi)=𝐱iT𝜷,i=1,2,,n,\log\left(\frac{\pi_{i}}{1-\pi_{i}}\right)=\mathbf{x}_{i}^{T}\bm{\beta},\quad i=1,2,\ldots,n, (1)

where 𝜷=(β1,β2,,βp)T\bm{\beta}=(\beta_{1},\beta_{2},\ldots,\beta_{p})^{T} is the vector of regression coefficients.

2.1 A Brief Review of Separation

We denote two disjoint subsets of sample points based on their response values: A0={i:yi=0}A_{0}=\{i:y_{i}=0\} and A1={i:yi=1}A_{1}=\{i:y_{i}=1\}. According to the definition of Albert and Anderson (1984), complete separation occurs in the sample if there exists a vector 𝜶=(α1,α2,,αp)T\bm{\alpha}=(\alpha_{1},\alpha_{2},\ldots,\alpha_{p})^{T}, such that for all i=1,2,,ni=1,2,\ldots,n,

𝐱iT𝜶>0 if iA1,𝐱iT𝜶<0 if iA0.\mathbf{x}_{i}^{T}\bm{\alpha}>0\ \text{ if }i\in A_{1},\quad\mathbf{x}_{i}^{T}\bm{\alpha}<0\ \text{ if }i\in A_{0}. (2)

Consider a simple example in which we wish to predict whether subjects in a study have a certain kind of infection based on model (1). Let yi=1y_{i}=1 if the iith subject is infected and 0 otherwise. The model includes an intercept (xi1=1x_{i1}=1) and the other covariates are age (xi2x_{i2}), gender (xi3x_{i3}), and previous records of being infected (xi4x_{i4}). Suppose in the sample, all infected subjects are older than 25 (xi2>25x_{i2}>25), and all subjects who are not infected are younger than 25 (xi2<25x_{i2}<25). This is an example of complete separation because (2) is satisfied for 𝜶=(25,1,0,0)T\bm{\alpha}=(-25,1,0,0)^{T}.

If the sample points cannot be completely separated, Albert and Anderson (1984) introduced another notion of separation called quasicomplete separation. There is quasicomplete separation in the sample if there exists a non-null vector 𝜶=(α1,α2,,αp)T\bm{\alpha}=(\alpha_{1},\alpha_{2},\ldots,\alpha_{p})^{T}, such that for all i=1,2,,ni=1,2,\ldots,n,

𝐱iT𝜶0 if iA1,𝐱iT𝜶0 if iA0,\mathbf{x}_{i}^{T}\bm{\alpha}\geq 0\ \text{ if }i\in A_{1},\quad\mathbf{x}_{i}^{T}\bm{\alpha}\leq 0\ \text{ if }i\in A_{0}, (3)

and equality holds for at least one ii. Consider the set up of the previous example where the goal is to predict whether a person is infected or not. Suppose we have the same model but there is a slight modification in the dataset: all infected subjects are at least 25 years old (xi225x_{i2}\geq 25), all uninfected subjects are no more than 25 years old (xi225x_{i2}\leq 25), and there are two subjects aged exactly 25, of whom one is infected but not the other. This is an example of quasicomplete separation because (2) is satisfied for 𝜶=(25,1,0,0)T\bm{\alpha}=(-25,1,0,0)^{T} and the equality holds for two observations with age exactly 25.

Let 𝒞\mathcal{C} and 𝒬\mathcal{Q} denote the set of all vectors 𝜶\bm{\alpha} that satisfy (2) and (3), respectively. For any 𝜶𝒞\bm{\alpha}\in\mathcal{C}, all sample points must satisfy (2), so 𝜶\bm{\alpha} cannot lead to quasicomplete separation which requires at least one equality in (3). This implies that 𝒞\mathcal{C} and 𝒬\mathcal{Q} are disjoint sets, while both can be non-empty for a certain dataset. Note that Albert and Anderson (1984) define quasicomplete separation only when the sample points cannot be separated using complete separation. Thus according to their definition, only one of 𝒞\mathcal{C} and 𝒬\mathcal{Q} can be non-empty for a certain dataset. However, in our slightly modified definition of quasicomplete separation, the absence of complete separation is not required. This permits both 𝒞\mathcal{C} and 𝒬\mathcal{Q} to be non-empty for a dataset. In the remainder of the paper, for simplicity we use the term “separation” to refer to either complete or quasicomplete separation, so that 𝒞𝒬\mathcal{C}\cup\mathcal{Q} is non-empty.

2.2 Existence of Posterior Means Under Independent Cauchy Priors

When Markov chain Monte Carlo (MCMC) is applied to sample from the posterior distribution, the posterior mean is a commonly used summary statistic. We aim to study whether the marginal posterior mean E(βj𝐲)E(\beta_{j}\mid\mathbf{y}) exists under the independent Cauchy priors suggested by Gelman et al. (2008). Let C(μ,σ)C(\mu,\sigma) denote a Cauchy distribution with location parameter μ\mu and scale parameter σ\sigma. The default prior suggested by Gelman et al. (2008) corresponds to βjindC(0,σj)\beta_{j}\stackrel{{\scriptstyle\text{ind}}}{{\sim}}C(0,\sigma_{j}), for j=1,2,,pj=1,2,\dots,p.

For a design matrix with full column rank, Albert and Anderson (1984) showed that a finite maximum likelihood estimate of 𝜷\bm{\beta} does not exist when there is separation in the data. However, even in the case of separation and/or a rank deficient design matrix, the posterior means for some or all βj\beta_{j}’s may exist because they incorporate the information from the prior distribution. Following Definition 2.2.1 of Casella and Berger (1990, pp. 55), we say E(βj𝐲)E(\beta_{j}\mid\mathbf{y}) exists if E(|βj|𝐲)<E(|\beta_{j}|\mid\mathbf{y})<\infty, and in this case, E(βj𝐲)E(\beta_{j}\mid\mathbf{y}) is given by

E(βj𝐲)=0βjp(βj𝐲)𝑑βj+0βjp(βj𝐲)𝑑βj.E(\beta_{j}\mid\mathbf{y})=\int_{0}^{\infty}\beta_{j}~p(\beta_{j}\mid\mathbf{y})~d\beta_{j}+\int_{-\infty}^{0}\beta_{j}~p(\beta_{j}\mid\mathbf{y})~d\beta_{j}. (4)

Note that alternative definitions may require only one of the integrals in (4) to be finite for the mean to exist, e.g., Bickel and Doksum (2001, pp. 455). However, according to the definition used in this paper, both integrals in (4) have to be finite for the posterior mean to exist. Our main result shows that for each j=1,2,,pj=1,2,\ldots,p, the existence of E(βj𝐲)E(\beta_{j}\mid\mathbf{y}) depends on whether the predictor 𝐗j\mathbf{X}_{j} is a solitary separator or not, which is defined as follows:

Definition 1.

The predictor 𝐗j\mathbf{X}_{j} is a solitary separator, if there exists an 𝛂(𝒞𝒬)\bm{\alpha}\in(\mathcal{C}\cup\mathcal{Q}) such that

αj0,αr=0 for all rj.\alpha_{j}\neq 0,\quad\alpha_{r}=0\text{ for all }r\neq j. (5)

This definition implies that for a solitary separator 𝐗j\mathbf{X}_{j}, if αj>0\alpha_{j}>0, then xi,j0x_{i,j}\geq 0 for all iA1i\in A_{1}, and xi,j0x_{i,j}\leq 0 for all iA0i\in A_{0}; if αj<0\alpha_{j}<0, then xi,j0x_{i,j}\leq 0 for all iA1i\in A_{1}, and xi,j0x_{i,j}\geq 0 for all iA0i\in A_{0}. Therefore, the hyperplane {𝐱p:xj=0}\{\mathbf{x}\in\mathbb{R}^{p}:x_{j}=0\} in the predictor space separates the data into two groups A1A_{1} and A0A_{0} (except for the points located on the hyperplane). The following theorem provides a necessary and sufficient condition for the existence of marginal posterior means of regression coefficients in a logistic regression model.

Theorem 1.

In a logistic regression model, suppose the regression coefficients (including the intercept) βjindC(0,σj)\beta_{j}\stackrel{{\scriptstyle ind}}{{\sim}}C(0,\sigma_{j}) with σj>0\sigma_{j}>0 for j=1,2,,pj=1,2,\dots,p, so that

p(𝜷)=j=1pp(βj)=j=1p1πσj(1+βj2/σj2).p(\bm{\beta})=\prod_{j=1}^{p}p(\beta_{j})=\prod_{j=1}^{p}\frac{1}{\pi\sigma_{j}(1+\beta_{j}^{2}/\sigma_{j}^{2})}. (6)

Then for each j=1,2,,pj=1,2,\ldots,p, the posterior mean E(βj𝐲)E(\beta_{j}\mid\mathbf{y}) exists if and only if 𝐗j\mathbf{X}_{j} is not a solitary separator.

A proof of Theorem 1 is available in Appendices A and B.

Remark 1.

Theorem 1 implies that under independent Cauchy priors in logistic regression, the posterior means of all coefficients exist if there is no separation, or if there is separation with no solitary separators.

Remark 2.

Gelman et al. (2008) suggested centering all predictors (except interaction terms) in the pre-processing step. A consequence of Theorem 1 is that centering may have a crucial role in the existence of the posterior mean E(βj𝐲)E(\beta_{j}\mid\mathbf{y}).

We expand on the second remark with a toy example where a predictor is a solitary separator before centering but not after centering. Consider a dataset with n=100n=100, 𝐲=(0,025,1,,175)T\mathbf{y}=(\underbrace{0,\dots 0}_{25},\underbrace{1,\dots,1}_{75})^{T} and a binary predictor 𝐗j=(0,050,1,,150)T\mathbf{X}_{j}=(\underbrace{0,\dots 0}_{50},\underbrace{1,\dots,1}_{50})^{T}. Here 𝐗j\mathbf{X}_{j} is a solitary separator which leads to quasicomplete separation before centering. However, the centered predictor 𝐗j=(0.5,0.550,0.5,,0.550)T\mathbf{X}_{j}=(\underbrace{-0.5,\dots-0.5}_{50},\underbrace{0.5,\dots,0.5}_{50})^{T} is no longer a solitary separator because after centering the hyperplane {𝐱:xj=0.5}\{\mathbf{x}:x_{j}=-0.5\} separates the data but {𝐱:xj=0}\{\mathbf{x}:x_{j}=0\} does not. Consequently, the posterior mean E(βj𝐲)E(\beta_{j}\mid\mathbf{y}) does not exist before centering but it exists after centering.

2.3 Extensions of the Theoretical Result

So far we have mainly focused on the logistic regression model, which is one of the most widely used binary regression models because of the interpretability of its regression coefficients in terms of odds ratios. We now generalize Theorem 1 to binary regression models with link functions other than the logit. Following the definition in McCullagh and Nelder (1989, pp. 27), we assume that for i=1,2,,ni=1,2,\ldots,n, the linear predictor 𝐱iT𝜷\mathbf{x}_{i}^{T}\bm{\beta} and the success probability πi\pi_{i} are connected by a monotonic and differentiable link function g()g(\cdot) such that g(πi)=𝐱iT𝜷g(\pi_{i})=\mathbf{x}_{i}^{T}\bm{\beta}. We further assume that g(.)g(.) is a one-to-one function, which means that g(.)g(.) is strictly monotonic. This is satisfied by many commonly used link functions including the probit. Without loss of generality, we assume that g()g(\cdot) is a strictly increasing function.

Theorem 2.

In a binary regression model with link function g(.)g(.) described above, suppose the regression coefficients have independent Cauchy priors in (6). Then for each j=1,2,,pj=1,2,\ldots,p,

  • (1)

    a necessary condition for the existence of the posterior mean E(βj𝐲)E(\beta_{j}\mid\mathbf{y}) is that 𝐗j\mathbf{X}_{j} is not a solitary separator;

  • (2)

    a sufficient condition for the existence of E(βj𝐲)E(\beta_{j}\mid\mathbf{y}) consists of the following:

    • (i)

      𝐗j\mathbf{X}_{j} is not a solitary separator, and

    • (ii)

      ϵ>0\forall\epsilon>0,

      0βjp(βj)g1(ϵβj)𝑑βj<,0βjp(βj)[1g1(ϵβj)]𝑑βj<.\int_{0}^{\infty}\beta_{j}p(\beta_{j})g^{-1}(-\epsilon\beta_{j})d\beta_{j}<\infty,\quad\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[1-g^{-1}(\epsilon\beta_{j})\right]d\beta_{j}<\infty. (7)

Note that (7) in the sufficient condition of Theorem 2 imposes constraints on the link function g(.)g(.), and hence the likelihood function. A proof of this theorem is given in Appendix C. Moreover, it is shown that condition (7) holds for the probit link function.

In certain applications, to incorporate available prior information, it may be desirable to use Cauchy priors with nonzero location parameters. The following corollary states that for both logistic and probit regression, the condition for existence of posterior means derived in Theorems 1 and 2 continues to hold under independent Cauchy priors with nonzero location parameters.

Corollary 1.

In logistic and probit regression models, suppose the regression coefficients βjindC(μj,σj)\beta_{j}\stackrel{{\scriptstyle\text{ind}}}{{\sim}}C(\mu_{j},\sigma_{j}), for j=1,2,,pj=1,2,\dots,p. Then a necessary and sufficient condition for the existence of the posterior mean E(βj𝐲)E(\beta_{j}\mid\mathbf{y}) is that 𝐗j\mathbf{X}_{j} is not a solitary separator, for j=1,2,,pj=1,2,\dots,p.

A proof of Corollary 1 is available in Appendix D.

In some applications it could be more natural to allow the regression coefficients to be dependent, a priori. Thus in addition to independent Cauchy priors, we also study the existence of posterior means under a multivariate Cauchy prior, with the following density function:

p(𝜷)=Γ(1+p2)Γ(12)πp2|𝚺|12[1+(𝜷𝝁)T𝚺1(𝜷𝝁)]1+p2,p(\bm{\beta})=\frac{\Gamma\left(\frac{1+p}{2}\right)}{\Gamma\left(\frac{1}{2}\right)\pi^{\frac{p}{2}}|\bm{\Sigma}|^{\frac{1}{2}}\left[1+(\bm{\beta}-\bm{\mu})^{T}\bm{\Sigma}^{-1}(\bm{\beta}-\bm{\mu})\right]^{\frac{1+p}{2}}}, (8)

where 𝜷p\bm{\beta}\in\mathbb{R}^{p}, 𝝁\bm{\mu} is a p×1p\times 1 location parameter and 𝚺\bm{\Sigma} is a p×pp\times p positive-definite scale matrix. A special case of the multivariate Cauchy prior is the Zellner-Siow prior (Zellner and Siow 1980). It can be viewed as a scale mixture of gg-priors, where conditional on gg, 𝜷\bm{\beta} has a multivariate normal prior with a covariance matrix proportional to g(𝐗T𝐗)1g(\mathbf{X}^{T}\mathbf{X})^{-1}, and the hyperparameter gg has an inverse gamma prior, IG(1/2,n/2)\text{IG}(1/2,n/2). Based on generalizations of the gg-prior to binary regression models (Fouskakis et al. 2009; Sabanés Bové and Held 2011; Hanson et al. 2014), the Zeller-Siow prior, which has a density (8) with 𝚺n(𝐗T𝐗)1\bm{\Sigma}\propto n(\mathbf{X}^{T}\mathbf{X})^{-1}, can be a desirable objective prior as it preserves the covariance structure of the data and is free of tuning parameters.

Theorem 3.

In logistic and probit regression models, suppose the vector of regression coefficients 𝛃\bm{\beta} has a multivariate Cauchy prior as in (8). If there is no separation, then all posterior means E(βj𝐲)E(\beta_{j}\mid\mathbf{y}) exist, for j=1,2,,pj=1,2,\ldots,p. If there is complete separation, then none of the posterior means E(βj𝐲)E(\beta_{j}\mid\mathbf{y}) exist, for j=1,2,,pj=1,2,\ldots,p.

A proof of Theorem 3 is available in Appendices E and F. The study of existence of posterior means under multivariate Cauchy priors in the presence of quasicomplete separation has proved to be more challenging. We hope to study this problem in future work. Note that although under (8), the induced marginal prior of βj\beta_{j} is a univariate Cauchy distribution for each j=1,2,,pj=1,2,\ldots,p, the multivariate Cauchy prior is different from independent Cauchy priors, even with a diagonal scale matrix 𝚺=diag(σ12,σ22,,σp2)\bm{\Sigma}=\text{diag}(\sigma_{1}^{2},\sigma_{2}^{2},\ldots,\sigma_{p}^{2}). In fact, as a rotation invariant distribution, the multivariate Cauchy prior places less probability mass along axes than the independent Cauchy priors (see Figure 1). Therefore, it is not surprising that solitary separators no longer play an important role for existence of posterior means under multivariate Cauchy priors, as evident from Theorem 3.

Refer to caption
Figure 1: Contour plots of log-density functions of independent Cauchy distributions with both scale parameters being 11 (left) and a bivariate Cauchy distribution with scale matrix 𝐈2\mathbf{I}_{2} (right). These plots suggest that independent Cauchy priors place more probability mass along axes than a multivariate Cauchy prior, and thus impose stronger shrinkage. Hence, if complete separation occurs, E(βj𝐘)E(\beta_{j}\mid\mathbf{Y}) may exist under independent Cauchy priors for some or all j=1,2,,pj=1,2,\ldots,p (Theorem 1), but does not exist under a multivariate Cauchy prior (Theorem 3).

So far we have considered Cauchy priors, which are tt distributions with 1 degree of freedom. We close this section with a remark on lighter tailed tt priors (with degrees of freedom greater than 1) and normal priors, for which the prior means exist.

Remark 3.

In a binary regression model, suppose that the regression coefficients have independent Student-t priors with degrees of freedom greater than one, or independent normal priors. Then it is straightforward to show that the posterior means of the coefficients exist because the likelihood is bounded above by one and the prior means exist. The same result holds under multivariate tt priors with degrees of freedom greater than one, and multivariate normal priors.

3 MCMC Sampling for Logistic Regression

In this section we discuss two algorithms for sampling from the posterior distribution for logistic regression coefficients under independent Student-tt priors. We first develop a Gibbs sampler and then briefly describe the No-U-Turn Sampler (NUTS) implemented in the freely available software Stan (Carpenter et al. 2016).

3.1 Pólya-Gamma Data Augmentation Gibbs Sampler

Polson et al. (2013) showed that the likelihood for logistic regression can be written as a mixture of normals with respect to a Pólya-Gamma (PG) distribution. Based on this result, they developed an efficient Gibbs sampler for logistic regression with a multivariate normal prior on 𝜷\bm{\beta}. Choi and Hobert (2013) showed that their Gibbs sampler is uniformly ergodic. This guarantees the existence of central limit theorems for Monte Carlo averages of functions of 𝜷\bm{\beta} which are square integrable with respect to the posterior distribution p(𝜷𝐲)p(\bm{\beta}\mid\mathbf{y}). Choi and Hobert (2013) developed a latent data model which also led to the Gibbs sampler of Polson et al. (2013). We adopt their latent data formulation to develop a Gibbs sampler for logistic regression with independent Student-tt priors on 𝜷\bm{\beta}.

Let U=(2/π2)l=1Wl/(2l1)2U=(2/\pi^{2})\sum_{l=1}^{\infty}W_{l}/{(2l-1)}^{2}, where W1,W2,W_{1},W_{2},\dots is a sequence of i.i.d. Exponential random variables with rate parameters equal to 1. The density of UU is given by

h(u)=l=0(1)l(2l+1)2πu3e(2l+1)28u,0<u<.h(u)=\sum_{l=0}^{\infty}{(-1)}^{l}\frac{(2l+1)}{\sqrt{2\pi u^{3}}}e^{-\frac{(2l+1)^{2}}{8u}},\quad 0<u<\infty. (9)

Then for k0k\geq 0, the Pólya-Gamma (PG) distribution is constructed by exponential tilting of h(u)h(u) as follows:

p(u;k)=cosh(k2)ek2u2h(u),0<u<.p(u;k)=\cosh\left(\frac{k}{2}\right)e^{-\frac{k^{2}u}{2}}h(u),\quad 0<u<\infty. (10)

A random variable with density p(u;k)p(u;k) has a PG(1,k1,k) distribution.

Let tv(0,σj)t_{v}(0,\sigma_{j}) denote the Student-tt distribution with vv degrees of freedom, location parameter 0, and scale parameter σj\sigma_{j}. Since Student-tt distributions can be expressed as inverse-gamma (IG) scale mixtures of normal distributions, for j=1,2,,pj=1,2,\ldots,p, we have:

βjtv(0,σj){βjγjN(0,γj),γjIG(v2,vσj22).\beta_{j}\sim t_{v}(0,\sigma_{j})\Longleftrightarrow\begin{cases}\beta_{j}\mid\gamma_{j}\sim\text{N}(0,\gamma_{j}),\\ \gamma_{j}\sim\text{IG}\left(\frac{v}{2},\frac{v\sigma_{j}^{2}}{2}\right).\end{cases}

Conditional on 𝜷\bm{\beta} and 𝚪=diag(γ1,γ2,,γp)\bm{\Gamma}=\text{diag}(\gamma_{1},\gamma_{2},\ldots,\gamma_{p}), let (y1,z1),(y2,z2),,(yn,zn)(y_{1},z_{1}),(y_{2},z_{2}),\ldots,(y_{n},z_{n}) be nn independent random vectors such that yiy_{i} has a Bernoulli distribution with success probability exp(𝐱iT𝜷)/(1+exp(𝐱iT𝜷))\exp(\mathbf{x}_{i}^{T}\bm{\beta})/(1+\exp(\mathbf{x}_{i}^{T}\bm{\beta})), ziPG(1,|𝐱iT𝜷|)z_{i}\sim PG(1,|\mathbf{x}_{i}^{T}\bm{\beta}|), and yiy_{i} and ziz_{i} are independent, for i=1,2,,ni=1,2,\ldots,n. Let 𝒁D=diag(z1,z2,,zn)\bm{Z}_{D}=\text{diag}(z_{1},z_{2},\ldots,z_{n}), then the augmented posterior density is p(𝜷,𝚪,𝒁D𝐲)p(\bm{\beta},\bm{\Gamma},\bm{Z}_{D}\mid\mathbf{y}). We develop a Gibbs sampler with target distribution p(𝜷,𝚪,𝒁D𝐲)p(\bm{\beta},\bm{\Gamma},\bm{Z}_{D}\mid\mathbf{y}), which cycles through the following sequence of distributions iteratively:

  1. 1.

    𝜷𝚪,𝒁D,𝐲N((𝐗T𝒁D𝐗+𝚪1)1𝐗T𝐲~,(𝐗T𝒁D𝐗+𝚪1)1)\bm{\beta}\mid\bm{\Gamma},\bm{Z}_{D},\mathbf{y}\sim\text{N}\left((\mathbf{X}^{T}\bm{Z}_{D}\mathbf{X}+\bm{\Gamma}^{-1})^{-1}\mathbf{X}^{T}\mathbf{\tilde{y}},(\mathbf{X}^{T}\bm{Z}_{D}\mathbf{X}+\bm{\Gamma}^{-1})^{-1}\right), where y~i=yi1/2\tilde{y}_{i}=y_{i}-1/2 and 𝐲~=(y~1,y~2,,y~n)T\mathbf{\tilde{y}}=(\tilde{y}_{1},\tilde{y}_{2},\ldots,\tilde{y}_{n})^{T},

  2. 2.

    γj𝜷,𝒁D,𝐲indIG(v+12,βj2+vσj22)\gamma_{j}\mid\bm{\beta},\bm{Z}_{D},\mathbf{y}\stackrel{{\scriptstyle\text{ind}}}{{\sim}}\text{IG}\left(\frac{v+1}{2},\frac{\beta_{j}^{2}+v\sigma_{j}^{2}}{2}\right), for j=1,2,,pj=1,2,\ldots,p,

  3. 3.

    zi𝚪,𝜷,𝐲indPG(1,|𝐱iT𝜷|)z_{i}\mid\bm{\Gamma},\bm{\beta},\mathbf{y}\stackrel{{\scriptstyle\text{ind}}}{{\sim}}\text{PG}(1,|\mathbf{x}_{i}^{T}\bm{\beta}|), for i=1,2,,ni=1,2,\ldots,n.

Steps 1 and 3 follow immediately from Choi and Hobert (2013); Polson et al. (2013) and step 2 follows from straightforward algebra. In the next section, for comparison of posterior distributions under Student-tt priors with different degrees of freedom, we implement the above Gibbs sampler, and for normal priors we apply the Gibbs sampler of Polson et al. (2013). Both Gibbs samplers can be implemented using the R package tglm, available in the supplement.

3.2 Stan

Our empirical results in the next section suggest that the Gibbs sampler exhibits extremely slow mixing for posterior simulation under Cauchy priors for data with separation. Thus we consider alternative MCMC sampling algorithms in the hope of improving mixing. A random walk Metropolis algorithm shows some improvement over the Gibbs sampler in the p=2p=2 case. However, it is not efficient for exploring higher dimensional spaces. Thus we have been motivated to use the software Stan (Carpenter et al. 2016), which implements the No-U-Turn Sampler (NUTS) of Hoffman and Gelman (2014), a tuning free extension of the Hamiltonian Monte Carlo (HMC) algorithm (Neal 2011).

It has been demonstrated that for continuous parameter spaces, HMC can improve over poorly mixing Gibbs samplers and random walk Metropolis algorithms. HMC is a Metropolis algorithm that generates proposals based on Hamiltonian dynamics, a concept borrowed from Physics. In HMC, the parameter of interest is referred to as the “position” variable, representing a particle’s position in a pp-dimensional space. A pp-dimensional auxiliary parameter, the “momentum” variable, is introduced to represent the particle’s momentum. In each iteration, the momentum variable is generated from a Gaussian distribution, and then a proposal of the position momentum pair is generated (approximately) along the trajectory of the Hamiltonian dynamics defined by the joint distribution of the position and momentum. Hamiltonian dynamics changing over time can be approximated by discretizing time via the “leapfrog” method. In practice, a proposal is generated by applying the leapfrog algorithm LL times, with stepsize ϵ\epsilon, to the the current state. The proposed state is accepted or rejected according to a Metropolis acceptance probability. Section 5.3.3 of the review paper by Neal (2011) illustrates the practical benefits of HMC over random walk Metropolis algorithms. The examples in this section demonstrate that the momentum variable may change only slowly along certain directions during leapfrog steps, permitting the position variable to move consistently in this direction for many steps. In this way, proposed states using Hamiltonian dynamics can be far away from current states but still achieve high acceptance probabilities, making HMC more efficient than traditional algorithms such as random walk Metropolis.

In spite of its advantages, HMC has not been very widely used in the Statistics community until recently, because its performance can be sensitive to the choice of two tuning parameters: the leapfrog stepsize ϵ\epsilon and the number of leapfrog steps LL. Very small ϵ\epsilon can lead to waste in computational power whereas large ϵ\epsilon can yield large errors due to discretization. Regarding the number of leapfrog steps LL, if it is too small, proposed states can be near current states and thus resemble random walk. On the other hand, if LL is too large, the Hamiltonian trajectory can retrace its path so that the proposal is brought closer to the current value, which again is a waste of computational power.

The NUTS algorithm tunes these two parameters automatically. To select LL, the main idea is to run the leapfrog steps until the trajectory starts to retrace its path. More specifically, NUTS builds a binary tree based on a recursive doubling procedure, that is similar in flavor to the doubling procedure used for slice sampling by Neal (2003), with nodes of the tree representing position momentum pairs visited by the leapfrog steps along the path. The doubling procedure is stopped if the trajectory starts retracing its path, that is making a “U-turn”, or if there is a large simulation error accumulated due to many steps of leapfrog discretization. NUTS consists of a carefully constructed transition kernel that leaves the target joint distribution invariant. It also proposes a way for adaptive tuning of the stepsize ϵ\epsilon.

We find that by implementing this tuning free NUTS algorithm, available in the freely available software Stan, substantially better mixing than the Gibbs sampler can be achieved in all of our examples in which posterior means exist. We still include the Gibbs sampler in this article for two main reasons. First, it illustrates that Stan can provide an incredible improvement in mixing over the Gibbs sampler in certain cases. Stan requires minimal coding effort, much less than developing a Gibbs sampler, which may be useful information for readers who are not yet familiar with Stan. Second, Stan currently works for continuous target distributions only, but discrete distributions for models and mixed distributions for regression coefficients frequently arise in Bayesian variable selection, for regression models with binary or categorical response variables (Holmes and Held 2006; Mitra and Dunson 2010; Ghosh and Clyde 2011; Ghosh et al. 2011; Ghosh and Reiter 2013; Li and Clyde 2015). Unlike HMC algorithms, Gibbs samplers can typically be extended via data augmentation to incorporate mixtures of a point mass and a continuous distribution, as priors for the regression coefficients, without much additional effort.

4 Simulated Data

In this section, we use two simulation examples to empirically demonstrate that under independent Cauchy priors, the aforementioned MCMC algorithm for logistic regression may suffer from extremely slow mixing in the presence of separation in the dataset.

For each simulation scenario, we first standardize the predictors following the recommendation of Gelman et al. (2008). Binary predictors (with 0/1 denoting the two categories) are centered to have mean 0, and other predictors are centered and scaled to have mean 0 and standard deviation 0.5. Their rationale is that such standardizing makes the scale of a continuous predictor comparable to that of a symmetric binary predictor, in the sense that they have the same sample mean and sample standard deviation. Gelman et al. (2008) made a distinction between input variables and predictors, and they suggested standardizing the input variables only. For example, temperature and humidity may be input variables as well as predictors in a model; however, their interaction term is a predictor but not an input variable. In our examples, except for the constant term for the intercept, all other predictors are input variables and standardized appropriately.

We compare the posterior distributions under independent i) Cauchy, i.e., Student-tt with 1 degree of freedom, ii) Student-tt with 7 degrees of freedom (t7t_{7}), and iii) normal priors for the regression coefficients. In binary regression models, while the inverse cumulative distribution function (CDF) of the logistic distribution yields the logit link function, the inverse CDF of the Student-tt distribution yields the robit link function. Liu (2004) showed that the logistic link can be well approximated by a robit link with 7 degrees of freedom. So a t7t_{7} prior approximately matches the tail heaviness of the logistic likelihood underlying logistic regression. For Cauchy priors we use the default choice recommended by Gelman et al. (2008): all location parameters are set to 0 and scale parameters are set to 10 and 2.5 for the intercept and other coefficients, respectively. To be consistent we use the same location and scale parameters for the other two priors. Gelman et al. (2008) adopted a similar strategy in one of their analyses, to study the effect of tail heaviness of the priors. Among the priors considered here, the normal prior has the lightest tails, the Cauchy prior the heaviest, and the t7t_{7} prior offers a compromise between the two extremes. For each simulated dataset, we run both the Gibbs sampler developed in Section 3.1 and Stan, for 1,000,000 iterations after a burn-in of 100,000 samples, under each of the three priors.

4.1 Complete Separation with a Solitary Separator

First, we generate a dataset with p=2p=2 (including the intercept) and n=30n=30. The continuous predictor 𝐗2\mathbf{X}_{2} is chosen to be a solitary separator (after standardizing), which leads to complete separation, whereas the constant term 𝐗1\mathbf{X}_{1} contains all one’s and is not a solitary separator. A plot of 𝐲\mathbf{y} versus 𝐗2\mathbf{X}_{2} in Figure 2 demonstrates this graphically. So by Theorem 1, under independent Cauchy priors, E(β1𝐲)E(\beta_{1}\mid\mathbf{y}) exists but E(β2𝐲)E(\beta_{2}\mid\mathbf{y}) does not.

Refer to caption
Figure 2: Scatter plot of 𝐲\mathbf{y} versus 𝐗2\mathbf{X}_{2} in the first simulated dataset, where 𝐗2\mathbf{X}_{2} is a solitary separator which completely separates the samples (the vertical line at zero separates the points corresponding to y=1y=1 and y=0y=0).

The results from the Gibbs sampler are reported in Figures 3 and 4. Figure 3 shows the posterior samples of 𝜷\bm{\beta} under the different priors. The scale of β2\beta_{2}, the coefficient corresponding to the solitary separator 𝐗2\mathbf{X}_{2}, is extremely large under Cauchy priors, less so under t7t_{7} priors, and the smallest under normal priors. In particular, under Cauchy priors, the posterior distribution of β2\beta_{2} seems to have an extremely long right tail. Moreover, although 𝐗1\mathbf{X}_{1} is not a solitary separator, under Cauchy priors, the posterior samples of β1\beta_{1} have a much larger spread. Figure 4 shows that the running means of both β1\beta_{1} and β2\beta_{2} converge rapidly under normal and t7t_{7} priors, whereas under Cauchy priors, the running mean of β1\beta_{1} does not converge after a million iterations and that of β2\beta_{2} clearly diverges. We also ran Stan for this example but do not report the results here, because it gave warning messages about divergent transitions for Cauchy priors, after the burn-in period. Given that the posterior mean of β2\beta_{2} does not exist in this case, the lack of convergence is not surprising.

Refer to caption
Figure 3: Scatter plots (top) and box plots (bottom) of posterior samples of β1\beta_{1} and β2\beta_{2} from the Gibbs sampler, under independent Cauchy, t7t_{7}, and normal priors for the first simulated dataset.
Refer to caption
Figure 4: Plots of running means of β1\beta_{1} (top) and β2\beta_{2} (bottom) sampled from the posterior distributions via the Gibbs sampler, under independent Cauchy, t7t_{7}, and normal priors for the first simulated dataset. Here E(β1𝐲)E(\beta_{1}\mid\mathbf{y}) exists under independent Cauchy priors but E(β2𝐲)E(\beta_{2}\mid\mathbf{y}) does not.

4.2 Complete Separation Without Solitary Separators

Now we generate a new dataset with p=2p=2 and n=30n=30 such that there is complete separation but there are no solitary separators (see Figure 5). This guarantees the existence of both E(β1𝐲)E(\beta_{1}\mid\mathbf{y}) and E(β2𝐲)E(\beta_{2}\mid\mathbf{y}) under independent Cauchy priors. The difference in the existence of E(β2𝐲)E(\beta_{2}\mid\mathbf{y}) for the two simulated datasets is reflected by the posterior samples from the Gibbs sampler: under Cauchy priors, the samples of β2\beta_{2} in Figure 1 in the Appendix are more stabilized than those in Figure 3 in the manuscript. However, when comparing across prior distributions, we find that the posterior samples of neither β1\beta_{1} nor β2\beta_{2} are as stable as those under t7t_{7} and normal priors, which is not surprising because among the three priors, Cauchy priors have the heaviest tails and thus yield the least shrinkage. Figure 2 in the Appendix shows that the convergence of the running means under Cauchy priors is slow. Although we have not verified the existence of the second or higher order posterior moments under Cauchy priors, for exploratory purposes we examine sample autocorrelation plots of the draws from the Gibbs sampler. Figure 6 shows that the autocorrelation decays extremely slowly for Cauchy priors, reasonably fast for t7t_{7} priors, and rapidly for normal priors.

Some results from Stan are reported in Figures 3 and 4 in the Appendix. Figure 3 in the Appendix shows posterior distributions with nearly identical shapes as those obtained using Gibbs sampling in Figure 1 in the Appendix, with the only difference being that more extreme values appear under Stan. This is most likely due to faster mixing in Stan. As Stan traverses the parameter space more rapidly, values in the tails appear more quickly than under the Gibbs sampler. Figures 2 and 4 in the Appendix demonstrate that running means based on Stan are in good agreement with those based on the Gibbs sampler.

The autocorrelation plots for Stan in Figure 7 demonstrate a remarkable improvement over those for Gibbs in Figure 6 for all priors, and the difference in mixing is the most prominent for Cauchy priors.

To summarize, all the plots unequivocally suggest that Cauchy priors lead to an extremely slow mixing Gibbs sampler and unusually large scales for the regression coefficients, even when all the marginal posterior means are guaranteed to exist. While mixing can be improved tremendously with Stan, the heavy tailed posteriors under Stan are in agreement with those obtained from the Gibbs samplers. One may argue that in spite of the unnaturally large regression coefficients, Cauchy priors could lead to superior predictions. Thus in the next two sections we compare predictions based on posteriors under the three priors for two real datasets. As Stan generates nearly independent samples, we use Stan for MCMC simulations for the real datasets.

Refer to caption
Figure 5: Scatter plot of 𝐲\mathbf{y} versus 𝐗2\mathbf{X}_{2} for the second simulated dataset. The solid vertical line at 0.3-0.3 demonstrates complete separation of the samples. However, 𝐗2\mathbf{X}_{2} is not a solitary separator, because the dashed vertical line at zero does not separate the points corresponding to y=1y=1 and y=0y=0. The other predictor 𝐗1\mathbf{X}_{1} is a vector of ones corresponding to the intercept, which is not a solitary separator, either.
Refer to caption
Figure 6: Autocorrelation plots of the posterior samples of β1\beta_{1} (top) and β2\beta_{2} (bottom) from the Gibbs sampler, under independent Cauchy, t7t_{7}, and normal priors for the second simulated dataset.
Refer to caption
Figure 7: Autocorrelation plots of the posterior samples of β1\beta_{1} (top) and β2\beta_{2} (bottom) from Stan, under independent Cauchy, t7t_{7}, and normal priors for the second simulated dataset.

5 Real Data

5.1 SPECT Dataset

The “SPECT” dataset (Kurgan et al. 2001) is available from the UCI Machine Learning Repository111https://archive.ics.uci.edu/ml/datasets/SPECT+Heart. The binary response variable is whether a patient’s cardiac image is normal or abnormal, according to the diagnosis of cardiologists. The predictors are 22 binary features obtained from the cardiac images using a machine learning algorithm. The goal of the study is to determine if the predictors can correctly predict the diagnoses of cardiologists, so that the process could be automated to some extent.

Prior to centering, two of the binary predictors are solitary quasicomplete separators: xi,j=0x_{i,j}=0 iA0\forall i\in A_{0} and xi,j0x_{i,j}\geq 0 iA1\forall i\in A_{1}, for j=18,19j=18,19, with 𝐗1\mathbf{X}_{1} denoting the column of ones. Ghosh and Reiter (2013) analyzed this dataset with a Bayesian probit regression model which incorporated variable selection. As some of their proposed methods relied on an approximation of the marginal likelihood based on the MLE of 𝜷\bm{\beta}, they had to drop these potentially important predictors from the analysis. If one analyzed the dataset with the uncentered predictors, by Theorem 1, the posterior means E(β18𝐲)E(\beta_{18}\mid\mathbf{y}) and E(β19𝐲)E(\beta_{19}\mid\mathbf{y}) would not exist under independent Cauchy priors. However, after centering there are no solitary separators, so the posterior means of all coefficients exist.

The SPECT dataset is split into a training set of 80 observations and a test set of 187 observations by Kurgan et al. (2001). We use the former for model fitting and the latter for prediction. First, for each of the three priors (Cauchy, t7t_{7}, and normal), we run Stan on the training dataset, for 1,000,000 iterations after discarding 100,000 samples as burn-in.

As in the simulation study, MCMC draws from Stan show excellent mixing for all priors. However, the posterior means of the regression coefficients involved in separation are rather large under Cauchy priors compared to the other priors. For example, the posterior means of (β18,β19)(\beta_{18},\beta_{19}) under Cauchy, t7t_{7}, and normal priors are (10.02,5.57),(3.24,1.68),(10.02,5.57),(3.24,1.68), and (2.73,1.43)(2.73,1.43) respectively. These results suggest that Cauchy priors are too diffuse for datasets with separation.

Next for each i=1,2,,ntesti=1,2,\dots,n_{\text{test}} in the test set, we estimate the corresponding success probability πi\pi_{i} by the Monte Carlo average:

π^iMC=1Ss=1Se𝐱iT𝜷(s)1+e𝐱iT𝜷(s),\widehat{\pi}_{i}^{\text{MC}}=\frac{1}{S}\sum_{s=1}^{S}\frac{e^{\mathbf{x}_{i}^{T}\bm{\beta}^{(s)}}}{1+e^{\mathbf{x}_{i}^{T}\bm{\beta}^{(s)}}}, (11)

where 𝜷(s)\bm{\beta}^{(s)} is the sampled value of 𝜷\bm{\beta} in iteration ss, after burn-in. Recall that here ntest=187n_{\text{test}}=187 and S=106S=10^{6}. We calculate two different types of summary measures to assess predictive performance. We classify the iith observation in the test set as a success, if π^iMC0.5\widehat{\pi}_{i}^{\text{MC}}\geq 0.5 and as a failure otherwise, and compute the misclassification rates. Note that the misclassification rate does not fully take into account the magnitude of π^iMC\widehat{\pi}_{i}^{\text{MC}}. For example, if yi=1y_{i}=1 both π^iMC=0.5\widehat{\pi}_{i}^{\text{MC}}=0.5 and π^iMC=0.9\widehat{\pi}_{i}^{\text{MC}}=0.9 would correctly classify the observation, while the latter may be more preferable. So we also consider the average squared difference between yiy_{i} and π^iMC\widehat{\pi}_{i}^{\text{MC}}:

MSEMC=1ntesti=1ntest(π^iMCyi)2,MSE^{\text{MC}}=\frac{1}{n_{\text{test}}}\sum_{i=1}^{n_{\text{test}}}{(\widehat{\pi}_{i}^{\text{MC}}-y_{i})}^{2}, (12)

which is always between 0 and 1, with a value closer to 0 being more preferable. Note that the Brier score (Brier 1950) equals 2MSEMC2MSE^{\text{MC}}, according to its original definition. Since in some modified definitions (Blattenberger and Lad 1985), it is the same as MSEMCMSE^{\text{MC}}, we refer to MSEMCMSE^{\text{MC}} as the Brier score.

Table 1: Misclassification rates based on π^iMC\widehat{\pi}_{i}^{\text{MC}} and π^iEM\widehat{\pi}_{i}^{\text{EM}}, under Cauchy, t7t_{7}, and normal priors for the SPECT data. Small values are preferable.
Cauchy t7t_{7} normal
MCMC 0.273 0.257 0.251
EM 0.278 0.262 0.262
Table 2: Brier scores MSEMCMSE^{\text{MC}} and MSEEMMSE^{\text{EM}}, under Cauchy, t7t_{7}, and normal priors for the SPECT data. Small values are preferable.
Cauchy t7t_{7} normal
MCMC 0.172 0.165 0.163
EM 0.179 0.178 0.178

To compare the Monte Carlo estimates with those based on the EM algorithm of Gelman et al. (2008), we also estimate the posterior mode, denoted by 𝜷~\widetilde{\bm{\beta}} under identical priors and hyperparameters, using the R package arm (Gelman et al. 2015). The EM estimator of πi\pi_{i} is given by:

π^iEM=e𝐱iT𝜷~1+e𝐱iT𝜷~,\widehat{\pi}_{i}^{\text{EM}}=\frac{e^{\mathbf{x}_{i}^{T}\widetilde{\bm{\beta}}}}{1+e^{\mathbf{x}_{i}^{T}\widetilde{\bm{\beta}}}}, (13)

and MSEEMMSE^{\text{EM}} is calculated by replacing π^iMC\widehat{\pi}_{i}^{\text{MC}} by π^iEM\widehat{\pi}_{i}^{\text{EM}} in (12).

We report the misclassification rates in Table 1 and the Brier scores in Table 2. MCMC achieves somewhat smaller misclassification rates and Brier scores than EM, especially under t7t_{7} and normal priors. This suggests that a full Bayesian analysis using MCMC may produce estimates that are closer to the truth than modal estimates based on the EM algorithm. The predictions are similar across the three prior distributions with the normal and t7t_{7} priors yielding slightly more accurate results than Cauchy priors.

5.2 Pima Indians Diabetes Dataset

We now analyze the “Pima Indians Diabetes” dataset in the R package MASS. This is a classic dataset without separation that has been analyzed by many authors in the past. Using this dataset we aim to compare predictions under different priors, when there is no separation. Using the training data provided in the package we predict the class labels of the test data. In this case the difference between different priors is practically nil. The Brier scores are same up to three decimal places, across all priors and all methods (EM and MCMC). The misclassification rates reported in Table 3 also show negligible difference between priors and methods. Here Cauchy priors have a slightly better misclassification rate compared to normal and t7t_{7} priors, and MCMC provides slightly more accurate results compared to those obtained from EM. These results suggest that when there is no separation and maximum likelihood estimates exist, Cauchy priors may be preferable as default weakly informative priors in the absence of real prior information.

Table 3: Misclassification rates based on π^iMC\widehat{\pi}_{i}^{\text{MC}} and π^iEM\widehat{\pi}_{i}^{\text{EM}}, under Cauchy, t7t_{7}, and normal priors for the Pima Indians data. Small values are preferable.
Cauchy t7t_{7} normal
MCMC 0.196 0.199 0.199
EM 0.202 0.202 0.202

6 Discussion

We have proved that posterior means of regression coefficients in logistic regression are not always guaranteed to exist under the independent Cauchy priors recommended by Gelman et al. (2008), if there is complete or quasicomplete separation in the data. In particular, we have introduced the notion of a solitary separator, which is a predictor capable of separating the samples on its own. Note that a solitary separator needs to be able to separate without the aid of any other predictor, not even the constant term corresponding to the intercept. We have proved that for independent Cauchy priors, the absence of solitary separators is a necessary condition for the existence of posterior means of all coefficients, for a general family of link functions in binary regression models. For logistic and probit regression, this has been shown to be a sufficient condition as well. In general, the sufficient condition depends on the form of the link function. We have also studied multivariate Cauchy priors, where the solitary separator no longer plays an important role. Instead, posterior means of all predictors exist if there is no separation, while none of them exist if there is complete separation. The result under quasicompelte separation is still unclear and will be studied in future work.

In practice, after centering the input variables it is straightforward to check if there are solitary separators in the dataset. The absence of solitary separators guarantees the existence of posterior means of all regression coefficients in logistic regression under independent Cauchy priors. However, our empirical results have shown that even when the posterior means for Cauchy priors exist under separation, the posterior samples of the regression coefficients may be extremely large in magnitude. Separation is usually considered to be a sample phenomenon, so even if the predictors involved in separation are potentially important, some shrinkage of their coefficients is desirable through the prior. Our empirical results based on real datasets have demonstrated that the default Cauchy priors can lead to posterior means as large as 10, which is considered to be unusually large on the logit scale. Our impression is that Cauchy priors are good default choices in general because they contain weak prior information and let the data speak. However, under separation, when there is little information in the data about the logistic regression coefficients (the MLE is not finite), it seems that lighter tailed priors, such as Student-tt priors with larger degrees of freedom or even normal priors, are more desirable in terms of producing more plausible posterior distributions.

From a computational perspective, we have observed very slow convergence of the Gibbs sampler under Cauchy priors in the presence of separation. Note that if the design matrix is not of full column rank, for example when p>np>n, the pp columns of 𝐗\mathbf{X} will be linearly dependent. This implies that the equation for quasicomplete separation (3) will be satisfied with equality for all observations. Empirical results (not reported here for brevity) demonstrated that independent Cauchy priors show convergence of the Gibbs sampler in this case also compared to other lighter tailed priors. Out-of-sample predictive performance based on a real dataset with separation did not show the default Cauchy priors to be superior to t7t_{7} or normal priors.

In logistic regression, under a multivariate normal prior for 𝜷\bm{\beta}, Choi and Hobert (2013) showed that the Pólya-Gamma data augmentation Gibbs sampler of Polson et al. (2013) is uniformly ergodic, and the moment generating function of the posterior distribution p(𝜷𝐲)p(\bm{\beta}\mid\mathbf{y}) exists for all 𝐗,𝐲\mathbf{X},\mathbf{y}. In our examples of datasets with separation, the normal priors led to the fastest convergence of the Gibbs sampler, reasonable scales for the posterior draws of 𝜷\bm{\beta}, and comparable or even better predictive performance than other priors. The results from Stan show no problem in mixing under any of the priors. However, the problematic issue of posteriors with extremely heavy tails under Cauchy priors cannot be resolved without altering the prior. Thus, after taking into account all the above considerations, for a full Bayesian analysis we recommend the use of normal priors as a default, when there is separation. Alternatively, heavier tailed priors such as the t7t_{7} could also be used if robustness is a concern. On the other hand, if the goal of the analysis is to obtain point estimates rather than the entire posterior distribution, the posterior mode obtained from the EM algorithm of Gelman et al. (2015) under default Cauchy priors (Gelman et al. 2008) is a fast viable alternative.

Supplementary Material

In the supplementary material, we present additional simulation results for logistic and probit regression with complete separation, along with an appendix that contains the proofs of all theoretical results. The Gibbs sampler developed in the paper can be implemented with the R package tglm, available from the website: https://cran.r-project.org/web/packages/tglm/index.html.

Acknowledgement

The authors thank the Editor-in-Chief, Editor, Associate Editor and the reviewer for suggestions that led to a greatly improved paper. The authors also thank Drs. David Banks, James Berger, William Bridges, Merlise Clyde, Jon Forster, Jayanta Ghosh, Aixin Tan, and Shouqiang Wang for helpful discussions. The research of Joyee Ghosh was partially supported by the NSA Young Investigator grant H98230-14-1-0126.

Supplementary Material: On the Use of Cauchy Prior Distributions for Bayesian Logistic Regression

In this supplement, we first present an appendix with additional simulation results for logit and probit link functions, and then include an appendix that contains proofs of the theoretical results.

1 Appendix: Simulation Results for Complete Separation Without Solitary Separators

In this section we present some supporting figures for the analysis of the simulated dataset described in Section 4.2 of the manuscript under logit and probit links.

1.1 Logistic Regression for Complete Separation Without Solitary Separators

Figures 1 and 2 are based on the posterior samples from a Gibbs sampler under a logit link, whereas Figures 3 and 4 are corresponding results from Stan. A detailed discussion of the results is provided in Section 4.2 of the manuscript.

Refer to caption
Figure 1: Scatter plots (top) and box plots (bottom) of posterior samples of β1\beta_{1} and β2\beta_{2} for a logistic regression model, from the Gibbs sampler, under independent Cauchy, t7t_{7}, and normal priors for the second simulated dataset.
Refer to caption
Figure 2: Plots of running means of β1\beta_{1} (top) and β2\beta_{2} (bottom) sampled from the posterior distributions for a logistic regression model, via the Gibbs sampler, under independent Cauchy, t7t_{7}, and normal priors for the second simulated dataset. Here both E(β1𝐲)E(\beta_{1}\mid\mathbf{y}) and E(β2𝐲)E(\beta_{2}\mid\mathbf{y}) exist under independent Cauchy priors.
Refer to caption
Figure 3: Scatter plots (top) and box plots (bottom) of posterior samples of β1\beta_{1} and β2\beta_{2} for a logistic regression model, from Stan, under independent Cauchy, t7t_{7}, and normal priors for the second simulated dataset.
Refer to caption
Figure 4: Plots of running means of β1\beta_{1} (top) and β2\beta_{2} (bottom) sampled from the posterior distributions for a logistic regression model, via Stan, under independent Cauchy, t7t_{7}, and normal priors for the second simulated dataset. Here both E(β1𝐲)E(\beta_{1}\mid\mathbf{y}) and E(β2𝐲)E(\beta_{2}\mid\mathbf{y}) exist under independent Cauchy priors.

1.2 Probit Regression for Complete Separation Without Solitary Separators

In this section we analyze the simulated dataset described in Section 4.2 of the manuscript under a probit link, while keeping everything else the same. We have shown in Theorem 2, that the theoretical results hold for a probit link. The goal of this analysis is to demonstrate that the empirical results are similar under the logit and probit link functions. For this dataset, Theorem 2 guarantees the existence of both E(β1𝐲)E(\beta_{1}\mid\mathbf{y}) and E(β2𝐲)E(\beta_{2}\mid\mathbf{y}) under independent Cauchy priors and a probit link function. As in the case of logistic regression the heavy tails of Cauchy priors translate into an extremely heavy right tail in the posterior distributions of β1\beta_{1} and β2\beta_{2}, compared to the lighter tailed priors (see Figure 5 and 6 here). Thus in the case of separation, normal priors seem to be reasonable for probit regression also.

Refer to caption
Figure 5: Scatter plots (top) and box plots (bottom) of posterior samples of β1\beta_{1} and β2\beta_{2}, for a probit regression model, under Cauchy, t7t_{7}, and normal priors for the second simulated dataset. Posterior sampling was generated via Stan.
Refer to caption
Figure 6: Plots of running means of β1\beta_{1} (top) and β2\beta_{2} (bottom) sampled from the posterior distributions for a probit regression model, under Cauchy, t7t_{7}, and normal priors for the second simulated datatset. Here both E(β1𝐲)E(\beta_{1}\mid\mathbf{y}) and E(β2𝐲)E(\beta_{2}\mid\mathbf{y}) exist under Cauchy priors. Posterior sampling was generated via Stan.

2 Appendix: Proofs

First, we decompose the proof of Theorem 1 into two parts: in Appendix A we show that a necessary condition for the existence of E(βj𝐲)E(\beta_{j}\mid\mathbf{y}) is that 𝐗j\mathbf{X}_{j} is not a solitary separator; and in Appendix B we show that it is also a sufficient condition. Then, we prove Theorem 2 in Appendix C, and Corollary 1 in Appendix D. Finally, we decompose the proof of Theorem 3 into two parts: in Appendix E we show that all posterior means exist if there is no separation; then in Appendix F we show that none of the posterior means exist if there is complete separation.

A Proof of the Necessary Condition for Theorem 1

Here we show that if 𝐗j\mathbf{X}_{j} is a solitary separator, then E(βj𝐲)E(\beta_{j}\mid\mathbf{y}) does not exist, which is equivalent to the necessary condition for Theorem 1.

Proof.

For notational simplicity, we define the functional form of the success and failure probabilities in logistic regression as

f1(t)=et/(1+et),f0(t)=1f1(t)=1/(1+et),f_{1}(t)=e^{t}/(1+e^{t}),\quad f_{0}(t)=1-f_{1}(t)=1/(1+e^{t}), (A.1)

which are strictly increasing and decreasing functions of tt, respectively. In addition, both functions are bounded: 0<f1(t),f0(t)<10<f_{1}(t),f_{0}(t)<1 for tt\in\mathbb{R}. Let 𝜷(j)\bm{\beta}_{(-j)} and 𝐱i,(j)\mathbf{x}_{i,(-j)} denote the vectors 𝜷\bm{\beta} and 𝐱i\mathbf{x}_{i} after excluding their jjth entries βj\beta_{j} and xi,jx_{i,j}, respectively. Then the likelihood function can be written as

p(𝐲𝜷)=i=1np(yi𝜷)=iA1f1(xi,jβj+𝐱i,(j)T𝜷(j))kA0f0(xk,jβj+𝐱k,(j)T𝜷(j)).p(\mathbf{y}\mid\bm{\beta})=\prod_{i=1}^{n}p(y_{i}\mid\bm{\beta})=\prod_{i\in A_{1}}f_{1}\left(x_{i,j}\beta_{j}+\mathbf{x}_{i,(-j)}^{T}\bm{\beta}_{(-j)}\right)\cdot\prod_{k\in A_{0}}f_{0}\left(x_{k,j}\beta_{j}+\mathbf{x}_{k,(-j)}^{T}\bm{\beta}_{(-j)}\right). (A.2)

The posterior mean of βj\beta_{j} exists provided E(|βj|𝐲)<E(|\beta_{j}|\mid\mathbf{y})<\infty. When the posterior mean exists it is given by (4). Clearly if one of the two integrals in (4) is not finite, then E(|βj|𝐲)=E(|\beta_{j}|\mid\mathbf{y})=\infty. In this proof we will show that if αj>0\alpha_{j}>0, the first integral in (4) equals \infty. Similarly, it can be shown that if αj<0\alpha_{j}<0, the second integral in (4) equals -\infty.

If αj>0\alpha_{j}>0, by (2)-(5), xi,j0x_{i,j}\geq 0 for all iA1i\in A_{1}, and xk,j0x_{k,j}\leq 0 for all kA0k\in A_{0}. When βj>0\beta_{j}>0, by the monotonic property of f1(t)f_{1}(t) and f0(t)f_{0}(t) we have, p(𝐲𝜷)iA1f1(𝐱i,(j)T𝜷(j))kA0f0(𝐱k,(j)T𝜷(j))p(\mathbf{y}\mid\bm{\beta})\geq\prod_{i\in A_{1}}f_{1}\left(\mathbf{x}_{i,(-j)}^{T}\bm{\beta}_{(-j)}\right)\cdot\prod_{k\in A_{0}}f_{0}\left(\mathbf{x}_{k,(-j)}^{T}\bm{\beta}_{(-j)}\right), which is free of βj\beta_{j}. Therefore,

0βjp(βj𝐲)𝑑βj=0βj[p1p(𝐲𝜷)p(βj)p(𝜷(j))p(𝐲)𝑑𝜷(j)]𝑑βj\displaystyle\int_{0}^{\infty}\beta_{j}~p(\beta_{j}\mid\mathbf{y})~d\beta_{j}=\int_{0}^{\infty}\beta_{j}\left[\int_{\mathbb{R}^{p-1}}\frac{p(\mathbf{y}\mid\bm{\beta})p(\beta_{j})p(\bm{\beta}_{(-j)})}{p(\mathbf{y})}d\bm{\beta}_{(-j)}\right]~d\beta_{j}
=\displaystyle=~ 1p(𝐲)0βjp(βj)[p1p(𝐲𝜷)p(𝜷(j))𝑑𝜷(j)]𝑑βj\displaystyle\frac{1}{p(\mathbf{y})}\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[\int_{\mathbb{R}^{p-1}}p(\mathbf{y}\mid\bm{\beta})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]~d\beta_{j}
\displaystyle\geq~ 1p(𝐲)0βjp(βj)[p1iA1f1(𝐱i,(j)T𝜷(j))kA0f0(𝐱k,(j)T𝜷(j))p(𝜷(j))d𝜷(j)]𝑑βj\displaystyle\frac{1}{p(\mathbf{y})}\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[\int_{\mathbb{R}^{p-1}}\prod_{i\in A_{1}}f_{1}\left(\mathbf{x}_{i,(-j)}^{T}\bm{\beta}_{(-j)}\right)\prod_{k\in A_{0}}f_{0}\left(\mathbf{x}_{k,(-j)}^{T}\bm{\beta}_{(-j)}\right)p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]~d\beta_{j}
=\displaystyle=~ 0βjp(βj)𝑑βjp(𝐲)[p1iA1f1(𝐱i,(j)T𝜷(j))kA0f0(𝐱k,(j)T𝜷(j))p(𝜷(j))d𝜷(j)].\displaystyle\frac{\int_{0}^{\infty}\beta_{j}p(\beta_{j})~d\beta_{j}}{p(\mathbf{y})}\left[\int_{\mathbb{R}^{p-1}}\prod_{i\in A_{1}}f_{1}\left(\mathbf{x}_{i,(-j)}^{T}\bm{\beta}_{(-j)}\right)\prod_{k\in A_{0}}f_{0}\left(\mathbf{x}_{k,(-j)}^{T}\bm{\beta}_{(-j)}\right)p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]. (A.3)

Here the first equation results from independent priors, i.e., p(𝜷(j)βj)=p(𝜷(j))p(\bm{\beta}_{(-j)}\mid\beta_{j})=p(\bm{\beta}_{(-j)}). Since p(𝐲𝜷)<1p(\mathbf{y}\mid\bm{\beta})<1, p(𝐲)=pp(𝐲𝜷)p(𝜷)𝑑𝜷<pp(𝜷)𝑑𝜷=1p(\mathbf{y})=\int_{\mathbb{R}^{p}}p(\mathbf{y}\mid\bm{\beta})p(\bm{\beta})d\bm{\beta}<\int_{\mathbb{R}^{p}}p(\bm{\beta})d\bm{\beta}=1. Moreover, p(𝐲𝜷)p(𝜷)>0p(\mathbf{y}\mid\bm{\beta})p(\bm{\beta})>0 for all 𝜷p\bm{\beta}\in\mathbb{R}^{p}, so we also have p(𝐲)>0p(\mathbf{y})>0, implying that 0<p(𝐲)<10<p(\mathbf{y})<1. For the independent Cauchy priors in (6), 0βjp(βj)𝑑βj=\int_{0}^{\infty}\beta_{j}p(\beta_{j})~d\beta_{j}=\infty and the second integral in (A.3) is positive, hence (A.3) equals \infty. ∎

B Proof of the Sufficient Condition for Theorem 1

Here we show that if 𝐗j\mathbf{X}_{j} is not a solitary separator, then the posterior mean of βj\beta_{j} exists.

Proof.

When E(|βj|𝐲)<E(|\beta_{j}|\mid\mathbf{y})<\infty the posterior mean of βj\beta_{j} exists and is given by

E(βj𝐲)\displaystyle E(\beta_{j}\mid\mathbf{y}) =βjp(βj𝐲)𝑑βj\displaystyle=\int_{-\infty}^{\infty}\beta_{j}p(\beta_{j}\mid\mathbf{y})d\beta_{j}
=1p(𝐲)0βjp(βj)p(𝐲βj)𝑑βjdenoted by E(βj𝐲)++1p(𝐲)0βjp(βj)p(𝐲βj)𝑑βjdenoted by E(βj𝐲).\displaystyle=\frac{1}{p(\mathbf{y})}\underbrace{\int_{0}^{\infty}\beta_{j}p(\beta_{j})p(\mathbf{y}\mid\beta_{j})d\beta_{j}}_{\text{denoted by }E(\beta_{j}\mid\mathbf{y})^{+}}+\frac{1}{p(\mathbf{y})}\underbrace{\int_{-\infty}^{0}\beta_{j}p(\beta_{j})p(\mathbf{y}\mid\beta_{j})d\beta_{j}}_{\text{denoted by }E(\beta_{j}\mid\mathbf{y})^{-}}. (B.1)

Since 0<p(𝐲)<10<p(\mathbf{y})<1 it is enough to show that the positive term E(βj𝐲)+E(\beta_{j}\mid\mathbf{y})^{+} has a finite upper bound, and the negative term E(βj𝐲)E(\beta_{j}\mid\mathbf{y})^{-} has a finite lower bound. For notational simplicity, in the remainder of the proof, we let 𝜶(j)\bm{\alpha}_{(-j)}, 𝐱i,(j)\mathbf{x}_{i,(-j)}, 𝜷(j)\bm{\beta}_{(-j)}, and 𝝈(j)\bm{\sigma}_{(-j)} denote the vectors 𝜶\bm{\alpha}, 𝐱i\mathbf{x}_{i}, 𝜷\bm{\beta}, and 𝝈=(σ1,,σp)T\bm{\sigma}=(\sigma_{1},\ldots,\sigma_{p})^{T} after excluding their jjth entries, respectively.

We first show that E(βj𝐲)+E(\beta_{j}\mid\mathbf{y})^{+} has a finite upper bound. Because 𝐗j\mathbf{X}_{j} is not a solitary separator, there exists either

  1. (a)

    an iA1i^{\prime}\in A_{1} such that xi,j<0x_{i^{\prime},j}<0, or

  2. (b)

    a kA0k^{\prime}\in A_{0}, such that xk,j>0x_{k^{\prime},j}>0.

If both (a) and (b) are violated then 𝐗j\mathbf{X}_{j} is a solitary separator and leads to a contradiction. Furthermore, for such 𝐱i,(j)\mathbf{x}_{i^{\prime},(-j)} or 𝐱k,(j)\mathbf{x}_{k^{\prime},(-j)}, it contains at least one non-zero entry. This is because if j1j\neq 1, i.e., 𝐗j\mathbf{X}_{j} does not correspond to the intercept (the column of all one’s), then the first entry in 𝐱i,(j)\mathbf{x}_{i^{\prime},(-j)} or 𝐱k,(j)\mathbf{x}_{k^{\prime},(-j)} equals 11. If j=1j=1, due to the assumption that 𝐗\mathbf{X} has a column rank >1>1, there exists one row i{1,2,,n}i^{\diamond}\in\{1,2,\ldots,n\} such that 𝐱i,(1)\mathbf{x}_{i^{\diamond},(-1)} contains at least one non-zero entry. If iA0i^{\diamond}\in A_{0}, then we let k=ik^{\prime}=i^{\diamond} and the condition (b) holds because xi,1=1>0x_{i^{\diamond},1}=1>0. If iA1i^{\diamond}\in A_{1}, we may first rescale 𝐗1\mathbf{X}_{1} by 1-1, which transforms β1\beta_{1} to β1-\beta_{1}. Since the Cauchy prior centered at zero is invariant to this rescaling, and E(β1𝐲)E(-\beta_{1}\mid\mathbf{y}) exists if and only if E(β1𝐲)E(\beta_{1}\mid\mathbf{y}) exists, we can just apply this rescaling, after which xi,1=1<0x_{i^{\diamond},1}=-1<0. Then we let i=ii^{\prime}=i^{\diamond} and the condition (a) holds.

We first assume that condition (a) is true. We define a positive constant ϵ=|xi,j|/2=xi,j/2\epsilon=|x_{i^{\prime},j}|/2=-x_{i^{\prime},j}/2. For any βj>0\beta_{j}>0, we define a subset of the domain of 𝜷(j)\bm{\beta}_{(-j)} as follows

G(βj)=def{𝜷(j)p1:𝐱i,(j)T𝜷(j)<ϵβj}.G(\beta_{j})\stackrel{{\scriptstyle\text{def}}}{{=}}\left\{\bm{\beta}_{(-j)}\in\mathbb{R}^{p-1}:\mathbf{x}_{i^{\prime},(-j)}^{T}\bm{\beta}_{(-j)}<\epsilon\beta_{j}\right\}. (B.2)

Then for any 𝜷(j)G(βj)\bm{\beta}_{(-j)}\in G(\beta_{j}), 𝐱iT𝜷=xi,jβj+𝐱i,(j)T𝜷(j)<(xi,j+ϵ)βj=ϵβj\mathbf{x}_{i^{\prime}}^{T}\bm{\beta}=x_{i^{\prime},j}\beta_{j}+\mathbf{x}_{i^{\prime},(-j)}^{T}\bm{\beta}_{(-j)}<(x_{i^{\prime},j}+\epsilon)\beta_{j}=-\epsilon\beta_{j}. Therefore,

f1(𝐱iT𝜷)<f1(ϵβj),for all 𝜷(j)G(βj).f_{1}(\mathbf{x}_{i^{\prime}}^{T}\bm{\beta})<f_{1}(-\epsilon\beta_{j}),\quad\text{for all }\bm{\beta}_{(-j)}\in G(\beta_{j}). (B.3)

As f1()f_{1}(\cdot) and f0()f_{0}(\cdot) are bounded above by 1, the likelihood function p(𝐲𝜷)p(\mathbf{y}\mid\bm{\beta}) in (A.2) is bounded above by f1(𝐱iT𝜷)f_{1}(\mathbf{x}_{i^{\prime}}^{T}\bm{\beta}). Thus

E(βj𝐲)+=0βjp(βj)[p1p(𝐲𝜷)p(𝜷(j))𝑑𝜷(j)]𝑑βj\displaystyle E(\beta_{j}\mid\mathbf{y})^{+}=\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[\int_{\mathbb{R}^{p-1}}p(\mathbf{y}\mid\bm{\beta})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]d\beta_{j}
<\displaystyle<~ 0βjp(βj)[p1f1(𝐱iT𝜷)p(𝜷(j))𝑑𝜷(j)]𝑑βj\displaystyle\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[\int_{\mathbb{R}^{p-1}}f_{1}(\mathbf{x}_{i^{\prime}}^{T}\bm{\beta})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]d\beta_{j}
=\displaystyle=~ 0βjp(βj)[G(βj)f1(𝐱iT𝜷)p(𝜷(j))𝑑𝜷(j)+p1\G(βj)f1(𝐱iT𝜷)p(𝜷(j))𝑑𝜷(j)]𝑑βj\displaystyle\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[\int_{G(\beta_{j})}f_{1}(\mathbf{x}_{i^{\prime}}^{T}\bm{\beta})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}+\int_{\mathbb{R}^{p-1}\backslash G(\beta_{j})}f_{1}(\mathbf{x}_{i^{\prime}}^{T}\bm{\beta})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]d\beta_{j}
<\displaystyle<~ 0βjp(βj)[G(βj)f1(ϵβj)p(𝜷(j))𝑑𝜷(j)+p1\G(βj)p(𝜷(j))𝑑𝜷(j)]𝑑βj\displaystyle\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[\int_{G(\beta_{j})}f_{1}(-\epsilon\beta_{j})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}+\int_{\mathbb{R}^{p-1}\backslash G(\beta_{j})}p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]d\beta_{j} (B.4)

Here the last inequality results from (B.3) and the fact that the function f1()f_{1}(\cdot) is bounded above by 11. An upper bound is obtained for the first term in the bracket in (B.4) using the fact that the integrand is non-negative as follows:

G(βj)f1(ϵβj)p(𝜷(j))𝑑𝜷(j)\displaystyle\int_{G(\beta_{j})}f_{1}(-\epsilon\beta_{j})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)} <p1f1(ϵβj)p(𝜷(j))𝑑𝜷(j)\displaystyle<\int_{\mathbb{R}^{p-1}}f_{1}(-\epsilon\beta_{j})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}
=f1(ϵβj)p1p(𝜷(j))𝑑𝜷(j)=1=eϵβj1+eϵβj<eϵβj.\displaystyle=f_{1}(-\epsilon\beta_{j})\underbrace{\int_{\mathbb{R}^{p-1}}p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}}_{=1}=\frac{e^{-\epsilon\beta_{j}}}{1+e^{-\epsilon\beta_{j}}}<e^{-\epsilon\beta_{j}}. (B.5)

Recall that 𝐱i,(j)\mathbf{x}_{i^{\prime},(-j)} contains at least one non-zero entry. We assume that 𝐱i,r0\mathbf{x}_{i^{\prime},r}\neq 0. Then to simplify the second term in the bracket in (B.4), we transform 𝜷(j)\bm{\beta}_{(-j)} to (η,𝝃)(\eta,\bm{\xi}) via a linear transformation, such that η=𝐱i,(j)T𝜷(j)\eta=\mathbf{x}_{i^{\prime},(-j)}^{T}\bm{\beta}_{(-j)}, and 𝝃\bm{\xi} is the vector 𝜷(j)\bm{\beta}_{(-j)} after excluding βr\beta_{r}. The characteristic function of a Cauchy distribution C(μ,σ)C(\mu,\sigma) is φ(t)=eitμ|t|σ\varphi(t)=e^{it\mu-|t|\sigma}, where tt\in\mathbb{R}. Since a priori, η\eta is a linear combination of independent C(0,σ)C(0,\sigma_{\ell}) random variables, β\beta_{\ell}, for 1p,j1\leq\ell\leq p,\ell\neq j, its characteristic function is

φη(t)=E(eitη)=1p,jE[ei(txi,)β]=1p,jφβ(txi,)=e|t|1p,j|xi,|σ.\varphi_{\eta}(t)=E(e^{it\eta})=\prod_{1\leq\ell\leq p,\ell\neq j}E\left[e^{i(tx_{i^{\prime},\ell})\beta_{\ell}}\right]=\prod_{1\leq\ell\leq p,\ell\neq j}\varphi_{\beta_{\ell}}(tx_{i^{\prime},\ell})=e^{-|t|\sum_{1\leq\ell\leq p,\ell\neq j}|x_{i^{\prime},\ell}|\sigma_{\ell}}.

So the induced prior of η\eta is C(0,1p,j|xi,|σ)C(0,\sum_{1\leq\ell\leq p,\ell\neq j}|x_{i^{\prime},\ell}|\sigma_{\ell}). Let |𝐱i,(j)||\mathbf{x}_{i^{\prime},(-j)}| denote the vector obtained by taking absolute values of each element of 𝐱i,(j)\mathbf{x}_{i^{\prime},(-j)}, then the above scale parameter 1p,j|xi,|σ\sum_{1\leq\ell\leq p,\ell\neq j}|x_{i^{\prime},\ell}|\sigma_{\ell}= |𝐱i,(j)|T𝝈(j)|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}. By (B.2), for any 𝜷(j)G(βj)\bm{\beta}_{(-j)}\not\in G(\beta_{j}), the corresponding ηϵβj\eta\geq\epsilon\beta_{j} and 𝝃p2\bm{\xi}\in\mathbb{R}^{p-2}. An upper bound is calculated for the second term in the bracket in (B.4). Note that p(η)p(𝝃η)p(\eta)p(\bm{\xi}\mid\eta) is the joint density of η\eta and 𝝃\bm{\xi}. Since it incorporates the Jacobian of transformation from 𝜷(j)\bm{\beta}_{(-j)} to η\eta and 𝝃\bm{\xi}, a separate Jacobian term is not needed in the first equality below.

p1\G(βj)p(𝜷(j))𝑑𝜷(j)\displaystyle\int_{\mathbb{R}^{p-1}\backslash G(\beta_{j})}p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)} =ϵβjp2p(η)p(𝝃η)𝑑𝝃𝑑η\displaystyle=\int_{\epsilon\beta_{j}}^{\infty}\int_{\mathbb{R}^{p-2}}p(\eta)p(\bm{\xi}\mid\eta)d\bm{\xi}d\eta
=ϵβjp2p(𝝃η)𝑑𝝃π|𝐱i,(j)|T𝝈(j)[1+η2/(|𝐱i,(j)|T𝝈(j))2]𝑑η\displaystyle=\int_{\epsilon\beta_{j}}^{\infty}\frac{\int_{\mathbb{R}^{p-2}}p(\bm{\xi}\mid\eta)d\bm{\xi}}{\pi~|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}\left[1+\eta^{2}/\left(|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}\right)^{2}\right]}d\eta
=1π[π2arctan(ϵβj|𝐱i,(j)|T𝝈(j))]\displaystyle=\frac{1}{\pi}\left[\frac{\pi}{2}-\arctan\left(\frac{\epsilon\beta_{j}}{|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}}\right)\right]
=1πarctan(|𝐱i,(j)|T𝝈(j)ϵβj)<|𝐱i,(j)|T𝝈(j)πϵβj.\displaystyle=\frac{1}{\pi}\arctan\left(\frac{|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}}{\epsilon\beta_{j}}\right)<\frac{|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}}{\pi\epsilon\beta_{j}}. (B.6)

Here, the second equality holds because p2p(𝝃η)𝑑𝝃=1\int_{\mathbb{R}^{p-2}}p(\bm{\xi}\mid\eta)d\bm{\xi}=1; the last inequality holds because ϵ\epsilon and βj\beta_{j} are both positive, and for any t>0t>0, arctan(t)<t\arctan(t)<t.

Then substituting the expression for p(βj)p(\beta_{j}) as in (6), we continue with (B.4) to find an upper bound.

E(βj𝐲)+\displaystyle E(\beta_{j}\mid\mathbf{y})^{+} <0βjπσj(1+βj2/σj2)[eϵβj+|𝐱i,(j)|T𝝈(j)πϵβj]𝑑βj\displaystyle<\int_{0}^{\infty}\frac{\beta_{j}}{\pi\sigma_{j}(1+\beta_{j}^{2}/\sigma_{j}^{2})}\left[e^{-\epsilon\beta_{j}}+\frac{|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}}{\pi\epsilon\beta_{j}}\right]d\beta_{j}
<0βjeϵβjπσj𝑑βj+0|𝐱i,(j)|T𝝈(j)π2σjϵ(1+βj2/σj2)𝑑βj=1πσjϵ2+|𝐱i,(j)|T𝝈(j)2πϵ.\displaystyle<\int_{0}^{\infty}\frac{\beta_{j}e^{-\epsilon\beta_{j}}}{\pi\sigma_{j}}d\beta_{j}+\int_{0}^{\infty}\frac{|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}}{\pi^{2}\sigma_{j}\epsilon(1+\beta_{j}^{2}/\sigma_{j}^{2})}d\beta_{j}=\frac{1}{\pi\sigma_{j}\epsilon^{2}}+\frac{|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}. (B.7)

On the other hand, if condition (b) holds, then we just need to slightly modify the above proof. We define ϵ=|xk,j|/2=xk,j/2\epsilon=|x_{k^{\prime},j}|/2=x_{k^{\prime},j}/2, and change (B.2) to

G(βj)=def{𝜷(j)p1:𝐱k,(j)T𝜷(j)>ϵβj}.G(\beta_{j})\stackrel{{\scriptstyle\text{def}}}{{=}}\left\{\bm{\beta}_{(-j)}\in\mathbb{R}^{p-1}:\mathbf{x}_{k^{\prime},(-j)}^{T}\bm{\beta}_{(-j)}>-\epsilon\beta_{j}\right\}.

Consequently, the terms f1(𝐱iT𝜷)f_{1}(\mathbf{x}_{i^{\prime}}^{T}\bm{\beta}) and f1(ϵβj)f_{1}(-\epsilon\beta_{j}) in (B.4) have to be changed to f0(𝐱kT𝜷)f_{0}(\mathbf{x}_{k^{\prime}}^{T}\bm{\beta}) and f0(ϵβj)f_{0}(\epsilon\beta_{j}), respectively. For the logit link, f0(ϵβj)=f1(ϵβj)f_{0}(\epsilon\beta_{j})=f_{1}(-\epsilon\beta_{j}). The range of the integral in (B.6) with respect to η\eta is from -\infty to ϵβj-\epsilon\beta_{j}; however, because the density of η\eta is symmetric around 0, the value of the integral stays the same. So it can be shown an upper bound for E(βj𝐲)+E(\beta_{j}\mid\mathbf{y})^{+} is [1/(πσjϵ2)]+[|𝐱k,(j)|T𝝈(j)/(2πϵ)]\left[1/\left(\pi\sigma_{j}\epsilon^{2}\right)\right]+\left[|\mathbf{x}_{k^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}/\left(2\pi\epsilon\right)\right].

We now show that the negative term E(βj𝐲)E(\beta_{j}\mid\mathbf{y})^{-} has a finite lower bound. For any βj<0\beta_{j}<0, by expressing βj=βj\beta_{j}^{*}=-\beta_{j}, we need to show that the positive term E(βj𝐲)=0βjp(βj)p(𝐲βj)𝑑βj=0βjp(βj)p(𝐲βj)𝑑βj-E(\beta_{j}\mid\mathbf{y})^{-}=-\int_{-\infty}^{0}\beta_{j}p(\beta_{j})p(\mathbf{y}\mid\beta_{j})d\beta_{j}=\int_{0}^{\infty}\beta_{j}^{*}p(\beta_{j}^{*})p(\mathbf{y}\mid-\beta_{j}^{*})d\beta_{j}^{*} has a finite upper bound. As the idea is very similar to the proof of existence of E(βj𝐲)+E(\beta_{j}\mid\mathbf{y})^{+}, we present less details here.

Since the predictor 𝐗j\mathbf{X}_{j} is not a solitary separator, there exists either

  1. (c)

    an iA1i^{*}\in A_{1} such that xi,j>0x_{i^{*},j}>0, or

  2. (d)

    a kA0k^{*}\in A_{0}, such that xk,j<0x_{k^{*},j}<0.

If (c) and (d) are both violated 𝐗j\mathbf{X}_{j} has to be a solitary separator, which leads to a contradiction. WLOG, we assume that condition (c) is true, and as before 𝐱i,(j)\mathbf{x}_{i^{*},(-j)} must contain at least one non-zero entry, say, 𝐱i,s0\mathbf{x}_{i^{*},s}\neq 0. If condition (d) is true, then we can adopt a modification similar to the one that is used to prove the existence under condition (b) based on the proof under condition (a).

We define a positive constant ϵ=xi,j/2\epsilon=x_{i^{*},j}/2. For any βj>0\beta_{j}^{*}>0, we define a subset of p1\mathbb{R}^{p-1} as G(βj)=def{𝜷(j)p1:𝐱i,(j)T𝜷(j)<ϵβj}G(\beta_{j}^{*})\stackrel{{\scriptstyle\text{def}}}{{=}}\left\{\bm{\beta}_{(-j)}\in\mathbb{R}^{p-1}:\mathbf{x}_{i^{*},(-j)}^{T}\bm{\beta}_{(-j)}<\epsilon\beta_{j}^{*}\right\}. Then for all 𝜷(j)G(βj)\bm{\beta}_{(-j)}\in G(\beta_{j}^{*}), 𝐱iT𝜷=xi,jβj+𝐱i,(j)T𝜷(j)<(xi,j+ϵ)βj=ϵβj\mathbf{x}_{i^{*}}^{T}\bm{\beta}=-x_{i^{*},j}\beta_{j}^{*}+\mathbf{x}_{i^{*},(-j)}^{T}\bm{\beta}_{(-j)}<(-x_{i^{*},j}+\epsilon)\beta_{j}^{*}=-\epsilon\beta_{j}^{*}, hence f1(𝐱iT𝜷)<f1(ϵβj)f_{1}(\mathbf{x}_{i^{*}}^{T}\bm{\beta})<f_{1}\left(-\epsilon\beta_{j}^{*}\right). Since the likelihood function p(𝐲βj,𝜷(j))<f1(xi,j(βj)+𝐱i,(j)T𝜷(j))<1p(\mathbf{y}\mid-\beta_{j}^{*},\bm{\beta}_{(-j)})<f_{1}(x_{i^{*},j}(-\beta_{j}^{*})+\mathbf{x}_{i^{*},(-j)}^{T}\bm{\beta}_{(-j)})<1,

E(βj𝐲)=0βjp(βj)[p1p(𝐲βj,𝜷(j))p(𝜷(j))𝑑𝜷(j)]𝑑βj\displaystyle-E(\beta_{j}\mid\mathbf{y})^{-}=\int_{0}^{\infty}\beta_{j}^{*}p(\beta_{j}^{*})\left[\int_{\mathbb{R}^{p-1}}p(\mathbf{y}\mid-\beta_{j}^{*},\bm{\beta}_{(-j)})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]d\beta_{j}^{*}
<\displaystyle<~ 0βjp(βj)[G(βj)f1(ϵβj)p(𝜷(j))𝑑𝜷(j)+p1\G(βj)p(𝜷(j))𝑑𝜷(j)]𝑑βj.\displaystyle\int_{0}^{\infty}\beta_{j}^{*}p(\beta_{j}^{*})\left[\int_{G(\beta_{j}^{*})}f_{1}(-\epsilon\beta_{j}^{*})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}+\int_{\mathbb{R}^{p-1}\backslash G(\beta_{j}^{*})}p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]d\beta_{j}^{*}. (B.8)

The first term in the bracket in (B.8) has an upper bound exp(ϵβj)\exp(-\epsilon\beta_{j}^{*}) as in (B.5). Recall that 𝐱i,s0\mathbf{x}_{i^{*},s}\neq 0. We now transform 𝜷(j)\bm{\beta}_{(-j)} to (η,𝝃)(\eta,\bm{\xi}) via a linear transformation, such that η=𝐱i,(j)T𝜷(j)\eta=\mathbf{x}_{i^{*},(-j)}^{T}\bm{\beta}_{(-j)} and 𝝃\bm{\xi} is the vector 𝜷(j)\bm{\beta}_{(-j)} after excluding βs\beta_{s}. The prior of η\eta is C(0,|𝐱i,(j)|T𝝈(j))C(0,|\mathbf{x}_{i^{*},(-j)}|^{T}\bm{\sigma}_{(-j)}). For any 𝜷(j)G(βj)\bm{\beta}_{(-j)}\not\in G(\beta_{j}^{*}), the corresponding ηϵβj\eta\geq\epsilon\beta_{j}^{*}. Therefore as in (B.6), we obtain an upper bound for the second term in the bracket in (B.8) as |𝐱i,(j)|T𝝈(j)/(πϵβj)|\mathbf{x}_{i^{*},(-j)}|^{T}\bm{\sigma}_{(-j)}/\left(\pi\epsilon\beta_{j}^{*}\right). Finally, following (B.7) an upper bound for E(βj𝐲)-E(\beta_{j}\mid\mathbf{y})^{-} is [1/(πσjϵ2)]+[|𝐱i,(j)|T𝝈(j)/(2πϵ)]\left[1/\left(\pi\sigma_{j}\epsilon^{2}\right)\right]+\left[|\mathbf{x}_{i^{*},(-j)}|^{T}\bm{\sigma}_{(-j)}/\left(2\pi\epsilon\right)\right]. ∎

C Proof of Theorem 2

Proof.

Following the proof of Theorem 1, we denote the success probability function by π=f1(𝐱T𝜷)\pi=f_{1}(\mathbf{x}^{T}\bm{\beta}), where f1()f_{1}(\cdot) is the inverse link function, i.e., f1()=g1()f_{1}(\cdot)=g^{-1}(\cdot). Similarly, let the failure probability function be f0(𝐱T𝜷)=1f1(𝐱T𝜷)f_{0}(\mathbf{x}^{T}\bm{\beta})=1-f_{1}(\mathbf{x}^{T}\bm{\beta}). Note that the proof of the necessary condition for Theorem 1 given in Appendix A only relies on the fact the f1()f_{1}(\cdot) is increasing, continuous, and bounded between 0 and 11. Since the link function g()g(\cdot) is assumed to be strictly increasing and differentiable, so is f1()f_{1}(\cdot). Moreover, the range of f1f_{1} is (0,1)(0,1). Therefore, the proof of the necessary condition for Theorem 2 follows immediately.

For the proof of the sufficient condition in Theorem 2, one can follow the proof in Appendix B and proceed with the specific choice of ϵ\epsilon used there, when condition (a) holds. The proof is identical until (B.4) because the explicit form of the inverse link function is not used until that step. We re-write the final step in (B.4) below and proceed from there:

E(βj𝐲)+\displaystyle E(\beta_{j}\mid\mathbf{y})^{+}
<\displaystyle< 0βjp(βj)[G(βj)f1(ϵβj)p(𝜷(j))𝑑𝜷(j)+p1\G(βj)p(𝜷(j))𝑑𝜷(j)]𝑑βj\displaystyle\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[\int_{G(\beta_{j})}f_{1}(-\epsilon\beta_{j})p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}+\int_{\mathbb{R}^{p-1}\backslash G(\beta_{j})}p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]d\beta_{j}
=\displaystyle= 0βjp(βj)f1(ϵβj)[G(βj)p(𝜷(j))𝑑𝜷(j)]<p1p(𝜷(j))𝑑𝜷(j)=1𝑑βj\displaystyle\int_{0}^{\infty}\beta_{j}p(\beta_{j})f_{1}(-\epsilon\beta_{j})\underbrace{\left[\int_{G(\beta_{j})}p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]}_{<\int_{\mathbb{R}^{p-1}}p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}=1}d\beta_{j}
+0βjp(βj)[p1\G(βj)p(𝜷(j))𝑑𝜷(j)]𝑑βj\displaystyle+\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[\int_{\mathbb{R}^{p-1}\backslash G(\beta_{j})}p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]d\beta_{j}
<\displaystyle< 0βjp(βj)f1(ϵβj)=g1(ϵβj)𝑑βj+0βjp(βj)[p1\G(βj)p(𝜷(j))𝑑𝜷(j)]𝑑βj.\displaystyle\int_{0}^{\infty}\beta_{j}p(\beta_{j})\underbrace{f_{1}(-\epsilon\beta_{j})}_{=g^{-1}(-\epsilon\beta_{j})}d\beta_{j}+\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[\int_{\mathbb{R}^{p-1}\backslash G(\beta_{j})}p(\bm{\beta}_{(-j)})d\bm{\beta}_{(-j)}\right]d\beta_{j}. (C.1)

The sufficient condition in Theorem 2 states that for every positive constant ϵ\epsilon,

0βjp(βj)g1(ϵβj)𝑑βj<.\int_{0}^{\infty}\beta_{j}p(\beta_{j})g^{-1}(-\epsilon\beta_{j})d\beta_{j}<\infty.

This implies that the integral will be bounded for the specific choice of ϵ\epsilon used in the above proof, and hence the first integral in (C.1) is bounded above. The second integral does not depend on the link function and its bound can be obtained exactly as in Appendix B. Thus E(βj𝐲)+<E(\beta_{j}\mid\mathbf{y})^{+}<\infty under condition (a). On the other hand if condition (b) holds, the proof follows similarly as in Appendix B and now we need to use the sufficient condition in Theorem 2 that for every positive constant ϵ\epsilon, 0βjp(βj)[1g1(ϵβj)]𝑑βj<\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[1-g^{-1}(\epsilon\beta_{j})\right]d\beta_{j}<\infty. A bound for E(βj𝐲)-E(\beta_{j}\mid\mathbf{y})^{-} can be obtained similarly, which completes the proof.

In probit regression, we first show that

gprobit1(t)=Φ(t)<et/(1+et)=glogit1(t),for any t<0,g^{-1}_{\text{probit}}(t)=\Phi(t)<e^{t}/(1+e^{t})=g^{-1}_{\text{logit}}(t),\quad\text{for any }t<0, (C.2)

where Φ(t)\Phi(t) is the standard normal cdf. It is equivalent to show that the difference function

u(t)=Φ(t)et1+et<0, for all t<0.u(t)=\Phi(t)-\frac{e^{t}}{1+e^{t}}<0,\text{ for all }t<0. (C.3)

Note that u(0)=1/21/2=0u(0)=1/2-1/2=0, and limtu(t)=0\lim_{t\rightarrow-\infty}u(t)=0. Since u(t)u(t) is differentiable, we have

u(t)=12πet22et(1+et)2.u^{\prime}(t)=\frac{1}{\sqrt{2\pi}}e^{-\frac{t^{2}}{2}}-\frac{e^{t}}{(1+e^{t})^{2}}.

Now u(0)=1/2π1/4>0u^{\prime}(0)=1/\sqrt{2\pi}-1/4>0, and when tt is very small, u(t)<0u^{\prime}(t)<0 since et2/2e^{-t^{2}/2} decays to zero at a faster speed than ete^{t}, i.e., there exists a t~<0\tilde{t}<0 such that u(t~)<0u^{\prime}(\tilde{t})<0. Since u(t)u^{\prime}(t) is a continuous function, the intermediate value theorem applied to [t~,0][\tilde{t},0] shows that there exists a t1<0t_{1}<0 such that u(t1)=0u^{\prime}(t_{1})=0. Therefore, to show (C.3), it is sufficient to show that u(t)u^{\prime}(t) has a unique root on \mathbb{R}^{-}, which is proved by contradiction as follows.

If u(t)u^{\prime}(t) has two distinct roots t1,t2<0t_{1},t_{2}<0, i.e., for i=1,2i=1,2, u(ti)=0u^{\prime}(t_{i})=0, then

12πeti22=eti(1+eti)2,i=1,2et122et222=et1et2(1+et2)2(1+et1)2\displaystyle\frac{1}{\sqrt{2\pi}}e^{-\frac{t_{i}^{2}}{2}}=\frac{e^{t_{i}}}{(1+e^{t_{i}})^{2}},\ i=1,2\Longleftrightarrow\frac{e^{-\frac{t_{1}^{2}}{2}}}{e^{-\frac{t_{2}^{2}}{2}}}=\frac{e^{t_{1}}}{e^{t_{2}}}\cdot\frac{(1+e^{t_{2}})^{2}}{(1+e^{t_{1}})^{2}}
\displaystyle\Longleftrightarrow e(t2+1)22e(t1+1)22=(1+et21+et1)2(t2+1)24log(1+et2)=(t1+1)24log(1+et1).\displaystyle\frac{e^{\frac{(t_{2}+1)^{2}}{2}}}{e^{\frac{(t_{1}+1)^{2}}{2}}}=\left(\frac{1+e^{t_{2}}}{1+e^{t_{1}}}\right)^{2}\Longleftrightarrow\frac{(t_{2}+1)^{2}}{4}-\log(1+e^{t_{2}})=\frac{(t_{1}+1)^{2}}{4}-\log(1+e^{t_{1}}). (C.4)

Note that the derivative of the function (t+1)2/4log(1+et)(t+1)^{2}/4-\log(1+e^{t}) is (t+1)/2et/(1+et)(t+1)/2-e^{t}/(1+e^{t}). It is straightforward to show that this derivative is strictly less than 0 for all t<0t<0, so (t+1)2/4log(1+et)(t+1)^{2}/4-\log(1+e^{t}) is a strictly decreasing function. Thus (C.4) holds only if t1=t2t_{1}=t_{2}, which leads to a contradiction.

Hence for any ϵ>0\epsilon>0,

0βjp(βj)gprobit1(ϵβj)𝑑βj\displaystyle\int_{0}^{\infty}\beta_{j}p(\beta_{j})g^{-1}_{\text{probit}}(-\epsilon\beta_{j})d\beta_{j} <0βjp(βj)glogit1(ϵβj)𝑑βj<0βjp(βj)eϵβj𝑑βj\displaystyle<\int_{0}^{\infty}\beta_{j}p(\beta_{j})g^{-1}_{\text{logit}}(-\epsilon\beta_{j})d\beta_{j}<\int_{0}^{\infty}\beta_{j}p(\beta_{j}){e^{-\epsilon\beta_{j}}}d\beta_{j}
=0βjeϵβjπσj(1+βj2/σj2)𝑑βj\displaystyle=\int_{0}^{\infty}\frac{\beta_{j}e^{-\epsilon\beta_{j}}}{\pi\sigma_{j}(1+\beta_{j}^{2}/\sigma_{j}^{2})}d\beta_{j} <0βjeϵβjπσj𝑑βj=1πσjϵ2<.\displaystyle<\int_{0}^{\infty}\frac{\beta_{j}e^{-\epsilon\beta_{j}}}{\pi\sigma_{j}}d\beta_{j}=\frac{1}{\pi\sigma_{j}\epsilon^{2}}<\infty.

Since the probit link is symmetric, i.e., 1gprobit1(ϵβj)=1Φ(ϵβj)=Φ(ϵβj)=gprobit1(ϵβj)1-g^{-1}_{\text{probit}}(\epsilon\beta_{j})=1-\Phi(\epsilon\beta_{j})=\Phi(-\epsilon\beta_{j})=g^{-1}_{\text{probit}}(-\epsilon\beta_{j}), we also have 0βjp(βj)[1gprobit1(ϵβj)]𝑑βj<\int_{0}^{\infty}\beta_{j}p(\beta_{j})\left[1-g^{-1}_{\text{probit}}(\epsilon\beta_{j})\right]d\beta_{j}<\infty. ∎

D Proof of Corollary 1

To prove Corollary 1, we mainly use a similar strategy to the proof of Theorem 1. To show the necessary condition, we can use all of Appendix A without modification, for both logistic and probit regression models. To show the sufficient condition, we can follow the same proof outline as in Appendix B, with some minimal modification as described in the following proof.

Proof.

First, we denote the vector of prior location parameters by 𝝁=(μ1,μ2,,μp)T\bm{\mu}=(\mu_{1},\mu_{2},\ldots,\mu_{p})^{T}. If we shift the coefficients 𝜷\bm{\beta} by 𝝁\bm{\mu} units, then

𝜸=def𝜷𝝁γjindC(0,σj),j=1,2,,p,\bm{\gamma}\stackrel{{\scriptstyle\text{def}}}{{=}}\bm{\beta}-\bm{\mu}~\Longrightarrow~\gamma_{j}\stackrel{{\scriptstyle\text{ind}}}{{\sim}}\text{C}(0,\sigma_{j}),\quad j=1,2,\ldots,p,

that is, the resulting parameters γj\gamma_{j} have independent Cauchy priors with location parameters being zero. Since the original parameter βj=γj+μj\beta_{j}=\gamma_{j}+\mu_{j} for each j=1,2,,pj=1,2,\ldots,p, the existence of E(βj𝐘)E(\beta_{j}\mid\mathbf{Y}) is equivalent to the existence of E(γj𝐘)E(\gamma_{j}\mid\mathbf{Y}). So we just need to show that if 𝐗j\mathbf{X}_{j} is not a solitary separator, then E(γj𝐘)E(\gamma_{j}\mid\mathbf{Y}) exists. For simplicity, here we just show that the positive term

E(γj𝐲)+=def0γjp(γj)p(𝐲γj)𝑑γjE(\gamma_{j}\mid\mathbf{y})^{+}\stackrel{{\scriptstyle\text{def}}}{{=}}\int_{0}^{\infty}\gamma_{j}p(\gamma_{j})p(\mathbf{y}\mid\gamma_{j})d\gamma_{j}

has a finite upper bound. The other half of the result that the negative term E(γj𝐲)E(\gamma_{j}\mid\mathbf{y})^{-} has a finite lower bound will follow with a similar derivation.

As in Appendix B, we first assume that condition (a) is true, and define ϵ\epsilon in the same way. For any γj>0\gamma_{j}>0, we define a subset of the domain of 𝜸(j)\bm{\gamma}_{(-j)} as follows

G(γj)=def{𝜸(j)p1:𝐱i,(j)T𝜸(j)<ϵγj},G(\gamma_{j})\stackrel{{\scriptstyle\text{def}}}{{=}}\left\{\bm{\gamma}_{(-j)}\in\mathbb{R}^{p-1}:\mathbf{x}_{i^{\prime},(-j)}^{T}\bm{\gamma}_{(-j)}<\epsilon\gamma_{j}\right\},

then for any 𝜸(j)G(γj)\bm{\gamma}_{(-j)}\in G(\gamma_{j}), 𝐱iT𝜸<ϵγj\mathbf{x}_{i^{\prime}}^{T}\bm{\gamma}<-\epsilon\gamma_{j}. Since f1()f_{1}(\cdot) is an increasing function,

f1(𝐱iT𝜷)=f1(𝐱iT𝜸+𝐱iT𝝁)<f1(ϵγj+𝐱iT𝝁),for all 𝜸(j)G(γj).f_{1}(\mathbf{x}_{i^{\prime}}^{T}\bm{\beta})=f_{1}(\mathbf{x}_{i^{\prime}}^{T}\bm{\gamma}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu})<f_{1}(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}),\quad\text{for all }\bm{\gamma}_{(-j)}\in G(\gamma_{j}).

A similar derivation to (B.4) gives

E(γj𝐲)+\displaystyle E(\gamma_{j}\mid\mathbf{y})^{+}
<\displaystyle<~ 0γjp(γj)[G(γj)f1(ϵγj+𝐱iT𝝁)p(𝜸(j))𝑑𝜸(j)+p1\G(γj)p(𝜸(j))𝑑𝜸(j)]𝑑γj,\displaystyle\int_{0}^{\infty}\gamma_{j}p(\gamma_{j})\left[\int_{G(\gamma_{j})}f_{1}(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu})p(\bm{\gamma}_{(-j)})d\bm{\gamma}_{(-j)}+\int_{\mathbb{R}^{p-1}\backslash G(\gamma_{j})}p(\bm{\gamma}_{(-j)})d\bm{\gamma}_{(-j)}\right]d\gamma_{j},

where by (B.5) the first integral inside the bracket has an upper bound

G(γj)f1(ϵγj+𝐱iT𝝁)p(𝜸(j))𝑑𝜸(j)<f1(ϵγj+𝐱iT𝝁),\int_{G(\gamma_{j})}f_{1}(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu})p(\bm{\gamma}_{(-j)})d\bm{\gamma}_{(-j)}<f_{1}(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}), (D.1)

and by (B.6) the second integral inside the bracket also has an upper bound

p1\G(γj)p(𝜸(j))𝑑𝜸(j)<|𝐱i,(j)|T𝝈(j)πϵγj.\int_{\mathbb{R}^{p-1}\backslash G(\gamma_{j})}p(\bm{\gamma}_{(-j)})d\bm{\gamma}_{(-j)}<\frac{|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}}{\pi\epsilon\gamma_{j}}.

In logistic regression, the right hand side of (D.1) is further bounded

f1(ϵγj+𝐱iT𝝁)=eϵγj+𝐱iT𝝁1+eϵγj+𝐱iT𝝁<eϵγj+𝐱iT𝝁,f_{1}(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu})=\frac{e^{-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}}{1+e^{-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}}<e^{-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}},

and hence by (B.7),

E(γj𝐲)+<e𝐱iT𝝁πσjϵ2+|𝐱i,(j)|T𝝈(j)2πϵ.E(\gamma_{j}\mid\mathbf{y})^{+}<\frac{e^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}}{\pi\sigma_{j}\epsilon^{2}}+\frac{|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}.

In probit regression, the function f1()f_{1}(\cdot) in the above derivations equals the standard normal cdf Φ()\Phi(\cdot). By (C.2), for any γj>𝐱iT𝝁/ϵ\gamma_{j}>\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon, we have

Φ(ϵγj+𝐱iT𝝁)<eϵγj+𝐱iT𝝁1+eϵγj+𝐱iT𝝁<eϵγj+𝐱iT𝝁.\Phi\left(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}\right)<\frac{e^{-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}}{1+e^{-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}}<e^{-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}.

Hence for 𝐱iT𝝁/ϵ>0\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon>0 we have an upper bound

E(γj𝐲)+<0γjp(γj)[Φ(ϵγj+𝐱iT𝝁)+|𝐱i,(j)|T𝝈(j)πϵγj]𝑑γj\displaystyle E(\gamma_{j}\mid\mathbf{y})^{+}<\int_{0}^{\infty}\gamma_{j}p(\gamma_{j})\left[\Phi\left(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}\right)+\frac{|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}}{\pi\epsilon\gamma_{j}}\right]d\gamma_{j}
=\displaystyle= 0γjp(γj)Φ(ϵγj+𝐱iT𝝁)𝑑γj+|𝐱i,(j)|T𝝈(j)2πϵ\displaystyle\int_{0}^{\infty}\gamma_{j}p(\gamma_{j})\Phi\left(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}\right)d\gamma_{j}+\frac{|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}
=\displaystyle= 0𝐱iT𝝁/ϵγjp(γj)Φ(ϵγj+𝐱iT𝝁)𝑑γj+𝐱iT𝝁/ϵγjp(γj)Φ(ϵγj+𝐱iT𝝁)𝑑γj+|𝐱i,(j)|T𝝈(j)2πϵ\displaystyle\int_{0}^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon}\gamma_{j}p(\gamma_{j})\Phi\left(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}\right)d\gamma_{j}+\int_{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon}^{\infty}\gamma_{j}p(\gamma_{j})\Phi\left(-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}\right)d\gamma_{j}+\frac{|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}
<\displaystyle< 0𝐱iT𝝁/ϵγjp(γj)𝑑γj+𝐱iT𝝁/ϵγjp(γj)eϵγj+𝐱iT𝝁𝑑γj+|𝐱i,(j)|T𝝈(j)2πϵ\displaystyle\int_{0}^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon}\gamma_{j}p(\gamma_{j})d\gamma_{j}+\int_{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon}^{\infty}\gamma_{j}p(\gamma_{j})e^{-\epsilon\gamma_{j}+\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}d\gamma_{j}+\frac{|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}
<\displaystyle< 0𝐱iT𝝁/ϵγjπσj(1+γj2/σj2)𝑑γj+e𝐱iT𝝁0γjeϵγjπσj(1+γj2/σj2)𝑑γj+|𝐱i,(j)|T𝝈(j)2πϵ\displaystyle\int_{0}^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon}\frac{\gamma_{j}}{\pi\sigma_{j}(1+\gamma_{j}^{2}/\sigma_{j}^{2})}d\gamma_{j}+e^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}\int_{0}^{\infty}\frac{\gamma_{j}e^{-\epsilon\gamma_{j}}}{\pi\sigma_{j}(1+\gamma_{j}^{2}/\sigma_{j}^{2})}d\gamma_{j}+\frac{|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}
<\displaystyle< σj2πlog[1+(𝐱iT𝝁ϵσj)2]+e𝐱iT𝝁0γjeϵγjπσj𝑑γj+|𝐱i,(j)|T𝝈(j)2πϵ\displaystyle~\frac{\sigma_{j}}{2\pi}\log\left[1+\left(\frac{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}{\epsilon\sigma_{j}}\right)^{2}\right]+e^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}\int_{0}^{\infty}\frac{\gamma_{j}e^{-\epsilon\gamma_{j}}}{\pi\sigma_{j}}d\gamma_{j}+\frac{|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}
=\displaystyle= σj2πlog[1+(𝐱iT𝝁ϵσj)2]+e𝐱iT𝝁πσjϵ2+|𝐱i,(j)|T𝝈(j)2πϵ.\displaystyle~\frac{\sigma_{j}}{2\pi}\log\left[1+\left(\frac{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}{\epsilon\sigma_{j}}\right)^{2}\right]+\frac{e^{\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}}}{\pi\sigma_{j}\epsilon^{2}}+\frac{|\mathbf{x}_{i^{\prime},(-j)}|^{T}\bm{\sigma}_{(-j)}}{2\pi\epsilon}.

Note that a similar derivation also holds if 𝐱iT𝝁/ϵ<0\mathbf{x}_{i^{\prime}}^{T}\bm{\mu}/\epsilon<0.

On the other hand, if condition (b) is true, we can follow the same modification in Appendix B to find upper bounds in a similar way. ∎

To show Theorem 3, we decompose its proof into two parts: in Appendix E we show that all posterior means exist if there is no separation; then in Appendix F we show that under a multivariate Cauchy prior, none of the posterior means exist if there is complete separation.

E Proof of Theorem 3, in the case of no separation

Proof.

For any j=1,2,,pj=1,2,\ldots,p, to show that E(βj𝐲)E(\beta_{j}\mid\mathbf{y}) exists, we just need to show the positive term E(βj𝐲)+E(\beta_{j}\mid\mathbf{y})^{+} in (B.1) has an upper bound, because the negative term E(βj𝐲)E(\beta_{j}\mid\mathbf{y})^{-} in (B.1) having a lower bound follows a similar derivation.

When working on E(βj𝐲)+E(\beta_{j}\mid\mathbf{y})^{+}, we only need to consider positive βj\beta_{j}. Denote a new p1p-1 dimensional variable 𝜸=𝜷(j)/βj\bm{\gamma}=\bm{\beta}_{(-j)}/\beta_{j}, then for i=1,2,,ni=1,2,\ldots,n,

𝐱iT𝜷=βj(xi,j+𝐱i,(j)T𝜸).\mathbf{x}_{i}^{T}\bm{\beta}=\beta_{j}\left(x_{i,j}+\mathbf{x}_{i,(-j)}^{T}\bm{\gamma}\right).

If there is no separation, for any 𝜸p1\bm{\gamma}\in\mathbb{R}^{p-1}, there exists at least one i{1,2,,n}i\in\{1,2,\ldots,n\}, such that

xi,j+𝐱i,(j)T𝜸<0, if iA1, or xi,j+𝐱i,(j)T𝜸>0, if iA0.x_{i,j}+\mathbf{x}_{i,(-j)}^{T}\bm{\gamma}<0,\text{ if }i\in A_{1},\text{ or }x_{i,j}+\mathbf{x}_{i,(-j)}^{T}\bm{\gamma}>0,\text{ if }i\in A_{0}. (E.1)

For each i=1,2,,ni=1,2,\ldots,n, denote the vector 𝐳i\mathbf{z}_{i} and the function hi()h_{i}(\cdot) as follows,

𝐳i=def{𝐱i if iA1𝐱i if iA0,hi(𝜸)=defzi,j+𝐳i,(j)T𝜸,\mathbf{z}_{i}\stackrel{{\scriptstyle\text{def}}}{{=}}\begin{cases}\mathbf{x}_{i}&\text{ if }i\in A_{1}\\ -\mathbf{x}_{i}&\text{ if }i\in A_{0}\\ \end{cases},\quad h_{i}(\bm{\gamma})\stackrel{{\scriptstyle\text{def}}}{{=}}z_{i,j}+\mathbf{z}_{i,(-j)}^{T}\bm{\gamma}, (E.2)

then (E.1) can be rewritten as hi(𝜸)<0h_{i}(\bm{\gamma})<0. Denote for i=1,2,,ni=1,2,\ldots,n,

Bi=def{𝜸:hi(𝜸)<0}.B_{i}\stackrel{{\scriptstyle\text{def}}}{{=}}\{\bm{\gamma}:h_{i}(\bm{\gamma})<0\}.

Then each BiB_{i} is a non-empty subset of p1\mathbb{R}^{p-1}, unless zi,j0z_{i,j}\geq 0 and 𝐳i,(j)=𝟎\mathbf{z}_{i,(-j)}=\mathbf{0}. Let ={i:Biø}\mathcal{I}=\{i:B_{i}\neq\o \} denote the set of indices ii, for which the corresponding BiB_{i} are non-empty. Because there is no separation,

iBi=p1.\bigcup_{i\in\mathcal{I}}B_{i}=\mathbb{R}^{p-1}. (E.3)

Hence, the set \mathcal{I} is non-empty. We denote its size by q=def||q\stackrel{{\scriptstyle\text{def}}}{{=}}|\mathcal{I}|, and rewrite ={i1,i2,,iq}\mathcal{I}=\{i_{1},i_{2},\ldots,i_{q}\}.

Now we show that there exist positive constants ϵi1,ϵi2,,ϵiq\epsilon_{i_{1}},\epsilon_{i_{2}},\ldots,\epsilon_{i_{q}}, such that

k=1qB~ik=p1,\bigcup_{k=1}^{q}\tilde{B}_{i_{k}}=\mathbb{R}^{p-1}, (E.4)

where

B~ik=def{𝜸:hik(𝜸)<ϵik},\tilde{B}_{i_{k}}\stackrel{{\scriptstyle\text{def}}}{{=}}\{\bm{\gamma}:h_{i_{k}}(\bm{\gamma})<-\epsilon_{i_{k}}\}, (E.5)

are subsets of the corresponding BikB_{i_{k}}, for all k=1,2,,qk=1,2,\ldots,q.

If there exists an iki_{k}\in\mathcal{I} such that zik,j<0z_{i_{k},j}<0 and 𝐳ik,(j)=𝟎\mathbf{z}_{i_{k},(-j)}=\mathbf{0}, then Bik=p1B_{i_{k}}=\mathbb{R}^{p-1}. In this case, we just need to let ϵik=zik,j/2\epsilon_{i_{k}}=-z_{i_{k},j}/2, and ϵir=M\epsilon_{i_{r}}=M, for all rkr\neq k, where MM is an arbitrary positive number. Under this choice of ϵi\epsilon_{i}’s, the sets B~i\tilde{B}_{i}’s defined by (E.5) satisfy (E.4).

If, on the other hand, 𝐳ik,(j)𝟎\mathbf{z}_{i_{k},(-j)}\neq\mathbf{0} for all iki_{k}\in\mathcal{I}, i.e., all BikB_{i_{k}} are open half spaces in p1\mathbb{R}^{p-1}, then we can find the constants ϵi1,ϵi2,,ϵiq\epsilon_{i_{1}},\epsilon_{i_{2}},\ldots,\epsilon_{i_{q}} sequentially. For i1i_{1}, if k=2qBik=p1\bigcup_{k=2}^{q}B_{i_{k}}=\mathbb{R}^{p-1}, we can set ϵi1=M\epsilon_{i_{1}}=M. Then the resulting B~i1\tilde{B}_{i_{1}} defined by (E.5) satisfies

B~i1Bi2Bi3Biq=p1.\tilde{B}_{i_{1}}\cup B_{i_{2}}\cup B_{i_{3}}\cup\cdots\cup B_{i_{q}}=\mathbb{R}^{p-1}. (E.6)

If k=2qBikp1\bigcup_{k=2}^{q}B_{i_{k}}\neq\mathbb{R}^{p-1}, then (E.3) suggests

Bi1(k=2qBik)c=k=2qBikc.B_{i_{1}}\supset\left(\bigcup_{k=2}^{q}B_{i_{k}}\right)^{c}=\bigcap_{k=2}^{q}B_{i_{k}}^{c}. (E.7)

For (E.6) to hold, we just need to find an positive ϵi1\epsilon_{i_{1}} such that the resulting B~i1\tilde{B}_{i_{1}} has k=2qBikc\bigcap_{k=2}^{q}B_{i_{k}}^{c} as a subset, i.e., ϵi1-\epsilon_{i_{1}} should be larger than the maximum of hi1(𝜸)h_{i_{1}}(\bm{\gamma}) over the polyhedral region 𝜸k=2qBikc\bm{\gamma}\in\bigcap_{k=2}^{q}B_{i_{k}}^{c}. Note that maximizing hi1(𝜸)h_{i_{1}}(\bm{\gamma}) over the polyhedron can be represented as a linear programming question,

maximize zi1,j+𝐳i1,(j)T𝜸\displaystyle z_{i_{1},j}+\mathbf{z}_{i_{1},(-j)}^{T}\bm{\gamma} (E.8)
subject to 𝐳i2,(j)T𝜸zi2,j\displaystyle\mathbf{z}_{i_{2},(-j)}^{T}\bm{\gamma}\geq-z_{i_{2},j}
\displaystyle\vdots
𝐳iq,(j)T𝜸ziq,j.\displaystyle\mathbf{z}_{i_{q},(-j)}^{T}\bm{\gamma}\geq-z_{i_{q},j}.

By Bertsimas and Tsitsiklis (1997, pp. 67, Corollary 2.3), for any linear programming problem over a non-empty polyhedron, including the one in (E.8) to maximize hi1(𝜸)=zi1,j+𝐳i1,(j)T𝜸h_{i_{1}}(\bm{\gamma})=z_{i_{1},j}+\mathbf{z}_{i_{1},(-j)}^{T}\bm{\gamma}, either the optimal hi1(𝜸)=h_{i_{1}}(\bm{\gamma})=\infty, or there exists an optimal solution, 𝜸\bm{\gamma}^{*}. Here, the latter case always occurs, because by (E.7), the maximum of hi1(𝜸)h_{i_{1}}(\bm{\gamma}) over the polyhedron k=2qBikc\bigcap_{k=2}^{q}B_{i_{k}}^{c} does not exceed zero, so it does not go to infinity. Hence, we just need to let

ϵi1=12[max𝜸k=2qBikczi1,j+𝐳i1,(j)T𝜸]=zi1,j+𝐳i1,(j)T𝜸2,\epsilon_{i_{1}}=-\frac{1}{2}\left[\max_{\bm{\gamma}\in\bigcap_{k=2}^{q}B_{i_{k}}^{c}}z_{i_{1},j}+\mathbf{z}_{i_{1},(-j)}^{T}\bm{\gamma}\right]=-\frac{z_{i_{1},j}+\mathbf{z}_{i_{1},(-j)}^{T}\bm{\gamma}^{*}}{2},

so that the resulting B~i1={𝜸:hi1(𝜸)<ϵi1}\tilde{B}_{i_{1}}=\{\bm{\gamma}:h_{i_{1}}(\bm{\gamma})<-\epsilon_{i_{1}}\} contains k=2qBikc\bigcap_{k=2}^{q}B_{i_{k}}^{c} as a subset, which yields (E.6). After finding ϵi1\epsilon_{i_{1}}, we can apply similar procedures sequentially, to find positive values ϵik\epsilon_{i_{k}}, for k=2,3,,qk=2,3,\ldots,q, such that

B~i1B~ikBik+1Biq=p1.\tilde{B}_{i_{1}}\cup\cdots\cup\tilde{B}_{i_{k}}\cup B_{i_{k+1}}\cup\cdots\cup B_{i_{q}}=\mathbb{R}^{p-1}.

After identifying all ϵik\epsilon_{i_{k}}’s, the resulting B~ik\tilde{B}_{i_{k}}’s satisfy (E.4). Note that the choice of ϵik\epsilon_{i_{k}}’s only depend on the data 𝐳i\mathbf{z}_{i}, i=1,2,,ni=1,2,\ldots,n, so they are constants given the observed data.

For each k=1,2,,qk=1,2,\ldots,q, next we show that for any 𝜸B~ik\bm{\gamma}\in\tilde{B}_{i_{k}}, the likelihood function of the iki_{k}th observation is bounded above by (βjϵike)1(\beta_{j}\epsilon_{i_{k}}e)^{-1}. This is because in a logistic regression, if ikA1i_{k}\in A_{1}, then

p(yikβj,𝜸)=f1(βjhik(𝜸))=eβjhik(𝜸)1+eβjhik(𝜸)<eβjhik(𝜸)<eβjϵik1βjϵike,p(y_{i_{k}}\mid\beta_{j},\bm{\gamma})=f_{1}\left(\beta_{j}h_{i_{k}}(\bm{\gamma})\right)=\frac{e^{\beta_{j}h_{i_{k}}(\bm{\gamma})}}{1+e^{\beta_{j}h_{i_{k}}(\bm{\gamma})}}<e^{\beta_{j}h_{i_{k}}(\bm{\gamma})}<e^{-\beta_{j}\epsilon_{i_{k}}}\leq\frac{1}{\beta_{j}\epsilon_{i_{k}}e}, (E.9)

if ikA0i_{k}\in A_{0}, then

p(yikβj,𝜸)=f0(βjhik(𝜸))=11+eβjhik(𝜸)<eβjhik(𝜸)<eβjϵik1βjϵike.p(y_{i_{k}}\mid\beta_{j},\bm{\gamma})=f_{0}\left(-\beta_{j}h_{i_{k}}(\bm{\gamma})\right)=\frac{1}{1+e^{-\beta_{j}h_{i_{k}}(\bm{\gamma})}}<e^{\beta_{j}h_{i_{k}}(\bm{\gamma})}<e^{-\beta_{j}\epsilon_{i_{k}}}\leq\frac{1}{\beta_{j}\epsilon_{i_{k}}e}. (E.10)

Here, the last inequality holds because ete1te^{-t}\leq\frac{e^{-1}}{t} for any t>0t>0. By (C.2), in a probit regression model, the inequalities (E.9) and (E.10) also hold.

Now we show that the positive term E(βj𝐲)+E(\beta_{j}\mid\mathbf{y})^{+} has a finite upper bound.

E(βj𝐲)+\displaystyle E(\beta_{j}\mid\mathbf{y})^{+} =0βjp1p(𝐲βj,𝜸)p(βj,𝜸)𝑑𝜸𝑑βj\displaystyle=\int_{0}^{\infty}\beta_{j}\int_{\mathbb{R}^{p-1}}p(\mathbf{y}\mid\beta_{j},\bm{\gamma})p(\beta_{j},\bm{\gamma})d\bm{\gamma}d\beta_{j}
0βjk=1qB~ikp(𝐲βj,𝜸)p(βj,𝜸)𝑑𝜸𝑑βj\displaystyle\leq\int_{0}^{\infty}\beta_{j}\sum_{k=1}^{q}\int_{\tilde{B}_{i_{k}}}p(\mathbf{y}\mid\beta_{j},\bm{\gamma})p(\beta_{j},\bm{\gamma})d\bm{\gamma}d\beta_{j}
<0βjk=1qB~ikp(yikβj,𝜸)p(βj,𝜸)𝑑𝜸𝑑βj\displaystyle<\int_{0}^{\infty}\beta_{j}\sum_{k=1}^{q}\int_{\tilde{B}_{i_{k}}}p(y_{i_{k}}\mid\beta_{j},\bm{\gamma})p(\beta_{j},\bm{\gamma})d\bm{\gamma}d\beta_{j}
0βjk=1qB~ik1βjϵikep(βj,𝜸)𝑑𝜸𝑑βj\displaystyle\leq\int_{0}^{\infty}\beta_{j}\sum_{k=1}^{q}\int_{\tilde{B}_{i_{k}}}\frac{1}{\beta_{j}\epsilon_{i_{k}}e}p(\beta_{j},\bm{\gamma})d\bm{\gamma}d\beta_{j}
=k=1q1ϵike0B~ikp(βj,𝜸)𝑑𝜸𝑑βj\displaystyle=\sum_{k=1}^{q}\frac{1}{\epsilon_{i_{k}}e}\int_{0}^{\infty}\int_{\tilde{B}_{i_{k}}}p(\beta_{j},\bm{\gamma})d\bm{\gamma}d\beta_{j}
k=1q1ϵike0p1p(βj,𝜸)𝑑𝜸𝑑βj<k=1q1ϵike.\displaystyle\leq\sum_{k=1}^{q}\frac{1}{\epsilon_{i_{k}}e}\int_{0}^{\infty}\int_{\mathbb{R}^{p-1}}p(\beta_{j},\bm{\gamma})d\bm{\gamma}d\beta_{j}<\sum_{k=1}^{q}\frac{1}{\epsilon_{i_{k}}e}.

Note that in Appendix E, the specific formula of the prior density of 𝜷\bm{\beta} is not used. Therefore, if there is no separation in logistic or probit regression, posterior means of all coefficients exist under all proper prior distributions.

F Proof of Theorem 3, in the case of complete separation

Proof.

Here we show that if there is complete separation, then none of the posterior means E(βj𝐲)E(\beta_{j}\mid\mathbf{y}) exist, for j=1,2,,pj=1,2,\ldots,p. Using the notation 𝐳i\mathbf{z}_{i}, defined in (E.2), we rewrite the set of all vectors satisfying the complete separation condition (2) as

𝒞=i=1n{𝜷p:𝐳iT𝜷>0}.\mathcal{C}=\bigcap_{i=1}^{n}\left\{\bm{\beta}\in\mathbb{R}^{p}:\mathbf{z}_{i}^{T}\bm{\beta}>0\right\}.

According to Albert and Anderson (1984), 𝒞\mathcal{C} is a convex cone; moreover, if 𝜷𝒞\bm{\beta}\in\mathcal{C}, then 𝜷+𝜹𝒞\bm{\beta}+\bm{\delta}\in\mathcal{C} for any 𝜹p\bm{\delta}\in\mathbb{R}^{p} that is small enough. Hence, the open set 𝒞\mathcal{C}, as a subset of the p\mathbb{R}^{p} Euclidean space, has positive Lebesgue measure.

To show that E(βj𝐲)E(\beta_{j}\mid\mathbf{y}) does not exist, if 𝒞\mathcal{C} projects on the positive half of the βj\beta_{j} axis, we will show that E(βj𝐲)+E(\beta_{j}\mid\mathbf{y})^{+} diverges, otherwise, we will show that E(βj𝐲)E(\beta_{j}\mid\mathbf{y})^{-} diverges (if both, then showing either is sufficient). Now we assume that the former is true, and denote the intersection

𝒞j+=def𝒞{𝜷p:βj>0},\mathcal{C}_{j}^{+}\stackrel{{\scriptstyle\text{def}}}{{=}}\mathcal{C}\cap\{\bm{\beta}\in\mathbb{R}^{p}:\beta_{j}>0\},

which is again an open convex cone. Since 𝒞j+\mathcal{C}_{j}^{+} has positive measure in p\mathbb{R}^{p}, under the change of variable from (βj,𝜷(j))(\beta_{j},\bm{\beta}_{(-j)}) to (βj,𝜸)(\beta_{j},\bm{\gamma}), where 𝜷(j)=βj𝜸\bm{\beta}_{(-j)}=\beta_{j}\bm{\gamma}, there exists an open set 𝒞~j+p1\tilde{\mathcal{C}}_{j}^{+}\in\mathbb{R}^{p-1} such that 𝒞j+\mathcal{C}_{j}^{+} can be written as

𝒞j+={(βj,βj𝜸):βj>0,𝜸𝒞~j+}.\mathcal{C}_{j}^{+}=\{(\beta_{j},\beta_{j}\bm{\gamma}):\beta_{j}>0,\bm{\gamma}\in\tilde{\mathcal{C}}_{j}^{+}\}.

Suppose that 𝜸\bm{\gamma} can be written as (γ1,,γj1,γj+1,,γp)T(\gamma_{1},\ldots,\gamma_{j-1},\gamma_{j+1},\ldots,\gamma_{p})^{T}. We define a variant of it by 𝜸~=def(γ1,,γj1,1,γj+1,,γp)T\tilde{\bm{\gamma}}\stackrel{{\scriptstyle\text{def}}}{{=}}(\gamma_{1},\ldots,\gamma_{j-1},1,\gamma_{j+1},\ldots,\gamma_{p})^{T}, such that 𝜷=βj𝜸~\bm{\beta}=\beta_{j}\tilde{\bm{\gamma}}. Under the multivariate Cauchy prior (8), the induced prior distribution of (βj,𝜸)(\beta_{j},\bm{\gamma}) is

p(βj,𝜸)\displaystyle p(\beta_{j},\bm{\gamma}) βjp1[1+(βj𝜸~𝝁)T𝚺1(βj𝜸~𝝁)]p+12\displaystyle\propto\frac{\beta_{j}^{p-1}}{\left[1+\left(\beta_{j}\tilde{\bm{\gamma}}-\bm{\mu}\right)^{T}\bm{\Sigma}^{-1}\left(\beta_{j}\tilde{\bm{\gamma}}-\bm{\mu}\right)\right]^{\frac{p+1}{2}}}
=βjp1[(𝜸~T𝚺1𝜸~)βj22(𝜸~T𝚺1𝝁)βj+(𝝁T𝚺1𝝁+1)]p+12.\displaystyle=\frac{\beta_{j}^{p-1}}{\left[\left(\tilde{\bm{\gamma}}^{T}\bm{\Sigma}^{-1}\tilde{\bm{\gamma}}\right)\beta_{j}^{2}-2\left(\tilde{\bm{\gamma}}^{T}\bm{\Sigma}^{-1}\bm{\mu}\right)\beta_{j}+\left(\bm{\mu}^{T}\bm{\Sigma}^{-1}\bm{\mu}+1\right)\right]^{\frac{p+1}{2}}}.

Inside 𝒞~j+\tilde{\mathcal{C}}_{j}^{+}, there must exist a closed rectangular box, denoted by 𝒟~j+={𝜸:γk[lk,uk],k=1,,j1,j+1,,p}𝒞~j+\tilde{\mathcal{D}}_{j}^{+}=\{\bm{\gamma}:\gamma_{k}\in[l_{k},u_{k}],k=1,\ldots,j-1,j+1,\ldots,p\}\subset\tilde{\mathcal{C}}_{j}^{+}. By Browder (1996, pp. 142, Corollary 6.57), a continuous function takes its maximum and minimum on a compact set. Since 𝒟j+\mathcal{D}_{j}^{+} is a compact set (closed and bounded in p1\mathbb{R}^{p-1}),

a=defmax𝜸𝒟~j+𝜸~T𝚺1𝜸~,b=defmin𝜸𝒟~j+𝜸~T𝚺1𝝁a\stackrel{{\scriptstyle\text{def}}}{{=}}\max_{\bm{\gamma}\in\tilde{\mathcal{D}}_{j}^{+}}\tilde{\bm{\gamma}}^{T}\bm{\Sigma}^{-1}\tilde{\bm{\gamma}},\quad b\stackrel{{\scriptstyle\text{def}}}{{=}}\min_{\bm{\gamma}\in\tilde{\mathcal{D}}_{j}^{+}}\tilde{\bm{\gamma}}^{T}\bm{\Sigma}^{-1}\bm{\mu} (F.1)

both exist.

Recall that for all 𝜸𝒞~j+\bm{\gamma}\in\tilde{\mathcal{C}}_{j}^{+} (hence including all elements in 𝒟j+\mathcal{D}_{j}^{+}), zi,j+𝐳i,(j)T𝜸>0z_{i,j}+\mathbf{z}_{i,(-j)}^{T}\bm{\gamma}>0; equivalently, if iA1i\in A_{1}, then xi,j+𝐱i,(j)T𝜸>0x_{i,j}+\mathbf{x}_{i,(-j)}^{T}\bm{\gamma}>0, and if iA0i\in A_{0}, then xi,j+𝐱i,(j)T𝜸<0x_{i,j}+\mathbf{x}_{i,(-j)}^{T}\bm{\gamma}<0. Now we show that in both logistic and probit regressions, the positive term E(βj𝐲)+E(\beta_{j}\mid\mathbf{y})^{+} diverges.

E(βj𝐲)+=0βjp1p(βj,𝜸)p(𝐲βj,𝜸)𝑑𝜸𝑑βj\displaystyle E(\beta_{j}\mid\mathbf{y})^{+}=\int_{0}^{\infty}\beta_{j}\int_{\mathbb{R}^{p-1}}p(\beta_{j},\bm{\gamma})p(\mathbf{y}\mid\beta_{j},\bm{\gamma})d\bm{\gamma}d\beta_{j}
\displaystyle\geq 0βj𝒟~j+p(βj,𝜸)iA1f1(βj(xi,j+𝐱i,(j)T𝜸))kA0f0(βj(xk,j+𝐱k,(j)T𝜸))d𝜸dβj\displaystyle\int_{0}^{\infty}\beta_{j}\int_{\tilde{\mathcal{D}}_{j}^{+}}p(\beta_{j},\bm{\gamma})\prod_{i\in A_{1}}f_{1}(\beta_{j}(x_{i,j}+\mathbf{x}_{i,(-j)}^{T}\bm{\gamma}))\prod_{k\in A_{0}}f_{0}(\beta_{j}(x_{k,j}+\mathbf{x}_{k,(-j)}^{T}\bm{\gamma}))d\bm{\gamma}d\beta_{j}
\displaystyle\geq 0βj𝒟~j+p(βj,𝜸)iA1f1(0)kA0f0(0)d𝜸dβj\displaystyle\int_{0}^{\infty}\beta_{j}\int_{\tilde{\mathcal{D}}_{j}^{+}}p(\beta_{j},\bm{\gamma})\prod_{i\in A_{1}}f_{1}(0)\prod_{k\in A_{0}}f_{0}(0)d\bm{\gamma}d\beta_{j}
=\displaystyle= (12)n0βj𝒟~j+p(βj,𝜸)𝑑𝜸𝑑βj\displaystyle\left(\frac{1}{2}\right)^{n}\int_{0}^{\infty}\beta_{j}\int_{\tilde{\mathcal{D}}_{j}^{+}}p(\beta_{j},\bm{\gamma})d\bm{\gamma}d\beta_{j}
=\displaystyle= C0βj𝒟~j+βjp1[(𝜸~T𝚺1𝜸~)βj22(𝜸~T𝚺1𝝁)βj+(𝝁T𝚺1𝝁+1)]p+12𝑑𝜸𝑑βj\displaystyle~C\int_{0}^{\infty}\beta_{j}\int_{\tilde{\mathcal{D}}_{j}^{+}}\frac{\beta_{j}^{p-1}}{\left[\left(\tilde{\bm{\gamma}}^{T}\bm{\Sigma}^{-1}\tilde{\bm{\gamma}}\right)\beta_{j}^{2}-2\left(\tilde{\bm{\gamma}}^{T}\bm{\Sigma}^{-1}\bm{\mu}\right)\beta_{j}+\left(\bm{\mu}^{T}\bm{\Sigma}^{-1}\bm{\mu}+1\right)\right]^{\frac{p+1}{2}}}d\bm{\gamma}d\beta_{j}
\displaystyle\geq C0𝒟~j+βjp[aβj22bβj+(𝝁T𝚺1𝝁+1)]p+12𝑑𝜸𝑑βj\displaystyle~C\int_{0}^{\infty}\int_{\tilde{\mathcal{D}}_{j}^{+}}\frac{\beta_{j}^{p}}{\left[a\beta_{j}^{2}-2b\beta_{j}+\left(\bm{\mu}^{T}\bm{\Sigma}^{-1}\bm{\mu}+1\right)\right]^{\frac{p+1}{2}}}d\bm{\gamma}d\beta_{j}
=\displaystyle= C[k=1p1(uklk)]0βjp[a(βjb/a)2+c]p+12𝑑βj\displaystyle~C\left[\prod_{k=1}^{p-1}(u_{k}-l_{k})\right]\int_{0}^{\infty}\frac{\beta_{j}^{p}}{\left[a(\beta_{j}-b/a)^{2}+c\right]^{\frac{p+1}{2}}}d\beta_{j} (F.2)
=\displaystyle= ,\displaystyle~\infty,

where CC is a positive constant, cc is a constant, and aa and bb have been defined previously in (F.1).

On the other hand, if the set 𝒞\mathcal{C} of complete separation vectors only projects on the negative half of the βj\beta_{j} axis, following a similar deviation, we can show that E(βj𝐲)E(\beta_{j}\mid\mathbf{y})^{-} diverges to -\infty.

References

  • Albert and Anderson (1984) Albert, A. and Anderson, J. A. (1984). “On the Existence of Maximum Likelihood Estimates in Logistic Regression Models.” Biometrika, 71(1): 1–10.
  • Bardenet et al. (2014) Bardenet, R., Doucet, A., and Holmes, C. (2014). “Towards Scaling up Markov Chain Monte Carlo: An Adaptive Subsampling Approach.” Proceedings of the 31st International Conference on Machine Learning (ICML-14), 405–413.
  • Bertsimas and Tsitsiklis (1997) Bertsimas, D. and Tsitsiklis, J. N. (1997). Introduction to Linear Optimization. Athena Scientific Belmont, MA.
  • Bickel and Doksum (2001) Bickel, P. J. and Doksum, K. A. (2001). Mathematical Statistics, volume I. Prentice Hall Englewood Cliffs, NJ.
  • Blattenberger and Lad (1985) Blattenberger, G. and Lad, F. (1985). “Separating the Brier Score into Calibration and Refinement Components: A Graphical Exposition.” The American Statistician, 39(1): 26–32.
  • Brier (1950) Brier, G. W. (1950). “Verification of Forecasts Expressed in Terms of Probability.” Monthly Weather Review, 78: 1–3.
  • Browder (1996) Browder, A. (1996). Mathematical Analysis: An Introduction. Springer-Verlag.
  • Carpenter et al. (2016) Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, A., Michael, Guo, J., Li, P., and Riddell, A. (2016). “Stan: A Probabilistic Programming Language.” Journal of Statistical Software, in press.
  • Casella and Berger (1990) Casella, G. and Berger, R. L. (1990). Statistical Inference. Duxbury Press.
  • Chen and Shao (2001) Chen, M.-H. and Shao, Q.-M. (2001). “Propriety of Posterior Distribution for Dichotomous Quantal Response Models.” Proceedings of the American Mathematical Society, 129(1): 293–302.
  • Choi and Hobert (2013) Choi, H. M. and Hobert, J. P. (2013). “The Polya-Gamma Gibbs Sampler for Bayesian Logistic Regression is Uniformly Ergodic.” Electronic Journal of Statistics, 7(2054-2064).
  • Chopin and Ridgway (2015) Chopin, N. and Ridgway, J. (2015). “Leave Pima Indians Alone: Binary Regression as a Benchmark for Bayesian Computation.” arxiv.org.
  • Clogg et al. (1991) Clogg, C. C., Rubin, D. B., Schenker, N., Schultz, B., and Weidman, L. (1991). “Multiple Imputation of Industry and Occupation Codes in Census Public-Use Samples Using Bayesian Logistic Regression.” Journal of the American Statistical Association, 86(413): 68–78.
  • Dawid (1973) Dawid, A. P. (1973). “Posterior Expectations for Large Observations.” Biometrika, 60: 664–666.
  • Fernández and Steel (2000) Fernández, C. and Steel, M. F. (2000). “Bayesian Regression Analysis with Scale Mixtures of Normals.” Econometric Theory, 16(80-101).
  • Firth (1993) Firth, D. (1993). “Bias Reduction of Maximum Likelihood Estimates.” Biometrika, 80(1): 27–38.
  • Fouskakis et al. (2009) Fouskakis, D., Ntzoufras, I., and Draper, D. (2009). “Bayesian Variable Selection Using Cost-Adjusted BIC, with Application to Cost-Effective Measurement of Quality of Health Care.” The Annals of Applied Statistics, 3(2): 663–690.
  • Gelman et al. (2008) Gelman, A., Jakulin, A., Pittau, M., and Su, Y. (2008). “A Weakly Informative Default Prior Distribution for Logistic and Other Regression Models.” The Annals of Applied Statistics, 2(4): 1360–1383.
  • Gelman et al. (2015) Gelman, A., Su, Y.-S., Yajima, M., Hill, J., Pittau, M. G., Kerman, J., Zheng, T., and Dorie, V. (2015). arm: Data Analysis Using Regression and Multilevel/Hierarchical Models. R package version 1.8-5.
    URL http://CRAN.R-project.org/package=arm
  • Ghosh and Clyde (2011) Ghosh, J. and Clyde, M. A. (2011). “Rao-Blackwellization for Bayesian Variable Selection and Model Averaging in Linear and Binary Regression: A Novel Data Augmentation Approach.” Journal of the American Statistical Association, 106(495): 1041–1052.
  • Ghosh et al. (2011) Ghosh, J., Herring, A. H., and Siega-Riz, A. M. (2011). “Bayesian Variable Selection for Latent Class Models.” Biometrics, 67: 917–925.
  • Ghosh and Reiter (2013) Ghosh, J. and Reiter, J. P. (2013). “Secure Bayesian Model Averaging for Horizontally Partitioned Data.” Statistics and Computing, 23: 311–322.
  • Gramacy and Polson (2012) Gramacy, R. B. and Polson, N. G. (2012). “Simulation-Based Regularized Logistic Regression.” Bayesian Analysis, 7(3): 567–590.
  • Hanson et al. (2014) Hanson, T. E., Branscum, A. J., and Johnson, W. O. (2014). “Informative g-Priors for Logistic Regression.” Bayesian Analysis, 9(3): 597–612.
  • Heinze (2006) Heinze, G. (2006). “A Comparative Investigation of Methods for Logistic Regression with Separated or Nearly Separated Data.” Statistics in Medicine, 25: 4216–4226.
  • Heinze and Schemper (2002) Heinze, G. and Schemper, M. (2002). “A Solution to the Problem of Separation in Logistic Regression.” Statistics in Medicine, 21: 2409–2419.
  • Hoffman and Gelman (2014) Hoffman, M. D. and Gelman, A. (2014). “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” The Journal of Machine Learning Research, 15(1): 1593–1623.
  • Holmes and Held (2006) Holmes, C. C. and Held, L. (2006). “Bayesian Auxiliary Variable Models for Binary and Multinomial Regression.” Bayesian Analysis, 1(1): 145–168.
  • Ibrahim and Laud (1991) Ibrahim, J. G. and Laud, P. W. (1991). “On Bayesian Analysis of Generalized Linear Models using Jeffreys’s Prior.” Journal of the American Statistical Association, 86(416): 981–986.
  • Jeffreys (1961) Jeffreys, H. (1961). Theory of Probability. Oxford Univ. Press.
  • Kurgan et al. (2001) Kurgan, L., Cios, K., Tadeusiewicz, R., Ogiela, M., and Goodenday, L. (2001). “Knowledge Discovery Approach to Automated Cardiac SPECT Diagnosis.” Artificial Intelligence in Medicine, 23:2: 149–169.
  • Li and Clyde (2015) Li, Y. and Clyde, M. A. (2015). “Mixtures of gg-Priors in Generalized Linear Models.” arxiv.org.
  • Liu (2004) Liu, C. (2004). “Robit Regression: A Simple Robust Alternative to Logistic and Probit Regression.” In Gelman, A. and Meng, X. (eds.), Applied Bayesian Modeling and Casual Inference from Incomplete-Data Perspectives, 227–238. Wiley, London.
  • McCullagh and Nelder (1989) McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Chapman and Hall.
  • Mitra and Dunson (2010) Mitra, R. and Dunson, D. B. (2010). “Two Level Stochastic Search Variable Selection in GLMs with Missing Predictors.” International Journal of Biostatistics, 6(1): Article 33.
  • Neal (2003) Neal, R. M. (2003). “Slice Samlping.” The Annals of Statistics, 31(3): 705–767.
  • Neal (2011) — (2011). “MCMC using Hamiltonian Dynamics.” In Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. (eds.), Handbook of Markov Chain Monte Carlo. Chapman & Hall / CRC Press.
  • O’Brien and Dunson (2004) O’Brien, S. M. and Dunson, D. B. (2004). “Bayesian Multivariate Logistic Regression.” Biometrics, 60(3): 739–746.
  • Polson et al. (2013) Polson, N. G., Scott, J. G., and Windle, J. (2013). “Bayesian Inference for Logistic Models Using Pólya-Gamma Latent Variables.” Journal of the American Statistical Association, 108(504): 1339–1349.
  • Rousseeuw and Christmann (2003) Rousseeuw, P. J. and Christmann, A. (2003). “Robustness Against Separation and Outliers in Logistic Regression.” Computational Statistics and Data Analysis, 42: 315–332.
  • Sabanés Bové and Held (2011) Sabanés Bové, D. and Held, L. (2011). “Hyper-gg Priors for Generalized Linear Models.” Bayesian Analysis, 6(3): 387–410.
  • Speckman et al. (2009) Speckman, P. L., Lee, J., and Sun, D. (2009). “Existence of the MLE and Propriety of Posteriors for a General Multinomial Choice Model.” Statistica Sinica, 19: 731–748.
  • Yang and Berger (1996) Yang, R. and Berger, J. O. (1996). “A Catalog of Noninformative Priors.” Institute of Statistics and Decision Sciences, Duke University.
  • Zellner and Siow (1980) Zellner, A. and Siow, A. (1980). “Posterior Odds Ratios for Selected Regression Hypotheses.” In Bayesian Statistics: Proceedings of the First International Meeting Held in Valencia (Spain), 585–603. Valencia, Spain: University of Valencia Press.
  • Zorn (2005) Zorn, C. (2005). “A Solution to Separation in Binary Response Models.” Political Analysis, 13(2): 157–170.