This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bayesian Distance Weighted Discrimination

\nameEric F. Lock \emailelock@umn.edu
\addrDivision of Biostatistics
School of Public Health
University of Minnesota
Minneapolis, MN 55446, USA
Abstract

Distance weighted discrimination (DWD) is a linear discrimination method that is particularly well-suited for classification tasks with high-dimensional data. The DWD coefficients minimize an intuitive objective function, which can solved very efficiently using state-of-the-art optimization techniques. However, DWD has not yet been cast into a model-based framework for statistical inference. In this article we show that DWD identifies the mode of a proper Bayesian posterior distribution, that results from a particular link function for the class probabilities and a shrinkage-inducing proper prior distribution on the coefficients. We describe a relatively efficient Markov chain Monte Carlo (MCMC) algorithm to simulate from the true posterior under this Bayesian framework. We show that the posterior is asymptotically normal and derive the mean and covariance matrix of its limiting distribution. Through several simulation studies and an application to breast cancer genomics we demonstrate how the Bayesian approach to DWD can be used to (1) compute well-calibrated posterior class probabilities, (2) assess uncertainty in the DWD coefficients and resulting sample scores, (3) improve power via semi-supervised analysis when not all class labels are available, and (4) automatically determine a penalty tuning parameter within the model-based framework. R code to perform Bayesian DWD is available at https://github.com/lockEF/BayesianDWD.

Keywords: Cancer genomics, distance weighted discrimination, high-dimensional data, probabilistic classification, support vector machines.

1 Introduction

High-dimensional data occur when a massive number of features are available for each unit of observation in a sample set. Such data are now very common across a variety of application areas, and this has motivated a large number of data analysis methods that were developed specifically for the high-dimensional context. For supervised analysis, where the task is to predict or describe an outcome, methods that are appropriate for the high-dimensional context often begin with an objective function to be optimized but lack a clear model-based framework for statistical inference. One reason for this is because fitting statistical models directly via (e.g.) maximum likelihood is prone to over-fitting and lack of identifiability in the high-dimensional context, especially when the number of features is larger than the sample size. Thus, objective functions that admit more regularization are required, and these often do not have a model-based interpretation in the frequentist framework. For example, a common approach in the high-dimensional context is to incorporate penalty terms that enforce shrinkage or sparsity (e.g., l2l_{2} or l1l_{1} penalization) within a classical predictive model (e.g., generalized linear models).

A Bayesian framework provides more flexibility for model-based supervised analysis of high-dimensional data, as appropriate regularization can be induced through the specified prior distribution. Indeed, there are a number of instances in which optimizing a commonly used penalized objective function was later shown to give the mode of a posterior distribution under a particular Bayesian model. For example, a Bayesian linear model with a normally distributed outcome and a normal prior on the coefficients corresponds to solving an l2l_{2} penalized least squares objective (i.e., ridge regression) (Hsiang, 1975). Similarly, the mode of the posterior under a Bayesian framework with a double-exponential prior on the model coefficients corresponds to l1l_{1} penalization (i.e., the lasso) (Park and Casella, 2008). Such equivalences can be very useful to better understand and interpret the original optimization problem, and also provide a potential framework for statistical inference that is based on the original objective.

In this article we introduce a Bayesian formulation for Distance Weighted Discrimination (DWD) (Marron et al., 2007). DWD is a popular approach for supervised analysis with a binary outcome and high-dimensional data. It aims to identify a linear combination of the features that distinguish the sample groups corresponding to the binary outcomes (i.e., projections onto a separating hyperplane); in this respect, it is comparable to various versions of the Support Vector Machine (SVM) (Cortes and Vapnik, 1995) or penalized Fisher’s Linear Discriminant Analysis (LDA) (Witten and Tibshirani, 2011). However, relative to its competitors, DWD often has superior generalizability for settings in which the sample size is small relative to the dimension, i.e., the high dimension low sample size (HDLSS) context. In its original formulation, DWD minimizes the inverse distances of the observed units from the separating hyperplane with a penalty term. This circumvents the “data piling” issue of SVM, wherein projections tend to pile up at the margins of the separating hyperplane for HDLSS data as a symptom of over-fitting and lack of generalizability. Since its inception, many other versions of DWD have been developed, including extensions that allow more than two classes (Huang et al., 2013), sparsity (Wang and Zou, 2016), non-linear kernels (Wang and Zou, 2019), multi-way (i.e., tensor) data (Lyu et al., 2017), and sample weighting for unbalanced data (Qiao et al., 2010). These useful extensions modify the optimization task used for DWD, but remain model-free. Resampling methods such as permutation testing (Wei et al., 2016), cross-validation, and bootstrapping (Lyu et al., 2017) have been used with DWD to address inferential questions that involve uncertainty. However, while various model-based versions of SVM have been proposed (Sollich, 2002; Henao et al., 2014), DWD is yet to be cast into a fully specified probability model.

We show that DWD identifies the mode of a proper Bayesian posterior distribution. The corresponding density of the posterior distribution is a monotone function of the DWD objective. The underlying model is given by a particular link function for the class probabilities and a shrinkage-inducing proper prior distribution on the model coefficients. In addition to providing a model-based context for DWD, we demonstrate how this Bayesian framework can be used to extend and enhance present applications of DWD in four specific ways:

  1. 1.

    Posterior class inclusion probabilities can be used for “soft” classification, and we find that these probabilities are well-calibrated across different contexts.

  2. 2.

    Posterior credible intervals can be used to visualize uncertainty in the coefficients and the sample scores (i.e., projections).

  3. 3.

    Semi-supervised analyses, in which only some of the class labels are observed in the training data, fit naturally into the Bayesian framework and can improve power.

  4. 4.

    A penalty tuning parameter for the DWD objective can be subsumed within the Bayesian framework and given its own prior, to allow for model-based selection of this parameter.

We describe a relatively efficient Markov chain Monte Carlo (MCMC) algorithm to simulate from the posterior distribution. Alternatively, we show that the posterior is asymptotically multivariate normal; we derive the asymptotic mean and covariance, which are readily computable using existing software.

In what follows, we introduce our formal framework and notation in Section 2, formally describe the DWD objective in Section 3, and then present and discuss the Bayesian model and its estimation in Sections 4, 5, 6, and 7. We present simulation results to illustrate and validate different aspects of the approach in Section 8, and we present an application to subtype classification in breast cancer genomics in Section 9.

2 Framework and notation

Throughout this article bold lower-case characters denote vectors, bold upper-case characters denote matrices, and greek characters denote unknown model parameters. Probability density functions are denoted by p()p(\cdot), and discrete probabilities by P()P(\cdot).

Let 𝐱i{\bf x}_{i} be a dd-dimensional vector and yi{1,1}y_{i}\in\{-1,1\} be a corresponding class indicator for sampled observations i=1,,ni=1,\ldots,n. Let 𝐗\mathbf{X} be the d×nd\times n matrix 𝐗=[𝐱1𝐱2𝐱n]\mathbf{X}=[{\bf x}_{1}\;{\bf x}_{2}\ldots{\bf x}_{n}]. Let 𝐲\mathbf{y} be the nn-dimensional vector of outcomes 𝐲=[y1y2yn]\mathbf{y}=[y_{1}\;y_{2}\ldots y_{n}]. Our task is to predict the outcomes yiy_{i} from the data 𝐱i{\bf x}_{i}.

3 Distance weighted discrimination

Distance weighted discrimination (DWD) identifies coefficients (weights) 𝜷=[β1β2βd]{\bm{\beta}}=[\beta_{1}\;\beta_{2}\ldots\beta_{d}] and an intercept β0\beta_{0}\in\mathbb{R} such that the linear combinations (scores) β0+𝐱iT𝜷\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}} discriminate the classes. The optimization problem for DWD was originally formulated as

argminβ0,𝜷,𝝃i=1n1ri+C1nξi, where ri=yi(β0+𝐱iT𝜷)+ξi\underset{\beta_{0},{\bm{\beta}},\bm{\xi}}{\mbox{argmin}}\sum_{i=1}^{n}\frac{1}{r_{i}}+C\sum_{1}^{n}\xi_{i},\text{ where }r_{i}=y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})+\xi_{i} (1)

subject to ri>0r_{i}>0 and ξi>0\xi_{i}>0 for i=1,,ni=1,\ldots,n and 𝜷221||{\bm{\beta}}||_{2}^{2}\leq 1. In practice the classification decision rule is determined by the sign of β0+𝐱iT𝜷\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}}, and thus the ξi\xi_{i} can be interpreted as a penalty for misclassification. The tuning parameter CC controls the relative weight of this misclassification penalty compared to the inverse distances from the separating hyperplane 1/ri1/r_{i}. An appealing property of DWD is that its objective accounts for the projections of all the data observations, rather than just those at the margins. Liu et al. (2011) show that the DWD objective (1) is equivalent to the following “loss+penalty” formulation

argminβ0,𝜷1ni=1nV(yi(β0+𝐱iT𝜷))+λ2𝜷22,\displaystyle\underset{\beta_{0},{\bm{\beta}}}{\mbox{argmin}}\frac{1}{n}\sum_{i=1}^{n}V\left(y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)+\frac{\lambda}{2}||{\bm{\beta}}||^{2}_{2}, (2)

where λ\lambda is a penalty tuning parameter, and

V(u)={1u if u1/2,1/(4u) if u>1/2.\displaystyle V(u)=\begin{cases}1-u&\text{ if }u\leq 1/2,\\ 1/(4u)&\text{ if }u>1/2.\end{cases} (3)

The objectives (1) and (2) are equivalent in that, with a one-to-one correspondence between the penalty parameters λ\lambda and CC, the estimated coefficients and resulting sample scores are proportional.

4 Bayesian model

Here we show that the DWD objective (2) gives the mode of a posterior distribution for {β0,𝜷}\{\beta_{0},{\bm{\beta}}\} under a Bayesian framework. Let ψ()\psi(\cdot) give the objective function that is minimized by DWD,

ψ(β0,𝜷,𝐲,𝐗,λ)=1ni=1nV(yi(β0+𝐱iT𝜷))+λ2𝜷22.\psi(\beta_{0},{\bm{\beta}},\mathbf{y},\mathbf{X},\lambda)=\frac{1}{n}\sum_{i=1}^{n}V\left(y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)+\frac{\lambda}{2}||{\bm{\beta}}||^{2}_{2}.

The posterior density for {β0,𝜷}\{\beta_{0},{\bm{\beta}}\} is the following monotone transformation of ψ\psi,

p(𝜷,β0𝐗,𝐲,λ)\displaystyle p({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda) enψ(β0,𝜷,𝐲,𝐗,λ)\displaystyle\propto e^{-n\psi(\beta_{0},{\bm{\beta}},\mathbf{y},\mathbf{X},\lambda)}
=[i=1neV(yi(β0+𝐱iT𝜷))]e(λn/2)𝜷22\displaystyle=\left[\prod_{i=1}^{n}e^{-V\left(y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)}\right]e^{-(\lambda n/2)||{\bm{\beta}}||^{2}_{2}} (4)

Under general conditions (4) is integrable over {β0,𝜷}\{\beta_{0},{\bm{\beta}}\}, and therefore defines a proper probability density function; we state this formally in Theorem 1.

Theorem 1

If 𝐗d×n\mathbf{X}\in\mathbb{R}^{d\times n}, λ0\lambda\geq 0, and 𝐲{1,1}n\mathbf{y}\in\{-1,1\}^{n} where yi=1y_{i}=-1 for some ii and yj=1y_{j}=1 for some jj, then p(𝛃,β0𝐗,𝐲,λ)p({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda) in (4) gives a proper probability density function over {β0,𝛃}\{\beta_{0},{\bm{\beta}}\}.

In what follows, we “work backward” from this posterior to complete the specification of the model and discuss its implications.

4.1 Conditional distribution of 𝐲\mathbf{y}

Treating 𝐲\mathbf{y} as a discrete random vector, it follows directly from (4) that its entries yiy_{i} are conditionally independent and that

P(yi𝜷,β0,𝐱i)P(yi)eV(yn(β0+𝐱iT𝜷)).P(y_{i}\mid{\bm{\beta}},\beta_{0},{\bf x}_{i})\propto P(y_{i})e^{-V\left(y_{n}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)}.

The conditional distribution for yi{1,1}y_{i}\in\{-1,1\} thus depends on its unconditional class probability P(yi=1)=1P(yi=1)P(y_{i}=1)=1-P(y_{i}=-1), and its corresponding score ui:=β0+𝐱iT𝜷u_{i}:=\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}}:

P(yi=1ui)=P(yi=1)eV(ui)P(yi=1)eV(ui)+P(yi=1)eV(ui).\displaystyle P(y_{i}=1\mid u_{i})=\frac{P(y_{i}=1)e^{-V\left(u_{i}\right)}}{P(y_{i}=1)e^{-V\left(u_{i}\right)}+P(y_{i}={-1})e^{-V\left(-u_{i}\right)}}. (5)

Choice of P(yi=1)P(y_{i}=1) does not influence posterior inference for {β0,𝜷}\{\beta_{0},{\bm{\beta}}\} in (4), but it is nevertheless important for the calibration of the predictive model and is discussed further in Section 6.1. Given P(yi=1)P(y_{i}=1), (5) defines a link function from uiu_{i}\in\mathbb{R} to a probability in [0,1][0,1]; this function is smooth and comparable to a probit or logit link. Figure 1 plots this function under equal class proportions, P(yi=1)=1/2P(y_{i}=1)=1/2.

Refer to caption
Figure 1: Class probability as a function of linear score under different links.

4.2 Prior distribution of β\beta

Estimating β0\beta_{0} and 𝜷{\bm{\beta}} via maximizing the likelihood under model (5) would perform poorly in the HDLSS context, analogous to the poor performance of unconstrained probit or logistic regression models. However, the posterior distribution (4) also implies a “prior” distribution for {β0,𝜷}\{\beta_{0},{\bm{\beta}}\} that is not conditional on 𝐲\mathbf{y}:

p(𝜷,β0𝐗,𝐲,λ)\displaystyle p({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda) P(𝐲𝜷,β0,𝐗)p(𝜷,β0𝐗,λ)\displaystyle\propto P(\mathbf{y}\mid{\bm{\beta}},\beta_{0},\mathbf{X})\,p({\bm{\beta}},\beta_{0}\mid\mathbf{X},\lambda)
p(𝜷,β0𝐗,λ)\displaystyle\rightarrow p({\bm{\beta}},\beta_{0}\mid\mathbf{X},\lambda) AB\displaystyle\propto A\cdot B (6)

where

A=i=1n(P(yi=1)eV(β0+𝐱nT𝜷)+P(yi=1)eV(β0𝐱nT𝜷))\displaystyle A=\prod_{i=1}^{n}\left(P(y_{i}=1)e^{-V\left(\beta_{0}+{\bf x}_{n}^{T}{\bm{\beta}}\right)}+P(y_{i}=-1)e^{-V\left(-\beta_{0}-{\bf x}_{n}^{T}{\bm{\beta}}\right)}\right) (7)

and

B=e(λN/2)𝜷22.\displaystyle B=e^{-(\lambda N/2)||{\bm{\beta}}||^{2}_{2}}. (8)

This prior facilitates shrinkage in two important ways that improve performance in HDLSS settings. Term BB (8) gives the kernel of a multivariate normal distribution for 𝜷{\bm{\beta}}, in which the βi\beta_{i}’s are independent with mean 0 and variance 1/(λN)1/(\lambda N). This derives from the l2l_{2} penalty in (2). There is a vast literature on the connection between l2l_{2}-regularized regression (e.g., ridge regression) and using Gaussian priors on the coefficients in a Bayesian framework, as mentioned in Section 1. This shrinks the inferred coefficients toward 𝟎\mathbf{0}, which improves performance in the HDLSS setting or in the presence of multicollinearity. Term AA (7) favors 𝜷{\bm{\beta}} that correspond to directions in 𝐗\mathbf{X} with high variability or bimodality. To illustrate, Figure 2 plots a single term of the product AA as a function of the DWD score ui=β0+𝐱iT𝜷u_{i}=\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}} when P(yi=1)=1/2P(y_{i}=1)=1/2,

f(ui)=eV(ui)+eV(ui).\displaystyle f(u_{i})=e^{-V(u_{i})}+e^{-V(-u_{i})}. (9)
Refer to caption
Figure 2: Prior term f(ui)f(u_{i}) (9) as a function of DWD score uiu_{i}.

This term gives higher prior density to directions in 𝐗\mathbf{X} for which the corresponding scores {ui}i=1N\{u_{i}\}_{i=1}^{N} are not concentrated near 0, and are therefore more useful for discriminating a negative class from a positive class.

The function ABA\cdot B in (6) defines a proper density for 𝜷{\bm{\beta}} for any fixed intercept β0\beta_{0}, as stated in Theorem 2.

Theorem 2

If 𝐗d×n\mathbf{X}\in\mathbb{R}^{d\times n}, λ>0\lambda>0, and β0\beta_{0}\in\mathbb{R}, then

p(𝜷𝐗,β0,λ)ABp({\bm{\beta}}\mid\mathbf{X},\beta_{0},\lambda)\propto A\cdot B

gives a proper probability density function, where AA and BB are defined in (7) and (8).

The prior (6) does not define a proper density for β0\beta_{0}. Improper priors are commonly used for a variety of Bayesian applications, and still provide a valid framework for inference if the posterior is proper (Taraldsen and Lindqvist, 2010; Sun et al., 2001). Nevertheless, it would be straightforward to modify the model to allow for shrinkage of β0\beta_{0}, if one desired it to have a proper probability density without conditioning on 𝐲\mathbf{y}.

4.3 Distribution of XX

No explicit probabilistic assumptions on 𝐗\mathbf{X} are needed for the validity of the posterior and prior distributions in the previous two sections. However, it is informative to consider the broader implications of the model with respect to the distribution for 𝐗\mathbf{X}. Without loss of generality we set β0=0\beta_{0}=0 and assume P(yi=1)=0.5P(y_{i}=1)=0.5. Note that

p(𝜷𝐱i,λ)=(eV(𝐱iT𝜷)+eV(𝐱iT𝜷))e(λ/2)𝜷22/ϕ(𝐱i,λ),p({\bm{\beta}}\mid{\bf x}_{i},\lambda)=\left(e^{-V\left({\bf x}_{i}^{T}{\bm{\beta}}\right)}+e^{-V\left(-{\bf x}_{i}^{T}{\bm{\beta}}\right)}\right)e^{-(\lambda/2)||{\bm{\beta}}||^{2}_{2}}/\phi({\bf x}_{i},\lambda),

where

ϕ(𝐱i,λ)=𝜷(eV(𝐱iT𝜷)+eV(𝐱iT𝜷))e(λ/2)𝜷22𝑑𝜷,\phi({\bf x}_{i},\lambda)=\int_{{\bm{\beta}}}\left(e^{-V\left({\bf x}_{i}^{T}{\bm{\beta}}\right)}+e^{-V\left(-{\bf x}_{i}^{T}{\bm{\beta}}\right)}\right)e^{-(\lambda/2)||{\bm{\beta}}||^{2}_{2}}d{\bm{\beta}},

Given any marginal density for 𝐱{\bf x} (which may depend on λ\lambda), 𝐱ip(𝐱λ){\bf x}_{i}\sim p({\bf x}\mid\lambda),

p(𝜷,𝐱iλ)=p(𝜷𝐱i,λ)p(𝐱i),\displaystyle p({\bm{\beta}},{\bf x}_{i}\mid\lambda)=p({\bm{\beta}}\mid{\bf x}_{i},\lambda)p({\bf x}_{i}),
p(𝜷λ)=p(𝜷,𝐱iλ)𝑑𝐱i,\displaystyle p({\bm{\beta}}\mid\lambda)=\int p({\bm{\beta}},{\bf x}_{i}\mid\lambda)d{\bf x}_{i},
p(𝐱i𝜷,λ)=p(𝜷,𝐱iλ)/p(𝜷λ).\displaystyle p({\bf x}_{i}\mid{\bm{\beta}},\lambda)=p({\bm{\beta}},{\bf x}_{i}\mid\lambda)/p({\bm{\beta}}\mid\lambda).

Thus, a marginal prior for 𝜷{\bm{\beta}} exists but it depends on the distribution 𝐱i{\bf x}_{i}. Moreover,

p(𝐱i𝜷,λ)=p(𝐱i𝜷,yi=1,λ)+p(𝐱i𝜷,yi=1,λ)\displaystyle p({\bf x}_{i}\mid{\bm{\beta}},\lambda)=p({\bf x}_{i}\mid{\bm{\beta}},y_{i}=1,\lambda)+p({\bf x}_{i}\mid{\bm{\beta}},y_{i}=-1,\lambda)

implies that

p(𝐱i𝜷,yi)eV(yi𝐱iT𝜷)ϕ(𝐱i,λ)p(𝐱iλ)/ϕ(𝐱i,λ).p({\bf x}_{i}\mid{\bm{\beta}},y_{i})\propto e^{-V(y_{i}{\bf x}_{i}^{T}{\bm{\beta}})}\phi({\bf x}_{i},\lambda)p({\bf x}_{i}\mid\lambda)/\phi({\bf x}_{i},\lambda).

Thus, given any marginal distribution for the data 𝐱i{\bf x}_{i}, it is possible to obtain the conditional distribution of 𝐱i{\bf x}_{i} given its class membership 𝐲i\mathbf{y}_{i} and the DWD vector 𝜷{\bm{\beta}}.

5 Asymptotic normality of β\beta

It follows from the Bernstein-von Mises theorem (i.e., the “Bayesian central limit theorem”) (Ghosal, 1997) that the posterior distribution for 𝜷{\bm{\beta}} approximates a multivariate normal distribution as nn\rightarrow\infty. The limiting distribution can be derived precisely via a second-order Taylor expansion of the log of (4), about the mode 𝜷^\hat{{\bm{\beta}}}. The resulting mean and covariance matrix are provided in Theorem 3.

Theorem 3

If 𝐗d×n\mathbf{X}\in\mathbb{R}^{d\times n}, λ>0\lambda>0, and 𝐲{1,1}n\mathbf{y}\in\{-1,1\}^{n}, then

p(𝜷𝐗,𝐲,λ,β0)MVN(𝜷𝜷^,Vβ)p({\bm{\beta}}\mid\mathbf{X},\mathbf{y},\lambda,\beta_{0})\approx\mbox{MVN}({\bm{\beta}}\mid\hat{{\bm{\beta}}},V_{\beta})

as nn\rightarrow\infty, where 𝛃^\hat{{\bm{\beta}}} is the DWD solution (2), and

Vβ=(iS𝜷^𝐱i𝐱iT2(yi(β^0+𝐱iT𝜷^)3+nλ𝐈)1V_{\beta}=\left(\sum_{i\in S_{\hat{{\bm{\beta}}}}}\frac{{\bf x}_{i}{\bf x}_{i}^{T}}{2(y_{i}(\hat{\beta}_{0}+{\bf x}_{i}^{T}\hat{{\bm{\beta}}})^{3}}+n\lambda\mathbf{I}\right)^{-1}

where S𝛃^={i:yi(β^0+𝐱iT𝛃^)>0.5}S_{\hat{{\bm{\beta}}}}=\{i:y_{i}(\hat{\beta}_{0}+{\bf x}_{i}^{T}\hat{{\bm{\beta}}})>0.5\}.

Note that this asymptotic approximation always exists and is directly computable given the DWD solution {β0^,𝜷^}\{\hat{\beta_{0}},\hat{{\bm{\beta}}}\}.

6 Extensions

6.1 Marginal class probabilities

The prior class probabilities P(yi=1)P(y_{i}=1) and P(yi=1)P(y_{i}=-1) in (5) do not affect the posterior for {β0,𝜷}\{\beta_{0},{\bm{\beta}}\} in (4). Thus, they need not be specified for tasks that do not require a predictive model for yiy_{i}, such as exploratory visualization or batch adjustment (Huang et al., 2012). However, their specification is important to obtain appropriately calibrated probabilities for classification. In particular, they must be considered carefully when the proportions in the training data differ from that of the target population. The ability to adjust the marginal class probabilities for out-of-sample prediction is useful, as the decision boundary for DWD (β0\beta_{0}) is sensitive to the class proportions used for training (Qiao et al., 2010; Qiao and Zhang, 2015). We let each observation have the same class membership probability a priori:

P(yi=1)=P1 for i=1,,n,P(y_{i}=1)=P_{1}\;\text{ for }i=1,\ldots,n,

and by default we set P1=0.5P_{1}=0.5. In other situations P1P_{1} may be known and fixed at another value, or estimated as the sample proportion from class 11. Alternatively, one can give P1P_{1} its own prior (e.g., P1Uniform(0,1)P_{1}\sim\text{Uniform}(0,1)) and infer it with the rest of the Bayesian model; such an approach may be useful in (e.g.,) situation for which the yiy_{i}’s are only partially observed and the population class proportions are of interest.

6.2 Inference for λ\lambda

Rather than fixing the penalty parameter λ\lambda a priori, we recommend inferring its value within the Bayesian framework. For a given prior density λp(λ)\lambda\sim p(\lambda), the conditional density for λ\lambda is

p(λ𝐗,𝐲,𝜷,β0)p(λ)P(𝐲𝐗,𝜷,β0)p(𝜷,𝐗β0,λ)i=1np(λ)(P1eV(β0+𝐱iT𝜷)+(1P1)eV(β0𝐱iT𝜷))/ϕ(𝐗,β0,λ)\displaystyle\begin{split}p(\lambda\mid\mathbf{X},\mathbf{y},{\bm{\beta}},\beta_{0})&\propto p(\lambda)P(\mathbf{y}\mid\mathbf{X},{\bm{\beta}},\beta_{0})p({\bm{\beta}},\mathbf{X}\mid\beta_{0},\lambda)\\ &\propto\prod_{i=1}^{n}p(\lambda)\left(P_{1}e^{-V\left(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}}\right)}+(1-P_{1})e^{-V\left(-\beta_{0}-{\bf x}_{i}^{T}{\bm{\beta}}\right)}\right)/\phi(\mathbf{X},\beta_{0},\lambda)\end{split} (10)

where

ϕ(𝐗,λ,β0)=𝜷i=1n(P1eV(β0+𝐱iT𝜷)+(1P1)eV(β0𝐱iT𝜷))e(λ/2)𝜷22d𝜷.\displaystyle\phi(\mathbf{X},\lambda,\beta_{0})=\int_{{\bm{\beta}}}\prod_{i=1}^{n}\left(P_{1}e^{-V\left(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}}\right)}+(1-P_{1})e^{-V\left(-\beta_{0}-{\bf x}_{i}^{T}{\bm{\beta}}\right)}\right)e^{-(\lambda/2)||{\bm{\beta}}||^{2}_{2}}d{\bm{\beta}}. (11)

In practice we use a prior for λ\lambda that is uniform over a large range, by default, λUniform(1/128,128)\lambda\sim\text{Uniform}(1/128,128). The impact of λ\lambda depends on the scale of 𝐗\mathbf{X} and the number of observations nn, so the prior may need to be suitably modified for other scenarios.

6.3 Semi-supervised learning

The conditional distribution of 𝜷{\bm{\beta}} on 𝐗\mathbf{X} in (6) is not flat, and therefore observations for which yiy_{i} are not observed can still inform 𝜷{\bm{\beta}}. Consider the context in which non_{o} observations are fully observed {(𝐱i,𝐲i)}i=1no\{({\bf x}_{i},\mathbf{y}_{i})\}_{i=1}^{n_{o}}, nun_{u} have missing outcome {(𝐱iu,yi)}i=1nu\{({\bf x}_{i}^{u},y_{i})\}_{i=1}^{n_{u}}, and n=no+nun=n_{o}+n_{u}. The full posterior distribution for {β0,𝜷}\{\beta_{0},{\bm{\beta}}\} is

p(𝜷,β0𝐗,𝐲,λ)\displaystyle p({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda) [i=1nu(P1eV(β0+𝐱nT𝜷)+(1P1)eV(β0𝐱nT𝜷))]\displaystyle\propto\left[\prod_{i=1}^{n_{u}}\left(P_{1}e^{-V\left(\beta_{0}+{\bf x}_{n}^{T}{\bm{\beta}}\right)}+(1-P_{1})e^{-V\left(-\beta_{0}-{\bf x}_{n}^{T}{\bm{\beta}}\right)}\right)\right]
[i=1noeV(yi(β0+𝐱iT𝜷))]e(λn/2)𝜷22.\displaystyle\;\;\;\;\;\;\left[\prod_{i=1}^{n_{o}}e^{-V\left(y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)}\right]e^{-(\lambda n/2)||{\bm{\beta}}||^{2}_{2}}.

This full posterior can be used for semi-supervised learning.

7 Posterior approximation

7.1 MCMC algorithm, fixed λ\lambda

To simulate from the posterior distribution for {β,𝜷0}\{\beta,{\bm{\beta}}_{0}\} (4), we implement a “Metropolis-in-Gibbs” algorithm (Carlin and Louis, 2008). That is, we iteratively draw each βi\beta_{i} from its conditional distribution given the current values of the rest of the parameters 𝜷[i]{\bm{\beta}}[-i], p(βi𝜷[i],𝐗,𝐲,λ)p(\beta_{i}\mid{\bm{\beta}}[-i],\mathbf{X},\mathbf{y},\lambda); each conditional draw is accomplished via a Metropolis sub-step. For the Metropolis step, we generate proposals from a normal distribution with mean given by the previous value for βi\beta_{i} and a variance that is adaptively scaled over the course of the algorithm. We initialize the Markov chain at the posterior mode, which can be identified using existing software for DWD; we use the sdwd package for this initialization.

7.2 Inference for λ\lambda

Inference for λ\lambda can be accomplished by iteratively simulating from its full conditional distribution (10) to update its value within the Gibbs sampling algorithm in Section 7.1. However, this is not straightforward, because the integral ϕ(𝐗,β0,λ)\phi(\mathbf{X},\beta_{0},\lambda) (11) does not have a closed form. In our implementation, we first estimate ϕ(𝐗,λ,β0)\phi(\mathbf{X},\lambda,\beta_{0}) over a grid of values {λj:j=1,,J}\{\lambda_{j}:j=1,\ldots,J\} that span the range of the prior for λ\lambda. For this, we utilize the fact that e(λn/2)𝜷22e^{-(\lambda n/2)||{\bm{\beta}}||^{2}_{2}} resembles a Gaussian kernel to approximate the integral via Monte Carlo with simulated random normal vectors. Given 𝜷(t)MVN(𝟎,1/(λn)𝐈){\bm{\beta}}^{(t)}\sim\mbox{MVN}(\mathbf{0},1/(\lambda n)\mathbf{I}) for t=1,,Tt=1,\ldots,T,

ϕ(𝐗,λ,β0)(2πnλ)p/21Tt=1Ti=1n(P1eV(β0+𝐱iT𝜷(t))+(1P1)eV(β0𝐱iT𝜷(t))).\phi(\mathbf{X},\lambda,\beta_{0})\approx\left(\frac{2\pi}{n\lambda}\right)^{p/2}\frac{1}{T}\sum_{t=1}^{T}\prod_{i=1}^{n}\left(P_{1}e^{-V\left(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}}^{(t)}\right)}+(1-P_{1})e^{-V\left(-\beta_{0}-{\bf x}_{i}^{T}{\bm{\beta}}^{(t)}\right)}\right).

Using this approximation, we estimate ϕ^(𝐗,λj,β0)\hat{\phi}(\mathbf{X},\lambda_{j},\beta_{0}) for each jj, and then linearly interpolate on the log-scale to approximate the full function ϕ^(𝐗,λ,β0)\hat{\phi}(\mathbf{X},\lambda,\beta_{0}) over the range of the prior for λ\lambda. Posterior estimation then proceeds as in Section 7.1, with an additional Metropolis sub-step to update the value of λ\lambda. The full algorithm for posterior sampling is given in Appendix B.

7.3 Computational considerations

In practice, we find that 10001000 MCMC cycles yield appropriate coverage of the posterior when λ\lambda is fixed (see Section 8.1) and 1000010000 are sufficient when λ\lambda is inferred. On a single CPU, 1000010000 MCMC iterations takes approximately 55 minutes for a dataset of size 𝐗:500×500\mathbf{X}:500\times 500 (comparable to the data considered in Section 9) and computing time scales linearly with the dimension dd. Thus, full posterior inference via MCMC is feasible for most applications (with patience), but computing time can be a limitation.

8 Simulations

We consider three distinct simulation scenarios, and multiple conditions under each scenario, to illustrate and validate several aspects of Bayesian DWD.

In Section 8.1 we simulate from a data generating scheme that matches the assumed prior and likelihood exactly. Under this scenario, we confirm that posterior inferences using the MCMC algorithms in Section 7 are appropriately calibrated regardless of the marginal distribution for 𝐗\mathbf{X}. We compare to results using the normal approximation in Section 5 and a bootstrapping approach.

In Section 8.2 we consider a scenario in which the two classes are defined by different multivariate normal distributions. Under this scenario we demonstrate the robustness of Bayesian DWD to its model assumptions, assess the sensitivity of results to the tuning parameter λ\lambda, and validate model-based inference for λ\lambda.

In Section 8.3 we consider different scenarios for which not all of the outcomes are observed to illustrate potential benefits and limitations of the semi-supervised approach.

8.1 Assumed model simulation

Here we fix β0=0\beta_{0}=0 and simulate data from the assumed model as follows:

  1. 1.

    Generate 𝐗:d×n\mathbf{X}:d\times n according to a specified probability distribution,

  2. 2.

    Generate 𝜷true{\bm{\beta}}_{\text{true}} from its conditional prior (6)\eqref{bprior} (via a Metropolis-Hastings algorithm), given λ\lambda

  3. 3.

    For i=1,,ni=1,\ldots,n, compute the score ui=𝐱iT𝜷trueu_{i}={\bf x}_{i}^{T}{\bm{\beta}}_{\text{true}} and generate yiy_{i} according to its class probabilities defined by (5).

As manipulated conditions, we consider three values for λ\lambda (λ=0.1,1,\lambda=0.1,1, or 1010), two configurations of dd and nn (𝐗:20×100\mathbf{X}:20\times 100 or 𝐗:100×20\mathbf{X}:100\times 20), and three different probability distributions for 𝐗\mathbf{X}: (1) entries are simulated independently from a Uniform(1,1)(-1,1) distribution, (2) entries are simulated independently from an Exponential(1)(1) distribution centered to have mean 0, and (3) each column 𝐱i{\bf x}_{i} is simulated from the following mixture of multivariate normal distributions:

xi{MVN(𝝁0,𝐈) with probability 1/2MVN(𝝁1,𝐈) with probability 1/2,\displaystyle x_{i}\sim\begin{cases}\mbox{MVN}(\bm{\mu}_{0},\mathbf{I})\text{ with probability }1/2\\ \mbox{MVN}(\bm{\mu}_{1},\mathbf{I})\text{ with probability }1/2,\end{cases} (12)

where the entries of 𝝁0\bm{\mu}_{0} and 𝝁1\bm{\mu}_{1} are generated independently from a N(0,0.5)(0,0.5) distribution.

For each set of manipulated conditions, we generate 100100 datasets {𝐗,𝜷true,𝐲}\{\mathbf{X},{\bm{\beta}}_{\text{true}},\mathbf{y}\} under the assumed model and approximate the posterior distribution for 𝜷{\bm{\beta}} using the MCMC algorithm in Section 7.1. We construct 95% Bayesian credible intervals for 𝜷true{\bm{\beta}}_{\text{true}} and the scores 𝐮=𝐗T𝜷true\mathbf{u}=\mathbf{X}^{T}{\bm{\beta}}_{\text{true}} based on the posterior samples. We also consider credible intervals constructed using the posterior mode and the asymptotic approximation in Theorem 3 (i.e., the CLT approximation), and confidence intervals based on the percentiles of 200200 non-parametric bootstrap resamples as in Lyu et al. (2017).

Table 1 gives the average coverage rates for the true values under each set of conditions and the three different approaches to inference. The credible intervals using MCMC all have nominal coverage rates, which validates the algorithm. The CLT approximation gives appropriate coverage in several instances, even when p>np>n. However, it performs less well when λ\lambda is smaller and has relatively lower coverage rates for the scores 𝐮=𝐗T𝜷true\mathbf{u}=\mathbf{X}^{T}{\bm{\beta}}_{\text{true}}. The bootstrap percentile approach yields more narrow intervals and tends to undercover in all cases. As a representative example, Figure 3 plots the mode and 95% intervals for the coefficients for an instance under the Uniform scenario with λ=0.1\lambda=0.1, n=100n=100 and p=20p=20; this shows close agreement between the MCMC and CLT intervals, and the bootstrap intervals are universally more narrow.

To assess the calibration of estimated class probabilities, we generate a new set of 10001000 observations under the true model as a “test” set for each instance of the simulation. We compute the class probabilities for each test case, given the posterior fit to the n=20n=20 or n=100n=100 training cases. We group the estimated probabilities into narrow bins of length 0.020.02 ([0,0.02),[0.02,0.04)[0,0.02),[0.02,0.04), etc.) and consider the proportion of cases with y=1y=1 in each bin. Figure 4 plots these proportions, aggregated across all conditions, vs. the midpoint of each bin. Using the posterior mean, the empirical proportions match the estimated probabilities perfectly, confirming that the model is well-calibrated and again validating the MCMC procedure. Using the posterior mode, the relationship deviates slightly but still tends to provide a reasonable fit in most cases.

Table 1: Coverage rates for model coefficients (β\beta) and scores (uu) using 95% credible intervals under MCMC sampling from the posterior, the asymptotic normal approximation (CLT), and bootstrapping (BOOT).
MCMC CLT BOOT
Distribution nn dd λ\lambda β\beta uu   β\beta uu β\beta uu
Uniform 2020 100100 0.10.1 0.94 0.94 0.95 0.85 0.16 0.42
Uniform 2020 100100 11 0.94 0.94 0.95 0.85 0.23 0.50
Uniform 2020 100100 1010 0.94 0.94 0.95 0.96 0.20 0.45
Uniform 100100 2020 0.10.1 0.94 0.94 0.92 0.85 0.46 0.55
Uniform 100100 2020 11 0.95 0.95 0.95 0.95 0.53 0.65
Uniform 100100 2020 1010 0.94 0.94 0.95 0.95 0.24 0.32
Exponential 2020 100100 0.10.1 0.96 0.94 0.92 0.91 0.61 0.69
Exponential 2020 100100 11 0.95 0.95 0.96 0.95 0.62 0.71
Exponential 2020 100100 1010 0.94 0.94 0.96 0.96 0.34 0.41
Exponential 100100 2020 0.10.1 0.94 0.95 0.92 0.85 0.39 0.49
Exponential 100100 2020 0.10.1 0.94 0.95 0.94 0.91 0.53 0.63
Exponential 100100 2020 11 0.94 0.94 0.95 0.96 0.40 0.52
Bimodal 2020 100100 1010 0.95 0.96 0.94 0.91 0.61 0.66
Bimodal 2020 100100 11 0.94 0.95 0.96 0.95 0.66 0.73
Bimodal 2020 100100 1010 0.93 0.94 0.96 0.96 0.39 0.46
Bimodal 100100 2020 0.10.1 0.94 0.94 0.92 0.85 0.37 0.47
Bimodal 100100 2020 11 0.95 0.95 0.94 0.90 0.51 0.60
Bimodal 100100 2020 1010 0.94 0.95 0.96 0.96 0.40 0.61
Refer to caption
Figure 3: Inferential results and true values for the coefficients β\beta for a dataset generated under the Uniform scenario when λ=0.1\lambda=0.1, n=100n=100 and p=20p=20.
Refer to caption
Figure 4: Estimated probabilities vs. observed proportions on test observations, aggregated across all conditions, for different simulation scenarios using the posterior mean or mode.

8.2 Two-class multivariate normal simulation

Here we generate data according to a two-class multivariate normal distribution, in which each class has a different mean vector, as follows:

  1. 1.

    Generate dd-dimensional mean vectors 𝝁0\bm{\mu}_{0} and 𝝁1\bm{\mu}_{1}, with independent entries from a N(0,τ2)(0,\tau^{2}) distribution.

  2. 2.

    Set yi=1y_{i}=-1 for i=1,,n/2i=1,\ldots,n/2 and yi=1y_{i}=1 for i=n/2+1,,ni=n/2+1,\ldots,n.

  3. 3.

    For i=1,,ni=1,\ldots,n, generate 𝐱i{\bf x}_{i} via

    xi{MVN(𝝁0,𝐈) if y0=1MVN(𝝁1,𝐈) if yi=1.x_{i}\sim\begin{cases}\mbox{MVN}(\bm{\mu}_{0},\mathbf{I})\text{ if }y_{0}=-1\\ \mbox{MVN}(\bm{\mu}_{1},\mathbf{I})\text{ if }y_{i}=1.\end{cases}

Note that this scenario differs from the ‘Bimodal’ case in Section 8.1 (12); for that scenario, membership in a given mixture component (defined by 𝝁0\bm{\mu}_{0} and 𝝁1\bm{\mu}_{1}) does not necessarily correspond to the supervised class labels yiy_{i}. The present scenario does not precisely match our assumed model. The “true” (i.e., oracle) probabilistic link between yiy_{i} and 𝐱i{\bf x}_{i} in this scenario corresponds to a probit link with coefficients that are proportional to the mean difference vector 𝝁0𝝁1\bm{\mu}_{0}-\bm{\mu}_{1}; as shown in Figure 1 the link for Bayesian DWD is not a probit, but it is close.

We fix n=100n=100 and consider different levels of signal (τ=0,0.1,0.2,0.5\tau=0,0.1,0.2,0.5) and different dimensions (d=10,100,1000d=10,100,1000) as manipulated conditions. For each instance, we approximate the posterior separately with λ\lambda fixed at each of 1515 possible values, that range from 1/1281/128 to 128128 and are equally spaced on a log scale. For each case we compute the Kullback-Leibler divergence (KL-divergence) of the estimated class probabilities from the oracle model. For each case we also generate a test set of 50005000 samples and evaluate (1) the misclassification rate on the test samples using a posterior predictive probability of 0.50.5 as a threshold, and the mean squared error (MSE) of the predicted class probabilities for class 1 (pip_{i}) from the true class memberships as

MSE=1ntesti=1ntest(1pi)2𝟙{yi=1}+pi2𝟙{yi=1}.\text{MSE}=\frac{1}{n_{\text{test}}}\sum_{i=1}^{n_{\text{test}}}(1-p_{i})^{2}\mathbbm{1}_{\{y_{i}=-1\}}+p_{i}^{2}\mathbbm{1}_{\{y_{i}=1\}}.

Figure 5 plots these metrics as a function of λ\lambda for the different conditions. In general, lower values of λ\lambda perform better when the distinguishing signal is stronger. This is intuitive, as λ\lambda can be considered an inverse scale parameter for 𝜷{\bm{\beta}}, and larger magnitudes for 𝜷{\bm{\beta}} correspond to probabilities closer to 0 and 11. The misclassification rate is robust to the value of λ\lambda, but its specification is important to achieve well-calibrated probabilities.

Refer to caption
Figure 5: Different performance metrics (KL-divergence, test MSE, and test misclassification rate) are shown as a function of λ\lambda used for estimation under different conditions.

Motivated by the sensitivity of the estimated probabilities to λ\lambda, we also approximate the posterior with a uniform prior on λ\lambda as in Section 6.2. Table 2 summarizes the posterior mean for λ\lambda under each set of conditions, and compares the performance of the model with uniform prior on λ\lambda with that of the best performing fixed value of λ\lambda for each metric. The model with a uniform prior performs comparatively well across the different scenarios, and the posterior mean for λ\lambda appropriately ranges from very high values when the signal is nonexistent to low values when the signal is strong. However, for the conditions with p=1000p=1000 and strong signal the posterior mean is substantially larger than the best performing values of λ\lambda; this is likely because the model fits well and perfectly classifies the data with probabilities close to 0 or 11 for a wide range of λ\lambda values (see Figure 5), so there is not strong evidence to distinguish between different values of λ\lambda in this range.

Table 2: Comparison of model performance when λ\lambda is inferred with a uniform prior (λ^\hat{\lambda}) vs. the best performing fixed value of λ\lambda based on KL-divergence (λKL\lambda_{\text{KL}}) or MSE λMSE\lambda_{\text{MSE}}. Results for each set of condition are averages over 1010 replications.
dd s2n (τ)\tau) λ^\hat{\lambda} λKL\lambda_{\text{KL}} λMSE\lambda_{\text{MSE}} KL(λ^\hat{\lambda}) KL(λKL\lambda_{\text{KL}}) MSE(λ^\hat{\lambda}) MSE(λMSE\lambda_{\text{MSE}})
1010 0 70.21 128.00 128.00 0.00 0.00 0.25 0.25
1010 0.10.1 70.82 2.00 2.00 0.02 0.02 0.25 0.25
1010 0.20.2 31.27 0.50 1.00 0.06 0.02 0.24 0.22
1010 0.50.5 0.10 0.12 0.12 0.06 0.05 0.11 0.11
100 0 74.56 128.00 64.00 0.00 0.00 0.25 0.25
100 0.10.1 22.55 1.00 2.00 0.15 0.10 0.24 0.22
100 0.20.2 0.14 0.12 0.50 0.11 0.10 0.11 0.10
100 0.50.5 0.11 0.01 0.02 0.03 0.01 0.00 0.00
1000 0 44.79 128.00 128.00 0.02 0.00 0.25 0.25
1000 0.10.1 3.11 0.03 0.50 0.21 0.06 0.11 0.09
1000 0.20.2 0.63 0.01 0.03 0.07 0.01 0.00 0.00
1000 0.50.5 1.82 0.01 0.01 0.04 0.00 0.00 0.00

In Figure 6 we consider the overall calibration of the fitted probabilities across the different conditions, as in Section 8.1. This shows clearly that fixing λ\lambda at a large value (λ=128\lambda=128) can yield overly conservative estimated probabilities and a relatively small value (λ=1/128\lambda=1/128) can yield overly confident probabilities. The model in which λ\lambda is inferred with a uniform prior is mostly well-calibrated but does show some deviations from the estimated probabilities and empirical proportions. These deviations are likely in part due to the fact that the assumed probabilistic link does not perfectly match the true generative model.

Refer to caption
Figure 6: Estimated probabilities vs. observed proportions on test observations, aggregated across all conditions, for λ=1/128\lambda=1/128, λ=128\lambda=128, or inferring λ\lambda with a uniform prior.

8.3 Semi-supervised simulation

Here we generate data in which some of the class labels yiy_{i} may be unknown, to illustrate a semi-supervised approach. We consider two scenarios. For the first scenario, the observations 𝐱i{\bf x}_{i} are generated exactly as in Section 8.2 with d=100d=100 and τ=0.3\tau=0.3. For the second scenario, d=100d=100 and entries of each 𝐱i{\bf x}_{i} are generated independently from a N(0,1)N(0,1) distribution; then class labels yiy_{i} are defined deterministically by the sign of 𝐱iT𝜷sep{\bf x}_{i}^{T}{\bm{\beta}}_{\text{sep}}, where the entries of 𝜷sep{\bm{\beta}}_{\text{sep}} are generated from a standard normal distribution. The key difference between these two scenarios, for our purpose, is that the first scenario (bimodal) has an underlying cluster structure that distinguishes the two classes whereas the second (unimodal) does not. So, we expect to improve performance by considering unlabeled data for the bimodal scenario but not the unimodal scenario.

As manipulated conditions, we vary the number of observations with yiy_{i} observed (no=10,20,100,1000)n_{o}=10,20,100,1000) and with yiy_{i} unobserved (nu=0,10,20,100,1000n_{u}=0,10,20,100,1000) in our training set, for each of the bimodal and unimodal scenarios. For each combination of conditions, we generate 2020 datasets, use MCMC to approximate the posterior, and consider the resulting misclassification rates for a test set. The average test misclassification rates under each set of conditions are shown in Table 3. This shows that for the bimodal scenario a semi-supervised analysis with a large number of unlabeled observations can improve performance dramatically; the misclassification rates for no=10n_{o}=10 observed and nu=1000n_{u}=1000 unobserved are comparable to that with no=1000n_{o}=1000 observed. As expected, including data without observed class labels does not help in the unimodal scenario; however, it does not lead to a substantial decrease in performance.

Table 3: Test misclassification rates using Bayesian DWD with differing number of observations with yiy_{i} observed (non_{o}) or unobserved (nun_{u}).
NN unobserved (nun_{u})
Scenario NN observed (non_{o}) 0 1010 2020 100100 10001000
Bimodal 1010 0.164 0.138 0.145 0.081 0.019
Bimodal 2020 0.077 0.077 0.080 0.053 0.014
Bimodal 100100 0.038 0.035 0.032 0.036 0.020
Bimodal 10001000 0.019 0.021 0.019 0.020 0.017
Unimodal 1010 0.444 0.451 0.425 0.410 0.432
Unimodal 2020 0.392 0.390 0.402 0.395 0.412
Unimodal 100100 0.280 0.266 0.272 0.280 0.297
Unimodal 10001000 0.098 0.102 0.091 0.104 0.113

9 Application to breast cancer genomics

As a real data application, we consider classifying breast cancer (BRCA) tumor subtypes using data on microRNA (miRNA) abundance that are publicly available from the Cancer Genome Atlas (TCGA). We use miRNA-Seq data for n=1047n=1047 BRCA tumor samples from different individuals, that was previously curated for a pan-cancer clustering application (Hoadley et al., 2018). The tumors are classified into one of 44 canonical subtypes based on the PAM5050 classifier for gene (mRNA) expression (Parker et al., 2009): Luminal A (LumA) (n=572n=572), Luminal B (LumB) (n=203n=203), Her2 (n=83(n=83), and Basal (189189). Previous exploratory analysis have shown that miRNA data shares some of the same BRCA subtype distinctions as gene expression (Park and Lock, 2020; Lock and Dunson, 2013). Here, we examine the power of miRNAs to distinguish the subtypes using a supervised approach.

The miRNA expression read counts were log(1+x1+x)-transformed and centered to have mean 0. After transformation, we keep miRNAs with standard deviation greater than 0.50.5, leaving d=489d=489 miRNAs. We consider separate classification tasks to discriminate each pair of subtypes, giving 66 comparisons. For each comparison, we apply (1) Bayesian DWD (BayesDWD) with uniform prior on λ\lambda, (2) DWD (i.e., the posterior mode) via the R package sdwd, (3) SVM via the package e1071, (4) random forest classification (Breiman, 2001) via the package randomForest, and (5) LDA via the package MASS. We assess each method by the average misclassification rate on the test sets under 10-fold cross-validation. The results are shown in Table 4. Bayesian DWD, DWD, and SVM have similar classification performance across the different comparisons, while random forests and LDA perform less well by comparison. For methods that give a class probability we consider the calibration of the estimated probabilities on the test sets vs. their empirical class proportions, as in Section 8.1. Figure 7 shows that the probabilities inferred for Bayesian DWD are overall well-calibrated. The probabilities under the random forest are also reasonable, but tend toward being over-conservative. The probabilities for LDA are not well-calibrated. LDA suffers from over-fitting in situations with higher dimension and lower sample size; for this application, the vast majority (93%) of predicted probabilities using LDA are below 0.020.02 or above 0.980.98, leaving limited data for the remaining bins.

Table 4: Test classification rates for different subtype comparisons under cross validation using Bayesian DWD (BayesDWD), DWD, SVM, random forests (RF), and LDA.
Comparison BayesDWD DWD SVM RF LDA
LumA vs. LumB 0.168 0.156 0.158 0.186 0.296
LumA vs. Her2 0.054 0.045 0.059 0.070 0.178
LumA vs. Basal 0.026 0.028 0.025 0.030 0.084
LumB vs. Her2 0.129 0.124 0.155 0.169 0.188
LumB vs. Basal 0.046 0.043 0.044 0.047 0.137
Her2 vs. Basal 0.066 0.064 0.053 0.075 0.111
Refer to caption
Figure 7: Estimated probabilities vs. observed proportions on test observations, aggregated across all comparisons, using Bayesian DWD, random forests (RF), or LDA.

To illustrate the Bayesian DWD results, we focus on the comparison with the best classification accuracy from Table 4 (LumA vs. Basal) and the comparison with the worst classification accuracy (LumA vs. LumB). Figure 8 plots the posterior mean of the DWD scores and their 95%95\% credible intervals for each comparison. Note that the DWD scores are almost perfectly separated for LumA vs. Basal, but not for LumA vs. LumB. The credible intervals for each plot help to understand the level of uncertainty in the discriminating projections.

Refer to caption
Figure 8: Mean DWD scores and associated 95% credible intervals for LumA vs. Basal and for LumA vs. LumB.

10 Discussion

The Bayesian formulation of DWD is straightforward to modify or extend via changes to the likelihood or prior. As a future direction, we are interested in extending the model to accommodate multiple sources of data (e.g., multiple ’omics or clinical sources) or data with multiway structure (Lyu et al., 2017). In this article we have focused on the original and most widely used version of DWD, but the Bayesian framework may be extended for other versions including multi-class DWD (Huang et al., 2013), weighted DWD (Qiao et al., 2010), sparse DWD (Wei et al., 2016), and kernel DWD (Wang and Zou, 2018, 2019). Additionally, a generalized version of DWD was introduced in (Hall et al., 2005), wherein the distances rir_{i} in (2) are raised to a power qq. The choice of qq modifies the loss function V()V(\cdot) in (3) (Wang and Zou, 2018), which in turn affects the probabilistic link (5); the Bayesian formulation is potentially very useful in this context, because one can put a prior on qq and infer it within the model.

Computing time and resources can limit a fully Bayesian treatment of DWD for some applications, as discussed in Section 7.3. In such cases, the DWD solution can still be interpreted as a posterior mode to give point estimates for class probabilities, and the normal approximation in Section 5 can be used to quickly assess uncertainty in the estimated coefficients. Alternative computational approaches that facilitate a fully Bayesian treatment for massive datasets are worth pursuing.


Acknowledgments

This work was supported by the National Institutes of Health (NIH) grant R01-GM130622.

A Proofs

Theorem 1

If 𝐗d×n\mathbf{X}\in\mathbb{R}^{d\times n}, λ0\lambda\geq 0, and 𝐲{1,1}n\mathbf{y}\in\{-1,1\}^{n} where yi=1y_{i}=-1 for some ii and yj=1y_{j}=1 for some jj, then p(𝛃,β0𝐗,𝐲,λ)p({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda) in (4) gives a proper probability density function.

Proof  Without loss of generality assume y1=1y_{1}=-1 and y2=1y_{2}=1. Define

h(𝜷,β0𝐗,𝐲,λ)=[i=1neV(yi(β0+𝐱iT𝜷))]e(λn/2)𝜷22.h({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda)=\left[\prod_{i=1}^{n}e^{-V\left(y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)}\right]e^{-(\lambda n/2)||{\bm{\beta}}||^{2}_{2}}.

By (4), p(𝜷,β0𝐗,𝐲,λ)h(𝜷,β0𝐗,𝐲,λ)p({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda)\propto h({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda). It suffices to show that (i) h(𝜷,β0𝐗,𝐲,λ)h({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda) is non-negative and (ii) h(𝜷,β0𝐗,𝐲,λ)h({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda) has finite integral over 𝜷{\bm{\beta}} and β0\beta_{0}. Condition (i) is trivially satisfied. For condition (ii), note that V(u)>0uV(u)>0\;\forall u\in\mathbb{R}, where V(u)V(u) is defined in (3). Thus,

eV(yi(β0+𝐱iT𝜷))<1 for i=1,,n.e^{-V\left(y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)}<1\;\;\text{ for }i=1,\ldots,n.

Let ||||1||\cdot||_{1} define the l1l_{1}-norm, and note that ||𝐗||1||𝜷||1>max({|𝐱iT𝜷|:i=1,,n})||\mathbf{X}||_{1}||{\bm{\beta}}||_{1}>\mbox{max}\left(\{|{\bf x}_{i}^{T}{\bm{\beta}}|:i=1,\ldots,n\}\right). For fixed 𝜷p{\bm{\beta}}\in\mathbb{R}^{p},

i=1neV(yi(β0+𝐱iT𝜷))dβ0\displaystyle\int\prod_{i=1}^{n}e^{-V\left(y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)}\,d\beta_{0} =𝐗1𝜷1𝐗1𝜷1i=1neV(y1(β0+𝐱1T𝜷))dβ0\displaystyle=\int\displaylimits_{-||\mathbf{X}||_{1}||{\bm{\beta}}||_{1}}^{||\mathbf{X}||_{1}||{\bm{\beta}}||_{1}}\prod_{i=1}^{n}e^{-V\left(y_{1}(\beta_{0}+{\bf x}_{1}^{T}{\bm{\beta}})\right)}\,d\beta_{0}
+𝐗1𝜷1i=1neV(y1(β0+𝐱1T𝜷))dβ0\displaystyle\;\;\;\;\;\;\;\;\;\;\;+\int\displaylimits_{-\infty}^{-||\mathbf{X}||_{1}||{\bm{\beta}}||_{1}}\prod_{i=1}^{n}e^{-V\left(y_{1}(\beta_{0}+{\bf x}_{1}^{T}{\bm{\beta}})\right)}\,d\beta_{0}
+𝐗1𝜷1i=1neV(y1(β0+𝐱1T𝜷))dβ0\displaystyle\;\;\;\;\;\;\;\;\;\;\;+\int\displaylimits_{||\mathbf{X}||_{1}||{\bm{\beta}}||_{1}}^{\infty}\prod_{i=1}^{n}e^{-V\left(y_{1}(\beta_{0}+{\bf x}_{1}^{T}{\bm{\beta}})\right)}\,d\beta_{0}
<2||𝐗||1𝜷1+𝐗1𝜷1eV(y2(β0+𝐱1T𝜷))𝑑β0\displaystyle<2||\mathbf{X}||_{1}||{\bm{\beta}}||_{1}+\int\displaylimits_{-\infty}^{-||\mathbf{X}||_{1}||{\bm{\beta}}||_{1}}e^{-V\left(y_{2}(\beta_{0}+{\bf x}_{1}^{T}{\bm{\beta}})\right)}d\beta_{0}
+𝐗1𝜷1eV(y1(β0+𝐱1T𝜷))𝑑β0\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\int\displaylimits_{||\mathbf{X}||_{1}||{\bm{\beta}}||_{1}}^{\infty}e^{-V\left(y_{1}(\beta_{0}+{\bf x}_{1}^{T}{\bm{\beta}})\right)}d\beta_{0}
<2||𝐗||1𝜷1+20eV(u)𝑑u\displaystyle<2||\mathbf{X}||_{1}||{\bm{\beta}}||_{1}+2\int\displaylimits_{0}^{\infty}e^{-V(-u)}du
=2𝐗1𝜷1+20e1u𝑑u\displaystyle=2||\mathbf{X}||_{1}||{\bm{\beta}}||_{1}+2\int\displaylimits_{0}^{\infty}e^{1-u}du
=2𝐗1𝜷1+2e.\displaystyle=2||\mathbf{X}||_{1}||{\bm{\beta}}||_{1}+2e.

It follows that

h(𝜷,β0𝐗,𝐲,λ)𝑑β0𝑑𝜷\displaystyle\int\int h({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda)d\beta_{0}d{\bm{\beta}} =[i=1neV(yi(β0+𝐱iT𝜷))dβ0]e(λn/2)𝜷22𝑑𝜷\displaystyle=\int\left[\int\prod_{i=1}^{n}e^{-V\left(y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)}\,d\beta_{0}\right]e^{-(\lambda n/2)||{\bm{\beta}}||^{2}_{2}}d{\bm{\beta}}
<(2𝐗1𝜷1+2e)e(λn/2)𝜷22𝑑𝜷\displaystyle<\int(2||\mathbf{X}||_{1}||{\bm{\beta}}||_{1}+2e)e^{-(\lambda n/2)||{\bm{\beta}}||^{2}_{2}}d{\bm{\beta}}
=2𝐗1𝜷1e(λn/2)𝜷22𝑑𝜷+2ee(λn/2)𝜷22𝑑𝜷\displaystyle=2||\mathbf{X}||_{1}\int||{\bm{\beta}}||_{1}e^{-(\lambda n/2)||{\bm{\beta}}||^{2}_{2}}d{\bm{\beta}}+2e\int e^{-(\lambda n/2)||{\bm{\beta}}||^{2}_{2}}d{\bm{\beta}}
=2𝐗1p(λn2)(2πλn)p12+2e(2πλn)p2.\displaystyle=2||\mathbf{X}||_{1}\cdot p\left(\frac{\lambda n}{2}\right)\left(\frac{2\pi}{\lambda n}\right)^{\frac{p-1}{2}}+2e\left(\frac{2\pi}{\lambda n}\right)^{\frac{p}{2}}.

 

Theorem 2

If 𝐗d×n\mathbf{X}\in\mathbb{R}^{d\times n}, λ>0\lambda>0, and β0\beta_{0}\in\mathbb{R}, then

p(𝜷𝐗,β0,λ)ABp({\bm{\beta}}\mid\mathbf{X},\beta_{0},\lambda)\propto A\cdot B

gives a proper probability density function, where AA and BB are defined in (7) and (8).

Proof  Define h(𝜷,β0,𝐗,β0,λ)=ABh({\bm{\beta}},\beta_{0},\mathbf{X},\beta_{0},\lambda)=A\cdot B, where AA and BB are defined in (7) and (8). Note that AA and BB are always non-negative, and term AA is alway less than 11. Thus,

h(𝜷,β0,𝐗,β0,λ)𝑑𝜷\displaystyle\int h({\bm{\beta}},\beta_{0},\mathbf{X},\beta_{0},\lambda)d{\bm{\beta}} <e(λn/2)𝜷22𝑑𝜷\displaystyle<\int e^{-(\lambda n/2)||{\bm{\beta}}||^{2}_{2}}d{\bm{\beta}}
=(2πλn)p2.\displaystyle=\left(\frac{2\pi}{\lambda n}\right)^{\frac{p}{2}}.

 

B Algorithm

Here we describe in detail the algorithm to simulate draws {𝜷(t),β0(t),λ(t)}t=1T\{{\bm{\beta}}^{(t)},\beta_{0}^{(t)},\lambda^{(t)}\}_{t=1}^{T} from their joint posterior distribution p(𝜷(t),β0(t),λ(t)𝐗,𝐲)p({\bm{\beta}}^{(t)},\beta_{0}^{(t)},\lambda^{(t)}\mid\mathbf{X},\mathbf{y}). First, estimate the normalizing function ϕ^(𝐗,λ,β0)\hat{\phi}(\mathbf{X},\lambda,\beta_{0}) as described in Section 7.2. Then, initialize λ(0)\lambda^{(0)} to a positive value (e.g., λ(0)=1\lambda^{(0)}=1) and initialize {𝜷(0),β0(0)}\{{\bm{\beta}}^{(0)},\beta_{0}^{(0)}\} to the conditional posterior mode using the sdwd package (Wang and Zou, 2016) or other software. The algorithm is given below for iterations t=1,,Tt=1,\ldots,T.

  1. 1.

    For i=1,,di=1,\ldots,d, update βi\beta_{i} as follows:

    1. (a)

      Generate proposal βiN(0,σ2)\beta_{i}^{*}\sim N(0,\sigma_{*}^{2})

    2. (b)

      Set 𝜷~=[β1(t)βi1(t)βi(t1)βd(t1)]\tilde{{\bm{\beta}}}=[\beta_{1}^{(t)}\dots\beta_{i-1}^{(t)}\;\beta_{i}^{(t-1)}\dots\beta_{d}^{(t-1)}] and 𝜷=[β1(t)βi1(t)βiβd(t1)]{\bm{\beta}}^{*}=[\beta_{1}^{(t)}\dots\beta_{i-1}^{(t)}\;\beta_{i}^{*}\dots\beta_{d}^{(t-1)}]

    3. (c)

      Compute r=p(𝜷,β0(t1)𝐗,𝐲,λ(t1))p(𝜷~,β0(t1)𝐗,𝐲,λ(t1))r=\frac{p({\bm{\beta}}^{*},\beta_{0}^{(t-1)}\mid\mathbf{X},\mathbf{y},\lambda^{(t-1)})}{p(\tilde{{\bm{\beta}}},\beta_{0}^{(t-1)}\mid\mathbf{X},\mathbf{y},\lambda^{(t-1)})}, with the numerator and denominator given by (4)

    4. (d)

      Generate uUniform(0,1)u\sim\mbox{Uniform}(0,1)

    5. (e)

      If u<ru<r set βi(t)=βi\beta_{i}^{(t)}=\beta_{i}^{*}, otherwise set βi(t)=βi(t1)\beta_{i}^{(t)}=\beta_{i}^{(t-1)}.

  2. 2.

    Update β0\beta_{0} as follows:

    1. (a)

      Generate proposal β0N(0,0.25)\beta_{0}^{*}\sim N(0,0.25)

    2. (b)

      Compute r=p(𝜷(t),β0𝐗,𝐲,λ(t1))p(𝜷(t),β0(t1)𝐗,𝐲,λ(t1)r=\frac{p({\bm{\beta}}^{(t)},\beta_{0}^{*}\mid\mathbf{X},\mathbf{y},\lambda^{(t-1))}}{p({\bm{\beta}}^{(t)},\beta_{0}^{(t-1)}\mid\mathbf{X},\mathbf{y},\lambda^{(t-1)}}, with the numerator and denominator given by (4)

    3. (c)

      Generate uUniform(0,1)u\sim\mbox{Uniform}(0,1)

    4. (d)

      If u<ru<r set β0(t)=β0\beta_{0}^{(t)}=\beta_{0}^{*}, otherwise set βi(t)=β0(t1)\beta_{i}^{(t)}=\beta_{0}^{(t-1)}.

  3. 3.

    Update λ\lambda as follows:

    1. (a)

      Generate proposal λ\lambda^{*} by λ=elogλ(t1)+Z/4\lambda^{*}=e^{\mbox{log}\lambda^{(t-1)}+Z/4} where ZN(0,1)Z\sim N(0,1)

    2. (b)

      Compute r=λϕ^(𝐗,λ)e0.5nλ𝜷(t)2λ(t1)ϕ^(𝐗,λ(t1))e0.5nλ(t1)𝜷(t)2r=\frac{\lambda^{*}\hat{\phi}(\mathbf{X},\lambda^{*})e^{-0.5n\lambda^{*}||{\bm{\beta}}^{(t)}||^{2}}}{\lambda^{(t-1)}\hat{\phi}(\mathbf{X},\lambda^{(t-1)})e^{-0.5n\lambda^{(t-1)}||{\bm{\beta}}^{(t)}||^{2}}}

    3. (c)

      Generate uUniform(0,1)u\sim\mbox{Uniform}(0,1)

    4. (d)

      If u<ru<r set λ(t)=λ\lambda^{(t)}=\lambda^{*}, otherwise set λ(t)=λ(t1)\lambda^{(t)}=\lambda^{(t-1)}.

  4. 4.

    Set σ2\sigma^{2}_{*} to half of the sample variance of {βi}i=1d\{\beta_{i}\}_{i=1}^{d}.

Note that if λ\lambda is fixed, step (c) above can be skipped. The adaptive updating of the proposal variance in step (d) is not necessary for the validity of the algorithm, but can improve mixing considerably when the approximate scale of the coefficients 𝜷{\bm{\beta}} is uncertain.

References

  • Breiman (2001) L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
  • Carlin and Louis (2008) B. P. Carlin and T. A. Louis. Bayesian methods for data analysis. CRC Press, 2008.
  • Cortes and Vapnik (1995) C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
  • Ghosal (1997) S. Ghosal. A review of consistency and convergence of posterior distribution. In Varanashi Symposium in Bayesian Inference, Banaras Hindu University, 1997.
  • Hall et al. (2005) P. Hall, J. S. Marron, and A. Neeman. Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(3):427–444, 2005.
  • Henao et al. (2014) R. Henao, X. Yuan, and L. Carin. Bayesian nonlinear support vector machines and discriminative factor modeling. In Advances in neural information processing systems, pages 1754–1762, 2014.
  • Hoadley et al. (2018) K. A. Hoadley, C. Yau, T. Hinoue, D. M. Wolf, A. J. Lazar, E. Drill, R. Shen, A. M. Taylor, A. D. Cherniack, V. Thorsson, et al. Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell, 173(2):291–304, 2018.
  • Hsiang (1975) T. Hsiang. A Bayesian view on ridge regression. Journal of the Royal Statistical Society: Series D (The Statistician), 24(4):267–268, 1975.
  • Huang et al. (2012) H. Huang, X. Lu, Y. Liu, P. Haaland, and J. Marron. R/DWD: distance-weighted discrimination for classification, visualization and batch adjustment. Bioinformatics, 28(8):1182–1183, 2012.
  • Huang et al. (2013) H. Huang, Y. Liu, Y. Du, C. M. Perou, D. N. Hayes, M. J. Todd, and J. S. Marron. Multiclass distance-weighted discrimination. Journal of Computational and Graphical Statistics, 22(4):953–969, 2013.
  • Liu et al. (2011) Y. Liu, H. H. Zhang, and Y. Wu. Hard or soft classification? large-margin unified machines. Journal of the American Statistical Association, 106(493):166–177, 2011.
  • Lock and Dunson (2013) E. F. Lock and D. B. Dunson. Bayesian consensus clustering. Bioinformatics, 29(20):2610–2616, 2013.
  • Lyu et al. (2017) T. Lyu, E. F. Lock, and L. E. Eberly. Discriminating sample groups with multi-way data. Biostatistics, 18(3):434–450, 2017.
  • Marron et al. (2007) J. Marron, M. J. Todd, and J. Ahn. Distance-weighted discrimination. Journal of the American Statistical Association, 102(480):1267–1271, 2007.
  • Park and Lock (2020) J. Y. Park and E. F. Lock. Integrative factorization of bidimensionally linked matrices. Biometrics, 76(1):61–74, 2020.
  • Park and Casella (2008) T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical Association, 103(482):681–686, 2008.
  • Parker et al. (2009) J. S. Parker, M. Mullins, M. C. Cheang, S. Leung, D. Voduc, T. Vickery, S. Davies, C. Fauron, X. He, Z. Hu, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of clinical oncology, 27(8):1160, 2009.
  • Qiao and Zhang (2015) X. Qiao and L. Zhang. Flexible high-dimensional classification machines and their asymptotic properties. The Journal of Machine Learning Research, 16(1):1547–1572, 2015.
  • Qiao et al. (2010) X. Qiao, H. H. Zhang, Y. Liu, M. J. Todd, and J. S. Marron. Weighted distance weighted discrimination and its asymptotic properties. Journal of the American Statistical Association, 105(489):401–414, 2010.
  • Sollich (2002) P. Sollich. Bayesian methods for support vector machines: Evidence and predictive class probabilities. Machine learning, 46(1-3):21–52, 2002.
  • Sun et al. (2001) D. Sun, R. K. Tsutakawa, and Z. He. Propriety of posteriors with improper priors in hierarchical linear mixed models. Statistica Sinica, pages 77–95, 2001.
  • Taraldsen and Lindqvist (2010) G. Taraldsen and B. H. Lindqvist. Improper priors are not improper. The American Statistician, 64(2):154–158, 2010.
  • Wang and Zou (2016) B. Wang and H. Zou. Sparse distance weighted discrimination. Journal of Computational and Graphical Statistics, 25(3):826–838, 2016.
  • Wang and Zou (2018) B. Wang and H. Zou. Another look at distance-weighted discrimination. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):177–198, 2018.
  • Wang and Zou (2019) B. Wang and H. Zou. A multicategory kernel distance weighted discrimination method for multiclass classification. Technometrics, 61(3):396–408, 2019.
  • Wei et al. (2016) S. Wei, C. Lee, L. Wichers, and J. Marron. Direction-projection-permutation for high-dimensional hypothesis tests. Journal of Computational and Graphical Statistics, 25(2):549–569, 2016.
  • Witten and Tibshirani (2011) D. M. Witten and R. Tibshirani. Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5):753–772, 2011.