Bayesian Distance Weighted Discrimination

\nameEric F. Lock \emailelock@umn.edu
\addrDivision of Biostatistics
School of Public Health
University of Minnesota
Minneapolis, MN 55446, USA

Abstract

Distance weighted discrimination (DWD) is a linear discrimination method that is particularly well-suited for classification tasks with high-dimensional data. The DWD coefficients minimize an intuitive objective function, which can solved very efficiently using state-of-the-art optimization techniques. However, DWD has not yet been cast into a model-based framework for statistical inference. In this article we show that DWD identifies the mode of a proper Bayesian posterior distribution, that results from a particular link function for the class probabilities and a shrinkage-inducing proper prior distribution on the coefficients. We describe a relatively efficient Markov chain Monte Carlo (MCMC) algorithm to simulate from the true posterior under this Bayesian framework. We show that the posterior is asymptotically normal and derive the mean and covariance matrix of its limiting distribution. Through several simulation studies and an application to breast cancer genomics we demonstrate how the Bayesian approach to DWD can be used to (1) compute well-calibrated posterior class probabilities, (2) assess uncertainty in the DWD coefficients and resulting sample scores, (3) improve power via semi-supervised analysis when not all class labels are available, and (4) automatically determine a penalty tuning parameter within the model-based framework. R code to perform Bayesian DWD is available at https://github.com/lockEF/BayesianDWD.

Keywords: Cancer genomics, distance weighted discrimination, high-dimensional data, probabilistic classification, support vector machines.

1 Introduction

High-dimensional data occur when a massive number of features are available for each unit of observation in a sample set. Such data are now very common across a variety of application areas, and this has motivated a large number of data analysis methods that were developed specifically for the high-dimensional context. For supervised analysis, where the task is to predict or describe an outcome, methods that are appropriate for the high-dimensional context often begin with an objective function to be optimized but lack a clear model-based framework for statistical inference. One reason for this is because fitting statistical models directly via (e.g.) maximum likelihood is prone to over-fitting and lack of identifiability in the high-dimensional context, especially when the number of features is larger than the sample size. Thus, objective functions that admit more regularization are required, and these often do not have a model-based interpretation in the frequentist framework. For example, a common approach in the high-dimensional context is to incorporate penalty terms that enforce shrinkage or sparsity (e.g., $l_{2}$ or $l_{1}$ penalization) within a classical predictive model (e.g., generalized linear models).

A Bayesian framework provides more flexibility for model-based supervised analysis of high-dimensional data, as appropriate regularization can be induced through the specified prior distribution. Indeed, there are a number of instances in which optimizing a commonly used penalized objective function was later shown to give the mode of a posterior distribution under a particular Bayesian model. For example, a Bayesian linear model with a normally distributed outcome and a normal prior on the coefficients corresponds to solving an $l_{2}$ penalized least squares objective (i.e., ridge regression) (Hsiang, 1975). Similarly, the mode of the posterior under a Bayesian framework with a double-exponential prior on the model coefficients corresponds to $l_{1}$ penalization (i.e., the lasso) (Park and Casella, 2008). Such equivalences can be very useful to better understand and interpret the original optimization problem, and also provide a potential framework for statistical inference that is based on the original objective.

In this article we introduce a Bayesian formulation for Distance Weighted Discrimination (DWD) (Marron et al., 2007). DWD is a popular approach for supervised analysis with a binary outcome and high-dimensional data. It aims to identify a linear combination of the features that distinguish the sample groups corresponding to the binary outcomes (i.e., projections onto a separating hyperplane); in this respect, it is comparable to various versions of the Support Vector Machine (SVM) (Cortes and Vapnik, 1995) or penalized Fisher’s Linear Discriminant Analysis (LDA) (Witten and Tibshirani, 2011). However, relative to its competitors, DWD often has superior generalizability for settings in which the sample size is small relative to the dimension, i.e., the high dimension low sample size (HDLSS) context. In its original formulation, DWD minimizes the inverse distances of the observed units from the separating hyperplane with a penalty term. This circumvents the “data piling” issue of SVM, wherein projections tend to pile up at the margins of the separating hyperplane for HDLSS data as a symptom of over-fitting and lack of generalizability. Since its inception, many other versions of DWD have been developed, including extensions that allow more than two classes (Huang et al., 2013), sparsity (Wang and Zou, 2016), non-linear kernels (Wang and Zou, 2019), multi-way (i.e., tensor) data (Lyu et al., 2017), and sample weighting for unbalanced data (Qiao et al., 2010). These useful extensions modify the optimization task used for DWD, but remain model-free. Resampling methods such as permutation testing (Wei et al., 2016), cross-validation, and bootstrapping (Lyu et al., 2017) have been used with DWD to address inferential questions that involve uncertainty. However, while various model-based versions of SVM have been proposed (Sollich, 2002; Henao et al., 2014), DWD is yet to be cast into a fully specified probability model.

We show that DWD identifies the mode of a proper Bayesian posterior distribution. The corresponding density of the posterior distribution is a monotone function of the DWD objective. The underlying model is given by a particular link function for the class probabilities and a shrinkage-inducing proper prior distribution on the model coefficients. In addition to providing a model-based context for DWD, we demonstrate how this Bayesian framework can be used to extend and enhance present applications of DWD in four specific ways:

1.

Posterior class inclusion probabilities can be used for “soft” classification, and we find that these probabilities are well-calibrated across different contexts.
2.

Posterior credible intervals can be used to visualize uncertainty in the coefficients and the sample scores (i.e., projections).
3.

Semi-supervised analyses, in which only some of the class labels are observed in the training data, fit naturally into the Bayesian framework and can improve power.
4.

A penalty tuning parameter for the DWD objective can be subsumed within the Bayesian framework and given its own prior, to allow for model-based selection of this parameter.

We describe a relatively efficient Markov chain Monte Carlo (MCMC) algorithm to simulate from the posterior distribution. Alternatively, we show that the posterior is asymptotically multivariate normal; we derive the asymptotic mean and covariance, which are readily computable using existing software.

In what follows, we introduce our formal framework and notation in Section 2, formally describe the DWD objective in Section 3, and then present and discuss the Bayesian model and its estimation in Sections 4, 5, 6, and 7. We present simulation results to illustrate and validate different aspects of the approach in Section 8, and we present an application to subtype classification in breast cancer genomics in Section 9.

2 Framework and notation

Throughout this article bold lower-case characters denote vectors, bold upper-case characters denote matrices, and greek characters denote unknown model parameters. Probability density functions are denoted by $p(\cdot)$ , and discrete probabilities by $P(\cdot)$ .

Let ${\bf x}_{i}$ be a $d$ -dimensional vector and $y_{i}\in\{-1,1\}$ be a corresponding class indicator for sampled observations $i=1,\ldots,n$ . Let $\mathbf{X}$ be the $d\times n$ matrix $\mathbf{X}=[{\bf x}_{1}\;{\bf x}_{2}\ldots{\bf x}_{n}]$ . Let $\mathbf{y}$ be the $n$ -dimensional vector of outcomes $\mathbf{y}=[y_{1}\;y_{2}\ldots y_{n}]$ . Our task is to predict the outcomes $y_{i}$ from the data ${\bf x}_{i}$ .

3 Distance weighted discrimination

Distance weighted discrimination (DWD) identifies coefficients (weights) ${\bm{\beta}}=[\beta_{1}\;\beta_{2}\ldots\beta_{d}]$ and an intercept $\beta_{0}\in\mathbb{R}$ such that the linear combinations (scores) $\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}}$ discriminate the classes. The optimization problem for DWD was originally formulated as

\underset{\beta_{0},{\bm{\beta}},\bm{\xi}}{\mbox{argmin}}\sum_{i=1}^{n}\frac{1}{r_{i}}+C\sum_{1}^{n}\xi_{i},\text{ where }r_{i}=y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})+\xi_{i}

(1)

subject to $r_{i}>0$ and $\xi_{i}>0$ for $i=1,\ldots,n$ and $||{\bm{\beta}}||_{2}^{2}\leq 1$ . In practice the classification decision rule is determined by the sign of $\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}}$ , and thus the $\xi_{i}$ can be interpreted as a penalty for misclassification. The tuning parameter $C$ controls the relative weight of this misclassification penalty compared to the inverse distances from the separating hyperplane $1/r_{i}$ . An appealing property of DWD is that its objective accounts for the projections of all the data observations, rather than just those at the margins. Liu et al. (2011) show that the DWD objective (1) is equivalent to the following “loss+penalty” formulation

\displaystyle\underset{\beta_{0},{\bm{\beta}}}{\mbox{argmin}}\frac{1}{n}\sum_{i=1}^{n}V\left(y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)+\frac{\lambda}{2}||{\bm{\beta}}||^{2}_{2},

(2)

where $\lambda$ is a penalty tuning parameter, and

\displaystyle V(u)=\begin{cases}1-u&\text{ if }u\leq 1/2,\\ 1/(4u)&\text{ if }u>1/2.\end{cases}

(3)

The objectives (1) and (2) are equivalent in that, with a one-to-one correspondence between the penalty parameters $\lambda$ and $C$ , the estimated coefficients and resulting sample scores are proportional.

4 Bayesian model

Here we show that the DWD objective (2) gives the mode of a posterior distribution for $\{\beta_{0},{\bm{\beta}}\}$ under a Bayesian framework. Let $\psi(\cdot)$ give the objective function that is minimized by DWD,

\psi(\beta_{0},{\bm{\beta}},\mathbf{y},\mathbf{X},\lambda)=\frac{1}{n}\sum_{i=1}^{n}V\left(y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)+\frac{\lambda}{2}||{\bm{\beta}}||^{2}_{2}.

The posterior density for $\{\beta_{0},{\bm{\beta}}\}$ is the following monotone transformation of $\psi$ ,

	$\displaystyle p({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda)$	$\displaystyle\propto e^{-n\psi(\beta_{0},{\bm{\beta}},\mathbf{y},\mathbf{X},\lambda)}$
		$\displaystyle=\left[\prod_{i=1}^{n}e^{-V\left(y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)}\right]e^{-(\lambda n/2)\|\|{\bm{\beta}}\|\|^{2}_{2}}$		(4)

Under general conditions (4) is integrable over $\{\beta_{0},{\bm{\beta}}\}$ , and therefore defines a proper probability density function; we state this formally in Theorem 1.

Theorem 1

If $\mathbf{X}\in\mathbb{R}^{d\times n}$ , $\lambda\geq 0$ , and $\mathbf{y}\in\{-1,1\}^{n}$ where $y_{i}=-1$ for some $i$ and $y_{j}=1$ for some $j$ , then $p({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda)$ in (4) gives a proper probability density function over $\{\beta_{0},{\bm{\beta}}\}$ .

In what follows, we “work backward” from this posterior to complete the specification of the model and discuss its implications.

4.1 Conditional distribution of $\mathbf{y}$

Treating $\mathbf{y}$ as a discrete random vector, it follows directly from (4) that its entries $y_{i}$ are conditionally independent and that

P(y_{i}\mid{\bm{\beta}},\beta_{0},{\bf x}_{i})\propto P(y_{i})e^{-V\left(y_{n}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)}.

The conditional distribution for $y_{i}\in\{-1,1\}$ thus depends on its unconditional class probability $P(y_{i}=1)=1-P(y_{i}=-1)$ , and its corresponding score $u_{i}:=\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}}$ :

\displaystyle P(y_{i}=1\mid u_{i})=\frac{P(y_{i}=1)e^{-V\left(u_{i}\right)}}{P(y_{i}=1)e^{-V\left(u_{i}\right)}+P(y_{i}={-1})e^{-V\left(-u_{i}\right)}}.

(5)

Choice of $P(y_{i}=1)$ does not influence posterior inference for $\{\beta_{0},{\bm{\beta}}\}$ in (4), but it is nevertheless important for the calibration of the predictive model and is discussed further in Section 6.1. Given $P(y_{i}=1)$ , (5) defines a link function from $u_{i}\in\mathbb{R}$ to a probability in $[0,1]$ ; this function is smooth and comparable to a probit or logit link. Figure 1 plots this function under equal class proportions, $P(y_{i}=1)=1/2$ .

Refer to caption — Figure 1: Class probability as a function of linear score under different links.

4.2 Prior distribution of $\beta$

Estimating $\beta_{0}$ and ${\bm{\beta}}$ via maximizing the likelihood under model (5) would perform poorly in the HDLSS context, analogous to the poor performance of unconstrained probit or logistic regression models. However, the posterior distribution (4) also implies a “prior” distribution for $\{\beta_{0},{\bm{\beta}}\}$ that is not conditional on $\mathbf{y}$ :

	$\displaystyle p({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda)$	$\displaystyle\propto P(\mathbf{y}\mid{\bm{\beta}},\beta_{0},\mathbf{X})\,p({\bm{\beta}},\beta_{0}\mid\mathbf{X},\lambda)$
	$\displaystyle\rightarrow p({\bm{\beta}},\beta_{0}\mid\mathbf{X},\lambda)$	$\displaystyle\propto A\cdot B$		(6)

where

\displaystyle A=\prod_{i=1}^{n}\left(P(y_{i}=1)e^{-V\left(\beta_{0}+{\bf x}_{n}^{T}{\bm{\beta}}\right)}+P(y_{i}=-1)e^{-V\left(-\beta_{0}-{\bf x}_{n}^{T}{\bm{\beta}}\right)}\right)

(7)

and

\displaystyle B=e^{-(\lambda N/2)||{\bm{\beta}}||^{2}_{2}}.

(8)

This prior facilitates shrinkage in two important ways that improve performance in HDLSS settings. Term $B$ (8) gives the kernel of a multivariate normal distribution for ${\bm{\beta}}$ , in which the $\beta_{i}$ ’s are independent with mean $0$ and variance $1/(\lambda N)$ . This derives from the $l_{2}$ penalty in (2). There is a vast literature on the connection between $l_{2}$ -regularized regression (e.g., ridge regression) and using Gaussian priors on the coefficients in a Bayesian framework, as mentioned in Section 1. This shrinks the inferred coefficients toward $\mathbf{0}$ , which improves performance in the HDLSS setting or in the presence of multicollinearity. Term $A$ (7) favors ${\bm{\beta}}$ that correspond to directions in $\mathbf{X}$ with high variability or bimodality. To illustrate, Figure 2 plots a single term of the product $A$ as a function of the DWD score $u_{i}=\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}}$ when $P(y_{i}=1)=1/2$ ,

\displaystyle f(u_{i})=e^{-V(u_{i})}+e^{-V(-u_{i})}.

(9)

This term gives higher prior density to directions in $\mathbf{X}$ for which the corresponding scores $\{u_{i}\}_{i=1}^{N}$ are not concentrated near $0$ , and are therefore more useful for discriminating a negative class from a positive class.

The function $A\cdot B$ in (6) defines a proper density for ${\bm{\beta}}$ for any fixed intercept $\beta_{0}$ , as stated in Theorem 2.

Theorem 2

If $\mathbf{X}\in\mathbb{R}^{d\times n}$ , $\lambda>0$ , and $\beta_{0}\in\mathbb{R}$ , then

p({\bm{\beta}}\mid\mathbf{X},\beta_{0},\lambda)\propto A\cdot B

gives a proper probability density function, where $A$ and $B$ are defined in (7) and (8).

The prior (6) does not define a proper density for $\beta_{0}$ . Improper priors are commonly used for a variety of Bayesian applications, and still provide a valid framework for inference if the posterior is proper (Taraldsen and Lindqvist, 2010; Sun et al., 2001). Nevertheless, it would be straightforward to modify the model to allow for shrinkage of $\beta_{0}$ , if one desired it to have a proper probability density without conditioning on $\mathbf{y}$ .

4.3 Distribution of $X$

No explicit probabilistic assumptions on $\mathbf{X}$ are needed for the validity of the posterior and prior distributions in the previous two sections. However, it is informative to consider the broader implications of the model with respect to the distribution for $\mathbf{X}$ . Without loss of generality we set $\beta_{0}=0$ and assume $P(y_{i}=1)=0.5$ . Note that

p({\bm{\beta}}\mid{\bf x}_{i},\lambda)=\left(e^{-V\left({\bf x}_{i}^{T}{\bm{\beta}}\right)}+e^{-V\left(-{\bf x}_{i}^{T}{\bm{\beta}}\right)}\right)e^{-(\lambda/2)||{\bm{\beta}}||^{2}_{2}}/\phi({\bf x}_{i},\lambda),

where

\phi({\bf x}_{i},\lambda)=\int_{{\bm{\beta}}}\left(e^{-V\left({\bf x}_{i}^{T}{\bm{\beta}}\right)}+e^{-V\left(-{\bf x}_{i}^{T}{\bm{\beta}}\right)}\right)e^{-(\lambda/2)||{\bm{\beta}}||^{2}_{2}}d{\bm{\beta}},

Given any marginal density for ${\bf x}$ (which may depend on $\lambda$ ), ${\bf x}_{i}\sim p({\bf x}\mid\lambda)$ ,

	$\displaystyle p({\bm{\beta}},{\bf x}_{i}\mid\lambda)=p({\bm{\beta}}\mid{\bf x}_{i},\lambda)p({\bf x}_{i}),$
	$\displaystyle p({\bm{\beta}}\mid\lambda)=\int p({\bm{\beta}},{\bf x}_{i}\mid\lambda)d{\bf x}_{i},$
	$\displaystyle p({\bf x}_{i}\mid{\bm{\beta}},\lambda)=p({\bm{\beta}},{\bf x}_{i}\mid\lambda)/p({\bm{\beta}}\mid\lambda).$

Thus, a marginal prior for ${\bm{\beta}}$ exists but it depends on the distribution ${\bf x}_{i}$ . Moreover,

\displaystyle p({\bf x}_{i}\mid{\bm{\beta}},\lambda)=p({\bf x}_{i}\mid{\bm{\beta}},y_{i}=1,\lambda)+p({\bf x}_{i}\mid{\bm{\beta}},y_{i}=-1,\lambda)

implies that

p({\bf x}_{i}\mid{\bm{\beta}},y_{i})\propto e^{-V(y_{i}{\bf x}_{i}^{T}{\bm{\beta}})}\phi({\bf x}_{i},\lambda)p({\bf x}_{i}\mid\lambda)/\phi({\bf x}_{i},\lambda).

Thus, given any marginal distribution for the data ${\bf x}_{i}$ , it is possible to obtain the conditional distribution of ${\bf x}_{i}$ given its class membership $\mathbf{y}_{i}$ and the DWD vector ${\bm{\beta}}$ .

5 Asymptotic normality of $\beta$

It follows from the Bernstein-von Mises theorem (i.e., the “Bayesian central limit theorem”) (Ghosal, 1997) that the posterior distribution for ${\bm{\beta}}$ approximates a multivariate normal distribution as $n\rightarrow\infty$ . The limiting distribution can be derived precisely via a second-order Taylor expansion of the log of (4), about the mode $\hat{{\bm{\beta}}}$ . The resulting mean and covariance matrix are provided in Theorem 3.

Theorem 3

If $\mathbf{X}\in\mathbb{R}^{d\times n}$ , $\lambda>0$ , and $\mathbf{y}\in\{-1,1\}^{n}$ , then

p({\bm{\beta}}\mid\mathbf{X},\mathbf{y},\lambda,\beta_{0})\approx\mbox{MVN}({\bm{\beta}}\mid\hat{{\bm{\beta}}},V_{\beta})

as $n\rightarrow\infty$ , where $\hat{{\bm{\beta}}}$ is the DWD solution (2), and

V_{\beta}=\left(\sum_{i\in S_{\hat{{\bm{\beta}}}}}\frac{{\bf x}_{i}{\bf x}_{i}^{T}}{2(y_{i}(\hat{\beta}_{0}+{\bf x}_{i}^{T}\hat{{\bm{\beta}}})^{3}}+n\lambda\mathbf{I}\right)^{-1}

where $S_{\hat{{\bm{\beta}}}}=\{i:y_{i}(\hat{\beta}_{0}+{\bf x}_{i}^{T}\hat{{\bm{\beta}}})>0.5\}$ .

Note that this asymptotic approximation always exists and is directly computable given the DWD solution $\{\hat{\beta_{0}},\hat{{\bm{\beta}}}\}$ .

6 Extensions

6.1 Marginal class probabilities

The prior class probabilities $P(y_{i}=1)$ and $P(y_{i}=-1)$ in (5) do not affect the posterior for $\{\beta_{0},{\bm{\beta}}\}$ in (4). Thus, they need not be specified for tasks that do not require a predictive model for $y_{i}$ , such as exploratory visualization or batch adjustment (Huang et al., 2012). However, their specification is important to obtain appropriately calibrated probabilities for classification. In particular, they must be considered carefully when the proportions in the training data differ from that of the target population. The ability to adjust the marginal class probabilities for out-of-sample prediction is useful, as the decision boundary for DWD ( $\beta_{0}$ ) is sensitive to the class proportions used for training (Qiao et al., 2010; Qiao and Zhang, 2015). We let each observation have the same class membership probability a priori:

P(y_{i}=1)=P_{1}\;\text{ for }i=1,\ldots,n,

and by default we set $P_{1}=0.5$ . In other situations $P_{1}$ may be known and fixed at another value, or estimated as the sample proportion from class $1$ . Alternatively, one can give $P_{1}$ its own prior (e.g., $P_{1}\sim\text{Uniform}(0,1)$ ) and infer it with the rest of the Bayesian model; such an approach may be useful in (e.g.,) situation for which the $y_{i}$ ’s are only partially observed and the population class proportions are of interest.

6.2 Inference for $\lambda$

Rather than fixing the penalty parameter $\lambda$ a priori, we recommend inferring its value within the Bayesian framework. For a given prior density $\lambda\sim p(\lambda)$ , the conditional density for $\lambda$ is

\displaystyle\begin{split}p(\lambda\mid\mathbf{X},\mathbf{y},{\bm{\beta}},\beta_{0})&\propto p(\lambda)P(\mathbf{y}\mid\mathbf{X},{\bm{\beta}},\beta_{0})p({\bm{\beta}},\mathbf{X}\mid\beta_{0},\lambda)\\ &\propto\prod_{i=1}^{n}p(\lambda)\left(P_{1}e^{-V\left(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}}\right)}+(1-P_{1})e^{-V\left(-\beta_{0}-{\bf x}_{i}^{T}{\bm{\beta}}\right)}\right)/\phi(\mathbf{X},\beta_{0},\lambda)\end{split}

(10)

where

\displaystyle\phi(\mathbf{X},\lambda,\beta_{0})=\int_{{\bm{\beta}}}\prod_{i=1}^{n}\left(P_{1}e^{-V\left(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}}\right)}+(1-P_{1})e^{-V\left(-\beta_{0}-{\bf x}_{i}^{T}{\bm{\beta}}\right)}\right)e^{-(\lambda/2)||{\bm{\beta}}||^{2}_{2}}d{\bm{\beta}}.

(11)

In practice we use a prior for $\lambda$ that is uniform over a large range, by default, $\lambda\sim\text{Uniform}(1/128,128)$ . The impact of $\lambda$ depends on the scale of $\mathbf{X}$ and the number of observations $n$ , so the prior may need to be suitably modified for other scenarios.

6.3 Semi-supervised learning

The conditional distribution of ${\bm{\beta}}$ on $\mathbf{X}$ in (6) is not flat, and therefore observations for which $y_{i}$ are not observed can still inform ${\bm{\beta}}$ . Consider the context in which $n_{o}$ observations are fully observed $\{({\bf x}_{i},\mathbf{y}_{i})\}_{i=1}^{n_{o}}$ , $n_{u}$ have missing outcome $\{({\bf x}_{i}^{u},y_{i})\}_{i=1}^{n_{u}}$ , and $n=n_{o}+n_{u}$ . The full posterior distribution for $\{\beta_{0},{\bm{\beta}}\}$ is

	$\displaystyle p({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda)$	$\displaystyle\propto\left[\prod_{i=1}^{n_{u}}\left(P_{1}e^{-V\left(\beta_{0}+{\bf x}_{n}^{T}{\bm{\beta}}\right)}+(1-P_{1})e^{-V\left(-\beta_{0}-{\bf x}_{n}^{T}{\bm{\beta}}\right)}\right)\right]$
		$\displaystyle\;\;\;\;\;\;\left[\prod_{i=1}^{n_{o}}e^{-V\left(y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)}\right]e^{-(\lambda n/2)\|\|{\bm{\beta}}\|\|^{2}_{2}}.$

This full posterior can be used for semi-supervised learning.

7 Posterior approximation

7.1 MCMC algorithm, fixed $\lambda$

To simulate from the posterior distribution for $\{\beta,{\bm{\beta}}_{0}\}$ (4), we implement a “Metropolis-in-Gibbs” algorithm (Carlin and Louis, 2008). That is, we iteratively draw each $\beta_{i}$ from its conditional distribution given the current values of the rest of the parameters ${\bm{\beta}}[-i]$ , $p(\beta_{i}\mid{\bm{\beta}}[-i],\mathbf{X},\mathbf{y},\lambda)$ ; each conditional draw is accomplished via a Metropolis sub-step. For the Metropolis step, we generate proposals from a normal distribution with mean given by the previous value for $\beta_{i}$ and a variance that is adaptively scaled over the course of the algorithm. We initialize the Markov chain at the posterior mode, which can be identified using existing software for DWD; we use the sdwd package for this initialization.

7.2 Inference for $\lambda$

Inference for $\lambda$ can be accomplished by iteratively simulating from its full conditional distribution (10) to update its value within the Gibbs sampling algorithm in Section 7.1. However, this is not straightforward, because the integral $\phi(\mathbf{X},\beta_{0},\lambda)$ (11) does not have a closed form. In our implementation, we first estimate $\phi(\mathbf{X},\lambda,\beta_{0})$ over a grid of values $\{\lambda_{j}:j=1,\ldots,J\}$ that span the range of the prior for $\lambda$ . For this, we utilize the fact that $e^{-(\lambda n/2)||{\bm{\beta}}||^{2}_{2}}$ resembles a Gaussian kernel to approximate the integral via Monte Carlo with simulated random normal vectors. Given ${\bm{\beta}}^{(t)}\sim\mbox{MVN}(\mathbf{0},1/(\lambda n)\mathbf{I})$ for $t=1,\ldots,T$ ,

\phi(\mathbf{X},\lambda,\beta_{0})\approx\left(\frac{2\pi}{n\lambda}\right)^{p/2}\frac{1}{T}\sum_{t=1}^{T}\prod_{i=1}^{n}\left(P_{1}e^{-V\left(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}}^{(t)}\right)}+(1-P_{1})e^{-V\left(-\beta_{0}-{\bf x}_{i}^{T}{\bm{\beta}}^{(t)}\right)}\right).

Using this approximation, we estimate $\hat{\phi}(\mathbf{X},\lambda_{j},\beta_{0})$ for each $j$ , and then linearly interpolate on the log-scale to approximate the full function $\hat{\phi}(\mathbf{X},\lambda,\beta_{0})$ over the range of the prior for $\lambda$ . Posterior estimation then proceeds as in Section 7.1, with an additional Metropolis sub-step to update the value of $\lambda$ . The full algorithm for posterior sampling is given in Appendix B.

7.3 Computational considerations

In practice, we find that $1000$ MCMC cycles yield appropriate coverage of the posterior when $\lambda$ is fixed (see Section 8.1) and $10000$ are sufficient when $\lambda$ is inferred. On a single CPU, $10000$ MCMC iterations takes approximately $5$ minutes for a dataset of size $\mathbf{X}:500\times 500$ (comparable to the data considered in Section 9) and computing time scales linearly with the dimension $d$ . Thus, full posterior inference via MCMC is feasible for most applications (with patience), but computing time can be a limitation.

8 Simulations

We consider three distinct simulation scenarios, and multiple conditions under each scenario, to illustrate and validate several aspects of Bayesian DWD.

In Section 8.1 we simulate from a data generating scheme that matches the assumed prior and likelihood exactly. Under this scenario, we confirm that posterior inferences using the MCMC algorithms in Section 7 are appropriately calibrated regardless of the marginal distribution for $\mathbf{X}$ . We compare to results using the normal approximation in Section 5 and a bootstrapping approach.

In Section 8.2 we consider a scenario in which the two classes are defined by different multivariate normal distributions. Under this scenario we demonstrate the robustness of Bayesian DWD to its model assumptions, assess the sensitivity of results to the tuning parameter $\lambda$ , and validate model-based inference for $\lambda$ .

In Section 8.3 we consider different scenarios for which not all of the outcomes are observed to illustrate potential benefits and limitations of the semi-supervised approach.

8.1 Assumed model simulation

Here we fix $\beta_{0}=0$ and simulate data from the assumed model as follows:

1.

Generate $\mathbf{X}:d\times n$ according to a specified probability distribution,
2.

Generate ${\bm{\beta}}_{\text{true}}$ from its conditional prior $\eqref{bprior}$ (via a Metropolis-Hastings algorithm), given $\lambda$
3.

For $i=1,\ldots,n$ , compute the score $u_{i}={\bf x}_{i}^{T}{\bm{\beta}}_{\text{true}}$ and generate $y_{i}$ according to its class probabilities defined by (5).

As manipulated conditions, we consider three values for $\lambda$ ( $\lambda=0.1,1,$ or $10$ ), two configurations of $d$ and $n$ ( $\mathbf{X}:20\times 100$ or $\mathbf{X}:100\times 20$ ), and three different probability distributions for $\mathbf{X}$ : (1) entries are simulated independently from a Uniform $(-1,1)$ distribution, (2) entries are simulated independently from an Exponential $(1)$ distribution centered to have mean $0$ , and (3) each column ${\bf x}_{i}$ is simulated from the following mixture of multivariate normal distributions:

\displaystyle x_{i}\sim\begin{cases}\mbox{MVN}(\bm{\mu}_{0},\mathbf{I})\text{ with probability }1/2\\ \mbox{MVN}(\bm{\mu}_{1},\mathbf{I})\text{ with probability }1/2,\end{cases}

(12)

where the entries of $\bm{\mu}_{0}$ and $\bm{\mu}_{1}$ are generated independently from a N $(0,0.5)$ distribution.

For each set of manipulated conditions, we generate $100$ datasets $\{\mathbf{X},{\bm{\beta}}_{\text{true}},\mathbf{y}\}$ under the assumed model and approximate the posterior distribution for ${\bm{\beta}}$ using the MCMC algorithm in Section 7.1. We construct 95% Bayesian credible intervals for ${\bm{\beta}}_{\text{true}}$ and the scores $\mathbf{u}=\mathbf{X}^{T}{\bm{\beta}}_{\text{true}}$ based on the posterior samples. We also consider credible intervals constructed using the posterior mode and the asymptotic approximation in Theorem 3 (i.e., the CLT approximation), and confidence intervals based on the percentiles of $200$ non-parametric bootstrap resamples as in Lyu et al. (2017).

Table 1 gives the average coverage rates for the true values under each set of conditions and the three different approaches to inference. The credible intervals using MCMC all have nominal coverage rates, which validates the algorithm. The CLT approximation gives appropriate coverage in several instances, even when $p>n$ . However, it performs less well when $\lambda$ is smaller and has relatively lower coverage rates for the scores $\mathbf{u}=\mathbf{X}^{T}{\bm{\beta}}_{\text{true}}$ . The bootstrap percentile approach yields more narrow intervals and tends to undercover in all cases. As a representative example, Figure 3 plots the mode and 95% intervals for the coefficients for an instance under the Uniform scenario with $\lambda=0.1$ , $n=100$ and $p=20$ ; this shows close agreement between the MCMC and CLT intervals, and the bootstrap intervals are universally more narrow.

To assess the calibration of estimated class probabilities, we generate a new set of $1000$ observations under the true model as a “test” set for each instance of the simulation. We compute the class probabilities for each test case, given the posterior fit to the $n=20$ or $n=100$ training cases. We group the estimated probabilities into narrow bins of length $0.02$ ( $[0,0.02),[0.02,0.04)$ , etc.) and consider the proportion of cases with $y=1$ in each bin. Figure 4 plots these proportions, aggregated across all conditions, vs. the midpoint of each bin. Using the posterior mean, the empirical proportions match the estimated probabilities perfectly, confirming that the model is well-calibrated and again validating the MCMC procedure. Using the posterior mode, the relationship deviates slightly but still tends to provide a reasonable fit in most cases.

Table 1: Coverage rates for model coefficients (

\beta

) and scores (

u

) using 95% credible intervals under MCMC sampling from the posterior, the asymptotic normal approximation (CLT), and bootstrapping (BOOT).

				MCMC		CLT		BOOT
Distribution	$n$	$d$	$\lambda$	$\beta$	$u$	$\beta$	$u$	$\beta$	$u$
Uniform	$20$	$100$	$0.1$	0.94	0.94	0.95	0.85	0.16	0.42
Uniform	$20$	$100$	$1$	0.94	0.94	0.95	0.85	0.23	0.50
Uniform	$20$	$100$	$10$	0.94	0.94	0.95	0.96	0.20	0.45
Uniform	$100$	$20$	$0.1$	0.94	0.94	0.92	0.85	0.46	0.55
Uniform	$100$	$20$	$1$	0.95	0.95	0.95	0.95	0.53	0.65
Uniform	$100$	$20$	$10$	0.94	0.94	0.95	0.95	0.24	0.32
Exponential	$20$	$100$	$0.1$	0.96	0.94	0.92	0.91	0.61	0.69
Exponential	$20$	$100$	$1$	0.95	0.95	0.96	0.95	0.62	0.71
Exponential	$20$	$100$	$10$	0.94	0.94	0.96	0.96	0.34	0.41
Exponential	$100$	$20$	$0.1$	0.94	0.95	0.92	0.85	0.39	0.49
Exponential	$100$	$20$	$0.1$	0.94	0.95	0.94	0.91	0.53	0.63
Exponential	$100$	$20$	$1$	0.94	0.94	0.95	0.96	0.40	0.52
Bimodal	$20$	$100$	$10$	0.95	0.96	0.94	0.91	0.61	0.66
Bimodal	$20$	$100$	$1$	0.94	0.95	0.96	0.95	0.66	0.73
Bimodal	$20$	$100$	$10$	0.93	0.94	0.96	0.96	0.39	0.46
Bimodal	$100$	$20$	$0.1$	0.94	0.94	0.92	0.85	0.37	0.47
Bimodal	$100$	$20$	$1$	0.95	0.95	0.94	0.90	0.51	0.60
Bimodal	$100$	$20$	$10$	0.94	0.95	0.96	0.96	0.40	0.61

8.2 Two-class multivariate normal simulation

Here we generate data according to a two-class multivariate normal distribution, in which each class has a different mean vector, as follows:

1.

Generate $d$ -dimensional mean vectors $\bm{\mu}_{0}$ and $\bm{\mu}_{1}$ , with independent entries from a N $(0,\tau^{2})$ distribution.
2.

Set $y_{i}=-1$ for $i=1,\ldots,n/2$ and $y_{i}=1$ for $i=n/2+1,\ldots,n$ .

For $i=1,\ldots,n$ , generate ${\bf x}_{i}$ via

x_{i}\sim\begin{cases}\mbox{MVN}(\bm{\mu}_{0},\mathbf{I})\text{ if }y_{0}=-1\\ \mbox{MVN}(\bm{\mu}_{1},\mathbf{I})\text{ if }y_{i}=1.\end{cases}

Note that this scenario differs from the ‘Bimodal’ case in Section 8.1 (12); for that scenario, membership in a given mixture component (defined by $\bm{\mu}_{0}$ and $\bm{\mu}_{1}$ ) does not necessarily correspond to the supervised class labels $y_{i}$ . The present scenario does not precisely match our assumed model. The “true” (i.e., oracle) probabilistic link between $y_{i}$ and ${\bf x}_{i}$ in this scenario corresponds to a probit link with coefficients that are proportional to the mean difference vector $\bm{\mu}_{0}-\bm{\mu}_{1}$ ; as shown in Figure 1 the link for Bayesian DWD is not a probit, but it is close.

We fix $n=100$ and consider different levels of signal ( $\tau=0,0.1,0.2,0.5$ ) and different dimensions ( $d=10,100,1000$ ) as manipulated conditions. For each instance, we approximate the posterior separately with $\lambda$ fixed at each of $15$ possible values, that range from $1/128$ to $128$ and are equally spaced on a log scale. For each case we compute the Kullback-Leibler divergence (KL-divergence) of the estimated class probabilities from the oracle model. For each case we also generate a test set of $5000$ samples and evaluate (1) the misclassification rate on the test samples using a posterior predictive probability of $0.5$ as a threshold, and the mean squared error (MSE) of the predicted class probabilities for class 1 ( $p_{i}$ ) from the true class memberships as

\text{MSE}=\frac{1}{n_{\text{test}}}\sum_{i=1}^{n_{\text{test}}}(1-p_{i})^{2}\mathbbm{1}_{\{y_{i}=-1\}}+p_{i}^{2}\mathbbm{1}_{\{y_{i}=1\}}.

Figure 5 plots these metrics as a function of $\lambda$ for the different conditions. In general, lower values of $\lambda$ perform better when the distinguishing signal is stronger. This is intuitive, as $\lambda$ can be considered an inverse scale parameter for ${\bm{\beta}}$ , and larger magnitudes for ${\bm{\beta}}$ correspond to probabilities closer to $0$ and $1$ . The misclassification rate is robust to the value of $\lambda$ , but its specification is important to achieve well-calibrated probabilities.

Motivated by the sensitivity of the estimated probabilities to $\lambda$ , we also approximate the posterior with a uniform prior on $\lambda$ as in Section 6.2. Table 2 summarizes the posterior mean for $\lambda$ under each set of conditions, and compares the performance of the model with uniform prior on $\lambda$ with that of the best performing fixed value of $\lambda$ for each metric. The model with a uniform prior performs comparatively well across the different scenarios, and the posterior mean for $\lambda$ appropriately ranges from very high values when the signal is nonexistent to low values when the signal is strong. However, for the conditions with $p=1000$ and strong signal the posterior mean is substantially larger than the best performing values of $\lambda$ ; this is likely because the model fits well and perfectly classifies the data with probabilities close to $0$ or $1$ for a wide range of $\lambda$ values (see Figure 5), so there is not strong evidence to distinguish between different values of $\lambda$ in this range.

Table 2: Comparison of model performance when

\lambda

is inferred with a uniform prior (

\hat{\lambda}

) vs. the best performing fixed value of

\lambda

based on KL-divergence (

\lambda_{\text{KL}}

) or MSE

\lambda_{\text{MSE}}

. Results for each set of condition are averages over

10

replications.

$d$	s2n ( $\tau)$	$\hat{\lambda}$	$\lambda_{\text{KL}}$	$\lambda_{\text{MSE}}$	KL( $\hat{\lambda}$ )	KL( $\lambda_{\text{KL}}$ )	MSE( $\hat{\lambda}$ )	MSE( $\lambda_{\text{MSE}}$ )
$10$	$0$	70.21	128.00	128.00	0.00	0.00	0.25	0.25
$10$	$0.1$	70.82	2.00	2.00	0.02	0.02	0.25	0.25
$10$	$0.2$	31.27	0.50	1.00	0.06	0.02	0.24	0.22
$10$	$0.5$	0.10	0.12	0.12	0.06	0.05	0.11	0.11
100	$0$	74.56	128.00	64.00	0.00	0.00	0.25	0.25
100	$0.1$	22.55	1.00	2.00	0.15	0.10	0.24	0.22
100	$0.2$	0.14	0.12	0.50	0.11	0.10	0.11	0.10
100	$0.5$	0.11	0.01	0.02	0.03	0.01	0.00	0.00
1000	$0$	44.79	128.00	128.00	0.02	0.00	0.25	0.25
1000	$0.1$	3.11	0.03	0.50	0.21	0.06	0.11	0.09
1000	$0.2$	0.63	0.01	0.03	0.07	0.01	0.00	0.00
1000	$0.5$	1.82	0.01	0.01	0.04	0.00	0.00	0.00

In Figure 6 we consider the overall calibration of the fitted probabilities across the different conditions, as in Section 8.1. This shows clearly that fixing $\lambda$ at a large value ( $\lambda=128$ ) can yield overly conservative estimated probabilities and a relatively small value ( $\lambda=1/128$ ) can yield overly confident probabilities. The model in which $\lambda$ is inferred with a uniform prior is mostly well-calibrated but does show some deviations from the estimated probabilities and empirical proportions. These deviations are likely in part due to the fact that the assumed probabilistic link does not perfectly match the true generative model.

8.3 Semi-supervised simulation

Here we generate data in which some of the class labels $y_{i}$ may be unknown, to illustrate a semi-supervised approach. We consider two scenarios. For the first scenario, the observations ${\bf x}_{i}$ are generated exactly as in Section 8.2 with $d=100$ and $\tau=0.3$ . For the second scenario, $d=100$ and entries of each ${\bf x}_{i}$ are generated independently from a $N(0,1)$ distribution; then class labels $y_{i}$ are defined deterministically by the sign of ${\bf x}_{i}^{T}{\bm{\beta}}_{\text{sep}}$ , where the entries of ${\bm{\beta}}_{\text{sep}}$ are generated from a standard normal distribution. The key difference between these two scenarios, for our purpose, is that the first scenario (bimodal) has an underlying cluster structure that distinguishes the two classes whereas the second (unimodal) does not. So, we expect to improve performance by considering unlabeled data for the bimodal scenario but not the unimodal scenario.

As manipulated conditions, we vary the number of observations with $y_{i}$ observed ( $n_{o}=10,20,100,1000)$ and with $y_{i}$ unobserved ( $n_{u}=0,10,20,100,1000$ ) in our training set, for each of the bimodal and unimodal scenarios. For each combination of conditions, we generate $20$ datasets, use MCMC to approximate the posterior, and consider the resulting misclassification rates for a test set. The average test misclassification rates under each set of conditions are shown in Table 3. This shows that for the bimodal scenario a semi-supervised analysis with a large number of unlabeled observations can improve performance dramatically; the misclassification rates for $n_{o}=10$ observed and $n_{u}=1000$ unobserved are comparable to that with $n_{o}=1000$ observed. As expected, including data without observed class labels does not help in the unimodal scenario; however, it does not lead to a substantial decrease in performance.

Table 3: Test misclassification rates using Bayesian DWD with differing number of observations with

y_{i}

observed (

n_{o}

) or unobserved (

n_{u}

		$N$ unobserved ( $n_{u}$ )
Scenario	$N$ observed ( $n_{o}$ )	$0$	$10$	$20$	$100$	$1000$
Bimodal	$10$	0.164	0.138	0.145	0.081	0.019
Bimodal	$20$	0.077	0.077	0.080	0.053	0.014
Bimodal	$100$	0.038	0.035	0.032	0.036	0.020
Bimodal	$1000$	0.019	0.021	0.019	0.020	0.017
Unimodal	$10$	0.444	0.451	0.425	0.410	0.432
Unimodal	$20$	0.392	0.390	0.402	0.395	0.412
Unimodal	$100$	0.280	0.266	0.272	0.280	0.297
Unimodal	$1000$	0.098	0.102	0.091	0.104	0.113

9 Application to breast cancer genomics

As a real data application, we consider classifying breast cancer (BRCA) tumor subtypes using data on microRNA (miRNA) abundance that are publicly available from the Cancer Genome Atlas (TCGA). We use miRNA-Seq data for $n=1047$ BRCA tumor samples from different individuals, that was previously curated for a pan-cancer clustering application (Hoadley et al., 2018). The tumors are classified into one of $4$ canonical subtypes based on the PAM $50$ classifier for gene (mRNA) expression (Parker et al., 2009): Luminal A (LumA) ( $n=572$ ), Luminal B (LumB) ( $n=203$ ), Her2 $(n=83$ ), and Basal ( $189$ ). Previous exploratory analysis have shown that miRNA data shares some of the same BRCA subtype distinctions as gene expression (Park and Lock, 2020; Lock and Dunson, 2013). Here, we examine the power of miRNAs to distinguish the subtypes using a supervised approach.

The miRNA expression read counts were log( $1+x$ )-transformed and centered to have mean $0$ . After transformation, we keep miRNAs with standard deviation greater than $0.5$ , leaving $d=489$ miRNAs. We consider separate classification tasks to discriminate each pair of subtypes, giving $6$ comparisons. For each comparison, we apply (1) Bayesian DWD (BayesDWD) with uniform prior on $\lambda$ , (2) DWD (i.e., the posterior mode) via the R package sdwd, (3) SVM via the package e1071, (4) random forest classification (Breiman, 2001) via the package randomForest, and (5) LDA via the package MASS. We assess each method by the average misclassification rate on the test sets under 10-fold cross-validation. The results are shown in Table 4. Bayesian DWD, DWD, and SVM have similar classification performance across the different comparisons, while random forests and LDA perform less well by comparison. For methods that give a class probability we consider the calibration of the estimated probabilities on the test sets vs. their empirical class proportions, as in Section 8.1. Figure 7 shows that the probabilities inferred for Bayesian DWD are overall well-calibrated. The probabilities under the random forest are also reasonable, but tend toward being over-conservative. The probabilities for LDA are not well-calibrated. LDA suffers from over-fitting in situations with higher dimension and lower sample size; for this application, the vast majority (93%) of predicted probabilities using LDA are below $0.02$ or above $0.98$ , leaving limited data for the remaining bins.

Table 4: Test classification rates for different subtype comparisons under cross validation using Bayesian DWD (BayesDWD), DWD, SVM, random forests (RF), and LDA.

Comparison	BayesDWD	DWD	SVM	RF	LDA
LumA vs. LumB	0.168	0.156	0.158	0.186	0.296
LumA vs. Her2	0.054	0.045	0.059	0.070	0.178
LumA vs. Basal	0.026	0.028	0.025	0.030	0.084
LumB vs. Her2	0.129	0.124	0.155	0.169	0.188
LumB vs. Basal	0.046	0.043	0.044	0.047	0.137
Her2 vs. Basal	0.066	0.064	0.053	0.075	0.111

To illustrate the Bayesian DWD results, we focus on the comparison with the best classification accuracy from Table 4 (LumA vs. Basal) and the comparison with the worst classification accuracy (LumA vs. LumB). Figure 8 plots the posterior mean of the DWD scores and their $95\%$ credible intervals for each comparison. Note that the DWD scores are almost perfectly separated for LumA vs. Basal, but not for LumA vs. LumB. The credible intervals for each plot help to understand the level of uncertainty in the discriminating projections.

10 Discussion

The Bayesian formulation of DWD is straightforward to modify or extend via changes to the likelihood or prior. As a future direction, we are interested in extending the model to accommodate multiple sources of data (e.g., multiple ’omics or clinical sources) or data with multiway structure (Lyu et al., 2017). In this article we have focused on the original and most widely used version of DWD, but the Bayesian framework may be extended for other versions including multi-class DWD (Huang et al., 2013), weighted DWD (Qiao et al., 2010), sparse DWD (Wei et al., 2016), and kernel DWD (Wang and Zou, 2018, 2019). Additionally, a generalized version of DWD was introduced in (Hall et al., 2005), wherein the distances $r_{i}$ in (2) are raised to a power $q$ . The choice of $q$ modifies the loss function $V(\cdot)$ in (3) (Wang and Zou, 2018), which in turn affects the probabilistic link (5); the Bayesian formulation is potentially very useful in this context, because one can put a prior on $q$ and infer it within the model.

Computing time and resources can limit a fully Bayesian treatment of DWD for some applications, as discussed in Section 7.3. In such cases, the DWD solution can still be interpreted as a posterior mode to give point estimates for class probabilities, and the normal approximation in Section 5 can be used to quickly assess uncertainty in the estimated coefficients. Alternative computational approaches that facilitate a fully Bayesian treatment for massive datasets are worth pursuing.

Acknowledgments

This work was supported by the National Institutes of Health (NIH) grant R01-GM130622.

A Proofs

Theorem 1

Proof Without loss of generality assume $y_{1}=-1$ and $y_{2}=1$ . Define

h({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda)=\left[\prod_{i=1}^{n}e^{-V\left(y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)}\right]e^{-(\lambda n/2)||{\bm{\beta}}||^{2}_{2}}.

By (4), $p({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda)\propto h({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda)$ . It suffices to show that (i) $h({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda)$ is non-negative and (ii) $h({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda)$ has finite integral over ${\bm{\beta}}$ and $\beta_{0}$ . Condition (i) is trivially satisfied. For condition (ii), note that $V(u)>0\;\forall u\in\mathbb{R}$ , where $V(u)$ is defined in (3). Thus,

e^{-V\left(y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)}<1\;\;\text{ for }i=1,\ldots,n.

Let $||\cdot||_{1}$ define the $l_{1}$ -norm, and note that $||\mathbf{X}||_{1}||{\bm{\beta}}||_{1}>\mbox{max}\left(\{|{\bf x}_{i}^{T}{\bm{\beta}}|:i=1,\ldots,n\}\right)$ . For fixed ${\bm{\beta}}\in\mathbb{R}^{p}$ ,

	$\displaystyle\int\prod_{i=1}^{n}e^{-V\left(y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)}\,d\beta_{0}$	$\displaystyle=\int\displaylimits_{-\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}}^{\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}}\prod_{i=1}^{n}e^{-V\left(y_{1}(\beta_{0}+{\bf x}_{1}^{T}{\bm{\beta}})\right)}\,d\beta_{0}$
		$\displaystyle\;\;\;\;\;\;\;\;\;\;\;+\int\displaylimits_{-\infty}^{-\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}}\prod_{i=1}^{n}e^{-V\left(y_{1}(\beta_{0}+{\bf x}_{1}^{T}{\bm{\beta}})\right)}\,d\beta_{0}$
		$\displaystyle\;\;\;\;\;\;\;\;\;\;\;+\int\displaylimits_{\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}}^{\infty}\prod_{i=1}^{n}e^{-V\left(y_{1}(\beta_{0}+{\bf x}_{1}^{T}{\bm{\beta}})\right)}\,d\beta_{0}$
		$\displaystyle<2\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}+\int\displaylimits_{-\infty}^{-\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}}e^{-V\left(y_{2}(\beta_{0}+{\bf x}_{1}^{T}{\bm{\beta}})\right)}d\beta_{0}$
		$\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\int\displaylimits_{\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}}^{\infty}e^{-V\left(y_{1}(\beta_{0}+{\bf x}_{1}^{T}{\bm{\beta}})\right)}d\beta_{0}$
		$\displaystyle<2\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}+2\int\displaylimits_{0}^{\infty}e^{-V(-u)}du$
		$\displaystyle=2\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}+2\int\displaylimits_{0}^{\infty}e^{1-u}du$
		$\displaystyle=2\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}+2e.$

It follows that

	$\displaystyle\int\int h({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda)d\beta_{0}d{\bm{\beta}}$	$\displaystyle=\int\left[\int\prod_{i=1}^{n}e^{-V\left(y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)}\,d\beta_{0}\right]e^{-(\lambda n/2)\|\|{\bm{\beta}}\|\|^{2}_{2}}d{\bm{\beta}}$
		$\displaystyle<\int(2\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}+2e)e^{-(\lambda n/2)\|\|{\bm{\beta}}\|\|^{2}_{2}}d{\bm{\beta}}$
		$\displaystyle=2\|\|\mathbf{X}\|\|_{1}\int\|\|{\bm{\beta}}\|\|_{1}e^{-(\lambda n/2)\|\|{\bm{\beta}}\|\|^{2}_{2}}d{\bm{\beta}}+2e\int e^{-(\lambda n/2)\|\|{\bm{\beta}}\|\|^{2}_{2}}d{\bm{\beta}}$
		$\displaystyle=2\|\|\mathbf{X}\|\|_{1}\cdot p\left(\frac{\lambda n}{2}\right)\left(\frac{2\pi}{\lambda n}\right)^{\frac{p-1}{2}}+2e\left(\frac{2\pi}{\lambda n}\right)^{\frac{p}{2}}.$

Theorem 2

If $\mathbf{X}\in\mathbb{R}^{d\times n}$ , $\lambda>0$ , and $\beta_{0}\in\mathbb{R}$ , then

p({\bm{\beta}}\mid\mathbf{X},\beta_{0},\lambda)\propto A\cdot B

gives a proper probability density function, where $A$ and $B$ are defined in (7) and (8).

Proof Define $h({\bm{\beta}},\beta_{0},\mathbf{X},\beta_{0},\lambda)=A\cdot B$ , where $A$ and $B$ are defined in (7) and (8). Note that $A$ and $B$ are always non-negative, and term $A$ is alway less than $1$ . Thus,

	$\displaystyle\int h({\bm{\beta}},\beta_{0},\mathbf{X},\beta_{0},\lambda)d{\bm{\beta}}$	$\displaystyle<\int e^{-(\lambda n/2)\|\|{\bm{\beta}}\|\|^{2}_{2}}d{\bm{\beta}}$
		$\displaystyle=\left(\frac{2\pi}{\lambda n}\right)^{\frac{p}{2}}.$

B Algorithm

Here we describe in detail the algorithm to simulate draws $\{{\bm{\beta}}^{(t)},\beta_{0}^{(t)},\lambda^{(t)}\}_{t=1}^{T}$ from their joint posterior distribution $p({\bm{\beta}}^{(t)},\beta_{0}^{(t)},\lambda^{(t)}\mid\mathbf{X},\mathbf{y})$ . First, estimate the normalizing function $\hat{\phi}(\mathbf{X},\lambda,\beta_{0})$ as described in Section 7.2. Then, initialize $\lambda^{(0)}$ to a positive value (e.g., $\lambda^{(0)}=1$ ) and initialize $\{{\bm{\beta}}^{(0)},\beta_{0}^{(0)}\}$ to the conditional posterior mode using the sdwd package (Wang and Zou, 2016) or other software. The algorithm is given below for iterations $t=1,\ldots,T$ .

1.
For $i=1,\ldots,d$ , update $\beta_{i}$ as follows:
1. (a)
  
  Generate proposal $\beta_{i}^{*}\sim N(0,\sigma_{*}^{2})$
2. (b)
  
  Set $\tilde{{\bm{\beta}}}=[\beta_{1}^{(t)}\dots\beta_{i-1}^{(t)}\;\beta_{i}^{(t-1)}\dots\beta_{d}^{(t-1)}]$ and ${\bm{\beta}}^{*}=[\beta_{1}^{(t)}\dots\beta_{i-1}^{(t)}\;\beta_{i}^{*}\dots\beta_{d}^{(t-1)}]$
3. (c)
  
  Compute $r=\frac{p({\bm{\beta}}^{*},\beta_{0}^{(t-1)}\mid\mathbf{X},\mathbf{y},\lambda^{(t-1)})}{p(\tilde{{\bm{\beta}}},\beta_{0}^{(t-1)}\mid\mathbf{X},\mathbf{y},\lambda^{(t-1)})}$ , with the numerator and denominator given by (4)
4. (d)
  
  Generate $u\sim\mbox{Uniform}(0,1)$
5. (e)
  
  If $u<r$ set $\beta_{i}^{(t)}=\beta_{i}^{*}$ , otherwise set $\beta_{i}^{(t)}=\beta_{i}^{(t-1)}$ .
2.
Update $\beta_{0}$ as follows:
1. (a)
  
  Generate proposal $\beta_{0}^{*}\sim N(0,0.25)$
2. (b)
  
  Compute $r=\frac{p({\bm{\beta}}^{(t)},\beta_{0}^{*}\mid\mathbf{X},\mathbf{y},\lambda^{(t-1))}}{p({\bm{\beta}}^{(t)},\beta_{0}^{(t-1)}\mid\mathbf{X},\mathbf{y},\lambda^{(t-1)}}$ , with the numerator and denominator given by (4)
3. (c)
  
  Generate $u\sim\mbox{Uniform}(0,1)$
4. (d)
  
  If $u<r$ set $\beta_{0}^{(t)}=\beta_{0}^{*}$ , otherwise set $\beta_{i}^{(t)}=\beta_{0}^{(t-1)}$ .
3.
Update $\lambda$ as follows:
1. (a)
  
  Generate proposal $\lambda^{*}$ by $\lambda^{*}=e^{\mbox{log}\lambda^{(t-1)}+Z/4}$ where $Z\sim N(0,1)$
2. (b)
  
  Compute $r=\frac{\lambda^{*}\hat{\phi}(\mathbf{X},\lambda^{*})e^{-0.5n\lambda^{*}||{\bm{\beta}}^{(t)}||^{2}}}{\lambda^{(t-1)}\hat{\phi}(\mathbf{X},\lambda^{(t-1)})e^{-0.5n\lambda^{(t-1)}||{\bm{\beta}}^{(t)}||^{2}}}$
3. (c)
  
  Generate $u\sim\mbox{Uniform}(0,1)$
4. (d)
  
  If $u<r$ set $\lambda^{(t)}=\lambda^{*}$ , otherwise set $\lambda^{(t)}=\lambda^{(t-1)}$ .
4.

Set $\sigma^{2}_{*}$ to half of the sample variance of $\{\beta_{i}\}_{i=1}^{d}$ .

Note that if $\lambda$ is fixed, step (c) above can be skipped. The adaptive updating of the proposal variance in step (d) is not necessary for the validity of the algorithm, but can improve mixing considerably when the approximate scale of the coefficients ${\bm{\beta}}$ is uncertain.

References

Breiman (2001) L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
Carlin and Louis (2008) B. P. Carlin and T. A. Louis. Bayesian methods for data analysis. CRC Press, 2008.
Cortes and Vapnik (1995) C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
Ghosal (1997) S. Ghosal. A review of consistency and convergence of posterior distribution. In Varanashi Symposium in Bayesian Inference, Banaras Hindu University, 1997.
Hall et al. (2005) P. Hall, J. S. Marron, and A. Neeman. Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(3):427–444, 2005.
Henao et al. (2014) R. Henao, X. Yuan, and L. Carin. Bayesian nonlinear support vector machines and discriminative factor modeling. In Advances in neural information processing systems, pages 1754–1762, 2014.
Hoadley et al. (2018) K. A. Hoadley, C. Yau, T. Hinoue, D. M. Wolf, A. J. Lazar, E. Drill, R. Shen, A. M. Taylor, A. D. Cherniack, V. Thorsson, et al. Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell, 173(2):291–304, 2018.
Hsiang (1975) T. Hsiang. A Bayesian view on ridge regression. Journal of the Royal Statistical Society: Series D (The Statistician), 24(4):267–268, 1975.
Huang et al. (2012) H. Huang, X. Lu, Y. Liu, P. Haaland, and J. Marron. R/DWD: distance-weighted discrimination for classification, visualization and batch adjustment. Bioinformatics, 28(8):1182–1183, 2012.
Huang et al. (2013) H. Huang, Y. Liu, Y. Du, C. M. Perou, D. N. Hayes, M. J. Todd, and J. S. Marron. Multiclass distance-weighted discrimination. Journal of Computational and Graphical Statistics, 22(4):953–969, 2013.
Liu et al. (2011) Y. Liu, H. H. Zhang, and Y. Wu. Hard or soft classification? large-margin unified machines. Journal of the American Statistical Association, 106(493):166–177, 2011.
Lock and Dunson (2013) E. F. Lock and D. B. Dunson. Bayesian consensus clustering. Bioinformatics, 29(20):2610–2616, 2013.
Lyu et al. (2017) T. Lyu, E. F. Lock, and L. E. Eberly. Discriminating sample groups with multi-way data. Biostatistics, 18(3):434–450, 2017.
Marron et al. (2007) J. Marron, M. J. Todd, and J. Ahn. Distance-weighted discrimination. Journal of the American Statistical Association, 102(480):1267–1271, 2007.
Park and Lock (2020) J. Y. Park and E. F. Lock. Integrative factorization of bidimensionally linked matrices. Biometrics, 76(1):61–74, 2020.
Park and Casella (2008) T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical Association, 103(482):681–686, 2008.
Parker et al. (2009) J. S. Parker, M. Mullins, M. C. Cheang, S. Leung, D. Voduc, T. Vickery, S. Davies, C. Fauron, X. He, Z. Hu, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of clinical oncology, 27(8):1160, 2009.
Qiao and Zhang (2015) X. Qiao and L. Zhang. Flexible high-dimensional classification machines and their asymptotic properties. The Journal of Machine Learning Research, 16(1):1547–1572, 2015.
Qiao et al. (2010) X. Qiao, H. H. Zhang, Y. Liu, M. J. Todd, and J. S. Marron. Weighted distance weighted discrimination and its asymptotic properties. Journal of the American Statistical Association, 105(489):401–414, 2010.
Sollich (2002) P. Sollich. Bayesian methods for support vector machines: Evidence and predictive class probabilities. Machine learning, 46(1-3):21–52, 2002.
Sun et al. (2001) D. Sun, R. K. Tsutakawa, and Z. He. Propriety of posteriors with improper priors in hierarchical linear mixed models. Statistica Sinica, pages 77–95, 2001.
Taraldsen and Lindqvist (2010) G. Taraldsen and B. H. Lindqvist. Improper priors are not improper. The American Statistician, 64(2):154–158, 2010.
Wang and Zou (2016) B. Wang and H. Zou. Sparse distance weighted discrimination. Journal of Computational and Graphical Statistics, 25(3):826–838, 2016.
Wang and Zou (2018) B. Wang and H. Zou. Another look at distance-weighted discrimination. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):177–198, 2018.
Wang and Zou (2019) B. Wang and H. Zou. A multicategory kernel distance weighted discrimination method for multiclass classification. Technometrics, 61(3):396–408, 2019.
Wei et al. (2016) S. Wei, C. Lee, L. Wichers, and J. Marron. Direction-projection-permutation for high-dimensional hypothesis tests. Journal of Computational and Graphical Statistics, 25(2):549–569, 2016.
Witten and Tibshirani (2011) D. M. Witten and R. Tibshirani. Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5):753–772, 2011.

	$\displaystyle\int\prod_{i=1}^{n}e^{-V\left(y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)}\,d\beta_{0}$	$\displaystyle=\int\displaylimits_{-\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}}^{\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}}\prod_{i=1}^{n}e^{-V\left(y_{1}(\beta_{0}+{\bf x}_{1}^{T}{\bm{\beta}})\right)}\,d\beta_{0}$
		$\displaystyle\;\;\;\;\;\;\;\;\;\;\;+\int\displaylimits_{-\infty}^{-\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}}\prod_{i=1}^{n}e^{-V\left(y_{1}(\beta_{0}+{\bf x}_{1}^{T}{\bm{\beta}})\right)}\,d\beta_{0}$
		$\displaystyle\;\;\;\;\;\;\;\;\;\;\;+\int\displaylimits_{\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}}^{\infty}\prod_{i=1}^{n}e^{-V\left(y_{1}(\beta_{0}+{\bf x}_{1}^{T}{\bm{\beta}})\right)}\,d\beta_{0}$
		$\displaystyle<2\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}+\int\displaylimits_{-\infty}^{-\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}}e^{-V\left(y_{2}(\beta_{0}+{\bf x}_{1}^{T}{\bm{\beta}})\right)}d\beta_{0}$
		$\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\int\displaylimits_{\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}}^{\infty}e^{-V\left(y_{1}(\beta_{0}+{\bf x}_{1}^{T}{\bm{\beta}})\right)}d\beta_{0}$
		$\displaystyle<2\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}+2\int\displaylimits_{0}^{\infty}e^{-V(-u)}du$
		$\displaystyle=2\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}+2\int\displaylimits_{0}^{\infty}e^{1-u}du$
		$\displaystyle=2\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}+2e.$

	$\displaystyle\int\int h({\bm{\beta}},\beta_{0}\mid\mathbf{X},\mathbf{y},\lambda)d\beta_{0}d{\bm{\beta}}$	$\displaystyle=\int\left[\int\prod_{i=1}^{n}e^{-V\left(y_{i}(\beta_{0}+{\bf x}_{i}^{T}{\bm{\beta}})\right)}\,d\beta_{0}\right]e^{-(\lambda n/2)\|\|{\bm{\beta}}\|\|^{2}_{2}}d{\bm{\beta}}$
		$\displaystyle<\int(2\|\|\mathbf{X}\|\|_{1}\|\|{\bm{\beta}}\|\|_{1}+2e)e^{-(\lambda n/2)\|\|{\bm{\beta}}\|\|^{2}_{2}}d{\bm{\beta}}$
		$\displaystyle=2\|\|\mathbf{X}\|\|_{1}\int\|\|{\bm{\beta}}\|\|_{1}e^{-(\lambda n/2)\|\|{\bm{\beta}}\|\|^{2}_{2}}d{\bm{\beta}}+2e\int e^{-(\lambda n/2)\|\|{\bm{\beta}}\|\|^{2}_{2}}d{\bm{\beta}}$
		$\displaystyle=2\|\|\mathbf{X}\|\|_{1}\cdot p\left(\frac{\lambda n}{2}\right)\left(\frac{2\pi}{\lambda n}\right)^{\frac{p-1}{2}}+2e\left(\frac{2\pi}{\lambda n}\right)^{\frac{p}{2}}.$

Bayesian Distance Weighted Discrimination

Abstract

1 Introduction

2 Framework and notation

3 Distance weighted discrimination

4 Bayesian model

Theorem 1

4.1 Conditional distribution of 𝐲\mathbf{y}

4.2 Prior distribution of β\beta

Theorem 2

4.3 Distribution of XX

5 Asymptotic normality of β\beta

Theorem 3

6 Extensions

6.1 Marginal class probabilities

6.2 Inference for λ\lambda

6.3 Semi-supervised learning

7 Posterior approximation

7.1 MCMC algorithm, fixed λ\lambda

7.2 Inference for λ\lambda

7.3 Computational considerations

8 Simulations

8.1 Assumed model simulation

8.2 Two-class multivariate normal simulation

8.3 Semi-supervised simulation

9 Application to breast cancer genomics

10 Discussion

A Proofs

Theorem 1

Theorem 2

B Algorithm

References

4.1 Conditional distribution of $\mathbf{y}$

4.2 Prior distribution of $\beta$

4.3 Distribution of $X$

5 Asymptotic normality of $\beta$

6.2 Inference for $\lambda$

7.1 MCMC algorithm, fixed $\lambda$

7.2 Inference for $\lambda$