This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Combining Biomarkers by Maximizing the True Positive Rate for a Fixed False Positive Rate

Abstract

Biomarkers abound in many areas of clinical research, and often investigators are interested in combining them for diagnosis, prognosis, or screening. In many applications, the true positive rate for a biomarker combination at a prespecified, clinically acceptable false positive rate is the most relevant measure of predictive capacity. We propose a distribution-free method for constructing biomarker combinations by maximizing the true positive rate while constraining the false positive rate. Theoretical results demonstrate desirable properties of biomarker combinations produced by the new method. In simulations, the biomarker combination provided by our method demonstrated improved operating characteristics in a variety of scenarios when compared with alternative methods for constructing biomarker combinations. Biomarker combinations; False positive rate; Sensitivity; Specificity; True positive rate.

ALLISON MEISNER

Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA
ameisne1@jhu.edu

MARCO CARONE

Department of Biostatistics, University of Washington, Seattle, WA, USA

MARGARET S. PEPE

Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA

KATHLEEN F. KERR

Department of Biostatistics, University of Washington, Seattle, WA, USA


1 Introduction

As the number of available biomarkers has grown, so has the interest in combining them for the purposes of diagnosis, prognosis, or screening. Over the past decade, much work has been done to develop methods for constructing biomarker combinations by targeting measures of performance, including those related to the receiver operating characteristic, or ROC, curve. This is in contrast to more traditional methods that construct biomarker combinations by optimizing general global fit criteria, such as the maximum likelihood approach. While methods to construct both linear and nonlinear combinations have been proposed, linear biomarker combinations are more common than nonlinear combinations, due to their greater interpretability and ease of construction (Wang and Chang, 2011; Hsu and Hsueh, 2013).

Although the area under the ROC curve, the AUC, is arguably the most popular way to summarize the ROC curve, there is often interest in identifying a biomarker combination with a high true positive rate (TPR), the proportion of correctly classified diseased individuals, while setting the false positive rate (FPR), the proportion of incorrectly classified nondiseased individuals, at some clinically acceptable level. A common practice among applied researchers is to construct linear biomarker combinations using logistic regression, and then calculate the TPR for the prespecified FPR, e.g., Moore et al. (2008). While methods for constructing biomarker combinations by maximizing the AUC or the partial AUC have been developed, these methods do not directly target the TPR for a specified FPR.

We propose a distribution-free method for constructing linear biomarker combinations by maximizing the TPR while constraining the FPR. We demonstrate desirable theoretical properties of the resulting combination, and provide empirical evidence of good small-sample performance through simulations. To illustrate our method, we consider data from a prospective study of diabetes mellitus in 532 adult women with Pima Indian heritage (Smith et al., 1988). Several variables were measured for each participant, and criteria from the World Health Organization were used to identify women who developed diabetes. A primary goal of the study was to predict the onset of diabetes within five years.

2 Background

2.1 ROC Curve and Related Measures

The ROC curve provides a means to evaluate the ability of a biomarker or, equivalently, biomarker combination ZZ to identify individuals who have or will experience a binary outcome DD. For example, in a diagnostic setting, DD denotes the presence or absence of disease and ZZ may be used to identify individuals with the disease. The ROC curve provides information about how well the biomarker discriminates between individuals who have or will experience the outcome, that is, the cases, and individuals who do not have or will not experience the outcome, that is, the controls (Pepe, 2003). Mathematically, if larger values of ZZ are more indicative of having or experiencing the outcome, for each threshold δ\delta we can define the TPR as P(Z>δD=1)P(Z>\delta\mid D=1) and the FPR as P(Z>δD=0)P(Z>\delta\mid D=0) (Pepe, 2003). For a given δ\delta, the TPR is also referred to as the sensitivity, and one minus the specificity equals the FPR (Pepe, 2003). The ROC curve is a plot of the TPR versus the FPR as δ\delta ranges over all possible values; as such, it is non-decreasing and takes values in the unit square (Pepe, 2003). A perfect biomarker has an ROC curve that reaches the upper left corner of the unit square, and a useless biomarker has an ROC curve on the 45-degree line (Pepe, 2003).

The most common summary of the ROC curve is the AUC, the area under the ROC curve. The AUC ranges between 0.5 for a useless biomarker and 1 for a perfect biomarker (Pepe, 2003). The AUC has a probabilistic interpretation: it is the probability that the biomarker value for a randomly chosen case is larger than that for a randomly chosen control, assuming that higher biomarker values are more indicative of having or experiencing the outcome (Pepe, 2003). Both the ROC curve and the AUC are invariant to monotone increasing transformations of the biomarker ZZ (Pepe, 2003).

The AUC summarizes the entire ROC curve, but in many situations it is more appropriate to only consider certain FPR values. For example, screening tests require a very low FPR, while diagnostic tests for fatal diseases may allow for a slightly higher FPR if the corresponding TPR is very high (Hsu and Hsueh, 2013). Such considerations led to the development of the partial AUC, the area under the ROC curve over some range (t0,t1)(t_{0},t_{1}) of FPR values (Pepe, 2003). Rather than considering a range of FPR values, there may be interest in fixing the FPR at a single value, determining the corresponding threshold δ\delta, and evaluating the TPR for that threshold. As opposed to the AUC and the partial AUC, this method returns a single classifier, or decision rule, which may appeal to researchers seeking a tool for clinical decision-making.

2.2 Biomarker Combinations

Many methods to combine biomarkers have been proposed, and they can be divided into two categories. The first includes indirect methods that seek to optimize a measure other than the performance measure of interest, while the second category includes direct methods that optimize the target performance measure. We focus on the latter.

Targeting the entire ROC curve (that is, constructing a combination that produces an ROC curve that dominates the ROC curve for all other linear combinations at all points) is very challenging and is only possible under special circumstances. Su and Liu (1993) demonstrated that, when the vector X of biomarkers has a multivariate normal distribution conditional on DD with proportional covariance matrices, it is possible to identify the linear combination that maximizes the TPR uniformly over the entire range of FPRs; this linear combination is Fisher’s linear discriminant function.

McIntosh and Pepe (2002) used the Neyman-Pearson lemma to demonstrate optimality (in terms of the ROC curve) of the likelihood ratio function and, consequently, of the risk score P(D=1|X=x)P(D=1|\textbf{X}=\textbf{x}) and monotone transformations of P(D=1|X=x)P(D=1|\textbf{X}=\textbf{x}). Thus, if the biomarkers are conditionally multivariate normal and the DD-specific covariance matrices are equal, the optimal linear combination dominates not just every other linear combination, but also every nonlinear combination. This results from the fact that in this case, the linear logistic model logit{P(D=1|X=x)}=𝜽x\mbox{logit}\{P(D=1|\textbf{X}=\textbf{x})\}={\boldsymbol{\theta}}^{\top}\textbf{x} holds for some pp-dimensional 𝜽{\boldsymbol{\theta}}, where pp is the dimension of X. If the covariance matrices are proportional but not equal, the likelihood ratio is a nonlinear function of the biomarkers, as shown in the Appendix A for p=2p=2, and the optimal biomarker combination with respect to the ROC curve is nonlinear.

In general, there is no linear combination that dominates all others in terms of the TPR over the entire range of FPR values (Su and Liu, 1993; Anderson and Bahadur, 1962). Thus, methods to optimize the AUC have been proposed. When the biomarkers are conditionally multivariate normal with nonproportional covariance matrices, Su and Liu (1993) gave an explicit form for the best linear combination with respect to the AUC. Others have targeted the AUC without any assumption on the distribution of the biomarkers; many of these methods rely on smooth approximations to the empirical AUC, which involves indicator functions (Ma and Huang, 2007; Fong, Yin, and Huang, 2016; Lin et al., 2011).

Acknowledging that often only a range of FPR values is of interest clinically, methods have been proposed to target the partial AUC for some FPR range (t0,t1)(t_{0},t_{1}). Some methods make parametric assumptions about the joint distribution of the biomarkers (Yu and Park, 2015; Hsu and Hsueh, 2013; Yan et al., 2018) while others do not (Wang and Chang, 2011; Komori and Eguchi, 2010; Yan et al., 2018). The latter group of methods generally uses a smooth approximation to the partial AUC, similar to some of the methods that aim to maximize the AUC (Wang and Chang, 2011; Komori and Eguchi, 2010; Yan et al., 2018). One challenge faced in partial AUC maximization is that for narrow intervals, that is, when t0t_{0} is close to t1t_{1}, the partial AUC is often very close to 0, which can make optimization difficult (Hsu and Hsueh, 2013).

Some work in constructing biomarker combinations by maximizing the TPR has been done for conditionally multivariate normal biomarkers. In this setting, procedures for constructing a linear combination that maximizes the TPR for a fixed FPR (Anderson and Bahadur, 1962; Gao et al., 2008) as well as methods for constructing a linear combination by maximizing the TPR for a range of FPR values (Liu, Schisterman, and Zhu, 2005) have been proposed. Importantly, in the method proposed by Liu, Schisterman, and Zhu (2005), the range of FPR values over which the fitted combination is optimal may depend on the combination itself; that is, the range of FPR values may be determined by the combination and so may not be fixed in advance. Thus, this method does not optimize the TPR for a prespecified FPR. Baker (2000) proposed a flexible nonparametric method for combining biomarkers by optimizing the ROC curve over a narrow target region of FPR values. However, this method is not well-suited to situations in which more than a few biomarkers are to be combined.

An important benefit of constructing linear biomarker combinations by targeting the performance measure of interest is that the performance of the combination will be at least as good as the performance of the individual biomarkers (Pepe, Cai, and Longton, 2006). Indeed, several authors have recommended matching the objective function to the performance measure, i.e., constructing biomarker combinations by optimizing the relevant measure of performance (Hwang et al., 2013; Liu, Schisterman, and Zhu, 2005; Wang and Chang, 2011; Ricamato and Tortorella, 2011). To that end, we propose a distribution-free method to construct biomarker combinations by maximizing the TPR for a given FPR.

Figure 1 illustrates the importance of targeting the measure of interest in constructing biomarker combinations. In this example, combinations of three biomarkers are constructed by (i) maximizing the logistic likelihood, (ii) maximizing the AUC via the optAUC package in R (i.e., the method of Huang, Qin, and Fang (2011)), and (iii) maximizing the TPR for an FPR of 20% using the proposed method. The ROC curves for the three combinations differ markedly near the prespecified FPR of 20%. In particular, the TPRs at an FPR of 20% for the three combinations are 18.0%, 24.0%, and 34.0% for maximum likelihood, AUC optimization, and maximization of the TPR for a given FPR, respectively. This example highlights the utility of methods that target the TPR for a specific FPR as opposed to methods that target other measures.

Refer to caption
Figure 1: Biomarker combinations obtained by targeting different measures. In this illustrative example, there is interest in the TPR for an FPR of 20%. Combinations of three biomarkers were constructed in training data (400 cases and 400 controls) and evaluated in a large test dataset (10,000 cases and 10,000 controls). Two of the biomarkers, X1X_{1} and X2X_{2}, were distributed as conditional bivariate lognormal random variables such that, among controls, E(log(X1))=1.1,E(log(X2))=1.1,Var(log(X1))=0.04,Var(log(X2))=0.5,Cov(log(X1),log(X2))=0.09E(\mbox{log}(X_{1}))=1.1,E(\mbox{log}(X_{2}))=1.1,Var(\mbox{log}(X_{1}))=0.04,Var(\mbox{log}(X_{2}))=0.5,Cov(\mbox{log}(X_{1}),\mbox{log}(X_{2}))=0.09 and, among cases, E(log(X1))=1,E(log(X2))=1,Var(log(X1))=0.05,Var(log(X2))=0.05,Cov(log(X1),log(X2))=0.015E(\mbox{log}(X_{1}))=1,E(\mbox{log}(X_{2}))=1,Var(\mbox{log}(X_{1}))=0.05,Var(\mbox{log}(X_{2}))=0.05,Cov(\mbox{log}(X_{1}),\mbox{log}(X_{2}))=0.015. The third biomarker, X3X_{3}, was distributed as an independent lognormal random variable such that, among both cases and controls, E(log(X3))=1.65,Var(log(X3))=4.66E(\mbox{log}(X_{3}))=1.65,Var(\mbox{log}(X_{3}))=4.66. The combinations were constructed by maximizing the TPR for an FPR of 20% (“Optimal TPR”; green), by maximizing the logistic likelihood (“Logistic Regression”; blue), and by maximizing the AUC (“Optimal AUC”; red). The gray line indicates the ROC curve for a useless marker (FPR and TPR are equal), the black dashed line denotes an FPR of 20%, and the dot-dashed lines indicate the TPRs for an FPR of 20%. This figure appears in color in the electronic version of this article.

3 Methodology

3.1 Description

Cases are denoted by the subscript 11 and controls are denoted by the subscript 0. Let X1i\textbf{X}_{1i} denote the vector of biomarkers for the ithi^{th} case, and let X0j\textbf{X}_{0j} denote the vector of biomarkers for the jthj^{th} control.

We propose constructing a linear biomarker combination of the form 𝜽X{\boldsymbol{\theta}}^{\top}\textbf{X} for a pp-dimensional X by maximizing the TPR when the FPR is below some prespecified, clinically acceptable value tt. We define the true and false positive rates for a given X as a function of 𝜽{\boldsymbol{\theta}} and δ\delta:

TPR(𝜽,δ)=P(𝜽X>δ|D=1),FPR(𝜽,δ)=P(𝜽X>δ|D=0).\mbox{TPR}({\boldsymbol{\theta}},\delta)=P({\boldsymbol{\theta}}^{\top}\textbf{X}>\delta|D=1),\;\;\;\;\mbox{FPR}({\boldsymbol{\theta}},\delta)=P({\boldsymbol{\theta}}^{\top}\textbf{X}>\delta|D=0).

Since the true and false positive rates for a given combination 𝜽{\boldsymbol{\theta}} and threshold δ\delta are invariant to scaling of the parameters (𝜽,δ)({\boldsymbol{\theta}},\delta), we must restrict (𝜽,δ)({\boldsymbol{\theta}},\delta) to ensure identifiability. Specifically, we constrain 𝜽=1||\boldsymbol{\theta}||=1 as in Fong, Yin, and Huang (2016). For any fixed t(0,1)t\in(0,1), we can consider

(𝜽t,δt)argmax(𝜽,δ)ΩtTPR(𝜽,δ),(\boldsymbol{\theta}_{t},\delta_{t})\in\operatorname*{arg\,max}_{({\boldsymbol{\theta}},\delta)\in\Omega_{t}}\mbox{TPR}({\boldsymbol{\theta}},\delta),

where Ωt={𝜽p,δ:𝜽=1,FPR(𝜽,δ)t}.\Omega_{t}=\{{\boldsymbol{\theta}}\in\mathbb{R}^{p},\delta\in\mathbb{R}:||{\boldsymbol{\theta}}||=1,\mbox{FPR}({\boldsymbol{\theta}},\delta)\leq t\}. This provides the optimal combination 𝜽t\boldsymbol{\theta}_{t} and threshold δt\delta_{t}. We define (𝜽t,δt)(\boldsymbol{\theta}_{t},\delta_{t}) to be an element of argmax(𝜽,δ)ΩtTPR(𝜽,δ)\operatorname*{arg\,max}_{({\boldsymbol{\theta}},\delta)\in\Omega_{t}}\mbox{TPR}({\boldsymbol{\theta}},\delta), where argmax(𝜽,δ)ΩtTPR(𝜽,δ)\operatorname*{arg\,max}_{({\boldsymbol{\theta}},\delta)\in\Omega_{t}}\mbox{TPR}({\boldsymbol{\theta}},\delta) may be a set.

Of course, in practice, the true and false positive rates are unknown, so 𝜽t\boldsymbol{\theta}_{t} and δt\delta_{t} cannot be computed. We can replace these unknowns by their empirical estimates,

TPR^n1(𝜽,δ)=1n1i=1n11(𝜽X1i>δ),FPR^n0(𝜽,δ)=1n0j=1n01(𝜽X0j>δ),\hat{\mbox{TPR}}_{n_{1}}({\boldsymbol{\theta}},\delta)=\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}1({\boldsymbol{\theta}}^{\top}\textbf{X}_{1i}>\delta),\;\;\hat{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)=\frac{1}{n_{0}}\sum_{j=1}^{n_{0}}1({\boldsymbol{\theta}}^{\top}\textbf{X}_{0j}>\delta),

where n1n_{1} is the number of cases and n0n_{0} is the number of controls, giving the total sample size n=n1+n0n=n_{1}+n_{0}. We can then define

(𝜽^t,δ^t)argmax(𝜽,δ)Ω^t,n0TPR^n1(𝜽,δ)(\hat{{\boldsymbol{\theta}}}_{t},\hat{\delta}_{t})\in\operatorname*{arg\,max}_{(\boldsymbol{\theta},\delta)\in\hat{\Omega}_{t,n_{0}}}\hat{\mbox{TPR}}_{n_{1}}({\boldsymbol{\theta}},\delta)

where Ω^t,n0={𝜽p,δ:𝜽=1,FPR^n0(𝜽,δ)t}.\hat{\Omega}_{t,n_{0}}=\{{\boldsymbol{\theta}}\in\mathbb{R}^{p},\delta\in\mathbb{R}:||{\boldsymbol{\theta}}||=1,\hat{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)\leq t\}. It is possible to conduct a grid search over (𝜽,δ)({\boldsymbol{\theta}},\delta) to perform this constrained optimization, though this becomes computationally demanding when combining more than two or three biomarkers.

Since the objective function involves indicator functions, it is not a smooth function of the parameters (𝜽,δ)({\boldsymbol{\theta}},\delta) and thus not amenable to derivative-based methods. However, smooth approximations to indicator functions have been used for AUC maximization (Ma and Huang, 2007; Fong, Yin, and Huang, 2016; Lin et al., 2011). One such smooth approximation is 1(w>0)Φ(w/h)1(w>0)\approx\Phi(w/h), where Φ\Phi is the standard normal distribution function, and hh is a tuning parameter representing the trade-off between approximation accuracy and estimation feasibility such that hh tends to zero as the sample size grows (Lin et al., 2011). We can use this smooth approximation to implement the method described above, writing the smooth approximations to the empirical true and false positive rates as

TPR~n1(𝜽,δ)=1n1i=1n1Φ(𝜽X1iδh),FPR~n0(𝜽,δ)=1n0j=1n0Φ(𝜽X0jδh).\tilde{\mbox{TPR}}_{n_{1}}({\boldsymbol{\theta}},\delta)=\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\Phi\left(\frac{{\boldsymbol{\theta}}^{\top}\textbf{X}_{1i}-\delta}{h}\right),\;\;\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)=\frac{1}{n_{0}}\sum_{j=1}^{n_{0}}\Phi\left(\frac{{\boldsymbol{\theta}}^{\top}\textbf{X}_{0j}-\delta}{h}\right).

Thus, we propose to compute

(𝜽~t,δ~t)argmax(𝜽,δ)Ω~t,n0TPR~n1(𝜽,δ),(\tilde{\boldsymbol{\theta}}_{t},\tilde{\delta}_{t})\in\operatorname*{arg\,max}_{({\boldsymbol{\theta}},\delta)\in\tilde{\Omega}_{t,n_{0}}}\tilde{\mbox{TPR}}_{n_{1}}({\boldsymbol{\theta}},\delta), (1)

where Ω~t,n0={𝜽p,δ:𝜽=1,FPR~n0(𝜽,δ)t}.\tilde{\Omega}_{t,n_{0}}=\{{\boldsymbol{\theta}}\in\mathbb{R}^{p},\delta\in\mathbb{R}:||{\boldsymbol{\theta}}||=1,\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)\leq t\}. Since both TPR~n1\tilde{\mbox{TPR}}_{n_{1}} and FPR~n0\tilde{\mbox{FPR}}_{n_{0}} are smooth functions, we can use gradient-based methods that incorporate the necessary constraints, e.g., Lagrange multipliers. In particular, (𝜽~t,δ~t)(\tilde{\boldsymbol{\theta}}_{t},\tilde{\delta}_{t}) can be obtained using existing software for constrained optimization of smooth functions, such as the Rsolnp package in R. An R package including code for our method based on Rsolnp, maxTPR, is available on CRAN. Other details related to implementation, including the choice of tuning parameter hh, are discussed below.

3.2 Asymptotic Properties

We present a theorem establishing that, under certain conditions, the combination obtained by optimizing the smooth approximation to the empirical TPR while constraining the smooth approximation to the empirical FPR has desirable operating characteristics. In particular, its FPR is bounded almost surely by the acceptable level tt in large samples. In addition, its TPR converges almost surely to the supremum of the TPR over the set where the FPR is constrained. We focus on the operating characteristics of (𝜽~t,δ~t)(\tilde{\boldsymbol{\theta}}_{t},\tilde{\delta}_{t}) since these are of primary interest to clinicians.

Rather than enforcing (𝜽~t,δ~t)(\tilde{\boldsymbol{\theta}}_{t},\tilde{\delta}_{t}) to be a strict maximizer, in the theoretical study below we allow it to be a near-maximizer of TPR~n1(𝜽,δ)\tilde{\mbox{TPR}}_{n_{1}}(\boldsymbol{\theta},\delta) within Ω~t,n0\tilde{\Omega}_{t,n_{0}} in the sense that

TPR~n1(𝜽~t,δ~t)sup(𝜽,δ)Ω~t,n0TPR~n1(𝜽,δ)an,\tilde{\mbox{TPR}}_{n_{1}}(\tilde{{\boldsymbol{\theta}}}_{t},\tilde{\delta}_{t})\ \geq\ \sup_{({\boldsymbol{\theta}},\delta)\in\tilde{\Omega}_{t,n_{0}}}\tilde{\mbox{TPR}}_{n_{1}}({\boldsymbol{\theta}},\delta)-a_{n}\ ,

where ana_{n} is a decreasing sequence of positive real numbers tending to zero. This provides some flexibility to accommodate situations in which a strict maximizer either does not exist or is numerically difficult to identify.

Before stating our key theorem, we give the following conditions.

  1. (1)

    Observations are randomly sampled conditional on disease status DD, and the group sizes tend to infinity proportionally, in the sense that n=n1+n0n=n_{1}+n_{0}\rightarrow\infty and n1/n0ρ(0,1)n_{1}/n_{0}\rightarrow\rho\in(0,1).

  2. (2)

    For each d{0,1}d\in\{0,1\}, observations Xdi\textbf{X}_{di}, i=1,2,,ndi=1,2,\ldots,n_{d}, are independent and identically distributed pp-dimensional random vectors with distribution function FdF_{d}.

  3. (3)

    For each d{0,1}d\in\{0,1\}, no proper linear subspace SpS\subset\mathbb{R}^{p} is such that P(XSD=d)=1.P(\textbf{X}\in S\mid D=d)=1.

  4. (4)

    For each d{0,1}d\in\{0,1\}, the distribution and quantile functions of 𝜽X{\boldsymbol{\theta}}^{\top}\textbf{X} given D=dD=d are globally Lipschitz continuous uniformly over 𝜽p{\boldsymbol{\theta}}\in\mathbb{R}^{p} such that 𝜽=1\|{\boldsymbol{\theta}}\|=1.

  5. (5)

    The map (𝜽,δ)TPR(𝜽,δ)(\boldsymbol{\theta},\delta)\mapsto\mbox{TPR}({\boldsymbol{\theta}},\delta) is globally Lipschitz continuous over Ω={𝜽p,δ:𝜽=1}\Omega=\{{\boldsymbol{\theta}}\in\mathbb{R}^{p},\delta\in\mathbb{R}:||{\boldsymbol{\theta}}||=1\}.

Theorem 1.

Under conditions (1)–(5), for every fixed t(0,1)t\in(0,1), we have that

  • (a)

    lim supnFPR(𝜽~t,δ~t)t\limsup_{n}\mbox{FPR}(\tilde{{\boldsymbol{\theta}}}_{t},\tilde{\delta}_{t})\leq t almost surely; and

  • (b)

    |TPR(𝜽~t,δ~t)sup(𝜽,δ)ΩtTPR(𝜽,δ)||\mbox{TPR}(\tilde{{\boldsymbol{\theta}}}_{t},\tilde{\delta}_{t})-\sup_{({\boldsymbol{\theta}},\delta)\in\Omega_{t}}\mbox{TPR}({\boldsymbol{\theta}},\delta)| tends to zero almost surely.

The proof of Theorem 1 is given in Appendix B. The proof relies on two lemmas, also in Appendix B. Lemma 1 demonstrates almost sure convergence to zero of the difference between the supremum of a function over a fixed set and the supremum of the function over a stochastic set that converges to the fixed set in an appropriate sense. Lemma 2 establishes the almost sure uniform convergence to zero of the difference between the FPR and the smooth approximation to the empirical FPR and the difference between the TPR and the smooth approximation to the empirical TPR. The proof of Theorem 1 demonstrates that Lemma 1 holds for the relevant function and sets, relying in part on the conclusions of Lemma 2. The conclusions of Lemmas 1 and 2 then demonstrate the claims of Theorem 1.

3.3 Implementation Details

Certain considerations must be addressed to implement the proposed method, including the choice of tuning parameter hh and starting values (𝜽~,δ~)(\tilde{{\boldsymbol{\theta}}},\tilde{\delta}) for the optimization routine. In using similar methods to maximize the AUC, Lin et al. (2011) proposed using h=σ~n1/3h=\tilde{\sigma}n^{-1/3}, where σ~\tilde{\sigma} is the sample standard error of 𝜽~X\tilde{{\boldsymbol{\theta}}}^{\top}\textbf{X}. In simulations, we considered both h=σ~n1/3h=\tilde{\sigma}n^{-1/3} and h=σ~n1/2h=\tilde{\sigma}n^{-1/2} and found similar behavior for the convergence of the optimization routine. Thus, we use h=σ~n1/2h=\tilde{\sigma}n^{-1/2}. We must also identify initial values (𝜽~,δ~)(\tilde{{\boldsymbol{\theta}}},\tilde{\delta}) for our procedure. As done in Fong, Yin, and Huang (2016), we use normalized estimates from robust logistic regression, which is described in greater detail below. Based on this initial value 𝜽~\tilde{{\boldsymbol{\theta}}}, we choose δ~\tilde{\delta} such that FPR~n0(𝜽~,δ~)=t\tilde{\mbox{FPR}}_{n_{0}}(\tilde{{\boldsymbol{\theta}}},\tilde{\delta})=t. In addition, we have also found that when FPR~n0\tilde{\mbox{FPR}}_{n_{0}} is bounded by tt, the performance of the optimization routine can be poor. Thus, we introduce another tuning parameter, α\alpha, which allows for a small amount of relaxation in the constraint on the smooth approximation to the empirical FPR, imposing instead FPR~n0(𝜽,δ)t+α.\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)\leq t+\alpha. Since the effective sample size for the smooth approximation to the empirical FPR is n0n_{0}, we chose to scale α\alpha with n0n_{0}, and have found α=1/(2n0)\alpha=1/(2n_{0}) to work well.

Our method involves computing the gradient of the smooth approximations to the true and false positive rates defined above, which is fast regardless of the number of biomarkers involved. This is in contrast with methods that rely on brute force (e.g., grid search), which typically become computationally infeasible for combinations of more than two or three biomarkers. However, we note that for any method, the risk of overfitting is expected to grow as the number of biomarkers increases relative to the sample size. We emphasize that our method does not impose constraints on the distribution of the biomarkers that can be included, except for weak conditions that allow us to establish its large-sample properties.

4 Simulations

4.1 Bivariate Normal Biomarkers with Contamination

First, we considered bivariate normal biomarkers with contamination, similar to a scenario described by Croux and Haesbroeck (2003). In particular, we considered a setting where two biomarkers (X1,X2)(X_{1},X_{2}) were independently normally distributed with mean zero and variance one. DD was then defined as D=1(2X1+2X2+ζ>0),D=1(2X_{1}+2X_{2}+\zeta>0), where ζ\zeta was distributed as a logistic random variable with location parameter zero and scale parameter one. Next, the sample was contaminated by a set of points with D=0,X1=6,D=0,X_{1}=6, and X2=6X_{2}=6. We consider simulations where the training set consisted of 800 or 1600 “typical” observations and 50 or 100, respectively, contaminating observations (this yielded a disease prevalence of approximately 47%). The test set consisted of 10610^{6} “typical” observations and 62,500 contaminating observations. The maximum acceptable FPR, tt, was 0.2 or 0.3. We performed 1000 simulations.

We considered five approaches: (1) logistic regression, (2) the robust logistic regression method proposed by Bianco and Yohai (1996), (3) grid search, (4) the method proposed by Su and Liu (1993) and (5) the proposed method. As discussed above, the method proposed by Su and Liu (1993) yields a combination with maximum AUC when the biomarkers have a conditionally multivariate normal distribution. We did not consider the optimal AUC method proposed by Huang, Qin, and Fang (2011) as the implementation provided in R is too slow for use in simulations (and, as illustrated in Figure 1, may not yield a combination with optimal TPR). While the methods recently proposed by Yan et al. (2018) to optimize the partial AUC are compelling and may yield a combination with high TPR at the specified FPR value, implementation of their method, particularly the nonparametric kernel-based method, is non-trivial, and so is not included here. Finally, the method of Liu, Schisterman, and Zhu (2005), discussed above, may also yield a combination with high TPR at a particular FPR. However, given the shortcomings of this method described above (namely, that the range of FPRs over which the combination is optimal cannot be fixed in advanced and the biomarkers are assumed to have a conditionally multivariate normal distribution), we do not include this as a comparison method. Above all, none of these methods specifically target the TPR for a specified FPR which, as indicated by Figure 1, may lead to combinations with reduced TPR at the specified FPR.

We focused on evaluating the operating characteristics of the fitted combination rather than the biomarker coefficients as the former is typically of primary interest. In particular, we evaluated the TPR in the test data for FPR = tt in the test data. In other words, for each combination, the threshold used to calculate the TPR in the test data was chosen such that the FPR in the test data was equal to tt. Evaluating the TPR in this way puts the combinations on equal footing in terms of the FPR, and so allows a fair comparison of the TPR. We evaluated the FPR of the fitted combinations in the test data using the thresholds estimated in the training data, i.e., the (1t)(1-t)th quantile of the fitted biomarker combination among controls in the training data. While we could have used the estimate of δt\delta_{t} provided by our method in the evaluation, we found improved performance (that is, better control of the FPR) when re-estimating the threshold based on the fitted combination in the training data.

Table 1 summarizes the results. For both sample sizes and FPR thresholds, all methods adequately controlled the FPR, while for the TPR, the proposed method outperformed logistic regression, robust logistic regression, and the method of Su and Liu (1993). Furthermore, the results from the proposed method were comparable to those from the grid search, which may be regarded as a performance reference but is infeasible for more than two or three biomarkers.

Table 1: Mean TPR and FPR and corresponding standard deviation (in parentheses) in the test data across 1000 simulations for contaminated data with two biomarkers. The TPR is based on the threshold corresponding to an FPR of tt in the test data whereas the FPR is based on the thresholds estimated in the training data. nn, size of the training dataset; tt, acceptable FPR; TPR, true positive rate; FPR, false positive rate; GLM, standard logistic regression; rGLM, robust logistic regression; sTPR, proposed method. All numbers are percentages.
nn Measure Method
GLM rGLM Grid Search Su & Liu sTPR
t=t= 0.20
800 TPR 52.7 (15.9) 55.8 (15.1) 72.5 (0.9) 53.6 (15.6) 72.0 (4.5)
FPR 20.4 (1.4) 20.4 (1.4) 20.6 (1.3) 20.4 (1.4) 20.4 (1.3)
1600 TPR 57.5 (13.2) 60.0 (12.1) 72.7 (0.5) 58.3 (12.8) 72.8 (0.5)
FPR 20.3 (1.0) 20.3 (1.0) 20.4 (1.0) 20.3 (1.0) 20.3 (1.0)
t=t= 0.30
800 TPR 68.5 (14.3) 71.4 (13.6) 86.0 (0.5) 69.3 (13.9) 86.0 (1.2)
FPR 30.6 (1.8) 30.6 (1.8) 30.6 (1.8) 30.6 (1.8) 30.4 (1.8)
1600 TPR 74.0 (11.5) 76.1 (10.3) 86.1 (0.3) 74.7 (11.1) 86.1 (0.3)
FPR 30.4 (1.3) 30.3 (1.3) 30.4 (1.3) 30.4 (1.3) 30.2 (1.3)

4.2 Conditionally Multivariate Lognormal Biomarkers

We also considered simulations with conditionally multivariate lognormal biomarkers (Mishra, 2019). In particular, we considered three biomarkers (X1,X2,X3)(X_{1},X_{2},X_{3}). Among controls, log(X1)\mbox{log}(X_{1}) and log(X2)\mbox{log}(X_{2}) had a bivariate normal distribution with E(log(X1))=1.1,E(log(X2))=1.1,Var(log(X1))=0.04,Var(log(X2))=0.5E(\mbox{log}(X_{1}))=1.1,E(\mbox{log}(X_{2}))=1.1,Var(\mbox{log}(X_{1}))=0.04,Var(\mbox{log}(X_{2}))=0.5 and Cov(log(X1),log(X2))=0.09Cov(\mbox{log}(X_{1}),\mbox{log}(X_{2}))=0.09. Among cases, log(X1)\mbox{log}(X_{1}) and log(X2)\mbox{log}(X_{2}) had a bivariate normal distribution with E(log(X1))=1,E(log(X2))=1,Var(log(X1))=0.05,Var(log(X2))=0.05E(\mbox{log}(X_{1}))=1,E(\mbox{log}(X_{2}))=1,Var(\mbox{log}(X_{1}))=0.05,Var(\mbox{log}(X_{2}))=0.05 and Cov(log(X1),log(X2))=0.015Cov(\mbox{log}(X_{1}),\mbox{log}(X_{2}))=0.015. The third biomarker, X3X_{3} was simulated from an independent lognormal distribution with E(log(X3))=1.65E(\mbox{log}(X_{3}))=1.65 and Var(log(X3))=4.66Var(\mbox{log}(X_{3}))=4.66 among both cases and controls. Given the performance of the method of Su and Liu (1993) and the performance of the proposed method relative to grid search observed above (and the computational challenges of implementing grid search for three biomarkers), we considered three methods here: (1) logistic regression, (2) robust logistic regression, and (3) the proposed method. Although neither logistic regression nor robust logistic regression performed particularly well in the simulations in Section 4.1, these methods represent the most commonly used approach for constructing biomarker combinations and the method used to provide starting values for the proposed method, respectively. Thus, it was important to include them here.

The maximum acceptable FPR, tt, was 0.2 and 1000 simulations were performed. The training data consisted of either 400 cases and 400 controls, or 800 cases and 800 controls. The test data consisted of 10610^{6} observations. The TPR and FPR were evaluated as described above. We present the results in Table 2. All three methods did well in controlling the FPR at the specified value. Furthermore, the proposed method substantially outperformed logistic regression and robust logistic regression: the mean TPR based on the proposed method was at least 20% larger than the mean TPRs from logistic regression and robust logistic regression.

Table 2: Mean TPR and FPR and corresponding standard deviation (in parentheses) in the test data across 1000 simulations for three conditionally lognormal biomarkers and t=0.30t=0.30. The TPR is based on the threshold corresponding to an FPR of tt in the test data whereas the FPR is based on the thresholds estimated in the training data. nn, size of the training dataset; tt, acceptable FPR; TPR, true positive rate; FPR, false positive rate; GLM, standard logistic regression; rGLM, robust logistic regression; sTPR, proposed method. All numbers are percentages.
nn Measure Method
GLM rGLM sTPR
800 TPR 34.1 (6.0) 34.1 (6.0) 41.5 (5.7)
FPR 30.3 (2.3) 30.3 (2.3) 31.2 (2.4)
1600 TPR 34.7 (4.2) 34.7 (4.2) 41.9 (4.9)
FPR 30.2 (1.6) 30.2 (1.6) 30.7 (1.7)

4.3 Bivariate Normal Biomarkers and Bivariate Normal Mixture Biomarkers

The above simulations demonstrate superiority of our approach relative to alternative methods in particular scenarios. We conducted further simulations to demonstrate the feasibility of our approach in other settings (for instance, small sample size, small and large prevalence, and low FPR cutoffs) relative to logistic regression and robust logistic regression.

We considered simulations with and without outliers in the data-generating distribution, and simulated data under a model similar to that used by Fong, Yin, and Huang (2016). We considered two biomarkers X1X_{1} and X2X_{2} constructed as

(X1X2)=(1Δ)×Z0+Δ×Z1,\left(\begin{array}[]{c}X_{1}\\ X_{2}\end{array}\right)=(1-\Delta)\times Z_{0}+\Delta\times Z_{1},

where Δ\Delta was a Bernoulli random variable with success probability π=0.05\pi=0.05 when outliers were simulated and π=0\pi=0 otherwise, and Z0Z_{0} and Z1Z_{1} were independent bivariate normal random variables with mean zero and respective covariance matrices

0.2×(10.90.91),  2×(1001).0.2\times\left(\begin{array}[]{cc}1&0.9\\ 0.9&1\end{array}\right),\,\,2\times\left(\begin{array}[]{cc}1&0\\ 0&1\end{array}\right).

DD was then simulated as a Bernoulli random variable with success probability f{β0+4X13X2f\left\{\beta_{0}+4X_{1}-3X_{2}\right. 0.8(X1X2)3}\left.-0.8(X_{1}-X_{2})^{3}\right\}. We considered two ff functions: f1(v)=expit(v)=ev/(1+ev)f_{1}(v)=\mbox{expit}(v)=e^{v}/(1+e^{v}) and a piecewise logistic function,

f2(v)=1(v<0)×11+ev/3+1(v0)×11+e3v.f_{2}(v)=1(v<0)\times\frac{1}{1+e^{-v/3}}+1(v\geq 0)\times\frac{1}{1+e^{-3v}}.

We varied β0\beta_{0} to reflect varying prevalences, with a prevalence of approximately 50–60% for β0=\beta_{0}= 0, 16–18% for β0<\beta_{0}< 0, and 77–82% for β0>\beta_{0}> 0. We considered t=t= 0.05, 0.1, and 0.2. A plot illustrating the data-generating distribution with f=f1f=f_{1} and β0=\beta_{0}= 0, with and without outliers, is given in Appendix D.

The training data consisted of 200, 400, or 800 observations while the test set included 10610^{6} observations. The TPR and FPR were evaluated as described above. The results are presented in Appendix C. When no outliers were present, the proposed method was comparable to logistic regression and robust logistic regression in terms of both the TPR and FPR. In the presence of outliers, robust logistic regression tended to provide combinations with higher TPRs than did logistic regression, and the TPRs of the combinations provided by the proposed method tended to be comparable to or somewhat better than the results from robust logistic regression. In all scenarios, all three methods controlled the FPR, particularly as sample size increased. In addition to demonstrating feasibility of our approach, these simulations highlight the fact that logistic regression is relatively robust to violations of the linear-logistic model (e.g., nonlinear biomarker combinations and deviations from the logit link).

4.4 Convergence

In most simulation settings, convergence of the proposed method was achieved in more than 96%96\% of simulations. For some of the more extreme outlier scenarios considered in Section 4.3, convergence failed in up to 7.3%7.3\% of simulations.

5 Application to Diabetes Data

We applied the method we have developed to a study of diabetes in women with Pima Indian heritage (Smith et al., 1988). We considered seven predictors measured in this study: number of pregnancies, plasma glucose concentration, diastolic blood pressure, triceps skin fold thickness, body mass index, age, and diabetes pedigree function (a measure of family history of diabetes (Smith et al., 1988)). We used 332 observations as training data and reserved the remaining 200 observations for testing. The training and test datasets had 109 and 68 diabetes cases, respectively. We scaled the variables to have equal variance. The distribution of predictors is depicted in Appendix E. The combinations were fitted using the training data and evaluated using the test data. We fixed the acceptable FPR at t=0.10t=0.10. We used logistic regression, robust logistic regression, and the proposed method to construct the combinations, giving the results in Table 3, where the fitted combinations from logistic regression and robust logistic regression have been normalized to aid in comparison.

Table 3: Fitted combinations of the scaled predictors in the diabetes study. GLM, standard logistic regression; rGLM, robust logistic regression; sTPR, proposed method with t=t= 0.10.
Predictor GLM rGLM sTPR
Number of pregnancies 0.321 0.320 0.403
Plasma glucose 0.793 0.792 0.627
Blood pressure -0.077 -0.073 -0.026
Skin fold thickness 0.089 0.090 -0.146
Body mass index 0.399 0.400 0.609
Diabetes pedigree 0.280 0.281 0.191
Age 0.133 0.134 0.123

Using thresholds based on an FPR of 10% in the test data, the estimated TPR in the test data was 54.4% for both logistic regression and robust logistic regression, and 55.9% for the proposed method. The estimated FPR in the test data using thresholds corresponding to an FPR of 10% in the training data was 18.2% for both logistic regression and robust logistic regression and 26.5% for the proposed method. The fact that these FPRs exceeded the target value for all three methods indicates potentially important differences in the controls between the training and test data.

6 Discussion

We have proposed a distribution-free method for constructing linear biomarker combinations by maximizing a smooth approximation to the TPR while constraining a smooth approximation to the FPR. Ours is the first distribution-free approach targeting the TPR for a specified FPR that can be used with more than two or three biomarkers. While we do not expect our method to outperform every other approach in every dataset, we have demonstrated broad feasibility of our method and, importantly, we have identified scenarios where the performance of our method is superior to alternative approaches.

The proposed method could be adapted to minimize the FPR while controlling the TPR to be above some acceptable level. Since the TPR and FPR condition on disease status, the proposed method can be used with case-control data. For case-control data matched on a covariate, however, it becomes necessary to consider the covariate-adjusted ROC curve and corresponding covariate-adjusted summaries, and thus the methods presented here are not immediately applicable (Janes and Pepe, 2008).

As our smooth approximation function is non-convex, the choice of starting values should be considered further. Extensions of convex methods, such as the ramp function method proposed by Fong, Yin, and Huang (2016) for the AUC, could also be considered. The idea of partitioning the search space, proposed by Yan et al. (2018), may also be useful. Further research could investigate methods for evaluating the true and false positive rates of biomarker combinations after estimation, for example, sample-splitting, bootstrapping, or kk-fold cross-validation.

7 Software

An R package containing code to implement the proposed method, maxTPR, is publicly available via CRAN.

Funding

This work was supported by the National Institutes of Health [F31 DK108356, R01 CA152089, and R01 HL085757]; and the University of Washington Department of Biostatistics Career Development Fund [to M.C.]. The opinions, results, and conclusions reported in this article are those of the authors and are independent of the funding sources.

Appendix A

Proposition 1.

If the biomarkers (X1,X2)(X_{1},X_{2}) are conditionally multivariate normal with proportional covariance matrices given DD, that is,

(X1,X2D=0)N(μ0,Σ),(X1,X2D=1)N(μ1,σ2Σ),(X_{1},X_{2}\mid D=0)\sim N({\mu}_{0},\Sigma),\;\;(X_{1},X_{2}\mid D=1)\sim N({\mu}_{1},\sigma^{2}\Sigma),

then the optimal biomarker combination in the sense of the ROC curve is of the form

β0+β1X1+β2X2+β3X1X2+β4X12+β5X22\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\beta_{3}X_{1}X_{2}+\beta_{4}X_{1}^{2}+\beta_{5}X_{2}^{2}

for some vector (β0,β1,β2,β3,β4,β5)5(\beta_{0},\beta_{1},\beta_{2},\beta_{3},\beta_{4},\beta_{5})\in\mathbb{R}^{5}.

Proof.

It is known that the optimal combination of (X1,X2)(X_{1},X_{2}) in terms of the ROC curve is the likelihood ratio, f(X1,X2D=1)/f(X1,X2D=0)f(X_{1},X_{2}\mid D=1)/f(X_{1},X_{2}\mid D=0), or any monotone increasing function thereof (McIntosh and Pepe, 2002). Let M=(X1,X2)\textbf{M}=(X_{1},X_{2}). Without loss of generality, let μ0=0{\mu}_{0}=0 and μ1=μ=(μX1,μX2){\mu}_{1}={\mu}=(\mu_{X_{1}},\mu_{X_{2}}). Then

f(MD=1)f(MD=0)\displaystyle\frac{f(M\mid D=1)}{f(M\mid D=0)} =|σ2Σ|1/2exp{12(Mμ)(σ2Σ)1(Mμ)}|Σ|1/2exp{12MΣ1M}\displaystyle=\frac{|\sigma^{2}\Sigma|^{-1/2}\mbox{exp}\left\{-\frac{1}{2}(M-{\mu})^{\top}(\sigma^{2}\Sigma)^{-1}(M-{\mu})\right\}}{|\Sigma|^{-1/2}\mbox{exp}\left\{-\frac{1}{2}M^{\top}\Sigma^{-1}M\right\}}
=exp{12(Mμ)(σ2Σ)1(Mμ)}σ2exp{12MΣ1M}\displaystyle=\frac{\mbox{exp}\left\{-\frac{1}{2}(M-{\mu})^{\top}(\sigma^{2}\Sigma)^{-1}(M-{\mu})\right\}}{\sigma^{2}\mbox{exp}\left\{-\frac{1}{2}M^{\top}\Sigma^{-1}M\right\}}
=1σ2exp{(Mμ)Σ1(Mμ)2σ2+MΣ1M2}.\displaystyle=\frac{1}{\sigma^{2}}\mbox{exp}\left\{-\frac{(M-{\mu})^{\top}\Sigma^{-1}(M-{\mu})}{2\sigma^{2}}+\frac{M^{\top}\Sigma^{-1}M}{2}\right\}.

Denote the entries of Σ1\Sigma^{-1} by

Σ1=(S11S12S21S22).\Sigma^{-1}=\left(\begin{array}[]{cc}S_{11}&S_{12}\\ S_{21}&S_{22}\end{array}\right).

Then, we can write that

12σ2(Mμ)Σ1(Mμ)+12MΣ1M\displaystyle-\frac{1}{2\sigma^{2}}(M-{\mu})^{\top}\Sigma^{-1}(M-{\mu})+\frac{1}{2}M^{\top}\Sigma^{-1}M
=12[1σ2{S11(X122X1μX1+μX12)S21(X1X2X2μX1X1μX2+μX1μX2)\displaystyle\hskip 36.135pt=\ \frac{1}{2}\left[\frac{1}{\sigma^{2}}\left\{-S_{11}(X_{1}^{2}-2X_{1}\mu_{X_{1}}+\mu_{X_{1}}^{2})-S_{21}(X_{1}X_{2}-X_{2}\mu_{X_{1}}-X_{1}\mu_{X_{2}}+\mu_{X_{1}}\mu_{X_{2}})\right.\right.
S12(X1X2X1μX2X2μX1+μX1μX2)S22(X222X2μX2+μX22)}\displaystyle\hskip 36.135pt\hskip 21.68121pt\left.\left.-S_{12}(X_{1}X_{2}-X_{1}\mu_{X_{2}}-X_{2}\mu_{X_{1}}+\mu_{X_{1}}\mu_{X_{2}})-S_{22}(X_{2}^{2}-2X_{2}\mu_{X_{2}}+\mu_{X_{2}}^{2})\right\}\right.
+S11X12+S21X1X2+S12X1X2+S22X22]\displaystyle\hskip 36.135pt\hskip 21.68121pt+S_{11}X_{1}^{2}+S_{21}X_{1}X_{2}+S_{12}X_{1}X_{2}+S_{22}X_{2}^{2}\bigg{]}
=12{(S11S11σ2)X12+(S22S22σ2)X22+(S12+S21S12σ2S21σ2)X1X2\displaystyle\hskip 36.135pt=\ \frac{1}{2}\left\{\left(S_{11}-\frac{S_{11}}{\sigma^{2}}\right)X_{1}^{2}+\left(S_{22}-\frac{S_{22}}{\sigma^{2}}\right)X_{2}^{2}+\left(S_{12}+S_{21}-\frac{S_{12}}{\sigma^{2}}-\frac{S_{21}}{\sigma^{2}}\right)X_{1}X_{2}\right.
+(2S11μX1+S21μX2+S12μX2σ2)X1+(S21μX1+S12μX1+2S22μX2σ2)X2\displaystyle\hskip 36.135pt\hskip 21.68121pt\left.+\left(\frac{2S_{11}\mu_{X_{1}}+S_{21}\mu_{X_{2}}+S_{12}\mu_{X_{2}}}{\sigma^{2}}\right)X_{1}+\left(\frac{S_{21}\mu_{X_{1}}+S_{12}\mu_{X_{1}}+2S_{22}\mu_{X_{2}}}{\sigma^{2}}\right)X_{2}\right.
+S11μX12S21μX1μX2S12μX1μX2S22μX22σ2}\displaystyle\hskip 36.135pt\hskip 21.68121pt\left.+\frac{-S_{11}\mu_{X_{1}}^{2}-S_{21}\mu_{X_{1}}\mu_{X_{2}}-S_{12}\mu_{X_{1}}\mu_{X_{2}}-S_{22}\mu_{X_{2}}^{2}}{\sigma^{2}}\right\}
=β0+β1X1+β2X2+β3X1X2+β4X12+β5X22\displaystyle\hskip 36.135pt=\ \beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\beta_{3}X_{1}X_{2}+\beta_{4}X_{1}^{2}+\beta_{5}X_{2}^{2}

as claimed, where

β0\displaystyle\beta_{0} =S11μX12S21μX1μX2S12μX1μX2S22μX22σ2\displaystyle\ =\ \frac{-S_{11}\mu_{X_{1}}^{2}-S_{21}\mu_{X_{1}}\mu_{X_{2}}-S_{12}\mu_{X_{1}}\mu_{X_{2}}-S_{22}\mu_{X_{2}}^{2}}{\sigma^{2}}
β1\displaystyle\beta_{1} =(2S11μX1+S21μX2+S12μX2σ2)\displaystyle\ =\ \left(\frac{2S_{11}\mu_{X_{1}}+S_{21}\mu_{X_{2}}+S_{12}\mu_{X_{2}}}{\sigma^{2}}\right)
β2\displaystyle\beta_{2} =(S21μX1+S12μX1+2S22μX2σ2)\displaystyle\ =\ \left(\frac{S_{21}\mu_{X_{1}}+S_{12}\mu_{X_{1}}+2S_{22}\mu_{X_{2}}}{\sigma^{2}}\right)
β3\displaystyle\beta_{3} =(S12+S21S12σ2S21σ2)\displaystyle\ =\ \left(S_{12}+S_{21}-\frac{S_{12}}{\sigma^{2}}-\frac{S_{21}}{\sigma^{2}}\right)
β4\displaystyle\beta_{4} =(S11S11σ2)\displaystyle\ =\ \left(S_{11}-\frac{S_{11}}{\sigma^{2}}\right)
β5\displaystyle\beta_{5} =(S22S22σ2).\displaystyle\ =\ \left(S_{22}-\frac{S_{22}}{\sigma^{2}}\right).

Appendix B

The proof of Theorem 1 relies on Lemmas 1 and 2, which are stated and proved below.

Lemma 1.

Say that a bounded function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} and possibly random sets Ω0,Ω1,Ω2,d\Omega_{0},\Omega_{1},\Omega_{2},\ldots\subseteq\mathbb{R}^{d} are given, and let {an}n1\{a_{n}\}_{n\geq 1} be a decreasing sequence of positive real numbers tending to zero. For each n1n\geq 1, suppose that ω0,nΩ0\omega_{0,n}\in\Omega_{0} and ωnΩn\omega_{n}\in\Omega_{n} are near-maximizers of ff over Ω0\Omega_{0} and Ωn\Omega_{n}, respectively, in the sense that f(ω0,n)supωΩ0f(ω)anf(\omega_{0,n})\geq\sup_{\omega\in\Omega_{0}}f(\omega)-a_{n} and f(ωn)supωΩnf(ω)anf(\omega_{n})\geq\sup_{\omega\in\Omega_{n}}f(\omega)-a_{n}. Further, define

dn=supωΩninfω~Ω0d(ω,ω~),en=supωΩ0infω~Ωnd(ω,ω~),d_{n}=\adjustlimits{\sup}_{\omega\in\Omega_{n}}{\inf}_{\tilde{\omega}\in\Omega_{0}}d(\omega,\tilde{\omega}),\;\;e_{n}=\adjustlimits{\sup}_{\omega\in\Omega_{0}}{\inf}_{\tilde{\omega}\in\Omega_{n}}d(\omega,\tilde{\omega}),

where dd is the Euclidean distance in d\mathbb{R}^{d}. If dnd_{n} and ene_{n} tend to zero almost surely, and ff is globally Lipschitz continuous, then |f(ω0,n)f(ωn)||f(\omega_{0,n})-f(\omega_{n})| tends to zero almost surely. In particular, this implies that

|supωΩ0f(ω)supωΩnf(ω)|0\bigg{|}\sup_{\omega\in\Omega_{0}}f(\omega)-\sup_{\omega\in\Omega_{n}}f(\omega)\bigg{|}\longrightarrow 0

almost surely.

Proof.

Say that both dnd_{n} and ene_{n} tend to zero almost surely, and denote by K>0K>0 the Lipschitz constant of ff. Suppose that for some ϵ>0\epsilon>0 we have that

P{lim supn|f(ωn)f(ω0,n)|>ϵ}>0.P\bigg{\{}\limsup_{n}|f(\omega_{n})-f(\omega_{0,n})|>\epsilon\bigg{\}}>0\ .

We will show that this leads to a contradiction, and thus that it must be true that

P{lim supn|f(ωn)f(ω0,n)|>ϵ}=0P\{\limsup_{n}|f(\omega_{n})-f(\omega_{0,n})|>\epsilon\}=0 for each ϵ>0\epsilon>0, thus establishing the desired result.

On a set of probability one, there exists an nϵ1n_{\epsilon}\geq 1 such that, for each nnϵn\geq n_{\epsilon}, there exists ωnΩ0\omega^{*}_{n}\in\Omega_{0} and ω0,nΩn\omega^{*}_{0,n}\in\Omega_{n} satisfying d(ωn,ωn)<ϵ/(2K)d(\omega^{*}_{n},\omega_{n})<\epsilon/(2K) and d(ω0,n,ω0,n)<ϵ/(2K)d(\omega^{*}_{0,n},\omega_{0,n})<\epsilon/(2K). Then, on this same set, for nnϵn\geq n_{\epsilon}, |f(ωn)f(ωn)|ϵ/2|f(\omega^{*}_{n})-f(\omega_{n})|\leq\epsilon/2 and |f(ω0,n)f(ω0,n)|ϵ/2|f(\omega^{*}_{0,n})-f(\omega_{0,n})|\leq\epsilon/2, so that f(ω0,n)f(ω0,n)+ϵ/2f(\omega_{0,n})\leq f(\omega^{*}_{0,n})+\epsilon/2 and f(ωn)f(ωn)+ϵ/2f(\omega_{n})\leq f(\omega^{*}_{n})+\epsilon/2 in particular. Since ω0,nΩn\omega^{*}_{0,n}\in\Omega_{n} and ωnΩ0\omega_{n}^{*}\in\Omega_{0}, it must also be true that f(ω0,n)f(ωn)+anf(\omega^{*}_{0,n})\leq f(\omega_{n})+a_{n} and f(ωn)f(ω0,n)+anf(\omega^{*}_{n})\leq f(\omega_{0,n})+a_{n}. This then implies that |f(ω0,n)f(ωn)|ϵ/2+an|f(\omega_{0,n})-f(\omega_{n})|\leq\epsilon/2+a_{n} for all nnϵn\geq n_{\epsilon} on a set of probability one. Since ana_{n} tends to zero deterministically, this yields the sought contradiction.

To establish the last portion of the Lemma, we simply use the first part along with the fact that

|supωΩnf(ω)supωΩ0f(ω)||f(ω0,n)f(ωn)|+2an.\bigg{|}\sup_{\omega\in\Omega_{n}}f(\omega)-\sup_{\omega\in\Omega_{0}}f(\omega)\bigg{|}\ \leq\ |f(\omega_{0,n})-f(\omega_{n})|+2a_{n}\ .

Lemma 2.

Under conditions (1)–(5), we have that

sup(𝜽,δ)Ω|FPR~n0(𝜽,δ)FPR(𝜽,δ)|0,sup(𝜽,δ)Ω|TPR~n1(𝜽,δ)TPR(𝜽,δ)|0\sup_{({\boldsymbol{\theta}},\delta)\in\Omega}\left|\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)-\mbox{FPR}({\boldsymbol{\theta}},\delta)\right|\longrightarrow 0,\,\,\sup_{({\boldsymbol{\theta}},\delta)\in\Omega}\left|\tilde{\mbox{TPR}}_{n_{1}}({\boldsymbol{\theta}},\delta)-\mbox{TPR}({\boldsymbol{\theta}},\delta)\right|\longrightarrow 0

almost surely as nn tends to ++\infty, where Ω={(𝛉,δ)p×:𝛉=1}\Omega=\{{(\boldsymbol{\theta}},\delta)\in\mathbb{R}^{p}\times\mathbb{R}:||{\boldsymbol{\theta}}||=1\}.

Proof.

We prove the claim for the false positive rate (FPR); the proof for the true positive rate (TPR) is analogous. We can write

sup(𝜽,δ)Ω|FPR~n0(𝜽,δ)FPR(𝜽,δ)|\displaystyle\sup_{({\boldsymbol{\theta}},\delta)\in\Omega}\left|\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)-\mbox{FPR}({\boldsymbol{\theta}},\delta)\right| sup(𝜽,δ)Ω|FPR~n0(𝜽,δ)E{FPR~n0(𝜽,δ)}|\displaystyle\leq\ \sup_{({\boldsymbol{\theta}},\delta)\in\Omega}\left|\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)-E\{\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)\}\right|
+sup(𝜽,δ)Ω|E{FPR~n0(𝜽,δ)}FPR(𝜽,δ)|.\displaystyle\hskip 21.68121pt+\sup_{({\boldsymbol{\theta}},\delta)\in\Omega}\left|E\{\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)\}-\mbox{FPR}({\boldsymbol{\theta}},\delta)\right|.

First, we consider FPR~n0(𝜽,δ)E{FPR~n0(𝜽,δ)}\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)-E\{\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)\}. We can write this as

FPR~n0(𝜽,δ)E{FPR~n0(𝜽,δ)}=1n0j=1n0Φ(𝜽X0jδh)Φ(𝜽xδh)𝑑F0(x).\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)-E\{\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)\}=\frac{1}{n_{0}}\sum_{j=1}^{n_{0}}\Phi\left(\frac{{\boldsymbol{\theta}}^{\top}\textbf{X}_{0j}-\delta}{h}\right)-\int\Phi\left(\frac{{\boldsymbol{\theta}}^{\top}\textbf{x}-\delta}{h}\right)dF_{0}(\textbf{x})\ .

The class of functions 𝒢1={(𝜽,δ)𝜽xδ:𝜽p,δ,xp}\mathcal{G}_{1}=\{({\boldsymbol{\theta}},\delta)\mapsto{\boldsymbol{\theta}}^{\top}\textbf{x}-\delta:{\boldsymbol{\theta}}\in\mathbb{R}^{p},\delta\in\mathbb{R},\textbf{x}\in\mathbb{R}^{p}\} is a Vapnik–Chervonenkis (VC) class. Since uΦ(u/h)u\mapsto\Phi(u/h) is monotone for each h>0h>0, the class of functions 𝒢2={(𝜽,δ)Φ{(𝜽xδ)/h}:𝜽p,δ,xp,h>0}\mathcal{G}_{2}=\{({\boldsymbol{\theta}},\delta)\mapsto\Phi\{({\boldsymbol{\theta}}^{\top}\textbf{x}-\delta)/h\}:{\boldsymbol{\theta}}\in\mathbb{R}^{p},\delta\in\mathbb{R},\textbf{x}\in\mathbb{R}^{p},h>0\} is also VC (Kosorok, 2008; van der Vaart, 1998; van der Vaart and Wellner, 2000). Since the constant 1 is an applicable envelope function for this class, 𝒢2\mathcal{G}_{2} is F0F_{0}–Glivenko-Cantelli, giving that (Kosorok, 2008; van der Vaart and Wellner, 2000)

sup(𝜽,δ)Ω|FPR~n0(𝜽,δ)E{FPR~n0(𝜽,δ)}|0\sup_{({\boldsymbol{\theta}},\delta)\in\Omega}\left|\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)-E\{\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)\}\right|\longrightarrow 0

almost surely.

Next, we consider E{FPR~n0(𝜽,δ)}FPR(𝜽,δ)E\{\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)\}-\mbox{FPR}({\boldsymbol{\theta}},\delta). We can write this as

E{FPR~n0(𝜽,δ)}FPR(𝜽,δ)=Φ(𝜽xδh)𝑑F0(x)P(𝜽X>δD=0).E\{\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)\}-\mbox{FPR}({\boldsymbol{\theta}},\delta)=\int\Phi\left(\frac{{\boldsymbol{\theta}}^{\top}\textbf{x}-\delta}{h}\right)dF_{0}(\textbf{x})-P({\boldsymbol{\theta}}^{\top}\textbf{X}>\delta\mid D=0).

For a general random variable VV with distribution function FF that is Lipschitz continuous, say with constant M>0M>0, we can write

E{Φ(sVh)}=Φ(svh)𝑑F(v)=hΦ(u)f(shu)𝑑uE\left\{\Phi\left(\frac{s-V}{h}\right)\right\}=\int\Phi\left(\frac{s-v}{h}\right)dF(v)=h\int\Phi(u)f(s-hu)du

with u=(sv)/hu=(s-v)/h. Using integration by parts and Lemma 2.1 from Winter (1979), this becomes

hΦ(u)f(shu)𝑑u=ϕ(u)F(shu)𝑑u,h\int\Phi(u)f(s-hu)du=\int\phi(u)F(s-hu)du\ ,

and so, we find that

|E{Φ(sVh)}F(s)|\displaystyle\left|E\left\{\Phi\left(\frac{s-V}{h}\right)\right\}-F(s)\right| =|ϕ(u)F(shu)𝑑uF(s)|\displaystyle=\ \left|\int\phi(u)F(s-hu)du-F(s)\right|
|F(shu)F(s)|ϕ(u)𝑑u\displaystyle\leq\ \int\left|F(s-hu)-F(s)\right|\phi(u)du
M|hu|ϕ(u)𝑑u=Mh(2π)1/2.\displaystyle\leq\ M\int|hu|\phi(u)du=Mh\left(\frac{2}{\pi}\right)^{1/2}.

Since hh tends to zero as nn tends to infinity, this implies that

sups|E{Φ(sVh)}F(s)|=o(1).\sup_{s}\left|E\left\{\Phi\left(\frac{s-V}{h}\right)\right\}-F(s)\right|=o(1)\ .

We now return to 𝜽X{\boldsymbol{\theta}}^{\top}\textbf{X} and consider the case p=2p=2, so that 𝜽X=θ1X1+θ2X2.{\boldsymbol{\theta}}^{\top}\textbf{X}=\theta_{1}X_{1}+\theta_{2}X_{2}. Let Y1=θ1X1+θ2X2Y_{1}=\theta_{1}X_{1}+\theta_{2}X_{2} and Y2=θ2X2Y_{2}=\theta_{2}X_{2}. Then, we have that fY1,Y2(y1,y2)=fX1,X2(x1,x2)|θ1θ2|1f_{Y_{1},Y_{2}}(y_{1},y_{2})=f_{X_{1},X_{2}}(x_{1},x_{2})|\theta_{1}\theta_{2}|^{-1}, where x1=x1(y1,y2)=(y1y2)/θ1x_{1}=x_{1}(y_{1},y_{2})=(y_{1}-y_{2})/\theta_{1} and x2=x2(y1,y2)=y2/θ2x_{2}=x_{2}(y_{1},y_{2})=y_{2}/\theta_{2}. We find that

Φ(s𝜽xh)𝑑FX(x)=Φ(sy1h)𝑑FY(y)=Φ(sy1h)𝑑FY1(y1)\int\Phi\left(\frac{s-{\boldsymbol{\theta}}^{\top}\textbf{x}}{h}\right)dF_{\textbf{X}}(\textbf{x})=\int\Phi\left(\frac{s-y_{1}}{h}\right)dF_{Y}(y)=\int\Phi\left(\frac{s-y_{1}}{h}\right)dF_{Y_{1}}(y_{1})

for any ss\in\mathbb{R}. Since P(𝜽XδD=0)=P(Y1δD=0)P({\boldsymbol{\theta}}^{\top}\textbf{X}\leq\delta\mid D=0)=P(Y_{1}\leq\delta\mid D=0), we can write

sup(𝜽,δ)Ω|Φ(𝜽xδh)dF0(x)P(𝜽X>δD=0)|\displaystyle\sup_{({\boldsymbol{\theta}},\delta)\in\Omega}\left|\int\Phi\left(\frac{{\boldsymbol{\theta}}^{\top}\textbf{x}-\delta}{h}\right)dF_{0}(\textbf{x})-P({\boldsymbol{\theta}}^{\top}\textbf{X}>\delta\mid D=0)\right|
=supδ|Φ(y1δh)dFY1|D=0(y1)P(Y1>δD=0)|\displaystyle\hskip 14.45377pt=\ \sup_{\delta\in\mathbb{R}}\left|\int\Phi\left(\frac{y_{1}-\delta}{h}\right)dF_{Y_{1}|D=0}(y_{1})-P(Y_{1}>\delta\mid D=0)\right|
=supδ|Φ(δy1h)dFY1|D=0(y1)P(Y1δD=0)|,\displaystyle\hskip 14.45377pt=\ \sup_{\delta\in\mathbb{R}}\left|\int\Phi\left(\frac{\delta-y_{1}}{h}\right)dF_{Y_{1}|D=0}(y_{1})-P(Y_{1}\leq\delta\mid D=0)\right|\ ,

implying, in view of condition (4) and the results above, that

sup(𝜽,δ)Ω|Φ(𝜽xδh)dF0(x)P(𝜽X>δD=0)|=o(1).\sup_{({\boldsymbol{\theta}},\delta)\in\Omega}\left|\int\Phi\left(\frac{{\boldsymbol{\theta}}^{\top}\textbf{x}-\delta}{h}\right)dF_{0}(\textbf{x})-P({\boldsymbol{\theta}}^{\top}\textbf{X}>\delta\mid D=0)\right|=o(1)\ .

The result for p>2p>2 can be proved analogously.

Combining these results, we conclude that sup(𝜽,δ)Ω|FPR~n0(𝜽,δ)FPR(𝜽,δ)|\sup_{({\boldsymbol{\theta}},\delta)\in\Omega}|\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)-\mbox{FPR}({\boldsymbol{\theta}},\delta)| tends to zero almost surely, as claimed. ∎

Proof of Theorem 1.

First, we show that lim supnFPR(𝜽~t,δ~t)t\limsup_{n}\mbox{FPR}(\tilde{{\boldsymbol{\theta}}}_{t},\tilde{\delta}_{t})\leq t almost surely. We can write

FPR(𝜽~t,δ~t)\displaystyle\mbox{FPR}(\tilde{{\boldsymbol{\theta}}}_{t},\tilde{\delta}_{t})\ =FPR~n0(𝜽~t,δ~t)+{FPR(𝜽~t,δ~t)FPR~n0(𝜽~t,δ~t)}\displaystyle=\ \tilde{\mbox{FPR}}_{n_{0}}(\tilde{{\boldsymbol{\theta}}}_{t},\tilde{\delta}_{t})+\{\mbox{FPR}(\tilde{{\boldsymbol{\theta}}}_{t},\tilde{\delta}_{t})-\tilde{\mbox{FPR}}_{n_{0}}(\tilde{{\boldsymbol{\theta}}}_{t},\tilde{\delta}_{t})\}
FPR~n0(𝜽~t,δ~t)+|FPR(𝜽~t,δ~t)FPR~n0(𝜽~t,δ~t)|\displaystyle\hskip-21.68121pt\leq\ \tilde{\mbox{FPR}}_{n_{0}}(\tilde{{\boldsymbol{\theta}}}_{t},\tilde{\delta}_{t})+|\mbox{FPR}(\tilde{{\boldsymbol{\theta}}}_{t},\tilde{\delta}_{t})-\tilde{\mbox{FPR}}_{n_{0}}(\tilde{{\boldsymbol{\theta}}}_{t},\tilde{\delta}_{t})|
FPR~n0(𝜽~t,δ~t)+sup(𝜽,δ)Ω|FPR(𝜽,δ)FPR~n0(𝜽,δ)|t+sup(𝜽,δ)Ω|FPR(𝜽,δ)FPR~n0(𝜽,δ)|.\displaystyle\hskip-21.68121pt\leq\ \tilde{\mbox{FPR}}_{n_{0}}(\tilde{{\boldsymbol{\theta}}}_{t},\tilde{\delta}_{t})+\sup_{({\boldsymbol{\theta}},\delta)\in\Omega}|\mbox{FPR}({\boldsymbol{\theta}},\delta)-\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)|\ \leq\ t+\sup_{({\boldsymbol{\theta}},\delta)\in\Omega}|\mbox{FPR}({\boldsymbol{\theta}},\delta)-\tilde{\mbox{FPR}}_{n_{0}}({\boldsymbol{\theta}},\delta)|\ .

As such, it follows that

P{lim supnFPR(𝜽~t,δ~t)t}P{lim supnsup(𝜽,δ)Ω|FPR(𝜽,δ)FPR~n0(𝜽,δ)|=0}= 1P\{\limsup_{n}\mbox{FPR}(\tilde{\boldsymbol{\theta}}_{t},\tilde{\delta}_{t})\leq t\}\ \geq\ P\{\limsup_{n}\sup_{(\boldsymbol{\theta},\delta)\in\Omega}|\mbox{FPR}(\boldsymbol{\theta},\delta)-\tilde{\mbox{FPR}}_{n_{0}}(\boldsymbol{\theta},\delta)|=0\}\ =\ 1

in view of Lemma 2, thereby establishing the first part of the theorem.

Let t(0,1)t\in(0,1) be fixed. We now establish that

|TPR(𝜽~t,δ~t)sup(𝜽,δ)ΩtTPR(𝜽,δ)|0\bigg{|}\mbox{TPR}(\tilde{{\boldsymbol{\theta}}}_{t},\tilde{\delta}_{t})-\sup_{({\boldsymbol{\theta}},\delta)\in\Omega_{t}}\mbox{TPR}({\boldsymbol{\theta}},\delta)\bigg{|}\longrightarrow 0

almost surely. For convenience, denote (𝜽,δ)(\boldsymbol{\theta},\delta) by ω\omega. Consider the function ff defined pointwise as f(ω)=TPR(𝜽,δ)f(\omega)=\mbox{TPR}(\boldsymbol{\theta},\delta), and set Ω0=Ωt\Omega_{0}=\Omega_{t} and Ωn=Ω~t,n0\Omega_{n}=\tilde{\Omega}_{t,n_{0}} for each n1n\geq 1. We verify that the conditions of Lemma 1 hold for these particular choices. We have that f(ω)=TPR(𝜽,δ)f(\omega)=\mbox{TPR}(\boldsymbol{\theta},\delta) is a bounded function. We must show dn0d_{n_{0}} and en0e_{n_{0}} tend to zero almost surely, where

dn0=supωΩ~t,n0infω~Ωtd(ω,ω~),en0=supωΩtinfω~Ω~t,n0d(ω,ω~),d_{n_{0}}=\adjustlimits{\sup}_{\omega\in\tilde{\Omega}_{t,n_{0}}}{\inf}_{\tilde{\omega}\in\Omega_{t}}d(\omega,\tilde{\omega}),\;\;\;\;e_{n_{0}}=\adjustlimits{\sup}_{\omega\in\Omega_{t}}{\inf}_{\tilde{\omega}\in\tilde{\Omega}_{t,n_{0}}}d(\omega,\tilde{\omega}),

and dd is the Euclidean distance in p+1\mathbb{R}^{p+1}. We consider dn0d_{n_{0}} first. Denote by G𝜽G_{\boldsymbol{\theta}} the conditional distribution function of 𝜽X\boldsymbol{\theta}^{\top}\textbf{X} given D=0D=0. By assumption, the corresponding conditional quantile function, denoted by G𝜽1G_{\boldsymbol{\theta}}^{-1}, is uniformly Lipschitz continuous over {𝜽p:𝜽=1}\{\boldsymbol{\theta}\in\mathbb{R}^{p}:\|\boldsymbol{\theta}\|=1\}, say with constant C>0C>0 independent of 𝜽\boldsymbol{\theta}. Suppose that, for some κ>0\kappa>0, supωΩ~t,n0|FPR~n0(ω)FPR(ω)|κ\sup_{\omega\in\tilde{\Omega}_{t,n_{0}}}|\tilde{\mbox{FPR}}_{n_{0}}(\omega)-\mbox{FPR}(\omega)|\leq\kappa. Because it is true that

κsupωΩ~t,n0|FPR~n0(ω)FPR(ω)||supωΩ~t,n0FPR~n0(ω)supωΩ~t,n0FPR(ω)|,\kappa\ \geq\ \sup_{\omega\in\tilde{\Omega}_{t,n_{0}}}|\tilde{\mbox{FPR}}_{n_{0}}(\omega)-\mbox{FPR}(\omega)|\ \geq\ \bigg{|}\sup_{\omega\in\tilde{\Omega}_{t,n_{0}}}\tilde{\mbox{FPR}}_{n_{0}}(\omega)-\sup_{\omega\in\tilde{\Omega}_{t,n_{0}}}\mbox{FPR}(\omega)\bigg{|}\ ,

then supωΩ~t,n0FPR(ω)κ+t\sup_{\omega\in\tilde{\Omega}_{t,n_{0}}}\mbox{FPR}(\omega)\leq\kappa+t, giving FPR~n0(ω)t\tilde{\mbox{FPR}}_{n_{0}}(\omega)\leq t and FPR(ω)κ+t\mbox{FPR}(\omega)\leq\kappa+t for each ωΩ~t,n0\omega\in\tilde{\Omega}_{t,n_{0}}.

For any given ω=(𝜽,δ)Ω~t,n0\omega=({\boldsymbol{\theta}},\delta)\in\tilde{\Omega}_{t,n_{0}}, write t(ω)=G𝜽(δ)t_{*}(\omega)=G_{\boldsymbol{\theta}}(\delta), giving t(ω)=FPR(ω)κ+tt_{*}(\omega)=\mbox{FPR}(\omega)\leq\kappa+t. If t(ω)tt_{*}(\omega)\leq t, note also that ωΩt\omega\in\Omega_{t} and set ω=ω\omega_{*}=\omega. Otherwise, find δ\delta_{*} such that 1G𝜽(δ)=t1-G_{\boldsymbol{\theta}}(\delta_{*})=t, namely by taking δ=G𝜽1(1t)\delta_{*}=G_{\boldsymbol{\theta}}^{-1}(1-t). Defining ω=(𝜽,δ)Ωt\omega_{*}=({\boldsymbol{\theta}},\delta_{*})\in\Omega_{t}, observe that

d(ω,ω)=|δδ|=|G𝜽1(1t(ω))G𝜽1(1t)|C|tt(ω)|Cκ.d(\omega,\omega_{*})\ =\ |\delta-\delta_{*}|\ =\ |G_{\boldsymbol{\theta}}^{-1}(1-t_{*}(\omega))-G_{\boldsymbol{\theta}}^{-1}(1-t)|\ \leq\ C|t-t_{*}(\omega)|\ \leq\ C\kappa\ .

Thus, for each ωΩ~t,n0\omega\in\tilde{\Omega}_{t,n_{0}}, it is true that infω~Ωtd(ω,ω~)Cκ\inf_{\tilde{\omega}\in\Omega_{t}}d(\omega,\tilde{\omega})\leq C\kappa and therefore dn0Cκd_{n_{0}}\leq C\kappa. As such, if dn0>ϵd_{n_{0}}>\epsilon for some ϵ>0\epsilon>0, then supωΩ~t,n0|FPR~n0(ω)FPR(ω)|>κϵ\sup_{\omega\in\tilde{\Omega}_{t,n_{0}}}|\tilde{\mbox{FPR}}_{n_{0}}(\omega)-\mbox{FPR}(\omega)|>\kappa_{\epsilon} for κϵ=ϵ/C\kappa_{\epsilon}=\epsilon/C. This implies that

P(supmn0dm>ϵ)P(supmn0supωΩ~t,m|FPR~m(ω)FPR(ω)|>κϵ) 0P\bigg{(}\sup_{m\geq n_{0}}d_{m}>\epsilon\bigg{)}\ \leq\ P\bigg{(}\sup_{m\geq n_{0}}\sup_{\omega\in\tilde{\Omega}_{t,m}}|\tilde{\mbox{FPR}}_{m}(\omega)-\mbox{FPR}(\omega)|>\kappa_{\epsilon}\bigg{)}\ \longrightarrow\ 0

by Lemma 2. Thus, dnd_{n} tends to zero almost surely since, for each ϵ>0\epsilon>0,

P(lim supm{dmϵ})P(lim supmdmϵ)= 0.P\bigg{(}\limsup_{m}\{d_{m}\geq\epsilon\}\bigg{)}\ \leq\ P\bigg{(}\limsup_{m}d_{m}\geq\epsilon\bigg{)}\ =\ 0\ .

Using similar arguments, we may show that ene_{n} also tends to zero almost surely.

The fact that dnd_{n} and ene_{n} tend to zero almost surely implies, in view of Lemma 1, that we have that |sup(𝜽,δ)ΩtTPR(𝜽,δ)sup(𝜽,δ)Ω~t,n0TPR(𝜽,δ)||\sup_{({\boldsymbol{\theta}},\delta)\in\Omega_{t}}\mbox{TPR}({\boldsymbol{\theta}},\delta)-\sup_{({\boldsymbol{\theta}},\delta)\in\tilde{\Omega}_{t,n_{0}}}\mbox{TPR}({\boldsymbol{\theta}},\delta)| tends to zero almost surely. Combining this with an application of Lemma 2, we have that

|sup(𝜽,δ)ΩtTPR(𝜽,δ)sup(𝜽,δ)Ω~t,n0TPR~n1(𝜽,δ)|\displaystyle\bigg{|}\sup_{({\boldsymbol{\theta}},\delta)\in\Omega_{t}}\mbox{TPR}({\boldsymbol{\theta}},\delta)\,-\sup_{({\boldsymbol{\theta}},\delta)\in\tilde{\Omega}_{t,n_{0}}}\tilde{\mbox{TPR}}_{n_{1}}({\boldsymbol{\theta}},\delta)\bigg{|}
|sup(𝜽,δ)ΩtTPR(𝜽,δ)sup(𝜽,δ)Ω~t,n0TPR(𝜽,δ)|+|sup(𝜽,δ)Ω~t,n0TPR(𝜽,δ)sup(𝜽,δ)Ω~t,n0TPR~n1(𝜽,δ)|\displaystyle\hskip 21.68121pt\leq\ \bigg{|}\sup_{({\boldsymbol{\theta}},\delta)\in\Omega_{t}}\mbox{TPR}({\boldsymbol{\theta}},\delta)-\sup_{({\boldsymbol{\theta}},\delta)\in\tilde{\Omega}_{t,n_{0}}}\mbox{TPR}({\boldsymbol{\theta}},\delta)\bigg{|}+\bigg{|}\sup_{({\boldsymbol{\theta}},\delta)\in\tilde{\Omega}_{t,n_{0}}}\mbox{TPR}({\boldsymbol{\theta}},\delta)-\sup_{({\boldsymbol{\theta}},\delta)\in\tilde{\Omega}_{t,n_{0}}}\tilde{\mbox{TPR}}_{n_{1}}({\boldsymbol{\theta}},\delta)\bigg{|}
|sup(𝜽,δ)ΩtTPR(𝜽,δ)sup(𝜽,δ)Ω~t,n0TPR(𝜽,δ)|+sup(𝜽,δ)Ω~t,n0|TPR(𝜽,δ)TPR~n1(𝜽,δ)| 0\displaystyle\hskip 21.68121pt\leq\ \bigg{|}\sup_{({\boldsymbol{\theta}},\delta)\in\Omega_{t}}\mbox{TPR}({\boldsymbol{\theta}},\delta)-\sup_{({\boldsymbol{\theta}},\delta)\in\tilde{\Omega}_{t,n_{0}}}\mbox{TPR}({\boldsymbol{\theta}},\delta)\bigg{|}+\sup_{({\boldsymbol{\theta}},\delta)\in\tilde{\Omega}_{t,n_{0}}}|\mbox{TPR}({\boldsymbol{\theta}},\delta)-\tilde{\mbox{TPR}}_{n_{1}}({\boldsymbol{\theta}},\delta)|\ \longrightarrow\ 0

almost surely. Since |TPR(𝜽~t,δ~t)TPR~n1(𝜽~t,δ~t)|sup(𝜽,δ)Ω|TPR(𝜽,δ)TPR~n1(𝜽,δ)||\mbox{TPR}(\tilde{\boldsymbol{\theta}}_{t},\tilde{\delta}_{t})-\tilde{\mbox{TPR}}_{n_{1}}(\tilde{\boldsymbol{\theta}}_{t},\tilde{\delta}_{t})|\leq\sup_{({\boldsymbol{\theta}},\delta)\in\Omega}|\mbox{TPR}({\boldsymbol{\theta}},\delta)-\tilde{\mbox{TPR}}_{n_{1}}({\boldsymbol{\theta}},\delta)| and, by Lemma 2, sup(𝜽,δ)Ω|TPR(𝜽,δ)TPR~n1(𝜽,δ)|\sup_{({\boldsymbol{\theta}},\delta)\in\Omega}|\mbox{TPR}({\boldsymbol{\theta}},\delta)-\tilde{\mbox{TPR}}_{n_{1}}({\boldsymbol{\theta}},\delta)| tends to zero almost surely, |TPR(𝜽~t,δ~t)TPR~n1(𝜽~t,δ~t)||\mbox{TPR}(\tilde{\boldsymbol{\theta}}_{t},\tilde{\delta}_{t})-\tilde{\mbox{TPR}}_{n_{1}}(\tilde{\boldsymbol{\theta}}_{t},\tilde{\delta}_{t})| tends to zero almost surely. In addition, since (𝜽~t,δ~t)(\tilde{\boldsymbol{\theta}}_{t},\tilde{\delta}_{t}) is a near-maximizer of TPR~n1\tilde{\mbox{TPR}}_{n_{1}}, sup(𝜽,δ)Ω~t,n0TPR~n1(𝜽,δ)TPR~n1(𝜽~t,δ~t)+an\sup_{({\boldsymbol{\theta}},\delta)\in\tilde{\Omega}_{t,n_{0}}}\tilde{\mbox{TPR}}_{n_{1}}({\boldsymbol{\theta}},\delta)\leq\tilde{\mbox{TPR}}_{n_{1}}(\tilde{\boldsymbol{\theta}}_{t},\tilde{\delta}_{t})+a_{n}, giving

|sup(𝜽,δ)ΩtTPR(𝜽,δ)TPR(𝜽~t,δ~t)|\displaystyle\bigg{|}\sup_{({\boldsymbol{\theta}},\delta)\in\Omega_{t}}\mbox{TPR}({\boldsymbol{\theta}},\delta)-\mbox{TPR}(\tilde{\boldsymbol{\theta}}_{t},\tilde{\delta}_{t})\bigg{|}
|sup(𝜽,δ)ΩtTPR(𝜽,δ)sup(𝜽,δ)Ω~t,n0TPR~n1(𝜽,δ)|+|sup(𝜽,δ)Ω~t,n0TPR~n1(𝜽,δ)TPR(𝜽~t,δ~t)|\displaystyle\hskip 21.68121pt\leq\ \bigg{|}\sup_{({\boldsymbol{\theta}},\delta)\in\Omega_{t}}\mbox{TPR}({\boldsymbol{\theta}},\delta)-\sup_{({\boldsymbol{\theta}},\delta)\in\tilde{\Omega}_{t,n_{0}}}\tilde{\mbox{TPR}}_{n_{1}}({\boldsymbol{\theta}},\delta)\bigg{|}+\bigg{|}\sup_{({\boldsymbol{\theta}},\delta)\in\tilde{\Omega}_{t,n_{0}}}\tilde{\mbox{TPR}}_{n_{1}}({\boldsymbol{\theta}},\delta)-\mbox{TPR}(\tilde{\boldsymbol{\theta}}_{t},\tilde{\delta}_{t})\bigg{|}
|sup(𝜽,δ)ΩtTPR(𝜽,δ)sup(𝜽,δ)Ω~t,n0TPR~n1(𝜽,δ)|+|sup(𝜽,δ)Ω~t,n0TPR~n1(𝜽,δ)TPR~n1(𝜽~t,δ~t)|\displaystyle\hskip 21.68121pt\leq\ \bigg{|}\sup_{({\boldsymbol{\theta}},\delta)\in\Omega_{t}}\mbox{TPR}({\boldsymbol{\theta}},\delta)-\sup_{({\boldsymbol{\theta}},\delta)\in\tilde{\Omega}_{t,n_{0}}}\tilde{\mbox{TPR}}_{n_{1}}({\boldsymbol{\theta}},\delta)\bigg{|}+\bigg{|}\sup_{({\boldsymbol{\theta}},\delta)\in\tilde{\Omega}_{t,n_{0}}}\tilde{\mbox{TPR}}_{n_{1}}({\boldsymbol{\theta}},\delta)-\tilde{\mbox{TPR}}_{n_{1}}(\tilde{\boldsymbol{\theta}}_{t},\tilde{\delta}_{t})\bigg{|}
+|TPR~n1(𝜽~t,δ~t)TPR(𝜽~t,δ~t)|\displaystyle\hskip 36.135pt+\bigg{|}\tilde{\mbox{TPR}}_{n_{1}}(\tilde{\boldsymbol{\theta}}_{t},\tilde{\delta}_{t})-\mbox{TPR}(\tilde{\boldsymbol{\theta}}_{t},\tilde{\delta}_{t})\bigg{|}
|sup(𝜽,δ)ΩtTPR(𝜽,δ)sup(𝜽,δ)Ω~t,n0TPR~n1(𝜽,δ)|+an+|TPR~n1(𝜽~t,δ~t)TPR(𝜽~t,δ~t)| 0\displaystyle\hskip 21.68121pt\leq\ \bigg{|}\sup_{({\boldsymbol{\theta}},\delta)\in\Omega_{t}}\mbox{TPR}({\boldsymbol{\theta}},\delta)-\sup_{({\boldsymbol{\theta}},\delta)\in\tilde{\Omega}_{t,n_{0}}}\tilde{\mbox{TPR}}_{n_{1}}({\boldsymbol{\theta}},\delta)\bigg{|}+a_{n}+\bigg{|}\tilde{\mbox{TPR}}_{n_{1}}(\tilde{\boldsymbol{\theta}}_{t},\tilde{\delta}_{t})-\mbox{TPR}(\tilde{\boldsymbol{\theta}}_{t},\tilde{\delta}_{t})\bigg{|}\ \longrightarrow 0

almost surely, completing the proof. ∎

Appendix C: Simulation results for data with outliers

Note: When simulating with outliers, the true biomarker combination was occasionally so large that it returned a non-value for the outcome DD; for example, with f1(v)=expit(v)f_{1}(v)=\mbox{expit}(v), this occurs in R when v>800v>800. These observations had to be removed from the simulated dataset, though this affected an extremely small fraction of observations.

Table 4: Mean TPR and FPR and corresponding standard deviation (in parentheses) for f(v)=f1(v)expit(v)=ev/(1+ev)f(v)=f_{1}(v)\equiv\mbox{expit}(v)=e^{v}/(1+e^{v}) and β0\beta_{0} = 0 across 1000 simulations. The TPR is based on the threshold corresponding to a FPR of tt in the test data whereas the FPR is based on the thresholds estimated in the training data. nn, size of the training dataset; tt, acceptable FPR; TPR, true positive rate; FPR, false positive rate; GLM, standard logistic regression; rGLM, robust logistic regression; sTPR, proposed method. All numbers are percentages.
Outliers nn Measure Method
GLM rGLM sTPR
t=t= 0.05
Yes 200 TPR 12.2 (2.1) 13.6 (2.6) 13.4 (2.7)
FPR 5.7 (2.2) 5.9 (2.3) 6.4 (2.4)
400 TPR 12.1 (1.7) 14.1 (2.3) 13.9 (2.4)
FPR 5.4 (1.6) 5.4 (1.6) 5.9 (1.7)
800 TPR 11.8 (1.2) 14.4 (2.2) 14.4 (2.3)
FPR 5.1 (1.1) 5.2 (1.1) 5.5 (1.2)
No 200 TPR 18.3 (0.6) 18.3 (0.6) 17.8 (1.8)
FPR 5.5 (2.2) 5.5 (2.2) 6.2 (2.4)
400 TPR 18.5 (0.3) 18.5 (0.3) 18.1 (1.6)
FPR 5.3 (1.5) 5.3 (1.5) 5.7 (1.6)
800 TPR 18.6 (0.2) 18.6 (0.2) 18.4 (1.2)
FPR 5.2 (1.1) 5.2 (1.1) 5.5 (1.2)
t=t= 0.10
Yes 200 TPR 22.5 (3.8) 24.6 (4.3) 24.6 (4.2)
FPR 10.9 (3.1) 11.1 (3.0) 11.7 (3.2)
400 TPR 21.8 (2.8) 25.1 (4.0) 25.2 (4.0)
FPR 10.4 (2.0) 10.5 (2.1) 11.0 (2.1)
800 TPR 21.4 (2.0) 25.7 (3.6) 25.8 (3.6)
FPR 10.1 (1.5) 10.1 (1.5) 10.5 (1.5)
No 200 TPR 29.4 (0.8) 29.5 (0.8) 28.9 (2.2)
FPR 10.5 (3.1) 10.5 (3.1) 11.4 (3.2)
400 TPR 29.8 (0.4) 29.8 (0.4) 29.5 (1.3)
FPR 10.4 (2.1) 10.4 (2.1) 10.9 (2.2)
800 TPR 29.9 (0.2) 29.9 (0.2) 29.7 (1.5)
FPR 10.2 (1.5) 10.2 (1.5) 10.6 (1.5)
t=t= 0.20
Yes 200 TPR 38.0 (5.1) 40.8 (5.8) 41.0 (5.7)
FPR 20.9 (4.0) 21.1 (4.0) 21.8 (4.0)
400 TPR 37.4 (3.9) 41.7 (5.3) 41.9 (5.2)
FPR 20.5 (2.8) 20.6 (2.9) 21.1 (2.9)
800 TPR 36.9 (2.9) 42.4 (4.6) 43.0 (4.4)
FPR 20.2 (2.0) 20.4 (2.0) 20.7 (2.0)
No 200 TPR 46.1 (0.9) 46.1 (0.9) 45.7 (1.5)
FPR 20.7 (4.1) 20.8 (4.1) 21.7 (4.2)
400 TPR 46.4 (0.5) 46.4 (0.5) 46.2 (0.8)
FPR 20.3 (2.8) 20.3 (2.8) 21.0 (2.8)
800 TPR 46.5 (0.2) 46.5 (0.3) 46.4 (0.6)
FPR 20.1 (2.0) 20.1 (2.0) 20.5 (2.0)
Table 5: Mean TPR and FPR and corresponding standard deviation (in parentheses) for f(v)=f2(v)1(v<0)×(1+ev/3)1+1(v0)×(1+e3v)1f(v)=f_{2}(v)\equiv 1(v<0)\times(1+e^{-v/3})^{-1}+1(v\geq 0)\times(1+e^{-3v})^{-1} and β0\beta_{0} = 0 across 1000 simulations. The TPR is based on the threshold corresponding to a FPR of tt in the test data whereas the FPR is based on the thresholds estimated in the training data. nn, size of the training dataset; tt, acceptable FPR; TPR, true positive rate; FPR, false positive rate; GLM, standard logistic regression; rGLM, robust logistic regression; sTPR, proposed method. All numbers are percentages.
Outliers nn Measure Method
GLM rGLM sTPR
t=t= 0.05
Yes 200 TPR 20.2 (7.3) 26.4 (9.1) 27.7 (9.2)
FPR 5.9 (2.6) 6.0 (2.6) 6.5 (2.8)
400 TPR 19.0 (5.9) 27.6 (8.5) 29.3 (8.2)
FPR 5.5 (1.8) 5.5 (1.7) 5.8 (1.8)
800 TPR 17.9 (4.1) 29.4 (7.5) 30.8 (7.3)
FPR 5.3 (1.3) 5.3 (1.2) 5.5 (1.3)
No 200 TPR 37.9 (1.7) 37.8 (1.9) 37.5 (3.1)
FPR 5.8 (2.7) 5.7 (2.7) 6.5 (2.9)
400 TPR 38.6 (0.9) 38.5 (1.0) 38.3 (2.1)
FPR 5.3 (1.8) 5.3 (1.8) 5.8 (1.8)
800 TPR 38.9 (0.4) 38.9 (0.5) 38.6 (2.2)
FPR 5.2 (1.3) 5.2 (1.3) 5.5 (1.3)
t=t= 0.10
Yes 200 TPR 31.1 (8.9) 37.4 (10.8) 39.3 (11.0)
FPR 11.0 (3.5) 11.3 (3.6) 12.0 (3.6)
400 TPR 30.3 (7.1) 39.9 (9.8) 41.5 (9.6)
FPR 10.5 (2.5) 10.7 (2.4) 11.0 (2.5)
800 TPR 28.9 (5.0) 41.1 (8.9) 43.1 (8.6)
FPR 10.1 (1.7) 10.3 (1.7) 10.6 (1.8)
No 200 TPR 48.2 (1.8) 48.0 (1.9) 48.2 (2.0)
FPR 10.9 (3.5) 10.9 (3.5) 11.7 (3.6)
400 TPR 48.8 (0.9) 48.7 (1.0) 48.7 (1.1)
FPR 10.4 (2.4) 10.4 (2.4) 10.9 (2.5)
800 TPR 49.2 (0.4) 49.1 (0.5) 49.0 (0.6)
FPR 10.2 (1.7) 10.2 (1.7) 10.7 (1.8)
t=t= 0.20
Yes 200 TPR 45.0 (8.1) 50.4 (9.8) 51.9 (9.7)
FPR 21.2 (4.6) 21.5 (4.7) 22.0 (4.8)
400 TPR 44.4 (6.3) 52.8 (8.6) 54.0 (8.5)
FPR 20.4 (3.2) 20.8 (3.3) 21.2 (3.4)
800 TPR 44.1 (4.8) 54.8 (7.3) 56.5 (6.6)
FPR 20.2 (2.3) 20.3 (2.3) 20.7 (2.3)
No 200 TPR 59.5 (1.3) 59.4 (1.4) 59.3 (1.8)
FPR 21.1 (4.6) 21.1 (4.6) 22.1 (4.7)
400 TPR 60.0 (0.6) 59.9 (0.7) 59.8 (0.9)
FPR 20.5 (3.4) 20.6 (3.4) 21.2 (3.4)
800 TPR 60.2 (0.4) 60.1 (0.4) 60.1 (0.5)
FPR 20.3 (2.2) 20.3 (2.2) 20.7 (2.3)
Table 6: Mean TPR and FPR and corresponding standard deviation (in parentheses) in the test data for f(v)=f1(v)expit(v)=ev/(1+ev)f(v)=f_{1}(v)\equiv\mbox{expit}(v)=e^{v}/(1+e^{v}) and β0\beta_{0} = -1.75 across 1000 simulations. The TPR is based on the threshold corresponding to a FPR of tt in the test data whereas the FPR is based on the thresholds estimated in the training data. nn, size of the training dataset; tt, acceptable FPR; TPR, true positive rate; FPR, false positive rate; GLM, standard logistic regression; rGLM, robust logistic regression; sTPR, proposed method. All numbers are percentages.
Outliers nn Measure Method
GLM rGLM sTPR
t=t= 0.05
Yes 200 TPR 13.0 (2.8) 13.4 (3.4) 13.5 (3.4)
FPR 5.3 (1.7) 5.4 (1.7) 5.7 (1.8)
400 TPR 12.7 (1.9) 13.4 (2.7) 13.6 (2.9)
FPR 5.2 (1.2) 5.2 (1.2) 5.4 (1.2)
800 TPR 12.5 (1.3) 13.2 (2.1) 13.6 (2.5)
FPR 5.1 (0.8) 5.2 (0.8) 5.2 (0.9)
No 200 TPR 18.1 (1.0) 18.1 (1.1) 17.5 (2.2)
FPR 5.5 (1.8) 5.5 (1.8) 5.9 (1.8)
400 TPR 18.5 (0.6) 18.5 (0.6) 18.2 (1.6)
FPR 5.1 (1.2) 5.2 (1.2) 5.4 (1.3)
800 TPR 18.7 (0.3) 18.7 (0.3) 18.5 (1.1)
FPR 5.1 (0.9) 5.1 (0.9) 5.3 (0.9)
t=t= 0.10
Yes 200 TPR 22.1 (4.5) 22.7 (5.3) 23.1 (5.3)
FPR 10.4 (2.4) 10.5 (2.4) 10.8 (2.4)
400 TPR 21.9 (3.6) 22.8 (4.7) 23.4 (4.8)
FPR 10.1 (1.7) 10.2 (1.7) 10.4 (1.8)
800 TPR 21.4 (2.3) 22.3 (3.4) 23.3 (4.3)
FPR 10.1 (1.2) 10.1 (1.2) 10.3 (1.2)
No 200 TPR 29.5 (1.3) 29.4 (1.3) 28.8 (2.5)
FPR 10.3 (2.3) 10.4 (2.3) 10.9 (2.3)
400 TPR 29.8 (0.7) 29.8 (0.7) 29.5 (1.5)
FPR 10.2 (1.7) 10.2 (1.7) 10.6 (1.7)
800 TPR 30.1 (0.4) 30.1 (0.4) 29.8 (1.1)
FPR 10.1 (1.1) 10.1 (1.1) 10.3 (1.1)
t=t= 0.20
Yes 200 TPR 36.4 (6.6) 37.2 (7.8) 38.1 (7.4)
FPR 20.5 (3.2) 20.6 (3.1) 21.0 (3.2)
400 TPR 36.2 (4.7) 37.3 (6.2) 38.5 (6.4)
FPR 20.1 (2.2) 20.2 (2.3) 20.4 (2.2)
800 TPR 35.7 (3.0) 37.0 (4.6) 38.8 (5.7)
FPR 20.2 (1.5) 20.2 (1.5) 20.4 (1.5)
No 200 TPR 46.1 (1.7) 46.1 (1.7) 45.5 (2.6)
FPR 20.4 (3.1) 20.5 (3.2) 21.0 (3.2)
400 TPR 46.7 (0.8) 46.7 (0.8) 46.4 (1.3)
FPR 20.1 (2.1) 20.2 (2.1) 20.5 (2.1)
800 TPR 47.0 (0.4) 47.0 (0.4) 46.8 (0.7)
FPR 20.0 (1.6) 20.0 (1.6) 20.2 (1.6)
Table 7: Mean TPR and FPR and corresponding standard deviation (in parentheses) in the test data for f(v)=f1(v)expit(v)=ev/(1+ev)f(v)=f_{1}(v)\equiv\mbox{expit}(v)=e^{v}/(1+e^{v}) and β0\beta_{0} = 1.75 across 1000 simulations. The TPR is based on the threshold corresponding to a FPR of tt in the test data whereas the FPR is based on the thresholds estimated in the training data. nn, size of the training dataset; tt, acceptable FPR; TPR, true positive rate; FPR, false positive rate; GLM, standard logistic regression; rGLM, robust logistic regression; sTPR, proposed method. All numbers are percentages.
Outliers nn Measure Method
GLM rGLM sTPR
t=t= 0.05
Yes 200 TPR 8.4 (1.2) 8.4 (1.4) 8.2 (1.8)
FPR 7.3 (4.0) 6.9 (3.9) 7.7 (4.6)
400 TPR 8.6 (0.9) 8.5 (1.1) 8.3 (1.6)
FPR 6.3 (2.7) 6.3 (2.7) 6.7 (2.9)
800 TPR 8.7 (0.6) 8.6 (0.7) 8.5 (1.5)
FPR 5.8 (1.8) 5.8 (1.8) 6.1 (2.0)
No 200 TPR 18.7 (1.0) 18.7 (1.0) 17.2 (3.5)
FPR 6.3 (4.1) 6.1 (4.0) 7.4 (4.5)
400 TPR 19.0 (0.5) 19.0 (0.6) 17.9 (2.9)
FPR 5.7 (2.7) 5.6 (2.7) 6.4 (3.0)
800 TPR 19.2 (0.3) 19.2 (0.3) 18.3 (2.9)
FPR 5.3 (1.9) 5.3 (1.9) 5.9 (2.0)
t=t= 0.10
Yes 200 TPR 18.6 (3.9) 19.1 (4.7) 19.4 (4.8)
FPR 12.4 (5.1) 12.4 (5.0) 13.5 (5.4)
400 TPR 18.6 (2.5) 19.2 (3.5) 19.8 (3.8)
FPR 11.1 (3.4) 11.1 (3.5) 12.0 (3.7)
800 TPR 18.4 (1.4) 19.2 (2.6) 19.8 (3.4)
FPR 10.8 (2.6) 10.8 (2.6) 11.3 (2.7)
No 200 TPR 29.9 (1.3) 29.9 (1.3) 28.7 (3.6)
FPR 11.7 (5.2) 11.5 (5.2) 13.1 (5.6)
400 TPR 30.4 (0.6) 30.3 (0.7) 29.4 (3.4)
FPR 10.7 (3.6) 10.6 (3.6) 11.7 (3.8)
800 TPR 30.6 (0.3) 30.6 (0.3) 30.2 (2.0)
FPR 10.4 (2.5) 10.4 (2.5) 11.1 (2.5)
t=t= 0.20
Yes 200 TPR 34.2 (6.4) 34.9 (7.7) 35.9 (7.1)
FPR 22.5 (6.5) 22.7 (6.3) 24.0 (6.7)
400 TPR 34.2 (4.3) 35.0 (5.6) 36.3 (5.9)
FPR 21.4 (4.7) 21.5 (4.7) 22.4 (4.8)
800 TPR 33.9 (2.8) 35.0 (4.4) 36.2 (5.0)
FPR 20.6 (3.3) 20.7 (3.3) 21.3 (3.4)
No 200 TPR 46.4 (1.6) 46.4 (1.6) 45.6 (3.3)
FPR 22.2 (7.0) 22.0 (7.0) 23.9 (7.1)
400 TPR 47.0 (0.8) 47.0 (0.8) 46.5 (2.2)
FPR 20.8 (5.0) 20.7 (4.9) 22.0 (5.0)
800 TPR 47.2 (0.4) 47.2 (0.4) 46.9 (1.9)
FPR 20.6 (3.4) 20.6 (3.4) 21.4 (3.5)
Table 8: Mean TPR and FPR and corresponding standard deviation (in parentheses) in the test data for f(v)=f2(v)1(v<0)×(1/(1+ev/3))+1(v0)×(1/(1+e3v))f(v)=f_{2}(v)\equiv 1(v<0)\times(1/(1+e^{-v/3}))+1(v\geq 0)\times(1/(1+e^{-3v})) and β0\beta_{0} = -5.25 across 1000 simulations. The TPR is based on the threshold corresponding to a FPR of tt in the test data whereas the FPR is based on the thresholds estimated in the training data. nn, size of the training dataset; tt, acceptable FPR; TPR, true positive rate; FPR, false positive rate; GLM, standard logistic regression; rGLM, robust logistic regression; sTPR, proposed method. All numbers are percentages.
Outliers nn Measure Method
GLM rGLM sTPR
t=t= 0.05
Yes 200 TPR 7.1 (1.1) 7.1 (1.1) 7.1 (1.1)
FPR 5.7 (1.8) 5.7 (1.8) 5.9 (1.9)
400 TPR 7.4 (1.0) 7.3 (0.9) 7.3 (1.0)
FPR 5.3 (1.2) 5.4 (1.2) 5.5 (1.2)
800 TPR 7.6 (0.8) 7.5 (0.8) 7.5 (0.9)
FPR 5.1 (0.8) 5.2 (0.8) 5.2 (0.9)
No 200 TPR 7.3 (1.4) 7.3 (1.4) 7.2 (1.4)
FPR 5.5 (1.7) 5.6 (1.7) 5.9 (1.8)
400 TPR 7.8 (0.9) 7.8 (1.0) 7.7 (1.1)
FPR 5.2 (1.2) 5.2 (1.1) 5.4 (1.2)
800 TPR 8.1 (0.4) 8.1 (0.4) 8.0 (0.7)
FPR 5.1 (0.9) 5.1 (0.9) 5.2 (0.9)
t=t= 0.10
Yes 200 TPR 12.4 (2.0) 12.3 (2.0) 12.4 (2.0)
FPR 10.6 (2.3) 10.6 (2.3) 10.9 (2.4)
400 TPR 12.6 (1.7) 12.4 (1.7) 12.6 (1.8)
FPR 10.4 (1.7) 10.4 (1.7) 10.6 (1.7)
800 TPR 12.8 (1.5) 12.5 (1.5) 12.7 (1.6)
FPR 10.2 (1.2) 10.2 (1.1) 10.3 (1.2)
No 200 TPR 13.9 (2.2) 13.9 (2.2) 13.6 (2.3)
FPR 10.7 (2.3) 10.8 (2.3) 11.2 (2.4)
400 TPR 14.5 (1.5) 14.5 (1.5) 14.4 (1.6)
FPR 10.2 (1.6) 10.2 (1.6) 10.5 (1.6)
800 TPR 15.0 (0.8) 15.0 (0.8) 14.9 (1.0)
FPR 10.2 (1.2) 10.2 (1.2) 10.4 (1.2)
t=t= 0.20
Yes 200 TPR 22.4 (3.6) 22.2 (3.7) 22.5 (3.7)
FPR 20.9 (3.1) 20.9 (3.2) 21.3 (3.2)
400 TPR 22.6 (3.3) 22.3 (3.3) 22.7 (3.3)
FPR 20.6 (2.2) 20.6 (2.2) 20.9 (2.2)
800 TPR 22.8 (2.7) 22.3 (2.8) 22.8 (2.8)
FPR 20.2 (1.5) 20.2 (1.6) 20.4 (1.6)
No 200 TPR 25.8 (3.5) 25.7 (3.5) 25.5 (3.6)
FPR 20.9 (3.1) 20.9 (3.1) 21.4 (3.1)
400 TPR 26.9 (2.3) 26.9 (2.3) 26.8 (2.3)
FPR 20.5 (2.1) 20.5 (2.1) 20.8 (2.1)
800 TPR 27.7 (1.1) 27.7 (1.1) 27.5 (1.3)
FPR 20.3 (1.6) 20.3 (1.6) 20.5 (1.6)
Table 9: Mean TPR and FPR and corresponding standard deviation (in parentheses) in the test data for f(v)=f2(v)1(v<0)×(1/(1+ev/3))+1(v0)×(1/(1+e3v))f(v)=f_{2}(v)\equiv 1(v<0)\times(1/(1+e^{-v/3}))+1(v\geq 0)\times(1/(1+e^{-3v})) and β0\beta_{0} = 0.6 across 1000 simulations. The TPR is based on the threshold corresponding to a FPR of tt in the test data whereas the FPR is based on the thresholds estimated in the training data. nn, size of the training dataset; tt, acceptable FPR; TPR, true positive rate; FPR, false positive rate; GLM, standard logistic regression; rGLM, robust logistic regression; sTPR, proposed method. All numbers are percentages.
Outliers nn Measure Method
GLM rGLM sTPR
t=t= 0.05
Yes 200 TPR 23.0 (8.6) 30.5 (10.9) 31.9 (10.8)
FPR 6.4 (3.3) 6.3 (3.4) 6.8 (3.7)
Yes 400 TPR 21.5 (6.9) 31.8 (10.5) 33.5 (10.1)
FPR 5.8 (2.3) 5.8 (2.4) 6.2 (2.6)
Yes 800 TPR 20.0 (4.4) 34.6 (9.2) 35.8 (8.5)
FPR 5.4 (1.6) 5.3 (1.6) 5.7 (1.7)
No 200 TPR 49.7 (1.5) 49.5 (1.7) 48.6 (4.4)
FPR 6.0 (3.5) 5.9 (3.5) 6.8 (3.7)
No 400 TPR 50.3 (0.7) 50.1 (0.8) 49.7 (2.3)
FPR 5.5 (2.5) 5.4 (2.5) 6.1 (2.6)
No 800 TPR 50.5 (0.4) 50.5 (0.5) 50.1 (2.0)
FPR 5.2 (1.6) 5.2 (1.6) 5.6 (1.7)
t=t= 0.10
Yes 200 TPR 37.3 (11.0) 45.7 (13.7) 48.4 (13.2)
FPR 11.5 (4.5) 11.6 (4.5) 12.4 (4.6)
Yes 400 TPR 35.2 (8.5) 47.5 (12.9) 50.6 (12.1)
FPR 10.8 (3.1) 10.9 (3.2) 11.4 (3.3)
Yes 800 TPR 34.5 (6.6) 51.3 (10.7) 53.6 (10.2)
FPR 10.4 (2.2) 10.4 (2.2) 10.8 (2.3)
No 200 TPR 61.3 (1.4) 61.1 (1.6) 60.7 (3.2)
FPR 10.9 (4.5) 10.9 (4.5) 12.1 (4.7)
No 400 TPR 61.8 (0.7) 61.6 (0.8) 61.4 (1.2)
FPR 10.6 (3.2) 10.6 (3.2) 11.4 (3.3)
No 800 TPR 62.0 (0.4) 62.0 (0.4) 61.8 (0.8)
FPR 10.3 (2.3) 10.3 (2.3) 10.9 (2.4)
t=t= 0.20
Yes 200 TPR 53.2 (10.6) 60.9 (13.0) 64.2 (12.3)
FPR 21.2 (5.9) 21.8 (6.0) 22.8 (6.0)
Yes 400 TPR 52.0 (8.5) 63.5 (11.8) 65.4 (11.3)
FPR 20.7 (4.1) 21.1 (4.2) 21.7 (4.1)
Yes 800 TPR 51.1 (6.0) 66.3 (9.7) 68.6 (8.2)
FPR 20.4 (3.0) 20.6 (3.0) 21.1 (3.0)
No 200 TPR 73.3 (1.1) 73.1 (1.3) 73.0 (1.5)
FPR 21.4 (6.4) 21.4 (6.4) 22.5 (6.3)
No 400 TPR 73.6 (0.6) 73.5 (0.7) 73.5 (0.8)
FPR 20.7 (4.4) 20.7 (4.4) 21.6 (4.4)
No 800 TPR 73.8 (0.3) 73.8 (0.4) 73.8 (0.4)
FPR 20.4 (3.0) 20.4 (3.0) 21.0 (3.0)

Appendix D: Illustration of data simulated with outliers

Refer to caption
Figure 2: Figure: Datasets with f(v)=f1(v)expit(v)f(v)=f_{1}(v)\equiv\mbox{expit}(v), β0=0\beta_{0}=0, without (left plot) and with outliers (right plot). Cases are represented by red circles, and controls are represented by blue triangles. The plot with outliers also includes an ellipse (dashed black line) indicating the 99% confidence region for the distribution of (X1,X2)(X_{1},X_{2}) without outliers.

Appendix E: Biomarker distribution in Pima Indian diabetes dataset

Refer to caption
Figure 3: Figure: Stratified distributions of the scaled predictors measured in the diabetes study for the observations in the training data. The predictors are number of pregnancies, plasma glucose concentration, diastolic blood pressure, triceps skin fold thickness, body mass index, diabetes pedigree function, and age. The predictor values are shown on the x-axis of each plot. The red solid line represents the distribution among diabetes cases and the blue dotted line represents the distribution among controls.

References

  • Anderson and Bahadur (1962) Anderson, T. W. and Bahadur, R. R. (1962). Classification into two multivariate normal distributions with different covariance matrices. The Annals of Mathematical Statistics 89, 315–331.
  • Baker (2000) Baker, S. G. (2000). Identifying combinations of cancer markers for further study as triggers of early intervention. Biometrics 56, 1082–1087.
  • Bianco and Yohai (1996) Bianco, A. M. and Yohai, V. J. (1996). Robust Estimation in the Logistic Regression Model. In Robust Statistics, Data Analysis, and Computer Intensive Methods, pages 17–34. New York: Springer-Verlag.
  • Croux and Haesbroeck (2003) Croux, C. and Haesbroeck, G. (2003). Implementing the Bianco and Yohai estimatorfor logistic regression. Computational Statistics & Data Analysis 44, 273–295.
  • Fong, Yin, and Huang (2016) Fong, Y., Yin, S., and Huang, Y. (2016). Combining biomarkers linearly and nonlinearly for classification using the area under the ROC curve. Statistics in Medicine 35, 3792–3809.
  • Gao et al. (2008) Gao, F., Xiong, C., Yan, Y., Yu, K. and Zhang, Z. (2008). Estimating optimum linear combination of multiple correlated diagnostic tests at a fixed specificity with receiver operating characteristic curves. Journal of Data Science 6, 105–123.
  • Hsu and Hsueh (2013) Hsu, M.-J. and Hsueh, H.-M. (2013). The linear combinations of biomarkers which maximize the partial area under the ROC curves. ‎Computational Statistics 28, 647–666.
  • Huang, Qin, and Fang (2011) Huang, X., Qin G., and Fang, Y. (2011). Optimal combinations of diagnostic tests based on AUC. Biometrics 67, 568–576.
  • Hwang et al. (2013) Hwang, K.-B., Ha, B.-Y., Ju, S., and Kim, S. (2013). Partial AUC maximization for essential gene prediction using genetic algorithms. BMB Reports 46, 41–46.
  • Janes and Pepe (2008) Janes, H. and Pepe, M. S. (2008). Adjusting for covariates in studies of diagnostic, screening or prognostic markers: an old concept in a new setting. American Journal of Epidemiology 168, 89–97.
  • Komori and Eguchi (2010) Komori, O. and Eguchi, S. (2010). A boosting method for maximizing the partial area under the ROC curve. BMC Bioinformatics.
  • Kosorok (2008) Kosorok, M.R. (2008). Introduction to Empirical Processes and Semiparametric Inference, pp 155–178. Springer-Verlag New York.
  • Lin et al. (2011) Lin, H., Zhou, L., Peng, H., and Zhou, X.-H. (2011). Selection and combination of biomarkers using ROC method for disease classification and prediction. The Canadian Journal of Statistics 39, 324–343.
  • Liu, Schisterman, and Zhu (2005) Liu, A., Schisterman, E. F. and Zhu, Y. (2005). On linear combinations of biomarkers to improve diagnostic accuracy. Statistics in Medicine 24, 37–47.
  • Ma and Huang (2007) Ma, S. and Huang, J. (2007). Combining multiple markers for classification using ROC. Biometrics 63, 751–757.
  • McIntosh and Pepe (2002) McIntosh, M. W. and Pepe, M. S. (2002). Combining several screening tests: optimality of the risk score. Biometrics 58, 657–664.
  • Mishra (2019) Mishra, A. (2019). Methods for risk markers that incorporate clinical utility. [PhD Thesis], University of Washington.
  • Moore et al. (2008) Moore, R. G., Brown, A. K., Miller, M. C., Skates, S., Allard, W. J., Verch, T., Steinhoff M., Messerlian G., DiSilvestro P., Granai C, O., Bast R. C. Jr (2008). The use of multiple novel tumor biomarkers for the detection of ovarian carcinoma in patients with a pelvic mass. Gynecologic Oncology 108, 402–408.
  • Pepe (2003) Pepe, M. S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction, pages 66–95. Oxford University Press.
  • Pepe et al. (2006) Pepe, M. S., Cai, T. and Longton, G. (2006). Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics 62, 221–229.
  • Ricamato and Tortorella (2011) Ricamato, M. T. and Tortorella, F. (2011). Partial AUC maximization in a linear combination of dichotomizers. Pattern Recognition 44, 2669–2677.
  • Smith et al. (1988) Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C. and Johannes, R. S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. Proceedings of the Symposium on Computer Applications and Medical Care, 261–265.
  • Su and Liu (1993) Su, J. Q. and Liu, J. S. (1993). Linear combinations of multiple diagnostic markers. Journal of the American Statistical Association 88, 1350–1355.
  • van der Vaart (1998) van der Vaart, A. W. (1998). Asymptotic Statistics, pp 265–290. Cambridge University Press.
  • van der Vaart and Wellner (2000) van der Vaart, A. W. and Wellner, J. A. (2000). Weak Convergence and Empirical Processes, pp 166–168. Springer Series in Statistics.
  • Wang and Chang (2011) Wang, Z. and Chang, Y.-C. I. (2011). Marker selection via maximizing the partial area under the ROC curve of linear risk scores. Biostatistics 12, 369–385.
  • Winter (1979) Winter, B. B. (1979). Convergence rate of perturbed empirical distribution functions. J. Appl. Probab. 16, 163–173.
  • Yan et al. (2018) Yan, Q., Bantis, L.-E., Stanford, J. L., and Feng, Z. (2013). Combining multiple biomarkers linearly to maximize the partial area under the ROC curve. Statistics in Medicine 37, 627–642.
  • Yu and Park (2015) Yu, W. and Park, T. (2015). Two simple algorithms on linear combination of multiple biomarkers to maximize partial area under the ROC curve. Computational Statistics & Data Analysis 88, 15–27.