Identification of prognostic and predictive biomarkers in high-dimensional data with PPLasso

Wencan Zhu UMR MIA-Paris, AgroParisTech, INRAE, Université Paris-Saclay, 75005, Paris, France wencan.zhu@agroparistech.fr , Céline Lévy-Leduc UMR MIA-Paris, AgroParisTech, INRAE, Université Paris-Saclay, 75005, Paris, France celine.levy-leduc@agroparistech.fr and Nils Ternès Biostatistics and Programming department, Sanofi R&D, 91380 Chilly Mazarin, France nils.ternes@sanofi.com

Abstract.

In clinical development, identification of prognostic and predictive biomarkers is essential to precision medicine. Prognostic biomarkers can be useful for anticipating the prognosis of individual patients, and predictive biomarkers can be used to identify patients more likely to benefit from a given treatment. Previous researches were mainly focused on clinical characteristics, and the use of genomic data in such an area is hardly studied. A new method is required to simultaneously select prognostic and predictive biomarkers in high dimensional genomic data where biomarkers are highly correlated. We propose a novel approach called PPLasso (Prognostic Predictive Lasso) integrating prognostic and predictive effects into one statistical model. PPLasso also takes into account the correlations between biomarkers that can alter the biomarker selection accuracy. Our method consists in transforming the design matrix to remove the correlations between the biomarkers before applying the generalized Lasso. In a comprehensive numerical evaluation, we show that PPLasso outperforms the Lasso type approaches on both prognostic and predictive biomarker identification in various scenarios. Finally, our method is applied to publicly available transcriptomic data from clinical trial RV144. Our method is implemented in the PPLasso R package which is available from the Comprehensive R Archive Network (CRAN).

Key words and phrases:

variable selection; highly correlated predictors; genomic data

1. Introduction

With the advancement of precision medicine, there has been an increasing interest in identifying prognostic or predictive biomarkers in clinical development. A prognostic biomarker informs about a likely clinical outcome (e.g., disease recurrence, disease progression, death) in the absence of therapy or with a standard therapy that patients are likely to receive, while a predictive biomarker is associated with a response or a lack of response to a specific therapy. Ballman (2015) and Clark (2008) provided a comprehensive explanation and concrete examples to distinguish prognostic from predictive biomarkers, respectively.

Concerning the biomarker selection, the high dimensionality of genomic data is one of the main challenges as explained in Fan and Li (2006). To identify effective biomarkers in high-dimensional settings, several approaches can be considered including hypothesis-based tests described in McDonald (2009), wrapper approaches proposed in Saeys et al. (2007), and penalized approaches such as Lasso designed by Tibshirani (1996) among others. Hypothesis-based tests consider each biomarker independently and thus ignore potential correlations between them. Wrapper approaches often show high risk of overfitting and are computationally expensive for high-dimensional data as explained in Smith (2018). More efforts have been devoted to penalized methods given their ability to automatically perform variable selection and coefficient estimation simultaneously as highlighted in Fan and Lv (2009). However, Lasso showed some potential drawbacks when biomarkers are highly correlated. Particularly, when the Irrepresentable Condition (IC) proposed by Zhao and Yu (2006) is violated, Lasso can not guarantee to correctly identify true effective biomarkers. In genomic data, biomarkers are usually highly correlated such that this condition can hardly be satisfied, see Wang et al. (2019). Several methods have been proposed to adress this issue. Elastic Net (Zou and Hastie, 2005) combines the $\ell_{1}$ and $\ell_{2}$ penalties and is particularly effective in tackling correlation issues and can generally outperform Lasso. Adaptive Lasso (Zou, 2006) proposes to assign adaptive weights for penalizing different coefficients in the $\ell_{1}$ penalty, and its oracle property was demonstrated. Wang and Leng (2016) proposed the HOLP approach which consists in removing the correlation between the columns of the design matrix; Wang et al. (2019) proposed to handle the correlation by assigning similar weights to correlated variables in their approach called Precision Lasso; Zhu et al. (2021) proposed to remove the correlations by applying a whitening transformation to the data before using the generalized Lasso criterion designed by Tibshirani and Taylor (2011).

The challenge of finding prognostic biomarkers has been extensively explored with previously introduced methods, however, the discovery of predictive biomarkers has seen much less attention. Limited to binary endpoint, Foster et al. (2011) proposed to first predict response probabilities for treatment and use this probability as the response in a classification problem to find effective biomarkers. Tian et al. (2012) proposed a new method to detect interaction between the treatment and the biomarkers by modifying the covariates. This method can be implemented on continuous/binary/time-to-event endpoint. Lipkovich et al. (2011) proposed a method called SIDES, which adopts a recursive partitioning algorithm for screening treatment-by-biomarker interactions. This method was further improved in Lipkovich and Dmitrienko (2014) by adding another step of preselection on predictive biomarkers based on variable importance. The method was demonstrated with continuous endpoint. More recently, Sechidis et al. (2018) applied approaches coming from information theory for ranking biomarkers on their prognostic/predictive strength. Their method is applicable only for binary or time-to-event endpoint. Moreover, all of these methods were assessed under the situation where the sample size is relatively large and the number of biomarkers is limited, which is hardly the case for genomic data.

In the literature mentioned above, the authors focused on one of the problematic of identifying prognostic or predictive biomarkers, but rarely on both. Even if predictive biomarkers is of major importance for identifying patients more likely to benefit from a treatment, the identification of prognostic biomarkers is also key in this context. Indeed, the clinical impact of a treatment can be judged only with the knowledge of the prognosis of a patient. It is thus of importance to reliably predict the prognosis of patients to assist treatment counseling (Windeler, 2000). In this paper, we propose a novel approach called PPLasso (Prognostic Predictive Lasso) to simultaneously identify prognostic and predictive biomarkers in a high dimensional setting with continuous endpoints, as presented in Section 2. Extensive numerical experiments are given in Section 3 to assess the performance of our approach and to compare it to other methods. PPLasso is also applied to the clinical trial RV144 in Section 4. Finally, we give concluding remarks in Section 5.

2. Methods

In this section, we propose a novel approach called PPLasso (Predictive Prognostic Lasso) which consists in writing the identification of predictive and prognostic biomarkers as a variable selection problem in an ANCOVA (Analysis of Covariance) type model mentioned for instance in Faraway (2002).

2.1. Statistical modeling

Let $\mathbf{y}$ be a continuous response or endpoint and $t_{1}$ , $t_{2}$ two treatments. Let also $\mathbf{X}_{1}$ (resp. $\mathbf{X}_{2}$ ) denote the design matrix for the $n_{1}$ (resp. $n_{2}$ ) patients with treatment $t_{1}$ (resp. $t_{2}$ ), each containing measurements on $p$ candidate biomarkers:

(1)

\mathbf{X}_{1}=\begin{bmatrix}X_{11}^{1}&X_{11}^{2}&\ldots&X_{11}^{p}\\ X_{12}^{1}&X_{12}^{2}&\ldots&X_{12}^{p}\\ ...\\ X_{1n_{1}}^{1}&X_{1n_{1}}^{2}&\ldots&X_{1n_{1}}^{p}\end{bmatrix},\mathbf{X}_{2}=\begin{bmatrix}X_{21}^{1}&X_{21}^{2}&\ldots&X_{21}^{p}\\ X_{22}^{1}&X_{22}^{2}&\ldots&X_{22}^{p}\\ ...\\ X_{2n_{2}}^{1}&X_{2n_{2}}^{2}&\ldots&X_{2n_{2}}^{p}\end{bmatrix}.

To take into account the potential correlation that may exist between the biomarkers in the different treatments, we shall assume that the rows of $\mathbf{X}_{1}$ (resp. $\mathbf{X}_{2}$ ) are independent centered Gaussian random vectors with a covariance matrice equal to $\boldsymbol{\Sigma_{1}}$ (resp. $\boldsymbol{\Sigma_{2}}$ ).

To model the link that exists between $\mathbf{y}$ and the different types of biomarkers we propose using the following model:

(2)

\mathbf{y}=\begin{pmatrix}y_{11}\\ y_{12}\\ \vdots\\ y_{1n_{1}}\\ y_{21}\\ y_{22}\\ \vdots\\ y_{2n_{2}}\end{pmatrix}=\mathbf{X}\begin{pmatrix}\alpha_{1}\\ \alpha_{2}\\ \beta_{11}\\ \beta_{12}\\ \vdots\\ \beta_{1p}\\ \beta_{21}\\ \beta_{22}\\ \vdots\\ \beta_{2p}\end{pmatrix}+\begin{pmatrix}\epsilon_{11}\\ \epsilon_{12}\\ \vdots\\ \epsilon_{1n_{1}}\\ \epsilon_{21}\\ \epsilon_{22}\\ \vdots\\ \epsilon_{2n_{2}}\end{pmatrix},

where $(y_{i1},\dots,y_{in_{i}})$ corresponds to the response of patients with treatment $t_{i}$ , $i$ being equal to 1 or 2,

\mathbf{X}=\begin{bmatrix}1&0&X_{11}^{1}&X_{11}^{2}&\ldots&X_{11}^{p}&0&0&\ldots&0\\ 1&0&X_{12}^{1}&X_{12}^{2}&\ldots&X_{12}^{p}&0&0&\ldots&0\\ \vdots&\vdots&\vdots&\vdots&&\vdots&&&\\ 1&0&X_{1n_{1}}^{1}&X_{1n_{1}}^{2}&\ldots&X_{1n_{1}}^{p}&0&0&\ldots&0\\ 0&1&0&0&\ldots&0&X_{21}^{1}&X_{21}^{2}&\ldots&X_{21}^{p}\\ 0&1&0&0&\ldots&0&X_{22}^{1}&X_{22}^{2}&\ldots&X_{22}^{p}\\ \vdots&\vdots&\vdots&\vdots&&\vdots&\vdots&\vdots&&\vdots\\ 0&1&0&0&\ldots&0&X_{2n_{2}}^{1}&X_{2n_{2}}^{2}&\ldots&X_{2n_{2}}^{p}\end{bmatrix},

with $\alpha_{1}$ (resp. $\alpha_{2}$ ) corresponding to the effects of treatment $t_{1}$ (resp. $t_{2}$ ). Moreover, $\boldsymbol{\beta}_{1}=(\beta_{11},\beta_{12},\ldots,\beta_{1p})^{\prime}$ (resp. $\boldsymbol{\beta}_{2}=(\beta_{21},\beta_{22},\ldots,\beta_{2p})^{\prime}$ ) are the coefficients associated to each of the $p$ biomarkers in treatment $t_{1}$ (resp. $t_{2}$ ) group, ^′ denoting the matrix transposition and $\epsilon_{11},\dots,\epsilon_{2n_{2}}$ are standard independent Gaussian random variables independent of $\mathbf{X}_{1}$ and $\mathbf{X}_{2}$ . When $t_{1}$ stands for the standard treatment or placebo, prognostic biomarkers are defined as those having non-zero coefficients in $\boldsymbol{\beta}_{1}$ . According to the definition of prognostic biomarkers, their effect should indeed be demonstrated in the absence of therapy or with a standard therapy that patients are likely to receive. On the other hand, predictive biomarkers are defined as those having non-zero coefficients in $\boldsymbol{\beta}_{2}-\boldsymbol{\beta}_{1}$ because they aim to highlight different effects between two different treatments.

Model (2) can be written as:

(3)

\mathbf{y}=\mathbf{X}\boldsymbol{\gamma}+\boldsymbol{\epsilon},

with $\boldsymbol{\gamma}=(\alpha_{1},\alpha_{2},\boldsymbol{\beta}_{1}^{\prime},\boldsymbol{\beta}_{2}^{\prime})^{\prime}$ . The Lasso penalty is a well-known approach to estimate coefficients with a sparsity enforcing constraint allowing variable selection by estimating some coefficients by zero. It consists in minimizing the following penalized least-squares criterion (Tibshirani (1996)):

(4)

\frac{1}{2}\left\lVert\mathbf{y}-\mathbf{X}\boldsymbol{\gamma}\right\rVert_{2}^{2}+\lambda\left\lVert\boldsymbol{\gamma}\right\rVert_{1},

where $\left\lVert\textbf{u}\right\rVert_{2}^{2}=\sum_{i=1}^{n}u_{i}^{2}$ and $\left\lVert\textbf{u}\right\rVert_{1}=\sum_{i=1}^{n}|u_{i}|$ for $\textbf{u}=(u_{1},\dots,u_{n})$ . A different sparsity constraint was applied to $\boldsymbol{\beta}_{1}$ and $\boldsymbol{\beta}_{2}-\boldsymbol{\beta}_{1}$ to allow different sparsity levels. Hence we propose to replace the penalty $\lambda\left\lVert\boldsymbol{\gamma}\right\rVert_{1}$ in (4) by

(5)

\lambda_{1}\left\lVert\boldsymbol{\beta}_{1}\right\rVert_{1}+\lambda_{2}\left\lVert\boldsymbol{\beta}_{2}-\boldsymbol{\beta}_{1}\right\rVert_{1}.

Thus, a first estimator of $\boldsymbol{\gamma}$ could be found by minimizing the following criterion with respect to $\boldsymbol{\gamma}$ :

(6)

\frac{1}{2}\left\lVert\mathbf{y}-\mathbf{X}\boldsymbol{\gamma}\right\rVert_{2}^{2}+\lambda_{1}\left\lVert\begin{bmatrix}\mathbf{0}_{p,1}&\mathbf{0}_{p,1}&D_{1}\\ \mathbf{0}_{p,1}&\mathbf{0}_{p,1}&\frac{\lambda_{2}}{\lambda_{1}}D_{2}\end{bmatrix}\boldsymbol{\gamma}\right\rVert_{1},

where $D_{1}=[\mathbf{\textrm{Id}}_{p},\mathbf{0}_{p,p}]$ and $D_{2}=[-\mathbf{\textrm{Id}}_{p},\mathbf{\textrm{Id}}_{p}]$ , with $\mathbf{\textrm{Id}}_{p}$ denoting the identity matrix of size $p$ and $\mathbf{0}_{i,j}$ denoting a matrix having $i$ rows and $j$ columns and containing only zeros. However, since the inconsistency of Lasso biomarker selection is originated from the correlations between the biomarkers, we propose to remove the correlation by “whitening” the matrix $\mathbf{X}$ . More precisely, we consider $\widetilde{\mathbf{X}}=\mathbf{X}\boldsymbol{\Sigma}^{-1/2}$ , where

(7)

\boldsymbol{\Sigma}=\begin{bmatrix}1&0&0&0\\ 0&1&0&0\\ 0&0&\boldsymbol{\Sigma}_{1}&0\\ 0&0&0&\boldsymbol{\Sigma}_{2}\end{bmatrix}

and define $\boldsymbol{\Sigma}^{-1/2}$ by replacing in (7) $\boldsymbol{\Sigma}_{i}$ by $\boldsymbol{\Sigma}_{i}^{-1/2}$ , where $\boldsymbol{\Sigma}_{i}^{-1/2}=\mathbf{U}_{i}\mathbf{D}_{i}^{-1/2}\mathbf{U}_{i}^{T}$ , $\mathbf{U}_{i}$ and $\mathbf{D}_{i}$ being the matrices involved in the spectral decomposition of $\boldsymbol{\Sigma}_{i}$ for $i=1$ or 2. With such a transformation the columns of $\widetilde{\mathbf{X}}$ are decorrelated and Model (3) can be rewritten as follows:

(8)

\mathbf{y}=\widetilde{\mathbf{X}}\widetilde{\boldsymbol{\gamma}}+\boldsymbol{\epsilon}

where $\widetilde{\boldsymbol{\gamma}}=\boldsymbol{\Sigma}^{1/2}\boldsymbol{\gamma}$ . The objective function (6) thus becomes:

(9)

L_{\lambda_{1},\lambda_{2}}^{\textrm{PPLasso}}(\widetilde{\boldsymbol{\gamma}})=\frac{1}{2}\left\lVert\mathbf{y}-\widetilde{\mathbf{X}}\widetilde{\boldsymbol{\gamma}}\right\rVert_{2}^{2}+\lambda_{1}\left\lVert\begin{bmatrix}\mathbf{0}_{p,1}&\mathbf{0}_{p,1}&D_{1}\\ \mathbf{0}_{p,1}&\mathbf{0}_{p,1}&\frac{\lambda_{2}}{\lambda_{1}}D_{2}\end{bmatrix}\boldsymbol{\Sigma}^{-1/2}\widetilde{\boldsymbol{\gamma}}\right\rVert_{1}.

2.2. Estimation of $\widetilde{\boldsymbol{\gamma}}$

Let us define a first estimator of $\widetilde{\boldsymbol{\gamma}}=(\widetilde{\alpha}_{1},\widetilde{\alpha}_{2},\widetilde{\boldsymbol{\beta}}_{1}^{\prime},\widetilde{\boldsymbol{\beta}}_{2}^{\prime})$ as follows:

(10)

\widehat{\widetilde{\boldsymbol{\gamma}}}_{0}(\lambda_{1},\lambda_{2})=(\widehat{\widetilde{\alpha}}_{1},\widehat{\widetilde{\alpha}}_{2},\widehat{\widetilde{\boldsymbol{\beta}}}_{10}^{\prime},\widehat{\widetilde{\boldsymbol{\beta}}}_{20}^{\prime})=\mathop{\rm Argmin}\nolimits_{\widetilde{\boldsymbol{\gamma}}}L_{\lambda_{1},\lambda_{2}}^{\textrm{PPLasso}}(\widetilde{\boldsymbol{\gamma}}),

for each fixed $\lambda_{1}$ and $\lambda_{2}$ . To better estimate $\widetilde{\boldsymbol{\beta}}_{1}$ and $\widetilde{\boldsymbol{\beta}}_{2}$ , a thresholding was applied to $\widehat{\widetilde{\boldsymbol{\beta}}}_{0}(\lambda_{1},\lambda_{2})=(\widehat{\widetilde{\boldsymbol{\beta}}}_{10}(\lambda_{1},\lambda_{2})^{\prime},\widehat{\widetilde{\boldsymbol{\beta}}}_{20}(\lambda_{1},\lambda_{2})^{\prime})^{\prime}$ . For $K_{1}$ (resp. $K_{2}$ ) in $\{1,\ldots,p\}$ , let $\textrm{Top}_{K_{1}}$ (resp. $\textrm{Top}_{K_{2}}$ ) be the set of indices corresponding to the $K_{1}$ (resp. $K_{2}$ ) largest values of the components of $|\widehat{\widetilde{\boldsymbol{\beta}}}_{10}(\lambda_{1},\lambda_{2})|$ (resp. $|\widehat{\widetilde{\boldsymbol{\beta}}}_{20}(\lambda_{1},\lambda_{2})|$ ), then the estimator of $\widetilde{\boldsymbol{\beta}}=(\widetilde{\boldsymbol{\beta}}_{1}^{\prime},\widetilde{\boldsymbol{\beta}}_{2}^{\prime})$ after the correction is denoted by $\widehat{\widetilde{\boldsymbol{\beta}}}(\lambda_{1},\lambda_{2})=(\widehat{\widetilde{\boldsymbol{\beta}}}_{1}^{(\widehat{K}_{1})}(\lambda_{1},\lambda_{2}),\widehat{\widetilde{\boldsymbol{\beta}}}_{2}^{(\widehat{K}_{2})}(\lambda_{1},\lambda_{2}))$ where the $j$ th component of $\widehat{\widetilde{\boldsymbol{\beta}}}_{i}^{(K_{i})}(\lambda_{1},\lambda_{2})$ , for $i=1$ or 2, is defined by:

(11)

\widehat{\widetilde{\boldsymbol{\beta}}}_{ij}^{(K_{i})}(\lambda_{1},\lambda_{2})=\begin{cases}\widehat{\widetilde{\boldsymbol{\beta}}}_{i0j}(\lambda_{1},\lambda_{2}),&j\in\textrm{Top}_{K_{i}}\\ \textrm{$K_{1}$th largest value of }|\widehat{\widetilde{\boldsymbol{\beta}}}_{i0j}(\lambda_{1},\lambda_{2})|,&j\not\in\textrm{Top}_{K_{i}}.\end{cases}

Note that the corrections are only performed on $\widehat{\widetilde{\boldsymbol{\beta}}}_{0}$ , the estimators $\widehat{\widetilde{\alpha}}_{1}$ and $\widehat{\widetilde{\alpha}}_{2}$ were not modified. The choice of $K_{1}$ and $K_{2}$ will be explained in Section 2.4.

To illustrate the interest of using a thresholding step, we generated a dataset based on Model 3 with parameters described in Section 3.1 and $p=500$ . Moreover, to simplify the graphical illustrations, we focus on the case where $\lambda_{1}=\lambda_{2}=\lambda$ . Figure 1 displays the estimation error associated to the estimators of $\widetilde{\boldsymbol{\beta}}(\lambda)$ before and after the thresholding. We can see from this figure that the estimation of $\widetilde{\boldsymbol{\beta}}(\lambda)$ is less biased after the correction. Moreover, we observed that this thresholding strongly improves the final estimation of $\boldsymbol{\gamma}$ and the variable selection performance of our method.

Refer to caption — Figure 1. Estimation error $\left\lVert\widehat{\widetilde{\boldsymbol{\beta}}}_{0}(\lambda)-\widetilde{\boldsymbol{\beta}}\right\rVert_{2}$ (gray) and $\left\lVert\widehat{\widetilde{\boldsymbol{\beta}}}(\lambda)-\widetilde{\boldsymbol{\beta}}\right\rVert_{2}$ (red) for all $\lambda$ .

2.3. Estimation of $\boldsymbol{\gamma}$

With $\widehat{\widetilde{\boldsymbol{\beta}}}=(\widehat{\widetilde{\boldsymbol{\beta}}}_{1}^{\prime},\widehat{\widetilde{\boldsymbol{\beta}}}_{2}^{\prime})$ , the estimators of $\boldsymbol{\beta}_{1}$ and $\boldsymbol{\beta}_{2}-\boldsymbol{\beta}_{1}$ can be obtained by $\widehat{\boldsymbol{\beta}}_{10}=\boldsymbol{\Sigma}_{1}^{-1/2}\widehat{\widetilde{\boldsymbol{\beta}}}_{1}$ and $(\widehat{\boldsymbol{\beta}}_{20}-\widehat{\boldsymbol{\beta}}_{10})=\boldsymbol{\Sigma}_{2}^{-1/2}\widehat{\widetilde{\boldsymbol{\beta}}}_{2}-\boldsymbol{\Sigma}_{1}^{-1/2}\widehat{\widetilde{\boldsymbol{\beta}}}_{1}$ . As previously, another thresholding was applied to $\widehat{\boldsymbol{\beta}}_{10}$ and $\widehat{\boldsymbol{\beta}}_{20}$ : for $i=1$ or 2,

(12)

\widehat{\boldsymbol{\beta}}_{ij}^{(M_{i})}(\lambda_{1},\lambda_{2})=\begin{cases}\widehat{\boldsymbol{\beta}}_{i0j}(\lambda_{1},\lambda_{2}),&j\in\textrm{Top}_{M_{i}}\\ 0,&j\not\in\textrm{Top}_{M_{i}},\end{cases}

for each fixed $\lambda_{1}$ and $\lambda_{2}$ . The biomarkers with non-zero coefficients in $\widehat{\boldsymbol{\beta}}_{1}=\widehat{\boldsymbol{\beta}}_{1}^{(M_{1})}$ (resp. $\widehat{\boldsymbol{\beta}}_{2}^{(M_{2})}-\widehat{\boldsymbol{\beta}}_{1}^{(M_{1})}$ ) are considered as prognostic (resp. predictive) biomarkers, where the choice of $M_{1}$ and $M_{2}$ is explained in Section 2.4.

To illustrate the benefits of using an additional thresholding step, we used the dataset described in Section 2.2. Moreover, to simplify the graphical illustrations, we also focus on the case where $\lambda_{1}=\lambda_{2}=\lambda$ . Figure 8 in the Supplementary material displays the number of True Positive (TP) and False Positive (FP) in prognostic and predictive biomarker identification with and without the second thresholding. We can see from this figure that the thresholding stage limits the number of false positives. Note that $\alpha_{1}$ and $\alpha_{2}$ are estimated by $\widehat{\widetilde{\alpha}}_{1}$ and $\widehat{\widetilde{\alpha}}_{2}$ defined in (10).

2.4. Choice of the parameters $K_{1},K_{2}$ , $M_{1}$ and $M_{2}$

For each $(\lambda_{1},\lambda_{2})$ and each $K_{1}$ , we computed:

(13)

\widetilde{\textrm{MSE}}_{K_{1},K_{2}}(\lambda_{1},\lambda_{2})=\|\mathbf{y}-\widetilde{\mathbf{X}}\widehat{\widetilde{\boldsymbol{\gamma}}}^{(K1,K2)}(\lambda_{1},\lambda_{2})\|_{2}^{2},

where $\widehat{\widetilde{\boldsymbol{\gamma}}}^{(K1,K2)}(\lambda_{1},\lambda_{2})=(\widehat{\widetilde{\alpha}}_{1},\widehat{\widetilde{\alpha}}_{2},\widehat{\widetilde{\boldsymbol{\beta}}}_{1}^{(K_{1})^{\prime}},\widehat{\widetilde{\boldsymbol{\beta}}}_{2}^{(K_{2})^{\prime}})$ defined in (10) and in (11). It is displayed in the left part of Figure 2.

For each $\lambda_{1}$ , $\lambda_{2}$ and a given $\delta\in(0,1)$ , the parameter $\widehat{K_{2}}$ is then chosen as follows for each $K_{1}$ :

\widehat{K_{2}}(\lambda_{1},\lambda_{2})=\mathop{\rm Argmin}\nolimits\left\{K_{2}\geq 1\textrm{ s.t. }\frac{\widetilde{\textrm{MSE}}_{(K_{1},K_{2}+1})(\lambda_{1},\lambda_{2})}{\widetilde{\textrm{MSE}}_{(K_{1},K_{2})}(\lambda_{1},\lambda_{2})}\geq\delta\right\}.

The $\widehat{K}_{2}$ associated to each $K_{1}$ are displayed with ’*’ in the left part of Figure 2. Then $\widehat{K_{1}}$ is chosen by using a similar criterion:

\widehat{K_{1}}(\lambda_{1},\lambda_{2})=\mathop{\rm Argmin}\nolimits\left\{K_{1}\geq 1\textrm{ s.t. }\frac{\widetilde{\textrm{MSE}}_{(K_{1}+1,\widehat{K}_{2}})(\lambda_{1},\lambda_{2})}{\widetilde{\textrm{MSE}}_{(K_{1},\widehat{K}_{2})}(\lambda_{1},\lambda_{2})}\geq\delta\right\}.

The values of $\widetilde{\textrm{MSE}}_{(K_{1},\widehat{K}_{2})}(\lambda_{1},\lambda_{2})$ are displayed in the right part of Figure 2 in the particular case where $\lambda_{1}=\lambda_{2}=\lambda$ , $\delta=0.95$ and with the same dataset as the one used in Section 2.2. $\widehat{K}_{1}$ is displayed with a red star. This value of $\delta$ will be used in the following sections. However, choosing $\delta$ in the range (0.9,0.99) does not have a strong impact on the variable selection performance of our approach.

The parameters $\widehat{M}_{1}$ and $\widehat{M}_{2}$ are chosen in a similar way except that $\widetilde{\textrm{MSE}}_{K_{1},K_{2}}(\lambda_{1},\lambda_{2})$ is replaced by $\widehat{\textrm{MSE}}_{M_{1},M_{2}}(\lambda_{1},\lambda_{2})$ where:

\widehat{\textrm{MSE}}_{M_{1},M_{2}}(\lambda_{1},\lambda_{2})=\|\mathbf{y}-\mathbf{X}\widehat{\boldsymbol{\gamma}}^{(M_{1},M_{2})}(\lambda_{1},\lambda_{2})\|_{2}^{2},

with $\widehat{\boldsymbol{\gamma}}^{(M_{1},M_{2})}(\lambda_{1},\lambda_{2})=(\widehat{\widetilde{\alpha}}_{1},\widehat{\widetilde{\alpha}}_{2},\widehat{\boldsymbol{\beta}}_{1}^{(M_{1})^{\prime}},\widehat{\boldsymbol{\beta}}_{2}^{(M_{2})^{\prime}})$ defined in (10) and (12). In the following, $\widehat{\boldsymbol{\gamma}}(\lambda_{1},\lambda_{2})=\widehat{\boldsymbol{\gamma}}^{(\widehat{M}_{1},\widehat{M}_{2})}(\lambda_{1},\lambda_{2})$ .

2.5. Estimation of $\boldsymbol{\Sigma}_{1}$ and $\boldsymbol{\Sigma}_{2}$

As the empirical correlation matrix is known to be a non accurate estimator of $\boldsymbol{\Sigma}$ when $p$ is larger than $n$ , a new estimator has to be used. Thus, for estimating $\boldsymbol{\Sigma}$ we adopted a cross-validation based method designed by Boileau et al. (2021) and implemented in the cvCovEst R package (Boileau et al., 2021). This method chooses the estimator having the smallest estimation error among several compared methods (sample correlation matrix, POET (Fan et al. (2013)) and Tapering (Cai et al. (2010)) as examples). Since the samples in treatments $t_{1}$ and $t_{2}$ are assumed to be collected from the same population, $\boldsymbol{\Sigma}_{1}$ and $\boldsymbol{\Sigma}_{2}$ are assumed to be equal.

2.6. Choice of the parameters $\lambda_{1}$ and $\lambda_{2}$

For the sake of simplicity, we limit ourselves to the case where $\lambda_{1}=\lambda_{2}=\lambda$ . For choosing $\lambda$ we used BIC (Bayesian Information Criterion) which is widely used in the variable selection field and which consists in minimizing the following criterion with respect to $\lambda$ :

\textrm{BIC}(\lambda)=n\log(\textrm{MSE}(\lambda)/n)+k(\lambda)\log(n),

where $n$ is the total number of samples, $\textrm{MSE}(\lambda)=\|\mathbf{y}-\mathbf{X}\widehat{\boldsymbol{\gamma}}(\lambda)\|_{2}^{2}$ and $k(\lambda)$ is the number of non null coefficients in the OLS estimator $\widehat{\boldsymbol{\gamma}}$ obtained by re-estimating only the non null components of $\widehat{\boldsymbol{\beta}}_{1}$ and $\widehat{\boldsymbol{\beta}}_{2}-\widehat{\boldsymbol{\beta}}_{1}$ . The values of the BIC criterion as well as those of the MSE obtained from the dataset described in Section 2.2 are displayed in Figure 3.

Table 2 in the supplementary material provides the True Positive Rate (TPR) and False Positive Rate (FPR) when $\lambda$ is chosen either by minimizing the MSE or the BIC criterion for this dataset. We can see from this table that both of them have TPR=1 (all true positives are identified). However, the FPR based on the BIC criterion is smaller than the one obtained by using the MSE.

3. Numerical experiments

This section presents a comprehensive numerical study by comparing the performance of our method with other regularized approaches in terms of prognostic and predictive biomarker selection. Besides the Lasso, we also compared with Elastic Net and Adaptive Lasso since they also take into account the correlations. For Lasso, Elastic Net and Adaptive Lasso, in order to directly estimate prognostic and predictive effects, $\mathbf{X}$ and $\boldsymbol{\gamma}$ in Model (3) were replaced by

\mathbf{X}^{*}=\begin{bmatrix}\mathbf{1}_{n_{1},1}&\mathbf{0}_{n_{1},1}&\mathbf{X}_{1}&\mathbf{0}_{n_{1},p}\\ \mathbf{0}_{n_{2},1}&\mathbf{1}_{n_{2},1}&\mathbf{X}_{2}&\mathbf{X}_{2}\end{bmatrix},

and $\boldsymbol{\gamma}^{*}=(\alpha_{1},\alpha_{2},\boldsymbol{\beta}_{1}^{*},\boldsymbol{\beta}_{2}^{*})$ , respectively, where $\mathbf{X}_{1}$ and $\mathbf{X}_{2}$ are defined in (1), $\mathbf{0}_{i,j}$ (resp. $\mathbf{1}_{i,j}$ ) denotes a matrix having $i$ rows and $j$ columns and containing only zeros (resp. ones). Note that this is the modeling proposed by Lipkovich et al. (2017). The sparsity enforcing constraint was put on the coefficients $\boldsymbol{\beta}_{1}^{*}$ and $\boldsymbol{\beta}_{2}^{*}$ which boils down to putting a sparsity enforcing constraint on $\boldsymbol{\beta}_{1}$ and $\boldsymbol{\beta}_{2}-\boldsymbol{\beta}_{1}$ .

3.1. Simulation setting

All simulated datasets were generated from Model (3) where the $n_{1}$ ( $n_{2}$ ) rows of $\mathbf{X}_{1}$ ( $\mathbf{X}_{2}$ ) are assumed to be independent Gaussian random vectors with a covariance matrix $\boldsymbol{\Sigma}_{1}=\boldsymbol{\Sigma}_{2}=\boldsymbol{\Sigma}_{bm}$ , and $\boldsymbol{\epsilon}$ is a standard Gaussian random vector independent of $\mathbf{X}_{1}$ and $\mathbf{X}_{2}$ . We defined $\boldsymbol{\Sigma}_{bm}$ as:

(14)

\boldsymbol{\Sigma}_{bm}=\begin{bmatrix}\boldsymbol{\Sigma}_{11}&\boldsymbol{\Sigma}_{12}\\ \boldsymbol{\Sigma}_{12}^{T}&\boldsymbol{\Sigma}_{22}\end{bmatrix}

where $\boldsymbol{\Sigma}_{11}$ (resp. $\boldsymbol{\Sigma}_{22}$ ) are the correlation matrix of prognostic (resp. non-prognostic) biomarkers with off-diagonal entries equal to $a_{1}$ (resp. $a_{3}$ ). Morever, $\boldsymbol{\Sigma}_{12}$ is the correlation matrix between prognostic and non-prognostic variables with entries equal to $a_{2}$ . In our simulations $(a_{1},a_{2},a_{3})=(0.3,0.5,0.7)$ , which is a framework proposed by Xue and Qu (2017). We checked that the Irrepresentable Condition (IC) of Zhao and Yu (2006) is violated and thus the standard Lasso cannot recover the positions of the null and non null variables. For each dataset we assumed randomized treatment allocation between standard and experimental arm with a 1:1 ratio, i.e. $n_{1}=n_{2}=50$ . We further assume a relative treatment effect of 1 ( $\alpha_{1}=0$ and $\alpha_{2}=1$ ). The number of biomarkers $p$ varies from 200 to 2000. The number of active biomarkers was set to 10 (i.e. 5 purely prognostic biomarkers with $\boldsymbol{\beta}_{1j}=\boldsymbol{\beta}_{2j}=b_{1}=1$ $(j=1,...,5)$ and 5 biomarkers both prognostic and predictive with $\boldsymbol{\beta}_{1j}=b_{1}$ and $\boldsymbol{\beta}_{2j}=b_{2}=2$ $(j=6,...,10)$ ).

3.2. Evaluation criteria

We considered several evaluation criteria to assess the performance of the methods in selecting the prognostic and predictive biomarkers: the $\textrm{TPR}_{\textrm{prog}}$ as the true positive rate (i.e. rate of active biomarkers selected) and $\textrm{FPR}_{\textrm{prog}}$ the false positive rate (i.e. rate of inactive biomarkers selected) of the selection of prognostic biomarkers, and similarly for predictive biomarkers with $\textrm{TPR}_{\textrm{pred}}$ and $\textrm{FPR}_{\textrm{pred}}$ . We further note $\textrm{TPR}_{\textrm{all}}$ and $\textrm{FPR}_{\textrm{all}}$ the criterion of overall selection among all candidate biomarkers regardless their prognostic or predictive effect. The objective of the selection is to maximize the $\textrm{TPR}_{\textrm{all}}$ and minimize the $\textrm{FPR}_{\textrm{all}}$ . All metrics were calculated by averaging the results of 100 replications for each scenario.

3.3. Biomarker selection results

For the proposed method, different results are presented. $\textrm{PPLasso}_{\Sigma}$ (resp. PPLasso) corresponds to the results of the method by considering the true (resp. estimated) matrix $\boldsymbol{\Sigma}_{bm}$ . For estimating $\boldsymbol{\Sigma}_{bm}$ , we used the approach explained in Section 2.5. Two choices of parameters are also given: “optimal” and “min(bic)”. The former uses as a value for $\lambda$ the one that maximizes $(\textrm{TPR}_{\textrm{all}}-\textrm{FPR}_{\textrm{all}})$ and the latter uses the approach presented in Section 2.6. In order to compare the performance of our approach to the best performance that could be reached by Elastic Net, Lasso and Adaptive Lasso, we used for these methods the “optimal” parameters namely those maximizing $(\textrm{TPR}_{\textrm{all}}-\textrm{FPR}_{\textrm{all}})$ . All these three methods were implemented with the glmnet R package, the best parameter $\alpha$ involved in Elastic Net was chosen in the set $\{0.1,0.2,\dots,0.9\}$ . The choice of “min(bic)” is only applied to our method and corresponds to a choice of $\lambda$ that could be used in practical situations. For ease of presentation, the abbreviation EN (resp. AdLasso) refers to Elastic Net (resp. Adaptive Lasso) in the following.

Figure 4 shows the selection performance of PPLasso and other compared methods in the simulation scenario presented in Section 3.1. PPLasso achieved to select all prognostic biomarkers ( $\textrm{TPR}_{\textrm{prog}}$ almost 1) even for large $p$ , with limited false positive prognostic biomarkers selected. As compared to the optimal $\lambda$ maximizing $(\textrm{TPR}_{\textrm{all}}-\textrm{FPR}_{\textrm{all}})$ , the one selected with the BIC tends to select some false positives (average: 33 ( $\textrm{FPR}_{\textrm{prog}}=0.17$ ) for $p=200$ and 10 ( $\textrm{FPR}_{\textrm{prog}}=0.005$ ) for $p=2000$ ). The results obtained from the oracle and estimated $\boldsymbol{\Sigma}_{bm}$ are comparable. Selection performance of predictive biomarkers is slightly lowered as compared to prognostic biomarkers. Even if the false positive selection is quite similar between prognostic and predictive biomarkers, PPLasso missed some true predictive biomarkers when $\lambda$ is selected with the BIC criterion (average $\textrm{TPR}_{\textrm{pred}}=$ 0.98 and 0.80 for oracle and estimated $\boldsymbol{\Sigma}_{bm}$ , respectively, with $p=2000$ ). In this scenario where the IC is violated, PPLasso globally outperforms Lasso, Elastic Net and Adaptive Lasso. Although Elastic Net showed higher TPR than Lasso and Adaptive Lasso, they all failed in selecting all truly prognostic and predictive biomarkers, and the number of missed active biomarkers increased with the dimension $p$ . For example, for Elastic Net, $\textrm{TPR}_{\textrm{prog}}$ = 0.85 and 0.53, $\textrm{TPR}_{\textrm{pred}}$ = 0.81 and 0.61 for $p=200$ and $2000$ , respectively.

3.3.1. Impact of the correlation matrix $\boldsymbol{\Sigma}$

To evaluate the impact of the correlation matrix on the selection performance of the methods, additional scenarios are presented where the IC is satisfied:

(1)

Compound symmetry structure where all biomarkers are equally correlated with a correlation $\rho=0.5$ ;
(2)

Independent setting where $\boldsymbol{\Sigma}_{bm}$ is the identity matrix.

For the scenario with compound symmetry structure displayed in Figure 5, all the methods successfully identified the true prognostic biomarkers ( $\textrm{TPR}_{\textrm{prog}}$ close to 1 even for large $p$ ) with limited false positive selection. On the other hand, the compared methods (Lasso, ELastic Net, Adaptive Lasso) missed some predictive biomarkers especially when $p$ increases. On the contrary, PPLasso successfully identified almost all predictive biomarkers with the optimal choice of $\lambda$ . Moreover, even when $\lambda$ is selected by minimizing the BIC criterion (min(bic)), $\textrm{PPLasso}_{\textrm{est}}$ outperformed Lasso and Adaptive Lasso when $p>500$ with relatively stable $\textrm{TPR}_{\textrm{pred}}$ and $\textrm{FPR}_{\textrm{pred}}$ as $p$ increases.

For the independent setting, as displayed in Figure 6, prognostic biomarkers were globally well identified by all the compared methods with a slightly higher $\textrm{TPR}_{\textrm{prog}}$ for Lasso and ELastic Net as compared to PPLasso but also with a slightly higher $\textrm{FPR}_{\textrm{prog}}$ . With regards to predictive biomarkers, PPLasso using $\boldsymbol{\Sigma}_{bm}$ (oracle) performed also similarly to the Lasso, which is reasonable since no transformation has been used in PPLasso. On the other hand, even if PPLasso with $\lambda$ selected with “min(bic)” performed similarly with PPLasso with optimal $\lambda$ for relatively small $p$ , the selection performance is altered for large $p$ and even if the performance is higher than Lasso and Adaptive Lasso, it is smaller than the one of Elastic Net.

3.3.2. Impact of the effect size of active biomarkers

To evaluate the impact of the effect size on biomarker selection performance, the scenario presented in Section 3.1 was considered with different values of $b_{2}$ : 1.5, 2 and 2.5.

Since the effect size of prognostic biomarkers did not change, the comparison focused on predictive biomarkers. As expected, the reduction of the effect size makes the biomarker selection harder, especially for Lasso, Elastic Net and Adaptive Lasso where the predictive biomarker selection is limited when $b_{2}=1.5$ : for Lasso when $p=2000$ , $\textrm{TPR}_{\textrm{pred}}$ = 0.45 (resp. 0.22) for $b_{2}=2$ (resp. 1.5), see Figure 4 and Figure 9 of the supplementary material. The selection performance of PPLasso when $\lambda$ is selected with min(bic) is also reduced by decreasing $b_{2}$ , especially when $\boldsymbol{\Sigma}_{bm}$ is also estimated. Nevertheless, the selection performance of PPLasso remains better than for the other compared methods for which the performance displayed are associated to the optimal value of $\lambda$ . On the other hand, even with limited effect size, PPLasso with optimal $\lambda$ identified all predictive biomarkers with very limited false positive selection. When $b_{2}$ was increased to 2.5, the selection performance for all methods is improved and the results for PPLasso with estimated $\lambda$ was close to the ones with the optimal $\lambda$ as displayed in Figure 10 of the supplementary material. As compared with PPLasso, for which the selection performance remained stable as $p$ increased, Lasso, Elastic Net and Adaptive Lasso were more impacted by the value of $p$ since the true positive selection decreased as $p$ increased. As an example, for the Lasso, $\textrm{TPR}_{\textrm{pred}}$ =0.95 (resp. 0.65) for $p=200$ (resp. 2000).

3.3.3. Impact of the number of predictive biomarkers

The impact of the number of true predictive biomarkers was assessed by increasing the number of predictive biomarkers from 5 to 10 in the scenario presented in Section 3.1. When the number of predictive biomarkers increased, the impact on PPLasso is almost negligible, especially for prognostic biomarker identification. However, for the other methods, we can see from Figure 11 of the supplementary material that it became even harder to identify predictive biomarkers. $\textrm{TPR}_{\textrm{pred}}$ decreased compared to Figure 4, especially for large $p$ (e.g. $\textrm{TPR}_{\textrm{pred}}=$ 0.12, 0.18, and 0.02 for Lasso, Elastic Net and Adaptive Lasso respectively when $p=2000$ ).

3.3.4. Impact of the dimension of the dataset

In this section, we studied a different sample size: $n$ =50 with $n_{1}=n_{2}=25$ and a different number of biomarkers: $p$ =5000.

We can see from Figure 12 of the supplementary material that for $p=5000$ , the selection performance of PPLasso is not altered as compared with $p=2000$ while the compared methods have more difficulties to identify both prognostic and predictive biomarkers.

When the sample size is smaller ( $n$ =50), we can see from Figure 13 of the supplementary material that the ability to identify prognostic and predictive biomarkers decreased for all the methods. However, PPLasso still outperformed the others with higher $\textrm{TPR}_{\textrm{prog}}$ and $\textrm{TPR}_{\textrm{pred}}$ and lower $\textrm{FPR}_{\textrm{prog}}$ and $\textrm{FPR}_{\textrm{pred}}$ .

4. Application to transcriptomic data in RV144 clinical trial

We applied the previously described methods to publicly available transcriptomic data from the RV144 vaccine trial (Rerks-Ngarm et al. (2009)). This trial showed reduced risk of HIV-1 acquisition by 31.2% with vaccination with ALVAC and AIDSVAX as compared to placebo. Transcriptomic profiles of in vitro HIV-1 Env-stimulated peripheral blood mononuclear cells (PBMCs) obtained pre-immunization and 15 days after the immunization (D15) from both 40 vaccinees and 10 placebo recipients were generated to better understand underlying biological mechanisms (Fourati et al. (2019), Gene Expression Omnibus accession code: GSE103671).

For illustration purpose, the absolute change at D15 in gene mTOR was considered as the continuous endpoint (response). mTOR plays a key role in mTORC1 signaling pathway which has been shown to be associated with risk of HIV-1 acquisition (Fourati et al. (2019), Akbay et al. (2020)). The gene expression has been normalized as in the original publication of Fourati et al. (2019). After removing non-annotated genes (LOCxxxx and HS.xxxx), the top 2000 genes with the highest empirical variances were included as candidate biomarkers for prognostic and predictive identification from PPLasso and the compared methods. The penalty parameter $\lambda$ for the Lasso and Adaptive Lasso, the parameters $\lambda$ and $\alpha$ for Elastic Net were selected through the classical cross-validation approach. For PPLasso, $\lambda$ was selected based on the criterion described in Section 2.6.

The estimation of $\boldsymbol{\Sigma}$ was obtained by comparing several candidate estimators from the cvCovEst R package and by selecting the estimator having the smallest estimation error. In this application, the combination of the sample covariance matrix and a dense target matrix (denseLinearShrinkEst) derived by Ledoit and Wolf (2020) provides the smallest estimation error. Figure 7 displays the estimated $\boldsymbol{\Sigma}$ and highlights the strong correlation between the genes. Table 3 of the Supplementary material gives details on the compared estimators.

Prognostic and predictive genes selected by PPLasso, Lasso, Elastic Net and Adaptive Lasso are listed in Table 1. The number of genes selected are similar for all the compared methods, except for a slightly higher number of predictive genes selected by PPLasso. Lasso, Elastic Net and Adaptive Lasso selected very similar sets of prognostic and predictive genes. The intersection between PPLasso and others is moderate (2 prognostic genes (SLAMF7 and TNFRSF6B), 3 predictive genes (YTHDC1, MS4A7 and RPL21)). Interestingly, some genes selected by most methods such as SLAMF7, TNFRSF6B, TNFRSF18 or NUCKS1 have already been discussed in the HIV-1 field. Moreover, among the predictive genes selected by the PPLasso, some are linked to pathways that have been highlighted as possible target for HIV-1 such as BIRC3 and TLR8.

Table 1. Selected genes from PPLasso, Lasso, Elastic Net and Adaptive Lasso. Commonly selected genes are in bold.

	prognostic genes	predictive genes
PPLasso	HAPLN3, SLAMF7, GTF3C5, FAM46A, SH3PXD2B, TM4SF1, TNFRSF6B, TNFRSF18, TRPM2	TLR8, YTHDC1, NUCKS1, BIRC3, SLAMF7, NFATC2IP, BOK, MGRN1, KIAA0492, SLC25A36, HMGN2, P2RY5, RPL21, MS4A7, RPL12P6
Lasso	DKFZp434K191, NUCKS1, MAFF, SLAMF7, HIST2H2AC, HIST1H4C, IL8, TNFRSF6B, TNFRSF18, SCAND1	DKFZp434K191, YTHDC1, VMO1, BOLA2, HIST1H4C, RPL21, MS4A7
Elastic Net	DKFZp434K191, NUCKS1,SNURF, MAFF, SLAMF7, IL8, ZBP1, TNFRSF6B, ZAK, TNFRSF18, SCAND1, NME1-NME2, DNM1L, RNF146, NPEPL1	DKFZp434K191, YTHDC1, PMP22, VMO1, BOLA2, HIST1H4C, RPL21, MS4A7,RAB11FIP1
Adaptive Lasso	NUCKS1,SNURF, MAFF, SLAMF7, IL8, ZBP1, TNFRSF6B, NME1-NME2, DNM1L, RNF146	YTHDC1, PMP22, VMO1, BOLA2, HIST1H4C, MS4A7, RPL21

5. Conclusion

We propose a new method named PPLasso to simultaneously identify prognostic and predictive biomarkers. PPLasso is particularly interesting for dealing with high dimensional omics data when the biomarkers are highly correlated, which is a framework that has not been thoroughly investigated yet. From various numerical studies with or whithout strong correlation between biomarkers, we highlighted the strength of PPLasso in well identifying both prognostic and predictive biomarkers with limited false positive selection. The current method is only dedicated to the analysis of continuous responses through ANCOVA type models. However, it will be the subject of a future work to extend it to other challenging contexts, such as classification or survival analysis.

Funding

This work was supported by the Association Nationale de la Recherche et de la Technologie (ANRT).

References

Akbay et al. (2020) Akbay, B., Shmakova, A., Vassetzky, Y., and Dokudovskaya, S. (2020). Modulation of mtorc1 signaling pathway by hiv-1. Cells, 9, 1090.
Ballman (2015) Ballman, K. V. (2015). Biomarker: Predictive or prognostic? Journal of clinical oncology, 33(33), 3968–3971.
Boileau et al. (2021) Boileau, P., Hejazi, N. S., van der Laan, M. J., and Dudoit, S. (2021). Cross-validated loss-based covariance matrix estimator selection in high dimensions.
Boileau et al. (2021) Boileau, P., Hejazi, N. S., van der Laan, M. J., and Dudoit, S. (2021). cvCovEst: Cross-validated covariance matrix estimator selection and evaluation in R. Journal of Open Source Software, 6 (63), 3273.
Cai et al. (2010) Cai, T., Zhang, C.-H., and Zhou, H. (2010). Optimal rates of convergence for covariance matrix estimation. The Annals of Statistics, 38.
Clark (2008) Clark, G. (2008). Prognostic factors versus predictive factors: Examples from a clinical trial of erlotinib. Molecular oncology, 1, 406–12.
Fan and Li (2006) Fan, J. and Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. Proc. Madrid Int. Congress of Mathematicians, 3.
Fan and Lv (2009) Fan, J. and Lv, J. (2009). A selective overview of variable selection in high dimensional feature space. Stat. Sinica, 20 1, 101–148.
Fan et al. (2013) Fan, J., Liao, Y., and Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society. Series B, Statistical methodology, 75.
Faraway (2002) Faraway, J. J. (2002). Practical regression and ANOVA using R. University of Bath.
Foster et al. (2011) Foster, J., Taylor, J., and Ruberg, S. (2011). Subgroup identification from randomized clinical trial data. Statistics in medicine, 30, 2867–80.
Fourati et al. (2019) Fourati, S., Ribeiro, S., Blasco Lopes, F., Talla, A., Lefebvre, F., Cameron, M., Kaewkungwal, J., Pitisuttithum, P., Nitayaphan, S., Rerks-Ngarm, S., Kim, J., Thomas, R., Gilbert, P., Tomaras, G., Koup, R., Michael, N., McElrath, M., Gottardo, R., and Sékaly, R. (2019). Integrated systems approach defines the antiviral pathways conferring protection by the rv144 hiv vaccine. Nature Communications, 10.
Ledoit and Wolf (2020) Ledoit, O. and Wolf, M. (2020). The Power of (Non-)Linear Shrinking: A Review and Guide to Covariance Matrix Estimation. Journal of Financial Econometrics.
Lipkovich and Dmitrienko (2014) Lipkovich, I. and Dmitrienko, A. (2014). Strategies for identifying predictive biomarkers and subgroups with enhanced treatment effect in clinical trials using sides. Journal of biopharmaceutical statistics, 24, 130–53.
Lipkovich et al. (2011) Lipkovich, I., Dmitrienko, A., Denne, J., and Enas, G. (2011). Subgroup identification based on differential effect search (sides) – a recursive partitioning method for establishing response to treatment in patient subpopulations. Statistics in medicine, 30, 2601–21.
Lipkovich et al. (2017) Lipkovich, I., Dmitrienko, A., and B. D’Agostino Sr., R. (2017). Tutorial in biostatistics: data-driven subgroup identification and analysis in clinical trials. Statistics in Medicine, 36(1), 136–196.
McDonald (2009) McDonald, J. (2009). Handbook of Biological Statistics 2nd Edition. Sparky House Publishing Baltimore.
Rerks-Ngarm et al. (2009) Rerks-Ngarm, S., Pitisuttithum, P., Nitayaphan, S., Kaewkungwal, J., Chiu, J., Paris, R., Premsri, N., Namwat, C., De Souza, M., Benenson, M., Gurunathan, S., Tartaglia, J., McNeil, J., Francis, D., Stablein, D., Birx, D., Chunsuttiwat, S., Khamboonruang, C., and Kim, J. (2009). Vaccination with alvac and aidsvax to prevent hiv-1 infection in thailand. The New England journal of medicine, 361, 2209–20.
Saeys et al. (2007) Saeys, Y., Inza, I., and Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19), 2507–2517.
Sechidis et al. (2018) Sechidis, K., Papangelou, K., Metcalfe, P. D., Svensson, D., Weatherall, J., and Brown, G. (2018). Distinguishing prognostic and predictive biomarkers: an information theoretic approach. Bioinformatics, 34(19), 3365–3376.
Smith (2018) Smith, G. (2018). Step away from stepwise. J. Big Data, 5(32), 1–12.
Tian et al. (2012) Tian, L., Alizadeh, A., Gentles, A., and Tibshirani, R. (2012). A simple method for estimating interactions between a treatment and a large number of covariates. Journal of the American Statistical Association, 109.
Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Stat. Methodol.), 58(1), 267–288.
Tibshirani and Taylor (2011) Tibshirani, R. J. and Taylor, J. (2011). The solution path of the generalized lasso. Ann. Stat., 39(3), 1335–1371.
Wang et al. (2019) Wang, H., Lengerich, B., Aragam, B., and Xing, E. (2019). Precision lasso: Accounting for correlations and linear dependencies in high-dimensional genomic data. Bioinformatics, 35(7), 1181–1187.
Wang and Leng (2016) Wang, X. and Leng, C. (2016). High dimensional ordinary least squares projection for screening variables. J. R. Stat. Soc. Ser. B (Stat. Methodol.), 78(3), 589–611.
Windeler (2000) Windeler, J. (2000). Prognosis - what does the clinician associate with this notion? Statistics in medicine, 19, 425–30.
Xue and Qu (2017) Xue, F. and Qu, A. (2017). Variable selection for highly correlated predictors.
Zhao and Yu (2006) Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. J. Machine Learn. Res., 7, 2541–2563.
Zhu et al. (2021) Zhu, W., Lévy-Leduc, C., and Ternès, N. (2021). A variable selection approach for highly correlated predictors in high-dimensional genomic data. Bioinformatics, 37(16), 2238–2244.
Zou and Hastie (2005) Zou, H. and Hastie, T. (2005) Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2), 301-320.
Zou (2006) Zou, H. (2006). The adaptive lasso and its oracle properties.. Journal of the American statistical association 101(476), 1418-1429.

Supplementary material

This supplementary material provides additional numerical experiments, figures and tables for the paper: “Identification of prognostic and predictive biomarkers in high-dimensional data with PPLasso”.

	MSE	BIC
TPR(prognostic)	1.000	1.000
FPR(prognostic)	0.038	0.024
TPR(predictive)	1.000	1.000
FPR(predctive)	0.008	0.006

Table 2. TPR and FPR associated to prognostic and predictive biomarker identification with the

\lambda

chosen in Figure 3.

Estimator	Hyperparameters	Empirical risk
denseLinearShrinkEst	-	102546
sampleCovEst	-	102547
linearShrinkLWEst	-	103496
poetEst	lambda=0.1, k=2	104522
poetEst	lambda=0.2, k=2	105358
poetEst	lambda=0.1, k=1	105972
poetEst	lambda=0.2, k=1	108222
thresholdingEst	gamma=0.2	137798
thresholdingEst	gamma=0.4	186844

Table 3. Empirical risk of tested methods with different hyperparameters.