This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Identification of prognostic and predictive biomarkers in high-dimensional data with PPLasso

Wencan Zhu UMR MIA-Paris, AgroParisTech, INRAE, Université Paris-Saclay, 75005, Paris, France wencan.zhu@agroparistech.fr Céline Lévy-Leduc UMR MIA-Paris, AgroParisTech, INRAE, Université Paris-Saclay, 75005, Paris, France celine.levy-leduc@agroparistech.fr  and  Nils Ternès Biostatistics and Programming department, Sanofi R&D, 91380 Chilly Mazarin, France nils.ternes@sanofi.com
Abstract.

In clinical development, identification of prognostic and predictive biomarkers is essential to precision medicine. Prognostic biomarkers can be useful for anticipating the prognosis of individual patients, and predictive biomarkers can be used to identify patients more likely to benefit from a given treatment. Previous researches were mainly focused on clinical characteristics, and the use of genomic data in such an area is hardly studied. A new method is required to simultaneously select prognostic and predictive biomarkers in high dimensional genomic data where biomarkers are highly correlated. We propose a novel approach called PPLasso (Prognostic Predictive Lasso) integrating prognostic and predictive effects into one statistical model. PPLasso also takes into account the correlations between biomarkers that can alter the biomarker selection accuracy. Our method consists in transforming the design matrix to remove the correlations between the biomarkers before applying the generalized Lasso. In a comprehensive numerical evaluation, we show that PPLasso outperforms the Lasso type approaches on both prognostic and predictive biomarker identification in various scenarios. Finally, our method is applied to publicly available transcriptomic data from clinical trial RV144. Our method is implemented in the PPLasso R package which is available from the Comprehensive R Archive Network (CRAN).

Key words and phrases:
variable selection; highly correlated predictors; genomic data

1. Introduction

With the advancement of precision medicine, there has been an increasing interest in identifying prognostic or predictive biomarkers in clinical development. A prognostic biomarker informs about a likely clinical outcome (e.g., disease recurrence, disease progression, death) in the absence of therapy or with a standard therapy that patients are likely to receive, while a predictive biomarker is associated with a response or a lack of response to a specific therapy. Ballman (2015) and Clark (2008) provided a comprehensive explanation and concrete examples to distinguish prognostic from predictive biomarkers, respectively.

Concerning the biomarker selection, the high dimensionality of genomic data is one of the main challenges as explained in Fan and Li (2006). To identify effective biomarkers in high-dimensional settings, several approaches can be considered including hypothesis-based tests described in McDonald (2009), wrapper approaches proposed in Saeys et al. (2007), and penalized approaches such as Lasso designed by Tibshirani (1996) among others. Hypothesis-based tests consider each biomarker independently and thus ignore potential correlations between them. Wrapper approaches often show high risk of overfitting and are computationally expensive for high-dimensional data as explained in Smith (2018). More efforts have been devoted to penalized methods given their ability to automatically perform variable selection and coefficient estimation simultaneously as highlighted in Fan and Lv (2009). However, Lasso showed some potential drawbacks when biomarkers are highly correlated. Particularly, when the Irrepresentable Condition (IC) proposed by Zhao and Yu (2006) is violated, Lasso can not guarantee to correctly identify true effective biomarkers. In genomic data, biomarkers are usually highly correlated such that this condition can hardly be satisfied, see Wang et al. (2019). Several methods have been proposed to adress this issue. Elastic Net (Zou and Hastie, 2005) combines the 1\ell_{1} and 2\ell_{2} penalties and is particularly effective in tackling correlation issues and can generally outperform Lasso. Adaptive Lasso (Zou, 2006) proposes to assign adaptive weights for penalizing different coefficients in the 1\ell_{1} penalty, and its oracle property was demonstrated. Wang and Leng (2016) proposed the HOLP approach which consists in removing the correlation between the columns of the design matrix; Wang et al. (2019) proposed to handle the correlation by assigning similar weights to correlated variables in their approach called Precision Lasso; Zhu et al. (2021) proposed to remove the correlations by applying a whitening transformation to the data before using the generalized Lasso criterion designed by Tibshirani and Taylor (2011).

The challenge of finding prognostic biomarkers has been extensively explored with previously introduced methods, however, the discovery of predictive biomarkers has seen much less attention. Limited to binary endpoint, Foster et al. (2011) proposed to first predict response probabilities for treatment and use this probability as the response in a classification problem to find effective biomarkers. Tian et al. (2012) proposed a new method to detect interaction between the treatment and the biomarkers by modifying the covariates. This method can be implemented on continuous/binary/time-to-event endpoint. Lipkovich et al. (2011) proposed a method called SIDES, which adopts a recursive partitioning algorithm for screening treatment-by-biomarker interactions. This method was further improved in Lipkovich and Dmitrienko (2014) by adding another step of preselection on predictive biomarkers based on variable importance. The method was demonstrated with continuous endpoint. More recently, Sechidis et al. (2018) applied approaches coming from information theory for ranking biomarkers on their prognostic/predictive strength. Their method is applicable only for binary or time-to-event endpoint. Moreover, all of these methods were assessed under the situation where the sample size is relatively large and the number of biomarkers is limited, which is hardly the case for genomic data.

In the literature mentioned above, the authors focused on one of the problematic of identifying prognostic or predictive biomarkers, but rarely on both. Even if predictive biomarkers is of major importance for identifying patients more likely to benefit from a treatment, the identification of prognostic biomarkers is also key in this context. Indeed, the clinical impact of a treatment can be judged only with the knowledge of the prognosis of a patient. It is thus of importance to reliably predict the prognosis of patients to assist treatment counseling (Windeler, 2000). In this paper, we propose a novel approach called PPLasso (Prognostic Predictive Lasso) to simultaneously identify prognostic and predictive biomarkers in a high dimensional setting with continuous endpoints, as presented in Section 2. Extensive numerical experiments are given in Section 3 to assess the performance of our approach and to compare it to other methods. PPLasso is also applied to the clinical trial RV144 in Section 4. Finally, we give concluding remarks in Section 5.

2. Methods

In this section, we propose a novel approach called PPLasso (Predictive Prognostic Lasso) which consists in writing the identification of predictive and prognostic biomarkers as a variable selection problem in an ANCOVA (Analysis of Covariance) type model mentioned for instance in Faraway (2002).

2.1. Statistical modeling

Let 𝐲\mathbf{y} be a continuous response or endpoint and t1t_{1}, t2t_{2} two treatments. Let also 𝐗1\mathbf{X}_{1} (resp. 𝐗2\mathbf{X}_{2}) denote the design matrix for the n1n_{1} (resp. n2n_{2}) patients with treatment t1t_{1} (resp. t2t_{2}), each containing measurements on pp candidate biomarkers:

(1) 𝐗1=[X111X112X11pX121X122X12pX1n11X1n12X1n1p],𝐗2=[X211X212X21pX221X222X22pX2n21X2n22X2n2p].\mathbf{X}_{1}=\begin{bmatrix}X_{11}^{1}&X_{11}^{2}&\ldots&X_{11}^{p}\\ X_{12}^{1}&X_{12}^{2}&\ldots&X_{12}^{p}\\ ...\\ X_{1n_{1}}^{1}&X_{1n_{1}}^{2}&\ldots&X_{1n_{1}}^{p}\end{bmatrix},\mathbf{X}_{2}=\begin{bmatrix}X_{21}^{1}&X_{21}^{2}&\ldots&X_{21}^{p}\\ X_{22}^{1}&X_{22}^{2}&\ldots&X_{22}^{p}\\ ...\\ X_{2n_{2}}^{1}&X_{2n_{2}}^{2}&\ldots&X_{2n_{2}}^{p}\end{bmatrix}.

To take into account the potential correlation that may exist between the biomarkers in the different treatments, we shall assume that the rows of 𝐗1\mathbf{X}_{1} (resp. 𝐗2\mathbf{X}_{2}) are independent centered Gaussian random vectors with a covariance matrice equal to 𝚺𝟏\boldsymbol{\Sigma_{1}} (resp. 𝚺𝟐\boldsymbol{\Sigma_{2}}).

To model the link that exists between 𝐲\mathbf{y} and the different types of biomarkers we propose using the following model:

(2) 𝐲=(y11y12y1n1y21y22y2n2)=𝐗(α1α2β11β12β1pβ21β22β2p)+(ϵ11ϵ12ϵ1n1ϵ21ϵ22ϵ2n2),\mathbf{y}=\begin{pmatrix}y_{11}\\ y_{12}\\ \vdots\\ y_{1n_{1}}\\ y_{21}\\ y_{22}\\ \vdots\\ y_{2n_{2}}\end{pmatrix}=\mathbf{X}\begin{pmatrix}\alpha_{1}\\ \alpha_{2}\\ \beta_{11}\\ \beta_{12}\\ \vdots\\ \beta_{1p}\\ \beta_{21}\\ \beta_{22}\\ \vdots\\ \beta_{2p}\end{pmatrix}+\begin{pmatrix}\epsilon_{11}\\ \epsilon_{12}\\ \vdots\\ \epsilon_{1n_{1}}\\ \epsilon_{21}\\ \epsilon_{22}\\ \vdots\\ \epsilon_{2n_{2}}\end{pmatrix},

where (yi1,,yini)(y_{i1},\dots,y_{in_{i}}) corresponds to the response of patients with treatment tit_{i}, ii being equal to 1 or 2,

𝐗=[10X111X112X11p00010X121X122X12p00010X1n11X1n12X1n1p00001000X211X212X21p01000X221X222X22p01000X2n21X2n22X2n2p],\mathbf{X}=\begin{bmatrix}1&0&X_{11}^{1}&X_{11}^{2}&\ldots&X_{11}^{p}&0&0&\ldots&0\\ 1&0&X_{12}^{1}&X_{12}^{2}&\ldots&X_{12}^{p}&0&0&\ldots&0\\ \vdots&\vdots&\vdots&\vdots&&\vdots&&&\\ 1&0&X_{1n_{1}}^{1}&X_{1n_{1}}^{2}&\ldots&X_{1n_{1}}^{p}&0&0&\ldots&0\\ 0&1&0&0&\ldots&0&X_{21}^{1}&X_{21}^{2}&\ldots&X_{21}^{p}\\ 0&1&0&0&\ldots&0&X_{22}^{1}&X_{22}^{2}&\ldots&X_{22}^{p}\\ \vdots&\vdots&\vdots&\vdots&&\vdots&\vdots&\vdots&&\vdots\\ 0&1&0&0&\ldots&0&X_{2n_{2}}^{1}&X_{2n_{2}}^{2}&\ldots&X_{2n_{2}}^{p}\end{bmatrix},

with α1\alpha_{1} (resp. α2\alpha_{2}) corresponding to the effects of treatment t1t_{1} (resp. t2t_{2}). Moreover, 𝜷1=(β11,β12,,β1p)\boldsymbol{\beta}_{1}=(\beta_{11},\beta_{12},\ldots,\beta_{1p})^{\prime} (resp. 𝜷2=(β21,β22,,β2p)\boldsymbol{\beta}_{2}=(\beta_{21},\beta_{22},\ldots,\beta_{2p})^{\prime}) are the coefficients associated to each of the pp biomarkers in treatment t1t_{1} (resp. t2t_{2}) group, denoting the matrix transposition and ϵ11,,ϵ2n2\epsilon_{11},\dots,\epsilon_{2n_{2}} are standard independent Gaussian random variables independent of 𝐗1\mathbf{X}_{1} and 𝐗2\mathbf{X}_{2}. When t1t_{1} stands for the standard treatment or placebo, prognostic biomarkers are defined as those having non-zero coefficients in 𝜷1\boldsymbol{\beta}_{1}. According to the definition of prognostic biomarkers, their effect should indeed be demonstrated in the absence of therapy or with a standard therapy that patients are likely to receive. On the other hand, predictive biomarkers are defined as those having non-zero coefficients in 𝜷2𝜷1\boldsymbol{\beta}_{2}-\boldsymbol{\beta}_{1} because they aim to highlight different effects between two different treatments.

Model (2) can be written as:

(3) 𝐲=𝐗𝜸+ϵ,\mathbf{y}=\mathbf{X}\boldsymbol{\gamma}+\boldsymbol{\epsilon},

with 𝜸=(α1,α2,𝜷1,𝜷2)\boldsymbol{\gamma}=(\alpha_{1},\alpha_{2},\boldsymbol{\beta}_{1}^{\prime},\boldsymbol{\beta}_{2}^{\prime})^{\prime}. The Lasso penalty is a well-known approach to estimate coefficients with a sparsity enforcing constraint allowing variable selection by estimating some coefficients by zero. It consists in minimizing the following penalized least-squares criterion (Tibshirani (1996)):

(4) 12𝐲𝐗𝜸22+λ𝜸1,\frac{1}{2}\left\lVert\mathbf{y}-\mathbf{X}\boldsymbol{\gamma}\right\rVert_{2}^{2}+\lambda\left\lVert\boldsymbol{\gamma}\right\rVert_{1},

where u22=i=1nui2\left\lVert\textbf{u}\right\rVert_{2}^{2}=\sum_{i=1}^{n}u_{i}^{2} and u1=i=1n|ui|\left\lVert\textbf{u}\right\rVert_{1}=\sum_{i=1}^{n}|u_{i}| for u=(u1,,un)\textbf{u}=(u_{1},\dots,u_{n}). A different sparsity constraint was applied to 𝜷1\boldsymbol{\beta}_{1} and 𝜷2𝜷1\boldsymbol{\beta}_{2}-\boldsymbol{\beta}_{1} to allow different sparsity levels. Hence we propose to replace the penalty λ𝜸1\lambda\left\lVert\boldsymbol{\gamma}\right\rVert_{1} in (4) by

(5) λ1𝜷11+λ2𝜷2𝜷11.\lambda_{1}\left\lVert\boldsymbol{\beta}_{1}\right\rVert_{1}+\lambda_{2}\left\lVert\boldsymbol{\beta}_{2}-\boldsymbol{\beta}_{1}\right\rVert_{1}.

Thus, a first estimator of 𝜸\boldsymbol{\gamma} could be found by minimizing the following criterion with respect to 𝜸\boldsymbol{\gamma}:

(6) 12𝐲𝐗𝜸22+λ1[𝟎p,1𝟎p,1D1𝟎p,1𝟎p,1λ2λ1D2]𝜸1,\frac{1}{2}\left\lVert\mathbf{y}-\mathbf{X}\boldsymbol{\gamma}\right\rVert_{2}^{2}+\lambda_{1}\left\lVert\begin{bmatrix}\mathbf{0}_{p,1}&\mathbf{0}_{p,1}&D_{1}\\ \mathbf{0}_{p,1}&\mathbf{0}_{p,1}&\frac{\lambda_{2}}{\lambda_{1}}D_{2}\end{bmatrix}\boldsymbol{\gamma}\right\rVert_{1},

where D1=[Idp,𝟎p,p]D_{1}=[\mathbf{\textrm{Id}}_{p},\mathbf{0}_{p,p}] and D2=[Idp,Idp]D_{2}=[-\mathbf{\textrm{Id}}_{p},\mathbf{\textrm{Id}}_{p}], with Idp\mathbf{\textrm{Id}}_{p} denoting the identity matrix of size pp and 𝟎i,j\mathbf{0}_{i,j} denoting a matrix having ii rows and jj columns and containing only zeros. However, since the inconsistency of Lasso biomarker selection is originated from the correlations between the biomarkers, we propose to remove the correlation by “whitening” the matrix 𝐗\mathbf{X}. More precisely, we consider 𝐗~=𝐗𝚺1/2\widetilde{\mathbf{X}}=\mathbf{X}\boldsymbol{\Sigma}^{-1/2}, where

(7) 𝚺=[1000010000𝚺10000𝚺2]\boldsymbol{\Sigma}=\begin{bmatrix}1&0&0&0\\ 0&1&0&0\\ 0&0&\boldsymbol{\Sigma}_{1}&0\\ 0&0&0&\boldsymbol{\Sigma}_{2}\end{bmatrix}

and define 𝚺1/2\boldsymbol{\Sigma}^{-1/2} by replacing in (7) 𝚺i\boldsymbol{\Sigma}_{i} by 𝚺i1/2\boldsymbol{\Sigma}_{i}^{-1/2}, where 𝚺i1/2=𝐔i𝐃i1/2𝐔iT\boldsymbol{\Sigma}_{i}^{-1/2}=\mathbf{U}_{i}\mathbf{D}_{i}^{-1/2}\mathbf{U}_{i}^{T}, 𝐔i\mathbf{U}_{i} and 𝐃i\mathbf{D}_{i} being the matrices involved in the spectral decomposition of 𝚺i\boldsymbol{\Sigma}_{i} for i=1i=1 or 2. With such a transformation the columns of 𝐗~\widetilde{\mathbf{X}} are decorrelated and Model (3) can be rewritten as follows:

(8) 𝐲=𝐗~𝜸~+ϵ\mathbf{y}=\widetilde{\mathbf{X}}\widetilde{\boldsymbol{\gamma}}+\boldsymbol{\epsilon}

where 𝜸~=𝚺1/2𝜸\widetilde{\boldsymbol{\gamma}}=\boldsymbol{\Sigma}^{1/2}\boldsymbol{\gamma}. The objective function (6) thus becomes:

(9) Lλ1,λ2PPLasso(𝜸~)=12𝐲𝐗~𝜸~22+λ1[𝟎p,1𝟎p,1D1𝟎p,1𝟎p,1λ2λ1D2]𝚺1/2𝜸~1.L_{\lambda_{1},\lambda_{2}}^{\textrm{PPLasso}}(\widetilde{\boldsymbol{\gamma}})=\frac{1}{2}\left\lVert\mathbf{y}-\widetilde{\mathbf{X}}\widetilde{\boldsymbol{\gamma}}\right\rVert_{2}^{2}+\lambda_{1}\left\lVert\begin{bmatrix}\mathbf{0}_{p,1}&\mathbf{0}_{p,1}&D_{1}\\ \mathbf{0}_{p,1}&\mathbf{0}_{p,1}&\frac{\lambda_{2}}{\lambda_{1}}D_{2}\end{bmatrix}\boldsymbol{\Sigma}^{-1/2}\widetilde{\boldsymbol{\gamma}}\right\rVert_{1}.

2.2. Estimation of 𝜸~\widetilde{\boldsymbol{\gamma}}

Let us define a first estimator of 𝜸~=(α~1,α~2,𝜷~1,𝜷~2)\widetilde{\boldsymbol{\gamma}}=(\widetilde{\alpha}_{1},\widetilde{\alpha}_{2},\widetilde{\boldsymbol{\beta}}_{1}^{\prime},\widetilde{\boldsymbol{\beta}}_{2}^{\prime}) as follows:

(10) 𝜸~^0(λ1,λ2)=(α~^1,α~^2,𝜷~^10,𝜷~^20)=Argmin𝜸~Lλ1,λ2PPLasso(𝜸~),\widehat{\widetilde{\boldsymbol{\gamma}}}_{0}(\lambda_{1},\lambda_{2})=(\widehat{\widetilde{\alpha}}_{1},\widehat{\widetilde{\alpha}}_{2},\widehat{\widetilde{\boldsymbol{\beta}}}_{10}^{\prime},\widehat{\widetilde{\boldsymbol{\beta}}}_{20}^{\prime})=\mathop{\rm Argmin}\nolimits_{\widetilde{\boldsymbol{\gamma}}}L_{\lambda_{1},\lambda_{2}}^{\textrm{PPLasso}}(\widetilde{\boldsymbol{\gamma}}),

for each fixed λ1\lambda_{1} and λ2\lambda_{2}. To better estimate 𝜷~1\widetilde{\boldsymbol{\beta}}_{1} and 𝜷~2\widetilde{\boldsymbol{\beta}}_{2}, a thresholding was applied to 𝜷~^0(λ1,λ2)=(𝜷~^10(λ1,λ2),𝜷~^20(λ1,λ2))\widehat{\widetilde{\boldsymbol{\beta}}}_{0}(\lambda_{1},\lambda_{2})=(\widehat{\widetilde{\boldsymbol{\beta}}}_{10}(\lambda_{1},\lambda_{2})^{\prime},\widehat{\widetilde{\boldsymbol{\beta}}}_{20}(\lambda_{1},\lambda_{2})^{\prime})^{\prime}. For K1K_{1} (resp. K2K_{2}) in {1,,p}\{1,\ldots,p\}, let TopK1\textrm{Top}_{K_{1}} (resp. TopK2\textrm{Top}_{K_{2}}) be the set of indices corresponding to the K1K_{1} (resp. K2K_{2}) largest values of the components of |𝜷~^10(λ1,λ2)||\widehat{\widetilde{\boldsymbol{\beta}}}_{10}(\lambda_{1},\lambda_{2})| (resp. |𝜷~^20(λ1,λ2)||\widehat{\widetilde{\boldsymbol{\beta}}}_{20}(\lambda_{1},\lambda_{2})|), then the estimator of 𝜷~=(𝜷~1,𝜷~2)\widetilde{\boldsymbol{\beta}}=(\widetilde{\boldsymbol{\beta}}_{1}^{\prime},\widetilde{\boldsymbol{\beta}}_{2}^{\prime}) after the correction is denoted by 𝜷~^(λ1,λ2)=(𝜷~^1(K^1)(λ1,λ2),𝜷~^2(K^2)(λ1,λ2))\widehat{\widetilde{\boldsymbol{\beta}}}(\lambda_{1},\lambda_{2})=(\widehat{\widetilde{\boldsymbol{\beta}}}_{1}^{(\widehat{K}_{1})}(\lambda_{1},\lambda_{2}),\widehat{\widetilde{\boldsymbol{\beta}}}_{2}^{(\widehat{K}_{2})}(\lambda_{1},\lambda_{2})) where the jjth component of 𝜷~^i(Ki)(λ1,λ2)\widehat{\widetilde{\boldsymbol{\beta}}}_{i}^{(K_{i})}(\lambda_{1},\lambda_{2}), for i=1i=1 or 2, is defined by:

(11) 𝜷~^ij(Ki)(λ1,λ2)={𝜷~^i0j(λ1,λ2),jTopKiK1th largest value of |𝜷~^i0j(λ1,λ2)|,jTopKi.\widehat{\widetilde{\boldsymbol{\beta}}}_{ij}^{(K_{i})}(\lambda_{1},\lambda_{2})=\begin{cases}\widehat{\widetilde{\boldsymbol{\beta}}}_{i0j}(\lambda_{1},\lambda_{2}),&j\in\textrm{Top}_{K_{i}}\\ \textrm{$K_{1}$th largest value of }|\widehat{\widetilde{\boldsymbol{\beta}}}_{i0j}(\lambda_{1},\lambda_{2})|,&j\not\in\textrm{Top}_{K_{i}}.\end{cases}

Note that the corrections are only performed on 𝜷~^0\widehat{\widetilde{\boldsymbol{\beta}}}_{0}, the estimators α~^1\widehat{\widetilde{\alpha}}_{1} and α~^2\widehat{\widetilde{\alpha}}_{2} were not modified. The choice of K1K_{1} and K2K_{2} will be explained in Section 2.4.

To illustrate the interest of using a thresholding step, we generated a dataset based on Model 3 with parameters described in Section 3.1 and p=500p=500. Moreover, to simplify the graphical illustrations, we focus on the case where λ1=λ2=λ\lambda_{1}=\lambda_{2}=\lambda. Figure 1 displays the estimation error associated to the estimators of 𝜷~(λ)\widetilde{\boldsymbol{\beta}}(\lambda) before and after the thresholding. We can see from this figure that the estimation of 𝜷~(λ)\widetilde{\boldsymbol{\beta}}(\lambda) is less biased after the correction. Moreover, we observed that this thresholding strongly improves the final estimation of 𝜸\boldsymbol{\gamma} and the variable selection performance of our method.

Refer to caption
Figure 1. Estimation error 𝜷~^0(λ)𝜷~2\left\lVert\widehat{\widetilde{\boldsymbol{\beta}}}_{0}(\lambda)-\widetilde{\boldsymbol{\beta}}\right\rVert_{2} (gray) and 𝜷~^(λ)𝜷~2\left\lVert\widehat{\widetilde{\boldsymbol{\beta}}}(\lambda)-\widetilde{\boldsymbol{\beta}}\right\rVert_{2} (red) for all λ\lambda.

2.3. Estimation of 𝜸\boldsymbol{\gamma}

With 𝜷~^=(𝜷~^1,𝜷~^2)\widehat{\widetilde{\boldsymbol{\beta}}}=(\widehat{\widetilde{\boldsymbol{\beta}}}_{1}^{\prime},\widehat{\widetilde{\boldsymbol{\beta}}}_{2}^{\prime}), the estimators of 𝜷1\boldsymbol{\beta}_{1} and 𝜷2𝜷1\boldsymbol{\beta}_{2}-\boldsymbol{\beta}_{1} can be obtained by 𝜷^10=𝚺11/2𝜷~^1\widehat{\boldsymbol{\beta}}_{10}=\boldsymbol{\Sigma}_{1}^{-1/2}\widehat{\widetilde{\boldsymbol{\beta}}}_{1} and (𝜷^20𝜷^10)=𝚺21/2𝜷~^2𝚺11/2𝜷~^1(\widehat{\boldsymbol{\beta}}_{20}-\widehat{\boldsymbol{\beta}}_{10})=\boldsymbol{\Sigma}_{2}^{-1/2}\widehat{\widetilde{\boldsymbol{\beta}}}_{2}-\boldsymbol{\Sigma}_{1}^{-1/2}\widehat{\widetilde{\boldsymbol{\beta}}}_{1}. As previously, another thresholding was applied to 𝜷^10\widehat{\boldsymbol{\beta}}_{10} and 𝜷^20\widehat{\boldsymbol{\beta}}_{20}: for i=1i=1 or 2,

(12) 𝜷^ij(Mi)(λ1,λ2)={𝜷^i0j(λ1,λ2),jTopMi0,jTopMi,\widehat{\boldsymbol{\beta}}_{ij}^{(M_{i})}(\lambda_{1},\lambda_{2})=\begin{cases}\widehat{\boldsymbol{\beta}}_{i0j}(\lambda_{1},\lambda_{2}),&j\in\textrm{Top}_{M_{i}}\\ 0,&j\not\in\textrm{Top}_{M_{i}},\end{cases}

for each fixed λ1\lambda_{1} and λ2\lambda_{2}. The biomarkers with non-zero coefficients in 𝜷^1=𝜷^1(M1)\widehat{\boldsymbol{\beta}}_{1}=\widehat{\boldsymbol{\beta}}_{1}^{(M_{1})} (resp. 𝜷^2(M2)𝜷^1(M1)\widehat{\boldsymbol{\beta}}_{2}^{(M_{2})}-\widehat{\boldsymbol{\beta}}_{1}^{(M_{1})}) are considered as prognostic (resp. predictive) biomarkers, where the choice of M1M_{1} and M2M_{2} is explained in Section 2.4.

To illustrate the benefits of using an additional thresholding step, we used the dataset described in Section 2.2. Moreover, to simplify the graphical illustrations, we also focus on the case where λ1=λ2=λ\lambda_{1}=\lambda_{2}=\lambda. Figure 8 in the Supplementary material displays the number of True Positive (TP) and False Positive (FP) in prognostic and predictive biomarker identification with and without the second thresholding. We can see from this figure that the thresholding stage limits the number of false positives. Note that α1\alpha_{1} and α2\alpha_{2} are estimated by α~^1\widehat{\widetilde{\alpha}}_{1} and α~^2\widehat{\widetilde{\alpha}}_{2} defined in (10).

2.4. Choice of the parameters K1,K2K_{1},K_{2}, M1M_{1} and M2M_{2}

For each (λ1,λ2)(\lambda_{1},\lambda_{2}) and each K1K_{1}, we computed:

(13) MSE~K1,K2(λ1,λ2)=𝐲𝐗~𝜸~^(K1,K2)(λ1,λ2)22,\widetilde{\textrm{MSE}}_{K_{1},K_{2}}(\lambda_{1},\lambda_{2})=\|\mathbf{y}-\widetilde{\mathbf{X}}\widehat{\widetilde{\boldsymbol{\gamma}}}^{(K1,K2)}(\lambda_{1},\lambda_{2})\|_{2}^{2},

where 𝜸~^(K1,K2)(λ1,λ2)=(α~^1,α~^2,𝜷~^1(K1),𝜷~^2(K2))\widehat{\widetilde{\boldsymbol{\gamma}}}^{(K1,K2)}(\lambda_{1},\lambda_{2})=(\widehat{\widetilde{\alpha}}_{1},\widehat{\widetilde{\alpha}}_{2},\widehat{\widetilde{\boldsymbol{\beta}}}_{1}^{(K_{1})^{\prime}},\widehat{\widetilde{\boldsymbol{\beta}}}_{2}^{(K_{2})^{\prime}}) defined in (10) and in (11). It is displayed in the left part of Figure 2.

Refer to caption
Figure 2. Illustration of how to choose K1K_{1} and K2K_{2} (δ=0.95\delta=0.95), final choice is marked with ’*’.

For each λ1\lambda_{1}, λ2\lambda_{2} and a given δ(0,1)\delta\in(0,1), the parameter K2^\widehat{K_{2}} is then chosen as follows for each K1K_{1}:

K2^(λ1,λ2)=Argmin{K21 s.t. MSE~(K1,K2+1)(λ1,λ2)MSE~(K1,K2)(λ1,λ2)δ}.\widehat{K_{2}}(\lambda_{1},\lambda_{2})=\mathop{\rm Argmin}\nolimits\left\{K_{2}\geq 1\textrm{ s.t. }\frac{\widetilde{\textrm{MSE}}_{(K_{1},K_{2}+1})(\lambda_{1},\lambda_{2})}{\widetilde{\textrm{MSE}}_{(K_{1},K_{2})}(\lambda_{1},\lambda_{2})}\geq\delta\right\}.

The K^2\widehat{K}_{2} associated to each K1K_{1} are displayed with ’*’ in the left part of Figure 2. Then K1^\widehat{K_{1}} is chosen by using a similar criterion:

K1^(λ1,λ2)=Argmin{K11 s.t. MSE~(K1+1,K^2)(λ1,λ2)MSE~(K1,K^2)(λ1,λ2)δ}.\widehat{K_{1}}(\lambda_{1},\lambda_{2})=\mathop{\rm Argmin}\nolimits\left\{K_{1}\geq 1\textrm{ s.t. }\frac{\widetilde{\textrm{MSE}}_{(K_{1}+1,\widehat{K}_{2}})(\lambda_{1},\lambda_{2})}{\widetilde{\textrm{MSE}}_{(K_{1},\widehat{K}_{2})}(\lambda_{1},\lambda_{2})}\geq\delta\right\}.

The values of MSE~(K1,K^2)(λ1,λ2)\widetilde{\textrm{MSE}}_{(K_{1},\widehat{K}_{2})}(\lambda_{1},\lambda_{2}) are displayed in the right part of Figure 2 in the particular case where λ1=λ2=λ\lambda_{1}=\lambda_{2}=\lambda, δ=0.95\delta=0.95 and with the same dataset as the one used in Section 2.2. K^1\widehat{K}_{1} is displayed with a red star. This value of δ\delta will be used in the following sections. However, choosing δ\delta in the range (0.9,0.99) does not have a strong impact on the variable selection performance of our approach.

The parameters M^1\widehat{M}_{1} and M^2\widehat{M}_{2} are chosen in a similar way except that MSE~K1,K2(λ1,λ2)\widetilde{\textrm{MSE}}_{K_{1},K_{2}}(\lambda_{1},\lambda_{2}) is replaced by MSE^M1,M2(λ1,λ2)\widehat{\textrm{MSE}}_{M_{1},M_{2}}(\lambda_{1},\lambda_{2}) where:

MSE^M1,M2(λ1,λ2)=𝐲𝐗𝜸^(M1,M2)(λ1,λ2)22,\widehat{\textrm{MSE}}_{M_{1},M_{2}}(\lambda_{1},\lambda_{2})=\|\mathbf{y}-\mathbf{X}\widehat{\boldsymbol{\gamma}}^{(M_{1},M_{2})}(\lambda_{1},\lambda_{2})\|_{2}^{2},

with 𝜸^(M1,M2)(λ1,λ2)=(α~^1,α~^2,𝜷^1(M1),𝜷^2(M2))\widehat{\boldsymbol{\gamma}}^{(M_{1},M_{2})}(\lambda_{1},\lambda_{2})=(\widehat{\widetilde{\alpha}}_{1},\widehat{\widetilde{\alpha}}_{2},\widehat{\boldsymbol{\beta}}_{1}^{(M_{1})^{\prime}},\widehat{\boldsymbol{\beta}}_{2}^{(M_{2})^{\prime}}) defined in (10) and (12). In the following, 𝜸^(λ1,λ2)=𝜸^(M^1,M^2)(λ1,λ2)\widehat{\boldsymbol{\gamma}}(\lambda_{1},\lambda_{2})=\widehat{\boldsymbol{\gamma}}^{(\widehat{M}_{1},\widehat{M}_{2})}(\lambda_{1},\lambda_{2}).

2.5. Estimation of 𝚺1\boldsymbol{\Sigma}_{1} and 𝚺2\boldsymbol{\Sigma}_{2}

As the empirical correlation matrix is known to be a non accurate estimator of 𝚺\boldsymbol{\Sigma} when pp is larger than nn, a new estimator has to be used. Thus, for estimating 𝚺\boldsymbol{\Sigma} we adopted a cross-validation based method designed by Boileau et al. (2021) and implemented in the cvCovEst R package (Boileau et al., 2021). This method chooses the estimator having the smallest estimation error among several compared methods (sample correlation matrix, POET (Fan et al. (2013)) and Tapering (Cai et al. (2010)) as examples). Since the samples in treatments t1t_{1} and t2t_{2} are assumed to be collected from the same population, 𝚺1\boldsymbol{\Sigma}_{1} and 𝚺2\boldsymbol{\Sigma}_{2} are assumed to be equal.

2.6. Choice of the parameters λ1\lambda_{1} and λ2\lambda_{2}

For the sake of simplicity, we limit ourselves to the case where λ1=λ2=λ\lambda_{1}=\lambda_{2}=\lambda. For choosing λ\lambda we used BIC (Bayesian Information Criterion) which is widely used in the variable selection field and which consists in minimizing the following criterion with respect to λ\lambda:

BIC(λ)=nlog(MSE(λ)/n)+k(λ)log(n),\textrm{BIC}(\lambda)=n\log(\textrm{MSE}(\lambda)/n)+k(\lambda)\log(n),

where nn is the total number of samples, MSE(λ)=𝐲𝐗𝜸^(λ)22\textrm{MSE}(\lambda)=\|\mathbf{y}-\mathbf{X}\widehat{\boldsymbol{\gamma}}(\lambda)\|_{2}^{2} and k(λ)k(\lambda) is the number of non null coefficients in the OLS estimator 𝜸^\widehat{\boldsymbol{\gamma}} obtained by re-estimating only the non null components of 𝜷^1\widehat{\boldsymbol{\beta}}_{1} and 𝜷^2𝜷^1\widehat{\boldsymbol{\beta}}_{2}-\widehat{\boldsymbol{\beta}}_{1}. The values of the BIC criterion as well as those of the MSE obtained from the dataset described in Section 2.2 are displayed in Figure 3.

Refer to caption
Figure 3. MSE and BIC for all λ\lambda. The λ\lambda minimizing each criterion is displayed with a vertical line.

Table 2 in the supplementary material provides the True Positive Rate (TPR) and False Positive Rate (FPR) when λ\lambda is chosen either by minimizing the MSE or the BIC criterion for this dataset. We can see from this table that both of them have TPR=1 (all true positives are identified). However, the FPR based on the BIC criterion is smaller than the one obtained by using the MSE.

3. Numerical experiments

This section presents a comprehensive numerical study by comparing the performance of our method with other regularized approaches in terms of prognostic and predictive biomarker selection. Besides the Lasso, we also compared with Elastic Net and Adaptive Lasso since they also take into account the correlations. For Lasso, Elastic Net and Adaptive Lasso, in order to directly estimate prognostic and predictive effects, 𝐗\mathbf{X} and 𝜸\boldsymbol{\gamma} in Model (3) were replaced by

𝐗=[𝟏n1,1𝟎n1,1𝐗1𝟎n1,p𝟎n2,1𝟏n2,1𝐗2𝐗2],\mathbf{X}^{*}=\begin{bmatrix}\mathbf{1}_{n_{1},1}&\mathbf{0}_{n_{1},1}&\mathbf{X}_{1}&\mathbf{0}_{n_{1},p}\\ \mathbf{0}_{n_{2},1}&\mathbf{1}_{n_{2},1}&\mathbf{X}_{2}&\mathbf{X}_{2}\end{bmatrix},

and 𝜸=(α1,α2,𝜷1,𝜷2)\boldsymbol{\gamma}^{*}=(\alpha_{1},\alpha_{2},\boldsymbol{\beta}_{1}^{*},\boldsymbol{\beta}_{2}^{*}), respectively, where 𝐗1\mathbf{X}_{1} and 𝐗2\mathbf{X}_{2} are defined in (1), 𝟎i,j\mathbf{0}_{i,j} (resp. 𝟏i,j\mathbf{1}_{i,j}) denotes a matrix having ii rows and jj columns and containing only zeros (resp. ones). Note that this is the modeling proposed by Lipkovich et al. (2017). The sparsity enforcing constraint was put on the coefficients 𝜷1\boldsymbol{\beta}_{1}^{*} and 𝜷2\boldsymbol{\beta}_{2}^{*} which boils down to putting a sparsity enforcing constraint on 𝜷1\boldsymbol{\beta}_{1} and 𝜷2𝜷1\boldsymbol{\beta}_{2}-\boldsymbol{\beta}_{1}.

3.1. Simulation setting

All simulated datasets were generated from Model (3) where the n1n_{1} (n2n_{2}) rows of 𝐗1\mathbf{X}_{1} (𝐗2\mathbf{X}_{2}) are assumed to be independent Gaussian random vectors with a covariance matrix 𝚺1=𝚺2=𝚺bm\boldsymbol{\Sigma}_{1}=\boldsymbol{\Sigma}_{2}=\boldsymbol{\Sigma}_{bm}, and ϵ\boldsymbol{\epsilon} is a standard Gaussian random vector independent of 𝐗1\mathbf{X}_{1} and 𝐗2\mathbf{X}_{2}. We defined 𝚺bm\boldsymbol{\Sigma}_{bm} as:

(14) 𝚺bm=[𝚺11𝚺12𝚺12T𝚺22]\boldsymbol{\Sigma}_{bm}=\begin{bmatrix}\boldsymbol{\Sigma}_{11}&\boldsymbol{\Sigma}_{12}\\ \boldsymbol{\Sigma}_{12}^{T}&\boldsymbol{\Sigma}_{22}\end{bmatrix}

where 𝚺11\boldsymbol{\Sigma}_{11} (resp. 𝚺22\boldsymbol{\Sigma}_{22}) are the correlation matrix of prognostic (resp. non-prognostic) biomarkers with off-diagonal entries equal to a1a_{1} (resp. a3a_{3}). Morever, 𝚺12\boldsymbol{\Sigma}_{12} is the correlation matrix between prognostic and non-prognostic variables with entries equal to a2a_{2}. In our simulations (a1,a2,a3)=(0.3,0.5,0.7)(a_{1},a_{2},a_{3})=(0.3,0.5,0.7), which is a framework proposed by Xue and Qu (2017). We checked that the Irrepresentable Condition (IC) of Zhao and Yu (2006) is violated and thus the standard Lasso cannot recover the positions of the null and non null variables. For each dataset we assumed randomized treatment allocation between standard and experimental arm with a 1:1 ratio, i.e. n1=n2=50n_{1}=n_{2}=50. We further assume a relative treatment effect of 1 (α1=0\alpha_{1}=0 and α2=1\alpha_{2}=1). The number of biomarkers pp varies from 200 to 2000. The number of active biomarkers was set to 10 (i.e. 5 purely prognostic biomarkers with 𝜷1j=𝜷2j=b1=1\boldsymbol{\beta}_{1j}=\boldsymbol{\beta}_{2j}=b_{1}=1 (j=1,,5)(j=1,...,5) and 5 biomarkers both prognostic and predictive with 𝜷1j=b1\boldsymbol{\beta}_{1j}=b_{1} and 𝜷2j=b2=2\boldsymbol{\beta}_{2j}=b_{2}=2 (j=6,,10)(j=6,...,10)).

3.2. Evaluation criteria

We considered several evaluation criteria to assess the performance of the methods in selecting the prognostic and predictive biomarkers: the TPRprog\textrm{TPR}_{\textrm{prog}} as the true positive rate (i.e. rate of active biomarkers selected) and FPRprog\textrm{FPR}_{\textrm{prog}} the false positive rate (i.e. rate of inactive biomarkers selected) of the selection of prognostic biomarkers, and similarly for predictive biomarkers with TPRpred\textrm{TPR}_{\textrm{pred}} and FPRpred\textrm{FPR}_{\textrm{pred}}. We further note TPRall\textrm{TPR}_{\textrm{all}} and FPRall\textrm{FPR}_{\textrm{all}} the criterion of overall selection among all candidate biomarkers regardless their prognostic or predictive effect. The objective of the selection is to maximize the TPRall\textrm{TPR}_{\textrm{all}} and minimize the FPRall\textrm{FPR}_{\textrm{all}}. All metrics were calculated by averaging the results of 100 replications for each scenario.

3.3. Biomarker selection results

For the proposed method, different results are presented. PPLassoΣ\textrm{PPLasso}_{\Sigma} (resp. PPLasso) corresponds to the results of the method by considering the true (resp. estimated) matrix 𝚺bm\boldsymbol{\Sigma}_{bm}. For estimating 𝚺bm\boldsymbol{\Sigma}_{bm}, we used the approach explained in Section 2.5. Two choices of parameters are also given: “optimal” and “min(bic)”. The former uses as a value for λ\lambda the one that maximizes (TPRallFPRall)(\textrm{TPR}_{\textrm{all}}-\textrm{FPR}_{\textrm{all}}) and the latter uses the approach presented in Section 2.6. In order to compare the performance of our approach to the best performance that could be reached by Elastic Net, Lasso and Adaptive Lasso, we used for these methods the “optimal” parameters namely those maximizing (TPRallFPRall)(\textrm{TPR}_{\textrm{all}}-\textrm{FPR}_{\textrm{all}}). All these three methods were implemented with the glmnet R package, the best parameter α\alpha involved in Elastic Net was chosen in the set {0.1,0.2,,0.9}\{0.1,0.2,\dots,0.9\}. The choice of “min(bic)” is only applied to our method and corresponds to a choice of λ\lambda that could be used in practical situations. For ease of presentation, the abbreviation EN (resp. AdLasso) refers to Elastic Net (resp. Adaptive Lasso) in the following.

Refer to caption
Figure 4. Average of (TPR-FPR) and the corresponding True Positive Rate (TPR) and False Positive Rate (FPR) for prognostic (left) and predictive (right) biomarkers.

Figure 4 shows the selection performance of PPLasso and other compared methods in the simulation scenario presented in Section 3.1. PPLasso achieved to select all prognostic biomarkers (TPRprog\textrm{TPR}_{\textrm{prog}} almost 1) even for large pp, with limited false positive prognostic biomarkers selected. As compared to the optimal λ\lambda maximizing (TPRallFPRall)(\textrm{TPR}_{\textrm{all}}-\textrm{FPR}_{\textrm{all}}), the one selected with the BIC tends to select some false positives (average: 33 (FPRprog=0.17\textrm{FPR}_{\textrm{prog}}=0.17) for p=200p=200 and 10 (FPRprog=0.005\textrm{FPR}_{\textrm{prog}}=0.005) for p=2000p=2000). The results obtained from the oracle and estimated 𝚺bm\boldsymbol{\Sigma}_{bm} are comparable. Selection performance of predictive biomarkers is slightly lowered as compared to prognostic biomarkers. Even if the false positive selection is quite similar between prognostic and predictive biomarkers, PPLasso missed some true predictive biomarkers when λ\lambda is selected with the BIC criterion (average TPRpred=\textrm{TPR}_{\textrm{pred}}= 0.98 and 0.80 for oracle and estimated 𝚺bm\boldsymbol{\Sigma}_{bm}, respectively, with p=2000p=2000). In this scenario where the IC is violated, PPLasso globally outperforms Lasso, Elastic Net and Adaptive Lasso. Although Elastic Net showed higher TPR than Lasso and Adaptive Lasso, they all failed in selecting all truly prognostic and predictive biomarkers, and the number of missed active biomarkers increased with the dimension pp. For example, for Elastic Net, TPRprog\textrm{TPR}_{\textrm{prog}} = 0.85 and 0.53, TPRpred\textrm{TPR}_{\textrm{pred}} = 0.81 and 0.61 for p=200p=200 and 20002000, respectively.

3.3.1. Impact of the correlation matrix 𝚺\boldsymbol{\Sigma}

To evaluate the impact of the correlation matrix on the selection performance of the methods, additional scenarios are presented where the IC is satisfied:

  1. (1)

    Compound symmetry structure where all biomarkers are equally correlated with a correlation ρ=0.5\rho=0.5;

  2. (2)

    Independent setting where 𝚺bm\boldsymbol{\Sigma}_{bm} is the identity matrix.

For the scenario with compound symmetry structure displayed in Figure 5, all the methods successfully identified the true prognostic biomarkers (TPRprog\textrm{TPR}_{\textrm{prog}} close to 1 even for large pp) with limited false positive selection. On the other hand, the compared methods (Lasso, ELastic Net, Adaptive Lasso) missed some predictive biomarkers especially when pp increases. On the contrary, PPLasso successfully identified almost all predictive biomarkers with the optimal choice of λ\lambda. Moreover, even when λ\lambda is selected by minimizing the BIC criterion (min(bic)), PPLassoest\textrm{PPLasso}_{\textrm{est}} outperformed Lasso and Adaptive Lasso when p>500p>500 with relatively stable TPRpred\textrm{TPR}_{\textrm{pred}} and FPRpred\textrm{FPR}_{\textrm{pred}} as pp increases.

Refer to caption
Figure 5. Average of (TPR-FPR) and the corresponding True Positive Rate (TPR) and False Positive Rate (FPR) for prognostic (left) and predictive (right) biomarkers for the compound symmetry correlation structure.
Refer to caption
Figure 6. Average of (TPR-FPR) and the corresponding True Positive Rate (TPR) and False Positive Rate (FPR) for prognostic (left) and predictive (right) biomarkers (independent setting).

For the independent setting, as displayed in Figure 6, prognostic biomarkers were globally well identified by all the compared methods with a slightly higher TPRprog\textrm{TPR}_{\textrm{prog}} for Lasso and ELastic Net as compared to PPLasso but also with a slightly higher FPRprog\textrm{FPR}_{\textrm{prog}}. With regards to predictive biomarkers, PPLasso using 𝚺bm\boldsymbol{\Sigma}_{bm} (oracle) performed also similarly to the Lasso, which is reasonable since no transformation has been used in PPLasso. On the other hand, even if PPLasso with λ\lambda selected with “min(bic)” performed similarly with PPLasso with optimal λ\lambda for relatively small pp, the selection performance is altered for large pp and even if the performance is higher than Lasso and Adaptive Lasso, it is smaller than the one of Elastic Net.

3.3.2. Impact of the effect size of active biomarkers

To evaluate the impact of the effect size on biomarker selection performance, the scenario presented in Section 3.1 was considered with different values of b2b_{2}: 1.5, 2 and 2.5.

Since the effect size of prognostic biomarkers did not change, the comparison focused on predictive biomarkers. As expected, the reduction of the effect size makes the biomarker selection harder, especially for Lasso, Elastic Net and Adaptive Lasso where the predictive biomarker selection is limited when b2=1.5b_{2}=1.5: for Lasso when p=2000p=2000, TPRpred\textrm{TPR}_{\textrm{pred}} = 0.45 (resp. 0.22) for b2=2b_{2}=2 (resp. 1.5), see Figure 4 and Figure 9 of the supplementary material. The selection performance of PPLasso when λ\lambda is selected with min(bic) is also reduced by decreasing b2b_{2}, especially when 𝚺bm\boldsymbol{\Sigma}_{bm} is also estimated. Nevertheless, the selection performance of PPLasso remains better than for the other compared methods for which the performance displayed are associated to the optimal value of λ\lambda. On the other hand, even with limited effect size, PPLasso with optimal λ\lambda identified all predictive biomarkers with very limited false positive selection. When b2b_{2} was increased to 2.5, the selection performance for all methods is improved and the results for PPLasso with estimated λ\lambda was close to the ones with the optimal λ\lambda as displayed in Figure 10 of the supplementary material. As compared with PPLasso, for which the selection performance remained stable as pp increased, Lasso, Elastic Net and Adaptive Lasso were more impacted by the value of pp since the true positive selection decreased as pp increased. As an example, for the Lasso, TPRpred\textrm{TPR}_{\textrm{pred}} =0.95 (resp. 0.65) for p=200p=200 (resp. 2000).

3.3.3. Impact of the number of predictive biomarkers

The impact of the number of true predictive biomarkers was assessed by increasing the number of predictive biomarkers from 5 to 10 in the scenario presented in Section 3.1. When the number of predictive biomarkers increased, the impact on PPLasso is almost negligible, especially for prognostic biomarker identification. However, for the other methods, we can see from Figure 11 of the supplementary material that it became even harder to identify predictive biomarkers. TPRpred\textrm{TPR}_{\textrm{pred}} decreased compared to Figure 4, especially for large pp (e.g. TPRpred=\textrm{TPR}_{\textrm{pred}}= 0.12, 0.18, and 0.02 for Lasso, Elastic Net and Adaptive Lasso respectively when p=2000p=2000).

3.3.4. Impact of the dimension of the dataset

In this section, we studied a different sample size: nn=50 with n1=n2=25n_{1}=n_{2}=25 and a different number of biomarkers: pp=5000.

We can see from Figure 12 of the supplementary material that for p=5000p=5000, the selection performance of PPLasso is not altered as compared with p=2000p=2000 while the compared methods have more difficulties to identify both prognostic and predictive biomarkers.

When the sample size is smaller (nn=50), we can see from Figure 13 of the supplementary material that the ability to identify prognostic and predictive biomarkers decreased for all the methods. However, PPLasso still outperformed the others with higher TPRprog\textrm{TPR}_{\textrm{prog}} and TPRpred\textrm{TPR}_{\textrm{pred}} and lower FPRprog\textrm{FPR}_{\textrm{prog}} and FPRpred\textrm{FPR}_{\textrm{pred}}.

4. Application to transcriptomic data in RV144 clinical trial

We applied the previously described methods to publicly available transcriptomic data from the RV144 vaccine trial (Rerks-Ngarm et al. (2009)). This trial showed reduced risk of HIV-1 acquisition by 31.2% with vaccination with ALVAC and AIDSVAX as compared to placebo. Transcriptomic profiles of in vitro HIV-1 Env-stimulated peripheral blood mononuclear cells (PBMCs) obtained pre-immunization and 15 days after the immunization (D15) from both 40 vaccinees and 10 placebo recipients were generated to better understand underlying biological mechanisms (Fourati et al. (2019), Gene Expression Omnibus accession code: GSE103671).

For illustration purpose, the absolute change at D15 in gene mTOR was considered as the continuous endpoint (response). mTOR plays a key role in mTORC1 signaling pathway which has been shown to be associated with risk of HIV-1 acquisition (Fourati et al. (2019), Akbay et al. (2020)). The gene expression has been normalized as in the original publication of Fourati et al. (2019). After removing non-annotated genes (LOCxxxx and HS.xxxx), the top 2000 genes with the highest empirical variances were included as candidate biomarkers for prognostic and predictive identification from PPLasso and the compared methods. The penalty parameter λ\lambda for the Lasso and Adaptive Lasso, the parameters λ\lambda and α\alpha for Elastic Net were selected through the classical cross-validation approach. For PPLasso, λ\lambda was selected based on the criterion described in Section 2.6.

Refer to caption
Figure 7. Heatmaps of the correlation matrix estimated by the cvCovEst R package.

The estimation of 𝚺\boldsymbol{\Sigma} was obtained by comparing several candidate estimators from the cvCovEst R package and by selecting the estimator having the smallest estimation error. In this application, the combination of the sample covariance matrix and a dense target matrix (denseLinearShrinkEst) derived by Ledoit and Wolf (2020) provides the smallest estimation error. Figure 7 displays the estimated 𝚺\boldsymbol{\Sigma} and highlights the strong correlation between the genes. Table 3 of the Supplementary material gives details on the compared estimators.

Prognostic and predictive genes selected by PPLasso, Lasso, Elastic Net and Adaptive Lasso are listed in Table 1. The number of genes selected are similar for all the compared methods, except for a slightly higher number of predictive genes selected by PPLasso. Lasso, Elastic Net and Adaptive Lasso selected very similar sets of prognostic and predictive genes. The intersection between PPLasso and others is moderate (2 prognostic genes (SLAMF7 and TNFRSF6B), 3 predictive genes (YTHDC1, MS4A7 and RPL21)). Interestingly, some genes selected by most methods such as SLAMF7, TNFRSF6B, TNFRSF18 or NUCKS1 have already been discussed in the HIV-1 field. Moreover, among the predictive genes selected by the PPLasso, some are linked to pathways that have been highlighted as possible target for HIV-1 such as BIRC3 and TLR8.

Table 1. Selected genes from PPLasso, Lasso, Elastic Net and Adaptive Lasso. Commonly selected genes are in bold.
prognostic genes predictive genes
PPLasso HAPLN3, SLAMF7, GTF3C5, FAM46A, SH3PXD2B, TM4SF1, TNFRSF6B, TNFRSF18, TRPM2 TLR8, YTHDC1, NUCKS1, BIRC3, SLAMF7, NFATC2IP, BOK, MGRN1, KIAA0492, SLC25A36, HMGN2, P2RY5, RPL21, MS4A7, RPL12P6
Lasso DKFZp434K191, NUCKS1, MAFF, SLAMF7, HIST2H2AC, HIST1H4C, IL8, TNFRSF6B, TNFRSF18, SCAND1 DKFZp434K191, YTHDC1, VMO1, BOLA2, HIST1H4C, RPL21, MS4A7
Elastic Net DKFZp434K191, NUCKS1,SNURF, MAFF, SLAMF7, IL8, ZBP1, TNFRSF6B, ZAK, TNFRSF18, SCAND1, NME1-NME2, DNM1L, RNF146, NPEPL1 DKFZp434K191, YTHDC1, PMP22, VMO1, BOLA2, HIST1H4C, RPL21, MS4A7,RAB11FIP1
Adaptive Lasso NUCKS1,SNURF, MAFF, SLAMF7, IL8, ZBP1, TNFRSF6B, NME1-NME2, DNM1L, RNF146 YTHDC1, PMP22, VMO1, BOLA2, HIST1H4C, MS4A7, RPL21

5. Conclusion

We propose a new method named PPLasso to simultaneously identify prognostic and predictive biomarkers. PPLasso is particularly interesting for dealing with high dimensional omics data when the biomarkers are highly correlated, which is a framework that has not been thoroughly investigated yet. From various numerical studies with or whithout strong correlation between biomarkers, we highlighted the strength of PPLasso in well identifying both prognostic and predictive biomarkers with limited false positive selection. The current method is only dedicated to the analysis of continuous responses through ANCOVA type models. However, it will be the subject of a future work to extend it to other challenging contexts, such as classification or survival analysis.

Funding

This work was supported by the Association Nationale de la Recherche et de la Technologie (ANRT).

References

  • Akbay et al. (2020) Akbay, B., Shmakova, A., Vassetzky, Y., and Dokudovskaya, S. (2020). Modulation of mtorc1 signaling pathway by hiv-1. Cells, 9, 1090.
  • Ballman (2015) Ballman, K. V. (2015). Biomarker: Predictive or prognostic? Journal of clinical oncology, 33(33), 3968–3971.
  • Boileau et al. (2021) Boileau, P., Hejazi, N. S., van der Laan, M. J., and Dudoit, S. (2021). Cross-validated loss-based covariance matrix estimator selection in high dimensions.
  • Boileau et al. (2021) Boileau, P., Hejazi, N. S., van der Laan, M. J., and Dudoit, S. (2021). cvCovEst: Cross-validated covariance matrix estimator selection and evaluation in R. Journal of Open Source Software, 6 (63), 3273.
  • Cai et al. (2010) Cai, T., Zhang, C.-H., and Zhou, H. (2010). Optimal rates of convergence for covariance matrix estimation. The Annals of Statistics, 38.
  • Clark (2008) Clark, G. (2008). Prognostic factors versus predictive factors: Examples from a clinical trial of erlotinib. Molecular oncology, 1, 406–12.
  • Fan and Li (2006) Fan, J. and Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. Proc. Madrid Int. Congress of Mathematicians, 3.
  • Fan and Lv (2009) Fan, J. and Lv, J. (2009). A selective overview of variable selection in high dimensional feature space. Stat. Sinica, 20 1, 101–148.
  • Fan et al. (2013) Fan, J., Liao, Y., and Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society. Series B, Statistical methodology, 75.
  • Faraway (2002) Faraway, J. J. (2002). Practical regression and ANOVA using R. University of Bath.
  • Foster et al. (2011) Foster, J., Taylor, J., and Ruberg, S. (2011). Subgroup identification from randomized clinical trial data. Statistics in medicine, 30, 2867–80.
  • Fourati et al. (2019) Fourati, S., Ribeiro, S., Blasco Lopes, F., Talla, A., Lefebvre, F., Cameron, M., Kaewkungwal, J., Pitisuttithum, P., Nitayaphan, S., Rerks-Ngarm, S., Kim, J., Thomas, R., Gilbert, P., Tomaras, G., Koup, R., Michael, N., McElrath, M., Gottardo, R., and Sékaly, R. (2019). Integrated systems approach defines the antiviral pathways conferring protection by the rv144 hiv vaccine. Nature Communications, 10.
  • Ledoit and Wolf (2020) Ledoit, O. and Wolf, M. (2020). The Power of (Non-)Linear Shrinking: A Review and Guide to Covariance Matrix Estimation. Journal of Financial Econometrics.
  • Lipkovich and Dmitrienko (2014) Lipkovich, I. and Dmitrienko, A. (2014). Strategies for identifying predictive biomarkers and subgroups with enhanced treatment effect in clinical trials using sides. Journal of biopharmaceutical statistics, 24, 130–53.
  • Lipkovich et al. (2011) Lipkovich, I., Dmitrienko, A., Denne, J., and Enas, G. (2011). Subgroup identification based on differential effect search (sides) – a recursive partitioning method for establishing response to treatment in patient subpopulations. Statistics in medicine, 30, 2601–21.
  • Lipkovich et al. (2017) Lipkovich, I., Dmitrienko, A., and B. D’Agostino Sr., R. (2017). Tutorial in biostatistics: data-driven subgroup identification and analysis in clinical trials. Statistics in Medicine, 36(1), 136–196.
  • McDonald (2009) McDonald, J. (2009). Handbook of Biological Statistics 2nd Edition. Sparky House Publishing Baltimore.
  • Rerks-Ngarm et al. (2009) Rerks-Ngarm, S., Pitisuttithum, P., Nitayaphan, S., Kaewkungwal, J., Chiu, J., Paris, R., Premsri, N., Namwat, C., De Souza, M., Benenson, M., Gurunathan, S., Tartaglia, J., McNeil, J., Francis, D., Stablein, D., Birx, D., Chunsuttiwat, S., Khamboonruang, C., and Kim, J. (2009). Vaccination with alvac and aidsvax to prevent hiv-1 infection in thailand. The New England journal of medicine, 361, 2209–20.
  • Saeys et al. (2007) Saeys, Y., Inza, I., and Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19), 2507–2517.
  • Sechidis et al. (2018) Sechidis, K., Papangelou, K., Metcalfe, P. D., Svensson, D., Weatherall, J., and Brown, G. (2018). Distinguishing prognostic and predictive biomarkers: an information theoretic approach. Bioinformatics, 34(19), 3365–3376.
  • Smith (2018) Smith, G. (2018). Step away from stepwise. J. Big Data, 5(32), 1–12.
  • Tian et al. (2012) Tian, L., Alizadeh, A., Gentles, A., and Tibshirani, R. (2012). A simple method for estimating interactions between a treatment and a large number of covariates. Journal of the American Statistical Association, 109.
  • Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Stat. Methodol.), 58(1), 267–288.
  • Tibshirani and Taylor (2011) Tibshirani, R. J. and Taylor, J. (2011). The solution path of the generalized lasso. Ann. Stat., 39(3), 1335–1371.
  • Wang et al. (2019) Wang, H., Lengerich, B., Aragam, B., and Xing, E. (2019). Precision lasso: Accounting for correlations and linear dependencies in high-dimensional genomic data. Bioinformatics, 35(7), 1181–1187.
  • Wang and Leng (2016) Wang, X. and Leng, C. (2016). High dimensional ordinary least squares projection for screening variables. J. R. Stat. Soc. Ser. B (Stat. Methodol.), 78(3), 589–611.
  • Windeler (2000) Windeler, J. (2000). Prognosis - what does the clinician associate with this notion? Statistics in medicine, 19, 425–30.
  • Xue and Qu (2017) Xue, F. and Qu, A. (2017). Variable selection for highly correlated predictors.
  • Zhao and Yu (2006) Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. J. Machine Learn. Res., 7, 2541–2563.
  • Zhu et al. (2021) Zhu, W., Lévy-Leduc, C., and Ternès, N. (2021). A variable selection approach for highly correlated predictors in high-dimensional genomic data. Bioinformatics, 37(16), 2238–2244.
  • Zou and Hastie (2005) Zou, H. and Hastie, T. (2005) Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2), 301-320.
  • Zou (2006) Zou, H. (2006). The adaptive lasso and its oracle properties.. Journal of the American statistical association 101(476), 1418-1429.

Supplementary material

This supplementary material provides additional numerical experiments, figures and tables for the paper: “Identification of prognostic and predictive biomarkers in high-dimensional data with PPLasso”.

Refer to caption
Figure 8. Number of True Positives and True Negatives for 𝜷^\widehat{\boldsymbol{\beta}} and 𝜷^0\widehat{\boldsymbol{\beta}}_{0} on prognostic/predictive biomarkers.
MSE BIC
TPR(prognostic) 1.000 1.000
FPR(prognostic) 0.038 0.024
TPR(predictive) 1.000 1.000
FPR(predctive) 0.008 0.006

Table 2. TPR and FPR associated to prognostic and predictive biomarker identification with the λ\lambda chosen in Figure 3.
Refer to caption
Figure 9. Average of (TPR-FPR) and the corresponding True Positive Rate (TPR) and False Positive Rate (FPR) for prognostic (left) and predictive (right) biomarkers (b2=1.5b_{2}=1.5).
Refer to caption
Figure 10. Average of (TPR-FPR) and the corresponding True Positive Rate (TPR) and False Positive Rate (FPR) for prognostic (left) and predictive (right) biomarkers (b2=2.5b_{2}=2.5).
Refer to caption
Figure 11. Average of (TPR-FPR) and the corresponding True Positive Rate (TPR) and False Positive Rate (FPR) for prognostic (left) and predictive (right) biomarkers (10 predictive biomarkers).
Refer to caption
Figure 12. Average of (TPR-FPR) and the corresponding True Positive Rate (TPR) and False Positive Rate (FPR) for prognostic (left) and predictive (right) biomarkers (with p=5000p=5000).
Refer to caption
Figure 13. Average of (TPR-FPR) and the corresponding True Positive Rate (TPR) and False Positive Rate (FPR) for prognostic (left) and predictive (right) biomarkers (n1=n2=25n_{1}=n_{2}=25).
Estimator Hyperparameters Empirical risk
denseLinearShrinkEst - 102546
sampleCovEst - 102547
linearShrinkLWEst - 103496
poetEst lambda=0.1, k=2 104522
poetEst lambda=0.2, k=2 105358
poetEst lambda=0.1, k=1 105972
poetEst lambda=0.2, k=1 108222
thresholdingEst gamma=0.2 137798
thresholdingEst gamma=0.4 186844

Table 3. Empirical risk of tested methods with different hyperparameters.