\clearauthor\Name

Chirag Nagpal \Emailchiragn@cs.cmu.edu
\NameVedant Sanil \Emailvsanil@andrew.cmu.edu
\NameArtur Dubrawski \Emailawd@cs.cmu.edu
\addrAuton Lab, Carnegie Mellon

Recovering Sparse and Interpretable Subgroups with Heterogeneous Treatment Effects with Censored Time-to-Event Outcomes

Abstract

Studies involving both randomized experiments as well as observational data typically involve time-to-event outcomes such as time-to-failure, death or onset of an adverse condition. Such outcomes are typically subject to censoring due to loss of follow-up and established statistical practice involves comparing treatment efficacy in terms of hazard ratios between the treated and control groups. In this paper we propose a statistical approach to recovering sparse phenogroups (or subtypes) that demonstrate differential treatment effects as compared to the study population. Our approach involves modelling the data as a mixture while enforcing parameter shrinkage through structured sparsity regularization. We propose a novel inference procedure for the proposed model and demonstrate its efficacy in recovering sparse phenotypes across large landmark real world clinical studies in cardiovascular health.

keywords:

Time-to-Event, Survival Analysis, Heterogeneous Treatment Effects, Hazard Ratio

1 Introduction

Data driven decision making across multiple disciplines including healthcare, epidemiology, econometrics and prognostics often involves establishing efficacy of an intervention when outcomes are measured in terms of the time to an adverse event, such as death, failure or onset of a critical condition. Typically the analysis of such studies involves assigning a patient population to two or more different treatment arms often called the ‘treated’ (or ‘exposed’) group and the ‘control’ (or ‘placebo’) group and observing whether the populations experience an adverse event (for instance death or onset of a disease) over the study period at a rate that is higher (or lower) than for the control group. Efficacy of a treatment is thus established by comparing the relative difference in the rate of event incidence between the two arms called the hazard ratio. However, not all individuals benefit equally from an intervention. Thus, very often potentially beneficial interventions are discarded even though there might exist individuals who benefit, as the population level estimates of treatment efficacy are inconclusive.

In this paper we assume that patient responses to an intervention are typically heterogeneous and there exists patient subgroups that are unaffected by (or worse, harmed) by the intervention being assessed. The ability to discover or phenotype these patients is thus clinically useful as it would allow for more precise clinical decision making by identifying individuals that actually benefit from the intervention being assessed.

Towards this end, we propose Sparse Cox Subgrouping, (SCS) a latent variable approach to model patient subgroups that demonstrate heterogeneous effects to an intervention. As opposed to existing literature in modelling heterogeneous treatment effects with censored time-to-event outcomes our approach involves structured regularization of the covariates that assign individuals to subgroups leading to parsimonious models resulting in phenogroups that are interpretable. We release a python implementation of the proposed SCS approach as part of the auton-survival package (Nagpal et al., 2022b) for survival analysis:

https://autonlab.github.io/auton-survival/

2 Related Work

Large studies especially in clinical medicine and epidemiology involve outcomes that are time-to-events such as death, or an adverse clinical condition like stroke or cancer. Treatment efficacy is typically estimated by comparing event rates between the treated and control arms using the Proportional Hazards (Cox, 1972) model and its extensions.

Identification of subgroups in such scenarios has been the subject of a large amount of traditional statistical literature. Large number of such approaches involve estimation of the factual and counterfactual outcomes using separate regression models (T-learner) followed by regressing the difference between these estimated potential outcomes. Within this category of approaches, Lipkovich et al. (2011) propose the subgroup identification based on differential effect search (SIDES) algorithm, Su et al. (2009) propose a recursive partitioning method for subgroup discovery, Dusseldorp and Mechelen (2014) propose the qualitative interaction trees (QUINT) algorithm, and Foster et al. (2011) propose the virtual twins (VT) method for subgroup discovery involving decision tree ensembles. We include a parametric version of such an approach as a competing baseline.

Identification of heterogeneous treatment effects (HTE) is also of growing interest to the machine learning community with multiple approaches involving deep neural networks with balanced representations (Shalit et al., 2017; Johansson et al., 2020), generative models Louizos et al. (2017) as well as Non-Parametric methods involving random-forests (Wager and Athey, 2018) and Gaussian Processes (Alaa and Van Der Schaar, 2017). There is a growing interest in estimating HTEs from an interpretable and trustworthy standpoint (Lee et al., 2020; Nagpal et al., 2020; Morucci et al., 2020; Wu et al., 2022; Crabbé et al., 2022). Wang and Rudin (2022) propose a sampling based approach to discovering interpretable rule sets demonstrating HTEs.

However large part of this work has focused extensively on outcomes that are binary or continuous. The estimation of HTEs in the presence of censored time-to-events has been limited. Xu et al. (2022) explore the problem and describe standard approaches to estimate treatment effect heterogeneity with survival outcomes. They also describe challenges associated with existing risk models when assessing treatment effect heterogeneity in the case of cardiovascular health.

There has been some initial attempt to use neural network for causal inference with censored time-to-event outcomes. Curth et al. (2021) propose a discrete time method along with regularization to match the treated and control representations. Chapfuwa et al. (2021)’s approach is related and involves the use of normalizing flows to estimate the potential time-to-event distributions under treatment and control. While our contributions are similar to Nagpal et al. (2022a), in that we assume treatment effect heterogeneity through a latent variable model, our contribution differs in that 1) Our approach is free of the expensive Monte-Carlo sampling procedure and 2) Our generalized EM inference procedure allows us to naturally incorporate structured sparsity regularization, which helps recovers phenogroups that are parsimonious in the features they recover that define subgroups.

Survival and time-to-event outcomes occur pre-eminently in areas of cardiovascular health. One such area is reducing combined risk of adverse outcomes from atherosclerotic disease¹¹1A class of related clinical conditions from increasing deposits of plaque in the arteries, leading to Stroke, Myorcardial Infarction and other Coronary Heart Diseases. (Herrington et al., 2016; Furberg et al., 2002; Group, 2009; Buse et al., 2007) The ability of recovering groups with differential benefits to interventions can thus lead to improved patient care through framing of optimal clinical guidelines.

3 Proposed Model: Sparse Cox Subgrouping

Refer to caption — Figure 1: Potential outcome distributions under the assumptions of treatment effect heterogeneity. Case 1: Amongst the treated population, conditioned on the latent $Z$ , there are two subgroups that benefit and are unaffected by the intervention. Case 2: There is an additional latent subgroup conditioned on which, the treated population is harmed with a worse survival rate.

Notation

As is standard in survival analysis, we assume that we either observe the true time-to-event or the time of censoring $U=\mathrm{min}\{T,C\}$ indicated by the censoring indicator defined as $\Delta=\bm{1}\{T<C\}$ . We thus work with a dataset of right censored observations in the form of 4-tuples, $\mathcal{D}=\{({\bm{x}}_{i},\delta_{i},{\bm{u}}_{i},{\bm{a}}_{i})\}_{i=1}^{n}$ , where ${\bm{u}}_{i}\in\mathbb{R}^{+}$ is the time-to-event or censoring as indicated by $\delta_{i}\in\{0,1\}$ , ${\bm{a}}_{i}\in\{0,1\}$ is the indicator of treatment assignment, and ${\bm{x}}_{i}$ are individual covariates that confound the treatment assignment and the outcome.

Assumption 1 (Independent Censoring)

The time-to-event $T$ and the censoring distribution $C$ are independent conditional on the covariates $X$ and the intervention $A$ .

Model

Consider a maximum likelihood approach to model the data $\mathcal{D}$ the set of parameters $\bm{\Omega}$ . Under Assumption 1 the likelihood of the data $\mathcal{D}$ can be given as,

\displaystyle\mathcal{L}(\bm{\Omega};\mathcal{D})

\displaystyle\propto\prod_{i=1}^{|\mathcal{D}|}\bm{\lambda}(u_{i}|X={\bm{x}}_{i},A={\bm{a}}_{i})^{\delta_{i}}\bm{S}(u_{i}|X={\bm{x}}_{i},A={\bm{a}}_{i}),

(1)

here $\bm{\lambda}(t)=\mathop{\lim}\limits_{\Delta t\to 0}\frac{\mathbb{P}(t\leq T<t+\Delta t|T\geq t)}{\Delta t}$ is the hazard and $\bm{S}(t)=\mathbb{P}(T>t)$ is the survival rate.

Assumption 2 (PH)

The distribution of the time-to-event $T$ conditional on the covariates and the treatment assignment obeys proportional hazards.

From Assumption 2 (Proportional Hazards), an individual with covariates $(X={\bm{x}})$ under intervention $(A={\bm{a}})$ under a Cox model with parameters $\beta$ and treatment effect $\omega$ is given as

\displaystyle\bm{\lambda}({\bm{t}}|A={\bm{a}},X={\bm{x}})=\bm{\lambda}_{0}(t)\exp\big{(}\bm{\beta}^{\top}{\bm{x}}+\bm{\omega}\cdot\bm{a}\big{)},

(2)

Here, $\bm{\lambda}_{0}(\cdot)$ is an infinite dimensional parameter known as the base survival rate. In practice in the Cox’s model the base survival rate is a nuisance parameter and is estimated non-parametrically. In order to model the heterogeneity of treatment response. We will now introduce a latent variable $Z\in\{0,1,-1\}$ that mediates treatment response to the model,

	$\displaystyle\bm{\lambda}({\bm{t}}\|A={\bm{a}},X={\bm{x}},Z={\bm{k}})$	$\displaystyle=\bm{\lambda}_{0}(t)\exp(\beta^{\top}{\bm{x}})\exp(\bm{\omega})^{\bm{ka}},$
	$\displaystyle\text{and, }\;\;\mathbb{P}(Z$	$\displaystyle={\bm{k}}\|X={\bm{x}})=\frac{\exp(\bm{\theta}_{k}^{\top}{\bm{x}})}{\sum_{j}\exp(\bm{\theta}_{j}^{\top}{\bm{x}})}.$		(3)

Here, $\bm{\omega}\in\mathbb{R}$ is the treatment effect, and $\bm{\theta}\in\mathbb{R}^{k\times d}$ is the set of parameters that mediate assignment to the latent group $Z$ conditioned on the confounding features ${\bm{x}}$ . Note that the above choice of parameterization naturally enforces the requirements from the model as in Figure 1. Consider the following scenarios,

Case 1: The study population consists of two sub-strata ie. $Z\in\{0,+1\}$ , that are benefit and are unaffected by treatment respectively.

Case 2: The study population consists of three sub-strata ie. $Z\in\{0,+1,-1\}$ , that benefit, are harmed or unaffected by treatment respectively.

Following from Equations 1 & 3, the complete likelihood of the data $\mathcal{D}$ under this model is,

	$\displaystyle\mathcal{L}(\bm{\Omega};\mathcal{D})=\prod_{i=1}^{\|\mathcal{D}\|}\sum_{k\in Z}\bigg{(}\bm{\lambda}_{0}(u_{i})\bm{h}({\bm{x}},{\bm{a}},{\bm{k}})\bigg{)}^{\delta_{i}}$	$\displaystyle{\bm{S}}_{0}(u_{i})^{\bm{h}({\bm{x}},{\bm{a}},{\bm{k}})}\mathbb{P}(Z=k\|X={\bm{x}}_{i})$
	$\displaystyle\text{ where, }\ln\bm{h}({\bm{x}},{\bm{a}},{\bm{k}})=\bm{\beta}^{\top}{\bm{x}}+{\bm{k}}\cdot\bm{a}\cdot\bm{w}\text{ and }\ln$	$\displaystyle{\bm{S}}_{0}(\cdot)=-\bm{\Lambda}_{0}(\cdot),$		(4)

Note that $\bm{\Lambda}_{0}(\cdot)=\int_{0}^{t}\bm{\lambda}_{0}(\cdot)$ is the infinite dimensional cumulative hazard and is inferred when learning the model. We will notate the set of all learnable parameters as $\bm{\Omega}=\{\bm{\theta},\bm{\beta},\bm{w},\bm{\Lambda}_{0}\}$ .

Shrinkage

In retrospective analysis to recover treatment effect heterogeneity a natural requirement is parsimony of the recovered subgroups in terms of the covariates to promote model interpretability. Such parsimony can be naturally enforced through appropriate shrinkage on the coefficients that promote sparsity. We want to recover phenogroups that are ‘sparse’ in $\bm{\theta}$ . We enforce sparsity in the parameters of the latent $Z$ gating function via a group $\ell_{1}$ (Lasso) penalty. The final loss function to be optimized including the group sparsity regularization term is,

	$\displaystyle\mathcal{L}(\bm{\Omega};\mathcal{D})+$	$\displaystyle\bm{\epsilon}\cdot\mathcal{R}(\bm{\theta})\text{ where, }\mathcal{R}(\bm{\theta})=\sum_{d}\sqrt{\sum_{k\in\mathcal{Z}}\big{(}\bm{\theta}^{k}_{d}\big{)}^{2}}$
	and	$\displaystyle\bm{{\epsilon}}>0\text{ is the strength of the shrinkage parameter}.$		(5)

Identifiability

Further, to ensure identifiability we restrict the gating parameters for the $(Z=0)$ to be $0$ . Thus $\bm{\theta}_{1}=0$ .

Inference

We will present a variation of the Expectation Maximization algorithm to infer the parameters in Equation 3. Our approach differs from Nagpal et al. (2022a, 2021) in that it does not require storchastic Monte-Carlo sampling. Further, our generalized EM inference allows for incorporation of the structured sparsity in the M-Step.

A Semi-Parametric $Q(\cdot)$

Note that the likelihood in Equation 3 is semi-parametric and consists of parametric components and the infinite dimensional base hazard $\bm{\Lambda}(\cdot)$ . We define the $Q(\cdot)$ as:

\displaystyle Q(\bm{\Omega};\mathcal{D})

\displaystyle=\sum_{i=1}^{n}\sum_{k\in\mathcal{Z}}\bm{\gamma}^{k}_{i}\bigg{(}\ln\bm{p}_{\bm{\theta}}(Z=k|X={\bm{x}}_{i})+\ln\bm{p}_{\bm{w},\bm{\beta},\bm{\Lambda}}(T|Z=k,X={\bm{x}}_{i})\bigg{)}+\mathcal{R}(\bm{\theta})

The E-Step

Requires computation of the posteriors counts $\bm{\gamma}:=\bm{p}(Z={k}|T,X=\bm{x},A={\bm{a}}).$

Result 1 (Posterior Counts)

The posterior counts $\bm{\gamma}$ for the latent $Z$ are estimated as,

$\displaystyle\bm{\gamma}^{k}$	$\displaystyle=\widehat{\mathbb{P}}(Z=k\|X={\bm{x}},A={\bm{a}},{\bm{u}})$
	$\displaystyle=\frac{\mathbb{P}({\bm{u}}\|Z={\bm{k}},X={\bm{x}},A={\bm{a}})\mathbb{P}(Z={\bm{k}}\|X={\bm{x}})}{\sum_{k}\mathbb{P}({\bm{u}}\|Z={\bm{k}},X={\bm{x}},A={\bm{a}})\mathbb{P}(Z={\bm{k}}\|X={\bm{x}})}$
	$\displaystyle=\frac{{{\bm{h}}({\bm{x}},{\bm{a}},{\bm{k}})}^{\delta_{i}}\widehat{{\bm{S}}}_{0}({\bm{u}})^{{\bm{h}}({\bm{x}},{\bm{a}},{\bm{k}})}\exp(\bm{\theta_{k}}^{\top}\bm{x})}{\mathop{\sum}_{j\in\mathcal{Z}}{{\bm{h}}({\bm{x}},{\bm{a}},{\bm{j}})}^{\delta_{i}}\widehat{{\bm{S}}}_{0}({\bm{u}})^{{\bm{h}}({\bm{x}},{\bm{a}},{\bm{j}})}\exp(\bm{\theta_{j}}^{\top}\bm{x})}.$	(6)

For a full discussion on derivation of the $Q(\cdot)$ and the posterior counts please refer to Appendix B

The M-Step

Involves maximizing the $Q(\cdot)$ function. Rewriting the $Q(\cdot)$ as a sum of two terms,

\displaystyle Q(\bm{\Omega})=\underbrace{\sum_{i=1}^{n}\sum_{k\in\mathcal{Z}}\bm{\gamma}^{k}_{i}\ln\bm{p}_{\bm{w},\bm{\beta},\bm{\Lambda}_{0}}(T|Z=k,X={\bm{x}}_{i},A={\bm{a}}_{i})}_{{\bm{A}}(\bm{w},\bm{\beta},\bm{\Lambda}_{0})}+\underbrace{\sum_{i=1}^{n}\sum_{k\in\mathcal{Z}}\bm{\gamma}^{k}_{i}\ln\bm{p}_{\bm{\theta}}(Z=k|X={\bm{x}}_{i})+\mathcal{R}(\bm{\theta})}_{{\bm{B}}(\bm{\theta})}

Result 2 (Weighted Cox model)

The term ${\bm{A}}$ can be rewritten as a weighted Cox model and thus optimized using the corresponding ‘partial likelihood’,

Updates for $\{\bm{\beta},\bm{\omega}\}$ : The partial-likelihood, $\mathcal{PL(\cdot)}$ under sampling weights (Binder, 1992) is

\displaystyle\mathcal{PL}(\bm{\Omega};\mathcal{D})=\sum_{\begin{subarray}{c}i=1,\delta_{i}=1\end{subarray}}^{n}\sum_{k\in\mathcal{Z}}\bm{\gamma}^{k}_{i}\bigg{(}\ln\bm{h}_{k}({\bm{x}}_{i},\bm{a}_{i},{\bm{k}})-\ln

\displaystyle\sum_{j\in{\mathsf{RiskSet}(u_{i})}}\sum_{k\in\mathcal{Z}}\bm{\gamma}^{k}_{j}\bm{h}_{k}({\bm{x}}_{j},\bm{a}_{j},{\bm{k}})\bigg{)}\bigg{]}

(7)

Here $\mathsf{RiskSet}(\cdot)$ is the ‘risk set’ or the set of all individuals who haven’t experienced the event till the corresponding time, i.e. $\mathsf{RiskSet}(t):=\{i:u_{i}>t\}$ . $\mathcal{PL(\cdot)}$ is convex in $\{\bm{\beta},\bm{\omega}\}$ and we update these with a gradient step.

Updates for $\bm{\Lambda}_{0}$ : The base hazard $\bm{\Lambda}_{0}$ are updated using a weighted Breslow’s estimate (Breslow, 1972; Lin, 2007) assuming the posterior counts $\bm{\gamma}$ to be sampling weights (Chen, 2009),

\displaystyle\widehat{\bm{\Lambda}}_{0}(t)^{+}=\sum_{i=1}^{n}\sum_{k\in\mathcal{Z}}\bm{1}\{u_{i}<t\}\frac{\bm{\gamma}^{k}_{i}\cdot\delta_{i}}{\mathop{\sum}\limits_{j\in\mathsf{RiskSet}(u_{i})}\mathop{\sum}\limits_{k\in\mathcal{Z}}\bm{\gamma}^{k}_{j}\bm{h}_{k}({\bm{x}}_{j},\bm{a}_{j},{\bm{k}})}

(8)

Term ${\bm{B}}$ is a function of the gating parameters $\bm{\theta}$ that determine the latent assignment $Z$ along with sparsity regularization. We update ${\bm{B}}$ using a Proximal Gradient update as is the case with Iterative Soft Thresholding (ISTA) for group sparse $\ell_{1}$ regression.

Updates for $\bm{\theta}$ : The proximal update for $\bm{\theta}$ including the group regularization (Friedman et al., 2010) term $\mathcal{R}(\cdot)$ is,

\displaystyle\widehat{\bm{\theta}}^{+}=\mathsf{prox}_{\eta{\epsilon}}\bigg{(}\bm{\theta}-\frac{d}{d\bm{\theta}}{\bm{B}}(\bm{\theta})\bigg{)},\;\;\;\text{ where }\mathsf{prox}_{\eta{\epsilon}}({\bm{y}})=\frac{{\bm{y}}}{||{\bm{y}}||_{2}}\mathrm{max}\{0,||\bm{y}||_{2}-\eta\bm{\epsilon}\}.

(9)

All together the inference procedure is described in Algorithm 1.

{algorithm}

[!h] Parameter Learning for SCS with a Generalized EM \SetAlgoLined\SetKwInOutInputInput \SetKwInOutreturnReturn \InputTraining set, $\mathcal{D}=\{({\bm{x}}_{i},u_{i},a_{i},\delta_{i})_{i=1}^{n}\}$ ; maximum EM iterations, $B$ , step size $\eta$

\While

<not converged> \For $b\in\{1,2,...,B\}$ E-Step .

$\bm{\gamma}_{i}^{k}=\frac{{{\bm{h}}({\bm{x}},{\bm{a}},{\bm{k}})}^{\delta_{i}}\widehat{{\bm{S}}}_{0}({\bm{u}})^{{\bm{h}}({\bm{x}},{\bm{a}},{\bm{k}})}\exp(\bm{\theta_{k}}^{\top}\bm{x})}{\mathop{\sum}_{j\in\mathcal{Z}}{{\bm{h}}({\bm{x}},{\bm{a}},{\bm{j}})}^{\delta_{i}}\widehat{{\bm{S}}}_{0}({\bm{u}})^{{\bm{h}}({\bm{x}},{\bm{a}},{\bm{j}})}\exp(\bm{\theta_{j}}^{\top}\bm{x})}$ $\hfill\triangleright$ Compute posterior counts (Equation 6).

M-Step .

$\widehat{\bm{\beta}}^{+}\leftarrow\widehat{\bm{\beta}}-\eta\cdot\nabla_{\bm{\beta}}\mathcal{PL}(\bm{\beta},{\bm{w}};\mathcal{D})$

$\widehat{\bm{w}}^{+}\leftarrow\widehat{\bm{w}}-\eta\cdot\nabla_{\bm{w}}\mathcal{PL}(\bm{\beta},{\bm{w}};\mathcal{D})$ $\triangleright$ Gradient descent update.

$\widehat{\bm{\Lambda}}_{0}(t)^{+}\leftarrow\mathop{\sum}_{i=1}^{n}\sum_{k\in\mathcal{Z}}\bm{1}\{u_{i}<t\}\frac{\bm{\gamma}^{k}_{i}\cdot\delta_{i}}{\mathop{\sum}\limits_{j\in\mathsf{RiskSet}(u_{i})}\mathop{\sum}\limits_{k\in\mathcal{Z}}\bm{\gamma}^{k}_{j}\bm{h}_{k}({\bm{x}}_{j},\bm{a}_{j},{\bm{k}})}$ $\hfill\triangleright$ Breslow (1972)’s estimator.

$\widehat{\bm{\theta}}^{+}\leftarrow\widehat{\bm{\theta}}-\eta\cdot\nabla_{\theta}{\bm{B}}(\theta)$ $\triangleright$ Update $\bm{\theta}$ with gradient of $\widehat{Q}$ .

$\widehat{\bm{\theta}}^{+}\leftarrow\mathsf{prox}_{{\epsilon}\eta}(\widehat{\bm{\theta}})$ $\triangleright$ Proximal update.

\returnlearnt parameters $\bm{\Omega}$ ;

4 Experiments

In this section we describe the experiments conducted to benchmark the performance of SCS against alternative models for heterogenous treatment effect estimation on multiple studies including a synthetic dataset and multiple large landmark clinical trials for cardiovascular diseases.

4.1 Simulation

In this section we first describe the performance of the proposed Sparse Cox Subgrouping approach on a synthetic dataset designed to demonstrate heterogeneous treatment effects. We randomly assign individuals to the treated or control group. The latent variable $Z$ is drawn from a uniform categorical distribution that determines the subgroup,

A\sim\mathrm{Bernoulli}(\nicefrac{{1}}{{2}}),\quad Z\sim\mathrm{Categorical}(\nicefrac{{1}}{{3}})

Conditioned on $Z$ we sample $X_{1:2}\sim\textrm{Normal}(\bm{\mu}_{z},\bm{\sigma}_{z})$ as in Figure 2 that determine the conditional Hazard Ratios $\textrm{HR}(k)$ , and randomly sample noisy covariates $X_{3:6}\sim\text{Uniform}(-1,1)$ . The true time-to-event $T$ and censoring times $C$ are then sampled as,

\displaystyle T|(X={\bm{x}},A={\bm{a}},Z={\bm{k}})

\displaystyle\sim\mathrm{Gompertz}(\beta=1,\eta=0.25\cdot\text{HR}(k)^{\bm{a}}),\quad C|T\sim\text{Uniform}(0,T)

Finally we sample the censoring indicator $\Delta\sim\mathrm{Bernoulli}(0.8)$ and set the observed time-to-event,

U=T\text{ if }\Delta=1\text{, else we set }U=C.

Figure 2 presents the ROC curves for SCS’s ability to identify the groups with enhanced and diminished treatment effects respectively. In Figure 3 we present Kaplan-Meier estimators of the Time-to-Event distributions conditioned on the predicted $Z$ by SCS. Clearly, SCS is able to identify the phenogroups corresponding to differential benefits.

4.2 Recovering subgroups demonstrating Heterogeneous Treatment Effects from Landmark studies of Cardiovascular Health

ALLHAT
Size	18,102
Outcome	Combined CVD
Intervention	Lisinopril
Control	Amlodipine
Hazard Ratio	1.06, (1.01, 1.12)
5-year RMST	-24.86, (-37.35, -8.89)

BARI2D
Size	2,368
Outcome	Death, MI or Stroke
Intervention	Medical Therapy
Control	Early Revascularization
Hazard Ratio	1.02, (0.81, 1.14)
5-year RMST	23.26, (-27.01, 64.84)

Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack: (Furberg et al., 2002)

The ALLHAT study was a large randomized experiment conducted to assess the efficacy of multiple classes of blood pressure lowering medicines for patients with hypertension in reducing risk of adverse cardiovascular conditions. We considered a subset of patients from the original ALLHAT study who were randomized to receive either Amlodipine (a calcium channel blocker) or Lisinopril (an Angiotensin-converting enzyme inhibitor). Overall, Amlodipine was found to be more efficacious than Lisinopril in reducing combined risk of cardio-vascular disease.
Bypass Angioplasty Revascularization Investigation in Type II Diabetes: (Group, 2009)

Diabetic patients have been traditionally known to be at higher risk of cardiovascular disease however appropriate intervention for diabetics with ischemic heart disease between surgical coronary revascularization or management with medical therapy is widely debated. The BARI2D was a large landmark experiment conducted to assess efficacy between these two possible medical interventions. Overall BARI2D was inconclusive in establishing the appropriate therapy between Coronary Revascularization or medical management for patients with Type-II Diabetes.

Figure 4 presents the event-free survival rates as well as the summary statistics for the studies. In our experiments, we included a large set of confounders collected at baseline visit of the patients which we utilize to train the proposed model. A full list of these features are in Appendix A.

4.3 Baselines

Cox PH with $\ell_{1}$ Regularized Treatment Interaction (cox-int)

We include treatment effect heterogeneity via interaction terms that model the time-to-event distribution using a proportional hazards model as in Kehl and Ulm (2006). Thus,

\displaystyle\bm{\lambda}(t|X={\bm{x}},A={\bm{a}})=\bm{\lambda}_{0}(t)\exp\big{(}\bm{\beta}^{\top}{\bm{x}}+{\bm{a}}\cdot\bm{\theta}^{\top}{\bm{x}}\big{)}

(10)

The interaction effects $\bm{\theta}$ are regularized with a lasso penalty in order to recover a sparse phenotyping rule defined as $G(\bm{x})=\bm{\theta}^{\top}\bm{x}$ .

Binary Classifier with $\ell_{1}$ Regularized Treatment Interaction (bin-int)

Instead of modelling the time-to-event distribution we directly model the thresholded survival outcomes $Y=\bm{1}\{T<t\}$ at a five year time horizon using a log-linear parameterization with a logit link function. As compared to cox-int, this model ignores the data-points that were right-censored prior to the thresholded time-to-event, however it is not sensitive to the strong assumption of Proportional Hazards.

	$\displaystyle\mathbb{E}[T>t\|X={\bm{x}},A={\bm{a}}]=\sigma(\bm{\beta}^{\top}{\bm{x}}+\bm{\beta}_{0}+{\bm{a}}\cdot\bm{\theta}^{\top}{\bm{x}}),$
	$\displaystyle\text{and, }\sigma(\cdot)\text{ is the logistic link function.}$		(11)

Cox PH T-Learner with $\ell_{1}$ Regularized Logistic Regression (cox-tlr)

We train two separate Cox Regression models on the treated and control arms (T-Learner) to estimate the potential outcomes under treatment $(A=1)$ and control $(A=0)$ . Motivated from the ‘Virtual Twins’ approach as in Foster et al. (2011), a logistic regression with an $\ell_{1}$ penalty is trained to estimate if the risk of the potential outcome under treatment is higher than under control. This logistic regression is then employed as the phenotyping function $G(\cdot)$ and is given as,

	$\displaystyle G({\bm{x}})$	$\displaystyle=\mathbb{E}[\bm{1}\{f_{1}({\bm{x}},t)>f_{0}({\bm{x}},t)\}\|X={\bm{x}}]$
	$\displaystyle\text{where, }f_{\bm{a}}({\bm{x}},t)$	$\displaystyle=\mathbb{P}(T>t\|\text{do}(A=\bm{a}),X={\bm{x}}).$		(12)

The above models involving sparse $\ell_{1}$ regularization were trained with the glmnet (Friedman et al., 2009) package in R.

The ACC/AHA Long term Atheroscleoratic Cardiovascular Risk Estimate

²²2https://tools.acc.org/ascvd-risk-estimator-plus/

The American College of Cardiology and the American Heart Association model for estimation of risk of Atheroscleratic disease risk (Goff Jr et al., 2014) involves pooling data from multiple observational cohorts of patients followed by modelling the 10-year risk of an adverse cardiovascular condition including death from coronary heart disease, Non-Fatal Myocardial Infarction or Non-fatal Stroke. While the risk model was originally developed to assess factual risk in the observational sense, in practice it is also employed to assess risk when making counterfactual decisions.

4.4 Results and Discussion

Protocol: We compare the performance of SCS and the corresponding competing methods in recovery of subgroups with enhanced (or diminished treatment effects). For each of these studies we stratify the study population into equal sized sets for training and validation while persevering the proportion of individuals that were assigned to treatment and experienced the outcome in the follow up period. The models were trained on the training set and validated on the held-out test set. For each of the methods we experiment with models that do not enforce any sparsity ( $\bm{\epsilon}=0$ ) as well as tune the level of sparsity to recover phenotyping functions that involve $5$ and $10$ features. The subgroup size are varied by controlling the threshold at which the individual is assigned to a group. Finally, the treatment effect is compared in terms of Hazard Ratios, Risk Differences as well as Restricted Mean Survival Time over a 5 Year event period.
Results: We present the results of SCS versus the baselines in terms of Hazard Ratios on the ALLHAT and BARI2D datasets in Figures 5 and 6. In the case of ALLHAT, SCS consistently recovered phenogroups with more pronounced (or diminished) treatment effects. On external validation on the heldout dataset, we found a subgroup of patients that had similar outcomes whether assigned to Lisinopril or Amlodipine, whereas the other subgroup clearly identified patients that were harmed with Lisinopril. The group harmed with Lisinopril had higher Diastolic BP. On the other hand, patients with Lower kidney function did not seem to benefit from Amlodipine.

In the case of BARI2D, SCS recovered phenogroups that were both harmed as well as benefitted from just medical therapy without revascularization. The patients who were harmed from Medical therapy were typically older, on the other hand the patients who benefitted primarily included patients who were otherwise assigned to receive PCI instead of CABG revascularization, suggesting PCI to be harmful for diabetic patients.

Tables 3 and 4 present the features that were selected by the proposed model for the studies. Additionally, we also report tabulated results involving metrics like risk difference and restricted mean survival time in the Appendix C.

5 Concluding Remarks

We presented Sparse Cox Subgrouping (SCS) a latent variable approach to recover subgroups of patients that respond differentially to an intervention in the presence of censored time-to-event outcomes. As compared to alternative approaches to learning parsimonious hypotheses in such settings, our proposed model recovered hypotheses with more pronounced treatment effects which we validated on multiple studies for cardiovascular health.

While powerful in its ability to recover parsimonious subgroups there exists limitations in SCS in its current form. The model is sensitive to proportional hazards and may be ill-specified when the proportional hazards assumptions are violated as is evident in many real world clinical studies (Maron et al., 2018; Bretthauer et al., 2022). Another limitation is that SCS in its current form looks at only a single endpoint (typically death, or a composite of multiple adverse outcome). In practice however real world studies typically involve multiple end-points. We envision that extensions of SCS would allow patient subgrouping across multiple endpoints, leading to discovery of actionable sub-populations that similarly benefit from the intervention under assessment.

References

Alaa and Van Der Schaar (2017) Ahmed M Alaa and Mihaela Van Der Schaar. Bayesian inference of individualized treatment effects using multi-task gaussian processes. Advances in neural information processing systems, 30, 2017.
Binder (1992) David A Binder. Fitting cox’s proportional hazards models from survey data. Biometrika, 79(1):139–147, 1992.
Breslow (1972) Norman E Breslow. Contribution to discussion of paper by dr cox. J. Roy. Statist. Soc., Ser. B, 34:216–217, 1972.
Bretthauer et al. (2022) Michael Bretthauer, Magnus Løberg, Paulina Wieszczy, Mette Kalager, Louise Emilsson, Kjetil Garborg, Maciej Rupinski, Evelien Dekker, Manon Spaander, Marek Bugajski, et al. Effect of colonoscopy screening on risks of colorectal cancer and related death. New England Journal of Medicine, 2022.
Buse et al. (2007) John B Buse, ACCORD Study Group, et al. Action to control cardiovascular risk in diabetes (accord) trial: design and methods. The American journal of cardiology, 99(12):S21–S33, 2007.
Chapfuwa et al. (2021) Paidamoyo Chapfuwa, Serge Assaad, Shuxi Zeng, Michael J Pencina, Lawrence Carin, and Ricardo Henao. Enabling counterfactual survival analysis with balanced representations. In Proceedings of the Conference on Health, Inference, and Learning, pages 133–145, 2021.
Chen (2009) Yi-Hau Chen. Weighted breslow-type and maximum likelihood estimation in semiparametric transformation models. Biometrika, 96(3):591–600, 2009.
Cox (1972) David R Cox. Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological), 34(2):187–202, 1972.
Crabbé et al. (2022) Jonathan Crabbé, Alicia Curth, Ioana Bica, and Mihaela van der Schaar. Benchmarking heterogeneous treatment effect models through the lens of interpretability. arXiv preprint arXiv:2206.08363, 2022.
Curth et al. (2021) Alicia Curth, Changhee Lee, and Mihaela van der Schaar. Survite: Learning heterogeneous treatment effects from time-to-event data. Advances in Neural Information Processing Systems, 34:26740–26753, 2021.
Dusseldorp and Mechelen (2014) Elise Dusseldorp and Iven Mechelen. Qualitative interaction trees: A tool to identify qualitative treatment-subgroup interactions. Statistics in medicine, 33, 01 2014. 10.1002/sim.5933.
Foster et al. (2011) Jared C Foster, Jeremy MG Taylor, and Stephen J Ruberg. Subgroup identification from randomized clinical trial data. Statistics in medicine, 30(24):2867–2880, 2011.
Friedman et al. (2009) Jerome Friedman, Trevor Hastie, Rob Tibshirani, et al. glmnet: Lasso and elastic-net regularized generalized linear models. R package version, 1(4):1–24, 2009.
Friedman et al. (2010) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736, 2010.
Furberg et al. (2002) Curt D Furberg et al. Major outcomes in high-risk hypertensive patients randomized to angiotensin-converting enzyme inhibitor or calcium channel blocker vs diuretic: the antihypertensive and lipid-lowering treatment to prevent heart attack trial (allhat). Journal of the American Medical Association, 2002.
Goff Jr et al. (2014) David C Goff Jr, Donald M Lloyd-Jones, Glen Bennett, Sean Coady, Ralph B D’agostino, Raymond Gibbons, Philip Greenland, Daniel T Lackland, Daniel Levy, Christopher J O’donnell, et al. 2013 acc/aha guideline on the assessment of cardiovascular risk: a report of the american college of cardiology/american heart association task force on practice guidelines. Circulation, 129(25_suppl_2):S49–S73, 2014.
Group (2009) BARI 2D Study Group. A randomized trial of therapies for type 2 diabetes and coronary artery disease. New England Journal of Medicine, 360(24):2503–2515, 2009.
Herrington et al. (2016) William Herrington, Ben Lacey, Paul Sherliker, Jane Armitage, and Sarah Lewington. Epidemiology of atherosclerosis and the potential to reduce the global burden of atherothrombotic disease. Circulation research, 118(4):535–546, 2016.
Johansson et al. (2020) Fredrik D Johansson, Uri Shalit, Nathan Kallus, and David Sontag. Generalization bounds and representation learning for estimation of potential outcomes and causal effects. arXiv preprint arXiv:2001.07426, 2020.
Kehl and Ulm (2006) Victoria Kehl and Kurt Ulm. Responder identification in clinical trials with censored data. Computational Statistics & Data Analysis, 50(5):1338–1355, 2006.
Lee et al. (2020) Kwonsang Lee, Falco J Bargagli-Stoffi, and Francesca Dominici. Causal rule ensemble: Interpretable inference of heterogeneous treatment effects. arXiv preprint arXiv:2009.09036, 2020.
Lin (2007) DY Lin. On the breslow estimator. Lifetime data analysis, 13(4):471–480, 2007.
Lipkovich et al. (2011) Ilya Lipkovich, Alex Dmitrienko, Jonathan Denne, and Gregory Enas. Subgroup identification based on differential effect search (sides) – a recursive partitioning method for establishing response to treatment in patient subpopulations. Statistics in medicine, 30:2601–21, 07 2011. 10.1002/sim.4289.
Louizos et al. (2017) Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. Causal effect inference with deep latent-variable models. Advances in neural information processing systems, 30, 2017.
Maron et al. (2018) David J Maron, Judith S Hochman, Sean M O’Brien, Harmony R Reynolds, William E Boden, Gregg W Stone, Sripal Bangalore, John A Spertus, Daniel B Mark, Karen P Alexander, et al. International study of comparative health effectiveness with medical and invasive approaches (ischemia) trial: rationale and design. American heart journal, 201:124–135, 2018.
Morucci et al. (2020) Marco Morucci, Vittorio Orlandi, Sudeepa Roy, Cynthia Rudin, and Alexander Volfovsky. Adaptive hyper-box matching for interpretable individualized treatment effect estimation. In Conference on Uncertainty in Artificial Intelligence, pages 1089–1098. PMLR, 2020.
Nagpal et al. (2020) Chirag Nagpal, Dennis Wei, Bhanukiran Vinzamuri, Monica Shekhar, Sara E Berger, Subhro Das, and Kush R Varshney. Interpretable subgroup discovery in treatment effect estimation with application to opioid prescribing guidelines. In Proceedings of the ACM Conference on Health, Inference, and Learning, pages 19–29, 2020.
Nagpal et al. (2021) Chirag Nagpal, Steve Yadlowsky, Negar Rostamzadeh, and Katherine Heller. Deep cox mixtures for survival regression. In Machine Learning for Healthcare Conference, pages 674–708. PMLR, 2021.
Nagpal et al. (2022a) Chirag Nagpal, Mononito Goswami, Keith Dufendach, and Artur Dubrawski. Counterfactual phenotyping with censored time-to-events. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, page 3634–3644, New York, NY, USA, 2022a. Association for Computing Machinery. ISBN 9781450393850. 10.1145/3534678.3539110. URL https://doi.org/10.1145/3534678.3539110.
Nagpal et al. (2022b) Chirag Nagpal, Willa Potosnak, and Artur Dubrawski. auton-survival: an open-source package for regression, counterfactual estimation, evaluation and phenotyping with censored time-to-event data. In Proceedings of the 7th Machine Learning for Healthcare Conference, volume 182 of Proceedings of Machine Learning Research, pages 585–608. PMLR, 05–06 Aug 2022b. URL https://proceedings.mlr.press/v182/nagpal22a.html.
Shalit et al. (2017) Uri Shalit, Fredrik D Johansson, and David Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In International Conference on Machine Learning, pages 3076–3085. PMLR, 2017.
Su et al. (2009) Xiaogang Su, Chih-Ling Tsai, Hansheng Wang, David Nickerson, and Bogong Li. Subgroup analysis via recursive partitioning. Journal of Machine Learning Research, 10:141–158, 02 2009. 10.2139/ssrn.1341380.
Wager and Athey (2018) Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242, 2018.
Wang and Rudin (2022) Tong Wang and Cynthia Rudin. Causal rule sets for identifying subgroups with enhanced treatment effects. INFORMS Journal on Computing, 2022.
Wu et al. (2022) Han Wu, Sarah Tan, Weiwei Li, Mia Garrard, Adam Obeng, Drew Dimmery, Shaun Singh, Hanson Wang, Daniel Jiang, and Eytan Bakshy. Interpretable personalized experimentation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4173–4183, 2022.
Xu et al. (2022) Yizhe Xu, Nikolaos Ignatiadis, Erik Sverdrup, Scott Fleming, Stefan Wager, and Nigam Shah. Treatment heterogeneity with survival outcomes. arXiv preprint arXiv:2207.07758, 2022.

Appendix A Additional Details on the ALLHAT and BARI 2D Case Studies

Tables 1 and 2 represent additional confounding variables found in the ALLHAT and BARI2D trials respectively.

Name	Description
ETHNIC	Ethnicity
SEX	Sex of Participant
ESTROGEN	Estrogen supplementation
BLMEDS	Antihypertensive treatment
MISTROKE	History of Stroke
HXCABG	History of coronary artery bypass
STDEPR	Prior ST depression/T-wave inversion
OASCVD	Other atherosclerotic cardiovascular disease
DIABETES	Prior history of Diabetes
HDLLT35	HDL cholesterol <35mg/dl; 2x in past 5 years
LVHECG	LVH by ECG in past 2 years
WALL25	LVH by ECG in past 2 years
LCHD	History of CHD at baseline
CURSMOKE	Current smoking status.
ASPIRIN	Aspirin use
LLT	Lipid-lowering trial
AGE	Age upon entry
BLWGT	Weight upon entry
BLHGT	Height upon entry
BLBMI	Body Mass Index upon entry
BV2SBP	Baseline SBP
BV2DBP	Baseline DBP
APOTAS	Baseline serum potassium
BLGFR	Baseline est glomerular filtration rate
ACHOL	Total Cholesterol
AHDL	Baseline HDL Cholesterol
AFGLUC	Baseline fasting serum glucose

Table 1: List of confounding variables used for experiments involving the ALLHAT dataset.

Name	Description
hxmi	History of MI
age	Age upon entry
dbp_stand	Standing diastolic BP
sbp_stand	Standing systolic BP
sex	Sex
asp	Aspirin use
smkcat	Cigarette smoking category
betab	Beta blocker use
ccb	Calcium blocker use
hxhtn	History of hypertension requiring tx
insulin	Insulin use
weight	Weight (kg) upon entry
bmi	BMI upon entry
qabn	Abnormal Q-Wave
trig	Triglycerides (mg/dl) upon entry
dmdur	Duration of diabetes mellitus
ablvef	Left ventricular ejection fraction <50%
race	Race
priorrev	Prior revascularization
hxcva	Cerebrovascular accident
screat	Serum creatinine (mg/dl)
hmg	Statin
hxhypo	History of hypoglycemic episode
hba1c	Hemoglobin A1c(%)
priorstent	Prior stent
spotass	Serum Potassium(mEq/L)
hispanic	Hispanic ethnicity
tchol	Total Cholesterol
hdl	HDL Cholesterol
insul_circ	Circulating insulin (IU/ml)
tzd	Thiazolidinedione
ldl	LDL Cholesterol
tabn	Abnormal T-waves
nsgn	Nonsublingual nitrate
sulf	Sulfonylurea
hxchf	Histoty of congestive heart failure req tx
arb	Angiotensin receptor blocker
acr	Urine albumin/creatinine ratio mg/g
diur	Diuretic
apa	Anti-platelet
hxchl	Hypercholesterolemia req tx
acei	ACE inhibitor
abilow	Low ABI (<= 0.9)
biguanide	Biguanide
stabn	Abnormal ST depression

Table 2: List of confounding variables used for experiments involving the BARI2D dataset.

ALLHAT Name Description BV2SBP Baseline Seated Diastolic Pressure BLGFR Baseline est Glomerular Filteration Rate BLMEDS Antihypertensive Treatment CURSMOKE Current Smoking Status SEX Sex of Participant

BARI2D Name Description age Age upon entry asp Aspirin use hxhtn History of hypertension requiring tx hxchl Hypercholesterolemia req tx priorstent Prior stent

Table 3: List of selected features with sparsity level:

||\bm{\theta}||_{0}\leq 5

ALLHAT Name Description BV2SBP Baseline Seated Diastolic Pressure BLGFR Baseline est Glomerular Filtration Rate BLMEDS Antihypertensive Treatment CURSMOKE Current Smoking Status SEX Sex of Participant ASPIRIN Aspirin Use ACHOL Total Cholesterol BLWGT Weight upon entry BMI Body mass index upon entry OASCVD Other atherosclerotic cardiovascular disease

BARI2D Name Description age Age upon entry asp Aspirin use hxhtn History of hypertension requiring tx hxchl Hypercholesterolemia req tx priorstent Prior stent acei ACE Inhibitor acr Urine albumin/creatinine ratio mg/g insul_circ Circulating insulin screat Serum creatinine (mg/dl) tchol Total Cholesterol

Table 4: List of selected features with sparsity level:

||\bm{\theta}||_{0}\leq 10

Appendix B Derivation of the Inference Algorithm

Censored Instances: Note that in the case of the censored instances we will condition on the thresholded survival $(T>{\bm{u}})$ . The the posterior counts thus reduce to:

	$\displaystyle\bm{\gamma}^{k}$	$\displaystyle=\mathbb{P}(Z=k\|X={\bm{x}},A={\bm{a}},T>{\bm{u}})$
		$\displaystyle=\frac{\mathbb{P}(T>{\bm{t}}\|Z={\bm{k}},X={\bm{x}},A={\bm{a}})\bm{p}(Z={\bm{k}}\|X={\bm{x}})}{\sum_{k}\mathbb{P}(T>{\bm{t}}\|Z={\bm{k}},X={\bm{x}},A={\bm{a}})\mathbb{P}(Z={\bm{k}}\|X={\bm{x}})}$		(13)

\text{Here, }\mathbb{P}(T>{\bm{t}}|Z={\bm{k}},X={\bm{x}},A={\bm{a}})\\ =\exp\big{(}-\bm{\Lambda}(t)\big{)}^{\bm{h}({\bm{x}},{\bm{a}},k)}

Uncensored Instances The posteriors are $\bm{\gamma}^{k}=\bm{p}_{\bm{\theta}}(Z=k|X={\bm{x}},T={\bm{u}})$ ,

Posteriors for the uncensored data are more involved and involve the base hazard $\bm{\lambda}_{0}(\cdot)$ . Posteriors for uncensored data are independent of the base hazard function, $\bm{\lambda}_{0}(\cdot)$ as,

\displaystyle\bm{\gamma}^{k}=\frac{\cancel{\bm{\lambda}_{0}(u)}\bm{h}_{k}({\bm{x}},\bm{a}){\bm{S}}_{0}(u_{i})^{\bm{h}_{k}({\bm{x}},\bm{a})}}{\mathop{\sum}\limits_{k}\cancel{\bm{\lambda}_{0}(u)}\bm{h}_{k}({\bm{x}},\bm{a}){\bm{S}}_{0}(u)^{\bm{h}_{k}({\bm{x}},\bm{a})}}=\frac{\bm{h}_{k}({\bm{x}},\bm{a}){\bm{S}}_{0}(u_{i})^{\bm{h}_{k}({\bm{x}},\bm{a})}}{\mathop{\sum}\limits_{k}\bm{h}_{k}({\bm{x}},\bm{a}){\bm{S}}_{0}(u_{i})^{\bm{h}_{k}({\bm{x}},\bm{a})}}

Combining Equations 13 and B we arrive at the following estimate for the posterior counts

$\displaystyle\bm{\gamma}^{k}$	$\displaystyle=\widehat{\mathbb{P}}(Z=k\|X={\bm{x}},A={\bm{a}},{\bm{u}})$
	$\displaystyle=\frac{\mathbb{P}({\bm{u}}\|Z={\bm{k}},X={\bm{x}},A={\bm{a}})\mathbb{P}(Z={\bm{k}}\|X={\bm{x}})}{\sum_{k}\mathbb{P}({\bm{u}}\|Z={\bm{k}},X={\bm{x}},A={\bm{a}})\mathbb{P}(Z={\bm{k}}\|X={\bm{x}})}$
	$\displaystyle=\frac{{{\bm{h}}({\bm{x}},{\bm{a}},{\bm{k}})}^{\delta_{i}}\widehat{{\bm{S}}}_{0}({\bm{u}})^{{\bm{h}}({\bm{x}},{\bm{a}},{\bm{k}})}\exp(\bm{\theta_{k}}^{\top}\bm{x})}{\mathop{\sum}_{j\in\mathcal{Z}}{{\bm{h}}({\bm{x}},{\bm{a}},{\bm{j}})}^{\delta_{i}}\widehat{{\bm{S}}}_{0}({\bm{u}})^{{\bm{h}}({\bm{x}},{\bm{a}},{\bm{j}})}\exp(\bm{\theta_{j}}^{\top}\bm{x})}.$	(14)

Appendix C Additional Results

Figures 7, 8, 9 present tabulated metrics on ALLHAT with Hazard Ratio, Risk Difference and Restricted Mean Survival Time respectively. Figures 10, 11, 12 present tabulated metrics BARI2D with Hazard Ratio, Risk Difference and Restricted Mean Survival Time metrics respectively.

	20%	40%	60%	80%
SCS	1.51 $\pm$ 0.14	1.26 $\pm$ 0.08	1.07 $\pm$ 0.06	1.07 $\pm$ 0.05
COX-INT	1.25 $\pm$ 0.12	1.22 $\pm$ 0.08	1.09 $\pm$ 0.06	1.04 $\pm$ 0.05
COX-TLR	1.43 $\pm$ 0.15	1.11 $\pm$ 0.08	1.07 $\pm$ 0.05	1.04 $\pm$ 0.05
BIN-INT	1.13 $\pm$ 0.11	1.05 $\pm$ 0.05	1.06 $\pm$ 0.05	1.04 $\pm$ 0.04
ASCVD	1.07 $\pm$ 0.1	1.09 $\pm$ 0.07	1.02 $\pm$ 0.06	1.06 $\pm$ 0.05

	20%	40%	60%	80%
SCS	1.05 $\pm$ 0.07	1.04 $\pm$ 0.07	0.99 $\pm$ 0.04	1.0 $\pm$ 0.04
COX-INT	1.08 $\pm$ 0.1	1.05 $\pm$ 0.06	0.98 $\pm$ 0.05	1.02 $\pm$ 0.05
COX-TLR	1.12 $\pm$ 0.1	1.05 $\pm$ 0.05	1.05 $\pm$ 0.06	1.02 $\pm$ 0.05
BIN-INT	1.07 $\pm$ 0.1	1.06 $\pm$ 0.06	1.07 $\pm$ 0.06	1.03 $\pm$ 0.05
ASCVD	1.07 $\pm$ 0.1	1.09 $\pm$ 0.07	1.02 $\pm$ 0.06	1.06 $\pm$ 0.05

	20%	40%	60%	80%
SCS	1.37 $\pm$ 0.16	1.22 $\pm$ 0.09	1.1 $\pm$ 0.07	1.06 $\pm$ 0.04
COX-INT	1.42 $\pm$ 0.17	1.2 $\pm$ 0.08	1.09 $\pm$ 0.05	1.06 $\pm$ 0.05
COX-TLR	1.37 $\pm$ 0.14	1.12 $\pm$ 0.07	1.06 $\pm$ 0.06	1.05 $\pm$ 0.05
BIN-INT	1.13 $\pm$ 0.11	1.05 $\pm$ 0.07	1.02 $\pm$ 0.06	1.02 $\pm$ 0.04
ASCVD	1.07 $\pm$ 0.1	1.09 $\pm$ 0.07	1.02 $\pm$ 0.06	1.06 $\pm$ 0.05

	20%	40%	60%	80%
SCS	1.07 $\pm$ 0.07	1.02 $\pm$ 0.06	1.0 $\pm$ 0.04	1.01 $\pm$ 0.05
COX-INT	1.05 $\pm$ 0.06	1.02 $\pm$ 0.05	1.01 $\pm$ 0.05	0.99 $\pm$ 0.04
COX-TLR	1.1 $\pm$ 0.1	1.05 $\pm$ 0.06	1.03 $\pm$ 0.05	1.0 $\pm$ 0.04
BIN-INT	1.15 $\pm$ 0.09	1.09 $\pm$ 0.05	1.07 $\pm$ 0.06	1.03 $\pm$ 0.05
ASCVD	1.07 $\pm$ 0.1	1.09 $\pm$ 0.07	1.02 $\pm$ 0.06	1.06 $\pm$ 0.05

	20%	40%	60%	80%
SCS	-0.07 $\pm$ 0.02	-0.04 $\pm$ 0.02	-0.0 $\pm$ 0.01	-0.0 $\pm$ 0.01
COX-INT	-0.03 $\pm$ 0.02	-0.03 $\pm$ 0.01	-0.01 $\pm$ 0.01	0.0 $\pm$ 0.01
COX-TLR	-0.06 $\pm$ 0.02	-0.01 $\pm$ 0.02	-0.0 $\pm$ 0.01	-0.0 $\pm$ 0.01
BIN-INT	-0.02 $\pm$ 0.02	-0.01 $\pm$ 0.01	-0.01 $\pm$ 0.01	-0.0 $\pm$ 0.01
ASCVD	-0.01 $\pm$ 0.03	-0.01 $\pm$ 0.02	0.01 $\pm$ 0.02	-0.01 $\pm$ 0.01

	20%	40%	60%	80%
SCS	-0.02 $\pm$ 0.02	-0.01 $\pm$ 0.02	0.01 $\pm$ 0.01	0.01 $\pm$ 0.01
COX-INT	-0.02 $\pm$ 0.03	-0.01 $\pm$ 0.02	0.01 $\pm$ 0.01	0.0 $\pm$ 0.01
COX-TLR	-0.02 $\pm$ 0.03	-0.01 $\pm$ 0.02	-0.01 $\pm$ 0.02	0.0 $\pm$ 0.01
BIN-INT	-0.01 $\pm$ 0.02	-0.0 $\pm$ 0.02	-0.01 $\pm$ 0.02	-0.0 $\pm$ 0.01
ASCVD	-0.01 $\pm$ 0.03	-0.01 $\pm$ 0.02	0.01 $\pm$ 0.02	-0.01 $\pm$ 0.01

	20%	40%	60%	80%
SCS	-101.97 $\pm$ 20.57	-69.01 $\pm$ 17.72	-28.38 $\pm$ 14.82	-26.13 $\pm$ 12.52
COX-INT	-57.8 $\pm$ 22.46	-53.56 $\pm$ 16.89	-27.46 $\pm$ 13.79	-17.04 $\pm$ 13.01
COX-TLR	-74.04 $\pm$ 19.52	-28.19 $\pm$ 17.6	-25.25 $\pm$ 13.57	-18.17 $\pm$ 13.06
BIN-INT	-21.45 $\pm$ 33.0	-15.78 $\pm$ 13.54	-18.12 $\pm$ 14.34	-21.48 $\pm$ 12.17
ASCVD	-18.81 $\pm$ 30.57	-29.74 $\pm$ 18.53	-13.37 $\pm$ 18.52	-28.73 $\pm$ 14.45

	20%	40%	60%	80%
SCS	-27.85 $\pm$ 23.85	-15.3 $\pm$ 20.0	-2.78 $\pm$ 12.58	-8.69 $\pm$ 12.72
COX-INT	-35.66 $\pm$ 31.92	-26.32 $\pm$ 19.65	-5.07 $\pm$ 16.91	-14.43 $\pm$ 13.43
COX-TLR	-50.44 $\pm$ 29.24	-26.65 $\pm$ 15.8	-25.8 $\pm$ 19.21	-18.25 $\pm$ 13.94
BIN-INT	-30.1 $\pm$ 26.83	-36.65 $\pm$ 17.93	-34.47 $\pm$ 16.25	-21.25 $\pm$ 13.13
ASCVD	-18.81 $\pm$ 30.57	-29.74 $\pm$ 18.53	-13.37 $\pm$ 18.52	-28.73 $\pm$ 14.45

	20%	40%	60%	80%
SCS	-85.16 $\pm$ 23.76	-69.9 $\pm$ 20.1	-44.31 $\pm$ 10.2	-31.01 $\pm$ 15.13
COX-INT	-44.56 $\pm$ 24.49	-28.22 $\pm$ 15.64	-23.32 $\pm$ 15.2	-21.78 $\pm$ 13.45
COX-TLR	-60.94 $\pm$ 24.58	-38.11 $\pm$ 18.07	-25.09 $\pm$ 14.44	-20.61 $\pm$ 11.56
BIN-INT	-20.17 $\pm$ 27.07	-21.04 $\pm$ 17.57	-24.72 $\pm$ 14.28	-20.06 $\pm$ 12.56
ASCVD	-18.81 $\pm$ 30.57	-29.74 $\pm$ 18.53	-13.37 $\pm$ 18.52	-28.73 $\pm$ 14.45

	20%	40%	60%	80%
SCS	1.74 $\pm$ 24.81	7.5 $\pm$ 18.37	-0.1 $\pm$ 13.34	-11.88 $\pm$ 12.05
COX-INT	-27.49 $\pm$ 32.86	-22.41 $\pm$ 19.13	-20.47 $\pm$ 15.24	-22.05 $\pm$ 13.78
COX-TLR	-28.94 $\pm$ 28.29	-23.05 $\pm$ 19.46	-15.34 $\pm$ 15.18	-14.23 $\pm$ 14.52
BIN-INT	-40.82 $\pm$ 27.02	-25.63 $\pm$ 22.29	-28.62 $\pm$ 14.63	-27.02 $\pm$ 11.01
ASCVD	-18.81 $\pm$ 30.57	-29.74 $\pm$ 18.53	-13.37 $\pm$ 18.52	-28.73 $\pm$ 14.45

	20%	40%	60%	80%
scs	1.42 $\pm$ 0.4	1.24 $\pm$ 0.22	1.22 $\pm$ 0.19	1.11 $\pm$ 0.14
COX-INT	1.22 $\pm$ 0.25	1.28 $\pm$ 0.2	1.06 $\pm$ 0.17	1.07 $\pm$ 0.13
COX-TLR	1.15 $\pm$ 0.27	1.29 $\pm$ 0.22	1.18 $\pm$ 0.2	1.09 $\pm$ 0.15
BIN-INT	0.9 $\pm$ 0.2	0.98 $\pm$ 0.16	1.05 $\pm$ 0.13	1.08 $\pm$ 0.15
ASCVD	0.86 $\pm$ 0.27	0.86 $\pm$ 0.16	1.05 $\pm$ 0.18	1.06 $\pm$ 0.15

	20%	40%	60%	80%
SCS	1.21 $\pm$ 0.3	1.35 $\pm$ 0.22	1.38 $\pm$ 0.25	1.1 $\pm$ 0.16
COX-INT	1.37 $\pm$ 0.31	1.29 $\pm$ 0.25	1.15 $\pm$ 0.18	1.1 $\pm$ 0.14
COX-TLR	1.34 $\pm$ 0.4	1.18 $\pm$ 0.26	1.16 $\pm$ 0.17	1.05 $\pm$ 0.14
BIN-INT	0.9 $\pm$ 0.2	0.98 $\pm$ 0.16	1.05 $\pm$ 0.13	1.08 $\pm$ 0.15
ASCVD	0.86 $\pm$ 0.27	0.86 $\pm$ 0.16	1.05 $\pm$ 0.18	1.06 $\pm$ 0.15

	20%	40%	60%	80%
SCS	0.7 $\pm$ 0.26	0.68 $\pm$ 0.15	0.75 $\pm$ 0.11	0.9 $\pm$ 0.11
COX-INT	0.7 $\pm$ 0.21	0.8 $\pm$ 0.16	0.88 $\pm$ 0.19	0.92 $\pm$ 0.14
COX-TLR	0.95 $\pm$ 0.33	0.79 $\pm$ 0.17	0.95 $\pm$ 0.15	0.94 $\pm$ 0.13
BIN-INT	0.9 $\pm$ 0.2	0.98 $\pm$ 0.16	1.05 $\pm$ 0.13	1.08 $\pm$ 0.15
ASCVD	0.86 $\pm$ 0.27	0.86 $\pm$ 0.16	1.05 $\pm$ 0.18	1.06 $\pm$ 0.15

	20%	40%	60%	80%
SCS	-0.07 $\pm$ 0.06	-0.03 $\pm$ 0.04	-0.04 $\pm$ 0.04	-0.02 $\pm$ 0.03
COX-INT	-0.05 $\pm$ 0.05	-0.04 $\pm$ 0.04	0.01 $\pm$ 0.04	-0.01 $\pm$ 0.03
COX-TLR	-0.02 $\pm$ 0.06	-0.05 $\pm$ 0.04	-0.02 $\pm$ 0.04	-0.01 $\pm$ 0.03
BIN-INT	0.06 $\pm$ 0.06	0.03 $\pm$ 0.04	0.01 $\pm$ 0.03	-0.01 $\pm$ 0.03
ASCVD	0.02 $\pm$ 0.07	0.05 $\pm$ 0.04	-0.0 $\pm$ 0.04	-0.01 $\pm$ 0.04

	20%	40%	60%	80%
SCS	0.07 $\pm$ 0.06	0.09 $\pm$ 0.04	0.07 $\pm$ 0.03	0.03 $\pm$ 0.03
COX-INT	0.06 $\pm$ 0.07	0.03 $\pm$ 0.04	0.03 $\pm$ 0.04	0.03 $\pm$ 0.03
COX-TLR	0.01 $\pm$ 0.07	0.05 $\pm$ 0.04	0.01 $\pm$ 0.04	0.02 $\pm$ 0.03
BIN-INT	0.06 $\pm$ 0.06	0.03 $\pm$ 0.04	0.01 $\pm$ 0.03	-0.01 $\pm$ 0.03
ASCVD	0.02 $\pm$ 0.07	0.05 $\pm$ 0.04	-0.0 $\pm$ 0.04	-0.01 $\pm$ 0.04

	20%	40%	60%	80%
SCS	-30.2 $\pm$ 67.44	-14.45 $\pm$ 55.58	-6.03 $\pm$ 43.99	-3.8 $\pm$ 39.02
COX-INT	-23.18 $\pm$ 69.92	-26.64 $\pm$ 53.34	12.09 $\pm$ 41.39	8.87 $\pm$ 36.87
COX-TLR	-10.22 $\pm$ 73.05	-17.07 $\pm$ 54.39	2.7 $\pm$ 49.05	0.31 $\pm$ 33.17
BIN-INT	11.84 $\pm$ 65.18	10.39 $\pm$ 46.33	12.23 $\pm$ 33.8	-0.6 $\pm$ 34.6
ASCVD	108.54 $\pm$ 109.62	48.36 $\pm$ 56.77	-5.7 $\pm$ 49.77	3.93 $\pm$ 36.04

	20%	40%	60%	80%
SCS	130.48 $\pm$ 58.03	54.3 $\pm$ 45.22	23.0 $\pm$ 34.47	32.71 $\pm$ 29.64
COX-INT	26.58 $\pm$ 74.93	20.7 $\pm$ 50.38	37.69 $\pm$ 33.38	27.18 $\pm$ 33.95
COX-TLR	71.81 $\pm$ 70.62	27.41 $\pm$ 57.71	52.85 $\pm$ 39.85	28.33 $\pm$ 32.79
BIN-INT	11.84 $\pm$ 65.18	10.39 $\pm$ 46.33	12.23 $\pm$ 33.8	-0.6 $\pm$ 34.6
ASCVD	108.54 $\pm$ 109.62	48.36 $\pm$ 56.77	-5.7 $\pm$ 49.77	3.93 $\pm$ 36.04

	20%	40%	60%	80%
SCS	-15.46 $\pm$ 67.61	-63.38 $\pm$ 45.83	-50.14 $\pm$ 44.42	1.99 $\pm$ 40.43
COX-INT	-40.54 $\pm$ 64.8	-35.58 $\pm$ 51.4	-5.82 $\pm$ 49.97	4.42 $\pm$ 36.41
COX-TLR	-34.51 $\pm$ 82.71	-11.28 $\pm$ 57.62	-12.3 $\pm$ 40.85	20.28 $\pm$ 38.2
BIN-INT	11.84 $\pm$ 65.18	10.39 $\pm$ 46.33	12.23 $\pm$ 33.8	-0.6 $\pm$ 34.6
ASCVD	108.54 $\pm$ 109.62	48.36 $\pm$ 56.77	-5.7 $\pm$ 49.77	3.93 $\pm$ 36.04

	20%	40%	60%	80%
SCS	102.43 $\pm$ 64.9	113.99 $\pm$ 41.35	76.49 $\pm$ 38.53	49.76 $\pm$ 31.25
COX-INT	74.04 $\pm$ 63.72	57.82 $\pm$ 50.04	50.8 $\pm$ 52.89	41.27 $\pm$ 38.38
COX-TLR	1.96 $\pm$ 72.63	70.7 $\pm$ 53.71	36.51 $\pm$ 40.91	33.98 $\pm$ 36.9
BIN-INT	11.84 $\pm$ 65.18	10.39 $\pm$ 46.33	12.23 $\pm$ 33.8	-0.6 $\pm$ 34.6
ASCVD	108.54 $\pm$ 109.62	48.36 $\pm$ 56.77	-5.7 $\pm$ 49.77	3.93 $\pm$ 36.04