Optimal Sparse Recovery with Decision Stumps

Kiarash Banihashem, MohammadTaghi Hajiaghayi, Max Springer

Abstract

Decision trees are widely used for their low computational cost, good predictive performance, and ability to assess the importance of features. Though often used in practice for feature selection, the theoretical guarantees of these methods are not well understood. We here obtain a tight finite sample bound for the feature selection problem in linear regression using single-depth decision trees. We examine the statistical properties of these “decision stumps” for the recovery of the $s$ active features from $p$ total features, where $s\ll p$ . Our analysis provides tight sample performance guarantees on high-dimensional sparse systems which align with the finite sample bound of $O(s\log p)$ as obtained by Lasso, improving upon previous bounds for both the median and optimal splitting criteria. Our results extend to the non-linear regime as well as arbitrary sub-Gaussian distributions, demonstrating that tree based methods attain strong feature selection properties under a wide variety of settings and further shedding light on the success of these methods in practice. As a byproduct of our analysis, we show that we can provably guarantee recovery even when the number of active features $s$ is unknown. We further validate our theoretical results and proof methodology using computational experiments.

1 Introduction

Decision trees are one of the most popular tools used in machine learning. Due to their simplicity and interpretability, trees are widely implemented by data scientist, both individually, and in aggregation with ensemble methods such as random forests and gradient boosting (Friedman 2001).

In addition to their predictive accuracy, tree based methods are an important tool used for the variable selection problem: identifying a relevant small subset of a high-dimensional feature space of the input variables that can accurately predict the output. When the relationship between the variables is linear, it has long been known that Lasso achieves the optimal sample complexity rate for this problem (Wainwright 2009a). In practice, however, tree-based methods have been shown to be preferable as they scale linearly with the size of the data and can capture non-linear relationships between the variables (Xu et al. 2014).

Notably, various tree structured systems are implemented for this variable selection task across wide-spanning domains. For example, tree based importance measures have been used for prediction of financial distress (Qian et al. 2022), wireless signal recognition (Li and Wang 2018), credit scoring (Xia et al. 2017), and the discovery of genes in the field of bioinformatics (Breiman 2001; Bureau et al. 2005; Huynh-Thu et al. 2012; Lunetta et al. 2004) to name a small fraction.

Despite this empirical success however, the theoretical properties of these tree based methods for feature selection are not well understood. While several papers have considered variants of this problem (see Section 3 for an overview), even in the simple linear case, the sample complexity of the decision tree is not well-characterized.

In this paper, we attempt to bridge this gap and analyze the variable selection properties of single-level decision trees, commonly referred to as “decision stumps” (DStump ). Considering both linear and non-linear settings, we show that DStump achieves nearly-tight sample complexity rates for a variety of practical sample distributions. Compared to prior work, our analysis is simpler and applies to different variants of the decision tree, as well as more general function classes.

The remainder of the paper is organized as follows: in Section 2 we introduce our main results on the finite sample guarantees of DStump , in Section 3 we discuss important prior results in the field of sparse recovery and where they fall flat, in Section 4 we present the recovery algorithm, and in Section 5, we provide theoretical guarantees for the procedure under progressively more general problem assumptions. The proofs of these results provided in Section 6. We supplement the theoretical results with computational experiments in Section 7 and finally conclude the paper in Section 8.

2 Our Results

We assume that we are given a dataset $\mathcal{D}=\{(\mathbf{X}_{i,:},Y_{i})\}_{i=1}^{n}$ consisting of $n$ samples from the non-parametric regression model $Y_{i}=\sum_{k}f_{k}(\mathbf{X}_{i,k})+W_{i}$ where $i$ denotes the sample number, $\mathbf{X}_{i,:}\in\mathbb{R}^{p}$ is the input vector with corresponding output $Y_{i}\in\mathbb{R}$ , $W_{i}\in\mathbb{R}$ are i.i.d noise values and $f_{k}:\mathbb{R}\to\mathbb{R}$ are univariate functions that are $s$ -sparse:

Definition 2.1.

A set of univariate functions $\{f_{k}\}_{k\in[p]}$ is $s$ -sparse on feature set $[p]$ if there exists a set $S\subseteq[p]$ with size $s=|S|\ll p$ such that

f_{k}=0\iff k\notin S

Given access to $(\mathbf{X}_{i,:},Y_{i})_{i=1}^{n}$ , we consider the sparse recovery problem, where we attempt to reveal the set $S$ with only a minimal number of samples. Note that this is different from the prediction problem where the goal is to learn the functions $f_{k}$ . In accordance with prior work (Han et al. 2020; Kazemitabar et al. 2017; Klusowski 2020) we assume the $\mathbf{X}_{i,j}$ are i.i.d draws from the uniform distribution on $[0,1]$ with Gaussian noise, $W_{i}\sim\mathcal{N}(0,1)$ . In Section 5, we will discuss how this assumption can be relaxed using our non-parametric results to consider more general distributions.

For the recovery problem, we consider the DStump algorithm as first proposed by (Kazemitabar et al. 2017). The algorithm, shown in Algorithm 1, fits a single-level decision tree (stump) on each feature using either the “median split” or the “optimal split” strategy and ranks the features by the error of the corresponding trees. The median split provides a substantially simplified implementation as we do not need to optimize the stump construction, further providing an improved run time over the comparable optimal splitting procedure. Indeed, the median and optimal split have time complexity at most $\mathcal{O}(np)$ and $\mathcal{O}(np\log(n))$ respectively.

In spite of this simplification, we show that in the widely considered case of linear design, where the $f_{k}$ are linear, DStump can correctly recover $S$ with a sample complexity bound of $\mathcal{O}(s\log p)$ , matching the minimax optimal lower bound for the problem as achieved by Lasso (Wainwright 2009a, b). This result is noteworthy and surprising since the DStump algorithm (and decision trees in general) is not designed with a linearity assumption, as is the case with Lasso . For this reason, trees are in general utilized for their predictive power in a non-linear model, yet our work proves their value in the opposite. We further extend these results for non-linear models and general sub-Gaussian distributions, improving previously known results using simpler analysis. In addition, our results do not require the sparsity level $s$ to be known in advance. We summarize our main technical results as follows:

•

We obtain a sample complexity bound of $\mathcal{O}(s\log p)$ for the DStump algorithm in the linear design case, matching the optimal rate of Lasso and improving prior bounds in the literature for both the median and optimal split. This is the first tight bound on the sample complexity of single-depth decision trees used for sparse recovery and significantly improves on the existing results.
•

We extend our results to the case of non-linear $f_{k}$ , obtaining tighter results for a wider class of functions compared to the existing literature. We further use these results to generalize our analysis to sub-Gaussian distributions via the extension to nonlinear $f_{k}$ .
•

As a byproduct of our improved analysis, we show that our results hold for the case where the number of active features $s$ is not known. This is the first theoretical guarantee on decision stumps that does not require $s$ to be known.
•

We validate our theoretical results using numerical simulations that show the necessity of our analytic techniques.

3 Related Work

While our model framework and the sparsity problem as a whole have been studied extensively (Fan and Lv 2006; Lafferty and Wasserman 2008; Wainwright 2009a, b), none have replicated the well known optimal lower bound for the classification problem under the given set of assumptions. Our work provides improved finite sample guarantees on DStump for the regression problem that nearly match that of Lasso by means of weak learners.

Most closely related to our work is that of (Kazemitabar et al. 2017), which formulates the DStump algorithm and theoretical approach for finite sample guarantees of these weak learners. Unlike our nearly tight result on the number of samples required for recovery, (Kazemitabar et al. 2017) provides a weaker $\mathcal{O}(s^{2}\log p)$ bound when using the median splitting criterion. We here demonstrate that the procedure can obtain the near optimal finite sample guarantees by highlighting a subtle nuance in the analysis of the stump splitting (potentially of independent interest to the reader); instead of using the variance of one sub-tree as an impurity measure, we use the variance of both sub-trees. As we will show both theoretically and experimentally, this consideration is vital for obtaining tight bounds. Our analysis is also more general than that of (Kazemitabar et al. 2017), with applications to both median and optimal splits, a wider class of functions $f_{k}$ , and more general distributions.

In a recent work, (Klusowski and Tian 2021) provide an indirect analysis of the DStump formulation with the optimal splitting criterion, by studying its relation to the SIS algorithm of (Fan and Lv 2006). Designed for linear models specifically, SIS sorts the features based on their Pearson correlation with the output, and has optimal sample complexity for the linear setting. (Klusowski and Tian 2021) show that when using the optimal splitting criterion, DStump is equivalent to SIS up to logarithmic factors, which leads to a sample complexity of

n\gtrsim s\log(p)\cdot\left(\log(s)+\log(\log(p))\right).

This improves the results of (Kazemitabar et al. 2017), though the analysis is more involved. A similar technique is also used to study the non-linear case, but the conditions assumed on $f_{k}$ are hard to interpret and the bounds are weakened. Indeed, instantiating the non-parametric results for the linear case implies a sub-optimal sample complexity rate of $\mathcal{O}(s^{2}\log p)$ . In addition, (Klusowski and Tian 2021) describe a heuristic algorithm for the case of unknown $|S|$ , though they fail to prove any guarantees on its performance.

In contrast, we provide a direct analysis of DStump . This allows us to obtain optimal bounds for both the median and optimal split in the linear case, as well as improved and generalized bounds in the non-linear case. Our novel approach further allows us to analyze the case of unknown sparsity level, as we provide the first formal proof for the heuristic algorithm suggested in (Klusowski and Tian 2021) and (Fan, Feng, and Song 2011). Compared to prior work, our analysis is considerably simpler and applies to more general settings.

Additionally, various studies have leveraged the simplicity of median splits in decision trees to produce sharp bounds on mean-squared error for the regression problem with random forests (Duroux, Roxane and Scornet, Erwan 2018; Klusowski 2021). In each of these studies, analysis under the median split assumption allows for improvements in both asymptotic and finite sample bounds on prediction error for these ensemble weak learners. In the present work, we extend this intuition to the feature selection problem for high-dimensional sparse systems, and further emphasize the utility of the median split even in the singular decision stump case.

4 Algorithm Description

4.1 Notation and Problem Setup

For mathematical convenience, we adopt matrix notation and use $\mathbf{X}=(\mathbf{X}_{ij})\in\mathbb{R}^{n\times p}$ and $Y=(Y_{i})\in\mathbb{R}^{n}$ to denote the input matrix and the output vector respectively. We use $\mathbf{X}_{i,:}$ and $X^{k}$ to refer the $i$ -th row and $k$ -th column of the matrix $\mathbf{X}$ . We will also extend the definition of the univariate functions $f_{k}$ to vectors by assuming that the function is applied to each coordinate separately: for $v\in\mathbb{R}^{d}$ , we define $f_{k}(v)\in\mathbb{R}^{d}$ as $\left(f_{k}(v)\right)_{i}=f_{k}(v_{i})$ .

We let $\mathbb{E}\left[\cdot\right]$ and $\text{Var}(\cdot)$ denote the expectation and variance for random variables, with $\widehat{\mathbb{E}}\left[\cdot\right]$ and $\widehat{\text{Var}}\left(\cdot\right)$ denoting their empirical counterparts i.e, for a generic vector $v\in\mathbb{R}^{d}$ ,

\widehat{\mathbb{E}}\left[v\right]=\frac{\sum_{i=1}^{d}{v_{i}}}{d}\quad\text{and}\quad\widehat{\text{Var}}\left(v\right)=\frac{\sum_{i=1}^{d}(v_{i}-\widehat{\mathbb{E}}\left[v\right])^{2}}{d}.

We will also use $[d]$ to denote the set $\{1,\dots,d\}$ . Finally, we let $\textnormal{Unif}(a,b)$ denote the uniform distribution over $[a,b]$ and use $C,c>0$ to denote generic universal constants.

Throughout the paper, we will use the concept of sub-Gaussian random variables for stating and proving our results. A random variable $Z\in\mathbb{R}$ is called sub-Gaussian if there exists a $t$ for which $\mathbb{E}\left[e^{(Z/t)^{2}}\right]\leq 2$ and its sub-Gaussian norm is defined as

\left\lVert Z\right\rVert_{\psi_{2}}=\inf\left\{t>0:\mathbb{E}\left[e^{(Z/t)^{2}}\right]\leq 2\right\}.

Sub-Gaussian random variables are well-studied and have many desirable properties, (see (Vershynin 2018) for a comprehensive overview), some of which we outline below as they are leveraged throughout our analysis.

(P1)

(Hoeffding’s Inequality) If $Z$ is a sub-Gaussian random variable, then for any $t>0$ ,

$\Pr\left(\left|{Z-\mathbb{E}\left[Z\right]}\right|\geq t\right)\leq 2e^{-c\left({t}/{\left\lVert Z\right\rVert_{\psi_{2}}}\right)^{2}}.$
(P2)

If $Z_{1},\dots Z_{n}$ are a sequence of independent sub-Gaussian random variables, then $\sum Z_{i}$ is also sub-Gaussian with norm satisfying

$\left\lVert\sum Z_{i}\right\rVert_{\psi_{2}}^{2}\leq\sum\left\lVert Z_{i}\right\rVert_{\psi_{2}}^{2}.$
(P3)

If $Z$ is a sub-Gaussian random variable, then so is $Z-\mathbb{E}\left[Z\right]$ and $\left\lVert Z-\mathbb{E}\left[Z\right]\right\rVert_{\psi_{2}}\leq c\left\lVert Z\right\rVert_{\psi_{2}}$ .

4.2 DStump Algorithm

We here present the recovery algorithm DStump. For each feature $k\in[p]$ , the algorithms fits a single-level decision tree or “stump” on the given feature and defines the impurity of the feature as the error of this decision tree. Intuitively, the active features are expected to be better predictors of $Y$ and therefore have lower impurity values. Thus, the algorithm sorts the features based on these values, and outputs the $|S|$ features with smallest impurity. A formal pseudo-code of our approach is given in Algorithm 1.

Algorithm 1 Scoring using DStump

\mathbf{X}\in\mathbb{R}^{n\times p},Y\in\mathbb{R}^{n},s\in\mathbb{N}

2:Estimate of

S

3:for

k\in\{1,...,p\}

\tau^{k}=\text{argsort}(X^{k})

5: for

n_{L}\in\{1,\dots,n\}

n_{R}=n-n_{L}

Y^{k}_{L}=(Y_{\tau^{k}(1)},...Y_{\tau^{k}(n_{L})})^{T}

Y^{k}_{R}=(Y_{\tau^{k}(n_{L}+1)},...Y_{\tau^{k}(n)})^{T}

\textnormal{imp}_{k,n_{L}}=\frac{n_{L}}{n}\widehat{\text{Var}}(Y^{k}_{L})+\frac{n_{R}}{n}\widehat{\text{Var}}(Y^{k}_{R})

10: if median split then

11:

\textnormal{imp}_{k}=\textnormal{imp}_{k,\lfloor\frac{n}{2}\rfloor}

12: else

13:

\textnormal{imp}_{k}=\min_{\ell}\textnormal{imp}_{k,\ell}

14:return

\tau=\text{argsort({imp})}

Formally, for each feature $k$ , the $k$ -th column is sorted in increasing order such that $X^{k}_{\tau^{k}(1)}\leq X^{k}_{\tau^{k}(2)}\dots\leq X^{k}_{\tau^{k}(n)}$ with ties broken randomly. The samples are then split into two groups: the left half, consisting of $X^{k}_{i}\leq X^{k}_{\tau^{k}(n_{L})}$ and the right half consisting of $X^{k}_{i}>X^{k}_{\tau^{k}(n_{L})}$ . Conceptually, this is the same as splitting a single-depth tree on the $k$ -th column with a $n_{L}$ to $n_{R}$ ratio and collecting the samples in each group. The algorithm then calculates the variance of the output in each group, which represents the optimal prediction error for this group with a single value as $\widehat{\text{Var}}\left({Y^{k}_{L}}\right)=\min_{t}\frac{1}{n_{L}}\cdot\sum(Y^{k}_{L,i}-t)^{2}$ . The average of these two variances is taken as the impurity. Formally,

\displaystyle\textnormal{imp}_{k,n_{L}}=\frac{n_{L}}{n}\widehat{\text{Var}}(Y^{k}_{L})+\frac{n_{R}}{n}\widehat{\text{Var}}(Y^{k}_{R}).

(1)

For the median split algorithm, the value of $n_{L}$ is simply chosen as $\lfloor\frac{n}{2}\rfloor$ or $\lceil\frac{n}{2}\rceil$ , where as for the optimal split, the value is chosen in order to minimize the impurity of the split. Lastly, the features are sorted by their impurity values and the $|S|$ features with lowest impurity are predicted as $S$ . In Section 5.3, we discuss the case of unknown $|S|$ and obtain an algorithm with nearly matching guarantees.

5 Main Results

5.1 Linear Design

For our first results, we consider the simplest setting of linear models with uniform distribution of the inputs. Formally, we assume that there is a vector $\beta_{k}\in\mathbb{R}^{p}$ such that $f_{k}(x)=\beta_{k}\cdot x$ for all $k$ . This is equivalent to considering the linear regression model $Y=\sum_{k}\beta_{k}X^{k}+W$ . We further assume that each entry of the matrix $\mathbf{X}$ is an i.i.d draw from the uniform distribution on $[0,1]$ . This basic setting is important from a theoretical perspective as it allows us to compare with existing results from the literature before extending to more general contexts. This initial result of Theorem 5.1, provides an upper bound on the failure probability for DStump in this setting.

Theorem 5.1.

Assume that each entry of the input matrix $X$ is sampled i.i.d from a uniform distribution on $[0,1]$ . Assume further that the output vectors satisfy the linear regression model $Y=\sum_{k}\beta_{k}X^{k}+W$ where $W_{i}$ are sampled i.i.d from $N(0,\sigma_{w}^{2})$ . Algorithm 1 correctly recovers the active feature set $S$ with probability at least

\displaystyle 1-4se^{-c\cdot n}-4pe^{-c\cdot n\frac{\min_{k}\beta_{k}^{2}}{\left\lVert\beta\right\rVert_{2}^{2}+\sigma_{w}^{2}}}.

for the median split, and with probability at least

\displaystyle 1-4se^{-c\cdot n}-4npe^{-c\cdot n\frac{\min_{k}\beta_{k}^{2}}{\left\lVert\beta\right\rVert_{2}^{2}+\sigma_{w}^{2}}}.

for the optimal split.

Moreover, the above theorem provides a finite sample guarantee for the DStump algorithm and does not make any assumptions on the parameters or their asymptotic relationship. In order to obtain a comparison with existing literature, (Kazemitabar et al. 2017; Klusowski and Tian 2021; Wainwright 2009a, b), we consider these bounds in the asymptotic regime $\min_{k\in S}\beta_{k}^{2}\in\Omega(\frac{1}{s})$ and $\left\lVert\beta\right\rVert_{2}\in\mathcal{O}(1)$ .

Corollary 5.2.

In the same setting as Theorem 5.1, assume that $\left\lVert\beta\right\rVert_{2}^{2}\in\mathcal{O}(1)$ and $\min_{k\in S}\beta_{k}^{2}\in\Omega(\frac{1}{s})$ . Then Algorithm 1 correctly recovers the active feature set $S$ with high probability as long as $n\gtrsim s\log p$ .

The proof of the Corollary is presented in Appendix. The above result shows that DStump is optimal for recovery when the data obeys a linear model. This is interesting considering tree based methods are known for their strength in capturing non-linear relationships and are not designed with a linearity assumption like Lasso. In the next section, we further extend our finite sample bound analysis to non-linear models.

5.2 Additive Design

We here consider the case of non-linear $f_{k}$ and obtain theoretical guarantees for the original DStump algorithm. Our main result is Theorem 5.3 stated below.

Theorem 5.3.

Assume that each entry of the input matrix $X$ is sampled i.i.d from a uniform distribution on $[0,1]$ and $Y=\sum_{k}f_{k}(X^{k})+W$ where $W_{i}$ are sampled i.i.d from $\mathcal{N}(0,\sigma_{w}^{2})$ . Assume further that each $f_{k}$ is monotone and $f_{k}(\textnormal{Unif}(0,1))$ is sub-Gaussian with sub-Gaussian norm $\left\lVert f_{k}(\textnormal{Unif}(0,1))\right\rVert_{\psi_{2}}^{2}\leq\sigma_{k}^{2}$ . For $k\in S$ , define $g_{k}$ as

\displaystyle g_{k}:=\mathbb{E}\left[f_{k}(\textnormal{Unif}(\frac{1}{2},1))\right]-\mathbb{E}\left[f_{k}(\textnormal{Unif}(0,\frac{1}{2}))\right]

(2)

and define $\sigma^{2}$ as $\sigma^{2}=\sigma_{w}^{2}+\sum_{k}\sigma_{k}^{2}$ . Algorithm 1 correctly recovers the set $S$ with probability at least

1-4se^{-cn}-4pe^{-cn\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}

for the median split and

1-4se^{-cn}-4npe^{-cn\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}

for the optimal split.

Note that, by instantiating Theorem 5.3 for linear models, we obtain the same bound as in Theorem 5.1 implying the above bounds are tight in the linear setting.

The extension to all monotone functions in Theorem 5.1 has an important theoretical consequence: since the DStump algorithm is invariant under monotone transformations of the input, we can obtain the same results for any distribution of $\mathbf{X}_{i,:}$ . As a simple example, consider $\mathbf{X}_{ij}\sim\mathcal{N}(0,1)$ and assume that we are interested in bounds for linear models. Define the matrix $\mathbf{Z}$ as $\mathbf{Z}_{ij}=F_{\mathcal{N}}(\mathbf{X}_{ij})$ where $F_{\mathcal{N}}(.)$ denotes the CDF of the Gaussian distribution. Since the CDF is an increasing function, running the DStump algorithm with $(\mathbf{Z},Y)$ produces the same output as running it with $(\mathbf{X},Y)$ . Furthermore, applying the CDF of a random variable to itself yields a uniform random variable. Therefore, $\mathbf{Z}_{ij}$ are i.i.d draws of the $\text{Unif}(0,1)$ distribution. Setting $f_{k}(t)=\beta_{k}\cdot F_{\mathcal{N}}^{-1}(t)$ , the results of Theorem 5.3 for $(\mathbf{Z},Y)$ imply the same bound as Theorem 5.1. Notably, we can obtain the same sample complexity bound of $\mathcal{O}(s\log p)$ for the Gaussian distribution as well. In the appendix, we discuss a generalization of this idea, which effectively allows us to remove the uniform distribution condition in Theorem 5.3.

5.3 Unknown sparsity level

A drawback of the previous results are that they assume $|S|$ is given when, in general, this is not the case. Even if $|S|$ is not known however, Theorem 5.3 guarantees that the active features are ranked higher than non-active ones in $\tau$ , i.e, $\tau(k)<\tau(k^{\prime})$ for all $k\in S,k^{\prime}\notin S$ . In order to recover $S$ , it suffices to find a threshold $\gamma$ such that $\max_{k\in S}\textnormal{imp}_{k}\leq\gamma\leq\min_{k\notin S}\textnormal{imp}_{k}$ .

To solve this, we use the so called “permutation algorithm” which is a well known heuristic in the statistics literature (Barber and Candès 2015; Chung and Romano 2013, 2016; Fan, Feng, and Song 2011) and was discussed (without proof) by (Klusowski and Tian 2021) as well. Formally, we apply a random permutation $\sigma$ on the rows of $X$ , obtaining the matrix $\sigma(X)$ where $\sigma(X)_{ij}=X_{\sigma(i),j}$ , We then rerun Algorithm 1 with $\sigma(X)$ and $Y$ as input. The random permutation means that $X$ and $Y$ were “decoupled” from each other and effectively, all of the features are now inactive. We therefore expect $\min_{i,t}\textnormal{imp}_{i}(\sigma(X),y)$ to be close to $\min_{k\notin S}\textnormal{imp}_{k}(X,y)$ . Since this estimate may be somewhat conservative, we repeat the sampling and take the minimum value across these repetitions. A formal pseudocode is provided in Algorithm 2. The StumpScore method is the same algorithm as Algorithm 1, with the distinction that it returns imp in Line 14

Algorithm 2 Unknown

|S|

\mathbf{X}\in\mathbb{R}^{n\times p},Y\in\mathbb{R}^{n}

\gamma\in[\max_{k\in S}\textnormal{imp}_{k},\min_{k\notin S}\textnormal{imp}_{k}]

3:for

t\leftarrow 1,\dots,T-1

\sigma^{(t)}\leftarrow

Random permutation on

[n]

\textnormal{imp}^{(t)}\leftarrow\textsc{StumpScore}{}(\sigma^{(t)}(X),y)

return

\min_{i,t}\textnormal{imp}_{i}^{(t)}

Assuming we have used $T$ repetitions in the algorithm, the probability that $\min_{k\notin S}\textnormal{imp}_{k}(X,y)\leq\gamma$ is at most $\frac{1}{T}$ . While we provide a formal proof in the appendix, the main intuition behind the result is that $\textnormal{imp}_{i}^{(t)}$ and $\textnormal{imp}_{k^{\prime}}$ for $k^{\prime}\notin S$ are all the impurities corresponding to inactive features. Thus, the probability that the maximum across all of these impurities falls in $[p]\backslash S$ is at most $\frac{p-s}{Tp}\leq\frac{1}{T}$ . Ensuring that $\max_{k\in S}\textnormal{imp}_{k}(X,y)\leq\gamma$ involves treating the extra $T$ impurity scores as $(T-1)p$ extra inactive features. This means that we can use the same results of Theorem 5.3 with $p$ set to $Tp$ since our sample complexity bound is logarithmic in $p$ . The formal result follows with proof in the appendix.

Theorem 5.4.

In the same setting as Theorem 5.3, let $\gamma$ be the output of Algorithm 2 and set $\widehat{S}$ to be $\{k:\textnormal{imp}_{k}\leq\gamma\}$ . The probability that $\widehat{S}=S$ is at least

1-T^{-1}-4se^{-cn}-4Tpe^{-cn\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}

for the median split and at least

1-T^{-1}-4se^{-cn}-4nTpe^{-cn\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}

for the optimal split.

We note that if we set $T=n^{c}$ for some constant $c>0$ , we obtain the same $\mathcal{O}(s\log p)$ as before.

6 Proofs

In this section, we prove Theorem 5.3 as it is the more general version of Theorem 5.1, and defer with remainder of the proofs to the appendix.

To prove that the algorithm succeeds, we need to show that $\textnormal{imp}_{k}<\textnormal{imp}_{k^{\prime}}$ for all $k\in S,k^{\prime}\notin S$ . We proceed by first proving an upper bound $\textnormal{imp}_{k}$ for all $k\in S$

Lemma 6.1.

In the setting of Theorem 5.3, for any active feature $k\in S$ , $\textnormal{imp}_{k}\leq\widehat{\text{Var}}\left(Y\right)-\frac{\min_{k}g_{k}^{2}}{720}$ with probability at least

1-4e^{-c\cdot n}-4e^{-c\cdot n\cdot\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}.

Subsequently, we need to prove an analogous lower bound on $\textnormal{imp}_{k^{\prime}}$ for all $k^{\prime}\notin S$ .

Lemma 6.2.

In the setting of Theorem 5.3, for any inactive feature $k^{\prime}\notin S$ , $\textnormal{imp}_{k^{\prime}}>\widehat{\text{Var}}\left(Y\right)-\frac{\min_{k}g_{k}^{2}}{720}$ with probability at least

1-4e^{-c\cdot n\cdot\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}

for the median split and

1-4ne^{-c\cdot n\cdot\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}

for the optimal split.

Taking the union bound over all $k,k^{\prime}$ , Lemmas 6.1 and 6.2 prove the theorem as they show that $\textnormal{imp}_{k}<\textnormal{imp}_{k^{\prime}}$ for all $k\in S,k^{\prime}\notin S$ with the desired probabilities.

We now focus on proving Lemma 6.1. We assume without loss of generality that $f_{k}$ is increasing ¹¹1The case of decreasing $f_{k}$ follows by a symmetric argument or by mapping $X^{k}\to-X^{k}$ . We further assume that $\mathbb{E}\left[f_{k}(X^{k}_{i})\right]=0$ as DStump is invariant under constant shifts of the output. Finally, we assume that $n>5$ , as for $n\leq 5$ , the theorem’s statement can be made vacuous by choosing large $c$ .
We will assume $\{n_{L},n_{R}\}=\{\lfloor\frac{n}{2}\rfloor,\lceil\frac{n}{2}\rceil\}$ throughout the proof; as such our results will hold for both the optimal and median splitting criteria. As noted before, a key point for obtaining a tight bound is considering both sub-trees in the analysis instead of considering them individually. Formally, while the impurity is usually defined via variance as in (1), it has the following equivalent definition.

\displaystyle\textnormal{imp}_{k}=\widehat{\text{Var}}(Y)-\frac{n_{L}\cdot n_{R}}{n^{2}}\cdot\left(\widehat{\mathbb{E}}\left[Y^{k}_{L}\right]-\widehat{\mathbb{E}}\left[Y^{k}_{R}\right]\right)^{2}.

(3)

The above identity is commonly used for the analysis of decision trees and their properties (Breiman et al. 1983; Li et al. 2019; Klusowski 2020; Klusowski and Tian 2021). From an analytic perspective, the key difference between (3) and (1) is that the empirical averaging is calculated before taking the square, allowing us to more simply analyze the first moment rather than the second.

Intuitively, we want to use concentration inequalities to show that $\widehat{\mathbb{E}}\left[Y^{k}_{L}\right]$ and $\widehat{\mathbb{E}}\left[Y^{k}_{R}\right]$ concentrate around their expectations and lower bound $|\mathbb{E}\left[Y^{k}_{L}\right]-\mathbb{E}\left[Y^{k}_{R}\right]|$ . This is challenging however as concentration results typically require an i.i.d assumption but, as we will see, $Y^{k}_{L,i}$ are not i.i.d. More formally, for each $k$ , define the random variable $X^{k}_{L}\in\mathbb{R}^{n_{L}}$ as $(X^{k}_{\tau^{k}(1)},\dots,X^{k}_{\tau^{k}(n_{L})})^{T}$ and thus $Y^{k}_{L,i}=f_{k}(X^{k}_{L,i})+\sum_{j\neq k}f_{j}(X^{j}_{\tau^{k}(i)})+W_{\tau^{k}(i)}.$ While the random vectors $X^{j\neq k}$ and $W$ have i.i.d entries, $X^{k}_{L}$ was obtained by sorting the coordinates of $X^{k}$ . Thus, its coordinates are non-identically distributed and dependent. To solve the first problem, observe that the empirical mean is invariant under permutation and we can thus randomly shuffle the elements of $Y^{k}_{L}$ in order to obtain a vector with identically distributed coordinates. Furthermore, by De Finetti’s Theorem, any random vector with coordinates that are identically distributed (but not necessarily independent), can be viewed as a mixture of i.i.d vectors, effectively solving the second problem. Formally, the following result holds.

Lemma 6.3 (Lemma 1 in (Kazemitabar et al. 2017)).

let $\tilde{\tau}:[n_{L}]\to[n_{L}]$ be a random permutation on $[n_{L}]$ independent of $(\mathbf{X},W)$ and define $\widetilde{X}^{k}_{L}\in\mathbb{R}^{n_{L}}$ as $\widetilde{X}^{k}_{L,i}:=X^{k}_{L,\tilde{\tau}(i)}$ . The random vector $\widetilde{X}^{k}_{L}$ is distributed as a mixture of uniform i.i.d uniform vectors of size $n_{L}$ on $[0,\Theta]$ with $\Theta$ sampled from $\text{Beta}(n_{L}+1,n_{R})$ .

Defining $\widetilde{Y}^{k}_{L}\in\mathbb{R}^{n_{L}}$ as $\widetilde{Y}^{k}_{L,i}:=Y^{k}_{L,\tilde{\tau}(i)}$ , it is clear that $\widehat{\mathbb{E}}\left[\widetilde{Y}^{k}_{L}\right]=\widehat{\mathbb{E}}\left[Y^{k}_{L}\right]$ and therefore we can analyze ${\widetilde{Y}^{k}_{L}}$ instead of ${Y^{k}_{L}}$ as

\displaystyle\widetilde{Y}^{k}_{L,i}:=f_{k}(\widetilde{X}^{k}_{L,i})+\sum_{j\neq k}f_{j}(X^{j}_{\tau^{k}\circ\tilde{\tau}(i)})+W_{\tau^{k}\circ\tilde{\tau}(i)}

which, given Lemma 6.3, can be seen as a mixture of i.i.d random variables.

Lemma 6.3 shows that there are two sources of randomness in the distribution of $\widetilde{Y}^{k}_{L,i}$ : the mixing variable $\Theta$ and the sub-Gaussian randomness of $\widetilde{X}^{k}_{L}|\Theta$ and $(X^{j\neq k},W)$ . For the second source, it is possible to use standard concentration inequalities to show that conditioned on $\Theta$ , $\widehat{\mathbb{E}}\left[\widetilde{Y}^{k}_{L}\right]$ concentrates around $\mathbb{E}\left[\widetilde{Y}^{k}_{L,1}|\Theta\right]$ . We will formally do this in Lemma 6.5. Before we do this however, we focus on the first source and how $\Theta$ affects the distribution of $\widetilde{Y}^{k}_{L,i}$ .

Since $\Theta$ is sampled from $\text{Beta}(n_{L}+1,n_{R})$ , we can use standard techniques to show that it concentrates around $\frac{1}{2}$ . More formally, we can use the following lemma, the proof of which is in the appendix.

Lemma 6.4.

If $n\geq 5$ , we have $\Theta\in[\frac{1}{4},\frac{3}{4}]$ with probability at least $1-2e^{-cn}$ .

Given the above result, we can analyze $\widetilde{Y}^{k}_{L}$ assuming $\Theta\in[1/4,3/4]$ . In this case, we can use concentration inequalities to show that with high probability, $\widehat{\mathbb{E}}\left[\widetilde{Y}^{k}_{L}\right]$ concentrates around $\mathbb{E}\left[f_{k}(\textnormal{Unif}(0,\Theta)\right])$ . Since $f_{k}$ was assumed to be increasing, this can be further bounded by $\mathbb{E}\left[f_{k}(\textnormal{Unif}(0,\frac{3}{4})\right])$ . Formally, we obtain the following result.

Lemma 6.5.

Let $k\in S$ be an active feature. For any $\theta\in[\frac{1}{4},\frac{3}{4}]$ ,

\displaystyle\Pr\left[\widehat{\mathbb{E}}\left[\widetilde{Y}^{k}_{L}\right]-\mathbb{E}\left[\widetilde{Y}^{k}_{L,1}|\Theta=\theta\right]\geq t|\Theta=\theta\right]\leq 2e^{-cn\frac{t^{2}}{\sigma^{2}}}.

Furthermore, letting $\overline{f}^{k}_{a,b}$ denote $\mathbb{E}\left[f_{k}(\textnormal{Unif}(a,b))\right]$ ,

\displaystyle\Pr\left[\widehat{\mathbb{E}}\left[{Y}^{k}_{L}\right]\geq\overline{f}^{k}_{0,\frac{3}{4}}+g_{k}/8\right]\leq 2e^{-c\cdot n}+2e^{-c\cdot n\cdot\frac{g_{k}^{2}}{\sigma^{2}}}.

(4)

Proof.

For ease of notation, we will fix $\theta\in[\frac{1}{4},\frac{3}{4}]$ and let the random variables $\widehat{X}^{k}_{L}$ and $\widehat{Y}^{k}_{L}$ denote $\widetilde{X}^{k}_{L}|\Theta=\theta$ and $\widetilde{Y}^{k}_{L}|\Theta=\theta$ respectively. Recall that for all $j$ , $f_{k}(X^{j}_{i})$ was sub-Gaussian with parameter $\sigma_{j}$ by assumption. It is straightforward to show (see the Appendix) that this means $f_{k}(\widehat{X}^{k}_{L,i})-\mathbb{E}\left[f(\widehat{X}^{k}_{L,i})\right]$ is also sub-Gaussian with norm at most $C\cdot\sigma_{j}^{2}$ . Thus,

	$\displaystyle\left\lVert\widehat{Y}^{k}_{L,i}\right\rVert_{\psi_{2}}$	$\displaystyle=\left\lVert f_{k}(\widehat{X}^{k}_{L,i})+\sum_{j\neq k}f_{j}(X^{j}_{\tau^{k}\circ\tilde{\tau}(i)})+W_{\tau^{k}\circ\tilde{\tau}(i)}\right\rVert_{\psi_{2}}^{2}$
		$\displaystyle\overset{(i)}{=}\left\lVert f_{k}(\widehat{X}^{k}_{L,i})+\sum_{j\neq k}f_{j}(X^{j}_{i})+W_{i}\right\rVert_{\psi_{2}}^{2}$
		$\displaystyle\overset{(ii)}{\leq}\left\lVert f_{k}(\widehat{X}^{k}_{L,i})\right\rVert_{\psi_{2}}^{2}+\sum_{j\neq k}\left\lVert f_{j}(X^{j}_{i})\right\rVert_{\psi_{2}}^{2}+\left\lVert W_{i}\right\rVert_{\psi_{2}}^{2}$
		$\displaystyle\leq C\cdot\sigma_{k}^{2}+\sum_{j\neq k}\sigma_{j}^{2}+\sigma_{w}^{2}\leq C\cdot\sigma^{2}.$

In the above analysis, $(i)$ follows from the independence assumption of $(\widehat{X}^{k}_{L},X^{j\neq k},W)$ together with the i.i.d assumption on $(X^{j\neq k}_{i},W_{i})$ . As for $(ii)$ , it follows from (P2) together with the independence assumption of $(\widehat{X}^{k}_{L},X^{j\neq k},W)$ . Property (P3) further implies that $\left\lVert\widehat{Y}^{k}_{L,i}-\mathbb{E}\left[\widehat{Y}^{k}_{L,i}\right]\right\rVert_{\psi_{2}}^{2}$ is upper bounded by $C\cdot\sigma^{2}$ , proving the first Equation in the Lemma.

Now, using Hoeffding’s inequality, we obtain

\displaystyle\Pr\left[\widehat{\mathbb{E}}\left[\widehat{Y}^{k}_{L}\right]-\mathbb{E}\left[\widehat{Y}^{k}_{L,i}\right]\geq g_{k}/8\right]\leq 2e^{-c\cdot n\cdot\frac{g_{k}^{2}}{\sigma^{2}}}.

Using Lemma 6.4 with $\Pr(A)\leq\Pr(B)+\Pr(A|B^{C})$ for any two events $A,B$ , we obtain

\displaystyle\Pr\left[\widehat{\mathbb{E}}\left[Y^{k}_{L}\right]-\mathbb{E}\left[\widehat{Y}^{k}_{L,i}\right]\geq g_{k}/8\right]\leq 2e^{-c\dot{n}}+2e^{-c\cdot n\cdot\frac{g_{k}^{2}}{\sigma^{2}}}.

Note however that $\mathbb{E}\left[\widehat{Y}^{k}_{L,i}\right]=\overline{f}^{k}_{0,\theta}$ which as we show in the appendix, can further be upper bounded by $\overline{f}^{k}_{0,\frac{3}{4}}$ , concluding the proof. ∎

Using the symmetry of the decision tree algorithm, we can further obtain that

\displaystyle\Pr\left[\widehat{\mathbb{E}}\left[{Y}^{k}_{R}\right]\geq\overline{f}^{k}_{\frac{1}{4},1}-g_{k}/8\right]\leq 2e^{-c\cdot n}+2e^{-c\cdot n\cdot\frac{g_{k}^{2}}{\sigma^{2}}}

(5)

from (4) with the change of variable $X^{k}\to-X^{k}$ and $f_{k}=-f_{k}$ . Taking union bound over (4) and (5), it follows that with probability at least $1-4e^{-c\cdot n}-4e^{-c\cdot n\cdot\frac{g_{k}^{2}}{\sigma^{2}}}$ ,

\displaystyle\widehat{\mathbb{E}}\left[Y^{k}_{R}\right]-\widehat{\mathbb{E}}\left[Y^{k}_{L}\right]\geq\overline{f}^{k}_{\frac{1}{4},1}-\overline{f}^{k}_{0,\frac{3}{4}}-g_{k}/4.

As we show in the appendix however, a simple application of conditional expectations implies $\overline{f}^{k}_{\frac{1}{4},1}-\overline{f}^{k}_{0,\frac{3}{4}}\geq g_{k}/3$ . Therefore, with probability at least $1-4e^{-c\cdot n}-4e^{-c\cdot n\cdot\frac{g_{k}^{2}}{\sigma^{2}}}$ , we have $\widehat{\mathbb{E}}\left[Y^{k}_{R}\right]-\widehat{\mathbb{E}}\left[Y^{k}_{L}\right]\geq\frac{g_{k}}{12}.$ Assuming $n\geq 5$ , we can further conclude that $\frac{n_{L}\cdot n_{R}}{n^{2}}\geq\frac{1}{5}$ which together with (3), proves the lemma.

7 Experimental Results

Refer to caption — Figure 1: Optimal sample count to recover 95% of the active features where design matrix samples i.i.d from $U(-1,1)$ or $\mathcal{N}(0,1)$ with additive Gaussian noise $\mathcal{N}(0,0.1)$ , comparing three methods: DStump with optimal split, DStump with median split, and Lasso .

In this section, we provide further justification of our theoretical results in the form of simulations on the finite sample count for active feature recovery under different regimes, as well as the predictive power of a single sub-tree as compared to the full tree. We additionally contrast DStump with the widely studied optimal Lasso .

We first validate the result of Theorem 5.1 and consider the linear design with $p=200$ and design matrix entries sampled i.i.d. from $U(0,1)$ with additive Gaussian noise $\mathcal{N}(0,.1)$ . Concretely, we examine the optimal number of samples required to recover approximately $95\%$ of the active features $s$ . This is achieved by conducting a binary search on the number of samples to find the minimal such value that recovers the desired fraction of the active feature set, averaged across 25 independent replications. In the leftmost plot of Figure 1, we plot the sample count as a function of varying sparsity levels $s\in[5,100]$ for DStump with a median split, DStump with the optimal split, as well as Lasso for bench-marking (with penalty parameter selected through standard cross-validation). By fixing $p$ , we are evaluating the dependence of $n$ on the sparsity level $s$ alone. The results here validate the theoretical $\mathcal{O}(s\log p)$ bound that nearly matches the optimal Lasso . Also of note, the number of samples required by the median splitting is less than that of the optimal. Thus, in the linear setting, we see that DStump with median splitting is both more simplistic and computationally inexpensive. This optimal bound result is repeated with Gaussian data samples in the right most plot of Figure 1. Notably, in this setting the optimal split decision stumps perform better than the median as it demonstrates their varied utility under different problem contexts.

We additionally reiterate that the prior work of (Kazemitabar et al. 2017) attempted to simplify the analysis of the sparse recovery problem using DStump by examining only the left sub-tree, which produced the non-optimal $\mathcal{O}(s^{2}\log p)$ finite sample bound. To analyze the effect of this choice, the middle plot of Figure 1 presents the optimal sample recovery count when using only the left sub-tree subject to the additive model of Theorem 5.1. In accordance with our expectation and previous literature’s analysis, we see a clear quadratic relationship between $n$ and $s$ when fixing the feature count $p$ .

Overall, these simulations further validate the practicality and predictive power of decisions stumps. Benchmarked against the optimal Lasso , we see a slight decrease in performance but a computational reduction and analytic simplification.

8 Conclusion

In this paper, we presented a simple and consistent feature selection algorithm in the regression case with single-depth decision trees, and derived the finite-sample performance guarantees in a high-dimensional sparse system. Our theoretical results demonstrate that this very simple class of weak learners is nearly optimal compared to the gold standard Lasso . We have provided strong theoretical evidence for the success of binary decision tree based methods in practice and provided a framework on which to extend the analysis of these structures to arbitrary height, a potential direction for future work.

9 Acknowledgements

The work is partially support by DARPA QuICC, NSF AF:Small #2218678, and NSF AF:Small # 2114269. Max Springer was supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1840340. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

References

Barber and Candès (2015) Barber, R. F.; and Candès, E. J. 2015. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5): 2055–2085.
Breiman (2001) Breiman, L. 2001. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science, 16(3): 199 – 231.
Breiman et al. (1983) Breiman, L.; Friedman, J. H.; Olshen, R. A.; and Stone, C. J. 1983. Classification and Regression Trees.
Bureau et al. (2005) Bureau, A.; Dupuis, J.; Falls, K.; Lunetta, K.; Hayward, B.; Keith, T.; and Eerdewegh, P. 2005. Identifying SNPs predictive of phenotype using random forests. Genetic epidemiology, 28: 171–82.
Chung and Romano (2013) Chung, E.; and Romano, J. P. 2013. Exact and asymptotically robust permutation tests. The Annals of Statistics, 41(2): 484–507.
Chung and Romano (2016) Chung, E.; and Romano, J. P. 2016. Multivariate and multiple permutation tests. Journal of econometrics, 193(1): 76–91.
Duroux, Roxane and Scornet, Erwan (2018) Duroux, Roxane; and Scornet, Erwan. 2018. Impact of subsampling and tree depth on random forests. ESAIM: PS, 22: 96–128.
Fan, Feng, and Song (2011) Fan, J.; Feng, Y.; and Song, R. 2011. Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 106(494): 544–557.
Fan and Lv (2006) Fan, J.; and Lv, J. 2006. Sure independence screening for ultrahigh dimensional feature space. Journal of The Royal Statistical Society Series B-statistical Methodology, 70: 849–911.
Friedman (2001) Friedman, J. H. 2001. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29: 1189–1232.
Han et al. (2020) Han, C.; Rao, N. S.; Sorokina, D.; and Subbian, K. 2020. Scalable Feature Selection for (Multitask) Gradient Boosted Trees. ArXiv, abs/2109.01965.
Huynh-Thu et al. (2012) Huynh-Thu, V. A.; Irrthum, A.; Wehenkel, L.; and Geurts, P. 2012. Inferring Regulatory Networks from Expression Data Using Tree-Based Methods. PLOS ONE, 5(9): 1–10.
Kazemitabar et al. (2017) Kazemitabar, S. J.; Amini, A. A.; Bloniarz, A.; and Talwalkar, A. S. 2017. Variable Importance Using Decision Trees. In NIPS.
Klusowski (2021) Klusowski, J. 2021. Sharp Analysis of a Simple Model for Random Forests. In Banerjee, A.; and Fukumizu, K., eds., Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, 757–765. PMLR.
Klusowski (2020) Klusowski, J. M. 2020. Sparse learning with CART. ArXiv, abs/2006.04266.
Klusowski and Tian (2021) Klusowski, J. M.; and Tian, P. M. 2021. Nonparametric Variable Screening with Optimal Decision Stumps. In AISTATS.
Lafferty and Wasserman (2008) Lafferty, J. D.; and Wasserman, L. A. 2008. Rodeo: Sparse, greedy nonparametric regression. Annals of Statistics, 36: 28–63.
Li and Wang (2018) Li, L.; and Wang, J. 2018. Research on feature importance evaluation of wireless signal recognition based on decision tree algorithm in cognitive computing. Cognitive Systems Research, 52: 882–890.
Li et al. (2019) Li, X.; Wang, Y.; Basu, S.; Kumbier, K.; and Yu, B. 2019. A Debiased mDI Feature Importance Measure for Random Forests. ArXiv, abs/1906.10845.
Lunetta et al. (2004) Lunetta, K. L.; Hayward, L. B.; Segal, J.; and Eerdewegh, P. V. 2004. Screening large-scale association study data: exploiting interactions using random forests. BMC Genetics, 5(1): 32.
Qian et al. (2022) Qian, H.; Wang, B.; Yuan, M.; Gao, S.; and Song, Y. 2022. Financial distress prediction using a corrected feature selection measure and gradient boosted decision tree. Expert Systems with Applications, 190: 116202.
Rundel (2012) Rundel, C. 2012. Lecture 15: Order statistics. https://www2.stat.duke.edu/courses/Spring12/sta104.1/Lectures/Lec15.pdf.
Vershynin (2018) Vershynin, R. 2018. High-Dimensional Probability: An Introduction with Applications in Data Science.
Wainwright (2009a) Wainwright, M. J. 2009a. Information-Theoretic Limits on Sparsity Recovery in the High-Dimensional and Noisy Setting. IEEE Transactions on Information Theory, 55: 5728–5741.
Wainwright (2009b) Wainwright, M. J. 2009b. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (Lasso). IEEE Transactions on Information Theory.
Xia et al. (2017) Xia, Y.; Liu, C.; Li, Y.; and Liu, N. 2017. A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring. Expert Systems with Applications, 78: 225–241.
Xu et al. (2014) Xu, Z.; Huang, G.; Weinberger, K. Q.; and Zheng, A. X. 2014. Gradient boosted feature selection. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 522–531.

Appendix A Appendix

A.1 Extension to non-uniform distributions

In this section, we extend our main result, i.e., Theorem 5.3, to non-uniform distributions.

Theorem A.1.

Assume that the entries of the input matrix $X$ are sampled independently, the entries in each column are i.i.d. and $Y=\sum_{k}f_{k}(X^{k})+W$ where $W_{i}$ are sampled i.i.d from $\mathcal{N}(0,\sigma_{w}^{2})$ . Assume further that each $f_{k}$ is monotone and $f_{k}(\mathbf{X}_{i,k})$ is sub-Gaussian with sub-Gaussian norm $\left\lVert f_{k}(\mathbf{X}_{i,k})\right\rVert_{\psi_{2}}^{2}\leq\sigma_{k}^{2}$ .
Let $F_{k}$ denote the CDF of the distribution of $X^{k}_{i}$ . Define $F_{k}^{-1,+}(z):=\inf\{t:F_{k}(t)>z\}$ and define $h_{k}:=f_{k}\circ F_{k}^{-1,+}$ . For $k\in S$ , define $g_{k}$ as

\displaystyle g_{k}:=\mathbb{E}\left[h_{k}(\textnormal{Unif}(\frac{1}{2},1))\right]-\mathbb{E}\left[h_{k}(\textnormal{Unif}(0,\frac{1}{2}))\right],

(6)

and define $\sigma^{2}$ as $\sigma^{2}=\sigma_{w}^{2}+\sum_{k}\sigma_{k}^{2}$ . Algorithm 1 correctly recovers the set $S$ with probability at least

1-4se^{-cn}-4pe^{-cn\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}

for the median split and

1-4se^{-cn}-4npe^{-cn\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}

for the optimal split.

Proof.

We first claim that if $Z\sim\textnormal{Unif}(0,1)$ , then $F_{k}^{-1,+}(Z)$ has the same distribution as $X^{k}_{i}$ . Formally, for all $z\in[0,1]$ ,

	$\displaystyle F_{k}^{-1,+}(z)\leq t\iff\inf\{\tilde{t}:F_{k}(\tilde{t})>z\}\leq t$	$\displaystyle\overset{(i)}{\iff}\forall\delta>0:\exists\tilde{t}\leq t+\delta:F_{k}(\tilde{t})>z$
		$\displaystyle\overset{(ii)}{\iff}\lim_{\epsilon\to 0,\epsilon>0}F_{k}(t+\epsilon)>z$
		$\displaystyle\overset{(iii)}{\iff}F_{k}(t)>z.$

where $(i)$ follows from the definition of $\inf$ , $(ii)$ follows from the fact that $F_{k}$ is increasing and $(iii)$ follows from the fact that $F_{k}$ is right continuous. Therefore,

\displaystyle\Pr\left(F_{k}^{-1,+}(Z)\leq t\right)

\displaystyle=\Pr\left(Z<F_{k}(t)\right)\overset{(i)}{=}F_{k}(t).

where for $(i)$ we have used the fact that $Z$ is sampled uniformly on $[0,1]$ . Since $F_{k}(t)=\Pr(X^{k}_{i}\leq t)$ , the claim is proved.

Given this result, we can assume that $X^{k}_{i}$ were set as $F_{k}^{-1,+}(Z^{k}_{i})$ for a latent variable $Z^{k}_{i}\sim\textnormal{Unif}(0,1)$ . While this means $Z^{k}_{i}$ is no longer a function of $X^{k}_{i}$ , $\text{argsort}(Z^{k})$ still has the same distribution as $\text{argsort}(X^{k})$ . To see why, consider a fixed value of $X^{k}$ and sample $Z^{k}$ based on the conditional distribution. If $X^{k}_{i}<X^{k}_{j}$ for some $i,j$ , then $Z^{k}_{i}<Z^{k}_{j}$ as well since $F^{k}_{-1,+}$ is a function. As for the case case of $X^{k}_{i}=X^{k}_{j}$ , then $Z^{k}_{i}<Z^{k}_{j}$ and $Z^{k}_{i}>Z^{k}_{j}$ are equi-probable which is consistent with the assumption that argsort breaks ties randomly.
We now note that

\displaystyle f_{k}(X^{k}_{i})=f_{k}(F_{k}^{-1,+}(Z^{k}_{i}))=h_{k}(Z^{k}_{i}).

Since $\text{argsort}(Z^{k})$ and $\text{argsort}(X^{k})$ have the same distribution, Algorithm 1 has the same output on $(\mathbf{X},f_{k})$ as $(\mathbf{Z},h_{k})$ . Invoking Theorem 5.3 therefore completes the proof.

∎

A.2 Auxiliary lemmas

Lemma A.2.

Let $U\sim\textnormal{Unif}(0,1)$ be uniform random variable on $[0,1]$ and Let $f$ be a function such that $f(U)$ is sub-Gaussian with norm $\left\lVert f(U)\right\rVert_{\psi_{2}}$ . Let $\delta$ be a value in the range $[\frac{1}{4},\frac{3}{4}]$ , and let $\tilde{U}\sim\textnormal{Unif}(0,\delta)$ be a random variable on $[0,\delta]$ . Then $f(U)$ is sub-Gaussian as well with norm at most a constant times that of $f(U)$ , i.e,

\displaystyle\left\lVert f(\tilde{U})\right\rVert_{\psi_{2}}\leq C\cdot\left\lVert f(U)\right\rVert_{\psi_{2}}

Proof.

Define $K:=\left\lVert f(U)\right\rVert_{\psi_{2}}$ ,

$\displaystyle 2$	$\displaystyle\overset{(i)}{\geq}\mathbb{E}\left[e^{\frac{f(U)^{2}}{K^{2}}}\right]$
	$\displaystyle=\mathbb{E}\left[e^{\frac{f(U)^{2}}{K^{2}}}\Big{\|}U\leq\delta\right]\cdot\Pr\left[U\leq\delta\right]+\mathbb{E}\left[e^{\frac{f(U)^{2}}{K^{2}}}\Big{\|}U>\delta\right]\cdot\Pr\left[U>\delta\right]$
	$\displaystyle\overset{(ii)}{\geq}\mathbb{E}\left[e^{\frac{f(U)^{2}}{K^{2}}}\Big{\|}U\leq\delta\right]\cdot\Pr\left[U\leq\delta\right]$
	$\displaystyle=\mathbb{E}\left[e^{\frac{f(\tilde{U})^{2}}{K^{2}}}\right]\cdot\delta$	(7)

where $(i)$ follows from the definition of the sub-Gaussian norm and $(ii)$ follows from the fact that $e^{x}\geq 0$ for all $x$ . Define $\widetilde{K}:=K\cdot\frac{ln\left(2/\delta\right)}{ln(2)}$ . Note that $\delta<1$ and therefore $\frac{K}{\widetilde{K}}<1$ . We can therefore obtain.

	$\displaystyle\mathbb{E}\left[e^{\frac{f(\tilde{U})^{2}}{\widetilde{K}^{2}}}\right]$	$\displaystyle=\mathbb{E}\left[e^{\frac{f(\tilde{U})^{2}}{K^{2}}\cdot\frac{K^{2}}{\widetilde{K}^{2}}}\right]$
		$\displaystyle=\mathbb{E}\left[\left(e^{\frac{f(\tilde{U})^{2}}{K^{2}}}\right)^{\frac{K^{2}}{\widetilde{K}^{2}}}\right]$
		$\displaystyle\overset{(i)}{\leq}\mathbb{E}\left[\left(e^{\frac{f(\tilde{U})^{2}}{K^{2}}}\right)\right]^{\frac{K^{2}}{\widetilde{K}^{2}}}$
		$\displaystyle\overset{(ii)}{\leq}\left(2/\delta\right)^{\frac{K^{2}}{\widetilde{K}^{2}}}$
		$\displaystyle\overset{(iii)}{=}e^{\ln(2/\delta)\cdot\frac{ln(2)}{ln(2/\delta)}}=2$

where for $(i)$ we have used the Jensen’s inequality for the concave function $t\to t^{\frac{K^{2}}{\widetilde{K}^{2}}}$ , for $(ii)$ we have used (7) and for $(iii)$ we have used the definition of $\widetilde{K}$ . Therefore the claim is proved with $C=\frac{ln\left(2/\delta\right)}{ln(2)}$ . ∎

Lemma A.3.

If $f_{k}$ is increasing, then for any $0\leq a\leq c\leq b\leq 1$ , $\overline{f}^{k}_{a,c}\leq\overline{f}^{k}_{c,b}$ .

Proof.

\displaystyle\overline{f}^{k}_{a,c}=\mathbb{E}_{Z\sim\textnormal{Unif}(a,c)}\left[f_{k}(Z)\right]\leq f_{k}(c)\leq\mathbb{E}_{Z\sim\textnormal{Unif}(c,b)}\left[c\right]=\overline{f}^{k}_{c,b}

∎

Lemma A.4.

If $f_{k}$ is increasing, then for any $0\leq a\leq c\leq b\leq 1$ , $\overline{f}^{k}_{a,c}\leq\overline{f}^{k}_{a,b}$ .

Proof.

	$\displaystyle\overline{f}^{k}_{a,b}=\mathbb{E}_{Z\sim\textnormal{Unif}(a,b)}\left[f_{k}(Z)\right]$	$\displaystyle=\Pr\left[Z\in[a,c]\right]\cdot\mathbb{E}_{Z\sim\textnormal{Unif}(a,c)}\left[f_{k}(Z)\right]+\Pr\left[Z\in[c,b]\right]\cdot\mathbb{E}_{Z\sim\textnormal{Unif}(c,b)}\left[f_{k}(Z)\right]$
		$\displaystyle=\Pr\left[Z\in[a,c]\right]\cdot\overline{f}^{k}_{a,c}+\Pr\left[Z\in[c,b]\right]\cdot\overline{f}^{k}_{c,b}$
		$\displaystyle\geq\overline{f}^{k}_{a,c}$

∎

Lemma A.5.

\displaystyle\overline{f}^{k}_{\frac{1}{4},1}-\overline{f}^{k}_{0,\frac{3}{4}}\geq g_{k}/3

Proof.

We have

\displaystyle\overline{f}^{k}_{0,\frac{3}{4}}=\frac{2}{3}\cdot\overline{f}^{k}_{\frac{1}{4},\frac{3}{4}}+\frac{1}{3}\cdot\overline{f}^{k}_{0,\frac{1}{4}},

and

\displaystyle\overline{f}^{k}_{\frac{1}{4},1}=\frac{2}{3}\cdot\overline{f}^{k}_{\frac{1}{4},\frac{3}{4}}+\frac{1}{3}\cdot\overline{f}^{k}_{\frac{3}{4},1}.

It therefore follows that

\displaystyle\overline{f}^{k}_{\frac{1}{4},1}-\overline{f}^{k}_{0,\frac{3}{4}}=\frac{1}{3}\cdot\left(\overline{f}^{k}_{\frac{3}{4},1}-\overline{f}^{k}_{0,\frac{1}{4}}\right)

Note however that

	$\displaystyle g_{k}=\overline{f}^{k}_{\frac{1}{2},1}-\overline{f}^{k}_{0,\frac{1}{2}}$	$\displaystyle=\left(\frac{1}{2}\cdot\overline{f}^{k}_{\frac{1}{2},\frac{3}{4}}+\frac{1}{2}\cdot\overline{f}^{k}_{\frac{3}{4},1}\right)-\left(\frac{1}{2}\cdot\overline{f}^{k}_{0,\frac{1}{4}}+\frac{1}{2}\cdot\overline{f}^{k}_{\frac{1}{4},\frac{1}{2}}\right)$
		$\displaystyle=\left(\frac{1}{2}\cdot\overline{f}^{k}_{\frac{3}{4},1}-\frac{1}{2}\cdot\overline{f}^{k}_{0,\frac{1}{4}}\right)+\left(\frac{1}{2}\cdot\overline{f}^{k}_{\frac{1}{2},\frac{3}{4}}-\frac{1}{2}\cdot\overline{f}^{k}_{\frac{1}{4},\frac{1}{2}}\right)$
		$\displaystyle\leq\overline{f}^{k}_{\frac{3}{4},1}-\overline{f}^{k}_{0,\frac{1}{4}}$

which proves the claim. ∎

A.3 Proof of Corollary 5.2

Proof.

We need to show that for $n\gtrsim s\log(p)$ ,

\displaystyle\lim_{p\to\infty}1-4se^{-c\cdot n}-4npe^{-c\cdot n\frac{\min_{k}\beta_{k}^{2}}{\left\lVert\beta\right\rVert_{2}^{2}+\sigma_{w}^{2}}}=1

First, note that since $\min_{k}\beta_{k}^{2}\in\Omega(\frac{1}{s})$ and $\left\lVert\beta\right\rVert_{2}\in\mathcal{O}(1)$ , it follows that $\frac{\min_{k}\beta_{k}^{2}}{\left\lVert\beta\right\rVert_{2}^{2}+\sigma_{w}^{2}}\in\Omega(\frac{1}{s})$ . Assuming that $\frac{\min_{k}\beta_{k}^{2}}{\left\lVert\beta\right\rVert_{2}^{2}+\sigma_{w}^{2}}\geq\frac{C}{s}$ for some $C$ , set $n=\frac{\max\{4/C,2\}}{c}s\log(p)$ . This implies that

\displaystyle 4se^{-cn}\leq 4\frac{s}{p^{2s}}\leq\frac{4}{p}.

which goes to zero for large $p$ . It therefore remains to show that $4npe^{-cn\frac{\min_{k}\beta_{k}^{2}}{\left\lVert\beta\right\rVert_{2}^{2}+\sigma_{w}^{2}}}$ goes to zero which is equivalent to showing that $cn\frac{\min_{k}\beta_{k}^{2}}{\left\lVert\beta\right\rVert_{2}^{2}+\sigma_{w}^{2}}-\log(np)$ goes to $\infty$ . Note however that

	$\displaystyle cn\frac{\min_{k}\beta_{k}^{2}}{\left\lVert\beta\right\rVert_{2}^{2}+\sigma_{w}^{2}}-\log(np)$	$\displaystyle=cn\frac{\min_{k}\beta_{k}^{2}}{\left\lVert\beta\right\rVert_{2}^{2}+\sigma_{w}^{2}}-\log(n)-\log(p)$
		$\displaystyle\geq c\cdot\frac{4}{c\cdot C}s\log(p)\cdot\frac{C}{s}-\log(p)-\log(n)$
		$\displaystyle=3\log(p)-\log(n)$

Now, observe that by our choice of $n$ ,

\displaystyle n\leq(\frac{4}{c\cdot C}+\frac{2}{c})\cdot p\log(p)\leq(\frac{4}{c\cdot C}+\frac{2}{c})p^{2}.

Therefore, $\log(n)\leq 2\log(p)+\log(C^{\prime})$ , where $C^{\prime}:=\frac{4}{c\cdot C}+\frac{2}{c}$ , implying that

\displaystyle 3\log(p)-\log(n)\geq\log(p)-\log(C^{\prime}),

which goes to $\infty$ as large $p$ . ∎

A.4 Proof of Lemma 6.2

In this section, we provide the proof of Lemma 6.2, which is a generalization of Lemma 1 in (Li et al. 2019).

Proof.

Since $k^{\prime}$ is independent of $Y$ , so is the permutation $\tau^{k^{\prime}}$ . Since $Y_{i}$ were assumed to be i.i.d, this implies that $Y_{\tau^{k}(i)}$ , and by extension $Y^{k}_{L,i}$ , are i.i.d as well and have the same distribution as $Y_{i}$ . $Y^{k}_{i}$ are zero-mean and sub-Gaussian with norm at most $\sigma$ however as

	$\displaystyle\left\lVert Y_{i}\right\rVert_{\psi_{2}}^{2}=\left\lVert f_{k}(X^{k}_{i})+\sum_{j\neq k}f_{j}(X^{j}_{i})+W\right\rVert_{\psi_{2}}^{2}$	$\displaystyle\overset{(i)}{\leq}\left\lVert f_{k}(X^{k}_{i})\right\rVert_{\psi_{2}}+\sum_{j\neq k}\left\lVert f_{j}(X^{j}_{i})\right\rVert_{\psi_{2}}+\left\lVert W\right\rVert_{\psi_{2}}^{2}$
		$\displaystyle=\sigma^{2}.$

where $(i)$ follows from (P2)LABEL:. Focusing on the median split, Hoeffding’s inequality therefore implies that for any $t>0$ , $\Pr\left(\left|\widehat{\mathbb{E}}\left[Y^{k}_{L}\right]\right|>t\right)\leq 2e^{-c\cdot n_{L}t^{2}/\sigma^{2}}$ and $\Pr\left(\left|\widehat{\mathbb{E}}\left[Y^{k}_{R}\right]\right|>t\right)\leq 2e^{-c\cdot n_{R}t^{2}/\sigma^{2}}$ . Setting $t=\frac{g_{k}}{30}$ and a union bound for both sub-trees implies that with probability at least $1-4e^{-c\cdot n\cdot\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}$ ,

\displaystyle\left|\widehat{\mathbb{E}}\left[Y^{k}_{R}\right]-\widehat{\mathbb{E}}\left[Y^{k}_{L}\right]\right|\leq\frac{g_{k}}{15}.

Since $\frac{n_{L}\cdot n_{R}}{n^{2}}\leq\frac{1}{4}$ , the claim follows from (3).

As for the optimal split, the analysis needs to change as the split point is not necessarily in the middle and is also dependent on the data. For a fixed $n_{L}$ however, the same analysis as above can be used for bounding $\textnormal{imp}_{k,n_{L}}$ with small tweaking as shown by (Li et al. 2019); while the bound on the concentration of $\widehat{\mathbb{E}}\left[Y\right]^{k}_{L}$ is weaker since $\frac{n_{L}}{n}$ can be small, this is offset by the $\frac{n_{L}\cdot n_{R}}{n^{2}}$ term in (3) and ultimately the same bound on $\textnormal{imp}_{k,n_{L}}$ can be obtained. Taking a union bound over all $n$ possible splitting points proves the results. ∎

A.5 Proof of Lemma 6.4

Lemma A.6.

Let the random variable $\Theta$ be distributed as $\text{Beta}(n_{L}+1,n_{R})$ for $\{n_{L},n_{R}\}=\{\lfloor\frac{n}{2}\rfloor,\lceil\frac{n}{2}\rceil\}$ . For $n\geq 5$ and $t\geq\frac{2}{3}$ ,

\displaystyle\Pr\left(\Theta\geq t\right)\leq e^{-n(t-\frac{2}{3})^{2}}.

Proof.

Let $U_{1},\dots,U_{n}$ be i.i.d $\textnormal{Unif}(0,1)$ random variables and let $\tau$ be their sorting permutation, i.e,

\displaystyle U_{\tau(1)}\leq\dots\leq U_{\tau(n)}

It is well-known (e.g., see (Rundel 2012)) that for any $k$ , $U_{\tau(k)}$ is distributed as $\text{Beta}(k,n+1-k)$ . Therefore, $\Theta$ has the same distribution as $U_{\tau(n_{L}+1)}$ . This means that for any $t\in[0,1]$ ,

	$\displaystyle\Pr\left[\Theta\geq t\right]$	$\displaystyle=\Pr\left[U_{\tau(n_{L}+1)}\geq t\right]$
		$\displaystyle=\Pr\left[\left\|\{k:U_{k}<t\}\right\|\leq n_{L}\right]$
		$\displaystyle=\Pr\left[\left\|\{k:U_{k}\geq t\}\right\|\geq n_{R}\right]$
		$\displaystyle=\Pr\left[\sum_{k}\mathds{1}\left[U_{k}\geq t\right]\geq n_{R}\right]$
		$\displaystyle=\Pr\left[\frac{1}{n}\cdot\left(\sum_{k}\mathds{1}\left[U_{k}\geq t\right]\right)\geq\frac{n_{R}}{n}\right]$

Defining $Z_{k}:=\mathds{1}\left[U_{k}\geq t\right]$ , it is clear that $Z_{k}$ are i.i.d Bernoulli random variables with parameter $1-t$ . If $1-t\leq\frac{n_{R}}{n}$ , Hoeffding’s inequality implies that

	$\displaystyle\Pr\left[\Theta\geq t\right]$	$\displaystyle=\Pr\left[\frac{1}{n}\cdot\left(\sum_{k}Z_{k}\right)\geq\frac{1}{3}\right]$
		$\displaystyle\leq 2e^{-n\left(\frac{n_{R}}{n}-(1-t)\right)^{2}}$
		$\displaystyle=2e^{-n\cdot(t-\frac{2}{3})^{2}}$

which completes the proof. ∎

By symmetry, we also have $\Pr{\Theta\leq t}\leq e^{-n(t-\frac{1}{3})^{2}}$ for any $t\leq\frac{1}{3}$ , which implies Lemma 6.4 via a union bound.

A.6 Proof of Theorem 5.4

Proof.

Consider the distribution of $\textnormal{imp}_{k}$ for an inactive $k$ . Since $k$ is inactive, the sorting permutation $\tau_{k}$ is independent of $\mathbf{X},Y$ . Therefore, the distribution of $\textnormal{imp}_{k}$ is the distribution of $\textnormal{imp}(\tau(Y))$ where $\tau$ is a random permutation on $[n]$ and $\textnormal{imp}(Y)$ is defined as in Lines 9, 11, 13 i.e,

\displaystyle\textnormal{imp}(Y)=\frac{\lfloor\frac{n}{2}\rfloor}{n}\widehat{\text{Var}}(Y_{\leq\lfloor\frac{n}{2}\rfloor})+\frac{n-\lfloor\frac{n}{2}\rfloor}{n}\widehat{\text{Var}}(Y_{>\lfloor\frac{n}{2}\rfloor})

for the median split and

\displaystyle\textnormal{imp}(Y)=\min_{n_{L}}\frac{n_{L}}{n}\widehat{\text{Var}}(Y_{\leq n_{L}})+\frac{n-n_{L}}{n}\widehat{\text{Var}}(Y_{>n_{L}})

for the optimal split. In the above equations, $Y_{\leq i}$ and $Y_{>i}$ refer to $(Y_{1},\dots Y_{i})^{T}$ and $(Y_{i+1},\dots,Y_{n})^{T}$ respectively.

Similarly, for $t\in[T]$ and $k\in[p]$ , $\tau^{(t)}_{i}$ is a random permutation independent of $\mathbf{X},Y$ . Therefore, $\textnormal{imp}_{i}^{(t)}$ is also distributed as $\textnormal{imp}(\tau(Y))$ for a random permutation $\tau$ on $[n]$ . Furthermore, all of these random permutations (and therefore all values $\textnormal{imp}_{i}^{(t)}$ and $\textnormal{imp}_{k}$ for inactive $k$ ) are independent of each other. This is because (a) all of the inactive features are independent of each other by assumption and (b) the permutations $\sigma^{(t)}$ were artificially independently by design as they were sampled independently in Line 4. We now observe that $\min_{k\notin S}\textnormal{imp}_{k}\leq\min_{i,t}\textnormal{imp}_{i}^{(t)}$ happens if and only if the minimum across all of these $(T-1)p+p-s$ numbers is in the $p-s$ numbers $\textnormal{imp}_{k}$ for $k\notin S$ . By symmetry, this probability is exactly $\frac{p-s}{(T-1)p+p-s}\leq\frac{1}{T}$ .

As for bounding the probability that $\max_{k\in S}\textnormal{imp}_{k}\geq\min_{i,t}\textnormal{imp}_{i}^{(t)}$ , we note that by the discussion above, the extra $\textnormal{imp}_{i,t}$ can be thought of $\textnormal{imp}_{k}$ for extra inactive features indexed by $i,t$ . The probability that $\max_{k\in S}\textnormal{imp}_{k}\geq\min_{k\notin S}\textnormal{imp}_{k}$ can be bounded as in Theorem 5.3. Taking a union bound with $\Pr\left[{\min_{k\notin S}\textnormal{imp}_{k}\leq\min_{i,t}\textnormal{imp}_{i}^{(t)}}\right]$ proves the theorem’s statement. ∎

A.7 Experimental Configurations

All numerical simulations were conducted using Python and the SkLearn packages to implement both Lasso and the single-depth decision trees. The Lasso regression was fit using a regularization strength of $C=2$ and the liblinear solver. Single-depth decision trees splitting point were identified by the variance reduction as described in the pseudocode of Algorithm 1. Data was generated randomly using the PyTorch package and all simulations were performed using an NVIDIA RTX A4000 GPU with Python’s CUDA ecosystem.