Optimal Sparse Recovery with Decision Stumps
Abstract
Decision trees are widely used for their low computational cost, good predictive performance, and ability to assess the importance of features. Though often used in practice for feature selection, the theoretical guarantees of these methods are not well understood. We here obtain a tight finite sample bound for the feature selection problem in linear regression using single-depth decision trees. We examine the statistical properties of these “decision stumps” for the recovery of the active features from total features, where . Our analysis provides tight sample performance guarantees on high-dimensional sparse systems which align with the finite sample bound of as obtained by Lasso, improving upon previous bounds for both the median and optimal splitting criteria. Our results extend to the non-linear regime as well as arbitrary sub-Gaussian distributions, demonstrating that tree based methods attain strong feature selection properties under a wide variety of settings and further shedding light on the success of these methods in practice. As a byproduct of our analysis, we show that we can provably guarantee recovery even when the number of active features is unknown. We further validate our theoretical results and proof methodology using computational experiments.
1 Introduction
Decision trees are one of the most popular tools used in machine learning. Due to their simplicity and interpretability, trees are widely implemented by data scientist, both individually, and in aggregation with ensemble methods such as random forests and gradient boosting (Friedman 2001).
In addition to their predictive accuracy, tree based methods are an important tool used for the variable selection problem: identifying a relevant small subset of a high-dimensional feature space of the input variables that can accurately predict the output. When the relationship between the variables is linear, it has long been known that Lasso achieves the optimal sample complexity rate for this problem (Wainwright 2009a). In practice, however, tree-based methods have been shown to be preferable as they scale linearly with the size of the data and can capture non-linear relationships between the variables (Xu et al. 2014).
Notably, various tree structured systems are implemented for this variable selection task across wide-spanning domains. For example, tree based importance measures have been used for prediction of financial distress (Qian et al. 2022), wireless signal recognition (Li and Wang 2018), credit scoring (Xia et al. 2017), and the discovery of genes in the field of bioinformatics (Breiman 2001; Bureau et al. 2005; Huynh-Thu et al. 2012; Lunetta et al. 2004) to name a small fraction.
Despite this empirical success however, the theoretical properties of these tree based methods for feature selection are not well understood. While several papers have considered variants of this problem (see Section 3 for an overview), even in the simple linear case, the sample complexity of the decision tree is not well-characterized.
In this paper, we attempt to bridge this gap and analyze the variable selection properties of single-level decision trees, commonly referred to as “decision stumps” (DStump ). Considering both linear and non-linear settings, we show that DStump achieves nearly-tight sample complexity rates for a variety of practical sample distributions. Compared to prior work, our analysis is simpler and applies to different variants of the decision tree, as well as more general function classes.
The remainder of the paper is organized as follows: in Section 2 we introduce our main results on the finite sample guarantees of DStump , in Section 3 we discuss important prior results in the field of sparse recovery and where they fall flat, in Section 4 we present the recovery algorithm, and in Section 5, we provide theoretical guarantees for the procedure under progressively more general problem assumptions. The proofs of these results provided in Section 6. We supplement the theoretical results with computational experiments in Section 7 and finally conclude the paper in Section 8.
2 Our Results
We assume that we are given a dataset consisting of samples from the non-parametric regression model where denotes the sample number, is the input vector with corresponding output , are i.i.d noise values and are univariate functions that are -sparse:
Definition 2.1.
A set of univariate functions is -sparse on feature set if there exists a set with size such that
Given access to , we consider the sparse recovery problem, where we attempt to reveal the set with only a minimal number of samples. Note that this is different from the prediction problem where the goal is to learn the functions . In accordance with prior work (Han et al. 2020; Kazemitabar et al. 2017; Klusowski 2020) we assume the are i.i.d draws from the uniform distribution on with Gaussian noise, . In Section 5, we will discuss how this assumption can be relaxed using our non-parametric results to consider more general distributions.
For the recovery problem, we consider the DStump algorithm as first proposed by (Kazemitabar et al. 2017). The algorithm, shown in Algorithm 1, fits a single-level decision tree (stump) on each feature using either the “median split” or the “optimal split” strategy and ranks the features by the error of the corresponding trees. The median split provides a substantially simplified implementation as we do not need to optimize the stump construction, further providing an improved run time over the comparable optimal splitting procedure. Indeed, the median and optimal split have time complexity at most and respectively.
In spite of this simplification, we show that in the widely considered case of linear design, where the are linear, DStump can correctly recover with a sample complexity bound of , matching the minimax optimal lower bound for the problem as achieved by Lasso (Wainwright 2009a, b). This result is noteworthy and surprising since the DStump algorithm (and decision trees in general) is not designed with a linearity assumption, as is the case with Lasso . For this reason, trees are in general utilized for their predictive power in a non-linear model, yet our work proves their value in the opposite. We further extend these results for non-linear models and general sub-Gaussian distributions, improving previously known results using simpler analysis. In addition, our results do not require the sparsity level to be known in advance. We summarize our main technical results as follows:
-
•
We obtain a sample complexity bound of for the DStump algorithm in the linear design case, matching the optimal rate of Lasso and improving prior bounds in the literature for both the median and optimal split. This is the first tight bound on the sample complexity of single-depth decision trees used for sparse recovery and significantly improves on the existing results.
-
•
We extend our results to the case of non-linear , obtaining tighter results for a wider class of functions compared to the existing literature. We further use these results to generalize our analysis to sub-Gaussian distributions via the extension to nonlinear .
-
•
As a byproduct of our improved analysis, we show that our results hold for the case where the number of active features is not known. This is the first theoretical guarantee on decision stumps that does not require to be known.
-
•
We validate our theoretical results using numerical simulations that show the necessity of our analytic techniques.
3 Related Work
While our model framework and the sparsity problem as a whole have been studied extensively (Fan and Lv 2006; Lafferty and Wasserman 2008; Wainwright 2009a, b), none have replicated the well known optimal lower bound for the classification problem under the given set of assumptions. Our work provides improved finite sample guarantees on DStump for the regression problem that nearly match that of Lasso by means of weak learners.
Most closely related to our work is that of (Kazemitabar et al. 2017), which formulates the DStump algorithm and theoretical approach for finite sample guarantees of these weak learners. Unlike our nearly tight result on the number of samples required for recovery, (Kazemitabar et al. 2017) provides a weaker bound when using the median splitting criterion. We here demonstrate that the procedure can obtain the near optimal finite sample guarantees by highlighting a subtle nuance in the analysis of the stump splitting (potentially of independent interest to the reader); instead of using the variance of one sub-tree as an impurity measure, we use the variance of both sub-trees. As we will show both theoretically and experimentally, this consideration is vital for obtaining tight bounds. Our analysis is also more general than that of (Kazemitabar et al. 2017), with applications to both median and optimal splits, a wider class of functions , and more general distributions.
In a recent work, (Klusowski and Tian 2021) provide an indirect analysis of the DStump formulation with the optimal splitting criterion, by studying its relation to the SIS algorithm of (Fan and Lv 2006). Designed for linear models specifically, SIS sorts the features based on their Pearson correlation with the output, and has optimal sample complexity for the linear setting. (Klusowski and Tian 2021) show that when using the optimal splitting criterion, DStump is equivalent to SIS up to logarithmic factors, which leads to a sample complexity of
This improves the results of (Kazemitabar et al. 2017), though the analysis is more involved. A similar technique is also used to study the non-linear case, but the conditions assumed on are hard to interpret and the bounds are weakened. Indeed, instantiating the non-parametric results for the linear case implies a sub-optimal sample complexity rate of . In addition, (Klusowski and Tian 2021) describe a heuristic algorithm for the case of unknown , though they fail to prove any guarantees on its performance.
In contrast, we provide a direct analysis of DStump . This allows us to obtain optimal bounds for both the median and optimal split in the linear case, as well as improved and generalized bounds in the non-linear case. Our novel approach further allows us to analyze the case of unknown sparsity level, as we provide the first formal proof for the heuristic algorithm suggested in (Klusowski and Tian 2021) and (Fan, Feng, and Song 2011). Compared to prior work, our analysis is considerably simpler and applies to more general settings.
Additionally, various studies have leveraged the simplicity of median splits in decision trees to produce sharp bounds on mean-squared error for the regression problem with random forests (Duroux, Roxane and Scornet, Erwan 2018; Klusowski 2021). In each of these studies, analysis under the median split assumption allows for improvements in both asymptotic and finite sample bounds on prediction error for these ensemble weak learners. In the present work, we extend this intuition to the feature selection problem for high-dimensional sparse systems, and further emphasize the utility of the median split even in the singular decision stump case.
4 Algorithm Description
4.1 Notation and Problem Setup
For mathematical convenience, we adopt matrix notation and use and to denote the input matrix and the output vector respectively. We use and to refer the -th row and -th column of the matrix . We will also extend the definition of the univariate functions to vectors by assuming that the function is applied to each coordinate separately: for , we define as .
We let and denote the expectation and variance for random variables, with and denoting their empirical counterparts i.e, for a generic vector ,
We will also use to denote the set . Finally, we let denote the uniform distribution over and use to denote generic universal constants.
Throughout the paper, we will use the concept of sub-Gaussian random variables for stating and proving our results. A random variable is called sub-Gaussian if there exists a for which and its sub-Gaussian norm is defined as
Sub-Gaussian random variables are well-studied and have many desirable properties, (see (Vershynin 2018) for a comprehensive overview), some of which we outline below as they are leveraged throughout our analysis.
-
(P1)
(Hoeffding’s Inequality) If is a sub-Gaussian random variable, then for any ,
-
(P2)
If are a sequence of independent sub-Gaussian random variables, then is also sub-Gaussian with norm satisfying
-
(P3)
If is a sub-Gaussian random variable, then so is and .
4.2 DStump Algorithm
We here present the recovery algorithm DStump. For each feature , the algorithms fits a single-level decision tree or “stump” on the given feature and defines the impurity of the feature as the error of this decision tree. Intuitively, the active features are expected to be better predictors of and therefore have lower impurity values. Thus, the algorithm sorts the features based on these values, and outputs the features with smallest impurity. A formal pseudo-code of our approach is given in Algorithm 1.
Formally, for each feature , the -th column is sorted in increasing order such that with ties broken randomly. The samples are then split into two groups: the left half, consisting of and the right half consisting of . Conceptually, this is the same as splitting a single-depth tree on the -th column with a to ratio and collecting the samples in each group. The algorithm then calculates the variance of the output in each group, which represents the optimal prediction error for this group with a single value as . The average of these two variances is taken as the impurity. Formally,
(1) |
For the median split algorithm, the value of is simply chosen as or , where as for the optimal split, the value is chosen in order to minimize the impurity of the split. Lastly, the features are sorted by their impurity values and the features with lowest impurity are predicted as . In Section 5.3, we discuss the case of unknown and obtain an algorithm with nearly matching guarantees.
5 Main Results
5.1 Linear Design
For our first results, we consider the simplest setting of linear models with uniform distribution of the inputs. Formally, we assume that there is a vector such that for all . This is equivalent to considering the linear regression model . We further assume that each entry of the matrix is an i.i.d draw from the uniform distribution on . This basic setting is important from a theoretical perspective as it allows us to compare with existing results from the literature before extending to more general contexts. This initial result of Theorem 5.1, provides an upper bound on the failure probability for DStump in this setting.
Theorem 5.1.
Assume that each entry of the input matrix is sampled i.i.d from a uniform distribution on . Assume further that the output vectors satisfy the linear regression model where are sampled i.i.d from . Algorithm 1 correctly recovers the active feature set with probability at least
for the median split, and with probability at least
for the optimal split.
Moreover, the above theorem provides a finite sample guarantee for the DStump algorithm and does not make any assumptions on the parameters or their asymptotic relationship. In order to obtain a comparison with existing literature, (Kazemitabar et al. 2017; Klusowski and Tian 2021; Wainwright 2009a, b), we consider these bounds in the asymptotic regime and .
Corollary 5.2.
The proof of the Corollary is presented in Appendix. The above result shows that DStump is optimal for recovery when the data obeys a linear model. This is interesting considering tree based methods are known for their strength in capturing non-linear relationships and are not designed with a linearity assumption like Lasso. In the next section, we further extend our finite sample bound analysis to non-linear models.
5.2 Additive Design
We here consider the case of non-linear and obtain theoretical guarantees for the original DStump algorithm. Our main result is Theorem 5.3 stated below.
Theorem 5.3.
Assume that each entry of the input matrix is sampled i.i.d from a uniform distribution on and where are sampled i.i.d from . Assume further that each is monotone and is sub-Gaussian with sub-Gaussian norm . For , define as
(2) |
and define as . Algorithm 1 correctly recovers the set with probability at least
for the median split and
for the optimal split.
Note that, by instantiating Theorem 5.3 for linear models, we obtain the same bound as in Theorem 5.1 implying the above bounds are tight in the linear setting.
The extension to all monotone functions in Theorem 5.1 has an important theoretical consequence: since the DStump algorithm is invariant under monotone transformations of the input, we can obtain the same results for any distribution of . As a simple example, consider and assume that we are interested in bounds for linear models. Define the matrix as where denotes the CDF of the Gaussian distribution. Since the CDF is an increasing function, running the DStump algorithm with produces the same output as running it with . Furthermore, applying the CDF of a random variable to itself yields a uniform random variable. Therefore, are i.i.d draws of the distribution. Setting , the results of Theorem 5.3 for imply the same bound as Theorem 5.1. Notably, we can obtain the same sample complexity bound of for the Gaussian distribution as well. In the appendix, we discuss a generalization of this idea, which effectively allows us to remove the uniform distribution condition in Theorem 5.3.
5.3 Unknown sparsity level
A drawback of the previous results are that they assume is given when, in general, this is not the case. Even if is not known however, Theorem 5.3 guarantees that the active features are ranked higher than non-active ones in , i.e, for all . In order to recover , it suffices to find a threshold such that .
To solve this, we use the so called “permutation algorithm” which is a well known heuristic in the statistics literature (Barber and Candès 2015; Chung and Romano 2013, 2016; Fan, Feng, and Song 2011) and was discussed (without proof) by (Klusowski and Tian 2021) as well. Formally, we apply a random permutation on the rows of , obtaining the matrix where , We then rerun Algorithm 1 with and as input. The random permutation means that and were “decoupled” from each other and effectively, all of the features are now inactive. We therefore expect to be close to . Since this estimate may be somewhat conservative, we repeat the sampling and take the minimum value across these repetitions. A formal pseudocode is provided in Algorithm 2. The StumpScore method is the same algorithm as Algorithm 1, with the distinction that it returns imp in Line 14
Assuming we have used repetitions in the algorithm, the probability that is at most . While we provide a formal proof in the appendix, the main intuition behind the result is that and for are all the impurities corresponding to inactive features. Thus, the probability that the maximum across all of these impurities falls in is at most . Ensuring that involves treating the extra impurity scores as extra inactive features. This means that we can use the same results of Theorem 5.3 with set to since our sample complexity bound is logarithmic in . The formal result follows with proof in the appendix.
Theorem 5.4.
We note that if we set for some constant , we obtain the same as before.
6 Proofs
In this section, we prove Theorem 5.3 as it is the more general version of Theorem 5.1, and defer with remainder of the proofs to the appendix.
To prove that the algorithm succeeds, we need to show that for all . We proceed by first proving an upper bound for all
Lemma 6.1.
In the setting of Theorem 5.3, for any active feature , with probability at least
Subsequently, we need to prove an analogous lower bound on for all .
Lemma 6.2.
In the setting of Theorem 5.3, for any inactive feature , with probability at least
for the median split and
for the optimal split.
Taking the union bound over all , Lemmas 6.1 and 6.2 prove the theorem as they show that for all with the desired probabilities.
We now focus on proving Lemma 6.1.
We assume without loss of generality that is
increasing 111The case of decreasing follows by a
symmetric argument or by mapping .
We further assume that as DStump is invariant under constant shifts of the output.
Finally, we assume that , as for , the theorem’s
statement can be made vacuous by choosing large .
We will assume throughout the proof;
as such our results will hold for both the optimal and median splitting criteria.
As noted before, a key point for obtaining a tight bound is
considering both sub-trees in the analysis instead of considering them
individually.
Formally, while the impurity is usually defined via variance as in
(1),
it has the following equivalent definition.
(3) |
The above identity is commonly used for the analysis of decision trees and their properties (Breiman et al. 1983; Li et al. 2019; Klusowski 2020; Klusowski and Tian 2021). From an analytic perspective, the key difference between (3) and (1) is that the empirical averaging is calculated before taking the square, allowing us to more simply analyze the first moment rather than the second.
Intuitively, we want to use concentration inequalities to show that and concentrate around their expectations and lower bound . This is challenging however as concentration results typically require an i.i.d assumption but, as we will see, are not i.i.d. More formally, for each , define the random variable as and thus While the random vectors and have i.i.d entries, was obtained by sorting the coordinates of . Thus, its coordinates are non-identically distributed and dependent. To solve the first problem, observe that the empirical mean is invariant under permutation and we can thus randomly shuffle the elements of in order to obtain a vector with identically distributed coordinates. Furthermore, by De Finetti’s Theorem, any random vector with coordinates that are identically distributed (but not necessarily independent), can be viewed as a mixture of i.i.d vectors, effectively solving the second problem. Formally, the following result holds.
Lemma 6.3 (Lemma 1 in (Kazemitabar et al. 2017)).
let be a random permutation on independent of and define as . The random vector is distributed as a mixture of uniform i.i.d uniform vectors of size on with sampled from .
Defining as , it is clear that and therefore we can analyze instead of as
which, given Lemma 6.3, can be seen as a mixture of i.i.d random variables.
Lemma 6.3 shows that there are two sources of randomness in the distribution of : the mixing variable and the sub-Gaussian randomness of and . For the second source, it is possible to use standard concentration inequalities to show that conditioned on , concentrates around . We will formally do this in Lemma 6.5. Before we do this however, we focus on the first source and how affects the distribution of .
Since is sampled from , we can use standard techniques to show that it concentrates around . More formally, we can use the following lemma, the proof of which is in the appendix.
Lemma 6.4.
If , we have with probability at least .
Given the above result, we can analyze assuming . In this case, we can use concentration inequalities to show that with high probability, concentrates around . Since was assumed to be increasing, this can be further bounded by . Formally, we obtain the following result.
Lemma 6.5.
Let be an active feature. For any ,
Furthermore, letting denote ,
(4) |
Proof.
For ease of notation, we will fix and let the random variables and denote and respectively. Recall that for all , was sub-Gaussian with parameter by assumption. It is straightforward to show (see the Appendix) that this means is also sub-Gaussian with norm at most . Thus,
In the above analysis, follows from the independence assumption of together with the i.i.d assumption on . As for , it follows from (P2) together with the independence assumption of . Property (P3) further implies that is upper bounded by , proving the first Equation in the Lemma.
Now, using Hoeffding’s inequality, we obtain
Using Lemma 6.4 with for any two events , we obtain
Note however that which as we show in the appendix, can further be upper bounded by , concluding the proof. ∎
Using the symmetry of the decision tree algorithm, we can further obtain that
(5) |
from (4) with the change of variable and . Taking union bound over (4) and (5), it follows that with probability at least ,
As we show in the appendix however, a simple application of conditional expectations implies . Therefore, with probability at least , we have Assuming , we can further conclude that which together with (3), proves the lemma.
7 Experimental Results

In this section, we provide further justification of our theoretical results in the form of simulations on the finite sample count for active feature recovery under different regimes, as well as the predictive power of a single sub-tree as compared to the full tree. We additionally contrast DStump with the widely studied optimal Lasso .
We first validate the result of Theorem 5.1 and consider the linear design with and design matrix entries sampled i.i.d. from with additive Gaussian noise . Concretely, we examine the optimal number of samples required to recover approximately of the active features . This is achieved by conducting a binary search on the number of samples to find the minimal such value that recovers the desired fraction of the active feature set, averaged across 25 independent replications. In the leftmost plot of Figure 1, we plot the sample count as a function of varying sparsity levels for DStump with a median split, DStump with the optimal split, as well as Lasso for bench-marking (with penalty parameter selected through standard cross-validation). By fixing , we are evaluating the dependence of on the sparsity level alone. The results here validate the theoretical bound that nearly matches the optimal Lasso . Also of note, the number of samples required by the median splitting is less than that of the optimal. Thus, in the linear setting, we see that DStump with median splitting is both more simplistic and computationally inexpensive. This optimal bound result is repeated with Gaussian data samples in the right most plot of Figure 1. Notably, in this setting the optimal split decision stumps perform better than the median as it demonstrates their varied utility under different problem contexts.
We additionally reiterate that the prior work of (Kazemitabar et al. 2017) attempted to simplify the analysis of the sparse recovery problem using DStump by examining only the left sub-tree, which produced the non-optimal finite sample bound. To analyze the effect of this choice, the middle plot of Figure 1 presents the optimal sample recovery count when using only the left sub-tree subject to the additive model of Theorem 5.1. In accordance with our expectation and previous literature’s analysis, we see a clear quadratic relationship between and when fixing the feature count .
Overall, these simulations further validate the practicality and predictive power of decisions stumps. Benchmarked against the optimal Lasso , we see a slight decrease in performance but a computational reduction and analytic simplification.
8 Conclusion
In this paper, we presented a simple and consistent feature selection algorithm in the regression case with single-depth decision trees, and derived the finite-sample performance guarantees in a high-dimensional sparse system. Our theoretical results demonstrate that this very simple class of weak learners is nearly optimal compared to the gold standard Lasso . We have provided strong theoretical evidence for the success of binary decision tree based methods in practice and provided a framework on which to extend the analysis of these structures to arbitrary height, a potential direction for future work.
9 Acknowledgements
The work is partially support by DARPA QuICC, NSF AF:Small #2218678, and NSF AF:Small # 2114269. Max Springer was supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1840340. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
References
- Barber and Candès (2015) Barber, R. F.; and Candès, E. J. 2015. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5): 2055–2085.
- Breiman (2001) Breiman, L. 2001. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science, 16(3): 199 – 231.
- Breiman et al. (1983) Breiman, L.; Friedman, J. H.; Olshen, R. A.; and Stone, C. J. 1983. Classification and Regression Trees.
- Bureau et al. (2005) Bureau, A.; Dupuis, J.; Falls, K.; Lunetta, K.; Hayward, B.; Keith, T.; and Eerdewegh, P. 2005. Identifying SNPs predictive of phenotype using random forests. Genetic epidemiology, 28: 171–82.
- Chung and Romano (2013) Chung, E.; and Romano, J. P. 2013. Exact and asymptotically robust permutation tests. The Annals of Statistics, 41(2): 484–507.
- Chung and Romano (2016) Chung, E.; and Romano, J. P. 2016. Multivariate and multiple permutation tests. Journal of econometrics, 193(1): 76–91.
- Duroux, Roxane and Scornet, Erwan (2018) Duroux, Roxane; and Scornet, Erwan. 2018. Impact of subsampling and tree depth on random forests. ESAIM: PS, 22: 96–128.
- Fan, Feng, and Song (2011) Fan, J.; Feng, Y.; and Song, R. 2011. Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 106(494): 544–557.
- Fan and Lv (2006) Fan, J.; and Lv, J. 2006. Sure independence screening for ultrahigh dimensional feature space. Journal of The Royal Statistical Society Series B-statistical Methodology, 70: 849–911.
- Friedman (2001) Friedman, J. H. 2001. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29: 1189–1232.
- Han et al. (2020) Han, C.; Rao, N. S.; Sorokina, D.; and Subbian, K. 2020. Scalable Feature Selection for (Multitask) Gradient Boosted Trees. ArXiv, abs/2109.01965.
- Huynh-Thu et al. (2012) Huynh-Thu, V. A.; Irrthum, A.; Wehenkel, L.; and Geurts, P. 2012. Inferring Regulatory Networks from Expression Data Using Tree-Based Methods. PLOS ONE, 5(9): 1–10.
- Kazemitabar et al. (2017) Kazemitabar, S. J.; Amini, A. A.; Bloniarz, A.; and Talwalkar, A. S. 2017. Variable Importance Using Decision Trees. In NIPS.
- Klusowski (2021) Klusowski, J. 2021. Sharp Analysis of a Simple Model for Random Forests. In Banerjee, A.; and Fukumizu, K., eds., Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, 757–765. PMLR.
- Klusowski (2020) Klusowski, J. M. 2020. Sparse learning with CART. ArXiv, abs/2006.04266.
- Klusowski and Tian (2021) Klusowski, J. M.; and Tian, P. M. 2021. Nonparametric Variable Screening with Optimal Decision Stumps. In AISTATS.
- Lafferty and Wasserman (2008) Lafferty, J. D.; and Wasserman, L. A. 2008. Rodeo: Sparse, greedy nonparametric regression. Annals of Statistics, 36: 28–63.
- Li and Wang (2018) Li, L.; and Wang, J. 2018. Research on feature importance evaluation of wireless signal recognition based on decision tree algorithm in cognitive computing. Cognitive Systems Research, 52: 882–890.
- Li et al. (2019) Li, X.; Wang, Y.; Basu, S.; Kumbier, K.; and Yu, B. 2019. A Debiased mDI Feature Importance Measure for Random Forests. ArXiv, abs/1906.10845.
- Lunetta et al. (2004) Lunetta, K. L.; Hayward, L. B.; Segal, J.; and Eerdewegh, P. V. 2004. Screening large-scale association study data: exploiting interactions using random forests. BMC Genetics, 5(1): 32.
- Qian et al. (2022) Qian, H.; Wang, B.; Yuan, M.; Gao, S.; and Song, Y. 2022. Financial distress prediction using a corrected feature selection measure and gradient boosted decision tree. Expert Systems with Applications, 190: 116202.
- Rundel (2012) Rundel, C. 2012. Lecture 15: Order statistics. https://www2.stat.duke.edu/courses/Spring12/sta104.1/Lectures/Lec15.pdf.
- Vershynin (2018) Vershynin, R. 2018. High-Dimensional Probability: An Introduction with Applications in Data Science.
- Wainwright (2009a) Wainwright, M. J. 2009a. Information-Theoretic Limits on Sparsity Recovery in the High-Dimensional and Noisy Setting. IEEE Transactions on Information Theory, 55: 5728–5741.
- Wainwright (2009b) Wainwright, M. J. 2009b. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (Lasso). IEEE Transactions on Information Theory.
- Xia et al. (2017) Xia, Y.; Liu, C.; Li, Y.; and Liu, N. 2017. A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring. Expert Systems with Applications, 78: 225–241.
- Xu et al. (2014) Xu, Z.; Huang, G.; Weinberger, K. Q.; and Zheng, A. X. 2014. Gradient boosted feature selection. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 522–531.
Appendix A Appendix
A.1 Extension to non-uniform distributions
In this section, we extend our main result, i.e., Theorem 5.3, to non-uniform distributions.
Theorem A.1.
Assume that the entries of the input matrix
are sampled independently, the entries in each column are i.i.d.
and
where are
sampled i.i.d from .
Assume further that
each is monotone and
is sub-Gaussian with sub-Gaussian norm
.
Let denote the CDF of the distribution of .
Define
and define .
For ,
define as
(6) |
and define as . Algorithm 1 correctly recovers the set with probability at least
for the median split and
for the optimal split.
Proof.
We first claim that if , then has the same distribution as . Formally, for all ,
where follows from the definition of , follows from the fact that is increasing and follows from the fact that is right continuous. Therefore,
where for we have used the fact that is sampled uniformly on . Since , the claim is proved.
Given this result, we can assume that were
set as
for a latent variable
. While this means is no longer a function of ,
still has the same distribution as
.
To see why, consider a fixed value of
and sample based on the conditional distribution.
If for some
, then as well
since is a function. As for the case
case of , then
and are equi-probable which is consistent with the assumption that argsort breaks ties randomly.
We now note that
Since and have the same distribution, Algorithm 1 has the same output on as . Invoking Theorem 5.3 therefore completes the proof.
∎
A.2 Auxiliary lemmas
Lemma A.2.
Let be uniform random variable on and Let be a function such that is sub-Gaussian with norm . Let be a value in the range , and let be a random variable on . Then is sub-Gaussian as well with norm at most a constant times that of , i.e,
Proof.
Define ,
(7) |
where follows from the definition of the sub-Gaussian norm and follows from the fact that for all . Define . Note that and therefore . We can therefore obtain.
where for we have used the Jensen’s inequality for the concave function , for we have used (7) and for we have used the definition of . Therefore the claim is proved with . ∎
Lemma A.3.
If is increasing, then for any , .
Proof.
∎
Lemma A.4.
If is increasing, then for any , .
Proof.
∎
Lemma A.5.
Proof.
We have
and
It therefore follows that
Note however that
which proves the claim. ∎
A.3 Proof of Corollary 5.2
Proof.
We need to show that for ,
First, note that since and , it follows that . Assuming that for some , set . This implies that
which goes to zero for large . It therefore remains to show that goes to zero which is equivalent to showing that goes to . Note however that
Now, observe that by our choice of ,
Therefore, , where , implying that
which goes to as large . ∎
A.4 Proof of Lemma 6.2
In this section, we provide the proof of Lemma 6.2, which is a generalization of Lemma 1 in (Li et al. 2019).
Proof.
Since is independent of , so is the permutation . Since were assumed to be i.i.d, this implies that , and by extension , are i.i.d as well and have the same distribution as . are zero-mean and sub-Gaussian with norm at most however as
where follows from (P2)LABEL:. Focusing on the median split, Hoeffding’s inequality therefore implies that for any , and . Setting and a union bound for both sub-trees implies that with probability at least ,
Since , the claim follows from (3).
As for the optimal split, the analysis needs to change as the split point is not necessarily in the middle and is also dependent on the data. For a fixed however, the same analysis as above can be used for bounding with small tweaking as shown by (Li et al. 2019); while the bound on the concentration of is weaker since can be small, this is offset by the term in (3) and ultimately the same bound on can be obtained. Taking a union bound over all possible splitting points proves the results. ∎
A.5 Proof of Lemma 6.4
Lemma A.6.
Let the random variable be distributed as for . For and ,
Proof.
Let be i.i.d random variables and let be their sorting permutation, i.e,
It is well-known (e.g., see (Rundel 2012)) that for any , is distributed as . Therefore, has the same distribution as . This means that for any ,
Defining , it is clear that are i.i.d Bernoulli random variables with parameter . If , Hoeffding’s inequality implies that
which completes the proof. ∎
By symmetry, we also have for any , which implies Lemma 6.4 via a union bound.
A.6 Proof of Theorem 5.4
Proof.
Consider the distribution of for an inactive . Since is inactive, the sorting permutation is independent of . Therefore, the distribution of is the distribution of where is a random permutation on and is defined as in Lines 9, 11, 13 i.e,
for the median split and
for the optimal split. In the above equations, and refer to and respectively.
Similarly, for and , is a random permutation independent of . Therefore, is also distributed as for a random permutation on . Furthermore, all of these random permutations (and therefore all values and for inactive ) are independent of each other. This is because (a) all of the inactive features are independent of each other by assumption and (b) the permutations were artificially independently by design as they were sampled independently in Line 4. We now observe that happens if and only if the minimum across all of these numbers is in the numbers for . By symmetry, this probability is exactly .
As for bounding the probability that , we note that by the discussion above, the extra can be thought of for extra inactive features indexed by . The probability that can be bounded as in Theorem 5.3. Taking a union bound with proves the theorem’s statement. ∎
A.7 Experimental Configurations
All numerical simulations were conducted using Python and the SkLearn packages to implement both Lasso and the single-depth decision trees. The Lasso regression was fit using a regularization strength of and the liblinear solver. Single-depth decision trees splitting point were identified by the variance reduction as described in the pseudocode of Algorithm 1. Data was generated randomly using the PyTorch package and all simulations were performed using an NVIDIA RTX A4000 GPU with Python’s CUDA ecosystem.