This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimal Sparse Recovery with Decision Stumps

Kiarash Banihashem, MohammadTaghi Hajiaghayi, Max Springer
Abstract

Decision trees are widely used for their low computational cost, good predictive performance, and ability to assess the importance of features. Though often used in practice for feature selection, the theoretical guarantees of these methods are not well understood. We here obtain a tight finite sample bound for the feature selection problem in linear regression using single-depth decision trees. We examine the statistical properties of these “decision stumps” for the recovery of the ss active features from pp total features, where sps\ll p. Our analysis provides tight sample performance guarantees on high-dimensional sparse systems which align with the finite sample bound of O(slogp)O(s\log p) as obtained by Lasso, improving upon previous bounds for both the median and optimal splitting criteria. Our results extend to the non-linear regime as well as arbitrary sub-Gaussian distributions, demonstrating that tree based methods attain strong feature selection properties under a wide variety of settings and further shedding light on the success of these methods in practice. As a byproduct of our analysis, we show that we can provably guarantee recovery even when the number of active features ss is unknown. We further validate our theoretical results and proof methodology using computational experiments.

1 Introduction

Decision trees are one of the most popular tools used in machine learning. Due to their simplicity and interpretability, trees are widely implemented by data scientist, both individually, and in aggregation with ensemble methods such as random forests and gradient boosting (Friedman 2001).

In addition to their predictive accuracy, tree based methods are an important tool used for the variable selection problem: identifying a relevant small subset of a high-dimensional feature space of the input variables that can accurately predict the output. When the relationship between the variables is linear, it has long been known that Lasso achieves the optimal sample complexity rate for this problem (Wainwright 2009a). In practice, however, tree-based methods have been shown to be preferable as they scale linearly with the size of the data and can capture non-linear relationships between the variables (Xu et al. 2014).

Notably, various tree structured systems are implemented for this variable selection task across wide-spanning domains. For example, tree based importance measures have been used for prediction of financial distress (Qian et al. 2022), wireless signal recognition (Li and Wang 2018), credit scoring (Xia et al. 2017), and the discovery of genes in the field of bioinformatics (Breiman 2001; Bureau et al. 2005; Huynh-Thu et al. 2012; Lunetta et al. 2004) to name a small fraction.

Despite this empirical success however, the theoretical properties of these tree based methods for feature selection are not well understood. While several papers have considered variants of this problem (see Section 3 for an overview), even in the simple linear case, the sample complexity of the decision tree is not well-characterized.

In this paper, we attempt to bridge this gap and analyze the variable selection properties of single-level decision trees, commonly referred to as “decision stumps” (DStump ). Considering both linear and non-linear settings, we show that DStump achieves nearly-tight sample complexity rates for a variety of practical sample distributions. Compared to prior work, our analysis is simpler and applies to different variants of the decision tree, as well as more general function classes.

The remainder of the paper is organized as follows: in Section 2 we introduce our main results on the finite sample guarantees of DStump , in Section 3 we discuss important prior results in the field of sparse recovery and where they fall flat, in Section 4 we present the recovery algorithm, and in Section 5, we provide theoretical guarantees for the procedure under progressively more general problem assumptions. The proofs of these results provided in Section 6. We supplement the theoretical results with computational experiments in Section 7 and finally conclude the paper in Section 8.

2 Our Results

We assume that we are given a dataset 𝒟={(𝐗i,:,Yi)}i=1n\mathcal{D}=\{(\mathbf{X}_{i,:},Y_{i})\}_{i=1}^{n} consisting of nn samples from the non-parametric regression model Yi=kfk(𝐗i,k)+WiY_{i}=\sum_{k}f_{k}(\mathbf{X}_{i,k})+W_{i} where ii denotes the sample number, 𝐗i,:p\mathbf{X}_{i,:}\in\mathbb{R}^{p} is the input vector with corresponding output YiY_{i}\in\mathbb{R}, WiW_{i}\in\mathbb{R} are i.i.d noise values and fk:f_{k}:\mathbb{R}\to\mathbb{R} are univariate functions that are ss-sparse:

Definition 2.1.

A set of univariate functions {fk}k[p]\{f_{k}\}_{k\in[p]} is ss-sparse on feature set [p][p] if there exists a set S[p]S\subseteq[p] with size s=|S|ps=|S|\ll p such that

fk=0kSf_{k}=0\iff k\notin S

Given access to (𝐗i,:,Yi)i=1n(\mathbf{X}_{i,:},Y_{i})_{i=1}^{n}, we consider the sparse recovery problem, where we attempt to reveal the set SS with only a minimal number of samples. Note that this is different from the prediction problem where the goal is to learn the functions fkf_{k}. In accordance with prior work (Han et al. 2020; Kazemitabar et al. 2017; Klusowski 2020) we assume the 𝐗i,j\mathbf{X}_{i,j} are i.i.d draws from the uniform distribution on [0,1][0,1] with Gaussian noise, Wi𝒩(0,1)W_{i}\sim\mathcal{N}(0,1). In Section 5, we will discuss how this assumption can be relaxed using our non-parametric results to consider more general distributions.

For the recovery problem, we consider the DStump algorithm as first proposed by (Kazemitabar et al. 2017). The algorithm, shown in Algorithm 1, fits a single-level decision tree (stump) on each feature using either the “median split” or the “optimal split” strategy and ranks the features by the error of the corresponding trees. The median split provides a substantially simplified implementation as we do not need to optimize the stump construction, further providing an improved run time over the comparable optimal splitting procedure. Indeed, the median and optimal split have time complexity at most 𝒪(np)\mathcal{O}(np) and 𝒪(nplog(n))\mathcal{O}(np\log(n)) respectively.

In spite of this simplification, we show that in the widely considered case of linear design, where the fkf_{k} are linear, DStump can correctly recover SS with a sample complexity bound of 𝒪(slogp)\mathcal{O}(s\log p), matching the minimax optimal lower bound for the problem as achieved by Lasso (Wainwright 2009a, b). This result is noteworthy and surprising since the DStump algorithm (and decision trees in general) is not designed with a linearity assumption, as is the case with Lasso . For this reason, trees are in general utilized for their predictive power in a non-linear model, yet our work proves their value in the opposite. We further extend these results for non-linear models and general sub-Gaussian distributions, improving previously known results using simpler analysis. In addition, our results do not require the sparsity level ss to be known in advance. We summarize our main technical results as follows:

  • We obtain a sample complexity bound of 𝒪(slogp)\mathcal{O}(s\log p) for the DStump algorithm in the linear design case, matching the optimal rate of Lasso and improving prior bounds in the literature for both the median and optimal split. This is the first tight bound on the sample complexity of single-depth decision trees used for sparse recovery and significantly improves on the existing results.

  • We extend our results to the case of non-linear fkf_{k}, obtaining tighter results for a wider class of functions compared to the existing literature. We further use these results to generalize our analysis to sub-Gaussian distributions via the extension to nonlinear fkf_{k}.

  • As a byproduct of our improved analysis, we show that our results hold for the case where the number of active features ss is not known. This is the first theoretical guarantee on decision stumps that does not require ss to be known.

  • We validate our theoretical results using numerical simulations that show the necessity of our analytic techniques.

3 Related Work

While our model framework and the sparsity problem as a whole have been studied extensively (Fan and Lv 2006; Lafferty and Wasserman 2008; Wainwright 2009a, b), none have replicated the well known optimal lower bound for the classification problem under the given set of assumptions. Our work provides improved finite sample guarantees on DStump for the regression problem that nearly match that of Lasso by means of weak learners.

Most closely related to our work is that of (Kazemitabar et al. 2017), which formulates the DStump algorithm and theoretical approach for finite sample guarantees of these weak learners. Unlike our nearly tight result on the number of samples required for recovery, (Kazemitabar et al. 2017) provides a weaker 𝒪(s2logp)\mathcal{O}(s^{2}\log p) bound when using the median splitting criterion. We here demonstrate that the procedure can obtain the near optimal finite sample guarantees by highlighting a subtle nuance in the analysis of the stump splitting (potentially of independent interest to the reader); instead of using the variance of one sub-tree as an impurity measure, we use the variance of both sub-trees. As we will show both theoretically and experimentally, this consideration is vital for obtaining tight bounds. Our analysis is also more general than that of (Kazemitabar et al. 2017), with applications to both median and optimal splits, a wider class of functions fkf_{k}, and more general distributions.

In a recent work, (Klusowski and Tian 2021) provide an indirect analysis of the DStump formulation with the optimal splitting criterion, by studying its relation to the SIS algorithm of (Fan and Lv 2006). Designed for linear models specifically, SIS sorts the features based on their Pearson correlation with the output, and has optimal sample complexity for the linear setting. (Klusowski and Tian 2021) show that when using the optimal splitting criterion, DStump is equivalent to SIS up to logarithmic factors, which leads to a sample complexity of

nslog(p)(log(s)+log(log(p))).n\gtrsim s\log(p)\cdot\left(\log(s)+\log(\log(p))\right).

This improves the results of (Kazemitabar et al. 2017), though the analysis is more involved. A similar technique is also used to study the non-linear case, but the conditions assumed on fkf_{k} are hard to interpret and the bounds are weakened. Indeed, instantiating the non-parametric results for the linear case implies a sub-optimal sample complexity rate of 𝒪(s2logp)\mathcal{O}(s^{2}\log p). In addition, (Klusowski and Tian 2021) describe a heuristic algorithm for the case of unknown |S||S|, though they fail to prove any guarantees on its performance.

In contrast, we provide a direct analysis of DStump . This allows us to obtain optimal bounds for both the median and optimal split in the linear case, as well as improved and generalized bounds in the non-linear case. Our novel approach further allows us to analyze the case of unknown sparsity level, as we provide the first formal proof for the heuristic algorithm suggested in (Klusowski and Tian 2021) and (Fan, Feng, and Song 2011). Compared to prior work, our analysis is considerably simpler and applies to more general settings.

Additionally, various studies have leveraged the simplicity of median splits in decision trees to produce sharp bounds on mean-squared error for the regression problem with random forests (Duroux, Roxane and Scornet, Erwan 2018; Klusowski 2021). In each of these studies, analysis under the median split assumption allows for improvements in both asymptotic and finite sample bounds on prediction error for these ensemble weak learners. In the present work, we extend this intuition to the feature selection problem for high-dimensional sparse systems, and further emphasize the utility of the median split even in the singular decision stump case.

4 Algorithm Description

4.1 Notation and Problem Setup

For mathematical convenience, we adopt matrix notation and use 𝐗=(𝐗ij)n×p\mathbf{X}=(\mathbf{X}_{ij})\in\mathbb{R}^{n\times p} and Y=(Yi)nY=(Y_{i})\in\mathbb{R}^{n} to denote the input matrix and the output vector respectively. We use 𝐗i,:\mathbf{X}_{i,:} and XkX^{k} to refer the ii-th row and kk-th column of the matrix 𝐗\mathbf{X}. We will also extend the definition of the univariate functions fkf_{k} to vectors by assuming that the function is applied to each coordinate separately: for vdv\in\mathbb{R}^{d}, we define fk(v)df_{k}(v)\in\mathbb{R}^{d} as (fk(v))i=fk(vi)\left(f_{k}(v)\right)_{i}=f_{k}(v_{i}).

We let 𝔼[]\mathbb{E}\left[\cdot\right] and Var()\text{Var}(\cdot) denote the expectation and variance for random variables, with 𝔼^[]\widehat{\mathbb{E}}\left[\cdot\right] and Var^()\widehat{\text{Var}}\left(\cdot\right) denoting their empirical counterparts i.e, for a generic vector vdv\in\mathbb{R}^{d},

𝔼^[v]=i=1dvidandVar^(v)=i=1d(vi𝔼^[v])2d.\widehat{\mathbb{E}}\left[v\right]=\frac{\sum_{i=1}^{d}{v_{i}}}{d}\quad\text{and}\quad\widehat{\text{Var}}\left(v\right)=\frac{\sum_{i=1}^{d}(v_{i}-\widehat{\mathbb{E}}\left[v\right])^{2}}{d}.

We will also use [d][d] to denote the set {1,,d}\{1,\dots,d\}. Finally, we let Unif(a,b)\textnormal{Unif}(a,b) denote the uniform distribution over [a,b][a,b] and use C,c>0C,c>0 to denote generic universal constants.

Throughout the paper, we will use the concept of sub-Gaussian random variables for stating and proving our results. A random variable ZZ\in\mathbb{R} is called sub-Gaussian if there exists a tt for which 𝔼[e(Z/t)2]2\mathbb{E}\left[e^{(Z/t)^{2}}\right]\leq 2 and its sub-Gaussian norm is defined as

Zψ2=inf{t>0:𝔼[e(Z/t)2]2}.\left\lVert Z\right\rVert_{\psi_{2}}=\inf\left\{t>0:\mathbb{E}\left[e^{(Z/t)^{2}}\right]\leq 2\right\}.

Sub-Gaussian random variables are well-studied and have many desirable properties, (see (Vershynin 2018) for a comprehensive overview), some of which we outline below as they are leveraged throughout our analysis.

  1. (P1)

    (Hoeffding’s Inequality) If ZZ is a sub-Gaussian random variable, then for any t>0t>0,

    Pr(|Z𝔼[Z]|t)2ec(t/Zψ2)2.\Pr\left(\left|{Z-\mathbb{E}\left[Z\right]}\right|\geq t\right)\leq 2e^{-c\left({t}/{\left\lVert Z\right\rVert_{\psi_{2}}}\right)^{2}}.
  2. (P2)

    If Z1,ZnZ_{1},\dots Z_{n} are a sequence of independent sub-Gaussian random variables, then Zi\sum Z_{i} is also sub-Gaussian with norm satisfying

    Ziψ22Ziψ22.\left\lVert\sum Z_{i}\right\rVert_{\psi_{2}}^{2}\leq\sum\left\lVert Z_{i}\right\rVert_{\psi_{2}}^{2}.
  3. (P3)

    If ZZ is a sub-Gaussian random variable, then so is Z𝔼[Z]Z-\mathbb{E}\left[Z\right] and Z𝔼[Z]ψ2cZψ2\left\lVert Z-\mathbb{E}\left[Z\right]\right\rVert_{\psi_{2}}\leq c\left\lVert Z\right\rVert_{\psi_{2}}.

4.2 DStump Algorithm

We here present the recovery algorithm DStump. For each feature k[p]k\in[p], the algorithms fits a single-level decision tree or “stump” on the given feature and defines the impurity of the feature as the error of this decision tree. Intuitively, the active features are expected to be better predictors of YY and therefore have lower impurity values. Thus, the algorithm sorts the features based on these values, and outputs the |S||S| features with smallest impurity. A formal pseudo-code of our approach is given in Algorithm 1.

Algorithm 1 Scoring using DStump 
1:𝐗n×p,Yn,s\mathbf{X}\in\mathbb{R}^{n\times p},Y\in\mathbb{R}^{n},s\in\mathbb{N}
2:Estimate of SS
3:for k{1,,p}k\in\{1,...,p\} do
4:     τk=argsort(Xk)\tau^{k}=\text{argsort}(X^{k})
5:     for  nL{1,,n}n_{L}\in\{1,\dots,n\}  do
6:         nR=nnLn_{R}=n-n_{L}
7:         YLk=(Yτk(1),Yτk(nL))TY^{k}_{L}=(Y_{\tau^{k}(1)},...Y_{\tau^{k}(n_{L})})^{T}
8:         YRk=(Yτk(nL+1),Yτk(n))TY^{k}_{R}=(Y_{\tau^{k}(n_{L}+1)},...Y_{\tau^{k}(n)})^{T}
9:         impk,nL=nLnVar^(YLk)+nRnVar^(YRk)\textnormal{imp}_{k,n_{L}}=\frac{n_{L}}{n}\widehat{\text{Var}}(Y^{k}_{L})+\frac{n_{R}}{n}\widehat{\text{Var}}(Y^{k}_{R})      
10:     if median split then
11:         impk=impk,n2\textnormal{imp}_{k}=\textnormal{imp}_{k,\lfloor\frac{n}{2}\rfloor}.
12:     else
13:         impk=minimpk,\textnormal{imp}_{k}=\min_{\ell}\textnormal{imp}_{k,\ell}.      
14:return τ=argsort(imp)\tau=\text{argsort({imp})}

Formally, for each feature kk, the kk-th column is sorted in increasing order such that Xτk(1)kXτk(2)kXτk(n)kX^{k}_{\tau^{k}(1)}\leq X^{k}_{\tau^{k}(2)}\dots\leq X^{k}_{\tau^{k}(n)} with ties broken randomly. The samples are then split into two groups: the left half, consisting of XikXτk(nL)kX^{k}_{i}\leq X^{k}_{\tau^{k}(n_{L})} and the right half consisting of Xik>Xτk(nL)kX^{k}_{i}>X^{k}_{\tau^{k}(n_{L})}. Conceptually, this is the same as splitting a single-depth tree on the kk-th column with a nLn_{L} to nRn_{R} ratio and collecting the samples in each group. The algorithm then calculates the variance of the output in each group, which represents the optimal prediction error for this group with a single value as Var^(YLk)=mint1nL(YL,ikt)2\widehat{\text{Var}}\left({Y^{k}_{L}}\right)=\min_{t}\frac{1}{n_{L}}\cdot\sum(Y^{k}_{L,i}-t)^{2}. The average of these two variances is taken as the impurity. Formally,

impk,nL=nLnVar^(YLk)+nRnVar^(YRk).\displaystyle\textnormal{imp}_{k,n_{L}}=\frac{n_{L}}{n}\widehat{\text{Var}}(Y^{k}_{L})+\frac{n_{R}}{n}\widehat{\text{Var}}(Y^{k}_{R}). (1)

For the median split algorithm, the value of nLn_{L} is simply chosen as n2\lfloor\frac{n}{2}\rfloor or n2\lceil\frac{n}{2}\rceil, where as for the optimal split, the value is chosen in order to minimize the impurity of the split. Lastly, the features are sorted by their impurity values and the |S||S| features with lowest impurity are predicted as SS. In Section 5.3, we discuss the case of unknown |S||S| and obtain an algorithm with nearly matching guarantees.

5 Main Results

5.1 Linear Design

For our first results, we consider the simplest setting of linear models with uniform distribution of the inputs. Formally, we assume that there is a vector βkp\beta_{k}\in\mathbb{R}^{p} such that fk(x)=βkxf_{k}(x)=\beta_{k}\cdot x for all kk. This is equivalent to considering the linear regression model Y=kβkXk+WY=\sum_{k}\beta_{k}X^{k}+W. We further assume that each entry of the matrix 𝐗\mathbf{X} is an i.i.d draw from the uniform distribution on [0,1][0,1]. This basic setting is important from a theoretical perspective as it allows us to compare with existing results from the literature before extending to more general contexts. This initial result of Theorem 5.1, provides an upper bound on the failure probability for DStump in this setting.

Theorem 5.1.

Assume that each entry of the input matrix XX is sampled i.i.d from a uniform distribution on [0,1][0,1]. Assume further that the output vectors satisfy the linear regression model Y=kβkXk+WY=\sum_{k}\beta_{k}X^{k}+W where WiW_{i} are sampled i.i.d from N(0,σw2)N(0,\sigma_{w}^{2}). Algorithm 1 correctly recovers the active feature set SS with probability at least

14secn4pecnminkβk2β22+σw2.\displaystyle 1-4se^{-c\cdot n}-4pe^{-c\cdot n\frac{\min_{k}\beta_{k}^{2}}{\left\lVert\beta\right\rVert_{2}^{2}+\sigma_{w}^{2}}}.

for the median split, and with probability at least

14secn4npecnminkβk2β22+σw2.\displaystyle 1-4se^{-c\cdot n}-4npe^{-c\cdot n\frac{\min_{k}\beta_{k}^{2}}{\left\lVert\beta\right\rVert_{2}^{2}+\sigma_{w}^{2}}}.

for the optimal split.

Moreover, the above theorem provides a finite sample guarantee for the DStump  algorithm and does not make any assumptions on the parameters or their asymptotic relationship. In order to obtain a comparison with existing literature, (Kazemitabar et al. 2017; Klusowski and Tian 2021; Wainwright 2009a, b), we consider these bounds in the asymptotic regime minkSβk2Ω(1s)\min_{k\in S}\beta_{k}^{2}\in\Omega(\frac{1}{s}) and β2𝒪(1)\left\lVert\beta\right\rVert_{2}\in\mathcal{O}(1).

Corollary 5.2.

In the same setting as Theorem 5.1, assume that β22𝒪(1)\left\lVert\beta\right\rVert_{2}^{2}\in\mathcal{O}(1) and minkSβk2Ω(1s)\min_{k\in S}\beta_{k}^{2}\in\Omega(\frac{1}{s}). Then Algorithm 1 correctly recovers the active feature set SS with high probability as long as nslogpn\gtrsim s\log p.

The proof of the Corollary is presented in Appendix. The above result shows that DStump is optimal for recovery when the data obeys a linear model. This is interesting considering tree based methods are known for their strength in capturing non-linear relationships and are not designed with a linearity assumption like Lasso. In the next section, we further extend our finite sample bound analysis to non-linear models.

5.2 Additive Design

We here consider the case of non-linear fkf_{k} and obtain theoretical guarantees for the original DStump algorithm. Our main result is Theorem 5.3 stated below.

Theorem 5.3.

Assume that each entry of the input matrix XX is sampled i.i.d from a uniform distribution on [0,1][0,1] and Y=kfk(Xk)+WY=\sum_{k}f_{k}(X^{k})+W where WiW_{i} are sampled i.i.d from 𝒩(0,σw2)\mathcal{N}(0,\sigma_{w}^{2}). Assume further that each fkf_{k} is monotone and fk(Unif(0,1))f_{k}(\textnormal{Unif}(0,1)) is sub-Gaussian with sub-Gaussian norm fk(Unif(0,1))ψ22σk2\left\lVert f_{k}(\textnormal{Unif}(0,1))\right\rVert_{\psi_{2}}^{2}\leq\sigma_{k}^{2}. For kSk\in S, define gkg_{k} as

gk:=𝔼[fk(Unif(12,1))]𝔼[fk(Unif(0,12))]\displaystyle g_{k}:=\mathbb{E}\left[f_{k}(\textnormal{Unif}(\frac{1}{2},1))\right]-\mathbb{E}\left[f_{k}(\textnormal{Unif}(0,\frac{1}{2}))\right] (2)

and define σ2\sigma^{2} as σ2=σw2+kσk2\sigma^{2}=\sigma_{w}^{2}+\sum_{k}\sigma_{k}^{2}. Algorithm 1 correctly recovers the set SS with probability at least

14secn4pecnminkgk2σ21-4se^{-cn}-4pe^{-cn\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}

for the median split and

14secn4npecnminkgk2σ21-4se^{-cn}-4npe^{-cn\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}

for the optimal split.

Note that, by instantiating Theorem 5.3 for linear models, we obtain the same bound as in Theorem 5.1 implying the above bounds are tight in the linear setting.

The extension to all monotone functions in Theorem 5.1 has an important theoretical consequence: since the DStump algorithm is invariant under monotone transformations of the input, we can obtain the same results for any distribution of 𝐗i,:\mathbf{X}_{i,:}. As a simple example, consider 𝐗ij𝒩(0,1)\mathbf{X}_{ij}\sim\mathcal{N}(0,1) and assume that we are interested in bounds for linear models. Define the matrix 𝐙\mathbf{Z} as 𝐙ij=F𝒩(𝐗ij)\mathbf{Z}_{ij}=F_{\mathcal{N}}(\mathbf{X}_{ij}) where F𝒩(.)F_{\mathcal{N}}(.) denotes the CDF of the Gaussian distribution. Since the CDF is an increasing function, running the DStump algorithm with (𝐙,Y)(\mathbf{Z},Y) produces the same output as running it with (𝐗,Y)(\mathbf{X},Y). Furthermore, applying the CDF of a random variable to itself yields a uniform random variable. Therefore, 𝐙ij\mathbf{Z}_{ij} are i.i.d draws of the Unif(0,1)\text{Unif}(0,1) distribution. Setting fk(t)=βkF𝒩1(t)f_{k}(t)=\beta_{k}\cdot F_{\mathcal{N}}^{-1}(t), the results of Theorem 5.3 for (𝐙,Y)(\mathbf{Z},Y) imply the same bound as Theorem 5.1. Notably, we can obtain the same sample complexity bound of 𝒪(slogp)\mathcal{O}(s\log p) for the Gaussian distribution as well. In the appendix, we discuss a generalization of this idea, which effectively allows us to remove the uniform distribution condition in Theorem 5.3.

5.3 Unknown sparsity level

A drawback of the previous results are that they assume |S||S| is given when, in general, this is not the case. Even if |S||S| is not known however, Theorem 5.3 guarantees that the active features are ranked higher than non-active ones in τ\tau, i.e, τ(k)<τ(k)\tau(k)<\tau(k^{\prime}) for all kS,kSk\in S,k^{\prime}\notin S. In order to recover SS, it suffices to find a threshold γ\gamma such that maxkSimpkγminkSimpk\max_{k\in S}\textnormal{imp}_{k}\leq\gamma\leq\min_{k\notin S}\textnormal{imp}_{k}.

To solve this, we use the so called “permutation algorithm” which is a well known heuristic in the statistics literature (Barber and Candès 2015; Chung and Romano 2013, 2016; Fan, Feng, and Song 2011) and was discussed (without proof) by (Klusowski and Tian 2021) as well. Formally, we apply a random permutation σ\sigma on the rows of XX, obtaining the matrix σ(X)\sigma(X) where σ(X)ij=Xσ(i),j\sigma(X)_{ij}=X_{\sigma(i),j}, We then rerun Algorithm 1 with σ(X)\sigma(X) and YY as input. The random permutation means that XX and YY were “decoupled” from each other and effectively, all of the features are now inactive. We therefore expect mini,timpi(σ(X),y)\min_{i,t}\textnormal{imp}_{i}(\sigma(X),y) to be close to minkSimpk(X,y)\min_{k\notin S}\textnormal{imp}_{k}(X,y). Since this estimate may be somewhat conservative, we repeat the sampling and take the minimum value across these repetitions. A formal pseudocode is provided in Algorithm 2. The StumpScore method is the same algorithm as Algorithm 1, with the distinction that it returns imp in Line 14

Algorithm 2 Unknown |S||S|
1:𝐗n×p,Yn\mathbf{X}\in\mathbb{R}^{n\times p},Y\in\mathbb{R}^{n}
2:γ[maxkSimpk,minkSimpk]\gamma\in[\max_{k\in S}\textnormal{imp}_{k},\min_{k\notin S}\textnormal{imp}_{k}]
3:for t1,,T1t\leftarrow 1,\dots,T-1 do
4:     σ(t)\sigma^{(t)}\leftarrow Random permutation on [n][n].
5:     imp(t)StumpScore(σ(t)(X),y)\textnormal{imp}^{(t)}\leftarrow\textsc{StumpScore}{}(\sigma^{(t)}(X),y) return mini,timpi(t)\min_{i,t}\textnormal{imp}_{i}^{(t)}

Assuming we have used TT repetitions in the algorithm, the probability that minkSimpk(X,y)γ\min_{k\notin S}\textnormal{imp}_{k}(X,y)\leq\gamma is at most 1T\frac{1}{T}. While we provide a formal proof in the appendix, the main intuition behind the result is that impi(t)\textnormal{imp}_{i}^{(t)} and impk\textnormal{imp}_{k^{\prime}} for kSk^{\prime}\notin S are all the impurities corresponding to inactive features. Thus, the probability that the maximum across all of these impurities falls in [p]\S[p]\backslash S is at most psTp1T\frac{p-s}{Tp}\leq\frac{1}{T}. Ensuring that maxkSimpk(X,y)γ\max_{k\in S}\textnormal{imp}_{k}(X,y)\leq\gamma involves treating the extra TT impurity scores as (T1)p(T-1)p extra inactive features. This means that we can use the same results of Theorem 5.3 with pp set to TpTp since our sample complexity bound is logarithmic in pp. The formal result follows with proof in the appendix.

Theorem 5.4.

In the same setting as Theorem 5.3, let γ\gamma be the output of Algorithm 2 and set S^\widehat{S} to be {k:impkγ}\{k:\textnormal{imp}_{k}\leq\gamma\}. The probability that S^=S\widehat{S}=S is at least

1T14secn4Tpecnminkgk2σ21-T^{-1}-4se^{-cn}-4Tpe^{-cn\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}

for the median split and at least

1T14secn4nTpecnminkgk2σ21-T^{-1}-4se^{-cn}-4nTpe^{-cn\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}

for the optimal split.

We note that if we set T=ncT=n^{c} for some constant c>0c>0, we obtain the same 𝒪(slogp)\mathcal{O}(s\log p) as before.

6 Proofs

In this section, we prove Theorem 5.3 as it is the more general version of Theorem 5.1, and defer with remainder of the proofs to the appendix.

To prove that the algorithm succeeds, we need to show that impk<impk\textnormal{imp}_{k}<\textnormal{imp}_{k^{\prime}} for all kS,kSk\in S,k^{\prime}\notin S. We proceed by first proving an upper bound impk\textnormal{imp}_{k} for all kSk\in S

Lemma 6.1.

In the setting of Theorem 5.3, for any active feature kSk\in S, impkVar^(Y)minkgk2720\textnormal{imp}_{k}\leq\widehat{\text{Var}}\left(Y\right)-\frac{\min_{k}g_{k}^{2}}{720} with probability at least

14ecn4ecnminkgk2σ2.1-4e^{-c\cdot n}-4e^{-c\cdot n\cdot\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}.

Subsequently, we need to prove an analogous lower bound on impk\textnormal{imp}_{k^{\prime}} for all kSk^{\prime}\notin S.

Lemma 6.2.

In the setting of Theorem 5.3, for any inactive feature kSk^{\prime}\notin S, impk>Var^(Y)minkgk2720\textnormal{imp}_{k^{\prime}}>\widehat{\text{Var}}\left(Y\right)-\frac{\min_{k}g_{k}^{2}}{720} with probability at least

14ecnminkgk2σ21-4e^{-c\cdot n\cdot\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}

for the median split and

14necnminkgk2σ21-4ne^{-c\cdot n\cdot\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}

for the optimal split.

Taking the union bound over all k,kk,k^{\prime}, Lemmas 6.1 and 6.2 prove the theorem as they show that impk<impk\textnormal{imp}_{k}<\textnormal{imp}_{k^{\prime}} for all kS,kSk\in S,k^{\prime}\notin S with the desired probabilities.

We now focus on proving Lemma 6.1. We assume without loss of generality that fkf_{k} is increasing 111The case of decreasing fkf_{k} follows by a symmetric argument or by mapping XkXkX^{k}\to-X^{k}. We further assume that 𝔼[fk(Xik)]=0\mathbb{E}\left[f_{k}(X^{k}_{i})\right]=0 as DStump is invariant under constant shifts of the output. Finally, we assume that n>5n>5, as for n5n\leq 5, the theorem’s statement can be made vacuous by choosing large cc.  
We will assume {nL,nR}={n2,n2}\{n_{L},n_{R}\}=\{\lfloor\frac{n}{2}\rfloor,\lceil\frac{n}{2}\rceil\} throughout the proof; as such our results will hold for both the optimal and median splitting criteria. As noted before, a key point for obtaining a tight bound is considering both sub-trees in the analysis instead of considering them individually. Formally, while the impurity is usually defined via variance as in (1), it has the following equivalent definition.

impk=Var^(Y)nLnRn2(𝔼^[YLk]𝔼^[YRk])2.\displaystyle\textnormal{imp}_{k}=\widehat{\text{Var}}(Y)-\frac{n_{L}\cdot n_{R}}{n^{2}}\cdot\left(\widehat{\mathbb{E}}\left[Y^{k}_{L}\right]-\widehat{\mathbb{E}}\left[Y^{k}_{R}\right]\right)^{2}. (3)

The above identity is commonly used for the analysis of decision trees and their properties (Breiman et al. 1983; Li et al. 2019; Klusowski 2020; Klusowski and Tian 2021). From an analytic perspective, the key difference between (3) and (1) is that the empirical averaging is calculated before taking the square, allowing us to more simply analyze the first moment rather than the second.

Intuitively, we want to use concentration inequalities to show that 𝔼^[YLk]\widehat{\mathbb{E}}\left[Y^{k}_{L}\right] and 𝔼^[YRk]\widehat{\mathbb{E}}\left[Y^{k}_{R}\right] concentrate around their expectations and lower bound |𝔼[YLk]𝔼[YRk]||\mathbb{E}\left[Y^{k}_{L}\right]-\mathbb{E}\left[Y^{k}_{R}\right]|. This is challenging however as concentration results typically require an i.i.d assumption but, as we will see, YL,ikY^{k}_{L,i} are not i.i.d. More formally, for each kk, define the random variable XLknLX^{k}_{L}\in\mathbb{R}^{n_{L}} as (Xτk(1)k,,Xτk(nL)k)T(X^{k}_{\tau^{k}(1)},\dots,X^{k}_{\tau^{k}(n_{L})})^{T} and thus YL,ik=fk(XL,ik)+jkfj(Xτk(i)j)+Wτk(i).Y^{k}_{L,i}=f_{k}(X^{k}_{L,i})+\sum_{j\neq k}f_{j}(X^{j}_{\tau^{k}(i)})+W_{\tau^{k}(i)}. While the random vectors XjkX^{j\neq k} and WW have i.i.d entries, XLkX^{k}_{L} was obtained by sorting the coordinates of XkX^{k}. Thus, its coordinates are non-identically distributed and dependent. To solve the first problem, observe that the empirical mean is invariant under permutation and we can thus randomly shuffle the elements of YLkY^{k}_{L} in order to obtain a vector with identically distributed coordinates. Furthermore, by De Finetti’s Theorem, any random vector with coordinates that are identically distributed (but not necessarily independent), can be viewed as a mixture of i.i.d vectors, effectively solving the second problem. Formally, the following result holds.

Lemma 6.3 (Lemma 1 in (Kazemitabar et al. 2017)).

let τ~:[nL][nL]\tilde{\tau}:[n_{L}]\to[n_{L}] be a random permutation on [nL][n_{L}] independent of (𝐗,W)(\mathbf{X},W) and define X~LknL\widetilde{X}^{k}_{L}\in\mathbb{R}^{n_{L}} as X~L,ik:=XL,τ~(i)k\widetilde{X}^{k}_{L,i}:=X^{k}_{L,\tilde{\tau}(i)}. The random vector X~Lk\widetilde{X}^{k}_{L} is distributed as a mixture of uniform i.i.d uniform vectors of size nLn_{L} on [0,Θ][0,\Theta] with Θ\Theta sampled from Beta(nL+1,nR)\text{Beta}(n_{L}+1,n_{R}).

Defining Y~LknL\widetilde{Y}^{k}_{L}\in\mathbb{R}^{n_{L}} as Y~L,ik:=YL,τ~(i)k\widetilde{Y}^{k}_{L,i}:=Y^{k}_{L,\tilde{\tau}(i)}, it is clear that 𝔼^[Y~Lk]=𝔼^[YLk]\widehat{\mathbb{E}}\left[\widetilde{Y}^{k}_{L}\right]=\widehat{\mathbb{E}}\left[Y^{k}_{L}\right] and therefore we can analyze Y~Lk{\widetilde{Y}^{k}_{L}} instead of YLk{Y^{k}_{L}} as

Y~L,ik:=fk(X~L,ik)+jkfj(Xτkτ~(i)j)+Wτkτ~(i)\displaystyle\widetilde{Y}^{k}_{L,i}:=f_{k}(\widetilde{X}^{k}_{L,i})+\sum_{j\neq k}f_{j}(X^{j}_{\tau^{k}\circ\tilde{\tau}(i)})+W_{\tau^{k}\circ\tilde{\tau}(i)}

which, given Lemma 6.3, can be seen as a mixture of i.i.d random variables.

Lemma 6.3 shows that there are two sources of randomness in the distribution of Y~L,ik\widetilde{Y}^{k}_{L,i}: the mixing variable Θ\Theta and the sub-Gaussian randomness of X~Lk|Θ\widetilde{X}^{k}_{L}|\Theta and (Xjk,W)(X^{j\neq k},W). For the second source, it is possible to use standard concentration inequalities to show that conditioned on Θ\Theta, 𝔼^[Y~Lk]\widehat{\mathbb{E}}\left[\widetilde{Y}^{k}_{L}\right] concentrates around 𝔼[Y~L,1k|Θ]\mathbb{E}\left[\widetilde{Y}^{k}_{L,1}|\Theta\right]. We will formally do this in Lemma 6.5. Before we do this however, we focus on the first source and how Θ\Theta affects the distribution of Y~L,ik\widetilde{Y}^{k}_{L,i}.

Since Θ\Theta is sampled from Beta(nL+1,nR)\text{Beta}(n_{L}+1,n_{R}), we can use standard techniques to show that it concentrates around 12\frac{1}{2}. More formally, we can use the following lemma, the proof of which is in the appendix.

Lemma 6.4.

If n5n\geq 5, we have Θ[14,34]\Theta\in[\frac{1}{4},\frac{3}{4}] with probability at least 12ecn1-2e^{-cn}.

Given the above result, we can analyze Y~Lk\widetilde{Y}^{k}_{L} assuming Θ[1/4,3/4]\Theta\in[1/4,3/4]. In this case, we can use concentration inequalities to show that with high probability, 𝔼^[Y~Lk]\widehat{\mathbb{E}}\left[\widetilde{Y}^{k}_{L}\right] concentrates around 𝔼[fk(Unif(0,Θ)])\mathbb{E}\left[f_{k}(\textnormal{Unif}(0,\Theta)\right]). Since fkf_{k} was assumed to be increasing, this can be further bounded by 𝔼[fk(Unif(0,34)])\mathbb{E}\left[f_{k}(\textnormal{Unif}(0,\frac{3}{4})\right]). Formally, we obtain the following result.

Lemma 6.5.

Let kSk\in S be an active feature. For any θ[14,34]\theta\in[\frac{1}{4},\frac{3}{4}],

Pr[𝔼^[Y~Lk]𝔼[Y~L,1k|Θ=θ]t|Θ=θ]2ecnt2σ2.\displaystyle\Pr\left[\widehat{\mathbb{E}}\left[\widetilde{Y}^{k}_{L}\right]-\mathbb{E}\left[\widetilde{Y}^{k}_{L,1}|\Theta=\theta\right]\geq t|\Theta=\theta\right]\leq 2e^{-cn\frac{t^{2}}{\sigma^{2}}}.

Furthermore, letting f¯a,bk\overline{f}^{k}_{a,b} denote 𝔼[fk(Unif(a,b))]\mathbb{E}\left[f_{k}(\textnormal{Unif}(a,b))\right],

Pr[𝔼^[YLk]f¯0,34k+gk/8]2ecn+2ecngk2σ2.\displaystyle\Pr\left[\widehat{\mathbb{E}}\left[{Y}^{k}_{L}\right]\geq\overline{f}^{k}_{0,\frac{3}{4}}+g_{k}/8\right]\leq 2e^{-c\cdot n}+2e^{-c\cdot n\cdot\frac{g_{k}^{2}}{\sigma^{2}}}. (4)
Proof.

For ease of notation, we will fix θ[14,34]\theta\in[\frac{1}{4},\frac{3}{4}] and let the random variables X^Lk\widehat{X}^{k}_{L} and Y^Lk\widehat{Y}^{k}_{L} denote X~Lk|Θ=θ\widetilde{X}^{k}_{L}|\Theta=\theta and Y~Lk|Θ=θ\widetilde{Y}^{k}_{L}|\Theta=\theta respectively. Recall that for all jj, fk(Xij)f_{k}(X^{j}_{i}) was sub-Gaussian with parameter σj\sigma_{j} by assumption. It is straightforward to show (see the Appendix) that this means fk(X^L,ik)𝔼[f(X^L,ik)]f_{k}(\widehat{X}^{k}_{L,i})-\mathbb{E}\left[f(\widehat{X}^{k}_{L,i})\right] is also sub-Gaussian with norm at most Cσj2C\cdot\sigma_{j}^{2}. Thus,

Y^L,ikψ2\displaystyle\left\lVert\widehat{Y}^{k}_{L,i}\right\rVert_{\psi_{2}} =fk(X^L,ik)+jkfj(Xτkτ~(i)j)+Wτkτ~(i)ψ22\displaystyle=\left\lVert f_{k}(\widehat{X}^{k}_{L,i})+\sum_{j\neq k}f_{j}(X^{j}_{\tau^{k}\circ\tilde{\tau}(i)})+W_{\tau^{k}\circ\tilde{\tau}(i)}\right\rVert_{\psi_{2}}^{2}
=(i)fk(X^L,ik)+jkfj(Xij)+Wiψ22\displaystyle\overset{(i)}{=}\left\lVert f_{k}(\widehat{X}^{k}_{L,i})+\sum_{j\neq k}f_{j}(X^{j}_{i})+W_{i}\right\rVert_{\psi_{2}}^{2}
(ii)fk(X^L,ik)ψ22+jkfj(Xij)ψ22+Wiψ22\displaystyle\overset{(ii)}{\leq}\left\lVert f_{k}(\widehat{X}^{k}_{L,i})\right\rVert_{\psi_{2}}^{2}+\sum_{j\neq k}\left\lVert f_{j}(X^{j}_{i})\right\rVert_{\psi_{2}}^{2}+\left\lVert W_{i}\right\rVert_{\psi_{2}}^{2}
Cσk2+jkσj2+σw2Cσ2.\displaystyle\leq C\cdot\sigma_{k}^{2}+\sum_{j\neq k}\sigma_{j}^{2}+\sigma_{w}^{2}\leq C\cdot\sigma^{2}.

In the above analysis, (i)(i) follows from the independence assumption of (X^Lk,Xjk,W)(\widehat{X}^{k}_{L},X^{j\neq k},W) together with the i.i.d assumption on (Xijk,Wi)(X^{j\neq k}_{i},W_{i}). As for (ii)(ii), it follows from (P2) together with the independence assumption of (X^Lk,Xjk,W)(\widehat{X}^{k}_{L},X^{j\neq k},W). Property (P3) further implies that Y^L,ik𝔼[Y^L,ik]ψ22\left\lVert\widehat{Y}^{k}_{L,i}-\mathbb{E}\left[\widehat{Y}^{k}_{L,i}\right]\right\rVert_{\psi_{2}}^{2} is upper bounded by Cσ2C\cdot\sigma^{2}, proving the first Equation in the Lemma.

Now, using Hoeffding’s inequality, we obtain

Pr[𝔼^[Y^Lk]𝔼[Y^L,ik]gk/8]2ecngk2σ2.\displaystyle\Pr\left[\widehat{\mathbb{E}}\left[\widehat{Y}^{k}_{L}\right]-\mathbb{E}\left[\widehat{Y}^{k}_{L,i}\right]\geq g_{k}/8\right]\leq 2e^{-c\cdot n\cdot\frac{g_{k}^{2}}{\sigma^{2}}}.

Using Lemma 6.4 with Pr(A)Pr(B)+Pr(A|BC)\Pr(A)\leq\Pr(B)+\Pr(A|B^{C}) for any two events A,BA,B, we obtain

Pr[𝔼^[YLk]𝔼[Y^L,ik]gk/8]2ecn˙+2ecngk2σ2.\displaystyle\Pr\left[\widehat{\mathbb{E}}\left[Y^{k}_{L}\right]-\mathbb{E}\left[\widehat{Y}^{k}_{L,i}\right]\geq g_{k}/8\right]\leq 2e^{-c\dot{n}}+2e^{-c\cdot n\cdot\frac{g_{k}^{2}}{\sigma^{2}}}.

Note however that 𝔼[Y^L,ik]=f¯0,θk\mathbb{E}\left[\widehat{Y}^{k}_{L,i}\right]=\overline{f}^{k}_{0,\theta} which as we show in the appendix, can further be upper bounded by f¯0,34k\overline{f}^{k}_{0,\frac{3}{4}}, concluding the proof. ∎

Using the symmetry of the decision tree algorithm, we can further obtain that

Pr[𝔼^[YRk]f¯14,1kgk/8]2ecn+2ecngk2σ2\displaystyle\Pr\left[\widehat{\mathbb{E}}\left[{Y}^{k}_{R}\right]\geq\overline{f}^{k}_{\frac{1}{4},1}-g_{k}/8\right]\leq 2e^{-c\cdot n}+2e^{-c\cdot n\cdot\frac{g_{k}^{2}}{\sigma^{2}}} (5)

from (4) with the change of variable XkXkX^{k}\to-X^{k} and fk=fkf_{k}=-f_{k}. Taking union bound over (4) and (5), it follows that with probability at least 14ecn4ecngk2σ21-4e^{-c\cdot n}-4e^{-c\cdot n\cdot\frac{g_{k}^{2}}{\sigma^{2}}},

𝔼^[YRk]𝔼^[YLk]f¯14,1kf¯0,34kgk/4.\displaystyle\widehat{\mathbb{E}}\left[Y^{k}_{R}\right]-\widehat{\mathbb{E}}\left[Y^{k}_{L}\right]\geq\overline{f}^{k}_{\frac{1}{4},1}-\overline{f}^{k}_{0,\frac{3}{4}}-g_{k}/4.

As we show in the appendix however, a simple application of conditional expectations implies f¯14,1kf¯0,34kgk/3\overline{f}^{k}_{\frac{1}{4},1}-\overline{f}^{k}_{0,\frac{3}{4}}\geq g_{k}/3. Therefore, with probability at least 14ecn4ecngk2σ21-4e^{-c\cdot n}-4e^{-c\cdot n\cdot\frac{g_{k}^{2}}{\sigma^{2}}}, we have 𝔼^[YRk]𝔼^[YLk]gk12.\widehat{\mathbb{E}}\left[Y^{k}_{R}\right]-\widehat{\mathbb{E}}\left[Y^{k}_{L}\right]\geq\frac{g_{k}}{12}. Assuming n5n\geq 5, we can further conclude that nLnRn215\frac{n_{L}\cdot n_{R}}{n^{2}}\geq\frac{1}{5} which together with (3), proves the lemma.

7 Experimental Results

Refer to caption
Figure 1: Optimal sample count to recover 95% of the active features where design matrix samples i.i.d from U(1,1)U(-1,1) or 𝒩(0,1)\mathcal{N}(0,1) with additive Gaussian noise 𝒩(0,0.1)\mathcal{N}(0,0.1), comparing three methods: DStump with optimal split, DStump with median split, and Lasso .

In this section, we provide further justification of our theoretical results in the form of simulations on the finite sample count for active feature recovery under different regimes, as well as the predictive power of a single sub-tree as compared to the full tree. We additionally contrast DStump with the widely studied optimal Lasso .

We first validate the result of Theorem 5.1 and consider the linear design with p=200p=200 and design matrix entries sampled i.i.d. from U(0,1)U(0,1) with additive Gaussian noise 𝒩(0,.1)\mathcal{N}(0,.1). Concretely, we examine the optimal number of samples required to recover approximately 95%95\% of the active features ss. This is achieved by conducting a binary search on the number of samples to find the minimal such value that recovers the desired fraction of the active feature set, averaged across 25 independent replications. In the leftmost plot of Figure 1, we plot the sample count as a function of varying sparsity levels s[5,100]s\in[5,100] for DStump with a median split, DStump with the optimal split, as well as Lasso for bench-marking (with penalty parameter selected through standard cross-validation). By fixing pp, we are evaluating the dependence of nn on the sparsity level ss alone. The results here validate the theoretical 𝒪(slogp)\mathcal{O}(s\log p) bound that nearly matches the optimal Lasso . Also of note, the number of samples required by the median splitting is less than that of the optimal. Thus, in the linear setting, we see that DStump with median splitting is both more simplistic and computationally inexpensive. This optimal bound result is repeated with Gaussian data samples in the right most plot of Figure 1. Notably, in this setting the optimal split decision stumps perform better than the median as it demonstrates their varied utility under different problem contexts.

We additionally reiterate that the prior work of (Kazemitabar et al. 2017) attempted to simplify the analysis of the sparse recovery problem using DStump by examining only the left sub-tree, which produced the non-optimal 𝒪(s2logp)\mathcal{O}(s^{2}\log p) finite sample bound. To analyze the effect of this choice, the middle plot of Figure 1 presents the optimal sample recovery count when using only the left sub-tree subject to the additive model of Theorem 5.1. In accordance with our expectation and previous literature’s analysis, we see a clear quadratic relationship between nn and ss when fixing the feature count pp.

Overall, these simulations further validate the practicality and predictive power of decisions stumps. Benchmarked against the optimal Lasso , we see a slight decrease in performance but a computational reduction and analytic simplification.

8 Conclusion

In this paper, we presented a simple and consistent feature selection algorithm in the regression case with single-depth decision trees, and derived the finite-sample performance guarantees in a high-dimensional sparse system. Our theoretical results demonstrate that this very simple class of weak learners is nearly optimal compared to the gold standard Lasso . We have provided strong theoretical evidence for the success of binary decision tree based methods in practice and provided a framework on which to extend the analysis of these structures to arbitrary height, a potential direction for future work.

9 Acknowledgements

The work is partially support by DARPA QuICC, NSF AF:Small #2218678, and NSF AF:Small # 2114269. Max Springer was supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1840340. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

References

  • Barber and Candès (2015) Barber, R. F.; and Candès, E. J. 2015. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5): 2055–2085.
  • Breiman (2001) Breiman, L. 2001. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science, 16(3): 199 – 231.
  • Breiman et al. (1983) Breiman, L.; Friedman, J. H.; Olshen, R. A.; and Stone, C. J. 1983. Classification and Regression Trees.
  • Bureau et al. (2005) Bureau, A.; Dupuis, J.; Falls, K.; Lunetta, K.; Hayward, B.; Keith, T.; and Eerdewegh, P. 2005. Identifying SNPs predictive of phenotype using random forests. Genetic epidemiology, 28: 171–82.
  • Chung and Romano (2013) Chung, E.; and Romano, J. P. 2013. Exact and asymptotically robust permutation tests. The Annals of Statistics, 41(2): 484–507.
  • Chung and Romano (2016) Chung, E.; and Romano, J. P. 2016. Multivariate and multiple permutation tests. Journal of econometrics, 193(1): 76–91.
  • Duroux, Roxane and Scornet, Erwan (2018) Duroux, Roxane; and Scornet, Erwan. 2018. Impact of subsampling and tree depth on random forests. ESAIM: PS, 22: 96–128.
  • Fan, Feng, and Song (2011) Fan, J.; Feng, Y.; and Song, R. 2011. Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 106(494): 544–557.
  • Fan and Lv (2006) Fan, J.; and Lv, J. 2006. Sure independence screening for ultrahigh dimensional feature space. Journal of The Royal Statistical Society Series B-statistical Methodology, 70: 849–911.
  • Friedman (2001) Friedman, J. H. 2001. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29: 1189–1232.
  • Han et al. (2020) Han, C.; Rao, N. S.; Sorokina, D.; and Subbian, K. 2020. Scalable Feature Selection for (Multitask) Gradient Boosted Trees. ArXiv, abs/2109.01965.
  • Huynh-Thu et al. (2012) Huynh-Thu, V. A.; Irrthum, A.; Wehenkel, L.; and Geurts, P. 2012. Inferring Regulatory Networks from Expression Data Using Tree-Based Methods. PLOS ONE, 5(9): 1–10.
  • Kazemitabar et al. (2017) Kazemitabar, S. J.; Amini, A. A.; Bloniarz, A.; and Talwalkar, A. S. 2017. Variable Importance Using Decision Trees. In NIPS.
  • Klusowski (2021) Klusowski, J. 2021. Sharp Analysis of a Simple Model for Random Forests. In Banerjee, A.; and Fukumizu, K., eds., Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, 757–765. PMLR.
  • Klusowski (2020) Klusowski, J. M. 2020. Sparse learning with CART. ArXiv, abs/2006.04266.
  • Klusowski and Tian (2021) Klusowski, J. M.; and Tian, P. M. 2021. Nonparametric Variable Screening with Optimal Decision Stumps. In AISTATS.
  • Lafferty and Wasserman (2008) Lafferty, J. D.; and Wasserman, L. A. 2008. Rodeo: Sparse, greedy nonparametric regression. Annals of Statistics, 36: 28–63.
  • Li and Wang (2018) Li, L.; and Wang, J. 2018. Research on feature importance evaluation of wireless signal recognition based on decision tree algorithm in cognitive computing. Cognitive Systems Research, 52: 882–890.
  • Li et al. (2019) Li, X.; Wang, Y.; Basu, S.; Kumbier, K.; and Yu, B. 2019. A Debiased mDI Feature Importance Measure for Random Forests. ArXiv, abs/1906.10845.
  • Lunetta et al. (2004) Lunetta, K. L.; Hayward, L. B.; Segal, J.; and Eerdewegh, P. V. 2004. Screening large-scale association study data: exploiting interactions using random forests. BMC Genetics, 5(1): 32.
  • Qian et al. (2022) Qian, H.; Wang, B.; Yuan, M.; Gao, S.; and Song, Y. 2022. Financial distress prediction using a corrected feature selection measure and gradient boosted decision tree. Expert Systems with Applications, 190: 116202.
  • Rundel (2012) Rundel, C. 2012. Lecture 15: Order statistics. https://www2.stat.duke.edu/courses/Spring12/sta104.1/Lectures/Lec15.pdf.
  • Vershynin (2018) Vershynin, R. 2018. High-Dimensional Probability: An Introduction with Applications in Data Science.
  • Wainwright (2009a) Wainwright, M. J. 2009a. Information-Theoretic Limits on Sparsity Recovery in the High-Dimensional and Noisy Setting. IEEE Transactions on Information Theory, 55: 5728–5741.
  • Wainwright (2009b) Wainwright, M. J. 2009b. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (Lasso). IEEE Transactions on Information Theory.
  • Xia et al. (2017) Xia, Y.; Liu, C.; Li, Y.; and Liu, N. 2017. A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring. Expert Systems with Applications, 78: 225–241.
  • Xu et al. (2014) Xu, Z.; Huang, G.; Weinberger, K. Q.; and Zheng, A. X. 2014. Gradient boosted feature selection. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 522–531.

Appendix A Appendix

A.1 Extension to non-uniform distributions

In this section, we extend our main result, i.e., Theorem 5.3, to non-uniform distributions.

Theorem A.1.

Assume that the entries of the input matrix XX are sampled independently, the entries in each column are i.i.d. and Y=kfk(Xk)+WY=\sum_{k}f_{k}(X^{k})+W where WiW_{i} are sampled i.i.d from 𝒩(0,σw2)\mathcal{N}(0,\sigma_{w}^{2}). Assume further that each fkf_{k} is monotone and fk(𝐗i,k)f_{k}(\mathbf{X}_{i,k}) is sub-Gaussian with sub-Gaussian norm fk(𝐗i,k)ψ22σk2\left\lVert f_{k}(\mathbf{X}_{i,k})\right\rVert_{\psi_{2}}^{2}\leq\sigma_{k}^{2}.  
Let FkF_{k} denote the CDF of the distribution of XikX^{k}_{i}. Define Fk1,+(z):=inf{t:Fk(t)>z}F_{k}^{-1,+}(z):=\inf\{t:F_{k}(t)>z\} and define hk:=fkFk1,+h_{k}:=f_{k}\circ F_{k}^{-1,+}. For kSk\in S, define gkg_{k} as

gk:=𝔼[hk(Unif(12,1))]𝔼[hk(Unif(0,12))],\displaystyle g_{k}:=\mathbb{E}\left[h_{k}(\textnormal{Unif}(\frac{1}{2},1))\right]-\mathbb{E}\left[h_{k}(\textnormal{Unif}(0,\frac{1}{2}))\right], (6)

and define σ2\sigma^{2} as σ2=σw2+kσk2\sigma^{2}=\sigma_{w}^{2}+\sum_{k}\sigma_{k}^{2}. Algorithm 1 correctly recovers the set SS with probability at least

14secn4pecnminkgk2σ21-4se^{-cn}-4pe^{-cn\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}

for the median split and

14secn4npecnminkgk2σ21-4se^{-cn}-4npe^{-cn\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}}

for the optimal split.

Proof.

We first claim that if ZUnif(0,1)Z\sim\textnormal{Unif}(0,1), then Fk1,+(Z)F_{k}^{-1,+}(Z) has the same distribution as XikX^{k}_{i}. Formally, for all z[0,1]z\in[0,1],

Fk1,+(z)tinf{t~:Fk(t~)>z}t\displaystyle F_{k}^{-1,+}(z)\leq t\iff\inf\{\tilde{t}:F_{k}(\tilde{t})>z\}\leq t (i)δ>0:t~t+δ:Fk(t~)>z\displaystyle\overset{(i)}{\iff}\forall\delta>0:\exists\tilde{t}\leq t+\delta:F_{k}(\tilde{t})>z
(ii)limϵ0,ϵ>0Fk(t+ϵ)>z\displaystyle\overset{(ii)}{\iff}\lim_{\epsilon\to 0,\epsilon>0}F_{k}(t+\epsilon)>z
(iii)Fk(t)>z.\displaystyle\overset{(iii)}{\iff}F_{k}(t)>z.

where (i)(i) follows from the definition of inf\inf, (ii)(ii) follows from the fact that FkF_{k} is increasing and (iii)(iii) follows from the fact that FkF_{k} is right continuous. Therefore,

Pr(Fk1,+(Z)t)\displaystyle\Pr\left(F_{k}^{-1,+}(Z)\leq t\right) =Pr(Z<Fk(t))=(i)Fk(t).\displaystyle=\Pr\left(Z<F_{k}(t)\right)\overset{(i)}{=}F_{k}(t).

where for (i)(i) we have used the fact that ZZ is sampled uniformly on [0,1][0,1]. Since Fk(t)=Pr(Xikt)F_{k}(t)=\Pr(X^{k}_{i}\leq t), the claim is proved.

Given this result, we can assume that XikX^{k}_{i} were set as Fk1,+(Zik)F_{k}^{-1,+}(Z^{k}_{i}) for a latent variable ZikUnif(0,1)Z^{k}_{i}\sim\textnormal{Unif}(0,1). While this means ZikZ^{k}_{i} is no longer a function of XikX^{k}_{i}, argsort(Zk)\text{argsort}(Z^{k}) still has the same distribution as argsort(Xk)\text{argsort}(X^{k}). To see why, consider a fixed value of XkX^{k} and sample ZkZ^{k} based on the conditional distribution. If Xik<XjkX^{k}_{i}<X^{k}_{j} for some i,ji,j, then Zik<ZjkZ^{k}_{i}<Z^{k}_{j} as well since F1,+kF^{k}_{-1,+} is a function. As for the case case of Xik=XjkX^{k}_{i}=X^{k}_{j}, then Zik<ZjkZ^{k}_{i}<Z^{k}_{j} and Zik>ZjkZ^{k}_{i}>Z^{k}_{j} are equi-probable which is consistent with the assumption that argsort breaks ties randomly.  
We now note that

fk(Xik)=fk(Fk1,+(Zik))=hk(Zik).\displaystyle f_{k}(X^{k}_{i})=f_{k}(F_{k}^{-1,+}(Z^{k}_{i}))=h_{k}(Z^{k}_{i}).

Since argsort(Zk)\text{argsort}(Z^{k}) and argsort(Xk)\text{argsort}(X^{k}) have the same distribution, Algorithm 1 has the same output on (𝐗,fk)(\mathbf{X},f_{k}) as (𝐙,hk)(\mathbf{Z},h_{k}). Invoking Theorem 5.3 therefore completes the proof.

A.2 Auxiliary lemmas

Lemma A.2.

Let UUnif(0,1)U\sim\textnormal{Unif}(0,1) be uniform random variable on [0,1][0,1] and Let ff be a function such that f(U)f(U) is sub-Gaussian with norm f(U)ψ2\left\lVert f(U)\right\rVert_{\psi_{2}}. Let δ\delta be a value in the range [14,34][\frac{1}{4},\frac{3}{4}], and let U~Unif(0,δ)\tilde{U}\sim\textnormal{Unif}(0,\delta) be a random variable on [0,δ][0,\delta]. Then f(U)f(U) is sub-Gaussian as well with norm at most a constant times that of f(U)f(U), i.e,

f(U~)ψ2Cf(U)ψ2\displaystyle\left\lVert f(\tilde{U})\right\rVert_{\psi_{2}}\leq C\cdot\left\lVert f(U)\right\rVert_{\psi_{2}}
Proof.

Define K:=f(U)ψ2K:=\left\lVert f(U)\right\rVert_{\psi_{2}},

2\displaystyle 2 (i)𝔼[ef(U)2K2]\displaystyle\overset{(i)}{\geq}\mathbb{E}\left[e^{\frac{f(U)^{2}}{K^{2}}}\right]
=𝔼[ef(U)2K2|Uδ]Pr[Uδ]+𝔼[ef(U)2K2|U>δ]Pr[U>δ]\displaystyle=\mathbb{E}\left[e^{\frac{f(U)^{2}}{K^{2}}}\Big{|}U\leq\delta\right]\cdot\Pr\left[U\leq\delta\right]+\mathbb{E}\left[e^{\frac{f(U)^{2}}{K^{2}}}\Big{|}U>\delta\right]\cdot\Pr\left[U>\delta\right]
(ii)𝔼[ef(U)2K2|Uδ]Pr[Uδ]\displaystyle\overset{(ii)}{\geq}\mathbb{E}\left[e^{\frac{f(U)^{2}}{K^{2}}}\Big{|}U\leq\delta\right]\cdot\Pr\left[U\leq\delta\right]
=𝔼[ef(U~)2K2]δ\displaystyle=\mathbb{E}\left[e^{\frac{f(\tilde{U})^{2}}{K^{2}}}\right]\cdot\delta (7)

where (i)(i) follows from the definition of the sub-Gaussian norm and (ii)(ii) follows from the fact that ex0e^{x}\geq 0 for all xx. Define K~:=Kln(2/δ)ln(2)\widetilde{K}:=K\cdot\frac{ln\left(2/\delta\right)}{ln(2)}. Note that δ<1\delta<1 and therefore KK~<1\frac{K}{\widetilde{K}}<1. We can therefore obtain.

𝔼[ef(U~)2K~2]\displaystyle\mathbb{E}\left[e^{\frac{f(\tilde{U})^{2}}{\widetilde{K}^{2}}}\right] =𝔼[ef(U~)2K2K2K~2]\displaystyle=\mathbb{E}\left[e^{\frac{f(\tilde{U})^{2}}{K^{2}}\cdot\frac{K^{2}}{\widetilde{K}^{2}}}\right]
=𝔼[(ef(U~)2K2)K2K~2]\displaystyle=\mathbb{E}\left[\left(e^{\frac{f(\tilde{U})^{2}}{K^{2}}}\right)^{\frac{K^{2}}{\widetilde{K}^{2}}}\right]
(i)𝔼[(ef(U~)2K2)]K2K~2\displaystyle\overset{(i)}{\leq}\mathbb{E}\left[\left(e^{\frac{f(\tilde{U})^{2}}{K^{2}}}\right)\right]^{\frac{K^{2}}{\widetilde{K}^{2}}}
(ii)(2/δ)K2K~2\displaystyle\overset{(ii)}{\leq}\left(2/\delta\right)^{\frac{K^{2}}{\widetilde{K}^{2}}}
=(iii)eln(2/δ)ln(2)ln(2/δ)=2\displaystyle\overset{(iii)}{=}e^{\ln(2/\delta)\cdot\frac{ln(2)}{ln(2/\delta)}}=2

where for (i)(i) we have used the Jensen’s inequality for the concave function ttK2K~2t\to t^{\frac{K^{2}}{\widetilde{K}^{2}}}, for (ii)(ii) we have used (7) and for (iii)(iii) we have used the definition of K~\widetilde{K}. Therefore the claim is proved with C=ln(2/δ)ln(2)C=\frac{ln\left(2/\delta\right)}{ln(2)}. ∎

Lemma A.3.

If fkf_{k} is increasing, then for any 0acb10\leq a\leq c\leq b\leq 1, f¯a,ckf¯c,bk\overline{f}^{k}_{a,c}\leq\overline{f}^{k}_{c,b}.

Proof.
f¯a,ck=𝔼ZUnif(a,c)[fk(Z)]fk(c)𝔼ZUnif(c,b)[c]=f¯c,bk\displaystyle\overline{f}^{k}_{a,c}=\mathbb{E}_{Z\sim\textnormal{Unif}(a,c)}\left[f_{k}(Z)\right]\leq f_{k}(c)\leq\mathbb{E}_{Z\sim\textnormal{Unif}(c,b)}\left[c\right]=\overline{f}^{k}_{c,b}

Lemma A.4.

If fkf_{k} is increasing, then for any 0acb10\leq a\leq c\leq b\leq 1, f¯a,ckf¯a,bk\overline{f}^{k}_{a,c}\leq\overline{f}^{k}_{a,b}.

Proof.
f¯a,bk=𝔼ZUnif(a,b)[fk(Z)]\displaystyle\overline{f}^{k}_{a,b}=\mathbb{E}_{Z\sim\textnormal{Unif}(a,b)}\left[f_{k}(Z)\right] =Pr[Z[a,c]]𝔼ZUnif(a,c)[fk(Z)]+Pr[Z[c,b]]𝔼ZUnif(c,b)[fk(Z)]\displaystyle=\Pr\left[Z\in[a,c]\right]\cdot\mathbb{E}_{Z\sim\textnormal{Unif}(a,c)}\left[f_{k}(Z)\right]+\Pr\left[Z\in[c,b]\right]\cdot\mathbb{E}_{Z\sim\textnormal{Unif}(c,b)}\left[f_{k}(Z)\right]
=Pr[Z[a,c]]f¯a,ck+Pr[Z[c,b]]f¯c,bk\displaystyle=\Pr\left[Z\in[a,c]\right]\cdot\overline{f}^{k}_{a,c}+\Pr\left[Z\in[c,b]\right]\cdot\overline{f}^{k}_{c,b}
f¯a,ck\displaystyle\geq\overline{f}^{k}_{a,c}

Lemma A.5.
f¯14,1kf¯0,34kgk/3\displaystyle\overline{f}^{k}_{\frac{1}{4},1}-\overline{f}^{k}_{0,\frac{3}{4}}\geq g_{k}/3
Proof.

We have

f¯0,34k=23f¯14,34k+13f¯0,14k,\displaystyle\overline{f}^{k}_{0,\frac{3}{4}}=\frac{2}{3}\cdot\overline{f}^{k}_{\frac{1}{4},\frac{3}{4}}+\frac{1}{3}\cdot\overline{f}^{k}_{0,\frac{1}{4}},

and

f¯14,1k=23f¯14,34k+13f¯34,1k.\displaystyle\overline{f}^{k}_{\frac{1}{4},1}=\frac{2}{3}\cdot\overline{f}^{k}_{\frac{1}{4},\frac{3}{4}}+\frac{1}{3}\cdot\overline{f}^{k}_{\frac{3}{4},1}.

It therefore follows that

f¯14,1kf¯0,34k=13(f¯34,1kf¯0,14k)\displaystyle\overline{f}^{k}_{\frac{1}{4},1}-\overline{f}^{k}_{0,\frac{3}{4}}=\frac{1}{3}\cdot\left(\overline{f}^{k}_{\frac{3}{4},1}-\overline{f}^{k}_{0,\frac{1}{4}}\right)

Note however that

gk=f¯12,1kf¯0,12k\displaystyle g_{k}=\overline{f}^{k}_{\frac{1}{2},1}-\overline{f}^{k}_{0,\frac{1}{2}} =(12f¯12,34k+12f¯34,1k)(12f¯0,14k+12f¯14,12k)\displaystyle=\left(\frac{1}{2}\cdot\overline{f}^{k}_{\frac{1}{2},\frac{3}{4}}+\frac{1}{2}\cdot\overline{f}^{k}_{\frac{3}{4},1}\right)-\left(\frac{1}{2}\cdot\overline{f}^{k}_{0,\frac{1}{4}}+\frac{1}{2}\cdot\overline{f}^{k}_{\frac{1}{4},\frac{1}{2}}\right)
=(12f¯34,1k12f¯0,14k)+(12f¯12,34k12f¯14,12k)\displaystyle=\left(\frac{1}{2}\cdot\overline{f}^{k}_{\frac{3}{4},1}-\frac{1}{2}\cdot\overline{f}^{k}_{0,\frac{1}{4}}\right)+\left(\frac{1}{2}\cdot\overline{f}^{k}_{\frac{1}{2},\frac{3}{4}}-\frac{1}{2}\cdot\overline{f}^{k}_{\frac{1}{4},\frac{1}{2}}\right)
f¯34,1kf¯0,14k\displaystyle\leq\overline{f}^{k}_{\frac{3}{4},1}-\overline{f}^{k}_{0,\frac{1}{4}}

which proves the claim. ∎

A.3 Proof of Corollary 5.2

Proof.

We need to show that for nslog(p)n\gtrsim s\log(p),

limp14secn4npecnminkβk2β22+σw2=1\displaystyle\lim_{p\to\infty}1-4se^{-c\cdot n}-4npe^{-c\cdot n\frac{\min_{k}\beta_{k}^{2}}{\left\lVert\beta\right\rVert_{2}^{2}+\sigma_{w}^{2}}}=1

First, note that since minkβk2Ω(1s)\min_{k}\beta_{k}^{2}\in\Omega(\frac{1}{s}) and β2𝒪(1)\left\lVert\beta\right\rVert_{2}\in\mathcal{O}(1), it follows that minkβk2β22+σw2Ω(1s)\frac{\min_{k}\beta_{k}^{2}}{\left\lVert\beta\right\rVert_{2}^{2}+\sigma_{w}^{2}}\in\Omega(\frac{1}{s}). Assuming that minkβk2β22+σw2Cs\frac{\min_{k}\beta_{k}^{2}}{\left\lVert\beta\right\rVert_{2}^{2}+\sigma_{w}^{2}}\geq\frac{C}{s} for some CC, set n=max{4/C,2}cslog(p)n=\frac{\max\{4/C,2\}}{c}s\log(p). This implies that

4secn4sp2s4p.\displaystyle 4se^{-cn}\leq 4\frac{s}{p^{2s}}\leq\frac{4}{p}.

which goes to zero for large pp. It therefore remains to show that 4npecnminkβk2β22+σw24npe^{-cn\frac{\min_{k}\beta_{k}^{2}}{\left\lVert\beta\right\rVert_{2}^{2}+\sigma_{w}^{2}}} goes to zero which is equivalent to showing that cnminkβk2β22+σw2log(np)cn\frac{\min_{k}\beta_{k}^{2}}{\left\lVert\beta\right\rVert_{2}^{2}+\sigma_{w}^{2}}-\log(np) goes to \infty. Note however that

cnminkβk2β22+σw2log(np)\displaystyle cn\frac{\min_{k}\beta_{k}^{2}}{\left\lVert\beta\right\rVert_{2}^{2}+\sigma_{w}^{2}}-\log(np) =cnminkβk2β22+σw2log(n)log(p)\displaystyle=cn\frac{\min_{k}\beta_{k}^{2}}{\left\lVert\beta\right\rVert_{2}^{2}+\sigma_{w}^{2}}-\log(n)-\log(p)
c4cCslog(p)Cslog(p)log(n)\displaystyle\geq c\cdot\frac{4}{c\cdot C}s\log(p)\cdot\frac{C}{s}-\log(p)-\log(n)
=3log(p)log(n)\displaystyle=3\log(p)-\log(n)

Now, observe that by our choice of nn,

n(4cC+2c)plog(p)(4cC+2c)p2.\displaystyle n\leq(\frac{4}{c\cdot C}+\frac{2}{c})\cdot p\log(p)\leq(\frac{4}{c\cdot C}+\frac{2}{c})p^{2}.

Therefore, log(n)2log(p)+log(C)\log(n)\leq 2\log(p)+\log(C^{\prime}), where C:=4cC+2cC^{\prime}:=\frac{4}{c\cdot C}+\frac{2}{c}, implying that

3log(p)log(n)log(p)log(C),\displaystyle 3\log(p)-\log(n)\geq\log(p)-\log(C^{\prime}),

which goes to \infty as large pp. ∎

A.4 Proof of Lemma 6.2

In this section, we provide the proof of Lemma 6.2, which is a generalization of Lemma 1 in (Li et al. 2019).

Proof.

Since kk^{\prime} is independent of YY, so is the permutation τk\tau^{k^{\prime}}. Since YiY_{i} were assumed to be i.i.d, this implies that Yτk(i)Y_{\tau^{k}(i)}, and by extension YL,ikY^{k}_{L,i}, are i.i.d as well and have the same distribution as YiY_{i}. YikY^{k}_{i} are zero-mean and sub-Gaussian with norm at most σ\sigma however as

Yiψ22=fk(Xik)+jkfj(Xij)+Wψ22\displaystyle\left\lVert Y_{i}\right\rVert_{\psi_{2}}^{2}=\left\lVert f_{k}(X^{k}_{i})+\sum_{j\neq k}f_{j}(X^{j}_{i})+W\right\rVert_{\psi_{2}}^{2} (i)fk(Xik)ψ2+jkfj(Xij)ψ2+Wψ22\displaystyle\overset{(i)}{\leq}\left\lVert f_{k}(X^{k}_{i})\right\rVert_{\psi_{2}}+\sum_{j\neq k}\left\lVert f_{j}(X^{j}_{i})\right\rVert_{\psi_{2}}+\left\lVert W\right\rVert_{\psi_{2}}^{2}
=σ2.\displaystyle=\sigma^{2}.

where (i)(i) follows from (P2)LABEL:. Focusing on the median split, Hoeffding’s inequality therefore implies that for any t>0t>0, Pr(|𝔼^[YLk]|>t)2ecnLt2/σ2\Pr\left(\left|\widehat{\mathbb{E}}\left[Y^{k}_{L}\right]\right|>t\right)\leq 2e^{-c\cdot n_{L}t^{2}/\sigma^{2}} and Pr(|𝔼^[YRk]|>t)2ecnRt2/σ2\Pr\left(\left|\widehat{\mathbb{E}}\left[Y^{k}_{R}\right]\right|>t\right)\leq 2e^{-c\cdot n_{R}t^{2}/\sigma^{2}}. Setting t=gk30t=\frac{g_{k}}{30} and a union bound for both sub-trees implies that with probability at least 14ecnminkgk2σ21-4e^{-c\cdot n\cdot\frac{\min_{k}g_{k}^{2}}{\sigma^{2}}},

|𝔼^[YRk]𝔼^[YLk]|gk15.\displaystyle\left|\widehat{\mathbb{E}}\left[Y^{k}_{R}\right]-\widehat{\mathbb{E}}\left[Y^{k}_{L}\right]\right|\leq\frac{g_{k}}{15}.

Since nLnRn214\frac{n_{L}\cdot n_{R}}{n^{2}}\leq\frac{1}{4}, the claim follows from (3).

As for the optimal split, the analysis needs to change as the split point is not necessarily in the middle and is also dependent on the data. For a fixed nLn_{L} however, the same analysis as above can be used for bounding impk,nL\textnormal{imp}_{k,n_{L}} with small tweaking as shown by (Li et al. 2019); while the bound on the concentration of 𝔼^[Y]Lk\widehat{\mathbb{E}}\left[Y\right]^{k}_{L} is weaker since nLn\frac{n_{L}}{n} can be small, this is offset by the nLnRn2\frac{n_{L}\cdot n_{R}}{n^{2}} term in (3) and ultimately the same bound on impk,nL\textnormal{imp}_{k,n_{L}} can be obtained. Taking a union bound over all nn possible splitting points proves the results. ∎

A.5 Proof of Lemma 6.4

Lemma A.6.

Let the random variable Θ\Theta be distributed as Beta(nL+1,nR)\text{Beta}(n_{L}+1,n_{R}) for {nL,nR}={n2,n2}\{n_{L},n_{R}\}=\{\lfloor\frac{n}{2}\rfloor,\lceil\frac{n}{2}\rceil\}. For n5n\geq 5 and t23t\geq\frac{2}{3},

Pr(Θt)en(t23)2.\displaystyle\Pr\left(\Theta\geq t\right)\leq e^{-n(t-\frac{2}{3})^{2}}.
Proof.

Let U1,,UnU_{1},\dots,U_{n} be i.i.d Unif(0,1)\textnormal{Unif}(0,1) random variables and let τ\tau be their sorting permutation, i.e,

Uτ(1)Uτ(n)\displaystyle U_{\tau(1)}\leq\dots\leq U_{\tau(n)}

It is well-known (e.g., see (Rundel 2012)) that for any kk, Uτ(k)U_{\tau(k)} is distributed as Beta(k,n+1k)\text{Beta}(k,n+1-k). Therefore, Θ\Theta has the same distribution as Uτ(nL+1)U_{\tau(n_{L}+1)}. This means that for any t[0,1]t\in[0,1],

Pr[Θt]\displaystyle\Pr\left[\Theta\geq t\right] =Pr[Uτ(nL+1)t]\displaystyle=\Pr\left[U_{\tau(n_{L}+1)}\geq t\right]
=Pr[|{k:Uk<t}|nL]\displaystyle=\Pr\left[\left|\{k:U_{k}<t\}\right|\leq n_{L}\right]
=Pr[|{k:Ukt}|nR]\displaystyle=\Pr\left[\left|\{k:U_{k}\geq t\}\right|\geq n_{R}\right]
=Pr[k𝟙[Ukt]nR]\displaystyle=\Pr\left[\sum_{k}\mathds{1}\left[U_{k}\geq t\right]\geq n_{R}\right]
=Pr[1n(k𝟙[Ukt])nRn]\displaystyle=\Pr\left[\frac{1}{n}\cdot\left(\sum_{k}\mathds{1}\left[U_{k}\geq t\right]\right)\geq\frac{n_{R}}{n}\right]

Defining Zk:=𝟙[Ukt]Z_{k}:=\mathds{1}\left[U_{k}\geq t\right], it is clear that ZkZ_{k} are i.i.d Bernoulli random variables with parameter 1t1-t. If 1tnRn1-t\leq\frac{n_{R}}{n}, Hoeffding’s inequality implies that

Pr[Θt]\displaystyle\Pr\left[\Theta\geq t\right] =Pr[1n(kZk)13]\displaystyle=\Pr\left[\frac{1}{n}\cdot\left(\sum_{k}Z_{k}\right)\geq\frac{1}{3}\right]
2en(nRn(1t))2\displaystyle\leq 2e^{-n\left(\frac{n_{R}}{n}-(1-t)\right)^{2}}
=2en(t23)2\displaystyle=2e^{-n\cdot(t-\frac{2}{3})^{2}}

which completes the proof. ∎

By symmetry, we also have PrΘten(t13)2\Pr{\Theta\leq t}\leq e^{-n(t-\frac{1}{3})^{2}} for any t13t\leq\frac{1}{3}, which implies Lemma 6.4 via a union bound.

A.6 Proof of Theorem 5.4

Proof.

Consider the distribution of impk\textnormal{imp}_{k} for an inactive kk. Since kk is inactive, the sorting permutation τk\tau_{k} is independent of 𝐗,Y\mathbf{X},Y. Therefore, the distribution of impk\textnormal{imp}_{k} is the distribution of imp(τ(Y))\textnormal{imp}(\tau(Y)) where τ\tau is a random permutation on [n][n] and imp(Y)\textnormal{imp}(Y) is defined as in Lines 9, 11, 13 i.e,

imp(Y)=n2nVar^(Yn2)+nn2nVar^(Y>n2)\displaystyle\textnormal{imp}(Y)=\frac{\lfloor\frac{n}{2}\rfloor}{n}\widehat{\text{Var}}(Y_{\leq\lfloor\frac{n}{2}\rfloor})+\frac{n-\lfloor\frac{n}{2}\rfloor}{n}\widehat{\text{Var}}(Y_{>\lfloor\frac{n}{2}\rfloor})

for the median split and

imp(Y)=minnLnLnVar^(YnL)+nnLnVar^(Y>nL)\displaystyle\textnormal{imp}(Y)=\min_{n_{L}}\frac{n_{L}}{n}\widehat{\text{Var}}(Y_{\leq n_{L}})+\frac{n-n_{L}}{n}\widehat{\text{Var}}(Y_{>n_{L}})

for the optimal split. In the above equations, YiY_{\leq i} and Y>iY_{>i} refer to (Y1,Yi)T(Y_{1},\dots Y_{i})^{T} and (Yi+1,,Yn)T(Y_{i+1},\dots,Y_{n})^{T} respectively.

Similarly, for t[T]t\in[T] and k[p]k\in[p], τi(t)\tau^{(t)}_{i} is a random permutation independent of 𝐗,Y\mathbf{X},Y. Therefore, impi(t)\textnormal{imp}_{i}^{(t)} is also distributed as imp(τ(Y))\textnormal{imp}(\tau(Y)) for a random permutation τ\tau on [n][n]. Furthermore, all of these random permutations (and therefore all values impi(t)\textnormal{imp}_{i}^{(t)} and impk\textnormal{imp}_{k} for inactive kk) are independent of each other. This is because (a) all of the inactive features are independent of each other by assumption and (b) the permutations σ(t)\sigma^{(t)} were artificially independently by design as they were sampled independently in Line 4. We now observe that minkSimpkmini,timpi(t)\min_{k\notin S}\textnormal{imp}_{k}\leq\min_{i,t}\textnormal{imp}_{i}^{(t)} happens if and only if the minimum across all of these (T1)p+ps(T-1)p+p-s numbers is in the psp-s numbers impk\textnormal{imp}_{k} for kSk\notin S. By symmetry, this probability is exactly ps(T1)p+ps1T\frac{p-s}{(T-1)p+p-s}\leq\frac{1}{T}.

As for bounding the probability that maxkSimpkmini,timpi(t)\max_{k\in S}\textnormal{imp}_{k}\geq\min_{i,t}\textnormal{imp}_{i}^{(t)}, we note that by the discussion above, the extra impi,t\textnormal{imp}_{i,t} can be thought of impk\textnormal{imp}_{k} for extra inactive features indexed by i,ti,t. The probability that maxkSimpkminkSimpk\max_{k\in S}\textnormal{imp}_{k}\geq\min_{k\notin S}\textnormal{imp}_{k} can be bounded as in Theorem 5.3. Taking a union bound with Pr[minkSimpkmini,timpi(t)]\Pr\left[{\min_{k\notin S}\textnormal{imp}_{k}\leq\min_{i,t}\textnormal{imp}_{i}^{(t)}}\right] proves the theorem’s statement. ∎

A.7 Experimental Configurations

All numerical simulations were conducted using Python and the SkLearn packages to implement both Lasso and the single-depth decision trees. The Lasso regression was fit using a regularization strength of C=2C=2 and the liblinear solver. Single-depth decision trees splitting point were identified by the variance reduction as described in the pseudocode of Algorithm 1. Data was generated randomly using the PyTorch package and all simulations were performed using an NVIDIA RTX A4000 GPU with Python’s CUDA ecosystem.