ControlBurn: Feature Selection by Sparse Forests

Brian Liu Cornell UniversityIthacaNew YorkUSA bl462@cornell.edu , Miaolan Xie Cornell UniversityIthacaNew YorkUSA mx229@cornell.edu and Madeleine Udell Cornell UniversityIthacaNew YorkUSA udell@cornell.edu

Tree ensembles distribute feature importance evenly amongst groups of correlated features. The average feature ranking of the correlated group is suppressed, which reduces interpretability and complicates feature selection. In this paper we present ControlBurn, a feature selection algorithm that uses a weighted LASSO-based feature selection method to prune unnecessary features from tree ensembles, just as low-intensity fire reduces overgrown vegetation. Like the linear LASSO, ControlBurn assigns all the feature importance of a correlated group of features to a single feature. Moreover, the algorithm is efficient and only requires a single training iteration to run, unlike iterative wrapper-based feature selection methods. We show that ControlBurn performs substantially better than feature selection methods with comparable computational costs on datasets with correlated features.

Refer to caption — Figure 1. A sparse forest.

1. Introduction

Tree ensembles are well-loved in machine learning for balancing excellent predictive performance with model interpretability. Importantly, tree ensemble models tend not to overfit even when the number of features is large. And even complex boosting ensembles admit feature importance scores that are cheap to compute. These properties make tree ensembles a popular choice for feature selection: simply train a model on all the features, and select the most important features according to the feature importance score.

However, when tree ensembles are trained on data with correlated features, the resulting feature rankings can be unreliable. This correlation bias arises because tree ensembles split feature importance scores evenly amongst groups of correlated features. Figure 2 shows an example. On the left plot, feature importance is calculated for a random forest classifier fit on the Titanic dataset. The forest identifies age, sex, and passenger class as the most influential features. As an experiment, we modify the Titanic dataset by adding the synthetic features “height”, “weight”, “femur” length, and “tibia” length, which we generate from a plausible distribution that depends on age and sex, following measurements from (howell2000demography, ). As a result, “height”, “weight”, “sex”, “age”, “femur”, and “tibia” are highly correlated. While these features together are important, the feature importance of each variable in this cluster drops. More generally, when large groups of correlated features are present in a dataset, important features may appear much less influential than they actually are. Simple feature selection methods may select the whole group, omit the entire group, or pick the wrong features from the group. Here, “weight” and “height”, synthetic noisy features, are now elevated above more important features like “age” and “sex”. In addition, “pclass” is elevated above “age”.

In this paper we develop ControlBurn, a feature selection algorithm that eliminates correlation bias by sparsifying the forest, just as fire is used in forest management to maintain a healthy (biological) forest by reducing density. We first grow a forest using techniques such as bagging and boosting. Then we reduce the number of trees by choosing sparse weights for each tree via a group-lasso penalized objective that balances feature sparsity with predictive performance.

Importantly, our algorithm is efficient and scalable. We only need to fit a tree ensemble once and solve a group LASSO problem, which scales linearly with the number of data points, to select a sparse subset of features.

On our modified Titanic dataset, ControlBurn selects “sex”, “pclass”, and either “age” or a continuous proxy for age (“weight”, “height”, “femur” or “tibia”) as the three most important features. Since ControlBurn is robust to correlation bias, these selected features align with the three most important features in the original data.

1.1. Applications

1.1.1. Interpretable machine learning

Correlation bias reduces the interpretability of feature rankings. When features importance scores are split evenly amongst groups of correlated features, influential features are harder to detect: feature ranks can become cluttered with many features that represent similar relationships. ControlBurn allows a data scientist to quickly identify a sparse subset of influential features for further analysis.

1.1.2. Optimal experimental design

In many experimental design problems, features are costly to observe. For example, in medical settings, certain tests are expensive and risky to the patient. In these situations it is crucial to avoid redundant tests. In this setting, ControlBurn can be used to identify a subset of $k$ features (for any integer $k>0$ ) with (nearly the) best model performance, eliminating redundancy.

1.2. Main contributions

The main contributions of this paper follow.

•

We develop ControlBurn, a feature selection algorithm robust to correlation bias that can recover a sparse set of features from highly intercorrelated data.
•

We present empirical evidence that ControlBurn outperforms traditional feature selection algorithms when applied to datasets with correlated features.

1.3. Paper organization

We first examine related work concerning tree ensembles and feature selection techniques. Following this literature review, we explain in detail our implementation of ControlBurn. We then describe our experimental procedure to evaluate ControlBurn against traditional feature selection methods. Finally we present our findings and discuss interesting directions for future research.

2. Background

2.1. Tree ensembles

The following section provides an overview of tree ensembles and details how feature importance is computed in tree models.

2.1.1. Decision trees

form the base learners used in tree ensembles. We give a quick overview of classification trees below. A classification tree of depth $d$ separates a dataset into at most $2^{d}$ distinct partitions. Within each partition, all data points are predicted as the majority class. Classification trees are grown greedily, and splits are selected to minimize the Gini impurity, misclassification error, or deviance in the directly resulting partitions. The textbook (friedman2001elements, , Chapter 9) provides a comprehensive reference on how to construct decision trees.

2.1.2. Bagging

reduces the variance of tree ensembles through averaging the predictions of many trees. We sample the training data uniformly with replacement to create bags (bootstrapped datasets with the same size as the original) and fit decision trees on each bag. The final prediction is the average (or majority vote, for classification) of the predictions from all the trees. Reducing correlation between trees improves the effectiveness of bagging. To do so, we can restrict the number of features considered each time a tree is built or a split is selected (friedman2001elements, , Chapter 15).

2.1.3. Boosting

fits an additive sequence of trees to reduce the bias of a tree ensemble. Each decision tree is fit on the residuals of its predecessor; weak learners are sequentially combined to improve performance. Popular boosting algorithms include AdaBoost (freund1999short, ), gradient boosting (friedman2001greedy, ), and XGBoost (chen2016xgboost, ). The textbook (friedman2001elements, , Chapter 10) provides a comprehensive reference on boosting algorithms.

2.1.4. Feature importance scores

can be computed from decision trees by calculating the decrease in node impurity for each tree split. Gini index and entropy are two commonly used metrics to measure this impurity. Averaging this metric over all the splits for each feature yields that feature’s importance score. This score is known as Mean Decrease Impurity (MDI); features with larger MDI scores are considered more important. MDI scores can be calculated for tree ensembles by averaging over all the trees in the ensemble (2001.04295, ).

2.2. Feature selection

Machine learning algorithms are sensitive to the underlying structure of the data. In datasets with linear relationships between the features and the response, linear regression, logistic regression, and support vector machines with linear kernels (cortes1995support, ) are all suitable candidates for a model. For nonlinear datasets, tree ensembles such as random forests, gradient boosted trees, and explainable boosting machines (EBMs) (nori2019interpretml, ) are commonly used.

Feature selection is fairly straightforward in the linear setting. Methods such as LASSO (tibshirani1996regression, ), regularized logistic regression (lee2006efficient, ), and SVMs with no kernels (lee2006efficient, ) can select sparse models. Our work aims to extend regularization techniques commonly used in linear models to the non-linear setting. Existing feature selection algorithms for nonlinear models can be divided into the following categories: filter-based, wrapper-based, and embedded (chandrashekar2014survey, ).

2.2.1. Filter-based

feature selection algorithms remove features suspected to be irrelevant based on metrics such as correlation or chi-square similarity (ambusaidi2016building, ; lee2011filter, ). Features that are correlated with the response are preserved while uncorrelated features are removed. Groups of intercorrelated features correlated with the response are entirely preserved.

Filter-based algorithms are model-agnostic, their performance depends on the relevancy of the similarity metric used. For example, using Pearson’s correlation to select features from highly non-linear data will yield poor results.

2.2.2. Wrapper-based

feature selection algorithms iteratively select features through retraining the model across different feature subsets. (mafarja2018whale, ; maldonado2009wrapper, ).

For example, in recursive feature elimination (RFE) the model is initially trained on the full feature set. The features are ranked by feature importance scores and the least important feature is pruned. The model is retrained on the remaining features and the procedure is repeated until the desired number of features is reached. Other wrapper-based feature selection algorithms include sequential and stepwise feature selection. All of these algorithms are computationally expensive since the model must be retrained several times.

2.2.3. Embedded

feature selection algorithms select important features as the model is trained. (langley1994selection, ; blum1997selection, ). For example in LASSO regression, the regularization parameter $\lambda$ controls the number of coefficients forced to zero, i.e. the number of features excluded from the model. For tree ensembles, MDI feature importance scores are computed during the training process. These scores can be used to select important features, however this algorithm is highly susceptible to correlation bias since MDI feature importance is distributed evenly amongst correlated features. ControlBurn extends LASSO to the nonlinear setting and is an embedded feature selection algorithm robust to correlation bias.

2.3. Sparsity vs. stability

Like LASSO in the linear setting, ControlBurn trades stability for sparsity. To build a sparse model on a datasets with correlated features, ControlBurn selects a subset of features to represent the group. The algorithm is useful for selecting an independent set of features, but the selected set is not stable. Indeed, an arbitrarily small perturbation to the dataset can lead the algorithm to select a different set of features when the features are highly correlated.

The tradeoff between sparsity and stability is fundamental to machine learning (xu2011sparse, ). Feature selection and explanation algorithms such as LASSO regression, best-subset selection (bertsimas2016best, ), and LIME (ribeiro2016should, ) encourage sparsity, while algorithms such as ridge regression and SHAP (lundberg2017unified, ) are more stable but not sparse. To achieve both sparsity and stability, the dataset may need to be very large (zhou2021, ).

3. ControlBurn algorithm

3.1. Overview

The ControlBurn algorithm works as follows. Given feature matrix $X\in\mathbb{R}^{m\times p}$ and response $y\in\mathbb{R}^{m}$ , we build a forest consisting of $n$ decision trees to predict $y$ from $X$ . We will describe in detail in the next section various approaches to build this forest. Each tree $t_{i}$ for $i\in\{1,\ldots,n\}$ generates a vector of predictions $\alpha_{i}\in\mathbb{R}^{m}$ . For each tree $t_{i}$ , let $g_{i}\in\{0,1\}^{p}$ be a binary indicator vector, with $g_{ij}=1$ if feature $j\in\{1,\ldots,p\}$ generates a split in tree $t_{i}$ .

Fix a regularization parameter $\lambda>0$ . We introduce a non-negative variable $w\in\mathbb{R}^{n}$ to represent the weight of each tree in the forest; a tree $t_{i}$ with $w_{i}=0$ plays no role in prediction and can be pruned. Let $A\in\mathbb{R}^{m\times n}$ be the matrix with vectors $\alpha_{i}$ as columns and let $\mathrm{G}\in\mathbb{R}^{p\times n}$ be the matrix with vectors $g_{i}$ as columns. Hence $Aw\in\mathbb{R}^{n}$ gives the predictions of the forest on each instance given weights $w$ , while the elements of $Gw\in\mathbb{R}^{p}$ give the total weight of each feature in the forest.

We seek a feature-sparse forest (sparse $Gw$ ) that yields good predictions. To this end, define a loss function $L(A,w,y)$ for the prediction problem at hand, for example square loss for regression $L(A,w,y)=\left\|y-\mathrm{~{}A}w\right\|_{2}^{2}$ , or logistic or hinge loss for classification.

Inspired by the group LASSO (tibshirani1996regression, ), our objective is to minimize the loss together with a sparsifying regularizer, over all possible weights $w\geq 0$ for each tree:

(1)

\begin{array}[]{ll}\mbox{minimize}&\frac{1}{m}L(A,w,y)+\lambda\left\|\mathrm{G}w\right\|_{1}\\ \mbox{subject to}&w\geq 0,\end{array}

Since $w_{i}\geq 0$ for all $i$ , we equivalently express this problem as

(2)

\begin{array}[]{ll}\mbox{minimize}&\frac{1}{m}L(A,w,y)+\lambda\sum_{i=1}^{n}u_{i}w_{i}\\ \mbox{subject to}&w\geq 0,\end{array}

where $u_{i}$ is the number of features used in tree $i$ . This objective encourages sparse models, as the regularization penalty increases as more features are used in the ensemble. The hyperparameter $\lambda$ controls the sparsity of the solution.

Let $w^{\star}\in\mathbb{R}^{m}$ denote the solution to Problem 2. A tree $i$ with a non-zero weight $w_{i}>0$ is used by the ensemble, while a tree with weight $w_{i}=0$ is omitted. We say feature $j$ was selected by the ensemble if $(Gw^{\star})_{j}$ is non-zero.

The final step of our algorithm fits a new tree ensemble model on the selected features. This step improves solution quality, just as refitting a least squares regression model on features selected from LASSO regression reduces the bias of the final model (chzhen2019lasso, ).

Our algorithm operates in two major steps: first we grow a forest, and then we sparsify the forest by solving Problem (2). We discuss each step in detail below.

3.2. Growing a forest

ControlBurn works best when executed on a diverse forest, with trees that use different sets of features. Traditional forest-growing methods such as bagging often fail to generate sufficient diversity. Using standard methods to grow the forest, we often find that no subset of the trees yields a feature-sparse model, or that the resulting model has poor accuracy.

For example, if trees are grown to full depth, every tree in the forest generally uses the full feature set. Hence the only sparse solution $w^{\star}$ to Problem 2, for any $\lambda$ , is $w^{\star}=\mathbf{0}$ : either all or none of the features will be selected.

We have developed two algorithms to grow diverse forests: incremental depth bagging and incremental depth bag-boosting. Both are simple to tune and match the accuracy of other state-of-the-art tree ensembles.

3.2.1. Incremental depth bagging

(Algorithm 1) increases forest diversity by incrementally increasing the depth $d$ of trees grown. A tree of depth $d$ can use no more than $d^{2}-1$ features. As we increase $d$ , the number of features used per tree increases, allowing the model to represent more complex feature interactions.

The algorithm has a single hyperparameter, $d^{\text{max}}$ , the maximum depth of trees to grow. This hyperparameter is easy to choose: bagging prevents overfitting, so bigger is better, and $d^{\text{max}}=p$ is allowed.

We represent a tree ensemble as a set of trees $F=\{t_{1}\ldots t_{n}$ }. With some abuse of notation, we let $F(x)$ be the output of the tree ensemble given input $x\in\mathbb{R}^{p}$ .

Input: maximum depth

d^{\textup{max}}

2Initialize

d\leftarrow 1

F\leftarrow\emptyset

3while $d\leq d^{\textup{max}}$ do

5 Sample bag

X^{\prime}

from

X

with replacement

6 Fit tree

t

of depth

d

X^{\prime},y

7 Add tree

t

to forest:

F\leftarrow F\cup t

8 Compute train error of forest

F

9 If train error has converged §3.2.1, increment

d\leftarrow d+1

Output: Forest

F

Algorithm 1 Incremental Depth Bagging.

Incremental depth bagging increments the depth of trees grown when the performance of the model saturates at the lower depths. At each depth level, the training loss is recorded each time a new tree is added to the ensemble. When the training loss converges, the algorithm increments the depth $d$ and begins adding deeper trees. We say the loss converges if the tail $N$ elements of the sequence fall within an $\epsilon$ tube. Experimentally, choosing $N=5$ and $\epsilon=10^{-3}$ yields good results. This methodology differs from typical early stopping algorithms used in machine learning, since it stops based on training loss rather than test loss. We use training loss because it is fast to evaluate and overfitting is not a concern, since bagged tree ensembles tend not to overfit with respect to the number of trees used.

3.2.2. Incremental depth bag-boosting

(Algorithm 2) uses out-of-bag (OOB) error to automatically determine the maximum depth of trees to grow, and requires no hyperparameters. The algorithm uses bagging ensembles as the base learner in gradient boosting to increase diversity without sacrificing accuracy.

1 Initialize

d\leftarrow 1

F^{\prime}\leftarrow\emptyset

F\leftarrow\emptyset

\delta>0

2Initialize

F(x)

by predicting the mean/majority class of

y

, set

e\in\mathbb{R}^{m}

as the vector of residuals

3while $\delta>0$ do

5 Sample bag

X^{\prime}

from

X

with replacement

6 Fit tree

t

of depth

d

X^{\prime},e

F^{\prime}\leftarrow F^{\prime}\cup t

8 Compute train error of forest

F\cup F^{\prime}

9 if train error has converged §3.2.1 then

F\leftarrow F\cup F^{\prime}

12 Set

e

as negative gradient of loss

13 Increment

d\leftarrow d+1

14 Set

\delta

, the improvement in OOB error from

F^{\prime}

F^{\prime}\leftarrow\emptyset

Output: Forest

F

Algorithm 2 Incremental Depth Bag-Boosting.

The number of trees grown per boosting iteration is determined by how fast the training error of the ensemble converges, as in §3.2.1. Each boosting iteration increments the depth of trees grown and updates the pseudo-residuals. This is done by setting $e_{i}=-\frac{\partial L(y_{i},z)}{\partial z}|_{z=F(x_{i})}$ for $i\in\{1\ldots m\}$ .

For each row in $x\in X$ , the OOB prediction is obtained by only using trees in $F^{{}^{\prime}}$ fit on bags $\{X^{\prime}\mid x\not\in X^{\prime}\}$ . Let $\delta$ be the improvement in OOB error between subsequent boosting iterations. The algorithm terminates when $\delta<0$ .

In Figure 6, incremental depth bag-boosting is conducted on the abalone dataset from UCI MLR (Dua:2019, ). The scatterplot shows how the improvement in OOB error, $\delta$ , changes as the number of boosting iterations increase. The procedure will terminate at the sixth boosting iteration, when $\delta<0$ . The lineplot shows how out-of-sample (OOS) test error varies with the number of boosting iterations. OOS error is minimized at the seventh boosting iteration. This is remarkably close to the OOB error stop, which is determined on-the-fly.

3.2.3. Theoretical guarantees

The following lemma shows that the average performance improvement calculated using OOB samples is a good estimator of the true performance improvement for the data distribution. The lemma applies to any pair of ensembles $F$ and $\hat{F}$ where every element in ensemble $F$ appears in ensemble $\hat{F}$ .

Lemma 3.1.

Let $F_{n}$ be an ensemble with $n$ trees $\{T_{1},\dots,T_{n}\}$ , and let $F_{n+1}$ be an ensemble with one more tree $\{T_{1},\dots,T_{n},T_{n+1}\}$ . Suppose the data (including the feature $x\in\mathbb{R}^{p}$ and label $y\in\mathbb{R}$ ) follows the distribution $(x,y)\sim\mathcal{X}$ . Define the improvement $\textup{Imp}(x,y)=|y-F_{n}(x)|-|y-F_{n+1}(x)|$ . Let $\bar{I}=\mathbb{E}_{(x,y)\sim\mathcal{X}}\textup{Imp}(x,y)$ be the expected improvement of the model $F_{n+1}$ over $F_{n}$ , and $\mathcal{O}$ be the set of OOB samples of tree $T_{n+1}$ . Suppose $\bar{I}$ is finite and each $(x,y)\in\mathcal{O}$ independently follows the distribution $\mathcal{X}$ . Then

\mathbb{E}\left(\frac{1}{|\mathcal{O}|}\sum_{(x,y)\in\mathcal{O}}\textup{Imp}(x,y)\right)=\bar{I}.

If the labels $y$ are all bounded, $\textup{Imp}(x,y)$ is a sub-exponential random variable with some parameters $(\nu,b)$ , and for all $\epsilon\geq 0$ ,

\displaystyle\mathbb{P}\left(\left|\frac{1}{|\mathcal{O}|}\sum_{(x,y)\in\mathcal{O}}\textup{Imp}(x,y)-\bar{I}\right|\geq\epsilon\right)\leq 2e^{-\min\{\frac{\epsilon^{2}|\mathcal{O}|}{2\nu^{2}},\frac{\epsilon|\mathcal{O}|}{2b}\}}.

We found empirically that incremental depth bag-boosting performs better than incremental depth bagging when used in ControlBurn. In addition, the OOB early stopping rule eliminates all hyperparameters and limits the size of the forest. This reduces computation time in the following optimization step. Forests grown by incremental depth bag-boosting are more feature diverse than those grown via incremental depth bagging. Consider the case where a single highly influential binary feature dominates a dataset. While each iteration of incremental depth bagging will include this feature, the feature will only appear in the first boosting iteration in incremental depth bag-boosting. Boosting algorithms fit on the errors of previous iterations, so once the relationship between the feature and the outcome is learned, the feature may be ignored in subsequent iterations.

3.2.4. Other forest growing algorithms

include feature-sparse decision trees and explainable boosting machines (EBM). These algorithms do not perform as well as incremental depth bag-boosting, but may be more interpretable

The procedure to build feature-sparse decision trees is a simple extension of the tradition tree building algorithm. For classification, decision trees typically choose splits to minimize the Gini impurity of the resulting nodes. Sparse decision trees add an additional cost $c$ to this impurity function when the tree splits on a feature that has not been used before. As a result, when groups of correlated features are present, the algorithm consistently splits on a single feature from each group. The cost parameter $c$ controls the sparsity of each tree. Ensembles of feature-sparse trees are not necessarily sparse, so ControlBurn must be used to obtain a final sparse model.

Explainable boosting machines are generalized additive models composed of boosted trees trained on a single feature (lou2013accurate, ). Each tree in an EBM is feature-sparse with sparsity $1$ . The additive nature of EBMs improve interpretability as the contribution of each feature in the model can be easily obtained. However, this comes at a cost of predictive performance since EBMs cannot capture higher order interactions. When ControlBurn is used on EBMs, the matrix $G\in\mathbb{R}^{p\times p}$ in the optimization step is the identity matrix.

3.3. Optimization variants

3.3.1. Feature groupings

In many applications, it may be necessary to restrict feature selection to a set of features rather than individual ones. For example, in medical diagnostics, a single blood test provides multiple features for a fixed cost. We can easily incorporate this feature grouping requirement into our framework by redefining $u_{i}$ . Suppose features $1,\dots,p$ are partitioned into various groups. Instead of defining $u_{i}$ to be the number of features tree $i$ uses, we now define

u_{i}=\sum_{k\in T_{k}}c_{k},

where $c_{k}$ is the cost of the group of features $k$ and $T_{k}$ is the set of feature groups tree $i$ uses. Our original formulation corresponds to the case where all groups are singleton.

3.3.2. Non-homogeneous feature costs

In some problems the cost of obtaining one feature may be very different from the cost of obtaining another. We can easily accommodate this non-homogeneous feature costs requirement into our framework as well. Instead of defining $u_{i}$ to be the number of features used by tree $t_{i}$ , we define

u_{i}=\sum_{j\in T_{i}}c_{j},

where $c_{j}$ is the cost of feature $j$ and $T_{i}$ is the set of features tree $i$ uses. Our original formulation corresponds to the case where all $c_{j}=1$ .

3.3.3. Sketching

The optimization step of ControlBurn scales linearly with the number of rows in our data. We can sketch $X$ to improve runtime. Let $S\in\mathbb{R}^{s\times m}$ be the sampled (e.g., Gaussian) sketching matrix. Our optimization problem becomes:

(3)

\begin{array}[]{ll}\mbox{minimize}&\frac{1}{s}\left\|Sy-\mathrm{~{}SA}w\right\|_{2}^{2}+\lambda\sum_{i=1}^{n}u_{i}w_{i}\\ \mbox{subject to}&w\geq 0,\end{array}

A relatively small $s$ , around $O(\text{log}(m))$ , can provide a good solution (pilanci2015randomized, ).

4. Experimental setup

We will evaluate how well ControlBurn selects features for binary classification problems.

4.1. Data

We report results on three types of datasets.

4.1.1. Semi-synthetic

To evaluate how well ControlBurn handles correlation bias, we synthetically add correlated features to existing data by duplicating columns and adding Gaussian white noise to each replicated column. This procedure experimentally controls the degree of feature correlation in the data.

4.1.2. Benchmark

We consider a wide range of benchmark datasets selected from the Penn Machine Learning Benchmark Suite (olson2017pmlb, ). This curated suite aggregates datasets from well-known sources such as the UCI Machine Learning Repository (Dua:2019, ). We select all datasets in this suite with a binary target, 43 datasets in total.

4.1.3. Case studies

Finally, we select several real-world datasets known to contain correlated features, based on machine learning threads on Stack Overflow and Reddit. We select the Audit dataset, where risk scores are co-linear, and the Adult Income dataset, where age, gender, and marital status form a group of correlated features (Dua:2019, ). We also evaluate ControlBurn on the DNA microarray data used in West (2001) (West:2001ft, ), since gene expression levels are highly correlated.

4.2. ControlBurn for experiments

We first generate a forest of classification trees using incremental depth bag-boosting. The algorithm requires no hyperparameters and uses out-of-bag error to determine how many trees to grow. We use logistic loss as $L$ in optimization problem 2. After solving the optimization problem, we eliminate all features with weight 0 and refit a random forest classifier on the remaining features. We evaluate the test ROC-AUC of this final model.

Input: desired sparsity

k

, maximum depth

d^{\textup{max}}

, initial regularization parameter

\lambda

, feature matrix

X

, labels

y

2Build forest

F

with incremental depth bag-boosting (Algorithm 2)

3repeat

5 Use bisection §4.2.1 to set

\lambda

6 Solve Problem 2 using

\lambda

to prune

F

7 Form sub-forest

F^{*}

with trees of non-zero weights

8until Desired sparsity of $F^{*}$ reached

Output: Forest

F^{*}

Algorithm 3 ControlBurn.

4.2.1. Bisection

The regularization parameter $\lambda$ in Problem 2 controls the number of features selected by ControlBurn, $k$ . We use bisection to find the value of $\lambda$ that yields the desired number of selected features $k$ . Our goal in feature selection is to produce an interpretable model, so we mandate that $k\leq k^{\text{max}}:=10$ in our experiments. Tree ensembles with more features can be difficult to interpret.

In fact, we can use bisection with ControlBurn to generate feature sets of any sparsity between $0$ and $k^{\text{max}}$ . Start with a value of $\lambda$ large enough to guarantee $k=0$ . Let $k^{{}^{\prime}}$ denote the number of features we want ControlBurn to select in the following iteration and initialize $k^{{}^{\prime}}=1$ . Bisect $\lambda=\frac{\lambda}{2}$ and solve Problem 2 to obtain $k$ . If $k<k^{{}^{\prime}}$ , update $\lambda=\frac{\lambda}{2}$ ; otherwise if $k>k^{{}^{\prime}}$ , update $\lambda=\lambda+\frac{\lambda}{2}$ . If $k=k^{{}^{\prime}}$ , update $k^{{}^{\prime}}=k^{{}^{\prime}}+1$ . Repeat until $k^{{}^{\prime}}>k^{\text{max}}$ .

4.3. Baseline feature selection

We will evaluate ControlBurn against the following baseline, an embedded feature selection method we call the random forest baseline. Given feature matrix $X$ and response $y$ , fit a random forest classifier and select the top $k$ features ranked by MDI feature importance scores. This feature selection algorithm is embedded since MDI feature importance scores are computed during training.

We fit a random forest classifier on the features selected by ControlBurn and the features selected by our random forest baseline and compare the difference in test performance. We use a 5-fold cross validation to estimate test error.

Our results in Section 5 omit comparisons between ControlBurn and filter-based and wrapper-based feature selection algorithms. We comment on comparisons here.

Experimentally, ControlBurn consistently outperforms filter-based algorithms across all datasets.

Wrapper-based algorithms are computationally expensive compared to ControlBurn since the model must be retrained each time a new subset of features is evaluated. For example, selecting $k$ features using Recursive Feature Elimination (RFE) from a dataset with $p$ features requires $p-k$ training iterations. In Best Subset Selection, the number of iterations required scales exponentially with $p$ . In contrast, like all embedded feature selection algorithms, ControlBurn only requires a single training iteration to select $k$ features. To illustrate this advantage, we subsample our PMLB datasets to 1000 rows and compare computation time between ControlBurn and RFE. Figure 7 demonstrates that RFE runtime drastically increases with $p$ compared to ControlBurn.

4.4. Implementation

We use scikit-learn (scikit-learn, ) to implement all random forest and decision tree classifiers in our experiment, and we use CVXPY (diamond2016cvxpy, ) and MOSEK (mosek, ) to formulate and solve optimization problems.

The code and data to reproduce our experiments can be found in this git repository.

5. Results

ControlBurn performs better than our random forest baseline on datasets that contain correlated features. In Figure 10, we plot the performance of our selected models against the number of features selected by our two algorithms. On datasets with correlated features, such as the Cleveland Heart Disease dataset, ControlBurn performs better than the random forest baseline. On datasets where all of the features are independent, such as Breiman’s synthetic TwoNorm dataset, there is no difference in performance.

In the following sections, we first discuss how ControlBurn performs when correlated features are added to semi-synthetic data. We then examine our results across our survey of PMLB datasets. Finally, we present several case studies that show the benefit of using ControlBurn on real world data.

5.1. Semi-synthetic dataset

To create our semi-synthetic dataset, we start with the Chess dataset from UCI MLR (Dua:2019, ). This dataset contains very few correlated features, since all features correspond to chessboard positions. We observe in figure 8 that ControlBurn performs nearly identically to our random forest baseline on the original data. We take the 3 most influential features, determined by random forest MDI importance, and create noisy copies by adding Gaussian noise with $\sigma=0.1$ . We repeat this process, duplicating the influential features 3, 5, and 7 times, which adds 9, 15, and 21 highly correlated features. Figure 8 shows that ControlBurn performs substantially better than the random forest baseline as the number of correlated features increases: MDI feature importance is diluted amongst the groups of correlated features, depressing the “importance” rank of any feature in the correlated group. This semi-synthetic example illustrates that ControlBurn can effectively select features even when groups of strongly correlated features exist.

5.2. PMLB suite

We evaluate ControlBurn on all $43$ binary classification datasets in the PMLB suite. On each dataset, we select $k$ features using ControlBurn and our random forest baseline and record the difference in ROC-AUC between the random forest classifiers fit on the selected features. We average these differences across sparsity levels $k\in\{1,\ldots,10\}$ and present the distribution as Figure 9.

In this histogram, we see that ControlBurn performs similarly to the random forest baseline for most of these datasets: most of the distribution is supported on $[-0.01,0.01]$ , since most of the datasets in PMLB do not contain correlated features. Still, ControlBurn performs better than our random forest baseline on average: the mean of the distribution is positive, and the distribution is right-skewed. Most of the datasets along the right tail, such as Musk, German, and Sonar, contain correlated features.

5.3. Case studies

5.3.1. Adult income dataset

We use ControlBurn to build a classification model to predict whether an adult’s annual income is above $50k. The Adult Income plot in figure 10 shows how well ControlBurn performs compared to our random forest baseline. We see that ControlBurn performs substantially better than our baseline when $k=3$ features are selected. ControlBurn selects “EducationNum”, “CapitalGain”, and “MaritalStatus_Married-civ-spouse” while our random forest baseline selects “fnlwgt”, “Age” and “CapitalGain”.

Figure 11 shows a correlation heat map of the top 20 features in the Adult Income dataset ranked by MDI importance. The top 3 features selected by ControlBurn are in bold. From this plot, it is apparent that the MDI feature importance of “MaritalStatus_Married-civ-spouse” is diluted because the feature is correlated with many other influential features. ControlBurn is able to select this feature while our random forest baseline ignores it.

It is also interesting to note that the feature “fnlwgt” is uninformative: it represents the population size of a row and is unrelated to income. This feature is given high importance by our random forest baseline but ignored by ControlBurn. We discuss this phenomenon further in Section 6.1.

5.3.2. Audit dataset

The Audit dataset demonstrates that highly correlated and even duplicated features sometimes appear in real-world data. Many of these risk scores are highly correlated and several scores are linear transformations of existing scores. We use ControlBurn to build a classification model to predict whether a firm is fradulent based on risk scores. Figure 10 shows that ControlBurn outperforms our baseline for any sparsity level $k$ .

5.3.3. Microarray gene expression

In this experiment, we use ControlBurn to determine which gene expressions are the most influential towards predicting the hormone receptor status of breast cancer tumors. Our data contains 49 tumor samples and over 7000 distinct gene expressions. We formulate this task as a binary classification problem to predict receptor status using gene expressions as features. The Gene Expression plot in Figure 10 shows the difference in accuracy between a model selected using ControlBurn and our random forest baseline. ControlBurn outperforms our baseline for every sparsity $k$ . These preliminary findings suggest that ControlBurn is effective at finding a small number of genes with important effects on tumor growth. We discuss implications and extensions appropriate for gene expression data further in Section 6.2.

6. Discussion and future work

6.1. Bias from uninformative continuous features

Brieman et. al (1984) (brieman1984classification, ) and Strobl et. al (2007) (strobl2007bias, ) note that MDI feature importance is biased towards features with more potential splits. The feature importance of continuous features, which can be split on multiple values, is inflated compared to binary features.

We attempt to quantify whether ControlBurn suffers bias from uninformative continuous features. We replicate the experiment conducted by Zhou and Hooker (2019) (zhou2019unbiased, ) by adding a random continuous feature to a classification dataset. We use ControlBurn to select a relevant subset of features and compute the feature rank, via MDI importance, of the uninformative feature in the model. Figure 12 shows that when the full feature set is selected, the uninformative feature is often present in the top feature rankings due to bias. As we increase $\lambda$ and select fewer features, our uninformative feature is demoted. This experiment demonstrates that ControlBurn mitigates bias from uninformative continuous features. It would be interesting to compare ControlBurn against other feature importance algorithms robust to this bias, such as Unbiased Feature Importance (zhou2019unbiased, ).

6.2. Limits of LASSO: big $p$ , small $m$

Zou and Hastie (2005) (zou2005regularization, ) note a limitation of LASSO: the number of features selected by LASSO is bounded by the number of rows $m$ . When ControlBurn is used for a problem with more features than rows ( $p\gg m$ ), the number of nonzero entries in $w^{\star}$ is bounded by $m$ . At most $m$ trees can be included in the selected model. If all trees are grown to full depth, the selected model can use at most $m^{2}$ features. In our gene expression example, $m=49$ while $p>7000$ . The upper-bound on the number of features that can selected via ControlBurn is much higher than the corresponding upper-bound for LASSO regression, however ControlBurn may still be unable to select the entire feature set if $p>m^{2}$ . ControlBurn also selects a single feature from a group of correlated features. This makes ControlBurn robust to correlation bias but ineffective for gene selection, where it is desirable to include entire groups of correlated gene expressions.

We propose extending ControlBurn by adding an $\ell_{2}$ regularization term to our optimization problem, just as the elastic net extends LASSO regression in the linear setting. Group elastic nets for gene selection have been explored in Munch (2018) (munch2018adaptive, ); using this approach to select tree ensembles presents an interesting direction for future research.

7. Conclusion

In this work, we develop ControlBurn, a weighted LASSO-based feature selection algorithm for tree ensembles. Our algorithm is robust to correlation bias; when presented with a group of correlated features, ControlBurn will assign all importance to a single feature to represent the group. This improves the interpretability of the selected model, as a sparse subset of features can be recovered for further analysis. ControlBurn is flexible and can incorporate various feature groupings and costs. In addition, the algorithm is computationally efficient and scalable.

Our results indicate that ControlBurn performs better than traditional embedded feature selection algorithms across a wide variety of datasets. We also found our algorithm to be robust to bias from uninformative continuous features. ControlBurn is an effective and useful algorithm for selecting important features from correlated data.

References

(1) Ambusaidi, M. A., He, X., Nanda, P., and Tan, Z. Building an intrusion detection system using a filter-based feature selection algorithm. IEEE transactions on computers 65, 10 (2016), 2986–2998.
(2) ApS, M. MOSEK Fusion API for Python 9.2.37, 2021.
(3) Bertsimas, D., King, A., Mazumder, R., et al. Best subset selection via a modern optimization lens. Annals of statistics 44, 2 (2016), 813–852.
(4) Blum, A. L., and Langley, P. Selection of relevant features and examples in machine learning. Artificial intelligence 97, 1-2 (1997), 245–271.
(5) Brieman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification and regression trees. wadsworth. Inc. Monterey, California, USA (1984).
(6) Chandrashekar, G., and Sahin, F. A survey on feature selection methods. Computers & Electrical Engineering 40, 1 (2014), 16–28.
(7) Chen, T., and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (2016), pp. 785–794.
(8) Chzhen, E., Hebiri, M., Salmon, J., et al. On lasso refitting strategies. Bernoulli 25, 4A (2019), 3175–3200.
(9) Cortes, C., and Vapnik, V. Support-vector networks. Machine learning 20, 3 (1995), 273–297.
(10) Diamond, S., and Boyd, S. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research 17, 83 (2016), 1–5.
(11) Dua, D., and Graff, C. UCI machine learning repository, 2017.
(12) Freund, Y., Schapire, R., and Abe, N. A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence 14, 771-780 (1999), 1612.
(13) Friedman, J., Hastie, T., and Tibshirani, R. The elements of statistical learning, vol. 1. Springer series in statistics New York, 2001.
(14) Friedman, J. H. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189–1232.
(15) Howell, N. Demography of the dobe! kung, 2nd edn. new brunswick, 2000.
(16) Langley, P., et al. Selection of relevant features in machine learning. In Proceedings of the AAAI Fall symposium on relevance (1994), vol. 184, Citeseer, pp. 245–271.
(17) Lee, I.-H., Lushington, G. H., and Visvanathan, M. A filter-based feature selection approach for identifying potential biomarkers for lung cancer. Journal of clinical Bioinformatics 1, 1 (2011), 11.
(18) Lee, S.-I., Lee, H., Abbeel, P., and Ng, A. Y. Efficient l~ 1 regularized logistic regression. In Aaai (2006), vol. 6, pp. 401–408.
(19) Lou, Y., Caruana, R., Gehrke, J., and Hooker, G. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (2013), pp. 623–631.
(20) Lundberg, S., and Lee, S.-I. A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874 (2017).
(21) Mafarja, M., and Mirjalili, S. Whale optimization approaches for wrapper feature selection. Applied Soft Computing 62 (2018), 441–453.
(22) Maldonado, S., and Weber, R. A wrapper method for feature selection using support vector machines. Information Sciences 179, 13 (2009), 2208–2217.
(23) Münch, M. M., Peeters, C. F., Van Der Vaart, A. W., and Van De Wiel, M. A. Adaptive group-regularized logistic elastic net regression. Biostatistics (2018).
(24) Nori, H., Jenkins, S., Koch, P., and Caruana, R. Interpretml: A unified framework for machine learning interpretability. arXiv preprint arXiv:1909.09223 (2019).
(25) Olson, R. S., La Cava, W., Orzechowski, P., Urbanowicz, R. J., and Moore, J. H. Pmlb: a large benchmark suite for machine learning evaluation and comparison. BioData mining 10, 1 (2017), 1–13.
(26) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
(27) Pilanci, M., and Wainwright, M. J. Randomized sketches of convex programs with sharp guarantees. IEEE Transactions on Information Theory 61, 9 (2015), 5096–5115.
(28) Ribeiro, M. T., Singh, S., and Guestrin, C. ” why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (2016), pp. 1135–1144.
(29) Scornet, E. Trees, forests, and impurity-based variable importance, 2020.
(30) Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics 8, 1 (2007), 25.
(31) Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58, 1 (1996), 267–288.
(32) West, M. M., Blanchette, C. C., Dressman, H. H., Huang, E. E., Ishida, S. S., Spang, R. R., Zuzan, H. H., Olson, J. A. J., Marks, J. R. J., and Nevins, J. R. J. Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences of the United States of America 98, 20 (Sept. 2001), 11462–11467.
(33) Xu, H., Caramanis, C., and Mannor, S. Sparse algorithms are not stable: A no-free-lunch theorem. IEEE transactions on pattern analysis and machine intelligence 34, 1 (2011), 187–193.
(34) Zhou, Z., and Hooker, G. Unbiased measurement of feature importance in tree-based methods. arXiv preprint arXiv:1903.05179 (2019).
(35) Zhou, Z., Hooker, G., and Wang, F. S-lime: Stabilized-lime for model explanation. In Proceedings of the 27th ACM SIGKDD international conference on knowledge discovery and data mining (2021).
(36) Zou, H., and Hastie, T. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology) 67, 2 (2005), 301–320.

Appendix A Benchmark Datasets

dataset	Rows	Columns
analcatdata_cyyoung8092	97	10
australian	690	14
biomed	209	8
breast_cancer_wisconsin	569	30
churn	5000	20
cleve	303	13
colic	368	22
diabetes	768	8
glass2	163	9
Hill_Valley_without_noise	1212	100
horse_colic	368	22
ionosphere	351	34
ring	7400	20
spambase	4601	57
tokyo1	959	44

Table 1. Sample of PMLB benchmark datasets with dimensions.

ControlBurn: Feature Selection by Sparse Forests

1. Introduction

1.1. Applications

1.1.1. Interpretable machine learning

1.1.2. Optimal experimental design

1.2. Main contributions

1.3. Paper organization

2. Background

2.1. Tree ensembles

2.1.1. Decision trees

2.1.2. Bagging

2.1.3. Boosting

2.1.4. Feature importance scores

2.2. Feature selection

2.2.1. Filter-based

2.2.2. Wrapper-based

2.2.3. Embedded

2.3. Sparsity vs. stability

3. ControlBurn algorithm

3.1. Overview

3.2. Growing a forest

3.2.1. Incremental depth bagging

3.2.2. Incremental depth bag-boosting

3.2.3. Theoretical guarantees

Lemma 3.1.

3.2.4. Other forest growing algorithms

3.3. Optimization variants

3.3.1. Feature groupings

3.3.2. Non-homogeneous feature costs

3.3.3. Sketching

4. Experimental setup

4.1. Data

4.1.1. Semi-synthetic

4.1.2. Benchmark

4.1.3. Case studies

4.2. ControlBurn for experiments

4.2.1. Bisection

4.3. Baseline feature selection

4.4. Implementation

5. Results

5.1. Semi-synthetic dataset

5.2. PMLB suite

5.3. Case studies

5.3.1. Adult income dataset

5.3.2. Audit dataset

5.3.3. Microarray gene expression

6. Discussion and future work

6.1. Bias from uninformative continuous features

6.2. Limits of LASSO: big pp, small mm

7. Conclusion

References

Appendix A Benchmark Datasets

6.2. Limits of LASSO: big $p$ , small $m$