This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Variable Selection Using Bayesian Additive Regression Trees

Chuji Luo111Chuji Luo is Data Scientist, Google LLC, Mountain View, California 94043, USA (email: cjluo@ufl.edu).    Michael J. Daniels222Michael J. Daniels is Professor and Chair, Department of Statistics, University of Florida, Gainesville, Florida 32611, USA (email: daniels@ufl.edu).
Abstract

Variable selection is an important statistical problem. This problem becomes more challenging when the candidate predictors are of mixed type (e.g. continuous and binary) and impact the response variable in nonlinear and/or non-additive ways. In this paper, we review existing variable selection approaches for the Bayesian additive regression trees (BART) model, a nonparametric regression model, which is flexible enough to capture the interactions between predictors and nonlinear relationships with the response. An emphasis of this review is on the capability of identifying relevant predictors. We also propose two variable importance measures which can be used in a permutation-based variable selection approach, and a backward variable selection procedure for BART. We present simulations demonstrating that our approaches exhibit improved performance in terms of the ability to recover all the relevant predictors in a variety of data settings, compared to existing BART-based variable selection methods.

Keywords and phrases: Variable selection, BART, nonparametric regression.

1 Introduction

Variable selection, also known as feature selection in machine learning, is the process of selecting a subset of relevant variables for use in model construction. It has been, and continues to be a major focus of research and practice because researchers and practitioners often seek low-cost, interpretable and not overfitted models. For example, in causal inference, model-based estimation of the effects of an exposure on an outcome is generally sensitive to the choice of confounders included in the model. If there are no unmeasured confounders, the full model with all the confounders and many non-confounders generally yields an unbiased estimation, but with a large standard error ([28]). On the contrary, a model with fewer non-confounders not only produces an unbiased estimation with a smaller standard error, but also saves computation time.

Variable selection is often carried out in parametric settings, such as the linear regression model ([10, 15, 26, 30, 7, 4]). However, variable selection approaches based on a linear model often fail when the underlying relationship between the predictors and the response variable is nonlinear and/or non-additive, and it is generally challenging to extend them to nonparametric models, such as tree-based models which incorporate both main effects and interaction effects of varying orders.

Tree-based models have been developed from both the frequentist and Bayesian perspectives, including but not limited to random forests ([6]), stochastic gradient boosting ([13]), reinforcement learning trees ([29]) and Bayesian additive regression trees ([9]), the last of which is the focus of this paper. Variable selection for the tree-based models in machine learning is often achieved through variable importance. Taking random forests as an example, there are two types of variable importance measures typically used: Gini importance ([12]) and permutation importance ([6]). Gini importance evaluates the importance of a variable by adding up the weighted decrease in impurity of all the nodes using the variable as a split variable, averaged over all the trees, while permutation importance evaluates the importance of a variable by adding up the difference in out-of-bag error before and after the permutation of the values of the variable in the training data, also averaged over all the trees. Though widely used, as pointed out by [25], both Gini importance and permutation importance are biased in favor of continuous and high cardinality predictors when the random forest is applied to data with mixed-type predictors. Methods such as growing unbiased trees ([25]) and partial permutations ([2]) can be used to solve this issue.

Compared to the tree-based models in machine learning, Bayesian additive regression trees (BART) not only possesses competitive predictive performance, but also enables the user to make inference on it due to its fully Bayesian construction. The original BART paper suggests using variable inclusion proportions to measure the importance of predictors and later [5] propose a principled permutation-based inferential approach to determine how large the variable inclusion proportion has to be in order to select a predictor. This approach exhibits superior performance in a variety of data settings, compared to many existing variable selection procedures including random forests with permutation importance. In fact, we note that the variable inclusion proportion produced by BART is a special case of Gini importance, which treats the weight and the decrease in impurity at each node to be constant. We also note that the approach of [5] is reminiscent of the partial permutation approach of [2] which also repeatedly permutes the response vector to estimate the distribution of the variable importance for each predictor in a non-informative setting where the relationship between the response and predictors is removed and the dependencies among the predictors are maintained. Though the approach of [2] is designed for mitigating the bias of Gini importance, we find that the approach of [5] is still biased against predictors with low cardinality, such as binary predictors, especially when the number of predictors is not large. In addition to the approach of [5], two variable selection approaches are proposed based on variants of the BART model. One variant of BART is DART ([19]) which modifies BART by placing a Dirichlet hyper-prior on the splitting proportions of the regression tree prior to encourage sparsity. To conduct variable selection, they suggest selecting predictors in the median probability model ([3]), i.e., selecting predictors with a marginal posterior variable inclusion probability at least 50%50\%. This approach is more computationally efficient than other BART-based variable selection methods, as it does not require fitting model multiple times. In general, we find that DART works better than other BART-based variable selection methods, but it becomes less stable in presence of correlated predictors and in the probit BART model. Another variant of the BART prior is the spike-and-forest prior ([23]) which wraps the BART prior with a spike-and-slab prior on the model space. [20] provide model selection consistency results for the spike-and-forest prior and propose approximate Bayesian computation (ABC) Bayesian forest, a modified ABC sampling method, to conduct variable selection in practice. Similar to DART, variables are selected based on their marginal posterior variable inclusion probabilities. ABC Bayesian forest shows good performance in terms of excluding irrelevant predictors, but its ability to include relevant predictors is relatively poor when predictors are correlated.

The main issues of existing BART-based variable selection methods include being biased against categorical predictors with fewer levels and being conservative in terms of including relevant predictors in the model. The goal of this paper is to develop variable selection approaches for BART to overcome these issues. For simplicity, we assume that the possible types of potential predictors only include binary and continuous types. This paper is organized as follows. In Section 2, we review the BART model and existing BART-based variable selection approaches. An example where the approach of [5] fails to include relevant binary predictors is also provided. In Section 3, we present the proposed variable selection approaches. In Section 4, we compare our approaches with existing BART-based variable selection approaches through simulated examples. Finally, we conclude this paper with a discussion in Section 5.

2 Review of BART

2.1 Model Specification

Motivated by the boosting algorithm and built on the Bayesian classification and regression tree algorithm ([8]), BART ([9]) is a Bayesian approach to nonparametric function estimation and inference using a sum of trees. Consider the problem of making inference about an unknown function f0f_{0} that predicts a continuous response yy using a pp-dimensional vector of predictors 𝒙=(x1,,xp)\boldsymbol{x}=(x_{1},\cdots,x_{p})^{\intercal} when

y=f0(𝒙)+ϵ,ϵN(0,σ2).y=f_{0}(\boldsymbol{x})+\epsilon,\quad\epsilon\sim N(0,\sigma^{2}). (1)

BART models f0f_{0} by a sum of MM Bayesian regression trees, i.e.,

f(𝒙)=m=1Mg(𝒙;Tm,𝝁m),f\left(\boldsymbol{x}\right)=\sum\limits_{m=1}^{M}g\left(\boldsymbol{x};T_{m},\boldsymbol{\mu}_{m}\right), (2)

where g(𝒙;Tm,𝝁m)g(\boldsymbol{x};T_{m},\boldsymbol{\mu}_{m}) is the output of 𝒙\boldsymbol{x} from a single regression tree. Each g(𝒙;Tm,𝝁m)g(\boldsymbol{x};T_{m},\boldsymbol{\mu}_{m}) is determined by the binary tree structure TmT_{m} consisting of a set of splitting rules and a set of terminal nodes, and the vector of parameters 𝝁m=(μm,1,,μm,bm)\boldsymbol{\mu}_{m}=(\mu_{m,1},\cdots,\mu_{m,b_{m}}) associated with the bmb_{m} terminal nodes of TmT_{m}, such that g(𝒙;Tm,𝝁m)=μm,lg(\boldsymbol{x};T_{m},\boldsymbol{\mu}_{m})=\mu_{m,l} if 𝒙\boldsymbol{x} is associated with the lthl^{th} terminal node of the tree TmT_{m}.

A regularization prior consisting of three components, (1) the MM independent trees structures {Tm}m=1M\{T_{m}\}_{m=1}^{M}, (2) the parameters {𝝁m}m=1M\{\boldsymbol{\mu}_{m}\}_{m=1}^{M} associated with the terminal nodes given the trees {Tm}m=1M\{T_{m}\}_{m=1}^{M} and (3) the error variance σ2\sigma^{2} which is assumed to be independent of the former two, is specified for BART in a hierarchical manner. The posterior distribution of BART is sampled through a Metropolis-within-Gibbs algorithm ([17, 14, 18]). The Gibbs sampler involves MM successive draws of (Tm,𝝁m)(T_{m},\boldsymbol{\mu}_{m}) followed by a draw of σ2\sigma^{2}. A key feature of this sampler is that it employs Bayesian backfitting ([16]) to sample each pair of (Tm,𝝁m)(T_{m},\boldsymbol{\mu}_{m}).

BART estimates f0(𝒙)=E(Y𝒙)f_{0}(\boldsymbol{x})=E(Y\mid\boldsymbol{x}) by taking the average of the posterior samples of f(𝒙)f(\boldsymbol{x}) after a burn-in period. As the trees structures are being updated through the MCMC chain, the model space is being searched by BART. In light of this, the estimation of f0(𝒙)f_{0}(\boldsymbol{x}) can be considered as a model averaging estimation with each posterior sample of the trees structures treated as a model and therefore BART can be regarded as a Bayesian model selection approach.

The continuous BART model (12) can be extend to a binary regression model for a binary response YY by specifying

P(Y=1𝒙)=Φ[f(𝒙)],P\left(Y=1\mid\boldsymbol{x}\right)=\Phi\left[f\left(\boldsymbol{x}\right)\right], (3)

where ff is the sum-of-trees function in (2) and Φ\Phi is the cumulative distribution function of the standard normal distribution. Posterior sampling can be done by applying the Metropolis-within-Gibbs sampler of continuous BART to the latent variables which are obtained via the data augmentation approach of [1].

2.2 Existing BART-Based Variable Selection Methods

In this section, we review the BART variable inclusion proportion, a variable importance measure produced by BART, and three existing variable selection methods based on BART.

2.2.1 BART variable inclusion proportions

The MCMC algorithm of BART returns posterior samples of the trees structures, from which we can obtain the variable inclusion proportion for each predictor.

Definition 2.1 (BART Variable Inclusion Proportion).

Let KK be the number of posterior samples obtained from a BART model. For each j=1,,pj=1,\cdots,p and k=1,,Kk=1,\cdots,K, let cjkc_{jk} be the number of splitting rules using the predictor xjx_{j} as the split variable in the kthk^{\text{th}} posterior sample of the trees structures and let ck=j=1pcjkc_{\cdot k}=\sum_{j=1}^{p}c_{jk} be the total number of splitting rules in the kthk^{\text{th}} posterior sample. The BART variable inclusion proportion (VIP) for the predictor xjx_{j} is defined as

vj=1Kk=1Kcjkck.v_{j}=\frac{1}{K}\sum\limits_{k=1}^{K}\frac{c_{jk}}{c_{\cdot k}}.

The BART VIP vjv_{j} describes the average usage per splitting rule for the predictor xjx_{j}. Intuitively, a large VIP is suggestive of the predictor being an important driver of the response. Therefore, the original BART paper uses VIPs to rank predictors in terms of relative importance and recommends conducting variable selection using VIPs with a small number of trees in order to introduce competition among predictors.

We note that BART VIP is a special case of Gini importance ([21]) which evaluates the importance of a variable by adding up the weighted decrease in impurity of all the nodes using the variable as a split variable, averaged over all the trees in the ensemble:

IMPGini(xj)=1Mm=1Mηϕm𝕀(jη=j)[w(η)Δi(η)],\text{IMP}_{\text{Gini}}\left(x_{j}\right)=\frac{1}{M}\sum_{m=1}^{M}\sum_{\eta\in\phi_{m}}\mathbb{I}\left(j_{\eta}=j\right)\left[w(\eta)\Delta i(\eta)\right],

where ϕm\phi_{m} is the set of all the splitting nodes in the mthm^{\text{th}} tree, jηj_{\eta} is the index of the variable used for the splitting node η\eta, w(η)w(\eta) is the weight of the node η\eta and Δi(η)\Delta i(\eta) is the decrease in impurity of the node η\eta. In random forests, the weight w(η)w(\eta) is typically the proportion of the samples reaching the node η\eta and the impurity i(η)i(\eta) is typically the variance of the observed response values of the samples reaching the node η\eta. BART VIP can be regarded as Gini importance with the weight w(η)w(\eta) set to 1/ck1/c_{\cdot k} and the impurity i(η)i(\eta) set to 11 for all the nodes.

Gini importance is known to be biased against variables with low cardinality in random forests ([25]). As a variant of Gini importance, BART VIP is also a biased variable importance measure. To illustrate this, we consider the following example.

Example 1.

For each i=1,,500i=1,\cdots,500, sample xi1,,xi10x_{i1},\cdots,x_{i10} from Bernoulli(0.5) independently, xi11,,xi20x_{i11},\cdots,x_{i20} from Uniform(0,1) independently and yiy_{i} from Normal(f0(𝒙i),1)\mathrm{Normal}(f_{0}(\boldsymbol{x}_{i}),1) independently, where

f0(𝒙)=10sin(πx1x11)+20(x130.5)2+10x2+5x12.f_{0}(\boldsymbol{x})=10\sin(\pi x_{1}x_{11})+20(x_{13}-0.5)^{2}+10x_{2}+5x_{12}.

This example is a variant of Friedman’s example ([11]) where the predictors impact the response in a nonlinear and non-additive way. We modify Friedman’s example by changing the predictors x1x_{1} and x2x_{2} from continuous to binary. We fit a BART model on the observations {yi,𝒙i}i=1500\{y_{i},\boldsymbol{x}_{i}\}_{i=1}^{500} with M=20M=20 trees and obtain the VIPs based on 10001000 posterior samples after burning in 10001000 posterior samples. Since all the predictors are on the same scale [0,1][0,1], the binary predictor x1x_{1} improves the fit of the response yy the same as the continuous predictor x11x_{11} and the binary predictor x2x_{2} moves the response yy more than the continuous predictor x12x_{12}. However, Figure 1 shows that x11x_{11} and x12x_{12} have higher VIPs than x1x_{1} and x2x_{2} respectively, suggesting that BART VIP is biased against binary predictors. This issue occurs because a binary predictor has only one available splitting value while a continuous predictor has many more splitting values. As a result, continuous predictors appear more often in the ensemble.

Refer to caption
Figure 1: BART VIPs for different predictors in Example 1. Red dots are for continuous predictors and green dots are for binary predictors.

2.2.2 Permutation-based variable selection with BART VIP

Apart from the bias issue of BART VIP, the original BART paper does not provide a complete variable selection approach in the sense that it does not provide a way of setting the threshold for BART VIPs to select predictors. In fact, [5] show that it is not possible to decide on an appropriate threshold for VIPs based on a single BART fit, because BART may fit noise due to its nonparametric flexibility. To deal with this, they develop a principled permutation-based variable selection approach which facilitates determining how large a BART VIP has to be in order to select a predictor.

Specifically, this approach first creates LL permutations {𝒚l}l=1L\{\boldsymbol{y}_{l}^{*}\}_{l=1}^{L} of the response vector 𝒚\boldsymbol{y} to form LL null datasets {𝒚l,X}l=1L\{\boldsymbol{y}_{l}^{*},X\}_{l=1}^{L}, where XX is a matrix of the observed predictors with each row corresponding to an observation, and then fits a BART model on each null dataset (𝒚l,X)(\boldsymbol{y}_{l}^{*},X). As a result, LL vectors of BART VIPs 𝒗l=(vl,1,,vl,p)\boldsymbol{v}_{l}^{*}=(v_{l,1}^{*},\cdots,v_{l,p}^{*}), l=1,,Ll=1,\cdots,L, can be obtained from the LL BART models. Since the null datasets delete the relationship between the response and predictors, the LL-dimensional vector {v1,j,,vL,j}\{v_{1,j}^{*},\cdots,v_{L,j}^{*}\} can be regarded as a sample of the null distribution of the BART VIP for the predictor xjx_{j}, which is the distribution of the BART VIP when the predictor xjx_{j} is unrelated to the response yy. If the predictor xjx_{j} is indeed not related to the response yy, then the true BART VIP obtained from the BART model on the original dataset (𝒚,X)(\boldsymbol{y},X) follows the null distribution. Hence, predictors can be selected according to the relative location of their true BART VIPs in the corresponding null distributions. Since BART may fit noise, the true VIPs are estimated by the averaged VIPs 𝒗¯=(v¯1,,v¯p)\bar{\boldsymbol{v}}=(\bar{v}_{1},\cdots,\bar{v}_{p}) which is the mean of multiple vectors of VIPs with each obtained from a replication of BART on the original dataset (𝒚,X)(\boldsymbol{y},X). [5] propose three criteria to select a subset of predictors. The least stringent one is the local threshold, i.e., the predictor xjx_{j} is selected if the averaged VIP v¯j\bar{v}_{j} exceeds the 1α1-\alpha quantile of the corresponding empirical distribution l=1Lδvl,j()\sum_{l=1}^{L}\delta_{v_{l,j}^{*}}(\cdot) of the null distribution, where δvl,j()\delta_{v_{l,j}^{*}}(\cdot) is a degenerate distribution at vl,jv_{l,j}^{*}. The algorithm for a general permutation-based variable selection approach, not restricted to the one using BART VIP as the variable importance measure, is summarized in Algorithm 1.

0:  XX: predictors; 𝒚\boldsymbol{y}: response; LL: number of permutations; LrepL_{\text{rep}}: number of repetitions; α\alpha: selection threshold
0:   A subset of predictors
1:  for s=1,,Lreps=1,\cdots,L_{\text{rep}} do
2:     Run the model on the original dataset (𝒚,X)(\boldsymbol{y},X) and compute the variable importance scores (vs,1,,vs,p)(v_{s,1},\cdots,v_{s,p})
3:  end for
4:  Compute the averaged variable importance scores v¯j=(s=1Lrepvs,j)/Lrep\bar{v}_{j}=(\sum_{s=1}^{L_{\text{rep}}}v_{s,j})/L_{\text{rep}} for j=1,,pj=1,\cdots,p
5:  for l=1,,Ll=1,\cdots,L do
6:     Run the model on the null dataset (𝒚l,X)(\boldsymbol{y}^{*}_{l},X) and compute the variable importance scores (vl,1,,vl,p)(v_{l,1}^{*},\cdots,v_{l,p}^{*})
7:  end for
8:  for j=1,,pj=1,\cdots,p do
9:     Compute the 1α1-\alpha quantile vjαv_{j}^{\alpha} of the empirical distribution l=1Lδvl,j()\sum_{l=1}^{L}\delta_{v^{*}_{l,j}}(\cdot) for the predictor xjx_{j}
10:     if v¯j>vjα\bar{v}_{j}>v_{j}^{\alpha} then
11:        Select xjx_{j}
12:     end if
13:  end for

Algorithm 1 Permutation-based variable selection approach

This approach works well when all the predictors are continuous; however, it often fails to include relevant binary predictors when the predictors belong to different types and the number of predictors is small. Consider the following variant of Friedman’s example.

Example 2.

For each i=1,,500i=1,\cdots,500, sample xi1,,xi10x_{i1},\cdots,x_{i10} from Bernoulli(0.5) independently, xi11,,xi20x_{i11},\cdots,x_{i20} from Uniform(0,1) independently and yiy_{i} from Normal(f0(𝒙i),1)\mathrm{Normal}(f_{0}(\boldsymbol{x}_{i}),1) independently, where

f0(𝒙)=10sin(πx11x12)+20(x130.5)2+10x1+5x2.f_{0}(\boldsymbol{x})=10\sin(\pi x_{11}x_{12})+20(x_{13}-0.5)^{2}+10x_{1}+5x_{2}.

Figure S.1 of the Supplementary Material shows that both the two relevant binary predictors x1x_{1} and x2x_{2} are not selected by this method. Further analysis is provided in Section 3.1.

2.2.3 Variable selection with DART

DART ([19]) is a variant of BART, which replaces the discrete uniform distribution for selecting a split variable with a categorical distribution of which the event probabilities (s1,,sp)(s_{1},\cdots,s_{p}) follow a Dirichlet distribution. With the Dirichlet hyper-prior, probability mass is accumulated towards the predictors that are more frequently used along the MCMC iterations, thereby inducing sparsity and improving the predictive performance in high dimensional settings.

As discussed in Section 2.1, BART can be considered as a Bayesian model selection approach, so predictors can be selected by using the marginal posterior variable inclusion probability (MPVIP)

πj=P(xj in the model𝒚).\pi_{j}=P(x_{j}\text{ in the model}\mid\boldsymbol{y}). (4)

However, the model space explored by BART is not only all the possible combinations of the predictors, but also all the possible relationships that the sum-of-trees model can express using the pp predictors, which implies that the cardinality of the model space is much bigger than 2p2^{p} and that the variable selection results can be susceptible to the influence of noise predictors, especially when the number of predictors is large. DART avoids the influence of noise predictors by employing the shrinkage Dirichlet hyper-prior. DART estimates the MPVIP for a predictor by the proportion of the posterior samples of the trees structures where the predictor is used as a split variable at least once, and selects predictors with MPVIP at least 0.50.5, yielding a median probability model. [3] shows that the median probability model, rather than the most probable model, is the optimal choice in the sense that the predictions for future observations are closest to the Bayesian model averaging prediction in the squared error sense. However, the assumptions for this property are quite strong and often not realistic, including the assumption of an orthogonal predictors matrix.

2.2.4 Variable selection with ABC Bayesian forest

[23] introduce a spike-and-forest prior which wraps the BART prior with a spike-and-slab prior on the model space:

𝒮π(𝒮),𝒮{1,,p},\displaystyle\,\,\mathcal{S}\sim\pi\left(\mathcal{S}\right),\,\forall\mathcal{S}\subseteq\left\{1,\cdots,p\right\},
{Tm,𝝁m}m=1M,σ2𝒮BART prior.\displaystyle\left\{T_{m},\boldsymbol{\mu}_{m}\right\}_{m=1}^{M},\sigma^{2}\mid\mathcal{S}\sim\text{BART prior}.

The tree-based model under the spike-and-forest prior can be used not only for estimation but also for variable selection, as shown in [20]. Due to the intractable marginal likelihood, they propose an approximate Bayesian computation (ABC) sampling method based on data-splitting to help sample from the model space with higher ABC acceptance rate. Specifically, at each iteration, the dataset is randomly split into a training set and a test set according to a certain split ratio. The algorithm proceeds by sampling a subset 𝒮\mathcal{S} from the prior π(𝒮)\pi(\mathcal{S}), fitting a BART model on the training set only with the predictors in 𝒮\mathcal{S}, and computing the root mean squared errors (RMSE) for the test set based on a posterior sample from the fitted BART model. Only those subsets that result in a low RMSE on the test set are kept for selection. Similar to DART, ABC Bayesian forest selects predictors based on their MPVIP πj\pi_{j} defined as (4) which is estimated by computing the proportion of ABC accepted BART posterior samples that use the predictor xjx_{j} at least one time. Given the πj\pi_{j}’s, predictors with πj\pi_{j} exceeding a pre-specified threshold are selected.

ABC Bayesian forest exhibits good performance in excluding irrelevant predictors, but it does not perform well in including relevant predictors in presence of correlated predictors, as shown in Section 4. One possible reason is that ABC Bayesian forest internally uses only one posterior sample after a short burn-in period and a small number of trees for BART, making it easy to miss relevant predictors. Furthermore, given that ABC Bayesian forest computes πj\pi_{j} based on the ABC accepted BART posterior samples rather than the ABC accepted subsets, it appears that the good performance in excluding irrelevant predictors may be primarily due to the variable selection capability of BART itself, when the number of trees is small.

3 New Approaches

3.1 BART VIP with Type Information

In Section 2.2.2, we show an example where the approach of [5] fails to include relevant binary predictors. The intuition of this approach is that a relevant predictor is expected to appear more often in the model built on the original dataset than that built on a null dataset. In this section, we revisit Example 2 and show that the intuition does hold for both relevant continuous and relevant binary predictors, but that the increase in usage of a relevant binary predictor is often offset by the increase of relevant continuous predictors, when the number of predictors is not large, thereby making the true VIP of a relevant binary predictor insignificant with respect to (w.r.t.) the corresponding null VIPs. In light of this observation, we modify BART VIP with the type information of predictors to help identify relevant binary predictors.

Direct studying BART VIP is challenging because each vjv_{j} consists of KK different components {cjk/ck}k=1K\{c_{jk}/c_{\cdot k}\}_{k=1}^{K}. To make the analysis more interpretable, we introduce an approximation of BART VIP, which only contains one component.

Definition 3.1 (Approximation of BART VIP).

Following the notation in Definition 2.1, for every j=1,,pj=1,\cdots,p and k=1,,Kk=1,\cdots,K, let cj=k=1Kcjkc_{j\cdot}=\sum_{k=1}^{K}c_{jk} be the number of splitting nodes using the predictor xjx_{j} over all the posterior samples of the trees structures and c=j,k=1p,Kcjkc_{\cdot\cdot}=\sum_{j,k=1}^{p,K}c_{jk} be the total number of splitting nodes over all the posterior samples. For each predictor xjx_{j}, define

v~j=cjc.\tilde{v}_{j}=\frac{c_{j\cdot}}{c_{\cdot\cdot}}.

The following lemma shows that v~j\tilde{v}_{j} defined above can be used to approximate BART VIP vjv_{j} under mild conditions.

Lemma 3.2.

For any j=1,,pj=1,\cdots,p and k=1,,Kk=1,\cdots,K, let ck=j=1pcjkc_{\cdot k}=\sum_{j=1}^{p}c_{jk} be the total number of splitting nodes in the kthk^{\text{th}} posterior sample and c¯K=c/K\bar{c}_{\cdot K}=c_{\cdot\cdot}/K be the average number of splitting nodes in a posterior sample. If k=1K(cjk/ck)2/Kδ1\sum_{k=1}^{K}(c_{jk}/c_{\cdot k})^{2}/K\leq\delta_{1} and [k=1K(ckc¯K)2/K]1/2/c¯Kδ2[\sum_{k=1}^{K}(c_{\cdot k}-\bar{c}_{\cdot K})^{2}/K]^{1/2}/\bar{c}_{\cdot K}\leq\delta_{2} for some positive numbers δ1,δ2>0\delta_{1},\delta_{2}>0, then the difference between v~j\tilde{v}_{j} and the corresponding BART VIP vjv_{j} is bounded, i.e., |v~jvj|δ11/2δ2\lvert\tilde{v}_{j}-v_{j}\rvert\leq\delta_{1}^{1/2}\delta_{2}.

Remark.

The first condition above means that the variance of cjk/ckc_{jk}/c_{\cdot k}, the proportion of the usage of the predictor xjx_{j} in an ensemble, is bounded. The second condition means that the coefficient of variation of ckc_{\cdot k}, the number of splitting nodes in an ensemble, is also bounded. Since [5] use posterior samples after a burn-in period, the variance of cjk/ckc_{jk}/c_{\cdot k} and the coefficient of variation of ckc_{\cdot k} are small (see Figure S.2 of the Supplementary Material). Hence, Lemma 3.2 implies that v~j\tilde{v}_{j} can be used as an alternative to the corresponding BART VIP vjv_{j}.

Proof.

See Section S.1 of the Supplementary Material. ∎

Consider the approach of [5]. For every j=1,,pj=1,\cdots,p, k=1,,Kk=1,\cdots,K, r=1,,Lrepr=1,\cdots,L_{\text{rep}} and l=1,,Ll=1,\cdots,L, let cl,jkc_{l,jk}^{*} be the number of splitting nodes using the predictor xjx_{j} in the kthk^{\text{th}} posterior sample of the BART model built on the lthl^{\text{th}} null dataset and cr,jkc_{r,jk} be that of the rthr^{\text{th}} repeated BART model built on the original dataset. Following the notation in Definition 3.1, write cl,j=k=1Kcl,jkc_{l,j\cdot}^{*}=\sum_{k=1}^{K}c_{l,jk}^{*}, cl,=j,k=1p,Kcl,jkc_{l,\cdot\cdot}^{*}=\sum_{j,k=1}^{p,K}c_{l,jk}^{*}, cr,j=k=1Kcr,jkc_{r,j\cdot}=\sum_{k=1}^{K}c_{r,jk}, and cr,=j,k=1p,Kcr,jkc_{r,\cdot\cdot}=\sum_{j,k=1}^{p,K}c_{r,jk}. By Lemma 3.2, for each predictor xjx_{j}, we can use v~lj=cl,j/cl,\tilde{v}_{lj}^{*}=c_{l,j\cdot}^{*}/c_{l,\cdot\cdot}^{*} to approximate vljv_{lj}^{*}, the VIP obtained from the BART model on the lthl^{\text{th}} null dataset, and v~rj=cr,j/cr,\tilde{v}_{rj}=c_{r,j\cdot}/c_{r,\cdot\cdot} to approximate vrjv_{rj}, the VIP obtained from the rthr^{\text{th}} repeated BART model on the original dataset. Denote by v~¯j=(r=1Lrepv~rj)/Lrep\bar{\tilde{v}}_{j}=(\sum_{r=1}^{L_{\text{rep}}}\tilde{v}_{rj})/L_{\text{rep}} the averaged approximate VIP for the predictor xjx_{j} across LrepL_{\text{rep}} repetitions of BART on the original dataset. According to the local threshold of [5], a predictor xjx_{j} is selected only if the average VIP v~¯j\bar{\tilde{v}}_{j} exceeds the 1α1-\alpha quantile of the empirical distribution l=1Lδv~lj()\sum_{l=1}^{L}\delta_{\tilde{v}_{lj}^{*}}(\cdot), i.e.,

1Ll=1L𝕀(1Lrepr=1Lrepcr,jcr,>cl,jcl,)> 1α.\frac{1}{L}\sum\limits_{l=1}^{L}\mathbb{I}\left(\frac{1}{L_{\text{rep}}}\sum\limits_{r=1}^{L_{\text{rep}}}\frac{c_{r,j\cdot}}{c_{r,\cdot\cdot}}>\frac{c_{l,j\cdot}^{*}}{c_{l,\cdot\cdot}^{*}}\right)\,>\,1-\alpha. (5)

Similar to the approximation of BART VIP, we can approximate (r=1Lrepcr,j/cr,)/Lrep(\sum_{r=1}^{L_{\text{rep}}}c_{r,j\cdot}/c_{r,\cdot\cdot})/L_{\text{rep}} in (5) by c¯Lrep,j/c¯Lrep,\bar{c}_{L_{\text{rep}},j\cdot}/\bar{c}_{L_{\text{rep}},\cdot\cdot} and therefore the selection criteria (5) can be rewritten as

1Ll=1L𝕀(c¯Lrep,jcl,j>c¯Lrep,cl,)> 1α,\frac{1}{L}\sum\limits_{l=1}^{L}\mathbb{I}\left(\frac{\bar{c}_{L_{\text{rep}},j\cdot}}{c_{l,j\cdot}^{*}}>\frac{\bar{c}_{L_{\text{rep}},\cdot\cdot}}{c_{l,\cdot\cdot}^{*}}\right)\,>\,1-\alpha, (6)

where c¯Lrep,j=(r=1Lrepcr,j)/Lrep\bar{c}_{L_{\text{rep}},j\cdot}=(\sum_{r=1}^{L_{\text{rep}}}c_{r,j\cdot})/L_{\text{rep}} and c¯Lrep,=(r=1Lrepcr,)/Lrep\bar{c}_{L_{\text{rep}},\cdot\cdot}=(\sum_{r=1}^{L_{\text{rep}}}c_{r,\cdot\cdot})/L_{\text{rep}}.

The ratio c¯Lrep,j/cl,j\bar{c}_{L_{\text{rep}},j\cdot}/c_{l,j\cdot}^{*} on the left hand side (LHS) of the inequality inside the sum of (6) represents the average increment in usage of a predictor xjx_{j} in a BART model built on the original dataset, compared to a null dataset; the ratio c¯Lrep,/cl,\bar{c}_{L_{\text{rep}},\cdot\cdot}/c_{l,\cdot\cdot}^{*} on the right hand side (RHS) represents the average increment in the total number of the splitting nodes over all the posterior samples of a BART model built on the original dataset, compared to a null dataset. Figure S.3 of the Supplementary Material shows the overall counts ratio c¯Lrep,/cl,\bar{c}_{L_{\text{rep}},\cdot\cdot}/c_{l,\cdot\cdot}^{*} and the counts ratio c¯Lrep,j/cl,j\bar{c}_{L_{\text{rep}},j\cdot}/c_{l,j\cdot}^{*} for each predictor in Example 2. Both relevant continuous (x11x_{11}, x12x_{12} and x13x_{13}) and relevant binary (x1x_{1} and x2x_{2}) predictors are indeed used more frequently in the BART model built on the original dataset, i.e., c¯Lrep,j/cl,j>1\bar{c}_{L_{\text{rep}},j\cdot}/c_{l,j\cdot}^{*}>1, but the increment in usage of a relevant binary predictor is not always greater than the increment in the total number of splits, i.e., c¯Lrep,j/cl,jc¯Lrep,/cl,\bar{c}_{L_{\text{rep}},j\cdot}/c_{l,j\cdot}^{*}\ngtr\bar{c}_{L_{\text{rep}},\cdot\cdot}/c_{l,\cdot\cdot}^{*}. A binary predictor can only be used once at most in each path of a tree because it only has one split value, and a few splits are sufficient to provide the information about a binary predictor. As a result, c¯Lrep,j\bar{c}_{L_{\text{rep}},j\cdot}, the number of splits using a binary predictor, is limited by its low cardinality, even if it is relevant. In addition, the number of splits cl,jc_{l,j\cdot}^{*} using a binary predictor in a BART model built on a null dataset is roughly 1/p1/p of the total number of splits in the ensemble ([5]). As such, when the number of predictors is small, cl,jc_{l,j\cdot}^{*} is close to c¯Lrep,j\bar{c}_{L_{\text{rep}},j\cdot} (see the left subfigure of Figure S.4 of the Supplementary Material), thereby making the counts ratio c¯Lrep,j/cl,j\bar{c}_{L_{\text{rep}},j\cdot}/c_{l,j\cdot}^{*} for a binary predictor close to 11. The overall counts ratio c¯Lrep,/cl,\bar{c}_{L_{\text{rep}},\cdot\cdot}/c_{l,\cdot\cdot}^{*} is also close to 11 because of the BART regularization prior. In fact, given a fixed number of trees in an ensemble, the total number of splits in the ensemble is controlled by the splitting probability, the probability of splitting a node into two child nodes, which is the same for BART models built on the original dataset and null datasets, so the total number of splits for the original dataset and a null dataset, c¯Lrep,\bar{c}_{L_{\text{rep}},\cdot\cdot} and cl,c_{l,\cdot\cdot}^{*}, are of similar magnitude (see the right subfigure of Figure S.4). As such, the increment in usage of a relevant binary predictor c¯Lrep,j/cl,j\bar{c}_{L_{\text{rep}},j\cdot}/c_{l,j\cdot}^{*} is easily offset by the increment in the total number of splits c¯Lrep,/cl,\bar{c}_{L_{\text{rep}},\cdot\cdot}/c_{l,\cdot\cdot}^{*}.

Figure S.4 of the Supplementary Material also shows that the increment in the total number of splits is primarily driven by those relevant continuous predictors, so essentially the increment in usage of a relevant binary predictor is offset by that of relevant continuous predictors. To alleviate this behavior, we modify Definition 2.1 using the type information of predictors.

Definition 3.3 (BART Within-Type Variable Inclusion Proportion).

Denote by 𝒮cts\mathcal{S}_{\text{cts}} (or 𝒮bin\mathcal{S}_{\text{bin}}) the set of indicators for continuous (or binary) predictors. Following the notation in Definition 2.1, for every k=1,,Kk=1,\cdots,K, let ccts,k=j𝒮ctscjkc_{\text{cts,}k}=\sum_{j\in\mathcal{S}_{\text{cts}}}c_{jk} (or cbin,k=j𝒮bincjkc_{\text{bin,}k}=\sum_{j\in\mathcal{S}_{\text{bin}}}c_{jk}) be the total number of splitting nodes using continuous (or binary) predictors in the kthk^{\text{th}} posterior sample. For every j=1,,pj=1,\cdots,p, define the BART within-type VIP for the predictor xjx_{j} as follows:

vjw.t.=1Kk=1Kcjkccts,k𝕀(j𝒮cts)+cbin,k𝕀(j𝒮bin).v_{j}^{\text{w.t.}}\,=\,\frac{1}{K}\sum\limits_{k=1}^{K}\frac{c_{jk}}{c_{\text{cts,}k}\cdot\mathbb{I}(j\in\mathcal{S}_{\text{cts}})+c_{\text{bin,}k}^{\text{}}\cdot\mathbb{I}(j\in\mathcal{S}_{\text{bin}})}.

Define δjj=𝕀(xj and xj are of the same type)\delta_{jj^{\prime}}=\mathbb{I}(x_{j}\text{ and }x_{j^{\prime}}\text{ are of the same type}) for any j=1,,pj=1,\cdots,p and j=1,,pj^{\prime}=1,\cdots,p. Write c¯Lrep,type of xj,=(r=1Lrepj=1pcr,jδjj)/Lrep\bar{c}_{L_{\text{rep}},\text{type of }x_{j},\cdot}=(\sum_{r=1}^{L_{\text{rep}}}\sum_{j^{\prime}=1}^{p}c_{r,j^{\prime}\cdot}\cdot\delta_{jj^{\prime}})/L_{\text{rep}} and cl,type of xj,=j=1pcl,jδjjc_{l,\text{type of }x_{j},\cdot}^{*}=\sum_{j^{\prime}=1}^{p}c_{l,j^{\prime}\cdot}^{*}\cdot\delta_{jj^{\prime}}. The permutation-based variable selection approach (Algorithm 1) using BART within-type VIP as the variable importance measure selects a predictor xjx_{j} only if

1Ll=1L𝕀(c¯Lrep,jcl,j>c¯Lrep,type of xj,cl,type of xj,)> 1α.\frac{1}{L}\sum\limits_{l=1}^{L}\mathbb{I}\left(\frac{\bar{c}_{L_{\text{rep}},j\cdot}}{c_{l,j\cdot}^{*}}>\frac{\bar{c}_{L_{\text{rep}},\text{type of }x_{j},\cdot}}{c_{l,\text{type of }x_{j},\cdot}^{*}}\right)\,>\,1-\alpha. (7)

For a binary predictor, the selection criteria (7) removes the impact of continuous predictors, thereby not diluting the effect of a relevant binary predictor. Figure 2 shows the variable selection result of this approach. As opposed to the approach of [5], both the two relevant binary predictors x1x_{1} and x2x_{2} are clearly identified by this approach.

Refer to caption
Figure 2: Variable selection results of the permutation-based approach using BART within-type VIP as the variable importance measure for Example 2. The same setting (L=100L=100, Lrep=10L_{\text{rep}}=10, α=0.05\alpha=0.05 and M=20M=20) and BART posterior samples as Figure S.1 of the Supplementary Material are used. Red (or green) dots are for continuous (or binary) predictors. Solid (or open) dots are for selected (or not selected) predictors. Each vertical grey line is the local threshold for the corresponding predictor. Both the two relevant binary predictors, x1x_{1} and x2x_{2}, are selected and have similar variable importance to the relevant continuous ones.

3.2 BART Variable Importance Using Metropolis Ratios

While BART within-type VIP works well in including relevant predictors in presence of a similar number of binary and continuous predictors, it becomes problematic if the number of predictors of one type is low. For example, when there are 9999 continuous predictors and 11 binary predictor, the BART within-type VIP of the binary predictor is always 11 for both the original dataset and a null dataset. As such, the binary predictor is never selected by the permutation-based approach, whether it is relevant or not. Furthermore, when there are more than two types of predictors, the computation of BART within-type VIP becomes complicated. Hence, we propose a new type of variable importance measure based on the Metropolis ratios calculated in the Metropolis-Hastings steps for sampling new trees, which are not affected by the issues above.

As mentioned in Section 2.1, a key feature of the Metropolis-within-Gibbs sampler is the Bayesian backfitting procedure for sampling (Tm,𝝁m)(T_{m},\boldsymbol{\mu}_{m}), 1mM1\leq m\leq M, where each TmT_{m} is fit iteratively using the residual response rm=ymmg(𝒙;Tm,𝝁m)r_{-m}=y-\sum_{m^{\prime}\neq m}g(\boldsymbol{x};T_{m^{\prime}},\boldsymbol{\mu}_{m^{\prime}}), with all the other trees Tm={Tm}mmT_{-m}=\{T_{m^{\prime}}\}_{m^{\prime}\neq m} held constant. Thus, each (Tm,𝝁m)(T_{m},\boldsymbol{\mu}_{m}) can be obtained in two sequential steps:

Tm𝒓m,σ2,\displaystyle T_{m}\mid\boldsymbol{r}_{-m},\sigma^{2},
𝝁mTm,𝒓m,σ2,\displaystyle\boldsymbol{\mu}_{m}\mid T_{m},\boldsymbol{r}_{-m},\sigma^{2},

where 𝒓m\boldsymbol{r}_{-m} is the vector of residual responses. The distribution of Tm𝒓m,σ2T_{m}\mid\boldsymbol{r}_{-m},\sigma^{2} has a closed form up to a normalizing constant and therefore can be sampled by a Metropolis-Hastings algorithm. [8] develop a Metropolis-Hastings algorithm that proposes a new tree based on the current tree using one of the four moves: BIRTH, DEATH, CHANGE and SWAP. The BIRTH and DEATH proposals are the essential moves for sufficient mixing of the Gibbs sampler, so both the CRAN R package BART ([24]) and our package BartMixVs ([22]) implement these two proposals, each with equal probability. A BIRTH proposal turns a terminal node into a splitting node and a DEATH proposal replaces a splitting node leading to two terminal nodes by a terminal node. Thus, the proposed tree TT^{*} is identical to the current tree TmT_{m} except growing/pruning one splitting node.

We consider using the Metropolis ratio for accepting a BIRTH proposal to construct a variable importance measure. For convenience, we suppress the subscript mm in the following. The Metropolis ratio for accepting a BIRTH proposal at a terminal node η\eta of the current tree TT can be written as

πBIRTH(η)=min{1,r(η)},\pi_{\text{BIRTH}}\left(\eta\right)=\min\left\{1,r(\eta)\right\}, (8)

where

r(η)=P(TT)P(T𝒓,σ2)P(TT)P(T𝒓,σ2).r(\eta)=\frac{P\left(T\mid T^{*}\right)P\left(T^{*}\mid\boldsymbol{r},\sigma^{2}\right)}{P\left(T^{*}\mid T\right)P\left(T\mid\boldsymbol{r},\sigma^{2}\right)}. (9)

The BART prior splits a node of depth dd into two child nodes of d+1d+1 with probability γ(1+d)β\gamma(1+d)^{-\beta} for some γ(0,1)\gamma\in(0,1) and β[0,)\beta\in[0,\infty). Given that, the untruncated Metropolis ratio r(η)r(\eta) can be explicitly expressed as the product of three ratios:

Nodes Ratio =\displaystyle= 2bb+2,\displaystyle\frac{2b}{b+2},
Depth Ratio =\displaystyle= γ[1γ/(2+dη)β]2(1+dη)βγ,\displaystyle\frac{\gamma\left[1-\gamma/(2+d_{\eta})^{\beta}\right]^{2}}{(1+d_{\eta})^{\beta}-\gamma},
Likelihood Ratio =\displaystyle= P(𝒓T,σ2)P(𝒓T,σ2),\displaystyle\frac{P\left(\boldsymbol{r}\mid T^{*},\sigma^{2}\right)}{P\left(\boldsymbol{r}\mid T,\sigma^{2}\right)},

where bb is the number of terminal nodes in the current tree TT and dηd_{\eta} is the depth of the node η\eta in the current tree TT. The derivation can be found in Section S.2 of the Supplementary Material.

Figure S.5 of the Supplementary Material shows the nodes ratios for different numbers of terminal nodes and the depth ratios for different depths when the hyper-parameters γ\gamma and β\beta are set as default, i.e., γ=0.95\gamma=0.95 and β=2\beta=2. From the figure, we can see that the product of nodes ratio and depth ratio is mostly affected by the depth ratio which decreases as the proposed depth dηd_{\eta} increases, implying that r(η)r(\eta) assigns a greater value to a shallower node which typically contains more samples. Since the likelihood ratio indicates the conditional improvement to fitting brought by the new split, the untruncated Metropolis ratio r(η)r(\eta) considers both the proportion of samples affected at the node η\eta and the conditional improvement to fitting brought by the new split at the node η\eta. However, it is not appropriate to directly use r(η)r(\eta) to construct a variable importance measure, due to the occurrence of extremely large r(η)r(\eta) which dominates the others. Instead, we use the Metropolis ratio in (8) to construct the following BART Metropolis importance.

Definition 3.4 (BART Metropolis Importance).

Let KK be the number of posterior samples obtained from a BART model. For every k=1,,Kk=1,\cdots,K and j=1,,pj=1,\cdots,p, define the average Metropolis acceptance ratio per splitting rule using the predictor xjx_{j} at the kthk^{\text{th}} posterior sample as follows:

u~jk=m=1Mηϕmk[𝕀(jη=j)πBIRTH(η)]m=1Mηϕmk𝕀(jη=j),\tilde{u}_{jk}\,=\,\frac{\sum\limits_{m=1}^{M}\sum\limits_{\eta\in\phi_{mk}}\left[\mathbb{I}\left(j_{\eta}=j\right)\pi_{\text{BIRTH}}\left(\eta\right)\right]}{\sum\limits_{m=1}^{M}\sum\limits_{\eta\in\phi_{mk}}\mathbb{I}\left(j_{\eta}=j\right)},

where ϕmk\phi_{mk} is the set of splitting nodes in the kthk^{\text{th}} posterior sample of the mthm^{\text{th}} tree and jηj_{\eta} is the indicator of the split variable at the node η\eta. The BART Metropolis Importance (MI) for the predictor xjx_{j} is defined as the normalized u~jk\tilde{u}_{jk}, averaged across KK posterior samples:

vjMI=1Kk=1Ku~jkj=1pu~jk.v_{j}^{\text{MI}}\,=\,\frac{1}{K}\sum\limits_{k=1}^{K}\frac{\tilde{u}_{jk}}{\sum\limits_{j=1}^{p}\tilde{u}_{jk}}. (10)

The intuition of BART MI is that a relevant predictor is expected to be consistently accepted with a high Metropolis ratio. Figure 3 shows that the Metropolis ratios of the splits using relevant predictors (x1x_{1}, x2x_{2}, x11x_{11}, x12x_{12} and x13x_{13}) as the split variable have a higher median and smaller standard error than those using irrelevant predictors, implying that relevant predictors are consistently accepted with a higher Metropolis ratio. BART MI and BART VIP have similar forms. The key difference is the “kernel” they use. While BART VIP uses the number of splits cjkc_{jk} which tends to be biased against binary predictors, BART MI uses the average Metropolis acceptance ratio u~jk\tilde{u}_{jk} which is similar between the relevant binary predictors and relevant continuous predictors, as shown in Figure 3. Hence, the average Metropolis acceptance ratio u~jk\tilde{u}_{jk} not only helps BART MI distinguish relevant and irrelevant predictors, but also helps BART MI not be biased against predictors of a certain type.

The vector of average Metropolis acceptance ratios (u~1k,,u~pk)(\tilde{u}_{1k},\cdots,\tilde{u}_{pk}) does not sum to 11 and each of them varies from 0 to 11. We normalize the vector (u~1k,,u~pk)(\tilde{u}_{1k},\cdots,\tilde{u}_{pk}) in (10) based on the idea that u~jk\tilde{u}_{jk}’s are correlated in the sense that once some predictors that explain the response more are accepted with higher Metropolis ratios, the rest of predictors will be accepted with lower Metropolis ratios or not be used. In addition, we take the average over KK posterior samples in (10) because averaging further helps identify relevant predictors. In fact, an irrelevant predictor can be accepted with a high Metropolis ratio at some MCMC iterations, but it can be quickly removed from the model by the DEATH proposal. As a result, the irrelevant predictor can have a few large u~jk\tilde{u}_{jk}’s and many small u~jk\tilde{u}_{jk}’s, thereby resulting in a small BART MI vjMIv_{j}^{\text{MI}}, as shown in Figure 3.

Refer to caption
Figure 3: Each barplot (w.r.t. the left y-axis) depicts cjc_{j\cdot}, the number of splits using a predictor as the split variable over all the posterior samples. Each boxplot (w.r.t. the right y-axis) depicts the Metropolis ratios πBIRTH(η)\pi_{\text{BIRTH}}(\eta) of the splits using a predictor as the split variable, over all the posterior samples. Each red triangle (w.r.t. the right y-axis) displays the BART MI vjMIv_{j}^{\text{MI}} for a predictor. Posterior samples are obtained from a BART model for Example 2.

We further explore BART MI under null settings. We create a null dataset of Example 2 and fit a BART model on it. Figure S.6 of the Supplementary Material shows that not all the MIs, vjMIv_{j}^{\text{MI}}’s, converge to 1/p=0.051/p=0.05, implying that 1/p1/p may not be a good selection threshold for vjMIv_{j}^{\text{MI}}’s. We repeat the experiment of [5] for BART MI under null settings to explore the variation among vjMIv_{j}^{\text{MI}}’s. Specifically, we create 100100 null datasets of the data generated from Example 2, and for each null dataset, we run BART 5050 times with different initial values of the hyper-parameters. Let vijkMIv_{ijk}^{\text{MI}} be the BART MI for the predictor xjx_{j} from the kthk^{\text{th}} BART model on the ithi^{\text{th}} null dataset. We investigate three nested variances listed in Table 1.

Table 1: Three nested variances
Variance Description
sij2=149k=150(vijkMIv¯ijMI)2s_{ij}^{2}=\frac{1}{49}\sum\limits_{k=1}^{50}(v_{ijk}^{\text{MI}}-\bar{v}_{ij\cdot}^{\text{MI}})^{2} The variability of BART MI for the predictor xjx_{j} in the ithi^{\text{th}} dataset; v¯ijMI=(k=150vijkMI)/50\bar{v}_{ij\cdot}^{\text{MI}}=(\sum_{k=1}^{50}v_{ijk}^{\text{MI}})/50.
sj2=199i=1100(v¯ijMIv¯jMI)2s_{j}^{2}=\frac{1}{99}\sum\limits_{i=1}^{100}(\bar{v}_{ij\cdot}^{\text{MI}}-\bar{v}_{\cdot j\cdot}^{\text{MI}})^{2} The variability due to chance capitalization, i.e., fitting noise, of BART for the predictor xjx_{j} across datasets; v¯jMI=(i,k=1100,50vijkMI)/(100×50)\bar{v}_{\cdot j\cdot}^{\text{MI}}=(\sum_{i,k=1}^{100,50}v_{ijk}^{\text{MI}})/(100\times 50).
s2=119j=120(v¯jMIv¯MI)2s^{2}=\frac{1}{19}\sum\limits_{j=1}^{20}(\bar{v}_{\cdot j\cdot}^{\text{MI}}-\bar{v}_{\cdot\cdot\cdot}^{\text{MI}})^{2} The variability of BART MI across predictors; v¯MI=(i,j,k=1100,20,50vijkMI)/(100×20×50)\bar{v}_{\cdot\cdot\cdot}^{\text{MI}}=(\sum_{i,j,k=1}^{100,20,50}v_{ijk}^{\text{MI}})/(100\times 20\times 50).

Results of the experiment are shown in Figure S.7 of the Supplementary Material. The non-zero sijs_{ij}’s imply that there is variation in vjMIv_{j}^{\text{MI}}’s among the repetitions of BART with different initial values, so it is necessary to average MIs over a certain number of repetitions of BART to get stable MIs. In practice, we find that among the repetitions of BART, irrelevant predictors occasionally get outliers of MI. Thus, instead of averaging MIs over BART repetitions, we take the median of them. Similar to [5], we also find that 1010 repetitions of BART is sufficient to get stable MIs (see Figure S.8 of the Supplementary Material). Figure S.7 also shows that the second type of standard deviation sjs_{j} is significantly greater than the median of sijs_{ij}’s for each predictor, which suggests that BART tends to overfit noise. Since the overall MI v¯MI\bar{v}_{\cdot\cdot\cdot}^{\text{MI}} is always 1/p1/p, the relatively large ss indicates that not all the v¯jMI\bar{v}_{\cdot j\cdot}^{\text{MI}} are approximately 1/p1/p, suggesting that it is not possible to determine a single selection threshold for all the MIs. Therefore, we employ the permutation-based approach (Algorithm 1) to select thresholds for each MI. A slight modification is applied to line 4 of Algorithm 1: we take the median, rather than the average, of BART MIs, over the repetitions of BART on the original dataset. The variable selection result of this approach for Example 2 is shown in Figure 4.

Refer to caption
Figure 4: Variable selection results of the permutation-based approach using BART MI as the variable importance measure for Example 2. The same setting and BART posterior samples as Figure S.1 of the Supplementary Material and Figure 2 are used. With BART MI, all the relevant predictors (x1,x_{1}, x2,x_{2}, x11x_{11}, x12x_{12}, and x13x_{13}) are identified.

3.3 Backward Selection with Two Filters

In this section, we propose an approach to select a subset of predictors with which the BART model gives the best prediction. In general, to select the best subset of predictors from the model space consisting of 2p2^{p} possible models, one needs a search strategy over the model space and decision-making rules to compare models. Here we use the backward elimination approach to search the model space, mainly because it is a deterministic approach and is efficient with a moderate number of predictors. Typically, backward selection starts with the full model with all the predictors, followed by comparing the deletion of each predictor using a chosen model fit criterion and then deleting the predictor whose loss gives the most statistically insignificant deterioration of the model fit. This process is repeated until no further predictors can be deleted without a statistically insignificant loss of fit. Some popular model fit criterions such as AIC and BIC are not available for BART, because the maximum likelihood estimates are unavailable and the number of parameters in the model is hard to determine. To overcome this issue, we propose a backward selection approach with two easy-to-compute selection criterions as the decision-making rules.

We first split the dataset {yi,𝒙i}i=1n\{y_{i},\boldsymbol{x}_{i}\}_{i=1}^{n} into a training set and a test set before starting the backward procedure. To measure the model fit, we make use of the mean squared errors (MSE) of the test set

MSEtest=1ntestitest set[yif^(𝒙i)]2,\text{MSE}_{\text{test}}=\frac{1}{n_{\text{test}}}\sum\limits_{i\in\text{test set}}\left[y_{i}-\hat{f}\left(\boldsymbol{x}_{i}\right)\right]^{2},

where f^\hat{f} is the estimated sum-of-trees function. Figure 5 shows that MSEtest\text{MSE}_{\text{test}} can distinguish between acceptable models and unacceptable models, where acceptable models are defined as those models including all the relevant predictors and unacceptable models are defined as those missing some relevant predictors. At each backward selection step, we run BART models on the training set and compute the MSE of the test set. The model with the smallest MSEtest\text{MSE}_{\text{test}} is chosen as the “winner” at that step. However, due to the lack of stopping rules under such nonparametric setting, the backward selection has to continue until there is only one predictor in the model, and ultimately returns pp “winner” models with different model sizes ranging from 11 to pp.

Refer to caption
Figure 5: MSEtest\text{MSE}_{\text{test}}’s of all the models evaluated in the backward selection approach for Example 2. Each green (or red) boxplot depicts the MSEtest\text{MSE}_{\text{test}}’s of all the acceptable (or unacceptable) models evaluated at a backward selection step. The MSEtest\text{MSE}_{\text{test}}’s for acceptable models are close to 0 and have tiny standard error, so green boxplots are very narrow and near 0.

Given the pp “winner” models, we compare them using the expected log pointwise predictive density based on leave-one-out (LOO) cross validation:

elpdloo=i=1nlog[f(yi𝒚i)]=i=1nlog[f(yiθ)f(θ𝒚i)𝑑θ],\text{elpd}_{\text{loo}}=\sum\limits_{i=1}^{n}\log\left[f\left(y_{i}\mid\boldsymbol{y}_{-i}\right)\right]=\sum\limits_{i=1}^{n}\log\left[\int f\left(y_{i}\mid\theta\right)f\left(\theta\mid\boldsymbol{y}_{-i}\right)d\theta\right], (11)

where 𝒚i={yj}ji\boldsymbol{y}_{-i}=\{y_{j}\}_{j\neq i}, θ={{Tm,𝝁m}m=1M,σ2}\theta=\{\{T_{m},\boldsymbol{\mu}_{m}\}_{m=1}^{M},\sigma^{2}\} represents the set of BART parameters, f(yi𝒚i)f\left(y_{i}\mid\boldsymbol{y}_{-i}\right) is the predictive density of yiy_{i} given 𝒚i\boldsymbol{y}_{-i}, f(yiθ)f(y_{i}\mid\theta) is the likelihood of yiy_{i}, and f(θ𝒚i)f(\theta\mid\boldsymbol{y}_{-i}) is the posterior density of BART given the observations 𝒚i\boldsymbol{y}_{-i}. The quantity elpdloo\text{elpd}_{\text{loo}} not only measures the predictive capability of a model, but also penalizes the complexity of the model, i.e., the number of predictors used in the model. Hence, we choose the model with the largest elpdloo\text{elpd}_{\text{loo}} as the best model.

Direct computing (11) involves fitting BART nn times. To avoid this, we use the approach of [27] that estimates each f(yi𝒚i)f(y_{i}\mid\boldsymbol{y}_{-i}) using importance sampling with the full posterior distribution f(θ𝒚)f(\theta\mid\boldsymbol{y}) as the sampling distribution and the smoothed ratios {f(θk𝒚i)/f(θk𝒚)1/f(yiθk)}k=1K\{f(\theta^{k}\mid\boldsymbol{y}_{-i})/f(\theta^{k}\mid\boldsymbol{y})\propto 1/f(y_{i}\mid\theta^{k})\}_{k=1}^{K} as the importance ratios, where {θk}k=1K\{\theta^{k}\}_{k=1}^{K} are the KK posterior samples from the full posterior distribution f(θ𝒚)f(\theta\mid\boldsymbol{y}). The smoothness is achieved by fitting a generalized Pareto distribution to the upper tail of each set of the importance ratios 1/f(yiθk)}k=1K1/f(y_{i}\mid\theta^{k})\}_{k=1}^{K}, i=1,,ni=1,\cdots,n. Thus, elpdloo\text{elpd}_{\text{loo}} in (11) can be estimated by using the likelihoods {f(yiθk)=ϕ(yif^k(𝒙i),σk)}i,k=1n,K\{f(y_{i}\mid\theta^{k})=\phi(y_{i}\mid\hat{f}^{k}(\boldsymbol{x}_{i}),\sigma^{k})\}_{i,k=1}^{n,K} based on KK posterior samples {f^k}k=1K\{\hat{f}^{k}\}_{k=1}^{K} from the BART model fit on the dataset {yi,𝒙i}i=1n\{y_{i},\boldsymbol{x}_{i}\}_{i=1}^{n}, where f^k\hat{f}^{k} is the kthk^{\text{th}} posterior sample of the sum-of-trees function (2) and ϕ(μ,σ)\phi(\cdot\mid\mu,\sigma) is the normal density with mean μ\mu and standard error σ\sigma.

The approach above can also be extended to the probit BART model (3) by changing the MSEtest\text{MSE}_{\text{test}} to the mean log loss (MLL) of the test set

MLLtest=1ntestitest set[yilog(p^i)+(1yi)log(1p^i)]\text{MLL}_{\text{test}}=-\frac{1}{n_{\text{test}}}\sum\limits_{i\in\text{test set}}\left[y_{i}\log\left(\hat{p}_{i}\right)+\left(1-y_{i}\right)\log\left(1-\hat{p}_{i}\right)\right] (12)

and replacing the normal likelihoods with the Bernoulli density {f(yiθk)=p^ikyi+(1p^ik)(1yi)}i,k=1n,K\{f(y_{i}\mid\theta^{k})=\hat{p}_{i}^{k}y_{i}+(1-\hat{p}_{i}^{k})(1-y_{i})\}_{i,k=1}^{n,K}, where p^ik=Φ(f^k(xi))\hat{p}_{i}^{k}=\Phi(\hat{f}^{k}(x_{i})) is the estimated probability based on the kthk^{\text{th}} posterior sample of probit BART for the ithi^{\text{th}} observation and p^i=Φ(k=1Kf^k(xi)/K)\hat{p}_{i}=\Phi(\sum_{k=1}^{K}\hat{f}^{k}(x_{i})/K). We summarize this algorithm in Algorithm 2. Although this algorithm requires fitting BART p(p1)/2p(p-1)/2 times, models at the same backward selection step can be fitted in parallel, which implies that the time complexity O(p(p1)/2)O(p(p-1)/2) can be reduced to O(p)O(p) if there are sufficient computing resources.

0:  XX: predictors; 𝒚\boldsymbol{y}: response; ss: split ratio
0:   A subset of predictors.
1:  Randomly split the data (𝒚,X)(\boldsymbol{y},X) into (𝒚train,Xtrain)(\boldsymbol{y}_{\text{train}},X_{\text{train}}) and (𝒚test,Xtest)(\boldsymbol{y}_{\text{test}},X_{\text{test}}) according to the split ratio ss
2:  Run BART, say Model1\text{Model}_{1}, on (𝒚train,Xtrain)(\boldsymbol{y}_{\text{train}},X_{\text{train}}) with all the predictors and compute the LOO score: elpdloo1\text{elpd}_{\text{loo}}^{1}
3:  for l=1,,p1l=1,\cdots,p-1 do
4:     for t=1,,pl+1t=1,\cdots,p-l+1 do
5:        Remove the ttht^{\text{th}} predictor from the predictors used in Modell\text{Model}_{l}
6:        Run BART, say Modell+1,t\text{Model}_{l+1,t}, with the remaining predictors on (𝒚train,Xtrain)(\boldsymbol{y}_{\text{train}},X_{\text{train}})
7:        Compute the MSE of the test set: MSEtestl+1,t\text{MSE}_{\text{test}}^{l+1,t} (or MLLtestl+1,t\text{MLL}_{\text{test}}^{l+1,t} if 𝒚\boldsymbol{y} is binary)
8:        Compute the LOO score of the training set: elpdlool+1,t\text{elpd}_{\text{loo}}^{l+1,t}
9:     end for
10:     Find tt^{*} that minimizes {MSEtestl+1,t}t=1pl+1\{\text{MSE}_{\text{test}}^{l+1,t}\}_{t=1}^{p-l+1} (or {MLLtestl+1,t}t=1pl+1\{\text{MLL}_{\text{test}}^{l+1,t}\}_{t=1}^{p-l+1} if 𝒚\boldsymbol{y} is binary)
11:     Denote Modell+1,t\text{Model}_{l+1,t^{*}} by Modell+1\text{Model}_{l+1} and elpdlool+1,t\text{elpd}_{\text{loo}}^{l+1,t^{*}} by elpdlool+1\text{elpd}_{\text{loo}}^{l+1}
12:  end for
13:  Find ll^{*} that maximizes {elpdlool}l=1p\{\text{elpd}_{\text{loo}}^{l}\}_{l=1}^{p};
14:  Return Modell\text{Model}_{l^{*}}.
Algorithm 2 Backward selection with two filters

4 Simulations

In this section, we compare the proposed variable selection approaches: (1) permutation-based approach using within-type BART VIP, (2) permutation-based approach using BART MI, and (3) backward selection with two filters, with the existing BART-based variable selection approaches: (4) median probability model from DART, (5) permutation-based approach using BART VIP, and (6) ABC Bayesian forest. For the permutation-based approaches (1), (2) and (5), we set Lrep=10L_{\text{rep}}=10, L=100L=100 and α=0.05\alpha=0.05 (see Algorithm 1), and use M=20M=20 trees for BART to encourage competition among the predictors. The three permutation-based approaches are based on the same posterior samples and the one using within-type BART VIP is only applied to scenarios with mixed-type predictors. For the backward selection procedure (3), we set s=80%s=80\% (see Algorithm 2) and use M=50M=50 trees for BART to ensure its prediction power. For DART (4), we consider M=20M=20 and M=200M=200 trees, respectively. For ABC Bayesian forest, as [20] suggest, we assume the prior π(𝒮)\pi(\mathcal{S}) on the model space to be the beta-binomial prior with θBeta(1,1)\theta\sim\mathrm{Beta}(1,1), set the split ratio as 50%50\%, and use M=10M=10 and M=20M=20 trees for BART respectively; we generate 10001000 ABC samples, rank the samples by MSE in ascending order if the response variable is continuous (or by MLL if the response variable is binary), and keep the top 10%10\% of samples for selection; we report selection results for two selection thresholds 0.250.25 and 0.50.5. All the methods above, except ABC Bayesian forest, burn in the first 1000 and keep 1000 posterior samples in each BART run. ABC Bayesian forest burns in 200200 posterior samples and saves the 201th201^{\text{th}} posterior sample, as [20] suggest. Other hyper-parameters of BART are the same as the default recommended by [9]. The simulation results were obtained using the R package BartMixVs ([22]) which is briefly introduced in Section S.3 of the Supplementary Material.

To evaluate variable selection results, we consider precision (prec), recall (rec), F1F_{1} score, and the proportion of the replications that miss at least one relevant predictor (rmissr_{\text{miss}}), given by

prec=TPTP+FP,rec=TPTP+FN,F1=2precrecprec+rec,\displaystyle\text{prec}=\frac{\text{TP}}{\text{TP}+\text{FP}},\,\text{rec}=\frac{\text{TP}}{\text{TP}+\text{FN}},\,F_{1}=\frac{2\cdot\text{prec}\cdot\text{rec}}{\text{prec}+\text{rec}},
rmiss=# of replications missing at least one relevant predictor# of total replications,\displaystyle r_{\text{miss}}=\frac{\text{\# of replications missing at least one relevant predictor}}{\text{\# of total replications}},

where TP is the number of relevant predictors correctly selected, FP is the number of predictors incorrectly selected and FN is the number of relevant predictors not selected. High precision indicates good ability of excluding irrelevant predictors and high recall indicates good ability of including relevant predictors. The F1F_{1} score is a balanced score between precision and recall. The lowest rmissr_{\text{miss}} score is 0, implying that the method does not miss any relevant predictors over all the replications.

We evaluate the aforementioned variable selection approaches under four possible combination scenarios of {continuous response y, binary response y}×{continuous predictors 𝒙, mixed-type predictors 𝒙}\left\{\text{continuous response }y,\text{ binary response }y\right\}\times\left\{\text{continuous predictors }\boldsymbol{x},\text{ mixed-type predictors }\boldsymbol{x}\right\}. Each scenario is replicated 100 times. The simulation results for the scenario of binary response and continuous predictors can be found in Section S.4.2 of the Supplementary Material.

4.1 Continuous Response and Continuous Predictors

In this setting, we consider two scenarios. Scenario C.C.1 is an example from [11]; Scenario C.C.2, which includes correlation between predictors, is borrowed from [29]. For both of them, we consider n=500n=500, p{50,200}p\in\{50,200\} and σ2=1\sigma^{2}=1.

Scenario C.C.1.

Sample the predictors x1,,xpx_{1},\cdots,x_{p} from Uniform(0,1)\text{Uniform}(0,1) independently and the response yy from Normal(f0(𝒙),σ2)\text{Normal}(f_{0}(\boldsymbol{x}),\sigma^{2}), where

f0(𝒙)=10sin(πx1x2)+20(x30.5)2+10x4+5x5.f_{0}\left(\boldsymbol{x}\right)=10\sin\left(\pi x_{1}x_{2}\right)+20\left(x_{3}-0.5\right)^{2}+10x_{4}+5x_{5}. (13)
Scenario C.C.2.

Sample the predictors x1,,xpx_{1},\cdots,x_{p} from Normal(0,Σ)\text{Normal}(0,\Sigma) with Σjk=0.3|jk|\Sigma_{jk}=0.3^{|j-k|}, j,k=1,,pj,k=1,\cdots,p, and the response yy from Normal(f0(𝒙),σ2)\text{Normal}(f_{0}(\boldsymbol{x}),\sigma^{2}), where

f0(𝒙)=2x1x4+2x7x10.f_{0}\left(\boldsymbol{x}\right)=2x_{1}x_{4}+2x_{7}x_{10}. (14)

Results are given in Table 2. When the predictors are independent (Scenario C.C.1) and p=50p=50, none of the methods miss any relevant predictors, i.e., rmiss=0r_{\text{miss}}=0 and rec1\text{rec}\equiv 1. The permutation-based approach using BART VIP and the ABC Bayesian forest methods, except the one using M=20M=20 trees and selection threshold 0.250.25 (i.e., ABC-20-0.25), are the best in terms of precision and the F1F_{1} scores, followed by the permutation-based approach using BART MI and DART with M=20M=20 trees. When p=200p=200, only the ABC Bayesian forest with M=10M=10 trees and selection threshold 0.50.5 (i.e., ABC-10-0.50) fails to include all the relevant predictors. As discussed in Section 2.2.4, ABC Bayesian forest using a small number of trees and keeping a small number of posterior samples makes it easy to miss relevant predictors, especially when the number of predictors is large and the selection threshold is high. As shown in Table 2, increasing the number of trees to M=20M=20 or decreasing the selection threshold to 0.250.25 can improve the result. The other three ABC Bayesian forest methods achieve the best precision and F1F_{1} scores, followed by the two permutation-based approaches. The backward selection procedure is not a top choice in terms of precision, but its precision (0.750.75 for p=50p=50 and 0.740.74 for p=200p=200) is acceptable in the sense that it only includes about 1.71.7 irrelevant predictors on average. In fact, we find that the LOO score used in the backward selection, though penalizing the model complexity, appears to have difficulty in distinguishing between the true model and an acceptable model with a similar number of predictors, thereby resulting in a relatively low precision score.

Table 2: Simulation results for the scenarios using continuous yy and continuous 𝒙\boldsymbol{x}. Scenario C.C.1 is Friedman’s example; Scenario C.C.2 includes correlation between predictors. Each score is the average across 100100 replications and Monte Carlo standard error is given inside the parentheses.
Scenario C.C.1
n=500,p=50,σ2=1n=500,p=50,\sigma^{2}=1 n=500,p=200,σ2=1n=500,p=200,\sigma^{2}=1
rmissr_{\text{miss}} rec prec F1F_{1} rmissr_{\text{miss}} rec prec F1F_{1}
DART-20 0 1(0) 0.87(0.01) 0.93(0.01) 0 1(0) 0.86(0.01) 0.92(0.01)
DART-200 0 1(0) 0.7(0.02) 0.81(0.01) 0 1(0) 0.63(0.02) 0.76(0.01)
ABC-10-0.25 0 1(0) 1(0) 1(0) 0 1(0) 1(0) 1(0)
ABC-10-0.50 0 1(0) 1(0) 1(0) 0.33 0.93(0.01) 1(0) 0.96(0.01)
ABC-20-0.25 0 1(0) 0.5(0.01) 0.66(0.01) 0 1(0) 1(0) 1(0)
ABC-20-0.50 0 1(0) 1(0) 1(0) 0 1(0) 1(0) 1(0)
Permute (VIP) 0 1(0) 1(0) 1(0) 0 1(0) 0.95(0.01) 0.97(0)
Permute (MI) 0 1(0) 0.98(0.01) 0.99(0) 0 1(0) 0.94(0.01) 0.97(0.01)
Backward 0 1(0) 0.75(0.02) 0.83(0.02) 0 1(0) 0.74(0.02) 0.83(0.02)
Scenario C.C.2
n=500,p=50,σ2=1n=500,p=50,\sigma^{2}=1 n=500,p=200,σ2=1n=500,p=200,\sigma^{2}=1
rmissr_{\text{miss}} rec prec F1F_{1} rmissr_{\text{miss}} rec prec F1F_{1}
DART-20 0.43 0.8(0.03) 0.86(0.02) 0.8(0.02) 0.97 0.37(0.03) 0.47(0.04) 0.39(0.03)
DART-200 0 1(0) 0.83(0.02) 0.9(0.01) 0.64 0.65(0.03) 0.49(0.03) 0.53(0.03)
ABC-10-0.25 0.02 1(0) 0.69(0.02) 0.8(0.01) 1 0.08(0.01) 0.28(0.04) 0.13(0.02)
ABC-10-0.50 0.82 0.61(0.02) 0.98(0.01) 0.73(0.02) 1 0(0) 0.02(0.01) 0.01(0.01)
ABC-20-0.25 0 1(0) 0.12(0) 0.22(0) 0.99 0.28(0.02) 0.45(0.04) 0.33(0.03)
ABC-20-0.50 0.1 0.97(0.01) 0.97(0.01) 0.97(0.01) 1 0.01(0) 0.04(0.02) 0.02(0.01)
Permute (VIP) 0 1(0) 0.98(0.01) 0.99(0) 0.71 0.68(0.03) 0.26(0.01) 0.37(0.01)
Permute (MI) 0 1(0) 0.89(0.01) 0.93(0.01) 0.93 0.43(0.03) 0.27(0.02) 0.32(0.02)
Backward 0 1(0) 0.86(0.02) 0.92(0.01) 0.02 0.99(0.01) 0.79(0.03) 0.84(0.03)

When the predictors are correlated (Scenario C.C.2) and p=50p=50, the two permutation-based approaches, the backward selection approach, DART with M=200M=200 trees, and the ABC-20-0.25 approach successfully identify all the relevant predictors for all the replications; the first four of them have competitive precision and F1F_{1} scores. When more correlated and irrelevant predictors are included (p=200)(p=200), the backward selection procedure has superior performance in all metrics. It identifies all the relevant predictors for 98%98\% of the times and achieves the highest precision (0.790.79) and F1F_{1} scores (0.840.84). Compared with the independent setting (Scenario C.C.1), ABC Bayesian forest suffers the most from the multicollinearity as we found that this approach often does not select out any predictors.

With the first three examples in Table 2 being considered, the two proposed approaches along with the approach of [5] and DART with M=200M=200 trees consistently perform well in identifying all the relevant predictors. For Scenario C.C.2 with p=200p=200 predictors, the backward selection approach significantly outperforms other methods.

4.2 Continuous Response and Mixed-Type Predictors

In this setting, we consider Scenario C.M.1, a modified Friedman’s example, and Scenario C.M.2, an example including correlation between predictors. We consider n{500,1000}n\in\{500,1000\}, p{50,200}p\in\{50,200\} and σ2{1,10}\sigma^{2}\in\{1,10\} for Scenario C.M.1, and n{500,1000}n\in\{500,1000\} and σ2{1,10}\sigma^{2}\in\{1,10\} for Scenario C.M.2.

Scenario C.M.1.

Sample the predictors x1,,xp/2x_{1},\cdots,x_{\left\lceil p/2\right\rceil} from Bernoulli(0.5)\text{Bernoulli}(0.5) independently, xp/2+1,,xpx_{\left\lceil p/2\right\rceil+1},\cdots,x_{p} from Uniform(0,1)\text{Uniform}(0,1) independently, and the response yy from Normal(f0(𝒙),σ2)\text{Normal}(f_{0}(\boldsymbol{x}),\sigma^{2}), where

f0(𝒙)=10sin(πxp/2+1xp/2+2)+20(xp/2+30.5)2+10x1+5x2.f_{0}\left(\boldsymbol{x}\right)=10\sin\left(\pi x_{\left\lceil p/2\right\rceil+1}x_{\left\lceil p/2\right\rceil+2}\right)+20\left(x_{\left\lceil p/2\right\rceil+3}-0.5\right)^{2}+10x_{1}+5x_{2}. (15)
Scenario C.M.2.

Sample the predictors x1,,x20x_{1},\cdots,x_{20} from Bernoulli(0.2)\text{Bernoulli}(0.2) independently, x21,,x40x_{21},\cdots,x_{40} from Bernoulli(0.5) independently, x41,,x84x_{41},\cdots,x_{84} from a multivariate normal distribution with mean 0, variance 11 and correlation 0.30.3, and the response yy from Normal(f0(𝒙),σ2)\text{Normal}(f_{0}(\boldsymbol{x}),\sigma^{2}), where

f0(𝒙)=4+x1+sin(πx1x44)x21+0.6x41x42exp[2(x42+1)2]x432+0.5x44.f_{0}\left(\boldsymbol{x}\right)=-4+x_{1}+\sin\left(\pi x_{1}x_{44}\right)-x_{21}+0.6x_{41}x_{42}-\exp\left[-2\left(x_{42}+1\right)^{2}\right]-x_{43}^{2}+0.5x_{44}. (16)

We report the simulation results for Scenario C.M.1 with the settings, n=500n=500, p{50,200}p\in\{50,200\} and σ2=1\sigma^{2}=1, and Scenario C.M.2 with the settings, n{500,1000}n\in\{500,1000\} and σ2=1\sigma^{2}=1, in Table 3. Simulation results for other settings are reported in Table S.1 of the Supplementary Material.

Table 3: A part of simulation results for Scenario C.M.1 where predictors are of mixed type and independent and Scenario C.M.2 where predictors are of mixed type and correlated. The rest of simulation results can be found in Table S.1 of the Supplementary Material.
Scenario C.M.1
n=500,p=50,σ2=1n=500,p=50,\sigma^{2}=1 n=500,p=200,σ2=1n=500,p=200,\sigma^{2}=1
rmissr_{\text{miss}} rec prec F1F_{1} rmissr_{\text{miss}} rec prec F1F_{1}
DART-20 0 1(0) 0.96(0.01) 0.98(0) 0 1(0) 0.95(0.01) 0.97(0)
DART-200 0 1(0) 0.65(0.02) 0.78(0.01) 0 1(0) 0.63(0.02) 0.76(0.01)
ABC-10-0.25 0 1(0) 0.99(0) 0.99(0) 0 1(0) 1(0) 1(0)
ABC-10-0.50 0 1(0) 1(0) 1(0) 0.56 0.89(0.01) 1(0) 0.94(0.01)
ABC-20-0.25 0 1(0) 0.62(0.01) 0.76(0.01) 0 1(0) 0.95(0.01) 0.97(0)
ABC-20-0.50 0 1(0) 1(0) 1(0) 0 1(0) 1(0) 1(0)
Permute (VIP) 0.29 0.94(0.01) 1(0) 0.97(0.01) 0 1(0) 0.85(0.01) 0.91(0.01)
Permute (Within-Type VIP) 0 1(0) 0.99(0.01) 0.99(0) 0 1(0) 0.73(0.02) 0.84(0.01)
Permute (MI) 0 1(0) 0.97(0.01) 0.98(0) 0 1(0) 0.82(0.01) 0.9(0.01)
Backward 0 1(0) 0.76(0.02) 0.84(0.02) 0 1(0) 0.82(0.02) 0.88(0.02)
Scenario C.M.2
n=500,p=84,σ2=1n=500,p=84,\sigma^{2}=1 n=1000,p=84,σ2=1n=1000,p=84,\sigma^{2}=1
rmissr_{\text{miss}} rec prec F1F_{1} rmissr_{\text{miss}} rec prec F1F_{1}
DART-20 0.37 0.94(0.01) 0.91(0.01) 0.92(0.01) 0.09 0.98(0) 0.91(0.01) 0.94(0.01)
DART-200 0.17 0.97(0.01) 0.49(0.01) 0.64(0.01) 0.01 1(0) 0.58(0.01) 0.73(0.01)
ABC-10-0.25 0.95 0.81(0.01) 0.94(0.01) 0.87(0.01) 0.54 0.91(0.01) 0.99(0) 0.94(0)
ABC-10-0.50 1 0.69(0.01) 1(0) 0.81(0.01) 0.96 0.82(0.01) 1(0) 0.9(0)
ABC-20-0.25 0.57 0.9(0.01) 0.44(0.01) 0.58(0.01) 0.05 0.99(0) 0.69(0.01) 0.8(0.01)
ABC-20-0.50 1 0.78(0.01) 0.99(0) 0.87(0.01) 0.7 0.88(0.01) 1(0) 0.93(0)
Permute (VIP) 0.18 0.97(0.01) 0.93(0.01) 0.95(0.01) 0.51 0.9(0.01) 0.96(0.01) 0.93(0.01)
Permute (Within-Type VIP) 0.44 0.93(0.01) 0.86(0.02) 0.89(0.01) 0.06 0.99(0.01) 0.92(0.01) 0.95(0.01)
Permute (MI) 0.07 0.99(0) 0.85(0.01) 0.91(0.01) 0 1(0) 0.89(0.01) 0.94(0.01)
Backward 0.11 0.98(0.01) 0.47(0.03) 0.57(0.03) 0 1(0) 0.75(0.02) 0.83(0.02)

When the predictors are independent (Scenario C.M.1) and p=50p=50, all the methods, except the permutation-based approach using BART VIP, are able to identify all the relevant predictors. The two proposed permutation-based approaches, DART with M=20M=20 trees, and the ABC Bayesian forest approaches except ABC-20-0.25 have comparable best precision and F1F_{1} scores, followed by the backward selection procedure. When the dimension is increased to p=200p=200, the ABC-10-0.50 method again fails to include all the relevant predictors, like Scenario C.C.1. The permutation-based approach using BART VIP does not perform poorly in this case, because as pp gets larger, c¯Lrep,j\bar{c}_{L_{\text{rep}},j\cdot} stays in a similar magnitude and cl,jc_{l,j\cdot}^{*} gets smaller (see the left subfigure of Figure S.4 of the Supplementary Material). As such, the offset effect discussed in Section 3.1 disappears. The three proposed approaches are still able to identify all the relevant predictors, but their precision scores are slightly worse than other methods.

When the mixed-type predictors are correlated (Scenario C.M.2), variable selection becomes more challenging. When σ2=1\sigma^{2}=1, the permutation-based approach using BART MI and the backward selection procedure are the best in terms of the rmissr_{\text{miss}} and recall scores. Furthermore, only these two methods successfully identify all the relevant predictors over all the replications when n=1000n=1000. When high noise (σ2=10)(\sigma^{2}=10) is included, all the methods have difficulty identifying all the true predictors.

In terms of of the recall and rmissr_{\text{miss}} scores, the permutation-based approach using BART MI and the backward selection approach are always the top choices for all the examples in Table 3 and Table S.1, except Scenario C.M.2 with σ2=10\sigma^{2}=10. Compared to the other two proposed approaches, the permutation-based approach using BART within-type VIP suffers from multicollinearity, though it does improve the approach of [5] in presence of a small number of mixed-type predictors.

4.3 Binary Response and Mixed-Type Predictors

In this setting, we consider two scenarios, Scenario B.M.1 and Scenario B.M.2. For Scenario B.M.1, we consider n{500,1000}n\in\{500,1000\} and p{50,200}p\in\{50,200\}; for Scenario B.M.2, we consider n{500,1000}n\in\{500,1000\}.

Scenario B.M.1.

Sample the predictors x1,,xp/2x_{1},\cdots,x_{\left\lceil p/2\right\rceil} from Bernoulli(0.5)\text{Bernoulli}(0.5) independently, xp/2+1,,xpx_{\left\lceil p/2\right\rceil+1},\cdots,x_{p} from Uniform(0,1)\text{Uniform}(0,1) independently, and the response yy from Bernoulli(Φ(f0(𝒙))\text{Bernoulli}(\Phi(f_{0}(\boldsymbol{x})), where f0(𝒙)f_{0}(\boldsymbol{x}) is defined in (15).

Scenario B.M.2.

Sample the predictors x1,,x20x_{1},\cdots,x_{20} from Bernoulli(0.2)\text{Bernoulli}(0.2) independently, x21,,x40x_{21},\cdots,x_{40} from Bernoulli(0.5) independently, x41,,x84x_{41},\cdots,x_{84} from a multivariate normal distribution with mean 0, variance 11 and correlation 0.30.3, and the response yy from Bernoulli(Φ(f0(𝒙))\text{Bernoulli}(\Phi(f_{0}(\boldsymbol{x})), where f0(𝒙)f_{0}(\boldsymbol{x}) is defined in (16).

Table 4: Simulation results for the scenarios using binary response and mixed-type predictors. Predictors in Scenario B.M.1 are independent and predictors in Scenario B.M.2 are correlated.
Scenario B.M.1
n=500,p=50n=500,p=50 n=500,p=200n=500,p=200
rmissr_{\text{miss}} rec prec F1F_{1} rmissr_{\text{miss}} rec prec F1F_{1}
DART-20 0.26 0.95(0.01) 0.99(0.01) 0.96(0.01) 0.65 0.87(0.01) 0.99(0) 0.92(0.01)
DART-200 0.13 0.97(0.01) 0.75(0.02) 0.83(0.01) 0.53 0.89(0.01) 0.79(0.02) 0.83(0.01)
ABC-10-0.25 0.55 0.89(0.01) 0.87(0.01) 0.87(0.01) 1 0.8(0) 0.98(0.01) 0.88(0)
ABC-10-0.50 0.97 0.81(0) 1(0) 0.89(0) 1 0.69(0.01) 1(0) 0.81(0.01)
ABC-20-0.25 0.03 0.99(0) 0.18(0) 0.31(0) 0.92 0.82(0.01) 0.85(0.01) 0.83(0.01)
ABC-20-0.50 0.85 0.83(0.01) 0.98(0.01) 0.9(0) 1 0.78(0.01) 0.99(0) 0.87(0)
Permute (VIP) 0.1 0.98(0.01) 0.99(0) 0.99(0) 0.16 0.97(0.01) 0.81(0.01) 0.88(0.01)
Permute (Within-Type VIP) 0.08 0.98(0.01) 1(0) 0.99(0) 0.22 0.96(0.01) 0.8(0.02) 0.87(0.01)
Permute (MI) 0.1 0.98(0.01) 0.96(0.01) 0.97(0.01) 0.25 0.95(0.01) 0.79(0.01) 0.86(0.01)
Backward 0.09 0.98(0.01) 0.93(0.02) 0.94(0.01) 0.37 0.93(0.01) 0.87(0.02) 0.88(0.01)
n=1000,p=50n=1000,p=50 n=1000,p=200n=1000,p=200
rmissr_{\text{miss}} rec prec F1F_{1} rmissr_{\text{miss}} rec prec F1F_{1}
DART-20 0.01 1(0) 0.98(0.01) 0.99(0) 0.17 0.97(0.01) 0.98(0.01) 0.97(0.01)
DART-200 0 1(0) 0.95(0.01) 0.97(0.01) 0.03 0.99(0) 0.95(0.01) 0.97(0.01)
ABC-10-0.25 0.01 1(0) 0.95(0.01) 0.97(0.01) 0.82 0.84(0.01) 0.99(0) 0.9(0)
ABC-10-0.50 0.53 0.89(0.01) 1(0) 0.94(0.01) 1 0.8(0) 1(0) 0.89(0)
ABC-20-0.25 0 1(0) 0.24(0) 0.39(0) 0.22 0.96(0.01) 0.91(0.01) 0.93(0.01)
ABC-20-0.50 0.02 1(0) 0.99(0) 0.99(0) 0.96 0.81(0) 1(0) 0.89(0)
Permute (VIP) 0 1(0) 1(0) 1(0) 0 1(0) 0.85(0.01) 0.92(0.01)
Permute (Within-Type VIP) 0 1(0) 0.99(0) 1(0) 0 1(0) 0.85(0.02) 0.91(0.01)
Permute (MI) 0 1(0) 0.96(0.01) 0.98(0) 0 1(0) 0.82(0.01) 0.89(0.01)
Backward 0 1(0) 0.91(0.01) 0.95(0.01) 0 1(0) 0.9(0.01) 0.94(0.01)
Scenario B.M.2
n=500,p=84n=500,p=84 n=1000,p=84n=1000,p=84
rmissr_{\text{miss}} rec prec F1F_{1} rmissr_{\text{miss}} rec prec F1F_{1}
DART-20 0.95 0.75(0.01) 0.91(0.01) 0.82(0.01) 0.62 0.89(0.01) 0.93(0.01) 0.91(0.01)
DART-200 0.98 0.75(0.01) 0.32(0.01) 0.44(0.01) 0.56 0.9(0.01) 0.52(0.02) 0.65(0.01)
ABC-10-0.25 1 0.71(0.01) 0.85(0.01) 0.76(0.01) 1 0.81(0.01) 0.93(0.01) 0.86(0.01)
ABC-10-0.50 1 0.53(0.01) 1(0) 0.69(0.01) 1 0.71(0.01) 1(0) 0.83(0.01)
ABC-20-0.25 0.98 0.8(0.01) 0.22(0) 0.34(0) 0.78 0.87(0.01) 0.33(0.01) 0.48(0.01)
ABC-20-0.50 1 0.62(0.01) 0.96(0.01) 0.75(0.01) 1 0.77(0.01) 0.98(0.01) 0.86(0.01)
Permute (VIP) 0.93 0.8(0.01) 0.81(0.01) 0.8(0.01) 0.28 0.95(0.01) 0.91(0.01) 0.93(0.01)
Permute (Within-Type VIP) 0.96 0.79(0.01) 0.83(0.02) 0.8(0.01) 0.52 0.91(0.01) 0.87(0.02) 0.88(0.01)
Permute (MI) 0.94 0.8(0.01) 0.79(0.01) 0.79(0.01) 0.36 0.94(0.01) 0.84(0.01) 0.88(0.01)
Backward 0.67 0.84(0.01) 0.54(0.03) 0.57(0.03) 0.15 0.97(0.01) 0.55(0.03) 0.63(0.03)

Simulation results are shown in Table 4. In presence of binary responses and mixed-type predictors, the variable selection results are clearly improved by increasing the sample size from n=500n=500 to n=1000n=1000. Under the independent setting (Scenario B.M.1), the three proposed approaches along with the approach of [5] achieve rmiss=0r_{\text{miss}}=0 and rec1\text{rec}\equiv 1 for both p=50p=50 and p=200p=200, with n=1000n=1000 samples. DART with M=200M=200 trees gets slightly worse rmissr_{\text{miss}} and recall scores under these two settings. However, when correlation is introduced, none of the methods work very well in identifying all the relevant predictors, though the backward selection procedure still gets the best recall and rmissr_{\text{miss}} scores. In fact, we find that the result of the backward selection procedure can be greatly improved by thinning the MCMC chain of BART. For the Scenario B.M.2 with n=1000n=1000, we get rmiss=0.06r_{\text{miss}}=0.06, rec=0.99(0.01)\text{rec}=0.99(0.01), prec=0.67(0.05)\text{prec}=0.67(0.05) and F1=0.73(0.05)F_{1}=0.73(0.05) by keeping every 10th10^{\text{th}} posterior sample in each BART run.

4.4 Conclusion of Simulation Results

We summarize the simulation results above and in Section S.4 of the Supplementary Material in Table 5. We call a variable selection approach with rmissr_{\text{miss}} no greater than 0.10.1 and precision no less than 0.60.6, i.e., an approach with excellent capability of including all the relevant predictors and acceptable capability of excluding irrelevant predictors, a successful approach, and check it in Table 5. From the table, we can see that the backward selection approach achieves the highest success rate, followed by the two new permutation-based approaches, meaning that the three proposed approaches consistently perform well in identifying all the relevant predictors and excluding irrelevant predictors. A drawback of the three proposed approaches is that, like existing BART-based variable selection approaches, they also suffer from multicollinearity (Scenario C.M.2, Scenario B.C.2 and Scenario B.M.2), especially when the noise is high or the response is binary. Another shortcoming of the backward selection approach is the computational cost of running BART multiple times, but it should be noted that at each step of the backward selection, BART models can be fitted in parallel on multiple cores.

Table 5: Summary of the simulation studies based on Table 24 and Table S.1–S.2 of the Supplementary Material. Variable selection approaches with rmissr_{\text{miss}} no greater than 0.10.1 and precision no less than 0.60.6 are checked.
Evaluated Setting Evaluated Variable Selection Approaches
Scenario (n,p,σ2)(n,p,\sigma^{2}) DART-20 DART-200 ABC-10-0.25 ABC-10-0.50 ABC-20-0.25 ABC-20-0.50 Permute (VIP) Permute (Within-Type VIP) Permute (MI) Backward
C.C.1 (500,50,1)(500,50,1) \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark - \checkmark \checkmark
C.C.1 (500,200,1)(500,200,1) \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark - \checkmark \checkmark
C.C.2 (500,50,1)(500,50,1) \checkmark \checkmark \checkmark \checkmark - \checkmark \checkmark
C.C.2 (500,200,1)(500,200,1) - \checkmark
C.M.1 (500,50,1)(500,50,1) \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark
C.M.1 (500,200,1)(500,200,1) \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark
C.M.1 (1000,50,1)(1000,50,1) \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark
C.M.1 (1000,200,1)(1000,200,1) \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark
C.M.1 (1000,50,10)(1000,50,10) \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark
C.M.1 (1000,200,10)(1000,200,10) \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark
C.M.2 (500,84,1)(500,84,1) \checkmark
C.M.2 (500,84,10)(500,84,10)
C.M.2 (1000,84,1)(1000,84,1) \checkmark \checkmark \checkmark \checkmark \checkmark
C.M.2 (1000,84,10)(1000,84,10)
B.C.1 (500,50,)(500,50,-) \checkmark \checkmark \checkmark \checkmark \checkmark - \checkmark \checkmark
B.C.1 (500,200,)(500,200,-) \checkmark - \checkmark \checkmark
B.C.2 (500,50,)(500,50,-) - \checkmark
B.C.2 (500,200,)(500,200,-) -
B.M.1 (500,50,)(500,50,-) \checkmark \checkmark \checkmark \checkmark
B.M.1 (500,200,)(500,200,-)
B.M.1 (1000,50,)(1000,50,-) \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark
B.M.1 (1000,200,)(1000,200,-) \checkmark \checkmark \checkmark \checkmark \checkmark
B.M.2 (500,84,)(500,84,-)
B.M.2 (1000,84,)(1000,84,-)
Success rate 45.8%45.8\% 45.8%45.8\% 45.8%45.8\% 20.8%20.8\% 29.2%29.2\% 45.8%45.8\% 45.8%45.8\% 62.5%\boldsymbol{62.5\%} 66.7%\boldsymbol{66.7\%} 70.8%\boldsymbol{70.8\%}

5 Discussion and Future Work

This paper reviews and explores existing BART-based variable selection methods and introduces three new variable selection approaches. These new approaches are designed for adapting to data with mixed-type predictors, and more importantly, for better allowing all the relevant predictors into the model.

We outline some interesting areas for future work. First, the distribution of the null importance scores in the permutation-based approach is unknown. If the distribution can be approximated appropriately, then a better threshold can be used. Second, from the simulation studies for the backward selection approach, we note that there is only a slight difference in MSEtest\text{MSE}_{\text{test}} between the winner models from two consecutive steps, when both of them include all the relevant predictors. However, if one of the winner models drops a relevant predictor, the difference can become very large. In light of this, it would be advantageous to develop a formal stopping rule, which would also improve the precision score and the efficiency of the backward selection approach. Finally, a potential direction of conducting variable selection based on BART is to take further advantage of the model selection property of BART itself. As discussed in Section 2.1, BART can be regarded as a Bayesian model selection approach, but it does not perform well due to a large number of noise predictors. We may alleviate this issue by muting noise predictors adaptively and externally ([29]).

Acknowledgements

Luo and Daniels were partially supported by NIH R01CA 183854.

References

  • [1] James H. Albert and Siddhartha Chib “Bayesian analysis of binary and polychotomous response data” In J. Amer. Statist. Assoc. 88.422, 1993, pp. 669–679 URL: http://links.jstor.org/sici?sici=0162-1459(199306)88:422%3C669:BAOBAP%3E2.0.CO;2-T&origin=MSN
  • [2] André Altmann, Laura Toloşi, Oliver Sander and Thomas Lengauer “Permutation importance: a corrected feature importance measure” In Bioinformatics 26.10, 2010, pp. 1340–1347 DOI: 10.1093/bioinformatics/btq134
  • [3] Maria Maddalena Barbieri and James O. Berger “Optimal predictive model selection” In Ann. Statist. 32.3, 2004, pp. 870–897 DOI: 10.1214/009053604000000238
  • [4] Anirban Bhattacharya, Debdeep Pati, Natesh S. Pillai and David B. Dunson “Dirichlet-Laplace priors for optimal shrinkage” In J. Amer. Statist. Assoc. 110.512, 2015, pp. 1479–1490 DOI: 10.1080/01621459.2014.960967
  • [5] Justin Bleich, Adam Kapelner, Edward I. George and Shane T. Jensen “Variable selection for BART: an application to gene regulation” In Ann. Appl. Stat. 8.3, 2014, pp. 1750–1781 DOI: 10.1214/14-AOAS755
  • [6] Leo Breiman “Random forests” In Machine learning 45.1 Springer, 2001, pp. 5–32
  • [7] Carlos M. Carvalho, Nicholas G. Polson and James G. Scott “The horseshoe estimator for sparse signals” In Biometrika 97.2, 2010, pp. 465–480 DOI: 10.1093/biomet/asq017
  • [8] Hugh A. Chipman, Edward I. George and Robert E. McCulloch “Bayesian CART Model Search” In J. Amer. Statist. Assoc. 93.443 Taylor & Francis, 1998, pp. 935–948 DOI: 10.1080/01621459.1998.10473750
  • [9] Hugh A. Chipman, Edward I. George and Robert E. McCulloch “BART: Bayesian additive regression trees” In Ann. Appl. Stat. 4.1, 2010, pp. 266–298 DOI: 10.1214/09-AOAS285
  • [10] M.. Efroymson “Multiple regression analysis” In Mathematical methods for digital computers Wiley, New York, 1960, pp. 191–203
  • [11] Jerome H. Friedman “Multivariate adaptive regression splines” With discussion and a rejoinder by the author In Ann. Statist. 19.1, 1991, pp. 1–141 DOI: 10.1214/aos/1176347963
  • [12] Jerome H. Friedman “Greedy function approximation: a gradient boosting machine” In Ann. Statist. 29.5, 2001, pp. 1189–1232 DOI: 10.1214/aos/1013203451
  • [13] Jerome H. Friedman “Stochastic gradient boosting” In Comput. Statist. Data Anal. 38.4, 2002, pp. 367–378 DOI: 10.1016/S0167-9473(01)00065-2
  • [14] Stuart Geman and Donald Geman “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images” In IEEE PAMI PAMI-6.6, 1984, pp. 721–741 DOI: 10.1109/TPAMI.1984.4767596
  • [15] Edward I George and Robert E McCulloch “Variable Selection via Gibbs Sampling” In J. Amer. Statist. Assoc. 88.423 Taylor & Francis, 1993, pp. 881–889 DOI: 10.1080/01621459.1993.10476353
  • [16] Trevor Hastie and Robert Tibshirani “Bayesian backfitting (with comments and a rejoinder by the authors” In Statist. Sci. 15.3 Institute of Mathematical Statistics, 2000, pp. 196–223 DOI: 10.1214/ss/1009212815
  • [17] W.. Hastings “Monte Carlo sampling methods using Markov chains and their applications” In Biometrika 57.1, 1970, pp. 97–109 DOI: 10.1093/biomet/57.1.97
  • [18] Adam Kapelner and Justin Bleich “bartMachine: Machine Learning with Bayesian Additive Regression Trees” In J. Stat. Softw. 70.4, 2016, pp. 1–40 DOI: 10.18637/jss.v070.i04
  • [19] Antonio R. Linero “Bayesian regression trees for high-dimensional prediction and variable selection” In J. Amer. Statist. Assoc. 113.522, 2018, pp. 626–636 DOI: 10.1080/01621459.2016.1264957
  • [20] Yi Liu, Veronika Ročková and Yuexi Wang “Variable selection with ABC Bayesian forests” In J. R. Stat. Soc. Ser. B. Stat. Methodol. 83.3, 2021, pp. 453–481 DOI: 10.1111/rssb.12423
  • [21] Gilles Louppe “Understanding random forests” In Cornell University Library, 2014
  • [22] Chuji Luo and Michael J. Daniels “The BartMixVs R package”, 2021 URL: https://github.com/chujiluo/BartMixVs
  • [23] Veronika Ročková and Stéphanie Pas “Posterior concentration for Bayesian regression trees and forests” In Ann. Statist. 48.4, 2020, pp. 2108–2131 DOI: 10.1214/19-AOS1879
  • [24] Rodney Sparapani, Charles Spanbauer and Robert McCulloch “Nonparametric machine learning and efficient computation with bayesian additive regression trees: the BART R package” In J. Stat. Softw. 97.1, 2021, pp. 1–66
  • [25] Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis and Torsten Hothorn “Bias in random forest variable importance measures: Illustrations, sources and a solution” In BMC bioinformatics 8.1 Springer, 2007, pp. 25 DOI: 10.1186/1471-2105-8-25
  • [26] Robert Tibshirani “Regression shrinkage and selection via the lasso” In J. Roy. Statist. Soc. Ser. B 58.1, 1996, pp. 267–288 URL: http://links.jstor.org/sici?sici=0035-9246(1996)58:1%3C267:RSASVT%3E2.0.CO;2-G&origin=MSN
  • [27] Aki Vehtari, Andrew Gelman and Jonah Gabry “Erratum to: Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC [ MR3647105]” In Stat. Comput. 27.5, 2017, pp. 1433 DOI: 10.1007/s11222-016-9709-3
  • [28] Chi Wang, Giovanni Parmigiani and Francesca Dominici “Bayesian effect estimation accounting for adjustment uncertainty” In Biometrics 68.3, 2012, pp. 680–686 DOI: 10.1111/j.1541-0420.2011.01735.x
  • [29] Ruoqing Zhu, Donglin Zeng and Michael R. Kosorok “Reinforcement learning trees” In J. Amer. Statist. Assoc. 110.512, 2015, pp. 1770–1784 DOI: 10.1080/01621459.2015.1036994
  • [30] Hui Zou and Trevor Hastie “Regularization and variable selection via the elastic net” In J. R. Stat. Soc. Ser. B Stat. Methodol. 67.5, 2005, pp. 768 DOI: 10.1111/j.1467-9868.2005.00527.x

Supplement to "Variable Selection Using Bayesian Additive Regression Trees"

S.1 Proof of Lemma 3.2

Proof.

From Cauchy-Schwarz inequality, Definition 2.1 and Definition 3.1 of the main paper, it is clear that

|v~jvj|\displaystyle\left|\tilde{v}_{j}-v_{j}\right| =\displaystyle= |k=1K(cjkckc¯KKc¯Kck)|k=1K(1Kcjkck|ckc¯K|c¯K)\displaystyle\left|\sum\limits_{k=1}^{K}\left(c_{jk}\frac{c_{\cdot k}-\bar{c}_{\cdot K}}{K\bar{c}_{\cdot K}c_{\cdot k}}\right)\right|\cdot\sum\limits_{k=1}^{K}\left(\frac{1}{K}\frac{c_{jk}}{c_{\cdot k}}\frac{\left|c_{\cdot k}-\bar{c}_{\cdot K}\right|}{\bar{c}_{\cdot K}}\right)
\displaystyle\leq [1Kk=1K(cjkck)2]1/2[k=1K(ckc¯K)2/K]1/2c¯K\displaystyle\left[\frac{1}{K}\sum\limits_{k=1}^{K}\left(\frac{c_{jk}}{c_{\cdot k}}\right)^{2}\right]^{1/2}\cdot\frac{\left[\sum\limits_{k=1}^{K}\left(c_{\cdot k}-\bar{c}_{\cdot K}\right)^{2}/K\right]^{1/2}}{\bar{c}_{\cdot K}}
\displaystyle\leq δ11/2δ2.\displaystyle\delta_{1}^{1/2}\delta_{2}.

S.2 Decomposition of Untruncated Metropolis Ratio

The untruncated Metropolis ratio in Equation (9) of the main paper can be explicitly expressed as

r(η)=\displaystyle r(\eta)= P(TT)P(TT)P(T)P(T)P(𝒓T,σ2)P(𝒓T,σ2)\displaystyle\frac{P\left(T\mid T^{*}\right)}{P\left(T^{*}\mid T\right)}\cdot\frac{P\left(T^{*}\right)}{P\left(T\right)}\cdot\frac{P\left(\boldsymbol{r}\mid T^{*},\sigma^{2}\right)}{P\left(\boldsymbol{r}\mid T,\sigma^{2}\right)}
=\displaystyle= P(DEATH)/w2P(BIRTH)/[bpadj(η)nj,adj(η)]γ[1γ/(2+dη)β]2[(1+dη)βγ]padj(η)nj,adj(η)P(𝒓T,σ2)P(𝒓T,σ2)\displaystyle\frac{P\left(\text{DEATH}\right)/w_{2}^{*}}{P\left(\text{BIRTH}\right)/\left[b\cdot p_{\text{adj}}(\eta)\cdot n_{j,\text{adj}}(\eta)\right]}\cdot\frac{\gamma\cdot\left[1-\gamma/(2+d_{\eta})^{\beta}\right]^{2}}{\left[(1+d_{\eta})^{\beta}-\gamma\right]\cdot p_{\text{adj}}(\eta)\cdot n_{j,\text{adj}}(\eta)}\cdot\frac{P\left(\boldsymbol{r}\mid T^{*},\sigma^{2}\right)}{P\left(\boldsymbol{r}\mid T,\sigma^{2}\right)}
=\displaystyle= bw2γ[1γ/(2+dη)β]2(1+dη)βγP(𝒓T,σ2)P(𝒓T,σ2)\displaystyle\frac{b}{w_{2}^{*}}\cdot\frac{\gamma\cdot\left[1-\gamma/(2+d_{\eta})^{\beta}\right]^{2}}{(1+d_{\eta})^{\beta}-\gamma}\cdot\frac{P\left(\boldsymbol{r}\mid T^{*},\sigma^{2}\right)}{P\left(\boldsymbol{r}\mid T,\sigma^{2}\right)}
=\displaystyle= 2bb+2γ[1γ/(2+dη)β]2(1+dη)βγP(𝒓T,σ2)P(𝒓T,σ2),\displaystyle\frac{2b}{b+2}\cdot\frac{\gamma\cdot\left[1-\gamma/(2+d_{\eta})^{\beta}\right]^{2}}{(1+d_{\eta})^{\beta}-\gamma}\cdot\frac{P\left(\boldsymbol{r}\mid T^{*},\sigma^{2}\right)}{P\left(\boldsymbol{r}\mid T,\sigma^{2}\right)},

where w2w_{2}^{*} is the number of second generation internal nodes (i.e., nodes with two terminal child nodes) in the proposed tree TT^{*}, bb is the number of terminal nodes in the current tree TT, padj(η)p_{\text{adj}}(\eta) is the number of predictors available to split on at the node η\eta in the current tree TT, nj,adj(η)n_{j,\text{adj}}(\eta) is the number of available values to choose for the selected predictor xjx_{j} at the node η\eta in the current tree TT, and dηd_{\eta} is the depth of node η\eta in the current tree TT.

The second equality is because the proposed tree TT^{*} is identical to the current tree TT except for the new split node. The third equality is from that P(BIRTH)=P(DEATH)=0.5P(\text{BIRTH})=P(\text{DEATH})=0.5. The last equality is due to the properties of a binary tree, b=2w2b=2w_{2} and w2=w2+1w_{2}^{*}=w_{2}+1, where w2w_{2} is the number of second generation internal nodes in the current tree TT.

S.3 Brief Introduction to BartMixVs

Built upon the CRAN R package BART ([24]), the R package BartMixVs ([22]) is developed to implement the three proposed variable selection approaches of the main paper and three existing BART-based variable selection approaches: the permutation-based variable selection approach using BART VIP ([5]), DART ([19]) and ABC Bayesian forest ([20]). The simulation results of this work were obtained using this package which is available at https://github.com/chujiluo/BartMixVs.

S.4 More Simulations

S.4.1 Additional Simulations for Continuous Response and Mixed-Type Predictors

In this section, we provide additional simulation results for Scenario C.M.1 and Scenario C.M.2 of the main paper, as shown in Table S.1.

S.4.2 Simulations for Binary Response and Continuous Predictors

In this setting, we consider the following two scenarios with n=500n=500 and p{50,200}p\in\{50,200\}.

Scenario B.C.1.

Sample the predictors x1,,xpx_{1},\cdots,x_{p} from Uniform(0,1)\text{Uniform}(0,1) independently and the response yy from Bernoulli(Φ(f0(𝒙))\text{Bernoulli}(\Phi(f_{0}(\boldsymbol{x})), where f0(x)f_{0}(x) is defined in (12) of the main paper.

Scenario B.C.2.

Sample the predictors x1,,xpx_{1},\cdots,x_{p} from Normal(0,Σ)\text{Normal}(0,\Sigma) with Σjk=0.3|jk|\Sigma_{jk}=0.3^{|j-k|}, j,k=1,,pj,k=1,\cdots,p, and the response yy from Bernoulli(Φ(f0(𝒙))\text{Bernoulli}(\Phi(f_{0}(\boldsymbol{x})), where f0(𝒙)f_{0}(\boldsymbol{x}) is defined in (13) of the main paper.

Table S.2 shows the simulation results. When the predictors are independent (Scenario B.C.1) and p=50p=50, all the methods except ABC-10-0.50 show similar performance. As the dimension increases, the two permutation-based approaches and the backward selection approach significantly outperform other methods in terms of the rmissr_{\text{miss}} and recall scores. When correlation is introduced (Scenario B.C.2), the backward selection approach is the best in all metrics. Moreover, it is the only method that identifies all the relevant predictors over all the replications when p=50p=50.

Table S.1: Additional simulation results for Scenario C.M.1 where predictors are of mixed type and independent, and Scenario C.M.2 where predictors are of mixed type and correlated.
Scenario C.M.1
n=1000,p=50,σ2=1n=1000,p=50,\sigma^{2}=1 n=1000,p=200,σ2=1n=1000,p=200,\sigma^{2}=1
rmissr_{\text{miss}} rec prec F1F_{1} rmissr_{\text{miss}} rec prec F1F_{1}
DART-20 0 1(0) 0.93(0.01) 0.96(0.01) 0 1(0) 0.91(0.01) 0.95(0.01)
DART-200 0 1(0) 0.6(0.01) 0.75(0.01) 0 1(0) 0.65(0.02) 0.78(0.01)
ABC-10-0.25 0 1(0) 1(0) 1(0) 0 1(0) 1(0) 1(0)
ABC-10-0.50 0 1(0) 1(0) 1(0) 0 1(0) 1(0) 1(0)
ABC-20-0.25 0 1(0) 0.79(0.01) 0.88(0.01) 0 1(0) 0.98(0.01) 0.99(0)
ABC-20-0.50 0 1(0) 1(0) 1(0) 0 1(0) 1(0) 1(0)
Permute (VIP) 1 0.76(0.01) 1(0) 0.86(0.01) 0 1(0) 0.94(0.01) 0.97(0.01)
Permute (Within-Type VIP) 0 1(0) 0.99(0.01) 0.99(0) 0 1(0) 0.69(0.02) 0.81(0.01)
Permute (MI) 0 1(0) 0.98(0.01) 0.99(0) 0 1(0) 0.89(0.01) 0.94(0.01)
Backward 0 1(0) 0.8(0.02) 0.87(0.01) 0 1(0) 0.8(0.02) 0.87(0.01)
n=1000,p=50,σ2=10n=1000,p=50,\sigma^{2}=10 n=1000,p=200,σ2=10n=1000,p=200,\sigma^{2}=10
rmissr_{\text{miss}} rec prec F1F_{1} rmissr_{\text{miss}} rec prec F1F_{1}
DART-20 0 1(0) 0.98(0.01) 0.99(0) 0 1(0) 0.96(0.01) 0.97(0.01)
DART-200 0 1(0) 0.67(0.02) 0.79(0.01) 0 1(0) 0.54(0.02) 0.69(0.01)
ABC-10-0.25 0 1(0) 0.97(0.01) 0.98(0) 0 1(0) 0.99(0) 1(0)
ABC-10-0.50 0 1(0) 1(0) 1(0) 0.19 0.96(0.01) 1(0) 0.98(0)
ABC-20-0.25 0 1(0) 0.45(0.01) 0.61(0.01) 0 1(0) 0.92(0.01) 0.96(0.01)
ABC-20-0.50 0 1(0) 0.99(0) 0.99(0) 0 1(0) 1(0) 1(0)
Permute (VIP) 0.11 0.98(0.01) 0.98(0.01) 0.98(0) 0 1(0) 0.69(0.01) 0.81(0.01)
Permute (Within-Type VIP) 0 1(0) 0.95(0.01) 0.97(0.01) 0 1(0) 0.64(0.02) 0.77(0.01)
Permute (MI) 0 1(0) 0.84(0.01) 0.91(0.01) 0 1(0) 0.62(0.01) 0.76(0.01)
Backward 0 1(0) 0.68(0.03) 0.78(0.02) 0 1(0) 0.60(0.02) 0.74(0.02)
Scenario C.M.2
n=500,p=84,σ2=10n=500,p=84,\sigma^{2}=10 n=1000,p=84,σ2=10n=1000,p=84,\sigma^{2}=10
rmissr_{\text{miss}} rec prec F1F_{1} rmissr_{\text{miss}} rec prec F1F_{1}
DART-20 1 0.5(0.02) 0.75(0.02) 0.58(0.01) 0.95 0.68(0.01) 0.85(0.02) 0.75(0.01)
DART-200 1 0.61(0.01) 0.26(0.01) 0.36(0.01) 0.91 0.77(0.01) 0.36(0.01) 0.48(0.01)
ABC-10-0.25 1 0.52(0.01) 0.7(0.02) 0.58(0.01) 1 0.64(0.01) 0.84(0.02) 0.72(0.01)
ABC-10-0.50 1 0.28(0.01) 0.98(0.01) 0.43(0.01) 1 0.43(0.01) 0.99(0) 0.59(0.01)
ABC-20-0.25 0.99 0.7(0.01) 0.18(0) 0.29(0.01) 0.93 0.79(0.01) 0.31(0.01) 0.45(0.01)
ABC-20-0.50 1 0.39(0.01) 0.89(0.02) 0.53(0.01) 1 0.55(0.01) 0.97(0.01) 0.7(0.01)
Permute (VIP) 0.99 0.62(0.02) 0.67(0.02) 0.64(0.01) 0.91 0.79(0.01) 0.77(0.02) 0.77(0.01)
Permute (Within-Type VIP) 1 0.61(0.02) 0.69(0.03) 0.64(0.02) 0.98 0.75(0.02) 0.79(0.02) 0.76(0.01)
Permute (MI) 0.99 0.62(0.01) 0.65(0.02) 0.63(0.01) 0.91 0.8(0.01) 0.73(0.02) 0.75(0.01)
Backward 0.75 0.65(0.03) 0.36(0.03) 0.36(0.02) 0.67 0.81(0.02) 0.31(0.03) 0.37(0.02)
Table S.2: Simulation results for the scenarios using binary responses and continuous predictors. Predictors in Scenario B.C.1 are independent and predictors in Scenario B.C.2 are correlated.
Scenario B.C.1
n=500,p=50n=500,p=50 n=500,p=200n=500,p=200
rmissr_{\text{miss}} rec prec F1F_{1} rmissr_{\text{miss}} rec prec F1F_{1}
DART-20 0 1(0) 0.98(0.01) 0.99(0) 0.28 0.94(0.01) 0.98(0.01) 0.96(0.01)
DART-200 0 1(0) 0.97(0.01) 0.98(0) 0.11 0.98(0.01) 0.96(0.01) 0.97(0.01)
ABC-10-0.25 0 1(0) 0.95(0.01) 0.97(0) 0.67 0.86(0.01) 1(0) 0.92(0.01)
ABC-10-0.50 0.33 0.93(0.01) 1(0) 0.96(0.01) 0.99 0.74(0.01) 1(0) 0.84(0.01)
ABC-20-0.25 0 1(0) 0.17(0) 0.29(0) 0.15 0.97(0.01) 0.98(0.01) 0.97(0.01)
ABC-20-0.50 0.03 0.99(0) 1(0) 0.99(0) 0.91 0.8(0.01) 1(0) 0.88(0.01)
Permute (VIP) 0 1(0) 1(0) 1(0) 0 1(0) 0.88(0.01) 0.94(0.01)
Permute (MI) 0 1(0) 0.96(0.01) 0.98(0) 0 1(0) 0.91(0.01) 0.95(0.01)
Backward 0.03 0.99(0) 0.92(0.02) 0.94(0.02) 0.05 0.99(0) 0.94(0.02) 0.95(0.01)
Scenario B.C.2
n=500,p=50n=500,p=50 n=500,p=200n=500,p=200
rmissr_{\text{miss}} rec prec F1F_{1} rmissr_{\text{miss}} rec prec F1F_{1}
DART-20 0.87 0.49(0.03) 0.71(0.04) 0.57(0.03) 1 0.06(0.02) 0.09(0.02) 0.07(0.02)
DART-200 0.42 0.78(0.03) 0.67(0.03) 0.69(0.02) 0.92 0.35(0.03) 0.09(0.01) 0.11(0.02)
ABC-10-0.25 0.46 0.82(0.02) 0.54(0.02) 0.64(0.02) 1 0(0) 0.01(0.01) 0(0)
ABC-10-0.50 1 0.05(0.01) 0.14(0.04) 0.07(0.02) 1 0(0) 0(0) 0(0)
ABC-20-0.25 0 1(0) 0.1(0) 0.18(0) 1 0.01(0) 0.02(0.01) 0.01(0.01)
ABC-20-0.50 0.95 0.43(0.03) 0.77(0.04) 0.53(0.03) 1 0(0) 0(0) 0(0)
Permute (VIP) 0.32 0.9(0.02) 0.85(0.02) 0.86(0.01) 0.99 0.16(0.02) 0.07(0.01) 0.1(0.01)
Permute (MI) 0.7 0.65(0.03) 0.74(0.02) 0.67(0.02) 1 0.03(0.01) 0.04(0.01) 0.03(0.01)
Backward 0 1(0) 0.9(0.02) 0.94(0.01) 0.29 0.86(0.02) 0.62(0.04) 0.61(0.04)

S.5 More Figures

In this section, we list Figure S.1, Figure S.2, Figure S.3, Figure S.4, Figure S.5, Figure S.6, Figure S.7 and Figure S.8, which are discussed in the main paper.

Figure S.4 is based on the following example, a generalized example of Example 2 of the main paper.

Example S.1.

For each i=1,,500i=1,\cdots,500, sample xi,1,,xi,p/2x_{i,1},\cdots,x_{i,\left\lceil p/2\right\rceil} from Bernoulli(0.5) independently, xi,p/2+1,,xi,px_{i,\left\lceil p/2\right\rceil+1},\cdots,x_{i,p} from Uniform(0,1) independently and yiy_{i} from Normal(f0(𝒙i),1)\mathrm{Normal}(f_{0}(\boldsymbol{x}_{i}),1) independently, where

f0(𝒙)=10sin(πxp/2+1xp/2+2)+20(xp/2+30.5)2+10x1+5x2.f_{0}(\boldsymbol{x})=10\sin(\pi x_{\left\lceil p/2\right\rceil+1}x_{\left\lceil p/2\right\rceil+2})+20(x_{\left\lceil p/2\right\rceil+3}-0.5)^{2}+10x_{1}+5x_{2}.

We consider Example S.1 with p{20,50,100}p\in\{20,50,100\} in Figure S.4, where p=20p=20 corresponds to Example 2 of the main paper.

Refer to caption
Figure S.1: Variable selection results of [5] with the setting, L=100L=100, Lrep=10L_{\text{rep}}=10, α=0.05\alpha=0.05 and M=20M=20, for Example 2 of the main paper. Red (or green) dots are for continuous (or binary) predictors. Solid (or open) dots are for selected (or not selected) predictors. Each vertical grey line is the local threshold for the corresponding predictor. Relative binary predictors x1x_{1} and x2x_{2} are not identified by this method.
Refer to caption
Refer to caption
Figure S.2: The top two subfigures depict the error bars of {ck}k=1K\{c_{\cdot k}\}_{k=1}^{K}, the number of splitting nodes in a posterior sample, for different BART models. The bottom two subfigures depict [k=1K(cjk/ck)2/K]1/2[\sum_{k=1}^{K}(c_{jk}/c_{\cdot k})^{2}/K]^{1/2}, the squared root of the second moment of the proportion of the usage cjk/ckc_{jk}/c_{\cdot k}, for different predictors. Each line is for a BART model. The posterior samples are obtained from the BART models fitted in the approach of [5] applied to Example 2 of the main paper. The left two subfigures are based on the 1010 repeated BART models (with different initial values) on the original dataset. The right two subfigures are based on the 100100 BART models on the null datasets.
Refer to caption
Figure S.3: The two subfigures depict the overall counts ratios c¯Lrep,/cl,\bar{c}_{L_{rep},\cdot\cdot}/c_{l,\cdot\cdot}^{*} (black dots) and the counts ratios c¯Lrep,j/cl,j\bar{c}_{L_{rep},j\cdot}/c_{l,j\cdot}^{*} of a predictor (non-black dots) for L=100L=100 permutations. The counts ratio of the binary predictor x1x_{1} not satisfying c¯Lrep,1/cl,1>c¯Lrep,/cl,\bar{c}_{L_{rep},1\cdot}/c_{l,1\cdot}^{*}>\bar{c}_{L_{rep},\cdot\cdot}/c_{l,\cdot\cdot}^{*} are marked as red. The posterior samples are obtained from the BART models fitted in the approach of [5] applied to Example 2 of the main paper.
Refer to caption
Refer to caption
Figure S.4: The left subfigure depicts c¯Lrep,j\bar{c}_{L_{\text{rep}},j\cdot} (bar) and {cl,j}l=1L\{c_{l,j\cdot}^{*}\}_{l=1}^{L} (boxplot) for each relevant predictor; the right subfigure depicts c¯Lrep,cts=(r=1Lrepj𝒮ctscr,j)/Lrep\bar{c}_{L_{\text{rep}},\text{cts}\cdot}=(\sum_{r=1}^{L_{\text{rep}}}\sum_{j\in\mathcal{S}_{\text{cts}}}c_{r,j\cdot})/L_{\text{rep}}, c¯Lrep,bin=(r=1Lrepj𝒮bincr,j)/Lrep\bar{c}_{L_{\text{rep}},\text{bin}\cdot}=(\sum_{r=1}^{L_{\text{rep}}}\sum_{j\in\mathcal{S}_{\text{bin}}}c_{r,j\cdot})/L_{\text{rep}}, and c¯Lrep,\bar{c}_{L_{\text{rep}},\cdot\cdot} (bars), and {cl,cts=j𝒮ctscl,j}l=1L\text{\{}c_{l,\text{cts}\cdot}^{*}=\sum_{j\in\mathcal{S}_{\text{cts}}}c_{l,j\cdot}^{*}\}_{l=1}^{L}, {cl,bin=j𝒮bincl,j}l=1L\text{\{}c_{l,\text{bin}\cdot}^{*}=\sum_{j\in\mathcal{S}_{\text{bin}}}c_{l,j\cdot}^{*}\}_{l=1}^{L}, and {cl,}l=1L\text{\{}c_{l,\cdot\cdot}^{*}\}_{l=1}^{L} (boxplots). The posterior samples are obtained from the BART models fitted in the approach of [5] applied to Example S.1 with different p{20,50,100}p\in\{20,50,100\}.
Refer to caption
Figure S.5: The left subfigure depicts nodes ratios for different bb’s, the number of terminal nodes; the right subfigure depicts depth ratios for different depths dηd_{\eta} when γ=0.95\gamma=0.95 and β=2\beta=2.
Refer to caption
Figure S.6: Each line depicts the BART MIs of a predictor for different numbers of posterior samples (after burning in 10001000 samples) obtained from a BART model built on a null dataset of Example 2 of the main paper.
Refer to caption
Figure S.7: Each boxplot depicts {sij}i=1100\{s_{ij}\}_{i=1}^{100} for the predictor xjx_{j}. Each blue dot represents sjs_{j} for the predictor xjx_{j}. The black dashed line represents ss. The red dashed line represent the ss within continuous predictors and the green dashed line represents the ss within binary predictors.
Refer to caption
Figure S.8: Each line depicts the median of BART MIs for a predictor over different numbers of repetitions of BART built on the data generated from Example 2 of the main paper.