This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adaptive Split Balancing for Optimal Random Forest

Yuqian Zhang Institute of Statistics and Big Data, Renmin University of China    Weijie Ji School of Statistics and Management, Shanghai University of Finance and Economics    Jelena Bradic Department of Mathematics and Halicioglu Data Science Institute, University of California, San Diego, E-mail: jbradic@ucsd.edu
Abstract

In this paper, we propose a new random forest algorithm that constructs the trees using a novel adaptive split-balancing method. Rather than relying on the widely-used random feature selection, we propose a permutation-based balanced splitting criterion. The adaptive split balancing forest (ASBF), achieves minimax optimality under the Lipschitz class. Its localized version, which fits local regressions at the leaf level, attains the minimax rate under the broad Hölder class q,β\mathcal{H}^{q,\beta} of problems for any qq\in\mathbb{N} and β(0,1]\beta\in(0,1]. We identify that over-reliance on auxiliary randomness in tree construction may compromise the approximation power of trees, leading to suboptimal results. Conversely, the proposed less random, permutation-based approach demonstrates optimality over a wide range of models. Although random forests are known to perform well empirically, their theoretical convergence rates are slow. Simplified versions that construct trees without data dependence offer faster rates but lack adaptability during tree growth. Our proposed method achieves optimality in simple, smooth scenarios while adaptively learning the tree structure from the data. Additionally, we establish uniform upper bounds and demonstrate that ASBF improves dimensionality dependence in average treatment effect estimation problems. Simulation studies and real-world applications demonstrate our methods’ superior performance over existing random forests.

1 Introduction

Renowned for their empirical success, random forests are the preferred method in numerous applied scientific domains. Their versatility has greatly contributed to their popularity, extending well beyond conditional mean problems to include quantile estimation [36], survival analysis [30, 29], and feature selection [24, 37, 34, 33, 4]. Despite its widespread use, the theoretical analysis remains largely incomplete, even in terms of minimax optimality.

Consider estimating the conditional mean function m(𝐱):=𝔼[Y𝐗=𝐱]m(\mathbf{x}):=\mathbb{E}[Y\mid\mathbf{X}=\mathbf{x}] for any 𝐱[0,1]d\mathbf{x}\in[0,1]^{d}, where YY\in\mathbb{R} is the response and 𝐗[0,1]d\mathbf{X}\in[0,1]^{d} is the covariate vector. Let 𝕊N:=(Yi,𝐗i)i=1N\mathbb{S}_{N}:=(Y_{i},\mathbf{X}_{i})_{i=1}^{N} be independent and identically distributed (i.i.d.) samples, with (Yi,𝐗i)i=1Nd(Y,𝐗)(Y_{i},\mathbf{X}_{i})_{i=1}^{N}\overset{d}{\sim}(Y,\mathbf{X}). Let m^()\widehat{m}(\cdot) be the random forest estimate constructed from 𝕊N\mathbb{S}_{N}. This paper focuses on the integrated mean squared error (IMSE) 𝔼𝐱[m^(𝐱)m(𝐱)]2\mathbb{E}_{\mathbf{x}}[\widehat{m}(\mathbf{x})-m(\mathbf{x})]^{2}, where the expectation is taken with respect to the new observation 𝐱\mathbf{x}.

Breiman’s original algorithm [8] is based on Classification And Regression Trees (CART) [9]. Its essential components, introduce auxiliary randomness through bagging and the random feature selection and play a crucial role in the algorithm’s effectiveness. At each node of every tree, the algorithm optimizes Ginny index (for classification) or prediction squared error (for regression) by selecting axes-aligned splitting directions and locations from a subset of randomly chosen set of features, using a subset of samples. The overall forest is obtained by averaging over an ensemble of trees. [42] established consistency of Breiman’s original algorithm under additive models with continuous components, but did not specify a rate. Recently, [14] established the rate under a “sufficient impurity decrease” (SID) condition for high dimensional setting, and [32] extended this to cases where covariates’ dimension grows sup-exponentially with the sample size. These results suggest that Breiman’s original algorithm retains consistency with discontinuous conditional mean functions and high-dimensional covariates. However, the established consistency rates for smooth functions were observed to be slow. For instance, for simple linear functions, the rates are no faster than N1/(16d2)N^{-1/(16d^{2})} and 1/log(N)1/\log(N), respectively; see Table 1.1. Given the theoretical complexities associated with CART-split criterion, [5]111The results presented by [5] necessitate prior knowledge of the active features, information that is typically unknown in practice. As an alternative, the author also suggested the honest technique albeit without providing any theoretical guarantees., [1], and [31] investigated a simplified version named the “centered forest,” where the splitting directions are chosen randomly, and splitting points are selected as midpoints of parent nodes. A more advanced variant, the “median forest,” [31, 17], selects sample medians as splitting points. Importantly, both centered and median forests’ consistency rates are slow, with minimax rates attained only when d=1d=1; see Table 1.1 and Figure 1.

Table 1.1: IMSE rates. Splitting criteria that use only 𝐗i\mathbf{X}_{i} are labeled unsupervised, while those using both 𝐗i\mathbf{X}_{i} and YiY_{i} are labeled supervised. Note: [12] refers to point-wise MSE at interior points, [10] to in-sample excess risk, and [3, 19] to normality at a specific point.
Methods Consistency rate Functional class Random Forest Splitting criterion
[23] N2/3N^{-2/3} 0,1\mathcal{H}^{0,1}, d=1d=1 Purely uniform Data-independent
[5] N3/4qlog(2)+3/4N^{-\frac{3/4}{q\log(2)+3/4}} qq-sparse 0,1\mathcal{H}^{0,1} Centered Data-independent
[1] N2log(11/(2d))2log(11/(2d))log(2)N^{\frac{-2\log(1-1/(2d))}{2\log(1-1/(2d))-\log(2)}} 1,1\mathcal{H}^{1,1}, d3d\leq 3 Centered Data-independent
Nδ+2log(2d12d)N^{\delta+2\log(\frac{2d-1}{2d})}, δ>0\delta>0 1,1\mathcal{H}^{1,1}, d4d\geq 4
[38] N2(q+β)d+2(q+β)N^{-\frac{2(q+\beta)}{d+2(q+\beta)}} q,β\mathcal{H}^{q,\beta}, q+β1.5q+\beta\leq 1.5 Mondrian Data-independent
[40] N2(q+β)d+2(q+β)N^{-\frac{2(q+\beta)}{d+2(q+\beta)}} q,β\mathcal{H}^{q,\beta}, q{0,1}q\in\{0,1\}, β(0,1]\beta\in(0,1] Tessellation Data-independent
[12] N2(q+β)d+2(q+β)N^{-\frac{2(q+\beta)}{d+2(q+\beta)}} q,β\mathcal{H}^{q,\beta}, qq\in\mathbb{N}, β(0,1]\beta\in(0,1] Debiased Mondrian Data-independent
[10] (N/log(N))2(q+β)d+2(q+β)(N/\log(N))^{-\frac{2(q+\beta)}{d+2(q+\beta)}} q,β\mathcal{H}^{q,\beta}, qq\in\mathbb{N}, β(0,1]\beta\in(0,1] Extrapolated tree Data-independent
[31] (Nlog(d1)/2(N))r(N\log^{(d-1)/2}(N))^{-r}, 0,1\mathcal{H}^{0,1} Centered Data-independent
r=2log(11/(2d))2log(11/(2d))log(2)r=\frac{2\log(1-1/(2d))}{2\log(1-1/(2d))-\log(2)}
N2log(11/(2d))2log(11/(2d))log(2)N^{-\frac{2\log(1-1/(2d))}{2\log(1-1/(2d))-\log(2)}} 0,1\mathcal{H}^{0,1} Median Unsupervised
[17] Nlog(13/(4d))log(13/(4d))log(2)N^{-\frac{\log(1-3/(4d))}{\log(1-3/(4d))-\log(2)}} 0,1\mathcal{H}^{0,1} Median Unsupervised
[42] Only op(1)o_{p}(1) Additive model Breiman Supervised
[32] Op(1/log(N))O_{p}(1/\log(N)) Additive model Breiman Supervised
[14] NcαηN^{-\frac{c}{\alpha}\land\eta}, c<1/4c<1/4, η<1/8\eta<1/8 SID(α)\alpha), α1\alpha\geq 1 Breiman Supervised
[3] Nδlog(1α)dlog(α)+log(1α)N^{\delta-\frac{\log(1-\alpha)}{d\log(\alpha)+\log(1-\alpha)}}, 0,1\mathcal{H}^{0,1} Honest Unsupervised if α=0.5\alpha=0.5,
α1\alpha\leq 1, δ>0\delta>0 supervised if α<0.5\alpha<0.5
[19] Nδ1.3log(1α)dlog(α)+1.3(1α)N^{\delta-\frac{1.3\log(1-\alpha)}{d\log(\alpha)+1.3(1-\alpha)}}, 1,1\mathcal{H}^{1,1} Local linear honest Supervised
α0.2\alpha\leq 0.2, δ>0\delta>0
This paper N2log(1α)dlog(α)+2log(1α)N^{\frac{-2\log(1-\alpha)}{d\log(\alpha)+2\log(1-\alpha)}} 0,1\mathcal{H}^{0,1} ASBF Unsupervised if α=0.5\alpha=0.5,
N2(q+β)log(1α)dlog(α)+2(q+β)log(1α)N^{\frac{-2(q+\beta)\log(1-\alpha)}{d\log(\alpha)+2(q+\beta)\log(1-\alpha)}} q,β\mathcal{H}^{q,\beta}, qq\in\mathbb{N}, β(0,1]\beta\in(0,1] L-ASBF supervised if α<0.5\alpha<0.5

Centered forests and other “purely random forests” [38, 40, 6, 1, 31], grow trees independently from all the samples. Few studies among these have achieved minimax rates for smooth functions. [21] achieved near-minimax rates exclusively for Lipschitz functions by using an “early stopping” technique, though their splitting criterion remains data-independent. Mondrian forests [38] attain IMSE minimax optimal rate for the Hölder class q,β\mathcal{H}^{q,\beta} when s=q+β1.5s=q+\beta\leq 1.5. [40] introduced Tessellation forests, extending minimax optimality to s2s\leq 2. [12] proposed debiased Mondrian forests, establishing minimax optimal rates any qq\in\mathbb{N} and β(0,1]\beta\in(0,1] in the point-wise mean squared error (MSE) 𝔼[m^(𝐱)m(𝐱)]2\mathbb{E}\left[\widehat{m}(\mathbf{x})-m(\mathbf{x})\right]^{2} for fixed interior points. However, this debiasing procedure only corrects for interior bias and not boundary bias, preventing it from achieving minimax optimality in terms of the IMSE. [10] allows arbitrary qq\in\mathbb{N} and established nearly optimal in-sample excess risk; however, they did not provide upper bounds for the out-of-sample IMSE. All aforementioned works use data-independent splits, restricting the use of data during tree growth. As a result, purely random forests lose the unique advantages of tree-based methods.

Recently, [3, 47, 46, 19] explored “honest forest” variant that differs from Breiman’s forest in two key aspects: (a) the splitting point ensures that child nodes contain at least α0.5\alpha\leq 0.5 fraction of parent’ samples, and (b) the forest is “honest” using two independent sub-samples per tree. One sub-sample’s outcomes and all the covariates, are used for splitting, while the other sub-sample’s outcomes are used for local averaging. Now, the splits depend on both the covariates and outcomes as long as α<0.5\alpha<0.5. For α=0.5\alpha=0.5, their method degenerates to the median forest. Without the α\alpha-fraction constraint, splits tend to concentrate at the parent node endpoints, resulting in inaccurate local averaging; see [28, 9, 11]. However, the consistency rates for honest forests remain sub-optimal; see Table 1.1 and Figure 1.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: IMSE for the Hölder class q,β\mathcal{H}^{q,\beta} with s=q+βs=q+\beta and d{2,4}d\in\{2,4\}. Here, yy denotes IMSE as Op(Ny)O_{p}(N^{-y}) excluding log terms; see Table 1.1. Klu, A&\&G, D&\&S, Biau, and Chietal refer to [31], [1], [17], [5], and [14], which provided rates for certain integer ss. L-ASBF is the proposed local ASBF with α=0.5\alpha=0.5, achieving the minimax optimal rate for any s>0s>0.

We aim to improve upon the existing methods, which either fail to achieve minimax rates or do not fully utilize the data’s information. To address these limitations, we propose a new splitting criterion that eliminates the auxiliary randomness of the random feature selection. Instead, our approach employs a permutation-based splitting criterion that balances the depth of the tree branches, resulting in minimax optimal rates in simple, smooth situations. Detailed construction is deferred to Algorithm 1. A special case of the proposed method is balanced median forest which attains the minimax optimal rate for Lipschitz continuous functions. We propose a local ASBF (L-ASBF) for the Hölder class q,β\mathcal{H}^{q,\beta}, that performs qq-th order local polynomial regression within the terminal leaves for any q1q\geq 1. A specific case is the local linear approach of [19], where now the addition of balanced splitting ensures a faster convergence rate; see Table 1.1. We show that the L-ASBF reaches minimax optimality over the broad Hölder class q,β\mathcal{H}^{q,\beta} for any qq\in\mathbb{N} and β(0,1]\beta\in(0,1]. To the best of our knowledge, this marks the first random forest that attains minimax optimal rates in terms of the IMSE for the Hölder smoothness class with any q2q\geq 2 and β(0,1]\beta\in(0,1], as seen in Figure 1. We also derive uniform rates and demonstrate minimax optimality for any qq\in\mathbb{N}; see in Section 3.3.

Random forests have been extensively applied to causal inference, with much of the literature focusing on estimating the conditional average treatment effect (CATE); see [46, 3]. Estimating the average treatment effect (ATE), however, poses unique challenges; see Remark 1. We use ASBF with augmented inverse propensity weighting (AIPW) to estimate ATE. In contrast to existing random forest methods for which ATE inferential guarantees would confine feature dimensions to one, our methods advance the field. They support multi-dimensional features, therefore significantly broadening the practical applications for forest-based ATE inference.

2 Adaptive Split Balancing Forest

The regression tree models function m()m(\cdot) by recursively partitioning the feature space [0,1]d[0,1]^{d} into non-overlapping rectangles, generally called leaves or nodes. For any given point 𝐱[0,1]d\mathbf{x}\in[0,1]^{d}, a regression tree estimates m(𝐱)m(\mathbf{x}) using the average of responses for those samples in the same leaf as 𝐱\mathbf{x}: T(𝐱,ξ)=i𝟙{𝐗iL(𝐱,ξ)}Yi/i𝟙{𝐗iL(𝐱,ξ)}\mathrm{T}(\mathbf{x},\xi)=\sum_{i\in\mathcal{I}}\mathbbm{1}_{\{\mathbf{X}_{i}\in L(\mathbf{x},\xi)\}}Y_{i}/\sum_{i\in\mathcal{I}}\mathbbm{1}_{\{\mathbf{X}_{i}\in L(\mathbf{x},\xi)\}}, where ξ\xi denotes all the auxiliary randomness in the tree-growing process and is independent of the samples, Ξ\Xi denotes the support of ξ\xi, {1,,N}\mathcal{I}\subseteq\{1,\dots,N\} is the indices of training samples used for local averaging and possibly depends on ξ\xi, and L(𝐱,ξ)L(\mathbf{x},\xi) represents the terminal leaf containing the point 𝐱\mathbf{x}.

Random forests consider ensembles of regression trees, where the forests’ predictions are the average of all the tree predictions. Let {T(𝐱,ξj),j=1,,B}\{\mathrm{T}(\mathbf{x},\xi_{j}),j=1,\dots,B\} denote the collection of regression trees in a forest, where BB is the number of trees and ξ1,,ξBΞ\xi_{1},\dots,\xi_{B}\in\Xi are i.i.d. auxiliary variables. For any B1B\geq 1, random forests estimate the conditional mean as m^(𝐱):=B1j=1BT(𝐱,ξj)=𝔼ξ[T(𝐱,ξ)],\widehat{m}(\mathbf{x}):=B^{-1}\sum_{j=1}^{B}\mathrm{T}(\mathbf{x},\xi_{j})=\mathbb{E}_{\xi}[\mathrm{T}(\mathbf{x},\xi)], where for any function f()f(\cdot), 𝔼ξ[f(𝐱)]=B1j=1Bf(ξj)\mathbb{E}_{\xi}[f(\mathbf{x})]=B^{-1}\sum_{j=1}^{B}f(\xi_{j}) denotes the empirical average over the auxiliary variables, and we omit the dependence of such an expectation on BB for the sake of notation simplicity. Using the introduced notations, random forests can also be represented as a weighted average of the outcomes:

m^(𝐱)=𝔼ξ[iωi(𝐱,ξ)Yi],whereωi(𝐱,ξ):=𝟙{𝐗iL(𝐱,ξ)}l𝟙{𝐗lL(𝐱,ξ)}.\displaystyle\widehat{m}(\mathbf{x})=\mathbb{E}_{\xi}\left[\sum_{i\in\mathcal{I}}\omega_{i}(\mathbf{x},\xi)Y_{i}\right],\;\;\mbox{where}\;\;\omega_{i}(\mathbf{x},\xi):=\frac{\mathbbm{1}_{\left\{\mathbf{X}_{i}\in L(\mathbf{x},\xi)\right\}}}{\sum_{l\in\mathcal{I}}\mathbbm{1}_{\left\{\mathbf{X}_{l}\in L(\mathbf{x},\xi)\right\}}}. (2.1)

To study the estimation rates of random forests, we consider the following decomposition of IMSE: 𝔼𝐱[m^(𝐱)m(𝐱)]22R1+2R2,\mathbb{E}_{\mathbf{x}}\left[\widehat{m}(\mathbf{x})-m(\mathbf{x})\right]^{2}\leq 2R_{1}+2R_{2}, where R1:=𝔼𝐱[𝔼ξ[iωi(𝐱,ξ)εi]]2R_{1}:=\mathbb{E}_{\mathbf{x}}\left[\mathbb{E}_{\xi}\left[\sum_{i\in\mathcal{I}}\omega_{i}(\mathbf{x},\xi)\varepsilon_{i}\right]\right]^{2} is the stochastic error originating from the random noise εi=Yim(𝐗i)\varepsilon_{i}=Y_{i}-m(\mathbf{X}_{i}), and R2:=𝔼𝐱[𝔼ξ[iωi(𝐱,ξ)(m(𝐗i)m(𝐱))]]2R_{2}:=\mathbb{E}_{\mathbf{x}}\left[\mathbb{E}_{\xi}\left[\sum_{i\in\mathcal{I}}\omega_{i}(\mathbf{x},\xi)(m(\mathbf{X}_{i})-m(\mathbf{x}))\right]\right]^{2} is the approximation error. Let kk be the minimum leaf size. Standard techniques lead to R1=Op(1/k)R_{1}=O_{p}(1/k); see (LABEL:thm:balance_eq2) of the Supplement. The control of the remaining approximation error is the key to reaching an optimal overall IMSE.

2.1 Auxiliary randomness and approximation error

We focus on the Lipschitz class here, with the Hölder class explored in Section 3. In the following, we illustrate how the auxiliary randomness introduced by the widely-used random feature selection affects the approximation error of tree models.

Assumption 1 (Lipschitz continuous).

Assume that m()m(\cdot) satisfies |m(𝐱)m(𝐱)|L0𝐱𝐱|m(\mathbf{x})-m(\mathbf{x}^{\prime})|\leq L_{0}\|\mathbf{x}-\mathbf{x}^{\prime}\| for all 𝐱,𝐱[0,1]d\mathbf{x},\mathbf{x}^{\prime}\in[0,1]^{d} with some constant L0>0L_{0}>0.

For any leaf L[0,1]dL\subseteq[0,1]^{d}, let diam(L):=sup𝐱,𝐱L𝐱𝐱\mathrm{diam}(L):=\sup_{\mathbf{x},\mathbf{x}^{\prime}\in L}\|\mathbf{x}-\mathbf{x}^{\prime}\| be its diameter. Under the Lipschitz condition, the approximation error can be controlled by the leaves’ diameters: R2L02𝔼𝐱[𝔼ξ[diam2(L(𝐱,ξ))]].R_{2}\leq L_{0}^{2}\mathbb{E}_{\mathbf{x}}[\mathbb{E}_{\xi}[\mathrm{diam}^{2}(L(\mathbf{x},\xi))]]. Therefore, it suffices to obtain an upper bound for the diameters. Although the Breiman’s algorithm selects up to mtryd\mbox{mtry}\leq d features at random, we find it worthwhile to study the special case with mtry=1\mbox{mtry}=1, as seen in centered and median forests. Recall that in the centered forest, a splitting direction is randomly selected (with probability 1/d1/d) for each split, and the splitting location is chosen as the center point of the parent node. For the moment, let B=B=\infty, meaning the forest is the ensemble of infinitely many trees. Suppose that each terminal leaf has been split for \mathcal{M} times, and consider a sequence of \mathcal{M} consecutive leaves containing 𝐱\mathbf{x}, with the smallest (terminal) leaf denoted as L(𝐱,ξ)L(\mathbf{x},\xi).

For each mm\leq\mathcal{M} and jdj\leq d, let δj,m(𝐱,ξ)=1\delta_{j,m}(\mathbf{x},\xi)=1 if the mm-th split is performed along the jj-th coordinate, and δj,m(𝐱,ξ)=0\delta_{j,m}(\mathbf{x},\xi)=0 otherwise. Here, ξ\mathbb{P}_{\xi} and 𝔼ξ\mathbb{E}_{\xi} represent the corresponding probability measure and the expectation taken with respect to ξ\xi, respectively. For any given jdj\leq d, the sequence (δj,m(𝐱,ξ))m=1(\delta_{j,m}(\mathbf{x},\xi))_{m=1}^{\mathcal{M}} is i.i.d., with ξ(δj,m(𝐱,ξ)=1)=1/d\mathbb{P}_{\xi}(\delta_{j,m}(\mathbf{x},\xi)=1)=1/d for centered forest. Let diamj(L(𝐱,ξ))\mathrm{diam}_{j}(L(\mathbf{x},\xi)) be the length of the longest segment parallel to the jj-th axis that is a subset of L(𝐱,ξ)L(\mathbf{x},\xi). Then,

𝔼ξ[diam2(L(𝐱,ξ))]=j=1d𝔼ξ[diamj2(L(𝐱,ξ))]=j=1d𝔼ξ[m=122δj,m(𝐱,ξ)]\displaystyle\mathbb{E}_{\xi}[\mathrm{diam}^{2}(L(\mathbf{x},\xi))]=\sum_{j=1}^{d}\mathbb{E}_{\xi}[\mathrm{diam}_{j}^{2}(L(\mathbf{x},\xi))]=\sum_{j=1}^{d}\mathbb{E}_{\xi}\left[\prod_{m=1}^{\mathcal{M}}2^{-2\delta_{j,m}(\mathbf{x},\xi)}\right]
=j=1d𝔼ξ[22cj(𝐱,ξ)]>(i)j=1d22𝔼ξ[cj(𝐱,ξ)]=j=1d22m=11/d=d22/d.\displaystyle\qquad=\sum_{j=1}^{d}\mathbb{E}_{\xi}\left[2^{-2c_{j}(\mathbf{x},\xi)}\right]\overset{(i)}{>}\sum_{j=1}^{d}2^{-2\mathbb{E}_{\xi}[c_{j}(\mathbf{x},\xi)]}=\sum_{j=1}^{d}2^{-2\sum_{m=1}^{\mathcal{M}}1/d}=d2^{-2\mathcal{M}/d}. (2.2)

The discrepancy introduced by the strict inequality (i),“Jensen gap”, stems from the variation in the quantity cj(𝐱,ξ):=m=1δj,m(𝐱,ξ)c_{j}(\mathbf{x},\xi):=\sum_{m=1}^{\mathcal{M}}\delta_{j,m}(\mathbf{x},\xi), representing the count of splits along the jj-th direction induced by the auxiliary randomness ξ\xi. This discrepancy results in a relatively large approximation error R2(11/(2d))2R_{2}\asymp(1-1/(2d))^{2\mathcal{M}} (up to logarithmic terms). By selecting an optimal \mathcal{M} (or kk) that strikes a balance between stochastic and approximation errors, the centered forest yields an overall IMSE of the size N2log(11/(2d))2log(11/(2d))log(2)N^{\frac{-2\log(1-1/(2d))}{2\log(1-1/(2d))-\log(2)}} – which is not minimax optimal for Lipschitz functions as long as d>1d>1, as depicted in Figure 1.

The sub-optimality stems from the forests’ excessive reliance on auxiliary randomness, leading to a significant number of redundant and inefficient splits. When splitting directions are chosen randomly, there is a non-negligible probability that some directions are overly selected while others are scarcely chosen. Consequently, terminal leaves tend to be excessively wide in some directions and overly narrow in others. For centered and median forests, this long and narrow leaf structure is solely due to auxiliary randomness and is unrelated to the data. This prevalence of long and narrow leaves increases the expected leaf diameter, resulting in a significant approximation error that deviates from the optimal one. Averaging across the trees stabilizes the forest, but to attain minimax optimal rates, controlling the number of splits at the tree level is essential to mitigate the Jensen gap, as shown in (2.2).

Instead of selecting splitting directions randomly, we adopt a permutation-based and more balanced approach. By ensuring a sufficiently large cj(𝐱,ξ)c_{j}(\mathbf{x},\xi) for each jdj\leq d and reducing dependence on auxiliary randomness ξ\xi, we achieve an approximation error of O((N/k)2/d)O((N/k)^{-2/d}), as shown in Lemma 2.1. This upper bound mirrors the right-hand side of (2.2) (whereas centered or median forests cannot improve the left-hand side) when log2(N/k)\mathcal{M}\approx\log_{2}(N/k), resulting in a minimax optimal rate for the overall IMSE when \mathcal{M} (or kk) is chosen to balance the approximation and the stochastic error.

2.2 Adaptive Split Balancing

In order to reduce the large approximation error caused by auxiliary randomness, we propose a permutation-based splitting method. Each time a leaf is split, we randomly select a direction from one of the sides that has been split the least number of times. Moreover, the splitting directions are chosen in a balanced fashion – we have to split once in each direction on any given tree path before proceeding to the next round.

Algorithm 1 Adaptive split balancing forests
1:Observations 𝕊N=(𝐗i,Yi)i=1N\mathbb{S}_{N}=(\mathbf{X}_{i},Y_{i})_{i=1}^{N}, with B1B\geq 1, α(0,0.5]\alpha\in(0,0.5], w(0,1]w\in(0,1], and kwNk\leq\lfloor wN\rfloor.
2:for b=1,,Bb=1,\dots,B do
3:     Divide 𝕊N\mathbb{S}_{N} into disjoint 𝕊(b)\mathbb{S}_{\mathcal{I}}^{(b)} and 𝕊𝒥(b)\mathbb{S}_{\mathcal{J}}^{(b)} with |(b)|=wN|\mathcal{I}^{(b)}|=\lfloor wN\rfloor and |𝒥(b)|=NwN|\mathcal{J}^{(b)}|=N-\lfloor wN\rfloor.
4:     repeat For each current node L[0,1]dL\subseteq[0,1]^{d}:
5:         Select direction jj along which the node has been split the least number of times.
6:         Partition along jj-th direction to minimize the mean squared error (MSE) on 𝕊𝒥(b)\mathbb{S}_{\mathcal{J}}^{(b)}:
i𝒥(b)(YiY¯1)2𝟙{𝐗iL1}+i𝒥(b)(YiY¯2)2𝟙{𝐗iL2},\displaystyle\sum_{i\in\mathcal{J}^{(b)}}(Y_{i}-\overline{Y}_{1})^{2}\mathbbm{1}\{\mathbf{X}_{i}\in L_{1}\}+\sum_{i\in\mathcal{J}^{(b)}}(Y_{i}-\overline{Y}_{2})^{2}\mathbbm{1}\{\mathbf{X}_{i}\in L_{2}\}, (2.3)
ensuring #{i(b):𝐗iLl}α#{i(b):𝐗iL},l=1,2.\displaystyle\mbox{ensuring }\#\{i\in\mathcal{I}^{(b)}:\mathbf{X}_{i}\in L_{l}\}\geq\alpha\#\{i\in\mathcal{I}^{(b)}:\mathbf{X}_{i}\in L\},\;l=1,2. (2.4)
           Here, Y¯1\overline{Y}_{1} and Y¯2\overline{Y}_{2} are the average responses within the child nodes L1L_{1} and L2L_{2}.
7:     until each node contains kk to 2k12k-1 samples 𝕊(b)\mathbb{S}_{\mathcal{I}}^{(b)}.
8:     Estimate m(𝐱)m(\mathbf{x}) with the bb-th adaptive split balancing tree:
T(𝐱,ξb):=i(b)𝟙{𝐗iL(𝐱,ξb)}Yii(b)𝟙{𝐗iL(𝐱,ξb)}.T(\mathbf{x},\xi_{b}):=\frac{\sum_{i\in\mathcal{I}^{(b)}}\mathbbm{1}_{\left\{\mathbf{X}_{i}\in L(\mathbf{x},\xi_{b})\right\}}Y_{i}}{\sum_{i\in\mathcal{I}^{(b)}}\mathbbm{1}_{\left\{\mathbf{X}_{i}\in L(\mathbf{x},\xi_{b})\right\}}}. (2.5)
9:end for
10:return Adaptive split balancing forest estimate m^(𝐱):=B1b=1BT(𝐱,ξb)\widehat{m}(\mathbf{x}):=B^{-1}\sum_{b=1}^{B}T(\mathbf{x},\xi_{b}).
Figure 2: Illustrations of the balanced tree-growing process with d=2d=2. The purple shading represents the splitting range that satisfies the α\alpha-fraction constraint. The red lines indicate the chosen splits.

To further enhance the practical performance, we employ data-dependent splitting rules of the “honest forests” approach that are dependent on both 𝐗i\mathbf{X}_{i} and YiY_{i}. This flexibility is particularly valuable when the local smoothness level varies in different directions and locations. For each tree, we partition the samples into two sets, denoted by \mathcal{I} and 𝒥\mathcal{J}. Additionally, we impose constraints on the child node α\alpha-fraction and terminal leaf size kk. With α(0,0.5]\alpha\in(0,0.5] and kk\in\mathbb{N}, we require the following conditions to hold: (a) each child node contains at least an α\alpha-fraction of observations within the parent node, and (b) the number of observations within terminal leaves is between kk and 2k12k-1. The splitting locations are then determined to minimize the empirical mean squared error (or maximize the impurity gain) within each parent node, selected from the set of points satisfying the above conditions. For more details, refer to Algorithm 1 and an illustration in Figure 2.

Algorithm 2 Sparse adaptive split balancing forests
1:Observations 𝕊N=(𝐗i,Yi)i=1N\mathbb{S}_{N}=(\mathbf{X}_{i},Y_{i})_{i=1}^{N}, with B1B\geq 1, α(0,0.5]\alpha\in(0,0.5], w(0,1]w\in(0,1], kwNk\leq\lfloor wN\rfloor, and mtry{1,,d}\mbox{mtry}\in\{1,\dots,d\}.
2:for b=1,,Bb=1,\dots,B do
3:     Divide 𝕊N\mathbb{S}_{N} into disjoint 𝕊(b)\mathbb{S}_{\mathcal{I}}^{(b)} and 𝕊𝒥(b)\mathbb{S}_{\mathcal{J}}^{(b)} with |(b)|=wN|\mathcal{I}^{(b)}|=\lfloor wN\rfloor and |𝒥(b)|=NwN|\mathcal{J}^{(b)}|=N-\lfloor wN\rfloor.
4:     Create a collection of index sets, 𝒬([0,1]d)={Q1,,Qd}\mathcal{Q}([0,1]^{d})=\{Q_{1},\dots,Q_{d}\}, with each Qj{1,,d}Q_{j}\subset\{1,\dots,d\} containing mtry directions while ensuring that each direction 1,,d1,\dots,d appears in exactly mtry of the sets Q1,,QdQ_{1},\dots,Q_{d}.
5:     repeat For each current node L[0,1]dL\subseteq[0,1]^{d}:
6:         Randomly select a set Q𝒬(L)Q\in\mathcal{Q}(L).
7:         Partition along jQj\in Q-th direction to minimize the MSE as in (2.3) and (2.4).
8:         if #𝒬(L)>1\#\mathcal{Q}(L)>1 then
9:              𝒬(L1)=𝒬(L2)=𝒬(L){Q}\mathcal{Q}(L_{1})=\mathcal{Q}(L_{2})=\mathcal{Q}(L)\setminus\{Q\}.
10:         else
11:              Randomly reinitialize 𝒬(L1)\mathcal{Q}(L_{1}) and 𝒬(L2)\mathcal{Q}(L_{2}) as in Steps 3.
12:         end if
13:     until each current node contains kk to 2k12k-1 samples 𝕊(b)\mathbb{S}_{\mathcal{I}}^{(b)}.
14:     The bb-th sparse adaptive split balancing tree estimates m(𝐱)m(\mathbf{x}) as in (2.5).
15:end for
16:return Sparse adaptive split balancing forest as the average of BB trees.

Any α\alpha controls the balance in leaf lengths, and any α>0\alpha>0 prevents greedy splits near the endpoints of the parent node. For any α<0.5\alpha<0.5, while the terminal leaves may still vary in width, they will be more balanced than before, hence the term “adaptive split-balancing.” Unlike centered and median forests, this time, the long and narrow leaf structure depends on the data. This distinction sets it apart from methods with data-independent splitting rules (e.g., [38, 40, 21]), allowing us to improve empirical performance, particularly when covariates have distinct local effects on the outcome.

While Algorithm 1 selects splitting points data-dependently and can adaptively learn local smoothness levels for different directions, it still encounters challenges in the presence of certain sparse structures. Notably, even if some covariates are entirely independent of the outcome, Algorithm 1 may still make splits along such directions. These splits are redundant, as they do not contribute to reducing the approximation error but increase the stochastic error, due to the smaller sample sizes in the resulting child nodes.

To address sparse situations, we introduce a tuning parameter, mtry{1,,d}\text{mtry}\in\{1,\dots,d\}, representing the number of candidate directions for each split. Unlike existing methods, we select candidate directions in a balanced fashion instead of randomly. In each round, we initiate a collection of dd index sets, each containing mtry distinct directions, ensuring each direction appears in exactly mtry sets. At each split, we randomly select one unused index set for the current round. Alternatively, we can achieve the same result by shuffling the directions before each round and setting the index sets as {1,2,,mtry}\{1,2,\dots,\text{mtry}\}, {2,3,,mtry+1}\{2,3,\dots,\text{mtry}+1\}, etc., in a random order; see Figure 3. The selected index set provides the candidate splitting directions for the current node. Details are provided in Algorithm 2.

Shuffle the directionsSelect from candidate directions {1,2,3}\{1,2,3\}22Select from candidate directions {2,3,4}\{2,3,4\}22Select from candidate directions {3,4,5}\{3,4,5\}44Select from candidate directions {4,5,1}\{4,5,1\}44Select from candidate directions {5,1,2}\{5,1,2\}22Shuffle the directions...Selected direction
Figure 3: An illustration of splitting directions undergone by a leaf in Algorithm 2.

When the true conditional mean function is sparse, setting a relatively large mtry avoids splits on redundant directions. Conversely, when all directions contribute similarly, a smaller mtry prevents inefficient splits due to sample randomness. However, existing methods select candidate sets randomly, causing some directions to be frequently considered as candidate directions. This randomness can reduce approximation power by neglecting important features. In contrast, our balanced approach ensures all features are equally considered for selection at the tree level, making all the trees perform similarly and leading to a smaller estimation error for the forest. Due to the Jensen gap, the error of the averaged forest mainly stems from the poorer-performing trees, as the approximation power non-linearly depends on the number of splits in important features. The tuning parameters α\alpha (location) and mtry (direction) control the “greediness” of each split during the tree-growing procedure. Greedy splits, characterized by a small α\alpha and a large mtry, lead to better adaptability but are sensitive to noise, making them sub-optimal under the Lipschitz or Hölder class. Conversely, non-greedy splits, which use a larger α\alpha and a smaller mtry, achieve minimax optimal results under smooth functional classes but are non-adaptive to complex (e.g., sparse) scenarios.

2.3 Theoretical results

For simplicity, we consider uniformly distributed 𝐗\mathbf{X} with support [0,1]d[0,1]^{d}, as seen in [1, 23, 5, 31, 10, 17, 42, 19, 11, 36, 35, 46]. We first highlight the advantages of the balanced approach.

Lemma 2.1.

For any r>0r>0 and α(0,0.5]\alpha\in(0,0.5], the leaves constructed by Algorithm 1 satisfy

sup𝐱[0,1]d,ξΞ𝔼𝕊[diamr(L(𝐱,ξ))]<2rdmax{r/2,1}exp(r2+r+1)(wN/(2k1))rlog(1α)dlog(α).\sup_{\mathbf{x}\in[0,1]^{d},\xi\in\Xi}\mathbb{E}_{\mathbb{S}_{\mathcal{I}}}[\mathrm{diam}^{r}(L(\mathbf{x},\xi))]<2^{r}d^{\max\{r/2,1\}}\exp(r^{2}+r+1)(\lfloor wN\rfloor/(2k-1))^{-\frac{r\log(1-\alpha)}{d\log(\alpha)}}.

The right most term of the above expression dominates the rest and determines the expected size of the diameter (wN/(2k1))rlog(1α)dlog(α)(w/2)rlog(1α)dlog(α)(N/k)rlog(1α)dlog(α)(k/N)rlog(1α)dlog(α)(\lfloor wN\rfloor/(2k-1))^{-\frac{r\log(1-\alpha)}{d\log(\alpha)}}\asymp(w/2)^{-\frac{r\log(1-\alpha)}{d\log(\alpha)}}\cdot(N/k)^{-\frac{r\log(1-\alpha)}{d\log(\alpha)}}\asymp(k/N)^{\frac{r\log(1-\alpha)}{d\log(\alpha)}}. Recall that the sample-splitting fraction ω(0,1]\omega\in(0,1] remains constant and that the minimum node size, kk is such that kωNk\leq\lfloor\omega N\rfloor. According to Lemma 2.1, the proposed forests’ approximation error is at the order of (k/N)2log(1α)dlog(α)(k/N)^{\frac{2\log(1-\alpha)}{d\log(\alpha)}}. Our balanced median forest, with α=0.5\alpha=0.5 yields a rate of (k/N)2d(k/N)^{\frac{2}{d}}, whereas, a standard median forest with random feature selection has a strictly larger rate of (k/N)2log2(2d2d1)(k/N)^{2\log_{2}\left(\frac{2d}{2d-1}\right)} [31], whenever d>1d>1; for d=1d=1, the rates are the same (no need to choose a splitting direction). Centered forests achieve comparable upper bounds, which suggests the sub-optimality of auxiliary randomness.

We further assume the following standard condition for the noise variable.

Assumption 2.

Assume that 𝔼[ε2𝐗]M\mathbb{E}[\varepsilon^{2}\mid\mathbf{X}]\leq M almost surely with some constant M>0M>0.

The following theorem characterizes the IMSE of the proposed forest in Algorithm 1.

Theorem 2.2.

Let Assumptions 1 and 2 hold. Suppose that w(0,1]w\in(0,1] and α(0,0.5]\alpha\in(0,0.5] are both constants. Choose any BB\in\mathbb{N} and kwNk\leq\lfloor wN\rfloor. Then, as NN\to\infty, the adaptive split balancing forest proposed in Algorithm 1 satisfies 𝔼𝐱[m^(𝐱)m(𝐱)]2=Op(1/k+(k/N)2log(1α)dlog(α)).\mathbb{E}_{\mathbf{x}}[\widehat{m}(\mathbf{x})-m(\mathbf{x})]^{2}=O_{p}(1/k+(k/N)^{\frac{2\log(1-\alpha)}{d\log(\alpha)}}). Moreover, let kN2log(1α)dlog(α)+2log(1α)k\asymp N^{\frac{2\log(1-\alpha)}{d\log(\alpha)+2\log(1-\alpha)}}, we have

𝔼𝐱[m^(𝐱)m(𝐱)]2=Op(N2log(1α)dlog(α)+2log(1α)).\mathbb{E}_{\mathbf{x}}[\widehat{m}(\mathbf{x})-m(\mathbf{x})]^{2}=O_{p}(N^{-\frac{2\log(1-\alpha)}{d\log(\alpha)+2\log(1-\alpha)}}).
Refer to caption
Figure 4: IMSE convergence rate from Theorem 2.2 sections for different dimension of features: N2log(1α)dlog(α)+2log(1α)N^{-\frac{2\log(1-\alpha)}{d\log(\alpha)+2\log(1-\alpha)}} (top) and N2d+2N^{-\frac{2}{d+2}} (bottom). Darker plot represents minimax optimal rate. The darker the cover of the overlap, the closer the rates.

The issue of excess auxiliary randomness is inherent at the tree level, meaning that merely collecting more samples or further averaging does not address the optimality of the approximation error. Our approach to resolving this involves a new splitting criterion that has yielded minimax optimal rates.

For instance, split-balancing median forest is the first minimax optimal median forest. When setting α=0.5\alpha=0.5, split-balancing random forest becomes new, balanced median forest, with the optimal minimax rate of N2d+2N^{-\frac{2}{d+2}}. In contrast, existing median and center forests [31, 17, 5, 1] do not achieve the minimax optimal rate; see Table 1.1 and Figure 1. The conclusions of Theorem 2.2 remain valid for any constant α\alpha in the range (0,0.5](0,0.5]. Greedy splits (small α\alpha) enhance adaptability by leveraging information from both 𝐗i\mathbf{X}_{i} and YiY_{i}, whereas non-greedy splits attain minimax optimality, necessitating a balance in practice. The rates are represented in Figure 4.

3 Local Adaptive Split Balancing Forest

In this section, we extend our focus to more general Hölder smooth functions and introduce local adaptive split-balancing forests capable of exploiting higher-order smoothness levels.

3.1 Local polynomial forests

Algorithm 3 Local adaptive split balancing forests
1:Observations 𝕊N=(𝐗i,Yi)i=1N\mathbb{S}_{N}=(\mathbf{X}_{i},Y_{i})_{i=1}^{N}, with B1B\geq 1, α(0,0.5]\alpha\in(0,0.5], w(0,1]w\in(0,1], kwNk\leq\lfloor wN\rfloor, and qq\in\mathbb{N}.
2:Calculate the polynomial basis G(𝐗i)d¯G(\mathbf{X}_{i})\in\mathbb{R}^{\bar{d}} for each iNi\leq N.
3:for b=1,,Bb=1,\dots,B do
4:     Divide 𝕊N\mathbb{S}_{N} into disjoint 𝕊(b)\mathbb{S}_{\mathcal{I}}^{(b)} and 𝕊𝒥(b)\mathbb{S}_{\mathcal{J}}^{(b)} with |(b)|=wN|\mathcal{I}^{(b)}|=\lfloor wN\rfloor and |𝒥(b)|=NwN|\mathcal{J}^{(b)}|=N-\lfloor wN\rfloor.
5:     repeat For each node L[0,1]dL\subseteq[0,1]^{d}:
6:         Select direction jj along which the node has been split the least number of times.
7:         Partition along jj-th direction to minimize the MSE on 𝕊𝒥(b)\mathbb{S}_{\mathcal{J}}^{(b)} as in (3.2) and (2.4).
8:     until each current node contains kk to 2k12k-1 samples 𝕊(b)\mathbb{S}_{\mathcal{I}}^{(b)}.
9:     Estimate 𝜷^(𝐱,ξb)\widehat{\bm{\beta}}(\mathbf{x},\xi_{b}) as defined in (3.1) using observations 𝕊(b)\mathbb{S}_{\mathcal{I}}^{(b)}.
10:     The bb-th local adaptive split balancing tree: TL(𝐱,ξb):=G(𝐱)𝜷^(𝐱,ξb)T_{\mathrm{L}}(\mathbf{x},\xi_{b}):=G(\mathbf{x})^{\top}\widehat{\bm{\beta}}(\mathbf{x},\xi_{b}).
11:end for
12:return The local adaptive split balancing forest m^L(𝐱):=B1b=1BTL(𝐱,ξb)\widehat{m}_{\mathrm{L}}(\mathbf{x}):=B^{-1}\sum_{b=1}^{B}T_{\mathrm{L}}(\mathbf{x},\xi_{b}).

To capture the higher-order smoothness of the conditional mean function m()m(\cdot), we propose to fit a qq-th local polynomial regression within the leaves of the split-adaptive balanced forest of Section 2.2. We first introduce the polynomial basis of order qq\in\mathbb{N}. For any 𝐱[0,1]d\mathbf{x}\in[0,1]^{d} and j{0,1,,q}j\in\{0,1,\dots,q\}, let 𝐠j(𝐱):=(𝐱𝜸)𝜸𝒜jdj\mathbf{g}_{j}(\mathbf{x}):=(\mathbf{x}^{\bm{\gamma}})_{\bm{\gamma}\in\mathcal{A}_{j}}\in\mathbb{R}^{d^{j}}, where 𝒜j:={𝜸=(𝜸1,,𝜸d)d:|𝜸|=j}\mathcal{A}_{j}:=\{\bm{\gamma}=(\bm{\gamma}_{1},\dots,\bm{\gamma}_{d})\in\mathbb{N}^{d}:|\bm{\gamma}|=j\}. For instance, 𝐠0(𝐱)=1\mathbf{g}_{0}(\mathbf{x})=1, 𝐠1(𝐱)=(𝐱1,𝐱2,,𝐱d)\mathbf{g}_{1}(\mathbf{x})=(\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{d})^{\top}, and 𝐠2(𝐱)=(𝐱12,𝐱1𝐱2,,𝐱d2)\mathbf{g}_{2}(\mathbf{x})=(\mathbf{x}_{1}^{2},\mathbf{x}_{1}\mathbf{x}_{2},\dots,\mathbf{x}_{d}^{2})^{\top}. Denote 𝐆(𝐱):=(𝐠0(𝐱),𝐠1(𝐱),,𝐠q(𝐱))d¯\mathbf{G}(\mathbf{x}):=(\mathbf{g}_{0}(\mathbf{x}),\mathbf{g}_{1}(\mathbf{x})^{\top},\dots,\mathbf{g}_{q}(\mathbf{x})^{\top})^{\top}\in\mathbb{R}^{\bar{d}} as the qq-th order polynomial basis, where d¯:=j=0qdj\bar{d}:=\sum_{j=0}^{q}d^{j}. For any ξΞ\xi\in\Xi, define the weights ωi(𝐱,ξ)\omega_{i}(\mathbf{x},\xi) as in (2.1). Using the training samples indexed by \mathcal{I}, we consider the weighted polynomial regression:

𝜷^(𝐱,ξ):=argmin𝜷d¯iωi(𝐱,ξ)(Yi𝐆(𝐗i)𝜷)2.\displaystyle\widehat{\bm{\beta}}(\mathbf{x},\xi):=\operatorname*{argmin}_{\bm{\beta}\in\mathbb{R}^{\bar{d}}}\sum_{i\in\mathcal{I}}\omega_{i}(\mathbf{x},\xi)(Y_{i}-\mathbf{G}(\mathbf{X}_{i})^{\top}\bm{\beta})^{2}. (3.1)

The qq-th order local adaptive split balancing forest is proposed as

m^L(𝐱):=𝔼ξ[𝐆(𝐱)𝜷^(𝐱,ξ)].\widehat{m}_{\mathrm{L}}(\mathbf{x}):=\mathbb{E}_{\xi}[\mathbf{G}(\mathbf{x})^{\top}\widehat{\bm{\beta}}(\mathbf{x},\xi)].

Unlike Algorithm 1, where local averages are used as tree predictions, our approach here involves conducting polynomial regressions within the terminal leaves. To optimize the behavior of the final polynomial regressions, we minimize the following during the leaf construction:

i𝒥(b)(Y^iY~1)2𝟙{𝐗iL1}+i𝒥(b)(Y^iY~1)2𝟙{𝐗iL2},\sum_{i\in\mathcal{J}^{(b)}}(\widehat{Y}_{i}-\widetilde{Y}_{1})^{2}\mathbbm{1}\{\mathbf{X}_{i}\in L_{1}\}+\sum_{i\in\mathcal{J}^{(b)}}(\widehat{Y}_{i}-\widetilde{Y}_{1})^{2}\mathbbm{1}\{\mathbf{X}_{i}\in L_{2}\}, (3.2)

where Y^i:=YiG(𝐗i)𝜷^\widehat{Y}_{i}:=Y_{i}-G(\mathbf{X}_{i})^{\top}\widehat{\bm{\beta}} with 𝜷^\widehat{\bm{\beta}} denoting the least squares estimate within the parent node LL, and Y~j\widetilde{Y}_{j} is the average of Y^i\widehat{Y}_{i} within the child node LjL_{j} for each j{1,2}j\in\{1,2\}. As in Section 2.2, we also require that both child nodes contain at least an α\alpha-fraction of samples from the parent node. Additional specifics are outlined in Algorithm 3. Note that Algorithm 1 is a special case of Algorithm 3 with q=0q=0. To further improve the forests’ empirical behavior in the presence of certain sparse structures, we also introduce a generalized sparse version of the local adaptive split balancing forest following; see Algorithm 4 in the Supplement for more details.

3.2 Theoretical results with higher-order smoothness levels

Assumption 3 (Hölder smooth).

Assume that m()q,βm(\cdot)\in\mathcal{H}^{q,\beta} with qq\in\mathbb{N} and β(0,1]\beta\in(0,1]. The Hölder class q,β\mathcal{H}^{q,\beta} contains all functions f:[0,1]df:[0,1]^{d}\to\mathbb{R} that are qq times continuously differentiable, with (a) |D𝜸f(𝐱)|L0|D^{\bm{\gamma}}f(\mathbf{x})|\leq L_{0} for all 𝐱[0,1]d\mathbf{x}\in[0,1]^{d} and multi-index 𝜸\bm{\gamma} satisfying |𝜸|q|\bm{\gamma}|\leq q, and (b) |D𝜸f(𝐱)D𝜸f(𝐱)|L0𝐱𝐱β|D^{\bm{\gamma}}f(\mathbf{x})-D^{\bm{\gamma}}f(\mathbf{x}^{\prime})|\leq L_{0}\|\mathbf{x}-\mathbf{x}^{\prime}\|^{\beta} for all 𝐱,𝐱[0,1]d\mathbf{x},\mathbf{x}^{\prime}\in[0,1]^{d} and 𝜸\bm{\gamma} satisfying |𝜸|=q|\bm{\gamma}|=q, where L0>0L_{0}>0 is a constant.

The following theorem characterizes the convergence rate of the local adaptive split balancing forest proposed in Algorithm 3.

Theorem 3.1.

Let Assumptions 2 and 3 hold. Suppose that w(0,1]w\in(0,1] and α(0,0.5]\alpha\in(0,0.5] are both constants. Choose any BB\in\mathbb{N} and kwNk\leq\lfloor wN\rfloor satisfying klog(N)k\gg\log(N). Then, as NN\to\infty, the local adaptive split balancing forest proposed in Algorithm 3 satisfies

𝔼𝐱[m^L(𝐱)m(𝐱)]2=Op(1/k+(k/N)2(q+β)log(1α)dlog(α)).\mathbb{E}_{\mathbf{x}}[\widehat{m}_{\mathrm{L}}(\mathbf{x})-m(\mathbf{x})]^{2}=O_{p}\left(1/k+(k/N)^{\frac{2(q+\beta)\log(1-\alpha)}{d\log(\alpha)}}\right).

Moreover, let kN2(q+β)log(1α)dlog(α)+2(q+β)log(1α)k\asymp N^{\frac{2(q+\beta)\log(1-\alpha)}{d\log(\alpha)+2(q+\beta)\log(1-\alpha)}}, we have

𝔼𝐱[m^L(𝐱)m(𝐱)]2=Op(N2(q+β)log(1α)dlog(α)+2(q+β)log(1α)).\displaystyle\mathbb{E}_{\mathbf{x}}\left[\widehat{m}_{\mathrm{L}}(\mathbf{x})-m(\mathbf{x})\right]^{2}=O_{p}\left(N^{-\frac{2(q+\beta)\log(1-\alpha)}{d\log(\alpha)+2(q+\beta)\log(1-\alpha)}}\right). (3.3)

When α=0.5\alpha=0.5, the rate given by (3.3) is N2(q+β)d+2(q+β)N^{-\frac{2(q+\beta)}{d+2(q+\beta)}}, which is minimax optimal for the Hölder class q,β\mathcal{H}^{q,\beta}; see, e.g., [44]. This accomplishment represents the first random forest with IMSE reaching minimax optimality when q>1q>1; refer to Table 1.1 and Figure 1.

Theorem 3.1 is built upon several key technical components. To mitigate the issue of multicollinearity during local polynomial regression within terminal leaves, we first demonstrate that sup𝐱[0,1]d,ξΞκN(𝐱)=Op(1)\sup_{\mathbf{x}\in[0,1]^{d},\xi\in\Xi}\kappa_{N}(\mathbf{x})=O_{p}(1), where κN(𝐱):=[1𝐝(𝐱)𝐒1(𝐱)𝐝(𝐱)]1\kappa_{N}(\mathbf{x}):=[1-\mathbf{d}(\mathbf{x})^{\top}\mathbf{S}^{-1}(\mathbf{x})\mathbf{d}(\mathbf{x})]^{-1} assesses the degree of multicollinearity. Here, 𝐝(𝐱):=iωi(𝐱,ξ)𝐆1(𝐗i𝐱)\mathbf{d}(\mathbf{x}):=\sum_{i\in\mathcal{I}}\omega_{i}(\mathbf{x},\xi)\mathbf{G}_{-1}(\mathbf{X}_{i}-\mathbf{x}), 𝐒(𝐱):=iωi(𝐱,ξ)𝐆1(𝐗i𝐱)𝐆1(𝐗i𝐱)\mathbf{S}(\mathbf{x}):=\sum_{i\in\mathcal{I}}\omega_{i}(\mathbf{x},\xi)\mathbf{G}_{-1}(\mathbf{X}_{i}-\mathbf{x})\mathbf{G}_{-1}(\mathbf{X}_{i}-\mathbf{x})^{\top}, and 𝐆(𝐱)=(1,𝐆1(𝐱))\mathbf{G}(\mathbf{x})=(1,\mathbf{G}_{-1}(\mathbf{x})^{\top})^{\top}. Unlike [19], which imposes an upper bound on the random quantity κN\kappa_{N} by assumption, we rigorously prove that this quantity is uniformly bounded above with high probability; see the equivalent form in Lemma S.4. Additionally, by employing localized polynomial regression, we reduce the approximation error to O(𝔼𝐱[diam2(q+β)(L(𝐱,ξ))])O(\mathbb{E}_{\mathbf{x}}[\mathrm{diam}^{2(q+\beta)}(L(\mathbf{x},\xi))]) under the Hölder class q,β\mathcal{H}^{q,\beta}, as detailed in (LABEL:consistency:R1) of the Supplement. Together with our proposed permutation-based splitting rule, previous results on leaf diameter (Lemma 2.1) ensure a sufficiently small rate.

[38] achieved minimax optimal IMSE for q=1q=1 and β(0,1/2]\beta\in(0,1/2] by leveraging forest structure and concentration across trees. However, for β(1/2,1]\beta\in(1/2,1], their optimality is limited to interior points where 𝔼ξ[𝔼𝐱[𝐱𝐱L(𝐱,ξ)]]𝐱\mathbb{E}_{\xi}[\mathbb{E}_{\mathbf{x}^{\prime}}[\mathbf{x}^{\prime}\mid\mathbf{x}^{\prime}\in L(\mathbf{x},\xi)]]\approx\mathbf{x}. Our approach broadens the range of Hölder smooth functions addressed, using polynomial approximation and split-balancing to eliminate the approximation error at the tree level.

Tessellation forest of [40] addressed this boundary issue, achieving minimax optimal IMSE for q=1q=1 and any β(0,1]\beta\in(0,1]. Theorem 3.1 further highlights that the local split-balancing median forest (with α=0.5\alpha=0.5) is minimax optimal not only for q=1q=1 but also for any arbitrary q>1q>1. This signifies a major breakthrough in attaining minimax optimality for higher-order smoothness levels.

Debiased Mondrian forests by [12] achieve minimax optimality for any qq\in\mathbb{N} and β(0,1]\beta\in(0,1], in terms of point-wise MSE at interior points. However, their results do not yield an optimal rate for the IMSE, as their debiasing technique does not extend for boundary points. Additionally, [10] achieved a nearly optimal rates for any qq\in\mathbb{N} and β(0,1]\beta\in(0,1] but their findings are limited to in-sample excess risk.

It’s important to note that while all previous works grow trees completely independently of the samples, our approach employs supervised splitting rules with α<0.5\alpha<0.5 after tuning, significantly enhancing forest performance. Only [7, 19] considered data-dependent splitting rules and studied local linear forests under the special case q=1q=1. However, [7] only demonstrated the consistency of their method, without providing any explicit rate of convergence. [19] provided asymptotic normal results at a given 𝐱[0,1]d\mathbf{x}\in[0,1]^{d}. However, their established upper bound for the asymptotic variance is no faster than N(1+dlog(0.2)1.3log(0.8))1N(1+5.55d)1N^{-(1+\frac{d\log(0.2)}{1.3\log(0.8)})^{-1}}\approx N^{-(1+5.55d)^{-1}}, which is slow as the splitting directions are chosen randomly.

3.3 Uniform consistency rate

In the following, we extend our analysis to include uniform-type results for the estimation error of the forests. These are useful for our particular applications to causal inference and more broadly are of independent interest. As Algorithm 1 is a specific instance of Algorithm 3, we focus on presenting results for the latter.

To begin, we establish a uniform bound on the diameter of the leaves as follows.

Lemma 3.2.

Suppose that r1r\geq 1, w(0,1]w\in(0,1], and α(0,0.5]\alpha\in(0,0.5] are constants. Choose any kwNk\leq\lfloor wN\rfloor satisfying klog3(N)k\gg\log^{3}(N). Then,

sup𝐱[0,1]d,ξΞdiamr(L(𝐱,ξ))C(N/k)rlog(1α)dlog(α)\sup_{\mathbf{x}\in[0,1]^{d},\xi\in\Xi}\mathrm{diam}^{r}(L(\mathbf{x},\xi))\leq C(N/k)^{-\frac{r\log(1-\alpha)}{d\log(\alpha)}}

with probability at least 1log(wN/k)/(nlog((1α)1))1-\log(\lfloor wN\rfloor/k)/(\sqrt{n}\log\left((1-\alpha)^{-1}\right)) and some constant C>0C>0.

The result in Lemma 3.2 holds for a single tree and also for an ensemble forest. In fact, obtaining uniform results for a single tree is relatively simple – it suffices to control the diameter of each terminal leaf with high probability and take the uniform bound over all the leaves, as the number of terminal leaves is at most N/kN/k. However, such an approach is invalid for forests – the number of all terminal leaves in a forest grows with the number of trees BB. Hence, such a method can be used only when BB is relatively small; however, in practice, we would like to set BB as large as possible unless constrained by computational limits. To obtain a uniform result for forests, we approximate all the possible leaves through a collection of rectangles.

Subsequently, we present a uniform upper bound for the estimation error of the forests.

Theorem 3.3.

Let Assumption 3 hold. Suppose that |Y|M|Y|\leq M. Let M>0M>0, w(0,1]w\in(0,1], and α(0,0.5]\alpha\in(0,0.5] be constants. Choose any BB\in\mathbb{N} and kwNk\leq\lfloor wN\rfloor satisfying klog3(N)k\gg\log^{3}(N). Then, as NN\to\infty,

sup𝐱[0,1]d|m^L(𝐱)m(𝐱)|=Op(log(N)/k+(k/N)(q+β)log(1α)dlog(α)).\sup_{\mathbf{x}\in[0,1]^{d}}|\widehat{m}_{\mathrm{L}}(\mathbf{x})-m(\mathbf{x})|=O_{p}\left(\sqrt{\log(N)/k}+(k/N)^{\frac{(q+\beta)\log(1-\alpha)}{d\log(\alpha)}}\right).

Moreover, let kN2(q+β)log(1α)dlog(α)+2(q+β)log(1α)(log(N))dlog(α)dlog(α)+2(q+β)log(1α)k\asymp N^{\frac{2(q+\beta)\log(1-\alpha)}{d\log(\alpha)+2(q+\beta)\log(1-\alpha)}}\left(\log(N)\right)^{\frac{d\log(\alpha)}{d\log(\alpha)+2(q+\beta)\log(1-\alpha)}}, we have

sup𝐱[0,1]d|m^L(𝐱)m(𝐱)|=Op((log(N)/N)(q+β)log(1α)dlog(α)+2(q+β)log(1α)).\sup_{\mathbf{x}\in[0,1]^{d}}|\widehat{m}_{\mathrm{L}}(\mathbf{x})-m(\mathbf{x})|=O_{p}\left((\log(N)/N)^{\frac{(q+\beta)\log(1-\alpha)}{d\log(\alpha)+2(q+\beta)\log(1-\alpha)}}\right).

Comparing with the results in Theorem 3.1, the rates above consist of additional logarithm terms. This is due to the cost of seeking uniform bounds. When α=0.5\alpha=0.5, an optimally tuned kk leads to the rate (log(N)/N)q+βd+2(q+β)(\log(N)/N)^{\frac{q+\beta}{d+2(q+\beta)}}, which is minimax optimal for sup-norms; see, e.g., [44]. To the best of our knowledge, we are the first to establish minimax optimal uniform bounds for forests over the Hölder class q,β\mathcal{H}^{q,\beta} for any qq\in\mathbb{N}.

3.4 Application: causal inference for ATE

In this section, we apply the proposed forests to estimate the average treatment effect (ATE) in the context of causal inference. We consider i.i.d. samples (𝐖i)i=1N:=(Yi,𝐗i,Ai)i=1N(\mathbf{W}_{i})_{i=1}^{N}:=(Y_{i},\mathbf{X}_{i},A_{i})_{i=1}^{N}, and denote 𝐖=(Y,𝐗,A)\mathbf{W}=(Y,\mathbf{X},A) as its independent copy. Here, YY\in\mathbb{R} denotes the outcome of interest, A{0,1}A\in\{0,1\} is a binary treatment variable, and 𝐗d\mathbf{X}\in\mathbb{R}^{d} represents a vector of covariates uniformly distributed in [0,1]d[0,1]^{d}. We operate within the potential outcome framework, assuming the existence of potential outcomes Y(1)Y(1) and Y(0)Y(0), where Y(a)Y(a) represents the outcome that would be observed if an individual receives treatment a{0,1}a\in\{0,1\}. The ATE is defined as θ:=𝔼[Y(1)Y(0)]\theta:=\mathbb{E}[Y(1)-Y(0)], representing the average effect of the treatment AA on the outcome YY. To identify causal effects, we make the standard assumptions seen in [41, 16, 27].

Assumption 4.

(a) Unconfoundedness: {Y(0),Y(1)}A𝐗\{Y(0),Y(1)\}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 4.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 4.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 4.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 4.0mu{\scriptscriptstyle\perp}}}A\mid\mathbf{X}. (b) Consistency: Y=Y(A)Y=Y(A). (c) Overlap: (c0<π(𝐗)<1c0)=1\mathbb{P}(c_{0}<\pi^{*}(\mathbf{X})<1-c_{0})=1, where c0(0,1/2)c_{0}\in(0,1/2) is a constant and the propensity score (PS) function is defined as π(𝐱):=(A=1𝐗=𝐱)\pi^{*}(\mathbf{x}):=\mathbb{P}(A=1\mid\mathbf{X}=\mathbf{x}) for any 𝐱[0,1]d\mathbf{x}\in[0,1]^{d}.

Define the true outcome regression function μa(𝐱):=𝔼[Y(a)𝐗=𝐱]\mu_{a}^{*}(\mathbf{x}):=\mathbb{E}[Y(a)\mid\mathbf{X}=\mathbf{x}] for a{0,1}a\in\{0,1\} and consider the doubly robust score function: for any η=(μ1,μ0,π)\eta=(\mu_{1},\mu_{0},\pi),

ψ(𝐖;η):=μ1(𝐗)μ0(𝐗)+A(Yμ1(𝐗))π(𝐗)(1A)(Yμ0(𝐗))1π(𝐗).\psi(\mathbf{W};\eta):=\mu_{1}(\mathbf{X})-\mu_{0}(\mathbf{X})+\frac{A(Y-\mu_{1}(\mathbf{X}))}{\pi(\mathbf{X})}-\frac{(1-A)(Y-\mu_{0}(\mathbf{X}))}{1-\pi(\mathbf{X})}.

As the ATE parameter can be represented as θ=𝔼[ψ(𝐖;η)]\theta=\mathbb{E}[\psi(\mathbf{W};\eta^{*})], it can be estimated as the empirical average of the score functions as long as we plug in appropriate estimates of the nuisance functions η=(μ1,μ0,π)\eta^{*}=(\mu_{1}^{*},\mu_{0}^{*},\pi^{*}). In the following, we introduce the forests-based ATE estimator using the double machine-learning framework of [13].

For any fixed integer K2K\geq 2, split the samples into KK equal-sized parts, indexed by (k)k=1K(\mathcal{I}_{k})_{k=1}^{K}. For the sake of simplicity, we assume n:=#k=N/Kn:=\#\mathcal{I}_{k}=N/K\in\mathbb{N}. For each kKk\leq K, denote k=k\mathcal{I}_{-k}=\mathcal{I}\setminus\mathcal{I}_{k}. Under Assumption 4, we can identify the outcome regression function as μa(𝐱)=𝔼(Y𝐗=𝐱,A=a)\mu_{a}^{*}(\mathbf{x})=\mathbb{E}(Y\mid\mathbf{X}=\mathbf{x},A=a) for each a{0,1}a\in\{0,1\}. Hence, we construct μ^ak()\widehat{\mu}_{a}^{-k}(\cdot) using Algorithm 3, based on samples (Yi,𝐗i)i{ik:Ai=a}(Y_{i},\mathbf{X}_{i})_{i\in\{i\in\mathcal{I}_{-k}:A_{i}=a\}}. Additionally, we also construct π^k()\widehat{\pi}^{-k}(\cdot) using Algorithm 3, based on samples (Ai,𝐗i)ik(A_{i},\mathbf{X}_{i})_{i\in\mathcal{I}_{-k}}. For the sake of simplicity, we denote μ2():=π()\mu_{2}(\cdot):=\pi(\cdot). The number of trees BB and the orders of polynomial forests are chosen in advance, where we use qjq_{j} to denote the polynomial orders considered in the estimation of μj()\mu_{j}(\cdot) for each j{0,1,2}j\in\{0,1,2\}. Further denote hj:=(αj,wj,kj)h_{j}:=(\alpha_{j},w_{j},k_{j}) as the hyperparameters for estimating μj()\mu_{j}(\cdot). To appropriately select hjh_{j}, we further split the samples indexed by k\mathcal{I}_{-k} into training and validation sets. After obtaining the nuisance estimates η^k:=(μ^1k,μ^0k,π^k)\widehat{\eta}^{-k}:=(\widehat{\mu}_{1}^{-k},\widehat{\mu}_{0}^{-k},\widehat{\pi}^{-k}) for each kKk\leq K, we define the ATE estimator as θ^:=N1k=1Kikψ(𝐖i;η^k).\widehat{\theta}:=N^{-1}\sum_{k=1}^{K}\sum_{i\in\mathcal{I}_{k}}\psi(\mathbf{W}_{i};\widehat{\eta}^{-k}).

While the double machine-learning framework offers a flexible and robust approach for estimating causal parameters like the ATE, the validity of statistical inference based on these methods requires careful consideration, as asymptotic normality depends on sufficiently fast convergence rates in nuisance estimation. For instance, even when minimax optimal nuisance estimation is achieved within the Lipschitz class, inference can still be inaccurate under worst-case scenarios when d>1d>1. To address this, we focus on the widely used random forest method and develop techniques that enhance convergence rates. By integrating the split-balancing technique with local polynomial regression, we enhance the method’s convergence rate, thereby providing solid theoretical guarantees for forest-based ATE estimation and delivering more reliable inference in practice. Additional discussion on the technical challenges can be found in Remark 1.

Now, we introduce theoretical properties of the ATE estimator.

Theorem 3.4.

Let Assumption 4 hold, |Y|M|Y|\leq M, and 𝔼[𝟙{A=a}(Y(a)μa)]2C0\mathbb{E}[\mathbbm{1}_{\{A=a\}}(Y(a)-\mu_{a}^{*})]^{2}\geq C_{0} for each a{0,1}a\in\{0,1\}, with some positive constants MM and C0C_{0}. Suppose that μ0q0,β0\mu_{0}^{*}\in\mathcal{H}^{q_{0},\beta_{0}}, μ1q1,β1\mu_{1}^{*}\in\mathcal{H}^{q_{1},\beta_{1}}, and πq2,β2\pi^{*}\in\mathcal{H}^{q_{2},\beta_{2}}, where qjq_{j}\in\mathbb{N} and βj(0,1]\beta_{j}\in(0,1] for each j{0,1,2}j\in\{0,1,2\}. Let wj(0,1]w_{j}\in(0,1] and αj(0,0.5]\alpha_{j}\in(0,0.5] be constants. Choose any B1B\geq 1 and kjN2(qj+βj)log(1αj)dlog(αj)+2(qj+βj)log(1αj).k_{j}\asymp N^{\frac{2(q_{j}+\beta_{j})\log(1-\alpha_{j})}{d\log(\alpha_{j})+2(q_{j}+\beta_{j})\log(1-\alpha_{j})}}. Moreover, let

d2log(αa)log(α2)<4(qa+βa)(q2+β2)log(1αa)log(1α2)d^{2}\log(\alpha_{a})\log(\alpha_{2})<4(q_{a}+\beta_{a})(q_{2}+\beta_{2})\log(1-\alpha_{a})\log(1-\alpha_{2})

for each a{0,1}a\in\{0,1\}. Then, as NN\to\infty,

σ1N(θ^θ)N(0,1)\sigma^{-1}\sqrt{N}(\widehat{\theta}-\theta)\leadsto N(0,1)

and σ^1N(θ^θ)N(0,1)\widehat{\sigma}^{-1}\sqrt{N}(\widehat{\theta}-\theta)\leadsto N(0,1), where σ^2:=N1k=1Kik[ψ(Wi;η^k)θ^]2\widehat{\sigma}^{2}:={N}^{-1}\sum_{k=1}^{K}\sum_{i\in\mathcal{I}_{k}}[\psi(W_{i};\widehat{\eta}^{-k})-\widehat{\theta}]^{2}.

Remark 1 (Technical challenges of forest-based ATE estimation).

It is worth emphasizing that the following aspects are the main challenges in our analysis:

(a) Establish convergence rates for the integrated mean squared error (IMSE) of the nuisance estimates. As the ATE is a parameter defined through integration over the entire population, we require nuisance convergence results in the sense of IMSE; point-wise mean squared error results are insufficient. This distinguishes our work from [46, 3], which focused on the estimation and inference for the conditional average treatment effect (CATE).

(b) Develop sufficiently fast convergence rates through higher-order smoothness. The asymptotic normality of the double machine-learning method [13] requires a product-rate condition for the nuisance estimation errors. If we only utilize the Lipschitz continuity of the nuisance functions, root-NN inference is ensured only when d=1d=1. In other words, we need to establish methods that can exploit the higher-order smoothness of nuisance functions as long as d>1d>1. As shown in Theorem 3.4, the higher the smoothness levels are, the larger dimension dd we allow for.

(c) Construct stable propensity score (PS) estimates. As demonstrated in Lemma S.6 of the Supplement, as long as we ensure a sufficiently large minimum leaf size k2log3(N)k_{2}\gg\log^{3}(N) for the forests used in PS estimation, we can guarantee that each terminal leaf contains a non-negligible fraction of samples from both treatment groups, provided the overlap condition holds for the true PS function as in Assumption 4. Consequently, we can stabilize the PS estimates, avoiding values close to zero.

4 Numerical Experiments

In this section, we assess the numerical performance of the proposed methods in non-parametric regression through both simulation studies and real-data analysis. Additional results for the ATE estimation problem are provided in Section B of the Supplement.

4.1 Simulations for the conditional mean estimation

We first focus on the estimation of conditional mean function m(x)=𝔼[Y𝐗=𝐱]m(x)=\mathbb{E}[Y\mid\mathbf{X}=\mathbf{x}]. Generate i.i.d. covariates 𝐗iUniform[0,1]d\mathbf{X}_{i}\sim\mathrm{Uniform}[0,1]^{d} and noise εiN(0,1)\varepsilon_{i}\sim N(0,1) for each iNi\leq N.

Consider the following models: (a) Yi=10sin(π𝐗i1𝐗i2)+20(𝐗i35)2+10𝐗i4+5𝐗i5+εiY_{i}=10\sin(\pi\mathbf{X}_{i1}\mathbf{X}_{i2})+20(\mathbf{X}_{i3}-5)^{2}+10\mathbf{X}_{i4}+5\mathbf{X}_{i5}+\varepsilon_{i}, (b) Yi=20exp((j=1s𝐗ij0.5s)/s)+εiY_{i}=20\exp((\sum_{j=1}^{s}\mathbf{X}_{ij}-0.5s)/\sqrt{s})+\varepsilon_{i}. In Setting (a), we utilize the well-known Friedman function proposed by [20], which serves as a commonly used benchmark for assessing non-parametric regression methods [50, 26, 35]. We set the covariates’ dimension to d=5d=5 and consider sample sizes N{500,1000}N\in\{500,1000\}. In Setting (b), we investigate the performance of the forests under various sparsity levels, keeping d=10d=10, N=1000N=1000, and choosing s{2,6,10}s\in\{2,6,10\}.

We implement the proposed adaptive split balancing forest (ASBF, Algorithm 1), local linear adaptive split balancing forest (LL-ASBF, Algorithm 3 with q=1q=1), and local quadratic adaptive split balancing forest (LQ-ASBF, Algorithm 3 with q=2q=2). In Setting (b) where various sparsity levels are considered, we further evaluate the numerical performance of the sparse adaptive split balancing forest (S-ASBF, Algorithm 2), which is more suitable for scenarios with sparse structures. We choose B=200B=200 and utilize 80%80\% of samples for training purposes, reserving the remaining 20%20\% for validation to determine the optimal tuning parameters (α,k)(\alpha,k), as well as mtry for the sparse versions. For the sake of simplicity, we fix the honest fraction w=0.5w=0.5 and do not perform additional subsampling.

Refer to caption
(a)
Refer to caption
(b)
Figure 5: Boxplots of log(RMSE+1)\log(\text{RMSE}+1) under Setting (a) with a varying sample size.

We also consider Breiman’s original forest (BOF), honest random forest (HRF), local linear forest (LLF), and Bayesian additive regression trees (BART). BOF is implemented using the R package ranger [49], HRF and LLF are implemented using the R package grf [45], and BART is implemented by the BART package [43]. HRF and LLF methods involve the tuning parameter mtry\mathrm{mtry}, denoting the number of directions tried for each split. For comparison purposes, we also consider modified versions with fixed mtry=1\mathrm{mtry}=1. This corresponds to the case where splitting directions are randomly chosen and is the only case that has been thoroughly studied theoretically [46, 19]. We denote the modified versions of HRF and LLF as HRF1 and LLF1, respectively. The only difference between HRF1 and the proposed ASBF is that ASBF considers a balanced splitting approach for the selection of splitting directions, instead of a fully random way; a parallel difference exists between LLF1 and LL-ASBF. Additionally, we also introduce a modified version of BOF with the splitting direction decided through random selection, denoted as BOF1.

We evaluate the root mean square error (RMSE) within 1000 test points and repeat the procedure 200 times. Figures 5 and 6 depict boxplots comparing the log-transformed RMSE of all the considered methods across various settings introduced above.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 6: Boxplots of log(RMSE+1)\log(\text{RMSE}+1) under Setting (b) with a varying sparsity level.

As shown in Figures 5 and 6, LQ-ASBF consistently exhibits the best performance across all the considered settings. When we focus on the choice of mtry=1\mbox{mtry}=1, the proposed ASBF method consistently outperforms the other local averaging methods BOF1 and HRF1, highlighting the distinct advantages offered by our balanced method in contrast to random feature selection. In addition, when the true model is dense, as shown in Figures 5 and 6(c), the ASBF method (with a fixed mtry=1\mbox{mtry}=1) outperforms the general BOF and HRF methods, even when their mtry parameters are appropriately tuned. The only exception is when N=500N=500 under Setting (a), where ASBF and BOF show similar performance. In sparse scenarios, as demonstrated in Figures 6(a)-(b), the proposed generalized sparse version S-ASBF, with an appropriately tuned mtry, clearly outperforms ASBF, especially when the sparsity level is small. Overall, the S-ASBF method consistently leads to a smaller RMSE than BOF and HRF methods in Figure 6 for all considered sparsity levels. The only exception occurs when s=2s=2, where S-ASBF and HRF exhibit similar behaviors. This similarity arises because, in scenarios with a small true sparsity level, the optimal mtry parameter is close to the dimension dd; otherwise, it is likely that all candidate directions are redundant for certain splits. Meanwhile, when mtry=d\mbox{mtry}=d, there is no difference between the balanced and random approaches, as we always need to consider all directions as candidate directions for each split. Lastly, for forest-based local linear methods, we observe that the proposed balanced method LL-ASBF consistently outperforms both LLF1 and LLF under all considered scenarios.

4.2 Application to wine quality and abalone datasets

We further assess the performance of the considered methods in Section 4.1 using the wine quality and abalone datasets, both available from the UCI repository [2].

The wine quality dataset comprises red and white variants of the Portuguese “Vinho Verde” wine, with 4898 observations for white and 1599 observations for red. The quality variable serves as the response, measured on a scale from 0 (indicating the worst quality) to 10 (representing the highest quality). Additionally, the dataset includes 11 continuous features. For detailed information about the data, refer to [15].

The abalone dataset consists of 1 categorical feature and 7 continuous features, along with the age of abalones determined by cutting through the shell cone and counting the number of rings. This age is treated as the response variable. The categorical feature pertains to sex, classifying the entire dataset into three categories: male (1528 observations), female (1307 observations), and infant (1342 observations). For more details, refer to [39].

Table 4.1: Root mean square error across methods for wine quality and age of abalone
Method BOF1 BOF HRF1 HRF ASBF S-ASBF BART LLF1 LLF LL-ASBF SLL-ASBF LQ-ASBF
Wine (overall) 0.830 0.826 0.833 0.828 0.808 0.804 0.824 0.773 0.767 0.728 0.725 0.715
\hdashlineRed wine 0.809 0.796 0.819 0.809 0.794 0.791 0.799 0.733 0.735 0.729 0.719 0.719
White wine 0.837 0.835 0.837 0.834 0.812 0.809 0.832 0.786 0.778 0.728 0.727 0.714
Abalone (overall) 2.619 2.629 2.608 2.600 2.551 2.550 2.611 2.557 2.556 2.539 2.537 2.497
\hdashlineMale abalone 2.685 2.697 2.687 2.682 2.679 2.677 2.717 2.635 2.633 2.622 2.622 2.570
Female abalone 3.063 3.066 3.019 3.005 2.990 2.989 3.004 2.955 2.957 2.936 2.931 2.889
Infant abalone 2.006 2.025 2.025 2.018 1.839 1.838 2.016 1.993 1.988 1.962 1.961 1.944

Based on the categorical features, we initially divide the wine quality dataset into two groups (red and white) and the abalone dataset into three groups (male, female, and infant). Random forests are then constructed based on samples within each of the sub-groups. We standardize the continuous features using min-max scaling, ensuring that all features fall within the range [0,1][0,1]. Each group of the data is randomly partitioned into three parts. With a total group size of NN, 3N/5\lceil 3N/5\rceil observations are used for training, N/5\lceil N/5\rceil observations for validation to determine optimal tuning parameters, and the prediction performance of the considered methods is reported based on the remaining testing observations. The tree size and subsampling ratio for honesty are chosen as B=200B=200 and w=0.5w=0.5 in advance.

Refer to caption
(a)
Refer to caption
(b)
Figure 7: Boxplots of the log(absolute errors+1)\log(\text{absolute errors}+1) for wine quality and age of abalone.

Table 4.1 reports the prediction performance of the considered random forest methods within each of the sub-groups. The proposed ASBF method and the sparse version S-ASBF outperform existing local averaging methods (including BOF, BOF1, HRF, and HRF), as well as BART, across all the sub-groups. The LL-ASBF method further outperforms existing local linear methods (LLF and LLF1), while the LQ-ASBF provides the most accurate prediction overall. Boxplots of the log-transformed absolute errors are also included in Figure 7, illustrating the overall performance for the wine quality dataset (including white and red) and the age of abalone dataset (including male, female, and infant).

5 Discussion

Since the introduction of random forest methods by [8], the infusion of randomness has played a pivotal role in mitigating overfitting and reducing the variance associated with individual greedy trees. However, this work raises pertinent concerns and queries regarding the over-reliance on such auxiliary randomness. Even for a simple median forest, opting for completely random splitting directions does not yield optimal results. Conversely, when we choose directions in a less random, or more balanced manner, we can achieve minimax results for smooth functions. Notably, as auxiliary randomness lacks information about the conditional distribution Y𝐗\mathbb{P}_{Y\mid\mathbf{X}} of interest, overemphasizing its role in constructing regression methods does not necessarily improve results; rather, it can compromise the approximation power of tree models. Our theoretical and numerical findings suggest that, especially for low-dimensional problems, adopting a more balanced approach in constructing trees and forests leads to more efficient outcomes. While our numerical results also indicate the efficacy of the proposed balanced method in complex scenarios, such as those involving sparse structures, further in-depth investigation is needed to understand its performance comprehensively in intricate situations.

References

  • [1] S. Arlot and R. Genuer. Analysis of purely random forests bias. arXiv preprint arXiv:1407.3939, 2014.
  • [2] A. Asuncion and D. Newman. UCI machine learning repository, 2007.
  • [3] S. Athey, J. Tibshirani, and S. Wager. Generalized random forests. The Annals of Statistics, 47(2):1148–1178, 2019.
  • [4] M. Behr, Y. Wang, X. Li, and B. Yu. Provable boolean interaction recovery from tree ensemble obtained via random forests. Proceedings of the National Academy of Sciences, 119(22):e2118636119, 2022.
  • [5] G. Biau. Analysis of a random forests model. The Journal of Machine Learning Research, 13(1):1063–1095, 2012.
  • [6] G. Biau, L. Devroye, and G. Lugosi. Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research, 9(9), 2008.
  • [7] A. Bloniarz, A. Talwalkar, B. Yu, and C. Wu. Supervised neighborhoods for distributed nonparametric regression. In Artificial Intelligence and Statistics, pages 1450–1459. PMLR, 2016.
  • [8] L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.
  • [9] L. Breiman, J. Friedman, C. J. Stone, and R. Olshen. Classification and Regression Trees. CRC Press, 1984.
  • [10] Y. Cai, Y. Ma, Y. Dong, and H. Yang. Extrapolated random tree for regression. In International Conference on Machine Learning, pages 3442–3468. PMLR, 2023.
  • [11] M. D. Cattaneo, J. M. Klusowski, and P. M. Tian. On the pointwise behavior of recursive partitioning and its implications for heterogeneous causal effect estimation. arXiv preprint arXiv:2211.10805, 2022.
  • [12] M. D. Cattaneo, J. M. Klusowski, and W. G. Underwood. Inference with mondrian random forests. arXiv preprint arXiv:2310.09702, 2023.
  • [13] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, and W. Newey. Double/debiased/neyman machine learning of treatment effects. American Economic Review, 107(5):261–65, 2017.
  • [14] C.-M. Chi, P. Vossler, Y. Fan, and J. Lv. Asymptotic properties of high-dimensional random forests. The Annals of Statistics, 50(6):3415–3438, 2022.
  • [15] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4):547–553, 2009.
  • [16] R. K. Crump, V. J. Hotz, G. W. Imbens, and O. A. Mitnik. Dealing with limited overlap in estimation of average treatment effects. Biometrika, 96(1):187–199, 2009.
  • [17] R. Duroux and E. Scornet. Impact of subsampling and tree depth on random forests. ESAIM: Probability and Statistics, 22:96–128, 2018.
  • [18] H. Federer. Geometric measure theory. Springer, 2014.
  • [19] R. Friedberg, J. Tibshirani, S. Athey, and S. Wager. Local linear forests. Journal of Computational and Graphical Statistics, 30(2):503–517, 2020.
  • [20] J. H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19(1):1–67, 1991.
  • [21] W. Gao, F. Xu, and Z.-H. Zhou. Towards convergence rate analysis of random forests for classification. Artificial Intelligence, 313:103788, 2022.
  • [22] W. Gautschi. Some elementary inequalities relating to the gamma and incomplete gamma function. J. Math. Phys, 38(1):77–81, 1959.
  • [23] R. Genuer. Variance reduction in purely random forests. Journal of Nonparametric Statistics, 24(3):543–562, 2012.
  • [24] B. A. Goldstein, E. C. Polley, and F. B. Briggs. Random forests for genetic association studies. Statistical Applications in Genetics and Molecular Biology, 10(1), 2011.
  • [25] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
  • [26] T. Hothorn and A. Zeileis. Predictive distribution modeling using transformation forests. Journal of Computational and Graphical Statistics, 30(4):1181–1196, 2021.
  • [27] G. W. Imbens and D. B. Rubin. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015.
  • [28] H. Ishwaran. The effect of splitting on random forests. Machine Learning, 99:75–118, 2015.
  • [29] H. Ishwaran and U. B. Kogalur. Consistency of random survival forests. Statistics & probability letters, 80(13-14):1056–1064, 2010.
  • [30] H. Ishwaran, U. B. Kogalur, E. H. Blackstone, and M. S. Lauer. Random survival forests. The Annals of Applied Statistics, pages 841–860, 2008.
  • [31] J. Klusowski. Sharp analysis of a simple model for random forests. In International Conference on Artificial Intelligence and Statistics, pages 757–765. PMLR, 2021.
  • [32] J. M. Klusowski and P. M. Tian. Large scale prediction with decision trees. Journal of the American Statistical Association, pages 1–27, 2023.
  • [33] X. Li, Y. Wang, S. Basu, K. Kumbier, and B. Yu. A debiased mdi feature importance measure for random forests. Advances in Neural Information Processing Systems, 32, 2019.
  • [34] G. Louppe, L. Wehenkel, A. Sutera, and P. Geurts. Understanding variable importances in forests of randomized trees. Advances in Neural Information Processing Systems, 26, 2013.
  • [35] B. Lu and J. Hardin. A unified framework for random forest prediction error estimation. The Journal of Machine Learning Research, 22(1):386–426, 2021.
  • [36] N. Meinshausen and G. Ridgeway. Quantile regression forests. Journal of Machine Learning Research, 7(6), 2006.
  • [37] L. Mentch and G. Hooker. Ensemble trees and CLTs: Statistical inference for supervised learning. Stat, 1050:25, 2014.
  • [38] J. Mourtada, S. Gaïffas, and E. Scornet. Minimax optimal rates for mondrian trees and forests. The Annals of Statistics, 48(4):2253–2276, 2020.
  • [39] W. J. Nash, T. L. Sellers, S. R. Talbot, A. J. Cawthorn, and W. B. Ford. The population biology of abalone (haliotis species) in tasmania. i. blacklip abalone (h. rubra) from the north coast and islands of bass strait. Sea Fisheries Division, Technical Report, 48:p411, 1994.
  • [40] E. O’Reilly and N. M. Tran. Minimax rates for high-dimensional random tessellation forests. Journal of Machine Learning Research, 25(89):1–32, 2024.
  • [41] P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
  • [42] E. Scornet, G. Biau, and J.-P. Vert. Consistency of random forests. The Annals of Statistics, 43(4):1716–1741, 2015.
  • [43] R. Sparapani, C. Spanbauer, and R. McCulloch. Nonparametric machine learning and efficient computation with bayesian additive regression trees: the bart r package. Journal of Statistical Software, 97:1–66, 2021.
  • [44] C. J. Stone. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, pages 1040–1053, 1982.
  • [45] J. Tibshirani, S. Athey, R. Friedberg, V. Hadad, D. Hirshberg, L. Miner, E. Sverdrup, S. Wager, M. Wright, and M. J. Tibshirani. Package ‘grf’, 2023.
  • [46] S. Wager and S. Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242, 2018.
  • [47] S. Wager and G. Walther. Adaptive concentration of regression trees, with application to random forests. arXiv preprint arXiv:1503.06388, 2015.
  • [48] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
  • [49] M. N. Wright and A. Ziegler. ranger: A fast implementation of random forests for high dimensional data in C++ and R. arXiv preprint arXiv:1508.04409, 2015.
  • [50] G. Zhang and Y. Lu. Bias-corrected random forests in regression. Journal of Applied Statistics, 39(1):151–160, 2012.

SUPPLEMENTARY MATERIALS FOR “ADAPTIVE SPLIT BALANCING FOR OPTIMAL RANDOM FOREST”

Notation

We denote rectangles L[0,1]dL\in[0,1]^{d} by R=j=1d[aj,bj]R=\bigotimes_{j=1}^{d}[a_{j},b_{j}], where 0aj<bj10\leq a_{j}<b_{j}\leq 1 for all j=1,,dj=1,\dots,d, writing the Lebesgue measure of LL as λ(L)=j=1d(bjaj)\lambda(L)=\prod_{j=1}^{d}(b_{j}-a_{j}). The indicator function of a subset AA of a set XX is a function 𝟙A\mathbbm{1}_{A} defined as 𝟙A=1\mathbbm{1}_{A}=1 if xAx\in A, and 𝟙A=0\mathbbm{1}_{A}=0 if xAx\notin A. For any rectangle L[0,1]dL\in[0,1]^{d}, we denote μ(L):=𝔼[𝟙{𝐗L}]\mu(L):=\mathbb{E}[\mathbbm{1}_{\{\mathbf{X}\in L\}}] as the expected fraction of training examples falling within LL. Denote #L:=i𝟙{𝐗iL}\#L:=\sum_{i\in\mathcal{I}}\mathbbm{1}_{\{\mathbf{X}_{i}\in L\}} as the number of training samples 𝐗i\mathbf{X}_{i} falling within LL. For any n×nn\times n matrix 𝐀\mathbf{A}, let Λmin(𝐀)\Lambda_{\min}(\mathbf{A}) and Λmax(𝐀)\Lambda_{\max}(\mathbf{A}) denote the smallest and largest eigenvalues of the matrix 𝐀\mathbf{A}, respectively. A dd-dimensional vector of all ones is denoted with 𝟏d\mathbf{1}_{d}. A tree grown by recursive partitioning is called (α,k)(\alpha,k)-regular for some α(0,0.5]\alpha\in(0,0.5] and kk\in\mathbb{N} if the following conditions to hold for the \mathcal{I} sample: (a) each child node contains at least an α\alpha-fraction of observations within the parent node, and (b) the number of observations within terminal leaves is between kk and 2k12k-1.

Appendix A The sparse local adaptive split balancing forests

In the following, we provide a generalized sparse version of the local adaptive split balancing forests proposed in Algorithm 3.

Algorithm 4 Sparse local adaptive split balancing forests
1:Observations 𝕊N=(𝐗i,Yi)i=1N\mathbb{S}_{N}=(\mathbf{X}_{i},Y_{i})_{i=1}^{N}, with parameters B1B\geq 1, α(0,0.5]\alpha\in(0,0.5], w(0,1]w\in(0,1], kwNk\leq\lfloor wN\rfloor, mtry{1,,d}\mbox{mtry}\in\{1,\dots,d\}, and qq\in\mathbb{N}.
2:Calculate the polynomial basis G(𝐗i)d¯G(\mathbf{X}_{i})\in\mathbb{R}^{\bar{d}} for each iNi\leq N.
3:for b=1,,Bb=1,\dots,B do
4:     Divide 𝕊N\mathbb{S}_{N} into disjoint 𝕊(b)\mathbb{S}_{\mathcal{I}}^{(b)} and 𝕊𝒥(b)\mathbb{S}_{\mathcal{J}}^{(b)} with |(b)|=wN|\mathcal{I}^{(b)}|=\lfloor wN\rfloor and |𝒥(b)|=NwN|\mathcal{J}^{(b)}|=N-\lfloor wN\rfloor.
5:     Create a collection of index sets, 𝒬([0,1]d)={Q1,,Qd}\mathcal{Q}([0,1]^{d})=\{Q_{1},\dots,Q_{d}\}, with each Qj{1,,d}Q_{j}\subset\{1,\dots,d\} containing mtry directions while ensuring that each direction 1,,d1,\dots,d appears in exactly mtry of the sets Q1,,QdQ_{1},\dots,Q_{d}.
6:     repeat For each current node L[0,1]dL\subseteq[0,1]^{d}:
7:         Randomly select a set Q𝒬(L)Q\in\mathcal{Q}(L).
8:         Partition along jQj\in Q-th direction to minimize
i𝒥(b)(Y^iY~1)2𝟙{𝐗iL1}+i𝒥(b)(Y^iY~1)2𝟙{𝐗iL2},\displaystyle\sum_{i\in\mathcal{J}^{(b)}}(\widehat{Y}_{i}-\widetilde{Y}_{1})^{2}\mathbbm{1}\{\mathbf{X}_{i}\in L_{1}\}+\sum_{i\in\mathcal{J}^{(b)}}(\widehat{Y}_{i}-\widetilde{Y}_{1})^{2}\mathbbm{1}\{\mathbf{X}_{i}\in L_{2}\},
ensuring #{i(b):𝐗iLl}α#{i(b):𝐗iL},l=1,2.\displaystyle\mbox{ensuring }\#\{i\in\mathcal{I}^{(b)}:\mathbf{X}_{i}\in L_{l}\}\geq\alpha\#\{i\in\mathcal{I}^{(b)}:\mathbf{X}_{i}\in L\},\;l=1,2.
9:         if #𝒬(L)>1\#\mathcal{Q}(L)>1 then
10:              𝒬(L1)=𝒬(L2)=𝒬(L){Q}\mathcal{Q}(L_{1})=\mathcal{Q}(L_{2})=\mathcal{Q}(L)\setminus\{Q\}.
11:         else
12:              Randomly reinitialize 𝒬(L1)\mathcal{Q}(L_{1}) and 𝒬(L2)\mathcal{Q}(L_{2}) as in Steps 4.
13:         end if
14:     until each current node contains kk to 2k12k-1 samples 𝕊(b)\mathbb{S}_{\mathcal{I}}^{(b)}.
15:     Estimate 𝜷^(𝐱,ξb)\widehat{\bm{\beta}}(\mathbf{x},\xi_{b}) as defined in (3.1) using observations 𝕊(b)\mathbb{S}_{\mathcal{I}}^{(b)}.
16:     The bb-th sparse local adaptive split balancing tree: TL(𝐱,ξb):=G(𝐱)𝜷^(𝐱,ξb)T_{\mathrm{L}}(\mathbf{x},\xi_{b}):=G(\mathbf{x})^{\top}\widehat{\bm{\beta}}(\mathbf{x},\xi_{b}).
17:end for
18:return The sparse local adaptive split balancing forest m^L(𝐱):=B1b=1BTL(𝐱,ξb)\widehat{m}_{\mathrm{L}}(\mathbf{x}):=B^{-1}\sum_{b=1}^{B}T_{\mathrm{L}}(\mathbf{x},\xi_{b}).

The generalized version considers an extra tuning parameter mtry, which has been also introduced in Algorithm 2, and performs local polynomial regressions within the terminal leaves as in Algorithm 3. It is worth noting that Algorithms 1-3 are all special cases of the most general version Algorithm 4.

Appendix B Simulations for the ATE estimation

In this section, we evaluate the behavior of the forest-based ATE estimator proposed in Section 3.4 through simulation studies.

We focus on the estimation of θ=𝔼[Y(1)Y(0)]\theta=\mathbb{E}[Y(1)-Y(0)] and describe the considered data generating processes below. Generate i.i.d. covariates 𝐗iUniform[0,1]d\mathbf{X}_{i}\sim\mathrm{Uniform}[0,1]^{d} and and noise εiN(0,1)\varepsilon_{i}\sim N(0,1) for each iNi\leq N. Let Ai𝐗iBernoulli(π(𝐗i))A_{i}\mid\mathbf{X}_{i}\sim\mathrm{Bernoulli}(\pi^{*}(\mathbf{X}_{i})) for each iNi\leq N. The outcome variables are generated as Yi=AiYi(1)+(1Ai)Yi(0)Y_{i}=A_{i}Y_{i}(1)+(1-A_{i})Y_{i}(0). Consider the following models for the propensity score and outcomes:

  • (a)

    Consider π(𝐗i)=((j=1d𝐗ij)/d+1.1)/((j=1d𝐗ij)/d+2)\pi^{*}(\mathbf{X}_{i})=((\sum_{j=1}^{d}\mathbf{X}_{ij})/d+1.1)/((\sum_{j=1}^{d}\mathbf{X}_{ij})/d+2), Yi(1)=(j=1d1𝐗ij𝐗i(j+1)+𝐗id𝐗i1)/d+εiY_{i}(1)=(\sum_{j=1}^{d-1}\mathbf{X}_{ij}\mathbf{X}_{i(j+1)}+\mathbf{X}_{id}\mathbf{X}_{i1})/d+\varepsilon_{i}, and Yi(0)=(j=1d1𝐗ij𝐗i(j+1)+𝐗id𝐗i1)/d+εiY_{i}(0)=-(\sum_{j=1}^{d-1}\mathbf{X}_{ij}\mathbf{X}_{i(j+1)}+\mathbf{X}_{id}\mathbf{X}_{i1})/d+\varepsilon_{i}.

  • (b)

    Consider π(𝐗i)=(j=1d𝐗ij)/(j=1d𝐗ij+d)\pi^{*}(\mathbf{X}_{i})=(\sum_{j=1}^{d}\mathbf{X}_{ij})/(\sum_{j=1}^{d}\mathbf{X}_{ij}+d), Yi(1)=2(j=1d1𝐗ij𝐗i(j+1)+𝐗id𝐗i1)+εiY_{i}(1)=2(\sum_{j=1}^{d-1}\mathbf{X}_{ij}\mathbf{X}_{i(j+1)}+\mathbf{X}_{id}\mathbf{X}_{i1})+\varepsilon_{i}, and Yi(0)=2(j=1d1𝐗ij𝐗i(j+1)+𝐗id𝐗i1)+εiY_{i}(0)=-2(\sum_{j=1}^{d-1}\mathbf{X}_{ij}\mathbf{X}_{i(j+1)}+\mathbf{X}_{id}\mathbf{X}_{i1})+\varepsilon_{i}.

Table B.1: Simulations for the forest-based ATE estimation. Bias: empirical bias; RMSE: root mean square error; Length: average length of the 95%95\% confidence intervals; Coverage: average coverage of the 95%95\% confidence intervals. All the reported values (except Coverage) are based on robust (median) estimates.
Method Bias RMSE Length Coverage Bias RMSE Length Coverage
Setting (a): N=1000,d=5N=1000,d=5 Setting (b): N=500,d=5N=500,d=5
BOF1 0.020 0.060 0.298 0.950 -0.018 0.108 0.698 0.955
BOF 0.020 0.062 0.303 0.970 -0.012 0.106 0.714 0.945
HRF1 0.022 0.052 0.270 0.960 0.020 0.100 0.631 0.955
HRF 0.018 0.053 0.270 0.950 0.015 0.100 0.634 0.955
ASBF 0.016 0.051 0.263 0.955 0.011 0.098 0.627 0.950
\hdashlineBART 0.022 0.055 0.269 0.960 -0.015 0.104 0.632 0.940
LLF1 0.016 0.052 0.271 0.960 -0.011 0.099 0.619 0.935
LLF 0.015 0.051 0.271 0.950 -0.016 0.102 0.622 0.930
LL-ASBF 0.013 0.049 0.267 0.960 -0.010 0.097 0.616 0.945
LQ-ASBF 0.009 0.047 0.261 0.950 -0.005 0.092 0.613 0.940

In settings (a) and (b), we designate sample sizes as 1000 and 500, respectively, with covariate dimensions fixed at d=5d=5. Each setting is replicated 200 times. The results, presented in Table B.1, show that in both settings, all considered methods exhibit coverages close to the desired 95%95\%. In terms of estimation, the ATE estimator based on our proposed ASBF method outperforms other local averaging methods (BOF, BOF1, HRF, HRF1), as well as BART. Across both settings, we observe smaller biases (in absolute values) and RMSEs, highlighting the superior importance of the balanced technique. Furthermore, the proposed LL-ASBF consistently outperforms existing local linear methods LLF and LLF1. Notably, LQ-ASBF exhibits the best performance among the considered settings.

Appendix C Auxiliary Lemmas

Lemma S.1 (Theorem 7 of [47]).

Let 𝒟={1,2,,d}\mathcal{D}=\{1,2,\dots,d\} and ω,ϵ(0,1)\omega,\epsilon\in(0,1). Then, there exists a set of rectangles 𝒟,ω,ϵ\mathcal{R}_{\mathcal{D},\omega,\epsilon} such that the following properties hold. Any rectangle LL of volume λ(L)ω\lambda(L)\geq\omega can be well approximated by elements in 𝒟,ω,ϵ\mathcal{R}_{\mathcal{D},\omega,\epsilon} from both above and below in terms of Lebesgue measure. Specifically, there exist rectangles R,R+𝒟,ω,ϵR_{-},R_{+}\in\mathcal{R}_{\mathcal{D},\omega,\epsilon} such that

RLR+andexp{ϵ}λ(R+)λ(L)exp{ϵ}λ(R).\displaystyle R_{-}\subseteq L\subseteq R_{+}\;\;\text{and}\;\;\exp\{-\epsilon\}\lambda(R_{+})\leq\lambda(L)\leq\exp\{\epsilon\}\lambda(R_{-}).

Moreover, the set 𝒟,ω,ϵ\mathcal{R}_{\mathcal{D},\omega,\epsilon} has cardinality bounded by

#𝒟,ω,ϵ=1ω(8d2ϵ2(1+log21ω))d(1+O(ϵ)).\displaystyle\#\mathcal{R}_{\mathcal{D},\omega,\epsilon}=\frac{1}{\omega}\left(\frac{8d^{2}}{\epsilon^{2}}\left(1+\log_{2}\left\lfloor\frac{1}{\omega}\right\rfloor\right)\right)^{d}\cdot(1+O(\epsilon)).
Lemma S.2 (Theorem 10 of [47]).

Suppose that w(0,1]w\in(0,1], and α(0,0.5]\alpha\in(0,0.5] are constants. Choose any kn=wNk\leq n=\lfloor wN\rfloor satisfying klog(N)k\gg\log(N). Let \mathcal{L} be the collection of all possible leaves of partitions satisfying (α,k)(\alpha,k)-regular. Let 𝒟,ω,ϵ\mathcal{R}_{\mathcal{D},\omega,\epsilon} be as defined in Lemma S.1, with ω\omega and ϵ\epsilon choosing as

ω=k2nandϵ=1k.\displaystyle\omega=\frac{k}{2n}\;\;\text{and}\;\;\epsilon=\frac{1}{\sqrt{k}}. (C.1)

Then, there exists an n0n_{0}\in\mathbb{N} such that, for every nn0n\geq n_{0}, the following statement holds with probability at at least 1n1/21-n^{-1/2}: for each leaf LL\in\mathcal{L}, we can select a rectangle R¯𝒟,ω,ϵ\bar{R}\in\mathcal{R}_{\mathcal{D},\omega,\epsilon} such that R¯L\bar{R}\subseteq L, λ(L)exp{ϵ}λ(R¯)\lambda(L)\leq\exp\{\epsilon\}\lambda(\bar{R}), and

#L#R¯3ϵ#L+23log(#𝒟,ω,ϵ)#L+O(log(#𝒟,ω,ϵ)).\displaystyle\#L-\#\bar{R}\leq 3\epsilon\#L+2\sqrt{3\log(\#\mathcal{R}_{\mathcal{D},\omega,\epsilon})\#L}+O\left(\log(\#\mathcal{R}_{\mathcal{D},\omega,\epsilon})\right).
Lemma S.3 (Lemma 12 of [47]).

Fix a sequence δ(n)>0\delta(n)>0, and define the event

𝒜:={sup{|#Rnμ(R)|nμ(R):R,μ(R)μmin}3log(#δ)}\displaystyle\mathcal{A}:=\left\{\sup\left\{\frac{|\#R-n\mu(R)|}{\sqrt{n\mu(R)}}:R\in\mathcal{R},\mu(R)\geq\mu_{\min}\right\}\leq\sqrt{3\log\left(\frac{\#\mathcal{R}}{\delta}\right)}\right\}

for any set of rectangles \mathcal{R} and threshold μmin\mu_{\min}, where #R:=#{i:𝐗iR}\#R:=\#\{i\in\mathcal{I}:\mathbf{X}_{i}\in R\} and #\#\mathcal{R} is the number of rectangles of the set \mathcal{R}. Then, for any sequence of problems indexed by nn with

limnlog(#)nμmin=0andlimnδ1#=0,\displaystyle\lim_{n\to\infty}\frac{\log(\#\mathcal{R})}{n\mu_{\min}}=0\;\;\text{and}\;\;\lim_{n\to\infty}\frac{\delta^{-1}}{\#\mathcal{R}}=0, (C.2)

there is a threshold n0n_{0}\in\mathbb{N} such that, for all nn0n\geq n_{0}, we have (𝒜)1δ\mathbb{P}(\mathcal{A})\geq 1-\delta. Note that, above, 𝒜\mathcal{A}, \mathcal{R}, μmin\mu_{\min} and δ\delta are all implicitly changing with nn.

Lemma S.4.

Suppose that w(0,1]w\in(0,1], and α(0,0.5]\alpha\in(0,0.5] are constants. Choose any kn=wNk\leq n=\lfloor wN\rfloor satisfying klog(N)k\gg\log(N). Then, there exists a positive constant Λ0>0\Lambda_{0}>0 such that the event

:={inf𝐱[0,1]d,ξΞΛmin(𝐒L𝐝L𝐝L)Λ0}\displaystyle\mathcal{B}:=\left\{\inf_{\mathbf{x}\in[0,1]^{d},\xi\in\Xi}\Lambda_{\min}\left(\mathbf{S}_{L}-\mathbf{d}_{L}\mathbf{d}_{L}^{\top}\right)\geq\Lambda_{0}\right\} (C.3)

satisfies limN𝕊()=1\lim_{N\to\infty}\mathbb{P}_{\mathbb{S}_{\mathcal{I}}}\left(\mathcal{B}\right)=1, where 𝐝L\mathbf{d}_{L} and 𝐒L\mathbf{S}_{L} are defined as in (LABEL:def:d_S). In addition, on the event \mathcal{B}, the matrices 𝐒L𝐝L𝐝L\mathbf{S}_{L}-\mathbf{d}_{L}\mathbf{d}_{L}^{\top} and 𝐒L\mathbf{S}_{L} are both positive-definite, and we also have

sup𝐱[0,1]d,ξΞ𝐝L(𝐒L𝐝L𝐝L)1𝐝Ld¯Λ0.\displaystyle\sup_{\mathbf{x}\in[0,1]^{d},\xi\in\Xi}\mathbf{d}_{L}^{\top}(\mathbf{S}_{L}-\mathbf{d}_{L}\mathbf{d}_{L}^{\top})^{-1}\mathbf{d}_{L}\leq\frac{\bar{d}}{\Lambda_{0}}. (C.4)
Lemma S.5.

Let the assumptions in Lemma S.4 hold. Define the event

𝒞:={diamj(L(𝐱,ξ))0,for all  1jd,𝐱[0,1]d,ξΞ}.\displaystyle\mathcal{C}:=\left\{\mathrm{diam}_{j}(L(\mathbf{x},\xi))\neq 0,\;\;\text{for all}\;\;1\leq j\leq d,\;\;\mathbf{x}\in[0,1]^{d},\;\;\xi\in\Xi\right\}. (C.5)

Then, we have 𝕊(𝒞)=1\mathbb{P}_{\mathbb{S}_{\mathcal{I}}}\left(\mathcal{C}\right)=1. Moreover, on the event 𝒞\mathcal{B}\cap\mathcal{C}, we have 𝐒\mathbf{S}, iωi(𝐱,ξ)𝚫i𝚫i\sum_{i\in\mathcal{I}}\omega_{i}(\mathbf{x},\xi)\bm{\Delta}_{i}\bm{\Delta}_{i}^{\top} and iωi(𝐱,ξ)𝐆(𝐗i)𝐆(𝐗i)\sum_{i\in\mathcal{I}}\omega_{i}(\mathbf{x},\xi)\mathbf{G}(\mathbf{X}_{i})\mathbf{G}(\mathbf{X}_{i})^{\top} are both positive-definite, where 𝚫i=𝐆(𝐗i𝐱)\bm{\Delta}_{i}=\mathbf{G}(\mathbf{X}_{i}-\mathbf{x}).

In the following, we consider the average treatment effect (ATE) estimation problem and that the proposed forests provide stable propensity score estimates that are away from zero and one with high probability.

Lemma S.6.

Let Assumptions 4(c) hold and πq2,β2\pi^{*}\in\mathcal{H}^{q_{2},\beta_{2}}, where q2q_{2}\in\mathbb{N} and β2(0,1]\beta_{2}\in(0,1]. Let M>0M>0, w2(0,1]w_{2}\in(0,1], and α2(0,0.5]\alpha_{2}\in(0,0.5] be constants. Choose any B1B\geq 1 and k2log3(N)k_{2}\gg\log^{3}(N). Then, as NN\to\infty,

𝐗(c1<π^k(𝐗)1c1)=1,for eachkK,\mathbb{P}_{\mathbf{X}}(c_{1}<\widehat{\pi}^{-k}(\mathbf{X})\leq 1-c_{1})=1,\;\;\mbox{for each}\;\;k\leq K, (C.6)

with probability approaching one and some constant c1(0,1/2)c_{1}\in(0,1/2). Note that the left-hand-side of (C.6) is a random quantity as the probability is only taken with respect to a new observation 𝐗\mathbf{X}.

Lemma S.6 demonstrates the stability of the inverse PS estimates, a requirement often assumed in the context of non-parametric nuisance estimates, as discussed in [13]. The above results suggest that, under the assumption of overlap, there is typically no necessity to employ any form of trimming or truncation techniques on the estimated propensities, provided the chosen tuning parameter k2k_{2} is not too small. In fact, the parameter kk can also be viewed as a truncation parameter as it avoids the occurrence of propensity score estimates close to zero or one with high probability.

Appendix D Proofs of the results for the adaptive split balancing forests

Proof of Lemma 2.1.

For any 𝐱[0,1]d\mathbf{x}\in[0,1]^{d} and ξΞ\xi\in\Xi, let c(𝐱,ξ)c(\mathbf{x},\xi) be the number of splits leading to the leaf L(𝐱,ξ)L(\mathbf{x},\xi), and let cj(𝐱,ξ)c_{j}(\mathbf{x},\xi) be the number of such splits along the jj-th coordinate. Define t=min1jdcj(𝐱,ξ)t=\min_{1\leq j\leq d}c_{j}(\mathbf{x},\xi). By the balanced splitting rule, we know that the number of splits along different coordinates differs by at most one. That is, cj(𝐱,ξ){t,t+1}c_{j}(\mathbf{x},\xi)\in\{t,t+1\} for all 1jd1\leq j\leq d. Since c(𝐱,ξ)=j=1dcj(𝐱,ξ)c(\mathbf{x},\xi)=\sum_{j=1}^{d}c_{j}(\mathbf{x},\xi), c(𝐱,ξ)c(\mathbf{x},\xi) can be express as c(𝐱,ξ)=td+lc(\mathbf{x},\xi)=td+l, with l=c(𝐱,ξ)td{0,1,,d1}l=c(\mathbf{x},\xi)-td\in\{0,1,\dots,d-1\} denoting the number of splits in last round if l0l\neq 0. Let L0(𝐱,ξ)L1(𝐱,ξ)Lc(𝐱,ξ)(𝐱,ξ)L_{0}(\mathbf{x},\xi)\supseteq L_{1}(\mathbf{x},\xi)\supseteq...\supseteq L_{c(\mathbf{x},\xi)}(\mathbf{x},\xi) be the successive nodes leading to L(𝐱,ξ)L(\mathbf{x},\xi), where L0(𝐱,ξ)=[0,1]dL_{0}(\mathbf{x},\xi)=[0,1]^{d} and Lc(𝐱,ξ)(𝐱,ξ)=L(𝐱,ξ)L_{c(\mathbf{x},\xi)}(\mathbf{x},\xi)=L(\mathbf{x},\xi). Let n0,n1,,nc(𝐱,ξ)n_{0},n_{1},\dots,n_{c(\mathbf{x},\xi)} be random variables denoting the number of points in 𝕊\mathbb{S}_{\mathcal{I}} located within the the successive nodes L0(𝐱,ξ),L1(𝐱,ξ),,Lc(𝐱,ξ)(𝐱,ξ)L_{0}(\mathbf{x},\xi),L_{1}(\mathbf{x},\xi),...,L_{c(\mathbf{x},\xi)}(\mathbf{x},\xi), where n0=wNn_{0}=\lfloor wN\rfloor. Since the tree is (α,k)(\alpha,k)-regular, we know that αni1ni(1α)ni1\alpha n_{i-1}\leq n_{i}\leq(1-\alpha)n_{i-1} for each 1itd+l1\leq i\leq td+l, and hence the following deterministic inequalities hold:

αin0ni(1α)in0,\displaystyle\alpha^{i}n_{0}\leq n_{i}\leq(1-\alpha)^{i}n_{0}, (D.1)
αtd+linintd+l(1α)td+lini.\displaystyle\alpha^{td+l-i}n_{i}\leq n_{td+l}\leq(1-\alpha)^{td+l-i}n_{i}. (D.2)

It follows that αtd+lwNntd+l(1α)td+lwN\alpha^{td+l}\lfloor wN\rfloor\leq n_{td+l}\leq(1-\alpha)^{td+l}\lfloor wN\rfloor. Moreover, note that ntd+l[k,2k1]n_{td+l}\in[k,2k-1]. Hence, we have k(1α)td+lwNk\leq(1-\alpha)^{td+l}\lfloor wN\rfloor and αtd+lwN2k1\alpha^{td+l}\lfloor wN\rfloor\leq 2k-1, which implies that

T:=log((2k1)/wN)dlog(α)1log((2k1)/wN)dlog(α)\displaystyle T:=\left\lceil\frac{\log((2k-1)/\lfloor wN\rfloor)}{d\log(\alpha)}-1\right\rceil\leq\frac{\log((2k-1)/\lfloor wN\rfloor)}{d\log(\alpha)}-\frac{l}{d}\leq t\leq\frac{\log(k/\lfloor wN\rfloor)}{d\log(1-\alpha)}-\frac{l}{d}. (D.3)