Oblique and rotation double random forest

M.A. Ganaie phd1901141006@iiti.ac.in M. Tanveer mtanveer@iiti.ac.in Department of Mathematics, Indian Institute of Technology Indore, Simrol, Indore, 453552, India School of Electrical & Electronic Engineering, Nanyang Technological University, Singapore KINDI Center for Computing Research, College of Engineering, Qatar University, Qatar¹¹footnotemark: 1 Department of Computer Science, VŠB - Technical University of Ostrava, Czech Republic P.N. Suganthan epnsugan@ntu.edu.sg V. Snasel vaclav.snasel@vsb.cz

Abstract

Random Forest is an ensemble of decision trees based on the bagging and random subspace concepts. As suggested by Breiman, the strength of unstable learners and the diversity among them are the ensemble models’ core strength. In this paper, we propose two approaches known as oblique and rotation double random forests. In the first approach, we propose rotation based double random forest. In rotation based double random forests, transformation or rotation of the feature space is generated at each node. At each node different random feature subspace is chosen for evaluation, hence the transformation at each node is different. Different transformations result in better diversity among the base learners and hence, better generalization performance. With the double random forest as base learner, the data at each node is transformed via two different transformations namely, principal component analysis and linear discriminant analysis. In the second approach, we propose oblique double random forest. Decision trees in random forest and double random forest are univariate, and this results in the generation of axis parallel split which fails to capture the geometric structure of the data. Also, the standard random forest may not grow sufficiently large decision trees resulting in suboptimal performance. To capture the geometric properties and to grow the decision trees of sufficient depth, we propose oblique double random forest. The oblique double random forest models are multivariate decision trees. At each non-leaf node, multisurface proximal support vector machine generates the optimal plane for better generalization performance. Also, different regularization techniques (Tikhonov regularisation, axis-parallel split regularisation, Null space regularisation) are employed for tackling the small sample size problems in the decision trees of oblique double random forest. The proposed ensembles of decision trees produce trees with bigger size compared to the standard ensembles of decision trees as bagging is used at eah which results in improved performance. The evaluation of the baseline models and the proposed oblique and rotation double random forest models is performed on benchmark $121$ UCI datasets and real-world fisheries datasets. Both statistical analysis and the experimental results demonstrate the efficacy of the proposed oblique and rotation double random forest models compared to the baseline models on the benchmark datasets.

keywords:

Double random forest , Oblique random forest , Support vector machine , Bias , Ensemble , Oblique , Orthogonal , Classification , Classifiers , Ensemble learning , Random forest , Bootstrap , Decision tree.

^†^†journal: Neural Networks, Elsevier

1 Introduction

Perturb and combine approach [breiman1996bias] is the core of the ensemble strategy [dietterich2000ensemble] and hence, it has been used across different domains like machine learning [wiering2008ensemble], computer vision tasks [goerss2000tropical] for recognition of patterns, mining big data [lulli2019mining] and biomedical domain [pal2021prediction]. Both theoretical and empirical aspects of the ensemble learning have been explored in the literature. Multiple classifier systems [zhou2013multiple] or ensemble learning perturbs the input data to induce diversity among the base learners of an ensemble and uses combine strategy to aggregate the outputs of base learners such that the generalization of the ensemble model is superior in comparison with the individual learners.

To analyze how the ensemble learning performs better compared to individual models, studies like reduction in variance among the base learners [breiman1996bias, geurts2006extremely, zhang2008rotboost] have been putforth. With the bias and variance reduction theory [breiman1996bias, kohavi1996bias], the error in classification is given in terms of bias and variance. Bias measure gives how far is the average guess of each base learner from the target class over the perturbed training sets generated from a given training set and variance measure is how much the base learners guess fluctuates with the perturbations of the given training set.

Decision tree algorithm is a commonly used classification model due to its simplicity and better interpretability. Decision tree uses divide and conquer approach to recursively partition the data. The recursive partition of the tree is sensitive to perturbation of the input data, and results in an unstable classifier. Hence, it is said to have high variance and low bias. The ensemble methodology can be used in unstable classifiers to further improve the classification performance.

Random forest [breiman2001random] and rotation forest [rodriguez2006rotation] are the well-known classification models, widely used in the literature. Both these models are based on the ensemble methodology and use decision tree as the base classifier. Due to the better generalization performance, random forest proved to be one of the best classification models among $179$ classifiers evaluated on $121$ datasets [fernandez2014we].

An ensemble of decision trees, Random forest, uses bagging [breiman1996bagging] and random subspace [ho1998random] strategy. These two approaches induce diversity among the base learners, here decision trees, for better generalization. Bagging, also known as bootstrap aggregation, generates multiple bags of a given training set such that each decision tree is trained on a given bag of the data. Each tree uses a bag of training data whose distribution is akin to the whole population and hence, each classifier shows good generalization performance. Within each decision tree, random subspace approach is used in each non-terminal node to further boost the diversity among the base models. Random forest has been successfully applied for analysis of gene expression data [jiang2004joint], EEG classification [shen2007feature], spectral data classification [menze2009comparison], recognition of objects, image segmentation [ho1998random, hothorn2005design] and chimera identification [ganaie2020identification]. Other applications include selection of features [menze2009comparison], analysis of sample proximities [menze2007multivariate] and so on. Random forest have also been adapted to Spark based distributed and scalable environments [lulli2017crack, lulli2017reforest]. With the growing of privacy concerns, Random forest models have been improved to meet the privacy expectations. Differential privacy [dwork2008differential] has been widely adopted in Random Forest [patil2014differential, fletcher2017differentially, guan2020differentially, xin2019differentially].

To obtain the better generalization performance, various hyperparameters of the random forest need to be chosen optimally. These hyperparameters include number of base learners (here, decision trees) in a forest (ntree), number of candidate features for evaluation at a given non-leaf node (mtry), and number of samples in an impure node (nodesize or minleaf) (we will use minleaf and nodesize interchangeably). To get these parameters optimally, different studies have been proposed. Analysis of tuning process [probst2017tune, freeman2016random], sensitivity of the parameters [huang2016parameter], effect of number of trees in an ensemble [banfield2006comparison, hernandez2013large, oshiro2012many] provide insight how these parameters affect the model performance. To obtain the optimal number of candidate features, different methods [boulesteix2012overview, han2019optimal] have been proposed. Analysis of optimal sample size in bagging [martinez2010out] and estimation of tree size via combination of random forest with adaptive nearest neighbours [lin2006random] result in the better choice of the hyperparameters.

Broadly speaking, there are two approaches, namely, univariate decision trees [banfield2006comparison] and multivariate decision trees [murthy1995growing] for generating the decision trees. Univariate decision trees, also known as axis parallel or orthogonal decision trees, use some impurity criteria to optimize best univariate split feature among the set of randomly chosen subspace of features. Multivariate decision trees, also known as oblique decision trees, perform the node splitting using all or a part of the features. In general, decision boundary of an oblique decision tree can be approximated by a large number of stair-like decision boundaries of the univariate decision trees.

Random forest is a univariate model and builds hyperplane at each non-terminal node such that splitting at the children nodes is easier in a given decision tree. At a given non-leaf node, the splitting hyperplane may not be a good classifier [manwani2011geometric]. Different criteria like entropy measure, Gini index measure and twoing rule are involved in most of the decision tree based models for choosing the best split among the set of splits such that the best split results in lowest impurity score. At each non-leaf node, impurity criteria measures skewness of the distribution of the different category samples. Nearly uniform distribution is assigned low impurity score while as high impurity score is given to a distribution wherein the samples of a particular class dominate the other classes. In most of the decision tree based induction tree algorithms, some impurity measure is optimized for generating the tree. However, due to non differentiability of the impurity measures with respect to the hyperplane parameters, different search techniques are employed for generating the decision trees. Like deterministic hill-climbing model in CART-LC [breiman1984classification], randomized search based CART-LC in OC1 [murthy1993oc1]. In high dimensional feature space, both these methods suffer due to searching in 1-D at a time and local optimum problem. Thus, to avoid the local optima, multiple trails or restarts are used to minimize the chances of ending up with the local optima. Evolutionary approaches have also been used for optimizing in all dimensions [pedrycz2005genetically, cha2009genetic] which tolerate the noisy evaluation of a rating function and also simultaneously optimize the multiple rating functions [cantu2003inducing, pangilinan2011pareto]. Extremely randomized trees [geurts2006extremely] and its oblique version [zhang2014towards], strongly randomized the attribute set and its cut point. Other approaches include fuzzy based decision trees [wang2008induction, wang2008improving], ensemble of feature spaces [zhang2014random] and decision tree support vector machine [zhang2007decision]. Random feature weights for decision tree ensemble [maudes2012random] associates weight to each attribute for better diversity of the model. Recent studies have evaluated the interpretability of the decision forests so that the decisions can be interpreted for better understanding [sagi2020explainable, fernandez2020random]. For more literature about the decision trees, we refer the readers to [rokach2016decision].

With all the impurity measures, given in [manwani2011geometric], the issue is they are function of different class distributions on each side of the hyperplane and ignore the geometric structure of the class regions [manwani2011geometric] as if the impurity measure is unaffected if one changes the data labels without any change in the relevant features of each category on either side of the hyperplane.

To incorporate the geometric structure of class distributions, support vector machines (SVM) [cortes1995support] are employed to generate the decision trees [manwani2011geometric]. The multisurface proximal support vector machines (MPSVM) [mangasarian2005multisurface] generate the proximal hyperplanes in a manner that each plane is proximal to the samples of one class and farthest from the samples of another class. manwani2011geometric generated the two clustering planes at each non-leaf node and choose the angle bisector of these planes which makes the nodes pure. MPSVM is a binary class algorithm, hence, they decomposed the multiclass problem into a binary class by grouping the majority class samples into one class and rest samples into other class. As the node becomes purer with the growth of a tree, the subsequent nodes receive smaller number of samples. To avoid this problem, NULL space method [chen2000new] is used in [manwani2011geometric]. Also, MPSVM based oblique decision tree ensemble [zhang2014oblique] employed regularisation approaches like Tikhonov regularization [marroquin1987probabilistic] and axis-parallel split regularization. In [ganaie2020oblique], twin bounded SVM [shao2011improvements] resulted in more generalization performance as no explicit regularisation methods are needed to handle these problems. Both MPSVM based oblique decision tree ensemble [zhang2014oblique] and TBSVM based oblique decision tree ensemble [zhang2014oblique] use single base learner at each nonleaf node to search the optimal split among the candidate splits. The oblique decision tree ensemble showed better generalization than the standard random forest [zhang2017benchmarking]. Heterogeneous oblique random forest [katuwal2020heterogeneous] generates hyperplanes via MPSVM, logistic regression, linear discriminant analysis, least squares SVM and ridge regression. The optimal hyperplane for best split is chosen among the generated planes which results in purer nodes.

Recent study of double random forest [han2020double] evaluated the effect of node size on the performance of the model. The study revealed that the prediction performance may improve if deeper decision trees are generated. The authors showed that the largest tree grown on a given data by the standard random forest might not be sufficiently large to give the optimal performance. Hence, double random forest [han2020double] generated decision trees that are bigger than the ones in standard random forest. The maximum performance of the random forest is achieved corresponding to the minimum node size which generates the larger trees [zhang2014oblique]. This supports the hypothesis that larger the trees of an ensemble the better the performance of the model is. Instead of training each decision tree with different bags of training set obtained via bagging approach at the root node, han2020double generated each tree with the original training set and used bootstrap aggregation at each non-terminal node of the decision tree to obtain the best split. However, both the random forest and double random forest are univariate decision trees and hence ignore the geometric class distributions resulting in lower generalization performance. To overcome these issues, we propose oblique double random forest. Oblique double random forest models integrate the benefits of double random forest and the geometric structure information of the class distribution for better generalization performance. For generating more diverse ensemble learners in the double random forest, feature space is rotated or transformed at each non leaf node using two transformations known as linear discriminant analysis and principal component analysis. Using transformations at each non-leaf node on different randomly chosen feature subspaces improves diversity among the base models and leads to better generalization performance.

The main highlights of this paper are:

1.

We use different rotations (principal component analysis and linear discriminant analysis) at each non-leaf node to generate diverse double random forest ensembles (DRaF-PCA and DRaF-LDA).
2.

The proposed oblique double random forest (MPDRaF-T, MPDRaF-P and MPDRaF-N) variants use MPSVM for obtaining the optimal separating hyperplanes at each non-terminal node of the decision tree ensembles.
3.

The proposed ensemble of double Random forest generate larger trees compared to the variants of standard Random forest.
4.

Statistical analysis reveals that the average rank of the proposed double random forest models is superior than the standard random forest. Moreover, the average accuracy of the proposed DRaF-LDA, DRaF-PCA, and MPDRaF-P is superior than the standard random forest and standard double random forest models. Also, the average rank of the proposed DRaF-LDA, MPDRaF-P and DRaF-PCA is better compared to the standard double Random forest.

2 Related work

In this section, we briefly review the related work of the ensemble of decision trees.

ABBREVIATION	DEFINITION
PCA	Principal component analysis
LDA	Linear discriminant analysis
SVM	Support vector machines
RaF	Standard Random Forest
DRaF	Standard double Random Forest
MPSVM	Multisurface proximal support vector machines
RaF-PCA	Principal component analysis based ensemble of decision trees
RaF-LDA	Linear discriminant analysis based ensemble of decision trees
MPRaF-T	MPSVM based oblique decision tree ensemble with Tikhonov regularisation
MPRaF-P	MPSVM based oblique decision tree ensemble with axis parallel regularisation
MPRaF-N	MPSVM based oblique decision tree ensemble with NULL space regularisation
DRaF-PCA	Rotation based double random forest with principal component analysis
DRaF-LDA	Rotation based double random forest with linear discriminant analysis
MPDRaF-T	Oblique double random forest with MPSVM via Tikhonov regularisation
MPDRaF-P	Oblique double random forest with MPSVM via axis parallel regularisation
MPDRaF-N	Oblique double random forest with MPSVM via NULL space regularisation

Table 1: Nomenclature

2.1 Handling multiclass problems

MPSVM is a binary classification model and finding the optimal separating hyperplanes at each non-terminal node of a decision tree may be a multiclass problem. To handle the multiclass problem via binary class approach, different methods like one-versus-all [bottou1994comparison], one-versus-one [knerr1990single], decision directed acyclic graph [platt1999large], error correcting output codes [dietterich1994solving] and so on have been proposed. Data partitioning rule of the decision trees at each non-leaf node proves handy over other binary classification models [zhang2014oblique]. Separating the classes with majority samples as one class and rest samples as another class results in an inefficient model as it fails to capture the geometric structure of the data samples [manwani2011geometric]. To incorporate the geometric structure, the authors in [zhang2014oblique] decomposed the multiclass problem into a binary one by using class separability information. The authors used Bhattacharyya distance for decomposition. In statistics, Bhattacharyya distance gives the measure of similarity between the two discrete probability distributions or continuous probability distributions as it is deemed to be a good insight about separability of classes between two normal classes $C_{1}\sim N(\mu_{1},\nu_{1}),C_{2}\sim N(\mu_{2},\nu_{2})$ , where $\mu_{i}$ and $\nu_{i}$ are the parameters of the normal distribution of class $C_{i}$ , for $i=1,2$ . Following the similar approach as in [zhang2014oblique], we used multivariate Gaussian distribution [jiang2011linear]. Motivated by [zhang2014oblique, jiang2011linear], we use Bhattacharyya distance to measure the class separability for decomposing the multiclass problem into a binary class problem (Algorithm 1).

Algorithm 1 Decomposition of Multiclass problem to a binary class problem

Input:

$D:=N\times n$ be the training dataset with $N$ number of data points with feature size $n$ .
$Y:=N\times 1$ be the target labels.
$\{L_{1},L_{2},\dots,L_{C}\}$ be the target labels.

Output:
$C_{p}$ and $C_{n}$ are two hyperclasses or groups

For each class

j=1,2,\dots,C

For each pair of $L_{j}$ and $L_{k}$ , for $k=j+1,\dots,C$ as:

\displaystyle F(L_{j},L_{k})=\frac{1}{8}(\mu_{k}-\mu_{j})^{t}\Big{(}\frac{\nu_{j}+\nu_{k}}{2}\Big{)}^{-1}(\mu_{k}-\mu_{j})+\frac{1}{2}ln\frac{|(\nu_{j}+\nu_{k})/2|}{\sqrt{|\nu_{j}||\nu_{k}|}}

(1)

2.

Find the pair $L_{p}$ and $L_{n}$ of classes with the maximum Bhattacharyya distance, and assign them to $C_{p}$ and $C_{n}$ respectively.
3.

For every other class, if $F(L_{k},L_{p})<F(L_{k},L_{n})$ then group $L_{k}$ to $C_{p}$ otherwise group in $C_{n}$ .

2.2 Multisurface proximal support vector machine

Multisurface proximal support vector machine (MPSVM) [mangasarian2005multisurface] is a binary class algorithm. Suppose $X_{1},X_{2}$ be the data points belonging to the positive and negative class, respectively. Here, $X_{1}\in\mathbb{R}^{m_{1}\times n}$ , $X_{2}\in\mathbb{R}^{m_{2}\times n}$ and each sample $x\in\mathbb{R}^{n}$ . MPSVM generates two hyperplanes as

\displaystyle x^{t}w_{1}-b_{1}=0~{}~{}\text{and}~{}~{}x^{t}w_{2}-b_{2}=0,

(2)

where $(w_{1},b_{1})$ and $(w_{2},b_{2})$ are the planes closer to the samples of positive and negative class, respectively. MPSVM minimises the sum of squared two norm distances between the samples of positive class divided by the sum of squared distances from the samples of negative class to the plane. Thus, the optimization problems of MPSVM are given as follows:

\displaystyle\underset{(w,b)\neq 0}{min}~{}~{}\frac{\norm{X_{1}w-eb}^{2}/\norm{\begin{matrix}w\\ b\end{matrix}}^{2}}{\norm{X_{2}w-eb}^{2}/\norm{\begin{matrix}w\\ b\end{matrix}}^{2}}

(3)

and

\displaystyle\underset{(w,b)\neq 0}{min}~{}~{}\frac{\norm{X_{2}w-eb}^{2}/\norm{\begin{matrix}w\\ b\end{matrix}}^{2}}{\norm{X_{1}w-eb}^{2}/\norm{\begin{matrix}w\\ b\end{matrix}}^{2}},

(4)

where $\norm{\cdot}$ is a two norm, $e$ is a vector of ones with appropriate dimensions.

Suppose

\displaystyle P=\begin{bmatrix}A&-e\end{bmatrix}^{t}\begin{bmatrix}A&-e\end{bmatrix},Q=\begin{bmatrix}B&-e\end{bmatrix}^{t}\begin{bmatrix}B&-e\end{bmatrix},r=\begin{bmatrix}w\\ b\end{bmatrix},

(5)

then the optimization problem (3) is given as

\displaystyle\underset{r\neq 0}{min}~{}~{}\frac{r^{t}Pr}{r^{t}Qr}.

(6)

Similarly, the optimization problem (4) is given as follows:

\displaystyle\underset{r\neq 0}{min}~{}~{}\frac{r^{t}Sr}{r^{t}Ur},

(7)

where $S=\begin{bmatrix}B&-e\end{bmatrix}^{t}\begin{bmatrix}B&-e\end{bmatrix}$ and $U=\begin{bmatrix}A&-e\end{bmatrix}^{t}\begin{bmatrix}A&-e\end{bmatrix}$ .

The clustering hyperplanes are obtained by solving the following generalized eigenvalue problems:

	$\displaystyle Pr$	$\displaystyle=\lambda Qr,~{}r\neq 0$		(8)
	$\displaystyle Sr$	$\displaystyle=\gamma Ur,~{}r\neq 0.$		(9)

The optimal hyperplanes are the eigenvectors corresponding to the smallest eigenvalues.

The way (8) and (9) are defined, the clustering hyperplanes are able to capture the geometric properties of the data which are helpful while discriminating among the classes.

2.2.1 Random Forest

Random forest [breiman2001random] is an ensemble with decision tree as the base learner which are generated using the concept of bagging and random subspace method. Both bagging and random subspace methods induce diversity among the decision trees of an ensemble. Each decision tree of an ensemble chooses the optimal split among the randomly selected candidate feature subsets at a given non-leaf node. The optimal split is chosen using some impurity criterion’s like information gain, Gini impurity and so on [breiman1984classification].

The algorithm of the random forest is given in Algorithm 2. The classification and regression tree (CART) [breiman2001random] performs the test split using only one feature and hence, known as univariate decision tree [murthy1995growing].

Algorithm 2 Random Forest

Training Phase:
Given:

$D:=N\times n$ be the training dataset with $N$ number of data points with feature size $n$ .
$Y:=N\times 1$ be the target labels.
$L:$ is number of base learners.
“mtry”: number of candidate features to be evaluated at each non-leaf node.
“nodesize” or “minleaf”: maximum number of samples in an impure node.

For each decision tree,

T_{i}

for

i=1,2,\dots,L

1.

Generate bootstrap samples $D_{i}$ from D.
2.

Generate the decision tree using $D_{i}$ :
For a given node $d$ :
1. (i)
  
  Choose “mtry” $=\sqrt{n}$ number of features from the given feature space of $D_{i}$ .
2. (ii)
  
  Select the best feature split feature and the cutpoint among the random feature subset.
3. (iii)
  
  With the optimal split feature and the cutpoint, divide the data.
Repeat steps (i)-(iii), until the stopping criteria is met.

Classification Phase:
For a test data point $x_{i}$ , use the base learner of the forest to generate the label of the test sample. The predicted label of the test data point is given by the majority voting of the decision trees of an ensemble.

Algorithm 3 Double Random Forest

Training Phase:
Given:
$D:=N\times n$ be the training dataset with $N$ number of data points with feature size $n$ .
$D_{i}:=N_{i}\times n_{i}$ be the training samples reaching to a node $i$ , with $N_{i}$ number of samples with feature size $n$ .
$Y:=N\times 1$ be the target labels.
$L:$ is number of base learners.
“mtry”: number of candidate features to be evaluated at each non-leaf node.
“nodesize” or “minleaf”: maximum number of data samples to be placed in an impure node.

For each decision tree,

T_{i}

for

i=1,2,\dots,L

1.

Use training data $D$ .
2.

Generate the decision tree $T_{i}$ with randomly chosen subset of features and randomised bootstrap instance using $D$ :
For a given node $d$ with data $D_{d}$ :
1. (i)
  
  if $N_{d}>N\times 0.1$
  
  Generate bootstrap sample $D_{d}^{*}$ from $D_{d}$ .
  
  else
  
  $D_{d}^{*}=D_{d}$
2. (i)
  
  Choose “mtry” $=\sqrt{n}$ number of features from the given feature space of $D_{d}^{*}$ .
3. (ii)
  
  Select the best split feature and the cutpoint among the random feature subset $D_{d}^{*}$ .
4. (iii)
  
  With the optimal split feature and the cutpoint with $D_{d}^{*}$ , split the data $D_{d}$ into child nodes.
Repeat steps (i)-(iii), until either of the satisfied:
- (a)
  
  Node reaches to purest form.
- (b)
  
  Samples reaching a given node are lesser or equal than minleaf

Classification Phase:
For a test data point $x_{i}$ , use the decision trees of the forest to generate the label of the test sample. The predicted class of the test data point is given by the majority voting of the decision trees of an ensemble.