This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sparsity in Optimal Randomized Classification Trees

Rafael Blanquero rblanquero@us.es Emilio Carrizosa ecarrizosa@us.es Cristina Molero-Río mmolero@us.es Dolores Romero Morales drm.eco@cbs.dk Instituto de Matemáticas de la Universidad de Sevilla (IMUS), Seville, Spain Copenhagen Business School, Frederiksberg, Denmark
Abstract

Decision trees are popular Classification and Regression tools and, when small-sized, easy to interpret. Traditionally, a greedy approach has been used to build the trees, yielding a very fast training process; however, controlling sparsity (a proxy for interpretability) is challenging. In recent studies, optimal decision trees, where all decisions are optimized simultaneously, have shown a better learning performance, especially when oblique cuts are implemented. In this paper, we propose a continuous optimization approach to build sparse optimal classification trees, based on oblique cuts, with the aim of using fewer predictor variables in the cuts as well as along the whole tree. Both types of sparsity, namely local and global, are modeled by means of regularizations with polyhedral norms. The computational experience reported supports the usefulness of our methodology. In all our data sets, local and global sparsity can be improved without harming classification accuracy. Unlike greedy approaches, our ability to easily trade in some of our classification accuracy for a gain in global sparsity is shown.

keywords:
Data mining , Optimal Classification Trees , Global and Local Sparsity , Nonlinear Programming
journal: European Journal of Operational Research

See FrontPage

1 Introduction

Decision trees [40] are a popular non-parametric tool for Classification and Regression in Statistics and Machine Learning [21]. Since they are rule-based, when small-sized, they are deemed to be leaders in terms of interpretability [1, 2, 9, 15, 17, 23, 27, 28, 32, 36].

It is well-known that the problem of building optimal decision trees is NP-complete [22].

For this reason, classic decision trees have been traditionally designed using greedy procedures in which at each branch node of the tree, some purity criterion is (locally) optimized. For instance, CARTs [8] employ a greedy and recursive partitioning procedure which is computationally cheap, especially since orthogonal cuts are implemented, i.e., one single predictor variable is involved in each branching rule. These rules are of maximal sparsity at each branching node (excellent local sparsity), making classic decision trees locally easy to interpret. However, when deep, they become to be harder to interpret since many predictor variables are, in general, involved across all branching rules (not so good global sparsity).

Addressing global sparsity is a challenge in decision trees and, to the best of our knowledge, this has not been tackled appropriately in the literature. Standard CARTs or Random Forests (RFs) [5, 7, 13, 16] cannot manage it due to the greedy construction of the trees. Nonetheless, some attempts have been made, see [11, 12]. Classic decision trees usually select their orthogonal cuts at each branch node by optimizing an information theory criterion among all possible predictor variables and thresholds. The regularization framework in [11] considers a penalty to this criterion for predictor variables that have not appeared yet in the tree. This approach is refined in [12], by also including the importance scores of the predictor variables, obtained in a preprocessing step running a preliminary RF.

The mainstream trend of using a greedy strategy in the construction of decision trees may lead to myopic decisions, which, in turn, may affect the overall learning performance. The major advances in Mathematical Optimization [10, 30, 33] have led to different approaches to build decision trees with some overall optimality criterion, called hereafter optimal classification trees. It is worth mentioning recent proposals which grow optimal classification trees of a pre-established depth, both deterministic [4, 14, 18, 37, 38] and randomized [6]. The deterministic approaches formulate the problem of building the tree as a mixed-integer linear optimization problem. Such approach is the most natural, since many discrete decisions are to be made when building a decision tree. Although the results of such optimal classification trees are encouraging, the inclusion of integer decision variables makes the computing times explode, giving rise to models trained over a small subsample of the data set [18] and, as customary, with a CPU time limit being imposed to the optimization solver. On the other hand, a continuous optimization-based approach to build optimal randomized classification trees is proposed in [6]. This is achieved by replacing the yes/no decisions in traditional trees by probabilistic decisions, i.e., instead of deciding at each branch node if an individual goes either to the left or to the right child node in the tree, the probability of going to the left is sought. The numerical results in [6] illustrate the good performance achieved in very short time. All these optimization-based approaches are flexible enough to address critical issues that the greedy nature of classic decision trees would find it difficult, such as preferences on the classification performance in some class where misclassifying is more damaging [6, 37, 38], or controlling the number of predictor variables used along the tree (local and global sparsity).

Optimal classification trees have been grown with both orthogonal [4, 14, 18] and oblique cuts [3, 4, 6, 29, 37, 38]. Oblique cuts are more flexible than orthogonal ones since a combination of several predictor variables is allowed in the branching. Trees based on oblique cuts lead to similar or even better learning performance than those based on orthogonal cuts, and, at the same time, they exhibit a shallow depth, since several orthogonal cuts may be reduced to one single oblique cut. Apart from the flexibility that we can borrow from them, many integer decision variables associated with orthogonal cuts are not present in the oblique ones, which eases the optimization. Therefore, optimal classification trees based on oblique cuts require a lower training computing time while showing much more promising results in terms of accuracy. However, this comes at the expense of damaging interpretability, since, in principle, all the predictor variables could appear in each branching rule. In this paper, we tackle this issue.

We propose a novel optimized classification tree, based on the methodology in [6] and, therefore, in oblique cuts, that yields rules/trees that are sparser, and thus enhance interpretability. We model this as a continuous optimization problem.

As in the classic LASSO model [35], sparsity is sought by means of regularization terms.

We model local sparsity with the 1\ell_{1}-norm, and the global sparsity with the \ell_{\infty}-norm. The \ell_{\infty} reguralization has been applied to other classifiers, for instance, Support Vector Machines [25, 26, 41], but the 1\ell_{1} is more popular. A novel continuous-based approach for building this sparse optimal randomized classification tree is provided. Theoretical results on the range of the sparsity parameters are shown. Our numerical results, where well-known real data sets are used, illustrate the efectiveness of our methodology: sparsity in optimal classification trees improves without harming learning performance. In addition, our ability to trade in some of our classification accuracy, still being superior to CART, to be comparable to CART in terms of global sparsity is shown.

The remainder of the paper is organized as follows. In Section 2 we detail the construction of the Sparse Optimal Randomized Classification Tree. Some theoretical properties are given in Section 3. In Section 4, our numerical experience is reported.

Finally, conclusions and possible lines of future research are provided in Section 5.

2 Sparsity in Optimal Randomized Classification Trees

2.1 Introduction

We assume given a training sample {(𝒙i,yi)}1iN\left\{\left(\bm{x}_{i},y_{i}\right)\right\}_{1\leq i\leq N}, where 𝒙i\bm{x}_{i} represents the pp-dimensional vector of predictor variables of individual ii, and yi{1,,K}y_{i}\in\left\{1,\ldots,K\right\} indicates the class membership. Without loss of generality, we assume 𝒙i[0,1]p,i=1,,N\bm{x}_{i}\in\left[0,1\right]^{p},\,\,i=1,\ldots,N.

Sparse Optimal Randomized Classification Trees, addressed in this paper, extend the Optimal Randomized Classification Trees (ORCTs) in [6]. An ORCT is an optimal binary classification tree of a given depth DD, obtained by minimizing the expected misclassification cost over the training sample. Figure 1

Refer to caption
Figure 1: Optimal Randomized Classification Tree of depth D=2D=2.

shows the structure of an ORCT of depth D=2D=2. Unlike classic decision trees, oblique cuts, on which more than one predictor variable takes part, are performed. ORCTs are modeled by means of a Non-Linear Continuous Optimization formulation. The usual deterministic yes/no rule at each branch node is replaced by a smoother rule: a probabilistic decision rule at each branch node, induced by a cumulative density function (CDF) FF, is obtained. Therefore, the movements in ORCTs can be seen as randomized: at a given branch node of an ORCT, a random variable will be generated to indicate by which branch an individual has to continue. Since binary trees are built, the Bernoulli distribution is appropriate, whose probability of success will be determined by the value of this CDF, evaluated over the vector of predictor variables. More precisely, at a given branch node tt of the tree, an individual with predictor variables 𝒙\bm{x} will go either to the left or to the right child nodes with probabilities F(1p𝒂tT𝒙μt)F\left(\dfrac{1}{p}\bm{a}_{\cdot t}^{T}\bm{x}-\mu_{t}\right) and 1F(1p𝒂tT𝒙μt)1-F\left(\dfrac{1}{p}\bm{a}_{\cdot t}^{T}\bm{x}-\mu_{t}\right), respectively, where 𝒂t\bm{a}_{\cdot t} and μt\mu_{t} are decision variables. For further details on the construction of ORCTs, the reader is referred to [6]. Sparse ORCT, S-ORCT, minimizes the expected misclassification cost over the training sample regularized with two polyhedral norms.

The following notation is needed:

Parameters
DD depth of the binary tree,
NN number of individuals in the training sample,
pp number of predictor variables,
KK number of classes,
{(𝒙i,yi)}1iN\left\{\left(\bm{x}_{i},y_{i}\right)\right\}_{1\leq i\leq N} training sample, where 𝒙i[0,1]p\bm{x}_{i}\in\left[0,1\right]^{p} and yi{1,,K},y_{i}\in\left\{1,\ldots,K\right\},
IkI_{k} set of individuals in the training sample belonging to class kk, k=1,,K,k=1,\ldots,K,
WyikW_{y_{i}k} misclassification cost incurred when classifying an individual ii, whose class is yiy_{i}, in class kk, yi,i=1,,N,k=1,,Ky_{i},i=1,\ldots,N,\,\,k=1,\ldots,K,
F()F\left(\cdot\right) univariate continuous CDF centered at 0, used to define the probabilities for an individual to go to the left or the right child node in the tree. We will assume that FF is the CDF of a continuous random variable with density ff,
λL0\lambda^{L}\geq 0 local sparsity regularization parameter,
λG0\lambda^{G}\geq 0 global sparsity regularization parameter,
Nodes
τB\tau_{B} set of branch nodes,
τL\tau_{L} set of leaf nodes,
NL(t)N_{L}\left(t\right) set of ancestor nodes of leaf node tt whose left branch takes part in the path from the root node to leaf node tt, tτLt\in\tau_{L},
NR(t)N_{R}\left(t\right) set of ancestor nodes of leaf node tt whose right branch takes part in the path from the root node to leaf node tt, tτLt\in\tau_{L},
Decision variables
ajt[1,1]a_{jt}\in\left[-1,1\right] coefficient of predictor variable jj in the oblique cut at branch node tτBt\in\tau_{B}, with 𝒂\bm{a} being the p×|τB|p\times\left\lvert\tau_{B}\right\rvert matrix of these coefficients, 𝒂=(ajt)j=1,,p,tτB\bm{a}=\left(a_{jt}\right)_{j=1,\ldots,p,\,\,t\in\tau_{B}}. The expressions 𝒂j\bm{a}_{j\cdot} and 𝒂t\bm{a}_{\cdot t} will denote the jj-th row and the tt-th column of 𝒂\bm{a}, respectively,
μt[1,1]\mu_{t}\in\left[-1,1\right] location parameter at branch node tτBt\in\tau_{B}, 𝝁\bm{\mu} being the vector that comprises every μt\mu_{t}, i.e., 𝝁=(μt)tτB\bm{\mu}=\left(\mu_{t}\right)_{t\in\tau_{B}},
CktC_{kt} probability of being assigned to class label k{1,,K}k\in\left\{1,\ldots,K\right\} for an individual at leaf node t,tτLt,\,\,t\in\tau_{L}, being the K×|τL|K\times\left\lvert\tau_{L}\right\rvert matrix such that 𝑪=(Ckt)k=1,,K,tτL\bm{C}=\left(C_{kt}\right)_{k=1,\ldots,K,\,\,t\in\tau_{L}}.
Probabilities
pit(𝒂t,μt)p_{it}\left(\bm{a}_{\cdot t},\mu_{t}\right) probability of individual ii going down the left branch at branch node tt. Its expression is pit(𝒂t,μt)=F(1p𝒂tT𝒙iμt),i=1,,N,tτBp_{it}\left(\bm{a}_{\cdot t},\mu_{t}\right)=F\left(\dfrac{1}{p}\bm{a}_{\cdot t}^{T}\bm{x}_{i}-\mu_{t}\right),\,\,i=1,\ldots,N,\,\,t\in\tau_{B},
Pit(𝒂,𝝁)P_{it}\left(\bm{a},\bm{\mu}\right) probability of individual ii falling into leaf node tt. Its expression is Pit(𝒂,𝝁)=tlNL(t)pitl(𝒂tl,μtl)trNR(t)(1pitr(𝒂tr,μtr)),i=1,,N,tτL,P_{it}\left(\bm{a},\bm{\mu}\right)=\prod\limits_{t_{l}\in N_{L}(t)}p_{it_{l}}\left(\bm{a}_{\cdot t_{l}},\mu_{t_{l}}\right)\prod\limits_{t_{r}\in N_{R}(t)}\left(1-p_{it_{r}}\left(\bm{a}_{\cdot t_{r}},\mu_{t_{r}}\right)\right),\,\,i=1,\ldots,N,\,\,t\in\tau_{L},
g(𝒂,𝝁,𝑪)g\left(\bm{a},\bm{\mu},\bm{C}\right) expected misclassification cost over the training sample. Its expression is g(𝒂,𝝁,𝑪)=1Ni=1NtτLPit(𝒂,𝝁)k=1KWyikCktg\left(\bm{a},\bm{\mu},\bm{C}\right)=\dfrac{1}{N}\sum\limits_{i=1}^{N}\sum\limits_{t\in\tau_{L}}P_{it}\left(\bm{a},\bm{\mu}\right)\sum\limits_{k=1}^{K}W_{y_{i}k}C_{kt}.

2.2 The formulation

With these parameters and decision variables, the S-ORCT is formulated as follows:

min\displaystyle\min\hskip 17.07182pt g(𝒂,𝝁,𝑪)+λLj=1p𝒂j1+λGj=1p𝒂j\displaystyle g\left(\bm{a},\bm{\mu},\bm{C}\right)+\lambda^{L}\sum_{j=1}^{p}\left\lVert\bm{a}_{j\cdot}\right\rVert_{1}+\lambda^{G}\sum_{j=1}^{p}\left\lVert\bm{a}_{j\cdot}\right\rVert_{\infty} (1)
s.t. k=1KCkt=1,tτL,\displaystyle\sum_{k=1}^{K}C_{kt}=1,\,\,t\in\tau_{L}, (2)
tτLCkt1,k=1,,K,\displaystyle\sum_{t\in\tau_{L}}C_{kt}\geq 1,\,\,k=1,\ldots,K, (3)
ajt[1,1],j=1,,p,tτB,\displaystyle a_{jt}\in\left[-1,1\right],\,\,j=1,\ldots,p,\,\,t\in\tau_{B}, (4)
μt[1,1],tτB,\displaystyle\mu_{t}\in\left[-1,1\right],\,\,t\in\tau_{B}, (5)
Ckt[0,1],k=1,,K,tτL.\displaystyle C_{kt}\in\left[0,1\right],\,\,k=1,\ldots,K,\,\,t\in\tau_{L}. (6)

In the objective function we have three terms, the first being the expected misclassification cost in the training sample, while the second and the third are regularization terms. The second term addresses local sparsity, since it penalizes the coefficients of the predictor variables used in the cuts along the tree. Instead, the third term controls whether a given predictor variable is ever used across the whole tree, thus addressing global sparsity. The \ell_{\infty}-norm is used as a group penalty function, by forcing the coefficients linked to the same predictor variable to be shrunk simultaneously along all branch nodes. Note that both local and global sparsity are equivalent when dealing with depth D=1D=1, as there is a single cut across the whole tree.

In terms of the feasible region, for each leaf node tτLt\in\tau_{L}, CktC_{kt} represents the probability that an individual at node tt is assigned to class k{1,,K}k\in\left\{1,\ldots,K\right\}. Constraints (2) force that such probabilities sum to 11, while constraints (3) force the sum of the probabilities along all leaf nodes tτBt\in\tau_{B} assigned to class kk to be at least one.

Theorem 1 guarantees the existence of an optimal deterministic solution, i.e., such probabilities CktC_{kt} will all be in {0,1}\left\{0,1\right\}, and thus (6) can be replaced by

Ckt{0,1},k=1,,K,tτL.C_{kt}\in\left\{0,1\right\},\,\,k=1,\ldots,K,\,\,t\in\tau_{L}. (7)

Constraints (6) and (7) will be used interchangeably when needed.

Theorem 1.

There exists an optimal solution to (1)-(6) such that Ckt{0,1},k=1,,K,tτLC_{kt}\in\left\{0,1\right\},\,\,k=1,\ldots,K,\,\,t\in\tau_{L}.

Proof.

The continuity of the objective function (1), defined over a compact set, ensures the existence of an optimal solution of the optimization problem (1)-(6), by Weierstrass Theorem. Let 𝒂=(ajt)j=1,,p,tτB,𝝁=(μt)tτB,𝑪=(Ckt)k=1,,K,tτB\bm{a}^{*}=\left(a^{*}_{jt}\right)_{j=1,\ldots,p,\,\,t\in\tau_{B}},\,\,\bm{\mu}^{*}=\left(\mu_{t}^{*}\right)_{t\in\tau_{B}},\,\,\bm{C}^{*}=\left(C^{*}_{kt}\right)_{k=1,\ldots,K,\,\,t\in\tau_{B}} be an optimal solution. Fixed 𝒂,𝝁\bm{a}^{*},\,\,\bm{\mu}^{*}, then 𝑪\bm{C}^{*} is optimal to the following problem in the decision variables Ckt,k=1,,K,tτLC_{kt},\,\,k=1,\ldots,K,\,\,t\in\tau_{L}:

min1Ni=1NtτLPit(𝒂,𝝁)k=1KWyikCkt+λLj=1p𝒂j1+λGj=1p𝒂js.t. k=1KCkt=1,tτL,tτLCkt1,k=1,,K,Ckt[0,1],k=1,,K,tτL.\displaystyle\begin{split}\min\hskip 17.07182pt&\dfrac{1}{N}\sum_{i=1}^{N}\sum_{t\in\tau_{L}}P_{it}\left(\bm{a}^{*},\bm{\mu}^{*}\right)\sum_{k=1}^{K}W_{y_{i}k}C_{kt}+\lambda^{L}\sum_{j=1}^{p}\left\lVert\bm{a}^{*}_{j\cdot}\right\rVert_{1}+\lambda^{G}\sum_{j=1}^{p}\left\lVert\bm{a}^{*}_{j\cdot}\right\rVert_{\infty}\\ \text{s.t. \hskip 14.22636pt}&\sum_{k=1}^{K}C_{kt}=1,\,\,t\in\tau_{L},\\ &\sum_{t\in\tau_{L}}C_{kt}\geq 1,\,\,k=1,\ldots,K,\\ &C_{kt}\in\left[0,1\right],\,\,k=1,\ldots,K,\,\,t\in\tau_{L}.\end{split}

This is a transportation problem, to which the integrality of an optimal solution is well-known to hold, i.e., there exists 𝑪¯=(C¯kt)k=1,,K,tτL{0,1}\overline{\bm{C}}=\left(\overline{C}_{kt}\right)_{k=1,\ldots,K,\,\,t\in\tau_{L}}\in\left\{0,1\right\} for all k,tk,\,\,t such that (𝒂,𝝁,𝑪¯)\left(\bm{a}^{*},\bm{\mu}^{*},\overline{\bm{C}}\right) is also optimal for (1)-(6). ∎

Theorem 1 gives a new interpretation of constraints (2)-(3): if (7) is used instead of (6), when CktC_{kt} takes the value 11, then all the individuals at node tτLt\in\tau_{L} are labelled as kk; and 0, otherwise. Constraints (2) state that any leaf node tτLt\in\tau_{L} must be labelled with exactly one class label, and constraints (3) state that each class kk has at least one node tt with such label.

Once the optimization problem is solved, the S-ORCT predicts the class of a new unlabeled observation with predictor vector 𝒙\bm{x} with a probabilistic rule, namely, we estimate the probability of being in class kk as tτLCktP𝒙t(𝒂,𝝁)\sum\limits_{t\in\tau_{L}}C_{kt}\cdot P_{\bm{x}t}\left(\bm{a},\bm{\mu}\right). If a deterministic classification rule is sought, we allocate to the most probable class. Moreover, if prior probabilities Πk(𝒙)\Pi_{k}\left(\bm{x}\right) are given, one can also use the Bayes rule.

ORCTs were also shown to deal effectively with controlling the correct classification rate on different classes. This idea can also be applied to S-ORCTs. Hence, given the classes k=1,,Kk=1,\ldots,K to be controlled and their corresponding desired performances ρk\rho_{k}, the expectation of achieving each performance guarantee can be computed with the ORCT parameters, provided that the following set of constrainsts is added to the model:

iIktτLPit(𝒂,𝝁)Cktρk|Ik|,k=1,,K.\sum_{i\in I_{k}}\sum_{t\in\tau_{L}}P_{it}\left(\bm{a},\bm{\mu}\right)C_{kt}\geq\rho_{k}|I_{k}|,\,\,k=1,\ldots,K. (8)

With these constraints we have a direct control on the classification performance in each class separately. This is useful when dealing with imbalanced data sets.

2.3 A smooth reformulation

Problem (1)-(6) is non-smooth due to the norms 1\left\lVert\cdot\right\rVert_{1} and \left\lVert\cdot\right\rVert_{\infty} appearing in the objective function. A smooth version is easily obtained by rewritting both regularization terms using new decision variables. Since the first regularization term includes absolute values,

𝒂j1=tτB|ajt|,j=1,,p,\left\lVert\bm{a}_{j\cdot}\right\rVert_{1}=\sum_{t\in\tau_{B}}\left\lvert a_{jt}\right\rvert,\,\,j=1,\ldots,p,

decision variables ajt[1,1],j=1,,p,tτBa_{jt}\in\left[-1,1\right],\,\,j=1,\ldots,p,\,\,t\in\tau_{B}, are split into their positive and negative counterparts ajt+,ajt[0,1],j=1,,p,tτBa^{+}_{jt},\,\,a^{-}_{jt}\in\left[0,1\right],\,\,j=1,\ldots,p,\,\,t\in\tau_{B}, respectively, holding ajt=ajt+ajta_{jt}=a^{+}_{jt}-a^{-}_{jt} and |ajt|=ajt++ajt\left\lvert a_{jt}\right\rvert=a^{+}_{jt}+a^{-}_{jt}. Similarly, we denote 𝒂+=(ajt+)j=1,,p,tτB\bm{a}^{+}=\left(a^{+}_{jt}\right)_{j=1,\ldots,p,\,\,t\in\tau_{B}} and 𝒂=(ajt)j=1,,p,tτB\bm{a}^{-}=\left(a^{-}_{jt}\right)_{j=1,\ldots,p,\,\,t\in\tau_{B}}. Regarding the second regularization term, new decision variables βj[0,1]\beta_{j}\in\left[0,1\right] are needed:

𝒂j=maxtτB|ajt|=βj[0,1],j=1,,p,\left\lVert\bm{a}_{j\cdot}\right\rVert_{\infty}=\max_{t\in\tau_{B}}\left\lvert a_{jt}\right\rvert=\beta_{j}\in\left[0,1\right],\,\,j=1,\ldots,p,

and have to force βj|ajt|=ajt++ajt,j=1,,p,tτB\beta_{j}\geq\left\lvert a_{jt}\right\rvert=a^{+}_{jt}+a^{-}_{jt},\,\,j=1,\ldots,p,\,\,t\in\tau_{B}.

We can now formulate S-ORCT as a smooth problem, thus solvable with standard continuous optimization solvers, as done in our computational section. Indeed, we have that (1)-(6) is equivalent to

min\displaystyle\min\hskip 17.07182pt g(𝒂+𝒂,𝝁,𝑪)+λLj=1ptτB(ajt++ajt)+λGj=1pβj\displaystyle g\left(\bm{a}^{+}-\bm{a}^{-},\bm{\mu},\bm{C}\right)+\lambda^{L}\sum_{j=1}^{p}\sum_{t\in\tau_{B}}\left(a_{jt}^{+}+a_{jt}^{-}\right)+\lambda^{G}\sum_{j=1}^{p}\beta_{j} (9)
s.t. k=1KCkt=1,tτL,\displaystyle\sum_{k=1}^{K}C_{kt}=1,\,\,t\in\tau_{L}, (10)
tτLCkt1,k=1,,K,\displaystyle\sum_{t\in\tau_{L}}C_{kt}\geq 1,\,\,k=1,\ldots,K, (11)
βjajt++ajt,j=1,,p,\displaystyle\beta_{j}\geq a_{jt}^{+}+a_{jt}^{-},\,\,j=1,\ldots,p, (12)
ajt+,ajt[0,1],j=1,,p,tτB,\displaystyle a_{jt}^{+},\,\,a_{jt}^{-}\in\left[0,1\right],\,\,j=1,\ldots,p,\,\,t\in\tau_{B}, (13)
βj[0,1],j=1,,p,\displaystyle\beta_{j}\in\left[0,1\right],\,\,j=1,\ldots,p, (14)
μt[1,1],tτB,\displaystyle\mu_{t}\in\left[-1,1\right],\,\,t\in\tau_{B}, (15)
Ckt[0,1],k=1,,K,tτL.\displaystyle C_{kt}\in\left[0,1\right],\,\,k=1,\ldots,K,\,\,t\in\tau_{L}. (16)

Observe that, if we are only concerned about global sparsity, and thus we set λL=0\lambda^{L}=0, the rewriting of the decision variables ajt,j=1,,p,tτBa_{jt},\,\,j=1,\ldots,p,\,\,t\in\tau_{B} is no longer necessary and (4) replaces (13), and (12) turns into

βj\displaystyle\beta_{j} ajt,j=1,,p,tτB,\displaystyle\geq a_{jt},\,\,j=1,\ldots,p,\,\,t\in\tau_{B}, (17)
βj\displaystyle\beta_{j} ajt,j=1,,p,tτB.\displaystyle\geq-a_{jt},\,\,j=1,\ldots,p,\,\,t\in\tau_{B}. (18)

3 Theoretical properties

This section discusses some theoretical properties enjoyed by the S-ORCT. Let us consider the objective function of (1)-(6). When taking λL\lambda^{L} and λG\lambda^{G} large enough, the first term related to the performance of the classifier becomes negligible and therefore 𝒂\bm{a} will shrink to 𝟎\bm{0}. The tree with 𝒂=𝟎\bm{a}=\bm{0} is the sparsest possible tree though not the best promising one from the accuracy point of view, since none of the predictor variables is used to classify. In this case, the probability of an individual with predictor variables 𝒙\bm{x} being assigned to class kk is independent of 𝒙\bm{x}, and nothing more than the distribution of classes is available. In this section, we derive upper bounds for the sparsity parameters, λL\lambda^{L} and λG\lambda^{G}, in the sense that above these bounds the sparsest tree (with 𝒂=𝟎\bm{a}^{*}=\bm{0}) is a stationary point of the S-ORCT, that is, there exists (𝒂=𝟎,𝝁,𝑪)\left(\bm{a}^{*}=\bm{0},\bm{\mu}^{*},\bm{C}^{*}\right) such that the necessary optimality condition with respect to 𝒂\bm{a} is satisfied. This is done in Theorems 2 and 3.

Theorem 2.

Let σ[0,1]\sigma\in\left[0,1\right]. For

λL\displaystyle\lambda^{L}\geq (1σ)max𝝁[1,1]|τB|𝑪{0,1}K×|τL|maxj=1,,p𝒂jg(𝟎,𝝁,𝑪)and\displaystyle\left(1-\sigma\right)\max_{\begin{subarray}{c}\bm{\mu}\in\left[-1,1\right]^{\left\lvert\tau_{B}\right\rvert}\\ \bm{C}\in\left\{0,1\right\}^{K\times\left\lvert\tau_{L}\right\rvert}\end{subarray}}\max_{j=1,\ldots,p}\left\lVert\nabla_{\bm{a}_{j\cdot}}g\left(\bm{0},\bm{\mu},\bm{C}\right)\right\rVert_{\infty}\text{and}
λG\displaystyle\lambda^{G}\geq σmax𝝁[1,1]|τB|𝑪{0,1}K×|τL|maxj=1,,p𝒂jg(𝟎,𝝁,𝑪)1,\displaystyle\hskip 14.22636pt\sigma\hskip 14.22636pt\max_{\begin{subarray}{c}\bm{\mu}\in\left[-1,1\right]^{\left\lvert\tau_{B}\right\rvert}\\ \bm{C}\in\left\{0,1\right\}^{K\times\left\lvert\tau_{L}\right\rvert}\end{subarray}}\max_{j=1,\ldots,p}\left\lVert\nabla_{\bm{a}_{j\cdot}}g\left(\bm{0},\bm{\mu},\bm{C}\right)\right\rVert_{1},

𝒂=𝟎\bm{a}^{*}=\bm{0} is a stationary point of the S-ORCT.

Proof.

Let σ,λL,λG\sigma,\,\,\lambda^{L},\,\,\lambda^{G} be such that they satisfy the assumptions.

By Theorem 1, there exists (𝒂,𝝁,𝑪)\left(\bm{a}^{*},\bm{\mu}^{*},\bm{C}^{*}\right) optimal solution to (1)-(6) satisfying Ckt{0,1}k=1,,K,tτLC^{*}_{kt}\in\left\{0,1\right\}\forall k=1,\ldots,K,\,\,t\in\tau_{L}. In the following we will show that (𝟎,𝝁,𝑪)\left(\bm{0},\bm{\mu}^{*},\bm{C}^{*}\right) is a stationary point of the S-ORCT, i.e.,

𝒂g(𝟎,𝝁,𝑪)𝒂(λLj=1p𝒂j1+λGj=1p𝒂j)(𝟎)-\nabla_{\bm{a}}g\left(\bm{0},\bm{\mu}^{*},\bm{C}^{*}\right)\in\partial_{\bm{a}}\left(\lambda^{L}\sum_{j=1}^{p}\left\lVert\bm{a}_{j\cdot}\right\rVert_{1}+\lambda^{G}\sum_{j=1}^{p}\left\lVert\bm{a}_{j\cdot}\right\rVert_{\infty}\right)\left(\bm{0}\right) (19)

where 𝒂\partial_{\bm{a}} is the subdifferential operator.

For every 𝒂j,j=1,,p\bm{a}_{j\cdot},\,\,j=1,\ldots,p, we have that

𝒂j(𝒂j1)(𝟎)=𝔹={𝒒|τB|:𝒒1}\displaystyle\partial_{\bm{a}_{j\cdot}}\left(\left\lVert\bm{a}_{j\cdot}\right\rVert_{1}\right)\left(\bm{0}\right)=\mathbb{B}_{\infty}=\left\{\bm{q}\in\mathbb{R}^{\left\lvert\tau_{B}\right\rvert}:\left\lVert\bm{q}\right\rVert_{\infty}\leq 1\right\}
𝒂j(𝒂j)(𝟎)=𝔹1={𝒒|τB|:𝒒11}.\displaystyle\partial_{\bm{a}_{j\cdot}}\left(\left\lVert\bm{a}_{j\cdot}\right\rVert_{\infty}\right)\left(\bm{0}\right)=\mathbb{B}_{1}=\left\{\bm{q}\in\mathbb{R}^{\left\lvert\tau_{B}\right\rvert}:\left\lVert\bm{q}\right\rVert_{1}\leq 1\right\}.

Hence,

𝒂jg(𝟎,𝝁,𝑪)λL𝒂j(𝒂j1)(𝟎)+λG𝒂j(𝒂j)(𝟎),-\nabla_{\bm{a}_{j\cdot}}g\left(\bm{0},\bm{\mu}^{*},\bm{C}^{*}\right)\in\lambda^{L}\partial_{\bm{a}_{j\cdot}}\left(\left\lVert\bm{a}_{j\cdot}\right\rVert_{1}\right)\left(\bm{0}\right)+\lambda^{G}\partial_{\bm{a}_{j\cdot}}\left(\left\lVert\bm{a}_{j\cdot}\right\rVert_{\infty}\right)\left(\bm{0}\right),

if, and only if,

𝒂jg(𝟎,𝝁,𝑪)λL𝔹+λG𝔹1,-\nabla_{\bm{a}_{j\cdot}}g\left(\bm{0},\bm{\mu}^{*},\bm{C}^{*}\right)\in\lambda^{L}\mathbb{B}_{\infty}+\lambda^{G}\mathbb{B}_{1},

if, and only if, there exist 𝒒jL,𝒒jG|τB|\bm{q}^{L}_{j},\,\,\bm{q}^{G}_{j}\in\mathbb{R}^{\left\lvert\tau_{B}\right\rvert} such that

𝒒jL1,\displaystyle\left\lVert\bm{q}_{j}^{L}\right\rVert_{\infty}\leq 1,
𝒒jG11,\displaystyle\left\lVert\bm{q}_{j}^{G}\right\rVert_{1}\leq 1,
𝒂jg(𝟎,𝝁,𝑪)=λL𝒒jL+λG𝒒jG,\displaystyle-\nabla_{\bm{a}_{j\cdot}}g\left(\bm{0},\bm{\mu}^{*},\bm{C}^{*}\right)=\lambda^{L}\bm{q}^{L}_{j}+\lambda^{G}\bm{q}^{G}_{j},

if, and only if, there exist 𝒒~jL,𝒒~jG|τB|\tilde{\bm{q}}^{L}_{j},\,\,\tilde{\bm{q}}^{G}_{j}\in\mathbb{R}^{\left\lvert\tau_{B}\right\rvert} such that

𝒒~jLλL,\displaystyle\left\lVert\tilde{\bm{q}}_{j}^{L}\right\rVert_{\infty}\leq\lambda^{L},
𝒒~jG1λG,\displaystyle\left\lVert\tilde{\bm{q}}_{j}^{G}\right\rVert_{1}\leq\lambda^{G},
𝒂jg(𝟎,𝝁,𝑪)=𝒒~jL+𝒒~jG.\displaystyle-\nabla_{\bm{a}_{j\cdot}}g\left(\bm{0},\bm{\mu}^{*},\bm{C}^{*}\right)=\tilde{\bm{q}}^{L}_{j}+\tilde{\bm{q}}^{G}_{j}.

Let us consider

𝒒~jL\displaystyle\tilde{\bm{q}}^{L}_{j} =(1σ)𝒂jg(𝟎,𝝁,𝑪),\displaystyle=-\left(1-\sigma\right)\nabla_{\bm{a}_{j\cdot}}g\left(\bm{0},\bm{\mu}^{*},\bm{C}^{*}\right),
𝒒~jG\displaystyle\tilde{\bm{q}}^{G}_{j} =σ𝒂jg(𝟎,𝝁,𝑪),\displaystyle=-\hskip 14.22636pt\sigma\hskip 14.22636pt\nabla_{\bm{a}_{j\cdot}}g\left(\bm{0},\bm{\mu}^{*},\bm{C}^{*}\right),

and check that the conditions are satisfied:

𝒒~jL=(1σ)𝒂jg(𝟎,𝝁,𝑪)(1σ)max𝝁[1,1]|τB|𝑪{0,1}K×|τL|maxj=1,,p𝒂jg(𝟎,𝝁,𝑪)λL,\displaystyle\left\lVert\tilde{\bm{q}}_{j}^{L}\right\rVert_{\infty}=\left(1-\sigma\right)\left\lVert\nabla_{\bm{a}_{j\cdot}}g\left(\bm{0},\bm{\mu}^{*},\bm{C}^{*}\right)\right\rVert_{\infty}\leq\left(1-\sigma\right)\max_{\begin{subarray}{c}\bm{\mu}\in\left[-1,1\right]^{\left\lvert\tau_{B}\right\rvert}\\ \bm{C}\in\left\{0,1\right\}^{K\times\left\lvert\tau_{L}\right\rvert}\end{subarray}}\max_{j=1,\ldots,p}\left\lVert\nabla_{\bm{a}_{j\cdot}}g\left(\bm{0},\bm{\mu},\bm{C}\right)\right\rVert_{\infty}\leq\lambda^{L},
𝒒~jG1=σ𝒂jg(𝟎,𝝁,𝑪)1σmax𝝁[1,1]|τB|𝑪{0,1}K×|τL|maxj=1,,p𝒂jg(𝟎,𝝁,𝑪)1λG,\displaystyle\left\lVert\tilde{\bm{q}}_{j}^{G}\right\rVert_{1}=\sigma\left\lVert\nabla_{\bm{a}_{j\cdot}}g\left(\bm{0},\bm{\mu}^{*},\bm{C}^{*}\right)\right\rVert_{1}\leq\sigma\max_{\begin{subarray}{c}\bm{\mu}\in\left[-1,1\right]^{\left\lvert\tau_{B}\right\rvert}\\ \bm{C}\in\left\{0,1\right\}^{K\times\left\lvert\tau_{L}\right\rvert}\end{subarray}}\max_{j=1,\ldots,p}\left\lVert\nabla_{\bm{a}_{j\cdot}}g\left(\bm{0},\bm{\mu},\bm{C}\right)\right\rVert_{1}\leq\lambda^{G},
𝒒~jL+𝒒~jG=(1σ)𝒂jg(𝟎,𝝁,𝑪)σ𝒂jg(𝟎,𝝁,𝑪)=𝒂jg(𝟎,𝝁,𝑪).\displaystyle\tilde{\bm{q}}^{L}_{j}+\tilde{\bm{q}}^{G}_{j}=-\left(1-\sigma\right)\nabla_{\bm{a}_{j\cdot}}g\left(\bm{0},\bm{\mu}^{*},\bm{C}^{*}\right)-\sigma\nabla_{\bm{a}_{j\cdot}}g\left(\bm{0},\bm{\mu}^{*},\bm{C}^{*}\right)=-\nabla_{\bm{a}_{j\cdot}}g\left(\bm{0},\bm{\mu}^{*},\bm{C}^{*}\right).

Therefore, the desired result follows. ∎

A stronger result is proven for the S-ORCT of depth D=1D=1 and K=2K=2. Since local and global sparsity are equivalent for the S-ORCT of depth D=1D=1, without loss of generality, we can assume that λG=0\lambda^{G}=0. Therefore, the objective function of the S-ORCT of depth D=1D=1 can be written as:

g1(𝒂1,μ1,𝑪)=g(𝒂1,μ1,𝑪)+λL𝒂11,g_{1}\left(\bm{a}_{\cdot 1},\mu_{1},\bm{C}\right)=g\left(\bm{a}_{\cdot 1},\mu_{1},\bm{C}\right)+\lambda^{L}\left\lVert\bm{a}_{\cdot 1}\right\rVert_{1},

where

g(𝒂1,μ1,𝑪)\displaystyle g\left(\bm{a}_{\cdot 1},\mu_{1},\bm{C}\right) =1Ni=1N[pi1(𝒂1,μ1)k=12WyikCk2+(1pi1(𝒂1,μ1))k=12WyikCk3]\displaystyle=\dfrac{1}{N}\sum_{i=1}^{N}\left[p_{i1}\left(\bm{a}_{\cdot 1},\mu_{1}\right)\sum_{k=1}^{2}W_{y_{i}k}C_{k2}+\left(1-p_{i1}\left(\bm{a}_{\cdot 1},\mu_{1}\right)\right)\sum_{k=1}^{2}W_{y_{i}k}C_{k3}\right]
=1Nk=12iIk[pi1(𝒂1,μ1)kkWkkCk2+(1pi1(𝒂1,μ1))kkWkkCk3]\displaystyle=\dfrac{1}{N}\sum_{k=1}^{2}\sum_{i\in I_{k}}\left[p_{i1}\left(\bm{a}_{\cdot 1},\mu_{1}\right)\sum_{k^{\prime}\neq k}W_{kk^{\prime}}C_{k^{\prime}2}+\left(1-p_{i1}\left(\bm{a}_{\cdot 1},\mu_{1}\right)\right)\sum_{k^{\prime}\neq k}W_{kk^{\prime}}C_{k^{\prime}3}\right] (20)

and

pi1(𝒂1,μ1)=F(1p𝒂1T𝒙iμ1),i=1,,N.p_{i1}\left(\bm{a}_{\cdot 1},\mu_{1}\right)=F\left(\dfrac{1}{p}\bm{a}_{\cdot 1}^{T}\bm{x}_{i}-\mu_{1}\right),\,\,i=1,\ldots,N.

A technical lemma is needed to prove the desired result.

Lemma 1.

For any allocation rule 𝐂\bm{C}, the objective function of the S-ORCT of depth D=1D=1, g1g_{1}, is monotonic in μ1\mu_{1} when 𝐚1=𝟎\bm{a}_{\cdot 1}=\bm{0}.

Proof.

Fixed 𝒂1=(aj1)j=1,,p\bm{a}_{\cdot 1}=\left(a_{j1}\right)_{j=1,\ldots,p}, and 𝑪=(Ckt)k=1,2,t=2,3\bm{C}=\left(C_{kt}\right)_{k=1,2,\,\,t=2,3},

g1μ1|𝒂1=𝟎=1Nk=1KiIk(kkWkkCk2kkWkkCk3)pi1(𝒂1,μ1)μ1|𝒂1=𝟎,\left.\dfrac{\partial g_{1}}{\partial\mu_{1}}\right|_{\bm{a}_{\cdot 1}=\bm{0}}=\dfrac{1}{N}\sum_{k=1}^{K}\sum_{i\in I_{k}}\left(\sum_{k^{\prime}\neq k}W_{kk^{\prime}}C_{k^{\prime}2}-\sum_{k^{\prime}\neq k}W_{kk^{\prime}}C_{k^{\prime}3}\right)\left.\dfrac{\partial p_{i1}\left(\bm{a}_{\cdot 1},\mu_{1}\right)}{\partial\mu_{1}}\right|_{\bm{a}_{\cdot 1}=\bm{0}},

where

pi1(𝒂1,μ1)μ1\displaystyle\dfrac{\partial p_{i1}\left(\bm{a}_{\cdot 1},\mu_{1}\right)}{\partial\mu_{1}} =F(1p𝒂1T𝒙iμ1)(1p𝒂1T𝒙iμ1)(1p𝒂1T𝒙iμ1)μ1=f(1p𝒂1T𝒙iμ1),i=1,,N,\displaystyle=\dfrac{\partial F\left(\dfrac{1}{p}\bm{a}_{\cdot 1}^{T}\bm{x}_{i}-\mu_{1}\right)}{\partial\left(\dfrac{1}{p}\bm{a}_{\cdot 1}^{T}\bm{x}_{i}-\mu_{1}\right)}\dfrac{\partial\left(\dfrac{1}{p}\bm{a}_{\cdot 1}^{T}\bm{x}_{i}-\mu_{1}\right)}{\partial\mu_{1}}=-f\left(\dfrac{1}{p}\bm{a}_{\cdot 1}^{T}\bm{x}_{i}-\mu_{1}\right),\,\,i=1,\ldots,N,

and

pi1(𝒂1,μ1)μ1|𝒂1=𝟎=f(μ1),i=1,,N.\displaystyle\left.\dfrac{\partial p_{i1}\left(\bm{a}_{\cdot 1},\mu_{1}\right)}{\partial\mu_{1}}\right|_{\bm{a}_{\cdot 1}=\bm{0}}=-f\left(-\mu_{1}\right),\,\,i=1,\ldots,N.

Thus,

g1(𝒂1,μ1,𝑪)μ1|𝒂1=𝟎\displaystyle\left.\dfrac{\partial g_{1}\left(\bm{a}_{\cdot 1},\mu_{1},\bm{C}\right)}{\partial\mu_{1}}\right|_{\bm{a}_{\cdot 1}=\bm{0}} =1Nf(μ1)(iI1W12(C23C22)+iI2W21(C13C12))\displaystyle=\dfrac{1}{N}f\left(-\mu_{1}\right)\left(\sum_{i\in I_{1}}W_{12}\left(C_{23}-C_{22}\right)+\sum_{i\in I_{2}}W_{21}\left(C_{13}-C_{12}\right)\right)
=1Nf(μ1)(W12(C23C22)|I1|+W21(1C231+C22)|I2|)\displaystyle=\dfrac{1}{N}f\left(-\mu_{1}\right)\left(W_{12}\left(C_{23}-C_{22}\right)\lvert I_{1}\rvert+W_{21}\left(1-C_{23}-1+C_{22}\right)\lvert I_{2}\rvert\right)
=1Nf(μ1)(C23C22)(W12|I1|W21|I2|).\displaystyle=\dfrac{1}{N}f\left(-\mu_{1}\right)\left(C_{23}-C_{22}\right)\left(W_{12}\lvert I_{1}\rvert-W_{21}\lvert I_{2}\rvert\right).

Since ff is a probability density function, the expression g1(𝒂1,μ1,𝑪)μ1|𝒂1=𝟎\left.\dfrac{\partial g_{1}\left(\bm{a}_{\cdot 1},\mu_{1},\bm{C}\right)}{\partial\mu_{1}}\right|_{\bm{a}_{\cdot 1}=\bm{0}} will always have the same sign for any value of μ1\mu_{1} and the desired result follows. ∎

Theorem 3.

For

λL\displaystyle\lambda^{L} 1Nmaxj=1,,p|W21iI2xij+W12iI1xij|maxμ1{1,1}f(μ1),\displaystyle\geq\dfrac{1}{N}\max_{j=1,\ldots,p}\left|-W_{21}\sum_{i\in I_{2}}x_{ij}+W_{12}\sum_{i\in I_{1}}x_{ij}\right|\max_{\mu_{1}\in\left\{-1,1\right\}}f\left(\mu_{1}\right), (21)

𝒂1=𝟎\bm{a}_{\cdot 1}^{*}=\bm{0} is a stationary point of the S-ORCT of depth D=1D=1.

Proof.

Using the monotonicity of μ1\mu_{1} proven in Lemma 1 and Theorem 2 with σ=0\sigma=0, we have that for

λL\displaystyle\lambda^{L} maxμ1{1,1}𝑪{0,1}2×2maxj=1,,p|𝒂j1g(𝟎,μ1,𝑪)|\displaystyle\geq\max_{\begin{subarray}{c}\mu_{1}\in\left\{-1,1\right\}\\ \bm{C}\in\left\{0,1\right\}^{2\times 2}\end{subarray}}\max_{j=1,\ldots,p}\left\lvert\nabla_{\bm{a}_{j1}}g\left(\bm{0},\mu_{1},\bm{C}\right)\right\rvert
=maxμ1{1,1}𝑪{0,1}2×2𝒂1g(𝟎,μ1,𝑪),\displaystyle=\max_{\begin{subarray}{c}\mu_{1}\in\left\{-1,1\right\}\\ \bm{C}\in\left\{0,1\right\}^{2\times 2}\end{subarray}}\left\lVert\nabla_{\bm{a}_{\cdot 1}}g\left(\bm{0},\mu_{1},\bm{C}\right)\right\rVert_{\infty}, (22)

where gg is as in (20), 𝒂1=𝟎\bm{a}_{\cdot 1}^{*}=\bm{0} is a stationary point of thr S-ORCT. The remainder of the proof is devoted to rewriting (22) as in (21).

We proceed with the calculation of the gradient.

For j=1,,pj=1,\ldots,p:

g(𝟎,μ1,𝑪)aj1=g(𝒂1,μ1,𝑪)aj1|𝒂1=𝟎=1Nk=12iIk(kkWkkCk2kkWkkCk3)pi1(𝒂1,μ1)aj1|𝒂1=𝟎,\dfrac{\partial g\left(\bm{0},\mu_{1},\bm{C}\right)}{\partial a_{j1}}=\left.\dfrac{\partial g\left(\bm{a}_{\cdot 1},\mu_{1},\bm{C}\right)}{\partial a_{j1}}\right|_{\bm{a}_{\cdot 1}=\bm{0}}=\dfrac{1}{N}\sum_{k=1}^{2}\sum_{i\in I_{k}}\left(\sum_{k^{\prime}\neq k}W_{kk^{\prime}}C_{k^{\prime}2}-\sum_{k^{\prime}\neq k}W_{kk^{\prime}}C_{k^{\prime}3}\right)\left.\dfrac{\partial p_{i1}\left(\bm{a}_{\cdot 1},\mu_{1}\right)}{\partial a_{j1}}\right|_{\bm{a}_{\cdot 1}=\bm{0}},

where

pi1(𝒂1,μ1)aj1\displaystyle\dfrac{\partial p_{i1}\left(\bm{a}_{\cdot 1},\mu_{1}\right)}{\partial a_{j1}} =F(1p𝒂1T𝒙iμ1)(1p𝒂1T𝒙iμ1)(1p𝒂1T𝒙iμ1)aj1=xijpf(1p𝒂1T𝒙iμ1),i=1,,N.\displaystyle=\dfrac{\partial F\left(\dfrac{1}{p}\bm{a}_{1}^{T}\bm{x}_{i}-\mu_{1}\right)}{\partial\left(\dfrac{1}{p}\bm{a}_{1}^{T}\bm{x}_{i}-\mu_{1}\right)}\dfrac{\partial\left(\dfrac{1}{p}\bm{a}_{1}^{T}\bm{x}_{i}-\mu_{1}\right)}{\partial a_{j1}}=\dfrac{x_{ij}}{p}f\left(\dfrac{1}{p}\bm{a}_{1}^{T}\bm{x}_{i}-\mu_{1}\right),\,\,i=1,\ldots,N.

and

pi1(𝒂1,μ1)aj1|𝒂1=𝟎=xijpf(μ1),i=1,,N.\left.\dfrac{\partial p_{i1}\left(\bm{a}_{\cdot 1},\mu_{1}\right)}{\partial a_{j1}}\right|_{\bm{a}_{\cdot 1}=\bm{0}}=\dfrac{x_{ij}}{p}f\left(-\mu_{1}\right),\,\,i=1,\ldots,N.

Thus,

g(𝟎,μ1,𝑪)aj1\displaystyle\dfrac{\partial g\left(\bm{0},\mu_{1},\bm{C}\right)}{\partial a_{j1}} =1Npf(μ1)(W12iI1xij(C22C23)+W21iI2xij(C12C13)).\displaystyle=\dfrac{1}{Np}f\left(-\mu_{1}\right)\left(W_{12}\sum_{i\in I_{1}}x_{ij}\left(C_{22}-C_{23}\right)+W_{21}\sum_{i\in I_{2}}x_{ij}\left(C_{12}-C_{13}\right)\right).

Now, we look for the maximum λL\lambda^{L} among every possible allocation of the decision variables 𝑪\bm{C}, i.e.:

λμ1L=max𝑪{0,1}2×2𝒂1g(𝟎,μ1,𝑪)=max𝑪¯{0,1}4×1D𝑪¯,\lambda^{L}_{\mu_{1}}=\max_{\bm{C}\in\left\{0,1\right\}^{2\times 2}}\left\lVert\nabla_{\bm{a}_{\cdot 1}}g\left(\bm{0},\mu_{1},\bm{C}\right)\right\rVert_{\infty}=\max_{\overline{\bm{C}}\in\left\{0,1\right\}^{4\times 1}}\,\,\lVert D\overline{\bm{C}}\rVert_{\infty},

where

D=1Npf(μ1)(W21iI2xi1W21iI2xi1W12iI1xi1W12iI1xi1W21iI2xipW21iI2xipW12iI1xipW12iI1xip)D=\dfrac{1}{Np}f\left(-\mu_{1}\right)\begin{pmatrix}-W_{21}\sum_{i\in I_{2}}x_{i1}&W_{21}\sum_{i\in I_{2}}x_{i1}&-W_{12}\sum_{i\in I_{1}}x_{i1}&W_{12}\sum_{i\in I_{1}}x_{i1}\\ \vdots&\vdots&\vdots&\vdots\\ -W_{21}\sum_{i\in I_{2}}x_{ip}&W_{21}\sum_{i\in I_{2}}x_{ip}&-W_{12}\sum_{i\in I_{1}}x_{ip}&W_{12}\sum_{i\in I_{1}}x_{ip}\end{pmatrix}

and 𝑪¯=(C12,C13,C22,C23)T\overline{\bm{C}}=\left(C_{12},C_{13},C_{22},C_{23}\right)^{T}.

max𝑪¯{0,1}4×1D𝑪¯\displaystyle\max_{\overline{\bm{C}}\in\left\{0,1\right\}^{4\times 1}}\,\,\lVert D\overline{\bm{C}}\rVert_{\infty} =max𝑪¯{0,1}4×1max{|d1T𝑪¯|,,|dpT𝑪¯|}\displaystyle=\max_{\overline{\bm{C}}\in\left\{0,1\right\}^{4\times 1}}\,\,\max\,\,\left\{|d_{1}^{T}\,\,\overline{\bm{C}}|,\ldots,|d_{p}^{T}\,\,\overline{\bm{C}}|\right\}
=max𝑪¯{0,1}4×1max{d1T𝑪¯,d1T𝑪¯,,dpT𝑪¯,dpT𝑪¯}\displaystyle=\max_{\overline{\bm{C}}\in\left\{0,1\right\}^{4\times 1}}\,\,\max\,\,\left\{d_{1}^{T}\,\,\overline{\bm{C}},-d_{1}^{T}\,\,\overline{\bm{C}},\ldots,d_{p}^{T}\,\,\overline{\bm{C}},-d_{p}^{T}\,\,\overline{\bm{C}}\right\}
=max{max𝑪¯{0,1}4×1d1T𝑪¯,max𝑪¯{0,1}4×1d1T𝑪¯,,max𝑪¯{0,1}4×1dpT𝑪¯,max𝑪¯{0,1}4×1dpT𝑪¯}.\displaystyle=\max\left\{\max_{\overline{\bm{C}}\in\left\{0,1\right\}^{4\times 1}}\,\,d_{1}^{T}\,\,\overline{\bm{C}},\max_{\overline{\bm{C}}\in\left\{0,1\right\}^{4\times 1}}\,\,-d_{1}^{T}\,\,\overline{\bm{C}},\ldots,\max_{\overline{\bm{C}}\in\left\{0,1\right\}^{4\times 1}}\,\,d_{p}^{T}\,\,\overline{\bm{C}},\max_{\overline{\bm{C}}\in\left\{0,1\right\}^{4\times 1}}\,\,-d_{p}^{T}\,\,\overline{\bm{C}}\right\}.

A finite number of transportation problems is to be solved, with the form:

z=max𝑪¯{0,1}4×1\displaystyle z=\max_{\overline{\bm{C}}\in\left\{0,1\right\}^{4\times 1}}\,\, {±djT𝑪¯}\displaystyle\left\{\pm d_{j}^{T}\,\,\overline{\bm{C}}\right\}
s.t. C12+C22=1\displaystyle C_{12}+C_{22}=1
C13+C23=1\displaystyle C_{13}+C_{23}=1
C12+C131\displaystyle C_{12}+C_{13}\geq 1
C22+C231,\displaystyle C_{22}+C_{23}\geq 1,

for which the integrality property holds. Then, we only have as possible solutions: 𝑪¯=(1,0,0,1)T\overline{\bm{C}}=\left(1,0,0,1\right)^{T} or 𝑪¯=(0,1,1,0)T\overline{\bm{C}}=\left(0,1,1,0\right)^{T}. Thus, the optimal objective is obtained as follows:

zopt\displaystyle z_{\textit{opt}} =max{±djT𝑪¯|𝑪¯=(1,0,0,1)T,±djT𝑪¯|𝑪¯=(0,1,1,0)T}\displaystyle=\max\left\{\left.\pm d_{j}^{T}\,\,\overline{\bm{C}}\right|_{\overline{\bm{C}}=\left(1,0,0,1\right)^{T}},\left.\pm d_{j}^{T}\,\,\overline{\bm{C}}\right|_{\overline{\bm{C}}=\left(0,1,1,0\right)^{T}}\right\}
=max{1Npf(μ1)(W21iI2xij+W12iI1xij),1Npf(μ1)(W21iI2xijW12iI1xij)}\displaystyle=\max\left\{\dfrac{1}{Np}f\left(-\mu_{1}\right)\left(-W_{21}\sum_{i\in I_{2}}x_{ij}+W_{12}\sum_{i\in I_{1}}x_{ij}\right),\dfrac{1}{Np}f\left(-\mu_{1}\right)\left(W_{21}\sum_{i\in I_{2}}x_{ij}-W_{12}\sum_{i\in I_{1}}x_{ij}\right)\right\}
=1Npf(μ1)|W21iI2xij+W12iI1xij|.\displaystyle=\dfrac{1}{Np}f\left(-\mu_{1}\right)\left|-W_{21}\sum_{i\in I_{2}}x_{ij}+W_{12}\sum_{i\in I_{1}}x_{ij}\right|.

Let us define

λμ1L=1Npf(μ1)maxj=1,,p|W21iI2xij+W12iI1xij|,\lambda^{L}_{\mu_{1}}=\dfrac{1}{Np}f\left(-\mu_{1}\right)\max_{j=1,\ldots,p}\left|-W_{21}\sum_{i\in I_{2}}x_{ij}+W_{12}\sum_{i\in I_{1}}x_{ij}\right|,

and the result holds when

λL\displaystyle\lambda^{L} max{λμ1=1L,λμ1=1L}.\displaystyle\geq\max\left\{\lambda^{L}_{\mu_{1}=-1},\lambda^{L}_{\mu_{1}=1}\right\}.

4 Computational experience

4.1 Introduction

The aim of this section is to illustrate the performance of our sparse optimal randomized classification trees S-ORCT’s. We have run our model for a grid of values of the sparsity regularization parameters λL\lambda^{L} and λG\lambda^{G}. The message that can be drawn from our experimental experience is twofold. First, we show empirically that our S-ORCT can gain in both local and global sparsity, without harming classification accuracy. Second, we benchmark our approach against CART, the classic approach to build decision trees, which considers orthogonal cuts and therefore has the best possible local sparsity. We show that we are able to trade in some of our classification accuracy, still being superior to CART, to be comparable to CART in terms of global sparsity.

The S-ORCT smooth formulation (9)-(16) has been implemented using Pyomo optimization modeling language [19, 20] in Python 3.5 [31]. As solver, we have used IPOPT 3.11.1 [39], and have followed a multistart approach, where the process is repeated 2020 times starting from different random initial solutions. For CART, the implementation in the rpart R package [34] is used. Our experiments have been conducted on a PC, with an Intel® CoreTM i7-2600 CPU 3.40GHz processor and 16 GB RAM. The operating system is 64 bits.

The remainder of the section is structured as follows. Section 4.2 gives details on the procedure followed to test S-ORCT. In Sections 4.3 and 4.4, respectively, we discuss the results for local and global sparsities separately, while in Section 4.5 we present results when both sparsities are simultaneously taken into account. Finally, Section 4.6 statistically compares S-ORCT versus CART in terms of classification accuracy and global sparsity.

4.2 Setup

An assorted collection of well-known real data sets from the UCI Machine Learning Repository [24] has been chosen for the computational experiments. Table 2 lists their names together with their number of observations, number of predictor variables and number of classes with the corresponding class distribution. In our pursuit of building small and, therefore, less complex trees, the construction of S-ORCTs has been restricted to depth D=1D=1 for two-class problems and depth D=2D=2 for three- and four- class problems.

Table 2: Information about the data sets considered.
Data set Abbrev. NN pp KK Class distribution
Monks-problems-3 Monks-3 122 11 2 51% - 49%
Monks-problems-1 Monks-1 124 11 2 50% - 50%
Monks-problems-2 Monks-2 169 11 2 62% - 38%
Connectionist-bench-sonar Sonar 208 60 2 55% - 45%
Ionosphere Ionosphere 351 34 2 64% - 36%
Breast-cancer-Wisconsin Wisconsin 569 30 2 63% - 37%
Credit-approval Creditapproval 653 37 2 55% - 45%
Pima-indians-diabetes Pima 768 8 2 65% - 35%
Statlog-project-German-credit Germancredit 1000 48 2 70% - 30%
Banknote-authentification Banknote 1372 4 2 56% - 44%
Ozone-level-detection-one Ozone 1848 72 2 97% - 3%
Spambase Spam 4601 57 2 61% - 39%
Iris Iris 150 4 3 33.3%-33.3%-33.3%
Wine Wine 178 13 3 40%-33%-27%
Seeds Seeds 210 7 3 33.3%-33.3%-33.3%
Balance-scale Balance 625 16 3 46%-46%-8%
Thyroid-disease-ann-thyroid Thyroid 3772 21 3 92.5%-5%-2.5%
Car-evaluation Car 1728 15 4 70%-22%-4%-4%

Each data set has been split into two subsets: the training subset (75%) and the test subset (25%). The corresponding S-ORCT is built on the training subset and, then, accuracy, local and global sparsities are measured. The out-of-sample accuracy over the test subset is denoted by acc. Local sparsity is denoted by δL\delta^{L} and reads as the average percentage of predictor variables not used per branch node:

δL=1|τB|tτB|{ajt=0,j=1,,p}|p×100.\displaystyle\delta^{L}=\dfrac{1}{\left\lvert\tau_{B}\right\rvert}\sum_{t\in\tau_{B}}\dfrac{\left\lvert\left\{a_{jt}=0,\,\,j=1,\ldots,p\right\}\right\rvert}{p}\times 100.

Global sparsity, δG\delta^{G}, is measured as the percentage of predictor variables not used at any of the branch nodes, i.e., across the whole tree:

δG=|{𝒂j=𝟎,j=1,,p}|p×100.\displaystyle\delta^{G}=\dfrac{\left\lvert\left\{\bm{a}_{j\cdot}=\bm{0},\,\,j=1,\ldots,p\right\}\right\rvert}{p}\times 100.

Note that when D=1D=1, local and global sparsity are measuring the same since there is a single cut across the whole tree. The training/testing procedure has been repeated ten times in order to avoid the effect of the initial split of the data. The results shown in the tables represent the average of such ten runs to each of the three performance criteria.

In what follows, we describe the choices made for the parameters in S-ORCT. Equal misclassification weights, Wyik=0.5,k=1,,K,kyiW_{y_{i}k}=0.5,\,\,k=1,\ldots,K,\,\,k\neq y_{i}, have been used for the experiments. We have added the set of constraints (8) with ρk=0.1,k=1,,K\rho_{k}=0.1,\,\,k=1,\ldots,K. The logistic CDF has been chosen for our experiments:

F()=11+exp(()γ),F\left(\cdot\right)=\dfrac{1}{1+\exp\left(-\left(\cdot\right)\gamma\right)},

with a large value of γ\gamma, namely, γ=512\gamma=512. The larger the value of γ\gamma, the closer the decision rule defined by FF is to a deterministic rule. We will illustrate that a small level of randomization is enough for obtaining good results. We have trained S-ORCT, as formulated in (9)-(16), for 17×1717\times 17 pairs of values for (λL,λG)\left(\lambda^{L},\lambda^{G}\right) starting from λL=0\lambda^{L}=0 followed by the grid {2rp|τB|,12r3,r}\left\{\dfrac{2^{r}}{p\left\lvert\tau_{B}\right\rvert},\,\,-12\leq r\leq 3,\,\,r\in\mathbb{Z}\right\}, and, similarly, λG=0\lambda^{G}=0 followed by the grid {2rp,12r3,r}\left\{\dfrac{2^{r}}{p},\,\,-12\leq r\leq 3,\,\,r\in\mathbb{Z}\right\}. We start solving the optimization problem with (λL,λG)=(0,0)\left(\lambda^{L},\lambda^{G}\right)=\left(0,0\right), where the multistart approach uses 20 random initial solutions. We continue solving the optimization problem for λL=0\lambda^{L}=0 but with larger values of λG\lambda^{G}. Once all values of λG\lambda^{G} are executed, we start the process all over again with the next value of λL\lambda^{L} in the grid. For pair (λL,λG)\left(\lambda^{L},\lambda^{G}\right), we feed the corresponding optimization problem with the 20 solutions resulting from the problem solved for the previous pair. For a given initial solution, the computing time taken by the S-ORCT typically ranges from 0.33 seconds (in Monks-1) to 22.27 seconds (in Thyroid).

For CART, the default parameter setting in rpart is used.

4.3 Results for local sparsity

Tables 3 and 4 present the results of the so-called local S-ORCT, i.e., when λG=0\lambda^{G}=0 and thus only local sparsity is taken into account. Figures 2 and 3 depict these results per data set, by showing simultaneously δL\delta^{L} (blue solid line) and acc (red dashed line) as a function of the grid of the λL\lambda^{L}’s considered. As expected, the larger the λL\lambda^{L}, the larger the δL\delta^{L}. The sparsest tree is shown in most of the data sets for large values of the parameter λL\lambda^{L}, where the best solution in terms of sparsity is obtained but the worst possible one in terms of accuracy. In terms of accuracy, the best rates are sometimes achieved when not all the predictor variables are included in the model. For instance, best performance is reached when sparsity is about 925%9-25\% for Pima, the 30%30\% for Monks-1, the 32%32\% for Monks-2, the 44%44\% for Germancredit, the 47%47\% for Car, the 5256%52-56\% for Thyroid, the 54%54\% for Monks-3, the 5560%55-60\% for Iris, the 7290%72-90\% for Sonar, the 81%81\% for both Wine and Seeds and the 87%87\% for Ionosphere. We highlight the Creditapproval data set, on which one single predictor variable can already guarantee very good accuracy. For Ozone, accuracy remains over the 96%96\% for the grid of λL\lambda^{L}’s considered. Accuracy might be slightly damaged but a great gain in sparsity is obtained. This is the case for Banknote, Spam, Balance or Wisconsin, which present a loss of accuracy lower than the 11 percentage point (p.p.), 44 p.p., 66 p.p. and 11 p.p. but 25%25\%, 52%52\%, 63%63\% and 85%85\% of local sparsity is reached, respectively.

Table 3: Results for the local S-ORCT of depth D=1D=1 as a function of λL\lambda^{L}, where δL\delta^{L} represents the average percentage of predictor variables not used per branch node in the tree over the ten runs and acc, the average out-of-sample accuracy.
λL\lambda^{L} Monks-3 Monks-1 Monks-2 Sonar Ionosphere Wisconsin Creditapproval Pima Germancredit Banknote Ozone Spam
δL\delta^{L} acc δL\delta^{L} acc δL\delta^{L} acc δL\delta^{L} acc δL\delta^{L} acc δL\delta^{L} acc δL\delta^{L} acc δL\delta^{L} acc δL\delta^{L} acc δL\delta^{L} acc δL\delta^{L} acc δL\delta^{L} acc
0 0 89.7 1 77.7 3 74.3 0 75.8 0 84.1 0 96.2 1 84.1 0 75.8 0 73.5 0 99.0 0 96.5 0 89.8
2122^{-12} 1 91.0 21 80.6 28 77.1 0 75.8 4 84.2 1 96.4 9 84.0 0 75.8 0 72.8 0 99.0 4 96.6 0 89.8
2112^{-11} 0 91.0 21 79.0 28 77.1 0 77.5 3 84.7 4 96.1 9 83.7 0 75.6 0 72.9 0 99.0 10 96.5 0 89.8
2102^{-10} 0 90.0 28 80.0 28 77.1 1 77.5 4 84.5 7 96.0 11 83.9 0 75.6 1 73.3 0 99.1 18 96.4 1 89.8
292^{-9} 0 89.3 27 82.9 28 77.1 2 77.5 4 84.4 10 96.1 13 83.7 0 76 1 73.2 3 99.1 29 96.5 2 89.8
282^{-8} 2 90.0 30 81.6 28 77.1 2 77.7 4 85.2 16 96.3 15 84.2 0 75.7 1 73.2 3 99.1 44 96.4 3 89.8
272^{-7} 0 90.7 23 78.4 28 77.1 5 76.9 8 84.6 28 96.3 16 84.2 0 75.9 2 73.4 3 99.0 62 96.4 5 89.6
262^{-6} 7 90.3 34 80.6 28 77.1 9 77.1 10 85.3 39 96.3 20 84.1 1 75.8 3 73.8 3 98.7 78 96.6 9 89.4
252^{-5} 2 90.3 32 78.4 28 77.1 18 75.4 19 85.9 50 96.3 29 84.6 9 76.2 0 74.2 23 98.5 83 96.6 25 88.8
242^{-4} 3 92.0 29 81.3 28 77.1 28 76.3 32 86.3 59 96.5 44 85.1 20 76.1 31 73.9 15 98.3 87 96.6 44 88.5
232^{-3} 15 92.7 30 83.5 28 77.1 40 77.1 49 86.2 67 96.3 62 86.1 25 75.9 37 73.4 10 98.0 90 96.7 52 86.1
222^{-2} 45 94.3 38 81.6 28 77.1 56 76.9 57 86.1 74 95.8 75 85.4 44 75.3 50 73.8 25 98.0 92 96.7 71 83
212^{-1} 54 94.7 39 81.0 32 78.6 72 78.6 74 85.6 85 95.7 95 86.3 61 74.9 69 71.8 25 97.5 95 96.7 82 78.6
202^{0} 54 94.7 62 81.0 39 76.7 85 78.1 87 86.8 87 94.7 97 86.7 81 73.7 93 69.6 25 96.7 96 96.7 97 64.4
212^{1} 54 94.7 71 78.4 95 63.3 90 78.3 91 84.7 91 92.7 97 86.7 94 65.8 98 69.5 50 85.8 97 96.7 100 60.4
222^{2} 77 74.3 84 72.6 100 64.3 98 62.9 94 75.1 95 91.2 97 86.7 100 63.4 100 69.5 50 84.0 100 96.7 100 60.4
232^{3} 93 55.7 91 72.2 100 64.3 100 51.5 100 61.1 99 64.1 97 86.7 100 63.4 100 69.5 100 56.3 100 96.7 100 60.4
Table 4: Results for the local S-ORCT of depth D=2D=2 as a function of λL\lambda^{L}, where δL\delta^{L} represents the average percentage of predictor variables not used per branch node in the tree over the ten runs and acc, the average out-of-sample accuracy.
λL\lambda^{L} Iris Wine Seeds Balance Thyroid Car
δL\delta^{L} acc δL\delta^{L} acc δL\delta^{L} acc δL\delta^{L} acc δL\delta^{L} acc δL\delta^{L} acc
0 8 95.9 15 96.6 10 94.4 33 96.6 57 92.8 20 92.7
2122^{-12} 42 95.9 51 98.6 33 93.8 58 92.0 61 92.7 36 91.5
2112^{-11} 42 95.9 54 98.4 38 93.8 60 91.1 59 92.9 33 91.9
2102^{-10} 42 96.2 54 97.3 38 94.0 65 91.0 64 92.6 36 91.5
292^{-9} 42 95.9 56 97.5 43 93.8 67 91.2 62 92.7 36 91.4
282^{-8} 42 95.9 56 96.8 48 93.2 60 91.9 65 92.5 36 91.4
272^{-7} 42 95.9 59 96.8 48 91.3 60 91.7 70 92.1 36 91.3
262^{-6} 42 95.9 59 96.8 52 94.0 65 92.2 72 92.1 38 91.6
252^{-5} 42 95.4 59 96.8 52 94.4 58 92.6 74 92.2 40 91.3
242^{-4} 42 95.9 59 97.3 57 93.8 58 92.4 79 92.2 42 91.1
232^{-3} 42 93.2 62 97.5 67 94.6 63 91.1 83 92.1 40 91.7
222^{-2} 50 89.7 62 97.7 67 94.4 65 90.6 87 92.3 47 90.4
212^{-1} 50 92.7 64 98.2 71 93.6 67 89.2 90 92.0 51 90.2
202^{0} 58 90.0 69 96.8 76 93.6 71 88.1 91 91.9 64 87.6
212^{1} 67 90.5 77 95.2 81 90.2 75 87.2 92 92.0 71 85.4
222^{2} 75 91.1 82 89.5 81 88.5 77 82.6 95 91.8 80 80.8
232^{3} 83 88.6 90 76.4 91 73.6 83 77.3 100 92.2 91 68.2
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Refer to caption
(j)
Refer to caption
(k)
Refer to caption
(l)
Figure 2: Graphical representation, for each data set, of the average percentage of predictor variables per branch node, δL\delta^{L}, together with the average out-of-sample accuracy obtained, acc, as a function of the values of λL\lambda^{L} considered in the local S-ORCT construction.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 3: Graphical representation, for each data set, of the average percentage of predictor variables per branch node, δL\delta^{L}, together with the average out-of-sample accuracy obtained, acc, as a function of the values of λL\lambda^{L} considered in the local S-ORCT construction.

4.4 Results for global sparsity

This section is devoted to the global S-ORCT, i.e., when λL=0\lambda^{L}=0 and thus only global sparsity is taken into account. We focus on depth D=2D=2, since for D=1D=1 global sparsity is equal to local sparsity. Similarly to Subsection 4.3, Table 5

Table 5: Results for the global S-ORCT of depth D=2D=2 as a function of λG\lambda^{G}, where δG\delta^{G} represents the average percentage of predictor variables not used per tree over ten runs and acc, the average out-of-sample accuracy.
λG\lambda^{G} Iris Wine Seeds Balance Thyroid Car
δG\delta^{G} acc δG\delta^{G} acc δG\delta^{G} acc δG\delta^{G} acc δG\delta^{G} acc δG\delta^{G} acc
0 0 95.9 0 96.6 0 94.4 0 96.6 1 92.8 0 92.7
2122^{-12} 0 96.2 18 97.7 0 94.0 0 96.7 3 93.0 0 93.4
2112^{-11} 0 96.2 15 97.5 0 93.8 0 95.4 5 93.9 0 93.7
2102^{-10} 0 96.2 15 97.5 0 94.0 0 95.9 5 93.9 0 94.1
292^{-9} 0 95.9 15 97.3 0 93.8 0 96.7 7 94.0 0 94.0
282^{-8} 0 95.9 15 97.7 0 93.8 0 96.2 12 94.1 0 94.7
272^{-7} 0 95.9 15 97.9 14 94.6 0 95.8 17 94.0 0 95.0
262^{-6} 0 95.4 15 98.2 14 95.4 0 96.1 26 94.0 0 94.9
252^{-5} 2 95.7 15 98.2 14 95.4 0 96.7 40 93.9 0 94.9
242^{-4} 0 95.4 15 98.4 14 94.6 0 96.5 57 93.8 0 94.7
232^{-3} 0 95.7 23 98.4 29 93.6 0 94.7 65 93.5 7 94.6
222^{-2} 25 95.4 23 97.9 29 95.2 0 91.1 73 91.5 7 94.1
212^{-1} 25 95.7 31 96.6 29 94.2 19 87.4 81 90.6 13 92.2
202^{0} 50 96.2 39 95.7 43 92.5 25 87.0 83 90.0 27 86.7
212^{1} 50 96.2 46 94.3 57 90.2 44 80.5 87 92.4 47 79.8
222^{2} 50 96.5 62 93.6 71 85.8 56 71.3 95 91.7 73 68.2
232^{3} 75 96.2 85 71.1 86 72.5 94 48.8 100 92.2 80 68.2

presents the results of the global S-ORCT, while Figure 4

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 4: Graphical representation, for each data set, of the average percentage of predictor variables per tree, δG\delta^{G}, together with the average out-of-sample accuracy obtained, acc, as a function of the values of λG\lambda^{G} considered in the global S-ORCT construction.

visualizes these results by showing simultaneously, per data set, δG\delta^{G} (blue solid line) and acc (red dashed line) as a function of the grid of the λG\lambda^{G}’s considered. As for local sparsity, as λG\lambda^{G} grows, δG\delta^{G} increases. For Iris and Seeds, a similar classification accuracy to that with all of the predictor variables is obtained while removing the 75%75\% and 29%29\% of them, respectively. For Wine, the best rates of accuracy are obtained with 15%23%15\%-23\% of global sparsity. A loss of less than 1010 p.p. of accuracy is observed for Balance but 25%25\% of predictor variables are not being used, respectively. Car remains around the accuracy rate of 80%80\% while using half of the predictor variables. Thyroid, an imbalanced data set, is over the 90%90\% of accuracy for the whole grid of λG\lambda^{G}’s considered.

4.5 Results for local and global sparsity

In this section, results enforcing local and global sparsity are presented by means of heatmaps, as seen in Figure 5. The experiment has been conducted on data sets of K=3K=3 and 44 classes, for which S-ORCTs of depth D=2D=2 are built. For each dataset, three heatmaps are depicted as a function of the grid of the sparsity regularization parameters, λL\lambda^{L} and λG\lambda^{G}: the average out-of-sample accuracy, acc, and the local and global sparsities, δL\delta^{L} and δG\delta^{G}, respectively, obtained over the ten runs performed. The color bar of each heatmap goes from light green to dark blue, being the latter the maximum accuracy, local sparsity or global sparsity achieved, respectively. As a general behavior, the best rates of accuracy are not always achieved only for (λL,λG)=(0,0)\left(\lambda^{L},\lambda^{G}\right)=\left(0,0\right), but also for other pairs of the chosen grid, i.e., the data set remains equally well explained while needing less information. As before, according to local sparsity, for a fixed λG\lambda^{G}, δL\delta^{L} has a growing trend. A similar behavior is observed for δG\delta^{G} when λL\lambda^{L} is fixed. It is also worth mentioning that small changes of λL\lambda^{L} quickly lead to a gain in δL\delta^{L}. Nevertheless, as expected, the gain in δG\delta^{G} is slower for the same range in λG\lambda^{G}.

Refer to caption
Refer to caption
Refer to caption
(a) Iris
Refer to caption
Refer to caption
Refer to caption
(b) Wine
Refer to caption
Refer to caption
Refer to caption
(c) Seeds
\phantomcaption
Refer to caption
Refer to caption
Refer to caption
(d) Balance
Refer to caption
Refer to caption
Refer to caption
(e) Thyroid
Refer to caption
Refer to caption
Refer to caption
(f) Car
Figure 5: Heatmaps representation, for each data set, of the average out-of-sample accuracy, acc, the average percentage of predictor variables not used per branch node, δL\delta^{L}, and the average percentage of predictor variables not used per tree, δG\delta^{G}, respectively, as a funcion of the grid of the sparsity parameters, λL\lambda^{L} and λG\lambda^{G}, considered in the S-ORCT of depth D=2D=2 construction.

4.6 Comparison S-ORCT versus CART

A statistical comparison between the proposed S-ORCT and CART, the classic approach to build decision trees, is provided in this section. As stated in the introduction of the paper, CARTs, as many other approaches that implement orthogonal cuts [4, 14, 18], are leaders in terms of local sparsity. Thus, the comparison S-ORCT versus CART is performed in terms of accuracy and global sparsity. Tables 3 and 5 for S-ORCT have been considered for the experiment.

CART has been trained and tested over the same ten runs as S-ORCT. For each pair S-ORCT(λG)\left(\lambda^{G}\right) versus CART, two hypothesis tests for the equality of means of paired samples were carried out, one for accuracy and another for global sparsity, assuming normality, at a 5%5\% significance level. For this task, the t.test function in R has been used. Figure 6 depicts, for each data set, the resulting confidence intervals (blue solid line) at the 95%95\% confidence level for the difference in average accuracy (on the left) and global sparsity (on the right) between S-ORCT(λG)\left(\lambda^{G}\right) and CART. The red dashed horizontal line represents the null hypothesis in each case. Except for Creditapproval and Thyroid, for the smaller values of λG\lambda^{G}, our approach is significantly better than, or at least comparable to, CART in terms of accuracy, while CART is significantly better than, or at least comparable to, in terms of global sparsity. For the larger values of λG\lambda^{G}, our approach starts to be comparable and then dominate CART in terms of global sparsity at the cost of accuracy.

Refer to caption
Refer to caption
(a) Monks-3
Refer to caption
Refer to caption
(b) Monks-1
Refer to caption
Refer to caption
(c) Monks-2
Refer to caption
Refer to caption
(d) Sonar
\phantomcaption
Refer to caption
Refer to caption
(e) Ionosphere
Refer to caption
Refer to caption
(f) Wisconsin
Refer to caption
Refer to caption
(g) Creditapproval
Refer to caption
Refer to caption
(h) Pima
\phantomcaption
Refer to caption
Refer to caption
(i) Germancredit
Refer to caption
Refer to caption
(j) Banknote
Refer to caption
Refer to caption
(k) Ozone
Refer to caption
Refer to caption
(l) Spam
\phantomcaption
Refer to caption
Refer to caption
(m) Iris
Refer to caption
Refer to caption
(n) Wine
Refer to caption
Refer to caption
(o) Seeds
Refer to caption
Refer to caption
(p) Balance
\phantomcaption
Refer to caption
Refer to caption
(q) Thyroid
Refer to caption
Refer to caption
(r) Car
Figure 6: Graphical representation, for each data set, of the confidence intervals (blue solid line) at the 95%95\% for the difference in average accuracy (on the left) and global sparsity (on the right) between S-ORCT(λG)\left(\lambda^{G}\right) and CART. The red dashed horizontal line represents the null hypothesis in each case.

5 Conclusions and future research

Recently, several proposals focused on building optimal classification trees are found in the literature to address the shortcomings of the classic greedy approaches. In this paper, we have proposed a novel continuous optimization-based approach, the Sparse Optimal Randomized Classification Tree (S-ORCT), in which a compromise between good classification accuracy and sparsity is pursued. Local and global sparsity in the tree are modeled by including in the objective function norm-like regularizations, namely, 1\ell_{1} and \ell_{\infty}, respectively. Our numerical results illustrate that our approach can improve both sparsities without harming classification accuracy. Unlike CART, we are able to easily trade in some of our classification accuracy for a gain in global sparsity.

Some extensions of our approach are of interest. First, this metholodogy can be extended straightaway to a regression tree counterpart, where the response variable is continuous. Second, categorical data is addressed in this paper through the inclusion of dummy predictor variables. For a given categorical predictor variable, and by means of an \ell_{\infty}-norm regularization, one can link all its dummies across all the branch nodes in the tree, with the aim of better modeling its contribution to the classifier. Third, it is known that bagging trees tends to enhance accuracy. An appropiate bagging scheme of our approach, where sparsity is a key point, is a nontrivial design question.

Acknowledgements. This research has been financed in part by research projects EC H2020 MSCA RISE NeEDS (Grant agreement ID: 822214), COSECLA - Fundación BBVA, MTM2015-65915R, Spain, P11-FQM-7603 and FQM-329, Junta de Andalucía, the last three with EU ERF funds. This support is gratefully acknowledged.

References

  • Athey [2018] Athey, S. (2018). The impact of machine learning on economics. In The Economics of Artificial Intelligence: An Agenda. University of Chicago Press.
  • Baesens et al. [2003] Baesens, B., Setiono, R., Mues, C., & Vanthienen, J. (2003). Using neural network rule extraction and decision tables for credit-risk evaluation. Management Science, 49, 312–329.
  • Bennett & Blue [1996] Bennett, K. P., & Blue, J. (1996). Optimal decision trees. Rensselaer Polytechnic Institute Math Report, 214.
  • Bertsimas & Dunn [2017] Bertsimas, D., & Dunn, J. (2017). Optimal classification trees. Machine Learning, 106, 1039–1082.
  • Biau & Scornet [2016] Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25, 197–227.
  • Blanquero et al. [2018] Blanquero, R., Carrizosa, E., Molero-Río, C., & Romero Morales, D. (2018). Optimal Randomized Classification Trees. https://www.researchgate.net/publication/326901224_Optimal_Randomized_Classification_Trees.
  • Breiman [2001] Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
  • Breiman et al. [1984] Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. CRC press.
  • Carrizosa et al. [2011] Carrizosa, E., Martín-Barragán, B., & Romero Morales, D. (2011). Detecting relevant variables and interactions in supervised classification. European Journal of Operational Research, 213, 260–269.
  • Carrizosa & Romero Morales [2013] Carrizosa, E., & Romero Morales, D. (2013). Supervised classification and mathematical optimization. Computers & Operations Research, 40, 150–165.
  • Deng & Runger [2012] Deng, H., & Runger, G. (2012). Feature selection via regularized trees. In The 2012 International Joint Conference on Neural Networks (IJCNN) (pp. 1–8). IEEE.
  • Deng & Runger [2013] Deng, H., & Runger, G. (2013). Gene selection with guided regularized random forest. Pattern Recognition, 46, 3483–3489.
  • Fernández-Delgado et al. [2014] Fernández-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15, 3133–3181.
  • Firat et al. [2018] Firat, M., Crognier, G., Gabor, A. F., Zhang, Y., & Hurkens, C. (2018). Constructing classification trees using column generation. arXiv preprint arXiv:1810.06684​​​​, .
  • Freitas [2014] Freitas, A. (2014). Comprehensible classification models: a position paper. ACM SIGKDD Explorations Newsletter, 15, 1–10.
  • Genuer et al. [2017] Genuer, R., Poggi, J.-M., Tuleau-Malot, C., & Villa-Vialaneix, N. (2017). Random Forests for Big Data. Big Data Research, 9, 28–46.
  • Goodman & Flaxman [2016] Goodman, B., & Flaxman, S. (2016). European Union regulations on algorithmic decision-making and a “right to explanation”. arXiv preprint arXiv:1606.08813​​​​, .
  • Günlük et al. [2018] Günlük, O., Kalagnanam, J., Menickelly, M., & Scheinberg, K. (2018). Optimal Decision Trees for Categorical Data via Integer Programming. arXiv preprint arXiv:1612.03225v2​​​​, .
  • Hart et al. [2017] Hart, W. E., Laird, C. D., Watson, J.-P., Woodruff, D. L., Hackebeil, G. A., Nicholson, B. L., & Siirola, J. D. (2017). Pyomo–Optimization Modeling in Python volume 67. (2nd ed.). Springer Science & Business Media.
  • Hart et al. [2011] Hart, W. E., Watson, J.-P., & Woodruff, D. L. (2011). Pyomo: modeling and solving mathematical programs in Python. Mathematical Programming Computation, 3, 219–260.
  • Hastie et al. [2009] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. (2nd ed.). New York: Springer.
  • Hyafil & Rivest [1976] Hyafil, L., & Rivest, R. L. (1976). Constructing optimal binary decision trees is NP-complete. Information Processing Letters, 5, 15–17.
  • Jung et al. [2017] Jung, J., Concannon, C., Shroff, R., Goel, S., & Goldstein, D. G. (2017). Simple rules for complex decisions. arXiv preprint arXiv:1702.04690​​​​, .
  • Lichman [2013] Lichman, M. (2013). UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. University of California, Irvine, School of Information and Computer Sciences.
  • Maldonado et al. [2017] Maldonado, S., Bravo, C., Lopez, J., & Perez, J. (2017). Integrated framework for profit-based feature selection and SVM classification in credit scoring. Decision Support Systems, 104, 113–121.
  • Maldonado & Lopez [2017] Maldonado, S., & Lopez, J. (2017). Synchronized feature selection for support vector machines with twin hyperplanes. Knowledge-Based Systems, 132, 119–128.
  • Martens et al. [2007] Martens, D., Baesens, B., Van Gestel, T., & Vanthienen, J. (2007). Comprehensible credit scoring models using rule extraction from support vector machines. European Journal of Operational Research, 183, 1466–1476.
  • Martín-Barragán et al. [2014] Martín-Barragán, B., Lillo, R., & Romo, J. (2014). Interpretable support vector machines for functional data. European Journal of Operational Research, 232, 146–155.
  • Norouzi et al. [2015] Norouzi, M., Collins, M., Johnson, M. A., Fleet, D. J., & Kohli, P. (2015). Efficient non-greedy optimization of decision trees. In Advances in Neural Information Processing Systems (pp. 1729–1737).
  • Olafsson et al. [2008] Olafsson, S., Li, X., & Wu, S. (2008). Operations research and data mining. European Journal of Operational Research, 187, 1429–1448.
  • Python Core Team [2015] Python Core Team (2015). Python: A dynamic, open source programming language. Python Software Foundation. URL: https://www.python.org.
  • Ridgeway [2013] Ridgeway, G. (2013). The pitfalls of prediction. National Institute of Justice Journal, 271, 34–40.
  • Silva [2017] Silva, A. P. D. (2017). Optimization approaches to supervised classification. European Journal of Operational Research, 261, 772–788.
  • Therneau et al. [2015] Therneau, T., Atkinson, B., & Ripley, B. (2015). rpart: Recursive Partitioning and Regression Trees. URL: https://CRAN.R-project.org/package=rpart R package version 4.1-10.
  • Tibshirani et al. [2015] Tibshirani, R., Wainwright, M., & Hastie, T. (2015). Statistical Learning with Sparsity. The Lasso and Generalizations. Chapman and Hall/CRC.
  • Ustun & Rudin [2016] Ustun, B., & Rudin, C. (2016). Supersparse linear integer models for optimized medical scoring systems. Machine Learning, 102, 349–391.
  • Verwer & Zhang [2017] Verwer, S., & Zhang, Y. (2017). Learning decision trees with flexible constraints and objectives using integer optimization. In International Conference on AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems (pp. 94–103). Springer.
  • Verwer et al. [2017] Verwer, S., Zhang, Y., & Ye, Q. C. (2017). Auction optimization using regression trees and linear models as integer programs. Artificial Intelligence, 244, 368–395.
  • Wächter & Biegler [2006] Wächter, A., & Biegler, L. T. (2006). On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming, 106, 25–57.
  • Yang et al. [2017] Yang, L., Liu, S., Tsoka, S., & Papageorgiou, L. G. (2017). A regression tree approach using mathematical programming. Expert Systems with Applications, 78, 347–357.
  • Zou & Yuan [2008] Zou, H., & Yuan, M. (2008). The F-infinity norm support vector machine. Statistica Sinica, 18, 379–398.