This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the boosting ability of top-down decision tree learning algorithm for multiclass classification

\nameAnna Choromanska \emailachoroma@cims.nyu.edu
\addrCourant Institue of Mathematical Sciences, NYU
NY, USA \AND\nameKrzysztof Choromanski \emailkchoro@google.com
\addrGoogle Research New York
NY, USA \AND\nameMariusz Bojarski \emailmbojarski@nvidia.com
\addrNVIDIA Corporation
NJ, USA
Equal contribution.
Abstract

We analyze the performance of the top-down multiclass classification algorithm for decision tree learning called LOMtree, recently proposed in the literature Choromanska and Langford (2014) for solving efficiently classification problems with very large number of classes. The algorithm online optimizes the objective function which simultaneously controls the depth of the tree and its statistical accuracy. We prove important properties of this objective and explore its connection to three well-known entropy-based decision tree objectives, i.e. Shannon entropy, Gini-entropy and its modified version, for which instead online optimization schemes were not yet developed. We show, via boosting-type guarantees, that maximizing the considered objective leads also to the reduction of all of these entropy-based objectives. The bounds we obtain critically depend on the strong-concavity properties of the entropy-based criteria, where the mildest dependence on the number of classes (only logarithmic) corresponds to the Shannon entropy.

Keywords: multiclass classification, decision trees, boosting, online learning

1 Introduction

This paper focuses on the multiclass classification problem with very large number of classes which becomes of increasing importance with the recent widespread development of data-acquisition web services and devices. Straightforward extensions of the binary approaches to the multiclass setting, such as one-against-all approach Rifkin and Klautau (2004), do not often work in the presence of strict computational constraints. In this case, hierarchical approaches seem particularly favorable since, due to their structure, they potentially can significantly reduce the computational costs. This paper is motivated by very recent advances in the area of multiclass classification, and considers a hierarchical approach for learning a multiclass decision tree structure in a top-down fashion, where splitting the data in every node of the tree is based on the value of a very particular objective function. This objective function controls the balancedness of the splits (thus the depth of the tree) and the statistical error they induce (thus the statistical error of the tree), and was initially introduced in Choromanska and Langford (2014) along with the algorithm optimizing it in an online fashion called LOMtree. The algorithm was empirically shown to obtain high-quality trees in logarithmic (in the label complexity) train and test running times, simultaneously outperforming state-of-the-art comparators, yet the objective underlying it is still not-well understood. The main contribution of this work is an extensive theoretical analysis of the properties of this objective, and the algorithm111We do not discuss the algorithm here, we refer the reader to the original paper. optimizing it. And in particular, the analysis includes exploring, via the boosting framework, a relation of this objective to some more standard entropy-based decision tree objectives, i.e. Shannon entropy222Throughout the paper we refer to Shannon entropy as simply entropy., Gini-entropy and its modified version, for which online optimization schemes in the context of multiclass classification were not yet developed.

The multiclass classification problem was relatively recently explored in the literature, and there exist only few works addressing the problem. In this work we focus on decision tree-based approaches. Filter tree Beygelzimer et al. (2009b) considers simplified instance of the problem where the tree structure over the labels is assumed given. It is provably consistent and achieves regret bound which depends logarithmically on the number of classes. The conditional probability tree Beygelzimer et al. (2009a) instead learns the tree structure and uses the node splitting criterion which compromises between obtaining balanced split in a tree node and violating the recommendation for the split from the node regressor. The authors also provide regret bounds which scales with the tree depth. Other works, which come with no guarantees, consider splitting the data in every tree node by optimizing efficiency with accuracy constraints allowing fine-grained control of the efficiency-accuracy tradeoff Deng et al. (2011), or by performing clustering Bengio et al. (2010); Madzarov et al. (2009). The splitting criterion (objective function) analyzed in this paper differs from the criteria considered in previous works and comes with much stronger theoretical justification given in Section 2.

The main theoretical analysis of this paper is kept in the boosting framework Schapire and Freund (2012) and relies on the assumption of the existence of weak learners in the tree nodes, where the top-down algorithm we study will amplify this weak advantage to build a tree achieving any desired level of accuracy wrt. entropy-based criteria. We add new theoretical results to the theory of boosting for the multiclass classification problem (the multiclass boosting is largely ununderstood, we refer the reader to Mukherjee and Schapire (2013) for comprehensive review), and we show that LOMtree is a boosting algorithm reducing standard entropy-based criteria, where the obtained bounds depend on the strong concativity properties of these criteria. Our work extends two previous works: it significantly adds to the theoretical analysis of Choromanska and Langford (2014), where only Shannon entropy is considered, in which case we also slightly improve their bound, and it extends beyond the boosting analysis of the binary case of Kearns and Mansour (1999). The main theoretical results are presented in Section 3. Numerical experiments (Section 4) and brief discussion (Section 5) conclude the paper.

2 Objective function and its theoretical properties

In this section we describe the objective function that is of central interest to this paper, and we provide its theoretical properties.

2.1 Objective function

We receive examples x𝒳dx\in\mathcal{X}\subseteq\mathbb{R}^{d}, with labels y{1,2,,k}y\in\{1,2,\ldots,k\}. We assume access to a hypothesis class \mathcal{H} where each hh\in\mathcal{H} is a binary classifier, h:𝒳{1,1}h~:~\mathcal{X}\mapsto\{-1,1\}, and each node in the tree consists of a classifier from \mathcal{H}. The classifiers are trained in such a way that hn(x)=1h_{n}(x)=1 (hnh_{n} denotes the classifier in node nn of the tree; for fixed node nn we will refer to hnh_{n} simply as hh) means that the example xx is sent to the right subtree of node nn, while hn(x)=1h_{n}(x)=-1 sends xx to the left subtree. When we reach a leaf node, we predict according to the label with the highest frequency amongst the examples reaching that leaf.

Notice that from the perspective of reducing the computational complexity, we want to encourage the number of examples going to the left and right to be balanced. Furthermore, for maintaining good statistical accuracy, we want to send examples of class ii almost exclusively to either the left or the right subtree. A measure of whether the examples of each class reaching the node are then mostly sent to its one child node (pure split) or otherwise to both children (impure split) is referred to as the purity of a tree node. These two criteria, purity and balancedness, were discussed in Choromanska and Langford (2014). This work also proposes an objective (convex) expressing both criteria, and thus measuring the quality of a hypothesis hh\in\mathcal{H} in creating partitions at a fixed node nn in the tree. The objective is given as follows

J(h)=2i=1kπi|P(h(x)>0)P(h(x)>0|i)|,J(h)=2\sum_{i=1}^{k}\pi_{i}\left|P(h(x)>0)-P(h(x)>0|i)\right|, (1)

where πi\pi_{i} denotes the proportion of label ii amongst the examples reaching this node, P(h(x)>0)P(h(x)>0) and P(h(x)>0|i)P(h(x)>0|i) denote the fraction of examples reaching nn for which h(x)>0h(x)>0, marginally and conditional on class ii respectively. It was shown that this objective can be effectively maximized over hypotheses hh\in\mathcal{H}, giving high-quality partitions, in an online fashion (recall that it remains unclear how to online optimize some of the more standard decision tree objectives such as entropy-based objectives). Despite that, this objective and its properties (including the relation to the more standard entropy-based objectives) remain not fully understood. Its exhaustive analysis is instead provided in this paper.

2.2 Theoretical properties of the objective function

We first define the concept of balancedness and purity of the split which are crucial for providing the theoretical properties of the objective function under consideration in this paper.

Definition 1 (Purity and balancedness, Choromanska and Langford (2014))

The hypothesis hh\in\mathcal{H} induces a pure split if

α:=i=1kπimin(P(h(x)>0|i),P(h(x)<0|i))δ,\alpha:=\sum_{i=1}^{k}\pi_{i}\min(P(h(x)>0|i),P(h(x)<0|i))\leq\delta,

where δ[0,0.5)\delta\in[0,0.5), and α\alpha is called the purity factor.

The hypothesis hh\in\mathcal{H} induces a balanced split if

cP(h(x)>0)=β1c,c\leq\underbrace{P(h(x)>0)}_{=\beta}\leq 1-c,

where c(0,0.5]c\in(0,0.5], and β\beta is called the balancing factor.

A partition is called maximally pure if α=0\alpha=0 (each class is sent exclusively to the left or to the right). A partition is called maximally balanced if β=0.5\beta=0.5 (equal number of examples are sent to the left and to the right).

Next we show the first theoretical property of the objective function J(h)J(h). Lemma 2 contains a stronger statement than the one in the original paper Choromanska and Langford (2014) (Lemma 2).

Lemma 2

For any hypothesis h:𝒳{1,1}h~:~\mathcal{X}\mapsto\{-1,1\}, the objective J(h)J(h) satisfies J(h)[0,1]J(h)\in[0,1]. Furthermore, hh induces a maximally pure and balanced partition iff J(h)=1J(h)=1.

Lemma 2 characterizes the behavior of the objective J(h)J(h) at the optimum where J(h)=1J(h)=1. In practice however we do not expect to have hypotheses producing maximally pure and balanced splits, thus it is of importance to be able to show that larger values of the objective correspond simultaneously to more pure and more balanced splits. This statement would fully justify why it is desired to maximize J(h)J(h). We next focus on showing this property. We start by showing that increasing the value of the objective leads to more balanced splits.

Lemma 3

For any hypothesis hh, and any distribution over examples (x,y)(x,y) the balancing factor β\beta satisfies β[0.5(11J(h)),0.5(1+1J(h))]\beta\in\left[0.5(1-\sqrt{1-J(h)}),0.5(1+\sqrt{1-J(h)})\right].

Thus the larger (closer to 11) the value of J(h)J(h) is, the narrower the interval from Lemma 3 is, leading to more balanced splits (β\beta closer to 0.50.5).

The next lemma, which we borrow from the literature, relates the balancing and purity factor, and it will be used to show that increasing the value of the objective function corresponds not only to more balanced splits, but also to more pure splits.

Lemma 4 (Choromanska and Langford (2014))

For any hypothesis hh, and any distribution over examples (x,y)(x,y), the purity factor α\alpha and the balancing factor β\beta satisfy αmin{(2J(h))/4ββ,0.5}\alpha\leq\min\left\{(2-J(h))/4\beta-\beta,0.5\right\}.

Recall that Lemma 3 shows that increasing the value of J(h)J(h) leads to a more balanced split (β\beta closer to 0.50.5). From this fact and Lemma 4 it follows that increasing the value of J(h)J(h) leads to the upper-bound on α\alpha being closer to 0 which also corresponds to a more pure split. Thus maximizing the objective recovers more balanced and more pure splits.

Refer to caption Refer to caption
Figure 1: Left: Blue curve captures the behavior of the upper-bound on the balancing factor as a function of J(h)J(h), red curve captures the behavior of the lower-bound on the balancing factor as a function of J(h)J(h), green intervals correspond to the intervals where the balancing factor lies for different values of J(h)J(h). Right: Red line captures the behavior of the upper-bound on the purity factor as a function of J(h)J(h) when the balancing factor is fixed to 12\frac{1}{2}. Figure should be read in color.

Proof [Lemma 2] The proof that J(h)[0,1]J(h)\in[0,1] and that if hh induces a maximally pure and balanced partition then J(h)=1J(h)=1 was done in Choromanska and Langford (2014) (Lemma 2). We therefore prove here the remaining statement in Lemma 2 that if J(h)=1J(h)=1 then hh induces a maximally pure and balanced partition.

Without loss of generality assume each πi(0,1)\pi_{i}\in(0,1). Recall that β=P(h(x)>0)\beta=P(h(x)>0), and let Pi=P(h(x)>0|i)P_{i}=P(h(x)>0|i). Also recall that β=i=1kπiPi\beta=\sum_{i=1}^{k}\pi_{i}P_{i}. Thus J(h)=2i=1kπi|j=1kπjPjPi|J(h)=2\sum_{i=1}^{k}\pi_{i}\left|\sum_{j=1}^{k}\pi_{j}P_{j}-P_{i}\right|. The objective is certainly maximized in the extremes of the interval [0,1][0,1], where each PiP_{i} is either 0 or 11 (also note that at maximum, where J(h)=1J(h)=1, it cannot be that all PiP_{i}’s are 0 or all PiP_{i}’s are 11). The function J(h)J(h) is differentiable in these extremes (J(h)J(h) is non-differentiable only when j=1kπjPj=Pi\sum_{j=1}^{k}\pi_{j}P_{j}=P_{i}, but at considered extremes the left-hand side of this equality is in (0,1)(0,1) whereas the right-hand side is either 0 or 11). We then write

J(h)=2i𝒫πi(j=1kπjPjPi)+2i𝒩πi(Pij=1kπjPj),J(h)=2\sum_{i\in\mathcal{P}}\pi_{i}\left(\sum_{j=1}^{k}\pi_{j}P_{j}-P_{i}\right)+2\sum_{i\in\mathcal{N}}\pi_{i}\left(P_{i}-\sum_{j=1}^{k}\pi_{j}P_{j}\right),

where 𝒫={i:j=1kπjPjPi}\mathcal{P}=\{i:\sum_{j=1}^{k}\pi_{j}P_{j}\geq P_{i}\} and 𝒩={i:j=1kπjPj<Pi}\mathcal{N}=\{i:\sum_{j=1}^{k}\pi_{j}P_{j}<P_{i}\}. Also let 𝒫+={i:j=1kπjPj>Pi}\mathcal{P}^{+}=\{i:\sum_{j=1}^{k}\pi_{j}P_{j}>P_{i}\} (clearly i𝒫+πi1\sum_{i\in\mathcal{P}^{+}}\pi_{i}\neq 1 and i𝒩πi1\sum_{i\in\mathcal{N}}\pi_{i}\neq 1 in the extremes of the interval [0,1][0,1] where J(h)J(h) is maximized). We then can compute the derivatives of J(h)J(h) with respect to PrP_{r}, where r={1,2,,k}r=\{1,2,\dots,k\}, everywhere where the function is differentiable as follows

JPr={2πr(i𝒫+πi1) if r𝒫+2πr(1i𝒩πi) if r𝒩,\frac{\partial J}{\partial P_{r}}=\left\{\begin{tabular}[]{c}$2\pi_{r}(\sum_{i\in\mathcal{P}^{+}}\pi_{i}-1)\>\>\>\>\>$ if$\>$$r\in\mathcal{P}^{+}$\\ $\!\!\!2\pi_{r}(1-\sum_{i\in\mathcal{N}}\pi_{i})\>\>\>\>\>\>\>$ if$\>$$r\in\mathcal{N}$\end{tabular}\right.,

and note that in the extremes of the interval [0,1][0,1] where J(h)J(h) is maximized JPr0\frac{\partial J}{\partial P_{r}}\neq 0, since i𝒫+πi1\sum_{i\in\mathcal{P}^{+}}\pi_{i}\neq 1, i𝒩πi1\sum_{i\in\mathcal{N}}\pi_{i}\neq 1, and each πi(0,1)\pi_{i}\in(0,1). Since J(h)J(h) is convex, and by the fact that in particular the derivative of J(h)J(h) with respect to any PrP_{r} cannot be 0 in the extremes of the interval [0,1][0,1] where J(h)J(h) is maximized, it follows that the J(h)J(h) can only be maximized (J(h)=1J(h)=1) at the extremes of the [0,1][0,1] interval. Thus we already proved that if J(h)=1J(h)=1 then hh induces a maximally pure partition. We are left with showing that if J(h)=1J(h)=1 then hh induces also a maximally balanced partition. We prove it by contradiction. Assume β0.5\beta\neq 0.5. Denote as before 0={i:P(h(x)>0|i)=0}\mathcal{I}_{0}=\{i:P(h(x)>0|i)=0\} and 1={i:P(h(x)>0|i)=1}\mathcal{I}_{1}=\{i:P(h(x)>0|i)=1\}. Recall β=i=1kπiPi=i0πi0+i1πi1=i1πi\beta=\sum_{i=1}^{k}\pi_{i}P_{i}=\sum_{i\in\mathcal{I}_{0}}\pi_{i}\cdot 0+\sum_{i\in\mathcal{I}_{1}}\pi_{i}\cdot 1=\sum_{i\in\mathcal{I}_{1}}\pi_{i}. Thus

J(h)\displaystyle J(h) =\displaystyle= 1=2i0πi|β|+2i1πi|β1|=2βi0πi+2(1β)i1πi\displaystyle 1=2\sum_{i\in\mathcal{I}_{0}}\pi_{i}\left|\beta\right|+2\sum_{i\in\mathcal{I}_{1}}\pi_{i}\left|\beta-1\right|=2\beta\sum_{i\in\mathcal{I}_{0}}\pi_{i}+2(1-\beta)\sum_{i\in\mathcal{I}_{1}}\pi_{i}
=\displaystyle= 2β(1i1πi)+2(1β)i1πi=2β(1β)+2(1β)β=4β2+4β<1,\displaystyle 2\beta(1\!-\!\!\sum_{i\in\mathcal{I}_{1}}\!\pi_{i})+2(1\!-\!\beta)\sum_{i\in\mathcal{I}_{1}}\!\pi_{i}=2\beta(1\!-\!\beta)+2(1\!-\!\beta)\beta=-4\beta^{2}+4\beta<1,

where the last inequality comes from the fact that the quadratic form 4β2+4β-4\beta^{2}+4\beta is equal to 11 only when β=0.5\beta=0.5, and otherwise it is smaller than 11. Thus we obtain the contradiction which ends the proof.  

Proof [Lemma 3] As before we use the following notation: β=P(h(x)>0)\beta=P(h(x)>0), and Pi=P(h(x)>0|i)P_{i}=P(h(x)>0|i). Also let 𝒫={i:βPi}\mathcal{P}=\{i:\beta\geq P_{i}\} and 𝒩={i:β<Pi}\mathcal{N}=\{i:\beta<P_{i}\}. Recall that β=i{𝒫𝒩}πiPi\beta=\sum_{i\in\{\mathcal{P}\cup\mathcal{N}\}}\pi_{i}P_{i}, and i{𝒫𝒩}πi=1\sum_{i\in\{\mathcal{P}\cup\mathcal{N}\}}\pi_{i}=1. We split the proof into two cases.

  • Let i𝒫πi1β\sum_{i\in\mathcal{P}}\pi_{i}\leq 1-\beta. Then

    J(h)\displaystyle J(h) =\displaystyle= 2i=1kπi|βPi|=2i𝒫πi(βPi)+2i𝒩πi(Piβ)\displaystyle 2\sum_{i=1}^{k}\pi_{i}\left|\beta-P_{i}\right|=2\sum_{i\in\mathcal{P}}\pi_{i}(\beta-P_{i})+2\sum_{i\in\mathcal{N}}\pi_{i}(P_{i}-\beta)
    =\displaystyle= 2i𝒫πiβ2i𝒫πiPi+2(βi𝒫πiPi)2β(1i𝒫πi)\displaystyle 2\sum_{i\in\mathcal{P}}\pi_{i}\beta-2\sum_{i\in\mathcal{P}}\pi_{i}P_{i}+2(\beta-\sum_{i\in\mathcal{P}}\pi_{i}P_{i})-2\beta(1-\sum_{i\in\mathcal{P}}\pi_{i})
    =\displaystyle= 4βi𝒫πi4i𝒫πiPi4βi𝒫πi4β(1β)\displaystyle 4\beta\sum_{i\in\mathcal{P}}\pi_{i}-4\sum_{i\in\mathcal{P}}\pi_{i}P_{i}\leq 4\beta\sum_{i\in\mathcal{P}}\pi_{i}\leq 4\beta(1-\beta)

    Thus 4β2+4βJ(h)0-4\beta^{2}+4\beta-J(h)\geq 0 which, when solved, yields the lemma.

  • Let i𝒫πi1β\sum_{i\in\mathcal{P}}\pi_{i}\geq 1-\beta (thus i𝒩πiβ\sum_{i\in\mathcal{N}}\pi_{i}\leq\beta). Note that J(h)J(h) can be written as

    J(h)=2i=1kπi|P(h(x)0)P(h(x)0|i)|,J(h)=2\sum_{i=1}^{k}\pi_{i}\left|P(h(x)\leq 0)-P(h(x)\leq 0|i)\right|,

    since P(h(x)0)=1P(h(x)>0)P(h(x)\leq 0)=1-P(h(x)>0) and P(h(x)0|i)=1P(h(x)>0|i)P(h(x)\leq 0|i)=1-P(h(x)>0|i). Let β=P(h(x)0)=1β\beta^{{}^{\prime}}=P(h(x)\leq 0)=1-\beta, and Pi=P(h(x)0|i)=1PiP_{i}^{{}^{\prime}}=P(h(x)\leq 0|i)=1-P_{i}. Note that 𝒫={i:βPi}={i:β<Pi}\mathcal{P}=\{i:\beta\geq P_{i}\}=\{i:\beta^{{}^{\prime}}<P_{i}^{{}^{\prime}}\} and 𝒩={i:β<Pi}={i:βPi}\mathcal{N}=\{i:\beta<P_{i}\}=\{i:\beta^{{}^{\prime}}\geq P_{i}^{{}^{\prime}}\}. Also note that β=i{𝒫𝒩}πiPi\beta^{{}^{\prime}}=\sum_{i\in\{\mathcal{P}\cup\mathcal{N}\}}\pi_{i}P_{i}^{{}^{\prime}}. Thus

    J(h)\displaystyle J(h) =\displaystyle= 2i=1kπi|βPi|=2i𝒫πi(Piβ)+2i𝒩πi(βPi)\displaystyle 2\sum_{i=1}^{k}\pi_{i}\left|\beta^{{}^{\prime}}-P_{i}^{{}^{\prime}}\right|=2\sum_{i\in\mathcal{P}}\pi_{i}(P_{i}^{{}^{\prime}}-\beta^{{}^{\prime}})+2\sum_{i\in\mathcal{N}}\pi_{i}(\beta^{{}^{\prime}}-P_{i}^{{}^{\prime}})
    =\displaystyle= 2(βi𝒩πiPi)2β(1i𝒩πi)+2i𝒩πiβ2i𝒩πiPi\displaystyle 2(\beta^{{}^{\prime}}-\sum_{i\in\mathcal{N}}\pi_{i}P_{i}^{{}^{\prime}})-2\beta^{{}^{\prime}}(1-\sum_{i\in\mathcal{N}}\pi_{i})+2\sum_{i\in\mathcal{N}}\pi_{i}\beta^{{}^{\prime}}-2\sum_{i\in\mathcal{N}}\pi_{i}P_{i}^{{}^{\prime}}
    =\displaystyle= 4βi𝒩πi4i𝒩πiPi4βi𝒩πi=4(1β)i𝒩πi4β(1β)\displaystyle 4\beta^{{}^{\prime}}\sum_{i\in\mathcal{N}}\pi_{i}-4\sum_{i\in\mathcal{N}}\pi_{i}P_{i}^{{}^{\prime}}\leq 4\beta^{{}^{\prime}}\sum_{i\in\mathcal{N}}\pi_{i}=4(1-\beta)\sum_{i\in\mathcal{N}}\pi_{i}\leq 4\beta(1-\beta)

    Thus as before we obtain 4β2+4βJ(h)0-4\beta^{2}+4\beta-J(h)\geq 0 which, when solved, yields the lemma.

 

We next consider the quality of the entire tree as we add more nodes. We aim to maximize the objective function in each node we split. In the next section we show that optimizing the objective J(h)J(h) leads to the reduction of the more standard decision tree entropy-based objectives. We consider three different objectives in this paper. We focus on the boosting framework, where the analysis depends on the weak learning assumption. Three different entropy-based criteria lead to three different theoretical statements, where we bound the number of splits required to reduce the value of the criterion below given level. The bounds we obtain, and their dependence on kk, critically depend on the strong concativity properties of the considered entropy-based criteria. In our analysis we use elements of the proof techniques from Kearns and Mansour (1999) (the proof of Theorem 10) and Choromanska and Langford (2014) (the proof of Theorem 1). We show all the steps for completeness as we make modifications compared to these works.

3 Main theoretical results

We begin from explaining the notation. Let 𝒯\mathcal{T} denote the tree under consideration. πl,i\pi_{l,i}’s denote the probabilities that a randomly chosen data point xx drawn from 𝒫\mathcal{P}, where 𝒫\mathcal{P} is a fixed target distribution over 𝒳\mathcal{X}, has label ii given that xx reaches node ll (note that i=1kπl,i=1\sum_{i=1}^{k}\pi_{l,i}=1), \mathcal{L} denotes the set of all tree leaves, tt denotes the number of internal tree nodes, and wlw_{l} is the weight of leaf ll defined as the probability a randomly chosen xx drawn from 𝒫\mathcal{P} reaches leaf ll (note that lwl=1\sum_{l\in\mathcal{L}}w_{l}=1). We study a tree construction algorithm where we recursively find the leaf node with the highest weight, and choose to split it into two children. Consider the tree constructed over tt steps where in each step we take one leaf node and split it into two (t=1t=1 corresponds to splitting the root, thus the tree consists of one node (root) and its two children (leaves) in this step). Let nn be the heaviest node at time tt and its weight wnw_{n} be denoted by ww for brevity. We measure the quality of the tree at any given time tt using three different entropy-based criteria:

  • The entropy function GteG_{t}^{e}: Gte=lwli=1kπl,iln(1πl,i)\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>G_{t}^{e}=\sum_{l\in\mathcal{L}}w_{l}\sum_{i=1}^{k}\pi_{l,i}\ln\left(\frac{1}{\pi_{l,i}}\right)

  • The Gini-entropy function GtgG_{t}^{g}: Gtg=lwli=1kπl,i(1πl,i)\>\>\>\>\>\>G_{t}^{g}=\sum_{l\in\mathcal{L}}w_{l}\sum_{i=1}^{k}\pi_{l,i}(1-\pi_{l,i})

  • The modified Gini-entropy GtmG_{t}^{m}: Gtm=lwli=1kπl,i(𝒞πl,i),\>\>\>\>G_{t}^{m}=\sum_{l\in\mathcal{L}}w_{l}\sum_{i=1}^{k}\sqrt{\pi_{l,i}(\mathcal{C}-\pi_{l,i})},
    where 𝒞\mathcal{C} is a constant such that 𝒞>2\mathcal{C}>2.

These criteria are natural extensions of the criteria used in Kearns and Mansour (1999) in the context of binary classification, to the multiclass classification setting333Note that there is more than one way of extending the entropy-based criteria from Kearns and Mansour (1999) to the multiclass classification setting, e.g. the modified Gini-entropy could as well be defined as Gtm=lwli=1kπl,i(𝒞πl,i)G_{t}^{m}=\sum_{l\in\mathcal{L}}w_{l}\sum_{i=1}^{k}\sqrt{\pi_{l,i}(\mathcal{C}-\pi_{l,i})} where 𝒞[1,2]\mathcal{C}\in[1,2]. This and other extensions will be investigated in future works.. We will next present the main results of this paper which will be followed by their proofs. We begin with introducing the weak hypothesis assumption.

Our theoretical analysis is held in the boosting framework and critically depends on the weak hypothesis assumption, which ensures that the hypothesis class \mathcal{H} is rich enough to guarantee ’weakly’ pure and ’weakly’ balanced split in any given node.

Definition 5 (Weak Hypothesis Assumption, Choromanska and Langford (2014))

Let mm denotes any node of the tree 𝒯\mathcal{T}, and let βm=P(hm(x)>0)\beta_{m}=P(h_{m}(x)>0) and Pm,i=P(hm(x)>0|i)P_{m,i}=P(h_{m}(x)>0|i). Furthermore, let γ+\gamma\in\mathbb{R}^{+} be such that for all mm, γ(0,min(βm,1βm)]\gamma\in(0,\min(\beta_{m},1-\beta_{m})]. We say that the weak hypothesis assumption is satisfied when for any distribution 𝒫\mathcal{P} over 𝒳\mathcal{X} at each node mm of the tree 𝒯\mathcal{T} there exists a hypothesis hmh_{m}\in\mathcal{H} such that J(hm)/2=i=1kπm,i|Pm,iβm|γJ(h_{m})/2=\sum_{i=1}^{k}\pi_{m,i}|P_{m,i}-\beta_{m}|\geq\gamma.

We next state the three main theoretical results of this paper captured in Theorem 67, and 8. They guarantee that the top-down decision tree algorithm which optimizes J(h)J(h), such as the one in Choromanska and Langford (2014), will amplify the weak advantage, captured in the weak learning assumption, to build a tree achieving any desired level of accuracy wrt. entropy-based criteria.

Theorem 6

Under the Weak Hypothesis Assumption, for any α[0,2lnk]\alpha\in[0,2\ln k], to obtain GteαG_{t}^{e}\leq\alpha it suffices to make

t(2lnkα)4(1γ)2γ2log2elnksplits.t\geq\left(\frac{2\ln k}{\alpha}\right)^{\frac{4(1-\gamma)^{2}}{\gamma^{2}\log_{2}e}\ln k}\>\>\>\>\>\>\>\>\>\>\text{splits.}
Theorem 7

Under the Weak Hypothesis Assumption, for any α[0,2(11k)]\alpha\in[0,2\left(1-\frac{1}{k}\right)], to obtain GtgαG_{t}^{g}\leq\alpha it suffices to make

t(2(11k)α)2(1γ)2γ2log2e(k1)splits.t\geq\left(\frac{2\left(1-\frac{1}{k}\right)}{\alpha}\right)^{\frac{2(1-\gamma)^{2}}{\gamma^{2}\log_{2}e}(k-1)}\>\>\>\>\>\>\>\>\>\>\text{splits.}
Theorem 8

Under the Weak Hypothesis Assumption, for any α[𝒞1,2k𝒞1]\alpha\in[\sqrt{\mathcal{C}-1},2\sqrt{k\mathcal{C}-1}], to obtain GtmαG_{t}^{m}\leq\alpha it suffices to make

t(2k𝒞1α)2(1γ)2𝒞3γ2(𝒞2)2log2ekk𝒞1splits.t\geq\left(\frac{2\sqrt{k\mathcal{C}-1}}{\alpha}\right)^{\frac{2(1-\gamma)^{2}\mathcal{C}^{3}}{\gamma^{2}(\mathcal{C}-2)^{2}\log_{2}e}k\sqrt{k\mathcal{C}-1}}\>\>\>\>\>\>\>\>\>\>\text{splits.}

Clearly, different criteria lead to bounds with different dependence on the number of classes kk, where the most advantageous dependence (only logarithmic in kk) is obtained for the entropy criterion. This is a consequence of the strong concativity properties of the entropy-based criteria considered in this paper. We next discuss in details the mathematical properties of the entropy-based criteria, which are important to prove the above theorems.

3.1 Properties of the entropy-based criteria

Each of the presented entropy-based criteria has a number of useful properties that we give next with their proofs.

Bounds on the entropy-based criteria

We first give bounds on the values of the entropy-based functions.

Lemma 9

The entropy function GteG_{t}^{e} at time tt is bounded as

0Gte(t+1)wlnk.0\leq G_{t}^{e}\leq(t+1)w\ln k.

Proof  The lower-bound follows from the fact that the entropy of each leaf i=1kπl,iln(1πl,i)\sum_{i=1}^{k}\pi_{l,i}\ln\left(\frac{1}{\pi_{l,i}}\right) is non-negative. We next prove the upper-bound. Note that

Gte=lwli=1kπl,iln(1πl,i)lwllnkwlnkl1=(t+1)wlnk,G_{t}^{e}=\sum_{l\in\mathcal{L}}w_{l}\sum_{i=1}^{k}\pi_{l,i}\ln\left(\frac{1}{\pi_{l,i}}\right)\leq\sum_{l\in\mathcal{L}}w_{l}\ln k\leq w\ln k\sum_{l\in\mathcal{L}}1=(t+1)w\ln k,

where the first inequality comes from the fact that uniform distribution maximizes the entropy, and the last equality comes from the fact that a tree with tt internal nodes has t+1t+1 leaves (also recall that ww is the weight of the heaviest node in the tree at time tt which is what we will also use in the next lemmas).  

Before proceeding to the Gini-entropy criterion we first introduce the helpful result captured in Lemma 10 and Corollary 11.

Lemma 10 (The inequality between Euclidean and arithmetic mean)

Let x1,x2,,xkx_{1},x_{2},\dots,x_{k} be a set of non-negative numbers. Then Euclidean mean upper-bounds the arithmetic mean as follows i=1kxi2ki=1kxik\sqrt{\frac{\sum_{i=1}^{k}x_{i}^{2}}{k}}\geq\frac{\sum_{i=1}^{k}x_{i}}{k}.

Corollary 11

Let {x1,x2,,xk}\{x_{1},x_{2},\dots,x_{k}\} be a set of non-negative numbers. Then i=1kxi21k(i=1kxi)2\sum_{i=1}^{k}x_{i}^{2}\geq\frac{1}{k}\left(\sum_{i=1}^{k}x_{i}\right)^{2}.

Proof  By Lemma 10 we have i=1kxi2ki=1kxiki=1kxi21k(i=1kxi)2\sqrt{\frac{\sum_{i=1}^{k}x_{i}^{2}}{k}}\geq\frac{\sum_{i=1}^{k}x_{i}}{k}\Leftrightarrow\sum_{i=1}^{k}x_{i}^{2}\geq\frac{1}{k}\left(\sum_{i=1}^{k}x_{i}\right)^{2}.  

We next proceed to the Gini-entropy.

Lemma 12

The Gini-entropy function GtgG_{t}^{g} at time tt is bounded as

0Gtg(t+1)w(11k).0\leq G_{t}^{g}\leq(t+1)w\left(1-\frac{1}{k}\right).

Proof  The lower-bound is straightforward since all πl,i\pi_{l,i}’s are non-negative. The upper-bound can be shown as follows (the last inequality results from Corollary 11):

Gtg\displaystyle G_{t}^{g} =\displaystyle= lwli=1kπl,i(1πl,i)wli=1k(πl,iπl,i2)=wl(1i=1kπl,i2)\displaystyle\sum_{l\in\mathcal{L}}w_{l}\sum_{i=1}^{k}\pi_{l,i}(1-\pi_{l,i})\leq w\sum_{l\in\mathcal{L}}\sum_{i=1}^{k}(\pi_{l,i}-\pi_{l,i}^{2})=w\sum_{l\in\mathcal{L}}\left(1-\sum_{i=1}^{k}\pi_{l,i}^{2}\right)
\displaystyle\leq wl(11k(i=1kπl,i)2)=wl(11k)=(t+1)w(11k),\displaystyle w\sum_{l\in\mathcal{L}}\left(1-\frac{1}{k}\left(\sum_{i=1}^{k}\pi_{l,i}\right)^{2}\right)=w\sum_{l\in\mathcal{L}}\left(1-\frac{1}{k}\right)=(t+1)w\left(1-\frac{1}{k}\right),
 
Lemma 13

The modified Gini-entropy function GtmG_{t}^{m} at time tt is bounded as

𝒞1Gtm(t+1)wk𝒞1.\sqrt{\mathcal{C}-1}\leq G_{t}^{m}\leq(t+1)w\sqrt{k\mathcal{C}-1}.

Proof  The lower-bound can be shown as follows. Recall that the function
i=1kπl,i(𝒞πl,i)\sum_{i=1}^{k}\sqrt{\pi_{l,i}(\mathcal{C}-\pi_{l,i})} is concave and therefore it is certainly minimized on the extremes of the [0,1][0,1] interval, meaning where each πl,i\pi_{l,i} is either 0 or 11. Let I0={i:πl,i=0}I_{0}=\{i:\pi_{l,i}=0\} and let I1={i:πl,i=1}I_{1}=\{i:\pi_{l,i}=1\}. Thus i=1kπl,i(𝒞πl,i)=iI1𝒞1𝒞1\sum_{i=1}^{k}\sqrt{\pi_{l,i}(\mathcal{C}-\pi_{l,i})}=\sum_{i\in I_{1}}\sqrt{\mathcal{C}-1}\geq\sqrt{\mathcal{C}-1}. Combining this result with the fact that lwl=1\sum_{l\in\mathcal{L}}w_{l}=1 gives the lower-bound. We next prove the upper-bound. Recall that from Lemma 10 it follows that

i=1kπl,i(𝒞πl,i)ki=1kπl,i(𝒞πl,i)k,thus\frac{\sum_{i=1}^{k}\sqrt{\pi_{l,i}(\mathcal{C}-\pi_{l,i})}}{k}\leq\sqrt{\frac{\sum_{i=1}^{k}\pi_{l,i}(\mathcal{C}-\pi_{l,i})}{k}},\>\>\>\>\>\text{thus}
Gtm\displaystyle G_{t}^{m} =\displaystyle= lwli=1kπl,i(𝒞πl,i)lwlki=1kπl,i(𝒞πl,i)\displaystyle\sum_{l\in\mathcal{L}}w_{l}\sum_{i=1}^{k}\sqrt{\pi_{l,i}(\mathcal{C}-\pi_{l,i})}\leq\sum_{l\in\mathcal{L}}w_{l}\sqrt{k\sum_{i=1}^{k}\pi_{l,i}(\mathcal{C}-\pi_{l,i})}
=\displaystyle= lwlk(𝒞i=1kπl,i2)=lwlk𝒞k2i=1k1kπl,i2\displaystyle\sum_{l\in\mathcal{L}}w_{l}\sqrt{k\left(\mathcal{C}-\sum_{i=1}^{k}\pi_{l,i}^{2}\right)}=\sum_{l\in\mathcal{L}}w_{l}\sqrt{k\mathcal{C}-k^{2}\sum_{i=1}^{k}\frac{1}{k}\pi_{l,i}^{2}}

By Jensen’s inequlity i=1k1kπl,i2(i=1k1kπl,i)2=1k2\sum_{i=1}^{k}\frac{1}{k}\pi_{l,i}^{2}\geq(\sum_{i=1}^{k}\frac{1}{k}\pi_{l,i})^{2}=\frac{1}{k^{2}}. Thus

Gtmlwlk𝒞1(t+1)wk𝒞1G_{t}^{m}\leq\sum_{l\in\mathcal{L}}w_{l}\sqrt{k\mathcal{C}-1}\leq(t+1)w\sqrt{k\mathcal{C}-1}
 

So far we have been focusing on the time step tt, where nn was the heaviest node and it had weight ww. Consider splitting this leaf to two children n0n_{0} and n1n_{1}. For the ease of notation let w0=wn0w_{0}=w_{n_{0}} and w1=wn1w_{1}=w_{n_{1}}, β=P(hn(x)>0)\beta=P(h_{n}(x)>0) and Pi=P(hn(x)>0|i)P_{i}=P(h_{n}(x)>0|i), and furthermore let πi\pi_{i} and hh be the shorthands for πn,i\pi_{n,i} and hnh_{n} respectively. Recall that β=i=1kπiPi\beta=\sum_{i=1}^{k}\pi_{i}P_{i} and i=1kπi=1\sum_{i=1}^{k}\pi_{i}=1. Notice that w0=w(1β)w_{0}=w(1-\beta) and w1=wβw_{1}=w\beta. Let 𝝅{\bm{\pi}} be the kk-element vector with ithi^{th} entry equal to πi\pi_{i}. Finally, let G~e(𝝅)=i=1kπiln(1πi)\tilde{G}^{e}({\bm{\pi}})=\sum_{i=1}^{k}\pi_{i}\ln\left(\frac{1}{\pi_{i}}\right), G~g(𝝅)=i=1kπi(1πi)\tilde{G}^{g}({\bm{\pi}})=\sum_{i=1}^{k}\pi_{i}(1-\pi_{i}), and G~m(𝝅)=i=1kπi(1πi)\tilde{G}^{m}({\bm{\pi}})=\sum_{i=1}^{k}\sqrt{\pi_{i}(1-\pi_{i})}. Before the split the contribution of node nn to resp. GteG_{t}^{e}, GtgG_{t}^{g}, and GtmG_{t}^{m} was resp. wG~e(𝝅)w\tilde{G}^{e}({\bm{\pi}}), wG~g(𝝅)w\tilde{G}^{g}({\bm{\pi}}), and wG~m(𝝅)w\tilde{G}^{m}({\bm{\pi}}). Note that πn0,i=πi(1Pi)1β\pi_{n_{0},i}=\frac{\pi_{i}(1-P_{i})}{1-\beta} and πn1,i=πiPiβ\pi_{n_{1},i}=\frac{\pi_{i}P_{i}}{\beta} are the probabilities that a randomly chosen xx drawn from 𝒫\mathcal{P} has label ii given that xx reaches nodes n0n_{0} and n1n_{1} respectively. For brevity, let πn0,i\pi_{n_{0},i} and πn1,i\pi_{n_{1},i} be denoted respectively as π0,i\pi_{0,i} and π1,i\pi_{1,i}. Let 𝝅0{\bm{\pi}}_{0} be the kk-element vector with ithi^{th} entry equal to π0,i\pi_{0,i} and let 𝝅1{\bm{\pi}}_{1} be the kk-element vector with ithi^{th} entry equal to π1,i\pi_{1,i}. Notice that 𝝅=(1β)𝝅0+β𝝅1{\bm{\pi}}=(1-\beta){\bm{\pi}}_{0}+\beta{\bm{\pi}}_{1}. After the split the contribution of the same, now internal, node nn changes to resp. w((1β)G~e(𝝅0)+βG~e(𝝅1))w((1-\beta)\tilde{G}^{e}({\bm{\pi}}_{0})+\beta\tilde{G}^{e}({\bm{\pi}}_{1})), w((1β)G~g(𝝅0)+βG~g(𝝅1))w((1-\beta)\tilde{G}^{g}({\bm{\pi}}_{0})+\beta\tilde{G}^{g}({\bm{\pi}}_{1})), and w((1β)G~m(𝝅0)+βG~m(𝝅1))w((1-\beta)\tilde{G}^{m}({\bm{\pi}}_{0})+\beta\tilde{G}^{m}({\bm{\pi}}_{1})). We denote the difference between the contribution of node nn to the value of the entropy-based objectives in times tt and t+1t+1 as

Δte:=GteGt+1e=w[G~e(𝝅)(1β)G~e(𝝅0)βG~e(𝝅1)].\Delta_{t}^{e}:=G_{t}^{e}-G_{t+1}^{e}=w\left[\tilde{G}^{e}({\bm{\pi}})-(1-\beta)\tilde{G}^{e}({\bm{\pi}}_{0})-\beta\tilde{G}^{e}({\bm{\pi}}_{1})\right]. (2)
Δtg:=GtgGt+1g=w[G~g(𝝅)(1β)G~g(𝝅0)βG~g(𝝅1)].\Delta_{t}^{g}:=G_{t}^{g}-G_{t+1}^{g}=w\left[\tilde{G}^{g}({\bm{\pi}})-(1-\beta)\tilde{G}^{g}({\bm{\pi}}_{0})-\beta\tilde{G}^{g}({\bm{\pi}}_{1})\right]. (3)
Δtm:=GtmGt+1m=w[G~m(𝝅)(1β)G~m(𝝅0)βG~m(𝝅1)].\Delta_{t}^{m}:=G_{t}^{m}-G_{t+1}^{m}=w\left[\tilde{G}^{m}({\bm{\pi}})-(1-\beta)\tilde{G}^{m}({\bm{\pi}}_{0})-\beta\tilde{G}^{m}({\bm{\pi}}_{1})\right]. (4)

Strong concativity properties of the entropy-based criteria

The next three lemmas, Lemma 1416, and 18, describe the strong concativity properties of the entropy, Gini-entropy and modified Gini-entropy which can be used to lower-bound Δte\Delta_{t}^{e}, Δtg\Delta_{t}^{g}, and Δtm\Delta_{t}^{m} (Equations 23, and 4 corresponds to a gap in the Jensen’s inequality applied to the strongly concave function).

Lemma 14

The entropy function G~e\tilde{G}^{e} is strongly concave with respect to l1l_{1}-norm with modulus 11, and thus the following holds

G~e(𝝅)(1β)G~e(𝝅0)βG~e(𝝅1)12β(1β)𝝅0𝝅112.\tilde{G}^{e}({\bm{\pi}})-(1-\beta)\tilde{G}^{e}({\bm{\pi}}_{0})-\beta\tilde{G}^{e}({\bm{\pi}}_{1})\geq\frac{1}{2}\beta(1-\beta)\|\bm{\pi}_{0}-\bm{\pi}_{1}\|_{1}^{2}.

Proof  Lemma 14 is proven in Shalev-Shwartz (2012) (Example 2.5).  

We introduce one more lemma and then proceed with Gini-entropy.

Lemma 15 (Shalev-Shwartz (2007)(Lemma 14))

If the function Φ(𝛑)\Phi(\bm{\pi}) is twice differentiable, then the sufficient condition for strong concativity of Φ\Phi is that for all 𝛑\bm{\pi}, 𝐱\bm{x}, 2Φ(𝛑)𝐱,𝐱σx2\left<\nabla^{2}\Phi(\bm{\pi})\bm{x},\bm{x}\right>\leq-\sigma\|x\|^{2}, where 2Φ(𝛑)\nabla^{2}\Phi(\bm{\pi}) is the Hessian matrix of Φ\Phi at 𝛑\bm{\pi}, and σ>0\sigma>0 is the strong concativity modulus.

Lemma 16

The Gini-entropy function G~g\tilde{G}^{g} is strongly concave with respect to l2l_{2}-norm with modulus 22, and thus the following holds

G~g(𝝅)(1β)G~g(𝝅0)βG~g(𝝅1)β(1β)𝝅0𝝅122.\tilde{G}^{g}({\bm{\pi}})-(1-\beta)\tilde{G}^{g}({\bm{\pi}}_{0})-\beta\tilde{G}^{g}({\bm{\pi}}_{1})\geq\beta(1-\beta)\|\bm{\pi}_{0}-\bm{\pi}_{1}\|_{2}^{2}.

Proof  Note that 2G~g(𝝅)𝒙,𝒙2𝒙22\left<\nabla^{2}\tilde{G}^{g}(\bm{\pi})\bm{x},\bm{x}\right>\leq-2\|\bm{x}\|_{2}^{2}, and apply Lemma 15.  

Before showing the strong concativity guarantee for the modified Gini-entropy, we first show the statement that will be useful to prove the lemma.

Lemma 17 (Zhukovskiy (2003), Remark 2.2.4.)

The sum of strongly concave functions on n\mathbb{R}^{n} with modulus σ\sigma is strongly concave with the same modulus.

Lemma 18

The modified Gini-entropy function G~m\tilde{G}^{m} is strongly concave with respect to l2l_{2}-norm with modulus 2(𝒞2)2𝒞3\frac{2(\mathcal{C}-2)^{2}}{\mathcal{C}^{3}}, and thus the following holds

G~m(𝝅)(1β)G~m(𝝅0)βG~m(𝝅1)(𝒞2)2𝒞3β(1β)𝝅0𝝅122.\tilde{G}^{m}({\bm{\pi}})-(1-\beta)\tilde{G}^{m}({\bm{\pi}}_{0})-\beta\tilde{G}^{m}({\bm{\pi}}_{1})\geq\frac{(\mathcal{C}-2)^{2}}{\mathcal{C}^{3}}\beta(1-\beta)\|\bm{\pi}_{0}-\bm{\pi}_{1}\|_{2}^{2}.
Refer to caption Refer to caption
Figure 2: Functions Ge(π1)=G~e(π1)/ln2=(π1ln(1π1)+(1π1)ln(11π1))/ln2G^{e}_{*}(\pi_{1})=\tilde{G}^{e}(\pi_{1})/\ln 2=\left(\pi_{1}\ln\left(\frac{1}{\pi_{1}}\right)+(1-\pi_{1})\ln\left(\frac{1}{1-\pi_{1}}\right)\right)/\ln 2, Gg(π1)=2G~g(π1)=4π1(1π1)G^{g}_{*}(\pi_{1})=2\tilde{G}^{g}(\pi_{1})=4\pi_{1}(1-\pi_{1}), and Gm(π1)=(G~m(π1)𝒞1)/(2𝒞1𝒞1)=(π1(𝒞π1)+(1π1)(𝒞1+π1)𝒞1)/(2𝒞1𝒞1)G^{m}_{*}(\pi_{1})=(\tilde{G}^{m}(\pi_{1})-\sqrt{\mathcal{C}-1})/(\sqrt{2*\mathcal{C}-1}-\sqrt{\mathcal{C}-1)}=(\sqrt{\pi_{1}(\mathcal{C}-\pi_{1})}+\sqrt{(1-\pi_{1})(\mathcal{C}-1+\pi_{1})}-\sqrt{\mathcal{C}-1})/(\sqrt{2*\mathcal{C}-1}-\sqrt{\mathcal{C}-1)} (functions G~e(π1)\tilde{G}^{e}(\pi_{1}), G~g(π1)\tilde{G}^{g}(\pi_{1}), and G~m(π1)\tilde{G}^{m}(\pi_{1}) were re-scaled to have values in [0,1][0,1]) as a function of π1\pi_{1} (pi1pi_{1}). Figure is recommended to be read in color.

Proof  Consider functions g(πi)=f(πi)g(\pi_{i})=\sqrt{f(\pi_{i})}, where f(πi)=πi(𝒞πi)f(\pi_{i})=\pi_{i}(\mathcal{C}-\pi_{i}), 𝒞2\mathcal{C}\geq 2, and πi[0,1]\pi_{i}\in[0,1]. Also let h(x)=xh(x)=\sqrt{x}, where x[0,𝒞24]x\in[0,\frac{\mathcal{C}^{2}}{4}]. It is easy to see, using Lemma 15, that function ff is strongly concave with respect to l2l_{2}-norm with modulus 22, thus

f(θπi+(1θ)πi′′)θf(πi)+(1θ)f(πi′′)+θ(1θ)πiπi′′22,f(\theta\pi_{i}^{{}^{\prime}}+(1-\theta)\pi_{i}^{{}^{\prime\prime}})\geq\theta f(\pi_{i}^{{}^{\prime}})+(1-\theta)f(\pi_{i}^{{}^{\prime\prime}})+\theta(1-\theta)\|\pi_{i}^{{}^{\prime}}-\pi_{i}^{{}^{\prime\prime}}\|_{2}^{2}, (5)

where πi,πi′′[0,1]\pi_{i}^{{}^{\prime}},\pi_{i}^{{}^{\prime\prime}}\in[0,1] and θ[0,1]\theta\in[0,1]. Also note that hh is strongly concave with modulus 2𝒞3\frac{2}{\mathcal{C}^{3}} in its domain [0,𝒞24][0,\frac{\mathcal{C}^{2}}{4}] (the second derivative of hh is h′′(x)=14x32𝒞3h^{{}^{\prime\prime}}(x)=-\frac{1}{4\sqrt{x^{3}}}\leq-\frac{2}{\mathcal{C}^{3}}). The strong concativity of hh implies that

θx1+(1θ)x2θx1+(1θ)x2+1𝒞3θ(1θ)x1x222,\sqrt{\theta x_{1}+(1-\theta)x_{2}}\geq\theta\sqrt{x_{1}}+(1-\theta)\sqrt{x_{2}}+\frac{1}{\mathcal{C}^{3}}\theta(1-\theta)\|x_{1}-x_{2}\|_{2}^{2},

where x1,x2[0,𝒞24]x_{1},x_{2}\in[0,\frac{\mathcal{C}^{2}}{4}]. Let x1=f(πi)x_{1}=f(\pi_{i}^{{}^{\prime}}) and x2=f(πi′′)x_{2}=f(\pi_{i}^{{}^{\prime\prime}}). Then we obtain

θf(πi)+(1θ)f(πi′′)θf(πi)+(1θ)f(πi′′)+1𝒞3θ(1θ)f(πi)f(πi′′)22.\sqrt{\theta f(\pi_{i}^{{}^{\prime}})+(1-\theta)f(\pi_{i}^{{}^{\prime\prime}})}\geq\theta\sqrt{f(\pi_{i}^{{}^{\prime}})}+(1-\theta)\sqrt{f(\pi_{i}^{{}^{\prime\prime}})}+\frac{1}{\mathcal{C}^{3}}\theta(1-\theta)\|f(\pi_{i}^{{}^{\prime}})-f(\pi_{i}^{{}^{\prime\prime}})\|_{2}^{2}. (6)

Note that

f(θπi+(1θ)πi′′)\displaystyle\sqrt{f(\theta\pi_{i}^{{}^{\prime}}+(1-\theta)\pi_{i}^{{}^{\prime\prime}})} \displaystyle\geq f(θπi+(1θ)πi′′)θ(1θ)πiπi′′22\displaystyle\sqrt{f(\theta\pi_{i}^{{}^{\prime}}+(1-\theta)\pi_{i}^{{}^{\prime\prime}})-\theta(1-\theta)\|\pi_{i}^{{}^{\prime}}-\pi_{i}^{{}^{\prime\prime}}\|_{2}^{2}}
\displaystyle\geq θf(πi)+(1θ)f(πi′′)\displaystyle\sqrt{\theta f(\pi_{i}^{{}^{\prime}})+(1-\theta)f(\pi_{i}^{{}^{\prime\prime}})}
\displaystyle\geq θf(πi)+(1θ)f(πi′′)+1𝒞3θ(1θ)f(πi)f(πi′′)22,\displaystyle\theta\sqrt{f(\pi_{i}^{{}^{\prime}})}+(1-\theta)\sqrt{f(\pi_{i}^{{}^{\prime\prime}})}+\frac{1}{\mathcal{C}^{3}}\theta(1-\theta)\|f(\pi_{i}^{{}^{\prime}})-f(\pi_{i}^{{}^{\prime\prime}})\|_{2}^{2},

where the second inequality results from Equation 5 and the last (third) inequality results from Equation 6. Finally note that the first derivative of ff is f(πi)=𝒞2πi[𝒞2,𝒞]f^{{}^{\prime}}(\pi_{i})=\mathcal{C}-2\pi_{i}\in[\mathcal{C}-2,\mathcal{C}] thus

|f(πi)f(πi′′)||πiπi′′|𝒞2f(πi)f(πi′′)2(𝒞2)2πiπi′′2,\frac{|f(\pi_{i}^{{}^{\prime}})-f(\pi_{i}^{{}^{\prime\prime}})|}{|\pi_{i}^{{}^{\prime}}-\pi_{i}^{{}^{\prime\prime}}|}\geq\mathcal{C}-2\Leftrightarrow\|f(\pi_{i}^{{}^{\prime}})-f(\pi_{i}^{{}^{\prime\prime}})\|^{2}\geq(\mathcal{C}-2)^{2}\|\pi_{i}^{{}^{\prime}}-\pi_{i}^{{}^{\prime\prime}}\|^{2},

and combining this result with previous statement yields

f(θπi+(1θ)πi′′)θf(πi)+(1θ)f(πi′′)+(𝒞2)2𝒞3θ(1θ)πiπi′′2,\sqrt{f(\theta\pi_{i}^{{}^{\prime}}+(1-\theta)\pi_{i}^{{}^{\prime\prime}})}\geq\theta\sqrt{f(\pi_{i}^{{}^{\prime}})}+(1-\theta)\sqrt{f(\pi_{i}^{{}^{\prime\prime}})}+\frac{(\mathcal{C}-2)^{2}}{\mathcal{C}^{3}}\theta(1-\theta)\|\pi_{i}^{{}^{\prime}}-\pi_{i}^{{}^{\prime\prime}}\|^{2},

thus g(πi)g(\pi_{i}) is strongly concave with modulus 2(𝒞2)2𝒞3\frac{2(\mathcal{C}-2)^{2}}{\mathcal{C}^{3}}. By Lemma 17, G~m(𝝅)\tilde{G}^{m}(\bm{\pi}) is also strongly concave with the same modulus.  

3.2 Proof of the main theorems

We finally proceed to proving all three theorems. We first introduce some mathematical tools that will be used in the following proofs. The next two lemma are fundamental. The first one relates l1l_{1}-norm and l2l_{2}-norm and the second one is a simple property of the exponential function.

Lemma 19

Let xkx\in\mathbb{R}^{k} then x1kx2\|x\|_{1}\leq\sqrt{k}\|x\|_{2}.

Lemma 20

For x1x\geq 1 the following holds (11x)x1e\left(1-\frac{1}{x}\right)^{x}\leq\frac{1}{e}.

We next proceed to proving Theorem 67, and 8.

Proof  For the entropy it follows from Equation 2 and Lemma 14 that

Δte\displaystyle\Delta_{t}^{e} \displaystyle\geq 12wβ(1β)𝝅0𝝅112=12wβ(1β)(i=1k|πi(Piβ)|)2=wJ(h)28β(1β)\displaystyle\frac{1}{2}w\beta(1-\beta)\|\bm{\pi}_{0}-\bm{\pi}_{1}\|_{1}^{2}=\frac{1}{2}\frac{w}{\beta(1-\beta)}\left(\sum_{i=1}^{k}\left|\pi_{i}(P_{i}-\beta)\right|\right)^{2}=\frac{wJ(h)^{2}}{8\beta(1-\beta)} (7)
\displaystyle\geq J(h)2Gte8β(1β)(t+1)lnkγ2Gte2(1γ)2(t+1)lnk,\displaystyle\frac{J(h)^{2}G_{t}^{e}}{8\beta(1-\beta)(t+1)\ln k}\geq\frac{\gamma^{2}G_{t}^{e}}{2(1-\gamma)^{2}(t+1)\ln k},

where the last inequality comes from the fact that 1γβγ1-\gamma\geq\beta\geq\gamma (see the definition of γ\gamma in the weak hypothesis assumption) and J(h)2γJ(h)\geq 2\gamma (see weak hypothesis assumption). For the Gini-entropy criterion notice that from Equation 3, Lemma 16 and Lemma 19, it follows that

Δtg\displaystyle\Delta_{t}^{g} \displaystyle\geq wβ(1β)𝝅0𝝅1221kwβ(1β)𝝅0𝝅112γ2Gtg(1γ)2(t+1)(k1),\displaystyle w\beta(1\!-\!\beta)\|\bm{\pi}_{0}\!-\!\bm{\pi}_{1}\|_{2}^{2}\geq\frac{1}{k}w\beta(1\!-\!\beta)\|\bm{\pi}_{0}\!-\!\bm{\pi}_{1}\|_{1}^{2}\geq\frac{\gamma^{2}G_{t}^{g}}{(1\!-\!\gamma)^{2}(t\!+\!1)(k\!-\!1)}, (8)

where the last inequality is obtained similarly as the last inequality in Equation 7. And finally for the modified Gini-entropy it follows from Equation 4, Lemma 18 and Lemma 19 that

Δtm\displaystyle\Delta_{t}^{m} \displaystyle\geq w(𝒞2)2𝒞3β(1β)𝝅0𝝅1221kw(𝒞2)2𝒞3β(1β)𝝅0𝝅112\displaystyle w\frac{(\mathcal{C}-2)^{2}}{\mathcal{C}^{3}}\beta(1-\beta)\|\bm{\pi}_{0}-\bm{\pi}_{1}\|_{2}^{2}\geq\frac{1}{k}w\frac{(\mathcal{C}-2)^{2}}{\mathcal{C}^{3}}\beta(1-\beta)\|\bm{\pi}_{0}-\bm{\pi}_{1}\|_{1}^{2} (9)
\displaystyle\geq γ2Gtm𝒞3(𝒞2)2(1γ)2(t+1)kk𝒞1,\displaystyle\frac{\gamma^{2}G_{t}^{m}}{\frac{\mathcal{C}^{3}}{(\mathcal{C}-2)^{2}}(1-\gamma)^{2}(t+1)k\sqrt{k\mathcal{C}-1}},

where the last inequality is obtained as before.

Clearly the larger the objective J(h)J(h) is at time tt, the larger the entropy reduction ends up being, which confirms the plausibility of the approach in Choromanska and Langford (2014) where the goal is to maximize J(h)J(h). Let

ηe=22γ(1γ)lnk,ηg=4γ(1γ)k1,ηm=4γ(1γ)𝒞3(𝒞2)2kk𝒞1.\eta^{e}=\frac{2\sqrt{2}\gamma}{(1-\gamma)\sqrt{\ln k}},\>\>\>\>\>\eta^{g}=\frac{4\gamma}{(1-\gamma)\sqrt{k-1}},\>\>\>\>\>\eta^{m}=\frac{4\gamma}{(1-\gamma)\sqrt{\frac{\mathcal{C}^{3}}{(\mathcal{C}-2)^{2}}k\sqrt{k\mathcal{C}-1}}}. (10)

For simplicity of notation assume Δt\Delta_{t} corresponds to either Δte\Delta_{t}^{e}, or Δtg\Delta_{t}^{g}, or Δtm\Delta_{t}^{m}, and GtG_{t} stands for GteG_{t}^{e}, or GtgG_{t}^{g}, or GtmG_{t}^{m}. Thus Δt>η2Gt16(t+1)\Delta_{t}>\frac{\eta^{2}G_{t}}{16(t+1)}, and we obtain the recurrence inequality

Gt+1GtΔt<Gtη2Gt16(t+1)=Gt(1η216(t+1))G_{t+1}\leq G_{t}-\Delta_{t}<G_{t}-\frac{\eta^{2}G_{t}}{16(t+1)}=G_{t}\left(1-\frac{\eta^{2}}{16(t+1)}\right)

One can now compute the minimum number of splits required to reduce GtG_{t} below α\alpha, where α[0,1]\alpha\in[0,1]. Assume log2(t+1)+\log_{2}(t+1)\in\mathbb{Z}^{+}.

Gt+1Gt(1η216(t+1))=G1(1η2162)(1η2163)(1η216(t+1))\displaystyle G_{t+1}\leq G_{t}\left(1-\frac{\eta^{2}}{16(t+1)}\right)=G_{1}\left(1\!-\!\frac{\eta^{2}}{16\cdot 2}\right)\left(1\!-\!\frac{\eta^{2}}{16\cdot 3}\right)\dots\left(1\!-\!\frac{\eta^{2}}{16\cdot(t+1)}\right)
=G1(1η2162)t=34(1η216t)t=(2r/2)+12r(1η216t)t=(2log2(t+1)/2)+12log2(t+1)(1η216t),\displaystyle=G_{1}\!\!\left(1\!-\!\frac{\eta^{2}}{16\cdot 2}\right)\!\!\!\prod_{t^{{}^{\prime}}=3}^{4}\!\!\!\left(1\!-\!\frac{\eta^{2}}{16\cdot t^{{}^{\prime}}}\right)\!\dots\!\!\!\!\!\!\!\!\!\!\prod_{t^{{}^{\prime}}=(2^{r}/2)+1}^{2^{r}}\!\!\!\!\!\left(1\!-\!\frac{\eta^{2}}{16\cdot t^{{}^{\prime}}}\right)\!\dots\!\!\!\!\!\!\!\!\!\!\prod_{t^{{}^{\prime}}=(2^{\log_{2}(t+1)}/2)+1}^{2^{\log_{2}(t+1)}}\!\!\!\!\!\left(1\!-\!\frac{\eta^{2}}{16\cdot t^{{}^{\prime}}}\right),

where r={2,3,,log2(t+1)}r=\{2,3,\dots,\log_{2}(t+1)\}. Recall that

t=(2r/2)+12r(1η216t)t=(2r/2)+12r(1η2162r)=(1η2162r)2r/2eη2/32,\prod_{t^{{}^{\prime}}=(2^{r}/2)+1}^{2^{r}}\!\!\!\left(1-\frac{\eta^{2}}{16\cdot t^{{}^{\prime}}}\right)\leq\!\!\!\prod_{t^{{}^{\prime}}=(2^{r}/2)+1}^{2^{r}}\!\!\!\left(1-\frac{\eta^{2}}{16\cdot 2^{r}}\right)=\left(1-\frac{\eta^{2}}{16\cdot 2^{r}}\right)^{2^{r}/2}\leq e^{-\eta^{2}/32},

where the last step follows from Lemma 20. Also note that by the same lemma (1η2162)eη2/32\left(1-\frac{\eta^{2}}{16\cdot 2}\right)\leq e^{-\eta^{2}/32}. Thus

Gt+1G1eη2log2(t+1)/32.G_{t+1}\leq G_{1}e^{-\eta^{2}\log_{2}(t+1)/32}. (11)

Therefore to reduce Gt+1αG_{t+1}\leq\alpha (where α\alpha’s are defined in Theorems 67, and 8) it suffices to make t+1t+1 splits such that log2(t+1)ln(G1α)32η2\log_{2}(t+1)\geq\ln\left(\frac{G_{1}}{\alpha}\right)^{\frac{32}{\eta^{2}}} splits. Since log2(t+1)=ln(t+1)log2(e)\log_{2}(t+1)=\ln(t+1)\cdot\log_{2}(e), where e=exp(1)e=\exp(1). Thus

ln(t+1)ln(G1α)32η2log2(e)t+1(G1α)32η2log2(e).\ln(t+1)\geq\ln\left(\frac{G_{1}}{\alpha}\right)^{\frac{32}{\eta^{2}\log_{2}(e)}}\Leftrightarrow t+1\geq\left(\frac{G_{1}}{\alpha}\right)^{\frac{32}{\eta^{2}\log_{2}(e)}}. (12)

Recall that by resp. Lemma 912, and  13 we have resp. G1e2lnkG_{1}^{e}\leq 2\ln k, G1g2(11k)G_{1}^{g}\leq 2(1-\frac{1}{k}), G1g2k𝒞1G_{1}^{g}\leq 2\sqrt{k\mathcal{C}-1}. We consider the worst case setting (giving the largest possible number of split) thus we assume G1e=2lnkG_{1}^{e}=2\ln k, G1g=2(11k)G_{1}^{g}=2(1-\frac{1}{k}), and G1g2k𝒞1G_{1}^{g}\leq 2\sqrt{k\mathcal{C}-1}. Combining that with Equation 10 and Equation 12 yields statements of the main theorems.  

4 Numerical experiments

We run LOMtree algorithm, which is implemented in the open source learning system Vowpal Wabbit Langford et al. (2007), on four benchmark multiclass datasets: Mnist (1010 classes, downloaded from http://yann.lecun.com/exdb/mnist/), Isolet (2626 classes, downloaded from http://www.cs.huji.ac.il/~shais/datasets/ClassificationDatasets.html), Sector (105105 classes, downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html), and Aloi (10001000 classes, downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html). The datasets were divided into training (90%90\%) and testing (10%10\%), where 10%10\% of the training dataset was used as a validation set. The regressors in the tree nodes are linear and were trained by SGD Bottou (1998) with 2020 epochs and the learning rate chosen from the set {0.25,0.5,0.75,1,2,4,8}\{0.25,0.5,0.75,1,2,4,8\}. We investigated different swap resistances555see Choromanska and Langford (2014) for details chosen from the set {4,8,16,32,64,128,256}\{4,8,16,32,64,128,256\}. We selected the learning rate and the swap resistance as the one minimizing the validation error, where the number of splits in all experiments were set to 10k10k.

Refer to caption Refer to caption
Refer to caption Refer to caption
Figure 3: Functions GteG_{t}^{e}, GteG_{t}^{e}, and GtmG_{t}^{m}, and the test error, all normalized to the interval [0,1][0,1], versus the number of splits. Figure is recommended to be read in color.

Figure 3 shows the entropy, Gini-entropy, modified Gini-entropy, and the error, all normalized to the interval [0,1][0,1], as the function of the number of splits. The behavior of the entropy and Gini-entropy match the theoretical findings. However, the modified Gini-entropy instead drops the fastest with the number of splits, which in particular suggests that in this case perhaps tighter bounds could possibly be proved (for the binary case tighter analysis was shown in Kearns and Mansour (1999), but it is highly non-trivial to generalize this analysis to the multiclass classification setting). Furthermore, it can be observed that the behavior of the error closely mimics the behavior of the Gini-entropy.

5 Conclusions

This paper focuses on the properties of the recently proposed LOMtree algorithm. We provide an exhaustive theoretical analysis of the objective function underlying the algorithm. We show a unified framework for analyzing the boosting ability of the algorithm by exploring the connection of its objective to entropy-based criteria, such as entropy, Gini-entropy and its modified version. We show that the strong concativity properties of these criteria have critical impact on the character of the obtained bounds. The experiments suggest that perhaps tighter bound is possible in particular for the modified version of the Gini-entropy.


References

  • Bengio et al. (2010) S. Bengio, J. Weston, and D. Grangier. Label embedding trees for large multi-class tasks. In NIPS, 2010.
  • Beygelzimer et al. (2009a) A. Beygelzimer, J. Langford, Y. Lifshits, G. B. Sorkin, and A. L. Strehl. Conditional probability tree estimation analysis and algorithms. In UAI, 2009a.
  • Beygelzimer et al. (2009b) A. Beygelzimer, J. Langford, and P. D. Ravikumar. Error-correcting tournaments. In ALT, 2009b.
  • Bottou (1998) L. Bottou. Online algorithms and stochastic approximations. In Online Learning and Neural Networks. Cambridge University Press, 1998.
  • Choromanska and Langford (2014) Anna Choromanska and John Langford. Logarithmic time online multiclass prediction. CoRR, abs/1406.1822, 2014.
  • Deng et al. (2011) J. Deng, S. Satheesh, A. C. Berg, and L. Fei-Fei. Fast and balanced: Efficient label tree learning for large scale object recognition. In NIPS, 2011.
  • Kearns and Mansour (1999) M. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. Journal of Computer and Systems Sciences, 58(1):109–128 (also In STOC, 1996), 1999.
  • Langford et al. (2007) J. Langford, L. Li, and A. Strehl. http://hunch.net/~vw, 2007.
  • Madzarov et al. (2009) G. Madzarov, D. Gjorgjevikj, and I. Chorbev. A multi-class svm classifier utilizing binary decision tree. Informatica, 33(2):225–233, 2009.
  • Mukherjee and Schapire (2013) I. Mukherjee and R. E. Schapire. A theory of multiclass boosting. Journal of Machine Learning Research, 14:437–497, 2013.
  • Rifkin and Klautau (2004) R. Rifkin and A. Klautau. In defense of one-vs-all classification. J. Mach. Learn. Res., 5:101–141, 2004. ISSN 1532-4435.
  • Schapire and Freund (2012) R. E. Schapire and Y. Freund. Boosting: Foundations and Algorithms. The MIT Press, 2012.
  • Shalev-Shwartz (2007) S. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew University of Jerusalem, 2007.
  • Shalev-Shwartz (2012) S. Shalev-Shwartz. Online learning and online convex optimization. Found. Trends Mach. Learn., 4(2):107–194, 2012.
  • Zhukovskiy (2003) V.I. Zhukovskiy. Lyapunov Functions in Differential Games. Stability and Control: Theory, Methods and Applications. Taylor & Francis, 2003.