This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Beihang University, Beijing 100191, China 22institutetext: NICTAthanks: NICTA is funded by the Australian Government’s Department of Communications, Information Technology, and the Arts and the Australian Research Council through Backing Australia’s Ability initiative and the ICT Research Center of Excellence programs., Canberra Research Laboratory, Canberra, ACT 2601, Australia 33institutetext: Australian National University, Canberra, ACT 0200, Australia

Asymmetric Totally-corrective Boosting for Real-time Object Detection

Peng Wang1 Work was done when P. W. was visiting NICTA Canberra Research Laboratory and Australian National University.    Chunhua Shen2,3    Nick Barnes2,3    Hong Zheng1    Zhang Ren1
Abstract

Real-time object detection is one of the core problems in computer vision. The cascade boosting framework proposed by Viola and Jones has become the standard for this problem. In this framework, the learning goal for each node is asymmetric, which is required to achieve a high detection rate and a moderate false positive rate. We develop new boosting algorithms to address this asymmetric learning problem. We show that our methods explicitly optimize asymmetric loss objectives in a totally corrective fashion. The methods are totally corrective in the sense that the coefficients of all selected weak classifiers are updated at each iteration. In contract, conventional boosting like AdaBoost is stage-wise in that only the current weak classifier’s coefficient is updated. At the heart of the totally corrective boosting is the column generation technique. Experiments on face detection show that our methods outperform the state-of-the-art asymmetric boosting methods.

1 Introduction

Due to its important applications in video surveillance, interactive human-machine interface etc, real-time object detection has attracted extensive research recently [1, 2, 3, 4, 5, 6]. Although it was introduced a decade ago, the boosted cascade classifier framework of Viola and Jones [2] is still considered as the most promising approach for object detection, and this framework is the basis which many papers have extended.

One difficulty in object detection is the problem is highly asymmetric. A common method to detect objects in an image is to exhaustively search all sub-windows at all possible scales and positions in the image, and use a trained model to detect target objects. Typically, there are only a few targets in millions of searched sub-windows. The cascade classifier framework partially solves the asymmetry problem by splitting the detection process into several nodes. Only those sub-windows passing through all nodes are classified as true targets. At each node, we want to train a classifier with a very high detection rate (e.g., 99.5%99.5\%) and a moderate false positive rate (e.g., around 50%50\%). The learning goal of each node should be asymmetric in order to achieve optimal detection performance. A drawback of standard boosting like AdaBoost in the context of the cascade framework is it is designed to minimize the overall false rate. The losses are equal for misclassifying a positive example and a negative example, which makes it not be able to build an optimal classifier for the asymmetric learning goal.

Many subsequent works attempt to improve the performance of object detectors by introducing asymmetric loss functions to boosting algorithms. Viola and Jones proposed asymmetric AdaBoost [3], which applies an asymmetric multiplier to one of the classes. However, this asymmetry is absorbed immediately by the first weak classifier because AdaBoost’s optimization strategy is greedy. In practice, they manually apply the nn-th root of the multiplier on each iteration to keep the asymmetric effect throughout the entire training process. Here nn is the number of weak classifiers. This heuristic cannot guarantee the solution to be optimal and the number of weak classifiers need to be specified before training. AdaCost presented by Fan et al. [7] adds a cost adjustment function on the weight updating strategy of AdaBoost. They also pointed out that the weight updating rule should consider the cost not only on the initial weights but also at each iteration. Li and Zhang [8] proposed FloatBoost to reduce the redundancy of greedy search by incorporating floating search with AdaBoost. In FloatBoost, the poor weak classifiers are deleted when adding the new weak classifier. Xiao et al. [9] improved the backtrack technique in [8] and exploited the historical information of preceding nodes into successive node learning. Hou et al. [10] used varying asymmetric factors for training different weak classifiers. However, because the asymmetric factor changes during training, the loss function remains unclear. Pham et al. [11] presented a method which trains the asymmetric AdaBoost [3] classifiers under a new cascade structure, namely multi-exit cascade. Like soft cascade [12], boosting chain [9] and dynamic cascade [13], multi-exit cascade is a cascade structure which takes the historical information into consideration. In multi-exit cascade, the nn-th node “inherits” weak classifiers selected at the preceding n1n-1 nodes. Wu et al. [14] stated that feature selection and ensemble classifier learning can be decoupled. They designed a linear asymmetric classifier (LAC) to adjust the linear coefficients of the selected weak classifiers. Kullback-Leibler Boosting [15] iteratively learns robust linear features by maximizing the Kullback-Leibler divergence.

Much of the previous work is based on AdaBoost and achieves the asymmetric learning goal by heuristic weights manipulations or post-processing techniques. It is not trivial to assess how these heuristics affect the original loss function of AdaBoost. In this work, we construct new boosting algorithms directly from asymmetric losses. The optimization process is implemented by column generation. Experiments on toy data and real data show that our algorithms indeed achieve the asymmetric learning goal without any heuristic manipulation, and can outperform previous methods.

Therefore, the main contributions of this work are as follows.

  1. 1.

    We utilize a general and systematic framework (column generation) to construct new asymmetric boosting algorithms, which can be applied to a variety of asymmetric losses. There is no heuristic strategy in our algorithms which may cause suboptimal solutions. In contrast, the global optimal solution is guaranteed for our algorithms.

    Unlike Viola-Jones’ asymmetric AdaBoost [3], the asymmetric effect of our methods spreads over the entire training process. The coefficients of all weak classifiers are updated at each iteration, which prevents the first weak classifier from absorbing the asymmetry. The number of weak classifiers does not need to be specified before training.

  2. 2.

    The asymmetric totally-corrective boosting algorithms introduce the asymmetric learning goal into both feature selection and ensemble classifier learning. Both the example weights and the linear classifier coefficients are learned in an asymmetric way.

  3. 3.

    In practice, L-BFGS-B [16] is used to solve the primal problem, which runs much faster than solving the dual problem and also less memory is needed.

  4. 4.

    We demonstrate that with the totally corrective optimization, the linear coefficients of some weak classifiers are set to zero by the algorithm such that fewer weak classifiers are needed. We present analysis on the theoretical condition and show how useful the historical information is for the training of successive nodes.

2 Asymmetric losses

In this section, we propose two asymmetric losses, which are motivated by asymmetric AdaBoost [3] and cost-sensitive LogitBoost [17], respectively.

We first introduce an asymmetric cost in the following form:

ACost={C1 if y=+1 and sign(F(𝒙))=1,C2 if y=1 and sign(F(𝒙))=+1,0 if y=sign(F(𝒙)).\displaystyle{\rm ACost}=\left\{\begin{array}[]{ll}C_{1}&\text{ if }y=+1\text{ and }\mathrm{sign}(F({\boldsymbol{x}}))=-1,\\ C_{2}&\text{ if }y=-1\text{ and }\mathrm{sign}(F({\boldsymbol{x}}))=+1,\\ 0&\text{ if }y=\mathrm{sign}(F({\boldsymbol{x}})).\end{array}\right. (4)

Here 𝒙{\boldsymbol{x}} is the input data, yy is the label and F(𝒙)F({\boldsymbol{x}}) is the learned classifier. Viola and Jones [3] directly take the product of ACost\rm ACost and the exponential loss E𝑿,Y[exp(yF(𝒙)]E_{{\boldsymbol{X}},Y}[\exp(-yF({\boldsymbol{x}})] as the asymmetric loss:

E𝑿,Y[(𝐈(y=1)C1+𝐈(y=1)C2)exp(yF(𝒙))],E_{{\boldsymbol{X}},Y}[\big{(}{\bf I}(y=1)C_{1}+{\bf I}(y=-1)C_{2}\big{)}\exp\big{(}-yF({\boldsymbol{x}})\big{)}],

where 𝐈(){\bf I}(\cdot) is the indicator function. In a similar manner, we can also form an asymmetric loss from the logistic loss E𝑿,Y[logit(yF(𝒙))]E_{{\boldsymbol{X}},Y}[\mathrm{logit}\big{(}-yF({\boldsymbol{x}})\big{)}]:

ALoss1=E𝑿,Y[(𝐈(y=1)C1+𝐈(y=1)C2)logit(yF(𝒙))],{\rm ALoss_{1}}=E_{{\boldsymbol{X}},Y}[\big{(}{\bf I}(y=1)C_{1}+{\bf I}(y=-1)C_{2}\big{)}\mathrm{logit}\big{(}yF({\boldsymbol{x}})\big{)}], (5)

where logit(x)=log(1+exp(x))\mathrm{logit}(x)=\log(1+\exp(-x)) is the logistic loss function.

Masnadi-Shirazi and Vasconcelos [17] proposed cost-sensitive boosting algorithms which optimize different versions of cost-sensitive losses by the means of gradient descent. They proved that the optimal cost-sensitive predictor minimizes the expected loss:

E𝑿,Y[𝐈(y=1)log(pc(𝒙))+𝐈(y=1)log(1pc(𝒙))],-E_{{\boldsymbol{X}},Y}[{\bf I}(y=1)\log(\mathrm{p_{c}}({\boldsymbol{x}}))+{\bf I}(y=-1)\log(1-\mathrm{p_{c}}({\boldsymbol{x}}))],
wherepc(𝒙)=eγF(x)+ηeγF(x)+η+eγF(x)η,with γ=C1+C22,η=12logC2C1.\text{where}\quad\mathrm{p_{c}}({\boldsymbol{x}})=\frac{e^{\gamma F(x)+\eta}}{e^{\gamma F(x)+\eta}+e^{-\gamma F(x)-\eta}},\text{with }\gamma=\frac{C_{1}+C_{2}}{2},\eta=\frac{1}{2}\log\frac{C_{2}}{C_{1}}.

With fixing γ\gamma to 11, the expected loss can be reformulated to

ALoss2=E𝑿,Y[logit(yF(𝒙)+2yη)].{\rm ALoss_{2}}=E_{{\boldsymbol{X}},Y}[\mathrm{logit}\big{(}yF({\boldsymbol{x}})+2y\eta\big{)}]. (6)

3 Asymmetric totally-corrective boosting

In this section, we construct asymmetric totally-corrective boosting algorithms (termed AsymBoostTC here) from the losses (5) and (6) discussed previously. In contrast to the methods constructing boosting-like algorithms in [17], [18] and [19], we use column generation to design our totally corrective boosting algorithms, inspired by [20] and [5].

Suppose there are MM training examples (M1M_{1} positives and M2M_{2} negatives), and the sequence of examples are arranged according to the labels (positives first). The pool {\mathcal{H}} contains NN available weak classifiers. The matrix HM×NH\in\mathbb{Z}^{M\times N} contains binary outputs of weak classifiers in {\mathcal{H}} for training examples, namely Hij=hj(xi)H_{ij}=h_{j}(x_{i}). We are aiming to learn a linear combination F𝒘()=j=1Nwihj()F_{\boldsymbol{w}}(\cdot)=\sum_{j=1}^{N}w_{i}h_{j}(\cdot). C1C_{1} and C2C_{2} are costs for misclassifying positives and negatives, respectively. We assign the asymmetric factor k=C2/C1k=C_{2}/C_{1} and restrict γ=(C1+C2)/2\gamma=(C_{1}+C_{2})/2 to 11, thus C1C_{1} and C2C_{2} are fixed for a given kk.

The problems of the two AsymBoostTC algorithms can be expressed as:

min𝒘i=1Mlilogit(zi)+θ𝟏𝒘s.t.𝒘𝟎,zi=yiHi𝒘,\displaystyle\min_{{\boldsymbol{w}}}\sum_{i=1}^{M}l_{i}\mathrm{logit}(z_{i})+\theta{\boldsymbol{1}}^{\!\top}{\boldsymbol{w}}\quad\mathop{\mathrm{s.t.}}\nolimits{\boldsymbol{w}}\succcurlyeq{\boldsymbol{0}},~z_{i}=y_{i}H_{i}{\boldsymbol{w}}, (7)

where 𝒍=[C1/M1,,C2/M2,]{\boldsymbol{l}}=[C_{1}/M_{1},\cdots,C_{2}/M_{2},\cdots]^{{\!\top}}, and

min𝒘i=1Meilogit(zi+2yiη)+θ𝟏𝒘s.t.𝒘𝟎,zi=yiHi𝒘,\displaystyle\min_{{\boldsymbol{w}}}\sum_{i=1}^{M}e_{i}\mathrm{logit}(z_{i}+2y_{i}\eta)+\theta{\boldsymbol{1}}^{\!\top}{\boldsymbol{w}}\quad\mathop{\mathrm{s.t.}}\nolimits{\boldsymbol{w}}\succcurlyeq{\boldsymbol{0}},~z_{i}=y_{i}H_{i}{\boldsymbol{w}}, (8)

where 𝒆=[1/M1,,1/M2,]{\boldsymbol{e}}=[1/M_{1},\cdots,1/M_{2},\cdots]^{{\!\top}}. In both (7) and (8), ziz_{i} stands for the margin of the ii-th training example. We refer (7) as AsymBoostTC1 and (8) as AsymBoostTC2. Note that here the optimization problems are 1\ell_{1}-norm regularized. It is possible to use other format of regularization such as the 2\ell_{2}-norm.

First we introduce a fact that the Fenchel conjugate [21] of the logistic loss function logit(x)\mathrm{logit}(x) is

logit(u)={(u)log(u)+(1+u)log(1+u),0u1;,otherwise.\mathrm{logit}^{*}(u)=\begin{cases}(-u)\log(-u)+(1+u)\log(1+u),&\!\!\!\!\!0\geq u\geq-1;\\ \infty,&\!\!\!\!\!\text{otherwise}.\end{cases}

Now we derive the Lagrange dual [21] of AsymBoostTC1. The Lagrangian of (7) is

L(𝒘,𝒛primal,𝝀,𝒖dual)=i=1Mlilogit(zi)+θ𝟏𝒘𝝀𝒘+i=1Mui(ziyiHi𝒘).{L}(\underbrace{{\boldsymbol{w}},{\boldsymbol{z}}}_{\text{primal}},\underbrace{{\boldsymbol{\lambda}},{\boldsymbol{u}}}_{\text{dual}})=\sum_{i=1}^{M}l_{i}\mathrm{logit}(z_{i})+\theta{\boldsymbol{1}}^{{\!\top}}{\boldsymbol{w}}-{\boldsymbol{\lambda}}^{{\!\top}}{\boldsymbol{w}}+\sum_{i=1}^{M}u_{i}(z_{i}-y_{i}H_{i}{\boldsymbol{w}}).

The dual function

g(𝝀,𝒖)\displaystyle{g}({\boldsymbol{\lambda}},{\boldsymbol{u}}) =inf𝒘,𝒛L(𝒘,𝒛,𝝀,𝒖)\displaystyle=\inf_{{\boldsymbol{w}},{\boldsymbol{z}}}{L}({\boldsymbol{w}},{\boldsymbol{z}},{\boldsymbol{\lambda}},{\boldsymbol{u}})
=i=1Msupzi(uizililogit(zi))lilogit(ui/li)+inf𝒘(θ𝟏𝝀i=1MuiyiHi)must be 𝟎𝒘.\displaystyle=-\sum_{i=1}^{M}\underbrace{\sup_{z_{i}}\Big{(}-u_{i}z_{i}-l_{i}\mathrm{logit}(z_{i})\Big{)}}_{\text{$l_{i}\mathrm{logit}^{*}(-u_{i}/l_{i})$}}+\inf_{{\boldsymbol{w}}}\underbrace{\Big{(}\theta{\boldsymbol{1}}^{{\!\top}}-{\boldsymbol{\lambda}}^{{\!\top}}-\sum_{i=1}^{M}u_{i}y_{i}H_{i}\Big{)}}_{\text{must be ${\boldsymbol{0}}$}}{\boldsymbol{w}}.

The dual problem is

max𝒖\displaystyle\max_{{\boldsymbol{u}}}~ i=1M[uilog(ui)+(liui)log(liui)]\displaystyle-\sum_{i=1}^{M}\Big{[}u_{i}\log(u_{i})+(l_{i}-u_{i})\log(l_{i}-u_{i})\Big{]}
s.t.\displaystyle\mathop{\mathrm{s.t.}}\nolimits~ i=1MuiyiHiθ𝟏,0𝒖𝒍.\displaystyle\sum_{i=1}^{M}u_{i}y_{i}H_{i}\preccurlyeq\theta{\boldsymbol{1}}^{{\!\top}},~0\preccurlyeq{\boldsymbol{u}}\preccurlyeq{\boldsymbol{l}}. (9)

Since the problem (7) is convex and the Slater’s conditions are satisfied [21], the duality gap between the primal (7) and the dual (3) is zero. Therefore, the solutions of (7) and (3) are the same. Through the KKT condition, the gradient of Lagrangian (3) over primal variable 𝒛{\boldsymbol{z}} and dual variable 𝒖{\boldsymbol{u}} should vanish at the optimum. Therefore, we can obtain the relationship between the optimal value of 𝒛{\boldsymbol{z}} and 𝒖{\boldsymbol{u}}:

ui=liexp(zi)1+exp(zi).\displaystyle u_{i}^{*}=\frac{l_{i}\exp(-z_{i}^{*})}{1+\exp(-z_{i}^{*})}. (10)

Similarly, we can get the dual problem of AsymBoostTC2, which is expressed as:

max𝒖\displaystyle\max_{{\boldsymbol{u}}}~ i=1M[uilog(ui)+(eiui)log(eiui)+2uiyiη]\displaystyle-\sum_{i=1}^{M}\Big{[}u_{i}\log(u_{i})+(e_{i}-u_{i})\log(e_{i}-u_{i})+2u_{i}y_{i}\eta\Big{]}
s.t.\displaystyle\mathop{\mathrm{s.t.}}\nolimits~ i=1MuiyiHiθ𝟏,0𝒖𝒆,\displaystyle\sum_{i=1}^{M}u_{i}y_{i}H_{i}\preccurlyeq\theta{\boldsymbol{1}}^{{\!\top}},~0\preccurlyeq{\boldsymbol{u}}\preccurlyeq{\boldsymbol{e}}, (11)

with

ui=eiexp(zi2yiη)1+exp(zi2yiη).\displaystyle u_{i}^{*}=\frac{e_{i}\exp(-z_{i}^{*}-2y_{i}\eta)}{1+\exp(-z_{i}^{*}-2y_{i}\eta)}. (12)

In practice, the total number of weak classifiers, NN, could be extremely large, so we can not solve the primal problems (7) and (8) directly. However equivalently, we can optimize the duals (3) and (3) iteratively using column generation [20]. In each round, we add the most violated constraint by finding a weak classifier satisfying:

h()=argmaxh()i=1Muiyih(𝒙i).h^{\star}(\cdot)=\mathop{\mathrm{argmax}}\limits_{h(\cdot)}\sum_{i=1}^{M}u_{i}y_{i}h({\boldsymbol{x}}_{i}). (13)

This step is the same as training a weak classifier in AdaBoost and LPBoost, in which one tries to find a weak classifier with the maximal edge (i.e. the minimal weighted error). The edge of hjh_{j} is defined as i=1Muiyihj(xi)\sum_{i=1}^{M}u_{i}y_{i}h_{j}(x_{i}), which is the inverse of the weighted error. Then we solve the restricted dual problem with one more constraint than the previous round, and update the linear coefficients of weak classifiers (𝒘{\boldsymbol{w}}) and the weights of training examples (𝒖{\boldsymbol{u}}). Adding one constraint into the dual problem corresponds to adding one variable into the primal problem. Since the primal problem and dual problem are equivalent, we can either solve the restricted dual or the restricted primal in practice. The algorithms of AsymBoostTC1 and AsymBoostTC2 are summarized in Algorithm 1. Note that, in practice, in order to achieve specific false negative rate (FNR) or false positive rate (FPR), an offset bb is needed to be added into the final strong classifier: F(𝒙)=j=1nwihj(𝐱)bF({\boldsymbol{x}})=\sum_{j=1}^{n}w_{i}h_{j}({\mathbf{x}})-b, which can be obtained by a simple line search. The new weak classifier h()h^{\prime}(\cdot) corresponds to an extra variable to the primal and an extra constraint to the dual. Thus, the minimal value of the primal decreases with growing variables, and the maximal value of the dual problem also decreases with growing constraints. Furthermore, as the optimization problems involved are convex, Algorithm 1 is guaranteed to converge to the global optimum.

Next we show how AsymBoostTC introduces the asymmetric learning into feature selection and ensemble classifier learning. Decision stumps are the most commonly used type of weak classifiers, and each stump only uses one dimension of the features. So the process of training weak classifiers (decision stumps) is equivalent to feature selection. In our framework, the weak classifier with the maximum edge (i.e. the minimal weighted error) is selected. From (10) and (12), the weight of ii-th example, namely uiu_{i}, is affected by two factors: the asymmetric factor kk and the current margin ziz_{i}. If we set k=1k=1, the weighting strategy goes back to being symmetric. On the other hand, the coefficients of the linear classifier, namely 𝒘{\boldsymbol{w}}, are updated by solving the restricted primal problem at each iteration. The asymmetric factor kk in the primal is absorbed by all the weak classifiers currently learned. So feature selection and ensemble classifier learning both consider the asymmetric factor kk.

Input: A training set with MM labeled examples (M1M_{1} positives and M2M_{2} negatives); termination tolerant ε>0\varepsilon>0; regularization parameter θ\theta; asymmetric factor kk; maximum number of weak classifiers NmaxN_{\rm max}.
Initialization: N=0N=0; 𝒘=𝟎{\boldsymbol{w}}={\bf 0}; and ui=li/2 or ei/(1+kyi)u_{i}=l_{i}/2\text{ or }e_{i}/(1+k^{-y_{i}}) , i=1i=1\cdotsMM.
for iteration=1:Nmax\mathrm{iteration}=1:N_{\mathrm{max}} do  -\, Train a weak classifier h()=argmmaxh()i=1Muiyih(𝒙i)h^{\prime}(\cdot)=\mathrm{argmmax}_{h(\cdot)}\sum_{i=1}^{M}u_{i}y_{i}h({\boldsymbol{x}}_{i}).
-\, Check for the termination condition:
if iteration>1\mathrm{iteration}>1 and i=1Muiyih(𝐱i)<θ+ε\sum_{i=1}^{M}u_{i}y_{i}h^{\prime}({\mathbf{x}}_{i})<\theta+\varepsilon, then break;
-\, Increment the number of weak classifiers N=N+1N=N+1.
-\, Add h()h^{\prime}(\cdot) to the restricted master problem;
-\, Solve the primal problem (7) or (8) (or the dual problem (3) or (3)) and update uiu_{i} (i=1Mi=1\cdots M) and wjw_{j} (j=1Nj=1\cdots N) .   Output: The selected weak classifiers are h1,h2,,hNh_{1},h_{2},\dots,h_{N}. The final strong classifier is: F(𝒙)=j=1Nwjhj(𝒙)F({\boldsymbol{x}})=\textstyle\sum_{j=1}^{N}w_{j}h_{j}({\boldsymbol{x}}).
Algorithm 1 The training algorithms of AsymBoostTC1 and AsymBoostTC2.

5

5

5

5

5

The number of variables of the primal problem is the number of weak classifiers, while for the dual problem, it is the number of training examples. In the cascade classifiers for face detection, the number of weak classifiers is usually much smaller than the number of training examples, so solving the primal is much cheaper than solving the dual. Since the primal problem has only simple box-bounding constraints, we can employ L-BFGS-B [16] to solve it. L-BFGS-B is a tool based on the quasi-Newton method for bound-constrained optimization.

Instead of maintaining the Hessian matrix, L-BFGS-B only needs the recent several updates of values and gradients for the cost function to approximate the Hessian matrix. Thus, L-BFGS-B requires less memory when running. In column generation, we can use the results from previous iteration as the starting point of current problem, which leads to further reductions in computation time.

The complementary slackness condition [21] suggests that λjwj=0\lambda_{j}w_{j}=0. So we can get the conditions of sparseness:

If λ=θi=1MuiyiHi,j>0,then wj=0.\displaystyle\text{If }\lambda=\theta-\textstyle\sum_{i=1}^{M}u_{i}y_{i}H_{i,j}>0,\text{then }w_{j}=0. (14)

This means that, if the weak classifier hj()h_{j}(\cdot) is so “weak” that its edge is less than θ\theta under the current distribution 𝒖{\boldsymbol{u}}, its contribution to the ensemble classifier is “zero”. From another viewpoint, the 1\ell_{1}-norm regularization term in the primal (7) and (8), leads to a sparse result. The parameter θ\theta controls the degree of the sparseness. The larger θ\theta is, the sparser the result would be.

4 Experiments

4.1 Results on synthetic data

To show the behavior of our algorithms, we construct a 22D data set, in which the positive data follow the 22D normal distribution (N(0,0.1𝐈)N(0,0.1{\mathbf{I}})), and the negative data form a ring with uniformly distributed angles and normally distributed radius (N(1.0,0.2)N(1.0,0.2)). Totally 20002000 examples are generated (10001000 positives and 10001000 negatives), 50%50\% of data for training and the other half for test. We compare AdaBoost, AsymBoostTC1 and AsymBoostTC2 on this data set. All the training processes are stopped at 100100 decision stumps. For AsymBoostTC1 and AsymBoostTC2, we fix θ\theta to 0.010.01, and use a group of kk’s {1.2,1.4,1.6,1.8,2.0,2.2,2.4,2.6,2.8,3.0}\{1.2,1.4,1.6,1.8,2.0,2.2,2.4,2.6,2.8,3.0\}.

From Figures 1 (1)(1) and (2)(2), we find that the larger kk is, the bigger the area for positive output becomes, which means that the asymmetric LogitBoost tends to make a positive decision for the region where positive and negative data are mixed together. Another observation is that AsymBoostTC1 and AsymBoostTC2 have almost the same decision boundaries on this data set with same kk’s.

Figures 1 (3)(3) and (4)(4) demonstrate trends of false rates with the growth of asymmetric factor (kk). The results of AdaBoost is considered as the baseline. For all kk’s, AsymBoostTC1 and AsymBoostTC2 achieve lower false negative rates and higher false positive rates than AdaBoost. With the growth of kk, AsymBoostTC1 and AsymBoostTC2 become more aggressive to reduce the false negative rate, with the sacrifice of a higher false positive rate.

Refer to caption
(1) AsymBoostTC1 vs AdaBoost
Refer to caption
(2) AsymBoostTC2 vs AdaBoost
Refer to caption
(3) False rates for AsymBoostTC1
Refer to caption
(4) False rates for AsymBoostTC2
Figure 1: Results on the synthetic data for AsymBoostTC1 and AsymBoostTC2, with a group of asymmetric factor kks. As the baseline, the results for AdaBoost are also shown in these figures. (11) and (22) demonstrate decision boundaries learned by AsymBoostTC1 and AsymBoostTC2, with kk is 2.02.0 or 3.03.0. The ×\times’s and \square’s stand for training negatives and training positives respectively. (33) and (44) demonstrate false rates (FR), false positive rates (FPR) and false negative rates (FNR) on test set with a group of kks (1.2,1.4,1.6,1.8,2.0,2.2,2.4,2.6,2.81.2,1.4,1.6,1.8,2.0,2.2,2.4,2.6,2.8 or 3.03.0), and the corresponding rates for AdaBoost is shown as dashed lines.

4.2 Face detection

We collect 98329832 mirrored frontal face images and about 1011510115 large background images. 50005000 face images and 70007000 background images are used for training, and 48324832 face images and 31153115 background images for validation. Five basic types of Haar features are calculated on each 24×2424\times 24 image, and totally generate 162336162336 features. Decision stumps on those 162336162336 features construct the pool of weak classifiers.

Single-node detectors Single-node classifiers with AdaBoost, AsymBoostTC1 and AsymBoostTC2 are trained. The parameters θ\theta and kk are simply set to 0.0010.001 and 7.07.0. 50005000 faces and 50005000 non-faces are used for training, while 48324832 faces and 50005000 non-faces are used for test. The training/validation non-faces are randomly cropped from training/validation background images.

Figure 2 (1)(1) shows curves of detection rate with the false positive rate fixed at 0.250.25, while curves of false positive rates with 0.9950.995 detection rate are shown in Figure 2 (2)(2). We set the false positive rate fixed to 0.250.25 rather than the commonly used 0.50.5 in order to slow down the increasing speed of detection rates, otherwise detection rates would converge to 1.01.0 immediately. The increasing/decreasing speed of detection rate/false positive rate is faster than reported in [8] and [9]. The reason is possibly that we use 1000010000 examples for training and 98329832 for testing, which are smaller than the data used in [8] and [9] (1800018000 training examples and 1500015000 test examples). We can see that under both situations, our algorithms achieve better performances than AdaBoost in most cases.

The benefits of our algorithms can be expressed in two-fold: (1) Given the same learning goal, our algorithms tend to use smaller number of weak classifiers. For example, from Figure 2 (2)(2), if we want a classifier with a 0.9950.995 detection rate and a 0.20.2 false positive rate, AdaBoost needs at least 4343 weak classifiers while AsymBoostTC1 needs 3232 and AsymBoostTC2 needs only 2222. (2) Using the same number of weak classifiers, our algorithms achieve a higher detection rate or a lower false positive rate. For example, from Figure 2 (2)(2), using 3030 weak classifiers, both AsymBoostTC1 and AsymBoostTC2 achieve higher detection rates (0.99650.9965 and 0.99750.9975) than AdaBoost (0.99450.9945).

Refer to caption
(1) DR with fixed FPR
Refer to caption
(2) FPR with fixed DR
Figure 2: Testing curves of single-node classifiers for AdaBoost, AsymBoostTC1 and AsymBoostTC2. All the classifiers use the same training and test data sets. (11) shows curves of detection rates (DR) with false positive rates (FPR) fixed to 0.250.25, (22) shows curves of FPR with DR fixed to 0.9950.995. FPR or DR are evaluated at each weak classifier.

Complete detectors Secondly, we train complete face detectors with AdaBoost, asymmetric-AdaBoost, AsymBoostTC1 and AsymBoostTC2. All detectors are trained using the same training set. We use two types of cascade framework for the detector training: the traditional cascade of Viola and Jones [2] and the multi-exit cascade presented in [11]. The latter utilizes decision information of previous nodes when judging instances in the current node. For fair comparison, all detectors use 2424 nodes and 33323332 weak classifiers. For each node, 50005000 faces + 50005000 non-faces are used for training, and 48324832 faces + 50005000 non-faces are used for validation. All non-faces are cropped from background images. The asymmetric factor kk for asymmetric-AdaBoost, AsymBoostTC1 and AsymBoostTC2 are selected from {1.2,\{1.2, 1.5,2.0,3.0,4.0,5.0,6.0}1.5,2.0,3.0,4.0,5.0,6.0\}. The regularization factor θ\theta for AsymBoostTC1 and AsymBoostTC2 are chosen from {150,\{\frac{1}{50}, 160,\frac{1}{60}, 170,\frac{1}{70}, 180,\frac{1}{80}, 190,\frac{1}{90}, 1100,\frac{1}{100}, 1200,\frac{1}{200}, 1400,\frac{1}{400}, 1800,\frac{1}{800}, 11000}\frac{1}{1000}\}. It takes about four hours to train a AsymBoostTC face detector on a machine with 88 Intel Xeon E5520 cores and 3232GB memory. Comparing with AdaBoost, only around 0.50.5 hour extra time is spent on solving the primal problem at each iteration. We can say that, in the context of face detection, the training time of AsymBoostTC is nearly the same as AdaBoost.

ROC curves on the CMU/MIT data set are shown in Figure 3. Those images containing ambiguous faces are removed and 120120 images are retained. From the figure, we can see that, asymmetric-AdaBoost outperforms AdaBoost in both Viola-Jones cascade and multi-exit cascade, which coincide with what reported in [3]. Our algorithms have better performances than all other methods in all points and the improvements are more significant when the false positives are less than 100100, which is the most commonly used region in practice.

As mentioned in the previous section, our algorithms produce sparse results to some extent. Some linear coefficients are zero when the corresponding weak classifiers satisfy the condition (14). In the multi-exit cascade, the sparse phenomenon becomes more clear. Since correctly classified negative data are discarded after each node is trained, the training data for each node are different. The “closer” nodes share more common training examples, while the nodes “far away” from each other have distinct training data. The greater the distance between two nodes, the more uncorrelated they become. Therefore, the weak classifiers in the early nodes may perform poorly on the last node, thus tending to obtain zero coefficients. We call those weak classifiers with non-zero coefficients “effective” weak classifiers. Table 1 shows the ratios of “effective” weak classifiers contributed by one node to a specific successive node. To save space, only the first 1515 nodes are demonstrated. We can see that, the ratio decreases with the growth of the node index, which means that the farther the preceding node is from the current node, the less useful it is for the current node. For example, the first node has almost no contribution after the eighth node. Table 2 shows the number of effective weak classifiers used by our algorithm and the traditional stage-wise boosting. All weak classifiers in stage-wise boosting have non-zero coefficients, while our totally-corrective algorithm uses much less effective weak classifiers.

Refer to caption
Refer to caption
Figure 3: Performances of cascades evaluated by ROC curves on the MIT+CMU data set. AdaBoost is referred to “Ada”, and Asymmetric AdaBoost [2] is referred to “Asym”. “Viola-Jones cascade” means the traditional cascade used in [3] .
Table 1: The ratio of weak classifiers selected at the ii-th node (column) appearing with non-zero coefficients in the jj-th node (row). The ratios decrease along with the growth of the node index in each column.
Node Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 1.001.00
2 1.001.00 1.001.00
3 1.001.00 1.001.00 1.001.00
4 0.860.86 1.001.00 0.970.97 1.001.00
5 0.430.43 0.930.93 0.970.97 0.970.97 1.001.00
6 0.710.71 0.930.93 0.900.90 1.001.00 0.960.96 1.001.00
7 0.430.43 0.870.87 0.870.87 0.970.97 0.920.92 0.920.92 1.001.00
8 0.290.29 0.400.40 0.700.70 0.730.73 0.740.74 0.880.88 0.740.74 1.001.00
9 0.000.00 0.270.27 0.500.50 0.600.60 0.760.76 0.720.72 0.660.66 0.670.67 1.001.00
10 0.140.14 0.270.27 0.430.43 0.600.60 0.620.62 0.700.70 0.620.62 0.660.66 0.600.60 1.001.00
11 0.000.00 0.200.20 0.330.33 0.500.50 0.520.52 0.540.54 0.600.60 0.590.59 0.560.56 0.480.48 1.001.00
12 0.140.14 0.200.20 0.400.40 0.400.40 0.560.56 0.500.50 0.540.54 0.610.61 0.550.55 0.460.46 0.360.36 1.001.00
13 0.000.00 0.130.13 0.330.33 0.370.37 0.360.36 0.540.54 0.400.40 0.470.47 0.470.47 0.460.46 0.430.43 0.250.25 1.001.00
14 0.000.00 0.070.07 0.170.17 0.400.40 0.280.28 0.500.50 0.420.42 0.490.49 0.500.50 0.530.53 0.450.45 0.430.43 0.350.35 1.001.00
15 0.000.00 0.130.13 0.200.20 0.270.27 0.360.36 0.380.38 0.460.46 0.410.41 0.520.52 0.420.42 0.490.49 0.440.44 0.340.34 0.270.27 1.001.00
Table 2: Comparison of the numbers of the effective weak classifiers for the stage-wise boosting (SWB) and the totally-corrective boosting (TCB). We take AdaBoost and AsymBoostTC1 as representative types of SWB and TCB, both of which are trained in the multi-exit cascade for face detection.
Node Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
SWB 77 2222 5252 8282 132132 182182 232232 332332 452452 592592 752752 932932 11321132 13321332 15321532 17321732 19321932 21322132
TCB 77 2222 5252 8080 125125 174174 213213 269269 331331 441441 464464 538538 570570 681681 717717 744744 742742 879879

5 Conclusion

We have proposed two asymmetric totally-corrective boosting algorithms for object detection, which are implemented by the column generation technique in convex optimization. Our algorithms introduce asymmetry into both feature selection and ensemble classifier learning in a systematic way.

Both our algorithms achieve better results for face detection than AdaBoost and Viola-Jones’ asymmetric AdaBoost. An observation is that we can not see great differences on performances between AsymBoostTC1 and AsymBoostTC2 in our experiments. For the face detection task, AdaBoost already achieves a very promising result, so the improvements of our method are not very significant.

One drawback of our algorithms is there are two parameters to be tuned. For different nodes, the optimal parameters should not be the same. In this work, we have used the same parameters for all nodes. Nevertheless, since the probability of negative examples decreases with the node index, the degree of the asymmetry between positive and negative examples also deceases. The optimal kk may decline with the node index.

The framework for constructing totally-corrective boosting algorithms is general, so we can consider other asymmetric losses (e.g., asymmetric exponential loss) to form new asymmetric boosting algorithms. In column generation, there is no restriction that only one constraint is added at each iteration. Actually, we can add several violated constraints at each iteration, which means that we can produce multiple weak classifiers in one round. By doing this, we can speed up the learning process.

Motivated by the analysis of sparseness, we find that the very early nodes contribute little information for training the later nodes. Based on this, we can exclude some useless nodes when the node index grows, which will simplify the multi-exit structure and shorten the testing time.

References

  • [1] Paisitkriangkrai, S., Shen, C., Zhang, J.: Fast pedestrian detection using a cascade of boosted covariance features. IEEE Trans. Circuits Syst. Video Technol. 18 (2008) 1140–1151
  • [2] Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comp. Vis. 57 (2004) 137–154
  • [3] Viola, P., Jones, M.: Fast and robust classification using asymmetric AdaBoost and a detector cascade. In: Proc. Adv. Neural Inf. Process. Syst., MIT Press (2002) 1311–1318
  • [4] Paisitkriangkrai, S., Shen, C., Zhang, J.: Efficiently training a better visual detector with sparse Eigenvectors. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Miami, Florida, US (2009)
  • [5] Shen, C., Li, H.: On the dual formulation of boosting algorithms. IEEE Trans. Pattern Anal. Mach. Intell. (2010) Online 25 Feb. 2010. IEEE computer Society Digital Library. http://doi.ieeecomputersociety.org/10.1109/TPAMI.2010.47.
  • [6] Shen, C., Wang, P., Li, H.: LACBoost and FisherBoost: Optimally building cascade classifiers. In: Proc. Eur. Conf. Comp. Vis. Volume 2., Crete Island, Greece, Lecture Notes in Computer Science (LNCS) 6312, Springer-Verlag (2010) 608–621
  • [7] Fan, W., Stolfo, S., Zhang, J., Chan, P.: Adacost: Misclassification cost-sensitive boosting. In: Proc. Int. Conf. Mach. Learn. (1999) 97–105
  • [8] Li, S.Z., Zhang, Z.: FloatBoost learning and statistical face detection. IEEE Trans. Pattern Anal. Mach. Intell. 26 (2004) 1112–1123
  • [9] Xiao, R., Zhu, L., Zhang, H.: Boosting chain learning for object detection. In: Proc. IEEE Int. Conf. Comp. Vis. (2003) 709–715
  • [10] Hou, X., Liu, C., Tan, T.: Learning boosted asymmetric classifiers for object detection. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (2006)
  • [11] Pham, M.T., Hoang, V.D.D., Cham, T.J.: Detection with multi-exit asymmetric boosting. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (2008)
  • [12] Bourdev, L., Brandt, J.: Robust object detection via soft cascade. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn., San Diego, CA, US (2005) 236–243
  • [13] Xiao, R., Zhu, H., Sun, H., Tang, X.: Dynamic cascades for face detection. In: Proc. IEEE Int. Conf. Comp. Vis., Rio de Janeiro, Brazil (2007)
  • [14] Wu, J., Brubaker, S.C., Mullin, M.D., Rehg, J.M.: Fast asymmetric learning for cascade face detection. IEEE Trans. Pattern Anal. Mach. Intell. 30 (2008) 369–382
  • [15] Liu, C., Shum, H.Y.: Kullback-Leibler boosting. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. Volume 1., Madison, Wisconsin (2003) 587–594
  • [16] Zhu, C., Byrd, R.H., Nocedal, J.: L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization. ACM Trans. Mathematical Software 23 (1997) 550–560
  • [17] Masnadi-Shirazi, H., Vasconcelos, N.: Cost-sensitive boosting. IEEE Trans. Pattern Anal. Mach. Intell. (2010)
  • [18] Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. Ann. Statist. 28 (2000) 337–407
  • [19] Rätsch, G., Mika, S., Schölkopf, B., Müller, K.R.: Constructing boosting algorithms from SVMs: An application to one-class classification. IEEE Trans. Pattern Anal. Mach. Intell. 24 (2002) 1184–1199
  • [20] Demiriz, A., Bennett, K., Shawe-Taylor, J.: Linear programming boosting via column generation. Mach. Learn. 46 (2002) 225–254
  • [21] Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press (2004)