This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Feature Selection Method for Multivariate Performance Measures

Qi Mao  and  Ivor W. Tsang
Abstract.

Feature selection with specific multivariate performance measures is the key to the success of many applications, such as image retrieval and text classification. The existing feature selection methods are usually designed for classification error. In this paper, we propose a generalized sparse regularizer. Based on the proposed regularizer, we present a unified feature selection framework for general loss functions. In particular, we study the novel feature selection paradigm by optimizing multivariate performance measures. The resultant formulation is a challenging problem for high-dimensional data. Hence, a two-layer cutting plane algorithm is proposed to solve this problem, and the convergence is presented. In addition, we adapt the proposed method to optimize multivariate measures for multiple instance learning problems. The analyses by comparing with the state-of-the-art feature selection methods show that the proposed method is superior to others. Extensive experiments on large-scale and high-dimensional real world datasets show that the proposed method outperforms l1l_{1}-SVM and SVM-RFE when choosing a small subset of features, and achieves significantly improved performances over SVMperf in terms of F1F_{1}-score.

Qi Mao and Ivor W. Tsang are with School of Computer Engineering, Nanyang Technological University, Singapore 639798, e-mail {QMAO1,IvorTsang}@ntu.edu.sg

1. Introduction

Machine learning methods have been widely applied to a variety of learning tasks (e.g. classification, ranking, structure prediction, etc) arising in computer vision, text mining, natural language processing and bioinformatics applications. Depending on applications, specific performance measures are required to evaluate the success of a learning algorithm. For instance, the error rate is a sound judgment for evaluating the classification performance of a learning method on datasets with balanced positive and negative examples. On the contrary, in text classification where positive examples are usually very few, one can simply assign all testing examples with the negative class (the major class), this trivial solution can easily achieve very low error rate due to the extreme imbalance of the data. However, the goal of text classification is to correctly detect positive examples. Hence, the error rate is considered as a poor criterion for the problems with highly skewed class distributions [11]. To address this issue, F1F_{1}-score and Precision/Recall Breakeven Point (PRBEP) are employed as the evaluation criteria for text classification. Besides this, in information retrieval, search engine systems are required to return the top kk documents (images) with the highest precision because most users only scan the first few of them presented by the system, so precision/recall at kk are preferred choices.

Instead of optimizing the error rate, Support Vector Machine for multivariate performance measures (SVMperf\textrm{SVM}^{perf}) [11] was proposed to directly optimize the losses based on a variety of multivariate performance measures. A smoothing version of SVMperf\textrm{SVM}^{perf} [37] was proposed to accelerate the convergence of the optimization problem specially designed for PRBEP and area under the Receiver Operating Characteristic curve (AUC). Structural SVMs are considered as the general framework for optimizing a variety of loss functions [27, 13, 28]. Other works optimize specific multivariate performance measures, such as F-score [21], normalize discount cumulative gain (NDCG) [29], ordinal regression [12], ranking loss [16] and so on.

For some real applications, such as image and document retrievals, a set of sparse yet discriminative features is a necessity for rapid prediction on massive databases. However, the learned weight vector of the aforementioned methods is usually non-sparse. In addition, there are many noisy or non-informative features in text documents and images. Even though the task-specific performance measures can be optimized directly, learning with these noisy or non-informative features may still hurt both prediction performance and efficiency. To alleviate these issues, one can resort to embedded feature selection methods [15], which can be categorized into the following two major directions.

One way is to consider the sparsity of a decision weight vector w by replacing l2l_{2}-norm w2\|\textbf{w}\|_{2} regularization in the structural risk functional (e.g. SVM, logistic regression) with l1l_{1}-norm w1\|\textbf{w}\|_{1} [39, 8, 23]. A thorough study to compare several recently developed l1l_{1}-regularized algorithms has been conducted in [33]. According to this study, coordinate descent method using one-dimensional Newton direction (CDN) achieves the state-of-the-art performance by solving l1l_{1}-regularized models on large-scale and high-dimensional datasets. To achieve a sparser solution, the Approximation of the zeRO norm Minimization (AROM) was proposed [30] to optimize l0l_{0} models. Its resultant problem is non-convex, so it easily suffers from local optima. However, the recent results [18] and theoretical studies [17, 36] have showed that lpl_{p} models (where p<1p<1) even with a local optimal solution can achieve better prediction performance than convex l1l_{1} models, which are asymptotically biased [18].

Another way is to sort the weights of a SVM classifier and remove the smallest weights iteratively, which is known as SVM with Recursive Feature Elimination (SVM-RFE) [9]. However, as discussed in [32], such nested “monotonic” feature selection scheme leads to suboptimal performance. Non-monotonic feature selection (NMMKL) [32] has been proposed to solve this problem, but each feature corresponding to one kernel makes NMMKL infeasible for high-dimensional problems. Recently, Tan et al.[26] proposed Feature Generating Machine (FGM), which shows great scalability to non-monotonic feature selection on large-scale and very high-dimensional datasets.

The aforementioned feature selection methods [33, 30, 9, 32, 26] are usually designed for optimizing classification error only. To fulfill the needs of different applications, it is imperative to have a feature selection method designed for optimizing task-specific performance measures.

To this end, we first propose a generalized sparse regularizer for feature selection. After that, a unified feature selection framework is presented for general loss functions based on the proposed regularizer. Particularly, in this paper, optimizing multivariate performance measures is studied in this framework. To our knowledge, this is the first work to optimize multivariate performance measures for feature selection. Due to exponential number of constraints brought by non-smooth multivariate loss functions [11, 13] and exponential number of feature subset combinations [26], the resultant optimization problem is very challenging for high-dimensional data. To tackle this challenge, we propose a two-layer cutting plane algorithm, including group feature generation (see Section 5.1) and group feature selection (see Section 5.2), to solve this problem effectively and efficiently. Specifically, Multiple Kernel Learning (MKL) trained in the primal by cutting plane algorithm is proposed to deal with exponential size of constraints induced by multivariate losses.

This paper is an extension of our preliminary work [19]. The main contributions of this paper are listed as follows.

  • The implementation details and the convergence proof of the proposed two-layer cutting plane algorithm and MKL algorithm trained in the primal are presented.

  • Connections to a variety of the state-of-the-art feature selection methods including SKM [3], NMMKL [32], l1l_{1}-SVM [33], l0l_{0}-SVM [30] and FGM [26] are discussed in details. By comparing with these methods, the advantages of our proposed methods are summarized as follows:

    1. (1)

      The tradeoff parameter CC in l1l_{1} SVM [33] is too sensitive to be tuned properly since it controls both margin loss and the sparsity of 𝐰\mathbf{w}. However, our method alleviates this problem by introducing an additional parameter BB to control the sparsity of 𝐰\mathbf{w}. This separation makes parameter tuning for our methods much easier than those of SKM [3] and l1l_{1} SVM.

    2. (2)

      NMMKL [32] uses the similar parameter separation strategy, but it is intractable for this method to handle high-dimensional datasets, let alone optimize multivariate losses. The proposed method can readily optimize multivariate losses for high-dimensional problems.

    3. (3)

      FGM [26] is a special case of the propose framework when optimizing square hinge loss with indicator variables in integer domain. The proposed framework is formulated in the real domain for general loss functions. In particular, we provide a natural extension of FGM for multivariate losses.

    4. (4)

      The proposed framework can be interpreted by l0l_{0}-norm constraint, so it can be considered as one of l0l_{0} methods. This gives another interpretation of the additional parameter BB.

  • Recall that Multiple-Instance Learning via Embedded instance Selection (MILES) [6], which transforms multiple instance learning (MIL) into a feature selection problem by embedding bags into an instance-based feature space and selecting the most important features, achieves state-of-the-art performance for multiple instance learning problems. Under our unified feature selection framework, we extend MILES and study MIL for multivariate performance measure. To our best knowledge, this is seldom studied in MIL scenarios, but it is important for the real world applications of MIL tasks.

  • Extensive experiments on several challenging and very high-dimensional real world datasets show that the proposed method yields better performance than the state-of-the-art feature selection methods, and outperforms SVMperf using all features in terms of multivariate performance measures. The experimental results on the multiple instance dataset show that our proposed method achieves promising results.

The rest of the paper is organized as follows: We briefly review SVMperf\textrm{SVM}^{perf} in Section 2. We then introduce the proposed generalized sparse regularizer in Section 3. In particular, we study the feature selection framework for multivariate performance measures, its algorithm and its application to multiple instance learning in Section 45 and 7, respectively. Section 6 gives the analysis of connections to a variety of feature selection methods. The extensive empirical results are shown in Section 8. Finally, conclusive remarks are presented in the last section.

In the sequel, 𝐀𝟎\mathbf{A}\succeq{\bf 0} means that the matrix 𝐀\mathbf{A} is symmetric and positive semidefinite (psd). We denote the transpose of a vector/matrix by the superscript T and lpl_{p} norm of a vector v by vp||\textbf{v}||_{p}. Binary operator \odot represents the elementwise product between two vectors/matrices.

2. SVM for Multivariate Performance Measure

Given a training sample of input-output pairs (xi,yi)𝒳×𝒴(\textbf{x}_{i},y_{i})\in\mathcal{X}\times\mathcal{Y} for i=1,,ni=1,\ldots,n drawn from some fixed but unknown probability distribution with 𝒳Rm\mathcal{X}\subseteq R^{m} and 𝒴{1,+1}\mathcal{Y}\in\{-1,+1\}. The learning problem is treated as a multivariate prediction problem by defining the hypotheses h¯:𝒳¯𝒴¯\overline{h}:\overline{\mathcal{X}}\rightarrow\overline{\mathcal{Y}} that map a tuple x¯𝒳¯\overline{\textbf{x}}\in\overline{\mathcal{X}} of nn feature vectors x¯=(x1,,xn)\overline{\textbf{x}}=(\textbf{x}_{1},\ldots,\textbf{x}_{n}) to a tuple y¯𝒴¯\overline{y}\in\overline{\mathcal{Y}} of nn labels y¯=(y1,,yn)\overline{y}=(y_{1},\ldots,y_{n}) where 𝒳¯=𝒳×,𝒳\overline{\mathcal{X}}=\mathcal{X}\times\ldots,\mathcal{X} and 𝒴¯{1,+1}n\overline{\mathcal{Y}}\subseteq\{-1,+1\}^{n}. The linear discriminative function of SVMperf is defined as

(1) h¯w(x¯)=argmaxy¯𝒴¯f(x¯,y¯)=argmaxy¯𝒴¯i=1nyiwTxi,\overline{h}_{\textbf{w}}(\overline{\textbf{x}})=\arg\max_{\overline{y}^{\prime}\in\overline{\mathcal{Y}}}f(\overline{\textbf{x}},\overline{y}^{\prime})=\arg\max_{\overline{y}^{\prime}\in\overline{\mathcal{Y}}}\sum_{i=1}^{n}y^{\prime}_{i}\textbf{w}^{T}\textbf{x}_{i},

where w=[w1,,wm]T\textbf{w}=[w_{1},\ldots,w_{m}]^{T} is the weight vector.

To learn the hypothesis (1) from training data, large margin method is employed to obtain the good generalization performance by enforcing the constraints that the decision value of the ground truth labels y¯\overline{y} should be larger than any possible labels y¯𝒴¯{y¯}\overline{y}^{\prime}\in\overline{\mathcal{Y}}\setminus\{\overline{y}\}, i.e., f(x¯,y¯)f(x¯,y¯)+Δ(y¯,y¯)f(\overline{\textbf{x}},\overline{y}^{\prime})\geq f(\overline{\textbf{x}},\overline{y}^{\prime})+\Delta(\overline{y},\overline{y}^{\prime}), where Δ(y¯,y¯)\Delta(\overline{y},\overline{y}^{\prime}) is some type of multivariate loss functions (several instantiated losses are presented in Section 5.4). Structural SVMs [28, 13] are proposed to solve the corresponding soft-margin case by 11-slack variable formula as,

(2) minw,ξ0\displaystyle\min_{\textbf{w},\xi\geq 0}\!\!\!\!\! 12w22+Cξ\displaystyle\!\!\!\!\!\frac{1}{2}\|\textbf{w}\|^{2}_{2}+C\xi
s.t. y¯𝒴¯\y¯:wTi=1n(yiyi)xiΔ(y¯,y¯)ξ,\displaystyle\!\!\!\!\!\forall\overline{y}^{\prime}\in\overline{\mathcal{Y}}\backslash\overline{y}:{\textbf{w}}^{T}\sum_{i=1}^{n}(y_{i}-y^{\prime}_{i})\textbf{x}_{i}\geq\Delta(\overline{y},\overline{y}^{\prime})-\xi,

where CC is a regularization parameter that trades off the empirical risk and the model complexity.

The optimization problem (2) is convex, but there is the exponential size of constraints. Fortunately, this problem can be solved in polynomial time by adopting the sparse approximation algorithm of structural SVMs. As shown in [11], optimizing the learning model subject to one specific multivariate measure can really boost the performance of this measure.

3. Generalized Sparse Regularizer

In this paper, we focus on minimizing the regularized empirical loss functional as

(3) min𝐰Ω(𝐰)+C(𝐰),\min_{\mathbf{w}}\Omega(\mathbf{w})+C\ell(\mathbf{w}),

where Ω(.)\Omega(.) is a regularization function and (.)\ell(.) is any loss function, including multivariate performance measure losses.

Since l2l_{2}-norm regularization is used in (2), the learned weight vector w is non-sparse, and so the linear discriminant function in (1) would involve many features for the prediction. As discussed in Section 1, selecting a small set of discriminative features is crucial to many real applications. In order to enforce the sparsity on w, we propose a new sparse regularizer

Ω(𝐰)=min𝐝𝒟12j=1m|wj|pdj,\Omega(\mathbf{w})=\min_{\mathbf{d}\in\mathcal{D}}\frac{1}{2}\sum_{j=1}^{m}\frac{|w_{j}|^{p}}{d_{j}},

where 𝐝\mathbf{d} is in the real domain of 𝒟={𝐝|j=1mdj=B,0dj1,j=1,,m}\mathcal{D}=\{\mathbf{d}|\sum_{j=1}^{m}d_{j}=B,0\leq d_{j}\leq 1,\forall j=1,\ldots,m\}, p>0p>0 and B>0B>0 are two parameters. The optimal solution of the new proposed regularizer should satisfy wj=0w_{j}=0 if dj=0d_{j}=0 since |wj|p=0|w_{j}|^{p}=0 with p>0p>0 induces wj=0w_{j}=0, otherwise the objective value approaches to infinite. The l1l_{1}-norm constraint j=1mdj=B\sum_{j=1}^{m}d_{j}=B and 0dj10\leq d_{j}\leq 1 will force some djd_{j} to be zero, so the corresponding wjw_{j} is zero, j=1,,m\forall j=1,\ldots,m. Hence, the parameter BB is interpreted as a budget to control the sparsity of 𝐰\mathbf{w}.

This regularizer is similar to SimpleMKL [24] with each feature corresponding to one kernel, but SimpleMKL is a special case of 𝒟\mathcal{D} with B=1B=1, which also can be interpreted by the quadratic variational formulation of l1l_{1} norm [2]. However, it is different from l1l_{1} when B1B\not=1. To explain the difference, we consider the problem (2) under the general framework (3). In the separable case, parameter CC does not affect the optimum solution since the error ξ=0\xi=0. If l1l_{1} norm is applied to replace l2l_{2} in Problem (2), the sparsity of 𝐰\mathbf{w} will be fixed once optimal solution is reached. Hence, parameter BB in 𝒟\mathcal{D} now can be considered as the only factor to enforce sparsity on 𝐰\mathbf{w}. However, in the non-separable case where errors are allowed, parameter CC will also influence the sparsity of 𝐰\mathbf{w}, but BB is expected to enforce the sparsity of 𝐰\mathbf{w} more explicitly when CC becomes larger. This argument will be empirically justified in Section 8.1.

The learning algorithm with the proposed generalized sparse regularizer is formulated as

(4) min𝐝𝒟minw\displaystyle\min_{\mathbf{d}\in\mathcal{D}}\min_{\textbf{w}}\!\!\!\!\! 12j=1m|wj|pdj+C(𝐰).\displaystyle\!\!\!\!\!\frac{1}{2}\sum_{j=1}^{m}\frac{|w_{j}|^{p}}{d_{j}}+C\ell(\mathbf{w}).

This formulation is more general for feature selection.

Lemma 1.

If p2p\geq 2, Problem (4) is jointly convex with respect to w and 𝐝\mathbf{d}; otherwise, it is not jointly convex.

Proof.

We only need to prove that, if p2p\geq 2, g(wj,dj)=|wj|pdjg(w_{j},d_{j})=\frac{|w_{j}|^{p}}{d_{j}} where dj>0d_{j}>0 is jointly convex with respect to wjw_{j} and djd_{j}. The convexity of gg in its domain is established when the following holds: 2g=[2|wj|pdj3p|wj|p1dj2p|wj|p1dj2p(p1)|wj|p2dj]𝟎[2|wj|2p|wj|djp|wj|djp(p1)dj2]𝟎,\nabla^{2}g=\left[\begin{array}[]{cc}\frac{2|w_{j}|^{p}}{d_{j}^{3}}&-\frac{p|w_{j}|^{p-1}}{d_{j}^{2}}\\ -\frac{p|w_{j}|^{p-1}}{d_{j}^{2}}&\frac{p(p-1)|w_{j}|^{p-2}}{d_{j}}\end{array}\right]\succeq{\bf 0}\Leftrightarrow\left[\begin{array}[]{cc}2|w_{j}|^{2}&-p|w_{j}|d_{j}\\ -p|w_{j}|d_{j}&p(p-1)d_{j}^{2}\end{array}\right]\succeq{\bf 0}, which is equivalent to 𝐯T2g𝐯0{\bf v}^{T}\nabla^{2}g{\bf v}\geq 0 for any nonzero vector 𝐯{\bf v}. WLOG, we assume 𝐯=[1a]T{\bf v}=[1\;a]^{T} where aa is any real number, then this condition is reduced to: 2|wj|22ap|wj|dj+a2p(p1)dj202(|wj|apdj2)2a2dj2p(2p)2.2|w_{j}|^{2}-2ap|w_{j}|d_{j}+a^{2}p(p-1)d_{j}^{2}\geq 0\Leftrightarrow 2\Big{(}|w_{j}|-\frac{apd_{j}}{2}\Big{)}^{2}\geq\frac{a^{2}d_{j}^{2}p(2-p)}{2}. This condition always holds when p2p\geq 2, which completes the proof. ∎

In what follows, we focus on the convex formulation with p=2p=2. In Section 6, we will discuss the relationships with a variety of the state-of-the-art feature selection methods.

4. Feature Selection for Multivariate Performance Measures

To optimize the multivariate loss functions and learn a sparse feature representation simultaneously, we propose to solve the following jointly convex problem over 𝐝\mathbf{d} and (𝐰,ξ)(\mathbf{w},\xi) in the case of p=2p=2,

(5) min𝐝𝒟minw,ξ0\displaystyle\min_{\mathbf{d}\in\mathcal{D}}\min_{\textbf{w},\xi\geq 0}\!\!\!\!\! 12j=1m|wj|2dj+Cξ\displaystyle\!\!\!\!\!\frac{1}{2}\sum_{j=1}^{m}\frac{|w_{j}|^{2}}{d_{j}}+C\xi
s.t. y¯𝒴¯\y¯:wT1ni=1n(yiyi)xiΔ(y¯,y¯)ξ.\displaystyle\!\!\!\!\!\forall\overline{y}^{\prime}\in\overline{\mathcal{Y}}\backslash\overline{y}:{\textbf{w}}^{T}\frac{1}{n}\sum_{i=1}^{n}(y_{i}-y^{\prime}_{i})\textbf{x}_{i}\geq\Delta(\overline{y},\overline{y}^{\prime})-\xi.

The partial dual with respect to (𝐰,ξ)(\mathbf{w},\xi) is obtained by Lagrangian function (𝐰,ξ,α,τ)\mathcal{L}(\mathbf{w},\xi,\alpha,\tau) with dual variables α0\alpha\geq 0 and τ0\tau\geq 0 as follows: 12j=1m|wj|2dj+Cξτξy¯𝒴¯\y¯αy¯(wT1ni=1n(yiyi)xiΔ(y¯,y¯)+ξ).\frac{1}{2}\sum_{j=1}^{m}\frac{|w_{j}|^{2}}{d_{j}}+C\xi-\tau\xi-\sum_{\overline{y}^{\prime}\in\overline{\mathcal{Y}}\backslash\overline{y}}\alpha_{\overline{y}^{\prime}}({\textbf{w}}^{T}\frac{1}{n}\sum_{i=1}^{n}(y_{i}-y^{\prime}_{i})\textbf{x}_{i}-\Delta(\overline{y},\overline{y}^{\prime})+\xi). As the gradients of Lagrangian function with respect to (𝐰,ξ)(\mathbf{w},\xi) vanish at the optimal points, we obtain the KKT conditions: wj=djy¯𝒴¯\y¯αy¯1ni=1n(yiyi)xj,i and y¯𝒴¯\y¯αy¯C.w_{j}=d_{j}\sum_{\overline{y}^{\prime}\in\overline{\mathcal{Y}}\backslash\overline{y}}\alpha_{\overline{y}^{\prime}}\frac{1}{n}\sum_{i=1}^{n}(y_{i}-y^{\prime}_{i})x_{j,i}\mbox{ and }\sum_{\overline{y}^{\prime}\in\overline{\mathcal{Y}}\backslash\overline{y}}\alpha_{\overline{y}^{\prime}}\leq C. By substituting KKT conditions back to (𝐰,ξ,α,τ)\mathcal{L}(\mathbf{w},\xi,\alpha,\tau), we obtain the dual problem as

(6) min𝐝𝒟maxα𝒜12y¯y¯′′αy¯αy¯′′Qy¯,y¯′′d+y¯αy¯by¯,\displaystyle\min_{\mathbf{d}\in\mathcal{D}}\max_{\alpha\in\mathcal{A}}-\frac{1}{2}\sum_{\overline{y}^{\prime}}\sum_{\overline{y}^{\prime\prime}}\alpha_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime\prime}}Q_{\overline{y}^{\prime},\overline{y}^{\prime\prime}}^{\textbf{d}}+\sum_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime}}b_{\overline{y}^{\prime}},

where Δ(y¯,y¯)=0\Delta(\overline{y},\overline{y})=0, Δ(y¯,y¯)>0\Delta(\overline{y},\overline{y}^{\prime})>0 if y¯y¯\overline{y}\not=\overline{y}^{\prime},

Qy¯,y¯′′d\displaystyle Q_{\overline{y}^{\prime},\overline{y}^{\prime\prime}}^{\textbf{d}}\!\!\!\!\! =\displaystyle= j=1mdj(y¯𝒴¯\y¯αy¯1ni=1n(yiyi)xj,i)2\displaystyle\!\!\!\!\!\sum_{j=1}^{m}d_{j}\bigg{(}\sum_{\overline{y}^{\prime}\in\overline{\mathcal{Y}}\backslash\overline{y}}\alpha_{\overline{y}^{\prime}}\frac{1}{n}\sum_{i=1}^{n}(y_{i}-y^{\prime}_{i})x_{j,i}\bigg{)}^{2}
=\displaystyle= j=1m(y¯𝒴¯\y¯αy¯1ni=1n(yiyi)xj,idj)2\displaystyle\!\!\!\!\!\sum_{j=1}^{m}\bigg{(}\sum_{\overline{y}^{\prime}\in\overline{\mathcal{Y}}\backslash\overline{y}}\alpha_{\overline{y}^{\prime}}\frac{1}{n}\sum_{i=1}^{n}(y_{i}-y^{\prime}_{i})x_{j,i}\sqrt{d_{j}}\bigg{)}^{2}
=\displaystyle= ay¯,ay¯′′,\displaystyle\!\!\!\!\!\langle\textbf{a}_{\overline{y}^{\prime}},\textbf{a}_{\overline{y}^{\prime\prime}}\rangle,

ay¯=1ni=1n(yiyi)(xid)\textbf{a}_{\overline{y}^{\prime}}=\frac{1}{n}\sum_{i=1}^{n}(y_{i}-y^{\prime}_{i})(\textbf{x}_{i}\odot\sqrt{\textbf{d}}), by¯=1nΔ(y¯,y¯)b_{\overline{y}^{\prime}}=\frac{1}{n}\Delta(\overline{y},\overline{y}^{\prime}), and 𝒜={α|y¯αy¯C,α0}\mathcal{A}=\{\alpha|\sum_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime}}\leq C,\alpha\geq 0\}. Problem (6) is a challenging problem because of the exponential size of α\alpha and high-dimensional vector 𝐝\mathbf{d} for high-dimensional problems.

5. Two-Layer Cutting Plane Algorithm

In this section, we propose a two-layer cutting plane algorithm to solve Problem (6) efficiently and effectively. The two layers, namely group feature generation and group feature selection, will be described in Section 5.1 and 5.2, respectively. The two-layer cutting plane algorithm will be presented in Section 5.3 and 5.4.

5.1. Group Feature Generation

By denoting S(α,d)=12y¯y¯′′αy¯αy¯′′Qy¯,y¯′′d+y¯αy¯by¯S(\alpha,\textbf{d})=-\frac{1}{2}\sum_{\overline{y}^{\prime}}\sum_{\overline{y}^{\prime\prime}}\alpha_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime\prime}}Q_{\overline{y}^{\prime},\overline{y}^{\prime\prime}}^{\textbf{d}}+\sum_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime}}b_{\overline{y}^{\prime}}, Problem (6) turns out to be

mind𝒟maxα𝒜S(α,d).\min_{\textbf{d}\in\mathcal{D}}\max_{\alpha\in\mathcal{A}}S(\alpha,\textbf{d}).

Since domains 𝒟\mathcal{D} and 𝒜\mathcal{A} are nonempty, the function S(α,𝐝)S(\alpha^{*},\mathbf{d}) is closed and convex for all 𝐝𝒟\mathbf{d}\in\mathcal{D} given any α𝒜\alpha^{*}\in\mathcal{A}, and the function S(α,𝐝)S(\alpha,\mathbf{d}^{*}) is closed and concave for all α𝒜\alpha\in\mathcal{A} given any 𝐝𝒟\mathbf{d}^{*}\in\mathcal{D}, the saddle-point property: mind𝒟maxα𝒜S(α,d)=maxα𝒜mind𝒟S(α,d)\min_{\textbf{d}\in\mathcal{D}}\max_{\alpha\in\mathcal{A}}S(\alpha,\textbf{d})=\max_{\alpha\in\mathcal{A}}\min_{\textbf{d}\in\mathcal{D}}S(\alpha,\textbf{d}) holds [4].

We further denote d(α)=S(α,d)\mathcal{F}_{\textbf{d}}(\alpha)=-S(\alpha,\textbf{d}), and then the equivalent optimization problems are obtained as

(7) minα𝒜maxd𝒟d(α)\displaystyle\min_{\alpha\in\mathcal{A}}\max_{\textbf{d}\in\mathcal{D}}\mathcal{F}_{\textbf{d}}(\alpha)\! or minα𝒜,γγ:γd(α),d𝒟.\displaystyle\!\min_{\alpha\in\mathcal{A},\gamma}\gamma:\gamma\geq\mathcal{F}_{\textbf{d}}(\alpha),\;\forall\textbf{d}\in\mathcal{D}.

Cutting plane algorithm [14] could be used here to solve this problem. Since maxd𝒟d(α)dt(α),dt𝒟\max_{\textbf{d}\in\mathcal{D}}\mathcal{F}_{\textbf{d}}(\alpha)\geq\mathcal{F}_{\textbf{d}^{t}}(\alpha),\forall d^{t}\in\mathcal{D}, the lower bound approximation of (30) can be obtained by maxd𝒟d(α)maxt=1,,Tdt(α)\max_{\textbf{d}\in\mathcal{D}}\mathcal{F}_{\textbf{d}}(\alpha)\geq\max_{t=1,\ldots,T}\mathcal{F}_{\textbf{d}^{t}}(\alpha). Then we minimize Problem (30) over the set {𝐝t}t=1T\{\mathbf{d}^{t}\}_{t=1}^{T} by,

(8) minα𝒜maxt=1,,Tdt(α)\displaystyle\min_{\alpha\in\mathcal{A}}\!\max_{t=1,\ldots,T}\!\mathcal{F}_{\textbf{d}^{t}}(\alpha)\!\! or minα𝒜,γγ:γdt(α),t=1,,T.\displaystyle\!\!\min_{\alpha\in\mathcal{A},\gamma}\!\gamma\!:\!\gamma\!\geq\!\mathcal{F}_{\textbf{d}^{t}}(\alpha),\forall t\!=\!1\!,\!\ldots\!,\!T\!\!.\;\;\;

As from [22], such cutting plane algorithm can converge to a robust optimal solution within tens of iterations with the exact worst-case analysis. Specifically, for a fixed αt\alpha^{t}, the worst-case analysis can be done by solving,

(9) dt=argmaxd𝒟d(αt),\textbf{d}^{t}=\arg\max_{\textbf{d}\in\mathcal{D}}\mathcal{F}_{\textbf{d}}(\alpha^{t}),

which is referred to as the group generation procedure. Even though Problem (8) and (9) cannot be solved directly due to the exponential size of α\alpha, we will show that they are readily solved in Section 5.2 and Section 5.4, respectively.

5.2. Group Feature Selection

By introducing dual variables μ=[μ1,μ2,,μT]T0\mu=[\mu_{1},\mu_{2},\ldots,\mu_{T}]^{T}\geq 0, we can transform (8) to an MKL problem as follows,

(10) maxα𝒜minμT12y¯y¯′′αy¯αy¯′′(t=1TμtQy¯,y¯′′dt)+y¯αy¯by¯,\max_{\alpha\in\mathcal{A}}\min_{\mu\in\mathcal{M}_{T}}-\frac{1}{2}\sum_{\overline{y}^{\prime}}\sum_{\overline{y}^{\prime\prime}}\alpha_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime\prime}}\left(\sum_{t=1}^{T}\mu_{t}Q_{\overline{y}^{\prime},\overline{y}^{\prime\prime}}^{\textbf{d}^{t}}\right)+\sum_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime}}b_{\overline{y}^{\prime}},

where T={t=1Tμt=1,μt0,t=1,,T}\mathcal{M}_{T}=\{\sum_{t=1}^{T}\mu_{t}=1,\mu_{t}\geq 0,\forall t=1,\ldots,T\}.

However, due to the exponential size of α\alpha, the complexity of Problem (29) remains. In this case, state-of-the-art multiple kernel learning algorithms [25, 24, 31] do not work any more. The following proposition shows that we can indirectly solve Problem (29) in the primal form.

Proposition 1.

The primal form of Problem (29) is

(11) minw1,,wT,ξ0\displaystyle\min_{\textbf{w}_{1},\ldots,\textbf{w}_{T},\xi\geq 0}\!\!\!\!\! 12(t=1Twt2)2+Cξ\displaystyle\!\!\!\!\!\frac{1}{2}\bigg{(}\sum_{t=1}^{T}\|\textbf{w}_{t}\|_{2}\bigg{)}^{2}+C\xi
s.t. ξby¯t=1Twt,ay¯t,y¯𝒴¯\y¯.\displaystyle\!\!\!\!\!\xi\geq b_{\overline{y}^{\prime}}-\sum_{t=1}^{T}\langle\textbf{w}_{t},\textbf{a}_{\overline{y}^{\prime}}^{t}\rangle,\forall\overline{y}^{\prime}\in\overline{\mathcal{Y}}\backslash\overline{y}.

According to KKT conditions, the solution of (A) is

(12) wt=μty¯αy¯ay¯t\displaystyle\textbf{w}_{t}=\mu_{t}\sum_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime}}\textbf{a}_{\overline{y}^{\prime}}^{t}

where μt\mu_{t} is a dual value of the ttht^{th} constraint of (8).

The detailed proof of Proposition 1 is given in the supplementary material.

Here, we define the regularization term as Ω(w¯)=12(t=1Twt2)2\Omega(\overline{\textbf{w}})=\frac{1}{2}\big{(}\sum_{t=1}^{T}\|\textbf{w}_{t}\|_{2}\big{)}^{2} with w¯=[w1,,wT]T\overline{\textbf{w}}=[\textbf{w}_{1},\ldots,\textbf{w}_{T}]^{T}and the empirical risk function as

(13) Remp(w¯)=max(0,maxy¯𝒴¯\y¯by¯t=1Twt,ay¯t),R_{emp}(\overline{\textbf{w}})=\max\bigg{(}0,\max_{\overline{y}^{\prime}\in\overline{\mathcal{Y}}\backslash\overline{y}}b_{\overline{y}^{\prime}}-\sum_{t=1}^{T}\langle\textbf{w}_{t},\textbf{a}_{\overline{y}^{\prime}}^{t}\rangle\bigg{)},

which is a convex but non-smooth function w.r.t w¯\overline{\textbf{w}}. Then we can apply the bundle method [27] to solve this primal problem. Problem (A) is transformed as

minw¯𝒥(w¯)=Ω(w¯)+CRemp(w¯).\displaystyle\min_{\overline{\textbf{w}}}\mathcal{J}(\overline{\textbf{w}})=\Omega(\overline{\textbf{w}})+CR_{emp}(\overline{\textbf{w}}).

Since Remp(w¯)R_{emp}(\overline{\textbf{w}}) is a convex function, its subgradient exists everywhere in its domain [10]. Suppose w¯k\overline{\textbf{w}}^{k} is a point where Remp(w¯)R_{emp}(\overline{\textbf{w}}) is finite, we can formulate the lower bound according to the definition of subgradient,

Remp(w¯)\displaystyle R_{emp}(\overline{\textbf{w}}) \displaystyle\geq Remp(w¯k)+w¯w¯k,pk\displaystyle R_{emp}(\overline{\textbf{w}}^{k})+\langle\overline{\textbf{w}}-\overline{\textbf{w}}^{k},\textbf{p}^{k}\rangle
=\displaystyle= w¯,pk+Remp(w¯k)w¯k,pk\displaystyle\langle\overline{\textbf{w}},\textbf{p}^{k}\rangle+R_{emp}(\overline{\textbf{w}}^{k})-\langle\overline{\textbf{w}}^{k},\textbf{p}^{k}\rangle

where subgradient pkw¯Remp(w¯k)\textbf{p}^{k}\in\partial_{\overline{\textbf{w}}}R_{emp}(\overline{\textbf{w}}^{k}) is at w¯k\overline{\textbf{w}}^{k}. In order to obtain pk\textbf{p}^{k}, we need to solve the following inference problem

(14) y¯k=argmaxy¯𝒴¯\y¯by¯t=1Twt,ay¯t\overline{y}^{k}=\arg\max_{\overline{y}^{\prime}\in\overline{\mathcal{Y}}\backslash\overline{y}}b_{\overline{y}^{\prime}}-\sum_{t=1}^{T}\langle\textbf{w}_{t},\textbf{a}_{\overline{y}^{\prime}}^{t}\rangle

which is a problem of integer programming. We delay the discussion of this problem to Section 5.4. After that, we can obtain the subgraident ptk=𝐚y¯kt\textbf{p}^{k}_{t}=-\mathbf{a}_{\overline{y}^{k}}^{t}, so that Remp(w¯k)=by¯kt=1Twt,ay¯kt=by¯k+w¯k,pkR_{emp}(\overline{\textbf{w}}^{k})=b_{\overline{y}^{k}}-\sum_{t=1}^{T}\langle\textbf{w}_{t},\textbf{a}_{\overline{y}^{k}}^{t}\rangle=b_{\overline{y}^{k}}+\langle\overline{\textbf{w}}^{k},\textbf{p}^{k}\rangle.

Given the subgradient sequence p1,p2,,pK\textbf{p}^{1},\textbf{p}^{2},\ldots,\textbf{p}^{K}, the tighter lower bound for Remp(w¯)R_{emp}(\overline{\textbf{w}}) can be reformulated as follows,

Remp(w¯)RempK(w¯)=max(0,max1kKw¯,pk+qk),\displaystyle R_{emp}(\overline{\textbf{w}})\geq R_{emp}^{K}(\overline{\textbf{w}})=\max\Big{(}0,\max_{1\leq k\leq K}\langle\overline{\textbf{w}},\textbf{p}^{k}\rangle+q^{k}\Big{)},

where qk=Remp(w¯k)w¯k,pk=by¯kq^{k}=R_{emp}(\overline{\textbf{w}}^{k})-\langle\overline{\textbf{w}}^{k},\textbf{p}^{k}\rangle=b_{\overline{y}^{k}}. Following the bundle method [27], the criterion for selecting the next point w¯K+1\overline{\textbf{w}}^{K+1} is to solve the following problem,

(15) minw1,,wT,ξ0\displaystyle\min_{\textbf{w}_{1},\ldots,\textbf{w}_{T},\xi\geq 0}\!\!\!\!\! 12(t=1Twt2)2+Cξ\displaystyle\!\!\!\!\!\frac{1}{2}\bigg{(}\sum_{t=1}^{T}\|\textbf{w}_{t}\|_{2}\bigg{)}^{2}+C\xi
s.t. ξw¯,pk+qk,k=1,,K.\displaystyle\!\!\!\!\!\xi\geq\langle\overline{\textbf{w}},\textbf{p}^{k}\rangle+q^{k},\forall k=1,\ldots,K.

The following Corollary shows that Problem (15) can be easily solved by QCQP solvers, and the number of variables is independent of the number of examples.

Corollary 1.

In terms of Proposition 1, the dual form of Problem (15) is

(16) maxα𝒜Kmaxθ\displaystyle\max_{\alpha\in\mathcal{A}_{K}}\max_{\theta}\!\!\!\!\! θ+k=1Kαkqk\displaystyle\!\!\!\!\!-\theta+\sum_{k=1}^{K}\alpha_{k}q^{k}
s.t. 12k=1Kαkptk22θ,t=1,,T,\displaystyle\!\!\!\!\!\frac{1}{2}\Bigg{\|}\sum_{k=1}^{K}\alpha_{k}\textbf{p}_{t}^{k}\Bigg{\|}_{2}^{2}\leq\theta,\forall t=1,\ldots,T,

where 𝒜K={k=1KαkC,αk0,k=1,,K}\mathcal{A}_{K}=\{\sum_{k=1}^{K}\alpha_{k}\leq C,\alpha_{k}\geq 0,\forall k=1,\ldots,K\}, and which is a QCQP problem with T+1T+1 constraints and K+1K+1 variables.

The proof of Corollary 1 follows the same derivation of Proposition 1 with ptk=𝐚y¯kt\textbf{p}^{k}_{t}=-\mathbf{a}_{\overline{y}^{k}}^{t}, qk=by¯kq^{k}=b_{\overline{y}^{k}} and the size of αk\alpha_{k} as KK. Consequently, the primal variables are recovered by 𝐰t=μtkαkptk\mathbf{w}_{t}=-\mu_{t}\sum_{k}\alpha_{k}\textbf{p}_{t}^{k}.

Algorithm 1 Group_feature_selection
1: Input: x¯=(x1,,xn),y¯=(y1,,yn)\overline{\textbf{x}}=(\textbf{x}_{1},\ldots,\textbf{x}_{n}),\overline{y}=(y_{1},\ldots,y_{n}), an initial group set 𝒲\mathcal{W}, ϵ\epsilon, CC
2:𝒴¯=\overline{\mathcal{Y}}=\emptyset, k=0k=0
3:repeat
4:  k=k+1k=k+1
5:  Finding the most violated y¯\overline{y}^{\prime}
6:  Compute pk\textbf{p}^{k} and qkq^{k}
7:  𝒴¯=𝒴¯{y¯}\overline{\mathcal{Y}}=\overline{\mathcal{Y}}\cup\{\overline{y}^{\prime}\}
8:  Solving Problem (16) over 𝒲\mathcal{W} and 𝒴¯\overline{\mathcal{Y}}
9:until ϵ\epsilon-optimal

Let 𝒥K(w¯)=Ω(w¯)+CRempK(w¯)\mathcal{J}_{K}(\overline{\textbf{w}})=\Omega(\overline{\textbf{w}})+CR_{emp}^{K}(\overline{\textbf{w}}), the ϵ\epsilon-optimal condition in Algorithm 1 is min0kK𝒥(w¯K)𝒥K(w¯K)ϵ\min_{0\leq k\leq K}\mathcal{J}(\overline{\textbf{w}}^{K})-\mathcal{J}_{K}(\overline{\textbf{w}}^{K})\leq\epsilon. The convergence proof in [27] does not apply in this case as the Fenchel dual of Ω(w¯)\Omega(\overline{\textbf{w}}) fails to satisfy the strong convexity assumption if K>1K>1. As K=1K=1, Algorithm 1 is exactly the bundle method [27]. When K2K\geq 2, we can adapt the proof of Theorem 55 in [13] for the following convergence results.

Theorem 1.

For any 0<C,0<ϵ4R2C0<C,0<\epsilon\leq 4R^{2}C and any training example (x1,y1),,(xn,yn)(\textbf{x}_{1},y_{1}),\ldots,(\textbf{x}_{n},y_{n}), Algorithm 1 converges to the desired precision ϵ\epsilon after at most,

log2(Δ4R2C)+16R2Cϵ\displaystyle\left\lceil\log_{2}\left(\frac{\Delta}{4R^{2}C}\right)\right\rceil+\left\lceil\frac{16R^{2}C}{\epsilon}\right\rceil

iterations. R2=maxdt,y¯1ni=1n(yiyi)(xidt)2R^{2}=\max_{\textbf{d}^{t},\overline{y}^{\prime}}\|\frac{1}{n}\sum_{i=1}^{n}(y_{i}-y^{\prime}_{i})(\textbf{x}_{i}\odot\sqrt{\textbf{d}^{t}})\|^{2}, Δ=maxy¯Δ(y¯,y¯)\Delta=\max_{\overline{y}^{\prime}}\Delta(\overline{y}^{\prime},\overline{y}) and .\lceil.\rceil is the integer ceiling function.

Proof.

We adapt the proof of Theorem 5 in [13], and sketch the necessary changes corresponding to Problem (A). For a given set 𝒲T\mathcal{W}_{T}, the dual objective of (8)(\ref{prob:svm-mpm-ours_dual}) can be reformulated as

maxα𝒜mind𝒲TΘd(α)=12y¯y¯′′αy¯αy¯′′Qy¯,y¯′′d+y¯αy¯by¯.\max_{\alpha\in\mathcal{A}}\min_{\textbf{d}\in\mathcal{W}_{T}}\Theta_{\textbf{d}}(\alpha)=-\frac{1}{2}\sum_{\overline{y}^{\prime}}\sum_{\overline{y}^{\prime\prime}}\alpha_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime\prime}}Q_{\overline{y}^{\prime},\overline{y}^{\prime\prime}}^{\textbf{d}}+\sum_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime}}b_{\overline{y}^{\prime}}.

Since there are the TT constrained quadratic problems, we consider each d𝒲T\textbf{d}\in\mathcal{W}_{T} at one time as maxα𝒜Θd(α)\max_{\alpha\in\mathcal{A}}\Theta_{\textbf{d}}(\alpha), where QdQ^{\textbf{d}} is positive semi-definite, and derivative Θd(α)=bQdα\partial\Theta_{\textbf{d}}(\alpha)=\textbf{b}-Q^{\textbf{d}}\alpha. The Lemma 2 in [13] states that a line search starting at α\alpha along an ascent direction η\eta with maximum step-size C>0C>0 improves the objective by at least max0βC{Θd(α+βη)Θd(α)}12min{C,Θd(α)TηηTQdη}Θd(α)Tη.\max_{0\leq\beta\leq C}\big{\{}\Theta_{\textbf{d}}(\alpha+\beta\eta)-\Theta_{\textbf{d}}(\alpha)\big{\}}\geq\frac{1}{2}\min\left\{C,\frac{\partial\Theta_{\textbf{d}}(\alpha)^{T}\eta}{\eta^{T}Q^{\textbf{d}}\eta}\right\}\partial\Theta_{\textbf{d}}(\alpha)^{T}\eta. If we consider subgradient descent method, the line search along the subgradient of objective is Θd(α)\partial\Theta_{\textbf{d}^{*}}(\alpha) where d=mind𝒲TΘd(α)\textbf{d}^{*}=\min_{\textbf{d}\in\mathcal{W}_{T}}\Theta_{\textbf{d}}(\alpha). Therefore, the maximum improvement is

(17) max0βC{Θd(α+βη)Θd(α)}\displaystyle\max_{0\leq\beta\leq C}\{\Theta_{\textbf{d}^{*}}(\alpha+\beta\eta)-\Theta_{\textbf{d}^{*}}(\alpha)\}
\displaystyle\geq 12min{C,Θd(α)TηηTQdη}Θd(α)Tη\displaystyle\frac{1}{2}\min\left\{C,\frac{\partial\Theta_{\textbf{d}^{*}}(\alpha)^{T}\eta}{\eta^{T}Q^{\textbf{d}^{*}}\eta}\right\}\partial\Theta_{\textbf{d}^{*}}(\alpha)^{T}\eta
\displaystyle\geq 12mind𝒲T{C,Θd(α)TηηTQdη}Θd(α)Tη.\displaystyle\frac{1}{2}\min_{\textbf{d}\in\mathcal{W}_{T}}\left\{C,\frac{\partial\Theta_{\textbf{d}}(\alpha)^{T}\eta}{\eta^{T}Q^{\textbf{d}}\eta}\right\}\partial\Theta_{\textbf{d}}(\alpha)^{T}\eta.

We can see that it is a special case of [13] if T=1T=1. According to Theorem 5 in [13], for a newly added constraint y^\widehat{y} and some γd>0\gamma_{\textbf{d}}>0, we can obtain Θd(α)Tη=γd\partial\Theta_{\textbf{d}}(\alpha)^{T}\eta=\gamma_{\textbf{d}} by setting the ascent direction ηy^=1\eta_{\widehat{y}}=1 for the newly added y^\widehat{y} and ηy¯=1Cαy¯\eta_{\overline{y}}=-\frac{1}{C}\alpha_{\overline{y}} for the others. Here, we set γ=mind𝒲Tγd\gamma=\min_{\textbf{d}\in\mathcal{W}_{T}}\gamma_{\textbf{d}} so as to be the lower bound of Θd(α)Tη,d𝒲T\partial\Theta_{\textbf{d}}(\alpha)^{T}\eta,\forall\textbf{d}\in\mathcal{W}_{T}. In addition, the upper bound for ηTQdη4R2,d𝒲T\eta^{T}Q^{\textbf{d}}\eta\leq 4R^{2},\forall\textbf{d}\in\mathcal{W}_{T} can also be obtained by the fact that ηTQdη=Qy^,y^d2Cy¯αy¯Qy¯,y^d+1C2y¯y¯′′αy¯αy¯′′Qy¯,y¯′′dR2+2CCR2+1C2C2R2=4R2,d𝒲T\eta^{T}Q^{\textbf{d}}\eta=Q^{\textbf{d}}_{\widehat{y},\widehat{y}}-\frac{2}{C}\sum_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime}}Q^{\textbf{d}}_{\overline{y}^{\prime},\widehat{y}}+\frac{1}{C^{2}}\sum_{\overline{y}^{\prime}}\sum_{\overline{y}^{\prime\prime}}\alpha_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime\prime}}Q^{\textbf{d}}_{\overline{y}^{\prime},\overline{y}^{\prime\prime}}\leq R^{2}+\frac{2}{C}CR^{2}+\frac{1}{C^{2}}C^{2}R^{2}=4R^{2},\forall\textbf{d}\in\mathcal{W}_{T}. By substituting them back to (17), the similar result shows the increase of the objective is at least

min{Cγ2,γ28R2}.\min\left\{\frac{C\gamma}{2},\frac{\gamma^{2}}{8R^{2}}\right\}.

Moreover, the initial optimality gap is at most CΔC\Delta. Following the remaining derivation in [13], the overall bound results are obtained. ∎

Remark 11: Problem (15) is similar to Support Kernel Machine (SKM) [3] in which the multiple Gaussian kernels are built on random subsets of features, with varying widths. However, our method can automatically choose the most violated subset of features as a group instead of a subset of random features. Such random features lead to a local optimum; while our method could guarantee the ϵ\epsilon-optimality stated in Theorem 1. However, due to the extra cost of computing nonlinear kernel, the current model are only implemented for linear kernel with learned subsets of features.

Remark 22: The original Problem (30) could be easily formulated as a QCQP problem with exponential size of variables α\alpha needed to be optimized and huge number of base kernels in the quadratic term. Unfortunately, the standard MKL methods cannot handle Problem (30) even for a small dataset, let alone the standard QCQP solver. However, Corollary 1 makes it practical to solve a sequence of small QCQP problems directly using standard off-line QCQP solvers, such as Mosek. Note that state-of-the-art MKL solvers can also be used to solve the small QCQP problems, but they are not preferred because their solutions are less accurate than that of standard QCQP solvers, which can solve Problem (16) more accurately in this case.

5.3. The Proposed Algorithm

Algorithm 1 can obtain the ϵ\epsilon-optimal solution for the original dual problem (8). By denoting 𝒢d(α)=12k=1Kαkpk22k=1Kαkqk\mathcal{G}_{d}(\alpha)=\frac{1}{2}||\sum_{k=1}^{K}\alpha_{k}\textbf{p}^{k}||_{2}^{2}-\sum_{k=1}^{K}\alpha_{k}q^{k}, the group feature generation layer can directly use the ϵ\epsilon-optimal solution of the objective 𝒢d(α)\mathcal{G}_{\textbf{d}}(\alpha) to approximate the original objective d(α)\mathcal{F}_{\textbf{d}}(\alpha). The two-layer cutting plane algorithm is presented in Algorithm 2.

Algorithm 2 The Two-Layer Cutting Plane Algorithm
1: Input: x¯=(x1,,xn),y¯=(y1,,yn)\overline{\textbf{x}}=(\textbf{x}_{1},\ldots,\textbf{x}_{n}),\overline{y}=(y_{1},\ldots,y_{n}), ϵ\epsilon, CC
2:𝒲=\mathcal{W}=\emptyset, t=0t=0
3:repeat
4:  t=t+1t=t+1
5:  Finding the most violated dt\textbf{d}^{t}
6:  𝒲=𝒲{dt}\mathcal{W}=\mathcal{W}\cup\{\textbf{d}^{t}\}
7:  Call group_feature_selection(x¯\overline{\textbf{x}}, y¯\overline{y}, 𝒲\mathcal{W}, ϵ\epsilon, CC)
8:until ϵ\epsilon-optimal

From the description of Algorithm 2, it is clear to see that groups are dynamically generated and augmented into active set 𝒲\mathcal{W} for group selection.

In terms of the convergence proof of FGM in [26] and Theorem 1, we can obtain the following theorem to illustrate the approximation with an ϵ\epsilon-optimal solution to the original problem.

Theorem 2.

After Algorithm 2 stops in a finite number of steps, the difference between optimal solution (d,α)(\textbf{d}^{*},\alpha^{*}) of Problem (29) and the solution (d,α)(\textbf{d},\alpha) of Algorithm 2 is d(α)d(α)ϵ\mathcal{F}_{\textbf{d}}(\alpha)-\mathcal{F}_{\textbf{d}^{*}}(\alpha^{*})\leq\epsilon.

The detailed proof of Theorem 2 is given in the supplementary material.

5.4. Finding the Most Violated y¯\overline{y}^{\prime} and d

Algorithm 1 and Algorithm2 need to find the most violated y¯\overline{y}^{\prime} and d, respectively. In this subsection, we discuss how to obtain these quantities efficiently. Algorithm 1 needs to calculate the subgradient of the empirical risk function RempK(w¯)R_{emp}^{K}(\overline{\textbf{w}}). Since RempK(w¯)R_{emp}^{K}(\overline{\textbf{w}}) is a pointwise supremum function, the subgradient should be in the convex hull of the gradient of the decomposed functions with the largest objective. Here, we just take one of these subgradients by solving

(18) y¯k=argmaxy¯𝒴¯\y¯Δ(y¯,y¯)i=1n(yiyi)vi,\displaystyle\overline{y}^{k}=\arg\max_{\overline{y}^{\prime}\in\overline{\mathcal{Y}}\backslash\overline{y}}\Delta(\overline{y}^{\prime},\overline{y})-\sum_{i=1}^{n}(y_{i}-y_{i}^{\prime})v_{i},

where vi=t=1TwtT(xidt)v_{i}=\sum_{t=1}^{T}\textbf{w}_{t}^{T}(\textbf{x}_{i}\odot\sqrt{\textbf{d}^{t}}). After obtaining y¯k\overline{y}^{k}, it is easy to compute ptk=1ni=1n(yiyik)(xidt)\textbf{p}^{k}_{t}=-\frac{1}{n}\sum_{i=1}^{n}(y_{i}-y_{i}^{k})(\textbf{x}_{i}\odot\sqrt{\textbf{d}^{t}}) and qk=1ni=1nΔ(y¯k,y¯)q^{k}=\frac{1}{n}\sum_{i=1}^{n}\Delta(\overline{y}^{k},\overline{y}).

For finding the most violated y¯\overline{y}^{\prime}, it depends on how to define the loss Δ(y¯,y¯)\Delta(\overline{y},\overline{y}^{\prime}) in Problem (18). One of the instances is the Hamming loss which can be decomposed and computed independently, i.e., Δ(y¯,y¯)=i=1nδ(yi,yi)\Delta(\overline{y},\overline{y}^{\prime})=\sum_{i=1}^{n}\delta(y_{i},y^{\prime}_{i}), where δ\delta is an indicator function with δ(yi,yi)=0\delta(y_{i},y^{\prime}_{i})=0 if yi=yiy_{i}=y^{\prime}_{i}, otherwise 11. However, there are some multivariate performance measures which could not be solved independently. Fortunately, there are a series of structured loss functions, such as Area Under ROC (AUC), Average Precision (AP), ranking and contingency table scores and other measures listed in [11, 34, 27], which can be implemented efficiently in our algorithms. In this paper, we only use several multivariate performance measures based on contingency table as the showcases and their finding y¯k\overline{y}^{k} could be solved in time complexity O(n2)O(n^{2}) [11].

Given the true labels y and predicted labels y\textbf{y}^{\prime}, the contingency tables is defined as follows

y=1 y=-1
y’=1 a b
y’=-1 c d

F1F_{1}-score: The FβF_{\beta}-score is a weighted harmonic average of Precision and Recall. According to the contingency table, we can obtain Fβ=(1+β2)a(1+β2)a+b+β2c.F_{\beta}=\frac{(1+\beta^{2})a}{(1+\beta^{2})a+b+\beta^{2}c}. The most common choice is β=1\beta=1. The corresponding balanced F1F_{1} measure loss can be written as ΔF1(a,b,c,d)=100(1F1)\Delta_{F_{1}}(a,b,c,d)=100(1-F_{1}). Then, Algorithm 2 in [11] can be directly applied.

Precision/Recall@k: In search engine systems, most users scan only the first few links that are presented. In this situation, Prec@k and Rec@k measure the precision and recall of a classifier that predicts exactly kk documents, i.e., Prec@k=aa+bPrec@k=\frac{a}{a+b} and Rec@k=aa+c,Rec@k=\frac{a}{a+c}, subject to a+b=ka+b=k. The corresponding loss could be defined as ΔPrec@k=100(1Prec@k)\Delta_{Prec@k}=100(1-Prec@k) and ΔRec@k=100(1Rec@k)\Delta_{Rec@k}=100(1-Rec@k). And the procedure of finding most violated y is similar to F-score, while the only difference is keeping constraint a+b=ka+b=k and removing a+bka+b\not=k.

Precision/Recall Break-Even Point (PRBEP): The Precision/Recall Break-Even Point requires that the precision and its recall are equal. According to above definition, we can see PRBEP only adds a constraint a+b=a+ca+b=a+c, or b=cb=c. The corresponding loss is defined as ΔPRBEP=100(1PRBEP)\Delta_{PRBEP}=100(1-PRBEP). Finding the most violated y should enforce the constraint b=cb=c.

After tt iterations in Algorithm 2, we transform α\alpha in Problem (9) from the exponential size to a small size αt\alpha^{t}. Now, finding the most violated d becomes

dt\displaystyle\textbf{d}^{t}\!\!\!\!\! =\displaystyle= argmaxd𝒟𝒢d(αt)\displaystyle\!\!\!\!\!\arg\max_{\textbf{d}\in\mathcal{D}}\mathcal{G}_{\textbf{d}}(\alpha^{t})
=\displaystyle= argmaxd𝒟12k=1Kαktpk22k=1Kαktqk\displaystyle\!\!\!\!\!\arg\max_{\textbf{d}\in\mathcal{D}}\frac{1}{2}\bigg{\|}\sum_{k=1}^{K}\alpha_{k}^{t}\textbf{p}^{k}\bigg{\|}_{2}^{2}-\sum_{k=1}^{K}\alpha_{k}^{t}q^{k}
=\displaystyle= argmaxd𝒟121nk=1Kαkti=1n(yiyik)(xid)2\displaystyle\!\!\!\!\!\arg\max_{\textbf{d}\in\mathcal{D}}\frac{1}{2}\bigg{\|}\frac{1}{n}\sum_{k=1}^{K}\alpha_{k}^{t}\sum_{i=1}^{n}(y_{i}-y^{k}_{i})(\textbf{x}_{i}\odot\sqrt{\textbf{d}})\bigg{\|}^{2}
=\displaystyle= argmaxd𝒟12n2j=1mcj2dj\displaystyle\!\!\!\!\!\arg\max_{\textbf{d}\in\mathcal{D}}\frac{1}{2n^{2}}\sum_{j=1}^{m}c_{j}^{2}d_{j}

where cj=k=1Kαkti=1n(yiyik)xi,jc_{j}=\sum_{k=1}^{K}\alpha_{k}^{t}\sum_{i=1}^{n}(y_{i}-y^{k}_{i})\textbf{x}_{i,j}. With the budget constraint i=1mdi=B\sum_{i=1}^{m}d_{i}=B in 𝒟\mathcal{D}, (5.4) can be solved by first sorting cj2c_{j}^{2}’s in the descent order and then setting the first BB numbers corresponding to djtd_{j}^{t} to 11 and the rest to 0. This takes only O(mlogm)O(m\log m) operations.

6. Relations To Existing Methods

In this section, we will discuss the relationships between our proposed method for multivariate loss (5) and the state-of-the-art feature selection methods including SKM [3], NMMKL [32], l1l_{1}-SVM [33], l0l_{0}-SVM [30] and FGM [26]. It can be easily adapted to the general framework (4).

6.1. Connections to SKM and l1l_{1} SVM

Let 𝒟1={𝐝|j=1mdj=1,dj0,j=1,,m}\mathcal{D}_{1}=\{\mathbf{d}|\sum_{j=1}^{m}d_{j}=1,d_{j}\geq 0,\forall j=1,\ldots,m\} be in the real domain. We observe that 𝒟=𝒟1\mathcal{D}=\mathcal{D}_{1} when B=1B=1. According to [24], we transform Problem (5) in the special case of B=1B=1 to the following equivalent optimization problem,

(20) minw,ξ0\displaystyle\min_{\textbf{w},\xi\geq 0}\!\!\!\!\! 12(j=1m|wj|)2+Cξ\displaystyle\!\!\!\!\!\frac{1}{2}\Bigg{(}\sum_{j=1}^{m}|w_{j}|\Bigg{)}^{2}+C\xi
s.t. y¯𝒴¯\y¯:wT1ni=1n(yiyi)xiΔ(y¯,y¯)ξ.\displaystyle\!\!\!\!\!\forall\overline{y}^{\prime}\in\overline{\mathcal{Y}}\backslash\overline{y}:{\textbf{w}}^{T}\frac{1}{n}\sum_{i=1}^{n}(y_{i}-y^{\prime}_{i})\textbf{x}_{i}\geq\Delta(\overline{y},\overline{y}^{\prime})-\xi.

SKM [3] attempts to obtain the sparsity of 𝐰\mathbf{w} by penalizing the square of a weighted block l1l_{1}-norm (j=1kγjwj2)2(\sum_{j=1}^{k}\gamma_{j}||\textbf{w}_{j}||_{2})^{2} where kk is the number of groups and 𝐰j\mathbf{w}_{j} is the weight vector for the features in the jjth group. The regularizer (j=1m|wj|)2(\sum_{j=1}^{m}|w_{j}|)^{2} used in (20) is the square of the l1l_{1} norm (𝐰1)2(||\mathbf{w}||_{1})^{2}, which is a special case of SKM when k=mk=m and γj=1\gamma_{j}=1, i.e., each group contains only one feature. Minimizing the square of the l1l_{1}-norm is very similar to l1l_{1}-norm SVM [33] by setting Ω(𝐰)=𝐰1\Omega(\mathbf{w})=||\mathbf{w}||_{1} with the non-negative (convex) loss function.

Regardless of l1l_{1}-norm or the square of l1l_{1}-norm, the parameter CC is too sensitive to be tuned properly since it controls both margin loss and the sparsity of 𝐰\mathbf{w}. However, our method alleviates this problem by two parameters CC and BB which control margin loss and sparsity of 𝐰\mathbf{w}, respectively. This separation makes parameter tuning of our method easier than those of SKM and l1l_{1} SVM.

6.2. Connection to NMMKL

Instead of directly solving Problem (20), we formulate a more general problem (5) by introducing an additional budget parameter BB, which directly controls the sparsity of 𝐰\mathbf{w}. The advantage is to make parameter tuning easily done since CC is not sensitive to the sparsity of 𝐰\mathbf{w}. This strategy is also used in NMMKL [32], but one feature corresponding to one base kernel makes NMMKL intractable for high-dimensional problems. The multivariate loss is even hard to be optimized by NMMKL since there are exponential dual variables in the dual form of NMMKL from the exponential number of constraints. However, our method can readily optimize multivariate loss on high-dimensional data.

6.3. Connection to FGM

According to the work [40], we can reformulate Problem (20) as an equivalent optimization problem

(21) min𝐝𝒟1minw,ξ0\displaystyle\min_{\mathbf{d}\in\mathcal{D}_{1}}\min_{\textbf{w},\xi\geq 0} 12j=1mdj|wj|2+Cξ\displaystyle\!\!\!\!\!\!\!\frac{1}{2}\sum_{j=1}^{m}d_{j}|w_{j}|^{2}+C\xi
s.t.y¯𝒴¯\y¯\displaystyle\textrm{s.t.}\forall\overline{y}^{\prime}\in\overline{\mathcal{Y}}\backslash\overline{y}\!\!\!\!\! :1nj=1mdjwji=1n(yiyi)xj,iΔ(y¯,y¯)ξ.\displaystyle\!\!\!\!\!:\frac{1}{n}\sum_{j=1}^{m}d_{j}w_{j}\sum_{i=1}^{n}(y_{i}-y^{\prime}_{i})\textbf{x}_{j,i}\geq\Delta(\overline{y},\overline{y}^{\prime})-\xi.

After the substitutions of vj=djwj,j=1,,mv_{j}=\sqrt{d}_{j}w_{j},\forall j=1,\ldots,m and the general case of 𝒟\mathcal{D}, we can obtain the following problem

(22) mind𝒟minv,ξ0\displaystyle\min_{\textbf{d}\in\mathcal{D}}\min_{\textbf{v},\xi\geq 0} 12v22+Cξ\displaystyle\!\!\!\!\!\frac{1}{2}\|\textbf{v}\|^{2}_{2}+C\xi
s.t.y¯𝒴¯\y¯\displaystyle\textrm{s.t.}\forall\overline{y}^{\prime}\in\overline{\mathcal{Y}}\backslash\overline{y}\!\!\!\!\! :vT1ni=1n(yiyi)(xid)Δ~(y¯,y¯)ξ,\displaystyle\!\!\!\!\!:\textbf{v}^{T}\frac{1}{n}\sum_{i=1}^{n}(y_{i}-y^{\prime}_{i})(\textbf{x}_{i}\odot\sqrt{\textbf{d}})\geq\widetilde{\Delta}(\overline{y},\overline{y}^{\prime})-\xi,

where 𝐯=[v1,,vm]T\mathbf{v}=[v_{1},\ldots,v_{m}]^{T}. After deriving Lagrangian dual problem of (22), we observe that it is same as Problem (6). Problem (5.4) always finds the most violated 𝐝\mathbf{d} in the integer domain {0,1}m\{0,1\}^{m}, so the solutions of the following problem solved by the proposed two-layer cutting plane algorithm is the same as the solutions of Problem (6)

(23) mind𝒟2minv,ξ0\displaystyle\min_{\textbf{d}\in\mathcal{D}_{2}}\min_{\textbf{v},\xi\geq 0} 12v22+Cξ\displaystyle\!\!\!\!\!\frac{1}{2}\|\textbf{v}\|^{2}_{2}+C\xi
s.t.y¯𝒴¯\y¯\displaystyle\textrm{s.t.}\forall\overline{y}^{\prime}\in\overline{\mathcal{Y}}\backslash\overline{y}\!\!\!\!\! :vT1ni=1n(yiyi)(xid)Δ~(y¯,y¯)ξ,\displaystyle\!\!\!\!\!:\textbf{v}^{T}\frac{1}{n}\sum_{i=1}^{n}(y_{i}-y^{\prime}_{i})(\textbf{x}_{i}\odot\textbf{d})\geq\widetilde{\Delta}(\overline{y},\overline{y}^{\prime})-\xi,

where the integer domain 𝒟2={𝐝|j=1mdjB,𝐝{0,1}m}\mathcal{D}_{2}=\{\mathbf{d}|\sum_{j=1}^{m}d_{j}\leq B,\mathbf{d}\in\{0,1\}^{m}\}. This formula can be equally derived as the extension of FGM for multivariate performance measures by defining the new hypotheses

(24) h~v(x)=argmaxy¯𝒴¯i=1nyi(vd)Txi,\widetilde{h}_{\textbf{v}}(\textbf{x})=\arg\max_{\overline{y}^{\prime}\in\overline{\mathcal{Y}}}\sum_{i=1}^{n}y^{\prime}_{i}(\textbf{v}\odot\textbf{d})^{T}\textbf{x}_{i},

where h~v:𝒳¯𝒴¯\widetilde{h}_{\textbf{v}}:\overline{\mathcal{X}}\rightarrow\overline{\mathcal{Y}} and 𝐝𝒟2\mathbf{d}\in\mathcal{D}_{2}. It is not trivial to perform the extension of FGM to optimize multivariate loss because original FGM method [26] cannot directly apply to solve the exponential number of constraints. And our domain of 𝐝\mathbf{d} is in real domain 𝒟\mathcal{D} which is more general than the integer domain 𝒟2\mathcal{D}_{2} used in FGM and the proposed extension (23), even though the final solutions of (5) and (23) are the same.

6.4. Connection to l0l_{0} SVM

The following Lemma indicates that the proposed formula can be interpreted by l0l_{0}-norm constraint.

Lemma 2.

(23) is equivalent to the following problem

minw~,ξ0\displaystyle\!\!\min_{\tilde{\textbf{w}},\xi\geq 0}\!\! 12w~22+Cξ\displaystyle\!\!\frac{1}{2}\|\tilde{\textbf{w}}\|^{2}_{2}+C\xi
s.t. y¯𝒴¯\y¯:w~T1ni=1n(yiyi)xiΔ~(y¯,y¯)ξ,\displaystyle\!\!\!\forall\overline{y}^{\prime}\in\overline{\mathcal{Y}}\backslash\overline{y}:\tilde{\textbf{w}}^{T}\frac{1}{n}\sum_{i=1}^{n}(y_{i}-y^{\prime}_{i})\textbf{x}_{i}\geq\widetilde{\Delta}(\overline{y},\overline{y}^{\prime})-\xi,
w~0B.\displaystyle\|\tilde{\textbf{w}}\|_{0}\leq B.
Proof.

Note, at the optimality of (22), WLOG, suppose dj=0d_{j}=0, the corresponding vjv_{j} must be 0. Thus, v0d0\|{\textbf{v}}\|_{0}\leq\|\textbf{d}\|_{0}. Let w~=vd\tilde{\textbf{w}}=\textbf{v}\odot\textbf{d}, we have w~0=vd0min{v0,d0}d0=j=1mdjB.\|\tilde{\textbf{w}}\|_{0}=\|{\textbf{v}}\odot\textbf{d}\|_{0}\leq\min\{\|{\textbf{v}}\|_{0},\|\textbf{d}\|_{0}\}\leq\|\textbf{d}\|_{0}=\sum_{j=1}^{m}d_{j}\leq B. Moreover, w~22=vd22=v22\|\tilde{\textbf{w}}\|_{2}^{2}=\|{\textbf{v}}\odot\textbf{d}\|_{2}^{2}=\|{\textbf{v}}\|_{2}^{2} at the optimality. Therefore, the optimal solution of (22) is a feasible solution of (2). On the other hand, for the optimal w~\tilde{\textbf{w}} in (2), let v=w~{\textbf{v}}=\tilde{\textbf{w}} and di=δ(wi~)d_{i}=\delta(\tilde{w_{i}}) where δ(t)=1\delta(t)=1 if t0t\neq 0; otherwise, 0. So, the optimal solution of (2) is a feasible solution of (22). ∎

This gives another interpretation of parameter BB from the perspective of l0l_{0}-norm. Since l0l_{0}-norm 𝐰~0||\widetilde{\mathbf{w}}||_{0} represents the number of non-zero entries of 𝐰~\widetilde{\mathbf{w}}, so BB in our method can be considered as the parameter which directly controls the sparsity of 𝐰\mathbf{w}.

7. Multiple Instance Learning for Multivariate Performance Measures

We have already illustrated the proposed framework by optimizing multivariate performance measures for feature selection in Section 4. In this section, we extend this approach to solve multiple instance learning problems which have been employed to solve a variety of learning problems, e.g., drug activity prediction [7], image retrieval [35], natural scene classification [20] and text categorization [1], but it is seldom optimized for multivariate performance measures in the literature. However, it is crucial to optimize the task specific performance measures, e.g., FF score is widely considered as the most important evaluation criterion for a learning method in image retrieval.

Multi-instance learning was formally introduced in the context of drug activity prediction [7]. In this learning scenario, a bag is represented by a set of instances where each instance is represented by a feature vector. The classification label is only assigned to each bag instead of the instances in this bag. We name a bag as a positive bag if there is at least one positive instance in this bag, otherwise it is called negative bag. The learning problem is to decide whether the given unlabeled bag is positive or not. By defining a similarity measure between a bag and an instance, Multiple-Instance Learning via Embedded instance Selection (MILES) [6] successfully transforms multiple instance learning into a feature selection problem by embedding bags into an instance-based feature space and selecting the most important features.

Before discussing the transformation in MILES, we first give the notations of multiple instance learning problem. Following the notations in [6], we denote iith positive bags as 𝐁i+={𝐱i,j+}j=1ni+\mathbf{B}^{+}_{i}=\{\mathbf{x}_{i,j}^{+}\}_{j=1}^{n_{i}^{+}} which consists of ni+n_{i}^{+} instances 𝐱i,j+,j=1,,ni+\mathbf{x}_{i,j}^{+},j=1,\ldots,n_{i}^{+}. Similarly, the iith negative bags is denoted as 𝐁i={𝐱i,j}j=1ni\mathbf{B}_{i}^{-}=\{\mathbf{x}_{i,j}^{-}\}_{j=1}^{n_{i}^{-}}. All instances belongs to the same feature space 𝒳\mathcal{X}. The number of positive bags and negative bags are +\ell^{+} and \ell^{-}, respectively. The instances in all bags are rearranged as {𝐱1,,𝐱n}\{\mathbf{x}^{1},\ldots,\mathbf{x}^{n}\} where n=i=1+ni++i=1nin=\sum_{i=1}^{\ell^{+}}n_{i}^{+}+\sum_{i=1}^{\ell^{-}}n_{i}^{-}.

By considering each instance in the training bags as a candidate for target concepts, the embedded feature space is represented as

(26) 𝐱^i=[s(𝐱1,𝐁i),,s(𝐱n,𝐁i)]TRn,\widehat{\mathbf{x}}_{i}=[s(\mathbf{x}^{1},\mathbf{B}_{i}),\ldots,s(\mathbf{x}^{n},\mathbf{B}_{i})]^{T}\in R^{n},

where the similarity measure between the bag 𝐁i\mathbf{B}_{i} and the instance 𝐱k\mathbf{x}^{k} is defined as the most-likely-cause estimator

(27) s(𝐱k,𝐁i)=maxjexp(𝐱i,j𝐱k22σ2).s(\mathbf{x}^{k},\mathbf{B}_{i})=\max_{j}\exp\left(-\frac{||\mathbf{x}_{i,j}-\mathbf{x}^{k}||^{2}}{2\sigma^{2}}\right).

It follows the intuition that the similarity between a concept and a bag is determined by the concept and the closest instance in this bag. The corresponding labels are constructed as follows: y^i=1\widehat{y}_{i}=1 if 𝐁i\mathbf{B}_{i} is a positive bag, otherwise y^i=1\widehat{y}_{i}=-1. For a given +\ell^{+} positive bags and \ell^{-} negative bags, we form a new classification representation of the multiple instance learning problem as {𝐱^i,y^i}i=1++\{\widehat{\mathbf{x}}_{i},\widehat{y}_{i}\}_{i=1}^{\ell^{+}+\ell^{-}}. For each instance 𝐱k\mathbf{x}^{k}, the new feature representation corresponds to the values of the kkth feature variable s(𝐱k,)s(\mathbf{x}^{k},\cdot) is

[s(𝐱k,𝐁1+),,s(𝐱k,𝐁++),s(𝐱k,𝐁1),,s(𝐱k,𝐁)][s(\mathbf{x}^{k},\mathbf{B}_{1}^{+}),\ldots,s(\mathbf{x}^{k},\mathbf{B}_{\ell^{+}}^{+}),s(\mathbf{x}^{k},\mathbf{B}_{1}^{-}),\ldots,s(\mathbf{x}^{k},\mathbf{B}_{\ell^{-}}^{-})]

where the feature induced by 𝐱k\mathbf{x}^{k} provides the useful information for separating the positive and negative bags. The linear discriminant function

(28) y^=sign(𝐰,x^+b)\widehat{y}=\textrm{sign}(\langle\mathbf{w},\widehat{\textbf{x}}\rangle+b)

where 𝐰\mathbf{w} and bb are the model parameters. The embedding induces a possible high-dimensional space when the number of instances in the training set is large. Since some instances may not be responsible for the label of the bags or might be similar to each other, many features are redundant or irrelevant, so MILES employs L1L_{1}-SVM to select a subset of mapped features that is most relevant to the classification problem. However, L1L_{1}-SVM cannot fulfill to obtain a high performance over the task-specific measures because it only focuses on optimizing zero-one loss function. Our proposed Algorithm 2 is a natural alternative feature selection method for multi-variate performance measures. The proposed algorithm for multiple instance learning to optimize multivariate measures is shown in Algorithm 3.

Algorithm 3 Learning a bag classifier
1: Input: positive bags {𝐁i+}i=1+\{\mathbf{B}_{i}^{+}\}_{i=1}^{\ell^{+}}, negative bags {𝐁i}i=1\{\mathbf{B}_{i}^{-}\}_{i=1}^{\ell^{-}}, CC, and ϵ\epsilon
2: Construct the embedding representation of training data
{(𝐱^i,y^i)},i=1,,++\{(\widehat{\mathbf{x}}_{i},\widehat{y}_{i})\},\forall i=1,\ldots,\ell^{+}+\ell^{-}
3:𝐱¯=[𝐱^1,,𝐱^++]\overline{\mathbf{x}}=[\widehat{\mathbf{x}}_{1},\ldots,\widehat{\mathbf{x}}_{\ell^{+}+\ell^{-}}] and 𝐲¯=[y^1,,y^++]\overline{\mathbf{y}}=[\widehat{y}_{1},\ldots,\widehat{y}_{\ell^{+}+\ell^{-}}]
4: call Algorithm 2 with arguments (𝐱¯\overline{\mathbf{x}},𝐲¯\overline{\mathbf{y}},CC,ϵ\epsilon)
5: Output: parameters 𝐰\mathbf{w}

According to Algorithm 3, we do not need the model parameter bb since the structural SVM is irrelevant to the relative offset bb, i.e., y^=argmaxy{1,+1}y𝐰^,𝐱^\widehat{y}=\arg\max_{y\in\{-1,+1\}}y\langle\widehat{\mathbf{w}},\widehat{\mathbf{x}}\rangle.

8. Experiments

Table 1. Datasets used in our experiments
Dataset #classes #features #train #test
points points
News20.binary​​​ 2 1,355,191 11,997 7,999
URL1 2 3,231,961 20,000 20,000
Image 5 10,800 1,200 800
Sector 105 55,197 6,412 3,207
News20 20 62,061 15,935 3,993

In this Section, we conduct extensive experiments to evaluate the performance of our proposed method and state-of-the-art feature selection methods: 1) SVM-RFE [9]; 2) l1l_{1}-SVM; 3) FGM [26]; 4) l1l_{1}-bmrm-F1111http://users.cecs.anu.edu.au/˜chteo/BMRM.html, which is l1l_{1} regularized SVM for optimizing F1 score by bundle method [27]. SVM-RFE and FGM use Liblinear software 222http://www.csie.ntu.edu.tw/˜cjlin/liblinear/ as the QP solver for their SVM subproblems. For l1l_{1}-SVM, we also use Liblinear software, which implements the state-of-the-art l1l_{1}-SVM algorithm [33]. In addition to the comparison for 0-11 loss, we also perform experiments on image data for F1 measure. Furthermore, several specific measures on the contingency table are investigated on Text datasets by comparing with SVMperf\textrm{SVM}^{perf} [11]. All the datasets shown in Table 1 are of high dimensions.

For convenience, we name our proposed two-layer cutting plane algorithm FSmultiΔ\textrm{FS}_{multi}^{\Delta}, where Δ\Delta represents different type of multivariate performance measures. We implemented Algorithm 2 in MATLAB for all the multivariate performance measures listed above, using Mosek as the QCQP solver for Problem (16) which yields a worse-case complexity of O(KT2)O(KT^{2}). Removing inactive constraints from the working set  [13] in the inner layer is employed for speedup the QCQP problem. Since the values of both KK and TT are much smaller than the number of examples nn and its dimensionality mm, the QCQP is very efficient as well as more accurate for large-scale and high-dimensional datasets. Furthermore, the codes simultaneously solve the primal and its dual form. So the optimal μ\mu and α\alpha can be obtained after solving Problem (16).

For a test pattern x, the discriminant function can be obtained by f(x)=wd~,xf(\textbf{x})=\langle\textbf{w}\odot\widetilde{\textbf{d}},\textbf{x}\rangle where w=i=1nβixi\textbf{w}=\sum_{i=1}^{n}\beta_{i}\textbf{x}_{i}, βi=1nk=1Kαk(yiyik)\beta_{i}=\frac{1}{n}\sum_{k=1}^{K}\alpha_{k}(y_{i}-y_{i}^{k}), and d~=t=1Tμtdt\widetilde{\textbf{d}}=\sum_{t=1}^{T}\mu_{t}\sqrt{\textbf{d}^{t}}. This leads to the faster prediction since only a few of the selected features are involved. After computing pk\textbf{p}^{k}, the matrices of Problem (16) can be incrementally updated, so it can be done totally in O(TK2)O(TK^{2}).

Refer to caption​​​​​​ ​​​​​​Refer to caption
(a) ​​​​​​ ​​​​​​ (b)
Refer to caption​​​​​​ ​​​​​​Refer to caption
(c) ​​​​​​ ​​​​​​ (d)
Figure 1. First row (a-b): Testing Accuracy and F1F_{1} scores as well as the number of selected features of the proposed method FSmultiΔ\textrm{FS}_{multi}^{\Delta} on News20.binary dataset by varying CC and BB. Second row (c-d): The corresponding relationship among parameters BB, CC, and the number of selected features.

8.1. Parameter Sensitivity Analysis

Before comparing FSmultiΔ\textrm{FS}^{\Delta}_{multi} with other methods, we first conduct empirical studies for the parameter sensitivity analysis on News20.binary. The goal is to examine the relationships among parameters CC and BB, performance measures and the number of selected features with the range of CC in [0.1,1,10,100]×n[0.1,1,10,100]\times n and BB in [2,5,10,50,100,150,200,250][2,5,10,50,100,150,200,250].

Figure 1(a-b) show the testing accuracy and F1 scores as well as the number of selected features by varying CC and BB. We observe that the results are very sensitive to CC when BB is very small. This indicates that the l1l_{1} model, which is equivalent to the proposed method in the case of B=1B=1, is vulnerable to the choice of CC. On the other hand, the results are rather insensitive to CC when BB is large. Hence, the proposed method is less sensitive to CC than l1l_{1} model. We also observe that the proposed method prefers a large CC value for better performances. Figure 1(c-d) demonstrate the corresponding relationships among parameters BB, CC and the number of selected features of Figure 1(a-b). We observe that BB and the number of selected features always exhibits a linear trend with a constant slope. Moreover, the slope remains the same when C10C\geq 10, but a small CC will increase the slope. This means that, compared with BB, parameter CC has less influence on the sparsity of 𝐰\mathbf{w}, and the learned feature selection model becomes stabilized when C10C\geq 10. These empirical results are consistent to the discussions of parameter BB in Section 3.

Since large CC needs more iterations to converge according to Theorem 1, the compromise is to set CC not too large and let BB dominate the selection of features. According to these observations, we can safely fix CC and study the results by varying BB to compare with other methods in the following experiments.

8.2. Time Complexity Analysis

​​​Refer to caption​​​​​​ ​​​​​​ Refer to caption
(a) News20.binary​​​​​​ ​​​​​​(b) Image (Desert)
Figure 2. Training time on different datasets

We empirically study the time complexity of FSmultiF1\textrm{FS}_{multi}^{F_{1}} by comparing with other methods. Two datasets News20.binary and Image (Desert) are used for illustration. The detailed setting are shown in Section 8.3 and Section 8.4, respectively. Figure 2 gives the training time over five different methods. On News20.binary dataset, we cannot report the training time for l1l_{1}-bmrm-F1 since l1l_{1}-bmrm-F1 cannot terminate after more than two days with the maximum iteration 10001000 and parameter λ[107,102]\lambda\in[10^{-7},10^{2}] due to the extremely high dimensionality. We observe that the proposed methods are slower than l1l_{1}-SVM, but much faster than SVM-RFE and l1l_{1}-bmrm-F1. In addition, on Image dataset, when the termination condition with the relative difference between the objective and its convex linear lower bound lower than 0.10.1 is set, l1l_{1}-bmrm-F1 also cannot converge after the maximum iteration, which is consistent with the discussion in Appendix C of [27] that bundle method with l1l_{1} regularizer cannot guarantee the convergence. This leads to the similar number of selected features (e.g., 9898 in Figure 2(b)) even though λ\lambda is decreasing gradually.

These observations implies that our proposed two-layer cutting plane method needs less time for training with guaranteed convergence than bundle method. Moreover, our method can work on large scale and high dimensional data for optimizing user-specified measure, but bundle method cannot. As aforementioned, l1l_{1}-bmrm-F1 is much slower on the high dimensional datasets in our experiments, so we can only report its results in Section 8.4.

8.3. Feature Selection for Accuracy

Since [11] has proven that SVMmultiΔ\textrm{SVM}_{multi}^{\Delta} with Hamming loss, namely ΔErr(y¯,y¯)=2(b+c)\Delta_{Err}(\overline{y},\overline{y}^{\prime})=2(b+c), is the same as SVM. In this subsection, we evaluate the accuracy performances of FSmultiΔ\textrm{FS}_{multi}^{\Delta} for Hamming loss function, namely FSmultihamming\textrm{FS}_{multi}^{hamming} as well as other state-of-the-art feature selection methods. We compare these methods on two binary datasets, News20.binary 333http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets and URL1 in Table 1. Both datasets are used in [26], and they are already split into training and testing sets.

We test FGM and SVM-RFE in the grid CFGM=[0.001,0.01,0.1,1,5,10]C_{FGM}=[0.001,0.01,0.1,1,5,10] and choose CFGM=5C_{FGM}=5 which gives good performance for both FGM and SVM-RFE. This is the same as [26]. For FSmultihamming\textrm{FS}_{multi}^{hamming}, we do the experiments by fixing CFGMmultiC_{FGM_{multi}} as 0.1×n0.1\times n for URL1 and 1.0×n1.0\times n for New20.binary. The setting for budget parameter B=[2,5,10,50,100,150,200,250]B=[2,5,10,50,100,150,200,250] for News20.binary, and B=[2,5,10,20,30,40,50,60]B=[2,5,10,20,30,40,50,60] for URL1. The elimination scheme of features for SVM-RFE method can be referred to [26]. For l1l_{1}-SVM, we report the results of different CC values so as to obtain different number of selected features.

​​​Refer to caption​​​​​​ ​​​​​​ Refer to caption
(a) News20.binary​​​​​​ ​​​​​​(b) URL1
Figure 3. Testing accuracy on different datasets
​​​Refer to caption ​​​​​​​ ​​​​​​​ Refer to caption
(a) News20.binary​​​​​​ ​​​​​​(b) URL1
Figure 4. The sparsity of features of FSmultihamming\textrm{FS}_{multi}^{hamming} with varying BB on different datasets. Each row bar with different color represents the different subset of features selected under current BB, where the white region means the features are not selected.

Figure 3 reports testing accuracy on different datasets. The testing accuracy is comparable among different methods, but both FSmultihamming\textrm{FS}_{multi}^{hamming} and FGM can obtain better prediction performances than SVM-RFE in a small number (less than 20) of selected features on both News20.binary and URL1. These results show that the proposed method with Hamming loss can work well on feature selection tasks especially when choosing only a few features. FSmultihamming\textrm{FS}_{multi}^{hamming} also performs better than l1l_{1}-SVM on News20.binary in most range of selected features. This is possibly because l1l_{1} models are more sensitive to noisy or redundant features on News20.binary dataset.

Figure 4 shows that our method with the small BB will select smaller number of features than the large BB. We also observed that most of features selected by the small BB also appeared in the subset of features using the large BB. This phenomenon can be obviously observed on News20.binary. This leads to the conclusion that FSmultihamming\textrm{FS}_{multi}^{hamming} can select the important features in the given datasets due to the insensitivity of parameter BB. However, we notice that not all the features in the selected subset of features with smaller BB fall into that of subset of features with the large BB, so our method is non-monotonic feature selection. This argument is consistent with the test accuracy in Figure 3. News20.binary seems to be monotonic datasets from Figure 4, since FSmultihamming\textrm{FS}_{multi}^{hamming}, FGM and SVM-RFE demonstrate similar performance. However, URL1 is more likely to be non-monotonic, as our method and FGM can do better than SVM-RFE. All the facts imply that the proposed method is comparable with FGM and SVM-RFE. And it also demonstrates the non-monotonic property for feature selection.

8.4. Feature Selection for Image Retrieval

Refer to caption​​​​​​ ​​​​​​Refer to caption
(a) Desert (b) Mountains
Refer to caption​​​​​​ ​​​​​​ Refer to caption
(c) Sea (d) Sunset
Refer to caption
(e) Trees
Figure 5. Testing F1F_{1} scores on Image dataset.
Table 2. The macro-average testing performance comparisons among different methods. The quantities in the parentheses represent won/lost of the current method comparing with FSmultiΔ\textrm{FS}_{multi}^{\Delta}. The last column indicates the average number of features is actually used in the current method for a specific measure. The symbol ’*’ indicates the level of significance at 0.950.95 according to t-test applied to pairs of results over classes
Dataset method F1F_{1} Rec@2pRec@2p PRBEPPRBEP #selected features
FSmultiΔ\textrm{FS}_{multi}^{\Delta} 92.07 95.77 93.25 787.6/658.9/508.3
Sector FSmultihamming\textrm{FS}_{multi}^{hamming} 84.99 (12/91)(12/91)^{*} 90.01 (0/71)(0/71)^{*} 85.54 (0/86)(0/86)^{*} 689.2
SVMmultiΔ\textrm{SVM}_{multi}^{\Delta} 33.35 (1/104)(1/104)^{*} 95.52 (11/19)(11/19) 91.24 (11/47)(11/47)^{*} 55,197
FSmultiΔ\textrm{FS}_{multi}^{\Delta} 77.56 91.21 81.46 1,301 / 1,186 / 931
News20 FSmultihamming\textrm{FS}_{multi}^{hamming} 49.61 (0/20)(0/20)^{*} 66.32 (0/20)(0/20)^{*} 52.14 (0/20)(0/20)^{*} 485.1
SVMmultiΔ\textrm{SVM}_{multi}^{\Delta} 55.53 (0/20)(0/20)^{*} 93.08 (16/2) 80.83 (6/11) 62,061

In this subsection, we demonstrate the specific multivariate performance measures are important to select features for real applications. In particular, we evaluate F1F_{1} measure (commonly used performance measure) for the task of image retrieval. Due to the success of transforming multiple instance learning into a feature selection problem by embedded instance selection, we use the same strategy in Algorithm 4.1 of [6] to construct a dense and high-dimensional dataset on a preprocessed image data 444http://lamda.nju.edu.cn/data_MIMLimage.ashx. This dataset is used in [38] for multi-instance learning. It contains five categories and 2,0002,000 images. Each image is represented as a bag of nine instances generated by the SBN method [20]. Each image bag is represented by a collection of nine 15-dimensional feature vectors. After that, following [6], the natural scene image retrieval problem turns out to be a feature selection task to select relevant embedded instances for prediction. The Image dataset are split randomly with the proportion of 60% for training and 40% for testing (Table 1). Since F1F_{1}-score is used for performance metric, we perform FSmultiΔ\textrm{FS}_{multi}^{\Delta} for F1F_{1}-score, namely FSmultiF1\textrm{FS}_{multi}^{F_{1}} as well as other state-of-the-art feature selection methods. As mentioned above, FGM and FSmultihamming\textrm{FS}_{multi}^{hamming} have similar performances, we will not report the results of FGM here. FSmultihamming\textrm{FS}_{multi}^{hamming} and FSmultiΔ\textrm{FS}_{multi}^{\Delta} use the fixed C=10×nC=10\times n. For other methods, we use the previous settings. The testing F1F_{1} values of all methods on each category are reported in Figure 5.

From Figure 5, we observe that FSmultiF1\textrm{FS}_{multi}^{F_{1}} and FSmultihamming\textrm{FS}_{multi}^{hamming} achieve significantly improved performance over l1l_{1}-SVM in term of F1F_{1}-score especially when choosing less than 100100 features. Moreover, SVM-RFE also outperforms l1l_{1}-SVM on three categories out of five. This verifies that 1\ell_{1} penalty does not perform as well as 0\ell_{0} methods like FSmultiF1\textrm{FS}_{multi}^{F_{1}} and FSmultihamming\textrm{FS}_{multi}^{hamming} on dense and high-dimensional datasets. It is possibly because 1\ell_{1}-norm penalty is very sensitive to dense and noisy features. We also observe that FSmultiF1\textrm{FS}_{multi}^{F_{1}} performs better than FSmultihamming\textrm{FS}_{multi}^{hamming} and SVM-RFE on four over five categories. l1l_{1}-bmrm-F1 performs competitively but it is unstable and time-consuming as shown in Section 8.2. All these facts imply that directly optimizing F1F_{1} measure is useful to boost F1F_{1} performance measure, and our proposed FSmultiF1\textrm{FS}_{multi}^{F_{1}} is efficient and effective.

8.5. Multivariate Performance Measures for Document Retrieval

In this subsection, we focus on feature selection for different multivariate performance measures on imbalanced text data shown in Table 1. For multiclass classification problems, one vs. rest strategy is used. The comparing model is SVMperf\textrm{SVM}^{perf} 555www.cs.cornell.edu/People/tj/svm_light/svm_perf.html. Following [11], we use the same notation SVMmultiΔ\textrm{SVM}_{multi}^{\Delta} for different multivariate performance measures. The command used for training SVMperf\textrm{SVM}^{perf} can work for different measures by -ll option 666 svm_perf_learn -c CperfC_{perf} -w 3 –b 0 train_file train_model. In our experiments, we search the CperfC_{perf} in the same range [26,,26][2^{-6},\ldots,2^{6}] as in [11]. We choose the one which demonstrates the best performance of SVMmultiΔ\textrm{SVM}_{multi}^{\Delta} to each multivariate performance measure for comparison. FSmultiΔ\textrm{FS}_{multi}^{\Delta} and FSmultihamming\textrm{FS}_{multi}^{hamming} fix CFGMmulti=0.1×nC_{FGM_{multi}}=0.1\times n for News20 except 5.0×n5.0\times n for Sector. For Rec@kRec@k, we use kk as twice the number of positive examples, namely Rec@2p. The evaluation for this measure uses the same strategy to label twice the number of positive examples as positive in the test datasets, and then calculate Rec@2p.

Table 2 shows the macro-average of the performance over all classes in a collection in which both FSmultiΔ\textrm{FS}_{multi}^{\Delta} and FSmultihamming\textrm{FS}_{multi}^{hamming} at B=250B=250 are listed. The improvement of FSmultiΔ\textrm{FS}_{multi}^{\Delta} over FSmultihamming\textrm{FS}_{multi}^{hamming} and SVMmultiΔ\textrm{SVM}_{multi}^{\Delta} with respect to different BB values are reported in Figure 6. From Table 2, FSmultiΔ\textrm{FS}_{multi}^{\Delta} is consistently better than FSmultihamming\textrm{FS}_{multi}^{hamming} on all multivariate performance measures and two multiclass datasets. Similar results can be obtained comparing with SVMmultiΔ\textrm{SVM}_{multi}^{\Delta}, while the only exception is the measure Rec@2p on News20 where SVMmultiΔ\textrm{SVM}_{multi}^{\Delta} is a little better than FSmultiΔ\textrm{FS}_{multi}^{\Delta}. The largest gains are observed for F1F_{1} score on all two text classification tasks. This implies that a small number of features selected by FSmultiΔ\textrm{FS}_{multi}^{\Delta} is enough to obtain comparable or even better performances for different measures than SVMmultiΔ\textrm{SVM}_{multi}^{\Delta} using all features.

Refer to caption ​​​​​​ ​​​​​​ Refer to caption
(a) Sector
Refer to caption​​​​​​ ​​​​​​ Refer to caption
(b) News20
Figure 6. The average performance improvement of FSmultiΔ\textrm{FS}_{multi}^{\Delta} with varying BB on different datasets.
Refer to caption​​​​​​ ​​​​​​ Refer to caption
(a) Sector (b) News20
Figure 7. Testing F1F_{1} in terms of the average number of selected features on Sector and News20.

From Figure 6, FSmultiΔ\textrm{FS}_{multi}^{\Delta} consistently performs better than FSmultihamming\textrm{FS}_{multi}^{hamming} for all of the multivariate performance measures from the figures in the left-hand side. Moreover, the figures in the right-hand side show that the small number of features are good for F1F_{1} measures, but poor for other measures. As the number of features increases, Rec@2pRec@2p and PRBEP can approach to the results of SVMmultiΔ\textrm{SVM}_{multi}^{\Delta} and all curves become flat. The performance of PRBEPPRBEP and Rec@2pRec@2p is relatively stable when sufficient features are selected, but our method can choose very few features for fast prediction. For F1F_{1} measure, our method is consistently better than SVMmultiΔ\textrm{SVM}_{multi}^{\Delta}, and the results show significant improvement over all range of BB. This improvement may be due to the reduction of noisy or non-informative features. Furthermore, FSmultiΔ\textrm{FS}_{multi}^{\Delta} can achieve better performance measures than FSmultihamming\textrm{FS}_{multi}^{hamming}.

We also compared different feature selection algorithms such as SVM-RFE and l1l_{1}-SVM on Sector and News20 in the same setting as the previous sections. The results in terms of F1 measure are reported in Figure 7. We clearly observe that FSmultiΔ\textrm{FS}_{multi}^{\Delta} outperforms l1l_{1}-SVM on both datasets, and comparable or even better than SVM-RFE. For a small number of features, FSmultiΔ\textrm{FS}_{multi}^{\Delta} can still demonstrate very good F1 measure.

9. Conclusion

In this paper, we propose a generalized sparse regularizer for feature selection, and the unified feature selection framework for general loss functions. We particularly study in details for multivariate losses. To solve the resultant optimization problem, a two-layer cutting plane algorithm was proposed. The convergence property of the proposed algorithm is studied. Moreover, connections to a variety of state-of-the-art feature selection methods are discussed in details. A variety of analyses by comparing with the various feature selection methods show that the proposed method is superior to others. Experimental results show that the proposed method is comparable with FGM and SVM-RFE and better than l1l_{1} models on feature selection task, and outperforms SVM for multivariate performance measures on full set of features.

Acknowledgements

This work was supported by Singapore A*star under Grant SERC 112 280 4005

Appendix A Proof of Proposition 1

Since the loss term Δ(y¯,y¯)=0\Delta(\overline{y}^{\prime},\overline{y}^{\prime})=0 for all y¯𝒴\overline{y}^{\prime}\in\mathcal{Y}, we can equivalently transform Problem

minw1,,wT,ξ0\displaystyle\min_{\textbf{w}_{1},\ldots,\textbf{w}_{T},\xi\geq 0}\!\!\!\!\! 12(t=1Twt2)2+Cξ\displaystyle\!\!\!\!\!\frac{1}{2}\left(\sum_{t=1}^{T}\|\textbf{w}_{t}\|_{2}\right)^{2}+C\xi
s.t. ξby¯t=1Twt,ay¯t,y¯𝒴¯\y¯,\displaystyle\!\!\!\!\!\xi\geq b_{\overline{y}^{\prime}}-\sum_{t=1}^{T}\langle\textbf{w}_{t},\textbf{a}_{\overline{y}^{\prime}}^{t}\rangle,\forall\overline{y}^{\prime}\in\overline{\mathcal{Y}}\backslash\overline{y},

into the following optimization problem

minw1,,wT,ξ0\displaystyle\min_{\textbf{w}_{1},\ldots,\textbf{w}_{T},\xi\geq 0}\!\!\!\!\! 12(t=1Twt2)2+Cξ\displaystyle\!\!\!\!\!\frac{1}{2}\left(\sum_{t=1}^{T}\|\textbf{w}_{t}\|_{2}\right)^{2}+C\xi
s.t. ξby¯t=1Twt,ay¯t,y¯𝒴¯.\displaystyle\!\!\!\!\!\xi\geq b_{\overline{y}^{\prime}}-\sum_{t=1}^{T}\langle\textbf{w}_{t},\textbf{a}_{\overline{y}^{\prime}}^{t}\rangle,\forall\overline{y}^{\prime}\in\overline{\mathcal{Y}}.

By introducing a new variable uu\in\mathbb{R} and moving out summation operator from objective to be a constraint, we can obtain the equivalent optimization problem as

minw,ξ0\displaystyle\min_{\textbf{w},\xi\geq 0} 12u2+Cξ\displaystyle\frac{1}{2}u^{2}+C\xi
s.t. ξby¯t=1Twt,ay¯t,y¯𝒴¯\displaystyle\xi\geq b_{\overline{y}^{\prime}}-\sum_{t=1}^{T}\langle\textbf{w}_{t},\textbf{a}_{\overline{y}^{\prime}}^{t}\rangle,\forall\overline{y}^{\prime}\in\overline{\mathcal{Y}}
t=1Twtu.\displaystyle\sum_{t=1}^{T}\|\textbf{w}_{t}\|\leq u.

We can further simplify above problem by introducing another variables ρm\rho\in\mathbb{R}^{m} such that wtρt\|\textbf{w}_{t}\|\leq\rho_{t}, t=1,,T\forall t=1,\ldots,T, to be

minw,u,ρ,ξ0\displaystyle\min_{\textbf{w},u,\rho,\xi\geq 0} 12u2+Cξ\displaystyle\frac{1}{2}u^{2}+C\xi
s.t. ξby¯t=1Twt,ay¯t,y¯𝒴¯\displaystyle\xi\geq b_{\overline{y}^{\prime}}-\sum_{t=1}^{T}\langle\textbf{w}_{t},\textbf{a}_{\overline{y}^{\prime}}^{t}\rangle,\forall\overline{y}^{\prime}\in\overline{\mathcal{Y}}
t=1Tρtu\displaystyle\sum_{t=1}^{T}\rho_{t}\leq u
wtρt,t=1,,T.\displaystyle||\textbf{w}_{t}||\leq\rho_{t},\forall t=1,\ldots,T.

We know that for each tt, wtρt\|\textbf{w}_{t}\|\leq\rho_{t} is a second-order cone constraint. Following the recipe of [5], the self-dual cone vt2ηt,t=1,,T\|\textbf{v}_{t}\|_{2}\leq\eta_{t},\forall t=1,\ldots,T can be introduced to form the Lagrangian function as follows

(w,ξ,u,ρ;α,τ,γ,v,η)\displaystyle\mathcal{L}(\textbf{w},\xi,u,\rho;\alpha,\tau,\gamma,\textbf{v},\eta)
=\displaystyle= 12u2+Cξy¯αy¯(ξby¯+t=1Twt,ay¯t)τξ\displaystyle\frac{1}{2}u^{2}+C\xi-\sum_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime}}\left(\xi-b_{\overline{y}^{\prime}}+\sum_{t=1}^{T}\langle\textbf{w}_{t},\textbf{a}_{\overline{y}^{\prime}}^{t}\rangle\right)-\tau\xi
+γ(t=1Tρtu)t=1T(vt,wt+ηtρt),\displaystyle+\gamma\Bigg{(}\sum_{t=1}^{T}\rho_{t}-u\Bigg{)}-\sum_{t=1}^{T}(\langle\textbf{v}_{t},\textbf{w}_{t}\rangle+\eta_{t}\rho_{t}),

with dual variables αt+\alpha_{t}\in\mathbb{R}_{+}, τ+\tau\in\mathbb{R}_{+}, γ+\gamma\in\mathbb{R}_{+}. The derivatives of the Lagrangian with respect to the primal variables have to vanish which leads to the following KKT conditions:

vt=y¯αy¯ay¯t,t=1,,T\displaystyle\textbf{v}_{t}=-\sum_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime}}\textbf{a}_{\overline{y}^{\prime}}^{t},\forall t=1,\ldots,T
Cy¯αy¯τ=0\displaystyle C-\sum_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime}}-\tau=0
u=γ\displaystyle u=\gamma
γ=ηt,t=1,,T\displaystyle\gamma=\eta_{t},\forall t=1,\ldots,T

By substituting all the primal variables with dual variables by above KKT conditions, we can obtain the following dual problem,

maxα,γ\displaystyle\max_{\alpha,\gamma} 12γ2+y¯αy¯by¯\displaystyle-\frac{1}{2}\gamma^{2}+\sum_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime}}b_{\overline{y}^{\prime}}
s.t. y¯αy¯ay¯tγ,t=1,,T\displaystyle\Big{\|}\sum_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime}}\textbf{a}_{\overline{y}^{\prime}}^{t}\Big{\|}\leq\gamma,\forall t=1,\ldots,T
y¯αy¯C,αy¯0,y¯𝒴¯\displaystyle\sum_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime}}\leq C,~\alpha_{\overline{y}^{\prime}}\geq 0,\forall\overline{y}^{\prime}\in\overline{\mathcal{Y}}

By setting θ=12γ2\theta=\frac{1}{2}\gamma^{2} and 𝒜={y¯αy¯C,αy¯0,y¯𝒴¯}\mathcal{A}=\{\sum_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime}}\leq C,\alpha_{\overline{y}^{\prime}}\geq 0,\forall\overline{y}^{\prime}\in\overline{\mathcal{Y}}\}, we can reformulate above problem as

maxθ,α𝒜\displaystyle\max_{\theta,\alpha\in\mathcal{A}} θ+y¯αy¯by¯\displaystyle-\theta+\sum_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime}}b_{\overline{y}^{\prime}}
s.t. 12αTQtαθ,t=1,,T\displaystyle\frac{1}{2}\alpha^{T}Q^{t}\alpha\leq\theta,\forall t=1,\ldots,T

where Qy¯,y¯′′t=ay¯t,ay¯′′tQ_{\overline{y}^{\prime},\overline{y}^{\prime\prime}}^{t}=\langle\textbf{a}_{\overline{y}^{\prime}}^{t},\textbf{a}_{\overline{y}^{\prime\prime}}^{t}\rangle. According to the property of self-dual cone [3], we can obtain the primal solution from its dual as wt=μtvt=μty¯αy¯ay¯t\textbf{w}_{t}=-\mu_{t}\textbf{v}_{t}=\mu_{t}\sum_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime}}\textbf{a}^{t}_{\overline{y}^{\prime}} where μj\mu_{j} is the dual variable of the jthj^{th} quadratic constraint such that j=1mμj=1,μj+,j=1,,m\sum_{j=1}^{m}\mu_{j}=1,\mu_{j}\in\mathbb{R}_{+},\forall j=1,\ldots,m. By constructing Lagrangian with dual variables μ\mu with respect to θ\theta, we can recover Problem

(29) maxα𝒜minμT12y¯y¯′′αy¯αy¯′′(t=1TμtQy¯,y¯′′dt)+y¯αy¯by¯,\max_{\alpha\in\mathcal{A}}\min_{\mu\in\mathcal{M}_{T}}-\frac{1}{2}\sum_{\overline{y}^{\prime}}\sum_{\overline{y}^{\prime\prime}}\alpha_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime\prime}}\left(\sum_{t=1}^{T}\mu_{t}Q_{\overline{y}^{\prime},\overline{y}^{\prime\prime}}^{\textbf{d}^{t}}\right)+\sum_{\overline{y}^{\prime}}\alpha_{\overline{y}^{\prime}}b_{\overline{y}^{\prime}},

where T={t=1Tμt=1,μt0,t=1,,T}\mathcal{M}_{T}=\{\sum_{t=1}^{T}\mu_{t}=1,\mu_{t}\geq 0,\forall t=1,\ldots,T\}. This completes the proof.

Appendix B Proof of Theorem 2

Given the Problem

(30) minα𝒜maxd𝒟d(α)\displaystyle\min_{\alpha\in\mathcal{A}}\max_{\textbf{d}\in\mathcal{D}}\mathcal{F}_{\textbf{d}}(\alpha)\! or minα𝒜,γγ:γd(α),d𝒟,\displaystyle\!\min_{\alpha\in\mathcal{A},\gamma}\gamma\;:\;\gamma\geq\mathcal{F}_{\textbf{d}}(\alpha),\;\forall\textbf{d}\in\mathcal{D},

we have the equivalent optimization problem as

maxα𝒜,γ\displaystyle\max_{\alpha\in\mathcal{A},\gamma} γ\displaystyle-\gamma
s.t. γ𝐝(α),𝐝𝒟.\displaystyle\gamma\geq\mathcal{F}_{\mathbf{d}}(\alpha),\forall\mathbf{d}\in\mathcal{D}.

The outer layer of Algorithm 2 can generate a sequence of configurations of 𝐝\mathbf{d} as {𝐝1,,𝐝k}\{\mathbf{d}^{1},\ldots,\mathbf{d}^{k}\} after kk iterations. In the kkth iteration, the most violated constraint dk+1d^{k+1} is found in terms of αk\alpha_{k}, so that 𝐝k+1(αk)=max𝐝𝒟𝐝(α)\mathcal{F}_{\mathbf{d}^{k+1}}(\alpha_{k})=\max_{\mathbf{d}\in\mathcal{D}}\mathcal{F}_{\mathbf{d}}(\alpha) according to Problem dt=argmaxd𝒟d(αt)\textbf{d}^{t}=\arg\max_{\textbf{d}\in\mathcal{D}}\mathcal{F}_{\textbf{d}}(\alpha^{t}). Hence, we can construct two sequences {γ¯k}\{\underline{\gamma}_{k}\} and {γ¯k}\{\overline{\gamma}_{k}\} such that

(31) γ¯k\displaystyle\underline{\gamma}_{k} =\displaystyle= max1tk𝐝t(αt)\displaystyle\max_{1\leq t\leq k}\mathcal{F}_{\mathbf{d}^{t}}(\alpha_{t})
(32) γ¯k\displaystyle\overline{\gamma}_{k} =\displaystyle= min1tk𝐝t+1(αt)=min1tkmax𝐝𝒟𝐝(αt)\displaystyle\min_{1\leq t\leq k}\mathcal{F}_{\mathbf{d}^{t+1}}(\alpha_{t})=\min_{1\leq t\leq k}\max_{\mathbf{d}\in\mathcal{D}}\mathcal{F}_{\mathbf{d}}(\alpha_{t})

Suppose that we can solve minα𝒜max1tk𝐅𝐝t(α)\min_{\alpha\in\mathcal{A}}\max_{1\leq t\leq k}\mathbf{F}_{\mathbf{d}^{t}}(\alpha) exactly. Due to the equivalence to Problem (29), it means that we can obtain the exact solution of the problem (29). Based on this assumption, equation (31) can be further reformed as

(33) γ¯k=max1tk𝐝t(αt)=minα𝒜max1tk𝐝t(αt).\displaystyle\underline{\gamma}_{k}=\max_{1\leq t\leq k}\mathcal{F}_{\mathbf{d}^{t}}(\alpha_{t})=\min_{\alpha\in\mathcal{A}}\max_{1\leq t\leq k}\mathcal{F}_{\mathbf{d}^{t}}(\alpha_{t}).

This turns out to be the same problem of FGM [26]. For self-completeness, we give the theorem as follows,

Theorem 3 ([26]).

Let (α,γ)(\alpha^{*},\gamma^{*}) be the globally optimal solution pair of Problem (30), sequences {γ¯k}\{\underline{\gamma}_{k}\} and {γ¯k}\{\overline{\gamma}_{k}\} have the following property

(34) γ¯kγkγ¯k.\underline{\gamma}_{k}\leq\gamma_{k}\leq\overline{\gamma}_{k}.

As kk increases, {γ¯k}\{\underline{\gamma}_{k}\} is monotonically increasing and {γ¯k}\{\overline{\gamma}_{k}\} is monotonically decreasing.

Based on above theorem, global optimal solution can be obtained after a finite number of iterations. However, the assumption of the accurate solution for (29) usually has no formal guarantee. We have already proven in Theorem 1 that the inner problem of Algorithm 2 can reach the desired precision ϵ\epsilon after a finite number of iterations by Algorithm 1. Therefore, according to Algorithm 2, we can construct the following sequence

(35) γ¯k=max1tk𝐝t(αt)minα𝒜max1tk𝐝t(αt)+ϵ.\displaystyle\underline{\gamma}_{k}^{\prime}=\max_{1\leq t\leq k}\mathcal{F}_{\mathbf{d}^{t}}(\alpha_{t})\leq\min_{\alpha\in\mathcal{A}}\max_{1\leq t\leq k}\mathcal{F}_{\mathbf{d}^{t}}(\alpha_{t})+\epsilon.

By combining inequalities (34) and (35), we obtain the following inequalities

(36) γ¯kϵγ¯kγkγ¯k.\underline{\gamma}_{k}^{\prime}-\epsilon\leq\underline{\gamma}_{k}\leq\gamma_{k}\leq\overline{\gamma}_{k}.

After a finite number of iterations, the global optimal solution is γ=γ¯k=γk=γ¯k\gamma^{*}=\underline{\gamma}_{k}=\gamma_{k}=\overline{\gamma}_{k}. Hence, the solution of the Algorithm 2 may be not less than the lower bound γ¯k\underline{\gamma}_{k}^{\prime} by ϵ\epsilon. It is complete for Theorem 2.

References

  • [1] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. In NIPS, 2003.
  • [2] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4:1–106, 2012.
  • [3] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In ICML, 2004.
  • [4] J. M. Borwein and A. S. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2000.
  • [5] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK., 2004.
  • [6] Y. Chen, J. Bi, and J. Z. Wang. MILES: Multiple-instance learning via embedded instance selection. TPAMI, 28:1931–1947, 2006.
  • [7] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Perez. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89:31–71, 1997.
  • [8] G. M. Fung and O. L. Mangasarian. A feature selection newton method for support vector machine classification. Computational Optimization and Applications, 28:185–202, 2004.
  • [9] I. Guyou, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, 46:389–422, 2002.
  • [10] J. B. Hiriart-Urruty and C. Lemarechal. Convex Analysis and Minimization Algorithms. Springer-Verlag, 1993.
  • [11] T. Joachims. A support vector method for multivariate performance measures. In ICML, 2005.
  • [12] T. Joachims. Training linear SVMs in linear time. In SIGKDD, 2006.
  • [13] T. Joachims, T. Finley, and C. J. Yu. Cutting-plane training of structural SVMs. Machine Learning, 77:27–59, 2009.
  • [14] J. E. Kelley. The cutting plane algorithm for solving convex programs. Journal of the Society for Industrial and Applied Mathematics, 8(4):703–712, 1960.
  • [15] T. N. Lal, O. Chapelle, J. Weston, and A. Elisseeff. Embedded methods. In I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh, editors, Feature Extraction: Foundations and Applications, Studies in Fuzziness and Soft Computing, number 207, pages 137 –165. Springer, 2006.
  • [16] Q. V. Le and A. Smola. Direct optimization of ranking measures. JMLR, 1:1–48, 2007.
  • [17] D. Lin, D. P. Foster, and L. H. Ungar. A risk ratio comparison of l0l_{0} and l1l_{1} penalized regressions. Technical report, University of Pennsylvania, 2010.
  • [18] Z. Liu, F. Jiang, G. Tian, S. Wang, F. Sato, S. J. Meltzer, and M. Tan. Sparse logistic regression with lpl_{p} penalty for biomarker identification. Statistical Applications in Genetics and Molecular Biology, 6(1), 2007.
  • [19] Q. Mao and I. W. Tsang. Optimizing performance measures for feature selection. In ICDM, 2011.
  • [20] O. Maron and A. L. Ratan. Multiple-instance learning for natural scene classification. In ICML, 1998.
  • [21] D. R. Musicant, V. Kumar, and A. Ozgur. Optimizing f-measure with support vector machines. In Proceedings of the 16th International Florida Artificial Intelligence Research Society Conference, 2003.
  • [22] A. Mutapcic and S. Boyd. Cutting-set methods for robust convex optimization with pessimizing oracles. Optimization Methods & Software, 24(3):381 406, 2009.
  • [23] A. Y. Ng. Feature selection, 1\ell_{1} vs. 2\ell_{2} regularization, and rotational invariance. In ICML, 2004.
  • [24] A. Rakotomamonjy, F. R. Bach, Y. Grandvalet, and S. Canu. SimpleMKL. JMLR, 3:1439–1461, 2008.
  • [25] S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Scholköpf. Large scale multiple kernel learning. JMLR, 7, 2006.
  • [26] M. Tan, L. Wang, and I. W. Tsang. Learning sparse SVM for feature selection on very high dimensional datasets. In ICML, 2010.
  • [27] C. H. Teo, S.V.N. Vishwanathan, A. Smola, and Quoc V. Le. Bundle methods for regularized risk minimization. JMLR, pages 311–365, 2010.
  • [28] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altum. Large margin methods for structured and interdependent output variables. JMLR, 6:1453–1484, 2005.
  • [29] H. Valizadengan, R. Jin, R. Zhang, and J. Mao. Learning to rank by optimizing ndcg measure. In NIPS, 2009.
  • [30] J. Weston, A. Elisseeff, and B. Scholköpf. Use of the zero-norm with linear models and kernel methods. JMLR, 3:1439–1461, 2003.
  • [31] Z. Xu, R. Jin, I. King, and M. R. Lyu. An extended level method for efficient multiple kernel learning. In NIPS, 2008.
  • [32] Z. Xu, R. Jin, J. Ye, Michael R. Lyu, and I. King. Non-monotonic feature selection. In ICML, 2009.
  • [33] G.-X. Yuan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. A comparison of optimization methods and software for large-scale l1l_{1}-regularized linear classification. JMLR, 11:3183–3234, 2010.
  • [34] Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing average precision. In SIGIR, 2007.
  • [35] Q. Zhang, S.A. Goldman, W. Yu, and J. Fritts. Content-based image retrieval using multiple-instance learning. In ICML, 2002.
  • [36] T. Zhang. Analysis of multi-stage convex relaxation for sparse regularization. JMLR, 11:1081–1107, Mar 2010.
  • [37] X. Zhang, A. Saha, and S.V.N. Vishwanathan. Smoothing multivariate performance measures. In UAI, 2011.
  • [38] Z.-H. Zhou and M.-L. Zhang. Multi-instance multi-label learning with application to scene classification. In NIPS, 2007.
  • [39] J. Zhu, S. Rossett, T. Hastie, and R. Tibshirani. 1-norm support vector machine. In NIPS, 2003.
  • [40] A. Zien and C. S. Ong. Multiclass multiple kernel learning. In ICML, 2007.