This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimizing Evaluation Metrics for Multi-Task Learning via the Alternating Direction Method of Multipliers

Ge-Yang Ke, Yan Pan, Jian Yin, Chang-Qin Huang Ge-Yang Ke, Yan Pan, and Jian Yin are with the School of Data Science and Computer Science, Sun Yat-sen University, Guangzhou 510006, China. Corresponding author: Yan Pan (panyan5@mail.sysu.edu.cn)Chang-Qin Huang is with the School of Information Technology in Education, South China Normal University, Guangzhou, 510631, China
Abstract

Multi-task learning (MTL) aims to improve the generalization performance of multiple tasks by exploiting the shared factors among them. Various metrics (e.g., F-score, Area Under the ROC Curve) are used to evaluate the performances of MTL methods. Most existing MTL methods try to minimize either the misclassified errors for classification or the mean squared errors for regression. In this paper, we propose a method to directly optimize the evaluation metrics for a large family of MTL problems. The formulation of MTL that directly optimizes evaluation metrics is the combination of two parts: (1) a regularizer defined on the weight matrix over all tasks, in order to capture the relatedness of these tasks; (2) a sum of multiple structured hinge losses, each corresponding to a surrogate of some evaluation metric on one task. This formulation is challenging in optimization because both of its parts are non-smooth. To tackle this issue, we propose a novel optimization procedure based on the alternating direction scheme of multipliers, where we decompose the whole optimization problem into a sub-problem corresponding to the regularizer and another sub-problem corresponding to the structured hinge losses. For a large family of MTL problems, the first sub-problem has closed-form solutions. To solve the second sub-problem, we propose an efficient primal-dual algorithm via coordinate ascent. Extensive evaluation results demonstrate that, in a large family of MTL problems, the proposed MTL method of directly optimization evaluation metrics has superior performance gains against the corresponding baseline methods.

Index Terms:
Multi-Task Learning, Evaluation Metrics, Structured Outputs, Coordinate Ascent, Alternating Direction Method of Multipliers.

1 Introduction

Recently, considerable research has been devoted to Multi-Task Learning (MTL), a problem of improving the generalization performance of multiple tasks by utilizing the shared information among them. MTL has been widely-used in various applications, such as natural language processing [1], handwritten character recognition [30, 34], scene recognition [29] and medical diagnosis [3]. Many MTL methods have been proposed in the literature [8, 11, 49, 51, 13, 21, 28, 30, 53, 1, 9, 10, 33, 29, 15, 46, 18, 52, 26].

In this paper, we consider MTL for classification or regression problems. Note that either a multi-class classification problem or a multi-label learning problem can be regarded as an MTL problem111As an illustrative example, we consider a multi-label classification problem with instances {x1,x2,x3,x4,x5}\{x_{1},x_{2},x_{3},x_{4},x_{5}\} that x1x_{1} belongs to classes aa and bb, x2x_{2} belongs to classes bb and cc, x3x_{3} belongs to class cc, x4x_{4} belongs to class aa, x5x_{5} belongs to classes aa, bb and cc. This problem can be regarded as an MTL problem with three tasks, where the training sets for each of these tasks are:
(x1,1),(x2,0),(x3,0),(x4,1),(x5,1)(x1,1),(x2,1),(x3,0),(x4,0),(x5,1)(x1,0),(x2,1),(x3,1),(x4,0),(x5,1)\begin{split}(x_{1},{\color[rgb]{1,0,0}1}),(x_{2},{\color[rgb]{1,0,0}0}),(x_{3},{\color[rgb]{1,0,0}0}),(x_{4},{\color[rgb]{1,0,0}1}),(x_{5},{\color[rgb]{1,0,0}1})\\ (x_{1},{\color[rgb]{0,1,0}1}),(x_{2},{\color[rgb]{0,1,0}1}),(x_{3},{\color[rgb]{0,1,0}0}),(x_{4},{\color[rgb]{0,1,0}0}),(x_{5},{\color[rgb]{0,1,0}1})\\ (x_{1},{\color[rgb]{0,0,1}0}),(x_{2},{\color[rgb]{0,0,1}1}),(x_{3},{\color[rgb]{0,0,1}1}),(x_{4},{\color[rgb]{0,0,1}0}),(x_{5},{\color[rgb]{0,0,1}1})\\ \end{split} The first/second/third task is a binary classification problem of an instance belonging to class aa/ bb/cc or not. Hence, a multi-label learning problem is a special case of an MTL problem. Similarly, we can verify that a multi-class classification problem can also be regarded as an MTL problem.
. Most of the existing MTL methods focus on minimizing either a convex surrogate (e.g. the hinge loss or the logistic loss) of the 0-11 errors for multi-task classification, or the mean squared errors for multi-task regression. On the other hand, in practice, several evaluation metrics other than misclassified errors or mean squared errors are used the evaluation of MTL methods, e.g., F-score, Precision, Recall, Area Under the ROC Curve (AUC), Mean Average Precision. For example, in the cases of MTL on imbalanced data (e.g., in a task, the number of negative samples is much larger than that of the positive samples), cost-sensitive MTL or MTL for ranking, these metrics are more effective in performance evaluation than the standard misclassified errors or the mean squared errors. However, due to the computational difficulties, few learning techniques have been developed to directly optimize these evaluation metrics in MTL.

In this paper, we propose an approach to directly optimizing the evaluation metrics in MTL, which can be applied to a large family of MTL problems. Specifically, for an MTL problem with mm tasks (the iith task is associated with a training set {(𝕩j(i),𝕪j(i))}j=1ni\{(\mathbb{x}_{j}^{(i)},\mathbb{y}_{j}^{(i)})\}_{j=1}^{n_{i}}, i=1,2,,mi=1,2,...,m, nin_{i} represents the number of training samples for the iith task), we consider a generic formulation in the following:

min𝕎Ω(𝕎)+λi=1m(𝕎.i;{(𝕩j(i),𝕪j(i))}j=1ni),\min_{\mathbb{W}}{\Omega}(\mathbb{W})+\lambda\sum_{i=1}^{m}\mathcal{L}(\mathbb{W}_{.i};\{(\mathbb{x}_{j}^{(i)},\mathbb{y}_{j}^{(i)})\}_{j=1}^{n_{i}}), (1)

where 𝕎\mathbb{W} is the weight matrix with mm columns 𝕎.1\mathbb{W}_{.1}, 𝕎.2\mathbb{W}_{.2}, …, 𝕎.m\mathbb{W}_{.m}, λ>0\lambda>0 is a trade-off parameter. This formulation is the linear combination of two parts. The first part is a regularizer Ω(𝕎){\Omega}(\mathbb{W}) defined on the weight matrix 𝕎\mathbb{W} over all tasks, in order to leverage the relatedness of these tasks. Examples of this kind of regularizer include the trace-norm, the 1,1\ell_{1,1}-norm or the 2,1\ell_{2,1}-norm on 𝕎\mathbb{W}. The second part in the formulation is a sum of multiple loss functions, each corresponds to one task. In order to directly optimize a specific evaluation metric, we consider the hinge loss functions for structured outputs [39, 20, 50, 48, 47], which are surrogates of a specific evaluation metric.

Such a formulation in (1) includes a large family of MTL problems. Since the two parts in (1) are usually non-smooth, the optimization problem (1) is difficult to solve. To tackle this issue, we propose a novel optimization procedure based on the alternating direction scheme of multipliers (ADMM [6, 25]), which is widely used in various machine learning problems (e.g., [31, 32, 33, 44]). We decompose the whole optimization problem in (1) into two simpler sub-problems. The first sub-problem corresponds to the regularizer. For commonly-used regularizers (e.g., the trace-norm, the 2,1\ell_{2,1}-norm) in MTL, this sub-problem can be solved by close-form solutions. The second sub-problem corresponds to the structured hinge losses. To solve the second sub-problem, we propose an efficient primal-dual algorithm via coordinate ascent.

We conduct extensive experiments to evaluate the performances of the proposed MTL method. Experimental results show that the proposed method that optimizes a specific evaluation metric outperforms the corresponding MTL classification or MTL regression baseline methods by a clear margin.

2 Related Work

MTL is a wide class of learning problems. Roughly speaking, the existing MTL methods can be divided into three main categories: parameters sharing, common features sharing, and low-rank subspace sharing.

In the methods with parameter sharing, all tasks are assumed to explicitly share some common parameters. Representative methods in this category include shared weight vectors [11], hidden units in neural network [8], and common prior in Bayessian models [49, 51].

In the methods with common features sharing, task relatedness is modeled by enforcing all tasks to share a common set of features [2, 28, 22, 30, 13, 21, 53]. Representative examples are the methods which constrain the model parameters (i.e., a weight matrix) of all tasks to have certain sparsity patterns, for example, cardinality sparsity [30], group sparsity [28, 13], or clustered structure [21, 53].

The methods in the third category assume that all tasks lie in a shared low-rank subspace [1, 9, 10]. A common assumption in these category of methods is that most of the tasks are relevant while (optionally) there may exist a small number of irrelevant (outlier) tasks. These methods pursue a low-rank weight matrix that captures the underlying shared factors among tasks. Trace-norm regularization is commonly-used in these methods to encourage the low-rank structure on the model parameters.

Most of the existing MTL methods are focused on designing regularizers or parameter sharing patterns to utilize the intrinsic relationships among multiple related tasks. These MTL methods usually try to optimize the classification errors or the mean squared errors for regression. In practice, various other metrics (such as F-score and AUC) are used in the evaluation of MTL methods. However, little effort has been devoted to optimize these evaluation metrics in the context of MTL except for the work [14], in which the author proposed a hierarchical MTL formulation for structured output prediction in sequence segmentation. Since the regularizer used in [14] is decomposable, the hierarchical MTL problem can be decomposed into multiple independent tasks, each is a structure output learning problem with a simple regularizer. In this paper, we seek to directly optimize commonly-used evaluation metrics for MTL with possibly indecomposable regularizer, resulting in a generic approach that can be applied to a large family of MTL problems. Our formulation can be regarded as MTL for structure output prediction with an indecomposable regularizer.

The proposed methods in this paper are also related to the multi-label algorithms. There are various multi-label algorithm proposed in the literature, e.g., the RAkEL method that uses random kk-label sets [41], the MLCSSP method that spans the original label space by subset of labels [4], the AdaBoostMH method based on AdaBoost [37], the HOMER method based on the hierarchy of multi-label learners [40], the binary relevance (BR) [42] method, the label power-set (LP) [42] method, and the ensembles of classifier chains (ECC) [35] method.

The proposed approach in this paper is to optimize the evaluation metrics in MTL. We refer the readers to Section 4 for the detailed introduction to the evaluation metrics related to the proposed approach.

3 Notations

We first introduce the notations to be used throughout this paper. We use bold upper-case characters (e.g., 𝕄\mathbb{M}, 𝕏\mathbb{X}, 𝕎\mathbb{W}) to represent matrices, and bold lower-case characters (e.g., 𝕩\mathbb{x}, 𝕪\mathbb{y}) to represent vectors, respectively. For a matrix 𝐌d×m\mathbf{M}\in{\mathbb{R}^{d\times m}}, we denote 𝐌ij\mathbf{M}_{ij} as the the element at the cross of the iith row and jjth column in 𝐌\mathbf{M}. We denote 𝐌i1×m\mathbf{M}_{i\cdot}\in\mathbb{R}^{1\times m} as the iith row of 𝐌\mathbf{M}, and 𝐌jd×1\mathbf{M}_{\cdot j}\in\mathbb{R}^{d\times 1} as the jj-th column of 𝐌\mathbf{M}, respectively.

We denote 𝐌F||\mathbf{M}||_{F} as the Frobenius norm of 𝐌\mathbf{M} that 𝐌F=i=1dj=1m(𝐌ij)2\|\mathbf{M}\|_{F}=\sqrt{\sum_{i=1}^{d}\sum_{j=1}^{m}(\mathbf{M}_{ij})^{2}}. Let 𝕄1,1=i=1dj=1m|𝕄ij|\|\mathbb{M}\|_{1,1}=\sum_{i=1}^{d}\sum_{j=1}^{m}|\mathbb{M}_{ij}| be the 1,1\ell_{1,1}-norm of 𝕄\mathbb{M}, where |𝕄ij||\mathbb{M}_{ij}| is the absolute value of 𝕄ij\mathbb{M}_{ij}. Let 𝕄2,1=i=1d𝕄i.2\|\mathbb{M}\|_{2,1}=\sum_{i=1}^{d}||\mathbb{M}_{{i.}}||_{2} be the 2,1\ell_{2,1}-norm of 𝕄\mathbb{M}, where 𝕄i.2=j=1m𝕄ij2||\mathbb{M}_{{}_{i.}}||_{2}=\sqrt{\sum_{j=1}^{m}\mathbb{M}_{ij}^{2}} is the 2\ell_{2}-norm of 𝕄i.\mathbb{M}_{{}_{i.}}. Let 𝕄=maxi,j|𝕄ij|||\mathbb{M}||_{\infty}=\mathop{\max}\limits_{i,j}|\mathbb{M}_{ij}| be the infinity norm of 𝕄\mathbb{M}. The trace-norm of 𝕄\mathbb{M} is defined by 𝕄=k=1rank(𝕄)σk(𝕄)\|\mathbb{M}\|_{*}=\sum_{k=1}^{rank(\mathbb{M})}\sigma_{k}(\mathbb{M}), where {σk(𝕄)}k=1rank(𝕄)\{\sigma_{k}(\mathbb{M})\}_{k=1}^{rank(\mathbb{M})} are the non-zero singular values of 𝕄\mathbb{M} and rank(𝕄)rank(\mathbb{M}) is the rank of 𝕄\mathbb{M}. We denote 𝕄T\mathbb{M}^{T} as the transpose of 𝕄\mathbb{M}. For a vector 𝕩\mathbb{x}, 𝕩2||\mathbb{x}||_{2} represent the 2\ell_{2}-norm.

In the context of MTL, we assume we are given mm learning tasks. The iith (i=1,,mi=1,\dots,m) task is associated with a training set (𝐗(i),𝐲(i))({\mathbf{X}}^{(i)},{\mathbf{y}}^{(i)}), where 𝐗(i)ni×d{\mathbf{X}}^{(i)}\in\mathbb{R}^{n_{i}\times d} denotes the data matrix with each row being a sample, 𝐲(i){1,+1}ni{\mathbf{y}}^{(i)}\in\{-1,+1\}^{n_{i}} denotes the target labels on 𝕏(i)\mathbb{X}^{(i)}, dd is the feature dimensionality, and nin_{i} is the number of samples for the iith task. For i=1,2,,mi=1,2,...,m, we define 𝔼i={1,+1}ni\mathbb{E}_{i}=\{-1,+1\}^{n_{i}} as the set of all possible nin_{i}-dimension vector, each of whose elements is either 1-1 or 11. To simplify presentation, we assume 𝔼i={𝕪1,𝕪2,,𝕪p}\mathbb{E}_{i}=\{\mathbb{y}_{1},\mathbb{y}_{2},...,\mathbb{y}_{p}\} where p=2nip=2^{n_{i}} and 𝕪j\mathbb{y}_{j} is one of the possible vectors that belong to {1,1}ni\{-1,1\}^{n_{i}}.

We define a weight matrix 𝐖=[𝕎1,,𝕎m]d×m\mathbf{W}=[\mathbb{W}_{\cdot 1},\dots,\mathbb{W}_{\cdot m}]\in\mathbb{R}^{d\times m} on all of the mm tasks. The goal of (linear) MTL is to simultaneously learn mm (linear) predictors 𝕎i(i=1,,m)\mathbb{W}_{\cdot i}\ (i=1,\dots,m) to minimize some loss function (𝕎i;𝐗(i),𝐲(i))\mathcal{L}(\mathbb{W}_{\cdot i};{\mathbf{X}}^{(i)},{\mathbf{y}}^{(i)}) (e.g. the least square loss 𝐲(i)𝐗(i)𝕎i22||{\mathbf{y}}^{(i)}-{\mathbf{X}}^{(i)}\mathbb{W}_{\cdot i}||_{2}^{2}), where 𝕎id\mathbb{W}_{\cdot i}\in\mathbb{R}^{d} is in the form of a column vector. Note that for each task, we have 𝕏(i)=[𝕩1(i),𝕩2(i),,𝕩ni(i)]T\mathbb{X}^{(i)}=[\mathbb{x}_{1}^{(i)},\mathbb{x}_{2}^{(i)},\cdots,\mathbb{x}_{n_{i}}^{(i)}]^{T} and 𝕪(i)=[𝕪1(i),𝕪2(i),,𝕪ni(i)]T\mathbb{y}^{(i)}=[\mathbb{y}_{1}^{(i)},\mathbb{y}_{2}^{(i)},\cdots,\mathbb{y}_{n_{i}}^{(i)}]^{T}.

4 Problem Formulations

The linear MTL problem can be formulated as the generic form in (1). The objective functions in many existing MTL methods are special cases of such a formulation. The following are two examples:

  • With the regularizer Ω(𝕎){\Omega}(\mathbb{W}) being the 2,1\ell_{2,1}-norm 𝕎2,1||\mathbb{W}||_{2,1} and each loss function (𝕎i;𝐗(i),𝐲(i))\mathcal{L}(\mathbb{W}_{\cdot i};{\mathbf{X}}^{(i)},{\mathbf{y}}^{(i)}) being the mean squared loss 12𝕪(i)𝕏(i)𝕎.i22\frac{1}{2}||\mathbb{y}^{(i)}-\mathbb{X}^{(i)}\mathbb{W}_{.i}||_{2}^{2}, the problem in (1) is the same as the objective used in [28].

  • If the regularizer Ω(𝕎){\Omega}(\mathbb{W}) is set to be the trace-norm 𝕎||\mathbb{W}||_{*} and each loss function (𝕎i;𝐗(i),𝐲(i))\mathcal{L}(\mathbb{W}_{\cdot i};{\mathbf{X}}^{(i)},{\mathbf{y}}^{(i)}) is smooth (e.g., the mean squared loss 12𝕪(i)𝕏(i)𝕎.i22\frac{1}{2}||\mathbb{y}^{(i)}-\mathbb{X}^{(i)}\mathbb{W}_{.i}||_{2}^{2}), the problem in (1) becomes the objective used in [17].

The existing MTL methods mainly focus on the design of good regularizers (i.e., Ω(𝕎){\Omega}(\mathbb{W})) to catch the shared factors among multiple related tasks. The loss functions used in these methods are either to minimize the misclassified errors (for classification) or the mean squared errors (for regression). On the other hand, in practice, several evaluation metrics other than misclassified errors or mean squared errors are used the evaluation of MTL methods, such as F-score and AUC. Particularly, in the cases of MTL on imbalanced data (e.g., in a task, the number of negative samples is much larger than that of the positive samples), these metrics are more effective in performance evaluation than the standard misclassified errors or the mean squared errors.

Learning techniques of directly optimizing evaluation metrics, as known as learning with structured outputs, have been developed for many (single-task) problems, e.g., classification [39, 20], ranking [50]. However, despite the acknowledged importance of the metrics like F-score or AUC, little effort has been made to design MTL methods that directly optimize these evaluation metrics. The main reason is that MTL of optimizing the evaluation metrics usually results in a non-smooth objective function which is difficult to solve.

In this paper, we focus on MTL with structured outputs and propose a generic optimization procedure based on ADMM. This optimization procedure can be applied to solving a large family of MTL problems that directly optimize some evaluation metric (e.g., F-score, AUC). We call the proposed method Structured MTL (SMTL for short).

The formulation of SMTL is also a special case of (1). In order to optimize some evaluation metric, we define the loss function for each task as the structured hinge loss:

(𝕎.i;𝕏(i),𝕪(i))=max𝐲j𝔼i[Δ(𝐲(i),𝐲j)𝕎.iT𝐗(i)T(𝐲(i)𝐲j)],\begin{split}&\mathcal{L}(\mathbb{W}_{.i};\mathbb{X}^{(i)},\mathbb{y}^{(i)})\\ =&\mathop{\max}\limits_{{\mathbf{y}_{j}}\in\mathbb{E}_{i}}[\Delta({{\bf{y}}^{(i)}},{\mathbf{y}_{j}})-\mathbb{W}_{.i}^{T}{{\bf{X}}^{(i)}}^{T}({{\bf{y}}^{(i)}}-{\mathbf{y}_{j}})],\\ \end{split}

where 𝐲j\mathbf{y}_{j} represents any possible label assignment on 𝐗(i)\mathbf{\mathbf{X}}^{(i)}. Δ(𝐲(i),𝐲j)\Delta({\mathbf{y}^{(i)}},\mathbf{y}_{j}) represents an evaluation metric to measure the distance between the true labels 𝐲(i){\mathbf{y}^{(i)}} and the other labels 𝐲j\mathbf{y}_{j}. For example, Δ(.,.)\Delta(.,.) can be 1-F-score or 1-AUC.

The formulation of SMTL is defined as:

min𝕎Ω(𝕎)+λi=1mmax𝐲j𝔼i[Δ(𝐲(i),𝐲j)𝕎.iT𝐗(i)T(𝐲(i)𝐲j)].\begin{split}&\min_{\mathbb{W}}{\Omega}(\mathbb{W})\\ &+\lambda\sum_{i=1}^{m}\mathop{\max}\limits_{{\mathbf{y}_{j}}\in\mathbb{E}_{i}}[\Delta({{\bf{y}}^{(i)}},{\mathbf{y}_{j}})-\mathbb{W}_{.i}^{T}{{\bf{X}}^{(i)}}^{T}({{\bf{y}}^{(i)}}-{\mathbf{y}_{j}})].\end{split} (2)

In this paper, we only focus on the MTL problems in the form of (2) that satisfy the following conditions:

  • Condition 1: With respect to Ω(𝐖)\Omega\left(\mathbf{W}\right), there is a close-form solution for the following sub-problem

    min𝐖Ω(𝐖)+μ2𝐖𝐌F2\displaystyle\min_{\mathbf{W}}\Omega(\mathbf{W})+\frac{\mu}{2}\left\|{\mathbf{W}-\mathbf{M}}\right\|_{F}^{2} (3)

    where 𝐌d×m\mathbf{M}\in{\mathbb{R}^{d\times m}} and μ\mu is a positive constant.

  • Condition 2: For the evaluation metric Δ(𝐲(i),𝐲j)\Delta({{\bf{y}}^{(i)}},{\mathbf{y}_{j}}), the following sub-problem can be solve in polynomial time.

    argmax𝐲j𝔼i[Δ(𝐲(i),𝐲j)𝕎.iT𝐗(i)T(𝐲(i)𝐲j)]\mathop{\operatorname*{argmax}}\limits_{{\mathbf{y}_{j}}\in\mathbb{E}_{i}}[\Delta({{\bf{y}}^{(i)}},{\mathbf{y}_{j}})-\mathbb{W}_{.i}^{T}{{\bf{X}}^{(i)}}^{T}({{\bf{y}}^{(i)}}-{\mathbf{y}_{j}})] (4)

The first condition is to restrict the regularizer Ω(𝕎)\Omega(\mathbb{W}) and the second one is to restrict the evaluation metric function Δ(𝕪(𝕚),𝐲j)\Delta(\mathbb{y^{(i)}},\mathbf{y}_{j}). Even under these conditions, the formulation in (2) includes a large family of MTL problems. On the one hand, for the regularizer Ω(𝕎)\Omega(\mathbb{W}), the following norms that are commonly-used in MTL satisfy Condition 1:

  • 1,1\ell_{1,1}-norm For the MTL problems with Ω(𝕎)=𝕎1,1\Omega(\mathbb{W})=||\mathbb{W}||_{1,1}, the sub-problem in (3) is known to have the close-form solution

    𝕎=𝒮1μ(𝕄),\mathbb{W}=\mathcal{S}_{\frac{1}{\mu}}(\mathbb{M}), (5)

    where 𝒮δ(𝕄)=max(𝕄δ,0)+min(𝕄+δ,0)\mathcal{S}_{\delta}(\mathbb{M})=\max(\mathbb{M}-\delta,0)+\min(\mathbb{M}+\delta,0) is the shrinkage operator [25].

  • 2,1\ell_{2,1}-norm For the MTL problems with Ω(𝕎)=𝕎2,1\Omega(\mathbb{W})=||\mathbb{W}||_{2,1}, the sub-problem in (3) is also known to have close-form solutions:

    𝕎j.={𝕄j.21μ𝕄j.2𝕄j.if 1μ<𝕄j.2,0otherwise,\mathbb{W}_{j.}=\left\{\begin{array}[]{ll}\frac{||\mathbb{M}_{j.}||_{2}-\frac{1}{\mu}}{||\mathbb{M}_{j.}||_{2}}\mathbb{M}_{j.}&\textrm{if $\frac{1}{\mu}<||\mathbb{M}_{j.}||_{2}$},\\ 0&\textrm{otherwise},\end{array}\right. (6)
  • Trace-norm For the MTL problems with Ω(𝕎)=𝕎\Omega(\mathbb{W})=||\mathbb{W}||_{*}, the sub-problem in (3) is also have the close-form solution by the Singular Value Threshold method [7]. Specifically, by assuming 𝕌Σ𝕍\mathbb{U}\mathbb{\Sigma}\mathbb{V} be the SVD form of 𝕄\mathbb{M}, the close-form solution is given by:

    𝕎=𝕌𝒮1/μ(Σ)𝕍T.\mathbb{W}=\mathbb{U}\mathcal{S}_{{1}/{\mu}}(\mathbb{\Sigma})\mathbb{V}^{T}. (7)

On the other hand, many commonly-used metric functions satisfy the second condition. The following are two examples which we will consider in this paper:

  • MTL by directly optimizing F-Score F-Score is a typical performance metric for binary classification, particularly in learning tasks on imbalanced data. F-Score is a trade-off between Precision and Recall. Specifically, given 𝐲(i){\mathbf{y}^{(i)}} and 𝐲j{\mathbf{y}_{j}}, we define the precision as:

    Precision=k=1niI(𝐲k(i)=1and(𝐲j)k=1)k=1niI(𝐲k(i)=1),Precision=\frac{\sum_{k=1}^{n_{i}}I({\mathbf{y}_{k}^{(i)}}=1\ \textit{and}\ (\mathbf{y}_{j})_{k}=1)}{\sum_{k=1}^{n_{i}}I({\mathbf{y}_{k}^{(i)}}=1)},

    and the recall as:

    Recall=k=1niI(𝐲k(i)=1and(𝐲j)k=1)k=1niI((𝐲j)k=1),Recall=\frac{\sum_{k=1}^{n_{i}}I({\mathbf{y}_{k}^{(i)}}=1\ \textit{and}\ (\mathbf{y}_{j})_{k}=1)}{\sum_{k=1}^{n_{i}}I({(\mathbf{y}_{j})_{k}}=1)},

    where I(condition)I(\textit{condition}) represents the indicator function that I(condition)=1I(\textit{condition})=1 if condition is true, otherwise I(condition)=0I(\textit{condition})=0. Then the F-score on 𝐲(i){\mathbf{y}^{(i)}} and 𝐲j{\mathbf{y}_{j}} is defined as:

    Fβ=(1+β)×Precision×RecallPrecision+βRecall,\displaystyle F_{\beta}=\frac{(1+\beta)\times Precision\times Recall}{Precision+\beta Recall}, (8)

    where β\beta is a trade-off parameter. Hereafter, we simply set β=1\beta=1. Finally, the metric function Δ(.,.)\Delta(.,.) with respect to the F-score is defined by:

    Δ(𝐲(i),𝐲j)=1Fβ.\displaystyle\Delta({\mathbf{y}^{(i)}},{\mathbf{y}_{j}})=1-{F_{\beta}}. (9)

    With such a form of Δ(𝐲(i),𝐲j)\Delta({\mathbf{y}^{(i)}},{\mathbf{y}_{j}}), the sub-problem in (4) can be solved in polynomial time [20].

  • MTL by directly optimizing AUC AUC is also a popular performance metric for binary classification, particularly in imbalanced learning. Given 𝐲(i){\mathbf{y}^{(i)}} and 𝐲j{\mathbf{y}_{j}}, the AUC metric can be calculated by :

    AUC=1SwappedPos×Neg\displaystyle AUC=1-\frac{Swapped}{Pos\times Neg} (10)

    where SwappedSwapped represents the number of “inverted” pairs in 𝕪(i)\mathbb{y}^{(i)} compared to 𝐲j\mathbf{y}_{j}:

    Swapped=l=1nik=1niI(𝕪l(i)=1and𝕪k(i)=1)×I((𝐲j)l=1and(𝐲j)k=1).\begin{split}Swapped=&\sum_{l=1}^{n_{i}}\sum_{k=1}^{n_{i}}I(\mathbb{y}_{l}^{(i)}=1\ \textit{and}\ \mathbb{y}_{k}^{(i)}=-1)\\ &\times I((\mathbf{y}_{j})_{l}=-1\ \textit{and}\ (\mathbf{y}_{j})_{k}=1).\end{split}

    PosPos/NegNeg represents the number of positive/negative samples in the iith task:

    Pos=k=1niI(𝕪k(i)=1),Neg=k=1niI(𝕪k(i)=1).\begin{split}&Pos=\sum_{k=1}^{n_{i}}I(\mathbb{y}_{k}^{(i)}=1),\\ &Neg=\sum_{k=1}^{n_{i}}I(\mathbb{y}_{k}^{(i)}=-1).\\ \end{split}

    The corresponding Δ(.,.)\Delta(.,.) can be defined as:

    Δ(𝐲(i),𝐲j)=1AUC.\displaystyle\Delta({\mathbf{y}^{(i)}},{\mathbf{y}_{j}})=1-AUC. (11)

    With such a form of Δ(𝐲(i),𝐲j)\Delta({\mathbf{y}^{(i)}},{\mathbf{y}_{j}}), there also exists polynomial-time algorithms to solve the sub-problem in (4[20].

Note that here the Precision, Recall, F-Score and AUC are defined for a particular task.

5 Proposed Optimization Procedure

5.1 Overview

In this section, we present the proposed optimization procedure to solve the problem (2). Our procedure is based on the scheme of ADMM.

For ease of presentation, we define

𝒢i(𝕎.i)=max𝐲j[Δ(𝐲(i),𝐲j)𝕎.iT𝐗(i)T(𝐲(i)𝐲j)],{\mathcal{G}_{i}}({\mathbb{W}_{.i}})=\mathop{\max}\limits_{{\mathbf{y}_{j}}}[\Delta({{\bf{y}}^{(i)}},{\mathbf{y}_{j}})-\mathbb{W}_{.i}^{T}{{\bf{X}}^{(i)}}^{T}({{\bf{y}}^{(i)}}-{\mathbf{y}_{j}})],

and

𝒢(𝐖)=i=1m𝒢i(𝕎.i).\mathcal{G}(\mathbf{W})=\sum\limits_{i=1}^{m}{{\mathcal{G}_{i}}({\mathbb{W}_{.i}})}.

Then, the problem in (2) can be re-formulated to its equivalent form in the following:

min𝕊,𝕎Ω(𝐒)+λ𝒢(𝐖)s.t𝐖𝐒=0,\begin{split}&\min_{\mathbb{S},\mathbb{W}}\ \ \ \Omega(\mathbf{S})+\lambda\mathcal{G}(\mathbf{W})\\ &s.t\ \ \ \mathbf{W}-\mathbf{S}=0,\end{split} (12)

where 𝕊d×m\mathbb{S}\in\mathbb{R}^{d\times m} is an auxiliary variable.

The corresponding augmented Lagrangian function with respect to (12) is:

𝒜(𝐖,𝐒,𝐙)=Ω(𝐒)+λ𝒢(𝐖)+𝐙,𝐖𝐒+μ2𝐖𝐒F2\begin{split}\begin{array}[]{l}\mathcal{A}(\mathbf{W},\mathbf{S},\mathbf{Z})\\ =\Omega(\mathbf{S})+\lambda\mathcal{G}(\mathbf{W})+\langle\mathbf{Z},\mathbf{W}-\mathbf{S}\rangle+\frac{\mu}{2}||{\mathbf{W}-\mathbf{S}}||_{F}^{2}\\ \end{array}\end{split} (13)

where 𝐙\mathbf{Z} is the Lagrangian multiplier, ,\langle\cdot,\cdot\rangle represents the inner product of two matrices (i.e., given matrices 𝔸\mathbb{A} and 𝔹\mathbb{B}, we have 𝔸,𝔹=Tr(𝔸T𝔹\langle\mathbb{A},\mathbb{B}\rangle=Tr(\mathbb{A}^{T}\mathbb{B}), where Tr(𝕄)Tr(\mathbb{M}) is the trace of the matrix 𝕄\mathbb{M}), μ>0\mu>0 is an adaptive penalty parameter.

Based on the ADMM scheme, the sketch of the proposed optimization procedure is shown in Algorithm 1, where in each iteration we alternatively update 𝕎\mathbb{W}, 𝕊\mathbb{S} and \mathbb{Z} by minimizing the Lagrangian function in (13) with other variables fixed. The update rules for 𝕎\mathbb{W}, 𝕊\mathbb{S} and \mathbb{Z} are in the following:

𝐒{t+1}argmin𝐒𝒜(𝐖{t},𝐒,𝐙{t});𝐖{t+1}argmin𝐖𝒜(𝐖,𝐒{t+1},𝐙{t});𝐙{t+1}𝐙{t}+μ(𝐖{t+1}𝐒{t+1}).\begin{array}[]{l}{\mathbf{S}^{\{t+1\}}}\leftarrow\mathop{\operatorname*{argmin}}\limits_{\mathbf{S}}\mathcal{A}({\mathbf{W}^{\{t\}}},\mathbf{S},{\mathbf{Z}^{\{t\}}});\\ {\mathbf{W}^{\{t+1\}}}\leftarrow\mathop{\operatorname*{argmin}}\limits_{\mathbf{W}}\mathcal{A}(\mathbf{W},{\mathbf{S}^{\{t+1\}}},{\mathbf{Z}^{\{t\}}});\\ {\mathbf{Z}^{\{t+1\}}}\leftarrow{\mathbf{Z}^{\{t\}}}+\mu({\mathbf{W}^{\{t+1\}}}-{\mathbf{S}^{\{t+1\}}}).\end{array}

Note that hereafter we use 𝕄{t}\mathbb{M}^{\{t\}} to represent the the value of variable 𝕄\mathbb{M} in the tt-th iteration.

Next, we will present the details of solving the sub-problems with respect to 𝕊\mathbb{S} or 𝕎\mathbb{W}, respectively, with other variables being fixed.

Algorithm 1 The proposed ADMM procedure for
the structured MTL problem (2)
Input: training set {(𝕏(i),𝕪(i))}i=1m\{({\mathbb{X}}^{(i)},{\mathbb{y}}^{(i)})\}_{i=1}^{m}, desired tolerant error ϵ\epsilon,
            maximal iteration number TT.
Output: Weight matrix 𝕎=[𝕎.1,,𝕎.m]\mathbb{W}=[\mathbb{W}_{.1},\cdots,\mathbb{W}_{.m}]
1. Initialize: =𝕊=𝕎𝟘d×m\mathbb{Z}=\mathbb{S}=\mathbb{W}\leftarrow\mathbb{0}^{d\times m}, t0t\leftarrow 0.
2. Repeat:
3.     Update
         𝕊{t+1}argmin𝐒Ω(𝐒)+μ2𝐒𝐖{t}𝐙{t}/μF2\mathbb{S}^{\{t+1\}}\leftarrow\mathop{\operatorname*{argmin}}\limits_{\mathbf{S}}\Omega(\mathbf{S})+\frac{\mu}{2}||\mathbf{S}-\mathbf{W}^{\{t\}}-\mathbf{Z}^{\{t\}}/{\mu}||_{F}^{2}
         by solving (15), (17) or (18) accordingly.
4.     For i=1i=1 to mm
5.          Update 𝕎.i{t+1}\mathbb{W}^{\{t+1\}}_{.i}\leftarrow
                argmin𝕎.iλ𝒢i(𝕎.i)+μ2𝕎.i𝕊.i{t+1}+.i{t}μ22\mathop{\operatorname*{argmin}}\limits_{\mathbb{W}_{.i}}\lambda\mathcal{G}_{i}(\mathbb{W}_{.i})+\frac{\mu}{2}||\mathbb{W}_{.i}-\mathbb{S}_{.i}^{\{t+1\}}+\frac{\mathbb{Z}_{.i}^{\{t\}}}{\mu}||_{2}^{2}
              by Algorithm 2.
6.     End For
7.     Update 𝐙{t+1}𝐙{t}+μ(𝐖{t+1}𝐒{t+1})\mathbf{Z}^{\{t+1\}}\leftarrow\mathbf{Z}^{\{t\}}+\mu(\mathbf{W}^{\{t+1\}}-\mathbf{S}^{\{t+1\}}).
8. Until SWϵ||S-W||_{\infty}\leq\epsilon or t=Tt=T.

5.2 Solving the Sub-Problem for 𝐒\mathbf{S}

In the tt-th iteration (in the outer loop) of Algorithm 1, the sub-problem for 𝕊\mathbb{S} with respect to (13) can be simplified as:

𝐒{t+1}argmin𝐒𝒜(𝐖{t},𝐒,𝐙{t})=argmin𝐒Ω(𝐒)+μ2𝐖{t}𝐒+𝐙{t}/μF2\begin{split}\begin{array}[]{l}\mathbf{S}^{\{t+1\}}\leftarrow\mathop{\operatorname*{argmin}}\limits_{\mathbf{S}}\mathcal{A}(\mathbf{W}^{\{t\}},\mathbf{S},\mathbf{Z}^{\{t\}})\\ =\mathop{\operatorname*{argmin}}\limits_{\mathbf{S}}\Omega(\mathbf{S})+\frac{\mu}{2}\left\|{\mathbf{W}^{\{t\}}-\mathbf{S}+\mathbf{Z}^{\{t\}}/{\mu}}\right\|_{F}^{2}\\ \end{array}\end{split} (14)

For different regularizer Ω(𝕊)\Omega(\mathbb{S}), the solution to (14) is different.

  • Case 1: the 1,1\ell_{1,1}-norm With Ω(𝐒)\Omega(\mathbf{S}) being 𝐒1,1||\mathbf{S}||_{1,1}, by applying (5) to (14), we have:

    argmin𝐒𝐒1,1+μ2𝐖{t}𝐒+𝐙{t}/μF2=max(0,𝐖{t}+𝐙{t}/μ1/μ)+min(0,𝐖{t}+𝐙{t}/μ+1/μ).\begin{split}\begin{array}[]{l}{\operatorname*{argmin}}_{\mathbf{S}}||\mathbf{S}||_{1,1}+\frac{\mu}{2}\left\|{\mathbf{W}^{\{t\}}-\mathbf{S}+\mathbf{Z}^{\{t\}}/{\mu}}\right\|_{F}^{2}\\ =\max(0,\mathbf{W}^{\{t\}}+\mathbf{Z}^{\{t\}}/{\mu}-1/{\mu})\\ +\min(0,\mathbf{W}^{\{t\}}+\mathbf{Z}^{\{t\}}/{\mu}+1/{\mu}).\end{array}\end{split} (15)
  • Case 2: the 2,1\ell_{2,1}-norm When Ω(𝐒)=𝐒2,1\Omega(\mathbf{S})=||\mathbf{S}|{|_{2,1}}, (14) can be rewritten as:

    argmin𝐒𝐒2,1+μ2𝐖{t}𝐒+𝐙{t}/μF2.\begin{split}\begin{array}[]{l}{\operatorname*{argmin}}_{\mathbf{S}}||\mathbf{S}||_{2,1}+\frac{\mu}{2}\left\|{\mathbf{W}^{\{t\}}-\mathbf{S}+\mathbf{Z}^{\{t\}}/{\mu}}\right\|_{F}^{2}.\\ \end{array}\end{split} (16)

    By applying (6) to (16), we obtain the following close-form solution:

    𝕊j.={𝕄j.21μ𝕄j.2𝕄j.if 1μ<𝕄j.2,0otherwise,\mathbb{S}_{j.}=\left\{\begin{array}[]{ll}\frac{||\mathbb{M}_{j.}||_{2}-\frac{1}{\mu}}{||\mathbb{M}_{j.}||_{2}}\mathbb{M}_{j.}&\textrm{if $\frac{1}{\mu}<||\mathbb{M}_{j.}||_{2}$},\\ 0&\textrm{otherwise},\end{array}\right. (17)

    where 𝕄=𝐖{t}+𝐙{t}/μ\mathbb{M}={\mathbf{W}^{\{t\}}+\mathbf{Z}^{\{t\}}/{\mu}}.

  • Case 3: the trace-norm When Ω(𝐒)=𝐒\Omega(\mathbf{S})=||\mathbf{S}|{|_{*}}, we can apply (7) to (14) and obtain the following close-form solution:

    argmin𝐒𝐒+μ2𝐖{t}𝐒+𝐙{t}/μF2.=𝕌(max(0,𝚺1/μ)+min(0,𝚺+1/μ))𝕍T,\begin{split}\begin{array}[]{l}{\operatorname*{argmin}}_{\mathbf{S}}||\mathbf{S}||_{*}+\frac{\mu}{2}\left\|{\mathbf{W}^{\{t\}}-\mathbf{S}+\mathbf{Z}^{\{t\}}/{\mu}}\right\|_{F}^{2}.\\ =\mathbb{U}(\max(0,\mathbf{\Sigma}-1/\mu)+\min(0,\mathbf{\Sigma}+1/\mu))\mathbb{V}^{T},\end{array}\end{split} (18)

    where 𝐔𝚺𝐕T\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{T} is the SVD form of 𝐖{t}+𝐙{t}/μ{\mathbf{W}^{\{t\}}+\mathbf{Z}^{\{t\}}/{\mu}}.

5.3 Solving the Sub-Problem for 𝐖\mathbf{W}

5.3.1 Formulation

In the tt-th outer iteration in Algorithm 1, the sub-problem for 𝕎\mathbb{W} with respect to (13) can be reformulated as:

𝐖{t+1}argmin𝐖𝒜(𝐖,𝐒{t+1},𝐙{t})=argmin𝐖λ𝒢(𝐖)+μ2𝐖𝐒{t+1}+𝐙{t}/μF2=argmin𝐖i=1mλ𝒢i(𝐖.i)+μ2𝐖.i𝐒.i{t+1}+𝐙.i{t}/μF2\begin{split}\begin{array}[]{l}\mathbf{W}^{\{t+1\}}\leftarrow\mathop{\operatorname*{argmin}}\limits_{\mathbf{W}}\mathcal{A}(\mathbf{W},\mathbf{S}^{\{t+1\}},\mathbf{Z}^{\{t\}})\\ =\mathop{\operatorname*{argmin}}\limits_{\mathbf{W}}\lambda\mathcal{G}(\mathbf{W})+\frac{\mu}{2}\left\|{\mathbf{W}-\mathbf{S}^{\{t+1\}}+\mathbf{Z}^{\{t\}}/{\mu}}\right\|_{F}^{2}\\ =\mathop{\operatorname*{argmin}}\limits_{\mathbf{W}}\sum_{i=1}^{m}\lambda\mathcal{G}_{i}(\mathbf{W}_{.i})+\frac{\mu}{2}\left\|{\mathbf{W}_{.i}-\mathbf{S}_{.i}^{\{t+1\}}+\mathbf{Z}_{.i}^{\{t\}}/{\mu}}\right\|_{F}^{2}\\ \end{array}\end{split} (19)

To simplify presentation, we denote 𝕓i=𝐒.i{t+1}𝐙.i{t}/μ\mathbb{b}_{i}=\mathbf{S}_{.i}^{\{t+1\}}-\mathbf{Z}_{.i}^{\{t\}}/{\mu}. Then, the problem in (19) can be separated into mm independent sub-tasks:

min𝐖.iλ𝒢i(𝐖.i)+μ2𝐖.i𝐛iF2,i=1,,m.\begin{split}\begin{array}[]{l}\mathop{\min}\limits_{\mathbf{W}_{.i}}\lambda\mathcal{G}_{i}(\mathbf{W}_{.i})+\frac{\mu}{2}\left\|{\mathbf{W}_{.i}-\mathbf{b}_{i}}\right\|_{F}^{2},i=1,...,m.\\ \end{array}\end{split} (20)

For j=1,2,,pj=1,2,...,p, we define 𝐊=[𝕂.1,𝕂.2,,𝕂.p]\mathbf{K}=[\mathbb{K}_{.1},\mathbb{K}_{.2},...,\mathbb{K}_{.p}] with 𝕂.j=𝐗(i)T(𝐲(i)𝕪j)+μλ𝕓i\mathbb{K}_{.j}={\mathbf{X}^{(i)}}^{T}(\mathbf{y}^{(i)}-\mathbb{y}_{j})+\frac{\mu}{\lambda}\mathbb{b}_{i}, 𝚫=(𝚫1,𝚫2,,𝚫p)T\mathbf{\Delta}=(\mathbf{\Delta}_{1},\mathbf{\Delta}_{2},...,\mathbf{\Delta}_{p})^{T} with 𝚫j=Δ(𝐲(i),𝐲j)\mathbf{\Delta}_{j}=\Delta({\mathbf{y}^{(i)}},{{\mathbf{y}}_{j}}). Then, the problem in (20) can be simplified as:

min𝐖.iλ𝒢i(𝐖.i)+μ2𝐖.i𝐛iF2=min𝐖.iμ2(𝐖.iF2+𝐛iF22𝐖.iT𝐛i)+λmax𝕪j𝔼i[Δ(𝐲(i),𝕪j)𝐖.iT𝐗(i)T(𝐲(i)𝕪j)].\begin{split}\begin{array}[]{l}\mathop{\min}\limits_{\mathbf{W}_{.i}}\lambda\mathcal{G}_{i}(\mathbf{W}_{.i})+\frac{\mu}{2}\left\|{\mathbf{W}_{.i}-\mathbf{b}_{i}}\right\|_{F}^{2}\\ =\mathop{\min}\limits_{\mathbf{W}_{.i}}\frac{\mu}{2}(||\mathbf{W}_{.i}||_{F}^{2}+||\mathbf{b}_{i}||_{F}^{2}-2\mathbf{W}_{.i}^{T}\mathbf{b}_{i})\\ +\lambda\mathop{\max}\limits_{\mathbb{y}_{j}\in\mathbb{E}_{i}}[\Delta({{\bf{y}}^{(i)}},{\mathbb{y}_{j}})-\mathbf{W}_{.i}^{T}{{\bf{X}}^{(i)}}^{T}({{\bf{y}}^{(i)}}-{\mathbb{y}_{j}})].\\ \end{array}\end{split} (21)

By re-scaling the objective (21) by μ\mu and drop the terms independent of 𝐖.i\mathbf{W}_{.i} and 𝕪j\mathbb{y}_{j}, we have:

min𝐖.i12𝐖.i22+λμmaxj[𝚫j(𝐖.iT𝐊)j]\begin{split}\begin{array}[]{l}\mathop{\min}\limits_{\mathbf{W}_{.i}}\frac{1}{2}||{\mathbf{W}_{.i}}||_{2}^{2}+\frac{\lambda}{\mu}\mathop{\max}\limits_{j}[{\mathbf{\Delta}_{j}}-{({\mathbf{W}_{.i}^{T}}\mathbf{K})_{j}}]\\ \end{array}\end{split} (22)

The existence of the max operator on exponential number of elements makes it difficult to optimize the objective in (22). To tackle this issue, in the next two subsection, we derive the Fenchel dual [36] form of (22) and develop a coordinate ascent algorithm to solve this dual form.

5.3.2 Fenchel Dual Form of (22)

In this subsection, we derive the Fenchel dual [36] form of (22). To simplify presentation, we use 𝐰\mathbf{w} to represent 𝐖.i\mathbf{W}_{.i}. Then we re-formulate the primal form in (22) as:

min𝐰𝒫(𝕨)=(𝕨)+𝒩(𝐰T𝐊)=12||𝐰||22+λμmaxj(𝚫T𝐰T𝐊)j,\begin{split}\mathop{\min}\limits_{\mathbf{w}}\mathcal{P}(\mathbb{w})=\mathcal{M}(\mathbb{w})+\mathcal{N}(-\mathbf{w}^{T}\mathbf{K})\\ =\frac{1}{2}||{\mathbf{w}}||_{2}^{2}+\frac{\lambda}{\mu}\max_{j}(\mathbf{\Delta}^{T}-\mathbf{w}^{T}\mathbf{K})_{j},\\ \end{split} (23)

where we define (𝕨)=12𝐰22\mathcal{M}(\mathbb{w})=\frac{1}{2}||{\mathbf{w}}||_{2}^{2} and 𝒩(𝐰T𝐊)=λμmaxj(𝚫T𝐰T𝐊)j\mathcal{N}(-\mathbf{w}^{T}\mathbf{K})=\frac{\lambda}{\mu}\max_{j}(\mathbf{\Delta}^{T}-\mathbf{w}^{T}\mathbf{K})_{j}.

Before deriving the dual form of (23), we first introduce the definition (Definition 1) and the main properties (Theorem 1 and 2) of Fenchel duality.

Definition 1.

The Fenchel conjugate of function f(𝐱)f(\boldsymbol{x}) is defined as f(𝛉)=max𝐱dom(f)(𝛉,𝐱f(𝐱))f^{*}(\boldsymbol{\theta})=\max_{\boldsymbol{x}\in dom(f)}(\langle\boldsymbol{\theta},\boldsymbol{x}\rangle-f(\boldsymbol{x})).

Theorem 1.

(Fenchel-Young inequality: [5], Proposition 3.3.4) Any points 𝛉\boldsymbol{\theta} in the domain of function ff^{*} and 𝐱\boldsymbol{x} in the domain of function ff satisfy the inequality:

f(𝒙)+f(𝜽)𝜽,𝒙f(\boldsymbol{x})+f^{*}(\boldsymbol{\theta})\geq\langle\boldsymbol{\theta},\boldsymbol{x}\rangle (24)

The equality holds if and only if 𝛉f(𝐱)\boldsymbol{\theta}\in\partial f(\boldsymbol{x}).

Theorem 2.

(Fenchel Duality inequality, see e.g.,Theorem 3.3.5 in [5]) Let :d(,+]\mathcal{M}:\mathbb{R}^{d}\rightarrow(-\infty,+\infty] and 𝒩:p(,+]\mathcal{N}:\mathbb{R}^{p}\rightarrow(-\infty,+\infty] be two closed and convex functions, and 𝐊\mathbf{K} be a d×p\mathbb{R}^{d\times p} matrix. Then we have

sup𝜶(𝐊𝜶)𝒩(𝜶)inf𝒘(𝒘)+𝒩(𝒘T𝐊),\sup_{\boldsymbol{\alpha}}-\mathcal{M}^{*}(\mathbf{K}\boldsymbol{\alpha})-\mathcal{N}^{*}(\boldsymbol{\alpha})\leq\inf_{\boldsymbol{w}}\mathcal{M}(\boldsymbol{w})+\mathcal{N}(-\boldsymbol{w}^{T}\mathbf{K}), (25)

where 𝛂p\boldsymbol{\alpha}\in\mathbb{R}^{p} and 𝐰d\boldsymbol{w}\in\mathbb{R}^{d}. The equality holds if and only if 0(dom(𝒩)𝐊Tdom())0\in(dom(\mathcal{N})-\mathbf{K}^{T}dom(\mathcal{M})).

Note that the right hand side of the inequality in (25) is called the primal form and the left hand side of (25) is the corresponding dual form.

With Definition 1, it is known (see, e.g.,  [38], Appendix B) that the Fenchel dual norm (i.e., the Fenchel conjugate) of the 2\ell_{2}-norm f(𝕩)=12𝕩22f(\mathbb{x})=\frac{1}{2}||\mathbb{x}||_{2}^{2} is also the 2\ell_{2}-norm f(θ)=12θ22f^{*}(\mathbb{\theta})=\frac{1}{2}||\mathbb{\theta}||_{2}^{2}. Hence, the Fenchel conjugate of (𝕨)=12𝕨22\mathcal{M}(\mathbb{w})=\frac{1}{2}||\mathbb{w}||_{2}^{2} is

(𝕂α)=12𝕂α22\begin{split}\mathcal{M}^{*}(-\mathbb{K\alpha})=\frac{1}{2}||\mathbb{-K\alpha}||_{2}^{2}\end{split} (26)

It is known ( [38], Appendix B) that the Fenchel conjugate of f(𝕩+𝕪)f(\mathbb{x}+\mathbb{y}) is f(θ)θ,𝕪f^{*}(\mathbb{\theta})-\langle\mathbb{\theta},\mathbb{y}\rangle, the Fenchel conjugate of cf(𝕩)cf(\mathbb{x}) (c>0c>0) is cf(θ/c)cf^{*}(\mathbb{\theta}/c). Then we can derive that the Fenchel conjugate of cf(𝕩+𝕪)cf(\mathbb{x}+\mathbb{y}) is

cf(θ/c)θ,𝕪.cf^{*}(\mathbb{\theta}/c)-\langle\mathbb{\theta},\mathbb{y}\rangle. (27)

In addition, the Fenchel conjugate of f(𝕩)=maxj(𝕩j)f(\mathbb{x})=\max_{j}(\mathbb{x}_{j}) is Iθi0,iθi=1(θ)I_{\theta_{i}\geq 0,\sum_{i}\theta_{i}=1}(\theta) with Icondition(.)I_{condition}(.) being the indicator function that Icondition(θ)=0I_{condition}(\theta)=0 if conditioncondition is true and otherwise Icondition(θ)=+I_{condition}(\theta)=+\infty (see [38], Appendix B). For convenience, we denote 𝒬(𝕩)=maxj(𝕩j)\mathcal{Q}(\mathbb{x})=\max_{j}(\mathbb{x}_{j}). It is easy to verify that 𝒩(𝐰T𝐊)=λμmaxj(ΔT𝐰T𝐊)j=λμ𝒬(ΔT𝐰T𝐊)\mathcal{N}(-\mathbf{w}^{T}\mathbf{K})=\frac{\lambda}{\mu}\max_{j}(\Delta^{T}-\mathbf{w}^{T}\mathbf{K})_{j}=\frac{\lambda}{\mu}\mathcal{Q}(\Delta^{T}-\mathbf{w}^{T}\mathbf{K}). Hence, by using (27), the Fenchel conjugate of 𝒩(𝐰T𝐊)\mathcal{N}(-\mathbf{w}^{T}\mathbf{K}) is:

𝒩(α)=λμ𝒬((α)/(λμ))α,Δ={𝚫Tα,k=1pαk=λμandαk0,k=1,,p;+,otherwise.\begin{split}&\mathcal{N}^{*}(\mathbf{\alpha})=\frac{\lambda}{\mu}\mathcal{Q}^{*}((\mathbb{\alpha})/(\frac{\lambda}{\mu}))-\langle\mathbf{\alpha},\mathbb{\Delta}\rangle\\ &=\left\{\begin{array}[]{l}-{\mathbf{\Delta}^{T}}\mathbf{\alpha},\ \sum\limits_{k=1}^{p}{\mathbf{\alpha}_{k}=\frac{\lambda}{\mu}\ and\ \mathbf{\alpha}_{k}\geq 0,\ k=1,\cdots,p};\\ +\infty,\ \ \ \ otherwise.\end{array}\right.\end{split} (28)

With (26), (28) and (25), we have that the dual form of (23) is:

maxα𝒟(α)=maxα(𝐊α)𝒩(α)=maxα12αT𝐊T𝐊α+𝚫Tαs.t.k=1pαk=λμandαk0,k=1,,p\begin{split}&\mathop{\max}\limits_{\mathbf{\alpha}}\mathcal{D}(\mathbf{\alpha})\\ &=\mathop{\max}\limits_{\mathbf{\alpha}}-\mathcal{M}^{*}(\mathbf{K}\alpha)-\mathcal{N}^{*}(\alpha)\\ &=\mathop{\max}\limits_{\mathbf{\alpha}}-\frac{1}{2}{\mathbf{\alpha}^{T}}{\mathbf{K}^{T}}\mathbf{K}\mathbf{\alpha}+\mathbf{\Delta}^{T}\mathbf{\alpha}\\ &s.t.\ \sum\limits_{k=1}^{p}\mathbf{\alpha}_{k}=\frac{\lambda}{\mu}\ and\ \mathbf{\alpha}_{k}\geq 0,\ k=1,\cdots,p\\ \end{split} (29)

The dual form in (29) is a smooth quadratic function with linear constraints, which is easier to be optimized compared to its primal form in (23).

5.3.3 Primal-Dual Algorithm via Coordinate Ascent

In this subsection, we develop a coordinate ascent algorithm to optimize the objective in (29), where we use the primal-dual gap 𝒫(𝕨)𝒟(α)\mathcal{P}(\mathbb{w})-\mathcal{D}(\mathbb{\alpha}) as the early stopping criterion. Coordinate ascent is a widely-used method in various machine learning problems (e.g., [12, 38, 23, 45]).

Algorithm 2 Primal-dual algorithm via coordinate ascent
Input: 𝐛i\mathbf{b}_{i}, ϵF\epsilon_{F}, λ\lambda, μ\mu, maximal iteration number TFT_{F}
Output: 𝕨\mathbb{w}
1. Initialize: v0v\leftarrow 0, 𝕨^0\mathbb{\hat{w}}\leftarrow 0
2. Repeat:
3.    Find the largest element (gα)j(g_{\alpha})_{j} in the gradient vector
        gα=𝒟(α)g_{\alpha}=\nabla\mathcal{D}(\alpha) by solving (30) via Algorithm 3 for F-score
        (or Algorithm 4 for AUC).
4.    𝚫jΔ(𝐲(i),𝐲j)\mathbf{\Delta}_{j}\leftarrow\Delta({\mathbf{y}^{(i)}},{{\mathbf{y}}_{j}})
5.    𝕂.j𝐗(i)T(𝐲(i)𝕪j)+μλ𝕓i\mathbb{K}_{.j}\leftarrow{\mathbf{X}^{(i)}}^{T}(\mathbf{y}^{(i)}-\mathbb{y}_{j})+\frac{\mu}{\lambda}\mathbb{b}_{i}
6.    Calculate γ\gamma by (37).
7.    Update 𝕨^\mathbb{\hat{w}} by (35).
8.    Update vv by (36)
9. Until 𝕨^T𝕨^+maxj(gα)jϵF\mathbb{\hat{w}}^{T}\mathbb{\hat{w}}+\max_{j}(g_{\alpha})_{j}\leq\epsilon_{F} or iteration number reaches TFT_{F}
10. 𝕨𝕨^\mathbb{w}\leftarrow\mathbb{\hat{w}}

The proposed coordinate ascent algorithm is shown in Algorithm 2. Next, we sketch the main steps the proposed algorithm in the following:

Repeat

  • Select an index jj with the jj-th element (α𝒟(α))j(\nabla_{\alpha}\mathcal{D}(\alpha))_{j} in the gradient vector α𝒟(α)\nabla_{\alpha}\mathcal{D}(\alpha) having the largest element.

  • Update αj\alpha_{j} with other αk\alpha_{k} (kjk\neq j) fixed, in a manner of greedily increasing the value of 𝒟(α)\mathcal{D}(\alpha).

Until the early stopping criterion 𝒫(𝕨)𝒟(α)ϵF\mathcal{P}(\mathbb{w})-\mathcal{D}(\mathbb{\alpha})\leq\epsilon_{F} is satisfied.

In each iteration, the proposed algorithm has three main building blocks:

The First Step is to select an index jj that the jj-th element is the largest element in the gradient vector for the dual objective 𝒟(α)\mathcal{D}({\alpha}). Specifically, the gradient vector with respect to α\alpha for 𝒟(α)\mathcal{D}({\alpha}) is:

gα=α𝒟(α)=𝐊T𝐊α+𝚫,g_{\alpha}=\nabla_{\alpha}\mathcal{D}({\alpha})=-\mathbf{K}^{T}\mathbf{K}\alpha+\mathbf{\Delta},

and the largest element in α𝒟(α)\nabla_{\alpha}\mathcal{D}({\alpha}) is:

(gα)j=(α𝒟(α))j=maxj𝚫j(𝐊α)T𝐊.j.\begin{split}(g_{\alpha})_{j}=(\nabla_{\alpha}\mathcal{D}({\alpha}))_{j}=\max_{j}\mathbf{\Delta}_{j}-(\mathbf{K}\alpha)^{T}\mathbf{K}_{.j}.\end{split}

We denote 𝕨^=𝐊α\mathbb{\hat{w}}=\mathbf{K}\alpha. Then, with the definition of 𝚫j\mathbf{\Delta}_{j} and 𝕂.j\mathbb{K}_{.j}, we have:

(α𝒟(α))j=maxjΔ(𝐲(i),𝕪j)𝐰^T𝐗(i)T(𝐲(i)𝕪j).\begin{split}(\nabla_{\alpha}\mathcal{D}({\alpha}))_{j}=\max_{j}\Delta({\mathbf{y}^{(i)}},{\mathbb{y}_{j}})-\mathbf{\hat{w}}^{T}{\mathbf{X}^{(i)}}^{T}({\mathbf{y}^{(i)}}-{\mathbb{y}_{j}}).\\ \end{split} (30)

Interestingly, the problem in (30) is essentially the same as the problems of “finding the most violated constraint” in Structured-SVMs (e.g., the problem (7) in [20]). For several commonly-used evaluation metrics Δ(.,.)\Delta(.,.), efficient algorithm in polynomial-time were proposed to solve the problems of “finding the most violated constraint”. One can directly use these inference algorithms to solve (30) of selecting the largest element from the gradient vector α𝒟(α)\nabla_{\alpha}\mathcal{D}({\alpha}). For example, when Δ(.,.)\Delta(.,.) corresponds to F-score, one can use Algorithm 2 in [20] to solve (30); when Δ(.,.)\Delta(.,.) corresponds to AUC, one can use Algorithm 3 in [20] to solve (30). For self-containness, we shown these two algorithms with our notations in Algorithm 3 and 4. Note that Algorithm 3 and 4 have the time complexity in O(ni2)O(n^{2}_{i}) and O(nilogni)O(n_{i}\log n_{i}), respectively.

Algorithm 3 Algorithm to solve (30) with loss function
defined on F-score
Input: n=ni,𝐗(i)=(𝐱1(i),,𝐱n(i))Tn=n_{i},{\mathbf{X}^{(i)}}=(\mathbf{x}^{(i)}_{1},\ldots,\mathbf{x}^{(i)}_{n})^{T},
            𝐲(i)=(𝐲1(i),,𝐲n(i))T\mathbf{y}^{(i)}=(\mathbf{y}^{(i)}_{1},\ldots,\mathbf{y}^{(i)}_{n})^{T}, 𝐰\mathbf{w}
Output: 𝐲j\mathbf{y}_{j}
1. Initialize: (k1p,,kPosp)sort{k:𝐲k(i)=1}(k^{p}_{1},\ldots,k^{p}_{Pos})\leftarrow sort\{k:\mathbf{y}^{(i)}_{k}=1\} by 𝐰T𝐱k(i)\mathbf{w}^{T}\mathbf{x}^{(i)}_{k}
                   (k1n,,kNegn)sort{k:𝐲k(i)=1}(k^{n}_{1},\ldots,k^{n}_{Neg})\leftarrow sort\{k:\mathbf{y}^{(i)}_{k}=-1\} by 𝐰T𝐱k(i)\mathbf{w}^{T}\mathbf{x}^{(i)}_{k}
2. For a[0,,Pos]a\in[0,\ldots,Pos] do:
3.    cPosac\leftarrow Pos-a
4.    Set l,k1p,lkapl{{}_{k_{1}^{p}}},\ldots,l{{}_{k_{a}^{p}}} to 11 and set l,ka+1p,lkPospl{{}_{k_{a+1}^{p}}},\ldots,l{{}_{k_{Pos}^{p}}} to 1-1
5.    For d[0,,Neg]d\in[0,\ldots,Neg] do:
6.        bNegdb\leftarrow Neg-d
7.        Set l,k1n,lkbnl{{}_{k_{1}^{n}}},\ldots,l{{}_{k_{b}^{n}}} to 11 and set l,kb+1n,lkNegnl{{}_{k_{b+1}^{n}}},\ldots,l{{}_{k_{Neg}^{n}}} to 1-1
8.        vΔ(𝐲(i),(l1,,ln)T)+𝐰Tk=1nl𝐱k(i)kv\leftarrow\Delta({{\bf{y}}^{(i)}},(l_{1},\ldots,l_{n})^{T})+{\mathbf{w}^{T}}\sum\limits_{k=1}^{n}{l{{}_{k}}{\mathbf{x}^{(i)}_{k}}},
            where Δ(,)\Delta(\cdot,\cdot) is defined by (11)
9.        If vv is the largest so far, then:
10.          𝐲j(l1,,ln)T\mathbf{y}_{j}\leftarrow(l_{1},\ldots,l_{n})^{T}
11.        End if
12.    End for
13. End for
Algorithm 4 Algorithm to solve (30) with loss function
defined on AUC
Input: n=ni,𝐗(i)=(𝐱1(i),,𝐱n(i))Tn=n_{i},{\mathbf{X}^{(i)}}=(\mathbf{x}^{(i)}_{1},\ldots,\mathbf{x}^{(i)}_{n})^{T},
            𝐲(i)=(𝐲1(i),,𝐲n(i))T\mathbf{y}^{(i)}=(\mathbf{y}^{(i)}_{1},\ldots,\mathbf{y}^{(i)}_{n})^{T}, 𝐰\mathbf{w}
Output: 𝐲j\mathbf{y}_{j}
1. Initialize: for k{k:𝐲k(i)=1}k\in\{k:{\bf{y}}_{k}^{(i)}=1\} do qk0.25+𝐰T𝐱k(i){q_{k}}\leftarrow-0.25+{\mathbf{w}^{T}}{\mathbf{x}^{(i)}_{k}}
                   for k{k:𝐲k(i)=1}k\in\{k:{\bf{y}}_{k}^{(i)}=-1\} do qk0.25+𝐰T𝐱k(i){q_{k}}\leftarrow 0.25+{\mathbf{w}^{T}}{\mathbf{x}^{(i)}_{k}}
2. (r1,,rn)({r_{1}},\ldots,{r_{n}})\leftarrow sort {1,,n}\{1,\ldots,n\} by qk{q_{k}}
3. qPosPos{q_{Pos}}\leftarrow Pos, qNeg0q_{Neg}\leftarrow 0
4.  For k[1,,n]k\in[1,\ldots,n] do:
5.    If 𝐲rk(i)>0\mathbf{y}^{(i)}_{r_{k}}>0, then:
6.       lrk(Neg2qn){l_{{r_{k}}}}\leftarrow(Neg-2{q_{n}})
7.       qPosqPos1{q_{Pos}}\leftarrow{q_{Pos}}-1
8.    else
9.       lrk(Pos+2qPos){l_{{r_{k}}}}\leftarrow(-Pos+2{q_{Pos}})
10.      qNegqNeg+1{q_{Neg}}\leftarrow{q_{Neg}}+1
11.    End if
12. End for
13. Convert (l1,,ln)(l_{1},\ldots,l_{n}) to 𝐲j\mathbf{y}_{j} according to some
      threshold value.

The Second Step is to update αj\alpha_{j} by fixing other variable αk(kj)\alpha_{k}(k\neq j), given the selected index jj.

We define the update rules for α\alpha as:

α(1γ)α+γλμej,\alpha\leftarrow(1-\gamma)\alpha+\frac{\gamma\lambda}{\mu}e_{j}, (31)

where 0γ10\leq\gamma\leq 1 and eje_{j} denotes the nin_{i}-dimension vector with the jj-th element being one and other elements being zeros. It is worth noting that, given αj0\alpha_{j}\geq 0 and jαj=λ/μ\sum_{j}\alpha_{j}=\lambda/\mu before updating, and 0γ10\leq\gamma\leq 1, this form of rules in (31) guarantees that αj0\alpha_{j}\geq 0 and jαj=λ/μ\sum_{j}\alpha_{j}=\lambda/\mu still hold after updating.

By substituting (31) into (29), we obtain the corresponding optimization problem with respect to γ\gamma:

maxγ12[(1γ)α+γλμej]T𝐊T𝐊[(1γ)α+γλμej]+[(1γ)α+γλμej]T𝚫\begin{split}\begin{array}[]{l}\mathop{\max}\limits_{\gamma}-\frac{1}{2}{[(1-\gamma)\alpha+\frac{\gamma\lambda}{\mu}{e_{j}}]^{T}}{\mathbf{K}^{T}}\mathbf{K}[(1-\gamma)\alpha+\frac{\gamma\lambda}{\mu}{e_{j}}]\\ \ \ \ \ \ \ +{[(1-\gamma)\alpha+\frac{\gamma\lambda}{\mu}{e_{j}}]^{T}}\mathbf{\Delta}\end{array}\end{split} (32)

Intuitively, our goal is to find γ[0,1]\gamma\in[0,1] to increase the dual objective 𝒟(α)\mathcal{D}(\alpha) as much as possible. By setting the gradient of (32) with respect to γ\gamma to zero, we have

𝐊(ejλ/μα)22γ+(ejλ/μα)T𝐊T𝐊α(ejλ/μα)T𝚫=0\begin{array}[]{l}||\mathbf{K}({e_{j}}\lambda/\mu-\alpha)||_{2}^{2}\gamma+{({e_{j}}\lambda/\mu-\alpha)^{T}}{\mathbf{K}^{T}}\mathbf{K}\alpha\\ -{({e_{j}}\lambda/\mu-\alpha)^{T}}\mathbf{\Delta}=0\end{array}

By simple algebra, we have

γ=(ejλ/μα)T(𝐊T𝐊α𝚫)𝐊(ejλ/μα)22\displaystyle\begin{array}[]{l}\gamma=-\frac{{{{({e_{j}}\lambda/\mu-\alpha)}^{T}}({\mathbf{K}^{T}}\mathbf{K}\alpha-\mathbf{\Delta})}}{{||\mathbf{K}({e_{j}}\lambda/\mu-\alpha)||_{2}^{2}}}\\ \end{array} (33)

To ensure that 0γ10\leq\gamma\leq 1, we make further restriction on γ\gamma:

γ=max(min((ejλ/μα)T(𝐊T𝐊α𝚫)𝐊(ejλ/μα)22,1),0)\displaystyle\gamma=\max(\min(-\frac{{{{({e_{j}}\lambda/\mu-\alpha)}^{T}}({\mathbf{K}^{T}}\mathbf{K}\alpha-\mathbf{\Delta})}}{{||\mathbf{K}({e_{j}}\lambda/\mu-\alpha)||_{2}^{2}}},1),0) (34)

The calculation of γ\gamma in (34) depends on the calculation of 𝕂α\mathbb{K\alpha} and αT𝚫\alpha^{T}\mathbf{\Delta}. However, since 𝕂d×p\mathbb{K}\in\mathbb{R}^{d\times p}, 𝚫,αp\mathbf{\Delta},\alpha\in\mathbb{R}^{p} and p=2nip=2^{n_{i}}, the time of directly calculating either 𝕂α\mathbb{K\alpha} or αT𝚫\alpha^{T}\mathbf{\Delta} depends exponentially on nin_{i}, which may often unaffordable. In order to improve efficiency, we maintain auxiliary variable to reduce the computation cost. Remind that we have defined 𝕨^=𝕂α\mathbb{\hat{w}}=\mathbb{K\alpha}. We also define v=αT𝚫{v}=\alpha^{T}\mathbf{\Delta}. We maintain 𝕨^\mathbb{\hat{w}} and vv during the iterations.

With the update rule (31) for α\mathbb{\alpha}, we can easily derive the corresponding update rules for 𝕨^\mathbb{\hat{w}} and 𝕧\mathbb{v}, respectively:

𝕨^(1γ)𝕨^+γλμ𝕂.j,\displaystyle\mathbb{\hat{w}}\leftarrow(1-\gamma)\mathbb{\hat{w}}+\frac{\gamma\lambda}{\mu}\mathbb{K}_{.j}, (35)
v(1γ)v+γλμ𝚫j.\displaystyle{v}\leftarrow(1-\gamma){v}+\frac{\gamma\lambda}{\mu}\mathbf{\Delta}_{j}. (36)

Obviously, the update rule for 𝕨^\mathbb{\hat{w}} (or vv) has the time complexity O(d)O(d) (or O(1)O(1)).

With the maintained 𝕨^\mathbb{\hat{w}} and vv, the update rule in (34) can be simplified to:

γmax(min(λμ(𝕂j.T𝕨^𝚫j)𝕨^T𝕨^+v||λμ𝕂.j𝕨^)||22,1),0),\displaystyle\gamma\leftarrow\max(\min(-\frac{{\frac{\lambda}{\mu}(\mathbb{K}_{j.}^{T}\mathbb{\hat{w}}-\mathbf{\Delta}_{j})-\mathbb{\hat{w}}^{T}\mathbb{\hat{w}}+v}}{{||\frac{\lambda}{\mu}\mathbb{K}_{.j}-\mathbb{\hat{w}})||_{2}^{2}}},1),0), (37)

where the time complexity of update γ\gamma in (37) is reduced to O(d)O(d).

The early stopping criterion is defined based on the primal-dual gap 𝒫(𝕨)𝒟(α)ϵF\mathcal{P}(\mathbb{w})-\mathcal{D}(\mathbb{\alpha})\leq\epsilon_{F} where the parameter ϵF\epsilon_{F} is the pre-defined tolerance. Assume 𝒫(𝕨)\mathcal{P}(\mathbb{w}^{\star}) is the optimal value of the primal objective (23). According to Theorem 2, we have:

𝒫(𝕨)𝒫(𝕨)𝒫(𝕨)𝒟(α)ϵF.\mathcal{P}(\mathbb{w})-\mathcal{P}(\mathbb{w}^{\star})\leq\mathcal{P}(\mathbb{w})-\mathcal{D}(\mathbb{\alpha})\leq\epsilon_{F}.

It is worth noting that, by using the update rule (31) with 0γ10\leq\gamma\leq 1, Algorithm 2 guarantees that α\alpha satisfies the constraints αk0\alpha_{k}\geq 0 and kαk=λ/μ\sum_{k}\alpha_{k}=\lambda/\mu in all of the iterations. In order words, we have 𝒩(α)<\mathcal{N}^{*}(\alpha)<\infty in all of the iterations. Hence, with (23) and (29), we have:

𝒫(𝕨)𝒟(α)=(𝕨)+(𝕂α)+𝒩(𝐰T𝐊)+𝒩(α)\begin{split}&\mathcal{P}(\mathbb{w})-\mathcal{D}(\mathbb{\alpha})\\ &=\mathcal{M}(\mathbb{w})+\mathcal{M}^{*}(\mathbb{K\alpha})+\mathcal{N}(-\mathbf{w}^{T}\mathbf{K})+\mathcal{N}^{*}(\alpha)\end{split} (38)

With Theorem 1, we have (𝕨)+(𝕂α)𝕨,𝕂α\mathcal{M}(\mathbb{w})+\mathcal{M}^{*}(\mathbb{K\alpha})\geq\langle\mathbb{w},\mathbb{K\alpha}\rangle, where the equality holds when 𝕨=𝕂α=𝕨^\mathbb{w}=\mathbb{K\alpha}=\mathbb{\hat{w}}. In order to greedily upper-bounded the gap 𝒟(α)𝒟(α)\mathcal{D}(\mathbb{\alpha}^{\star})-\mathcal{D}(\mathbb{\alpha}), we set 𝕨=𝕂α=𝕨^\mathbb{w}=\mathbb{K\alpha}=\mathbb{\hat{w}} in (38) and obtain:

𝒫(𝕨)𝒟(α)=𝕨^,𝕂α+𝒩(𝐰^T𝐊)+𝒩(α)=𝕨^T𝕨^+maxj(gα)jv\begin{split}&\mathcal{P}(\mathbb{w})-\mathcal{D}(\mathbb{\alpha})\\ &=\langle\mathbb{\hat{w}},\mathbb{K\alpha}\rangle+\mathcal{N}(\mathbf{\hat{w}}^{T}\mathbf{K})+\mathcal{N}^{*}(\alpha)\\ &=\mathbb{\hat{w}}^{T}\mathbb{\hat{w}}+\max_{j}(g_{\alpha})_{j}-v\end{split} (39)

Consequently, the early stopping criterion is set to be 𝕨^T𝕨^+maxj(gα)jvϵF\mathbb{\hat{w}}^{T}\mathbb{\hat{w}}+\max_{j}(g_{\alpha})_{j}-v\leq\epsilon_{F}, which can be calculated in time O(d)O(d).

5.4 Convergence Analysis

For the sub-problem w. r. t. 𝐖\mathbf{W} (see Section 5.3), the proposed coordinate ascent method is similar to those in [38, 23]. By using similar proof techniques to those of [38, 23] (e.g., see the proofs of Theorem 1 in [23]), we can derive that, after TT iteration in Algorithm 2, we have 𝒟(α)𝒟(α)𝒫(𝕨)𝒟(α)ϵF=O(1T)\mathcal{D}(\mathbb{\alpha}^{\star})-\mathcal{D}(\mathbb{\alpha})\leq\mathcal{P}(\mathbb{w})-\mathcal{D}(\mathbb{\alpha})\leq\epsilon_{F}=O(\frac{1}{T}). Note that 𝒟(α)=𝒫(𝕨)\mathcal{D}(\mathbb{\alpha}^{\star})=\mathcal{P}(\mathbb{w}^{\star}), where 𝒟(α)\mathcal{D}(\mathbb{\alpha}^{\star}) and 𝒫(𝕨)\mathcal{P}(\mathbb{w}^{\star}) are the optimal solution of (29) and (23) respectively. Ideally, for all the tasks, if we set the iteration number TT to be sufficient large, we can solve the sub-problem w,r.t. WW exactly (by ignoring the small numerical errors).

In addition, as discussed in Section 5.2, the sub-problems w. r. t. 𝐒\mathbf{S} can be solved exactly by closed-form solutions. Hence, the objective (12) is convex subject to linear constraints, and both of its subproblems can be solved exactly. Based on existing theoretical results [6, 16], we have that Algorithm 1 converges to global optima with a O(1/ϵ)O(1/\epsilon) convergence rate.

6 Experiments

6.1 Overview

In this section, we evaluate and compare the performance of the proposed SMTL method on several benchmark datasets. For the regularizer Ω(𝕊)\Omega(\mathbb{S}) in (12), we consider 𝕊1,1||\mathbb{S}||_{1,1}, 𝕊2,1||\mathbb{S}||_{2,1} and 𝕊||\mathbb{S}||_{*}, respectively. For the evaluation metric Δ(.,.)\Delta(.,.) used in 𝒢(𝕎)\mathcal{G}(\mathbb{W}) in (12), we consider F1F_{1}-score (with β=1\beta=1) and AUC. These settings lead to six variants of SMTL.

Here we focus on MTL for classification. Given a specific regularizer (i.e., 𝕊1,1||\mathbb{S}||_{1,1}, 𝕊2,1||\mathbb{S}||_{2,1} or 𝕊||\mathbb{S}||_{*}), we choose these methods as baselines: (1) single-task structured SVM that directly optimizes AUC (StructSVM) [20], we train it on each of the individual tasks and average the results. (2) MTL with hinge loss for classification (MTL-CLS). (3) MTL with least squares loss for regression (MTL-REG). (4) RAkEL, a meta algorithm using random kk-label sets [41]. (5) MLCSSP, a method spanning the original label space by subset of labels [4]. (6) AdaBoostMH, a method based on AdaBoost [37]. (7) HOMER, a method based on the hierarchy of multi-label learners [40]. (8) BR, the binary relevance method [42]. (9) LP, the label power-set method [42]. (10) ECC, the ensembles of classifier chains method (ECC) [35]. Note that the classification problem can be regarded as a regression problem222For a dataset for binary classification that each positive example has a label +1+1 and each negative example has a label 1-1, one can regard these labels as real numbers (i.e., 1.01.0 for each of the positive examples and 1.0-1.0 for each of the negative examples). Then, this dataset can be used in a MTL method for regression to learn a regressor. After obtaining the regressor, for a test example xx, if the predicted label of xx (by the regressor) is larger than 0, one can regard xx as a positive example. On the other hand, if the predicted label of xx is smaller than 0, then one can regard xx as a negative example..

The proposed methods, the baselines MTL-CLS and MTL-REG were implemented with Python 2.7. For MTL-REG, our implementations are based on the algorithms in [28] (for the 2,1\ell_{2,1} norm) and [17] (for the trace norm). According to Theorem 3 in [20], the problem of MTL-CLS is equivalent to a special form of SMTL in (2) (with Δ(y(i),y)=2×t\Delta(y^{(i)},y)=2\times t, where tt represents the number of index kk that satisfies yk(i)yky^{(i)}_{k}\neq y_{k}). Hence, our implementation of MTL-CLS is based on the framework of Algorithm 1. For StructSVM, we use the open-source implementation of SVM-Perf [20]. All the experiments were conducted on a Dell PowerEdge R320 server with 16G memory and 1.9Ghz E5-2420 CPU.

We report the experimental results on 99 real-world datasets. The statistics of these datasets are summarized in Table I. In the Emotions dataset, the labels are 66 kinds of emotions, and the features are rhythmic and timbre extracted from music wave files. In the Yeast dataset, the labels are localization sites of protein, and the features are protein properties. In the Flags dataset, the labels are religions of countries and the features are extracted from flag images. In the Cal500 dataset, the labels are semantically meaning of popular songs and the features are extracted from audio data. In the Segmentation dataset, the labels are content of image region, and the features are pixels’ properties of image regions. In the Optdigits dataset, the labels are handwritten digits 0 to 99, and the features are pixels. In the MediaMill dataset, the labels are semantic concepts of each video and the features are extracted from videos. In the TMC2007 dataset, the labels are the document topics, and the features are discrete attributes about terms. In the Scene dataset, the labels are scene types, and the features are spatial color moments in LUV space. All of these datasets are normalized.

TABLE I: Statistics of 9 datasets
Type Features Samples Tasks \bigstrut
Emotions music 72 593 6      \bigstrut
Yeast gene 103 2417 14      \bigstrut
Flags image 19 194 7      \bigstrut
Cal500 songs 68 502 174      \bigstrut
Segmentation image 19 2310 7      \bigstrut
Optdigits image 64 5620 10      \bigstrut
MediaMill multimedia 120 10000 12      \bigstrut
TMC2007 test 500 10000 6      \bigstrut
Scene image 294 2407 6      \bigstrut
TABLE II: Comparison results on Cal500, Segmentation and Optdigits.
MACRO MICRO Average      \bigstrut[t]
METHOD 𝑭𝟏F_{1} 𝑭𝟏F_{1} AUC      \bigstrut[b]
Cal500      \bigstrut
SMTL(𝟐,𝟏\ell_{2,1}+AUC) 21.722±\pm0.456 38.452±\pm0.610 56.505±\pm0.511 \bigstrut[t]
SMTL(𝟐,𝟏\ell_{2,1}+F𝟏F_{1}) 21.495±\pm0.232 40.127±\pm0.173 53.690±\pm0.293
MTL-CLS(2,1\ell_{2,1}) 13.157±\pm0.449 37.357±\pm0.180 55.764±\pm0.820
MTL-REG(2,1\ell_{2,1}) 12.500±\pm0.129 36.438±\pm0.176 52.964±\pm0.758 \bigstrut[b]
SMTL(𝟏,𝟏\ell_{1,1}+AUC) 21.721±\pm0.807 35.52±\pm0.811 56.716±\pm0.500 \bigstrut[t]
SMTL(𝟏,𝟏\ell_{1,1}+F𝟏F_{1}) 21.138±\pm0.191 38.386±\pm0.456 53.358±\pm0.827
MTL-CLS(1,1\ell_{1,1}) 12.176±\pm0.445 37.387±\pm0.845 56.316±\pm0.216
MTL-REG(1,1\ell_{1,1}) 12.447±\pm0.297 36.66±\pm0.638 53.628±\pm0.264 \bigstrut[b]
SMTL(TraceNorm+AUC) 21.772±\pm0.545 35.204±\pm0.585 56.798±\pm0.358 \bigstrut[t]
SMTL(TraceNorm+F𝟏F_{1}) 21.768±\pm0.333 38.559±\pm0.394 54.987±\pm0.823
MTL-CLS(TraceNorm) 12.884±\pm0.353 37.402±\pm0.501 55.635±\pm0.511
MTL-REG(TraceNorm) 8.348±\pm0.999 34.832±\pm0.698 55.69±\pm0.636 \bigstrut[b]
StructSVM 20.864±\pm1.150 35.408±\pm1.150 51.427±\pm0.841 \bigstrut[t]
RAkEL 20.628±\pm0.611 33.689±\pm0.843 54.637±\pm0.656
MLCSSP 21.677±\pm0.514 27.093±\pm0.537 52.69±\pm0.983
AdaBoostMH 0.923±\pm0.274 6.492±\pm0.146 50.734±\pm0.538
HOMER 13.850±\pm0.163 30.332±\pm1.313 52.461±\pm0.937
BR 17.094±\pm0.634 33.619±\pm0.375 50.563±\pm0.153
LP 15.257±\pm0.428 32.978±\pm0.668 52.117±\pm0.685
ECC 9.600±\pm0.666 34.789±\pm0.482 52.117±\pm0.625 \bigstrut[b]
Segmentation      \bigstrut
SMTL(𝟐,𝟏\ell_{2,1}+AUC) 72.832±\pm1.567 68.445±\pm1.543 97.195±\pm0.4549 \bigstrut[t]
SMTL(𝟐,𝟏\ell_{2,1}+F𝟏F_{1}) 85.61±\pm1.304 84.149±\pm1.684 96.967±\pm0.647
MTL-CLS(2,1\ell_{2,1}) 85.114±\pm1.946 84.228±\pm4.508 96.93±\pm0.560
MTL-REG(2,1\ell_{2,1}) 75.547±\pm1.215 81.702±\pm2.456 96.757±\pm0.645 \bigstrut[b]
SMTL(𝟏,𝟏\ell_{1,1}+AUC) 73.378±\pm1.564 68.424±\pm1.787 97.527±\pm0.286 \bigstrut[t]
SMTL(𝟏,𝟏\ell_{1,1}+F𝟏F_{1}) 85.105±\pm1.830 83.693±\pm1.192 96.757±\pm0.192
MTL-CLS(1,1\ell_{1,1}) 83.712±\pm3.513 82.518±\pm4.003 96.781±\pm0.828
MTL-REG(1,1\ell_{1,1}) 76.253±\pm2.564 82.606±\pm0.156 96.798±\pm0.231 \bigstrut[b]
SMTL(TraceNorm+AUC) 72.265±\pm1.453 67.655±\pm1.978 97.134±\pm0.457 \bigstrut[t]
SMTL(TraceNorm+F𝟏F_{1}) 85.356±\pm1.092 83.462±\pm1.805 96.863±\pm0.322
MTL-CLS(TraceNorm) 82.703±\pm3.865 82.150±\pm5.439 96.705±\pm0.612
MTL-REG(TraceNorm) 76.602±\pm1.286 82.805±\pm1.877 96.698±\pm0.147 \bigstrut[b]
StructSVM 44.632±\pm1.828 53.992±\pm1.828 89.355±\pm0.311 \bigstrut[t]
RAkEL 75.592±\pm0.243 70.980±\pm0.398 91.333±\pm0.082
MLCSSP 79.821±\pm8.533 78.923±\pm14.036 93.810±\pm0.329
AdaBoostMH 75.633±\pm0.209 71.018±\pm0.376 96.148±\pm0.089
HOMER 72.920±\pm2.505 69.969±\pm1.651 91.225±\pm1.543
BR 84.236±\pm0.638 78.796±\pm0.708 96.870±\pm0.194
LP 84.394±\pm0.603 83.411±\pm0.615 96.240±\pm0.124
ECC 84.183±\pm0.550 82.942±\pm0.542 96.782±\pm0.269 \bigstrut[b]
Optdigits      \bigstrut
SMTL(𝟐,𝟏\ell_{2,1}+AUC) 92.722±\pm0.595 92.734±\pm0.712 99.657±\pm0.0528 \bigstrut[t]
SMTL(𝟐,𝟏\ell_{2,1}+F𝟏F_{1}) 93.963±\pm0.164 93.964±\pm0.235 99.589±\pm0.054
MTL-CLS(2,1\ell_{2,1}) 93.701±\pm0.403 92.773±\pm0.440 99.206±\pm0.044
MTL-REG(2,1\ell_{2,1}) 88.901±\pm0.306 89.268±\pm0.875 99.32±\pm0.089 \bigstrut[b]
SMTL(𝟏,𝟏\ell_{1,1}+AUC) 92.526±\pm0.624 92.213±\pm0.670 99.653±\pm0.078 \bigstrut[t]
SMTL(𝟏,𝟏\ell_{1,1}+F𝟏F_{1}) 93.692±\pm0.508 94.626±\pm0.520 99.554±\pm0.047
MTL-CLS(1,1\ell_{1,1}) 92.961±\pm0.608 94.009±\pm0.356 98.658±\pm0.067
MTL-REG(1,1\ell_{1,1}) 88.762±\pm0.845 89.203±\pm0.865 99.269±\pm0.045 \bigstrut[b]
SMTL(TraceNorm+AUC) 92.862±\pm0.543 92.802±\pm0.944 99.654±\pm0.036 \bigstrut[t]
SMTL(TraceNorm+F𝟏F_{1}) 94.206±\pm0.202 94.139±\pm0.266 99.566±\pm0.027
MTL-CLS(TraceNorm) 93.701±\pm0.435 93.773±\pm0.267 99.182±\pm0.065
MTL-REG(TraceNorm) 88.777±\pm0.765 89.173±\pm0.946 99.293±\pm0.048 \bigstrut[b]
StructSVM 36.276±\pm0.905 38.289±\pm2.218 98.400±\pm0.366 \bigstrut[t]
RAkEL 82.450±\pm0.168 80.967±\pm0.311 94.543±\pm0.070
MLCSSP 75.191±\pm2.245 82.129±\pm3.195 88.879±\pm0.195
AdaBoostMH 93.083±\pm0.695 93.108±\pm0.669 98.594±\pm0.119
HOMER 74.869±\pm4.151 75.713±\pm3.663 93.391±\pm0.964
BR 92.625±\pm0.348 92.714±\pm0.383 99.370±\pm0.122
LP 88.875±\pm0.212 88.915±\pm0.269 94.941±\pm0.329
ECC 93.043±\pm0.206 94.019±\pm0.213 99.019±\pm0.156 \bigstrut[b]
TABLE III: Comparison results on Scene, MediaMill and TMC2007.
MACRO MICRO Average      \bigstrut[t]
METHOD 𝑭𝟏F_{1} 𝑭𝟏F_{1} AUC      \bigstrut[b]
Scene      \bigstrut
SMTL(𝟐,𝟏\ell_{2,1}+AUC) 54.013±\pm1.124 54.746±\pm1.231 89.99±\pm0.820 \bigstrut[t]
SMTL(𝟐,𝟏\ell_{2,1}+F𝟏F_{1}) 55.787±\pm0.756 56.434±\pm0.567 87.652±\pm0.280
MTL-CLS(2,1\ell_{2,1}) 54.722±\pm1.590 54.508±\pm1.176 86.738±\pm1.102
MTL-REG(2,1\ell_{2,1}) 51.157±\pm0.343 52.810±\pm0.345 85.194±\pm0.712 \bigstrut[b]
SMTL(𝟏,𝟏\ell_{1,1}+AUC) 54.296±\pm0.977 54.333±\pm0.025 88.358±\pm0.467 \bigstrut[t]
SMTL(𝟏,𝟏\ell_{1,1}+F𝟏F_{1}) 55.501±\pm1.92 56.007±\pm2.34 87.364±\pm1.801
MTL-CLS(1,1\ell_{1,1}) 54.387±\pm0.730 54.805±\pm1.488 85.952±\pm1.116
MTL-REG(1,1\ell_{1,1}) 50.748±\pm0.546 51.280±\pm0.619 85.032±\pm0.779 \bigstrut[b]
SMTL(TraceNorm+AUC) 54.227±\pm0.660 55.384±\pm0.804 88.421±\pm1.103 \bigstrut[t]
SMTL(TraceNorm+F𝟏F_{1}) 55.396±\pm1.089 56.304±\pm1.119 87.071±\pm0.682
MTL-CLS(TraceNorm) 55.104±\pm0.298 55.481±\pm0.506 86.205±\pm0.471
MTL-REG(TraceNorm) 50.832±\pm0.226 51.236±\pm0.264 85.275±\pm0.852 \bigstrut[b]
StructSVM 49.826±\pm0.815 49.951±\pm0.755 82.375±\pm0.393 \bigstrut[t]
RAkEL 54.592±\pm0.613 55.719±\pm0.565 78.981±\pm0.535
MLCSSP 42.764±\pm0.080 47.178±\pm0.181 65.830±\pm2.240
AdaBoostMH 36.506±\pm0.404 40.681±\pm0.449 87.617±\pm0.470
HOMER 60.980±\pm2.470 58.251±\pm2.592 80.744±\pm0.360
BR 54.579±\pm1.813 55.019±\pm1.843 82.888±\pm1.164
LP 54.902±\pm1.503 55.818±\pm1.595 75.900±\pm1.362
ECC 55.347±\pm0.893 55.831±\pm0.881 88.153±\pm0.298 \bigstrut[b]
MediaMill      \bigstrut
SMTL(𝟐,𝟏\ell_{2,1}+AUC) 18.030±\pm0.294 22.058±\pm0.257 66.068±\pm0.426 \bigstrut[t]
SMTL(𝟐,𝟏\ell_{2,1}+F𝟏F_{1}) 22.851±\pm5.093 56.424±\pm2.761 78.705±\pm2.280
MTL-CLS(2,1\ell_{2,1}) 10.613±\pm1.733 55.441±\pm3.647 76.216±\pm2.474
MTL-REG(2,1\ell_{2,1}) 6.366±\pm0.065 55.515±\pm0.465 53.867±\pm0.496 \bigstrut[b]
SMTL(𝟏,𝟏\ell_{1,1}+AUC) 18.012±\pm0.286 22.232±\pm0.211 65.405±\pm0.503 \bigstrut[t]
SMTL(𝟏,𝟏\ell_{1,1}+F𝟏F_{1}) 22.386±\pm5.326 56.169±\pm2.436 78.907±\pm1.854
MTL-CLS(1,1\ell_{1,1}) 8.542±\pm1.672 55.838±\pm2.229 74.037±\pm1.219
MTL-REG(1,1\ell_{1,1}) 6.393±\pm0.033 55.687±\pm0.439 53.036±\pm0.181 \bigstrut[b]
SMTL(TraceNorm+AUC) 18.201±\pm0.221 22.684±\pm0.354 66.847±\pm1.015 \bigstrut[t]
SMTL(TraceNorm+F𝟏F_{1}) 27.973±\pm3.006 56.031±\pm4.924 79.730±\pm1.850
MTL-CLS(TraceNorm) 15.800±\pm0.589 50.098±\pm5.569 75.968±\pm2.144
MTL-REG(TraceNorm) 6.380±\pm0.045 55.333±\pm0.425 53.825±\pm0.493 \bigstrut[b]
StructSVM 17.847±\pm0.318 22.030±\pm0.284 64.761±\pm0.487 \bigstrut[t]
RAkEL 19.874±\pm0.156 26.686±\pm0.189 63.241±\pm0.398
MLCSSP 15.129±\pm0.633 20.124±\pm0.723 52.473±\pm1.884
AdaBoostMH 17.939±\pm0.469 41.991±\pm0.425 61.914±\pm0.167
HOMER 17.939±\pm0.469 41.991±\pm0.425 61.914±\pm0.167
BR 19.769±\pm0.196 26.515±\pm0.166 69.032±\pm0.854
LP 24.135±\pm0.959 50.170±\pm0.402 60.597±\pm0.502
ECC 24.879±\pm0.590 56.214±\pm0.363 78.067±\pm0.705 \bigstrut[b]
TMC2007      \bigstrut
SMTL(𝟐,𝟏\ell_{2,1}+AUC) 59.432±\pm0.581 68.02±\pm1.042 90.138±\pm0.17 \bigstrut[t]
SMTL(𝟐,𝟏\ell_{2,1}+F𝟏F_{1}) 64.321±\pm0.955 74.159±\pm0.255 90.561±\pm0.669
MTL-CLS(2,1\ell_{2,1}) 60.517±\pm1.363 71.284±\pm0.387 88.382±\pm0.398
MTL-REG(2,1\ell_{2,1}) 37.106±\pm0.416 70.181±\pm0.221 85.218±\pm0.529 \bigstrut[b]
SMTL(𝟏,𝟏\ell_{1,1}+AUC) 60.249±\pm0.147 67.654±\pm0.234 90.441±\pm0.077 \bigstrut[t]
SMTL(𝟏,𝟏\ell_{1,1}+F𝟏F_{1}) 65.436±\pm1.239 73.984±\pm0.533 90.238±\pm0.732
MTL-CLS(1,1\ell_{1,1}) 62.919±\pm0.802 72.745±\pm0.464 89.074±\pm0.59
MTL-REG(1,1\ell_{1,1}) 37.709±\pm0.32 70.431±\pm0.414 86.612±\pm0.592 \bigstrut[b]
SMTL(TraceNorm+AUC) 58.595±\pm0.148 68.056±\pm0.45 88.325±\pm0.182 \bigstrut[t]
SMTL(TraceNorm+F𝟏F_{1}) 61.867±\pm1.014 72.588±\pm0.350 89.328±\pm0.815
MTL-CLS(TraceNorm) 59.752±\pm0.951 71.863±\pm0.628 87.933±\pm0.428
MTL-REG(TraceNorm) 36.64±\pm0.314 70.118±\pm0.437 84.54±\pm0.743 \bigstrut[b]
StructSVM 37.19±\pm0.652 45.027±\pm0.601 88.072±\pm0.289 \bigstrut[t]
RAkEL 57.331±\pm0.592 69.813±\pm0.179 81.994±\pm0.134
MLCSSP 56.717±\pm0.790 60.417±\pm1.665 75.246±\pm1.093
AdaBoostMH 15.170±\pm1.893 56.004±\pm1.103 61.466±\pm0.206
HOMER 61.144±\pm0.238 71.429±\pm0.104 84.998±\pm0.589
BR 51.939±\pm1.225 67.873±\pm0.374 84.616±\pm0.528
LP 52.683±\pm0.832 62.672±\pm0.526 73.063±\pm0.637
ECC 58.368±\pm0.714 68.223±\pm0.096 86.287±\pm0.664 \bigstrut[b]
TABLE IV: Comparison results Emotions, Yeast and Flags.
MACRO MICRO Average      \bigstrut[t]
METHOD 𝑭𝟏F_{1} 𝑭𝟏F_{1} AUC      \bigstrut[b]
Emotions      \bigstrut
SMTL(𝟐,𝟏\ell_{2,1}+AUC) 65.498±\pm2.047 67.067±\pm1.956 83.378±\pm0.466 \bigstrut[t]
SMTL(𝟐,𝟏\ell_{2,1}+F𝟏F_{1}) 66.244±\pm1.584 66.358±\pm1.255 81.986±\pm0.495
MTL-CLS(2,1\ell_{2,1}) 63.343±\pm1.688 65.684±\pm1.327 80.065±\pm0.490
MTL-REG(2,1\ell_{2,1}) 62.621±\pm1.543 63.701±\pm1.054 81.32±\pm0.396 \bigstrut[b]
SMTL(𝟏,𝟏\ell_{1,1}+AUC) 65.622±\pm1.984 67.143±\pm1.629 83.358±\pm0.345 \bigstrut[t]
SMTL(𝟏,𝟏\ell_{1,1}+F𝟏F_{1}) 67.696±\pm0.348 67.923±\pm0.578 83.106±\pm0.596
MTL-CLS(1,1\ell_{1,1}) 64.969±\pm0.822 66.584±\pm1.049 80.03±\pm0.574
MTL-REG(1,1\ell_{1,1}) 62.976±\pm0.547 64.404±\pm1.535 81.811±\pm0.587 \bigstrut[b]
SMTL(TraceNorm+AUC) 65.902±\pm1.904 67.405±\pm1.848 83.362±\pm0.618 \bigstrut[t]
SMTL(TraceNorm+F𝟏F_{1}) 67.600±\pm0.574 67.858±\pm0.984 83.000±\pm0.236
MTL-CLS(TraceNorm) 63.805±\pm2.339 66.602±\pm2.063 80.485±\pm0.597
MTL-REG(TraceNorm) 63.243±\pm1.574 64.869±\pm2.574 82.834±\pm0.266 \bigstrut[b]
StructSVM 46.367±\pm5.531 49.902±\pm19.032 62.908±\pm4.361 \bigstrut[t]
RAkEL 64.998±\pm1.387 65.835±\pm1.136 75.206±\pm0.875
MLCSSP 62.980±\pm2.780 63.593±\pm2.603 76.054±\pm2.495
AdaBoostMH 4.291±\pm1.429 7.577±\pm2.627 55.111±\pm0.328
HOMER 59.039±\pm2.431 61.830±\pm1.642 71.212±\pm1.167
BR 61.358±\pm2.578 62.635±\pm2.332 79.146±\pm1.250
LP 53.384±\pm1.858 54.618±\pm1.543 68.506±\pm0.652
ECC 62.694±\pm1.645 64.138±\pm1.216 82.589±\pm1.131 \bigstrut[b]
Yeast      \bigstrut
SMTL(𝟐,𝟏\ell_{2,1}+AUC) 43.593±\pm1.120 46.261±\pm0.872 63.018±\pm1.504 \bigstrut[t]
SMTL(𝟐,𝟏\ell_{2,1}+F𝟏F_{1}) 44.353±\pm1.080 55.451±\pm0.457 61.285±\pm1.246
MTL-CLS(2,1\ell_{2,1}) 36.308±\pm0.974 43.908±\pm0.499 56.686±\pm0.539
MTL-REG(2,1\ell_{2,1}) 28.187±\pm1.544 47.029±\pm0.645 62.757±\pm1.745 \bigstrut[b]
SMTL(𝟏,𝟏\ell_{1,1}+AUC) 43.132±\pm1.349 45.729±\pm1.643 62.626±\pm1.709 \bigstrut[t]
SMTL(𝟏,𝟏\ell_{1,1}+F𝟏F_{1}) 44.647±\pm1.058 54.971±\pm1.187 61.569±\pm1.945
MTL-CLS(1,1\ell_{1,1}) 36.89±\pm0.699 44.620±\pm0.553 58.221±\pm0.424
MTL-REG(1,1\ell_{1,1}) 33.720±\pm1.634 54.682±\pm1.846 50.050±\pm1.563 \bigstrut[b]
SMTL(TraceNorm+AUC) 43.58±\pm1.046 46.395±\pm1.067 63.058±\pm0.634 \bigstrut[t]
SMTL(TraceNorm+F𝟏F_{1}) 44.972±\pm0.765 50.471±\pm0.968 61.819±\pm0.395
MTL-CLS(TraceNorm) 42.275±\pm1.006 44.542±\pm0.460 61.528±\pm0.590
MTL-REG(TraceNorm) 28.178±\pm1.043 47.046±\pm0.126 62.920±\pm0.326 \bigstrut[b]
StructSVM 42.669±\pm 2.48 46.298±\pm2.048 61.894±\pm2.488 \bigstrut[t]
RAkEL 44.101±\pm0.389 46.086±\pm0.450 61.971±\pm0.753
MLCSSP 41.511±\pm0.837 46.200±\pm1.272 50.756±\pm0.451
AdaBoostMH 12.255±\pm0.041 48.144±\pm0.315 50.805±\pm0.050
HOMER 40.054±\pm1.063 53.745±\pm0.867 62.311±\pm1.265
BR 39.209±\pm0.891 54.153±\pm0.543 62.375±\pm0.408
LP 37.029±\pm0.584 53.059±\pm0.514 56.616±\pm1.394
ECC 37.523±\pm0.310 54.632±\pm0.325 62.105±\pm0.627 \bigstrut[b]
Flags      \bigstrut
SMTL(𝟐,𝟏\ell_{2,1}+AUC) 60.473±\pm1.951 61.666±\pm2.226 73.875±\pm2.563 \bigstrut[t]
SMTL(𝟐,𝟏\ell_{2,1}+F𝟏F_{1}) 70.279±\pm1.744 75.047±\pm0.945 75.000±\pm0.745
MTL-CLS(2,1\ell_{2,1}) 65.233±\pm1.930 71.709±\pm0.955 72.928±\pm1.479
MTL-REG(2,1\ell_{2,1}) 66.073±\pm0.276 73.005±\pm1.307 71.429±\pm1.105 \bigstrut[b]
SMTL(𝟏,𝟏\ell_{1,1}+AUC) 60.187±\pm1.971 61.618±\pm1.714 74.136±\pm2.805 \bigstrut[t]
SMTL(𝟏,𝟏\ell_{1,1}+F𝟏F_{1}) 69.122±\pm1.975 74.259±\pm1.378 74.168±\pm1.513
MTL-CLS(1,1\ell_{1,1}) 65.532±\pm1.210 72.666±\pm1.752 72.725±\pm0.497
MTL-REG(1,1\ell_{1,1}) 65.256±\pm0.739 72.246±\pm0.928 71.299±\pm0.998 \bigstrut[b]
SMTL(TraceNorm+AUC) 61.435±\pm1.616 62.84±\pm1.481 74.367±\pm2.373 \bigstrut[t]
SMTL(TraceNorm+F𝟏F_{1}) 68.704±\pm1.650 73.132±\pm1.891 73.145±\pm1.973
MTL-CLS(TraceNorm) 65.236±\pm3.507 72.688±\pm2.156 73.307±\pm2.155
MTL-REG(TraceNorm) 65.257±\pm2.647 72.437±\pm1.918 71.495±\pm0.783 \bigstrut[b]
StructSVM 55.683±\pm5.777 51.957±\pm2.048 72.178±\pm3.604 \bigstrut[t]
RAkEL 60.696±\pm5.216 64.749±\pm4.688 61.260±\pm3.805
MLCSSP 59.629±\pm1.619 63.215±\pm1.326 55.865±\pm1.909
AdaBoostMH 56.457±\pm4.288 71.268±\pm1.400 69.329±\pm2.043
HOMER 59.018±\pm1.269 63.855±\pm2.259 64.826±\pm0.569
BR 59.421±\pm2.163 67.287±\pm1.876 66.823±\pm2.860
LP 61.801±\pm3.822 69.132±\pm3.200 60.540±\pm4.149
ECC 64.936±\pm3.023 72.715±\pm1.675 73.913±\pm2.339 \bigstrut[b]

Following the settings in [9], to evaluate the performance, we use AUC, Macro F1-score, and Micro F1-score as the evaluation metrics (the details about the computation of AUC and F1F_{1}333In MTL, the Macro F1F_{1} is calculated by firstly calculating the F1F_{1} score of each individual task, and then average these F1F_{1} scores over all tasks. The Micro F1F_{1} in MTL is calculated by 2×P×RP+R\frac{2\times P\times R}{P+R}, where P=i=1mk=1niI(𝐲k(i)=1and(𝐲j)k=1)i=1mk=1niI(𝐲k(i)=1),P=\frac{\sum_{i=1}^{m}\sum_{k=1}^{n_{i}}I({\mathbf{y}_{k}^{(i)}}=1\ \textit{and}\ (\mathbf{y}_{j})_{k}=1)}{\sum_{i=1}^{m}\sum_{k=1}^{n_{i}}I({\mathbf{y}_{k}^{(i)}}=1)}, R=i=1mk=1niI(𝐲k(i)=1and(𝐲j)k=1)i=1mk=1niI((𝐲j)k=1).R=\frac{\sum_{i=1}^{m}\sum_{k=1}^{n_{i}}I({\mathbf{y}_{k}^{(i)}}=1\ \textit{and}\ (\mathbf{y}_{j})_{k}=1)}{\sum_{i=1}^{m}\sum_{k=1}^{n_{i}}I({(\mathbf{y}_{j})_{k}}=1)}. can be found in Section 4).

For each dataset, we firstly generate 10 60%60\%:40%40\% partitions. In each partition, the “60%60\%” part is used as the training set and the “40%40\%” part is used as the test set. Then, we run each of the methods (the baselines and the proposed methods) on these 10 partitions, and reported the averaged results on these 1010 trials. Note that, for a fair comparison, in a dataset, each method uses the same ten partitions to produce its results. After the training set is determined, we conduct 10-fold cross validation on the training set to choose the trade-off parameter λ\lambda within {103×i}i=110{102×i}i=110{101×i}i=110{2×i}i=110{40×i}i=120\{{10^{-3}}\times i\}_{i=1}^{10}\cup\{{10^{-2}}\times i\}_{i=1}^{10}\cup\{{10^{-1}}\times i\}_{i=1}^{10}\cup\{2\times i\}_{i=1}^{10}\cup\{40\times i\}_{i=1}^{20}.

In Algorithm 2, we set the maximum iterations TF=5000T_{F}=5000 and the optimization tolerance ϵF=105\epsilon_{F}=10^{-5}.

Refer to caption
Figure 1: Comparison results on Segmentation, Emotions and Optdigits w.r.t. AUC.
Refer to caption
Refer to caption
Figure 2: Comparison results on Segmentation, Emotions and Optdigits w.r.t. Macro F1 (up) and Micro F1 (down).

6.2 Results on real-world datasets

The evaluation results w.r.t. Micro F1F_{1}, Macro F1F_{1} and AUC (with standard deviations) of the proposed SMTL are shown in Table II, III and IV. As can be seen, by using the same regularizer, the proposed SMTL variants that optimize F1F_{1}-score or AUC show superior performance gains over the baselines. In most cases, the SMTL variant that optimizes a specific metric achieves the best results on this metric. Here are some statistics. On the Yeast dataset, the value of Macro F1F_{1} using SMTL(2,1\ell_{2,1}+F1F_{1}) is 44.353%44.353\%, a 22.16%22.16\% relative increase compared to the best MTL baseline MTL-CLS(2,1\ell_{2,1}); the value of Micro F1F_{1} using SMTL(2,1\ell_{2,1}+F1F_{1}) is 55.451%55.451\%, a 17.91%17.91\% relative increase compared to the best MTL baseline MTL-REG(2,1\ell_{2,1}); the value of averaged AUC using SMTL(1,1\ell_{1,1}+AUC) is 62.626%62.626\%, a 7.57%7.57\% relative increase compared to the best MTL baseline MTL-CLS(1,1\ell_{1,1}). On the Emotions dataset, the proposed SMTL(2,1\ell_{2,1}+F1F_{1}) performs 66.244%66.244\% at Macro F1, a 4.58%4.58\% relative increase compared to the best MTL baseline MTL-CLS(2,1\ell_{2,1}); SMTL(2,1\ell_{2,1}+F1F_{1}) performs 83.378%83.378\% at AUC, a 2.53%2.53\% relative increase compared to the best MTL baseline MTL-CLS(2,1\ell_{2,1}); SMTL(TraceNorm+F1F_{1}) performs 67.6%67.6\% at Macro F1, a 5.95%5.95\% relative increase compared to the best MTL baseline MTL-CLS(TraceNorm). On the Cal500 dataset, SMTL(1,1\ell_{1,1}+AUC) performs 21.721%21.721\% at Macro F1F_{1}, compared to 12.447%12.447\% of MTL-REG(1,1\ell_{1,1}, which indicates a 74.51%74.51\% relative increase; SMTL(2,1\ell_{2,1}+F1F_{1}) performs 40.127%40.127\% at Micro F1F_{1}, compared to 37.357%37.357\% of MTL-CLS(2,1\ell_{2,1}, which indicates a 7.41%7.41\% relative increase.

In addition, we conduct tt-tests and Wilcoxon’s signed rank test [43] on 99 datasets to investigate whether the improvements of SMTL methods against the baselines are statistically significant. The pp-values of tt-tests are showed in Table V and VI. The pp-values of Wilcoxon’s tests are showed in Table VII and VIII. As can be seen, most of the pp-values are smaller than 0.05, which indicate that the improvements are statistically significant. These results verify the effectiveness of directly optimizing evaluation metric in MTL problems.

TABLE V: tt-test: pp-values of SMTL against the baselines
Two mehtods for comparison Optdigits TMC2007 MediaMill Segmentation      \bigstrut[t]
\bigstrut[b]
Average AUC      \bigstrut
2,1\ell_{2,1}: SMTL(AUC) vs. MTL-CLS 4.86E-07 1.49E-13 2.12E-02 4.74E-03 \bigstrut[t]
2,1\ell_{2,1}: SMTL(AUC) vs. MTL-REG 2.70E-12 1.44E-18 6.58E-01 4.30E-03
Trace: SMTL(AUC) vs. MTL-CLS 6.88E-03 5.20E-03 2.12E-02 4.74E-03
Trace: SMTL(AUC) vs. MTL-REG 3.85E-12 4.59E-11 6.61E-01 4.25E-03
1,1\ell_{1,1}: SMTL(AUC) vs. MTL-CLS 5.71E-03 2.95E-05 5.52E-03 4.75E-03
1,1\ell_{1,1}: SMTL(AUC) vs. MTL-REG 1.46E-12 1.92E-12 1.57E-08 4.35E-03
Trace: SMTL(AUC) vs. RAkEL 1.87E-26 1.50E-14 2.79E-13 3.27E-14
Trace: SMTL(AUC) vs. MLCSSP 6.02E-10 1.05E-14 9.63E-15 3.24E-01
Trace: SMTL(AUC) vs. AdaBoostMH 2.36E-04 5.42E-20 4.44E-08 3.05E-14
Trace: SMTL(AUC) vs. HOMER 5.23E-12 6.97E-09 4.65E-08 1.04E-12
Trace: SMTL(AUC) vs. BR 2.91E-10 1.31E-16 2.43E-13 5.14E-07
Trace: SMTL(AUC) vs. LP 6.82E-22 1.41E-20 1.44E-03 9.30E-01
Trace: SMTL(AUC) vs. ECC 8.16E-03 9.67E-19 7.52E-03 6.19E-03 \bigstrut[b]
Micro F1F_{1}      \bigstrut
2,1\ell_{2,1}: SMTL(F1F_{1}) vs. MTL-CLS 8.37E-02 8.37E-02 3.98E-10 4.89E-02 \bigstrut[t]
2,1\ell_{2,1}: SMTL(F1F_{1}) vs. MTL-REG 3.28E-20 3.28E-20 2.54E-18 1.24E-02
Trace: SMTL(F1F_{1}) vs. MTL-CLS 4.68E-03 4.68E-03 4.00E-10 4.96E-02
Trace: SMTL(F1F_{1}) vs. MTL-REG 3.01E-14 3.01E-14 2.30E-18 4.92E-03
1,1\ell_{1,1}: SMTL(F1F_{1}) vs. MTL-CLS 9.54E-03 9.54E-03 4.19E-10 4.75E-01
1,1\ell_{1,1}: SMTL(F1F_{1}) vs. MTL-REG 6.16E-12 6.16E-12 2.56E-18 1.03E-01
Trace: SMTL(F1F_{1}) vs. RAkEL 7.30E-33 4.64E-25 4.93E-09 9.28E-19
Trace: SMTL(F1F_{1}) vs. MLCSSP 2.28E-30 1.90E-18 3.28E-14 2.90E-13
Trace: SMTL(F1F_{1}) vs. AdaBoostMH 4.53E-16 9.97E-35 9.38E-12 3.65E-06
Trace: SMTL(F1F_{1}) vs. HOMER 5.37E-14 1.13E-12 9.68E-12 8.31E-10
Trace: SMTL(F1F_{1}) vs. BR 1.61E-06 3.09E-14 5.20E-05 9.79E-03
Trace: SMTL(F1F_{1}) vs. LP 3.94E-20 9.73E-24 1.06E-12 1.39E-05
Trace: SMTL(F1F_{1}) vs. ECC 3.99E-07 2.76E-08 1.75E-16 1.45E-01 \bigstrut[b]
Macro F1F_{1}      \bigstrut
2,1\ell_{2,1}: SMTL(F1F_{1}) vs. MTL-CLS 4.09E-21 1.61E-10 3.98E-10 4.09E-02 \bigstrut[t]
2,1\ell_{2,1}: SMTL(F1F_{1}) vs. MTL-REG 1.47E-26 3.09E-16 2.54E-18 2.98E-12
Trace: SMTL(F1F_{1}) vs. MTL-CLS 1.04E-21 1.82E-02 4.00E-10 4.13E-02
Trace: SMTL(F1F_{1}) vs. MTL-REG 3.85E-19 5.87E-12 2.30E-18 3.19E-12
1,1\ell_{1,1}: SMTL(F1F_{1}) vs. MTL-CLS 7.04E-22 9.26E-07 4.19E-10 4.04E-02
1,1\ell_{1,1}: SMTL(F1F_{1}) vs. MTL-REG 1.94E-24 8.74E-14 2.56E-18 2.57E-12
Trace: SMTL(F1F_{1}) vs. RAkEL 6.35E-29 3.99E-10 4.93E-09 3.08E-16
Trace: SMTL(F1F_{1}) vs. MLCSSP 6.50E-16 2.29E-10 3.28E-14 4.68E-02
Trace: SMTL(F1F_{1}) vs. AdaBoostMH 1.21E-04 2.99E-23 9.38E-12 3.17E-16
Trace: SMTL(F1F_{1}) vs. HOMER 1.78E-11 4.33E-02 9.68E-12 2.55E-11
Trace: SMTL(F1F_{1}) vs. BR 1.28E-08 1.12E-13 5.20E-05 1.19E-02
Trace: SMTL(F1F_{1}) vs. LP 1.67E-19 1.52E-14 1.06E-12 7.49E-01
Trace: SMTL(F1F_{1}) vs. ECC 2.83E-01 4.46E-08 1.75E-16 9.52E-01 \bigstrut[b]
TABLE VI: tt-test: pp-values of SMTL against the baselines
Two mehtods for comparison Cal500 Yeast Emotions Scene Flags      \bigstrut[t]
\bigstrut[b]
Average AUC      \bigstrut
2,1\ell_{2,1}: SMTL(AUC) vs. MTL-CLS 2.62E-01 1.02E-12 2.62E-01 4.92E-02 4.30E-02 \bigstrut[t]
2,1\ell_{2,1}: SMTL(AUC) vs. MTL-REG 7.48E-05 1.47E-09 7.48E-05 4.74E-11 4.21E-02
Trace: SMTL(AUC) vs. MTL-CLS 1.01E-01 8.97E-13 1.01E-01 4.48E-02 4.21E-02
Trace: SMTL(AUC) vs. MTL-REG 3.04E-03 1.53E-09 3.04E-03 5.00E-11 4.37E-02
1,1\ell_{1,1}: SMTL(AUC) vs. MTL-CLS 2.18E-03 1.00E-12 2.18E-03 4.56E-02 4.27E-02
1,1\ell_{1,1}: SMTL(AUC) vs. MTL-REG 2.55E-06 1.71E-09 2.55E-06 4.67E-11 4.28E-02
Trace: SMTL(AUC) vs. RAkEL 2.62E-12 1.65E-10 4.55E-04 1.48E-01 5.58E-05
Trace: SMTL(AUC) vs. MLCSSP 1.49E-21 1.05E-07 1.22E-04 1.60E-15 6.81E-11
Trace: SMTL(AUC) vs. AdaBoostMH 4.10E-33 1.03E-06 3.61E-23 3.21E-19 2.12E-02
Trace: SMTL(AUC) vs. HOMER 2.30E-13 2.54E-04 8.57E-09 4.33E-02 9.89E-09
Trace: SMTL(AUC) vs. BR 2.23E-16 7.57E-10 3.84E-06 7.92E-02 1.63E-06
Trace: SMTL(AUC) vs. LP 1.12E-14 6.42E-07 9.15E-15 4.97E-01 3.19E-03
Trace: SMTL(AUC) vs. ECC 2.00E-13 2.09E-12 6.45E-07 3.28E-01 9.78E-01 \bigstrut[b]
Micro F1F_{1}      \bigstrut
2,1\ell_{2,1}: SMTL(F1F_{1}) vs. MTL-CLS 4.09E-21 2.70E-05 2.62E-01 1.64E-05 3.14E-02 \bigstrut[t]
2,1\ell_{2,1}: SMTL(F1F_{1}) vs. MTL-REG 1.47E-26 5.00E-02 7.48E-05 1.09E-06 1.87E-03
Trace: SMTL(F1F_{1}) vs. MTL-CLS 1.04E-21 3.45E-05 1.01E-01 1.42E-05 3.10E-02
Trace: SMTL(F1F_{1}) vs. MTL-REG 3.85E-19 4.39E-02 3.04E-03 1.19E-06 1.76E-03
1,1\ell_{1,1}: SMTL(F1F_{1}) vs. MTL-CLS 7.04E-22 2.16E-05 2.18E-03 1.54E-05 3.13E-02
1,1\ell_{1,1}: SMTL(F1F_{1}) vs. MTL-REG 1.94E-24 4.21E-02 2.55E-06 1.35E-06 1.87E-03
Trace: SMTL(F1F_{1}) vs. RAkEL 4.16E-08 2.54E-03 4.55E-04 3.26E-15 2.85E-08
Trace: SMTL(F1F_{1}) vs. MLCSSP 2.82E-10 9.31E-21 1.22E-04 1.80E-16 2.01E-13
Trace: SMTL(F1F_{1}) vs. AdaBoostMH 8.68E-17 3.36E-22 3.61E-23 4.76E-02 7.77E-05
Trace: SMTL(F1F_{1}) vs. HOMER 5.69E-11 1.08E-01 8.57E-09 4.53E-14 3.25E-10
Trace: SMTL(F1F_{1}) vs. BR 7.35E-21 1.08E-01 3.84E-06 2.51E-09 4.96E-06
Trace: SMTL(F1F_{1}) vs. LP 1.64E-13 9.74E-11 9.15E-15 1.25E-14 3.39E-08
Trace: SMTL(F1F_{1}) vs. ECC 6.21E-14 5.30E-01 6.45E-07 1.05E-02 9.09E-01 \bigstrut[b]
Macro F1F_{1}      \bigstrut
2,1\ell_{2,1}: SMTL(F1F_{1}) vs. MTL-CLS 2.43E-02 2.51E-06 7.55E-12 4.09E-02 1.12E-02 \bigstrut[t]
2,1\ell_{2,1}: SMTL(F1F_{1}) vs. MTL-REG 4.24E-10 2.71E-19 2.83E-09 1.40E-10 2.58E-03
Trace: SMTL(F1F_{1}) vs. MTL-CLS 1.77E-05 2.34E-06 3.53E-09 4.37E-02 1.13E-02
Trace: SMTL(F1F_{1}) vs. MTL-REG 1.43E-04 3.33E-19 2.69E-02 1.55E-10 2.67E-03
1,1\ell_{1,1}: SMTL(F1F_{1}) vs. MTL-CLS 3.99E-02 2.66E-06 6.38E-12 4.43E-02 1.11E-02
1,1\ell_{1,1}: SMTL(F1F_{1}) vs. MTL-REG 1.01E-12 2.76E-19 1.09E-06 1.32E-10 2.54E-03
Trace: SMTL(F1F_{1}) vs. RAkEL 5.45E-05 5.52E-03 3.24E-05 5.68E-02 2.11E-04
Trace: SMTL(F1F_{1}) vs. MLCSSP 6.64E-01 1.57E-08 6.89E-05 2.34E-18 2.75E-10
Trace: SMTL(F1F_{1}) vs. AdaBoostMH 1.28E-29 1.36E-28 3.03E-28 5.89E-21 1.16E-07
Trace: SMTL(F1F_{1}) vs. HOMER 3.52E-23 6.16E-10 2.60E-09 3.77E-06 1.84E-11
Trace: SMTL(F1F_{1}) vs. BR 7.28E-14 6.97E-12 6.40E-07 2.49E-01 2.86E-09
Trace: SMTL(F1F_{1}) vs. LP 1.42E-18 9.45E-16 7.63E-15 4.04E-01 5.68E-05
Trace: SMTL(F1F_{1}) vs. ECC 5.69E-21 1.98E-16 5.51E-08 9.15E-01 4.96E-03 \bigstrut[b]
TABLE VII: Wilcoxon’s test: pp-values of SMTL against the baselines
Two mehtods for comparison Optdigits TMC2007 MediaMill Segmentation      \bigstrut[t]
\bigstrut[b]
Average AUC      \bigstrut
2,1\ell_{2,1}: SMTL(AUC) vs. MTL-CLS 1.25E-02 1.25E-02 5.06E-03 5.75E-01
2,1\ell_{2,1}: SMTL(AUC) vs. MTL-REG 5.06E-03 5.06E-03 4.45E-01 2.84E-02
Trace: SMTL(AUC) vs. MTL-CLS 2.84E-02 2.18E-02 2.18E-02 4.69E-02
Trace: SMTL(AUC) vs. MTL-REG 5.06E-03 5.06E-03 8.79E-01 3.86E-01
1,1\ell_{1,1}: SMTL(AUC) vs. MTL-CLS 2.84E-02 2.84E-02 4.69E-02 2.18E-02
1,1\ell_{1,1}: SMTL(AUC) vs. MTL-REG 5.06E-03 5.06E-03 5.75E-01 2.18E-02
Trace: SMTL(AUC) vs. RAkEL 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL(AUC) vs. MLCSSP 5.06E-03 5.06E-03 5.06E-03 2.85E-01
Trace: SMTL(AUC) vs. AdaBoostMH 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL(AUC) vs. HOMER 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL(AUC) vs. BR 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL(AUC) vs. LP 5.06E-03 5.06E-03 1.25E-02 8.79E-01
Trace: SMTL(AUC) vs. ECC 7.45E-02 5.06E-03 3.67E-02 1.66E-02 \bigstrut[b]
Micro F1F_{1}      \bigstrut
2,1\ell_{2,1}: SMTL(F1F_{1}) vs. MTL-CLS 5.06E-03 5.93E-02 5.06E-03 9.34E-03 \bigstrut[t]
2,1\ell_{2,1}: SMTL(F1F_{1}) vs. MTL-REG 5.06E-03 5.06E-03 5.06E-03 2.84E-02
Trace: SMTL(F1F_{1}) vs. MTL-CLS 5.06E-03 1.66E-02 5.06E-03 9.26E-02
Trace: SMTL(F1F_{1}) vs. MTL-REG 5.06E-03 5.06E-03 5.06E-03 6.91E-03
1,1\ell_{1,1}: SMTL(F1F_{1}) vs. MTL-CLS 5.06E-03 4.69E-02 5.06E-03 5.93E-02
1,1\ell_{1,1}: SMTL(F1F_{1}) vs. MTL-REG 5.06E-03 5.06E-03 5.06E-03 2.84E-02
Trace: SMTL(F1F_{1}) vs. RAkEL 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL(F1F_{1}) vs. MLCSSP 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL(F1F_{1}) vs. AdaBoostMH 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL(F1F_{1}) vs. HOMER 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL((F1F_{1}) vs. BR 5.06E-03 5.06E-03 9.34E-03 1.69E-01
Trace: SMTL((F1F_{1}) vs. LP 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL((F1F_{1}) vs. ECC 5.06E-03 5.06E-03 5.06E-03 2.03E-01 \bigstrut[b]
Macro F1F_{1}      \bigstrut
2,1\ell_{2,1}: SMTL(F1F_{1}) vs. MTL-CLS 9.34E-03 5.06E-03 5.06E-03 4.69E-02 \bigstrut[t]
2,1\ell_{2,1}: SMTL(F1F_{1}) vs. MTL-REG 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL(F1F_{1}) vs. MTL-CLS 9.34E-03 5.06E-03 5.06E-03 5.93E-02
Trace: SMTL(F1F_{1}) vs. MTL-REG 5.06E-03 5.06E-03 5.06E-03 5.06E-03
1,1\ell_{1,1}: SMTL(F1F_{1}) vs. MTL-CLS 9.34E-03 5.06E-03 5.06E-03 1.14E-01
1,1\ell_{1,1}: SMTL(F1F_{1}) vs. MTL-REG 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL(F1F_{1}) vs. RAkEL 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL(F1F_{1}) vs. MLCSSP 5.06E-03 5.06E-03 5.06E-03 7.45E-02
Trace: SMTL(F1F_{1}) vs. AdaBoostMH 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL(F1F_{1}) vs. HOMER 5.06E-03 3.67E-02 5.06E-03 5.06E-03
Trace: SMTL((F1F_{1}) vs. BR 5.06E-03 5.06E-03 5.06E-03 2.84E-02
Trace: SMTL((F1F_{1}) vs. LP 5.06E-03 5.06E-03 1.69E-01 8.79E-01
Trace: SMTL((F1F_{1}) vs. ECC 4.69E-02 5.06E-03 1.66E-02 9.59E-01 \bigstrut[b]
TABLE VIII: Wilcoxon’s test: pp-values of SMTL against the baselines
Two mehtods for comparison Cal500 Yeast Emotions Scene Flags      \bigstrut[t]
\bigstrut[b]
Average AUC      \bigstrut
2,1\ell_{2,1}: SMTL(AUC) vs. MTL-CLS 5.06E-03 5.06E-03 1.69E-01 1.25E-02 1.25E-02 \bigstrut[t]
2,1\ell_{2,1}: SMTL(AUC) vs. MTL-REG 5.06E-03 5.06E-03 1.25E-02 5.06E-03 5.06E-03
Trace: SMTL(AUC) vs. MTL-CLS 6.91E-03 5.06E-03 1.14E-01 4.69E-02 5.08E-01
Trace: SMTL(AUC) vs. MTL-REG 5.06E-03 5.06E-03 1.25E-02 5.06E-03 3.67E-02
1,1\ell_{1,1}: SMTL(AUC) vs. MTL-CLS 5.06E-03 5.06E-03 1.25E-02 2.84E-02 1.25E-02
1,1\ell_{1,1}: SMTL(AUC) vs. MTL-REG 5.06E-03 5.06E-03 1.25E-02 5.06E-03 1.25E-02
Trace: SMTL(AUC) vs. RAkEL 5.06E-03 5.06E-03 1.25E-02 2.84E-02 5.06E-03
Trace: SMTL(AUC) vs. MLCSSP 5.06E-03 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL(AUC) vs. AdaBoostMH 5.06E-03 5.06E-03 5.06E-03 5.06E-03 1.25E-02
Trace: SMTL(AUC) vs. HOMER 5.06E-03 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL(AUC) vs. BR 5.06E-03 5.06E-03 5.06E-03 9.26E-02 5.06E-03
Trace: SMTL(AUC) vs. LP 5.06E-03 6.91E-03 5.06E-03 7.21E-01 9.34E-03
Trace: SMTL(AUC) vs. ECC 5.06E-03 5.06E-03 5.06E-03 1.69E-01 9.59E-01 \bigstrut[b]
Micro F1F_{1}      \bigstrut
2,1\ell_{2,1}: SMTL(F1F_{1}) vs. MTL-CLS 5.06E-03 6.91E-03 5.06E-03 5.06E-03 5.06E-03 \bigstrut[t]
2,1\ell_{2,1}: SMTL(F1F_{1}) vs. MTL-REG 9.34E-03 2.84E-02 2.84E-02 5.06E-03 1.25E-02
Trace: SMTL(F1F_{1}) vs. MTL-CLS 5.06E-03 6.91E-03 5.06E-03 5.06E-03 4.45E-01
Trace: SMTL(F1F_{1}) vs. MTL-REG 5.06E-03 2.84E-02 1.25E-02 5.06E-03 1.25E-02
1,1\ell_{1,1}: SMTL(F1F_{1}) vs. MTL-CLS 5.06E-03 5.06E-03 5.06E-03 5.06E-03 3.67E-02
1,1\ell_{1,1}: SMTL(F1F_{1}) vs. MTL-REG 5.06E-03 2.84E-02 3.67E-02 5.06E-03 2.84E-02
Trace: SMTL(F1F_{1}) vs. RAkEL 5.06E-03 9.34E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL(F1F_{1}) vs. MLCSSP 5.06E-03 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL(F1F_{1}) vs. AdaBoostMH 5.06E-03 5.06E-03 5.06E-03 5.93E-02 5.06E-03
Trace: SMTL(F1F_{1}) vs. HOMER 5.06E-03 2.03E-01 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL((F1F_{1}) vs. BR 5.06E-03 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL((F1F_{1}) vs. LP 5.06E-03 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL((F1F_{1}) vs. ECC 5.06E-03 3.86E-01 2.41E-01 4.69E-02 8.79E-01 \bigstrut[b]
Macro F1F_{1}      \bigstrut
2,1\ell_{2,1}: SMTL(F1F_{1}) vs. MTL-CLS 5.06E-03 5.06E-03 6.91E-03 3.67E-02 3.67E-02 \bigstrut[t]
2,1\ell_{2,1}: SMTL(F1F_{1}) vs. MTL-REG 5.06E-03 5.06E-03 5.06E-03 5.06E-03 6.91E-03
Trace: SMTL(F1F_{1}) vs. MTL-CLS 5.06E-03 5.06E-03 5.06E-03 1.66E-02 5.93E-02
Trace: SMTL(F1F_{1}) vs. MTL-REG 5.06E-03 5.06E-03 5.06E-03 5.06E-03 5.06E-03
1,1\ell_{1,1}: SMTL(F1F_{1}) vs. MTL-CLS 5.06E-03 5.06E-03 5.06E-03 3.33E-01 3.67E-02
1,1\ell_{1,1}: SMTL(F1F_{1}) vs. MTL-REG 5.06E-03 5.06E-03 5.06E-03 5.06E-03 2.84E-02
Trace: SMTL(F1F_{1}) vs. RAkEL 5.06E-03 1.25E-02 9.34E-03 9.26E-02 5.06E-03
Trace: SMTL(F1F_{1}) vs. MLCSSP 3.67E-02 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL(F1F_{1}) vs. AdaBoostMH 5.06E-03 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL(F1F_{1}) vs. HOMER 5.06E-03 5.06E-03 5.06E-03 5.06E-03 5.06E-03
Trace: SMTL((F1F_{1}) vs. BR 5.06E-03 5.06E-03 5.06E-03 1.69E-01 5.06E-03
Trace: SMTL((F1F_{1}) vs. LP 5.06E-03 5.06E-03 5.06E-03 5.75E-01 6.91E-03
Trace: SMTL((F1F_{1}) vs. ECC 5.06E-03 5.06E-03 5.06E-03 7.99E-01 6.91E-03 \bigstrut[b]

6.3 Results on imbalanced data

In the scenarios of learning classifiers on imbalanced data (e.g., the number of positive training samples is much less than that of negative training samples), the metrics like F-score or AUC are more effective for evaluation than the misclassified errors. This is one of the reasons to motivate the proposed SMTL method in this paper. In MTL, the imbalance can be measured by firstly calculating the imbalance ratio in each individual task (i.e., thenumberofpositiveinstancesthenumberofnegativeinstances\frac{the\ number\ of\ positive\ instances}{the\ number\ of\ negative\ instances} for each task), and then averaging these ratios.

We conduct simulated experiments on 3 datasets (Segmentation, Emotions and Optdigits) to investigate the characteristics of the proposed SMTL methods on imbalanced data. In each dataset, we generate an imbalanced dataset by randomly selecting (with replacement) the positive and negative samples from the original dataset, with the ratio 1:11:1, 1:51:5 and 1:101:10, respectively. As can be seen in Fig. 1 and Fig. 2, in most cases, the proposed SMTL variants consistently outperform the baseline method. For example, On Emotions with the ratio of negativesamplespositivesamples=10:1\frac{negative\ samples}{positive\ samples}=10:1, the proposed SMTL indicates a relative increase of 9.7%9.7\% / 12.9%12.9\% / 11.1%11.1\% over the baseline w. r. t. AUC / Macro F1 / Micro F1, respectively. In addition, with the ratio of negativesamplespositivesamples\frac{negative\ samples}{positive\ samples} increasing, the improvement of SMTL over the baseline method also increases.

6.4 Training Time Comparison

To investigate the training speed of the proposed method, we provide the running time comparison results in Table IX. We can see that the training time of SMTL is (less than 30 times) slower than the baseline methods. It is worth noting that the training time cost is not a critical issue in practice, because the training process is usually off-line.

TABLE IX: Training Time Comparison
method training time of training time of training time of      \bigstrut[t]
Optdigits Emotions Segmentation      \bigstrut[b]
SMTL(1,1\ell_{1,1}+AUC) 105.200s 30.001s 1.888s \bigstrut[t]
SMTL(1,1\ell_{1,1}+F1F_{1}) 510.900s 29.797s 2.964s
MTL-CLS(1,1\ell_{1,1}) 356.200s 24.674s 2.023s
MTL-REG(1,1\ell_{1,1}) 19.030s 7.427s 0.450s
StructSVM 17.762s 46.468s 5.015s
RAkEL 28.428s 4.117s 4.310s
AdaBoostMH 17.157s 1.024s 0.641s
MLCSSP 121.779s 1.563s 6.410s
HOMER 20.643s 1.354s 0.880s
BR 20.852s 1.859s 1.835s
LP 16.131s 22.561s 2.103s
ECC 17.852s 2.834s 1.891s \bigstrut[b]

7 Conclusion

In this paper, we developed Structured-MTL, a MTL method of optimizing evaluation metrics. To solve the optimization problem of Structured MTL, we developed an optimization procedure based on ADMM scheme. This optimization procedure can be applied to solving a large family of MTL problems with structured outputs.

In the future work, we plan to investigate Structured-MTL on problems other than classification (e.g., MTL for ranking problems). We also plan to improve the efficiency of Structured-MTL on large-scale learning problems.

References

  • [1] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817-1853, 2005.
  • [2] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243-272, 2008.
  • [3] J-B. Bi, T. Xiong, S-P. Yu, M. Dundar, and R. Rao. An improved multi-task learning approach with applications in medical diagnosis. In Machine Learning and Knowledge Discovery in Databases, pages 117-132, 2008.
  • [4] W. Bi, J. Kwok. Efficient Multi-label Classification with Many Labels. Proceedings of the 30th International Conference on Machine Learning. 405-413, 2013.
  • [5] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006.
  • [6] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning, 3(1):1-122, 2011.
  • [7] J-F. Cai, E. J. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion. UCLA CAM Report, 2008.
  • [8] R. Caruana. Multitask learning. Machine Learning, 28(1):41-75, 1997.
  • [9] J-H. Chen, J. Liu, and J-P. Ye. Learning incoherent sparse and low-rank patterns from multiple tasks. In International Conference on Knowledge Discovery and Data Mining, pages 1179-1188, 2010.
  • [10] J-H. Chen, J-Y. Zhou, and J-P. Ye. Integrating low-rank and group-sparse structures for robust multi-task learning. In International Conference on Knowledge Discovery and Data Mining, pages 42-50, 2011.
  • [11] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6:615-637, 2005.
  • [12] R-E. Fan, K-W. Chang, C-J. Hsieh, X-R. Wang, C-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871-1874, 2008.
  • [13] P-H. Gong, J-P. Ye, and C-S. Zhang. Robust multi-task feature learning. In International Conference on Knowledge Discovery and Data Mining, pages 895-903, 2012.
  • [14] N. Gornitz, C. Widmer, G. Zeller, A. Kahles, S. Sonneburg, and G. Ratsch. Hierarchical Multitask Structured Output Learning for Large-Scale Sequence Segmentation. In Advances in Neural Information Processing Systems, 2011.
  • [15] X. Gu, F-L. Chung, H. Ishibuchi, and S-T. Wang. Multitask Coupled Logistic Regression and Its Fast Implementation for Large Multitask Datasets. In IEEE Transactions on Cybernetics, 45(9): 1953-1966, 2015.
  • [16] B. He, X. Yuan On the O(1/n) Convergence Rate of the Douglas-Rachford Alternating Direction Method. SIAM Journal on Numerical Analysis, 50(2): 700-709, 2012
  • [17] S-W. Ji and J-P. Ye. An Accelerated Gradient Method for Trace Norm Minimization. In International Conference on Machine Learning, pages 457-464, 2009
  • [18] Y-Z. Jiang, F-L. Chung, H. Ishibuchi, Z-H. Deng, and S-T. Wang. Multitask TSK Fuzzy System Modeling by Mining Intertask Common Hidden Structure. In IEEE Transactions on Cybernetics, 45(3): 548-561, 2015.
  • [19] T. Grigorios, S-X. Eleftherios, V. Jozef, and V. Ioannis. Mulan: A java library for multi-label learning. Journal of Machine Learning Research, 12:2411–2414, 2011.
  • [20] T. Joachims. A Support Vector Method for Multivariate Performance Measures. In International Conference on Machine Learning, 2005.
  • [21] Z. Kang, K. Grauman, and F. Sha. Learning with whom to share in multi-task feature learning. In International Conference on Machine Learning, pages 521-528, 2011.
  • [22] S. Kim and E. P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. In International Conference on Machine Learning, pages 543-550, 2010.
  • [23] H-J. Lai, Y. Pan, C. Liu, L. Lin, J. Wu Sparse Learning-to-rank via an Efficient Primal-Dual Algorithm. IEEE Transactions on Computers, 62(6):1221-1233, 2013
  • [24] H-J. Lai, Y. Pan, Y. Tang, R. Yu FSMRank: Feature Selection Method for Learning to Rank. IEEE Transactions on Neaural Networks and Learning Systems, 24(6):940-952, 2013
  • [25] Z-C. Lin, M-M. Chen, and Y. Ma. The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrix. Technical Report, UIUC, October 2009.
  • [26] A-A. Liu, Y-T. Su, P-P. Jia, Z. Gao, T. Hao, Z-X. Yang. Multiple/Single-View Human Action Recognition via Part-Induced Multitask Structural Learning. IEEE transactions on cybernetics, 45(6): 1194-1208, 2016.
  • [27] G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In International Conference on Machine Learning, pages 663-670, 2010.
  • [28] J. Liu, S-W. Ji, and J-P. Ye. Multi-task feature learning via efficient 2,1\ell_{2,1}-norm minimization. In Conference on Uncertainty in Artificial Intelligence, pages 339-348, 2009.
  • [29] X-Q. Lu, X-L. Li, and L-C. Mou. Semi-Supervised Multitask Learning for Scene Recognition. In IEEE Transactions on Cybernetics, 45(9): 1967-1976, 2015.
  • [30] G. Obozinski, B. Taskar, and M.I. Jordan. Multi-task feature selection. Technical report, Statistics Department, UC Berkeley, 2006.
  • [31] Y. Pan, H-J. Lai, C. Liu, S-C. Yan. A Divide-and-Conquer Method for Scalable Low-Rank Latent Matrix Pursuit. In International Conference on Computer Vision and Pattern Recognition, 2013.
  • [32] Y. Pan, H-J. Lai, C. Liu, Y. Tang, S-C. Yan. Rank Aggregation via Low-Rank and Structured-Sparse Decomposition. In AAAI Conference on Artificial Intelligence, 2013.
  • [33] Y. Pan, R-K. Xai, J. Yin, N. Liu. A Divide-and-Conquer Method for Scalable Robust Multitask Learning. In IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 12, pp. 3163-3175, 2015.
  • [34] N. Quadrianto, A. Smola, T. Caetano, S. Vishwanathan, and J. Petterson. Multitask learning without label correspondences. In Advances in Neural Information Processing Systems, pages 1957-1965, 2010.
  • [35] J. Read, B. Pfahringer, G. Holmes and E. Frank. Classifier Chains for Multi-label Classification. Machine learning, 85(3): 333-359, 2011.
  • [36] R.M. Rifkin and R.A. Lippert. Value Regularization and Fenchel Duality. Journal of Machine Learning Research, 8:441-479, 2007.
  • [37] R. E. Schapire, Y. Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning. 39(2):135-168, 2000
  • [38] S.S. Shwartz and Y. Singer. On the Equivalence of Weak Learnability and Linear Separability: New Relaxations and Efficient Boosting Algorithms MachineLearning Journal, vol. 80, no. 2, pp. 141-163, 2010.
  • [39] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Machine Learning for Interdependent and Structured Output Spaces. In International Conference on Machine Learning, 2004.
  • [40] G. Tsoumakas, I. Katakis and I. Vlahavas. Effective and Efficient Multilabel Classification in Domains with Large Number of Labels. Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data . 30-44, 2008.
  • [41] G. Tsoumakas, I. Katakis and I. Vlahavas. Random k-Labelsets for Multi-Label Classification. IEEE Transactions on Knowledge and Data Engineering. 23(7):1079-1089, 2011.
  • [42] E. Gibaja, S. Ventura. A tutorial on multilabel learning. ACM Computing Surveys, 47(3): 52, 2015.
  • [43] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics bulletin, 1(6): 80-83, 1945.
  • [44] R-K. Xia, Y. Pan, L. Du, J. Yin. Robust Multi-View Clustering via Low-Rank and Sparse Decomposition. In AAAL Conference on Artificial Intelligence, 2014.
  • [45] R-K. Xia, Y. Pan, H-J. Lai, C. Liu, S-C. Yan. Supervised Hashing for Image Retrieval via Image Representation Learning. In AAAL Conference on Artificial Intelligence, 2014.
  • [46] Y. Yang, Z-G. Ma, Y. Yang, F-P. Nie, and H-T. Shen. Multitask Spectral Clustering by Exploring Intertask Correlation. In IEEE Transactions on Cybernetics, 45(5): 1069-1080, 2015.
  • [47] Y-J. Yin, D. Xu, X-G. Wang, and M-R. Bai. Online State-Based Structured SVM Combined With Incremental PCA for Robust Visual Tracking. In IEEE Transactions on Cybernetics, 45(9): 1988-2000, 2015.
  • [48] J. Yu, D-C. Tao, M. Wang, and Y. Rui. Learning to Rank Using User Clicks and Visual Features for Image Retrieval. In IEEE Transactions on Cybernetics, 45(4): 767-779, 2015.
  • [49] K. Yu, V. Tresp, and A. Schwaighofer. Learning gaussian processes from multiple tasks. In International Conference on Machine Learning, pages 1012-1019, 2005.
  • [50] Y. Yue, T. Finley, F. Radlinski, T. Joachims. A Support Vector Method for Optimizing Average Precision. In International Conference on Research and Development in Information Retrieval, 2007.
  • [51] J. Zhang, Z. Ghahramani, and Y-M. Yang. Learning multiple related tasks using latent independent component analysis. In Advances in Neural Information Processing Systems, pages 1585-1592, 2006.
  • [52] W-Q. Zhao, Q-G Meng and P. W. H. Chung. A Heuristic Distributed Task Allocation Method for Multivehicle Multitask Problems and Its Application to Search and Rescue Scenario. IEEE transactions on cybernetics, 46(4): 902-915, 2016.
  • [53] J-Y. Zhou, J-H. Chen, and J-P. Ye. Clustered multi-task learning via alternating structure optimization. In Advances in Neural Information Processing Systems, pages 702-710, 2011.