Optimizing Evaluation Metrics for Multi-Task Learning via the Alternating Direction Method of Multipliers

Ge-Yang Ke, Yan Pan, Jian Yin, Chang-Qin Huang Ge-Yang Ke, Yan Pan, and Jian Yin are with the School of Data Science and Computer Science, Sun Yat-sen University, Guangzhou 510006, China. Corresponding author: Yan Pan (panyan5@mail.sysu.edu.cn)Chang-Qin Huang is with the School of Information Technology in Education, South China Normal University, Guangzhou, 510631, China

Abstract

Multi-task learning (MTL) aims to improve the generalization performance of multiple tasks by exploiting the shared factors among them. Various metrics (e.g., F-score, Area Under the ROC Curve) are used to evaluate the performances of MTL methods. Most existing MTL methods try to minimize either the misclassified errors for classification or the mean squared errors for regression. In this paper, we propose a method to directly optimize the evaluation metrics for a large family of MTL problems. The formulation of MTL that directly optimizes evaluation metrics is the combination of two parts: (1) a regularizer defined on the weight matrix over all tasks, in order to capture the relatedness of these tasks; (2) a sum of multiple structured hinge losses, each corresponding to a surrogate of some evaluation metric on one task. This formulation is challenging in optimization because both of its parts are non-smooth. To tackle this issue, we propose a novel optimization procedure based on the alternating direction scheme of multipliers, where we decompose the whole optimization problem into a sub-problem corresponding to the regularizer and another sub-problem corresponding to the structured hinge losses. For a large family of MTL problems, the first sub-problem has closed-form solutions. To solve the second sub-problem, we propose an efficient primal-dual algorithm via coordinate ascent. Extensive evaluation results demonstrate that, in a large family of MTL problems, the proposed MTL method of directly optimization evaluation metrics has superior performance gains against the corresponding baseline methods.

Index Terms:

Multi-Task Learning, Evaluation Metrics, Structured Outputs, Coordinate Ascent, Alternating Direction Method of Multipliers.

1 Introduction

Recently, considerable research has been devoted to Multi-Task Learning (MTL), a problem of improving the generalization performance of multiple tasks by utilizing the shared information among them. MTL has been widely-used in various applications, such as natural language processing [1], handwritten character recognition [30, 34], scene recognition [29] and medical diagnosis [3]. Many MTL methods have been proposed in the literature [8, 11, 49, 51, 13, 21, 28, 30, 53, 1, 9, 10, 33, 29, 15, 46, 18, 52, 26].

In this paper, we consider MTL for classification or regression problems. Note that either a multi-class classification problem or a multi-label learning problem can be regarded as an MTL problem¹¹1As an illustrative example, we consider a multi-label classification problem with instances $\{x_{1},x_{2},x_{3},x_{4},x_{5}\}$ that $x_{1}$ belongs to classes $a$ and $b$ , $x_{2}$ belongs to classes $b$ and $c$ , $x_{3}$ belongs to class $c$ , $x_{4}$ belongs to class $a$ , $x_{5}$ belongs to classes $a$ , $b$ and $c$ . This problem can be regarded as an MTL problem with three tasks, where the training sets for each of these tasks are:
$\begin{split}(x_{1},{\color[rgb]{1,0,0}1}),(x_{2},{\color[rgb]{1,0,0}0}),(x_{3},{\color[rgb]{1,0,0}0}),(x_{4},{\color[rgb]{1,0,0}1}),(x_{5},{\color[rgb]{1,0,0}1})\\ (x_{1},{\color[rgb]{0,1,0}1}),(x_{2},{\color[rgb]{0,1,0}1}),(x_{3},{\color[rgb]{0,1,0}0}),(x_{4},{\color[rgb]{0,1,0}0}),(x_{5},{\color[rgb]{0,1,0}1})\\ (x_{1},{\color[rgb]{0,0,1}0}),(x_{2},{\color[rgb]{0,0,1}1}),(x_{3},{\color[rgb]{0,0,1}1}),(x_{4},{\color[rgb]{0,0,1}0}),(x_{5},{\color[rgb]{0,0,1}1})\\ \end{split}$ The first/second/third task is a binary classification problem of an instance belonging to class $a$ / $b$ / $c$ or not. Hence, a multi-label learning problem is a special case of an MTL problem. Similarly, we can verify that a multi-class classification problem can also be regarded as an MTL problem.. Most of the existing MTL methods focus on minimizing either a convex surrogate (e.g. the hinge loss or the logistic loss) of the $0$ - $1$ errors for multi-task classification, or the mean squared errors for multi-task regression. On the other hand, in practice, several evaluation metrics other than misclassified errors or mean squared errors are used the evaluation of MTL methods, e.g., F-score, Precision, Recall, Area Under the ROC Curve (AUC), Mean Average Precision. For example, in the cases of MTL on imbalanced data (e.g., in a task, the number of negative samples is much larger than that of the positive samples), cost-sensitive MTL or MTL for ranking, these metrics are more effective in performance evaluation than the standard misclassified errors or the mean squared errors. However, due to the computational difficulties, few learning techniques have been developed to directly optimize these evaluation metrics in MTL.

In this paper, we propose an approach to directly optimizing the evaluation metrics in MTL, which can be applied to a large family of MTL problems. Specifically, for an MTL problem with $m$ tasks (the $i$ th task is associated with a training set $\{(\mathbb{x}_{j}^{(i)},\mathbb{y}_{j}^{(i)})\}_{j=1}^{n_{i}}$ , $i=1,2,...,m$ , $n_{i}$ represents the number of training samples for the $i$ th task), we consider a generic formulation in the following:

\min_{\mathbb{W}}{\Omega}(\mathbb{W})+\lambda\sum_{i=1}^{m}\mathcal{L}(\mathbb{W}_{.i};\{(\mathbb{x}_{j}^{(i)},\mathbb{y}_{j}^{(i)})\}_{j=1}^{n_{i}}),

(1)

where $\mathbb{W}$ is the weight matrix with $m$ columns $\mathbb{W}_{.1}$ , $\mathbb{W}_{.2}$ , …, $\mathbb{W}_{.m}$ , $\lambda>0$ is a trade-off parameter. This formulation is the linear combination of two parts. The first part is a regularizer ${\Omega}(\mathbb{W})$ defined on the weight matrix $\mathbb{W}$ over all tasks, in order to leverage the relatedness of these tasks. Examples of this kind of regularizer include the trace-norm, the $\ell_{1,1}$ -norm or the $\ell_{2,1}$ -norm on $\mathbb{W}$ . The second part in the formulation is a sum of multiple loss functions, each corresponds to one task. In order to directly optimize a specific evaluation metric, we consider the hinge loss functions for structured outputs [39, 20, 50, 48, 47], which are surrogates of a specific evaluation metric.

Such a formulation in (1) includes a large family of MTL problems. Since the two parts in (1) are usually non-smooth, the optimization problem (1) is difficult to solve. To tackle this issue, we propose a novel optimization procedure based on the alternating direction scheme of multipliers (ADMM [6, 25]), which is widely used in various machine learning problems (e.g., [31, 32, 33, 44]). We decompose the whole optimization problem in (1) into two simpler sub-problems. The first sub-problem corresponds to the regularizer. For commonly-used regularizers (e.g., the trace-norm, the $\ell_{2,1}$ -norm) in MTL, this sub-problem can be solved by close-form solutions. The second sub-problem corresponds to the structured hinge losses. To solve the second sub-problem, we propose an efficient primal-dual algorithm via coordinate ascent.

We conduct extensive experiments to evaluate the performances of the proposed MTL method. Experimental results show that the proposed method that optimizes a specific evaluation metric outperforms the corresponding MTL classification or MTL regression baseline methods by a clear margin.

2 Related Work

MTL is a wide class of learning problems. Roughly speaking, the existing MTL methods can be divided into three main categories: parameters sharing, common features sharing, and low-rank subspace sharing.

In the methods with parameter sharing, all tasks are assumed to explicitly share some common parameters. Representative methods in this category include shared weight vectors [11], hidden units in neural network [8], and common prior in Bayessian models [49, 51].

In the methods with common features sharing, task relatedness is modeled by enforcing all tasks to share a common set of features [2, 28, 22, 30, 13, 21, 53]. Representative examples are the methods which constrain the model parameters (i.e., a weight matrix) of all tasks to have certain sparsity patterns, for example, cardinality sparsity [30], group sparsity [28, 13], or clustered structure [21, 53].

The methods in the third category assume that all tasks lie in a shared low-rank subspace [1, 9, 10]. A common assumption in these category of methods is that most of the tasks are relevant while (optionally) there may exist a small number of irrelevant (outlier) tasks. These methods pursue a low-rank weight matrix that captures the underlying shared factors among tasks. Trace-norm regularization is commonly-used in these methods to encourage the low-rank structure on the model parameters.

Most of the existing MTL methods are focused on designing regularizers or parameter sharing patterns to utilize the intrinsic relationships among multiple related tasks. These MTL methods usually try to optimize the classification errors or the mean squared errors for regression. In practice, various other metrics (such as F-score and AUC) are used in the evaluation of MTL methods. However, little effort has been devoted to optimize these evaluation metrics in the context of MTL except for the work [14], in which the author proposed a hierarchical MTL formulation for structured output prediction in sequence segmentation. Since the regularizer used in [14] is decomposable, the hierarchical MTL problem can be decomposed into multiple independent tasks, each is a structure output learning problem with a simple regularizer. In this paper, we seek to directly optimize commonly-used evaluation metrics for MTL with possibly indecomposable regularizer, resulting in a generic approach that can be applied to a large family of MTL problems. Our formulation can be regarded as MTL for structure output prediction with an indecomposable regularizer.

The proposed methods in this paper are also related to the multi-label algorithms. There are various multi-label algorithm proposed in the literature, e.g., the RAkEL method that uses random $k$ -label sets [41], the MLCSSP method that spans the original label space by subset of labels [4], the AdaBoostMH method based on AdaBoost [37], the HOMER method based on the hierarchy of multi-label learners [40], the binary relevance (BR) [42] method, the label power-set (LP) [42] method, and the ensembles of classifier chains (ECC) [35] method.

The proposed approach in this paper is to optimize the evaluation metrics in MTL. We refer the readers to Section 4 for the detailed introduction to the evaluation metrics related to the proposed approach.

3 Notations

We first introduce the notations to be used throughout this paper. We use bold upper-case characters (e.g., $\mathbb{M}$ , $\mathbb{X}$ , $\mathbb{W}$ ) to represent matrices, and bold lower-case characters (e.g., $\mathbb{x}$ , $\mathbb{y}$ ) to represent vectors, respectively. For a matrix $\mathbf{M}\in{\mathbb{R}^{d\times m}}$ , we denote $\mathbf{M}_{ij}$ as the the element at the cross of the $i$ th row and $j$ th column in $\mathbf{M}$ . We denote $\mathbf{M}_{i\cdot}\in\mathbb{R}^{1\times m}$ as the $i$ th row of $\mathbf{M}$ , and $\mathbf{M}_{\cdot j}\in\mathbb{R}^{d\times 1}$ as the $j$ -th column of $\mathbf{M}$ , respectively.

We denote $||\mathbf{M}||_{F}$ as the Frobenius norm of $\mathbf{M}$ that $\|\mathbf{M}\|_{F}=\sqrt{\sum_{i=1}^{d}\sum_{j=1}^{m}(\mathbf{M}_{ij})^{2}}$ . Let $\|\mathbb{M}\|_{1,1}=\sum_{i=1}^{d}\sum_{j=1}^{m}|\mathbb{M}_{ij}|$ be the $\ell_{1,1}$ -norm of $\mathbb{M}$ , where $|\mathbb{M}_{ij}|$ is the absolute value of $\mathbb{M}_{ij}$ . Let $\|\mathbb{M}\|_{2,1}=\sum_{i=1}^{d}||\mathbb{M}_{{i.}}||_{2}$ be the $\ell_{2,1}$ -norm of $\mathbb{M}$ , where $||\mathbb{M}_{{}_{i.}}||_{2}=\sqrt{\sum_{j=1}^{m}\mathbb{M}_{ij}^{2}}$ is the $\ell_{2}$ -norm of $\mathbb{M}_{{}_{i.}}$ . Let $||\mathbb{M}||_{\infty}=\mathop{\max}\limits_{i,j}|\mathbb{M}_{ij}|$ be the infinity norm of $\mathbb{M}$ . The trace-norm of $\mathbb{M}$ is defined by $\|\mathbb{M}\|_{*}=\sum_{k=1}^{rank(\mathbb{M})}\sigma_{k}(\mathbb{M})$ , where $\{\sigma_{k}(\mathbb{M})\}_{k=1}^{rank(\mathbb{M})}$ are the non-zero singular values of $\mathbb{M}$ and $rank(\mathbb{M})$ is the rank of $\mathbb{M}$ . We denote $\mathbb{M}^{T}$ as the transpose of $\mathbb{M}$ . For a vector $\mathbb{x}$ , $||\mathbb{x}||_{2}$ represent the $\ell_{2}$ -norm.

In the context of MTL, we assume we are given $m$ learning tasks. The $i$ th ( $i=1,\dots,m$ ) task is associated with a training set $({\mathbf{X}}^{(i)},{\mathbf{y}}^{(i)})$ , where ${\mathbf{X}}^{(i)}\in\mathbb{R}^{n_{i}\times d}$ denotes the data matrix with each row being a sample, ${\mathbf{y}}^{(i)}\in\{-1,+1\}^{n_{i}}$ denotes the target labels on $\mathbb{X}^{(i)}$ , $d$ is the feature dimensionality, and $n_{i}$ is the number of samples for the $i$ th task. For $i=1,2,...,m$ , we define $\mathbb{E}_{i}=\{-1,+1\}^{n_{i}}$ as the set of all possible $n_{i}$ -dimension vector, each of whose elements is either $-1$ or $1$ . To simplify presentation, we assume $\mathbb{E}_{i}=\{\mathbb{y}_{1},\mathbb{y}_{2},...,\mathbb{y}_{p}\}$ where $p=2^{n_{i}}$ and $\mathbb{y}_{j}$ is one of the possible vectors that belong to $\{-1,1\}^{n_{i}}$ .

We define a weight matrix $\mathbf{W}=[\mathbb{W}_{\cdot 1},\dots,\mathbb{W}_{\cdot m}]\in\mathbb{R}^{d\times m}$ on all of the $m$ tasks. The goal of (linear) MTL is to simultaneously learn $m$ (linear) predictors $\mathbb{W}_{\cdot i}\ (i=1,\dots,m)$ to minimize some loss function $\mathcal{L}(\mathbb{W}_{\cdot i};{\mathbf{X}}^{(i)},{\mathbf{y}}^{(i)})$ (e.g. the least square loss $||{\mathbf{y}}^{(i)}-{\mathbf{X}}^{(i)}\mathbb{W}_{\cdot i}||_{2}^{2}$ ), where $\mathbb{W}_{\cdot i}\in\mathbb{R}^{d}$ is in the form of a column vector. Note that for each task, we have $\mathbb{X}^{(i)}=[\mathbb{x}_{1}^{(i)},\mathbb{x}_{2}^{(i)},\cdots,\mathbb{x}_{n_{i}}^{(i)}]^{T}$ and $\mathbb{y}^{(i)}=[\mathbb{y}_{1}^{(i)},\mathbb{y}_{2}^{(i)},\cdots,\mathbb{y}_{n_{i}}^{(i)}]^{T}$ .

4 Problem Formulations

The linear MTL problem can be formulated as the generic form in (1). The objective functions in many existing MTL methods are special cases of such a formulation. The following are two examples:

•

With the regularizer ${\Omega}(\mathbb{W})$ being the $\ell_{2,1}$ -norm $||\mathbb{W}||_{2,1}$ and each loss function $\mathcal{L}(\mathbb{W}_{\cdot i};{\mathbf{X}}^{(i)},{\mathbf{y}}^{(i)})$ being the mean squared loss $\frac{1}{2}||\mathbb{y}^{(i)}-\mathbb{X}^{(i)}\mathbb{W}_{.i}||_{2}^{2}$ , the problem in (1) is the same as the objective used in [28].
•

If the regularizer ${\Omega}(\mathbb{W})$ is set to be the trace-norm $||\mathbb{W}||_{*}$ and each loss function $\mathcal{L}(\mathbb{W}_{\cdot i};{\mathbf{X}}^{(i)},{\mathbf{y}}^{(i)})$ is smooth (e.g., the mean squared loss $\frac{1}{2}||\mathbb{y}^{(i)}-\mathbb{X}^{(i)}\mathbb{W}_{.i}||_{2}^{2}$ ), the problem in (1) becomes the objective used in [17].

The existing MTL methods mainly focus on the design of good regularizers (i.e., ${\Omega}(\mathbb{W})$ ) to catch the shared factors among multiple related tasks. The loss functions used in these methods are either to minimize the misclassified errors (for classification) or the mean squared errors (for regression). On the other hand, in practice, several evaluation metrics other than misclassified errors or mean squared errors are used the evaluation of MTL methods, such as F-score and AUC. Particularly, in the cases of MTL on imbalanced data (e.g., in a task, the number of negative samples is much larger than that of the positive samples), these metrics are more effective in performance evaluation than the standard misclassified errors or the mean squared errors.

Learning techniques of directly optimizing evaluation metrics, as known as learning with structured outputs, have been developed for many (single-task) problems, e.g., classification [39, 20], ranking [50]. However, despite the acknowledged importance of the metrics like F-score or AUC, little effort has been made to design MTL methods that directly optimize these evaluation metrics. The main reason is that MTL of optimizing the evaluation metrics usually results in a non-smooth objective function which is difficult to solve.

In this paper, we focus on MTL with structured outputs and propose a generic optimization procedure based on ADMM. This optimization procedure can be applied to solving a large family of MTL problems that directly optimize some evaluation metric (e.g., F-score, AUC). We call the proposed method Structured MTL (SMTL for short).

The formulation of SMTL is also a special case of (1). In order to optimize some evaluation metric, we define the loss function for each task as the structured hinge loss:

\begin{split}&\mathcal{L}(\mathbb{W}_{.i};\mathbb{X}^{(i)},\mathbb{y}^{(i)})\\ =&\mathop{\max}\limits_{{\mathbf{y}_{j}}\in\mathbb{E}_{i}}[\Delta({{\bf{y}}^{(i)}},{\mathbf{y}_{j}})-\mathbb{W}_{.i}^{T}{{\bf{X}}^{(i)}}^{T}({{\bf{y}}^{(i)}}-{\mathbf{y}_{j}})],\\ \end{split}

where $\mathbf{y}_{j}$ represents any possible label assignment on $\mathbf{\mathbf{X}}^{(i)}$ . $\Delta({\mathbf{y}^{(i)}},\mathbf{y}_{j})$ represents an evaluation metric to measure the distance between the true labels ${\mathbf{y}^{(i)}}$ and the other labels $\mathbf{y}_{j}$ . For example, $\Delta(.,.)$ can be 1-F-score or 1-AUC.

The formulation of SMTL is defined as:

\begin{split}&\min_{\mathbb{W}}{\Omega}(\mathbb{W})\\ &+\lambda\sum_{i=1}^{m}\mathop{\max}\limits_{{\mathbf{y}_{j}}\in\mathbb{E}_{i}}[\Delta({{\bf{y}}^{(i)}},{\mathbf{y}_{j}})-\mathbb{W}_{.i}^{T}{{\bf{X}}^{(i)}}^{T}({{\bf{y}}^{(i)}}-{\mathbf{y}_{j}})].\end{split}

(2)

In this paper, we only focus on the MTL problems in the form of (2) that satisfy the following conditions:

•

Condition 1: With respect to $\Omega\left(\mathbf{W}\right)$ , there is a close-form solution for the following sub-problem

$\displaystyle\min_{\mathbf{W}}\Omega(\mathbf{W})+\frac{\mu}{2}\left\|{\mathbf{W}-\mathbf{M}}\right\|_{F}^{2}$ (3)

where $\mathbf{M}\in{\mathbb{R}^{d\times m}}$ and $\mu$ is a positive constant.

•

Condition 2: For the evaluation metric $\Delta({{\bf{y}}^{(i)}},{\mathbf{y}_{j}})$ , the following sub-problem can be solve in polynomial time.

\mathop{\operatorname*{argmax}}\limits_{{\mathbf{y}_{j}}\in\mathbb{E}_{i}}[\Delta({{\bf{y}}^{(i)}},{\mathbf{y}_{j}})-\mathbb{W}_{.i}^{T}{{\bf{X}}^{(i)}}^{T}({{\bf{y}}^{(i)}}-{\mathbf{y}_{j}})]

(4)

The first condition is to restrict the regularizer $\Omega(\mathbb{W})$ and the second one is to restrict the evaluation metric function $\Delta(\mathbb{y^{(i)}},\mathbf{y}_{j})$ . Even under these conditions, the formulation in (2) includes a large family of MTL problems. On the one hand, for the regularizer $\Omega(\mathbb{W})$ , the following norms that are commonly-used in MTL satisfy Condition 1:

•

$\ell_{1,1}$ -norm For the MTL problems with $\Omega(\mathbb{W})=||\mathbb{W}||_{1,1}$ , the sub-problem in (3) is known to have the close-form solution

$\mathbb{W}=\mathcal{S}_{\frac{1}{\mu}}(\mathbb{M}),$ (5)

where $\mathcal{S}_{\delta}(\mathbb{M})=\max(\mathbb{M}-\delta,0)+\min(\mathbb{M}+\delta,0)$ is the shrinkage operator [25].

•

$\ell_{2,1}$ -norm For the MTL problems with $\Omega(\mathbb{W})=||\mathbb{W}||_{2,1}$ , the sub-problem in (3) is also known to have close-form solutions:

\mathbb{W}_{j.}=\left\{\begin{array}[]{ll}\frac{||\mathbb{M}_{j.}||_{2}-\frac{1}{\mu}}{||\mathbb{M}_{j.}||_{2}}\mathbb{M}_{j.}&\textrm{if $\frac{1}{\mu}<||\mathbb{M}_{j.}||_{2}$},\\ 0&\textrm{otherwise},\end{array}\right.

(6)

•

Trace-norm For the MTL problems with $\Omega(\mathbb{W})=||\mathbb{W}||_{*}$ , the sub-problem in (3) is also have the close-form solution by the Singular Value Threshold method [7]. Specifically, by assuming $\mathbb{U}\mathbb{\Sigma}\mathbb{V}$ be the SVD form of $\mathbb{M}$ , the close-form solution is given by:

$\mathbb{W}=\mathbb{U}\mathcal{S}_{{1}/{\mu}}(\mathbb{\Sigma})\mathbb{V}^{T}.$ (7)

On the other hand, many commonly-used metric functions satisfy the second condition. The following are two examples which we will consider in this paper:

•

MTL by directly optimizing F-Score F-Score is a typical performance metric for binary classification, particularly in learning tasks on imbalanced data. F-Score is a trade-off between Precision and Recall. Specifically, given ${\mathbf{y}^{(i)}}$ and ${\mathbf{y}_{j}}$ , we define the precision as:

Precision=\frac{\sum_{k=1}^{n_{i}}I({\mathbf{y}_{k}^{(i)}}=1\ \textit{and}\ (\mathbf{y}_{j})_{k}=1)}{\sum_{k=1}^{n_{i}}I({\mathbf{y}_{k}^{(i)}}=1)},

and the recall as:

Recall=\frac{\sum_{k=1}^{n_{i}}I({\mathbf{y}_{k}^{(i)}}=1\ \textit{and}\ (\mathbf{y}_{j})_{k}=1)}{\sum_{k=1}^{n_{i}}I({(\mathbf{y}_{j})_{k}}=1)},

where $I(\textit{condition})$ represents the indicator function that $I(\textit{condition})=1$ if condition is true, otherwise $I(\textit{condition})=0$ . Then the F-score on ${\mathbf{y}^{(i)}}$ and ${\mathbf{y}_{j}}$ is defined as:

\displaystyle F_{\beta}=\frac{(1+\beta)\times Precision\times Recall}{Precision+\beta Recall},

(8)

where $\beta$ is a trade-off parameter. Hereafter, we simply set $\beta=1$ . Finally, the metric function $\Delta(.,.)$ with respect to the F-score is defined by:

\displaystyle\Delta({\mathbf{y}^{(i)}},{\mathbf{y}_{j}})=1-{F_{\beta}}.

(9)

With such a form of $\Delta({\mathbf{y}^{(i)}},{\mathbf{y}_{j}})$ , the sub-problem in (4) can be solved in polynomial time [20].

•

MTL by directly optimizing AUC AUC is also a popular performance metric for binary classification, particularly in imbalanced learning. Given ${\mathbf{y}^{(i)}}$ and ${\mathbf{y}_{j}}$ , the AUC metric can be calculated by :

\displaystyle AUC=1-\frac{Swapped}{Pos\times Neg}

(10)

where $Swapped$ represents the number of “inverted” pairs in $\mathbb{y}^{(i)}$ compared to $\mathbf{y}_{j}$ :

\begin{split}Swapped=&\sum_{l=1}^{n_{i}}\sum_{k=1}^{n_{i}}I(\mathbb{y}_{l}^{(i)}=1\ \textit{and}\ \mathbb{y}_{k}^{(i)}=-1)\\ &\times I((\mathbf{y}_{j})_{l}=-1\ \textit{and}\ (\mathbf{y}_{j})_{k}=1).\end{split}

$Pos$ / $Neg$ represents the number of positive/negative samples in the $i$ th task:

\begin{split}&Pos=\sum_{k=1}^{n_{i}}I(\mathbb{y}_{k}^{(i)}=1),\\ &Neg=\sum_{k=1}^{n_{i}}I(\mathbb{y}_{k}^{(i)}=-1).\\ \end{split}

The corresponding $\Delta(.,.)$ can be defined as:

\displaystyle\Delta({\mathbf{y}^{(i)}},{\mathbf{y}_{j}})=1-AUC.

(11)

With such a form of $\Delta({\mathbf{y}^{(i)}},{\mathbf{y}_{j}})$ , there also exists polynomial-time algorithms to solve the sub-problem in (4) [20].

Note that here the Precision, Recall, F-Score and AUC are defined for a particular task.

5 Proposed Optimization Procedure

5.1 Overview

In this section, we present the proposed optimization procedure to solve the problem (2). Our procedure is based on the scheme of ADMM.

For ease of presentation, we define

{\mathcal{G}_{i}}({\mathbb{W}_{.i}})=\mathop{\max}\limits_{{\mathbf{y}_{j}}}[\Delta({{\bf{y}}^{(i)}},{\mathbf{y}_{j}})-\mathbb{W}_{.i}^{T}{{\bf{X}}^{(i)}}^{T}({{\bf{y}}^{(i)}}-{\mathbf{y}_{j}})],

and

\mathcal{G}(\mathbf{W})=\sum\limits_{i=1}^{m}{{\mathcal{G}_{i}}({\mathbb{W}_{.i}})}.

Then, the problem in (2) can be re-formulated to its equivalent form in the following:

\begin{split}&\min_{\mathbb{S},\mathbb{W}}\ \ \ \Omega(\mathbf{S})+\lambda\mathcal{G}(\mathbf{W})\\ &s.t\ \ \ \mathbf{W}-\mathbf{S}=0,\end{split}

(12)

where $\mathbb{S}\in\mathbb{R}^{d\times m}$ is an auxiliary variable.

The corresponding augmented Lagrangian function with respect to (12) is:

\begin{split}\begin{array}[]{l}\mathcal{A}(\mathbf{W},\mathbf{S},\mathbf{Z})\\ =\Omega(\mathbf{S})+\lambda\mathcal{G}(\mathbf{W})+\langle\mathbf{Z},\mathbf{W}-\mathbf{S}\rangle+\frac{\mu}{2}||{\mathbf{W}-\mathbf{S}}||_{F}^{2}\\ \end{array}\end{split}

(13)

where $\mathbf{Z}$ is the Lagrangian multiplier, $\langle\cdot,\cdot\rangle$ represents the inner product of two matrices (i.e., given matrices $\mathbb{A}$ and $\mathbb{B}$ , we have $\langle\mathbb{A},\mathbb{B}\rangle=Tr(\mathbb{A}^{T}\mathbb{B}$ ), where $Tr(\mathbb{M})$ is the trace of the matrix $\mathbb{M}$ ), $\mu>0$ is an adaptive penalty parameter.

Based on the ADMM scheme, the sketch of the proposed optimization procedure is shown in Algorithm 1, where in each iteration we alternatively update $\mathbb{W}$ , $\mathbb{S}$ and $\mathbb{Z}$ by minimizing the Lagrangian function in (13) with other variables fixed. The update rules for $\mathbb{W}$ , $\mathbb{S}$ and $\mathbb{Z}$ are in the following:

\begin{array}[]{l}{\mathbf{S}^{\{t+1\}}}\leftarrow\mathop{\operatorname*{argmin}}\limits_{\mathbf{S}}\mathcal{A}({\mathbf{W}^{\{t\}}},\mathbf{S},{\mathbf{Z}^{\{t\}}});\\ {\mathbf{W}^{\{t+1\}}}\leftarrow\mathop{\operatorname*{argmin}}\limits_{\mathbf{W}}\mathcal{A}(\mathbf{W},{\mathbf{S}^{\{t+1\}}},{\mathbf{Z}^{\{t\}}});\\ {\mathbf{Z}^{\{t+1\}}}\leftarrow{\mathbf{Z}^{\{t\}}}+\mu({\mathbf{W}^{\{t+1\}}}-{\mathbf{S}^{\{t+1\}}}).\end{array}

Note that hereafter we use $\mathbb{M}^{\{t\}}$ to represent the the value of variable $\mathbb{M}$ in the $t$ -th iteration.

Next, we will present the details of solving the sub-problems with respect to $\mathbb{S}$ or $\mathbb{W}$ , respectively, with other variables being fixed.

Algorithm 1 The proposed ADMM procedure for

the structured MTL problem (2)

Input: training set

\{({\mathbb{X}}^{(i)},{\mathbb{y}}^{(i)})\}_{i=1}^{m}

, desired tolerant error

\epsilon

maximal iteration number

T

Output: Weight matrix

\mathbb{W}=[\mathbb{W}_{.1},\cdots,\mathbb{W}_{.m}]

1. Initialize:

\mathbb{Z}=\mathbb{S}=\mathbb{W}\leftarrow\mathbb{0}^{d\times m}

t\leftarrow 0

2. Repeat:

3. Update

\mathbb{S}^{\{t+1\}}\leftarrow\mathop{\operatorname*{argmin}}\limits_{\mathbf{S}}\Omega(\mathbf{S})+\frac{\mu}{2}||\mathbf{S}-\mathbf{W}^{\{t\}}-\mathbf{Z}^{\{t\}}/{\mu}||_{F}^{2}

by solving (15), (17) or (18) accordingly.

4. For

i=1

m

5. Update

\mathbb{W}^{\{t+1\}}_{.i}\leftarrow

\mathop{\operatorname*{argmin}}\limits_{\mathbb{W}_{.i}}\lambda\mathcal{G}_{i}(\mathbb{W}_{.i})+\frac{\mu}{2}||\mathbb{W}_{.i}-\mathbb{S}_{.i}^{\{t+1\}}+\frac{\mathbb{Z}_{.i}^{\{t\}}}{\mu}||_{2}^{2}

by Algorithm 2.

6. End For

7. Update

\mathbf{Z}^{\{t+1\}}\leftarrow\mathbf{Z}^{\{t\}}+\mu(\mathbf{W}^{\{t+1\}}-\mathbf{S}^{\{t+1\}})

8. Until

||S-W||_{\infty}\leq\epsilon

t=T

5.2 Solving the Sub-Problem for $\mathbf{S}$

In the $t$ -th iteration (in the outer loop) of Algorithm 1, the sub-problem for $\mathbb{S}$ with respect to (13) can be simplified as:

\begin{split}\begin{array}[]{l}\mathbf{S}^{\{t+1\}}\leftarrow\mathop{\operatorname*{argmin}}\limits_{\mathbf{S}}\mathcal{A}(\mathbf{W}^{\{t\}},\mathbf{S},\mathbf{Z}^{\{t\}})\\ =\mathop{\operatorname*{argmin}}\limits_{\mathbf{S}}\Omega(\mathbf{S})+\frac{\mu}{2}\left\|{\mathbf{W}^{\{t\}}-\mathbf{S}+\mathbf{Z}^{\{t\}}/{\mu}}\right\|_{F}^{2}\\ \end{array}\end{split}

(14)

For different regularizer $\Omega(\mathbb{S})$ , the solution to (14) is different.

•

Case 1: the $\ell_{1,1}$ -norm With $\Omega(\mathbf{S})$ being $||\mathbf{S}||_{1,1}$ , by applying (5) to (14), we have:

\begin{split}\begin{array}[]{l}{\operatorname*{argmin}}_{\mathbf{S}}||\mathbf{S}||_{1,1}+\frac{\mu}{2}\left\|{\mathbf{W}^{\{t\}}-\mathbf{S}+\mathbf{Z}^{\{t\}}/{\mu}}\right\|_{F}^{2}\\ =\max(0,\mathbf{W}^{\{t\}}+\mathbf{Z}^{\{t\}}/{\mu}-1/{\mu})\\ +\min(0,\mathbf{W}^{\{t\}}+\mathbf{Z}^{\{t\}}/{\mu}+1/{\mu}).\end{array}\end{split}

(15)

•

Case 2: the $\ell_{2,1}$ -norm When $\Omega(\mathbf{S})=||\mathbf{S}|{|_{2,1}}$ , (14) can be rewritten as:

\begin{split}\begin{array}[]{l}{\operatorname*{argmin}}_{\mathbf{S}}||\mathbf{S}||_{2,1}+\frac{\mu}{2}\left\|{\mathbf{W}^{\{t\}}-\mathbf{S}+\mathbf{Z}^{\{t\}}/{\mu}}\right\|_{F}^{2}.\\ \end{array}\end{split}

(16)

By applying (6) to (16), we obtain the following close-form solution:

\mathbb{S}_{j.}=\left\{\begin{array}[]{ll}\frac{||\mathbb{M}_{j.}||_{2}-\frac{1}{\mu}}{||\mathbb{M}_{j.}||_{2}}\mathbb{M}_{j.}&\textrm{if $\frac{1}{\mu}<||\mathbb{M}_{j.}||_{2}$},\\ 0&\textrm{otherwise},\end{array}\right.

(17)

where $\mathbb{M}={\mathbf{W}^{\{t\}}+\mathbf{Z}^{\{t\}}/{\mu}}$ .

•

Case 3: the trace-norm When $\Omega(\mathbf{S})=||\mathbf{S}|{|_{*}}$ , we can apply (7) to (14) and obtain the following close-form solution:

\begin{split}\begin{array}[]{l}{\operatorname*{argmin}}_{\mathbf{S}}||\mathbf{S}||_{*}+\frac{\mu}{2}\left\|{\mathbf{W}^{\{t\}}-\mathbf{S}+\mathbf{Z}^{\{t\}}/{\mu}}\right\|_{F}^{2}.\\ =\mathbb{U}(\max(0,\mathbf{\Sigma}-1/\mu)+\min(0,\mathbf{\Sigma}+1/\mu))\mathbb{V}^{T},\end{array}\end{split}

(18)

where $\mathbf{U}\mathbf{\Sigma}\mathbf{V}^{T}$ is the SVD form of ${\mathbf{W}^{\{t\}}+\mathbf{Z}^{\{t\}}/{\mu}}$ .

5.3 Solving the Sub-Problem for $\mathbf{W}$

5.3.1 Formulation

In the $t$ -th outer iteration in Algorithm 1, the sub-problem for $\mathbb{W}$ with respect to (13) can be reformulated as:

\begin{split}\begin{array}[]{l}\mathbf{W}^{\{t+1\}}\leftarrow\mathop{\operatorname*{argmin}}\limits_{\mathbf{W}}\mathcal{A}(\mathbf{W},\mathbf{S}^{\{t+1\}},\mathbf{Z}^{\{t\}})\\ =\mathop{\operatorname*{argmin}}\limits_{\mathbf{W}}\lambda\mathcal{G}(\mathbf{W})+\frac{\mu}{2}\left\|{\mathbf{W}-\mathbf{S}^{\{t+1\}}+\mathbf{Z}^{\{t\}}/{\mu}}\right\|_{F}^{2}\\ =\mathop{\operatorname*{argmin}}\limits_{\mathbf{W}}\sum_{i=1}^{m}\lambda\mathcal{G}_{i}(\mathbf{W}_{.i})+\frac{\mu}{2}\left\|{\mathbf{W}_{.i}-\mathbf{S}_{.i}^{\{t+1\}}+\mathbf{Z}_{.i}^{\{t\}}/{\mu}}\right\|_{F}^{2}\\ \end{array}\end{split}

(19)

To simplify presentation, we denote $\mathbb{b}_{i}=\mathbf{S}_{.i}^{\{t+1\}}-\mathbf{Z}_{.i}^{\{t\}}/{\mu}$ . Then, the problem in (19) can be separated into $m$ independent sub-tasks:

\begin{split}\begin{array}[]{l}\mathop{\min}\limits_{\mathbf{W}_{.i}}\lambda\mathcal{G}_{i}(\mathbf{W}_{.i})+\frac{\mu}{2}\left\|{\mathbf{W}_{.i}-\mathbf{b}_{i}}\right\|_{F}^{2},i=1,...,m.\\ \end{array}\end{split}

(20)

For $j=1,2,...,p$ , we define $\mathbf{K}=[\mathbb{K}_{.1},\mathbb{K}_{.2},...,\mathbb{K}_{.p}]$ with $\mathbb{K}_{.j}={\mathbf{X}^{(i)}}^{T}(\mathbf{y}^{(i)}-\mathbb{y}_{j})+\frac{\mu}{\lambda}\mathbb{b}_{i}$ , $\mathbf{\Delta}=(\mathbf{\Delta}_{1},\mathbf{\Delta}_{2},...,\mathbf{\Delta}_{p})^{T}$ with $\mathbf{\Delta}_{j}=\Delta({\mathbf{y}^{(i)}},{{\mathbf{y}}_{j}})$ . Then, the problem in (20) can be simplified as:

\begin{split}\begin{array}[]{l}\mathop{\min}\limits_{\mathbf{W}_{.i}}\lambda\mathcal{G}_{i}(\mathbf{W}_{.i})+\frac{\mu}{2}\left\|{\mathbf{W}_{.i}-\mathbf{b}_{i}}\right\|_{F}^{2}\\ =\mathop{\min}\limits_{\mathbf{W}_{.i}}\frac{\mu}{2}(||\mathbf{W}_{.i}||_{F}^{2}+||\mathbf{b}_{i}||_{F}^{2}-2\mathbf{W}_{.i}^{T}\mathbf{b}_{i})\\ +\lambda\mathop{\max}\limits_{\mathbb{y}_{j}\in\mathbb{E}_{i}}[\Delta({{\bf{y}}^{(i)}},{\mathbb{y}_{j}})-\mathbf{W}_{.i}^{T}{{\bf{X}}^{(i)}}^{T}({{\bf{y}}^{(i)}}-{\mathbb{y}_{j}})].\\ \end{array}\end{split}

(21)

By re-scaling the objective (21) by $\mu$ and drop the terms independent of $\mathbf{W}_{.i}$ and $\mathbb{y}_{j}$ , we have:

\begin{split}\begin{array}[]{l}\mathop{\min}\limits_{\mathbf{W}_{.i}}\frac{1}{2}||{\mathbf{W}_{.i}}||_{2}^{2}+\frac{\lambda}{\mu}\mathop{\max}\limits_{j}[{\mathbf{\Delta}_{j}}-{({\mathbf{W}_{.i}^{T}}\mathbf{K})_{j}}]\\ \end{array}\end{split}

(22)

The existence of the max operator on exponential number of elements makes it difficult to optimize the objective in (22). To tackle this issue, in the next two subsection, we derive the Fenchel dual [36] form of (22) and develop a coordinate ascent algorithm to solve this dual form.

5.3.2 Fenchel Dual Form of (22)

In this subsection, we derive the Fenchel dual [36] form of (22). To simplify presentation, we use $\mathbf{w}$ to represent $\mathbf{W}_{.i}$ . Then we re-formulate the primal form in (22) as:

\begin{split}\mathop{\min}\limits_{\mathbf{w}}\mathcal{P}(\mathbb{w})=\mathcal{M}(\mathbb{w})+\mathcal{N}(-\mathbf{w}^{T}\mathbf{K})\\ =\frac{1}{2}||{\mathbf{w}}||_{2}^{2}+\frac{\lambda}{\mu}\max_{j}(\mathbf{\Delta}^{T}-\mathbf{w}^{T}\mathbf{K})_{j},\\ \end{split}

(23)

where we define $\mathcal{M}(\mathbb{w})=\frac{1}{2}||{\mathbf{w}}||_{2}^{2}$ and $\mathcal{N}(-\mathbf{w}^{T}\mathbf{K})=\frac{\lambda}{\mu}\max_{j}(\mathbf{\Delta}^{T}-\mathbf{w}^{T}\mathbf{K})_{j}$ .

Before deriving the dual form of (23), we first introduce the definition (Definition 1) and the main properties (Theorem 1 and 2) of Fenchel duality.

Definition 1.

The Fenchel conjugate of function $f(\boldsymbol{x})$ is defined as $f^{*}(\boldsymbol{\theta})=\max_{\boldsymbol{x}\in dom(f)}(\langle\boldsymbol{\theta},\boldsymbol{x}\rangle-f(\boldsymbol{x}))$ .

Theorem 1.

(Fenchel-Young inequality: [5], Proposition 3.3.4) Any points $\boldsymbol{\theta}$ in the domain of function $f^{*}$ and $\boldsymbol{x}$ in the domain of function $f$ satisfy the inequality:

f(\boldsymbol{x})+f^{*}(\boldsymbol{\theta})\geq\langle\boldsymbol{\theta},\boldsymbol{x}\rangle

(24)

The equality holds if and only if $\boldsymbol{\theta}\in\partial f(\boldsymbol{x})$ .

Theorem 2.

(Fenchel Duality inequality, see e.g.,Theorem 3.3.5 in [5]) Let $\mathcal{M}:\mathbb{R}^{d}\rightarrow(-\infty,+\infty]$ and $\mathcal{N}:\mathbb{R}^{p}\rightarrow(-\infty,+\infty]$ be two closed and convex functions, and $\mathbf{K}$ be a $\mathbb{R}^{d\times p}$ matrix. Then we have

\sup_{\boldsymbol{\alpha}}-\mathcal{M}^{*}(\mathbf{K}\boldsymbol{\alpha})-\mathcal{N}^{*}(\boldsymbol{\alpha})\leq\inf_{\boldsymbol{w}}\mathcal{M}(\boldsymbol{w})+\mathcal{N}(-\boldsymbol{w}^{T}\mathbf{K}),

(25)

where $\boldsymbol{\alpha}\in\mathbb{R}^{p}$ and $\boldsymbol{w}\in\mathbb{R}^{d}$ . The equality holds if and only if $0\in(dom(\mathcal{N})-\mathbf{K}^{T}dom(\mathcal{M}))$ .

Note that the right hand side of the inequality in (25) is called the primal form and the left hand side of (25) is the corresponding dual form.

With Definition 1, it is known (see, e.g., [38], Appendix B) that the Fenchel dual norm (i.e., the Fenchel conjugate) of the $\ell_{2}$ -norm $f(\mathbb{x})=\frac{1}{2}||\mathbb{x}||_{2}^{2}$ is also the $\ell_{2}$ -norm $f^{*}(\mathbb{\theta})=\frac{1}{2}||\mathbb{\theta}||_{2}^{2}$ . Hence, the Fenchel conjugate of $\mathcal{M}(\mathbb{w})=\frac{1}{2}||\mathbb{w}||_{2}^{2}$ is

\begin{split}\mathcal{M}^{*}(-\mathbb{K\alpha})=\frac{1}{2}||\mathbb{-K\alpha}||_{2}^{2}\end{split}

(26)

It is known ( [38], Appendix B) that the Fenchel conjugate of $f(\mathbb{x}+\mathbb{y})$ is $f^{*}(\mathbb{\theta})-\langle\mathbb{\theta},\mathbb{y}\rangle$ , the Fenchel conjugate of $cf(\mathbb{x})$ ( $c>0$ ) is $cf^{*}(\mathbb{\theta}/c)$ . Then we can derive that the Fenchel conjugate of $cf(\mathbb{x}+\mathbb{y})$ is

cf^{*}(\mathbb{\theta}/c)-\langle\mathbb{\theta},\mathbb{y}\rangle.

(27)

In addition, the Fenchel conjugate of $f(\mathbb{x})=\max_{j}(\mathbb{x}_{j})$ is $I_{\theta_{i}\geq 0,\sum_{i}\theta_{i}=1}(\theta)$ with $I_{condition}(.)$ being the indicator function that $I_{condition}(\theta)=0$ if $condition$ is true and otherwise $I_{condition}(\theta)=+\infty$ (see [38], Appendix B). For convenience, we denote $\mathcal{Q}(\mathbb{x})=\max_{j}(\mathbb{x}_{j})$ . It is easy to verify that $\mathcal{N}(-\mathbf{w}^{T}\mathbf{K})=\frac{\lambda}{\mu}\max_{j}(\Delta^{T}-\mathbf{w}^{T}\mathbf{K})_{j}=\frac{\lambda}{\mu}\mathcal{Q}(\Delta^{T}-\mathbf{w}^{T}\mathbf{K})$ . Hence, by using (27), the Fenchel conjugate of $\mathcal{N}(-\mathbf{w}^{T}\mathbf{K})$ is:

\begin{split}&\mathcal{N}^{*}(\mathbf{\alpha})=\frac{\lambda}{\mu}\mathcal{Q}^{*}((\mathbb{\alpha})/(\frac{\lambda}{\mu}))-\langle\mathbf{\alpha},\mathbb{\Delta}\rangle\\ &=\left\{\begin{array}[]{l}-{\mathbf{\Delta}^{T}}\mathbf{\alpha},\ \sum\limits_{k=1}^{p}{\mathbf{\alpha}_{k}=\frac{\lambda}{\mu}\ and\ \mathbf{\alpha}_{k}\geq 0,\ k=1,\cdots,p};\\ +\infty,\ \ \ \ otherwise.\end{array}\right.\end{split}

(28)

With (26), (28) and (25), we have that the dual form of (23) is:

\begin{split}&\mathop{\max}\limits_{\mathbf{\alpha}}\mathcal{D}(\mathbf{\alpha})\\ &=\mathop{\max}\limits_{\mathbf{\alpha}}-\mathcal{M}^{*}(\mathbf{K}\alpha)-\mathcal{N}^{*}(\alpha)\\ &=\mathop{\max}\limits_{\mathbf{\alpha}}-\frac{1}{2}{\mathbf{\alpha}^{T}}{\mathbf{K}^{T}}\mathbf{K}\mathbf{\alpha}+\mathbf{\Delta}^{T}\mathbf{\alpha}\\ &s.t.\ \sum\limits_{k=1}^{p}\mathbf{\alpha}_{k}=\frac{\lambda}{\mu}\ and\ \mathbf{\alpha}_{k}\geq 0,\ k=1,\cdots,p\\ \end{split}

(29)

The dual form in (29) is a smooth quadratic function with linear constraints, which is easier to be optimized compared to its primal form in (23).

5.3.3 Primal-Dual Algorithm via Coordinate Ascent

In this subsection, we develop a coordinate ascent algorithm to optimize the objective in (29), where we use the primal-dual gap $\mathcal{P}(\mathbb{w})-\mathcal{D}(\mathbb{\alpha})$ as the early stopping criterion. Coordinate ascent is a widely-used method in various machine learning problems (e.g., [12, 38, 23, 45]).

Algorithm 2 Primal-dual algorithm via coordinate ascent

Input:

\mathbf{b}_{i}

\epsilon_{F}

\lambda

\mu

, maximal iteration number

T_{F}

Output:

\mathbb{w}

1. Initialize:

v\leftarrow 0

\mathbb{\hat{w}}\leftarrow 0

2. Repeat:

3. Find the largest element

(g_{\alpha})_{j}

in the gradient vector

g_{\alpha}=\nabla\mathcal{D}(\alpha)

by solving (30) via Algorithm 3 for F-score

(or Algorithm 4 for AUC).

\mathbf{\Delta}_{j}\leftarrow\Delta({\mathbf{y}^{(i)}},{{\mathbf{y}}_{j}})

\mathbb{K}_{.j}\leftarrow{\mathbf{X}^{(i)}}^{T}(\mathbf{y}^{(i)}-\mathbb{y}_{j})+\frac{\mu}{\lambda}\mathbb{b}_{i}

6. Calculate

\gamma

by (37).

7. Update

\mathbb{\hat{w}}

by (35).

8. Update

v

by (36)

9. Until

\mathbb{\hat{w}}^{T}\mathbb{\hat{w}}+\max_{j}(g_{\alpha})_{j}\leq\epsilon_{F}

or iteration number reaches

T_{F}

10.

\mathbb{w}\leftarrow\mathbb{\hat{w}}

The proposed coordinate ascent algorithm is shown in Algorithm 2. Next, we sketch the main steps the proposed algorithm in the following:

Repeat

•

Select an index $j$ with the $j$ -th element $(\nabla_{\alpha}\mathcal{D}(\alpha))_{j}$ in the gradient vector $\nabla_{\alpha}\mathcal{D}(\alpha)$ having the largest element.
•

Update $\alpha_{j}$ with other $\alpha_{k}$ ( $k\neq j$ ) fixed, in a manner of greedily increasing the value of $\mathcal{D}(\alpha)$ .

Until the early stopping criterion $\mathcal{P}(\mathbb{w})-\mathcal{D}(\mathbb{\alpha})\leq\epsilon_{F}$ is satisfied.

In each iteration, the proposed algorithm has three main building blocks:

The First Step is to select an index $j$ that the $j$ -th element is the largest element in the gradient vector for the dual objective $\mathcal{D}({\alpha})$ . Specifically, the gradient vector with respect to $\alpha$ for $\mathcal{D}({\alpha})$ is:

g_{\alpha}=\nabla_{\alpha}\mathcal{D}({\alpha})=-\mathbf{K}^{T}\mathbf{K}\alpha+\mathbf{\Delta},

and the largest element in $\nabla_{\alpha}\mathcal{D}({\alpha})$ is:

\begin{split}(g_{\alpha})_{j}=(\nabla_{\alpha}\mathcal{D}({\alpha}))_{j}=\max_{j}\mathbf{\Delta}_{j}-(\mathbf{K}\alpha)^{T}\mathbf{K}_{.j}.\end{split}

We denote $\mathbb{\hat{w}}=\mathbf{K}\alpha$ . Then, with the definition of $\mathbf{\Delta}_{j}$ and $\mathbb{K}_{.j}$ , we have:

\begin{split}(\nabla_{\alpha}\mathcal{D}({\alpha}))_{j}=\max_{j}\Delta({\mathbf{y}^{(i)}},{\mathbb{y}_{j}})-\mathbf{\hat{w}}^{T}{\mathbf{X}^{(i)}}^{T}({\mathbf{y}^{(i)}}-{\mathbb{y}_{j}}).\\ \end{split}

(30)

Interestingly, the problem in (30) is essentially the same as the problems of “finding the most violated constraint” in Structured-SVMs (e.g., the problem (7) in [20]). For several commonly-used evaluation metrics $\Delta(.,.)$ , efficient algorithm in polynomial-time were proposed to solve the problems of “finding the most violated constraint”. One can directly use these inference algorithms to solve (30) of selecting the largest element from the gradient vector $\nabla_{\alpha}\mathcal{D}({\alpha})$ . For example, when $\Delta(.,.)$ corresponds to F-score, one can use Algorithm 2 in [20] to solve (30); when $\Delta(.,.)$ corresponds to AUC, one can use Algorithm 3 in [20] to solve (30). For self-containness, we shown these two algorithms with our notations in Algorithm 3 and 4. Note that Algorithm 3 and 4 have the time complexity in $O(n^{2}_{i})$ and $O(n_{i}\log n_{i})$ , respectively.

Algorithm 3 Algorithm to solve (30) with loss function

defined on F-score

Input:

n=n_{i},{\mathbf{X}^{(i)}}=(\mathbf{x}^{(i)}_{1},\ldots,\mathbf{x}^{(i)}_{n})^{T}

\mathbf{y}^{(i)}=(\mathbf{y}^{(i)}_{1},\ldots,\mathbf{y}^{(i)}_{n})^{T}

\mathbf{w}

Output:

\mathbf{y}_{j}

1. Initialize:

(k^{p}_{1},\ldots,k^{p}_{Pos})\leftarrow sort\{k:\mathbf{y}^{(i)}_{k}=1\}

\mathbf{w}^{T}\mathbf{x}^{(i)}_{k}

(k^{n}_{1},\ldots,k^{n}_{Neg})\leftarrow sort\{k:\mathbf{y}^{(i)}_{k}=-1\}

\mathbf{w}^{T}\mathbf{x}^{(i)}_{k}

2. For

a\in[0,\ldots,Pos]

do:

c\leftarrow Pos-a

4. Set

l{{}_{k_{1}^{p}}},\ldots,l{{}_{k_{a}^{p}}}

1

and set

l{{}_{k_{a+1}^{p}}},\ldots,l{{}_{k_{Pos}^{p}}}

-1

5. For

d\in[0,\ldots,Neg]

do:

b\leftarrow Neg-d

7. Set

l{{}_{k_{1}^{n}}},\ldots,l{{}_{k_{b}^{n}}}

1

and set

l{{}_{k_{b+1}^{n}}},\ldots,l{{}_{k_{Neg}^{n}}}

-1

v\leftarrow\Delta({{\bf{y}}^{(i)}},(l_{1},\ldots,l_{n})^{T})+{\mathbf{w}^{T}}\sum\limits_{k=1}^{n}{l{{}_{k}}{\mathbf{x}^{(i)}_{k}}}

where

\Delta(\cdot,\cdot)

is defined by (11)

9. If

v

is the largest so far, then:

10.

\mathbf{y}_{j}\leftarrow(l_{1},\ldots,l_{n})^{T}

11. End if

12. End for

13. End for

Algorithm 4 Algorithm to solve (30) with loss function

defined on AUC

Input:

n=n_{i},{\mathbf{X}^{(i)}}=(\mathbf{x}^{(i)}_{1},\ldots,\mathbf{x}^{(i)}_{n})^{T}

\mathbf{y}^{(i)}=(\mathbf{y}^{(i)}_{1},\ldots,\mathbf{y}^{(i)}_{n})^{T}

\mathbf{w}

Output:

\mathbf{y}_{j}

1. Initialize: for

k\in\{k:{\bf{y}}_{k}^{(i)}=1\}

{q_{k}}\leftarrow-0.25+{\mathbf{w}^{T}}{\mathbf{x}^{(i)}_{k}}

for

k\in\{k:{\bf{y}}_{k}^{(i)}=-1\}

{q_{k}}\leftarrow 0.25+{\mathbf{w}^{T}}{\mathbf{x}^{(i)}_{k}}

({r_{1}},\ldots,{r_{n}})\leftarrow

sort

\{1,\ldots,n\}

{q_{k}}

{q_{Pos}}\leftarrow Pos

q_{Neg}\leftarrow 0

4. For

k\in[1,\ldots,n]

do:

5. If

\mathbf{y}^{(i)}_{r_{k}}>0

, then:

{l_{{r_{k}}}}\leftarrow(Neg-2{q_{n}})

{q_{Pos}}\leftarrow{q_{Pos}}-1

8. else

{l_{{r_{k}}}}\leftarrow(-Pos+2{q_{Pos}})

10.

{q_{Neg}}\leftarrow{q_{Neg}}+1

11. End if

12. End for

13. Convert

(l_{1},\ldots,l_{n})

\mathbf{y}_{j}

according to some

threshold value.

The Second Step is to update $\alpha_{j}$ by fixing other variable $\alpha_{k}(k\neq j)$ , given the selected index $j$ .

We define the update rules for $\alpha$ as:

\alpha\leftarrow(1-\gamma)\alpha+\frac{\gamma\lambda}{\mu}e_{j},

(31)

where $0\leq\gamma\leq 1$ and $e_{j}$ denotes the $n_{i}$ -dimension vector with the $j$ -th element being one and other elements being zeros. It is worth noting that, given $\alpha_{j}\geq 0$ and $\sum_{j}\alpha_{j}=\lambda/\mu$ before updating, and $0\leq\gamma\leq 1$ , this form of rules in (31) guarantees that $\alpha_{j}\geq 0$ and $\sum_{j}\alpha_{j}=\lambda/\mu$ still hold after updating.

By substituting (31) into (29), we obtain the corresponding optimization problem with respect to $\gamma$ :

\begin{split}\begin{array}[]{l}\mathop{\max}\limits_{\gamma}-\frac{1}{2}{[(1-\gamma)\alpha+\frac{\gamma\lambda}{\mu}{e_{j}}]^{T}}{\mathbf{K}^{T}}\mathbf{K}[(1-\gamma)\alpha+\frac{\gamma\lambda}{\mu}{e_{j}}]\\ \ \ \ \ \ \ +{[(1-\gamma)\alpha+\frac{\gamma\lambda}{\mu}{e_{j}}]^{T}}\mathbf{\Delta}\end{array}\end{split}

(32)

Intuitively, our goal is to find $\gamma\in[0,1]$ to increase the dual objective $\mathcal{D}(\alpha)$ as much as possible. By setting the gradient of (32) with respect to $\gamma$ to zero, we have

\begin{array}[]{l}||\mathbf{K}({e_{j}}\lambda/\mu-\alpha)||_{2}^{2}\gamma+{({e_{j}}\lambda/\mu-\alpha)^{T}}{\mathbf{K}^{T}}\mathbf{K}\alpha\\ -{({e_{j}}\lambda/\mu-\alpha)^{T}}\mathbf{\Delta}=0\end{array}

By simple algebra, we have

\displaystyle\begin{array}[]{l}\gamma=-\frac{{{{({e_{j}}\lambda/\mu-\alpha)}^{T}}({\mathbf{K}^{T}}\mathbf{K}\alpha-\mathbf{\Delta})}}{{||\mathbf{K}({e_{j}}\lambda/\mu-\alpha)||_{2}^{2}}}\\ \end{array}

(33)

To ensure that $0\leq\gamma\leq 1$ , we make further restriction on $\gamma$ :

\displaystyle\gamma=\max(\min(-\frac{{{{({e_{j}}\lambda/\mu-\alpha)}^{T}}({\mathbf{K}^{T}}\mathbf{K}\alpha-\mathbf{\Delta})}}{{||\mathbf{K}({e_{j}}\lambda/\mu-\alpha)||_{2}^{2}}},1),0)

(34)

The calculation of $\gamma$ in (34) depends on the calculation of $\mathbb{K\alpha}$ and $\alpha^{T}\mathbf{\Delta}$ . However, since $\mathbb{K}\in\mathbb{R}^{d\times p}$ , $\mathbf{\Delta},\alpha\in\mathbb{R}^{p}$ and $p=2^{n_{i}}$ , the time of directly calculating either $\mathbb{K\alpha}$ or $\alpha^{T}\mathbf{\Delta}$ depends exponentially on $n_{i}$ , which may often unaffordable. In order to improve efficiency, we maintain auxiliary variable to reduce the computation cost. Remind that we have defined $\mathbb{\hat{w}}=\mathbb{K\alpha}$ . We also define ${v}=\alpha^{T}\mathbf{\Delta}$ . We maintain $\mathbb{\hat{w}}$ and $v$ during the iterations.

With the update rule (31) for $\mathbb{\alpha}$ , we can easily derive the corresponding update rules for $\mathbb{\hat{w}}$ and $\mathbb{v}$ , respectively:

\displaystyle\mathbb{\hat{w}}\leftarrow(1-\gamma)\mathbb{\hat{w}}+\frac{\gamma\lambda}{\mu}\mathbb{K}_{.j},

(35)

\displaystyle{v}\leftarrow(1-\gamma){v}+\frac{\gamma\lambda}{\mu}\mathbf{\Delta}_{j}.

(36)

Obviously, the update rule for $\mathbb{\hat{w}}$ (or $v$ ) has the time complexity $O(d)$ (or $O(1)$ ).

With the maintained $\mathbb{\hat{w}}$ and $v$ , the update rule in (34) can be simplified to:

\displaystyle\gamma\leftarrow\max(\min(-\frac{{\frac{\lambda}{\mu}(\mathbb{K}_{j.}^{T}\mathbb{\hat{w}}-\mathbf{\Delta}_{j})-\mathbb{\hat{w}}^{T}\mathbb{\hat{w}}+v}}{{||\frac{\lambda}{\mu}\mathbb{K}_{.j}-\mathbb{\hat{w}})||_{2}^{2}}},1),0),

(37)

where the time complexity of update $\gamma$ in (37) is reduced to $O(d)$ .

The early stopping criterion is defined based on the primal-dual gap $\mathcal{P}(\mathbb{w})-\mathcal{D}(\mathbb{\alpha})\leq\epsilon_{F}$ where the parameter $\epsilon_{F}$ is the pre-defined tolerance. Assume $\mathcal{P}(\mathbb{w}^{\star})$ is the optimal value of the primal objective (23). According to Theorem 2, we have:

\mathcal{P}(\mathbb{w})-\mathcal{P}(\mathbb{w}^{\star})\leq\mathcal{P}(\mathbb{w})-\mathcal{D}(\mathbb{\alpha})\leq\epsilon_{F}.

It is worth noting that, by using the update rule (31) with $0\leq\gamma\leq 1$ , Algorithm 2 guarantees that $\alpha$ satisfies the constraints $\alpha_{k}\geq 0$ and $\sum_{k}\alpha_{k}=\lambda/\mu$ in all of the iterations. In order words, we have $\mathcal{N}^{*}(\alpha)<\infty$ in all of the iterations. Hence, with (23) and (29), we have:

\begin{split}&\mathcal{P}(\mathbb{w})-\mathcal{D}(\mathbb{\alpha})\\ &=\mathcal{M}(\mathbb{w})+\mathcal{M}^{*}(\mathbb{K\alpha})+\mathcal{N}(-\mathbf{w}^{T}\mathbf{K})+\mathcal{N}^{*}(\alpha)\end{split}

(38)

With Theorem 1, we have $\mathcal{M}(\mathbb{w})+\mathcal{M}^{*}(\mathbb{K\alpha})\geq\langle\mathbb{w},\mathbb{K\alpha}\rangle$ , where the equality holds when $\mathbb{w}=\mathbb{K\alpha}=\mathbb{\hat{w}}$ . In order to greedily upper-bounded the gap $\mathcal{D}(\mathbb{\alpha}^{\star})-\mathcal{D}(\mathbb{\alpha})$ , we set $\mathbb{w}=\mathbb{K\alpha}=\mathbb{\hat{w}}$ in (38) and obtain:

\begin{split}&\mathcal{P}(\mathbb{w})-\mathcal{D}(\mathbb{\alpha})\\ &=\langle\mathbb{\hat{w}},\mathbb{K\alpha}\rangle+\mathcal{N}(\mathbf{\hat{w}}^{T}\mathbf{K})+\mathcal{N}^{*}(\alpha)\\ &=\mathbb{\hat{w}}^{T}\mathbb{\hat{w}}+\max_{j}(g_{\alpha})_{j}-v\end{split}

(39)

Consequently, the early stopping criterion is set to be $\mathbb{\hat{w}}^{T}\mathbb{\hat{w}}+\max_{j}(g_{\alpha})_{j}-v\leq\epsilon_{F}$ , which can be calculated in time $O(d)$ .

5.4 Convergence Analysis

For the sub-problem w. r. t. $\mathbf{W}$ (see Section 5.3), the proposed coordinate ascent method is similar to those in [38, 23]. By using similar proof techniques to those of [38, 23] (e.g., see the proofs of Theorem 1 in [23]), we can derive that, after $T$ iteration in Algorithm 2, we have $\mathcal{D}(\mathbb{\alpha}^{\star})-\mathcal{D}(\mathbb{\alpha})\leq\mathcal{P}(\mathbb{w})-\mathcal{D}(\mathbb{\alpha})\leq\epsilon_{F}=O(\frac{1}{T})$ . Note that $\mathcal{D}(\mathbb{\alpha}^{\star})=\mathcal{P}(\mathbb{w}^{\star})$ , where $\mathcal{D}(\mathbb{\alpha}^{\star})$ and $\mathcal{P}(\mathbb{w}^{\star})$ are the optimal solution of (29) and (23) respectively. Ideally, for all the tasks, if we set the iteration number $T$ to be sufficient large, we can solve the sub-problem w,r.t. $W$ exactly (by ignoring the small numerical errors).

In addition, as discussed in Section 5.2, the sub-problems w. r. t. $\mathbf{S}$ can be solved exactly by closed-form solutions. Hence, the objective (12) is convex subject to linear constraints, and both of its subproblems can be solved exactly. Based on existing theoretical results [6, 16], we have that Algorithm 1 converges to global optima with a $O(1/\epsilon)$ convergence rate.

6 Experiments

6.1 Overview

In this section, we evaluate and compare the performance of the proposed SMTL method on several benchmark datasets. For the regularizer $\Omega(\mathbb{S})$ in (12), we consider $||\mathbb{S}||_{1,1}$ , $||\mathbb{S}||_{2,1}$ and $||\mathbb{S}||_{*}$ , respectively. For the evaluation metric $\Delta(.,.)$ used in $\mathcal{G}(\mathbb{W})$ in (12), we consider $F_{1}$ -score (with $\beta=1$ ) and AUC. These settings lead to six variants of SMTL.

Here we focus on MTL for classification. Given a specific regularizer (i.e., $||\mathbb{S}||_{1,1}$ , $||\mathbb{S}||_{2,1}$ or $||\mathbb{S}||_{*}$ ), we choose these methods as baselines: (1) single-task structured SVM that directly optimizes AUC (StructSVM) [20], we train it on each of the individual tasks and average the results. (2) MTL with hinge loss for classification (MTL-CLS). (3) MTL with least squares loss for regression (MTL-REG). (4) RAkEL, a meta algorithm using random $k$ -label sets [41]. (5) MLCSSP, a method spanning the original label space by subset of labels [4]. (6) AdaBoostMH, a method based on AdaBoost [37]. (7) HOMER, a method based on the hierarchy of multi-label learners [40]. (8) BR, the binary relevance method [42]. (9) LP, the label power-set method [42]. (10) ECC, the ensembles of classifier chains method (ECC) [35]. Note that the classification problem can be regarded as a regression problem²²2For a dataset for binary classification that each positive example has a label $+1$ and each negative example has a label $-1$ , one can regard these labels as real numbers (i.e., $1.0$ for each of the positive examples and $-1.0$ for each of the negative examples). Then, this dataset can be used in a MTL method for regression to learn a regressor. After obtaining the regressor, for a test example $x$ , if the predicted label of $x$ (by the regressor) is larger than $0$ , one can regard $x$ as a positive example. On the other hand, if the predicted label of $x$ is smaller than $0$ , then one can regard $x$ as a negative example..

The proposed methods, the baselines MTL-CLS and MTL-REG were implemented with Python 2.7. For MTL-REG, our implementations are based on the algorithms in [28] (for the $\ell_{2,1}$ norm) and [17] (for the trace norm). According to Theorem 3 in [20], the problem of MTL-CLS is equivalent to a special form of SMTL in (2) (with $\Delta(y^{(i)},y)=2\times t$ , where $t$ represents the number of index $k$ that satisfies $y^{(i)}_{k}\neq y_{k}$ ). Hence, our implementation of MTL-CLS is based on the framework of Algorithm 1. For StructSVM, we use the open-source implementation of SVM-Perf [20]. All the experiments were conducted on a Dell PowerEdge R320 server with 16G memory and 1.9Ghz E5-2420 CPU.

We report the experimental results on $9$ real-world datasets. The statistics of these datasets are summarized in Table I. In the Emotions dataset, the labels are $6$ kinds of emotions, and the features are rhythmic and timbre extracted from music wave files. In the Yeast dataset, the labels are localization sites of protein, and the features are protein properties. In the Flags dataset, the labels are religions of countries and the features are extracted from flag images. In the Cal500 dataset, the labels are semantically meaning of popular songs and the features are extracted from audio data. In the Segmentation dataset, the labels are content of image region, and the features are pixels’ properties of image regions. In the Optdigits dataset, the labels are handwritten digits $0$ to $9$ , and the features are pixels. In the MediaMill dataset, the labels are semantic concepts of each video and the features are extracted from videos. In the TMC2007 dataset, the labels are the document topics, and the features are discrete attributes about terms. In the Scene dataset, the labels are scene types, and the features are spatial color moments in LUV space. All of these datasets are normalized.

TABLE I: Statistics of 9 datasets

	Type	Features	Samples	Tasks \bigstrut
Emotions	music	72	593	6 \bigstrut
Yeast	gene	103	2417	14 \bigstrut
Flags	image	19	194	7 \bigstrut
Cal500	songs	68	502	174 \bigstrut
Segmentation	image	19	2310	7 \bigstrut
Optdigits	image	64	5620	10 \bigstrut
MediaMill	multimedia	120	10000	12 \bigstrut
TMC2007	test	500	10000	6 \bigstrut
Scene	image	294	2407	6 \bigstrut

TABLE II: Comparison results on Cal500, Segmentation and Optdigits.

	MACRO	MICRO	Average \bigstrut[t]
METHOD	$F_{1}$	$F_{1}$	AUC \bigstrut[b]
Cal500 \bigstrut
SMTL( $\ell_{2,1}$ +AUC)	21.722 $\pm$ 0.456	38.452 $\pm$ 0.610	56.505 $\pm$ 0.511 \bigstrut[t]
SMTL( $\ell_{2,1}$ + $F_{1}$ )	21.495 $\pm$ 0.232	40.127 $\pm$ 0.173	53.690 $\pm$ 0.293
MTL-CLS( $\ell_{2,1}$ )	13.157 $\pm$ 0.449	37.357 $\pm$ 0.180	55.764 $\pm$ 0.820
MTL-REG( $\ell_{2,1}$ )	12.500 $\pm$ 0.129	36.438 $\pm$ 0.176	52.964 $\pm$ 0.758 \bigstrut[b]
SMTL( $\ell_{1,1}$ +AUC)	21.721 $\pm$ 0.807	35.52 $\pm$ 0.811	56.716 $\pm$ 0.500 \bigstrut[t]
SMTL( $\ell_{1,1}$ + $F_{1}$ )	21.138 $\pm$ 0.191	38.386 $\pm$ 0.456	53.358 $\pm$ 0.827
MTL-CLS( $\ell_{1,1}$ )	12.176 $\pm$ 0.445	37.387 $\pm$ 0.845	56.316 $\pm$ 0.216
MTL-REG( $\ell_{1,1}$ )	12.447 $\pm$ 0.297	36.66 $\pm$ 0.638	53.628 $\pm$ 0.264 \bigstrut[b]
SMTL(TraceNorm+AUC)	21.772 $\pm$ 0.545	35.204 $\pm$ 0.585	56.798 $\pm$ 0.358 \bigstrut[t]
SMTL(TraceNorm+ $F_{1}$ )	21.768 $\pm$ 0.333	38.559 $\pm$ 0.394	54.987 $\pm$ 0.823
MTL-CLS(TraceNorm)	12.884 $\pm$ 0.353	37.402 $\pm$ 0.501	55.635 $\pm$ 0.511
MTL-REG(TraceNorm)	8.348 $\pm$ 0.999	34.832 $\pm$ 0.698	55.69 $\pm$ 0.636 \bigstrut[b]
StructSVM	20.864 $\pm$ 1.150	35.408 $\pm$ 1.150	51.427 $\pm$ 0.841 \bigstrut[t]
RAkEL	20.628 $\pm$ 0.611	33.689 $\pm$ 0.843	54.637 $\pm$ 0.656
MLCSSP	21.677 $\pm$ 0.514	27.093 $\pm$ 0.537	52.69 $\pm$ 0.983
AdaBoostMH	0.923 $\pm$ 0.274	6.492 $\pm$ 0.146	50.734 $\pm$ 0.538
HOMER	13.850 $\pm$ 0.163	30.332 $\pm$ 1.313	52.461 $\pm$ 0.937
BR	17.094 $\pm$ 0.634	33.619 $\pm$ 0.375	50.563 $\pm$ 0.153
LP	15.257 $\pm$ 0.428	32.978 $\pm$ 0.668	52.117 $\pm$ 0.685
ECC	9.600 $\pm$ 0.666	34.789 $\pm$ 0.482	52.117 $\pm$ 0.625 \bigstrut[b]
Segmentation \bigstrut
SMTL( $\ell_{2,1}$ +AUC)	72.832 $\pm$ 1.567	68.445 $\pm$ 1.543	97.195 $\pm$ 0.4549 \bigstrut[t]
SMTL( $\ell_{2,1}$ + $F_{1}$ )	85.61 $\pm$ 1.304	84.149 $\pm$ 1.684	96.967 $\pm$ 0.647
MTL-CLS( $\ell_{2,1}$ )	85.114 $\pm$ 1.946	84.228 $\pm$ 4.508	96.93 $\pm$ 0.560
MTL-REG( $\ell_{2,1}$ )	75.547 $\pm$ 1.215	81.702 $\pm$ 2.456	96.757 $\pm$ 0.645 \bigstrut[b]
SMTL( $\ell_{1,1}$ +AUC)	73.378 $\pm$ 1.564	68.424 $\pm$ 1.787	97.527 $\pm$ 0.286 \bigstrut[t]
SMTL( $\ell_{1,1}$ + $F_{1}$ )	85.105 $\pm$ 1.830	83.693 $\pm$ 1.192	96.757 $\pm$ 0.192
MTL-CLS( $\ell_{1,1}$ )	83.712 $\pm$ 3.513	82.518 $\pm$ 4.003	96.781 $\pm$ 0.828
MTL-REG( $\ell_{1,1}$ )	76.253 $\pm$ 2.564	82.606 $\pm$ 0.156	96.798 $\pm$ 0.231 \bigstrut[b]
SMTL(TraceNorm+AUC)	72.265 $\pm$ 1.453	67.655 $\pm$ 1.978	97.134 $\pm$ 0.457 \bigstrut[t]
SMTL(TraceNorm+ $F_{1}$ )	85.356 $\pm$ 1.092	83.462 $\pm$ 1.805	96.863 $\pm$ 0.322
MTL-CLS(TraceNorm)	82.703 $\pm$ 3.865	82.150 $\pm$ 5.439	96.705 $\pm$ 0.612
MTL-REG(TraceNorm)	76.602 $\pm$ 1.286	82.805 $\pm$ 1.877	96.698 $\pm$ 0.147 \bigstrut[b]
StructSVM	44.632 $\pm$ 1.828	53.992 $\pm$ 1.828	89.355 $\pm$ 0.311 \bigstrut[t]
RAkEL	75.592 $\pm$ 0.243	70.980 $\pm$ 0.398	91.333 $\pm$ 0.082
MLCSSP	79.821 $\pm$ 8.533	78.923 $\pm$ 14.036	93.810 $\pm$ 0.329
AdaBoostMH	75.633 $\pm$ 0.209	71.018 $\pm$ 0.376	96.148 $\pm$ 0.089
HOMER	72.920 $\pm$ 2.505	69.969 $\pm$ 1.651	91.225 $\pm$ 1.543
BR	84.236 $\pm$ 0.638	78.796 $\pm$ 0.708	96.870 $\pm$ 0.194
LP	84.394 $\pm$ 0.603	83.411 $\pm$ 0.615	96.240 $\pm$ 0.124
ECC	84.183 $\pm$ 0.550	82.942 $\pm$ 0.542	96.782 $\pm$ 0.269 \bigstrut[b]
Optdigits \bigstrut
SMTL( $\ell_{2,1}$ +AUC)	92.722 $\pm$ 0.595	92.734 $\pm$ 0.712	99.657 $\pm$ 0.0528 \bigstrut[t]
SMTL( $\ell_{2,1}$ + $F_{1}$ )	93.963 $\pm$ 0.164	93.964 $\pm$ 0.235	99.589 $\pm$ 0.054
MTL-CLS( $\ell_{2,1}$ )	93.701 $\pm$ 0.403	92.773 $\pm$ 0.440	99.206 $\pm$ 0.044
MTL-REG( $\ell_{2,1}$ )	88.901 $\pm$ 0.306	89.268 $\pm$ 0.875	99.32 $\pm$ 0.089 \bigstrut[b]
SMTL( $\ell_{1,1}$ +AUC)	92.526 $\pm$ 0.624	92.213 $\pm$ 0.670	99.653 $\pm$ 0.078 \bigstrut[t]
SMTL( $\ell_{1,1}$ + $F_{1}$ )	93.692 $\pm$ 0.508	94.626 $\pm$ 0.520	99.554 $\pm$ 0.047
MTL-CLS( $\ell_{1,1}$ )	92.961 $\pm$ 0.608	94.009 $\pm$ 0.356	98.658 $\pm$ 0.067
MTL-REG( $\ell_{1,1}$ )	88.762 $\pm$ 0.845	89.203 $\pm$ 0.865	99.269 $\pm$ 0.045 \bigstrut[b]
SMTL(TraceNorm+AUC)	92.862 $\pm$ 0.543	92.802 $\pm$ 0.944	99.654 $\pm$ 0.036 \bigstrut[t]
SMTL(TraceNorm+ $F_{1}$ )	94.206 $\pm$ 0.202	94.139 $\pm$ 0.266	99.566 $\pm$ 0.027
MTL-CLS(TraceNorm)	93.701 $\pm$ 0.435	93.773 $\pm$ 0.267	99.182 $\pm$ 0.065
MTL-REG(TraceNorm)	88.777 $\pm$ 0.765	89.173 $\pm$ 0.946	99.293 $\pm$ 0.048 \bigstrut[b]
StructSVM	36.276 $\pm$ 0.905	38.289 $\pm$ 2.218	98.400 $\pm$ 0.366 \bigstrut[t]
RAkEL	82.450 $\pm$ 0.168	80.967 $\pm$ 0.311	94.543 $\pm$ 0.070
MLCSSP	75.191 $\pm$ 2.245	82.129 $\pm$ 3.195	88.879 $\pm$ 0.195
AdaBoostMH	93.083 $\pm$ 0.695	93.108 $\pm$ 0.669	98.594 $\pm$ 0.119
HOMER	74.869 $\pm$ 4.151	75.713 $\pm$ 3.663	93.391 $\pm$ 0.964
BR	92.625 $\pm$ 0.348	92.714 $\pm$ 0.383	99.370 $\pm$ 0.122
LP	88.875 $\pm$ 0.212	88.915 $\pm$ 0.269	94.941 $\pm$ 0.329
ECC	93.043 $\pm$ 0.206	94.019 $\pm$ 0.213	99.019 $\pm$ 0.156 \bigstrut[b]

TABLE III: Comparison results on Scene, MediaMill and TMC2007.

	MACRO	MICRO	Average \bigstrut[t]
METHOD	$F_{1}$	$F_{1}$	AUC \bigstrut[b]
Scene \bigstrut
SMTL( $\ell_{2,1}$ +AUC)	54.013 $\pm$ 1.124	54.746 $\pm$ 1.231	89.99 $\pm$ 0.820 \bigstrut[t]
SMTL( $\ell_{2,1}$ + $F_{1}$ )	55.787 $\pm$ 0.756	56.434 $\pm$ 0.567	87.652 $\pm$ 0.280
MTL-CLS( $\ell_{2,1}$ )	54.722 $\pm$ 1.590	54.508 $\pm$ 1.176	86.738 $\pm$ 1.102
MTL-REG( $\ell_{2,1}$ )	51.157 $\pm$ 0.343	52.810 $\pm$ 0.345	85.194 $\pm$ 0.712 \bigstrut[b]
SMTL( $\ell_{1,1}$ +AUC)	54.296 $\pm$ 0.977	54.333 $\pm$ 0.025	88.358 $\pm$ 0.467 \bigstrut[t]
SMTL( $\ell_{1,1}$ + $F_{1}$ )	55.501 $\pm$ 1.92	56.007 $\pm$ 2.34	87.364 $\pm$ 1.801
MTL-CLS( $\ell_{1,1}$ )	54.387 $\pm$ 0.730	54.805 $\pm$ 1.488	85.952 $\pm$ 1.116
MTL-REG( $\ell_{1,1}$ )	50.748 $\pm$ 0.546	51.280 $\pm$ 0.619	85.032 $\pm$ 0.779 \bigstrut[b]
SMTL(TraceNorm+AUC)	54.227 $\pm$ 0.660	55.384 $\pm$ 0.804	88.421 $\pm$ 1.103 \bigstrut[t]
SMTL(TraceNorm+ $F_{1}$ )	55.396 $\pm$ 1.089	56.304 $\pm$ 1.119	87.071 $\pm$ 0.682
MTL-CLS(TraceNorm)	55.104 $\pm$ 0.298	55.481 $\pm$ 0.506	86.205 $\pm$ 0.471
MTL-REG(TraceNorm)	50.832 $\pm$ 0.226	51.236 $\pm$ 0.264	85.275 $\pm$ 0.852 \bigstrut[b]
StructSVM	49.826 $\pm$ 0.815	49.951 $\pm$ 0.755	82.375 $\pm$ 0.393 \bigstrut[t]
RAkEL	54.592 $\pm$ 0.613	55.719 $\pm$ 0.565	78.981 $\pm$ 0.535
MLCSSP	42.764 $\pm$ 0.080	47.178 $\pm$ 0.181	65.830 $\pm$ 2.240
AdaBoostMH	36.506 $\pm$ 0.404	40.681 $\pm$ 0.449	87.617 $\pm$ 0.470
HOMER	60.980 $\pm$ 2.470	58.251 $\pm$ 2.592	80.744 $\pm$ 0.360
BR	54.579 $\pm$ 1.813	55.019 $\pm$ 1.843	82.888 $\pm$ 1.164
LP	54.902 $\pm$ 1.503	55.818 $\pm$ 1.595	75.900 $\pm$ 1.362
ECC	55.347 $\pm$ 0.893	55.831 $\pm$ 0.881	88.153 $\pm$ 0.298 \bigstrut[b]
MediaMill \bigstrut
SMTL( $\ell_{2,1}$ +AUC)	18.030 $\pm$ 0.294	22.058 $\pm$ 0.257	66.068 $\pm$ 0.426 \bigstrut[t]
SMTL( $\ell_{2,1}$ + $F_{1}$ )	22.851 $\pm$ 5.093	56.424 $\pm$ 2.761	78.705 $\pm$ 2.280
MTL-CLS( $\ell_{2,1}$ )	10.613 $\pm$ 1.733	55.441 $\pm$ 3.647	76.216 $\pm$ 2.474
MTL-REG( $\ell_{2,1}$ )	6.366 $\pm$ 0.065	55.515 $\pm$ 0.465	53.867 $\pm$ 0.496 \bigstrut[b]
SMTL( $\ell_{1,1}$ +AUC)	18.012 $\pm$ 0.286	22.232 $\pm$ 0.211	65.405 $\pm$ 0.503 \bigstrut[t]
SMTL( $\ell_{1,1}$ + $F_{1}$ )	22.386 $\pm$ 5.326	56.169 $\pm$ 2.436	78.907 $\pm$ 1.854
MTL-CLS( $\ell_{1,1}$ )	8.542 $\pm$ 1.672	55.838 $\pm$ 2.229	74.037 $\pm$ 1.219
MTL-REG( $\ell_{1,1}$ )	6.393 $\pm$ 0.033	55.687 $\pm$ 0.439	53.036 $\pm$ 0.181 \bigstrut[b]
SMTL(TraceNorm+AUC)	18.201 $\pm$ 0.221	22.684 $\pm$ 0.354	66.847 $\pm$ 1.015 \bigstrut[t]
SMTL(TraceNorm+ $F_{1}$ )	27.973 $\pm$ 3.006	56.031 $\pm$ 4.924	79.730 $\pm$ 1.850
MTL-CLS(TraceNorm)	15.800 $\pm$ 0.589	50.098 $\pm$ 5.569	75.968 $\pm$ 2.144
MTL-REG(TraceNorm)	6.380 $\pm$ 0.045	55.333 $\pm$ 0.425	53.825 $\pm$ 0.493 \bigstrut[b]
StructSVM	17.847 $\pm$ 0.318	22.030 $\pm$ 0.284	64.761 $\pm$ 0.487 \bigstrut[t]
RAkEL	19.874 $\pm$ 0.156	26.686 $\pm$ 0.189	63.241 $\pm$ 0.398
MLCSSP	15.129 $\pm$ 0.633	20.124 $\pm$ 0.723	52.473 $\pm$ 1.884
AdaBoostMH	17.939 $\pm$ 0.469	41.991 $\pm$ 0.425	61.914 $\pm$ 0.167
HOMER	17.939 $\pm$ 0.469	41.991 $\pm$ 0.425	61.914 $\pm$ 0.167
BR	19.769 $\pm$ 0.196	26.515 $\pm$ 0.166	69.032 $\pm$ 0.854
LP	24.135 $\pm$ 0.959	50.170 $\pm$ 0.402	60.597 $\pm$ 0.502
ECC	24.879 $\pm$ 0.590	56.214 $\pm$ 0.363	78.067 $\pm$ 0.705 \bigstrut[b]
TMC2007 \bigstrut
SMTL( $\ell_{2,1}$ +AUC)	59.432 $\pm$ 0.581	68.02 $\pm$ 1.042	90.138 $\pm$ 0.17 \bigstrut[t]
SMTL( $\ell_{2,1}$ + $F_{1}$ )	64.321 $\pm$ 0.955	74.159 $\pm$ 0.255	90.561 $\pm$ 0.669
MTL-CLS( $\ell_{2,1}$ )	60.517 $\pm$ 1.363	71.284 $\pm$ 0.387	88.382 $\pm$ 0.398
MTL-REG( $\ell_{2,1}$ )	37.106 $\pm$ 0.416	70.181 $\pm$ 0.221	85.218 $\pm$ 0.529 \bigstrut[b]
SMTL( $\ell_{1,1}$ +AUC)	60.249 $\pm$ 0.147	67.654 $\pm$ 0.234	90.441 $\pm$ 0.077 \bigstrut[t]
SMTL( $\ell_{1,1}$ + $F_{1}$ )	65.436 $\pm$ 1.239	73.984 $\pm$ 0.533	90.238 $\pm$ 0.732
MTL-CLS( $\ell_{1,1}$ )	62.919 $\pm$ 0.802	72.745 $\pm$ 0.464	89.074 $\pm$ 0.59
MTL-REG( $\ell_{1,1}$ )	37.709 $\pm$ 0.32	70.431 $\pm$ 0.414	86.612 $\pm$ 0.592 \bigstrut[b]
SMTL(TraceNorm+AUC)	58.595 $\pm$ 0.148	68.056 $\pm$ 0.45	88.325 $\pm$ 0.182 \bigstrut[t]
SMTL(TraceNorm+ $F_{1}$ )	61.867 $\pm$ 1.014	72.588 $\pm$ 0.350	89.328 $\pm$ 0.815
MTL-CLS(TraceNorm)	59.752 $\pm$ 0.951	71.863 $\pm$ 0.628	87.933 $\pm$ 0.428
MTL-REG(TraceNorm)	36.64 $\pm$ 0.314	70.118 $\pm$ 0.437	84.54 $\pm$ 0.743 \bigstrut[b]
StructSVM	37.19 $\pm$ 0.652	45.027 $\pm$ 0.601	88.072 $\pm$ 0.289 \bigstrut[t]
RAkEL	57.331 $\pm$ 0.592	69.813 $\pm$ 0.179	81.994 $\pm$ 0.134
MLCSSP	56.717 $\pm$ 0.790	60.417 $\pm$ 1.665	75.246 $\pm$ 1.093
AdaBoostMH	15.170 $\pm$ 1.893	56.004 $\pm$ 1.103	61.466 $\pm$ 0.206
HOMER	61.144 $\pm$ 0.238	71.429 $\pm$ 0.104	84.998 $\pm$ 0.589
BR	51.939 $\pm$ 1.225	67.873 $\pm$ 0.374	84.616 $\pm$ 0.528
LP	52.683 $\pm$ 0.832	62.672 $\pm$ 0.526	73.063 $\pm$ 0.637
ECC	58.368 $\pm$ 0.714	68.223 $\pm$ 0.096	86.287 $\pm$ 0.664 \bigstrut[b]

TABLE IV: Comparison results Emotions, Yeast and Flags.

	MACRO	MICRO	Average \bigstrut[t]
METHOD	$F_{1}$	$F_{1}$	AUC \bigstrut[b]
Emotions \bigstrut
SMTL( $\ell_{2,1}$ +AUC)	65.498 $\pm$ 2.047	67.067 $\pm$ 1.956	83.378 $\pm$ 0.466 \bigstrut[t]
SMTL( $\ell_{2,1}$ + $F_{1}$ )	66.244 $\pm$ 1.584	66.358 $\pm$ 1.255	81.986 $\pm$ 0.495
MTL-CLS( $\ell_{2,1}$ )	63.343 $\pm$ 1.688	65.684 $\pm$ 1.327	80.065 $\pm$ 0.490
MTL-REG( $\ell_{2,1}$ )	62.621 $\pm$ 1.543	63.701 $\pm$ 1.054	81.32 $\pm$ 0.396 \bigstrut[b]
SMTL( $\ell_{1,1}$ +AUC)	65.622 $\pm$ 1.984	67.143 $\pm$ 1.629	83.358 $\pm$ 0.345 \bigstrut[t]
SMTL( $\ell_{1,1}$ + $F_{1}$ )	67.696 $\pm$ 0.348	67.923 $\pm$ 0.578	83.106 $\pm$ 0.596
MTL-CLS( $\ell_{1,1}$ )	64.969 $\pm$ 0.822	66.584 $\pm$ 1.049	80.03 $\pm$ 0.574
MTL-REG( $\ell_{1,1}$ )	62.976 $\pm$ 0.547	64.404 $\pm$ 1.535	81.811 $\pm$ 0.587 \bigstrut[b]
SMTL(TraceNorm+AUC)	65.902 $\pm$ 1.904	67.405 $\pm$ 1.848	83.362 $\pm$ 0.618 \bigstrut[t]
SMTL(TraceNorm+ $F_{1}$ )	67.600 $\pm$ 0.574	67.858 $\pm$ 0.984	83.000 $\pm$ 0.236
MTL-CLS(TraceNorm)	63.805 $\pm$ 2.339	66.602 $\pm$ 2.063	80.485 $\pm$ 0.597
MTL-REG(TraceNorm)	63.243 $\pm$ 1.574	64.869 $\pm$ 2.574	82.834 $\pm$ 0.266 \bigstrut[b]
StructSVM	46.367 $\pm$ 5.531	49.902 $\pm$ 19.032	62.908 $\pm$ 4.361 \bigstrut[t]
RAkEL	64.998 $\pm$ 1.387	65.835 $\pm$ 1.136	75.206 $\pm$ 0.875
MLCSSP	62.980 $\pm$ 2.780	63.593 $\pm$ 2.603	76.054 $\pm$ 2.495
AdaBoostMH	4.291 $\pm$ 1.429	7.577 $\pm$ 2.627	55.111 $\pm$ 0.328
HOMER	59.039 $\pm$ 2.431	61.830 $\pm$ 1.642	71.212 $\pm$ 1.167
BR	61.358 $\pm$ 2.578	62.635 $\pm$ 2.332	79.146 $\pm$ 1.250
LP	53.384 $\pm$ 1.858	54.618 $\pm$ 1.543	68.506 $\pm$ 0.652
ECC	62.694 $\pm$ 1.645	64.138 $\pm$ 1.216	82.589 $\pm$ 1.131 \bigstrut[b]
Yeast \bigstrut
SMTL( $\ell_{2,1}$ +AUC)	43.593 $\pm$ 1.120	46.261 $\pm$ 0.872	63.018 $\pm$ 1.504 \bigstrut[t]
SMTL( $\ell_{2,1}$ + $F_{1}$ )	44.353 $\pm$ 1.080	55.451 $\pm$ 0.457	61.285 $\pm$ 1.246
MTL-CLS( $\ell_{2,1}$ )	36.308 $\pm$ 0.974	43.908 $\pm$ 0.499	56.686 $\pm$ 0.539
MTL-REG( $\ell_{2,1}$ )	28.187 $\pm$ 1.544	47.029 $\pm$ 0.645	62.757 $\pm$ 1.745 \bigstrut[b]
SMTL( $\ell_{1,1}$ +AUC)	43.132 $\pm$ 1.349	45.729 $\pm$ 1.643	62.626 $\pm$ 1.709 \bigstrut[t]
SMTL( $\ell_{1,1}$ + $F_{1}$ )	44.647 $\pm$ 1.058	54.971 $\pm$ 1.187	61.569 $\pm$ 1.945
MTL-CLS( $\ell_{1,1}$ )	36.89 $\pm$ 0.699	44.620 $\pm$ 0.553	58.221 $\pm$ 0.424
MTL-REG( $\ell_{1,1}$ )	33.720 $\pm$ 1.634	54.682 $\pm$ 1.846	50.050 $\pm$ 1.563 \bigstrut[b]
SMTL(TraceNorm+AUC)	43.58 $\pm$ 1.046	46.395 $\pm$ 1.067	63.058 $\pm$ 0.634 \bigstrut[t]
SMTL(TraceNorm+ $F_{1}$ )	44.972 $\pm$ 0.765	50.471 $\pm$ 0.968	61.819 $\pm$ 0.395
MTL-CLS(TraceNorm)	42.275 $\pm$ 1.006	44.542 $\pm$ 0.460	61.528 $\pm$ 0.590
MTL-REG(TraceNorm)	28.178 $\pm$ 1.043	47.046 $\pm$ 0.126	62.920 $\pm$ 0.326 \bigstrut[b]
StructSVM	42.669 $\pm$ 2.48	46.298 $\pm$ 2.048	61.894 $\pm$ 2.488 \bigstrut[t]
RAkEL	44.101 $\pm$ 0.389	46.086 $\pm$ 0.450	61.971 $\pm$ 0.753
MLCSSP	41.511 $\pm$ 0.837	46.200 $\pm$ 1.272	50.756 $\pm$ 0.451
AdaBoostMH	12.255 $\pm$ 0.041	48.144 $\pm$ 0.315	50.805 $\pm$ 0.050
HOMER	40.054 $\pm$ 1.063	53.745 $\pm$ 0.867	62.311 $\pm$ 1.265
BR	39.209 $\pm$ 0.891	54.153 $\pm$ 0.543	62.375 $\pm$ 0.408
LP	37.029 $\pm$ 0.584	53.059 $\pm$ 0.514	56.616 $\pm$ 1.394
ECC	37.523 $\pm$ 0.310	54.632 $\pm$ 0.325	62.105 $\pm$ 0.627 \bigstrut[b]
Flags \bigstrut
SMTL( $\ell_{2,1}$ +AUC)	60.473 $\pm$ 1.951	61.666 $\pm$ 2.226	73.875 $\pm$ 2.563 \bigstrut[t]
SMTL( $\ell_{2,1}$ + $F_{1}$ )	70.279 $\pm$ 1.744	75.047 $\pm$ 0.945	75.000 $\pm$ 0.745
MTL-CLS( $\ell_{2,1}$ )	65.233 $\pm$ 1.930	71.709 $\pm$ 0.955	72.928 $\pm$ 1.479
MTL-REG( $\ell_{2,1}$ )	66.073 $\pm$ 0.276	73.005 $\pm$ 1.307	71.429 $\pm$ 1.105 \bigstrut[b]
SMTL( $\ell_{1,1}$ +AUC)	60.187 $\pm$ 1.971	61.618 $\pm$ 1.714	74.136 $\pm$ 2.805 \bigstrut[t]
SMTL( $\ell_{1,1}$ + $F_{1}$ )	69.122 $\pm$ 1.975	74.259 $\pm$ 1.378	74.168 $\pm$ 1.513
MTL-CLS( $\ell_{1,1}$ )	65.532 $\pm$ 1.210	72.666 $\pm$ 1.752	72.725 $\pm$ 0.497
MTL-REG( $\ell_{1,1}$ )	65.256 $\pm$ 0.739	72.246 $\pm$ 0.928	71.299 $\pm$ 0.998 \bigstrut[b]
SMTL(TraceNorm+AUC)	61.435 $\pm$ 1.616	62.84 $\pm$ 1.481	74.367 $\pm$ 2.373 \bigstrut[t]
SMTL(TraceNorm+ $F_{1}$ )	68.704 $\pm$ 1.650	73.132 $\pm$ 1.891	73.145 $\pm$ 1.973
MTL-CLS(TraceNorm)	65.236 $\pm$ 3.507	72.688 $\pm$ 2.156	73.307 $\pm$ 2.155
MTL-REG(TraceNorm)	65.257 $\pm$ 2.647	72.437 $\pm$ 1.918	71.495 $\pm$ 0.783 \bigstrut[b]
StructSVM	55.683 $\pm$ 5.777	51.957 $\pm$ 2.048	72.178 $\pm$ 3.604 \bigstrut[t]
RAkEL	60.696 $\pm$ 5.216	64.749 $\pm$ 4.688	61.260 $\pm$ 3.805
MLCSSP	59.629 $\pm$ 1.619	63.215 $\pm$ 1.326	55.865 $\pm$ 1.909
AdaBoostMH	56.457 $\pm$ 4.288	71.268 $\pm$ 1.400	69.329 $\pm$ 2.043
HOMER	59.018 $\pm$ 1.269	63.855 $\pm$ 2.259	64.826 $\pm$ 0.569
BR	59.421 $\pm$ 2.163	67.287 $\pm$ 1.876	66.823 $\pm$ 2.860
LP	61.801 $\pm$ 3.822	69.132 $\pm$ 3.200	60.540 $\pm$ 4.149
ECC	64.936 $\pm$ 3.023	72.715 $\pm$ 1.675	73.913 $\pm$ 2.339 \bigstrut[b]

Following the settings in [9], to evaluate the performance, we use AUC, Macro F1-score, and Micro F1-score as the evaluation metrics (the details about the computation of AUC and $F_{1}$ ³³3In MTL, the Macro $F_{1}$ is calculated by firstly calculating the $F_{1}$ score of each individual task, and then average these $F_{1}$ scores over all tasks. The Micro $F_{1}$ in MTL is calculated by $\frac{2\times P\times R}{P+R}$ , where $P=\frac{\sum_{i=1}^{m}\sum_{k=1}^{n_{i}}I({\mathbf{y}_{k}^{(i)}}=1\ \textit{and}\ (\mathbf{y}_{j})_{k}=1)}{\sum_{i=1}^{m}\sum_{k=1}^{n_{i}}I({\mathbf{y}_{k}^{(i)}}=1)},$ $R=\frac{\sum_{i=1}^{m}\sum_{k=1}^{n_{i}}I({\mathbf{y}_{k}^{(i)}}=1\ \textit{and}\ (\mathbf{y}_{j})_{k}=1)}{\sum_{i=1}^{m}\sum_{k=1}^{n_{i}}I({(\mathbf{y}_{j})_{k}}=1)}.$ can be found in Section 4).

For each dataset, we firstly generate 10 $60\%$ : $40\%$ partitions. In each partition, the “ $60\%$ ” part is used as the training set and the “ $40\%$ ” part is used as the test set. Then, we run each of the methods (the baselines and the proposed methods) on these 10 partitions, and reported the averaged results on these $10$ trials. Note that, for a fair comparison, in a dataset, each method uses the same ten partitions to produce its results. After the training set is determined, we conduct 10-fold cross validation on the training set to choose the trade-off parameter $\lambda$ within $\{{10^{-3}}\times i\}_{i=1}^{10}\cup\{{10^{-2}}\times i\}_{i=1}^{10}\cup\{{10^{-1}}\times i\}_{i=1}^{10}\cup\{2\times i\}_{i=1}^{10}\cup\{40\times i\}_{i=1}^{20}$ .

In Algorithm 2, we set the maximum iterations $T_{F}=5000$ and the optimization tolerance $\epsilon_{F}=10^{-5}$ .

Refer to caption — Figure 1: Comparison results on Segmentation, Emotions and Optdigits w.r.t. AUC.

6.2 Results on real-world datasets

The evaluation results w.r.t. Micro $F_{1}$ , Macro $F_{1}$ and AUC (with standard deviations) of the proposed SMTL are shown in Table II, III and IV. As can be seen, by using the same regularizer, the proposed SMTL variants that optimize $F_{1}$ -score or AUC show superior performance gains over the baselines. In most cases, the SMTL variant that optimizes a specific metric achieves the best results on this metric. Here are some statistics. On the Yeast dataset, the value of Macro $F_{1}$ using SMTL( $\ell_{2,1}$ + $F_{1}$ ) is $44.353\%$ , a $22.16\%$ relative increase compared to the best MTL baseline MTL-CLS( $\ell_{2,1}$ ); the value of Micro $F_{1}$ using SMTL( $\ell_{2,1}$ + $F_{1}$ ) is $55.451\%$ , a $17.91\%$ relative increase compared to the best MTL baseline MTL-REG( $\ell_{2,1}$ ); the value of averaged AUC using SMTL( $\ell_{1,1}$ +AUC) is $62.626\%$ , a $7.57\%$ relative increase compared to the best MTL baseline MTL-CLS( $\ell_{1,1}$ ). On the Emotions dataset, the proposed SMTL( $\ell_{2,1}$ + $F_{1}$ ) performs $66.244\%$ at Macro F1, a $4.58\%$ relative increase compared to the best MTL baseline MTL-CLS( $\ell_{2,1}$ ); SMTL( $\ell_{2,1}$ + $F_{1}$ ) performs $83.378\%$ at AUC, a $2.53\%$ relative increase compared to the best MTL baseline MTL-CLS( $\ell_{2,1}$ ); SMTL(TraceNorm+ $F_{1}$ ) performs $67.6\%$ at Macro F1, a $5.95\%$ relative increase compared to the best MTL baseline MTL-CLS(TraceNorm). On the Cal500 dataset, SMTL( $\ell_{1,1}$ +AUC) performs $21.721\%$ at Macro $F_{1}$ , compared to $12.447\%$ of MTL-REG( $\ell_{1,1}$ , which indicates a $74.51\%$ relative increase; SMTL( $\ell_{2,1}$ + $F_{1}$ ) performs $40.127\%$ at Micro $F_{1}$ , compared to $37.357\%$ of MTL-CLS( $\ell_{2,1}$ , which indicates a $7.41\%$ relative increase.

In addition, we conduct $t$ -tests and Wilcoxon’s signed rank test [43] on $9$ datasets to investigate whether the improvements of SMTL methods against the baselines are statistically significant. The $p$ -values of $t$ -tests are showed in Table V and VI. The $p$ -values of Wilcoxon’s tests are showed in Table VII and VIII. As can be seen, most of the $p$ -values are smaller than 0.05, which indicate that the improvements are statistically significant. These results verify the effectiveness of directly optimizing evaluation metric in MTL problems.

TABLE V:

t

-test:

p

-values of SMTL against the baselines

Two mehtods for comparison	Optdigits	TMC2007	MediaMill	Segmentation \bigstrut[t]
Two mehtods for comparison	Optdigits	TMC2007	MediaMill	\bigstrut[b]
Average AUC \bigstrut
$\ell_{2,1}$ : SMTL(AUC) vs. MTL-CLS	4.86E-07	1.49E-13	2.12E-02	4.74E-03 \bigstrut[t]
$\ell_{2,1}$ : SMTL(AUC) vs. MTL-REG	2.70E-12	1.44E-18	6.58E-01	4.30E-03
Trace: SMTL(AUC) vs. MTL-CLS	6.88E-03	5.20E-03	2.12E-02	4.74E-03
Trace: SMTL(AUC) vs. MTL-REG	3.85E-12	4.59E-11	6.61E-01	4.25E-03
$\ell_{1,1}$ : SMTL(AUC) vs. MTL-CLS	5.71E-03	2.95E-05	5.52E-03	4.75E-03
$\ell_{1,1}$ : SMTL(AUC) vs. MTL-REG	1.46E-12	1.92E-12	1.57E-08	4.35E-03
Trace: SMTL(AUC) vs. RAkEL	1.87E-26	1.50E-14	2.79E-13	3.27E-14
Trace: SMTL(AUC) vs. MLCSSP	6.02E-10	1.05E-14	9.63E-15	3.24E-01
Trace: SMTL(AUC) vs. AdaBoostMH	2.36E-04	5.42E-20	4.44E-08	3.05E-14
Trace: SMTL(AUC) vs. HOMER	5.23E-12	6.97E-09	4.65E-08	1.04E-12
Trace: SMTL(AUC) vs. BR	2.91E-10	1.31E-16	2.43E-13	5.14E-07
Trace: SMTL(AUC) vs. LP	6.82E-22	1.41E-20	1.44E-03	9.30E-01
Trace: SMTL(AUC) vs. ECC	8.16E-03	9.67E-19	7.52E-03	6.19E-03 \bigstrut[b]
Micro $F_{1}$ \bigstrut
$\ell_{2,1}$ : SMTL( $F_{1}$ ) vs. MTL-CLS	8.37E-02	8.37E-02	3.98E-10	4.89E-02 \bigstrut[t]
$\ell_{2,1}$ : SMTL( $F_{1}$ ) vs. MTL-REG	3.28E-20	3.28E-20	2.54E-18	1.24E-02
Trace: SMTL( $F_{1}$ ) vs. MTL-CLS	4.68E-03	4.68E-03	4.00E-10	4.96E-02
Trace: SMTL( $F_{1}$ ) vs. MTL-REG	3.01E-14	3.01E-14	2.30E-18	4.92E-03
$\ell_{1,1}$ : SMTL( $F_{1}$ ) vs. MTL-CLS	9.54E-03	9.54E-03	4.19E-10	4.75E-01
$\ell_{1,1}$ : SMTL( $F_{1}$ ) vs. MTL-REG	6.16E-12	6.16E-12	2.56E-18	1.03E-01
Trace: SMTL( $F_{1}$ ) vs. RAkEL	7.30E-33	4.64E-25	4.93E-09	9.28E-19
Trace: SMTL( $F_{1}$ ) vs. MLCSSP	2.28E-30	1.90E-18	3.28E-14	2.90E-13
Trace: SMTL( $F_{1}$ ) vs. AdaBoostMH	4.53E-16	9.97E-35	9.38E-12	3.65E-06
Trace: SMTL( $F_{1}$ ) vs. HOMER	5.37E-14	1.13E-12	9.68E-12	8.31E-10
Trace: SMTL( $F_{1}$ ) vs. BR	1.61E-06	3.09E-14	5.20E-05	9.79E-03
Trace: SMTL( $F_{1}$ ) vs. LP	3.94E-20	9.73E-24	1.06E-12	1.39E-05
Trace: SMTL( $F_{1}$ ) vs. ECC	3.99E-07	2.76E-08	1.75E-16	1.45E-01 \bigstrut[b]
Macro $F_{1}$ \bigstrut
$\ell_{2,1}$ : SMTL( $F_{1}$ ) vs. MTL-CLS	4.09E-21	1.61E-10	3.98E-10	4.09E-02 \bigstrut[t]
$\ell_{2,1}$ : SMTL( $F_{1}$ ) vs. MTL-REG	1.47E-26	3.09E-16	2.54E-18	2.98E-12
Trace: SMTL( $F_{1}$ ) vs. MTL-CLS	1.04E-21	1.82E-02	4.00E-10	4.13E-02
Trace: SMTL( $F_{1}$ ) vs. MTL-REG	3.85E-19	5.87E-12	2.30E-18	3.19E-12
$\ell_{1,1}$ : SMTL( $F_{1}$ ) vs. MTL-CLS	7.04E-22	9.26E-07	4.19E-10	4.04E-02
$\ell_{1,1}$ : SMTL( $F_{1}$ ) vs. MTL-REG	1.94E-24	8.74E-14	2.56E-18	2.57E-12
Trace: SMTL( $F_{1}$ ) vs. RAkEL	6.35E-29	3.99E-10	4.93E-09	3.08E-16
Trace: SMTL( $F_{1}$ ) vs. MLCSSP	6.50E-16	2.29E-10	3.28E-14	4.68E-02
Trace: SMTL( $F_{1}$ ) vs. AdaBoostMH	1.21E-04	2.99E-23	9.38E-12	3.17E-16
Trace: SMTL( $F_{1}$ ) vs. HOMER	1.78E-11	4.33E-02	9.68E-12	2.55E-11
Trace: SMTL( $F_{1}$ ) vs. BR	1.28E-08	1.12E-13	5.20E-05	1.19E-02
Trace: SMTL( $F_{1}$ ) vs. LP	1.67E-19	1.52E-14	1.06E-12	7.49E-01
Trace: SMTL( $F_{1}$ ) vs. ECC	2.83E-01	4.46E-08	1.75E-16	9.52E-01 \bigstrut[b]

TABLE VI:

t

-test:

p

-values of SMTL against the baselines

Two mehtods for comparison	Cal500	Yeast	Emotions	Scene	Flags \bigstrut[t]
Two mehtods for comparison	Cal500	Yeast	Emotions	Scene	\bigstrut[b]
Average AUC \bigstrut
$\ell_{2,1}$ : SMTL(AUC) vs. MTL-CLS	2.62E-01	1.02E-12	2.62E-01	4.92E-02	4.30E-02 \bigstrut[t]
$\ell_{2,1}$ : SMTL(AUC) vs. MTL-REG	7.48E-05	1.47E-09	7.48E-05	4.74E-11	4.21E-02
Trace: SMTL(AUC) vs. MTL-CLS	1.01E-01	8.97E-13	1.01E-01	4.48E-02	4.21E-02
Trace: SMTL(AUC) vs. MTL-REG	3.04E-03	1.53E-09	3.04E-03	5.00E-11	4.37E-02
$\ell_{1,1}$ : SMTL(AUC) vs. MTL-CLS	2.18E-03	1.00E-12	2.18E-03	4.56E-02	4.27E-02
$\ell_{1,1}$ : SMTL(AUC) vs. MTL-REG	2.55E-06	1.71E-09	2.55E-06	4.67E-11	4.28E-02
Trace: SMTL(AUC) vs. RAkEL	2.62E-12	1.65E-10	4.55E-04	1.48E-01	5.58E-05
Trace: SMTL(AUC) vs. MLCSSP	1.49E-21	1.05E-07	1.22E-04	1.60E-15	6.81E-11
Trace: SMTL(AUC) vs. AdaBoostMH	4.10E-33	1.03E-06	3.61E-23	3.21E-19	2.12E-02
Trace: SMTL(AUC) vs. HOMER	2.30E-13	2.54E-04	8.57E-09	4.33E-02	9.89E-09
Trace: SMTL(AUC) vs. BR	2.23E-16	7.57E-10	3.84E-06	7.92E-02	1.63E-06
Trace: SMTL(AUC) vs. LP	1.12E-14	6.42E-07	9.15E-15	4.97E-01	3.19E-03
Trace: SMTL(AUC) vs. ECC	2.00E-13	2.09E-12	6.45E-07	3.28E-01	9.78E-01 \bigstrut[b]
Micro $F_{1}$ \bigstrut
$\ell_{2,1}$ : SMTL( $F_{1}$ ) vs. MTL-CLS	4.09E-21	2.70E-05	2.62E-01	1.64E-05	3.14E-02 \bigstrut[t]
$\ell_{2,1}$ : SMTL( $F_{1}$ ) vs. MTL-REG	1.47E-26	5.00E-02	7.48E-05	1.09E-06	1.87E-03
Trace: SMTL( $F_{1}$ ) vs. MTL-CLS	1.04E-21	3.45E-05	1.01E-01	1.42E-05	3.10E-02
Trace: SMTL( $F_{1}$ ) vs. MTL-REG	3.85E-19	4.39E-02	3.04E-03	1.19E-06	1.76E-03
$\ell_{1,1}$ : SMTL( $F_{1}$ ) vs. MTL-CLS	7.04E-22	2.16E-05	2.18E-03	1.54E-05	3.13E-02
$\ell_{1,1}$ : SMTL( $F_{1}$ ) vs. MTL-REG	1.94E-24	4.21E-02	2.55E-06	1.35E-06	1.87E-03
Trace: SMTL( $F_{1}$ ) vs. RAkEL	4.16E-08	2.54E-03	4.55E-04	3.26E-15	2.85E-08
Trace: SMTL( $F_{1}$ ) vs. MLCSSP	2.82E-10	9.31E-21	1.22E-04	1.80E-16	2.01E-13
Trace: SMTL( $F_{1}$ ) vs. AdaBoostMH	8.68E-17	3.36E-22	3.61E-23	4.76E-02	7.77E-05
Trace: SMTL( $F_{1}$ ) vs. HOMER	5.69E-11	1.08E-01	8.57E-09	4.53E-14	3.25E-10
Trace: SMTL( $F_{1}$ ) vs. BR	7.35E-21	1.08E-01	3.84E-06	2.51E-09	4.96E-06
Trace: SMTL( $F_{1}$ ) vs. LP	1.64E-13	9.74E-11	9.15E-15	1.25E-14	3.39E-08
Trace: SMTL( $F_{1}$ ) vs. ECC	6.21E-14	5.30E-01	6.45E-07	1.05E-02	9.09E-01 \bigstrut[b]
Macro $F_{1}$ \bigstrut
$\ell_{2,1}$ : SMTL( $F_{1}$ ) vs. MTL-CLS	2.43E-02	2.51E-06	7.55E-12	4.09E-02	1.12E-02 \bigstrut[t]
$\ell_{2,1}$ : SMTL( $F_{1}$ ) vs. MTL-REG	4.24E-10	2.71E-19	2.83E-09	1.40E-10	2.58E-03
Trace: SMTL( $F_{1}$ ) vs. MTL-CLS	1.77E-05	2.34E-06	3.53E-09	4.37E-02	1.13E-02
Trace: SMTL( $F_{1}$ ) vs. MTL-REG	1.43E-04	3.33E-19	2.69E-02	1.55E-10	2.67E-03
$\ell_{1,1}$ : SMTL( $F_{1}$ ) vs. MTL-CLS	3.99E-02	2.66E-06	6.38E-12	4.43E-02	1.11E-02
$\ell_{1,1}$ : SMTL( $F_{1}$ ) vs. MTL-REG	1.01E-12	2.76E-19	1.09E-06	1.32E-10	2.54E-03
Trace: SMTL( $F_{1}$ ) vs. RAkEL	5.45E-05	5.52E-03	3.24E-05	5.68E-02	2.11E-04
Trace: SMTL( $F_{1}$ ) vs. MLCSSP	6.64E-01	1.57E-08	6.89E-05	2.34E-18	2.75E-10
Trace: SMTL( $F_{1}$ ) vs. AdaBoostMH	1.28E-29	1.36E-28	3.03E-28	5.89E-21	1.16E-07
Trace: SMTL( $F_{1}$ ) vs. HOMER	3.52E-23	6.16E-10	2.60E-09	3.77E-06	1.84E-11
Trace: SMTL( $F_{1}$ ) vs. BR	7.28E-14	6.97E-12	6.40E-07	2.49E-01	2.86E-09
Trace: SMTL( $F_{1}$ ) vs. LP	1.42E-18	9.45E-16	7.63E-15	4.04E-01	5.68E-05
Trace: SMTL( $F_{1}$ ) vs. ECC	5.69E-21	1.98E-16	5.51E-08	9.15E-01	4.96E-03 \bigstrut[b]

TABLE VII: Wilcoxon’s test:

p

-values of SMTL against the baselines

Two mehtods for comparison	Optdigits	TMC2007	MediaMill	Segmentation \bigstrut[t]
Two mehtods for comparison	Optdigits	TMC2007	MediaMill	\bigstrut[b]
Average AUC \bigstrut
$\ell_{2,1}$ : SMTL(AUC) vs. MTL-CLS	1.25E-02	1.25E-02	5.06E-03	5.75E-01
$\ell_{2,1}$ : SMTL(AUC) vs. MTL-REG	5.06E-03	5.06E-03	4.45E-01	2.84E-02
Trace: SMTL(AUC) vs. MTL-CLS	2.84E-02	2.18E-02	2.18E-02	4.69E-02
Trace: SMTL(AUC) vs. MTL-REG	5.06E-03	5.06E-03	8.79E-01	3.86E-01
$\ell_{1,1}$ : SMTL(AUC) vs. MTL-CLS	2.84E-02	2.84E-02	4.69E-02	2.18E-02
$\ell_{1,1}$ : SMTL(AUC) vs. MTL-REG	5.06E-03	5.06E-03	5.75E-01	2.18E-02
Trace: SMTL(AUC) vs. RAkEL	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL(AUC) vs. MLCSSP	5.06E-03	5.06E-03	5.06E-03	2.85E-01
Trace: SMTL(AUC) vs. AdaBoostMH	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL(AUC) vs. HOMER	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL(AUC) vs. BR	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL(AUC) vs. LP	5.06E-03	5.06E-03	1.25E-02	8.79E-01
Trace: SMTL(AUC) vs. ECC	7.45E-02	5.06E-03	3.67E-02	1.66E-02 \bigstrut[b]
Micro $F_{1}$ \bigstrut
$\ell_{2,1}$ : SMTL( $F_{1}$ ) vs. MTL-CLS	5.06E-03	5.93E-02	5.06E-03	9.34E-03 \bigstrut[t]
$\ell_{2,1}$ : SMTL( $F_{1}$ ) vs. MTL-REG	5.06E-03	5.06E-03	5.06E-03	2.84E-02
Trace: SMTL( $F_{1}$ ) vs. MTL-CLS	5.06E-03	1.66E-02	5.06E-03	9.26E-02
Trace: SMTL( $F_{1}$ ) vs. MTL-REG	5.06E-03	5.06E-03	5.06E-03	6.91E-03
$\ell_{1,1}$ : SMTL( $F_{1}$ ) vs. MTL-CLS	5.06E-03	4.69E-02	5.06E-03	5.93E-02
$\ell_{1,1}$ : SMTL( $F_{1}$ ) vs. MTL-REG	5.06E-03	5.06E-03	5.06E-03	2.84E-02
Trace: SMTL( $F_{1}$ ) vs. RAkEL	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL( $F_{1}$ ) vs. MLCSSP	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL( $F_{1}$ ) vs. AdaBoostMH	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL( $F_{1}$ ) vs. HOMER	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL(( $F_{1}$ ) vs. BR	5.06E-03	5.06E-03	9.34E-03	1.69E-01
Trace: SMTL(( $F_{1}$ ) vs. LP	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL(( $F_{1}$ ) vs. ECC	5.06E-03	5.06E-03	5.06E-03	2.03E-01 \bigstrut[b]
Macro $F_{1}$ \bigstrut
$\ell_{2,1}$ : SMTL( $F_{1}$ ) vs. MTL-CLS	9.34E-03	5.06E-03	5.06E-03	4.69E-02 \bigstrut[t]
$\ell_{2,1}$ : SMTL( $F_{1}$ ) vs. MTL-REG	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL( $F_{1}$ ) vs. MTL-CLS	9.34E-03	5.06E-03	5.06E-03	5.93E-02
Trace: SMTL( $F_{1}$ ) vs. MTL-REG	5.06E-03	5.06E-03	5.06E-03	5.06E-03
$\ell_{1,1}$ : SMTL( $F_{1}$ ) vs. MTL-CLS	9.34E-03	5.06E-03	5.06E-03	1.14E-01
$\ell_{1,1}$ : SMTL( $F_{1}$ ) vs. MTL-REG	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL( $F_{1}$ ) vs. RAkEL	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL( $F_{1}$ ) vs. MLCSSP	5.06E-03	5.06E-03	5.06E-03	7.45E-02
Trace: SMTL( $F_{1}$ ) vs. AdaBoostMH	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL( $F_{1}$ ) vs. HOMER	5.06E-03	3.67E-02	5.06E-03	5.06E-03
Trace: SMTL(( $F_{1}$ ) vs. BR	5.06E-03	5.06E-03	5.06E-03	2.84E-02
Trace: SMTL(( $F_{1}$ ) vs. LP	5.06E-03	5.06E-03	1.69E-01	8.79E-01
Trace: SMTL(( $F_{1}$ ) vs. ECC	4.69E-02	5.06E-03	1.66E-02	9.59E-01 \bigstrut[b]

TABLE VIII: Wilcoxon’s test:

p

-values of SMTL against the baselines

Two mehtods for comparison	Cal500	Yeast	Emotions	Scene	Flags \bigstrut[t]
Two mehtods for comparison	Cal500	Yeast	Emotions	Scene	\bigstrut[b]
Average AUC \bigstrut
$\ell_{2,1}$ : SMTL(AUC) vs. MTL-CLS	5.06E-03	5.06E-03	1.69E-01	1.25E-02	1.25E-02 \bigstrut[t]
$\ell_{2,1}$ : SMTL(AUC) vs. MTL-REG	5.06E-03	5.06E-03	1.25E-02	5.06E-03	5.06E-03
Trace: SMTL(AUC) vs. MTL-CLS	6.91E-03	5.06E-03	1.14E-01	4.69E-02	5.08E-01
Trace: SMTL(AUC) vs. MTL-REG	5.06E-03	5.06E-03	1.25E-02	5.06E-03	3.67E-02
$\ell_{1,1}$ : SMTL(AUC) vs. MTL-CLS	5.06E-03	5.06E-03	1.25E-02	2.84E-02	1.25E-02
$\ell_{1,1}$ : SMTL(AUC) vs. MTL-REG	5.06E-03	5.06E-03	1.25E-02	5.06E-03	1.25E-02
Trace: SMTL(AUC) vs. RAkEL	5.06E-03	5.06E-03	1.25E-02	2.84E-02	5.06E-03
Trace: SMTL(AUC) vs. MLCSSP	5.06E-03	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL(AUC) vs. AdaBoostMH	5.06E-03	5.06E-03	5.06E-03	5.06E-03	1.25E-02
Trace: SMTL(AUC) vs. HOMER	5.06E-03	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL(AUC) vs. BR	5.06E-03	5.06E-03	5.06E-03	9.26E-02	5.06E-03
Trace: SMTL(AUC) vs. LP	5.06E-03	6.91E-03	5.06E-03	7.21E-01	9.34E-03
Trace: SMTL(AUC) vs. ECC	5.06E-03	5.06E-03	5.06E-03	1.69E-01	9.59E-01 \bigstrut[b]
Micro $F_{1}$ \bigstrut
$\ell_{2,1}$ : SMTL( $F_{1}$ ) vs. MTL-CLS	5.06E-03	6.91E-03	5.06E-03	5.06E-03	5.06E-03 \bigstrut[t]
$\ell_{2,1}$ : SMTL( $F_{1}$ ) vs. MTL-REG	9.34E-03	2.84E-02	2.84E-02	5.06E-03	1.25E-02
Trace: SMTL( $F_{1}$ ) vs. MTL-CLS	5.06E-03	6.91E-03	5.06E-03	5.06E-03	4.45E-01
Trace: SMTL( $F_{1}$ ) vs. MTL-REG	5.06E-03	2.84E-02	1.25E-02	5.06E-03	1.25E-02
$\ell_{1,1}$ : SMTL( $F_{1}$ ) vs. MTL-CLS	5.06E-03	5.06E-03	5.06E-03	5.06E-03	3.67E-02
$\ell_{1,1}$ : SMTL( $F_{1}$ ) vs. MTL-REG	5.06E-03	2.84E-02	3.67E-02	5.06E-03	2.84E-02
Trace: SMTL( $F_{1}$ ) vs. RAkEL	5.06E-03	9.34E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL( $F_{1}$ ) vs. MLCSSP	5.06E-03	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL( $F_{1}$ ) vs. AdaBoostMH	5.06E-03	5.06E-03	5.06E-03	5.93E-02	5.06E-03
Trace: SMTL( $F_{1}$ ) vs. HOMER	5.06E-03	2.03E-01	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL(( $F_{1}$ ) vs. BR	5.06E-03	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL(( $F_{1}$ ) vs. LP	5.06E-03	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL(( $F_{1}$ ) vs. ECC	5.06E-03	3.86E-01	2.41E-01	4.69E-02	8.79E-01 \bigstrut[b]
Macro $F_{1}$ \bigstrut
$\ell_{2,1}$ : SMTL( $F_{1}$ ) vs. MTL-CLS	5.06E-03	5.06E-03	6.91E-03	3.67E-02	3.67E-02 \bigstrut[t]
$\ell_{2,1}$ : SMTL( $F_{1}$ ) vs. MTL-REG	5.06E-03	5.06E-03	5.06E-03	5.06E-03	6.91E-03
Trace: SMTL( $F_{1}$ ) vs. MTL-CLS	5.06E-03	5.06E-03	5.06E-03	1.66E-02	5.93E-02
Trace: SMTL( $F_{1}$ ) vs. MTL-REG	5.06E-03	5.06E-03	5.06E-03	5.06E-03	5.06E-03
$\ell_{1,1}$ : SMTL( $F_{1}$ ) vs. MTL-CLS	5.06E-03	5.06E-03	5.06E-03	3.33E-01	3.67E-02
$\ell_{1,1}$ : SMTL( $F_{1}$ ) vs. MTL-REG	5.06E-03	5.06E-03	5.06E-03	5.06E-03	2.84E-02
Trace: SMTL( $F_{1}$ ) vs. RAkEL	5.06E-03	1.25E-02	9.34E-03	9.26E-02	5.06E-03
Trace: SMTL( $F_{1}$ ) vs. MLCSSP	3.67E-02	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL( $F_{1}$ ) vs. AdaBoostMH	5.06E-03	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL( $F_{1}$ ) vs. HOMER	5.06E-03	5.06E-03	5.06E-03	5.06E-03	5.06E-03
Trace: SMTL(( $F_{1}$ ) vs. BR	5.06E-03	5.06E-03	5.06E-03	1.69E-01	5.06E-03
Trace: SMTL(( $F_{1}$ ) vs. LP	5.06E-03	5.06E-03	5.06E-03	5.75E-01	6.91E-03
Trace: SMTL(( $F_{1}$ ) vs. ECC	5.06E-03	5.06E-03	5.06E-03	7.99E-01	6.91E-03 \bigstrut[b]

6.3 Results on imbalanced data

In the scenarios of learning classifiers on imbalanced data (e.g., the number of positive training samples is much less than that of negative training samples), the metrics like F-score or AUC are more effective for evaluation than the misclassified errors. This is one of the reasons to motivate the proposed SMTL method in this paper. In MTL, the imbalance can be measured by firstly calculating the imbalance ratio in each individual task (i.e., $\frac{the\ number\ of\ positive\ instances}{the\ number\ of\ negative\ instances}$ for each task), and then averaging these ratios.

We conduct simulated experiments on 3 datasets (Segmentation, Emotions and Optdigits) to investigate the characteristics of the proposed SMTL methods on imbalanced data. In each dataset, we generate an imbalanced dataset by randomly selecting (with replacement) the positive and negative samples from the original dataset, with the ratio $1:1$ , $1:5$ and $1:10$ , respectively. As can be seen in Fig. 1 and Fig. 2, in most cases, the proposed SMTL variants consistently outperform the baseline method. For example, On Emotions with the ratio of $\frac{negative\ samples}{positive\ samples}=10:1$ , the proposed SMTL indicates a relative increase of $9.7\%$ / $12.9\%$ / $11.1\%$ over the baseline w. r. t. AUC / Macro F1 / Micro F1, respectively. In addition, with the ratio of $\frac{negative\ samples}{positive\ samples}$ increasing, the improvement of SMTL over the baseline method also increases.

6.4 Training Time Comparison

To investigate the training speed of the proposed method, we provide the running time comparison results in Table IX. We can see that the training time of SMTL is (less than 30 times) slower than the baseline methods. It is worth noting that the training time cost is not a critical issue in practice, because the training process is usually off-line.

TABLE IX: Training Time Comparison

method	training time of	training time of	training time of \bigstrut[t]
method	Optdigits	Emotions	Segmentation \bigstrut[b]
SMTL( $\ell_{1,1}$ +AUC)	105.200s	30.001s	1.888s \bigstrut[t]
SMTL( $\ell_{1,1}$ + $F_{1}$ )	510.900s	29.797s	2.964s
MTL-CLS( $\ell_{1,1}$ )	356.200s	24.674s	2.023s
MTL-REG( $\ell_{1,1}$ )	19.030s	7.427s	0.450s
StructSVM	17.762s	46.468s	5.015s
RAkEL	28.428s	4.117s	4.310s
AdaBoostMH	17.157s	1.024s	0.641s
MLCSSP	121.779s	1.563s	6.410s
HOMER	20.643s	1.354s	0.880s
BR	20.852s	1.859s	1.835s
LP	16.131s	22.561s	2.103s
ECC	17.852s	2.834s	1.891s \bigstrut[b]

7 Conclusion

In this paper, we developed Structured-MTL, a MTL method of optimizing evaluation metrics. To solve the optimization problem of Structured MTL, we developed an optimization procedure based on ADMM scheme. This optimization procedure can be applied to solving a large family of MTL problems with structured outputs.

In the future work, we plan to investigate Structured-MTL on problems other than classification (e.g., MTL for ranking problems). We also plan to improve the efficiency of Structured-MTL on large-scale learning problems.

References

[1] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817-1853, 2005.
[2] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243-272, 2008.
[3] J-B. Bi, T. Xiong, S-P. Yu, M. Dundar, and R. Rao. An improved multi-task learning approach with applications in medical diagnosis. In Machine Learning and Knowledge Discovery in Databases, pages 117-132, 2008.
[4] W. Bi, J. Kwok. Efficient Multi-label Classification with Many Labels. Proceedings of the 30th International Conference on Machine Learning. 405-413, 2013.
[5] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006.
[6] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning, 3(1):1-122, 2011.
[7] J-F. Cai, E. J. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion. UCLA CAM Report, 2008.
[8] R. Caruana. Multitask learning. Machine Learning, 28(1):41-75, 1997.
[9] J-H. Chen, J. Liu, and J-P. Ye. Learning incoherent sparse and low-rank patterns from multiple tasks. In International Conference on Knowledge Discovery and Data Mining, pages 1179-1188, 2010.
[10] J-H. Chen, J-Y. Zhou, and J-P. Ye. Integrating low-rank and group-sparse structures for robust multi-task learning. In International Conference on Knowledge Discovery and Data Mining, pages 42-50, 2011.
[11] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6:615-637, 2005.
[12] R-E. Fan, K-W. Chang, C-J. Hsieh, X-R. Wang, C-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871-1874, 2008.
[13] P-H. Gong, J-P. Ye, and C-S. Zhang. Robust multi-task feature learning. In International Conference on Knowledge Discovery and Data Mining, pages 895-903, 2012.
[14] N. Gornitz, C. Widmer, G. Zeller, A. Kahles, S. Sonneburg, and G. Ratsch. Hierarchical Multitask Structured Output Learning for Large-Scale Sequence Segmentation. In Advances in Neural Information Processing Systems, 2011.
[15] X. Gu, F-L. Chung, H. Ishibuchi, and S-T. Wang. Multitask Coupled Logistic Regression and Its Fast Implementation for Large Multitask Datasets. In IEEE Transactions on Cybernetics, 45(9): 1953-1966, 2015.
[16] B. He, X. Yuan On the O(1/n) Convergence Rate of the Douglas-Rachford Alternating Direction Method. SIAM Journal on Numerical Analysis, 50(2): 700-709, 2012
[17] S-W. Ji and J-P. Ye. An Accelerated Gradient Method for Trace Norm Minimization. In International Conference on Machine Learning, pages 457-464, 2009
[18] Y-Z. Jiang, F-L. Chung, H. Ishibuchi, Z-H. Deng, and S-T. Wang. Multitask TSK Fuzzy System Modeling by Mining Intertask Common Hidden Structure. In IEEE Transactions on Cybernetics, 45(3): 548-561, 2015.
[19] T. Grigorios, S-X. Eleftherios, V. Jozef, and V. Ioannis. Mulan: A java library for multi-label learning. Journal of Machine Learning Research, 12:2411–2414, 2011.
[20] T. Joachims. A Support Vector Method for Multivariate Performance Measures. In International Conference on Machine Learning, 2005.
[21] Z. Kang, K. Grauman, and F. Sha. Learning with whom to share in multi-task feature learning. In International Conference on Machine Learning, pages 521-528, 2011.
[22] S. Kim and E. P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. In International Conference on Machine Learning, pages 543-550, 2010.
[23] H-J. Lai, Y. Pan, C. Liu, L. Lin, J. Wu Sparse Learning-to-rank via an Efficient Primal-Dual Algorithm. IEEE Transactions on Computers, 62(6):1221-1233, 2013
[24] H-J. Lai, Y. Pan, Y. Tang, R. Yu FSMRank: Feature Selection Method for Learning to Rank. IEEE Transactions on Neaural Networks and Learning Systems, 24(6):940-952, 2013
[25] Z-C. Lin, M-M. Chen, and Y. Ma. The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrix. Technical Report, UIUC, October 2009.
[26] A-A. Liu, Y-T. Su, P-P. Jia, Z. Gao, T. Hao, Z-X. Yang. Multiple/Single-View Human Action Recognition via Part-Induced Multitask Structural Learning. IEEE transactions on cybernetics, 45(6): 1194-1208, 2016.
[27] G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In International Conference on Machine Learning, pages 663-670, 2010.
[28] J. Liu, S-W. Ji, and J-P. Ye. Multi-task feature learning via efficient $\ell_{2,1}$ -norm minimization. In Conference on Uncertainty in Artificial Intelligence, pages 339-348, 2009.
[29] X-Q. Lu, X-L. Li, and L-C. Mou. Semi-Supervised Multitask Learning for Scene Recognition. In IEEE Transactions on Cybernetics, 45(9): 1967-1976, 2015.
[30] G. Obozinski, B. Taskar, and M.I. Jordan. Multi-task feature selection. Technical report, Statistics Department, UC Berkeley, 2006.
[31] Y. Pan, H-J. Lai, C. Liu, S-C. Yan. A Divide-and-Conquer Method for Scalable Low-Rank Latent Matrix Pursuit. In International Conference on Computer Vision and Pattern Recognition, 2013.
[32] Y. Pan, H-J. Lai, C. Liu, Y. Tang, S-C. Yan. Rank Aggregation via Low-Rank and Structured-Sparse Decomposition. In AAAI Conference on Artificial Intelligence, 2013.
[33] Y. Pan, R-K. Xai, J. Yin, N. Liu. A Divide-and-Conquer Method for Scalable Robust Multitask Learning. In IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 12, pp. 3163-3175, 2015.
[34] N. Quadrianto, A. Smola, T. Caetano, S. Vishwanathan, and J. Petterson. Multitask learning without label correspondences. In Advances in Neural Information Processing Systems, pages 1957-1965, 2010.
[35] J. Read, B. Pfahringer, G. Holmes and E. Frank. Classifier Chains for Multi-label Classification. Machine learning, 85(3): 333-359, 2011.
[36] R.M. Rifkin and R.A. Lippert. Value Regularization and Fenchel Duality. Journal of Machine Learning Research, 8:441-479, 2007.
[37] R. E. Schapire, Y. Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning. 39(2):135-168, 2000
[38] S.S. Shwartz and Y. Singer. On the Equivalence of Weak Learnability and Linear Separability: New Relaxations and Efficient Boosting Algorithms MachineLearning Journal, vol. 80, no. 2, pp. 141-163, 2010.
[39] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Machine Learning for Interdependent and Structured Output Spaces. In International Conference on Machine Learning, 2004.
[40] G. Tsoumakas, I. Katakis and I. Vlahavas. Effective and Efficient Multilabel Classification in Domains with Large Number of Labels. Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data . 30-44, 2008.
[41] G. Tsoumakas, I. Katakis and I. Vlahavas. Random k-Labelsets for Multi-Label Classification. IEEE Transactions on Knowledge and Data Engineering. 23(7):1079-1089, 2011.
[42] E. Gibaja, S. Ventura. A tutorial on multilabel learning. ACM Computing Surveys, 47(3): 52, 2015.
[43] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics bulletin, 1(6): 80-83, 1945.
[44] R-K. Xia, Y. Pan, L. Du, J. Yin. Robust Multi-View Clustering via Low-Rank and Sparse Decomposition. In AAAL Conference on Artificial Intelligence, 2014.
[45] R-K. Xia, Y. Pan, H-J. Lai, C. Liu, S-C. Yan. Supervised Hashing for Image Retrieval via Image Representation Learning. In AAAL Conference on Artificial Intelligence, 2014.
[46] Y. Yang, Z-G. Ma, Y. Yang, F-P. Nie, and H-T. Shen. Multitask Spectral Clustering by Exploring Intertask Correlation. In IEEE Transactions on Cybernetics, 45(5): 1069-1080, 2015.
[47] Y-J. Yin, D. Xu, X-G. Wang, and M-R. Bai. Online State-Based Structured SVM Combined With Incremental PCA for Robust Visual Tracking. In IEEE Transactions on Cybernetics, 45(9): 1988-2000, 2015.
[48] J. Yu, D-C. Tao, M. Wang, and Y. Rui. Learning to Rank Using User Clicks and Visual Features for Image Retrieval. In IEEE Transactions on Cybernetics, 45(4): 767-779, 2015.
[49] K. Yu, V. Tresp, and A. Schwaighofer. Learning gaussian processes from multiple tasks. In International Conference on Machine Learning, pages 1012-1019, 2005.
[50] Y. Yue, T. Finley, F. Radlinski, T. Joachims. A Support Vector Method for Optimizing Average Precision. In International Conference on Research and Development in Information Retrieval, 2007.
[51] J. Zhang, Z. Ghahramani, and Y-M. Yang. Learning multiple related tasks using latent independent component analysis. In Advances in Neural Information Processing Systems, pages 1585-1592, 2006.
[52] W-Q. Zhao, Q-G Meng and P. W. H. Chung. A Heuristic Distributed Task Allocation Method for Multivehicle Multitask Problems and Its Application to Search and Rescue Scenario. IEEE transactions on cybernetics, 46(4): 902-915, 2016.
[53] J-Y. Zhou, J-H. Chen, and J-P. Ye. Clustered multi-task learning via alternating structure optimization. In Advances in Neural Information Processing Systems, pages 702-710, 2011.

Optimizing Evaluation Metrics for Multi-Task Learning via the Alternating Direction Method of Multipliers

Abstract

Index Terms:

1 Introduction

2 Related Work

3 Notations

4 Problem Formulations

5 Proposed Optimization Procedure

5.1 Overview

5.2 Solving the Sub-Problem for 𝐒\mathbf{S}

5.3 Solving the Sub-Problem for 𝐖\mathbf{W}

5.3.1 Formulation

5.3.2 Fenchel Dual Form of (22)

Definition 1.

Theorem 1.

Theorem 2.

5.3.3 Primal-Dual Algorithm via Coordinate Ascent

5.4 Convergence Analysis

6 Experiments

6.1 Overview

6.2 Results on real-world datasets

6.3 Results on imbalanced data

6.4 Training Time Comparison

7 Conclusion

References

5.2 Solving the Sub-Problem for $\mathbf{S}$

5.3 Solving the Sub-Problem for $\mathbf{W}$