Optimizing Evaluation Metrics for Multi-Task Learning via the Alternating Direction Method of Multipliers
Abstract
Multi-task learning (MTL) aims to improve the generalization performance of multiple tasks by exploiting the shared factors among them. Various metrics (e.g., F-score, Area Under the ROC Curve) are used to evaluate the performances of MTL methods. Most existing MTL methods try to minimize either the misclassified errors for classification or the mean squared errors for regression. In this paper, we propose a method to directly optimize the evaluation metrics for a large family of MTL problems. The formulation of MTL that directly optimizes evaluation metrics is the combination of two parts: (1) a regularizer defined on the weight matrix over all tasks, in order to capture the relatedness of these tasks; (2) a sum of multiple structured hinge losses, each corresponding to a surrogate of some evaluation metric on one task. This formulation is challenging in optimization because both of its parts are non-smooth. To tackle this issue, we propose a novel optimization procedure based on the alternating direction scheme of multipliers, where we decompose the whole optimization problem into a sub-problem corresponding to the regularizer and another sub-problem corresponding to the structured hinge losses. For a large family of MTL problems, the first sub-problem has closed-form solutions. To solve the second sub-problem, we propose an efficient primal-dual algorithm via coordinate ascent. Extensive evaluation results demonstrate that, in a large family of MTL problems, the proposed MTL method of directly optimization evaluation metrics has superior performance gains against the corresponding baseline methods.
Index Terms:
Multi-Task Learning, Evaluation Metrics, Structured Outputs, Coordinate Ascent, Alternating Direction Method of Multipliers.1 Introduction
Recently, considerable research has been devoted to Multi-Task Learning (MTL), a problem of improving the generalization performance of multiple tasks by utilizing the shared information among them. MTL has been widely-used in various applications, such as natural language processing [1], handwritten character recognition [30, 34], scene recognition [29] and medical diagnosis [3]. Many MTL methods have been proposed in the literature [8, 11, 49, 51, 13, 21, 28, 30, 53, 1, 9, 10, 33, 29, 15, 46, 18, 52, 26].
In this paper, we consider MTL for classification or regression problems. Note that either a multi-class classification problem or a multi-label learning problem can be regarded as an MTL problem111As an illustrative example, we consider a multi-label classification problem with instances that belongs to classes and , belongs to classes and , belongs to class , belongs to class , belongs to classes , and . This problem can be regarded as an MTL problem with three tasks, where the training sets for each of these tasks are:
The first/second/third task is a binary classification problem of an instance belonging to class / / or not. Hence, a multi-label learning problem is a special case of an MTL problem. Similarly, we can verify that a multi-class classification problem can also be regarded as an MTL problem.. Most of the existing MTL methods focus on minimizing either a convex surrogate (e.g. the hinge loss or the logistic loss) of the - errors for multi-task classification, or the mean squared errors for multi-task regression. On the other hand, in practice, several evaluation metrics other than misclassified errors or mean squared errors are used the evaluation of MTL methods, e.g., F-score, Precision, Recall, Area Under the ROC Curve (AUC), Mean Average Precision. For example, in the cases of MTL on imbalanced data (e.g., in a task, the number of negative samples is much larger than that of the positive samples), cost-sensitive MTL or MTL for ranking, these metrics are more effective in performance evaluation than the standard misclassified errors or the mean squared errors. However, due to the computational difficulties, few learning techniques have been developed to directly optimize these evaluation metrics in MTL.
In this paper, we propose an approach to directly optimizing the evaluation metrics in MTL, which can be applied to a large family of MTL problems. Specifically, for an MTL problem with tasks (the th task is associated with a training set , , represents the number of training samples for the th task), we consider a generic formulation in the following:
(1) |
where is the weight matrix with columns , , …, , is a trade-off parameter. This formulation is the linear combination of two parts. The first part is a regularizer defined on the weight matrix over all tasks, in order to leverage the relatedness of these tasks. Examples of this kind of regularizer include the trace-norm, the -norm or the -norm on . The second part in the formulation is a sum of multiple loss functions, each corresponds to one task. In order to directly optimize a specific evaluation metric, we consider the hinge loss functions for structured outputs [39, 20, 50, 48, 47], which are surrogates of a specific evaluation metric.
Such a formulation in (1) includes a large family of MTL problems. Since the two parts in (1) are usually non-smooth, the optimization problem (1) is difficult to solve. To tackle this issue, we propose a novel optimization procedure based on the alternating direction scheme of multipliers (ADMM [6, 25]), which is widely used in various machine learning problems (e.g., [31, 32, 33, 44]). We decompose the whole optimization problem in (1) into two simpler sub-problems. The first sub-problem corresponds to the regularizer. For commonly-used regularizers (e.g., the trace-norm, the -norm) in MTL, this sub-problem can be solved by close-form solutions. The second sub-problem corresponds to the structured hinge losses. To solve the second sub-problem, we propose an efficient primal-dual algorithm via coordinate ascent.
We conduct extensive experiments to evaluate the performances of the proposed MTL method. Experimental results show that the proposed method that optimizes a specific evaluation metric outperforms the corresponding MTL classification or MTL regression baseline methods by a clear margin.
2 Related Work
MTL is a wide class of learning problems. Roughly speaking, the existing MTL methods can be divided into three main categories: parameters sharing, common features sharing, and low-rank subspace sharing.
In the methods with parameter sharing, all tasks are assumed to explicitly share some common parameters. Representative methods in this category include shared weight vectors [11], hidden units in neural network [8], and common prior in Bayessian models [49, 51].
In the methods with common features sharing, task relatedness is modeled by enforcing all tasks to share a common set of features [2, 28, 22, 30, 13, 21, 53]. Representative examples are the methods which constrain the model parameters (i.e., a weight matrix) of all tasks to have certain sparsity patterns, for example, cardinality sparsity [30], group sparsity [28, 13], or clustered structure [21, 53].
The methods in the third category assume that all tasks lie in a shared low-rank subspace [1, 9, 10]. A common assumption in these category of methods is that most of the tasks are relevant while (optionally) there may exist a small number of irrelevant (outlier) tasks. These methods pursue a low-rank weight matrix that captures the underlying shared factors among tasks. Trace-norm regularization is commonly-used in these methods to encourage the low-rank structure on the model parameters.
Most of the existing MTL methods are focused on designing regularizers or parameter sharing patterns to utilize the intrinsic relationships among multiple related tasks. These MTL methods usually try to optimize the classification errors or the mean squared errors for regression. In practice, various other metrics (such as F-score and AUC) are used in the evaluation of MTL methods. However, little effort has been devoted to optimize these evaluation metrics in the context of MTL except for the work [14], in which the author proposed a hierarchical MTL formulation for structured output prediction in sequence segmentation. Since the regularizer used in [14] is decomposable, the hierarchical MTL problem can be decomposed into multiple independent tasks, each is a structure output learning problem with a simple regularizer. In this paper, we seek to directly optimize commonly-used evaluation metrics for MTL with possibly indecomposable regularizer, resulting in a generic approach that can be applied to a large family of MTL problems. Our formulation can be regarded as MTL for structure output prediction with an indecomposable regularizer.
The proposed methods in this paper are also related to the multi-label algorithms. There are various multi-label algorithm proposed in the literature, e.g., the RAkEL method that uses random -label sets [41], the MLCSSP method that spans the original label space by subset of labels [4], the AdaBoostMH method based on AdaBoost [37], the HOMER method based on the hierarchy of multi-label learners [40], the binary relevance (BR) [42] method, the label power-set (LP) [42] method, and the ensembles of classifier chains (ECC) [35] method.
The proposed approach in this paper is to optimize the evaluation metrics in MTL. We refer the readers to Section 4 for the detailed introduction to the evaluation metrics related to the proposed approach.
3 Notations
We first introduce the notations to be used throughout this paper. We use bold upper-case characters (e.g., , , ) to represent matrices, and bold lower-case characters (e.g., , ) to represent vectors, respectively. For a matrix , we denote as the the element at the cross of the th row and th column in . We denote as the th row of , and as the -th column of , respectively.
We denote as the Frobenius norm of that . Let be the -norm of , where is the absolute value of . Let be the -norm of , where is the -norm of . Let be the infinity norm of . The trace-norm of is defined by , where are the non-zero singular values of and is the rank of . We denote as the transpose of . For a vector , represent the -norm.
In the context of MTL, we assume we are given learning tasks. The th () task is associated with a training set , where denotes the data matrix with each row being a sample, denotes the target labels on , is the feature dimensionality, and is the number of samples for the th task. For , we define as the set of all possible -dimension vector, each of whose elements is either or . To simplify presentation, we assume where and is one of the possible vectors that belong to .
We define a weight matrix on all of the tasks. The goal of (linear) MTL is to simultaneously learn (linear) predictors to minimize some loss function (e.g. the least square loss ), where is in the form of a column vector. Note that for each task, we have and .
4 Problem Formulations
The linear MTL problem can be formulated as the generic form in (1). The objective functions in many existing MTL methods are special cases of such a formulation. The following are two examples:
- •
- •
The existing MTL methods mainly focus on the design of good regularizers (i.e., ) to catch the shared factors among multiple related tasks. The loss functions used in these methods are either to minimize the misclassified errors (for classification) or the mean squared errors (for regression). On the other hand, in practice, several evaluation metrics other than misclassified errors or mean squared errors are used the evaluation of MTL methods, such as F-score and AUC. Particularly, in the cases of MTL on imbalanced data (e.g., in a task, the number of negative samples is much larger than that of the positive samples), these metrics are more effective in performance evaluation than the standard misclassified errors or the mean squared errors.
Learning techniques of directly optimizing evaluation metrics, as known as learning with structured outputs, have been developed for many (single-task) problems, e.g., classification [39, 20], ranking [50]. However, despite the acknowledged importance of the metrics like F-score or AUC, little effort has been made to design MTL methods that directly optimize these evaluation metrics. The main reason is that MTL of optimizing the evaluation metrics usually results in a non-smooth objective function which is difficult to solve.
In this paper, we focus on MTL with structured outputs and propose a generic optimization procedure based on ADMM. This optimization procedure can be applied to solving a large family of MTL problems that directly optimize some evaluation metric (e.g., F-score, AUC). We call the proposed method Structured MTL (SMTL for short).
The formulation of SMTL is also a special case of (1). In order to optimize some evaluation metric, we define the loss function for each task as the structured hinge loss:
where represents any possible label assignment on . represents an evaluation metric to measure the distance between the true labels and the other labels . For example, can be 1-F-score or 1-AUC.
The formulation of SMTL is defined as:
(2) |
In this paper, we only focus on the MTL problems in the form of (2) that satisfy the following conditions:
-
•
Condition 1: With respect to , there is a close-form solution for the following sub-problem
(3) where and is a positive constant.
-
•
Condition 2: For the evaluation metric , the following sub-problem can be solve in polynomial time.
(4)
The first condition is to restrict the regularizer and the second one is to restrict the evaluation metric function . Even under these conditions, the formulation in (2) includes a large family of MTL problems. On the one hand, for the regularizer , the following norms that are commonly-used in MTL satisfy Condition 1:
- •
-
•
-norm For the MTL problems with , the sub-problem in (3) is also known to have close-form solutions:
(6) - •
On the other hand, many commonly-used metric functions satisfy the second condition. The following are two examples which we will consider in this paper:
-
•
MTL by directly optimizing F-Score F-Score is a typical performance metric for binary classification, particularly in learning tasks on imbalanced data. F-Score is a trade-off between Precision and Recall. Specifically, given and , we define the precision as:
and the recall as:
where represents the indicator function that if condition is true, otherwise . Then the F-score on and is defined as:
(8) where is a trade-off parameter. Hereafter, we simply set . Finally, the metric function with respect to the F-score is defined by:
-
•
MTL by directly optimizing AUC AUC is also a popular performance metric for binary classification, particularly in imbalanced learning. Given and , the AUC metric can be calculated by :
(10) where represents the number of “inverted” pairs in compared to :
/ represents the number of positive/negative samples in the th task:
The corresponding can be defined as:
Note that here the Precision, Recall, F-Score and AUC are defined for a particular task.
5 Proposed Optimization Procedure
5.1 Overview
In this section, we present the proposed optimization procedure to solve the problem (2). Our procedure is based on the scheme of ADMM.
For ease of presentation, we define
and
Then, the problem in (2) can be re-formulated to its equivalent form in the following:
(12) |
where is an auxiliary variable.
The corresponding augmented Lagrangian function with respect to (12) is:
(13) |
where is the Lagrangian multiplier, represents the inner product of two matrices (i.e., given matrices and , we have ), where is the trace of the matrix ), is an adaptive penalty parameter.
Based on the ADMM scheme, the sketch of the proposed optimization procedure is shown in Algorithm 1, where in each iteration we alternatively update , and by minimizing the Lagrangian function in (13) with other variables fixed. The update rules for , and are in the following:
Note that hereafter we use to represent the the value of variable in the -th iteration.
Next, we will present the details of solving the sub-problems with respect to or , respectively, with other variables being fixed.
Algorithm 1 The proposed ADMM procedure for |
the structured MTL problem (2) |
Input: training set , desired tolerant error , |
maximal iteration number . |
Output: Weight matrix |
1. Initialize: , . |
2. Repeat: |
3. Update |
by solving (15), (17) or (18) accordingly. |
4. For to |
5. Update |
by Algorithm 2. |
6. End For |
7. Update . |
8. Until or . |
5.2 Solving the Sub-Problem for
In the -th iteration (in the outer loop) of Algorithm 1, the sub-problem for with respect to (13) can be simplified as:
(14) |
For different regularizer , the solution to (14) is different.
- •
- •
- •
5.3 Solving the Sub-Problem for
5.3.1 Formulation
In the -th outer iteration in Algorithm 1, the sub-problem for with respect to (13) can be reformulated as:
(19) |
To simplify presentation, we denote . Then, the problem in (19) can be separated into independent sub-tasks:
(20) |
5.3.2 Fenchel Dual Form of (22)
In this subsection, we derive the Fenchel dual [36] form of (22). To simplify presentation, we use to represent . Then we re-formulate the primal form in (22) as:
(23) |
where we define and .
Before deriving the dual form of (23), we first introduce the definition (Definition 1) and the main properties (Theorem 1 and 2) of Fenchel duality.
Definition 1.
The Fenchel conjugate of function is defined as .
Theorem 1.
(Fenchel-Young inequality: [5], Proposition 3.3.4) Any points in the domain of function and in the domain of function satisfy the inequality:
(24) |
The equality holds if and only if .
Theorem 2.
(Fenchel Duality inequality, see e.g.,Theorem 3.3.5 in [5]) Let and be two closed and convex functions, and be a matrix. Then we have
(25) |
where and . The equality holds if and only if .
Note that the right hand side of the inequality in (25) is called the primal form and the left hand side of (25) is the corresponding dual form.
With Definition 1, it is known (see, e.g., [38], Appendix B) that the Fenchel dual norm (i.e., the Fenchel conjugate) of the -norm is also the -norm . Hence, the Fenchel conjugate of is
(26) |
It is known ( [38], Appendix B) that the Fenchel conjugate of is , the Fenchel conjugate of () is . Then we can derive that the Fenchel conjugate of is
(27) |
In addition, the Fenchel conjugate of is with being the indicator function that if is true and otherwise (see [38], Appendix B). For convenience, we denote . It is easy to verify that . Hence, by using (27), the Fenchel conjugate of is:
(28) |
5.3.3 Primal-Dual Algorithm via Coordinate Ascent
In this subsection, we develop a coordinate ascent algorithm to optimize the objective in (29), where we use the primal-dual gap as the early stopping criterion. Coordinate ascent is a widely-used method in various machine learning problems (e.g., [12, 38, 23, 45]).
Algorithm 2 Primal-dual algorithm via coordinate ascent |
Input: , , , , maximal iteration number |
Output: |
1. Initialize: , |
2. Repeat: |
3. Find the largest element in the gradient vector |
by solving (30) via Algorithm 3 for F-score |
(or Algorithm 4 for AUC). |
4. |
5. |
6. Calculate by (37). |
7. Update by (35). |
8. Update by (36) |
9. Until or iteration number reaches |
10. |
The proposed coordinate ascent algorithm is shown in Algorithm 2. Next, we sketch the main steps the proposed algorithm in the following:
Repeat
-
•
Select an index with the -th element in the gradient vector having the largest element.
-
•
Update with other () fixed, in a manner of greedily increasing the value of .
Until the early stopping criterion is satisfied.
In each iteration, the proposed algorithm has three main building blocks:
The First Step is to select an index that the -th element is the largest element in the gradient vector for the dual objective . Specifically, the gradient vector with respect to for is:
and the largest element in is:
We denote . Then, with the definition of and , we have:
(30) |
Interestingly, the problem in (30) is essentially the same as the problems of “finding the most violated constraint” in Structured-SVMs (e.g., the problem (7) in [20]). For several commonly-used evaluation metrics , efficient algorithm in polynomial-time were proposed to solve the problems of “finding the most violated constraint”. One can directly use these inference algorithms to solve (30) of selecting the largest element from the gradient vector . For example, when corresponds to F-score, one can use Algorithm 2 in [20] to solve (30); when corresponds to AUC, one can use Algorithm 3 in [20] to solve (30). For self-containness, we shown these two algorithms with our notations in Algorithm 3 and 4. Note that Algorithm 3 and 4 have the time complexity in and , respectively.
Algorithm 3 Algorithm to solve (30) with loss function |
defined on F-score |
Input: , |
, |
Output: |
1. Initialize: by |
by |
2. For do: |
3. |
4. Set to and set to |
5. For do: |
6. |
7. Set to and set to |
8. , |
where is defined by (11) |
9. If is the largest so far, then: |
10. |
11. End if |
12. End for |
13. End for |
Algorithm 4 Algorithm to solve (30) with loss function |
defined on AUC |
Input: , |
, |
Output: |
1. Initialize: for do |
for do |
2. sort by |
3. , |
4. For do: |
5. If , then: |
6. |
7. |
8. else |
9. |
10. |
11. End if |
12. End for |
13. Convert to according to some |
threshold value. |
The Second Step is to update by fixing other variable , given the selected index .
We define the update rules for as:
(31) |
where and denotes the -dimension vector with the -th element being one and other elements being zeros. It is worth noting that, given and before updating, and , this form of rules in (31) guarantees that and still hold after updating.
(32) |
Intuitively, our goal is to find to increase the dual objective as much as possible. By setting the gradient of (32) with respect to to zero, we have
By simple algebra, we have
(33) |
To ensure that , we make further restriction on :
(34) |
The calculation of in (34) depends on the calculation of and . However, since , and , the time of directly calculating either or depends exponentially on , which may often unaffordable. In order to improve efficiency, we maintain auxiliary variable to reduce the computation cost. Remind that we have defined . We also define . We maintain and during the iterations.
With the update rule (31) for , we can easily derive the corresponding update rules for and , respectively:
(35) |
(36) |
Obviously, the update rule for (or ) has the time complexity (or ).
With the maintained and , the update rule in (34) can be simplified to:
(37) |
where the time complexity of update in (37) is reduced to .
The early stopping criterion is defined based on the primal-dual gap where the parameter is the pre-defined tolerance. Assume is the optimal value of the primal objective (23). According to Theorem 2, we have:
It is worth noting that, by using the update rule (31) with , Algorithm 2 guarantees that satisfies the constraints and in all of the iterations. In order words, we have in all of the iterations. Hence, with (23) and (29), we have:
(38) |
With Theorem 1, we have , where the equality holds when . In order to greedily upper-bounded the gap , we set in (38) and obtain:
(39) |
Consequently, the early stopping criterion is set to be , which can be calculated in time .
5.4 Convergence Analysis
For the sub-problem w. r. t. (see Section 5.3), the proposed coordinate ascent method is similar to those in [38, 23]. By using similar proof techniques to those of [38, 23] (e.g., see the proofs of Theorem 1 in [23]), we can derive that, after iteration in Algorithm 2, we have . Note that , where and are the optimal solution of (29) and (23) respectively. Ideally, for all the tasks, if we set the iteration number to be sufficient large, we can solve the sub-problem w,r.t. exactly (by ignoring the small numerical errors).
In addition, as discussed in Section 5.2, the sub-problems w. r. t. can be solved exactly by closed-form solutions. Hence, the objective (12) is convex subject to linear constraints, and both of its subproblems can be solved exactly. Based on existing theoretical results [6, 16], we have that Algorithm 1 converges to global optima with a convergence rate.
6 Experiments
6.1 Overview
In this section, we evaluate and compare the performance of the proposed SMTL method on several benchmark datasets. For the regularizer in (12), we consider , and , respectively. For the evaluation metric used in in (12), we consider -score (with ) and AUC. These settings lead to six variants of SMTL.
Here we focus on MTL for classification. Given a specific regularizer (i.e., , or ), we choose these methods as baselines: (1) single-task structured SVM that directly optimizes AUC (StructSVM) [20], we train it on each of the individual tasks and average the results. (2) MTL with hinge loss for classification (MTL-CLS). (3) MTL with least squares loss for regression (MTL-REG). (4) RAkEL, a meta algorithm using random -label sets [41]. (5) MLCSSP, a method spanning the original label space by subset of labels [4]. (6) AdaBoostMH, a method based on AdaBoost [37]. (7) HOMER, a method based on the hierarchy of multi-label learners [40]. (8) BR, the binary relevance method [42]. (9) LP, the label power-set method [42]. (10) ECC, the ensembles of classifier chains method (ECC) [35]. Note that the classification problem can be regarded as a regression problem222For a dataset for binary classification that each positive example has a label and each negative example has a label , one can regard these labels as real numbers (i.e., for each of the positive examples and for each of the negative examples). Then, this dataset can be used in a MTL method for regression to learn a regressor. After obtaining the regressor, for a test example , if the predicted label of (by the regressor) is larger than , one can regard as a positive example. On the other hand, if the predicted label of is smaller than , then one can regard as a negative example..
The proposed methods, the baselines MTL-CLS and MTL-REG were implemented with Python 2.7. For MTL-REG, our implementations are based on the algorithms in [28] (for the norm) and [17] (for the trace norm). According to Theorem 3 in [20], the problem of MTL-CLS is equivalent to a special form of SMTL in (2) (with , where represents the number of index that satisfies ). Hence, our implementation of MTL-CLS is based on the framework of Algorithm 1. For StructSVM, we use the open-source implementation of SVM-Perf [20]. All the experiments were conducted on a Dell PowerEdge R320 server with 16G memory and 1.9Ghz E5-2420 CPU.
We report the experimental results on real-world datasets. The statistics of these datasets are summarized in Table I. In the Emotions dataset, the labels are kinds of emotions, and the features are rhythmic and timbre extracted from music wave files. In the Yeast dataset, the labels are localization sites of protein, and the features are protein properties. In the Flags dataset, the labels are religions of countries and the features are extracted from flag images. In the Cal500 dataset, the labels are semantically meaning of popular songs and the features are extracted from audio data. In the Segmentation dataset, the labels are content of image region, and the features are pixels’ properties of image regions. In the Optdigits dataset, the labels are handwritten digits to , and the features are pixels. In the MediaMill dataset, the labels are semantic concepts of each video and the features are extracted from videos. In the TMC2007 dataset, the labels are the document topics, and the features are discrete attributes about terms. In the Scene dataset, the labels are scene types, and the features are spatial color moments in LUV space. All of these datasets are normalized.
Type | Features | Samples | Tasks \bigstrut | |
---|---|---|---|---|
Emotions | music | 72 | 593 | 6 \bigstrut |
Yeast | gene | 103 | 2417 | 14 \bigstrut |
Flags | image | 19 | 194 | 7 \bigstrut |
Cal500 | songs | 68 | 502 | 174 \bigstrut |
Segmentation | image | 19 | 2310 | 7 \bigstrut |
Optdigits | image | 64 | 5620 | 10 \bigstrut |
MediaMill | multimedia | 120 | 10000 | 12 \bigstrut |
TMC2007 | test | 500 | 10000 | 6 \bigstrut |
Scene | image | 294 | 2407 | 6 \bigstrut |
MACRO | MICRO | Average \bigstrut[t] | |
METHOD | AUC \bigstrut[b] | ||
Cal500 \bigstrut | |||
SMTL(+AUC) | 21.7220.456 | 38.4520.610 | 56.5050.511 \bigstrut[t] |
SMTL(+) | 21.4950.232 | 40.1270.173 | 53.6900.293 |
MTL-CLS() | 13.1570.449 | 37.3570.180 | 55.7640.820 |
MTL-REG() | 12.5000.129 | 36.4380.176 | 52.9640.758 \bigstrut[b] |
SMTL(+AUC) | 21.7210.807 | 35.520.811 | 56.7160.500 \bigstrut[t] |
SMTL(+) | 21.1380.191 | 38.3860.456 | 53.3580.827 |
MTL-CLS() | 12.1760.445 | 37.3870.845 | 56.3160.216 |
MTL-REG() | 12.4470.297 | 36.660.638 | 53.6280.264 \bigstrut[b] |
SMTL(TraceNorm+AUC) | 21.7720.545 | 35.2040.585 | 56.7980.358 \bigstrut[t] |
SMTL(TraceNorm+) | 21.7680.333 | 38.5590.394 | 54.9870.823 |
MTL-CLS(TraceNorm) | 12.8840.353 | 37.4020.501 | 55.6350.511 |
MTL-REG(TraceNorm) | 8.3480.999 | 34.8320.698 | 55.690.636 \bigstrut[b] |
StructSVM | 20.8641.150 | 35.4081.150 | 51.4270.841 \bigstrut[t] |
RAkEL | 20.6280.611 | 33.6890.843 | 54.6370.656 |
MLCSSP | 21.6770.514 | 27.0930.537 | 52.690.983 |
AdaBoostMH | 0.9230.274 | 6.4920.146 | 50.7340.538 |
HOMER | 13.8500.163 | 30.3321.313 | 52.4610.937 |
BR | 17.0940.634 | 33.6190.375 | 50.5630.153 |
LP | 15.2570.428 | 32.9780.668 | 52.1170.685 |
ECC | 9.6000.666 | 34.7890.482 | 52.1170.625 \bigstrut[b] |
Segmentation \bigstrut | |||
SMTL(+AUC) | 72.8321.567 | 68.4451.543 | 97.1950.4549 \bigstrut[t] |
SMTL(+) | 85.611.304 | 84.1491.684 | 96.9670.647 |
MTL-CLS() | 85.1141.946 | 84.2284.508 | 96.930.560 |
MTL-REG() | 75.5471.215 | 81.7022.456 | 96.7570.645 \bigstrut[b] |
SMTL(+AUC) | 73.3781.564 | 68.4241.787 | 97.5270.286 \bigstrut[t] |
SMTL(+) | 85.1051.830 | 83.6931.192 | 96.7570.192 |
MTL-CLS() | 83.7123.513 | 82.5184.003 | 96.7810.828 |
MTL-REG() | 76.2532.564 | 82.6060.156 | 96.7980.231 \bigstrut[b] |
SMTL(TraceNorm+AUC) | 72.2651.453 | 67.6551.978 | 97.1340.457 \bigstrut[t] |
SMTL(TraceNorm+) | 85.3561.092 | 83.4621.805 | 96.8630.322 |
MTL-CLS(TraceNorm) | 82.7033.865 | 82.1505.439 | 96.7050.612 |
MTL-REG(TraceNorm) | 76.6021.286 | 82.8051.877 | 96.6980.147 \bigstrut[b] |
StructSVM | 44.6321.828 | 53.9921.828 | 89.3550.311 \bigstrut[t] |
RAkEL | 75.5920.243 | 70.9800.398 | 91.3330.082 |
MLCSSP | 79.8218.533 | 78.92314.036 | 93.8100.329 |
AdaBoostMH | 75.6330.209 | 71.0180.376 | 96.1480.089 |
HOMER | 72.9202.505 | 69.9691.651 | 91.2251.543 |
BR | 84.2360.638 | 78.7960.708 | 96.8700.194 |
LP | 84.3940.603 | 83.4110.615 | 96.2400.124 |
ECC | 84.1830.550 | 82.9420.542 | 96.7820.269 \bigstrut[b] |
Optdigits \bigstrut | |||
SMTL(+AUC) | 92.7220.595 | 92.7340.712 | 99.6570.0528 \bigstrut[t] |
SMTL(+) | 93.9630.164 | 93.9640.235 | 99.5890.054 |
MTL-CLS() | 93.7010.403 | 92.7730.440 | 99.2060.044 |
MTL-REG() | 88.9010.306 | 89.2680.875 | 99.320.089 \bigstrut[b] |
SMTL(+AUC) | 92.5260.624 | 92.2130.670 | 99.6530.078 \bigstrut[t] |
SMTL(+) | 93.6920.508 | 94.6260.520 | 99.5540.047 |
MTL-CLS() | 92.9610.608 | 94.0090.356 | 98.6580.067 |
MTL-REG() | 88.7620.845 | 89.2030.865 | 99.2690.045 \bigstrut[b] |
SMTL(TraceNorm+AUC) | 92.8620.543 | 92.8020.944 | 99.6540.036 \bigstrut[t] |
SMTL(TraceNorm+) | 94.2060.202 | 94.1390.266 | 99.5660.027 |
MTL-CLS(TraceNorm) | 93.7010.435 | 93.7730.267 | 99.1820.065 |
MTL-REG(TraceNorm) | 88.7770.765 | 89.1730.946 | 99.2930.048 \bigstrut[b] |
StructSVM | 36.2760.905 | 38.2892.218 | 98.4000.366 \bigstrut[t] |
RAkEL | 82.4500.168 | 80.9670.311 | 94.5430.070 |
MLCSSP | 75.1912.245 | 82.1293.195 | 88.8790.195 |
AdaBoostMH | 93.0830.695 | 93.1080.669 | 98.5940.119 |
HOMER | 74.8694.151 | 75.7133.663 | 93.3910.964 |
BR | 92.6250.348 | 92.7140.383 | 99.3700.122 |
LP | 88.8750.212 | 88.9150.269 | 94.9410.329 |
ECC | 93.0430.206 | 94.0190.213 | 99.0190.156 \bigstrut[b] |
MACRO | MICRO | Average \bigstrut[t] | |
METHOD | AUC \bigstrut[b] | ||
Scene \bigstrut | |||
SMTL(+AUC) | 54.0131.124 | 54.7461.231 | 89.990.820 \bigstrut[t] |
SMTL(+) | 55.7870.756 | 56.4340.567 | 87.6520.280 |
MTL-CLS() | 54.7221.590 | 54.5081.176 | 86.7381.102 |
MTL-REG() | 51.1570.343 | 52.8100.345 | 85.1940.712 \bigstrut[b] |
SMTL(+AUC) | 54.2960.977 | 54.3330.025 | 88.3580.467 \bigstrut[t] |
SMTL(+) | 55.5011.92 | 56.0072.34 | 87.3641.801 |
MTL-CLS() | 54.3870.730 | 54.8051.488 | 85.9521.116 |
MTL-REG() | 50.7480.546 | 51.2800.619 | 85.0320.779 \bigstrut[b] |
SMTL(TraceNorm+AUC) | 54.2270.660 | 55.3840.804 | 88.4211.103 \bigstrut[t] |
SMTL(TraceNorm+) | 55.3961.089 | 56.3041.119 | 87.0710.682 |
MTL-CLS(TraceNorm) | 55.1040.298 | 55.4810.506 | 86.2050.471 |
MTL-REG(TraceNorm) | 50.8320.226 | 51.2360.264 | 85.2750.852 \bigstrut[b] |
StructSVM | 49.8260.815 | 49.9510.755 | 82.3750.393 \bigstrut[t] |
RAkEL | 54.5920.613 | 55.7190.565 | 78.9810.535 |
MLCSSP | 42.7640.080 | 47.1780.181 | 65.8302.240 |
AdaBoostMH | 36.5060.404 | 40.6810.449 | 87.6170.470 |
HOMER | 60.9802.470 | 58.2512.592 | 80.7440.360 |
BR | 54.5791.813 | 55.0191.843 | 82.8881.164 |
LP | 54.9021.503 | 55.8181.595 | 75.9001.362 |
ECC | 55.3470.893 | 55.8310.881 | 88.1530.298 \bigstrut[b] |
MediaMill \bigstrut | |||
SMTL(+AUC) | 18.0300.294 | 22.0580.257 | 66.0680.426 \bigstrut[t] |
SMTL(+) | 22.8515.093 | 56.4242.761 | 78.7052.280 |
MTL-CLS() | 10.6131.733 | 55.4413.647 | 76.2162.474 |
MTL-REG() | 6.3660.065 | 55.5150.465 | 53.8670.496 \bigstrut[b] |
SMTL(+AUC) | 18.0120.286 | 22.2320.211 | 65.4050.503 \bigstrut[t] |
SMTL(+) | 22.3865.326 | 56.1692.436 | 78.9071.854 |
MTL-CLS() | 8.5421.672 | 55.8382.229 | 74.0371.219 |
MTL-REG() | 6.3930.033 | 55.6870.439 | 53.0360.181 \bigstrut[b] |
SMTL(TraceNorm+AUC) | 18.2010.221 | 22.6840.354 | 66.8471.015 \bigstrut[t] |
SMTL(TraceNorm+) | 27.9733.006 | 56.0314.924 | 79.7301.850 |
MTL-CLS(TraceNorm) | 15.8000.589 | 50.0985.569 | 75.9682.144 |
MTL-REG(TraceNorm) | 6.3800.045 | 55.3330.425 | 53.8250.493 \bigstrut[b] |
StructSVM | 17.8470.318 | 22.0300.284 | 64.7610.487 \bigstrut[t] |
RAkEL | 19.8740.156 | 26.6860.189 | 63.2410.398 |
MLCSSP | 15.1290.633 | 20.1240.723 | 52.4731.884 |
AdaBoostMH | 17.9390.469 | 41.9910.425 | 61.9140.167 |
HOMER | 17.9390.469 | 41.9910.425 | 61.9140.167 |
BR | 19.7690.196 | 26.5150.166 | 69.0320.854 |
LP | 24.1350.959 | 50.1700.402 | 60.5970.502 |
ECC | 24.8790.590 | 56.2140.363 | 78.0670.705 \bigstrut[b] |
TMC2007 \bigstrut | |||
SMTL(+AUC) | 59.4320.581 | 68.021.042 | 90.1380.17 \bigstrut[t] |
SMTL(+) | 64.3210.955 | 74.1590.255 | 90.5610.669 |
MTL-CLS() | 60.5171.363 | 71.2840.387 | 88.3820.398 |
MTL-REG() | 37.1060.416 | 70.1810.221 | 85.2180.529 \bigstrut[b] |
SMTL(+AUC) | 60.2490.147 | 67.6540.234 | 90.4410.077 \bigstrut[t] |
SMTL(+) | 65.4361.239 | 73.9840.533 | 90.2380.732 |
MTL-CLS() | 62.9190.802 | 72.7450.464 | 89.0740.59 |
MTL-REG() | 37.7090.32 | 70.4310.414 | 86.6120.592 \bigstrut[b] |
SMTL(TraceNorm+AUC) | 58.5950.148 | 68.0560.45 | 88.3250.182 \bigstrut[t] |
SMTL(TraceNorm+) | 61.8671.014 | 72.5880.350 | 89.3280.815 |
MTL-CLS(TraceNorm) | 59.7520.951 | 71.8630.628 | 87.9330.428 |
MTL-REG(TraceNorm) | 36.640.314 | 70.1180.437 | 84.540.743 \bigstrut[b] |
StructSVM | 37.190.652 | 45.0270.601 | 88.0720.289 \bigstrut[t] |
RAkEL | 57.3310.592 | 69.8130.179 | 81.9940.134 |
MLCSSP | 56.7170.790 | 60.4171.665 | 75.2461.093 |
AdaBoostMH | 15.1701.893 | 56.0041.103 | 61.4660.206 |
HOMER | 61.1440.238 | 71.4290.104 | 84.9980.589 |
BR | 51.9391.225 | 67.8730.374 | 84.6160.528 |
LP | 52.6830.832 | 62.6720.526 | 73.0630.637 |
ECC | 58.3680.714 | 68.2230.096 | 86.2870.664 \bigstrut[b] |
MACRO | MICRO | Average \bigstrut[t] | |
METHOD | AUC \bigstrut[b] | ||
Emotions \bigstrut | |||
SMTL(+AUC) | 65.4982.047 | 67.0671.956 | 83.3780.466 \bigstrut[t] |
SMTL(+) | 66.2441.584 | 66.3581.255 | 81.9860.495 |
MTL-CLS() | 63.3431.688 | 65.6841.327 | 80.0650.490 |
MTL-REG() | 62.6211.543 | 63.7011.054 | 81.320.396 \bigstrut[b] |
SMTL(+AUC) | 65.6221.984 | 67.1431.629 | 83.3580.345 \bigstrut[t] |
SMTL(+) | 67.6960.348 | 67.9230.578 | 83.1060.596 |
MTL-CLS() | 64.9690.822 | 66.5841.049 | 80.030.574 |
MTL-REG() | 62.9760.547 | 64.4041.535 | 81.8110.587 \bigstrut[b] |
SMTL(TraceNorm+AUC) | 65.9021.904 | 67.4051.848 | 83.3620.618 \bigstrut[t] |
SMTL(TraceNorm+) | 67.6000.574 | 67.8580.984 | 83.0000.236 |
MTL-CLS(TraceNorm) | 63.8052.339 | 66.6022.063 | 80.4850.597 |
MTL-REG(TraceNorm) | 63.2431.574 | 64.8692.574 | 82.8340.266 \bigstrut[b] |
StructSVM | 46.3675.531 | 49.90219.032 | 62.9084.361 \bigstrut[t] |
RAkEL | 64.9981.387 | 65.8351.136 | 75.2060.875 |
MLCSSP | 62.9802.780 | 63.5932.603 | 76.0542.495 |
AdaBoostMH | 4.2911.429 | 7.5772.627 | 55.1110.328 |
HOMER | 59.0392.431 | 61.8301.642 | 71.2121.167 |
BR | 61.3582.578 | 62.6352.332 | 79.1461.250 |
LP | 53.3841.858 | 54.6181.543 | 68.5060.652 |
ECC | 62.6941.645 | 64.1381.216 | 82.5891.131 \bigstrut[b] |
Yeast \bigstrut | |||
SMTL(+AUC) | 43.5931.120 | 46.2610.872 | 63.0181.504 \bigstrut[t] |
SMTL(+) | 44.3531.080 | 55.4510.457 | 61.2851.246 |
MTL-CLS() | 36.3080.974 | 43.9080.499 | 56.6860.539 |
MTL-REG() | 28.1871.544 | 47.0290.645 | 62.7571.745 \bigstrut[b] |
SMTL(+AUC) | 43.1321.349 | 45.7291.643 | 62.6261.709 \bigstrut[t] |
SMTL(+) | 44.6471.058 | 54.9711.187 | 61.5691.945 |
MTL-CLS() | 36.890.699 | 44.6200.553 | 58.2210.424 |
MTL-REG() | 33.7201.634 | 54.6821.846 | 50.0501.563 \bigstrut[b] |
SMTL(TraceNorm+AUC) | 43.581.046 | 46.3951.067 | 63.0580.634 \bigstrut[t] |
SMTL(TraceNorm+) | 44.9720.765 | 50.4710.968 | 61.8190.395 |
MTL-CLS(TraceNorm) | 42.2751.006 | 44.5420.460 | 61.5280.590 |
MTL-REG(TraceNorm) | 28.1781.043 | 47.0460.126 | 62.9200.326 \bigstrut[b] |
StructSVM | 42.669 2.48 | 46.2982.048 | 61.8942.488 \bigstrut[t] |
RAkEL | 44.1010.389 | 46.0860.450 | 61.9710.753 |
MLCSSP | 41.5110.837 | 46.2001.272 | 50.7560.451 |
AdaBoostMH | 12.2550.041 | 48.1440.315 | 50.8050.050 |
HOMER | 40.0541.063 | 53.7450.867 | 62.3111.265 |
BR | 39.2090.891 | 54.1530.543 | 62.3750.408 |
LP | 37.0290.584 | 53.0590.514 | 56.6161.394 |
ECC | 37.5230.310 | 54.6320.325 | 62.1050.627 \bigstrut[b] |
Flags \bigstrut | |||
SMTL(+AUC) | 60.4731.951 | 61.6662.226 | 73.8752.563 \bigstrut[t] |
SMTL(+) | 70.2791.744 | 75.0470.945 | 75.0000.745 |
MTL-CLS() | 65.2331.930 | 71.7090.955 | 72.9281.479 |
MTL-REG() | 66.0730.276 | 73.0051.307 | 71.4291.105 \bigstrut[b] |
SMTL(+AUC) | 60.1871.971 | 61.6181.714 | 74.1362.805 \bigstrut[t] |
SMTL(+) | 69.1221.975 | 74.2591.378 | 74.1681.513 |
MTL-CLS() | 65.5321.210 | 72.6661.752 | 72.7250.497 |
MTL-REG() | 65.2560.739 | 72.2460.928 | 71.2990.998 \bigstrut[b] |
SMTL(TraceNorm+AUC) | 61.4351.616 | 62.841.481 | 74.3672.373 \bigstrut[t] |
SMTL(TraceNorm+) | 68.7041.650 | 73.1321.891 | 73.1451.973 |
MTL-CLS(TraceNorm) | 65.2363.507 | 72.6882.156 | 73.3072.155 |
MTL-REG(TraceNorm) | 65.2572.647 | 72.4371.918 | 71.4950.783 \bigstrut[b] |
StructSVM | 55.6835.777 | 51.9572.048 | 72.1783.604 \bigstrut[t] |
RAkEL | 60.6965.216 | 64.7494.688 | 61.2603.805 |
MLCSSP | 59.6291.619 | 63.2151.326 | 55.8651.909 |
AdaBoostMH | 56.4574.288 | 71.2681.400 | 69.3292.043 |
HOMER | 59.0181.269 | 63.8552.259 | 64.8260.569 |
BR | 59.4212.163 | 67.2871.876 | 66.8232.860 |
LP | 61.8013.822 | 69.1323.200 | 60.5404.149 |
ECC | 64.9363.023 | 72.7151.675 | 73.9132.339 \bigstrut[b] |
Following the settings in [9], to evaluate the performance, we use AUC, Macro F1-score, and Micro F1-score as the evaluation metrics (the details about the computation of AUC and 333In MTL, the Macro is calculated by firstly calculating the score of each individual task, and then average these scores over all tasks. The Micro in MTL is calculated by , where can be found in Section 4).
For each dataset, we firstly generate 10 : partitions. In each partition, the “” part is used as the training set and the “” part is used as the test set. Then, we run each of the methods (the baselines and the proposed methods) on these 10 partitions, and reported the averaged results on these trials. Note that, for a fair comparison, in a dataset, each method uses the same ten partitions to produce its results. After the training set is determined, we conduct 10-fold cross validation on the training set to choose the trade-off parameter within .
In Algorithm 2, we set the maximum iterations and the optimization tolerance .



6.2 Results on real-world datasets
The evaluation results w.r.t. Micro , Macro and AUC (with standard deviations) of the proposed SMTL are shown in Table II, III and IV. As can be seen, by using the same regularizer, the proposed SMTL variants that optimize -score or AUC show superior performance gains over the baselines. In most cases, the SMTL variant that optimizes a specific metric achieves the best results on this metric. Here are some statistics. On the Yeast dataset, the value of Macro using SMTL(+) is , a relative increase compared to the best MTL baseline MTL-CLS(); the value of Micro using SMTL(+) is , a relative increase compared to the best MTL baseline MTL-REG(); the value of averaged AUC using SMTL(+AUC) is , a relative increase compared to the best MTL baseline MTL-CLS(). On the Emotions dataset, the proposed SMTL(+) performs at Macro F1, a relative increase compared to the best MTL baseline MTL-CLS(); SMTL(+) performs at AUC, a relative increase compared to the best MTL baseline MTL-CLS(); SMTL(TraceNorm+) performs at Macro F1, a relative increase compared to the best MTL baseline MTL-CLS(TraceNorm). On the Cal500 dataset, SMTL(+AUC) performs at Macro , compared to of MTL-REG(, which indicates a relative increase; SMTL(+) performs at Micro , compared to of MTL-CLS(, which indicates a relative increase.
In addition, we conduct -tests and Wilcoxon’s signed rank test [43] on datasets to investigate whether the improvements of SMTL methods against the baselines are statistically significant. The -values of -tests are showed in Table V and VI. The -values of Wilcoxon’s tests are showed in Table VII and VIII. As can be seen, most of the -values are smaller than 0.05, which indicate that the improvements are statistically significant. These results verify the effectiveness of directly optimizing evaluation metric in MTL problems.
Two mehtods for comparison | Optdigits | TMC2007 | MediaMill | Segmentation \bigstrut[t] |
\bigstrut[b] | ||||
Average AUC \bigstrut | ||||
: SMTL(AUC) vs. MTL-CLS | 4.86E-07 | 1.49E-13 | 2.12E-02 | 4.74E-03 \bigstrut[t] |
: SMTL(AUC) vs. MTL-REG | 2.70E-12 | 1.44E-18 | 6.58E-01 | 4.30E-03 |
Trace: SMTL(AUC) vs. MTL-CLS | 6.88E-03 | 5.20E-03 | 2.12E-02 | 4.74E-03 |
Trace: SMTL(AUC) vs. MTL-REG | 3.85E-12 | 4.59E-11 | 6.61E-01 | 4.25E-03 |
: SMTL(AUC) vs. MTL-CLS | 5.71E-03 | 2.95E-05 | 5.52E-03 | 4.75E-03 |
: SMTL(AUC) vs. MTL-REG | 1.46E-12 | 1.92E-12 | 1.57E-08 | 4.35E-03 |
Trace: SMTL(AUC) vs. RAkEL | 1.87E-26 | 1.50E-14 | 2.79E-13 | 3.27E-14 |
Trace: SMTL(AUC) vs. MLCSSP | 6.02E-10 | 1.05E-14 | 9.63E-15 | 3.24E-01 |
Trace: SMTL(AUC) vs. AdaBoostMH | 2.36E-04 | 5.42E-20 | 4.44E-08 | 3.05E-14 |
Trace: SMTL(AUC) vs. HOMER | 5.23E-12 | 6.97E-09 | 4.65E-08 | 1.04E-12 |
Trace: SMTL(AUC) vs. BR | 2.91E-10 | 1.31E-16 | 2.43E-13 | 5.14E-07 |
Trace: SMTL(AUC) vs. LP | 6.82E-22 | 1.41E-20 | 1.44E-03 | 9.30E-01 |
Trace: SMTL(AUC) vs. ECC | 8.16E-03 | 9.67E-19 | 7.52E-03 | 6.19E-03 \bigstrut[b] |
Micro \bigstrut | ||||
: SMTL() vs. MTL-CLS | 8.37E-02 | 8.37E-02 | 3.98E-10 | 4.89E-02 \bigstrut[t] |
: SMTL() vs. MTL-REG | 3.28E-20 | 3.28E-20 | 2.54E-18 | 1.24E-02 |
Trace: SMTL() vs. MTL-CLS | 4.68E-03 | 4.68E-03 | 4.00E-10 | 4.96E-02 |
Trace: SMTL() vs. MTL-REG | 3.01E-14 | 3.01E-14 | 2.30E-18 | 4.92E-03 |
: SMTL() vs. MTL-CLS | 9.54E-03 | 9.54E-03 | 4.19E-10 | 4.75E-01 |
: SMTL() vs. MTL-REG | 6.16E-12 | 6.16E-12 | 2.56E-18 | 1.03E-01 |
Trace: SMTL() vs. RAkEL | 7.30E-33 | 4.64E-25 | 4.93E-09 | 9.28E-19 |
Trace: SMTL() vs. MLCSSP | 2.28E-30 | 1.90E-18 | 3.28E-14 | 2.90E-13 |
Trace: SMTL() vs. AdaBoostMH | 4.53E-16 | 9.97E-35 | 9.38E-12 | 3.65E-06 |
Trace: SMTL() vs. HOMER | 5.37E-14 | 1.13E-12 | 9.68E-12 | 8.31E-10 |
Trace: SMTL() vs. BR | 1.61E-06 | 3.09E-14 | 5.20E-05 | 9.79E-03 |
Trace: SMTL() vs. LP | 3.94E-20 | 9.73E-24 | 1.06E-12 | 1.39E-05 |
Trace: SMTL() vs. ECC | 3.99E-07 | 2.76E-08 | 1.75E-16 | 1.45E-01 \bigstrut[b] |
Macro \bigstrut | ||||
: SMTL() vs. MTL-CLS | 4.09E-21 | 1.61E-10 | 3.98E-10 | 4.09E-02 \bigstrut[t] |
: SMTL() vs. MTL-REG | 1.47E-26 | 3.09E-16 | 2.54E-18 | 2.98E-12 |
Trace: SMTL() vs. MTL-CLS | 1.04E-21 | 1.82E-02 | 4.00E-10 | 4.13E-02 |
Trace: SMTL() vs. MTL-REG | 3.85E-19 | 5.87E-12 | 2.30E-18 | 3.19E-12 |
: SMTL() vs. MTL-CLS | 7.04E-22 | 9.26E-07 | 4.19E-10 | 4.04E-02 |
: SMTL() vs. MTL-REG | 1.94E-24 | 8.74E-14 | 2.56E-18 | 2.57E-12 |
Trace: SMTL() vs. RAkEL | 6.35E-29 | 3.99E-10 | 4.93E-09 | 3.08E-16 |
Trace: SMTL() vs. MLCSSP | 6.50E-16 | 2.29E-10 | 3.28E-14 | 4.68E-02 |
Trace: SMTL() vs. AdaBoostMH | 1.21E-04 | 2.99E-23 | 9.38E-12 | 3.17E-16 |
Trace: SMTL() vs. HOMER | 1.78E-11 | 4.33E-02 | 9.68E-12 | 2.55E-11 |
Trace: SMTL() vs. BR | 1.28E-08 | 1.12E-13 | 5.20E-05 | 1.19E-02 |
Trace: SMTL() vs. LP | 1.67E-19 | 1.52E-14 | 1.06E-12 | 7.49E-01 |
Trace: SMTL() vs. ECC | 2.83E-01 | 4.46E-08 | 1.75E-16 | 9.52E-01 \bigstrut[b] |
Two mehtods for comparison | Cal500 | Yeast | Emotions | Scene | Flags \bigstrut[t] |
\bigstrut[b] | |||||
Average AUC \bigstrut | |||||
: SMTL(AUC) vs. MTL-CLS | 2.62E-01 | 1.02E-12 | 2.62E-01 | 4.92E-02 | 4.30E-02 \bigstrut[t] |
: SMTL(AUC) vs. MTL-REG | 7.48E-05 | 1.47E-09 | 7.48E-05 | 4.74E-11 | 4.21E-02 |
Trace: SMTL(AUC) vs. MTL-CLS | 1.01E-01 | 8.97E-13 | 1.01E-01 | 4.48E-02 | 4.21E-02 |
Trace: SMTL(AUC) vs. MTL-REG | 3.04E-03 | 1.53E-09 | 3.04E-03 | 5.00E-11 | 4.37E-02 |
: SMTL(AUC) vs. MTL-CLS | 2.18E-03 | 1.00E-12 | 2.18E-03 | 4.56E-02 | 4.27E-02 |
: SMTL(AUC) vs. MTL-REG | 2.55E-06 | 1.71E-09 | 2.55E-06 | 4.67E-11 | 4.28E-02 |
Trace: SMTL(AUC) vs. RAkEL | 2.62E-12 | 1.65E-10 | 4.55E-04 | 1.48E-01 | 5.58E-05 |
Trace: SMTL(AUC) vs. MLCSSP | 1.49E-21 | 1.05E-07 | 1.22E-04 | 1.60E-15 | 6.81E-11 |
Trace: SMTL(AUC) vs. AdaBoostMH | 4.10E-33 | 1.03E-06 | 3.61E-23 | 3.21E-19 | 2.12E-02 |
Trace: SMTL(AUC) vs. HOMER | 2.30E-13 | 2.54E-04 | 8.57E-09 | 4.33E-02 | 9.89E-09 |
Trace: SMTL(AUC) vs. BR | 2.23E-16 | 7.57E-10 | 3.84E-06 | 7.92E-02 | 1.63E-06 |
Trace: SMTL(AUC) vs. LP | 1.12E-14 | 6.42E-07 | 9.15E-15 | 4.97E-01 | 3.19E-03 |
Trace: SMTL(AUC) vs. ECC | 2.00E-13 | 2.09E-12 | 6.45E-07 | 3.28E-01 | 9.78E-01 \bigstrut[b] |
Micro \bigstrut | |||||
: SMTL() vs. MTL-CLS | 4.09E-21 | 2.70E-05 | 2.62E-01 | 1.64E-05 | 3.14E-02 \bigstrut[t] |
: SMTL() vs. MTL-REG | 1.47E-26 | 5.00E-02 | 7.48E-05 | 1.09E-06 | 1.87E-03 |
Trace: SMTL() vs. MTL-CLS | 1.04E-21 | 3.45E-05 | 1.01E-01 | 1.42E-05 | 3.10E-02 |
Trace: SMTL() vs. MTL-REG | 3.85E-19 | 4.39E-02 | 3.04E-03 | 1.19E-06 | 1.76E-03 |
: SMTL() vs. MTL-CLS | 7.04E-22 | 2.16E-05 | 2.18E-03 | 1.54E-05 | 3.13E-02 |
: SMTL() vs. MTL-REG | 1.94E-24 | 4.21E-02 | 2.55E-06 | 1.35E-06 | 1.87E-03 |
Trace: SMTL() vs. RAkEL | 4.16E-08 | 2.54E-03 | 4.55E-04 | 3.26E-15 | 2.85E-08 |
Trace: SMTL() vs. MLCSSP | 2.82E-10 | 9.31E-21 | 1.22E-04 | 1.80E-16 | 2.01E-13 |
Trace: SMTL() vs. AdaBoostMH | 8.68E-17 | 3.36E-22 | 3.61E-23 | 4.76E-02 | 7.77E-05 |
Trace: SMTL() vs. HOMER | 5.69E-11 | 1.08E-01 | 8.57E-09 | 4.53E-14 | 3.25E-10 |
Trace: SMTL() vs. BR | 7.35E-21 | 1.08E-01 | 3.84E-06 | 2.51E-09 | 4.96E-06 |
Trace: SMTL() vs. LP | 1.64E-13 | 9.74E-11 | 9.15E-15 | 1.25E-14 | 3.39E-08 |
Trace: SMTL() vs. ECC | 6.21E-14 | 5.30E-01 | 6.45E-07 | 1.05E-02 | 9.09E-01 \bigstrut[b] |
Macro \bigstrut | |||||
: SMTL() vs. MTL-CLS | 2.43E-02 | 2.51E-06 | 7.55E-12 | 4.09E-02 | 1.12E-02 \bigstrut[t] |
: SMTL() vs. MTL-REG | 4.24E-10 | 2.71E-19 | 2.83E-09 | 1.40E-10 | 2.58E-03 |
Trace: SMTL() vs. MTL-CLS | 1.77E-05 | 2.34E-06 | 3.53E-09 | 4.37E-02 | 1.13E-02 |
Trace: SMTL() vs. MTL-REG | 1.43E-04 | 3.33E-19 | 2.69E-02 | 1.55E-10 | 2.67E-03 |
: SMTL() vs. MTL-CLS | 3.99E-02 | 2.66E-06 | 6.38E-12 | 4.43E-02 | 1.11E-02 |
: SMTL() vs. MTL-REG | 1.01E-12 | 2.76E-19 | 1.09E-06 | 1.32E-10 | 2.54E-03 |
Trace: SMTL() vs. RAkEL | 5.45E-05 | 5.52E-03 | 3.24E-05 | 5.68E-02 | 2.11E-04 |
Trace: SMTL() vs. MLCSSP | 6.64E-01 | 1.57E-08 | 6.89E-05 | 2.34E-18 | 2.75E-10 |
Trace: SMTL() vs. AdaBoostMH | 1.28E-29 | 1.36E-28 | 3.03E-28 | 5.89E-21 | 1.16E-07 |
Trace: SMTL() vs. HOMER | 3.52E-23 | 6.16E-10 | 2.60E-09 | 3.77E-06 | 1.84E-11 |
Trace: SMTL() vs. BR | 7.28E-14 | 6.97E-12 | 6.40E-07 | 2.49E-01 | 2.86E-09 |
Trace: SMTL() vs. LP | 1.42E-18 | 9.45E-16 | 7.63E-15 | 4.04E-01 | 5.68E-05 |
Trace: SMTL() vs. ECC | 5.69E-21 | 1.98E-16 | 5.51E-08 | 9.15E-01 | 4.96E-03 \bigstrut[b] |
Two mehtods for comparison | Optdigits | TMC2007 | MediaMill | Segmentation \bigstrut[t] |
\bigstrut[b] | ||||
Average AUC \bigstrut | ||||
: SMTL(AUC) vs. MTL-CLS | 1.25E-02 | 1.25E-02 | 5.06E-03 | 5.75E-01 |
: SMTL(AUC) vs. MTL-REG | 5.06E-03 | 5.06E-03 | 4.45E-01 | 2.84E-02 |
Trace: SMTL(AUC) vs. MTL-CLS | 2.84E-02 | 2.18E-02 | 2.18E-02 | 4.69E-02 |
Trace: SMTL(AUC) vs. MTL-REG | 5.06E-03 | 5.06E-03 | 8.79E-01 | 3.86E-01 |
: SMTL(AUC) vs. MTL-CLS | 2.84E-02 | 2.84E-02 | 4.69E-02 | 2.18E-02 |
: SMTL(AUC) vs. MTL-REG | 5.06E-03 | 5.06E-03 | 5.75E-01 | 2.18E-02 |
Trace: SMTL(AUC) vs. RAkEL | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL(AUC) vs. MLCSSP | 5.06E-03 | 5.06E-03 | 5.06E-03 | 2.85E-01 |
Trace: SMTL(AUC) vs. AdaBoostMH | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL(AUC) vs. HOMER | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL(AUC) vs. BR | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL(AUC) vs. LP | 5.06E-03 | 5.06E-03 | 1.25E-02 | 8.79E-01 |
Trace: SMTL(AUC) vs. ECC | 7.45E-02 | 5.06E-03 | 3.67E-02 | 1.66E-02 \bigstrut[b] |
Micro \bigstrut | ||||
: SMTL() vs. MTL-CLS | 5.06E-03 | 5.93E-02 | 5.06E-03 | 9.34E-03 \bigstrut[t] |
: SMTL() vs. MTL-REG | 5.06E-03 | 5.06E-03 | 5.06E-03 | 2.84E-02 |
Trace: SMTL() vs. MTL-CLS | 5.06E-03 | 1.66E-02 | 5.06E-03 | 9.26E-02 |
Trace: SMTL() vs. MTL-REG | 5.06E-03 | 5.06E-03 | 5.06E-03 | 6.91E-03 |
: SMTL() vs. MTL-CLS | 5.06E-03 | 4.69E-02 | 5.06E-03 | 5.93E-02 |
: SMTL() vs. MTL-REG | 5.06E-03 | 5.06E-03 | 5.06E-03 | 2.84E-02 |
Trace: SMTL() vs. RAkEL | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL() vs. MLCSSP | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL() vs. AdaBoostMH | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL() vs. HOMER | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL(() vs. BR | 5.06E-03 | 5.06E-03 | 9.34E-03 | 1.69E-01 |
Trace: SMTL(() vs. LP | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL(() vs. ECC | 5.06E-03 | 5.06E-03 | 5.06E-03 | 2.03E-01 \bigstrut[b] |
Macro \bigstrut | ||||
: SMTL() vs. MTL-CLS | 9.34E-03 | 5.06E-03 | 5.06E-03 | 4.69E-02 \bigstrut[t] |
: SMTL() vs. MTL-REG | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL() vs. MTL-CLS | 9.34E-03 | 5.06E-03 | 5.06E-03 | 5.93E-02 |
Trace: SMTL() vs. MTL-REG | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
: SMTL() vs. MTL-CLS | 9.34E-03 | 5.06E-03 | 5.06E-03 | 1.14E-01 |
: SMTL() vs. MTL-REG | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL() vs. RAkEL | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL() vs. MLCSSP | 5.06E-03 | 5.06E-03 | 5.06E-03 | 7.45E-02 |
Trace: SMTL() vs. AdaBoostMH | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL() vs. HOMER | 5.06E-03 | 3.67E-02 | 5.06E-03 | 5.06E-03 |
Trace: SMTL(() vs. BR | 5.06E-03 | 5.06E-03 | 5.06E-03 | 2.84E-02 |
Trace: SMTL(() vs. LP | 5.06E-03 | 5.06E-03 | 1.69E-01 | 8.79E-01 |
Trace: SMTL(() vs. ECC | 4.69E-02 | 5.06E-03 | 1.66E-02 | 9.59E-01 \bigstrut[b] |
Two mehtods for comparison | Cal500 | Yeast | Emotions | Scene | Flags \bigstrut[t] |
\bigstrut[b] | |||||
Average AUC \bigstrut | |||||
: SMTL(AUC) vs. MTL-CLS | 5.06E-03 | 5.06E-03 | 1.69E-01 | 1.25E-02 | 1.25E-02 \bigstrut[t] |
: SMTL(AUC) vs. MTL-REG | 5.06E-03 | 5.06E-03 | 1.25E-02 | 5.06E-03 | 5.06E-03 |
Trace: SMTL(AUC) vs. MTL-CLS | 6.91E-03 | 5.06E-03 | 1.14E-01 | 4.69E-02 | 5.08E-01 |
Trace: SMTL(AUC) vs. MTL-REG | 5.06E-03 | 5.06E-03 | 1.25E-02 | 5.06E-03 | 3.67E-02 |
: SMTL(AUC) vs. MTL-CLS | 5.06E-03 | 5.06E-03 | 1.25E-02 | 2.84E-02 | 1.25E-02 |
: SMTL(AUC) vs. MTL-REG | 5.06E-03 | 5.06E-03 | 1.25E-02 | 5.06E-03 | 1.25E-02 |
Trace: SMTL(AUC) vs. RAkEL | 5.06E-03 | 5.06E-03 | 1.25E-02 | 2.84E-02 | 5.06E-03 |
Trace: SMTL(AUC) vs. MLCSSP | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL(AUC) vs. AdaBoostMH | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 | 1.25E-02 |
Trace: SMTL(AUC) vs. HOMER | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL(AUC) vs. BR | 5.06E-03 | 5.06E-03 | 5.06E-03 | 9.26E-02 | 5.06E-03 |
Trace: SMTL(AUC) vs. LP | 5.06E-03 | 6.91E-03 | 5.06E-03 | 7.21E-01 | 9.34E-03 |
Trace: SMTL(AUC) vs. ECC | 5.06E-03 | 5.06E-03 | 5.06E-03 | 1.69E-01 | 9.59E-01 \bigstrut[b] |
Micro \bigstrut | |||||
: SMTL() vs. MTL-CLS | 5.06E-03 | 6.91E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 \bigstrut[t] |
: SMTL() vs. MTL-REG | 9.34E-03 | 2.84E-02 | 2.84E-02 | 5.06E-03 | 1.25E-02 |
Trace: SMTL() vs. MTL-CLS | 5.06E-03 | 6.91E-03 | 5.06E-03 | 5.06E-03 | 4.45E-01 |
Trace: SMTL() vs. MTL-REG | 5.06E-03 | 2.84E-02 | 1.25E-02 | 5.06E-03 | 1.25E-02 |
: SMTL() vs. MTL-CLS | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 | 3.67E-02 |
: SMTL() vs. MTL-REG | 5.06E-03 | 2.84E-02 | 3.67E-02 | 5.06E-03 | 2.84E-02 |
Trace: SMTL() vs. RAkEL | 5.06E-03 | 9.34E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL() vs. MLCSSP | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL() vs. AdaBoostMH | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.93E-02 | 5.06E-03 |
Trace: SMTL() vs. HOMER | 5.06E-03 | 2.03E-01 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL(() vs. BR | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL(() vs. LP | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL(() vs. ECC | 5.06E-03 | 3.86E-01 | 2.41E-01 | 4.69E-02 | 8.79E-01 \bigstrut[b] |
Macro \bigstrut | |||||
: SMTL() vs. MTL-CLS | 5.06E-03 | 5.06E-03 | 6.91E-03 | 3.67E-02 | 3.67E-02 \bigstrut[t] |
: SMTL() vs. MTL-REG | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 | 6.91E-03 |
Trace: SMTL() vs. MTL-CLS | 5.06E-03 | 5.06E-03 | 5.06E-03 | 1.66E-02 | 5.93E-02 |
Trace: SMTL() vs. MTL-REG | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
: SMTL() vs. MTL-CLS | 5.06E-03 | 5.06E-03 | 5.06E-03 | 3.33E-01 | 3.67E-02 |
: SMTL() vs. MTL-REG | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 | 2.84E-02 |
Trace: SMTL() vs. RAkEL | 5.06E-03 | 1.25E-02 | 9.34E-03 | 9.26E-02 | 5.06E-03 |
Trace: SMTL() vs. MLCSSP | 3.67E-02 | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL() vs. AdaBoostMH | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL() vs. HOMER | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.06E-03 |
Trace: SMTL(() vs. BR | 5.06E-03 | 5.06E-03 | 5.06E-03 | 1.69E-01 | 5.06E-03 |
Trace: SMTL(() vs. LP | 5.06E-03 | 5.06E-03 | 5.06E-03 | 5.75E-01 | 6.91E-03 |
Trace: SMTL(() vs. ECC | 5.06E-03 | 5.06E-03 | 5.06E-03 | 7.99E-01 | 6.91E-03 \bigstrut[b] |
6.3 Results on imbalanced data
In the scenarios of learning classifiers on imbalanced data (e.g., the number of positive training samples is much less than that of negative training samples), the metrics like F-score or AUC are more effective for evaluation than the misclassified errors. This is one of the reasons to motivate the proposed SMTL method in this paper. In MTL, the imbalance can be measured by firstly calculating the imbalance ratio in each individual task (i.e., for each task), and then averaging these ratios.
We conduct simulated experiments on 3 datasets (Segmentation, Emotions and Optdigits) to investigate the characteristics of the proposed SMTL methods on imbalanced data. In each dataset, we generate an imbalanced dataset by randomly selecting (with replacement) the positive and negative samples from the original dataset, with the ratio , and , respectively. As can be seen in Fig. 1 and Fig. 2, in most cases, the proposed SMTL variants consistently outperform the baseline method. For example, On Emotions with the ratio of , the proposed SMTL indicates a relative increase of / / over the baseline w. r. t. AUC / Macro F1 / Micro F1, respectively. In addition, with the ratio of increasing, the improvement of SMTL over the baseline method also increases.
6.4 Training Time Comparison
To investigate the training speed of the proposed method, we provide the running time comparison results in Table IX. We can see that the training time of SMTL is (less than 30 times) slower than the baseline methods. It is worth noting that the training time cost is not a critical issue in practice, because the training process is usually off-line.
method | training time of | training time of | training time of \bigstrut[t] |
---|---|---|---|
Optdigits | Emotions | Segmentation \bigstrut[b] | |
SMTL(+AUC) | 105.200s | 30.001s | 1.888s \bigstrut[t] |
SMTL(+) | 510.900s | 29.797s | 2.964s |
MTL-CLS() | 356.200s | 24.674s | 2.023s |
MTL-REG() | 19.030s | 7.427s | 0.450s |
StructSVM | 17.762s | 46.468s | 5.015s |
RAkEL | 28.428s | 4.117s | 4.310s |
AdaBoostMH | 17.157s | 1.024s | 0.641s |
MLCSSP | 121.779s | 1.563s | 6.410s |
HOMER | 20.643s | 1.354s | 0.880s |
BR | 20.852s | 1.859s | 1.835s |
LP | 16.131s | 22.561s | 2.103s |
ECC | 17.852s | 2.834s | 1.891s \bigstrut[b] |
7 Conclusion
In this paper, we developed Structured-MTL, a MTL method of optimizing evaluation metrics. To solve the optimization problem of Structured MTL, we developed an optimization procedure based on ADMM scheme. This optimization procedure can be applied to solving a large family of MTL problems with structured outputs.
In the future work, we plan to investigate Structured-MTL on problems other than classification (e.g., MTL for ranking problems). We also plan to improve the efficiency of Structured-MTL on large-scale learning problems.
References
- [1] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817-1853, 2005.
- [2] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243-272, 2008.
- [3] J-B. Bi, T. Xiong, S-P. Yu, M. Dundar, and R. Rao. An improved multi-task learning approach with applications in medical diagnosis. In Machine Learning and Knowledge Discovery in Databases, pages 117-132, 2008.
- [4] W. Bi, J. Kwok. Efficient Multi-label Classification with Many Labels. Proceedings of the 30th International Conference on Machine Learning. 405-413, 2013.
- [5] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006.
- [6] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning, 3(1):1-122, 2011.
- [7] J-F. Cai, E. J. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion. UCLA CAM Report, 2008.
- [8] R. Caruana. Multitask learning. Machine Learning, 28(1):41-75, 1997.
- [9] J-H. Chen, J. Liu, and J-P. Ye. Learning incoherent sparse and low-rank patterns from multiple tasks. In International Conference on Knowledge Discovery and Data Mining, pages 1179-1188, 2010.
- [10] J-H. Chen, J-Y. Zhou, and J-P. Ye. Integrating low-rank and group-sparse structures for robust multi-task learning. In International Conference on Knowledge Discovery and Data Mining, pages 42-50, 2011.
- [11] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6:615-637, 2005.
- [12] R-E. Fan, K-W. Chang, C-J. Hsieh, X-R. Wang, C-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871-1874, 2008.
- [13] P-H. Gong, J-P. Ye, and C-S. Zhang. Robust multi-task feature learning. In International Conference on Knowledge Discovery and Data Mining, pages 895-903, 2012.
- [14] N. Gornitz, C. Widmer, G. Zeller, A. Kahles, S. Sonneburg, and G. Ratsch. Hierarchical Multitask Structured Output Learning for Large-Scale Sequence Segmentation. In Advances in Neural Information Processing Systems, 2011.
- [15] X. Gu, F-L. Chung, H. Ishibuchi, and S-T. Wang. Multitask Coupled Logistic Regression and Its Fast Implementation for Large Multitask Datasets. In IEEE Transactions on Cybernetics, 45(9): 1953-1966, 2015.
- [16] B. He, X. Yuan On the O(1/n) Convergence Rate of the Douglas-Rachford Alternating Direction Method. SIAM Journal on Numerical Analysis, 50(2): 700-709, 2012
- [17] S-W. Ji and J-P. Ye. An Accelerated Gradient Method for Trace Norm Minimization. In International Conference on Machine Learning, pages 457-464, 2009
- [18] Y-Z. Jiang, F-L. Chung, H. Ishibuchi, Z-H. Deng, and S-T. Wang. Multitask TSK Fuzzy System Modeling by Mining Intertask Common Hidden Structure. In IEEE Transactions on Cybernetics, 45(3): 548-561, 2015.
- [19] T. Grigorios, S-X. Eleftherios, V. Jozef, and V. Ioannis. Mulan: A java library for multi-label learning. Journal of Machine Learning Research, 12:2411–2414, 2011.
- [20] T. Joachims. A Support Vector Method for Multivariate Performance Measures. In International Conference on Machine Learning, 2005.
- [21] Z. Kang, K. Grauman, and F. Sha. Learning with whom to share in multi-task feature learning. In International Conference on Machine Learning, pages 521-528, 2011.
- [22] S. Kim and E. P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. In International Conference on Machine Learning, pages 543-550, 2010.
- [23] H-J. Lai, Y. Pan, C. Liu, L. Lin, J. Wu Sparse Learning-to-rank via an Efficient Primal-Dual Algorithm. IEEE Transactions on Computers, 62(6):1221-1233, 2013
- [24] H-J. Lai, Y. Pan, Y. Tang, R. Yu FSMRank: Feature Selection Method for Learning to Rank. IEEE Transactions on Neaural Networks and Learning Systems, 24(6):940-952, 2013
- [25] Z-C. Lin, M-M. Chen, and Y. Ma. The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrix. Technical Report, UIUC, October 2009.
- [26] A-A. Liu, Y-T. Su, P-P. Jia, Z. Gao, T. Hao, Z-X. Yang. Multiple/Single-View Human Action Recognition via Part-Induced Multitask Structural Learning. IEEE transactions on cybernetics, 45(6): 1194-1208, 2016.
- [27] G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In International Conference on Machine Learning, pages 663-670, 2010.
- [28] J. Liu, S-W. Ji, and J-P. Ye. Multi-task feature learning via efficient -norm minimization. In Conference on Uncertainty in Artificial Intelligence, pages 339-348, 2009.
- [29] X-Q. Lu, X-L. Li, and L-C. Mou. Semi-Supervised Multitask Learning for Scene Recognition. In IEEE Transactions on Cybernetics, 45(9): 1967-1976, 2015.
- [30] G. Obozinski, B. Taskar, and M.I. Jordan. Multi-task feature selection. Technical report, Statistics Department, UC Berkeley, 2006.
- [31] Y. Pan, H-J. Lai, C. Liu, S-C. Yan. A Divide-and-Conquer Method for Scalable Low-Rank Latent Matrix Pursuit. In International Conference on Computer Vision and Pattern Recognition, 2013.
- [32] Y. Pan, H-J. Lai, C. Liu, Y. Tang, S-C. Yan. Rank Aggregation via Low-Rank and Structured-Sparse Decomposition. In AAAI Conference on Artificial Intelligence, 2013.
- [33] Y. Pan, R-K. Xai, J. Yin, N. Liu. A Divide-and-Conquer Method for Scalable Robust Multitask Learning. In IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 12, pp. 3163-3175, 2015.
- [34] N. Quadrianto, A. Smola, T. Caetano, S. Vishwanathan, and J. Petterson. Multitask learning without label correspondences. In Advances in Neural Information Processing Systems, pages 1957-1965, 2010.
- [35] J. Read, B. Pfahringer, G. Holmes and E. Frank. Classifier Chains for Multi-label Classification. Machine learning, 85(3): 333-359, 2011.
- [36] R.M. Rifkin and R.A. Lippert. Value Regularization and Fenchel Duality. Journal of Machine Learning Research, 8:441-479, 2007.
- [37] R. E. Schapire, Y. Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning. 39(2):135-168, 2000
- [38] S.S. Shwartz and Y. Singer. On the Equivalence of Weak Learnability and Linear Separability: New Relaxations and Efficient Boosting Algorithms MachineLearning Journal, vol. 80, no. 2, pp. 141-163, 2010.
- [39] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Machine Learning for Interdependent and Structured Output Spaces. In International Conference on Machine Learning, 2004.
- [40] G. Tsoumakas, I. Katakis and I. Vlahavas. Effective and Efficient Multilabel Classification in Domains with Large Number of Labels. Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data . 30-44, 2008.
- [41] G. Tsoumakas, I. Katakis and I. Vlahavas. Random k-Labelsets for Multi-Label Classification. IEEE Transactions on Knowledge and Data Engineering. 23(7):1079-1089, 2011.
- [42] E. Gibaja, S. Ventura. A tutorial on multilabel learning. ACM Computing Surveys, 47(3): 52, 2015.
- [43] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics bulletin, 1(6): 80-83, 1945.
- [44] R-K. Xia, Y. Pan, L. Du, J. Yin. Robust Multi-View Clustering via Low-Rank and Sparse Decomposition. In AAAL Conference on Artificial Intelligence, 2014.
- [45] R-K. Xia, Y. Pan, H-J. Lai, C. Liu, S-C. Yan. Supervised Hashing for Image Retrieval via Image Representation Learning. In AAAL Conference on Artificial Intelligence, 2014.
- [46] Y. Yang, Z-G. Ma, Y. Yang, F-P. Nie, and H-T. Shen. Multitask Spectral Clustering by Exploring Intertask Correlation. In IEEE Transactions on Cybernetics, 45(5): 1069-1080, 2015.
- [47] Y-J. Yin, D. Xu, X-G. Wang, and M-R. Bai. Online State-Based Structured SVM Combined With Incremental PCA for Robust Visual Tracking. In IEEE Transactions on Cybernetics, 45(9): 1988-2000, 2015.
- [48] J. Yu, D-C. Tao, M. Wang, and Y. Rui. Learning to Rank Using User Clicks and Visual Features for Image Retrieval. In IEEE Transactions on Cybernetics, 45(4): 767-779, 2015.
- [49] K. Yu, V. Tresp, and A. Schwaighofer. Learning gaussian processes from multiple tasks. In International Conference on Machine Learning, pages 1012-1019, 2005.
- [50] Y. Yue, T. Finley, F. Radlinski, T. Joachims. A Support Vector Method for Optimizing Average Precision. In International Conference on Research and Development in Information Retrieval, 2007.
- [51] J. Zhang, Z. Ghahramani, and Y-M. Yang. Learning multiple related tasks using latent independent component analysis. In Advances in Neural Information Processing Systems, pages 1585-1592, 2006.
- [52] W-Q. Zhao, Q-G Meng and P. W. H. Chung. A Heuristic Distributed Task Allocation Method for Multivehicle Multitask Problems and Its Application to Search and Rescue Scenario. IEEE transactions on cybernetics, 46(4): 902-915, 2016.
- [53] J-Y. Zhou, J-H. Chen, and J-P. Ye. Clustered multi-task learning via alternating structure optimization. In Advances in Neural Information Processing Systems, pages 702-710, 2011.