This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Semi-Supervised Multi-Modal Multi-Instance Multi-Label Deep Network with Optimal Transport

Yang Yang,  Zhao-Yang Fu,  De-Chuan Zhan,  Zhi-Bin Liu,  and Yuan Jiang Yang Yang, Zhao-Yang Fu, De-Chuan Zhan and Yuan Jiang are with National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China.
E-mail: yangy, fuzy, zhandc, jiangy@lamda.nju.edu.cnZhi-Bin Liu is with Tencent WXG, ShenZhen 518057, China.
E-mail: lewiszbliu@tencent.comDe-Chuan Zhan is the corresponding author.
Abstract

Complex objects are usually with multiple labels, and can be represented by multiple modal representations, e.g., the complex articles contain text and image information as well as multiple annotations. Previous methods assume that the homogeneous multi-modal data are consistent, while in real applications, the raw data are disordered, e.g., the article constitutes with variable number of inconsistent text and image instances. Therefore, Multi-modal Multi-instance Multi-label (M3) learning provides a framework for handling such task and has exhibited excellent performance. However, M3 learning is facing two main challenges: 1) how to effectively utilize label correlation; 2) how to take advantage of multi-modal learning to process unlabeled instances. To solve these problems, we first propose a novel Multi-modal Multi-instance Multi-label Deep Network (M3DN), which considers M3 learning in an end-to-end multi-modal deep network and utilizes consistency principle among different modal bag-level predictions. Based on the M3DN, we learn the latent ground label metric with the optimal transport. Moreover, we introduce the extrinsic unlabeled multi-modal multi-instance data, and propose the M3DNS, which considers the instance-level auto-encoder for single modality and modified bag-level optimal transport to strengthen the consistency among modalities. Thereby M3DNS can better predict label and exploit label correlation simultaneously. Experiments on benchmark datasets and real world WKG Game-Hub dataset validate the effectiveness of the proposed methods.

Index Terms:
Semi-supervised Learning, Multi-Modal Multi-Instance Multi-label Learning, Modal consistency, Optimal Transport.

1 Introduction

With the development of data collection techniques, objects can always be represented by multiple modal features, e.g., in the forum of famous mobile game “ Strike of Kings”, the articles are with image and content information, and they belong to multiple categories if they are observed from different aspects, e.g., an article belongs to “Wukong Sun” (Game Heroes) as well as “golden cudgel” (Game Equipment) from the images, while it can be categorized as “game strategy”, “producer name” from contents and so on. The major challenge for addressing such problem is how to jointly model multiple types of heterogeneities in a mutually beneficial way. To solve this problem, multi-modal multi-label learning approaches utilize multiple modal information, and require modal-based classifiers to generate similar predictions, e.g., Huang et al. proposed a multi-label conditional restricted boltzmann machine, which uses multiple modalities to obtain shared representations under the supervision [1]; Yang et al. learned a novel graph-based model to learn both label and feature heterogeneities [2]. However, a real-world object may contain variable number of inconsistent multi-modal instances, e.g., the article usually contains multiple images and content paragraphs, in which each image or content paragraph can be regarded as an instance, yet the relationships between the images and contents have not been marked as shown in Figure. 1.

Therefore, several Multi-modal Multi-instance Multi-label methods have been proposed. Nguyen et al. proposed M3LDA with a visual-label part, a textual-label part, and a label topic part, in which the topic decided by visual information and the topic decided by textual information should be consistent [3]; Nguyen et al. developed a multi-modal MIML framework based on hierarchical bayesian network [4]. Nevertheless, there are two drawbacks of the existing M3 models. In detail, previous approaches rarely consider the correlations among labels, besides, M3 methods are all supervised methods, which violate the intuition of multi-modal learning using unsupervised data.

Thus, considering the label correlation, Yang and He studied a hierarchical multi-latent space, which can leverage the task relatedness, modal consistency and the label correlation simultaneously to improve the learning performance [5]; Huang and Zhou proposed the ML-LOC approach which allows label correlation to be exploited locally [6]; Frogner et al. developed a loss function with ground metric for multi-label learning, which is based on the wasserstein distance [7]. Previous works mainly assumed that there exists some prior knowledge such as label similarity matrix or the ground metric [7, 8]. In reality, semantic information among labels is indirect or complicated, thus the confidence of the label similarity matrix or ground metric is weak. On the other hand, considering the labeling cost, there are many unlabeled instances. The most important advantage of multi-modal methods is that they use unlabeled data, e.g., co-training [9] style methods utilized the complementary principle to label unlabeled data for each other; co-regularize [10] style methods exploited unlabeled multi-modal data with consistency principle. Meanwhile, it is notable that previous proposed M3 based methods are hard to adopt the unlabeled instances. Therefore, another issue is how to bypass the limitation of M3 style methods by using unlabeled multi-modal instances.

Refer to caption
Figure 1: An illustration of the M3 (Multi-Modal Multi-instance Multi-label) Data in an article of WKG Game-Hub. Each article is with context bag and image bag, each bag contains variable number of instances (context paragraphs/images), while each article has multiple label representations. It is notable that different modalities are heterogeneous, i.e., there have no congruent relationships between the articles and images.

In this work, aiming at learning the label prediction and exploring label correlation with semi-supervised M3 data simultaneously, we proposed a novel general Multi-modal Multi-instance Multi-label Deep Network, which models the independent deep network for each modality, and imposes the modal consistency on bag-level prediction. To better consider the label correlation, M3DN first adopts Optimal Transport (OT) [11] distance to measure the quality of prediction. The adoption provides a more meaningful measure in multi-label tasks by capturing the geometric information of the underlying label space. The raw data may not calculate the raw ground metric confidently, thus we cast the label correlation exploration as a latent ground metric learning problem. Moreover, considering the unlabeled data information, we propose the semi-supervised M3DN (M3DNS). M3DNS utilizes the instance-level auto-encoder to build the single modal network, and considers the bag-level consistency among different unlabeled modal predictions with the modified OT theory. Consequently, M3DNS could automatically learn the predictors from different modalities and the latent shared ground metric.

The main contributions of this paper are summarized in the following points:

  • We propose a novel Multi-modal Multi-instance Multi-label Deep Network (M3DN), which models the deep independent network for each modality, and imposes the modal consistency on bag-level prediction;

  • We consider label correlation exploration as a latent ground metric learning problem between different modalities, rather than a fix ground metric using prior raw knowledge;

  • We utilize the extrinsic unlabeled data, by considering instance-level auto-encoder, and the bag-level consistency among different unlabeled modal predictions with the modified OT metric;

  • We achieve superior performances on real-world applications, comprehensively evaluate on the performance and obtain consistently superior performances stably.

Section 2 summarizes related work, our approaches are presented in Section 3. Section 4 reports our experiments. Finally, Section 5 gives the conclusion.

2 Related Work

The exploitation of multi-modal multi-instance multi-label learning has attracted much attention recently. In this paper, our method concentrates on deep multi-label classification for semi-supervised inconsistent multi-modal multi-instance data, and considers the label correlation using optimal transport technique. Therefore, our work is related to M3 learning and the optimal transport.

Multi-modal learning deals with data from multiple modalities, i.e., multiple feature sets. The goals are to improve performance and reduce the sample complexity. Meanwhile, multi-modal multi-label learning has been well studied, e.g., Fang and Zhang proposed a multi-modal multi-label learning method based on the large margin framework [12]. Yang et al. modeled both the modal consistency and the label correlation in a graph-based framework [13]. The basic assumption behind these methods is that multi-modal data is consistent. However, in real applications, the multi-modal data are always heterogeneous on the instance-level, e.g., articles have variable number of inconsistent images and text paragraphs, videos have variable length of inconsistent audio and image frames. Articles and videos only have consistency on the bag level, rather than instance level. Thus, multi-modal multi-instance multi-label learning is proposed recently. Nguyen et al. developed a multi-modal MIML framework based on hierarchical bayesian network [4]; Feng and Zhou exploited deep neural network to generate instance representation for MIML and it can be extended to multi-modal scenario. Nevertheless, previous approaches rarely consider the confidence of label correlation. More importantly, the current M3 approaches are supervised, which obviously lose the advantage of multi-modal learning for processing unlabeled data.

Considering the label correlation, several multi-label learning methods are proposed [15, 16, 17]. Recently, Optimal Transport (OT)  [11] is developed to measure the difference between two distributions based on given ground metric, and it has been widely used in computer vision and image processing fields, e.g., Qian et al. proposed a novel method that exploits knowledge in both data manifold and feature correlation [18]; Courty et al. proposed a regularized unsupervised optimal transportation model to perform the alignment of the representations [19]. However, previous works mainly assumed that prior knowledge for cost matrix already exists, and ignored deficiency of information or domain knowledge. Thus, Cuturi and Avis, Zhao and Zhou suggested to formulate the cost metric learning problem with the side information [20, 21]. On the other hand, existing M3 methods are almost supervised methods, while multi-modal methods aim to utilize the complementary [9] or consistency [10] principle using the unlabeled instance. Thereby how to take unlabeled data into consideration becomes a challenge.

3 Proposed Method

3.1 Notation

In the multi-instance extension of the multi-modal multi-label framework, we are given NN bags of instances, let 𝒴={𝐲1,𝐲2,,𝐲Nl}\mathcal{Y}=\{{\bf y}_{1},{\bf y}_{2},\cdots,{\bf y}_{N_{l}}\} denotes the label set, 𝐲iL{\bf y}_{i}\in\mathbb{R}^{L} is the label vector of ii-th bag, where yi,j=1y_{i,j}=1 denotes positive class, and yi,j=0y_{i,j}=0 otherwise. On the other hand, suppose we are given KK modalities, without any loss of generality, we consider two modalities in our paper, i.e., images and contents. Let 𝒟={([𝐗11,𝐗12],𝐲1),([𝐗21,𝐗22],𝐲2),,([𝐗Nl1,𝐗Nl2],𝐲Nl),([𝐗Nl+11,𝐗Nl+12]),,([𝐗Nl+Nu1,𝐗Nl+Nu2])}\mathcal{D}=\{([{\bf X}_{1}^{1},{\bf X}_{1}^{2}],{\bf y}_{1}),([{\bf X}_{2}^{1},{\bf X}_{2}^{2}],{\bf y}_{2}),\cdots,([{\bf X}_{N_{l}}^{1},{\bf X}_{N_{l}}^{2}],{\bf y}_{N_{l}}),\\ ([{\bf X}_{N_{l}+1}^{1},{\bf X}_{N_{l}+1}^{2}]),\cdots,([{\bf X}_{N_{l}+N_{u}}^{1},{\bf X}_{N_{l}+N_{u}}^{2}])\} represents the training dataset, where Nl/NuN_{l}/N_{u} denotes the number of labelled/unlabelled instances. 𝐗i1={𝐱i,11,𝐱i,21,,𝐱i,mi1}{\bf X}_{i}^{1}=\{{\bf x}_{i,1}^{1},{\bf x}_{i,2}^{1},\cdots,{\bf x}_{i,m_{i}}^{1}\} denotes the bag representation of mim_{i} instances of 𝐗i1{\bf X}_{i}^{1}, similarly, 𝐗i2={𝐱i,12,𝐱i,22,,𝐱i,ni2}{\bf X}_{i}^{2}=\{{\bf x}_{i,1}^{2},{\bf x}_{i,2}^{2},\cdots,{\bf x}_{i,n_{i}}^{2}\} is the bag representation of nin_{i} instances of 𝐗i2{\bf X}_{i}^{2}, it is notable that bags of different modalities may contain variable number of instances.

The goal is to generate a learner to annotate new bags based on its inputs 𝐗1,𝐗2{\bf X}^{1},{\bf X}^{2}, e.g., annotate a new complex article with its images and contents.

Refer to caption
Figure 2: The flowchart of the M3DN, the raw articles can be divided into two homogeneous modal bag with variable number of heterogeneous instances, i.e., the image bag with four images and content bag with 5 text paragraphs. The instances of different modalities can be calculated with different deep networks, and finally represented as 𝐱lp1{\bf x}_{l_{p}}^{1} or 𝐱lp2{\bf x}_{l_{p}}^{2}, the output features are fully connected with the labels, and we can get the bag-concept layer for different modalities. Eventually, we can acquire the final prediction by mean-max pooling the bag-concept layer of different modalities.

3.2 Optimal Transport

Traditionally, several measurements such as Kullback-Leibler divergences, Hellinger and total variation, have been utilized to measure the similarity between two distributions. However, these measurements play little effect when the probability space has geometrical structures. On the other hand, Optimal transport [11], also known as Wasserstein distance or earth mover distance [22], defines a reasonable distance between two probability distribution over the metric space. Intuitively, the Wasserstein distance is the minimum cost of transporting the pile of one distribution into the pile of another distribution, which formulates the problem of learning the ground metric as minimizing the difference between two polyhedral convex functions over a convex set of distance matrices. Therefore, the Wasserstein distance is more powerful in such situations by considering the pairwise cost.

Definition 1

(Transport Polytope) For two probability vectors rr and cc in the simplex L\sum_{L}, U(r,c)U(r,c) is the transport polytope of rr and cc, namely the polyhedral set of L×LL\times L matrices,

U(r,c)={P+L×L|P𝟏L=r,P𝟏L=c}U(r,c)=\{P\in\mathbb{R}_{+}^{L\times L}|P{\bf 1}_{L}=r,P^{\top}{\bf 1}_{L}=c\}
Definition 2

(Optimal Transport) Given a L×LL\times L cost matrix MM, the total cost of mapping from rr to cc using a transport matrix (or coupling probability) PP can be quantified as P,M\langle P,M\rangle. The optimal transport (OT) problem is defined as,

dM(r,c)=minPU(r,c)P,Md_{M}(r,c)=\min\limits_{P\in U(r,c)}\langle P,M\rangle

When MM belongs to the cone of metric matrices 𝕄\mathbb{M}, the value of dM(r,c)d_{M}(r,c) is a distance [11] between rr and cc, parameterized by MM. In that case, assuming implicitly that MM is fixed and only rr and cc vary, we will refer to the optimal transport distance between rr and cc. It is notable that dM(r,c)d_{M}(r,c) is the cost of the optimal plan for transporting the predicted mass distribution rr to match the target distribution cc. The penalty increases when more mass is transported over longer distances, according to the ground metric MM.

Theorem 1

dMd_{M} defined in Def. 2 is a distance on L\sum_{L} whenever MM is a metric matrix [11].

3.3 Multi-Modal Multi-instance Multi-label Deep Network (M3DN)

Multi-modal Multi-instance Multi-label (M3) learning provides a framework for handling the complex objects, and we propose a novel M3 based parallel deep network (M3DN). Based on the M3DN, we can bypass the limitation of initial label correlation metric using the Optimal Transport (OT) theory, and further take advantage of unlabeled data considering the modal consistency. In this section, we propose the Multi-Modal Multi-instance Multi-label Deep Network (M3DN) framework. M3DN models deep networks for different modalities and imposes the modal consistency.

Refer to caption
Figure 3: The schematic of the bag-concept layer. We can acquire the bag-concept layer with the output feature representations of a bag of instances, in which each column represents corresponding prediction of each instance. Eventually, the final label prediction is calculated by row-wise max pooling.

The raw articles contain variable number of heterogeneous multi-modal information, i.e., when no corresponding relationships exist among each the contents and images, it is difficult to utilize the consistency principle with previous multi-modal methods. Thus, we turn to utilize the consistency among the bags of different modalities, rather than the instance-level. Specifically, raw articles can be divided into two modal bags of heterogeneous instances, i.e., the image bag with 4 images and content bag with 5 text paragraphs as shown in Fig. 2, while only the homogeneous bags share the same multiple labels. Each instance 𝐱1(𝐱2){\bf x}^{1}({\bf x}^{2}) in different modal bag can be calculated among several layers and can be finally represented as 𝐱lp1(𝐱lp2){\bf x}_{l_{p1}}({\bf x}_{l_{p2}}).

Without any loss of generality, we use the convolutional neural network for images and the fully connected networks for text. Then, the output features are fully connected with the bag-concept layer. All parameters including deep network facts and fully connected weights can be organized as Θ1={θl1,θl2,,θlp11,W1}(Θ2={θl1,θl2,,θlp21,W2})\Theta_{1}=\{\theta_{l_{1}},\theta_{l_{2}},\cdots,\theta_{l_{p1-1}},W_{1}\}(\Theta_{2}=\{\theta_{l_{1}},\theta_{l_{2}},\cdots,\theta_{l_{p2-1}},W_{2}\}). Concretely, once the label predictions of the instances for a bag 𝐗iv{\bf X}_{i}^{v} are obtained, we propose a fully connected 2D layer (bag-concept layer) with the size of mi(ni)×Lm_{i}(n_{i})\times L as shown in Fig. 3, in which each column represents corresponding prediction of each instance in the image/content bag. Formally, for a given bag of instances 𝐗iv{\bf X}_{i}^{v}, the (k,j)(k,j)-th node in the 2D bag-concept layer represents the prediction score between the instance 𝐱i,jv{\bf x}_{i,j}^{v} and the kk-th label. Therefore, the jj-column has the following form of activation:

𝐲^jv=g(Wv𝐱i,jv+bv)\hat{{\bf y}}_{j}^{v}=g(W_{v}{\bf x}_{i,j}^{v}+b_{v}) (1)

Here, g()g(\cdot) can be any convex activation function, and we use softmax function here. In the bag-concept layer, we utilize the row-wise max pooling: fv(i)=max(𝐲^i,)f_{v}(i)=max(\hat{{\bf y}}_{i,\cdot}). The final prediction value is:f=f1+f22f=\frac{f_{1}+f_{2}}{2}.

3.4 Explore Label Correlation

However, fully connection to the label output rarely considers the relationship among labels. Recently, Optimal Transport (OT) theory  [11] is used in multi-label learning, which captures geometric information of the underlying label space. According to the Def.  2 and Def.  1, the loss function implied in the parallel network structure can be formulated without any loss of generality as:

minPvU(f(Xiv),𝐲i)v=12i=1NPv,Ms.t.U(f(𝐗iv),𝐲i)={Pv+L×L|Pv𝟏L=f(Xiv),Pv𝟏L=𝐲i}\small\begin{split}&\min\limits_{P_{v}\in U(f(X_{i}^{v}),{\bf y}_{i})}{\sum_{v=1}^{2}\sum_{i=1}^{N}}\langle P_{v},M\rangle\\ s.t.\quad&U(f({\bf X}_{i}^{v}),{\bf y}_{i})=\{P_{v}\in\mathbb{R}_{+}^{L\times L}|P_{v}{\bf 1}_{L}=f(X_{i}^{v}),P_{v}^{\top}{\bf 1}_{L}={\bf y}_{i}\}\end{split} (2)

where MM is the shared latent cost matrix. However, this method requires prior knowledge to construct the cost matrix MM. However, in reality, indirect or incomplete information among labels leads to weak cost matrix MM and poor classification performance.

Therefore, we can define the process of learning cost metric as an optimization problem. Optimizing the cost metric directly is difficult and it consumes O(L2)O(L^{2}) constraints. Thus,  [20, 21] proposed to formulate the cost metric learning problem with the side information, i.e., the label similarity matrix SS as [21], and [20] has proved that the cost metric matrix MM, which computes corresponding optimal transport distance dMd_{M} between pairs of labels, agrees with the side information. More precisely, this criterion favors matrix MM, in which the distance dM(r;c)d_{M}(r;c) is small for pairs of similar histograms rr and cc (corresponding S(r;c)S(r;c) is large) and large for pairs of dissimilar histograms (corresponding S(r;c)S(r;c) is small). Consequently, optimizing MM can be turned to optimize the SS. Finally, the goal of M3DN can be turned to learn label predictor and explore label correlation simultaneously.

In detail, we first introduce the connection between nonlinear transformation and pseudo-metric:

Definition 3

With the nonlinear transformation ()\emptyset(\cdot), the Euclidean distance after the transformation can be denoted as:

D(r,c)=(r)(c)2.\begin{split}D_{\emptyset}(r,c)=\|\emptyset(r)-\emptyset(c)\|_{2}.\end{split}

And [23] proved that DD_{\emptyset} satisfies all properties of a well-defined pseudo-metric in the original input space.

Theorem 2

For a pseudo-metric MM defined in Def. 3 and histograms r,cLr,c\in\sum_{L}, the function (r,c)𝟏rcdM(r,c)(r,c)\rightarrow{\bf 1}_{r\neq c}d_{M}(r,c) satisfies all four distance axioms, i.e., non-negativity, symmetry, definiteness and sub-additivity (triangle inequality) as in [20].

Thus, MM can be turned to learn the kernel SS defined by the non-linear transformation ()\emptyset(\cdot):

Sij=S(𝐲i,𝐲j)=(𝐲i)(𝐲j)\begin{split}S_{ij}=S({\bf y}_{i},{\bf y}_{j})=\emptyset({\bf y}_{i})^{\top}\emptyset({\bf y}_{j})\end{split} (3)

where the 𝐲i{\bf y}_{i} represents the label vector of ii-th instance. Besides, it is notable that the cost matrix MM is computed as Mij=D2(𝐲i,𝐲j)M_{ij}=D_{\emptyset}^{2}({\bf y}_{i},{\bf y}_{j}), while the kernel SS is defined as Eq. 3. Thus, the relation between MM and SS can be derived as:

Mij=Sii+Sjj2Sij.M_{ij}=S_{ii}+S_{jj}-2S_{ij}. (4)

The non-linear mapping preserves pseudo metric properties in Def. 3, therefore it only needs a projection to positive semi-definite matrix cone when learning the kernel matrix SS. Thus, we can avoid the projection to metric space which is complicated and costly. Therefore, we propose to conduct the label predictions and label correlation exploration simultaneously based on substituted optimal transport, the combination of Eq.  4 and Eq. 2 can be reformulated as:

minS,PvU(f(Xiv),𝐲i)v=12i=1NPv,M+λ1r(S,S0)s.t.U(f(𝐗iv),𝐲i)={Pv+L×L|Pv𝟏L=f(Xiv),Pv𝟏L=𝐲i}S𝒮+,Mij=Sii+Sjj2Sij\small\begin{split}&\min\limits_{S,P_{v}\in U(f(X_{i}^{v}),{\bf y}_{i})}{\sum_{v=1}^{2}\sum_{i=1}^{N}}\langle P_{v},M\rangle+\lambda_{1}r(S,S_{0})\\ s.t.\quad&U(f({\bf X}_{i}^{v}),{\bf y}_{i})=\{P_{v}\in\mathbb{R}_{+}^{L\times L}|P_{v}{\bf 1}_{L}=f(X_{i}^{v}),P_{v}^{\top}{\bf 1}_{L}={\bf y}_{i}\}\\ \quad&S\in\mathcal{S}_{+},\quad M_{ij}=S_{ii}+S_{jj}-2S_{ij}\end{split} (5)

where λ1\lambda_{1} is a trade-off parameter, 𝒮+\mathcal{S}_{+} denotes the set of positive semi-definite matrix. We adopt OT distance as the loss between prediction and groundtruth, and then incorporate the ground metric learning by kernel biased regularization in 2nd term, where λ1r(S,S0)\lambda_{1}r(S,S_{0}) can be any convex regularization. The regularizer 𝒮+×𝒮++\mathcal{S}_{+}\times\mathcal{S}_{+}\rightarrow\mathcal{R}_{+} allows us to exploit prior knowledge on the kernelized similar matrix, encoded by a reference matrix S0S_{0}. Since typically no strong prior knowledge is available, we use S0=𝒴×𝒴S_{0}=\mathcal{Y}^{\prime}\times\mathcal{Y}. Following common practice [24], we utilize the asymmetric Burg divergence, which yields:

r(S,S0)=tr(SS01)logdet(SS01)pr(S,S_{0})=\mathop{\mathrm{tr}}(SS_{0}^{-1})-logdet(SS_{0}^{-1})-p

where pp is the balance parameter, and we set as 1 in our experiments.

Refer to caption
Figure 4: The flowchart of the M3DNS consider unlabeled data. Similar to M3DN, the raw articles can be divided into two homogeneous modal bags with variable number of heterogeneous instances. The instances of different modalities can be calculated with different deep networks, and finally represented as 𝐱lp1{\bf x}_{l_{p}}^{1} or 𝐱lp2{\bf x}_{l_{p}}^{2}. The output features of labeled data are fully connected with the labels, while we add decoder networks for each modality to process the unlabeled data. On the other hand, we can get bag representations of all data from the bag-concept layer for different modalities. Eventually, we can acquire the final predictions of different modalities and calculate the semi-supervised loss.

3.5 Consider Unsupervised Data

M3DN provides a framework for handling complex multi-modal multi-instance multi-label objects, and it considers the label correlation as an optimization problem in Eq. 8. The limitation of manual labeling is that, in real application, it leaves over large number of unlabeled data. In other words, unlabeled data is readily available, while labeled data tends to be of smaller size. The basic intuition of multi-modal learning is to utilize the complement or consistent information of unlabeled data, to get better performance. Yet M3DN leaves the unlabeled data without consideration, and this obviously loses the advantage of multi-modal learning. Consequently, how to extend M3DN to semi-supervised scenario is an urgent problem.

To consider the extrinsic consistency, i.e., the unlabeled information of different modalities, we propose a semi-supervised M3DN (M3DNS) methods for learning each modal predictors. Different from previous co-regularize style methods using instance-level consistency principle, M3 learning only has bag-level consistency among different modalities, rather than instance-level consistency. Thus, there exist two challenges in using unlabeled data in M3 learning: 1) how to utilize different modal instance-level unlabeled data; 2) how to utilize different modal bag-level consistency of unlabeled data.

To solve this problem, M3DNS utilizes the instance-level unlabeled instances with auto-encoder and bag-level unlabeled instances with modified OT. As shown in Fig. 4, since different modal bags include various number of instances, and the correspondences among different modal instances are unknown, we turn to utilize the auto-encoder based networks to reconstruct the input instances for different modalities, which can build more robust encoder networks. On the one hand, bag-level correspondences are known, thereby for the bag-level unlabeled data, we utilize modified OT consistency term to constraint different modalities.

Specifically, each modal ordinal network can be replaced by auto-encoder (AE) network, which minimizes the reconstruction error of all the instances, i.e., auto-encoder CNN for image modality and auto-encoder fully connected network for content modality. Without any loss of generality, AE can be formulated as square loss:

AE(𝐱k)=minΘfv,Θrvi=Nl+1Nu𝐱ivrv(fv(xiv))F2\begin{split}AE({\bf x}_{k})=\min\limits_{\Theta_{f_{v}},\Theta_{r_{v}}}\sum_{i=N_{l}+1}^{N_{u}}\|{\bf x}_{i^{v}}-r_{v}(f_{v}(x_{i^{v}}))\|_{F}^{2}\\ \end{split} (6)

where Θfv,Θrv\Theta_{f_{v}},\Theta_{r_{v}} are the weight parameters of encoder network fvf_{v} and decoder network rvr_{v} of the vv-th modality.

On the other hand, Eq. 2 only utilizes the supervised information, while neglect the unlabeled modal bag-level correspondences. Thus, with the unlabeled information, Eq. 2 can be reformulated as:

minPvU,P^U^v=12i=1NlPv,M+i=1NuP^,Ms.t.U={Pv+L×L|Pv𝟏L=f(Xiv),Pv𝟏L=𝐲i}U^={P^+L×L|P^𝟏L=f(Xi1),P^𝟏L=f(Xi2)}\small\begin{split}&\min\limits_{P_{v}\in U,\hat{P}\in\hat{U}}{\sum_{v=1}^{2}\sum_{i=1}^{N_{l}}}\langle P_{v},M\rangle+\sum_{i=1}^{N_{u}}\langle\hat{P},M\rangle\\ s.t.\quad&U=\{P_{v}\in\mathbb{R}_{+}^{L\times L}|P_{v}{\bf 1}_{L}=f(X_{i}^{v}),P_{v}^{\top}{\bf 1}_{L}={\bf y}_{i}\}\\ \quad&\hat{U}=\{\hat{P}\in\mathbb{R}_{+}^{L\times L}|\hat{P}{\bf 1}_{L}=f(X_{i}^{1}),\hat{P}^{\top}{\bf 1}_{L}=f(X_{i}^{2})\}\end{split} (7)

where P^\hat{P} is the pseudo transport matrix (or coupling probability) for unlabeled data. The extra unlabeled modal predictions can be regarded as the pseudo labels in P^\hat{P} for constructing more discriminative predictors. In detail, when learning one modal predictor, the predictions of other modalities can act as the pseudo label, which can assist learning more discriminative predictors with unlabeled data. Thus M3DNS can well utilize the bag-level consistency among different modalities. Therefore, M3DNS can acquire more robust ground metric MM, which potentially utilizes the consistency between different modal bags.

As a result, with the unlabeled information, we can combine the Eq. 7 and Eq. 6. The semi-supervised M3DN method (M3DNS) can be given as:

minPvU,P^U^v=12i=1NlPv,M+i=Nl+1NuAE(xiv)+i=1NuP^,M+λ1r(S,S0)s.t.U={Pv+L×L|Pv𝟏L=f(Xiv),Pv𝟏L=𝐲i}U^={P^+L×L|P^𝟏L=f(Xi1),P^𝟏L=f(Xi2)}S𝒮+,Mij=Sii+Sjj2Sij\begin{split}\min\limits_{P_{v}\in U,\hat{P}\in\hat{U}}&\sum_{v=1}^{2}\sum_{i=1}^{N_{l}}\langle P_{v},M\rangle+\sum_{i=N_{l}+1}^{N_{u}}AE(x_{i}^{v})+\sum_{i=1}^{N_{u}}\langle\hat{P},M\rangle\\ &+\lambda_{1}r(S,S_{0})\\ s.t.\quad&U=\{P_{v}\in\mathbb{R}_{+}^{L\times L}|P_{v}{\bf 1}_{L}=f(X_{i}^{v}),P_{v}^{\top}{\bf 1}_{L}={\bf y}_{i}\}\\ \quad&\hat{U}=\{\hat{P}\in\mathbb{R}_{+}^{L\times L}|\hat{P}{\bf 1}_{L}=f(X_{i}^{1}),\hat{P}^{\top}{\bf 1}_{L}=f(X_{i}^{2})\}\\ \quad&S\in\mathcal{S}_{+},\quad M_{ij}=S_{ii}+S_{jj}-2S_{ij}\end{split} (8)

3.6 Optimization

The P^\hat{P} is similar with the PP when considering the extra modal predictions as the pseudo label. Thus, we analyze the optimization of the Eq. 5, and Eq. 8 has similar solution. In detail, The 1st term in Eq. 5 involves the product of predictors ff and cost matrix SS, which makes the formulation not joint convex. Consequently, the formulation cannot be optimized easily. We provide the optimization process below:

Algorithm 1 The pseudo code of learning the predictors

Input:

  • Sampled Batch Dataset: {[Xi1,Xi2],𝐲}i=1n\{[X_{i}^{1},X_{i}^{2}],{\bf y}\}_{i=1}^{n}, kernelized similar matric StS^{t}, current mapping f1,f2f_{1},f_{2}

  • Parameter: λ\lambda

Output:

  • Gradient of the target mapping: L/f1,L/f2\partial{L}/\partial{f_{1}},\partial{L}/\partial{f_{2}}

1:  Calculate MM\leftarrow Eq. 4
2:  Initialize K=exp(λM1)K=exp(-\lambda M-1), \nabla\leftarrow 0
3:  for v=1,2v=1,2 do
4:     for i=1,2,,ni=1,2,\cdots,n do
5:        uivu_{i}^{v}\leftarrow 1
6:        while uivu_{i}^{v} not converged do
7:           uivfv(𝐱iv)(K(𝐲ivKuiv))u_{i}^{v}\leftarrow f_{v}({\bf x}_{i}^{v})\oslash(K({\bf y}_{i}^{v}\oslash K^{\top}u_{i}^{v}))
8:        end while
9:        fvfv+loguivλloguiv𝟏λL𝟏\nabla^{f_{v}}\leftarrow\nabla^{f_{v}}+\frac{logu_{i}^{v}}{\lambda}-\frac{log{u_{i}^{v}}^{\top}{\bf 1}}{\lambda L}\cdot{\bf 1}
10:     end for
11:  end for

Fix SS, Optimize f1,f2f_{1},f_{2}: When updating f1,f2f_{1},f_{2} with a fixed SS, the 2nd term of Eq. 5 is irrelevant to f1,f2f_{1},f_{2}, and the Eq. 5 can be reformulated as follows:

minPvU(f(Xiv),𝐲i)v=12i=1NPv,Ms.t.U(f(𝐗iv),𝐲i)={Pv+L×L|Pv𝟏L=f(Xiv),Pv𝟏L=𝐲i}\small\begin{split}&\min\limits_{P_{v}\in U(f(X_{i}^{v}),{\bf y}_{i})}{\sum_{v=1}^{2}\sum_{i=1}^{N}}\langle P_{v},M\rangle\\ s.t.\quad&U(f({\bf X}_{i}^{v}),{\bf y}_{i})=\{P_{v}\in\mathbb{R}_{+}^{L\times L}|P_{v}{\bf 1}_{L}=f(X_{i}^{v}),P_{v}^{\top}{\bf 1}_{L}={\bf y}_{i}\}\\ \end{split} (9)

The empirical risk minimization function of Eq. 9 can be optimized by stochastic gradient descent. However, it requires to evaluate the descent direction for the loss, with respect to the predictor ff. Computing the exact subgradient is quite costly, it needs to solve a linear program with O(L2)O(L^{2}) constraints, which are with high expense with the LL (the label dimension) increase.

Similar to [7], the loss is a linear program, and the subgradient can be computed using Lagrange duality. Therefore, we use primal-dual approach to compute the gradient by solving the dual LP problem. From [25], we know that the dual optimal α\alpha is, in fact, the subgradient of the loss of training sample (𝐗v,𝐲)({\bf X}^{v},{\bf y}) with respect to its first argument fvf_{v}. However, it is costly to compute the exact loss directly. In [26], Sinkhorn relaxation is adopted as the entropy regularization to smooth the transport objective, which results in a strictly convex problem that can be solved through Sinkhorn matrix scaling algorithm, at a speed that is faster than that of transport solvers [26].

For a given training bag of instances ([𝐗1,𝐗2],𝐲)([{\bf X}^{1},{\bf X}^{2}],{\bf y}), the dual LP of Eq. 9 is:

dM(fv(𝐗v),y)=maxα,βCMαf(𝐗iv)+β𝐲,\begin{split}d_{M}(f_{v}({\bf X}^{v}),y)=\max\limits_{\alpha,\beta\in C_{M}}\alpha^{\top}f({\bf X}_{i}^{v})+\beta{\bf y},\end{split} (10)

where CM={α,βL:αi+βj<Mi,j}C_{M}=\{\alpha,\beta\in\mathbb{R}^{L}:\alpha_{i}+\beta_{j}<M_{i,j}\}.

Definition 4

(Sinkhorn Distance) Given a L×LL\times L cost matrix MM, and histograms (r,c)L(r,c)\in\sum_{L}. The Sinkhorn distance is defined as:

dMλ(r,c)=minPλU(r,c)Pλ,MPλ=argminPU(f(Xiv),𝐲i)P,M1λH(P)\begin{split}&d_{M}^{\lambda}(r,c)=\min\limits_{P^{\lambda}\in U(r,c)}\langle P^{\lambda},M\rangle\\ &P^{\lambda}=\arg\min\limits_{P\in U(f(X_{i}^{v}),{\bf y}_{i})}\langle P,M\rangle-\frac{1}{\lambda}H(P)\end{split} (11)

where H(P)=i=1Lj=1LpijlogpijH(P)=-\sum_{i=1}^{L}\sum_{j=1}^{L}p_{ij}logp_{ij} is the entropy of PP, and λ>0\lambda>0 is entropic regularization coefficient.

TABLE I: Comparison results (mean ±\pm std.) of M3DN/M3DNS with compared methods on benchmark datasets.
Methods Coverage \downarrow Macro AUC \uparrow
FLICKR25K IAPR TC-12 MS-CoCo NUS-WIDE FLICKR25K IAPRTC-12 MS-CoCo NUS-WIDE
M3LDA 12.345±\pm.214 11.620±\pm.042 47.400±\pm.622 6.670±\pm.205 .532±\pm.015 .526±\pm.003 .507±\pm.015 .509±\pm.012
MIMLmix 17.114±\pm1.024 15.720±\pm.543 64.130±\pm1.121 14.167±\pm1.140 .472±\pm.018 .554±\pm.096 .471±\pm.019 .493±\pm.020
CS3G 8.168±\pm.137 7.153±\pm.178 50.138±\pm2.146 8.028±\pm.907 .837±\pm.007 .817±\pm.006 .717±\pm.011 .530±\pm.022
DeepMIML 9.242±\pm.331 8.931±\pm.421 27.358±\pm.654 8.369±\pm.119 .766±\pm.035 .795±\pm.022 .827±\pm0.006 .823±\pm.005
M3MIML 11.760±\pm1.121 9.125±\pm.553 42.420±\pm.2.696 5.210±\pm.920 .687±\pm.087 .724±\pm.033 .650±\pm.032 .649±\pm.084
MIMLfast 12.155±\pm.913 12.711±\pm.315 41.048±\pm.831 8.634±\pm.028 .524±\pm.050 .485±\pm.009 .506±\pm.010 .522±\pm.008
SLEEC 9.568±\pm.222 9.494±\pm.105 47.502±\pm.448 7.390±\pm.275 .706±\pm.007 .675±\pm.007 .661±\pm.014 .620±\pm.006
Tram 7.959±\pm.187 8.156±\pm.163 28.417±\pm.945 9.934±\pm.026 .780±\pm.009 .746±\pm.007 .776±\pm.011 .493±\pm.007
ECC 14.818±\pm.086 14.229±\pm.258 47.124±\pm.675 7.941±\pm.194 .532±\pm.013 .484±\pm.009 .630±\pm.023 .634±\pm.009
ML-KNN 10.379±\pm.115 9.523±\pm.072 27.568±\pm.066 4.610±\pm.062 .591±\pm.008 .723±\pm.006 .823±\pm.003 .736±\pm.008
RankSVM 11.439±\pm.196 11.941±\pm.078 37.300±\pm.835 8.292±\pm.054 .512±\pm.019 .499±\pm.009 .521±\pm.033 .501±\pm.001
ML-SVM 11.311±\pm.158 11.755±\pm.270 39.258±\pm.294 7.890±\pm.020 .503±\pm.010 .502±\pm.010 .497±\pm.016 .561±\pm.001
M3DN 7.502±\pm.129 6.936±\pm.065 26.921±\pm.320 4.599±\pm.050 .822 ±\pm.009 .798±\pm.002 .811±\pm.004 .826±\pm.006
M3DNS 3.947±\pm.307 4.214±\pm.202 6.119±\pm.262 2.764±\pm.071 .892±\pm.004 .876±\pm.003 .838±\pm.003 .898±\pm.008
Methods Ranking Loss \downarrow Example AUC \uparrow
FLICKR25K IAPR TC-12 MS-CoCo NUS-WIDE FLICKR25K IAPRTC-12 MS-CoCo NUS-WIDE
M3LDA .301±\pm.009 .377±\pm.002 .247±\pm.001 .257±\pm.006 .707±\pm.008 .630±\pm.005 .770±\pm.006 .652±\pm.009
MIMLmix .609±\pm.036 .675±\pm.012 .609±\pm.040 .583±\pm.081 .391±\pm.036 .325±\pm.012 .391±\pm.040 .417±\pm.082
CS3G .118±\pm.005 .155±\pm.005 .202±\pm.009 .170±\pm.032 .881±\pm.005 .835±\pm.005 .798±\pm.009 .642±\pm.032
DeepMIML .149±\pm.012 .166±\pm.017 .089±\pm.002 .164±\pm.007 .791±\pm.044 .834±\pm.017 .911±\pm.002 .835±\pm.007
M3MIML .271±\pm.053 .250±\pm.011 .191±\pm.016 .284±\pm.030 .729±\pm.053 .751±\pm.011 .811±\pm.017 .717±\pm.031
MIMLfast .275±\pm.033 .435±\pm.021 .194±\pm.006 .430±\pm.009 .724±\pm.033 .626±\pm.013 .811±\pm.005 .646±\pm.009
SLEEC .316±\pm.009 .413.006 .455±\pm.005 .512±\pm.008 .843±\pm.003 .761±\pm.005 .796±\pm.002 .713±\pm.008
Tram .132±\pm.004 .203±\pm.007 .117±\pm.004 .456±\pm.004 .867±\pm.004 .797±\pm.007 .883±\pm.005 .591±\pm.001
ECC .804±\pm.024 .928±\pm.013 .461±\pm.009 .617±\pm.020 .642±\pm.005 .529±\pm.012 .775±\pm.005 .697±\pm.013
ML-KNN .235±\pm.005 .264±\pm.004 .097±\pm.002 .176±\pm.003 .764±\pm.005 .736±\pm.004 .903±\pm.001 .824±\pm.003
RankSVM .236±\pm.006 .344±\pm.001 .199±\pm.098 .323±\pm.008 .763±\pm.006 .656±\pm.001 .801±\pm.098 .677±\pm.001
ML-SVM .232±\pm.005 .337±\pm.009 .179±\pm.004 .314±\pm.002 .768±\pm.005 .662±\pm.009 .822±\pm.004 .686±\pm.002
M3DN .108±\pm.003 .151±\pm.002 .085±\pm.002 .117±\pm.002 .891±\pm.003 .850±\pm.003 .915±\pm.003 .883±\pm.001
M3DNS .108±\pm.001 .142±\pm.002 .112±\pm.003 .119±\pm.003 .899±\pm.004 .858±\pm.005 .898±\pm.008 .881±\pm.006
Methods Average Precision \uparrow Micro AUC \uparrow
FLICKR25K IAPR TC-12 MS-CoCo NUS-WIDE FLICKR25K IAPRTC-12 MS-CoCo NUS-WIDE
M3LDA .371±\pm.005 .311±\pm.007 .399±\pm.007 .338±\pm.005 .693±\pm.006 .609±\pm.002 .773±\pm.005 .657±\pm.008
MIMLmix .207±\pm.038 .183±\pm.008 .213±\pm.041 .167±\pm.020 .436±\pm.024 .438±\pm.060 .434±\pm.026 .472±\pm.015
CS3G .749±\pm.008 .622±\pm.006 .542±\pm.012 .597±\pm.031 .867±\pm.005 .827±\pm.006 .738±\pm.007 .557±\pm.021
DeepMIML .621±\pm.027 .619±\pm.025 .633±\pm.005 .583±\pm.008 .835±\pm.009 .802±\pm.017 .914±\pm.002 .852±\pm.003
M3MIML .423±\pm.056 .490±\pm.020 .446±\pm.030 .443±\pm.076 .745±\pm.034 .707±\pm.017 .816±\pm.020 .762±\pm.020
MIMLfast .432±\pm.064 .339±\pm.013 .413±\pm.005 .365±\pm.021 .712±\pm.022 .540±\pm.010 .745±\pm.012 .630±\pm.005
SLEEC .608±\pm.006 .473±\pm.010 .565±\pm.003 .392±\pm.007 .824±\pm.004 .736±\pm.005 .795±\pm.002 .701±\pm.005
Tram .653±\pm.011 .523±\pm.008 .494±\pm.007 .336±\pm.002 .842±\pm.003 .782±\pm.007 .883±\pm.006 .554±\pm.002
ECC .416±\pm.012 .278±\pm.011 .462±\pm.007 .438±\pm.014 .646±\pm.004 .514±\pm.008 .779±\pm.005 .702±\pm.009
ML-KNN .398±\pm.006 .403±\pm.010 .585±\pm.002 .439±\pm.006 .752±\pm.005 .729±\pm.003 .905±\pm.002 .817±\pm.004
RankSVM .467±\pm.005 .364±\pm.004 .427±\pm.066 .401±\pm.001 .748±\pm.005 .649±\pm.004 .791±\pm.093 .680±\pm.003
ML-SVM .466±\pm.006 .367±\pm.006 .441±\pm.007 .443±\pm.007 .753±\pm.004 .656±\pm.009 .825±\pm.004 .724±\pm.001
M3DN .719±\pm.006 .634±\pm.003 .680±\pm.005 .691±\pm.001 .876±\pm.003 .834±\pm.001 .918±\pm.002 .877±\pm.003
M3DNS .698±\pm.002 .637±\pm.007 .691±\pm.004 .634±\pm.003 .858±\pm.003 .863±\pm.004 .877±\pm.006 .878±\pm.005
TABLE II: Comparison results (mean ±\pm std.) of M3DN/M3DNS with compared methods on WKG Game-Hub dataset. 6 commonly used criteria are evaluated. The best performance for each criterion is bolded. /\uparrow/\downarrow indicates the larger/smaller the better of the criterion.
Methods Content Modality
Coverage \downarrow
(×102\times 10^{2})
Macro
AUC \uparrow
Ranking
Loss \downarrow
Example
AUC \uparrow
Average
Precision \uparrow
Micro
AUC \uparrow
M3LDA .466±\pm.020 .470±\pm.015 1.000±\pm1.000 .360±\pm.056 .098±\pm.001 .381±\pm.036
MIMLmix .334±\pm.003 .507±\pm.002 .445±\pm.006 .539±\pm.001 .111±\pm.001 .540±\pm.003
CS3G .362±\pm.002 .593±\pm.001 .340±\pm.003 .659±\pm.003 .371±\pm.002 .614±\pm.007
DeepMIML .341±\pm.010 .533±\pm.018 .415±\pm.027 .186±\pm.025 .600±\pm.030 .634±\pm.014
M3MIML N/A N/A N/A N/A N/A N/A
MIMLfast .363±\pm.040 .496±\pm.050 .414±\pm.056 .585±\pm.056 .162±\pm.033 .567±\pm.040
M3DN .258±\pm.006 .761±\pm.016 .276±\pm.008 .723±\pm.008 .329±\pm.002 .753±\pm.007
M3DNS .246±\pm.002 .763±\pm.001 .255±\pm.002 .744±\pm.002 .332±\pm.001 .763±\pm.001
Methods Image Modality
Coverage \downarrow
(×102\times 10^{2})
Macro
AUC \uparrow
Ranking
Loss \downarrow
Example
AUC \uparrow
Average
Precision \uparrow
Micro
AUC \uparrow
M3LDA .466±\pm.010 .455±\pm.054 1.000±\pm.000 .359±\pm.019 .098±\pm.001 .384±\pm.030
MIMLmix .329±\pm.002 .502±\pm.003 .427±\pm.005 .557±\pm.001 .114±\pm.001 .560±\pm.002
CS3G .395±\pm.004 .545±\pm.001 .405±\pm.003 .595±\pm.003 .304±\pm.003 .563±\pm.006
DeepMIML .383±\pm.006 .512±\pm.002 .515±\pm.009 .484±\pm.009 .121±\pm.001 .488±\pm.018
M3MIML N/A N/A N/A N/A N/A N/A
MIMLfast .402±\pm.070 .512±\pm.061 .433±\pm.059 .566±\pm.059 .170±\pm.037 .547±\pm.058
M3DN .175±\pm.001 .896±\pm.001 .210±\pm.002 .789±\pm.002 .402±\pm.001 .586±\pm.000
M3DNS .164±\pm.001 .910±\pm.003 .196±\pm.001 .803±\pm.001 .407±\pm.000 .869±\pm.000
Methods Overall
Coverage \downarrow
(×102\times 10^{2})
Macro
AUC \uparrow
Ranking
Loss \downarrow
Example
AUC \uparrow
Average
Precision \uparrow
Micro
AUC \uparrow
M3LDA .466±\pm.008 .468±\pm.026 1.000±\pm.000 .359±\pm.030 .098±\pm.001 .383±\pm.017
MIMLmix .358±\pm.003 .504±\pm.002 .488±\pm.007 .496±\pm.001 .101±\pm.001 .519±\pm.003
CS3G .361±\pm.004 .589±\pm.003 .346±\pm.004 .653±\pm.004 .365±\pm.001 .612±\pm.004
DeepMIML .362±\pm.005 .518±\pm.002 .488±\pm.008 .512±\pm.008 .125±\pm.001 .524±\pm.018
M3MIML N/A N/A N/A N/A N/A N/A
MIMLfast .393±\pm.060 .509±\pm.064 .430±\pm.052 .596±\pm.052 .170±\pm.036 .549±\pm.054
SLEEC .603±\pm.013 .518±\pm.004 .756±\pm.007 .493±\pm.005 .150±\pm.006 .583±\pm.006
Tram .712±\pm.005 .429±\pm.008 .109±\pm.010 .545±\pm.003 .164±\pm.008 .464±\pm.006
ECC .622±\pm.017 .630±\pm.002 .632±\pm.009 .530±\pm.017 .198±\pm.002 .592±\pm.011
ML-KNN .675±\pm.020 .712±\pm.006 .175±\pm.003 .802±\pm.015 .265±\pm.004 .814±\pm.001
RankSVM N/A N/A N/A N/A N/A N/A
ML-SVM .742±\pm.023 .561±\pm.002 .223±\pm.009 .782±\pm.008 .234±\pm.003 .793±\pm.002
M3DN .163±\pm.003 .924±\pm.002 .190±\pm.004 .809±\pm.004 .401±\pm.003 .866±\pm.003
M3DNS .149±\pm.002 .933±\pm.001 .180±\pm.009 .828±\pm.003 .409±\pm.001 .880±\pm.001

Based on the Sinkhorn theorem, we conclude that the transportation matrix can be written in the form of P=diag(u)Kdiag(v)P^{\star}=diag(u)Kdiag(v), where K=exp(λM1)K=exp(-\lambda M-1) is the element-wise exponential of λM1\lambda M-1. Besides, u=exp(λα)u=exp(\lambda\alpha) and v=exp(λβ)v=exp(\lambda\beta).

Therefore, we adopt the well-known Sinkhorn-Knopp algorithm, which is used in [26, 20] to update the target mapping fvf_{v} given the ground metric. fvf_{v} can be defined as Eq. 1. The detailed procedure is summarized in Algorithm 1, then with the help of Back Propagation technique, gradient descent could be adopted to update the network parameters.

Algorithm 2 The pseudo code of M3DN

Input:

  • Dataset: 𝒟={[Xi1,Xi2],𝐲}i=1N\mathcal{D}=\{[X_{i}^{1},X_{i}^{2}],{\bf y}\}_{i=1}^{N}

  • Parameter: λ1\lambda_{1}, λ\lambda

  • maxIter\rm maxIter: TT, learning rate: {αt}t=1T\{\alpha_{t}\}_{t=1}^{T}

Output:

  • Classifiers: f1,f2f_{1},f_{2}

  • Label similar matric: S,MS,M

1:  Initialize S0𝒴×𝒴S_{0}\leftarrow\mathcal{Y}^{\prime}\times\mathcal{Y}
2:  while true do
3:     Create Batch: Randomly pick up nn examples from 𝒟\mathcal{D} without replacement
4:     Calculate St+1S^{t+1}\leftarrow Eq. 13, Eq. 14
5:     Calculate L/f1t,L/f2t\partial{L}/\partial{f_{1}^{t}},\partial{L}/\partial{f_{2}^{t}}\leftarrow Alg. 1
6:     Weight Propagation step: Obtain the derivative f1t/Θ1\partial{f_{1}^{t}}/\partial{\Theta_{1}}, f2t/Θ2\partial{f_{2}^{t}}/\partial{\Theta_{2}};
7:     Update parameters Θ1,Θ2{\Theta_{1}},{\Theta_{2}}
8:     Funcobjt+1Func_{obj}^{t+1}\leftarrow calculate obj. value in Eq. 5 with Ft+1F^{t+1}
9:     if Funcobjt+1Funcobjtϵ\|Func_{obj}^{t+1}-Func_{obj}^{t}\|\leq\epsilon or tTt\geq T then
10:        Break;
11:     end if
12:  end while

Fix f1,f2f_{1},f_{2}, Optimize SS:

When updating SS with the fixed f1,f2f_{1},f_{2}, the sub-problem can be rewritten as following:

minSv=12i=1NP,M+λ1r(S,S0)s.t.K𝒮+,Mij=Sii+Sjj2Sij.\begin{split}&\min\limits_{S}{\sum_{v=1}^{2}\sum_{i=1}^{N}}\langle P,M\rangle+\lambda_{1}r(S,S_{0})\\ s.t.&\quad K\in\mathcal{S}_{+},\quad M_{ij}=S_{ii}+S_{jj}-2S_{ij}.\\ \end{split} (12)

This sub-problem has closed-form solution. The differential can be formulated as:

S=(P¯+S01p)1\begin{split}S=(\bar{P}+S_{0}^{-1}-p)^{-1}\end{split} (13)

where

P¯={2Pij,whenij,kiL(Pik+Pki),wheni=j\begin{split}\bar{P}=\left\{\begin{aligned} &-2P_{ij},{\rm~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}}{when\quad i\neq j},\\ &\sum_{k\neq i}^{L}(P_{ik}+P_{ki}),{when\quad i=j}\end{aligned}\right.\end{split}

Then, we project SS back to positive semi-definite cone as:

S=𝐏𝐫𝐨𝐣(S)=Umax(σ,0)U\begin{split}S={\bf Proj}(S)=Umax(\sigma,0)U^{\top}\end{split} (14)

where Proj is a projection operator, U and σ\sigma correspond to the eigenvectors and eigenvalues of SS. The whole procedure is summarized in Algorithm 2.

Eq. 8 can be easily optimized as M3DN with GCD method. Without any loss of generality, in semi-supervised scenario, the extra modal prediction f(X3i)f(X^{3-i}) can be regarded as the pseudo label similar to the 𝐲{\bf y} in the supervised term when updating f1,f2f_{1},f_{2}. SS can be updated in similar form, where

P¯={2(Pij+P^ij),whenij,kiL(Pik+Pki+P^ik+P^ki),wheni=j\small\begin{split}\bar{P}=\left\{\begin{aligned} &-2(P_{ij}+\hat{P}_{ij}),{\rm~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}}{when\quad i\neq j},\\ &\sum_{k\neq i}^{L}(P_{ik}+P_{ki}+\hat{P}_{ik}+\hat{P}_{ki}),{when\quad i=j}\end{aligned}\right.\end{split}
Refer to caption
Figure 5: Illustration of learned label correlations for different datasets, and the value has been scaled in [-1,1]. Red color indicates a positive correlation, and blue one indicates a negative correlation.
Refer to caption

     (a) M3DN

Refer to caption

     (b) M3DNS

Figure 6: Objective function value convergence and corresponding classification performance (Coverage, Ranking Loss, Average Precision, Macro AUC, example AUC and Micro AUC) vs. number of iterations of M3DN and M3DNS

4 Experiments

4.1 Datasets and Configurations

M3DN/M3DNS can learn more discriminative multi-modal feature representation on bag level for supervised/semi-supervised multi-label classification, while considering the label correlation among different labels. Thus, in this section, we provide empirical investigations and performance comparisons of M3DN on multi-label classification and label correlation. Without any loss of generality, we experiment on 4 public real-world datasets, i.e., FLICKR25K [27], IAPR TC-12 [28], MS-COCO [29] and NUS-WIDE [30]. Besides, we experiment on 1 real-world complex article dataset, i.e., WKG Game-Hub. FLICKR25K: consists of 25,000 images collected from Flickr website, and each image is associated with several textual tags. The text for each instance is represented as a 1386-dimensional bag-of-words vector. Each point is manually annotated with 24 labels. We select 23,600 image-text pairs that belong to the 10 most frequent concepts; IAPR TC-12: consists of 20,000 image-text pairs which annotate 255 labels. The text for each point is represented as a 2912-dimensional bag-of-words vector; NUS-WIDE: contains 260,648 web images, and images are associated with textual tags where each point is annotated with 81 concept labels. We select 195,834 image-text pairs that belong to the 21 most frequent concepts. The text for each point is represented as a 1000-dimensional bag-of-words vector; MS-COCO: contains 82,783 training, 40,504 validation image-text pairs which belong to 91 categories. We select 38,000 image-text pairs that belong to the 20 most frequent concepts. The text for each point is represented as a 2912-dimensional bag-of-words vector; WKG Game-Hub: consists of 13,750 articles collected from the Game-Hub of “ Strike of Kings” with 1744 concept labels. We select 11,000 image-text pairs that belong to the 54 most frequent concepts. Each article contains several images and content paragraphs, and the text for each point is represented as a 300-dimensional w2v vector.

We run each compared method 30 times for all datasets, and then randomly select 70%70\% for training and the remaining are for test. For all the training examples, we randomly choose 30%30\% as the labeled data, and the other 70%70\% as unlabeled ones as [31]. For the 4 benchmark datasets, each image is divided into 10 regions using [32] as image bag, while the corresponding text tags are also separated into several independent tags as text bag. For the WKG Game-Hub dataset, each article is denoted as an image bag and a content bag. The deep network for image encoder is implemented the same as Resnet-18 [33]. We run the following experiments with the implementation of an environment on NVIDIA K80 GPUs server, and our model can be trained around 290 images per second with a single K80 GPGPU. In the training phase, the parameters λ1\lambda_{1} is selected by 5-fold cross validation from {105,104,,104,105}\{10^{-5},10^{-4},\cdots,10^{4},10^{5}\} with further splitting on only the training datasets, i.e., there is no overlap between the test set and the validation set for parameter picking up. Empirically, when the variation between the objective values of Eq. 13 is less than 10610^{-6} in iteration, we treat M3DN or M3DNS converged.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Sample test complex articles predictions of the WKG Game-Hub. Left is the image bag, middle are label predictions, right is the context bag.

4.2 Compared methods

In our experiments, first, we compare our methods with multi-modal multi-instance multi-label methods, i.e., M3LDA [3], MIMLmix [4]. Besides, M3DN can be degenerated into different settings, we also compare with multi-modal multi-label methods, i.e., CS3G [34]; multi-instance multi-label methods, i.e., DeepMIML [14], M3MIML [35], MIMLfast [36]. Moreover, we compare our methods with multi-label methods, i.e., SLEEC [37], Tram [38], ECC [39], ML-KNN [40], RankSVM [41], ML-SVM [42]. Specifically, for multi-modal multi-label methods, we calculate the average of all instances’ representations as the bag-level feature representation. In the multi-instance multi-label methods, all modalities of a dataset are concatenated together as a single modal input. As to the multi-label learners, we first calculate bag-level feature representation for different modalities independently, then we concatenate all modalities together as a single modal input. As to the semi-supervised scenario, considering that existing M3 methods are supervised methods, we compare our methods with semi-supervised multi-modal multi-label methods, i.e., CS3G [34]; and semi-supervised multi-label methods, i.e., Tram [38], COINS [17], iMLU [43].

TABLE III: Semi-supervised comparison results (mean ±\pm std.) of M3DNS with compared methods on 4 benchmark datasets. 6 commonly used criteria are evaluated. The best performance for each criterion is bolded. /\uparrow/\downarrow indicates the larger/smaller the better of the criterion.
Methods Coverage \downarrow Macro AUC \uparrow
FLICKR25K IAPR TC-12 MS-CoCo NUS-WIDE FLICKR25K IAPRTC-12 MS-CoCo NUS-WIDE
CS3G 10.346±\pm.227 7.545±\pm.056 6.968±\pm.060 9.819±\pm.931 .844±\pm.006 .798±\pm.002 .699±\pm.006 .662±\pm.077
Tram 6.857±\pm.645 5.793±\pm.359 55.059±\pm1.888 9.359±\pm.223 .827±\pm.001 .805±\pm.001 .891±\pm.001 .890±\pm.045
COINS 22.940±\pm5.082 20.598±\pm4.513 25.839±\pm10.629 20.126±\pm4.072 .891±\pm.004 .863±\pm.006 .814±\pm.014 .873±\pm.017
iMLU 23.411±\pm1.160 23.401±\pm8.939 26.462±\pm5.548 21.030±\pm4.844 .880±\pm.009 .835±\pm.003 .812±\pm.004 .835±\pm.048
M3DNS 3.947±\pm.307 4.214±\pm.202 6.119±\pm.262 2.764±\pm.071 .892±\pm.004 .876±\pm.003 .838±\pm.003 .898±\pm.008
Methods Ranking Loss \downarrow Example AUC \uparrow
FLICKR25K IAPR TC-12 MS-CoCo NUS-WIDE FLICKR25K IAPRTC-12 MS-CoCo NUS-WIDE
CS3G .109±\pm.003 .120±\pm.001 .168±\pm.001 .196±\pm.070 .890±\pm.003 .879±\pm.001 .831±\pm.001 .803±\pm.070
Tram .108±\pm.002 .119±\pm.001 .183±\pm.001 .183±\pm.076 .893±\pm.002 .880±\pm.001 .816±\pm.001 .816±\pm.076
COINS .150±\pm.009 .171±\pm.002 .305±\pm.008 .297±\pm.028 .849±\pm.009 .828±\pm.002 .694±\pm.008 .702±\pm.028
iMLU .167±\pm.007 .242±\pm.014 .344±\pm.013 .346±\pm.015 .832±\pm.007 .757±\pm.014 .655±\pm.013 .653±\pm.015
M3DNS .108±\pm.001 .142±\pm.002 .112±\pm.003 .119±\pm.003 .899±\pm.004 .858±\pm.005 .898±\pm.008 .881±\pm.006
Methods Average Precision \uparrow Micro AUC \uparrow
FLICKR25K IAPR TC-12 MS-CoCo NUS-WIDE FLICKR25K IAPRTC-12 MS-CoCo NUS-WIDE
CS3G .671±\pm.003 .678±\pm.001 .661±\pm.003 .586±\pm.083 .860±\pm.007 .820±\pm.002 .769±\pm.003 .724±\pm.084
Tram .670±\pm.006 .507±\pm.004 .348±\pm.003 .318±\pm.091 .910±\pm.001 .859±\pm.001 .874±\pm.001 .868±\pm.057
COINS .570±\pm.007 .419±\pm.007 .258±\pm.033 .216±\pm.016 .884±\pm.007 .852±\pm.003 .788±\pm.018 .856±\pm.025
iMLU .538±\pm.015 .325±\pm.016 .220±\pm.043 .187±\pm.015 .860±\pm.015 .793±\pm.007 .760±\pm.013 .798±\pm.078
M3DNS .698±\pm.002 .637±\pm.007 .691±\pm.004 .634±\pm.003 .858±\pm.003 .863±\pm.004 .877±\pm.006 .878±\pm.005
TABLE IV: Semi-supervised comparison results (mean ±\pm std.) of M3DNS with compared methods on WKG Game-Hub dataset. 6 commonly used criteria are evaluated. The best performance for each criterion is bolded. /\uparrow/\downarrow indicates the larger/smaller, the better of the criterion.
Methods Coverage \downarrow (×103\times 10^{3}) Macro AUC \uparrow Ranking Loss \downarrow Example AUC \uparrow Average Precision \uparrow Micro AUC \uparrow
CS3G .326±\pm.002 .683±\pm.021 .187±\pm.014 .812±\pm.014 .404±\pm.057 .728±\pm.026
Tram 1.731±\pm.083 .854±\pm.031 .190±\pm.024 .809±\pm.024 .245±\pm.046 .852±\pm.024
COINS .186±\pm.021 .782±\pm.087 .252±\pm.029 .747±\pm.029 .195±\pm.037 .783±\pm.072
iMLU .225±\pm.027 .786±\pm.070 .288±\pm.033 .711±\pm.030 .169±\pm.026 .763±\pm.010
M3DNS .149±\pm.002 .933±\pm.001 .180±\pm.009 .828±\pm.003 .409±\pm.001 .880±\pm.001
TABLE V: Ablation study results (mean ±\pm std.) of M3DNS on 4 benchmark datasets. 6 commonly used criteria are evaluated. The best performance for each criterion is bolded. /\uparrow/\downarrow indicates the larger/smaller the better of the criterion.
Methods Coverage \downarrow Macro AUC \uparrow
FLICKR25K IAPR TC-12 MS-CoCo NUS-WIDE FLICKR25K IAPRTC-12 MS-CoCo NUS-WIDE
M3DNS-F 8.678±\pm.002 6.875±\pm.010 9.280±\pm.003 11.042±\pm.009 .896±\pm.000 .868±\pm.000 .829±\pm.002 .858±\pm.001
M3DNS-M 8.889±\pm.010 6.964±\pm.003 9.764±\pm.001 11.043±\pm.005 .885±\pm.001 .862±\pm.000 .757±\pm.001 .843±\pm.000
M3DNS-MP 4.039±\pm.021 5.047±\pm.038 .8.708±\pm.028 3.230±\pm.003 .874±\pm.000 .860±\pm.000 .779±\pm.001 .837±\pm.001
M3DNS 3.947±\pm.307 4.214±\pm.202 6.119±\pm.262 2.764±\pm.071 .892±\pm.004 .876±\pm.003 .838±\pm.003 .898±\pm.008
Methods Ranking Loss \downarrow Example AUC \uparrow
FLICKR25K IAPR TC-12 MS-CoCo NUS-WIDE FLICKR25K IAPRTC-12 MS-CoCo NUS-WIDE
M3DNS-F .074±\pm.000 .146±\pm.000 .134±\pm.001 .184±\pm.000 .825±\pm.000 .804±\pm.000 .866±\pm.001 .816±\pm.000
M3DNS-M .109±\pm.001 .149±\pm.000 .150±\pm.000 .132±\pm.000 .783±\pm.001 .696±\pm.000 .686±\pm.000 .540±\pm.001
M3DNS-MP .106±\pm.000 .145±\pm.001 .150±\pm.001 .190±\pm.001 .818±\pm.000 .790±\pm.001 .848±\pm.000 .810±\pm.001
M3DNS .108±\pm.001 .142±\pm.002 .112±\pm.003 .119±\pm.003 .899±\pm.004 .858±\pm.005 .898±\pm.008 .881±\pm.006
Methods Average Precision \uparrow Micro AUC \uparrow
FLICKR25K IAPR TC-12 MS-CoCo NUS-WIDE FLICKR25K IAPRTC-12 MS-CoCo NUS-WIDE
M3DNS-F .693±\pm.000 .592±\pm.000 .693±\pm.000 .624±\pm.000 .917±\pm.000 .863±\pm.002 .868±\pm.003 .877±\pm.000
M3DNS-M .614±\pm.002 .588±\pm.000 .639±\pm.001 .610±\pm.000 .819±\pm.001 .790±\pm.000 .850±\pm.003 .814±\pm.001
M3DNS-MP .681±\pm.000 .582±\pm.001 .684±\pm.001 .616±\pm.001 .809±\pm.000 .791±\pm.000 .846±\pm.001 .807±\pm.002
M3DNS .698±\pm.002 .637±\pm.007 .691±\pm.004 .634±\pm.003 .858±\pm.003 .863±\pm.004 .877±\pm.006 .878±\pm.005
TABLE VI: Ablation study results (mean ±\pm std.) of M3DNS on WKG Game-Hub dataset. 6 commonly used criteria are evaluated. The best performance for each criterion is bolded. /\uparrow/\downarrow indicates the larger/smaller, the better of the criterion.
Methods Coverage \downarrow (×103\times 10^{3}) Macro AUC \uparrow Ranking Loss \downarrow Example AUC \uparrow Average Precision \uparrow Micro AUC \uparrow
M3DNS-F .279±\pm.003 .821±\pm.000 .183±\pm.001 .822±\pm.000 .345±\pm.000 .872±\pm.000
M3DNS-M .287±\pm.041 .840±\pm.000 .182±\pm.001 .823±\pm.000 .379±\pm.001 .870±\pm.002
M3DNS-MP .286±\pm.008 .818±\pm.000 .190±\pm.001 .817±\pm.001 .333±\pm.000 .869±\pm.002
M3DNS .149±\pm.002 .933±\pm.001 .180±\pm.009 .828±\pm.003 .409±\pm.001 .880±\pm.001
TABLE VII: Missing modal comparison results (mean ±\pm std.) of M3DNS on 4 benchmark datasets. 6 commonly used criteria are evaluated. The best performance for each criterion is bolded. /\uparrow/\downarrow indicates the larger/smaller the better of the criterion.
Methods Coverage \downarrow Macro AUC \uparrow
FLICKR25K IAPR TC-12 MS-CoCo NUS-WIDE FLICKR25K IAPRTC-12 MS-CoCo NUS-WIDE
0%0\% 3.947±\pm.307 4.214±\pm.202 6.119±\pm.262 2.764±\pm.071 .892±\pm.004 .876±\pm.003 .838±\pm.003 .898±\pm.008
10%10\% 4.012±\pm.013 5.017±\pm.015 6.443±\pm.002 2.815±\pm.018 .891±\pm.000 .858±\pm.001 .822±\pm.000 .865±\pm.001
30%30\% 4.033±\pm.009 5.604±\pm.013 6.324±\pm.007 2.834±\pm.010 .888±\pm.001 .870±\pm.001 .817±\pm.001 .866±\pm.000
50%50\% 4.080±\pm.003 5.862±\pm.000 6.496±\pm.004 3.381±\pm.002 .887±\pm.000 .862±\pm.004 .812±\pm.000 .834±\pm.001
70%70\% 4.180±\pm.021 5.840±\pm.002 6.378±\pm.005 3.213±\pm.001 .880±\pm.000 .861±\pm.000 .806±\pm.001 .846±\pm.000
90%90\% 4.485±\pm.004 5.897±\pm.001 6.816±\pm.017 3.615±\pm.004 .869±\pm.000 .856±\pm.000 .781±\pm.000 .820±\pm.001
Methods Ranking Loss \downarrow Example AUC \uparrow
FLICKR25K IAPR TC-12 MS-CoCo NUS-WIDE FLICKR25K IAPRTC-12 MS-CoCo NUS-WIDE
0%0\% .108±\pm.001 .142±\pm.002 .112±\pm.003 .119±\pm.003 .899±\pm.004 .858±\pm.005 .898±\pm.008 .881±\pm.006
10%10\% .178±\pm.000 .159±\pm.000 .140±\pm.000 .178±\pm.000 .892±\pm.000 .840±\pm.000 .859±\pm.000 .871±\pm.000
30%30\% .180±\pm.000 .150±\pm.001 .138±\pm.000 .178±\pm.000 .879±\pm.000 .849±\pm.000 .861±\pm.001 .871±\pm.000
50%50\% .181±\pm.000 .157±\pm.000 .143±\pm.000 .192±\pm.000 .878±\pm.001 .842±\pm.000 .856±\pm.000 .857±\pm.000
70%70\% .185±\pm.001 .155±\pm.000 .139±\pm.000 .187±\pm.001 .874±\pm.001 .844±\pm.000 .854±\pm.000 .862±\pm.004
90%90\% .190±\pm.002 .159±\pm.001 .156±\pm.000 .199±\pm.000 .869±\pm.000 .839±\pm.001 .843±\pm.000 .850±\pm.000
Methods Average Precision \uparrow Micro AUC \uparrow
FLICKR25K IAPR TC-12 MS-CoCo NUS-WIDE FLICKR25K IAPRTC-12 MS-CoCo NUS-WIDE
0%0\% .698±\pm.002 .637±\pm.007 .691±\pm.004 .634±\pm.003 .858±\pm.003 .863±\pm.004 .877±\pm.006 .878±\pm.005
10%10\% .689±\pm.000 .631±\pm.000 .684±\pm.000 .631±\pm.000 .817±\pm.000 .845±\pm.000 .860±\pm.000 .870±\pm.000
30%30\% .678±\pm.000 .635±\pm.000 .686±\pm.002 .631±\pm.000 .812±\pm.000 .855±\pm.002 .862±\pm.001 .869±\pm.000
50%50\% .678±\pm.000 .628±\pm.000 .679±\pm.001 .598±\pm.000 .815±\pm.000 .849±\pm.000 .857±\pm.000 .853±\pm.000
70%70\% .666±\pm.001 .629±\pm.000 .680±\pm.000 .593±\pm.000 .808±\pm.001 .848±\pm.000 .862±\pm.000 .858±\pm.000
90%90\% .659±\pm.000 .610±\pm.000 .663±\pm.001 .590±\pm.000 .802±\pm.000 .846±\pm.000 .842±\pm.000 .846±\pm.000
TABLE VIII: Missing modal comparison results (mean ±\pm std.) of M3DNS on WKG Game-Hub dataset. 6 commonly used criteria are evaluated. The best performance for each criterion is bolded. /\uparrow/\downarrow indicates the larger/smaller, the better of the criterion.
Methods Coverage \downarrow (×103\times 10^{3}) Macro AUC \uparrow Ranking Loss \downarrow Example AUC \uparrow Average Precision \uparrow Micro AUC \uparrow
0%0\% .149±\pm.002 .933±\pm.001 .180±\pm.009 .828±\pm.003 .409±\pm.001 .880±\pm.001
10%10\% .264±\pm.007 .844±\pm.000 .183±\pm.000 .776±\pm.000 .379±\pm.000 .877±\pm.000
30%30\% .273±\pm.003 .830±\pm.000 .191±\pm.000 .768±\pm.001 .363±\pm.000 .868±\pm.000
50%50\% .276±\pm.013 .825±\pm.000 .193±\pm.000 .766±\pm.000 .350±\pm.000 .866±\pm.000
70%70\% .284±\pm.002 .812±\pm.000 .201±\pm.000 .758±\pm.000 .336±\pm.000 .859±\pm.000
90%90\% .299±\pm.008 .802±\pm.000 .207±\pm.000 .752±\pm.000 .329±\pm.001 .848±\pm.000

4.3 Benchmark Comparisons

M3DN is compared with other methods on 4 benchmark datasets to demonstrate the abilities. Results of compared methods and M3DN/M3DNS on 6 commonly used criteria are listed in Tab.  I. The best performance for each criterion is bolded. /\uparrow/\downarrow indicates that the larger/smaller, the better of the criterion. From the results, it is obvious that our M3DN/M3DNS approaches can achieve the best or second performance on most datasets with different performance measures. Therefore the M3DN/M3DNS approach are highly competitive multi-modal multi-label learning methods.

4.4 Complex Article Classification

In this subsection, M3DN approach is tested on the real-world complex article classification problem, i.e., WKG Game-Hub dataset. There are 13,570 articles in collection, with image and text modalities to promote classification. Specifically, each article contains variable number of images and text paragraphs. Thus, each article can be divided into both image bag and text bag. Comparison results (independent modalities and overall) against compared methods are listed in Tab.  II, where notation “N/A” means the method cannot give a result in 60 hours. We use the same 6 measurement criteria as in previous subsection, i.e., Coverage, Ranking Loss, Average Precision, Macro AUC, example AUC and Micro AUC. It is notable that multi-label methods concatenate all of the modal features, which have no independent modal classification performance. The results show that on both of the independent modalities and overall prediction, our M3DN and M3DNS approaches can get the best results over all criteria. The statistics validates the effectiveness of our method when solving the complex article classification problem.

4.5 Label Correlations Exploration

Since M3DN can learn label correlation explicitly, in this subsection, we examine effectiveness of M3DN in label correlations exploration. Due to page limitation, the exploration is conducted on the real-world dataset WKG Game-Hug. We randomly sampled 27 labels, with the learned ground metric shown in Figure  5, and scaled the original value in cost matrix into [1,1][-1,1]. Red color indicates a positive correlation, and blue indicates a negative correlation. We can see that the learned pairwise cost accords with intuitions. Taking a few examples, the cost between Overwatcha and Tencent indicates a very small correlation, and this is reasonable as the game Overwatch has no correlation with Tencent. While the cost between (Zhuge Liang, Wizard) indicates a very strong correlation, since Zhuge Liang belongs to the wizard role in the game.

4.6 Empirical Investigation on Convergence

To investigate the convergence of M3DN iterations empirically, we record the objective function value, i.e., the value of Eq. 5 and the different criteria of classification performance of M3DN/M3DNS in each epoch. Due to page limits, results on WKG Game-Hug dataset are plotted in Fig. 6. It clearly reveals that the objective function value decreases as the iterations increase, and all of the classification performance is stable after several iterations in Fig. 6. Moreover, these additional experiment results indicate that our M3DN/M3DNS can converge fast, i.e., M3DN converges after 10 epoches.

4.7 Empirical Illustrative Examples

Figure 7 shows 6 illustrative examples of the classification results on WKG Game-Hub dataset. Qualitatively, illustration of the predictions clearly discovers the modal-instance-label relation on the test set. E.g., the first example shows that the article has separated three images and four content paragraphs. We can predict the Zhuge liang, battlefront labels from both the images and contents, and acquire the master, cooperation labels form the context.

5 Conclusion

This paper focuses on the issues of complex objects classification with semi-supervised M3 information, and extends our preliminary research [44]. Complex objects, i.e., the articles, the videos, etc, can always be represented by multi-modal multi-instance information, with multiple labels. However, we usually only have bag-level consistency among different modalities. Therefore, Multi-modal Multi-instance Multi-label (M3) learning provides a framework for handling such task. Meanwhile, previous M3 methods rarely consider label correlation and unlabeled data. In this paper, we propose a novel Multi-modal Multi-instance Multi-label Deep Network (M3DN) framework, and exploit label correlation based on the Optimal Transport (OT) theory. Moreover, considering unlabel information, M3DNS utilizes the instance-label and bag-level unlabel information for more excellent performance. Experiments on the real world benchmark datasets and special complex article dataset WKG Game-Hub validate effectiveness of the proposed methods. Meanwhile, how to extend to multiple modalities is an interesting future work.

Appendix A Semi-Supervised Classification

M3DNS takes unlabeled instances into consideration, i.e., using auto-encoder for single modal network, and consistency among different modalities for joint predictions. Thus, in this section, we provide empirical investigations and performance comparisons of M3DNS with several state-of-the-art semi-supervised methods. The introduction to data configuration and comparison methods are in Section 4.1, 4.2. The results are recorded in Table III and Table IV. The results indicate that M3DNS approach can achieve the best or second performance on most datasets with different performance measures, thus M3DNS can make better use of unlabeled data.

Appendix B Ablation Study

In order to explore the impact of different operators in the network structure, we conduct more experiments. In detail, 1) in order to verify different pooling methods to get bag-level prediction, we compare max pooling with mean pooling, denoted as M3DNS-M with mean pooling; 2) based on the better bag-level pooling method, we compare average prediction with max prediction to evaluate different ensemble methods for final predictions, denoted as M3DNS-MP with max operator; 3) based on the better pooling method and prediction operator, we fix the ground metric as the initial value without any change to explore the advantage of learning ground metric, denoted as M3DNS-F. The results are recorded in Table V and Table VI. It is notable that M3DNS is with max pooling, mean prediction operator. The results reveal that max pooling are always better than the mean pooling in getting bag-level prediction. This is because there are often only a few positive examples in the bag that can represent the prediction of this bag, yet mean pooling will bring a lot of noise on the contrast. This phenomenon is also consistent with the assumption of multi-instance learning. Furthermore, the results reveal that mean prediction operator is always better than the max operator, which is also according with the ensemble learning methods. An interesting thing is that, though M3DNS is better than M3DNS-F on most datasets, it is worse on one dataset, i.e., FLICKR25K. This result shows that learning ground metric is not definitely effective. Considering the noise data, it may affect the learning of ground metric. Thus, how to modify the learning process or design a suitable initialization method could be an interesting future work.

Appendix C Comparison with Missing Modality

Specifically, in order to explore the impact of modal missing scenario, we conduct more experiments. Following [45], in each split, we randomly select 10%10\% to 90%90\% of examples, with 20%20\% as interval, for homogeneous examples with complete modality. And the remaining are incomplete instances. The results are recorded in Table VII and Table VIII. It shows that M3DNS achieves competitive results when comparing the results in Table I, II, V and VI with missing modalities, and the performance of M3DNS increases faster than compared methods as incomplete ratio decreases.

Acknowledgment

This research was supported by National Key R&\&D Program of China (2018YFB1004300), NSFC (61773198, 61632004, 61751306), NSFC-NRF Joint Research Project under Grant 61861146001, and Collaborative Innovation Center of Novel Software Technology and Industrialization, Postgraduate Research &\& Practice Innovation Program of Jiangsu province (KYCX18-0045).

References

  • Huang et al. [2015] Y. Huang, W. Wang, and L. Wang, “Unconstrained multimodal multi-label learning,” IEEE Transactions Multimedia, vol. 17, no. 11, pp. 1923–1935, 2015.
  • Yang et al. [2016] P. Yang, H. Yang, H. Fu, D. Zhou, J. Ye, T. Lappas, and J. He, “Jointly modeling label and feature heterogeneity in medical informatics,” TKDD, vol. 10, no. 4, pp. 39:1–39:25, 2016.
  • Nguyen et al. [2013] C. Nguyen, D. Zhan, and Z. Zhou, “Multi-modal image annotation with multi-instance multi-label LDA,” in IJCAI, Beijing, China, 2013, pp. 1558–1564.
  • Nguyen et al. [2014] C. Nguyen, X. Wang, J. Liu, and Z. Zhou, “Labeling complicated objects: Multi-view multi-instance multi-label learning,” in AAAI, Quebec, Canada, 2014, pp. 2013–2019.
  • Yang and He [2015] P. Yang and J. He, “Model multiple heterogeneity via hierarchical multi-latent space learning,” in SIGKDD, NSW, Australia, 2015, pp. 1375–1384.
  • Huang and Zhou [2012] S. Huang and Z. Zhou, “Multi-label learning by exploiting label correlations locally,” in AAAI, Ontario, Canada, 2012.
  • Frogner et al. [2015] C. Frogner, C. Zhang, H. Mobahi, M. Araya-Polo, and T. A. Poggio, “Learning with a wasserstein loss,” in NIPS, Quebec, Canada, 2015, pp. 2053–2061.
  • Rolet et al. [2016] A. Rolet, M. Cuturi, and G. Peyre, “Fast dictionary learning with a smoothed wasserstein loss,” in AISTATS, Cadiz, Spain, 2016, pp. 630–638.
  • Blum and Mitchell [1998] A. Blum and T. M. Mitchell, “Combining labeled and unlabeled data with co-training,” in COLT, Madison, Wisconsin, 1998, pp. 92–100.
  • Brefeld et al. [2006] U. Brefeld, T. Gartner, T. Scheffer, and S. Wrobel, “Efficient co-regularised least squares regression,” in ICML, Pittsburgh, Pennsylvania, 2006, pp. 137–144.
  • Villani [2008] C. Villani, Optimal transport: old and new.   Springer Science & Business Media, 2008, vol. 338.
  • Fang and Zhang [2012] Z. Fang and Z. M. Zhang, “Simultaneously combining multi-view multi-label learning with maximum margin classification,” in ICDM, Brussels, Belgium, 2012, pp. 864–869.
  • Yang et al. [2014] P. Yang, J. He, H. Yang, and H. Fu, “Learning from label and feature heterogeneity,” in ICDM, Shenzhen, China, 2014, pp. 1079–1084.
  • Feng and Zhou [2017] J. Feng and Z. Zhou, “Deep MIML network,” in AAAI, San Francisco, California, 2017, pp. 1884–1890.
  • Bi and Kwok [2014] W. Bi and J. T. Kwok, “Multilabel classification with label correlations and missing labels,” in AAAI, Quebec, Canada, 2014, pp. 1680–1686.
  • Zhang and Zhou [2014] M. Zhang and Z. Zhou, “A review on multi-label learning algorithms,” TKDE, vol. 26, no. 8, pp. 1819–1837, 2014.
  • Zhan and Zhang [2017] W. Zhan and M. Zhang, “Inductive semi-supervised multi-label learning with co-training,” in SIGKDD, NS, Canada, 2017, pp. 1305–1314.
  • Qian et al. [2016] W. Qian, B. Hong, D. Cai, X. He, and X. Li, “Non-negative matrix factorization with sinkhorn distance,” in IJCAI, New York, NY, 2016, pp. 1960–1966.
  • Courty et al. [2017] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy, “Optimal transport for domain adaptation,” TPAMI, vol. 39, no. 9, pp. 1853–1865, 2017.
  • Cuturi and Avis [2014] M. Cuturi and D. Avis, “Ground metric learning,” JMLR, vol. 15, no. 1, pp. 533–564, 2014.
  • Zhao and Zhou [2018] P. Zhao and Z.-H. Zhou, “Label distribution learning by optimal transport,” in AAAI, New Orleans, Louisiana, 2018, pp. 4506–4513.
  • Yossi et al. [1997] R. Yossi, L. Guibas, and C. Tomasi, “The earth mover’s distance multi-dimensional scaling and color-based image retrieval,” in ARPA, 1997.
  • Kedem et al. [2012] D. Kedem, S. Tyree, K. Q. Weinberger, F. Sha, and G. R. G. Lanckriet, “Non-linear metric learning,” in NIPS, Lake Tahoe, Nevada, 2012, pp. 2582–2590.
  • Hoffman et al. [2014] J. Hoffman, E. Rodner, J. Donahue, B. Kulis, and K. Saenko, “Asymmetric and category invariant feature transformations for domain adaptation,” IJCV, vol. 109, no. 1-2, pp. 28–41, 2014.
  • Bertsimas and Tsitsiklis [1997] D. Bertsimas and J. N. Tsitsiklis, Introduction to linear optimization.   Athena Scientific Belmont, MA, 1997, vol. 6.
  • Cuturi [2013] M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” in NIPS, Lake Tahoe, Nevada, 2013, pp. 2292–2300.
  • Huiskes and Lew [2008] M. J. Huiskes and M. S. Lew, “The MIR flickr retrieval evaluation,” in SIGMM, British Columbia, Canada, 2008, pp. 39–43.
  • Escalante et al. [2010] H. J. Escalante, C. A. Hernandez, J. A. Gonzalez, A. Lopez-Lopez, M. Montes-y-Gomez, E. F. Morales, L. E. Sucar, L. V. Pineda, and M. Grubinger, “The segmented and annotated IAPR TC-12 benchmark,” CVIU, vol. 114, no. 4, pp. 419–428, 2010.
  • Lin et al. [2014] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, “Microsoft COCO: common objects in context,” in ECCV, Zurich, Switzerland, 2014, pp. 740–755.
  • Chua et al. [2009] T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “NUS-WIDE: a real-world web image database from national university of singapore,” in CIVR, Santorini Island, Greece, 2009.
  • Zhang et al. [2018] M. Zhang, Y. Li, X. Liu, and X. Geng, “Binary relevance for multi-label learning: an overview,” FCS, vol. 12, no. 2, pp. 191–202, 2018.
  • Girshick [2015] R. B. Girshick, “Fast R-CNN,” in ICCV, Santiago, Chile, 2015, pp. 1440–1448.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, Las Vegas, NV, 2016, pp. 770–778.
  • Ye et al. [2016] H. Ye, D. Zhan, X. Li, Z. Huang, and Y. Jiang, “College student scholarships and subsidies granting: A multi-modal multi-label approach,” in ICDM, Barcelona, Spain, 2016, pp. 559–568.
  • Zhang and Zhou [2008] M. Zhang and Z. Zhou, “M3MIML: A maximum margin method for multi-instance multi-label learning,” in ICDM, Pisa, Italy, 2008, pp. 688–697.
  • Huang et al. [2014] S. Huang, W. Gao, and Z. Zhou, “Fast multi-instance multi-label learning,” in AAAI, Quebec, Canada, 2014, pp. 1868–1874.
  • Bhatia et al. [2015] K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain, “Sparse local embeddings for extreme multi-label classification,” in NIPS, Quebec, Canada, 2015, pp. 730–738.
  • Kong et al. [2013] X. Kong, M. K. Ng, and Z. Zhou, “Transductive multilabel learning via label set propagation,” TKDE, vol. 25, no. 3, pp. 704–719, 2013.
  • Read et al. [2011] J. Read, B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains for multi-label classification,” ML, vol. 85, no. 3, pp. 333–359, 2011.
  • Zhang and Zhou [2007] M. Zhang and Z. Zhou, “ML-KNN: A lazy learning approach to multi-label learning,” PR, vol. 40, no. 7, pp. 2038–2048, 2007.
  • Joachims [2002] T. Joachims, “Optimizing search engines using click through data,” in SIGKDD, Alberta, Canada, 2002, pp. 133–142.
  • Boutell et al. [2004] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-label scene classification,” PR, vol. 37, no. 9, pp. 1757–1771, 2004.
  • Wu and Zhang [2013] L. Wu and M. Zhang, “Multi-label classification with unlabeled data: An inductive approach,” in ACML, Canberra, Australia, 2013, pp. 197–212.
  • Yang et al. [2018] Y. Yang, Y. Wu, D. Zhan, Z. Liu, and Y. Jiang, “Complex object classification: A multi-modal multi-instance multi-label deep network with optimal transport,” in SIGKDD, London, UK, 2018, pp. 2594–2603.
  • Li et al. [2014] S. Li, Y. Jiang, and Z. Zhou, “Partial multi-view clustering,” in AAAI, Quebec, Canada, 2014, pp. 1968–1974.
[Uncaptioned image] Yang Yang is working towards the PhD degree with the National Key Lab for Novel Software Technology, the Department of Computer Science &\& Technology in Nanjing University, China. His research interests lie primarily in machine learning and data mining, including heterogeneous learning, model reuse, and incremental mining.
[Uncaptioned image] Zhao-Yang Fu is working towards the M.Sc. degree with the National Key Lab for Novel Software Technology, the Department of Computer Science &\& Technology in Nanjing University, China. His research interests lie primarily in machine learning and data mining, including multi-modal learning.
[Uncaptioned image] De-Chuan Zhan received the Ph.D. degree in computer science, Nanjing University, China in 2010. At the same year, he became a faculty member in the Department of Computer Science and Technology at Nanjing University, China. He is currently an Associate Professor with the Department of Computer Science and Technology at Nanjing University. His research interests are mainly in machine learning, data mining and mobile intelligence. He has published over 20 papers in leading international journal/conferences. He serves as an editorial board member of IDA and IJAPR, and serves as SPC/PC in leading conferences such as IJCAI, AAAI, ICML, NIPS, etc.
[Uncaptioned image] Zhi-Bin Liu received the Ph.D. degree and M.S. degree in control science and engineering from Tsinghua Universtiy, Beijing, China, in 2010, and the B.S. degree in automatic control engineering from Central South University, Changsha, China, in 2004. His research interests are in big data minning, machine learning, AI, NLP, computer vision, information fusion and etc.
[Uncaptioned image] Yuan Jiang received the PhD degree in computer science from Nanjing University, China, in 2004. At the same year, she became a faculty member in the Department of Computer Science &\& Technology at Nanjing University, China and currently is a Professor. She was selected in the Program for New Century Excellent talents in University, Ministry of Education in 2009. Her research interests are mainly in artificial intelligence, machine learning, and data mining. She has published over 50 papers in leading international/national journals and conferences.