This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

StyleEDL: Style-Guided High-order Attention Network for Image Emotion Distribution Learning

Peiguang Jing pgjing@tju.edu.cn 0000-0003-2648-7358 School of Electrical and Information Engineering, Tianjin UniversityTianjinChina Xianyi Liu goog@tju.edu.cn 0000-0001-6284-9470 School of Electrical and Information Engineering, Tianjin UniversityTianjinChina Ji Wang wangji@tju.edu.cn School of Electrical and Information Engineering, Tianjin UniversityTianjinChina Yinwei Wei weiyinwei@hotmail.com Faculty of Information Technology, Monash UniversityVictoriaAustralia Liqiang Nie nieliqiang@gmail.com School of Computer Science and Technology, Harbin Institute of TechnologyShenzhenChina  and  Yuting Su ytsu@tju.edu.cn School of Electrical and Information Engineering, Tianjin UniversityWeijin RoadTianjinChina43017-6221
(2023)
Abstract.

Emotion distribution learning has gained increasing attention with the tendency to express emotions through images. As for emotion ambiguity arising from humans’ subjectivity, substantial previous methods generally focused on learning appropriate representations from the holistic or significant part of images. However, they rarely consider establishing connections with the stylistic information although it can lead to a better understanding of images. In this paper, we propose a style-guided high-order attention network for image emotion distribution learning termed StyleEDL, which interactively learns stylistic-aware representations of images by exploring the hierarchical stylistic information of visual contents. Specifically, we consider exploring the intra- and inter-layer correlations among GRAM-based stylistic representations, and meanwhile exploit an adversary-constrained high-order attention mechanism to capture potential interactions between subtle visual parts. In addition, we introduce a stylistic graph convolutional network to dynamically generate the content-dependent emotion representations to benefit the final emotion distribution learning. Extensive experiments conducted on several benchmark datasets demonstrate the effectiveness of our proposed StyleEDL compared to state-of-the-art methods. The implementation is released at: https://github.com/liuxianyi/StyleEDL.

emotion distribution learning, style-guided, high-order, stylistic GCN
journalyear: 2023doi: 10.1145/3581783.3612040copyright: acmlicensedconference: Proceedings of the 31st ACM International Conference on Multimedia; October 29-November 3, 2023; Ottawa, ON, Canadabooktitle: Proceedings of the 31st ACM International Conference on Multimedia (MM ’23), October 29-November 3, 2023, Ottawa, ON, Canadaprice: 15.00isbn: 979-8-4007-0108-5/23/10submissionid: 1528ccs: Networks Network design principlesccs: Networks Network algorithms

1. Introduction

Image emotion analysis (Zhao et al., 2021) has gained significant research attention owing to its facility in conveying emotions and views of people. Currently, image emotion analysis has been applied in various scenarios, such as multimedia retrieval (Lu et al., 2019; Zhu et al., 2020, 2023; Qu et al., 2021; Nie et al., 2022), social network analysis (Serrat, 2017; Freeman, 2004; Wasserman and Faust, 1994; Jing et al., 6796), advertising recommendation (Yang et al., 2013; Holbrook and O’Shaughnessy, 1984; Mitchell, 1986).

In prior studies, the image emotion analysis tends to be formulated as a single-label classification task (Rao et al., 2020; Yang et al., 2018b, c; Zhang et al., 2019; Zhu et al., 2017), where each image is assigned a dominant label. However, one image may contain a mixture of multiple emotions with varying intensities, and an individual may have different emotional responses toward one image (i.e., ambiguity). As to this problem, the label distribution learning (LDL) paradigm  (Yang et al., 2017a; Fan et al., 2018; Wang and Geng, 2021a; Gao et al., 2017) has been adopted to narrow the gap between visual features and affective states. Typically,  (Yang et al., 2017a) intended to learn a more smooth label vector to represent the emotions of images, replacing the previous dominant emotion classification.  (Fan et al., 2018) attempted to boost predicting performance by taking regions that represent emotions most into consideration. However, these methods failed to explicitly consider the correlations between emotions. For example, an image of a reunion of old friends may be more likely to evoke feelings of both excitement and happiness, without causing sadness. Fortunately, emotion correlations (Yang et al., 2021; He and Jin, 2019; Xiong et al., 2019; Xu and Wang, 2021) have been proven to be able to further improve the emotional distribution performance with prior knowledge.

Refer to caption
Figure 1. Different styles of images can elicit different emotional responses. In the case where the content is the same but the style is different, the left side contains more melancholic while the right side is more content.

However, existing methods for emotion distribution learning usually suffer from two challenges that tightly hinder performance improvements. First, due to the subjectivity of human cognition, directly using visual representations extracted from convolutional neural networks (CNNs) may be insufficient to characterize emotions contained in images, especially for the emotion ambiguity problem. As an example, Figure 1 shows that the right side has more contentment, while the left side may evoke melancholic feelings among observers, despite depicting the same content. Different styles shown in these two pictures cause different emotions, but the existing methods rarely reveal this point from the perspective of stylistic representation learning. Second, the correlation among emotions is generally modeled by a static graph structure (He and Jin, 2019). Unfortunately, the adjacent relation of a static graph is usually manually defined according to the given dataset, and such relation generally models coarse dependencies, with limited versatility to mine fine latent relationships between emotions. As a result, these methods learn only coarse emotional correlations during iterations, which ultimately leads to unsatisfactory prediction performance.

To address the above issues, we propose a novel method termed style-guided high-order attention network for image emotion distribution learning (StyleEDL). The core idea behind our proposed StyleEDL is to leverage stylistic information to compensate for the deficiency of visual representations to resolve emotion ambiguity. To explore stylistic-aware information in datasets, we first use GRAM-based intra- and inter-layer correlations as emotional style representations. And then we intermix significant attention results of content information generated by an adversary-constrained high-order attention module to obtain stylistic-aware representations. To get more accurate emotions of images, we consider the intrinsic relationship upon stylistic-aware representations using a stylistic graph network. Taking the coarse information of static graph network learning as the prior information, the network adopts a dynamic graph structure to obtain the emotional-aware representations from the stylistic-aware representations in an adaptive way. Our main contributions are as follows:

  • We propose a novel emotion distribution learning method termed StyleEDL, which explores stylistic information as complementary information to refine the representations of images. To the best of our knowledge, this is the first work using a style-induced paradigm for IEDL.

  • We devise a stylistic-aware representation learning network that extracts attentive visual content representations and hierarchical stylistic representations. In addition, develop a stylistic GCN to capture the intrinsic correlation among stylistic-aware representations.

  • We conduct experiments on three public datasets, and the results show the superiority of the proposed method compared to state-of-the-art methods.

The remaining sections of the paper have been structured as follows: Section 2 expounds on the related research, Section 3 presents the proposed StyleEDL, Section 4 provides empirical evaluation and analysis, and Section 5 concludes the paper.

2. Related Work

2.1. Image Emotion Distribution Learning

Existing research on LDL can be borrowed to describe the emotions corresponding to an image. In particular, (Yang et al., 2017b) proposed two methods, conditional probability neural network with binary code (BCPNN) and augmented conditional probability neural network (ACPNN), based on conditional probability neural networks to address sentiment ambiguity with multiple emotions. (Gao et al., 2017) proposed the deep label distribution learning (DLDL) method, which effectively utilizes label ambiguity by minimizing the Kullback-Leibler divergence for the first time. (Yang et al., 2017a) developed a multi-task deep framework by jointly optimizing classification and distribution prediction. Later, polarity and relevance among emotions were also taken into account to explicitly model emotional correlation, making it effective to learn the distribution. (He and Jin, 2019) utilized graph neural networks as emotional predictors to capture the correlation among emotions. (Xiong et al., 2019) designed a combined loss based on the earth mover’s distance (EMD) and kullback-leibler divergence using structured labels in sentiment polarity. (Yang et al., 2021) designed a novel progressive circular (PC) loss based on an emotional circle to boost the learning process in an emotion-specific way. To explore emotional style representations in complicated images, our proposed method proposes stylistic-aware representation learning and emotional-aware enhanced representation learning, producing accurate emotion distribution in real-world datasets.

2.2. Image Style Recognition

Many recent works have indicated that the style of an image has a significant impact on the meaning it conveys. For example, (Lu et al., 2015) proposed a multi-patch aggregation network for extracting fine-grained features from images and showed that this approach achieves good performance in image style classification, aesthetic classification, and quality estimation tasks. (Matsuo and Yanai, 2016) used the GRAM matrix of feature maps to generate style vectors, which they applied to style image retrieval. (Lecoutre et al., 2017) demonstrated the effectiveness of deep residual networks in image style recognition. (Yang et al., 2018a) proposed a multi-factor distribution soft label and performed image style classification in a multi-task framework. (Chu and Wu, 2018) systematically explored the use of correlations between feature maps to characterize image style. (Laubrock and Dubray, 2019) confirmed that mid-level features corresponding to textures, shadows, etc., are particularly well-suited for illustration style classification. (Ghosal et al., 2019) proposed using geometry-sensitive style features based on image saliency for photographic image classification. However, those works have only focused on directly extracting feature maps from their models, which may not fully capture fine representations. In this paper, we propose a novel style-induced method that leverages attentive visual content representations and hierarchical stylistic representations to guide emotion distribution learning. This approach allows for more comprehensive emotional representations than previous methods.

Refer to caption
Figure 2. Detailed structure of our StyleEDL, which consists of three core networks: (1) ResNet-50 is the backbone of our method, discarded the last fully connected layer and retained the top convolutional layer and four groups of convolutional layers, namely ‘Conv1’, ‘Layer1’, ‘Layer2’, ‘Layer3’, and ‘Layer4’. (2) Stylistic-aware representation learning network generates stylistic-aware representations based on emotional content representations content\mathcal{F}_{content} and emotional style representations style\mathcal{F}_{style}. (3) Emotional-aware enhanced representation Learning network uses a stylistic GCN to obtain emotional-aware enhanced representations.

3. The Proposed Method

Emotion distribution learning task can be defined as: given a labeled sample pair {x,𝐲^}\{x,\hat{\mathbf{y}}\}, which is used to learn a function:

(1) :x𝐲\small{\mathcal{H}:x\xrightarrow{}\mathbf{y}}

where xx represents an input image, 𝐲^={y^n}n=1N\hat{\mathbf{y}}=\{\hat{y}_{n}\}_{n=1}^{N} (n=1Ny^ni=1\sum\nolimits_{n=1}^{N}\hat{y}_{n}^{i}=1, y^ni[0,1]\hat{y}_{n}^{i}\in[0,1]) is the degree to which NN emotions are expressed in this image, and 𝐲\mathbf{y} represents the corresponding emotional distribution learned. Our goal is to optimize the function \mathcal{H} with the help of supervised information 𝐲^\hat{\mathbf{y}}, fitting the true emotional distribution of the image.

3.1. Framework Overview

The representations of images can evoke different emotions depending on the aspects being considered. Different aspects can also contribute differently to the triggering of emotions. One way to construct different representations was to directly use shallow features extracted by CNN as emotional concepts. However, these concepts may not fully capture the emotional content of an image. Another way to represent emotions in different aspects is by using CNN with multiple convolutional layers. Even though a convolutional layer with small kernels may struggle to perceive everything in an image, a deeper architecture can increase the model’s receptive field. As a result, the early layers of the CNN tend to capture low-level features such as color and texture, while the later layers capture more complex and high-level features. Therefore, we use the characteristics of the CNN to construct a module for the stylistic-aware representation learning. In detail, we first use the GRAM matrix of low-level features as stylistic information. And then we combine it with visual content information from high-level features enhanced by a high-order attention module to obtain stylistic-aware representations. Moreover, recent studies (Mittal et al., 2021) have shown that the graph convolutional network (GCN) can improve the performance of emotion distribution learning due to its ability to capture emotion dependencies. However, traditional GCN only captures coarse emotion dependencies. And the stylistic-aware representations contain relatively comprehensive emotional representations from visual and stylistic information, but only parts of them play a role in improving performance. Therefore, we propose a stylistic GCN module for emotional-aware enhanced representation learning, which captures emotion relations of stylistic-aware representations in an adaptive way. Specifically, the module initializes coarse emotion dependencies from a static GCN and uses them to capture emotional-aware dependencies of stylistic-aware representations for each image. By integrating the stylistic-aware representations and emotional-aware dependencies, our proposed method can better capture the emotions present in images.

3.2. Stylistic-aware Representation Learning

In this module, we first learn the emotional style representations with image style by exploring intra-layer and inter-layer correlations in feature maps. Second, a high-order attention mechanism with adversary constraints is introduced to guide learning emotional content representations. Finally, we fuse the emotional style representations and emotional content representations and further explore the latent stylistic-aware representations of images.

3.2.1. GRAM-based Intra- and Inter-layer Correlation

Inspired by the work (Matsuo and Yanai, 2016), we use the GRAM matrix of each layer’s feature maps as the intra-layer emotional style representations. Because the features related to the emotional style of the image, such as texture and color, are typically captured in the low-dimensional feature maps, we first extract the input feature maps from the outputs of different layers of ResNet-50 to calculate the stylistic representation within each layer, as follows:

(2) 𝒳k=[𝐗k1,𝐗k2,,𝐗kck]ck×wk×hk,k=0,1,2\small{\mathcal{X}_{k}=[\mathbf{X}^{1}_{k},\mathbf{X}^{2}_{k},...,\mathbf{X}^{c_{k}}_{k}]}\in{\mathbb{R}^{c_{k}\times w_{k}\times h_{k}}},k=0,1,2

where 𝒳k\mathcal{X}_{k} represents the feature maps of the kk-th layer extracted from an image xx. For simplicity, in all the following layers, cc_{*}, ww_{*} and hh_{*} represent the number of channels, width and height of the feature maps or representations, respectively. We then convert feature maps into a vector 𝐱kimk\mathbf{x}^{i}_{k}\in\mathbb{R}^{m_{k}},i=0,1,,cki=0,1,...,c_{k},mk=wk×hkm_{k}=w_{k}\times h_{k} and concatenate them into a matrix 𝐁k\mathbf{B}_{k}.

(3) 𝐁k=[𝐱k1,𝐱k2,,𝐱kck]ck×mk\small\mathbf{B}_{k}=[\mathbf{x}^{1}_{k},\mathbf{x}^{2}_{k},...,\mathbf{x}^{c_{k}}_{k}]\in\mathbb{R}^{c_{k}\times m_{k}}

Following the above transformation, we can obtain the GRAM matrix of each layer as the corresponding intra-layer correlation.

(4) 𝐆k=𝐁k𝐁kTck×ck,k=0,1,2\small\mathbf{G}_{k}=\mathbf{B}_{k}{\mathbf{B}_{k}}^{\mathrm{T}}\in\mathbb{R}^{c_{k}\times c_{k}},k=0,1,2

In 𝐆k\mathbf{G}_{k}, each element Gkij=a𝐁kia𝐁kaj,a=0,1,,mkG^{ij}_{k}=\sum_{a}{\mathbf{B}^{ia}_{k}}{\mathbf{B}^{aj}_{k}},a=0,1,\cdots,m_{k} is the inner product between the transformed feature maps ii and jj in layer kk.

To capture the correlations between different layers of the network, we first use the GRAM matrix 𝐆k\mathbf{G}^{k} defined in Eq. (4) to obtain the emotional style representation for each layer. However, the shape of the GRAM matrix obtained from different layers may vary, so we next upsample them to the same shape and further stack them together along the channel dimension, denoted as

(5) 𝒢~=Stack(𝐆~0,𝐆~1,𝐆~2)\small\widetilde{\mathcal{G}}=\mathrm{Stack}(\widetilde{\mathbf{G}}_{0},\widetilde{\mathbf{G}}_{1},\widetilde{\mathbf{G}}_{2})
(6) 𝐆~k=Upsample(𝐆k),k=0,1,2\small\widetilde{\mathbf{G}}_{k}=\mathrm{Upsample}(\mathbf{G}_{k}),k=0,1,2

where 𝒢~\widetilde{\mathcal{G}} is the stacked GRAM matrix by the operator Stack()\mathrm{Stack}(\cdot).

To measure the inter-layer correlations in the image, we design an inter-layer correlation module that consists of two convolutional layers, each followed by a layer normalization (LN) and a ReLU activation. Unlike the commonly used instance normalization (IN) in image style transfer networks, the proposed module uses LN to normalize the input feature maps along the three dimensions of channel, width, and height. This increases the correlation between channels, and can be expressed as follows:

(7) style=fcc(𝒢~)cs×ws×hs\small\mathcal{F}_{style}=f_{cc}(\widetilde{\mathcal{G}})\in{\mathbb{R}^{c_{s}\times w_{s}\times h_{s}}}

where style\mathcal{F}_{style} is the emotional style representations, fccf_{cc} represents inter-layer correlation module.

3.2.2. Visual Attention Module

The visual content information of an image can be derived from the objects and scene information depicted in the image. Complex emotions are not easily acquired from those pieces of information. Inspired by (Chen et al., 2019) in solving person re-identification tasks, we propose a visual attention module that introduces high-order attention (HOA) and feature pyramid mechanisms. It captures the complex relations and subtle differences among visual parts to enhance the ability of emotion distribution learning. Specifically, given the feature maps 𝒳2\mathcal{X}_{2} obtained from the ‘Layer2’ layer, HOA is utilized to model the high-order attentive results 𝒳attr\mathcal{X}_{att}^{r} among visual parts as follows:

(8) 𝒳attr=fhoar(𝒳2),r=1,,R\small\mathcal{X}_{att}^{r}=f_{hoa}^{r}(\mathcal{X}_{2}),r=1,\ldots,R
(9) fhoar(𝒳)=s=1rConv1×1(𝒵1r𝒵sr𝒵rr)\small f_{hoa}^{r}(\mathcal{X})=\sum_{s=1}^{r}\mathrm{Conv_{1\times 1}}(\mathcal{Z}_{1}^{r}\odot\cdots\odot\mathcal{Z}_{s}^{r}\odot\cdots\odot\mathcal{Z}_{r}^{r})
(10) 𝒵sr=Conv1×1(𝒳),s=1,,r\small\mathcal{Z}_{s}^{r}=\mathrm{Conv_{1\times 1}}(\mathcal{X}),s=1,\cdots,r

where fhoarf_{hoa}^{r} represents HOA, RR is the number of order, \odot represents element-wise product operator. Specifically, Eq. (10) mines simple and coarse information from ss emotional perspectives using various 1×11\times 1 convolution layers Conv1×1()\mathrm{Conv_{1\times 1}}(\cdot). The element-wise product operator that follows is used to capture the complex and high-order interactions of visual parts, as well as the subtle differences among emotional-aware regions caused by the objects present in the image. As shown in Figure 2, to generate the emotional content features under the guidance of high-order relationships, we use the ‘Layer3’ layer termed f3()f_{3}(\cdot) and ‘Layer4’ layer termed f4()f_{4}(\cdot) in ResNet-50 to encode the high-order attentive results to obtain multi-scale emotional content features 𝒳3\mathcal{X}_{3} and 𝒳4\mathcal{X}_{4}:

(11) 𝒳3=[f3(𝒳att1),,f3(𝒳attR)]R×c3×w3×h3\small\mathcal{X}_{3}=[f_{3}(\mathcal{X}_{att}^{1}),\ldots,f_{3}(\mathcal{X}_{att}^{R})]\in\mathbb{R}^{R\times c_{3}\times w_{3}\times h_{3}}
(12) 𝒳4=[f4(𝒳31),,f4(𝒳3R)]R×c4×w4×h4\small\mathcal{X}_{4}=[f_{4}(\mathcal{X}_{3}^{1}),\ldots,f_{4}(\mathcal{X}_{3}^{R})]\in\mathbb{R}^{R\times c_{4}\times w_{4}\times h_{4}}

To effectively extract and describe visual content information, we construct a feature pyramid network (FPN) to improve the network’s multi-scale perception ability. This is done by upsampling the feature maps 𝒳4\mathcal{X}_{4} and convolving it with a 1×11\times 1 convolution layer to match the number of channels in 𝒳3\mathcal{X}_{3}. The resulting feature maps are then added to 𝒳3\mathcal{X}_{3} to obtain the emotional content representations content\mathcal{F}_{content}.

(13) content=Conv1×1(Upsample(𝒳4))+𝒳3\small\mathcal{F}_{content}=\mathrm{Conv_{1\times 1}}(\mathrm{Upsample}(\mathcal{X}_{4}))+\mathcal{X}_{3}

The HOA can explicitly capture diverse and complementary high-order information, which encourages the richness of the learned features. However, simply employing the HOA module causes partial/biased learning behavior, hindering the performance of our method. The variant labeled as ”noAN” aptly demonstrates this fact with great efficacy in Table 4. As mentioned in (Chen et al., 2019), we introduce an adversary constraint to suppress the problem of order collapse for the multi-scale emotional content features 𝒳3\mathcal{X}_{3} and 𝒳4\mathcal{X}_{4}, respectively:

(14) maxHOA|R=1R=rminfadvadvk=\displaystyle\max\limits_{HOA|^{R=r}_{R=1}}\min\limits_{f_{adv}}\mathcal{L}_{adv}^{k}=
maxHOA|R=1R=rminfadv(s,s=1,ssrfadv(𝐱ks)fadv(𝐱ks)22))\displaystyle\max\limits_{HOA|^{R=r}_{R=1}}\min\limits_{f_{adv}}(\sum_{s,s^{\prime}=1,s\neq s^{\prime}}^{r}\parallel f_{adv}(\mathbf{x}_{k}^{s})-f_{adv}({\mathbf{x}_{k}^{s^{\prime}}})\parallel^{2}_{2}))
(15) 𝒳k=[𝒳k1,,𝒳kr,,𝒳kR],k=3,4\small\mathcal{X}_{k}=[\mathcal{X}_{k}^{1},\dots,\mathcal{X}_{k}^{r},\dots,\mathcal{X}_{k}^{R}],k=3,4
(16) 𝐱kr=Flatten(𝒳kr)\small\mathbf{x}_{k}^{r}=\mathrm{Flatten}(\mathcal{X}_{k}^{r})

where fadvf_{adv} is the adversary network (AN) which contains two fully-connected layers, HOA|R=1R=rHOA|^{R=r}_{R=1} means there are rr HOA modules (from first-order to rr-th order), 𝐱ksDs(Ds=ck×wk×hk)\mathbf{x}_{k}^{s}\in\mathbb{R}^{D_{s}}(D_{s}=c_{k}\times w_{k}\times h_{k}) is the multi-scale emotional content vector flattened from 𝒳kr\mathcal{X}_{k}^{r} with r=sr=s and Flatten()\mathrm{Flatten}(\cdot) is the flattening operator. According to Eq. (14) , we get the adversarial loss adv=kadvk\mathcal{L}_{adv}=\sum\nolimits_{k}\mathcal{L}_{adv}^{k}.

3.2.3. Stylistic-aware Representation

Based on the emotional style and content representations learned above, we use a fusion operator to obtain the stylistic-aware distribution 𝐲style\mathbf{y}_{style}. The fusion operator is a 1×11\times 1 convolution layer with concatenation, which produces an output with the same number of channels as the number of emotion categories.

Specifically, we first use the concatenation operator to combine the stylistic-content representation pairs {style,content}\{\mathcal{F}_{style},\mathcal{F}_{content}\} and {style,𝒳4}\{\mathcal{F}_{style},\mathcal{X}_{4}\} to obtain the intermediate representations scR×csc×wsc×hsc\mathcal{F}_{sc}\in\mathbb{R}^{R\times c_{sc}\times w_{sc}\times h_{sc}} and s4R×cs4×ws4×hs4\mathcal{F}_{s4}\in\mathbb{R}^{R\times c_{s4}\times w_{s4}\times h_{s4}}, respectively. On the condition, we set C=csc=cs4C=c_{sc}=c_{s4}. We then concatenate these intermediate representations to obtain the stylistic-aware representations eR×C×De\mathcal{F}_{e}\in\mathbb{R}^{R\times C\times D_{e}}, where De=hsc×wsc+hs4×ws4D_{e}=h_{sc}\times w_{sc}+h_{s4}\times w_{s4}. Finally, we apply global average pooling mean()\mathrm{mean}(\cdot) and global max pooling max()\mathrm{max}(\cdot) to e\mathcal{F}_{e} to generate the stylistic-aware distribution results 𝐲style\mathbf{y}_{style}.

(17) [𝐲e1,,𝐲eR]=Softmax(mean(e)+λmax(e))R×C\small[\mathbf{y}_{e}^{1},\cdots,\mathbf{y}_{e}^{R}]=\mathrm{Softmax}(\mathrm{mean}(\mathcal{F}_{e})+\lambda\ast\max(\mathcal{F}_{e}))\in\mathbb{R}^{R\times C}
(18) 𝐲style=mean(𝐲e1,,𝐲eR)\small\mathbf{y}_{style}=\mathrm{mean}(\mathbf{y}_{e}^{1},\cdots,\mathbf{y}_{e}^{R})

where λ\lambda is the coefficient to control the trade-off between two types of pooling method, Softmax()\mathrm{Softmax}(\cdot) is the activation function to unify the element value in 𝐲style\mathbf{y}_{style} to [0,1][0,1].

3.3. Emotional-aware Enhanced Representation Learning

Different from other LDL tasks, emotions and their unique characteristics have intrinsic relationships, as demonstrated in psychological theories (Yang et al., 2021). Previous work (Yang et al., 2021; He and Jin, 2019; Xiong et al., 2019) has shown that exploiting the correlations between emotion labels can improve the prediction of the emotion distribution of images.

Inspired by (Ye et al., 2020), we introduce a stylistic GCN, which consists of a static GCN termed fSGCNf_{SGCN} and a dynamic GCN termed fDGCNf_{DGCN}, obtaining initialization representation 𝐅sgcn\mathbf{F}_{sgcn} and emotional-aware enhanced representations 𝐅dgcn\mathbf{F}_{dgcn} as follows:

(19) 𝐅sgcn=fSGCN(𝐀s,𝐅~e,𝐖s)\small\mathbf{F}_{sgcn}=f_{SGCN}(\mathbf{A}_{s},\widetilde{\mathbf{F}}_{e},\mathbf{W}_{s})

where 𝐀s\mathbf{A}_{s} is the graph adjacency matrix constructed using the co-occurrence relationship between labels. 𝐅~eC×D\widetilde{\mathbf{F}}_{e}\in\mathbb{R}^{C\times D} is obtained by concatenating e\mathcal{F}_{e}, where D=De1+De2++Der++DeRD=D_{e}^{1}+D_{e}^{2}+\ldots+D_{e}^{r}+\ldots+D_{e}^{R}. DerD_{e}^{r} represents rr-th order stylistic-aware representations. 𝐖s\mathbf{W}_{s} are learnable parameters. However, the static GCN is not very flexible and can not eliminate irrelevant information of stylistic-aware representations to capture fine emotional dependencies. Therefore, we use the adaptability of the dynamic graph network to better capture emotional-aware enhanced representations:

(20) 𝐅dgcn=fDGCN(𝐀d,𝐅sgcn,𝐖d)\small\mathbf{F}_{dgcn}=f_{DGCN}(\mathbf{A}_{d},\mathbf{F}_{sgcn},\mathbf{W}_{d})
(21) 𝐀d=δ(𝐖A𝐅~dgcn)\small\mathbf{A}_{d}=\delta(\mathbf{W}_{A}\widetilde{\mathbf{F}}_{dgcn})

where 𝐀d\mathbf{A}_{d} enables the network structure to be dynamically adjusted for each image. 𝐖A\mathbf{W}_{A} and 𝐖d\mathbf{W}_{d} are learnable parameters. 𝐅~dgcn\widetilde{\mathbf{F}}_{dgcn} is obtained by concatenating 𝐅sgcn\mathbf{F}_{sgcn} and its global representations 𝐅sgcn\mathbf{F}_{sgcn}, δ()\delta(\cdot) is the sigmoid activation function.

In the same way as Eq. (17) and Eq. (18), we can obtain the emotional-aware distribution results 𝐲emotionC\mathbf{y}_{emotion}\in\mathbb{R}^{C} of the module.

3.4. Final Distribution and Optimization

Once we have these two predicted distributions 𝐲emotion\mathbf{y}_{emotion} and 𝐲style\mathbf{y}_{style}, we can simply combine them using the weighted sum defined above to obtain the final emotional distribution 𝐲\mathbf{y} as follows:

(22) 𝐲=μ𝐲emotion+(1μ)𝐲style\small\mathbf{y}=\mu\ast\mathbf{y}_{emotion}+(1-\mu)\ast\mathbf{y}_{style}

where μ\mu is the coefficient to control the trade-off between two different predicted results.

As with the previous method, the proposed method employs the KL loss (Gao et al., 2017) for the distribution learning. Our objective function consists of adversarial loss adv\mathcal{L}_{adv} and prediction loss pred\mathcal{L}_{pred}. For prediction loss, we consider intermediate supervision instead of directly optimizing the predicted results as follows:

(23) pred=mean(1Rr=1R(KLloss(𝐲er,𝐲^))+KLloss(𝐲emotion,𝐲^))\small\mathcal{L}_{pred}=\mathrm{mean}(\frac{1}{R}\sum_{r=1}^{R}(\mathrm{KLloss}(\mathbf{y}_{e}^{r},\hat{\mathbf{y}}))+\mathrm{KLloss}(\mathbf{y}_{emotion},\hat{\mathbf{y}}))

Meanwhile, in order to balance the difference in the numerical scale of the two losses, we adopt an adaptive balance method:

(24) =pred+adv/adv/pred\small\mathcal{L}=\mathcal{L}_{pred}+\mathcal{L}_{adv}/\parallel\mathcal{L}_{adv}/\mathcal{L}_{pred}\parallel

where adv/pred\parallel\mathcal{L}_{adv}/\mathcal{L}_{pred}\parallel represents the truncated gradient operator, which calculates the adaptive balance coefficient of adversarial loss.

4. EXPERIMENT

4.1. Experimental Setup

4.1.1. Emotion Datasets

Flickr-LDL (Yang et al., 2017b) is a collection of images that has been annotated with emotional label distributions (i.e.anger, amusement, awe, contentment, disgust, excitement, fear and sadness). Per emotional category was created by selecting a subset of the Flickr dataset (Borth et al., 2013) using 1200 adjective-noun pairs, and then having 11 viewers annotate the images with one of eight common emotions. The final dataset contains 10700 images, with roughly equal numbers of images per emotion class. Twitter-LDL (Yang et al., 2017b) was created by using a variety of emotional keywords to search for images on Twitter and then the retrieved images were manually screened and annotated by 8 viewers. The final dataset contains 10045 images, with the annotations indicating the distribution of emotions present in each image. Emotion6 (Peng et al., 2015) contains 1980 images that were obtained from Flickr using seven categories of emotion keywords (i.e.anger, disgust, joy, fear, sadness, surprise and neutral), with 330 images in each category. And each image was annotated by 15 viewers.

Table 1. Comparison with the state-of-the-art methods on Twitter-LDL dataset.
Measures PT-Bayes PT-SVM AA-kNN AA-BP SA-BFGS SA-CPNN SSDL LDL-LDM DIEDL Ours
KL \downarrow 1.31(8) 1.65(9) 3.89(10) 1.19(6) 1.19(6) 0.85(5) 0.51(3) 0.53(4) 0.47(2) 0.42(1)
Chebyshev \downarrow 0.53(9) 0.63(10) 0.28(5) 0.37(7) 0.37(7) 0.36(6) 0.25(3) 0.27(4) 0.24(2) 0.22(1)
Clark \downarrow 0.85(5) 0.91(9) 0.58(1) 0.89(7) 0.89(7) 0.85(5) 0.84(2) 2.35(10) 0.84(2) 0.84(2)
Canberra \downarrow 0.77(3) 0.88(9) 0.41(1) 0.84(7) 0.84(7) 0.78(6) 0.76(2) 6.05(10) 0.77(3) 0.77(3)
Cosine \uparrow 0.53(9) 0.25(10) 0.82(5) 0.71(8) 0.82(5) 0.75(7) 0.86(3) 0.85(4) 0.87(2) 0.89(1)
Intersection \uparrow 0.40(9) 0.21(10) 0.66(5) 0.59(6) 0.57(7) 0.56(8) 0.69(2) 0.67(3) 0.67(4) 0.73(1)
Average Rank \downarrow 7.17(9) 9.50(10) 4.50(4) 6.83(7) 6.50(6) 6.17(8) 2.50(2) 5.83(5) 2.50(2) 1.50(1)
Table 2. Comparison with the state-of-the-art methods on Emotion6 dataset.
Measures PT-Bayes PT-SVM AA-kNN AA-BP SA-BFGS SA-CPNN SSDL LDL-LDM DIEDL Ours
KL \downarrow 2.32(10) 1.07(8) 0.85(7) 0.63(6) 1.16(9) 0.56(5) 0.40(2) 0.44(4) 0.40(2) 0.36(1)
Chebyshev \downarrow 0.35(8) 0.39(10) 0.29(5) 0.30(6) 0.38(9) 0.30(6) 0.24(2) 0.26(3) 0.26(3) 0.22(1)
Clark \downarrow 0.73(8) 0.69(7) 0.62(2) 0.64(6) 0.74(9) 0.63(5) 0.62(2) 1.65(10) 0.62(2) 0.59(1)
Canberra \downarrow 0.66(8) 0.62(7) 0.51(2) 0.54(5) 0.67(9) 0.54(5) 0.51(2) 3.64(10) 0.52(4) 0.47(1)
Cosine \uparrow 0.69(6) 0.48(10) 0.75(4) 0.68(7) 0.63(9) 0.66(8) 0.79(3) 0.72(5) 0.81(2) 0.84(1)
Intersection \uparrow 0.56(8) 0.42(10) 0.62(5) 0.59(7) 0.52(9) 0.60(6) 0.66(2) 0.65(4) 0.66(2) 0.70(1)
Average Rank \downarrow 8.00(8) 8.67(9) 4.17(4) 4.17(4) 9.00(10) 5.83(6) 2.17(2) 6.00(7) 2.50(3) 1.00(1)
Table 3. Comparison with the state-of-the-art methods on Flickr-LDL dataset.
Measures PT-Bayes PT-SVM AA-kNN AA-BP SA-BFGS SA-CPNN SSDL LDL-LDM DIEDL Ours
KL \downarrow 1.88(9) 1.69(8) 3.28(10) 0.82(5) 1.06(6) 1.06(6) 0.46(3) 0.49(2) 0.46(3) 0.39(1)
Chebyshev \downarrow 0.44(9) 0.55(10) 0.28(5) 0.36(7) 0.37(8) 0.30(6) 0.23(2) 0.25(4) 0.23(2) 0.21(1)
Clark \downarrow 0.89(9) 0.87(8) 0.57(1) 0.82(5) 0.86(7) 0.82(5) 0.78(3) 2.14(10) 0.79(4) 0.76(2)
Canberra \downarrow 0.85(9) 0.83(8) 0.41(1) 0.75(6) 0.82(7) 0.74(5) 0.69(3) 5.26(10) 0.70(4) 0.66(2)
Cosine \uparrow 0.63(9) 0.32(10) 0.79 (5) 0.72(6) 0.70(7) 0.70(7) 0.85(3) 0.84(4) 0.86(2) 0.88(1)
Intersection \uparrow 0.49(9) 0.29(10) 0.64(5) 0.53(8) 0.56(7) 0.60(6) 0.68(3) 0.66(4) 0.70(2) 0.71(1)
Average Rank \downarrow 9.00(9) 9.00(9) 4.50(4) 6.17(7) 7.00(8) 5.83(6) 2.83(2) 5.66(5) 2.83(2) 1.33(1)

4.1.2. Evaluation Metrics

To evaluate the effectiveness of our proposed StyleEDL, four metrics are selected: Kullback-Leibler (KL) divergence, Chebyshev distance, Cosine coefficient, Intersection similarity, Clark distance and Canberra metric. Additionally, Average Rank is also adopted to indicate the total performance of each model.

4.1.3. Parameter and Evaluation Settings

We used a ResNet-50 model pre-trained on the ImageNet dataset as our backbone network and removed the last fully connected layer. We considered the outputs of the top convolutional layer and four groups of convolutional layers (‘Conv1’, ‘Layer1’, ‘Layer2’, ‘Layer3’, and ‘Layer4’) of the ResNet-50 model. All training images are resized to 448×448448\times 448 pixels and undergo random scaling and horizontal flipping for data augmentation. Our proposed method is implemented using the PyTorch deep learning framework and is trained on an NVIDIA GTX 1080Ti GPU. We used mini-batch stochastic gradient descent (SGD) with momentum and weight decay to optimize our proposed method. The mini-batch size is set to 8 and the learning rate is set to 0.01 for the first 10 epochs, then decreased 10-fold every 20 epochs until the total number of training epochs reaches 90.

4.2. Experimental Results

To evaluate the effectiveness of our proposed StyleEDL, we compared our proposed scheme with several existing state-of-the-art methods, which are grouped into four categories: problem transformation (PT-Bayes and PT-SVM  (Geng, 2016)), algorithm adaptation (AA-kNN and AA-BP (Geng, 2016)), specialized algorithm (SA-BFGS (Geng, 2016) and SA-CPNN (Geng et al., 2013)) and CNN-based methods (SSDL (Xiong et al., 2019), LDL-LDM  (Wang and Geng, 2021b) and DIEDL (Wu et al., 2023)). Table 12 and 3 show the performances of these methods on three widely used datasets. The best results are highlighted in boldface. The down arrow \downarrow next to the measure means a lower score is better, and the up arrow \uparrow means that a higher one is better. From the table, we can make the following observations: 1) AA-kNN achieves insurmountable results in Clerk Distance and Canberra metric, affirming its superiority in addressing intersecting samples in visual emotion distributions. 2) CNN-based methods perform better than the other three types of algorithms, which suggests that CNNs have a stronger ability to capture emotional-related content information from visual parts. 3) Our method consistently outperforms the other methods by a clear margin, indicating that we can expect more accurate results by considering stylistic representations in emotion distribution learning tasks.

4.3. Ablation Study

To further investigate the influence of different components of the proposed method, several variants of our proposed method are configured and ablation experiments are conducted on the Twitter-LDL dataset for comparison. The variants of the model include: (a) B only, which adopts the backbone network. (b) B+G, which adds the GRAM-based intra- and inter-layer correlation based on the (a) model. (c) B+V, which employs the visual attention module based on the (a) model. (e) B+E, which adopts only backbone and emotional-aware enhanced representation learning. (d) B+G+V, which involves both GRAM-based intra- and inter-layer correlation and visual attention module based on the (a) model. (f) B+G+V+E, which replaces our emotional-aware enhanced representation learning with a static GCN. (g) B+G+V+E, which only considers the correlation between GRAM matrices, but not the correlation within the layers. (h) noAN, which discards adversarial constraint loss based on our proposed method.

Table 4. Ablation analysis on Twitter-LDL. ‘B’, ‘G’, ‘G’, ‘V’, ‘E’ and ‘E’ correspond to the backbone, GRAM-based intra- and inter-layer correlation, only GRAM-based inter-layer correlation, visual attention module, emotional-aware enhanced representation learning, and static GCN, respectively.
Method KL\downarrow Chebyshev\downarrow Cosine\uparrow Intersection\uparrow
B 0.457 0.228 0.881 0.718
B+G 0.448 0.228 0.881 0.719
B+V 0.441 0.226 0.882 0.717
B+E 0.482 0.229 0.876 0.718
B+G+V 0.433 0.223 0.884 0.719
B+G+V+E 0.465 0.228 0.879 0.715
B+G+V+E 0.434 0.224 0.884 0.719
noAN 0.446 0.223 0.882 0.723
Ours 0.420 0.218 0.889 0.726

The results are shown in Table 4. In this table, we selected six metrics mentioned in our paper to report the emotion distribution based on their predicted results. From the results, the following observations can be made: 1) Without style-induced information, B+V performs worse than B+G+V, indicating that the emotional style representations are beneficial for learning stylistic-aware representations. 2) B+G and B+V all take positive effects, which demonstrate that not only the learning of visual content information improves emotional distribution results, but also the style information plays an important role. 3) B+E yields inferior outcomes compared to B, suggesting that features extracted from ResNet-50 and directly applied to our stylistic GCN do not yield favorable results. 4) Our proposed StyleEDL consistently surpasses B+G+V, B+G+V+E and B+G+V+E, which means our proposed method gains from the use of the intra- and inter-layer correlation and the stylistic GCN. Moreover, the outcomes further indicate that the flexible dynamic GCN can eliminate irrelevant information of stylistic-aware representations. Similar observations also can be found for the other two datasets.

Refer to caption
Figure 3. Effect of μ\mu on datasets Twitter-LDL (left) and Emotion6 (right)
Refer to caption
Figure 4. Sensitivity analysis of λ\lambda on Twitter-LDL (left) and Emotion6 (right)

4.4. Parameter Sensitivity Analysis

In our work, there are three essential parameters, which are the order RR of the HOA module and the balance coefficients λ\lambda and μ\mu in stylistic-aware representation learning and emotional-aware enhanced representation learning, respectively. We conducted comprehensive experiments on two datasets: the Twitter-LDL and the Emotion6. Specifically, the KL divergence and Intersection coefficient metrics were used for the Twitter-LDL, while the KL divergence and Cosine coefficient metrics were used for the Emotion6.

Table 5. Sensitivity analysis of RR on Twitter-LDL.
Order R=1R=1 R=2R=2 R=3R=3 R=4R=4
KL\downarrow 0.445 0.420 0.421 0.427
Chebyshev\downarrow 0.225 0.218 0.221 0.219
Cosine\uparrow 0.882 0.889 0.888 0.886
Intersection\uparrow 0.721 0.726 0.720 0.724
Table 6. Sensitivity analysis of RR on Emotion6.
Order R=1R=1 R=2R=2 R=3R=3 R=4R=4
KL\downarrow 0.377 0.361 0.385 0.393
Chebyshev\downarrow 0.227 0.222 0.231 0.235
Cosine\uparrow 0.829 0.839 0.827 0.822
Intersection\uparrow 0.694 0.698 0.689 0.687

4.4.1. Order of HOA module

Tables 5 and  6 show that our proposed method performs best when R=2R=2 for both two datasets. When R=1R=1, our proposed method lacks the ability to further encode feature maps and fails to employ the attention mechanism to refine the visual content representations, while a large value of RR makes the model more susceptible to being influenced by noise.

Refer to caption
Figure 5. Visualization of the predicted emotion distributions (Predicted) and the ground-truth (GT). ”Amu”, ”Con”, ”Awe”, ”Exc”, ”Fea”, ”Sad”, ”Dis” and ”Ang” represent ”amusement”, ”contentment”, ”awe”, ”excitement”, ”fear”, ”sadness”, ”disgust” and ”anger” in Twitter-LDL, respectively.

4.4.2. Balance coefficient

We investigated the influence of the balance coefficients λ\lambda and μ\mu by varying their values from 0.00.0 to 1.01.0. As shown in Figure 4, larger values of λ\lambda generally result in better performance than smaller values. A proper value of λ\lambda can enhance the stylistic-aware representations of images and improve the overall performance of the model. The coefficient μ\mu plays an important role in balancing the importance between stylistic-aware distribution results and emotional-aware distribution results. From Figure 3, our proposed method steadily improves from 0.00.0 to 0.60.6 and reaches its best performance at 0.60.6. Intuitively, the performance can be enhanced by introducing emotional-aware enhanced representation learning. Moreover, all values of KL divergence on emotion6 are much lower than that on Twitter-LDL, which may be owing to the fact that the dataset size of emotion6 is much smaller than that of Twitter-LDL.

4.5. Computational Complexity

Table 7 reports the actual inference time with several recent state-of-the-art methods. As discerned from the table, our approach achieves superior performance than those methods at the cost of the high computational complexity of the HOA, which proffers us a glimpse into the future. In the future, we will explore light high-order solutions.

Table 7. Model complexity for inference with several state-of-the-art methods.
SSDL LDL-LDM DIEDL Ours
Time (ms) 5.853 1.27 7.659 16.272

Figure 5 presents a qualitative comparison of the predicted distributions on the Twitter-LDL dataset. The visualization encompasses two aspects: 1. different scenarios, such as human, animal, etc. 2. the impact of style on emotions. From the illustration, we could discern that our method has achieved decent prediction results. In particular, our model can identify well the changes in emotions induced by stylistic information. Taking the first and second images in Figure 5 as an example, the second one evokes a more melancholic state, and our method cannot solely rely on content perception alone to account for this difference, thereby substantiating the efficacy of incorporating style information.

5. CONCLUSION

In this paper, we propose a novel image emotion distribution learning method termed StyleEDL to learn emotional distribution in a style-induced manner. In StyleEDL, we sought stylistic-aware representations of images based on the hierarchical stylistic information of visual parts. In addition, emotional-aware enhanced representations are obtained and further exploited to explore correlations between emotions by the stylistic GCN. Comprehensive experiments on three well-known datasets demonstrate the superiority of our StyleEDL.

References

  • (1)
  • Borth et al. (2013) Damian Borth, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih-Fu Chang. 2013. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In Proceedings of the 21st ACM international conference on Multimedia. 223–232.
  • Chen et al. (2019) Binghui Chen, Weihong Deng, and Jiani Hu. 2019. Mixed high-order attention network for person re-identification. In Proceedings of the IEEE/CVF international conference on computer vision. 371–381.
  • Chu and Wu (2018) Wei-Ta Chu and Yi-Ling Wu. 2018. Image style classification based on learnt deep correlation features. IEEE Transactions on Multimedia 20, 9 (2018), 2491–2502.
  • Fan et al. (2018) Yangyu Fan, Hansen Yang, Zuhe Li, and Shu Liu. 2018. Predicting image emotion distribution by emotional region. In 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). IEEE, 1–9.
  • Freeman (2004) Linton Freeman. 2004. The development of social network analysis. A Study in the Sociology of Science 1, 687 (2004), 159–167.
  • Gao et al. (2017) Bin-Bin Gao, Chao Xing, Chen-Wei Xie, Jianxin Wu, and Xin Geng. 2017. Deep label distribution learning with label ambiguity. IEEE Transactions on Image Processing 26, 6 (2017), 2825–2838.
  • Geng (2016) Xin Geng. 2016. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering 28, 7 (2016), 1734–1748.
  • Geng et al. (2013) Xin Geng, Chao Yin, and Zhi-Hua Zhou. 2013. Facial age estimation by learning from label distributions. IEEE transactions on pattern analysis and machine intelligence 35, 10 (2013), 2401–2412.
  • Ghosal et al. (2019) Koustav Ghosal, Mukta Prasad, and Aljosa Smolic. 2019. A geometry-sensitive approach for photographic style classification. arXiv preprint arXiv:1909.01040 (2019).
  • He and Jin (2019) Tao He and Xiaoming Jin. 2019. Image emotion distribution learning with graph convolutional networks. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. 382–390.
  • Holbrook and O’Shaughnessy (1984) Morris B. Holbrook and John O’Shaughnessy. 1984. The role of emotion in advertising. Psychology & Marketing 1, 2 (1984), 45–64.
  • Jing et al. (6796) Peiguang Jing, Kai Cui, Weili Guan, Liqiang Nie, and Yuting Su. 2023, DOI:10.1109/TMM.2023.3246796. Category-aware Multimodal Attention Network for Fashion Compatibility Modeling. IEEE Transactions on Multimedia (2023, DOI:10.1109/TMM.2023.3246796).
  • Laubrock and Dubray (2019) Jochen Laubrock and David Dubray. 2019. CNN-based Classification of Illustrator Style in Graphic Novels: Which Features Contribute Most?. In International Conference on Multimedia Modeling. Springer, 684–695.
  • Lecoutre et al. (2017) Adrian Lecoutre, Benjamin Negrevergne, and Florian Yger. 2017. Recognizing art style automatically in painting with deep learning. In Asian conference on machine learning. PMLR, 327–342.
  • Lu et al. (2015) Xin Lu, Zhe Lin, Xiaohui Shen, Radomir Mech, and James Z Wang. 2015. Deep multi-patch aggregation network for image style, aesthetics, and quality estimation. In Proceedings of the IEEE international conference on computer vision. 990–998.
  • Lu et al. (2019) Xu Lu, Lei Zhu, Zhiyong Cheng, Jingjing Li, Xiushan Nie, and Huaxiang Zhang. 2019. Flexible Online Multi-Modal Hashing for Large-Scale Multimedia Retrieval. In Proceedings of the 27th ACM International Conference on Multimedia (MM). 1129–1137.
  • Matsuo and Yanai (2016) Shin Matsuo and Keiji Yanai. 2016. CNN-based style vector for style image retrieval. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. 309–312.
  • Mitchell (1986) Andrew A Mitchell. 1986. The effect of verbal and visual components of advertisements on brand attitudes and attitude toward the advertisement. Journal of consumer research 13, 1 (1986), 12–24.
  • Mittal et al. (2021) Anshul Mittal, Noveen Sachdeva, Sheshansh Agrawal, Sumeet Agarwal, Purushottam Kar, and Manik Varma. 2021. ECLARE: Extreme classification with label graph correlations. In Proceedings of the Web Conference 2021. 3721–3732.
  • Nie et al. (2022) Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian, and Alberto Del Bimbo. 2022. Search-oriented micro-video captioning. In Proceedings of the 30th ACM International Conference on Multimedia. 3234–3243.
  • Peng et al. (2015) Kuan-Chuan Peng, Tsuhan Chen, Amir Sadovnik, and Andrew C Gallagher. 2015. A mixed bag of emotions: Model, predict, and transfer emotion distributions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 860–868.
  • Qu et al. (2021) Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic modality interaction modeling for image-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104–1113.
  • Rao et al. (2020) Tianrong Rao, Xiaoxu Li, and Min Xu. 2020. Learning multi-level deep representations for image emotion classification. Neural processing letters 51 (2020), 2043–2061.
  • Serrat (2017) Olivier Serrat. 2017. Social network analysis. In Knowledge solutions. Springer, 39–43.
  • Wang and Geng (2021a) Jing Wang and Xin Geng. 2021a. Label distribution learning by exploiting label distribution manifold. IEEE transactions on neural networks and learning systems (2021).
  • Wang and Geng (2021b) Jing Wang and Xin Geng. 2021b. Label Distribution Learning by Exploiting Label Distribution Manifold. IEEE transactions on neural networks and learning systems PP (08 2021).
  • Wasserman and Faust (1994) Stanley Wasserman and Katherine Faust. 1994. Social network analysis: Methods and applications. (1994).
  • Wu et al. (2023) Huiyan Wu, Yonggang Huang, and Guoshun Nan. 2023. Doubled Coupling for Image Emotion Distribution Learning. Know.-Based Syst. 260, C (jan 2023), 11 pages.
  • Xiong et al. (2019) Haitao Xiong, Hongfu Liu, Bineng Zhong, and Yun Fu. 2019. Structured and sparse annotations for image emotion distribution learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 363–370.
  • Xu and Wang (2021) Zhiwei Xu and Shangfei Wang. 2021. Emotional attention detection and correlation exploration for image emotion distribution learning. IEEE Transactions on Affective Computing (2021).
  • Yang et al. (2013) Byunghwa Yang, Youngchan Kim, and Changjo Yoo. 2013. The integrated mobile advertising model: The effects of technology-and emotion-based evaluations. Journal of Business Research 66, 9 (2013), 1345–1352.
  • Yang et al. (2018a) Jufeng Yang, Liyi Chen, Le Zhang, Xiaoxiao Sun, Dongyu She, Shao-Ping Lu, and Ming-Ming Cheng. 2018a. Historical context-based style classification of painting images via label distribution learning. In Proceedings of the 26th ACM international conference on Multimedia. 1154–1162.
  • Yang et al. (2021) Jingyuan Yang, Jie Li, Leida Li, Xiumei Wang, and Xinbo Gao. 2021. A circular-structured representation for visual emotion distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4237–4246.
  • Yang et al. (2018b) Jufeng Yang, Dongyu She, Yu-Kun Lai, Paul L Rosin, and Ming-Hsuan Yang. 2018b. Weakly supervised coupled networks for visual sentiment analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7584–7592.
  • Yang et al. (2017a) Jufeng Yang, Dongyu She, and Ming Sun. 2017a. Joint Image Emotion Classification and Distribution Learning via Deep Convolutional Neural Network.. In IJCAI. 3266–3272.
  • Yang et al. (2018c) Jufeng Yang, Dongyu She, Ming Sun, Ming-Ming Cheng, Paul L Rosin, and Liang Wang. 2018c. Visual sentiment prediction based on automatic discovery of affective regions. IEEE Transactions on Multimedia 20, 9 (2018), 2513–2525.
  • Yang et al. (2017b) Jufeng Yang, Ming Sun, and Xiaoxiao Sun. 2017b. Learning visual sentiment distributions via augmented conditional probability neural network. In Thirty-first AAAI conference on artificial intelligence.
  • Ye et al. (2020) Jin Ye, Junjun He, Xiaojiang Peng, Wenhao Wu, and Yu Qiao. 2020. Attention-driven dynamic graph convolutional network for multi-label image recognition. In European Conference on Computer Vision. Springer, 649–665.
  • Zhang et al. (2019) Wei Zhang, Xuanyu He, and Weizhi Lu. 2019. Exploring discriminative representations for image emotion recognition with CNNs. IEEE Transactions on Multimedia 22, 2 (2019), 515–523.
  • Zhao et al. (2021) Sicheng Zhao, Xingxu Yao, Jufeng Yang, Guoli Jia, Guiguang Ding, Tat-Seng Chua, Bjoern W Schuller, and Kurt Keutzer. 2021. Affective image content analysis: Two decades review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
  • Zhu et al. (2020) Lei Zhu, Xu Lu, Zhiyong Cheng, Jingjing Li, and Huaxiang Zhang. 2020. Deep Collaborative Multi-View Hashing for Large-Scale Image Search. IEEE Trans. Image Process. 29 (2020), 4643–4655.
  • Zhu et al. (2023) Lei Zhu, Chaoqun Zheng, Weili Guan, Jingjing Li, Yang Yang, and Heng Tao Shen. 2023. Multi-modal Hashing for Efficient Multimedia Retrieval: A Survey. IEEE Transactions on Knowledge and Data Engineering (2023), 1–20. https://doi.org/10.1109/TKDE.2023.3282921
  • Zhu et al. (2017) Xinge Zhu, Liang Li, Weigang Zhang, Tianrong Rao, Min Xu, Qingming Huang, and Dong Xu. 2017. Dependency Exploitation: A Unified CNN-RNN Approach for Visual Emotion Recognition.. In IJCAI. 3595–3601.