Global Image Sentiment Transfer
Abstract
Transferring the sentiment of an image is an unexplored research topic in the area of computer vision. This work proposes a novel framework consisting of a reference image retrieval step and a global sentiment transfer step to transfer sentiments of images according to a given sentiment tag. The proposed image retrieval algorithm is based on the SSIM index. The retrieved reference images by the proposed algorithm is more content-related against the algorithm based on the perceptual loss. Therefore can lead to a better image sentiment transfer result. In addition, we propose a global sentiment transfer step, which employs an optimization algorithm to iteratively transfer sentiment of images based on feature maps produced by the Densenet121 architecture. The proposed sentiment transfer algorithm can transfer the sentiment of images while ensuring the content structure of the input image intact. The qualitative and quantitative experiment demonstrates that the proposed sentiment transfer framework outperforms existing artistic and photorealistic style transfer algorithms in making reliable sentiment transfer results with rich, fine, and exact details.
I Introduction
Transferring the sentiment of an image is still an unexplored research topic. Comparing with the existing well-known tasks such as two-domain image-to-image translation [1, 2, 3, 4] (e.g. winter summer, cat dog) and image style transfer (e.g. artistic style transfer, photorealistic style transfer), image sentiment transfer focuses on modifying the image from a higher-level aspect to change it overall feeling to people. For example, without modifying the content, a family portrait can be transferred to be a more positive picture. The transferred one may give people a feeling of warmth and thus be more valuable to be kept. As we live in an age of pressures, we argue that this research topic is significant with its strong potential to decorate people’s life.
Intuitively, image sentiment is an abstract concept. Compared with the two-domain image-to-image translation that commonly has a definite pattern to accomplish the transfer between two domains (e.g. cat dog, horse zebra), there are enormous ways to transfer an image to a specific sentiment. To make the image transfer controllable, a reference image should be fed into the model as guidance. Considering its similarity to the image style transfer task, we can leverage existing image style transfer models to perform reference-guided image sentiment transfer. However, it is nontrivial to implement this design because of the poorer compatibility between the input image and the reference one for image sentiment transfer. Moreover, directly using existing artistic and photorealistic style transfer models generally fails to create visually pleasing results in terms of detail preservation and artifacts/distortions elimination. Compared with the image style transfer that an artistic/photorealistic style can be indiscriminately added to any input images, the sentiment transfer between two content-unrelated images is risky. Given the example of Fig. 1, the sentiment transfer result lakes photorealism due to the reference image does not bring any content-related reference information to the input image.
Considering this, we propose a high-performance image sentiment transfer framework that starts with image retrieval. Given an input image and a sentiment tag provided by the user, instead of randomly sampling a reference image that contains the input sentiment tag, we retrieve the most suitable reference image based on the structural information of the input image. Leveraging structural similarity (SSIM), the framework significantly constructs the content relation between the input and the reference image. In Section IV, we demonstrate that this image retrieval step is crucial to improve the performance of image sentiment transfer.
To transfer the sentiment of the retrieved reference image to the input image, we design a novel global image sentiment transfer algorithm. Inspired by the image style transfer algorithm by Gatys [5], we use an optimization algorithm on deep features by the neural network pre-trained on the ImageNet [6] dataset to iteratively transfer the sentiment of the reference image to the input image. Different from existing style transfer algorithms, our method adopts the Densenet121 architecture as the feature extractor instead of the widely-used VGG19 architecture. We emperically find that Densenet121 architecture outperforms VGG19 architecture in terms of fine detail preservation and artifacts/distortions elimination. Therefore is more suitable to make sentiment transfer where the produced image should be photorealistic.
Our main contributions are summarized as follows:
-
•
We are the first to explore the task of image sentiment transfer. We present an effective two-step framework for the task by image retrieval and reference-guided image sentiment transfer.
-
•
We introduce an effective reference image retrieval algorithm based on SSIM index, which can achieve better results compared with other methods in finding content-related reference images.
-
•
We propose a global sentiment transfer algorithm based on the Densenet121, which can transfer the sentiment/style of an image while preserving fine details of the image.

II Related Work
Visual sentiment understanding has been explored for many years. Most existing works focus on the visual sentiment classification tasks. To perform accurate classification for images with different sentiments, low-level features, like color [7, 8, 9], texture [9], and shape [10] has been studied in early years. Later on, mid-level composition [9], sentributes [11], principles-of-art features [12], high-level noun-adjective pairs (ANP) [13] are also been considered. Most recently, due to the rapid development of the convolution neural network (CNN) for extracting visual features, many approaches turn to work on the CNN-based sentiment recognition. Some of them working on the noisy data during the training process [14, 15, 16], while some of them exploring the visual sentiment in region level [17, 18, 19, 20, 21]. However, compared with image sentiment classification, the other sentiment-related fields such as image sentiment generation/translation has not been well studied yet.
The most related tasks to ours are the image-to-image translation and the image style transfer. Image-to-image translation targets at learning an image-to-image mapping from two different domains. Early approaches need paired data to train the model and are essentially restricted to learn the deterministic one-to-one mapping [22, 23, 24]. This disables the generation of diverse outputted images. CycleGAN [2] first proposes a cycle consistency loss to enable the model to be trained from the unpaired data. Following approaches like MUNIT [1] and DRIT [3] further propose disentangled representations that enable the outputted images to be diverse. On the other hand, our task is related to image style transfer. A great number of approaches are proposed for artistic style transfer [25, 26, 27, 28, 29] and photorealistic style transfer [30, 31, 32, 33, 34]. Different from the above approaches, we focus on image sentiment transfer that requires a strong content relation between the input and the reference image. Therefore, we search the reference image based on the sentiment tag provided by the users instead of directly asking users to provide it. The proposed global sentiment transfer algorithm is based on the work by Gatys [27]. However, our algorithm uses the Densenet121 [35] network architecture instead of the VGG19 [36] as the feature extractor since we empirically find that Densenet121 can achieve a more faithful input detail preservation compared with the VGG19.
Searching a reference image by a given sentiment tag is related to the image retrieval task. Image retrieval aims at finding an image that is close to the given image. Most recent works measure the perceptual loss of images by comparing the image features extracted from pre-trained convolution neural networks [37]. Images with more similar style effects generally have a low perceptual loss. Different from these works based on the perceptual loss, we find that the SSIM index [38] is more suitable to retrieve reference images for sentiment transfer since it mainly captures the similarity of images in terms of the content notions rather than focusing on stylization effects.


III Method
To transfer the global sentiment of an image, we propose a method that consists of a reference image retrieve and global sentiment transfer step. Fig. 2 shows the framework of the proposed algorithm. Given an input image, we first retrieve a reference image according to the target sentiment tag. Then a global sentiment transfer algorithm is employed to transfer the sentiment of the input image to the reference. We describe the details of these two steps in the following part of this section.
III-A Reference Image Retrieve
Retrieving a reference image is the initial step to make sentiment transfer. Given an input image, the proposed retrieval method aims at finding a reference image according to a sentiment tag given by the user. To facilitate the following global sentiment transfer step, the retrieved image should have a similar content compared with the input image but contains the feelings of the target sentiment.
To achieve this, we propose an image retrieval algorithm based on the Visual Semantic Odometry (VSO) [39] dataset. For each image in the VSO dataset, a noun-adjective pair is attached with it to describe the semantic content and its sentiment respectively. To retrieve a reference image according to a given sentiment tag, we first select a subset of the VSO dataset, where every image within the subset contains the content tag (noun) of the input image but the sentiment tag (adjective) of the given target, For example, in Fig. 2, the input image has a noun-adjective pair of “Clear River” while images in the corresponding subset have a label of “Muddy River”.
To find a reference image from the selected target subset which has the most similar semantic structure with the input image, inspired by [32, 34], we use the Structural Similarity Index (SSIM) [40] between edge responses [41] of images to measure the semantic similarity between each image in the target subset and the input image. SSIM index is originally used by image/video quality assessment methods. We empirically find that SSIM is more suitable than the widely-used perceptual loss to measure the semantic similarity between two images. For every image in the target subset, we first compute the SSIM index between the evaluated and the input images and then pick-up the image with the highest SSIM index as the corresponding reference image to the input image.

III-B Global Sentiment Transfer
With the given input and a selected reference image that is structurally most similar to the input image, we propose a novel algorithm to transfer the sentiment of the input image according to the selected reference image. Our algorithm is based on an optimization method, which iteratively transfers the sentiment of images by minimizing two objectives on deep features. Here the first objective is to ensure the details of the input intact while the other one is to restrain the sentiment of the produced image similar with the reference image.
A high-quality sentiment transfer result should have a similar sentiment with the reference image while keeping the content details intact compared with the input image. The key challenge in sentiment transfer is to measure the sentiment similarity between two images. Inspired by Gatys et al. [42, 5], we adopt the Gram loss on deep features of the input and reference images produced by neural networks to measure the sentiment similarity. Such a Gram-based loss term is originally used to measure the style similarity. Since sentiment can be regarded as the abstract of the style, we borrow the Gram loss term to make sentiment transfer. Moreover, we compute the norm between features of the transferred and input images as the content-consistency loss.

The sentiment/style transfer results created by the above-mentioned loss terms heavily relies on deep features used to compute objectives. Those style transfer algorithms [42, 43, 44, 45, 31, 46, 47] all use features produced by the VGG19 network pre-trained on ImageNet dataset [6]. However, the optimization algorithm based on the feature maps of VGG19 can inevitably change details of the content image. Please take the style transfer results of Gatys et al. [43] shown in Fig. 5 for example. The style transfer algorithm based on features by VGG19 changes details of the sea, sky, and plants in the input image.
As illustrated in Fig. 4, the proposed algorithm adopts the Densenet121 network which is pre-trained on ImageNet dataset [6] as the feature extractor. We empirically find that Densenet121 can achieve sentiment transfer while avoiding the damage to the detail of the input content. Based on the Densenet121, We first get deep features of the input and the reference image produced by the ReLU layers behind each pooling operator. Here we use and to denote feature maps of the input and reference image respectively, where denotes source while represents target. The use of Densenet121 has two main advantages: first, Densenet121 networks can ensure the content information intact while transferring sentiment from the reference to the input image. Second, Densenet121 has only half of the parameters of VGG19 (Densenet121: 6.952 VGG19: 12.945). Therefore it is more time-efficient in creating sentiment transfer results.
The overall loss functions we used is,
(1) |
(2) |
(3) |
where , denotes the feature maps by the transferred images in Densenet121. All the feature maps has the shape of , where denotes the channel number while represent height and width of respectively. Please note that here the transferred image is the variable in optimization process, , we iteratively alter an image to make its str look the same as the input image while having the sentiment of the reference image.
IV Experiment
In this section, we first discuss experimental settings. Then we compare the proposed image retrieval and global sentiment transfer algorithm against other image retrieval and image style transfer algorithms respectively. Finally, we demonstrate the effectiveness of the proposed algorithm by both visual and quantitative evaluation. All the source code and the trained model will be made available to the public.
Positive Sentiments | Negative Sentiments |
---|---|
Warm home | Dark room |
Clear river | Muddy water |
Clear water | Muddy river |
Clear mountain | Misty mountains |
Scenic mountain | Rough hill |
Clear lake | Misty lake |
Lovely city | Harsh landscape |
Bright city | Poor city |
Great city |
IV-A Experimental Settings
Global Sentiment Dataset. To demonstrate the effectiveness of the proposed global sentiment transfer framework, we collect a few images from the VSO dataset, which have global sentiments. For example, an image with the description of “beautiful bird” should not be selected since “bird” is only a regional object. Table. I shows the noun-adjective pairs of the selected subsets we used in our experiment. We employ nine subsets with positive sentiments and eight subsets with negative sentiments.
Global Sentiment Transfer Settings. To transfer the sentiment of the selected reference image to the input image, we propose an optimization-based iterative method. As stated above, we use a Densenet121 pre-trained on the ImageNet dataset as the feature extractor. In optimization, we use Adam algorithm [48] to make optimization on the input image and the retrieved reference image. To balance the content and sentiment loss terms, we set and in Equation 1. To get the sentiment transfer result for each input-reference pair, we run the optimization method for 500 iterations.
IV-B Reference Image Retrieve
Fig. 6 shows the image retrieval results based on the perceptual loss and SSIM index. Since the perceptual loss mainly measures the similarity between two images in terms of style effects, the retrieved image based on the perceptual loss generally has a similar sentiment/style but distinct content notations. On the contrary, the retrieved image produced by the algorithm based on the SSIM index contains the same content but a different sentiment compared with the input image. Please take Fig. 6 for example. The retrieved image by the perceptual loss (b) has the same blue style compared with the input image (a). However, the content structure of (a) and (b) is completely different. On the contrary, the input (c) and the retrieved image by SSIM index (b) has a similar content structure. Generally, the global sentiment transfer algorithm would generate a better result if the reference image has a more similar content structure compared with the input image. Therefore, Fig. 6 demonstrates that the proposed image retrieval method based on the SSIM index has a better performance compared with the algorithm based on the perceptual loss.

IV-C Visual Comparison

Method | Gatys [5] | WCT [49] | AdaIN [26] | StyleNAS [34] | Ours |
---|---|---|---|---|---|
SSIM | 0.7019 | 0.2443 | 0.5301 | 0.6653 | 0.8719 |
Since this work is the first global sentiment transfer algorithm for the arbitrary input image, we compare the result produced by our algorithm with state-of-the-art artistic [50, 26, 49] and photorealistic [34] style transfer algorithms to demonstrate the effectiveness of the proposed global sentiment transfer method. Other photorealistic style transfer algorithm such as [31, 30, 32] are not compared since these methods need a segmentation map or post process to assist the style transfer. Such pre or post processing steps are not needed by ours and the compared methods. To make a fair comparison, each compared style transfer algorithm adopts the retrieved image produced by our image retrieval algorithm based on the SSIM index as the reference image. Fig. 7 shows the sentiment transfer results of our method and style transfer results by the state-of-the-art universal style transfer algorithms. The results by artistic style transfer algorithms ( StyleSwap[50], WCT [26], AdaIN [49]) usually has distorted content details, which is necessary to create artistic feelings but is not favored to make faithful sentiment transfer. The photorealistic style transfer algorithm [34] can preserve the content information. However, it may create significant artifacts in images. Please take Fig. 7 (f) for example. Fig. 7 (g) shows the result produced by our global sentiment transfer algorithm. The results by our method successfully achieve the sentiment transfer while ensuring the content details intact. Moreover, the produced results have significantly fewer artifacts compared with the state-of-the-art photorealistic style transfer algorithms.
IV-D Quantitative Comparison
We quantitatively demonstrate the effectiveness of the proposed algorithm by computing the SSIM index between the input image and the produced result. Here the SSIM score measures the ability of the compared algorithm to preserve fine details of the content. We collect 46 input-reference image pairs to form a validation set. We get the sentiment/style transfer results with all the compared algorithms on this validation set. Table. II shows the mean SSIM score of the compared algorithm on this validation set. The proposed global sentiment transfer algorithm has a higher mean SSIM score compared with other style transfer methods, which demonstrates that our method has a stronger ability to preserve fine details of the input.
V Conclusion
In this work, we propose a high-performance global image sentiment transfer framework consisting of a reference image retrieval step and a global sentiment transfer step. In reference image retrieval step, we adopt the SSIM index instead of the widely-used perceptual loss to measure the structure distance of images, which can capture content similarity of images rather than style effects. In the global sentiment transfer step, we use the Densenet121 network pre-trained on the ImageNet dataset as the feature extractor and employ an image style transfer framework to iteratively transfer sentiment based on features produced by the Densenet121 architectures. Our qualitatively and quantitatively experiment demonstrates that the proposed algorithm outperforms existing style transfer algorithms in terms of sentiment transfer effects and input detail preservation.
References
- [1] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsupervised image-to-image translation,” in ECCV, 2018.
- [2] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in ICCV, 2017.
- [3] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang, “Diverse image-to-image translation via disentangled representations,” in ECCV, 2018.
- [4] H. Tang, D. Xu, G. Liu, W. Wang, N. Sebe, and Y. Yan, “Cycle in cycle generative adversarial networks for keypoint-guided image generation,” in ACM MM, 2019.
- [5] L. A. Gatys, A. S. Ecker, and M. Bethge, “A neural algorithm of artistic style,” arXiv preprint arXiv:1508.06576, 2015.
- [6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012.
- [7] X. Alameda-Pineda, E. Ricci, Y. Yan, and N. Sebe, “Recognizing emotions from abstract paintings using non-linear matrix completion,” in CVPR, 2016.
- [8] A. Sartori, D. Culibrk, Y. Yan, and N. Sebe, “Who’s afraid of itten: Using the art theory of color combination to analyze emotions in abstract paintings,” in ACM MM, 2015.
- [9] J. Machajdik and A. Hanbury, “Affective image classification using features inspired by psychology and art theory,” in ACM MM, 2010.
- [10] X. Lu, P. Suryanarayan, R. B. Adams Jr, J. Li, M. G. Newman, and J. Z. Wang, “On shape and the computability of emotions,” in ACM MM, 2012.
- [11] J. Yuan, S. Mcdonough, Q. You, and J. Luo, “Sentribute: image sentiment analysis from a mid-level perspective,” in Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining, 2013.
- [12] S. Zhao, Y. Gao, X. Jiang, H. Yao, T.-S. Chua, and X. Sun, “Exploring principles-of-art features for image emotion recognition,” in ACM MM, 2014.
- [13] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale visual sentiment ontology and detectors using adjective noun pairs,” in ACM MM, 2013.
- [14] J. Yang, D. She, Y.-K. Lai, and M.-H. Yang, “Retrieving and classifying affective images via deep metric learning,” in AAAI, 2018.
- [15] J. Yang, D. She, and M. Sun, “Joint image emotion classification and distribution learning via deep convolutional neural network.” in IJCAI, 2017.
- [16] Q. You, J. Luo, H. Jin, and J. Yang, “Robust image sentiment analysis using progressively trained and domain transferred deep networks,” in AAAI, 2015.
- [17] J. Yang, D. She, Y.-K. Lai, P. L. Rosin, and M.-H. Yang, “Weakly supervised coupled networks for visual sentiment analysis,” in CVPR, 2018.
- [18] K. Song, T. Yao, Q. Ling, and T. Mei, “Boosting image sentiment analysis with visual attention,” Neurocomputing, 2018.
- [19] S. Zhao, Z. Jia, H. Chen, L. Li, G. Ding, and K. Keutzer, “Pdanet: Polarity-consistent deep attention network for fine-grained visual emotion regression,” in ACM MM, 2019.
- [20] T. Rao, X. Li, H. Zhang, and M. Xu, “Multi-level region-based convolutional neural network for image emotion classification,” Neurocomputing, 2019.
- [21] Q. You, H. Jin, and J. Luo, “Visual sentiment analysis by attending on local image regions,” in AAAI, 2017.
- [22] L. Karacan, Z. Akata, A. Erdem, and E. Erdem, “Learning to generate images of outdoor scenes from attributes and semantic layouts,” arXiv preprint arXiv:1612.00215, 2016.
- [23] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays, “Scribbler: Controlling deep image synthesis with sketch and color,” in CVPR, 2017.
- [24] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in CVPR, 2017.
- [25] J. Liao, Y. Yao, L. Yuan, G. Hua, and S. B. Kang, “Visual attribute transfer through deep image analogy,” arXiv preprint arXiv:1705.01088, 2017.
- [26] X. Huang and S. J. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in ICCV, 2017.
- [27] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in CVPR, 2016.
- [28] L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman, “Controlling perceptual factors in neural style transfer,” in CVPR, 2017.
- [29] D. Kotovenko, A. Sanakoyeu, S. Lang, and B. Ommer, “Content and style disentanglement for artistic style transfer,” in ICCV, 2019.
- [30] Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz, “A closed-form solution to photorealistic image stylization,” in ECCV, 2018.
- [31] F. Luan, S. Paris, E. Shechtman, and K. Bala, “Deep photo style transfer,” in CVPR, 2017.
- [32] J. Yoo, Y. Uh, S. Chun, B. Kang, and J.-W. Ha, “Photorealistic style transfer via wavelet transforms,” in ICCV, 2019.
- [33] S. Bae, S. Paris, and F. Durand, “Two-scale tone management for photographic look,” in ACM Transactions on Graphics, 2006.
- [34] J. An, H. Xiong, J. Huan, and J. Luo, “Ultrafast photorealistic style transfer via neural architecture search,” in AAAI, 2020.
- [35] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in CVPR, 2017.
- [36] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- [37] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018.
- [38] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” TIP, 2004.
- [39] K.-N. Lianos, J. L. Schonberger, M. Pollefeys, and T. Sattler, “Vso: Visual semantic odometry,” in ECCV, 2018.
- [40] M.-J. Chen and A. C. Bovik, “Fast structural similarity index algorithm,” Journal of Real-Time Image Processing, 2011.
- [41] S. Xie and Z. Tu, “Holistically-nested edge detection,” in ICCV, 2015.
- [42] L. Gatys, A. S. Ecker, and M. Bethge, “Texture synthesis using convolutional neural networks,” in NeurIPS, 2015.
- [43] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in CVPR, 2016.
- [44] L. A. Gatys, M. Bethge, A. Hertzmann, and E. Shechtman, “Preserving color in neural artistic style transfer,” arXiv preprint arXiv:1606.05897, 2016.
- [45] E. Risser, P. Wilmot, and C. Barnes, “Stable and controllable neural texture synthesis and style transfer using histogram losses,” arXiv preprint arXiv:1701.08893, 2017.
- [46] S. Li, X. Xu, L. Nie, and T.-S. Chua, “Laplacian-steered neural style transfer,” in ACM MM, 2017.
- [47] Y. Li, N. Wang, J. Liu, and X. Hou, “Demystifying neural style transfer,” arXiv preprint arXiv:1701.01036, 2017.
- [48] D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [49] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universal style transfer via feature transforms,” in NeurIPS, 2017.
- [50] T. Q. Chen and M. Schmidt, “Fast patch-based style transfer of arbitrary style,” arXiv preprint arXiv:1612.04337, 2016.