Multimodal Pretraining and Generation for Recommendation:
A Tutorial

Jieming Zhu Huawei Noah’s Ark LabShenzhenChina jiemingzhu@ieee.org , Chuhan Wu Huawei Noah’s Ark LabBeijingChina wuchuhan1@huawei.com , Rui Zhang www.ruizhang.infoShenzhenChina rayteam@yeah.net and Zhenhua Dong Huawei Noah’s Ark LabShenzhenChina dongzhenhua@huawei.com

(2024)

Abstract.

Personalized recommendation stands as a ubiquitous channel for users to explore information or items aligned with their interests. Nevertheless, prevailing recommendation models predominantly rely on unique IDs and categorical features for user-item matching. While this ID-centric approach has witnessed considerable success, it falls short in comprehensively grasping the essence of raw item contents across diverse modalities, such as text, image, audio, and video. This underutilization of multimodal data poses a limitation to recommender systems, particularly in the realm of multimedia services like news, music, and short-video platforms. The recent surge in pretraining and generation techniques presents both opportunities and challenges in the development of multimodal recommender systems. This tutorial seeks to provide a thorough exploration of the latest advancements and future trajectories in multimodal pretraining and generation techniques within the realm of recommender systems. The tutorial comprises three parts: multimodal pretraining, multimodal generation, and industrial applications and open challenges in the field of recommendation. Our target audience encompasses scholars, practitioners, and other parties interested in this domain. By providing a succinct overview of the field, we aspire to facilitate a swift understanding of multimodal recommendation and foster meaningful discussions on the future development of this evolving landscape.

Recommender systems, multimodal pretraining, multimodal generation, multimodal adaptation

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Companion Proceedings of the ACM Web Conference 2024; May 13–17, 2024; Singapore, Singapore^†^†booktitle: Companion Proceedings of the ACM Web Conference 2024 (WWW ’24 Companion), May 13–17, 2024, Singapore, Singapore^†^†doi: 10.1145/3589335.3641248^†^†isbn: 979-8-4007-0172-6/24/05

1. Introduction

1.1. Topic and Relevance

Nowadays, the emergence of Large Language Models (LLMs) and Multimodal LLMs (MLLMs), such as ChatGPT (GPT-3.5 and GPT-4) (OpenAI, 2023), Llama2 (et al., 2023), BLIP-2 (Li et al., 2023a), and MiniGPT-4 (Zhu et al., 2023), is reshaping the landscape of technological capabilities. The immense potential of these pretrained large models, particularly MLLMs, introduces both novel opportunities and challenges for the research community, prompting exploration into innovative applications for recommendation tasks. This tutorial aims to comprehensively review and present existing research and practical insights related to multimodal pretraining and generation for recommendation. The tutorial aligns closely with the core themes of the WWW conference and promises valuable takeaways for attendees from both multimodal learning and recommender systems communities.

1.2. Target Audience

The tutorial is structured in a lecture-style format. We welcome participation from academic researchers, industrial practitioners, and other stakeholders with a keen interest in the field. Participants are anticipated to possess a foundational knowledge of the relevant fields. The tutorial uniquely explores the synergy between multimodal learning and recommender system domains. For researchers specializing in multimodal learning, the tutorial offers insights into the applications and challenges associated with integrating multimodal models into recommendation systems. On the other hand, researchers within the recommender systems domain can gain valuable knowledge about recent and prospective research directions in multimodal recommender systems, specifically focusing on how to enhance recommendations through multimodal pretraining and generation techniques. Moreover, we share impactful success stories derived from deploying multimodal models in production systems. These real-world cases can provide practitioners with valuable insights into practical multimodal model deployment.

2. Tentative Schedule

The tutorial consists of three talks: The first two talks cover the research topics of multimodal pretraining and multimodal generation in the context of recommender systems. The last one will share some successful applications in practice and present the open challenges from an industrial perspective. The tutorial materials will be made available at https://mmrec.github.io/tutorial/www2024.

•

Opening Remarks (30min), by Dr. Zhenhua Dong.
•

Multimodal Pretraining for Recommendation (45min), by Dr. Jieming Zhu.
•

Coffee Break (15min)
•

Multimodal Generation for Recommendation (45min), by Prof. Rui Zhang.
•

Industrial Applications and Open Challenges in Multimodal Recommendation (45min), by Dr. Chuhan Wu.

2.1. Multimodal Pretraining for Recommendation

Pretrained models have recently emerged as a groundbreaking approach to achieve the state-of-the-art results in many machine learning tasks. In this talk, we will introduce multimodal pretraining techniques and their applications in recommender systems.

•

Self-supervised pretraining: We will briefly review the common self-supervised pretraining paradigms, including reconstructive, contrastive, and generative learning tasks (Liu et al., 2023).
•

Multimodal pretraining: Multimodal pretraining models have emerged as a rapidly growing trend across various fields, including computing vision, natural language processing, and speech recognition, among others, capturing significant interests within these fields. We will introduce some representative multimodal pretrained models, including both constrative and generative ones, e.g., CLIP (Radford et al., 2021), Flamingo (Alayrac et al., 2022), GPT-4 (OpenAI, 2023), BLIP-2 (Li et al., 2023a), ImageBind (Girdhar et al., 2023), etc.
•

Pretraining for recommendation: This part focuses on recent research that applies pretraining techniques to recommendation. We will summarize the pretrained models for recommendation from four categories: 1) Sequence pretraining, which aims to capture users’ sequential behavior patterns from item representations, including Bert4Rec (Sun et al., 2019), PeterRec (Yuan et al., 2020), UserBert (Wu et al., 2022), S³-Rec (Zhou et al., 2020), SL4Rec (Yao et al., 2021). 2) Text-based pretraining, which models semantic-based item representations from text data. Examples include UNBERT (Zhang et al., 2021), PREC (Liu et al., 2022), MINER (Li et al., 2022), UniSRec (Hou et al., 2022), Recformer (Li et al., 2023b), and P5 (Geng et al., 2022). They are not only valuable for text-rich news recommendation but also can enable knowledge transfer across items and domains. 3) Audio-based pretraining, which has been studied in the context of music recommendation and retrieval. They are used to extract latent music representations to enhance recommendation and retrieval tasks, including MusicBert (Zhu et al., 2021), MART (Yao et al., 2024), PEMR (Yao et al., 2022), and UAE (Chen et al., 2021). 4) Multimodal pretraining that aims to achieve multimodal content understanding and cross-modal alignment. Recent trend emerges to build multimodal foundation models for recommendation, e.g., MMSSL (Wei et al., 2023), PMGT (Liu et al., 2021), MSM4SR (Zhang et al., 2023), MISSRec (Wang et al., 2023), VIP5 (Geng et al., 2023).
•

Model adaptation for recommendation: Given large pretrained models, it is often necessary to adapt the models to a recommendation task with domain-specific data. We will review the common paradigms for model adaptation, including representation-based transfer, fine-tuning, adapter tuning (Lialin et al., 2023), prompt tuning (Gu et al., 2023), and retrieval-augmented adaptation (Long et al., 2023).

2.2. Multimodal Generation for Recommendation

With the recent advancements in generative models, AI-generated content (AIGC) has gained significant popularity in various applications. In this talk, we will discuss the research directions for applying AIGC techniques in recommendation scenarios.

•

Text generation: With the support of powerful large language models (LLMs), text generation has been applied to many tasks such as news headline generation (Gu et al., 2020; Salemi et al., 2023) and dialogue generation (Yang et al., 2022; Huang et al., 2020). We will discuss the commonly used sequence-to-sequence generation framework and LLM-based generation methods. More recently, news headline generation has been performed in a personalized manner, such as LaMP (Salemi et al., 2023), GUE (Cai et al., 2023), PENS (Ao et al., 2021), NHNet (Gu et al., 2020), and PNG (Ao et al., 2023).
•

Image generation: Image generation has achieved remarkable success with prevalence of GAN and diffusion models. We will introduce their applications in poster generation for advertisements and cover image generation of news and e-books. Examples include AutoPoster (Lin et al., 2023), TextPainter (Gao et al., 2023), and PosterLayout (Hsu et al., 2023).
•

Personalized generation: While pretrained generation models enable general-domain text and image generation, there is a trend towards personalized generation. This is important for recommendation scenarios where personalized content or identity information needs to be provided. Pioneer work includes personalized image generation (e.g., DreamBooth (Ruiz et al., 2023), text inversion (Yang et al., 2023)), personalized text generation (e.g., LaMP (Salemi et al., 2023), APR (Li et al., 2023c), PTG (Li et al., 2023d)), and personalized multimodal generation (e.g., PMG (Shen et al., 2024)).

2.3. Industrial Applications and Open Challenges in Multimodal Recommendation

•

Successful applications: In this talk, we will demonstrate a list of successful applications in industry. We organize the open use cases from Alibaba (Ge et al., 2018), JD.com (Liu et al., 2020; Xiao et al., 2022), Tencent (Chen et al., 2021), Baidu (Wen et al., 2023; Yu et al., 2022), Xiaohongshu (Huang et al., 2021), Pinterest (Baltescu et al., 2022), etc. We will also share our industrial experiences that deploying multimodal recommendation models at Huawei (Xun et al., 2021).
•

Open challenges: We will discuss the open challenges in multimodal recommendation from both research and practice perpectives, such as multimodal representation fusion, multi-domain multimodal pretraining, efficient adaptation of MLLMs, personalized adaptation of MLLMs, multimodal AIGC for recommendation, efficiency and responsibility of multimodal recommendation, open benchmarking (Zhu et al., 2022), etc.

3. Related Tutorials

There are several related tutorials given at previous conferences:

•

Paul Pu Liang, Louis-Philippe Morency. Tutorial on Multimodal Machine Learning: Principles, Challenges, and Open Questions. ICMI 2023 (Liang and Morency, 2023).
•

Trung-Hoang Le, Quoc-Tuan Truong, Aghiles Salah, Hady W. Lauw. Multi-Modal Recommender Systems: Towards Addressing Sparsity, Comparability, and Explainability. WWW 2023 (Le et al., 2023).
•

Quoc-Tuan Truong, Aghiles Salah, Hady Lauw. Multi-Modal Recommender Systems: Hands-On Exploration. RecSys 2021 (Truong et al., 2021).
•

Xiangnan He, Hanwang Zhang, Tat-Seng Chua. Recommendation Technologies for Multimedia Content. ICMR 2018 (He et al., 2018).
•

Yi Yu, Kiyoharu Aizawa, Toshihiko Yamasaki, Roger Zimmermann. Emerging Topics on Personalized and Localized Multimedia Information Systems. MM 2014 (Yu et al., 2014).
•

Jialie Shen, Xian-Sheng Hua, Emre Sargin. Towards Next Generation Multimedia Recommendation Systems. MM 2013 (Shen et al., 2013).
•

Jialie Shen, Meng Wang, Shuicheng Yan, Peng Cui. Multimedia Recommendation: Technology and Techniques. SIGIR 2013 (Shen et al., 2013).
•

Jialie Shen, Meng Wang, Shuicheng Yan, Peng Cui. Multimedia Recommendation. MM 2012 (Shen et al., 2012).

Different from these previous tutorials, our tutorial makes the following novel contributions: 1) Our tutorial builds on recent advances in multimodal pretraining and generation techniques, which differs significantly from the previous tutorials on multimedia recommendaton (Shen et al., 2012, 2013; Yu et al., 2014; He et al., 2018). 2) As for the recent three tutorials, they either present a technical review on general multimodal learning tasks (Liang and Morency, 2023) or provide introductory to intermediate hands-on projects on multimodal recommendation (Truong et al., 2021; Le et al., 2023). In contrast, we take a look into new research and practice progresses on applying pretrained multimodal models to recommendation tasks.

4. BIOGRAPHY

Dr. Jieming Zhu is a researcher at Huawei Noah’s Ark Lab. He received the Ph.D. degree from The Chinese University of Hong Kong in 2016. His recent research focuses on developing practical AI models for industrial-scale recommender systems. He currently leads a research project on multimodal pretraining and generation for recommender systems. Please find more information at https://jiemingzhu.github.io.

Dr. Chuhan Wu is a researcher at Huawei Noah’s Ark Lab. Before that, he got his Ph.D. degree from Tsinghua University in 2023. He focuses on recommender systems and responsible AI. Please find more information at https://wuch15.github.io.

Prof. Rui Zhang is a visiting Professor at Tsinghua University and was previously a Professor at the University of Melbourne. His research interests include machine learning and big data. Please find more information at https://www.ruizhang.info.

Dr. Zhenhua Dong is a technology expert and project manager at Huawei Noah’s Ark Lab. He received the B.Eng. degree from Tianjin University in 2006 and the Ph.D. degree from Nankai University in 2012. He leads a research team dedicated to advancing the field of recommender systems and causal inference.

References

(1)
Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L. Menick, Sebastian Borgeaud, and et al. 2022. Flamingo: a Visual Language Model for Few-Shot Learning. In NeurIPS.
Ao et al. (2023) Xiang Ao, Ling Luo, Xiting Wang, Zhao Yang, Jiun-Hung Chen, Ying Qiao, Qing He, and Xing Xie. 2023. Put Your Voice on Stage: Personalized Headline Generation for News Articles. TKDD 18, 3 (2023).
Ao et al. (2021) Xiang Ao, Xiting Wang, Ling Luo, Ying Qiao, Qing He, and Xing Xie. 2021. PENS: A Dataset and Generic Framework for Personalized News Headline Generation. In Proceedings of ACL/IJCNLP. 82–92.
Baltescu et al. (2022) Paul Baltescu, Haoyu Chen, Nikil Pancha, Andrew Zhai, Jure Leskovec, and Charles Rosenberg. 2022. ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest. In The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 2703–2711.
Cai et al. (2023) Pengshan Cai, Kaiqiang Song, Sangwoo Cho, Hongwei Wang, Xiaoyang Wang, Hong Yu, Fei Liu, and Dong Yu. 2023. Generating User-Engaging News Headlines. In Proceedings of ACL. 3265–3280.
Chen et al. (2021) Ke Chen, Beici Liang, Xiaoshuan Ma, and Minwei Gu. 2021. Learning Audio Embeddings with User Listening Data for Content-Based Music Recommendation. In ICASSP. 3015–3019.
et al. (2023) Touvron et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs/2307.09288 (2023).
Gao et al. (2023) Yifan Gao, Jinpeng Lin, Min Zhou, Chuanbin Liu, Hongtao Xie, Tiezheng Ge, and Yuning Jiang. 2023. TextPainter: Multimodal Text Image Generation with Visual-harmony and Text-comprehension for Poster Design. In Proceedings of the 31st ACM International Conference on Multimedia (MM). 7236–7246.
Ge et al. (2018) Tiezheng Ge, Liqin Zhao, Guorui Zhou, Keyu Chen, Shuying Liu, Huiming Yi, Zelin Hu, Bochao Liu, Peng Sun, Haoyu Liu, Pengtao Yi, Sui Huang, Zhiqiang Zhang, Xiaoqiang Zhu, Yu Zhang, and Kun Gai. 2018. Image Matters: Visually Modeling User Behaviors Using Advanced Model Server. In CIKM. 2087–2095.
Geng et al. (2022) Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5). In RecSys. 299–315.
Geng et al. (2023) Shijie Geng, Juntao Tan, Shuchang Liu, Zuohui Fu, and Yongfeng Zhang. 2023. VIP5: Towards Multimodal Foundation Models for Recommendation. CoRR.
Girdhar et al. (2023) Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. ImageBind: One Embedding Space to Bind Them All. In CVPR. 15180–15190.
Gu et al. (2023) Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, and Philip H. S. Torr. 2023. A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models. CoRR abs/2307.12980 (2023).
Gu et al. (2020) Xiaotao Gu, Yuning Mao, Jiawei Han, Jialu Liu, You Wu, Cong Yu, Daniel Finnie, Hongkun Yu, Jiaqi Zhai, and Nicholas Zukoski. 2020. Generating Representative Headlines for News Stories. In The Web Conference 2020 (WWW). 1773–1784.
He et al. (2018) Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. 2018. Recommendation Technologies for Multimedia Content. In Proceedings of ICMR. 8.
Hou et al. (2022) Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. 2022. Towards Universal Sequence Representation Learning for Recommender Systems. In KDD. 585–593.
Hsu et al. (2023) HsiaoYuan Hsu, Xiangteng He, Yuxin Peng, Hao Kong, and Qing Zhang. 2023. PosterLayout: A New Benchmark and Approach for Content-Aware Visual-Textual Presentation Layout. In CVPR. 6018–6026.
Huang et al. (2020) Xinting Huang, Jianzhong Qi, Yu Sun, and Rui Zhang. 2020. MALA: Cross-Domain Dialogue Generation with Action Learning. In AAAI. 7977–7984.
Huang et al. (2021) Yanhua Huang, Weikun Wang, Lei Zhang, and Ruiwen Xu. 2021. Sliding Spectrum Decomposition for Diversified Recommendation. In KDD. 3041–3049.
Le et al. (2023) Trung-Hoang Le, Quoc-Tuan Truong, Aghiles Salah, and Hady W. Lauw. 2023. Multi-Modal Recommender Systems: Towards Addressing Sparsity, Comparability, and Explainability. In ACM Web Conference (WWW).
Li et al. (2023c) Cheng Li, Mingyang Zhang, Qiaozhu Mei, Weize Kong, and Michael Bendersky. 2023c. Automatic Prompt Rewriting for Personalized Text Generation. CoRR abs/2310.00152 (2023).
Li et al. (2023d) Cheng Li, Mingyang Zhang, Qiaozhu Mei, Yaqing Wang, Spurthi Amba Hombaiah, Yi Liang, and Michael Bendersky. 2023d. Teach LLMs to Personalize - An Approach inspired by Writing Education. CoRR abs/2308.07968 (2023).
Li et al. (2023a) Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023a. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In ICML, Vol. 202. 19730–19742.
Li et al. (2023b) Jiacheng Li, Ming Wang, Jin Li, Jinmiao Fu, Xin Shen, Jingbo Shang, and Julian J. McAuley. 2023b. Text Is All You Need: Learning Language Representations for Sequential Recommendation. In Proceedings of KDD. 1258–1267.
Li et al. (2022) Jian Li, Jieming Zhu, Qiwei Bi, Guohao Cai, Lifeng Shang, Zhenhua Dong, Xin Jiang, and Qun Liu. 2022. MINER: Multi-Interest Matching Network for News Recommendation. In Findings of ACL. 343–352.
Lialin et al. (2023) Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. 2023. Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning. CoRR abs/2303.15647.
Liang and Morency (2023) Paul Pu Liang and Louis-Philippe Morency. 2023. Tutorial on Multimodal Machine Learning: Principles, Challenges, and Open Questions. In International Conference on Multimodal Interaction (ICMI). 101–104.
Lin et al. (2023) Jinpeng Lin, Min Zhou, Ye Ma, Yifan Gao, Chenxi Fei, Yangjian Chen, Zhang Yu, and Tiezheng Ge. 2023. AutoPoster: A Highly Automatic and Content-aware Design System for Advertising Poster Generation. In ACM MM. 1250–1260.
Liu et al. (2020) Hu Liu, Jing Lu, Hao Yang, Xiwei Zhao, Sulong Xu, Hao Peng, Zehua Zhang, Wenjie Niu, Xiaokun Zhu, Yongjun Bao, and Weipeng Yan. 2020. Category-Specific CNN for Visual-aware CTR Prediction at JD.com. In The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 2686–2696.
Liu et al. (2022) Qijiong Liu, Jieming Zhu, Quanyu Dai, and Xiaoming Wu. 2022. Boosting Deep CTR Prediction with a Plug-and-Play Pre-trainer for News Recommendation. In Proceedings of COLING. 2823–2833.
Liu et al. (2023) Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. 2023. Self-Supervised Learning: Generative or Contrastive. IEEE Trans. Knowl. Data Eng. 35, 1 (2023), 857–876.
Liu et al. (2021) Yong Liu, Susen Yang, Chenyi Lei, Guoxin Wang, Haihong Tang, Juyong Zhang, Aixin Sun, and Chunyan Miao. 2021. Pre-training Graph Transformer with Multimodal Side Information for Recommendation. In ACM MM. 2853–2861.
Long et al. (2023) Quanyu Long, Wenya Wang, and Sinno Jialin Pan. 2023. Adapt in Contexts: Retrieval-Augmented Domain Adaptation via In-Context Learning. In Proceedings of EMNLP. 6525–6542.
OpenAI (2023) OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023).
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of ICML, Vol. 139. 8748–8763.
Ruiz et al. (2023) Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. In CVPR. 22500–22510.
Salemi et al. (2023) Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. 2023. LaMP: When Large Language Models Meet Personalization. CoRR (2023).
Shen et al. (2012) Jialie Shen, Meng Wang, Shuicheng Yan, and Peng Cui. 2012. Multimedia Recommendation. In ACM Multimedia Conference (MM). 1535.
Shen et al. (2013) Jialie Shen, Meng Wang, Shuicheng Yan, and Peng Cui. 2013. Multimedia recommendation: technology and techniques. In SIGIR. 1131.
Shen et al. (2024) Xiaoteng Shen, Rui Zhang, Xiaoyan Zhao, Jieming Zhu, and Xi Xiao. 2024. PMG: Personalized Multimodal Generation with Large Language Models. In WWW.
Sun et al. (2019) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer. In Proceedings of CIKM. 1441–1450.
Truong et al. (2021) Quoc-Tuan Truong, Aghiles Salah, and Hady W. Lauw. 2021. Multi-Modal Recommender Systems: Hands-On Exploration. In RecSys. 834–837.
Wang et al. (2023) Jinpeng Wang, Ziyun Zeng, Yunxiao Wang, Yuting Wang, Xingyu Lu, Tianxiang Li, Jun Yuan, Rui Zhang, Hai-Tao Zheng, and Shu-Tao Xia. 2023. MISSRec: Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation. In Proceedings of MM. 6548–6557.
Wei et al. (2023) Wei Wei, Chao Huang, Lianghao Xia, and Chuxu Zhang. 2023. Multi-Modal Self-Supervised Learning for Recommendation. In WWW. 790–800.
Wen et al. (2023) Zhoufutu Wen, Xinyu Zhao, Zhipeng Jin, Yi Yang, Wei Jia, Xiaodong Chen, Shuanglong Li, and Lin Liu. 2023. Enhancing Dynamic Image Advertising with Vision-Language Pre-training. In SIGIR. 3310–3314.
Wu et al. (2022) Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Huang. 2022. UserBERT: Pre-training User Model with Contrastive Self-supervision. In SIGIR. 2087–2092.
Xiao et al. (2022) Fangxiong Xiao, Lixi Deng, Jingjing Chen, Houye Ji, Xiaorui Yang, Zhuoye Ding, and Bo Long. 2022. From Abstract to Details: A Generative Multimodal Fusion Framework for Recommendation. In ACM MM. 258–267.
Xun et al. (2021) Jiahao Xun, Shengyu Zhang, Zhou Zhao, Jieming Zhu, Qi Zhang, Jingjie Li, Xiuqiang He, Xiaofei He, Tat-Seng Chua, and Fei Wu. 2021. Why Do We Click: Visual Impression-aware News Recommendation. In ACM MM. 3881–3890.
Yang et al. (2023) Jianan Yang, Haobo Wang, Ruixuan Xiao, Sai Wu, Gang Chen, and Junbo Zhao. 2023. Controllable Textual Inversion for Personalized Text-to-Image Generation. CoRR abs/2304.05265 (2023).
Yang et al. (2022) Shiquan Yang, Rui Zhang, Sarah M. Erfani, and Jey Han Lau. 2022. An Interpretable Neuro-Symbolic Reasoning Framework for Task-Oriented Dialogue Generation. In ACL. 4918–4935.
Yao et al. (2022) Dong Yao, Zhou Zhao, Shengyu Zhang, Jieming Zhu, Yudong Zhu, Rui Zhang, and Xiuqiang He. 2022. Contrastive Learning with Positive-Negative Frame Mask for Music Representation. In The ACM Web Conference 2022 (WWW). 2906–2915.
Yao et al. (2024) Dong Yao, Jieming Zhu, Jiahao Xun, Shengyu Zhang, Zhou Zhao, Liqun Deng, Wenqiao Zhang, Zhenhua Dong, and Xin Jiang. 2024. MART: Learning Hierarchical Music Audio Representations with Part-Whole Transformer. In WWW.
Yao et al. (2021) Tiansheng Yao, Xinyang Yi, Derek Zhiyuan Cheng, Felix X. Yu, Ting Chen, Aditya Krishna Menon, Lichan Hong, Ed H. Chi, Steve Tjoa, Jieqi (Jay) Kang, and Evan Ettinger. 2021. Self-supervised Learning for Large-scale Item Recommendations. In CIKM. 4321–4330.
Yu et al. (2022) Tan Yu, Zhipeng Jin, Jie Liu, Yi Yang, Hongliang Fei, and Ping Li. 2022. Boost CTR Prediction for New Advertisements via Modeling Visual Content. In IEEE International Conference on Big Data (BigData). 2140–2149.
Yu et al. (2014) Yi Yu, Kiyoharu Aizawa, Toshihiko Yamasaki, and Roger Zimmermann. 2014. Emerging Topics on Personalized and Localized Multimedia Information Systems. In ACM MM. 1233–1234.
Yuan et al. (2020) Fajie Yuan, Xiangnan He, Alexandros Karatzoglou, and Liguang Zhang. 2020. Parameter-Efficient Transfer from Sequential Behaviors for User Modeling and Recommendation. In Proceedings of SIGIR. 1469–1478.
Zhang et al. (2023) Lingzi Zhang, Xin Zhou, and Zhiqi Shen. 2023. Multimodal Pre-training Framework for Sequential Recommendation via Contrastive Learning. CoRR (2023).
Zhang et al. (2021) Qi Zhang, Jingjie Li, Qinglin Jia, Chuyuan Wang, Jieming Zhu, Zhaowei Wang, and Xiuqiang He. 2021. UNBERT: User-News Matching BERT for News Recommendation. In Proceedings of IJCAI. 3356–3362.
Zhou et al. (2020) Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. 2020. S3-Rec: Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization. In CIKM. 1893–1902.
Zhu et al. (2023) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. CoRR abs/2304.10592 (2023).
Zhu et al. (2021) Hongyuan Zhu, Ye Niu, Di Fu, and Hao Wang. 2021. MusicBERT: A Self-supervised Learning of Music Representation. In ACM MM. 3955–3963.
Zhu et al. (2022) Jieming Zhu, Quanyu Dai, Liangcai Su, Rong Ma, Jinyang Liu, Guohao Cai, Xi Xiao, and Rui Zhang. 2022. BARS: Towards Open Benchmarking for Recommender Systems. In SIGIR. 2912–2923.

Multimodal Pretraining and Generation for Recommendation: A Tutorial