This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Knowledge Enhanced Multi-intent Transformer Network for Recommendation

Ding Zou [0]CCIIP Lab [1]School of Computer Science and Technology Huazhong University of Science and TechnologyJoint Laboratory of HUST and Pingan Property & Casualty Research (HPL)WuhanChina Taotian GroupHangzhouChina m202173662@hust.edu.cn Wei Wei [0]CCIIP Lab [1]School of Computer Science and Technology Huazhong University of Science and TechnologyJoint Laboratory of HUST and Pingan Property & Casualty Research (HPL)WuhanChina weiw@hust.edu.cn Feida Zhu Singapore Management UniversitySingaporeSingapore fdzhu@smu.edu.sg Chuanyu Xu Taotian GroupHangzhouChina tracy.xcy@taobao.com Tao Zhang Taotian GroupHangzhouChina guyan.zt@taobao.com  and  Chengfu Huo Taotian GroupHangzhouChina chengfu.huocf@taobao.com
(2024)
Abstract.

Incorporating Knowledge Graphs (KGs) into Recommendation has attracted growing attention in industry, due to the great potential of KG in providing abundant supplementary information and interpretability for the underlying models. However, simply integrating KG into recommendation usually brings in negative feedback in industry, mainly due to the ignorance of the following two factors: i) users’ multiple intents, which involve diverse nodes in KG. For example, in e-commerce scenarios, users may exhibit preferences for specific styles, brands, or colors. ii) knowledge noise, which is a prevalent issue in Knowledge Enhanced Recommendation (KGR) and even more severe in industry scenarios. The irrelevant knowledge properties of items may result in inferior model performance compared to approaches that do not incorporate knowledge. To tackle these challenges, we propose a novel approach named Knowledge Enhanced Multi-intent Transformer Network for Recommendation (KGTN), which comprises two primary modules: Global Intents Modeling with Graph Transformer, and Knowledge Contrastive Denoising under Intents. Specifically, Global Intents with Graph Transformer focuses on capturing learnable user intents, by incorporating global signals from user-item-relation-entity interactions with a well-designed graph transformer, and meanwhile learning intent-aware user/item representations. On the other hand, Knowledge Contrastive Denoising under Intents is dedicated to learning precise and robust representations. It leverages the intent-aware user/item representations to sample relevant knowledge, and subsequently proposes a local-global contrastive mechanism to enhance noise-irrelevant representation learning. Extensive experiments conducted on three benchmark datasets show the superior performance of our proposed method over the state-of-the-arts. And online A/B testing results on Alibaba large-scale industrial recommendation platform also indicate the real-scenario effectiveness of KGTN. The implementations are available at: https://github.com/CCIIPLab/KGTN.

Knowledge Enhanced Recommendation, Graph Transformer, Graph Neural Networks
journalyear: 2024copyright: acmlicensedconference: Companion Proceedings of the ACM Web Conference 2024; May 13–17, 2024; Singapore, Singapore.booktitle: Companion Proceedings of the ACM Web Conference 2024 (WWW ’24 Companion), May 13–17, 2024, Singapore, Singaporeisbn: 979-8-4007-0172-6/24/05doi: 10.1145/3589335.3648296ccs: Information systems Recommender systems

1. Introduction

Refer to caption
(a) Intent Case
Refer to caption
(b) Comparison
Figure 1. (a) A simple case for illustrating multiple user intents with global information; (b)Performance comparison.

Knowledge graphs (KGs) have emerged as a promising approach to enhance the accuracy and interpretability of recommender systems in both academic and industry scenarios. By incorporating entities and relations, KGs provide a rich source of information for user/item representation learning, which not only captures the diverse relationships among items (such as the same item brand), but also allows for the interpretation of user preferences (such as attributing a user’s selection of a clothing to its fashionable style).

In an effort to effectively integrate the item-side KG information into recommendation, considerable research efforts have been devoted to Knowledge Enhanced Recommendation (aka. KGR). Early studies (zhang2016collaborative; huang2018improving; wang2018dkn) directly integrate knowledge graph embeddings with items to enhance their representations. Some subsequent studies (hu2018leveraging; shi2018heterogeneous; wang2019explainable) enrich the interactions via meta-paths that capture relevant connectivities between users and items with KG. They either select prominent paths over KG (sun2018recurrent), or represent the interactions with multi-hop paths from users to items (hu2018leveraging; wang2019explainable). Nevertheless, most of them heavily rely on manually designed meta-paths, which makes it hard to optimize in reality. As a result, later methods have embraced Graph Neural Networks (GNNs) (wang2021learning; wang2019kgat) to automatically aggregate high-order information over KG, which iteratively integrate multi-hop neighbors into representations and have demonstrated promising performance for recommendation. Most recently, there have been efforts to incorporate Contrastive Learning (CL) into KGR for addressing noisy knowledge and long-tail problems (yang2022knowledge; zou2022multi; wang2024exploring) via contrasting the user-item (collaborative part) and item-entity (knowledge part) graphs.

However, current KGR methods usually bring poor performance in large-scale industry scenarios, due to their commonly overlooking two crucial factors: 1) Users’ multiple intents underlying interaction behavior. For instance, as depicted in Figure 1(a), users may have diverse intentions when shopping in Alibaba E-commerce platform, such as long-term interest, passing time, or social reason, etc. 2) Redundant Knowledge information. In the context of user intents, some knowledge facts in the KG may be irrelevant noise (chen2022attentive), which can potentially disrupt the learning process of user/item representations. As shown in Figure 1(b), incorporating KGs may result in a worse model performance than the models without KG utilization (the details of comparison could refer to Section 4.2 ).

But still, it’s not trivial to model user intents in KGR, since user intents may be composed of multiple heterogeneous information, including items, relations, and entities. Previous multi-intent modeling methods usually define the intents as a linear combination of either interacted items (wang2020disentangled) or entire relation sets (wang2021learning), then update the intent representations through local aggregation in the user-intent-item heterogeneous graph. Nevertheless, such a multi-intent learning paradigm may not fully meet the requirements for KGR, as it neglects the global information in intent defining and learning. To illustrate this, we present an example in Figure 1(a). In this example, user u1u_{1} may purchase the item i1i_{1} for the intent c1c_{1} of long-term interest, resulting in a focus on clothing style (e.g., whether it is fashionable), which means intent c1c_{1} is associated with KG relation r1r_{1} and entity e1e_{1}; while u1u_{1} may buy the item ini_{n} for the intent ckc_{k} of social reason (such as friend u2u_{2} recommend), which means intent ckc_{k} is associated with user u2u_{2} and item iki_{k}.

In this paper, we focus on modeling user intents behind interaction behaviors with global collaborative (user-item) and knowledge (item-relation-entity) information, and exploiting these modeled intents to guide knowledge sampling, facilitating fine-grained and accurate user/item representation learning. We propose a novel model, KGTN, which comprises two essential components for solving the foregoing limitations: i) Global Intents Modeling with Graph Transformer. We predefine KK intent representations for user/item, then learn these intents with global information from collaborative and knowledge graphs. Specifically, it first merges knowledge information into items, then propose a novel graph transformer in the user-item graph to learn global intents and generate intent-aware user/item representations. ii) Knowledge Contrastive Denoising under Intents. KGTN first exploits the intent-aware user/item representations to guide the knowledge sampling, effectively pruning the irrelevant knowledge. Then a novel local-global contrastive mechanism is proposed here to denoise the user/item representations. Empirically, KGTN outperforms the state-of-the-art models on three benchmark datasets in offline testing, and achieves significant improvements in online A/B testing.

Our contributions of this work can be summarized as follows:

  • General Aspects: We emphasize the importance of intent modeling with global information, which plays a crucial role in fine-grained representation learning and knowledge denoising.

  • Novel Methodologies: We propose a novel model KGTN, which models user intents from global signals with a novel graph transformer; and denoises item representations with i) knowledge denoising under intents, and ii) local-global graph contrastive learning.

  • Multifaceted Experiments: We conduct extensive offline experiments on three benchmark datasets and online A/B testing on Alibaba recommendation platform. The results demonstrate the advantages of our KGTN in better representation learning.

2. Problem Formulation

In this section, we begin by formulating the structural data of CF (user-item interactions) and KG (item-relation-entity knowledge) in KGR, then present the problem statement.

Interaction Data. In a typical recommendation scenario, let 𝒰={u1,u2,,uM}\mathcal{U}=\left\{u_{1},u_{2},\ldots,u_{M}\right\} be a set of MM users and 𝒱={v1,v2,,vN}\mathcal{V}=\left\{v_{1},v_{2},\ldots,v_{N}\right\} a set of NN items. Let 𝐘𝐑M×N\mathbf{Y}\in\mathbf{R}^{M\times N} be the user-item interaction matrix, where yuv=1y_{uv}=1 indicates that user uu engaged with item vv, such as behaviors like clicking or purchasing; otherwise yuv=0y_{uv}=0.

Knowledge Graph. A KG stores luxuriant real-world facts associated with items, encompassing item attributes or external commonsense knowledge, in the form of a heterogeneous graph (shi2018heterogeneous). Let 𝒢={(h,r,t)h,t,r}\mathcal{G}=\{(h,r,t)\mid h,t\in\mathcal{E},r\in\mathcal{R}\} be the KG, where hh, rr, tt represent the head, relation, tail of a knowledge triple, respectively; \mathcal{E} and \mathcal{R} denote the sets of entities and relations in 𝒢\mathcal{G}. In many recommendation scenarios, an item v𝒱v\in\mathcal{V} corresponds to one entity ee\in\mathcal{E}. We hence establish a set of item-entity alignments 𝒜={(v,e)|v𝒱,e}\mathcal{A}=\{(v,e)|v\in\mathcal{V},e\in\mathcal{E}\}, where (v,e)\left(v,e\right) indicates that item vv can be aligned with an entity ee in KG. With the alignments between items and KG entities, KG is able to profile items and offer complementary information to the interaction data.

Problem Statement. Given the user-item interaction matrix 𝐘\mathbf{Y} and the KG 𝒢\mathcal{G}, KGR aims to learn a function that can predict how likely a user would adopt an item.

3. Methodology

Refer to caption
Figure 2. Overall framework illustration of the proposed KGTN model. Best viewed in color.

We now present the proposed Knowledge Enhanced Multi-intent Transformer Network for Recommendation (KGTN). KGTN aims at modeling user intents with global information and exploiting user intents to denoise KG for accurate and robust user/item representation learning. Figure 2 displays the framework of KGTN, which mainly consists of two key components: 1) Global Intent Modeling with graph transformer. Initially, KGTN defines a set of KK learnable global intents for users and items. It then models these intents and learns intent-aware user/item representations, via integrating global signals with a graph transformer in the user-item graph, where knowledge information has been encoded into items. 2) Knowledge Contrastive Denoising under intents. It first exploits the learned intent-aware user/item representations to sample intent-relevant knowledge, then designs a contrastive self-supervised task between the local aggregation and global aggregation features within the sampled graph to facilitate robust representation learning.

3.1. Global Intents Modeling with Graph Transformer

3.1.1. Intent Initialization with Global signals

When interacting with items, users often have diverse intents, such as preferences for specific clothing brands and styles, friends recommending, or passing time with randomly clicking (wang2021learning; ren2023disentangled). To capture these diverse intents, we assume KK different intents cuc_{u} and cvc_{v} from the user and item sides, respectively, where the intents on the item side can also be understood as the theme or context of the item, for example, a user who intends to purchase a fashionable dress may like clothes of “young” topic. Our predictive objective of user-item preference can be presented as follows:

(1) cucvP(y,cu,cv|u,v)dcvdcu=kKP(y,cuk,cvk|u,v).\displaystyle\int_{c_{u}}\ \int_{c_{v}}\ P(y,c_{u},c_{v}|u,v)\,dc_{v}\,dc_{u}=\sum_{k}^{K}P(y,c_{u}^{k},c_{v}^{k}|u,v).

Specifically, we define KK global intent prototypes {cukd}k=1K\{\textbf{c}_{u}^{k}\in\mathbb{R}^{d}\}_{k=1}^{K} and {cvkd}k=1K\{\textbf{c}_{v}^{k}\in\mathbb{R}^{d}\}_{k=1}^{K} for user and item, respectively. With these predefined intent prototypes, we then are supposed to integrate them into user/item representations, and update them with related global signals.

3.1.2. Intent Modeling with graph transformer

Towards accurately modeling user intents with global information and learning intent-aware user/item representations, we perform an intent-aware information propagation with these learnable intents. Specifically, intent-aware user/item embeddings are acquired by an attentive sum of the intent prototypes, and user/item embeddings of each layer are updated by aggregating the global user/item/relation/entity signals.

Formally, we could get intent-aware user/item representations at the ll-th user/item embedding layer, by aggregating information across different KK learnable intent prototypes (including cu\textbf{c}_{u} and cv\textbf{c}_{v}), using the following design:

(2) elu=kKcukP(cuk|elu),\displaystyle\textbf{e}^{l}_{u}=\sum_{k}^{K}\textbf{c}_{u}^{k}P(\textbf{c}_{u}^{k}|\textbf{e}^{l}_{u}),
(3) P(cuk|elu)=η(el1ucuk)kKη(el1ucuk),\displaystyle P(\textbf{c}_{u}^{k}|\textbf{e}^{l}_{u})=\frac{\eta(\textbf{e}^{l-1\top}_{u}\textbf{c}_{u}^{k})}{\sum_{k^{\prime}}^{K}\eta(\textbf{e}^{l-1\top}_{u}\textbf{c}_{u}^{k^{\prime}})},

where the P(cuk|elu)P(\textbf{c}_{u}^{k}|\textbf{e}^{l}_{u}) and P(cvk|elv)P(\textbf{c}_{v}^{k}|\textbf{e}^{l}_{v}) denotes the importance score of cuk\textbf{c}_{u}^{k} for ll-th user embeddings that has encodes the global signals. Similarly, the P(cvk|elv)P(\textbf{c}_{v}^{k}|\textbf{e}^{l}_{v}) denotes the importance score of cuk\textbf{c}_{u}^{k} for ll-th item embeddings.

As for the way of calculating the ll-th user/item embeddings, we propose to adopt a two-step process to encode the global user/item/ relation/entity information in the whole heterogeneous graph. The first step is to merge the knowledge information (including both relation and entity) into item embeddings with a proposed relation-aware graph aggregation, making the item representation more comprehensive and informative. It injects the relational context into the embeddings of the neighboring entities, and weighting them with the knowledge rationale scores (It’s worth noting that items are a subset of knowledge entities), as follows:

(4) 𝐞i(l+1)\displaystyle\mathbf{e}_{i}^{(l+1)} =1|𝒩i|(r,v)𝒩iβ(i,r,v)𝐞r𝐞v(l),\displaystyle=\frac{1}{\left|\mathcal{N}_{i}\right|}\sum\limits_{(r,v)\in\mathcal{N}_{i}}\beta(i,r,v)\mathbf{e}_{r}\odot\mathbf{e}_{v}^{(l)},
β(i,r,v)\displaystyle\beta(i,r,v) =softmax((𝐞i||𝐞r)T(𝐞v||𝐞r))\displaystyle=\operatorname{softmax}\left(\left(\mathbf{e}_{i}||\mathbf{e}_{r}\right)^{T}\cdot\left(\mathbf{e}_{v}||\mathbf{e}_{r}\right)\right)
=exp((𝐞i||𝐞r)T(𝐞v||𝐞r))(v,r)𝐍^(i)exp((𝐞i||𝐞r)T(𝐞v||𝐞r)),\displaystyle=\frac{\exp\left(\left(\mathbf{e}_{i}||\mathbf{e}_{r}\right)^{T}\cdot\left(\mathbf{e}_{v}||\mathbf{e}_{r}\right)\right)}{\sum\limits_{\left(v^{\prime},r\right)\in\hat{\mathbf{N}}(i)}\exp\left(\left(\mathbf{e}_{i}||\mathbf{e}_{r}\right)^{T}\cdot\left(\mathbf{e}_{v^{\prime}}||\mathbf{e}_{r}\right)\right)},

where |||| denotes concat operation, NiN_{i} denotes the set of neighboring entities.

Then the second step is to apply a novel graph transformer among user-item graph, which encodes global user/item/entity information into user/item representations. By doing so, the user/item representations of each layer are integrated with global signals, which would be exploited into intent modeling and representation updating, as follows:

eul+1=i||h=1Hmu,iβu,ihWVheil;mu,i={1if(u,i)𝐘0otherwise\displaystyle\textbf{e}_{u}^{l+1}=\sum_{i}\mathop{\Bigm{|}\Bigm{|}}\limits_{h=1}^{H}m_{u,i}\beta_{u,i}^{h}\textbf{W}_{\text{V}}^{h}\textbf{e}_{i}^{l};~{}~{}~{}m_{u,i}=\left\{\begin{aligned} &1~{}~{}\text{if}~{}(u,i)\in{\mathbf{Y}}\\ &0~{}~{}\text{otherwise}\end{aligned}\right.
(5) βhu,i=expβ¯hu,iiexpβ¯hu,i;β¯hv,v=(WQheul)(WKheil)d/H,\displaystyle\beta^{h}_{u,i}=\frac{\exp\bar{\beta}^{h}_{u,i}}{\sum_{i}\exp\bar{\beta}^{h}_{u,i}};~{}~{}~{}~{}~{}~{}~{}~{}\bar{\beta}^{h}_{v,v^{\prime}}=\frac{(\textbf{W}_{\text{Q}}^{h}\cdot\textbf{e}_{u}^{l})^{\top}\cdot(\textbf{W}_{\text{K}}^{h}\cdot\textbf{e}_{i}^{l})}{\sqrt{d/H}},

where HH denotes the number of attention heads (indexed by hh). mv,vm_{v,v^{\prime}} is the binary indicator to decide whether to calculate the attentive relations between user uu and item ii. βu,ih\beta_{u,i}^{h} denotes the attention weight for user-item interaction pair (u,i)(u,i) w.r.t.  the hh-th head representation space. WQh,WKh,WVhd/H×d\textbf{W}_{\text{Q}}^{h},\textbf{W}_{\text{K}}^{h},\textbf{W}_{\text{V}}^{h}\in\mathbb{R}^{d/H\times d} denotes the query, key, the value embedding projection for the hh-th head, respectively.

By integrating global information into users/items, we could learn intent-aware user/item representations and update the learnable intents according to Equation 2.

3.2. Knowledge Contrastive Denoising under Intents

It is intuitive that noisy or irrelevant connections between entities in knowledge graphs can lead to suboptimal representation learning, which is opposite to original purpose of introducing the KG. To eliminate the noise effect in the KG and distill informative signals that benefit the recommendation task, we propose to highlight important connections consistent to user intents, while removing the irrelevant ones.

3.2.1. Knowledge Sampling under intents.

With the intent-aware user/item representations, we then try to denoise the item-entity graph by removing the irrelevant edges and nodes and sampling the important ones. We first exploit the intent-aware representations to calculate the importance score of knowledge triplets (i.e., the item-relation-entity pairs) same as Equation 4, then add the Gumbel noise (jang2017categorical) to the learned importance scores to improve the sampling robustness, as follows:

(6) β(i,r,v)\displaystyle\beta(i,r,v) =softmax((𝐞i||𝐞r)T(𝐞v||𝐞r))\displaystyle=\operatorname{softmax}\left(\left(\mathbf{e}_{i}||\mathbf{e}_{r}\right)^{T}\cdot\left(\mathbf{e}_{v}||\mathbf{e}_{r}\right)\right)
β(i,r,v)\displaystyle\beta(i,r,v) =β(i,r,v)log(log(ϵ));ϵUniform(0,1),\displaystyle=\beta(i,r,v)-\log\left(-\log(\epsilon)\right);\quad\epsilon\sim\text{Uniform}\left(0,1\right),

where ϵ\epsilon is a random variable sampled from a uniform distribution. Then it follows a top-k sampling strategy for generating the new item-entity graph that removes the irrelevant edges and nodes:

(7) β^(i,r,v)={β(i,r,v),β(i,r,v) top-k(β(i,r,v)),0,otherwise,\widehat{\beta}(i,r,v)=\left\{\begin{array}[]{ll}\beta(i,r,v),&\beta(i,r,v)\in\text{ top-k}\left(\beta(i,r,v)\right),\\ 0,&\text{otherwise},\end{array}\right.

where β^(i,r,v)\widehat{\beta}(i,r,v) is the sampled triples in item-entity graph, which would be used to replace the original graph structure in the following user/item representation learning.

3.2.2. Local-Global Knowledge Contrastive Learning

With the sampled item-entity graph, we then propose to iteratively update the intent-aware representations in it. And inspired by previous contrastive learning based methods that align the item representations from KG and CF to denoise, we further propose a local-global contrastive mechanism to improve the robustness of representation learning.

Specifically, we exploit the user-item graph and sampled item-entity graph to perform light information aggregation with intent-aware user/item representations eu,ei\textbf{e}_{u},\textbf{e}_{i} as input zu(0),zi(0)\textbf{z}_{u}^{(0)},\textbf{z}_{i}^{(0)}, for acquiring a robust and effective intent-aware user/item representations, as follows:

(8) 𝐳i(l+1)=1|𝒩i|(r,v)𝒩i𝐞r𝐳v(l),𝐳u(l+1)=1|𝒩u|i𝒩u𝐳i(l),\begin{array}[]{l}\mathbf{z}_{i}^{(l+1)}=\frac{1}{\left|\mathcal{N}_{i}\right|}\sum\limits_{(r,v)\in\mathcal{N}_{i}}\mathbf{e}_{r}\odot\mathbf{z}_{v}^{(l)},\\ \mathbf{z}_{u}^{(l+1)}=\frac{1}{\left|\mathcal{N}_{u}\right|}\sum\limits_{i\in\mathcal{N}_{u}}\mathbf{z}_{i}^{(l)},\end{array}

where zu(0),zi(0)\textbf{z}_{u}^{(0)},\textbf{z}_{i}^{(0)} memorize the global signals, and we hence get final representations of user/item 𝐳u(l),𝐳i(l)(lL)\mathbf{z}_{u}^{(l)},\mathbf{z}_{i}^{(l)}(l\in L).

Besides the supervised user/item representation learning, we propose to perform a contrastive learning between the nodes embeddings that encode global signals and local signals, which is different from traditional cl-based methods that contrast the CF and KG parts. We perform information aggregation in the sampled graph with the initial user/item representations eu(0),ei(0)\textbf{e}_{u}^{(0)},\textbf{e}_{i}^{(0)} to acquire the local results 𝐳u,local(l),𝐳i,local(l)(lL)\mathbf{z}_{u,local}^{(l)},\mathbf{z}_{i,local}^{(l)}(l\in L), while utilizing the intent-aware user/item representations eu,ei\textbf{e}_{u},\textbf{e}_{i} that contains global signals to acquire the global results 𝐳u(l),𝐳i(l)(lL)\mathbf{z}_{u}^{(l)},\mathbf{z}_{i}^{(l)}(l\in L). Then perform layer-wise contrastive learning between local and global results.

The local aggregation layer embeddings 𝐳u,local(l),𝐳i,local(l)\mathbf{z}_{u,local}^{(l)},\mathbf{z}_{i,local}^{(l)} and global aggregation layer embeddings 𝐳u(l),𝐳i(l)\mathbf{z}_{u}^{(l)},\mathbf{z}_{i}^{(l)} are made to be contrasted in a layer-wise way. We generate each positive pair using the embeddings of the same user (item) from the local view and each of the global view, and other nodes form the negative pairs. We could get the contrastive loss of users as follows:

(9) cu=1Ll=0Llogexp(s(𝐳ul,𝐳u,locall)/τ)kuexp(s(𝐳ul,𝐳kl)/τ)+kuexp(s(𝐳ul,𝐳k,locall)/τ),\begin{array}[]{ll}\mathcal{L}_{c}^{u}=\frac{1}{L}\sum\limits_{l=0}^{L}-\log\frac{exp({\operatorname{s}\left({{\mathbf{z}_{u}^{l}}},{{\mathbf{z}_{u,local}^{l}}}\right)/\tau})}{\sum\limits_{k\neq u}exp({\operatorname{s}\left({{\mathbf{z}_{u}^{l}}},{{\mathbf{z}_{k}^{l}}}\right)/\tau})+\sum\limits_{k\neq u}exp({\operatorname{s}\left({{\mathbf{z}_{u}^{l}}},{{\mathbf{z}_{k,local}^{l}}}\right)/\tau})},\end{array}

where s()\operatorname{s}(\cdot) denotes the cosine similarity calculating, and τ\tau denotes a temperature parameter. And similarly we could get the contrastive loss of item ci\mathcal{L}_{c}^{i}. By summing the two contrastive losses we hence have the total local-global contrastive loss c\mathcal{L}_{c}.

3.3. Model Prediction

After learning intent-aware user/item representations with global signals and performing contrastive learning between local and global information, we have multi-layer intent-aware representations for user/item. By summing all the layers’ representations, we have the final user/item representations and predict their matching score through inner product, as follows:

(10) 𝐳u=𝐳u(0)++𝐳u(K),𝐳i=𝐳i(0)++𝐳i(K).𝐲^(u,i)=𝐳u𝐳i.\begin{array}[]{l}\mathbf{z}_{u}=\mathbf{z}_{u}^{(0)}+\cdots+\mathbf{z}_{u}^{(K)},\quad\mathbf{z}_{i}=\mathbf{z}_{i}^{(0)}+\cdots+\mathbf{z}_{i}^{(K)}.\\ \hat{\mathbf{y}}(u,i)=\mathbf{z}_{u}^{\top}\mathbf{z}_{i}.\end{array}

By adopting a BPR loss (rendle2012bpr) to reconstruct the historical data, which encourages the prediction scores of a user’s historical items to be higher than the unobserved items, we acquire the supervised loss:

(11) BPR=(u,i,j)Olnσ(𝐲^ui𝐲^uj),\mathcal{L}_{\mathrm{BPR}}=\sum_{(u,i,j)\in O}-\ln\sigma\left(\hat{\mathbf{y}}_{ui}-\hat{\mathbf{y}}_{uj}\right),

where 𝑶={(u,i,j)(u,i)𝑶+,(u,j)𝑶}\boldsymbol{O}=\left\{(u,i,j)\mid(u,i)\in\boldsymbol{O}^{+},(u,j)\in\boldsymbol{O}^{-}\right\} is the training dataset consisting of the observed interactions 𝑶+\boldsymbol{O}^{+} and unobserved counterparts 𝑶\boldsymbol{O}^{-}; σ\sigma is the sigmoid function.

3.4. Multi-task Training

To combine the recommendation task with the self-supervised task, we optimize the whole model with a multi-task training strategy. We combine the local-global contrastive loss with BPR loss, and learn the model parameter via minimizing the following objective function:

(12) KGTN=BPR+αc+λΘ22,\mathcal{L}_{KGTN}=\mathcal{L}_{\mathrm{BPR}}+\alpha\mathcal{L}_{c}+\lambda\|\Theta\|_{2}^{2},

where Θ\Theta is the model parameter set, α\alpha is a hyperparameter to determine the local-global contrastive loss ratio, β\beta and λ\lambda are two hyperparameters to control the contrastive loss and L2L_{2} regularization term, respectively.

4. Experiment

Book-Crossing MovieLens-1M Last.FM
User-item Interaction # users 17,860 6,036 1,872
# items 14,967 2,445 3,846
# interactions 139,746 753,772 42,346
Knowledge Graph # entities 77,903 182,011 9,366
# relations 25 12 60
# triplets 151,500 1,241,996 15,518
Table 1. Statistics for the three datasets.
Model Book-Crossing MovieLens-1M Last.FM
AUC F1 AUC F1 AUC F1
BPRMF 0.6583(13.18%)(-13.18\%) 0.6117(7.59%)(-7.59\%) 0.8920(4.52%)(-4.52\%) 0.7921(7.21%)(-7.21\%) 0.7563(13.41%)(-13.41\%) 0.7010(9.95%)(-9.95\%)
CKE 0.6759(11.42%)(-11.42\%) 0.6235(6.41%)(-6.41\%) 0.9065(3.07%)(-3.07\%) 0.8024(6.18%)(-6.18\%) 0.7471(14.33%)(-14.33\%) 0.6740(12.65%)(-12.65\%)
RippleNet 0.7211(6.90%)(-6.90\%) 0.6472(4.04%)(-4.04\%) 0.9190(1.82%)(-1.82\%) 0.8422(2.20%)(-2.20\%) 0.7762(11.42%)(-11.42\%) 0.7025(9.80%)(-9.80\%)
PER 0.6048(18.53%)(-18.53\%) 0.5726(11.50%)(-11.50\%) 0.7124(22.48%)(-22.48\%) 0.6670(19.72%)(-19.72\%) 0.6414(24.90%)(-24.90\%) 0.6033(19.72%)(-19.72\%)
KGCN 0.6841(10.60%)(-10.60\%) 0.6313(5.63%)(-5.63\%) 0.9090(2.82%)(-2.82\%) 0.8366(2.76%)(-2.76\%) 0.8027(8.77%)(-8.77\%) 0.7086(9.19%)(-9.19\%)
KGNN-LS 0.6762(11.39%)(-11.39\%) 0.6314(5.62%)(-5.62\%) 0.9140(2.32%)(-2.32\%) 0.8410(2.32%)(-2.32\%) 0.8052(8.52%)(-8.52\%) 0.7224(7.81%)(-7.81\%)
KGAT 0.7314(5.87%)(-5.87\%) 0.6544(3.32%)(-3.32\%) 0.9140(2.32%)(-2.32\%) 0.8440(2.02%)(-2.02\%) 0.8293(6.11%)(-6.11\%) 0.7424(5.81%)(-5.81\%)
CKAN 0.7420(4.81%)(-4.81\%) 0.6671(2.05%)(-2.05\%) 0.9082(2.90%)(-2.90\%) 0.8410(2.32%)(-2.32\%) 0.8418(4.86%)(-4.86\%) 0.7592(4.13%)(-4.13\%)
KGIN 0.7273(6.28%)(-6.28\%) 0.6614(2.62%)(-2.62\%) 0.9190(1.82%)(-1.82\%) 0.8441(2.01%)(-2.01\%) 0.8486(4.18%)(-4.18\%) 0.7602(4.03%)(-4.03\%)
CG-KGR 0.7498(4.03%)(-4.03\%) 0.6689(1.87%)(-1.87\%) 0.9110(2.62%)(-2.62\%) 0.8359(2.83%)(-2.83\%) 0.8336(5.68%)(-5.68\%) 0.7433(5.72%)(-5.72\%)
KGCL 0.7453(4.48%)(-4.48\%) 0.6679(1.97%)(-1.97\%) 0.9184(1.88%)(-1.88\%) 0.8437(2.05%)(-2.05\%) 0.8455(4.49%)(-4.49\%) 0.7596(4.00%)(-4.00\%)
MCCLK 0.7625(2.76%)(-2.76\%) 0.6777(0.99%)(-0.99\%) 0.9252(1.20%)(-1.20\%) 0.8559(0.83%)(-0.83\%) 0.8663(2.41%)(-2.41\%) 0.7753(2.43%)(-2.43\%)
KGTN 0.7901* 0.6876* 0.9372* 0.8642* 0.8904* 0.7996*
Table 2. The result of AUCAUC and F1F1 in CTR prediction. The best results are in boldface and the second best results are underlined. * denotes statistically significant improvement by unpaired two-sample tt-test with p<0.001p<0.001.
Model Book-Crossing MovieLens-1M Last.FM
R@10 R@20 R@10 R@20 R@10 R@20
BPRMF 0.0334 0.0525 0.0939 0.1512 0.0923 0.1740
CKE 0.0421 0.0562 0.0867 0.1364 0.0780 0.1532
RippleNet 0.0507 0.0622 0.1082 0.1766 0.0942 0.1520
PER 0.0322 0.0481 0.0523 0.1204 0.0540 0.1167
KGCN 0.0496 0.0540 0.0965 0.1720 0.1416 0.1776
KGNN-LS 0.0422 0.0526 0.1286 0.1757 0.1312 0.1933
KGAT 0.0522 0.0670 0.1468 0.2296 0.1640 0.2313
CKAN 0.0462 0.0566 0.1511 0.2400 0.1412 0.2465
KGIN 0.0555 0.0699 0.1511 0.2404 0.1758 0.2487
CG-KGR 0.0612 0.0781 0.1621 0.2495 0.1578 0.2106
KGCL 0.0679 0.0845 0.1633 0.2499 0.1759 0.2471
MCCLK 0.0769 0.0936 0.1642 0.2503 0.1835 0.2598
KGTN 0.1060* 0.1275* 0.1841* 0.2826* 0.2104* 0.3106*
Table 3. The result of Recall@10Recall@10 and Recall@20Recall@20 in top-KK recommendation.

Aiming to answer the following research questions, we conduct both offline experiments and online A/B tests on three public datasets and Alibaba online platform:

  • RQ1: How does KGTN perform, compared to present models?

  • RQ2: How do the main components in KGTN affect its effectiveness?

  • RQ3: How do different hyper-parameter settings affect KGTN?

  • RQ4: How does KGTN perform with noisy injection?

  • RQ5: How does KGTN perform in a live system serving billions of users?

4.1. Experiment Settings

4.1.1. Dataset and Metrics

Three benchmark datasets are utilized to evaluate the effectiveness of KGTN: Last.FM 111https://grouplens.org/datasets/hetrec-2011/, Book-Crossing 222http://www2.informatik.uni-freiburg.de/~cziegler/BX/, and MovieLens-1M 333https://grouplens.org/datasets/movielens/1m/. The detailed statistics of them are summarized in Table 1, which vary in size and sparsity and make our experiments more convincing. As for the data pre-process, we first follow RippleNet (wang2018ripplenet) to transform their explicit feedback into implicit one, and randomly sample negative samples from his unwatched items with the size equal to his positive ones to construct the negative parts. As for the sub-KG construction, we follow RippleNet (wang2018ripplenet) and use Microsoft Satori444https://searchengineland.com/library/bing/bing-satori to construct it for MovieLens-1M, Book-Crossing, and Last.FM datasets. Each sub knowledge graph that follows the triple format is a subset of the whole KG with a confidence level greater than 0.9.

We evaluate our method in two experimental scenarios: (1) In click-through rate (CTR) prediction, we apply the trained model to predict each interaction in the test set. We adopt two widely used metrics (wang2018ripplenet; wang2019knowledge) AUCAUC and F1F1 to evaluate CTR prediction. (2) In top-KK recommendation, we use the trained model to select KK items with the highest predicted click probability for each user in the test set, and we choose Recall@KK to evaluate the recommended sets.

4.1.2. Baselines

To demonstrate the effectiveness of our proposed KGTN, we compare it with four types of KGR methods: CF-based methods (BPRMF (rendle2012bpr)), embedding-based method (CKE (zhang2016collaborative), RippleNet (wang2018ripplenet)), path-based method (PER (yu2014personalized)), GNN-based methods(KGCN (wang2019knowledge), KGNN-LS (wang2019knowledge-aware), KGAT (wang2019kgat), CKAN (wang2020ckan), KGIN (wang2021learning), CG-KGR (chen2022attentive)), CL-based methods (KGCL(yang2022knowledge), MCCLK (zou2022multi)).

4.1.3. Parameter Settings

We implement our KGTN and all baselines in Pytorch and carefully tune the key parameters. For a fair comparison, we fix the embedding size to 64 for all models, and the embedding parameters are initialized with the Xavier method (glorot2010understanding). We optimize our method with Adam (kingma2014adam) and set the batch size to 2048. A grid search is conducted to confirm the optimal settings, we tune the learning rate η\eta among{0.0001,0.0003,0.001,0.003}\{0.0001,0.0003,0.001,0.003\} and λ\lambda of L2L2 regularization term among {107,106,105,104,103}\{10^{-7},10^{-6},10^{-5},10^{-4},10^{-3}\}. Other hyper-parameter settings are provided in Table 1. The best settings for hyper-parameters in all comparison methods are researched by either empirical study or following the original papers.

4.2. Performance Comparison (RQ1)

We report the empirical results of all methods in Table 2 and Table 3. The improvements and statistical significance test are performed between KGTN and the strongest baselines (highlighted with underline). Analyzing such performance comparison, we have the following observations:

  • Our proposed KGTN achieves the best results. KGTN consistently outperforms all baselines across three datasets in terms of all measures, which achieves significant improvements over the strongest baselines w.r.t. AUC by 2.76%, 1.20%, and 2.41% in Book, Movie, and Music respectively, and demonstrates its effectiveness. We attribute such improvements to the following aspects: (1) By modeling user intents with global signals, KGTN is able to learn user/item representations in a more fine-grained and comprehensive manner; (2) The knowledge sampling strategy under intents could remove less relevant knowledge information for a robust representation learning; (3) The local-global contrastive learning improves the representation learning in a self-supervised manner, via contrasting the local and global information.

  • Incorporating KG not always benefits recommender system. Comparing CKE with BPRMF, leaving KG untapped limits the performance of BPRMF, which shows the effectiveness of KG information. While PER gets a worse performance than BPRMF, which means that only incorporating suitable knowledge could benefit the model. This fact stresses the importance of knowledge sampling and knowledge denoising.

  • GNN has a strong power of graph learning. Most of the GNN-based methods perform better, suggesting the importance of modeling long-range connectivity for graph representation learning. This fact inspires us to go beyond the local aggregation paradigm, and to consider the global signals.

  • Contrastive Learning is effective. The most recently proposed CL-based methods have the best performance, which shows the effectiveness of incorporating a self-supervised task for improving representation learning. It inspires us to design proper contrastive mechanisms to denoise the knowledge and improve the model performance.

4.3. Ablation Studies (RQ2)

Refer to caption
Figure 3. Effect of ablation study.

As shown in Figure 3, here we examine the contributions of main components in our model to the final performance by comparing KGTN with the following three variants: 1) KGTNw/oS\text{KGTN}_{w/o\ S}: In this variant, the knowledge sampling under intents module is removed. 2) KGTNw/oC\text{KGTN}_{w/o\ C}: This variant removes local-global contrastive mechanism. 3) KGTNw/oI\text{KGTN}_{w/o\ I}: This variant removes the multi-intent modeling, which means both global intent modeling and knowledge contrastive denoising do not exist in this variant. The results of two variants and KGTN are reported in Figure 3, from which we have the following observations:

  • Removing both knowledge sampling and local-global contrasting would degrade model performance, which shows their effectiveness in representation learning.

  • Ablating the multi-intent modeling brings the worst performance, which shows the importance of incorporating global signals and considering multiple intents.

4.4. Sensitivity Analysis (RQ3)

4.4.1. Impact of graph transformer depth.

Book Movie Music
Auc F1 Auc F1 Auc F1
LL=1 0.7901 0.6876 0.9372 0.8642 0.8904 0.7996
LL=2 0.7743 0.6783 0.9349 0.8623 0.8834 0.8068
LL=3 0.7603 0.6709 0.9278 0.8481 0.8785 0.7951
Table 4. Impact of graph transformer depth.

To study the influence of graph transformer depth, we vary LL in range of {1, 2, 3} on book, movie, and music datasets. As shown in Table 4, KGTN performs best when L=1L=1. It convinces that one iteration is enough for integrating the global signals into user/item representations, which shows its low reliance on model depth.

4.4.2. Impact of intent number KK.

Refer to caption
(a) Book
Refer to caption
(b) Music
Figure 4. Impact of intent number KK.