Knowledge Enhanced Multi-intent Transformer Network for Recommendation

Ding Zou [0]CCIIP Lab [1]School of Computer Science and Technology Huazhong University of Science and TechnologyJoint Laboratory of HUST and Pingan Property & Casualty Research (HPL)WuhanChina Taotian GroupHangzhouChina m202173662@hust.edu.cn , Wei Wei [0]CCIIP Lab [1]School of Computer Science and Technology Huazhong University of Science and TechnologyJoint Laboratory of HUST and Pingan Property & Casualty Research (HPL)WuhanChina weiw@hust.edu.cn , Feida Zhu Singapore Management UniversitySingaporeSingapore fdzhu@smu.edu.sg , Chuanyu Xu Taotian GroupHangzhouChina tracy.xcy@taobao.com , Tao Zhang Taotian GroupHangzhouChina guyan.zt@taobao.com and Chengfu Huo Taotian GroupHangzhouChina chengfu.huocf@taobao.com

(2024)

Abstract.

Incorporating Knowledge Graphs (KGs) into Recommendation has attracted growing attention in industry, due to the great potential of KG in providing abundant supplementary information and interpretability for the underlying models. However, simply integrating KG into recommendation usually brings in negative feedback in industry, mainly due to the ignorance of the following two factors: i) users’ multiple intents, which involve diverse nodes in KG. For example, in e-commerce scenarios, users may exhibit preferences for specific styles, brands, or colors. ii) knowledge noise, which is a prevalent issue in Knowledge Enhanced Recommendation (KGR) and even more severe in industry scenarios. The irrelevant knowledge properties of items may result in inferior model performance compared to approaches that do not incorporate knowledge. To tackle these challenges, we propose a novel approach named Knowledge Enhanced Multi-intent Transformer Network for Recommendation (KGTN), which comprises two primary modules: Global Intents Modeling with Graph Transformer, and Knowledge Contrastive Denoising under Intents. Specifically, Global Intents with Graph Transformer focuses on capturing learnable user intents, by incorporating global signals from user-item-relation-entity interactions with a well-designed graph transformer, and meanwhile learning intent-aware user/item representations. On the other hand, Knowledge Contrastive Denoising under Intents is dedicated to learning precise and robust representations. It leverages the intent-aware user/item representations to sample relevant knowledge, and subsequently proposes a local-global contrastive mechanism to enhance noise-irrelevant representation learning. Extensive experiments conducted on three benchmark datasets show the superior performance of our proposed method over the state-of-the-arts. And online A/B testing results on Alibaba large-scale industrial recommendation platform also indicate the real-scenario effectiveness of KGTN. The implementations are available at: https://github.com/CCIIPLab/KGTN.

Knowledge Enhanced Recommendation, Graph Transformer, Graph Neural Networks

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Companion Proceedings of the ACM Web Conference 2024; May 13–17, 2024; Singapore, Singapore.^†^†booktitle: Companion Proceedings of the ACM Web Conference 2024 (WWW ’24 Companion), May 13–17, 2024, Singapore, Singapore^†^†isbn: 979-8-4007-0172-6/24/05^†^†doi: 10.1145/3589335.3648296^†^†ccs: Information systems Recommender systems

1. Introduction

Knowledge graphs (KGs) have emerged as a promising approach to enhance the accuracy and interpretability of recommender systems in both academic and industry scenarios. By incorporating entities and relations, KGs provide a rich source of information for user/item representation learning, which not only captures the diverse relationships among items (such as the same item brand), but also allows for the interpretation of user preferences (such as attributing a user’s selection of a clothing to its fashionable style).

In an effort to effectively integrate the item-side KG information into recommendation, considerable research efforts have been devoted to Knowledge Enhanced Recommendation (aka. KGR). Early studies (zhang2016collaborative; huang2018improving; wang2018dkn) directly integrate knowledge graph embeddings with items to enhance their representations. Some subsequent studies (hu2018leveraging; shi2018heterogeneous; wang2019explainable) enrich the interactions via meta-paths that capture relevant connectivities between users and items with KG. They either select prominent paths over KG (sun2018recurrent), or represent the interactions with multi-hop paths from users to items (hu2018leveraging; wang2019explainable). Nevertheless, most of them heavily rely on manually designed meta-paths, which makes it hard to optimize in reality. As a result, later methods have embraced Graph Neural Networks (GNNs) (wang2021learning; wang2019kgat) to automatically aggregate high-order information over KG, which iteratively integrate multi-hop neighbors into representations and have demonstrated promising performance for recommendation. Most recently, there have been efforts to incorporate Contrastive Learning (CL) into KGR for addressing noisy knowledge and long-tail problems (yang2022knowledge; zou2022multi; wang2024exploring) via contrasting the user-item (collaborative part) and item-entity (knowledge part) graphs.

However, current KGR methods usually bring poor performance in large-scale industry scenarios, due to their commonly overlooking two crucial factors: 1) Users’ multiple intents underlying interaction behavior. For instance, as depicted in Figure 1(a), users may have diverse intentions when shopping in Alibaba E-commerce platform, such as long-term interest, passing time, or social reason, etc. 2) Redundant Knowledge information. In the context of user intents, some knowledge facts in the KG may be irrelevant noise (chen2022attentive), which can potentially disrupt the learning process of user/item representations. As shown in Figure 1(b), incorporating KGs may result in a worse model performance than the models without KG utilization (the details of comparison could refer to Section 4.2 ).

But still, it’s not trivial to model user intents in KGR, since user intents may be composed of multiple heterogeneous information, including items, relations, and entities. Previous multi-intent modeling methods usually define the intents as a linear combination of either interacted items (wang2020disentangled) or entire relation sets (wang2021learning), then update the intent representations through local aggregation in the user-intent-item heterogeneous graph. Nevertheless, such a multi-intent learning paradigm may not fully meet the requirements for KGR, as it neglects the global information in intent defining and learning. To illustrate this, we present an example in Figure 1(a). In this example, user $u_{1}$ may purchase the item $i_{1}$ for the intent $c_{1}$ of long-term interest, resulting in a focus on clothing style (e.g., whether it is fashionable), which means intent $c_{1}$ is associated with KG relation $r_{1}$ and entity $e_{1}$ ; while $u_{1}$ may buy the item $i_{n}$ for the intent $c_{k}$ of social reason (such as friend $u_{2}$ recommend), which means intent $c_{k}$ is associated with user $u_{2}$ and item $i_{k}$ .

In this paper, we focus on modeling user intents behind interaction behaviors with global collaborative (user-item) and knowledge (item-relation-entity) information, and exploiting these modeled intents to guide knowledge sampling, facilitating fine-grained and accurate user/item representation learning. We propose a novel model, KGTN, which comprises two essential components for solving the foregoing limitations: i) Global Intents Modeling with Graph Transformer. We predefine $K$ intent representations for user/item, then learn these intents with global information from collaborative and knowledge graphs. Specifically, it first merges knowledge information into items, then propose a novel graph transformer in the user-item graph to learn global intents and generate intent-aware user/item representations. ii) Knowledge Contrastive Denoising under Intents. KGTN first exploits the intent-aware user/item representations to guide the knowledge sampling, effectively pruning the irrelevant knowledge. Then a novel local-global contrastive mechanism is proposed here to denoise the user/item representations. Empirically, KGTN outperforms the state-of-the-art models on three benchmark datasets in offline testing, and achieves significant improvements in online A/B testing.

Our contributions of this work can be summarized as follows:

•

General Aspects: We emphasize the importance of intent modeling with global information, which plays a crucial role in fine-grained representation learning and knowledge denoising.
•

Novel Methodologies: We propose a novel model KGTN, which models user intents from global signals with a novel graph transformer; and denoises item representations with i) knowledge denoising under intents, and ii) local-global graph contrastive learning.
•

Multifaceted Experiments: We conduct extensive offline experiments on three benchmark datasets and online A/B testing on Alibaba recommendation platform. The results demonstrate the advantages of our KGTN in better representation learning.

2. Problem Formulation

In this section, we begin by formulating the structural data of CF (user-item interactions) and KG (item-relation-entity knowledge) in KGR, then present the problem statement.

Interaction Data. In a typical recommendation scenario, let $\mathcal{U}=\left\{u_{1},u_{2},\ldots,u_{M}\right\}$ be a set of $M$ users and $\mathcal{V}=\left\{v_{1},v_{2},\ldots,v_{N}\right\}$ a set of $N$ items. Let $\mathbf{Y}\in\mathbf{R}^{M\times N}$ be the user-item interaction matrix, where $y_{uv}=1$ indicates that user $u$ engaged with item $v$ , such as behaviors like clicking or purchasing; otherwise $y_{uv}=0$ .

Knowledge Graph. A KG stores luxuriant real-world facts associated with items, encompassing item attributes or external commonsense knowledge, in the form of a heterogeneous graph (shi2018heterogeneous). Let $\mathcal{G}=\{(h,r,t)\mid h,t\in\mathcal{E},r\in\mathcal{R}\}$ be the KG, where $h$ , $r$ , $t$ represent the head, relation, tail of a knowledge triple, respectively; $\mathcal{E}$ and $\mathcal{R}$ denote the sets of entities and relations in $\mathcal{G}$ . In many recommendation scenarios, an item $v\in\mathcal{V}$ corresponds to one entity $e\in\mathcal{E}$ . We hence establish a set of item-entity alignments $\mathcal{A}=\{(v,e)|v\in\mathcal{V},e\in\mathcal{E}\}$ , where $\left(v,e\right)$ indicates that item $v$ can be aligned with an entity $e$ in KG. With the alignments between items and KG entities, KG is able to profile items and offer complementary information to the interaction data.

Problem Statement. Given the user-item interaction matrix $\mathbf{Y}$ and the KG $\mathcal{G}$ , KGR aims to learn a function that can predict how likely a user would adopt an item.

3. Methodology

We now present the proposed Knowledge Enhanced Multi-intent Transformer Network for Recommendation (KGTN). KGTN aims at modeling user intents with global information and exploiting user intents to denoise KG for accurate and robust user/item representation learning. Figure 2 displays the framework of KGTN, which mainly consists of two key components: 1) Global Intent Modeling with graph transformer. Initially, KGTN defines a set of $K$ learnable global intents for users and items. It then models these intents and learns intent-aware user/item representations, via integrating global signals with a graph transformer in the user-item graph, where knowledge information has been encoded into items. 2) Knowledge Contrastive Denoising under intents. It first exploits the learned intent-aware user/item representations to sample intent-relevant knowledge, then designs a contrastive self-supervised task between the local aggregation and global aggregation features within the sampled graph to facilitate robust representation learning.

3.1. Global Intents Modeling with Graph Transformer

3.1.1. Intent Initialization with Global signals

When interacting with items, users often have diverse intents, such as preferences for specific clothing brands and styles, friends recommending, or passing time with randomly clicking (wang2021learning; ren2023disentangled). To capture these diverse intents, we assume $K$ different intents $c_{u}$ and $c_{v}$ from the user and item sides, respectively, where the intents on the item side can also be understood as the theme or context of the item, for example, a user who intends to purchase a fashionable dress may like clothes of “young” topic. Our predictive objective of user-item preference can be presented as follows:

(1)

\displaystyle\int_{c_{u}}\ \int_{c_{v}}\ P(y,c_{u},c_{v}|u,v)\,dc_{v}\,dc_{u}=\sum_{k}^{K}P(y,c_{u}^{k},c_{v}^{k}|u,v).

Specifically, we define $K$ global intent prototypes $\{\textbf{c}_{u}^{k}\in\mathbb{R}^{d}\}_{k=1}^{K}$ and $\{\textbf{c}_{v}^{k}\in\mathbb{R}^{d}\}_{k=1}^{K}$ for user and item, respectively. With these predefined intent prototypes, we then are supposed to integrate them into user/item representations, and update them with related global signals.

3.1.2. Intent Modeling with graph transformer

Towards accurately modeling user intents with global information and learning intent-aware user/item representations, we perform an intent-aware information propagation with these learnable intents. Specifically, intent-aware user/item embeddings are acquired by an attentive sum of the intent prototypes, and user/item embeddings of each layer are updated by aggregating the global user/item/relation/entity signals.

Formally, we could get intent-aware user/item representations at the $l$ -th user/item embedding layer, by aggregating information across different $K$ learnable intent prototypes (including $\textbf{c}_{u}$ and $\textbf{c}_{v}$ ), using the following design:

(2)			$\displaystyle\textbf{e}^{l}_{u}=\sum_{k}^{K}\textbf{c}_{u}^{k}P(\textbf{c}_{u}^{k}\|\textbf{e}^{l}_{u}),$
(3)			$\displaystyle P(\textbf{c}_{u}^{k}\|\textbf{e}^{l}_{u})=\frac{\eta(\textbf{e}^{l-1\top}_{u}\textbf{c}_{u}^{k})}{\sum_{k^{\prime}}^{K}\eta(\textbf{e}^{l-1\top}_{u}\textbf{c}_{u}^{k^{\prime}})},$

where the $P(\textbf{c}_{u}^{k}|\textbf{e}^{l}_{u})$ and $P(\textbf{c}_{v}^{k}|\textbf{e}^{l}_{v})$ denotes the importance score of $\textbf{c}_{u}^{k}$ for $l-$ th user embeddings that has encodes the global signals. Similarly, the $P(\textbf{c}_{v}^{k}|\textbf{e}^{l}_{v})$ denotes the importance score of $\textbf{c}_{u}^{k}$ for $l-$ th item embeddings.

As for the way of calculating the $l-$ th user/item embeddings, we propose to adopt a two-step process to encode the global user/item/ relation/entity information in the whole heterogeneous graph. The first step is to merge the knowledge information (including both relation and entity) into item embeddings with a proposed relation-aware graph aggregation, making the item representation more comprehensive and informative. It injects the relational context into the embeddings of the neighboring entities, and weighting them with the knowledge rationale scores (It’s worth noting that items are a subset of knowledge entities), as follows:

(4)	$\displaystyle\mathbf{e}_{i}^{(l+1)}$	$\displaystyle=\frac{1}{\left\|\mathcal{N}_{i}\right\|}\sum\limits_{(r,v)\in\mathcal{N}_{i}}\beta(i,r,v)\mathbf{e}_{r}\odot\mathbf{e}_{v}^{(l)},$
	$\displaystyle\beta(i,r,v)$	$\displaystyle=\operatorname{softmax}\left(\left(\mathbf{e}_{i}\|\|\mathbf{e}_{r}\right)^{T}\cdot\left(\mathbf{e}_{v}\|\|\mathbf{e}_{r}\right)\right)$
		$\displaystyle=\frac{\exp\left(\left(\mathbf{e}_{i}\|\|\mathbf{e}_{r}\right)^{T}\cdot\left(\mathbf{e}_{v}\|\|\mathbf{e}_{r}\right)\right)}{\sum\limits_{\left(v^{\prime},r\right)\in\hat{\mathbf{N}}(i)}\exp\left(\left(\mathbf{e}_{i}\|\|\mathbf{e}_{r}\right)^{T}\cdot\left(\mathbf{e}_{v^{\prime}}\|\|\mathbf{e}_{r}\right)\right)},$

where $||$ denotes concat operation, $N_{i}$ denotes the set of neighboring entities.

Then the second step is to apply a novel graph transformer among user-item graph, which encodes global user/item/entity information into user/item representations. By doing so, the user/item representations of each layer are integrated with global signals, which would be exploited into intent modeling and representation updating, as follows:

		$\displaystyle\textbf{e}_{u}^{l+1}=\sum_{i}\mathop{\Bigm{\|}\Bigm{\|}}\limits_{h=1}^{H}m_{u,i}\beta_{u,i}^{h}\textbf{W}_{\text{V}}^{h}\textbf{e}_{i}^{l};~{}~{}~{}m_{u,i}=\left\{\begin{aligned} &1~{}~{}\text{if}~{}(u,i)\in{\mathbf{Y}}\\ &0~{}~{}\text{otherwise}\end{aligned}\right.$
(5)			$\displaystyle\beta^{h}_{u,i}=\frac{\exp\bar{\beta}^{h}_{u,i}}{\sum_{i}\exp\bar{\beta}^{h}_{u,i}};~{}~{}~{}~{}~{}~{}~{}~{}\bar{\beta}^{h}_{v,v^{\prime}}=\frac{(\textbf{W}_{\text{Q}}^{h}\cdot\textbf{e}_{u}^{l})^{\top}\cdot(\textbf{W}_{\text{K}}^{h}\cdot\textbf{e}_{i}^{l})}{\sqrt{d/H}},$

where $H$ denotes the number of attention heads (indexed by $h$ ). $m_{v,v^{\prime}}$ is the binary indicator to decide whether to calculate the attentive relations between user $u$ and item $i$ . $\beta_{u,i}^{h}$ denotes the attention weight for user-item interaction pair $(u,i)$ w.r.t. the $h$ -th head representation space. $\textbf{W}_{\text{Q}}^{h},\textbf{W}_{\text{K}}^{h},\textbf{W}_{\text{V}}^{h}\in\mathbb{R}^{d/H\times d}$ denotes the query, key, the value embedding projection for the $h$ -th head, respectively.

By integrating global information into users/items, we could learn intent-aware user/item representations and update the learnable intents according to Equation 2.

3.2. Knowledge Contrastive Denoising under Intents

It is intuitive that noisy or irrelevant connections between entities in knowledge graphs can lead to suboptimal representation learning, which is opposite to original purpose of introducing the KG. To eliminate the noise effect in the KG and distill informative signals that benefit the recommendation task, we propose to highlight important connections consistent to user intents, while removing the irrelevant ones.

3.2.1. Knowledge Sampling under intents.

With the intent-aware user/item representations, we then try to denoise the item-entity graph by removing the irrelevant edges and nodes and sampling the important ones. We first exploit the intent-aware representations to calculate the importance score of knowledge triplets (i.e., the item-relation-entity pairs) same as Equation 4, then add the Gumbel noise (jang2017categorical) to the learned importance scores to improve the sampling robustness, as follows:

(6)		$\displaystyle\beta(i,r,v)$	$\displaystyle=\operatorname{softmax}\left(\left(\mathbf{e}_{i}\|\|\mathbf{e}_{r}\right)^{T}\cdot\left(\mathbf{e}_{v}\|\|\mathbf{e}_{r}\right)\right)$
(6)		$\displaystyle\beta(i,r,v)$	$\displaystyle=\beta(i,r,v)-\log\left(-\log(\epsilon)\right);\quad\epsilon\sim\text{Uniform}\left(0,1\right),$

where $\epsilon$ is a random variable sampled from a uniform distribution. Then it follows a top-k sampling strategy for generating the new item-entity graph that removes the irrelevant edges and nodes:

(7)

\widehat{\beta}(i,r,v)=\left\{\begin{array}[]{ll}\beta(i,r,v),&\beta(i,r,v)\in\text{ top-k}\left(\beta(i,r,v)\right),\\ 0,&\text{otherwise},\end{array}\right.

where $\widehat{\beta}(i,r,v)$ is the sampled triples in item-entity graph, which would be used to replace the original graph structure in the following user/item representation learning.

3.2.2. Local-Global Knowledge Contrastive Learning

With the sampled item-entity graph, we then propose to iteratively update the intent-aware representations in it. And inspired by previous contrastive learning based methods that align the item representations from KG and CF to denoise, we further propose a local-global contrastive mechanism to improve the robustness of representation learning.

Specifically, we exploit the user-item graph and sampled item-entity graph to perform light information aggregation with intent-aware user/item representations $\textbf{e}_{u},\textbf{e}_{i}$ as input $\textbf{z}_{u}^{(0)},\textbf{z}_{i}^{(0)}$ , for acquiring a robust and effective intent-aware user/item representations, as follows:

(8)

\begin{array}[]{l}\mathbf{z}_{i}^{(l+1)}=\frac{1}{\left|\mathcal{N}_{i}\right|}\sum\limits_{(r,v)\in\mathcal{N}_{i}}\mathbf{e}_{r}\odot\mathbf{z}_{v}^{(l)},\\ \mathbf{z}_{u}^{(l+1)}=\frac{1}{\left|\mathcal{N}_{u}\right|}\sum\limits_{i\in\mathcal{N}_{u}}\mathbf{z}_{i}^{(l)},\end{array}

where $\textbf{z}_{u}^{(0)},\textbf{z}_{i}^{(0)}$ memorize the global signals, and we hence get final representations of user/item $\mathbf{z}_{u}^{(l)},\mathbf{z}_{i}^{(l)}(l\in L)$ .

Besides the supervised user/item representation learning, we propose to perform a contrastive learning between the nodes embeddings that encode global signals and local signals, which is different from traditional cl-based methods that contrast the CF and KG parts. We perform information aggregation in the sampled graph with the initial user/item representations $\textbf{e}_{u}^{(0)},\textbf{e}_{i}^{(0)}$ to acquire the local results $\mathbf{z}_{u,local}^{(l)},\mathbf{z}_{i,local}^{(l)}(l\in L)$ , while utilizing the intent-aware user/item representations $\textbf{e}_{u},\textbf{e}_{i}$ that contains global signals to acquire the global results $\mathbf{z}_{u}^{(l)},\mathbf{z}_{i}^{(l)}(l\in L)$ . Then perform layer-wise contrastive learning between local and global results.

The local aggregation layer embeddings $\mathbf{z}_{u,local}^{(l)},\mathbf{z}_{i,local}^{(l)}$ and global aggregation layer embeddings $\mathbf{z}_{u}^{(l)},\mathbf{z}_{i}^{(l)}$ are made to be contrasted in a layer-wise way. We generate each positive pair using the embeddings of the same user (item) from the local view and each of the global view, and other nodes form the negative pairs. We could get the contrastive loss of users as follows:

(9)

\begin{array}[]{ll}\mathcal{L}_{c}^{u}=\frac{1}{L}\sum\limits_{l=0}^{L}-\log\frac{exp({\operatorname{s}\left({{\mathbf{z}_{u}^{l}}},{{\mathbf{z}_{u,local}^{l}}}\right)/\tau})}{\sum\limits_{k\neq u}exp({\operatorname{s}\left({{\mathbf{z}_{u}^{l}}},{{\mathbf{z}_{k}^{l}}}\right)/\tau})+\sum\limits_{k\neq u}exp({\operatorname{s}\left({{\mathbf{z}_{u}^{l}}},{{\mathbf{z}_{k,local}^{l}}}\right)/\tau})},\end{array}

where $\operatorname{s}(\cdot)$ denotes the cosine similarity calculating, and $\tau$ denotes a temperature parameter. And similarly we could get the contrastive loss of item $\mathcal{L}_{c}^{i}$ . By summing the two contrastive losses we hence have the total local-global contrastive loss $\mathcal{L}_{c}$ .

3.3. Model Prediction

After learning intent-aware user/item representations with global signals and performing contrastive learning between local and global information, we have multi-layer intent-aware representations for user/item. By summing all the layers’ representations, we have the final user/item representations and predict their matching score through inner product, as follows:

(10)

\begin{array}[]{l}\mathbf{z}_{u}=\mathbf{z}_{u}^{(0)}+\cdots+\mathbf{z}_{u}^{(K)},\quad\mathbf{z}_{i}=\mathbf{z}_{i}^{(0)}+\cdots+\mathbf{z}_{i}^{(K)}.\\ \hat{\mathbf{y}}(u,i)=\mathbf{z}_{u}^{\top}\mathbf{z}_{i}.\end{array}

By adopting a BPR loss (rendle2012bpr) to reconstruct the historical data, which encourages the prediction scores of a user’s historical items to be higher than the unobserved items, we acquire the supervised loss:

(11)

\mathcal{L}_{\mathrm{BPR}}=\sum_{(u,i,j)\in O}-\ln\sigma\left(\hat{\mathbf{y}}_{ui}-\hat{\mathbf{y}}_{uj}\right),

where $\boldsymbol{O}=\left\{(u,i,j)\mid(u,i)\in\boldsymbol{O}^{+},(u,j)\in\boldsymbol{O}^{-}\right\}$ is the training dataset consisting of the observed interactions $\boldsymbol{O}^{+}$ and unobserved counterparts $\boldsymbol{O}^{-}$ ; $\sigma$ is the sigmoid function.

3.4. Multi-task Training

To combine the recommendation task with the self-supervised task, we optimize the whole model with a multi-task training strategy. We combine the local-global contrastive loss with BPR loss, and learn the model parameter via minimizing the following objective function:

(12)

\mathcal{L}_{KGTN}=\mathcal{L}_{\mathrm{BPR}}+\alpha\mathcal{L}_{c}+\lambda\|\Theta\|_{2}^{2},

where $\Theta$ is the model parameter set, $\alpha$ is a hyperparameter to determine the local-global contrastive loss ratio, $\beta$ and $\lambda$ are two hyperparameters to control the contrastive loss and $L_{2}$ regularization term, respectively.

4. Experiment

		Book-Crossing	MovieLens-1M	Last.FM
User-item Interaction	# users	17,860	6,036	1,872
	# items	14,967	2,445	3,846
	# interactions	139,746	753,772	42,346
Knowledge Graph	# entities	77,903	182,011	9,366
	# relations	25	12	60
	# triplets	151,500	1,241,996	15,518

Table 1. Statistics for the three datasets.

Model	Book-Crossing		MovieLens-1M		Last.FM
Model	AUC	F1	AUC	F1	AUC	F1
BPRMF	0.6583 $(-13.18\%)$	0.6117 $(-7.59\%)$	0.8920 $(-4.52\%)$	0.7921 $(-7.21\%)$	0.7563 $(-13.41\%)$	0.7010 $(-9.95\%)$
CKE	0.6759 $(-11.42\%)$	0.6235 $(-6.41\%)$	0.9065 $(-3.07\%)$	0.8024 $(-6.18\%)$	0.7471 $(-14.33\%)$	0.6740 $(-12.65\%)$
RippleNet	0.7211 $(-6.90\%)$	0.6472 $(-4.04\%)$	0.9190 $(-1.82\%)$	0.8422 $(-2.20\%)$	0.7762 $(-11.42\%)$	0.7025 $(-9.80\%)$
PER	0.6048 $(-18.53\%)$	0.5726 $(-11.50\%)$	0.7124 $(-22.48\%)$	0.6670 $(-19.72\%)$	0.6414 $(-24.90\%)$	0.6033 $(-19.72\%)$
KGCN	0.6841 $(-10.60\%)$	0.6313 $(-5.63\%)$	0.9090 $(-2.82\%)$	0.8366 $(-2.76\%)$	0.8027 $(-8.77\%)$	0.7086 $(-9.19\%)$
KGNN-LS	0.6762 $(-11.39\%)$	0.6314 $(-5.62\%)$	0.9140 $(-2.32\%)$	0.8410 $(-2.32\%)$	0.8052 $(-8.52\%)$	0.7224 $(-7.81\%)$
KGAT	0.7314 $(-5.87\%)$	0.6544 $(-3.32\%)$	0.9140 $(-2.32\%)$	0.8440 $(-2.02\%)$	0.8293 $(-6.11\%)$	0.7424 $(-5.81\%)$
CKAN	0.7420 $(-4.81\%)$	0.6671 $(-2.05\%)$	0.9082 $(-2.90\%)$	0.8410 $(-2.32\%)$	0.8418 $(-4.86\%)$	0.7592 $(-4.13\%)$
KGIN	0.7273 $(-6.28\%)$	0.6614 $(-2.62\%)$	0.9190 $(-1.82\%)$	0.8441 $(-2.01\%)$	0.8486 $(-4.18\%)$	0.7602 $(-4.03\%)$
CG-KGR	0.7498 $(-4.03\%)$	0.6689 $(-1.87\%)$	0.9110 $(-2.62\%)$	0.8359 $(-2.83\%)$	0.8336 $(-5.68\%)$	0.7433 $(-5.72\%)$
KGCL	0.7453 $(-4.48\%)$	0.6679 $(-1.97\%)$	0.9184 $(-1.88\%)$	0.8437 $(-2.05\%)$	0.8455 $(-4.49\%)$	0.7596 $(-4.00\%)$
MCCLK	0.7625 $(-2.76\%)$	0.6777 $(-0.99\%)$	0.9252 $(-1.20\%)$	0.8559 $(-0.83\%)$	0.8663 $(-2.41\%)$	0.7753 $(-2.43\%)$
KGTN	0.7901*	0.6876*	0.9372*	0.8642*	0.8904*	0.7996*

Table 2. The result of

AUC

and

F1

in CTR prediction. The best results are in boldface and the second best results are underlined. * denotes statistically significant improvement by unpaired two-sample

t

-test with

p<0.001

Model	Book-Crossing		MovieLens-1M		Last.FM
Model	R@10	R@20	R@10	R@20	R@10	R@20
BPRMF	0.0334	0.0525	0.0939	0.1512	0.0923	0.1740
CKE	0.0421	0.0562	0.0867	0.1364	0.0780	0.1532
RippleNet	0.0507	0.0622	0.1082	0.1766	0.0942	0.1520
PER	0.0322	0.0481	0.0523	0.1204	0.0540	0.1167
KGCN	0.0496	0.0540	0.0965	0.1720	0.1416	0.1776
KGNN-LS	0.0422	0.0526	0.1286	0.1757	0.1312	0.1933
KGAT	0.0522	0.0670	0.1468	0.2296	0.1640	0.2313
CKAN	0.0462	0.0566	0.1511	0.2400	0.1412	0.2465
KGIN	0.0555	0.0699	0.1511	0.2404	0.1758	0.2487
CG-KGR	0.0612	0.0781	0.1621	0.2495	0.1578	0.2106
KGCL	0.0679	0.0845	0.1633	0.2499	0.1759	0.2471
MCCLK	0.0769	0.0936	0.1642	0.2503	0.1835	0.2598
KGTN	0.1060*	0.1275*	0.1841*	0.2826*	0.2104*	0.3106*

Table 3. The result of

Recall@10

and

Recall@20

in top-

K

recommendation.

Aiming to answer the following research questions, we conduct both offline experiments and online A/B tests on three public datasets and Alibaba online platform:

•

RQ1: How does KGTN perform, compared to present models?
•

RQ2: How do the main components in KGTN affect its effectiveness?
•

RQ3: How do different hyper-parameter settings affect KGTN?
•

RQ4: How does KGTN perform with noisy injection?
•

RQ5: How does KGTN perform in a live system serving billions of users?

4.1. Experiment Settings

4.1.1. Dataset and Metrics

Three benchmark datasets are utilized to evaluate the effectiveness of KGTN: Last.FM ¹¹1https://grouplens.org/datasets/hetrec-2011/, Book-Crossing ²²2http://www2.informatik.uni-freiburg.de/~cziegler/BX/, and MovieLens-1M ³³3https://grouplens.org/datasets/movielens/1m/. The detailed statistics of them are summarized in Table 1, which vary in size and sparsity and make our experiments more convincing. As for the data pre-process, we first follow RippleNet (wang2018ripplenet) to transform their explicit feedback into implicit one, and randomly sample negative samples from his unwatched items with the size equal to his positive ones to construct the negative parts. As for the sub-KG construction, we follow RippleNet (wang2018ripplenet) and use Microsoft Satori⁴⁴4https://searchengineland.com/library/bing/bing-satori to construct it for MovieLens-1M, Book-Crossing, and Last.FM datasets. Each sub knowledge graph that follows the triple format is a subset of the whole KG with a confidence level greater than 0.9.

We evaluate our method in two experimental scenarios: (1) In click-through rate (CTR) prediction, we apply the trained model to predict each interaction in the test set. We adopt two widely used metrics (wang2018ripplenet; wang2019knowledge) $AUC$ and $F1$ to evaluate CTR prediction. (2) In top- $K$ recommendation, we use the trained model to select $K$ items with the highest predicted click probability for each user in the test set, and we choose Recall@ $K$ to evaluate the recommended sets.

4.1.2. Baselines

To demonstrate the effectiveness of our proposed KGTN, we compare it with four types of KGR methods: CF-based methods (BPRMF (rendle2012bpr)), embedding-based method (CKE (zhang2016collaborative), RippleNet (wang2018ripplenet)), path-based method (PER (yu2014personalized)), GNN-based methods(KGCN (wang2019knowledge), KGNN-LS (wang2019knowledge-aware), KGAT (wang2019kgat), CKAN (wang2020ckan), KGIN (wang2021learning), CG-KGR (chen2022attentive)), CL-based methods (KGCL(yang2022knowledge), MCCLK (zou2022multi)).

4.1.3. Parameter Settings

We implement our KGTN and all baselines in Pytorch and carefully tune the key parameters. For a fair comparison, we fix the embedding size to 64 for all models, and the embedding parameters are initialized with the Xavier method (glorot2010understanding). We optimize our method with Adam (kingma2014adam) and set the batch size to 2048. A grid search is conducted to confirm the optimal settings, we tune the learning rate $\eta$ among $\{0.0001,0.0003,0.001,0.003\}$ and $\lambda$ of $L2$ regularization term among $\{10^{-7},10^{-6},10^{-5},10^{-4},10^{-3}\}$ . Other hyper-parameter settings are provided in Table 1. The best settings for hyper-parameters in all comparison methods are researched by either empirical study or following the original papers.

4.2. Performance Comparison (RQ1)

We report the empirical results of all methods in Table 2 and Table 3. The improvements and statistical significance test are performed between KGTN and the strongest baselines (highlighted with underline). Analyzing such performance comparison, we have the following observations:

•

Our proposed KGTN achieves the best results. KGTN consistently outperforms all baselines across three datasets in terms of all measures, which achieves significant improvements over the strongest baselines w.r.t. AUC by 2.76%, 1.20%, and 2.41% in Book, Movie, and Music respectively, and demonstrates its effectiveness. We attribute such improvements to the following aspects: (1) By modeling user intents with global signals, KGTN is able to learn user/item representations in a more fine-grained and comprehensive manner; (2) The knowledge sampling strategy under intents could remove less relevant knowledge information for a robust representation learning; (3) The local-global contrastive learning improves the representation learning in a self-supervised manner, via contrasting the local and global information.
•

Incorporating KG not always benefits recommender system. Comparing CKE with BPRMF, leaving KG untapped limits the performance of BPRMF, which shows the effectiveness of KG information. While PER gets a worse performance than BPRMF, which means that only incorporating suitable knowledge could benefit the model. This fact stresses the importance of knowledge sampling and knowledge denoising.
•

GNN has a strong power of graph learning. Most of the GNN-based methods perform better, suggesting the importance of modeling long-range connectivity for graph representation learning. This fact inspires us to go beyond the local aggregation paradigm, and to consider the global signals.
•

Contrastive Learning is effective. The most recently proposed CL-based methods have the best performance, which shows the effectiveness of incorporating a self-supervised task for improving representation learning. It inspires us to design proper contrastive mechanisms to denoise the knowledge and improve the model performance.

4.3. Ablation Studies (RQ2)

As shown in Figure 3, here we examine the contributions of main components in our model to the final performance by comparing KGTN with the following three variants: 1) $\text{KGTN}_{w/o\ S}$ : In this variant, the knowledge sampling under intents module is removed. 2) $\text{KGTN}_{w/o\ C}$ : This variant removes local-global contrastive mechanism. 3) $\text{KGTN}_{w/o\ I}$ : This variant removes the multi-intent modeling, which means both global intent modeling and knowledge contrastive denoising do not exist in this variant. The results of two variants and KGTN are reported in Figure 3, from which we have the following observations:

•

Removing both knowledge sampling and local-global contrasting would degrade model performance, which shows their effectiveness in representation learning.
•

Ablating the multi-intent modeling brings the worst performance, which shows the importance of incorporating global signals and considering multiple intents.

4.4. Sensitivity Analysis (RQ3)

4.4.1. Impact of graph transformer depth.

	Book		Movie		Music
	Auc	F1	Auc	F1	Auc	F1
$L$ =1	0.7901	0.6876	0.9372	0.8642	0.8904	0.7996
$L$ =2	0.7743	0.6783	0.9349	0.8623	0.8834	0.8068
$L$ =3	0.7603	0.6709	0.9278	0.8481	0.8785	0.7951

Table 4. Impact of graph transformer depth.

To study the influence of graph transformer depth, we vary $L$ in range of {1, 2, 3} on book, movie, and music datasets. As shown in Table 4, KGTN performs best when $L=1$ . It convinces that one iteration is enough for integrating the global signals into user/item representations, which shows its low reliance on model depth.