This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

FiBiNet++: Reducing Model Size by Low Rank Feature Interaction Layer for CTR Prediction

Pengtao Zhang Sina WeiboBeijingChina zpt1986@126.com Zheng Zheng Brandeis UniversityWalthamUnited States zhengzheng@brandeis.edu  and  Junlin Zhang Sina WeiboBeijingChina junlin6@staff.weibo.com
(2023)
Abstract.

Click-Through Rate (CTR) estimation has become one of the most fundamental tasks in many real-world applications and various deep models have been proposed. Some research has proved that FiBiNet is one of the best performance models and outperforms all other models on Avazu dataset. However, the large model size of FiBiNet hinders its wider application. In this paper, we propose a novel FiBiNet++ model to redesign FiBiNet’s model structure, which greatly reduces model size while further improves its performance. One of the primary techniques involves our proposed ”Low Rank Layer” focused on feature interaction, which serves as a crucial driver of achieving a superior compression ratio for models. Extensive experiments on three public datasets show that FiBiNet++ effectively reduces non-embedding model parameters of FiBiNet by 12x to 16x on three datasets. On the other hand, FiBiNet++ leads to significant performance improvements compared to state-of-the-art CTR methods, including FiBiNet. The source code is in https://github.com/recommendation-algorithm/FiBiNet.

Recommender System; Click-Through Rate
journalyear: 2023copyright: acmlicensedconference: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management; October 21–25, 2023; Birmingham, United Kingdombooktitle: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23), October 21–25, 2023, Birmingham, United Kingdomprice: 15.00doi: 10.1145/3583780.3615242isbn: 979-8-4007-0124-5/23/10ccs: Information systems Recommender systems

1. Introduction

Click-through rate (CTR) prediction plays important role in personalized advertising and recommender systems(McMahan et al., 2013; Graepel et al., 2010; He et al., 2014; Koren et al., 2009; Deng et al., 2021). In recent years, a series of deep CTR models have been proposed to resolve this problem such as Wide & Deep Learning(Cheng et al., 2016), DeepFM(Guo et al., 2017), DCN(Wang et al., 2017), xDeepFM (Lian et al., 2018), AutoInt(Song et al., 2019), DCN v2(Wang et al., 2021) and FiBiNet(Huang et al., 2019).

Refer to caption
Figure 1. The Framework of FiBiNet++

Specifically, Wide & Deep Learning(Cheng et al., 2016) jointly trains wide linear models and deep neural networks to combine the benefits of memorization and generalization for recommender systems. DeepFM(Guo et al., 2017) replaces the wide part of Wide & Deep model with FM and shares the feature embedding between the FM and deep component. Some works explicitly introduce high-order feature interactions by sub-network. For example, Deep & Cross Network (DCN)(Wang et al., 2017) and DCN v2(Wang et al., 2021) efficiently capture feature interactions of bounded degrees in an explicit fashion. The eXtreme Deep Factorization Machine (xDeepFM) (Lian et al., 2018) also models the low-order and high-order feature interactions in an explicit way by proposing a novel Compressed Interaction Network (CIN) part. Similarly, FiBiNet(Huang et al., 2019) dynamically learns the importance of features via the Squeeze-Excitation network (SENET) and feature interactions via bi-linear function.

Though many models have been proposed, seldom works fairly compare these models’ performance. FuxiCTR(Zhu et al., 2020) performs open benchmarking for CTR prediction and experimental results(Zhu et al., 2020) show that FiBiNet is one of the best performance models, which outperforms all other 23 models on Avazu dataset.

However, we argue that FiBiNet has too many model parameters, which hinders its wider usage in real-life applications. In real-world systems, both the smaller size and the cost of training and inference times are important factors to be considered. Therefore, our works aims to redesign the model structure to greatly reduce model size while improving its performance.

In this paper, we propose a novel FiBiNet++ model to address these challenges as shown in Figure 1. First, we reconstruct the model structure by removing the bi-linear module on SENet and the linear part in FiBiNet, which directly reduces model parameters. More importantly, we upgrade the bi-linear function into the bi-linear+ module by changing the hadamard product to inner product and bringing a ”Low Rank Layer” on feature interaction into it. In Section 3.2, we will demonstrate that the proposed ”Low Rank Layer” is primarily responsible for high model compression ratio. Our inspiration for this approach stems from the LoRA(Hu et al., 2021), which reveals that LLM models possess a low rank ”intrinsic dimension,” enabling them to learn effectively even after undergoing random projection into a smaller subspace. We put forth the hypothesis that feature interactions in CTR models similarly exhibit a low intrinsic rank during training and propose incorporating a ”Low Rank Feature Interaction Layer” into bi-linear+ modules, which greatly reduces model parameters while keeps model’s performance. Finally, we introduce feature normalization and the upgraded SENet+ module to further enhance model performance.

We summarize our major contributions as below: (1) The proposed FiBiNet++ greatly reduces model size of FiBiNet by 12x to 16x on three datasets. (2) Compared with FiBiNet, our proposed FiBiNet++ model increases mode’s training and reference efficiencies by +37.50% to 81.03% on three datasets. (3) FiBiNet++ yields remarkable improvements compared to state-of-the-art models.

2. Preliminaries

DNN model is always used as a sub-component in most current DNN ranking systems(Xiao et al., 2017; Guo et al., 2017; Lian et al., 2018; Wang et al., 2017; Huang et al., 2019; Covington et al., 2016; Qu et al., 2016) and it contains two components: feature embedding and MLP.

(1) Feature Embeddding. We map one-hot sparse features to dense, low-dimensional embedding vectors and obtain feature embedding 𝐯i\mathbf{v}_{i} for one-hot vector 𝐱i\mathbf{x}_{i} via: 𝐯i=𝐖e𝐱i1×d\mathbf{v}_{i}=\mathbf{W}_{e}\mathbf{x}_{i}\in\mathbb{R}^{1\times d} , where 𝐖en×d\mathbf{W}_{e}\in\mathbb{R}^{n\times d} is the embedding matrix of nn features and dd is the dimension of field embedding.

(2) MLP. To learn high-order feature interactions, multiple feed-forward layers are stacked on the concatenation of dense features represented as 𝐇0=concat[𝐯1,𝐯2,,𝐯f]\mathbf{H}_{0}=concat[\mathbf{v}_{1},\mathbf{v}_{2},...,\mathbf{v}_{f}], where ff denotes field number. Then, the feed forward process of MLP is 𝐇l=ReLU(𝐖l𝐇l1+βl)\mathbf{H}_{l}=ReLU(\mathbf{W}_{l}\mathbf{H}_{l-1}+\beta_{l}) , where ll is the depth and ReLU is the activation function. 𝐖l,βl,𝐇l\mathbf{W}_{l},\beta_{l},\mathbf{H}_{l} are weighting matrix, bias and output of the ll-th layer.

For binary classifications, the loss function of CTR prediction is the log loss:

(1) =1Ni=1Nyilog(y^i)+(1yi)log(1y^i)\mathcal{L}=-\frac{1}{N}\sum^{N}_{i=1}y_{i}\log(\hat{y}_{i})+(1-y_{i})\log(1-\hat{y}_{i})

where NN is the total number of training instances, yiy_{i} is the ground truth of ii-th instance and y^i\hat{y}_{i} is the predicted CTR.

3. Our Proposed Model

The architecture of the proposed FiBiNet++ is shown in Figure 1. The original feature embedding is first normalized before being sent to the following components. Then, Bi-linear+ module models feature interactions and SENet+ module computes bit-wise feature importance. The outputs of two branches are concatenated as input of the following MLP layers.

3.1. FiBiNet++

Feature Normalization. We introduce feature normalization into FiBiNet++ to enhance model’s training stability as follows:

(2) 𝐍(𝐕)=concat[𝐍(𝐯1),𝐍(𝐯2),,𝐍(𝐯f)]1×fd\mathbf{N}(\mathbf{V})=concat[\mathbf{N}(\mathbf{v}_{1}),\mathbf{N}(\mathbf{v}_{2}),...,\mathbf{N}(\mathbf{v}_{f})]\in\mathbb{R}^{1\times fd}

where 𝐍()\mathbf{N}\left(\cdot\right) is layer normalization(Ba et al., 2016) for numerical feature and batch normalization(Ioffe and Szegedy, 2015) operation for categorical feature.

Bi-Linear+ Module. FiBiNet models interaction between feature 𝐱i\mathbf{x}_{i} and feature 𝐱j\mathbf{x}_{j} by bi-linear function which introduces an extra learned matrix 𝐖\mathbf{W} as follows:

(3) 𝐩i,j=𝐯i𝐖𝐯j1×d\mathbf{p}_{i,j}=\mathbf{v}_{i}\circ\mathbf{W}\otimes\mathbf{v}_{j}\in\mathbb{R}^{1\times d}

where \circ and \otimes denote inner product and element-wise hadamard product, respectively. In order to effectively reduce model size, we upgrade bi-linear function into bi-linear+ module by following two methods. First, the hadamard product is replaced by another inner product as: 𝐩i,j=𝐯i𝐖𝐯j1×1\mathbf{p}_{i,j}=\mathbf{v}_{i}\circ\mathbf{W}\circ\mathbf{v}_{j}\in\mathbb{R}^{1\times 1} . We think the feature interaction is rather sparse and one bit information as representation is enough instead of a vector. It’s easy to see the parameters of 𝐩i,j\mathbf{p}_{i,j} decrease greatly from 𝐝\mathbf{d} dimensional vector to 11 bit for each feature interaction. Suppose the input instance has ff fields and we have the following vector after bi-linear feature interaction:

(4) 𝐏=concat[𝐩1,2,𝐩1,3,,𝐩f1,f]1×f×(f1)2\mathbf{P}=concat[\mathbf{p}_{1,2},\mathbf{p}_{1,3},......,\mathbf{p}_{f-1,f}]\in\mathbb{R}^{1\times\frac{f\times(f-1)}{2}}

Inspired by LoRA(Hu et al., 2021), which has demonstrated that LLM models possess a low ”intrinsic dimension” and exhibit efficient learning despite undergoing random projections to smaller subspaces, we posit that updates to the feature interaction layer during training also exhibit a low ”intrinsic rank.” Therefore, we propose integrating a thin ”Low Rank Layer” into Bi-Linear+, thereby reducing model parameters significantly while maintaining optimal model performance. Specifically, we introduce ”Low Rank Layer” stacking on feature interaction vector 𝐏\mathbf{P} as follows:

(5) 𝐇LRL=σ1(𝐖1𝐏)1×m\mathbf{H}^{LRL}=\sigma_{1}\left(\mathbf{W}_{1}\mathbf{P}\right)\in\mathbb{R}^{1\times m}

where 𝐖1m×f×(f1)2\mathbf{W}_{1}\in\mathbb{R}^{m\times\frac{f\times(f-1)}{2}} is a learning matrix of thin MLP layer with small size mm and σ1()\sigma_{1}\left(\cdot\right) is an identity function. ”Low Rank Layer” projects feature interactions from sparse space into low rank space to greatly reduce the storage.

SENet+ Module. SENet+ module consists of four phases: squeeze, excitation, re-weight and fuse. (1) Squeeze. SENet collects one bit information by mean pooling from each feature embedding as ”summary statistics”. However, we improve the original squeeze step by providing more useful information. Specifically, we first segment each normalized feature embedding 𝐯i1×d\mathbf{v}_{i}\in\mathbb{R}^{1\times d} into 𝐠\mathbf{g} groups, which is a hyper-parameter, as: 𝐯i=concat[𝐯i,1,𝐯i,2,,𝐯i,g]\mathbf{v}_{i}=concat[\mathbf{v}_{i,1},\mathbf{v}_{i,2},......,\mathbf{v}_{i,g}] , where 𝐯i,j1×dg\mathbf{v}_{i,j}\in\mathbb{R}^{1\times\frac{d}{g}} means information in the j-th group of the i-th feature. Let 𝐤=dg\mathbf{k}=\frac{d}{g} denotes the size of each group. Then, we select the max value 𝐳i,jmax\mathbf{z}_{i,j}^{max}and average pooling value 𝐳i,javg\mathbf{z}_{i,j}^{avg} in 𝐯i,j\mathbf{v}_{i,j} as representation of the group as: 𝐳i,jmax=maxt{𝐯i,jt}t=1k\mathbf{z}_{i,j}^{max}=\max\limits_{t}\left\{\mathbf{v}_{i,j}^{t}\right\}_{t=1}^{k} and 𝐳i,javg=1kt=1k𝐯i,jt\mathbf{z}_{i,j}^{avg}=\frac{1}{k}\sum_{t=1}^{k}\mathbf{v}_{i,j}^{t} . The concatenated representative information of each group forms the ”summary statistic” 𝐙i\mathbf{Z}_{i} of feature embedding 𝐯i\mathbf{v}_{i}:

(6) 𝐙i=concat[𝐳i,1max,𝐳i,1avg,𝐳i,2max,𝐳i,2avg,,𝐳i,gmax,𝐳i,gavg]1×2g\mathbf{Z}_{i}=concat[\mathbf{z}_{i,1}^{max},\mathbf{z}_{i,1}^{avg},\mathbf{z}_{i,2}^{max},\mathbf{z}_{i,2}^{avg},......,\mathbf{z}_{i,g}^{max},\mathbf{z}_{i,g}^{avg}]\in\mathbb{R}^{1\times 2g}

Finally, we can concatenate each feature’s summary statistic 𝐙=concat[𝐙1,𝐙2,,𝐙f]1×2gf\mathbf{Z}=concat[\mathbf{Z}_{1},\mathbf{Z}_{2},......,\mathbf{Z}_{f}]\in\mathbb{R}^{1\times 2gf} as the input of SENet+ module.

(2) Excitation. The excitation step in SENet computes each feature’s weight according to the statistic vector 𝐙\mathbf{Z}, which is a field-wise attention. However, we improve this step by changing the field-wise attention into a more fine-grained bit-wise attention. Similarly, we also use two fully connected (FC) layers to learn the weights as follows: 𝐀=σ3(𝐖3σ2(𝐖2𝐙))1×fd\mathbf{A}=\sigma_{3}\left(\mathbf{W}_{3}\sigma_{2}\left(\mathbf{W}_{2}\mathbf{Z}\right)\right)\in\mathbb{R}^{1\times fd} , where 𝐖22gfr×2gf\mathbf{W}_{2}\in\mathbb{R}^{\frac{2gf}{r}\times 2gf} denotes learning parameters of the first FC layer, which is a thin layer and rr is reduction ratio. 𝐖3fd×2gfr\mathbf{W}_{3}\in\mathbb{R}^{fd\times\frac{2gf}{r}} means learning parameters of the second FC layer, which is a wider layer with a size of fdfd. Here σ2()\sigma_{2}\left(\cdot\right) is ReLu()ReLu\left(\cdot\right) and σ3()\sigma_{3}\left(\cdot\right) is an identity function without non-linear transformation.In this way, each bit in input embedding can dynamically learn the corresponding attention score provided by 𝐀\mathbf{A}.

(3) Re-Weight. Re-weight step does element-wise multiplication between the original field embedding and the learned attention scores as follows: 𝐕w=𝐀𝐍(𝐕)1×fd\mathbf{V}^{w}=\mathbf{A}\otimes\mathbf{N}(\mathbf{V})\in\mathbb{R}^{1\times fd} , where \otimes is an element-wise multiplication between two vectors and 𝐍(𝐕)\mathbf{N}(\mathbf{V}) denotes original embedding after normalization.

(4) Fuse. An extra ”fuse” step is introduced in order to better fuse the information contained both in original feature embedding and weighted embedding. Specifically, we first use skip-connection to merge two embedding as follows: 𝐯is=𝐯io𝐯iw\mathbf{v}_{i}^{s}=\mathbf{v}_{i}^{o}\oplus\mathbf{v}_{i}^{w} , where 𝐯io\mathbf{v}_{i}^{o} donates the i-th normalized feature embedding, 𝐯iw\mathbf{v}_{i}^{w} denotes embedding after re-weight step, \oplus is an element-wise addition operation. Then, another feature normalization is applied on feature embedding 𝐯is\mathbf{v}_{i}^{s} for a better representation: 𝐯iu=𝐋𝐍(𝐯is)\mathbf{v}_{i}^{u}=\mathbf{LN}(\mathbf{v}_{i}^{s}) . Note we adopt layer normalization here no matter what type of feature it belongs to, numerical or categorical feature. Finally, we concatenate all the fused embeddings as the output of the SENet+ module:

(7) 𝐕SENet+=concat[𝐯1u,𝐯2u,,𝐯fu]1×fd\mathbf{V}^{SENet+}=concat[\mathbf{v}_{1}^{u},\mathbf{v}_{2}^{u},...,\mathbf{v}_{f}^{u}]\in\mathbb{R}^{1\times fd}

Concatenation Layer. We concatenate the output of two branches to form the input of the following MLP layers:

(8) 𝐇0=concat[𝐇LRL,𝐕SENet+]\mathbf{H}_{0}=concat[\mathbf{H}^{LRL},\mathbf{V}^{SENet+}]

3.2. Discussion

In this section, we discuss the model size difference between FiBiNet and FiBiNet++. Note only non-embedding parameter is considered, which really demonstrates model complexity.

The major parameter of FiBiNet comes from two components: one is the connection between the first MLP layer and the output of two bi-linear modules, and the other is the linear part. Suppose we denote h=400h=400 as the size of the first MLP layer, f=50f=50 as field number, d=10d=10 as feature embedding size, and t=1milliont=1\ million as feature number. Therefore, the parameter number in these two parts is nearly 10.8 million:

(9) 𝐓FiBiNet=𝐟×(𝐟𝟏)×𝐝×𝐡MLPandbilinear+tlinear=10.8millions\mathbf{T}^{FiBiNet}=\begin{matrix}\underbrace{\mathbf{f}\times(\mathbf{f-1})\times\mathbf{d}\times\mathbf{h}}_{MLP\ and\ bi-linear}+\underbrace{t}_{linear}\end{matrix}=10.8\ millions

For FiBiNet++, the majority of model parameter comes from following three components: the connection between the first MLP layer and embedding produced by SENet+ module(1-th part), the connection between the first MLP layer and ”Low Rank Layer”(2-th part), and parameters between ”Low Rank Layer” and bi-linear feature interaction results(3-th part). Let m=50m=50 denote the size of ”Low Rank Layer”. We have the parameter number of these components as follows:

(10) 𝐓FiBiNet++=𝐟×𝐝×𝐡1thpart+𝐦×𝐡2thpart+𝐟×(𝐟𝟏)𝟐×𝐦3thpart=0.28millions\mathbf{T}^{FiBiNet++}=\begin{matrix}\underbrace{\mathbf{f}\times\mathbf{d}\times\mathbf{h}}_{1-th\ part}+\underbrace{\mathbf{m}\times\mathbf{h}}_{2-th\ part}+\underbrace{\mathbf{\frac{\mathbf{f}\times(\mathbf{f-1})}{2}}\times\mathbf{m}}_{3-th\ part}\end{matrix}=0.28\ millions

We can see that the above-mentioned methods to reduce model size greatly decrease model size from 10.8 million to 0.28 million, which is nearly 39x model compression. In addition, the larger the field number ff is, the larger the model compression ratio we can achieve. It’s easy to see that ”Low Rank Layer” is the key to the high compression ratio while it can also be applied into other CTR models with long feature interaction layers such as ONN(Yang et al., 2020) and FAT-DeepFFM(Zhang et al., 2019) for feature interaction compression.

4. Experimental Results

4.1. Experiment Setup

Datasets Three datasets are used in our experiments and we randomly split instances by 8:1:1 for training, validation and testing: (1) Criteo111Criteo http://labs.criteo.com/downloads/download-terabyte-click-logs/: As a display ad dataset, there are 26 anonymous categorical fields and 13 continuous feature fields. (2) Avazu222Avazu http://www.kaggle.com/c/avazu-ctr-prediction: The Avazu dataset contains 23 fields that indicate elements of a single ad impression. (3) KDD12333KDD12 https://www.kaggle.com/c/kddcup2012-track2: KDD12 dataset has 13 fields spanning from user id to ad position for a clicked data.

Table 1. Overall performance (AUC) of different models
Avazu Criteo KDD12
Model AUC(%) Paras. AUC(%) Paras. AUC(%) Paras.
FM 78.17 1.54M 78.97 1.0M 77.65 5.46M
DNN 78.67 0.74M 80.73 0.48M 79.54 0.37M
DeepFM 78.64 2.29M 80.58 1.48M 79.40 5.84M
xDeepFM 78.88 4.06M 80.64 4.90M 79.51 6.91M
DCN 78.68 0.75M 80.73 0.48M 79.58 0.37M
AutoInt+ 78.62 0.77M 80.78 0.48M 79.69 0.38M
DCN v2 78.98 4.05M 80.88 0.65M 79.66 0.39M
FiBiNet 79.12 10.27M 80.73 7.25M 79.52 6.41M
FiBiNet++ 79.15 0.81M 81.10 0.56M 79.98 0.40M
Improv. +0.03 12.7x +0.37 12.9x +0.46 16x
Refer to caption
Figure 2. Effect of Hyper-Parameters

Models for Comparisons We compare the performance of the FM(Rendle, 2010), DNN, DeepFM(Guo et al., 2017), DCN (Wang et al., 2017), AutoInt (Song et al., 2019), DCN V2 (Wang et al., 2021), xDeepFM (Lian et al., 2018) and FiBiNet (Huang et al., 2019) models as baselines and AUC is used as the evaluation metric.

Refer to caption
Figure 3. Comparison of Model Size

Implementation Details For the optimization method, we use the Adam with a mini-batch size of 10241024 and 0.00010.0001 as learning rates. We make the dimension of field embedding for all models to be a fixed value of 1010 for Criteo dataset, 5050 for Avazu dataset and 1010 for KDD12 dataset. For models with DNN part, the depth of hidden layers is set to 33, the number of neurons per layer is 400400, and all activation functions are ReLU. In SENet+, the reduction ratio is set to 3 and the group number is 2 as the default settings. In Bi-linear+ module, we set size of the low rank layer as 50.

4.2. Results and Analysis

Performance Comparison. Table 1 shows the performances of different SOTA baselines and FiBiNet++. The best results are in bold, and the best baseline results are underlined. We can see that:(1) FiBiNET++ model outperforms all the compared SOTA methods and achieves the best performance on all three benchmarks. (2) Among all the strong baselines with a similar amount of parameters such as Autoint+, DCN v2 and DeepFM, FiBiNet++ is the best performance model on all three datasets. This demonstrates the reason why FiBiNet++ outperforms other models is because of its designed components instead of more parameter. (3) Compared with FiBiNet model, FiBiNet++ can achieve better performance on all datasets though it has much fewer parameters, which indicates that our proposed methods to enhance model performance are effective.

Model Size Comparison. We compare the non-embedding model size of different methods in Table 1 and Figure 3. FiBiNet++ provides orders of magnitude improvement in model size while improving the quality of the model compared with FiBiNet. Specifically, FiBiNet++ reduce the model size of FiBiNet by 12.7x, 12.9x and 16x in terms of the number of parameters on three datasets, respectively, which demonstrates that our proposed methods to reduce model parameter in this paper are effective. Now FiBiNet++ has a comparable model size with DNN model while outperforms all other models on three datasets at the same time.

Training/Reference Efficiency. Efficiency is an essential concern in industrial applications and we conduct experiments to compare the training and inference time between our proposed FiBiNet++ and FiBiNet. We leverage time(millisecond) of processing one batch of examples during the training and reference as an efficiency metric. The average training and inference times of the two models are illustrated in Table 2. Compared with FiBiNet model, the training efficiency of FiBiNet++ increases by 58.76%, 62.30% and 39.39% while reference efficiency increases by 41.66%, 81.03% and 37.50% on three datasets respectively. Our FiBiNet++ model shows a significant advantage in training and inference efficiency, which makes it more practical to be applied in real life.

Hyper-Parameters of FiBiNet++. Next, we study hyperparameter sensitivity of FiBiNet++. (1) Group Number. Figure 2a shows a slight performance increase with the increase of group number, which indicates that more group number benefits model performance because we can input more useful information in feature embedding into SENet+ module.

(2) Reduction Ratio. We conduct some experiments to adjust the reduction ratio in SENet+ module from 11 to 99 and Figure 2b shows the result. It can be seen that the performance is better if we set the reduction ratio to 3 or 9. (3) Size of Low Rank Layer. The results in Figure 2c show the impact when we adjust the size of the Low Rank Layer in bi-linear+ module. We can observe that the performance begins to decrease when the size is set greater than 150, which demonstrates the feature interaction represented in low rank space indeed works.

Table 2. Training and reference efficiency comparison
Avazu Criteo KDD12
Model Train(ms) Refer(ms) Train(ms) Refer(ms) Train(ms) Refer(ms)
FiBiNet 97 12 191 58 33 8
FiBiNet++ 40 7 72 11 20 5
Improv. +58.76% +41.66% +62.30% +81.03% +39.39% +37.50%

5. Conclusion

In this paper, we propose FiBiNet++ model in order to greatly reduce the model size while improving the model performance. Experimental results show that FiBiNet++ provides orders of magnitude improvement in model size while improving the quality of the model.

References

  • (1)
  • Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
  • Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems. ACM, 7–10.
  • Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems. ACM, 191–198.
  • Deng et al. (2021) Wei Deng, Junwei Pan, Tian Zhou, Deguang Kong, Aaron Flores, and Guang Lin. 2021. Deeplight: Deep lightweight feature interactions for accelerating ctr predictions in ad serving. In Proceedings of the 14th ACM international conference on Web search and data mining. 922––930.
  • Graepel et al. (2010) Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf Herbrich. 2010. Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. Omnipress.
  • Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247 (2017).
  • He et al. (2014) Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, et al. 2014. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising. ACM, 1–9.
  • Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  • Huang et al. (2019) Tongwen Huang, Zhiqi Zhang, and Junlin Zhang. 2019. FiBiNET: combining feature importance and bilinear feature interaction for click-through rate prediction. In Proceedings of the 13th ACM Conference on Recommender Systems, RecSys 2019, Copenhagen, Denmark, September 16-20, 2019. ACM, 169–177. https://doi.org/10.1145/3298689.3347043
  • Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).
  • Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 8 (2009), 30–37.
  • Lian et al. (2018) Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1754–1763.
  • McMahan et al. (2013) H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, and et al. 2013. Ad Click Prediction: A View from the Trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Chicago, Illinois, USA) (KDD ’13). Association for Computing Machinery, New York, NY, USA, 1222–1230. https://doi.org/10.1145/2487575.2488200
  • Qu et al. (2016) Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. 2016. Product-based neural networks for user response prediction. In 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 1149–1154.
  • Rendle (2010) Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International Conference on Data Mining. IEEE, 995–1000.
  • Song et al. (2019) Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019. Autoint: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1161–1170.
  • Wang et al. (2017) Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17. ACM, 12.
  • Wang et al. (2021) Ruoxi Wang, Rakesh Shivanna, Derek Zhiyuan Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed H. Chi. 2021. DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems. In WWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021. 1785–1797.
  • Xiao et al. (2017) Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and Tat-Seng Chua. 2017. Attentional factorization machines: Learning the weight of feature interactions via attention networks. arXiv preprint arXiv:1708.04617 (2017).
  • Yang et al. (2020) Yi Yang, Baile Xu, Shaofeng Shen, Furao Shen, and Jian Zhao. 2020. Operation-aware Neural Networks for user response prediction. Neural Networks 121 (2020), 161–168. https://doi.org/10.1016/j.neunet.2019.09.020
  • Zhang et al. (2019) Junlin Zhang, Tongwen Huang, and Zhiqi Zhang. 2019. FAT-DeepFFM: Field Attentive Deep Field-aware Factorization Machine. In Industrial Conference on Data Mining. https://api.semanticscholar.org/CorpusID:155099971
  • Zhu et al. (2020) Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, and Xiuqiang He. 2020. FuxiCTR: An Open Benchmark for Click-Through Rate Prediction. ArXiv abs/2009.05794 (2020).