BCRLSP: An Offline Reinforcement Learning Framework
for Sequential Targeted Promotion

Fanglin Chen flchen@bus.miami.edu University of MiamiMiamiFLUSA , Xiao Liu xliu@stern.nyu.edu New York UniversityNew YorkNYUSA , Bo Tang tangbo.t@alibaba-inc.com Alibaba GroupHangzhouChina , Feiyu Xiong feiyu.xfy@alibaba-inc.com Alibaba GroupHangzhouChina , Serim Hwang serimh@andrew.cmu.edu Carnegie Mellon UniversityPittsburghPAUSA and Guomian Zhuang guomian.zgm@alibaba-inc.com Alibaba GroupHangzhouChina

(2022)

Abstract.

We utilize an offline reinforcement learning (RL) model for sequential targeted promotion in the presence of budget constraints in a real-world business environment. In our application, the mobile app aims to boost customer retention by sending cash bonuses to customers and control the costs of such cash bonuses during each time period. To achieve the multi-task goal, we propose the Budget Constrained Reinforcement Learning for Sequential Promotion (BCRLSP) framework to determine the value of cash bonuses to be sent to users. We first find out the target policy and the associated Q-values that maximizes the user retention rate using an RL model. A linear programming (LP) model is then added to satisfy the constraints of promotion costs. We solve the LP problem by maximizing the Q-values of actions learned from the RL model given the budget constraints. During deployment, we combine the offline RL model with the LP model to generate a robust policy under the budget constraints. Using both online and offline experiments, we demonstrate the efficacy of our approach by showing that BCRLSP achieves a higher long-term customer retention rate and a lower cost than various baselines. Taking advantage of the near real-time cost control method, the proposed framework can easily adapt to data with a noisy behavioral policy and/or meet flexible budget constraints.

online promotion, customer retention, reinforcement learning, linear programming

^†^†copyright: acmcopyright^†^†journalyear: 2022^†^†conference: The 3rd Workshop on Deep Reinforcement Learning for Information Retrieval at SIGIR’22; July 15, 2022; Madrid, Spain^†^†ccs: Information systems Online shopping^†^†ccs: Theory of computation Reinforcement learning^†^†ccs: Theory of computation Linear programming

1. Introduction

In order for a mobile app to be successful, it must have loyal users that remain active on the app over time. One important digital marketing tool to boost user retention is sequential targeted promotion. On one hand, targeted promotion has been widely applied by mobile apps as they have more access to customers’ information and have a better understanding of their behavior in the era of big data. On the other, sequential promotion can help mobile apps build customer loyalty and increase customer lifetime value over time, ultimately boosting profits in the long run. One form of sequential targeted promotion, which combines the two kinds of promotion above, is to send personalized cash bonuses sequentially to users if they keep using the app.

In this paper, we design a policy to allocate such cash bonuses throughout the lifecycle of each customer, aiming to maximize user retention while satisfying the budget constraint during each time period. We use the reinforcement learning (RL) framework to formulate the problem of sequential targeted promotion as a Markov decision process (MDP). Reinforcement learning algorithms enable us to dynamically adjust the cash bonuses over time for each customer, in order to maximize long-run user engagement. Sequential targeting is superior to static targeting because a customer’s reaction to a given cash bonus may depend not only on the current cash bonus but also on all the previously-received cash bonuses. Therefore, the app’s decision on the value of cash bonuses can generate long term consequences instead of only immediate outcomes.

In the e-commerce setting, the target policy should be trained and tested in an offline manner from logged data for two major reasons. First, online training relies on real-time interactions with customers, and it is often prohibitively expensive to collect enough data needed to train effective models. Offline data, in contrast, is natural to record as the interactions are already taking place in the existing environment. Second, during online training, the proposed policy interacts with real users without being evaluated in advance. Therefore, the firm has little control over the online process, such that the model performance could be even worse than the behavioral policy, resulting in lower profits and/or higher costs.

There are two major technical challenges in applying offline RL to sequential targeted promotion. One challenge results from the distributional shift of actions due to the discrepancy between the behavior policy and the target policy. This is a fundamental challenge in almost all deployments of offline RL (Levine et al., 2020). The other challenge is that the problem is multi-task, i.e., both the optimality and feasibility (i.e., that costs cannot exceed the budget) need to be assessed before the policy can be deployed to real users.

In this paper, we focus on sequential targeted promotion with constraints in each time period, in order to fulfill flexible budget constraints that mimic the real-world industrial settings. To address the distributional shift issue, we use a batch constrained model-free RL model, specifically, batch constrained learning (Fujimoto et al., 2019b, a). We choose model-free solutions over model-based solutions because customer behavior in an information-rich environment is too complicated to model or simulate without any bias. To address the multi-task challenge, we incorporate budget constraints by adding a linear programming model to maximize the Q-values learned from the RL model given the budget constraints.

The main contributions of our work are summarized as follows:

•

We formalize the problem of allocating sequential targeted cash bonuses among customers and propose a two-stage framework, Budget Constrained Reinforcement Learning for Sequential Promotion (BCRLSP), to solve it. Specifically, we combine an offline RL model with a near real-time LP model to generate a robust policy under the budget constraints.
•

The BCRLSP framework we proposed is verified empirically through both offline and online experiments. The results of offline experiments show that our solution consistently outperforms the baseline models in boosting customer retention and controls promotion costs. Online A/B tests also demonstrate the effectiveness of our method in real-life applications.
•

Since the RL model requires significantly more running time and computational resources than the LP model, we train the RL model using logged offline data but solve the LP model with online data. In this way, the target policy needs to be trained only once but can be adjusted to satisfy any budget constraints flexibly.

2. Related Literature

2.1. Reinforcement Learning in Digital Marketing

In recent years, we have witnessed the rapid development of RL techniques and widespread applications of RL. In digital marketing, RL is expected to revitalize the industry and modernize various operations. For example, prior research has applied RL to solve digital marketing problems related to search (Wei et al., 2017; Xia et al., 2017; Xiao et al., 2019b; Hu et al., 2018), recommendation (Zhao et al., 2018b; Cai et al., 2018; Theocharous et al., 2015; Zhao et al., 2018c), online advertising (Yang and Lu, 2016; Zhao et al., 2018a; Jin et al., 2018; Cai et al., 2017; Wu et al., 2018; Schwartz et al., 2017), and pricing (Misra et al., 2019).

However, most prior works in this stream of literature only focus on a single objective. For example, budget is usually not involved in the applications of search or recommendation. Our paper differentiates from them by solving a multi-task problem: return maximization under constraints. Although some RL applications in online advertising also consider the budget constraint, for instance, Cai et al. (2017) and Wu et al. (2018) use RL to solve the budget-constrained real-time-bidding problem,they do not allow for offline training to make sure the costs of the proposed policy are controlled.

The most relevant work to our study is Xiao et al. (2019a), which uses a model-based MDP with a global constraint in sequential targeted promotion. Our work differentiates from Xiao et al. (2019a) in two major aspects. First, the state space is in high dimension, estimating the transition dynamics is very challenging, so we choose model-free algorithms over the model-based ones. Second, the CMDP setting controls the average budget over the whole trajectory of each user, while budget constraint in BCRLSP is set as the average cost per time step, which is more flexible and easier for the company to adjust.

2.2. Reinforcement Learning Algorithm with Constraints

In standard reinforcement learning (RL), a learning agent seeks to optimize the long-term return, which is based on a real-valued reward signal (Sutton et al. (1998)). However, in our setting, we want to ensure reasonable performance and at the same time respect budget constraints during deployment. RL algorithms with constraints are mostly formulated as a constrained Markov Decision Process (CMDP) introduced by Altman (1999). In CMDPs, the agent’s goal is to maximize the long-run rewards while satisfying some linear constraints on the long-run costs. Altman (1999) give an LP-based solution under the assumption that the CMDP process has known dynamics and finite states.

Based on that, some model-free approaches are also proposed to solve RL with constraints in high-dimensional settings. Achiam et al. (2017) use constrained policy optimization (CPO), focus on online safe exploration, and solve CMDP for continuous actions during the learning process. Le et al. (2019) describe a batch offline algorithm with PAC-style guarantees for CMDPs. Later, Miryoosefi et al. (2019) generalize this algorithm that can work with arbitrary convex constraints.

Although CMDP is the mainstream framework in RL with constraints, the long-run cost constraint in CMDP cannot satisfy the requirements of most business applications. CMDP models consider cost as part of the return and maximize the accumulated return. Mostly, companies have a budget constraint for each period of time, a week or a month. It’s difficult to define the value of the constraints for the long-run costs in this circumstance. In contrast, our BCRLSP method can meet the dynamic budget requirement set by the company by applying constrained optimization after solving the RL model. Our constraint is set per time period, which is more flexible and easier for the company to adjust. In this case, we are able to have exact control over the budget in near real-time.

3. Context

The context of our problem is the daily check-in cash bonus strategy used by Taobao Special Offer Edition. It is a mobile shopping app launched by Alibaba in 2019 and focuses on ultra-competitive product pricing. The mechanism of this strategy is shown in the lower part of Figure 1. When a user opens the Taobao Special Offer Edition app, he/she can click a button and enter the daily check-in page. A cash bonus can be claimed after clicking the check-in button. The corresponding cash bonus can be subtracted from the final payment at the time of purchase within 24 hours. For each user, only one cash bonus can be claimed on the same day. The app needs to design a policy to optimally allocate the cash bonuses to each consumer on each day they log in.

Two important features of the setting make the problem a constrained optimization problem. First, the average cash bonus for each consumer distributed on each day is limited. This daily quota is clearly predefined by operational business units for economic reasons. Models without such a constraint may bring substantial economic losses as a large amount of abnormally high-value cash bonuses may be sent out. Also, such a constraint makes the A/B test meaningful and comparable. Undoubtedly, if two policies are under different budget constraints, the one with a larger budget could more easily create higher user retention. Therefore, a fair comparison of different policies should impose the same constraint.

Second, the set of available cash bonuses is discrete and constrained. One cycle is at most seven days long, and the cycle restarts once the user collects four cash bonuses in the cycle. In each cycle, the available cash bonuses vary by day. In order to encourage users to log in to the app frequently, the fourth cash bonus in a cycle will be larger than the first three. There are two sets of cash bonuses from which the value of cash bonuses is chosen. One is for normal cash bonuses and the other is for super (i.e., large) cash bonuses. Users that are on the first three days of the cycle may only receive a normal cash bonus, whereas users that are on the fourth day of the cycle may only receive a super cash bonus. Note that this rule can be regarded as public information and we may assume that users know the rule well. We show some examples of possible cycles in sequences A, B, and C in the upper part of Figure 1, with the super cash bonuses highlighted.

Refer to caption — Figure 1. Cycles and cash bonuses in Taobao Special Offer Edition

4. Model

In this section, we describe the details of the BCRLSP framework and the baseline models.To generate a robust policy under the budget constraints, our proposed BCRLSP consists of two modules, an offline RL model followed by a near real-time LP model.

Due to this combination of two modules, BCRLSP can not only solve the constrained optimization problem efficiently but also adapt to the flexible budget constraints in business. Mostly, companies have a budget constraint for each period, a week or a month. Our BCRLSP method can meet the dynamic budget requirement set by the company by applying constrained optimization after solving the RL model. Our constraint is set per time period, which is more flexible and easier for the company to adjust. Specifically, we are able to have exact control over the budget in near real-time.

The model framework is shown in Figure 2. As it is a daily sign-in problem, the data are updated daily. Specifically, we first train the RL module with the data in an offline manner. For all the users who have login records in the platform history, we run the RL model to predict the next (action, Q-value) pair for the user, where the action means the amount of cash bonus to be assigned to the user if he/she logs the next day. Then in the online setting, each time a user interacts with the platform, we have a new record in the online data, which is the (action, Q-value) pair of the user predicted by offline RL. With the online data, the linear programming module is trained to update the policy under the constraints of the required per person budget.

4.1. MDP formulation

In this research, we aim to learn a policy from offline data such that when deployed online, it maximizes the cumulative user retention rate while still keeping costs under constraints on the Taobao Special Offer Edition. We address this problem with a forward-looking sequential targeting method based on the model-free deep reinforcement learning algorithm. A learning agent chooses a value of cash bonus under a policy, and the value will be allocated to the user once claimed. We define the act of choosing the value of cash bonuses as an action. We formulate the sequential targeted promotion problem as a Markov Decision Process (MDP).

Interactions between the agent and the environment (users collecting the cash bonuses) are recorded in logged offline data $\mathcal{D}$ , which is a set of trajectories $\tau_{i}$ as follows: Let $T$ be the maximum length of time step as far as we concern.¹¹1 We consider a cycle of user behavior as a trajectory and restart a trajectory when a cycle ends, so $T=4$ in our context. A trajectory is a sequence of user states, agent actions, and rewards, $\tau_{i}=\{(s_{t}^{i},a_{t}^{i},r_{t}^{i})\}_{t=1}^{T}$ for time step t and customer $i$ .²²2For simplicity, in the rest of the paper, we denote a trajectory as $\tau$ without the subscript $i$ . State $s_{t}^{i}$ is the user characteristics, action $a_{t}^{i}$ is the value of cash bonuses, and reward $r_{t}^{i}$ is a 0-1 value indicating whether or not the user logs in the next day. Note that $\mathcal{D}$ is collected from the behavior policy $\pi_{b}$ that interacted with the real users in the past and that action $a_{t}^{i}\sim\pi_{b}(\cdot\mid s_{t}^{i})$ . For each action $a_{t}^{i}$ , it has a corresponding cost $c_{t}^{i}$ .

4.2. Batch Constrained Deep Reinforcement Learning

The first part of our model is reinforcement learning, which is solved by the Batch-Constrained Q-learning (BCQ) algorithm in the discrete action space (Fujimoto et al., 2019a). We choose to use BCQ in an effort to restrict the action space and force the agent towards behaving close to the behavior policy. We do this for economic concerns as this is a cost-sensitive problem. We train the default BCQ model from (Fujimoto et al., 2019a) using the offline data to estimate the expected return of all state-action pairs.

User Behavior Model

Given users’ trajectories $\mathcal{D}=\{\tau_{1}\,\dots,\tau_{n}\}$ , we use a neural network $\mathcal{U}_{\omega}$ with a parameter $\omega$ to model the unobserved behavioral policy that generates the offline data.

Agent

At each time step, the reinforcement learning agent faces state $s$ , takes action $a$ , receives reward $r$ , and transits to next state $s^{\prime}$ . The agent aims to maximize the discounted sum reward (i.e., return) $R_{t}=\sum_{i=t+1}^{T}\gamma^{i}r(s_{i},a_{i},s_{i+1})$ , where $\gamma\in[0,1)$ is the discount factor. The agent takes the logged feedback data $\mathcal{D}$ as input and outputs the optimal policy $\pi^{*}(a|s)$ and the corresponding optimal value function $Q^{*}(s,a)$ that maximize the return $R_{t}$ . In each iteration of the training process, the action is selected with the highest $Q$ value among the candidate actions whose relative probability is above a certain threshold in the behavioral policy. Specifically, the chosen action is characterized as below:

(1)

\pi(a\mid s)={\operatorname*{arg\,max}}{arg\,max}_{a\mid\mathcal{U}_{\omega}(a\mid s)/\max_{\hat{a}}\mathcal{U}_{\omega}(\hat{a}\mid s)>\xi}Q_{\theta}(s,a)

where $\xi$ is the threshold probability that an action has to satisfy in order to be considered, and $Q_{\theta}$ is the value function approximated by a neural network with a parameter $\theta$ . Note that $a$ is selected only among those that satisfy $\mathcal{U}_{\omega}(a\mid s)/\max_{\hat{a}}\mathcal{U}_{\omega}(\hat{a}\mid s)>\xi$ , i.e., the probability of $a$ needs to exceed a threshold $\xi$ relative to the most probable action $\hat{a}$ in the behavioral policy. If we set the threshold $\xi=0$ , then the model would be the normal deep Q-learning model. For $\xi=1$ the model will be a simulator of the actions contained in the batch. Next, we update the value function $Q_{\theta}$ through Q-learning (Watkins, 1989). Specifically, we minimize the loss function on mini-batches as below:

\begin{split}&\mathcal{L}(\theta)=l_{\kappa}\biggl{(}r\\ &+\gamma\max_{a^{\prime}|\pi_{b}(a^{\prime}|s^{\prime})/\max_{\hat{a}}\pi_{b}(\hat{a}|s^{\prime})>\xi}Q_{\theta^{\prime}}(s^{\prime},a^{\prime})\\ &-Q_{\theta}(s,a)\biggr{)}\end{split}

where $l_{\kappa}$ denotes the Huber loss (Huber et al., 1964):

(2)

l_{\kappa}(\delta)=\begin{cases}0.5\delta^{2}&\text{if }\delta\leq\kappa\\ \kappa(|\delta|-0.5\kappa)&\text{otherwise }\end{cases}.

4.3. Near Real-Time Linear Programming

The second part of our model is a linear programming model followed by the RL module. It is natural to use the LP method for incentive allocation problems in a business setting to solve the large-scale constrained optimization problem efficiently. Under the given budget constraints, to maximize the return, i.e., the long-term retention of users, we use Q-values of actions learned from BCQ as the input of the LP model.

Q-values are a result of the unconstrained RL algorithm. Since Q-values represent the expected return $R_{t}$ starting from state s, following policy $\pi$ , and taking action $a$ , we are essentially maximizing the total return across all customers while satisfying the budget constraint in the LP model.

For a group of $N$ customers, each customer, $i$ , is assigned to one and only one level of a cash bonus $j$ (i.e., action) among $M$ levels. The objective is to maximize the Q values of all customers under the constraint of the average cost. The problem formulation is as below:

	$\displaystyle{\operatorname*{arg\,max}}{arg\,max}_{x_{ij}}\$	$\displaystyle\sum_{i=1}^{N}\sum_{j=1}^{M}q_{ij}x_{ij}$
	$\displaystyle\text{s.t.}\quad\sum_{i=1}^{N}\sum_{j=1}^{M}c_{j}x_{ij}$	$\displaystyle\leq\sum_{i=1}^{N}\sum_{j=1}^{M}\bar{c}x_{ij}$
	$\displaystyle\sum_{j=1}^{M}x_{ij}$	$\displaystyle=1,\quad\forall i=1,...,N$
(3)		$\displaystyle x_{ij}$	$\displaystyle\in\{0,1\}$

where $x_{ij}$ denotes whether customer $i$ receives action $j$ , $q_{ij}$ is the Q value corresponding to the state of customer $i$ and action $j$ , $c_{j}$ is the cost incurred from action $j$ , and $\bar{c}$ is the average cost externally set by the company.

It is worth noting that our budget constraint is the average cost per customer per day, so the total budget is not fixed but rather depends on the number of users who log in on a specific day. This setting fits our goal to increase the number of active users. Suppose we set an upper limit to the total budget, then as the number of users increases, the average cash bonus each user gets will drop, and it will be harder and harder to keep new customers active.

By introducing a Lagrange multiplier $\lambda$ , we have the dual problem:

	$\displaystyle\min_{\lambda}\max_{x_{ij}}\ \sum_{i=1}^{N}\sum_{j=1}^{M}q_{ij}x_{ij}$	$\displaystyle-\lambda\left(\sum_{i=1}^{N}\sum_{j=1}^{M}c_{j}x_{ij}-N\bar{c}\right)$
	$\displaystyle\text{s.t.}\quad\sum_{j=1}^{M}x_{ij}$	$\displaystyle=1,\quad\forall i=1,...,N$
	$\displaystyle x_{ij}$	$\displaystyle\in\{0,1\}$
(4)		$\displaystyle\lambda$	$\displaystyle\geq 0$

Note that the formulation above approximates the 0-1 integer program in Equation 4.3, which is acceptable in exchange for the efficiency of solving this problem. We further transform the dual problem above into a problem only with respect to $\lambda$ :

	$\displaystyle\min_{\lambda}\ \sum_{i=1}^{N}\max_{1\leq j\leq M}\{q_{ij}$	$\displaystyle-\lambda c_{j}\}+\lambda(N\bar{c})$
(5)		$\displaystyle\text{s.t.}\quad\lambda$	$\displaystyle\geq 0$

After solving for $\lambda$ , the $x_{ij}$ in the original problem is given as:

(6)

x_{ij}=1\quad\text{when}\quad j={\operatorname*{arg\,max}}{arg\,max}_{j|q_{ij}-\lambda(c_{j}-\bar{c})\geq 0}(q_{ij}-\lambda(c_{j}-\bar{c}))

The $\lambda$ parameter reflects how sensitive users are to cash bonuses. If $\lambda$ is large, it means that users’ retention rates do not change much as cash bonuses increase. Thus, the optimal strategy is to assign small cash bonuses to users, which will not affect the retention rates a lot but significantly decrease the costs. On the contrary, a small $\lambda$ indicates that uses are very sensitive to cash bonuses, and we need to send large cash bonuses to make sure they continue to use the app.

The key to solving the LP problem is to calculate $\lambda$ such that (1) it accurately reflects the distribution of user sensitivity and (2) it is updated in a short time. To achieve the first goal, it is necessary that the distribution of user sensitivity is stable over time. Since we employ the $\lambda$ values solved using historical data, we use the logged data in the past 24 hours up to the current time point to calculate $\lambda$ , rather than the logged data in a shorter time period. The distribution of user sensitivity on the previous day will be similar to that on the current day, enabling us to acquire more accurate $\lambda$ values and have better control of costs.

As for the second goal, we update the logged data and recalculate the corresponding $\lambda$ every 10 minutes. Theoretically, we can update $\lambda$ more frequently or even in real-time, and using a shorter time interval also gives more accurate results. But increasing the frequency increases computational time, and we find that a time interval of 10 minutes achieves the best balance between accuracy and efficiency.

4.4. Baseline Models

We compare the proposed approach with several alternatives to demonstrate the advantages and disadvantages of different approaches. The baseline models we use are briefly introduced as follows.

Expert Policy

The policy is designed by domain experts, and it allocates actions to customers according to different user types within the constrained action set. The expert policy is used as a default policy by the platform.

Rainbow

Another RL model that we compare our model with is the Rainbow model proposed by Hessel et al. (2018). It combines multiple extensions of DQN (Bellemare et al., 2017; Fortunato et al., 2017; Mnih et al., 2016; Schaul et al., 2015; van Hasselt et al., 2016; Wang et al., 2015) as an integrated RL agent and outperforms the models with each component separately. We replace BCQ with Rainbow and use the Q values learned from Rainbow as input $q_{ij}$ in the linear programming model.

Supervised Learning

We use three supervised learning methods, logistics regression, Gradient Boost Decision Tree (GBDT), and Xgboost (Chen and Guestrin, 2016) as baseline models. These methods are in the same framework and contain two components: (1) a reward prediction model to learn user behavior (whether or not the user will log in on the next day given the state and action), and (2) an action selection model to select the best action according to the estimated reward from the reward prediction model.

Specifically, the reward $r_{t}$ for $(s_{t},a_{t})$ is calculated by $r_{t}(s_{t},a_{t})=f_{r}(s_{t},a_{t})$ , where $f_{r}$ is a supervised learning model. As $r_{t}$ denotes whether or not the user logs in on the next day, this supervised learning model tries to solve a classification problem. The supervised learning model learns from the data to predict the reward as accurately as possible.

After the reward prediction model $f_{r}$ is trained, we use it to generate a policy:

(7)

\pi(a_{t}|s_{t})=\left\{\begin{array}[]{ll}1&a_{t}={\operatorname*{arg\,max}}{arg\,max}_{a_{t}\in\mathbf{A}}f_{r}(s_{t},a_{t})\\ 0&\text{otherwise}\end{array}\right.

Using supervised learning to choose actions is actually a rather simplistic method as it does not consider the long-term return or the influence of current action on future state and reward. However, it is broadly practiced in the industry as it is easy to implement and interpret.

We also combine the supervised learning algorithms with LP. Instead of using Q values in the linear programming model, we use the probability of retention predicted by the classification model $f_{r}$ .

5. Experiments

In this section, we conduct empirical evaluations on real-world data in both online and offline manners to demonstrate that our solution can effectively generate a robust policy under budget constraints compared with baseline models. We first describe our data and experiment settings and then compare the overall performance of the proposed method with other baselines in offline experiments. Next, we show the performance of the proposed method in the online environment with real traffic. The results demonstrate that our solution can effectively generate a robust policy under the budget constraints compared with baseline models.

5.1. Real-World Dataset

We investigate the performance of BCRLSP on a large-scale real-world dataset. We collaborate with Alibaba Inc. and collect data from the Taobao Special Offer Edition App. This real-world data is used to train and evaluate our framework in an offline manner. The dataset comprises 1.4 million records for a continuous one-month period in December 2020.We split the interactions into the training, validation, and test sets dis-jointly according to different time ranges. The first 2 weeks are used for training, and the remaining 2 weeks are used for offline testing.

State

To accurately depict the pattern of next-day retention, we need a set of informative features (i.e., state variables) for a given cash bonus collection record. Given our research agenda, our features should be able to capture the contextual information and behavioral information that is either stable over time (i.e., static states) or change with time (i.e., dynamic states). Contextual information includes user demographics, such as gender, age, and location, which portray static states that do not vary over time. Behavioral information includes user behavior variables that mainly capture the dynamic states of users.

User behavior variables can be further categorized into two subgroups, frequency and monetary states. Frequency states describe how frequent users claim and redeem the cash bonuses and whether they stay in the app given the cash bonuses. Monetary states characterize the monetary values of cash bonuses and order payments. All user behavior variables are created at five different historical time windows (1 day, 3 days, 5 days, 7 days, 15 days).

Action

We have 12 values of cash bonuses in total, so our action space has a size of 12. As explained in Section 3, these 12 actions are split into two sets: 10 normal cash bonuses, 0.65, 0.67, 0.71, 0.75, 0.79, 0.83, 0.87, 0.94, 1.01, and 1.05 Yuan, and 2 super cash bonuses, 1.72 and 1.82 Yuan. We include all 12 possible actions in our action set for training, but put a constraint on the feasible actions when making predictions.

Reward

As mentioned in Section 4.1, reward is a 0-1 value indicating whether or not the user logs in the next day. The details of the data and feature generation framework can be found in Supplementary Material.

5.2. Experiment Settings

For both BCQ and Rainbow algorithms, we use a 2-layer fully connected neural network with 128-dimension hidden states in the user behavioral model. We use a threshold of $\xi=0.3$ to eliminate actions when selecting the optimal action using BCQ.³³3We tried different thresholds of $\xi=0.3,0.4,0.5,0.6$ and the results are similar. We use the exploration rate $\epsilon=0.01$ to assign a random action in the exploration subsample.⁴⁴4We tried different exploration rates of $\epsilon$ between 0.01 and 0.1 and $\epsilon=0.01$ shows the best results. The discount factor is set to $\gamma=1$ as the company values the user retention in the whole trajectory and does not discount the rewards across different time periods.⁵⁵5Each trajectory has a finite length with the maximum length = 4, so $\gamma$ can be set to 1. Training on 6 CPU cores, 30 GB of memory, and 2 GPU Tesla V100 took roughly 5 hours for the RL models. The hyper-parameters in the baseline models are set as follows: We use a grid search to find the best hyper-parameters in each supervised learning model in the holdout validation dataset. For GBDT: learning rate = 0.1, max depth =3, number of estimator = 500; For Xgboost: learning rate = 0.05, max depth = 5, number of estimator = 500.

We launch an offline experiment to calculate the evaluation results based on the approach proposed by Fonteneau et al. (2013). Specifically, we use all the states in the test dataset and consider them as users interacting with the agent. Then, given a state in the dataset $s_{t}$ , the agent generates an action $\hat{a}_{t}$ according to its policy. Then we find the records in the dataset where the actual action given by the behavior policy matches the estimated action by the agent: $a_{t}=\hat{a}_{t}$ . Denote these matching records as $\mathcal{M}=\{\tau_{1},\dots,\tau_{m}\}\subset\mathcal{D}$ , where $\tau_{i}=\{(s_{t}^{i},a_{t}^{i},r_{t}^{i})\}_{t=1}^{T}$ .

Evaluation Metrics

The goal of the budget-constrained incentive marketing is to maximize the total retention rate under a fixed budget. Therefore, we evaluate the performance of agents in $\mathcal{M}$ with the following two metrics: (a) Average retention rate:

(8)

\text{Ret}=\frac{1}{\sum_{j=1}^{m}T_{j}}\sum_{k=1}^{m}\sum_{t=0}^{T_{j}}r_{t}^{j}

where $T_{j}$ denotes the length of the trajectory $\tau_{j}$ ; (b) Average cost:

(9)

\text{AvgCost}=\frac{1}{\sum_{k=1}^{m}T_{j}}\sum_{j=1}^{m}\sum_{t=0}^{T_{j}}c_{t}^{j}

5.3. Offline Experiment Results

The results of offline experiments are shown in Table 1. Our proposed BCRLSP model and all baseline models are able to maintain the same average cost because the LP model is set under a fixed budget of 0.87 Yuan per customer per day. Under the same budget setting, BCRLSP model outperforms all baseline models and achieves the highest retention rate. Compared with the second-best model (GBDT), the retention rate of BCRLSP model increases from $36.15\%$ to $37.22\%$ , an improvement of $2.96\%$ . We train the models using the same data and hyper-parameters for 10 times, and the retention rates change within the range of $0.1\%$ .

Algorithm	Ret	AvgCost
BCRLSP	37.22%	0.87
Rainbow	29.05%	0.87
GBDT	36.15%	0.87
LR	34.43%	0.87

Table 1. Offline experiment results

5.4. Online Experiment Results

f We also evaluate the BCRLSP framework for a week in the online environment, i.e., real-world cash bonuses on the check-in page in the Taobao Special Offer Edition App. The comparison baseline models are the expert policy, Rainbow, GBDT, and logistics regression. To assess the effectiveness of the proposed framework, the platform assigns $92\%$ of the online traffic to the expert policy and $2\%$ of the online traffic to each of the BCRLSP, Rainbow, GBDT, and logistic regression models, which is big enough considering the overall amount of user interactions. Table 2 lists the performance of the two main online metrics.

Compared with the expert policy, $0.53\%$ growth on retention rate exhibits that our proposed framework increases the retention rate of the platform when put into production. Note that during the online experiment, the firm decided to increase the budget to 1.60 Yuan, hence we can see an increase in both the average cost and the retention rate in the online experiment results compared with offline ones.

Algorithm	Ret	AvgCost
BCRLSP	50.81%	1.60
Rainbow	50.57%	1.60
GBDT	50.69%	1.60
LR	50.51%	1.60
Expert policy	50.54%	1.60

Table 2. Online experiment results

6. Conclusions

In this paper, we propose the BCRLSP framework to solve the sequential targeted promotion problem. We combine the offline deep reinforcement learning algorithm with the linear programming model to sequentially target customers with cash bonuses under the budget constraint. As the promotion cost constraint is vital to the company, the RL model is trained offline using logged data to avoid the risk of over-spending. After getting the Q values of actions corresponding to the states of each customer, we maximize the total value across all customers under the constraint of the average cost using the online linear programming model. Compared with other reinforcement learning and supervised learning algorithms, our approach achieves better performance in both customer retention rate and average cost.

Our work enables companies to better target customers using sequential promotion, which will result in higher customer engagement and loyalty in the long term. Companies will benefit from this research as they can build customer loyalty in a more cost-efficient way. At the same time, this does not mean that customers are put at a disadvantage, as they receive cash bonuses from companies. In other words, our work creates a win-win outcome for companies and customers by reducing the information friction and improving the total surplus in the market.

References

(1)
Achiam et al. (2017) Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. 2017. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 22–31.
Altman (1999) Eitan Altman. 1999. Constrained Markov decision processes. Vol. 7. CRC Press.
Bellemare et al. (2017) Marc G Bellemare, Will Dabney, and Rémi Munos. 2017. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 449–458.
Cai et al. (2017) Han Cai, Kan Ren, Weinan Zhang, Kleanthis Malialis, Jun Wang, Yong Yu, and Defeng Guo. 2017. Real-time bidding by reinforcement learning in display advertising. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. 661–670.
Cai et al. (2018) Qingpeng Cai, Aris Filos-Ratsikas, Pingzhong Tang, and Yiwei Zhang. 2018. Reinforcement Mechanism Design for e-commerce. In Proceedings of the 2018 World Wide Web Conference. 1339–1348.
Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.
Fonteneau et al. (2013) Raphael Fonteneau, Susan A Murphy, Louis Wehenkel, and Damien Ernst. 2013. Batch mode reinforcement learning based on the synthesis of artificial trajectories. Annals of operations research 208, 1 (2013), 383–416.
Fortunato et al. (2017) Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. 2017. Noisy networks for exploration. arXiv preprint arXiv:1706.10295 (2017).
Fujimoto et al. (2019a) Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh, and Joelle Pineau. 2019a. Benchmarking Batch Deep Reinforcement Learning Algorithms. arXiv preprint arXiv:1910.01708 (2019).
Fujimoto et al. (2019b) Scott Fujimoto, David Meger, and Doina Precup. 2019b. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning. 2052–2062.
Hessel et al. (2018) Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. 2018. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence.
Hu et al. (2018) Yujing Hu, Qing Da, Anxiang Zeng, Yang Yu, and Yinghui Xu. 2018. Reinforcement learning to rank in e-commerce search engine: Formalization, analysis, and application. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 368–377.
Huber et al. (1964) Peter J Huber et al. 1964. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics 35, 1 (1964), 73–101.
Jin et al. (2018) Junqi Jin, Chengru Song, Han Li, Kun Gai, Jun Wang, and Weinan Zhang. 2018. Real-time bidding with multi-agent reinforcement learning in display advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2193–2201.
Le et al. (2019) Hoang M Le, Cameron Voloshin, and Yisong Yue. 2019. Batch policy learning under constraints. arXiv preprint arXiv:1903.08738 (2019).
Levine et al. (2020) Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. 2020. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 (2020).
Miryoosefi et al. (2019) Sobhan Miryoosefi, Kianté Brantley, Hal Daume III, Miro Dudik, and Robert E Schapire. 2019. Reinforcement Learning with Convex Constraints. In Advances in Neural Information Processing Systems. 14070–14079.
Misra et al. (2019) Kanishka Misra, Eric M Schwartz, and Jacob Abernethy. 2019. Dynamic online pricing with incomplete information using multiarmed bandit experiments. Marketing Science 38, 2 (2019), 226–252.
Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning. 1928–1937.
Schaul et al. (2015) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2015. Prioritized experience replay. arXiv preprint arXiv:1511.05952 (2015).
Schwartz et al. (2017) Eric M Schwartz, Eric T Bradlow, and Peter S Fader. 2017. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science 36, 4 (2017), 500–522.
Sutton et al. (1998) Richard S Sutton, Andrew G Barto, et al. 1998. Introduction to reinforcement learning. Vol. 135. MIT press Cambridge.
Theocharous et al. (2015) Georgios Theocharous, Philip S Thomas, and Mohammad Ghavamzadeh. 2015. Personalized ad recommendation systems for life-time value optimization with guarantees. In Twenty-Fourth International Joint Conference on Artificial Intelligence.
van Hasselt et al. (2016) Hado P van Hasselt, Arthur Guez, Matteo Hessel, Volodymyr Mnih, and David Silver. 2016. Learning values across many orders of magnitude. In Advances in Neural Information Processing Systems. 4287–4295.
Wang et al. (2015) Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. 2015. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581 (2015).
Watkins (1989) Christopher John Cornish Hellaby Watkins. 1989. Learning from delayed rewards. Ph. D. Dissertation. King’s College, Cambridge.
Wei et al. (2017) Zeng Wei, Jun Xu, Yanyan Lan, Jiafeng Guo, and Xueqi Cheng. 2017. Reinforcement learning to rank with Markov decision process. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 945–948.
Wu et al. (2018) Di Wu, Xiujun Chen, Xun Yang, Hao Wang, Qing Tan, Xiaoxun Zhang, Jian Xu, and Kun Gai. 2018. Budget constrained bidding by model-free reinforcement learning in display advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 1443–1451.
Xia et al. (2017) Long Xia, Jun Xu, Yanyan Lan, Jiafeng Guo, Wei Zeng, and Xueqi Cheng. 2017. Adapting Markov decision process for search result diversification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 535–544.
Xiao et al. (2019b) Fei Xiao, Zhen Wang, Haikuan Huang, Jun Huang, Xi Chen, Hongbo Deng, Minghui Qiu, and Xiaoli Gong. 2019b. AliISA: Creating an Interactive Search Experience in E-commerce Platforms. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1305–1308.
Xiao et al. (2019a) Shuai Xiao, Le Guo, Zaifan Jiang, Lei Lv, Yuanbo Chen, Jun Zhu, and Shuang Yang. 2019a. Model-based Constrained MDP for Budget Allocation in Sequential Incentive Marketing. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 971–980.
Yang and Lu (2016) Hongxia Yang and Quan Lu. 2016. Dynamic contextual multi arm bandits in display advertisement. In 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 1305–1310.
Zhao et al. (2018a) Jun Zhao, Guang Qiu, Ziyu Guan, Wei Zhao, and Xiaofei He. 2018a. Deep reinforcement learning for sponsored search real-time bidding. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1021–1030.
Zhao et al. (2018b) Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and Jiliang Tang. 2018b. Deep reinforcement learning for page-wise recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems. 95–103.
Zhao et al. (2018c) Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei Yin. 2018c. Recommendations with negative feedback via pairwise deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1040–1048.

BCRLSP: An Offline Reinforcement Learning Framework for Sequential Targeted Promotion