Opportunistic Advertisement Scheduling in Live Social Media: A Multiple Stopping Time POMDP Approach

Vikram Krishnamurthy
Anup Aprem
Sujay Bhatt
Department of Electrical and Computer Engineering,
Cornell University, Ithaca, NY
vikramk@cornell.edu
Department of Electrical and Computer Engineering,
University of British Columbia, Vancouver, BC, Canada
aaprem@ece.ubc.ca
Department of Electrical and Computer Engineering,
Cornell University, Ithaca, NY
sh2376@cornell.edu

Abstract

Live online social broadcasting services like YouTube Live and Twitch have steadily gained popularity due to improved bandwidth, ease of generating content and the ability to earn revenue on the generated content. In contrast to traditional cable television, revenue in online services is generated solely through advertisements, and depends on the number of clicks generated. Channel owners aim to opportunistically schedule advertisements so as to generate maximum revenue. This paper considers the problem of optimal scheduling of advertisements in live online social media. The problem is formulated as a multiple stopping problem and is addressed in a partially observed Markov decision process (POMDP) framework. Structural results are provided on the optimal advertisement scheduling policy. By exploiting the structure of the optimal policy, best linear thresholds are computed using stochastic approximation. The proposed model and framework are validated on real datasets, and the following observations are made: (i) The policy obtained by the multiple stopping problem can be used to detect changes in ground truth from online search data (ii) Numerical results show a significant improvement in the expected revenue by opportunistically scheduling the advertisements. The revenue can be improved by $20-30\%$ in comparison to currently employed periodic scheduling.

Index Terms:

scheduling, advertisement, live channel, POMDP, multiple stopping, structural result

I Introduction

Popularity of online video streaming has seen a sharp growth due to improved bandwidth for streaming and the ease of sharing User-Generated-Content (UGC) on the internet platforms. One of the primary motivations for users to generate content is that platforms like YouTube, Twitch etc., allow users to generate revenue through advertising and royalties. The revenue of Twitch which deals with live video gaming, play through of video games, and e-sport competitions, is around 3.8 billion for the year 2015, out of which 77% of the revenue was generated from advertisements¹¹1http://www.tubefilter.com/2015/07/10/twitch-global-gaming-content-revenue-3-billion/.

Some of the common ways advertisements (ads) are scheduled on pre-recorded video contents on social media like YouTube are pre-roll, mid-roll and post-roll; where the names indicate the time at which the ads are displayed. In a recent research on viewer engagement, Adobe Research²²2https://gigaom.com/2012/04/16/adobe-ad-research/ concluded that mid-roll video ads constitute the most engaging ad type for pre-recorded video contents, outperforming pre-roll and post-rolls when it comes to completion rate (the probability that the ad will not be skipped). Viewers are more likely to engage with an ad if they are interested in the content of the video that the ad is been inserted into. Most content sharing platforms hosted pre-recorded videos, until recently, owing to higher bandwidth requirements of real-time video streaming. However, this has changed lately with improved bandwidths (e.g. Google Fiber, Comcast Xfinity) and well known content sharing websites like YouTube and Facebook making provisions for live streaming videos to capture major events in real time⁵⁵5For example, as early as 2011, millions watched the Royal wedding live through YouTube³³footnotemark: 3, and, more recently, the Caribbean Premier League is scheduled to be broadcast live on Facebook⁴⁴footnotemark: 4. Online live video streaming, popularly known as “Social TV”, now boasts a number of popular applications like Twitch, YouTube Live, Facebook Live, etc.

When a channel is streaming a live video, the mid-roll ads need to be scheduled manually⁶⁶6http://www.telegraph.co.uk/news/uknews/royal-wedding/8460801/Royal-wedding-Kate-and-William-marriage-live-on-YouTube.html. Twitch allows only periodic ad scheduling [1] and YouTube and other live services currently offers no automated method of scheduling ads for live channels. The ad revenue in live channel depends on the click rate (the probability that the ad will be clicked), which in turn depend on the viewer engagement with the channel content. Hence, ads need to be scheduled when the viewer engagement of the channel is high. The problem of optimal scheduling of ads has been well studied in the context of advertising in television; see [2],[3], [4] and the references therein. However, scheduling ads on live online social media is different from scheduling ads on television in two significant ways [5]: (i) real-time measurement of viewer engagement (ii) revenue is based on ads rather than a negotiated contract. Prior literature on scheduling ads on social media is limited to ad scheduling in real-time for social network games, where the ads are served to either the video game consoles in real time over the Internet [6], or in digital games that are played via major social networks [7]. ⁶⁶footnotetext: http://www.si.com/tech-media/2016/07/06/caribbean-premier-league-matches-facebook-live ⁶⁶footnotetext: YouTube Live: Slate and Ad Insertion https://is.gd/i6c7ku

Problem Formulation: This paper deals with optimal scheduling of ads on live channels in social media, by considering viewer engagement, termed as active scheduling, to maximize the revenue generated from the advertisements. We model the viewer engagement of the channel using a Markov chain [8, 9]. The viewer engagement of the content is not observed directly, however, noisy observation of the viewer engagement is obtained by the current number of viewers of the channel. Hence, the problem of computing the optimal policy of scheduling ads on live channel can then be formulated as an instance of a stochastic control problem called the partially observable Markov decision process (POMDP). To the best of our knowledge, this is the first time in the literature that ad scheduling in live channels on social media is studied in a POMDP framework. The main contribution of this paper is two-fold:

1.)

We provide a POMDP framework for the optimal ad-scheduling problem on live channels and show that it is an instance of the optimal multiple stopping problem.
2.)

We provide structural results of the optimal policy of the multiple stopping problem and using stochastic approximation compute the best approximate policy.

Structural Results: The problem of optimal multiple stopping has been well studied in the literature see [10], [11], [12], [13] and the references therein. The optimal multiple stopping problem generalizes the classical (single) stopping problem, where the objective is to stop once to obtain maximum reward. Nakai [10] considers optimal $L$ -stopping over a finite horizon of length $N$ in a partially observed Markov chain. More recently, [13] considers $L$ -stopping over a random horizon. The state of the finite horizon partially observed Markov chain in [10] above can be summarized by the “belief state”⁷⁷7The belief state is a sufficient statistic for all the past observations and actions [14].. For a stopping time POMDP, the policy can be characterized by stopping region (set of belief state where we stop) and continuance region (set of belief states where we do not stop). Nakai [10] shows that there are $N\times L$ stopping regions corresponding to each time index and stops and these regions form a nested structure. However, in live channels, the time horizon $N$ is very large (in comparison to decision epochs) or initially unknown. Therefore, we extend the results in Nakai [10] to the infinite horizon case. The extension is both important and non-trivial. In the infinite horizon case, the policy is stationary (the stopping regions do not depend on the time index) and hence $L$ stopping regions characterize the policy. We obtain similar structural results as [10] in the infinite horizon case.

Our main structural result (Theorem 1) is that the optimal policy is characterized by a switching curve on the unit simplex of Bayesian posteriors (belief states). To prove this result we use the monotone likelihood ratio stochastic order since it is preserved under conditional expectations. However, determining the optimal policy is non-trivial since the policy can only be characterized on a partially ordered set (more generally a lattice) within the unit simplex. We modify the MLR stochastic order to operate on line segments within the unit simplex of posterior distributions. Such line segments form chains (totally ordered subsets of a partially ordered set) and permit us to prove that the optimal decision policy has a threshold structure. Having established the existence of a threshold curve, Theorem 2 and Theorem 22 gives necessary and sufficient conditions for the best linear hyperplane approximation to this curve. Then a simulation-based stochastic approximation algorithm (Algorithm 1) is presented to compute this best linear hyperplane approximation.

Context: The optimal multiple stopping problem can be contrasted to the recent work on sampling with “causality constraints”. In sampling with causality constraints, not all the observations are observable. [15] considers the case where an agent is limited to a finite number of observations (sampling constraints) and must adaptively decide the observation strategy so as to perform quickest detection on a data stream. The extension to the case where the sampling constraints are replenished randomly is considered in [16]. In the multiple stopping problem, considered in this paper, there is no constraint on the observations and the objective is to stop $L$ times at states that correspond to maximum reward.

The optimal multiple stopping problem, considered in this paper, is similar to sequential hypothesis testing [17, 18], sequential scheduling problem with uncertainty [19] and the optimal search problem considered in the literature. [20] and [21] consider the problem of finding the optimal launch times for a firm under strategic consumers and competition from other firms to maximize profit. However, in this paper we deal with sequential scheduling in a partially observed case. [22],[23] consider an optimal search problem where the searcher receives imperfect information on a (static) target location and decides optimally to search or interdict by solving a classical optimal stopping problem ( $L=1$ ). However, the multiple-stopping problem considered in this paper is equivalent to a search problem where the underlying process is evolving (Markovian) and the searcher needs to optimally stop $L>1$ times to achieve a specific objective.

The paper is organized as follows: Section II provides a model of a live channel and introduces the notations, assumptions and key definitions. Section III provides the main results of the paper. First, similar to [10], we show that the stopping regions of the optimal ad scheduling policy form a nested structure. Second, we show the threshold structure of the optimal ad scheduling policy. In Section IV, we use the nested property of the stopping regions and the threshold property in Section III and stochastic approximation algorithm to compute the best approximate policies using linear thresholds. Such linear threshold policies are computationally inexpensive to implement. In Section V, we validate the model on three different datasets. First, we illustrate the analysis using a synthetic dataset and verify the performance of the optimal ad scheduling policy against conventional scheduling policies. Second, we show that the policy obtain by the multiple stopping problem can be used to detect changes in ground truth using data from online search. Finally, we use real datasets from YouTube Live and Twitch to optimally schedule multiple ads ( $L>1$ ) in a sequential manner so as to maximize the revenue.

II Opportunistic Scheduling on Live Channels: Model and Problem Formulation

II-A Live Channel Model

In this section, we develop a model of the live channel. The three main components of the live channel are i) Viewer engagement: How to model the viewer engagement of a live channel? ii) Dynamics of the channel viewers: How does the number of viewers vary with respect to the engagement? iii) Reward of the channel owner: What is the (monetary) reward obtained by the channel owner through advertising? Below, we develop models to address each of these questions.

1)

Dynamics of viewer engagement: Similar to [24], viewer engagement, in the context of live channels, can be defined as the following process: (i) The viewer decides to watch the live channel. (ii) The viewer is “engaged” with the content of the live channel. (iii) The viewer will watch the live content without switching to other channels. (iv) The viewer is more attentive when watching the live content. The potential benefits of (iii) and (iv) above is that it increases the odds of the viewers being exposed and persuaded by advertisements.

Viewer engagement, as defined above, is an abstract concept which captures viewer attitude, behaviour and attentiveness. Archak et. al. [8, 9] developed a Markov chain model for online behaviour of users and the effects of advertising. The main finding is that the user behaviour in online social media can be approximated by a first-order Markov chain. Following Archak et. al. [8, 9], we model the viewer engagement at time $t$ , denoted by $X_{t}$ , as an $S$ -state Markov chain with state-space $\mathcal{S}\equiv\left\{1,2,\cdots,{S}\right\}$ . The dynamics of the viewer engagement of the channel, modelled as a Markov chain, can be characterized by the transition matrix $P$ and initial probability vector $\pi_{0}$ as follows:

$P(i,j)=\mathbb{P}(X_{t+1}=j|X_{t}=i),\quad\pi_{0}(i)=\mathbb{P}(X_{0}=i)$ (1)

The Markov chain model for the viewer engagement of the channel is validated using simulations in Sec. V.

Dynamics of channel viewers: The number of viewers at time $t$ depend on the viewer engagement of the live channel. As viewers are more engaged with the content, they are less likely to switch the channel. Hence, a higher viewer engagement state has higher number of viewers compared to a lower engagement state. Therefore, we model the dynamics of channel viewers as follows: The number of viewers at time $t$ , denoted by $Y_{t}$ , belongs to the countably infinite set $\mathcal{Y}$ of non-negative integers. Denote, the conditional probability of $j$ viewers ( $Y_{t}=j$ ) in viewer engagement state $i$ ( $X_{t}=i$ ) by $B(i,j)$ . Note that the conditional probability $B(i,j)$ is assumed to be time homogeneous⁸⁸8The conditional probability $B(i,j)$ does not depend on the time index, $t$ . . The number of viewers $Y_{t}$ is modeled as a Poisson random variable with state dependent mean $g_{i},i\in\mathcal{S}$ , based on evidence in [25] and [11], as follows:

B(i,j)=\mathbb{P}\left(Y_{t}=j|X_{t}={i}\right)=\frac{g_{i}^{j}\exp{(-g_{i})}}{j!},~\forall i\in\mathcal{S},j\in\mathcal{Y}.

(2)

The states with higher state dependent mean correspond to states with higher viewer engagement.

The channel owner does not observe the true viewer engagement of the channel, $X_{t}$ . However, at each time instant $t$ , the channel owner receives a noisy observation of the viewer engagement of the channel by the number of viewers, $Y_{t}$ . Hence, the channel owner needs to estimate the viewer engagement using the history of noisy observations and schedule ads.

3)

Reward of channel owner: The channel owner agrees to show $L$ ads during the live session, which are decided prior to the beginning of the session. For example, the ads during the Super Bowl 50 in YouTube Live had to be pre-booked in advance.⁹⁹9http://www.campaignlive.co.uk/article/youtube-launches-real-time-ads-major-live-events-starting-super-bowl-50/1380260 Hence, at each time instant $t$ , the channel owner chooses an action $u_{t}$ as follows: The channel owner can continue with the live session (denoted by Continue) or can pause the live session to insert an ad (denoted by Stop). Hence, $u_{t}\in\mathcal{A}=\{\text{Stop(1)},\text{Continue(2)}\}$ ¹⁰¹⁰10The Stop and Continue actions will be denoted by $1$ and $2$ , respectively in the remainder of the paper. denote the actions available to the channel owner at time $t$ . This problem of scheduling the $L$ ads opportunistically, so as to obtain maximum revenue, corresponds to a multiple stopping problem with $L$ stops.

Choosing to stop at time $t$ (and schedule an ad), with $l$ stops remaining ( $l$ ads remaining), the channel owner will accrue a reward $r_{l}(X_{t},a=1)$ , where $X_{t}$ is the state of the Markov chain at time $t$ . The reward obtained by the channel owner depends on two factors: (i) the number of viewers (ii) the completion rate and click rate of the ad . To capture these, we model the reward as follows:

$r_{l}(X_{t}=i,a=1)=\alpha_{i}g_{i}.$ (3)

In (3), $g_{i}$ captures the average number of viewers in any viewer engagement state. The term $\alpha_{i}\in\left[0,1\right]$ captures the completion and click rate of the ads at any viewer engagement state. Similarly, if the channel owner chooses to continue, he will accrue $r_{l}(X_{t},a=2)$ . When an ad is not shown, the reward obtained by the channel owner is usually zero, hence, $r_{l}(X_{t},a=2)=0$ .

Hence, to maximize revenue, the channel owner needs to opportunistically schedule ads at time slots when the viewer engagement is high, corresponding to a higher number of viewers and higher click rate.

II-B Ad Scheduling : Problem formulation & Stochastic Dynamic Programming

II-B1 Problem Formulation

The ad scheduling policy (or the control policy) prescribes a decision rule that determines the action taken by the channel owner. Let the initial probability vector, $\pi_{0}$ and the history of past observations at time $t$ for the channel owner be denoted as $Z_{t}=\left\{\pi_{0},Y_{1},\cdots,Y_{t}\right\}$ . The control policy, at time $t$ , maps the history $Z_{t}$ to action. Hence, the policy of the channel owner $\mu$ belongs to the set of admissible policies $\mathcal{U}=\left\{\mu:\mu~\text{maps}~Z_{t}\rightarrow\mathcal{A}\right\}.$ Below, we reformulate the sequential multiple stopping problem of scheduling ads in terms of belief state.

Let $\Pi=\left\{\pi\in\mathbb{R}^{S}:\textbf{1}_{S}^{\prime}\pi=1,\pi(i)\geq 0\right\}$ denote the belief space of all $S$ -dimensional probability vectors. The belief-state $\pi_{t}$ is a sufficient statistic of $Z_{t}$ [14], and evolves as $\pi_{t+1}=T(\pi_{t},Y_{t+1})$ , where

\displaystyle\begin{aligned} &T(\pi_{t},Y_{t+1})=\cfrac{B_{Y_{t+1}}P^{{}^{\prime}}\pi_{t}}{\sigma(\pi_{t},Y_{t+1})},\quad\sigma(\pi_{t},Y_{t+1})=\textbf{1}_{S}^{\prime}B_{Y_{t+1}}P^{{}^{\prime}}\pi_{t},\\ &B_{Y_{t+1}}=\text{diag}\left(B(1,Y_{t+1}),\cdots,B(S,Y_{t+1})\right).\end{aligned}

(4)

Here $\textbf{1}_{S}$ represents the $S$ -dimensional vectors of ones.

The aim is to compute the optimal stationary ad scheduling policy $u_{t}=\mu(\pi_{t},l),$ as a function of the belief, $\pi_{t}$ , and the number of stops (or the number of ads) remaining, $l$ , to maximize the infinite horizon criterion defined in (6). Let $\tau_{l}$ denote the stopping time when there are $l$ stops remaining:

\tau_{l}=\inf\left\{t:t>\tau_{l+1},u_{t}=1\right\},\text{with }\tau_{L+1}=0.

(5)

The infinite horizon discounted criterion with stationary policy $\mu$ , and initial belief $\pi_{0}$ is as below:

	$\displaystyle J_{{\mu}}(\pi_{0})$	$\displaystyle=\mathbb{E}_{\mu}\left\{\sum_{t=0}^{\tau_{L}-1}\rho^{t}r_{L}(X_{t},2)+\rho^{\tau_{L}}r_{L}(X_{\tau_{L}},1)+\sum_{t={\tau_{L}+1}}^{\tau_{L-1}-1}\rho^{t}r_{L-1}(X_{t},2)+\dots+\rho^{\tau_{1}}r_{1}(X_{\tau_{1}},1)\>\Big{\|}\>\pi_{0}\right\},$		(6)
		$\displaystyle=\mathbb{E}_{\mu}\left\{\sum_{t=0}^{\tau_{L}-1}\rho^{t}r_{2,L}^{\prime}\pi+\rho^{\tau_{L}}r_{1,L}^{\prime}\pi+\sum_{t={\tau_{L}+1}}^{\tau_{L-1}-1}\rho^{t}r_{2,L-1}^{\prime}+\dots+\rho^{\tau_{1}}r_{1,1}^{\prime}\pi\>\Big{\|}\>\pi_{0}\right\},$		(7)

where $r_{u,l}=\left[r_{l}(1,u),\dots,r_{l}(S,u)\right]^{\prime}$ . In (6) and (7), $\rho\in\left(0,1\right]$ denotes the discount factor. Below, we study the special case, where $r_{1,1}=r_{1,2}=\dots=r_{1,L}=r$ and $r_{2,1}=r_{2,2}=\dots=r_{2,L}=0$ , i.e. the reward for stopping and scheduling an ad is independent of the number of stops remaining and the channel owner accrues no reward for continuing. The extension to the general case is straightforward.

II-B2 Stochastic Dynamic Programming

The computation of the optimal ad scheduling policy $\mu^{*}$ , to maximize the infinite horizon discounted criterion in (6) and (7), is equivalent to solving Bellman’s dynamic programming equation [14]:

\mu^{*}(\pi,l)=\underset{u\in\mathcal{A}}{\operatornamewithlimits{argmax}}~Q(\pi,l,u),\quad V(\pi,l)=\underset{u\in\mathcal{A}}{\max}~Q(\pi,l,u),

(8)

where,

Q(\pi,l,1)=r^{\prime}\pi+\rho\sum_{Y\in\mathcal{Y}}V\left(T(\pi,Y),l-1\right)\sigma(\pi,Y),\quad Q(\pi,l,2)=\rho\sum_{Y\in\mathcal{Y}}V\left(T(\pi,Y),l\right)\sigma(\pi,Y)

(9)

The above dynamic programming formulation is a POMDP. Since the state-space $\Pi$ , is a continuum, Bellman’s equation (8) does not translate into practical solution methodologies as $V(\pi,l)$ needs to be evaluated at each $\pi\in\Pi$ . This in turn renders the calculation of the optimal policy $\mu^{*}(\pi,l)$ computationally intractable. In Sec. III we derive structural results of the optimal ad scheduling policy. The advantage of deriving structural results is that the optimal policy can be computed efficiently. Sec. IV provides stochastic approximation algorithms to compute approximations of the optimal policy using the structural results derived in Sec. III.

III Optimal ad scheduling: Structural results

In this section, we derive structural results for the optimal ad scheduling policy. We first introduce the value iteration algorithm in Sec. III-A, a successive approximation method to solve the dynamic programming recursion in (8). The value iteration algorithm is a valuable tool for deriving the structural results. In Section III-B, we use the value iteration algorithm in Sec. III-A to prove structural results of the optimal ad scheduling policy. Using the structural results in Sec. III-B, we provide a simulation based algorithm to compute the policy in Sec. IV.

III-A Value Iteration Algorithm

The value iteration algorithm is a successive approximation approach for solving Bellman’s equation (8).

The procedure is as follows: For iterations $k=0,1,\dots$ , the value function $V_{k}(\pi,l)$ and the policy $\mu_{k}(\pi,l)$ is obtained as follows

V_{k+1}(\pi,l)=\underset{u\in\{1,2\}}{\text{max}}Q_{k+1}(\pi,l,u),

(10)

\mu_{k+1}(\pi,l)=\underset{u\in\{1,2\}}{\operatornamewithlimits{argmax}}Q_{k+1}(\pi,l,u),

(11)

where

Q_{k+1}(\pi,l,1)=r^{\prime}\pi+\rho\sum_{y}V_{k}(T(\pi,y),l-1)\sigma(\pi,y),

(12)

and

Q_{k+1}(\pi,l,2)=\rho\sum_{y}V_{k}(T(\pi,y),l)\sigma(\pi,y),

(13)

with $V_{0}(\pi,l)$ initialized arbitrarily. It can be easily shown that the above procedure converges [14].

In order to prove the structural results of the stationary ad scheduling policy, defined in Sec. II, we define the stopping and the continuance regions of the policy as below. Let $W_{k}(\pi,l)$ be defined as

W_{k}(\pi,l)\triangleq V_{k}(\pi,l)-V_{k}(\pi,l-1).

(14)

The stopping and continuance region (at each iteration $k$ ) is defined as follows:

\displaystyle\begin{aligned} {S_{k+1}^{l}}=\{\pi|r^{\prime}\pi\geq\rho\sum_{y}W_{k}(T(\pi,y),l)\sigma(\pi,y)\}\\ {C_{k+1}^{l}}=\{\pi|r^{\prime}\pi<\rho\sum_{y}W_{k}(T(\pi,y),l)\sigma(\pi,y)\}\end{aligned}

(15)

Since the value iteration converges, the optimal stationary policy $\mu^{*}(\pi,l)$ is defined as

\mu^{*}(\pi,l)=\lim_{k\to\infty}\mu_{k}(\pi,l).

(16)

Correspondingly, the stationary stopping and continuance sets are defined by

S^{l}=\lim_{k\to\infty}S_{k}^{l},\quad C^{l}=\lim_{k\to\infty}C_{k}^{l}.

(17)

The value function, $V_{k}(\pi,l)$ in (10), can be rewritten, using (15), as follows:

V_{k}(\pi,l)={\left(r^{\prime}\pi+\rho\sum_{y}V_{k-1}(T(\pi,y),l-1)\sigma(\pi,y)\right)\mathcal{I}_{S_{k}^{l}}}+{\left(\rho\sum_{y}V_{k-1}(T(\pi,y),l)\sigma(\pi,y)\right)\mathcal{I}_{C_{k}^{l}}}

(18)

where $\mathcal{I}_{C_{k}^{l}}$ and $\mathcal{I}_{S_{k}^{l}}$ are indicator functions on the continuance and stopping regions respectively, for each iteration $k$ .

Assume $S_{k}^{l-1}\subset S_{k}^{l}$ (see Proposition 3) and substituting (18) in the definition of $W_{k}(\pi,l)$ in (14),

$\displaystyle W_{k}(\pi,l)$	$\displaystyle=\left(\rho\sum_{y}W_{k-1}(T(\pi,y),l)\sigma(\pi,y)\right)\mathcal{I}_{C_{k}^{l}}(\pi)$
	$\displaystyle+{r^{\prime}\pi\mathcal{I}_{C_{k}^{l-1}\cap S_{k}^{l}}(\pi)}$	(19)
	$\displaystyle+\left(\rho\sum_{y}W_{k-1}(T(\pi,y),l-1)\sigma(\pi,y)\right)\mathcal{I}_{S_{k}^{l-1}}(\pi)$

The value iteration algorithm (10) to (13) does not translate into a practical solution methodology since the belief state belongs to an uncountable set. Hence, there is a strong motivation to characterize the structure of the optimal policy. In the following section, we use the definition of $W_{k}(\pi,l)$ in (19) to prove structural results on the stopping region defined in (15).

III-B Structural results for the optimal ad scheduling policy

In this section, we provide structural results on the optimal policy of the multiple stopping problem corresponding to maximizing the ad revenue.

III-B1 Assumptions

The main result below, namely, Theorem 1, requires the following assumptions on the reward vector, $r$ , and the transition matrix, $P$ and the observation distribution, $B$ . The first assumption on the reward says that $e_{1}$ is the state with the highest reward and the reward monotonically decreases. This captures the channel owners’ preference of scheduling ads at the highest viewer engagement state. A sufficent condition for the reward to monotonically decrease is that both the mean number of viewers $g_{i}$ , and the completion and click rate $\alpha_{i}$ be non-increasing. The second and the third assumptions (A2) and (A3) relate to the underlying stochastic model and is related to MLR ordering of the updated belief vector in (4) (see Theorem 4 and Theorem 5 in Appendix A). The assumption (A2) models the following facts: i) The user behaviour in online social media can be approximated by a first order Markov chain. ii) The viewer engagement changes at a smaller time scale compared to sampling or decision epochs. The assumption (A3) is due to the fact that the viewer engagement states can be ordered corresponding to decreasing mean number of viewers, $g_{i}$ . The last assumption (A4) is a technical condition required for our proof.

•

(A1) The vector, $r$ , has decreasing elements. Hence, $r^{\prime}\pi$ is increasing in $\pi$ .
•

(A2) $B$ is TP2¹¹¹¹11Refer to Appendix A for the definition of TP2 ordering. . A necessary and sufficient condition for $B$ to be TP2 is that the state dependent mean $g_{i}$ in (2) is monotonically decreasing.
•

(A3) $P$ is TP211.
•

(A4) The vector, $(I-\rho P^{\prime})r$ , has decreasing elements.

III-B2 Main Result

The following (Theorem 1) is the main result of the paper and the proof is provided in Appendix B. Theorem 1.A states that the optimal policy is a monotone policy. The optimal policy $\mu^{*}(\pi,l)$ decreases monotonically with the belief state $\pi$ . However, for a monotone policy to be well defined, we need to first define the ordering between two belief states. In this paper, we use the Monotone Likelihood Ratio (MLR) ordering¹²¹²12MLR ordering is defined in Def. 2 in Appendix A-B, and the less restrictive MLR ordering on lines $\mathcal{L}(e_{1},\bar{\pi})$ and $\mathcal{L}(e_{S},\bar{\pi})$ ¹³¹³13MLR ordering over lines is defined in Appendix A-C over the belief states [26, 27]. Fig. 15 illustrates the definition of $\mathcal{L}(e_{1},\bar{\pi})$ .

Refer to caption — Figure 1: $\mathcal{L}(e_{1},\bar{\pi})$ corresponds to a line joining $e_{1}$ and any $\bar{\pi}\in\mathcal{H}$ ¹⁵¹⁵15 $\mathcal{H}$ is defined in Appendix A-Con the simplex $\Pi$ . The advantage of MLR ordering on lines is that the belief states on the line $\mathcal{L}(e_{1},\bar{\pi})$ and $\mathcal{L}(e_{X},\bar{\pi})$ can be fully ordered. Hence, we can define monotonicity of the policy over the lines. This is not possible for MLR ordering, since it is a partial order.

Theorem 1

Assume (A1), (A2), (A3) and (A4), then,

A

There exists an optimal policy $\mu^{*}(\pi,l)$ that is monotonically decreasing on lines $\mathcal{L}(e_{1},\bar{\pi})$ , and $\mathcal{L}(e_{X},\bar{\pi})$ for each $l$ .
B

There exists an optimal switching curve $\Gamma_{l}$ , for each $l$ , that partitions the belief space $\Pi(X)$ into two individually connected regions $S^{l}$ and $C^{l}$ ,such that the optimal policy is

$\mu^{*}(\pi,l)=\begin{cases}1&\text{if }\pi\in S^{l}\\ 2&\text{if }\pi\in C^{l}\end{cases}$ (20)
C

$S^{l-1}\subset S^{l}$ .

Theorem 1 A implies that the optimal action is monotonically decreasing on the line $\mathcal{L}(e_{1},\bar{\pi})$ , as shown in Fig. 15. Hence, on each line $\mathcal{L}(e_{1},\bar{\pi})$ there exists a threshold above (in MLR sense) which it is optimal to Stop and below which it is optimal to Continue. Theorem 1 B says, for each $l$ , the stopping and continuance sets are connected. Hence, there exists a threshold curve, $\Gamma_{l}$ , as shown in Fig. 15, obtained by joining the thresholds, from Theorem 1 A, on each of the line $\mathcal{L}(e_{1},\bar{\pi})$ . Theorem 15 C proves the sub-setting structure of the stopping and continuance sets, as shown in Fig. 15.

In order to prove the theorem, we introduce the following propositions, proofs of which are provided in the Appendix B. The value function of the classical (single) stopping POMDP increases with $\pi$ (in MLR sense) [27, 28, 29]. Proposition 1 states that the above result holds even in the multiple stopping problem. In the classical stopping POMDPs in [27], the stopping and continuance sets are characterized using the value function. However, in the multiple stopping problem, considered in this paper, $W$ plays the role of value function in characterizing the stopping and continuance region. Proposition 2 proves the corresponding result in the multiple stopping problem. Proposition 3 proves the nested stopping regions at each iteration $k$ of the value iteration. Since the value iteration converges, Proposition 3 implies Theorem 15 C.

Proposition 1

$V_{k}(\pi,l)$ is increasing in $\pi$ .

Proposition 2

$W_{k}(\pi,l)$ is decreasing in $l$ .

Proposition 3

$S_{k+1}^{l}\supset S_{k+1}^{l-1}$

To summarize, we showed the following properties of the optimal policy:

(i)

the optimal policy is monotone on the lines $\mathcal{L}(e_{1},\pi)$ .
(ii)

existence of a unique threshold curve for the stopping region, $\Gamma^{l}$ .
(iii)

the stopping regions have a sub-setting property, i.e. $S^{l-1}\subset S^{l}$ .

IV Stochastic Approximation Algorithm for computing best approximate ad scheduling policy

In this section, we synthesize policies satisfying the properties of the optimal ad scheduling policy derived in Sec. III. The policies can be characterized by $L$ threshold curve, corresponding to each of the stopping regions. In this section, we parametrize the threshold curve, $\Gamma^{l}$ , as $\hat{\Gamma}^{l}_{\theta}$ . Here, $\theta$ denotes the parameter of the threshold curve. To capture the essence of Theorem 1, we require that the parametrized optimal policy, $\mu_{\theta}(l,\pi)$ , be decreasing on lines $\mathcal{L}(e_{1},\bar{\pi})$ and $\mathcal{L}(e_{S},\bar{\pi})$ and the stopping sets are connected and satisfy the sub-setting property, i.e. $S^{l-1}\subset S^{l}$ . Below, we will restrict our attention to obtain the best linear threshold policy, i.e. policy of the form given in (21). We characterize the parameters of the threshold policy in Sec. IV-A and provide an algorithm to compute the parameters using stochastic approximation in Sec. IV-B.

IV-A Structure of best linear MLR threshold policy for ad scheduling

Consider a threshold hyperplane, on the simplex $\Pi$ , of the form (21) where $\theta_{l}\in{\rm I\hskip-1.9919ptR}^{S-1}$ denotes the coefficient vector. The linear threshold scheduling policy, denoted by $\mu_{\theta}(l,\pi)$ is defined as

\mu_{\theta}(l,\pi)=\begin{cases}1&\text{if }\begin{bmatrix}0&1&\theta_{l}\end{bmatrix}\begin{bmatrix}\pi\\ -1\end{bmatrix}\leq 0\\ 2&\text{else },\end{cases}

(21)

Here, $\theta=\left(\theta_{1},\theta_{2},\dots,\theta_{L}\right)\in{\rm I\hskip-1.9919ptR}^{L\times(S-1)}$ is the concatenation of the $\theta_{l}$ vectors.

To capture the essence of Theorem 1 A, we require that the policy be decreasing on lines, i.e. for $\pi_{1}\geq_{\mathcal{L}_{1}}\pi_{2},\;\mu_{\theta}(\pi_{1},l)\leq\mu_{\theta}(\pi_{2},l)$ . Theorem 2 gives necessary and sufficient conditions on the coefficient vector $\theta_{l}$ such that the above condition holds.

Theorem 2

A necessary and sufficient condition for the linear threshold policy $\mu_{\theta}(l,\pi)$ to be

1.

MLR decreasing on line $\mathcal{L}(e_{1},\bar{\pi})$ , iff $\theta_{l}(S-1)\geq 0$ and $\theta_{l}(i)\geq 0,\;i\leq S-2$ .
2.

MLR decreasing on line $\mathcal{L}(e_{S},\bar{\pi})$ , iff $\theta_{l}(S-1)\geq 0$ , $\theta_{l}(S-2)\geq 1$ and $\theta_{l}(i)\leq\theta_{l}(S-2),\;i<S-2$ .

The proof of Theorem 2 is similar to the proof of Theorem 12.4.1 in [27] and hence is omitted in this paper.

The stopping sets are connected since we parametrize the threshold curve using a linear hyperplane. Finally, the linear threshold approximation curve needs to satisfy the sub-setting property in Theorem 15 C. Theorem 22 provides sufficient conditions such that the parametrized linear threshold curve satisfy the sub-setting property and the proof is provided in the Appendix B-E.

Theorem 3

To satisfy the sub-setting structure in Theorem 1 C, the parameters of the linear threshold curve have to satisfy the following condition

	$\displaystyle\theta_{l-1}(S-1)$	$\displaystyle=\theta_{l}(S-1)$		(22)
	$\displaystyle\theta_{l-1}(i)$	$\displaystyle\geq\theta_{l}(i)\quad i<S-1$		(22)

Therefore, under the conditions of Theorem 2 and Theorem 22 the linear threshold policy in (21) satisfy all the conditions in Theorem 1 and hence qualify as the best linear threshold policy.

The parameter $\theta$ can be re-parametrized as follows:

\theta^{\phi}_{l}(i)=\begin{cases}\phi_{1}^{2}(S-1)&i=S-1\\ 1+\phi_{1}^{2}(S-2)\prod_{\ell=2}^{l}\sin^{2}(\phi_{\ell}(S-2))&i=S-2\\ \left(1+\phi_{1}^{2}(S-2)\prod_{\ell=2}^{l}\sin^{2}(\phi_{\ell}(S-2))\right)\sin^{2}(\phi_{l}(i))&i<S-2\end{cases}

(23)

It can be easily checked the parametrization in (23) satisfies the conditions in Theorem 2 and Theorem 22.

Theorem 2 and Theorem 22 characterize the parameters of the linear threshold policy. In Sec. IV-B we provide an algorithm to compute the best linear threshold policy satisfying Theorem 2 and Theorem 22.

IV-B Simulation-based stochastic approximation algorithm for estimating the best linear MLR threshold policy for ad scheduling

In this section, we provide an algorithm to compute the best linear thresholds satisfying the conditions in Theorem 2 and Theorem 22. Recall, the optimal policy minimizes the average discounted reward in eqs. 6 and 7. The optimal linear thresholds can be obtained by a policy gradient algorithm [14]. Algorithm 1 is a policy gradient algorithm to compute the best linear threshold policy. In this algorithm, we approximate $J_{\mu}$ in eqs. 6 and 7 by the finite time approximation

J_{N}(\theta)=\mathbb{E}\left\{\sum_{l=1}^{L}\rho^{\tau_{l}}r(X_{\tau_{l}},1)\>\Big{|}\>\tau_{l}\leq N;\forall l\right\}.

(24)

Here, we have made explicit the dependence of the parameter vector, $\theta$ , on the discounted reward and with an abuse of notation, have suppressed the dependence on the policy $\mu$ . It can be shown that $J_{N}(\theta)$ is an asymptotically biased estimate of $J_{\mu}$ and can be obtained by simulation for large $N$ .

Algorithm 1 Threshold-Based Policy Gradient Algorithm for Optimal Multiple Stopping

0: Assume the parameters of the optimal multiple stopping problem satisfy the assumptions in Theorem 1.

1: Initialize: Choose initial parameters

\hat{\phi}_{0}

and initial linear threshold policy

\mu_{\hat{\theta}_{0}}

using (21).

2: for each iterations

n=0,1,2,\dots

: do

3: Evaluate cost

J_{N}(\theta^{\hat{\phi}_{n}})

using (24) and gradient estimate

\nabla_{\phi}J_{N}(\theta^{\hat{\phi}_{n}})

with policy

\mu_{\theta^{\hat{\phi}_{n}}}

using (25).

4: Update the parameter vector

{{\hat{\phi}_{n}}}

{{\hat{\phi}_{n+1}}}

using (27).

5: end for

The policy gradient algorithm in Algorithm 1 requires the gradient $\nabla_{\phi}J_{N}(\cdot)$ at each iteration. Computation of the gradient is quite difficult due to the non-linear dependence of the parameter $\phi$ on the cost function. Hence, in this paper, we estimate the gradient through a stochastic approximation algorithm.

There are several stochastic approximation algorithms available in the literature: infinitesimal perturbation analysis[30], weak derivatives [31] and the SPSA algorithm [32]. In this paper, we use the SPSA algorithm and because of the constraints in Theorem 22 we use a variant of SPSA that can handle linear inequality constraints [33].

Following [32] and [33], the gradient estimate is obtained by picking a random direction $\omega_{n}$ , at each iteration $n$ . The estimate of the gradient is then given by

	$\displaystyle\hat{\nabla}_{\phi}J_{N}(\theta^{\hat{\phi}_{n}})$	$\displaystyle=\frac{J_{N}(\theta^{\hat{\phi}_{n}+c_{n}\omega_{n}})-J_{N}(\theta^{\hat{\phi}_{n}-c_{n}\omega_{n}})}{2c_{n}}\omega_{n},$	(25)
where,
	$\displaystyle\omega_{n}(i)$	$\displaystyle=\begin{cases}-1&\text{with probability }0.5\\ +1&\text{with probability }0.5.\end{cases}$	(26)

The equations for the parameter update are as follows [33]:

\phi_{n+1}=\phi_{n}-a_{n}\hat{\nabla}_{\phi}J_{N}(\theta^{\hat{\phi}_{n}}).

(27)

The parameters $a_{n}$ and $c_{n}$ and $r_{n}$ are chosen as in [33] as follows:

\displaystyle\begin{aligned} a_{n}&=\upvarepsilon(n+1+\varsigma)^{-\kappa}&0.5<\kappa\leq 1,\quad\text{and }\upvarepsilon,\varsigma>0\\ c_{n}&=\upmu(n+1)^{-\Upsilon}&\upmu>0\end{aligned}

(28)

The stochastic approximation in Algorithm 1 converges to a local minimum. Hence, it is necessary to try several initial conditions and pick the best threshold.

V Numerical Results: Synthetic and Real Dataset

In this section, we present numerical results on synthetic and real datasets. First, we illustrate our analysis of Theorem 1, using synthetic data. Second, we demonstrate how the optimal multiple stopping framework, used for ad scheduling in live media, can be used to detect changes in ground truth using real data from online search. Online search is linked to advertising in television and online social media [34]. Third, we show, through simulations, that the scheduling policy obtained from Algorithm 1 outperforms conventional technique of scheduling ads in live social media. In this paper, we compare the scheduling policy obtained from Algorithm 1 with two schemes: “Periodic” and “Random”. The periodic scheme models the most common method of advertisement scheduling in pre-recorded videos in platforms like YouTube. In the context of live channels, Twitch uses a periodic scheduling where an ad is inserted periodically into the live channel [1]. In contrast, in the random scheduling scheme, the ad is inserted randomly into the live channel. The random scheduling scheme is used as a benchmark to compare the revenue obtained through the periodic scheme and the policy obtained through Algorithm 1.

V-A Synthetic Data

In this section, we do a simulation study based on synthetic data for the optimal multiple stopping problem. The objective is to illustrate the analysis in Theorem 1. In addition, we obtain the policy by solving the dynamic programming equations (8) and show through simulations that it outperforms conventional method of scheduling ads periodically.

In this “toy” example, the viewer engagement of the live channel can be categorized into $3$ states: “Popular”, “Interesting” and “Boring”, denoted by $1$ , $2$ and $3$ , respectively. The transition between the various states follow a Markov chain with transition matrix given in (29). The viewer engagement state is observed through a state dependent Poisson process with mean given in (30). The reward structure for scheduling the ads is as in (31).

P=\begin{bmatrix}0.2&0.1&0.7\\ 0.1&0.1&0.8\\ 0&0.1&0.9\end{bmatrix}

(29)

	$\displaystyle g$	$\displaystyle=\begin{bmatrix}12&7&2\end{bmatrix}$		(30)
	$\displaystyle r$	$\displaystyle=\begin{bmatrix}9&3&1\end{bmatrix}$		(31)

Fig. 2 shows a sample trajectory of the observation and the state sequence. For $L=5$ , i.e. for a total of $5$ ads to be scheduled, we obtained the policy by solving the dynamic programming equations in (8). The resulting policy, in terms of stopping and continuance set as defined in (14), is shown in Fig. 3. We also show the corresponding points where the policy was chosen in Fig. 2. Only the stopping set for $l=5$ and $l=1$ , i.e. $S^{5}$ and $S^{1}$ respectively are shown in Figure 3. As can be seen from Figure 3, $S^{1}\subset S^{5}$ , verifying Theorem 1.C. The stopping regions $S^{1}$ and $S^{5}$ are connected and the optimal policy is threshold on any line $\mathcal{L}(e_{1},\pi)$ , verifying Theorem 1.B and Theorem 1.A respectively.

We now compare the policy obtained by solving the dynamic programming equation in (8) with the conventional technique of scheduling ads periodically within the live session. We also include a random scheduling policy, where the ads are scheduled randomly within the live session, for benchmarking purposes. Fig. 4 shows the comparison between the various schemes. The results in Fig. 4 was obtained by $10^{4}$ independent Monte Carlo (M.C.) simulations. It can be seen from Fig. 4 that the policy obtained from (8) significantly outperforms (close to $4$ times) periodic scheduling. This is not surprising since the policy obtained by solving the dynamic programming in (8) opportunistically schedules ads when the viewer engagement of the channel is high.

V-B Change Detection (Real Dataset)

In this section, we illustrate Theorem 1 for change detection problems. Change detection problems are special case of multiple stopping problems, when $L=1$ . We apply Theorem 1 for detecting changes in ground truth using an online search dataset. Online search is linked to advertising in television and online social media [34]. In addition, detecting changes in ground truth using online search data have been used for detection of outbreak of illness, political election, or major sporting events [35]. Hence, detection of changes in ground truth is important for optimizing advertising strategy. The dataset that we use is the Tech Buzz dataset from Yahoo!. We first describe the Tech Buzz dataset in Sec. V-B1 and then show through simulations that the policy obtained by solving the dynamic programming equation in (8) can be used for detecting changes in ground truth using data from online search.

V-B1 Dataset

The dataset that we use in our study is the Yahoo! Buzz Game Transactions from the Webscope datasets¹⁶¹⁶16Yahoo! Webscope dataset: A2 - Yahoo! Buzz Game Transactions with Buzz Scores, version 1.0 http://research.yahoo.com/Academic_Relations available from Yahoo! Labs. In 2005, Yahoo! along with O’Reilly Media started a fantasy market where the trending technologies at that point where pitted against each other. For example, in the browser market there were “Internet Explorer”, “Firefox”, “Opera”, “Mozilla”, “Camino”, “Konqueror”, and “Safari”. The players in the game have access to the “buzz”, which is the online search index, measured by the number of people searching on the Yahoo! search engine for the technology. The objective of the game is to use the buzz and trade stocks accordingly.

V-B2 Change Detection

We consider a subset of the data containing the WIMAX buzz scores and the number of stocks traded (volume of the stocks). The unknown valuation of the WIMAX technology is modelled using a $2-$ state Markov chain (“1” for high valuation and “2” for low valuation). The valuation of the stock is not observed directly, but through noisy observations on the volume of the stocks traded. Fig. 5 shows the volume of the stocks traded and the buzz during the month of April. The volume of stocks traded depend on the unknown valuation and, for ease of analysis, is quantized into $3$ states (“High”, “Medium” and “Low”), denoted by $1$ , $2$ and $3$ respectively. Given the time series of the (quantized) volume of stocks traded, we obtain the hidden Markov model constituting of the transition matrix of the WIMAX valuation and the observation probability of the volume of the stocks traded given the WIMAX valuation using an EM algorithm [27]. The parameters of the Markov chain obtained using an EM algorithm are as below:

P=\begin{bmatrix}1&0\\ 0.1462&0.8538\end{bmatrix}

(32)

B=\begin{bmatrix}0.1489&0.4467&0.4044\\ 0.3727&0.5325&0.0947\end{bmatrix}

(33)

r=\begin{bmatrix}10&1\end{bmatrix}

(34)

As can be seen from Fig. 5 the WIMAX buzz and the volume of stocks traded is initially low. Hence, the objective is to find the time point at which the WIMAX value switches to the high value. The reward structure in (34) reflects the fact that choosing to Stop at the high value state, an agent obtains more money by trading the high value WIMAX stocks. The policy obtained by solving the dynamic programming in (8) shows a high valuation at April 18. The change point corresponds to Intel’s announcement of WIMAX chip¹⁷¹⁷17http://www.dailywireless.org/2005/04/17/intel-shipping-wimax-silicon/. The high valuation of WIMAX stock can also be noticed from the “spike” in the buzz around April 18 in Fig. 5.

Figure 5: The buzz scores and the trade volume for the WIMAX stock. The policy obtained through solving the dynamic programming in (8) shows a high valuation during April 18. This corresponds to Intel announcement of the WIMAX chip. The increase in valuation can also be seen by a corresponding spike in the WIMAX buzz scores.

V-C Ad Scheduling on Live Channels (YouTube and Twitch Datasets)

In this section, we illustrate the policy from Algorithm 1 on real data from YouTube and Twitch. In comparison to Sec. V-A and Sec. V-B, real data from YouTube and Twitch has a wide range of viewer engagement states and hence requires more states in the Markov chain model. As the number of states increases, solving the dynamic programming equation in (8) becomes impractical. Hence, we resort to best linear threshold policy through Algorithm 1. We first describe the dataset in Sec. V-C1 and then show that the policy obtained from Algorithm 1 outperforms conventional periodic scheduling.

V-C1 Dataset

In this paper, we use the dataset in [36]¹⁸¹⁸18The dataset is available from http://dash.ipv6.enstb.fr/dataset/live-sessions/. The dataset consists of live session on the two popular live broadcasting platforms: YouTube Live and Twitch, between January and April 2014. The dataset contains samples of the live sessions sampled at a $5$ -minute interval on each of the platforms. Each sample contains the identification of the channel, the number of viewers and some additional meta-data of the channel. The main finding in [36] is that the viewer engagement is more heterogeneous than in other user-generated content platforms such as YouTube. The heterogeneity of viewer engagement in live channel can be used to opportunistically schedule advertisements.

V-C2 Entertainment (YouTube Live)

In this section we use real data from YouTube Live channel and show that the policy obtained from Algorithm 1 outperforms conventional periodic scheduling.

We selected data from the “entertainment” category from the YouTube Live dataset. The channel contains data from January 01, 2014 to Jan 31, 2014, i.e. for a period of one month. Fig. 6(a) shows the distribution of the viewers of the channel during Jan 01, 2014.

The parameters of the channel were obtained by the EM-Algorithm A.2.3 in [37]. The EM-Algorithm was run for Markov Chain with $2-12$ states. Using the AIC and BIC criteria, we selected that the channel be modelled by a $5$ state Markov chain with the transition matrix in (35) and observations following state dependent Poisson distribution with mean given by (36). As can be seen from (35) the transition matrix is a first-order Markov chain validating our initial assumptions. Moreover, the transition matrix is diagonally dominant entries ensuring the TP2 assumption. This diagonally domainant entries in the transition matrix in (35) models the fact the viewer engagement of the channel changes at a slower time scale compared to the decision epochs (or sampling epochs). The reward depend on both the $g$ in (36) and the completion and click rate $\alpha_{i}$ . Since the click rate $\alpha_{i}$ is not available in the dataset, we assume $\alpha_{i}=\alpha$ . Due to the ordinality of the reward, $\alpha$ is assumed to be equal to $1$ . The model is validated using the QQ-plot of pseudo-residuals defined in Sec. 6.1 in [37] and is show in Fig. 6(b). As can be seen from Fig. 6(b), the QQ-plot closely follows the straight line which indicates that the model is a good fit for the data.

P=\begin{bmatrix}0.94&0.06&0.00&0.00&0.00\\ 0.02&0.94&0.04&0.00&0.00\\ 0.00&0.02&0.96&0.02&0.00\\ 0.00&0.00&0.06&0.91&0.03\\ 0.00&0.00&0.00&0.01&0.99\end{bmatrix}

(35)

g=\begin{bmatrix}184&139&102&66&37\end{bmatrix}

(36)

Fig. 8(a) shows the comparison between the various schemes. It can be seen that the policy obtained through Algorithm 1 outperforms conventional periodic scheduling by $30\%$ .

V-C3 Gaming (Twitch)

The Twitch dataset contains channels with “gaming” content. Fig. 7(a) shows the distribution of the viewers of the channel during Jan 01, 2014. Similar, to the Sec. V-C2 above, we use the EM-Algorithm in [37] to estimate the parameters of the Markov model. The parameters of the Markov model consisting of the transition matrix and the state dependent mean of the Poisson distribution is as given in (37) and (38), respectively. The model is validated using QQ-plot of pseudo-residuals and is shown in Fig. 7(b).

P=\begin{bmatrix}0.97&0.03&0.00&0.00&0.00\\ 0.01&0.96&0.03&0.00&0.00\\ 0.00&0.02&0.95&0.03&0.00\\ 0.00&0.00&0.02&0.96&0.01\\ 0.00&0.00&0.00&0.02&0.98\end{bmatrix}

(37)

g=\begin{bmatrix}55.24&42.40&34.65&28.30&20.6\end{bmatrix}

(38)

Fig. 8(b) shows the comparison between the various schemes. Similar to the result in the YouTube Live session, the policy obtained through Algorithm 1 outperforms conventional periodic scheduling by close to $20\%$ .

Comparing Fig. 8(a) and Fig. 8(b) we find that the performance of Algorithm 1 is lower in the Twitch channel compared to YouTube Live. This is due to the fact that Twitch is a subscription based service and the viewers are more “loyal” and hence their is less variation in the number of viewers.

VI Conclusion

In this paper, we considered the problem of optimal scheduling of ads on live online social broadcasting channels. First, we cast the problem as an optimal multiple stopping problem in the POMDP framework. Second, we characterized the structural results of the optimal ad scheduling policy. Using the structural results of the optimal ad scheduling policy we computed best approximate policies using stochastic approximation. Finally, we validated the results on real datasets. First, we illustrated the analysis using synthetic data. In addition, using synthetic data, we showed that the optimal ad scheduling policy outperforms conventional scheduling techniques. Second, we show through simulations, that the policy obtained from the multiple stopping problem framework, used for ad scheduling, can be used to detect changes in the ground truth using data from online search. Detecting changes in ground truth is useful for optimizing ad strategy. Finally, we show through simulations, that the best approximate ad scheduling policies obtained through stochastic approximation outperforms conventional periodic scheduling by $20-30\%$ .

Extension of the current work could involve developing upper and lower myopic bounds to the optimal policy as in [38], optimizing the ad length and constraints on ad placement. These issues promise to offer interesting avenues for future work.

Appendix A Preliminaries and Definitions

A-A First-order stochastic dominance

Definition 1

Let $\pi_{1}\in\Pi$ and $\pi_{2}\in\Pi$ be two belief state vectors. Then, $\pi_{1}$ is greater than $\pi_{2}$ with respect to first-order stochastic dominance–denoted as $\pi_{1}\geq_{s}\pi_{2}$ , if

\sum_{i=j}^{S}\pi_{1}(i)\leq\sum_{i=j}^{S}\pi_{2}(i)\quad\forall j\in\left\{1,2,\cdots,S\right\}

(39)

A-B MLR ordering

Definition 2

Let $\pi_{1}\in\Pi$ and $\pi_{2}\in\Pi$ be two belief state vectors. Then, $\pi_{1}$ is greater than $\pi_{2}$ with respect to Monotone Likelihood Ratio (MLR) ordering–denoted as $\pi_{1}\geq_{r}\pi_{2}$ , if

\pi_{1}(j)\pi_{2}(i)\leq\pi_{2}(j)\pi_{1}(i),\quad i<j,\;i,j\in\left\{1,\dots,S\right\}

(40)

MLR ordering over $\Pi$ is a strong condition. In order to show threshold structure, we define the following weaker notion of MLR ordering over two types of lines.

A-C MLR ordering over lines

First, we define $\mathcal{H}$ as the $S-2$ dimensional linear hyperplane which connects the vertices $e_{2},\dots,e_{S}$ as follows:

\mathcal{H}=\left\{\bar{\pi}:\bar{\pi}\in\mathcal{H}\text{ and }\bar{\pi}(1)=0\right\}.

(41)

Figure 15 illustrates the definition (41) for an optimal multiple stopping problem with $S=3$ . Next, we construct two types of lines as follows:

•

$\mathcal{L}\left(e_{1},\bar{\pi}\right)$ : For any $\bar{\pi}\in\mathcal{H}$ , construct the line $\mathcal{L}(e_{1},\bar{\pi})$ that connects $\bar{\pi}$ to $e_{1}$ as below:

\mathcal{L}\left(e_{1},\bar{\pi}\right)={\left\{\pi\in\Pi:\pi=\left(1-\gamma\right)\bar{\pi}+\gamma e_{1},0\leq\gamma\leq 1\right\}},{\bar{\pi}\in\mathcal{H}}

(42)

•

$\mathcal{L}\left(e_{S},\bar{\pi}\right)$ : For any $\bar{\pi}\in\mathcal{H}$ , construct the line $\mathcal{L}(e_{S},\bar{\pi})$ that connects $\bar{\pi}$ to $e_{S}$ as below:

\mathcal{L}\left(e_{S},\bar{\pi}\right)=\left\{\pi\in\Pi:\pi=\left(1-\gamma\right)\bar{\pi}+\gamma e_{S},0\leq\gamma\leq 1\right\},{\bar{\pi}\in\mathcal{H}}

(43)

With an abuse of notation, we denote $\mathcal{L}(e_{1},\bar{\pi})$ by $\mathcal{L}(e_{1})$ and $\mathcal{L}(e_{S},\bar{\pi})$ by $\mathcal{L}(e_{S})$ . Figure 15 illustrates the definition of $\mathcal{L}(e_{1})$ .

Definition 3 (MLR ordering on lines)

$\pi_{1}$ is greater than $\pi_{2}$ with respect to MLR ordering on the lines $\mathcal{L}(e_{1})$ , denoted as $\pi_{1}\geq_{\mathcal{L}_{1}}\pi_{2}$ , if $\pi_{1},\pi_{2}\in\mathcal{L}(e_{1},\bar{\pi})$ , for some $\bar{\pi}\in\mathcal{H}$ and $\pi_{1}\geq_{r}\pi_{2}$ .

The MLR order is a partial order, however, the MLR ordering on lines is a complete order. The MLR on lines requires less stringent conditions and can be used for devising threshold policies over lines.

A-D TP2 ordering

Definition 4 (TP2 ordering)

A transition probability matrix, $A$ is Totally Positive of order 2 (TP2), if all the second order minors are non-negative i.e. the determinants

\begin{vmatrix}a_{i_{1}j_{1}}&a_{i_{1}j_{2}}\\ a_{i_{2}j_{1}}&a_{i_{2}j_{2}}\end{vmatrix}\geq 0,\forall i_{2}\geq i_{1},j_{2}\geq j_{1}

(44)

An important consequence of the TP2 ordering in Definition 44 is the following theorems, which states that the filter $T(\pi,y)$ preserves MLR dominance.

Theorem 4 (Theorem 10.3.1 in [27])

If the transition matrix, $P$ , and the observation matrix, $B$ , satisfies the condition in 11 and • ‣ III-B1, then

•

For $\pi_{1}\geq_{r}\pi_{2}$ , the filter satisfies $T(\pi_{1},y)\geq_{r}T(\pi_{2},y)$ .
•

For $\pi_{1}\geq_{r}\pi_{2}$ , $\sigma(\pi_{1},y)\geq_{s}\sigma(\pi_{2},y)$

Theorem 5 ([39])

$\pi_{1}\geq_{s}\pi_{2}$ if and only if for any increasing function $\phi(\cdot)$ , $\mathbb{E}_{\pi_{1}}\left\{\phi(x)\right\}\geq\mathbb{E}_{\pi_{2}}\left\{\phi(x)\right\}$

Appendix B Proof of Prop. 1, Prop. 2, Prop. 3, Theorem 1

To prove Prop. 1, Prop. 2 and Prop. 3, we assume that the proposition hold for all values less than $k$ .

B-A Proof of Prop. 1

Recall from (10),

V_{k}(\pi,l)=\underset{u\in\{1,2\}}{\text{max}}Q_{k}(\pi,l,u),

To prove Prop. 1, we show $Q_{k}(\pi,l,u)$ is increasing in $\pi$ for $u=\left\{1,2\right\}$ .

Recall from (12),

\scalebox{0.94}[1.0]{$Q_{k}(\pi,l,1)=r^{\prime}\pi+\rho\sum_{y}V_{k-1}(T(\pi,y),l-1)\sigma(\pi,y)$},

Using Theorem 4, Theorem 5 and the induction hypothesis, the term $\sum_{y}V_{k-1}(T(\pi,y),l-1)\sigma(\pi,y)$ is increasing in $\pi$ . From Assumption • ‣ III-B1, $r^{\prime}\pi$ is increasing in $\pi$ . The proof for $Q_{k}(\pi,l,2)$ increasing in $\pi$ is similar and is omitted. Hence, $V_{k}(\pi,l)$ is increasing in $\pi$ .

B-B Proof of Prop. 2

The proof follows by induction. Recall from (19), we have

$W_{k}(\pi,l-1)=\sum_{y}{W_{k-1}(T(\pi,y),l-1)\sigma(\pi,y)\mathcal{I}_{C_{k}^{l-1}}(\pi)}+{r^{\prime}\pi\mathcal{I}_{C_{k}^{l-2}\cap S_{k}^{l-1}}(\pi)}+{\sum_{y}W_{k-1}(T(\pi,y),l-2)\sigma(\pi,y)\mathcal{I}_{S_{k}^{l-2}}(\pi)}$

(45)

Hence, we compare $W_{k}(\pi,l)$ and $W_{k}(\pi,l-1)$ in the following $4$ regions:

a.)

$S_{k}^{l-2}:$

W_{k}(\pi,l)-W_{k}(\pi,l-1)=\sum_{y}(W_{k-1}(T(\pi,y),l-1)-W_{k-1}(T(\pi,y),l-2))\sigma(\pi,y),

which is non-negative by the induction assumption.

b.)

$C_{k}^{l-2}\cap S_{k}^{l-1}:$

$W_{k}(\pi,l)-W_{k}(\pi,l-1)=\sum_{y}W_{k-1}(T(\pi,y),l-1)\sigma(\pi,y)-r^{\prime}\pi,$

which is non-negative since $\pi\in S_{k}^{l-1}$ .
c.)

$C_{k}^{l-1}\cap S_{k}^{l}:$

$W_{k}(\pi,l)-W_{k}(\pi,l-1)=r^{\prime}\pi-\sum_{y}W_{k-1}(T(\pi,y),l-1)\sigma(\pi,y),$

which is non-negative since $\pi\in C_{k}^{l-1}$ .

d.)

$C_{k}^{l}:$

W_{k}(\pi,l)-W_{k}(\pi,l-1)=\sum_{y}(W_{k-1}(T(\pi,y),l)-W_{k-1}(T(\pi,y),l-1))\sigma(\pi,y),

which is non-negative by the induction assumption.

B-C Proof of Prop. 3

If $\pi\in S_{k}^{l-1}$ , then $r^{\prime}\pi\geq\sum_{y}W_{k-1}(T(\pi,y),l-1)\sigma(\pi,y)$ . By Prop. 2, $r^{\prime}\pi\geq\sum_{y}W_{k-1}(T(\pi,y),l)\sigma(\pi,y)$ . Hence $\pi\in S_{k}^{l}$ .

B-D Proof of Theorem 1

Existence of optimal policy: In order to show the existence of a threshold policy of $\mathcal{L}_{1}$ , we need to show that $Q_{k+1}(\pi,l,2)-Q_{k+1}(\pi,l,1)$ is supermodular in $\pi\in\mathcal{L}(e_{1},\bar{\pi})$ . Since,

Q_{k+1}(\pi,l,2)-Q_{k+1}(\pi,l,1)=\rho\sum_{y}W_{k}(T(\pi,y),l)\sigma(\pi,y)-r^{\prime}\pi.

We need to show that $\rho\sum_{y}W_{k}(T(\pi,y),l)\sigma(\pi,y)-r^{\prime}\pi$ is decreasing in $\pi$ .

$\displaystyle\rho\sum_{y}W_{k}(T(\pi,y),l)\sigma(\pi,y)-r^{\prime}\pi$	$\displaystyle=\sum_{y}\left(\rho W_{k}(T(\pi,y),l)-r^{\prime}\pi\right)\sigma(\pi,y)$
	$\displaystyle=\sum_{y}\left(\left(\rho W_{k}(T(\pi,y),l)-\rho r^{\prime}T(\pi,y)\right)-\left(r^{\prime}\pi-\rho r^{\prime}T(\pi,y)\right)\right)\sigma(\pi,y)$
	$\displaystyle=\rho\sum_{y}\left(W_{k}(T(\pi,y),l)-r^{\prime}T(\pi,y)\right)\sigma(\pi,y)-r^{\prime}(I-\rho P^{\prime})\pi$	(46)

The term $r^{\prime}(I-\rho P^{\prime})\pi$ in (46) is decreasing in $\pi$ due to our assumption. Hence, to show that $\rho\sum_{y}W_{k}(T(\pi,y),l)\sigma(\pi,y)-r^{\prime}\pi$ is decreasing in $\pi$ it is sufficent to show that $W_{k}(\pi,l)-r^{\prime}\pi$ is decreasing in $\pi$ . Define,

\bar{W}_{k}(\pi,l)\triangleq W_{k}(\pi,l)-r^{\prime}\pi

(47)

Now, $\bar{W}_{k}(\pi,l)=$


	$\displaystyle=\scalebox{0.9}[1.0]{$\left(\sum_{y}\left(\rho\bar{W}_{k-1}(T(\pi,y),l)\sigma(\pi,y)\right)-r^{\prime}(I-\rho P)^{\prime}\pi\right)\mathcal{I}_{C_{k}^{l}}(\pi)+\left(\sum_{y}\left(\rho\bar{W}_{k-1}(T(\pi,y),l-1)\sigma(\pi,y)\right)-r^{\prime}(I-\rho P)^{\prime}\pi\right)\mathcal{I}_{S_{k}^{l}}(\pi)$}$		(48)

We prove using induction that $\bar{W}_{k}(\pi,l)$ is decreasing in $\pi$ , using the recursive relation over $k$ in (48).

For $k=0$ ,

\bar{W}_{0}(\pi,l)=W_{0}(\pi,l)-r^{\prime}\pi=V_{0}(\pi,l)-V_{0}(\pi,l-1)-r^{\prime}\pi

(49)

The initial conditions of the value iteration algorithm can be chosen such that $\bar{W}_{0}(\pi,l)$ in (49) is decreasing in $\pi$ . A suitable choice of the initial conditions is given below:

V_{0}(\pi,l)=r^{\prime}\left(\sum_{j=0}^{l-1}\rho^{j}P^{j}\right)^{\prime}\pi.

(50)

The intuition behind the initial conditions in (50) is that the value function, $V_{0}(\pi,l)$ gives the expected total reward if we stop $l$ times successively starting at belief $\pi$ . Hence, it is clear that $\bar{W}_{k}(\pi,l)$ is decreasing in $\pi$ , if $\bar{W}_{k-1}(\pi,l)$ is decreasing in $\pi$ , finishing the induction step.

Characterization of the switching curve $\Gamma_{l}$ : For each $\bar{\pi}\in\mathcal{H}$ construct the line segment $\mathcal{L}(e_{1},\bar{\pi})$ . The line segment can be described as $(1-\varepsilon)\bar{\pi}+\varepsilon e_{1}$ . On the line segment $\mathcal{L}(e_{1},\bar{\pi})$ all the belief states are MLR orderable. Since $\mu^{*}(\pi,l)$ is monotone decreasing in $\pi$ , for each $l$ , we pick the largest $\varepsilon$ such that $\mu^{*}(\pi,l)=1$ . The belief state, $\pi^{\varepsilon^{*},\bar{\pi}}$ is the threshold belief state, where $\varepsilon^{*}=\text{inf }\{\varepsilon\in[0,1]:\mu^{*}({\pi^{\varepsilon,\bar{\pi}}})=1\}$ . Denote by $\Gamma(\bar{\pi})=\pi^{\varepsilon^{*},\bar{\pi}}$ . The above construction implies that there is a unique threshold $\Gamma(\bar{\pi})$ on $\mathcal{L}(e_{1},\bar{\pi})$ . The entire simplex can be covered by considering all pairs of lines $\mathcal{L}(e_{1},\bar{\pi})$ , for $\bar{\pi}\in\mathcal{H}$ , i.e. $\Pi(X)=\cup_{\bar{\pi}\in\mathcal{H}}\mathcal{L}(e_{1},\bar{\pi})$ . Combining, all points yield a unique threshold curve in $\Pi(X)$ given by $\Gamma=\cup_{\bar{\pi}\in\mathcal{H}}\Gamma(\bar{\pi})$ .

Connectedness of $S^{l}$ : Since $e_{1}\in S^{l}$ for all $l$ , call $S_{a}^{l}$ , the subset of $S^{l}$ that contains $e_{1}$ . Suppose $S_{b}^{l}$ is the subset that was disconnected from $S_{a}^{l}$ . Since every point on $\Pi(X)$ lies on the line segment $\mathcal{L}(e_{1},\bar{\pi})$ , for some $\bar{\pi}$ , there exists a line segment starting from $e_{1}\in S^{l}_{a}$ that would leave the region $S_{a}^{l}$ , pass through the region where action $2$ is optimal and then intersect region $S_{b}^{l}$ , where action $1$ is optimal. But, this violates the requirement that the policy $\mu^{*}(\pi,l)$ is monotone on $\mathcal{L}(e_{1},\bar{\pi})$ . Hence, $S_{a}^{l}$ and $S_{b}^{l}$ are connected.

Connectedness of $C^{l}$ : Assume $e_{X}\in C^{l}$ , otherwise $C^{l}=\phi$ and there is nothing to prove. Call the region that contains $e_{X}$ as $C^{l}_{a}$ . Suppose $C_{b}^{l}\subset C^{l}$ is disconnected from $C_{a}^{l}$ . Since every point in $\Pi(X)$ can be covered by a line segment $\mathcal{L}(e_{X},\bar{\pi})$ , for some $\bar{\pi}$ . Then, there exists a line starting from $e_{X}\in C^{l}_{a}$ would leave region $C_{a}^{l}$ , pass through the region where action $1$ is optimal and then intersect the region $C_{b}^{l}$ (where action $2$ is optimal). But this violates the monotone property of $\mu^{*}(\pi,l)$ .

Sub-setting structure: The proof is straightforward from Prop. 3.

B-E Proof of Theorem 22

For $l_{1}>l_{2}$ , due to the sub-setting structure in Theorem 1 $S^{l_{2}}\subset S^{l_{1}}$ . This implies the following

$\displaystyle\mu_{\theta}(l_{2},\pi)$	$\displaystyle\geq\mu_{\theta}(l_{1},\pi)$
$\displaystyle\begin{bmatrix}0&1&\theta_{l_{2}}\end{bmatrix}\begin{bmatrix}\pi\\ -1\end{bmatrix}$	$\displaystyle\geq\begin{bmatrix}0&1&\theta_{l_{1}}\end{bmatrix}\begin{bmatrix}\pi\\ -1\end{bmatrix}$
$\displaystyle\begin{bmatrix}0&0&\theta_{l_{2}}-\theta_{l_{1}}\end{bmatrix}\begin{bmatrix}\pi\\ -1\end{bmatrix}$	$\displaystyle\geq 0$	(51)

It is straightforward to check that the conditions in (22) in Theorem 22 satisfy the conditions in (51).

References

[1] T. Smith, M. Obrist, and P. Wright, “Live-streaming changes the (video) game,” in Proc. of the 11th European Conference on Interactive TV and Video. ACM, 2013, pp. 131–138.
[2] S. Bollapragada, M. R. Bussieck, and S. Mallik, “Scheduling commercial videotapes in broadcast television,” Oper. Res., vol. 52, no. 5, pp. 679–689, Oct. 2004.
[3] D. G. Popescu and P. Crama, “Ad revenue optimization in live broadcasting,” Management Science, vol. 62, no. 4, pp. 1145–1164, 2015.
[4] S. Seshadri, S. Subramanian, and S. Souyris, “Scheduling spots on television,” 2015.
[5] H. Kang and M. P. McAllister, “Selling you and your clicks: examining the audience commodification of google,” Journal for a Global Sustainable Information Society, vol. 9, no. 2, pp. 141–153, 2011.
[6] R. Terlutter and M. L. Capella, “The gamification of advertising: analysis and research directions of in-game advertising, advergames, and advertising in social network games,” Journal of Advertising, vol. 42, no. 2-3, pp. 95–112, 2013.
[7] J. Turner, A. Scheller-Wolf, and S. Tayur, “Scheduling of dynamic in-game advertising,” Operations Research, vol. 59, no. 1, pp. 1–16, 2011.
[8] N. Archak, V. Mirrokni, and S. Muthukrishnan, “Budget optimization for online campaigns with positive carryover effects,” in Proc. of the 8th International Conference on Internet and Network Economics. Springer-Verlag, 2012, pp. 86–99.
[9] N. Archak, V. S. Mirrokni, and S. Muthukrishnan, “Mining advertiser-specific user behavior using adfactors,” in Proceedings of the 19th International Conference on World Wide Web, ser. WWW ’10. ACM, 2010, pp. 31–40.
[10] T. Nakai, “The problem of optimal stopping in a partially observable markov chain,” Journal of optimization Theory and Applications, vol. 45, no. 3, pp. 425–442, 1985.
[11] W. Stadje, “An optimal k-stopping problem for the poisson process,” in Mathematical Statistics and Probability Theory. Springer, 1987, pp. 231–244.
[12] M. Nikolaev, “On optimal multiple stopping of markov sequences,” Theory of Probability & Its Applications, vol. 43, no. 2, pp. 298–306, 1999.
[13] A. Krasnosielska-Kobos, “Multiple-stopping problems with random horizon,” Optimization, vol. 64, no. 7, pp. 1625–1645, 2015.
[14] D. P. Bertsekas, D. P. Bertsekas, D. P. Bertsekas, and D. P. Bertsekas, Dynamic programming and optimal control. Athena Scientific Belmont, MA, 1995, vol. 1, no. 2.
[15] E. Bayraktar and R. Kravitz, “Quickest detection with discretely controlled observations,” Sequential Analysis, vol. 34, no. 1, pp. 77–133, 2015.
[16] J. Geng, E. Bayraktar, and L. Lai, “Bayesian quickest change-point detection with sampling right constraints,” IEEE Transactions on Information Theory, vol. 60, no. 10, pp. 6474–6490, 2014.
[17] T. L. Lai, “On optimal stopping problems in sequential hypothesis testing,” Statistica Sinica, vol. 7, no. 1, pp. 33–51, 1997.
[18] ——, Sequential analysis. Wiley Online Library, 2001.
[19] S. H. J. Alexander G. Nikolaev, “Stochastic sequential decision-making with a random number of jobs,” Operations Research, vol. 58, no. 4, pp. 1023–1027, 2010.
[20] S. Savin and C. Terwiesch, “Optimal product launch times in a duopoly: Balancing life-cycle revenues with product cost,” Operations Research, vol. 53, no. 1, pp. 26–47, 2005.
[21] I. Lobel, J. Patel, G. Vulcano, and J. Zhang, “Optimizing product launches in the presence of strategic consumers,” Management Science, vol. 62, no. 6, pp. 1778–1799, 2015.
[22] K. E. Wilson, R. Szechtman, and M. P. Atkinson, “A sequential perspective on searching for static targets,” European Journal of Operational Research, vol. 215, no. 1, pp. 218 – 226, 2011.
[23] M. Atkinson, M. Kress, and R.-J. Lange, “When is information sufficient for action? search with unreliable yet informative intelligence,” Operations Research, vol. 64, no. 2, pp. 315–328, 2016.
[24] I. D. Askwith, “Television 2.0: Reconceptualizing tv as an engagement medium,” Ph.D. dissertation, Massachusetts Institute of Technology, 2007.
[25] H. Yu, D. Zheng, B. Y. Zhao, and W. Zheng, “Understanding user behavior in large-scale video-on-demand systems,” SIGOPS Oper. Syst. Rev., vol. 40, no. 4, pp. 333–344, Apr. 2006.
[26] V. Krishnamurthy and D. V. Djonin, “Structured threshold policies for dynamic sensor scheduling–a partially observed markov decision process approach,” IEEE Transactions on Signal Processing, vol. 55, no. 10, pp. 4938–4957, Oct 2007.
[27] V. Krishnamurthy, Partially Observed Markov Decision Processes. Cambridge University Press, 2016.
[28] ——, “How to schedule measurements of a noisy Markov chain in decision making?” IEEE Transactions on Information Theory, vol. 59, no. 7, pp. 4440–4461, July 2013.
[29] ——, “Bayesian sequential detection with phase-distributed change time and nonlinear penalty – A POMDP Lattice programming approach,” IEEE Transactions on Information Theory, vol. 57, no. 10, pp. 7096–7124, Oct 2011.
[30] J. J. M. Eric Johnson, “Infinitesimal perturbation analysis: A tool for simulation,” The Journal of the Operational Research Society, vol. 40, no. 3, pp. 243–254, 1989.
[31] G. C. Pflug, Optimization of stochastic models: the interface between simulation and optimization. Springer Science & Business Media, 2012, vol. 373.
[32] J. C. Spall, Introduction to stochastic search and optimization: estimation, simulation, and control. John Wiley & Sons, 2005, vol. 65.
[33] I.-J. Wang and J. C. Spall, “Stochastic optimisation with inequality constraints using simultaneous perturbations and penalty functions,” International Journal of Control, vol. 81, no. 8, pp. 1232–1238, 2008.
[34] M. Joo, K. C. Wilbur, B. Cowgill, and Y. Zhu, “Television advertising and online search,” Management Science, vol. 60, no. 1, pp. 56–73, 2013.
[35] J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski, and L. Brilliant, “Detecting influenza epidemics using search engine query data,” Nature, vol. 457, no. 7232, pp. 1012–1014, 2009.
[36] K. Pires and G. Simon, “Youtube live and twitch: A tour of user-generated live streaming systems,” in Proceedings of the 6th ACM Multimedia Systems Conference, ser. MMSys ’15. ACM, 2015, pp. 225–230.
[37] W. Zucchini and I. L. MacDonald, Hidden Markov models for time series: an introduction using R. CRC press, 2009.
[38] V. Krishnamurthy and U. Pareek, “Myopic bounds for optimal policy of POMDPs: An extension of Lovejoy’s structural results,” Operations Research, vol. 62, no. 2, pp. 428–434, 2015.
[39] P. C. Kiessler, “Comparison methods for stochastic models and risks,” Journal of the American Statistical Association, vol. 100, no. 470, pp. 704–704, 2005.