An Online Learning Approach to Dynamic Pricing and Capacity Sizing in Service Systems

1 Introduction

1.1 Problem Statement and Methodology

1.1.1 Pricing and capacity sizing in queue

We study a service queueing model where the service provider manages congestion and revenue by dynamically adjusting the price and service capacity. Specifically, we consider a $GI/GI/1$ queue, in which the demand for service is $\lambda(p)$ per unit of time when each customer is charged by a service fee $p$ ; the cost for providing service capacity $\mu$ is $c(\mu)$ ; and a holding cost $h_{0}$ incurs per job per unit of time. By choosing the appropriate service fee $p$ and capacity $\mu$ , the service provider aims to maximize the net profit, which is the service fee minus the staffing cost and penalty of congestion, i.e.,

\max_{\mu,p}\leavevmode\nobreak\ \mathcal{P}(\mu,p)\equiv p\lambda(p)-c(\mu)-h_{0}\mathbb{E}[Q_{\infty}(\mu,p)],

(1)

where $Q_{\infty}(\mu,p)$ is the steady-state queue length under service rate $\mu$ and price $p$ .

Problems in this framework have a long history, see for example Kumar and Randhawa (2010), Lee and Ward (2014), Lee and Ward (2019), Maglaras and Zeevi (2003), Nair et al. (2016), Kim and Randhawa (2018) and the references therein. Due to the complex nature of the queueing dynamics, exact analysis is challenging and often unavailable (computation of the optimal dynamic pricing and staffing rules is not straightforward even for the Markovian $M/M/1$ queue (Ata and Shneorson, 2006)). Therefore, researchers resort to heavy-traffic analysis to approximately obtain performance evaluation and optimization results. Commonly adopted heavy-traffic regimes require sending the arrival rate and service capacity (service rate or number of servers) to $\infty$ . Although heavy-traffic analysis provides satisfactory results for large-scale queueing systems, approximation formulas based on heavy-traffic limits often become inaccurate as the system scale decreases.

1.1.2 An online learning method

In this paper we propose an online learning framework designed for solving Problem (1). According to our online learning algorithm, the $GI/GI/1$ queue will be operated in successive cycles, where in each cycle the service provider’s decisions on the service fee $p$ and service capacity $\mu$ , deemed the best by far, are obtained using the system’s data collected in previous operational cycles. Data hereby include (i) the number of customers who join for service, (ii) customers’ waiting times, and (iii) the server’s busy time, all of which are easy to collect. Newly generated data, which represent the response from the (random and complex) environment to the present operational decisions, will be used to obtain improved pricing and staffing policies in the next cycle. In this way the service provider can dynamically interact with the environment so that the operational decisions can evolve and eventually approach the optimal solution.

At the beginning of each cycle $k$ , the service provider’s decisions $(p_{k},\mu_{k})$ will be computed and enforced throughout the cycle. At the heart of our procedure for computing $(p_{k},\mu_{k})$ is to obtain a sufficiently accurate estimator $H_{{k-1}}$ for the gradient of the objective function of (1), using past experience. Specifically, our online algorithm will update $(p_{k},\mu_{k})$ according to

\displaystyle(\mu_{k},p_{k})\leftarrow(\mu_{k-1},p_{k-1})+\eta_{k-1}H_{k-1},

where $\eta_{k}$ is the updating step size for cycle $k$ . We call this algorithm Gradient-based Online Learning in Queue (GOLiQ).

Besides showing that, under our online learning scheme, the decisions in cycle $k$ , $(\mu_{k},p_{k})$ will converge to the optimal solutions $(\mu^{*},p^{*})$ as $k$ increases, we quantify the effectiveness of GOLiQ by computing the regret - the cumulative loss of profit due to the suboptimality of $(\mu_{k},p_{k})$ , namely, the maximum profit under the (unknown) optimal strategy minus the expected profit earned under the online algorithm over time. When GOLiQ’s hyperparameters are chosen optimally, we show that our regret bound is logarithmic so that the service provider, with any initial pricing and staffing policy $(\mu_{0},p_{0})$ , will quickly learn the optimal solutions without losing much profit in the learning process.

1.2 Advantages, Challenges and Contributions

In what follows, we first discusses the general advantages of the online learning approach by contrasting with heavy-traffic methods; we next explain the key challenges we face in the development of online learning algorithms for queueing systems.

1.2.1 Online learning vs. heavy-traffic method.

First, heavy-traffic solutions are derived from approximating models which arise as the system scale approaches infinity, so the fidelity of the solutions is sensitive to the system scale. Unlike heavy-traffic methods, online learning approaches do not require any asymptotic scaling, so they can treat service systems at any scale (small or large). Second, heavy-traffic approaches usually require the knowledge of certain distributional information apriori (e.g., moments and distribution functions of service times), which serve as critical input parameters for the heavy-traffic models. On the other hand, online learning methods require information of this kind to a lesser extent. Although certain distribution information can help fine-tune parameters of online algorithms, it is less crucial to algorithm design and implementation. So in this sense, the dependence on the distributional information is weaker than that of heavy-traffic analysis. Last, online learning is advantageous when the underlining problem focuses on performance optimization in the long run. Heavy-traffic analysis gives approximate solutions that are static, and in a longer time frame, the performance discrepancy (relative to the true optimal reward) should grow linearly as time increases. But online learning is a dynamic evolution, and its data-driven nature enables it to constantly produce improved solutions which will eventually reach optimality. In addition, heavy-traffic solutions require the establishment of heavy-traffic limit theorems and careful analysis of the dynamics of the limit processes (e.g., fluid and diffusion). Both steps can be quite sophisticated in general. See Remarks 11 and 12 for more detailed discussions; also see Section 6.3 for numerical evidence.

1.2.2 Challenges of online learning in queueing systems.

Online learning in queues is by no means an easy extension of online learning in other domains; its theoretical development has to account for the unique features in queueing systems. A crucial step is to develop effective ways to control the nonstationary error that arises at the beginning of every cycle due to the policy update. Towards this, we develop a new regret analysis framework for the transient queueing performance that not only helps establish desired regret bounds for the specific online $GI/GI/1$ algorithm, but may also be used to develop online learning method for other queueing models (see Section 4). Another challenge we have to address here is to devise a convenient gradient estimator for the online learning algorithm (essentially, an estimator for the gradient of $\mathbb{E}[Q_{\infty}(\mu,p)]$ ). The estimator should have a negligible bias to warrant a quick convergence of the algorithm, and at the same time, its computation (using previous data) should be sufficiently straightforward to ensure the ease of implementation (The detailed gradient estimator of GOLiQ for the $GI/GI/1$ system is given in Section 5).

1.2.3 Main Contributions

We summarize our contributions below.

•

To the best of our knowledge, the present work is the first to develop an online learning framework for joint pricing and staffing in a queueing system with logarithmic regret bound in the total number of customers served (Theorem 3). Due to the complex nature of queueing systems, previous research often resorts to asymptotic heavy-traffic analysis to approximately solve for desired operational decisions. The ingenuity of our online learning method lies in the ability to obtain the optimal solutions without needing the system scale (e.g., arrival rate and service rate) to grow large. The other appeal of our method is its robustness, especially in its weaker dependence on the distributions of service and arrival times.
•

A critical step in the regret analysis is the treatment of the transient system dynamics, because when improved operational decisions are obtained and implemented at the beginning of a new period, the queueing performance will shift away from previously established steady-state level. Towards this, we develop a new way to treat and bound the transient queueing performance in the regret analysis of our online learning algorithm (Theorem 1). Bounding the transient error also guarantees convergence of the SGD iteration (Theorem 2). Comparing to previous literature (e.g., the regret bound is $O(T^{2/3})$ in Huh et al. (2009)), our analysis of the regret due to nonstationarity gives a much tighter logarithmic bound. In addition, the regret analysis in the present paper may be extended to other queueing systems which share similar properties to $GI/GI/1$ .
•

Supplementing the theoretical results of our regret bound, we evaluate the practical effectiveness of our method by conducting comprehensive numerical experiments. Our simulations draw the following two main conclusions. First, our method is robust in several dimensions: (i) GOLiQ exhibits convincing performance for $GI/GI/1$ queues having representative arrival and service distributions; (ii) GOLiQ remains effective even when certain theoretical assumptions are relaxed. Furthermore, in order to clearly highlight the advantages of our online learning approach relative to the previous results of heavy-traffic limits, we provide a careful performance comparison of these two methods. We show that GOLiQ is more effective in any one of the following three cases: the system scale is not too large, staffing cost is high, or service times are more variable.

1.3 Organization of the paper

In Section 2, we review the related literature. In Section 3, we introduce the model assumptions and provide an outline of our online learning algorithm. In Section 4, we conduct the regret analysis for GOLiQ by separately treating the regret of nonstationarity - the part of regret arising from the transient system dynamics, and the regret of suboptimality - the part originating from the errors due to suboptimal pricing and staffing decisions. In Section 5, we give the detailed description of GOLiQ and establish a logarithmic regret bound by appropriately selecting our algorithm parameters. In Section 6 we conduct numerical experiments to confirm the effectiveness and robustness of GOLiQ. We conclude in Section 7. In the e-companion, we give all technical proofs and provide additional numerical examples.

2 Related Literature

The present paper is related to the following three streams of literature.

Pricing and capacity sizing in queues.

There is a rich literature on pricing and capacity sizing in service systems under different settings. Maglaras and Zeevi (2003) studies pricing and capacity sizing problem in a processor sharing queue motivated by internet applications; Kumar and Randhawa (2010) considers a single-server system with nonlinear delay costs; Nair et al. (2016) studies $M/M/1$ and $M/M/k$ systems with network effect among customers; Kim and Randhawa (2018) considers a dynamic pricing problem in a single-server system. The specific problem (1) we consider here is most closely related to Lee and Ward (2014), i.e., joint pricing and capacity sizing for the $GI/GI/1$ queue. Later, the authors extend their results to the $GI/GI/1+G$ model with customer abandonment in Lee and Ward (2019). As there is usually no closed-form solution for the optimal strategy or equilibrium, asymptotic analysis is adopted under large-market assumptions. In detail, their analysis is rooted in a deterministic static planning problem which requires both the service capacity and the demand rate to scale to infinity. Most of the papers conclude that heavy-traffic regime is economically optimal. (There are some exceptions where heavy-traffic regime is not optimal, for example, Kumar and Randhawa (2010) shows that agent is forced to decrease its utilization if the delay cost is concave.) Our algorithm is motivated by the pricing and capacity sizing problem for service systems, however, as explained previously, our methodology is very different from the asymptotic analysis used in these papers.

Reinforcement learning for queueing systems.

Our paper is also related to a small but growing literature on reinforcement learning (RL) for queueing systems. Dai and Gluzman (2021) studies an actor-critic algorithm for queueing networks. Liu et al. (2019) and Shah et al. (2020) develop RL techniques to treat the unboundedness of the state space of queueing systems. Jia et al. (2021) studies a price-based revenue management problem in an $M/M/c$ queue with a discrete price space; their methodology draws from the multi-armed bandit framework (with each price treated as an “arm”). Krishnasamy et al. (2021) develops bandit methods for scheduling problem in a multi-server queue with unknown service rates. Our work draws distinction from the above-mentioned literature in two dimensions. First, we are the first to develop an online learning method for joint pricing and capacity sizing in queue. In addition, our method applies to settings of continuous decision variables. Comparing to the more general RL literature, our algorithm design and regret analysis take advantage of the specific queueing system structure so as to establish tight regret bounds and more accurate control of the convergence rate. In some sense, the algorithm developed in the present paper may be viewed as a version of the policy gradient method, a special class of RL methods (Sutton and Barto, 2018), see Remark 2 for detailed discussions.

Stochastic gradient decent algorithms.

In general, our algorithm falls into the broad class of stochastic gradient descent (SGD) methods. There are some early papers on SGD algorithms for steady-state performance of queues (see Fu (1990), Chong and Ramadge (1993), L’Ecuyer et al. (1994), L’Ecuyer and Glynn (1994) and the references therein). In particular, these papers have established convergence results of SGD algorithms for capacity sizing problems with a variety of gradient estimating designs. In this paper, we consider a more general setting in which the price is also optimized jointly with the service capacity. Besides, in order to establish theoretical bounds for the regret, we conduct a careful analysis on the convergence rate of the algorithm and provide an explicit guidance for the optimal choice of algorithm parameters, which is not discussed in this early literature. Our algorithm design and analysis are also related to the online learning methods in recent inventory management literature (Burnetas and Smith, 2000; Huh et al., 2009; Huh and Rusmevichientong, 2013; Zhang et al., 2020; Yuan et al., 2021). Among these papers, our work is perhaps most closely related to Huh et al. (2009) where the authors develop an SGD based learning method for an inventory model with a bounded replenishment lead time. Still, due to the unique natures of queueing models, we develop a new regret analysis framework as we shall explain with details in Section 1.2.3.

3 Problem Setting and Algorithm Outline

In Section 3.1 we describe the queueing model and technical assumptions. In Section 3.2, we provide a general outline of GOLiQ. Finally, in Section 3.3 we conduct preliminary analysis of the queueing performance under GOLiQ.

3.1 Model and Assumptions

We study a $GI/GI/1$ queueing system having customer arrivals according to a renewal process with generally distributed interarrival times (the first $GI$ ), independent and identically distributed (i.i.d.) service times following a general distribution (the second $GI$ ), and a single server that provides service under the first-in-first-out (FIFO) discipline. Each customer upon joining the queue is charged by the service provider a fee $p>0$ . The demand arrival rate (per time unit) depends on the service fee $p$ and is denoted as $\lambda(p)$ . To maintain a service rate $\mu$ , the service provider continuously incurs a staffing cost at a rate $c(\mu)$ per time unit.

For $\mu\in[\underline{\mu},\bar{\mu}]$ and $p\in[\underline{p},\bar{p}]$ , the service provider’s goal is to determine the optimal service fee $p^{*}$ and service capacity $\mu^{*}$ with the objective of maximizing the steady-state expected profit (1), or equivalently minimizing the objective function $f(\mu,p)$ as follows

\displaystyle\min_{(\mu,p)\in\mathcal{B}}f(\mu,p)\equiv h_{0}\mathbb{E}[Q_{\infty}(\mu,p)]+c(\mu)-p\lambda(p),\qquad\mathcal{B}\equiv[\underline{\mu},\bar{\mu}]\times[\underline{p},\bar{p}].

(2)

We shall impose the following assumptions on the above service system throughout the paper.

Assumption 1.

$($ Demand rate, staffing cost, and uniform stability $)$

$(a)$

The arrival rate $\lambda(p)$ is continuously differentiable and non-increasing in $p$ .
$(b)$

The staffing cost $c(\mu)$ is continuously differentiable and non-decreasing in $\mu$ .
$(c)$

The lower bounds $\underline{p}$ and $\underline{\mu}$ satisfy that $\lambda(\underline{p})<\underline{\mu}$ so that the system is uniformly stable for all feasible choices of the pair $(\mu,p)$ .

Part $(c)$ of Assumption 1 is commonly used in the literature of SGD methods for queueing models to ensure that the steady-state mean waiting time $\mathbb{E}[W_{\infty}(\mu,p)]$ is differentiable with respect to model parameters (see Chong and Ramadge (1993), Fu (1990), L’Ecuyer et al. (1994), L’Ecuyer and Glynn (1994), also see Theorem 3.2 of Glasserman (1992)). In the our numerical experiments (see Section 11.1), we show that our online algorithm remains effective when this assumption is relaxed.

We do not require full knowledge of service and inter-arrival time distributions. But in order to develop explicit bounds for the part of the regret due to the nonstationarity of the queueing processes, we require both distributions to be light-tailed. Specifically, since the actual service and interarrival times are subject to our pricing and staffing decisions, we model the interarrival and service times by two scaled random sequences $\{U_{n}/\lambda(p)\}$ and $\{V_{n}/\mu\}$ , where $U_{1},U_{2},\ldots$ and $V_{1},V_{2},\ldots$ are two independent i.i.d. sequences of random variables having unit means, i.e., $\mathbb{E}[U_{n}]=\mathbb{E}[V_{n}]=1$ . We make the following assumptions on $U_{n}$ and $V_{n}$ .

Assumption 2.

$($ Light-tailed service and interarrival times $)$
There exists a sufficiently small constant $\eta>0$ such that the moment-generating functions

\mathbb{E}[\exp(\eta V_{n})]<\infty\quad\text{and}\quad\mathbb{E}[\exp(\eta U_{n})]<\infty.

In addition, there exist constants $0<\theta<\eta/2\bar{\mu}$ , $0<a<(\underline{\mu}-\lambda(\underline{p}))/(\underline{\mu}+\lambda(\underline{p}))$ and $\gamma>0$ such that

\phi_{U}(-\theta)<-(1-a)\theta-\gamma\quad\text{and}\quad\phi_{V}(\theta)<(1+a)\theta-\gamma,

(3)

where $\phi_{V}(\theta)\equiv\log\mathbb{E}[\exp(\theta V_{n})]$ and $\phi_{U}(\theta)\equiv\log\mathbb{E}[\exp(\theta U_{n})]$ are the cummulant generating functions of $U$ and $V$ .

Note that $\phi^{\prime}_{U}(0)=\phi^{\prime}_{V}(0)=1$ as $\mathbb{E}[U]=\mathbb{E}[V]=1$ . Suppose $\phi_{U}$ and $\phi_{V}$ are smooth around 0, then we have $\phi_{U}(-\theta)=-\theta+o(\theta)$ and $\phi_{V}(\theta)=\theta+o(\theta)$ by Taylor’s expansion. This implies that, for any $a>0$ , we can make $\theta$ small enough, such that $\phi_{U}(-\theta)<-(1-a)\theta$ and $\phi_{V}(\theta)<(1+a)\theta$ . To obtain the bound in (3), we can simply take $\gamma={\frac{1}{2}}\min(-(1-a)\theta-\phi_{U}(-\theta),(1+a)\theta-\phi_{V}(\theta))>0$ . Hence, a sufficient condition that warrants (3) is to require that $\phi_{U}$ and $\phi_{V}$ be smooth around 0, which is true for many distributions of $U$ and $V$ considered in common queueing models. Assumption 2 will be used in our proofs to build an explicit bound for the regret of nonstationarity.

Finally, in order to warrant the convergence of our online learning algorithm, we require a convex structure for the problem in (2), which is common in the SGD literature; see Broadie et al. (2011), Kushner and Yin (2003) and the references therein.

Let $x^{*}\equiv(\mu^{*},p^{*})$ and $x\equiv(\mu,p)$ . Let $\nabla f(x)$ denote the gradient of a function $f(x)$ and $\|\cdot\|$ denote the Euclidean norm.

Assumption 3.

$($ Convexity and smoothness $)$
There exist finite positive constants $K_{0}\leq 1$ and $K_{1}>K_{0}$ such that for all $x\in\mathcal{B}{,}$

$(a)$

$(x-x^{*})^{T}\nabla f(x)\geq K_{0}\|x-x^{*}\|^{2}$ ;
$(b)$

$\|\nabla f(x)\|\leq K_{1}\|x-x^{*}\|$ .

Remark 1.

Our simulation experiments show that our algorithm works effectively for some representative $GI/GI/1$ queues with conditions in Assumption 3 relaxed; see Section 6 and Section 11 in the Supplement Material. In addition, we later provide some sufficient conditions for Assumption 3 in the special case of $M/GI/1$ queues in Section 12.

3.2 Outline of GOLiQ

In general, an SGD algorithm for a minimization problem $\min_{x}f(x)$ over a compact set $\mathcal{B}$ relies on updating the decision variable via the recursion

\displaystyle x_{k+1}=\Pi_{\mathcal{B}}(x_{k}-\eta_{k}H_{k}),\qquad k\geq 1.

where $\eta_{k}$ is the step size, $H_{k}$ is a random estimator for $\nabla f(x_{k})$ , $x_{k}$ is the decision variable by step $k$ , and the projection operator $\Pi_{\mathcal{B}}$ restricts the updated decision in $\mathcal{B}$ . For problem (2), we let $x_{k}\equiv(\mu_{k},p_{k})$ represent the service capacity and price at step $k$ , We define

B_{k}\equiv\mathbb{E}[\|\mathbb{E}[H_{k}-\nabla f(x_{k})|\mathcal{F}_{k}]\|^{2}]^{1/2}\quad\text{and}\quad\mathcal{V}_{k}\equiv\mathbb{E}[\|H_{k}\|^{2}],

(4)

where $\mathcal{F}_{k}$ is the $\sigma$ -algebra including all events in the first $k-1$ iterations. Intuitively, $B_{k}$ measures the bias of the gradient estimator $H_{k}$ and $\mathcal{V}_{k}$ measures its variability. As we shall see later, $B_{k}$ and $\mathcal{V}_{k}$ play important roles in designing the algorithm and establishing desired regret bounds.

The standard SGD algorithm iterates in discrete step $k$ . In our setting, however, the queueing system and objective function $f(\mu,p)$ are defined in continuous time (in particular, $Q_{\infty}(\mu,p)$ is the steady-state queue length observed in continuous time). To facilitate the regret analysis, we first transform the objective function into an expression of customer waiting times that are observed in discrete time. By Little’s law, we can rewrite the objective function $f(\mu,p)$ as, for all $(\mu,p)\in\mathcal{B}$ ,

\displaystyle f(\mu,p)=h_{0}\lambda(p)\left(\mathbb{E}[W_{\infty}(\mu,p)]+\frac{1}{\mu}\right)+c(\mu)-p\lambda(p),

(5)

where $W_{\infty}(\mu,p)$ is the steady-state waiting time under $(\mu,p)$ . In each cycle $k$ , our algorithm adopts the average of $D_{k}$ observed customer waiting times to estimate $\mathbb{E}[W_{\infty}(\mu,p)]$ , where $D_{k}$ denotes the number of customers that enter service in cycle $k$ (we refer to $D_{k}$ as the cycle length or sample size of cycle $k$ ). But any finite $D_{k}$ will introduce a bias to our gradient estimate $H_{k}$ . To mitigate the bias due to the transient performance of the queueing process, we shall let the cycle length $D_{k}$ be increasing in $k$ (in this way the transient bias will vanish eventually). We give the outline of the algorithm below.

Outline of GOLiQ:

0.

Input: $\{D_{k}\}$ and $\{\eta_{k}\}$ for $k=1,2,..,L$ , initial policy $x_{1}=(\mu_{1},p_{1})$ .

For $k=1,2,...,L$ ,
1.

In the $k^{\rm th}$ cycle, operate the $GI/GI/1$ queue under policy $x_{k}=(\mu_{k},p_{k})$ until $D_{k}$ customers enter service.
2.

Collect and use the data (e.g., customer delays) to build an estimator $H_{k}$ for $\nabla f(\mu_{k},p_{k})$ .
3.

Update $x_{k+1}=\Pi_{\mathcal{B}}(x_{k}-\eta_{k}H_{k})$ .

Remark 2 (Exploration vs. exploitation).

The online nature of this algorithm makes it possible to obtain improved decisions by learning from past experience, which is in the spirit of the essential ideas of reinforcement learning where an agent (hereby the service provider) aims to tradeoff between exploration (Step 1) and exploitation (Steps 2 and 3). Effectiveness of the algorithms lies in properly choosing the algorithm parameters and devising an efficient gradient estimator $H_{k}$ . For example, if $D_{k}$ is too small, we are unable to generate sufficient data (we do not have much to exploit in order for devising a better policy); if $D_{k}$ is too large, we incur a higher profit loss due to suboptimality of the policy in use (we do not explore enough for seeking potentially better policies). In particular, GOLiQ may be viewed as a special case of the policy gradient (PG) algorithm (the general idea of PG is to estimate the policy parameters using the gradient of the value function learned via continuous interaction with the system, see for example Sutton and Barto (2018)). To put this into perspective, the policy in the present paper is specified by a pair of parameters $(\mu,p)$ , and in each iteration, we update the policy parameters using an estimated policy gradient $H_{k}$ learned from data of the queueing model. In the subsequent sections, we give detailed regret analysis that can be used to establish optimal algorithm parameters (Section 4) and develop an efficient gradient estimator (Section 5).

3.3 System Dynamics under GOLiQ

We explain explicitly the dynamic of the queueing system under GOLiQ, with the system starting empty. We first define notations for relevant performance functions. For $k\geq 1$ , let $T_{k}$ be the length of cycle $k$ in the units of time, and let $D_{k}$ be the total number of customers who enter service in cycle $k$ . For $n=1,2,...,D_{k}$ , let $W_{n}^{k}$ be the waiting time of the $n^{\rm th}$ customer that enters service in cycle $k$ . We define $W_{0}^{k}\equiv W_{D_{k-1}}^{k-1}$ . We use the two i.i.d. random sequences $V_{n}^{k}$ and $U_{n}^{k}$ to construct the service and inter-arrival times in cycle $k$ , $n=1,2,...,D_{k}$ . In particular, $V_{n}^{k}$ corresponds to the service time of the customer $n-1$ , and $U_{n}^{k}$ corresponds to the inter-arrival time between customers $n-1$ and $n$ in cycle $k$ . Let $\lambda_{k}\equiv\lambda(p_{k})$ . Last, we use $Q_{k}$ to denote the number of existing customers (those who arrive in previous cycles) at the beginning of cycle in $k$ , with $Q_{1}=0$ . We will have $Q_{k}\geq 1$ for $k\geq 2$ , as we shall explain soon, according to our updating procedure. The detailed dynamics of the queueing system in cycle $k$ is summarized as follows:

•
Updating the control policy. In cycle $k$ , we adopt the pricing and staffing policy $(p_{k},\mu_{k})$ . The service time of customer $n-1$ in cycle $k$ is $S_{n}^{k}=V_{n}^{k}/\mu_{k}$ for $n=1,...,D_{k}$ . Cycle $k$ ends as soon as a total number of $D_{k}$ (of which the value is to be determined later) customers have entered service. So, customer $D_{k}$ will receive service in cycle $k+1$ (with service time $S_{1}^{k+1}$ ) and the queue leftover consists of at least one customer, i.e., $Q_{k+1}\geq 1$ for a new cycle $k+1$ , which begins under a new policy $(p_{k+1},\mu_{k+1})$ as follows:
- –
  
  Service rate. The service rate is updated to $\mu_{k+1}$ immediately as the new cycle begins, so that all existing customers will undergo service times with rate $\mu_{k+1}$ .
- –
  
  Service fee. The price remains $p_{k}$ at the beginning of cycle $k+1$ and evolves to $p_{k+1}$ immediately after the first new customer arrives in the new cycle; we charge this customer with $p_{k}$ (because its interarrival time is modulated by $p_{k}$ ) and all subsequent customers in cycle $k+1$ with $p_{k+1}$ .
•

Leftovers from previous cycles. For $k\geq 2$ , at the beginning of cycle $k$ , there are $Q_{k}-1$ customers waiting in queue indexed by $n$ from 1 to $Q_{k}-1$ . The customer who just enters service is indexed by 0. We update the price from $p_{k-1}$ to $p_{k}$ right after the first new customer (indexed by $Q_{k}$ ) arrives in a new cycle. As a consequence, the prices charged to customers $1,2,...,Q_{k}$ are not yet updated to $p_{k}$ . Denote by $p_{n}^{k}$ and $\lambda_{n}^{k}\equiv\lambda(p_{n}^{k})$ as the price and arrival rate for customer $n$ in cycle $k$ , respectively, for $1\leq n\leq Q_{k}$ . The corresponding interarrival time is $\tau_{n}^{k}=U_{n}^{k}/\lambda_{n}^{k}$ . In case $Q_{k-1}>D_{k-1}$ , some queueing leftover are customers from earlier cycles. So here $p_{n}^{k}\in\{p_{1},p_{2},...,p_{k-1}\}$ . In addition, in case $Q_{k}>D_{k}$ , part of $Q_{k}$ will continue to remain in cycle $k+1$ and we will have, for example, $p^{k+1}_{1}=p^{k}_{D_{k}+1}$ .
•

New arrivals. We denote interarrival times for new customers in cycle $k$ by $\tau_{n}^{k}=U_{n}^{k}/\lambda_{k}$ for $n=Q_{k}+1,...,D_{k}$ if $D_{k}\geq Q_{k}+1$ . (As will soon become clear, the case $D_{k}\leq Q_{k}$ is a rare event with a negligible probability under appropriate algorithm settings, see Remark 3.)

•

Customer delay. Customers’ waiting times in cycle $k$ are characterized by the recursions

W_{n}^{k}=\begin{cases}\left(W_{n-1}^{k}+\frac{V^{k}_{n}}{\mu_{k}}-\frac{U^{k}_{n}}{\lambda^{k}_{n}}\right)^{+}&\text{ for }1\leq n\leq Q_{k}\wedge D_{k};\\ \left(W_{n-1}^{k}+\frac{V^{k}_{n}}{\mu_{k}}-\frac{U^{k}_{n}}{\lambda_{k}}\right)^{+}&\text{ for }(Q_{k}+1)\wedge{(D_{k}+1)}\leq n\leq D_{k}.\end{cases},\quad W_{0}^{k}=W_{D_{k-1}}^{k-1},

(6)

where $x^{+}\equiv\max\{x,0\}$ .

•

Server’s busy time. The age of the server’s busy time observed by customer $n$ upon arrival, which is the length of time the server has been busy since the last idleness, is given by the recursions

X_{n}^{k}=\begin{cases}\left(X_{n-1}^{k}+\frac{U^{k}_{n}}{\lambda^{k}_{n}}\right){\bf 1}_{\{W_{n}^{k}>0\}}&\text{ for }1\leq n\leq Q_{k}\wedge D_{k};\\ \left(X_{n-1}^{k}+\frac{U^{k}_{n}}{\lambda_{k}}\right){\bf 1}_{\{W_{n}^{k}>0\}}&\text{ for }(Q_{k}+1)\wedge{(D_{k}+1)}\leq n\leq D_{k}.\end{cases},\quad X_{0}^{k}=X_{D_{k-1}}^{k-1},

(7)

where the indicator ${\bf 1}_{A}$ is 1 if $A$ occurs and is 0 otherwise.

We provide explanations for (6) and (7). First, recursion (6) simply follows from Lindley’s equation. Next, recursion (7) follows from the fact that, for customer $n$ , if the queue is empty upon its arrival, the observed busy time is simply 0 by definition; otherwise, the server must have been busy since the arrival of the previous customer and therefore, the observed busy time by customer $n$ should extend that of customer $n-1$ by an additional inter-arrival time. As we shall see later, both the delay and busy time observed by customers will be important ingredients (i.e., data) for building the gradient estimator of the online learning algorithm.

Remark 3 (Clearance of the leftover $Q_{k}$ ).

As explained above, $Q_{k}$ is random and unbounded, while in our algorithm design, the cycle length $D_{k}$ is deterministic. So it is indeed possible the remaining queue content may not be all cleared in cycle $k$ (i.e., $D_{k}<Q_{k}$ ). We will see later in the regret analysis that our choice of $D_{k}$ leads to a small probability of uncleared leftovers and thus the impact of the rare event $\{D_{k}<Q_{k}\}$ is negligible.

In Figure 1, we further illustrate how the service price and service rate are updated by showing the ordering of all relative events as a new cycle begins. We emphasize that (i) the service rate $\mu_{k-1}$ is updated to $\mu_{k}$ immediately when a new cycle $k$ begins, which is triggered as soon as the last one of $D_{k-1}$ customers enters service; and (ii) the service price $p_{k-1}$ is updated to $p_{k}$ only after the first external arrival occurs in the new cycle $k$ (we honor our previous prices for all customers who arrive in the previous cycle).

Refer to caption — Figure 1: On the timing of the update of $p_{k}$ and $\mu_{k}$ under GOLiQ.

We end this section by providing a uniform boundedness result for all relevant queueing functions. This result below will be used in the next sections to establish desired regret bounds. The proof follows from a stochastic ordering approach and is given in Section 8.1.

Lemma 1.

$($ Uniform boundedness of relevant queueing functions $)$
Under Assumptions 1 and 2, there exists a finite positive constant $M>0$ such that for any sequences $(\mu_{k},p_{k})\in\mathcal{B}$ and $D_{k}\geq 1$ , we have, for all $k\geq 1$ , $1\leq n\leq D_{k}$ and $1\leq m\leq 4$ , and $\eta>0$ as defined in Assumption 2,

\mathbb{E}[(W_{n}^{k})^{m}],\quad\mathbb{E}[(X_{n}^{k})^{m}],\quad\mathbb{E}[(Q_{k})^{m}],\quad\mathbb{E}[\exp(\eta W_{n}^{k})]\quad\text{and}\quad\mathbb{E}[\exp(\eta Q_{k})]

are all bounded by $M$ .

4 Regret Analysis

The online learning approach described in Section 3.2 is a data-driven method, it should contunue to generate improved solutions that will eventually converge to the true optimal solution as the server’s experience accumulates (by serving more and more customers). The performance of GOLiQ is measured by the so-called regret, which can be interpreted as the cost to pay, over the time or the number of samples, for the algorithm to learn the optimal policy. In this section, we give a formal definition of the regret and conduct the regret analysis for our online learning algorithm.

The expected net cost of the queueing system incurred in cycle $k$ is

\displaystyle\rho_{k}=\mathbb{E}\left[\sum_{n=1}^{Q_{k}\wedge D_{k}}(h_{0}(W_{n}^{k}+S_{n}^{k})-p^{k}_{n})+\sum_{n=Q_{k}+1}^{D_{k}}(h_{0}(W_{n}^{k}+S_{n}^{k})-p_{k})+c(\mu_{k})T_{k}\right],

(8)

where the summation $\sum_{n=Q_{k}+1}^{D_{k}}\cdot$ is $0$ in case $D_{k}<Q_{k}+1$ . The total regret accumulated in the first $L$ cycles is

\displaystyle R(L)\equiv\sum_{k=1}^{L}R_{k},\quad\text{where}\quad R_{k}\equiv\rho_{k}-f(\mu^{*},p^{*})\mathbb{E}[T_{k}]

(9)

is regret in cycle $k$ (the expected system cost in cycle $k$ minus the optimal cost).

Remark 4.

Following Huh et al. (2009) and Jia et al. (2021), our regret defined in (9) is computed by accumulating the difference between the steady-state maximum profit under $(\mu^{*},p^{*})$ and the expected profit earned under GOLiQ. However, one may find such a definition to be somewhat too demanding; it appears to be more reasonable if we were to benchmark with the nonstationary dynamics under $(\mu^{*},p^{*})$ , rather than the steady-state performance. Nevertheless, our numerical studies confirm that the nuance of the two aforementioned regret definitions is negligible. See Section 11.5.

Separation of regret.

To treat the total regret defined in (9), we separate it into two parts: regret of nonstationarity which quantifies the error due to the system’s transient performance, and regret of suboptimality which accounts for the suboptimality error due to the present policy. In detail, we write

\displaystyle R_{k}=\underbrace{(\rho_{k}-\mathbb{E}[f(\mu_{k},p_{k})T_{k}])}_{\equiv R_{1,k}}+\underbrace{\mathbb{E}[T_{k}(f(\mu_{k},p_{k})-f(\mu^{*},p^{*}))]}_{\equiv R_{2,k}},

(10)

so that

\displaystyle R(L)=\sum_{k=1}^{L}R_{1,k}+\sum_{k=1}^{L}R_{2,k}\equiv R_{1}(L)+R_{2}(L).

(11)

Intuitively, $R_{1,k}$ measures the performance error due to transient queueing dynamics (regret of nonstationarity), while $R_{2,k}$ accounts for the suboptimality error of control parameters $(\mu_{k},p_{k})$ (regret of suboptimality).

In what follows, we will analyze the two terms $R_{1}(L)$ and $R_{2}(L)$ separately. To treat $R_{1}(L)$ , we develop in Section 4.1 a new framework to analyze the transient queueing behavior using the coupling technique (Theorem 1). The development of the theoretical bound for $R_{2}(L)$ is given in Section 4.2 (Theorem 2). Results in these sections provide convenient conditions that facilitate the convergence and regret bound analysis of our GOLiQ algorithm for GI/GI/1 queues (which is to be given in Section 5). The roadmap of the theoretical analysis is depicted in Figure 2.

4.1 Regret of Nonstationarity

In this part, we analyze the transient queueing dynamics, base on which we develop a theoretical upper bound for $R_{1}(L)$ . As we shall see later in Section 5, this analysis is also essential to bounding the bias $B_{k}$ and variance $\mathcal{V}_{k}$ of the gradient estimators for GOLiQ.

A crude $O(L)$ bound. Roughly speaking, since the parameters $\mu,p$ and functions $\lambda(\cdot)$ , $c(\cdot)$ are all bounded, the regret $R_{1}(L)$ is in the same order as the transient bias of the waiting time process, i.e.,

\displaystyle R_{1}(L)\approx\sum_{k=1}^{L}O\left(\sum_{n=1}^{D_{k}}\left(\mathbb{E}[W_{n}(\mu_{k},p_{k})]-\mathbb{E}[W_{\infty}(\mu_{k},p_{k})]\right)\right).

Here we use $W_{\infty}(\mu,p)$ to denote the steady-state waiting time of the $GI/GI/1$ queue with parameter $(\mu,p)\in\mathcal{B}$ . Under the uniform stability condition (Assumption 1), it is not difficult to show that there exist positive constants $\gamma>0$ and $K>0$ , independent of $k$ and $(\mu_{k},p_{k})$ such that

\left|\mathbb{E}[W_{n}^{k}]-\mathbb{E}[W_{\infty}(\mu_{k},p_{k})]\right|\leq e^{-\gamma n}K.

Then, as a direct consequence, we have

\sum_{n=1}^{D_{k}}\left(\mathbb{E}[W_{n}(\mu_{k},p_{k})]-\mathbb{E}[W_{\infty}(\mu_{k},p_{k})]\right)\leq\frac{K}{1-e^{-\gamma}}\leavevmode\nobreak\ \Rightarrow\leavevmode\nobreak\ R_{1}(L)=O(L).

An analogue of the above $O(L)$ bound is given by Huh et al. (2009) (Lemma 11) in an inventory model.

An improved $o(L)$ bound. In the rest of this subsection, we will conduct a more delicate analysis on the transient performance of the queueing system, and our analysis will render a (tighter) sub-linear bound $R_{1}(L)=o(L)$ (of which the exact order depends on the concrete algorithm, as we shall see later).

Theorem 1.

$($ Regret of nonstationarity $)$ Suppose that Assumptions 1 and 2 hold. In addition, assume that the following conditions are satisfied for some constant $K_{2}>0$ and $0<\alpha\leq 1$ :

$(a)$

$\lceil 6\log(k)/\min(\gamma,\eta)\rceil\leq D_{k}\leq K_{2}k^{2-\alpha}$ ;
$(b)$

$\mathbb{E}[\|x_{k}-x_{k+1}\|^{2}]\leq K_{2}k^{-2\alpha}$ ,

where the constants $\eta$ and $\gamma$ are defined in Assumption 2. Then, there exists a positive constant $K>0$ such that

R_{1,k}\leq K\cdot k^{-\alpha}\log(k),\quad k\geq 2\leavevmode\nobreak\ \quad\text{and}\quad R_{1}(L)\leq K\sum_{k=1}^{L}k^{-\alpha}\log(k),\quad L\geq 2.

(12)

Remark 5.

As will become clear later in Section 5, we obtain a bound $R_{1}(L)=O(\log(L)^{{2}})$ for Algorithm 1 by validating Condition (b) in Theorem 1 with $\alpha=1$ , which is much tighter than the crude $O(L)$ bound. This $O(\log(L)^{{2}})$ bound for $R_{1}(L)$ is critical to achieving an overall logarithmic regret bound in the total number of served customers. An explicit expression of constant $K$ is given in (8.5).

4.1.1 Roadmap of the proof of Theorem 1

Our point of departure in proving Theorem 1 is to decompose $R_{1,k}$ into three terms. We shall split each cycle into a warm-up period consisting the first $\tilde{d}_{k}=\lceil 5\log(k)/\min(\gamma,\eta)\rceil<D_{k}$ customers and the near-stationary period consisting of all remaining customers, where $\gamma,\eta>0$ are as defined in Assumption 2. The three parts are: transient error in the near-stationary period ( $I_{1}$ ), transient error in the warm-up period ( $I_{2}$ ) and the remaining error ( $I_{3}$ ). The detailed separation is given below

	$\displaystyle R_{1,k}$	$\displaystyle=\rho_{k}-\mathbb{E}[f(\mu_{k},p_{k})T_{k}]$
		$\displaystyle=\mathbb{E}\left[\sum_{n=1}^{Q_{k}\wedge D_{k}}(h_{0}(W_{n}^{k}+S_{n}^{k})-p_{n}^{k})+\sum_{n=Q_{k}+1}^{D_{k}}(h_{0}(W_{n}^{k}+S_{n}^{k})-p_{k})+c(\mu_{k})T_{k}-f(\mu_{k},p_{k})T_{k}\right]$
		$\displaystyle=h_{0}\underbrace{\mathbb{E}\left[\sum_{n=\tilde{d}_{k}+1}^{D_{k}}\left(W_{n}^{k}-w(\mu_{k},p_{k})\right)\right]}_{\equiv I_{1}}+h_{0}\underbrace{\mathbb{E}\left[\sum_{n=1}^{\tilde{d}_{k}}\left(W_{n}^{k}-w(\mu_{k},p_{k})\right)\right]}_{\equiv I_{2}}$
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +\underbrace{\mathbb{E}\left[(D_{k}-\lambda_{k}T_{k})(h_{0}w(\mu_{k},p_{k})+\frac{h_{0}}{\mu_{k}}-p_{k})\right]+\mathbb{E}\left[\sum_{n=1}^{Q_{k}\wedge D_{k}}(p_{k}-p_{n}^{k})\right]}_{\equiv I_{3}}.$

The term $w(\mu,p)\equiv\mathbb{E}[W_{\infty}(\mu,p)]$ is a function in $(\mu,p)$ and equals to the steady-state expected waiting time under parameter $(\mu,p)\in\mathcal{B}$ . To prove $R_{1,k}=O(k^{-\alpha}\log(k))$ , it suffices to show that $I_{i}=O(k^{-\alpha}\log(k))$ for $i=1,2,3$ . Below we explain the main ideas of our treatment to $I_{1}$ , $I_{2}$ and $I_{3}$ :

•

$I_{1}$ : We will first show that, after serving $d_{k}\equiv\lceil 4\log(k)/\min(\gamma,\eta)\rceil<\tilde{d}_{k}$ customers, with a sufficiently high probability, all $Q_{k}$ existing customers have left the system and $\{W_{n}^{k}:n=d_{k},...,D_{k}\}$ follows the dynamic of a GI/GI/1 queue with arrival rate $\lambda_{k}$ and service rate $\mu_{k}$ . Then, we show that $W_{n}^{k}$ , for $n\geq d_{k}$ , will converge exponentially fast to the steady state (Lemma 2). Hence $W_{n}^{k}$ is close to $W_{\infty}(\mu_{k},p_{k})$ for $n\geq\tilde{d}_{k}$ , warranting a small transient error $I_{1}$ .
•

$I_{2}$ : Note that the $\tilde{d}_{k}$ customers in the warm-up period include those leftovers from previous periods, and their arrival rates $\lambda_{n}^{k}$ are different from $\lambda_{k}$ . To control the impact of such difference between $\lambda_{n}^{k}$ and $\lambda_{k}$ , we first establish almost sure Lipschitz continuity of waiting times (for queues having customer-heterogeneous arrival rates) with respect to the arrival rate sequence and the initial state (Lemma 3). As a consequence, we can prove that $|\mathbb{E}[W_{n}^{k}-w(\mu_{k-1},p_{k-1})]|=O(k^{-\alpha})$ taking advantage of the fact that the initial state $W_{0}^{k}=W_{D_{k-1}}^{k-1}$ is close to the steady-state $W_{\infty}(\mu_{k-1},p_{k-1})$ . Then, we show that the steady-state distribution is smooth in the parameter $(\mu,p)$ (Lemma 4), i.e., $\mathbb{E}[|w(\mu_{k-1},p_{k-1})-w(\mu_{k},p_{k})|]=O(\mathbb{E}|\mu_{k}-\mu_{k-1}|+\mathbb{E}|p_{k}-p_{k-1}|)=O(k^{-\alpha})$ , which completes the analysis for $I_{2}$ .
•

$I_{3}$ : The term $I_{3}$ will be under control because $W_{D_{k}}^{k}$ is close to the steady-state (Lemma 2) and $Q_{k}$ is uniformly bounded (Lemma 1).

Also see in Figure 3 for a graphical illustration.

Following the above roadmap, we next give detailed analysis for $I_{i},i=1,2,3$ by establishing three lemmas (Lemmas 2–4). We believe that these results are not only essential to the transient analysis in the present paper, but may also be of independent interest for theoretic studies of other queueing models.

Bounding $I_{1}$ . We first establish the rate at which waiting times converge to their steady state distributions. For two given sequences $V_{n}$ and $U_{n}$ , we say two $GI/GI/1$ queues with the same parameter $(\mu,p)\in\mathcal{B}$ are synchronously coupled if their waiting times $W^{1}_{n}$ and $W^{2}_{n}$ satisfy

W^{i}_{n}=\left(W^{i}_{n-1}+\frac{V_{n}}{\mu}-\frac{U_{n}}{\lambda(p)}\right)^{+},\text{ for }i=1,2,\text{ and }n\geq 1,

i.e., the two systems share the same sequences of service and interarrival times. The proof of Lemma 2 is given in Section 8.

Lemma 2.

$($ Exponential loss of memory of initial state $)$ Suppose two $GI/GI/1$ queues with parameter $(\mu,p)\in\mathcal{B}$ are synchronously coupled with initial waiting times $W^{1}_{0}$ and $W^{2}_{0}$ , respectively. Then, for the two positive constants $\gamma$ and $\theta$ defined in Assumption 2 and any $m\geq 1$ , we have, conditional on $W^{1}_{0}$ and $W^{2}_{0}$ ,

\mathbb{E}\left[|W^{1}_{n}-W^{2}_{n}|^{m}\leavevmode\nobreak\ |\leavevmode\nobreak\ W^{1}_{0},W^{2}_{0}\right]\leq e^{-\gamma n}(2+e^{\mu\theta W^{1}_{0}}+e^{\mu\theta W^{2}_{0}})|W_{0}^{1}-W^{2}_{0}|^{m}.

In order to bound $I_{1}$ , at the beginning of each cycle $k$ , given $(\mu_{k},p_{k})$ , we couple $W_{0}^{k}$ with $\bar{W}_{0}^{k}$ that is independently drawn from the steady-state waiting time distribution $W_{\infty}(\mu_{k},p_{k})$ . The sequence $\bar{W}_{n}^{k}$ is defined as

\bar{W}_{n}^{k}=\left(\bar{W}_{n-1}^{k}+\frac{V_{n}^{k}}{\mu_{k}}-\frac{U_{n}^{k}}{\lambda_{k}}\right)^{+},\text{ for all }1\leq n\leq D_{k}.

Then, by definition, conditional on $(\mu_{k},p_{k})$ , $\mathbb{E}[\bar{W}_{n}^{k}]=w(\mu_{k},p_{k})$ for all $1\leq n\leq D_{k}$ , and therefore,

\left|\mathbb{E}[W_{n}^{k}-w(\mu_{k},p_{k})]\right|\leq\mathbb{E}[|W_{n}^{k}-\bar{W}_{n}^{k}|].

As we will show in the proof of Corollary 1, $\{W_{n}^{k}:n=d_{k}+1,...,D_{k}\}$ is coupled with $\bar{W}_{n}^{k}$ except on a set of negligible set, with $d_{k}\equiv\lceil 4\log(k)/\min(\gamma,\eta)\rceil<\tilde{d}_{k}$ . As a result, we can use Lemma 2 to construct a bound on $\mathbb{E}[|W^{k}_{n}-\bar{W}^{k}_{n}|]$ for $n=\tilde{d}_{k}+1,...,D_{k}$ .

Corollary 1.

Under the conditions of Theorem 1, there exists a constant $A\geq 1$ independent of $k$ and $(\mu_{k},p_{k})$ , such that for all $k\geq 1$ and $n\geq d_{k}\equiv\lceil 4\log(k+1)/\min(\gamma,\eta)\rceil$ ,

\mathbb{E}[|W_{n}^{k}-\bar{W}_{n}^{k}|]\leq e^{-\gamma(n-d_{k})}A+2Mk^{-2}.

(13)

As a direct consequence, we have $I_{1}=O(k^{-\alpha})$ .

Bounding $I_{2}$ . We first show that the waiting times $W_{n}$ of a queueing model having customer-heterogeneous arrival rates are Lipschitz continuous with respect to the rates $(\mu_{n},\lambda_{n})$ and the initial state almost surely.

Lemma 3.

(Lipschitz continuity) Consider two waiting time sequences $W_{n}$ and $\tilde{W}_{n}$ for $n\geq 1$ with initial values $W_{0}$ and $\tilde{W}_{0}$ respectively. Let $(\mu_{n},\lambda_{n})$ and $(\tilde{\mu}_{n},\tilde{\lambda}_{n})\in\mathcal{B}$ be the corresponding sequences of service and arrival rates, respectively, i.e.,

W_{n}=\left(W_{n-1}+\frac{V_{n}}{\mu_{n}}-\frac{U_{n}}{\lambda_{n}}\right)^{+}\quad\text{and}\quad\tilde{W}_{n}=\left(\tilde{W}_{n-1}+\frac{V_{n}}{\tilde{\mu}_{n}}-\frac{U_{n}}{\tilde{\lambda}_{n}}\right)^{+},\quad\text{ for }n\geq 1.

Suppose there exist two constants $c_{\mu},c_{\lambda}>0$ such that

|\mu_{n}-\tilde{\mu}_{n}|\leq c_{\mu}\quad\text{and}\quad|\lambda_{n}-\tilde{\lambda}_{n}|\leq c_{\lambda},\quad\text{ for all }n\geq 1.

Then we have, for all $n\geq 1$ ,

|W_{n}-\tilde{W}_{n}|\leq|W_{0}-\tilde{W}_{0}|+\left(\frac{c_{\mu}}{\underline{\mu}}+\frac{c_{\lambda}}{\underline{\lambda}}\right)\max(X_{n},\tilde{X}_{n})+\frac{c_{\mu}}{\underline{\mu}}\max(W_{n},\tilde{W}_{n}),

where $X_{n}$ and $\tilde{X}_{n}$ are the corresponding observed busy periods. In particular, $X_{n}$ and $\tilde{X}_{n}$ satisfy the recursion (7) defined in Section 3.3 with any given initial values of $X_{0}\geq 0$ and $\tilde{X}_{0}\geq 0$ .

As discussed above, controlling $I_{2}$ also involves bounding the difference between the mean steady-state waiting times in two consecutive cycles. Hence, we next establish a uniform high-order smoothness result for the steady-state waiting times with respect to the model parameter $(\mu,p)$ .

Lemma 4.

$($ Smoothness in $\mu$ and $p)$ Suppose $(\mu_{i},p_{i})\in\mathcal{B}$ for $i=1,2$ . Let $W_{\infty}(\mu_{i},p_{i})$ be the steady-state waiting time of the GI/GI/1 queue under parameter $(\mu_{i},p_{i})$ , respectively. Then, the steady-state waiting times $(W_{\infty}(\mu_{1},p_{1}),W_{\infty}(\mu_{2},p_{2}))$ can be coupled such that, there exists a constant $B>0$ independent of $(\mu_{i},p_{i})$ satisfying that, for all $1\leq m\leq 4$ ,

\mathbb{E}[|W_{\infty}(\mu_{1},p_{1})-W_{\infty}(\mu_{2},p_{2})|^{m}]\leq B\left(|\mu_{1}-\mu_{2}|^{m}+|p_{1}-p_{2}|^{m}\right),

where a closed-form expression of constant $B$ is given in (31).

We adopt a “coupling from the past” (CFTP) approach in the proof of Lemma 4 (see Section 8). Roughly speaking, CFTP is a synchronous coupling starting from infinite past. In the proof of Lemma 4, we shall explicitly explain how to construct the CFTP.

Now we are ready to analyze $I_{2}$ . Essentially, we shall compare $\mathbb{E}[W_{n}^{k}]$ in the warm-up period with $w(\mu_{k-1},p_{k-1})=\mathbb{E}[W_{\infty}(\mu_{k-1},p_{k-1})]$ . For each cycle $k$ , recall that we have already coupled $W_{n}^{k-1}$ with a stationary sequence $\bar{W}_{n}^{k-1}$ in cycle $k-1$ , we then extend the sequence $\bar{W}_{n}^{k-1}$ to cycle $k$ in the sense that

\bar{W}_{D_{k-1}+n}^{k-1}=\left(\bar{W}_{D_{k-1}+n-1}^{k-1}+\frac{V_{n}^{k}}{\mu_{k-1}}-\frac{U_{n}^{k}}{\lambda_{k-1}}\right)^{+},\text{ for }n=1,2,...,D_{k}.

Then, conditional on $(\mu_{k-1},p_{k-1})$ , $\mathbb{E}[\bar{W}_{D_{k-1}+n}^{k-1}]=w(\mu_{k-1},p_{k-1})$ . So we have

	$\displaystyle\left\|\mathbb{E}[W_{n}^{k}-w(\mu_{k},p_{k})]\right\|$	$\displaystyle\leq\left\|\mathbb{E}[W_{n}^{k}-w(\mu_{k-1},p_{k-1})]\right\|+\mathbb{E}\left[\|w(\mu_{k-1},p_{k-1})-w(\mu_{k},p_{k})\|\right]$
		$\displaystyle\leq\mathbb{E}\left[\|W_{n}^{k}-\bar{W}_{D_{k-1}+n}^{k-1}\|\right]+\mathbb{E}\left[\|w(\mu_{k-1},p_{k-1})-w(\mu_{k},p_{k})\|\right].$

Bounding the first term by Lemma 3 and the second term by Lemma 4 yields the following bound on $I_{2}$ .

Corollary 2.

Under the conditions of Theorem 1, for all $k\geq 2$ and $1\leq n\leq D_{k}$ , we have

\mathbb{E}[|W_{n}^{k}-w(\mu_{k},p_{k})|]=O(k^{-\alpha}).\quad

(14)

As a direct consequence, $|I_{2}|=O(k^{-\alpha}\log(k))$ .

Bounding $I_{3}$ . We complete our analysis on the regret of nonstationarity by showing that $I_{3}=O(k^{-\alpha})$ . The proof of Corollary 3 below basically follows from Lemma 1 and Lemma 2 with some similar argument as used in the proof of Corollary 2.

Corollary 3.

Under the conditions of Theorem 1, $|I_{3}|=O(k^{-\alpha})$ .

Finishing the Proof of Theorem 1. Then, Theorem 1 follows immediately from Corollaries 1 to 3. A complete proof of Theorem 1, including the proofs of Corollaries 1 to 3, is given in Section 8.5 of e-companion. In particular, we provide an explicit expression of the constant $K$ in terms of the model parameters in (8.5).

Remark 6.

We advocate that Theorem 1 may apply to other queueing models (its scope is beyond the $GI/GI/1$ queue), as long as one can verify three conditions for the designated model: (i) uniform boundedness for the rate of convergence to the steady state, i.e., Lemma 2, (ii) path-wise Liptschize continuity, i.e., Lemma 3, and (iii) smoothness of the stationary distributions in the control variables, i.e., Lemma 4.

4.2 Regret of Suboptimality

To bound the regret of suboptimality $R_{2}(L)$ , we need to control the rate at which $x_{k}$ converges to $x^{*}$ . This depends largely on the effectiveness of the estimator $H_{k}$ for $\nabla f(x_{k})$ . In our algorithm, such effectiveness is measured by the bias $B_{k}$ and variance $\mathcal{V}_{k}$ . The following result shows that, if $B_{k}$ and $\mathcal{V}_{k}$ can be appropriately bounded, then, $x_{k}$ will converge to $x^{*}$ rapidly and hence $R_{2}(L)$ can be properly bounded.

Theorem 2.

$($ Regret of suboptimality $)$ Suppose Assumptions 3 holds. If there exists a constant $K_{3}\geq 1$ such that the following conditions hold for all $k$ ,

$(a)$

$\left(1+\frac{1}{k}\right)^{\beta}\leq 1+\frac{K_{0}}{2}\eta_{k}$ ,
$(b)$

$B_{k}\leq\frac{K_{0}}{8}k^{-\beta}$ ,
$(c)$

$\eta_{k}\mathcal{V}_{k}\leq K_{3}k^{-\beta}$ ,

where $0<\beta\leq 1$ is a constant, and $\eta_{k}\to 0$ is the step size, then, there exists a constant $C\geq 8K_{3}/K_{0}$ with an explicit expression given in (34), such that for all $k\geq 1$ ,

\mathbb{E}[\|x_{k}-x^{*}\|^{2}]\leq Ck^{-\beta},

(15)

and as a consequence,

R_{2}(L)\leq CK_{1}\sum_{k=1}^{L}\left(\frac{D_{k}}{\lambda(\bar{p})}+M\right)k^{-\beta}=O\left(\sum_{k=1}^{L}D_{k}k^{-\beta}\right).

(16)

Remark 7 (Selecting the “optimal” $D_{k}$ ).

The above expression (16) indicates a trade-off in the selection of the parameter $D_{k}$ . On the one hand, increasing the sample size $D_{k}$ reduces the bias $B_{k}$ for the gradient estimator, and hence leads to a smaller value of $k^{-\beta}$ . On the other hand, a larger $D_{k}$ makes the system operate under a sub-optimal decision for a longer time. To this end, one may choose an optimal order (in $k$ ) for $D_{k}$ by minimizing the order of the regret as in (16).

Our proof of Theorem 2 follows an inductive approach as used in Broadie et al. (2011). Let $b_{k}\equiv\mathbb{E}[\|x_{k}-x^{*}\|^{2}]$ . According to the SGD iteration $x_{k+1}=\Pi_{\mathcal{B}}(x_{k}-\eta_{k}H_{k})$ , we have

\mathbb{E}[\|x_{k+1}-x^{*}\|^{2}|x_{k}]\leq\mathbb{E}[\|x_{k}-\eta_{k}H_{k}-x^{*}\|^{2}|x_{k}]=\|x_{k}-x^{*}\|^{2}-2\eta_{k}\mathbb{E}[H_{k}|x_{k}](x_{k}-x^{*})+\eta_{k}^{2}\mathbb{E}[\|H_{k}\|^{2}|x_{k}].

Then, by Assumption 3 and the definition of $B_{k},\mathcal{V}_{k}$ by (4), we derive the following recursive inequality for $b_{k}$ :

\displaystyle b_{k+1}\leq(1-K_{0}\eta_{k}+\eta_{k}B_{k})b_{k}+\eta_{k}B_{k}+\eta^{2}_{k}\mathcal{V}_{k},\quad k\geq 1,

and we prove (15) by induction. The full proof is given in Section 8.7 of the e-companion.

In Section 5, we apply Theorem 2 to treat our online learning algorithm (Algorithm 1) by verifying that Conditions (a)–(c) are satisfied. Because in Theorem 2, Conditions (a)–(c) are stated explicitly in terms of the step size $\eta_{k}$ , bias $B_{k}$ and variance $\mathcal{V}_{k}$ of the gradient estimator, these conditions may serve as useful building blocks for the design and analysis of online learning algorithms in other queueing models as well.

5 GOLiQ for the $GI/GI/1$ Queue

In this section, we provide a concrete GOLiQ algorithm that solves the optimal pricing and capacity sizing problem (1) for a $GI/GI/1$ queueing system. We show that the gradient $\nabla f(\mu,p)$ can be estimated “directly” from past experience (i.e., data of delay and busy times generated under the present policy). Applying the regret analysis developed in Section 4, we provide a theoretic upper bound for the overall regret in Theorem 3.

5.1 A Gradient Estimator

Following the algorithm framework outlined in Section 3.2, we now develop a detailed gradient estimator $H_{k}$ . Regarding the objective function in (5), it suffices to construct estimators for the partial derivatives

\displaystyle\frac{\partial}{\partial\mu}\mathbb{E}[W_{\infty}(p,\mu)]\qquad\text{and}\qquad\frac{\partial}{\partial p}\mathbb{E}[W_{\infty}(p,\mu)].

(17)

Following the infinitesimal perturbation analysis (IPA) approach (see, for example, Glasserman (1992)), we next show that the partial derivatives in (17) can be expressed in terms of the steady-state distributions $W_{\infty}(p,\mu)$ and $X_{\infty}(p,\mu)$ of the waiting time process $W_{n}$ and observed busy period process $X_{n}$ , of which the dynamics are characterized by (6)–(7).

Lemma 5.

Suppose Assumptions 1 and 2 holds. Then, for any $(\mu,p)\in\mathcal{B}$ , $\mathbb{E}[W_{\infty}(\mu,p)]$ are differentiable in $\mu$ and $p$ . Besides,

	$\displaystyle\frac{\partial}{\partial p}f(\mu,p)$	$\displaystyle=-\lambda(p)-p\lambda^{\prime}(p)+h_{0}\lambda^{\prime}(p)\left(\mathbb{E}[W_{\infty}(\mu,p)]+\mathbb{E}[X_{\infty}(\mu,p)]+\frac{1}{\mu}\right)$		(18)
	$\displaystyle\frac{\partial}{\partial\mu}f(\mu,p)$	$\displaystyle=c^{\prime}(\mu)-h_{0}\frac{\lambda(p)}{\mu}\left(\mathbb{E}[W_{\infty}(\mu,p)]+\mathbb{E}[X_{\infty}(\mu,p)]+\frac{1}{\mu}\right)$		(18)

Proof of Lemma 5.

To prove Equation (18), it suffices to work with the partial derivatives of the steady-state expectation $\mathbb{E}[W_{\infty}(\mu,p)]$ . We follow the IPA analysis in Glasserman (1992) and Chen (2014).

Given $(\mu,p)$ , we define $r(p)=1/\lambda(p)$ and rewrite the recursion (6) as

W_{n}(\mu,p)=\left(W_{n-1}(\mu,p)+\frac{V_{n}}{\mu}-r(p)U_{n}\right)^{+}.

Define the derivative process $Z_{n}\equiv\frac{\partial}{\partial r}W_{n}(\mu,p)$ , then by chain rule, we have

Z_{n}=\frac{\partial}{\partial r}W_{n}(\mu,p)=\frac{\partial}{\partial r}\left(W_{n-1}(\mu,p)+\frac{V_{n}}{\mu}-rU_{n}\right)^{+}=\begin{cases}\frac{\partial}{\partial r}W_{n-1}-U_{n}=Z_{n-1}-U_{n}&\text{ if }W_{n}>0;\\ 0&\text{ if }W_{n}=0.\end{cases}

and obtain a recursion $Z_{n}=(Z_{n-1}-U_{n}){\bf 1}_{\{W_{n}>0\}}$ . Let $\tilde{Z}_{n}\equiv-Z_{n}/\lambda(p)$ . Then, it is straightforward to see that $\tilde{Z}_{n}$ satisfies the recursion given in (7) as the observed busy period $X_{n}$ , i.e.,

\tilde{Z}_{n}=\left(\tilde{Z}_{n-1}+\frac{U_{n}}{\lambda(p)}\right){\bf 1}(W_{n}>0).

Under the assumption that the queueing system is stable, the limit $\tilde{Z}_{\infty}$ should be equal in distribution to $X_{\infty}$ . Therefore, we formally derive

\frac{\partial}{\partial r}\mathbb{E}[W_{\infty}(\mu,p)]=\mathbb{E}[Z_{\infty}]=-\lambda(p)\mathbb{E}[\tilde{Z}_{\infty}]=-\lambda(p)\mathbb{E}[X_{\infty}(\mu,p)].

(19)

The above heuristics can be made rigorous by verifying exchanges of limits using the results in Glasserman (1992), and we refer the readers to Section 8.9 for detailed explanations. Using (19), we can derive the partial derivative of the steady-state waiting time with respect to price $p$ as below:

\displaystyle\frac{\partial}{\partial p}\mathbb{E}[W_{\infty}(\mu,p)]=\frac{\partial}{\partial r}\mathbb{E}[W_{\infty}(\mu,p)]\frac{\partial r(p)}{\partial p}=-\lambda(p)\mathbb{E}[X_{\infty}(\mu,p)]\cdot-\frac{\lambda^{\prime}(p)}{\lambda(p)^{2}}=\mathbb{E}[X_{\infty}(\mu,p)]\frac{\lambda^{\prime}(p)}{\lambda(p)}.

Now we turn to $\frac{\partial}{\partial\mu}\mathbb{E}[W_{\infty}(\mu,p)]$ . Let $\hat{Z}_{n}\equiv\mu W_{n}(\mu,p)$ , it is easy to check that $\hat{Z}_{n}=\left(\hat{Z}_{n-1}+V_{n}-\mu U_{n}/\lambda(p)\right)^{+}$ . Then, following steps similar to those for (19), we have

\frac{\partial}{\partial\mu}\mathbb{E}[\hat{Z}_{\infty}(\mu,p)]=-\mathbb{E}[X_{\infty}(\mu,p)].

Therefore,

\displaystyle-\mathbb{E}[X_{\infty}(\mu,p)]=\frac{\partial}{\partial\mu}\mathbb{E}[\hat{Z}_{\infty}(\mu,p)]=\frac{\partial}{\partial\mu}\mathbb{E}[\mu W_{\infty}(\mu,p)]=\mu\frac{\partial}{\partial\mu}\mathbb{E}[W_{\infty}(\mu,p)]+\mathbb{E}[W_{\infty}(\mu,p)],

and hence, $\partial\mathbb{E}[W_{\infty}(\mu,p)]/\partial\mu=-(\mathbb{E}[X_{\infty}(\mu,p)]+\mathbb{E}[W_{\infty}(\mu,p)])/\mu.$

Finally, plugging the expressions of the two partial derivatives into $\nabla f$ yields (18). ∎

5.2 GOLiQ: A $G/G/1$ Version

Utilizing results in Lemma 5, we are ready to design a $G/G/1$ version of the GOLiQ algorithm, where we estimate the terms $\mathbb{E}[W_{\infty}(\mu,p)]$ and $\mathbb{E}[X_{\infty}(\mu,p)])$ in the partial derivatives (18) using the finite-sample averages of $W^{k}_{n}$ and $X_{n}^{k}$ observed in each cycle $k$ . The formal description of the algorithm is given in Algorithm 1.

Input: number of cycles

L

;

parameters

0<\xi<1

,

D_{k}

,

\eta_{k}

for

k=1,2,...,L

;

initial value

x_{1}=(\mu_{1},p_{1})

;

for $n=1,2,...,D_{k}$ do

operate the system under

x_{k}=(\mu_{k},p_{k})

until

D_{k}

customers enter service;

observe

(W^{k}_{n},X^{k}_{n})

for

n=1,2,...,D_{k}

;

randomly draw

Z\in\{1,2\}

;

if $Z=1$ then

h\leftarrow-\lambda(p_{k})-p_{k}\lambda^{\prime}(p_{k})+h_{0}\lambda^{\prime}(p_{k})\left[\frac{1}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(X^{k}_{n}+W^{k}_{n}\right)+\frac{1}{\mu_{k}}\right]

;

H_{k}\leftarrow(2h,0)

;

else

h\leftarrow c^{\prime}(\mu_{k})-h_{0}\frac{\lambda(p)}{\mu_{k}}\left[\frac{1}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(X^{k}_{n}+W^{k}_{n}\right)+\frac{1}{\mu_{k}}\right]

;

H_{k}\leftarrow(0,2h)

;

end if

update:

x_{k+1}=\Pi_{\mathcal{B}}(x_{k}-\eta_{k}H_{k})

;

end for

Algorithm 1 GOLiQ for

GI/GI/1

Queues

Remark 8 (On the queueing leftover).

We elaborate more on our treatment of $Q_{k}$ , the existing queue content at the beginning of cycle $k$ . First, the content of $Q_{k}$ includes customer arrivals in cycle $k-1$ and possibly even earlier cycles. Second, it is also possible to have $Q_{k}>D_{k}$ . Nevertheless, these above cases do not affect the implementation of Algorithm 1 (note that Algorithm 1 gives a gradient estimator using $\lceil(1-\xi)D_{k}\rceil$ samples without specifying any of the above events). Of course, the event $\{Q_{k}>D_{k}\}$ does play a role in our theoretic regret analysis, but it is a rare event with a negligible probability (in fact, we show that the probability will be suppressed to $O(k^{-3})$ , also see Remark 3.

Selecting the “optimal” hyperparameters.

The effectiveness of Algorithm 1 largely hinges upon carefully selecting the three hyperparameters: (i) the warm-up time $\xi\in(0,1)$ , (ii) the learning step size $\eta_{k}>0$ , and (iii) the exploration sample size $D_{k}>0$ . Except for $\xi$ which has no bearing on the theoretical order of the regret, both the other two parameters $D_{k}$ and $\eta_{k}$ will play critical roles in our regret analysis. We next give the forms of the two parameters. First, The step size $\eta_{k}$ satisfies

\displaystyle\eta_{k}=c_{\eta}/k,\qquad\text{with}\quad c_{\eta}\geq 2/K_{0},

(20)

where $K_{0}$ is the convexity bound specified in Assumption 3. Next, the sample size $D_{k}$ satisfies

\displaystyle D_{k}=a_{D}+b_{D}\log(k),\quad\text{with}\quad a_{D}\geq\frac{C_{D}}{\min(\gamma,\eta)\xi}\quad\text{and}\quad b_{D}\geq\frac{8}{\min(\gamma,\eta)\xi},

(21)

for any warm-up parameter $\xi\in(0,1)$ , where $\gamma$ and $\eta$ are the constants specified in Assumption 2, and the explicit formula of $C_{D}$ is given in (36).

The above-mentioned forms of $\eta_{k}$ and $D_{k}$ are obtained from our detailed regret analysis where we show that the structure of (20) and (21) “minimizes” the order of the overall regret (in the sense of maximizing $\alpha$ and $\beta$ as in Theorems 1 and 2). Although the theoretical bounds of parameters $a_{D}$ , $b_{D}$ and $c_{\eta}$ are imposed to facilitate our regret analysis, our numerical experiments show that GOLiQ remains effective even when the theoretical bounds are relaxed, confirming the robustness of GOLiQ to these hyperparameters; see Section 9 for details. Next, we show that Algorithm 1 has a regret bound of $O((\log(M_{L})^{2})$ with $M_{L}\equiv\sum_{k=1}^{L}D_{k}$ being the cumulative number of customers served by cycle $L$ . We do so by verifying that our choices of $D_{k}$ and $\eta_{k}$ (along with the corresponding $B_{k}$ and $\mathcal{V}_{k}$ ), will satisfy the conditions in Theorem 1 and Theorem 2.

Theorem 3.

$($ Regret Bound for Algorithm 1 $)$
Suppose Assumptions 1 to 3 hold, and $\eta_{k}$ and $D_{k}$ are selected according to (20) and (21). Then

$(i)$

There exists a positive constant $K_{3}>0$ such that

$\displaystyle B_{k}\leq\frac{K_{0}}{8k}\quad\text{and}\quad\eta_{k}\mathcal{V}_{k}\leq\frac{K_{3}}{k}.$
$(ii)$

There exists a positive constant $K_{2}>0$ such that

$\mathbb{E}[\|x_{k}-x_{k+1}\|^{2}]\leq K_{2}k^{-2}.$ (22)
$(iii)$

As a consequence of $(i)$ and $(ii)$ , the regret for Algorithm 1

$\displaystyle R(L)\leq K_{\text{alg}}\log(M_{L})^{2}=O(\log(M_{L})^{2}).$ (23)

Remark 9 (On the logarithmic regret bound (23)).

Below we provide some additional discussions on the regret bound (23):

$(i)$

On the constant $K_{\text{alg}}$ . The explicit expression for the constant $K_{\text{alg}}$ , although complicated, is given by (38). It involves error bound corresponding to the transient behavior of the queueing system, the bias and variance of the gradient estimator, moment bounds on the queue length and other model parameters. One can verify that $K_{\text{alg}}$ is decreasing in the convergence rate coefficient $\gamma$ and increasing in the moment bounds of the queue length $M$ .
$(ii)$

On the first logarithmic term. Consider an SGD algorithm in that an unbiased gradient estimator $H_{k}$ with a bounded variance can be evaluated using a single data point (i.e., $B_{k}=0$ , $\mathcal{V}_{k}=O(1)$ ), it has been proved the scaled error $k^{-1/2}(x_{k}-x^{*})$ converges in distribution to a non-zero random variable (Theorem 2.1 in Chapter 10 of Kushner and Yin (2003)). Hence, the convergence rate for $\|x_{k}-x^{*}\|^{2}$ that any SGD-based algorithm can achieve is at best $O(k^{-1})$ (yielding a cumulative regret of order $O(\log(k))$ ), which is exactly the rate of convergence established by our online algorithm (taking $\beta=1$ in Theorem 3). In this sense, GOLiQ is already achieving an “optimal” convergence rate. We point out that, due to the nonstationary error of the queueing system, our gradient estimator is obtained using an increasing number of data points in order to guarantee a reasonably small bias.
$(iii)$

On the second logarithmic term. In order to control the regret of nonstationarity, the queueing system need to be operated in each cycle for a duration of order $O(\log(k))$ . Because the queueing performance converges to its steady state exponentially fast, this inevitably introduces an extra logarithmic term in our regret bound (which explains the “square” in $\log(M_{L})^{2}$ ). The question that remains open is whether this $O(\log(M_{L})^{2})$ bound is optimal. We conjecture that the answer is yes but admit that a rigorous treatment of a lower regret bound can be quite challenging. For example, establishing a lower regret bound requires a lower bound on the convergence rate of a $GI/GI/1$ queue, which by itself is an open question. We leave this question to future research.

Remark 10 (Controlling the length of cycle $k$ ).

We use $D_{k}$ (the number of customers served in cycle $k$ ), instead of the clock time $T_{k}$ , to control and measure the regret bound. The benefit of using $D_{k}$ (rather $T_{k}$ ) as the cycle length is that it facilitates the technical analysis, because $D_{k}$ is directly related to the number of samples used to estimate our gradient estimator. In fact, using $D_{k}$ instead of $T_{k}$ has no bearing on the order of the regret bound. To see this, note that the arrival rate is assumed to fall into a compact set $[\lambda(\bar{p}),\lambda(\underline{p})]$ . Therefore, since $T_{L}$ is the total units of clock time elapsed after cycle $L$ , we have $M_{L}/\lambda(\underline{p})\leq\mathbb{E}[T_{L}]\leq M_{L}/\lambda(\bar{p})$ for all $L$ .

6 Numerical Experiments

To confirm the practical effectiveness of our online learning method, we conduct numerical experiments to visualize the algorithm convergence, benchmark the outcomes with known exact optimal solutions, estimate the true regret and compare it to the theoretical upper bounds. Our base example is an $M/M/1$ queue, having Poisson arrivals with rate $\lambda(p)$ , and exponential service times with rate $\mu$ . In our optimization, we consider a commonly used logistic demand function (Besbes and Zeevi, 2015)

\displaystyle\lambda(p)=n\lambda_{0}(p),\qquad\lambda_{0}(p)=\frac{\exp(a-p)}{1+\exp(a-p)},

(24)

where $n$ is the system scale (also referred to as the market size). We also consider the following convex cost function for the service rate

\displaystyle c(\mu)=c_{0}\mu^{2}.

(25)

See the top left panel of Figure 4 for $\lambda(p)$ in (24). In particular, the optimal pricing and staffing problem in (1) now becomes

\max_{\mu,p}\ \left\{p\lambda(p)-c_{0}\mu^{2}-h_{0}\frac{\lambda(p)/\mu}{1-\lambda(p)/\mu}\right\}.

(26)

In light of the closed-form steady-state formulas of the $M/M/1$ queue, we can obtain the exact values of the optimal solutions $(\mu^{*},p^{*})$ and the corresponding objective value $f(\mu^{*},p^{*})$ , with which we are able to benchmark the solutions from our online optimization algorithm.

We first consider two one-dimensional online optimization problems in Section 6.1. We next treat the two-dimensional pricing and staffing problem in Section 6.2. In Section 6.3, we compare our results to previously established asymptotic heavy-traffic solutions in Lee and Ward (2014). Additional numerical experiments are provided in the e-companion: In Section 9 we investigate the robustness of GOLiQ to the hyperparameters. In Section 10 we benchmark the performance of GOLiQ to other online learning methods. Section 11 includes more experiments regarding the relaxation of uniform stability and GOLiQ’s performance in queues having other inter-arrival and service time distributions.

6.1 One-Dimensional Online Optimizations

Algorithm 1 covers special cases where there is only one decision variable. For example, if the service capacity $\mu$ (service fee $p$ ) is an exogenous parameter and the only decision is the service fee $p$ (service capacity $\mu$ ), then one can simply fix $Z=1$ ( $Z=2$ ) throughout the learning process. The theoretical regret bound (as in Theorem 3) for these one-dimensional cases remains unchanged.

6.1.1 Online optimal pricing with a fixed service capacity

Motivated by revenue management problems in revenue generating service system, our first example focuses on the one-dimensional optimization of price $p$ with service rate $\mu=\mu_{0}$ held fixed. In this case we can simply omit the term $c_{0}\mu^{2}$ in (26). Fixing the other model parameters as $a=4.1,n=10$ , $h_{0}=1$ and $\mu_{0}=10$ , we first obtain the exact optimal price $p^{*}=3.531$ (top right panel of Figure 4). According to Algorithm 1 and Theorem 3, we set the step size $\eta_{k}=1/k$ and cycle length $D_{k}=10+10\log(k)$ .

In Figure 4, we give the sample paths of the gradient $H_{k}$ and price $p_{k}$ as functions of the number of cycles $k$ , and the mean regret (estimated by averaging 500 independent sample paths) as a function of the cumulative number of service completions $M_{L}$ . We observe that although the objective function $f(\mu,p)$ is not convex in $p$ , the pricing decision $p_{k}$ quickly converges to the optimal value $p^{*}$ , and the regret grows as a logarithmic function of $M_{L}$ . In particular, a simple linear regression for the pair $\left(\sqrt{R(M_{L})},\log(M_{L})\right)$ (bottom right panel) verifies our regret bound given in Theorem 3.

6.1.2 Online optimal staffing problem with an exogenous arrival rate

Motivated by conventional service systems where customers are served based on good wills (e.g., hospitals), we next solve an online optimal staffing problem, with the objective of minimizing the combination of the steady-state queue length (or equivalently the delay) and the staffing cost, with the arrival rate (or equivalently, the price $p$ ) held fixed. Namely, we omit the term $p\lambda(p)$ in (26). Fixing $\lambda=\lambda_{0}=6.385$ , $h_{0}=1$ , and $c_{0}=0.1$ , we obtain the exact optimal service capacity $\mu^{*}=8.342$ (top right panel of Figure 5). Also by Algorithm 1 and Theorem 3, we set the step size $\eta_{k}=0.4k^{-1}$ and cycle length $D_{k}=10+10\log(k)$ with initial service rate $\mu_{0}=10$ .

In Figure 5, we again give sample paths of the gradient $H_{k}$ and service capacity $\mu_{k}$ , and estimation of the regret. As the number of cycles $k$ increases, our stage- $k$ staffing decision $\mu_{k}$ quickly converges to $\mu^{*}$ (bottom right panel) and the regret also grows as a logarithmic function of $M_{L}$ (bottom left panel).

6.2 Joint Pricing and Staffing Problem

We next consider a joint staffing and pricing problem having the objective function in (26), with the logistic demand function in (24) and parameters $a=4.1$ , $n=10$ , $h_{0}=1$ and $c_{0}=0.1$ . The optimal price $p^{*}=4.02$ and service rate $\mu^{*}=7.10$ are given as benchmarks (top right panel in Figure 6).

In Figure 6, we show that $\mu_{k}$ and $p_{k}$ converge quickly to their corresponding optimal target levels $\mu^{*}$ and $p^{*}$ (although the objective $f(\mu,p)$ is not always convex when $\mu>\lambda(p)$ ). And similar to the one-dimensional cases, the regret grows as a logarithmic function of $M_{L}$ (bottom left panel).

6.3 Comparison to Heavy-Traffic Methods

In this subsection, we provide numerical analysis to contrast the performance of GOLiQ to that of the heavy-traffic approach in Lee and Ward (2014). In Lee and Ward (2014), the objective is to find the optimal decisions $p^{*}$ and $\mu^{*}$ for the $GI/GI/1$ optimization problem (1) with a linear staffing cost $c(\mu)=c\mu.$ Because this problem is not amenable to analytic treatments (due to the complex $GI/GI/1$ queueing dynamics), the authors resort to the heavy-traffic approximation by constructing a sequence of $GI/GI/1$ queues indexed by a scaling factor $n$ , where the $n^{\rm th}$ model has an arrival rate $\lambda^{n}(p)\equiv n\lambda_{0}(p)$ which grows to infinity as $n$ increases. The authors propose an asymptotically optimal solution

(\tilde{p}^{(n)},\tilde{\mu}^{(n)})=\left(\hat{p}^{*},n\hat{\mu}^{*}+\sigma\sqrt{\frac{h_{0}n}{2c}}\right)

(27)

where $\sigma=\sqrt{\text{Var}(U_{i})+\text{Var}(V_{i})}$ , and $U_{i}$ and $V_{i}$ are defined in Assumption 2, and $(\hat{p}^{*},\hat{\mu}^{*})$ solves a deterministic static planning problem:

\min_{p,\mu}f_{0}(p,\mu)=-p\lambda_{0}(p)+c\mu.

(28)

We remark that the solution in Lee and Ward (2014) requires the precise knowledge of the second moments of service and arrival times (e.g., the term $\sigma$ in (27)), but such information is not needed in GOLiQ.

Experiment settings.

We consider an $M/GI/1$ model with a phase-type service-time distribution, and a logit demand $\lambda(p)=n\lambda_{0}(p)$ in (24) where the base demand rate $\lambda_{0}(p)$ has $a=4.1$ and the market size $n$ plays the role of the scaling factor. We fix the delay cost $h_{0}=1$ throughout this experiment. To quantify the regret, we obtain the exact optimal policy using the Pollaczek-Khinchine formula for the queue-length function

\mathbb{E}[Q_{\infty}(p,\mu)]=\rho+\frac{\rho^{2}}{1-\rho}\frac{1+c_{s}^{2}}{2},

(29)

where $c_{s}^{2}\equiv Var(U_{i})/\mathbb{E}[U_{i}]^{2}$ is the squared coefficient of variation (SCV) for the service time. We next describe the detailed settings for comparing GOLiQ to heavy-traffic solution in Lee and Ward (2014), dubbed LW. In order to benchmark the regret of our GOLiQ to that of LW, we continue to consider a dynamic environment where the number of cycles $k$ increases. In the $k^{\rm th}$ cycle,

•

the LW policy remains fixed at $(\tilde{p}^{(n)},\tilde{\mu}^{(n)})$ as in (27) (it does not evolve with $k$ );
•

our online learning policy is dynamically updated according to GOLiQ (Algorithm 1).

Because the LW policy is an approximation, it will yield a linear regret as $k$ increases. But LW’s linear regret should not be too steep when $n$ is large enough. In contrast, although GOLiQ is guaranteed to generate a sublinear regret, it is expected to have a larger reget increment at the earlier “exploration” stage, because it is learning without the supervision of the fluid or diffusion limits (as in the LW approach). Nevertheless, we expect that GOLiQ will eventually outperform the LW method (exhibiting a lower regret level) when $k$ is sufficiently large. We next numerically study how soon GOLiQ surpasses LW and the impact of the following three parameters:

(i)

staffing cost $c$ ;
(ii)

service-time SCV $c_{s}^{2}$ ;
(iii)

market size $n$ (i.e., system scale).

We intentionally set the initial decision $(\mu_{0},p_{0})$ of GOLiQ far from the optimal solution $(\mu^{*},p^{*})$ in the experiment.

Experiment results.

In Figure 7, we report results of regret for both GOLiQ and LW. For the three factors $c$ , $c_{s}^{2}$ and $n$ , we change one at a time (with the other two held fixed). In Panels (a)-(c), we vary the staffing cost $c$ from 0.5 to 2. In Panels (d)-(f), we vary the service-time SCV $c_{s}^{2}$ from 0.1 to 10. Here the cases $c_{s}^{2}=0.1$ , 1, and 10 are achieved by considering Erlang, exponential, and hyperexponential service-time distributions. In Panels (g)-(i), we vary the system scale $n$ from 1 to 25. In all of the cases, we use hyper-parameter $\eta_{k}=5k^{-1}$ and $D_{k}=10+10\log(k)$ . Monte-Carlo estimates of the regret curves are obtained by averaging 100 independent runs.

We can see from Figure 7 that, in all cases, GOLiQ will eventually establish a lower regret level than the LW policy. Varying these three factors clearly has an significant impact on how soon GOLiQ outperforms LW.

Our findings are summarized below:

•

Staffing cost $c$ : Figure 7 shows that GOLiQ intends to outperform LW when $c$ is relatively large. We provide our explanations below. First, a larger staffing cost $c$ will induce a smaller $\mu^{*}$ , which leads to a longer waiting queue. On the other hand, note that the LW solution is primarily based on solving the deterministic static problem (28); and unlike the stochastic revenue optimization problem (1), the objective function of (28) overlooks the queue-length holding cost. This explains why GOLiQ gains its advantage over LW as $c$ increases. See Panels (a)-(c) of Figure 7.
•

Service SCV $c_{s}^{2}$ : When the service-time SCV is smaller, the LW method intends to work better, because the basic idea of LW stems from solutions of a fluid model (where the service times are assumed deterministic). On the other hand, when $c_{s}^{2}$ is larger, the system becomes more variable so that our learning-based algorithm begins to excel (because GOLiQ takes into account real-time information dynamically). See Panels (d)-(f) of Figure 7.
•

Market size $n$ : When $n$ is small, LW loses its advantages because it arises from the large-scale limit of the $GI/GI/1$ queue which requires $n$ to be sufficiently large. While the performance of our GOLiQ is robust to the system scale. See Panels (g)-(i).
•

Performance in the long run: GOLiQ is a more effective approach in the long run, because the LW solution remains static and its error grows linearly as time increases.

Remark 11 (Different philosophies: online learning vs. heavy traffic).

We emphasize that online learning and heavy-traffic analysis are two methodologies developed based on distinct philosophies. First, when the system size is large, heavy-traffic models are able to produce high-fidelity solutions, but they require more prior knowledge of the system as inputs. On the other hand, online learning requires less prior understanding of the system, because the data-driven nature allows it to dynamically evolve and improve (whereas heavy-traffic solutions are static). Second, the notions of asymptotic optimality are different. As an approximate method, heavy-traffic analysis is said to be asymptotically optimal in the sense that as the system size grows large, its solution will become close to the true optimal solution. On the other hand, the solution of the online learning method will converge to the true optimal solution as the server’s experience accumulates (by serving more and more customers).

7 Conclusion

In this paper we develop an online learning framework designed for dynamic pricing and staffing in queueing systems. The ingenuity of this approach lies in its online nature, which allows the service provider to continuously obtain improved pricing and staffing policies by interacting with the environment. The environment here is interpreted as everything beyond the service provider’s knowledge, which is the composition of the random external demand process and the complex internal queueing dynamics. The proposed algorithm organizes the time horizon into successive operational cycles, and prescribes an efficient way to update the service provider’s policy in each cycle using data collected in previous cycles. Data include the number of customer arrivals, waiting times, and the server’s busy times.

A key appeal of the online learning approach is its insensitivity to the scale of the queueing system, as opposed to the heavy-traffic analysis, which requires the system to be in large scale (with the arrival and service rate both approaching infinity). Effectiveness of our online learning algorithm is substantiated by (i) theoretical results including the algorithm convergence and regret analysis, and (ii) engineering confirmation via simulation experiments of a variety of representative $GI/GI/1$ queues. Theoretical analysis of the regret bound in the present paper may shed lights on the design of efficient online learning algorithms (e.g., bounding gradient estimation error and controlling proper learning rate) for more general queueing systems.

There are several venues for future research. One natural extension would be to develop new regret analyses that do not require the uniform stability condition. Another interesting and promising direction is to develop an online learning method without assuming the knowledge of the arrival rate function $\lambda(p)$ , where the learner (hereby the service provider), during the interactions with the environment, will have to resolve the tension between obtaining an accurate estimation of the demand function and optimizing returns over time. A third dimension is to extend the methodology to more general model settings (e.g., queues having customer abandonment and multiple servers), which will make the framework more practical for service systems such as call centers and healthcare. In this regard, results in the present paper may serve as useful foundations; in particular, Theorems 1 and 2 will help construct desired regret bounds as long as their associated conditions can be verified. Doing so usually requires two main steps in a new queueing model: (i) proving a new ergodicity (or rate of convergence to stationarity) result that can be used to bound the regret of nonstationarity; (ii) designing a new gradient estimator which is easily computed from data (here a good gradient estimator should have small bias and variance subject to conditions in Theorem 2).

8 Proofs

8.1 Proof of Lemma 1

Let $Q_{n}^{k}$ be the queue length when customer $n-1$ in cycle $k$ leaves the system. Then $Q_{k}=Q_{D_{k-1}}^{k-1}+1$ . The proof follows a stochastic ordering argument for $GI/GI/1$ models. Let $\hat{W}_{n}^{k}$ , $\hat{X}_{n}^{k}$ and $\hat{Q}_{n}^{k}$ be the waiting times, observed busy periods, and queue length process in a $GI/GI/1$ queue with stationary control parameter $\mu_{k}\equiv\underline{\mu}$ and $p_{k}\equiv\underline{p}$ , and with steady-state initial state, i.e., $\hat{W}_{0}^{1}\stackrel{{\scriptstyle d}}{{=}}W_{\infty}(\underline{\mu},\underline{p})$ , $\hat{X}_{0}^{1}\stackrel{{\scriptstyle d}}{{=}}X_{\infty}(\underline{\mu},\underline{p})$ and $\hat{Q}_{0}^{1}\stackrel{{\scriptstyle d}}{{=}}Q_{\infty}(\underline{\mu},\underline{p})$ . Let’s call this system the dominating system. Then, for all $k$ ,

\frac{U_{n}^{k}}{\lambda_{n}^{k}}\geq\frac{U_{n}^{k}}{\lambda(\underline{p})},\text{ for }n=1,2,...,Q_{k},\text{ and }\frac{U_{n}^{k}}{\lambda_{k}}\geq\frac{U_{n}^{k}}{\lambda(\underline{p})},\text{ for }n=Q_{k}+1,2,...,D_{k},

i.e., the arrival process in the dominating queue is the upper envelope process (UEP) for all possible arrival processes corresponding to any control sequence $(\mu_{k},p_{k})$ . Similarly, the service process in the dominating queue is the lower envelope process (LEP) for all possible service processes corresponding to any control sequence. As a consequence, since $W_{0}^{1}=0$ and $Q_{0}^{1}=0$ ,

W_{n}^{k}\leq_{st}\hat{W}_{n}^{k},\leavevmode\nobreak\ X_{n}^{k}\leq_{st}\frac{\lambda(\underline{p})}{\lambda(\bar{p})}\cdot\hat{X}_{n}^{k},\leavevmode\nobreak\ Q_{n}^{k}\leq_{st}\hat{Q}_{n}^{k}.

Under Assumption 2, the moment generating function of the random variable $V_{n}/\underline{\mu}-U_{n}/\lambda(\underline{p})$ exists around the origin. Following Blanchet and Chen (2015), under Assumption 1, this condition can further imply that there exists a constant $\bar{\eta}>0$ such that $\mathbb{E}[\exp\left(\bar{\eta}(V_{n}/\underline{\mu}-U_{n}/\lambda(\underline{p}))\right)]=1.$ (See the Remark on p.3222 in Blanchet and Chen (2015)) Then, following Theorem 1 of Abate et al. (1995), there exists a constant $\alpha>0$ such that $\mathbb{P}(\hat{W}_{n}^{k}>x)\leq\alpha\exp(-\bar{\eta}x),\text{ for all }x>0$ . As a consequence, $\mathbb{E}[\exp(\eta\hat{W}_{n}^{k})]$ is finite for $\eta<\bar{\eta}$ , and so are $\mathbb{E}[(\hat{W}_{n}^{k})^{m}]$ for all $m\geq 1$ . Given that the moments of waiting times are finite, we can conclude that $\mathbb{E}[(\hat{Q}_{n}^{k})^{m}]$ and $\mathbb{E}[\exp(\eta\hat{Q}_{n}^{k})]$ are finite for all $m\geq 1$ , applying Theorem 10.4.3 in Asmussen (2003). Finally, the moments of the observed busy period $\mathbb{E}[(\hat{X}_{n}^{k})^{m}]$ are finite following Proposition 4.2 in Nakayama et al. (2004). Therefore, we choose

M=\max_{1\leq m\leq 4}\left\{\mathbb{E}[(\hat{W}_{n}^{k})^{m}],\ \frac{\lambda(\underline{p})^{m}}{\lambda(\bar{p})^{m}}\mathbb{E}[(\hat{X}_{n}^{k})^{m}],\ \mathbb{E}[(\hat{Q}_{n}^{k}+1)^{m}],\ \mathbb{E}[\exp(\eta\hat{W}_{n}^{k})],\ \mathbb{E}[\exp(\eta(Q_{n}^{k}+1))]\right\},

and this closes our proof. $\Box$

8.2 Proof of Lemma 2

For $i\in\{1,2\}$ , define stopping times $\Gamma_{i}=\min\{n:W^{i}_{n}=0\}$ . For a fixed pair of inter-arrival and service time sequences, the consequent waiting time sequence $W_{k}$ in a single-server queue is monotone in its initial state $W_{0}$ . Without loss of generality, assume $W^{1}_{0}\geq W^{2}_{0}$ . Then, $W^{1}_{n}\geq W^{2}_{n}$ for all $n\geq 1$ and therefore, $W^{1}_{\Gamma_{1}}=W^{2}_{\Gamma_{1}}=0$ . As the two queues are coupled with the same arrival and service time sequences, we will have $W^{1}_{n}=W^{2}_{n}$ for all $n\geq\Gamma_{1}$ . Therefore, we can conclude $W^{1}_{n}=W^{2}_{n}$ for all $n\geq\max(\Gamma_{1},\Gamma_{2})$ . For $n\leq\max(\Gamma_{1},\Gamma_{2})$ , we have $|W^{1}_{n}-W^{2}_{n}|\leq|W^{1}_{0}-W^{2}_{0}|$ following Kella and Ramasubramanian (2012).

For simplicity of notation, we write $\lambda=\lambda(p)$ . For $i\in\{1,2\}$ , define a random walk $R^{i}_{n+1}=R^{i}_{n}+S_{n}-\tau_{n}$ with $R^{i}_{0}=W^{i}_{0}$ . (Recall that $S_{n}$ and $\tau_{n}$ are the sequences of service and inter-arrival times.) By Lindley recursion, $\Gamma_{i}=\min\{n:R^{i}_{n}\leq 0\}$ . Then, for any $n\geq 1$ ,

	$\displaystyle\mathbb{P}(\Gamma_{i}\leq n)$	$\displaystyle\geq\mathbb{P}\left(\sum_{k=1}^{n}(S_{k}-\tau_{k})<-W_{0}^{i}\right)$
		$\displaystyle\geq\mathbb{P}\left(\lambda\sum_{k=1}^{n}\tau_{k}\geq n(1-a),\mu\sum_{k=1}^{n}S_{k}\leq n(1+a)-\mu W^{i}_{0}\right),$

where the second inequality holds as $(1-a)/\lambda>(1+a)/\mu$ given that $0<a<(\underline{\mu}-\lambda(\underline{p}))/(\underline{\mu}+\lambda(\underline{p}))$ and that $\lambda/\mu\leq\lambda(\underline{p})/\underline{\mu}$ . Recall that $\tau_{k}=U_{k}/\lambda$ and $S_{k}=V_{k}/\mu$ . Therefore,

\displaystyle\mathbb{P}(\Gamma_{i}>n)\leq\mathbb{P}\left(\sum_{k=1}^{n}U_{k}<n(1-a)\right)+\mathbb{P}\left(\sum_{k=1}^{n}V_{k}>n(1+a)-\mu W^{i}_{0}\right).

Following Chebyshev’s Inequality, we have

	$\displaystyle\mathbb{P}\left(\sum_{k=1}^{n}V_{k}>n(1+a)-\mu W^{i}_{0}\right)$	$\displaystyle\leq\frac{\mathbb{E}[\exp(\theta\sum_{k=1}^{n}V_{k})]}{\exp(n\theta(1+a)-\mu\theta W^{i}_{0})}=\exp(n(\phi_{V}(\theta)-(1+a)\theta))\exp(\mu\theta W^{i}_{0})$
		$\displaystyle\leq\exp(-n\gamma)\exp(\mu\theta W^{i}_{0}),$

where the last inequality follows from Assumption 2. On the other hand, let $Q$ be an exponentially tilted probability measure with respect to $U$ , such that the likelihood ratio $\frac{dQ}{dP}(U)=\exp(-\theta U-\phi_{U}(-\theta))$ . Then,

		$\displaystyle\mathbb{P}\left(\sum_{k=1}^{n}U_{k}<n(1-a)\right)=\mathbb{E}^{Q}\left[\exp\left(\theta\sum_{k=1}^{n}U_{k}+n\phi_{U}(-\theta)\right){\bf 1}_{\left\{\sum_{k=1}^{n}U_{k}<n(1-a)\right\}}\right]$
	$\displaystyle\leq$	$\displaystyle\exp(n(1-a)\theta+n\phi_{U}(-\theta))=\exp(n((1-a)\theta+\phi_{U}(-\theta)))\leq\exp(-n\gamma).$

In summary, we have $\mathbb{P}(\Gamma_{i}>n)\leq\exp(-n\gamma)\left(1+\exp(\mu\theta W^{i}_{0})\right)$ , $i=1,2$ . So, we can conclude

	$\displaystyle\mathbb{E}[\|W^{1}_{n}-W^{2}_{n}\|^{m}]$	$\displaystyle\leq\mathbb{P}(\max(\Gamma_{1},\Gamma_{2})>n)\|W^{1}_{0}-W^{2}_{0}\|^{m}$
		$\displaystyle\leq e^{-\gamma n}\left(2+e^{\mu\theta W^{1}_{0}}+e^{\mu\theta W^{2}_{0}}\right)\|W^{1}_{0}-W^{2}_{0}\|^{m}.\qquad\Box$

8.3 Proof of Lemma 3

Define two auxiliary random walks:

Y_{n}=W_{0}+\sum_{i=1}^{n}\left(\frac{V_{i}}{\mu_{i}}-\frac{U_{i}}{\lambda_{i}}\right),\tilde{Y}_{n}=\tilde{W}_{0}+\sum_{i=1}^{n}\left(\frac{V_{i}}{\tilde{\mu}_{i}}-\frac{U_{i}}{\tilde{\lambda}_{i}}\right).

Then, for any $n\geq 1$ , we could express $W_{n}$ and $\tilde{W}_{n}$ as

W_{n}=Y_{n}-\min_{1\leq m\leq n}Y_{m}\wedge 0,\quad\tilde{W}_{n}=\tilde{Y}_{n}-\min_{1\leq m\leq n}\tilde{Y}_{m}\wedge 0.

Let $\tau=\operatorname*{arg\,min}_{1\leq m\leq n}Y_{m}$ and $\tilde{\tau}=\operatorname*{arg\,min}_{1\leq m\leq n}\tilde{Y}_{m}$ . Note that following the above notation, for each $n$ , $W_{n}$ is the waiting time of customer $n$ and as a consequence, $\frac{U_{n}}{\lambda_{n}}$ should be understood as the inter-arrival time between customers $n-1$ and $n$ , and $\frac{V_{n}}{\mu_{n}}$ as the service time of customer $n-1$ .

Case 1: If $Y_{\tau}\leq 0$ and $\tilde{Y}_{\tilde{\tau}}\leq 0$ , i.e., both $W_{t}$ and $\tilde{W}_{t}$ hit zero before $n$ , we have

\displaystyle Y_{n}-Y_{\tilde{\tau}}-(\tilde{Y}_{n}-\tilde{Y}_{\tilde{\tau}})\leq W_{n}-\tilde{W}_{n}=Y_{n}-Y_{\tau}-(\tilde{Y}_{n}-\tilde{Y}_{\tilde{\tau}})\leq Y_{n}-Y_{\tau}-(\tilde{Y}_{n}-\tilde{Y}_{\tau}).

So, in this case

|W_{n}-\tilde{W}_{n}|\leq\sum_{i=\tau\wedge\tilde{\tau}+1}^{n}\left|\frac{1}{\mu_{i}}-\frac{1}{\tilde{\mu}_{i}}\right|V_{i}+\sum_{i=\tau\wedge\tilde{\tau}+1}^{n}\left|\frac{1}{\lambda_{i}}-\frac{1}{\tilde{\lambda}_{i}}\right|U_{i}.

Recall that $X_{n}$ (and $\tilde{X}_{n}$ ) is the age of the server’s busy time observed by customer $n$ upon arrival. By definition, $W_{\tau}=0$ and therefore,

X_{n}=\sum_{i=\tau+1}^{n}\frac{U_{i}}{\lambda_{i}},\quad X_{n}+W_{n}=\sum_{i=\tau+1}^{n}\frac{V_{i}}{\mu_{i}}.

The second equation holds as the server has just served $n-\tau$ customers (indexed from $\tau$ to $n-1$ ) in the current busy cycle when customer $n$ enters service. Then,

\sum_{i=\tau+1}^{n}\left|\frac{1}{\mu_{i}}-\frac{1}{\tilde{\mu}_{i}}\right|V_{i}+\sum_{i=\tau+1}^{n}\left|\frac{1}{\lambda_{i}}-\frac{1}{\tilde{\lambda}_{i}}\right|U_{i}\leq\frac{c_{\mu}}{\underline{\mu}}(X_{n}+W_{n})+\frac{c_{\lambda}}{\underline{\lambda}}X_{n}.

Following a similar argument, we have

\sum_{i=\tilde{\tau}+1}^{n}\left|\frac{1}{\mu_{i}}-\frac{1}{\tilde{\mu}_{i}}\right|V_{i}+\sum_{i=\tilde{\tau}+1}^{n}\left|\frac{1}{\lambda_{i}}-\frac{1}{\tilde{\lambda}_{i}}\right|U_{i}\leq\frac{c_{\mu}}{\underline{\mu}}(\tilde{X}_{n}+\tilde{W}_{n})+\frac{c_{\lambda}}{\underline{\lambda}}\tilde{X}_{n}.

Therefore, in this case, we have

|W_{n}-\tilde{W}_{n}|\leq\left(\frac{c_{\mu}}{\underline{\mu}}+\frac{c_{\lambda}}{\underline{\lambda}}\right)\max(X_{n},\tilde{X}_{n})+\frac{c_{\mu}}{\underline{\mu}}\max(W_{n},\tilde{W}_{n}).

Case 2: If $Y_{\tau}>0$ or $\tilde{Y}_{\tilde{\tau}}>0$ , we can inductively derive that

|W_{n}-\tilde{W}_{n}|\leq|W_{0}-\tilde{W}_{0}|+\sum_{i=1}^{n}\left|\frac{1}{\mu_{i}}-\frac{1}{\tilde{\mu}_{i}}\right|V_{i}+\sum_{i=1}^{n}\left|\frac{1}{\lambda_{i}}-\frac{1}{\tilde{\lambda}_{i}}\right|U_{i}.

In detail, it suffices to show that, for all $1\leq m\leq n$ ,

|W_{m}-\tilde{W}_{m}|\leq|W_{m-1}-\tilde{W}_{m-1}|+\left|\frac{1}{\mu_{m}}-\frac{1}{\tilde{\mu}_{m}}\right|V_{m}+\left|\frac{1}{\lambda_{m}}-\frac{1}{\tilde{\lambda}_{m}}\right|U_{m}.

(30)

Without loss of generality, we assume $Y_{\tau}>0$ . By definition, $Y_{\tau}=\min_{1\leq l\leq n}Y_{l}$ and hence $W_{l}=Y_{l}>0$ for all $1\leq l\leq n$ . Then,

|W_{m}-\tilde{W}_{m}|=\left|W_{m-1}-\frac{U_{m}}{\lambda_{m}}+\frac{V_{m}}{\mu_{m}}-\left(\tilde{W}_{m-1}-\frac{U_{m}}{\tilde{\lambda}_{m}}+\frac{V_{m}}{\tilde{\mu}_{m}}\right)^{+}\right|.

If $\tilde{W}_{m}>0$ , we have

	$\displaystyle\|W_{m}-\tilde{W}_{m}\|$	$\displaystyle=\left\|W_{m-1}-\frac{U_{m}}{\lambda_{m}}+\frac{V_{m}}{\mu_{m}}-\left(\tilde{W}_{m-1}-\frac{U_{m}}{\tilde{\lambda}_{m}}+\frac{V_{m}}{\tilde{\mu}_{m}}\right)\right\|$
		$\displaystyle\leq\|W_{m-1}-\tilde{W}_{m-1}\|+\left\|\frac{1}{\mu_{m}}-\frac{1}{\tilde{\mu}_{m}}\right\|V_{m}+\left\|\frac{1}{\lambda_{m}}-\frac{1}{\tilde{\lambda}_{m}}\right\|U_{m}.$

On the other hand, if $\tilde{W}_{m}=0$ , we have $\tilde{W}_{m-1}-\frac{U_{m}}{\tilde{\lambda}_{m}}+\frac{V_{m}}{\tilde{\mu}_{m}}\leq 0.$ So,

	$\displaystyle\|W_{m}-\tilde{W}_{m}\|$	$\displaystyle=W_{m}-0\leq W_{m}-\left(\tilde{W}_{m-1}-\frac{U_{m}}{\tilde{\lambda}_{m}}+\frac{V_{m}}{\tilde{\mu}_{m}}\right)$
		$\displaystyle=\left\|W_{m-1}-\frac{U_{m}}{\lambda_{m}}+\frac{V_{m}}{\mu_{m}}-\left(\tilde{W}_{m-1}-\frac{U_{m}}{\tilde{\lambda}_{m}}+\frac{V_{m}}{\tilde{\mu}_{m}}\right)\right\|$
		$\displaystyle\leq\|W_{m-1}-\tilde{W}_{m-1}\|+\left\|\frac{1}{\mu_{m}}-\frac{1}{\tilde{\mu}_{m}}\right\|V_{m}+\left\|\frac{1}{\lambda_{m}}-\frac{1}{\tilde{\lambda}_{m}}\right\|U_{m}.$

This closes the proof of (30).

As a result of (30), if $Y_{\tau}>0$ , we can conclude the system (associated with $(\mu_{n},\lambda_{n})$ ) was kept busy from time 0 until customer $n$ enters service. As a consequence, as $X_{0}\geq 0$ , we have

X_{n}\geq\sum_{i=1}^{n}\frac{U_{i}}{\lambda_{i}},\quad X_{n}+W_{n}\geq\sum_{i=1}^{n}\frac{V_{i}}{\mu_{i}}.

Therefore,

\sum_{i=1}^{n}\left|\frac{1}{\mu_{i}}-\frac{1}{\tilde{\mu}_{i}}\right|V_{i}+\sum_{i=1}^{n}\left|\frac{1}{\lambda_{i}}-\frac{1}{\tilde{\lambda}_{i}}\right|U_{i}\leq\frac{c_{\mu}}{\underline{\mu}}(\max(X_{n},\tilde{X}_{n})+\max({W}_{n},\tilde{W}_{n}))+\frac{c_{\lambda}}{\underline{\lambda}}\max(X_{n},\tilde{X}_{n}),

and hence we can also conclude

|W_{n}-\tilde{W}_{n}|\leq|W_{0}-\tilde{W}_{0}|+\left(\frac{c_{\mu}}{\underline{\mu}}+\frac{c_{\lambda}}{\underline{\lambda}}\right)\max(X_{n},\tilde{X}_{n})+\frac{c_{\mu}}{\underline{\mu}}\max(W_{n},\tilde{W}_{n}).

$\Box{}$

8.4 Proof of Lemma 4

By the inequality that $(a+b)^{m}\leq 2^{m-1}(a^{m}+b^{m})$ for $m\geq 1$ , we have

	$\displaystyle\mathbb{E}[\|W_{\infty}(\mu_{1},p_{1})-W_{\infty}(\mu_{2},p_{2})\|^{m}]$
	$\displaystyle\leq 2^{m-1}\left(\mathbb{E}[\|W_{\infty}(\mu_{1},p_{1})-W_{\infty}(\mu_{2},p_{1})\|^{m}+\|W_{\infty}(\mu_{2},p_{1})-W_{\infty}(\mu_{2},p_{2})\|^{m}]\right).$

It suffices to prove that there exist two constant $B_{1},B_{2}>0$ such that for $1\leq m\leq 4$ ,

	$\displaystyle\mathbb{E}[\|W_{\infty}(\mu_{1},p_{1})-W_{\infty}(\mu_{2},p_{1})\|^{m}]$	$\displaystyle\leq B_{1}\|\mu_{1}-\mu_{2}\|^{m},$
	$\displaystyle\mathbb{E}[\|W_{\infty}(\mu_{2},p_{1})-W_{\infty}(\mu_{2},p_{2})\|^{m}]$	$\displaystyle\leq B_{2}\|p_{1}-p_{2}\|^{m}.$

Without loss of generality, assume $\mu_{1}<\mu_{2}$ . We now construct two stationary sequences $\{(W_{n}^{\mu_{i}}:n\leq 0),i=1,2\}$ that are coupled “from the past”. Let $V_{j}$ and $U_{j}$ be two i.i.d sequences corresponding to the service and inter-arrival times. For each $i$ , we define a random walk:

Y^{\mu_{i}}_{0}=0,\leavevmode\nobreak\ Y^{\mu_{i}}_{n}=\sum_{j=1}^{n}\left(\frac{V_{j}}{\mu_{i}}-\frac{U_{j}}{\lambda(p_{1})}\right),\leavevmode\nobreak\ \forall n\geq 1.

It is clear that $Y^{\mu_{i}}_{n}$ is a random walk with negative drift for $i=1,2$ . Define

W_{-n}^{\mu_{i}}=\max_{j\geq n}Y^{\mu_{i}}_{j}-Y^{\mu_{i}}_{n},n\geq 0.

It is known in literature (see, for example, Blanchet and Chen (2015)) that $W_{-n}^{\mu_{i}}$ is a stationary waiting time process of a $GI/GI/1$ queue, starting from $-\infty$ , with parameter $(\mu_{i},p_{1})$ . In particular, the dynamics of $W_{-n}^{\mu_{i}}$ satisfies that

W_{-n+1}^{\mu_{i}}=\left(W_{-n}^{\mu_{i}}+\frac{V_{n}}{\mu_{i}}-\frac{U_{n}}{\lambda(p_{1})}\right)^{+},\text{ for }n\geq 1,

with $V_{n}/\mu_{i}$ being the service time of customer $-n$ and $U_{n}/\lambda(p_{1})$ being the inter-arrival time between customer $-n$ and $-n+1$ . For a fixed sequence of $(V_{n},U_{n})$ , we have

W_{0}^{\mu_{1}}=\max_{j\geq 0}Y_{j}^{\mu_{1}},\quad\text{and }W_{0}^{\mu_{2}}=\max_{j\geq 0}Y_{j}^{\mu_{2}}.

As $Y^{\mu_{1}}_{j}\geq Y^{\mu_{2}}_{j}$ , we have $W_{0}^{\mu_{1}}\geq W_{0}^{\mu_{2}}$ . Besides, let $\tau=\arg\max_{j\geq 0}Y^{\mu_{1}}_{j}$ , we have

W_{0}^{\mu_{1}}-W_{0}^{\mu_{2}}=\max_{j\geq 0}Y_{j}^{\mu_{1}}-\max_{j\geq 0}Y_{j}^{\mu_{2}}=Y_{\tau}^{\mu_{1}}-\max_{j\geq 0}Y_{j}^{\mu_{2}}\leq Y_{\tau}^{\mu_{1}}-Y_{\tau}^{\mu_{2}}.

As a consequence, we have

|W_{0}^{\mu_{1}}-W_{0}^{\mu_{2}}|\leq\sum_{n=1}^{\tau}\left(\frac{V_{n}}{\mu_{1}}-\frac{V_{n}}{\mu_{2}}\right)\leq\frac{\mu_{2}-\mu_{1}}{\mu_{1}}\sum_{n=1}^{\tau}\frac{V_{n}}{\mu_{1}},\quad\text{ with }\tau=\inf\{n:W_{-n}=0\}.

Note that $V_{n}/\mu_{1}$ is the service time of customer $-n$ in the system with parameter $(p_{1},\mu_{1})$ . By the definition of $\tau$ , customer $-\tau$ enters service immediately upon the arrival and the queue remains busy by arrival of customer $0$ . Therefore, the summation of service times on the right hand side equals to the time between the arrival of customer $-\tau$ and the departure of customer $-1$ , which equals to the observed busy period at the arrival of customer $0$ plus its waiting time, i.e.,

|W_{0}^{\mu_{1}}-W_{0}^{\mu_{2}}|\leq\frac{\mu_{2}-\mu_{1}}{\mu_{1}}\sum_{n=1}^{\tau}\frac{V_{n}}{\mu_{1}}=\frac{\mu_{2}-\mu_{1}}{\mu_{1}}(X_{0}^{\mu_{1}}+W_{0}^{\mu_{1}}).

Therefore, for each $n$ ,

\mathbb{E}[|W_{0}^{\mu_{1}}-W_{0}^{\mu_{2}}|^{m}]\leq\frac{(\mu_{2}-\mu_{1})^{m}}{\mu_{1}^{m}}\mathbb{E}[(X_{0}^{\mu_{1}}+W_{0}^{\mu_{1}})^{m}]\leq\frac{(\mu_{2}-\mu_{1})^{m}}{\underline{\mu}^{m}}\mathbb{E}[(X_{0}^{\mu_{1}}+W_{0}^{\mu_{1}})^{m}].

Following Lemma 1, $\mathbb{E}[(X_{0}^{\mu_{1}}+W_{0}^{\mu_{1}})^{m}]\leq 2^{m}M$ . Let $B_{1}=\max_{1\leq m\leq 4}2^{m}M/\underline{\mu}^{m}$ and we conclude, for $1\leq m\leq 4$ ,

\mathbb{E}[|W_{0}^{\mu_{1}}-W_{0}^{\mu_{2}}|^{m}]\leq B_{1}|\mu_{1}-\mu_{2}|^{m}.

The bound for $\mathbb{E}[|W_{\infty}(\mu_{2},p_{1})-W_{\infty}(\mu_{2},p_{2})|^{m}]$ follows a similar argument and therefore we only provide a sketch of the proof. Without loss of generality, we assume $p_{1}<p_{2}$ and consider two stationary waiting time process $\{(W_{n}^{p_{i}}:n\leq 0),\lambda_{i}=\lambda(p_{i}),i=1,2\}$ that are coupled from past with the same sequence $(V_{n},U_{n})$ in a similar way as we introduced previously. Then, we have $|W_{0}^{p_{1}}-W_{0}^{p_{2}}|\leq(\lambda_{1}-\lambda_{2})X_{0}^{p_{1}}/\lambda_{2}$ , and therefore,

\mathbb{E}[|W_{0}^{p_{1}}-W_{0}^{p_{2}}|^{m}]\leq B_{2}|p_{1}-p_{2}|^{m},\text{ with }B_{2}=\max_{1\leq m\leq 4,\underline{p}\leq p\leq\bar{p}}(M|\lambda^{\prime}(p)|^{m}/\lambda(\bar{p})^{m}).

As a consequence, we can take

B=8\cdot\max_{1\leq m\leq 4}(2^{m}M/\underline{\mu}^{m})\vee\max_{1\leq m\leq 4,\underline{p}\leq p\leq\bar{p}}(M|\lambda^{\prime}(p)|^{m}/\lambda(\bar{p})^{m}).

(31)

8.5 Full Proof of Theorem 1

We first give the proofs of Corollaries 1–3.

Proof of Corollary 1.

For any $n\geq d_{k}$ ,

\mathbb{E}[|W_{n}^{k}-\bar{W}_{n}^{k}|]=\mathbb{E}[|W_{n}^{k}-\bar{W}_{n}^{k}|1(Q_{k}<d_{k})]+\mathbb{E}[|W_{n}^{k}-\bar{W}_{n}^{k}|1(Q_{k}\geq d_{k})].

Given that $Q_{k}<d_{k}$ , by definition, $W_{n}^{k}$ is synchronously coupled with $\bar{W}_{n}^{k}$ for $n\geq d_{k}+1$ . Note that given $Q_{k}<d_{k}$ , $U_{n}^{k}$ and $V_{n}^{k}$ are independent of $Q_{k}$ for $n\geq d_{k}+1$ . As a consequence, by Lemma 2, the conditional expectation

\displaystyle\mathbb{E}\left[|W_{n}^{k}-\bar{W}_{n}^{k}|\leavevmode\nobreak\ \Big{|}\leavevmode\nobreak\ Q_{k}<d_{k},W_{d_{k}}^{k},\bar{W}_{d_{k}}^{k}\right]\leq e^{-\gamma(n-d_{k})}(2+e^{\bar{\mu}\theta W_{d_{k}}^{k}}+e^{\bar{\mu}\theta\bar{W}_{d_{k}}^{k}})\left|W_{d_{k}}^{k}-\bar{W}_{d_{k}}^{k}\right|.

Therefore,

	$\displaystyle\mathbb{E}[\|W_{n}^{k}-\bar{W}_{n}^{k}\|1(Q_{k}<d_{k})]$	$\displaystyle\leq e^{-\gamma(n-d_{k})}\mathbb{E}\left[(2+e^{\bar{\mu}\theta W_{d_{k}}^{k}}+e^{\bar{\mu}\theta\bar{W}_{d_{k}}^{k}})\left\|W_{d_{k}}^{k}-\bar{W}_{d_{k}}^{k}\right\|1(Q_{k}<d_{k})\right]$
		$\displaystyle\leq e^{-\gamma(n-d_{k})}\mathbb{E}\left[(2+e^{\bar{\mu}\theta W_{d_{k}}^{k}}+e^{\bar{\mu}\theta\bar{W}_{d_{k}}^{k}})\left\|W_{d_{k}}^{k}-\bar{W}_{d_{k}}^{k}\right\|\right]$
		$\displaystyle\leq e^{-\gamma(n-d_{k})}\left(2+\mathbb{E}\left[\left(e^{\bar{\mu}\theta W_{d_{k}}^{k}}+e^{\bar{\mu}\theta\bar{W}_{d_{k}}^{k}}\right)^{2}\right]^{1/2}\right)\mathbb{E}\left[\left\|W_{d_{k}}^{k}-\bar{W}_{d_{k}}^{k}\right\|^{2}\right]^{1/2}.$

By Lemma 1 and Assumption 2, we have

	$\displaystyle\mathbb{E}\left[\left(e^{\bar{\mu}\theta W_{d_{k}}^{k}}+e^{\bar{\mu}\theta\bar{W}_{d_{k}}^{k}}\right)^{2}\right]$	$\displaystyle\leq 2\left(\mathbb{E}[e^{2\bar{\mu}\theta W_{d_{k}}^{k}}]+\mathbb{E}[e^{2\bar{\mu}\theta\bar{W}_{d_{k}}^{k}}]\right)\leq 4M,$
	$\displaystyle\mathbb{E}\left[\|W_{d_{k}}^{k}-\bar{W}_{d_{k}}^{k}\|^{2}\right]$	$\displaystyle\leq 2\left(\mathbb{E}[(W_{d_{k}}^{k})^{2}]+\mathbb{E}[(\bar{W}_{d_{k}}^{k})^{2}]\right)\leq 4M.$

As a consequence, we have

\mathbb{E}[|W_{n}^{k}-\bar{W}_{n}^{k}|1(Q_{k}<d_{k})]\leq e^{-\gamma(n-d_{k})}A,\text{with }A=4\sqrt{M}+4M.

On the other hand,

\mathbb{E}[|W_{n}^{k}-\bar{W}_{n}^{k}|1(Q_{k}\geq d_{k})]\leq\mathbb{E}\left[|W_{n}^{k}-\bar{W}_{n}^{k}|^{2}\right]^{1/2}\mathbb{P}(Q_{k}\geq d_{k})^{1/2}.

Again, by Lemma 1, $\mathbb{E}\left[|W_{n}^{k}-\bar{W}_{n}^{k}|^{2}\right]\leq 4M.$ As $d_{k}=\lceil 4\log(k)/\min(\gamma,\eta)\rceil$ ,

\mathbb{P}(Q_{k}\geq d_{k})\leq e^{-\eta d_{k}}\mathbb{E}\left[e^{\eta Q_{k}}\right]\leq k^{-4}M.

In summary, we have, for $n\geq d_{k}+1$ ,

\mathbb{E}[|W_{n}^{k}-\bar{W}_{n}^{k}|]\leq e^{-\gamma(n-d_{k})}A+2Mk^{-2}.

As a direct consequence,

	$\displaystyle\|I_{1}\|$	$\displaystyle=\left\|\mathbb{E}\left[\sum_{n=\tilde{d}_{k}+1}^{D_{k}}W_{n}^{k}-w(\mu_{k},p_{k})\right]\right\|\leq\sum_{n=\tilde{d}_{k}+1}^{D_{k}}\mathbb{E}[\|W_{n}^{k}-\bar{W}_{n}^{k}\|]$
		$\displaystyle\leq\sum_{n=\tilde{d}_{k}+1}^{\infty}e^{-\gamma(n-d_{k})}A+2Mk^{-2}D_{k}\leq\frac{A}{1-e^{-\gamma}}k^{-1}+2MK_{2}k^{-\alpha}=O(k^{-\alpha}).$

∎

Proof of Corollary 2.

Recall that by (6), for each cycle $k$ ,

W_{n}^{k}=\begin{cases}\left(W_{n-1}^{k}+\frac{V^{k}_{n}}{\mu_{k}}-\frac{U^{k}_{n}}{\lambda^{k}_{n}}\right)^{+}&\text{ for }1\leq n\leq Q_{k}{\wedge D_{k}};\\ \left(W_{n-1}^{k}+\frac{V^{k}_{n}}{\mu_{k}}-\frac{U^{k}_{n}}{\lambda_{k}}\right)^{+}&\text{ for }(Q_{k}+1){\wedge(D_{k}+1)}\leq n\leq D_{k}.\end{cases},\quad W_{0}^{k}=W_{D_{k-1}}^{k-1}.

Define

\tilde{W}_{n}^{k}=\begin{cases}\left(\tilde{W}_{n-1}^{k}+\frac{V^{k}_{n}}{\mu_{k}}-\frac{U^{k}_{n}}{\lambda_{k-1}}\right)^{+}&\text{ for }1\leq n\leq Q_{k}{\wedge D_{k}};\\ \left(\tilde{W}_{n-1}^{k}+\frac{V^{k}_{n}}{\mu_{k}}-\frac{U^{k}_{n}}{\lambda_{k}}\right)^{+}&\text{ for }(Q_{k}+1){\wedge(D_{k}+1)}\leq n\leq D_{k}.\end{cases},\quad\tilde{W}_{0}^{k}=W_{D_{k-1}}^{k-1}.

Then, in the case $Q_{k-1}<D_{k-1}$ , we have $W_{n}^{k}=\tilde{W}_{n}^{k}$ for all $1\leq n\leq D_{k}$ . As a consequence, we have

|W_{n}^{k}-\bar{W}_{D_{k-1}+n}^{k-1}|\leq|\tilde{W}_{n}^{k}-\bar{W}_{D_{k-1}+n}^{k-1}|+|W_{n}^{k}-\bar{W}_{D_{k-1}+n}^{k-1}|\cdot 1(Q_{k-1}\geq D_{k-1}).

For the second term, by Lemma 1, we have, for $k\geq 2$ ,

	$\displaystyle\mathbb{E}[\|W_{n}^{k}-\bar{W}_{D_{k-1}+n}^{k-1}\|\cdot 1(Q_{k-1}\geq D_{k-1})]$	$\displaystyle\leq\mathbb{E}[(W_{n}^{k}-\bar{W}_{D_{k-1}+n}^{k-1})^{2}]^{1/2}\mathbb{P}(Q_{k-1}\geq D_{k-1})^{1/2}$
		$\displaystyle\leq(2\mathbb{E}[(W_{n}^{k})^{2}]+2\mathbb{E}[(\bar{W}_{D_{k-1}+n}^{k-1})^{2}])^{1/2}\left(\exp(-\eta D_{k-1})\mathbb{E}[\exp(\eta Q_{k-1}]\right)^{1/2}$
		$\displaystyle\leq 2M(k-1)^{-3}\leq 16Mk^{-3}$

For the first term, by definition, $\bar{W}_{D_{k-1}+n}^{k-1}$ is a waiting time sequence with service and arrival rates $(\mu_{k-1},\lambda(p_{k-1}))$ and $\tilde{W}_{n}^{k}$ is a sequence with rates $(\mu_{k},\lambda(p_{k}))$ or $(\mu_{k},\lambda(p_{k-1}))$ . As a consequence, by applying Lemma 3, we have

	$\displaystyle\|\tilde{W}_{n}^{k}-\bar{W}_{D_{k-1}+n}^{k-1}\|\leq$	$\displaystyle\|\tilde{W}_{0}^{k}-\bar{W}_{D_{k-1}}^{k-1}\|+\left(\frac{\|\mu_{k}-\mu_{k-1}\|}{\underline{\mu}}+\frac{\|\lambda(p_{k})-\lambda(p_{k-1})\|}{\lambda(\bar{p})}\right)\max(\tilde{X}_{n}^{k},\bar{X}_{D_{k-1}+n}^{k-1})$
		$\displaystyle+\frac{\|\mu_{k}-\mu_{k-1}\|}{\underline{\mu}}\max(\tilde{W}_{n}^{k},\bar{W}_{D_{k-1}+n}^{k-1}).$

By Lemma 1, we have that $\max(\tilde{X}_{n}^{k},\bar{X}_{D_{k-1}+n}^{k-1})\leq\frac{\lambda(\underline{p})}{\lambda(\bar{p})}\hat{X}_{n}^{k}$ and $\max(\tilde{W}_{n}^{k},\bar{W}_{D_{k-1}+n}^{k-1})\leq\hat{W}_{n}^{k}$ , where $\hat{X}_{n}^{k}$ and $\hat{W}_{n}^{k}$ are the observed busy period and waiting time in a stationary $GI/GI/1$ queue with rate $(\underline{\mu},\underline{p})$ as defined in Lemma 1. On the other hand, under Condition (b) of Theorem 1,

	$\displaystyle\mathbb{E}[\|\mu_{k}-\mu_{k-1}\|^{2}]$	$\displaystyle\leq\mathbb{E}[\\|x_{k}-x_{k+1}\\|^{2}]\leq K_{2}k^{-2\alpha}$
	$\displaystyle\mathbb{E}[\|\lambda_{k}-\lambda_{k-1}\|^{2}]$	$\displaystyle\leq K_{2}\left(\max_{p}\lambda^{\prime}(p)\right)^{2}k^{-2\alpha}\equiv K_{6}k^{-2\alpha}.$

Therefore,

	$\displaystyle\mathbb{E}\left[\|\mu_{k}-\mu_{k-1}\|\max(\tilde{X}_{n}^{k},\bar{X}_{D_{k-1}+n}^{k-1})\right]\leq\mathbb{E}[(\mu_{k}-\mu_{k-1})^{2}]^{1/2}\frac{\lambda(\underline{p})}{\lambda(\bar{p})}\mathbb{E}[(\hat{X}_{n}^{k})^{2}]^{1/2}\leq\sqrt{K_{2}}\sqrt{M}k^{-\alpha};$
	$\displaystyle\mathbb{E}[\|\lambda(p_{k})-\lambda(p_{k-1})\|\max(\tilde{X}_{n}^{k},\bar{X}_{D_{k-1}+n}^{k-1})]\leq\mathbb{E}[(\lambda_{k}-\lambda_{k-1})^{2}]^{1/2}\frac{\lambda(\underline{p})}{\lambda(\bar{p})}\mathbb{E}[(\hat{X}_{n}^{k})^{2}]^{1/2}\leq\sqrt{K_{6}}\sqrt{M}k^{-\alpha};$
	$\displaystyle\mathbb{E}\left[\|\mu_{k}-\mu_{k-1}\|\max(\tilde{W}_{n}^{k},\bar{W}_{D_{k-1}+n}^{k-1})\right]\leq\mathbb{E}[(\mu_{k}-\mu_{k-1})^{2}]^{1/2}\mathbb{E}[(\hat{W}_{n}^{k})^{2}]^{1/2}\leq\sqrt{K_{2}}\sqrt{M}k^{-\alpha}.$

Finally, by Corollary 1, we have

{\mathbb{E}}[|\tilde{W}_{0}^{k}-\bar{W}_{D_{k-1}}^{k-1}|]=\mathbb{E}[|\bar{W}_{D_{k-1}}^{k-1}-W_{0}^{k}|]=\mathbb{E}[|\bar{W}_{D_{k-1}}^{k-1}-W_{D_{k-1}}^{k-1}|]\leq(A+2M)(k-1)^{-2}\leq(4A+8M)k^{-2}.

In summary, we can conclude

	$\displaystyle\|\mathbb{E}[W_{n}^{k}-w(\mu_{k},p_{k})]\|$	$\displaystyle\leq\mathbb{E}[\|w(\mu_{k-1},p_{k-1})-w(\mu_{k},p_{k})\|]+\mathbb{E}[\|W_{n}^{k}-\bar{W}_{D_{k-1}+n}^{k-1}\|]$
		$\displaystyle\leq B\mathbb{E}[\|\mu_{k}-\mu_{k-1}\|+\|\lambda(p_{k})-\lambda(p_{k-1})\|]+\left(\frac{2\sqrt{K_{2}}}{\underline{\mu}}+\frac{\sqrt{K_{6}}}{\lambda(\bar{p})}\right)\sqrt{M}k^{-\alpha}+O(k^{-2})$
		$\displaystyle\leq B(\sqrt{K_{2}}+\sqrt{K_{6}})k^{-\alpha}+\left(\frac{2\sqrt{K_{2}}}{\underline{\mu}}+\frac{\sqrt{K_{6}}}{\lambda(\bar{p})}\right)\sqrt{M}k^{-\alpha}+O(k^{-2})$
		$\displaystyle=O(k^{-\alpha}),$

where the second inequality follows from Lemma 4. As a direct consequence, $|I_{2}|=O(k^{-\alpha}\log(k))$ as $\tilde{d}_{k}=O(\log(k))$ .

∎

Proof of Corollary 3.

Note that by Lemma 1,

\left|h_{0}\mathbb{E}[W_{\infty}(\mu_{k},p_{k})]+\frac{h_{0}}{\mu_{k}}-p_{k}\right|\leq h_{0}M+h_{0}\underline{\mu}^{-1}+\bar{p}=O(1).

So it suffices to show that

\mathbb{E}[\left|D_{k}-\lambda(p_{k})T_{k}\right|]=O(k^{-\alpha}),\quad\mathbb{E}\left[\sum_{n=1}^{Q_{k}\wedge D_{k}}|p_{k}-p_{n}^{k}|\right]=O(k^{-\alpha}).

Given $\mu_{k}$ and $p_{k}$ , $T_{k}$ is the time for the $D_{k}$ -th customer to enter service. Let $F_{n}^{k}$ be the inter-service time between the $(n-1)$ -th and the $n$ -th customers in cycle $k$ . Then, $T_{k}=\sum_{n=1}^{D_{k}}F_{n}^{k}$ and for each $n$ ,

F_{n}^{k}=\begin{cases}\frac{U^{k}_{n}}{\lambda^{k}_{n}}+W_{n}^{k}-W_{n-1}^{k}&\text{ for }1\leq n\leq Q_{k}\\ \frac{U^{k}_{n}}{\lambda_{k}}+W_{n}^{k}-W_{n-1}^{k}&Q_{k}+1\leq n\leq D_{k}.\end{cases}

Therefore,

	$\displaystyle T_{k}$	$\displaystyle=\sum_{k=1}^{D_{k}}F_{n}^{k}=\sum_{n=1}^{Q_{k}\wedge D_{k}}\frac{U^{k}_{n}}{\lambda_{n}^{k}}+\frac{1}{\lambda_{k}}\sum_{n=Q_{k}+1}^{D_{k}}U^{k}_{n}+W_{D_{k}}^{k}-W_{0}^{k}$
		$\displaystyle=\frac{1}{\lambda_{k}}\sum_{n=1}^{D_{k}}U^{k}_{n}+W_{D_{k}}^{k}-W_{0}^{k}+\sum_{n=1}^{Q_{k}\wedge D_{k}}U^{k}_{n}\left(\frac{1}{\lambda_{n}^{k}}-\frac{1}{\lambda_{k}}\right).$

As a consequence,

\displaystyle|\mathbb{E}\left[(D_{k}-\lambda_{k}T_{k})\right]|

\displaystyle\leq\lambda_{k}|\mathbb{E}[W_{D_{k}}^{k}]-\mathbb{E}[W_{0}^{k}]|+\mathbb{E}\left[\sum_{k=1}^{Q_{k}\wedge D_{k}}U^{k}_{n}\Big{|}\frac{\lambda_{k}}{\lambda_{n}^{k}}-1\Big{|}\right].

Following Corollary 1 and Lemma 4, for $k\geq 2$ , the first term

	$\displaystyle\|\mathbb{E}[W_{D_{k}}^{k}]-\mathbb{E}[W_{0}^{k}]\|$	$\displaystyle\leq\mathbb{E}\|W_{D_{k}}^{k}-\bar{W}_{D_{k}}^{k}\|+\mathbb{E}\|W_{D_{k-1}}^{k-1}-\bar{W}_{D_{k-1}}^{k-1}\|+\|\mathbb{E}[\bar{W}_{D_{k}}^{k}]-\mathbb{E}[\bar{W}_{D_{k-1}}^{k-1}]\|$
		$\displaystyle=(A+2M)\left(k^{-2}+(k-1)^{-2}\right)+B\sqrt{K_{2}}k^{-\alpha}=O(k^{-\alpha}).$

As to the second term, by definition, the customers 1 to $Q_{k}-1$ arrive to the system while customer $0$ is waiting in the system, and therefore,

0\leq\sum_{i=1}^{(Q_{k}-1)\wedge D_{k}}\frac{U_{i}^{k}}{\bar{\lambda}}\leq\sum_{i=1}^{(Q_{k}-1)\wedge D_{k}}\frac{U_{i}^{k}}{\lambda_{i}^{k}}\leq W_{0}^{k}\quad\Rightarrow\quad\mathbb{E}\left[\left(\sum_{i=1}^{Q_{k}\wedge D_{k}}U_{i}^{k}\right)^{2}\right]\leq\mathbb{E}\left[(\bar{\lambda}W_{0}^{k}+U^{k}_{Q_{k}})^{2}\right]\leq 4\bar{\lambda}^{2}M.

Here, $\mathbb{E}[(U_{Q_{k}}^{k})^{2}]$ is bounded since we assume that $U$ is light-tailed (Assumption 2). For the simplicity of notation, we just assume that $\mathbb{E}\left[\Big{(}\frac{U^{2}_{i}}{\bar{\lambda}}\Big{)}^{2}\right]<M$ for the same $M$ in Lemma 1. Then,

	$\displaystyle\mathbb{E}\left[\sum_{k=1}^{Q_{k}\wedge D_{k}}U^{k}_{n}\Big{\|}\frac{\lambda_{k}}{\lambda_{n}^{k}}-1\Big{\|}\right]$	$\displaystyle\leq\mathbb{E}\left[\sum_{k=1}^{Q_{k}\wedge D_{k}}U^{k}_{n}\Big{\|}\frac{\lambda_{k}}{\lambda_{k-1}}-1\Big{\|}\right]+\mathbb{E}\left[\sum_{k=1}^{Q_{k}\wedge D_{k}}U^{k}_{n}\cdot\frac{\bar{\lambda}}{\underline{\lambda}}\cdot 1(Q_{k-1}\geq D_{k-1})\right]$
		$\displaystyle\leq 2\bar{\lambda}\sqrt{M}\mathbb{E}\left[\Big{\|}\frac{\lambda_{k}}{\lambda_{k-1}}-1\Big{\|}^{2}\right]^{1/2}+\frac{2\bar{\lambda}^{2}}{\underline{\lambda}}\sqrt{M}\mathbb{P}(Q_{k-1}\geq D_{k-1})^{1/2}$
		$\displaystyle\leq\frac{2\bar{\lambda}\sqrt{M}K_{6}^{1/2}}{\underline{\lambda}}k^{-\alpha}+\frac{16\bar{\lambda}^{2}}{\underline{\lambda}}Mk^{-3}=O(k^{-\alpha}).$

Finally,

	$\displaystyle\mathbb{E}\left[\sum_{n=1}^{Q_{k}\wedge D_{k}}\|p_{k}-p_{n}^{k}\|\right]$	$\displaystyle\leq\mathbb{E}\left[\sum_{n=1}^{Q_{k}\wedge D_{k}}\|p_{k}-p_{n}^{k}\|\cdot 1(Q_{k-1}<D_{k-1})\right]$
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ +\mathbb{E}\left[\sum_{n=1}^{Q_{k}\wedge D_{k}}\|p_{k}-p_{n}^{k}\|\cdot 1(Q_{k-1}\geq D_{k-1})\right]$
		$\displaystyle\leq\mathbb{E}\left[Q_{k-1}^{2}\right]^{1/2}\mathbb{E}\left[\|p_{k}-p_{k-1}\|^{2}\right]^{1/2}+\mathbb{E}\left[\bar{p}^{2}Q_{k}^{2}\right]^{1/2}\mathbb{P}(Q_{k-1}\geq D_{k-1})^{1/2}$
		$\displaystyle\leq\sqrt{MK_{2}}k^{-\alpha}+8\bar{p}Mk^{-3}=O(k^{-\alpha})$

Therefore, $I_{3}=O(k^{-\alpha})$ . ∎

Finishing the proof of Theorem 1.

First, by Corollary 1, we have

\displaystyle|I_{1}|\leq\frac{A}{1-e^{-\gamma}}k^{-1}+2MK_{2}k^{-\alpha}=O(k^{-\alpha}).

By Corollary 2,

\displaystyle|I_{2}|\leq\frac{5}{\min(\gamma,\eta)}\left(B(\sqrt{K_{2}}+\sqrt{K_{6}})+\left(\frac{2\sqrt{K_{2}}}{\underline{\mu}}+\frac{\sqrt{K_{6}}}{\underline{\lambda}}\right)\sqrt{M}+(4A+8M)\right)k^{-\alpha}\log(k)=O(k^{-\alpha}\log(k)).

Following the proof of Corollary 3, we have

\displaystyle I_{3}

\displaystyle\leq(h_{0}M+h_{0}\underline{\mu}^{-1}+\bar{p})\left(\frac{2\bar{\lambda}\sqrt{M}K_{6}^{1/2}}{\underline{\lambda}}+\frac{16\bar{\lambda}^{2}}{\underline{\lambda}}M\right)k^{-\alpha}+(\sqrt{MK_{2}}+8\bar{p}M)k^{-\alpha}=O(k^{-\alpha}).

Therefore, we can conclude that $\forall k\geq 2$ , $R_{1,k}\leq K^{\prime}\cdot k^{-\alpha}\log(k)$ with

	$\displaystyle K^{\prime}=\leavevmode\nobreak\$	$\displaystyle\frac{Ah_{0}}{1-e^{-\gamma}}+2h_{0}MK_{2}+\frac{5h_{0}}{\min(\gamma,\eta)}\left(B(\sqrt{K_{2}}+\sqrt{K_{6}})+\left(\frac{2\sqrt{K_{2}}}{\underline{\mu}}+\frac{\sqrt{K_{6}}}{\underline{\lambda}}\right)\sqrt{M}+(4A+8M)\right)$
		$\displaystyle+(h_{0}M+h_{0}\underline{\mu}^{-1}+\bar{p})\left(\frac{2\bar{\lambda}\sqrt{M}K_{6}^{1/2}}{\underline{\lambda}}+\frac{16\bar{\lambda}^{2}}{\underline{\lambda}}M\right)+\sqrt{MK_{2}}+8\bar{p}M.$		(32)

Let $M_{0}>0$ be the upper bound of the regret in the first cycle. Here the constant $M_{0}<\infty$ since the decision region $\mathcal{B}$ is bounded and by condition (a), $D_{1}\leq K_{2}$ is also bounded. Finally, we conclude that

\displaystyle R_{1}(L)

\displaystyle\leq M_{0}+K^{\prime}\sum_{k=2}^{L}k^{-\alpha}\log(k)\leq K\sum_{k=1}^{L}k^{-\alpha}\log(k).

with $K=K^{\prime}+\frac{2M_{0}}{\log(2)}$ . ∎

8.6 Convergence Rate of Observed Busy Period

As an analogue of Lemma 2, we prove a uniform convergence rate for the observed busy period $X_{n}$ , which will be used to bound $B_{k}$ and $\mathcal{V}_{k}$ of the gradient estimator (18) that involves terms of $X_{n}^{k}$ .

Lemma 6.

Let $X^{1}_{n}$ and $X^{2}_{n}$ be the observed busy period of the two queueing systems coupled as in Lemma 2, with $X^{1}_{0},X^{2}_{0}\leq_{st}\frac{\lambda(\underline{p})}{\lambda(\bar{p})}\hat{X}_{0}$ and $W^{1}_{0},W^{2}_{0}\leq_{st}\hat{W}_{0}$ .

1.

$|X^{1}_{n}-X^{2}_{n}|\leq{\bf 1}_{\{\max(\Gamma_{1},\Gamma_{2})>n\}}\left(\sum_{k=1}^{n}\tau_{k}+X^{1}_{0}+X^{2}_{0}\right)$ .
2.

There exists a constant $K_{4}>0$ such that $|\mathbb{E}[X^{1}_{n}-X^{2}_{n}]|^{m}\leq K_{4}e^{-0.5\gamma n}n^{2}$ for all $n\geq 1$ and $m\leq 2$ .

Proof of Lemma 6.

1. Following the argument in Lemma 2, if $W^{1}_{0}\geq W^{2}_{0}$ , we will have $W^{1}_{\Gamma_{1}}=W^{2}_{\Gamma_{1}}=0$ and hence $X^{1}_{\Gamma_{1}}=X^{2}_{\Gamma_{1}}=0$ . Since the two systems share the same sequence of arrivals and service times, $X^{1}_{n}=X^{2}_{n}$ for all $n\geq\Gamma_{1}$ . Therefore,

|X^{1}_{n}-X^{2}_{n}|\leq{\bf 1}_{\{\max(\Gamma_{1},\Gamma_{2})>n\}}|X^{1}_{n}-X^{2}_{n}|\leq{\bf 1}_{\{\max(\Gamma_{1},\Gamma_{2})>n\}}\left(\sum_{k=1}^{n}\tau_{k}+X^{1}_{0}+X^{2}_{0}\right).

The last inequality follows from $0\leq X^{i}_{n}\leq X_{0}^{i}+\sum_{k=1}^{n}\tau_{k}$ for $i=1,2$ .

2. Following 1 and part 2 of Lemma 2, for $m=1,2$ ,

	$\displaystyle\mathbb{E}[\|X^{1}_{n}-X^{2}_{n}\|^{m}]$	$\displaystyle\leq\mathbb{E}\left[{\bf 1}_{\{\max(\Gamma_{1},\Gamma_{2})>n\}}\left(\sum_{k=1}^{n}\tau_{k}+X^{1}_{0}+X^{2}_{0}\right)^{m}\right]$
		$\displaystyle\leq\mathbb{P}(\max(\Gamma_{1},\Gamma_{2})>n)^{1/2}\mathbb{E}\left[\left(\sum_{k=1}^{n}\tau_{k}+X^{1}_{0}+X^{2}_{0}\right)^{2m}\right]^{1/2}$

where

\mathbb{P}(\max(\Gamma_{1},\Gamma_{2})>n)\leq e^{-n\gamma}\mathbb{E}[2+e^{\mu\theta W_{0}^{1}}+e^{\mu\theta W_{0}^{2}}]\leq e^{-n\gamma}(2+2M),

and

\mathbb{E}\left[\left(\sum_{k=1}^{n}\tau_{k}+X^{1}_{0}+X^{2}_{0}\right)^{2m}\right]\leq 3^{2m-1}\left(n^{2m}\mathbb{E}\left[\frac{U_{1}^{2m}}{\lambda(p)^{2m}}\right]+\mathbb{E}[(X^{1}_{0})^{2m}]+\mathbb{E}[(X^{2}_{0})^{2m}]\right).

Therefore,

\mathbb{E}[|X^{1}_{n}-X^{2}_{n}|^{m}]\leq K_{4}e^{-0.5n\gamma}n^{2},

with $K_{4}=3^{m}\left(\max_{1\leq m\leq 2}\mathbb{E}[U_{1}^{2m}]/\lambda(\bar{p})^{2m}+2\frac{\lambda(\underline{p})^{2m}}{\lambda(\bar{p})^{2m}}M\right)^{1/2}(2+2M)^{1/2}$ . ∎

8.7 Proof of Theorem 2

The proof follows an induction-based approach similar to Broadie et al. (2011). For simplicity of notation, we write $\Delta_{k}=k^{-\beta}$ . Let $\mathcal{F}_{k}$ be the filtration up to cycle $k$ , i.e. including all events in the first $k-1$ cycles. Since $x_{k+1}=\pi_{\mathcal{B}}(x_{k}-\eta_{k}H_{k})$ ,

		$\displaystyle\mathbb{E}\left[\\|x_{k+1}-x^{}\\|^{2}]=\mathbb{E}[\\|x_{k}-x^{}-\eta_{k}H_{k}\\|^{2}\right]$
	$\displaystyle=\leavevmode\nobreak\$	$\displaystyle\mathbb{E}\left[\\|x_{k}-x^{}\\|^{2}-2\eta_{k}H_{k}\cdot(x_{k}-x^{})+\eta_{k}^{2}H_{k}^{2}\right]$
	$\displaystyle=\leavevmode\nobreak\$	$\displaystyle\mathbb{E}\left[\\|x_{k}-x^{}\\|^{2}-2\eta_{k}\nabla f(x_{k})\cdot(x_{k}-x^{})\right]-\mathbb{E}[2\eta_{k}(H_{k}-f(x_{k}))\cdot(x_{k}-x^{*})]+\mathbb{E}[\eta_{k}^{2}H_{k}^{2}]$
	$\displaystyle=\leavevmode\nobreak\$	$\displaystyle(1-2\eta_{k}K_{0})\mathbb{E}\left[\\|x_{k}-x^{}\\|^{2}\right]+\mathbb{E}[2\eta_{k}(H_{k}-f(x_{k}))\cdot(x^{}-x_{k})]+\eta_{k}^{2}\mathbb{E}[H_{k}^{2}].$

Note that

		$\displaystyle\mathbb{E}[2\eta_{k}(H_{k}-\nabla f(x_{k}))\cdot(x^{}-x_{k})]=\leavevmode\nobreak\ \mathbb{E}[\mathbb{E}[2\eta_{k}(H_{k}-\nabla f(x_{k}))\cdot(x^{}-x_{k})\|\mathcal{F}_{k}]]$
	$\displaystyle=\leavevmode\nobreak\$	$\displaystyle 2\eta_{k}\mathbb{E}[\mathbb{E}[H_{k}-\nabla f(x_{k})\|\mathcal{F}_{k}]\cdot(x^{}-x_{k})]\leq\leavevmode\nobreak\ 2\eta_{k}\mathbb{E}[\\|\mathbb{E}[H_{k}-\nabla f(x_{k})\|\mathcal{F}_{k}]\\|^{2}]^{1/2}\mathbb{E}[\\|x^{}-x_{k}\\|^{2}]^{1/2}$
	$\displaystyle\leq\leavevmode\nobreak\$	$\displaystyle\eta_{k}\mathbb{E}[\\|\mathbb{E}[H_{k}-\nabla f(x_{k})\|\mathcal{F}_{k}]\\|^{2}]^{1/2}(1+\mathbb{E}[\\|x_{k}-x^{*}\\|^{2}]).$

The second last inequality follows from $ab+cd\leq\sqrt{a^{2}+c^{2}}\sqrt{b^{2}+d^{2}}$ and the Holder Inequality, the last inequality follows from $2a\leq 1+a^{2}$ .

Let $b_{k}=\mathbb{E}[\|x_{k}-x^{*}\|^{2}]$ and recall that $B_{k}=\mathbb{E}[\|\mathbb{E}[H_{k}-\nabla f(x_{k})|\mathcal{F}_{k}]\|^{2}]^{1/2},\quad\mathcal{V}_{k}=\mathbb{E}[H_{k}^{2}]$ . Then, we obtain the recursion

b_{k+1}\leq(1-2K_{0}\eta_{k}+\eta_{k}B_{k})b_{k}+\eta_{k}B_{k}+\eta_{k}^{2}\mathcal{V}_{k}.

By Condition (b) and (c), we have

b_{k+1}\leq(1-2K_{0}\eta_{k}+\eta_{k}B_{k})b_{k}+\eta_{k}B_{k}+\eta_{k}^{2}\mathcal{V}_{k}\leq\left(1-2K_{0}\eta_{k}+\frac{K_{0}}{8}\eta_{k}\Delta_{k}\right)b_{k}+\frac{K_{0}}{8}\eta_{k}\Delta_{k}+K_{3}\eta_{k}\Delta_{k}.

Because step size $\eta_{k}\rightarrow 0$ , for $k$ large enough, $\eta_{k}K_{0}\leq 1/2$ . Let $k_{0}=\max\{k:2\eta_{k}K_{0}>1\}$ . Then, for $k>k_{0}$ , $1-2K_{0}\eta_{k}+\frac{K_{0}}{8}\eta_{k}\Delta_{k}>0$ . By Condition (a), $\Delta_{k}/\Delta_{k+1}=(1+\frac{1}{k})^{\beta}\leq 1+\frac{1}{k}\leq 1+\frac{K_{0}}{2}\eta_{k}$ , and by the induction assumption $b_{k}\leq C\Delta_{k}$ , for $k>k_{0}$ , we have

	$\displaystyle b_{k+1}$	$\displaystyle\leq\left(1-2K_{0}\eta_{k}+\frac{K_{0}}{8}\eta_{k}\Delta_{k}\right)\left(1+\frac{K_{0}\eta_{k}}{2}\right)C\Delta_{k+1}+\frac{K_{0}}{8}\eta_{k}\Delta_{k}+K_{3}\eta_{k}\Delta_{k}$
		$\displaystyle\leq C\Delta_{k+1}-\eta_{k}\Delta_{k}\left(\frac{3K_{0}C}{2}-\frac{K_{0}C}{8}\Delta_{k}-\frac{K_{0}^{2}C}{16}\eta_{k}\Delta_{k}-\frac{K_{0}}{8}-K_{3}\right)$

Then, we have $b_{k+1}\leq C\Delta_{k+1}$ as long as

\frac{3K_{0}C}{2}-\frac{K_{0}C}{8}\Delta_{k}-\frac{K_{0}^{2}C}{16}\eta_{k}\Delta_{k}-\frac{K_{0}}{8}-K_{3}\geq 0.

(33)

To check (33), note that, $\Delta_{k},K_{0}\leq 1$ and $C\geq 8K_{3}/K_{0}$ . Besides, $\eta_{k}K_{0}\leq 1/2<1$ for $k>k_{0}$ . Then, for $k\geq k_{0}$ ,

\frac{3K_{0}C}{2}-\frac{K_{0}C}{8}\Delta_{k}-\frac{K_{0}^{2}C}{16}\eta_{k}\Delta_{k}-\frac{K_{0}}{8}-K_{3}\geq\frac{3K_{0}C}{2}-\frac{K_{0}C}{8}-\frac{K_{0}C}{16}-\frac{K_{0}C}{8}-\frac{K_{0}C}{8}=\frac{17K_{0}C}{16}>0.

Let

C=\max\left(k_{0}^{\beta}(|\bar{\mu}-\underline{\mu}|^{2}+|\bar{p}-\underline{p}|^{2}),8K_{3}/K_{0}\right).

(34)

Then we have $\|x_{k}-x^{*}\|^{2}\leq C\Delta_{k}$ for all $1\leq k\leq k_{0}$ , and we can conclude by induction, for all $k\geq k_{0}$ ,

\mathbb{E}[\|x_{k}-x^{*}\|^{2}]\leq Ck^{-\beta}.

By Assumption 3, there exists $\theta_{0}\in[0,1]$ such that

|f(x_{k})-f(x^{*})|=|\nabla f(\theta_{0}(x^{k}-x^{*})+x^{*})^{T}(x_{k}-x^{*})|\leq K_{1}\|x_{k}-x^{*}\|^{2}.

As a consequence,

R_{2}(L)\leq\sum_{k=1}^{L}\mathbb{E}[T_{k}]K_{1}Ck^{-\beta}.

Note that $T_{k}$ equals to the arrival time of customer $D_{k}$ plus its waiting time. Therefore,

\mathbb{E}[T_{k}]\leq\mathbb{E}\left[\frac{D_{k}}{\lambda_{k}}\right]+\mathbb{E}[W_{D_{k}}^{k}]\leq\frac{D_{k}}{\lambda(\bar{p})}+M=O(D_{k}),

and we can conclude

R_{2}(L)=O\left(\sum_{k=1}^{L}D_{k}k^{-\beta}\right).

$\Box$

8.8 Proof of Theorem 3

(i) For each $k$ , note that $x_{k}\in\mathcal{F}_{k}$ , let’s denote by

	$\displaystyle h_{k}^{1}$	$\displaystyle=-\lambda(p_{k})-p_{k}\lambda^{\prime}(p_{k})+h_{0}\lambda^{\prime}(p_{k})\left[\frac{1}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(\mathbb{E}[X_{n}^{k}\|\mathcal{F}_{k}]+\mathbb{E}[W_{n}^{k}\|\mathcal{F}_{k}]\right)+\frac{1}{\mu}\right],$
	$\displaystyle h_{k}^{2}$	$\displaystyle=c^{\prime}(\mu_{k})-h_{0}\frac{\lambda(p_{k})}{\mu_{k}}\left[\frac{1}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(\mathbb{E}[X_{n}^{k}\|\mathcal{F}_{k}]+\mathbb{E}[W_{n}^{k}\|\mathcal{F}_{k}]\right)+\frac{1}{\mu}\right].$

Then,

\|\mathbb{E}[H_{k}-\nabla f(x_{k})|\mathcal{F}_{k}]\|^{2}=\left|h_{k}^{1}-\frac{\partial}{\partial p}f(\mu_{k},p_{k})\right|^{2}+\left|h_{k}^{2}-\frac{\partial}{\partial\mu}f(\mu_{k},p_{k})\right|^{2}.

Following (18),

	$\displaystyle\|h_{k}^{1}-\frac{\partial}{\partial p}f(\mu_{k},p_{k})\|^{2}$	$\displaystyle\leq\frac{h_{0}^{2}\lambda^{\prime}(p_{k})^{2}}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(\|\mathbb{E}[X_{n}^{k}-x_{k}\|\mathcal{F}_{k}]\|+\|\mathbb{E}[W_{n}^{k}-w_{k}\|\mathcal{F}_{k}]\|\right)^{2},$
	$\displaystyle\|h_{k}^{2}-\frac{\partial}{\partial\mu}f(\mu_{k},p_{k})\|^{2}$	$\displaystyle\leq\frac{h_{0}^{2}\lambda(p_{k})^{2}}{\mu_{k}^{2}\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(\|\mathbb{E}[X_{n}^{k}-x_{k}\|\mathcal{F}_{k}]\|+\|\mathbb{E}[W_{n}^{k}-w_{k}\|\mathcal{F}_{k}]\|\right)^{2},$

where $w_{k}=\mathbb{E}[W_{\infty}(\mu_{k},p_{k})]$ and $x_{k}=\mathbb{E}[X_{\infty}(\mu_{k},p_{k})]$ . Note that $\lambda(p)$ , $\lambda^{\prime}(p)$ and $\mu$ are bounded. Let $C_{0}=\max_{(\mu,p)\in\mathcal{B}}\{h_{0}\lambda^{\prime}(p_{k}),h_{0}\lambda(p)/\mu\}$ , then

	$\displaystyle\\|\mathbb{E}[H_{k}-\nabla f(x_{k})\|\mathcal{F}_{k}]\\|^{2}$	$\displaystyle\leq\frac{2C_{0}^{2}}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(\|\mathbb{E}[X_{n}^{k}-x_{k}\|\mathcal{F}_{k}]\|+\|\mathbb{E}[W_{n}^{k}-w_{k}\|\mathcal{F}_{k}]\|\right)^{2}$
		$\displaystyle\leq\frac{4C_{0}^{2}}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(\|\mathbb{E}[X_{n}^{k}-x_{k}\|\mathcal{F}_{k}]\|^{2}+\|\mathbb{E}[W_{n}^{k}-w_{k}\|\mathcal{F}_{k}]\|^{2}\right)$
		$\displaystyle=\frac{4C_{0}^{2}}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(\|\mathbb{E}[X_{n}^{k}-\bar{X}_{k}^{n}\|\mathcal{F}_{k}]\|^{2}+\|\mathbb{E}[W_{n}^{k}-\bar{W}_{k}^{n}\|\mathcal{F}_{k}]\|^{2}\right)$

where the last equality follows from $\mathbb{E}[\bar{W}_{k}^{n}|\mathcal{F}_{k}]=w_{k}$ and $\mathbb{E}[\bar{X}_{k}^{n}|\mathcal{F}_{k}]=x_{k}$ and $\bar{W}_{k}^{n}$ and $\bar{X}_{k}^{n}$ are stationary versions of the waiting times and observed busy periods that are synchronously coupled with $W_{k}^{n}$ and $X_{k}^{n}$ respectively. Therefore, the bias

	$\displaystyle B_{k}^{2}$	$\displaystyle=\mathbb{E}[\\|\mathbb{E}[H_{k}-\nabla f(x_{k})\|\mathcal{F}_{k}]\\|^{2}]$
		$\displaystyle\leq\mathbb{E}\left[\frac{4C_{0}^{2}}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(\|\mathbb{E}[X_{n}^{k}-\bar{X}_{k}^{n}\|\mathcal{F}_{k}]\|^{2}+\|\mathbb{E}[X_{n}^{k}-\bar{X}_{k}^{n}\|\mathcal{F}_{k}]\|^{2}\right)\right]$
		$\displaystyle\leq\frac{4C_{0}^{2}}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(\|\mathbb{E}[(X_{n}^{k}-\bar{X}_{k}^{n})^{2}]+\mathbb{E}[(W_{n}^{k}-\bar{W}_{k}^{n})^{2}]\right)$

Following a similar argument as in the proof of Corollary 1, we have, for $n\geq\lceil 0.5\xi D_{k}\rceil$ ,

	$\displaystyle\mathbb{E}[(W_{n}^{k}-\bar{W}_{k}^{n})^{2}]$	$\displaystyle\leq\mathbb{E}[(W_{n}^{k}-\bar{W}_{k}^{n})^{2}\cdot 1(Q_{k}<0.5\xi D_{k})]+\mathbb{E}[(W_{n}^{k}-\bar{W}_{k}^{n})^{2}\cdot 1(Q_{k}\geq 0.5\xi D_{k})]$
		$\displaystyle\leq A\exp(-\gamma\cdot(n-0.5\xi D_{k}))+2M\exp(-\eta\cdot 0.25\xi D_{k}))$
		$\displaystyle\leq(A+2M)\exp(-\min(\gamma,\eta)\cdot 0.25\xi D_{k}).$

For the observed busy period $X_{n}^{k}$ , following a similar analysis and Lemma 6, we have

		$\displaystyle\mathbb{E}[(X_{n}^{k}-\bar{X}_{n}^{k})^{2}]$
	$\displaystyle\leq\leavevmode\nobreak$	$\displaystyle K_{4}e^{-0.5\gamma\xi D_{k}}D_{k}^{2}+(2\mathbb{E}[(X_{n}^{k})^{4}]+2\mathbb{E}[(\bar{X}_{k}^{n})^{4}])^{1/2}\mathbb{P}(Q_{k}\geq 0.5\xi D_{k})^{1/2}$
	$\displaystyle\leq\leavevmode\nobreak$	$\displaystyle\exp(-\min(\gamma,\eta)\cdot 0.25\xi D_{k})(2M+K_{4}D_{k}^{2})\leq\exp(-\min(\gamma,\eta)\cdot 0.125\xi D_{k})(2M+K_{4}K_{5}),$

where

K_{5}=\max_{D>0}\exp(-\min(\gamma,\eta)\cdot 0.125\xi D)D^{2}=\left(\frac{16}{\min(\gamma,\eta)\cdot\xi}\right)^{2}e^{-2}.

(35)

If we choose

D_{k}=a_{D}+b_{D}\log(k),\text{ for }a_{D}\geq\frac{C_{D}}{\min(\gamma,\eta)\xi}\text{ and }b_{D}\geq\frac{8}{\min(\gamma,\eta)\xi},

with

C_{D}=\max(8(\log((16A+32M)C_{0}/K_{0}),16\log((32M+16K_{4}K_{5})C_{0}/K_{0})),

(36)

then

\mathbb{E}[(W_{n}^{k}-\bar{W}_{n}^{k})^{2}]\leq\frac{K_{0}^{2}}{256C_{0}^{2}k^{2}},\quad\mathbb{E}[(X_{n}^{k}-\bar{X}_{n}^{k})^{2}]\leq\frac{K_{0}^{2}}{256C_{0}k^{2}}.

As a consequence,

\mathbb{E}[\|\mathbb{E}[H_{k}-\nabla f(x_{k})|\mathcal{F}_{k}]\|^{2}]\leq\frac{4C_{0}^{2}}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(|\mathbb{E}[(X_{n}^{k}-\bar{X}_{k}^{n})^{2}]+\mathbb{E}[(W_{n}^{k}-\bar{W}_{k}^{n})^{2}]\right)\leq\frac{K_{0}^{2}}{64k^{2}}.

Therefore, we can conclude that

B_{k}=\mathbb{E}[\|\mathbb{E}[H_{k}-\nabla f(x_{k})|\mathcal{F}_{k}]\|^{2}]^{1/2}\leq\frac{K_{0}}{8k}.

On the other hand, as $\lambda(p)$ , $\lambda^{\prime}(p)$ and $\mu$ are bounded, $C_{1}\triangleq\max_{\mu,p\in\mathcal{B}}\{|\lambda(p)+p\lambda^{\prime}(p)|,|c^{\prime}(\mu)|\}<\infty$ . Recall that $C_{0}=\max_{(\mu,p)\in\mathcal{B}}\{h_{0}\lambda^{\prime}(p_{k}),h_{0}\lambda(p)/\mu\}$ . Then,

\mathbb{E}[\|H_{k}\|^{2}]\leq 8(C_{1}+C_{0}/\underline{\mu})^{2}+8C_{0}^{2}\mathbb{E}\left[\frac{1}{\lceil(1-\xi)D_{k}\rceil^{2}}\left(\sum_{n>\xi D_{k}}^{D_{k}}\left(X_{n}^{k}+W_{n}^{k}\right)\right)^{2}\right].

By Lemma 1, we have

\displaystyle\mathbb{E}\left[\frac{1}{\lceil(1-\xi)D_{k}\rceil^{2}}\left(\sum_{n>\xi D_{k}}^{D_{k}}\left(X_{n}^{k}+W_{n}^{k}\right)\right)^{2}\right]\leq\leavevmode\nobreak\ \mathbb{E}\left[\frac{1}{\lceil(1-\xi)D_{k}\rceil^{2}}\left(\sum_{n>\xi D_{k}}^{D_{k}}\left(\frac{\lambda(\underline{p})}{\lambda(\bar{p})}\hat{X}_{n}^{k}+\hat{W}_{n}^{k}\right)\right)^{2}\right],

where $\hat{W}_{n}^{k}$ and $\hat{X}_{n}^{k}$ are defined as in Lemma 1. Note that by definition, $\hat{W}_{n}^{k}$ and $\hat{X}^{k}_{n}$ are stationary, we have

		$\displaystyle\mathbb{E}\left[\frac{1}{\lceil(1-\xi)D_{k}\rceil^{2}}\left(\sum_{n>\xi D_{k}}^{D_{k}}\left(\frac{\lambda(\underline{p})}{\lambda(\bar{p})}\hat{X}_{n}^{k}+\hat{W}_{n}^{k}\right)\right)^{2}\right]$
	$\displaystyle\leq$	$\displaystyle\frac{2}{\lceil(1-\xi)D_{k}\rceil^{2}}\mathbb{E}\left[\left(\sum_{n>\xi D_{k}}^{D_{k}}\frac{\lambda(\underline{p})}{\lambda(\bar{p})}\hat{X}_{n}^{k}\right)^{2}\right]+\frac{2}{\lceil(1-\xi)D_{k}\rceil^{2}}\mathbb{E}\left[\left(\sum_{n>\xi D_{k}}^{D_{k}}\hat{W}_{n}^{k}\right)^{2}\right]$
	$\displaystyle\leq$	$\displaystyle 2(1-\xi)^{-2}\mathbb{E}\left[\left(\frac{\lambda(\underline{p})}{\lambda(\bar{p})}\hat{X}_{0}^{k}\right)^{2}\right]+2(1-\xi)^{-2}\mathbb{E}[(\hat{W}_{0}^{k})^{2}]\leq 4(1-\xi)^{-2}M.$

Therefore, $\mathcal{V}_{k}$ is uniformly bounded. Given that $\eta_{k}=c_{\eta}k^{-1}$ , we have $\eta_{k}\mathcal{V}_{k}\leq\frac{K_{3}}{k}$ with

K_{3}=(8(C_{1}+C_{0}/\underline{\mu})^{2}+32C_{0}^{2}(1-\xi)^{-2}M)c_{\eta}.

(37)

(ii) According to the update rule, we immediately got

\mathbb{E}[\|x_{k}-x_{k+1}\|^{2}]\leq\eta^{2}_{k}\mathbb{E}[\|H_{k}\|^{2}]\leq 2k^{-2}K_{3}/K_{0}\equiv K_{2}k^{-2},\text{ with }K_{2}=2K_{3}/K_{0}.

(iii) We have just proved that the conditions of Theorem 1 are satisfied with $\alpha=1$ . Therefore, $R_{1}(L)\leq K\sum_{k=1}^{L}k^{-1}\log(k)\leq K\log(L)^{2}$ with the expression of $K$ given in (8.5). Besides, conditions of Theorem 2 are satisfied with $\beta=1$ and $D_{k}=O(\log(k))$ . In particular, $\Delta_{k}/\Delta_{k+1}=1+\frac{1}{k}\leq 1+\frac{K_{0}}{2\eta_{k}}$ given that $\eta_{k}=c_{\eta}k^{-1}$ with $c_{\eta}\geq 2/K_{0}$ . Therefore,

R_{2}(L)\leq CK_{1}\sum_{k=1}^{L}\left(\frac{D_{k}}{\lambda(\bar{p})}+M\right)k^{-1}=O(\log(L)^{2}).

As a consequence, the total regret

R(L)=R_{1}(L)+R_{2}(L)\leq K_{alg}\log(L)^{2}\leq K_{alg}\log(M_{L})^{2},\text{with }M_{L}=\sum_{k=1}^{L}D_{k}.

The last inequality uses $\log(L)^{2}\leq\log(M_{L})^{2}$ . Since $M_{L}=O(L\log(L))$ , the relaxation from $L$ to $M_{L}$ will not change the order of the regret bound. In addition, we can find a closed-expression for $K_{alg}$ as

K_{alg}=K+CK_{1}\cdot\left(\frac{C_{D}+8}{\lambda(\bar{p})\min(\gamma,\eta)\xi}+M\right),

(38)

where $K$ is defined by (8.5), $C$ by (34) and $C_{D}$ by (36). $\Box$

8.9 Details in the Proof of Lemma 5

We first give a rigorous proof of (19) in derivation of the partial derivation $\frac{\partial}{\partial p}\mathbb{E}[W_{\infty}(p,\mu)]$ . To better explain the proof, we adopt the notions in Glasserman (1992). We will take derivative with respective to the parameter $\theta=r=1/\lambda(p)$ . With a slight abuse of notation, we redefine $W_{n}(\theta)=W_{n}(\mu,p)$ and $\tilde{U}_{n}(\theta)=\frac{V_{n}}{\mu}-\theta U_{n}$ so that $\tilde{U}_{n}^{\prime}(\theta)=-U_{n}$ . And then, the Lindley recursion becomes

W_{n+1}(\theta)=\phi(W_{n}(\theta),\tilde{U}_{n}(\theta)),\quad\text{with }\phi(w,u)=(w+u)^{+}.

Note that the function $\phi$ is increasing and convex in $w$ and $u$ . In addition, the derivative process is denote as $V_{n}(\theta)=Z_{n}$ . Define $\psi_{w}(w,u)=\psi_{u}(w,u)=1(w+u>0),$ such that

V_{n+1}(\theta)=\psi_{w}(W_{n}(\theta),\tilde{U}_{n}(\theta))V_{n}(\theta)+\psi_{u}(W_{n}(\theta),\tilde{U}_{n}(\theta))\tilde{U}_{n}^{\prime}(\theta).

The stationary versions of the waiting time and derivative process are denoted as $\tilde{W}_{0}(\theta)$ and $\tilde{V}_{0}(\theta)$ . Then we can check Conditions (B1) to (B3) on page 377 of Glasserman (1992):

(B1)

For each $\theta\in[1/\lambda(\underline{p}),1/\lambda(\bar{p})]$ , the sequence

\{(\tilde{U}_{n}(\theta),\tilde{U}_{n}^{\prime}(\theta)),-\infty<n<\infty\leavevmode\nobreak\ \}=\Big{\{}\underbrace{\left(\frac{V_{n}}{\mu}-rU_{n},-U_{n}\right),-\infty<n<\infty}_{\text{in our notation}}\Big{\}}

is stationary and ergodic, as we can extend the i.i.d. sequences $V_{n}$ and $U_{n}$ to $-\infty<n\leq 0$ .

(B2)

For each $\theta\in[1/\lambda(\underline{p}),1/\lambda(\bar{p})]$ , the Lindley recursion has a stationary solution $\tilde{W}_{0}(\theta)$ , which is guaranteed by Assumption 1. Besides, following Lemma 2, for any initial state $W_{0}(\theta)$ , the transient process $W_{n}(\theta)$ will converge to the stationary version in finite time almost surely.

(B3)

For all $\theta\in[1/\lambda(\underline{p}),1/\lambda(\bar{p})]$ ,

\mathbb{P}(\psi_{w}(\tilde{W}_{0}(\theta),\tilde{U}_{0}(\theta))=0)=\mathbb{P}\underbrace{\left(\left(W_{\infty}(\mu,p)+\frac{V_{0}}{\mu}-rU_{0}\right)^{+}=0\right)}_{\text{(in our notation)}}=\mathbb{P}(W_{\infty}(\mu,p)=0)>0.

According to the discussion on p.379 of Glasserman (1992), Condition (B3) holds for $GI/GI/1$ queues under the usual stability condition that $\mu>\lambda(p)$ . Below, we give a detailed verification of this condition under our model setting.
Recall that $\tilde{U}_{0}(r)=\frac{V_{0}}{\mu}-rU_{0}$ and by Assumption 1, $\mathbb{E}[\tilde{U}_{0}(r)]<0,\leavevmode\nobreak\ \forall r\in[1/\lambda(\underline{p}),1/\lambda(\bar{p})]$ . So there exists a constant $b>0$ , such that $\mathbb{P}(\tilde{U}_{0}(r)<-b)>0$ for all $r\in[1/\lambda(\underline{p}),1/\lambda(\bar{p})].$ Let $S$ denote the support of $W_{\infty}(\mu,p)$ and let $A=\inf S\geq 0$ . We first show by contradiction that $A=0$ . Since $A$ is the infimum of the support,

\displaystyle\mathbb{P}\left(W_{\infty}(\mu,p)\in[A,A+\varepsilon)\right)>0,\text{ for any }\varepsilon>0.

Besides, if $A>0$ ,

	$\displaystyle\mathbb{P}(W_{\infty}(\mu,p)\geq A)$	$\displaystyle=\mathbb{P}\left(\left(W_{\infty}(\mu,p)+\tilde{U}_{0}(r)\right)^{+}\geq A\right)$
		$\displaystyle=\mathbb{P}\left(W_{\infty}(\mu,p)+\tilde{U}_{0}(r)\geq A\right)=1,$

On the other hand, we have

\displaystyle\mathbb{P}\left(W_{\infty}(\mu,p)+\tilde{U}_{0}(r)<A\right)\geq\mathbb{P}\left(W_{\infty}(\mu,p)\in\Big{[}A,A+\frac{b}{2}\Big{)},\leavevmode\nobreak\ \tilde{U}_{0}(r)<-b\right)>0,

where the last inequality follows from the fact that $W_{\infty}(\mu,p)$ and $\tilde{U}_{0}(r)$ are independent in the $GI/GI/1$ queue. This is a contradiction, so we can conclude that $A=0$ . Next, we show that $\mathbb{P}(W_{\infty}(\mu,p)=0)>0$ . Following a similar derivation, we can conclude

\displaystyle\mathbb{P}(W_{\infty}(\mu,p)=0)

\displaystyle=\mathbb{P}\left(\left(W_{\infty}(\mu,p)+\tilde{U}_{0}(r)\right)^{+}=0\right)\geq\mathbb{P}\left(W_{\infty}(\mu,p)\in\Big{[}0,\frac{b}{2}\Big{)},\leavevmode\nobreak\ \tilde{U}_{0}(r)<-b\right)>0.

In addition, we have $\mathbb{E}[\tilde{W}_{0}(\theta)]\leq M$ and $\mathbb{E}[\tilde{V}_{0}(\theta)]=\mathbb{E}[\tilde{Z}_{\infty}]=\mathbb{E}[X_{\infty}(\mu,p)]\leq M$ following Lemma 1. As a consequence, we can prove (19) using the following Corollary 5.3 in Glasserman (1992):

Lemma 7 (Corollary 5.3 in Glasserman (1992)).

Suppose that $\phi$ is increasing and (jointly) convex, and that $W_{0}$ and $U_{0}$ are almost surely convex. Suppose (B1)-(B3) hold, $\mathbb{E}[\tilde{W}_{0}(\theta)],\mathbb{E}[\tilde{V}_{0}(\theta)]<\infty$ for all $\theta$ in its range. Then, $\mathbb{E}[\tilde{V}_{0}(\theta)]=\mathbb{E}[\tilde{W}_{0}(\theta)]^{\prime}$ and

\lim_{n\to\infty}\frac{1}{n}\sum_{i=1}^{n}V_{i}(\theta)=\mathbb{E}[\tilde{W}(\theta)]^{\prime},a.s.

almost everywhere in the range of $\theta$ .

The derivation of $\frac{\partial}{\partial\mu}\mathbb{E}[W_{\infty}(p,\mu)]$ follows a similar argument with $\tilde{U}(\theta)\equiv V_{n}-\theta U_{n}/\lambda(p)$ .

9 Relaxing Theoretical Bounds of Hyperparameters

In this section, we conduct numerical experiments to investigate the robustness of GOLiQ’s performance to the two main hyperparameters: (i) cycle length $D_{k}$ , and (ii) step size $\eta_{k}$ . We follow two steps:

•

First, we calculate the theoretical bounds of these hyperparameters according to (20) and (21).
•

Next, we test the algorithm’s performance while varying these hyperparameters; we especially consider values that violate their corresponding theoretical bounds.

9.1 Theoretical bounds for $\eta_{k}$ and $D_{k}$

We follow Section 6.2 by considering the $M/M/1$ example having the objective function in (26) and demand function in (24), with $a=4.1$ , size $n=10$ and $c_{0}=0.1$ . In order to obtain the theoretic bounds for hyper-parameters, we set the region $\mathcal{B}=[6.7,10]\times[3.7,5]$ so that $f(\mu,p)$ is strongly convex on $\mathcal{B}$ .

Theoretical bound for $\eta_{k}$ .

According to the conditions in Assumption 3, we note that the Hessian matrix of the objective $f(\mu,p)$ has a smallest eigenvalue $0.1231$ in the specific region $\mathcal{B}$ , which implies that $K_{0}=0.1231$ (and the strong convexity of the objective function on $\mathcal{B}$ ). Hence, following from (20), the theoretical lower bound for $\eta_{k}$ is $c_{\eta}\geq\tilde{c}_{\eta}=2/K_{0}=16.24$ .

Theoretical bound for $D_{k}$ .

To calculate the lower bounds of $a_{D}$ and $b_{D}$ specified in (21), we first estimate $C$ and $(\gamma,\eta)$ . We set $\xi=1$ . First, according to the expression (36) and $K_{0}=0.1231$ , we see that $C\geq 8$ . Next, following (3), we select $\min(\gamma,\eta)=0.011$ which gives the smallest theoretical lower bound.

Hence, (21) requires that $a_{D}\geq\tilde{a}_{D}=8/0.011=727$ and $b_{D}\geq\tilde{b}_{D}=8/0.011=727$ , which leads to a bound for the cycle length $D_{k}\geq 727+727\log(k)$ .

9.2 Robustness to the Theoretical Bounds

Recall that the theoretical bounds in (20) and (21) require that $a_{D}\geq\tilde{a}_{D}$ , $b_{D}\geq\tilde{b}_{D}$ and $c_{\eta}\geq\tilde{c}_{\eta}$ . We hereby test the criticality of these lower bounds $\tilde{a}_{D}$ , $\tilde{b}_{D}$ and $\tilde{c}_{\eta}$ by implementing GOLiQ with $(a_{D},b_{D},c_{\eta})<(\tilde{a}_{D},\tilde{b}_{D},\tilde{c}_{\eta})$ . Specifically, in our first experiments, we consider $c_{\eta}=\{2,1,0.5,0.1\}\tilde{c}_{\eta}$ for the step-size $\eta_{k}$ , with $D_{k}=10+10\log(k)$ (see left-hand panels of Figure 8); in our second experiment, we consider $(a_{D},b_{D})=\{0.028,0.021,0.014,0.007,0.0014\}(\tilde{a}_{D},\tilde{b}_{D})$ for the sample-size $D_{k}$ , with $\eta_{k}=3/k$ (see right-hand panels of Figure 8). In both experiments, we plot the average regret curves estimated by 500 independent runs.

Figure 8 reveals that GOLiQ continues to perform effectively even when the hyperparameters are chosen to be much smaller than their corresponding theoretical lower bounds. For $\eta_{k}$ , our algorithm generates a logarithm regret even when $c_{\eta}=0.1\tilde{c}_{\eta}$ . (However, we discover that GOLiQ will fail to converge and yield a linear regret if we keep reducing $c_{\eta}$ (e.g., to $0.01\tilde{c}_{\eta}$ ). For $D_{k}$ , all regret curves exhibit a logarithmic order (even when $(a_{D},b_{D})=0.0014(\tilde{a}_{D},\tilde{b}_{D})$ ). In summary, our numerical experiments show that the theoretical bounds for our hyperparameters do not seem to be too restrictive. In addition, the experiment in Section 6.3 serves as another piece of evidence supporting the robustness of GOLiQ. In Section 6.3, we apply GOLiQ with the same hyperparameters $\eta_{k}=5k^{-1}$ and $D_{k}=10+10\log(k)$ for different settings with various $c$ , $c_{s}^{2}$ and $n$ (see Figure 7), and GOLiQ exhibits stable performance with similar logarithm regrets. Of course, we acknowledge that the specific selection of these hyperparameters in a practical setting will require further tuning in order to make the most efficient use of GOLiQ.

Remark 12 (Requirement of information: online learning vs. heavy traffic).

We provide our view on how online learning relies on the system information, and we treat heavy-traffic methods as a benchmark. First, online learning in general requires less prior information of the distributions than heavy-traffic methods do. For example, to solve the problem in the present study, the diffusion limit in Lee and Ward (2014) requires the knowledge of the exact values of the second moments of arrival and service times. On the other hand, even though the efficiency of GOLiQ is subject to constraints in terms of certain model parameters, the bounds of these constraints may be relaxed without needing to sacrifice much of the algorithm’s performance. Second, the required information (e.g., moments) serves as crucial input parameters for the heavy-traffic models, whereas the design and implementation of online learning algorithms do not immediately require the aforementioned information (even though it is still relevant to the tuning of hyperparameters). All that we require is that the constants in (20) and (21) are not too small. So as long as we follow the structure specified in (20)–(21), it will not be too difficult to find reasonably sound hyperparameters (e.g., by a trial-and-error search) even without precise information of parameters $\eta$ and $\gamma$ as in Assumption 2. However, trial-and-error will be ineffective for heavy-traffic methods because precise information is needed (e.g., $\sigma^{2}$ ). In this sense, online learning depends on the system information to a lesser extent.

10 Comparison With Online Learning Algorithm in Huh et al. (2009)

In order to highlight the novelty in the regret analysis of this work, we wish to provide a comparison to other existing online learning algorithms developed for queueing systems. Unfortunately, there exists no previous algorithm that aims to solve a similar (not to mention the same) problem as in this paper.

Huh et al. (2009) develops an online learning algorithm with the objective of finding the optimal base-stock policy for an inventory system with a non-zero replenishment lead time. At a glance, Huh et al. (2009) does not seem to be relevant to the present paper at all. Indeed, results in Huh et al. (2009) are by no means directly comparable to GOLiQ, because the two articles consider two different systems. Nevertheless, the fundamental idea in the regret analysis by Huh et al. (2009) may be used as a basis to devise a queueing-version algorithm.To understand why this is possible, we first discuss the similarities of our method and that in Huh et al. (2009); and we next explain their major distinctions.

Similarities.

First, Huh et al. (2009) analyzes the transient regret bound of an inventory system operated under a stationary base-stock policy, of which the main framework is analogous to that in the present work. Second, the heart of the online learning algorithm in Huh et al. (2009) is an SGD method. Last, the regret in Huh et al. (2009) is also defined using the steady-state performance as the benchmark.

Distinctions.

Nevertheless, we stress that GOLiQ is not a quick extension of the regret analysis in Huh et al. (2009). First, a queueing model, by its very nature, has completely different dynamics, problem structure, and research questions from inventory systems. For example, the state space of the queueing model here is unbounded, while the inventory system in Huh et al. (2009) is bounded. (The unbounded state space has made the analysis on transient error and gradient variance more challenging.) Next, our analysis of the “regret of nonstationarity” is a novelty; when establishing our regret bound, we examine more delicately the transient error at the beginning of the cycles, so as to render a smaller regret bound $O(\log(L)^{2})$ than the linear bound $O(L)$ as in Huh et al. (2009). Also see Section 4.1 for additional discussions. In detail, different from the analysis in Huh et al. (2009), we further separate the transient error in each cycle into two parts, i.e., the ‘warm-up’ part and ‘near-stationary’ part, and deal with them using different coupling techniques: coupling from the previous cycle and coupling from infinite past for the ‘warm-up’ part and synchronous coupling in the same cycle for the ‘near-stationary’ part. Last, our novel theoretical analysis yields different “optimal” structure for the hyperparameters $\eta_{k}=O(k^{-1})$ and $D_{k}=O(\log{(k)})$ .

According to their regret analysis, Huh et al. (2009) propose to choose the hyperparameters $\eta_{k}=O(k^{-1/2})$ and $D_{k}=O(\sqrt{k})$ which yield a regret bound in the order $O(M_{L}^{2/3})$ . However, we point out that the objective function in Huh et al. (2009) is convex while GOLiQ in the present paper is designed assuming a strongly convex objective function (Assumption 3). Therefore, to make a fair comparison between GOLiQ and the online learning algorithm proposed in Huh et al. (2009), we need to redo the regret analysis in Huh et al. (2009) under the strong convexity. This change, as we will show below, will yield a different set of hyperparameters.

Suppose we select $D_{k}=O(k^{\alpha})$ and $\eta_{k}=O(k^{-\beta})$ . Then, following Lemma 11 of Huh et al. (2009), $R_{1}(L)=O(L)$ (compared to $R_{1}(L)=o(L)$ in our analysis). Given that the objective function is strongly convex, Theorem 2 yields that $R_{2}(L)=O\left(\sum_{k=1}^{L}k^{\alpha}\cdot k^{-\beta}\right)=O\left(L^{\alpha-\beta+1}\right)$ for $\beta\in(0,1]$ . As a result, the overall regret is

R(L)=R_{1}(L)+R_{2}(L)=O(L^{(\alpha-\beta+1)\vee 1}),\quad\text{and}\quad R(M_{L})=O\left(M_{L}^{\frac{(\alpha-\beta+1)\vee 1}{\alpha+1}}\right),

with $M_{L}=O(L^{\alpha+1})$ . Consequently, the order of $R(M_{L})$ is minimized by setting

\displaystyle\eta_{k}=O(k^{-1}),\quad\text{and}\quad D_{k}=O(k),

(39)

which yields an improved regret bound $O(M_{L}^{\frac{1}{2}})$ (as opposed to the previous regret $O(M_{L}^{\frac{2}{3}})$ under regular convexity).

We refer to Algorithm 1 with $\eta_{k}$ and $D_{k}$ selected according to (39) as GOLiQ-H. To compare GOLiQ with GOLiQ-H, we follow the setting in Section 6.2 by considering an $M/M/1$ queue, with $c(\mu)=0.1\mu^{2}$ and $\lambda(p)=10\lambda_{0}(p).$

In Figure 9, we plot the average regret curves (estimated by averaging 500 independent paths) for both GOLiQ and GOLiQ-H. The hyperparameters are $\eta_{k}=2k^{-1}$ and $D_{k}=10+10\log(k)$ for GOLiQ, and $\eta_{k}=\{2,4\}k^{-1}$ and $D_{k}=10+k$ for GOLiQ-H. Unsurprisingly, Figure 9 confirms that GOLiQ is more effective than GOLiQ-H. This is consistent with our theoretical analysis because GOLiQ yields a logarithmic regret while GOLi-H has a regret bound of $O(M_{L}^{\frac{1}{2}})$ .

11 Additional Numerical Examples

In this section we conduct additional numerical experiments to confirm the practical effectiveness of our algorithm. In what follows, we first test the case where the uniform stability condition is relaxed; we next report the algorithm performance for $GI/GI/1$ queueing models with phase-type and lognormal distributions.

11.1 Violation of Uniform Stability

We extend the $M/M/1$ example considered in Section 6.2 with the uniform stability condition relaxed. Specifically, we begin with an initial setting of $(p_{0},\mu_{0})$ such that $\rho_{0}\equiv\lambda(p_{0})/\mu_{0}=2.55$ , which violates the stability condition. As shown in Figure 10, the pricing and staffing policies $(p_{k},\mu_{k})$ remain convergent to $(p^{*},\mu^{*})$ . Consistently, the resulting traffic intensity $\rho_{k}\equiv\lambda(p_{k})/\mu_{k}$ is quickly controlled to fall below 1; that is, the workload is kept in check despite of the unstable performance in the initial cycle.

11.2 $M/G/1$ with Phase-Type Service

To test the performance of our online learning algorithm for queues with non-exponential service times, we consider phase-type distributions: hyperexponential with $n$ phases ( $H_{n}$ ) and Erlang with $n$ phases ( $E_{n}$ ). In Figure 11 we report the convergent sequence $(p_{k},\mu_{k})$ with $H_{2}$ service with service-time SCV $c_{s}^{2}=8$ (top panel), $M$ service with $c_{s}^{2}=1$ (middle panel), and $E_{8}$ service with $c_{s}^{2}=1/8$ (bottom panel). Other parameters include the step size $\eta_{k}=4/k$ , cycle length $D_{k}=20+10\log(k)$ and initial condition $p_{0}=4$ and $\mu_{0}=12$ ( $\lambda_{0}=5.249$ ).

Figure 11 confirms that our algorithm remains effective. In addition, the convergence is faster as the CSV $c_{s}^{2}$ decreases. This is intuitive because a less variable service-time distribution yields a smaller $\mathcal{V}_{k}$ for the gradient estimator.

11.3 Lognormal Service and Arrival

Next, we consider an $LN/LN/1$ queue with service and interarrival times following lognormal (LN) distributions. Our consideration here follows from the recent empirical confirmations of LN distributed service times in real service systems.

We let $c_{s}^{2}=c_{a}^{2}=2$ with $c_{a}^{2}$ being the SCV of the $LN$ -distributed interarrival times. The other parameters remain the same as in Section 11.2. Because the exact optimal solutions $(p^{*},\mu^{*})$ are unavailable for this model, we are unable to provide an estimate of the regret as done in Figure 6, nor can we confirm the convex structure of the problem. Nevertheless, Figure 12 shows that our online algorithm continues to work well, despite the fact that LN is no longer a light-tail distribution (Assumption 2 does not hold in this case).

11.4 Extended Comparison of GOLiQ and Heavy-traffic Methods

Supplementing our investigations in Section 6.3, we provide additional numerical results. Recall that the heavy-traffic results in Lee and Ward (2014) are obtained by constructing a sequence of $GI/GI/1$ models indexed by $n$ , where the $n^{\rm th}$ model has scaled arrival rate $\lambda_{n}(p)=n\lambda(p)$ and service rate $\mu_{n}=n\mu$ , so that both $\lambda_{n}$ and $\mu_{n}$ grow to $\infty$ as $n$ increases. Lee and Ward (2014) develop asymptotic staffing and pricing solutions for the $GI/GI/1$ queue; they show that, as the scaling factor $n\rightarrow\infty$ , the optimal price $p^{*}_{n}\rightarrow p_{\infty}$ and service capacity $\mu^{*}_{n}/n\rightarrow\mu_{\infty}$ , with $\rho_{\infty}\equiv\lambda(p_{\infty})/\mu_{\infty}=1$ .

We repeat our experiment in Section 6.2 with the scaling parameter $n\in\{10,50,100,500,1000,2000\}$ for the arrival rate function (24). But we now focus on the optimal traffic intensity as $n$ varies. In Figure 13 we plot the optimal price and service rate as $n$ increases. In each experiment, we compute the optimal $p_{n}$ and $\mu_{n}$ using their average value in cycles 300–500 of Algorithm 1. Consistent with Lee and Ward (2014), Figure 13 shows that $p_{n}$ , $\mu_{n}/n$ and $\rho_{n}$ approach $p_{\infty}$ , $\mu_{\infty}$ and $\rho_{\infty}=1$ . On the other hand, when the scale $n$ is not very large, the heavy-traffic solutions can become inaccurate. For instance, when $n=100$ the optimal traffic $\rho_{100}$ is around 0.8, which is not close to 1.

11.5 Alternative Definition of Regret

In this subsection, we attempt to rationalize our regret definition in (9). We consider a potential alternative to (9) which benchmarks the system revenue under GOLiQ with the nonstationary revenue under $(\mu^{*},p^{*})$ . Because the nonstationary queue length is intractable, we conduct additional numerical experiments to estimate the expected nonstationary regret via Monte-Carlo simulations.

Specifically, we simulate the regret in (10) under $(p^{*},\mu^{*})$ with the queueing system starting empty (of which the dynamics is nonstationary). We use the $M/M/1$ model in Section 6 having a logit demand function (24) with $n=10$ and a quadratic staffing cost in (25) with $c=0.1$ .

In Figure 14 we graph both versions of the regret under GOLiQ under the same experimental setting, with hyperparameters $\eta_{k}=3k^{-1}$ and $D_{k}=10+10\log(k)$ . Figure 14 confirms that the these two versions of regret appear to be nearly indistinguishable. This is due to the geometric ergodicity of $G/G/1$ queue.

12 Additional discussion on Assumption 3

In this section, we provide some sufficient conditions for strong convexity in the $M/GI/1$ case.

Lemma 8.

For $M/GI/1$ queues, if $c(\mu)$ is convex, $\lambda(\underline{p})/\underline{\mu}<1$ , and in addition,

\frac{\lambda^{\prime}(p)^{2}}{2\lambda(p)}<\lambda^{\prime\prime}(p)<\frac{-2\lambda^{\prime}(p)}{p},\quad\text{and}\quad\frac{\lambda(\bar{p})}{\bar{\mu}}>1-1/\sqrt{2}\approx 0.29,

(40)

then $f(\mu,p)$ is strongly convex in $\mathcal{B}$ .

Proof of Lemma 8.

Recall that $f(\mu,p)=-p\lambda(p)+h_{0}\mathbb{E}[Q_{\infty}(\mu,p)]+c(\mu)$ , and $(-p\lambda(p))^{\prime\prime}=-p\lambda^{\prime\prime}-2\lambda^{\prime}$ . Under condition (40), we have $-p\lambda^{\prime\prime}-2\lambda^{\prime}>0$ . Therefore, both $-p\lambda(p)$ and $c(\mu)$ are convex, and it suffices to show that $\mathbb{E}[Q_{\infty}(\mu,p)]$ is strongly convex in $\mu$ and $p$ . For $M/GI/1$ queues, Pollaczek-Khinchine formula yields that

q(\mu,p)\equiv\mathbb{E}[Q_{\infty}(\mu,p)]=C\frac{\lambda(p)}{\mu-\lambda(p)}+(1-C)\frac{\lambda(p)}{\mu},

with $C\equiv\frac{1+c_{s}^{2}}{2}$ . For any given pair of $(\mu,p)$ , let $H_{q}$ be the Hessian matrix of $q(\mu,p)$ . We next verify that $H_{q}$ is positively definite. By direct calculation, we have

	$\displaystyle\partial_{p}^{2}q$	$\displaystyle=C\frac{\mu}{(\mu-\lambda)^{3}}\left(2(\lambda^{\prime})^{2}+(\mu-\lambda)\lambda^{\prime\prime}\right)+(1-C)\frac{\lambda^{\prime\prime}}{\mu},$
	$\displaystyle\partial_{\mu}^{2}q$	$\displaystyle=C\frac{2\lambda}{(\mu-\lambda)^{3}}+(1-C)\frac{2\lambda}{\mu^{3}},\quad\partial_{p}\partial_{\mu}q=-C\frac{\lambda+\mu}{(\mu-\lambda)^{3}}\lambda^{\prime}-(1-C)\frac{\lambda^{\prime}}{\mu^{2}},$

with $\lambda^{\prime},\lambda^{\prime\prime}$ being the first and second order derivatives of $\lambda(p)$ . As a result, the determinant of Hessian matrix of $H_{q}$ is

	$\displaystyle\|H_{q}\|=$	$\displaystyle\frac{C^{2}}{(\mu-\lambda)^{5}}\left(2\mu\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right)+\frac{(1-C)^{2}}{\mu^{4}}\left(2\lambda\lambda^{\prime\prime}-(\lambda^{\prime})^{2}\right)$
		$\displaystyle\qquad+\frac{2C(1-C)}{\mu^{2}(\mu-\lambda)^{3}}\left((2\mu-\lambda)\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right).$

To show that $H_{q}$ is positively definite, it suffices to show that $\partial_{\mu}^{2}q$ , $\partial_{p}^{2}q$ and $|H_{q}|$ are all positive. First, it is clear that

\partial_{\mu}^{2}q=2\lambda C\left(\frac{1}{(\mu-\lambda)^{3}}-\frac{1}{\mu^{3}}\right)+\frac{2\lambda}{\mu^{3}}>0.

Next, we compute

	$\displaystyle\partial_{p}^{2}q$	$\displaystyle=C\frac{\mu}{(\mu-\lambda)^{3}}\left(2(\lambda^{\prime})^{2}+(\mu-\lambda)\lambda^{\prime\prime}\right)+(1-C)\frac{\lambda^{\prime\prime}}{\mu}$
		$\displaystyle=\frac{2C\mu}{(\mu-\lambda)^{3}}(\lambda^{\prime})^{2}+\frac{C\mu}{(\mu-\lambda)^{2}}\lambda^{\prime\prime}+(1-C)\frac{\lambda^{\prime\prime}}{\mu}$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{>}}\frac{C\mu}{(\mu-\lambda)^{2}}\lambda^{\prime\prime}+(1-C)\frac{\lambda^{\prime\prime}}{\mu}$
		$\displaystyle=C\lambda^{\prime\prime}\Big{(}\underbrace{\frac{\mu}{\mu-\lambda}}_{>1}\cdot\frac{1}{\mu-\lambda}-\frac{1}{\mu}\Big{)}+\frac{\lambda^{\prime\prime}}{\mu}\stackrel{{\scriptstyle(b)}}{{>}}C\lambda^{\prime\prime}\left(\frac{1}{\mu-\lambda}-\frac{1}{\mu}\right)+\frac{\lambda^{\prime\prime}}{\mu}>0.$

Here, inequality (a) follows from that $\frac{2C\mu}{(\mu-\lambda)^{3}}(\lambda^{\prime})^{2}>0$ . Inequality (b) holds due to the facts that $\frac{\mu}{\mu-\lambda}>1$ and that $\lambda^{\prime\prime}>\frac{(\lambda^{\prime})^{2}}{2\lambda}\geq 0$ . The last inequality holds because $\frac{1}{\mu-\lambda}>\frac{1}{\mu}$ . As a result, we have $\partial^{2}_{p}q,\partial^{2}_{\mu}q>0$ . Next, we verify that $|H_{q}|>0$ . Because $2\lambda\lambda^{\prime\prime}-(\lambda^{\prime})^{2}>0$ , we have

2\mu\lambda\lambda^{\prime\prime}>(\mu-\lambda)(\lambda^{\prime})^{2},\quad\text{and}\quad(2\mu-\lambda)\lambda\lambda^{\prime\prime}>(\mu-\lambda)(\lambda^{\prime})^{2}.

Therefore,

	$\displaystyle\|H_{q}\|=\leavevmode\nobreak\$	$\displaystyle\frac{C^{2}}{(\mu-\lambda)^{5}}\left(2\mu\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right)+\frac{(1-C)^{2}}{\mu^{4}}\left(2\lambda\lambda^{\prime\prime}-(\lambda^{\prime})^{2}\right)$
		$\displaystyle\qquad+\frac{2C(1-C)}{\mu^{2}(\mu-\lambda)^{3}}\left((2\mu-\lambda)\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right)$
	$\displaystyle\stackrel{{\scriptstyle(c)}}{{>}}$	$\displaystyle\frac{C^{2}}{(\mu-\lambda)^{5}}\left(2\mu\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right)+\frac{2C(1-C)}{\mu^{2}(\mu-\lambda)^{3}}\left((2\mu-\lambda)\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right)$
	$\displaystyle>\leavevmode\nobreak\$	$\displaystyle\frac{C^{2}}{(\mu-\lambda)^{5}}\Big{(}\underbrace{2\mu\lambda\lambda^{\prime\prime}}_{>(2\mu-\lambda)\lambda\lambda^{\prime\prime}}-(\mu-\lambda)(\lambda^{\prime})^{2}\Big{)}-\frac{2C^{2}}{\mu^{2}(\mu-\lambda)^{3}}\left((2\mu-\lambda)\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right)$
	$\displaystyle>\leavevmode\nobreak\$	$\displaystyle\frac{C^{2}}{(\mu-\lambda)^{5}}\left((2\mu-\lambda)\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right)-\frac{2C^{2}}{\mu^{2}(\mu-\lambda)^{3}}\left((2\mu-\lambda)\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right)$
	$\displaystyle=\leavevmode\nobreak\$	$\displaystyle\frac{C^{2}\left((2\mu-\lambda)\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right)}{(\mu-\lambda)^{3}}\left(\frac{1}{(\mu-\lambda)^{2}}-\frac{2}{\mu^{2}}\right)>0.$

Inequality (c) holds because $\frac{(1-C)^{2}}{\mu^{4}}(2\lambda\lambda^{\prime\prime}-(\lambda^{\prime})^{2})$ is positive by (40). The last inequality follows from $\frac{\lambda(p)}{\mu}>1-1/\sqrt{2}$ . This closes our proof. ∎

	Notation	Description
Model parameters and functions	$\mathcal{B}=[\underline{p},\bar{p}]\times[\underline{\mu},\bar{\mu}]$	Feasible action space
	$c(\mu)$	Staffing cost function
	$f(\mu,p)$	Objective (loss) function
	$h_{0}$	Customer holding cost
	$\lambda(p)$	Demand function
	$\mu$	Service rate/capacity
	$n$	Market size/ System scale in Section 6
	$p$	Service fee
	$Q_{\infty}(\mu,p)$	Stationary queue length under $(p,\mu)$
	$S_{n}^{k}$	Service time of the $(n-1)^{\rm th}$ customer in cycle $k$
	$\tau_{n}$	Interarrival time between $(n-1)^{\rm th}$ and $n^{\rm th}$ customers
	$\theta,\gamma,\eta$	Parameters of light-tail assumptions (Assumption 2)
	$U_{n},V_{n}$	Unscaled random “seeds” of interarrival and service times
	$x^{}=(p^{},\mu^{*})$	Optimal decision fee and service rate
Algorithmic hyperparameters	$D_{k}$	Sample size (number of customers served) in cycle $k$
	$\eta_{k}$	Step size or learning rate in cycle $k$
	$H_{k}$	Gradient estimator in cycle $k$
	$M_{L}$	Cumulative number of customers served by cycle $L$
	$Q_{k}$	Queue content leftover from cycle $k-1$
	$W_{n}^{k}$	Delay of the $n^{\rm th}$ customer in cycle $k$
	$\xi$	Warm up rate
	$X_{n}^{k}$ ( $X_{n}$ )	Server’s busy time observed by customer $n$ in cycle $k$
Constants and bounds in regret analysis	$a_{D},b_{D}$	Constants for $D_{k}$ in equation (21)
	$A=4\sqrt{M}+4M$	Constant in Corollary 1
	$B$	Constant of stationary waiting times in Lemma 4
	$B_{k},\leavevmode\nobreak\ \mathcal{V}_{k}$	Upper bounds for bias and Variance of $H_{k}$
	$c_{\eta}$	Constant for $\eta_{k}$ in equation (20)
	$c_{\mu},c_{\lambda}$	Constants in Lemma 3
	$C=\max\{\\|x_{0}-x^{*}\\|^{2},8K_{3}/K_{0}\}$	Constant in Theorem 2
	$C_{0}=\max_{x\in\mathcal{B}}\{h_{0}\lambda^{\prime}(p),h_{0}\lambda(p)/\mu\}$	Constant in the proof of Theorem 3
	$C_{1}=\max_{x\in\mathcal{B}}\{\|\lambda(p)+p\lambda^{\prime}(p)\|,\|c^{\prime}(\mu)\|\}$	Constant in the proof of Theorem 3
	$C_{D}$	Constant for the selection of $D_{k}$ (Theorem 3)
	$d_{k}=\lceil 4\log(k)/\min(\theta,\gamma)\rceil$	Constant of warm-up time (Theorem 1)
	$\tilde{d}_{k}=\lceil 5\log(k)/\min(\theta,\gamma)\rceil$	Constant of warm-up time (Theorem 1)
	$\Gamma_{i},\leavevmode\nobreak\ i=1,2$	Stopping time of random walks (proof of Lemma 2)
	$I_{1},I_{2},I_{3}$	Three terms of the regret of nonstationary
	$K_{alg}$	Bound of the cumulative regret (Theorem 3)
	$K^{\prime}$	Constant for regret of nonstationary (proof of Theorem 1)
	$K=K^{\prime}+2M_{0}/\log(2)$	Constant in the proof of Theorem 1
	$K_{0},K_{1}$	Constants for convexity and smoothness (Assumption 1)
	$K_{2}=2K_{3}/K_{0}$	Constant for $D_{k}$ (Theorem 1)
	$K_{3}$	Constants for variance in Theorem 3 (37)
	$K_{4}$	Constants for convergence rate of busy time in Lemma EC.1
	$K_{5}=32e^{4}/(\min(\theta,\gamma))$	Bound in the proof of Theorem 3 (35)
	$K_{6}=K_{2}\max_{p}\lambda^{\prime}(p)$	Bound in the proof of Corollary 2
	$\bar{\lambda},\underline{\lambda}$	Upper and lower bounds for $\lambda(p)$
	$M$	Uniform bound for queueing functions in Lemma 1
	$M_{0}$	Upper bound of the regret in the first cycle
	$R_{k}$	Total regret during cycle $k$
	$R_{1,k},R_{2,k}$	Regret of nonstationary/suboptimality in cycle $k$
	$R(L),R_{1}(L),R_{2}(L)$	Cumulative/nonstationary/suboptimal regret by cycle $L$
	$T_{k}$	Length of cycle $k$

Table 1: Glossary of notation.

	$\displaystyle\|W_{m}-\tilde{W}_{m}\|$	$\displaystyle=W_{m}-0\leq W_{m}-\left(\tilde{W}_{m-1}-\frac{U_{m}}{\tilde{\lambda}_{m}}+\frac{V_{m}}{\tilde{\mu}_{m}}\right)$
		$\displaystyle=\left\|W_{m-1}-\frac{U_{m}}{\lambda_{m}}+\frac{V_{m}}{\mu_{m}}-\left(\tilde{W}_{m-1}-\frac{U_{m}}{\tilde{\lambda}_{m}}+\frac{V_{m}}{\tilde{\mu}_{m}}\right)\right\|$
		$\displaystyle\leq\|W_{m-1}-\tilde{W}_{m-1}\|+\left\|\frac{1}{\mu_{m}}-\frac{1}{\tilde{\mu}_{m}}\right\|V_{m}+\left\|\frac{1}{\lambda_{m}}-\frac{1}{\tilde{\lambda}_{m}}\right\|U_{m}.$

	$\displaystyle\mathbb{E}[\|W_{\infty}(\mu_{1},p_{1})-W_{\infty}(\mu_{2},p_{1})\|^{m}]$	$\displaystyle\leq B_{1}\|\mu_{1}-\mu_{2}\|^{m},$
	$\displaystyle\mathbb{E}[\|W_{\infty}(\mu_{2},p_{1})-W_{\infty}(\mu_{2},p_{2})\|^{m}]$	$\displaystyle\leq B_{2}\|p_{1}-p_{2}\|^{m}.$

	$\displaystyle\mathbb{E}[\|W_{n}^{k}-\bar{W}_{n}^{k}\|1(Q_{k}<d_{k})]$	$\displaystyle\leq e^{-\gamma(n-d_{k})}\mathbb{E}\left[(2+e^{\bar{\mu}\theta W_{d_{k}}^{k}}+e^{\bar{\mu}\theta\bar{W}_{d_{k}}^{k}})\left\|W_{d_{k}}^{k}-\bar{W}_{d_{k}}^{k}\right\|1(Q_{k}<d_{k})\right]$
		$\displaystyle\leq e^{-\gamma(n-d_{k})}\mathbb{E}\left[(2+e^{\bar{\mu}\theta W_{d_{k}}^{k}}+e^{\bar{\mu}\theta\bar{W}_{d_{k}}^{k}})\left\|W_{d_{k}}^{k}-\bar{W}_{d_{k}}^{k}\right\|\right]$
		$\displaystyle\leq e^{-\gamma(n-d_{k})}\left(2+\mathbb{E}\left[\left(e^{\bar{\mu}\theta W_{d_{k}}^{k}}+e^{\bar{\mu}\theta\bar{W}_{d_{k}}^{k}}\right)^{2}\right]^{1/2}\right)\mathbb{E}\left[\left\|W_{d_{k}}^{k}-\bar{W}_{d_{k}}^{k}\right\|^{2}\right]^{1/2}.$

	$\displaystyle\mathbb{E}[\|\mu_{k}-\mu_{k-1}\|^{2}]$	$\displaystyle\leq\mathbb{E}[\\|x_{k}-x_{k+1}\\|^{2}]\leq K_{2}k^{-2\alpha}$
	$\displaystyle\mathbb{E}[\|\lambda_{k}-\lambda_{k-1}\|^{2}]$	$\displaystyle\leq K_{2}\left(\max_{p}\lambda^{\prime}(p)\right)^{2}k^{-2\alpha}\equiv K_{6}k^{-2\alpha}.$

	$\displaystyle\|\mathbb{E}[W_{n}^{k}-w(\mu_{k},p_{k})]\|$	$\displaystyle\leq\mathbb{E}[\|w(\mu_{k-1},p_{k-1})-w(\mu_{k},p_{k})\|]+\mathbb{E}[\|W_{n}^{k}-\bar{W}_{D_{k-1}+n}^{k-1}\|]$
		$\displaystyle\leq B\mathbb{E}[\|\mu_{k}-\mu_{k-1}\|+\|\lambda(p_{k})-\lambda(p_{k-1})\|]+\left(\frac{2\sqrt{K_{2}}}{\underline{\mu}}+\frac{\sqrt{K_{6}}}{\lambda(\bar{p})}\right)\sqrt{M}k^{-\alpha}+O(k^{-2})$
		$\displaystyle\leq B(\sqrt{K_{2}}+\sqrt{K_{6}})k^{-\alpha}+\left(\frac{2\sqrt{K_{2}}}{\underline{\mu}}+\frac{\sqrt{K_{6}}}{\lambda(\bar{p})}\right)\sqrt{M}k^{-\alpha}+O(k^{-2})$
		$\displaystyle=O(k^{-\alpha}),$

An Online Learning Approach to Dynamic Pricing and Capacity Sizing in Service Systems

Abstract

1 Introduction

1.1 Problem Statement and Methodology

1.1.1 Pricing and capacity sizing in queue

1.1.2 An online learning method

1.2 Advantages, Challenges and Contributions

1.2.1 Online learning vs. heavy-traffic method.

1.2.2 Challenges of online learning in queueing systems.

1.2.3 Main Contributions

1.3 Organization of the paper

2 Related Literature

Pricing and capacity sizing in queues.

Reinforcement learning for queueing systems.

Stochastic gradient decent algorithms.

3 Problem Setting and Algorithm Outline

3.1 Model and Assumptions

Assumption 1.

Assumption 2.

Assumption 3.

Remark 1.

3.2 Outline of GOLiQ

Remark 2 (Exploration vs. exploitation).

3.3 System Dynamics under GOLiQ

Remark 3 (Clearance of the leftover QkQ_{k}).

Lemma 1.

4 Regret Analysis

Remark 4.

Separation of regret.

4.1 Regret of Nonstationarity

Theorem 1.

Remark 5.

4.1.1 Roadmap of the proof of Theorem 1

Lemma 2.

Corollary 1.

Lemma 3.

Lemma 4.

Corollary 2.

Corollary 3.

Remark 6.

4.2 Regret of Suboptimality

Theorem 2.

Remark 7 (Selecting the “optimal” DkD_{k}).

5 GOLiQ for the G​I/G​I/1GI/GI/1 Queue

5.1 A Gradient Estimator

Lemma 5.

Proof of Lemma 5.

5.2 GOLiQ: A G/G/1G/G/1 Version

Remark 8 (On the queueing leftover).

Selecting the “optimal” hyperparameters.

Theorem 3.

Remark 9 (On the logarithmic regret bound (23)).

Remark 10 (Controlling the length of cycle kk).

6 Numerical Experiments

6.1 One-Dimensional Online Optimizations

6.1.1 Online optimal pricing with a fixed service capacity

6.1.2 Online optimal staffing problem with an exogenous arrival rate

6.2 Joint Pricing and Staffing Problem

6.3 Comparison to Heavy-Traffic Methods

Experiment settings.

Experiment results.

Remark 11 (Different philosophies: online learning vs. heavy traffic).

7 Conclusion

References

8 Proofs

8.1 Proof of Lemma 1

8.2 Proof of Lemma 2

8.3 Proof of Lemma 3

8.4 Proof of Lemma 4

8.5 Full Proof of Theorem 1

Proof of Corollary 1.

Proof of Corollary 2.

Proof of Corollary 3.

Finishing the proof of Theorem 1.

8.6 Convergence Rate of Observed Busy Period

Lemma 6.

Proof of Lemma 6.

8.7 Proof of Theorem 2

8.8 Proof of Theorem 3

8.9 Details in the Proof of Lemma 5

Remark 3 (Clearance of the leftover $Q_{k}$ ).

Remark 7 (Selecting the “optimal” $D_{k}$ ).

5 GOLiQ for the $GI/GI/1$ Queue

5.2 GOLiQ: A $G/G/1$ Version

Remark 10 (Controlling the length of cycle $k$ ).

9.1 Theoretical bounds for $\eta_{k}$ and $D_{k}$

Theoretical bound for $\eta_{k}$ .

Theoretical bound for $D_{k}$ .

11.2 $M/G/1$ with Phase-Type Service