This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Online Learning Approach to Dynamic Pricing and Capacity Sizing in Service Systems

Xinyun Chen
School of Data Science, Chinese University of Hong Kong (Shenzhen),
Yunan Liu
Dept. of Industrial and Systems Engineering, North Carolina State University,
Guiyu Hong
School of Data Science, Chinese University of Hong Kong (Shenzhen)
Abstract

We study a dynamic pricing and capacity sizing problem in a GI/GI/1GI/GI/1 queue, where the service provider’s objective is to obtain the optimal service fee pp and service capacity μ\mu so as to maximize the cumulative expected profit (the service revenue minus the staffing cost and delay penalty). Due to the complex nature of the queueing dynamics, such a problem has no analytic solution so that previous research often resorts to heavy-traffic analysis where both the arrival rate and service rate are sent to infinity. In this work we propose an online learning framework designed for solving this problem which does not require the system’s scale to increase. Our framework is dubbed Gradient-based Online Learning in Queue (GOLiQ). GOLiQ organizes the time horizon into successive operational cycles and prescribes an efficient procedure to obtain improved pricing and staffing policies in each cycle using data collected in previous cycles. Data here include the number of customer arrivals, waiting times, and the server’s busy times. The ingenuity of this approach lies in its online nature, which allows the service provider do better by interacting with the environment. Effectiveness of GOLiQ is substantiated by (i) theoretical results including the algorithm convergence and regret analysis (with a logarithmic regret bound), and (ii) engineering confirmation via simulation experiments of a variety of representative GI/GI/1GI/GI/1 queues.


Keywords: online learning in queues; service systems; capacity planning; staffing; pricing in service systems

1 Introduction

1.1 Problem Statement and Methodology

1.1.1 Pricing and capacity sizing in queue

We study a service queueing model where the service provider manages congestion and revenue by dynamically adjusting the price and service capacity. Specifically, we consider a GI/GI/1GI/GI/1 queue, in which the demand for service is λ(p)\lambda(p) per unit of time when each customer is charged by a service fee pp; the cost for providing service capacity μ\mu is c(μ)c(\mu); and a holding cost h0h_{0} incurs per job per unit of time. By choosing the appropriate service fee pp and capacity μ\mu, the service provider aims to maximize the net profit, which is the service fee minus the staffing cost and penalty of congestion, i.e.,

maxμ,p𝒫(μ,p)pλ(p)c(μ)h0𝔼[Q(μ,p)],\max_{\mu,p}\leavevmode\nobreak\ \mathcal{P}(\mu,p)\equiv p\lambda(p)-c(\mu)-h_{0}\mathbb{E}[Q_{\infty}(\mu,p)], (1)

where Q(μ,p)Q_{\infty}(\mu,p) is the steady-state queue length under service rate μ\mu and price pp.

Problems in this framework have a long history, see for example Kumar and Randhawa (2010), Lee and Ward (2014), Lee and Ward (2019), Maglaras and Zeevi (2003), Nair et al. (2016), Kim and Randhawa (2018) and the references therein. Due to the complex nature of the queueing dynamics, exact analysis is challenging and often unavailable (computation of the optimal dynamic pricing and staffing rules is not straightforward even for the Markovian M/M/1M/M/1 queue (Ata and Shneorson, 2006)). Therefore, researchers resort to heavy-traffic analysis to approximately obtain performance evaluation and optimization results. Commonly adopted heavy-traffic regimes require sending the arrival rate and service capacity (service rate or number of servers) to \infty. Although heavy-traffic analysis provides satisfactory results for large-scale queueing systems, approximation formulas based on heavy-traffic limits often become inaccurate as the system scale decreases.

1.1.2 An online learning method

In this paper we propose an online learning framework designed for solving Problem (1). According to our online learning algorithm, the GI/GI/1GI/GI/1 queue will be operated in successive cycles, where in each cycle the service provider’s decisions on the service fee pp and service capacity μ\mu, deemed the best by far, are obtained using the system’s data collected in previous operational cycles. Data hereby include (i) the number of customers who join for service, (ii) customers’ waiting times, and (iii) the server’s busy time, all of which are easy to collect. Newly generated data, which represent the response from the (random and complex) environment to the present operational decisions, will be used to obtain improved pricing and staffing policies in the next cycle. In this way the service provider can dynamically interact with the environment so that the operational decisions can evolve and eventually approach the optimal solution.

At the beginning of each cycle kk, the service provider’s decisions (pk,μk)(p_{k},\mu_{k}) will be computed and enforced throughout the cycle. At the heart of our procedure for computing (pk,μk)(p_{k},\mu_{k}) is to obtain a sufficiently accurate estimator Hk1H_{{k-1}} for the gradient of the objective function of (1), using past experience. Specifically, our online algorithm will update (pk,μk)(p_{k},\mu_{k}) according to

(μk,pk)(μk1,pk1)+ηk1Hk1,\displaystyle(\mu_{k},p_{k})\leftarrow(\mu_{k-1},p_{k-1})+\eta_{k-1}H_{k-1},

where ηk\eta_{k} is the updating step size for cycle kk. We call this algorithm Gradient-based Online Learning in Queue (GOLiQ).

Besides showing that, under our online learning scheme, the decisions in cycle kk, (μk,pk)(\mu_{k},p_{k}) will converge to the optimal solutions (μ,p)(\mu^{*},p^{*}) as kk increases, we quantify the effectiveness of GOLiQ by computing the regret - the cumulative loss of profit due to the suboptimality of (μk,pk)(\mu_{k},p_{k}), namely, the maximum profit under the (unknown) optimal strategy minus the expected profit earned under the online algorithm over time. When GOLiQ’s hyperparameters are chosen optimally, we show that our regret bound is logarithmic so that the service provider, with any initial pricing and staffing policy (μ0,p0)(\mu_{0},p_{0}), will quickly learn the optimal solutions without losing much profit in the learning process.

1.2 Advantages, Challenges and Contributions

In what follows, we first discusses the general advantages of the online learning approach by contrasting with heavy-traffic methods; we next explain the key challenges we face in the development of online learning algorithms for queueing systems.

1.2.1 Online learning vs. heavy-traffic method.

First, heavy-traffic solutions are derived from approximating models which arise as the system scale approaches infinity, so the fidelity of the solutions is sensitive to the system scale. Unlike heavy-traffic methods, online learning approaches do not require any asymptotic scaling, so they can treat service systems at any scale (small or large). Second, heavy-traffic approaches usually require the knowledge of certain distributional information apriori (e.g., moments and distribution functions of service times), which serve as critical input parameters for the heavy-traffic models. On the other hand, online learning methods require information of this kind to a lesser extent. Although certain distribution information can help fine-tune parameters of online algorithms, it is less crucial to algorithm design and implementation. So in this sense, the dependence on the distributional information is weaker than that of heavy-traffic analysis. Last, online learning is advantageous when the underlining problem focuses on performance optimization in the long run. Heavy-traffic analysis gives approximate solutions that are static, and in a longer time frame, the performance discrepancy (relative to the true optimal reward) should grow linearly as time increases. But online learning is a dynamic evolution, and its data-driven nature enables it to constantly produce improved solutions which will eventually reach optimality. In addition, heavy-traffic solutions require the establishment of heavy-traffic limit theorems and careful analysis of the dynamics of the limit processes (e.g., fluid and diffusion). Both steps can be quite sophisticated in general. See Remarks 11 and 12 for more detailed discussions; also see Section 6.3 for numerical evidence.

1.2.2 Challenges of online learning in queueing systems.

Online learning in queues is by no means an easy extension of online learning in other domains; its theoretical development has to account for the unique features in queueing systems. A crucial step is to develop effective ways to control the nonstationary error that arises at the beginning of every cycle due to the policy update. Towards this, we develop a new regret analysis framework for the transient queueing performance that not only helps establish desired regret bounds for the specific online GI/GI/1GI/GI/1 algorithm, but may also be used to develop online learning method for other queueing models (see Section 4). Another challenge we have to address here is to devise a convenient gradient estimator for the online learning algorithm (essentially, an estimator for the gradient of 𝔼[Q(μ,p)]\mathbb{E}[Q_{\infty}(\mu,p)]). The estimator should have a negligible bias to warrant a quick convergence of the algorithm, and at the same time, its computation (using previous data) should be sufficiently straightforward to ensure the ease of implementation (The detailed gradient estimator of GOLiQ for the GI/GI/1GI/GI/1 system is given in Section 5).

1.2.3 Main Contributions

We summarize our contributions below.

  • To the best of our knowledge, the present work is the first to develop an online learning framework for joint pricing and staffing in a queueing system with logarithmic regret bound in the total number of customers served (Theorem 3). Due to the complex nature of queueing systems, previous research often resorts to asymptotic heavy-traffic analysis to approximately solve for desired operational decisions. The ingenuity of our online learning method lies in the ability to obtain the optimal solutions without needing the system scale (e.g., arrival rate and service rate) to grow large. The other appeal of our method is its robustness, especially in its weaker dependence on the distributions of service and arrival times.

  • A critical step in the regret analysis is the treatment of the transient system dynamics, because when improved operational decisions are obtained and implemented at the beginning of a new period, the queueing performance will shift away from previously established steady-state level. Towards this, we develop a new way to treat and bound the transient queueing performance in the regret analysis of our online learning algorithm (Theorem 1). Bounding the transient error also guarantees convergence of the SGD iteration (Theorem 2). Comparing to previous literature (e.g., the regret bound is O(T2/3)O(T^{2/3}) in Huh et al. (2009)), our analysis of the regret due to nonstationarity gives a much tighter logarithmic bound. In addition, the regret analysis in the present paper may be extended to other queueing systems which share similar properties to GI/GI/1GI/GI/1.

  • Supplementing the theoretical results of our regret bound, we evaluate the practical effectiveness of our method by conducting comprehensive numerical experiments. Our simulations draw the following two main conclusions. First, our method is robust in several dimensions: (i) GOLiQ exhibits convincing performance for GI/GI/1GI/GI/1 queues having representative arrival and service distributions; (ii) GOLiQ remains effective even when certain theoretical assumptions are relaxed. Furthermore, in order to clearly highlight the advantages of our online learning approach relative to the previous results of heavy-traffic limits, we provide a careful performance comparison of these two methods. We show that GOLiQ is more effective in any one of the following three cases: the system scale is not too large, staffing cost is high, or service times are more variable.

1.3 Organization of the paper

In Section 2, we review the related literature. In Section 3, we introduce the model assumptions and provide an outline of our online learning algorithm. In Section 4, we conduct the regret analysis for GOLiQ by separately treating the regret of nonstationarity - the part of regret arising from the transient system dynamics, and the regret of suboptimality - the part originating from the errors due to suboptimal pricing and staffing decisions. In Section 5, we give the detailed description of GOLiQ and establish a logarithmic regret bound by appropriately selecting our algorithm parameters. In Section 6 we conduct numerical experiments to confirm the effectiveness and robustness of GOLiQ. We conclude in Section 7. In the e-companion, we give all technical proofs and provide additional numerical examples.

2 Related Literature

The present paper is related to the following three streams of literature.

Pricing and capacity sizing in queues.

There is a rich literature on pricing and capacity sizing in service systems under different settings. Maglaras and Zeevi (2003) studies pricing and capacity sizing problem in a processor sharing queue motivated by internet applications; Kumar and Randhawa (2010) considers a single-server system with nonlinear delay costs; Nair et al. (2016) studies M/M/1M/M/1 and M/M/kM/M/k systems with network effect among customers; Kim and Randhawa (2018) considers a dynamic pricing problem in a single-server system. The specific problem (1) we consider here is most closely related to Lee and Ward (2014), i.e., joint pricing and capacity sizing for the GI/GI/1GI/GI/1 queue. Later, the authors extend their results to the GI/GI/1+GGI/GI/1+G model with customer abandonment in Lee and Ward (2019). As there is usually no closed-form solution for the optimal strategy or equilibrium, asymptotic analysis is adopted under large-market assumptions. In detail, their analysis is rooted in a deterministic static planning problem which requires both the service capacity and the demand rate to scale to infinity. Most of the papers conclude that heavy-traffic regime is economically optimal. (There are some exceptions where heavy-traffic regime is not optimal, for example, Kumar and Randhawa (2010) shows that agent is forced to decrease its utilization if the delay cost is concave.) Our algorithm is motivated by the pricing and capacity sizing problem for service systems, however, as explained previously, our methodology is very different from the asymptotic analysis used in these papers.

Reinforcement learning for queueing systems.

Our paper is also related to a small but growing literature on reinforcement learning (RL) for queueing systems. Dai and Gluzman (2021) studies an actor-critic algorithm for queueing networks. Liu et al. (2019) and Shah et al. (2020) develop RL techniques to treat the unboundedness of the state space of queueing systems. Jia et al. (2021) studies a price-based revenue management problem in an M/M/cM/M/c queue with a discrete price space; their methodology draws from the multi-armed bandit framework (with each price treated as an “arm”). Krishnasamy et al. (2021) develops bandit methods for scheduling problem in a multi-server queue with unknown service rates. Our work draws distinction from the above-mentioned literature in two dimensions. First, we are the first to develop an online learning method for joint pricing and capacity sizing in queue. In addition, our method applies to settings of continuous decision variables. Comparing to the more general RL literature, our algorithm design and regret analysis take advantage of the specific queueing system structure so as to establish tight regret bounds and more accurate control of the convergence rate. In some sense, the algorithm developed in the present paper may be viewed as a version of the policy gradient method, a special class of RL methods (Sutton and Barto, 2018), see Remark 2 for detailed discussions.

Stochastic gradient decent algorithms.

In general, our algorithm falls into the broad class of stochastic gradient descent (SGD) methods. There are some early papers on SGD algorithms for steady-state performance of queues (see Fu (1990), Chong and Ramadge (1993), L’Ecuyer et al. (1994), L’Ecuyer and Glynn (1994) and the references therein). In particular, these papers have established convergence results of SGD algorithms for capacity sizing problems with a variety of gradient estimating designs. In this paper, we consider a more general setting in which the price is also optimized jointly with the service capacity. Besides, in order to establish theoretical bounds for the regret, we conduct a careful analysis on the convergence rate of the algorithm and provide an explicit guidance for the optimal choice of algorithm parameters, which is not discussed in this early literature. Our algorithm design and analysis are also related to the online learning methods in recent inventory management literature (Burnetas and Smith, 2000; Huh et al., 2009; Huh and Rusmevichientong, 2013; Zhang et al., 2020; Yuan et al., 2021). Among these papers, our work is perhaps most closely related to Huh et al. (2009) where the authors develop an SGD based learning method for an inventory model with a bounded replenishment lead time. Still, due to the unique natures of queueing models, we develop a new regret analysis framework as we shall explain with details in Section 1.2.3.

3 Problem Setting and Algorithm Outline

In Section 3.1 we describe the queueing model and technical assumptions. In Section 3.2, we provide a general outline of GOLiQ. Finally, in Section 3.3 we conduct preliminary analysis of the queueing performance under GOLiQ.

3.1 Model and Assumptions

We study a GI/GI/1GI/GI/1 queueing system having customer arrivals according to a renewal process with generally distributed interarrival times (the first GIGI), independent and identically distributed (i.i.d.) service times following a general distribution (the second GIGI), and a single server that provides service under the first-in-first-out (FIFO) discipline. Each customer upon joining the queue is charged by the service provider a fee p>0p>0. The demand arrival rate (per time unit) depends on the service fee pp and is denoted as λ(p)\lambda(p). To maintain a service rate μ\mu, the service provider continuously incurs a staffing cost at a rate c(μ)c(\mu) per time unit.

For μ[μ¯,μ¯]\mu\in[\underline{\mu},\bar{\mu}] and p[p¯,p¯]p\in[\underline{p},\bar{p}], the service provider’s goal is to determine the optimal service fee pp^{*} and service capacity μ\mu^{*} with the objective of maximizing the steady-state expected profit (1), or equivalently minimizing the objective function f(μ,p)f(\mu,p) as follows

min(μ,p)f(μ,p)h0𝔼[Q(μ,p)]+c(μ)pλ(p),[μ¯,μ¯]×[p¯,p¯].\displaystyle\min_{(\mu,p)\in\mathcal{B}}f(\mu,p)\equiv h_{0}\mathbb{E}[Q_{\infty}(\mu,p)]+c(\mu)-p\lambda(p),\qquad\mathcal{B}\equiv[\underline{\mu},\bar{\mu}]\times[\underline{p},\bar{p}]. (2)

We shall impose the following assumptions on the above service system throughout the paper.

Assumption 1.

((Demand rate, staffing cost, and uniform stability))

  1. (a)(a)

    The arrival rate λ(p)\lambda(p) is continuously differentiable and non-increasing in pp.

  2. (b)(b)

    The staffing cost c(μ)c(\mu) is continuously differentiable and non-decreasing in μ\mu.

  3. (c)(c)

    The lower bounds p¯\underline{p} and μ¯\underline{\mu} satisfy that λ(p¯)<μ¯\lambda(\underline{p})<\underline{\mu} so that the system is uniformly stable for all feasible choices of the pair (μ,p)(\mu,p).

Part (c)(c) of Assumption 1 is commonly used in the literature of SGD methods for queueing models to ensure that the steady-state mean waiting time 𝔼[W(μ,p)]\mathbb{E}[W_{\infty}(\mu,p)] is differentiable with respect to model parameters (see Chong and Ramadge (1993), Fu (1990), L’Ecuyer et al. (1994), L’Ecuyer and Glynn (1994), also see Theorem 3.2 of Glasserman (1992)). In the our numerical experiments (see Section 11.1), we show that our online algorithm remains effective when this assumption is relaxed.

We do not require full knowledge of service and inter-arrival time distributions. But in order to develop explicit bounds for the part of the regret due to the nonstationarity of the queueing processes, we require both distributions to be light-tailed. Specifically, since the actual service and interarrival times are subject to our pricing and staffing decisions, we model the interarrival and service times by two scaled random sequences {Un/λ(p)}\{U_{n}/\lambda(p)\} and {Vn/μ}\{V_{n}/\mu\}, where U1,U2,U_{1},U_{2},\ldots and V1,V2,V_{1},V_{2},\ldots are two independent i.i.d. sequences of random variables having unit means, i.e., 𝔼[Un]=𝔼[Vn]=1\mathbb{E}[U_{n}]=\mathbb{E}[V_{n}]=1. We make the following assumptions on UnU_{n} and VnV_{n}.

Assumption 2.

((Light-tailed service and interarrival times))
There exists a sufficiently small constant η>0\eta>0 such that the moment-generating functions

𝔼[exp(ηVn)]<and𝔼[exp(ηUn)]<.\mathbb{E}[\exp(\eta V_{n})]<\infty\quad\text{and}\quad\mathbb{E}[\exp(\eta U_{n})]<\infty.

In addition, there exist constants 0<θ<η/2μ¯0<\theta<\eta/2\bar{\mu}, 0<a<(μ¯λ(p¯))/(μ¯+λ(p¯))0<a<(\underline{\mu}-\lambda(\underline{p}))/(\underline{\mu}+\lambda(\underline{p})) and γ>0\gamma>0 such that

ϕU(θ)<(1a)θγandϕV(θ)<(1+a)θγ,\phi_{U}(-\theta)<-(1-a)\theta-\gamma\quad\text{and}\quad\phi_{V}(\theta)<(1+a)\theta-\gamma, (3)

where ϕV(θ)log𝔼[exp(θVn)]\phi_{V}(\theta)\equiv\log\mathbb{E}[\exp(\theta V_{n})] and ϕU(θ)log𝔼[exp(θUn)]\phi_{U}(\theta)\equiv\log\mathbb{E}[\exp(\theta U_{n})] are the cummulant generating functions of UU and VV.

Note that ϕU(0)=ϕV(0)=1\phi^{\prime}_{U}(0)=\phi^{\prime}_{V}(0)=1 as 𝔼[U]=𝔼[V]=1\mathbb{E}[U]=\mathbb{E}[V]=1. Suppose ϕU\phi_{U} and ϕV\phi_{V} are smooth around 0, then we have ϕU(θ)=θ+o(θ)\phi_{U}(-\theta)=-\theta+o(\theta) and ϕV(θ)=θ+o(θ)\phi_{V}(\theta)=\theta+o(\theta) by Taylor’s expansion. This implies that, for any a>0a>0, we can make θ\theta small enough, such that ϕU(θ)<(1a)θ\phi_{U}(-\theta)<-(1-a)\theta and ϕV(θ)<(1+a)θ\phi_{V}(\theta)<(1+a)\theta. To obtain the bound in (3), we can simply take γ=12min((1a)θϕU(θ),(1+a)θϕV(θ))>0\gamma={\frac{1}{2}}\min(-(1-a)\theta-\phi_{U}(-\theta),(1+a)\theta-\phi_{V}(\theta))>0. Hence, a sufficient condition that warrants (3) is to require that ϕU\phi_{U} and ϕV\phi_{V} be smooth around 0, which is true for many distributions of UU and VV considered in common queueing models. Assumption 2 will be used in our proofs to build an explicit bound for the regret of nonstationarity.

Finally, in order to warrant the convergence of our online learning algorithm, we require a convex structure for the problem in (2), which is common in the SGD literature; see Broadie et al. (2011), Kushner and Yin (2003) and the references therein.

Let x(μ,p)x^{*}\equiv(\mu^{*},p^{*}) and x(μ,p)x\equiv(\mu,p). Let f(x)\nabla f(x) denote the gradient of a function f(x)f(x) and \|\cdot\| denote the Euclidean norm.

Assumption 3.

((Convexity and smoothness))
There exist finite positive constants K01K_{0}\leq 1 and K1>K0K_{1}>K_{0} such that for all x,x\in\mathcal{B}{,}

  1. (a)(a)

    (xx)Tf(x)K0xx2(x-x^{*})^{T}\nabla f(x)\geq K_{0}\|x-x^{*}\|^{2};

  2. (b)(b)

    f(x)K1xx\|\nabla f(x)\|\leq K_{1}\|x-x^{*}\|.

Remark 1.

Our simulation experiments show that our algorithm works effectively for some representative GI/GI/1GI/GI/1 queues with conditions in Assumption 3 relaxed; see Section 6 and Section 11 in the Supplement Material. In addition, we later provide some sufficient conditions for Assumption 3 in the special case of M/GI/1M/GI/1 queues in Section 12.

3.2 Outline of GOLiQ

In general, an SGD algorithm for a minimization problem minxf(x)\min_{x}f(x) over a compact set \mathcal{B} relies on updating the decision variable via the recursion

xk+1=Π(xkηkHk),k1.\displaystyle x_{k+1}=\Pi_{\mathcal{B}}(x_{k}-\eta_{k}H_{k}),\qquad k\geq 1.

where ηk\eta_{k} is the step size, HkH_{k} is a random estimator for f(xk)\nabla f(x_{k}), xkx_{k} is the decision variable by step kk, and the projection operator Π\Pi_{\mathcal{B}} restricts the updated decision in \mathcal{B}. For problem (2), we let xk(μk,pk)x_{k}\equiv(\mu_{k},p_{k}) represent the service capacity and price at step kk, We define

Bk𝔼[𝔼[Hkf(xk)|k]2]1/2and𝒱k𝔼[Hk2],B_{k}\equiv\mathbb{E}[\|\mathbb{E}[H_{k}-\nabla f(x_{k})|\mathcal{F}_{k}]\|^{2}]^{1/2}\quad\text{and}\quad\mathcal{V}_{k}\equiv\mathbb{E}[\|H_{k}\|^{2}], (4)

where k\mathcal{F}_{k} is the σ\sigma-algebra including all events in the first k1k-1 iterations. Intuitively, BkB_{k} measures the bias of the gradient estimator HkH_{k} and 𝒱k\mathcal{V}_{k} measures its variability. As we shall see later, BkB_{k} and 𝒱k\mathcal{V}_{k} play important roles in designing the algorithm and establishing desired regret bounds.

The standard SGD algorithm iterates in discrete step kk. In our setting, however, the queueing system and objective function f(μ,p)f(\mu,p) are defined in continuous time (in particular, Q(μ,p)Q_{\infty}(\mu,p) is the steady-state queue length observed in continuous time). To facilitate the regret analysis, we first transform the objective function into an expression of customer waiting times that are observed in discrete time. By Little’s law, we can rewrite the objective function f(μ,p)f(\mu,p) as, for all (μ,p)(\mu,p)\in\mathcal{B},

f(μ,p)=h0λ(p)(𝔼[W(μ,p)]+1μ)+c(μ)pλ(p),\displaystyle f(\mu,p)=h_{0}\lambda(p)\left(\mathbb{E}[W_{\infty}(\mu,p)]+\frac{1}{\mu}\right)+c(\mu)-p\lambda(p), (5)

where W(μ,p)W_{\infty}(\mu,p) is the steady-state waiting time under (μ,p)(\mu,p). In each cycle kk, our algorithm adopts the average of DkD_{k} observed customer waiting times to estimate 𝔼[W(μ,p)]\mathbb{E}[W_{\infty}(\mu,p)], where DkD_{k} denotes the number of customers that enter service in cycle kk (we refer to DkD_{k} as the cycle length or sample size of cycle kk). But any finite DkD_{k} will introduce a bias to our gradient estimate HkH_{k}. To mitigate the bias due to the transient performance of the queueing process, we shall let the cycle length DkD_{k} be increasing in kk (in this way the transient bias will vanish eventually). We give the outline of the algorithm below.

Outline of GOLiQ:

  1. 0.

    Input: {Dk}\{D_{k}\} and {ηk}\{\eta_{k}\} for k=1,2,..,Lk=1,2,..,L, initial policy x1=(μ1,p1)x_{1}=(\mu_{1},p_{1}).

    For k=1,2,,Lk=1,2,...,L,

  2. 1.

    In the kthk^{\rm th} cycle, operate the GI/GI/1GI/GI/1 queue under policy xk=(μk,pk)x_{k}=(\mu_{k},p_{k}) until DkD_{k} customers enter service.

  3. 2.

    Collect and use the data (e.g., customer delays) to build an estimator HkH_{k} for f(μk,pk)\nabla f(\mu_{k},p_{k}).

  4. 3.

    Update xk+1=Π(xkηkHk)x_{k+1}=\Pi_{\mathcal{B}}(x_{k}-\eta_{k}H_{k}).

Remark 2 (Exploration vs. exploitation).

The online nature of this algorithm makes it possible to obtain improved decisions by learning from past experience, which is in the spirit of the essential ideas of reinforcement learning where an agent (hereby the service provider) aims to tradeoff between exploration (Step 1) and exploitation (Steps 2 and 3). Effectiveness of the algorithms lies in properly choosing the algorithm parameters and devising an efficient gradient estimator HkH_{k}. For example, if DkD_{k} is too small, we are unable to generate sufficient data (we do not have much to exploit in order for devising a better policy); if DkD_{k} is too large, we incur a higher profit loss due to suboptimality of the policy in use (we do not explore enough for seeking potentially better policies). In particular, GOLiQ may be viewed as a special case of the policy gradient (PG) algorithm (the general idea of PG is to estimate the policy parameters using the gradient of the value function learned via continuous interaction with the system, see for example Sutton and Barto (2018)). To put this into perspective, the policy in the present paper is specified by a pair of parameters (μ,p)(\mu,p), and in each iteration, we update the policy parameters using an estimated policy gradient HkH_{k} learned from data of the queueing model. In the subsequent sections, we give detailed regret analysis that can be used to establish optimal algorithm parameters (Section 4) and develop an efficient gradient estimator (Section 5).

3.3 System Dynamics under GOLiQ

We explain explicitly the dynamic of the queueing system under GOLiQ, with the system starting empty. We first define notations for relevant performance functions. For k1k\geq 1, let TkT_{k} be the length of cycle kk in the units of time, and let DkD_{k} be the total number of customers who enter service in cycle kk. For n=1,2,,Dkn=1,2,...,D_{k}, let WnkW_{n}^{k} be the waiting time of the nthn^{\rm th} customer that enters service in cycle kk. We define W0kWDk1k1W_{0}^{k}\equiv W_{D_{k-1}}^{k-1}. We use the two i.i.d. random sequences VnkV_{n}^{k} and UnkU_{n}^{k} to construct the service and inter-arrival times in cycle kk, n=1,2,,Dkn=1,2,...,D_{k}. In particular, VnkV_{n}^{k} corresponds to the service time of the customer n1n-1, and UnkU_{n}^{k} corresponds to the inter-arrival time between customers n1n-1 and nn in cycle kk. Let λkλ(pk)\lambda_{k}\equiv\lambda(p_{k}). Last, we use QkQ_{k} to denote the number of existing customers (those who arrive in previous cycles) at the beginning of cycle in kk, with Q1=0Q_{1}=0. We will have Qk1Q_{k}\geq 1 for k2k\geq 2, as we shall explain soon, according to our updating procedure. The detailed dynamics of the queueing system in cycle kk is summarized as follows:

  • Updating the control policy. In cycle kk, we adopt the pricing and staffing policy (pk,μk)(p_{k},\mu_{k}). The service time of customer n1n-1 in cycle kk is Snk=Vnk/μkS_{n}^{k}=V_{n}^{k}/\mu_{k} for n=1,,Dkn=1,...,D_{k}. Cycle kk ends as soon as a total number of DkD_{k} (of which the value is to be determined later) customers have entered service. So, customer DkD_{k} will receive service in cycle k+1k+1 (with service time S1k+1S_{1}^{k+1}) and the queue leftover consists of at least one customer, i.e., Qk+11Q_{k+1}\geq 1 for a new cycle k+1k+1, which begins under a new policy (pk+1,μk+1)(p_{k+1},\mu_{k+1}) as follows:

    • Service rate. The service rate is updated to μk+1\mu_{k+1} immediately as the new cycle begins, so that all existing customers will undergo service times with rate μk+1\mu_{k+1}.

    • Service fee. The price remains pkp_{k} at the beginning of cycle k+1k+1 and evolves to pk+1p_{k+1} immediately after the first new customer arrives in the new cycle; we charge this customer with pkp_{k} (because its interarrival time is modulated by pkp_{k}) and all subsequent customers in cycle k+1k+1 with pk+1p_{k+1}.

  • Leftovers from previous cycles. For k2k\geq 2, at the beginning of cycle kk, there are Qk1Q_{k}-1 customers waiting in queue indexed by nn from 1 to Qk1Q_{k}-1. The customer who just enters service is indexed by 0. We update the price from pk1p_{k-1} to pkp_{k} right after the first new customer (indexed by QkQ_{k}) arrives in a new cycle. As a consequence, the prices charged to customers 1,2,,Qk1,2,...,Q_{k} are not yet updated to pkp_{k}. Denote by pnkp_{n}^{k} and λnkλ(pnk)\lambda_{n}^{k}\equiv\lambda(p_{n}^{k}) as the price and arrival rate for customer nn in cycle kk, respectively, for 1nQk1\leq n\leq Q_{k}. The corresponding interarrival time is τnk=Unk/λnk\tau_{n}^{k}=U_{n}^{k}/\lambda_{n}^{k}. In case Qk1>Dk1Q_{k-1}>D_{k-1}, some queueing leftover are customers from earlier cycles. So here pnk{p1,p2,,pk1}p_{n}^{k}\in\{p_{1},p_{2},...,p_{k-1}\}. In addition, in case Qk>DkQ_{k}>D_{k}, part of QkQ_{k} will continue to remain in cycle k+1k+1 and we will have, for example, p1k+1=pDk+1kp^{k+1}_{1}=p^{k}_{D_{k}+1}.

  • New arrivals. We denote interarrival times for new customers in cycle kk by τnk=Unk/λk\tau_{n}^{k}=U_{n}^{k}/\lambda_{k} for n=Qk+1,,Dkn=Q_{k}+1,...,D_{k} if DkQk+1D_{k}\geq Q_{k}+1. (As will soon become clear, the case DkQkD_{k}\leq Q_{k} is a rare event with a negligible probability under appropriate algorithm settings, see Remark 3.)

  • Customer delay. Customers’ waiting times in cycle kk are characterized by the recursions

    Wnk={(Wn1k+VnkμkUnkλnk)+ for 1nQkDk;(Wn1k+VnkμkUnkλk)+ for (Qk+1)(Dk+1)nDk.,W0k=WDk1k1,W_{n}^{k}=\begin{cases}\left(W_{n-1}^{k}+\frac{V^{k}_{n}}{\mu_{k}}-\frac{U^{k}_{n}}{\lambda^{k}_{n}}\right)^{+}&\text{ for }1\leq n\leq Q_{k}\wedge D_{k};\\ \left(W_{n-1}^{k}+\frac{V^{k}_{n}}{\mu_{k}}-\frac{U^{k}_{n}}{\lambda_{k}}\right)^{+}&\text{ for }(Q_{k}+1)\wedge{(D_{k}+1)}\leq n\leq D_{k}.\end{cases},\quad W_{0}^{k}=W_{D_{k-1}}^{k-1}, (6)

    where x+max{x,0}x^{+}\equiv\max\{x,0\}.

  • Server’s busy time. The age of the server’s busy time observed by customer nn upon arrival, which is the length of time the server has been busy since the last idleness, is given by the recursions

    Xnk={(Xn1k+Unkλnk)𝟏{Wnk>0} for 1nQkDk;(Xn1k+Unkλk)𝟏{Wnk>0} for (Qk+1)(Dk+1)nDk.,X0k=XDk1k1,X_{n}^{k}=\begin{cases}\left(X_{n-1}^{k}+\frac{U^{k}_{n}}{\lambda^{k}_{n}}\right){\bf 1}_{\{W_{n}^{k}>0\}}&\text{ for }1\leq n\leq Q_{k}\wedge D_{k};\\ \left(X_{n-1}^{k}+\frac{U^{k}_{n}}{\lambda_{k}}\right){\bf 1}_{\{W_{n}^{k}>0\}}&\text{ for }(Q_{k}+1)\wedge{(D_{k}+1)}\leq n\leq D_{k}.\end{cases},\quad X_{0}^{k}=X_{D_{k-1}}^{k-1}, (7)

    where the indicator 𝟏A{\bf 1}_{A} is 1 if AA occurs and is 0 otherwise.

We provide explanations for (6) and (7). First, recursion (6) simply follows from Lindley’s equation. Next, recursion (7) follows from the fact that, for customer nn, if the queue is empty upon its arrival, the observed busy time is simply 0 by definition; otherwise, the server must have been busy since the arrival of the previous customer and therefore, the observed busy time by customer nn should extend that of customer n1n-1 by an additional inter-arrival time. As we shall see later, both the delay and busy time observed by customers will be important ingredients (i.e., data) for building the gradient estimator of the online learning algorithm.

Remark 3 (Clearance of the leftover QkQ_{k}).

As explained above, QkQ_{k} is random and unbounded, while in our algorithm design, the cycle length DkD_{k} is deterministic. So it is indeed possible the remaining queue content may not be all cleared in cycle kk (i.e., Dk<QkD_{k}<Q_{k}). We will see later in the regret analysis that our choice of DkD_{k} leads to a small probability of uncleared leftovers and thus the impact of the rare event {Dk<Qk}\{D_{k}<Q_{k}\} is negligible.

In Figure 1, we further illustrate how the service price and service rate are updated by showing the ordering of all relative events as a new cycle begins. We emphasize that (i) the service rate μk1\mu_{k-1} is updated to μk\mu_{k} immediately when a new cycle kk begins, which is triggered as soon as the last one of Dk1D_{k-1} customers enters service; and (ii) the service price pk1p_{k-1} is updated to pkp_{k} only after the first external arrival occurs in the new cycle kk (we honor our previous prices for all customers who arrive in the previous cycle).

Refer to caption
Figure 1: On the timing of the update of pkp_{k} and μk\mu_{k} under GOLiQ.

We end this section by providing a uniform boundedness result for all relevant queueing functions. This result below will be used in the next sections to establish desired regret bounds. The proof follows from a stochastic ordering approach and is given in Section 8.1.

Lemma 1.

((Uniform boundedness of relevant queueing functions))
Under Assumptions 1 and 2, there exists a finite positive constant M>0M>0 such that for any sequences (μk,pk)(\mu_{k},p_{k})\in\mathcal{B} and Dk1D_{k}\geq 1, we have, for all k1k\geq 1, 1nDk1\leq n\leq D_{k} and 1m41\leq m\leq 4, and η>0\eta>0 as defined in Assumption 2,

𝔼[(Wnk)m],𝔼[(Xnk)m],𝔼[(Qk)m],𝔼[exp(ηWnk)]and𝔼[exp(ηQk)]\mathbb{E}[(W_{n}^{k})^{m}],\quad\mathbb{E}[(X_{n}^{k})^{m}],\quad\mathbb{E}[(Q_{k})^{m}],\quad\mathbb{E}[\exp(\eta W_{n}^{k})]\quad\text{and}\quad\mathbb{E}[\exp(\eta Q_{k})]

are all bounded by MM.

4 Regret Analysis

The online learning approach described in Section 3.2 is a data-driven method, it should contunue to generate improved solutions that will eventually converge to the true optimal solution as the server’s experience accumulates (by serving more and more customers). The performance of GOLiQ is measured by the so-called regret, which can be interpreted as the cost to pay, over the time or the number of samples, for the algorithm to learn the optimal policy. In this section, we give a formal definition of the regret and conduct the regret analysis for our online learning algorithm.

The expected net cost of the queueing system incurred in cycle kk is

ρk=𝔼[n=1QkDk(h0(Wnk+Snk)pnk)+n=Qk+1Dk(h0(Wnk+Snk)pk)+c(μk)Tk],\displaystyle\rho_{k}=\mathbb{E}\left[\sum_{n=1}^{Q_{k}\wedge D_{k}}(h_{0}(W_{n}^{k}+S_{n}^{k})-p^{k}_{n})+\sum_{n=Q_{k}+1}^{D_{k}}(h_{0}(W_{n}^{k}+S_{n}^{k})-p_{k})+c(\mu_{k})T_{k}\right], (8)

where the summation n=Qk+1Dk\sum_{n=Q_{k}+1}^{D_{k}}\cdot is 0 in case Dk<Qk+1D_{k}<Q_{k}+1. The total regret accumulated in the first LL cycles is

R(L)k=1LRk,whereRkρkf(μ,p)𝔼[Tk]\displaystyle R(L)\equiv\sum_{k=1}^{L}R_{k},\quad\text{where}\quad R_{k}\equiv\rho_{k}-f(\mu^{*},p^{*})\mathbb{E}[T_{k}] (9)

is regret in cycle kk (the expected system cost in cycle kk minus the optimal cost).

Remark 4.

Following Huh et al. (2009) and Jia et al. (2021), our regret defined in (9) is computed by accumulating the difference between the steady-state maximum profit under (μ,p)(\mu^{*},p^{*}) and the expected profit earned under GOLiQ. However, one may find such a definition to be somewhat too demanding; it appears to be more reasonable if we were to benchmark with the nonstationary dynamics under (μ,p)(\mu^{*},p^{*}), rather than the steady-state performance. Nevertheless, our numerical studies confirm that the nuance of the two aforementioned regret definitions is negligible. See Section 11.5.

Separation of regret.

To treat the total regret defined in (9), we separate it into two parts: regret of nonstationarity which quantifies the error due to the system’s transient performance, and regret of suboptimality which accounts for the suboptimality error due to the present policy. In detail, we write

Rk=(ρk𝔼[f(μk,pk)Tk])R1,k+𝔼[Tk(f(μk,pk)f(μ,p))]R2,k,\displaystyle R_{k}=\underbrace{(\rho_{k}-\mathbb{E}[f(\mu_{k},p_{k})T_{k}])}_{\equiv R_{1,k}}+\underbrace{\mathbb{E}[T_{k}(f(\mu_{k},p_{k})-f(\mu^{*},p^{*}))]}_{\equiv R_{2,k}}, (10)

so that

R(L)=k=1LR1,k+k=1LR2,kR1(L)+R2(L).\displaystyle R(L)=\sum_{k=1}^{L}R_{1,k}+\sum_{k=1}^{L}R_{2,k}\equiv R_{1}(L)+R_{2}(L). (11)

Intuitively, R1,kR_{1,k} measures the performance error due to transient queueing dynamics (regret of nonstationarity), while R2,kR_{2,k} accounts for the suboptimality error of control parameters (μk,pk)(\mu_{k},p_{k}) (regret of suboptimality).

In what follows, we will analyze the two terms R1(L)R_{1}(L) and R2(L)R_{2}(L) separately. To treat R1(L)R_{1}(L), we develop in Section 4.1 a new framework to analyze the transient queueing behavior using the coupling technique (Theorem 1). The development of the theoretical bound for R2(L)R_{2}(L) is given in Section 4.2 (Theorem 2). Results in these sections provide convenient conditions that facilitate the convergence and regret bound analysis of our GOLiQ algorithm for GI/GI/1 queues (which is to be given in Section 5). The roadmap of the theoretical analysis is depicted in Figure 2.

Refer to caption
Figure 2: Roadmap of regret analysis and algorithm design

4.1 Regret of Nonstationarity

In this part, we analyze the transient queueing dynamics, base on which we develop a theoretical upper bound for R1(L)R_{1}(L). As we shall see later in Section 5, this analysis is also essential to bounding the bias BkB_{k} and variance 𝒱k\mathcal{V}_{k} of the gradient estimators for GOLiQ.

A crude O(L)O(L) bound. Roughly speaking, since the parameters μ,p\mu,p and functions λ()\lambda(\cdot), c()c(\cdot) are all bounded, the regret R1(L)R_{1}(L) is in the same order as the transient bias of the waiting time process, i.e.,

R1(L)k=1LO(n=1Dk(𝔼[Wn(μk,pk)]𝔼[W(μk,pk)])).\displaystyle R_{1}(L)\approx\sum_{k=1}^{L}O\left(\sum_{n=1}^{D_{k}}\left(\mathbb{E}[W_{n}(\mu_{k},p_{k})]-\mathbb{E}[W_{\infty}(\mu_{k},p_{k})]\right)\right).

Here we use W(μ,p)W_{\infty}(\mu,p) to denote the steady-state waiting time of the GI/GI/1GI/GI/1 queue with parameter (μ,p)(\mu,p)\in\mathcal{B}. Under the uniform stability condition (Assumption 1), it is not difficult to show that there exist positive constants γ>0\gamma>0 and K>0K>0, independent of kk and (μk,pk)(\mu_{k},p_{k}) such that

|𝔼[Wnk]𝔼[W(μk,pk)]|eγnK.\left|\mathbb{E}[W_{n}^{k}]-\mathbb{E}[W_{\infty}(\mu_{k},p_{k})]\right|\leq e^{-\gamma n}K.

Then, as a direct consequence, we have

n=1Dk(𝔼[Wn(μk,pk)]𝔼[W(μk,pk)])K1eγR1(L)=O(L).\sum_{n=1}^{D_{k}}\left(\mathbb{E}[W_{n}(\mu_{k},p_{k})]-\mathbb{E}[W_{\infty}(\mu_{k},p_{k})]\right)\leq\frac{K}{1-e^{-\gamma}}\leavevmode\nobreak\ \Rightarrow\leavevmode\nobreak\ R_{1}(L)=O(L).

An analogue of the above O(L)O(L) bound is given by Huh et al. (2009) (Lemma 11) in an inventory model.

An improved o(L)o(L) bound. In the rest of this subsection, we will conduct a more delicate analysis on the transient performance of the queueing system, and our analysis will render a (tighter) sub-linear bound R1(L)=o(L)R_{1}(L)=o(L) (of which the exact order depends on the concrete algorithm, as we shall see later).

Theorem 1.

((Regret of nonstationarity)) Suppose that Assumptions 1 and 2 hold. In addition, assume that the following conditions are satisfied for some constant K2>0K_{2}>0 and 0<α10<\alpha\leq 1:

  1. (a)(a)

    6log(k)/min(γ,η)DkK2k2α\lceil 6\log(k)/\min(\gamma,\eta)\rceil\leq D_{k}\leq K_{2}k^{2-\alpha};

  2. (b)(b)

    𝔼[xkxk+12]K2k2α\mathbb{E}[\|x_{k}-x_{k+1}\|^{2}]\leq K_{2}k^{-2\alpha},

where the constants η\eta and γ\gamma are defined in Assumption 2. Then, there exists a positive constant K>0K>0 such that

R1,kKkαlog(k),k2andR1(L)Kk=1Lkαlog(k),L2.R_{1,k}\leq K\cdot k^{-\alpha}\log(k),\quad k\geq 2\leavevmode\nobreak\ \quad\text{and}\quad R_{1}(L)\leq K\sum_{k=1}^{L}k^{-\alpha}\log(k),\quad L\geq 2. (12)
Remark 5.

As will become clear later in Section 5, we obtain a bound R1(L)=O(log(L)2)R_{1}(L)=O(\log(L)^{{2}}) for Algorithm 1 by validating Condition (b) in Theorem 1 with α=1\alpha=1, which is much tighter than the crude O(L)O(L) bound. This O(log(L)2)O(\log(L)^{{2}}) bound for R1(L)R_{1}(L) is critical to achieving an overall logarithmic regret bound in the total number of served customers. An explicit expression of constant KK is given in (8.5).

4.1.1 Roadmap of the proof of Theorem 1

Our point of departure in proving Theorem 1 is to decompose R1,kR_{1,k} into three terms. We shall split each cycle into a warm-up period consisting the first d~k=5log(k)/min(γ,η)<Dk\tilde{d}_{k}=\lceil 5\log(k)/\min(\gamma,\eta)\rceil<D_{k} customers and the near-stationary period consisting of all remaining customers, where γ,η>0\gamma,\eta>0 are as defined in Assumption 2. The three parts are: transient error in the near-stationary period (I1I_{1}), transient error in the warm-up period (I2I_{2}) and the remaining error (I3I_{3}). The detailed separation is given below

R1,k\displaystyle R_{1,k} =ρk𝔼[f(μk,pk)Tk]\displaystyle=\rho_{k}-\mathbb{E}[f(\mu_{k},p_{k})T_{k}]
=𝔼[n=1QkDk(h0(Wnk+Snk)pnk)+n=Qk+1Dk(h0(Wnk+Snk)pk)+c(μk)Tkf(μk,pk)Tk]\displaystyle=\mathbb{E}\left[\sum_{n=1}^{Q_{k}\wedge D_{k}}(h_{0}(W_{n}^{k}+S_{n}^{k})-p_{n}^{k})+\sum_{n=Q_{k}+1}^{D_{k}}(h_{0}(W_{n}^{k}+S_{n}^{k})-p_{k})+c(\mu_{k})T_{k}-f(\mu_{k},p_{k})T_{k}\right]
=h0𝔼[n=d~k+1Dk(Wnkw(μk,pk))]I1+h0𝔼[n=1d~k(Wnkw(μk,pk))]I2\displaystyle=h_{0}\underbrace{\mathbb{E}\left[\sum_{n=\tilde{d}_{k}+1}^{D_{k}}\left(W_{n}^{k}-w(\mu_{k},p_{k})\right)\right]}_{\equiv I_{1}}+h_{0}\underbrace{\mathbb{E}\left[\sum_{n=1}^{\tilde{d}_{k}}\left(W_{n}^{k}-w(\mu_{k},p_{k})\right)\right]}_{\equiv I_{2}}
+𝔼[(DkλkTk)(h0w(μk,pk)+h0μkpk)]+𝔼[n=1QkDk(pkpnk)]I3.\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +\underbrace{\mathbb{E}\left[(D_{k}-\lambda_{k}T_{k})(h_{0}w(\mu_{k},p_{k})+\frac{h_{0}}{\mu_{k}}-p_{k})\right]+\mathbb{E}\left[\sum_{n=1}^{Q_{k}\wedge D_{k}}(p_{k}-p_{n}^{k})\right]}_{\equiv I_{3}}.

The term w(μ,p)𝔼[W(μ,p)]w(\mu,p)\equiv\mathbb{E}[W_{\infty}(\mu,p)] is a function in (μ,p)(\mu,p) and equals to the steady-state expected waiting time under parameter (μ,p)(\mu,p)\in\mathcal{B}. To prove R1,k=O(kαlog(k))R_{1,k}=O(k^{-\alpha}\log(k)), it suffices to show that Ii=O(kαlog(k))I_{i}=O(k^{-\alpha}\log(k)) for i=1,2,3i=1,2,3. Below we explain the main ideas of our treatment to I1I_{1}, I2I_{2} and I3I_{3}:

  • I1I_{1}: We will first show that, after serving dk4log(k)/min(γ,η)<d~kd_{k}\equiv\lceil 4\log(k)/\min(\gamma,\eta)\rceil<\tilde{d}_{k} customers, with a sufficiently high probability, all QkQ_{k} existing customers have left the system and {Wnk:n=dk,,Dk}\{W_{n}^{k}:n=d_{k},...,D_{k}\} follows the dynamic of a GI/GI/1 queue with arrival rate λk\lambda_{k} and service rate μk\mu_{k}. Then, we show that WnkW_{n}^{k}, for ndkn\geq d_{k}, will converge exponentially fast to the steady state (Lemma 2). Hence WnkW_{n}^{k} is close to W(μk,pk)W_{\infty}(\mu_{k},p_{k}) for nd~kn\geq\tilde{d}_{k}, warranting a small transient error I1I_{1}.

  • I2I_{2}: Note that the d~k\tilde{d}_{k} customers in the warm-up period include those leftovers from previous periods, and their arrival rates λnk\lambda_{n}^{k} are different from λk\lambda_{k}. To control the impact of such difference between λnk\lambda_{n}^{k} and λk\lambda_{k}, we first establish almost sure Lipschitz continuity of waiting times (for queues having customer-heterogeneous arrival rates) with respect to the arrival rate sequence and the initial state (Lemma 3). As a consequence, we can prove that |𝔼[Wnkw(μk1,pk1)]|=O(kα)|\mathbb{E}[W_{n}^{k}-w(\mu_{k-1},p_{k-1})]|=O(k^{-\alpha}) taking advantage of the fact that the initial state W0k=WDk1k1W_{0}^{k}=W_{D_{k-1}}^{k-1} is close to the steady-state W(μk1,pk1)W_{\infty}(\mu_{k-1},p_{k-1}). Then, we show that the steady-state distribution is smooth in the parameter (μ,p)(\mu,p) (Lemma 4), i.e., 𝔼[|w(μk1,pk1)w(μk,pk)|]=O(𝔼|μkμk1|+𝔼|pkpk1|)=O(kα)\mathbb{E}[|w(\mu_{k-1},p_{k-1})-w(\mu_{k},p_{k})|]=O(\mathbb{E}|\mu_{k}-\mu_{k-1}|+\mathbb{E}|p_{k}-p_{k-1}|)=O(k^{-\alpha}), which completes the analysis for I2I_{2}.

  • I3I_{3}: The term I3I_{3} will be under control because WDkkW_{D_{k}}^{k} is close to the steady-state (Lemma 2) and QkQ_{k} is uniformly bounded (Lemma 1).

Also see in Figure 3 for a graphical illustration.

Refer to caption
Figure 3: Roadmap of the analysis of the regret of nonstationarity.

Following the above roadmap, we next give detailed analysis for Ii,i=1,2,3I_{i},i=1,2,3 by establishing three lemmas (Lemmas 24). We believe that these results are not only essential to the transient analysis in the present paper, but may also be of independent interest for theoretic studies of other queueing models.

Bounding I1I_{1}. We first establish the rate at which waiting times converge to their steady state distributions. For two given sequences VnV_{n} and UnU_{n}, we say two GI/GI/1GI/GI/1 queues with the same parameter (μ,p)(\mu,p)\in\mathcal{B} are synchronously coupled if their waiting times Wn1W^{1}_{n} and Wn2W^{2}_{n} satisfy

Wni=(Wn1i+VnμUnλ(p))+, for i=1,2, and n1,W^{i}_{n}=\left(W^{i}_{n-1}+\frac{V_{n}}{\mu}-\frac{U_{n}}{\lambda(p)}\right)^{+},\text{ for }i=1,2,\text{ and }n\geq 1,

i.e., the two systems share the same sequences of service and interarrival times. The proof of Lemma 2 is given in Section 8.

Lemma 2.

((Exponential loss of memory of initial state)) Suppose two GI/GI/1GI/GI/1 queues with parameter (μ,p)(\mu,p)\in\mathcal{B} are synchronously coupled with initial waiting times W01W^{1}_{0} and W02W^{2}_{0}, respectively. Then, for the two positive constants γ\gamma and θ\theta defined in Assumption 2 and any m1m\geq 1, we have, conditional on W01W^{1}_{0} and W02W^{2}_{0},

𝔼[|Wn1Wn2|m|W01,W02]eγn(2+eμθW01+eμθW02)|W01W02|m.\mathbb{E}\left[|W^{1}_{n}-W^{2}_{n}|^{m}\leavevmode\nobreak\ |\leavevmode\nobreak\ W^{1}_{0},W^{2}_{0}\right]\leq e^{-\gamma n}(2+e^{\mu\theta W^{1}_{0}}+e^{\mu\theta W^{2}_{0}})|W_{0}^{1}-W^{2}_{0}|^{m}.

In order to bound I1I_{1}, at the beginning of each cycle kk, given (μk,pk)(\mu_{k},p_{k}), we couple W0kW_{0}^{k} with W¯0k\bar{W}_{0}^{k} that is independently drawn from the steady-state waiting time distribution W(μk,pk)W_{\infty}(\mu_{k},p_{k}). The sequence W¯nk\bar{W}_{n}^{k} is defined as

W¯nk=(W¯n1k+VnkμkUnkλk)+, for all 1nDk.\bar{W}_{n}^{k}=\left(\bar{W}_{n-1}^{k}+\frac{V_{n}^{k}}{\mu_{k}}-\frac{U_{n}^{k}}{\lambda_{k}}\right)^{+},\text{ for all }1\leq n\leq D_{k}.

Then, by definition, conditional on (μk,pk)(\mu_{k},p_{k}), 𝔼[W¯nk]=w(μk,pk)\mathbb{E}[\bar{W}_{n}^{k}]=w(\mu_{k},p_{k}) for all 1nDk1\leq n\leq D_{k}, and therefore,

|𝔼[Wnkw(μk,pk)]|𝔼[|WnkW¯nk|].\left|\mathbb{E}[W_{n}^{k}-w(\mu_{k},p_{k})]\right|\leq\mathbb{E}[|W_{n}^{k}-\bar{W}_{n}^{k}|].

As we will show in the proof of Corollary 1, {Wnk:n=dk+1,,Dk}\{W_{n}^{k}:n=d_{k}+1,...,D_{k}\} is coupled with W¯nk\bar{W}_{n}^{k} except on a set of negligible set, with dk4log(k)/min(γ,η)<d~kd_{k}\equiv\lceil 4\log(k)/\min(\gamma,\eta)\rceil<\tilde{d}_{k}. As a result, we can use Lemma 2 to construct a bound on 𝔼[|WnkW¯nk|]\mathbb{E}[|W^{k}_{n}-\bar{W}^{k}_{n}|] for n=d~k+1,,Dkn=\tilde{d}_{k}+1,...,D_{k}.

Corollary 1.

Under the conditions of Theorem 1, there exists a constant A1A\geq 1 independent of kk and (μk,pk)(\mu_{k},p_{k}), such that for all k1k\geq 1 and ndk4log(k+1)/min(γ,η)n\geq d_{k}\equiv\lceil 4\log(k+1)/\min(\gamma,\eta)\rceil,

𝔼[|WnkW¯nk|]eγ(ndk)A+2Mk2.\mathbb{E}[|W_{n}^{k}-\bar{W}_{n}^{k}|]\leq e^{-\gamma(n-d_{k})}A+2Mk^{-2}. (13)

As a direct consequence, we have I1=O(kα)I_{1}=O(k^{-\alpha}).

Bounding I2I_{2}. We first show that the waiting times WnW_{n} of a queueing model having customer-heterogeneous arrival rates are Lipschitz continuous with respect to the rates (μn,λn)(\mu_{n},\lambda_{n}) and the initial state almost surely.

Lemma 3.

(Lipschitz continuity) Consider two waiting time sequences WnW_{n} and W~n\tilde{W}_{n} for n1n\geq 1 with initial values W0W_{0} and W~0\tilde{W}_{0} respectively. Let (μn,λn)(\mu_{n},\lambda_{n}) and (μ~n,λ~n)(\tilde{\mu}_{n},\tilde{\lambda}_{n})\in\mathcal{B} be the corresponding sequences of service and arrival rates, respectively, i.e.,

Wn=(Wn1+VnμnUnλn)+andW~n=(W~n1+Vnμ~nUnλ~n)+, for n1.W_{n}=\left(W_{n-1}+\frac{V_{n}}{\mu_{n}}-\frac{U_{n}}{\lambda_{n}}\right)^{+}\quad\text{and}\quad\tilde{W}_{n}=\left(\tilde{W}_{n-1}+\frac{V_{n}}{\tilde{\mu}_{n}}-\frac{U_{n}}{\tilde{\lambda}_{n}}\right)^{+},\quad\text{ for }n\geq 1.

Suppose there exist two constants cμ,cλ>0c_{\mu},c_{\lambda}>0 such that

|μnμ~n|cμand|λnλ~n|cλ, for all n1.|\mu_{n}-\tilde{\mu}_{n}|\leq c_{\mu}\quad\text{and}\quad|\lambda_{n}-\tilde{\lambda}_{n}|\leq c_{\lambda},\quad\text{ for all }n\geq 1.

Then we have, for all n1n\geq 1,

|WnW~n||W0W~0|+(cμμ¯+cλλ¯)max(Xn,X~n)+cμμ¯max(Wn,W~n),|W_{n}-\tilde{W}_{n}|\leq|W_{0}-\tilde{W}_{0}|+\left(\frac{c_{\mu}}{\underline{\mu}}+\frac{c_{\lambda}}{\underline{\lambda}}\right)\max(X_{n},\tilde{X}_{n})+\frac{c_{\mu}}{\underline{\mu}}\max(W_{n},\tilde{W}_{n}),

where XnX_{n} and X~n\tilde{X}_{n} are the corresponding observed busy periods. In particular, XnX_{n} and X~n\tilde{X}_{n} satisfy the recursion (7) defined in Section 3.3 with any given initial values of X00X_{0}\geq 0 and X~00\tilde{X}_{0}\geq 0.

As discussed above, controlling I2I_{2} also involves bounding the difference between the mean steady-state waiting times in two consecutive cycles. Hence, we next establish a uniform high-order smoothness result for the steady-state waiting times with respect to the model parameter (μ,p)(\mu,p).

Lemma 4.

((Smoothness in μ\mu and p)p) Suppose (μi,pi)(\mu_{i},p_{i})\in\mathcal{B} for i=1,2i=1,2. Let W(μi,pi)W_{\infty}(\mu_{i},p_{i}) be the steady-state waiting time of the GI/GI/1 queue under parameter (μi,pi)(\mu_{i},p_{i}), respectively. Then, the steady-state waiting times (W(μ1,p1),W(μ2,p2))(W_{\infty}(\mu_{1},p_{1}),W_{\infty}(\mu_{2},p_{2})) can be coupled such that, there exists a constant B>0B>0 independent of (μi,pi)(\mu_{i},p_{i}) satisfying that, for all 1m41\leq m\leq 4,

𝔼[|W(μ1,p1)W(μ2,p2)|m]B(|μ1μ2|m+|p1p2|m),\mathbb{E}[|W_{\infty}(\mu_{1},p_{1})-W_{\infty}(\mu_{2},p_{2})|^{m}]\leq B\left(|\mu_{1}-\mu_{2}|^{m}+|p_{1}-p_{2}|^{m}\right),

where a closed-form expression of constant BB is given in (31).

We adopt a “coupling from the past” (CFTP) approach in the proof of Lemma 4 (see Section 8). Roughly speaking, CFTP is a synchronous coupling starting from infinite past. In the proof of Lemma 4, we shall explicitly explain how to construct the CFTP.

Now we are ready to analyze I2I_{2}. Essentially, we shall compare 𝔼[Wnk]\mathbb{E}[W_{n}^{k}] in the warm-up period with w(μk1,pk1)=𝔼[W(μk1,pk1)]w(\mu_{k-1},p_{k-1})=\mathbb{E}[W_{\infty}(\mu_{k-1},p_{k-1})]. For each cycle kk, recall that we have already coupled Wnk1W_{n}^{k-1} with a stationary sequence W¯nk1\bar{W}_{n}^{k-1} in cycle k1k-1, we then extend the sequence W¯nk1\bar{W}_{n}^{k-1} to cycle kk in the sense that

W¯Dk1+nk1=(W¯Dk1+n1k1+Vnkμk1Unkλk1)+, for n=1,2,,Dk.\bar{W}_{D_{k-1}+n}^{k-1}=\left(\bar{W}_{D_{k-1}+n-1}^{k-1}+\frac{V_{n}^{k}}{\mu_{k-1}}-\frac{U_{n}^{k}}{\lambda_{k-1}}\right)^{+},\text{ for }n=1,2,...,D_{k}.

Then, conditional on (μk1,pk1)(\mu_{k-1},p_{k-1}), 𝔼[W¯Dk1+nk1]=w(μk1,pk1)\mathbb{E}[\bar{W}_{D_{k-1}+n}^{k-1}]=w(\mu_{k-1},p_{k-1}). So we have

|𝔼[Wnkw(μk,pk)]|\displaystyle\left|\mathbb{E}[W_{n}^{k}-w(\mu_{k},p_{k})]\right| |𝔼[Wnkw(μk1,pk1)]|+𝔼[|w(μk1,pk1)w(μk,pk)|]\displaystyle\leq\left|\mathbb{E}[W_{n}^{k}-w(\mu_{k-1},p_{k-1})]\right|+\mathbb{E}\left[|w(\mu_{k-1},p_{k-1})-w(\mu_{k},p_{k})|\right]
𝔼[|WnkW¯Dk1+nk1|]+𝔼[|w(μk1,pk1)w(μk,pk)|].\displaystyle\leq\mathbb{E}\left[|W_{n}^{k}-\bar{W}_{D_{k-1}+n}^{k-1}|\right]+\mathbb{E}\left[|w(\mu_{k-1},p_{k-1})-w(\mu_{k},p_{k})|\right].

Bounding the first term by Lemma 3 and the second term by Lemma 4 yields the following bound on I2I_{2}.

Corollary 2.

Under the conditions of Theorem 1, for all k2k\geq 2 and 1nDk1\leq n\leq D_{k}, we have

𝔼[|Wnkw(μk,pk)|]=O(kα).\mathbb{E}[|W_{n}^{k}-w(\mu_{k},p_{k})|]=O(k^{-\alpha}).\quad (14)

As a direct consequence, |I2|=O(kαlog(k))|I_{2}|=O(k^{-\alpha}\log(k)).

Bounding I3I_{3}. We complete our analysis on the regret of nonstationarity by showing that I3=O(kα)I_{3}=O(k^{-\alpha}). The proof of Corollary 3 below basically follows from Lemma 1 and Lemma 2 with some similar argument as used in the proof of Corollary 2.

Corollary 3.

Under the conditions of Theorem 1, |I3|=O(kα)|I_{3}|=O(k^{-\alpha}).

Finishing the Proof of Theorem 1. Then, Theorem 1 follows immediately from Corollaries 1 to 3. A complete proof of Theorem 1, including the proofs of Corollaries 1 to 3, is given in Section 8.5 of e-companion. In particular, we provide an explicit expression of the constant KK in terms of the model parameters in (8.5).

Remark 6.

We advocate that Theorem 1 may apply to other queueing models (its scope is beyond the GI/GI/1GI/GI/1 queue), as long as one can verify three conditions for the designated model: (i) uniform boundedness for the rate of convergence to the steady state, i.e., Lemma 2, (ii) path-wise Liptschize continuity, i.e., Lemma 3, and (iii) smoothness of the stationary distributions in the control variables, i.e., Lemma 4.

4.2 Regret of Suboptimality

To bound the regret of suboptimality R2(L)R_{2}(L), we need to control the rate at which xkx_{k} converges to xx^{*}. This depends largely on the effectiveness of the estimator HkH_{k} for f(xk)\nabla f(x_{k}). In our algorithm, such effectiveness is measured by the bias BkB_{k} and variance 𝒱k\mathcal{V}_{k}. The following result shows that, if BkB_{k} and 𝒱k\mathcal{V}_{k} can be appropriately bounded, then, xkx_{k} will converge to xx^{*} rapidly and hence R2(L)R_{2}(L) can be properly bounded.

Theorem 2.

((Regret of suboptimality)) Suppose Assumptions 3 holds. If there exists a constant K31K_{3}\geq 1 such that the following conditions hold for all kk,

  1. (a)(a)

    (1+1k)β1+K02ηk\left(1+\frac{1}{k}\right)^{\beta}\leq 1+\frac{K_{0}}{2}\eta_{k},

  2. (b)(b)

    BkK08kβB_{k}\leq\frac{K_{0}}{8}k^{-\beta},

  3. (c)(c)

    ηk𝒱kK3kβ\eta_{k}\mathcal{V}_{k}\leq K_{3}k^{-\beta},

where 0<β10<\beta\leq 1 is a constant, and ηk0\eta_{k}\to 0 is the step size, then, there exists a constant C8K3/K0C\geq 8K_{3}/K_{0} with an explicit expression given in (34), such that for all k1k\geq 1,

𝔼[xkx2]Ckβ,\mathbb{E}[\|x_{k}-x^{*}\|^{2}]\leq Ck^{-\beta}, (15)

and as a consequence,

R2(L)CK1k=1L(Dkλ(p¯)+M)kβ=O(k=1LDkkβ).R_{2}(L)\leq CK_{1}\sum_{k=1}^{L}\left(\frac{D_{k}}{\lambda(\bar{p})}+M\right)k^{-\beta}=O\left(\sum_{k=1}^{L}D_{k}k^{-\beta}\right). (16)
Remark 7 (Selecting the “optimal” DkD_{k}).

The above expression (16) indicates a trade-off in the selection of the parameter DkD_{k}. On the one hand, increasing the sample size DkD_{k} reduces the bias BkB_{k} for the gradient estimator, and hence leads to a smaller value of kβk^{-\beta}. On the other hand, a larger DkD_{k} makes the system operate under a sub-optimal decision for a longer time. To this end, one may choose an optimal order (in kk) for DkD_{k} by minimizing the order of the regret as in (16).

Our proof of Theorem 2 follows an inductive approach as used in Broadie et al. (2011). Let bk𝔼[xkx2]b_{k}\equiv\mathbb{E}[\|x_{k}-x^{*}\|^{2}]. According to the SGD iteration xk+1=Π(xkηkHk)x_{k+1}=\Pi_{\mathcal{B}}(x_{k}-\eta_{k}H_{k}), we have

𝔼[xk+1x2|xk]𝔼[xkηkHkx2|xk]=xkx22ηk𝔼[Hk|xk](xkx)+ηk2𝔼[Hk2|xk].\mathbb{E}[\|x_{k+1}-x^{*}\|^{2}|x_{k}]\leq\mathbb{E}[\|x_{k}-\eta_{k}H_{k}-x^{*}\|^{2}|x_{k}]=\|x_{k}-x^{*}\|^{2}-2\eta_{k}\mathbb{E}[H_{k}|x_{k}](x_{k}-x^{*})+\eta_{k}^{2}\mathbb{E}[\|H_{k}\|^{2}|x_{k}].

Then, by Assumption 3 and the definition of Bk,𝒱kB_{k},\mathcal{V}_{k} by (4), we derive the following recursive inequality for bkb_{k}:

bk+1(1K0ηk+ηkBk)bk+ηkBk+ηk2𝒱k,k1,\displaystyle b_{k+1}\leq(1-K_{0}\eta_{k}+\eta_{k}B_{k})b_{k}+\eta_{k}B_{k}+\eta^{2}_{k}\mathcal{V}_{k},\quad k\geq 1,

and we prove (15) by induction. The full proof is given in Section 8.7 of the e-companion.

In Section 5, we apply Theorem 2 to treat our online learning algorithm (Algorithm 1) by verifying that Conditions (a)–(c) are satisfied. Because in Theorem 2, Conditions (a)–(c) are stated explicitly in terms of the step size ηk\eta_{k}, bias BkB_{k} and variance 𝒱k\mathcal{V}_{k} of the gradient estimator, these conditions may serve as useful building blocks for the design and analysis of online learning algorithms in other queueing models as well.

5 GOLiQ for the GI/GI/1GI/GI/1 Queue

In this section, we provide a concrete GOLiQ algorithm that solves the optimal pricing and capacity sizing problem (1) for a GI/GI/1GI/GI/1 queueing system. We show that the gradient f(μ,p)\nabla f(\mu,p) can be estimated “directly” from past experience (i.e., data of delay and busy times generated under the present policy). Applying the regret analysis developed in Section 4, we provide a theoretic upper bound for the overall regret in Theorem 3.

5.1 A Gradient Estimator

Following the algorithm framework outlined in Section 3.2, we now develop a detailed gradient estimator HkH_{k}. Regarding the objective function in (5), it suffices to construct estimators for the partial derivatives

μ𝔼[W(p,μ)]andp𝔼[W(p,μ)].\displaystyle\frac{\partial}{\partial\mu}\mathbb{E}[W_{\infty}(p,\mu)]\qquad\text{and}\qquad\frac{\partial}{\partial p}\mathbb{E}[W_{\infty}(p,\mu)]. (17)

Following the infinitesimal perturbation analysis (IPA) approach (see, for example, Glasserman (1992)), we next show that the partial derivatives in (17) can be expressed in terms of the steady-state distributions W(p,μ)W_{\infty}(p,\mu) and X(p,μ)X_{\infty}(p,\mu) of the waiting time process WnW_{n} and observed busy period process XnX_{n}, of which the dynamics are characterized by (6)–(7).

Lemma 5.

Suppose Assumptions 1 and 2 holds. Then, for any (μ,p)(\mu,p)\in\mathcal{B}, 𝔼[W(μ,p)]\mathbb{E}[W_{\infty}(\mu,p)] are differentiable in μ\mu and pp. Besides,

pf(μ,p)\displaystyle\frac{\partial}{\partial p}f(\mu,p) =λ(p)pλ(p)+h0λ(p)(𝔼[W(μ,p)]+𝔼[X(μ,p)]+1μ)\displaystyle=-\lambda(p)-p\lambda^{\prime}(p)+h_{0}\lambda^{\prime}(p)\left(\mathbb{E}[W_{\infty}(\mu,p)]+\mathbb{E}[X_{\infty}(\mu,p)]+\frac{1}{\mu}\right) (18)
μf(μ,p)\displaystyle\frac{\partial}{\partial\mu}f(\mu,p) =c(μ)h0λ(p)μ(𝔼[W(μ,p)]+𝔼[X(μ,p)]+1μ)\displaystyle=c^{\prime}(\mu)-h_{0}\frac{\lambda(p)}{\mu}\left(\mathbb{E}[W_{\infty}(\mu,p)]+\mathbb{E}[X_{\infty}(\mu,p)]+\frac{1}{\mu}\right)
Proof of Lemma 5.

To prove Equation (18), it suffices to work with the partial derivatives of the steady-state expectation 𝔼[W(μ,p)]\mathbb{E}[W_{\infty}(\mu,p)]. We follow the IPA analysis in Glasserman (1992) and Chen (2014).

Given (μ,p)(\mu,p), we define r(p)=1/λ(p)r(p)=1/\lambda(p) and rewrite the recursion (6) as

Wn(μ,p)=(Wn1(μ,p)+Vnμr(p)Un)+.W_{n}(\mu,p)=\left(W_{n-1}(\mu,p)+\frac{V_{n}}{\mu}-r(p)U_{n}\right)^{+}.

Define the derivative process ZnrWn(μ,p)Z_{n}\equiv\frac{\partial}{\partial r}W_{n}(\mu,p), then by chain rule, we have

Zn=rWn(μ,p)=r(Wn1(μ,p)+VnμrUn)+={rWn1Un=Zn1Un if Wn>0;0 if Wn=0.Z_{n}=\frac{\partial}{\partial r}W_{n}(\mu,p)=\frac{\partial}{\partial r}\left(W_{n-1}(\mu,p)+\frac{V_{n}}{\mu}-rU_{n}\right)^{+}=\begin{cases}\frac{\partial}{\partial r}W_{n-1}-U_{n}=Z_{n-1}-U_{n}&\text{ if }W_{n}>0;\\ 0&\text{ if }W_{n}=0.\end{cases}

and obtain a recursion Zn=(Zn1Un)𝟏{Wn>0}Z_{n}=(Z_{n-1}-U_{n}){\bf 1}_{\{W_{n}>0\}}. Let Z~nZn/λ(p)\tilde{Z}_{n}\equiv-Z_{n}/\lambda(p). Then, it is straightforward to see that Z~n\tilde{Z}_{n} satisfies the recursion given in (7) as the observed busy period XnX_{n}, i.e.,

Z~n=(Z~n1+Unλ(p))𝟏(Wn>0).\tilde{Z}_{n}=\left(\tilde{Z}_{n-1}+\frac{U_{n}}{\lambda(p)}\right){\bf 1}(W_{n}>0).

Under the assumption that the queueing system is stable, the limit Z~\tilde{Z}_{\infty} should be equal in distribution to XX_{\infty}. Therefore, we formally derive

r𝔼[W(μ,p)]=𝔼[Z]=λ(p)𝔼[Z~]=λ(p)𝔼[X(μ,p)].\frac{\partial}{\partial r}\mathbb{E}[W_{\infty}(\mu,p)]=\mathbb{E}[Z_{\infty}]=-\lambda(p)\mathbb{E}[\tilde{Z}_{\infty}]=-\lambda(p)\mathbb{E}[X_{\infty}(\mu,p)]. (19)

The above heuristics can be made rigorous by verifying exchanges of limits using the results in Glasserman (1992), and we refer the readers to Section 8.9 for detailed explanations. Using (19), we can derive the partial derivative of the steady-state waiting time with respect to price pp as below:

p𝔼[W(μ,p)]=r𝔼[W(μ,p)]r(p)p=λ(p)𝔼[X(μ,p)]λ(p)λ(p)2=𝔼[X(μ,p)]λ(p)λ(p).\displaystyle\frac{\partial}{\partial p}\mathbb{E}[W_{\infty}(\mu,p)]=\frac{\partial}{\partial r}\mathbb{E}[W_{\infty}(\mu,p)]\frac{\partial r(p)}{\partial p}=-\lambda(p)\mathbb{E}[X_{\infty}(\mu,p)]\cdot-\frac{\lambda^{\prime}(p)}{\lambda(p)^{2}}=\mathbb{E}[X_{\infty}(\mu,p)]\frac{\lambda^{\prime}(p)}{\lambda(p)}.

Now we turn to μ𝔼[W(μ,p)]\frac{\partial}{\partial\mu}\mathbb{E}[W_{\infty}(\mu,p)]. Let Z^nμWn(μ,p)\hat{Z}_{n}\equiv\mu W_{n}(\mu,p), it is easy to check that Z^n=(Z^n1+VnμUn/λ(p))+\hat{Z}_{n}=\left(\hat{Z}_{n-1}+V_{n}-\mu U_{n}/\lambda(p)\right)^{+}. Then, following steps similar to those for (19), we have

μ𝔼[Z^(μ,p)]=𝔼[X(μ,p)].\frac{\partial}{\partial\mu}\mathbb{E}[\hat{Z}_{\infty}(\mu,p)]=-\mathbb{E}[X_{\infty}(\mu,p)].

Therefore,

𝔼[X(μ,p)]=μ𝔼[Z^(μ,p)]=μ𝔼[μW(μ,p)]=μμ𝔼[W(μ,p)]+𝔼[W(μ,p)],\displaystyle-\mathbb{E}[X_{\infty}(\mu,p)]=\frac{\partial}{\partial\mu}\mathbb{E}[\hat{Z}_{\infty}(\mu,p)]=\frac{\partial}{\partial\mu}\mathbb{E}[\mu W_{\infty}(\mu,p)]=\mu\frac{\partial}{\partial\mu}\mathbb{E}[W_{\infty}(\mu,p)]+\mathbb{E}[W_{\infty}(\mu,p)],

and hence, 𝔼[W(μ,p)]/μ=(𝔼[X(μ,p)]+𝔼[W(μ,p)])/μ.\partial\mathbb{E}[W_{\infty}(\mu,p)]/\partial\mu=-(\mathbb{E}[X_{\infty}(\mu,p)]+\mathbb{E}[W_{\infty}(\mu,p)])/\mu.

Finally, plugging the expressions of the two partial derivatives into f\nabla f yields (18). ∎

5.2 GOLiQ: A G/G/1G/G/1 Version

Utilizing results in Lemma 5, we are ready to design a G/G/1G/G/1 version of the GOLiQ algorithm, where we estimate the terms 𝔼[W(μ,p)]\mathbb{E}[W_{\infty}(\mu,p)] and 𝔼[X(μ,p)])\mathbb{E}[X_{\infty}(\mu,p)]) in the partial derivatives (18) using the finite-sample averages of WnkW^{k}_{n} and XnkX_{n}^{k} observed in each cycle kk. The formal description of the algorithm is given in Algorithm 1.

Input: number of cycles LL;
parameters 0<ξ<10<\xi<1, DkD_{k}, ηk\eta_{k} for k=1,2,,Lk=1,2,...,L;
initial value x1=(μ1,p1)x_{1}=(\mu_{1},p_{1});
for n=1,2,,Dkn=1,2,...,D_{k} do
       operate the system under xk=(μk,pk)x_{k}=(\mu_{k},p_{k}) until DkD_{k} customers enter service;
       observe (Wnk,Xnk)(W^{k}_{n},X^{k}_{n}) for n=1,2,,Dkn=1,2,...,D_{k};
       randomly draw Z{1,2}Z\in\{1,2\};
       if Z=1Z=1 then
             hλ(pk)pkλ(pk)+h0λ(pk)[1Dk(1ξ)n>ξDkDk(Xnk+Wnk)+1μk]h\leftarrow-\lambda(p_{k})-p_{k}\lambda^{\prime}(p_{k})+h_{0}\lambda^{\prime}(p_{k})\left[\frac{1}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(X^{k}_{n}+W^{k}_{n}\right)+\frac{1}{\mu_{k}}\right];
             Hk(2h,0)H_{k}\leftarrow(2h,0);
            
      else
             hc(μk)h0λ(p)μk[1Dk(1ξ)n>ξDkDk(Xnk+Wnk)+1μk]h\leftarrow c^{\prime}(\mu_{k})-h_{0}\frac{\lambda(p)}{\mu_{k}}\left[\frac{1}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(X^{k}_{n}+W^{k}_{n}\right)+\frac{1}{\mu_{k}}\right];
             Hk(0,2h)H_{k}\leftarrow(0,2h);
            
       end if
      update: xk+1=Π(xkηkHk)x_{k+1}=\Pi_{\mathcal{B}}(x_{k}-\eta_{k}H_{k});
      
end for
Algorithm 1 GOLiQ for GI/GI/1GI/GI/1 Queues
Remark 8 (On the queueing leftover).

We elaborate more on our treatment of QkQ_{k}, the existing queue content at the beginning of cycle kk. First, the content of QkQ_{k} includes customer arrivals in cycle k1k-1 and possibly even earlier cycles. Second, it is also possible to have Qk>DkQ_{k}>D_{k}. Nevertheless, these above cases do not affect the implementation of Algorithm 1 (note that Algorithm 1 gives a gradient estimator using (1ξ)Dk\lceil(1-\xi)D_{k}\rceil samples without specifying any of the above events). Of course, the event {Qk>Dk}\{Q_{k}>D_{k}\} does play a role in our theoretic regret analysis, but it is a rare event with a negligible probability (in fact, we show that the probability will be suppressed to O(k3)O(k^{-3}), also see Remark 3.

Selecting the “optimal” hyperparameters.

The effectiveness of Algorithm 1 largely hinges upon carefully selecting the three hyperparameters: (i) the warm-up time ξ(0,1)\xi\in(0,1), (ii) the learning step size ηk>0\eta_{k}>0, and (iii) the exploration sample size Dk>0D_{k}>0. Except for ξ\xi which has no bearing on the theoretical order of the regret, both the other two parameters DkD_{k} and ηk\eta_{k} will play critical roles in our regret analysis. We next give the forms of the two parameters. First, The step size ηk\eta_{k} satisfies

ηk=cη/k,withcη2/K0,\displaystyle\eta_{k}=c_{\eta}/k,\qquad\text{with}\quad c_{\eta}\geq 2/K_{0}, (20)

where K0K_{0} is the convexity bound specified in Assumption 3. Next, the sample size DkD_{k} satisfies

Dk=aD+bDlog(k),withaDCDmin(γ,η)ξandbD8min(γ,η)ξ,\displaystyle D_{k}=a_{D}+b_{D}\log(k),\quad\text{with}\quad a_{D}\geq\frac{C_{D}}{\min(\gamma,\eta)\xi}\quad\text{and}\quad b_{D}\geq\frac{8}{\min(\gamma,\eta)\xi}, (21)

for any warm-up parameter ξ(0,1)\xi\in(0,1), where γ\gamma and η\eta are the constants specified in Assumption 2, and the explicit formula of CDC_{D} is given in (36).

The above-mentioned forms of ηk\eta_{k} and DkD_{k} are obtained from our detailed regret analysis where we show that the structure of (20) and (21) “minimizes” the order of the overall regret (in the sense of maximizing α\alpha and β\beta as in Theorems 1 and 2). Although the theoretical bounds of parameters aDa_{D}, bDb_{D} and cηc_{\eta} are imposed to facilitate our regret analysis, our numerical experiments show that GOLiQ remains effective even when the theoretical bounds are relaxed, confirming the robustness of GOLiQ to these hyperparameters; see Section 9 for details. Next, we show that Algorithm 1 has a regret bound of O((log(ML)2)O((\log(M_{L})^{2}) with MLk=1LDkM_{L}\equiv\sum_{k=1}^{L}D_{k} being the cumulative number of customers served by cycle LL. We do so by verifying that our choices of DkD_{k} and ηk\eta_{k} (along with the corresponding BkB_{k} and 𝒱k\mathcal{V}_{k}), will satisfy the conditions in Theorem 1 and Theorem 2.

Theorem 3.

((Regret Bound for Algorithm 1))
Suppose Assumptions 1 to 3 hold, and ηk\eta_{k} and DkD_{k} are selected according to (20) and (21). Then

  1. (i)(i)

    There exists a positive constant K3>0K_{3}>0 such that

    BkK08kandηk𝒱kK3k.\displaystyle B_{k}\leq\frac{K_{0}}{8k}\quad\text{and}\quad\eta_{k}\mathcal{V}_{k}\leq\frac{K_{3}}{k}.
  2. (ii)(ii)

    There exists a positive constant K2>0K_{2}>0 such that

    𝔼[xkxk+12]K2k2.\mathbb{E}[\|x_{k}-x_{k+1}\|^{2}]\leq K_{2}k^{-2}. (22)
  3. (iii)(iii)

    As a consequence of (i)(i) and (ii)(ii), the regret for Algorithm 1

    R(L)Kalglog(ML)2=O(log(ML)2).\displaystyle R(L)\leq K_{\text{alg}}\log(M_{L})^{2}=O(\log(M_{L})^{2}). (23)
Remark 9 (On the logarithmic regret bound (23)).

Below we provide some additional discussions on the regret bound (23):

  1. (i)(i)

    On the constant KalgK_{\text{alg}}. The explicit expression for the constant KalgK_{\text{alg}}, although complicated, is given by (38). It involves error bound corresponding to the transient behavior of the queueing system, the bias and variance of the gradient estimator, moment bounds on the queue length and other model parameters. One can verify that KalgK_{\text{alg}} is decreasing in the convergence rate coefficient γ\gamma and increasing in the moment bounds of the queue length MM.

  2. (ii)(ii)

    On the first logarithmic term. Consider an SGD algorithm in that an unbiased gradient estimator HkH_{k} with a bounded variance can be evaluated using a single data point (i.e., Bk=0B_{k}=0, 𝒱k=O(1)\mathcal{V}_{k}=O(1)), it has been proved the scaled error k1/2(xkx)k^{-1/2}(x_{k}-x^{*}) converges in distribution to a non-zero random variable (Theorem 2.1 in Chapter 10 of Kushner and Yin (2003)). Hence, the convergence rate for xkx2\|x_{k}-x^{*}\|^{2} that any SGD-based algorithm can achieve is at best O(k1)O(k^{-1}) (yielding a cumulative regret of order O(log(k))O(\log(k))), which is exactly the rate of convergence established by our online algorithm (taking β=1\beta=1 in Theorem 3). In this sense, GOLiQ is already achieving an “optimal” convergence rate. We point out that, due to the nonstationary error of the queueing system, our gradient estimator is obtained using an increasing number of data points in order to guarantee a reasonably small bias.

  3. (iii)(iii)

    On the second logarithmic term. In order to control the regret of nonstationarity, the queueing system need to be operated in each cycle for a duration of order O(log(k))O(\log(k)). Because the queueing performance converges to its steady state exponentially fast, this inevitably introduces an extra logarithmic term in our regret bound (which explains the “square” in log(ML)2\log(M_{L})^{2}). The question that remains open is whether this O(log(ML)2)O(\log(M_{L})^{2}) bound is optimal. We conjecture that the answer is yes but admit that a rigorous treatment of a lower regret bound can be quite challenging. For example, establishing a lower regret bound requires a lower bound on the convergence rate of a GI/GI/1GI/GI/1 queue, which by itself is an open question. We leave this question to future research.

Remark 10 (Controlling the length of cycle kk).

We use DkD_{k} (the number of customers served in cycle kk), instead of the clock time TkT_{k}, to control and measure the regret bound. The benefit of using DkD_{k} (rather TkT_{k}) as the cycle length is that it facilitates the technical analysis, because DkD_{k} is directly related to the number of samples used to estimate our gradient estimator. In fact, using DkD_{k} instead of TkT_{k} has no bearing on the order of the regret bound. To see this, note that the arrival rate is assumed to fall into a compact set [λ(p¯),λ(p¯)][\lambda(\bar{p}),\lambda(\underline{p})]. Therefore, since TLT_{L} is the total units of clock time elapsed after cycle LL, we have ML/λ(p¯)𝔼[TL]ML/λ(p¯)M_{L}/\lambda(\underline{p})\leq\mathbb{E}[T_{L}]\leq M_{L}/\lambda(\bar{p}) for all LL.

6 Numerical Experiments

To confirm the practical effectiveness of our online learning method, we conduct numerical experiments to visualize the algorithm convergence, benchmark the outcomes with known exact optimal solutions, estimate the true regret and compare it to the theoretical upper bounds. Our base example is an M/M/1M/M/1 queue, having Poisson arrivals with rate λ(p)\lambda(p), and exponential service times with rate μ\mu. In our optimization, we consider a commonly used logistic demand function (Besbes and Zeevi, 2015)

λ(p)=nλ0(p),λ0(p)=exp(ap)1+exp(ap),\displaystyle\lambda(p)=n\lambda_{0}(p),\qquad\lambda_{0}(p)=\frac{\exp(a-p)}{1+\exp(a-p)}, (24)

where nn is the system scale (also referred to as the market size). We also consider the following convex cost function for the service rate

c(μ)=c0μ2.\displaystyle c(\mu)=c_{0}\mu^{2}. (25)

See the top left panel of Figure 4 for λ(p)\lambda(p) in (24). In particular, the optimal pricing and staffing problem in (1) now becomes

maxμ,p{pλ(p)c0μ2h0λ(p)/μ1λ(p)/μ}.\max_{\mu,p}\ \left\{p\lambda(p)-c_{0}\mu^{2}-h_{0}\frac{\lambda(p)/\mu}{1-\lambda(p)/\mu}\right\}. (26)

In light of the closed-form steady-state formulas of the M/M/1M/M/1 queue, we can obtain the exact values of the optimal solutions (μ,p)(\mu^{*},p^{*}) and the corresponding objective value f(μ,p)f(\mu^{*},p^{*}), with which we are able to benchmark the solutions from our online optimization algorithm.

We first consider two one-dimensional online optimization problems in Section 6.1. We next treat the two-dimensional pricing and staffing problem in Section 6.2. In Section 6.3, we compare our results to previously established asymptotic heavy-traffic solutions in Lee and Ward (2014). Additional numerical experiments are provided in the e-companion: In Section 9 we investigate the robustness of GOLiQ to the hyperparameters. In Section 10 we benchmark the performance of GOLiQ to other online learning methods. Section 11 includes more experiments regarding the relaxation of uniform stability and GOLiQ’s performance in queues having other inter-arrival and service time distributions.

6.1 One-Dimensional Online Optimizations

Algorithm 1 covers special cases where there is only one decision variable. For example, if the service capacity μ\mu (service fee pp) is an exogenous parameter and the only decision is the service fee pp (service capacity μ\mu), then one can simply fix Z=1Z=1 (Z=2Z=2) throughout the learning process. The theoretical regret bound (as in Theorem 3) for these one-dimensional cases remains unchanged.

6.1.1 Online optimal pricing with a fixed service capacity

Motivated by revenue management problems in revenue generating service system, our first example focuses on the one-dimensional optimization of price pp with service rate μ=μ0\mu=\mu_{0} held fixed. In this case we can simply omit the term c0μ2c_{0}\mu^{2} in (26). Fixing the other model parameters as a=4.1,n=10a=4.1,n=10, h0=1h_{0}=1 and μ0=10\mu_{0}=10, we first obtain the exact optimal price p=3.531p^{*}=3.531 (top right panel of Figure 4). According to Algorithm 1 and Theorem 3, we set the step size ηk=1/k\eta_{k}=1/k and cycle length Dk=10+10log(k)D_{k}=10+10\log(k).

Refer to caption
Figure 4: Online optimal pricing for an M/M/1M/M/1 queue with fixed service rate, with μ0=10,a=4.1,p0=6.5,p=3.531\mu_{0}=10,a=4.1,p_{0}=6.5,p^{*}=3.531, ηk=1/k\eta_{k}=1/k and Dk=10+10log(k)D_{k}=10+10\log(k): (i) demand function (top left); (ii) revenue function (top right); (iii) sample path of the gradient (middle left); (iv) sample path of the price (middle right); (v) estimated regret (bottom left); (vi) square root of regret versus logarithmic of served customers, with c=0.24c=0.24, d=19.04d=19.04 (bottom right).

In Figure 4, we give the sample paths of the gradient HkH_{k} and price pkp_{k} as functions of the number of cycles kk, and the mean regret (estimated by averaging 500 independent sample paths) as a function of the cumulative number of service completions MLM_{L}. We observe that although the objective function f(μ,p)f(\mu,p) is not convex in pp, the pricing decision pkp_{k} quickly converges to the optimal value pp^{*}, and the regret grows as a logarithmic function of MLM_{L}. In particular, a simple linear regression for the pair (R(ML),log(ML))\left(\sqrt{R(M_{L})},\log(M_{L})\right) (bottom right panel) verifies our regret bound given in Theorem 3.

6.1.2 Online optimal staffing problem with an exogenous arrival rate

Motivated by conventional service systems where customers are served based on good wills (e.g., hospitals), we next solve an online optimal staffing problem, with the objective of minimizing the combination of the steady-state queue length (or equivalently the delay) and the staffing cost, with the arrival rate (or equivalently, the price pp) held fixed. Namely, we omit the term pλ(p)p\lambda(p) in (26). Fixing λ=λ0=6.385\lambda=\lambda_{0}=6.385, h0=1h_{0}=1, and c0=0.1c_{0}=0.1, we obtain the exact optimal service capacity μ=8.342\mu^{*}=8.342 (top right panel of Figure 5). Also by Algorithm 1 and Theorem 3, we set the step size ηk=0.4k1\eta_{k}=0.4k^{-1} and cycle length Dk=10+10log(k)D_{k}=10+10\log(k) with initial service rate μ0=10\mu_{0}=10.

Refer to caption
Figure 5: Online optimal staffing for an M/M/1 queue with fixed price with λ0=6.385,M=10,ηk=0.4k1\lambda_{0}=6.385,M=10,\eta_{k}=0.4k^{-1} and Dk=10+10log(k)D_{k}=10+10\log(k): (i) staffing cost (top left); (ii) cost function (top right); (iii) sample path of gradient (middle left); (iv) sample path of service capacity (middle right); (v) estimated regret (bottom left); (vi) square root of regret versus logarithmic of served customers, with c=2.76c=2.76, d=8.68d=-8.68 (bottom right).

In Figure 5, we again give sample paths of the gradient HkH_{k} and service capacity μk\mu_{k}, and estimation of the regret. As the number of cycles kk increases, our stage-kk staffing decision μk\mu_{k} quickly converges to μ\mu^{*} (bottom right panel) and the regret also grows as a logarithmic function of MLM_{L} (bottom left panel).

6.2 Joint Pricing and Staffing Problem

We next consider a joint staffing and pricing problem having the objective function in (26), with the logistic demand function in (24) and parameters a=4.1a=4.1, n=10n=10, h0=1h_{0}=1 and c0=0.1c_{0}=0.1. The optimal price p=4.02p^{*}=4.02 and service rate μ=7.10\mu^{*}=7.10 are given as benchmarks (top right panel in Figure 6).

Refer to caption
Figure 6: Joint pricing and staffing for an M/M/1M/M/1 queue with p0=7.5,μ0=12,ηk=1/kp_{0}=7.5,\mu_{0}=12,\eta_{k}=1/k and Dk=10+10log(k)D_{k}=10+10\log(k): (i) demand function (top left); (ii) revenue function (top right); (iii) sample path of gradient (middle left); (iv) sample path of decision parameters (middle right); (v) estimated regret (bottom left); (vi) square root of regret versus logarithmic of served customers, with c=0.186c=0.186, d=5.17d=5.17 (bottom right).

In Figure 6, we show that μk\mu_{k} and pkp_{k} converge quickly to their corresponding optimal target levels μ\mu^{*} and pp^{*} (although the objective f(μ,p)f(\mu,p) is not always convex when μ>λ(p)\mu>\lambda(p)). And similar to the one-dimensional cases, the regret grows as a logarithmic function of MLM_{L} (bottom left panel).

6.3 Comparison to Heavy-Traffic Methods

In this subsection, we provide numerical analysis to contrast the performance of GOLiQ to that of the heavy-traffic approach in Lee and Ward (2014). In Lee and Ward (2014), the objective is to find the optimal decisions pp^{*} and μ\mu^{*} for the GI/GI/1GI/GI/1 optimization problem (1) with a linear staffing cost c(μ)=cμ.c(\mu)=c\mu. Because this problem is not amenable to analytic treatments (due to the complex GI/GI/1GI/GI/1 queueing dynamics), the authors resort to the heavy-traffic approximation by constructing a sequence of GI/GI/1GI/GI/1 queues indexed by a scaling factor nn, where the nthn^{\rm th} model has an arrival rate λn(p)nλ0(p)\lambda^{n}(p)\equiv n\lambda_{0}(p) which grows to infinity as nn increases. The authors propose an asymptotically optimal solution

(p~(n),μ~(n))=(p^,nμ^+σh0n2c)(\tilde{p}^{(n)},\tilde{\mu}^{(n)})=\left(\hat{p}^{*},n\hat{\mu}^{*}+\sigma\sqrt{\frac{h_{0}n}{2c}}\right) (27)

where σ=Var(Ui)+Var(Vi)\sigma=\sqrt{\text{Var}(U_{i})+\text{Var}(V_{i})}, and UiU_{i} and ViV_{i} are defined in Assumption 2, and (p^,μ^)(\hat{p}^{*},\hat{\mu}^{*}) solves a deterministic static planning problem:

minp,μf0(p,μ)=pλ0(p)+cμ.\min_{p,\mu}f_{0}(p,\mu)=-p\lambda_{0}(p)+c\mu. (28)

We remark that the solution in Lee and Ward (2014) requires the precise knowledge of the second moments of service and arrival times (e.g., the term σ\sigma in (27)), but such information is not needed in GOLiQ.

Experiment settings.

We consider an M/GI/1M/GI/1 model with a phase-type service-time distribution, and a logit demand λ(p)=nλ0(p)\lambda(p)=n\lambda_{0}(p) in (24) where the base demand rate λ0(p)\lambda_{0}(p) has a=4.1a=4.1 and the market size nn plays the role of the scaling factor. We fix the delay cost h0=1h_{0}=1 throughout this experiment. To quantify the regret, we obtain the exact optimal policy using the Pollaczek-Khinchine formula for the queue-length function

𝔼[Q(p,μ)]=ρ+ρ21ρ1+cs22,\mathbb{E}[Q_{\infty}(p,\mu)]=\rho+\frac{\rho^{2}}{1-\rho}\frac{1+c_{s}^{2}}{2}, (29)

where cs2Var(Ui)/𝔼[Ui]2c_{s}^{2}\equiv Var(U_{i})/\mathbb{E}[U_{i}]^{2} is the squared coefficient of variation (SCV) for the service time. We next describe the detailed settings for comparing GOLiQ to heavy-traffic solution in Lee and Ward (2014), dubbed LW. In order to benchmark the regret of our GOLiQ to that of LW, we continue to consider a dynamic environment where the number of cycles kk increases. In the kthk^{\rm th} cycle,

  • the LW policy remains fixed at (p~(n),μ~(n))(\tilde{p}^{(n)},\tilde{\mu}^{(n)}) as in (27) (it does not evolve with kk);

  • our online learning policy is dynamically updated according to GOLiQ (Algorithm 1).

Because the LW policy is an approximation, it will yield a linear regret as kk increases. But LW’s linear regret should not be too steep when nn is large enough. In contrast, although GOLiQ is guaranteed to generate a sublinear regret, it is expected to have a larger reget increment at the earlier “exploration” stage, because it is learning without the supervision of the fluid or diffusion limits (as in the LW approach). Nevertheless, we expect that GOLiQ will eventually outperform the LW method (exhibiting a lower regret level) when kk is sufficiently large. We next numerically study how soon GOLiQ surpasses LW and the impact of the following three parameters:

  1. (i)

    staffing cost cc;

  2. (ii)

    service-time SCV cs2c_{s}^{2};

  3. (iii)

    market size nn (i.e., system scale).

We intentionally set the initial decision (μ0,p0)(\mu_{0},p_{0}) of GOLiQ far from the optimal solution (μ,p)(\mu^{*},p^{*}) in the experiment.

Experiment results.

In Figure 7, we report results of regret for both GOLiQ and LW. For the three factors cc, cs2c_{s}^{2} and nn, we change one at a time (with the other two held fixed). In Panels (a)-(c), we vary the staffing cost cc from 0.5 to 2. In Panels (d)-(f), we vary the service-time SCV cs2c_{s}^{2} from 0.1 to 10. Here the cases cs2=0.1c_{s}^{2}=0.1, 1, and 10 are achieved by considering Erlang, exponential, and hyperexponential service-time distributions. In Panels (g)-(i), we vary the system scale nn from 1 to 25. In all of the cases, we use hyper-parameter ηk=5k1\eta_{k}=5k^{-1} and Dk=10+10log(k)D_{k}=10+10\log(k). Monte-Carlo estimates of the regret curves are obtained by averaging 100 independent runs.

We can see from Figure 7 that, in all cases, GOLiQ will eventually establish a lower regret level than the LW policy. Varying these three factors clearly has an significant impact on how soon GOLiQ outperforms LW.

Refer to caption
Figure 7: Regret comparison to heavy-traffic approximation, with varying (i) staffing cost cc (panels (a)–(c)), (ii) service variability cs2c_{s}^{2} (panels (d)–(f)), and (iii) market size nn (panels (g)–(i)). Hyperparameters are ηk=5k1\eta_{k}=5k^{-1}and Dk=10+10log(k)D_{k}=10+10\log(k) for all instances. All regret are estimated by averaging 500 independent simulation runs.

Our findings are summarized below:

  • Staffing cost cc: Figure 7 shows that GOLiQ intends to outperform LW when cc is relatively large. We provide our explanations below. First, a larger staffing cost cc will induce a smaller μ\mu^{*}, which leads to a longer waiting queue. On the other hand, note that the LW solution is primarily based on solving the deterministic static problem (28); and unlike the stochastic revenue optimization problem (1), the objective function of (28) overlooks the queue-length holding cost. This explains why GOLiQ gains its advantage over LW as cc increases. See Panels (a)-(c) of Figure 7.

  • Service SCV cs2c_{s}^{2}: When the service-time SCV is smaller, the LW method intends to work better, because the basic idea of LW stems from solutions of a fluid model (where the service times are assumed deterministic). On the other hand, when cs2c_{s}^{2} is larger, the system becomes more variable so that our learning-based algorithm begins to excel (because GOLiQ takes into account real-time information dynamically). See Panels (d)-(f) of Figure 7.

  • Market size nn: When nn is small, LW loses its advantages because it arises from the large-scale limit of the GI/GI/1GI/GI/1 queue which requires nn to be sufficiently large. While the performance of our GOLiQ is robust to the system scale. See Panels (g)-(i).

  • Performance in the long run: GOLiQ is a more effective approach in the long run, because the LW solution remains static and its error grows linearly as time increases.

Remark 11 (Different philosophies: online learning vs. heavy traffic).

We emphasize that online learning and heavy-traffic analysis are two methodologies developed based on distinct philosophies. First, when the system size is large, heavy-traffic models are able to produce high-fidelity solutions, but they require more prior knowledge of the system as inputs. On the other hand, online learning requires less prior understanding of the system, because the data-driven nature allows it to dynamically evolve and improve (whereas heavy-traffic solutions are static). Second, the notions of asymptotic optimality are different. As an approximate method, heavy-traffic analysis is said to be asymptotically optimal in the sense that as the system size grows large, its solution will become close to the true optimal solution. On the other hand, the solution of the online learning method will converge to the true optimal solution as the server’s experience accumulates (by serving more and more customers).

7 Conclusion

In this paper we develop an online learning framework designed for dynamic pricing and staffing in queueing systems. The ingenuity of this approach lies in its online nature, which allows the service provider to continuously obtain improved pricing and staffing policies by interacting with the environment. The environment here is interpreted as everything beyond the service provider’s knowledge, which is the composition of the random external demand process and the complex internal queueing dynamics. The proposed algorithm organizes the time horizon into successive operational cycles, and prescribes an efficient way to update the service provider’s policy in each cycle using data collected in previous cycles. Data include the number of customer arrivals, waiting times, and the server’s busy times.

A key appeal of the online learning approach is its insensitivity to the scale of the queueing system, as opposed to the heavy-traffic analysis, which requires the system to be in large scale (with the arrival and service rate both approaching infinity). Effectiveness of our online learning algorithm is substantiated by (i) theoretical results including the algorithm convergence and regret analysis, and (ii) engineering confirmation via simulation experiments of a variety of representative GI/GI/1GI/GI/1 queues. Theoretical analysis of the regret bound in the present paper may shed lights on the design of efficient online learning algorithms (e.g., bounding gradient estimation error and controlling proper learning rate) for more general queueing systems.

There are several venues for future research. One natural extension would be to develop new regret analyses that do not require the uniform stability condition. Another interesting and promising direction is to develop an online learning method without assuming the knowledge of the arrival rate function λ(p)\lambda(p), where the learner (hereby the service provider), during the interactions with the environment, will have to resolve the tension between obtaining an accurate estimation of the demand function and optimizing returns over time. A third dimension is to extend the methodology to more general model settings (e.g., queues having customer abandonment and multiple servers), which will make the framework more practical for service systems such as call centers and healthcare. In this regard, results in the present paper may serve as useful foundations; in particular, Theorems 1 and 2 will help construct desired regret bounds as long as their associated conditions can be verified. Doing so usually requires two main steps in a new queueing model: (i) proving a new ergodicity (or rate of convergence to stationarity) result that can be used to bound the regret of nonstationarity; (ii) designing a new gradient estimator which is easily computed from data (here a good gradient estimator should have small bias and variance subject to conditions in Theorem 2).

References

  • Abate et al. (1995) Abate, J., G. L. Choudhury, and W. Whitt (1995). Exponential approximations for tail probabilities in queues, i: waiting times. Operations Research 43(5), 885–901.
  • Asmussen (2003) Asmussen, S. (2003). Applied Probability and Queues. Springer.
  • Ata and Shneorson (2006) Ata, B. and S. Shneorson (2006). Dynamic control of an M/M/1 service system with adjustable arrival and service rates. Management Science 52, 1778–1791.
  • Besbes and Zeevi (2015) Besbes, O. and A. Zeevi (2015). On the (surprising) sufficiency of linear models for dynamic pricing with demand learning. Management Science 61(4), 1211–1224.
  • Blanchet and Chen (2015) Blanchet, J. and X. Chen (2015). Steady-state simulation of reflected Brownian motion and related stochastic networks. Annals of Applied Probability 25, 3209–3250.
  • Broadie et al. (2011) Broadie, M., D. Cicek, and A. Zeevi (2011). General bounds and finite-time improvement for the kiefer-wolfowitz stochastic approximation algorithm. Operations Research 59(5), 1211–1224.
  • Burnetas and Smith (2000) Burnetas, A. N. and C. E. Smith (2000). Adaptive ordering and pricing for perishable products. Operations Research 43(3), 436–443.
  • Chen (2014) Chen, X. (2014). Exact gradient simulation for stochastic fluid networks in steady state. Winter Simulation Conference (WSC) 2015(586-594).
  • Chong and Ramadge (1993) Chong, E. K. P. and P. J. Ramadge (1993). Optimization of queues using an infnitesimal perturbation analysis-based stochastic algorithm with general update times. SIAM Journal on Control and Optimization 31, 698–732.
  • Dai and Gluzman (2021) Dai, J. G. and M. Gluzman (2021). Queueing network controls via deep reinforcement learning. Stochastic Systems.
  • Fu (1990) Fu, M. C. (1990). Convergence of a stochastic approximation algorithm for the GI/G/1 queue using infinitesimal perturbation analysis. Journal of Optimization Theory and Applications 65, 149–160.
  • Glasserman (1992) Glasserman, P. (1992). Stationary waiting time derivatives. Queueing Systems 12, 369–390.
  • Huh et al. (2009) Huh, W. T., G. Janakiraman, J. A. Muckstadt, and P. Rusmevichientong (2009). An adaptive algorithm for finding the optimal base-stock policy in lost sales inventory system with censored demand. Mathematics of Operations Research 34(2), 397–416.
  • Huh and Rusmevichientong (2013) Huh, W. T. and P. Rusmevichientong (2013). Online sequential optimization with biased gradients: Theory and applications to censored demand. INFORMS Journal on Computing 26(150-159).
  • Jia et al. (2021) Jia, H., C. Shi, and S. Shen (2021). Online learning and pricing for service systems with reusable resources. Working paper.
  • Kella and Ramasubramanian (2012) Kella, O. and S. Ramasubramanian (2012). Asymptotic irrelevance of initial conditions for Skorokhod refection mapping on the nonnegative orthant. Mathematics of Operations Research 37, 301–312.
  • Kim and Randhawa (2018) Kim, J. and R. S. Randhawa (2018). the value of dynamic pricing in large queueing systems. Operations Research 66(2), 409–425.
  • Krishnasamy et al. (2021) Krishnasamy, S., R. Sen, R. Johari, and S. Shakkottai (2021). Learning unknown service rates in queues: A multiarmed bandit approach. Operations Research 69(1), 315–330.
  • Kumar and Randhawa (2010) Kumar, S. and R. S. Randhawa (2010). Exploiting market size in service systems. Manufacturing Service Oper. Management 12(3), 511–526.
  • Kushner and Yin (2003) Kushner, H. J. and G. G. Yin (2003). Stochastic Approximation and Recursive Algorithms and Applications. Springer.
  • L’Ecuyer et al. (1994) L’Ecuyer, P., N. Giroux, and P. W. Glynn (1994). Stochastic optimization by simulation: numerical experiments with the M/M/1 queue in steady-state. Management Science 40(10), 1245–1261.
  • L’Ecuyer and Glynn (1994) L’Ecuyer, P. and P. W. Glynn (1994). Stochastic optimization by simulation: convergence proofs for the GI/GI/1 queue in steady state. Management Science 40(11), 1562–1578.
  • Lee and Ward (2014) Lee, C. and A. R. Ward (2014). Optimal pricing and capacity sizing for the GI/GI/1 queue. Operations Research Letters 42, 527–531.
  • Lee and Ward (2019) Lee, C. and A. R. Ward (2019). Pricing and capacity sizing of a service facility: Customer abandonment effects. Production and Operations Management 28(8), 2031–2043.
  • Liu et al. (2019) Liu, B., Q. Xie, and E. H. Modiano (2019). Reinforcement learning for optimal control of queueing systems. In Proceedings of Allerton Conference.
  • Maglaras and Zeevi (2003) Maglaras, C. and A. Zeevi (2003). Pricing and capacity sizing for systems with shared resources: approximate solutions and scaling relations. Management Science 49(8), 1018–1038.
  • Nair et al. (2016) Nair, J., A. Wierman, and B. Zwart (2016). Provisioning of large-scale systems: the interplay between network effects and strategic behavior in the user base. Management Science 62(6), 1830–1841.
  • Nakayama et al. (2004) Nakayama, M. K., P. Shahabuddin, and K. Sigman (2004). On finite exponential moments for branching processes and busy periods for queues. Journal of Applied Probability 41(A), 273–280.
  • Shah et al. (2020) Shah, D., Q. Xie, and Z. Xu (2020). Stable reinforcement learning with unbounded state space. Working paper.
  • Sutton and Barto (2018) Sutton, R. S. and A. G. Barto (2018). Reinforcement Learning: An Introduction (2nd ed.). The MIT Press.
  • Yuan et al. (2021) Yuan, H., Q. Luo, and C. Shi (2021). Marrying stochastic gradient descent with bandits: Learning algorithms for inventory systems with fixed costs. Management Science.
  • Zhang et al. (2020) Zhang, H., X. Chao, and C. Shi (2020). Closing the gap: A learning algorithm for lost-sales inventory systems with lead times. Management Science 66(5), 1962–1980.

SUPPLEMENTARY MATERIAL

This Supplementary Material provides supplementary materials to the main paper. In Section 8, we give all the technical proofs omitted from the main paper. In Section 9, we test the robustness of GOLiQ with respect to key algorithmic hyperparameters. In Section 10, we compare GOLiQ to the online learning method in Huh et al. (2009). In Section 11, we report additional numerical studies. To facilitate readability, we formally summarize all notations in Table 1 including all model parameters and functions, algorithmic hyperparameters, and constants in the regret analysis.

8 Proofs

8.1 Proof of Lemma 1

Let QnkQ_{n}^{k} be the queue length when customer n1n-1 in cycle kk leaves the system. Then Qk=QDk1k1+1Q_{k}=Q_{D_{k-1}}^{k-1}+1. The proof follows a stochastic ordering argument for GI/GI/1GI/GI/1 models. Let W^nk\hat{W}_{n}^{k}, X^nk\hat{X}_{n}^{k} and Q^nk\hat{Q}_{n}^{k} be the waiting times, observed busy periods, and queue length process in a GI/GI/1GI/GI/1 queue with stationary control parameter μkμ¯\mu_{k}\equiv\underline{\mu} and pkp¯p_{k}\equiv\underline{p}, and with steady-state initial state, i.e., W^01=dW(μ¯,p¯)\hat{W}_{0}^{1}\stackrel{{\scriptstyle d}}{{=}}W_{\infty}(\underline{\mu},\underline{p}), X^01=dX(μ¯,p¯)\hat{X}_{0}^{1}\stackrel{{\scriptstyle d}}{{=}}X_{\infty}(\underline{\mu},\underline{p}) and Q^01=dQ(μ¯,p¯)\hat{Q}_{0}^{1}\stackrel{{\scriptstyle d}}{{=}}Q_{\infty}(\underline{\mu},\underline{p}). Let’s call this system the dominating system. Then, for all kk,

UnkλnkUnkλ(p¯), for n=1,2,,Qk, and UnkλkUnkλ(p¯), for n=Qk+1,2,,Dk,\frac{U_{n}^{k}}{\lambda_{n}^{k}}\geq\frac{U_{n}^{k}}{\lambda(\underline{p})},\text{ for }n=1,2,...,Q_{k},\text{ and }\frac{U_{n}^{k}}{\lambda_{k}}\geq\frac{U_{n}^{k}}{\lambda(\underline{p})},\text{ for }n=Q_{k}+1,2,...,D_{k},

i.e., the arrival process in the dominating queue is the upper envelope process (UEP) for all possible arrival processes corresponding to any control sequence (μk,pk)(\mu_{k},p_{k}). Similarly, the service process in the dominating queue is the lower envelope process (LEP) for all possible service processes corresponding to any control sequence. As a consequence, since W01=0W_{0}^{1}=0 and Q01=0Q_{0}^{1}=0,

WnkstW^nk,Xnkstλ(p¯)λ(p¯)X^nk,QnkstQ^nk.W_{n}^{k}\leq_{st}\hat{W}_{n}^{k},\leavevmode\nobreak\ X_{n}^{k}\leq_{st}\frac{\lambda(\underline{p})}{\lambda(\bar{p})}\cdot\hat{X}_{n}^{k},\leavevmode\nobreak\ Q_{n}^{k}\leq_{st}\hat{Q}_{n}^{k}.

Under Assumption 2, the moment generating function of the random variable Vn/μ¯Un/λ(p¯)V_{n}/\underline{\mu}-U_{n}/\lambda(\underline{p}) exists around the origin. Following Blanchet and Chen (2015), under Assumption 1, this condition can further imply that there exists a constant η¯>0\bar{\eta}>0 such that 𝔼[exp(η¯(Vn/μ¯Un/λ(p¯)))]=1.\mathbb{E}[\exp\left(\bar{\eta}(V_{n}/\underline{\mu}-U_{n}/\lambda(\underline{p}))\right)]=1. (See the Remark on p.3222 in Blanchet and Chen (2015)) Then, following Theorem 1 of Abate et al. (1995), there exists a constant α>0\alpha>0 such that (W^nk>x)αexp(η¯x), for all x>0\mathbb{P}(\hat{W}_{n}^{k}>x)\leq\alpha\exp(-\bar{\eta}x),\text{ for all }x>0. As a consequence, 𝔼[exp(ηW^nk)]\mathbb{E}[\exp(\eta\hat{W}_{n}^{k})] is finite for η<η¯\eta<\bar{\eta}, and so are 𝔼[(W^nk)m]\mathbb{E}[(\hat{W}_{n}^{k})^{m}] for all m1m\geq 1. Given that the moments of waiting times are finite, we can conclude that 𝔼[(Q^nk)m]\mathbb{E}[(\hat{Q}_{n}^{k})^{m}] and 𝔼[exp(ηQ^nk)]\mathbb{E}[\exp(\eta\hat{Q}_{n}^{k})] are finite for all m1m\geq 1, applying Theorem 10.4.3 in Asmussen (2003). Finally, the moments of the observed busy period 𝔼[(X^nk)m]\mathbb{E}[(\hat{X}_{n}^{k})^{m}] are finite following Proposition 4.2 in Nakayama et al. (2004). Therefore, we choose

M=max1m4{𝔼[(W^nk)m],λ(p¯)mλ(p¯)m𝔼[(X^nk)m],𝔼[(Q^nk+1)m],𝔼[exp(ηW^nk)],𝔼[exp(η(Qnk+1))]},M=\max_{1\leq m\leq 4}\left\{\mathbb{E}[(\hat{W}_{n}^{k})^{m}],\ \frac{\lambda(\underline{p})^{m}}{\lambda(\bar{p})^{m}}\mathbb{E}[(\hat{X}_{n}^{k})^{m}],\ \mathbb{E}[(\hat{Q}_{n}^{k}+1)^{m}],\ \mathbb{E}[\exp(\eta\hat{W}_{n}^{k})],\ \mathbb{E}[\exp(\eta(Q_{n}^{k}+1))]\right\},

and this closes our proof. \Box

8.2 Proof of Lemma 2

For i{1,2}i\in\{1,2\}, define stopping times Γi=min{n:Wni=0}\Gamma_{i}=\min\{n:W^{i}_{n}=0\}. For a fixed pair of inter-arrival and service time sequences, the consequent waiting time sequence WkW_{k} in a single-server queue is monotone in its initial state W0W_{0}. Without loss of generality, assume W01W02W^{1}_{0}\geq W^{2}_{0}. Then, Wn1Wn2W^{1}_{n}\geq W^{2}_{n} for all n1n\geq 1 and therefore, WΓ11=WΓ12=0W^{1}_{\Gamma_{1}}=W^{2}_{\Gamma_{1}}=0. As the two queues are coupled with the same arrival and service time sequences, we will have Wn1=Wn2W^{1}_{n}=W^{2}_{n} for all nΓ1n\geq\Gamma_{1}. Therefore, we can conclude Wn1=Wn2W^{1}_{n}=W^{2}_{n} for all nmax(Γ1,Γ2)n\geq\max(\Gamma_{1},\Gamma_{2}). For nmax(Γ1,Γ2)n\leq\max(\Gamma_{1},\Gamma_{2}), we have |Wn1Wn2||W01W02||W^{1}_{n}-W^{2}_{n}|\leq|W^{1}_{0}-W^{2}_{0}| following Kella and Ramasubramanian (2012).

For simplicity of notation, we write λ=λ(p)\lambda=\lambda(p). For i{1,2}i\in\{1,2\}, define a random walk Rn+1i=Rni+SnτnR^{i}_{n+1}=R^{i}_{n}+S_{n}-\tau_{n} with R0i=W0iR^{i}_{0}=W^{i}_{0}. (Recall that SnS_{n} and τn\tau_{n} are the sequences of service and inter-arrival times.) By Lindley recursion, Γi=min{n:Rni0}\Gamma_{i}=\min\{n:R^{i}_{n}\leq 0\}. Then, for any n1n\geq 1,

(Γin)\displaystyle\mathbb{P}(\Gamma_{i}\leq n) (k=1n(Skτk)<W0i)\displaystyle\geq\mathbb{P}\left(\sum_{k=1}^{n}(S_{k}-\tau_{k})<-W_{0}^{i}\right)
(λk=1nτkn(1a),μk=1nSkn(1+a)μW0i),\displaystyle\geq\mathbb{P}\left(\lambda\sum_{k=1}^{n}\tau_{k}\geq n(1-a),\mu\sum_{k=1}^{n}S_{k}\leq n(1+a)-\mu W^{i}_{0}\right),

where the second inequality holds as (1a)/λ>(1+a)/μ(1-a)/\lambda>(1+a)/\mu given that 0<a<(μ¯λ(p¯))/(μ¯+λ(p¯))0<a<(\underline{\mu}-\lambda(\underline{p}))/(\underline{\mu}+\lambda(\underline{p})) and that λ/μλ(p¯)/μ¯\lambda/\mu\leq\lambda(\underline{p})/\underline{\mu}. Recall that τk=Uk/λ\tau_{k}=U_{k}/\lambda and Sk=Vk/μS_{k}=V_{k}/\mu. Therefore,

(Γi>n)(k=1nUk<n(1a))+(k=1nVk>n(1+a)μW0i).\displaystyle\mathbb{P}(\Gamma_{i}>n)\leq\mathbb{P}\left(\sum_{k=1}^{n}U_{k}<n(1-a)\right)+\mathbb{P}\left(\sum_{k=1}^{n}V_{k}>n(1+a)-\mu W^{i}_{0}\right).

Following Chebyshev’s Inequality, we have

(k=1nVk>n(1+a)μW0i)\displaystyle\mathbb{P}\left(\sum_{k=1}^{n}V_{k}>n(1+a)-\mu W^{i}_{0}\right) 𝔼[exp(θk=1nVk)]exp(nθ(1+a)μθW0i)=exp(n(ϕV(θ)(1+a)θ))exp(μθW0i)\displaystyle\leq\frac{\mathbb{E}[\exp(\theta\sum_{k=1}^{n}V_{k})]}{\exp(n\theta(1+a)-\mu\theta W^{i}_{0})}=\exp(n(\phi_{V}(\theta)-(1+a)\theta))\exp(\mu\theta W^{i}_{0})
exp(nγ)exp(μθW0i),\displaystyle\leq\exp(-n\gamma)\exp(\mu\theta W^{i}_{0}),

where the last inequality follows from Assumption 2. On the other hand, let QQ be an exponentially tilted probability measure with respect to UU, such that the likelihood ratio dQdP(U)=exp(θUϕU(θ))\frac{dQ}{dP}(U)=\exp(-\theta U-\phi_{U}(-\theta)). Then,

(k=1nUk<n(1a))=𝔼Q[exp(θk=1nUk+nϕU(θ))𝟏{k=1nUk<n(1a)}]\displaystyle\mathbb{P}\left(\sum_{k=1}^{n}U_{k}<n(1-a)\right)=\mathbb{E}^{Q}\left[\exp\left(\theta\sum_{k=1}^{n}U_{k}+n\phi_{U}(-\theta)\right){\bf 1}_{\left\{\sum_{k=1}^{n}U_{k}<n(1-a)\right\}}\right]
\displaystyle\leq exp(n(1a)θ+nϕU(θ))=exp(n((1a)θ+ϕU(θ)))exp(nγ).\displaystyle\exp(n(1-a)\theta+n\phi_{U}(-\theta))=\exp(n((1-a)\theta+\phi_{U}(-\theta)))\leq\exp(-n\gamma).

In summary, we have (Γi>n)exp(nγ)(1+exp(μθW0i))\mathbb{P}(\Gamma_{i}>n)\leq\exp(-n\gamma)\left(1+\exp(\mu\theta W^{i}_{0})\right), i=1,2i=1,2. So, we can conclude

𝔼[|Wn1Wn2|m]\displaystyle\mathbb{E}[|W^{1}_{n}-W^{2}_{n}|^{m}] (max(Γ1,Γ2)>n)|W01W02|m\displaystyle\leq\mathbb{P}(\max(\Gamma_{1},\Gamma_{2})>n)|W^{1}_{0}-W^{2}_{0}|^{m}
eγn(2+eμθW01+eμθW02)|W01W02|m.\displaystyle\leq e^{-\gamma n}\left(2+e^{\mu\theta W^{1}_{0}}+e^{\mu\theta W^{2}_{0}}\right)|W^{1}_{0}-W^{2}_{0}|^{m}.\qquad\Box

8.3 Proof of Lemma 3

Define two auxiliary random walks:

Yn=W0+i=1n(ViμiUiλi),Y~n=W~0+i=1n(Viμ~iUiλ~i).Y_{n}=W_{0}+\sum_{i=1}^{n}\left(\frac{V_{i}}{\mu_{i}}-\frac{U_{i}}{\lambda_{i}}\right),\tilde{Y}_{n}=\tilde{W}_{0}+\sum_{i=1}^{n}\left(\frac{V_{i}}{\tilde{\mu}_{i}}-\frac{U_{i}}{\tilde{\lambda}_{i}}\right).

Then, for any n1n\geq 1, we could express WnW_{n} and W~n\tilde{W}_{n} as

Wn=Ynmin1mnYm0,W~n=Y~nmin1mnY~m0.W_{n}=Y_{n}-\min_{1\leq m\leq n}Y_{m}\wedge 0,\quad\tilde{W}_{n}=\tilde{Y}_{n}-\min_{1\leq m\leq n}\tilde{Y}_{m}\wedge 0.

Let τ=argmin1mnYm\tau=\operatorname*{arg\,min}_{1\leq m\leq n}Y_{m} and τ~=argmin1mnY~m\tilde{\tau}=\operatorname*{arg\,min}_{1\leq m\leq n}\tilde{Y}_{m}. Note that following the above notation, for each nn, WnW_{n} is the waiting time of customer nn and as a consequence, Unλn\frac{U_{n}}{\lambda_{n}} should be understood as the inter-arrival time between customers n1n-1 and nn, and Vnμn\frac{V_{n}}{\mu_{n}} as the service time of customer n1n-1.

Case 1: If Yτ0Y_{\tau}\leq 0 and Y~τ~0\tilde{Y}_{\tilde{\tau}}\leq 0, i.e., both WtW_{t} and W~t\tilde{W}_{t} hit zero before nn, we have

YnYτ~(Y~nY~τ~)WnW~n=YnYτ(Y~nY~τ~)YnYτ(Y~nY~τ).\displaystyle Y_{n}-Y_{\tilde{\tau}}-(\tilde{Y}_{n}-\tilde{Y}_{\tilde{\tau}})\leq W_{n}-\tilde{W}_{n}=Y_{n}-Y_{\tau}-(\tilde{Y}_{n}-\tilde{Y}_{\tilde{\tau}})\leq Y_{n}-Y_{\tau}-(\tilde{Y}_{n}-\tilde{Y}_{\tau}).

So, in this case

|WnW~n|i=ττ~+1n|1μi1μ~i|Vi+i=ττ~+1n|1λi1λ~i|Ui.|W_{n}-\tilde{W}_{n}|\leq\sum_{i=\tau\wedge\tilde{\tau}+1}^{n}\left|\frac{1}{\mu_{i}}-\frac{1}{\tilde{\mu}_{i}}\right|V_{i}+\sum_{i=\tau\wedge\tilde{\tau}+1}^{n}\left|\frac{1}{\lambda_{i}}-\frac{1}{\tilde{\lambda}_{i}}\right|U_{i}.

Recall that XnX_{n} (and X~n\tilde{X}_{n}) is the age of the server’s busy time observed by customer nn upon arrival. By definition, Wτ=0W_{\tau}=0 and therefore,

Xn=i=τ+1nUiλi,Xn+Wn=i=τ+1nViμi.X_{n}=\sum_{i=\tau+1}^{n}\frac{U_{i}}{\lambda_{i}},\quad X_{n}+W_{n}=\sum_{i=\tau+1}^{n}\frac{V_{i}}{\mu_{i}}.

The second equation holds as the server has just served nτn-\tau customers (indexed from τ\tau to n1n-1) in the current busy cycle when customer nn enters service. Then,

i=τ+1n|1μi1μ~i|Vi+i=τ+1n|1λi1λ~i|Uicμμ¯(Xn+Wn)+cλλ¯Xn.\sum_{i=\tau+1}^{n}\left|\frac{1}{\mu_{i}}-\frac{1}{\tilde{\mu}_{i}}\right|V_{i}+\sum_{i=\tau+1}^{n}\left|\frac{1}{\lambda_{i}}-\frac{1}{\tilde{\lambda}_{i}}\right|U_{i}\leq\frac{c_{\mu}}{\underline{\mu}}(X_{n}+W_{n})+\frac{c_{\lambda}}{\underline{\lambda}}X_{n}.

Following a similar argument, we have

i=τ~+1n|1μi1μ~i|Vi+i=τ~+1n|1λi1λ~i|Uicμμ¯(X~n+W~n)+cλλ¯X~n.\sum_{i=\tilde{\tau}+1}^{n}\left|\frac{1}{\mu_{i}}-\frac{1}{\tilde{\mu}_{i}}\right|V_{i}+\sum_{i=\tilde{\tau}+1}^{n}\left|\frac{1}{\lambda_{i}}-\frac{1}{\tilde{\lambda}_{i}}\right|U_{i}\leq\frac{c_{\mu}}{\underline{\mu}}(\tilde{X}_{n}+\tilde{W}_{n})+\frac{c_{\lambda}}{\underline{\lambda}}\tilde{X}_{n}.

Therefore, in this case, we have

|WnW~n|(cμμ¯+cλλ¯)max(Xn,X~n)+cμμ¯max(Wn,W~n).|W_{n}-\tilde{W}_{n}|\leq\left(\frac{c_{\mu}}{\underline{\mu}}+\frac{c_{\lambda}}{\underline{\lambda}}\right)\max(X_{n},\tilde{X}_{n})+\frac{c_{\mu}}{\underline{\mu}}\max(W_{n},\tilde{W}_{n}).

Case 2: If Yτ>0Y_{\tau}>0 or Y~τ~>0\tilde{Y}_{\tilde{\tau}}>0, we can inductively derive that

|WnW~n||W0W~0|+i=1n|1μi1μ~i|Vi+i=1n|1λi1λ~i|Ui.|W_{n}-\tilde{W}_{n}|\leq|W_{0}-\tilde{W}_{0}|+\sum_{i=1}^{n}\left|\frac{1}{\mu_{i}}-\frac{1}{\tilde{\mu}_{i}}\right|V_{i}+\sum_{i=1}^{n}\left|\frac{1}{\lambda_{i}}-\frac{1}{\tilde{\lambda}_{i}}\right|U_{i}.

In detail, it suffices to show that, for all 1mn1\leq m\leq n,

|WmW~m||Wm1W~m1|+|1μm1μ~m|Vm+|1λm1λ~m|Um.|W_{m}-\tilde{W}_{m}|\leq|W_{m-1}-\tilde{W}_{m-1}|+\left|\frac{1}{\mu_{m}}-\frac{1}{\tilde{\mu}_{m}}\right|V_{m}+\left|\frac{1}{\lambda_{m}}-\frac{1}{\tilde{\lambda}_{m}}\right|U_{m}. (30)

Without loss of generality, we assume Yτ>0Y_{\tau}>0. By definition, Yτ=min1lnYlY_{\tau}=\min_{1\leq l\leq n}Y_{l} and hence Wl=Yl>0W_{l}=Y_{l}>0 for all 1ln1\leq l\leq n. Then,

|WmW~m|=|Wm1Umλm+Vmμm(W~m1Umλ~m+Vmμ~m)+|.|W_{m}-\tilde{W}_{m}|=\left|W_{m-1}-\frac{U_{m}}{\lambda_{m}}+\frac{V_{m}}{\mu_{m}}-\left(\tilde{W}_{m-1}-\frac{U_{m}}{\tilde{\lambda}_{m}}+\frac{V_{m}}{\tilde{\mu}_{m}}\right)^{+}\right|.

If W~m>0\tilde{W}_{m}>0, we have

|WmW~m|\displaystyle|W_{m}-\tilde{W}_{m}| =|Wm1Umλm+Vmμm(W~m1Umλ~m+Vmμ~m)|\displaystyle=\left|W_{m-1}-\frac{U_{m}}{\lambda_{m}}+\frac{V_{m}}{\mu_{m}}-\left(\tilde{W}_{m-1}-\frac{U_{m}}{\tilde{\lambda}_{m}}+\frac{V_{m}}{\tilde{\mu}_{m}}\right)\right|
|Wm1W~m1|+|1μm1μ~m|Vm+|1λm1λ~m|Um.\displaystyle\leq|W_{m-1}-\tilde{W}_{m-1}|+\left|\frac{1}{\mu_{m}}-\frac{1}{\tilde{\mu}_{m}}\right|V_{m}+\left|\frac{1}{\lambda_{m}}-\frac{1}{\tilde{\lambda}_{m}}\right|U_{m}.

On the other hand, if W~m=0\tilde{W}_{m}=0, we have W~m1Umλ~m+Vmμ~m0.\tilde{W}_{m-1}-\frac{U_{m}}{\tilde{\lambda}_{m}}+\frac{V_{m}}{\tilde{\mu}_{m}}\leq 0. So,

|WmW~m|\displaystyle|W_{m}-\tilde{W}_{m}| =Wm0Wm(W~m1Umλ~m+Vmμ~m)\displaystyle=W_{m}-0\leq W_{m}-\left(\tilde{W}_{m-1}-\frac{U_{m}}{\tilde{\lambda}_{m}}+\frac{V_{m}}{\tilde{\mu}_{m}}\right)
=|Wm1Umλm+Vmμm(W~m1Umλ~m+Vmμ~m)|\displaystyle=\left|W_{m-1}-\frac{U_{m}}{\lambda_{m}}+\frac{V_{m}}{\mu_{m}}-\left(\tilde{W}_{m-1}-\frac{U_{m}}{\tilde{\lambda}_{m}}+\frac{V_{m}}{\tilde{\mu}_{m}}\right)\right|
|Wm1W~m1|+|1μm1μ~m|Vm+|1λm1λ~m|Um.\displaystyle\leq|W_{m-1}-\tilde{W}_{m-1}|+\left|\frac{1}{\mu_{m}}-\frac{1}{\tilde{\mu}_{m}}\right|V_{m}+\left|\frac{1}{\lambda_{m}}-\frac{1}{\tilde{\lambda}_{m}}\right|U_{m}.

This closes the proof of (30).

As a result of (30), if Yτ>0Y_{\tau}>0, we can conclude the system (associated with (μn,λn)(\mu_{n},\lambda_{n})) was kept busy from time 0 until customer nn enters service. As a consequence, as X00X_{0}\geq 0, we have

Xni=1nUiλi,Xn+Wni=1nViμi.X_{n}\geq\sum_{i=1}^{n}\frac{U_{i}}{\lambda_{i}},\quad X_{n}+W_{n}\geq\sum_{i=1}^{n}\frac{V_{i}}{\mu_{i}}.

Therefore,

i=1n|1μi1μ~i|Vi+i=1n|1λi1λ~i|Uicμμ¯(max(Xn,X~n)+max(Wn,W~n))+cλλ¯max(Xn,X~n),\sum_{i=1}^{n}\left|\frac{1}{\mu_{i}}-\frac{1}{\tilde{\mu}_{i}}\right|V_{i}+\sum_{i=1}^{n}\left|\frac{1}{\lambda_{i}}-\frac{1}{\tilde{\lambda}_{i}}\right|U_{i}\leq\frac{c_{\mu}}{\underline{\mu}}(\max(X_{n},\tilde{X}_{n})+\max({W}_{n},\tilde{W}_{n}))+\frac{c_{\lambda}}{\underline{\lambda}}\max(X_{n},\tilde{X}_{n}),

and hence we can also conclude

|WnW~n||W0W~0|+(cμμ¯+cλλ¯)max(Xn,X~n)+cμμ¯max(Wn,W~n).|W_{n}-\tilde{W}_{n}|\leq|W_{0}-\tilde{W}_{0}|+\left(\frac{c_{\mu}}{\underline{\mu}}+\frac{c_{\lambda}}{\underline{\lambda}}\right)\max(X_{n},\tilde{X}_{n})+\frac{c_{\mu}}{\underline{\mu}}\max(W_{n},\tilde{W}_{n}).

\Box{}

8.4 Proof of Lemma 4

By the inequality that (a+b)m2m1(am+bm)(a+b)^{m}\leq 2^{m-1}(a^{m}+b^{m}) for m1m\geq 1, we have

𝔼[|W(μ1,p1)W(μ2,p2)|m]\displaystyle\mathbb{E}[|W_{\infty}(\mu_{1},p_{1})-W_{\infty}(\mu_{2},p_{2})|^{m}]
2m1(𝔼[|W(μ1,p1)W(μ2,p1)|m+|W(μ2,p1)W(μ2,p2)|m]).\displaystyle\leq 2^{m-1}\left(\mathbb{E}[|W_{\infty}(\mu_{1},p_{1})-W_{\infty}(\mu_{2},p_{1})|^{m}+|W_{\infty}(\mu_{2},p_{1})-W_{\infty}(\mu_{2},p_{2})|^{m}]\right).

It suffices to prove that there exist two constant B1,B2>0B_{1},B_{2}>0 such that for 1m41\leq m\leq 4,

𝔼[|W(μ1,p1)W(μ2,p1)|m]\displaystyle\mathbb{E}[|W_{\infty}(\mu_{1},p_{1})-W_{\infty}(\mu_{2},p_{1})|^{m}] B1|μ1μ2|m,\displaystyle\leq B_{1}|\mu_{1}-\mu_{2}|^{m},
𝔼[|W(μ2,p1)W(μ2,p2)|m]\displaystyle\mathbb{E}[|W_{\infty}(\mu_{2},p_{1})-W_{\infty}(\mu_{2},p_{2})|^{m}] B2|p1p2|m.\displaystyle\leq B_{2}|p_{1}-p_{2}|^{m}.

Without loss of generality, assume μ1<μ2\mu_{1}<\mu_{2}. We now construct two stationary sequences {(Wnμi:n0),i=1,2}\{(W_{n}^{\mu_{i}}:n\leq 0),i=1,2\} that are coupled “from the past”. Let VjV_{j} and UjU_{j} be two i.i.d sequences corresponding to the service and inter-arrival times. For each ii, we define a random walk:

Y0μi=0,Ynμi=j=1n(VjμiUjλ(p1)),n1.Y^{\mu_{i}}_{0}=0,\leavevmode\nobreak\ Y^{\mu_{i}}_{n}=\sum_{j=1}^{n}\left(\frac{V_{j}}{\mu_{i}}-\frac{U_{j}}{\lambda(p_{1})}\right),\leavevmode\nobreak\ \forall n\geq 1.

It is clear that YnμiY^{\mu_{i}}_{n} is a random walk with negative drift for i=1,2i=1,2. Define

Wnμi=maxjnYjμiYnμi,n0.W_{-n}^{\mu_{i}}=\max_{j\geq n}Y^{\mu_{i}}_{j}-Y^{\mu_{i}}_{n},n\geq 0.

It is known in literature (see, for example, Blanchet and Chen (2015)) that WnμiW_{-n}^{\mu_{i}} is a stationary waiting time process of a GI/GI/1GI/GI/1 queue, starting from -\infty, with parameter (μi,p1)(\mu_{i},p_{1}). In particular, the dynamics of WnμiW_{-n}^{\mu_{i}} satisfies that

Wn+1μi=(Wnμi+VnμiUnλ(p1))+, for n1,W_{-n+1}^{\mu_{i}}=\left(W_{-n}^{\mu_{i}}+\frac{V_{n}}{\mu_{i}}-\frac{U_{n}}{\lambda(p_{1})}\right)^{+},\text{ for }n\geq 1,

with Vn/μiV_{n}/\mu_{i} being the service time of customer n-n and Un/λ(p1)U_{n}/\lambda(p_{1}) being the inter-arrival time between customer n-n and n+1-n+1. For a fixed sequence of (Vn,Un)(V_{n},U_{n}), we have

W0μ1=maxj0Yjμ1,and W0μ2=maxj0Yjμ2.W_{0}^{\mu_{1}}=\max_{j\geq 0}Y_{j}^{\mu_{1}},\quad\text{and }W_{0}^{\mu_{2}}=\max_{j\geq 0}Y_{j}^{\mu_{2}}.

As Yjμ1Yjμ2Y^{\mu_{1}}_{j}\geq Y^{\mu_{2}}_{j}, we have W0μ1W0μ2W_{0}^{\mu_{1}}\geq W_{0}^{\mu_{2}}. Besides, let τ=argmaxj0Yjμ1\tau=\arg\max_{j\geq 0}Y^{\mu_{1}}_{j}, we have

W0μ1W0μ2=maxj0Yjμ1maxj0Yjμ2=Yτμ1maxj0Yjμ2Yτμ1Yτμ2.W_{0}^{\mu_{1}}-W_{0}^{\mu_{2}}=\max_{j\geq 0}Y_{j}^{\mu_{1}}-\max_{j\geq 0}Y_{j}^{\mu_{2}}=Y_{\tau}^{\mu_{1}}-\max_{j\geq 0}Y_{j}^{\mu_{2}}\leq Y_{\tau}^{\mu_{1}}-Y_{\tau}^{\mu_{2}}.

As a consequence, we have

|W0μ1W0μ2|n=1τ(Vnμ1Vnμ2)μ2μ1μ1n=1τVnμ1, with τ=inf{n:Wn=0}.|W_{0}^{\mu_{1}}-W_{0}^{\mu_{2}}|\leq\sum_{n=1}^{\tau}\left(\frac{V_{n}}{\mu_{1}}-\frac{V_{n}}{\mu_{2}}\right)\leq\frac{\mu_{2}-\mu_{1}}{\mu_{1}}\sum_{n=1}^{\tau}\frac{V_{n}}{\mu_{1}},\quad\text{ with }\tau=\inf\{n:W_{-n}=0\}.

Note that Vn/μ1V_{n}/\mu_{1} is the service time of customer n-n in the system with parameter (p1,μ1)(p_{1},\mu_{1}). By the definition of τ\tau, customer τ-\tau enters service immediately upon the arrival and the queue remains busy by arrival of customer 0. Therefore, the summation of service times on the right hand side equals to the time between the arrival of customer τ-\tau and the departure of customer 1-1, which equals to the observed busy period at the arrival of customer 0 plus its waiting time, i.e.,

|W0μ1W0μ2|μ2μ1μ1n=1τVnμ1=μ2μ1μ1(X0μ1+W0μ1).|W_{0}^{\mu_{1}}-W_{0}^{\mu_{2}}|\leq\frac{\mu_{2}-\mu_{1}}{\mu_{1}}\sum_{n=1}^{\tau}\frac{V_{n}}{\mu_{1}}=\frac{\mu_{2}-\mu_{1}}{\mu_{1}}(X_{0}^{\mu_{1}}+W_{0}^{\mu_{1}}).

Therefore, for each nn,

𝔼[|W0μ1W0μ2|m](μ2μ1)mμ1m𝔼[(X0μ1+W0μ1)m](μ2μ1)mμ¯m𝔼[(X0μ1+W0μ1)m].\mathbb{E}[|W_{0}^{\mu_{1}}-W_{0}^{\mu_{2}}|^{m}]\leq\frac{(\mu_{2}-\mu_{1})^{m}}{\mu_{1}^{m}}\mathbb{E}[(X_{0}^{\mu_{1}}+W_{0}^{\mu_{1}})^{m}]\leq\frac{(\mu_{2}-\mu_{1})^{m}}{\underline{\mu}^{m}}\mathbb{E}[(X_{0}^{\mu_{1}}+W_{0}^{\mu_{1}})^{m}].

Following Lemma 1, 𝔼[(X0μ1+W0μ1)m]2mM\mathbb{E}[(X_{0}^{\mu_{1}}+W_{0}^{\mu_{1}})^{m}]\leq 2^{m}M. Let B1=max1m42mM/μ¯mB_{1}=\max_{1\leq m\leq 4}2^{m}M/\underline{\mu}^{m} and we conclude, for 1m41\leq m\leq 4,

𝔼[|W0μ1W0μ2|m]B1|μ1μ2|m.\mathbb{E}[|W_{0}^{\mu_{1}}-W_{0}^{\mu_{2}}|^{m}]\leq B_{1}|\mu_{1}-\mu_{2}|^{m}.

The bound for 𝔼[|W(μ2,p1)W(μ2,p2)|m]\mathbb{E}[|W_{\infty}(\mu_{2},p_{1})-W_{\infty}(\mu_{2},p_{2})|^{m}] follows a similar argument and therefore we only provide a sketch of the proof. Without loss of generality, we assume p1<p2p_{1}<p_{2} and consider two stationary waiting time process {(Wnpi:n0),λi=λ(pi),i=1,2}\{(W_{n}^{p_{i}}:n\leq 0),\lambda_{i}=\lambda(p_{i}),i=1,2\} that are coupled from past with the same sequence (Vn,Un)(V_{n},U_{n}) in a similar way as we introduced previously. Then, we have |W0p1W0p2|(λ1λ2)X0p1/λ2|W_{0}^{p_{1}}-W_{0}^{p_{2}}|\leq(\lambda_{1}-\lambda_{2})X_{0}^{p_{1}}/\lambda_{2}, and therefore,

𝔼[|W0p1W0p2|m]B2|p1p2|m, with B2=max1m4,p¯pp¯(M|λ(p)|m/λ(p¯)m).\mathbb{E}[|W_{0}^{p_{1}}-W_{0}^{p_{2}}|^{m}]\leq B_{2}|p_{1}-p_{2}|^{m},\text{ with }B_{2}=\max_{1\leq m\leq 4,\underline{p}\leq p\leq\bar{p}}(M|\lambda^{\prime}(p)|^{m}/\lambda(\bar{p})^{m}).

As a consequence, we can take

B=8max1m4(2mM/μ¯m)max1m4,p¯pp¯(M|λ(p)|m/λ(p¯)m).B=8\cdot\max_{1\leq m\leq 4}(2^{m}M/\underline{\mu}^{m})\vee\max_{1\leq m\leq 4,\underline{p}\leq p\leq\bar{p}}(M|\lambda^{\prime}(p)|^{m}/\lambda(\bar{p})^{m}). (31)

8.5 Full Proof of Theorem 1

We first give the proofs of Corollaries 13.

Proof of Corollary 1.

For any ndkn\geq d_{k},

𝔼[|WnkW¯nk|]=𝔼[|WnkW¯nk|1(Qk<dk)]+𝔼[|WnkW¯nk|1(Qkdk)].\mathbb{E}[|W_{n}^{k}-\bar{W}_{n}^{k}|]=\mathbb{E}[|W_{n}^{k}-\bar{W}_{n}^{k}|1(Q_{k}<d_{k})]+\mathbb{E}[|W_{n}^{k}-\bar{W}_{n}^{k}|1(Q_{k}\geq d_{k})].

Given that Qk<dkQ_{k}<d_{k}, by definition, WnkW_{n}^{k} is synchronously coupled with W¯nk\bar{W}_{n}^{k} for ndk+1n\geq d_{k}+1. Note that given Qk<dkQ_{k}<d_{k}, UnkU_{n}^{k} and VnkV_{n}^{k} are independent of QkQ_{k} for ndk+1n\geq d_{k}+1. As a consequence, by Lemma 2, the conditional expectation

𝔼[|WnkW¯nk||Qk<dk,Wdkk,W¯dkk]eγ(ndk)(2+eμ¯θWdkk+eμ¯θW¯dkk)|WdkkW¯dkk|.\displaystyle\mathbb{E}\left[|W_{n}^{k}-\bar{W}_{n}^{k}|\leavevmode\nobreak\ \Big{|}\leavevmode\nobreak\ Q_{k}<d_{k},W_{d_{k}}^{k},\bar{W}_{d_{k}}^{k}\right]\leq e^{-\gamma(n-d_{k})}(2+e^{\bar{\mu}\theta W_{d_{k}}^{k}}+e^{\bar{\mu}\theta\bar{W}_{d_{k}}^{k}})\left|W_{d_{k}}^{k}-\bar{W}_{d_{k}}^{k}\right|.

Therefore,

𝔼[|WnkW¯nk|1(Qk<dk)]\displaystyle\mathbb{E}[|W_{n}^{k}-\bar{W}_{n}^{k}|1(Q_{k}<d_{k})] eγ(ndk)𝔼[(2+eμ¯θWdkk+eμ¯θW¯dkk)|WdkkW¯dkk|1(Qk<dk)]\displaystyle\leq e^{-\gamma(n-d_{k})}\mathbb{E}\left[(2+e^{\bar{\mu}\theta W_{d_{k}}^{k}}+e^{\bar{\mu}\theta\bar{W}_{d_{k}}^{k}})\left|W_{d_{k}}^{k}-\bar{W}_{d_{k}}^{k}\right|1(Q_{k}<d_{k})\right]
eγ(ndk)𝔼[(2+eμ¯θWdkk+eμ¯θW¯dkk)|WdkkW¯dkk|]\displaystyle\leq e^{-\gamma(n-d_{k})}\mathbb{E}\left[(2+e^{\bar{\mu}\theta W_{d_{k}}^{k}}+e^{\bar{\mu}\theta\bar{W}_{d_{k}}^{k}})\left|W_{d_{k}}^{k}-\bar{W}_{d_{k}}^{k}\right|\right]
eγ(ndk)(2+𝔼[(eμ¯θWdkk+eμ¯θW¯dkk)2]1/2)𝔼[|WdkkW¯dkk|2]1/2.\displaystyle\leq e^{-\gamma(n-d_{k})}\left(2+\mathbb{E}\left[\left(e^{\bar{\mu}\theta W_{d_{k}}^{k}}+e^{\bar{\mu}\theta\bar{W}_{d_{k}}^{k}}\right)^{2}\right]^{1/2}\right)\mathbb{E}\left[\left|W_{d_{k}}^{k}-\bar{W}_{d_{k}}^{k}\right|^{2}\right]^{1/2}.

By Lemma 1 and Assumption 2, we have

𝔼[(eμ¯θWdkk+eμ¯θW¯dkk)2]\displaystyle\mathbb{E}\left[\left(e^{\bar{\mu}\theta W_{d_{k}}^{k}}+e^{\bar{\mu}\theta\bar{W}_{d_{k}}^{k}}\right)^{2}\right] 2(𝔼[e2μ¯θWdkk]+𝔼[e2μ¯θW¯dkk])4M,\displaystyle\leq 2\left(\mathbb{E}[e^{2\bar{\mu}\theta W_{d_{k}}^{k}}]+\mathbb{E}[e^{2\bar{\mu}\theta\bar{W}_{d_{k}}^{k}}]\right)\leq 4M,
𝔼[|WdkkW¯dkk|2]\displaystyle\mathbb{E}\left[|W_{d_{k}}^{k}-\bar{W}_{d_{k}}^{k}|^{2}\right] 2(𝔼[(Wdkk)2]+𝔼[(W¯dkk)2])4M.\displaystyle\leq 2\left(\mathbb{E}[(W_{d_{k}}^{k})^{2}]+\mathbb{E}[(\bar{W}_{d_{k}}^{k})^{2}]\right)\leq 4M.

As a consequence, we have

𝔼[|WnkW¯nk|1(Qk<dk)]eγ(ndk)A,with A=4M+4M.\mathbb{E}[|W_{n}^{k}-\bar{W}_{n}^{k}|1(Q_{k}<d_{k})]\leq e^{-\gamma(n-d_{k})}A,\text{with }A=4\sqrt{M}+4M.

On the other hand,

𝔼[|WnkW¯nk|1(Qkdk)]𝔼[|WnkW¯nk|2]1/2(Qkdk)1/2.\mathbb{E}[|W_{n}^{k}-\bar{W}_{n}^{k}|1(Q_{k}\geq d_{k})]\leq\mathbb{E}\left[|W_{n}^{k}-\bar{W}_{n}^{k}|^{2}\right]^{1/2}\mathbb{P}(Q_{k}\geq d_{k})^{1/2}.

Again, by Lemma 1, 𝔼[|WnkW¯nk|2]4M.\mathbb{E}\left[|W_{n}^{k}-\bar{W}_{n}^{k}|^{2}\right]\leq 4M. As dk=4log(k)/min(γ,η)d_{k}=\lceil 4\log(k)/\min(\gamma,\eta)\rceil,

(Qkdk)eηdk𝔼[eηQk]k4M.\mathbb{P}(Q_{k}\geq d_{k})\leq e^{-\eta d_{k}}\mathbb{E}\left[e^{\eta Q_{k}}\right]\leq k^{-4}M.

In summary, we have, for ndk+1n\geq d_{k}+1,

𝔼[|WnkW¯nk|]eγ(ndk)A+2Mk2.\mathbb{E}[|W_{n}^{k}-\bar{W}_{n}^{k}|]\leq e^{-\gamma(n-d_{k})}A+2Mk^{-2}.

As a direct consequence,

|I1|\displaystyle|I_{1}| =|𝔼[n=d~k+1DkWnkw(μk,pk)]|n=d~k+1Dk𝔼[|WnkW¯nk|]\displaystyle=\left|\mathbb{E}\left[\sum_{n=\tilde{d}_{k}+1}^{D_{k}}W_{n}^{k}-w(\mu_{k},p_{k})\right]\right|\leq\sum_{n=\tilde{d}_{k}+1}^{D_{k}}\mathbb{E}[|W_{n}^{k}-\bar{W}_{n}^{k}|]
n=d~k+1eγ(ndk)A+2Mk2DkA1eγk1+2MK2kα=O(kα).\displaystyle\leq\sum_{n=\tilde{d}_{k}+1}^{\infty}e^{-\gamma(n-d_{k})}A+2Mk^{-2}D_{k}\leq\frac{A}{1-e^{-\gamma}}k^{-1}+2MK_{2}k^{-\alpha}=O(k^{-\alpha}).

Proof of Corollary 2.

Recall that by (6), for each cycle kk,

Wnk={(Wn1k+VnkμkUnkλnk)+ for 1nQkDk;(Wn1k+VnkμkUnkλk)+ for (Qk+1)(Dk+1)nDk.,W0k=WDk1k1.W_{n}^{k}=\begin{cases}\left(W_{n-1}^{k}+\frac{V^{k}_{n}}{\mu_{k}}-\frac{U^{k}_{n}}{\lambda^{k}_{n}}\right)^{+}&\text{ for }1\leq n\leq Q_{k}{\wedge D_{k}};\\ \left(W_{n-1}^{k}+\frac{V^{k}_{n}}{\mu_{k}}-\frac{U^{k}_{n}}{\lambda_{k}}\right)^{+}&\text{ for }(Q_{k}+1){\wedge(D_{k}+1)}\leq n\leq D_{k}.\end{cases},\quad W_{0}^{k}=W_{D_{k-1}}^{k-1}.

Define

W~nk={(W~n1k+VnkμkUnkλk1)+ for 1nQkDk;(W~n1k+VnkμkUnkλk)+ for (Qk+1)(Dk+1)nDk.,W~0k=WDk1k1.\tilde{W}_{n}^{k}=\begin{cases}\left(\tilde{W}_{n-1}^{k}+\frac{V^{k}_{n}}{\mu_{k}}-\frac{U^{k}_{n}}{\lambda_{k-1}}\right)^{+}&\text{ for }1\leq n\leq Q_{k}{\wedge D_{k}};\\ \left(\tilde{W}_{n-1}^{k}+\frac{V^{k}_{n}}{\mu_{k}}-\frac{U^{k}_{n}}{\lambda_{k}}\right)^{+}&\text{ for }(Q_{k}+1){\wedge(D_{k}+1)}\leq n\leq D_{k}.\end{cases},\quad\tilde{W}_{0}^{k}=W_{D_{k-1}}^{k-1}.

Then, in the case Qk1<Dk1Q_{k-1}<D_{k-1}, we have Wnk=W~nkW_{n}^{k}=\tilde{W}_{n}^{k} for all 1nDk1\leq n\leq D_{k}. As a consequence, we have

|WnkW¯Dk1+nk1||W~nkW¯Dk1+nk1|+|WnkW¯Dk1+nk1|1(Qk1Dk1).|W_{n}^{k}-\bar{W}_{D_{k-1}+n}^{k-1}|\leq|\tilde{W}_{n}^{k}-\bar{W}_{D_{k-1}+n}^{k-1}|+|W_{n}^{k}-\bar{W}_{D_{k-1}+n}^{k-1}|\cdot 1(Q_{k-1}\geq D_{k-1}).

For the second term, by Lemma 1, we have, for k2k\geq 2,

𝔼[|WnkW¯Dk1+nk1|1(Qk1Dk1)]\displaystyle\mathbb{E}[|W_{n}^{k}-\bar{W}_{D_{k-1}+n}^{k-1}|\cdot 1(Q_{k-1}\geq D_{k-1})] 𝔼[(WnkW¯Dk1+nk1)2]1/2(Qk1Dk1)1/2\displaystyle\leq\mathbb{E}[(W_{n}^{k}-\bar{W}_{D_{k-1}+n}^{k-1})^{2}]^{1/2}\mathbb{P}(Q_{k-1}\geq D_{k-1})^{1/2}
(2𝔼[(Wnk)2]+2𝔼[(W¯Dk1+nk1)2])1/2(exp(ηDk1)𝔼[exp(ηQk1])1/2\displaystyle\leq(2\mathbb{E}[(W_{n}^{k})^{2}]+2\mathbb{E}[(\bar{W}_{D_{k-1}+n}^{k-1})^{2}])^{1/2}\left(\exp(-\eta D_{k-1})\mathbb{E}[\exp(\eta Q_{k-1}]\right)^{1/2}
2M(k1)316Mk3\displaystyle\leq 2M(k-1)^{-3}\leq 16Mk^{-3}

For the first term, by definition, W¯Dk1+nk1\bar{W}_{D_{k-1}+n}^{k-1} is a waiting time sequence with service and arrival rates (μk1,λ(pk1))(\mu_{k-1},\lambda(p_{k-1})) and W~nk\tilde{W}_{n}^{k} is a sequence with rates (μk,λ(pk))(\mu_{k},\lambda(p_{k})) or (μk,λ(pk1))(\mu_{k},\lambda(p_{k-1})). As a consequence, by applying Lemma 3, we have

|W~nkW¯Dk1+nk1|\displaystyle|\tilde{W}_{n}^{k}-\bar{W}_{D_{k-1}+n}^{k-1}|\leq |W~0kW¯Dk1k1|+(|μkμk1|μ¯+|λ(pk)λ(pk1)|λ(p¯))max(X~nk,X¯Dk1+nk1)\displaystyle|\tilde{W}_{0}^{k}-\bar{W}_{D_{k-1}}^{k-1}|+\left(\frac{|\mu_{k}-\mu_{k-1}|}{\underline{\mu}}+\frac{|\lambda(p_{k})-\lambda(p_{k-1})|}{\lambda(\bar{p})}\right)\max(\tilde{X}_{n}^{k},\bar{X}_{D_{k-1}+n}^{k-1})
+|μkμk1|μ¯max(W~nk,W¯Dk1+nk1).\displaystyle+\frac{|\mu_{k}-\mu_{k-1}|}{\underline{\mu}}\max(\tilde{W}_{n}^{k},\bar{W}_{D_{k-1}+n}^{k-1}).

By Lemma 1, we have that max(X~nk,X¯Dk1+nk1)λ(p¯)λ(p¯)X^nk\max(\tilde{X}_{n}^{k},\bar{X}_{D_{k-1}+n}^{k-1})\leq\frac{\lambda(\underline{p})}{\lambda(\bar{p})}\hat{X}_{n}^{k} and max(W~nk,W¯Dk1+nk1)W^nk\max(\tilde{W}_{n}^{k},\bar{W}_{D_{k-1}+n}^{k-1})\leq\hat{W}_{n}^{k}, where X^nk\hat{X}_{n}^{k} and W^nk\hat{W}_{n}^{k} are the observed busy period and waiting time in a stationary GI/GI/1GI/GI/1 queue with rate (μ¯,p¯)(\underline{\mu},\underline{p}) as defined in Lemma 1. On the other hand, under Condition (b) of Theorem 1,

𝔼[|μkμk1|2]\displaystyle\mathbb{E}[|\mu_{k}-\mu_{k-1}|^{2}] 𝔼[xkxk+12]K2k2α\displaystyle\leq\mathbb{E}[\|x_{k}-x_{k+1}\|^{2}]\leq K_{2}k^{-2\alpha}
𝔼[|λkλk1|2]\displaystyle\mathbb{E}[|\lambda_{k}-\lambda_{k-1}|^{2}] K2(maxpλ(p))2k2αK6k2α.\displaystyle\leq K_{2}\left(\max_{p}\lambda^{\prime}(p)\right)^{2}k^{-2\alpha}\equiv K_{6}k^{-2\alpha}.

Therefore,

𝔼[|μkμk1|max(X~nk,X¯Dk1+nk1)]𝔼[(μkμk1)2]1/2λ(p¯)λ(p¯)𝔼[(X^nk)2]1/2K2Mkα;\displaystyle\mathbb{E}\left[|\mu_{k}-\mu_{k-1}|\max(\tilde{X}_{n}^{k},\bar{X}_{D_{k-1}+n}^{k-1})\right]\leq\mathbb{E}[(\mu_{k}-\mu_{k-1})^{2}]^{1/2}\frac{\lambda(\underline{p})}{\lambda(\bar{p})}\mathbb{E}[(\hat{X}_{n}^{k})^{2}]^{1/2}\leq\sqrt{K_{2}}\sqrt{M}k^{-\alpha};
𝔼[|λ(pk)λ(pk1)|max(X~nk,X¯Dk1+nk1)]𝔼[(λkλk1)2]1/2λ(p¯)λ(p¯)𝔼[(X^nk)2]1/2K6Mkα;\displaystyle\mathbb{E}[|\lambda(p_{k})-\lambda(p_{k-1})|\max(\tilde{X}_{n}^{k},\bar{X}_{D_{k-1}+n}^{k-1})]\leq\mathbb{E}[(\lambda_{k}-\lambda_{k-1})^{2}]^{1/2}\frac{\lambda(\underline{p})}{\lambda(\bar{p})}\mathbb{E}[(\hat{X}_{n}^{k})^{2}]^{1/2}\leq\sqrt{K_{6}}\sqrt{M}k^{-\alpha};
𝔼[|μkμk1|max(W~nk,W¯Dk1+nk1)]𝔼[(μkμk1)2]1/2𝔼[(W^nk)2]1/2K2Mkα.\displaystyle\mathbb{E}\left[|\mu_{k}-\mu_{k-1}|\max(\tilde{W}_{n}^{k},\bar{W}_{D_{k-1}+n}^{k-1})\right]\leq\mathbb{E}[(\mu_{k}-\mu_{k-1})^{2}]^{1/2}\mathbb{E}[(\hat{W}_{n}^{k})^{2}]^{1/2}\leq\sqrt{K_{2}}\sqrt{M}k^{-\alpha}.

Finally, by Corollary 1, we have

𝔼[|W~0kW¯Dk1k1|]=𝔼[|W¯Dk1k1W0k|]=𝔼[|W¯Dk1k1WDk1k1|](A+2M)(k1)2(4A+8M)k2.{\mathbb{E}}[|\tilde{W}_{0}^{k}-\bar{W}_{D_{k-1}}^{k-1}|]=\mathbb{E}[|\bar{W}_{D_{k-1}}^{k-1}-W_{0}^{k}|]=\mathbb{E}[|\bar{W}_{D_{k-1}}^{k-1}-W_{D_{k-1}}^{k-1}|]\leq(A+2M)(k-1)^{-2}\leq(4A+8M)k^{-2}.

In summary, we can conclude

|𝔼[Wnkw(μk,pk)]|\displaystyle|\mathbb{E}[W_{n}^{k}-w(\mu_{k},p_{k})]| 𝔼[|w(μk1,pk1)w(μk,pk)|]+𝔼[|WnkW¯Dk1+nk1|]\displaystyle\leq\mathbb{E}[|w(\mu_{k-1},p_{k-1})-w(\mu_{k},p_{k})|]+\mathbb{E}[|W_{n}^{k}-\bar{W}_{D_{k-1}+n}^{k-1}|]
B𝔼[|μkμk1|+|λ(pk)λ(pk1)|]+(2K2μ¯+K6λ(p¯))Mkα+O(k2)\displaystyle\leq B\mathbb{E}[|\mu_{k}-\mu_{k-1}|+|\lambda(p_{k})-\lambda(p_{k-1})|]+\left(\frac{2\sqrt{K_{2}}}{\underline{\mu}}+\frac{\sqrt{K_{6}}}{\lambda(\bar{p})}\right)\sqrt{M}k^{-\alpha}+O(k^{-2})
B(K2+K6)kα+(2K2μ¯+K6λ(p¯))Mkα+O(k2)\displaystyle\leq B(\sqrt{K_{2}}+\sqrt{K_{6}})k^{-\alpha}+\left(\frac{2\sqrt{K_{2}}}{\underline{\mu}}+\frac{\sqrt{K_{6}}}{\lambda(\bar{p})}\right)\sqrt{M}k^{-\alpha}+O(k^{-2})
=O(kα),\displaystyle=O(k^{-\alpha}),

where the second inequality follows from Lemma 4. As a direct consequence, |I2|=O(kαlog(k))|I_{2}|=O(k^{-\alpha}\log(k)) as d~k=O(log(k))\tilde{d}_{k}=O(\log(k)).

Proof of Corollary 3.

Note that by Lemma 1,

|h0𝔼[W(μk,pk)]+h0μkpk|h0M+h0μ¯1+p¯=O(1).\left|h_{0}\mathbb{E}[W_{\infty}(\mu_{k},p_{k})]+\frac{h_{0}}{\mu_{k}}-p_{k}\right|\leq h_{0}M+h_{0}\underline{\mu}^{-1}+\bar{p}=O(1).

So it suffices to show that

𝔼[|Dkλ(pk)Tk|]=O(kα),𝔼[n=1QkDk|pkpnk|]=O(kα).\mathbb{E}[\left|D_{k}-\lambda(p_{k})T_{k}\right|]=O(k^{-\alpha}),\quad\mathbb{E}\left[\sum_{n=1}^{Q_{k}\wedge D_{k}}|p_{k}-p_{n}^{k}|\right]=O(k^{-\alpha}).

Given μk\mu_{k} and pkp_{k}, TkT_{k} is the time for the DkD_{k}-th customer to enter service. Let FnkF_{n}^{k} be the inter-service time between the (n1)(n-1)-th and the nn-th customers in cycle kk. Then, Tk=n=1DkFnkT_{k}=\sum_{n=1}^{D_{k}}F_{n}^{k} and for each nn,

Fnk={Unkλnk+WnkWn1k for 1nQkUnkλk+WnkWn1kQk+1nDk.F_{n}^{k}=\begin{cases}\frac{U^{k}_{n}}{\lambda^{k}_{n}}+W_{n}^{k}-W_{n-1}^{k}&\text{ for }1\leq n\leq Q_{k}\\ \frac{U^{k}_{n}}{\lambda_{k}}+W_{n}^{k}-W_{n-1}^{k}&Q_{k}+1\leq n\leq D_{k}.\end{cases}

Therefore,

Tk\displaystyle T_{k} =k=1DkFnk=n=1QkDkUnkλnk+1λkn=Qk+1DkUnk+WDkkW0k\displaystyle=\sum_{k=1}^{D_{k}}F_{n}^{k}=\sum_{n=1}^{Q_{k}\wedge D_{k}}\frac{U^{k}_{n}}{\lambda_{n}^{k}}+\frac{1}{\lambda_{k}}\sum_{n=Q_{k}+1}^{D_{k}}U^{k}_{n}+W_{D_{k}}^{k}-W_{0}^{k}
=1λkn=1DkUnk+WDkkW0k+n=1QkDkUnk(1λnk1λk).\displaystyle=\frac{1}{\lambda_{k}}\sum_{n=1}^{D_{k}}U^{k}_{n}+W_{D_{k}}^{k}-W_{0}^{k}+\sum_{n=1}^{Q_{k}\wedge D_{k}}U^{k}_{n}\left(\frac{1}{\lambda_{n}^{k}}-\frac{1}{\lambda_{k}}\right).

As a consequence,

|𝔼[(DkλkTk)]|\displaystyle|\mathbb{E}\left[(D_{k}-\lambda_{k}T_{k})\right]| λk|𝔼[WDkk]𝔼[W0k]|+𝔼[k=1QkDkUnk|λkλnk1|].\displaystyle\leq\lambda_{k}|\mathbb{E}[W_{D_{k}}^{k}]-\mathbb{E}[W_{0}^{k}]|+\mathbb{E}\left[\sum_{k=1}^{Q_{k}\wedge D_{k}}U^{k}_{n}\Big{|}\frac{\lambda_{k}}{\lambda_{n}^{k}}-1\Big{|}\right].

Following Corollary 1 and Lemma 4, for k2k\geq 2, the first term

|𝔼[WDkk]𝔼[W0k]|\displaystyle|\mathbb{E}[W_{D_{k}}^{k}]-\mathbb{E}[W_{0}^{k}]| 𝔼|WDkkW¯Dkk|+𝔼|WDk1k1W¯Dk1k1|+|𝔼[W¯Dkk]𝔼[W¯Dk1k1]|\displaystyle\leq\mathbb{E}|W_{D_{k}}^{k}-\bar{W}_{D_{k}}^{k}|+\mathbb{E}|W_{D_{k-1}}^{k-1}-\bar{W}_{D_{k-1}}^{k-1}|+|\mathbb{E}[\bar{W}_{D_{k}}^{k}]-\mathbb{E}[\bar{W}_{D_{k-1}}^{k-1}]|
=(A+2M)(k2+(k1)2)+BK2kα=O(kα).\displaystyle=(A+2M)\left(k^{-2}+(k-1)^{-2}\right)+B\sqrt{K_{2}}k^{-\alpha}=O(k^{-\alpha}).

As to the second term, by definition, the customers 1 to Qk1Q_{k}-1 arrive to the system while customer 0 is waiting in the system, and therefore,

0i=1(Qk1)DkUikλ¯i=1(Qk1)DkUikλikW0k𝔼[(i=1QkDkUik)2]𝔼[(λ¯W0k+UQkk)2]4λ¯2M.0\leq\sum_{i=1}^{(Q_{k}-1)\wedge D_{k}}\frac{U_{i}^{k}}{\bar{\lambda}}\leq\sum_{i=1}^{(Q_{k}-1)\wedge D_{k}}\frac{U_{i}^{k}}{\lambda_{i}^{k}}\leq W_{0}^{k}\quad\Rightarrow\quad\mathbb{E}\left[\left(\sum_{i=1}^{Q_{k}\wedge D_{k}}U_{i}^{k}\right)^{2}\right]\leq\mathbb{E}\left[(\bar{\lambda}W_{0}^{k}+U^{k}_{Q_{k}})^{2}\right]\leq 4\bar{\lambda}^{2}M.

Here, 𝔼[(UQkk)2]\mathbb{E}[(U_{Q_{k}}^{k})^{2}] is bounded since we assume that UU is light-tailed (Assumption 2). For the simplicity of notation, we just assume that 𝔼[(Ui2λ¯)2]<M\mathbb{E}\left[\Big{(}\frac{U^{2}_{i}}{\bar{\lambda}}\Big{)}^{2}\right]<M for the same MM in Lemma 1. Then,

𝔼[k=1QkDkUnk|λkλnk1|]\displaystyle\mathbb{E}\left[\sum_{k=1}^{Q_{k}\wedge D_{k}}U^{k}_{n}\Big{|}\frac{\lambda_{k}}{\lambda_{n}^{k}}-1\Big{|}\right] 𝔼[k=1QkDkUnk|λkλk11|]+𝔼[k=1QkDkUnkλ¯λ¯1(Qk1Dk1)]\displaystyle\leq\mathbb{E}\left[\sum_{k=1}^{Q_{k}\wedge D_{k}}U^{k}_{n}\Big{|}\frac{\lambda_{k}}{\lambda_{k-1}}-1\Big{|}\right]+\mathbb{E}\left[\sum_{k=1}^{Q_{k}\wedge D_{k}}U^{k}_{n}\cdot\frac{\bar{\lambda}}{\underline{\lambda}}\cdot 1(Q_{k-1}\geq D_{k-1})\right]
2λ¯M𝔼[|λkλk11|2]1/2+2λ¯2λ¯M(Qk1Dk1)1/2\displaystyle\leq 2\bar{\lambda}\sqrt{M}\mathbb{E}\left[\Big{|}\frac{\lambda_{k}}{\lambda_{k-1}}-1\Big{|}^{2}\right]^{1/2}+\frac{2\bar{\lambda}^{2}}{\underline{\lambda}}\sqrt{M}\mathbb{P}(Q_{k-1}\geq D_{k-1})^{1/2}
2λ¯MK61/2λ¯kα+16λ¯2λ¯Mk3=O(kα).\displaystyle\leq\frac{2\bar{\lambda}\sqrt{M}K_{6}^{1/2}}{\underline{\lambda}}k^{-\alpha}+\frac{16\bar{\lambda}^{2}}{\underline{\lambda}}Mk^{-3}=O(k^{-\alpha}).

Finally,

𝔼[n=1QkDk|pkpnk|]\displaystyle\mathbb{E}\left[\sum_{n=1}^{Q_{k}\wedge D_{k}}|p_{k}-p_{n}^{k}|\right] 𝔼[n=1QkDk|pkpnk|1(Qk1<Dk1)]\displaystyle\leq\mathbb{E}\left[\sum_{n=1}^{Q_{k}\wedge D_{k}}|p_{k}-p_{n}^{k}|\cdot 1(Q_{k-1}<D_{k-1})\right]
+𝔼[n=1QkDk|pkpnk|1(Qk1Dk1)]\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ +\mathbb{E}\left[\sum_{n=1}^{Q_{k}\wedge D_{k}}|p_{k}-p_{n}^{k}|\cdot 1(Q_{k-1}\geq D_{k-1})\right]
𝔼[Qk12]1/2𝔼[|pkpk1|2]1/2+𝔼[p¯2Qk2]1/2(Qk1Dk1)1/2\displaystyle\leq\mathbb{E}\left[Q_{k-1}^{2}\right]^{1/2}\mathbb{E}\left[|p_{k}-p_{k-1}|^{2}\right]^{1/2}+\mathbb{E}\left[\bar{p}^{2}Q_{k}^{2}\right]^{1/2}\mathbb{P}(Q_{k-1}\geq D_{k-1})^{1/2}
MK2kα+8p¯Mk3=O(kα)\displaystyle\leq\sqrt{MK_{2}}k^{-\alpha}+8\bar{p}Mk^{-3}=O(k^{-\alpha})

Therefore, I3=O(kα)I_{3}=O(k^{-\alpha}). ∎

Finishing the proof of Theorem 1.

First, by Corollary 1, we have

|I1|A1eγk1+2MK2kα=O(kα).\displaystyle|I_{1}|\leq\frac{A}{1-e^{-\gamma}}k^{-1}+2MK_{2}k^{-\alpha}=O(k^{-\alpha}).

By Corollary 2,

|I2|5min(γ,η)(B(K2+K6)+(2K2μ¯+K6λ¯)M+(4A+8M))kαlog(k)=O(kαlog(k)).\displaystyle|I_{2}|\leq\frac{5}{\min(\gamma,\eta)}\left(B(\sqrt{K_{2}}+\sqrt{K_{6}})+\left(\frac{2\sqrt{K_{2}}}{\underline{\mu}}+\frac{\sqrt{K_{6}}}{\underline{\lambda}}\right)\sqrt{M}+(4A+8M)\right)k^{-\alpha}\log(k)=O(k^{-\alpha}\log(k)).

Following the proof of Corollary 3, we have

I3\displaystyle I_{3} (h0M+h0μ¯1+p¯)(2λ¯MK61/2λ¯+16λ¯2λ¯M)kα+(MK2+8p¯M)kα=O(kα).\displaystyle\leq(h_{0}M+h_{0}\underline{\mu}^{-1}+\bar{p})\left(\frac{2\bar{\lambda}\sqrt{M}K_{6}^{1/2}}{\underline{\lambda}}+\frac{16\bar{\lambda}^{2}}{\underline{\lambda}}M\right)k^{-\alpha}+(\sqrt{MK_{2}}+8\bar{p}M)k^{-\alpha}=O(k^{-\alpha}).

Therefore, we can conclude that k2\forall k\geq 2, R1,kKkαlog(k)R_{1,k}\leq K^{\prime}\cdot k^{-\alpha}\log(k) with

K=\displaystyle K^{\prime}=\leavevmode\nobreak\ Ah01eγ+2h0MK2+5h0min(γ,η)(B(K2+K6)+(2K2μ¯+K6λ¯)M+(4A+8M))\displaystyle\frac{Ah_{0}}{1-e^{-\gamma}}+2h_{0}MK_{2}+\frac{5h_{0}}{\min(\gamma,\eta)}\left(B(\sqrt{K_{2}}+\sqrt{K_{6}})+\left(\frac{2\sqrt{K_{2}}}{\underline{\mu}}+\frac{\sqrt{K_{6}}}{\underline{\lambda}}\right)\sqrt{M}+(4A+8M)\right)
+(h0M+h0μ¯1+p¯)(2λ¯MK61/2λ¯+16λ¯2λ¯M)+MK2+8p¯M.\displaystyle+(h_{0}M+h_{0}\underline{\mu}^{-1}+\bar{p})\left(\frac{2\bar{\lambda}\sqrt{M}K_{6}^{1/2}}{\underline{\lambda}}+\frac{16\bar{\lambda}^{2}}{\underline{\lambda}}M\right)+\sqrt{MK_{2}}+8\bar{p}M. (32)

Let M0>0M_{0}>0 be the upper bound of the regret in the first cycle. Here the constant M0<M_{0}<\infty since the decision region \mathcal{B} is bounded and by condition (a), D1K2D_{1}\leq K_{2} is also bounded. Finally, we conclude that

R1(L)\displaystyle R_{1}(L) M0+Kk=2Lkαlog(k)Kk=1Lkαlog(k).\displaystyle\leq M_{0}+K^{\prime}\sum_{k=2}^{L}k^{-\alpha}\log(k)\leq K\sum_{k=1}^{L}k^{-\alpha}\log(k).

with K=K+2M0log(2)K=K^{\prime}+\frac{2M_{0}}{\log(2)}. ∎

8.6 Convergence Rate of Observed Busy Period

As an analogue of Lemma 2, we prove a uniform convergence rate for the observed busy period XnX_{n}, which will be used to bound BkB_{k} and 𝒱k\mathcal{V}_{k} of the gradient estimator (18) that involves terms of XnkX_{n}^{k}.

Lemma 6.

Let Xn1X^{1}_{n} and Xn2X^{2}_{n} be the observed busy period of the two queueing systems coupled as in Lemma 2, with X01,X02stλ(p¯)λ(p¯)X^0X^{1}_{0},X^{2}_{0}\leq_{st}\frac{\lambda(\underline{p})}{\lambda(\bar{p})}\hat{X}_{0} and W01,W02stW^0W^{1}_{0},W^{2}_{0}\leq_{st}\hat{W}_{0}.

  1. 1.

    |Xn1Xn2|𝟏{max(Γ1,Γ2)>n}(k=1nτk+X01+X02)|X^{1}_{n}-X^{2}_{n}|\leq{\bf 1}_{\{\max(\Gamma_{1},\Gamma_{2})>n\}}\left(\sum_{k=1}^{n}\tau_{k}+X^{1}_{0}+X^{2}_{0}\right).

  2. 2.

    There exists a constant K4>0K_{4}>0 such that |𝔼[Xn1Xn2]|mK4e0.5γnn2|\mathbb{E}[X^{1}_{n}-X^{2}_{n}]|^{m}\leq K_{4}e^{-0.5\gamma n}n^{2} for all n1n\geq 1 and m2m\leq 2.

Proof of Lemma 6.

1. Following the argument in Lemma 2, if W01W02W^{1}_{0}\geq W^{2}_{0}, we will have WΓ11=WΓ12=0W^{1}_{\Gamma_{1}}=W^{2}_{\Gamma_{1}}=0 and hence XΓ11=XΓ12=0X^{1}_{\Gamma_{1}}=X^{2}_{\Gamma_{1}}=0. Since the two systems share the same sequence of arrivals and service times, Xn1=Xn2X^{1}_{n}=X^{2}_{n} for all nΓ1n\geq\Gamma_{1}. Therefore,

|Xn1Xn2|𝟏{max(Γ1,Γ2)>n}|Xn1Xn2|𝟏{max(Γ1,Γ2)>n}(k=1nτk+X01+X02).|X^{1}_{n}-X^{2}_{n}|\leq{\bf 1}_{\{\max(\Gamma_{1},\Gamma_{2})>n\}}|X^{1}_{n}-X^{2}_{n}|\leq{\bf 1}_{\{\max(\Gamma_{1},\Gamma_{2})>n\}}\left(\sum_{k=1}^{n}\tau_{k}+X^{1}_{0}+X^{2}_{0}\right).

The last inequality follows from 0XniX0i+k=1nτk0\leq X^{i}_{n}\leq X_{0}^{i}+\sum_{k=1}^{n}\tau_{k} for i=1,2i=1,2.

2. Following 1 and part 2 of Lemma 2, for m=1,2m=1,2,

𝔼[|Xn1Xn2|m]\displaystyle\mathbb{E}[|X^{1}_{n}-X^{2}_{n}|^{m}] 𝔼[𝟏{max(Γ1,Γ2)>n}(k=1nτk+X01+X02)m]\displaystyle\leq\mathbb{E}\left[{\bf 1}_{\{\max(\Gamma_{1},\Gamma_{2})>n\}}\left(\sum_{k=1}^{n}\tau_{k}+X^{1}_{0}+X^{2}_{0}\right)^{m}\right]
(max(Γ1,Γ2)>n)1/2𝔼[(k=1nτk+X01+X02)2m]1/2\displaystyle\leq\mathbb{P}(\max(\Gamma_{1},\Gamma_{2})>n)^{1/2}\mathbb{E}\left[\left(\sum_{k=1}^{n}\tau_{k}+X^{1}_{0}+X^{2}_{0}\right)^{2m}\right]^{1/2}

where

(max(Γ1,Γ2)>n)enγ𝔼[2+eμθW01+eμθW02]enγ(2+2M),\mathbb{P}(\max(\Gamma_{1},\Gamma_{2})>n)\leq e^{-n\gamma}\mathbb{E}[2+e^{\mu\theta W_{0}^{1}}+e^{\mu\theta W_{0}^{2}}]\leq e^{-n\gamma}(2+2M),

and

𝔼[(k=1nτk+X01+X02)2m]32m1(n2m𝔼[U12mλ(p)2m]+𝔼[(X01)2m]+𝔼[(X02)2m]).\mathbb{E}\left[\left(\sum_{k=1}^{n}\tau_{k}+X^{1}_{0}+X^{2}_{0}\right)^{2m}\right]\leq 3^{2m-1}\left(n^{2m}\mathbb{E}\left[\frac{U_{1}^{2m}}{\lambda(p)^{2m}}\right]+\mathbb{E}[(X^{1}_{0})^{2m}]+\mathbb{E}[(X^{2}_{0})^{2m}]\right).

Therefore,

𝔼[|Xn1Xn2|m]K4e0.5nγn2,\mathbb{E}[|X^{1}_{n}-X^{2}_{n}|^{m}]\leq K_{4}e^{-0.5n\gamma}n^{2},

with K4=3m(max1m2𝔼[U12m]/λ(p¯)2m+2λ(p¯)2mλ(p¯)2mM)1/2(2+2M)1/2K_{4}=3^{m}\left(\max_{1\leq m\leq 2}\mathbb{E}[U_{1}^{2m}]/\lambda(\bar{p})^{2m}+2\frac{\lambda(\underline{p})^{2m}}{\lambda(\bar{p})^{2m}}M\right)^{1/2}(2+2M)^{1/2}. ∎

8.7 Proof of Theorem 2

The proof follows an induction-based approach similar to Broadie et al. (2011). For simplicity of notation, we write Δk=kβ\Delta_{k}=k^{-\beta}. Let k\mathcal{F}_{k} be the filtration up to cycle kk, i.e. including all events in the first k1k-1 cycles. Since xk+1=π(xkηkHk)x_{k+1}=\pi_{\mathcal{B}}(x_{k}-\eta_{k}H_{k}),

𝔼[xk+1x2]=𝔼[xkxηkHk2]\displaystyle\mathbb{E}\left[\|x_{k+1}-x^{*}\|^{2}]=\mathbb{E}[\|x_{k}-x^{*}-\eta_{k}H_{k}\|^{2}\right]
=\displaystyle=\leavevmode\nobreak\ 𝔼[xkx22ηkHk(xkx)+ηk2Hk2]\displaystyle\mathbb{E}\left[\|x_{k}-x^{*}\|^{2}-2\eta_{k}H_{k}\cdot(x_{k}-x^{*})+\eta_{k}^{2}H_{k}^{2}\right]
=\displaystyle=\leavevmode\nobreak\ 𝔼[xkx22ηkf(xk)(xkx)]𝔼[2ηk(Hkf(xk))(xkx)]+𝔼[ηk2Hk2]\displaystyle\mathbb{E}\left[\|x_{k}-x^{*}\|^{2}-2\eta_{k}\nabla f(x_{k})\cdot(x_{k}-x^{*})\right]-\mathbb{E}[2\eta_{k}(H_{k}-f(x_{k}))\cdot(x_{k}-x^{*})]+\mathbb{E}[\eta_{k}^{2}H_{k}^{2}]
=\displaystyle=\leavevmode\nobreak\ (12ηkK0)𝔼[xkx2]+𝔼[2ηk(Hkf(xk))(xxk)]+ηk2𝔼[Hk2].\displaystyle(1-2\eta_{k}K_{0})\mathbb{E}\left[\|x_{k}-x^{*}\|^{2}\right]+\mathbb{E}[2\eta_{k}(H_{k}-f(x_{k}))\cdot(x^{*}-x_{k})]+\eta_{k}^{2}\mathbb{E}[H_{k}^{2}].

Note that

𝔼[2ηk(Hkf(xk))(xxk)]=𝔼[𝔼[2ηk(Hkf(xk))(xxk)|k]]\displaystyle\mathbb{E}[2\eta_{k}(H_{k}-\nabla f(x_{k}))\cdot(x^{*}-x_{k})]=\leavevmode\nobreak\ \mathbb{E}[\mathbb{E}[2\eta_{k}(H_{k}-\nabla f(x_{k}))\cdot(x^{*}-x_{k})|\mathcal{F}_{k}]]
=\displaystyle=\leavevmode\nobreak\ 2ηk𝔼[𝔼[Hkf(xk)|k](xxk)] 2ηk𝔼[𝔼[Hkf(xk)|k]2]1/2𝔼[xxk2]1/2\displaystyle 2\eta_{k}\mathbb{E}[\mathbb{E}[H_{k}-\nabla f(x_{k})|\mathcal{F}_{k}]\cdot(x^{*}-x_{k})]\leq\leavevmode\nobreak\ 2\eta_{k}\mathbb{E}[\|\mathbb{E}[H_{k}-\nabla f(x_{k})|\mathcal{F}_{k}]\|^{2}]^{1/2}\mathbb{E}[\|x^{*}-x_{k}\|^{2}]^{1/2}
\displaystyle\leq\leavevmode\nobreak\ ηk𝔼[𝔼[Hkf(xk)|k]2]1/2(1+𝔼[xkx2]).\displaystyle\eta_{k}\mathbb{E}[\|\mathbb{E}[H_{k}-\nabla f(x_{k})|\mathcal{F}_{k}]\|^{2}]^{1/2}(1+\mathbb{E}[\|x_{k}-x^{*}\|^{2}]).

The second last inequality follows from ab+cda2+c2b2+d2ab+cd\leq\sqrt{a^{2}+c^{2}}\sqrt{b^{2}+d^{2}} and the Holder Inequality, the last inequality follows from 2a1+a22a\leq 1+a^{2}.

Let bk=𝔼[xkx2]b_{k}=\mathbb{E}[\|x_{k}-x^{*}\|^{2}] and recall that Bk=𝔼[𝔼[Hkf(xk)|k]2]1/2,𝒱k=𝔼[Hk2]B_{k}=\mathbb{E}[\|\mathbb{E}[H_{k}-\nabla f(x_{k})|\mathcal{F}_{k}]\|^{2}]^{1/2},\quad\mathcal{V}_{k}=\mathbb{E}[H_{k}^{2}]. Then, we obtain the recursion

bk+1(12K0ηk+ηkBk)bk+ηkBk+ηk2𝒱k.b_{k+1}\leq(1-2K_{0}\eta_{k}+\eta_{k}B_{k})b_{k}+\eta_{k}B_{k}+\eta_{k}^{2}\mathcal{V}_{k}.

By Condition (b) and (c), we have

bk+1(12K0ηk+ηkBk)bk+ηkBk+ηk2𝒱k(12K0ηk+K08ηkΔk)bk+K08ηkΔk+K3ηkΔk.b_{k+1}\leq(1-2K_{0}\eta_{k}+\eta_{k}B_{k})b_{k}+\eta_{k}B_{k}+\eta_{k}^{2}\mathcal{V}_{k}\leq\left(1-2K_{0}\eta_{k}+\frac{K_{0}}{8}\eta_{k}\Delta_{k}\right)b_{k}+\frac{K_{0}}{8}\eta_{k}\Delta_{k}+K_{3}\eta_{k}\Delta_{k}.

Because step size ηk0\eta_{k}\rightarrow 0, for kk large enough, ηkK01/2\eta_{k}K_{0}\leq 1/2. Let k0=max{k:2ηkK0>1}k_{0}=\max\{k:2\eta_{k}K_{0}>1\}. Then, for k>k0k>k_{0}, 12K0ηk+K08ηkΔk>01-2K_{0}\eta_{k}+\frac{K_{0}}{8}\eta_{k}\Delta_{k}>0. By Condition (a), Δk/Δk+1=(1+1k)β1+1k1+K02ηk\Delta_{k}/\Delta_{k+1}=(1+\frac{1}{k})^{\beta}\leq 1+\frac{1}{k}\leq 1+\frac{K_{0}}{2}\eta_{k}, and by the induction assumption bkCΔkb_{k}\leq C\Delta_{k}, for k>k0k>k_{0}, we have

bk+1\displaystyle b_{k+1} (12K0ηk+K08ηkΔk)(1+K0ηk2)CΔk+1+K08ηkΔk+K3ηkΔk\displaystyle\leq\left(1-2K_{0}\eta_{k}+\frac{K_{0}}{8}\eta_{k}\Delta_{k}\right)\left(1+\frac{K_{0}\eta_{k}}{2}\right)C\Delta_{k+1}+\frac{K_{0}}{8}\eta_{k}\Delta_{k}+K_{3}\eta_{k}\Delta_{k}
CΔk+1ηkΔk(3K0C2K0C8ΔkK02C16ηkΔkK08K3)\displaystyle\leq C\Delta_{k+1}-\eta_{k}\Delta_{k}\left(\frac{3K_{0}C}{2}-\frac{K_{0}C}{8}\Delta_{k}-\frac{K_{0}^{2}C}{16}\eta_{k}\Delta_{k}-\frac{K_{0}}{8}-K_{3}\right)

Then, we have bk+1CΔk+1b_{k+1}\leq C\Delta_{k+1} as long as

3K0C2K0C8ΔkK02C16ηkΔkK08K30.\frac{3K_{0}C}{2}-\frac{K_{0}C}{8}\Delta_{k}-\frac{K_{0}^{2}C}{16}\eta_{k}\Delta_{k}-\frac{K_{0}}{8}-K_{3}\geq 0. (33)

To check (33), note that, Δk,K01\Delta_{k},K_{0}\leq 1 and C8K3/K0C\geq 8K_{3}/K_{0}. Besides, ηkK01/2<1\eta_{k}K_{0}\leq 1/2<1 for k>k0k>k_{0}. Then, for kk0k\geq k_{0},

3K0C2K0C8ΔkK02C16ηkΔkK08K33K0C2K0C8K0C16K0C8K0C8=17K0C16>0.\frac{3K_{0}C}{2}-\frac{K_{0}C}{8}\Delta_{k}-\frac{K_{0}^{2}C}{16}\eta_{k}\Delta_{k}-\frac{K_{0}}{8}-K_{3}\geq\frac{3K_{0}C}{2}-\frac{K_{0}C}{8}-\frac{K_{0}C}{16}-\frac{K_{0}C}{8}-\frac{K_{0}C}{8}=\frac{17K_{0}C}{16}>0.

Let

C=max(k0β(|μ¯μ¯|2+|p¯p¯|2),8K3/K0).C=\max\left(k_{0}^{\beta}(|\bar{\mu}-\underline{\mu}|^{2}+|\bar{p}-\underline{p}|^{2}),8K_{3}/K_{0}\right). (34)

Then we have xkx2CΔk\|x_{k}-x^{*}\|^{2}\leq C\Delta_{k} for all 1kk01\leq k\leq k_{0}, and we can conclude by induction, for all kk0k\geq k_{0},

𝔼[xkx2]Ckβ.\mathbb{E}[\|x_{k}-x^{*}\|^{2}]\leq Ck^{-\beta}.

By Assumption 3, there exists θ0[0,1]\theta_{0}\in[0,1] such that

|f(xk)f(x)|=|f(θ0(xkx)+x)T(xkx)|K1xkx2.|f(x_{k})-f(x^{*})|=|\nabla f(\theta_{0}(x^{k}-x^{*})+x^{*})^{T}(x_{k}-x^{*})|\leq K_{1}\|x_{k}-x^{*}\|^{2}.

As a consequence,

R2(L)k=1L𝔼[Tk]K1Ckβ.R_{2}(L)\leq\sum_{k=1}^{L}\mathbb{E}[T_{k}]K_{1}Ck^{-\beta}.

Note that TkT_{k} equals to the arrival time of customer DkD_{k} plus its waiting time. Therefore,

𝔼[Tk]𝔼[Dkλk]+𝔼[WDkk]Dkλ(p¯)+M=O(Dk),\mathbb{E}[T_{k}]\leq\mathbb{E}\left[\frac{D_{k}}{\lambda_{k}}\right]+\mathbb{E}[W_{D_{k}}^{k}]\leq\frac{D_{k}}{\lambda(\bar{p})}+M=O(D_{k}),

and we can conclude

R2(L)=O(k=1LDkkβ).R_{2}(L)=O\left(\sum_{k=1}^{L}D_{k}k^{-\beta}\right).

\Box

8.8 Proof of Theorem 3

(i) For each kk, note that xkkx_{k}\in\mathcal{F}_{k}, let’s denote by

hk1\displaystyle h_{k}^{1} =λ(pk)pkλ(pk)+h0λ(pk)[1Dk(1ξ)n>ξDkDk(𝔼[Xnk|k]+𝔼[Wnk|k])+1μ],\displaystyle=-\lambda(p_{k})-p_{k}\lambda^{\prime}(p_{k})+h_{0}\lambda^{\prime}(p_{k})\left[\frac{1}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(\mathbb{E}[X_{n}^{k}|\mathcal{F}_{k}]+\mathbb{E}[W_{n}^{k}|\mathcal{F}_{k}]\right)+\frac{1}{\mu}\right],
hk2\displaystyle h_{k}^{2} =c(μk)h0λ(pk)μk[1Dk(1ξ)n>ξDkDk(𝔼[Xnk|k]+𝔼[Wnk|k])+1μ].\displaystyle=c^{\prime}(\mu_{k})-h_{0}\frac{\lambda(p_{k})}{\mu_{k}}\left[\frac{1}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(\mathbb{E}[X_{n}^{k}|\mathcal{F}_{k}]+\mathbb{E}[W_{n}^{k}|\mathcal{F}_{k}]\right)+\frac{1}{\mu}\right].

Then,

𝔼[Hkf(xk)|k]2=|hk1pf(μk,pk)|2+|hk2μf(μk,pk)|2.\|\mathbb{E}[H_{k}-\nabla f(x_{k})|\mathcal{F}_{k}]\|^{2}=\left|h_{k}^{1}-\frac{\partial}{\partial p}f(\mu_{k},p_{k})\right|^{2}+\left|h_{k}^{2}-\frac{\partial}{\partial\mu}f(\mu_{k},p_{k})\right|^{2}.

Following (18),

|hk1pf(μk,pk)|2\displaystyle|h_{k}^{1}-\frac{\partial}{\partial p}f(\mu_{k},p_{k})|^{2} h02λ(pk)2Dk(1ξ)n>ξDkDk(|𝔼[Xnkxk|k]|+|𝔼[Wnkwk|k]|)2,\displaystyle\leq\frac{h_{0}^{2}\lambda^{\prime}(p_{k})^{2}}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(|\mathbb{E}[X_{n}^{k}-x_{k}|\mathcal{F}_{k}]|+|\mathbb{E}[W_{n}^{k}-w_{k}|\mathcal{F}_{k}]|\right)^{2},
|hk2μf(μk,pk)|2\displaystyle|h_{k}^{2}-\frac{\partial}{\partial\mu}f(\mu_{k},p_{k})|^{2} h02λ(pk)2μk2Dk(1ξ)n>ξDkDk(|𝔼[Xnkxk|k]|+|𝔼[Wnkwk|k]|)2,\displaystyle\leq\frac{h_{0}^{2}\lambda(p_{k})^{2}}{\mu_{k}^{2}\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(|\mathbb{E}[X_{n}^{k}-x_{k}|\mathcal{F}_{k}]|+|\mathbb{E}[W_{n}^{k}-w_{k}|\mathcal{F}_{k}]|\right)^{2},

where wk=𝔼[W(μk,pk)]w_{k}=\mathbb{E}[W_{\infty}(\mu_{k},p_{k})] and xk=𝔼[X(μk,pk)]x_{k}=\mathbb{E}[X_{\infty}(\mu_{k},p_{k})]. Note that λ(p)\lambda(p), λ(p)\lambda^{\prime}(p) and μ\mu are bounded. Let C0=max(μ,p){h0λ(pk),h0λ(p)/μ}C_{0}=\max_{(\mu,p)\in\mathcal{B}}\{h_{0}\lambda^{\prime}(p_{k}),h_{0}\lambda(p)/\mu\}, then

𝔼[Hkf(xk)|k]2\displaystyle\|\mathbb{E}[H_{k}-\nabla f(x_{k})|\mathcal{F}_{k}]\|^{2} 2C02Dk(1ξ)n>ξDkDk(|𝔼[Xnkxk|k]|+|𝔼[Wnkwk|k]|)2\displaystyle\leq\frac{2C_{0}^{2}}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(|\mathbb{E}[X_{n}^{k}-x_{k}|\mathcal{F}_{k}]|+|\mathbb{E}[W_{n}^{k}-w_{k}|\mathcal{F}_{k}]|\right)^{2}
4C02Dk(1ξ)n>ξDkDk(|𝔼[Xnkxk|k]|2+|𝔼[Wnkwk|k]|2)\displaystyle\leq\frac{4C_{0}^{2}}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(|\mathbb{E}[X_{n}^{k}-x_{k}|\mathcal{F}_{k}]|^{2}+|\mathbb{E}[W_{n}^{k}-w_{k}|\mathcal{F}_{k}]|^{2}\right)
=4C02Dk(1ξ)n>ξDkDk(|𝔼[XnkX¯kn|k]|2+|𝔼[WnkW¯kn|k]|2)\displaystyle=\frac{4C_{0}^{2}}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(|\mathbb{E}[X_{n}^{k}-\bar{X}_{k}^{n}|\mathcal{F}_{k}]|^{2}+|\mathbb{E}[W_{n}^{k}-\bar{W}_{k}^{n}|\mathcal{F}_{k}]|^{2}\right)

where the last equality follows from 𝔼[W¯kn|k]=wk\mathbb{E}[\bar{W}_{k}^{n}|\mathcal{F}_{k}]=w_{k} and 𝔼[X¯kn|k]=xk\mathbb{E}[\bar{X}_{k}^{n}|\mathcal{F}_{k}]=x_{k} and W¯kn\bar{W}_{k}^{n} and X¯kn\bar{X}_{k}^{n} are stationary versions of the waiting times and observed busy periods that are synchronously coupled with WknW_{k}^{n} and XknX_{k}^{n} respectively. Therefore, the bias

Bk2\displaystyle B_{k}^{2} =𝔼[𝔼[Hkf(xk)|k]2]\displaystyle=\mathbb{E}[\|\mathbb{E}[H_{k}-\nabla f(x_{k})|\mathcal{F}_{k}]\|^{2}]
𝔼[4C02Dk(1ξ)n>ξDkDk(|𝔼[XnkX¯kn|k]|2+|𝔼[XnkX¯kn|k]|2)]\displaystyle\leq\mathbb{E}\left[\frac{4C_{0}^{2}}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(|\mathbb{E}[X_{n}^{k}-\bar{X}_{k}^{n}|\mathcal{F}_{k}]|^{2}+|\mathbb{E}[X_{n}^{k}-\bar{X}_{k}^{n}|\mathcal{F}_{k}]|^{2}\right)\right]
4C02Dk(1ξ)n>ξDkDk(|𝔼[(XnkX¯kn)2]+𝔼[(WnkW¯kn)2])\displaystyle\leq\frac{4C_{0}^{2}}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(|\mathbb{E}[(X_{n}^{k}-\bar{X}_{k}^{n})^{2}]+\mathbb{E}[(W_{n}^{k}-\bar{W}_{k}^{n})^{2}]\right)

Following a similar argument as in the proof of Corollary 1, we have, for n0.5ξDkn\geq\lceil 0.5\xi D_{k}\rceil,

𝔼[(WnkW¯kn)2]\displaystyle\mathbb{E}[(W_{n}^{k}-\bar{W}_{k}^{n})^{2}] 𝔼[(WnkW¯kn)21(Qk<0.5ξDk)]+𝔼[(WnkW¯kn)21(Qk0.5ξDk)]\displaystyle\leq\mathbb{E}[(W_{n}^{k}-\bar{W}_{k}^{n})^{2}\cdot 1(Q_{k}<0.5\xi D_{k})]+\mathbb{E}[(W_{n}^{k}-\bar{W}_{k}^{n})^{2}\cdot 1(Q_{k}\geq 0.5\xi D_{k})]
Aexp(γ(n0.5ξDk))+2Mexp(η0.25ξDk))\displaystyle\leq A\exp(-\gamma\cdot(n-0.5\xi D_{k}))+2M\exp(-\eta\cdot 0.25\xi D_{k}))
(A+2M)exp(min(γ,η)0.25ξDk).\displaystyle\leq(A+2M)\exp(-\min(\gamma,\eta)\cdot 0.25\xi D_{k}).

For the observed busy period XnkX_{n}^{k}, following a similar analysis and Lemma 6, we have

𝔼[(XnkX¯nk)2]\displaystyle\mathbb{E}[(X_{n}^{k}-\bar{X}_{n}^{k})^{2}]
\displaystyle\leq\leavevmode\nobreak K4e0.5γξDkDk2+(2𝔼[(Xnk)4]+2𝔼[(X¯kn)4])1/2(Qk0.5ξDk)1/2\displaystyle K_{4}e^{-0.5\gamma\xi D_{k}}D_{k}^{2}+(2\mathbb{E}[(X_{n}^{k})^{4}]+2\mathbb{E}[(\bar{X}_{k}^{n})^{4}])^{1/2}\mathbb{P}(Q_{k}\geq 0.5\xi D_{k})^{1/2}
\displaystyle\leq\leavevmode\nobreak exp(min(γ,η)0.25ξDk)(2M+K4Dk2)exp(min(γ,η)0.125ξDk)(2M+K4K5),\displaystyle\exp(-\min(\gamma,\eta)\cdot 0.25\xi D_{k})(2M+K_{4}D_{k}^{2})\leq\exp(-\min(\gamma,\eta)\cdot 0.125\xi D_{k})(2M+K_{4}K_{5}),

where

K5=maxD>0exp(min(γ,η)0.125ξD)D2=(16min(γ,η)ξ)2e2.K_{5}=\max_{D>0}\exp(-\min(\gamma,\eta)\cdot 0.125\xi D)D^{2}=\left(\frac{16}{\min(\gamma,\eta)\cdot\xi}\right)^{2}e^{-2}. (35)

If we choose

Dk=aD+bDlog(k), for aDCDmin(γ,η)ξ and bD8min(γ,η)ξ,D_{k}=a_{D}+b_{D}\log(k),\text{ for }a_{D}\geq\frac{C_{D}}{\min(\gamma,\eta)\xi}\text{ and }b_{D}\geq\frac{8}{\min(\gamma,\eta)\xi},

with

CD=max(8(log((16A+32M)C0/K0),16log((32M+16K4K5)C0/K0)),C_{D}=\max(8(\log((16A+32M)C_{0}/K_{0}),16\log((32M+16K_{4}K_{5})C_{0}/K_{0})), (36)

then

𝔼[(WnkW¯nk)2]K02256C02k2,𝔼[(XnkX¯nk)2]K02256C0k2.\mathbb{E}[(W_{n}^{k}-\bar{W}_{n}^{k})^{2}]\leq\frac{K_{0}^{2}}{256C_{0}^{2}k^{2}},\quad\mathbb{E}[(X_{n}^{k}-\bar{X}_{n}^{k})^{2}]\leq\frac{K_{0}^{2}}{256C_{0}k^{2}}.

As a consequence,

𝔼[𝔼[Hkf(xk)|k]2]4C02Dk(1ξ)n>ξDkDk(|𝔼[(XnkX¯kn)2]+𝔼[(WnkW¯kn)2])K0264k2.\mathbb{E}[\|\mathbb{E}[H_{k}-\nabla f(x_{k})|\mathcal{F}_{k}]\|^{2}]\leq\frac{4C_{0}^{2}}{\lceil D_{k}(1-\xi)\rceil}\sum_{n>\xi D_{k}}^{D_{k}}\left(|\mathbb{E}[(X_{n}^{k}-\bar{X}_{k}^{n})^{2}]+\mathbb{E}[(W_{n}^{k}-\bar{W}_{k}^{n})^{2}]\right)\leq\frac{K_{0}^{2}}{64k^{2}}.

Therefore, we can conclude that

Bk=𝔼[𝔼[Hkf(xk)|k]2]1/2K08k.B_{k}=\mathbb{E}[\|\mathbb{E}[H_{k}-\nabla f(x_{k})|\mathcal{F}_{k}]\|^{2}]^{1/2}\leq\frac{K_{0}}{8k}.

On the other hand, as λ(p)\lambda(p), λ(p)\lambda^{\prime}(p) and μ\mu are bounded, C1maxμ,p{|λ(p)+pλ(p)|,|c(μ)|}<C_{1}\triangleq\max_{\mu,p\in\mathcal{B}}\{|\lambda(p)+p\lambda^{\prime}(p)|,|c^{\prime}(\mu)|\}<\infty. Recall that C0=max(μ,p){h0λ(pk),h0λ(p)/μ}C_{0}=\max_{(\mu,p)\in\mathcal{B}}\{h_{0}\lambda^{\prime}(p_{k}),h_{0}\lambda(p)/\mu\}. Then,

𝔼[Hk2]8(C1+C0/μ¯)2+8C02𝔼[1(1ξ)Dk2(n>ξDkDk(Xnk+Wnk))2].\mathbb{E}[\|H_{k}\|^{2}]\leq 8(C_{1}+C_{0}/\underline{\mu})^{2}+8C_{0}^{2}\mathbb{E}\left[\frac{1}{\lceil(1-\xi)D_{k}\rceil^{2}}\left(\sum_{n>\xi D_{k}}^{D_{k}}\left(X_{n}^{k}+W_{n}^{k}\right)\right)^{2}\right].

By Lemma 1, we have

𝔼[1(1ξ)Dk2(n>ξDkDk(Xnk+Wnk))2]𝔼[1(1ξ)Dk2(n>ξDkDk(λ(p¯)λ(p¯)X^nk+W^nk))2],\displaystyle\mathbb{E}\left[\frac{1}{\lceil(1-\xi)D_{k}\rceil^{2}}\left(\sum_{n>\xi D_{k}}^{D_{k}}\left(X_{n}^{k}+W_{n}^{k}\right)\right)^{2}\right]\leq\leavevmode\nobreak\ \mathbb{E}\left[\frac{1}{\lceil(1-\xi)D_{k}\rceil^{2}}\left(\sum_{n>\xi D_{k}}^{D_{k}}\left(\frac{\lambda(\underline{p})}{\lambda(\bar{p})}\hat{X}_{n}^{k}+\hat{W}_{n}^{k}\right)\right)^{2}\right],

where W^nk\hat{W}_{n}^{k} and X^nk\hat{X}_{n}^{k} are defined as in Lemma 1. Note that by definition, W^nk\hat{W}_{n}^{k} and X^nk\hat{X}^{k}_{n} are stationary, we have

𝔼[1(1ξ)Dk2(n>ξDkDk(λ(p¯)λ(p¯)X^nk+W^nk))2]\displaystyle\mathbb{E}\left[\frac{1}{\lceil(1-\xi)D_{k}\rceil^{2}}\left(\sum_{n>\xi D_{k}}^{D_{k}}\left(\frac{\lambda(\underline{p})}{\lambda(\bar{p})}\hat{X}_{n}^{k}+\hat{W}_{n}^{k}\right)\right)^{2}\right]
\displaystyle\leq 2(1ξ)Dk2𝔼[(n>ξDkDkλ(p¯)λ(p¯)X^nk)2]+2(1ξ)Dk2𝔼[(n>ξDkDkW^nk)2]\displaystyle\frac{2}{\lceil(1-\xi)D_{k}\rceil^{2}}\mathbb{E}\left[\left(\sum_{n>\xi D_{k}}^{D_{k}}\frac{\lambda(\underline{p})}{\lambda(\bar{p})}\hat{X}_{n}^{k}\right)^{2}\right]+\frac{2}{\lceil(1-\xi)D_{k}\rceil^{2}}\mathbb{E}\left[\left(\sum_{n>\xi D_{k}}^{D_{k}}\hat{W}_{n}^{k}\right)^{2}\right]
\displaystyle\leq 2(1ξ)2𝔼[(λ(p¯)λ(p¯)X^0k)2]+2(1ξ)2𝔼[(W^0k)2]4(1ξ)2M.\displaystyle 2(1-\xi)^{-2}\mathbb{E}\left[\left(\frac{\lambda(\underline{p})}{\lambda(\bar{p})}\hat{X}_{0}^{k}\right)^{2}\right]+2(1-\xi)^{-2}\mathbb{E}[(\hat{W}_{0}^{k})^{2}]\leq 4(1-\xi)^{-2}M.

Therefore, 𝒱k\mathcal{V}_{k} is uniformly bounded. Given that ηk=cηk1\eta_{k}=c_{\eta}k^{-1}, we have ηk𝒱kK3k\eta_{k}\mathcal{V}_{k}\leq\frac{K_{3}}{k} with

K3=(8(C1+C0/μ¯)2+32C02(1ξ)2M)cη.K_{3}=(8(C_{1}+C_{0}/\underline{\mu})^{2}+32C_{0}^{2}(1-\xi)^{-2}M)c_{\eta}. (37)

(ii) According to the update rule, we immediately got

𝔼[xkxk+12]ηk2𝔼[Hk2]2k2K3/K0K2k2, with K2=2K3/K0.\mathbb{E}[\|x_{k}-x_{k+1}\|^{2}]\leq\eta^{2}_{k}\mathbb{E}[\|H_{k}\|^{2}]\leq 2k^{-2}K_{3}/K_{0}\equiv K_{2}k^{-2},\text{ with }K_{2}=2K_{3}/K_{0}.

(iii) We have just proved that the conditions of Theorem 1 are satisfied with α=1\alpha=1. Therefore, R1(L)Kk=1Lk1log(k)Klog(L)2R_{1}(L)\leq K\sum_{k=1}^{L}k^{-1}\log(k)\leq K\log(L)^{2} with the expression of KK given in (8.5). Besides, conditions of Theorem 2 are satisfied with β=1\beta=1 and Dk=O(log(k))D_{k}=O(\log(k)). In particular, Δk/Δk+1=1+1k1+K02ηk\Delta_{k}/\Delta_{k+1}=1+\frac{1}{k}\leq 1+\frac{K_{0}}{2\eta_{k}} given that ηk=cηk1\eta_{k}=c_{\eta}k^{-1} with cη2/K0c_{\eta}\geq 2/K_{0}. Therefore,

R2(L)CK1k=1L(Dkλ(p¯)+M)k1=O(log(L)2).R_{2}(L)\leq CK_{1}\sum_{k=1}^{L}\left(\frac{D_{k}}{\lambda(\bar{p})}+M\right)k^{-1}=O(\log(L)^{2}).

As a consequence, the total regret

R(L)=R1(L)+R2(L)Kalglog(L)2Kalglog(ML)2,with ML=k=1LDk.R(L)=R_{1}(L)+R_{2}(L)\leq K_{alg}\log(L)^{2}\leq K_{alg}\log(M_{L})^{2},\text{with }M_{L}=\sum_{k=1}^{L}D_{k}.

The last inequality uses log(L)2log(ML)2\log(L)^{2}\leq\log(M_{L})^{2}. Since ML=O(Llog(L))M_{L}=O(L\log(L)), the relaxation from LL to MLM_{L} will not change the order of the regret bound. In addition, we can find a closed-expression for KalgK_{alg} as

Kalg=K+CK1(CD+8λ(p¯)min(γ,η)ξ+M),K_{alg}=K+CK_{1}\cdot\left(\frac{C_{D}+8}{\lambda(\bar{p})\min(\gamma,\eta)\xi}+M\right), (38)

where KK is defined by (8.5), CC by (34) and CDC_{D} by (36). \Box

8.9 Details in the Proof of Lemma 5

We first give a rigorous proof of (19) in derivation of the partial derivation p𝔼[W(p,μ)]\frac{\partial}{\partial p}\mathbb{E}[W_{\infty}(p,\mu)]. To better explain the proof, we adopt the notions in Glasserman (1992). We will take derivative with respective to the parameter θ=r=1/λ(p)\theta=r=1/\lambda(p). With a slight abuse of notation, we redefine Wn(θ)=Wn(μ,p)W_{n}(\theta)=W_{n}(\mu,p) and U~n(θ)=VnμθUn\tilde{U}_{n}(\theta)=\frac{V_{n}}{\mu}-\theta U_{n} so that U~n(θ)=Un\tilde{U}_{n}^{\prime}(\theta)=-U_{n}. And then, the Lindley recursion becomes

Wn+1(θ)=ϕ(Wn(θ),U~n(θ)),with ϕ(w,u)=(w+u)+.W_{n+1}(\theta)=\phi(W_{n}(\theta),\tilde{U}_{n}(\theta)),\quad\text{with }\phi(w,u)=(w+u)^{+}.

Note that the function ϕ\phi is increasing and convex in ww and uu. In addition, the derivative process is denote as Vn(θ)=ZnV_{n}(\theta)=Z_{n}. Define ψw(w,u)=ψu(w,u)=1(w+u>0),\psi_{w}(w,u)=\psi_{u}(w,u)=1(w+u>0), such that

Vn+1(θ)=ψw(Wn(θ),U~n(θ))Vn(θ)+ψu(Wn(θ),U~n(θ))U~n(θ).V_{n+1}(\theta)=\psi_{w}(W_{n}(\theta),\tilde{U}_{n}(\theta))V_{n}(\theta)+\psi_{u}(W_{n}(\theta),\tilde{U}_{n}(\theta))\tilde{U}_{n}^{\prime}(\theta).

The stationary versions of the waiting time and derivative process are denoted as W~0(θ)\tilde{W}_{0}(\theta) and V~0(θ)\tilde{V}_{0}(\theta). Then we can check Conditions (B1) to (B3) on page 377 of Glasserman (1992):

  1. (B1)

    For each θ[1/λ(p¯),1/λ(p¯)]\theta\in[1/\lambda(\underline{p}),1/\lambda(\bar{p})], the sequence

    {(U~n(θ),U~n(θ)),<n<}={(VnμrUn,Un),<n<in our notation}\{(\tilde{U}_{n}(\theta),\tilde{U}_{n}^{\prime}(\theta)),-\infty<n<\infty\leavevmode\nobreak\ \}=\Big{\{}\underbrace{\left(\frac{V_{n}}{\mu}-rU_{n},-U_{n}\right),-\infty<n<\infty}_{\text{in our notation}}\Big{\}}

    is stationary and ergodic, as we can extend the i.i.d. sequences VnV_{n} and UnU_{n} to <n0-\infty<n\leq 0.

  2. (B2)

    For each θ[1/λ(p¯),1/λ(p¯)]\theta\in[1/\lambda(\underline{p}),1/\lambda(\bar{p})], the Lindley recursion has a stationary solution W~0(θ)\tilde{W}_{0}(\theta), which is guaranteed by Assumption 1. Besides, following Lemma 2, for any initial state W0(θ)W_{0}(\theta), the transient process Wn(θ)W_{n}(\theta) will converge to the stationary version in finite time almost surely.

  3. (B3)

    For all θ[1/λ(p¯),1/λ(p¯)]\theta\in[1/\lambda(\underline{p}),1/\lambda(\bar{p})],

    (ψw(W~0(θ),U~0(θ))=0)=((W(μ,p)+V0μrU0)+=0)(in our notation)=(W(μ,p)=0)>0.\mathbb{P}(\psi_{w}(\tilde{W}_{0}(\theta),\tilde{U}_{0}(\theta))=0)=\mathbb{P}\underbrace{\left(\left(W_{\infty}(\mu,p)+\frac{V_{0}}{\mu}-rU_{0}\right)^{+}=0\right)}_{\text{(in our notation)}}=\mathbb{P}(W_{\infty}(\mu,p)=0)>0.

    According to the discussion on p.379 of Glasserman (1992), Condition (B3) holds for GI/GI/1GI/GI/1 queues under the usual stability condition that μ>λ(p)\mu>\lambda(p). Below, we give a detailed verification of this condition under our model setting.
    Recall that U~0(r)=V0μrU0\tilde{U}_{0}(r)=\frac{V_{0}}{\mu}-rU_{0} and by Assumption 1, 𝔼[U~0(r)]<0,r[1/λ(p¯),1/λ(p¯)]\mathbb{E}[\tilde{U}_{0}(r)]<0,\leavevmode\nobreak\ \forall r\in[1/\lambda(\underline{p}),1/\lambda(\bar{p})]. So there exists a constant b>0b>0, such that (U~0(r)<b)>0\mathbb{P}(\tilde{U}_{0}(r)<-b)>0 for all r[1/λ(p¯),1/λ(p¯)].r\in[1/\lambda(\underline{p}),1/\lambda(\bar{p})]. Let SS denote the support of W(μ,p)W_{\infty}(\mu,p) and let A=infS0A=\inf S\geq 0. We first show by contradiction that A=0A=0. Since AA is the infimum of the support,

    (W(μ,p)[A,A+ε))>0, for any ε>0.\displaystyle\mathbb{P}\left(W_{\infty}(\mu,p)\in[A,A+\varepsilon)\right)>0,\text{ for any }\varepsilon>0.

    Besides, if A>0A>0,

    (W(μ,p)A)\displaystyle\mathbb{P}(W_{\infty}(\mu,p)\geq A) =((W(μ,p)+U~0(r))+A)\displaystyle=\mathbb{P}\left(\left(W_{\infty}(\mu,p)+\tilde{U}_{0}(r)\right)^{+}\geq A\right)
    =(W(μ,p)+U~0(r)A)=1,\displaystyle=\mathbb{P}\left(W_{\infty}(\mu,p)+\tilde{U}_{0}(r)\geq A\right)=1,

    On the other hand, we have

    (W(μ,p)+U~0(r)<A)(W(μ,p)[A,A+b2),U~0(r)<b)>0,\displaystyle\mathbb{P}\left(W_{\infty}(\mu,p)+\tilde{U}_{0}(r)<A\right)\geq\mathbb{P}\left(W_{\infty}(\mu,p)\in\Big{[}A,A+\frac{b}{2}\Big{)},\leavevmode\nobreak\ \tilde{U}_{0}(r)<-b\right)>0,

    where the last inequality follows from the fact that W(μ,p)W_{\infty}(\mu,p) and U~0(r)\tilde{U}_{0}(r) are independent in the GI/GI/1GI/GI/1 queue. This is a contradiction, so we can conclude that A=0A=0. Next, we show that (W(μ,p)=0)>0\mathbb{P}(W_{\infty}(\mu,p)=0)>0. Following a similar derivation, we can conclude

    (W(μ,p)=0)\displaystyle\mathbb{P}(W_{\infty}(\mu,p)=0) =((W(μ,p)+U~0(r))+=0)(W(μ,p)[0,b2),U~0(r)<b)>0.\displaystyle=\mathbb{P}\left(\left(W_{\infty}(\mu,p)+\tilde{U}_{0}(r)\right)^{+}=0\right)\geq\mathbb{P}\left(W_{\infty}(\mu,p)\in\Big{[}0,\frac{b}{2}\Big{)},\leavevmode\nobreak\ \tilde{U}_{0}(r)<-b\right)>0.

In addition, we have 𝔼[W~0(θ)]M\mathbb{E}[\tilde{W}_{0}(\theta)]\leq M and 𝔼[V~0(θ)]=𝔼[Z~]=𝔼[X(μ,p)]M\mathbb{E}[\tilde{V}_{0}(\theta)]=\mathbb{E}[\tilde{Z}_{\infty}]=\mathbb{E}[X_{\infty}(\mu,p)]\leq M following Lemma 1. As a consequence, we can prove (19) using the following Corollary 5.3 in Glasserman (1992):

Lemma 7 (Corollary 5.3 in Glasserman (1992)).

Suppose that ϕ\phi is increasing and (jointly) convex, and that W0W_{0} and U0U_{0} are almost surely convex. Suppose (B1)-(B3) hold, 𝔼[W~0(θ)],𝔼[V~0(θ)]<\mathbb{E}[\tilde{W}_{0}(\theta)],\mathbb{E}[\tilde{V}_{0}(\theta)]<\infty for all θ\theta in its range. Then, 𝔼[V~0(θ)]=𝔼[W~0(θ)]\mathbb{E}[\tilde{V}_{0}(\theta)]=\mathbb{E}[\tilde{W}_{0}(\theta)]^{\prime} and

limn1ni=1nVi(θ)=𝔼[W~(θ)],a.s.\lim_{n\to\infty}\frac{1}{n}\sum_{i=1}^{n}V_{i}(\theta)=\mathbb{E}[\tilde{W}(\theta)]^{\prime},a.s.

almost everywhere in the range of θ\theta.

The derivation of μ𝔼[W(p,μ)]\frac{\partial}{\partial\mu}\mathbb{E}[W_{\infty}(p,\mu)] follows a similar argument with U~(θ)VnθUn/λ(p)\tilde{U}(\theta)\equiv V_{n}-\theta U_{n}/\lambda(p).

9 Relaxing Theoretical Bounds of Hyperparameters

In this section, we conduct numerical experiments to investigate the robustness of GOLiQ’s performance to the two main hyperparameters: (i) cycle length DkD_{k}, and (ii) step size ηk\eta_{k}. We follow two steps:

  • First, we calculate the theoretical bounds of these hyperparameters according to (20) and (21).

  • Next, we test the algorithm’s performance while varying these hyperparameters; we especially consider values that violate their corresponding theoretical bounds.

9.1 Theoretical bounds for ηk\eta_{k} and DkD_{k}

We follow Section 6.2 by considering the M/M/1M/M/1 example having the objective function in (26) and demand function in (24), with a=4.1a=4.1, size n=10n=10 and c0=0.1c_{0}=0.1. In order to obtain the theoretic bounds for hyper-parameters, we set the region =[6.7,10]×[3.7,5]\mathcal{B}=[6.7,10]\times[3.7,5] so that f(μ,p)f(\mu,p) is strongly convex on \mathcal{B}.

Theoretical bound for ηk\eta_{k}.

According to the conditions in Assumption 3, we note that the Hessian matrix of the objective f(μ,p)f(\mu,p) has a smallest eigenvalue 0.12310.1231 in the specific region \mathcal{B}, which implies that K0=0.1231K_{0}=0.1231 (and the strong convexity of the objective function on \mathcal{B}). Hence, following from (20), the theoretical lower bound for ηk\eta_{k} is cηc~η=2/K0=16.24c_{\eta}\geq\tilde{c}_{\eta}=2/K_{0}=16.24.

Theoretical bound for DkD_{k}.

To calculate the lower bounds of aDa_{D} and bDb_{D} specified in (21), we first estimate CC and (γ,η)(\gamma,\eta). We set ξ=1\xi=1. First, according to the expression (36) and K0=0.1231K_{0}=0.1231, we see that C8C\geq 8. Next, following (3), we select min(γ,η)=0.011\min(\gamma,\eta)=0.011 which gives the smallest theoretical lower bound.

Hence, (21) requires that aDa~D=8/0.011=727a_{D}\geq\tilde{a}_{D}=8/0.011=727 and bDb~D=8/0.011=727b_{D}\geq\tilde{b}_{D}=8/0.011=727, which leads to a bound for the cycle length Dk727+727log(k)D_{k}\geq 727+727\log(k).

Refer to caption
Figure 8: Simulated regret curves with relaxed bounds on (i) step size ηk\eta_{k} robustness (top left panel), (ii) cycle length DkD_{k} (top right panel), and their logarithmic scales in sample size MLM_{L} (two bottom panels). All regret curves are estimated by averaging 500 independent simulation runs.

9.2 Robustness to the Theoretical Bounds

Recall that the theoretical bounds in (20) and (21) require that aDa~Da_{D}\geq\tilde{a}_{D}, bDb~Db_{D}\geq\tilde{b}_{D} and cηc~ηc_{\eta}\geq\tilde{c}_{\eta}. We hereby test the criticality of these lower bounds a~D\tilde{a}_{D}, b~D\tilde{b}_{D} and c~η\tilde{c}_{\eta} by implementing GOLiQ with (aD,bD,cη)<(a~D,b~D,c~η)(a_{D},b_{D},c_{\eta})<(\tilde{a}_{D},\tilde{b}_{D},\tilde{c}_{\eta}). Specifically, in our first experiments, we consider cη={2,1,0.5,0.1}c~ηc_{\eta}=\{2,1,0.5,0.1\}\tilde{c}_{\eta} for the step-size ηk\eta_{k}, with Dk=10+10log(k)D_{k}=10+10\log(k) (see left-hand panels of Figure 8); in our second experiment, we consider (aD,bD)={0.028,0.021,0.014,0.007,0.0014}(a~D,b~D)(a_{D},b_{D})=\{0.028,0.021,0.014,0.007,0.0014\}(\tilde{a}_{D},\tilde{b}_{D}) for the sample-size DkD_{k}, with ηk=3/k\eta_{k}=3/k (see right-hand panels of Figure 8). In both experiments, we plot the average regret curves estimated by 500 independent runs.

Figure 8 reveals that GOLiQ continues to perform effectively even when the hyperparameters are chosen to be much smaller than their corresponding theoretical lower bounds. For ηk\eta_{k}, our algorithm generates a logarithm regret even when cη=0.1c~ηc_{\eta}=0.1\tilde{c}_{\eta}. (However, we discover that GOLiQ will fail to converge and yield a linear regret if we keep reducing cηc_{\eta} (e.g., to 0.01c~η0.01\tilde{c}_{\eta}). For DkD_{k}, all regret curves exhibit a logarithmic order (even when (aD,bD)=0.0014(a~D,b~D)(a_{D},b_{D})=0.0014(\tilde{a}_{D},\tilde{b}_{D})). In summary, our numerical experiments show that the theoretical bounds for our hyperparameters do not seem to be too restrictive. In addition, the experiment in Section 6.3 serves as another piece of evidence supporting the robustness of GOLiQ. In Section 6.3, we apply GOLiQ with the same hyperparameters ηk=5k1\eta_{k}=5k^{-1} and Dk=10+10log(k)D_{k}=10+10\log(k) for different settings with various cc, cs2c_{s}^{2} and nn (see Figure 7), and GOLiQ exhibits stable performance with similar logarithm regrets. Of course, we acknowledge that the specific selection of these hyperparameters in a practical setting will require further tuning in order to make the most efficient use of GOLiQ.

Remark 12 (Requirement of information: online learning vs. heavy traffic).

We provide our view on how online learning relies on the system information, and we treat heavy-traffic methods as a benchmark. First, online learning in general requires less prior information of the distributions than heavy-traffic methods do. For example, to solve the problem in the present study, the diffusion limit in Lee and Ward (2014) requires the knowledge of the exact values of the second moments of arrival and service times. On the other hand, even though the efficiency of GOLiQ is subject to constraints in terms of certain model parameters, the bounds of these constraints may be relaxed without needing to sacrifice much of the algorithm’s performance. Second, the required information (e.g., moments) serves as crucial input parameters for the heavy-traffic models, whereas the design and implementation of online learning algorithms do not immediately require the aforementioned information (even though it is still relevant to the tuning of hyperparameters). All that we require is that the constants in (20) and (21) are not too small. So as long as we follow the structure specified in (20)–(21), it will not be too difficult to find reasonably sound hyperparameters (e.g., by a trial-and-error search) even without precise information of parameters η\eta and γ\gamma as in Assumption 2. However, trial-and-error will be ineffective for heavy-traffic methods because precise information is needed (e.g., σ2\sigma^{2}). In this sense, online learning depends on the system information to a lesser extent.

10 Comparison With Online Learning Algorithm in Huh et al. (2009)

In order to highlight the novelty in the regret analysis of this work, we wish to provide a comparison to other existing online learning algorithms developed for queueing systems. Unfortunately, there exists no previous algorithm that aims to solve a similar (not to mention the same) problem as in this paper.

Huh et al. (2009) develops an online learning algorithm with the objective of finding the optimal base-stock policy for an inventory system with a non-zero replenishment lead time. At a glance, Huh et al. (2009) does not seem to be relevant to the present paper at all. Indeed, results in Huh et al. (2009) are by no means directly comparable to GOLiQ, because the two articles consider two different systems. Nevertheless, the fundamental idea in the regret analysis by Huh et al. (2009) may be used as a basis to devise a queueing-version algorithm.To understand why this is possible, we first discuss the similarities of our method and that in Huh et al. (2009); and we next explain their major distinctions.

Similarities.

First, Huh et al. (2009) analyzes the transient regret bound of an inventory system operated under a stationary base-stock policy, of which the main framework is analogous to that in the present work. Second, the heart of the online learning algorithm in Huh et al. (2009) is an SGD method. Last, the regret in Huh et al. (2009) is also defined using the steady-state performance as the benchmark.

Distinctions.

Nevertheless, we stress that GOLiQ is not a quick extension of the regret analysis in Huh et al. (2009). First, a queueing model, by its very nature, has completely different dynamics, problem structure, and research questions from inventory systems. For example, the state space of the queueing model here is unbounded, while the inventory system in Huh et al. (2009) is bounded. (The unbounded state space has made the analysis on transient error and gradient variance more challenging.) Next, our analysis of the “regret of nonstationarity” is a novelty; when establishing our regret bound, we examine more delicately the transient error at the beginning of the cycles, so as to render a smaller regret bound O(log(L)2)O(\log(L)^{2}) than the linear bound O(L)O(L) as in Huh et al. (2009). Also see Section 4.1 for additional discussions. In detail, different from the analysis in Huh et al. (2009), we further separate the transient error in each cycle into two parts, i.e., the ‘warm-up’ part and ‘near-stationary’ part, and deal with them using different coupling techniques: coupling from the previous cycle and coupling from infinite past for the ‘warm-up’ part and synchronous coupling in the same cycle for the ‘near-stationary’ part. Last, our novel theoretical analysis yields different “optimal” structure for the hyperparameters ηk=O(k1)\eta_{k}=O(k^{-1}) and Dk=O(log(k))D_{k}=O(\log{(k)}).

According to their regret analysis, Huh et al. (2009) propose to choose the hyperparameters ηk=O(k1/2)\eta_{k}=O(k^{-1/2}) and Dk=O(k)D_{k}=O(\sqrt{k}) which yield a regret bound in the order O(ML2/3)O(M_{L}^{2/3}). However, we point out that the objective function in Huh et al. (2009) is convex while GOLiQ in the present paper is designed assuming a strongly convex objective function (Assumption 3). Therefore, to make a fair comparison between GOLiQ and the online learning algorithm proposed in Huh et al. (2009), we need to redo the regret analysis in Huh et al. (2009) under the strong convexity. This change, as we will show below, will yield a different set of hyperparameters.

Suppose we select Dk=O(kα)D_{k}=O(k^{\alpha}) and ηk=O(kβ)\eta_{k}=O(k^{-\beta}). Then, following Lemma 11 of Huh et al. (2009), R1(L)=O(L)R_{1}(L)=O(L) (compared to R1(L)=o(L)R_{1}(L)=o(L) in our analysis). Given that the objective function is strongly convex, Theorem 2 yields that R2(L)=O(k=1Lkαkβ)=O(Lαβ+1)R_{2}(L)=O\left(\sum_{k=1}^{L}k^{\alpha}\cdot k^{-\beta}\right)=O\left(L^{\alpha-\beta+1}\right) for β(0,1]\beta\in(0,1]. As a result, the overall regret is

R(L)=R1(L)+R2(L)=O(L(αβ+1)1),andR(ML)=O(ML(αβ+1)1α+1),R(L)=R_{1}(L)+R_{2}(L)=O(L^{(\alpha-\beta+1)\vee 1}),\quad\text{and}\quad R(M_{L})=O\left(M_{L}^{\frac{(\alpha-\beta+1)\vee 1}{\alpha+1}}\right),

with ML=O(Lα+1)M_{L}=O(L^{\alpha+1}). Consequently, the order of R(ML)R(M_{L}) is minimized by setting

ηk=O(k1),andDk=O(k),\displaystyle\eta_{k}=O(k^{-1}),\quad\text{and}\quad D_{k}=O(k), (39)

which yields an improved regret bound O(ML12)O(M_{L}^{\frac{1}{2}}) (as opposed to the previous regret O(ML23)O(M_{L}^{\frac{2}{3}}) under regular convexity).

We refer to Algorithm 1 with ηk\eta_{k} and DkD_{k} selected according to (39) as GOLiQ-H. To compare GOLiQ with GOLiQ-H, we follow the setting in Section 6.2 by considering an M/M/1M/M/1 queue, with c(μ)=0.1μ2c(\mu)=0.1\mu^{2} and λ(p)=10λ0(p).\lambda(p)=10\lambda_{0}(p).

In Figure 9, we plot the average regret curves (estimated by averaging 500 independent paths) for both GOLiQ and GOLiQ-H. The hyperparameters are ηk=2k1\eta_{k}=2k^{-1} and Dk=10+10log(k)D_{k}=10+10\log(k) for GOLiQ, and ηk={2,4}k1\eta_{k}=\{2,4\}k^{-1} and Dk=10+kD_{k}=10+k for GOLiQ-H. Unsurprisingly, Figure 9 confirms that GOLiQ is more effective than GOLiQ-H. This is consistent with our theoretical analysis because GOLiQ yields a logarithmic regret while GOLi-H has a regret bound of O(ML12)O(M_{L}^{\frac{1}{2}}).

Refer to caption
Figure 9: Comparing GOLiQ and GOLiQ-H: (i) regret curves (top panel), (ii) trajectory of price pkp_{k} (middle panel), and (iii) trajectory of service rate μk\mu_{k} (bottom panel). Hyperparameters are ηk={2,4}k1\eta_{k}=\{2,4\}k^{-1} and Dk=10+kD_{k}=10+k for GOLiQ-H, and are ηk=2k1\eta_{k}=2k^{-1} and Dk=10+10log(k)D_{k}=10+10\log(k) for GOLiQ. All regret curves are estimated by averaging 500 independent simulation runs.

11 Additional Numerical Examples

In this section we conduct additional numerical experiments to confirm the practical effectiveness of our algorithm. In what follows, we first test the case where the uniform stability condition is relaxed; we next report the algorithm performance for GI/GI/1GI/GI/1 queueing models with phase-type and lognormal distributions.

11.1 Violation of Uniform Stability

Refer to caption
Figure 10: Joint pricing and staffing for the M/M/1M/M/1 model in Section 6.2 without uniform stability.

We extend the M/M/1M/M/1 example considered in Section 6.2 with the uniform stability condition relaxed. Specifically, we begin with an initial setting of (p0,μ0)(p_{0},\mu_{0}) such that ρ0λ(p0)/μ0=2.55\rho_{0}\equiv\lambda(p_{0})/\mu_{0}=2.55, which violates the stability condition. As shown in Figure 10, the pricing and staffing policies (pk,μk)(p_{k},\mu_{k}) remain convergent to (p,μ)(p^{*},\mu^{*}). Consistently, the resulting traffic intensity ρkλ(pk)/μk\rho_{k}\equiv\lambda(p_{k})/\mu_{k} is quickly controlled to fall below 1; that is, the workload is kept in check despite of the unstable performance in the initial cycle.

11.2 M/G/1M/G/1 with Phase-Type Service

To test the performance of our online learning algorithm for queues with non-exponential service times, we consider phase-type distributions: hyperexponential with nn phases (HnH_{n}) and Erlang with nn phases (EnE_{n}). In Figure 11 we report the convergent sequence (pk,μk)(p_{k},\mu_{k}) with H2H_{2} service with service-time SCV cs2=8c_{s}^{2}=8 (top panel), MM service with cs2=1c_{s}^{2}=1 (middle panel), and E8E_{8} service with cs2=1/8c_{s}^{2}=1/8 (bottom panel). Other parameters include the step size ηk=4/k\eta_{k}=4/k, cycle length Dk=20+10log(k)D_{k}=20+10\log(k) and initial condition p0=4p_{0}=4 and μ0=12\mu_{0}=12 (λ0=5.249\lambda_{0}=5.249).

Refer to caption
Figure 11: Joint pricing and staffing for an M/G/1M/G/1 queue having (a) H2H_{2} service with cs2=8c_{s}^{2}=8 (top panel), (b) MM service (middle panel), and (c) Erlang service with cs2=1/8c_{s}^{2}=1/8 (bottom panel). Other parameters are step length ηk=4/k\eta_{k}=4/k, cycle length Dk=20+10log(k)D_{k}=20+10\log(k), initial condition p0=4p_{0}=4, μ0=12\mu_{0}=12. The optimal pricing and staffing solutions are: (i) (p,μ)=(3.44,16.86)(p^{*},\mu^{*})=(3.44,16.86); (ii) (p,μ)=3.40,12.48)(p^{*},\mu^{*})=3.40,12.48); (iii) (p,μ)=3.38,11.34)(p^{*},\mu^{*})=3.38,11.34).

Figure 11 confirms that our algorithm remains effective. In addition, the convergence is faster as the CSV cs2c_{s}^{2} decreases. This is intuitive because a less variable service-time distribution yields a smaller 𝒱k\mathcal{V}_{k} for the gradient estimator.

11.3 Lognormal Service and Arrival

Next, we consider an LN/LN/1LN/LN/1 queue with service and interarrival times following lognormal (LN) distributions. Our consideration here follows from the recent empirical confirmations of LN distributed service times in real service systems.

We let cs2=ca2=2c_{s}^{2}=c_{a}^{2}=2 with ca2c_{a}^{2} being the SCV of the LNLN-distributed interarrival times. The other parameters remain the same as in Section 11.2. Because the exact optimal solutions (p,μ)(p^{*},\mu^{*}) are unavailable for this model, we are unable to provide an estimate of the regret as done in Figure 6, nor can we confirm the convex structure of the problem. Nevertheless, Figure 12 shows that our online algorithm continues to work well, despite the fact that LN is no longer a light-tail distribution (Assumption 2 does not hold in this case).

Refer to caption
Figure 12: Joint online pricing and staffing for an LN/LN/1LN/LN/1 having lognormal service and interarrival times with CSVs cs2=ca2=2c_{s}^{2}=c_{a}^{2}=2. Other parameters are ηk=4/k\eta_{k}=4/k, Dk=20+10log(k)D_{k}=20+10\log(k), p0=4,μ0=14p_{0}=4,\mu_{0}=14.

11.4 Extended Comparison of GOLiQ and Heavy-traffic Methods

Supplementing our investigations in Section 6.3, we provide additional numerical results. Recall that the heavy-traffic results in Lee and Ward (2014) are obtained by constructing a sequence of GI/GI/1GI/GI/1 models indexed by nn, where the nthn^{\rm th} model has scaled arrival rate λn(p)=nλ(p)\lambda_{n}(p)=n\lambda(p) and service rate μn=nμ\mu_{n}=n\mu, so that both λn\lambda_{n} and μn\mu_{n} grow to \infty as nn increases. Lee and Ward (2014) develop asymptotic staffing and pricing solutions for the GI/GI/1GI/GI/1 queue; they show that, as the scaling factor nn\rightarrow\infty, the optimal price pnpp^{*}_{n}\rightarrow p_{\infty} and service capacity μn/nμ\mu^{*}_{n}/n\rightarrow\mu_{\infty}, with ρλ(p)/μ=1\rho_{\infty}\equiv\lambda(p_{\infty})/\mu_{\infty}=1.

We repeat our experiment in Section 6.2 with the scaling parameter n{10,50,100,500,1000,2000}n\in\{10,50,100,500,1000,2000\} for the arrival rate function (24). But we now focus on the optimal traffic intensity as nn varies. In Figure 13 we plot the optimal price and service rate as nn increases. In each experiment, we compute the optimal pnp_{n} and μn\mu_{n} using their average value in cycles 300–500 of Algorithm 1. Consistent with Lee and Ward (2014), Figure 13 shows that pnp_{n}, μn/n\mu_{n}/n and ρn\rho_{n} approach pp_{\infty}, μ\mu_{\infty} and ρ=1\rho_{\infty}=1. On the other hand, when the scale nn is not very large, the heavy-traffic solutions can become inaccurate. For instance, when n=100n=100 the optimal traffic ρ100\rho_{100} is around 0.8, which is not close to 1.

Refer to caption
Figure 13: Comparison of (i) online optimization solutions (pn,μn,ρn)(p_{n},\mu_{n},\rho_{n}); (ii) exact solutions (pn,μn,ρn)(p_{n}^{*},\mu_{n}^{*},\rho_{n}^{*}); and (iii) heavy-traffic solutions (p,μ,ρ)=(3.282,6.932,1)(p_{\infty},\mu_{\infty},\rho_{\infty})=(3.282,6.932,1) in Lee and Ward (2014), as the system scale n=M0n=M_{0} increases, with parameters Dk=n(10+10log(k))D_{k}=n(10+10\log(k)) and ηk=3k1\eta_{k}=3k^{-1}.

11.5 Alternative Definition of Regret

In this subsection, we attempt to rationalize our regret definition in (9). We consider a potential alternative to (9) which benchmarks the system revenue under GOLiQ with the nonstationary revenue under (μ,p)(\mu^{*},p^{*}). Because the nonstationary queue length is intractable, we conduct additional numerical experiments to estimate the expected nonstationary regret via Monte-Carlo simulations.

Specifically, we simulate the regret in (10) under (p,μ)(p^{*},\mu^{*}) with the queueing system starting empty (of which the dynamics is nonstationary). We use the M/M/1M/M/1 model in Section 6 having a logit demand function (24) with n=10n=10 and a quadratic staffing cost in (25) with c=0.1c=0.1.

In Figure 14 we graph both versions of the regret under GOLiQ under the same experimental setting, with hyperparameters ηk=3k1\eta_{k}=3k^{-1} and Dk=10+10log(k)D_{k}=10+10\log(k). Figure 14 confirms that the these two versions of regret appear to be nearly indistinguishable. This is due to the geometric ergodicity of G/G/1G/G/1 queue.

Refer to caption
Figure 14: Comparing two versions of regret: (i) “stationary regret” that benchmarks with the steady-state performance under (μ,p)(\mu^{*},p^{*}) (defined in (9)), and (ii) “nonstationary regret” that benchmarks with the nonstationary performance under (μ,p)(\mu^{*},p^{*}). Hyperparameters are ηk=3k1\eta_{k}=3k^{-1}, Dk=10+10log(k)D_{k}=10+10\log(k). Both regret curves are estimated by 1,000 independent runs.

12 Additional discussion on Assumption 3

In this section, we provide some sufficient conditions for strong convexity in the M/GI/1M/GI/1 case.

Lemma 8.

For M/GI/1M/GI/1 queues, if c(μ)c(\mu) is convex, λ(p¯)/μ¯<1\lambda(\underline{p})/\underline{\mu}<1, and in addition,

λ(p)22λ(p)<λ′′(p)<2λ(p)p,andλ(p¯)μ¯>11/20.29,\frac{\lambda^{\prime}(p)^{2}}{2\lambda(p)}<\lambda^{\prime\prime}(p)<\frac{-2\lambda^{\prime}(p)}{p},\quad\text{and}\quad\frac{\lambda(\bar{p})}{\bar{\mu}}>1-1/\sqrt{2}\approx 0.29, (40)

then f(μ,p)f(\mu,p) is strongly convex in \mathcal{B}.

Proof of Lemma 8.

Recall that f(μ,p)=pλ(p)+h0𝔼[Q(μ,p)]+c(μ)f(\mu,p)=-p\lambda(p)+h_{0}\mathbb{E}[Q_{\infty}(\mu,p)]+c(\mu), and (pλ(p))′′=pλ′′2λ(-p\lambda(p))^{\prime\prime}=-p\lambda^{\prime\prime}-2\lambda^{\prime}. Under condition (40), we have pλ′′2λ>0-p\lambda^{\prime\prime}-2\lambda^{\prime}>0. Therefore, both pλ(p)-p\lambda(p) and c(μ)c(\mu) are convex, and it suffices to show that 𝔼[Q(μ,p)]\mathbb{E}[Q_{\infty}(\mu,p)] is strongly convex in μ\mu and pp. For M/GI/1M/GI/1 queues, Pollaczek-Khinchine formula yields that

q(μ,p)𝔼[Q(μ,p)]=Cλ(p)μλ(p)+(1C)λ(p)μ,q(\mu,p)\equiv\mathbb{E}[Q_{\infty}(\mu,p)]=C\frac{\lambda(p)}{\mu-\lambda(p)}+(1-C)\frac{\lambda(p)}{\mu},

with C1+cs22C\equiv\frac{1+c_{s}^{2}}{2}. For any given pair of (μ,p)(\mu,p), let HqH_{q} be the Hessian matrix of q(μ,p)q(\mu,p). We next verify that HqH_{q} is positively definite. By direct calculation, we have

p2q\displaystyle\partial_{p}^{2}q =Cμ(μλ)3(2(λ)2+(μλ)λ′′)+(1C)λ′′μ,\displaystyle=C\frac{\mu}{(\mu-\lambda)^{3}}\left(2(\lambda^{\prime})^{2}+(\mu-\lambda)\lambda^{\prime\prime}\right)+(1-C)\frac{\lambda^{\prime\prime}}{\mu},
μ2q\displaystyle\partial_{\mu}^{2}q =C2λ(μλ)3+(1C)2λμ3,pμq=Cλ+μ(μλ)3λ(1C)λμ2,\displaystyle=C\frac{2\lambda}{(\mu-\lambda)^{3}}+(1-C)\frac{2\lambda}{\mu^{3}},\quad\partial_{p}\partial_{\mu}q=-C\frac{\lambda+\mu}{(\mu-\lambda)^{3}}\lambda^{\prime}-(1-C)\frac{\lambda^{\prime}}{\mu^{2}},

with λ,λ′′\lambda^{\prime},\lambda^{\prime\prime} being the first and second order derivatives of λ(p)\lambda(p). As a result, the determinant of Hessian matrix of HqH_{q} is

|Hq|=\displaystyle|H_{q}|= C2(μλ)5(2μλλ′′(μλ)(λ)2)+(1C)2μ4(2λλ′′(λ)2)\displaystyle\frac{C^{2}}{(\mu-\lambda)^{5}}\left(2\mu\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right)+\frac{(1-C)^{2}}{\mu^{4}}\left(2\lambda\lambda^{\prime\prime}-(\lambda^{\prime})^{2}\right)
+2C(1C)μ2(μλ)3((2μλ)λλ′′(μλ)(λ)2).\displaystyle\qquad+\frac{2C(1-C)}{\mu^{2}(\mu-\lambda)^{3}}\left((2\mu-\lambda)\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right).

To show that HqH_{q} is positively definite, it suffices to show that μ2q\partial_{\mu}^{2}q, p2q\partial_{p}^{2}q and |Hq||H_{q}| are all positive. First, it is clear that

μ2q=2λC(1(μλ)31μ3)+2λμ3>0.\partial_{\mu}^{2}q=2\lambda C\left(\frac{1}{(\mu-\lambda)^{3}}-\frac{1}{\mu^{3}}\right)+\frac{2\lambda}{\mu^{3}}>0.

Next, we compute

p2q\displaystyle\partial_{p}^{2}q =Cμ(μλ)3(2(λ)2+(μλ)λ′′)+(1C)λ′′μ\displaystyle=C\frac{\mu}{(\mu-\lambda)^{3}}\left(2(\lambda^{\prime})^{2}+(\mu-\lambda)\lambda^{\prime\prime}\right)+(1-C)\frac{\lambda^{\prime\prime}}{\mu}
=2Cμ(μλ)3(λ)2+Cμ(μλ)2λ′′+(1C)λ′′μ\displaystyle=\frac{2C\mu}{(\mu-\lambda)^{3}}(\lambda^{\prime})^{2}+\frac{C\mu}{(\mu-\lambda)^{2}}\lambda^{\prime\prime}+(1-C)\frac{\lambda^{\prime\prime}}{\mu}
>(a)Cμ(μλ)2λ′′+(1C)λ′′μ\displaystyle\stackrel{{\scriptstyle(a)}}{{>}}\frac{C\mu}{(\mu-\lambda)^{2}}\lambda^{\prime\prime}+(1-C)\frac{\lambda^{\prime\prime}}{\mu}
=Cλ′′(μμλ>11μλ1μ)+λ′′μ>(b)Cλ′′(1μλ1μ)+λ′′μ>0.\displaystyle=C\lambda^{\prime\prime}\Big{(}\underbrace{\frac{\mu}{\mu-\lambda}}_{>1}\cdot\frac{1}{\mu-\lambda}-\frac{1}{\mu}\Big{)}+\frac{\lambda^{\prime\prime}}{\mu}\stackrel{{\scriptstyle(b)}}{{>}}C\lambda^{\prime\prime}\left(\frac{1}{\mu-\lambda}-\frac{1}{\mu}\right)+\frac{\lambda^{\prime\prime}}{\mu}>0.

Here, inequality (a) follows from that 2Cμ(μλ)3(λ)2>0\frac{2C\mu}{(\mu-\lambda)^{3}}(\lambda^{\prime})^{2}>0. Inequality (b) holds due to the facts that μμλ>1\frac{\mu}{\mu-\lambda}>1 and that λ′′>(λ)22λ0\lambda^{\prime\prime}>\frac{(\lambda^{\prime})^{2}}{2\lambda}\geq 0. The last inequality holds because 1μλ>1μ\frac{1}{\mu-\lambda}>\frac{1}{\mu}. As a result, we have p2q,μ2q>0\partial^{2}_{p}q,\partial^{2}_{\mu}q>0. Next, we verify that |Hq|>0|H_{q}|>0. Because 2λλ′′(λ)2>02\lambda\lambda^{\prime\prime}-(\lambda^{\prime})^{2}>0, we have

2μλλ′′>(μλ)(λ)2,and(2μλ)λλ′′>(μλ)(λ)2.2\mu\lambda\lambda^{\prime\prime}>(\mu-\lambda)(\lambda^{\prime})^{2},\quad\text{and}\quad(2\mu-\lambda)\lambda\lambda^{\prime\prime}>(\mu-\lambda)(\lambda^{\prime})^{2}.

Therefore,

|Hq|=\displaystyle|H_{q}|=\leavevmode\nobreak\ C2(μλ)5(2μλλ′′(μλ)(λ)2)+(1C)2μ4(2λλ′′(λ)2)\displaystyle\frac{C^{2}}{(\mu-\lambda)^{5}}\left(2\mu\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right)+\frac{(1-C)^{2}}{\mu^{4}}\left(2\lambda\lambda^{\prime\prime}-(\lambda^{\prime})^{2}\right)
+2C(1C)μ2(μλ)3((2μλ)λλ′′(μλ)(λ)2)\displaystyle\qquad+\frac{2C(1-C)}{\mu^{2}(\mu-\lambda)^{3}}\left((2\mu-\lambda)\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right)
>(c)\displaystyle\stackrel{{\scriptstyle(c)}}{{>}} C2(μλ)5(2μλλ′′(μλ)(λ)2)+2C(1C)μ2(μλ)3((2μλ)λλ′′(μλ)(λ)2)\displaystyle\frac{C^{2}}{(\mu-\lambda)^{5}}\left(2\mu\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right)+\frac{2C(1-C)}{\mu^{2}(\mu-\lambda)^{3}}\left((2\mu-\lambda)\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right)
>\displaystyle>\leavevmode\nobreak\ C2(μλ)5(2μλλ′′>(2μλ)λλ′′(μλ)(λ)2)2C2μ2(μλ)3((2μλ)λλ′′(μλ)(λ)2)\displaystyle\frac{C^{2}}{(\mu-\lambda)^{5}}\Big{(}\underbrace{2\mu\lambda\lambda^{\prime\prime}}_{>(2\mu-\lambda)\lambda\lambda^{\prime\prime}}-(\mu-\lambda)(\lambda^{\prime})^{2}\Big{)}-\frac{2C^{2}}{\mu^{2}(\mu-\lambda)^{3}}\left((2\mu-\lambda)\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right)
>\displaystyle>\leavevmode\nobreak\ C2(μλ)5((2μλ)λλ′′(μλ)(λ)2)2C2μ2(μλ)3((2μλ)λλ′′(μλ)(λ)2)\displaystyle\frac{C^{2}}{(\mu-\lambda)^{5}}\left((2\mu-\lambda)\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right)-\frac{2C^{2}}{\mu^{2}(\mu-\lambda)^{3}}\left((2\mu-\lambda)\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right)
=\displaystyle=\leavevmode\nobreak\ C2((2μλ)λλ′′(μλ)(λ)2)(μλ)3(1(μλ)22μ2)>0.\displaystyle\frac{C^{2}\left((2\mu-\lambda)\lambda\lambda^{\prime\prime}-(\mu-\lambda)(\lambda^{\prime})^{2}\right)}{(\mu-\lambda)^{3}}\left(\frac{1}{(\mu-\lambda)^{2}}-\frac{2}{\mu^{2}}\right)>0.

Inequality (c) holds because (1C)2μ4(2λλ′′(λ)2)\frac{(1-C)^{2}}{\mu^{4}}(2\lambda\lambda^{\prime\prime}-(\lambda^{\prime})^{2}) is positive by (40). The last inequality follows from λ(p)μ>11/2\frac{\lambda(p)}{\mu}>1-1/\sqrt{2}. This closes our proof. ∎

Notation Description
Model parameters and functions =[p¯,p¯]×[μ¯,μ¯]\mathcal{B}=[\underline{p},\bar{p}]\times[\underline{\mu},\bar{\mu}] Feasible action space
c(μ)c(\mu) Staffing cost function
f(μ,p)f(\mu,p) Objective (loss) function
h0h_{0} Customer holding cost
λ(p)\lambda(p) Demand function
μ\mu Service rate/capacity
nn Market size/ System scale in Section 6
pp Service fee
Q(μ,p)Q_{\infty}(\mu,p) Stationary queue length under (p,μ)(p,\mu)
SnkS_{n}^{k} Service time of the (n1)th(n-1)^{\rm th} customer in cycle kk
τn\tau_{n} Interarrival time between (n1)th(n-1)^{\rm th} and nthn^{\rm th} customers
θ,γ,η\theta,\gamma,\eta Parameters of light-tail assumptions (Assumption 2)
Un,VnU_{n},V_{n} Unscaled random “seeds” of interarrival and service times
x=(p,μ)x^{*}=(p^{*},\mu^{*}) Optimal decision fee and service rate
Algorithmic hyperparameters DkD_{k} Sample size (number of customers served) in cycle kk
ηk\eta_{k} Step size or learning rate in cycle kk
HkH_{k} Gradient estimator in cycle kk
MLM_{L} Cumulative number of customers served by cycle LL
QkQ_{k} Queue content leftover from cycle k1k-1
WnkW_{n}^{k} Delay of the nthn^{\rm th} customer in cycle kk
ξ\xi Warm up rate
XnkX_{n}^{k} (XnX_{n}) Server’s busy time observed by customer nn in cycle kk
Constants and bounds in regret analysis aD,bDa_{D},b_{D} Constants for DkD_{k} in equation (21)
A=4M+4MA=4\sqrt{M}+4M Constant in Corollary 1
BB Constant of stationary waiting times in Lemma 4
Bk,𝒱kB_{k},\leavevmode\nobreak\ \mathcal{V}_{k} Upper bounds for bias and Variance of HkH_{k}
cηc_{\eta} Constant for ηk\eta_{k} in equation (20)
cμ,cλc_{\mu},c_{\lambda} Constants in Lemma 3
C=max{x0x2,8K3/K0}C=\max\{\|x_{0}-x^{*}\|^{2},8K_{3}/K_{0}\} Constant in Theorem 2
C0=maxx{h0λ(p),h0λ(p)/μ}C_{0}=\max_{x\in\mathcal{B}}\{h_{0}\lambda^{\prime}(p),h_{0}\lambda(p)/\mu\} Constant in the proof of Theorem 3
C1=maxx{|λ(p)+pλ(p)|,|c(μ)|}C_{1}=\max_{x\in\mathcal{B}}\{|\lambda(p)+p\lambda^{\prime}(p)|,|c^{\prime}(\mu)|\} Constant in the proof of Theorem 3
CDC_{D} Constant for the selection of DkD_{k} (Theorem 3)
dk=4log(k)/min(θ,γ)d_{k}=\lceil 4\log(k)/\min(\theta,\gamma)\rceil Constant of warm-up time (Theorem 1)
d~k=5log(k)/min(θ,γ)\tilde{d}_{k}=\lceil 5\log(k)/\min(\theta,\gamma)\rceil Constant of warm-up time (Theorem 1)
Γi,i=1,2\Gamma_{i},\leavevmode\nobreak\ i=1,2 Stopping time of random walks (proof of Lemma 2)
I1,I2,I3I_{1},I_{2},I_{3} Three terms of the regret of nonstationary
KalgK_{alg} Bound of the cumulative regret (Theorem 3)
KK^{\prime} Constant for regret of nonstationary (proof of Theorem 1)
K=K+2M0/log(2)K=K^{\prime}+2M_{0}/\log(2) Constant in the proof of Theorem 1
K0,K1K_{0},K_{1} Constants for convexity and smoothness (Assumption 1)
K2=2K3/K0K_{2}=2K_{3}/K_{0} Constant for DkD_{k} (Theorem 1)
K3K_{3} Constants for variance in Theorem 3 (37)
K4K_{4} Constants for convergence rate of busy time in Lemma EC.1
K5=32e4/(min(θ,γ))K_{5}=32e^{4}/(\min(\theta,\gamma)) Bound in the proof of Theorem 3 (35)
K6=K2maxpλ(p)K_{6}=K_{2}\max_{p}\lambda^{\prime}(p) Bound in the proof of Corollary 2
λ¯,λ¯\bar{\lambda},\underline{\lambda} Upper and lower bounds for λ(p)\lambda(p)
MM Uniform bound for queueing functions in Lemma 1
M0M_{0} Upper bound of the regret in the first cycle
RkR_{k} Total regret during cycle kk
R1,k,R2,kR_{1,k},R_{2,k} Regret of nonstationary/suboptimality in cycle kk
R(L),R1(L),R2(L)R(L),R_{1}(L),R_{2}(L) Cumulative/nonstationary/suboptimal regret by cycle LL
TkT_{k} Length of cycle kk
Table 1: Glossary of notation.