SMDP-Based Dynamic Batching for Improving Responsiveness and Energy Efficiency of Batch Services

Yaodan Xu, Sheng Zhou, and Zhisheng Niu This work was supported in part by the National Natural Science Foundation of China under Grants 62341108 and in part by the Fundamental Research Funds for the Central Universities under Grant 2242022k60006.Y. Xu, S. Zhou, and Z. Niu are with Department of Electronic Engineering, the Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing 100084, China (e-mail: xyd21@mails.tsinghua.edu.cn; sheng.zhou@tsinghua.edu.cn; niuzhs@tsinghua.edu.cn).Part of this work has been published in IEEE ICC 2023[1].

Abstract

For servers incorporating parallel computing resources, batching is a pivotal technique for providing efficient and economical services at scale. Parallel computing resources exhibit heightened computational and energy efficiency when operating with larger batch sizes. However, in the realm of online services, the adoption of a larger batch size may lead to longer response times. This paper aims to provide a dynamic batching scheme that delicately balances latency and efficiency. The system is modeled as a batch service queue with size-dependent service times. Then, the design of dynamic batching is formulated as a semi-Markov decision process (SMDP) problem, with the objective of minimizing the weighted sum of average response time and average power consumption. A method is proposed to derive an approximate optimal SMDP solution, representing the chosen dynamic batching policy. By introducing an abstract cost to reflect the impact of “tail” states, the space complexity and the time complexity of the procedure can decrease by 63.5% and 98%, respectively. Numerical results showcase the superiority of SMDP-based batching policies across various parameter setups. Additionally, the proposed scheme exhibits noteworthy flexibility in balancing power consumption and latency.

Index Terms:

Dynamic batching, SMDP, latency, power consumption, GPUs.

I Introduction

To meet the escalating demands for powerful computing capabilities, processors have undergone significant advancements in recent decades. The processors of today, including multi-core processors, graphics processing units (GPUs) and tensor processing units (TPUs), are equipped to better support parallel computing. This enhancement is crucial for efficiently managing large-scale data and executing complex tasks. For instance, GPUs have played a prominent role in accelerating the training and inference of neural networks due to their advantage in parallel computing[2]. These computing resources are now widely deployed across various levels—locally, on edge servers, and on cloud servers, providing computing services that facilitate ubiquitous access to intelligence at any time and from anywhere.

Refer to caption — Figure 1: Batching of the same type of inference requests from potentially different users on a GPU-based ML-as-a-Service (MLaaS) platform.

For servers equipped with processors capable of parallel operations, an important factor that affects both performance and cost of computing services is batch processing, or batching[3, 4]. Specifically, batching is usually employed for homogeneous tasks that share common operations, allowing the grouping of data into a unified batch for simultaneous processing on the server. The number of standard units of data or tasks gathered in a batch is referred to as the batch size. It is noteworthy that batching helps to better utilize computing and energy resources due to parallelism. This batching effect has been examined across different hardware platforms for various tasks, including basic matrix computations[5] and diverse machine learning (ML) inference models[6, 7, 8, 9, 10, 11].

However, opting for a larger batch size is not always the preferred approach, especially for online service provisioning where requests arrive in a random pattern, and expect responsive feedback. This gives rise to two main challenges in determining the batch size: (1) Larger batch sizes enhance energy efficiency and throughput by improving resource utilization and reducing per-sample I/O overhead. However, this benefit may come at the cost of decreased responsiveness, as batching can increase request response time due to potential waiting time needed to form a batch and extended processing time for handling multiple requests simultaneously[12, 4, 13]. This creates a tradeoff between efficiency and responsiveness[12]. (2) The use of statically configured batching proves inadequate in realistic scenarios[14], exhibiting poor responsiveness under low load conditions and limited throughput under high load[15]. To address these issues, a dynamic batching scheme is essential, allowing for judicious batch size adjustments to well balance the performance and cost.

In this paper, we study the dynamic batching scheme on batch processing-capable servers, aiming to strike a delicate balance between responsiveness and energy efficiency. This issue has gained increasing significance with the emergence of ML-as-a-Service (MLaaS) platforms like Google Cloud Prediction[16], where trained ML models are published on the platforms to provide inference (prediction) services for massive end users. As illustrated in Fig. 1, batch processing the inference requests becomes a natural strategy, for efficiency and economical concerns.

We commence our exploration with intra-processor parallelism, considering a scenario with a single server equipped with a single parallel computing processor. We leverage the theoretical framework of sequential decision-making to address the sequential batch size decisions. In our context, where we assume Poisson request arrivals and an arbitrary service time distribution, the problem is formulated as a semi-Markov decision process (SMDP). The objective is a weighted sum of the long-term average request response time and average power consumption. Notably, to the best of our knowledge, the optimal control problem for batch service queues with size-dependent service times remains unexplored in the literature. Moreover, the inherent complexities of the formulated SMDP problem—characterized by an infinite state space, an average (non-discounted) objective, and unbounded costs—pose challenges for efficient resolution using traditional methods. To address these challenges, we propose a procedure to solve the SMDP problem and derive an approximate optimal policy, which manifests as the selected dynamic batching scheme. Our main contributions are summarized as follows:

1.

To the best of our knowledge, our work is the first to rigorously formulate and optimally solve the dynamic batching problem for online computing services. The batching decision is formulated as an infinite-state SMDP, with the objective of minimizing the weighted sum of average response time and average power consumption.
2.

A new method is proposed, composed of finite state approximation, model “discretization” and relative value iteration, to obtain an approximate optimal policy. The demanding problem of state space explosion is tackled by a novel abstract cost, which reflects the impact of costs in “tail” states.
3.

We also conclude the theoretical results regarding the optimal policy structure in special cases. The SMDP solutions obtained through the proposed general method are visualized under different parameter settings. On one hand, the computed policies align with the theoretical results in special cases, affirming the effectiveness of the proposed method and the correctness of the theoretical results. On the other hand, certain instances reveal that the theoretical results might not extend to more general scenarios, underscoring the necessity of the general solving approach.
4.

Extensive numerical results demonstrate that the SMDP-based policies achieve the lowest average cost compared to benchmarks. The latency-energy tradeoff curves show that when having the same average response time, the SMDP-based policies never consume more energy than any other benchmark policy, and vice versa. Moreover, the proposed scheme can adapt to different traffic intensities and flexibly balance the response time and power consumption.

The rest of this paper is organized as follows. We begin by presenting related works in Section II. The system model is introduced in Section III, followed by the SMDP formulation in Section IV. A procedure for solving the SMDP problem is proposed in Section V. Theoretical analyses regarding the optimal policy in special cases of the problem are detailed in Section VI. Numerical results are showcased in Section VII. Finally, Section VIII provides the concluding remarks for the paper.

II Related Works

Dynamic Batching. Some works have explored dynamic batching for systems such as data centers[18] and Spark Streaming[19]. Recently, there has been a notable increase in attention towards dynamic batching, particularly in the context of ML applications. While works like DBS[20] and Zeus[21] explore batching for ML training, they fall outside the scope of our focus. Our exclusive attention is on online services. For example, in applications such as smart healthcare monitoring, data generated by devices can be efficiently processed through cloud, edge, or hybrid computing solutions to ensure timely responses[22, 23, 24]. Plenty of research has been conducted on dynamic batching for ML inference serving on GPU-based platforms[4, 3, 15, 14, 12, 8, 25, 26, 27]. There are also studies addressing the batching issue on a multi-core central processing unit (CPU)[10]. Nevertheless, progress in theoretical analysis remains very limited. For example, SERF[25] models inference serving as an M/D/c queue. However, it does not explicitly account for batching in the model. Another work, BATCH[12], characterizes request arrivals as a Poisson process or a Markov-modulated Poisson process with two phases (MMPP(2))[28], allowing for the estimation of the number of requests that arrive before a timeout. However, this analysis overlooks the cases where requests arrive during processing times. In contrast, the author of [29] presents a closed-form queueing analysis of dynamic batching under the greedy batching policy. Nonetheless, this policy is suboptimal as it does not leverage the potential benefits of larger batches. Additionally, the analysis in [29] assumes an infinite batching capacity, which is generally impractical.

Parallel Batch Processing. Parallel batch processing, or parallel batching, is a classical issue in operational research that finds applications in numerous fields such as manufacturing, transportation, and healthcare[30, 31, 32]. However, the batch processing problem studied in this paper exhibits two distinctive deviations from classical scenarios in the existing literature. Firstly, unlike the ideal parallelism with batch-size independent service times[33, 34], the batch processing time can increase with the batch size[35, 36]. Secondly, the energy efficiency can directly benefit from an increased batch size, rather than remaining unchanged. For instance, as illustrated in Fig. 2LABEL:sub@fig:latency and Fig. 2LABEL:sub@fig:energy, which are based on the statistics from NIVIDIA[7], the processing time and energy consumption for batch ML inference services appear to be affine functions of the batch size. Consequently, the average processing time and energy consumption per batch decrease as the batch size increases, leading to improvements in both computational and energy efficiency, as shown in Fig. 2. For GoogLeNet inference on a TESLA V100, using a batch size of 128 can achieve a speedup of 9.8 times compared to inference without batching (batch size of 1). Additionally, the energy efficiency is improved by 4.4 times.

Queueing analyses for batch servers with size-dependent service times have been conducted in the literature, but they only focus on certain structured policies that are suboptimal[37, 38, 39, 40]. Following the line of optimal control, existing research[33, 41, 32] primarily addresses problems where batch processing times are independent of the batch size. In [42], the author models the load-balancing problem in multiple batch service queues with size-dependent processing times as a Markov decision process and identifies it as an open problem. In fact, optimal batching in one such batch service queue, as the simplest sub-problem of [42], still remains unsolved.

SMDP. Continuous-time Markov decision processes (CTMDPs) and SMDPs are the common formulations for sequential decision-making in continuous-time systems[43]. Given that the processing times of computation tasks often exhibit limited randomness[12], deviating from the characteristics of an exponential distribution, SMDPs appear as more fitting choices for the studied problem. In this work, we need to address an infinite-horizon and infinite-state SMDP problem with unbounded costs, where the objective is expressed in a long-term average form. Analytical results regarding optimal policies for this SMDP are available only in specific cases[33]. An alternative approach is to utilize iteration-based numerical methods, which require finite state approximation due to the intractable infinite state space. In prior research, the authors in [44] demonstrated the convergence of several finite state approximation algorithms for average cost SMDP models, but only with bounded costs. In [45], proofs were provided for the convergence of finite state approximation algorithms for models with unbounded costs, but with a discounted objective. Nevertheless, none of the existing finite state approximation algorithms has been proven effective for the considered SMDP problem.

In summary, the lack of theoretical guarantees in existing dynamic batching schemes necessitates a rigorous formulation. The extracted system model distinguishes from classical batch service queues in its size-dependent service times, a feature that remains unexplored in the literature. The design of dynamic batching can be formulated into an SMDP problem with an infinite state space. Regrettably, none of the existing finite state approximation algorithms has been proven effective in addressing the studied SMDP, highlighting the need for the development of novel methods.

In our prior work[1], we proposed an SMDP-based dynamic batching scheme for ML inference serving on GPU-based platforms. This paper extends the research to encompass general online computing service scenarios. The batch service time and energy consumption take more general forms with respect to the batch size, unlike the deterministic and linear functions considered in [1]. Moreover, additional insights are gained through both theoretical induction and extensive numerical results.

III System Model

We consider a single server with a single parallel computing processor in the continuous-time setting. Batch processing is implemented on the same type of computing requests, and it cannot be interrupted once the processing is started. The system is modeled as a single service queue, where computing requests (tasks) are assumed to arrive according to a Poisson process with an arrival rate of $\lambda$ . The requests awaiting processing are stored in a buffer, which is assumed to have an infinite capacity. This assumption of an infinite buffer is based on the fact that the storage capacity of a computing server, which functions as the buffer in the queuing model, is significantly larger than the memory size. The total number of requests in the buffer as well as being processed at time $t\geq 0$ is denoted by $s(t)\in\{0,1,2,\dots\}$ .

Let $b\in\mathcal{B}\triangleq\{x\,|\,x\in\mathbb{N}_{+},B_{\min}\leq x\leq B_{\max}\}$ denote the batch size, where $B_{\min},B_{\max}\in\mathbb{N}_{+}$ and $B_{\min}\leq B_{\max}$ . $B_{\max}$ (or $B_{\min}$ ) is the maximum (or minimum) batch size allowed by the system. The cumulative distribution function (CDF) of the processing (service) time for a batch of size $b$ is $G_{b}(x)\;(x\geq 0)$ with mean $1/\mu^{[b]}$ . Assume that for every $b\in\mathcal{B}$ , $G_{b}(x)$ has a finite second moment. Moreover, the mean processing time should be finite and larger than zero, i.e., $G_{b}(0)<1$ and $1/\mu^{[b]}<\infty$ . Let $l\;:\;\mathcal{B}\rightarrow\mathbb{R}$ represent the function of the mean batch processing time with respect to the batch size, where for $\forall b\in\mathcal{B}$ , $0<l(b)<\infty$ and ${l(b)}=\frac{1}{\mu^{[b]}}$ . The time required for processing a larger batch should be no less than that for a smaller batch. Therefore, we assume that $l(b)$ is monotonically non-decreasing in terms of $b$ . Let $\tilde{g}_{b}(s)$ denote the Laplace transform of the service time distribution $G_{b}(x)$ , defined by

\tilde{g}_{b}(s)=\int_{0}^{\infty}e^{-sx}\mathrm{d}G_{b}(x),

(1)

with $-\tilde{g}^{\prime}_{b}(0)=1/\mu^{[b]}$ .

When operating with a batch size of $b$ , define the batch processing computational efficiency, or batch service rate, as the average number of requests processed per unit of time, denoted as $\theta(b)\triangleq\frac{b}{l(b)}$ . Since parallelism can enhance computational efficiency, it is assumed that $\theta(b)$ is monotonically non-decreasing with the batch size $b$ . Noting that $l(b)$ is also a non-decreasing function with respect to $b$ , it follows that the mean batch processing time $l(b)$ should exhibit linear or sublinear growth as $b$ increases, if $l(b)$ is dependent on $b$ .

Remark 1.

(i) When $l(b)$ exhibits sublinear growth as $b$ increases, the computational efficiency $\theta(b)$ monotonically increases with $b$ . (ii) When $l(b)$ exhibits linear growth as $b$ increases, there are two cases: If $l(b)$ is proportional to $b$ , i.e., $l(b)=\alpha b$ with $\alpha>0$ , the computational efficiency $\theta(b)$ is constant and independent of $b$ ; If $l(b)$ is affine to $b$ , i.e., $l(b)=\alpha b+l_{0}$ with $\alpha>0$ and $l_{0}>0$ , the computational efficiency $\theta(b)$ monotonically increases with $b$ .

Thus, the maximum service rate is $\theta(B_{\max})=\frac{B_{\max}}{l(B_{\max})}={B_{\max}}\mu^{[B_{\max}]}$ . Let $\rho=\lambda/(B_{\max}\mu^{[B_{\max}]})$ denote the ratio of the arrival rate over the maximum service rate. Assume that $\lambda l(B_{\max})<B_{\max}$ , or equivalently $\lambda<B_{\max}\mu^{[B_{\max}]}$ , then $\rho$ satisfies $\rho\in(0,1)$ , which is a necessary condition for the system stability.

The energy consumption of processing a batch of $b$ requests is denoted by ${\zeta}(b)$ , $\zeta\;:\;\mathcal{B}\rightarrow\mathbb{R}$ . Let $\eta(b)=\frac{b}{{\zeta}(b)}$ denote the batch processing energy efficiency, which is defined as the average number of requests served with one unit of energy consumption. Given the potential for parallelism to improve energy efficiency, we assume that $\eta(b)$ is monotonically non-decreasing with the batch size $b$ . Similarly, $\zeta(b)$ , the function of the batch processing energy consumption, should exhibit linear or sublinear growth as $b$ increases, if $\zeta(b)$ is not independent of $b$ .

Given the server configurations and the specific computing task, the parameters $B_{\max}$ and $B_{\min}$ can be profiled and determined. Additionally, the exact forms of $G_{b}(\cdot),l(b)$ , and ${\zeta}(b)$ are established by fitting the latency and energy consumption statistics obtained from prior profiling[12, 29].

We consider two main factors in the objective: One is the request response time (or latency), which includes both waiting and processing time, as the performance metric. The other is the power consumption of the server, as the running cost metric. Our objective is to minimize the weighted sum of average request response time, denoted by $\overline{W}$ , and average power consumption, denoted by $\overline{P}$ :

\min\;w_{1}\overline{W}+w_{2}\overline{P},

(2)

where $w_{1}>0$ and $w_{2}\geq 0$ are the weights.

The serving process consists of sequential service rounds. Define $t_{i}\;(t_{i}\geq 0,\;i=1,2,\ldots)$ as the start time of the $i$ th service round and $b(t_{i})\in\mathcal{B}$ as the batch size in the $i$ th service round. Let $N(t)\in\mathbb{N}$ denote the total number of service rounds until time $t\geq 0$ . The objective can then be expressed as

\min\;\limsup\limits_{T\rightarrow\infty}\frac{1}{T}\Bigg{\{}w_{1}\frac{1}{\lambda}\int_{0}^{T}s(t)\mathrm{d}t+w_{2}\sum_{i=1}^{N(T)}{\zeta}{(b(t_{i}))}\Bigg{\}},

(3)

Note that here the average request response time $\overline{W}$ is equivalently transformed to the average queue length $\overline{L}$ through Little’s Law $\overline{L}=\lambda\overline{W}$ [46], where $\overline{L}=\limsup\limits_{T\rightarrow\infty}\frac{1}{T}\int_{0}^{T}s(t)\mathrm{d}t$ .

Remark 2.

The energy cost considered in this model can also be replaced with other types of costs, such as the monetary cost[12].

IV SMDP Formulation

The decision-making process for batching in the continuous-time system, as introduced in Section III, naturally lends itself to the formulation as an SMDP[43]. In SMDP, we are only concerned with the states at decision epochs, upon which the decisions are made. Let $\mathcal{T}=[0,\infty)$ be the timeline of the SMDP model. The decision epochs are set as the moments when either the server completes a batch of service, or a request arrives while the server is idle. The $m$ th ( $m=0,1,2,\ldots$ ) decision epoch is denoted as $t_{m}\in\mathcal{T}$ . Let the state be the number of requests in the system. The state at the $m$ th decision epoch is denoted by $s_{m}$ , taking values from the state space $\mathcal{S}\triangleq\{0,1,2,\ldots\}$ .

At each epoch $m$ , the server takes an action $a_{m}$ from the action space $\mathcal{A}\triangleq\{0\}\cup\mathcal{B}$ . The action $a_{m}$ is the size of batch to be processed. Note that $a_{m}=0$ means that no requests are served in the $m$ th epoch. Let $\mathcal{A}_{s}\subseteq\mathcal{A}$ be the set of feasible actions for a given state $s$ . The number of requests to be batched should be no more than the available requests, which means $\mathcal{A}_{s}\triangleq\{0,1,2,\ldots,s\}\cap\mathcal{A}$ , and thus $\mathcal{A}_{s}=\{0,1,2,\ldots,s\}\cap\mathcal{B}$ .

The state transition is associated with the current state and action. Let $m(j|s,a)$ denote the probability that the semi-Markov decision process occupies state $j$ at the next decision epoch when action $a$ is chosen at the current state $s$ . Let $p_{k}^{[b]}$ denote the probability that $k$ requests arrive during the period of processing a batch of $b\in\mathcal{B}$ requests. With the assumption that the arrival of requests follows a Poisson process, we have

p_{k}^{[b]}=\int_{0}^{\infty}\frac{e^{-\lambda t}(\lambda t)^{k}}{k!}\mathrm{d}G_{b}(t),\quad k=0,1,2,\ldots

(4)

A useful method to generate the probabilities $p_{k}^{[b]}$ is by using $\tilde{g}_{b}(s)$ , which is the Laplace transform of the service time distribution. Denote the probability generating function (PGF) that corresponds to $p_{k}^{[b]}$ as $A^{[b]}(z)=\sum_{k=0}^{\infty}p_{k}^{[b]}z^{k}$ , which can be simplified to

$\displaystyle A^{[b]}(z)$	$\displaystyle=\sum_{k=0}^{\infty}p_{k}^{[b]}z^{k}=\sum_{k=0}^{\infty}z^{k}\int_{0}^{\infty}\frac{{e}^{-\lambda t}(\lambda t)^{k}}{k!}\mathrm{d}G_{b}(t)$	(5)
	$\displaystyle=\int_{0}^{\infty}{e}^{-\lambda t}\left(\sum_{k=0}^{\infty}\frac{(\lambda tz)^{k}}{k!}\right)\mathrm{d}G_{b}(t)$
	$\displaystyle=\int_{0}^{\infty}{e}^{-\lambda t(1-z)}\mathrm{d}G_{b}(t)=\tilde{g}_{b}(\lambda(1-z)).$

Then the required probabilities $p_{k}^{[b]}$ can be computed by

p_{k}^{[b]}=\left.\frac{1}{k!}\frac{\mathrm{d}^{k}\tilde{g}_{b}(\lambda(1-z))}{\mathrm{d}z^{k}}\right|_{z=0},\quad k=0,1,\ldots

(6)

The transition probability for $\forall j,s\in\mathcal{S},\forall a\in\mathcal{A}_{s}$ is expressed as

m(j|s,a)=\begin{cases}p_{j-s+a}^{[a]}&\text{ $j\geq s-a,$ $a\in\mathcal{A}_{s},$ $a\neq 0$ }\\ 1&\text{ $j=s+1,$ $a=0$ }\\ 0&\text{ otherwise }\end{cases}.

(7)

Let a random variable $\gamma_{m}$ denote the sojourn time between the $m$ th and the $(m+1)$ th epoch. The random variables $\gamma_{m},m\in\mathbb{N}$ , are conditionally independent given the state $s_{m}$ and action $a_{m}$ . Let $\Gamma_{s,a}(x)\;(x\geq 0)$ denote the CDF of $\gamma_{m}$ when action $a\in\mathcal{A}_{s}$ is chosen at the state $s$ , given by

\Gamma_{s,a}(x)=\begin{cases}G_{a}(x)&\quad a\in\mathcal{A}_{s},a\neq 0\\ 1-e^{-\lambda x}&\quad a=0\end{cases},\;\forall s\in\mathcal{S}.

(8)

Define $y(s,a)=\mathbb{E}[\gamma|s,a]$ as the expected sojourn time until the next decision epoch, given by

y(s,a)=\begin{cases}1/{\mu}^{[a]}&\quad a\in\mathcal{A}_{s},a\neq 0\\ 1/\lambda&\quad a=0\end{cases},\;\forall s\in\mathcal{S}.

(9)

Costs are incurred for serving the requests as well as holding them. The cost of serving a batch of $n\in\mathbb{N}$ requests is denoted by $u(n)$ , and the cost of holding $n$ requests in the system per unit time is denoted by $v(n)$ . Let $c(s,a)$ denote the expected cost until the next decision epoch when action $a\in\mathcal{A}_{s}$ is taken in state $s$ . We have $c(s,0)=u(0)+v(s)y(s,0)$ , and for $a\neq 0$ ,

c(s,a)=u(a)+\int_{0}^{\infty}\int_{0}^{x}\sum_{k=0}^{\infty}v(s+k)\frac{e^{-\lambda t}(\lambda t)^{k}}{k!}\mathrm{d}t\mathrm{d}\Gamma_{s,a}(x).

(10)

The cost functions corresponding to the objective in (3) are $u(n)=w_{2}{\zeta}(n)\;(n>0)$ , $u(0)=0$ and $v(n)=\frac{w_{1}}{\lambda}n$ . This leads to a detailed description of $c(s,a)$ , which is

$\displaystyle c(s,0)$	$\displaystyle=w_{1}\frac{s}{{\lambda}^{2}},$	(11)
$\displaystyle c(s,a)$	$\displaystyle=w_{2}{\zeta}(a)+\int_{0}^{\infty}\int_{0}^{x}\frac{w_{1}}{\lambda}(s+\lambda t)\mathrm{d}t\mathrm{d}\Gamma_{s,a}(x)$
	$\displaystyle=w_{2}{\zeta}(a)+\int_{0}^{\infty}{w_{1}}(\frac{s}{\lambda}x+\frac{1}{2}{x}^{2})\mathrm{d}\Gamma_{s,a}(x)$
	$\displaystyle=w_{2}{\zeta}(a)+{w_{1}}(\frac{s}{\lambda}\mathbb{E}[\gamma\|s,a]+\frac{1}{2}\mathbb{E}[\gamma^{2}\|s,a])$
	$\displaystyle=w_{2}{\zeta}(a)+{w_{1}}(\frac{s}{\lambda}\mathbb{E}[G_{a}]+\frac{1}{2}\mathbb{E}[G_{a}^{2}])$
	$\displaystyle=w_{2}{\zeta}(a)+{w_{1}}(\frac{s}{\lambda{\mu}^{[a]}}+\frac{1}{2}\mathbb{E}[G_{a}^{2}]),\;a\in\mathcal{A}_{s},a\neq 0,$

where $G_{a}$ denotes a generic random variable that follows the CDF $G_{a}(x)$ , and $\mathbb{E}[G_{a}^{2}]=\tilde{g}^{\prime\prime}_{b}(0)$ .

We can also generalize $c(s,a)$ in the form

c(s,a)=u(a)+d(s,a)y(s,a),

(12)

where $d(s,a)$ , referred to as the cost rate, represents the holding cost averaged over the sojourn time:

d(s,a)=\begin{cases}w_{1}\frac{s}{{\lambda}}&\quad a=0\\ w_{1}(\frac{s}{{\lambda}}+\frac{1}{2}\mathbb{E}[G_{a}^{2}]{\mu}^{[a]})&\quad a\in\mathcal{A}_{s},a\neq 0\end{cases},\forall s\in\mathcal{S}.

(13)

The formulated SMDP can be fully described by the set of objects $\mathcal{P}\triangleq\{\mathcal{T},\mathcal{S},\mathcal{A}_{s},m(\cdot|s,a),\Gamma_{s,a}(\cdot),c(s,a)\}$ , and we will use the symbol $\mathcal{P}$ to represent this SMDP model in the subsequent text.

Remark 3.

This formulation assumes Poisson arrivals but does not restrict the distribution of the processing time.

To extend the SMDP method to more general arrival processes, ﬁctitious decision epochs and additional state variables are required to maintain the semi-Markov property. For example, with MMPP arrivals, phase shifts must be included as decision epochs. For non-memoryless renewal arrivals, such as deterministic processes, the arrival points should be incorporated as decision epochs, and the remaining time of the arrival interval must be included in the state. Because of the decision epochs occurring during processing times, the remaining processing time must also be part of the state for both cases. Moreover, an extra state variable is needed to distinguish the exact event of the decision epoch.

Handling continuous state variables and addressing the curse of dimensionality caused by the enlarged state space are significant challenges. Therefore, utilizing SMDP with both general arrival intervals and processing times is quite difficult.

Let $\mathcal{B}(\mathcal{A})$ denote the set of probability distributions on Borel subsets of $\mathcal{A}$ . A Markovian decision rule $\bm{d}^{(m)}:\mathcal{S}\rightarrow\mathcal{B}(\mathcal{A})$ specifies the probabilities of taking each action at epoch $m$ . It is called Markovian since it relies solely on the current state $s_{m}$ for its decision-making. The decision rule is deterministic if it selects an action with probability 1. A policy $\bm{\pi}=\{\bm{d}^{(0)},\bm{d}^{(1)},\bm{d}^{(2)},\ldots\}$ is a sequence of decision rules. Furthermore, $\bm{\pi}$ is called stationary if $\bm{d}^{(m)}=\bm{d},\forall m\in\mathbb{N}$ .

Our goal is to find a policy that minimizes the long term average expected cost $g^{\bm{\pi}}(s_{0})$ , given $s_{0}\in\mathcal{S}$ at $t=0$ , which is

\min\limits_{\bm{\pi}}g^{\bm{\pi}}(s_{0})=\limsup\limits_{M\rightarrow\infty}\frac{{\mathbb{E}}^{\bm{\pi}}_{s_{0}}\Big{[}\sum_{m=0}^{M}c(s_{m},a_{m})\Big{]}}{{\mathbb{E}}^{\bm{\pi}}_{s_{0}}\Big{[}\sum_{m=0}^{M}\gamma_{m}\Big{]}}.

(14)

The objective (14) and the SMDP model $\mathcal{P}$ together constitute the SMDP problem. The objective focuses on the long-term average, and for every history-dependent policy, there exists an equivalent Markovian policy with the same objective value (referred to Theorem 8.1.2 in [43]). This implies that only Markovian policies need to be considered. Moreover, in this paper, we restrict our consideration to stationary deterministic policies. A stationary deterministic policy $\pi:\mathcal{S}\rightarrow\mathcal{A}$ is a function that maps the state space $\mathcal{S}$ to the action space $\mathcal{A}$ . This type of policy is concise and clear, helping to reduce the solution space. For instance, the static batching policy and the greedy batching policy are both stationary deterministic policies.

Definition 1.

A static batching policy with a parameter $b\in\mathcal{B}$ is denoted as $\pi^{b}_{\mathrm{static}}:\mathcal{S}\rightarrow\mathcal{A}$ , and is defined as follows:

\pi^{b}_{\mathrm{static}}(s)=\begin{cases}0&\quad s<b\\ b&\quad s\geq b\end{cases}.

(15)

Under such a policy $\pi^{b}_{\mathrm{static}}$ , the served batches have a constant batch size of $b$ .

The greedy batching policy is a representative dynamic batching policy, defined as follows:

Definition 2.

Define a greedy batching policy $\pi_{\mathrm{greedy}}:\mathcal{S}\rightarrow\mathcal{A}$ as

\pi_{\mathrm{greedy}}(s)=\max\{\min\{s,B_{\max}\},B_{\min}\}.

(16)

Under such a policy $\pi_{\mathrm{greedy}}$ , the system greedily serves batches with the current maximum allowable sizes.

The considered model $\mathcal{P}$ is an infinite state SMDP with non-negative, unbounded costs and finite action sets. The existence of an optimal stationary deterministic policy for such a model requires further discussions.

Proposition 1.

An average expected optimal stationary deterministic policy exists for the SMDP model $\mathcal{P}$ .

Proof:

See Appendix A. ∎

The equations corresponding to the optimal stationary deterministic policies are provided as follows.

Proposition 2.

Let $h:\mathcal{S}\rightarrow\mathbb{R}$ denote a value function, and let $g\in\mathbb{R}$ represent a scalar. Given the SMDP model $\mathcal{P}$ , the constant and functions $(g,h)$ such that

	$\displaystyle h(s)=\min_{a\in\mathcal{A}_{s}}\bigg{\{}c(s,a)-gy(s,a)+\sum_{j\in S}m(j\|s,a)h(j)\bigg{\}},$			(17)
	$\displaystyle\forall s\in\mathcal{S},$			(17)

are exactly the optimal average expected cost per unit time, and the corresponding relative value functions. The function $h$ is referred to as the relative value function since $h(s)$ is exactly the relative expected total cost when starting with state $s$ .

Consequently, the optimal stationary deterministic policy ${\pi}^{*}$ for the SMDP problem is given by

	$\displaystyle\pi^{*}(s)\in\underset{a\in\mathcal{A}_{s}}{\arg\min}\bigg{\{}c(s,a)-gy(s,a)+\sum_{j\in S}m(j\|s,a)h(j)\bigg{\}},$
	$\displaystyle\forall s\in\mathcal{S}.$

Equations (17) are referred to as the optimality equations for the SMDP problem with $\mathcal{P}$ and the average objective.

Proof:

See Appendix B. ∎

Through Proposition 1, the existence of an optimal stationary deterministic policy for this problem is ensured. We try to acquire such an optimal policy by solving the optimality equations (17) in the next section.

V Solving the Infinite State SMDP Problem

Iteration-based algorithms, such as value iteration and policy iteration, are widely used to solve the optimality equations for most common discrete-time, finite-state and discounted Markov decision process (MDP) problems. However, the standard procedure is not readily applied to our problem, which is a continuous-time SMDP with an infinite state space and a long-term average objective.

We solve this problem in three steps. In Section V-A, we approximate the infinite state space by a finite state space through “tail” state aggregation. In Section V-B, we transform the finite state SMDP to an equivalent discrete time MDP. Finally in Section V-C, we use the relative value iteration (RVI) algorithm to solve the average-cost MDP problem.

V-A Finite State Approximation

The SMDP problem has infinite states in $\mathcal{S}=\{0,1,2,\ldots\}$ , and is impractical to be solved by numerical methods. Hence, we truncate the infinite state space to a finite state space $\mathcal{\hat{S}}=\{0,1,\ldots,s_{\max},S_{\mathrm{o}}\}$ , which replaces the states larger than $s_{\max}$ by an “overflow” state $S_{\mathrm{o}}$ . In other words, the “overflow” state $S_{\mathrm{o}}$ is an aggregation of the “tail” states $\mathcal{S_{\mathrm{tail}}}=\{s_{\max}+1,s_{\max}+2,\ldots\}$ . The dimension of the finite state space is $|\mathcal{\hat{S}}|=s_{\max}+2$ , where $s_{\max}$ needs to be no less than $B_{\max}$ . The rationale of the state space truncation is that the tail probability, defined as the probability of being in the “tail” states $\mathcal{S_{\mathrm{tail}}}$ , decreases with $s_{\max}$ . When $s_{\max}$ is large enough, the “tail” states are negligible.

In the truncated model, the action space $\mathcal{A}$ , the sojourn time distribution $\Gamma_{s,a}(\cdot)$ , and the expected sojourn time $y(s,a)$ are the same as before, while the feasible action space at state $S_{\mathrm{o}}$ is $\mathcal{A}_{S_{\mathrm{o}}}\equiv\mathcal{A}$ since $s_{\max}\geq B_{\max}$ . Original transitions to the “tail” states $\mathcal{S_{\mathrm{tail}}}$ are aggregated to $S_{\mathrm{o}}$ , and we assume the number of requests at $S_{\mathrm{o}}$ is $s_{\max}$ . The adapted transition probability $\hat{m}(j|s,a)$ for $\forall j,s\in\mathcal{\hat{S}},\forall a\in\mathcal{A}_{s}$ is

		$\displaystyle\hat{m}(j\|S_{\mathrm{o}},a)=$		(18)
		$\displaystyle\begin{cases}p_{j-s_{\max}+a}^{[a]}&j\geq s_{\max}-a,j\neq S_{\mathrm{o}},a\neq 0\\ 1-\sum\limits_{i=0}^{a}p_{i}^{[a]}&j=S_{\mathrm{o}},a\neq 0\\ 1&j=S_{\mathrm{o}},a=0\\ 0&\text{ otherwise }\end{cases},$
		$\displaystyle\hat{m}(j\|s,a)\;(s\neq S_{\mathrm{o}})=$
		$\displaystyle\begin{cases}p_{j-s+a}^{[a]}&j\geq s-a,j\neq S_{\mathrm{o}},a\neq 0\\ 1-\sum\limits_{i=0}^{s_{\max}-s+a}p_{i}^{[a]}&j=S_{\mathrm{o}},a\neq 0\\ 1&j=s+1,s<s_{\max},a=0\\ 1&j=S_{\mathrm{o}},s=s_{\max},a=0\\ 0&\text{ otherwise }\end{cases}.$

The unbounded holding cost induced by the infinite states in the primal problem is also erased by the truncation. Therefore, we introduce an abstract cost $c_{\mathrm{o}}y(s,a)\;(c_{\mathrm{o}}\geq 0)$ to the “overflow” state $S_{\mathrm{o}}$ , working as an estimation of the difference between the expected holding cost at “tail” states and the holding cost at $s_{\max}$ . The adapted cost $\hat{c}(s,a)$ is

\begin{aligned} \hat{c}(s,a)=\begin{cases}c(s_{\max},a)+c_{\mathrm{o}}y(s,a)\quad s=S_{\mathrm{o}}\\ c(s,a)\quad\quad\quad\quad\;s\neq S_{\mathrm{o}},s\in\mathcal{\hat{S}}\end{cases},\end{aligned}\forall a\in\mathcal{A}_{s}.

(19)

Since $\rho\in[0,1)$ , the optimal policy must stabilize the system. The abstract cost can also be interpreted as an overflow punishment, which pushes the optimal policy away from causing overflow. Note that the abstract cost $c_{\mathrm{o}}y(s,a)$ in (19) is rarely mentioned in the literature, without which the problem can be solved as well, but leading to a larger satisfactory $s_{\max}$ and higher computational complexity in iteration algorithms (which will be discussed in Section VII-D).

Let $\mathcal{\hat{P}}\triangleq\{\mathcal{T},\mathcal{\hat{S}},\mathcal{A}_{s},\hat{m}(\cdot|s,a),\Gamma_{s,a}(\cdot),\hat{c}(s,a)\}$ denote the finite state SMDP model, and the optimality equations for the finite-state average-cost SMDP problem are

\displaystyle\hat{h}(s)=\min_{a\in\mathcal{A}_{s}}\bigg{\{}\hat{c}(s,a)-\hat{g}y(s,a)+\sum_{j\in\mathcal{\hat{S}}}\hat{m}(j|s,a)\hat{h}(j)\bigg{\}},

(20)

for $\forall s\in\mathcal{\hat{S}}$ . Denote $\hat{g}_{*}$ as the optimal average expected cost.

Given a stationary deterministic policy as a function $\hat{\pi}:\mathcal{\hat{S}}\rightarrow\mathcal{A}$ , the corresponding state transition matrix is $P_{\hat{\pi}}\in\mathbb{R}^{{|\mathcal{\hat{S}}|}\times{|\mathcal{\hat{S}}|}}$ . Suppose that the Markov chain with $P_{\hat{\pi}}$ has a unique stationary distribution $\bm{\mu}=(\mu_{0},\mu_{1},\ldots,\mu_{S_{\mathrm{o}}})$ , where $\mu_{s}$ is the stationary probability at state $s\in\mathcal{\hat{S}}$ . Then the average expected cost per unit time is

\hat{g}^{\hat{\pi}}=\frac{\sum_{s\in\mathcal{\hat{S}}}\mu_{s}\cdot\hat{c}(s,\hat{\pi}(s))}{\sum_{s\in\mathcal{\hat{S}}}\mu_{s}\cdot y(s,\hat{\pi}(s))}.

(21)

We establish a criterion for assessing the approximation based on the difference in average cost under stabilizing policies: Let $\Delta^{\hat{\pi}}$ represent the average expected cost contributed by $S_{\mathrm{o}}$ per unit time under policy $\hat{\pi}$ :

\Delta^{\hat{\pi}}=\frac{\mu_{S_{\mathrm{o}}}\cdot\hat{c}(S_{\mathrm{o}},\hat{\pi}(S_{\mathrm{o}}))}{\sum_{s\in\mathcal{\hat{S}}}\mu_{s}\cdot y(s,\hat{\pi}(s))}.

(22)

It is important to note that for a policy that stabilizes the system, the average cost contributed by the “tail” states should asymptotically decrease to zero as $s_{\max}$ increases. Therefore, given a predefined constant $\delta>0$ , if $\Delta^{\hat{\pi}}<\delta$ , we consider the approximation acceptable with tolerance $\delta$ . If $\Delta^{\hat{\pi}}\geq\delta$ , we conclude that the approximation is not acceptable with tolerance $\delta$ , and a larger $s_{\max}$ should be selected.

V-B Associated Discrete-Time MDP

The finite state continuous-time SMDP is associated with a discrete-time MDP through a “discretization” transformation (see Section 11.4 of [43]). The time slots are denoted by $\tilde{\mathcal{T}}=\{0,1,2,\ldots\}$ . The state space $\mathcal{\hat{S}}$ , the action space $\mathcal{A}$ and the feasible action space $\mathcal{A}_{s}$ for any $s\in\mathcal{\hat{S}}$ keep unchanged in the transformed model. The transformed cost $\tilde{c}(s,a)$ and the transformed transition probability $\tilde{m}(j|s,a)$ for $\forall j,s\in\mathcal{\hat{S}},\forall a\in\mathcal{A}_{s}$ are

	$\displaystyle\tilde{c}(s,a)\triangleq$	$\displaystyle\;\hat{c}(s,a)/y(s,a),$		(23)
	$\displaystyle\tilde{m}(j\|s,a)\triangleq$	$\displaystyle\begin{cases}\eta\hat{m}(j\|s,a)/y(s,a)&j\neq s\\ 1+\eta[\hat{m}(s\|s,a)-1]/y(s,a)&j=s\end{cases},$		(23)

where $\eta$ satisfies

0<\eta<y(s,a)/(1-\hat{m}(s|s,a)),

(24)

for all $a\in A_{s}$ and $s\in\mathcal{\hat{S}}$ for which $\hat{m}(s|s,a)<1$ .

By (9) and (18), $\eta$ should satisfy

0<\eta<\min\Bigg{\{}\frac{1}{\lambda},\min_{a\in\mathcal{A},a\neq 0}\bigg{\{}\frac{1}{{\mu}^{[a]}(1-p_{a}^{[a]})},\frac{1}{{\mu}^{[a]}\sum\limits_{i=0}^{a}p_{i}^{[a]}}\bigg{\}}\Bigg{\}}.

(25)

And from experiments we find that the larger the $\eta$ is, the faster the value-based iteration algorithm converges.

The discrete-time MDP model can be denoted by $\tilde{\mathcal{P}}\triangleq\{\tilde{\mathcal{T}},\mathcal{\hat{S}},\mathcal{A}_{s},\tilde{m}(\cdot|s,a),\tilde{c}(s,a)\}$ . The transformation (23) serves to standardize costs to a unit time basis, and then adjust the transition structure to align the long-run average cost of the discrete model $\tilde{\mathcal{P}}$ with that of the SMDP model $\mathcal{\hat{P}}$ (refer to Section 11.5.1 in [43] for additional insights into this conversion).

For the average cost MDP problem with $\tilde{\mathcal{P}}$ , the optimality equations are

\tilde{h}(s)=\min_{a\in{\mathcal{A}}_{s}}\bigg{\{}\tilde{c}(s,a)-\tilde{g}+\sum_{j\in\mathcal{\hat{S}}}\tilde{m}(j|s,a)\tilde{h}(j)\bigg{\}},\forall s\in\mathcal{\hat{S}}.

(26)

According to Proposition 11.4.5 in [43], if $(\tilde{g},\tilde{h})$ satisfies the discrete-time optimality equations in (26), then $(\tilde{g},\eta\tilde{h})$ satisfies (20). Let $\tilde{g}_{*}$ represent the optimal average expected cost in the MDP problem. Then, $\tilde{g}_{*}=\hat{g}_{*}$ is equivalent to the optimal average expected cost per unit time in the continuous-time SMDP problem. Therefore, the optimal stationary policy $\tilde{{\pi}}^{*}$ for the MDP problem, given by

\tilde{{\pi}}^{*}(s)\in\underset{a\in\mathcal{A}_{s}}{\arg\min}\bigg{\{}\tilde{c}(s,a)-\tilde{g}+\sum_{j\in\mathcal{\hat{S}}}\tilde{m}(j|s,a)\tilde{h}(j)\bigg{\}},\forall s\in\mathcal{\hat{S}},

is also optimal for the finite state SMDP problem (in section V-A). The existence of a solution to (26) is established through Theorem 8.4.3 in [43].

V-C Relative Value Iteration

We utilize the value-based iteration algorithm to solve the optimality equations (26) of the discrete-time MDP problem. Specifically, for average-cost MDP problems, the standard value iteration is numerically unstable, so we use the relative value iteration algorithm instead[43].

Let $\mathcal{V}\triangleq\mathbb{R}^{|\mathcal{\hat{S}}|}$ denote the space of value functions. For any value function $h\in\mathcal{V}$ , the exact Bellman operator is $\mathcal{L}\;:\;\mathcal{V}\rightarrow\mathcal{V}$ , defined as

(\mathcal{L}h)(s)\triangleq\min\limits_{a\in A_{s}}\bigg{\{}\tilde{c}(s,a)+\sum\limits_{j\in\mathcal{\hat{S}}}\tilde{m}(j|s,a)h(j)\bigg{\}}.

(27)

The span of a value function $h\in\mathcal{V}$ is defined as

\text{span}(h)\triangleq\max\limits_{s\in\mathcal{\hat{S}}}h(s)-\min\limits_{s\in\mathcal{\hat{S}}}h(s).

(28)

Let $H_{i}$ and $J_{i}$ be value functions that iterate with $i$ , and we describe the relative value iteration algorithm in Algorithm 1.

Algorithm 1 Relative Value Iteration (RVI)

0: A small positive number

\epsilon>0

0: A stationary deterministic policy

\tilde{\pi}_{\epsilon}:\mathcal{S}\rightarrow\mathcal{A}

step 1: Set

i=0

, and

H_{i}(s)=J_{i}(s)=0

for all

s\in\mathcal{\hat{S}}

. Choose an arbitrary state

s^{*}\in\mathcal{\hat{S}}

step 2: For all

s\in\mathcal{\hat{S}}

, compute

J_{i+1}(s)=(\mathcal{L}H_{i})(s)

, expressed as

J_{i+1}(s)=\min\limits_{a\in A_{s}}\bigg{\{}\tilde{c}(s,a)+\sum\limits_{j\in\mathcal{\hat{S}}}\tilde{m}(j|s,a)H_{i}(j)\bigg{\}}.

(29)

step 3: For all

s\in\mathcal{\hat{S}}

, compute

H_{i+1}(s)=J_{i+1}(s)-J_{i+1}(s^{*}).

step 4: If

\text{span}(H_{i+1}-H_{i})<\epsilon

, then for all

s\in\mathcal{\hat{S}}

compute

\tilde{\pi}_{\epsilon}(s)\in\underset{a\in\mathcal{A}_{s}}{\arg\min}\bigg{\{}\tilde{c}(s,a)+\sum\limits_{j\in\mathcal{\hat{S}}}\tilde{m}(j|s,a)H_{i}(j)\bigg{\}}.

Otherwise increment

i

by 1, and return to step 2.

Note that in each iteration, $H_{i}$ is the renormalized form of $J_{i}$ by subtracting a common $J_{i}(s^{*})$ from each $J_{i}(s)$ . This helps prevent the divergence of value functions in ordinary value iteration, but it does not affect the minimizing actions or the value of $\text{span}(H_{i+1}-H_{i})$ . The termination of iteration is triggered when $\text{span}(H_{i+1}-H_{i})$ becomes smaller than a predefined constant $\epsilon>0$ . According to Proposition 6.6.1 in [43], the Bellman operator $\mathcal{L}$ is a contraction operator over the span of (28). Therefore, the iteration algorithm is guaranteed to terminate. Moreover, it can be proven that within Algorithm 1, the value function $H_{i}$ asymptotically converges to the optimal value function as $i\rightarrow\infty$ . The resulting policy $\tilde{\pi}_{\epsilon}$ is an $\epsilon$ -optimal policy. In other words, the average expected cost associated with policy $\tilde{\pi}_{\epsilon}$ , denoted as $\tilde{g}_{\tilde{\pi}_{\epsilon}}$ , satisfies $\tilde{g}_{\tilde{\pi}_{\epsilon}}-\tilde{g}_{*}<\epsilon$ . Detailed proof for this can be found in Section 8.5.5 of [43].

The computational complexity of Algorithm 1 is discussed as follows. Suppose the total number of iterations is $n$ . It is important to note that $s_{\max}\geq B_{\max}$ , and in most cases, $s_{\max}$ is significantly larger than $B_{\max}$ . Also, please note that for $s\geq B_{\max}$ , the feasible action space can be represented as $\mathcal{A}_{s}\equiv\mathcal{A}\equiv\{0\}\cup\mathcal{B}$ . Now, we break down the computational complexity: The number of multiplications per iteration is approximately $\sum\limits_{s\in\mathcal{\hat{S}}}|\mathcal{A}_{s}||\mathcal{\hat{S}}|\approx B_{\max}s_{\max}^{2}$ . The number of additions per iteration is approximately $\sum\limits_{s\in\mathcal{\hat{S}}}|\mathcal{A}_{s}||\mathcal{\hat{S}}|+|\mathcal{\hat{S}}|\approx B_{\max}s_{\max}^{2}$ . As for space complexity, it is primarily determined by $\tilde{c}(s,a)$ and $\tilde{m}(j|s,a)$ . The storage required for $\tilde{c}(s,a)$ contributes to a space complexity of approximately $\sum\limits_{s\in\mathcal{\hat{S}}}|\mathcal{A}_{s}|\approx B_{\max}s_{\max}$ . Referring to (18), the storage needed for $\tilde{m}(j|s,a)$ simplifies to the storage of $p_{i}^{[a]}$ , resulting in a space complexity of approximately $\sum\limits_{a\in\mathcal{B}}(s_{\max}+a+1)\approx B_{\max}s_{\max}$ . Consequently, the overall space complexity is $\mathcal{O}(B_{\max}s_{\max})$ , and the time complexity is $\mathcal{O}(nB_{\max}s^{2}_{\max})$ .

It should be noted that the state space of the computed policy $\tilde{\pi}_{\epsilon}$ is $\mathcal{\hat{S}}$ , but the ultimate goal is to derive a policy that maps from the infinite state space $\mathcal{S}$ to the action space $\mathcal{A}$ . Therefore, given the policy $\tilde{\pi}_{\epsilon}:\mathcal{\hat{S}}\rightarrow\mathcal{A}$ , we can define its corresponding policy in the original infinite state space as ${\pi}_{\epsilon}\;:\;\mathcal{S}\rightarrow\mathcal{A}$ , using the following equation:

\pi_{\epsilon}(s)\triangleq\begin{cases}\tilde{\pi}_{\epsilon}(s)&\quad s\leq s_{\max}\\ \tilde{\pi}_{\epsilon}(s_{\max})&\quad s>s_{\max}\end{cases}.

(30)

Here, the actions for “tail” states $\mathcal{S_{\mathrm{tail}}}=\{s_{\max}+1,s_{\max}+2,\ldots\}$ are assigned the same action as for the state $s_{\max}$ .

In summary, the RVI algorithm guarantees the derivation of a stationary deterministic policy $\tilde{\pi}_{\epsilon}$ , which is $\epsilon$ -optimal for the discrete-time MDP problem discussed in Section V-B. Additionally, thanks to the benefits of the “discretization” transformation, $\tilde{\pi}_{\epsilon}$ maintains its $\epsilon$ -optimality for the finite state SMDP problem introduced in Section V-A. When considering the performance of ${\pi}_{\epsilon}$ in the original infinite state SMDP problem (as presented in Section IV), it is closely tied to the impact of finite state approximation. On one hand, with a larger $s_{\max},$ the approximation to the original infinite state SMDP becomes more accurate, enhancing the performance of ${\pi}_{\epsilon}$ in the original problem. On the other hand, a larger $s_{\max}$ increases the computational complexity of RVI. Therefore, as defined in Section V-A, we can set a tolerance value $\delta$ for approximation. Choosing an $s_{\max}$ as small as possible is preferred in terms of complexity, as long as the resulting $\tilde{\pi}_{\epsilon}$ satisfies $\Delta^{\tilde{\pi}_{\epsilon}}<\delta$ .

VI Optimal Policy in Special Cases

In the previous section, we introduced a general approach to address the formulated SMDP problem. However, in existing literature, specific properties of optimal policies have been discussed for certain special cases. Research studies[33, 47] have shown that in scenarios with size-independent service times, optimal policies exhibit a threshold-based structure known as a Q-policy or control limit policy, given certain assumptions.

The concept of the control limit policy is explained as follows:

Definition 3.

Define a stationary deterministic policy $\pi^{Q}:\mathcal{S}\rightarrow\mathcal{A}$ as

\pi^{Q}(s)=\begin{cases}0&\quad s<Q\\ \min(s,B_{\max})&\quad s\geq Q\end{cases},

(31)

with a parameter $Q\in\mathbb{N}_{+}$ . Under such a policy $\pi^{Q}$ , a batch service of $\min(s,B_{\max})$ requests will be initiated if and only if the number of awaiting requests at a review point, $s$ , exceeds the threshold $Q$ .

This policy is called a Q-policy[48], or a control limit policy[49] with the threshold $Q$ defined as the control limit.

The necessary assumptions for the specific cases discussed in this section are listed as follows:

Assumption 1.

Service times are independent and identically distributed (i.i.d.), irrespective of the batch size. In other words, $l(b)\equiv l$ with $l>0$ independent of $b$ , and ${\mu}^{[b]}\equiv\mu\equiv\frac{1}{l}$ is independent of $b$ as well.

Assumption 2.

The minimum batch size is $1$ , i.e., $B_{\min}=1$ .

Assumption 3.

The energy consumed in a batch service is a linear function of $b$ , i.e., $\zeta(b)=\beta b+\zeta_{0}$ , with $\beta>0$ and $\zeta_{0}\geq 0$ .

The conclusion regarding the structure of the optimal policy for the specific scenario is as follows.

Proposition 3.

If Assumptions 1-3 hold, then there exists a positive integer $Q$ , $1\leq Q\leq B_{\max}$ , such that the associated control limit policy ${\pi}^{Q}$ is an average expected optimal policy for the SMDP model $\mathcal{P}$ .

Proof:

See Appendix C. ∎

Assumption 4.

Service times follow an exponential distribution, i.e., $G_{b}(x)=1-e^{-{\mu}^{[b]}x}$ .

In a more special case with exponential service time, the optimal $Q$ value can be computed in the following way:

Proposition 4 (Refer to Section 6 in [33]).

Assume that Assumptions 1-4 hold. Combining Assumption 1 and Assumption 4, we have $G_{b}(x)\equiv G(x)\equiv 1-e^{-\mu x}$ , with mean $1/{\mu}\equiv l$ . Let $\psi=\lambda/(\lambda+\mu)$ , and let $\xi(0<\xi<1)$ denote the unique solution of

(1-\psi)\xi^{B_{\max}+1}-\xi+\psi=0.

Let $\chi=\lambda/\mu$ , $r=\xi/(1-\xi)$ , and

D_{q}=q\{\frac{1}{2}(q+1)+\chi-r\}-r^{2}\xi^{q}+r(r-\chi)-\frac{w_{2}\zeta_{0}\lambda^{2}}{w_{1}}.

(32)

Then, the optimal $Q$ is the smallest positive integer $q\leq B_{\max}$ for which $D_{q}\geq 0$ . If there is no positive $q\leq B_{\max}$ such that $D_{q}\geq 0$ , the optimal $Q$ is $B_{\max}$ .

Collary 1.

Assuming that Assumptions 1-4 are satisfied, if either $w_{2}=0$ or $\zeta_{0}=0$ , the optimal value of $Q$ is only influenced by $\chi=\lambda/\mu$ and $B_{\max}$ .

Proof:

See Appendix D. ∎

It is important to note that the general case under Assumptions 1-3 is intractable, which means that the optimal threshold $Q$ cannot be obtained through explicit computation. In such cases, a linear search approach can be employed to assess the policy performance of various potential $Q$ values, thereby allowing us to identify the optimal threshold $Q$ .

VII Numerical Results

In numerical experiments, we take the GoogLeNet inference on a TESLA P4 as the basic scenario[7]. As depicted in Fig. 2, the latency and energy functions, which were fitted from the empirical data[7], are $l(b)=0.3051b+1.0524$ ms and $\zeta(b)=19.899b+19.603$ mJ, respectively. Since the processing time for such image recognition tasks is almost deterministic[12], the Laplace transform of the service time is $\tilde{g}_{b}(x)=e^{-l(b)x}$ . The minimum batch size $B_{\min}$ for the ML inference task is $1$ , and the maximum batch size $B_{\max}$ is set to $32$ by default. This basic scenario with deterministic processing times and linear latency and energy functions is acknowledged as a representative case for ML inference serving[12, 29, 8].

We conduct experiments under varying values of $\rho$ and $w_{2}$ . It is important to note that $\rho\in(0,1)$ represents the “normalized” traffic intensity, calculated as the ratio of the absolute traffic intensity (arrival rate) $\lambda$ to the maximum service rate $B_{\max}\mu^{[B_{\max}]}$ . This ratio serves as a measure of the system load. The weight $w_{2}$ reflects the importance of power consumption in the overall objective, with $w_{1}$ fixed at $1$ .

VII-A SMDP Solution

In this subsection, we visualize the SMDP solutions under various parameter sets, as illustrated in Fig. 3. We construct three scenarios with processing times independent of the batch size, named Cases 1-3, based on the basic scenario. The maximum batch size $B_{\max}$ is set as 8, for convenience of visualization. The depicted solutions are the converged results (which remain consistent with increased $s_{\max}$ ) obtained using the procedure in Section V. Each horizontal block in Fig. 3 corresponds to one scenario, and solutions are obtained under different $\rho$ and $w_{2}$ . The charts of policies are placed from left to right for increasing $w_{2}$ . Each row in the chart is a stationary deterministic policy under a certain $\rho$ , where each element denotes the action taken at the state corresponding to the column.

From Fig.3, it can be observed that for Cases 1-3, where Assumptions 1-3 hold true, all SMDP solutions exhibit a control limit structure, with the control limits highlighted by pink boxes. This finding concurs with the conclusions in Proposition 3. Under control limit policies, the system does not serve until the state exceeds a threshold, known as a “control limit”. Once the state exceeds this control limit, the system serves a maximum available batch of requests. It can be seen that the control limit increases with $w_{2}$ . When $w_{2}$ is as large as $100$ , the control limits under different traffic intensities are all $B_{\max}$ . This is reasonable because the importance of power consumption grows with $w_{2}$ , and the energy is better saved with a larger batch size.

For Case 2 and Case 3 that satisfy Assumptions 1-4, the optimal control limits can be explicitly calculated using Proposition 4. We observe that the control limits $Q$ of the obtained SMDP solutions are in alignment with the directly calculated results, which also validates the effectiveness of the proposed general solving procedure. It is further observed that when $[w_{1},w_{2}]=[1,0]$ , the SMDP solutions in Case 3 are exactly the same as those in Case 2. This can be explained by Collary 1: when $w_{2}=0$ , the control limits are solely influenced by $B_{\max}$ and $\chi=\rho B_{\max}$ , which means that they are only influenced by values $B_{\max}$ and $\rho$ . Moreover, when $w_{2}\neq 0$ , it can be seen that the control limits in Case 3 are equal to or larger than those in Case 2. Note that the batch service rate in Case 2 is $\theta(b)=\frac{b}{2.4252}$ , while in Case 3 it is $\theta(b)=\frac{b}{1.7465}$ . As a result, Case 3 offers a greater marginal benefit from increasing the batch size compared to Case 2, leading to its control limits no less than those of Case 2.

Furthermore, upon examining the solutions in a broader set of cases (see Appendix E), we have observed that in more general situations with characteristics such as size-dependent batch service time, a minimum batch size greater than 1, or a nonlinear energy consumption function, the control limit structure may not be applicable or maintained.

VII-B Performance Comparisons

In this subsection, we compare the performance of the obtained SMDP solutions with other benchmark batching policies. The benchmark policies encompass the greedy batching policy, as well as static batching policies with batch sizes of $b=8,16,32$ . Under the greedy batching policy, the server processes the largest feasible batch of current requests. In static batching policies, the server consistently processes batches in a fixed batch size and waits for new incoming requests if there are insufficient requests. The static batching policy with $b=32$ represents a special case known as the maximum batching policy in this context, where $B_{\max}=32$ .

In what follows, we first showcase that the SMDP-based policies always yield the lowest average cost compared to other benchmark policies. Then, we present the two-dimensional figures illustrating latency and energy measurements, highlighting the superior performance of SMDP solutions from a Pareto perspective.

Furthermore, we demonstrate that the SMDP-based policy can enhance the satisfaction of delay requirements, as it produces a lighter-tailed distribution.

VII-B1 Overall Objective

We first compare the overall objective, namely the average cost per unit time, under three levels of traffic intensities, $\rho=0.1,0.3$ and $0.7$ , as shown in Fig. 4. The weight for the latency term, $w_{1}$ , is fixed at $1$ , while the weight for the power consumption term, $w_{2}$ , ranges from $0$ to $15$ . The objectives are computed using Eq.(21). It is observed that the SMDP solutions always achieve the lowest (best) average cost per unit time among all policies under various parameter settings.

When $w_{2}$ is close to zero, the objective primarily focuses on latency. In such cases, we observe that the cost of the greedy batching policy is close to that of the SMDP-based policy. Meanwhile, the costs associated with static batching policies are higher and increase with the batch size $b=8,16,32$ , with the maximum batching policy incurring the highest (worst) cost. This suggests that the latency introduced by serving with larger batches is comparable or even greater than the latency saved by increased batch service rate. When $w_{2}$ reaches a large value, the objective is primarily influenced by power consumption. In such cases, it is observed that the maximum batching policy yields nearly the lowest cost, approaching that of the SMDP solution. (Unfortunately, for $\rho=0.1$ , a value of $w_{2}=15$ is not large enough to observe this phenomenon.) This observation is consistent with the results in Fig. 3. Meanwhile, in such cases, the greedy batching policy works poorly and incurs a much higher cost than other policies, due to its limited parallelism. In most common scenarios, the weighted average cost is not dominated by a single term, and the other two static batching policies ( $b=8,16$ ) can achieve a proper balance between latency and energy, thus approaching the SMDP solution under certain parameter scales.

The comparison of the overall objective has two main limitations: (1) The overall objective lacks a unified metric and is not sufficiently informative, as it combines power consumption and latency through a weighted sum. (2) The value of the overall objective can become infinitely large as $w_{2}$ increases, resulting in a wide range that is difficult to visualize, especially under high load conditions. Therefore, we will analyze the objective factors separately in what follows.

VII-B2 Objective Pairs

We now focus on the two factors in the multi-objective optimization (as formulated in Eq.(2)): (a) Average request response time, or long-term average latency, denoted as $\overline{W}$ ; (b) Average power consumption, denoted as $\overline{P}$ , calculated by dividing the long-term energy consumption by the time period. The unit of average power consumption is measured in Watt (W). Additionally, we introduce average energy efficiency, representing the average number of requests processed per Joule of energy (calculated by ${\lambda}/{\overline{P}}$ ), as an alternative measure of power consumption.

By fixing $w_{1}=1$ and varying $w_{2}$ , different SMDP solutions can be obtained through the proposed scheme. Accordingly, a set of $(\overline{W},\overline{P})$ pairs is acquired. The latency-power consumption tradeoff curves for these $(\overline{W},\overline{P})$ pairs under $\rho=0.3$ and $\rho=0.7$ are illustrated in Fig. 5LABEL:sub@fig:data_3_tradeoff. It can be seen that as the weight for power consumption, $w_{2}$ , increases, the average power consumption decreases while the average request response time increases, thus forming the latency-energy tradeoff. After acquiring this curve, an appropriate weight can be selected according to the requirements. For example, if the average request response time is required to be less than 5 ms when $\rho=0.3$ , the maximum weight whose corresponding SMDP solution meets this requirement should be selected, which is $w_{2}=1.3$ in this case. By selecting the weight and batching policy in this manner, the least power consumption is achieved while satisfying the latency requirement. The method for choosing an appropriate $w_{2}$ under a power constraint is similar.

Fig. 5LABEL:sub@fig:case6_tradeoff illustrates the $(\overline{W},\overline{P})$ pairs of different policies under various load conditions. Furthermore, Fig. 5LABEL:sub@fig:case6_tradeoff_ee depicts the interplay between latency and energy efficiency. By varying the values of $(w_{1},w_{2})$ , the solutions obtained from the weighted-objective SMDP demonstrate a flexible balance between latency and energy, forming the tradeoff curves. In contrast, the objective pairs of benchmark policies are represented as separate points that remain unchanged with the weights. Moreover, it can be seen from Fig. 5LABEL:sub@fig:case6_tradeoff (or Fig. 5LABEL:sub@fig:case6_tradeoff_ee) that the SMDP objective pairs are positioned to the lower (or upper) left of those of the benchmark policies, indicating that the SMDP solutions consistently outperform other benchmark policies in a Pareto-optimal sense.

The latency-power consumption pairs associated with the maximum batching policy precisely correspond to the right endpoints of the SMDP’s tradeoff curves. This observation aligns with our findings in Fig. 3 and Fig. 4 under conditions where $w_{2}$ is large. This is reasonable, given that the maximum batching policy exhibits the highest energy efficiency among all policies. There is no alternative policy that can achieve equal or lower power consumption with a smaller latency. The latency-power consumption pairs associated with the greedy batching policy are situated near the left endpoints of SMDP’s tradeoff curves. It is essential to highlight that, although not evident, the greedy batching policy is at a slight disadvantage compared to the SMDP solution. For instance, when $\rho=0.7$ , magnifying the plot around the left corner of the SMDP’s tradeoff reveals that the objective pair $(\overline{W},{\overline{P}})$ of the greedy batching policy (the blue triangle) is in the upper right relative to the objective pair of an SMDP solution (an orange dot).

Several points of static batching policies with $b=8$ and $b=16$ are close to the SMDP’s tradeoff curve, indicating that these policies can effectively approximate SMDP solutions under specific parameters. This observation aligns with the findings in Fig. 4. However, in certain cases, the superiority of the SMDP-based policy over static batching becomes evident. For example, when $\rho=0.7$ , the latency-power consumption (or energy efficiency) pair of static batching with $b=8$ is positioned above (or below) the SMDP curve, implying higher power consumption (or lower energy efficiency) compared to an SMDP-based policy with equal latency. Furthermore, the static batching policy with $b=8$ fails to stabilize the system when $\rho\geq 0.8$ . Similarly, when $\rho=0.9$ , static batching with $b=16$ results in significantly longer latency compared to the SMDP solutions.

VII-B3 Latency Distribution and Percentile Analysis

In real applications, the average request response time studied in our formulated framework may not be the primary concern. Instead, the service level objective (SLO) usually specifies that the request response time at a certain percentile must meet some latency bound. Therefore, we conduct simulations and subsequently analyze the distribution and percentiles of latency.

Fig. 6LABEL:sub@cdf_case6_rho7 demonstrates the empirical CDFs of request response time for different batching policies under a load condition of $\rho=0.7$ , with each CDF based on $1.66\times 10^{6}$ latency data points. A CDF positioned further to the left represents a better policy, as it achieves lower latency for a higher proportion of requests. Therefore, the latency performance of the benchmark policies, from best to worst, is ranked as follows: greedy batching, static batching with $b=8$ , static batching with $b=16$ , and maximum batching. The CDFs of the selected SMDP solutions with $w_{2}=1.6$ and $w_{2}=2.2$ intersect the CDF of static batching with $b=8$ with an upward crossing at the intersection points. This indicates that the CDFs of these SMDP solutions are lighter-tailed, and at percentiles beyond the intersection points, the request response times are shorter than that of static batching with $b=8$ .

TABLE I: Comparison of average power consumption and request response times at different percentiles under

\rho=0.7

Policy	Static Batching ( $b=8$ )	SMDP Solution ( $w_{2}=1.6$ )	SMDP Solution ( $w_{2}=2.2$ )
$\overline{P}$ [W]	$46.27$	$\bm{44.96}$	$\bm{44.41}$
$\overline{W}$ [ms]	$6.85$	$6.90$	$7.81$
$W$ : 50th percentile [ms]	$6.51$	$6.83$	$7.72$
$W$ : 90th percentile [ms]	$9.85$	$\bm{9.23}$	$10.45$
$W$ : 95th percentile [ms]	$11.34$	$\bm{9.96}$	$\bm{11.24}$

Table I presents more comprehensive data including average power consumption, average request response time, and request response times at the 50th, 90th, and 95th percentiles for static batching with $b=8$ , as well as SMDP solutions with $w_{2}=1.6$ and $w_{2}=2.2$ . While both SMDP solutions achieve lower power consumption, they result in longer average response times compared to static batching with $b=8$ . However, the SMDP solution with $w_{2}=1.6$ improves the request response times at the 90th and 95th percentiles compared to static batching with $b=8$ . Similarly, the SMDP solution with $w_{2}=2.2$ provides a shorter request response time at the 95th percentile compared to static batching with $b=8$ .

Suppose the SLO specifies that the 95th percentile of request response time must be less than 10 ms. Then, we can either obtain and plot the data pairs of $($ latency at the 95th percentile $,\overline{P})$ under various weights, as illustrated in Fig. 6LABEL:sub@L95_case6_rho7, or plot the data pairs of $($ satisfaction percentage for the 10 ms constraint $,\overline{P})$ , as shown in Fig. 6LABEL:sub@Lless10_case6_rho7. Both exhibit tradeoff trends similar to those in Fig. 5LABEL:sub@fig:data_3_tradeoff. Therefore, the maximum (or minimum) weight $w_{2}$ should be selected such that the corresponding SMDP solution results in a 95th percentile request response time of less than 10 ms (or a satisfaction percentage for the 10 ms constraint greater than 95%). This ensures that the SLO constraint is met while minimizing the power consumption.

VII-C Performance Comparisons in Other Settings

All experiments in Section VII-B are conducted in the default configuration. Therefore, in this subsection, we study some other typical cases and demonstrate the comparison results in these settings.

VII-C1 Stronger Batching Effect in Batch Service Rate

In this part, we modify the mean batch service time function of the default scenario to a constant value of $l(b)=6.0859$ ms, representing ideal parallelism. Scenarios involving running inference models on powerful processors can approximate this ideal batching, such as the inference of InceptionV2 with float16 on a Titan V with batch sizes up to fifty[36]. As shown in Fig. 7LABEL:sub@case1_setting, whereas the default batch service rate increases sub-linearly with the batch size $b$ , the constant service time leads to a linear increase in the batch service rate with $b$ . Consequently, at $b=32$ , the batch service rate achieves a speedup of 32 times compared to service without batching ( $b=1$ ), whereas it is only 4 times in the default setting.

The latency-power consumption pairs for different policies under various load conditions are illustrated in Fig. 7LABEL:sub@fig:case1_tradeoff, and the latency-energy efficiency pairs are shown in Fig. 7LABEL:sub@fig:case1_tradeoff_ee. It is observed that the SMDP-based policies consistently outperform other benchmarks. Additionally, a few notable observations for this special case include: (1) The latency of the greedy batching policy shows minor growth with increasing load, compared to the significant growth in the default setting (see Fig. 5LABEL:sub@fig:case6_tradeoff). The reason is that, in the default setting, the experienced service time grows with the average batch size as $\rho$ increases, and the service rate does not increase as quickly as with constant service time, worsening the situation. (2) The latency of the maximum batching policy is much lower than that in the default case and decreases rapidly with increasing $\rho$ . This is because the batch service rate of maximum batching is significantly higher than that in the default case, and an increased traffic load reduces the waiting time for forming a batch. (3) When $\rho$ is 0.6 or higher, the latency of the maximum batching policy is very close to that of the work-conserving policy. This follows the previous observation, as a sufficiently high load can effectively activate the power of the maximum batching policy. This insight suggests that under high load conditions, if the latency of maximum batching is acceptable, there is no need to compute or select an SMDP policy.

VII-C2 Stronger Batching Effect in Energy Efficiency

In this part, we modify the energy consumption function of the default scenario to $\zeta(b)=105\log(b)+60$ mJ, which is a logarithmic function of $b$ . As illustrated in Fig. 8LABEL:sub@case7o_setting, the default energy consumption function $\zeta(b)$ increases linearly with the batch size $b$ , resulting in energy efficiency that grows sub-linearly with $b$ . In contrast, with the logarithmic energy function, the energy efficiency increases super-linearly with the batch size $b$ . Notably, the energy efficiency continues to improve significantly after $b$ exceeds 8, while in the default setting, it remains relatively stable.

The latency-power consumption pairs for different policies under various load conditions are illustrated in Fig. 8LABEL:sub@fig:case7o_tradeoff, and the latency-energy efficiency pairs are demonstrated in Fig. 8LABEL:sub@fig:case7o_tradeoff_ee. It is observed that the SMDP-based policies consistently perform as well as or better than other benchmarks. Notable observations for this case include: (1) The power consumption of the work-conserving policy decreases as $\rho$ increases from 0.6 to 0.8, unlike in the default setting where power consumption rises with higher load. This is due to the significant increase in energy efficiency with larger batch sizes. (2) The latency-power consumption tradeoff curve is much steeper compared to the default setting. This is because the wider power consumption range and consistent latency range result in a larger absolute value of the tradeoff derivative. Furthermore, compared to the default setting, the energy efficiency increases more rapidly at relatively large latency values. Therefore, a small increase in latency can lead to substantial power savings in this scenario.

VII-C3 Impact of the Distribution of Service Time

In the default setting, the service time follows a deterministic distribution with a coefficient of variation (CoV) of 0. However, in scenarios such as inference running with interference from other tasks, the service time is typically stochastic, and a larger CoV represents more complex and severe interference. Therefore, in this part, we conduct experiments with three additional types of service time distributions while keeping the same $l(b)$ . (a) Erlang distribution with a Laplace transform given by $\tilde{g}_{b}(x)={(\frac{1}{1+0.5l(b)x})}^{2}$ , which has a CoV of 0.5. (b) Exponential distribution with a Laplace transform given by $\tilde{g}_{b}(x)=\frac{1}{1+l(b)x}$ , which has a CoV of 1. (c) Hyperexponential distribution with a Laplace transform given by $\tilde{g}_{b}(x)=\frac{2}{3}\times\frac{1}{1+0.5l(b)x}+\frac{1}{3}\times\frac{1}{1+2l(b)x}$ , which has a CoV of 2.

The CDFs of these service time distributions for $b=8$ are illustrated in Fig. 9LABEL:sub@cdf_covs. It can be seen that the tail of the service time distribution becomes heavier as the CoV increases. Furthermore, as shown in the latency-power consumption curves in Fig. 9LABEL:sub@results_covs, the average latency for a given power consumption increases with the CoV. This effect is more pronounced under high load conditions (e.g., $\rho=0.7$ ) compared to low load conditions (e.g., $\rho=0.3$ ). This observation aligns with our expectations, as the average latency increases with the CoV due to its corresponding increase in the second moment, as shown in Eq.(11). Additionally, the average power consumption under the greedy batching policy decreases with increasing CoV, reflecting the increase in the average batch size.

VII-D Efficiency of the Solving Procedure

TABLE II: Evaluation of approximations acceptable with tolerance

\delta=0.001

under different

c_{\mathrm{o}}

$c_{\mathrm{o}}$	$10000$	$1000$	$100$	$10$	$0$
$\min s_{\max}$	$89$	$78$	$\bm{70}$	$161$	$192$
Iterations	$1847$	$1635$	$\bm{1483}$	$10000$	$10000$
Space Complexity	$2848$	$2496$	$\bm{2240}$	$5152$	$6144$
Time Complexity	$4.68\times 10^{8}$	$3.18\times 10^{8}$	$\bm{2.33\times 10^{8}}$	$8.29\times 10^{9}$	$1.18\times 10^{10}$
$\Delta^{\pi}$	$9.36\times 10^{-4}$	$9.78\times 10^{-4}$	$8.36\times 10^{-4}$	$6.14\times 10^{-12}$	$1.30\times 10^{-14}$
$\hat{g}^{{\pi}}$	$66.1384$	$66.1383$	$66.1377$	$66.1374$	$66.1374$

We want to evaluate the accuracy and complexity under different abstract costs in the finite state approximation. As mentioned in Section V-A, there are two parameters in the approximation: $s_{\max}$ and $c_{\mathrm{o}}$ , which determine the dimension of the state space and the abstract cost, respectively. Given a specific approximate model with certain $s_{\max}$ and $c_{\mathrm{o}}$ , a policy ${\pi}_{(c_{\mathrm{o}},s_{\max})}$ is calculated following the procedure outlined in Sections V-B and V-C, and it serves as an approximation to the optimal policy. The corresponding average cost per unit time $\hat{g}^{{\pi}_{(c_{\mathrm{o}},s_{\max})}}$ , evaluated in the state space $\mathcal{\hat{S}}$ , can be obtained by Eq.(21). The accuracy of the approximation is assessed using $\Delta^{{\pi}_{(c_{\mathrm{o}},s_{\max})}}$ , as detailed in Eq.(22), representing the average cost contributed by $S_{\mathrm{o}}$ per unit time. For simplicity, we use $\hat{g}^{{\pi}}$ and $\Delta^{\pi}$ to denote these metrics in the following text. We conduct experiments in the basic scenario, with $\rho=0.9$ and $[w_{1},w_{2}]=[1,1]$ . The stopping parameter in RVI is set to $\epsilon=0.01$ . Additionally, we impose a maximum iteration value in the RVI process as $\rm{iter}_{\max}=10000$ .

In Fig. 10, we illustrate the evolution of $\hat{g}^{{\pi}}$ and $\Delta^{\pi}$ with $s_{\max}$ ranging from 32 to 250. The $\hat{g}^{{\pi}}$ with $c_{\mathrm{o}}=10000,1000,100$ decreases and converges around $s_{\max}=35,50,70$ , while $\hat{g}^{{\pi}}$ with $c_{\mathrm{o}}=10,0$ increases and converges around $s_{\max}=170,200$ . We can infer that the abstract cost with $c_{\mathrm{o}}=10000,1000,100$ (or $10,0$ ) overestimates (or underestimates) the impact of “tail” states, leading to the $\hat{g}^{{\pi}}$ mostly larger (or smaller) than the convergence value. From the orange curves, we observe that $\Delta^{\pi}$ decreases with $s_{\max}$ , and almost converges when $s_{\max}$ exceeds 200. A sharp drop is observed for $\Delta^{\pi}$ under $c_{\mathrm{o}}=10,0$ . This is due to the underestimated impact of “tail” states with $c_{\mathrm{o}}=10,0$ , resulting in a lower estimated value for $a=0$ (“wait”) in the RHS of Eq.(29). Consequently, the computed policy is to always wait, until $s_{\max}$ is large enough for the cost of waiting to be comparable to the cost of serving. Although the converged values of $\Delta^{\pi}$ are no more than $10^{-13}$ , we only need an approximation acceptable with tolerance $\delta$ , and we choose $\delta=0.001$ . In Table II, we list the minimum values of $s_{\max}$ that satisfy the approximation requirement. The $\Delta^{\pi}$ and $\hat{g}^{{\pi}}$ , the number of RVI iterations, as well as the space and time complexity corresponding to the minimum $s_{\max}$ are also recorded. It can be seen that all $\Delta^{\pi}$ are less than $0.001$ , and the differences among $\hat{g}^{{\pi}}$ are also no greater than $0.001$ . The least required $s_{\max}$ is $70$ , observed in the approximation with $c_{\mathrm{o}}=100$ . Compared to the ordinary finite state approximation with $c_{\mathrm{o}}=0$ , the minimum $s_{\max}$ decreases from $192$ to $70$ . Consequently, the space complexity reduces by $63.5\%$ , and the time complexity decreases by $98\%$ . Furthermore, approximations with $c_{\mathrm{o}}$ larger (or smaller) than $100$ exhibit an increasing trend in complexity due to the growing overestimation (or underestimation).

In addition to the state aggregation method used for finite state approximation in our work, the literature also explores approximate iteration algorithms to implement finite state approximation within iteration algorithms[44, 45]. A comparison between our proposed scheme and two representative approximate iteration algorithms (detailed in Appendix F) shows that our approximation procedure outperforms these algorithms in both convergence speed and result accuracy, especially when the abstract cost $c_{\mathrm{o}}$ is included.

VIII Conclusion

In this paper, we have studied the dynamic batching problem for online serving, where the batch service time is dependent on the batch size. The problem is formulated as an SMDP with the objective of minimizing the weighted sum of average response time and average power consumption. The inherent complexities of this SMDP problem, characterized by an infinite state space, an average (non-discounted) objective, and unbounded costs, make it challenging to efficiently solve using traditional methods. To overcome these challenges, we have introduced a solution procedure consisting of finite state approximation, “discretization” transformation, and relative value iteration. The computational complexity is largely reduced owning to the introduction of an abstract cost. Then, we have conducted comprehensive numerical experiments across various parameter settings. The overall average cost and the tradeoff between average latency and average power consumption are depicted under different parameter setups. Comparisons with benchmark batching policies further showcase the superiority of the SMDP solutions.

Compared to many existing dynamic batching schemes, our proposed solution is theoretically derived, rather than relying on repeated trials. As a result, our scheme can be computed offline, alleviating the system from the burden of additional complex modules. Despite focusing on average objectives, statistics related to the SLO requirements, such as the satisfaction percentage for a certain latency constraint, can be obtained through offline simulations. When running in real time, it then becomes easy to find the most suitable weight and its corresponding batching policy that minimizes power consumption while satisfying the SLO requirement. For bursty or non-stationary traffic arrival processes, which are common in real systems, they can be approximated as temporal compositions of Poisson process periods. Specifically, in the case of MMPPs, they are exact temporal compositions of Poisson process periods. By detecting phases and applying the proposed method to each period, such traffic can be effectively managed. We also plan to explore dynamic batching schemes for multiple processors, incorporating both inter- and intra-processor parallelism in future work.

Appendix A Proof of Proposition 1

To prove the proposition, we first introduce Lemma 1, which extends Theorem 3 from [50] to our specific context.

Lemma 1 (Extended from Theorem 3 in [50]).

Let $d(s)=\underset{a\in\mathcal{A}_{s}}{\min}\{d(s,a)\}$ . Assume $\underset{s\rightarrow\infty}{\lim}d(s)=\infty$ and for every $a\in\mathcal{B}$ , assume $G_{a}(0)<1$ and $\frac{1}{{\mu}^{[a]}}<\infty$ . Assume there exist a service parameter $a$ and nonnegative integer $n$ such that $c(s,a)\leq W(s)$ , a nonnegative polynomial of degree $n$ , $G_{a}(t)$ has finite $(n+1)$ st moment and satisfies $\lambda\frac{1}{{\mu}^{[a]}}<B_{\max}$ . Then there exists an expected average cost optimal stationary deterministic policy.

Then, we validate these assumptions within our framework.

In $\mathcal{P}$ we have $\underset{s\rightarrow\infty}{\lim}d(s,0)=\underset{s\rightarrow\infty}{\lim}w_{1}\frac{s}{{\lambda}}=\infty$ , and for $a\neq 0$ , $\underset{s\rightarrow\infty}{\lim}d(s,a)=\underset{s\rightarrow\infty}{\lim}w_{1}(\frac{s}{{\lambda}}+\frac{1}{2}\mathbb{E}[G_{a}^{2}]{\mu}^{[a]})=\infty$ . Thus, $\underset{s\rightarrow\infty}{\lim}d(s)=\underset{s\rightarrow\infty}{\lim}\underset{a\in\mathcal{A}_{s}}{\min}\{d(s,a)\}=\infty$ . For every $a\in\mathcal{B}$ , $G_{a}(0)<1$ and $\frac{1}{{\mu}^{[a]}}<\infty$ hold, according to Section III.

When $a=B_{\max}$ , there is $c(s,B_{\max})=w_{2}{\zeta}(B_{\max})+{w_{1}}(\frac{s}{\lambda{\mu}^{[B_{\max}]}}+\frac{1}{2}\mathbb{E}[G^{2}_{B_{\max}}])\leq W(s)\triangleq c(s,B_{\max})$ , where $W(s)$ is a nonnegative polynomial of degree one. Furthermore, $G_{B_{\max}}(t)$ is assumed to have finite second moment, and $\lambda\frac{1}{{\mu}^{[B_{\max}]}}<B_{\max}$ also holds (see Section III). That is to say, there exist a service parameter $a=B_{\max}$ and nonnegative integer $n=1$ that satisfy the assumptions in Lemma 1.

Therefore, by Lemma 1, there exisits an average expected optimal stationary deterministic policy for the SMDP model $\mathcal{P}$ .

Appendix B Proof of Proposition 2

The state process $\{s_{m}^{{\pi}}\}_{m=0}^{\infty}$ induced by a stationary deterministic policy ${\pi}$ , defined as $s_{m+1}^{{\pi}}=f(s_{m}^{{\pi}},{\pi}(s_{m}^{{\pi}}))$ with $f(\cdot)$ representing a stochastic evolution, is a Markov chain. To get the equations corresponding to the optimal stationary deterministic policies, we need to first discuss the chain structure of the transition matrices of Markov chains generated by stationary policies[43].

Definition 4.

The SMDP model is unichain if the Markov chain corresponding to every deterministic stationary policy has a single recurrent class and a possibly empty set of transient states.

Lemma 2.

The SMDP model $\mathcal{P}$ is unichain.

Proof:

Note that under any policy, each state $s$ can reach its neighboring state $s+1$ in one step, since $m(s+1|s,a)>0,\forall s\in\mathcal{S},\forall a\in\mathcal{A}_{s}$ . This is valid because $m(s+1|s,0)=1$ and $m(s+1|s,a)=p_{a+1}^{[a]}>0$ , for $a\neq 0$ . Since state $s+1$ is accessible from state $s$ , we know that if state $s$ is a recurrent state, $s+1$ should also be recurrent. Therefore, the Markov chain of some stationary policy never contains more than one closed irreducible recurrent class, which concludes the proof. ∎

Then we are able to obtain the optimality equations. Since $\mathcal{S}$ is countable and the SMDP model is unichain (by Lemma 2), the optimality equations for such average cost SMDP problems are obtained according to Theorem 11.4.4 in [43].

Appendix C Proof of Proposition 3

The holding cost function is $v(n)=\frac{w_{1}}{\lambda}n$ . Then, for each value of $s$ , it can be observed that $v(s+1)-v(s)=\frac{w_{1}}{\lambda}>0$ . Additionally, as stated in Section III, the assumption holds that $\lambda l(B_{\max})<B_{\max}$ . Combined with Assumption 1, we have $\lambda l<B_{\max}$ . Consequently, the conditions of Theorem 5.3 in [33] are satisfied. Thus, applying Theorem 5.3 from [33] completes the proof.

Appendix D Proof of Collary 1

In Proposition 4, when $w_{2}=0$ or $\zeta_{0}=0$ , we observe that $D_{q}$ is solely affected by $q$ , $\chi$ , and $\xi$ . Here, $\xi$ is influenced by $\psi$ and $B_{\max}$ . It is also worth noting that $\psi=\lambda/(\lambda+\mu)=\chi/(\chi+1)$ is determined by $\chi$ . As a result, $D_{q}$ is determined by $q$ , $\chi$ and $B_{\max}$ , when $w_{2}=0$ or $\zeta_{0}=0$ . Therefore, the optimal $Q$ , which is the smallest positive integer $q\leq B_{\max}$ such that $D_{q}\geq 0$ , is only influenced by $\chi=\lambda/\mu$ and $B_{\max}$ . It remains unaffected by the absolute values of $\lambda$ and $\mu$ , when $\chi$ is given.

Appendix E SMDP Solution Visualization for Broader Cases

We construct four scenarios named Cases 4-7, based on the basic scenario. The maximum batch size $B_{\max}$ is set as 8. The converged SMDP solutions under Cases 4-7 are demonstrated in Fig. 11. In Cases 4-6, which violate Assumptions 2, 3, and 1, respectively, it is observed that not all SMDP solutions adhere to the control limit structure. The elements that disrupt the control limit structure are highlighted in magenta. In Case 4, where the minimum batch size $B_{\min}$ is set to 5, the SMDP solutions do not merely adjust the control limits $Q$ of Case 1 to $\max(Q,5)$ . For example, when $\rho=0.1$ and $w_{2}=0,0.5$ and $1$ , there are states that exceed $B_{\max}$ with corresponding actions smaller than $B_{\max}$ , which contradicts the definition of control limit policies. In Case 5, some solutions feature more than one threshold dividing the actions between “wait” ( $a(s)=0$ ) and “serve” ( $a(s)>0$ ). In Case 6, there are actions involving serving a batch smaller than the maximum available size. This can be attributed to the possibility of forming a larger batch when the newly initiated service is completed, with some requests remaining in the buffer. In the more general scenario, Case 7, there are even more instances where the solutions deviate from the control limit structure.

Appendix F Comparison with Approximate Iteration Algorithms

TABLE III: Evaluation of approximations under different iteration algorithms.

RVI			RVI			AVI			API
( $c_{\mathrm{o}}=0,s_{\max}=160$ )			( $c_{\mathrm{o}}=100,s_{\max}=160$ )			(Scheme I in [44])			(Scheme IV in [44])
CPU time [s]	$\hat{g}^{{\pi}}$	$\Delta^{\pi}$	CPU time [s]	$\hat{g}^{{\pi}}$	$\Delta^{\pi}$	CPU time [s]	$\hat{g}^{{\pi}}$	$\hat{g}^{{\pi}_{\text{trunc}}}$	CPU time [s]	$\hat{g}^{{\pi}}$	$\hat{g}^{{\pi}_{\text{trunc}}}$
$0.73$	$108.14$	$108.14$	$0.49$	$114.46$	$1.51\times 10^{-14}$	$0.73$	$197.33$	$197.33$	$0.79$	$909.73$	$208.14$
$1.81$	$108.14$	$108.14$	$0.61$	$44.47$	$8.70\times 10^{-15}$	$9.12$	$383.20$	$208.14$	$3.46$	$909.73$	$208.14$
$1.93$	$38.86$	$3.63\times 10^{-15}$	$0.73$	$38.86$	$1.42\times 10^{-15}$	$9.36$	$386.58$	$42.53$	$3.94$	$909.73$	$42.53$
$3.02$	$38.86$	$3.63\times 10^{-15}$	$2.06$	$38.86$	$1.48\times 10^{-15}$	$12.220$	$417.00$	$42.53$	$5.25$	$909.73$	$42.53$

We compare the proposed finite state approximation scheme with two classical approximate iteration algorithms. Scheme I in [44] (also Scheme II in [45]) is a typical approximate value iteration algorithm (abbreviated as AVI in this paper). Scheme IV in [44] is an approximate policy iteration algorithm (abbreviated as API in this paper) that incorporates the aforementioned AVI algorithm in its inner loop. AVI and API algorithms are applied to solve the infinite state discrete-time MDP directly associated with the original SMDP.

The experiments are conducted in the basic scenario, with $\rho=0.5$ and $[w_{1},w_{2}]=[1,1]$ . In our proposed schemes, we set $s_{\max}$ to $160$ and $c_{\mathrm{o}}$ to $0$ and $100$ , respectively. For all value iterations, we initialize the value functions to zero. The API algorithm has an initial policy set as $a(s)=0$ for all $s$ . The number of inner iterations is set to $20*i$ in the $i$ th ( $i=1,2,...$ ) outer iteration loop. The algorithms are implemented in MATLAB_R2021b and executed on a MacBook Air (M2, 2022). The Apple M2 chip is equipped with an 8-core CPU comprising four performance cores and four efficiency cores.

In Table III, we demonstrate the change in the evaluated average cost $\hat{g}^{\pi}$ with the exact CPU execution time (averaged over 11 runs) under different schemes. We also record the evolution of $\Delta^{\pi}$ under the proposed scheme (referred to as RVI in Table III). In both AVI and API, the state space consistently expands with each iteration, while the exact number of iterations used for computing the policy on a state $s$ decreases with the increasing value of $s$ . Consequently, the latter part of the computed policy (the policy computed for relatively large states) does not converge very effectively. Therefore, we introduce $\pi_{\text{trunc}}$ to truncate and only maintain the policy on the state space $\mathcal{S}_{\text{trunc}}=\{0,1,2,\ldots,160,161\}$ , whose cardinality is the same as the state space $\hat{\mathcal{S}}$ of the proposed scheme. We use $\hat{g}^{\pi_{\text{trunc}}}$ to denote the average cost of $\pi_{\text{trunc}}$ evaluated on $\mathcal{S}_{\text{trunc}}$ .

Table III shows that the proposed methods with $c_{\mathrm{o}}=0$ and $c_{\mathrm{o}}=100$ start converging at approximately $1.93$ seconds and $0.73$ seconds, respectively. The converged $\Delta^{\pi}$ values are on the order of $10^{-15}$ , ensuring high approximation accuracy. In contrast, AVI and API do not achieve convergence for the entire computed policy within the demonstrated CPU times, but their truncated policies converge around $9.36$ seconds and $3.94$ seconds, respectively. Both RVI schemes demonstrate faster convergence than AVI and API, with RVI ( $c_{\mathrm{o}}=100$ ) exhibiting a notable advantage over other schemes. Furthermore, the converged average cost $\hat{g}^{\pi}$ in RVI is $38.86$ , which is lower than the converged average cost $\hat{g}^{\pi_{\text{trunc}}}=42.53$ in AVI and API algorithms. This suggests that the converged policies obtained through the proposed schemes provide better approximations to optimal policies than those obtained from AVI and API algorithms. Moreover, approximate iteration algorithms may encompass a very large state space as the number of iterations increases, which raises challenges in complexity and numerical stability.

References

[1] Y. Xu, J. Sun, S. Zhou, and Z. Niu, “SMDP-Based Dynamic Batching for Efficient Inference on GPU-Based Platforms,” in Proc. IEEE Int. Conf. Commun. (ICC), Rome, Italy, May 2023.
[2] K.-S. Oh and K. Jung, “GPU Implementation of Neural Networks,” Pattern Recognition, vol. 37, no. 6, pp. 1311–1314, Jun. 2004.
[3] C. Zhang, M. Yu, W. Wang, and F. Yan, “MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving,” in Proc. USENIX Annu. Tech. Conf. (ATC), Renton, WA, USA, Jul. 2019.
[4] D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, “Clipper: A Low-Latency Online Prediction Serving System,” in Proc. USENIX Symp. Netw. Syst. Des. Implement. (NSDI), Boston, MA, USA, Mar. 2017.
[5] H. Shin, D. Kim, E. Park, S. Park, Y. Park, and S. Yoo, “McDRAM: Low Latency and Energy-Efficient Matrix Computations in DRAM,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 37, no. 11, pp. 2613–2622, Nov. 2018.
[6] Y. E. Wang, G.-Y. Wei, and D. Brooks, “Benchmarking TPU, GPU, and CPU Platforms for Deep Learning,” [Online]. Available: https://arxiv.org/abs/1907.10701, 2019.
[7] NVIDIA, “NVIDIA AI inference platform technical overview,” [Online]. Available: https://www.nvidia.com/en-us/data-center/resources/inference-technical-overview/, 2018, (accessed 23-Nov-2019).
[8] C. Yao, W. Liu, W. Tang, and S. Hu, “EAIS: Energy-Aware Adaptive Scheduling for CNN Inference on High-Performance GPUs,” Future Gener. Comp. Syst., vol. 130, pp. 253–268, May 2022.
[9] Y. Wang, Q. Wang, and X. Chu, “Energy-Efficient Online Scheduling of Transformer Inference Services on GPU Servers,” IEEE Trans. Green Commun. Netw., vol. 6, no. 3, pp. 1649–1659, Sep. 2022.
[10] A. Bhardwaj, A. Phanishayee, D. Narayanan, M. Tarta, and R. Stutsman, “Packrat: Automatic Reconfiguration for Latency Minimization in CPU-Based DNN Serving,” [Online]. Available: http://arxiv.org/abs/2311.18174, 2023.
[11] S. M. Nabavinejad, S. Reda, and M. Ebrahimi, “Coordinated Batching and DVFS for DNN Inference on GPU Accelerators,” IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 10, pp. 2496–2508, Oct. 2022.
[12] A. Ali, R. Pinciroli, F. Yan, and E. Smirni, “BATCH: Machine Learning Inference Serving on Serverless Platforms with Adaptive Batching,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal. (SC), Atlanta, GA, USA, Nov. 2020.
[13] Q. Zhang, Y. Liu, T. Liu, and D. Qian, “CoFB: Latency-Constrained Co-Scheduling of Flows and Batches for Deep Learning Inference Service on the CPU–GPU System,” J. Supercomput., vol. 79, pp. 14 172–14 199, Apr. 2023.
[14] Y. Choi, Y. Kim, and M. Rhu, “Lazy Batching: An SLA-Aware Batching System for Cloud Machine Learning Inference,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit. (HPCA), Seoul, Korea (South), Feb. 2021.
[15] W. Cui, Q. Chen, H. Zhao, M. Wei, X. Tang, and M. Guo, “E2Bird: Enhanced Elastic Batch for Improving Responsiveness and Throughput of Deep Learning Services,” IEEE Trans. Parallel Distrib. Syst., vol. 32, no. 6, pp. 1307–1321, Jun. 2021.
[16] GoogleCloud, “Google cloud prediction API documentation,” [Online]. Available: https://cloud.google.com/prediction/docs/, 2017, (accessed 19-Oct-2022).
[17] C. Szegedy et al., “Going Deeper with Convolutions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Boston, MA, USA, Jun. 2015.
[18] Y. Wang and X. Wang, “Virtual Batching: Request Batching for Server Energy Conservation in Virtualized Data Centers,” IEEE Trans. Parallel Distrib. Syst., vol. 24, no. 8, pp. 1695–1705, Aug. 2012.
[19] D. Cheng, X. Zhou, Y. Wang, and C. Jiang, “Adaptive Scheduling Parallel Jobs with Dynamic Batching in Spark Streaming,” IEEE Trans. Parallel Distrib. Syst., vol. 29, no. 12, pp. 2672–2685, Dec. 2018.
[20] Q. Ye, Y. Zhou, M. Shi, Y. Sun, and J. Lv, “DBS: Dynamic Batch Size for Distributed Deep Neural Network Training,” [Online]. Available: https://arxiv.org/abs/2007.11831, 2020.
[21] J. You, J.-W. Chung, and M. Chowdhury, “Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training,” in Proc. USENIX Symp. Netw. Syst. Des. Implement. (NSDI), Boston, MA, USA, Apr. 2023.
[22] R. Yadav, W. Zhang, O. Kaiwartya, H. Song, and S. Yu, “Energy-Latency Tradeoff for Dynamic Computation Offloading in Vehicular Fog Computing,” IEEE Trans. Veh. Technol., vol. 69, no. 12, pp. 14 198–14 211, Dec. 2020.
[23] C. Ling, W. Zhang, H. He, R. Yadav, J. Wang, and D. Wang, “QoS and Fairness Oriented Dynamic Computation Offloading in the Internet of Vehicles based on Estimate Time of Arrival,” IEEE Trans. Veh. Technol., vol. 73, no. 7, pp. 10 554–10 571, Jul. 2024.
[24] R. Yadav et al., “Smart Healthcare: RL-Based Task Offloading Scheme for Edge-Enable Sensor Networks,” IEEE Sens. J., vol. 21, no. 22, pp. 24 910–24 918, Nov. 2021.
[25] F. Yan, O. Ruwase, Y. He, and E. Smirni, “SERF: Efficient Scheduling for Fast Deep Neural Network Serving via Judicious Parallelism,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal. (SC), Salt Lake City, UT, USA, Nov. 2016.
[26] Z. Li et al., “AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving,” in Proc. USENIX Symp. Oper. Syst. Des. Implement. (OSDI), Boston, MA, USA, Jul. 2023.
[27] H. Qin, S. Zawad, Y. Zhou, L. Yang, D. Zhao, and F. Yan, “Swift Machine Learning Model Serving Scheduling: A Region Based Reinforcement Learning Approach,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal. (SC), Denver, CO, USA, Nov. 2019.
[28] W. Fischer and K. Meier-Hellstern, “The Markov-Modulated Poisson Process (MMPP) Cookbook,” Performance Evaluation, vol. 18, no. 2, pp. 149–171, Sep. 1993.
[29] Y. Inoue, “Queueing Analysis of GPU-Based Inference Servers with Dynamic Batching: A Closed-Form Characterization,” Performance Evaluation, vol. 147, p. 102183, May 2021.
[30] D. Bonvin, “Control and Optimization of Batch Processes,” IEEE Control Syst. Mag., vol. 6, no. 26, pp. 34–45, Dec. 2006.
[31] S. Sasikala and K. Indhira, “Bulk Service Queueing Models-A Survey,” Int. J. Pure Appl. Math, vol. 106, no. 6, pp. 43–56, Apr. 2016.
[32] J. W. Fowler and L. Mönch, “A Survey of Scheduling with Parallel Batch (p-Batch) Processing,” Eur. J. Oper. Res., vol. 298, no. 1, pp. 1–24, Apr. 2022.
[33] R. K. Deb and R. F. Serfozo, “Optimal Control of Batch Service Queues,” Adv. Appl. Probability, vol. 5, no. 2, pp. 340–361, Aug. 1973.
[34] Y. Zeng and C. H. Xia, “Optimal bulking threshold of batch service queues,” J. Appl. Probability, vol. 54, no. 2, pp. 409–423, Jun. 2017.
[35] W. Shi, S. Zhou, Z. Niu, M. Jiang, and L. Geng, “Multiuser Co-Inference With Batch Processing Capable Edge Server,” IEEE Trans. Wireless Commun., vol. 22, no. 1, pp. 286–300, Jul. 2022.
[36] J. Hanhirova, T. Kämäräinen, S. Seppälä, M. Siekkinen, V. Hirvisalo, and A. Ylä-Jääski, “Latency and Throughput Characterization of Convolutional Neural Networks for Mobile Computer Vision,” in Proc. ACM Multimedia Syst. Conf. (MMSys), Amsterdam, Netherlands, Jun. 2018.
[37] M. F. Neuts, “A General Class of Bulk Queues with Poisson Input,” Ann. Math. Statist., vol. 38, no. 3, pp. 759–770, Jun. 1967.
[38] A. Maity and U. C. Gupta, “Analysis and Optimal Control of a Queue with Infinite Buffer Under Batch-Size Dependent Versatile Bulk-Service Rule,” OPSEARCH, vol. 52, no. 3, pp. 472–489, Sep. 2015.
[39] S. Pradhan, “On the Distribution of an Infinite-Buffer Queueing System with Versatile Bulk-Service Rule Under Batch-Size-Dependent Service Policy: $M/G_{n}^{(a,y)}/1$ ,” Int. J. Math. Oper. Res., vol. 16, no. 3, p. 407, Apr. 2020.
[40] G. K. Gupta and A. Banerjee, “Analysis of Infinite Buffer General Bulk Service Queue with State Dependent Balking,” Int. J. Oper. Res., vol. 40, no. 2, pp. 137–161, Mar. 2021.
[41] K. P. Papadaki and W. B. Powell, “Exploiting Structure in Adaptive Dynamic Programming Algorithms for a Stochastic Batch Service Problem,” Eur. J. Oper. Res., vol. 142, no. 1, pp. 108–127, Oct. 2002.
[42] Y. Inoue, “A Load-Balancing Problem for Distributed Bulk-Service Queues with Size-Dependent Batch Processing Times,” Queueing Systems, vol. 100, no. 3-4, pp. 449–451, Apr. 2022.
[43] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 1994.
[44] L. C. Thomas and D. Stengos, “Finite State Approximation Algorithms for Average Cost Denumerable State Markov Decision Processes,” Oper. Res. Spectr., vol. 7, no. 1, pp. 27–37, Mar. 1985.
[45] D. White, “Finite State Approximations for Denumerable State Infinite Horizon Discounted Markov Decision Processes with Unbounded Rewards,” J. Math. Anal. Appl., vol. 86, no. 1, pp. 292–306, Mar. 1982.
[46] J. D. C. Little, “A Proof for the Queuing Formula: L = $\lambda$ W,” Operations Research, vol. 9, no. 3, pp. 383–387, Jun. 1961.
[47] S. Aalto, “Optimal Control of Batch Service Queues with Finite Service Capacity and Linear Holding Costs,” Math. Method Oper. Res., vol. 51, no. 2, pp. 263–285, Apr. 2000.
[48] H. J. Weiss, “The Computation of Optimal Control Limits for a Queue with Batch Services,” Management Science, vol. 25, no. 4, pp. 320–328, Apr. 1979.
[49] E. Ignall and P. Kolesar, “Optimal Dispatching of an Infinite-Capacity Shuttle: Control at a Single Terminal,” Operations Research, vol. 22, no. 5, pp. 1008–1024, Oct. 1974.
[50] L. I. Sennott, “Average Cost Semi-Markov Decision Processes and the Control of Queueing Systems,” Probability Eng. Inf. Sci., vol. 3, no. 2, pp. 247–272, Apr. 1989.