This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Concave Utility Reinforcement Learning with Zero-Constraint Violations

Mridul Agarwal agarw180@purdue.edu
Purdue University
Qinbo Bai bai113@purdue.edu
Purdue University
Vaneet Aggarwal vaneet@purdue.edu
Purdue University
Abstract

We consider the problem of tabular infinite horizon concave utility reinforcement learning (CURL) with convex constraints. For this, we propose a model-based learning algorithm that also achieves zero constraint violations. Assuming that the concave objective and the convex constraints have a solution interior to the set of feasible occupation measures, we solve a tighter optimization problem to ensure that the constraints are never violated despite the imprecise model knowledge and model stochasticity. We use Bellman error-based analysis for tabular infinite-horizon setups which allows analyzing stochastic policies. Combining the Bellman error-based analysis and tighter optimization equation, for TT interactions with the environment, we obtain a high-probability regret guarantee for objective which grows as O~(1/T)\tilde{O}(1/\sqrt{T}), excluding other factors. The proposed method can be applied for optimistic algorithms to obtain high-probability regret bounds and also be used for posterior sampling algorithms to obtain a loose Bayesian regret bounds but with significant improvement in computational complexity.

1 Introduction

In many applications where a learning agent uses reinforcement learning to find optimal policies, the agent optimizes a concave function of the expected rewards or the agent must satisfy certain constraints while maximizing an objective (Altman & Schwartz, 1991; Roijers et al., 2013). For example, in network scheduling, a controller can maximize fairness of the users using a concave function of the average reward of each of the users (Chen et al., 2021). Consider a scheduler which allocates a resource to users. Each user obtains some reward based on their current state. The goal of the scheduler is to maximize fairness among the users. However, there are certain preferred users for which some service level agreements (SLA) must be made. For this setup, the scheduler aims to find a policy which maximizes the fairness while ensuring the SLA constraints of the preferred users are met. Note that, here, the objective is a non-linear concave utility in the presence of constraints on service level agreement. Setups with constraints also exist in autonomous vehicles where the goal is to reach the destination quickly while ensuring the safety of the surroundings (Le et al., 2019; Tessler et al., 2018). Further, an agent may aim to efficiently explore the environment by maximizing the entropy, which is a concave function of the distribution induced over the state and action space Hazan et al. (2019).

Owing to the variety of the use cases, recently, there has been significant effort to make RL algorithms for setups with constraints, or concave utilties, or both. For episodic setup, works range from model based algorithms (Brantley et al., 2020; Yu et al., 2021) to primal-dual based model-free algorithms (Ding et al., 2021). Recently, there has been a thrust towards developing algorithms which can also achieve zero-constraint violations in the learning phase as well Wei et al. (2022a); Liu et al. (2021); Bai et al. (2022b). However, for the episodic setup, the majority of the current works consider the weaker regret definition specified by Efroni et al. (2020) and only achieve zero expected constraint violations. Further, these algorithms require the knowledge of a safe policy following which the agent does not violate constraints, or the knowledge of the Slater’s gap δ\delta which determines how far a safe policy is from the constraint boundary.

The definition which considers the average over time makes sense for an infinite horizon setup as the long-term average is naturally defined (Puterman, 2014). For a tabular infinite-horizon setup, Singh et al. (2020) proposed an optimistic epoch-based algorithm. Much recently, Chen et al. (2022) proposed an Optimistic Online Mirror Descent based algorithm. In this work, we consider the problem of maximizing concave utility of the expected rewards while also ensuring that a set of convex constraints of the expected rewards are also satisfied. Moreover, we aim to develop algorithms that can also ensure that the constraints are not violated during the training phase as well. We work with tabular MDP with infinite horizon. For such setup, our algorithm updates policies as it learns the system model. Further, our approach also bounds the accumulated observed constraint violations as compared to the expected constraint violations.

For infinite horizon setups for non-constrained setup, the regret analysis has been widely studied Fruit et al. (2018); Jaksch et al. (2010). However, we note that the dealing with constraints and non-linear setup requires additional attention because of the stochastic policies. Further, unlike episodic setup, the distribution at the epoch is not constant and hence the policy switching cost has to be accounted explicitly. Prior works in infinite horizon also faced this issue and provide some tools to overcome this limitation. Singh et al. (2020) builds confidence intervals for transition probability for every next state given the current state-action pair and obtains a regret bound of O(T2/3)O(T^{2/3}). Chen et al. (2022) obtains a regret bound of O(TMT)O(T_{M}\sqrt{T}) with O(TM2S3A)O(T_{M}^{2}S^{3}A) constraint violations for ergodic MDPs with TMT_{M} mixing time following an analysis which works with confidence intervals on both transition probability vectors and value functions.

To overcome the limitations mentioned in previous analysis and to obtain a tighter result, we propose an optimism based UC-CURL algorithm which proceeds in epochs ee. At each epoch, we solve for an policy which considers constraints tighter by ϵe\epsilon_{e} than the true bounds for the optimistic MDP in the confidence intervals for the transition probabilities. Further, as the knowledge of the model improves with increased interactions with the environment, we reduce this tightness. This ϵe\epsilon_{e}-sequence is critical to our algorithm as, if the sequence decays too fast, the constraints violations cannot be bounded by zero. If this sequence decays too slow, the objective regret may not decay fast enough. Further, using the ϵe\epsilon_{e}-sequence, we do not require the knowledge of the total time TT for which the algorithm runs.

We bound our regret by bounding the gap between the optimal policy in the feasible region and the optimal policy for the optimization problem with ϵe\epsilon_{e} tight constraints. We bound this gap with a multiplicative factor of O(1/δ)O(1/\delta), where δ\delta is Slater’s parameter. Based on our analysis using the Slater’s parameter δ\delta, we consider a case where a lower bound TlT_{l} on the time horizon TT is known. This knowledge of TlT_{l} allows us to relax our assumption on δ\delta.

Further, for the regret analysis of the proposed UC-CURL algorithm, we use Bellman error for infinite horizon setup to bound the difference between the performance of optimistic policy on the optimistic MDP and the true MDP. Compared to analysis of Jaksch et al. (2010), this allows us to work with stochastic policies. We bound our regret as O~(1δLdTMSA/T+CTMS2A/T(1ρ))\tilde{O}(\frac{1}{\delta}LdT_{M}S\sqrt{A/T}+CT_{M}S^{2}A/T(1-\rho)) and constraint violations as 0, where SS and AA are the number of states and actions respectively, LL is the Lipschitz constant of the objective and constraint functions, dd is the number of costs the agent is trying to optimize, and TMT_{M} is the mixing time of the MDP. The Bellman error based analysis along with Slater’s slackness assumption also allows to develop posterior sampling based methods for constrained RL (see Appendix G) by showing feasibility of the optimization problem for the sampled MDPs.

To summarize our contributions, we improve prior results on infinite horizon concave utility reinforcement learning setup on multiple fronts. First, we consider convex function for objectives and constraints. Second, even with a non-linear function setup, we reduce the regret order to O(TMSA/T)O(T_{M}S\sqrt{A/T}) and bound the constraint violations with 0. Third, our algorithm does not require the knowledge of the time horizon TT, safe policy, or Slater’s gap δ\delta. Finally, we provide analysis for posterior sampling algorithm which improves both empirical performance and computational complexity.

2 Related Works

Constrained RL: Altman (1999) builds the formulation for constrained MDPs to study constrained reinforcement learning and provides algorithms for obtaining policies with known transition models. Zheng & Ratliff (2020) considered an episodic CMDP (Constrained Markov Decision Processes) and use an optimism based algorithm to bound the constraint violation as O~(1/T0.25)\tilde{O}(1/T^{0.25}) with high probability. Kalagarla et al. (2021) also considered the episodic setup to obtain PAC-style bound for an optimism based algorithm. Ding et al. (2021) considered the setup of HH-episode length episodic CMDPs with dd-dimensional linear function approximation to bound the constraint violations as O~(dH5/T)\tilde{O}(d\sqrt{H^{5}/T}) by mixing the optimal policy with an exploration policy. Efroni et al. (2020) proposes a linear-programming and primal-dual policy optimization algorithm to bound the regret as O(SH3/T)O(S\sqrt{H^{3}/T}). Wei et al. (2022a); Liu et al. (2021) considered the problem of ensuring zero constraint violations using a model-free algorithm for tabular MDPs with linear rewards and constraints. However, for infinite horizon setups, the analysis from finite horizon algorithms does not directly hold. This is because finite horizon setups can update the policy after every episode. But this policy switch modifies the induced Markov chains which takes time to converge to a stationary distribution.

Xu et al. (2021) considered an infinite horizon discounted setup with constraints and obtain global convergence using policy gradient algorithms. Bai et al. (2022b) proposed a conservative stochastic model-free primal-dual algorithm for infinite horizon discounted setup. Ding et al. (2020); Bai et al. (2023) also considered an infinite horizon discounted setup with parametrization. They used a natural policy gradient to update the primal variable and sub-gradient descent to update the dual variable. In addition to the above results on discounted MDPs, the long-term rewards have also been considered. Singh et al. (2020) considered the setup of infinite-horizon ergodic CMDPs with long-term average constraints with an optimism based algorithm. Gattami et al. (2021) analyzed the asymptotic performance for Lagrangian based algorithms for infinite-horizon long-term average constraints, however they only show convergence guarantees without explicit convergence rates. Chen et al. (2022) provided an optimistic online mirror descent algorithm for ergodic MDPs which obtain a regret bound of O(TMSSAT)O(T_{M}S\sqrt{SAT}), and Wei et al. (2022b) provided a model free SARSA algorithm which obtains a regret bound of O(SAT5/6)O(\sqrt{SA}T^{5/6}) for constrained MDPs. Agarwal et al. (2022b) proposed a posterior sampling based algorithm for infinite horizon setup with a regret of O~(TMSAT)\tilde{O}(T_{M}S\sqrt{AT}) and constraint violation of O~(TMSAT)\tilde{O}(T_{M}S\sqrt{AT}).

Algorithm(s) Setup Regret Constraint Violation Non-Linear
conRL Brantley et al. (2020) FH O~(LH5/2SA/K)\tilde{O}(LH^{5/2}S\sqrt{A/K}) O(H5/2SA/K)O(H^{5/2}S\sqrt{A/K}) Yes
MOMA Yu et al. (2021) FH O~(LH3/2SA/K)\tilde{O}(LH^{3/2}\sqrt{SA/K}) O~(H3/2SA/K)\tilde{O}(H^{3/2}\sqrt{SA/K}) Yes
TripleQ Wei et al. (2022a) FH O~(1δH4SAK1/5)\tilde{O}(\frac{1}{\delta}H^{4}\sqrt{SA}K^{-1/5}) 0 No
OptPess-LP Liu et al. (2021) FH O~(H3δS3A/K)\tilde{O}(\frac{H^{3}}{\delta}\sqrt{S^{3}A/K}) 0 No
OptPess-Primal Dual Liu et al. (2021) FH O~(H3δS3A/K)\tilde{O}(\frac{H^{3}}{\delta}\sqrt{S^{3}A/K}) O~(H4S2A/δ)\tilde{O}(H^{4}S^{2}A/\delta) No
UCRL-CMDP Singh et al. (2020) IH O~(SAT1/3)\tilde{O}(\sqrt{SA}T^{-1/3}) O~(SA/T1/3)\tilde{O}(\sqrt{SA}/T^{1/3}) No
Chen et al. Chen et al. (2022) IH O~(1δTMSSA/T)\tilde{O}(\frac{1}{\delta}T_{M}S\sqrt{SA/T}) O~(1δ2TM2S3A)\tilde{O}(\frac{1}{\delta^{2}}T_{M}^{2}S^{3}A) No
Wei et al. Wei et al. (2022b) IH O~(1δSAT1/6)\tilde{O}(\frac{1}{\delta}\sqrt{SA}T^{-1/6}) 0 No
Agarwal et al. Agarwal et al. (2022b) IH O~(TMSA/T)\tilde{O}(T_{M}S\sqrt{A/T}) O~(TMSA/T)\tilde{O}(T_{M}S\sqrt{A/T}) No
UC-CURL (This work) IH O~(1δLTMSA/T)\tilde{O}(\frac{1}{\delta}LT_{M}S\sqrt{A/T}) 0 Yes
Table 1: Overview of work for constrained reinforcement learning setups. For finite horizon (FH) setups, HH is the episode length and KK is the number of episodes for which the algorithm runs. For infinite horizon (IH) setups, TMT_{M} denotes the mixing time of the MDP, and TT is the time for which algorithm runs. LL is the Lipschitz constant. We note that all the IH setups assume ergodic MDPs, where the FH setups do not require the ergodic assumption as the system resets to the final state after every episode.

Concave Utility RL: Another major research area related to this work is concave utility RL (Hazan et al., 2019). Along this direction, Cheung (2019) considered a concave function of expected per-step vector reward and developed an algorithm using Frank-Wolfe gradient of the concave function for tabular infinite horizon MDPs. Agarwal & Aggarwal (2022); Agarwal et al. (2022a) also considered the same setup using a posterior sampling based algorithm. Recently, Brantley et al. (2020) combined concave utility reinforcement learning and constrained reinforcement learning for an episodic setup. Yu et al. (2021) also considered the case of episodic setup with concave utility RL. However, both (Brantley et al., 2020) and (Yu et al., 2021) consider the weaker regret definition by Efroni et al. (2020), and Cheung (2019); Yu et al. (2021) do not target the convergence of the policy. Further, these works do not target zero-constraint violations. Recently, policy gradient based algorithms have also been studied for discounted infinite horizon setup Bai et al. (2022a).

Another parallel line of work in RL which deals with concave utilities is variational policy gradient (Zhang et al., 2021; 2020). However, they consider discounted MDPs whereas we consider undiscounted setup for our work.

Compared to prior works, we consider the constrained reinforcement learning with convex constraints and concave objective function. Using infinite-horizon setup, we consider the tightest possible regret definition. Further, we achieve zero constraint violations with objective regret tight in TT using an optimization problem with decaying tightness. A comparative survey of key prior works and our work is also presented in Table 1.

3 Problem Formulation

We consider an ergodic tabular infinite-horizon constrained Markov Decision Process =(𝒮,𝒜,r,f,c1,,cd,g,P,ϕ)\mathcal{M}=(\mathcal{S},\mathcal{A},r,f,c_{1},\cdots,c_{d},g,P,\phi). 𝒮\mathcal{S} is finite set of SS states, and 𝒜\mathcal{A} is a finite set of AA actions. P:𝒮×𝒜Δ(𝒮)P:\mathcal{S}\times\mathcal{A}\to\Delta(\mathcal{S}) denotes the transition probability distribution such that on taking action a𝒜a\in\mathcal{A} in state s𝒮s\in\mathcal{S}, the system moves to state s𝒮s^{\prime}\in\mathcal{S} with probability P(s|s,a)P(s^{\prime}|s,a). r:𝒮×𝒜[0,1]r:\mathcal{S}\times\mathcal{A}\to[0,1] and ci:𝒮×𝒜[0,1],i1,,dc_{i}:\mathcal{S}\times\mathcal{A}\to[0,1],i\in{1,\cdots,d} denotes the average reward obtained and average costs incurred in state action pair (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, and ϕ\phi is the distribution over the initial state.

The agent interacts with \mathcal{M} in time-steps t1,2,t\in{1,2,\cdots} for a total of TT time-steps. We note that TT is possibly unknown and s1ϕs_{1}\sim\phi. At each time tt, the agent observes state sts_{t} and plays action ata_{t}. The agent selects an action on observing the state ss using a policy π:𝒮Δ(𝒜)\pi:\mathcal{S}\to\Delta(\mathcal{A}), where Δ(𝒜)\Delta(\mathcal{A}) is the probability simplex on the action space. On following a policy π\pi, the long-term average reward of the agent is denoted as:

λπP\displaystyle\lambda_{\pi}^{P} =limτ𝔼π,P[t=1τr(st,at)/τ]\displaystyle=\lim_{\tau\to\infty}\mathbb{E}_{\pi,P}\left[\sum\nolimits_{t=1}^{\tau}r(s_{t},a_{t})/\tau\right] (1)

where 𝔼π,P[]\mathbb{E}_{\pi,P}[\cdot] denotes the expectation over the state and action trajectory generated from following π\pi on transitions PP. The long-term average reward can also be represented as:

λπP=s,aρπP(s,a)r(s,a)\displaystyle\lambda_{\pi}^{P}=\sum\nolimits_{s,a}\rho_{\pi}^{P}(s,a)r(s,a) =limγ1(1γ)Vγπ,P(s)s𝒮\displaystyle=\lim_{\gamma\to 1}(1-\gamma)V_{\gamma}^{\pi,P}(s)\ \forall s\in\mathcal{S}

where Vγπ,P(s)V_{\gamma}^{\pi,P}(s) is the discounted cumulative reward on following policy π\pi, and ρπPΔ(𝒮×𝒜)\rho_{\pi}^{P}\in\Delta(\mathcal{S}\times\mathcal{A}) is the steady-state occupancy measure generated from following policy π\pi on MDP with transitions PP Puterman (2014). Similarly, we also define the long-term average costs as follows:

ζπP(i)\displaystyle\zeta_{\pi}^{P}(i) =limτ𝔼π,P[t=1τci(st,at)/τ]=limγ1(1γ)Vγπ,P(s;i)s𝒮\displaystyle=\lim_{\tau\to\infty}\mathbb{E}_{\pi,P}\left[\sum\nolimits_{t=1}^{\tau}c_{i}(s_{t},a_{t})/\tau\right]=\lim_{\gamma\to 1}(1-\gamma)V_{\gamma}^{\pi,P}(s;i)\ \ \ \forall s\in\mathcal{S}
=s,aρπP(s,a)ci(s,a)\displaystyle=\sum\nolimits_{s,a}\rho_{\pi}^{P}(s,a)c_{i}(s,a) (2)

The agent interacts with the CMDP \mathcal{M} for TT time-steps in an online manner and aims to maximize a function f:[0,1]f:[0,1]\to\mathbb{R} of the average per-step reward. Further, the agent attempts to ensure that a function of average per-step costs g:[0,1]dg:[0,1]^{d}\to\mathbb{R} is at most 0. In the hindsight, the agents wants to play a policy π\pi^{*} which which satisfies:

maxπf(λπP)s.t.g(ζπP(1),,ζπP(d))0\displaystyle\max_{\pi}f\left(\lambda_{\pi}^{P}\right)~{}~{}~{}s.t.~{}~{}~{}g\left(\zeta_{\pi}^{P}(1),\cdots,\zeta_{\pi}^{P}(d)\right)\leq 0 (3)

Let Pπ,st=t=1tPπP^{t}_{\pi,s}{=\prod_{t^{\prime}=1}^{t}{P_{\pi}}} denote the tt-step transition probability on following policy π\pi in MDP \mathcal{M} starting from some state ss where Pπ(|s)=aπ(a|s)P(|s,a)P_{\pi}(\cdot|s)=\sum_{a}\pi(a|s)P(\cdot|s,a). Let TssπT_{s\to s^{\prime}}^{\pi} denote the time taken by the Markov chain induced by the policy π\pi to hit state ss^{\prime} starting from state ss. Further, let TM:=maxπ𝔼[Tssπ]T_{M}:=\max_{\pi}\mathbb{E}[T^{\pi}_{s\to s^{\prime}}] be the mixing time of the MDP \mathcal{M}. We now introduce our assumptions on the MDP \mathcal{M}.

Assumption 3.1.

For MDP \mathcal{M}, we have Pπ,stPπCρt\|P^{t}_{\pi,s}-P_{\pi}\|\leq C\rho^{t} with PπP_{\pi} being the long-term steady state distribution induced by policy π\pi, and C>0C>0 and ρ<1\rho<1 are problem specific constants. Additionally, the mixing time of the MDP \mathcal{M} if finite or TM<T_{M}<\infty. In other words, the MDP, \mathcal{M}, is ergodic.

Assumption 3.2.

The rewards r(s,a)r(s,a), the costs ci(s,a);ic_{i}(s,a);\forall\ i, and the functions ff and gg are known to the agent.

Assumption 3.3.

The scalarization function ff is jointly concave and the constraints gg are jointly convex. Hence for any arbitrary distributions 𝒟1\mathcal{D}_{1} and 𝒟2\mathcal{D}_{2}, the following holds.

f(𝔼x𝒟1[x])\displaystyle f\left(\mathbb{E}_{x\sim\mathcal{D}_{1}}\left[x\right]\right) 𝔼x𝒟1[f(x)]\displaystyle\geq\mathbb{E}_{x\sim\mathcal{D}_{1}}\left[f\left(x\right)\right] (4)
g(𝔼𝐱𝒟2[𝐱])\displaystyle g\left(\mathbb{E}_{\mathbf{x}\sim\mathcal{D}_{2}}\left[\mathbf{x}\right]\right) 𝔼𝐱𝒟2[g(𝐱)];𝐱d\displaystyle\leq\mathbb{E}_{\mathbf{x}\sim\mathcal{D}_{2}}\left[g\left(\mathbf{x}\right)\right];\ \mathbf{x}\in\mathbb{R}^{d} (5)
Assumption 3.4.

The function ff and gg are assumed to be a LL- Lipschitz function, or

|f(x)f(y)|\displaystyle\left|f\left(x\right)-f\left(y\right)\right| L|xy|;x,y\displaystyle\leq L|x-y|;\ x,y\in\mathbb{R} (6)
|g(𝐱)g(𝐲)|\displaystyle\left|g\left(\mathbf{x}\right)-g\left(\mathbf{y}\right)\right| L𝐱𝐲1;𝐱,𝐲d\displaystyle\leq L\left\lVert\mathbf{x}-\mathbf{y}\right\rVert_{1};\ \mathbf{x},\mathbf{y}\in\mathbb{R}^{d} (7)
Remark 3.5.

We consider a standard setup of concave and the Lipschitz function as considered by Cheung (2019); Brantley et al. (2020); Yu et al. (2021). Note that the analysis in this paper directly works for f:Kf:\mathbb{R}^{K}\to\mathbb{R}, where the function takes as input KK average per-step rewards for KK objectives.

Remark 3.6.

For non-Lipshitz continuous functions such as entropy, we can obtain maximum entropy exploration if choose function f=kλklog(λk+η)f=-\sum_{k}\lambda_{k}\log(\lambda_{k}+\eta) with rk(s,a)=𝟏{sk,ak}r_{k}(s,a)=\bm{1}_{\{s_{k},a_{k}\}} for a particular state action pair sk,aks_{k},a_{k} and choosing K=S×AK=S\times A to cover all state-action pairs and a regularizer η\eta Hazan et al. (2019).

Assumption 3.7.

There exists a policy π\pi, and one constant δ>LdSTM(AlogT)/T+(CSAlogT)/(T(1ρ))\delta>LdST_{M}\sqrt{(A\log T)/T}+(CSA\log T)/(T(1-\rho)) such that

g(ζπP(1),,ζπP(d))δ\displaystyle g\left(\zeta_{\pi}^{P}(1),\cdots,\zeta_{\pi}^{P}(d)\right)\leq-\delta (8)

This assumption is again a standard assumption in the constrained RL literature (Efroni et al., 2020; Ding et al., 2021; 2020; Wei et al., 2022a). δ\delta is referred as Slater’s constant. Ding et al. (2021) assumes that the Slater’s constant δ\delta is known. Wei et al. (2022a) assumes that the number of iterations of the algorithm is at least Ω~(SAH/δ)5\tilde{\Omega}(SAH/\delta)^{5} for episode length HH. On the contrary, we simply assume the existence of δ\delta and a lower bound on the value of δ\delta which gets relaxed as the agent acquires more time to interact with the environment.

Any online algorithm starting with no prior knowledge will need to obtain estimates of transition probabilities PP and obtain reward rr and costs ck,k{1,,d}c_{k},\forall\ k\in\{1,\cdots,d\}, for each state action pair. Initially, when algorithm does not have good estimate of the model, it accumulates a regret as well as violates constraints as it does not know the optimal policy. We define reward regret R(T)R(T) as the difference between the average cumulative reward obtained vs the expected rewards from running the optimal policy π\pi^{*} for TT steps, or

R(T)\displaystyle R(T) =f(λπP)f(t=1Tr(st,at)/T)\displaystyle=f\left(\lambda_{\pi^{*}}^{P}\right)-f\left(\sum\nolimits_{t=1}^{T}r(s_{t},a_{t})/T\right)

Additionally, we define constraint regret C(T)C(T) as the gap between the constraint function and incurred and constraint bounds, or

C(T)\displaystyle C(T) =(g(t=1Tc1(st,at)/T,,t=1Tcd(st,at)/T))+, where(x)+=max(0,x)\displaystyle=\left(g\left(\sum\nolimits_{t=1}^{T}c_{1}(s_{t},a_{t})/T,\cdots,\sum\nolimits_{t=1}^{T}c_{d}(s_{t},a_{t})/T\right)\right)_{+}\text{, where}(x)_{+}=\max(0,x)

In the following section, we present a model-based algorithm to obtain this policy π\pi^{*}, and reward regret and the constraint regret accumulated by the algorithm.

4 Algorithm

We now present our algorithm UC-CURL and the key ideas used in designing the algorithm. Note that if the agent is aware of the true transition PP, it can solve the following optimization problem for the optimal feasible policy.

maxρ(s,a)f(s,ar(s,a)ρ(s,a))\displaystyle\max_{\rho(s,a)}f\big{(}\sum\nolimits_{s,a}r(s,a)\rho(s,a)\big{)} (9)

with the following set of constraints,

s,aρ(s,a)=1,ρ(s,a)0\displaystyle\sum\nolimits_{s,a}\rho(s,a)=1,\ \ \rho(s,a)\geq 0 (10)
a𝒜ρ(s,a)=s,aP(s|s,a)ρ(s,a)\displaystyle\sum\nolimits_{a\in\mathcal{A}}\rho(s^{\prime},a)=\sum\nolimits_{s,a}P(s^{\prime}|s,a)\rho(s,a) (11)
g(s,ac1(s,a)ρ(s,a),,s,acd(s,a)ρ(s,a))0\displaystyle g\big{(}\sum\nolimits_{s,a}c_{1}(s,a)\rho(s,a),\cdots,\sum\nolimits_{s,a}c_{d}(s,a)\rho(s,a)\big{)}\leq 0 (12)

for all s𝒮,s𝒮,s^{\prime}\in\mathcal{S},~{}\forall~{}s\in\mathcal{S}, and a𝒜\forall~{}a\in\mathcal{A}. Equation (11) denotes the constraint on the transition structure for the underlying Markov Process. Equation (10) ensures that the solution is a valid probability distribution. Finally, Equation (12) are the constraints for the constrained MDP setup which the policy must satisfy. Using the solution for ρ\rho, we can obtain the optimal policy as:

π(a|s)=ρ(s,a)b𝒜ρ(s,b)s,a\displaystyle\pi^{*}(a|s)=\frac{\rho(s,a)}{\sum_{b\in\mathcal{A}}\rho(s,b)}\forall\ s,a (13)

However, the agent does not have the knowledge of PP to solve this optimization problem, and thus starts learning the transitions with an arbitrary policy. We first note that if the agent does not have complete knowledge of the transition PP of the true MDP \mathcal{M}, it should be conservative in its policy to allow room to violate constraints. Based on this idea, we formulate the ϵ\epsilon-tight optimization problem by modifying the constraint in Equation (12) as.

g(s,ac1(s,a)ρϵ(s,a),,s,acd(s,a)ρϵ(s,a))ϵ\displaystyle g\big{(}\sum\nolimits_{s,a}c_{1}(s,a)\rho_{\epsilon}(s,a),\cdots,\sum\nolimits_{s,a}c_{d}(s,a)\rho_{\epsilon}(s,a)\big{)}\leq-\epsilon (14)

Let ρϵ\rho_{\epsilon} be the solution of the ϵ\epsilon-tight optimization problem, then the optimal conservative policy becomes:

πϵ(a|s)=ρϵ(s,a)b𝒜ρϵ(s,b)s,a\displaystyle\pi_{\epsilon}^{*}(a|s)=\frac{\rho_{\epsilon}(s,a)}{\sum\nolimits_{b\in\mathcal{A}}\rho_{\epsilon}(s,b)}\forall\ s,a (15)

We are now ready to design our algorithm UC-CURL which is based on the optimism principle (Jaksch et al., 2010). The UC-CURL algorithm is presented in Algorithm 1. The algorithm proceeds in epochs ee. The algorithm maintains three key variables νe(s,a)\nu_{e}(s,a), Ne(s,a)N_{e}(s,a), and P^(s,a,s)\hat{P}(s,a,s^{\prime}) for all s,as,a. νe(s,a)\nu_{e}(s,a) stores the number of times state-action pair (s,a)(s,a) are visited in epoch ee. Ne(s,a)N_{e}(s,a) stores the number of times (s,a)(s,a) are visited till the start of epoch ee. P^(s,a,s)\hat{P}(s,a,s^{\prime}) stores the number of times the system transitions to state ss^{\prime} after taking action aa in state ss. Another key parameter of the algorithm is ϵe=K(logte)/te\epsilon_{e}=K\sqrt{(\log t_{e})/t_{e}} where tet_{e} is the start time of the epoch ee and KK is a configurable constant. Using these variables, the agent solves for the optimal ϵe\epsilon_{e}-conservative policy for the optimistic MDP by replacing the constraints in Equation (11) by:

a𝒜ρ(s,a)s,aP~e(s|s,a)ρ(s,a)\displaystyle\sum\nolimits_{a\in\mathcal{A}}\rho(s^{\prime},a)\leq\sum\nolimits_{s,a}\tilde{P}_{e}(s^{\prime}|s,a)\rho(s,a) (16)
P~e(s|s,a)>0,sP~e(s|s,a)=1\displaystyle\tilde{P}_{e}(s^{\prime}|s,a)>0,\sum\nolimits_{s^{\prime}}\tilde{P}_{e}(s^{\prime}|s,a)=1 (17)
P~e(|s,a)P^(s,a,)1Ne(s,a)114Slog(2At)1Ne(s,a)\displaystyle\|\tilde{P}_{e}(\cdot|s,a)-\frac{\hat{P}(s,a,\cdot)}{1\vee N_{e}(s,a)}\|_{1}\leq\sqrt{\frac{14S\log(2At)}{1\vee N_{e}(s,a)}} (18)

for all s𝒮,s𝒮,s^{\prime}\in\mathcal{S},\forall s\in\mathcal{S}, and a𝒜\forall a\in\mathcal{A} and xy=max(x,y)x\vee y=\max(x,y). Equation (18) ensures that the agent searches for optimistic policy in the confidence intervals of the transition probability estimates.

Combining the right hand side of (16) with (10) gives

ss,aP~e(s|s,a)ρ(s,a)=1=s,aρ(s,a)\displaystyle\sum\nolimits_{s^{\prime}}\sum\nolimits_{s,a}\tilde{P}_{e}(s^{\prime}|s,a)\rho(s,a)=1=\sum\nolimits_{s^{\prime},a}\rho(s^{\prime},a)

Thus, joint with (16), we see that equality in (16) will be satisfied at the boundary as aρ(s,a)\sum_{a}\rho(s^{\prime},a) for some ss^{\prime} can never exceed the boundary to compensate for another ss^{\prime} and hence, for all ss^{\prime}, aρ(s,a)\sum_{a}\rho(s^{\prime},a) will lie on the boundary. In other words, the above constraints give a𝒜ρ(s,a)=s,aP~e(s|s,a)ρ(s,a)\sum\nolimits_{a\in\mathcal{A}}\rho(s^{\prime},a)=\sum\nolimits_{s,a}\tilde{P}_{e}(s^{\prime}|s,a)\rho(s,a). Further, we note that the region for the constraints is convex. This is because the set {x,y,z:xyz}\{x,y,z:xy\geq z\} is convex when x,y,z0x,y,z\geq 0. We note that even though the optimization problem may look non-convex due to constraints having product of two variables, we see Equations (9), (14), and (16)-(18) form a convex optimization problem. We expand more on this in Appendix B. We note that (Rosenberg & Mansour, 2019) provide another approach to obtain a convex optimization problem for optimistic MDP.

Algorithm 1 UC-CURL

Parameters: KK
Input: SS, AA, rr, dd, cii[d]c_{i}\ \forall\ i\in[d]

1:  Let t=1t=1, e=1,ϵe=Klntte=1,\epsilon_{e}=K\sqrt{\frac{\ln t}{t}}
2:  for (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} do
3:     νe(s,a)=0,Ne(s,a)=0,P^(s,a,s)=0s𝒮\nu_{e}(s,a)=0,N_{e}(s,a)=0,\widehat{P}(s^{\prime},a,s)=0\forall\ s^{\prime}\in\mathcal{S}
4:  end for
5:  Solve for policy πe\pi_{e} using Eq. (19)
6:  for t{1,2,}t\in\{1,2,\cdots\} do
7:     Observe sts_{t}, and play atπe(|st)a_{t}\sim\pi_{e}(\cdot|s_{t})
8:     Observe st+1s_{t+1}, r(st,at)r(s_{t},a_{t}) and ci(st,at)i[d]c_{i}(s_{t},a_{t})\ \forall\ i\in[d]
9:     νe(st,at)=νe(st,at)+1\nu_{e}(s_{t},a_{t})=\nu_{e}(s_{t},a_{t})+1
10:     P^(st,at,st+1)=P^(st,at,st+1)+1\widehat{P}(s_{t},a_{t},s_{t+1})=\widehat{P}(s_{t},a_{t},s_{t+1})+1
11:     if νe(s,a)=max{1,Ne(s,a)}\nu_{e}(s,a)=\max\{1,N_{e}(s,a)\} for any s,as,a then
12:        for (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} do
13:           Ne+1(s,a)=Ne(s,a)+νe(s,a)N_{e+1}(s,a)=N_{e}(s,a)+\nu_{e}(s,a)
14:           e=e+1e=e+1, νe(s,a)=0\nu_{e}(s,a)=0
15:        end for
16:        ϵe=Klntt\epsilon_{e}=K\sqrt{\frac{\ln t}{t}}
17:        Solve for policy πe\pi_{e} using Eq. (19)
18:     end if
19:  end for

Let ρe\rho_{e} be the solution for ϵe\epsilon_{e}-tight optimization equation for the optimistic MDP. Then, we obtain the optimal conservative policy for epoch ee as:

πe(a|s)=ρe(s,a)b𝒜ρe(s,b)s,a\displaystyle\pi_{e}(a|s)=\frac{\rho_{e}(s,a)}{\sum_{b\in\mathcal{A}}\rho_{e}(s,b)}\forall\ s,a (19)

The agent plays the optimistic conservative policy πe\pi_{e} for epoch ee. Note that the conservative parameter ϵe\epsilon_{e} decays with time. As the agent interacts with the environment, the system model improves and the agent does not need to be as conservative as before. This allows us to bound both constraint violations and the objective regret. Further, if during the initial iterations of the algorithms a conservative solution is not feasible, we can ignore the constraints completely. We will show that the conservation behavior is required when t=Θ(T)t=\Theta(T) to compensate for the violations in the initial period of the algorithm E.2.

For the UC-CURL algorithm described in Algorithm 1, we choose {ϵe}={K(logte)/te}\{\epsilon_{e}\}=\{K\sqrt{(\log t_{e})/t_{e}}\}. However, if the agent has access to a lower bound TlT_{l} (Assumption 3.7) on the time horizon TT, the algorithm can change the ϵe=K(ln(teTl))/(teTl)δ\epsilon_{e}=K\sqrt{(\ln(t_{e}\vee T_{l}))/(t_{e}\vee T_{l})}\leq\delta in each epoch ee as follows. Note that if Tl=0T_{l}=0, ϵe\epsilon_{e} becomes as specified in Algorithm 1 and if Tl=TT_{l}=T, ϵe\epsilon_{e} becomes constant for all epochs ee.

5 Regret Analysis

After describing UC-CURL algorithm, we now perform the regret and constraint violation analysis. We note that the standard analysis for infinite horizon tabular MDPs of UCRL2 Jaksch et al. (2010) cannot be directly applied as the policy πe\pi_{e} is possibly stochastic for every epoch. Another peculiar aspect of the analysis of the infinite horizon MDPs is that the regret grows linearly with the number of epochs (or policy switches). This is because a new policy induces a new Markov chain and this chain take time to converge to the stationary distribution. The analysis still bounds the regret by O~(TMSA/T)\tilde{O}(T_{M}S\sqrt{A/T}) as the number of epochs are bounded by O(SAlogT)O(SA\log T).

Before diving into the details, we first define few important variables which are key to our analysis. The first variable is the standard QQ-value function. We define Qγπ,PQ_{\gamma}^{\pi,P} as the long term expected reward on taking action aa in state ss and then following policy π\pi for the MDP with transition PP. Mathematically, we have

Qγπ,P(s,a)=r(s,a)+γs𝒮P(s|s,a)Vγπ,P(s);Vγπ,P(s)=𝔼aπ[Qγπ,P(s,a)]\displaystyle Q_{\gamma}^{\pi,P}(s,a)=r(s,a)+\gamma\sum\nolimits_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)V_{\gamma}^{\pi,P}(s^{\prime});V_{\gamma}^{\pi,P}(s)=\mathbb{E}_{a\sim\pi}\left[Q_{\gamma}^{\pi,P}(s,a)\right]

We also define Bellman error Bπ,P~(s,a)B^{\pi,\tilde{P}}(s,a) for the infinite horizon MDPs as the difference between the cumulative expected rewards obtained for deviating from the system model with transition P~\tilde{P} for one step by taking action aa in state ss and then following policy π\pi. We have:

Bπ,P~(s,a)\displaystyle B^{\pi,\tilde{P}}(s,a) =limγ1(Qγπ,P~(s,a)r(s,a)γs𝒮P(s|s,a)Vγπ,P~(s,a))\displaystyle=\lim_{\gamma\to 1}\Big{(}Q_{\gamma}^{\pi,\tilde{P}}(s,a)-r(s,a)-\gamma\sum\nolimits_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)V_{\gamma}^{\pi,\tilde{P}}(s,a)\Big{)} (20)

After defining the key variables, we can now jump into bounding the objective regret R(T)R(T). Intuitively, the algorithm incurs regret on three accounts. First source is following the conservative policy which we require to limit the constraint violations. Second source of regret is solving for the policy which is optimal for the optimistic MDP. Third source of regret is the stochastic behavior of the system. We also note that the constraints are violated because of the imperfect MDP knowledge and the stochastic behavior. However, the conservative behavior actually allows us to violate the constraints within some limits which we will discuss in the later part of this section.

We start by stating our first lemma which bounds the regret due to solving for a conservative policy. We define ϵe\epsilon_{e}-tight optimization problem as optimization problem for the true MDP with transitions PP with ϵ=ϵe\epsilon=\epsilon_{e}. We bound the gap between the value of function ff at the long-term expected reward of the policy for ϵe\epsilon_{e}-tight optimization problem and the true optimization problem (Equation (9)-(12)) in the following lemma.

Lemma 5.1.

Let λπP\lambda_{\pi^{*}}^{P} be the long-term average reward following the optimal feasible policy π\pi^{*} for the true MDP \mathcal{M} and let λπeP\lambda_{\pi_{e}}^{P} be the long-term average rewards following the optimal policy πe\pi_{e} for the ϵe\epsilon_{e} tight optimization problem for the true MDP \mathcal{M}, then for ϵeδ\epsilon_{e}\leq\delta, we have,

f(λπP)f(λπeP)2Lϵe/δ\displaystyle f\left(\lambda_{\pi^{*}}^{P}\right)-f\left(\lambda_{\pi_{e}}^{P}\right)\leq 2L\epsilon_{e}/\delta (21)
Proof Sketch.

We construct a policy for which the steady state distribution is the weighted average of two steady state distributions. First distribution is for the optimal policy for the true optimization problem. Second distribution is for the policy which satisfies Assumption 3.7. We show that this constructed policy satisfies the ϵe\epsilon_{e}-tight constraints. Further, using Lipschitz continuity, we convert the difference between function values into the difference between the long-term average rewards to obtain the required result. The detailed proof is provided in Appendix C. ∎

Lemma 5.1 and our construction of ϵe\epsilon_{e} sequence allows us to limit the growth of regret because of conservative policy by O~(LdTMSA/T)\tilde{O}(LdT_{M}S\sqrt{A/T}).

To bound the regret from the second source, we use a Bellman error based analysis. In our next lemma, we show that the difference between the performance of a policy on two different MDPs is bounded by long-term averaged Bellman error. Formally, we have:

Lemma 5.2.

The difference of long-term average rewards for running the optimistic policy πe\pi_{e} on the optimistic MDP, λπeP~e\lambda_{\pi_{e}}^{\tilde{P}_{e}}, and the average long-term average rewards for running the optimistic policy πe\pi_{e} on the true MDP, λπeP\lambda_{\pi_{e}}^{P}, is the long-term average Bellman error as

λπeP~eλπeP=s,aρπePBπe,P~e(s,a)\displaystyle\lambda_{\pi_{e}}^{\tilde{P}_{e}}-\lambda_{\pi_{e}}^{P}=\sum\nolimits_{s,a}\rho_{\pi_{e}}^{P}B^{\pi_{e},\tilde{P}_{e}}(s,a) (22)
Proof Sketch.

We start by writing Qγπe,P~eQ_{\gamma}^{\pi_{e},\tilde{P}_{e}} in terms of the Bellman error. Subtracting Vγπe,PV_{\gamma}^{\pi_{e},P} from Vγπe,P~eV_{\gamma}^{\pi_{e},\tilde{P}_{e}} and using the fact that λπeP=limγ1Vγπ,P\lambda_{\pi_{e}}^{P}=\lim_{\gamma\to 1}V_{\gamma}^{\pi,P} and λπeP~e=limγ1Vγπ,P~e\lambda_{\pi_{e}}^{\tilde{P}_{e}}=\lim_{\gamma\to 1}V_{\gamma}^{\pi,{\tilde{P}_{e}}}, we obtain the required result. A complete proof is provided in Appendix D.3. ∎

After relating the gap between the long-term average rewards of policy πe\pi_{e} on the two MDPs, we aim to bound the sum of Bellman error over an epoch. For this, we first bound the Bellman error for a particular state action pair (s,a)(s,a) in the following lemma. We have,

Lemma 5.3.

With probability at least 11/te61-1/t_{e}^{6}, the Bellman error Bπe,P~e(s,a)B^{\pi_{e},\tilde{P}_{e}}(s,a) for state-action pair (s,a)(s,a) in epoch ee is upper bounded as

Bπe,P~e(s,a)14Slog(2AT)1Ne(s,a)h~\displaystyle B^{\pi_{e},\tilde{P}_{e}}(s,a)\leq\sqrt{\frac{14S\log(2AT)}{1\vee N_{e}(s,a)}}\|\tilde{h}\|_{\infty} (23)

where Ne(s,a)N_{e}(s,a) is the number of visitations to (s,a)(s,a) till epoch ee and h~\tilde{h} is the bias of the MDP with transition probability P~e\tilde{P}_{e}.

Proof Sketch.

We start by noting that the Bellman error essentially bounds the impact of the difference in value obtained because of the difference in transition probability to the immediate next state. We bound the difference in transition probability between the optimistic MDP and the true MDP using the result from Weissman et al. (2003). This approach gives the required result. A complete proof is provided in Appendix D.3. ∎

We use Lemma 5.2 and Lemma 5.3 to bound the regret because of the imperfect knowledge of the system model. We bound the expected Bellman error in epoch ee starting from state stes_{t_{e}} and action atea_{t_{e}} by constructing a Martingale sequence with filtration t={s1,a1,,st1,at1}\mathcal{F}_{t}=\{s_{1},a_{1},\cdots,s_{t-1},a_{t-1}\} and using Azuma’s inequality Bercu et al. (2015). Using the Azuma’s inequality, we can also bound the deviations because of the stochasticity of the Markov Decision Process. The result is stated in the following lemma with proof in Appendix D.

Lemma 5.4.

With probability at least 1T5/41-T^{-5/4}, the regret incurred from imperfect model knowledge and process stochastics is bounded by

O(TMSA(logAT)/T+(CTMS2AlogT)/(1ρ))\displaystyle O(T_{M}S\sqrt{A(\log AT)/T}+(CT_{M}S^{2}A\log T)/(1-\rho)) (24)

The regret analysis framework also prepares us to bound the constraint violations as well. We again start by quantifying the reasons for constraint violations. The agent violates the constraint because 1. it is playing with the imperfect knowledge of the MDP and 2. the stochasticity of the MDP which results in the deviation from the average costs. We note that the conservative policy πe\pi_{e} for every epoch does not violate the constraints, but instead allows the agent to manage the constraint violations because of the imperfect model knowledge and the system dynamics.

We note that the Lipschitz continuity of the constraint function gg allows us to convert the function of dd averaged costs to the sum of dd averaged costs. Further, we note that we can treat the cost similar to rewards Brantley et al. (2020). This property allows us to bound the cost incurred incurred in a way similar to how we bound the gap from the optimal reward by LdTMSA(logAT)/TLdT_{M}S\sqrt{A(\log AT)/T}. We now want that the slackness provided by the conservative policy should allow LdTMSA(logAT)/TLdT_{M}S\sqrt{A(\log AT)/T} constraint violations. This is ensured by our chosen ϵe\epsilon_{e} sequence. We formally state that result in the following lemma proven in parts in Appendix D and Appendix E.

Lemma 5.5.

The cumulative sum of the ϵe\epsilon_{e} sequence is upper and lower bounded as

e=1E(te+1te)ϵe=Θ(KTlogT)\displaystyle\sum\nolimits_{e=1}^{E}(t_{e+1}-t_{e})\epsilon_{e}=\Theta\left(K\sqrt{T\log T}\right) (25)

After giving the details on bounds on the possible sources of regret and constraint violations, we can formally state the result in the form of following theorem.

Theorem 5.6.

For all TT and K=Θ(LdTMSA+CSA/(1ρ))K=\Theta(LdT_{M}S\sqrt{A}+CSA/(1-\rho)), the regret R(T)R(T) of UC-CURL algorithm is bounded as

R(T)=O(1δLdTMSAlogATT+CTMS2AlogT(1ρ)T)\displaystyle R(T)=O\left(\frac{1}{\delta}LdT_{M}S\sqrt{A\frac{\log AT}{T}}+\frac{CT_{M}S^{2}A\log T}{(1-\rho)T}\right) (26)

and the constraints are bounded as C(T)=0C(T)=0, with probability at least 11T5/41-\frac{1}{T^{5/4}}.

5.1 Posterior Sampling Algorithm

We can also modify the analysis to obtain Bayesian regret for a posterior sampling version of the UC-CURL algorithm using Lemma 1 of Osband et al. (2013). In the posterior sampling algorithm, instead of finding the optimistic MDP, we sample the transition probability P~e\tilde{P}_{e} using an updated posterior. This sampling allows to reduce the complexity of the optimization problem by eliminating Eq. (17) and Eq. (18). The complete algorithm is described in Appendix G. We note that optimization problem for the UC-CURL algorithm is feasible because the true MDP lies in the confidence interval. However, for the sampled MDP obtaining the feasibility requires a stronger Slater’s condition.

5.2 Further Modifications

The proposed algorithm, and the analysis can be easily extended to MM convex constraints g1,,gMg_{1},\cdots,g_{M} by applying union bounds. Further, our analysis uses Proposition 1 of Jaksch et al. (2010) to bound the epochs by O(SAlog2T)O(SA\log_{2}T). However, we can improve the empirical performance of the UC-CURL algorithm by modifying the epoch trigger condition (Line 11 of Algorithm 1). Triggering a new episode whenever νe(s,a)\nu_{e}(s,a) becomes max{1,νe1(s,a)+1}\max\{1,\nu_{e-1}(s,a)+1\} for any state-action pair results in a linearly increasing episode length with total epochs bounded by O(SA+SAT)O(SA+\sqrt{SAT}). This modification results in a better empirical performance (See Appendix 6 for simulations) at the cost of a higher theoretical regret bound and computation complexity for obtaining a new policy at every epoch.

6 Simulation Results

To validate the performance of the UC-CURL algorithm and the PS-CURL algorithm, we run the simulation on the flow and service control in a single-serve queue, which was introduced in (Altman & Schwartz, 1991). Along with validating the performance of the proposed algorithms, we also compare the algorithms against the algorithms proposed in Singh et al. (2020) and in Chen et al. (2022) for model-based constrained reinforcement learning for infinite horizon MDPs. Compared to these algorithms, we note that our algorithm is also designed to handle concave objectives of expected rewards with convex constraints on costs with 0 constraint violations.

In the queue environment, a discrete-time single-server queue with a buffer of finite size LL is considered. The number of customers waiting in the queue is considered as the state in this problem and thus |S|=L+1|S|=L+1. Two kinds of the actions, service and flow, are considered in the problem and control the number of customers together. The action space for service is a finite subset AA in [amin,amax][a_{min},a_{max}], where 0<aminamax<10<a_{min}\leq a_{max}<1. Given a specific service action aa, the service a customer is successfully finished with the probability bb. If the service is successful, the length of the queue will reduce by 1. Similarly, the space for flow is also a finite subsection BB in [bmin,bmax][b_{min},b_{max}]. In contrast to the service action, flow action will increase the queue by 11 with probability bb if the specific flow action bb is given. Also, we assume that there is no customer arriving when the queue is full. The overall action space is the Cartesian product of the AA and BB. According to the service and flow probability, the transition probability can be computed and is given in the Table 2.

Table 2: Transition probability of the queue system
Current State P(xt+1=xt1)P(x_{t+1}=x_{t}-1) P(xt+1=xt)P(x_{t+1}=x_{t}) P(xt+1=xt+1)P(x_{t+1}=x_{t}+1)
1xtL11\leq x_{t}\leq L-1 a(1b)a(1-b) ab+(1a)(1b)ab+(1-a)(1-b) (1a)b(1-a)b
xt=Lx_{t}=L aa 1a1-a 0
xt=0x_{t}=0 0 1b(1a)1-b(1-a) b(1a)b(1-a)
Refer to caption
(a) Reward growth w.r.t. time
Refer to caption
(b) Regret w.r.t. time
Refer to caption
(c) Service constraints w.r.t. time
Refer to caption
(d) Flow constraints w.r.t. time
Figure 1: Performance of the proposed UC-CURL and PS-CURL algorithms on a flow and service control problem for a single queue with doubling epoch lengths and linearly increasing epoch lengths. The algorithms are compared against Chen et al. (2022) and Singh et al. (2020)

.

Define the reward function as r(s,a,b)r(s,a,b) and the constraints for service and flow as c1(s,a,b)c^{1}(s,a,b) and c2(s,a,b)c^{2}(s,a,b), respectively. Define the stationary policy for service and flow as πa\pi_{a} and πb\pi_{b}, respectively. Then, the problem can be defined as

maxπa,πblimT1Tt=1Tr(st,πa(st),πb(st))s.t.limT1Tt=1Tc1(st,πa(st),πb(st))0limT1Tt=1Tc2(st,πa(st),πb(st))0\begin{split}\max_{\pi_{a},\pi_{b}}&\quad\lim\limits_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}r(s_{t},\pi_{a}(s_{t}),\pi_{b}(s_{t}))\\ s.t.&\quad\lim\limits_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}c^{1}(s_{t},\pi_{a}(s_{t}),\pi_{b}(s_{t}))\geq 0\\ &\quad\lim\limits_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}c^{2}(s_{t},\pi_{a}(s_{t}),\pi_{b}(s_{t}))\geq 0\end{split} (27)

According to the discussion in (Altman & Schwartz, 1991), we define the reward function as r(s,a,b)=5sr(s,a,b)=5-s, which is an decreasing function only dependent on the state. It is reasonable to give higher reward when the number of customers waiting in the queue is small. For the constraint function, we define c1(s,a,b)=10a+6c^{1}(s,a,b)=-10a+6 and c2=8(1b)2+2c^{2}=-8(1-b)^{2}+2, which are dependent only on service and flow action, respectively. Higher constraint value is given if the probability for the service and flow are low and high, respectively.

In the simulation, the length of the buffer is set as L=5L=5. The service action space is set as [0.2,0.4,0.6,0.8][0.2,0.4,0.6,0.8] and the flow action space is set as [0.4,0.5,0.6,0.7][0.4,0.5,0.6,0.7]. We use the length of horizon T=5×105T=5\times 10^{5} and run 5050 independent simulations of all algorithms. The experiments were run on a 3636 core Intel-i9 CPU @3.003.00 GHz with 6464 GB of RAM. The result is shown in the Figure 1(d). The average values of the cumulative reward and the constraint functions are shown in the solid lines. Further, we plot the standard deviation around the mean value in the shadow to show the random error. In order to compare this result to the optimal, we assume that the full information of the transition dynamics is known and then use Linear Programming to solve the problem. The optimal cumulative reward for the constrained optimization is calculated to be 4.48 with both flow constraint and service constraint values to be 0. The optimal cumulative reward for the unconstrained optimization is 4.8 with service constraint being 2-2 and flow constraint being 0.88-0.88.

We now discuss the performance of all the algorithms starting with our algorithms UC-CURL and PS-CURL. In Figure 1(d), we observe that the proposed UC-CURL algorithm in Algorithm 1 does not perform well initially. We observe that this is because the confidence interval radius Slog(At)/N(s,a)\sqrt{S\log(At)/N(s,a)} for any (s,a)(s,a) are not tight enough in the initial period. After the algorithms collects sufficient samples to construct tight confidence intervals around the transition probabilities, the algorithm starts converging towards the optimal policy. We also note that the linear epoch modification of the algorithm works better than the doubling epoch algorithm presented in Algorithm 1. This is because the linear epoch variant updates the policy quickly whereas the doubling epoch algorithm works with the same policy for too long and thus looses the advantages of collected samples. For our implementation, we choose the value of parameter KK in Algorithm 1 as K=1K=1, using which we observe that the constraint values start converging towards zero.

We now analyse the performance of the PS-CURL algorithm. For our implementation of the PS-CURL algorithm, we sample the transition probabilities using Dirichlet distribution. Note that the true transition probabilities were not sampled from a Dirichlet distribution and hence this experiment also shows the robustness against misspecified priors. We observe that the algorithm quickly brings the reward close to the optimal rewards. The performance of the PS-CURL algorithm is significantly better than the UC-CURL algorithm. We suspect this is because the UC-CURL algorithm wastes a large-number of steps to find optimistic policy with a large confidence interval. This observation aligns with the TDSE algorithm Ouyang et al. (2017), where they show that the Thompson sampling algorithm with O(SAT)O(\sqrt{SAT}) epochs performs empirically better than the optimism based UCRL2 algorithm Jaksch et al. (2010) with O(SAlogT)O(\sqrt{SA\log T)} epochs. Osband et al. (2013) also made a similar observation where their PSRL algorithm worked better than the UCRL2 algorithm. Again, we set the value of parameter KK as 11 and with K=1K=1, the algorithm does not violate constraints. We also observe that the standard deviation of the rewards and constraints are higher for the PS-CURL algorithm as compared to the UC-CURL algorithm as the PS-CURL algorithm has an additional stochastic component which arises from sampling the transition probabilities.

After analysing the algorithms presented in this paper, we now analyse the performance of the algorithm by Chen et al. (2022). They provide an optimistic online mirror descent algorithm which also works with conservative parameter to tightly bound constraint violations. Their algorithm also obtains a O(T)O(\sqrt{T}) regret bound. However, their algorithm is designed for a linear reward/constraint setup with a single constraint, and empirically the algorithm is difficult to tune as it requires additional knowledge of TMT_{M}, ρ\rho, δ\delta, and TT to fine tune parameters used in their algorithm. We set the value of the learning rate θ\theta for online mirror descent as 5×1025\times 10^{-2} with an episode length of 5×1035\times 10^{3}. Further, we scale the rewards and costs to ensure that they lie between 0 and 11. We analyze the behavior of the optimistic online mirror descent algorithm in Figure 1(b). We observe that the algorithm has three phases. The first phase is the first episodes where the algorithm uses a uniform policy which is the initial flat area till first 50005000 steps. In the second phase, the algorithm updates the policy for the first time and starts converging to the optimal policy with a convergence rate which matches to that of the PS-CURL algorithm. However, after few policy updates, we observe that the algorithm has oscillatory behavior which is because the dual variable updates require online constraint violations.

Finally, we analyze the the algorithm by Singh et al. (2020). They also provide an algorithm which proceeds in epochs and solves an optimization problem at every epoch. The algorithm considers a fixed epoch length T1/3T^{1/3}. Further, the algorithm considers a confidence interval on each estimate of P(s|s,a)P(s^{\prime}|s,a) for all s,a,ss,a,s^{\prime} triplet. The algorithm does not perform well even though it updates the policy most frequently because of creating confidence intervals on individual transition probabilities P(s|s,a)P(s^{\prime}|s,a) instead of the probability vector P(s|s,a)P(s^{\prime}|s,a).

From the experimental observations, we note that the proposed UC-CURL algorithm is suitable in cases where the parameter tuning is not possible and the system requires tighter bounds on deviation of the performance of the algorithm. The PS-CURL algorithm can be used in cases where the variance in algorithm’s performance can be tolerated or computational complexity is a constraint. Further, for both the algorithms, it is beneficial to use the linear increasing epoch lengths. Additionally, the algorithm by Chen et al. (2022) is suitable for cases where solving an optimization equation is not feasible, for example an embedded system, as the algorithm updates policy using exponential function which can be easily computed. However, this algorithm is only applicable in applications with linear reward/constraint and single constraint.

7 Conclusion

We considered the problem of Markov Decision Process with concave objective and convex constraints. For this problem, we proposed UC-CURL algorithm which works on the principle of optimism. To bound the constraint violations, we solve for a conservative policy using an optimistic model for an ϵ\epsilon-tight optimization problem. Using an analysis based on Bellman error for infinite-horizon MDPs, we show the UC-CURL algorithm achieves 0 constraint violations with a regret bound of O~(LdTMSA/T+(CSAlogT)/(T(1ρ))\tilde{O}(LdT_{M}S\sqrt{A/T}+(CSA\log T)/(T(1-\rho)). Further, to reduce the computation complexity of finding optimistic MDP, we also propose a posterior sampling algorithm which finds the optimal policy for a sampled MDP. We provide a Bayesian regret bound of O~(LdTMSA/T+(CTMS2AlogT)/(T(1ρ))\tilde{O}(LdT_{M}S\sqrt{A/T}+(CT_{M}S^{2}A\log T)/(T(1-\rho)) for the posterior sampling algorithm by considering a stronger Slater’s condition to solve for constrained optimization for sampled MDPs as well. As part of potential future works, we consider dynamically configuring KK to be an interesting and important direction to reduce the requirement of problem parameters.

References

  • Agarwal & Aggarwal (2022) Mridul Agarwal and Vaneet Aggarwal. Reinforcement learning for joint optimization of multiple rewards. Accepted to Journal of Machine Learning Research, 2022.
  • Agarwal et al. (2022a) Mridul Agarwal, Vaneet Aggarwal, and Tian Lan. Multi-objective reinforcement learning with non-linear scalarization. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pp.  9–17, 2022a.
  • Agarwal et al. (2022b) Mridul Agarwal, Qinbo Bai, and Vaneet Aggarwal. Regret guarantees for model-based reinforcement learning with long-term average constraints. In The 38th Conference on Uncertainty in Artificial Intelligence, 2022b.
  • Altman & Schwartz (1991) E. Altman and A. Schwartz. Adaptive control of constrained markov chains. IEEE Transactions on Automatic Control, 36(4):454–462, 1991. doi: 10.1109/9.75103.
  • Altman (1999) Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
  • Bai et al. (2022a) Qinbo Bai, Mridul Agarwal, and Vaneet Aggarwal. Joint optimization of concave scalarized multi-objective reinforcement learning with policy gradient based algorithm. Journal of Artificial Intelligence Research, 74:1565–1597, 2022a.
  • Bai et al. (2022b) Qinbo Bai, Amrit Singh Bedi, Mridul Agarwal, Alec Koppel, and Vaneet Aggarwal. Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  3682–3689, 2022b.
  • Bai et al. (2023) Qinbo Bai, Amrit Singh Bedi, and Vaneet Aggarwal. Achieving zero constraint violation for constrained reinforcement learning via conservative natural policy gradient primal-dual algorithm. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
  • Bercu et al. (2015) Bernard Bercu, Bernard Delyon, and Emmanuel Rio. Concentration inequalities for sums and martingales. Springer, 2015.
  • Brantley et al. (2020) Kianté Brantley, Miro Dudik, Thodoris Lykouris, Sobhan Miryoosefi, Max Simchowitz, Aleksandrs Slivkins, and Wen Sun. Constrained episodic reinforcement learning in concave-convex and knapsack settings. Advances in Neural Information Processing Systems, 33:16315–16326, 2020.
  • Bubeck et al. (2015) Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
  • Chen et al. (2021) Jingdi Chen, Yimeng Wang, and Tian Lan. Bringing fairness to actor-critic reinforcement learning for network utility optimization. In IEEE INFOCOM 2021-IEEE Conference on Computer Communications, pp.  1–10. IEEE, 2021.
  • Chen et al. (2022) Liyu Chen, Rahul Jain, and Haipeng Luo. Learning infinite-horizon average-reward Markov decision process with constraints. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  3246–3270. PMLR, 17–23 Jul 2022.
  • Cheung (2019) Wang Chi Cheung. Regret minimization for reinforcement learning with vectorial feedback and complex objectives. Advances in Neural Information Processing Systems, 32:726–736, 2019.
  • Cui et al. (2019) Wei Cui, Kaiming Shen, and Wei Yu. Spatial deep learning for wireless scheduling. ieee journal on selected areas in communications, 37(6):1248–1261, 2019.
  • Ding et al. (2020) Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural policy gradient primal-dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33, 2020.
  • Ding et al. (2021) Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo Jovanovic. Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics, pp.  3304–3312. PMLR, 2021.
  • Efroni et al. (2020) Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained mdps. arXiv preprint arXiv:2003.02189, 2020.
  • Fruit et al. (2018) Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In International Conference on Machine Learning, pp. 1578–1586. PMLR, 2018.
  • Gattami et al. (2021) Ather Gattami, Qinbo Bai, and Vaneet Aggarwal. Reinforcement learning for constrained markov decision processes. In International Conference on Artificial Intelligence and Statistics, pp.  2656–2664. PMLR, 2021.
  • Ghasemipour et al. (2020) Seyed Kamyar Seyed Ghasemipour, Richard Zemel, and Shixiang Gu. A divergence minimization perspective on imitation learning methods. In Conference on Robot Learning, pp.  1259–1277. PMLR, 2020.
  • Hazan et al. (2019) Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pp. 2681–2691. PMLR, 2019.
  • Jaksch et al. (2010) Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
  • Jin et al. (2017) Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International Conference on Machine Learning, pp. 1724–1732. PMLR, 2017.
  • Kalagarla et al. (2021) Krishna C Kalagarla, Rahul Jain, and Pierluigi Nuzzo. A sample-efficient algorithm for episodic finite-horizon mdp with constraints. 35(9):8030–8037, 2021.
  • Kwan et al. (2009) Raymond Kwan, Cyril Leung, and Jie Zhang. Proportional fair multiuser scheduling in lte. IEEE Signal Processing Letters, 16(6):461–464, 2009.
  • Lan et al. (2010) Tian Lan, David Kao, Mung Chiang, and Ashutosh Sabharwal. An axiomatic theory of fairness in network resource allocation. IEEE, 2010.
  • Langford & Kakade (2002) J Langford and S Kakade. Approximately optimal approximate reinforcement learning. In Proceedings of ICML, 2002.
  • Le et al. (2019) Hoang Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703–3712. PMLR, 2019.
  • Liu et al. (2021) Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183–17193, 2021.
  • Osband et al. (2013) Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pp. 3003–3011, 2013.
  • Ouyang et al. (2017) Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown markov decision processes: A thompson sampling approach. Advances in neural information processing systems, 30, 2017.
  • Puterman (2014) Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  • Roijers et al. (2013) Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48:67–113, 2013.
  • Rosenberg & Mansour (2019) Aviv Rosenberg and Yishay Mansour. Online convex optimization in adversarial markov decision processes. In International Conference on Machine Learning, pp. 5478–5486. PMLR, 2019.
  • Singh et al. (2020) Rahul Singh, Abhishek Gupta, and Ness B Shroff. Learning in markov decision processes under constraints. arXiv preprint arXiv:2002.12435, 2020.
  • Tessler et al. (2018) Chen Tessler, Daniel J Mankowitz, and Shie Mannor. Reward constrained policy optimization. In International Conference on Learning Representations, 2018.
  • Wei et al. (2022a) Honghao Wei, Xin Liu, and Lei Ying. Triple-q: A model-free algorithm for constrained reinforcement learning with sublinear regret and zero constraint violation. In International Conference on Artificial Intelligence and Statistics, pp.  3274–3307. PMLR, 2022a.
  • Wei et al. (2022b) Honghao Wei, Xin Liu, and Lei Ying. A provably-efficient model-free algorithm for infinite-horizon average-reward constrained markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022b.
  • Weissman et al. (2003) Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
  • Wierman (2011) Adam Wierman. Fairness and scheduling in single server queues. Surveys in Operations Research and Management Science, 16(1):39–48, 2011.
  • Xu et al. (2021) Tengyu Xu, Yingbin Liang, and Guanghui Lan. Crpo: A new approach for safe reinforcement learning with convergence guarantee. In International Conference on Machine Learning, pp. 11480–11491. PMLR, 2021.
  • Yu et al. (2021) Tiancheng Yu, Yi Tian, Jingzhao Zhang, and Suvrit Sra. Provably efficient algorithms for multi-objective competitive rl. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  12167–12176. PMLR, 18–24 Jul 2021.
  • Zhang et al. (2020) Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and Mengdi Wang. Variational policy gradient method for reinforcement learning with general utilities. Advances in Neural Information Processing Systems, 33:4572–4583, 2020.
  • Zhang et al. (2021) Junyu Zhang, Chengzhuo Ni, Zheng Yu, Csaba Szepesvari, and Mengdi Wang. On the convergence and sample efficiency of variance-reduced policy gradient method. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Re_VXFOyyO.
  • Zheng & Ratliff (2020) Liyuan Zheng and Lillian Ratliff. Constrained upper confidence reinforcement learning. In Learning for Dynamics and Control, pp.  620–629. PMLR, 2020.

Appendix A Assumptions and their justification

We first introduce our initial assumptions on the MDP \mathcal{M}. We assume the MDP \mathcal{M} is ergodic. Ergodicity is a commonly used assumption in constrained RL literature Singh et al. (2020); Chen et al. (2022). Further, ergodicity is required to obtain stationary Markovian policies which can be tranferred from training setup to test environment. Let Pπ,stP^{t}_{\pi,s} denote the tt-step transition probability on following policy π\pi in MDP \mathcal{M} starting from some state ss. Also, let TssπT_{s\to s^{\prime}}^{\pi} denotes the time taken by the Markov chain induced by the policy π\pi to hit state ss^{\prime} starting from state ss. Building on these variables, Pπ,stP^{t}_{\pi,s} and TssπT_{s\to s^{\prime}}^{\pi}, we make our first assumption as follows:

Assumption A.1.

The MDP \mathcal{M} is ergodic, or

Pπ,stPπCρt\displaystyle\|P^{t}_{\pi,s}-P_{\pi}\|\leq C\rho^{t} (28)

where PπP_{\pi} is the long-term steady state distribution induced by policy π\pi, and C>0C>0 and ρ<1\rho<1 are problem specific constants. Also, we have

TM:=maxπ𝔼[Tssπ]<\displaystyle T_{M}:=\max_{\pi}\mathbb{E}[T^{\pi}_{s\to s^{\prime}}]<\infty (29)

where TMT_{M} is the finite mixing time of the MDP \mathcal{M}.

We note that in most of the problems, rewards are engineered according to the problem. However, the system dynamics are stochastic and typically not known. Based on this, we make the following assumption on rewards.

Assumption A.2.

The rewards r(s,a)r(s,a), the costs ci(s,a);ic_{i}(s,a);\forall\ i and the functions ff and gg are known to the agent.

Our next assumption is on the functions ff and gg. Many practically implemented fairness objectives are concave (Kwan et al., 2009), or the agent want to explore all possible state action pairs by maximizing the entropy of the long-term state-action distribution (Hazan et al., 2019), or the agent may want to minimize divergence with respect to a certain expert policy (Ghasemipour et al., 2020). Formally, we have

Assumption A.3.

The scalarization function ff is jointly concave and the constraints gg are jointly convex. Hence for any arbitrary distributions 𝒟1\mathcal{D}_{1} and 𝒟2\mathcal{D}_{2}, the following holds.

f(𝔼x𝒟1[x])\displaystyle f\left(\mathbb{E}_{x\sim\mathcal{D}_{1}}\left[x\right]\right) 𝔼x𝒟1[f(x)]\displaystyle\geq\mathbb{E}_{x\sim\mathcal{D}_{1}}\left[f\left(x\right)\right] (30)
g(𝔼𝐱𝒟2[𝐱])\displaystyle g\left(\mathbb{E}_{\mathbf{x}\sim\mathcal{D}_{2}}\left[\mathbf{x}\right]\right) 𝔼𝐱𝒟2[g(𝐱)];𝐱d\displaystyle\leq\mathbb{E}_{\mathbf{x}\sim\mathcal{D}_{2}}\left[g\left(\mathbf{x}\right)\right];\ \mathbf{x}\in\mathbb{R}^{d} (31)

We impose an additional assumption on the functions ff and gg. We assume that the functions are continuous and Lipschitz continuity in particular. Lipschitz continuity is a common assumption for optimization literature (Bubeck et al., 2015; Jin et al., 2017; Zhang et al., 2020). Additionally, in practice this assumption is validated, often by adding some regularization. We have,

Assumption A.4.

The function ff and gg are assumed to be a LL- Lipschitz function, or

|f(x)f(y)|\displaystyle\left|f\left(x\right)-f\left(y\right)\right| L|xy|;x,y\displaystyle\leq L|x-y|;\ x,y\in\mathbb{R} (32)
|g(𝐱)g(𝐲)|\displaystyle\left|g\left(\mathbf{x}\right)-g\left(\mathbf{y}\right)\right| L𝐱𝐲1;𝐱,𝐲d\displaystyle\leq L\left\lVert\mathbf{x}-\mathbf{y}\right\rVert_{1};\ \mathbf{x},\mathbf{y}\in\mathbb{R}^{d} (33)

We consider a standard setup of concave and the Lipschitz function as considered by (Cheung, 2019; Brantley et al., 2020; Yu et al., 2021). Note that the analysis in this paper directly works for f:Kf:\mathbb{R}^{K}\to\mathbb{R}, where the function takes as input multiple average per-step rewards. We can obtain maximum entropy exploration if choose function f=kλklog(λk+η)f=-\sum_{k}\lambda_{k}\log(\lambda_{k}+\eta) with rk(s,a)=𝟏{sk,ak}r_{k}(s,a)=\bm{1}_{\{s_{k},a_{k}\}} for a particular state action pair sk,aks_{k},a_{k} and choosing K=S×AK=S\times A to cover all state-action pairs and a regularizer η\eta.

Next, we assume the following Slater’s condition to hold.

Assumption A.5.

There exists a policy π\pi, and one constant δ>LdSTMA/T\delta>LdST_{M}\sqrt{A/T} such that

g(ζπP(1),,ζπP(d))δ\displaystyle g\left(\zeta_{\pi}^{P}(1),\cdots,\zeta_{\pi}^{P}(d)\right)\leq-\delta (34)

Further, if there is a (possibly unknown) lower bound on time-horizon, Tlexp(1)T_{l}\geq\exp{(1)}, then we only require δ>LdSTMA(logTl)/Tl\delta>LdST_{M}\sqrt{A(\log T_{l})/T_{l}}. This assumption is again a standard assumption in the constrained RL literature (Efroni et al., 2020; Ding et al., 2021; 2020; Wei et al., 2022a). δ\delta is referred as Slater’s constant. (Ding et al., 2021) assumes that the Slater’s constant δ\delta is known. (Wei et al., 2022a) assumes that the number of iterations of the algorithm is at least Ω~(SAH/δ)5\tilde{\Omega}(SAH/\delta)^{5} for episode length HH. On the contrary, we simply assume the existence of δ\delta and a lower bound on the value of δ\delta which can be relaxed as the agent acquires more time to interact with the environment.

Appendix B Efficiently solving the Conservative Optimistic Optimization problem

We now provide the details on efficiently solving the optimistic optimization problem described with constraints in Equation (16)-(18). Similar to the method proposed in Rosenberg & Mansour (2019), we define a new variable p(s,a,s)p(s,a,s^{\prime}) which denotes the probability of being is state ss, taking action aa, and then moving to state ss^{\prime}. Now, the transition probability to next state ss^{\prime} given current state ss and action aa is given as:

P(s|s,a)=p(s,a,s)sp(s,a,s)\displaystyle P(s^{\prime}|s,a)=\frac{p(s,a,s^{\prime})}{\sum_{s^{\prime}}p(s,a,s^{\prime})} (35)

Further, the occupancy measure of state-action pair s,as,a is given as

ρ(s,a)=sp(s,a,s)\displaystyle\rho(s,a)=\sum_{s^{\prime}}p(s,a,s^{\prime}) (36)

Based on these two observations, at the beginning of epoch ee, we define the optimization problem as follows:

maxp(s,a,s)f(s,a((sp(s,a,s))r(s,a)))\displaystyle\max_{p(s,a,s^{\prime})}f\left(\sum_{s,a}\left(\left(\sum_{s^{\prime}}p(s,a,s^{\prime})\right)r(s,a)\right)\right) (37)

subject to following constraints

s,a,sp(s,a,s)=1,p(s,a,s)0\displaystyle\sum_{s,a,s^{\prime}}p(s,a,s^{\prime})=1,p(s,a,s^{\prime})\geq 0 (38)
s,ap(s,a,s)=s,ap(s,a,s)\displaystyle\sum_{s^{\prime},a}p(s,a,s^{\prime})=\sum_{s^{\prime},a}p(s^{\prime},a,s) (39)
g(s,a((sp(s,a,s))c1(s,a)),,s,a((sp(s,a,s))cd(s,a)))ϵe\displaystyle g\left(\sum_{s,a}\left(\left(\sum_{s^{\prime}}p(s,a,s^{\prime})\right)c_{1}(s,a)\right),\cdots,\sum_{s,a}\left(\left(\sum_{s^{\prime}}p(s,a,s^{\prime})\right)c_{d}(s,a)\right)\right)\leq\epsilon_{e} (40)
p(s,a,s)P^(s,a,s)1Ne(s,a)sp(s,a,s)α(s,a,s)\displaystyle p(s,a,s^{\prime})-\frac{\hat{P}(s,a,s^{\prime})}{1\vee N_{e}(s,a)}\sum_{s^{\prime}}p(s,a,s^{\prime})\leq\alpha(s,a,s^{\prime}) (41)
P^(s,a,s)1Ne(s,a)sp(s,a,s)p(s,a,s)α(s,a,s)\displaystyle\frac{\hat{P}(s,a,s^{\prime})}{1\vee N_{e}(s,a)}\sum_{s^{\prime}}p(s,a,s^{\prime})-p(s,a,s^{\prime})\leq\alpha(s,a,s^{\prime}) (42)
sα(s,a,s)14Slog(2At)1Ne(s,a)sp(s,a,s)\displaystyle\sum_{s^{\prime}}\alpha(s,a,s^{\prime})\leq\sqrt{\frac{14S\log(2At)}{1\vee N_{e}(s,a)}}\sum_{s^{\prime}}p(s,a,s^{\prime}) (43)

for all s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}, and s𝒮s^{\prime}\in\mathcal{S}. Also, α(s,a,s)\alpha(s,a,s^{\prime}) is an auxiliary variable introduced to reduce the complexity of 1\ell_{1} norm constraints and the present the optimization problem in a disciplined convex program which can be coded easily in CVXPY. The Equations (41), (42), (43) jointly describe the 1\ell_{1} confidence interval on the probability estimates.

Appendix C Proof of Lemma 5.1

Proof.

Note that ρπP\rho^{P}_{\pi^{*}} denotes the stationary distribution of the optimal solution which satisfies

g(s,aρπP(s,a)c1(s,a),,s,aρπP(s,a)cd(s,a))C\displaystyle g\left(\sum_{s,a}\rho^{P}_{\pi^{*}}(s,a)c_{1}(s,a),\cdots,\sum_{s,a}\rho^{P}_{\pi^{*}}(s,a)c_{d}(s,a)\right)\leq C (44)

Further, from Assumption 3.7, we have a feasible policy π\pi for which

g(s,aρπP(s,a)c1(s,a),,s,aρπP(s,a)cd(s,a))Cδ\displaystyle g\left(\sum_{s,a}\rho^{P}_{\pi}(s,a)c_{1}(s,a),\cdots,\sum_{s,a}\rho^{P}_{\pi}(s,a)c_{d}(s,a)\right)\leq C-\delta (45)

We now construct a stationary distribution ρP\rho^{P} obtain the corresponding πe\pi_{e}^{\prime} as:

ρP(s,a)=(1ϵeδ)ρπP(s,a)+ϵeδρπP(s,a)\displaystyle\rho^{P}(s,a)=\left(1-\frac{\epsilon_{e}}{\delta}\right)\rho^{P}_{\pi^{*}}(s,a)+\frac{\epsilon_{e}}{\delta}\rho^{P}_{\pi}(s,a) (46)
πe=ρP(s,a)/(s,bρP(s,b))\displaystyle\pi_{e}^{\prime}=\rho^{P}(s,a)/\left(\sum_{s,b}\rho^{P}(s,b)\right) (47)

For this new policy and convex constraint gg, we observe that

g(s,aρπP(s,a)c1(s,a),,s,aρπP(s,a)cd(s,a))\displaystyle g\left(\sum_{s,a}\rho^{P}_{\pi}(s,a)c_{1}(s,a),\cdots,\sum_{s,a}\rho^{P}_{\pi}(s,a)c_{d}(s,a)\right) (48)
=g(s,a((1ϵeδ)ρπP+ϵeδρπP)(s,a)c1(s,a),,s,a((1ϵeδ)ρπP+ϵeδρπP)(s,a)cd(s,a))\displaystyle=g\left(\sum_{s,a}\right(\left(1-\frac{\epsilon_{e}}{\delta}\right)\rho^{P}_{\pi^{*}}+\frac{\epsilon_{e}}{\delta}\rho^{P}_{\pi}\left)(s,a)c_{1}(s,a),\cdots,\sum_{s,a}\left(\left(1-\frac{\epsilon_{e}}{\delta}\right)\rho^{P}_{\pi^{*}}+\frac{\epsilon_{e}}{\delta}\rho^{P}_{\pi}\right)(s,a)c_{d}(s,a)\right) (49)
(1ϵeδ)g(s,aρπP(s,a)c1(s,a),,s,aρπP(s,a)cd(s,a))\displaystyle\leq\left(1-\frac{\epsilon_{e}}{\delta}\right)g\left(\sum_{s,a}\rho^{P}_{\pi^{*}}(s,a)c_{1}(s,a),\cdots,\sum_{s,a}\rho^{P}_{\pi^{*}}(s,a)c_{d}(s,a)\right)
+ϵeδg(s,aρπP(s,a)c1(s,a),,s,aρπP(s,a)cd(s,a))\displaystyle~{}~{}~{}+\frac{\epsilon_{e}}{\delta}g\left(\sum_{s,a}\rho^{P}_{\pi}(s,a)c_{1}(s,a),\cdots,\sum_{s,a}\rho^{P}_{\pi}(s,a)c_{d}(s,a)\right) (50)
(1ϵeδ)C+ϵeδ(Cδ)\displaystyle\leq\left(1-\frac{\epsilon_{e}}{\delta}\right)C+\frac{\epsilon_{e}}{\delta}\left(C-\delta\right) (51)
=CδCϵe\displaystyle=C-\delta\leq C-\epsilon_{e} (52)

where Equation (50) follows from the convexity of the constraints. Equation (51) follows from Equation (44) and Equation (45).

Note that the policy πe\pi_{e}^{\prime} corresponding to stationary distribution constructed in Equation (46) satisfies the ϵe\epsilon_{e}-tight constraints. Further, we find πe\pi_{e}^{*} as the optimal solution for the ϵe\epsilon_{e}-tight optimization problem. Hence, we have

f(s,aρπP(s,a)r(s,a))f(s,aρπeP(s,a)r(s,a))\displaystyle f\left(\sum_{s,a}\rho^{P}_{\pi^{*}}(s,a)r(s,a)\right)-f\left(\sum_{s,a}\rho^{P}_{\pi_{e}^{*}}(s,a)r(s,a)\right) (53)
\displaystyle\leq f(s,aρπP(s,a)r(s,a))f(s,aρP(s,a)r(s,a))\displaystyle f\left(\sum_{s,a}\rho^{P}_{\pi^{*}}(s,a)r(s,a)\right)-f\left(\sum_{s,a}\rho^{P}(s,a)r(s,a)\right)
\displaystyle\leq L|s,a(ρπP(s,a)ρP(s,a))r(s,a)|\displaystyle L\Big{|}\sum_{s,a}\left(\rho^{P}_{\pi^{*}}(s,a)-\rho^{P}(s,a)\right)r(s,a)\Big{|} (54)
\displaystyle\leq L|s,a(ρπP(s,a)(1ϵeδ)ρπP(s,a)ϵeδρπP(s,a))cd(s,a)|\displaystyle L\Big{|}\sum_{s,a}\left(\rho^{P}_{\pi^{*}}(s,a)-\left(1-\frac{\epsilon_{e}}{\delta}\right)\rho^{P}_{\pi^{*}}(s,a)-\frac{\epsilon_{e}}{\delta}\rho^{P}_{\pi}(s,a)\right)c_{d}(s,a)\Big{|} (55)
\displaystyle\leq Lϵeδ|s,a(ρπP(s,a)ρπP(s,a))r(s,a)|\displaystyle L\frac{\epsilon_{e}}{\delta}\Big{|}\sum_{s,a}\left(\rho^{P}_{\pi^{*}}(s,a)-\rho^{P}_{\pi}(s,a)\right)r(s,a)\Big{|} (56)
\displaystyle\leq Lϵeδ|s,aρπPr(s,a)|+Lϵeδ|s,aρπP(s,a)r(s,a)|\displaystyle L\frac{\epsilon_{e}}{\delta}\Big{|}\sum_{s,a}\rho^{P}_{\pi^{*}}r(s,a)\Big{|}+L\frac{\epsilon_{e}}{\delta}\Big{|}\sum_{s,a}\rho^{P}_{\pi}(s,a)r(s,a)\Big{|} (57)
\displaystyle\leq 2Lϵeδ\displaystyle 2L\frac{\epsilon_{e}}{\delta} (58)

where, Equation (54) follows from the Lipschitz assumption on the joint objective ff. Equation (58) follows from the fact that r(s,a)1r(s,a)\leq 1 for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}. ∎

Appendix D Objective Regret Bound

In this section, we begin with breaking down the regret in multiple components and then analysis the components individually.

D.1 Regret breakdown

We first break down our regret into multiple parts which will help us bound the regret.

R(T)\displaystyle R(T) =f(λP)f(1Tt=1Trt(st,at))\displaystyle=f(\lambda_{*}^{P})-f\left(\frac{1}{T}\sum_{t=1}^{T}r_{t}(s_{t},a_{t})\right) (59)
=f(λP)1Te=1ETef(λπeP)+1Te=1ETef(λπeP)f(1Tt=1Trt(st,at))\displaystyle=f(\lambda_{*}^{P})-\frac{1}{T}\sum_{e=1}^{E}T_{e}f(\lambda_{\pi_{e}^{*}}^{P})+\frac{1}{T}\sum_{e=1}^{E}T_{e}f(\lambda_{\pi_{e}^{*}}^{P})-f\left(\frac{1}{T}\sum_{t=1}^{T}r_{t}(s_{t},a_{t})\right) (60)
=1Te=1ETe(f(λP)f(λπeP))+1Te=1ETef(λπeP)f(1Tt=1Trt(st,at))\displaystyle=\frac{1}{T}\sum_{e=1}^{E}T_{e}\left(f(\lambda_{*}^{P})-f(\lambda_{\pi_{e}^{*}}^{P})\right)+\frac{1}{T}\sum_{e=1}^{E}T_{e}f(\lambda_{\pi_{e}^{*}}^{P})-f\left(\frac{1}{T}\sum_{t=1}^{T}r_{t}(s_{t},a_{t})\right) (61)
1Te=1ETe(f(λP)f(λπeP))+1Te=1ETef(λπeP~e)f(1Tt=1Trt(st,at))\displaystyle\leq\frac{1}{T}\sum_{e=1}^{E}T_{e}\left(f(\lambda_{*}^{P})-f(\lambda_{\pi_{e}^{*}}^{P})\right)+\frac{1}{T}\sum_{e=1}^{E}T_{e}f(\lambda_{\pi_{e}}^{\tilde{P}_{e}})-f\left(\frac{1}{T}\sum_{t=1}^{T}r_{t}(s_{t},a_{t})\right) (62)
1Te=1ETe(f(λP)f(λπeP))+f(1Te=1ETeλπeP~e)f(1Tt=1Trt(st,at))\displaystyle\leq\frac{1}{T}\sum_{e=1}^{E}T_{e}\left(f(\lambda_{*}^{P})-f(\lambda_{\pi_{e}^{*}}^{P})\right)+f\left(\frac{1}{T}\sum_{e=1}^{E}T_{e}\lambda_{\pi_{e}}^{\tilde{P}_{e}}\right)-f\left(\frac{1}{T}\sum_{t=1}^{T}r_{t}(s_{t},a_{t})\right) (63)
1Te=1ETe(f(λP)f(λπeP))+L|1Te=1ETeλπeP~e1Tt=1Trt(st,at)|\displaystyle\leq\frac{1}{T}\sum_{e=1}^{E}T_{e}\left(f(\lambda_{*}^{P})-f(\lambda_{\pi_{e}^{*}}^{P})\right)+L\Big{|}\frac{1}{T}\sum_{e=1}^{E}T_{e}\lambda_{\pi_{e}}^{\tilde{P}_{e}}-\frac{1}{T}\sum_{t=1}^{T}r_{t}(s_{t},a_{t})\Big{|} (64)
=1Te=1ETe(f(λP)f(λπeP))+L|1Te=1Et=tete+11(λπeP~eλπeP+λπePrt(st,at))|\displaystyle=\frac{1}{T}\sum_{e=1}^{E}T_{e}\left(f(\lambda_{*}^{P})-f(\lambda_{\pi_{e}^{*}}^{P})\right)+L\Big{|}\frac{1}{T}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{\tilde{P}_{e}}-\lambda_{\pi_{e}}^{P}+\lambda_{\pi_{e}}^{P}-r_{t}(s_{t},a_{t})\right)\Big{|} (65)
1Te=1ETe(f(λP)f(λπeP))+L|1Te=1Et=tete+11(λπeP~eλπeP)|\displaystyle\leq\frac{1}{T}\sum_{e=1}^{E}T_{e}\left(f(\lambda_{*}^{P})-f(\lambda_{\pi_{e}^{*}}^{P})\right)+L\Big{|}\frac{1}{T}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{\tilde{P}_{e}}-\lambda_{\pi_{e}}^{P}\right)\Big{|}
+L|1Te=1Et=tete+11(λπePrt(st,at))|\displaystyle~{}~{}+L\Big{|}\frac{1}{T}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{P}-r_{t}(s_{t},a_{t})\right)\Big{|} (66)
=R1(T)+R2(T)+R3(T)\displaystyle=R_{1}(T)+R_{2}(T)+R_{3}(T) (67)

where Equation (62) comes from the fact that the policy πe\pi_{e} is for the optimistic CMDP and provides a higher value of the function ff. Equation 63 comes from the concavity of the function ff, and Equation 64 comes from the Lipschitz continuity of the function ff. The three terms in Equation (67) are now defined as:

R1(T)\displaystyle R_{1}(T) =LTe=1ETe(f(λP)f(λπeP))\displaystyle=\frac{L}{T}\sum_{e=1}^{E}T_{e}\left(f(\lambda_{*}^{P})-f(\lambda_{\pi_{e}^{*}}^{P})\right) (68)

R1(T)R_{1}(T) denotes the regret incurred from not playing the optimal policy π\pi^{*} for the true optimization problem in Equation (9) but the optimal policy πe\pi_{e}^{*} for the ϵe\epsilon_{e}-tight optimization problem in epoch ee.

R2(T)\displaystyle R_{2}(T) =LT|e=1Et=tete+11(λπeP~eλπeP)|\displaystyle=\frac{L}{T}\Big{|}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{\tilde{P}_{e}}-\lambda_{\pi_{e}}^{P}\right)\Big{|} (69)

R2(T)R_{2}(T) denotes the gap between expected rewards from playing the optimal policy πe\pi_{e} for ϵe\epsilon_{e}-tight optimization problem on the optimistic MDP instead of the true MDP. For this term, we further consider another modification. We have P~e\tilde{P}_{e} being the optimistic MDP with optimistic policy πe\pi_{e} as the solutions for the optimization equation solved at the beginning of every epoch. Now consider an MDP P~e\tilde{P}_{e}^{*} in the confidence set, which maximizes the long term expected reward for policy πe\pi_{e} or λπeP~eλπePe\lambda^{\tilde{P}_{e}^{*}}_{\pi_{e}}\geq\lambda^{P_{e}}_{\pi_{e}} for all PeP_{e} in the confidence interval at epoch ee. Hence, we have

R2(T)\displaystyle R_{2}(T) =LT|e=1Et=tete+11(λπeP~eλπeP)|\displaystyle=\frac{L}{T}\Big{|}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{\tilde{P}_{e}}-\lambda_{\pi_{e}}^{P}\right)\Big{|} (70)
LT|e=1Et=tete+11(λπeP~eλπeP)|\displaystyle\leq\frac{L}{T}\Big{|}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{\tilde{P}_{e}^{*}}-\lambda_{\pi_{e}}^{P}\right)\Big{|} (71)

We relabel P~e\tilde{P}_{e}^{*} as P~e\tilde{P}_{e} in the remaining analysis to reduce notation clutter.

R3(T)\displaystyle R_{3}(T) =LT|e=1Et=tete+11(λπePrt(st,at))|\displaystyle=\frac{L}{T}\Big{|}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{P}-r_{t}(s_{t},a_{t})\right)\Big{|} (72)

R3(T)R_{3}(T) denotes the gap between obtained rewards from playing the optimal policy πe\pi_{e} for ϵe\epsilon_{e}-tight optimization problem the true MDP and the expected per-step reward of playing the optimal policy πe\pi_{e} for ϵe\epsilon_{e}-tight optimization problem the true MDP.

D.2 Bounding R1(T)R_{1}(T)

Bounding R1(T)R_{1}(T) uses Lemma 5.1. We have the following set of equations:

R1(T)\displaystyle R_{1}(T) =1Te=1Et=tete+11(f(λP)f(λπeP))\displaystyle=\frac{1}{T}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(f(\lambda_{*}^{P})-f(\lambda_{\pi_{e}^{*}}^{P})\right) (73)
1Te=1Et=tete+112Lϵeδ\displaystyle\leq\frac{1}{T}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\frac{2L\epsilon_{e}}{\delta} (74)
=2LTδe=1Et=tete+11Klogtt\displaystyle=\frac{2L}{T\delta}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}K\sqrt{\frac{\log t}{t}} (75)
=2KLTδt=1Tlogtt\displaystyle=\frac{2KL}{T\delta}\sum_{t=1}^{T}\sqrt{\frac{\log t}{t}} (76)
2KLTδt=1TlogTt\displaystyle\leq\frac{2KL}{T\delta}\sum_{t=1}^{T}\sqrt{\frac{\log T}{t}} (77)
=2KLlogTTδ(1+t=2T1t)\displaystyle=\frac{2KL\log T}{T\delta}(1+\sum_{t=2}^{T}\sqrt{\frac{1}{t}}) (78)
2KLlogTTδ(1+t=1T1t𝑑t)\displaystyle\leq\frac{2KL\log T}{T\delta}(1+\int_{t=1}^{T}\sqrt{\frac{1}{t}}dt) (79)
2KLlogTTδ(2T)\displaystyle\leq\frac{2KL\log T}{T\delta}(2\sqrt{T}) (80)

where Equation (77) follows from the fact that logtlogT\log t\leq\log T for all tTt\leq T.

D.3 Bounding R2(T)R_{2}(T)

We relate the difference between long-term average rewards for running the optimistic policy πe\pi_{e} on the optimistic MDP λπeP~e\lambda_{\pi_{e}}^{\tilde{P}_{e}} and the long-term average rewards for running the optimistic policy πe\pi_{e} on the true MDP (λπeP\lambda_{\pi_{e}}^{P}) with the Bellman error. Formally, we have the following lemma:

Lemma D.1.

The difference of long-term average rewards for running the optimistic policy πe\pi_{e} on the optimistic MDP, λπeP~e\lambda_{\pi_{e}}^{\tilde{P}_{e}}, and the average long-term average rewards for running the optimistic policy πe\pi_{e} on the true MDP, λπeP\lambda_{\pi_{e}}^{P}, is the long-term average Bellman error as

λπeP~eλπeP=s,aρπePBπe,P~e(s,a)\displaystyle\lambda_{\pi_{e}}^{\tilde{P}_{e}}-\lambda_{\pi_{e}}^{P}=\sum_{s,a}\rho_{\pi_{e}}^{P}B^{\pi_{e},\tilde{P}_{e}}(s,a) (81)
Proof.

Note that for all s𝒮s\in\mathcal{S}, we have:

Vγπe,P~e(s)\displaystyle V_{\gamma}^{\pi_{e},\tilde{P}_{e}}(s) =𝔼aπe[Qγπe,P~e(s,a)]\displaystyle=\mathbb{E}_{a\sim\pi_{e}}\left[Q_{\gamma}^{\pi_{e},\tilde{P}_{e}}(s,a)\right] (82)
=𝔼aπe[Bπe,P~e(s,a)+r(s,a)+γs𝒮P(s|s,a)Vγπe,P~e(s)]\displaystyle=\mathbb{E}_{a\sim\pi_{e}}\left[B^{\pi_{e},\tilde{P}_{e}}(s,a)+r(s,a)+\gamma\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)V_{\gamma}^{\pi_{e},\tilde{P}_{e}}(s^{\prime})\right] (83)

where Equation (83) follows from the definition of the Bellman error for state action pair (s,a)(s,a).

Similarly, for the true MDP, we have,

Vγπe,P(s)\displaystyle V_{\gamma}^{\pi_{e},P}(s) =𝔼aπe[Qγπe,(s,a)]\displaystyle=\mathbb{E}_{a\sim\pi_{e}}\left[Q_{\gamma}^{\pi_{e},}(s,a)\right] (84)
=𝔼aπe[r(s,a)+γs𝒮P(s|s,a)Vγπe,P(s)]\displaystyle=\mathbb{E}_{a\sim\pi_{e}}\left[r(s,a)+\gamma\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)V_{\gamma}^{\pi_{e},P}(s^{\prime})\right] (85)

Subtracting Equation (85) from Equation (83), we get:

Vγπe,P~e(s)Vγπe,P(s)\displaystyle V_{\gamma}^{\pi_{e},\tilde{P}_{e}}(s)-V_{\gamma}^{\pi_{e},P}(s) =𝔼aπe[Bπe,P~e(s,a)+γs𝒮P(s|s,a)(Vγπe,P~eVγπe,P~e)(s)]\displaystyle=\mathbb{E}_{a\sim\pi_{e}}\left[B^{\pi_{e},\tilde{P}_{e}}(s,a)+\gamma\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)\left(V_{\gamma}^{\pi_{e},\tilde{P}_{e}}-V_{\gamma}^{\pi_{e},\tilde{P}_{e}}\right)(s^{\prime})\right] (86)
=𝔼aπe[Bπe,P~e(s,a)]+γs𝒮Pπe(Vγπe,P~eVγπe,P~e)(s)\displaystyle=\mathbb{E}_{a\sim\pi_{e}}\left[B^{\pi_{e},\tilde{P}_{e}}(s,a)\right]+\gamma\sum_{s^{\prime}\in\mathcal{S}}P_{\pi_{e}}\left(V_{\gamma}^{\pi_{e},\tilde{P}_{e}}-V_{\gamma}^{\pi_{e},\tilde{P}_{e}}\right)(s^{\prime}) (87)

Using the vector format for the value functions, we have,

V¯γπe,P~eV¯γπe,P\displaystyle\bar{V}_{\gamma}^{\pi_{e},\tilde{P}_{e}}-\bar{V}_{\gamma}^{\pi_{e},P} =(IγPπe)1B¯πeπe,P~e\displaystyle=\left(I-\gamma P_{\pi_{e}}\right)^{-1}\overline{B}_{\pi_{e}}^{\pi_{e},\tilde{P}_{e}} (88)

Now, converting the value function to average per-step reward we have,

λπeP~e𝟏SλπeP𝟏S\displaystyle\lambda_{\pi_{e}}^{\tilde{P}_{e}}\bm{1}_{S}-\lambda_{\pi_{e}}^{P}\bm{1}_{S} =limγ1(1γ)(V¯γπe,P~eV¯γπe,P)\displaystyle=\lim_{\gamma\to 1}(1-\gamma)\left(\bar{V}_{\gamma}^{\pi_{e},\tilde{P}_{e}}-\bar{V}_{\gamma}^{\pi_{e},P}\right) (89)
=limγ1(1γ)(IγPπe)1B¯πeπe,P~e\displaystyle=\lim_{\gamma\to 1}(1-\gamma)\left(I-\gamma P_{\pi_{e}}\right)^{-1}\overline{B}_{\pi_{e}}^{\pi_{e},\tilde{P}_{e}} (90)
=(s,aρπePBπe,P~e(s,a))𝟏S\displaystyle=\left(\sum_{s,a}\rho_{\pi_{e}}^{P}B^{\pi_{e},\tilde{P}_{e}}(s,a)\right)\bm{1}_{S} (91)

where the last equation follows from the definition of occupancy measures by Puterman (2014). ∎

Remark D.2.

Note that the Bellman error is not to be confused by Advantage function and policy improvement lemma Langford & Kakade (2002). The policy improvement lemma relates the performance of two policies on same MDP whereas we bounded the performance of one policy on two different MDPs in Lemma D.1

We now want to bound the Bellman errors to bound the gap between the average per-step reward λπeP~e\lambda_{\pi_{e}}^{\tilde{P}_{e}}, and λπeP\lambda_{\pi_{e}}^{P}. From the definition of Bellman error and the confidence intervals on the estimated transition probabilities, we obtain the following lemma:

Lemma D.3.

With probability at least 11/te61-1/t_{e}^{6}, the Bellman error Bπe,P~e(s,a)B^{\pi_{e},\tilde{P}_{e}}(s,a) for state-action pair s,as,a in epoch ee is upper bounded as

Bπe,P~e(s,a)min{2,14Slog(2AT)1Ne(s,a)}h~()\displaystyle B^{\pi_{e},\tilde{P}_{e}}(s,a)\leq\min\left\{2,\sqrt{\frac{14S\log(2AT)}{1\vee N_{e}(s,a)}}\right\}\|\tilde{h}(\cdot)\|_{\infty} (92)
Proof.

Starting with the definition of Bellman error in Equation (20), we get

Bπe,P~e(s,a)\displaystyle B^{\pi_{e},\tilde{P}_{e}}(s,a) =limγ1(Qγπe,P~e(s,a)(r(s,a)+γs𝒮P(s|s,a)Vγπe,P~e))\displaystyle=\lim_{\gamma\to 1}\left(Q_{\gamma}^{\pi_{e},\tilde{P}_{e}}(s,a)-\left(r(s,a)+\gamma\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)V_{\gamma}^{\pi_{e},\tilde{P}_{e}}\right)\right) (93)
=limγ1((r(s,a)+γs𝒮P~e(s|s,a)Vγπe,P~e(s))\displaystyle=\lim_{\gamma\to 1}\Bigg{(}\left(r(s,a)+\gamma\sum_{s^{\prime}\in\mathcal{S}}\tilde{P}_{e}(s^{\prime}|s,a)V_{\gamma}^{\pi_{e},\tilde{P}_{e}}(s^{\prime})\right)
(r(s,a)+γs𝒮P(s|s,a)Vγπe,P~e(s)))\displaystyle~{}~{}-\left(r(s,a)+\gamma\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)V_{\gamma}^{\pi_{e},\tilde{P}_{e}}(s^{\prime})\right)\Bigg{)} (94)
=limγ1γs𝒮(P~e(s|s,a)P(s|s,a))Vγπe,P~e(s)\displaystyle=\lim_{\gamma\to 1}\gamma\sum_{s^{\prime}\in\mathcal{S}}\left(\tilde{P}_{e}(s^{\prime}|s,a)-P(s^{\prime}|s,a)\right)V_{\gamma}^{\pi_{e},\tilde{P}_{e}}(s^{\prime}) (95)
=limγ1γ(s𝒮(P~e(s|s,a)P(s|s,a))Vγπe,P~e(s)+Vγπe,P~e(s)Vγπe,P~e(s))\displaystyle=\lim_{\gamma\to 1}\gamma\left(\sum_{s^{\prime}\in\mathcal{S}}\left(\tilde{P}_{e}(s^{\prime}|s,a)-P(s^{\prime}|s,a)\right)V_{\gamma}^{\pi_{e},\tilde{P}_{e}}(s^{\prime})+V_{\gamma}^{\pi_{e},\tilde{P}_{e}}(s)-V_{\gamma}^{\pi_{e},\tilde{P}_{e}}(s)\right) (96)
=limγ1γ(s𝒮(P~e(s|s,a)P(s|s,a))Vγπe,P~e(s)\displaystyle=\lim_{\gamma\to 1}\gamma\Bigg{(}\sum_{s^{\prime}\in\mathcal{S}}\left(\tilde{P}_{e}(s^{\prime}|s,a)-P(s^{\prime}|s,a)\right)V_{\gamma}^{\pi_{e},\tilde{P}_{e}}(s^{\prime})
s𝒮P~e(s|s,a)Vγπe,P~e(s)+s𝒮P(s|s,a)Vγπe,P~e(s))\displaystyle~{}~{}-\sum_{s^{\prime}\in\mathcal{S}}\tilde{P}_{e}(s^{\prime}|s,a)V_{\gamma}^{\pi_{e},\tilde{P}_{e}}(s)+\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)V_{\gamma}^{\pi_{e},\tilde{P}_{e}}(s)\Bigg{)} (97)
=limγ1γ(s𝒮(P~e(s|s,a)P(s|s,a))(Vγπe,P~e(s)Vγπe,P~e(s)))\displaystyle=\lim_{\gamma\to 1}\gamma\left(\sum_{s^{\prime}\in\mathcal{S}}\left(\tilde{P}_{e}(s^{\prime}|s,a)-P(s^{\prime}|s,a)\right)\left(V_{\gamma}^{\pi_{e},\tilde{P}_{e}}(s^{\prime})-V_{\gamma}^{\pi_{e},\tilde{P}_{e}}(s)\right)\right) (98)
=(s𝒮(P~e(s|s,a)P(s|s,a))limγ1γ(Vγπe,P~e(s)Vγπe,P~e(s)))\displaystyle=\left(\sum_{s^{\prime}\in\mathcal{S}}\left(\tilde{P}_{e}(s^{\prime}|s,a)-P(s^{\prime}|s,a)\right)\lim_{\gamma\to 1}\gamma\left(V_{\gamma}^{\pi_{e},\tilde{P}_{e}}(s^{\prime})-V_{\gamma}^{\pi_{e},\tilde{P}_{e}}(s)\right)\right) (99)
=(s𝒮(P~e(s|s,a)P(s|s,a))h~(s))\displaystyle=\left(\sum_{s^{\prime}\in\mathcal{S}}\left(\tilde{P}_{e}(s^{\prime}|s,a)-P(s^{\prime}|s,a)\right)\tilde{h}(s^{\prime})\right) (100)
(P~e(|s,a)P(|s,a))1h~()\displaystyle\leq\Big{\|}\left(\tilde{P}_{e}(\cdot|s,a)-P(\cdot|s,a)\right)\Big{\|}_{1}\|\tilde{h}(\cdot)\|_{\infty} (101)
14Slog(2At)1Ne(s,a)h~()\displaystyle\leq\sqrt{\frac{14S\log(2At)}{1\vee N_{e}(s,a)}}\|\tilde{h}(\cdot)\|_{\infty} (102)
14Slog(2AT)1Ne(s,a)h~()\displaystyle\leq\sqrt{\frac{14S\log(2AT)}{1\vee N_{e}(s,a)}}\|\tilde{h}(\cdot)\|_{\infty} (103)

where Equation (95) comes from the assumption that the rewards are known to the agent. Equation (99) follows from the fact that the difference between value function at two states is bounded. Equation (100) comes from the definition of bias term Puterman (2014). Equation (101) follows from Hölder’s inequality. In Equation (102), h~()\|\tilde{h}(\cdot)\|_{\infty} is the bias span of the MDP with transition probabilities P~e\tilde{P}_{e} for policy πe\pi_{e}. Also, the 1\ell_{1} norm of probability vector is bounded using Lemma F.1 for start time tet_{e} of epoch ee. ∎

Additionally, note that the 1\ell_{1} norm in Equation (101) is bounded by 22. Thus the Bellman error is loose upper bounded by 2h~()2\|\tilde{h}(\cdot)\|_{\infty} for all state-action pairs.

Note that we have converted the difference of average rewards into the average Bellman error. Also, we have bounded the Bellman error of a state-action pair. We now want to bound the average Bellman error of an epoch using the realizations of Bellman error at state-action pairs visited in an epoch. For this, we present the following lemma.

Lemma D.4.

With probability at least 11/T61-1/T^{6}, the cumulative expected Bellman error is bounded as:

e=1E(te+1te)𝔼πe,P[Bπe,P~e(s,a)]e=1Et=tete+11Bπe,P~e(st,at)+4TM7Tlog(T)\displaystyle\sum_{e=1}^{E}(t_{e+1}-t_{e})\mathbb{E}_{\pi_{e},P}\left[B^{\pi_{e},\tilde{P}_{e}}(s,a)\right]\leq\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})+4T_{M}\sqrt{7T\log(T)} (104)
Proof.

Let t={s1,a1,,st,at}\mathcal{F}_{t}=\{s_{1},a_{1},\cdots,s_{t},a_{t}\} be the filtration generated by the running the algorithm for tt time-steps. Note that conditioned on filtration te1\mathcal{F}_{t_{e}-1} the two expectations 𝔼s,aπe,P[]\mathbb{E}_{s,a\sim\pi_{e},P}[\cdot] and 𝔼s,aπe,P[|te1]\mathbb{E}_{s,a\sim\pi_{e},P}[\cdot|\mathcal{F}_{t_{e}-1}] are not equal as the former is the expected value of the long-term state distribution and the latter is the long-term state distribution condition on initial state ste1s_{t_{e}-1}. We now use Assumption 3.1 to obtain the following set of inequalities.

𝔼(s,a)πe,P[Bπe,P~e(s,a)]\displaystyle\mathbb{E}_{(s,a)\sim\pi_{e},P}[B^{\pi_{e},\tilde{P}_{e}}(s,a)] =𝔼(s,a)πe,P[Bπe,P~e(s,a)]±𝔼(st,at)πe,P[Bπe,P~e(st,at)|te1]\displaystyle=\mathbb{E}_{(s,a)\sim\pi_{e},P}[B^{\pi_{e},\tilde{P}_{e}}(s,a)]\pm\mathbb{E}_{(s_{t},a_{t})\sim\pi_{e},P}[B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})|\mathcal{F}_{t_{e}-1}] (105)
=𝔼(st,at)πe,P[Bπe,P~e(st,at)|te1]\displaystyle=\mathbb{E}_{(s_{t},a_{t})\sim\pi_{e},P}[B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})|\mathcal{F}_{t_{e}-1}]
+(𝔼(s,a)πe,P[Bπe,P~e(s,a)]𝔼(st,at)πe,P[Bπe,P~e(st,at)|Hte1])\displaystyle~{}~{}+\left(\mathbb{E}_{(s,a)\sim\pi_{e},P}[B^{\pi_{e},\tilde{P}_{e}}(s,a)]-\mathbb{E}_{(s_{t},a_{t})\sim\pi_{e},P}[B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})|H_{t_{e}-1}]\right) (106)
𝔼(st,at)πe,P[Bπe,P~e(st,at)|te1]\displaystyle\leq\mathbb{E}_{(s_{t},a_{t})\sim\pi_{e},P}[B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})|\mathcal{F}_{t_{e}-1}]
+2h~()s,a|πe(a|s)dπe(s)πe(a|s)Pπ,ste1tte+1(s)|\displaystyle~{}~{}+2\|\tilde{h}(\cdot)\|_{\infty}\sum_{s,a}\big{|}\pi_{e}(a|s)d_{\pi_{e}}(s)-\pi_{e}(a|s)P_{\pi,s_{t_{e}-1}}^{t-t_{e}+1}(s)\big{|} (107)
𝔼(st,at)πe,P[Bπe,P~e(st,at)|te1]\displaystyle\leq\mathbb{E}_{(s_{t},a_{t})\sim\pi_{e},P}[B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})|\mathcal{F}_{t_{e}-1}]
+2h~()s,aπ(a|s)|dπe(s)Pπ,ste1tte+1(s)|\displaystyle~{}~{}+2\|\tilde{h}(\cdot)\|_{\infty}\sum_{s,a}\pi(a|s)\big{|}d_{\pi_{e}}(s)-P_{\pi,s_{t_{e}-1}}^{t-t_{e}+1}(s)\big{|} (108)
𝔼(st,at)πe,P[Bπe,P~e(st,at)|te1]\displaystyle\leq\mathbb{E}_{(s_{t},a_{t})\sim\pi_{e},P}[B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})|\mathcal{F}_{t_{e}-1}]
+2h~()s,aπ(a|s)dπePπ,ste1tte+1TV\displaystyle~{}~{}+2\|\tilde{h}(\cdot)\|_{\infty}\sum_{s,a}\pi(a|s)\|d_{\pi_{e}}-P_{\pi,s_{t_{e}-1}}^{t-t_{e}+1}\|_{TV} (109)
𝔼(st,at)πe,P[Bπe,P~e(st,at)|te1]\displaystyle\leq\mathbb{E}_{(s_{t},a_{t})\sim\pi_{e},P}[B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})|\mathcal{F}_{t_{e}-1}]
+2h~()s,aπ(a|s)Cρtte\displaystyle~{}~{}+2\|\tilde{h}(\cdot)\|_{\infty}\sum_{s,a}\pi(a|s)C\rho^{t-t_{e}} (110)
=𝔼(st,at)πe,P[Bπe,P~e(st,at)|te1]+2CSh~()ρtte\displaystyle=\mathbb{E}_{(s_{t},a_{t})\sim\pi_{e},P}[B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})|\mathcal{F}_{t_{e}-1}]+2CS\|\tilde{h}(\cdot)\|_{\infty}\rho^{t-t_{e}} (111)

where Equation 107 comes from Assumption 3.1 for running policy πe\pi_{e} starting from state ste1s_{t_{e}-1} for tte+1t-t_{e}+1 steps and from Lemma 5.3. Equation (111) follows from bounding the total-variation distance for all states and from the fact that aπ(a|s)=1\sum_{a}\pi(a|s)=1.

Using this, and the fact that 𝔼πe,P[Bπe,P~e(st,at)|te1]Bπe,P~e(st,at)\mathbb{E}_{\pi_{e},P}\left[B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})|\mathcal{F}_{t_{e}-1}\right]-B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t}) forms a Martingale difference sequence conditioned on filtration t1\mathcal{F}_{t-1} with |𝔼πe,P[Bπe,P~e(st,at)|t1]Bπe,P~e(st,at)|4h~()|\mathbb{E}_{\pi_{e},P}\left[B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})|\mathcal{F}_{t-1}\right]-B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})|\leq 4\|\tilde{h}(\cdot)\|_{\infty}, we can use Azuma-Hoeffding inequality to bound the summation as

e=1E(te+1te)𝔼πe,P[Bπe,P~e(s,a)]\displaystyle\sum_{e=1}^{E}(t_{e+1}-t_{e})\mathbb{E}_{\pi_{e},P}\left[B^{\pi_{e},\tilde{P}_{e}}(s,a)\right] (112)
=\displaystyle= e=1E((te+1te)𝔼πe,P[Bπe,P~e(st,at)|te1]+t=tete+112CSh~()ρtte)\displaystyle\sum_{e=1}^{E}\Big{(}(t_{e+1}-t_{e})\mathbb{E}_{\pi_{e},P}\left[B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})|\mathcal{F}_{t_{e}-1}\right]+\sum_{t=t_{e}}^{t_{e+1}-1}2CS\|\tilde{h}(\cdot)\|_{\infty}\rho^{t-t_{e}}\Big{)}
e=1E(t=tete+11𝔼πe,P[Bπe,P~e(st,at)|te1]+2CSh~()1ρ)\displaystyle\leq\sum_{e=1}^{E}\left(\sum_{t=t_{e}}^{t_{e+1}-1}\mathbb{E}_{\pi_{e},P}\left[B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})|\mathcal{F}_{t_{e}-1}\right]+\frac{2CS\|\tilde{h}(\cdot)\|_{\infty}}{1-\rho}\right) (113)
e=1Et=tete+11Bπe,P~e(st,at)+4h~()7Tlog2T+2CESh~()1ρ\displaystyle\leq\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})+4\|\tilde{h}(\cdot)\|_{\infty}\sqrt{7T\log 2T}+\frac{2CES\|\tilde{h}(\cdot)\|_{\infty}}{1-\rho} (114)

where Eq. (114) comes from the Azuma-Hoefdding’s inequality with probability at least 1T61-T^{-6}. ∎

D.4 Bounding the term h~()\|\tilde{h}(\cdot)\|_{\infty}

Note that we have λπeP~e>λπeP\lambda_{\pi_{e}}^{\tilde{P}_{e}}>\lambda_{\pi_{e}}^{P^{\prime}} for all PP^{\prime} in the confidence set.

Lemma D.5.

For a MDP with rewards r(s,a)r(s,a) and transition probabilities P~e\tilde{P}_{e}, using policy πe\pi_{e}, the difference of bias of any two states ss, and ss^{\prime} is bounded as h~(s)h~(s)TMs,s𝒮\tilde{h}(s)-\tilde{h}(s^{\prime})\leq T_{M}~{}\forall~{}s,s^{\prime}\in\mathcal{S}.

Proof.

Note that λπeP~eλπeP\lambda_{\pi_{e}}^{\tilde{P}_{e}}\geq\lambda_{\pi_{e}}^{P^{\prime}} for all PP^{\prime} in the confidence set. Now, consider the following Bellman equation

h~(s)\displaystyle\tilde{h}(s) =rπe(s,a)λπeP~e+(Pπe,e(|s))Th~=Th~(s)\displaystyle=r_{\pi_{e}}(s,a)-\lambda_{\pi_{e}}^{\tilde{P}_{e}}+(P_{\pi_{e},e}(\cdot|s))^{T}\tilde{h}~{}=T\tilde{h}(s)

where rπe(s)=aπe(a|s)r(s,a)r_{\pi_{e}}(s)=\sum_{a}\pi_{e}(a|s)r(s,a) and Pπe,e(s|s)=aπ(a|s)P~e(s|s,a)P_{\pi_{e},e}(s^{\prime}|s)=\sum_{a}\pi(a|s)\tilde{P}_{e}(s^{\prime}|s,a).

Consider two states s,s𝒮s,s^{\prime}\in\mathcal{S}. Also, let τ=min{t1:st=s,s1=s}\tau=\min\{t\geq 1:s_{t}=s^{\prime},s_{1}=s\} be a random variable. With Pπe(|s)P_{\pi_{e}}(\cdot|s) =aπe(a|s)P(s|s,a)=\sum_{a}\pi_{e}(a|s)P(s^{\prime}|s,a), we also define another operator,

T¯h(s)\displaystyle\bar{T}h(s) =(mins,ar(s,a)λπeP~e+(Pπe(|s))Th)𝟏(ss)+h~(s)𝟏(s=s).\displaystyle=(\min_{s,a}r(s,a)-\lambda_{\pi_{e}}^{\tilde{P}_{e}}+(P_{\pi_{e}}(\cdot|s))^{T}h)\mathbf{1}(s\neq s^{\prime})+\tilde{h}(s^{\prime})\mathbf{1}(s=s^{\prime}).

Note that T¯h~(s)Th~(s)=h~(s)\bar{T}\tilde{h}(s)\leq T\tilde{h}(s)=\tilde{h}(s) for all ss since P~e\tilde{P}_{e} maximizes the reward rr over all the transition probabilities in the confidence set of Eq. (21) including the true transition probability PP. Further, for any two vectors u,vSu,v\in\mathbb{R}^{S} with u(s)v(s)su(s)\geq v(s)\forall s, we have T¯uT¯v\bar{T}u\geq\bar{T}v. Hence, we have T¯nh~(s)h~(s)\bar{T}^{n}\tilde{h}(s)\leq\tilde{h}(s) for all ss. Hence, we have

h~(s)\displaystyle\tilde{h}(s) T¯n(s)=𝔼[(λπeP~emins,ar(s,a))(nτ)+h~(snτ)]\displaystyle\geq\bar{T}^{n}(s)=\mathbb{E}\big{[}-(\lambda_{\pi_{e}}^{\tilde{P}_{e}}-\min_{s,a}r(s,a))(n\wedge\tau)+\tilde{h}(s_{n\wedge\tau})\big{]}

Taking limit as nn\to\infty, we have h~(s)h~(s)TM\tilde{h}(s)\geq\tilde{h}(s^{\prime})-T_{M}, thus completing the proof. ∎

We are now ready to bound R2(T)R_{2}(T) using Lemma D.1, Lemma D.3, and Lemma D.4. We have the following set of equations:

R2(T)\displaystyle R_{2}(T) =\displaystyle= LT|e=1Et=tete+11(λπeP~eλπeP)|\displaystyle\frac{L}{T}\Big{|}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{\tilde{P}_{e}}-\lambda_{\pi_{e}}^{P}\right)\Big{|} (115)
=\displaystyle= LT|e=1Et=tete+11s,aρπePBπe,P~e(s,a)|\displaystyle\frac{L}{T}\Big{|}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\sum_{s,a}\rho_{\pi_{e}}^{P}B^{\pi_{e},\tilde{P}_{e}}(s,a)\Big{|} (116)
\displaystyle\leq LT|e=1Et=tete+11Bπe,P~e(st,at)+4TM7Tlog(2T)+2CTMSE1ρ|\displaystyle\frac{L}{T}\Big{|}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})+4T_{M}\sqrt{7T\log(2T)}+\frac{2CT_{M}SE}{1-\rho}\Big{|} (117)
LT|e=1Et=tete+11TM14Slog(2AT)1Ne(s,a)+4TM7Tlog(2T)+2CTMSE1ρ|\displaystyle\leq\frac{L}{T}\Big{|}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}T_{M}\sqrt{\frac{14S\log(2AT)}{1\vee N_{e}(s,a)}}+4T_{M}\sqrt{7T\log(2T)}+\frac{2CT_{M}SE}{1-\rho}\Big{|} (118)
LT|e=1Es,aνe(s,a)TM14Slog(2AT)1Ne(s,a)+4TM7Tlog(2T)+2CTMSE1ρ|\displaystyle\leq\frac{L}{T}\Big{|}\sum_{e=1}^{E}\sum_{s,a}\nu_{e}(s,a)T_{M}\sqrt{\frac{14S\log(2AT)}{1\vee N_{e}(s,a)}}+4T_{M}\sqrt{7T\log(2T)}+\frac{2CT_{M}SE}{1-\rho}\Big{|} (119)
LT|s,aTM14SAlog(2AT)e=1Eνe(s,a)1Ne(s,a)+4TM7Tlog(2T)+2CTMSE1ρ|\displaystyle\leq\frac{L}{T}\Big{|}\sum_{s,a}T_{M}\sqrt{14SA\log(2AT)}\sum_{e=1}^{E}\frac{\nu_{e}(s,a)}{\sqrt{1\vee N_{e}(s,a)}}+4T_{M}\sqrt{7T\log(2T)}+\frac{2CT_{M}SE}{1-\rho}\Big{|} (120)
LT|s,aTM(2+1)14SAlog(2AT)N(s,a)+4TM7Tlog(2T)+2CTMSE1ρ|\displaystyle\leq\frac{L}{T}\Big{|}\sum_{s,a}T_{M}(\sqrt{2}+1)\sqrt{14SA\log(2AT)}\sqrt{N(s,a)}+4T_{M}\sqrt{7T\log(2T)}+\frac{2CT_{M}SE}{1-\rho}\Big{|} (121)
LT|TM(2+1)14SAlog(2AT)(s,a1)(s,aN(s,a))\displaystyle\leq\frac{L}{T}\Big{|}T_{M}(\sqrt{2}+1)\sqrt{14SA\log(2AT)}\sqrt{\left(\sum_{s,a}1\right)\left(\sum_{s,a}N(s,a)\right)}
+4TM7Tlog(2T)+2CTMSE1ρ|\displaystyle~{}~{}~{}+4T_{M}\sqrt{7T\log(2T)}+\frac{2CT_{M}SE}{1-\rho}\Big{|} (122)
LT|TM(2+1)14SAlog(2AT)SAT+4TM7Tlog(2T)+2CTMSE1ρ|\displaystyle\leq\frac{L}{T}\Big{|}T_{M}(\sqrt{2}+1)\sqrt{14SA\log(2AT)}\sqrt{SAT}+4T_{M}\sqrt{7T\log(2T)}+\frac{2CT_{M}SE}{1-\rho}\Big{|} (123)

where Equation (116) follows from Lemma D.1, Equation (117) follows from Lemma D.4, and Equation (118) follows from Lemma D.4. Equation (121) follows from Jaksch et al. (2010) and Equation (122) follows from Cauchy-Schwarz inequality.

D.5 Bounding R3(T)R_{3}(T)

Bounding R3(T)R_{3}(T) follows mostly similar to Lemma D.4. At each epoch, the agent visits states according to the occupancy measure ρπeP\rho_{\pi_{e}}^{P} and obtains the rewards. We bound the deviation of the observed visitations to the expected visitations to each state action pair in each epoch.

Lemma D.6.

With probability at least 11/T61-1/T^{6}, the difference between the observed rewards and the expected rewards is bounded as:

|e=1Et=tete+11𝔼πe,P[r(s,a)]e=1Et=tete+11r(st,at)|27Tlog(2T)\displaystyle\Big{|}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\mathbb{E}_{\pi_{e},P}\left[r(s,a)\right]-\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}r(s_{t},a_{t})\Big{|}\leq 2\sqrt{7T\log(2T)} (124)
Proof.

We note that 𝔼πe,P[r(s,a)|t1]r(st,at)\mathbb{E}_{\pi_{e},P}\left[r(s,a)|\mathcal{F}_{t-1}\right]-r(s_{t},a_{t}) is a Martingale difference sequence bounded by 22 because the rewards are bounded by 11. Hence, following the proof of Lemma D.4 we get the required result. ∎

D.6 Bounding the number of episodes EE

The number of episodes EE of the UC-CURL algorithm are bounded by 1+2SA+SAlog(T/SA)1+2SA+SA\log(T/SA) from Proposition 18 of Jaksch et al. (2010). We now bound the number of episodes for the modification of the algorithm as described in Section 5.2. We considered to trigger a new episode whenever νe(s,a)\nu_{e}(s,a) becomes max{1,ν¯e1(s,a)+1}\max\{1,\bar{\nu}_{e-1}(s,a)+1\} where ν¯e1\bar{\nu}_{e-1} is the number of visitations to s,as,a which triggered a new epoch. In the following lemma, we show that the number of episodes are bounded by O(1+2SAT)O(1+\sqrt{2SAT}) with this epoch trigger schedule.

Lemma D.7.

If the UC-CURL algorithm triggers a new epoch whenever νe(s,a)max{1,ν¯e1(s,a)+1}\nu_{e}(s,a)\geq\max\{1,\bar{\nu}_{e-1}(s,a)+1\} for any state-action pair s,as,a, the total number of epochs are bounded by O(1+2SAT)O(1+\sqrt{2SAT}), where ν¯e(s,a)=νe(s,a)𝟏{νe(s,a)=ν¯e1(s,a)+1}+ν¯e1(s,a)(s,a)𝟏{νe(s,a)ν¯e1(s,a)}\bar{\nu}_{e}(s,a)=\nu_{e}(s,a)\bm{1}\{\nu_{e}(s,a)=\bar{\nu}_{e-1}(s,a)+1\}+\bar{\nu}_{e-1}(s,a)(s,a)\bm{1}\{\nu_{e}(s,a)\neq\bar{\nu}_{e-1}(s,a)\} and ν0(s,a)=ν¯e(s,a)=1\nu_{0}(s,a)=\bar{\nu}_{e}(s,a)=1 for all s,as,a.

Proof.

Let N(s,a)N(s,a) be the number visitations to state-action pair s,as,a and K(s,a)K(s,a) be the total number of epochs triggered when the trigger condition is met for state action pair s,as,a. Hence, we have

N(s,a)\displaystyle N(s,a) =e=1Eνe(s,a)\displaystyle=\sum_{e=1}^{E}\nu_{e}(s,a) (125)
e:νe(s,a)=ν¯e1+1νe(s,a)\displaystyle\geq\sum_{e:\nu_{e}(s,a)=\bar{\nu}_{e-1}+1}\nu_{e}(s,a) (126)
K(s,a)(K(s,a)+1)2K2(s,a)2,\displaystyle\geq\frac{K(s,a)(K(s,a)+1)}{2}\geq\frac{K^{2}(s,a)}{2}, (127)

where considering only epoch triggers for s,as,a gives Equation (126). Equation (127) is obtained from the fact that ν¯e(s,a)=νe(s,a)=ν¯e1(s,a)+1\bar{\nu}_{e}(s,a)=\nu_{e}(s,a)=\bar{\nu}_{e-1}(s,a)+1 which gives Ne+1(s,a)=Ne(s,a)+ν¯e+1(s,a)=Ne(s,a)+ν¯e(s,a)+1=e(e+1)/2N_{e+1}(s,a)=N_{e}(s,a)+\bar{\nu}_{e+1}(s,a)=N_{e}(s,a)+\bar{\nu}_{e}(s,a)+1=e(e+1)/2.

Now, we have the following,

T\displaystyle T =s,aN(s,a)\displaystyle=\sum_{s,a}N(s,a) (128)
s,aK2(s,a)2\displaystyle\geq\sum_{s,a}\frac{K^{2}(s,a)}{2} (129)
=SA2SAs,aK2(s,a)\displaystyle=\frac{SA}{2SA}\sum_{s,a}K^{2}(s,a) (130)
SA2(1SAs,aK(s,a))2\displaystyle\geq\frac{SA}{2}\left(\frac{1}{SA}\sum_{s,a}K(s,a)\right)^{2} (131)

where Equation (131) is obtained from the convexity of x2x^{2}. Hence, we have,

s,aK(s,a)SA2TSA=2SAT\displaystyle\sum_{s,a}K(s,a)\leq SA\sqrt{\frac{2T}{SA}}=\sqrt{2SAT} (132)

Further, the first epoch is triggered when the algorithm starts. Hence we have E=1+s,aK(s,a)=1+2SATE=1+\sum_{s,a}K(s,a)=1+\sqrt{2SAT}. ∎

Appendix E Bounding Constraint Violations

To bound the constraint violations C(T)C(T), we break it into multiple components. We can then bound these components individually.

E.1 Constraint breakdown

We first break down our constraint violations into multiple parts which will help us bound the constraint violations.

C(T)\displaystyle C(T) =(g(1Tt=1Tc1(st,at),,1Tt=1Tcd(st,at)))+\displaystyle=\left(g\left(\frac{1}{T}\sum_{t=1}^{T}c_{1}(s_{t},a_{t}),\cdots,\frac{1}{T}\sum_{t=1}^{T}c_{d}(s_{t},a_{t})\right)\right)_{+} (133)
=(g(1Tt=1Tc1(st,at),,1Tt=1Tcd(st,at))+1Te=1ETeg(ζπeP~e(1),,ζπeP~e(d))\displaystyle=\Bigg{(}g\left(\frac{1}{T}\sum_{t=1}^{T}c_{1}(s_{t},a_{t}),\cdots,\frac{1}{T}\sum_{t=1}^{T}c_{d}(s_{t},a_{t})\right)+\frac{1}{T}\sum_{e=1}^{E}T_{e}g\left(\zeta_{\pi_{e}}^{\tilde{P}_{e}}(1),\cdots,\zeta_{\pi_{e}}^{\tilde{P}_{e}}(d)\right)
1Te=1ETeg(ζπeP~e(1),,ζπeP~e(d)))+\displaystyle~{}~{}-\frac{1}{T}\sum_{e=1}^{E}T_{e}g\left(\zeta_{\pi_{e}}^{\tilde{P}_{e}}(1),\cdots,\zeta_{\pi_{e}}^{\tilde{P}_{e}}(d)\right)\Bigg{)}_{+} (134)
(g(1Tt=1Tc1(st,at),,1Tt=1Tcd(st,at))1Te=1ETeg(ζπeP~e(1),,ζπeP~e(d))1Te=1ETeϵe)+\displaystyle\leq\left(g\left(\frac{1}{T}\sum_{t=1}^{T}c_{1}(s_{t},a_{t}),\cdots,\frac{1}{T}\sum_{t=1}^{T}c_{d}(s_{t},a_{t})\right)-\frac{1}{T}\sum_{e=1}^{E}T_{e}g\left(\zeta_{\pi_{e}}^{\tilde{P}_{e}}(1),\cdots,\zeta_{\pi_{e}}^{\tilde{P}_{e}}(d)\right)-\frac{1}{T}\sum_{e=1}^{E}T_{e}\epsilon_{e}\right)_{+} (135)
(g(1Tt=1Tc1(st,at),,1Tt=1Tcd(st,at))g(1Te=1ETeζπeP~e(1),,1Te=1ETeζπeP~e(d))1Te=1ETeϵe)+\displaystyle\leq\left(g\left(\frac{1}{T}\sum_{t=1}^{T}c_{1}(s_{t},a_{t}),\cdots,\frac{1}{T}\sum_{t=1}^{T}c_{d}(s_{t},a_{t})\right)-g\left(\frac{1}{T}\sum_{e=1}^{E}T_{e}\zeta_{\pi_{e}}^{\tilde{P}_{e}}(1),\cdots,\frac{1}{T}\sum_{e=1}^{E}T_{e}\zeta_{\pi_{e}}^{\tilde{P}_{e}}(d)\right)-\frac{1}{T}\sum_{e=1}^{E}T_{e}\epsilon_{e}\right)_{+} (136)
(Li=1d|1Te=1Et=tete+11(ci(st,at)ζπeP~e(i))|1Te=1ETeϵe)+\displaystyle\leq\left(L\sum_{i=1}^{d}\Big{|}\frac{1}{T}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(c_{i}(s_{t},a_{t})-\zeta_{\pi_{e}}^{\tilde{P}_{e}}(i)\right)\Big{|}-\frac{1}{T}\sum_{e=1}^{E}T_{e}\epsilon_{e}\right)_{+} (137)
(Li=1d|1Te=1Et=tete+11(ci(st,at)ζπeP(i)+ζπeP(i)ζπeP~e(i))|1Te=1ETeϵe)+\displaystyle\leq\left(L\sum_{i=1}^{d}\Big{|}\frac{1}{T}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(c_{i}(s_{t},a_{t})-\zeta_{\pi_{e}}^{P}(i)+\zeta_{\pi_{e}}^{P}(i)-\zeta_{\pi_{e}}^{\tilde{P}_{e}}(i)\right)\Big{|}-\frac{1}{T}\sum_{e=1}^{E}T_{e}\epsilon_{e}\right)_{+} (138)
(LTi=1d|e=1Et=tete+11(ci(st,at)ζπeP(i))|+LTi=1d|e=1Et=tete+11(ζπeP(i)ζπeP~e(i))|1Te=1ETeϵe)+\displaystyle\leq\left(\frac{L}{T}\sum_{i=1}^{d}\Big{|}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(c_{i}(s_{t},a_{t})-\zeta_{\pi_{e}}^{P}(i)\right)\Big{|}+\frac{L}{T}\sum_{i=1}^{d}\Big{|}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(\zeta_{\pi_{e}}^{P}(i)-\zeta_{\pi_{e}}^{\tilde{P}_{e}}(i)\right)\Big{|}-\frac{1}{T}\sum_{e=1}^{E}T_{e}\epsilon_{e}\right)_{+} (139)
(C3(T)+C2(T)C1(T))+\displaystyle\leq\left(C_{3}(T)+C_{2}(T)-C_{1}(T)\right)_{+} (140)

where Equation (185) comes from the fact the policy πe\pi_{e} is solution of a conservative optimization equation. Equation (186) comes from the convexity of the constraint g()g(\cdot). Equation (187) follows from the Lipschitz assumption. The three terms in Equation (67) are now defined as:

C1(T)\displaystyle C_{1}(T) =1Te=1ETeϵe\displaystyle=\frac{1}{T}\sum_{e=1}^{E}T_{e}\epsilon_{e} (141)

C1(T)C_{1}(T) denotes the gap left by playing the policy for ϵe\epsilon_{e}-tight optimization problem on the optimistic MDP.

C2(T)\displaystyle C_{2}(T) =LTi=1d|e=1Et=tete+11(ζπeP(i)ζπeP~e(i))|\displaystyle=\frac{L}{T}\sum_{i=1}^{d}\Big{|}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(\zeta_{\pi_{e}}^{P}(i)-\zeta_{\pi_{e}}^{\tilde{P}_{e}}(i)\right)\Big{|} (142)

C2(T)C_{2}(T) denotes the difference between long-term average costs incurred by playing the policy πe\pi_{e} on the true MDP with transitions PP and the optimistic MDP with transitions P~\tilde{P}. This term is bounded similar to the bound of R2(T)R_{2}(T).

C3(T)\displaystyle C_{3}(T) =LTi=1d|e=1Et=tete+11(ci(st,at)ζπeP(i))|\displaystyle=\frac{L}{T}\sum_{i=1}^{d}\Big{|}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(c_{i}(s_{t},a_{t})-\zeta_{\pi_{e}}^{P}(i)\right)\Big{|} (143)

C3(T)C_{3}(T) denotes the difference between long-term average costs incurred by playing the policy πe\pi_{e} on the true MDP with transitions PP and the realized costs. This term is bounded similar to the bound of R3(T)R_{3}(T).

E.2 Bounding C1(T)C_{1}(T)

Note that C1(T)C_{1}(T) allows us to violate constraints by not having the knowledge of the true MDP and allowing deviations of incurred costs from the expected costs. We now want to lower bound C1C_{1} to allow us sufficient slackness. With this idea, we have the following set of equations.

C1(T)\displaystyle C_{1}(T) =1Te=1Et=tete+11ϵe\displaystyle=\frac{1}{T}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\epsilon_{e} (144)
=1Te=1Et=tete+11Klogtete\displaystyle=\frac{1}{T}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}K\sqrt{\frac{\log t_{e}}{t_{e}}} (145)
K1Te=EEt=tete+11log(T/4)te\displaystyle\geq K\frac{1}{T}\sum_{e=E^{\prime}}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\sqrt{\frac{\log(T/4)}{t_{e}}} (146)
K1Te=EEt=tete+11log(T/4)T\displaystyle\geq K\frac{1}{T}\sum_{e=E^{\prime}}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\sqrt{\frac{\log(T/4)}{T}} (147)
=K1T(TtE)log(T/4)T\displaystyle=K\frac{1}{T}\left(T-t_{E^{\prime}}\right)\sqrt{\frac{\log(T/4)}{T}} (148)
K12log(T/4)T\displaystyle\geq K\frac{1}{2}\sqrt{\frac{\log(T/4)}{T}} (149)
K14logTT\displaystyle\geq K\frac{1}{4}\sqrt{\frac{\log T}{T}} (150)

where EE^{\prime} is some epoch for which T/4tE<T/2T/4\leq t_{E^{\prime}}<{T/2}.

E.3 Bounding C2(T)C_{2}(T), and C3(T)C_{3}(T)

We note that costs incurred in C2(T)C_{2}(T), and C3(T)C_{3}(T) follows the same bound as R2(T)R_{2}(T) and R3(T)R_{3}(T) respectively. Thus, replacing rr with cc, we obtain constraint violations because of imperfect system knowledge and system stochastics as O~(LdTMSA/T)\tilde{O}(LdT_{M}S\sqrt{A/T}).

Summing the three terms gives the required bound and choosing K=Θ(LdTMSA)K=\Theta(LdT_{M}S\sqrt{A}) gives the required bound on constraint violations.

Appendix F Concentration bound results

We want to bound the deviation of the estimates of the estimated transition probabilities of the Markov Decision Processes \mathcal{M}. For that we use 1\ell_{1} deviation bounds from (Weissman et al., 2003). Consider, the following event,

t={P^(|s,a)P(|s,a)114Slog(2AT)max{1,n(s,a)}(s,a)𝒮×𝒜}\displaystyle\mathcal{E}_{t}=\left\{\|\hat{P}(\cdot|s,a)-P(\cdot|s,a)\|_{1}\leq\sqrt{\frac{14S\log(2AT)}{\max\{1,n(s,a)\}}}\forall(s,a)\in\mathcal{S}\times\mathcal{A}\right\} (151)

where n=t=1t𝟏{st=s,at=a}n=\sum_{t^{\prime}=1}^{t}\bm{1}_{\{s_{t^{\prime}}=s,a_{t^{\prime}}=a\}}. Then, we have the following result:

Lemma F.1.

The probability that the event t\mathcal{E}_{t} fails to occur us upper bounded by 120t6\frac{1}{20t^{6}}.

Proof.

From the result of (Weissman et al., 2003), the 1\ell_{1} distance of a probability distribution over SS events with nn samples is bounded as:

(P(|s,a)P^(|s,a)1ϵ)(2S2)exp(nϵ22)(2S)exp(nϵ22)\displaystyle\mathbb{P}\left(\|P(\cdot|s,a)-\hat{P}(\cdot|s,a)\|_{1}\geq\epsilon\right)\leq(2^{S}-2)\exp{\left(-\frac{n\epsilon^{2}}{2}\right)}\leq(2^{S})\exp{\left(-\frac{n\epsilon^{2}}{2}\right)} (152)

This, for ϵ=2n(s,a)log(2S20SAt7)14Sn(s,a)log(2At)14Sn(s,a)log(2AT)\epsilon=\sqrt{\frac{2}{n(s,a)}\log(2^{S}20SAt^{7})}\leq\sqrt{\frac{14S}{n(s,a)}\log(2At)}\leq\sqrt{\frac{14S}{n(s,a)}\log(2AT)} gives,

(P(|s,a)P^(|s,a)114Sn(s,a)log(2At))\displaystyle\mathbb{P}\left(\|P(\cdot|s,a)-\hat{P}(\cdot|s,a)\|_{1}\geq\sqrt{\frac{14S}{n(s,a)}\log(2At)}\right) (2S)exp(n(s,a)22n(s,a)log(2S20SAt7))\displaystyle\leq(2^{S})\exp{\left(-\frac{n(s,a)}{2}\frac{2}{n(s,a)}\log(2^{S}20SAt^{7})\right)} (153)
=2S12S20SAt7\displaystyle=2^{S}\frac{1}{2^{S}20SAt^{7}} (154)
=120ASt7\displaystyle=\frac{1}{20ASt^{7}} (155)

We sum over the all the possible values of n(s,a)n(s,a) till tt time-step to bound the probability that the event t\mathcal{E}_{t} does not occur as:

n(s,a)=1t120SAt7120SAt6\displaystyle\sum_{n(s,a)=1}^{t}\frac{1}{20SAt^{7}}\leq\frac{1}{20SAt^{6}} (156)

Finally, summing over all the s,as,a, we get,

(P(|s,a)P^(|s,a)114Sn(s,a)log(2At)s,a)120t6\displaystyle\mathbb{P}\left(\|P(\cdot|s,a)-\hat{P}(\cdot|s,a)\|_{1}\geq\sqrt{\frac{14S}{n(s,a)}\log(2At)}~{}\forall s,a\right)\leq\frac{1}{20t^{6}} (157)

The second lemma is the Azuma-Hoeffding’s inequality, which we use to bound Martingale difference sequences.

Lemma F.2 (Azuma-Hoeffding’s Inequality).

Let X1,,XnX_{1},\cdots,X_{n} be a Martingale difference sequence such that |Xi|c|X_{i}|\leq c for all i{1,2,,n}i\in\{1,2,\cdots,n\}, then,

(|i=1nXi|ϵ)2exp(ϵ22nc2)\displaystyle\mathbb{P}\left(|\sum_{i=1}^{n}X_{i}|\geq\epsilon\right)\leq 2\exp{\left(-\frac{\epsilon^{2}}{2nc^{2}}\right)} (158)

Appendix G Posterior Sampling Algorithm

Note that in the UC-CURL algorithm, the agent solves for an optimistic policy. This convex optimization problem may be computationally intensive with O(S2A)O(S^{2}A) additional variables and O(SA)O(SA) additional constraints. We now present the posterior sampling version of the UC-CURL algorithm which reduces this computational complexity by sampling the transition probabilities from the updated posterior. The posterior sampling algorithm is based on Lemma 1 of Osband et al. (2013), which we state formally here.

Lemma G.1.

[Posterior Sampling] If hh is the distribution of \mathcal{M} then, for any σ(Fte)\sigma({F}_{t_{e}})-measurable function gg,

𝔼[g()|Fte]=𝔼[g(e)|Fte]\displaystyle\mathbb{E}\left[g(\mathcal{M})|{F}_{t_{e}}\right]=\mathbb{E}\left[g(\mathcal{M}_{e})|{F}_{t_{e}}\right] (159)

where e\mathcal{M}_{e} is the MDP sampled at the beginning of the epoch ee at time-step tet_{e}.

We now present our posterior sampling based PS-CURL algorithm described in Algorithm 2. Similar to the UC-CURL algorithm, the PS-CURL algorithm proceeds in epochs. At each epoch ee, the agent samples P~eh(|te)\tilde{P}_{e}\sim h(\cdot|\mathcal{F}_{t_{e}}), it can solve the following optimization problem for the optimal feasible policy.

maxρ(s,a)f(s,ar(s,a)ρe(s,a))\displaystyle\max_{\rho(s,a)}f\big{(}\sum\nolimits_{s,a}r(s,a)\rho_{e}(s,a)\big{)} (160)

with the following set of constraints,

s,aρe(s,a)=1,ρe(s,a)0\displaystyle\sum\nolimits_{s,a}\rho_{e}(s,a)=1,\ \ \rho_{e}(s,a)\geq 0 (161)
a𝒜ρe(s,a)=s,aP~e(s|s,a)ρe(s,a)\displaystyle\sum\nolimits_{a\in\mathcal{A}}\rho_{e}(s^{\prime},a)=\sum\nolimits_{s,a}\tilde{P}_{e}(s^{\prime}|s,a)\rho_{e}(s,a) (162)
g(s,ac1(s,a)ρe(s,a),,s,acd(s,a)ρe(s,a))ϵe\displaystyle g\big{(}\sum\nolimits_{s,a}c_{1}(s,a)\rho_{e}(s,a),\cdots,\sum\nolimits_{s,a}c_{d}(s,a)\rho_{e}(s,a)\big{)}\leq-\epsilon_{e} (163)

for all s𝒮,s𝒮,s^{\prime}\in\mathcal{S},~{}\forall~{}s\in\mathcal{S}, and a𝒜\forall~{}a\in\mathcal{A}. Using the solution for ρe\rho_{e} for ϵe\epsilon_{e}-tight optimization equation for the optimistic MDP, we obtain the optimal conservative policy for epoch ee as:

πe(a|s)=ρe(s,a)b𝒜ρe(s,b)s,a\displaystyle\pi_{e}(a|s)=\frac{\rho_{e}(s,a)}{\sum_{b\in\mathcal{A}}\rho_{e}(s,b)}\forall\ s,a (164)
Algorithm 2 PS-CURL

Parameters: KK
Input: SS, AA, rr, dd, cii[d]c_{i}\ \forall\ i\in[d]

1:  Let t=1t=1, e=1,ϵe=Klntte=1,\epsilon_{e}=K\sqrt{\frac{\ln t}{t}}
2:  νe(s,a)=0,Ne(s,a)=0s,a\nu_{e}(s,a)=0,N_{e}(s,a)=0~{}\forall~{}~{}s,a
3:  Solve for policy πe\pi_{e} using Eq. (164)
4:  for t{1,2,}t\in\{1,2,\cdots\} do
5:     Observe sts_{t}, and play atπe(|st)a_{t}\sim\pi_{e}(\cdot|s_{t})
6:     Observe st+1s_{t+1}, r(st,at)r(s_{t},a_{t}) and ci(st,at)i[d]c_{i}(s_{t},a_{t})\ \forall\ i\in[d]
7:     νe(st,at)=νe(st,at)+1\nu_{e}(s_{t},a_{t})=\nu_{e}(s_{t},a_{t})+1
8:     if νe(s,a)=max{1,Ne(s,a)}\nu_{e}(s,a)=\max\{1,N_{e}(s,a)\} for any s,as,a then
9:        for (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} do
10:           Ne+1(s,a)=Ne(s,a)+νe(s,a)N_{e+1}(s,a)=N_{e}(s,a)+\nu_{e}(s,a)
11:           e=e+1e=e+1, νe(s,a)=0\nu_{e}(s,a)=0
12:        end for
13:        ϵe=Klntt\epsilon_{e}=K\sqrt{\frac{\ln t}{t}}
14:        P~eh(|t)\tilde{P}_{e}\sim h(\cdot|\mathcal{H}_{t})
15:        Solve for policy πe\pi_{e} using Eq. (164)
16:     end if
17:  end for

For the UC-CURL algorithm, the true MDP lies in the confidence interval with high probability, and hence the solution of the optimization problem was guaranteed. However, the same is not true for the MDP with sampled transition probabilities. We want the existence of a policy πe\pi_{e} such that Equation (163) holds. We obtain the condition for existence of such a policy in the following lemma. To obtain the lemma, we first state a tighter Slater assumption as:

Assumption G.2.

There exists a policy π\pi, and constants delta>LdSTM(AlogT)/T+(CSAlogT)/(T(1ρ))delta>LdST_{M}\sqrt{(A\log T)/T}+(CSA\log T)/(T(1-\rho)) and Γ>Ld(2STM14AlogAT/T1/3+CSTM/((1ρ)T1/3))\Gamma>Ld\left(2ST_{M}\sqrt{14A\log AT/T^{1/3}}+CST_{M}/((1-\rho)T^{1/3})\right) such that

g(ζπP,1,,ζπP,K2)δΓ\displaystyle g\left(\zeta_{\pi}^{P,1},\cdots,\zeta_{\pi}^{P,K_{2}}\right)\leq-\delta-\Gamma (165)
Lemma G.3.

If there exists a policy π\pi, such that

g(ζπP(1),,ζπP(d))δΓ,\displaystyle g\left(\zeta_{\pi}^{P}(1),\cdots,\zeta_{\pi}^{P}(d)\right)\leq-\delta-\Gamma, (166)

and there exists episodes ee and e+1e+1 with start timesteps tet_{e} and te+1t_{e+1} respectively satisfying te+1teT1/3t_{e+1}-t_{e}\geq T^{1/3}, then for P~e(|s,a)P(|s,a)114Slog(2At)Ne(s,a)\|\tilde{P}_{e}(\cdot|s,a)-P(\cdot|s,a)\|_{1}\leq\sqrt{\frac{14S\log(2At)}{N_{e}(s,a)}} the policy π\pi satisfies,

g(ζπP~e(1),,ζπP~e(d))δ.\displaystyle g\left(\zeta_{\pi}^{\tilde{P}_{e}}(1),\cdots,\zeta_{\pi}^{\tilde{P}_{e}}(d)\right)\leq-\delta. (167)
Proof.

We start with the Lipschitz assumption (Assumption 3.4) to obtain,

|g(ζπP~e(1),,ζπP~e(d))\displaystyle|g\left(\zeta_{\pi}^{\tilde{P}_{e}}(1),\cdots,\zeta_{\pi}^{\tilde{P}_{e}}(d)\right) g(ζπP(1),,ζπP(d))|Ldmaxi|ζπP~e(i)ζπP(i)|\displaystyle-g\left(\zeta_{\pi}^{P}(1),\cdots,\zeta_{\pi}^{P}(d)\right)|\leq Ld\max_{i}|\zeta_{\pi}^{\tilde{P}_{e}}(i)-\zeta_{\pi}^{P}(i)| (168)
g(ζπP~e(1),,ζπP~e(d))\displaystyle\implies g\left(\zeta_{\pi}^{\tilde{P}_{e}}(1),\cdots,\zeta_{\pi}^{\tilde{P}_{e}}(d)\right) Ldmaxi|ζπP~e(i)ζπP(i)|+g(ζπP(1),,ζπP(d))\displaystyle\leq Ld\max_{i}|\zeta_{\pi}^{\tilde{P}_{e}}(i)-\zeta_{\pi}^{P}(i)|+g\left(\zeta_{\pi}^{P}(1),\cdots,\zeta_{\pi}^{P}(d)\right) (169)

where Equation (169) is obtained by choosing the sign of modulo in the previous equation. We now bound the term |ζπP~eζπP||\zeta_{\pi}^{\tilde{P}_{e}}-\zeta_{\pi}^{P}| using Bellman error. We have,

ζπP~e(i)ζπP(i)=s,aρπPBiπe,P~e(s,a)=𝔼[Bπe,P~e(s,a)]\displaystyle\zeta_{\pi}^{\tilde{P}_{e}}(i)-\zeta_{\pi}^{P}(i)=\sum_{s,a}\rho_{\pi}^{P}B^{\pi_{e},\tilde{P}_{e}}_{i}(s,a)=\mathbb{E}\left[B^{\pi_{e},\tilde{P}_{e}}(s,a)\right] (170)

where Biπe,P~e(s,a)B^{\pi_{e},\tilde{P}_{e}}_{i}(s,a) is the Bellman error for cost ii. We bound the expectation using Azuma-Hoeffding’s inequality as follows:

𝔼[Bπe,P~e(s,a)]\displaystyle\mathbb{E}\left[B^{\pi_{e},\tilde{P}_{e}}(s,a)\right] =𝔼[Bπe,P~e(st,at)|te1]+Cρtte\displaystyle=\mathbb{E}\left[B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})|\mathcal{F}_{t_{e}-1}\right]+C\rho^{t-t_{e}} (171)
=1te+1tetete+11(𝔼[Bπe,P~e(st,at)|te1]+Cρtte)\displaystyle=\frac{1}{t_{e+1}-t_{e}}\sum_{t_{e}}^{t_{e+1}-1}\left(\mathbb{E}\left[B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})|\mathcal{F}_{t_{e}-1}\right]+C\rho^{t-t_{e}}\right) (172)
1te+1tetete+11(𝔼[Bπe,P~e(st,at)|te1])+CSh~()(1ρ)(te+1te)\displaystyle\leq\frac{1}{t_{e+1}-t_{e}}\sum_{t_{e}}^{t_{e+1}-1}\left(\mathbb{E}\left[B^{\pi_{e},\tilde{P}_{e}}(s_{t},a_{t})|\mathcal{F}_{t_{e}-1}\right]\right)+\frac{CS\|\tilde{h}(\cdot)\|_{\infty}}{(1-\rho)(t_{e+1}-t_{e})} (173)
1te+1te(h~14SlogATs,aνe(s,a)Ne(s,a)+4h~()7(te+1te)log(te+1te))\displaystyle\leq\frac{1}{t_{e+1}-t_{e}}\left(\|\tilde{h}\|_{\infty}\sqrt{14S\log AT}\sum_{s,a}\frac{\nu_{e}(s,a)}{\sqrt{N_{e}(s,a)}}+4\|\tilde{h}(\cdot)\|_{\infty}\sqrt{7(t_{e+1}-t_{e})\log(t_{e+1}-t_{e})}\right)
+CSh~()(1ρ)(te+1te)\displaystyle~{}~{}+\frac{CS\|\tilde{h}(\cdot)\|_{\infty}}{(1-\rho)(t_{e+1}-t_{e})} (174)
1te+1te(h~14SlogATs,aνe(s,a)+4h~()7(te+1te)log(te+1te))\displaystyle\leq\frac{1}{t_{e+1}-t_{e}}\left(\|\tilde{h}\|_{\infty}\sqrt{14S\log AT}\sum_{s,a}\sqrt{\nu_{e}(s,a)}+4\|\tilde{h}(\cdot)\|_{\infty}\sqrt{7(t_{e+1}-t_{e})\log(t_{e+1}-t_{e})}\right)
+CSh~()(1ρ)(te+1te)\displaystyle~{}~{}+\frac{CS\|\tilde{h}(\cdot)\|_{\infty}}{(1-\rho)(t_{e+1}-t_{e})} (175)
1te+1te(h~S14AlogATs,aνe(s,a)+4h~()7(te+1te)log(te+1te))\displaystyle\leq\frac{1}{t_{e+1}-t_{e}}\left(\|\tilde{h}\|_{\infty}S\sqrt{14A\log AT}\sqrt{\sum_{s,a}\nu_{e}(s,a)}+4\|\tilde{h}(\cdot)\|_{\infty}\sqrt{7(t_{e+1}-t_{e})\log(t_{e+1}-t_{e})}\right)
+CSh~()(1ρ)(te+1te)\displaystyle~{}~{}+\frac{CS\|\tilde{h}(\cdot)\|_{\infty}}{(1-\rho)(t_{e+1}-t_{e})} (176)
1te+1te(h~S14AlogAT(te+1te)+4h~()7(te+1te)log(te+1te))\displaystyle\leq\frac{1}{t_{e+1}-t_{e}}\left(\|\tilde{h}\|_{\infty}S\sqrt{14A\log AT}\sqrt{(t_{e+1}-t_{e})}+4\|\tilde{h}(\cdot)\|_{\infty}\sqrt{7(t_{e+1}-t_{e})\log(t_{e+1}-t_{e})}\right)
+CSh~()(1ρ)(te+1te)\displaystyle~{}~{}+\frac{CS\|\tilde{h}(\cdot)\|_{\infty}}{(1-\rho)(t_{e+1}-t_{e})} (177)
(h~S14AlogAT(te+1te)+4h~()7log(te+1te)(te+1te))+CSh~()(1ρ)(te+1te)\displaystyle\leq\left(\|\tilde{h}\|_{\infty}S\sqrt{\frac{14A\log AT}{(t_{e+1}-t_{e})}}+4\|\tilde{h}(\cdot)\|_{\infty}\sqrt{\frac{7\log(t_{e+1}-t_{e})}{(t_{e+1}-t_{e})}}\right)+\frac{CS\|\tilde{h}(\cdot)\|_{\infty}}{(1-\rho)(t_{e+1}-t_{e})} (178)

where Equation (172) is obtained by summing both sides from t=tet=t_{e} to t=te+1t=t_{e+1}. Equation (173) is obtained by summing over the geometric series with ratio ρ\rho. Equation (174) comes from Lemma D.4. Equation (175) comes from the fact that Ne(s,a)νe(s,a)N_{e}(s,a)\geq\nu_{e}(s,a) for all s,as,a, and then replacing the lower bound of Ne(s,a)N_{e}(s,a). Equation (176) follows from the Cauchy Schwarz inequality. Equation (177) follows from the fact that the epoch length te+1tet_{e+1}-t_{e} is same as the number of visitations to all state action pairs in an epoch.

Combining Equation (178) with Equation (169), and bounding the h~()\|\tilde{h}(\cdot)\|_{\infty} term with TMT_{M}, we obtain the required result as follows:

g(ζπP~e(1),,ζπP~e(d))\displaystyle g\left(\zeta_{\pi}^{\tilde{P}_{e}}(1),\cdots,\zeta_{\pi}^{\tilde{P}_{e}}(d)\right) Ldmaxi|ζπP~e(i)ζπP(i)|+g(ζπP(1),,ζπP(d))\displaystyle\leq Ld\max_{i}|\zeta_{\pi}^{\tilde{P}_{e}}(i)-\zeta_{\pi}^{P}(i)|+g\left(\zeta_{\pi}^{P}(1),\cdots,\zeta_{\pi}^{P}(d)\right) (179)
Ld((TMS14AlogAT(te+1te)+4TM7log(te+1te)(te+1te))\displaystyle\leq Ld\Big{(}\left(T_{M}S\sqrt{\frac{14A\log AT}{(t_{e+1}-t_{e})}}+4T_{M}\sqrt{\frac{7\log(t_{e+1}-t_{e})}{(t_{e+1}-t_{e})}}\right)
+CTMS(1ρ)(te+1te))+g(ζπP(1),,ζπP(d))\displaystyle~{}~{}+\frac{CT_{M}S}{(1-\rho)(t_{e+1}-t_{e})}\Big{)}+g\left(\zeta_{\pi}^{P}(1),\cdots,\zeta_{\pi}^{P}(d)\right) (180)
Ld((TMS14AlogAT(te+1te)+4TM7log(te+1te)(te+1te))\displaystyle\leq Ld\Big{(}\left(T_{M}S\sqrt{\frac{14A\log AT}{(t_{e+1}-t_{e})}}+4T_{M}\sqrt{\frac{7\log(t_{e+1}-t_{e})}{(t_{e+1}-t_{e})}}\right)
+CTMS(1ρ)(te+1te))δΓ\displaystyle~{}~{}~{}+\frac{CT_{M}S}{(1-\rho)(t_{e+1}-t_{e})}\Big{)}-\delta-\Gamma (181)
δ,\displaystyle\leq-\delta, (182)

where Equation (182) is follows from the definition of Γ\Gamma in Assumption G.2 and te+1teT1/3t_{e+1}-t_{e}\geq T^{1/3}. ∎

From Lemma G.3 we observe that for a tighter Slater condition on the true MDP, we can only guarantee a weaker Slater guarantee. However, we make that assumption to obtain the feasibility of the optimization problem in Equation (160).

The Bayesian regret of the PS-CURL algorithm is defined as follows:

𝔼[R(T)]\displaystyle\mathbb{E}[R(T)] =𝔼[f(λπP)f(t=1Tr(st,at)/T)]\displaystyle=\mathbb{E}\left[f\left(\lambda_{\pi^{*}}^{P}\right)-f\left(\sum\nolimits_{t=1}^{T}r(s_{t},a_{t})/T\right)\right]

Similarly, we define Bayesian constraint violations, C(T)C(T), as the expected gap between the constraint function and incurred and constraint bounds, or

𝔼[C(T)]\displaystyle\mathbb{E}[C(T)] =𝔼[(g(t=1Tc1(st,at)/T,,t=1Tc1(st,at)/T))+]\displaystyle=\mathbb{E}\left[\left(g\left(\sum\nolimits_{t=1}^{T}c_{1}(s_{t},a_{t})/T,\cdots,\sum\nolimits_{t=1}^{T}c_{1}(s_{t},a_{t})/T\right)\right)_{+}\right]

where (x)+=max(0,x)(x)_{+}=\max(0,x).

Now, we can use Lemma G.1 to obtain 𝔼[f(λπP)|te]=𝔼[f(λπeP~e)|te]\mathbb{E}[f(\lambda_{\pi^{*}}^{P})|\mathcal{F}_{t_{e}}]=\mathbb{E}[f(\lambda_{\pi_{e}}^{\tilde{P}_{e}})|\mathcal{F}_{t_{e}}] and 𝔼[f(ζπP)(i)|te]=𝔼[f(ζπeP~e)(i)|te]i\mathbb{E}[f(\zeta_{\pi^{*}}^{P})(i)|\mathcal{F}_{t_{e}}]=\mathbb{E}[f(\zeta_{\pi_{e}}^{\tilde{P}_{e}})(i)|\mathcal{F}_{t_{e}}]~{}\forall~{}i, and follow the analysis similar to the analysis of Theorem 5.6 to obtain the required regret bounds.

G.1 Bound on constraints

We now bound the constraint violations and prove that using a conservative policy. We can reduce the constraint violations to 0. We have:

C(T)\displaystyle C(T) =(g(1Tt=1Tc1(st,at),,1Tt=1Tcd(st,at)))+\displaystyle=\left(g\left(\frac{1}{T}\sum_{t=1}^{T}c_{1}(s_{t},a_{t}),\cdots,\frac{1}{T}\sum_{t=1}^{T}c_{d}(s_{t},a_{t})\right)\right)_{+} (183)
=(g(1Tt=1Tc1(st,at),,1Tt=1Tcd(st,at))1Te=1ETeg(ζπeP~e,1,,ζπeP~e,K2)\displaystyle=\Bigg{(}g\left(\frac{1}{T}\sum_{t=1}^{T}c_{1}(s_{t},a_{t}),\cdots,\frac{1}{T}\sum_{t=1}^{T}c_{d}(s_{t},a_{t})\right)-\frac{1}{T}\sum_{e=1}^{E}T_{e}g\left(\zeta_{\pi_{e}}^{\tilde{P}_{e},1},\cdots,\zeta_{\pi_{e}}^{\tilde{P}_{e},K_{2}}\right)
+1Te=1ETeg(ζπeP~e,1,,ζπeP~e,K2))+\displaystyle~{}~{}+\frac{1}{T}\sum_{e=1}^{E}T_{e}g\left(\zeta_{\pi_{e}}^{\tilde{P}_{e},1},\cdots,\zeta_{\pi_{e}}^{\tilde{P}_{e},K_{2}}\right)\Bigg{)}_{+} (184)
(g(1Tt=1Tc1(st,at),,1Tt=1Tcd(st,at))1Te=1ETeg(ζπeP~e,1,,ζπeP~e,K2)+C1)+\displaystyle\leq\left(g\left(\frac{1}{T}\sum_{t=1}^{T}c_{1}(s_{t},a_{t}),\cdots,\frac{1}{T}\sum_{t=1}^{T}c_{d}(s_{t},a_{t})\right)-\frac{1}{T}\sum_{e=1}^{E}T_{e}g\left(\zeta_{\pi_{e}}^{\tilde{P}_{e},1},\cdots,\zeta_{\pi_{e}}^{\tilde{P}_{e},K_{2}}\right)+C_{1}\right)_{+} (185)
(g(1Tt=1Tc1(st,at),,1Tt=1Tcd(st,at))g(1Te=1ETeζπeP~e,1,,1Te=1ETeζπeP~e,K2)+C1)+\displaystyle\leq\left(g\left(\frac{1}{T}\sum_{t=1}^{T}c_{1}(s_{t},a_{t}),\cdots,\frac{1}{T}\sum_{t=1}^{T}c_{d}(s_{t},a_{t})\right)-g\left(\frac{1}{T}\sum_{e=1}^{E}T_{e}\zeta_{\pi_{e}}^{\tilde{P}_{e},1},\cdots,\frac{1}{T}\sum_{e=1}^{E}T_{e}\zeta_{\pi_{e}}^{\tilde{P}_{e},K_{2}}\right)+C_{1}\right)_{+} (186)
(Lk=1K2|1Te=1Et=tete+11(ck(st,at)ζπeP~e,k)|+C1)+\displaystyle\leq\left(L\sum_{k=1}^{K_{2}}\Big{|}\frac{1}{T}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(c^{k}(s_{t},a_{t})-\zeta_{\pi_{e}}^{\tilde{P}_{e},k}\right)\Big{|}+C_{1}\right)_{+} (187)
(Lk=1K2|1Te=1Et=tete+11(ck(st,at)ζπeP,k+ζπeP,kζπeP~e,k)|+C1)+\displaystyle\leq\left(L\sum_{k=1}^{K_{2}}\Big{|}\frac{1}{T}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(c^{k}(s_{t},a_{t})-\zeta_{\pi_{e}}^{P,k}+\zeta_{\pi_{e}}^{P,k}-\zeta_{\pi_{e}}^{\tilde{P}_{e},k}\right)\Big{|}+C_{1}\right)_{+} (188)
(LTk=1K2|e=1Et=tete+11(ck(st,at)ζπeP,k)|+LTk=1K2|e=1Et=tete+11(ζπeP,kζπeP~e,k)|+C1)+\displaystyle\leq\left(\frac{L}{T}\sum_{k=1}^{K_{2}}\Big{|}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(c^{k}(s_{t},a_{t})-\zeta_{\pi_{e}}^{P,k}\right)\Big{|}+\frac{L}{T}\sum_{k=1}^{K_{2}}\Big{|}\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(\zeta_{\pi_{e}}^{P,k}-\zeta_{\pi_{e}}^{\tilde{P}_{e},k}\right)\Big{|}+C_{1}\right)_{+} (189)
(C3(T)+C2(T)+C1(T))+\displaystyle\leq\left(C_{3}(T)+C_{2}(T)+C_{1}(T)\right)_{+} (190)

where Equation (186) follows from the convexity of the function gg, and Equation (187) follows from the Lipschtiz continuity.

We bound C2(T)+C3(T)C_{2}(T)+C_{3}(T) similar to the analysis of R(T)R(T) by

𝒪~(TMSAT+CTMS2A(1ρ)T)\displaystyle\tilde{\mathcal{O}}\left(T_{M}S\sqrt{\frac{A}{T}}+\frac{CT_{M}S^{2}A}{(1-\rho)T}\right) (191)

We focus our attention on bounding C1(T)C_{1}(T). For this, note that in Assumption 3.4 we assumed that the cost function gg is Lipschitz continuous and the gradients are bounded at all points. This implies for a bounded input domain the cost function is bounded. We assume that the upper bound of the cost is gg_{\infty}. We now obtain the bound on C1(T)C_{1}(T) as:

C1(T)\displaystyle C_{1}(T) =1Te=1ETe(g(ζπeP~e,1,,ζπeP~e,K2))\displaystyle=\frac{1}{T}\sum_{e=1}^{E}T_{e}\left(g\left(\zeta_{\pi_{e}}^{\tilde{P}_{e},1},\cdots,\zeta_{\pi_{e}}^{\tilde{P}_{e},K_{2}}\right)\right) (192)
=1Te=1ETe(g(ζπeP~e,1,,ζπeP~e,K2))𝟏{TeT1/3}\displaystyle=\frac{1}{T}\sum_{e=1}^{E}T_{e}\left(g\left(\zeta_{\pi_{e}}^{\tilde{P}_{e},1},\cdots,\zeta_{\pi_{e}}^{\tilde{P}_{e},K_{2}}\right)\right)\bm{1}\{T_{e}\geq T^{1/3}\}
+1Te=1ETe(g(ζπeP~e,1,,ζπeP~e,K2))𝟏{Te<T1/3}\displaystyle~{}~{}~{}+\frac{1}{T}\sum_{e=1}^{E}T_{e}\left(g\left(\zeta_{\pi_{e}}^{\tilde{P}_{e},1},\cdots,\zeta_{\pi_{e}}^{\tilde{P}_{e},K_{2}}\right)\right)\bm{1}\{T_{e}<T^{1/3}\} (193)
1Te=1ETe(g(ζπeP~e,1,,ζπeP~e,K2))𝟏{TeT1/3}+1Te=1ET1/3g\displaystyle\leq\frac{1}{T}\sum_{e=1}^{E}T_{e}\left(g\left(\zeta_{\pi_{e}}^{\tilde{P}_{e},1},\cdots,\zeta_{\pi_{e}}^{\tilde{P}_{e},K_{2}}\right)\right)\bm{1}\{T_{e}\geq T^{1/3}\}+\frac{1}{T}\sum_{e=1}^{E}T^{1/3}g_{\infty} (194)
1Te=1ETeϵe𝟏{TeT1/3}+1TET1/3g\displaystyle\leq-\frac{1}{T}\sum_{e=1}^{E}T_{e}\epsilon_{e}\bm{1}\{T_{e}\geq T^{1/3}\}+\frac{1}{T}ET^{1/3}g_{\infty} (195)
=1T(e=1ETeϵe(1𝟏{Te<T1/3})+ET1/3g)\displaystyle=-\frac{1}{T}\left(\sum_{e=1}^{E}T_{e}\epsilon_{e}\left(1-\bm{1}\{T_{e}<T^{1/3}\}\right)+ET^{1/3}g_{\infty}\right) (196)
=1T(e=1ETeϵe+1Te=1ETeϵe𝟏{Te<T1/3}+1TET1/3g)\displaystyle=-\frac{1}{T}\left(\sum_{e=1}^{E}T_{e}\epsilon_{e}+\frac{1}{T}\sum_{e=1}^{E}T_{e}\epsilon_{e}\bm{1}\{T_{e}<T^{1/3}\}+\frac{1}{T}ET^{1/3}g_{\infty}\right) (197)
1T(K4TlogT+1Te=1ET1/3κ+1TET1/3g)\displaystyle\leq-\frac{1}{T}\left(\frac{K}{4}\sqrt{T\log T}+\frac{1}{T}\sum_{e=1}^{E}T^{1/3}\kappa+\frac{1}{T}ET^{1/3}g_{\infty}\right) (198)
=K4T(TlogT+1TEκT1/3+1TET1/3g)\displaystyle=-\frac{K}{4T}\left(\sqrt{T\log T}+\frac{1}{T}E\kappa T^{1/3}+\frac{1}{T}ET^{1/3}g_{\infty}\right) (199)

where Equation (194) follows from the bound on g(𝐱)gg(\mathbf{x})\leq g_{\infty}. Equation (195) follows from following the conservative policy.

Thus, choosing an appropriate KK, we can bound constraint violations by 0.

Appendix H Further Discussions

H.1 Regarding Ergodicity in Assumption 3.1

Regarding the assumption on ergodicity in Assumption 3.1, we make two observations:

For MDPs with constraints, we note that for optimal policy to be stationary, the MDP has to be ergodic. For finite diameter MDPs, the optimal policy can be non-stationary. Consider an MDP with three states, left, middle, right, and two actions left, right. The left action keeps the state unchanged from left, or takes the agent left from middle and to middle from right. Similarly, the right action takes the agent to middle from left and to right from middle, or keep the agent to right. Since there exists different recurrent classes for different policies (For a policy which takes only left action, the recurrent class contains only left state. For policy which takes only right action, the recurrent class contains only right state), the MDP is non-ergodic. Further, the agent obtains a reward of +1 and cost of 0 on taking left action in left state and reward of 0 and cost of +1 on taking right action in the right state. A stationary policy provides an average reward of (1/6,1/6)(1/6,1/6) as the agent stays in all three states with equal probability and takes either actions with equal probability in each state. Whereas, if the agent follows a a non-stationary optimal policy, the agent optimizes both rewards and cost with average reward of (1/2,1/2)(1/2,1/2). Hence, the agent must stay in state left as much as the state right by only making minimal transitions via the middle state. Thus, the optimal policy is non-stationary for non-ergodic MDPs. This example is provided in detail by Cheung (2019).

The second observation is for finite diameter MDPs, where Chen et al. (2022) provided an algorithm which requires the knowledge of the time horizon TT and the span of the costs spcsp_{c}. We note that the two variables might not be known to the agent in advance. Further, the knowledge of the time horizon is required to divide the time horizons to epochs of duration O(T1/3)O(T^{1/3}) to obtain a regret bound of O(T2/3)O(T^{2/3}). This particular epoch length is required to bound the bias-span of the MDP considered in the epoch. Finally, we note that the finite mixing time is also assumed in other works in constrained IH MDP Singh et al. (2020).

We note that even if we use other exploration strategies, we will require the Bellman error analysis to analyze the stochastic policies. In this work, we perform an exploration and exploitation strategy by dividing the time horizon into epochs and then updating the policy in each epoch using the MDP model built using exploration done in previous epochs. The analysis of the regret will still need the impact of stochastic policies and thus the analysis approaching the paper will still be needed for any exploration strategy.

We also note that since the MDP is ergodic, exploration can be done with any policy and the agent does not need an optimistic MDP to explore. However, the agent wants to minimize the regret for the online algorithm, and hence it plays the optimal policy based on the MDP estimated/learned till time tt. To do so the agent finds the optimistic policy or the policy which provides the highest possible reward in the confidence interval. Note that if agent agent could have played any policy and obtained same regret bound, a policy worse than the true MDP can also exist in the confidence interval and that would not give the same performance. In the following, we provide a simplified problem setup and algorithm to demonstrate that some policy may achieve large regret bound even with ergodic assumption.

Consider a simplified problem setup where f(λπP)=λπPf(\lambda_{\pi}^{P})=\lambda_{\pi}^{P} with no constraints. Note that this is the classical RL setup. Also consider an algorithm where the agent uses the estimated MDP without considering the confidence intervals. After every epoch, the agent solves for the optimal policy using the following optimization equation.

maxρ(s,a)s,ar(s,a)ρ(s,a)\displaystyle\max_{\rho(s,a)}\sum\nolimits_{s,a}r(s,a)\rho(s,a) (200)

with the following set of constraints,

s,aρ(s,a)=1,ρ(s,a)0\displaystyle\sum\nolimits_{s,a}\rho(s,a)=1,\ \ \rho(s,a)\geq 0 (201)
a𝒜ρ(s,a)=s,aP^e(s|s,a)ρ(s,a)\displaystyle\sum\nolimits_{a\in\mathcal{A}}\rho(s^{\prime},a)=\sum\nolimits_{s,a}\hat{P}_{e}(s^{\prime}|s,a)\rho(s,a) (202)

where P^e(|s,a)\hat{P}_{e}(\cdot|s,a) is the estimate for transition probability to next state given state action pair (s,a)(s,a) after epoch ee. Let πe\pi_{e} be the solution for optimization problem in Equation (200)-(202) for epoch ee.

Now the regret R(T)R(T), till time horizon TT, is defined as

R(T)\displaystyle R(T) =TλπPt=1Tr(st,at)\displaystyle=T\lambda_{\pi^{*}}^{P}-\sum_{t=1}^{T}r(s_{t},a_{t}) (203)
=et=tete+11λπPet=tete+11r(st,at)\displaystyle=\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}\lambda_{\pi^{*}}^{P}-\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}r(s_{t},a_{t}) (204)
=et=tete+11λπP±et=tete+11λπePet=tete+11r(st,at)\displaystyle=\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}\lambda_{\pi^{*}}^{P}\pm\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}\lambda_{\pi_{e}}^{P}-\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}r(s_{t},a_{t}) (205)
=(et=tete+11λπPet=tete+11λπeP)+(et=tete+11λπePet=tete+11r(st,at))\displaystyle=\left(\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}\lambda_{\pi^{*}}^{P}-\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}\lambda_{\pi_{e}}^{P}\right)+\left(\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}\lambda_{\pi_{e}}^{P}-\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}r(s_{t},a_{t})\right) (206)
=R1(T)+R2(T)\displaystyle=R_{1}(T)+R_{2}(T) (207)

R2(T)R_{2}(T) is analysed in similarly to the regret analysis of the proposed UC-CURL algorithm. For R2(T)R_{2}(T), we obtain the following analysis:

R1(T)\displaystyle R_{1}(T) =et=tete+11(λπPλπeP)\displaystyle=\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi^{*}}^{P}-\lambda_{\pi_{e}}^{P}\right) (208)
=et=tete+11((λπPλπeP^e)+(λπeP^eλπeP))\displaystyle=\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}\left(\left(\lambda_{\pi^{*}}^{P}-\lambda_{\pi_{e}}^{\hat{P}_{e}}\right)+\left(\lambda_{\pi_{e}}^{\hat{P}_{e}}-\lambda_{\pi_{e}}^{P}\right)\right) (209)

Now, the second term, λπeP^eλπeP\lambda_{\pi_{e}}^{\hat{P}_{e}}-\lambda_{\pi_{e}}^{P}, in Equation (209) can be again analyzed using the Bellman error based analysis. We are primarily interested in the first term. Now, note that since the agent does not play the optimistic policy, R1(T)R_{1}(T) cannot be upper bounded by the optimal policy of the optimistic MDP in the confidence interval. For this, the average reward λπ~eP~e\lambda_{\tilde{\pi}_{e}}^{\tilde{P}_{e}} satisfies λπ~eP~eλπP\lambda_{\tilde{\pi}_{e}}^{\tilde{P}_{e}}\geq\lambda_{\pi^{*}}^{P}. For a posterior sampling algorithm, the optimal policy for the sampled MDP satisfies 𝔼[λπ~eP~e]=𝔼[λπP]\mathbb{E}[\lambda_{\tilde{\pi}_{e}}^{\tilde{P}_{e}}]=\mathbb{E}[\lambda_{\pi^{*}}^{P}]. This two properties for the optimistic or posterior sampling algorithm contributes to the key parts in the analysis and design of the RL algorithms.

However, no such relationship can be established for λπP\lambda_{\pi^{*}}^{P}, and λπeP^e\lambda_{\pi_{e}}^{\hat{P}_{e}} and hence the first term of Equation (209) is not trivially upper bounded by 0. Further, for the optimal policy πe\pi_{e} for the estimated MDP P^e\hat{P}_{e} can return a reward lower than the optimal policy π\pi^{*} on the true MDP PP or λπeP^e<λπP\lambda_{\pi_{e}}^{\hat{P}_{e}}<\lambda_{\pi^{*}}^{P} resulting in a trivial O(T)O(T) regret bound.

H.2 Regarding Optimality

We note the work of Singh et al. (2020) provided a lower bound of DSAT\sqrt{DSAT}, where DD is the diameter of the MDP, AA, SS are the number of actions and states respectively and TT is the time horizon for which the algorithm runs. Based on the lower bound, the regret results presented in this work are optimal in AA and TT. However, to obtain a tighter dependence on SS using tighter concentration inequalities for stochastic policies remain an open problem. Further, we note that reducing the dependence of TMT_{M} to the diameter DD while also keeping the regret order in TT as O~(T)\tilde{O}(\sqrt{T}) is an open problem.

Appendix I Experiments with Fairness Utility and Constraints

We also evaluate the proposed algorithm on non-linear setup. We consider a scheduler allocating resources to two clients, client1,client2client_{1},client_{2}. At each time step, the scheduler allocates a resource to either of the clients. Hence, {client1,client2}\{client_{1},client_{2}\} are 22 actions available to the scheduler. The client, on resource allocation, consumes the resource and obtains a reward. The reward depends on the state of the client. Each client can be in 44 possible states. Hence, there are 1616 total possible system states. At every step a client stays in the same state with probability 0.6250.625 and transitions to a different state with probability 0.1250.125 of landing in any of the remaining 33 states.

The agent aims to maximize the proportional fairness among the two clients Lan et al. (2010). The proportional fairness is used to quantitatively evaluate fairness in various networks scheduling systems such as wireless scheduling Cui et al. (2019) and queuing Wierman (2011). We calculate proportional fairness as:

i={1,2}log(s,aρ(s,a)ri(s,a))\displaystyle\sum_{i=\{1,2\}}\log\left(\sum_{s,a}\rho(s,a)r_{i}(s,a)\right) (210)

where ii denotes the client index, ri(s,a)r_{i}(s,a) is the reward received by the client ii when the system is in state ss and takes action aa, and ρ(s,a)\rho(s,a) is the steady-state state-action distribution. The rewards are rir_{i} with respect to client state is presented in Table 3

Table 3: Transition probability of the queue system
Client Client State 11 Client State 22 Client State 33 Client State 44
client1client_{1} 0.750.75 0.3750.375 0.50.5 0.3750.375
client2client_{2} 0.250.25 0.50.5 0.750.75 1.01.0

Further, the first client is a high priority client and requires a minimum service level agreement (SLA) guarantee. Every time client1client_{1} is denied the resource, the scheduler incurs a penalty of 1-1. Let CC denote the SLA guarantee for client1client_{1}. Then the cost constraint can be written as:

sρ(s,a)𝟏{a=client2}C\displaystyle-\sum_{s}\rho(s,a)\bm{1}_{\{a=client_{2}\}}\geq C (211)

where CC is set to 0.3-0.3.

We evaluate the PS-CURL and UC-CURL algorithms on the scheduling system with K=1K=1. We evaluate both PS-CURL and UC-CURL algorithms for linear epochs and doubling epochs. We run 1010 independent iterations of the algorithm for T=500000T=500000 time steps. The mean values of the simulation results for the constrained setup are presented in Figure 2(d). The scheduler take about 100,000100,000 steps to converge at the optimal fairness value (Figure 2a) for PS-CURL algorithm where as the optimistic setup does not converges till 500,000500,000 steps. This is inline with the results of Section 6 where the posterior sampling algorithm converges the fastest. Further, note that optimistic algorithm is conservative with respect to constraints but it does not optimises for the fairness to satisfy the constraints.

We also present the system behavior in absence of constraints in Figure 2(c) and Figure 2(d). For the unconstrained setup, we only evaluate the linear epoch setup of the PS-CURL algorithm which converges to the optimal fairness value at around t=50,000t=50,000 which is faster than the constrained setup. We also observe that the optimal fairness among the clients is higher when the scheduler is not required to guarantee any service level agreements.

Refer to caption
(a) System fairness w.r.t. time
Refer to caption
(b) SLA penalty w.r.t. time
Refer to caption
(c) System fairness w.r.t. time for unconstrained setup
Refer to caption
(d) SLA penalty w.r.t. time for unconstrained setup
Figure 2: Performance of the proposed UC-CURL and PS-CURL algorithms on the scheduling example

.

Again, from the experimental evaluations, we observe that the proposed PS-CURL and UC-CURL algorithms can be used for systems with non-linear utilities and/or non-linear constraints.