This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes

\nameYi Tian111These authors contributed equally to this work. \emailyitian@mit.edu
\nameJian Qian11footnotemark: 1 \emailjianqian@mit.edu
\nameSuvrit Sra \emailsuvrit@mit.edu
\addrMassachusetts Institute of Technology, Cambridge, MA, 02139, USA
Abstract

We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs), which are MDPs with conditionally independent transition components. Assuming the factorization is known, we propose two model-based algorithms. The first one achieves minimax optimal regret guarantees for a rich class of factored structures, while the second one enjoys better computational complexity with a slightly worse regret. A key new ingredient of our algorithms is the design of a bonus term to guide exploration. We complement our algorithms by presenting several structure-dependent lower bounds on regret for FMDPs that reveal the difficulty hiding in the intricacy of the structures.

1 Introduction

In reinforcement learning (RL) an agent interacts with an unknown environment seeking to maximize its cumulative reward. The dynamics of the environment and the agent’s interaction with it are typically modeled as a Markov decision process (MDP). We consider the specific setting of episodic MDPs with a fixed interaction horizon. Here, at each step the agent observes the current state of the environment, takes an action, and receives a reward. Given the agent’s action, the environment then transits to the next state. The interaction horizon is the number of steps in an episode, while both the transitions and rewards may be unknown to the agent.

The agent’s performance is quantified using regret: the gap between the cumulative rewards the agent receives and those obtainable by following an optimal policy. An optimal RL algorithm then becomes one that incurs the minimum regret. For episodic MDPs, the minimax regret bound is known to be 𝒪~(HSAT)\tilde{\mathcal{O}}(\sqrt{HSAT}) [1] (the notation 𝒪~()\tilde{\mathcal{O}}(\cdot) hides log factors), where HH, SS, AA, TT denote the horizon, the size of the state space, the size of the action space, and the total number of steps, respectively.

Background and our focus. In problems with large state and action spaces even the 𝒪~(SA)\tilde{\mathcal{O}}(\sqrt{SA}) dependence is impractical. We focus on problems where one can circumvent this dependence. Specifically, we focus on problems with conditional independence structure that can be modeled via factored MDPs (FMDPs) [2, 3]. The state for an FMDP is represented as a tuple, where each component is determined by a small portion of the state-action space, termed its “scope”. For example, for a home robot, whether the table is clean has nothing to do with whether the robot vacuums the floor. FMDPs also arise naturally from cooperative RL [14, 31], and furthermore, they are useful to model subtask dependencies in hierarchical RL [34].

A key benefit of FMDPs is their more compact representation that requires exponentially smaller space to store state transitions than corresponding nonfactored MDPs. Early works [23, 15, 35, 37] show that FMDPs can also reduce the sample complexity exponentially, albeit with some higher order terms. Osband and Van Roy [27] and Xu and Tewari [41] develop near-optimal regret bound of FMDPs in the episodic and nonepisodic settings, respectively. For an FMDP with mm conditionally independent components, both works bound the regret by 𝒪~(i=1mD2SiX[Ii]T)\tilde{\mathcal{O}}(\sum_{i=1}^{m}\sqrt{D^{2}S_{i}X[{I_{i}}]T})111The dependence on reward factorization is omitted here and below, unless otherwise stated. The omission also corresponds to the case of known rewards. where DD is the diameter of the MDP [19] and X[Ii]X[{I_{i}}] is the size of the scope of the iith state component.

Main difficulty. For (nonfactored) MDPs, the above regret bounds from [27, 41] reduce to 𝒪~(D2S2AT)\tilde{\mathcal{O}}(\sqrt{D^{2}S^{2}AT}), or essentially 𝒪~(H2S2AT)\tilde{\mathcal{O}}(\sqrt{H^{2}S^{2}AT}) in the episodic setting. Either bound has a gap to the corresponding lower bound, leading one to wonder whether we can adapt the techniques from MDPs to obtain tight bounds for FMDPs. It turns out to be nontrivial to answer this question.

Indeed, the key technique that yields a minimax regret bound for MDPs is a careful design of the bonus term that ensures optimism, by applying scalar concentration instead of L1L_{1}-norm concentration (Lemma 36) on a vector [1]. To adapt the same idea to FMDPs, perhaps the most natural choice is to estimate the transition of each component separately, and then to sum up the component-wise bonuses to guide overall exploration. With the bonuses derived from L1L_{1}-norm concentration, this method ensures optimism, leading to the regret bound in [27, 41]. But with the bonuses derived from scalar concentration (which was key to [1]), it is hard, if possible, to show that this naive sum of component-wise bonuses ensures optimism (see (5.1)), and thus is the major obstacle towards removing the suboptimal Si\sqrt{S_{i}} factor from the regret bound.

On the other hand, there is no uniform nondegenerate lower bound that holds for any structure. As a result, the dependence on structure has to be specified, making the lower bound statements subtle due to the intricacy of the structure.

Our contributions. Our main contribution is to present two provably efficient algorithms for episodic FMDPs, including one with tight regret guarantees for a rich class of structures. We overcome the above noted difficulty by introducing an additional bonus term via an inverse telescoping technique. Our first algorithm F-UCBVI, uses Hoeffding-style bonuses and achieves the 𝒪~(i=1mH2X[Ii]T)\tilde{\mathcal{O}}(\sum_{i=1}^{m}\sqrt{H^{2}X[{I_{i}}]T}) regret (Theorem 1). Our second algorithm F-EULER, uses Bernstein-style bonuses and achieves a problem-dependent regret (Theorem 2) that reduces to 𝒪~(i=1mHX[Ii]T)\tilde{\mathcal{O}}(\sum_{i=1}^{m}\sqrt{HX[{I_{i}}]T}) in the worst case (Corollary 3), and thus closes the gap to the Ω(HSAT)\Omega(\sqrt{HSAT}) lower bound (see, e.g., [20]) when reduced to the nonfactored case. Despite an extra H\sqrt{H} dependence, F-UCBVI has a much simpler form of bonus and enjoys a lower computational complexity.

Furthermore, we identify some reasonably general factored structures for which we prove lower bounds (Theorems 456 and 7). Our construction builds upon the lower bounds for multi-armed bandits (MABs) [25], in a similar manner to [7, 20]. As a side remark, the JAO MDP, used to establish the Ω(DSAT)\Omega(\sqrt{DSAT}) lower bound for nonepisodic MDPs [19], only establishes a Ω(HSAT/logT){\Omega}(\sqrt{HSAT/\log T}) lower bound through a direct extension (Appendix D). This result misses a log factor resulting from the episodic resetting, and it is not clear whether it can be tightened with the same construction.

1.1 Related work

Regret bound for episodic MDPs. For episodic MDPs, built upon the principle of optimism in the face of uncertainty, recent works establish the 𝒪~(HSAT)\tilde{\mathcal{O}}(\sqrt{HSAT}) worst-case regret bound [1, 22, 42, 33, 12], matching the Ω(HSAT)\Omega(\sqrt{HSAT}) lower bound (see, e.g., [20]) up to log factors. For model-free algorithms, the state-of-the-art worst-case regret is 𝒪~(H2SAT)\tilde{\mathcal{O}}(\sqrt{H^{2}SAT}) [43]. Sharper problem-dependent regret bounds [42, 33] also exist in this setting. Posterior sampling for RL (PSRL) [29] offers an alternative to optimism-based methods and can perform well empirically [28]. The best existing Bayesian regret bound of PSRL for episodic MDPs is 𝒪~(H2SAT)\tilde{\mathcal{O}}(\sqrt{H^{2}SAT}) [28]. Among these works, our work is mostly motivated by [1, 42]. Specifically, for nonfactored MDPs, F-UCBVI and F-EULER reduce to the UCBVI-CH [1] and EULER [42] algorithms, respectively, after which we name our algorithms.

Planning in FMDPs. For nonepisodic FMDPs, various works exploit the factored structure to develop efficient (approximate) dynamic programming methods [4, 24, 13, 32, 16]. See [9] for an overview. For episodic FMDPs, however, efficient approximate planning is yet to be understood. Planning is a subroutine of model-based RL algorithms. The focus of this work is the sample efficiency, and we adopt a value iteration procedure that has the same computational complexity as that for a nonfactored MDP.

Reinforcement learning in FMDPs. To efficiently find the optimal policy assuming known factorization yet unknown state transitions or rewards, early works aim to develop algorithms with PAC guarantees [23, 15, 35, 37]. More recent works place a focus on the regret analysis and establish 𝒪~(i=1mD2SiX[Ii]T)\tilde{\mathcal{O}}(\sum_{i=1}^{m}\sqrt{D^{2}S_{i}X[{I_{i}}]T}) regret in either the episodic or nonepisodic setting [27, 41], also translating to 𝒪~(i=1mH2SiX[Ii]T)\tilde{\mathcal{O}}(\sum_{i=1}^{m}\sqrt{H^{2}S_{i}X[{I_{i}}]T}) in terms of horizon HH for episodic problems. Our work considers minimax optimal regret guarantees in the episodic setting. In this setting, we provide the first 𝒪~(i=1mHX[Ii]T)\tilde{\mathcal{O}}(\sum_{i=1}^{m}\sqrt{HX[{I_{i}}]T}) regret bound and the first formal treatment of the lower bounds, by which we show that our algorithm is minimax optimal for a rich class of factored structures.

Structure learning in FMDPs. Assuming unknown structure except for known state factorization, various structure learning algorithms exist in the literature, e.g., SPITI [10], SLF-Rmax [36], Met-Rmax [11] and LSE-Rmax [6]. See [5, Chapter 7] for an overview of these methods. More recent works include [18, 17]. To the best of our knowledge, there is no regret analysis in setting, which is an interesting future direction.

2 Preliminaries and problem formulation

2.1 Learning in episodic MDPs

An episodic MDP is described by the quintuple (𝒮,𝒜,P,r,H)(\mathcal{S},\mathcal{A},{P},r,H), where 𝒮\mathcal{S} is the state space, 𝒜\mathcal{A} is the action space, P:𝒮×𝒜×𝒮[0,1]{P}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\mapsto[0,1] is the transition function such that P(|s,a){P}(\cdot|s,a) is in the probability simplex on 𝒮\mathcal{S}, denoted by Δ(𝒮)\Delta(\mathcal{S}), r:𝒮×𝒜[0,1]r:\mathcal{S}\times\mathcal{A}\mapsto[0,1] is the reward function, and HH is the horizon or the length of an episode. The focus in this paper is the stationary case, where both the transition function P{P} and the reward function rr remain unchanged across steps or episodes.

An RL agent interacts episodically with the environment starting from arbitrary initial states. The role of the agent is abstracted into a sequence of policies. For any natural number nn, we use [n][n] to denote the set {1,2,,n}\{1,2,\cdots,n\}. Then for finite-horizon MDPs, the policy π:𝒮×[H]𝒜\pi:\mathcal{S}\times[H]\mapsto\mathcal{A} maps a state and a time step to an action. Let πk\pi_{k} denote the policy in the kkth episode. An RL algorithm, therefore, specifies the updates of the sequence {πk}\{\pi_{k}\}. Our algorithms use deterministic policies, while the lower bounds hold for the more general class of stochastic policies.

The Q-value function QhπQ_{h}^{\pi} and state-value function VhπV_{h}^{\pi} measure the goodness of a state-action pair and of a state following the policy π\pi starting from the hhth step in an episode, respectively. For state s𝒮s\in\mathcal{S} and action a𝒜a\in\mathcal{A}, these functions are defined by

Qhπ(s,a)=𝔼[t=hHr(st,at)|sh=s,ah=a],Vhπ(s)=Qhπ(s,π(s,h)).\displaystyle Q_{h}^{\pi}(s,a)=\mathbb{E}\Bigl{[}\sum\nolimits_{t=h}^{H}r(s_{t},a_{t})\,\bigl{\lvert}\,s_{h}=s,a_{h}=a\Bigr{]},\quad V_{h}^{\pi}(s)=Q_{h}^{\pi}(s,\pi(s,h)).

There exists an optimal policy π\pi^{*} such that Vhπ(s)Vhπ(s)V_{h}^{\pi^{*}}(s)\geq V_{h}^{\pi}(s) for all s𝒮,h[H]s\in\mathcal{S},h\in[H] [30]. Let VhV_{h}^{*} denote VhπV_{h}^{\pi^{*}}, and let sk,1s_{k,1} be the initial state at the kkth episode; we then define the regret over KK episodes (T=KHT=KH steps) as

Regret(K):=k=1K(V1(sk,1)V1πk(sk,1)).\displaystyle{\textnormal{Regret}}(K):=\sum\nolimits_{k=1}^{K}\left(V_{1}^{*}(s_{k,1})-V_{1}^{\pi_{k}}(s_{k,1})\right).

2.2 Factored MDPs

Episodic FMDPs inherit the above definitions, but we also need additional notation that we adopt from [27] to describe the factored structure. Let the state-action space 𝒳:=𝒮×𝒜\mathcal{X}:=\mathcal{S}\times\mathcal{A}, so that x=(s,a)𝒳x=(s,a)\in\mathcal{X} denotes a state-action pair. Let m,n,lm,n,l be natural numbers. The state space is factorized as 𝒮=i=1m𝒮i𝒮1××𝒮m\mathcal{S}=\bigotimes_{i=1}^{m}\mathcal{S}_{i}\equiv\mathcal{S}_{1}\times\cdots\times\mathcal{S}_{m}, using which we write a state s𝒮s\in\mathcal{S} as the tuple (s[1],,s[m])(s[1],\cdots,s[m]). Similarly, the state-action space is factorized as 𝒳=i=1n𝒳i𝒳1××𝒳n\mathcal{X}=\bigotimes_{i=1}^{n}\mathcal{X}_{i}\equiv\mathcal{X}_{1}\times\cdots\times\mathcal{X}_{n}.

Definition 1 (Scope operation).

For any natural number nn, any index set I[n]I\subset[n] and any factored set 𝒳=i=1n𝒳i\mathcal{X}=\bigotimes_{i=1}^{n}\mathcal{X}_{i}, define the scope operation 𝒳[I]:=iI𝒳i\mathcal{X}[I]:=\bigotimes_{i\in I}\mathcal{X}_{i}, where the indices in the Cartesian product are in ascending order by default. Correspondingly, define the scope variable x[I]𝒳[Ii]x[I]\in\mathcal{X}[{I_{i}}] as the tuple of xix_{i} where iIi\in I.

The transition factored structure refers to the conditional independence of the transitions of the mm state components. Mathematically, P(s|s,a)=P(s|x)=i=1mPi(s[i]|x){P}(s^{\prime}|s,a)={P}(s^{\prime}|x)=\prod_{i=1}^{m}{P}_{i}(s^{\prime}[i]|x), where Pi(|x)Δ(𝒮i){P}_{i}(\cdot|x)\in\Delta(\mathcal{S}_{i}). Moreover, the transition of the iith state component is determined only by its scope, a subset of the state-action components. Let Ii[n]{I_{i}}\subset[n] be the scope index set for i[m]i\in[m]. Then with a slight abuse of notation, Pi(s[i]|x)Pi(s[i]|x[Ii]){P}_{i}(s^{\prime}[i]|x)\equiv{P}_{i}(s^{\prime}[i]|x[{I_{i}}]) where “\equiv” denotes identity, and the factored transition is given by P(s|x)=i=1mPi(s[i]|x[Ii])P(s^{\prime}|x)=\prod_{i=1}^{m}{P}_{i}(s^{\prime}[i]|x[{I_{i}}]).

The reward function rr is stochastic and unknown in general, which can also be factored. Let Ji[n]{J_{i}}\subset[n] be the reward scope index set for i[l]i\in[l]. Then the factored reward is the sum of ll independent components ri(x)ri(x[Ji])r_{i}(x)\equiv r_{i}(x[{J_{i}}]), given by r(x)=i=1lri(x[Ji])r(x)=\sum_{i=1}^{l}r_{i}(x[{J_{i}}]). Let R:=𝔼[r]R:=\mathbb{E}[r] and Ri:=𝔼[ri]R_{i}:=\mathbb{E}[r_{i}] be their expectations. Then R=i=1lRi:𝒳[0,1]R=\sum_{i=1}^{l}R_{i}:\mathcal{X}\to[0,1] is deterministic. With the above notation, an FMDP model is described in general by

=({𝒳i}i=1n,{𝒮i,Ii,Pi}i=1m,{Ji,ri}i=1l,H).\displaystyle{\mathcal{M}}=\bigl{(}\left\{\mathcal{X}_{i}\right\}_{i=1}^{n},\left\{\mathcal{S}_{i},{I_{i}},{P}_{i}\right\}_{i=1}^{m},\left\{{J_{i}},r_{i}\right\}_{i=1}^{l},H\bigr{)}. (2.1)

In this paper, we assume unknown transition with known factorization and consider two cases about the knowledge of reward: (1) the reward rr is unknown with known factorization, and the agent receives the factored rewards {ri}i=1l\{r_{i}\}_{i=1}^{l} [27, 41]; (2) the reward rr is known and not necessarily factored, in which case the expected reward RR is also known [1].

Notation. Throughout the paper, we use S,A,X,Si,X[I]S,A,X,S_{i},X[I] to denote the finite cardinalities of the corresponding 𝒮,𝒜,𝒳,𝒮i,𝒳[I]\mathcal{S},\mathcal{A},\mathcal{X},\mathcal{S}_{i},\mathcal{X}[I], respectively. Then S=i=1mSiS=\prod_{i=1}^{m}S_{i} and X[I]X=SAX[I]\leq X=SA for any I[n]I\subset[n]. Since the abstraction of a state component s[i]s[i] is meaningful only if 𝒮i2\mathcal{S}_{i}\geq 2, the number of state components mm is at most log2S\log_{2}S , which is treated as a negligible log factor whenever necessary. For x=(s,a)𝒳x=(s,a)\in\mathcal{X}, we use Q(s,a)Q(s,a) and Q(x)Q(x) interchangeably to denote a function QQ defined on 𝒳:=𝒮×𝒜\mathcal{X}:=\mathcal{S}\times\mathcal{A}. A function V:𝒮V:\mathcal{S}\to\mathbb{R} is also seen as a real vector indexed by 𝒮\mathcal{S}, denoted by V𝒮V\in\mathbb{R}^{\mathcal{S}}. For vectors in 𝒮\mathbb{R}^{\mathcal{S}}, let ||,()2|\cdot|,(\cdot)^{2} denote entrywise absolute values and squares, respectively.

Define P(x):=P(|x)Δ(𝒮){P}(x):={P}(\cdot|x)\in\Delta(\mathcal{S}). Moreover, for Pi(x)Pi(x[Ii]):=Pi(|x[Ii])Δ(𝒮i){P}_{i}(x)\equiv{P}_{i}(x[{I_{i}}]):={P}_{i}(\cdot|x[{I_{i}}])\in\Delta(\mathcal{S}_{i}) and natural numbers aba\leq b, define Pa:b(x):=i=abPi(x)Δ(i=ab𝒮i){P}_{a:b}(x):=\prod_{i=a}^{b}{P}_{i}(x)\in\Delta(\bigotimes_{i=a}^{b}\mathcal{S}_{i}). In particular, i=1mPi(x)=P(x)Δ(𝒮)\prod_{i=1}^{m}{P}_{i}(x)={P}(x)\in\Delta(\mathcal{S}). For PΔ(𝒮){P}\in\Delta(\mathcal{S}), V𝒮V\in\mathbb{R}^{\mathcal{S}}, define their inner product P,V:=s𝒮P(s)V(s)=𝔼P[V(s)]𝔼P[V]\left\langle{P},V\right\rangle:=\sum_{s\in\mathcal{S}}{P}(s)V(s)=\mathbb{E}_{{P}}[V(s)]\equiv\mathbb{E}_{{P}}[V]. The inner product of two elements in Δ(𝒮i)\Delta(\mathcal{S}_{i}) and 𝒮i\mathbb{R}^{\mathcal{S}_{i}} respectively takes sum over 𝒮i\mathcal{S}_{i}. For PiΔ(𝒮i){P}_{i}\in\Delta(\mathcal{S}_{i}) and V𝒮V\in\mathbb{R}^{\mathcal{S}}, their inner product takes sum over 𝒮\mathcal{S}. In this case, note that Pi,V𝔼Pi[V]\left\langle{P}_{i},V\right\rangle\neq\mathbb{E}_{{P}_{i}}[V], which is given by 𝔼Pi[V]𝔼Pi[V(s)]=s[i]𝒮Pi(s[i])V(s)𝒮[i]\mathbb{E}_{{P}_{i}}[V]\equiv\mathbb{E}_{{P}_{i}}[V(s)]=\sum_{s[i]\in\mathcal{S}}{P}_{i}(s[i])V(s)\in\mathbb{R}^{\mathcal{S}[-i]}, where 𝒮[i]:=j=1,jim𝒮j\mathcal{S}[-i]:=\bigotimes_{j=1,j\neq i}^{m}\mathcal{S}_{j}.

3 Factored UCBVI and regret bounds

We are now ready to present the two algorithms for solving FMDPs that we propose. In this section, We introduce the general procedures (Algorithms 1 and 2) and discuss an instantiation called factored UCBVI (F-UCBVI) using Hoeffding-style optimism. In the next section, we use a Bernstein-style instantiation called factored EULER (F-EULER) that obtains better regret guarantees.

Algorithm 1 F-OVI (Factored Optimistic Value Iteration)
1:Factored structure {𝒳i}i=1n\left\{\mathcal{X}_{i}\right\}_{i=1}^{n}, {𝒮i,Ii}i=1m\left\{\mathcal{S}_{i},{I_{i}}\right\}_{i=1}^{m}, {Ji}i=1l\left\{{J_{i}}\right\}_{i=1}^{l}, and horizon HH
2:Initialize counters Ni(x[Ii])=Ni(x[Ii],s[i])=0N_{i}(x[{I_{i}}])=N_{i}^{\prime}(x[{I_{i}}],s[i])=0, (i[m],x[Ii]𝒳[Ii],s[i]𝒮i)\forall(i\in[m],x[{I_{i}}]\in\mathcal{X}[{I_{i}}],s[i]\in\mathcal{S}_{i})
3:Initialize counters Mi(x[Ji])=Mi(x[Ji])=0M_{i}(x[{J_{i}}])=M_{i}^{\prime}(x[{J_{i}}])=0, (i[l],x[Ji]𝒳[Ji])\forall(i\in[l],x[{J_{i}}]\in\mathcal{X}[{J_{i}}]) and history r=\mathcal{H}_{r}=\emptyset
4:for episode k=1,2,,Kk=1,2,\cdots,K do
5:     Estimate transitions P^i(s[i]|x[Ii])=Ni(x[Ii],s[i])max{1,Ni(x[Ii])}{\hat{P}}_{i}(s[i]|x[{I_{i}}])=\tfrac{N_{i}^{\prime}(x[{I_{i}}],s[i])}{\max\{1,N_{i}(x[{I_{i}}])\}}, (i[m],x[Ii]𝒳[Ii],s[i]𝒮i)\forall(i\in[m],x[{I_{i}}]\in\mathcal{X}[{I_{i}}],s[i]\in\mathcal{S}_{i})
6:     Estimate expected rewards R^i(x[Ji])=Mi(x[Ji])max{1,Mi(x[Ji])}{\hat{R}}_{i}(x[{J_{i}}])=\frac{M_{i}^{\prime}(x[{J_{i}}])}{\max\{1,M_{i}(x[{J_{i}}])\}}, (i[l],x[Ji]𝒳[Ji])\forall(i\in[l],x[{J_{i}}]\in\mathcal{X}[{J_{i}}])
7:     Estimate expected overall reward R^(x)=i=1lR^i(x[Ji]){\hat{R}}(x)=\sum_{i=1}^{l}{\hat{R}}_{i}(x[{J_{i}}]), x𝒳\forall x\in\mathcal{X}
8:     Obtain policy π=𝚅𝙸_𝙾𝚙𝚝𝚒𝚖𝚒𝚜𝚖({P^i}i=1m,R^,{Ni}i=1m,{Mi}i=1l,r)\pi=\mathtt{VI\_Optimism}(\{{\hat{P}}_{i}\}_{i=1}^{m},{\hat{R}},\{N_{i}\}_{i=1}^{m},\{M_{i}\}_{i=1}^{l},\mathcal{H}_{r})
9:     Observe initial state s1s_{1} from environment
10:     for step h=1,2,,Hh=1,2,\cdots,H do
11:         Take action ah=π(sh,h)a_{h}=\pi(s_{h},h) and let xh=(sh,ah)x_{h}=(s_{h},a_{h})
12:         Receive reward {ri,h}i=1l\{r_{i,h}\}_{i=1}^{l} and observe next state sh+1s_{h+1}
13:         Update transition counters Ni(xh[Ii])+=1N_{i}(x_{h}[{I_{i}}])\mathrel{+}=1 and Ni(xh[Ii],sh+1[i])+=1N_{i}^{\prime}(x_{h}[{I_{i}}],s_{h+1}[i])\mathrel{+}=1, i[m]\forall i\in[m]
14:         Update reward counters Mi(xh[Ji])+=1M_{i}(x_{h}[{J_{i}}])\mathrel{+}=1 and Mi(xh[Ji])+=ri,hM_{i}^{\prime}(x_{h}[{J_{i}}])\mathrel{+}=r_{i,h}, i[l]\forall i\in[l]
15:         Update reward history r=r{(xh,{ri,h}i=1l)}\mathcal{H}_{r}=\mathcal{H}_{r}\cup\{(x_{h},\{r_{i,h}\}_{i=1}^{l})\}
16:     end for
17:end for

The general optimistic model-based RL framework for FMDPs is described in F-OVI (Algorithm 1), which maintains an empirical estimate of the transition function, typically the maximum likelihood (ML) estimate. For FMDPs, although a direct ML estimate of the overall transition is still possible, the product of the ML estimates of component-wise transition functions is preferable to exploit the factored structure. The same argument applies to the empirical expected reward function.

As a subroutine of F-OVI (Algorithm 1), VI_Optimism (Algorithm 2) ensures an upper confidence bound (UCB) on the optimal value function, achieved by introducing bonuses derived from concentration inequalities. The L1L_{1}-norm concentration (Lemma 36) on the transition estimation leads to the UCRL-Factored algorithm, achieving the 𝒪~(i=1D2SiX[Ii]T)\tilde{\mathcal{O}}(\sum_{i=1}\sqrt{D^{2}S_{i}X[{I_{i}}]T}) regret [27]. The scalar concentration on the transition estimation leads to the two new algorithms proposed in this work.

In the case of known rewards, there is no need to estimate the reward R^{\hat{R}} or to collect the reward history r\mathcal{H}_{r} in F-OVI (Algorithm 1). Neither is there the need to introduce the reward bonus β{\beta} in VI_Optimism (Algorithm 2). In both procedures, the empirical R^{\hat{R}} is replaced by the true RR.

Hoeffding-style bonus. Let L=log(16mlSXT/δ)L=\log(16mlSXT/\delta) be a log factor. With a slight abuse of notation, let Ni(x)Ni(x[Ii])N_{i}(x)\equiv N_{i}(x[{I_{i}}]). The transition bonus of F-UCBVI is given by

b(x):=i=1mHL2Ni(x)+i=1mj=i+1m2HLSiSjNi(x)Nj(x),\displaystyle{b}(x):=\sum\nolimits_{i=1}^{m}H\sqrt{\tfrac{L}{2N_{i}(x)}}+\sum\nolimits_{i=1}^{m}\sum\nolimits_{j=i+1}^{m}2HL\sqrt{\tfrac{S_{i}S_{j}}{N_{i}(x)N_{j}(x)}}, (3.1)

which consists of both a component-wise bonus term and a cross-component bonus term. In the case of unknown rewards, the reward bonus of F-UCBVI is given by

β(x):=i=1lL2Mi(x).\displaystyle{\beta}(x):=\sum\nolimits_{i=1}^{l}\sqrt{\tfrac{L}{2M_{i}(x)}}. (3.2)

For the transition and reward bonuses here and below, zero-valued Ni(x)N_{i}(x) and Mi(x)M_{i}(x) in the denominators are replaced by 11 to prevent zero division, as in F-OVI (Algorithm 1). For F-UCBVI, the reward history r\mathcal{H}_{r} and the lower confidence bound (LCB) V¯h{\underline{V}}_{h} are redundant and the related updates are removable. F-UCBVI has the following regret guarantees.

Algorithm 2 VI_Optimism (Value Iteration with Optimism)
1:Empirical transitions {P^i}i=1m\{{\hat{P}}_{i}\}_{i=1}^{m}, empirical expected reward R^{\hat{R}}, transition counters {Ni}i=1m\{N_{i}\}_{i=1}^{m}, reward counters {Mi}i=1l\{M_{i}\}_{i=1}^{l}, reward history r\mathcal{H}_{r}
2:Initialize UCB and LCB V¯H+1(s)=V¯H+1(s)=0,s𝒮{\overline{V}}_{H+1}(s)={\underline{V}}_{H+1}(s)=0,\forall s\in\mathcal{S}; let P^(x)=i=1mP^i(x),x𝒳{\hat{P}}(x)=\prod_{i=1}^{m}{\hat{P}}_{i}(x),\forall x\in\mathcal{X}
3:for step h=H,H1,,1h=H,H-1,\cdots,1 do
4:     for state s𝒮s\in\mathcal{S} do
5:         for action a𝒜a\in\mathcal{A}, with x=(s,a)x=(s,a) do
6:              Q¯h(x)=min{Hh+1,R^(x)+P^(x),V¯h+1+b(x)+β(x)}{\overline{Q}}_{h}(x)=\min\left\{H-h+1,{\hat{R}}(x)+\left\langle{\hat{P}}(x),{\overline{V}}_{h+1}\right\rangle+{b}(x)+{\beta}(x)\right\}
7:         end for
8:         Let π(s,h)=argmaxa𝒜Q¯h(s,a)\pi(s,h)=\operatorname*{argmax}_{a\in\mathcal{A}}{\overline{Q}}_{h}(s,a) and x=(s,π(s,h))x=(s,\pi(s,h))
9:         Update UCB V¯h(s)=Q¯h(x){\overline{V}}_{h}(s)={\overline{Q}}_{h}(x)
10:         Update LCB V¯h(s)=max{0,R^(x)+P^(x),V¯h+1b(x)β(x)}{\underline{V}}_{h}(s)=\max\left\{0,{\hat{R}}(x)+\left\langle{\hat{P}}(x),{\underline{V}}_{h+1}\right\rangle-{b}(x)-{\beta}(x)\right\}
11:     end for
12:end for
13:return policy π\pi
Theorem 1 (Worst-case regret bounds of F-UCBVI).

For any FMDP specified by (2.1), in the case of unknown rewards, with probability 1δ1-\delta, the regret of F-UCBVI in KK episodes is upper bounded by

𝒪~(i=1mH2X[Ii]T+i=1lX[Ji]T).\displaystyle\tilde{\mathcal{O}}\Bigl{(}\sum\nolimits_{i=1}^{m}\sqrt{H^{2}X[{I_{i}}]T}+\sum\nolimits_{i=1}^{l}\sqrt{X[{J_{i}}]T}\Bigr{)}. (3.3)

In the case of known rewards, the term i=1lX[Ji]T\sum\nolimits_{i=1}^{l}\sqrt{X[{J_{i}}]T} is dropped.

In Theorem 1 (and Theorem 2, Corollary 3 below), the T\sqrt{T} terms dominate the upper bounds under the assumption that Tpoly(m,l,maxiSi,maxiX[Ii],H)T\geq\mathrm{poly}(m,l,\max_{i}S_{i},\max_{i}X[{I_{i}}],H), which can be exponentially smaller than poly(S,A,H)\mathrm{poly}(S,A,H), required in general by nonfactored algorithms [1, 22, 42, 33, 12, 21, 43]. This weaker requirement on TT is also true for [27] and can be viewed as another benefit from the factored structure. The lower-order terms in Section 5 are also under this assumption.

In (3.3), the regret bound consists of a transition-induced term and a reward-induced term, which correspond to the sum over time of the component-wise transition bonus term in (3.1) and the reward bonus (3.2), respectively. The cross-component transition bonus term in (3.1) diminishes fast and its sum over time turns out to be negligible. Nevertheless, this bonus term plays a significant role in guiding exploration at the early stage of the algorithm, when it is large due to the dependence on SiS_{i}. Note that the cross-component form is not absolutely necessary, e.g., we can use the AM–GM inequality to decouple it as

i=1mj=i+1m2HLSiSjNi,k(x)Nj,k(x)(m1)HLi=1mSiNi,k(x).\displaystyle\sum\nolimits_{i=1}^{m}\sum\nolimits_{j=i+1}^{m}2HL\sqrt{\tfrac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}\leq(m-1)HL\sum\nolimits_{i=1}^{m}\tfrac{S_{i}}{N_{i,k}(x)}.

In this work, we use the cross-component form, as it is tighter and implies an intertwined behavior among the state components.

The component-wise transition bonus term in (3.1) is exactly the bonus in the UCBVI-CH algorithm [1]. It may also be feasible to derive a factored version of UCBVI-BF [1], but we switch to an upper-lower bound exploration strategy like EULER [42] for the Bernstein-style optimism, due in part to its problem-dependent regret.

4 Factored EULER and regret bounds

To describe the Bernstein-style bonus, we first add some convenient notation. We omit the dependence of Pi(x){P}_{i}(x) and P(x){P}(x) on xx; for PiΔ(𝒮i),P=i=1mPiΔ(𝒮){P}_{i}\in\Delta(\mathcal{S}_{i}),{P}=\prod_{i=1}^{m}{P}_{i}\in\Delta(\mathcal{S}) and V𝒮V\in\mathbb{R}^{\mathcal{S}}, define

gi(P,V):=2LVarPi𝔼Pi[V],\displaystyle g_{i}({P},V):=2\sqrt{L}\sqrt{\mathrm{Var}_{{P}_{i}}\mathbb{E}_{{P}_{-i}}[V]}, (4.1)

where 𝔼Pi[V]:=𝔼P1:i1Pi+1:m[V]𝒮i\mathbb{E}_{{P}_{-i}}[V]:=\mathbb{E}_{{P}_{1:i-1}{P}_{i+1:m}}[V]\in\mathbb{R}^{\mathcal{S}_{i}} takes expectation over the m1m-1 components. With the notation from [42], let X2,P=𝔼P[X2]\|X\|_{2,P}=\sqrt{\mathbb{E}_{P}[X^{2}]} be the L2L_{2}-norm on the equivalence classes of almost surely the same random variables.

Bernstein-style bonus. The transition bonus of F-EULER is defined by

b(x)\displaystyle{b}(x) :=i=1mgi(P^(x),V¯h+1)Ni(x)+i=1m2LV¯h+1V¯h+12,P^(x)Ni(x)\displaystyle:=\sum\nolimits_{i=1}^{m}\tfrac{g_{i}({\hat{P}}(x),{\overline{V}}_{h+1})}{\sqrt{N_{i}(x)}}+\sum\nolimits_{i=1}^{m}\tfrac{\sqrt{2L}\left\|{\overline{V}}_{h+1}-{\underline{V}}_{h+1}\right\|_{2,{\hat{P}}(x)}}{\sqrt{N_{i}(x)}} (4.2)
+i=1mj=i+1m11HLSiSjNi(x)Nj(x)+i=1m5HLNi(x).\displaystyle\quad+\sum\nolimits_{i=1}^{m}\sum\nolimits_{j=i+1}^{m}11HL\sqrt{\tfrac{S_{i}S_{j}}{N_{i}(x)N_{j}(x)}}+\sum\nolimits_{i=1}^{m}\tfrac{5HL}{N_{i}(x)}.

In the case of unknown rewards, the reward bonus of F-EULER is defined by

β(x):=i=1l4𝕊[r^i(x)]LMi(x)+i=1l14L3Mi(x),\displaystyle{\beta}(x):=\sum\nolimits_{i=1}^{l}\sqrt{\tfrac{4\mathbb{S}[{\hat{r}}_{i}(x)]L}{M_{i}(x)}}+\sum\nolimits_{i=1}^{l}\tfrac{14L}{3M_{i}(x)}, (4.3)

where 𝕊[r^i(x)]\mathbb{S}[{\hat{r}}_{i}(x)] is sample variance of the reward component, calculated using the history r\mathcal{H}_{r}. Online algorithms like Welford’s method [40] can efficiently compute R^i{\hat{R}}_{i} and 𝕊[r^i(x)]\mathbb{S}[{\hat{r}}_{i}(x)] jointly. F-EULER has the following problem-dependent regret guarantees.

Theorem 2 (Problem-dependent regret bounds of F-EULER).

For any FMDP specified by (2.1), let

𝒬i:=maxx𝒳,hH{VarPi(x)𝔼Pi(x)[Vh+1]},i:=maxx𝒳{Var[ri(x)]}.\displaystyle\mathcal{Q}_{i}:=\max_{x\in\mathcal{X},h\in H}\left\{\mathrm{Var}_{{P}_{i}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h+1}^{*}}]\right\},\quad\mathcal{R}_{i}:=\max_{x\in\mathcal{X}}\left\{\mathrm{Var}[r_{i}(x)]\right\}.

Let 𝒢\mathcal{G} be the upper bound of h=1HR(xh)\sum_{h=1}^{H}R(x_{h}) for any initial state and any policypolicy. In the case of known rewards, with probability 1δ1-\delta, the regret of F-EULER in KK episodes is upper bounded by

min{𝒪~(i=1m𝒬iX[Ii]T),𝒪~(i=1m𝒢2X[Ii]K)}\displaystyle\min\bigl{\{}\tilde{\mathcal{O}}\bigl{(}\sum\nolimits_{i=1}^{m}\sqrt{\mathcal{Q}_{i}X[{I_{i}}]T}\bigr{)},\tilde{\mathcal{O}}\bigl{(}\sum\nolimits_{i=1}^{m}\sqrt{\mathcal{G}^{2}X[{I_{i}}]K}\bigr{)}\bigr{\}} (4.4)
+min{𝒪~(i=1liX[Ji]T),𝒪~(i=1l𝒢2X[Ji]K)}.\displaystyle\quad+\min\bigl{\{}\tilde{\mathcal{O}}\bigl{(}\sum\nolimits_{i=1}^{l}\sqrt{\mathcal{R}_{i}X[{J_{i}}]T}\bigr{)},\tilde{\mathcal{O}}\bigl{(}\sum\nolimits_{i=1}^{l}\sqrt{\mathcal{G}^{2}X[{J_{i}}]K}\bigr{)}\bigr{\}}. (4.5)

In the case of known rewards, the term (4.5) is dropped.

F-EULER does not assume knowledge of the actual 𝒬i,i\mathcal{Q}_{i},\mathcal{R}_{i} and 𝒢\mathcal{G} and its regret bounds (Theorem 2) enjoy the same problem-dependent advantages as discussed in the nonfactored case [42]. r()[0,1]r(\cdot)\in[0,1] yields 𝒢H\mathcal{G}\leq H and i1\mathcal{R}_{i}\leq 1 for all i[l]i\in[l]. Hence, Theorem 2 implies the following worst-case regret.

Corollary 3 (Worst-case regret bounds of F-EULER).

For any FMDP specified by (2.1), in the case of unknown rewards, with probability 1δ1-\delta, the regret of F-EULER in KK episodes is upper bounded by

𝒪~(i=1mHX[Ii]T+i=1lX[Ji]T).\displaystyle\tilde{\mathcal{O}}\left(\sum\nolimits_{i=1}^{m}\sqrt{HX[{I_{i}}]T}+\sum\nolimits_{i=1}^{l}\sqrt{X[{J_{i}}]T}\right). (4.6)

In the case of known rewards, the term i=1lX[Ji]T\sum\nolimits_{i=1}^{l}\sqrt{X[{J_{i}}]T} is dropped.

In (4.6), the regret bound again consists of a transition-induced term and a reward-induced term, resulting from the sum over time of the first term in (4.2) and the first term in (4.3), respectively. For nonfactored MDPs, the reward-induced term 𝒪~(SAT)\tilde{\mathcal{O}}(\sqrt{SAT}) is absorbed into the transition-induced term 𝒪~(HSAT)\tilde{\mathcal{O}}(\sqrt{HSAT}), while for FMDPs, these two terms depend on their factorization structures, and do not absorb each other in general. Corollary 3 indicates that F-EULER improves the transition-induced term of F-UCBVI by a H\sqrt{H} factor. On the other hand, F-EULER has a worse computational complexity than F-UCBVI. Even if neglecting the computation on the reward bonus (since efficient methods exist), in each run of the VI_Optimism subroutine, F-EULER has 𝒪(mS2AH){\mathcal{O}}(mS^{2}AH) computational complexity, mm times more than F-UCBVI.

5 Proof sketch

We present here the proof ideas of F-UCBVI in the case of known rewards, which sheds light on our main new techniques. The full proof is deferred to Appendix A. We add a subcript kk to P^i,P^,Ni,Mi,V¯h,b{\hat{P}}_{i},{\hat{P}},N_{i},M_{i},{\overline{V}}_{h},{b} in Algorithms 1 and 2 to denote the corresponding quantities in the kkth episode. For inductive purpose, let VH+1π(s)=0V_{H+1}^{\pi}(s)=0 for all πΠ,s𝒮\pi\in\Pi,s\in\mathcal{S}, so that Vhπ(s)=R(x)+P(x),Vh+1πV_{h}^{\pi}(s)=R(x)+\left\langle{P}(x),V_{h+1}^{\pi}\right\rangle with x=(s,π(s,h))x=(s,\pi(s,h)) for all πΠ,s𝒮,h[H]\pi\in\Pi,s\in\mathcal{S},h\in[H], where Π\Pi is the set of all deterministic policies.

The mechanism of optimism. The purpose of the bonus is to ensure optimism, which means obtaining an entrywise UCB V¯k,h{\overline{V}}_{k,h} on Vh{V_{h}^{*}} for all k[K],h[H]k\in[K],h\in[H].

For h=H+1h=H+1, this is certainly the case, since they are both entrywise 0. For h[H]h\in[H], s𝒮\forall s\in\mathcal{S}, with xk,h=(s,πk(s,h))x_{k,h}=(s,\pi_{k}(s,h)) and xh=(s,π(s,h))x_{h}^{*}=(s,\pi^{*}(s,h)),

V¯k,h(s)Vh(s)\displaystyle{\overline{V}}_{k,h}(s)-{V_{h}^{*}}(s) P^k(xh),Vk,h+1Vh+10 by inductive assumption+P^k(xh)P(xh),Vh+1transition estimation error+bk(xh).\displaystyle\geq\underbrace{\bigl{\langle}{\hat{P}}_{k}(x_{h}^{*}),V_{k,h+1}-{V_{h+1}^{*}}\bigr{\rangle}}_{\text{$\geq 0$ by inductive assumption}}+\underbrace{\bigl{\langle}{\hat{P}}_{k}(x_{h}^{*})-{P}(x_{h}^{*}),{V_{h+1}^{*}}\bigr{\rangle}}_{\text{transition estimation error}}+{b}_{k}(x_{h}^{*}).

To form a valid backward induction argument, we expect the bonus to be a UCB of the absolute value of the transition estimation error. For nonfactored MDPs, with Nk(x)N_{k}(x) being the number of visits to xx before the kkth episode, a direct application of Hoeffding’s inequality (Lemma 37) yields that for any x𝒳x\in\mathcal{X},

|P^k(x)P(x),Vh+1|HL2Nk(x),\displaystyle\bigl{|}\bigl{\langle}{\hat{P}}_{k}(x)-{P}(x),{V_{h+1}^{*}}\bigr{\rangle}\bigr{|}\leq H\sqrt{\tfrac{L}{2N_{k}(x)}},

which determines the transition bonus. For FMDPs, a natural extension is to expect

|P^k(x)P(x),Vh+1|?i=1m|P^i,k(x)Pi(x),Vh+1|i=1mHL2Ni,k(x),\displaystyle\bigl{|}\bigl{\langle}{\hat{P}}_{k}(x)-{P}(x),{V_{h+1}^{*}}\bigr{\rangle}\bigr{|}\stackrel{{\scriptstyle?}}{{\leq}}\sum\nolimits_{i=1}^{m}\bigl{|}\bigl{\langle}{{\hat{P}}_{i,k}(x)-{P}_{i}(x)},{{V_{h+1}^{*}}}\bigr{\rangle}\bigr{|}\leq\sum\nolimits_{i=1}^{m}H\sqrt{\tfrac{L}{2N_{i,k}(x)}}, (5.1)

which can then be used as the transition bonus. Although it holds that

|P^k(x)P(x),Vh+1|i=1m|P^i,k(x)Pi(x)|,Vh+1,\displaystyle\bigl{|}\bigl{\langle}{{\hat{P}}_{k}(x)-{P}(x)},{{V_{h+1}^{*}}}\bigr{\rangle}\bigr{|}\leq\sum\nolimits_{i=1}^{m}\bigl{\langle}{\bigl{|}{\hat{P}}_{i,k}(x)-{P}_{i}(x)\bigr{|}},{{V_{h+1}^{*}}}\bigr{\rangle},

the inequality in question in (5.1) fails to hold (e.g., using Si=m=2S_{i}=m=2 one obtains a quick contradiction). To address this difficulty, we adopt an inverse telescoping technique. Omitting the dependence of the transitions P^k(x),P(x),P^i(x),Pi(x){\hat{P}}_{k}(x),{P}(x),{\hat{P}}_{i}(x),{P}_{i}(x) on xx, we rewrite P^kP,Vh+1\langle{{\hat{P}}_{k}-{P}},{{V_{h+1}^{*}}}\rangle as

i=1mP^i,ki=1mPi,Vh+1=i=1m(P^i,kPi)P1:i1P^i+1:m,k,Vh+1\displaystyle\bigl{\langle}\prod\nolimits_{i=1}^{m}{\hat{P}}_{i,k}-\prod\nolimits_{i=1}^{m}{P}_{i},{V_{h+1}^{*}}\bigr{\rangle}=\bigl{\langle}\sum\nolimits_{i=1}^{m}({\hat{P}}_{i,k}-{P}_{i}){P}_{1:i-1}{\hat{P}}_{i+1:m,k},{V_{h+1}^{*}}\bigr{\rangle}
=\displaystyle= i=1mP^i,kPi,𝔼P1:i1𝔼Pi+1:m[Vh+1]\displaystyle\sum\nolimits_{i=1}^{m}\bigl{\langle}{\hat{P}}_{i,k}-{P}_{i},\mathbb{E}_{{P}_{1:i-1}}\mathbb{E}_{{P}_{i+1:m}}[{V_{h+1}^{*}}]\bigr{\rangle} (5.2)
+i=1mP^i,kPi,𝔼P1:i1(𝔼P^i+1:m,k𝔼Pi+1:m)[Vh+1],\displaystyle\quad+\sum\nolimits_{i=1}^{m}\bigl{\langle}{\hat{P}}_{i,k}-{P}_{i},\mathbb{E}_{{P}_{1:i-1}}(\mathbb{E}_{{\hat{P}}_{i+1:m,k}}-\mathbb{E}_{{P}_{i+1:m}})[{V_{h+1}^{*}}]\bigr{\rangle}, (5.3)

where the first equality is referred to as the “inverse telescoping” technique, and the second equality adds then subtracts the same term (5.2), which can be bounded by scalar concentration, leading to the component-wise bonus term in (3.1). (5.3) is bounded by L1L_{1}-norm concentration (Lemma 36) with another use of the inverse telescoping technique, leading to the cross-component bonus term in (3.1).

Regret bound. In the sequel, we use “,\lesssim,\approx” to represent “,=\leq,=” hiding constants and lower-order terms. Let wk,h(x)w_{k,h}(x) denote the visit probability of x𝒳x\in\mathcal{X} at step h[H]h\in[H] following policy πk\pi_{k} during the interaction. Then the optimism ensures that the regret in KK episodes is upper bounded by

k=1KV¯k,1(sk,1)V1πk(sk,1)\displaystyle\sum\nolimits_{k=1}^{K}{\overline{V}}_{k,1}(s_{k,1})-V_{1}^{\pi_{k}}(s_{k,1})
\displaystyle\approx k=1Kh=1HxLkwk,h(x){bk(x)+P^k(x)P(x),Vh+1transition estimation error+P^k(x)P(x),V¯k,h+1Vh+1correction, sum is lower-order (Lemma 22)},\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\bigl{\{}{b}_{k}(x)+\underbrace{\bigl{\langle}{\hat{P}}_{k}(x)-{P}(x),{V_{h+1}^{*}}\bigr{\rangle}}_{\text{transition estimation error}}+\underbrace{\bigl{\langle}{\hat{P}}_{k}(x)-{P}(x),{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\bigr{\rangle}}_{\text{correction, sum is lower-order (Lemma~{}\ref{lem:cul-corr-h})}}\bigr{\}},

where the “good” set LkL_{k} is a notion of sufficient visits before the kkth episode and the sum out of LkL_{k} is a lower-order term (Lemma 13). The cumulative transition estimation error is upper bounded by

k=1Kh=1HxLkwk,h(x){i=1mHL2Ni,k(x)+i=1mj=i+1m2HLSiSjNi,k(x)Nj,k(x)sum is lower-order (Lemma 16)}i=1mHX[Ii]TL.\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\Bigl{\{}\sum_{i=1}^{m}H\sqrt{\tfrac{L}{2N_{i,k}(x)}}+\underbrace{\sum_{i=1}^{m}\sum_{j=i+1}^{m}2HL\sqrt{\tfrac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}}_{\text{sum is lower-order (Lemma~{}\ref{lem:sum-mixed-vr})}}\Bigr{\}}\lesssim\sum_{i=1}^{m}H\sqrt{X[{I_{i}}]TL}.

The upper bound of the cumulative transition bonus takes exactly the same form. Thus, we obtain

Regret(K)i=1mHX[Ii]TL=𝒪~(i=1mH2X[Ii]T).\displaystyle{\textnormal{Regret}}(K)\lesssim\sum\nolimits_{i=1}^{m}H\sqrt{X[{I_{i}}]TL}=\tilde{\mathcal{O}}(\sum\nolimits_{i=1}^{m}\sqrt{H^{2}X[{I_{i}}]T}).

Extension to other conditions. The treatment of unknown rewards is standard [42]. The same mechanism of optimism applies to F-EULER, except that the scalar concentration uses Bernstein’s inequality (Lemma 38). Please see Appendix B for a complete proof.

6 Lower bounds

In this section, we present some information theoretic lower bounds for different factored structures of FMDP, which illustrate the structure-dependent nature of the lower bounds; a full characterization is still an open problem.

Recall the formal statement of the lower bound for MDPs that for any given natural numbers S,A,HS,A,H, there is an MDP that has SS states, AA actions and horizon HH with unknown transition P{P} and possibly known rewards RR, such that the expected regret after KK episodes is Ω(HSAT)\Omega(\sqrt{HSAT}). Ideally, the lower bound of FMDPs should be stated in the same way, specified for any natural numbers m,n,l,Hm,n,l,H and any factored structure {𝒮i}i=1m,{𝒳i}i=1n,{Ii}i=1m,{Ji}i=1l\{\mathcal{S}_{i}\}_{i=1}^{m},\{\mathcal{X}_{i}\}_{i=1}^{n},\{{I_{i}}\}_{i=1}^{m},\{{J_{i}}\}_{i=1}^{l}. A simple Ω(i=1mHX[Ii]T+i=1lX[Ji]T)\Omega(\sum_{i=1}^{m}\sqrt{HX[{I_{i}}]T}+\sum_{i=1}^{l}\sqrt{X[{J_{i}}]T}) lower bound is tempting, yet far from the truth. To avoid excessive subtlety, our lower bound discussion is restricted to the following normal factored structure, where the state-action factorization is an extension to the state factorization.

Definition 2 (Normal factored structure).

For natural numbers m<nm<n, let the state space 𝒮=i=1m𝒮i\mathcal{S}=\bigotimes_{i=1}^{m}\mathcal{S}_{i} and the action space 𝒜=i=1nm𝒜i\mathcal{A}=\bigotimes_{i=1}^{n-m}\mathcal{A}_{i}. The factored structure of an FMDP is normal if and only if the state-action space 𝒳=i=1n𝒳i\mathcal{X}=\bigotimes_{i=1}^{n}\mathcal{X}_{i} where 𝒳i=𝒮i\mathcal{X}_{i}=\mathcal{S}_{i} for i[m]i\in[m] and 𝒳m+i=𝒜i\mathcal{X}_{m+i}=\mathcal{A}_{i} for i[nm]i\in[n-m].

We now state the lower bounds for two degenerate structures.

Theorem 4 (Degenerate case 1).

For any algorithm, under the assumption of unknown rewards, for the normal factored structure that satisfies Ii[m]{I_{i}}\subset[m] for all i[m]i\in[m] and any Ji{J_{i}} for i[l]i\in[l], there is an FMDP with the specified structure such that for some initial states, the expected regret in KK episodes is at least Ω(maxiX[Ji]T)\Omega(\max_{i}\sqrt{X[{J_{i}}^{\prime}]T}) where Ji=Ji{m+1,,n}{J_{i}}^{\prime}={J_{i}}\cap\{m+1,\cdots,n\}.

Theorem 5 (Degenerate case 2).

For any algorithm, under the assumption of unknown rewards, for the normal factored structure that satisfies Ji{m+1,,n}{J_{i}}\subset\{m+1,\cdots,n\} for all i[l]i\in[l], there is an FMDP with the specified structure such that for some initial states, the expected regret in KK episodes is at least Ω(maxiX[Ji]T)\Omega(\max_{i}\sqrt{X[{J_{i}}]T}).

Theorem 4 considers the degenerate case where the transitions do not depend on the action space at all. Furthermore, the regret is always 0 if the rewards do not depend on the action space either (Ji={J_{i}}^{\prime}=\emptyset for all i[l]i\in[l]). Theorem 5 considers the degenerate case where the rewards have no dependence on the state space, and X[Ii]X[{I_{i}}] vanishes in the lower bound for this MAB problem. The following theorem identifies some nondegenerate constraints on the factored structures.

Theorem 6 (Lower bound, a nondegenerate case).

For any algorithm, under the assumption of known rewards, for the normal factored structure that satisfies Si3S_{i}\geq 3, iIii\in{I_{i}}, j{m+1,,n}\exists j\in\{m+1,\cdots,n\} such that jIij\in{I_{i}} for all i[m]i\in[m] and H2H\geq 2, there is an FMDP with the specified structure such that for some initial states, the expected regret in KK episodes is at least Ω(maxiHX[Ii]T)\Omega(\max_{i}\sqrt{HX[{I_{i}}]T}).

Theorem 6 indicates that the regret achieved by F-EULER is minimax optimal up to log factors for a rich class of structures, since 𝒪~(i=1mHX[Ii]T)=𝒪~(maxiHX[Ii]T)\tilde{\mathcal{O}}(\sum_{i=1}^{m}\sqrt{HX[{I_{i}}]T})=\tilde{\mathcal{O}}(\max_{i}\sqrt{HX[{I_{i}}]T}). The main assumption here is that the transition of each state component depends on itself and at least one action component. Considering that it is satisfied in the nonfactored case, this assumption is not too restrictive. A major obstacle to show the lower bound for a general factored structure is that some structures lack the self-loop property, i.e., the transition of a state component depends on itself so that we can say it stays in the same state with a certain probability, which is key to the lower bound constructions for nonfactored MDPs [19, 7, 20]. However, by generalizing the self-loop property to the loop property, we can show a more general version of the nondegenerate lower bound.

For an FMDP with the normal factored structure, for i,j[m]i,j\in[m], if iIji\in{I_{j}}, then we say the jjth state component depends on the iith state component, or the iith state component influences the jjth state component, denoted by 𝒮i𝒮j\mathcal{S}_{i}\to\mathcal{S}_{j}. The iith state component has the loop property, if and only if the iith state component influences itself either directly (𝒮i𝒮i\mathcal{S}_{i}\to\mathcal{S}_{i}, self-loop) or through some intermediate state components (𝒮i𝒮j𝒮i\mathcal{S}_{i}\to\cdots\to\mathcal{S}_{j}\to\cdots\to\mathcal{S}_{i} for some j’s[m]j\text{'s}\in[m]). The loop is referred to as an influence loop. Let \mathcal{I} be the set of the indices of the state components that have the loop property and some action dependence, i.e.,

:={i[m]:i has the loop property, and there exists j{m+1,,n} such that jIi}.\displaystyle\mathcal{I}:=\left\{i\in[m]:\text{$i$ has the loop property, and there exists }j\in\{m+1,\cdots,n\}\text{ such that }j\in{I_{i}}\right\}.

Then we have the following theorem that subsumes Theorem 6.

Theorem 7.

For any algorithm, under the assumption of known rewards, for the normal factored structure that satisfies Si3,i[m]S_{i}\geq 3,\forall i\in[m] and H2H\geq 2, there is an FMDP with the specified structure such that for some initial states, the expected regret in KK episodes is at least Ω(maxiHX[Ii]T)\Omega(\max_{i\in\mathcal{I}}\sqrt{HX[{I_{i}}]T}).

Theorem 7 shows that the factored structures that satisfy maxiX[Ii]=maxi[m]X[Ii]\max_{i\in\mathcal{I}}X[{I_{i}}]=\max_{i\in[m]}X[{I_{i}}], F-EULER is minimax optimal up to log factors. We defer to Appendix C the proofs of Theorems 456 and 7, relying on the lower bounds for MABs [25] and the construction of MAB-like FMDPs (FMDPs whose cumulative rewards depend only on the action in the first step), similar to the MAB-like MDPs in [7, 20].

7 Conclusion

In this paper, we study reinforcement learning within the tabular episodic setting of FMDPs, that is, MDPs which enjoy a factored structure in their dynamics. Such a factorization is typically derived via conditional independence structures in the transition, and by using it effectively one can greatly reduce the complexity of learning. However, a straightforward adaptation of the usual minimax optimal regret analysis for MDPs turns out not to be possible. We uncover the difficulties posed by such an adaptation, and subsequently develop two new algorithms (called F-UCBVI and F-EULER, both motivated by their known nonfactored analogues) that carefully exploit the factored structure to obtain minimax optimal regret bounds for a rich class of factored structures. The key algorithmic technique that we develop is a careful design of a “cross-component” bonus term to ensure optimism and guide exploration.

We present the lower bounds for FMDPs under certain structure constraints. The problem of characterizing lower bounds for FMDPs with arbitrary structures turns out to be more subtle, and remains an open problem. In addition, our methods are model-based; developing model-free algorithms for FMDPs is worth exploring. Lastly, an important practical direction is to develop structure-agnostic algorithms that can still ensure regret guarantees.

Acknowledgments

YT acknowledges partial support from an MIT Presidential Fellowship and a graduate research assistantship from the NSF BIGDATA grant (number 1741341). SS acknowledges partial support from NSF-BIGDATA (1741341) and NSF-TRIPODS+X (1839258).

References

  • Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
  • Boutilier et al. [1995] Craig Boutilier, Richard Dearden, and Moises Goldszmidt. Exploiting structure in policy construction. In IJCAI, volume 14, pages 1104–1113, 1995.
  • Boutilier et al. [1999] Craig Boutilier, Thomas Dean, and Steve Hanks. Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11:1–94, 1999.
  • Boutilier et al. [2000] Craig Boutilier, Richard Dearden, and Moisés Goldszmidt. Stochastic dynamic programming with factored representations. Artificial intelligence, 121(1-2):49–107, 2000.
  • Chakraborty [2014] Doran Chakraborty. Sample efficient multiagent learning in the presence of markovian agents. Springer, 2014.
  • Chakraborty and Stone [2011] Doran Chakraborty and Peter Stone. Structure learning in ergodic factored mdps without knowledge of the transition function’s in-degree. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 737–744. Citeseer, 2011.
  • Dann and Brunskill [2015] Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818–2826, 2015.
  • Dann et al. [2018] Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certificates: Towards accountable reinforcement learning. arXiv preprint arXiv:1811.03056, 2018.
  • Degris and Sigaud [2013] Thomas Degris and Olivier Sigaud. Factored markov decision processes. In Olivier Sigaud and Olivier Buffet, editors, Markov Decision Processes in Artificial Intelligence, chapter 4, pages 99–126. Wiley Online Library, 2013.
  • Degris et al. [2006] Thomas Degris, Olivier Sigaud, and Pierre-Henri Wuillemin. Learning the structure of factored markov decision processes in reinforcement learning problems. In Proceedings of the 23rd international conference on Machine learning, pages 257–264, 2006.
  • Diuk et al. [2009] Carlos Diuk, Lihong Li, and Bethany R Leffler. The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 249–256, 2009.
  • Efroni et al. [2019] Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, and Shie Mannor. Tight regret bounds for model-based reinforcement learning with greedy policies. In Advances in Neural Information Processing Systems, pages 12203–12213, 2019.
  • Guestrin et al. [2001] Carlos Guestrin, Daphne Koller, and Ronald Parr. Max-norm projections for factored mdps. In IJCAI, volume 1, pages 673–682, 2001.
  • Guestrin et al. [2002a] Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent planning with factored mdps. In Advances in neural information processing systems, pages 1523–1530, 2002a.
  • Guestrin et al. [2002b] Carlos Guestrin, Relu Patrascu, and Dale Schuurmans. Algorithm-directed exploration for model-based reinforcement learning in factored mdps. In ICML, pages 235–242. Citeseer, 2002b.
  • Guestrin et al. [2003] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Efficient solution algorithms for factored mdps. Journal of Artificial Intelligence Research, 19:399–468, 2003.
  • Guo and Brunskill [2018] Zhaohan Daniel Guo and Emma Brunskill. Sample efficient learning with feature selection for factored mdps. In Proceedings of the 14th European Workshop on Reinforcement Learning. EWRL, 2018.
  • Hallak et al. [2015] Assaf Hallak, François Schnitzler, Timothy Mann, and Shie Mannor. Off-policy model-based learning under unknown factored dynamics. In International Conference on Machine Learning, pages 711–719, 2015.
  • Jaksch et al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
  • Jin [2020] Chi Jin. Lecture 12: Lower bound for mdp. Lecture notes in ELE524: Foundations of Reinforcement Learning, Princeton University, 2020. Spring Semester.
  • Jin et al. [2018] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
  • Kakade et al. [2018] Sham Kakade, Mengdi Wang, and Lin F Yang. Variance reduction methods for sublinear reinforcement learning. arXiv preprint arXiv:1802.09184, 2018.
  • Kearns and Koller [1999] Michael Kearns and Daphne Koller. Efficient reinforcement learning in factored mdps. In IJCAI, volume 16, pages 740–747, 1999.
  • Koller and Parr [2000] Daphne Koller and Ron Parr. Policy iteration for factored mdps. In Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence, pages 326–334, 2000.
  • Lattimore and Szepesvári [2018] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, page 28, 2018.
  • Maurer and Pontil [2009] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
  • Osband and Van Roy [2014] Ian Osband and Benjamin Van Roy. Near-optimal reinforcement learning in factored mdps. In Advances in Neural Information Processing Systems, pages 604–612, 2014.
  • Osband and Van Roy [2017] Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2701–2710. JMLR. org, 2017.
  • Osband et al. [2013] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
  • Puterman [2014] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  • Radoszycki et al. [2015] Julia Radoszycki, Nathalie Peyrard, and Régis Sabbadin. Solving f3 mdps: Collaborative multiagent markov decision processes with factored transitions, rewards and stochastic policies. In International Conference on Principles and Practice of Multi-Agent Systems, pages 3–19. Springer, 2015.
  • Schuurmans and Patrascu [2002] Dale Schuurmans and Relu Patrascu. Direct value-approximation for factored mdps. In Advances in Neural Information Processing Systems, pages 1579–1586, 2002.
  • Simchowitz and Jamieson [2019] Max Simchowitz and Kevin G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. In Advances in Neural Information Processing Systems, pages 1151–1160, 2019.
  • Sohn et al. [2020] Sungryull Sohn, Hyunjae Woo, Jongwook Choi, and Honglak Lee. Meta reinforcement learning with autonomous inference of subtask dependencies. arXiv preprint arXiv:2001.00248, 2020.
  • Strehl [2007] Alexander L Strehl. Model-based reinforcement learning in factored-state mdps. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 103–110. IEEE, 2007.
  • Strehl et al. [2007] Alexander L Strehl, Carlos Diuk, and Michael L Littman. Efficient structure learning in factored-state mdps. In AAAI, volume 7, pages 645–650, 2007.
  • Szita and Lőrincz [2009] István Szita and András Lőrincz. Optimistic initialization and greediness lead to polynomial time learning in factored mdps. In Proceedings of the 26th annual international conference on machine learning, pages 1001–1008, 2009.
  • Wainwright [2019] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
  • Weissman et al. [2003] Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
  • Welford [1962] BP Welford. Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3):419–420, 1962.
  • Xu and Tewari [2020] Ziping Xu and Ambuj Tewari. Near-optimal reinforcement learning in factored mdps: Oracle-efficient algorithms for the non-episodic setting. arXiv preprint arXiv:2002.02302, 2020.
  • Zanette and Brunskill [2019] Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. arXiv preprint arXiv:1901.00210, 2019.
  • Zhang et al. [2020] Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost optimal model-free reinforcement learning via reference-advantage decomposition. arXiv preprint arXiv:2004.10019, 2020.

Appendix A Regret analysis of F-UCBVI

A.1 Failure event

Before we define the failure event that causes our regret guarantee to fail, we introduce some additional notation. As a shorthand, for any natural number nn, any factored set 𝒳=i=1n𝒳i\mathcal{X}=\bigotimes_{i=1}^{n}\mathcal{X}_{i} and any given index set I[n]I\subset[n], let 𝒳[I]:=i=1,iIn𝒳i\mathcal{X}[-I]:=\bigotimes_{i=1,i\notin I}^{n}\mathcal{X}_{i} and x[I]𝒳[I]x[-I]\in\mathcal{X}[-I] be a tuple of x[j]x[j] for j[n]j\in[n] and jIj\notin I. For a singleton {i}\{i\}, let 𝒮i𝒮[{i}]:=(j=1i1𝒮j)×(j=i+1m𝒮j)\mathcal{S}_{-i}\equiv\mathcal{S}[-\{i\}]:=(\bigotimes_{j=1}^{i-1}\mathcal{S}_{j})\times(\bigotimes_{j=i+1}^{m}\mathcal{S}_{j}). Let s[i]𝒮is[-i]\in\mathcal{S}_{-i} be a tuple of s[j]s[j] for j[m]j\in[m] and jij\neq i. For vector V𝒮V\in\mathbb{R}^{\mathcal{S}},

V(s[i])\displaystyle V(s[-i]) :=V((s[1],,s[i1],,s[i+1],,s[m]))𝒮i,\displaystyle:=V((s[1],\cdots,s[i-1],\cdot,s[i+1],\cdots,s[m]))\in\mathbb{R}^{\mathcal{S}_{i}},
V(s[i])\displaystyle V(s[i]) :=V((,,,s[i],,,))𝒮i.\displaystyle:=V((\cdot,\cdots,\cdot,s[i],\cdot,\cdots,\cdot))\in\mathbb{R}^{\mathcal{S}_{-i}}.

Recall that wk,h(x)w_{k,h}(x) is the visit probability to xx at step hh of episode kk. Overloading the notation, we make the following definitions.

Definition 3 (Visit probabilities).

Define

wi,k,h(x)\displaystyle w_{i,k,h}(x) :=wi,k,h(x[Ii])=x[Ii]𝒳[Ii]wk,h(x),\displaystyle:=w_{i,k,h}(x[{I_{i}}])=\sum\nolimits_{x[-{I_{i}}]\in\mathcal{X}[-{I_{i}}]}w_{k,h}(x),
vi,k,h(x)\displaystyle v_{i,k,h}(x) :=vi,k,h(x[Ji])=x[Ji]𝒳[Ji]wk,h(x).\displaystyle:=v_{i,k,h}(x[{J_{i}}])=\sum\nolimits_{x[-{J_{i}}]\in\mathcal{X}[-{J_{i}}]}w_{k,h}(x).

Then let wk(x):=h=1Hwk,h(x)w_{k}(x):=\sum_{h=1}^{H}w_{k,h}(x), wi,k(x):=h=1Hwi,k,h(x)w_{i,k}(x):=\sum_{h=1}^{H}w_{i,k,h}(x) and vi,k(x):=h=1Hvi,k,h(x)v_{i,k}(x):=\sum_{h=1}^{H}v_{i,k,h}(x).

Recall that L=log(16mlSXT/δ)L=\log(16mlSXT/\delta). Then we define the failure event below.

Definition 4 (Failure event).

Define the events

1\displaystyle\mathcal{F}_{1} :={(i[m],k[K],h[H],x𝒳),|P^i,k(x)Pi(x),𝔼Pi(x)[Vh+1]|>HL2Ni,k(x)},\displaystyle:=\Biggl{\{}\exists(i\in[m],k\in[K],h\in[H],x\in\mathcal{X}),\quad\left|\left\langle{\hat{P}}_{i,k}(x)-{P}_{i}(x),\mathbb{E}_{{P}_{-i}(x)}[{V_{h+1}^{*}}]\right\rangle\right|>H\sqrt{\frac{L}{2N_{i,k}(x)}}\Biggr{\}},
2\displaystyle\mathcal{F}_{2} :={(i[m],k[K],x𝒳),P^i,k(x)Pi(x)1>2SiLNi,k(x)},\displaystyle:=\left\{\exists(i\in[m],k\in[K],x\in\mathcal{X}),\quad\left\|{\hat{P}}_{i,k}(x)-{P}_{i}(x)\right\|_{1}>\sqrt{\frac{2S_{i}L}{N_{i,k}(x)}}\right\},
3\displaystyle\mathcal{F}_{3} :={(i[l],k[K],x𝒳),|R^i,k(x)Ri(x)|>L2Mi,k(x)},\displaystyle:=\left\{\exists(i\in[l],k\in[K],x\in\mathcal{X}),\quad\left|{\hat{R}}_{i,k}(x)-R_{i}(x)\right|>\sqrt{\frac{L}{2M_{i,k}(x)}}\right\},
4\displaystyle\mathcal{F}_{4} :={(i[m],k[K],x𝒳),Ni,k(x)<12κ<kwi,κ(x)HL},\displaystyle:=\left\{\exists(i\in[m],k\in[K],x\in\mathcal{X}),\quad N_{i,k}(x)<\frac{1}{2}\sum_{\kappa<k}w_{i,\kappa}(x)-HL\right\},
5\displaystyle\mathcal{F}_{5} :={(i[l],k[K],x𝒳),Mi,k(x)<12κ<kvi,κ(x)HL},\displaystyle:=\left\{\exists(i\in[l],k\in[K],x\in\mathcal{X}),\quad M_{i,k}(x)<\frac{1}{2}\sum_{\kappa<k}v_{i,\kappa}(x)-HL\right\},
6\displaystyle\mathcal{F}_{6} :={(i[m],k[K],x𝒳,s𝒮),|P^i,k(s[i]|x)Pi(s[i]|x)|>2L3Ni,k(x)+2Pi(s[i]|x)LNi,k(x)}.\displaystyle:=\Biggl{\{}\exists(i\in[m],k\in[K],x\in\mathcal{X},s^{\prime}\in\mathcal{S}),\quad\left|{\hat{P}}_{i,k}(s^{\prime}[i]|x)-{P}_{i}(s^{\prime}[i]|x)\right|>\frac{2L}{3N_{i,k}(x)}+\sqrt{\frac{2{P}_{i}(s^{\prime}[i]|x)L}{N_{i,k}(x)}}\Biggr{\}}.

Then the failure event for F-UCBVI is defined by :=i=16i\mathcal{F}:=\bigcup_{i=1}^{6}\mathcal{F}_{i}.

The following lemma shows that the failure event \mathcal{F} happens with low probability.

Lemma 8 (Failure probability).

For any FMDP specified by (2.1), during the running of F-UCBVI for KK episodes, the failure event \mathcal{F} happens with probability at most δ\delta.

Proof.

By Hoeffding’s inequality (Lemma 37) and the union bound, 1\mathcal{F}_{1} happens with probability at most δ/8\delta/8. The same argument applies to 3\mathcal{F}_{3}. By L1L_{1}-norm concentration (Lemma 36) and the union bound, 2\mathcal{F}_{2} happens with probability at most δ/8\delta/8. By the same argument regarding failure event N\mathcal{F}_{N} in [8, Lemma 6, Section B.1], 4\mathcal{F}_{4} and 5\mathcal{F}_{5} happen with probability at most δ/16\delta/16 respectively. By the same Bernstein’s inequality argument in [1, Lemma 1, Section B.4], 6\mathcal{F}_{6} happens with probability at most δ/8\delta/8. Finally, applying the union bound on i\mathcal{F}_{i} for i[6]i\in[6] yields that the failure event \mathcal{F} happens with probability at most 5δ/8δ5\delta/8\leq\delta. ∎

The deduction in the rest of this section and hence the regret bound hold outside the failure event \mathcal{F}, with probability at least 1δ1-\delta. From the above derivation, note that we can actually use a smaller

L0=log(10mlmax{maxi(SiX[Ii]),maxiX[Ji]}T/δ)\displaystyle L_{0}=\log(10ml\max\{\max_{i}(S_{i}X[{I_{i}}]),\max_{i}X[{J_{i}}]\}T/\delta)

to replace LL for F-UCBVI. We use LL for simplicity.

A.2 Upper confidence bound

The transition estimation error refers to a term incurred by the difference between the estimated transition and the true one. To apply scalar concentration, we use the standard technique that bounds the inner product of their difference and the optimal value function. Specifically, we have the following lemma.

Lemma 9 (Transition estimation error, Hoeffding-style).

Outside the failure event \mathcal{F}, for any episode k[K]k\in[K], step h[H]h\in[H] and state-action pair x𝒳x\in\mathcal{X}, the transition estimation error satisfies that

|P^k(x)P(x),Vh+1|i=1mHL2Ni,k(x)+i=1mj=i+1m2HLSiSjNi,k(x)Nj,k(x).\displaystyle\left|\left\langle{\hat{P}}_{k}(x)-{P}(x),{V_{h+1}^{*}}\right\rangle\right|\leq\sum_{i=1}^{m}H\sqrt{\frac{L}{2N_{i,k}(x)}}+\sum_{i=1}^{m}\sum_{j=i+1}^{m}2HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}. (A.1)
Proof.

Omitting the dependence of P^k(x),P(x),P^i,k(x),Pi(x){\hat{P}}_{k}(x),{P}(x),{\hat{P}}_{i,k}(x),{P}_{i}(x) on xx,

P^kP,Vh+1\displaystyle\left\langle{\hat{P}}_{k}-{P},{V_{h+1}^{*}}\right\rangle =i=1mP^i,ki=1mPi,Vh+1\displaystyle=\left\langle\prod_{i=1}^{m}{\hat{P}}_{i,k}-\prod_{i=1}^{m}{P}_{i},{V_{h+1}^{*}}\right\rangle
=i=1m(P^i,kPi)P1:i1P^i+1:m,k,Vh+1\displaystyle=\left\langle\sum_{i=1}^{m}({\hat{P}}_{i,k}-{P}_{i}){P}_{1:i-1}{\hat{P}}_{i+1:m,k},{V_{h+1}^{*}}\right\rangle
=i=1mP^i,kPi,𝔼P1:i1P^i+1:m,k[Vh+1]\displaystyle=\sum_{i=1}^{m}\left\langle{\hat{P}}_{i,k}-{P}_{i},\mathbb{E}_{{P}_{1:i-1}{\hat{P}}_{i+1:m,k}}[{V_{h+1}^{*}}]\right\rangle
=i=1mP^i,kPi,𝔼P1:i1𝔼P^i+1:m,k[Vh+1]\displaystyle=\sum_{i=1}^{m}\left\langle{\hat{P}}_{i,k}-{P}_{i},\mathbb{E}_{{P}_{1:i-1}}\mathbb{E}_{{\hat{P}}_{i+1:m,k}}[{V_{h+1}^{*}}]\right\rangle
=i=1mP^i,kPi,𝔼P1:i1𝔼Pi+1:m[Vh+1]\displaystyle=\sum_{i=1}^{m}\left\langle{\hat{P}}_{i,k}-{P}_{i},\mathbb{E}_{{P}_{1:i-1}}\mathbb{E}_{{P}_{i+1:m}}[{V_{h+1}^{*}}]\right\rangle (A.2)
+i=1mP^i,kPi,𝔼P1:i1(𝔼P^i+1:m,k𝔼Pi+1:m)[Vh+1],\displaystyle\quad+\sum_{i=1}^{m}\left\langle{\hat{P}}_{i,k}-{P}_{i},\mathbb{E}_{{P}_{1:i-1}}(\mathbb{E}_{{\hat{P}}_{i+1:m,k}}-\mathbb{E}_{{P}_{i+1:m}})[{V_{h+1}^{*}}]\right\rangle, (A.3)

where the second equality adopts an inverse telescoping technique (add and subtract a sequence of terms), essential to our analysis. Outside the failure event \mathcal{F} (specifically, 1\mathcal{F}_{1}), (A.2) is upper bounded by

|P^i,k(x)Pi(x),𝔼P1:i1(x)𝔼Pi+1:m(x)[Vh+1]|HL2Ni,k(x).\displaystyle\left|\left\langle{\hat{P}}_{i,k}(x)-{P}_{i}(x),\mathbb{E}_{{P}_{1:i-1}(x)}\mathbb{E}_{{P}_{i+1:m}(x)}[{V_{h+1}^{*}}]\right\rangle\right|\leq H\sqrt{\frac{L}{2N_{i,k}(x)}}. (A.4)

By Lemma 10, outside the failure event \mathcal{F}, (A.3) is upper bounded by

|P^i,k(x)Pi(x),𝔼P1:i1(x)(𝔼P^i+1:m,k(x)𝔼Pi+1:m(x))[Vh+1]|j=i+1m2HLSiSjNi,k(x)Nj,k(x).\displaystyle\left|\left\langle{\hat{P}}_{i,k}(x)-{P}_{i}(x),\mathbb{E}_{{P}_{1:i-1}(x)}(\mathbb{E}_{{\hat{P}}_{i+1:m,k}(x)}-\mathbb{E}_{{P}_{i+1:m}(x)})[{V_{h+1}^{*}}]\right\rangle\right|\leq\sum_{j=i+1}^{m}2HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}. (A.5)

Combining (A.4) and (A.5) yields the transition estimation error bound (A.1). ∎

The following Lemma 10 brings in the cross-component term, also as part of the transition bonus later, which results from applying inverse telescoping once and L1L_{1}-norm concentration (lemma 36) twice.

Lemma 10 (Holder’s argument).

Outside the failure event \mathcal{F}, for any index i[m]i\in[m], episode k[K]k\in[K], step h[H]h\in[H] and state-action pair x𝒳x\in\mathcal{X},

|P^i,k(x)Pi(x),𝔼P1:i1(x)(𝔼P^i+1:m,k(x)𝔼Pi+1:m(x))[Vh+1]|j=i+1m2HLSiSjNi,k(x)Nj,k(x).\displaystyle\left|\left\langle{\hat{P}}_{i,k}(x)-{P}_{i}(x),\mathbb{E}_{{P}_{1:i-1}(x)}(\mathbb{E}_{{\hat{P}}_{i+1:m,k}(x)}-\mathbb{E}_{{P}_{i+1:m}(x)})[V_{h+1}^{*}]\right\rangle\right|\leq\sum_{j=i+1}^{m}2HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}.
Proof.

By Holder’s inequality,

|P^i,k(x)Pi(x),𝔼P1:i1(x)(𝔼P^i+1:m,k(x)𝔼Pi+1:m(x))[Vh+1]|\displaystyle\left|\left\langle{\hat{P}}_{i,k}(x)-{P}_{i}(x),\mathbb{E}_{{P}_{1:i-1}(x)}(\mathbb{E}_{{\hat{P}}_{i+1:m,k}(x)}-\mathbb{E}_{{P}_{i+1:m}(x)})[V_{h+1}^{*}]\right\rangle\right|
\displaystyle\leq P^i,k(x)Pi(x)1𝔼P1:i1(x)(𝔼P^i+1:m,k(x)𝔼Pi+1:m(x))[Vh+1]\displaystyle\left\|{\hat{P}}_{i,k}(x)-{P}_{i}(x)\right\|_{1}\cdot\left\|\mathbb{E}_{{P}_{1:i-1}(x)}(\mathbb{E}_{{\hat{P}}_{i+1:m,k}(x)}-\mathbb{E}_{{P}_{i+1:m}(x)})[V_{h+1}^{*}]\right\|_{\infty}
\displaystyle\leq 2SiLNi,k(x)𝔼P1:i1(x)(𝔼P^i+1:m,k(x)𝔼Pi+1:m,h(x))[Vh+1],\displaystyle\sqrt{\frac{2S_{i}L}{N_{i,k}(x)}}\cdot\left\|\mathbb{E}_{{P}_{1:i-1}(x)}(\mathbb{E}_{{\hat{P}}_{i+1:m,k}(x)}-\mathbb{E}_{{P}_{i+1:m,h}(x)})[V_{h+1}^{*}]\right\|_{\infty}, (A.6)

where the second inequality holds outside the failure event \mathcal{F} (specifically, 2\mathcal{F}_{2}). We proceed to bound the LL_{\infty}-norm term by applying the inverse telescoping technique. Omitting the dependence of P^k(x),P(x),P^i,k(x),Pi(x){\hat{P}}_{k}(x),{P}(x),{\hat{P}}_{i,k}(x),{P}_{i}(x) on xx, for any i[m]i\in[m],

𝔼P1:i1(𝔼P^i+1:m,k𝔼Pi+1:m)[Vh+1]\displaystyle\mathbb{E}_{{P}_{1:i-1}}(\mathbb{E}_{{\hat{P}}_{i+1:m,k}}-\mathbb{E}_{{P}_{i+1:m}})[V_{h+1}^{*}] =j=i+1mP^j,kj=i+1mPj,𝔼P1:i1[Vh+1]\displaystyle=\biggl{\langle}\prod_{j=i+1}^{m}{\hat{P}}_{j,k}-\prod_{j=i+1}^{m}{P}_{j},\mathbb{E}_{{P}_{1:i-1}}[V_{h+1}^{*}]\biggr{\rangle}
=j=i+1m(P^j,kPj)Pi+1:j1P^j+1:m,k,𝔼P1:i1[Vh+1]\displaystyle=\biggl{\langle}\sum_{j=i+1}^{m}({\hat{P}}_{j,k}-{P}_{j}){P}_{i+1:j-1}{\hat{P}}_{j+1:m,k},\mathbb{E}_{{P}_{1:i-1}}[V_{h+1}^{*}]\biggr{\rangle}
=j=i+1mP^j,kPj,𝔼P1:i1𝔼Pi+1:j1𝔼P^j+1:m,k[Vh+1]𝒮i.\displaystyle=\sum_{j=i+1}^{m}\left\langle{\hat{P}}_{j,k}-{P}_{j},\mathbb{E}_{{P}_{1:i-1}}\mathbb{E}_{{P}_{i+1:j-1}}\mathbb{E}_{{\hat{P}}_{j+1:m,k}}[V_{h+1}^{*}]\right\rangle\in\mathbb{R}^{\mathcal{S}_{i}}.

Therefore, for any i[m]i\in[m] and s[i]𝒮is^{\prime}[i]\in\mathcal{S}_{i},

|𝔼P1:i1(𝔼P^i+1:m,k𝔼Pi+1:m)[Vh+1](s[i])|\displaystyle\left|\mathbb{E}_{{P}_{1:i-1}}(\mathbb{E}_{{\hat{P}}_{i+1:m,k}}-\mathbb{E}_{{P}_{i+1:m}})[V_{h+1}^{*}](s^{\prime}[i])\right| =|𝔼P1:i1(𝔼P^i+1:m,k𝔼Pi+1:m)[Vh+1(s[i])]|\displaystyle=\left|\mathbb{E}_{{P}_{1:i-1}}(\mathbb{E}_{{\hat{P}}_{i+1:m,k}}-\mathbb{E}_{{P}_{i+1:m}})[V_{h+1}^{*}(s^{\prime}[i])]\right|
j=i+1m|P^j,kPj,𝔼P1:i1𝔼Pi+1:j1𝔼P^j+1:m,k[Vh+1(s[i])]|\displaystyle\leq\sum_{j=i+1}^{m}\left|\left\langle{\hat{P}}_{j,k}-{P}_{j},\mathbb{E}_{{P}_{1:i-1}}\mathbb{E}_{{P}_{i+1:j-1}}\mathbb{E}_{{\hat{P}}_{j+1:m,k}}[V_{h+1}^{*}(s^{\prime}[i])]\right\rangle\right|
j=i+1mP^j,kPj1𝔼Pi+1:j1𝔼P^j+1:m,k[Vh+1(s[i])]\displaystyle\leq\sum_{j=i+1}^{m}\left\|{\hat{P}}_{j,k}-{P}_{j}\right\|_{1}\cdot\left\|\mathbb{E}_{{P}_{i+1:j-1}}\mathbb{E}_{{\hat{P}}_{j+1:m,k}}[V_{h+1}^{*}(s^{\prime}[i])]\right\|_{\infty}
j=i+1m2SjLNj,kH,\displaystyle\leq\sum_{j=i+1}^{m}\sqrt{\frac{2S_{j}L}{N_{j,k}}}\cdot H,

where the last inequality holds outside the failure event \mathcal{F} (specifically, 2\mathcal{F}_{2}). Substituting the above into (A.6) yields

|P^i,k(x)Pi(x),𝔼P1:i1(x)(𝔼P^i+1:m,k(x)𝔼Pi+1:m(x))[Vh+1]|\displaystyle\left|\left\langle{\hat{P}}_{i,k}(x)-{P}_{i}(x),\mathbb{E}_{{P}_{1:i-1}(x)}(\mathbb{E}_{{\hat{P}}_{i+1:m,k}(x)}-\mathbb{E}_{{P}_{i+1:m}(x)})[V_{h+1}^{*}]\right\rangle\right| j=i+1m2SiLNi,k(x)H2SjLNj,k(x)\displaystyle\leq\sum_{j=i+1}^{m}\sqrt{\frac{2S_{i}L}{N_{i,k}(x)}}\cdot H\sqrt{\frac{2S_{j}L}{N_{j,k}(x)}}
=j=i+1m2HLSiSjNi,k(x)Nj,k(x).\displaystyle=\sum_{j=i+1}^{m}2HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}.

Recall that our Hoeffding-style transition bonus (3.1) is exactly the transition estimation error bound in (A.1). Add a subscript kk to R^,R^i{\hat{R}},{\hat{R}}_{i} to denote the corresponding quantities in the kkth episode. Recall that our choice of the reward bonus upper bounds the reward estimation error outside the failure event \mathcal{F} (specifically, 3\mathcal{F}_{3}), i.e.,

|R^k(x)R(x)|i=1l|R^i,k(x)Ri(x)|i=1lL2Mi,k(x):=βk(x).\displaystyle\left|{\hat{R}}_{k}(x)-R(x)\right|\leq\sum_{i=1}^{l}\left|{\hat{R}}_{i,k}(x)-R_{i}(x)\right|\leq\sum_{i=1}^{l}\sqrt{\frac{L}{2M_{i,k}(x)}}:={\beta}_{k}(x).

The following lemma shows that these choices ensure the optimism. Specifically, V¯k,h{\overline{V}}_{k,h} is an entrywise UCB of Vh{V_{h}^{*}} for all k[K],h[H]k\in[K],h\in[H].

Lemma 11 (Upper confidence bound).

Outside the failure event \mathcal{F}, for the choices of bonuses in (3.1) and (3.2), Vh(s)V¯k,h(s){V_{h}^{*}}(s)\leq{\overline{V}}_{k,h}(s) for any episode k[K]k\in[K], step h[H]h\in[H] and state s𝒮s\in\mathcal{S}.

Proof.

For h=H+1h=H+1, VH+1(s)=Vk,H+1(s)=0{V_{H+1}^{*}}(s)=V_{k,H+1}(s)=0 for all k[K]k\in[K] and s𝒮s\in\mathcal{S}. We proceed by backward induction. For all k[K]k\in[K], for a given h[H]h\in[H], for all s𝒮s\in\mathcal{S}, with xk,h=(s,πk(s,h))x_{k,h}=(s,\pi_{k}(s,h)) and xh=(s,π(s,h))x_{h}^{*}=(s,\pi^{*}(s,h)),

V¯k,h(s)Vh(s)\displaystyle{\overline{V}}_{k,h}(s)-{V_{h}^{*}}(s)
=\displaystyle= R^(xk,h)+βk(xk,h)+P^k(xk,h),V¯k,h+1+bk(xk,h)R(xh)P(xh),Vh+1\displaystyle{\hat{R}}(x_{k,h})+{\beta}_{k}(x_{k,h})+\left\langle{\hat{P}}_{k}(x_{k,h}),{\overline{V}}_{k,h+1}\right\rangle+{b}_{k}(x_{k,h})-R(x_{h}^{*})-\left\langle{P}(x_{h}^{*}),{V_{h+1}^{*}}\right\rangle
\displaystyle\geq R^(xh)+βk(xh)+P^k(xh),V¯k,h+1+bk(xh)R(xh)P(xh),Vh+1\displaystyle{\hat{R}}(x_{h}^{*})+{\beta}_{k}(x^{*}_{h})+\left\langle{\hat{P}}_{k}(x_{h}^{*}),{\overline{V}}_{k,h+1}\right\rangle+{b}_{k}(x_{h}^{*})-R(x_{h}^{*})-\left\langle{P}(x_{h}^{*}),{V_{h+1}^{*}}\right\rangle
\displaystyle\geq P^k(xh),V¯k,h+1Vh+1+R^(xh)R(xh)+βk(xh)+P^k(xh)P(xh),Vh+1+bk(xh),\displaystyle\left\langle{\hat{P}}_{k}(x_{h}^{*}),{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\rangle+{\hat{R}}(x_{h}^{*})-R(x_{h}^{*})+{\beta}_{k}(x^{*}_{h})+\left\langle{\hat{P}}_{k}(x_{h}^{*})-{P}(x_{h}^{*}),{V_{h+1}^{*}}\right\rangle+{b}_{k}(x_{h}^{*}),

where the first equality corresponds to the nontrivial case where V¯k,h(s)<Hh+1{\overline{V}}_{k,h}(s)<H-h+1. Since

P^k(xh),V¯k,h+1Vh+10\displaystyle\left\langle{\hat{P}}_{k}(x_{h}^{*}),{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\rangle\geq 0

by the inductive assumption, we have V¯k,h(s)Vh(s)0{\overline{V}}_{k,h}(s)-{V_{h}^{*}}(s)\geq 0 outside the failure event \mathcal{F}. Therefore, Vh(s)V¯k,h(s){V_{h}^{*}}(s)\leq{\overline{V}}_{k,h}(s) for all k[K],h[H],s𝒮k\in[K],h\in[H],s\in\mathcal{S}. ∎

Refer to the difference between the optimistic value function than the optimal value function as the confidence radius. We now bound the confidence radius in the following lemma. After the introduction of “good” sets, we then bound the sum over time of the squared confidence radius, which is useful to prove that the cumulative correction term is lower-order (polylog in TT, Lemma 22).

Lemma 12 (Confidence radius, Hoeffding-style).

Let F0:=5mHmaxiSiLF_{0}:=5mH\max_{i}S_{i}L be a lower-order term. Let sk,t𝒮s_{k,t}\in\mathcal{S} denote the state at step tt of episode kk and xk,t=(sk,t,πk(sk,t,t))x_{k,t}=(s_{k,t},\pi_{k}(s_{k,t},t)). Outside the failure event \mathcal{F}, for any episode k[K]k\in[K], step h[H]h\in[H] and state s𝒮s\in\mathcal{S}, the confidence radius of F-UCBVI satisfies that

V¯k,h(s)Vh(s)min{t=hH𝔼πk[i=1mF0Ni,k(xk,t)+i=1l2LMi,k(xk,t)|sk,h=s],H}.\displaystyle{\overline{V}}_{k,h}(s)-{V_{h}^{*}}(s)\leq\min\left\{\sum_{t=h}^{H}\mathbb{E}_{\pi_{k}}\left[\sum_{i=1}^{m}\frac{F_{0}}{\sqrt{N_{i,k}(x_{k,t})}}+\sum_{i=1}^{l}\sqrt{\frac{2L}{M_{i,k}(x_{k,t})}}\middle|s_{k,h}=s\right],H\right\}.
Proof.

By definition, for any k[K],h[H],s𝒮k\in[K],h\in[H],s\in\mathcal{S},

V¯k,h(s)Vh(s)\displaystyle{\overline{V}}_{k,h}(s)-{V_{h}^{*}}(s) R^(xk,h)+βk(xk,h)+P^k(xk,h),V¯k,h+1+bk(xk,h)R(xh)P(xh),Vh+1\displaystyle\leq{\hat{R}}(x_{k,h})+{\beta}_{k}(x_{k,h})+\left\langle{\hat{P}}_{k}(x_{k,h}),{\overline{V}}_{k,h+1}\right\rangle+{b}_{k}(x_{k,h})-R(x_{h}^{*})-\left\langle{P}(x_{h}^{*}),{V_{h+1}^{*}}\right\rangle
R^(xk,h)+βk(xk,h)+P^k(xk,h),V¯k,h+1+bk(xk,h)R(xk,h)P(xk,h),Vh+1\displaystyle\leq{\hat{R}}(x_{k,h})+{\beta}_{k}(x_{k,h})+\left\langle{\hat{P}}_{k}(x_{k,h}),{\overline{V}}_{k,h+1}\right\rangle+{b}_{k}(x_{k,h})-R(x_{k,h})-\left\langle{P}(x_{k,h}),{V_{h+1}^{*}}\right\rangle
2βk(xk,h)+P^k(xk,h)P(xk,h),V¯k,h+1+P(xk,h),V¯k,h+1Vh+1+bk(xk,h)\displaystyle\leq 2{\beta}_{k}(x_{k,h})+\left\langle{\hat{P}}_{k}(x_{k,h})-{P}(x_{k,h}),{\overline{V}}_{k,h+1}\right\rangle+\left\langle{P}(x_{k,h}),{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\rangle+{b}_{k}(x_{k,h})
P(xk,h),V¯k,h+1Vh+1+i=1mH2SiLNi,k(x)+bk(xk,h)+2βk(xk,h).\displaystyle\leq\left\langle{P}(x_{k,h}),{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\rangle+\sum_{i=1}^{m}H\sqrt{\frac{2S_{i}L}{N_{i,k}(x)}}+{b}_{k}(x_{k,h})+2{\beta}_{k}(x_{k,h}). (A.7)

For bk(xk,h)b_{k}(x_{k,h}), we have

bk(xk,h)\displaystyle b_{k}(x_{k,h}) =i=1mHL2Ni,k(xk,h)+i=1mj=i+1m2HLSiSjNi,k(xk,h)Nj,k(xk,h)\displaystyle=\sum_{i=1}^{m}H\sqrt{\frac{L}{2N_{i,k}(x_{k,h})}}+\sum_{i=1}^{m}\sum_{j=i+1}^{m}2HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x_{k,h})N_{j,k}(x_{k,h})}}
i=1mHL2Ni,k(xk,h)+i=1m2mmaxiSiHL1Ni,k(xk,h)\displaystyle\leq\sum_{i=1}^{m}H\sqrt{\frac{L}{2N_{i,k}(x_{k,h})}}+\sum_{i=1}^{m}2m\max_{i}S_{i}HL\frac{1}{\sqrt{N_{i,k}(x_{k,h})}}
3mmaxiSiHLi=1m1Ni,k(xk,h).\displaystyle\leq 3m\max_{i}S_{i}HL\sum_{i=1}^{m}\frac{1}{\sqrt{N_{i,k}(x_{k,h})}}.

Substituting the above into (A.7) yields

V¯k,h(s)Vh(s)\displaystyle{\overline{V}}_{k,h}(s)-{V_{h}^{*}}(s) P(xk,h),V¯k,h+1Vh+1+5mHmaxiSiLi=1m1Ni,k(x)+i=1l2LMi,k(xk,h)\displaystyle\leq\left\langle{P}(x_{k,h}),{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\rangle+5mH\max_{i}S_{i}L\sum_{i=1}^{m}\frac{1}{\sqrt{N_{i,k}(x)}}+\sum_{i=1}^{l}\sqrt{\frac{2L}{M_{i,k}(x_{k,h})}}
=P(xk,h),V¯k,h+1Vh+1+i=1mF0Ni,k(x)+i=1l2LMi,k(xk,h).\displaystyle=\left\langle{P}(x_{k,h}),{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\rangle+\sum_{i=1}^{m}\frac{F_{0}}{\sqrt{N_{i,k}(x)}}+\sum_{i=1}^{l}\sqrt{\frac{2L}{M_{i,k}(x_{k,h})}}.

By backward induction over the subscript hh, we have

V¯k,h(s)Vh(s)min{t=hH𝔼πk[i=1mF0Ni,k(xk,t)+i=1l2LMi,k(xk,t)|sk,h=s],H}.\displaystyle{\overline{V}}_{k,h}(s)-{V_{h}^{*}}(s)\leq\min\left\{\sum_{t=h}^{H}\mathbb{E}_{\pi_{k}}\left[\sum_{i=1}^{m}\frac{F_{0}}{\sqrt{N_{i,k}(x_{k,t})}}+\sum_{i=1}^{l}\sqrt{\frac{2L}{M_{i,k}(x_{k,t})}}\middle|s_{k,h}=s\right],H\right\}.

A.3 Good sets

The following “good sets” [42] are a notion of sufficient visits so that the estimations are meaningful, the introduction of which is seminal to the sum-over-time analysis.

Definition 5 (Good sets).

Define the good sets of state-action components for transition and reward estimations as

Li,k\displaystyle L_{i,k} :={x[Ii]𝒳[Ii]:14κ<kwi,κ(x[Ii])HL+H},\displaystyle:=\left\{x[{I_{i}}]\in\mathcal{X}[{I_{i}}]:\frac{1}{4}\sum_{\kappa<k}w_{i,\kappa}(x[{I_{i}}])\geq HL+H\right\},
Λi,k\displaystyle\Lambda_{i,k} :={x[Ji]𝒳[Ji]:14κ<kvi,κ(x[Ji])HL+H}.\displaystyle:=\left\{x[{J_{i}}]\in\mathcal{X}[{J_{i}}]:\frac{1}{4}\sum_{\kappa<k}v_{i,\kappa}(x[{J_{i}}])\geq HL+H\right\}.

Then the corresponding good sets of state-action pairs are defined by

Lk\displaystyle L_{k} :={x𝒳:x[Ii]Li,k for all i[m]},\displaystyle:=\left\{x\in\mathcal{X}:x[{I_{i}}]\in L_{i,k}{\text{~{}for all~{}}}i\in[m]\right\},
Λk\displaystyle\Lambda_{k} :={x𝒳:x[Ji]Λi,k for all i[l]}.\displaystyle:=\left\{x\in\mathcal{X}:x[{J_{i}}]\in\Lambda_{i,k}{\text{~{}for all~{}}}i\in[l]\right\}.

We shall restrict our attention to the state-action pairs in the good sets (with sufficient visits). To this end, we show the sums of visit probabilities to the state-action pairs outside the good sets are lower-order terms in the following lemma.

Lemma 13 (Sum out of good sets).

The sums of the visit probabilities of the state-action pairs out of the good sets and over time satisfy that

k=1Kh=1HxLkwk,h(x)8i=1mX[Ii]HL.\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\notin L_{k}}w_{k,h}(x)\leq 8\sum_{i=1}^{m}X[{I_{i}}]HL.
k=1Kh=1HxΛkvk,h(x)8i=1lX[Ji]HL.\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\notin\Lambda_{k}}v_{k,h}(x)\leq 8\sum_{i=1}^{l}X[{J_{i}}]HL.
Proof.

If x[Ii]Li,kx[{I_{i}}]\notin L_{i,k}, then by definition,

14κkwi,κ(x[Ii])<HL+H+H=H(L+2).\displaystyle\frac{1}{4}\sum_{\kappa\leq k}w_{i,\kappa}(x[{I_{i}}])<HL+H+H=H(L+2).

Therefore,

k=1Kh=1HxLkwk,h(x)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\notin L_{k}}w_{k,h}(x) k=1Kh=1Hx𝒳wk,h(x)𝕀(xLk)\displaystyle\leq\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\mathcal{X}}w_{k,h}(x)\mathbb{I}(x\notin L_{k})
k=1Kh=1Hx𝒳wk,h(x)i=1m𝕀(x[Ii]Li,k)\displaystyle\leq\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\mathcal{X}}w_{k,h}(x)\sum_{i=1}^{m}\mathbb{I}(x[{I_{i}}]\notin L_{i,k})
i=1mx[Ii]𝒳[Ii]k=1Kh=1Hwi,k,h(x[Ii])𝕀(x[Ii]Li,k)\displaystyle\leq\sum_{i=1}^{m}\sum_{x[{I_{i}}]\in\mathcal{X}[{I_{i}}]}\sum_{k=1}^{K}\sum_{h=1}^{H}w_{i,k,h}(x[{I_{i}}])\mathbb{I}(x[{I_{i}}]\notin L_{i,k})
=i=1mx[Ii]𝒳[Ii]k=1Kwi,k(x[Ii])𝕀(x[Ii]Li,k)\displaystyle=\sum_{i=1}^{m}\sum_{x[{I_{i}}]\in\mathcal{X}[{I_{i}}]}\sum_{k=1}^{K}w_{i,k}(x[{I_{i}}])\mathbb{I}(x[{I_{i}}]\notin L_{i,k})
4i=1mX[Ii]H(L+2),\displaystyle\leq 4\sum_{i=1}^{m}X[{I_{i}}]H(L+2),

where in the third inequality we write x𝒳\sum_{x\in\mathcal{X}} as x[Ii]𝒳[Ii]x[Ii]𝒳[Ii]\sum_{x[{I_{i}}]\in\mathcal{X}[{I_{i}}]}\sum_{x[-{I_{i}}]\in\mathcal{X}[-{I_{i}}]} and use the definition of wi,k,hw_{i,k,h} (Definition 3). Since L=log(16mlSXT/δ)2L=\log(16mlSXT/\delta)\geq 2, we have

k=1Kh=1HxLkwk,h(x)8i=1mX[Ii]HL.\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\notin L_{k}}w_{k,h}(x)\leq 8\sum_{i=1}^{m}X[{I_{i}}]HL.

The same argument applies to vk,h(x)v_{k,h}(x). ∎

The following lemma bridges the visit probabilities wi,kw_{i,k} and vi,kv_{i,k} to the actual numbers of visits Ni,kN_{i,k} and Mi,kM_{i,k} for the state-action pairs in the good sets.

Lemma 14 (Visit number and visit probability).

Outside the failure event \mathcal{F}, the numbers of visits Ni,kN_{i,k} and Mi,kM_{i,k} to the state-action pairs in the good sets satisfy that

Ni,k(x)\displaystyle N_{i,k}(x) 14κkwi,κ(x) for all i[m] and xLk,\displaystyle\geq\frac{1}{4}\sum_{\kappa\leq k}w_{i,\kappa}(x)\quad{\text{~{}for all~{}}}i\in[m]{\text{~{}and~{}}}x\in L_{k},
Mi,k(x)\displaystyle M_{i,k}(x) 14κkwi,κr(x) for all i[l] and xΛk.\displaystyle\geq\frac{1}{4}\sum_{\kappa\leq k}w^{r}_{i,\kappa}(x)\quad{\text{~{}for all~{}}}i\in[l]{\text{~{}and~{}}}x\in\Lambda_{k}.
Proof.

Outside the failure event \mathcal{F} (specifically, 4\mathcal{F}_{4}), for all i[m]i\in[m],

Ni,k(x)\displaystyle N_{i,k}(x) 12κ<kwi,κ(x)HL\displaystyle\geq\frac{1}{2}\sum_{\kappa<k}w_{i,\kappa}(x)-HL
=14κ<kwi,κ(x)+14κ<kwi,κ(x)HL\displaystyle=\frac{1}{4}\sum_{\kappa<k}w_{i,\kappa}(x)+\frac{1}{4}\sum_{\kappa<k}w_{i,\kappa}(x)-HL
14κ<kwi,κ(x)+H\displaystyle\geq\frac{1}{4}\sum_{\kappa<k}w_{i,\kappa}(x)+H
14κkwi,κ(x),\displaystyle\geq\frac{1}{4}\sum_{\kappa\leq k}w_{i,\kappa}(x),

where the second inequality results from the definition of good sets (Definition 5). Outside the failure event \mathcal{F} (specifically, 5\mathcal{F}_{5}), the same argument applies to Mi,k(x)M_{i,k}(x) for all i[l]i\in[l]. ∎

By Lemma 14 and the definition of the good sets (Definition 5), for all k[K]k\in[K], Ni,k(x)HL+H2N_{i,k}(x)\geq HL+H\geq 2 for all xLkx\in L_{k}, and Mi,k(x)HL+H2M_{i,k}(x)\geq HL+H\geq 2 for all xΛkx\in\Lambda_{k}. Therefore, the regret analysis out of the good sets automatically precludes the cases of zero denominators in Algorithms 1 and 2, where we replace the zeros by ones for algorithmic completeness.

Refer to the ratio of visit probability wk,hw_{k,h} to visit number Ni,kN_{i,k} or Mi,kM_{i,k} as the visit ratio. Then the accumulation of the visit ratios turns out to be a lower-order term, as shown in the following lemma.

Lemma 15 (Sum of visit ratio in good sets).

Outside the failure event \mathcal{F}, the sums of the visit ratios of the state-action pairs within the good sets and over time satisfy that

k=1Kh=1HxLkwk,h(x)Ni,k(x)=k=1KxLkwk(x)Ni,k(x[Ii])4X[Ii]L for all i[m],\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}\frac{w_{k,h}(x)}{N_{i,k}(x)}=\sum_{k=1}^{K}\sum_{x\in L_{k}}\frac{w_{k}(x)}{N_{i,k}(x[{I_{i}}])}\leq 4X[{I_{i}}]L\quad{\text{~{}for all~{}}}i\in[m],
k=1Kh=1HxΛkwk,h(x)Mi,k(x)=k=1KxΛkwk(x)Mi,k(x[Ji])4X[Ji]L for all i[l].\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}\frac{w_{k,h}(x)}{M_{i,k}(x)}=\sum_{k=1}^{K}\sum_{x\in\Lambda_{k}}\frac{w_{k}(x)}{M_{i,k}(x[{J_{i}}])}\leq 4X[{J_{i}}]L\quad{\text{~{}for all~{}}}i\in[l].
Proof.

Outside the failure event \mathcal{F}, for any i[m]i\in[m], by Lemma 14,

k=1KxLkwk(x)Ni,k(x[Ii])\displaystyle\sum_{k=1}^{K}\sum_{x\in L_{k}}\frac{w_{k}(x)}{N_{i,k}(x[{I_{i}}])} k=1Kx𝒳wk(x)Ni,k(x[Ii])𝕀(x[Ii]Li,k)\displaystyle\leq\sum_{k=1}^{K}\sum_{x\in\mathcal{X}}\frac{w_{k}(x)}{N_{i,k}(x[{I_{i}}])}\mathbb{I}(x[{I_{i}}]\in L_{i,k})
k=1Kx[Ii]𝒳[Ii]wi,k(x[Ii])Ni,k(x[Ii])𝕀(x[Ii]Li,k)\displaystyle\leq\sum_{k=1}^{K}\sum_{x[{I_{i}}]\in\mathcal{X}[{I_{i}}]}\frac{w_{i,k}(x[{I_{i}}])}{N_{i,k}(x[{I_{i}}])}\mathbb{I}(x[{I_{i}}]\in L_{i,k})
4k=1Kx[Ii]𝒳[Ii]wi,k(x[Ii])κkwi,κ(x[Ii])𝕀(x[Ii]Li,k)\displaystyle\leq 4\sum_{k=1}^{K}\sum_{x[{I_{i}}]\in\mathcal{X}[{I_{i}}]}\frac{w_{i,k}(x[{I_{i}}])}{\sum_{\kappa\leq k}w_{i,\kappa}(x[{I_{i}}])}\mathbb{I}(x[{I_{i}}]\in L_{i,k})
4X[Ii]L,\displaystyle\leq 4X[{I_{i}}]L,

where the last inequality is shown by the proof of Lemma 13 in [42]. The same argument applies to the visit ratio wk,h(x)/Mi,k(x)w_{k,h}(x)/M_{i,k}(x) for any i[l]i\in[l]. ∎

As to be shown below, the cross-component transition bonus term brings in the mixed visit ratio wk,h(x)/Ni,k(x)Nj,k(x)w_{k,h}(x)/\sqrt{N_{i,k}(x)N_{j,k}(x)} in the analysis. By the Cauchy-Schwarz inequality, we immediately have the following control on the accumulation of the mixed visit ratios.

Lemma 16 (Sum of mixed visit ratio in good sets).

Outside the failure event \mathcal{F}, the sum of the mixed visit ratios of the state-action pairs within the good set LkL_{k} and over time satisfies that

k=1Kh=1HxLkwk,h(x)Ni,k(x)Nj,k(x)4X[Ii]X[Ij]L for all i,j[m].\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}\frac{w_{k,h}(x)}{\sqrt{N_{i,k}(x)N_{j,k}(x)}}\leq 4\sqrt{X[{I_{i}}]X[{I_{j}}]}L\quad{\text{~{}for all~{}}}i,j\in[m].
Proof.

Outside the failure event \mathcal{F}, by the Cauchy-Schwarz inequality and Lemma 15, for any i,j[m]i,j\in[m],

k=1Kh=1HxLkwk(x)Ni,k(x)Nj,k(x)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}\frac{w_{k}(x)}{\sqrt{N_{i,k}(x)N_{j,k}(x)}} =k=1KxLkwk(x)Ni,k(x[Ii])Nj,k(x[Ij])\displaystyle=\sum_{k=1}^{K}\sum_{x\in L_{k}}\frac{w_{k}(x)}{\sqrt{N_{i,k}(x[{I_{i}}])N_{j,k}(x[{I_{j}}])}}
k=1KxLkwk(x)Ni,k(x[Ii])k=1KxLkwk(x)Nj,k(x[Ij])\displaystyle\leq\sqrt{\sum_{k=1}^{K}\sum_{x\in L_{k}}\frac{w_{k}(x)}{N_{i,k}(x[{I_{i}}])}}\cdot\sqrt{\sum_{k=1}^{K}\sum_{x\in L_{k}}\frac{w_{k}(x)}{N_{j,k}(x[{I_{j}}])}}
4X[Ii]X[Ij]L.\displaystyle\leq 4\sqrt{X[{I_{i}}]X[{I_{j}}]}L.

The following lemma bounds the sum over time of the expected squared confidence radius, through the proof of which we can see an initial application of the above lemmas obtained with the notion of good sets. For clarification, by “sum over time”, we mean the wk,h(x)w_{k,h}(x)-weighted sum over k[K],h[H]k\in[K],h\in[H] and xLkx\in L_{k} or xΛkx\in\Lambda_{k} henceforth.

Lemma 17 (Cumulative confidence radius, Hoeffding-style).

Define the lower-order term

G0:=208m4H4(maxiSi)2maxiX[Ii]L3+24l2H3maxiX[Ji]L2.\displaystyle G_{0}:=208m^{4}H^{4}(\max_{i}S_{i})^{2}\max_{i}X[{I_{i}}]L^{3}+24l^{2}H^{3}\max_{i}X[{J_{i}}]L^{2}.

Then outside the failure event \mathcal{F}, for all i[m]i\in[m], the sum over time of the following expected squared confidence radius of F-UCBVI satisfies that

k=1Kh=1Hx𝒳wk,h(x)(𝔼Pi(𝔼Pi[V¯k,h+1Vh+1])2)k=1Kh=1Hx𝒳wk,h(x)(𝔼P[(V¯k,h+1Vh+1)2])G0.\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\mathcal{X}}w_{k,h}(x)\left(\mathbb{E}_{{P}_{i}}\left(\mathbb{E}_{{P}_{-i}}[{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}]\right)^{2}\right)\leq\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\mathcal{X}}w_{k,h}(x)\left(\mathbb{E}_{{P}}[\left({\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right)^{2}]\right)\leq G_{0}.
Proof.

Let sk,h𝒮s_{k,h}\in\mathcal{S} denote the state at step hh of episode kk. Since (𝔼[X])2𝔼[X2](\mathbb{E}[X])^{2}\leq\mathbb{E}[X^{2}] for any random variable XX, we have that for all i[m]i\in[m],

k=1Kh=1Hx𝒳wk,h(x)(𝔼Pi(𝔼Pi[V¯k,h+1Vh+1])2)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\mathcal{X}}w_{k,h}(x)\left(\mathbb{E}_{{P}_{i}}\left(\mathbb{E}_{{P}_{-i}}[{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}]\right)^{2}\right)
\displaystyle\leq k=1Kh=1Hx𝒳wk,h(x)(𝔼Pi𝔼Pi[(V¯k,h+1Vh+1)2])\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\mathcal{X}}w_{k,h}(x)\left(\mathbb{E}_{{P}_{i}}\mathbb{E}_{{P}_{-i}}[\left({\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right)^{2}]\right)
=\displaystyle= k=1Kh=1Hx𝒳wk,h(x)(s𝒮P(s|x)(V¯k,h+1(s)Vh+1(s))2)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\mathcal{X}}w_{k,h}(x)\left(\sum_{s^{\prime}\in\mathcal{S}}{P}(s^{\prime}|x)\left({\overline{V}}_{k,h+1}(s^{\prime})-{V_{h+1}^{*}}(s^{\prime})\right)^{2}\right)
=\displaystyle= k=1Kh=1H𝔼πk[(V¯k,h+1(sk,h+1)Vh+1(sk,h+1))2|sk,1]\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\mathbb{E}_{\pi_{k}}\left[\left({\overline{V}}_{k,h+1}(s_{k,h+1})-{V_{h+1}^{*}}(s_{k,h+1})\right)^{2}\middle|s_{k,1}\right]
\displaystyle\leq k=1Kh=1H𝔼πk[(V¯k,h(sk,h)Vh(sk,h))2|sk,1].\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\mathbb{E}_{\pi_{k}}\left[\left({\overline{V}}_{k,h}(s_{k,h})-{V_{h}^{*}}(s_{k,h})\right)^{2}\middle|s_{k,1}\right]. (A.8)

By the confidence radius lemma (Lemma 12), for any k[K],h[H]k\in[K],h\in[H],

𝔼πk[(V¯k,h(sk,h)Vh(sk,h))2|sk,1]\displaystyle\mathbb{E}_{\pi_{k}}\left[\left({\overline{V}}_{k,h}(s_{k,h})-{V_{h}^{*}}(s_{k,h})\right)^{2}\middle|s_{k,1}\right]
\displaystyle\leq 𝔼πk[(t=hH𝔼πk[i=1mF0Ni,k(xk,t)+i=1l2LMi,k(xk,t)|sk,h])2|sk,1]\displaystyle\mathbb{E}_{\pi_{k}}\left[\left(\sum_{t=h}^{H}\mathbb{E}_{\pi_{k}}\left[\sum_{i=1}^{m}\frac{F_{0}}{\sqrt{N_{i,k}(x_{k,t})}}+\sum_{i=1}^{l}\sqrt{\frac{2L}{M_{i,k}(x_{k,t})}}\middle|s_{k,h}\right]\right)^{2}\middle|s_{k,1}\right]
\displaystyle\leq 2mHF02i=1mt=hH𝔼πk[1Ni,k(xk,t)|sk,1]+4lHLi=1lt=hH𝔼πk[1Mi,k(xk,t)|sk,1]\displaystyle 2mHF_{0}^{2}\sum_{i=1}^{m}\sum_{t=h}^{H}\mathbb{E}_{\pi_{k}}\left[\frac{1}{N_{i,k}(x_{k,t})}\middle|s_{k,1}\right]+4lHL\sum_{i=1}^{l}\sum_{t=h}^{H}\mathbb{E}_{\pi_{k}}\left[\frac{1}{M_{i,k}(x_{k,t})}\middle|s_{k,1}\right]
\displaystyle\leq 2mH2F02i=1m𝔼πk[1Ni,k(xk,h)|sk,1]+4lH2Li=1l𝔼πk[1Mi,k(xk,t)|sk,1],\displaystyle 2mH^{2}F_{0}^{2}\sum_{i=1}^{m}\mathbb{E}_{\pi_{k}}\left[\frac{1}{N_{i,k}(x_{k,h})}\middle|s_{k,1}\right]+4lH^{2}L\sum_{i=1}^{l}\mathbb{E}_{\pi_{k}}\left[\frac{1}{M_{i,k}(x_{k,t})}\middle|s_{k,1}\right],

where in the second inequality we use the inequality (i=1nai)2ni=1nai2\left(\sum_{i=1}^{n}a_{i}\right)^{2}\leq n\sum_{i=1}^{n}a_{i}^{2} for multiple times (n=2,m,l,Hn=2,m,l,H). The confidence radius lemma (Lemma 12) also guarantees that

𝔼πk[(V¯k,h(sk,h)Vh(sk,h))2|sk,1]H2.\displaystyle\mathbb{E}_{\pi_{k}}\left[\left({\overline{V}}_{k,h}(s_{k,h})-{V_{h}^{*}}(s_{k,h})\right)^{2}\middle|s_{k,1}\right]\leq H^{2}.

Therefore, substituting the above two bounds into (A.8) and by Lemmas 13 and 15, we have

k=1Kh=1H𝔼πk[(V¯k,h(sk,h)Vh(sk,h))2|sk,1]\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\mathbb{E}_{\pi_{k}}\left[\left({\overline{V}}_{k,h}(s_{k,h})-{V_{h}^{*}}(s_{k,h})\right)^{2}\middle|s_{k,1}\right]
\displaystyle\leq 2mH2F02i=1mk=1Kh=1HxLkwk,h(x)Ni,k(x)+4lH2Li=1lk=1Kh=1HxΛkwk,h(x)Mi,k(x)\displaystyle 2mH^{2}F_{0}^{2}\sum_{i=1}^{m}\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}\frac{w_{k,h}(x)}{N_{i,k}(x)}+4lH^{2}L\sum_{i=1}^{l}\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}\frac{w_{k,h}(x)}{M_{i,k}(x)}
+k=1Kh=1HxLkwk,h(x)H2+k=1Kh=1HxΛkwk,h(x)H2\displaystyle\quad+\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\notin L_{k}}w_{k,h}(x)H^{2}+\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\notin\Lambda_{k}}w_{k,h}(x)H^{2}
\displaystyle\leq 2mH2F02i=1m4X[Ii]L+4lH2Li=1l4X[Ji]L+8H3i=1mX[Ii]L+8H3i=1lX[Ji]L\displaystyle 2mH^{2}F_{0}^{2}\sum_{i=1}^{m}4X[{I_{i}}]L+4lH^{2}L\sum_{i=1}^{l}4X[{J_{i}}]L+8H^{3}\sum_{i=1}^{m}X[{I_{i}}]L+8H^{3}\sum_{i=1}^{l}X[{J_{i}}]L
\displaystyle\leq 8mH2F02i=1mX[Ii]L+16lH2i=1lX[Ji]L2+8H3i=1mX[Ii]L+8H3i=1lX[Ji]L\displaystyle 8mH^{2}F_{0}^{2}\sum_{i=1}^{m}X[{I_{i}}]L+16lH^{2}\sum_{i=1}^{l}X[{J_{i}}]L^{2}+8H^{3}\sum_{i=1}^{m}X[{I_{i}}]L+8H^{3}\sum_{i=1}^{l}X[{J_{i}}]L
=\displaystyle= 208m4H4(maxiSi)2maxiX[Ii]L3+24l2H3maxiX[Ji]L2,\displaystyle 208m^{4}H^{4}(\max_{i}S_{i})^{2}\max_{i}X[{I_{i}}]L^{3}+24l^{2}H^{3}\max_{i}X[{J_{i}}]L^{2},

where in the last equality we use the definition of F0F_{0} (Lemma 12). ∎

A.4 Regret decomposition

We decompose the regret in the following standard way [42], and then bound the sum over time of the individual terms in the next few subsections. Here we assume the general transition bonus b{b} to be a function of step hh, as in F-EULER.

Lemma 18 (Regret decomposition).

Let Lk,ΛkL_{k},\Lambda_{k} be the good sets defined in Definition 5. Then for any given FMDP specified in (2.1), outside the failure event \mathcal{F}, the regret of F-UCBVI in KK episodes satisfies that

Regret(K)\displaystyle{\textnormal{Regret}}(K) k=1Kx𝒳wk,h(x)min{(P^k(x)P(x),V¯k,h+1+bk,h(x)+R^k(x)R(x)+βk(x)),H}\displaystyle\leq\sum_{k=1}^{K}\sum_{x\in\mathcal{X}}w_{k,h}(x)\min\Bigl{\{}\Bigl{(}\left\langle{\hat{P}}_{k}(x)-{P}(x),{\overline{V}}_{k,h+1}\right\rangle+{b}_{k,h}(x)+{\hat{R}}_{k}(x)-R(x)+{\beta}_{k}(x)\Bigr{)},H\Bigr{\}}
k=1Kh=1HxLkwk,h(x)(P^k(x)P(x),Vh+1transition estimation error+bk,h(x)+P^k(x)P(x),V¯k,h+1Vh+1correction term)\displaystyle\leq\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\Bigl{(}\underbrace{\left\langle{\hat{P}}_{k}(x)-{P}(x),{V_{h+1}^{*}}\right\rangle}_{\text{transition estimation error}}+{b}_{k,h}(x)+\underbrace{\left\langle{\hat{P}}_{k}(x)-{P}(x),{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\rangle}_{\text{correction term}}\Bigr{)}
+k=1Kh=1HxΛkwk,h(x)2βk(x)+8H2i=1mX[Ii]L+8H2i=1lX[Ji]L,\displaystyle\quad+\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}w_{k,h}(x)2{\beta}_{k}(x)+8H^{2}\sum_{i=1}^{m}X[{I_{i}}]L+8H^{2}\sum_{i=1}^{l}X[{J_{i}}]L,

where the bk,h(x){b}_{k,h}(x) term is referred to as “transition optimism” and 2βk(x)2{\beta}_{k}(x) term is referred to as “reward estimation error and optimism”.

Proof.

Add a subscript kk to Q¯h{\overline{Q}}_{h} in the VI_Optimism procedure (Algorithm 2) to denote the corresponding optimistic Q-value function in episode kk. Since V¯k,h{\overline{V}}_{k,h} is an entrywise UCB of Vh{V_{h}^{*}}, we upper bound the regret by

Regret(K)\displaystyle{\textnormal{Regret}}(K) k=1KV¯k,1(sk,1)V1πk(sk,1)\displaystyle\leq\sum_{k=1}^{K}{\overline{V}}_{k,1}(s_{k,1})-V_{1}^{\pi_{k}}(s_{k,1}) (A.9)
=k=1Kx𝒳wk,1(x)(Qk,1(x)Q1πk(x))\displaystyle=\sum_{k=1}^{K}\sum_{x\in\mathcal{X}}w_{k,1}(x)(Q_{k,1}(x)-Q_{1}^{\pi_{k}}(x))
=k=1Kx𝒳wk,1(x)(min{R^k(x)+βk(x)+P^k(x),V¯k,2+bk,1(x),H}\displaystyle=\sum_{k=1}^{K}\sum_{x\in\mathcal{X}}w_{k,1}(x)\Bigl{(}\min\left\{{\hat{R}}_{k}(x)+{\beta}_{k}(x)+\left\langle{\hat{P}}_{k}(x),{\overline{V}}_{k,2}\right\rangle+{b}_{k,1}(x),H\right\}
R(x)P(x),V2πk)\displaystyle\quad-R(x)-\left\langle{P}(x),V_{2}^{\pi_{k}}\right\rangle\Bigr{)} (A.10)
k=1Kx𝒳wk,1(x)(min{(R^k(x)R(x)+βk(x)+P^k(x)P(x),V¯k,2\displaystyle\leq\sum_{k=1}^{K}\sum_{x\in\mathcal{X}}w_{k,1}(x)\biggl{(}\min\Bigl{\{}\Bigl{(}{\hat{R}}_{k}(x)-R(x)+{\beta}_{k}(x)+\left\langle{\hat{P}}_{k}(x)-{P}(x),{\overline{V}}_{k,2}\right\rangle
+bk,1(x)),H}+P(x),V¯k,2V2πk)\displaystyle\quad+{b}_{k,1}(x)\Bigr{)},H\Bigr{\}}+\left\langle{P}(x),{\overline{V}}_{k,2}-V_{2}^{\pi_{k}}\right\rangle\biggr{)}
=k=1K(x𝒳wk,1(x)min{(R^k(x)R(x)+βk(x)+P^k(x)P(x),V¯k,2\displaystyle=\sum_{k=1}^{K}\biggl{(}\sum_{x\in\mathcal{X}}w_{k,1}(x)\min\Bigl{\{}\Bigl{(}{\hat{R}}_{k}(x)-R(x)+{\beta}_{k}(x)+\left\langle{\hat{P}}_{k}(x)-{P}(x),{\overline{V}}_{k,2}\right\rangle
+bk,1(x)),H}+x𝒳wk,1(x)sP(s|x)(V¯k,2(s)V2πk(s))).\displaystyle\quad+{b}_{k,1}(x)\Bigr{)},H\Bigr{\}}+\sum_{x\in\mathcal{X}}w_{k,1}(x)\sum_{s^{\prime}}{P}(s^{\prime}|x)\left({\overline{V}}_{k,2}(s^{\prime})-V_{2}^{\pi_{k}}(s^{\prime})\right)\biggr{)}. (A.11)

Let x=(s,a)x^{\prime}=(s^{\prime},a^{\prime}). By definition, the visit probability wk,h(x)w_{k,h}(x) has the property that

wk,h+1(x)=x𝒳wk,h(x)P(s|x)(πk(s,h)=a),\displaystyle w_{k,h+1}(x^{\prime})=\sum_{x\in\mathcal{X}}w_{k,h}(x){P}(s^{\prime}|x)\mathbb{P}(\pi_{k}(s^{\prime},h)=a^{\prime}),

where ()\mathbb{P}(\cdot) denotes an appropriate probability measure. Hence,

x𝒳wk,1(x)sP(s|x)(V¯k,2(s)V2πk(s))\displaystyle\sum_{x\in\mathcal{X}}w_{k,1}(x)\sum_{s^{\prime}}{P}(s^{\prime}|x)\left({\overline{V}}_{k,2}(s^{\prime})-V_{2}^{\pi_{k}}(s^{\prime})\right)
=\displaystyle= x𝒳wk,1(x)x𝒳P(s|x)(πk(s,1)=a)(Qk,2(x)Q2πk(x))\displaystyle\sum_{x\in\mathcal{X}}w_{k,1}(x)\sum_{x^{\prime}\in\mathcal{X}}P(s^{\prime}|x)\mathbb{P}(\pi_{k}(s^{\prime},1)=a^{\prime})\left(Q_{k,2}(x^{\prime})-Q_{2}^{\pi_{k}}(x^{\prime})\right)
=\displaystyle= x𝒳wk,2(x)(Qk,2(x)Q2πk(x)).\displaystyle\sum_{x^{\prime}\in\mathcal{X}}w_{k,2}(x^{\prime})\left(Q_{k,2}(x^{\prime})-Q_{2}^{\pi_{k}}(x^{\prime})\right).

Substituting the above into (A.11) yields

Regret(K)\displaystyle{\textnormal{Regret}}(K) k=1Kx𝒳wk,1(x)(Qk,1(x)Q1πk(x))\displaystyle\leq\sum_{k=1}^{K}\sum_{x\in\mathcal{X}}w_{k,1}(x)(Q_{k,1}(x)-Q_{1}^{\pi_{k}}(x))
k=1K(x𝒳wk,1(x)min{(R^k(x)R(x)+βk(x)+P^k(x)P(x),V¯k,2\displaystyle\leq\sum_{k=1}^{K}\biggl{(}\sum_{x\in\mathcal{X}}w_{k,1}(x)\min\Bigl{\{}\Bigl{(}{\hat{R}}_{k}(x)-R(x)+{\beta}_{k}(x)+\left\langle{\hat{P}}_{k}(x)-{P}(x),{\overline{V}}_{k,2}\right\rangle
+bk,1(x)),H}+x𝒳wk,2(x)(Qk,2(x)Q2πk(x))).\displaystyle\quad+{b}_{k,1}(x)\Bigr{)},H\Bigr{\}}+\sum_{x\in\mathcal{X}}w_{k,2}(x)\left(Q_{k,2}(x)-Q_{2}^{\pi_{k}}(x)\right)\biggr{)}.

Inductively, we have

Regret(K)k=1Kx𝒳wk,h(x)min{(R^k(x)R(x)+βk(x)+P^k(x)P(x),V¯k,h+1+bk,h(x)),H}.\displaystyle{\textnormal{Regret}}(K)\leq\sum_{k=1}^{K}\sum_{x\in\mathcal{X}}w_{k,h}(x)\min\Bigl{\{}\Bigl{(}{\hat{R}}_{k}(x)-R(x)+{\beta}_{k}(x)+\left\langle{\hat{P}}_{k}(x)-{P}(x),{\overline{V}}_{k,h+1}\right\rangle+{b}_{k,h}(x)\Bigr{)},H\Bigr{\}}.

Outside the failure event \mathcal{F} (specifically, 3\mathcal{F}_{3} for F-UCBVI),

Regret(K)\displaystyle{\textnormal{Regret}}(K) k=1Kh=1Hx𝒳wk,h(x)min{(P^k(x)P(x),V¯k,h+1+bk,h(x)+2βk(x)),H}\displaystyle\leq\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\mathcal{X}}w_{k,h}(x)\min\left\{\left(\left\langle{\hat{P}}_{k}(x)-{P}(x),{\overline{V}}_{k,h+1}\right\rangle+{b}_{k,h}(x)+2{\beta}_{k}(x)\right),H\right\}
=k=1Kh=1Hx𝒳wk,h(x)min{(P^k(x)P(x),Vh+1+bk,h(x)\displaystyle=\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\mathcal{X}}w_{k,h}(x)\min\Bigl{\{}\Bigl{(}\left\langle{\hat{P}}_{k}(x)-{P}(x),{V_{h+1}^{*}}\right\rangle+{b}_{k,h}(x)
+P^k(x)P(x),V¯k,h+1Vh+1+2βk(x)),H},\displaystyle\quad+\left\langle{\hat{P}}_{k}(x)-{P}(x),{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\rangle+2{\beta}_{k}(x)\Bigr{)},H\Bigr{\}},
k=1Kh=1HxLkwk,h(x)(P^k(x)P(x),Vh+1+bk,h(x)+P^k(x)P(x),V¯k,h+1Vh+1)\displaystyle\leq\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\Bigl{(}\left\langle{\hat{P}}_{k}(x)-{P}(x),{V_{h+1}^{*}}\right\rangle+{b}_{k,h}(x)+\left\langle{\hat{P}}_{k}(x)-{P}(x),{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\rangle\Bigr{)}
+k=1Kh=1HxΛkwk,h(x)2βk(x)+k=1Kh=1HxLkwk,h(x)H+k=1Kh=1HxΛkwk,h(x)H\displaystyle\quad+\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}w_{k,h}(x)2{\beta}_{k}(x)+\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\notin L_{k}}w_{k,h}(x)H+\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\notin\Lambda_{k}}w_{k,h}(x)H
k=1Kh=1HxLkwk,h(x)(P^k(x)P(x),Vh+1+bk,h(x)+P^k(x)P(x),V¯k,h+1Vh+1)\displaystyle\leq\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\Bigl{(}\left\langle{\hat{P}}_{k}(x)-{P}(x),{V_{h+1}^{*}}\right\rangle+{b}_{k,h}(x)+\left\langle{\hat{P}}_{k}(x)-{P}(x),{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\rangle\Bigr{)}
+k=1Kh=1HxΛkwk,h(x)2βk(x)+8H2i=1mX[Ii]L+8H2i=1lX[Ji]L.\displaystyle+\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}w_{k,h}(x)2{\beta}_{k}(x)+8H^{2}\sum_{i=1}^{m}X[{I_{i}}]L+8H^{2}\sum_{i=1}^{l}X[{J_{i}}]L.

A.5 Bounds on the individual terms in regret

To prove the following bounds on the individual terms in regret (Lemma 18), we heavily use the lemmas derived from the notion of the good sets (Section A.3).

Lemma 19 (Cumulative transition estimation error, Hoeffding-style).

For F-UCBVI, outside the failure event \mathcal{F}, the sum over time of the transition estimation error satisfies that

k=1Kh=1HxLkwk,h(x)P^k(x)P(x),Vh+1i=1mH2X[Ii]TL+4m2HmaxiSimaxiXiL2.\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\bigl{\langle}{\hat{P}}_{k}(x)-{P}(x),{V_{h+1}^{*}}\bigr{\rangle}\leq\sum_{i=1}^{m}H\sqrt{2X[{I_{i}}]TL}+4m^{2}H\max_{i}S_{i}\max_{i}X_{i}L^{2}.
Proof.

Outside the failure event \mathcal{F}, by lemma 9,

k=1Kh=1HxLkwk,h(x)P^k(x)P(x),Vh+1\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left\langle{\hat{P}}_{k}(x)-{P}(x),{V_{h+1}^{*}}\right\rangle
\displaystyle\leq k=1Kh=1HxLkwk,h(x)(i=1mHL2Ni,k(x)+i=1mj=i+1m2HLSiSjNi,k(x)Nj,k(x)),\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left(\sum_{i=1}^{m}H\sqrt{\frac{L}{2N_{i,k}(x)}}+\sum_{i=1}^{m}\sum_{j=i+1}^{m}2HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}\right), (A.12)

For the first term in (A.12), by the Cauchy-Schwarz inequality and Lemma 15,

k=1Kh=1HxLkwk,h(x)(i=1mHL2Ni,k(x))\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left(\sum_{i=1}^{m}H\sqrt{\frac{L}{2N_{i,k}(x)}}\right) HL2i=1mk=1Kh=1HxLkwk,h(x)k=1Kh=1HxLkwk,h(x)Ni,k(x)\displaystyle\leq H\sqrt{\frac{L}{2}}\sum_{i=1}^{m}\sqrt{\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)}\cdot\sqrt{\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}\frac{w_{k,h}(x)}{N_{i,k}(x)}}
i=1mH2X[Ii]TL.\displaystyle\leq\sum_{i=1}^{m}H\sqrt{2X[{I_{i}}]TL}.

For the second term in (A.12), by Lemma 16,

k=1Kh=1HxLkwk,h(x)(i=1mj=i+1m2HLSiSjNi,k(x)Nj,k(x))\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left(\sum_{i=1}^{m}\sum_{j=i+1}^{m}2HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}\right) m2HLmaxiSi4X[Ii]X[Ij]L\displaystyle\leq m^{2}HL\max_{i}S_{i}\cdot 4\sqrt{X[{I_{i}}]X[{I_{j}}]}L
4m2HmaxiSimaxiXiL2.\displaystyle\leq 4m^{2}H\max_{i}S_{i}\max_{i}X_{i}L^{2}.

Substituting the above two bounds into (A.12)completes the proof. ∎

Lemma 20 (Cumulative transition optimism, Hoeffding-style).

For F-UCBVI, outside the failure event \mathcal{F}, the sum over time of the transition optimism satisfies that

k=1Kh=1HxLkwk,h(x)bk(x)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x){b}_{k}(x) i=1mH2X[Ii]TL+4m2HmaxiSimaxiXiL2.\displaystyle\leq\sum_{i=1}^{m}H\sqrt{2X[{I_{i}}]TL}+4m^{2}H\max_{i}S_{i}\max_{i}X_{i}L^{2}.
Proof.

The proof is exactly the same as that of Lemma 19 by noting

bk(x)=i=1mHL2Ni,k(x)+i=1mj=i+1m2HLSiSjNi,k(x)Nj,k(x).\displaystyle{b}_{k}(x)=\sum_{i=1}^{m}H\sqrt{\frac{L}{2N_{i,k}(x)}}+\sum_{i=1}^{m}\sum_{j=i+1}^{m}2HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}.

Lemma 21 (Cumulative reward estimation error and optimism, Hoeffding-style).

For F-UCBVI, outside the failure event \mathcal{F}, the sum over time of the reward estimation error and optimism satisfies that

k=1Kh=1HxΛkwk,h(x)2βk(x)i=1l22X[Ji]TL\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}w_{k,h}(x)2{\beta}_{k}(x)\leq\sum_{i=1}^{l}2\sqrt{2X[{J_{i}}]T}L
Proof.

Outside the failure event \mathcal{F}, by the Cauchy-Schwarz inequality,

k=1Kh=1HxΛkwk,h(x)βk(x)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}w_{k,h}(x){\beta}_{k}(x) =k=1Kh=1HxΛkwk,h(x)i=1lL2Mi,k(x)\displaystyle=\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}w_{k,h}(x)\sum_{i=1}^{l}\sqrt{\frac{L}{2M_{i,k}(x)}}
L2i=1lk=1Kh=1HxΛkwk,h(x)Mi,k(x)k=1Kh=1HxΛkwk,h(x)\displaystyle\leq\sqrt{\frac{L}{2}}\sum_{i=1}^{l}\sqrt{\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}\frac{w_{k,h}(x)}{M_{i,k}(x)}}\sqrt{\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}w_{k,h}(x)}
i=1l2X[Ji]TL,\displaystyle\leq\sum_{i=1}^{l}\sqrt{2X[{J_{i}}]TL},

where the last inequality is due to Lemma 15. ∎

Lemma 22 (Cumulative correction term, Hoeffding-style).

For F-UCBVI, outside the failure event \mathcal{F}, the sum over time of the correction term satisfies that

k=1Kh=1HxLkwk,h(x)P^k(x)P(x),V¯k,h+1Vh+1\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left\langle{\hat{P}}_{k}(x)-{P}(x),{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\rangle
\displaystyle\leq 45m3H2(maxiSi)1.5maxiX[Ii]L2.5+14mlH1.5(maxiSi)0.5(maxiX[Ii])0.5(maxiX[Ji])0.5L2.\displaystyle 45m^{3}H^{2}(\max_{i}S_{i})^{1.5}\max_{i}X[{I_{i}}]L^{2.5}+14mlH^{1.5}(\max_{i}S_{i})^{0.5}(\max_{i}X[{I_{i}}])^{0.5}(\max_{i}X[{J_{i}}])^{0.5}L^{2}.
Proof.

Since V¯k,h+1{\overline{V}}_{k,h+1} is a random vector, we cannot apply scalar concentration as in bounding the transition estimation error (Lemma 9). However, some techniques there are useful here, including the inverse telescoping technique and the Holder’s argument (Lemma 10).

Omitting the dependence of P^k(x),P(x),P^i,k(x),Pi(x){\hat{P}}_{k}(x),{P}(x),{\hat{P}}_{i,k}(x),{P}_{i}(x) on xx, for any fixed ii and s[i]𝒮is^{\prime}[i]\in\mathcal{S}_{i}, by the inverse telescoping technique,

P^kP,V¯k,h+1Vh+1\displaystyle\left\langle{\hat{P}}_{k}-{P},{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\rangle =i=1mP^i,ki=1mPi,V¯k,h+1Vh+1\displaystyle=\left\langle\prod_{i=1}^{m}{\hat{P}}_{i,k}-\prod_{i=1}^{m}{P}_{i},{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\rangle
=i=1m(P^i,kPi)P1:i1P^i+1:m,k,V¯k,h+1Vh+1\displaystyle=\left\langle\sum_{i=1}^{m}({\hat{P}}_{i,k}-{P}_{i}){P}_{1:i-1}{\hat{P}}_{i+1:m,k},{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\rangle
=i=1mP^i,kPi,𝔼P1:i1𝔼P^i+1:m,k[V¯k,h+1Vh+1]\displaystyle=\sum_{i=1}^{m}\left\langle{\hat{P}}_{i,k}-{P}_{i},\mathbb{E}_{{P}_{1:i-1}}\mathbb{E}_{{\hat{P}}_{i+1:m,k}}[{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}]\right\rangle
=i=1mP^i,kPi,𝔼P1:i1𝔼Pi+1:m[V¯k,h+1Vh+1]\displaystyle=\sum_{i=1}^{m}\left\langle{\hat{P}}_{i,k}-{P}_{i},\mathbb{E}_{{P}_{1:i-1}}\mathbb{E}_{{P}_{i+1:m}}[{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}]\right\rangle (A.13)
+i=1mP^i,kPi,𝔼P1:i1(𝔼P^i+1:m,k𝔼Pi+1:m)[V¯k,h+1Vh+1].\displaystyle\quad+\sum_{i=1}^{m}\left\langle{\hat{P}}_{i,k}-{P}_{i},\mathbb{E}_{{P}_{1:i-1}}(\mathbb{E}_{{\hat{P}}_{i+1:m,k}}-\mathbb{E}_{{P}_{i+1:m}})[{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}]\right\rangle. (A.14)

For (A.13), outside the failure event \mathcal{F} (specifically, 6\mathcal{F}_{6}),

P^i,kPi,𝔼P1:i1𝔼Pi+1:m[V¯k,h+1Vh+1]\displaystyle\left\langle{\hat{P}}_{i,k}-{P}_{i},\mathbb{E}_{{P}_{1:i-1}}\mathbb{E}_{{P}_{i+1:m}}[{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}]\right\rangle
\displaystyle\leq s[i]Si(2L3Ni,k(x)+2Pi(s[i]|x)LNi,k(x))𝔼P1:i1𝔼Pi+1:m[V¯k,h+1Vh+1]\displaystyle\sum_{s^{\prime}[i]\in S_{i}}\left(\frac{2L}{3N_{i,k}(x)}+\sqrt{\frac{2{P}_{i}(s^{\prime}[i]|x)L}{N_{i,k}(x)}}\right)\mathbb{E}_{{P}_{1:i-1}}\mathbb{E}_{{P}_{i+1:m}}[{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}]
\displaystyle\leq 2HSiL3Ni,k(x)+s[i]Si2Pi(s[i]|x)LNi,k(x)𝔼Pi[V¯k,h+1Vh+1],\displaystyle\frac{2HS_{i}L}{3N_{i,k}(x)}+\sum_{s^{\prime}[i]\in S_{i}}\sqrt{\frac{2{P}_{i}(s^{\prime}[i]|x)L}{N_{i,k}(x)}}\mathbb{E}_{{P}_{-i}}[{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}],

the sum over time of which are both bounded, respectively, by

k=1Kh=1HxLkwk,h(x)2HSiL3Ni,k(x)83HSiX[Ii]L23HSiX[Ii]L2,\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\frac{2HS_{i}L}{3N_{i,k}(x)}\leq\frac{8}{3}HS_{i}X[{I_{i}}]L^{2}\leq 3HS_{i}X[{I_{i}}]L^{2},

and

k=1Kh=1HxLkwk,h(x)s[i]Si2Pi(s[i]|x)LNi,k(x)𝔼Pi[V¯k,h+1Vh+1]\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\sum_{s^{\prime}[i]\in S_{i}}\sqrt{\frac{2{P}_{i}(s^{\prime}[i]|x)L}{N_{i,k}(x)}}\mathbb{E}_{{P}_{-i}}[{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}]
\displaystyle\leq 2SiLk=1Kh=1HxLkwk,h(x)𝔼Pi(𝔼Pi[V¯k,h+1Vh+1])2Ni,k(x)\displaystyle\sqrt{2S_{i}L}\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\sqrt{\frac{\mathbb{E}_{{P}_{i}}(\mathbb{E}_{{P}_{-i}}[{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}])^{2}}{N_{i,k}(x)}}
\displaystyle\leq 2SiLk=1Kh=1HxLkwk,h(x)Ni,k(x)k=1Kh=1HxLkwk,h(x)𝔼Pi(𝔼Pi[V¯k,h+1Vh+1])2\displaystyle\sqrt{2S_{i}L}\sqrt{\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}\frac{w_{k,h}(x)}{N_{i,k}(x)}}\cdot\sqrt{\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\mathbb{E}_{{P}_{i}}(\mathbb{E}_{{P}_{-i}}[{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}])^{2}}
\displaystyle\leq 2SiL2X[Ii]LG0\displaystyle\sqrt{2S_{i}L}\cdot 2\sqrt{X[{I_{i}}]L}\cdot\sqrt{G_{0}}
\displaystyle\leq 41m2H2(maxiSi)1.5maxiX[Ii]L2.5+14lH1.5(maxiSi)0.5(maxiX[Ii])0.5(maxiX[Ji])0.5L2,\displaystyle 41m^{2}H^{2}(\max_{i}S_{i})^{1.5}\max_{i}X[{I_{i}}]L^{2.5}+14lH^{1.5}(\max_{i}S_{i})^{0.5}(\max_{i}X[{I_{i}}])^{0.5}(\max_{i}X[{J_{i}}])^{0.5}L^{2},

where the first and second inequalities are due to the Cauchy-Schwarz inequality, and the third inequality is due to Lemma 15 and Lemma 17. With the same Holder’s argument as in Lemma 10, (A.14) is upper bounded by

i=1mP^i,kPi,𝔼P1:i1(𝔼P^i+1:m,k𝔼Pi+1:m)[V¯k,h+1Vh+1]i=1mj=i+1m2HLSiSjNi,k(x)Nj,k(x),\displaystyle\sum_{i=1}^{m}\left\langle{\hat{P}}_{i,k}-{P}_{i},\mathbb{E}_{{P}_{1:i-1}}(\mathbb{E}_{{\hat{P}}_{i+1:m,k}}-\mathbb{E}_{{P}_{i+1:m}})[{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}]\right\rangle\leq\sum_{i=1}^{m}\sum_{j=i+1}^{m}2HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}},

the sum over time of which is upper bounded by 4m2HmaxiSimaxiXiL24m^{2}H\max_{i}S_{i}\max_{i}X_{i}L^{2} due to Lemma 16. ∎

A.6 Regret bounds (proof of Theorem 1)

Proof.

Outside the failure event \mathcal{F}, combining Lemmas 18192021 and 22, we obtain

Regret(K)\displaystyle{\textnormal{Regret}}(K) 2(i=1mH2X[Ii]TL+4m2HmaxiSimaxiXiL2)\displaystyle\leq 2\left(\sum_{i=1}^{m}H\sqrt{2X[{I_{i}}]TL}+4m^{2}H\max_{i}S_{i}\max_{i}X_{i}L^{2}\right)
+45m3H2(maxiSi)1.5maxiX[Ii]L2.5\displaystyle\quad+45m^{3}H^{2}(\max_{i}S_{i})^{1.5}\max_{i}X[{I_{i}}]L^{2.5}
+14mlH1.5(maxiSi)0.5(maxiX[Ii])0.5(maxiX[Ji])0.5L2+4i=1lX[Ji]TL\displaystyle\quad+14mlH^{1.5}(\max_{i}S_{i})^{0.5}(\max_{i}X[{I_{i}}])^{0.5}(\max_{i}X[{J_{i}}])^{0.5}L^{2}+4\sum_{i=1}^{l}\sqrt{X[{J_{i}}]TL}
+8i=1mX[Ii]H2L+8i=1lX[Ji]H2L\displaystyle\quad+8\sum_{i=1}^{m}X[{I_{i}}]H^{2}L+8\sum_{i=1}^{l}X[{J_{i}}]H^{2}L
3i=1mHX[Ii]TL+4i=1lX[Ji]TL+53m3H2(maxiSi)1.5maxiX[Ii]L2.5\displaystyle\leq 3\sum_{i=1}^{m}H\sqrt{X[{I_{i}}]TL}+4\sum_{i=1}^{l}\sqrt{X[{J_{i}}]TL}+53m^{3}H^{2}(\max_{i}S_{i})^{1.5}\max_{i}X[{I_{i}}]L^{2.5}
+22mlH2(maxiSi)0.5(maxiX[Ii])0.5maxiX[Ji]L2\displaystyle\quad+22mlH^{2}(\max_{i}S_{i})^{0.5}(\max_{i}X[{I_{i}}])^{0.5}\max_{i}X[{J_{i}}]L^{2}
=𝒪~(i=1mH2X[Ii]T+i=1lX[Ji]T),\displaystyle=\tilde{\mathcal{O}}\left(\sum_{i=1}^{m}\sqrt{H^{2}X[{I_{i}}]T}+\sum_{i=1}^{l}\sqrt{X[{J_{i}}]T}\right),

where in the last equality we assume that Tpoly(m,l,maxiSi,maxiX[Ii],H)T\geq\mathrm{poly}(m,l,\max_{i}S_{i},\max_{i}X[{I_{i}}],H).

To accommodate the case of known rewards, it suffices to remove the parts related to reward estimation and reward bonuses in both the algorithm and the analysis, which yields the regret bound 𝒪~(i=1mH2X[Ii]T)\tilde{\mathcal{O}}(\sum_{i=1}^{m}\sqrt{H^{2}X[{I_{i}}]T}). ∎

Appendix B Regret analysis of F-EULER

The basic structure of the regret analysis in this section is the same as that of F-UCBVI (Section A), e.g., the notion of the good sets, including the definitions and lemmas (specifically, Definition 5 and Lemmas 131415 and 16), carries through here. The key difference is a more refined analysis using Bernstein-style concentrations (Lemmas 38 and 39).

B.1 Failure event

Recall that L=log(16mlSXT/δ)L=\log(16mlSXT/\delta). We define the failure event for F-EULER as follows.

Definition 6 (Failure events).

Define the events

1\displaystyle\mathcal{B}_{1} :={(i[m],k[K],h[H],x𝒳),\displaystyle:=\bigg{\{}\exists(i\in[m],k\in[K],h\in[H],x\in\mathcal{X}),
|P^i,k(x)Pi(x),𝔼Pi(x)[Vh+1]|>2VarPi(x)𝔼Pi(x)[Vh+1]LNi,k(x)+2HL3Ni,k(x)},\displaystyle\qquad\qquad\left|\left\langle{\hat{P}}_{i,k}(x)-{P}_{i}(x),\mathbb{E}_{{P}_{-i}(x)}[{V_{h+1}^{*}}]\right\rangle\right|>\sqrt{\frac{2\mathrm{Var}_{{P}_{i}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h+1}^{*}}]L}{N_{i,k}(x)}}+\frac{2HL}{3N_{i,k}(x)}\bigg{\}},
2\displaystyle\mathcal{B}_{2} :=2,\displaystyle:=\mathcal{F}_{2},
3\displaystyle\mathcal{B}_{3} :={(i[l],k[K],x𝒳),|R^i,k(x)Ri(x)|>2𝕊[r^i(x)]LMi,k(x)+14L3Mi,k(x)},\displaystyle:=\left\{\exists(i\in[l],k\in[K],x\in\mathcal{X}),\quad\left|{\hat{R}}_{i,k}(x)-R_{i}(x)\right|>\sqrt{\frac{2\mathbb{S}[{\hat{r}}_{i}(x)]L}{M_{i,k}(x)}}+\frac{14L}{3M_{i,k}(x)}\right\},
4\displaystyle\mathcal{B}_{4} :={(i[m],k[K],h[H],x𝒳,s[i]Si),\displaystyle:=\Biggl{\{}\exists(i\in[m],k\in[K],h\in[H],x\in\mathcal{X},s[-i]\in S_{-i}),
|P^i,k(x)Pi(x),Vh+1(s[i])|>HL2Ni,k(x)},\displaystyle\qquad\qquad\left|\left\langle{\hat{P}}_{i,k}(x)-{P}_{i}(x),{V_{h+1}^{*}}(s[-i])\right\rangle\right|>H\sqrt{\frac{L}{2N_{i,k}(x)}}\Biggr{\}},
5\displaystyle\mathcal{B}_{5} :={(i[m],k[K],h[H],x𝒳),\displaystyle:=\biggl{\{}\exists(i\in[m],k\in[K],h\in[H],x\in\mathcal{X}),
|VarP^i,k(x)𝔼Pi(x)[Vh]VarPi(x)𝔼Pi(x)[Vh]|>3HLNi,k(x)},\displaystyle\qquad\qquad\left|\sqrt{\mathrm{Var}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}-\sqrt{\mathrm{Var}_{{P}_{i}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}\right|>3H\sqrt{\frac{L}{N_{i,k}(x)}}\biggr{\}},
6\displaystyle\mathcal{B}_{6} :={(i[m],k[K],h[H],x𝒳),|𝕊[r^i(x)]Var(ri(x))|>4LMi,k(x)},\displaystyle:=\left\{\exists(i\in[m],k\in[K],h\in[H],x\in\mathcal{X}),\quad\left|\sqrt{\mathbb{S}[{\hat{r}}_{i}(x)]}-\sqrt{\mathrm{Var}(r_{i}(x))}\right|>\sqrt{\frac{4L}{M_{i,k}(x)}}\right\},
7\displaystyle\mathcal{B}_{7} :=4,8:=5,9:=6,\displaystyle:=\mathcal{F}_{4},\quad\mathcal{B}_{8}:=\mathcal{F}_{5},\quad\mathcal{B}_{9}:=\mathcal{F}_{6},

where in 3\mathcal{B}_{3} and 5\mathcal{B}_{5} we assume Ni,k2N_{i,k}\geq 2 (true for xx in the good set LkL_{k}), and in 6\mathcal{B}_{6} we assume Mi,k2M_{i,k}\geq 2 (true for xx in the good set Λk\Lambda_{k}). Then the failure event for F-EULER is defined by :=i=19i\mathcal{B}:=\bigcup_{i=1}^{9}\mathcal{B}_{i}.

The following lemma shows that the failure event \mathcal{B} happens with low probability.

Lemma 23 (Failure probability).

For any FMDP specified by (2.1), during the running of F-EULER for KK episodes, the failure event \mathcal{B} happens with probability at most δ\delta.

Proof.

By Bernstein’s inequality (Lemma 38) and the union bound, 1\mathcal{B}_{1} happens with probability at most δ/8\delta/8. By empirical Bernstein’s inequality (Lemma 39[26] and the union bound, 3\mathcal{B}_{3} happens with probability at most δ/8\delta/8, where 1/(Ni,k(x)1)1/(N_{i,k}(x)-1) is replaced by 2/Ni,k(x)2/N_{i,k}(x) for Ni,k(x)2N_{i,k}(x)\geq 2. By Hoeffding’s inequality (Lemma 37) and the union bound, 4\mathcal{B}_{4} happens with probability at most δ/8\delta/8. By Theorem 10 in [26] and the union bound, 6\mathcal{B}_{6} happens with probability at most δ/8\delta/8, where 1/(Mi,k(x)1)1/(M_{i,k}(x)-1) is replaced by 2/Mi,k(x)2/M_{i,k}(x) for Mi,k(x)2M_{i,k}(x)\geq 2. By the proof of Lemma 8, 2\mathcal{B}_{2} and 9\mathcal{B}_{9} happen with probability at most δ/8\delta/8, respectively; 7\mathcal{B}_{7} and 8\mathcal{B}_{8} happen with probability at most δ/16\delta/16, respectively. The argument on 5\mathcal{B}_{5} is more involved, which is stated as follows.

For k[K],h[H],x𝒳k\in[K],h\in[H],x\in\mathcal{X}, let 𝕊P^i,k(x)𝔼Pi(x)[Vh]\mathbb{S}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}] denote the sample variance of 𝔼Pi(x)[Vh]\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}], whose relationship with the empirical variance is given by

𝕊P^i,k(x)𝔼Pi(x)[Vh]=Ni,k(x)Ni,k(x)1VarP^i,k(x)𝔼Pi(x)[Vh].\displaystyle\mathbb{S}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]=\frac{N_{i,k}(x)}{N_{i,k}(x)-1}\mathrm{Var}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}].

Hence, for Ni,k(x)2N_{i,k}(x)\geq 2,

𝕊P^i,k(x)𝔼Pi(x)[Vh]VarP^i,k(x)𝔼Pi(x)[Vh]=1Ni,k(x)1VarP^i,k(x)𝔼Pi(x)[Vh]2H2Ni,k(x).\displaystyle\mathbb{S}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]-\mathrm{Var}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]=\frac{1}{N_{i,k}(x)-1}\mathrm{Var}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]\leq\frac{2H^{2}}{N_{i,k}(x)}.

By Theorem 10 in [26] and the union bound, with probability at least 1δ/81-\delta/8, for all i[m],k[K],h[H],x𝒳i\in[m],k\in[K],h\in[H],x\in\mathcal{X} and Ni,k(x)2N_{i,k}(x)\geq 2,

|𝕊P^i,k(x)𝔼Pi(x)[Vh]VarPi(x)𝔼Pi(x)[Vh]|H2LNi,k(x)1H4LNi,k(x),\displaystyle\left|\sqrt{\mathbb{S}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}-\sqrt{\mathrm{Var}_{{P}_{i}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}\right|\leq H\sqrt{\frac{2L}{N_{i,k}(x)-1}}\leq H\sqrt{\frac{4L}{N_{i,k}(x)}},

which yields that

|VarP^i,k(x)𝔼Pi(x)[Vh]VarPi(x)𝔼Pi(x)[Vh]|\displaystyle\left|\sqrt{\mathrm{Var}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}-\sqrt{\mathrm{Var}_{{P}_{i}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}\right|
\displaystyle\leq |VarP^i,k(x)𝔼Pi(x)[Vh]𝕊P^i,k(x)𝔼Pi(x)[Vh]+𝕊P^i,k(x)𝔼Pi(x)[Vh]VarPi(x)𝔼Pi(x)[Vh]|\displaystyle\Bigl{|}\sqrt{\mathrm{Var}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}-\sqrt{\mathbb{S}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}+\sqrt{\mathbb{S}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}-\sqrt{\mathrm{Var}_{{P}_{i}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}\Bigr{|}
\displaystyle\leq 2H2Ni,k(x)+H4LNi,k(x)3HLNi,k(x),\displaystyle\sqrt{\frac{2H^{2}}{N_{i,k}(x)}}+H\sqrt{\frac{4L}{N_{i,k}(x)}}\leq 3H\sqrt{\frac{L}{N_{i,k}(x)}},

where in the second inequality we use |ab||ab||\sqrt{a}-\sqrt{b}|\leq\sqrt{|a-b|} for all a,b0a,b\geq 0 and in the third inequality we use L=log(16mlSXT/δ)2L=\log(16mlSXT/\delta)\geq 2. Therefore, 5\mathcal{B}_{5} happens with probability at most δ/8\delta/8.

Finally, applying the union bound on i\mathcal{B}_{i} for i[9]i\in[9] yields that the failure event \mathcal{F} happens with probability at most δ\delta. ∎

The deduction in the rest of this section and hence the regret bound hold outside the failure event \mathcal{B}, with probability at least 1δ1-\delta. From the above derivation, note that we can actually use a smaller

L1=log(16mlmax{SmaxiX[Ii],maxiX[Ji]}T/δ)\displaystyle L_{1}=\log(16ml\max\{S\max_{i}X[{I_{i}}],\max_{i}X[{J_{i}}]\}T/\delta)

to replace LL for F-EULER. We use LL for simplicity.

B.2 Upper confidence bound

The goal of this section is to show that V¯k,h{\overline{V}}_{k,h} is an entrywise UCB of Vh{V_{h}^{*}} for all k[K],h[H]k\in[K],h\in[H]. Analogously to the analysis of F-UCBVI, this optimism is achieved by setting the transition (reward, respectively) bonus as the UCB of the transition (reward, respectively) estimation error. While the reward estimation error is direct to bound (failure event 3\mathcal{B}_{3}), to bound the transition estimation error, we need some properties of the relevant functions.

Recall that in (4.1), for i[m],PiΔ(𝒮i),P=i=1mPiΔ(𝒮)i\in[m],{P}_{i}\in\Delta(\mathcal{S}_{i}),{P}=\prod_{i=1}^{m}{P}_{i}\in\Delta(\mathcal{S}) and V𝒮V\in\mathbb{R}^{\mathcal{S}}, we define gi(P,V):=2LVarPi𝔼Pi[V]g_{i}({P},V):=2\sqrt{L}\sqrt{\mathrm{Var}_{{P}_{i}}\mathbb{E}_{{P}_{-i}}[V]}. Now for i[m],k[K],PΔ(𝒮)i\in[m],k\in[K],{P}\in\Delta(\mathcal{S}), V𝒮V\in\mathbb{R}^{\mathcal{S}} and x𝒳x\in\mathcal{X}, we define

ϕi,k(P,V,x):=gi(P,V)Ni,k(x)+2HL3Ni,k(x).\displaystyle\phi_{i,k}({P},V,x):=\frac{g_{i}({P},V)}{\sqrt{N_{i,k}(x)}}+\frac{2HL}{3N_{i,k}(x)}.

The following lemma establishes some “Lipschitzness” properties of gig_{i}, which are then used to prove a similar property of ϕi,k\phi_{i,k}.

Lemma 24 (Properties of gig_{i}).

Outside the failure event \mathcal{B}, for any index i[m]i\in[m], episode k[K]k\in[K], step h[H]h\in[H], state-action pair x𝒳x\in\mathcal{X} and vectors V1,V2SV_{1},V_{2}\in\mathbb{R}^{S},

|gi(P(x),V1)gi(P(x),V2)|\displaystyle\left|g_{i}({P}(x),V_{1})-g_{i}({P}(x),V_{2})\right| 2LV1V22,P(x),\displaystyle\leq\sqrt{2L}\|V_{1}-V_{2}\|_{2,{P}(x)}, (B.1)
|gi(P^k(x),Vh)gi(P(x),Vh)|\displaystyle\left|g_{i}({\hat{P}}_{k}(x),{V_{h}^{*}})-g_{i}({P}(x),{V_{h}^{*}})\right| 32HLj=1m1Nj,k(x).\displaystyle\leq 3\sqrt{2}HL\sum_{j=1}^{m}\frac{1}{\sqrt{N_{j,k}(x)}}. (B.2)
Proof.

For any finite set 𝒮\mathcal{S}, PΔ(𝒮)P\in\Delta(\mathcal{S}) and vectors V1,V2𝒮V_{1},V_{2}\in\mathbb{R}^{\mathcal{S}}, we have that VarP[V1]VarP[V2]+VarP[V1V2]\sqrt{\mathrm{Var}_{P}[V_{1}]}\leq\sqrt{\mathrm{Var}_{P}[V_{2}]}+\sqrt{\mathrm{Var}_{P}[V_{1}-V_{2}]} (see [42, Section D.3] for a proof). Then for the first property (B.1),

|gi(P(x),V1)gi(P(x),V2)|\displaystyle\left|g_{i}({P}(x),V_{1})-g_{i}({P}(x),V_{2})\right| 2LVarPi(x)𝔼P1:i1(x)𝔼Pi+1:m(x)[V1V2]\displaystyle\leq\sqrt{2L}\sqrt{\mathrm{Var}_{{P}_{i}(x)}\mathbb{E}_{{P}_{1:i-1}(x)}\mathbb{E}_{{P}_{i+1:m}(x)}[V_{1}-V_{2}]}
2L𝔼Pi(x)(𝔼P1:i1(x)𝔼Pi+1:m(x)[V1V2])2\displaystyle\leq\sqrt{2L}\sqrt{\mathbb{E}_{{P}_{i}(x)}\left(\mathbb{E}_{{P}_{1:i-1}(x)}\mathbb{E}_{{P}_{i+1:m}(x)}[V_{1}-V_{2}]\right)^{2}}
2L𝔼P[(V1V2)2]:=2LV1V22,P(x).\displaystyle\leq\sqrt{2L}\sqrt{\mathbb{E}_{P}[(V_{1}-V_{2})^{2}]}:=\sqrt{2L}\|V_{1}-V_{2}\|_{2,{P}(x)}.

For the second property (B.2),

|gi(P^k(x),Vh)gi(P(x),Vh)|\displaystyle\left|g_{i}({\hat{P}}_{k}(x),{V_{h}^{*}})-g_{i}({P}(x),{V_{h}^{*}})\right| =2L|VarP^i,k(x)𝔼P^i,k(x)[Vh]VarPi(x)𝔼Pi(x)[Vh]|\displaystyle=\sqrt{2L}\left|\sqrt{\mathrm{Var}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{\hat{P}}_{-i,k}(x)}[{V_{h}^{*}}]}-\sqrt{\mathrm{Var}_{{P}_{i}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}\right|
=2L|VarP^i,k(x)𝔼P^i,k(x)[Vh]VarP^i,k(x)𝔼Pi(x)[Vh]\displaystyle=\sqrt{2L}\biggl{|}\sqrt{\mathrm{Var}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{\hat{P}}_{-i,k}(x)}[{V_{h}^{*}}]}-\sqrt{\mathrm{Var}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}
+VarP^i,k(x)𝔼Pi(x)[Vh]VarPi(x)𝔼Pi(x)[Vh]|\displaystyle\quad+\sqrt{\mathrm{Var}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}-\sqrt{\mathrm{Var}_{{P}_{i}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}\biggr{|}
2L|VarP^i,k(x)𝔼P^i,k(x)[Vh]VarP^i,k(x)𝔼Pi(x)[Vh]|\displaystyle\leq\sqrt{2L}\left|\sqrt{\mathrm{Var}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{\hat{P}}_{-i,k}(x)}[{V_{h}^{*}}]}-\sqrt{\mathrm{Var}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}\right| (B.3)
+2L|VarP^i,k(x)𝔼Pi(x)[Vh]VarPi(x)𝔼Pi(x)[Vh]|.\displaystyle\quad+\sqrt{2L}\left|\sqrt{\mathrm{Var}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}-\sqrt{\mathrm{Var}_{{P}_{i}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}\right|. (B.4)

To bound (B.3), omitting the dependence of P^i,k(x),Pi(x){\hat{P}}_{i,k}(x),{P}_{i}(x) on xx, outside the failure event \mathcal{B} (specifically, 4\mathcal{B}_{4}), by an inverse telescoping argument,

|VarP^i,k𝔼P^i,k[Vh]VarP^i,k𝔼Pi[Vh]|\displaystyle\left|\sqrt{\mathrm{Var}_{{\hat{P}}_{i,k}}\mathbb{E}_{{\hat{P}}_{-i,k}}[{V_{h}^{*}}]}-\sqrt{\mathrm{Var}_{{\hat{P}}_{i,k}}\mathbb{E}_{{P}_{-i}}[{V_{h}^{*}}]}\right|
\displaystyle\leq j=1,jim|VarP^i,k𝔼P1:j1\i,k𝔼P^j:m\i,k[Vh]VarP^i,k𝔼P1:j\i,k𝔼P^j+1:m\i,k[Vh]|\displaystyle\sum_{j=1,j\neq i}^{m}\left|\sqrt{\mathrm{Var}_{{\hat{P}}_{i,k}}\mathbb{E}_{{P}_{1:j-1\backslash i,k}}\mathbb{E}_{{\hat{P}}_{j:m\backslash i,k}}[{V_{h}^{*}}]}-\sqrt{\mathrm{Var}_{{\hat{P}}_{i,k}}\mathbb{E}_{{P}_{1:j\backslash i,k}}\mathbb{E}_{{\hat{P}}_{j+1:m\backslash i,k}}[{V_{h}^{*}}]}\right|
\displaystyle\leq j=1,jimVarP^i,k𝔼P1:j1\i𝔼P^j+1:m\i,k(𝔼P^j,k𝔼Pj)[Vh]\displaystyle\sum_{j=1,j\neq i}^{m}\sqrt{\mathrm{Var}_{{\hat{P}}_{i,k}}\mathbb{E}_{{P}_{1:j-1\backslash i}}\mathbb{E}_{{\hat{P}}_{j+1:m\backslash i,k}}(\mathbb{E}_{{\hat{P}}_{j,k}}-\mathbb{E}_{{P}_{j}})[{V_{h}^{*}}]}
\displaystyle\leq j=1,jim𝔼P^i,k(𝔼P1:j1\i𝔼P^j+1:m\i,k(𝔼P^j,k𝔼Pj)[Vh])2j=1,jimHL2Nj,k(x),\displaystyle\sum_{j=1,j\neq i}^{m}\sqrt{\mathbb{E}_{{\hat{P}}_{i,k}}\left(\mathbb{E}_{{P}_{1:j-1\backslash i}}\mathbb{E}_{{\hat{P}}_{j+1:m\backslash i,k}}(\mathbb{E}_{{\hat{P}}_{j,k}}-\mathbb{E}_{{P}_{j}})[{V_{h}^{*}}]\right)^{2}}\leq\sum_{j=1,j\neq i}^{m}H\sqrt{\frac{L}{2N_{j,k}(x)}},

where “\i\cdot\backslash i” denotes excluding ii. For (B.4), outside the failure event \mathcal{B} (specifically, 5\mathcal{B}_{5}),

|VarP^i,k(x)𝔼Pi(x)[Vh]VarPi(x)𝔼Pi(x)[Vh]|3HLNi,k(x),\displaystyle\left|\sqrt{\mathrm{Var}_{{\hat{P}}_{i,k}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}-\sqrt{\mathrm{Var}_{{P}_{i}(x)}\mathbb{E}_{{P}_{-i}(x)}[{V_{h}^{*}}]}\right|\leq 3H\sqrt{\frac{L}{N_{i,k}(x)}},

Combining the above bounds on (B.3) and (B.4) yields

|gi(P^k(x),Vh)gi(P(x),Vh)|32HLj=1m1Nj,k(x).\displaystyle\left|g_{i}({\hat{P}}_{k}(x),{V_{h}^{*}})-g_{i}({P}(x),{V_{h}^{*}})\right|\leq 3\sqrt{2}HL\sum_{j=1}^{m}\frac{1}{\sqrt{N_{j,k}(x)}}.

Then ϕi,k\phi_{i,k} satisfies the following “Lipschitzness” property.

Lemma 25 (Property of ϕi,k\phi_{i,k}).

Outside the failure event \mathcal{B}, for any index i[m]i\in[m], episode k[K]k\in[K], step h[H]h\in[H], state-action pair x𝒳x\in\mathcal{X} and vector VSV\in\mathbb{R}^{S},

|ϕi,k(P^k(x),V,x)ϕi,k(P(x),Vh+1,x)|2LVVh+12,P^k(x)Ni,k(x)+32HLNi,k(x)j=1m1Nj,k(x).\displaystyle\left|\phi_{i,k}({\hat{P}}_{k}(x),V,x)-\phi_{i,k}({P}(x),{V_{h+1}^{*}},x)\right|\leq\frac{\sqrt{2L}\left\|V-{V_{h+1}^{*}}\right\|_{2,{\hat{P}}_{k}(x)}}{\sqrt{N_{i,k}(x)}}+\frac{3\sqrt{2}HL}{\sqrt{N_{i,k}(x)}}\sum_{j=1}^{m}\frac{1}{\sqrt{N_{j,k}(x)}}.
Proof.

By Lemma 24, outside the failure event \mathcal{B},

|ϕi,k(P^k(x),V,x)ϕi,k(P(x),Vh+1,x)|\displaystyle\left|\phi_{i,k}({\hat{P}}_{k}(x),V,x)-\phi_{i,k}({P}(x),{V_{h+1}^{*}},x)\right|
\displaystyle\leq 1Ni,k(x)(|gi(P^k(x),V)gi(P^k(x),Vh+1)|+|gi(P^k(x),Vh+1)gi(P(x),Vh+1)|)\displaystyle\frac{1}{\sqrt{N_{i,k}(x)}}\left(\left|g_{i}({\hat{P}}_{k}(x),V)-g_{i}({\hat{P}}_{k}(x),{V_{h+1}^{*}})\right|+\left|g_{i}({\hat{P}}_{k}(x),{V_{h+1}^{*}})-g_{i}({P}(x),{V_{h+1}^{*}})\right|\right)
\displaystyle\leq 2LVVh+12,P^k(x)Ni,k(x)+32HLNi,k(x)j=1m1Nj,k(x).\displaystyle\frac{\sqrt{2L}\left\|V-{V_{h+1}^{*}}\right\|_{2,{\hat{P}}_{k}(x)}}{\sqrt{N_{i,k}(x)}}+\frac{3\sqrt{2}HL}{\sqrt{N_{i,k}(x)}}\sum_{j=1}^{m}\frac{1}{\sqrt{N_{j,k}(x)}}.

Now we are ready to present the bounds on transition estimation error in the following lemma.

Lemma 26 (Transition estimation error, Bernstein-style).

Outside the failure event \mathcal{B}, for any episode k[K]k\in[K], step h[H]h\in[H] and state-action pair x𝒳x\in\mathcal{X},

|P^k(x)Pk(x),Vh+1|i=1mϕi,k(P(x),Vh+1,x)+i=1mj=i+1m2HLSiSjNi,k(x)Nj,k(x).\displaystyle\left|\left\langle{\hat{P}}_{k}(x)-{P}_{k}(x),{V_{h+1}^{*}}\right\rangle\right|\leq\sum_{i=1}^{m}\phi_{i,k}({P}(x),{V_{h+1}^{*}},x)+\sum_{i=1}^{m}\sum_{j=i+1}^{m}2HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}.

And for a given k[K]k\in[K] and a given h[H]h\in[H], if V¯k,h+1Vh+1V¯k,h+1{\underline{V}}_{k,h+1}\leq{V_{h+1}^{*}}\leq{\overline{V}}_{k,h+1} entrywise, then the above inequality yields

|P^k(x)Pk(x),Vh+1|\displaystyle\left|\left\langle{\hat{P}}_{k}(x)-{P}_{k}(x),{V_{h+1}^{*}}\right\rangle\right| i=1mϕi,k(P^k(x),V¯k,h+1,x)+i=1m2LV¯k,h+1V¯k,h+12,P^k(x)Ni,k(x)\displaystyle\leq\sum_{i=1}^{m}\phi_{i,k}({\hat{P}}_{k}(x),{\overline{V}}_{k,h+1},x)+\sum_{i=1}^{m}\frac{\sqrt{2L}\left\|{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\|_{2,{\hat{P}}_{k}(x)}}{\sqrt{N_{i,k}(x)}}
+i=1m32HLNi,k(x)j=1m1Nj,k(x)+i=1mj=i+1m2HLSiSjNi,k(x)Nj,k(x).\displaystyle+\sum_{i=1}^{m}\frac{3\sqrt{2}HL}{\sqrt{N_{i,k}(x)}}\sum_{j=1}^{m}\frac{1}{\sqrt{N_{j,k}(x)}}+\sum_{i=1}^{m}\sum_{j=i+1}^{m}2HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}.
Proof.

Recall that in Lemma 9, for all k[K],h[H],x𝒳k\in[K],h\in[H],x\in\mathcal{X}, omitting the dependence of P^k(x),P(x),P^i,k(x),Pi(x){\hat{P}}_{k}(x),{P}(x),\allowbreak{\hat{P}}_{i,k}(x),{P}_{i}(x) on xx, we decompose the transition estimation error as

P^kP,Vh+1\displaystyle\left\langle{\hat{P}}_{k}-{P},{V_{h+1}^{*}}\right\rangle =i=1mP^i,kPi,𝔼P1:i1𝔼Pi+1:m[Vh+1]\displaystyle=\sum_{i=1}^{m}\left\langle{\hat{P}}_{i,k}-{P}_{i},\mathbb{E}_{{P}_{1:i-1}}\mathbb{E}_{{P}_{i+1:m}}[{V_{h+1}^{*}}]\right\rangle (B.5)
+i=1mP^i,kPi,𝔼P1:i1(𝔼P^i+1:m,k𝔼Pi+1:m)[Vh+1].\displaystyle\quad+\sum_{i=1}^{m}\left\langle{\hat{P}}_{i,k}-{P}_{i},\mathbb{E}_{{P}_{1:i-1}}(\mathbb{E}_{{\hat{P}}_{i+1:m,k}}-\mathbb{E}_{{P}_{i+1:m}})[{V_{h+1}^{*}}]\right\rangle. (B.6)

Outside the failure event \mathcal{B} (specifically, 1\mathcal{B}_{1}), (B.5) is bounded by

|P^i,k(x)Pi(x),𝔼P1:i1(x)𝔼Pi+1:m(x)[Vh+1]|\displaystyle\left|\left\langle{\hat{P}}_{i,k}(x)-{P}_{i}(x),\mathbb{E}_{{P}_{1:i-1}(x)}\mathbb{E}_{{P}_{i+1:m}(x)}[{V_{h+1}^{*}}]\right\rangle\right| 2VarPi(x)𝔼P1:i1(x)𝔼Pi+1:m(x)[Vh+1]LNi,k(x)+2HL3Ni,k(x)\displaystyle\leq\sqrt{\frac{2\mathrm{Var}_{{P}_{i}(x)}\mathbb{E}_{{P}_{1:i-1}(x)}\mathbb{E}_{{P}_{i+1:m}(x)}[{V_{h+1}^{*}}]L}{N_{i,k}(x)}}+\frac{2HL}{3N_{i,k}(x)}
=ϕi,k(P(x),Vh+1,x).\displaystyle=\phi_{i,k}({P}(x),{V_{h+1}^{*}},x).

If V¯k,h+1Vh+1V¯k,h+1{\underline{V}}_{k,h+1}\leq{V_{h+1}^{*}}\leq{\overline{V}}_{k,h+1} entrywise, then by Lemma 25, (B.5) is further bounded by

ϕi,k(P(x),Vh+1,x)\displaystyle\phi_{i,k}({P}(x),{V_{h+1}^{*}},x) ϕi,k(P^k(x),V¯h+1,x)+2LV¯k,h+1Vh+12,P^k(x)Ni,k(x)+32HLNi,k(x)j=1m1Nj,k(x)\displaystyle\leq\phi_{i,k}({\hat{P}}_{k}(x),{\overline{V}}_{h+1},x)+\frac{\sqrt{2L}\left\|{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\|_{2,{\hat{P}}_{k}(x)}}{\sqrt{N_{i,k}(x)}}+\frac{3\sqrt{2}HL}{\sqrt{N_{i,k}(x)}}\sum_{j=1}^{m}\frac{1}{\sqrt{N_{j,k}(x)}}
ϕi,k(P^k(x),V¯h+1,x)+2LV¯k,h+1V¯k,h+12,P^k(x)Ni,k(x)+32HLNi,k(x)j=1m1Nj,k(x),\displaystyle\leq\phi_{i,k}({\hat{P}}_{k}(x),{\overline{V}}_{h+1},x)+\frac{\sqrt{2L}\left\|{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\|_{2,{\hat{P}}_{k}(x)}}{\sqrt{N_{i,k}(x)}}+\frac{3\sqrt{2}HL}{\sqrt{N_{i,k}(x)}}\sum_{j=1}^{m}\frac{1}{\sqrt{N_{j,k}(x)}},

For (B.6), the Holder’s argument (Lemma 10) yields

|P^i,k(x)Pi(x),𝔼P1:i1(x)(𝔼P^i+1:m,k(x)𝔼Pi+1:m(x))[Vh+1]|j=i+1m2HLSiSjNi,k(x)Nj,k(x).\displaystyle\left|\left\langle{\hat{P}}_{i,k}(x)-{P}_{i}(x),\mathbb{E}_{{P}_{1:i-1}(x)}(\mathbb{E}_{{\hat{P}}_{i+1:m,k}(x)}-\mathbb{E}_{{P}_{i+1:m}(x)})[{V_{h+1}^{*}}]\right\rangle\right|\leq\sum_{j=i+1}^{m}2HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}.

Combining the above bounds on (B.5) and (B.6) completes the proof. ∎

The first Bernstein-style bound on the transition estimation error (Lemma 26) depends on the unknown Vh+1{V_{h+1}^{*}}, which is why we derive the second bound in terms of the UCB V¯k,h{\overline{V}}_{k,h} and LCB V¯k,h{\underline{V}}_{k,h}. Now simplify the second bound to the form of the transition bonus. Expanding ϕi,k\phi_{i,k} by definition,

i=1mϕi,k(P^k(x),V¯k,h+1,x)+i=1m2LV¯k,h+1V¯k,h+12,P^k(x)Ni,k(x)\displaystyle\sum_{i=1}^{m}\phi_{i,k}({\hat{P}}_{k}(x),{\overline{V}}_{k,h+1},x)+\sum_{i=1}^{m}\frac{\sqrt{2L}\left\|{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\|_{2,{\hat{P}}_{k}(x)}}{\sqrt{N_{i,k}(x)}}
+i=1m32HLNi,k(x)j=1m1Nj,k(x)+i=1mj=i+1m2HLSiSjNi,k(x)Nj,k(x)\displaystyle\quad+\sum_{i=1}^{m}\frac{3\sqrt{2}HL}{\sqrt{N_{i,k}(x)}}\sum_{j=1}^{m}\frac{1}{\sqrt{N_{j,k}(x)}}+\sum_{i=1}^{m}\sum_{j=i+1}^{m}2HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}
=\displaystyle= i=1mgi(P^k(x),V¯k,h+1)Ni,k(x)+i=1m2HL3Ni,k(x)+i=1m2LV¯k,h+1V¯k,h+12,P^k(x)Ni,k(x)\displaystyle\sum_{i=1}^{m}\frac{g_{i}({\hat{P}}_{k}(x),{\overline{V}}_{k,h+1})}{\sqrt{N_{i,k}(x)}}+\sum_{i=1}^{m}\frac{2HL}{3N_{i,k}(x)}+\sum_{i=1}^{m}\frac{\sqrt{2L}\left\|{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\|_{2,{\hat{P}}_{k}(x)}}{\sqrt{N_{i,k}(x)}}
+i=1m32HLNi,k(x)j=1m1Nj,k(x)+i=1mj=i+1m2HLSiSjNi,k(x)Nj,k(x)\displaystyle\quad+\sum_{i=1}^{m}\frac{3\sqrt{2}HL}{\sqrt{N_{i,k}(x)}}\sum_{j=1}^{m}\frac{1}{\sqrt{N_{j,k}(x)}}+\sum_{i=1}^{m}\sum_{j=i+1}^{m}2HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}
\displaystyle\leq i=1mgi(P^k(x),V¯k,h+1)Ni,k(x)+i=1m2LV¯k,h+1V¯k,h+12,P^k(x)Ni,k(x)\displaystyle\sum_{i=1}^{m}\frac{g_{i}({\hat{P}}_{k}(x),{\overline{V}}_{k,h+1})}{\sqrt{N_{i,k}(x)}}+\sum_{i=1}^{m}\frac{\sqrt{2L}\left\|{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\|_{2,{\hat{P}}_{k}(x)}}{\sqrt{N_{i,k}(x)}}
+i=1mj=i+1m11HLSiSjNi,k(x)Nj,k(x)+i=1m5HLNi,k(x),\displaystyle\quad+\sum_{i=1}^{m}\sum_{j=i+1}^{m}11HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}+\sum_{i=1}^{m}\frac{5HL}{N_{i,k}(x)},

which is precisely the transition bonus bk,h(x){b}_{k,h}(x) in (4.2). Therefore, for any k[K],h[H]k\in[K],h\in[H],

|P^k(x)Pk(x),Vh+1|\displaystyle\left|\left\langle{\hat{P}}_{k}(x)-{P}_{k}(x),{V_{h+1}^{*}}\right\rangle\right|\leq bk,h(x),\displaystyle b_{k,h}(x),

if V¯k,h+1Vh+1V¯k,h+1{\underline{V}}_{k,h+1}\leq{V_{h+1}^{*}}\leq{\overline{V}}_{k,h+1} holds entrywise, which we soon prove in Lemma 27. Note that our choice of the reward bonus upper bounds the reward estimation error outside the failure event \mathcal{B} (specifically, 3\mathcal{B}_{3}), i.e.,

|R^k(x)R(x)|i=1l|R^i,k(x)Ri(x)|i=1l2𝕊[r^i(x)]LMi,k(x)+i=1l14L3Mi,k(x):=βk(x).\displaystyle\left|{\hat{R}}_{k}(x)-R(x)\right|\leq\sum_{i=1}^{l}\left|{\hat{R}}_{i,k}(x)-R_{i}(x)\right|\leq\sum_{i=1}^{l}\sqrt{\frac{2\mathbb{S}[{\hat{r}}_{i}(x)]L}{M_{i,k}(x)}}+\sum_{i=1}^{l}\frac{14L}{3M_{i,k}(x)}:={\beta}_{k}(x).

With xk,h=(s,πk(s,h))x_{k,h}=(s,\pi_{k}(s,h)), recall the optimistic and pessimistic value iterations are defined as

V¯k,h(s)\displaystyle{\overline{V}}_{k,h}(s) =min{Hh+1,R^(xk,h)+P^k(xk,h),V¯k,h+1+bk,h(xk,h)+βk(xk,h)},\displaystyle=\min\left\{H-h+1,{\hat{R}}(x_{k,h})+\left\langle{\hat{P}}_{k}(x_{k,h}),{\overline{V}}_{k,h+1}\right\rangle+{b}_{k,h}(x_{k,h})+{\beta}_{k}(x_{k,h})\right\},
V¯k,h(s)\displaystyle{\underline{V}}_{k,h}(s) =max{0,R(xk,h)+P^k(xk,h),V¯k,h+1bk,h(xk,h)βk(xk,h)}.\displaystyle=\max\left\{0,R(x_{k,h})+\left\langle{\hat{P}}_{k}(x_{k,h}),{\underline{V}}_{k,h+1}\right\rangle-{b}_{k,h}(x_{k,h})-{\beta}_{k}(x_{k,h})\right\}.

The following lemma indicates that the Bernstein-style bonuses and the above value iterations ensure optimism and pessimism. Specifically, V¯k,h{\overline{V}}_{k,h} and V¯k,h{\underline{V}}_{k,h} are entrywise upper and lower confidence bounds of Vh{V_{h}^{*}} for all k[K],h[H]k\in[K],h\in[H].

Lemma 27 (Upper-lower confidence bounds).

Outside the failure event \mathcal{B}, for the choices of bonuses in (4.2) and (4.3), for any episode k[K]k\in[K], step h[H]h\in[H] and state s𝒮s\in\mathcal{S},

V¯k,h(s)Vh(s)V¯k,h(s).\displaystyle{\underline{V}}_{k,h}(s)\leq{V_{h}^{*}}(s)\leq{\overline{V}}_{k,h}(s). (B.7)
Proof.

For h=H+1h=H+1, V¯k,H+1(s)=VH+1(s)=V¯k,H+1(s)=0{\underline{V}}_{k,H+1}(s)={V_{H+1}^{*}}(s)={\overline{V}}_{k,H+1}(s)=0 for all s𝒮,k[K]s\in\mathcal{S},k\in[K]. We proceed by backward induction. For h[H]h\in[H], assume (B.7) holds for h+1h+1. The transition bonus then satisfies |P^k(x)Pk(x),Vh+1|bk,h(x)|\bigl{\langle}{\hat{P}}_{k}(x)-{P}_{k}(x),{V_{h+1}^{*}}\bigr{\rangle}|\leq b_{k,h}(x) for all x𝒳x\in\mathcal{X}. For all s𝒮s\in\mathcal{S}, with xk,h=(s,πk(s,h))x_{k,h}=(s,\pi_{k}(s,h)) and xh=(s,π(s,h))x_{h}^{*}=(s,\pi^{*}(s,h)), V¯k,h{\overline{V}}_{k,h} satisfies that

V¯k,h(s)Vh(s)\displaystyle{\overline{V}}_{k,h}(s)-{V_{h}^{*}}(s) =R^(xk,h)+βk(xk,h)+P^k(xk,h),V¯k,h+1+bk(xk,h)R(xh)P(xh),Vh+1\displaystyle={\hat{R}}(x_{k,h})+{\beta}_{k}(x_{k,h})+\left\langle{\hat{P}}_{k}(x_{k,h}),{\overline{V}}_{k,h+1}\right\rangle+{b}_{k}(x_{k,h})-R(x_{h}^{*})-\left\langle{P}(x_{h}^{*}),{V_{h+1}^{*}}\right\rangle
R^(xh)+βk(xh)+P^k(xh),V¯k,h+1+bk(xh)R(xh)P(xh),Vh+1\displaystyle\geq{\hat{R}}(x_{h}^{*})+{\beta}_{k}(x^{*}_{h})+\left\langle{\hat{P}}_{k}(x_{h}^{*}),{\overline{V}}_{k,h+1}\right\rangle+{b}_{k}(x_{h}^{*})-R(x_{h}^{*})-\left\langle{P}(x_{h}^{*}),{V_{h+1}^{*}}\right\rangle
P^k(xh),V¯k,h+1Vh+1+P^k(xh)P(xh),Vh+1+bk(xh)0,\displaystyle\geq\left\langle{\hat{P}}_{k}(x_{h}^{*}),{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\rangle+\left\langle{\hat{P}}_{k}(x_{h}^{*})-{P}(x_{h}^{*}),{V_{h+1}^{*}}\right\rangle+{b}_{k}(x_{h}^{*})\geq 0,

and V¯k,h{\underline{V}}_{k,h} satisfies that

Vh(s)V¯k,h(s)\displaystyle{V_{h}^{*}}(s)-{\underline{V}}_{k,h}(s) R(xh)+P(xh),Vh+1R^k(xk,h)βk(xk,h)P^k(xk,h),V¯k,h+1+bk,h(xk,h)\displaystyle\geq R(x_{h}^{*})+\left\langle{P}(x_{h}^{*}),{V_{h+1}^{*}}\right\rangle-{\hat{R}}_{k}(x_{k,h})-{\beta}_{k}(x_{k,h})-\left\langle{\hat{P}}_{k}(x_{k,h}),{\underline{V}}_{k,h+1}\right\rangle+{b}_{k,h}(x_{k,h})
R(xk,h)+P(xk,h),Vh+1R^k(xk,h)βk(xk,h)P^k(xk,h),V¯k,h+1+bk,h(xk,h)\displaystyle\geq R(x_{k,h})+\left\langle{P}(x_{k,h}),{V_{h+1}^{*}}\right\rangle-{\hat{R}}_{k}(x_{k,h})-{\beta}_{k}(x_{k,h})-\left\langle{\hat{P}}_{k}(x_{k,h}),{\underline{V}}_{k,h+1}\right\rangle+{b}_{k,h}(x_{k,h})
P^k(xk,h),Vh+1V¯k,h+1+P(xk,h)P^k(xk,h),Vh+1+bk,h(xk,h)0.\displaystyle\geq\left\langle{\hat{P}}_{k}(x_{k,h}),{V_{h+1}^{*}}-{\underline{V}}_{k,h+1}\right\rangle+\left\langle{P}(x_{k,h})-{\hat{P}}_{k}(x_{k,h}),{V_{h+1}^{*}}\right\rangle+{b}_{k,h}(x_{k,h})\geq 0.

Inductively, V¯k,hVhV¯k,h{\underline{V}}_{k,h}\leq{V_{h}^{*}}\leq{\overline{V}}_{k,h} holds entrywise for all k[K]k\in[K] and h[H]h\in[H]. ∎

For F-EULER, we refer to the difference between the optimistic value function and the pessimistic value function as the confidence radius, which we bound in the following lemma. For here and below, we use “,\lesssim,\approx” to denote “,=\leq,=” neglecting constants. Different from the main text, we make lower-order terms explicit in the appendices.

Lemma 28 (Confidence radius, Bernstein-style).

Let F1:=mHmaxiSiLF_{1}:=mH\max_{i}S_{i}L be a lower-order term. Let sk,t𝒮s_{k,t}\in\mathcal{S} denote the state at step tt of episode kk and xk,t=(sk,t,πk(sk,t,t))x_{k,t}=(s_{k,t},\pi_{k}(s_{k,t},t)). Outside the failure event \mathcal{B}, for any episode k[K]k\in[K], step h[H]h\in[H] and state s𝒮s\in\mathcal{S}, the confidence radius of F-EULER satisfies that

V¯k,h(s)V¯k,h(s)min{t=hH𝔼πk[i=1mF1Ni,k(xk,t)+i=1lLMi,k(xk,t)|sk,h=s],H}.\displaystyle{\overline{V}}_{k,h}(s)-{\underline{V}}_{k,h}(s)\lesssim\min\left\{\sum_{t=h}^{H}\mathbb{E}_{\pi_{k}}\left[\sum_{i=1}^{m}\frac{F_{1}}{\sqrt{N_{i,k}(x_{k,t})}}+\sum_{i=1}^{l}\frac{L}{\sqrt{M_{i,k}(x_{k,t})}}\middle|s_{k,h}=s\right],H\right\}.
Proof.

By definition, for any k[K],h[H],s𝒮k\in[K],h\in[H],s\in\mathcal{S},

V¯k,h(s)V¯k,h(s)\displaystyle{\overline{V}}_{k,h}(s)-{\underline{V}}_{k,h}(s)
\displaystyle\leq R^k(xk,h)+P^k(xk,h),V¯k,h+1+bk,h(xk,h)+βk(xk,h)\displaystyle{\hat{R}}_{k}(x_{k,h})+\left\langle{\hat{P}}_{k}(x_{k,h}),{\overline{V}}_{k,h+1}\right\rangle+{b}_{k,h}(x_{k,h})+{\beta}_{k}(x_{k,h})
R^k(xk,h)P^k(xk,h),V¯k,h+1+bk,h(xk,h)+βk(xk,h)\displaystyle\quad-{\hat{R}}_{k}(x_{k,h})-\left\langle{\hat{P}}_{k}(x_{k,h}),{\underline{V}}_{k,h+1}\right\rangle+{b}_{k,h}(x_{k,h})+{\beta}_{k}(x_{k,h})
=\displaystyle= P^k(xk,h),V¯k,h+1V¯k,h+1+2bk,h(xk,h)+2βk(xk,h)\displaystyle\left\langle{\hat{P}}_{k}(x_{k,h}),{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\rangle+2{b}_{k,h}(x_{k,h})+2{\beta}_{k}(x_{k,h})
=\displaystyle= P(xk,h),V¯k,h+1V¯k,h+1+P(xk,h)P^k(xk,h),V¯k,h+1V¯k,h+1+2bk,h(xk,h)+2βk(xk,h)\displaystyle\left\langle{P}(x_{k,h}),{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\rangle+\left\langle{P}(x_{k,h})-{\hat{P}}_{k}(x_{k,h}),{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\rangle+2{b}_{k,h}(x_{k,h})+2{\beta}_{k}(x_{k,h})
\displaystyle\leq P(xk,h),V¯k,h+1V¯k,h+1+i=1mH2SiLNi,k(xk,h)+2bk,h(xk,h)+2βk(xk,h),\displaystyle\left\langle{P}(x_{k,h}),{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\rangle+\sum_{i=1}^{m}H\sqrt{\frac{2S_{i}L}{N_{i,k}(x_{k,h})}}+2{b}_{k,h}(x_{k,h})+2{\beta}_{k}(x_{k,h}), (B.8)

where the last inequality results from Holder’s inequality and holds outside the failure event \mathcal{B} (specifically, 2\mathcal{B}_{2}). We apply the following loose bounds on the transition bonus and the reward bonus that for any k[K],h[H],x𝒳k\in[K],h\in[H],x\in\mathcal{X},

bk,h(x)\displaystyle{b}_{k,h}(x) =i=1mgi(P^k(x),V¯k,h+1)Ni,k(x)+i=1m2LV¯k,h+1V¯k,h+12,P^k(x)Ni,k(x)\displaystyle=\sum_{i=1}^{m}\frac{g_{i}({\hat{P}}_{k}(x),{\overline{V}}_{k,h+1})}{\sqrt{N_{i,k}(x)}}+\sum_{i=1}^{m}\frac{\sqrt{2L}\left\|{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\|_{2,{\hat{P}}_{k}(x)}}{\sqrt{N_{i,k}(x)}}
+i=1mj=i+1m8HLSiSjNi,k(x)Nj,k(x)+i=1m4HLNi,k(x)\displaystyle\quad+\sum_{i=1}^{m}\sum_{j=i+1}^{m}8HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}+\sum_{i=1}^{m}\frac{4HL}{N_{i,k}(x)}
2LVarP^i,k𝔼P^i,k[V¯k,h+1]Ni,k(x)+i=1m2LHNi,k(x)\displaystyle\leq\frac{2\sqrt{L}\sqrt{\mathrm{Var}_{{\hat{P}}_{i,k}}\mathbb{E}_{{\hat{P}}_{-i,k}}[{\overline{V}}_{k,h+1}]}}{\sqrt{N_{i,k}(x)}}+\sum_{i=1}^{m}\frac{\sqrt{2L}H}{\sqrt{N_{i,k}(x)}}
+i=1mj=i+1m8HLmaxiSi1Ni,k(x)+i=1m4HLNi,k(x)\displaystyle\quad+\sum_{i=1}^{m}\sum_{j=i+1}^{m}8HL\max_{i}S_{i}\frac{1}{\sqrt{N_{i,k}(x)}}+\sum_{i=1}^{m}\frac{4HL}{N_{i,k}(x)}
mHmaxiSiLi=1m1Ni,k(x),\displaystyle\lesssim mH\max_{i}S_{i}L\sum_{i=1}^{m}\frac{1}{\sqrt{N_{i,k}(x)}},

and

βk(x)=i=1l2𝕊[r^i(x)]LMi,k(x)+i=1l14L3Mi,k(x)Li=1l1Mi,k(x).\displaystyle{\beta}_{k}(x)=\sum_{i=1}^{l}\sqrt{\frac{2\mathbb{S}[{\hat{r}}_{i}(x)]L}{M_{i,k}(x)}}+\sum_{i=1}^{l}\frac{14L}{3M_{i,k}(x)}\lesssim L\sum_{i=1}^{l}\frac{1}{\sqrt{M_{i,k}(x)}}.

Substituting the above bounds on bk,h(x){b}_{k,h}(x) and βk(x){\beta}_{k}(x) at x=xk,hx=x_{k,h} into (B.8) yields

V¯k,h(s)V¯k,h(s)\displaystyle{\overline{V}}_{k,h}(s)-{\underline{V}}_{k,h}(s)\lesssim P(xk,h),V¯k,h+1V¯k,h+1+i=1mFNi,k(xk,h)+i=1lLMi,k(xk,h).\displaystyle\left\langle{P}(x_{k,h}),{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\rangle+\sum_{i=1}^{m}\frac{F}{\sqrt{N_{i,k}(x_{k,h})}}+\sum_{i=1}^{l}\frac{L}{\sqrt{M_{i,k}(x_{k,h})}}.

Inductively, we have

V¯k,h(s)V¯k,h(s)min{t=hH𝔼πk[i=1mF1Ni,k(xk,t)+i=1lLMi,k(xk,t)|sk,h=s],H}.\displaystyle{\overline{V}}_{k,h}(s)-{\underline{V}}_{k,h}(s)\lesssim\min\left\{\sum_{t=h}^{H}\mathbb{E}_{\pi_{k}}\left[\sum_{i=1}^{m}\frac{F_{1}}{\sqrt{N_{i,k}(x_{k,t})}}+\sum_{i=1}^{l}\frac{L}{\sqrt{M_{i,k}(x_{k,t})}}\middle|s_{k,h}=s\right],H\right\}.

As noted above, the notion of the good sets in the analysis of F-UCBVI carries over here, which again plays an important role in showing sum-over-time bounds. Analogous to Lemma 17, the following lemma bounds the sum over time of the squared confidence radius, which is later used to bound the cumulative correction term (Lemma 34).

Lemma 29 (Cumulative confidence radius, Bernstein-style).

Define the lower-order term

G1:=m4H4(maxiSi)2maxiX[Ii]L3+l2H3maxiX[Ji]L3.\displaystyle G_{1}:=m^{4}H^{4}(\max_{i}S_{i})^{2}\max_{i}X[{I_{i}}]L^{3}+l^{2}H^{3}\max_{i}X[{J_{i}}]L^{3}.

Then outside the failure event \mathcal{B}, for all i[m]i\in[m], the sum over time of the following expected squared confidence radius of F-EULER satisfies that

k=1Kh=1Hx𝒳wk,h(x)(𝔼Pi(𝔼Pi[V¯k,h+1V¯k,h+1])2)k=1Kh=1Hx𝒳wk,h(x)(𝔼P[(V¯k,h+1V¯k,h+1)2])G1.\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\mathcal{X}}w_{k,h}(x)\left(\mathbb{E}_{{P}_{i}}\left(\mathbb{E}_{{P}_{-i}}[{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}]\right)^{2}\right)\leq\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\mathcal{X}}w_{k,h}(x)\left(\mathbb{E}_{{P}}[({\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1})^{2}]\right)\lesssim G_{1}.
Proof.

Replacing the failure event \mathcal{F} and confidence radius bound (Lemma 12) of F-UCBVI by the failure event \mathcal{B} and confidence radius bound (Lemma 28) of F-EULER, the proof of Lemma 17 carries over here. Note that there is a minor difference in the order of LL between the second terms of G1G_{1} (F-EULER) and G0G_{0} (F-UCBVI), resulting from the L\sqrt{L} difference in the corresponding confidence radius bounds. ∎

Like the analysis of EULER [42], we show two problem-dependent regret bounds of F-EULER. The following lemma bridges one bound to the other.

Lemma 30 (Bound bridge).

Outside the failure event \mathcal{B}, for F-EULER, we have

k=1Kh=1HxLkwk,h(x)gi(P(x),Vh+1)gi(P(x),Vh+1πk)Ni,k(x)22LHX[Ii]LRegret(K).\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\frac{g_{i}({P}(x),{V_{h+1}^{*}})-g_{i}({P}(x),V_{h+1}^{\pi_{k}})}{\sqrt{N_{i,k}(x)}}\leq 2\sqrt{2L}H\sqrt{X[{I_{i}}]L}\sqrt{{\textnormal{Regret}}(K)}.
Proof.

Outside the failure event \mathcal{B}, by the properties of gig_{i} (Lemma 24),

k=1Kh=1HxLkwk,h(x)gi(P(x),Vh+1)gi(P(x),Vh+1πk)Ni,k(x)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\frac{g_{i}({P}(x),{V_{h+1}^{*}})-g_{i}({P}(x),V_{h+1}^{\pi_{k}})}{\sqrt{N_{i,k}(x)}}
\displaystyle\leq 2Lk=1Kh=1HxLkwk,h(x)Vh+1Vh+1πk2,P(x)Ni,k(x)\displaystyle\sqrt{2L}\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\frac{\|{V_{h+1}^{*}}-V_{h+1}^{\pi_{k}}\|_{2,{P}(x)}}{\sqrt{N_{i,k}(x)}}
\displaystyle\leq 2Lk=1Kh=1HxLkwk,h(x)Ni,k(x)k=1Kh=1HxLkwk,h(x)P(x),(Vh+1Vh+1πk)2\displaystyle\sqrt{2L}\sqrt{\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}\frac{w_{k,h}(x)}{N_{i,k(x)}}}\cdot\sqrt{\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left\langle{P}(x),({V_{h+1}^{*}}-V_{h+1}^{\pi_{k}})^{2}\right\rangle}
\displaystyle\leq 22LX[Ii]LH2Regret(K)=22LHX[Ii]LRegret(K),\displaystyle 2\sqrt{2L}\sqrt{X[{I_{i}}]L}\cdot\sqrt{H^{2}{\textnormal{Regret}}(K)}=2\sqrt{2L}H\sqrt{X[{I_{i}}]L}\sqrt{{\textnormal{Regret}}(K)},

where in the second inequality we use the Cauchy-Schwarz inequality, and the third inequality is due to Lemma 15 in this work and Lemma 16 in [42]. ∎

B.3 Bounds on the individual terms in regret

Replacing the failure event \mathcal{F} by \mathcal{B}, the regret decomposition in Lemma 18 carries over here. Hence, outside the failure event \mathcal{B},

Regret(K)\displaystyle{\textnormal{Regret}}(K) k=1Kh=1HxLkwk,h(x)(P^k(x)P(x),Vh+1transition estimation error+bk,h(x)+P^k(x)P(x),V¯k,h+1Vh+1correction term)\displaystyle\leq\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\Bigl{(}\underbrace{\left\langle{\hat{P}}_{k}(x)-{P}(x),{V_{h+1}^{*}}\right\rangle}_{\text{transition estimation error}}+{b}_{k,h}(x)+\underbrace{\left\langle{\hat{P}}_{k}(x)-{P}(x),{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\rangle}_{\text{correction term}}\Bigr{)}
+k=1Kh=1HxΛkwk,h(x)2βk(x)+8H2i=1mX[Ii]L+8H2i=1lX[Ji]L,\displaystyle\quad+\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}w_{k,h}(x)2{\beta}_{k}(x)+8H^{2}\sum_{i=1}^{m}X[{I_{i}}]L+8H^{2}\sum_{i=1}^{l}X[{J_{i}}]L,

where the bk,h(x){b}_{k,h}(x) term is referred to as “transition optimism” and 2βk(x)2{\beta}_{k}(x) term is referred to as “reward estimation error and optimism”. In this subsection, we present the bounds on the above individual terms for F-EULER.

Lemma 31 (Cumulative transition estimation error, Bernstein-style).

Define

i\displaystyle\mathbb{C}_{i}^{*} :=1Tk=1Kh=1Hx𝒳wk,h(x)gi2(P(x),Vh+1)=1Tk=1Kh=1H𝔼πk[gi2(P,Vh+1)|sk,1],\displaystyle:=\frac{1}{T}\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\mathcal{X}}w_{k,h}(x)g_{i}^{2}({P}(x),{V_{h+1}^{*}})=\frac{1}{T}\sum_{k=1}^{K}\sum_{h=1}^{H}\mathbb{E}_{\pi_{k}}[g_{i}^{2}(P,{V_{h+1}^{*}})|s_{k,1}],
iπ\displaystyle\mathbb{C}_{i}^{\pi} :=1Tk=1Kh=1Hx𝒳wk,h(x)gi2(P(x),Vh+1πk)=1Tk=1Kh=1H𝔼πk[gi2(P,Vh+1πk)|sk,1],\displaystyle:=\frac{1}{T}\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\mathcal{X}}w_{k,h}(x)g_{i}^{2}({P}(x),V_{h+1}^{\pi_{k}})=\frac{1}{T}\sum_{k=1}^{K}\sum_{h=1}^{H}\mathbb{E}_{\pi_{k}}[g_{i}^{2}(P,V_{h+1}^{\pi_{k}})|s_{k,1}],

where sk,1s_{k,1} denotes the initial state in the kkth episode. Then for F-EULER, outside the failure event \mathcal{B}, the sum over time of the transition estimation error satisfies that

k=1Kh=1HxLkwk,h(x)P^k(x)P(x),Vh+1i=1miX[Ii]T+m2HmaxiSimaxiX[Ii]L2,\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left\langle{\hat{P}}_{k}(x)-{P}(x),{V_{h+1}^{*}}\right\rangle\lesssim\sum_{i=1}^{m}\sqrt{\mathbb{C}_{i}^{*}X[{I_{i}}]T}+m^{2}H\max_{i}S_{i}\max_{i}X[{I_{i}}]L^{2},

and that

k=1Kh=1HxLkwk,h(x)P^k(x)P(x),Vh+1\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left\langle{\hat{P}}_{k}(x)-{P}(x),{V_{h+1}^{*}}\right\rangle
\displaystyle\lesssim i=1miπX[Ii]TL+i=1mHX[Ii]LRegret(K)+m2HmaxiSimaxiX[Ii]L2.\displaystyle\sum_{i=1}^{m}\sqrt{\mathbb{C}_{i}^{\pi}X[{I_{i}}]TL}+\sum_{i=1}^{m}H\sqrt{X[{I_{i}}]L}\sqrt{{\textnormal{Regret}}(K)}+m^{2}H\max_{i}S_{i}\max_{i}X[{I_{i}}]L^{2}.
Proof.

Outside the failure event \mathcal{B}, by lemma 26,

k=1Kh=1HxLkwk,h(x)P^k(x)P(x),Vh+1\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left\langle{\hat{P}}_{k}(x)-{P}(x),{V_{h+1}^{*}}\right\rangle
\displaystyle\leq k=1Kh=1HxLkwk,h(x)(i=1mϕi,k(P(x),Vh+1,x)+i=1mj=i+1m2HLSiSjNi,k(x)Nj,k(x))\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left(\sum_{i=1}^{m}\phi_{i,k}({P}(x),{V_{h+1}^{*}},x)+\sum_{i=1}^{m}\sum_{j=i+1}^{m}2HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}\right)
=\displaystyle=
+i=1m2HL3Ni,k(x)+i=1mj=i+1m2HLSiSjNi,k(x)Nj,k(x)\bBigg@3).\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad+\sum_{i=1}^{m}\frac{2HL}{3N_{i,k}(x)}+\sum_{i=1}^{m}\sum_{j=i+1}^{m}2HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}\bBigg@{3}). (B.9)

By the Cauchy-Schwarz inequality and Lemma 15, the term (LABEL:eqn:cul-est-err-b-decomp-1) is bounded by

k=1Kh=1HxLkwk,h(x)(i=1mgi(P(x),Vh+1)Ni,k(x))\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left(\sum_{i=1}^{m}\frac{g_{i}({P}(x),{V_{h+1}^{*}})}{\sqrt{N_{i,k}(x)}}\right)
\displaystyle\leq i=1mk=1Kh=1HxLkwk,h(x)gi2(P(x),Vh+1)k=1Kh=1HxLkwk,h(x)Ni,k(x)\displaystyle\sum_{i=1}^{m}\sqrt{\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)g_{i}^{2}({P}(x),{V_{h+1}^{*}})}\cdot\sqrt{\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}\frac{w_{k,h}(x)}{N_{i,k}(x)}}
\displaystyle\leq i=1m2iX[Ii]TL.\displaystyle\sum_{i=1}^{m}2\sqrt{\mathbb{C}_{i}^{*}X[{I_{i}}]TL}.

By Lemma 15 and Lemma 16, the terms in (B.9) are bounded by

k=1Kh=1HxLkwk,h(x)(i=1m2HL3Ni,k(x)+i=1mj=i+1m2HLSiSjNi,k(x)Nj,k(x))\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left(\sum_{i=1}^{m}\frac{2HL}{3N_{i,k}(x)}+\sum_{i=1}^{m}\sum_{j=i+1}^{m}2HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}\right)
\displaystyle\leq 23HLi=1m4X[Ii]L+2HLmaxiSii=1mj=i+1m4X[Ii]X[Ij]L\displaystyle\frac{2}{3}HL\sum_{i=1}^{m}4X[{I_{i}}]L+2HL\max_{i}S_{i}\sum_{i=1}^{m}\sum_{j=i+1}^{m}4\sqrt{X[{I_{i}}]X[{I_{j}}]}L
\displaystyle\lesssim m2HmaxiSimaxiX[Ii]L2.\displaystyle m^{2}H\max_{i}S_{i}\max_{i}X[{I_{i}}]L^{2}.

Combining the above bounds on (LABEL:eqn:cul-est-err-b-decomp-1) and  (B.9) yields the first bound in this lemma. To show the second bound, we bound (LABEL:eqn:cul-est-err-b-decomp-1) otherwise, by

k=1Kh=1HxLkwk,h(x)(i=1mgi(P(x),Vh+1)Ni,k(x))\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left(\sum_{i=1}^{m}\frac{g_{i}({P}(x),{V_{h+1}^{*}})}{\sqrt{N_{i,k}(x)}}\right)
=\displaystyle= i=1mk=1Kh=1HxLkwk,h(x)(gi(P(x),Vh+1πk)Ni,k(x)+gi(P(x),Vh+1)gi(P(x),Vh+1πk)Ni,k(x))\displaystyle\sum_{i=1}^{m}\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left(\frac{g_{i}({P}(x),V_{h+1}^{\pi_{k}})}{\sqrt{N_{i,k}(x)}}+\frac{g_{i}({P}(x),{V_{h+1}^{*}})-g_{i}({P}(x),V_{h+1}^{\pi_{k}})}{\sqrt{N_{i,k}(x)}}\right)
\displaystyle\leq i=1m2iπX[Ii]TL+i=1m22LHX[Ii]LRegret(K),\displaystyle\sum_{i=1}^{m}2\sqrt{\mathbb{C}_{i}^{\pi}X[{I_{i}}]TL}+\sum_{i=1}^{m}2\sqrt{2L}H\sqrt{X[{I_{i}}]L}\sqrt{{\textnormal{Regret}}(K)},

where the inequality is due to the bound bridge (Lemma 30). ∎

Lemma 32 (Cumulative transition optimism, Bernstein-style).

For F-EULER, outside the failure event \mathcal{B}, the sum over time of the transition optimism satisfies that

k=1Kh=1HxLkwk,h(x)bk,h(x)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x){b}_{k,h}(x) i=1miX[Ii]T+m3H2maxiSimaxiX[Ii]L2\displaystyle\lesssim\sum_{i=1}^{m}\sqrt{\mathbb{C}_{i}^{*}X[{I_{i}}]T}+m^{3}H^{2}\max_{i}S_{i}\max_{i}X[{I_{i}}]L^{2}
+m1.5lH1.5(maxiSi)0.25(maxiX[Ii])0.75(maxiX[Ji])0.5L2,\displaystyle\quad+m^{1.5}lH^{1.5}(\max_{i}S_{i})^{0.25}(\max_{i}X[{I_{i}}])^{0.75}(\max_{i}X[{J_{i}}])^{0.5}L^{2},

and that

k=1Kh=1HxLkwk,h(x)bk,h(x)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x){b}_{k,h}(x) i=1miπX[Ii]TL+i=1mHX[Ii]LRegret(K)\displaystyle\lesssim\sum_{i=1}^{m}\sqrt{\mathbb{C}_{i}^{\pi}X[{I_{i}}]TL}+\sum_{i=1}^{m}H\sqrt{X[{I_{i}}]L}\sqrt{{\textnormal{Regret}}(K)}
+m3H2maxiSimaxiX[Ii]L2\displaystyle\quad+m^{3}H^{2}\max_{i}S_{i}\max_{i}X[{I_{i}}]L^{2}
+m1.5lH1.5(maxiSi)0.25(maxiX[Ii])0.75(maxiX[Ji])0.5L2.\displaystyle\quad+m^{1.5}lH^{1.5}(\max_{i}S_{i})^{0.25}(\max_{i}X[{I_{i}}])^{0.75}(\max_{i}X[{J_{i}}])^{0.5}L^{2}.
Proof.

Outside the failure event \mathcal{B}, for all k[K],h[H]k\in[K],h\in[H] and x𝒳x\in\mathcal{X}, by definition,

bk,h(x)\displaystyle{b}_{k,h}(x) =i=1mgi(P^k(x),V¯k,h+1)Ni,k(x)+i=1m2LV¯k,h+1V¯k,h+12,P^k(x)Ni,k(x)\displaystyle=\sum_{i=1}^{m}\frac{g_{i}({\hat{P}}_{k}(x),{\overline{V}}_{k,h+1})}{\sqrt{N_{i,k}(x)}}+\sum_{i=1}^{m}\frac{\sqrt{2L}\left\|{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\|_{2,{\hat{P}}_{k}(x)}}{\sqrt{N_{i,k}(x)}}
+i=1mj=i+1m11HLSiSjNi,k(x)Nj,k(x)+i=1m5HLNi,k(x)\displaystyle\quad+\sum_{i=1}^{m}\sum_{j=i+1}^{m}11HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}+\sum_{i=1}^{m}\frac{5HL}{N_{i,k}(x)}
=i=1mϕi,k(P^k(x),V¯h+1,x)+i=1m2LV¯k,h+1V¯k,h+12,P^k(x)Ni,k(x)\displaystyle=\sum_{i=1}^{m}\phi_{i,k}({\hat{P}}_{k}(x),{\overline{V}}_{h+1},x)+\sum_{i=1}^{m}\frac{\sqrt{2L}\left\|{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\|_{2,{\hat{P}}_{k}(x)}}{\sqrt{N_{i,k}(x)}}
+i=1mj=i+1m11HLSiSjNi,k(x)Nj,k(x)+i=1m13HL3Ni,k(x)\displaystyle\quad+\sum_{i=1}^{m}\sum_{j=i+1}^{m}11HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}+\sum_{i=1}^{m}\frac{13HL}{3N_{i,k}(x)}
i=1mϕi,k(P(x),Vh+1,x)+i=1m22LV¯k,h+1V¯k,h+12,P^k(x)Ni,k(x)\displaystyle\leq\sum_{i=1}^{m}\phi_{i,k}({P}(x),{V_{h+1}^{*}},x)+\sum_{i=1}^{m}\frac{2\sqrt{2L}\left\|{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\|_{2,{\hat{P}}_{k}(x)}}{\sqrt{N_{i,k}(x)}}
+i=1m32HLNi,k(x)j=1m1Nj,k(x)+i=1mj=i+1m11HLSiSjNi,k(x)Nj,k(x)+i=1m13HL3Ni,k(x)\displaystyle\quad+\sum_{i=1}^{m}\frac{3\sqrt{2}HL}{\sqrt{N_{i,k}(x)}}\sum_{j=1}^{m}\frac{1}{\sqrt{N_{j,k}(x)}}+\sum_{i=1}^{m}\sum_{j=i+1}^{m}11HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}+\sum_{i=1}^{m}\frac{13HL}{3N_{i,k}(x)}
i=1mϕi,k(P(x),Vh+1,x)+i=1m22LV¯k,h+1V¯k,h+12,P^k(x)Ni,k(x)\displaystyle\leq\sum_{i=1}^{m}\phi_{i,k}({P}(x),{V_{h+1}^{*}},x)+\sum_{i=1}^{m}\frac{2\sqrt{2L}\left\|{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\|_{2,{\hat{P}}_{k}(x)}}{\sqrt{N_{i,k}(x)}}
+i=1mj=i+1m20HLSiSjNi,k(x)Nj,k(x)+i=1m9HLNi,k(x)\displaystyle\quad+\sum_{i=1}^{m}\sum_{j=i+1}^{m}20HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}+\sum_{i=1}^{m}\frac{9HL}{N_{i,k}(x)}

where the first inequality is due to Lemma 25. Substituting the definition of ϕi,k\phi_{i,k} into the above bound on bk,h(x){b}_{k,h}(x), we have

k=1Kh=1HxLkwk,h(x)bk,h(x)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x){b}_{k,h}(x)
\displaystyle\lesssim k=1Kh=1HxLkwk,h(x)(i=1mLV¯k,h+1V¯k,h+12,P^k(x)Ni,k(x)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\Biggl{(}\sum_{i=1}^{m}\frac{\sqrt{L}\left\|{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\|_{2,{\hat{P}}_{k}(x)}}{\sqrt{N_{i,k}(x)}} (B.10)
+i=1mgi(P(x),Vh+1)Ni,k(x)+i=1mHLNi,k(x)+i=1mj=i+1mHLSiSjNi,k(x)Nj,k(x)).\displaystyle\quad+\sum_{i=1}^{m}\frac{g_{i}({P}(x),{V_{h+1}^{*}})}{\sqrt{N_{i,k}(x)}}+\sum_{i=1}^{m}\frac{HL}{N_{i,k}(x)}+\sum_{i=1}^{m}\sum_{j=i+1}^{m}HL\sqrt{\frac{S_{i}S_{j}}{N_{i,k}(x)N_{j,k}(x)}}\Biggr{)}. (B.11)

By the Cauchy-Schwarz inequality and the inequality that a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b} for all a,b0a,b\geq 0, we bound the term in (B.10) by

k=1Kh=1HxLkwk,h(x)V¯k,h+1V¯k,h+12,P^k(x)Ni,k(x)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\frac{\left\|{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\|_{2,{\hat{P}}_{k}(x)}}{\sqrt{N_{i,k}(x)}}
\displaystyle\leq k=1Kh=1HxLkwk,h(x)Ni,k(x)k=1Kh=1HxLkwk,h(x)V¯k,h+1V¯k,h+122,P^k(x)\displaystyle\sqrt{\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}\frac{w_{k,h}(x)}{N_{i,k}(x)}}\cdot\sqrt{\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left\|{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\|^{2}_{2,{\hat{P}}_{k}(x)}}
\displaystyle\leq 2X[Ii]L\bBigg@3.5(k=1Kh=1HxLkwk,h(x)V¯k,h+1V¯k,h+122,P(x)\displaystyle 2\sqrt{X[{I_{i}}]L}\cdot\bBigg@{3.5}(\sqrt{\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left\|{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\|^{2}_{2,{P}(x)}} (B.12)
+k=1Kh=1HxLkwk,h(x)P^k(x)P(x),(V¯k,h+1V¯k,h+1)2\bBigg@3.5),\displaystyle\quad+\sqrt{\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left\langle{\hat{P}}_{k}(x)-{P}(x),({\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1})^{2}\right\rangle}\bBigg@{3.5}), (B.13)

where the term in (B.12) is bounded, due to Lemma 29, by

k=1Kh=1HxLkwk,h(x)V¯k,h+1V¯k,h+122,P(x)G1,\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left\|{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\|^{2}_{2,{P}(x)}\lesssim G_{1},

and the term in (B.13) is bounded, due to Lemma 34, by

k=1Kh=1HxLkwk,h(x)P^k(x)P(x),(V¯k,h+1V¯k,h+1)2\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left\langle{\hat{P}}_{k}(x)-{P}(x),({\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1})^{2}\right\rangle
\displaystyle\leq Hk=1Kh=1HxLkwk,h(x)|P^k(x)P(x),V¯k,h+1V¯k,h+1|\displaystyle H\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left|\left\langle{\hat{P}}_{k}(x)-{P}(x),{\overline{V}}_{k,h+1}-{\underline{V}}_{k,h+1}\right\rangle\right|
\displaystyle\lesssim m3H3(maxiSi)1.5maxiX[Ii]L2.5+mlH2.5(maxiSi)0.5(maxiX[Ii])0.5(maxiX[Ji])0.5L2.5.\displaystyle m^{3}H^{3}(\max_{i}S_{i})^{1.5}\max_{i}X[{I_{i}}]L^{2.5}+mlH^{2.5}(\max_{i}S_{i})^{0.5}(\max_{i}X[{I_{i}}])^{0.5}(\max_{i}X[{J_{i}}])^{0.5}L^{2.5}.

By the proof of Lemma 31, the same bounds as those on the cumulative transition estimation error applies to (B.11). Combining the above bounds on (B.10) and (B.11), we obtain the first bound that

k=1Kh=1HxLkwk,h(x)bk,h(x)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x){b}_{k,h}(x)
\displaystyle\lesssim i=1miX[Ii]T+m2HmaxiSimaxiX[Ii]L2+mmaxiX[Ii]LG1\displaystyle\sum_{i=1}^{m}\sqrt{\mathbb{C}_{i}^{*}X[{I_{i}}]T}+m^{2}H\max_{i}S_{i}\max_{i}X[{I_{i}}]L^{2}+m\sqrt{\max_{i}X[{I_{i}}]L}\sqrt{G_{1}}
+mmaxiX[Ii]Lm3H3(maxiSi)1.5maxiX[Ii]L2.5\displaystyle\quad+m\sqrt{\max_{i}X[{I_{i}}]L}\sqrt{m^{3}H^{3}(\max_{i}S_{i})^{1.5}\max_{i}X[{I_{i}}]L^{2.5}}
+mmaxiX[Ii]LmlH2.5(maxiSi)0.5(maxiX[Ii])0.5(maxiX[Ji])0.5L2.5\displaystyle\quad+m\sqrt{\max_{i}X[{I_{i}}]L}\sqrt{mlH^{2.5}(\max_{i}S_{i})^{0.5}(\max_{i}X[{I_{i}}])^{0.5}(\max_{i}X[{J_{i}}])^{0.5}L^{2.5}}
\displaystyle\lesssim i=1miX[Ii]T+m2HmaxiSimaxiX[Ii]L2\displaystyle\sum_{i=1}^{m}\sqrt{\mathbb{C}_{i}^{*}X[{I_{i}}]T}+m^{2}H\max_{i}S_{i}\max_{i}X[{I_{i}}]L^{2}
+m(maxiX[Ii])0.5L0.5m2H2maxiSi(maxiX[Ii])0.5L1.5\displaystyle\quad+m(\max_{i}X[{I_{i}}])^{0.5}L^{0.5}m^{2}H^{2}\max_{i}S_{i}(\max_{i}X[{I_{i}}])^{0.5}L^{1.5}
+m(maxiX[Ii])0.5L0.5lH1.5(maxiX[Ji])0.5L1.5\displaystyle\quad+m(\max_{i}X[{I_{i}}])^{0.5}L^{0.5}lH^{1.5}(\max_{i}X[{J_{i}}])^{0.5}L^{1.5}
+m(maxiX[Ii])0.5L0.5m1.5H1.5(maxiSi)0.75(maxiX[Ii])0.5L1.25\displaystyle\quad+m(\max_{i}X[{I_{i}}])^{0.5}L^{0.5}m^{1.5}H^{1.5}(\max_{i}S_{i})^{0.75}(\max_{i}X[{I_{i}}])^{0.5}L^{1.25}
+m(maxiX[Ii])0.5L0.5m0.5l0.5H1.25(maxiSi)0.25(maxiX[Ii])0.25(maxiX[Ji])0.25L1.25\displaystyle\quad+m(\max_{i}X[{I_{i}}])^{0.5}L^{0.5}m^{0.5}l^{0.5}H^{1.25}(\max_{i}S_{i})^{0.25}(\max_{i}X[{I_{i}}])^{0.25}(\max_{i}X[{J_{i}}])^{0.25}L^{1.25}
\displaystyle\lesssim i=1miX[Ii]T+m3H2maxiSimaxiX[Ii]L2\displaystyle\sum_{i=1}^{m}\sqrt{\mathbb{C}_{i}^{*}X[{I_{i}}]T}+m^{3}H^{2}\max_{i}S_{i}\max_{i}X[{I_{i}}]L^{2}
+m1.5lH1.5(maxiSi)0.25(maxiX[Ii])0.75(maxiX[Ji])0.5L2,\displaystyle\quad+m^{1.5}lH^{1.5}(\max_{i}S_{i})^{0.25}(\max_{i}X[{I_{i}}])^{0.75}(\max_{i}X[{J_{i}}])^{0.5}L^{2},

and the second bound that

k=1Kh=1HxLkwk,h(x)bk,h(x)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x){b}_{k,h}(x)
\displaystyle\lesssim i=1miπX[Ii]T+i=1mHX[Ii]LRegret(K)\displaystyle\sum_{i=1}^{m}\sqrt{\mathbb{C}_{i}^{\pi}X[{I_{i}}]T}+\sum_{i=1}^{m}H\sqrt{X[{I_{i}}]L}\sqrt{{\textnormal{Regret}}(K)}
+m3H2maxiSimaxiX[Ii]L2+m1.5lH1.5(maxiSi)0.25(maxiX[Ii])0.75(maxiX[Ji])0.5L2.\displaystyle\quad+m^{3}H^{2}\max_{i}S_{i}\max_{i}X[{I_{i}}]L^{2}+m^{1.5}lH^{1.5}(\max_{i}S_{i})^{0.25}(\max_{i}X[{I_{i}}])^{0.75}(\max_{i}X[{J_{i}}])^{0.5}L^{2}.

Lemma 33 (Cumulative reward estimation error and optimism, Bernstein-style).

For F-EULER, outside the failure event \mathcal{B}, the sum over time of the reward estimation error and optimism satisfies that

k=1Kh=1HxΛkwk,h(x)2βk(xk,h)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}w_{k,h}(x)2{\beta}_{k}(x_{k,h}) i=1liX[Ji]TL+i=1lX[Ji]L2,\displaystyle\lesssim\sum_{i=1}^{l}\sqrt{\mathcal{R}_{i}X[{J_{i}}]T}L+\sum_{i=1}^{l}X[{J_{i}}]L^{2},

and that

k=1Kh=1HxΛkwk,h(x)2βk(xk,h)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}w_{k,h}(x)2{\beta}_{k}(x_{k,h}) i=1l𝒢2X[Ji]KL+i=1lX[Ji]L2.\displaystyle\lesssim\sum_{i=1}^{l}\sqrt{\mathcal{G}^{2}X[{J_{i}}]K}L+\sum_{i=1}^{l}X[{J_{i}}]L^{2}.
Proof.

The treatment of the cumulative βk,h(x){\beta}_{k,h}(x) is essentially the same as the proof of Lemma 8 in [42]. Recall that i:=maxx𝒳{Var[ri(x)]}\mathcal{R}_{i}:=\max_{x\in\mathcal{X}}\left\{\mathrm{Var}[r_{i}(x)]\right\}. Outside the failure event \mathcal{B} (specifically, 6\mathcal{B}_{6}),

k=1Kh=1HxΛkwk,h(x)βk(xk,h)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}w_{k,h}(x){\beta}_{k}(x_{k,h})
=\displaystyle= k=1Kh=1HxΛkwk,h(x)i=1l(2𝕊[r^i(x)]LMi,k(x)+14L3Mi,k(x))\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}w_{k,h}(x)\sum_{i=1}^{l}\left(\sqrt{\frac{2\mathbb{S}[{\hat{r}}_{i}(x)]L}{M_{i,k}(x)}}+\frac{14L}{3M_{i,k}(x)}\right)
\displaystyle\leq k=1Kh=1HxΛkwk,h(x)i=1l(2Var(ri(x))LMi,k(x)+4LMi,k(x)+14L3Mi,k(x))\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}w_{k,h}(x)\sum_{i=1}^{l}\left(\sqrt{\frac{2\mathrm{Var}(r_{i}(x))L}{M_{i,k}(x)}}+\frac{\sqrt{4L}}{M_{i,k}(x)}+\frac{14L}{3M_{i,k}(x)}\right)
\displaystyle\lesssim Li=1lk=1Kh=1HxΛkwk,h(x)Mi,k(x)k=1Kh=1HxΛkwk,h(x)Var(ri(x))+X[Ji]L2\displaystyle\sqrt{L}\sum_{i=1}^{l}\sqrt{\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}\frac{w_{k,h}(x)}{M_{i,k}(x)}}\sqrt{\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}w_{k,h}(x)\mathrm{Var}(r_{i}(x))}+X[{J_{i}}]L^{2} (B.14)
\displaystyle\lesssim i=1liX[Ji]TL+X[Ji]L2,\displaystyle\sum_{i=1}^{l}\sqrt{\mathcal{R}_{i}X[{J_{i}}]T}L+X[{J_{i}}]L^{2},

where the second inequality is due to the Cauchy-Schwarz inequality, and the last inequality is due to Lemma 15.

Let sk,h𝒮s_{k,h}\in\mathcal{S} denote the state at step hh of episode kk and xk,h=(sk,h,πk(sk,h,h))x_{k,h}=(s_{k,h},\pi_{k}(s_{k,h},h)). Recall that h=1Hri(xk,h)𝒢\sum_{h=1}^{H}r_{i}(x_{k,h})\leq\mathcal{G} for any sequence {xk,h}h=1H\{x_{k,h}\}_{h=1}^{H}. To obtain the second bound, consider

Var[h=1Hri(xk,h)|{sk,h}h=1H]𝔼[(h=1Hri(xk,h))2|{sk,h}h=1H]𝒢2.\displaystyle\mathrm{Var}\left[\sum_{h=1}^{H}r_{i}(x_{k,h})\middle|\{s_{k,h}\}_{h=1}^{H}\right]\leq\mathbb{E}\left[\left(\sum_{h=1}^{H}r_{i}(x_{k,h})\right)^{2}\middle|\{s_{k,h}\}_{h=1}^{H}\right]\leq\mathcal{G}^{2}.

Take the expectation over the trajectory {sk,h}h=1H\{s_{k,h}\}_{h=1}^{H} yields

𝔼{sk,h}h=1H[Var[h=1Hri(xk,h)|{sk,h}h=1H]]𝒢2.\displaystyle\mathbb{E}_{\{s_{k,h}\}_{h=1}^{H}}\left[\mathrm{Var}\left[\sum_{h=1}^{H}r_{i}(x_{k,h})\middle|\{s_{k,h}\}_{h=1}^{H}\right]\right]\leq\mathcal{G}^{2}.

Since ri()r_{i}(\cdot) has independent randomness in different steps, in an alternative notation and taking sum over k[K]k\in[K], we have

k=1Kh=1HxΛkwk,h(x)Var(ri(x))K𝒢2.\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}w_{k,h}(x)\mathrm{Var}(r_{i}(x))\leq K\mathcal{G}^{2}.

Substituting the above into (B.14) yields

k=1Kh=1HxΛkwk,h(x)2βk(xk,h)i=1l𝒢2X[Ji]KL+i=1lX[Ji]L2.\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\Lambda_{k}}w_{k,h}(x)2{\beta}_{k}(x_{k,h})\lesssim\sum_{i=1}^{l}\sqrt{\mathcal{G}^{2}X[{J_{i}}]K}L+\sum_{i=1}^{l}X[{J_{i}}]L^{2}.

Lemma 34 (Cumulative correction term, Bernstein-style).

For F-EULER, outside the failure event \mathcal{B}, the sum over time of the correction term satisfies that

k=1Kh=1HxLkwk,h(x)P^k(x)P(x),V¯k,h+1Vh+1\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left\langle{\hat{P}}_{k}(x)-{P}(x),{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\rangle
\displaystyle\leq k=1Kh=1HxLkwk,h(x)|P^k(x)P(x),V¯k,h+1Vh+1|\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in L_{k}}w_{k,h}(x)\left|\left\langle{\hat{P}}_{k}(x)-{P}(x),{\overline{V}}_{k,h+1}-{V_{h+1}^{*}}\right\rangle\right|
\displaystyle\lesssim m3H2(maxiSi)1.5maxiX[Ii]L2.5+mlH1.5(maxiSi)0.5(maxiX[Ii])0.5(maxiX[Ji])0.5L2.5.\displaystyle m^{3}H^{2}(\max_{i}S_{i})^{1.5}\max_{i}X[{I_{i}}]L^{2.5}+mlH^{1.5}(\max_{i}S_{i})^{0.5}(\max_{i}X[{I_{i}}])^{0.5}(\max_{i}X[{J_{i}}])^{0.5}L^{2.5}.
Proof.

Replacing the failure event \mathcal{F} by \mathcal{B} and the lower-order term G0G_{0} by G1G_{1}, the proof of the bound on the cumulative correction term for F-UCBVI (Lemma 22) carries over. Note that the second term in the bound here has an extra L\sqrt{L} factor due to the difference between G0G_{0} and G1G_{1}. ∎

B.4 Regret bounds

The following lemma is useful in deriving the problem-dependent regret bounds of F-EULER.

Lemma 35 (Component variance bound).

For any finite set 𝒮=i=1m𝒮i\mathcal{S}=\bigotimes_{i=1}^{m}\mathcal{S}_{i}, any probability mass function P=i=1mPiΔ(𝒮){P}=\prod_{i=1}^{m}{P}_{i}\in\Delta(\mathcal{S}) where PiΔ(𝒮i){P}_{i}\in\Delta(\mathcal{S}_{i}) for i[m]i\in[m], and any vector V𝒮V\in\mathbb{R}^{\mathcal{S}},

VarPi𝔼Pi[V]VarP[V].\displaystyle\mathrm{Var}_{{P}_{i}}\mathbb{E}_{{P}_{-i}}[V]\leq\mathrm{Var}_{{P}}[V].
Proof.

By the definition of variance,

VarP[V]VarPi𝔼Pi[V]\displaystyle\mathrm{Var}_{{P}}[V]-\mathrm{Var}_{{P}_{i}}\mathbb{E}_{{P}_{-i}}[V] =𝔼P[V2](𝔼P[V])2𝔼Pi[(𝔼Pi[V])2]+(𝔼P[V])2\displaystyle=\mathbb{E}_{{P}}[V^{2}]-\left(\mathbb{E}_{{P}}[V]\right)^{2}-\mathbb{E}_{{P}_{i}}\left[\left(\mathbb{E}_{{P}_{-i}}[V]\right)^{2}\right]+\left(\mathbb{E}_{{P}}[V]\right)^{2}
=𝔼Pi[VarPi[V]]\displaystyle=\mathbb{E}_{{P}_{i}}\left[\mathrm{Var}_{{P}_{-i}}[V]\right]
0.\displaystyle\geq 0.

B.4.1 Proof of Theorem 2

Proof.

To prove Theorem 2, we actually need to show four combinations of the bounds therein.

Outside the failure event \mathcal{B}, combining Lemmas 18313233 and 34, for F-EULER, we obtain

Regret(K)\displaystyle{\textnormal{Regret}}(K) i=1miX[Ii]T+m3H2maxiSimaxiX[Ii]L2\displaystyle\lesssim\sum_{i=1}^{m}\sqrt{\mathbb{C}_{i}^{*}X[{I_{i}}]T}+m^{3}H^{2}\max_{i}S_{i}\max_{i}X[{I_{i}}]L^{2}
+m1.5lH1.5(maxiSi)0.25(maxiX[Ii])0.75(maxiX[Ji])0.5L2,\displaystyle\quad+m^{1.5}lH^{1.5}(\max_{i}S_{i})^{0.25}(\max_{i}X[{I_{i}}])^{0.75}(\max_{i}X[{J_{i}}])^{0.5}L^{2},
+i=1liX[Ji]TL+i=1lX[Ji]L2\displaystyle\quad+\sum_{i=1}^{l}\sqrt{\mathcal{R}_{i}X[{J_{i}}]T}L+\sum_{i=1}^{l}X[{J_{i}}]L^{2}
+m3H2(maxiSi)1.5maxiX[Ii]L2.5\displaystyle\quad+m^{3}H^{2}(\max_{i}S_{i})^{1.5}\max_{i}X[{I_{i}}]L^{2.5}
+mlH1.5(maxiSi)0.5(maxiX[Ii])0.5(maxiX[Ji])0.5L2.5\displaystyle\quad+mlH^{1.5}(\max_{i}S_{i})^{0.5}(\max_{i}X[{I_{i}}])^{0.5}(\max_{i}X[{J_{i}}])^{0.5}L^{2.5}
+H2i=1mX[Ii]L+H2i=1lX[Ji]L\displaystyle\quad+H^{2}\sum_{i=1}^{m}X[{I_{i}}]L+H^{2}\sum_{i=1}^{l}X[{J_{i}}]L
i=1miX[Ii]T+i=1liX[Ji]TL+m3H2(maxiSi)1.5maxiX[Ii]L2.5\displaystyle\lesssim\sum_{i=1}^{m}\sqrt{\mathbb{C}_{i}^{*}X[{I_{i}}]T}+\sum_{i=1}^{l}\sqrt{\mathcal{R}_{i}X[{J_{i}}]T}L+m^{3}H^{2}(\max_{i}S_{i})^{1.5}\max_{i}X[{I_{i}}]L^{2.5}
+m1.5lH2(maxiSi)0.5(maxiX[Ii])0.75maxiX[Ji]L2.5,\displaystyle\quad+m^{1.5}lH^{2}(\max_{i}S_{i})^{0.5}(\max_{i}X[{I_{i}}])^{0.75}\max_{i}X[{J_{i}}]L^{2.5},

and similarly,

Regret(K)\displaystyle{\textnormal{Regret}}(K) i=1miX[Ii]T+i=1l𝒢2X[Ji]KL+m3H2(maxiSi)1.5maxiX[Ii]L2.5\displaystyle\lesssim\sum_{i=1}^{m}\sqrt{\mathbb{C}_{i}^{*}X[{I_{i}}]T}+\sum_{i=1}^{l}\sqrt{\mathcal{G}^{2}X[{J_{i}}]K}L+m^{3}H^{2}(\max_{i}S_{i})^{1.5}\max_{i}X[{I_{i}}]L^{2.5}
+m1.5lH2(maxiSi)0.5(maxiX[Ii])0.75maxiX[Ji]L2.5.\displaystyle\quad+m^{1.5}lH^{2}(\max_{i}S_{i})^{0.5}(\max_{i}X[{I_{i}}])^{0.75}\max_{i}X[{J_{i}}]L^{2.5}.

By definition,

i=1Tk=1Kh=1H𝔼πk[gi2(P,Vh+1)|sk,1](2L)2𝒬i=4𝒬iL.\displaystyle\mathbb{C}_{i}^{*}=\frac{1}{T}\sum_{k=1}^{K}\sum_{h=1}^{H}\mathbb{E}_{\pi_{k}}[g_{i}^{2}(P,{V_{h+1}^{*}})|s_{k,1}]\leq(2\sqrt{L})^{2}\mathcal{Q}_{i}=4\mathcal{Q}_{i}L.

Substituting the above bound on i\mathbb{C}_{i}^{*} into the above two regret bounds, we obtain two combinations of the bounds in Theorem 2. Specifically, assuming Tpoly(m,l,maxiSi,maxiX[Ii],H)T\geq\mathrm{poly}(m,l,\max_{i}S_{i},\max_{i}X[{I_{i}}],H), we have

Regret(K)\displaystyle{\textnormal{Regret}}(K) =𝒪~(i=1m𝒬iX[Ii]T+i=1liX[Ji]T),\displaystyle=\tilde{\mathcal{O}}(\sum_{i=1}^{m}\sqrt{\mathcal{Q}_{i}X[{I_{i}}]T}+\sum_{i=1}^{l}\sqrt{\mathcal{R}_{i}X[{J_{i}}]T}),
Regret(K)\displaystyle{\textnormal{Regret}}(K) =𝒪~(i=1m𝒬iX[Ii]T+i=1l𝒢X[Ji]K).\displaystyle=\tilde{\mathcal{O}}(\sum_{i=1}^{m}\sqrt{\mathcal{Q}_{i}X[{I_{i}}]T}+\sum_{i=1}^{l}\sqrt{\mathcal{G}X[{J_{i}}]K}).

For other two combinations of the regret bounds of F-EULER, outside the failure event \mathcal{B}, combining Lemmas 18313233 and 34 yet with the alternative bounds in Lemmas 31 and 32, we obtain

Regret(K)\displaystyle{\textnormal{Regret}}(K) i=1miπX[Ii]T+i=1mHX[Ii]LRegret(K)+i=1liX[Ji]TL\displaystyle\lesssim\sum_{i=1}^{m}\sqrt{\mathbb{C}_{i}^{\pi}X[{I_{i}}]T}+\sum_{i=1}^{m}H\sqrt{X[{I_{i}}]L}\sqrt{{\textnormal{Regret}}(K)}+\sum_{i=1}^{l}\sqrt{\mathcal{R}_{i}X[{J_{i}}]T}L (B.15)
+m3H2(maxiSi)1.5maxiX[Ii]L2.5\displaystyle\quad+m^{3}H^{2}(\max_{i}S_{i})^{1.5}\max_{i}X[{I_{i}}]L^{2.5}
+m1.5lH2(maxiSi)0.5(maxiX[Ii])0.75maxiX[Ji]L2.5,\displaystyle\quad+m^{1.5}lH^{2}(\max_{i}S_{i})^{0.5}(\max_{i}X[{I_{i}}])^{0.75}\max_{i}X[{J_{i}}]L^{2.5},

and

Regret(K)\displaystyle{\textnormal{Regret}}(K) i=1miπX[Ii]T+i=1mHX[Ii]LRegret(K)+i=1l𝒢2X[Ji]KL\displaystyle\lesssim\sum_{i=1}^{m}\sqrt{\mathbb{C}_{i}^{\pi}X[{I_{i}}]T}+\sum_{i=1}^{m}H\sqrt{X[{I_{i}}]L}\sqrt{{\textnormal{Regret}}(K)}+\sum_{i=1}^{l}\sqrt{\mathcal{G}^{2}X[{J_{i}}]K}L (B.16)
+m3H2(maxiSi)1.5maxiX[Ii]L2.5\displaystyle\quad+m^{3}H^{2}(\max_{i}S_{i})^{1.5}\max_{i}X[{I_{i}}]L^{2.5}
+m1.5lH2(maxiSi)0.5(maxiX[Ii])0.75maxiX[Ji]L2.5.\displaystyle\quad+m^{1.5}lH^{2}(\max_{i}S_{i})^{0.5}(\max_{i}X[{I_{i}}])^{0.75}\max_{i}X[{J_{i}}]L^{2.5}.

For inequality yAy+By\lesssim A\sqrt{y}+B, the solution is

yA+A2+4B2A+B,\displaystyle\sqrt{y}\lesssim\frac{A+\sqrt{A^{2}+4B}}{2}\lesssim A+\sqrt{B},

which yields yA2+B+2A2BA2+By\lesssim A^{2}+B+2A^{2}\sqrt{B}\lesssim A^{2}+B, where the last inequality is due to the AM-GM inequality. Applying the above solution to (B.15) and (B.16) yields

Regret(K)\displaystyle{\textnormal{Regret}}(K) i=1miπX[Ii]T+i=1liX[Ji]TL+m3H2(maxiSi)1.5maxiX[Ii]L2.5\displaystyle\lesssim\sum_{i=1}^{m}\sqrt{\mathbb{C}_{i}^{\pi}X[{I_{i}}]T}+\sum_{i=1}^{l}\sqrt{\mathcal{R}_{i}X[{J_{i}}]T}L+m^{3}H^{2}(\max_{i}S_{i})^{1.5}\max_{i}X[{I_{i}}]L^{2.5} (B.17)
+m1.5lH2(maxiSi)0.5(maxiX[Ii])0.75maxiX[Ji]L2.5.\displaystyle\quad+m^{1.5}lH^{2}(\max_{i}S_{i})^{0.5}(\max_{i}X[{I_{i}}])^{0.75}\max_{i}X[{J_{i}}]L^{2.5}. (B.18)

and

Regret(K)\displaystyle{\textnormal{Regret}}(K) i=1miπX[Ii]T+i=1l𝒢2X[Ji]KL+m3H2(maxiSi)1.5maxiX[Ii]L2.5\displaystyle\lesssim\sum_{i=1}^{m}\sqrt{\mathbb{C}_{i}^{\pi}X[{I_{i}}]T}+\sum_{i=1}^{l}\sqrt{\mathcal{G}^{2}X[{J_{i}}]K}L+m^{3}H^{2}(\max_{i}S_{i})^{1.5}\max_{i}X[{I_{i}}]L^{2.5} (B.19)
+m1.5lH2(maxiSi)0.5(maxiX[Ii])0.75maxiX[Ji]L2.5.\displaystyle\quad+m^{1.5}lH^{2}(\max_{i}S_{i})^{0.5}(\max_{i}X[{I_{i}}])^{0.75}\max_{i}X[{J_{i}}]L^{2.5}. (B.20)

With xk,h=(sk,h,πk(sk,h,h))x_{k,h}=(s_{k,h},\pi_{k}(s_{k,h},h)) where sk,h𝒮s_{k,h}\in\mathcal{S} denotes the state in step hh of episode kk, by definition,

iπ\displaystyle\mathbb{C}_{i}^{\pi} =1Tk=1Kh=1Hx𝒳wk,h(x)gi2(P(x),Vh+1πk)\displaystyle=\frac{1}{T}\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\mathcal{X}}w_{k,h}(x)g_{i}^{2}({P}(x),V_{h+1}^{\pi_{k}})
=4LTk=1Kh=1Hx𝒳wk,h(x)VarPi(x)𝔼Pi(x)[Vh+1πk]\displaystyle=\frac{4L}{T}\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\mathcal{X}}w_{k,h}(x)\mathrm{Var}_{{P}_{i}(x)}\mathbb{E}_{{P}_{-i}(x)}[V_{h+1}^{\pi_{k}}]
4LTk=1Kh=1Hx𝒳wk,h(x)VarP(x)[Vh+1πk]\displaystyle\leq\frac{4L}{T}\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{x\in\mathcal{X}}w_{k,h}(x)\mathrm{Var}_{{P}(x)}[V_{h+1}^{\pi_{k}}]
=4LTk=1Kh=1H𝔼πk[VarP(xk,h)[Vh+1πk|sk,h]|sk,1]\displaystyle=\frac{4L}{T}\sum_{k=1}^{K}\sum_{h=1}^{H}\mathbb{E}_{\pi_{k}}\left[\mathrm{Var}_{{P}(x_{k,h})}[V_{h+1}^{\pi_{k}}|s_{k,h}]\middle|s_{k,1}\right]
4LTk=1K𝔼πk[(h=1HR(xk,h)V1πk(sk,1))2|sk,1]\displaystyle\leq\frac{4L}{T}\sum_{k=1}^{K}\mathbb{E}_{\pi_{k}}\left[\left(\sum_{h=1}^{H}R(x_{k,h})-V_{1}^{\pi_{k}}(s_{k,1})\right)^{2}\middle|s_{k,1}\right]
4K𝒢2LT=4𝒢2LH,\displaystyle\leq\frac{4K\mathcal{G}^{2}L}{T}=\frac{4\mathcal{G}^{2}L}{H},

where the first inequality is due to the component variance bound (Lemma 35), the second inequality is due to a law of total variance argument (see Lemma 15 in [42] for a proof). Substituting the above bound on iπ\mathbb{C}_{i}^{\pi} into (B.17) and (B.19), we obtain the other two combinations of the bounds in Theorem 2. Specifically, assuming Tpoly(m,l,maxiSi,maxiX[Ii],H)T\geq\mathrm{poly}(m,l,\max_{i}S_{i},\max_{i}X[{I_{i}}],H), we have

Regret(K)\displaystyle{\textnormal{Regret}}(K) =𝒪~(i=1m𝒢X[Ii]K+i=1liX[Ji]T),\displaystyle=\tilde{\mathcal{O}}(\sum_{i=1}^{m}\sqrt{\mathcal{G}X[{I_{i}}]K}+\sum_{i=1}^{l}\sqrt{\mathcal{R}_{i}X[{J_{i}}]T}),
Regret(K)\displaystyle{\textnormal{Regret}}(K) =𝒪~(i=1m𝒢X[Ii]K+i=1l𝒢X[Ji]K).\displaystyle=\tilde{\mathcal{O}}(\sum_{i=1}^{m}\sqrt{\mathcal{G}X[{I_{i}}]K}+\sum_{i=1}^{l}\sqrt{\mathcal{G}X[{J_{i}}]K}).

To accommodate the case of known rewards, it suffices to remove the parts related to reward estimation and reward bonuses in both the algorithm and the analysis, which yields the regret bound

min{𝒪~(i=1m𝒬iX[Ii]T),𝒪~(i=1m𝒢X[Ii]K)}.\displaystyle\min\left\{\tilde{\mathcal{O}}\left(\sum_{i=1}^{m}\sqrt{\mathcal{Q}_{i}X[{I_{i}}]T}\right),\tilde{\mathcal{O}}\left(\sum_{i=1}^{m}\sqrt{\mathcal{G}X[{I_{i}}]K}\right)\right\}.

B.4.2 Proof of Corollary 3

Proof.

Since r(x)[0,1]r(x)\in[0,1] and R(x)[0,1]R(x)\in[0,1] for all x𝒳x\in\mathcal{X}, we have 𝒢H\mathcal{G}\leq H and i=maxx𝒳[Var(ri(x))]1\mathcal{R}_{i}=\max_{x\in\mathcal{X}}[\mathrm{Var}(r_{i}(x))]\leq 1 for all i[l]i\in[l]. Then by Theorem 2, for F-EULER,

Regret(K)\displaystyle{\textnormal{Regret}}(K) i=1m𝒢2X[Ii]KL+i=1liX[Ji]TL+m3H2(maxiSi)1.5maxiX[Ii]L2.5\displaystyle\lesssim\sum_{i=1}^{m}\sqrt{\mathcal{G}^{2}X[{I_{i}}]KL}+\sum_{i=1}^{l}\sqrt{\mathcal{R}_{i}X[{J_{i}}]T}L+m^{3}H^{2}(\max_{i}S_{i})^{1.5}\max_{i}X[{I_{i}}]L^{2.5}
+m1.5lH2(maxiSi)0.5(maxiX[Ii])0.75maxiX[Ji]L2.5\displaystyle\quad+m^{1.5}lH^{2}(\max_{i}S_{i})^{0.5}(\max_{i}X[{I_{i}}])^{0.75}\max_{i}X[{J_{i}}]L^{2.5}
i=1mHX[Ii]TL+i=1lX[Ji]TL+m3H2(maxiSi)1.5maxiX[Ii]L2.5\displaystyle\lesssim\sum_{i=1}^{m}\sqrt{HX[{I_{i}}]TL}+\sum_{i=1}^{l}\sqrt{X[{J_{i}}]T}L+m^{3}H^{2}(\max_{i}S_{i})^{1.5}\max_{i}X[{I_{i}}]L^{2.5}
+m1.5lH2(maxiSi)0.5(maxiX[Ii])0.75maxiX[Ji]L2.5.\displaystyle\quad+m^{1.5}lH^{2}(\max_{i}S_{i})^{0.5}(\max_{i}X[{I_{i}}])^{0.75}\max_{i}X[{J_{i}}]L^{2.5}.

Assuming Tpoly(m,l,maxiSi,maxiX[Ii],H)T\geq\mathrm{poly}(m,l,\max_{i}S_{i},\max_{i}X[{I_{i}}],H), we further have

Regret(K)=𝒪~(i=1mHX[Ii]T+i=1lX[Ji]T).\displaystyle{\textnormal{Regret}}(K)=\tilde{\mathcal{O}}\left(\sum_{i=1}^{m}\sqrt{HX[{I_{i}}]T}+\sum_{i=1}^{l}\sqrt{X[{J_{i}}]T}\right).

In the case of known rewards, again, by removing the parts related to reward estimation and reward bonuses in both the algorithm and the analysis, we obtain the regret bound 𝒪~(i=1mHX[Ii]T)\tilde{\mathcal{O}}(\sum_{i=1}^{m}\sqrt{HX[{I_{i}}]T}). ∎

Appendix C Lower bounds for FMDPs

C.1 Proof for degenerate case 1 (Theorem 4)

Proof.

Without loss of generality, assume argmaxiX[Ji]=1\operatorname*{argmax}_{i}X[{J_{i}}^{\prime}]=1. Let the reward function depends only on 𝒳[J1]\mathcal{X}[{J_{1}}^{\prime}]. Then for arbitrary transition, learning in this FMDP can be converted to an MAB problem with X[J1]X[{J_{1}}^{\prime}] arms, whose regret in TT steps has the lower bound Ω(X[J1]T)\Omega(\sqrt{X[{J_{1}}^{\prime}]T}) [25]. ∎

C.2 Proof for degenerate case 2 (Theorem 5)

Proof.

The construction is essentially the same as that in the proof of Theorem 4. Without loss of generality, assume argmaxiX[Ji]=1\operatorname*{argmax}_{i}X[{J_{i}}]=1. Let the reward function depends only on 𝒳[J1]\mathcal{X}[{J_{1}}]. Then for arbitrary transition, learning in this FMDP can be converted to an MAB problem with X[J1]X[{J_{1}}] arms, whose regret in TT steps has the lower bound Ω(X[J1]T)\Omega(\sqrt{X[{J_{1}}]T}) [25]. ∎

C.3 Proof for the nondegenerate case (Theorem 6)

Proof.

The proof of this lower bound relies on the Ω(HSAT)\Omega(\sqrt{HSAT}) regret for nonfactored MDPs. The basic idea to prove the Ω(HSAT)\Omega(\sqrt{HSAT}) regret bound in  [20] is to construct an MDP with 33 states (an initial state, two states with reward 0 and 11, respectively) and AA actions, where the state remains unchanged in the following H1H-1 steps after one step of transition. This MDP is equivalent to an MAB with AA arms, which has the lower bound Ω((H1)AK)\Omega((H-1)\sqrt{AK}). Making S/3S/3 copies of this MAB-like MDP and restarting at each copy uniformly at random yield the expected regret lower bound Ω(HSAT)\Omega(\sqrt{HSAT}).

Here, without loss of generality, assume argmaxiX[Ii]=1\operatorname*{argmax}_{i}X[{I_{i}}]=1. Then we only consider the transition of 𝒮1\mathcal{S}_{1} and neglect the rest. Let the reward function depends only on 𝒮1\mathcal{S}_{1}. Let the transition of 𝒮1\mathcal{S}_{1} depend only on 𝒮1\mathcal{S}_{1} and 𝒳[I1]\mathcal{X}[{I_{1}}^{\prime}], where I1=I1{m+1,,n}{I_{1}}^{\prime}={I_{1}}\cap\{m+1,\cdots,n\} contains only the action component indices. Learning in this component can then be converted to learning in a nonfactored MDP with 𝒮1\mathcal{S}_{1} states and X[I1]X[{I_{1}}^{\prime}] actions, which has Ω(HS1X[I1]T)\Omega(\sqrt{HS_{1}X[{I_{1}}^{\prime}]T}) regret lower bound.

Now consider the other state components in 𝒳[I1]\mathcal{X}[{I_{1}}], which we denote by 𝒮1=𝒳[I1([m]\{i})]\mathcal{S}^{\prime}_{1}=\mathcal{X}[{I_{1}}\cap([m]\backslash\{i\})], where “\\backslash” denotes set subtraction. Then X[I1]=X[I1]S1S1X[{I_{1}}]=X^{\prime}[{I_{1}}]\cdot S^{\prime}_{1}\cdot S_{1}. Make S1S^{\prime}_{1} copies of the above FMDP, and restart at each copy uniformly at random, so that each copy is expected to run T/S1T/S^{\prime}_{1} steps (K/S1K/S^{\prime}_{1} episodes). The regret lower bound is then given by Ω(S1HS1X[I1]T/S1)=Ω(HX[I1]T)\Omega(S^{\prime}_{1}\sqrt{HS_{1}X^{\prime}[{I_{1}}]T/S^{\prime}_{1}})=\Omega(\sqrt{HX[{I_{1}}]T}). ∎

C.4 Proof for the more general nondegenerate case (Theorem 7)

Proof.

Refer to the number of intermediate state components in an influence loop as its length. For a state component with the loop property, define its minimum influence loop as the one with the minimum length. Without loss of generality, assume argmaxiX[Ii]=1\operatorname*{argmax}_{i\in\mathcal{I}}X[{I_{i}}]=1 and the minimum influence loop of the first state component to be

𝒮1𝒮2𝒮u1𝒮u𝒮1.\displaystyle\mathcal{S}_{1}\to\mathcal{S}_{2}\to\dots\to\mathcal{S}_{u-1}\to\mathcal{S}_{u}\to\mathcal{S}_{1}. (C.1)

Since the above influence loop is minimum, we have iI1i\not\in{I_{1}} for all i{2,,u1}i\in\{2,\cdots,u-1\}. Let I1,s:=I1([m]\{u}){I_{1,s}}:={I_{1}}\cap([m]\backslash\{u\}) and I1,a:=I1{m+1,,n}{I_{1,a}}:={I_{1}}\cap\{m+1,\cdots,n\} be the state parts (excluding uu) and action parts of the scope index sets of 𝒮1\mathcal{S}_{1}, respectively. Then X[I1]=SuX[I1,s]X[I1,a]X[{I_{1}}]=S_{u}X[{I_{1,s}}]X[{I_{1,a}}]. Note that in the proof below, we can actually relax the assumption that Si3S_{i}\geq 3 for all ii to the assumption that Su3S_{u}\geq 3 and Si2S_{i}\geq 2 for all i[u1]i\in[u-1]. In the case where 𝒮1\mathcal{S}_{1} has the self-loop property, the proof below carries over by letting u=1u=1.

For the space 𝒮i\mathcal{S}_{i} in the loop (C.1), we define two special values s+i,si𝒮is_{+}^{i},s_{-}^{i}\in\mathcal{S}_{i} as positive and negative state component values, respectively. Let the reward function be the indicator function of whether at least one of the state components in the loop takes its positive value, i.e., for x=(s,a)x=(s,a),

R(x):=maxi[u]{𝕀(s[i]=s+i)}.\displaystyle R(x):=\max_{i\in[u]}\{\mathbb{I}(s[i]=s_{+}^{i})\}.

Construct the transition of s[1]s[1] with dependence only on 𝒮u\mathcal{S}_{u} and 𝒳[I1,a]\mathcal{X}[{I_{1,a}}]. If s[u]=s+us[u]=s_{+}^{u} (or sus_{-}^{u}, respectively), then s[1]s[1] follows s[u]s[u] in the next step and transits to s+1s_{+}^{1} (or s1s_{-}^{1}, respectively) deterministically. Otherwise, s[1]s[1] also transits to s+1s_{+}^{1} or s1s_{-}^{1}, but with probabilities specified by the action components in 𝒳[I1,a]\mathcal{X}[{I_{1,a}}]. Construct the transition of the intermediate state components s[i],i{2,,u}s[i],i\in\{2,\cdots,u\} with dependence only on s[i1]s[i-1]. If s[i1]=s+i1s[i-1]=s_{+}^{i-1}, then in the next step s[i]s[i] transits to s+is_{+}^{i}; otherwise s[i]s[i] transits to sis_{-}^{i}. The transitions of other state components are arbitrary and irrelevant to the regret. Therefore, after the first step, the positive and negative state component values shift their places in a cyclic way within the influence loop.

For initialization, we choose s[u]s[u] to be arbitrary in 𝒮u\{su+,su}\mathcal{S}_{u}\backslash\{s^{u}_{+},s^{u}_{-}\}, s[i]s[i] to be arbitrary in 𝒮i\{si+}\mathcal{S}_{i}\backslash\{s^{i}_{+}\} for i[u1]i\in[u-1], and s[I1,s]s[{I_{1,s}}] to be arbitrary in 𝒳[I1,s]\mathcal{X}[{I_{1,s}}], where we apply the scope operation (Definition 1) to ss. The initializations of other state components are arbitrary and irrelevant to the regret. In this way, after the first step of transition, there can be zero or one positive state component values in the influence loop, depending on the action components in 𝒳[I1,a]\mathcal{X}[{I_{1,a}}]. Moreover, the number of positive state components in the loop remains the same for H1H-1 steps until the end of the episode.

Therefore, for any s[u]𝒮u\{s+u,su}s[u]\in\mathcal{S}_{u}\backslash\{s_{+}^{u},s_{-}^{u}\} and any s[Ii]𝒳[I1,s]s[{I_{i}}]\in\mathcal{X}[{I_{1,s}}], learning in the above FMDP can be converted to an MAB problem with X[I1,a]X[{I_{1,a}}] actions where the reward is (H1)(H-1) if s+1s_{+}^{1} is reached at the second time step and 0 otherwise. The regret of such an MAB has the lower bound Ω((H1)X[I1,a]T)=Ω(HX[I1,a]T)\Omega((H-1)\sqrt{X[{I_{1,a}}]T})=\Omega(\sqrt{HX[{I_{1,a}}]T}) [25]. Splitting 𝒮u\{s+u,su}\mathcal{S}_{u}\backslash\{s_{+}^{u},s_{-}^{u}\} and 𝒳[I1,s]\mathcal{X}[{I_{1,s}}] to make (Su2)X[I1,s](S_{u}-2)X[{I_{1,s}}] copies of this MAB-like FMDP and restarting at each copy uniformly at random, we obtain the lower bound on regret

Ω((Su2)X[I1,s]HX[I1,a]T(Su2)X[I1,s])=Ω(HX[I1]T).\displaystyle\Omega\left((S_{u}-2)X[{I_{1,s}}]\sqrt{HX[{I_{1,a}}]\left\lfloor\frac{T}{(S_{u}-2)X[{I_{1,s}}]}\right\rfloor}\right)=\Omega(\sqrt{HX[{I_{1}}]T}).

Note that making Su/3S_{u}/3 copies is unnecessary since different copies can share s+us_{+}^{u} and sus_{-}^{u}. ∎

Appendix D Lower bound for MDPs via the JAO MDP construction

As pointed out in [1], “they (the 𝒪~(HSAT)\tilde{\mathcal{O}}(\sqrt{HSAT}) upper bounds) help to establish the information-theoretic lower bound of reinforcement learning at Ω(HSAT)\Omega(\sqrt{HSAT}). … Moving from this big picture insight to an analytically rigorous bound is non-trivial.” A rigorous Ω(HSAT)\Omega(\sqrt{HSAT}) lower bound proof is possible [20] by constructing MAB-like MDPs [7]. In the meantime, the MDP literature suggests that the JAO MDP [19], which establishes the minimax lower bound for nonepisodic MDPs, also establishes the minimax lower bound for episodic MDPs. We look into this problem in detail, and find that a direct episodic extension of the JAO MDP actually establishes the lower bound at Ω(HSAT/logT)\Omega(\sqrt{HSAT/\log T}), missing a log factor. It is not clear whether this result can be further improved with the same construction. The rest of this section introduces our derivation, which we recommend reading in comparison with [19, Section 6].

Recall that the JAO MDP is an MDP with two states s0,s1s_{0},s_{1} and A=(A1)/2A^{\prime}=\lfloor(A-1)/2\rfloor actions. The transition probability from s1s_{1} to s0s_{0} is δ\delta for all actions, and the transition probability from s0s_{0} to s1s_{1} is δ\delta for all actions except that it is δ+ϵ\delta+\epsilon for one special action aa^{*}. The reward is 11 for each step at s1s_{1} and 0 otherwise. By making S=S/2S^{\prime}=\lfloor S/2\rfloor copies of the JAO MDP one extends the construction to SS or S1S-1 states, which then reduces to a two-state JAO MDP with SAS^{\prime}A^{\prime} actions. In the episodic setting, we simply start the MDP at s0s_{0} for each episode. The symbols δ,ϵ,A\delta,\epsilon,A^{\prime} have the same meanings as they do in [19], and we use SS^{\prime} to replace kk in [19]. Let 𝔼a,𝔼unif\mathbb{E}_{a},\mathbb{E}_{{\textnormal{unif}}} denote the expectation under a[SA]a\in[S^{\prime}A^{\prime}] being the better action aa^{*} and there being no better action respectively, and a,unif\mathbb{P}_{a},\mathbb{P}_{{\textnormal{unif}}} denote the corresponding probability measures. Let 𝔼\mathbb{E}_{*} denote the expectation under a uniformly random choice of the better action. Hence, 𝔼[f]=1SAa=1SA𝔼a[f]\mathbb{E}_{*}[f]=\frac{1}{S^{\prime}A^{\prime}}\sum_{a=1}^{S^{\prime}A^{\prime}}\mathbb{E}_{a}[f]. Let N1,N0,N0N_{1},N_{0},N_{0}^{*} denote the number of visits to state s1s_{1}, to state s0s_{0} and to state-action pair (s0,a)(s_{0},a^{*}) respectively. We use another subscript kk to denote the corresponding quantity in the kkth episode.

Step 1

Let H=1/δH^{\prime}=1/\delta.

𝔼a[N1,k]𝔼a[N0,k]+ϵH𝔼a[N0,k]12δ+(12δ)H2δ.\displaystyle\mathbb{E}_{a}[N_{1,k}]\leq\mathbb{E}_{a}[N_{0,k}]+\epsilon H^{\prime}\mathbb{E}_{a}[N_{0,k}^{*}]-\frac{1}{2\delta}+\frac{(1-2\delta)^{H}}{2\delta}.
Proof.

Here we adopt a more refined analysis compared to that in [19], where we consider the probability of the last state, because a constant deviation in each episode results in a relaxation on the order of 𝒪(K){\mathcal{O}}(K) in total.

𝔼a[N1,k]\displaystyle\mathbb{E}_{a}[N_{1,k}] =h=2Ha(sk,h=s1|sk,h1=s0)a(sk,h1=s0)\displaystyle=\sum_{h=2}^{H}\mathbb{P}_{a}(s_{k,h}=s_{1}|s_{k,h-1}=s_{0})\cdot\mathbb{P}_{a}(s_{k,h-1}=s_{0})
+h=2Ha(sk,h=s1|sk,h1=s1)a(sk,h1=s1)\displaystyle\quad+\sum_{h=2}^{H}\mathbb{P}_{a}(s_{k,h}=s_{1}|s_{k,h-1}=s_{1})\cdot\mathbb{P}_{a}(s_{k,h-1}=s_{1})
=δh=2Ha(sk,h1=s0,ak,h1a)+(δ+ϵ)h=2Ha(sk,h1=s0,ak,h1=a)\displaystyle=\delta\sum_{h=2}^{H}\mathbb{P}_{a}(s_{k,h-1}=s_{0},a_{k,h-1}\neq a)+(\delta+\epsilon)\sum_{h=2}^{H}\mathbb{P}_{a}(s_{k,h-1}=s_{0},a_{k,h-1}=a)
+(1δ)h=2Ha(sk,h1=s1)\displaystyle\quad+(1-\delta)\sum_{h=2}^{H}\mathbb{P}_{a}(s_{k,h-1}=s_{1})
=δ𝔼a[N0,kN0,k]+(δ+ϵ)𝔼a[N0,k]+(1δ)𝔼a[N1,k]δa(sk,H=s0)(1δ)a(sk,H=s1).\displaystyle=\delta\mathbb{E}_{a}[N_{0,k}-N_{0,k}^{*}]+(\delta+\epsilon)\mathbb{E}_{a}[N_{0,k}^{*}]+(1-\delta)\mathbb{E}_{a}[N_{1,k}]-\delta\mathbb{P}_{a}(s_{k,H}=s_{0})-(1-\delta)\mathbb{P}_{a}(s_{k,H}=s_{1}).

Note that

a(sk,H=s1)\displaystyle\mathbb{P}_{a}(s_{k,H}=s_{1}) unif(sk,H=s1)=1212(12δ)H1.\displaystyle\geq\mathbb{P}_{{\textnormal{unif}}}(s_{k,H}=s_{1})=\frac{1}{2}-\frac{1}{2}(1-2\delta)^{H-1}.

Then for δ<12\delta<\frac{1}{2},

δa(sk,H=s0)+(1δ)a(sk,H=s1)\displaystyle\delta\mathbb{P}_{a}(s_{k,H}=s_{0})+(1-\delta)\mathbb{P}_{a}(s_{k,H}=s_{1}) =δ(1a(sk,H=s1))+(1δ)a(sk,H=s1)\displaystyle=\delta(1-\mathbb{P}_{a}(s_{k,H}=s_{1}))+(1-\delta)\mathbb{P}_{a}(s_{k,H}=s_{1})
δ(12+12(12δ)H1)+(1δ)1212(12δ)H1\displaystyle\geq\delta(\frac{1}{2}+\frac{1}{2}(1-2\delta)^{H-1})+(1-\delta)\frac{1}{2}-\frac{1}{2}(1-2\delta)^{H-1}
=12(12δ)H2.\displaystyle=\frac{1}{2}-\frac{(1-2\delta)^{H}}{2}.

Hence,

𝔼a[N1,k]\displaystyle\mathbb{E}_{a}[N_{1,k}] δ𝔼a[N0,kN0,k]+(δ+ϵ)𝔼a[N0,k]+(1δ)𝔼a[N1,k]12+(12δ)H2.\displaystyle\leq\delta\mathbb{E}_{a}[N_{0,k}-N_{0,k}^{*}]+(\delta+\epsilon)\mathbb{E}_{a}[N_{0,k}^{*}]+(1-\delta)\mathbb{E}_{a}[N_{1,k}]-\frac{1}{2}+\frac{(1-2\delta)^{H}}{2}.

By rearranging the terms,

𝔼a[N1,k]𝔼a[N0,k]+ϵH𝔼a[N0,k]12δ+(12δ)H2δ,\displaystyle\mathbb{E}_{a}[N_{1,k}]\leq\mathbb{E}_{a}[N_{0,k}]+\epsilon H^{\prime}\mathbb{E}_{a}[N_{0,k}^{*}]-\frac{1}{2\delta}+\frac{(1-2\delta)^{H}}{2\delta},

where we use H=1δH^{\prime}=\frac{1}{\delta}. ∎

Step 2

Let RR denote the cumulative reward by a given algorithm through KK episodes. Then assuming 𝔼a[N0]𝔼unif[N0]\mathbb{E}_{a}[N_{0}]\leq\mathbb{E}_{{\textnormal{unif}}}[N_{0}], by Step 1 we have

𝔼a[R]=𝔼a[N1]=k=1K𝔼a[N1,k]KHk=1K𝔼unif[N1,k]+ϵHk=1K𝔼a[N0,k]K2δ+K(12δ)H2δ.\displaystyle\mathbb{E}_{a}[R]=\mathbb{E}_{a}[N_{1}]=\sum_{k=1}^{K}\mathbb{E}_{a}[N_{1,k}]\leq KH-\sum_{k=1}^{K}\mathbb{E}_{{\textnormal{unif}}}[N_{1,k}]+\epsilon H^{\prime}\sum_{k=1}^{K}\mathbb{E}_{a}[N_{0,k}^{*}]-\frac{K}{2\delta}+K\frac{(1-2\delta)^{H}}{2\delta}. (D.1)
Step 3

Independent of the above two steps,

𝔼unif[N1,k]=t=1Hunif(st=s1)=t=1H12+(12δ)t1(012)=H21(12δ)H4δHH2.\displaystyle\mathbb{E}_{{\textnormal{unif}}}[N_{1,k}]=\sum_{t=1}^{H}\mathbb{P}_{{\textnormal{unif}}}(s_{t}=s_{1})=\sum_{t=1}^{H}\frac{1}{2}+(1-2\delta)^{t-1}(0-\frac{1}{2})=\frac{H}{2}-\frac{1-(1-2\delta)^{H}}{4\delta}\geq\frac{H-H^{\prime}}{2}. (D.2)

Therefore,

𝔼unif[N0,k]H+H2.\displaystyle\mathbb{E}_{{\textnormal{unif}}}[N_{0,k}]\leq\frac{H+H^{\prime}}{2}. (D.3)
Step 4

Substituting (D.2) in Step 3 into (D.1) in Step 2 yields

𝔼a[R]KH2+ϵHk=1K𝔼a[N0,k]K4δ+K(12δ)H4δ.\displaystyle\mathbb{E}_{a}[R]\leq\frac{KH}{2}+\epsilon H^{\prime}\sum_{k=1}^{K}\mathbb{E}_{a}[N_{0,k}^{*}]-\frac{K}{4\delta}+K\frac{(1-2\delta)^{H}}{4\delta}.

Therefore,

𝔼[R]=1SAa=1SA𝔼a[R]KH2+ϵHSAa=1SAk=1K𝔼a[N0,k]K4δ+K(12δ)H4δ.\displaystyle\mathbb{E}_{*}[R]=\frac{1}{S^{\prime}A^{\prime}}\sum\limits_{a=1}^{S^{\prime}A^{\prime}}\mathbb{E}_{a}[R]\leq\frac{KH}{2}+\frac{\epsilon H^{\prime}}{S^{\prime}A^{\prime}}\sum\limits_{a=1}^{S^{\prime}A^{\prime}}\sum_{k=1}^{K}\mathbb{E}_{a}[N_{0,k}^{*}]-\frac{K}{4\delta}+K\frac{(1-2\delta)^{H}}{4\delta}.
Step 5
𝔼[R]KH2+ϵKHSA(H+H2+ϵHH2SA(H+H)K)K4δ+K(12δ)H4δ.\displaystyle\mathbb{E}_{*}[R]\leq\frac{KH}{2}+\frac{\epsilon KH^{\prime}}{S^{\prime}A^{\prime}}\left(\frac{H+H^{\prime}}{2}+\frac{\epsilon H\sqrt{H^{\prime}}}{2}\sqrt{S^{\prime}A^{\prime}(H^{\prime}+H)K}\right)-\frac{K}{4\delta}+K\frac{(1-2\delta)^{H}}{4\delta}.
Proof.

To be more explicit, let N0,k,aN_{0,k,a} denote the number where action aa is chosen in state s0s_{0} at the kkth episode. Then by Lemma 13 in [19], we have

k=1Ka=1SA𝔼a[N0,k]=a=1SA𝔼a[k=1KN0,k,a]a=1SA𝔼unif[k=1KN0,k,a]+a=1SAϵKHH2𝔼unif[k=1KN0,k,a].\displaystyle\sum_{k=1}^{K}\sum_{a=1}^{S^{\prime}A^{\prime}}\mathbb{E}_{a}[N_{0,k}^{*}]=\sum_{a=1}^{S^{\prime}A^{\prime}}\mathbb{E}_{a}\left[\sum_{k=1}^{K}N_{0,k,a}\right]\leq\sum_{a=1}^{S^{\prime}A^{\prime}}\mathbb{E}_{{\textnormal{unif}}}\left[\sum\limits_{k=1}^{K}N_{0,k,a}\right]+\sum_{a=1}^{S^{\prime}A^{\prime}}\epsilon KH\sqrt{H^{\prime}}\sqrt{2\mathbb{E}_{{\textnormal{unif}}}\left[\sum\limits_{k=1}^{K}N_{0,k,a}\right]}.

Since a=1SA𝔼unif[N0,k,a]H+H2\sum_{a=1}^{S^{\prime}A^{\prime}}\mathbb{E}_{{\textnormal{unif}}}[N_{0,k,a}]\leq\frac{H+H^{\prime}}{2} by (D.2) in Step 3,

k=1Ka=1SA𝔼a[N0,k]KH+H2+ϵKHH2SAK(H+H).\displaystyle\sum_{k=1}^{K}\sum_{a=1}^{S^{\prime}A^{\prime}}\mathbb{E}_{a}[N_{0,k}^{*}]\leq K\frac{H+H^{\prime}}{2}+\frac{\epsilon KH\sqrt{H^{\prime}}}{2}\sqrt{S^{\prime}A^{\prime}K(H+H^{\prime})}.

Substituting the above into the bound on 𝔼[R]\mathbb{E}_{*}[R] in Step 4 concludes the proof. ∎

Step 6

Now we determine the optimal value V1(s0)V_{1}^{*}(s_{0}) we aim to compare. Let VH+1=[0,0]V_{H+1}^{*}=[0,0]^{\top}. Then the optimal value function is given by the iteration Vh=B+AVh+1V_{h}^{*}=B+AV_{h+1}^{*}, where

A=[1δϵδ+ϵδ1δ],B=[01].\displaystyle A=\left[\begin{array}[]{cc}1-\delta-\epsilon&\delta+\epsilon\\ \delta&1-\delta\end{array}\right],\quad B=\left[\begin{array}[]{c}0\\ 1\end{array}\right].

Iteratively, by matrix diagonalization,

V1=B+AV2=B+A(B+AV3)==(h=0H1Ah)B=U(h=0H1Λh)UB,\displaystyle V_{1}^{*}=B+AV_{2}^{*}=B+A(B+AV_{3}^{*})=\cdots=(\sum_{h=0}^{H-1}A^{h})B=U(\sum_{h=0}^{H-1}\Lambda^{h})UB,

where U=[1,δ+ϵδ;1,1]U=[1,-\frac{\delta+\epsilon}{\delta};1,1] and Λ=diag(1,12δϵ)\Lambda={\textnormal{diag}}(1,1-2\delta-\epsilon). Hence,

V1(s0)=δ+ϵ2δ+ϵHδ+ϵ(2δ+ϵ)2(1(12δϵ)H).\displaystyle V_{1}^{*}(s_{0})=\frac{\delta+\epsilon}{2\delta+\epsilon}H-\frac{\delta+\epsilon}{(2\delta+\epsilon)^{2}}\left(1-(1-2\delta-\epsilon)^{H}\right).

Note that here we compute the exact optimal value because the episodic resetting causes a constant difference than the stationary optimal value in the infinite-horizon setting in each episode, which accumulates to the order of 𝒪(K){\mathcal{O}}(K) in total, similar to Step 1.

Step 7

Since

δ+ϵ(2δ+ϵ)2=δ+ϵ4δ(δ+ϵ)+ϵ214δ,\displaystyle\frac{\delta+\epsilon}{(2\delta+\epsilon)^{2}}=\frac{\delta+\epsilon}{4\delta(\delta+\epsilon)+\epsilon^{2}}\leq\frac{1}{4\delta},

we have

Regret(K)\displaystyle{\textnormal{Regret}}(K) =KV1(s0)𝔼[R]\displaystyle=KV_{1}^{*}(s_{0})-\mathbb{E}_{*}[R]
δ+ϵ2δ+ϵKHKH2ϵKHSA(H+H2+ϵHH2SAK(H+H))standard as in [19]\displaystyle\geq\underbrace{\frac{\delta+\epsilon}{2\delta+\epsilon}KH-\frac{KH}{2}-\frac{\epsilon KH^{\prime}}{S^{\prime}A^{\prime}}\left(\frac{H+H^{\prime}}{2}+\frac{\epsilon H\sqrt{H^{\prime}}}{2}\sqrt{S^{\prime}A^{\prime}K(H^{\prime}+H)}\right)}_{\text{standard as in~{}\cite[citep]{[\@@bibref{Number}{jaksch2010near}{}{}]}}}
K((δ+ϵ)(1(12δϵ)H)(2δ+ϵ)214δ+(12δ)H4δ)new challenge in the episodic setting\displaystyle\quad\underbrace{-K\left(\frac{(\delta+\epsilon)\left(1-(1-2\delta-\epsilon)^{H}\right)}{(2\delta+\epsilon)^{2}}-\frac{1}{4\delta}+\frac{(1-2\delta)^{H}}{4\delta}\right)}_{\text{new challenge in the episodic setting}}
ϵ4δ+2ϵTϵTH2SA(1+HH)ϵ2TH2SAHSAKH(1+HH)Θ(HSAT)\displaystyle\geq\underbrace{\frac{\epsilon}{4\delta+2\epsilon}T-\frac{\epsilon TH^{\prime}}{2S^{\prime}A^{\prime}}\left(1+\frac{H^{\prime}}{H}\right)-\frac{\epsilon^{2}TH^{\prime}}{2S^{\prime}A^{\prime}}\sqrt{H^{\prime}S^{\prime}A^{\prime}KH}\left(\sqrt{1+\frac{H^{\prime}}{H}}\right)}_{\Theta(\sqrt{H^{\prime}S^{\prime}A^{\prime}T})}
K4δ((12δ)H(12δϵ)H)Θ(KH(12H)H),\displaystyle\quad-\underbrace{\frac{K}{4\delta}\left((1-2\delta)^{H}-(1-2\delta-\epsilon)^{H}\right)}_{\Theta(KH^{\prime}(1-\frac{2}{H^{\prime}})^{H})},

where Θ(HSAT)\Theta(\sqrt{H^{\prime}S^{\prime}A^{\prime}T}) in the last line is obtained by taking ϵ=Θ(SAHT)=Θ(δSAT)\epsilon=\Theta(\sqrt{\frac{S^{\prime}A^{\prime}}{H^{\prime}T}})=\Theta(\sqrt{\frac{\delta S^{\prime}A^{\prime}}{T}}). The logic of taking such an ϵ\epsilon is to let the first term be on the order of the third term, i.e.,

ϵ4δ+2ϵT=Θ(ϵ2TH2SAHSAKH).\displaystyle\frac{\epsilon}{4\delta+2\epsilon}T=\Theta\left(\frac{\epsilon^{2}TH^{\prime}}{2S^{\prime}A^{\prime}}\sqrt{H^{\prime}S^{\prime}A^{\prime}KH}\right).

The new challenge in the episodic setting results from the episodic resetting of each episode and a different definition of regret. By taking H=HlogKH=HlogTH=H^{\prime}\log KH=H^{\prime}\log T, we have

KH(12H)HKH1elogKH=HH=1logT=o(HSAT).\displaystyle KH^{\prime}(1-\frac{2}{H^{\prime}})^{H}\leq KH^{\prime}\frac{1}{e^{\log KH}}=\frac{H^{\prime}}{H}=\frac{1}{\log T}=o(\sqrt{H^{\prime}S^{\prime}A^{\prime}T}).

Hence, we obtain the Ω(HSAT/logT)=Ω~(HSAT)\Omega(\sqrt{HSAT/\log T})=\tilde{\Omega}(\sqrt{HSAT}) lower bound. Note that taking H=H/logTH^{\prime}=H/\log T is reasonable, because none of the parameters is exponential in others in our consideration.

Appendix E Concentration inequalities

In this section, we provide a summary of some important concentration inequalities that are frequently invoked in this work. In what follows, ()\mathbb{P}(\cdot) denotes an appropriate probability measure.

Lemma 36 (Concentration on L1L_{1}-norm of probability distributions).

Let PP be a probability mass function on a finite set 𝒴\mathcal{Y} with cardinality YY. Let 𝐲=[y1,,yn]\bm{\mathsfit{y}}=[\mathsfit{y}_{1},\cdots,\mathsfit{y}_{n}] be nn i.i.d. samples from PP. Let P^𝐲\hat{P}_{\bm{\mathsfit{y}}} be the empirical distribution based on the observed samples. Then for all ϵ>0\epsilon>0

(PP^𝒚1ϵ)2Yexp{nϵ22}.\displaystyle\mathbb{P}(\|P-\hat{P}_{\bm{\mathsfit{y}}}\|_{1}\geq\epsilon)\leq 2^{Y}\exp\left\{-\frac{n\epsilon^{2}}{2}\right\}.

Alternatively, with probability at least 1δ1-\delta,

PP^𝒚12Ylog2+2log1δn2Ynlog2δ\displaystyle\|P-\hat{P}_{\bm{\mathsfit{y}}}\|_{1}\leq\sqrt{\frac{2Y\log 2+2\log\frac{1}{\delta}}{n}}\leq\sqrt{\frac{2Y}{n}\log\frac{2}{\delta}}
Proof.

This lemma is a relaxation of Theorem 2.1 in [39]. ∎

Lemma 37 (Hoeffding’s inequality).

Given nn independent random variables such that xi[a,b]\mathsfit{x}_{i}\in[a,b] a.s., then for all ϵ0\epsilon\geq 0,

(|i=1n(xi𝔼[xi])|ϵ)2exp{2ϵ2n(ba)2}.\displaystyle\mathbb{P}\left(\left|\sum_{i=1}^{n}(\mathsfit{x}_{i}-\mathbb{E}[\mathsfit{x}_{i}])\right|\geq\epsilon\right)\leq 2\exp\left\{-\frac{2\epsilon^{2}}{n(b-a)^{2}}\right\}.

Alternatively, with probability at least 1δ1-\delta,

|1ni=1n(xi𝔼[xi])|(ba)22nlog2δ.\displaystyle\left|\frac{1}{n}\sum_{i=1}^{n}(\mathsfit{x}_{i}-\mathbb{E}[\mathsfit{x}_{i}])\right|\leq\sqrt{\frac{(b-a)^{2}}{2n}\log\frac{2}{\delta}}.
Proof.

See e.g., Proposition 2.5 in [38] for the one-sided Hoeffding’s inequality, applying which to xi-\mathsfit{x}_{i} yields the other side. ∎

Lemma 38 (One-sided Bernstein’s inequality).

Given nn independent random variables such that xib\mathsfit{x}_{i}\leq b a.s., then for all ϵ0\epsilon\geq 0,

(i=1n(xi𝔼[xi])nϵ)exp{nϵ22(V+bϵ3)},\displaystyle\mathbb{P}\left(\sum_{i=1}^{n}(\mathsfit{x}_{i}-\mathbb{E}[\mathsfit{x}_{i}])\geq n\epsilon\right)\leq\exp\left\{-\frac{n\epsilon^{2}}{2(V+\frac{b\epsilon}{3})}\right\},

where V=1ni=1n𝔼[xi2]V=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}[\mathsfit{x}_{i}^{2}]. Alternatively, with probability at least 1δ1-\delta,

1ni=1n(xi𝔼[xi])blog1δ3n+(blog1δ3n)2+2Vlog1δn2blog1δ3n+2Vlog1δn.\displaystyle\frac{1}{n}\sum_{i=1}^{n}(\mathsfit{x}_{i}-\mathbb{E}[\mathsfit{x}_{i}])\leq\frac{b\log\frac{1}{\delta}}{3n}+\sqrt{(\frac{b\log\frac{1}{\delta}}{3n})^{2}+\frac{2V\log\frac{1}{\delta}}{n}}\leq\frac{2b\log\frac{1}{\delta}}{3n}+\sqrt{\frac{2V\log\frac{1}{\delta}}{n}}. (E.1)
Proof.

See e.g., Proposition 2.14 in [38]. ∎

For random variables xi[0,b]\mathsfit{x}_{i}\in[0,b], applying the one-sided Bernstein’s inequality to xi𝔼[xi]\mathsfit{x}_{i}-\mathbb{E}[\mathsfit{x}_{i}], we can replace VV in (E.1) by the average variance σ2=1ni=1nVar(xi)\sigma^{2}=\frac{1}{n}\sum_{i=1}^{n}\mathrm{Var}(\mathsfit{x}_{i}). Moreover, applying the one-sided Bernstein’s inequality to xi+𝔼[xi]-\mathsfit{x}_{i}+\mathbb{E}[\mathsfit{x}_{i}] yields the other side of the bound in terms of σ2\sigma^{2}. By the union bound, we have that with probability at least 1δ1-\delta,

|1ni=1n(xi𝔼[xi])|2blog1δ3n+2σ2log1δn,\displaystyle\left|\frac{1}{n}\sum_{i=1}^{n}(\mathsfit{x}_{i}-\mathbb{E}[\mathsfit{x}_{i}])\right|\leq\frac{2b\log\frac{1}{\delta}}{3n}+\sqrt{\frac{2\sigma^{2}\log\frac{1}{\delta}}{n}},

which is what we actually use in this work.

Lemma 39 (Empirical Bernstein’s inequality).

Given nn independent random variables such that xi1\mathsfit{x}_{i}\leq 1 a.s.. Let S=1n(n1)i=1nj=1n(xixj)22S=\frac{1}{n(n-1)}\sum_{i=1}^{n}\sum_{j=1}^{n}\frac{(\mathsfit{x}_{i}-\mathsfit{x}_{j})^{2}}{2} be the sample variance. Then with probability at least 1δ1-\delta,

1ni=1n(𝔼[xi]xi)2Slog2δn+7log2δ3(n1).\displaystyle\frac{1}{n}\sum_{i=1}^{n}(\mathbb{E}[\mathsfit{x}_{i}]-\mathsfit{x}_{i})\leq\sqrt{\frac{2S\log\frac{2}{\delta}}{n}}+\frac{7\log\frac{2}{\delta}}{3(n-1)}.
Proof.

See Theorem 11 in [26]. ∎