This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

Tianhao Wu    Yunchang Yang    Han Zhong    Liwei Wang    Simon S. Du    Jiantao Jiao
Abstract

Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. However, theoretical understanding of these methods remains insufficient. Even in the episodic (time-inhomogeneous) tabular setting, the state-of-the-art theoretical result of policy-based method in Shani et al. (2020) is only O~(S2AH4K)\widetilde{O}(\sqrt{S^{2}AH^{4}K}) where SS is the number of states, AA is the number of actions, HH is the horizon, and KK is the number of episodes, and there is a SH\sqrt{SH} gap compared with the information theoretic lower bound Ω~(SAH3K)\widetilde{\Omega}(\sqrt{SAH^{3}K}) (Jin et al., 2018). To bridge such a gap, we propose a novel algorithm Reference-based Policy Optimization with Stable at Any Time guarantee (RPO-SAT), which features the property “Stable at Any Time”. We prove that our algorithm achieves O~(SAH3K+AH4K)\widetilde{O}(\sqrt{SAH^{3}K}+\sqrt{AH^{4}K}) regret. When S>HS>H, our algorithm is minimax optimal when ignoring logarithmic factors. To our best knowledge, RPO-SAT is the first computationally efficient, nearly minimax optimal policy-based algorithm for tabular RL.

Machine Learning, ICML, Reinforcement Learning, Policy Optimization

1 Introduction

Reinforcement Learning (RL) has achieved phenomenal successes in solving complex sequential decision-making problems (Silver et al., 2016, 2017; Levine et al., 2016; Gu et al., 2017). Most of these empirical successes are driven by policy-based (policy optimization) methods, such as policy gradient (Sutton et al., 1999), natural policy gradient (NPG) (Kakade, 2001), trust region policy optimization (TRPO) (Schulman et al., 2015), and proximal policy optimization (PPO) (Schulman et al., 2017). For example, Haarnoja et al. (2018) proposed a policy-based state-of-the-art reinforcement learning algorithm, soft actor-critic (SAC), which outperformed value-based methods in a variety of real world robotics tasks including manipulation and locomotion. In fact, Kalashnikov et al. (2018) observed that compared with value-based methods such as Q-learning, policy-based methods work better with dense reward. On the other hand, for sparse reward cases in robotics, value-based methods perform better.

Motivated by this, a line of recent work (Fazel et al., 2018; Bhandari & Russo, 2019; Liu et al., 2019; Wang et al., 2019; Agarwal et al., 2021) provides global convergence guarantees for these popular policy-based methods. However, to achieve this goal, they made several assumptions. Agarwal et al. (2021) assumes they have the access to either the exact population policy gradient or an estimate of it up to certain precision for all states uniformly compared with the state distribution induced by π\pi^{*}, bypassing the hardness of exploration. They showed that even with this stringent assumption, the convergence rate would depend on the distribution mismatch coefficient D=maxs(ds0π(s)μ(s))D_{\infty}=\max_{s}\big{(}\frac{d_{s_{0}}^{\pi^{*}}(s)}{\mu(s)}\big{)}, where μ\mu is the starting state distribution of the algorithm and ds0π(s)d_{s_{0}}^{\pi^{*}}(s) is the stationary state distribution of the optimal policy π\pi^{*} starting from s0s_{0}. This dependency is problematic since DD_{\infty} is small only when the initial distribution has a good coverage of the optimal stationary distribution, which may not happen in practice.

However, in online value-based RL, algorithm such as Azar et al. (2017) can achieve fast convergence rate (or regret) independent of the distribution mismatch coefficient, or equivalently, without the coverage assumption. Though value-based methods have achieved the information theoretical optimal regret in tabular (Azar et al., 2017) and linear MDPs settings (Zanette et al., 2020), it remains unclear whether policy-based methods can achieve information theoretically optimal regret in the same settings. To address this issue, Cai et al. (2020) proposed the idea of optimism in policy optimization, which seems similar to the value-based optimism but different in nature, since it encourages optimism for QπQ^{\pi} instead of QQ^{*} (Section 4). With this new idea, Cai et al. (2020); Shani et al. (2020) managed to establish the regret guarantees without additional assumptions, though the regret is suboptimal.

In this work, we focus on the same setting as in Shani et al. (2020): episodic tabular MDPs with unknown transitions, stochastic rewards/losses, and bandit feedback. In this setting, the state-of-the-art result of policy-based method is O~(S2AH4K)\widetilde{O}(\sqrt{S^{2}AH^{4}K}) (Shani et al., 2020). Here SS and AA are the cardinality of states and actions, respectively, HH is the episode horizon, and KK is the number of episodes. Compared with the information theoretic limit (Jaksch et al., 2010; Azar et al., 2017; Jin et al., 2018; Domingues et al., 2021), there is still a gap SH\sqrt{SH}.

1.1 Main Contributions

In this paper we present a novel provably efficient policy optimization algorithm, Reference-based Policy Optimization with Stable at Any Time guarantee (RPO-SAT). We establish a high probablity regret upper bound O~(SAH3K+AH4K)\widetilde{O}(\sqrt{SAH^{3}K}+\sqrt{AH^{4}K}) for our algorithm. Importantly, if S>HS>H, the main term in this bound matches the information theoretic limit Ω(SAH3K)\Omega(\sqrt{SAH^{3}K}) (Jin et al., 2018), up to some lower order term O~(poly(S,A,H)K1/4)\widetilde{O}(\operatorname{poly}(S,A,H)K^{1/4}). We introduce our algorithmic innovations and analytical innovations as follows:

Algorithmic innovations:

  • We introduce a novel reference VV estimator in our algorithm. It is conceptually simple and easy to implement as it just updates the reference value to be the mean of empirical VV values when some conditions are triggered (cf. Algorithm 1 Lines 18-20).

  • We carefully incorporate the reference VV estimator into our bonus term, in the way of adding a weighted absolute difference between the estimated VV values and the reference VV values to control the instability of the estimation process. Readers may refer to Section 4 for more details.

  • Another highlight is that we modify the policy improvement phase of our algorithm to meet a novel property, which we called Stable at Any Time (SAT). (For a detailed definition see Equation (10) in Section 4.) More specifically, instead of using the KL-divergence regularization term proposed in Shani et al. (2020), we use the 2\ell_{2} regularization term. This is crucial to ensure SAT.

Analytical innovations:

  • We prove that our algorithm satisfies the SAT property. The analysis is done by two steps: First, we establish a 1-st step regret bound O~(S2AH4K)\widetilde{O}(\sqrt{S^{2}AH^{4}K}). Second, we use a new technique “Forward Induction” to prove the same for all the hh-th step regret. Here the hh-step regret is defined in (4). Readers may refer to Section 4 for more details of the “Forward Induction” technique.

  • We show that the combination of the SAT property and the simple reference VV estimator yields a precise approximation of VV^{*}. We use this property to derive a O~(SAH3K)\widetilde{O}(\sqrt{SAH^{3}K}) upper bound for the sum of bonus terms, which leads to a S\sqrt{S} reduction in terms of regret.

1.2 Related Work

Our work contributes to the theoretical investigations of policy-based methods in RL (Cai et al., 2020; Shani et al., 2020; Lancewicki et al., 2020; Fei et al., 2020; He et al., 2021; Zhong et al., 2021; Luo et al., 2021; Zanette et al., 2021). The most related policy-based method is proposed by Shani et al. (2020), who also studies the episodic tabular MDPs with unknown transitions, stochastic losses, and bandit feedback. It is important to understand whether it is possible to eliminate this gap and thus achieve minimax optimality, or alternatively to show that this gap is inevitable for policy-based methods.

We also provide interesting practical insights. For example, the usage of reference estimator and 2\ell_{2} regularization with decreasing learning rate to stabilize the estimate of VV and QQ. The use of reference estimator can be traced to (Zhang et al., 2020b). They use the reference estimator to maximize the data utilization, hence reduce the estimation variance. However, our usage of reference estimator is different from theirs. The reason why they can reduce a H\sqrt{H} factor is because the bottleneck term is only estimated using 1/H1/H fraction of data, while the usage of reference can make use of all the data, hence fully utilizing available data. However, for policy-based RL, there is no such problem of data utilization. In fact, the bottleneck of policy-based methods is the instability of QQ estimation, therefore we use the reference estimator to stabilize the estimation process and reduce a S\sqrt{S} factor in the regret. Readers may turn to Section 4 for a detailed explanation for “instability”.

Our work is also closely related to another line of work on value-based methods. In particular, Azar et al. (2017); Zanette & Brunskill (2019); Zhang et al. (2020a, b); Menard et al. (2021) have shown that the value-based methods can achieve O~(SAH3K)\widetilde{O}(\sqrt{SAH^{3}K}) regret upper bound, which matches the information theoretic limit. Different from these works, we are the first to prove the (nearly) optimal regret bound for policy-based methods.

2 Preliminaries

A finite horizon stochastic Markov Decision Process (MDP) with time-variant transitions \mathcal{M} is defined by a tuple (𝒮,𝒜,H,P={Ph}h=1H,c={ch}h=1H)(\mathcal{S},\mathcal{A},H,P=\left\{P_{h}\right\}_{h=1}^{H},c=\left\{c_{h}\right\}_{h=1}^{H}), where 𝒮\mathcal{S} and 𝒜\mathcal{A} are finite state and action spaces with cardinality SS and AA, respectively, and HH\in\mathbb{N} is the horizon of the MDP. At time step hh, and state ss, the agent performs an action aa, transitions to the next state ss^{\prime} with probability Ph(ss,a)P_{h}(s^{\prime}\mid s,a), and suffers a random cost Ch(s,a)[0,1]C_{h}(s,a)\in[0,1] drawn i.i.d from a distribution with expectation ch(s,a)c_{h}(s,a).

A stochastic policy π:𝒮×[H]ΔA\pi:\mathcal{S}\times[H]\rightarrow\Delta_{A} is a mapping from states and time-step indices to a distribution over actions, i.e., ΔA={πA:aπ(a)=1,π(a)0}.\Delta_{A}=\left\{\pi\in\mathbb{R}^{A}:\sum_{a}\pi(a)=1,\pi(a)\geq 0\right\}. The performance of a policy π\pi when starting from state ss at time hh is measured by its value function, which is defined as

Vhπ(s)=𝔼[h=hHch(sh,ah)sh=s,π,P].V_{h}^{\pi}(s)=\mathbb{E}\bigg{[}\sum_{h^{\prime}=h}^{H}c_{h^{\prime}}\left(s_{h^{\prime}},a_{h^{\prime}}\right)\mid s_{h}=s,\pi,P\bigg{]}. (1)

The expectation is taken with respect to the randomness of the transition, the cost function and the policy. The QQ-function of a policy given the state action pair (s,a)(s,a) at time-step hh is defined by

Qhπ(s,a)=𝔼[h=hHch(sh,ah)sh=s,ah=a,π,P].Q_{h}^{\pi}(s,a)=\mathbb{E}\bigg{[}\sum_{h^{\prime}=h}^{H}c_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})\mid s_{h}=s,a_{h}=a,\pi,P\bigg{]}. (2)

By the above definitions, for any fixed policy π\pi, we can obtain the Bellman equation

Qhπ(s,a)\displaystyle Q_{h}^{\pi}(s,a) =ch(s,a)+Ph(s,a)Vh+1π(),\displaystyle=c_{h}(s,a)+P_{h}(\cdot\mid s,a)V_{h+1}^{\pi}(\cdot), (3)
Vhπ(s)\displaystyle V_{h}^{\pi}(s) =Qhπ(s,),πh(s).\displaystyle=\langle Q_{h}^{\pi}(s,\cdot),\pi_{h}(\cdot\mid s)\rangle.

An optimal policy π\pi^{*} minimizes the value for all states ss and time-steps hh simultaneously, and its corresponding optimal value is denoted by Vh(s)=V_{h}^{*}(s)= minπVhπ(s)\min_{\pi}V_{h}^{\pi}(s), for all h[H].h\in[H]. We consider an agent that repeatedly interacts with an MDP in a sequence of KK episodes such that the starting state at the kk-th episode, s1ks_{1}^{k}, is initialized by a fixed state s1s_{1}***Our subsequent analysis can be extended to the setting where the initial state is sampled from a fixed distribution..

In this paper we define the notion of hh-th step regret, Regreth\operatorname{Regret}_{h}, as follows:

Regreth(K)=k=1K(Vhπk(shk)Vh(shk)).\displaystyle\operatorname{Regret}_{h}(K)=\sum_{k=1}^{K}\big{(}V_{h}^{\pi^{k}}(s_{h}^{k})-V_{h}^{*}(s_{h}^{k})\big{)}. (4)

When h=1h=1, this matches the traditional definition of regret, which measures the performance of the agent starting from s1s_{1}. In this case we also use Regret(K)\operatorname{Regret}(K) for simplicity.

Notations and Definitions

We denote the number of times that the agent has visited state ss, state-action pair (s,a)(s,a) and state-action-transition pair (s,a,s(s,a,s^{\prime} at the hh-th step by nhk(s)n_{h}^{k}(s), nhk(s,a)n_{h}^{k}(s,a) and nhk(s,a,s)n_{h}^{k}(s,a,s^{\prime}) respectively. We denote by X¯k\overline{X}_{k} the empirical average of a random variable XX. All quantities are based on experience gathered until the end of the kk-th episode. We denote by Δ𝒜\Delta_{\mathcal{A}} the probability simplex over the action space, i.e. Δ𝒜={(p1,,p|𝒜|)pi0,ipi=1}\Delta_{\mathcal{A}}=\{(p_{1},...,p_{|\mathcal{A}|})\mid p_{i}\geq 0,\sum_{i}p_{i}=1\}.

We use O~(X)\widetilde{O}(X) to refer to a quantity that depends on XX up to a poly-log expression of a quantity at most polynomial in S,A,K,HS,A,K,H and δ1.\delta^{-1}. Similarly, \lesssim represents \leq up to numerical constants or poly-log factors. We define XY:=max{X,Y}X\vee Y:=\max\{X,Y\}.

3 RPO-SAT: Reference-based Policy Optimization with Stable at Any Time guarantee

Algorithm 1 Reference-based Policy Optimization with Stable at Any Time guarantee (RPO-SAT)
1:  initialize Qh(x,a)0Q_{h}(x,a)\leftarrow 0, Vh(x,a)0V_{h}(x,a)\leftarrow 0 and Vhref(x,a)0V_{h}^{\mathrm{ref}}(x,a)\leftarrow 0, C0=S3AH3C_{0}=\sqrt{S^{3}AH^{3}}
2:  for episode k=1,,Kk=1,...,K do
3:     Rollout a trajectory by acting πk\pi^{k}
4:     Update counters and empirical model nk={nhk}h[H],c¯k={c¯hk}h[H],P¯k={P¯hk}h[H]n^{k}=\{n_{h}^{k}\}_{h\in[H]},\overline{c}^{k}=\{\overline{c}_{h}^{k}\}_{h\in[H]},\overline{P}^{k}=\{\overline{P}_{h}^{k}\}_{h\in[H]}
5:     for Step h=H,,1h=H,...,1 do
6:        for s,a𝒮×𝒜\forall s,a\in\mathcal{S}\times\mathcal{A} do
7:           Calculate uhku_{h}^{k} as in (5)
8:           bhk(s,a)=min{uhk(s,a),2ln2SAHTδnhk(s,a)+H4Sln3SAHTδnhk(s,a)}b^{k}_{h}(s,a)=\min\{u_{h}^{k}(s,a),\sqrt{\frac{2\ln\frac{2SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}+H\sqrt{\frac{4S\ln\frac{3SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}\}
9:           Qhk(s,a)=max{c¯hk(s,a)+P¯hk(s,a)Vh+1k()bhk(s,a),0}Q^{k}_{h}(s,a)=\max\{\overline{c}_{h}^{k}(s,a)+\overline{P}^{k}_{h}(\cdot\mid s,a)V^{k}_{h+1}(\cdot)-b^{k}_{h}(s,a),0\}
10:           for s𝒮\forall s\in\mathcal{S} do
11:              Vhk(s)=Qhk(s,),πhk(|s)V^{k}_{h}(s)=\langle Q^{k}_{h}(s,\cdot),\pi^{k}_{h}(\cdot|s)\rangle
12:           end for
13:        end for
14:     end for
15:     for h,s,a[H]×𝒮×𝒜\forall h,s,a\in[H]\times\mathcal{S}\times\mathcal{A} do
16:        πhk+1(|s)=argminπhηkQhk(,s),πhπhk+πhπhk22\pi^{k+1}_{h}(\cdot|s)=\text{argmin}_{\pi_{h}}\eta_{k}\langle Q^{k}_{h}(\cdot,s),\pi_{h}-\pi^{k}_{h}\rangle+\|\pi_{h}-\pi_{h}^{k}\|_{2}^{2}
17:     end for
18:     for h,s,a[H]×𝒮×𝒜\forall h,s,a\in[H]\times\mathcal{S}\times\mathcal{A} such that nhk(s)C0kn_{h}^{k}(s)\geq C_{0}\sqrt{k} do
19:        Vhref(s)=1nhk(s)i=1kVhi(shi)1[shi=s]V^{\mathrm{ref}}_{h}(s)=\frac{1}{n^{k}_{h}(s)}\sum_{i=1}^{k}V_{h}^{i}(s^{i}_{h})\textbf{1}[s^{i}_{h}=s]
20:     end for
21:  end for

In this section, we present our algorithm RPO-SAT (Reference-based Policy Optimization with Stable at Any Time guarantee). The pseudocode is given in Algorithm 1.

We start by reviewing the optimistic policy optimization algorithms OPPO and POMD proposed by Cai et al. (2020); Shani et al. (2020). In each update, OPPO and POMD involve a policy improvement phase and a policy evaluation phase. In policy evaluation phase, OPPO and POMD explicitly incorporate a UCB bonus function into the estimated QQ-function to promote exploration. Then, in the policy improvement phase, OPPO and POMD improve the policy by Online Mirror Descent (OMD) with KL-regularization, where the estimated QQ-function serves as the gradient. Compared with existing optimistic policy optimization algorithms in Shani et al. (2020); Cai et al. (2020), our algorithm has three novelties.

First, in policy evaluation phase, we introduce the reference VV estimator (Lines 18-20). Specifically, for any s𝒮s\in\cal{S}, if the number of visiting ss satisfies the condition in Line 18, we update the reference value to be the empirical mean estimator of VV (Line 19). Zhang et al. (2020b) also adopts a reference value estimator. Nevertheless their update conditions and methods are different from us, and they reduce a H\sqrt{H} factor while we reduce a S\sqrt{S} factor.

Second, we make some modifications in policy improvement phase. Specifically, the policy optimization step in the hh-th step of the kk-th episode is

πhk+1(|s)=argminπhηkQhk(,s),πhπhk+D(πh,πhk),\displaystyle\pi^{k+1}_{h}(\cdot|s)=\text{argmin}_{\pi_{h}}\eta_{k}\langle Q^{k}_{h}(\cdot,s),\pi_{h}-\pi^{k}_{h}\rangle+D(\pi_{h},\pi_{h}^{k}),

where ηk\eta_{k} is the stepsize in the kk-th episode, DD is some distance measure and QhkQ_{h}^{k} is the estimated QQ-function in the hh-th step of the kk-th episode. Different from Shani et al. (2020); Cai et al. (2020) choosing D(πh,πhk)=KL(πh,πhk)D(\pi_{h},\pi_{h}^{k})=\mathrm{KL}(\pi_{h},\pi_{h}^{k}), we choose D(πh,πhk)=πhπhk22D(\pi_{h},\pi_{h}^{k})=\|\pi_{h}-\pi_{h}^{k}\|_{2}^{2}. In this case the solution of the OMD is:

πhk+1(|s)=ΠΔ𝒜(πhk(|s)ηkQhk(,s))\pi^{k+1}_{h}(\cdot|s)=\Pi_{\Delta_{\mathcal{A}}}\left(\pi^{k}_{h}(\cdot|s)-\eta_{k}Q^{k}_{h}(\cdot,s)\right)

where ΠΔ𝒜\Pi_{\Delta_{\mathcal{A}}} is the Euclidean projection onto Δ𝒜\Delta_{\mathcal{A}}, i.e. ΠΔ𝒜(𝒙)=argmin𝒚Δ𝒜𝒙𝒚2\Pi_{\Delta_{\mathcal{A}}}(\boldsymbol{x})=\operatorname{argmin}_{\boldsymbol{y}\in\Delta_{\mathcal{A}}}\|\boldsymbol{x}-\boldsymbol{y}\|_{2}. Unlike previous works, we also adopt a decreasing learning rate schedule instead of a fixed learning rate. These modifications are necessary for our analysis since they ensure the SAT property, making it possible to learn a good reference VV value.

Finally, we design a novel bonus term which carefully incorporate the reference VV estimator mentioned before by adding a weighted absolute difference Term (iii), which is also referred to as the instability term. Specifically, we define

uhk(s,a)=Term(i)+Term(ii)+Term(iii)+Term(iv)\displaystyle u_{h}^{k}(s,a)=\text{Term}(i)+\text{Term}(ii)+\text{Term}(iii)+\text{Term}(iv) (5)

where

Term(i)\displaystyle\text{Term}(i) =2ln2SAHTδnhk(s,a),\displaystyle=\sqrt{\frac{2\ln\frac{2SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}},
Term(ii)\displaystyle\text{Term}(ii) =6𝕍YP¯hk(s,a)Vh+1ref(Y)ln(2SAHKδ)nhk(s,a)+\displaystyle=\sqrt{\frac{6\mathbb{V}_{Y\sim\overline{P}_{h}^{k}(\cdot\mid s,a)}V_{h+1}^{\mathrm{ref}}(Y)\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)}}+
4Hln(2SAHKδ)Snhk(s,a)+8SH2ln(2SAHKδ)3nhk(s,a),\displaystyle\sqrt{\frac{4H\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{S\cdot n_{h}^{k}(s,a)}}+\frac{8\sqrt{SH^{2}}\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)},
Term(iii)\displaystyle\text{Term}(iii) =y𝒮(2P¯hk(ys,a)(1P¯hk(ys,a))ln(2SAHKδ)nhk(s,a)1\displaystyle\!=\!\sum_{y\in\mathcal{S}}\!\left(\!\sqrt{\frac{2\!\overline{P}_{h}^{k}(\!y\!\mid\!s\!,\!a\!)(\!1\!-\!\overline{P}_{h}^{k}\!(\!y\!\mid\!s,\!a))\!\ln\!\left(\!\frac{2SAHK}{\delta^{\prime}}\!\right)}{n_{h}^{k}(s,a)-1}}\right.
+7ln(2SAHKδ)3nhk(s,a))|Vkh+1(y)Vrefh+1(y)|,\displaystyle\left.+\frac{7\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}\right)|V^{k}_{h+1}(y)-V^{\mathrm{ref}}_{h+1}(y)|,
Term(iv)\displaystyle\text{Term}(iv) =4Hln3SAHTδnhk(s,a)+7SHln(2SAHKδ)3nhk(s,a)\displaystyle=\sqrt{\frac{4H\ln\frac{3SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}+\frac{7SH\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}
+2S3/2A1/4H7/4K1/4ln(2SAHKδ)nhk(s,a).\displaystyle+\frac{2S^{3/2}A^{1/4}H^{7/4}K^{1/4}\sqrt{\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}}{n_{h}^{k}(s,a)}.

where (i)(i) is the estimation error for reward functions and (ii)(ii)-(iv)(iv) are estimation errors for transition kernels. We will provide more explanations for this seemingly complicated term in Section 4. Furthermore, we set the bonus as

bhk(s,a)=\displaystyle b^{k}_{h}(s,a)= min{uhk(s,a),\displaystyle\min\left\{u_{h}^{k}(s,a),\right.
2ln2SAHTδnhk(s,a)+H4Sln3SAHTδnhk(s,a)}.\displaystyle\left.\sqrt{\frac{2\ln\frac{2SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}+H\sqrt{\frac{4S\ln\frac{3SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}\right\}. (6)

With this carefully chosen bonus function, we can also achieve optimism like previous work in optimistic policy optimization (Shani et al., 2020; Cai et al., 2020). Notably, our bonus function is smaller than that in Shani et al. (2020), thus leads to our tighter regret bound.

Now we state our main theoretical results for RPO-SAT.

Theorem 3.1 (Regret bound).

Suppose in Algorithm 1, we choose ηt=O(1/(H2At))\eta_{t}=O(\sqrt{1/(H^{2}At)}), and C0=S3AH3C_{0}=\sqrt{S^{3}AH^{3}}, then for sufficiently large KK, we have

Regret(K)\displaystyle\operatorname{Regret}(K)\leq O~(SAH3K+AH4K+\displaystyle\widetilde{O}(\sqrt{SAH^{3}K}+\sqrt{AH^{4}K}+
S5/2A5/4H3/2K1/4).\displaystyle S^{5/2}A^{5/4}H^{3/2}K^{1/4}).

We provide a proof sketch in Section 5. The full proof is in appendix B. Note that previous literature shows that the regret lower bound is Ω(SAH3K)\Omega(\sqrt{SAH^{3}K}) (Jin et al., 2018). Hence our result matches the information theoretic limit up to logarithmic factors when S>HS>H.

4 Technique Overview

In this section, we illustrate the main steps of achieving near optimal regret bound and introduce our key techniques.

Achieving Optimism via Reference. Similar to previous works (Cai et al., 2020; Shani et al., 2020), to achieve optimism, a crucial step is to design proper bonus term to upper bound (P¯hkPh)(s,a)Vh+1k()(\overline{P}_{h}^{k}-P_{h})(\cdot\mid s,a)V^{k}_{h+1}(\cdot). For example, Jaksch et al. (2010); Shani et al. (2020) bound this term in a separate way:

|(P¯hkPh)(s,a)Vh+1k()|\displaystyle\left|(\overline{P}_{h}^{k}-P_{h})(\cdot\mid s,a)V^{k}_{h+1}(\cdot)\right| (7)
\displaystyle\leq (P¯hkPh)(s,a)1Vh+1k()O~(SH2nhk(s,a)),\displaystyle\|(\overline{P}_{h}^{k}-P_{h})(\cdot\mid s,a)\|_{1}\cdot\|V^{k}_{h+1}(\cdot)\|_{\infty}\leq\widetilde{O}\Big{(}\sqrt{\frac{SH^{2}}{n_{h}^{k}(s,a)}}\Big{)},

which will leads to an additional S\sqrt{S} factor due to the absence of making use of the optimism of QkQ^{k}. This issue is later addressed by value-based algorithm UCBVI (Azar et al., 2017). They divide the term into two separate terms:

(P¯hPh)(s,a)Vh+1k()=(P¯hkPh)(s,a)Vh+1()\displaystyle(\overline{P}_{h}-P_{h})(\cdot\mid s,a)V^{k}_{h+1}(\cdot)=(\overline{P}_{h}^{k}-P_{h})(\cdot\mid s,a)V^{*}_{h+1}(\cdot)
+(P¯hkPh)(s,a)(Vh+1kVh+1)().\displaystyle+(\overline{P}_{h}^{k}-P_{h})(\cdot\mid s,a)(V^{k}_{h+1}-V_{h+1}^{*})(\cdot). (8)

They bound the first term using straightforward application of Chernoff-Hoeffding inequality, which removes the S\sqrt{S} factor since VV^{*} is deterministic. Thanks to the fact that VhkVhV_{h}^{k}\leq V_{h}^{*} for any (k,h)[K]×[H](k,h)\in[K]\times[H], they can bound the second term successfully (see Appendix C for more details). However, this approach is not applicable for policy-based methods which improve the policy in a conservative way instead of choosing the greedy policy (when the stepsize η\eta\rightarrow\infty, the OMD update becomes the greedy policy, and for any η<\eta<\infty this update is “conservatively” greedy). This key property of policy-based methods makes it only possible to ensure VhkVhπkV^{k}_{h}\leq V^{\pi^{k}}_{h}, the optimism VhkVhV^{k}_{h}\leq V^{*}_{h} doesn’t hold in general. To tackle this challenge, we notice that as long as VhkV^{k}_{h} converges to VhV_{h}^{*} sufficiently fast (at least on average), (P¯hkPh)(s,a)(Vh+1kVh+1)()(\overline{P}_{h}^{k}-P_{h})(\cdot\mid s,a)(V^{k}_{h+1}-V_{h+1}^{*})(\cdot) can be bounded by a term related to the rate of convergence. This leads to the important notion called Stable at Any Time (SAT). Precisely, we introduce the reference V estimator Vref={Vhref}h[H]V^{\mathrm{ref}}=\{V^{\mathrm{ref}}_{h}\}_{h\in[H]} and decompose (P¯hkPh)(s,a)Vh+1k()(\overline{P}_{h}^{k}-P_{h})(\cdot\mid s,a)V^{k}_{h+1}(\cdot) as

(P¯hkPh)(s,a)Vh+1k()\displaystyle(\overline{P}_{h}^{k}-P_{h})(\cdot\mid s,a)V^{k}_{h+1}(\cdot)
=\displaystyle= (P¯hkPh)(s,a)Vh+1()(a)\displaystyle\underbrace{(\overline{P}_{h}^{k}-P_{h})(\cdot\mid s,a)V^{*}_{h+1}(\cdot)}_{(a)}
+(P¯hkPh)(s,a)(Vh+1kVh+1ref)()(b)\displaystyle+\underbrace{(\overline{P}_{h}^{k}-P_{h})(\cdot\mid s,a)(V^{k}_{h+1}-V_{h+1}^{\mathrm{ref}})(\cdot)}_{(b)}
+(P¯hkPh)(s,a)(Vh+1refVh+1)()(c).\displaystyle+\underbrace{(\overline{P}_{h}^{k}-P_{h})(\cdot\mid s,a)(V^{\mathrm{ref}}_{h+1}-V_{h+1}^{*})(\cdot)}_{(c)}. (9)

By standard concentration inequalities, we can bound the first term by the quantity depending on the variance of Vh+1V_{h+1}^{*}. Once we have SAT, an easy implication is that |Vhref(s)Vh(s)|O(H/S)|V_{h}^{\mathrm{ref}}(s)-V_{h}^{*}(s)|\leq O(\sqrt{H/S}) for any nhk(s)n^{k}_{h}(s) large enough (Lemma 5.3). We can replace the variance of Vh+1V_{h+1}^{*} (unknown) by the variance of Vh+1refV_{h+1}^{\mathrm{ref}} (known) and bound the second and third term in an analogous way of (7) and remove the factor S\sqrt{S} as desired. Specifically, we can upper bound Terms (a),(b),(c)(a),(b),(c) in (4) by Terms (ii),(iii),(iv)(ii),(iii),(iv) in (5), respectively.

Stable at Any Time. The high-level idea is that in order to control (P¯hkPh)(s,a)(Vh+1kVh+1)()(\overline{P}_{h}^{k}-P_{h})(\cdot\mid s,a)(V^{k}_{h+1}-V_{h+1}^{*})(\cdot), term Vh+1kVh+1V^{k}_{h+1}-V_{h+1}^{*} must satisfy some properties. For example, Zhang et al. (2020b) shows that in the senario of greedy policy, this term converges to zero as the number of visit goes to infinity. However, due to the nature of conservative policy update scheme, we cannot guarantee the convergence, unless we make the coverage assumption as in Agarwal et al. (2021). Fortunately, it’s possible to obtain an average convergence guarantee, which we called SAT. Specifically, we say that an algorithm satisfy the property of SAT if for all KK^{\prime} and hh,

k=1K|Vhk(shk)Vh(shk)|O(K).\displaystyle\sum_{k=1}^{K^{\prime}}|V_{h}^{k}(s^{k}_{h})-V_{h}^{*}(s^{k}_{h})|\leq O(\sqrt{K^{\prime}}). (10)

The meaning of Stable at Any Time can be interpreted as follows: The above inequality implies that the estimation VhkV^{k}_{h} varies around the fixed value VhV_{h}^{*}, hence “stable”. And since we impose that the inequality holds for all hh and KK^{\prime}, hence “at any time”. For this reason, we also name y𝒮P¯hk(y|s,a)nhk(s,a)|Vh+1k(y)Vh+1ref(y)|\sum_{y\in\mathcal{S}}\sqrt{\frac{\overline{P}^{k}_{h}(y|s,a)}{n^{k}_{h}(s,a)}}|V^{k}_{h+1}(y)-V^{\mathrm{ref}}_{h+1}(y)| in the bonus term (5) as the instability term, since it’s an upper bound of (P¯hkPh)(s,a)(Vh+1kVh+1ref)()(\overline{P}_{h}^{k}-P_{h})(\cdot\mid s,a)(V^{k}_{h+1}-V_{h+1}^{\mathrm{ref}})(\cdot), which measures the instability of VkV^{k} with respect to the fixed reference VrefV^{\mathrm{ref}}.

Another nice implication is that we can obtain a precise estimate of VV^{*} if SAT is satisfied. Specifically, combining the update rules of reference VV estimator and the fact that nhk(s)C0kn^{k}_{h}(s)\geq C_{0}\sqrt{k}, we have:

|Vhref(s)Vh(s)|\displaystyle|V^{\mathrm{ref}}_{h}(s)-V^{*}_{h}(s)|
\displaystyle\leq 1nhk(s)i=1k|Vhi(shi)Vh(shi)|1[shi=s]O(1/C0).\displaystyle\frac{1}{n^{k}_{h}(s)}\sum_{i=1}^{k}|V_{h}^{i}(s^{i}_{h})-V^{*}_{h}(s^{i}_{h})|\textbf{1}[s^{i}_{h}=s]\leq O(1/C_{0}).

Coarse Regret Bound. We are able to show that (10) can be implied by

Regret1(K)O(K).\displaystyle\text{Regret}_{1}(K^{\prime})\leq O(\sqrt{K^{\prime}}). (11)

for any K[K]K^{\prime}\in[K]. See Lemma 5.3 for details. We point out that establishing Equation (10) from coarse regret bound is highly non-trivial. There exists two challenges:

  1. 1.

    Previous policy-based methods (Cai et al., 2020; Shani et al., 2020) choose a constant mirror descent stepsize depending on KK, which impedes us from obtaining regret bound sublinear on KK^{\prime} for all K[K]K^{\prime}\in[K].

  2. 2.

    For the step 2hH2\leq h\leq H, the state shks_{h}^{k} is not fixed, which makes the OMD term difficult to bound.

To this end, we use the following two novel techniques:

  1. 1.

    Replace the unbounded KL-divergence regularization term by the bounded 2\ell_{2} regularization term, which allows us to choose varying stepsize which depends on the current time step instead of KK.

  2. 2.

    Forward Induction, that is, deriving the regret upper bound for the 1-st step and obtaining the regret upper bounds for hh-th (2hH2\leq h\leq H) step by induction. See Section 5 for more details.

Putting these together, we can remove the additional factor S\sqrt{S} as desired. If we also adopt the Bernstein variance reduction technique (Azar et al., 2017), we can further improve a H\sqrt{H} factor in the main term.

5 Proof Sketch of Theorem 3.1

In this section, our goal is to provide a sketch of the proof of Theorem 3.1. One key ingredient in regret analysis in RL is the optimism condition, namely

2bhk(s,a)Qhk(s,a)ch(s,a)Ph(s,a)Vh+1k()0.\displaystyle-2b^{k}_{h}(s,a)\!\leq\!Q^{k}_{h}(s,a)\!-\!c_{h}(s,a)\!-\!P_{h}(\cdot\mid s,a)V^{k}_{h+1}(\cdot)\!\leq\!0. (12)

This condition guarantees that QhkQ^{k}_{h} is an optimistic estimation of QhπkQ_{h}^{\pi^{k}} (or QhQ_{h}^{*}), and that the regret can be bounded by the sum of bonus terms (Jaksch et al., 2010; Azar et al., 2017; Cai et al., 2020; Shani et al., 2020).

In our case, it does not appear straightforward to show such condition hold or not since the bonus term bhkb_{h}^{k} depends on the value of VrefV^{\mathrm{ref}}, which we do not know in advance. Dealing with this problem needs additional effort. But let’s put it aside for a while, and assumes that (12) holds temporarily. In Section 5.1, we first demonstrate the intuition of the proof under this assumption. Then in Section 5.2, we show how to remove this assumption, with additional techniques.

5.1 A warm-up: the Optimism Assumption

In this section, we first demonstrate the intuition of the proof, under the assumption that optimism condition (12) holds. We first recall the useful notion of hh-th step regret in episode KK^{\prime} as follows:

Regreth(K)=k=1K(Vhπk(shk)Vh(shk)).\text{Regret}_{h}(K^{\prime})=\sum^{K^{\prime}}_{k=1}\big{(}V_{h}^{\pi^{k}}(s^{k}_{h})-V_{h}^{*}(s^{k}_{h})\big{)}.

The notion aligns with the normal regret if h=1h=1, in other words, we have Regret1(K)=Regret(K)\text{Regret}_{1}(K^{\prime})=\text{Regret}(K^{\prime}). We first state the following lemma, which bounds the estimation error using bonus function:

Lemma 5.1 (Bounding estimation error).

Suppose that optimism holds, namely for kK,hH,s𝒮,a𝒜\forall k\leq K,h\leq H,s\in\mathcal{S},a\in\mathcal{A},

2bhk(s,a)Qhk(s,a)ch(s,a)Ph(|s,a)Vh+1k0.\displaystyle-2b^{k}_{h}(s,a)\leq Q^{k}_{h}(s,a)-c_{h}(s,a)-P_{h}(\cdot|s,a)V^{k}_{h+1}\leq 0.

Then for KK,hH\forall K^{\prime}\leq K,h^{\prime}\leq H, it holds with high probability that

k=1K(Vhπk(shk)Vhk(shk))\displaystyle\sum_{k=1}^{K^{\prime}}\big{(}V^{\pi^{k}}_{h}(s^{k}_{h})-V^{k}_{h}(s^{k}_{h})\big{)}
\displaystyle\leq O~(k=1Kh=hHbhk(shk,ahk))+O~(H3K).\displaystyle\widetilde{O}\big{(}\sum_{k=1}^{K^{\prime}}\sum_{h=h^{\prime}}^{H}b^{k}_{h}(s^{k}_{h},a^{k}_{h})\big{)}+\widetilde{O}(\sqrt{H^{3}K^{\prime}}).
Proof.

See §B.1 for a detailed proof. ∎

The above lemma is useful since it tells us that the sum of the estimation error Vhπk(shk)Vhk(shk)V^{\pi^{k}}_{h}(s^{k}_{h})-V^{k}_{h}(s^{k}_{h}) can be roughly viewed as the sum of bonus function. By the choice of our bonus term, we have bhk(s,a)SH2nk(s,a)b^{k}_{h}(s,a)\leq\sqrt{\frac{SH^{2}}{n^{k}(s,a)}}. Following standard techniques from (Shani et al., 2020), we get a easy corollary k=1K(Vhπk(shk)Vhk(shk))O~(S2AH4K)\sum_{k=1}^{K^{\prime}}\big{(}V^{\pi^{k}}_{h}(s^{k}_{h})-V^{k}_{h}(s^{k}_{h})\big{)}\leq\widetilde{O}(\sqrt{S^{2}AH^{4}K^{\prime}}).

Next, we show that our algorithm satisfies a coarse regret bound:

Lemma 5.2 (Coarse regret bound).

With the same assumptions and notations in Lemma 5.1, we have for KK\forall K^{\prime}\leq K:

Regret(K)=k=1K(V1πk(s1k)V1(s1k))O~(S2AH4K).\operatorname{Regret}(K^{\prime})\!=\!\sum_{k=1}^{K^{\prime}}\big{(}V^{\pi^{k}}_{1}(s_{1}^{k})-V^{*}_{1}(s_{1}^{k})\big{)}\!\leq\!\widetilde{O}(\sqrt{S^{2}AH^{4}K^{\prime}}). (13)
Proof.

We first decompose the regret term in the following way:

k=1K(V1πk(s1)V1(s1))\displaystyle\sum_{k=1}^{K^{\prime}}\big{(}V_{1}^{\pi^{k}}\left(s_{1}\right)-V_{1}^{*}\left(s_{1}\right)\big{)}
=\displaystyle= k=1K(V1πk(s1)V1k(s1))(i)+k=1K(V1k(s1)V1(s1))(ii)\displaystyle\underbrace{\sum_{k=1}^{K^{\prime}}\big{(}V_{1}^{\pi^{k}}\left(s_{1}\right)-V_{1}^{k}\left(s_{1}\right)\big{)}}_{(i)}+\underbrace{\sum_{k=1}^{K^{\prime}}\big{(}V_{1}^{k}\left(s_{1}\right)-V_{1}^{*}\left(s_{1}\right)\big{)}}_{(ii)} (14)

Term(i)\text{Term}\ (i) can be bounded by O~(S2AH4K)\widetilde{O}(\sqrt{S^{2}AH^{4}K^{\prime}}) using the corollary of Lemma 5.1. We take a closer look at Term(ii)\text{Term}\ (ii), by standard regret decomposition lemma (Lemma A.1):

Term(ii)=\displaystyle\text{Term}\ (ii)=
k,h𝔼[Qhk(sh,),πhk(sh)πh(sh)s1,π,P](iii)\displaystyle\underbrace{\sum_{k,h}\mathbb{E}\left[\left\langle Q_{h}^{k}\left(s_{h},\cdot\right),\pi_{h}^{k}\left(\cdot\mid s_{h}\right)-\pi_{h}^{*}\left(\cdot\mid s_{h}\right)\right\rangle\mid s_{1},\pi^{*},P\right]}_{(iii)}
+k,h𝔼[Qhk(sh,ah)ch(sh,ah)Ph(sh,ah)Vh+1ks1,π,P](iv).\displaystyle+\!\underbrace{\sum_{k,h}\!\mathbb{E}\!\left[\!Q_{h}^{k}\!\left(\!s_{h}\!,\!a_{h}\!\right)\!-\!c_{h}\!\left(\!s_{h}\!,\!a_{h}\!\right)\!-\!P_{h}\!\left(\!\cdot\!\mid\!s_{h}\!,\!a_{h}\!\right)\!V_{h+1}^{k}\!\mid\!s_{1}\!,\!\pi^{*}\!,\!P\right]}_{(iv)}.

Term(iii)\text{Term}\ (iii) is also called the OMD term. Using Lemma B.6, we have

Term(iii)O~(AH4K).\text{Term}\ (iii)\leq\widetilde{O}(\sqrt{AH^{4}K^{\prime}}).

We note that the reason why we change the Bregman term and the learning rate schedule is to ensure a Term(iii)O~(K)\text{Term}\ (iii)\leq\widetilde{O}(\sqrt{K^{\prime}}) type bound. In (Shani et al., 2020), they use the KL-divergence as the Bregman term and a fixed learning rate that is a function of KK, hence their bound is Term(iii)O~(log(A)H4K)\text{Term}\ (iii)\leq\widetilde{O}(\sqrt{\log(A)H^{4}K}). Although our choice of the 2\ell_{2} penalty term and decreasing learning rate leads to a larger dependence on AA, but it has a better dependence on KK^{\prime}, which is crucial to ensure that we can learn a good reference function.

Combining with the fact that optimism holds, we have Term(iv)0\text{Term}\ (iv)\leq 0, therefore the lemma is proved. ∎

Now we have a Regret(K)=O~(K)\text{Regret}(K^{\prime})=\widetilde{O}(\sqrt{K^{\prime}}) bound for all KKK^{\prime}\leq K. We show that this immediately implies the following key theorem, which we called the Average Convergence Lemma. The following lemma guarantees that the reference value is a good approximation of VV^{*}:

Lemma 5.3 (Average convergence of VkV^{k}).

With the same assumptions and notations in Lemma 5.1, we have for all KK,hHK^{\prime}\leq K,h\leq H,

Regreth(K)O~(S2AH4K).\displaystyle\operatorname{Regret}_{h}(K^{\prime})\leq\widetilde{O}(\sqrt{S^{2}AH^{4}K^{\prime}}). (15)

As a consequence, SAT holds, for KK,hH\forall K^{\prime}\leq K,h\leq H,

k=1K|Vhk(shk)Vh(shk)|O~(S2AH4K).\displaystyle\sum_{k=1}^{K^{\prime}}|V^{k}_{h}(s^{k}_{h})-V^{*}_{h}(s^{k}_{h})|\leq\widetilde{O}(\sqrt{S^{2}AH^{4}K^{\prime}}).
Proof.

For the proof of Equation (15), we use a novel technique called Forward Induction, starting from the case when h=1h=1, Regret1(K)=Regret(K)O~(S2AH4K)\text{Regret}_{1}(K^{\prime})=\text{Regret}(K^{\prime})\leq\widetilde{O}(\sqrt{S^{2}AH^{4}K^{\prime}}), which is true according to Lemma 13. Using induction we can proof Equation (15) for all hh, which we will leave the details to Appendix B.3. For the second statement, we note that

|Vhk(shk)Vh(shk)|\displaystyle|V^{k}_{h}(s^{k}_{h})-V^{*}_{h}(s^{k}_{h})|
\displaystyle\leq |Vhk(shk)Vπk(shk)|+|Vhπk(shk)Vh(shk)|\displaystyle|V^{k}_{h}(s^{k}_{h})-V^{\pi^{k}}(s^{k}_{h})|+|V^{\pi^{k}}_{h}(s^{k}_{h})-V^{*}_{h}(s^{k}_{h})|
=\displaystyle= (Vhπk(shk)Vhk(shk))+(Vhπk(shk)Vh(shk)).\displaystyle\big{(}V^{\pi^{k}}_{h}(s^{k}_{h})-V^{k}_{h}(s^{k}_{h})\big{)}+\big{(}V^{\pi^{k}}_{h}(s^{k}_{h})-V^{*}_{h}(s^{k}_{h})\big{)}.

These two terms can be handled by Lemma 5.1 and Equation (15) separately, hence finished the proof of Lemma 5.3. ∎

As mentioned before, Theorem 5.3 is significant in the sense that it implies the reference value Vhref(s)=1nhk(s)i=1kVhi(shi)1[shi=s]V^{\mathrm{ref}}_{h}(s)=\frac{1}{n^{k}_{h}(s)}\sum_{i=1}^{k}V^{i}_{h}(s^{i}_{h})\textbf{1}[s^{i}_{h}=s] is close to the optimal value if nhk(s)n^{k}_{h}(s) is large enough, in other words, if nhk(s)C0k=S3AH3kn^{k}_{h}(s)\geq C_{0}\sqrt{k}=\sqrt{S^{3}AH^{3}k}:

|Vhref(s)Vh(s)|\displaystyle|V^{\mathrm{ref}}_{h}(s)-V^{*}_{h}(s)|
\displaystyle\leq 1nhk(s)i=1k|Vhi(shi)Vh(shi)|1[shi=s]\displaystyle\frac{1}{n^{k}_{h}(s)}\sum_{i=1}^{k}|V^{i}_{h}(s^{i}_{h})-V_{h}^{*}(s_{h}^{i})|\textbf{1}[s^{i}_{h}=s]
\displaystyle\leq 1nhk(s)i=1k|Vhi(shi)Vh(shi)|\displaystyle\frac{1}{n^{k}_{h}(s)}\sum_{i=1}^{k}|V^{i}_{h}(s^{i}_{h})-V_{h}^{*}(s_{h}^{i})|
\displaystyle\leq O~(HS)\displaystyle\widetilde{O}(\sqrt{\frac{H}{S}})

This explains the reason why we choose C0=S3AH3C_{0}=\sqrt{S^{3}AH^{3}} in the algorithm, the 1/S1/\sqrt{S} bound serves a key role in reducing the S\sqrt{S} factor in the regret.

Finally, we are ready to prove Theorem 3.1. Combining Lemma 5.1 and Equation (14),

k=1K(V1πk(s1k)V1(s1k))\displaystyle\sum_{k=1}^{K}\big{(}V_{1}^{\pi^{k}}(s_{1}^{k})-V_{1}^{*}(s_{1}^{k})\big{)}
\displaystyle\leq O~(k=1Kh=1Hbhk(shk,ahk))+O~(AH4K).\displaystyle\widetilde{O}\big{(}\sum_{k=1}^{K}\sum_{h=1}^{H}b^{k}_{h}(s^{k}_{h},a^{k}_{h})\big{)}+\widetilde{O}(\sqrt{AH^{4}K}). (16)

The only thing left is to prove that

k=1Kh=1Hbhk(shk,ahk)O~(SAH3K+S52A54H114K14).\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}b^{k}_{h}(s^{k}_{h},a^{k}_{h})\leq\widetilde{O}(\sqrt{SAH^{3}K}+S^{\frac{5}{2}}A^{\frac{5}{4}}H^{\frac{11}{4}}K^{\frac{1}{4}}).

By Lemma B.7, this is true. Therefore we have finished the proof.

5.2 How to prove optimism

In this section, we show how to remove the optimism assumption in Lemma 5.1, Lemma 13 and Lemma 5.3. In fact, we will first prove the following lemma , which shows that both optimism and coarse regret bound holds for all episodes.

Lemma 5.4 (Coarse analysis of regret).

For any K[K]K^{\prime}\in[K], we have the following regret bound:

Regret(K)=k=1K(V1πk(s1k)V1(s1k))O~(S2AH4K).\operatorname{Regret}(K^{\prime})\!=\!\sum_{k=1}^{K^{\prime}}\big{(}V^{\pi^{k}}_{1}(s_{1}^{k})-V^{*}_{1}(s_{1}^{k})\big{)}\!\leq\!\widetilde{O}(\sqrt{S^{2}AH^{4}K^{\prime}}). (17)

and the optimism holds, namely

2bhk(s,a)Qhk(s,a)ch(s,a)Ph(s,a)Vh+1k()0.\displaystyle-2b_{h}^{k}(s,a)\!\leq\!Q_{h}^{k}(s,a)\!-\!c_{h}(s,a)\!-\!P_{h}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)\!\leq\!0. (18)

The full proof is in Appendix B.2. Here we mainly explain the core idea. To prove Lemma 5.4 we use induction. For the first few episodes of the algorithm, the bonus term is dominated by 2ln2SAHTδnhk(s,a)+H4Sln3SAHTδnhk(s,a)\sqrt{\frac{2\ln\frac{2SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}+H\sqrt{\frac{4S\ln\frac{3SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}, which is the same as in Shani et al. (2020). Therefore the optimism and coarse regret bound automatically holds.

In the induction step, assume that optimism and coarse regret holds for all kKk\leq K^{\prime} and hh. Then by optimism we have that 11-step regret is bounded by O~(S2AH4K)\widetilde{O}(\sqrt{S^{2}AH^{4}K^{\prime}}). Then by Lemma 5.3, we have that Regreth(K)O~(S2AH4K)\operatorname{Regret}_{h}(K^{\prime})\leq\widetilde{O}(\sqrt{S^{2}AH^{4}K^{\prime}}). This means that if in the K+1K^{\prime}+1-th episode, a new reference function is calculated, then this new reference function must be H/S\sqrt{H/S} accurate, i.e. |Vref(s)V(s)|O(H/S)|V^{\mathrm{ref}}(s)-V^{*}(s)|\leq O(\sqrt{H/S}). Hence, the carefully designed bonus guarantees that optimism still holds for k=K+1k=K^{\prime}+1, which finishes the proof. The induction process is shown in Figure 1.

Finally, thanks to optimism, we can bound the regret by k=1Kh=1Hbhk(shk,ahk)\sum_{k=1}^{K}\sum_{h=1}^{H}b^{k}_{h}(s^{k}_{h},a^{k}_{h}). The rest follows from the same arguments in the previous section.

Optimism for (k,h,s,a)[K]×[H]×𝒮×𝒜\forall(k,h,s,a)\in[K^{\prime}]\times[H]\times\mathcal{S}\times\mathcal{A}Qhk(s,a)ch(s,a)Ph(|s,a)Vh+1k()0Q_{h}^{k}(s,a)-c_{h}(s,a)-P_{h}(\cdot|s,a)V_{h+1}^{k}(\cdot)\leq 01-st step regret: k=1KV1πk(s1k)V1(s1k)O(K)\sum_{k=1}^{K^{\prime}}V_{1}^{\pi^{k}}(s_{1}^{k})-V_{1}^{*}(s_{1}^{k})\leq O(\sqrt{K^{\prime}})hh-th step regret:k=1KVhπk(shk)Vh(shk)O(K)\sum_{k=1}^{K^{\prime}}V_{h}^{\pi^{k}}(s_{h}^{k})-V_{h}^{*}(s_{h}^{k})\leq O(\sqrt{K^{\prime}})For any Vref(s)V^{\mathrm{ref}}(s) updated before K+1K^{\prime}+1 episode |Vref(s)V(s)|O(H/S)|V^{\mathrm{ref}}(s)-V^{*}(s)|\leq O(\sqrt{H/S})Optimism for (k,h,s,a)[K+1]×[H]×𝒮×𝒜\forall(k,h,s,a)\in[K^{\prime}+1]\times[H]\times\mathcal{S}\times\mathcal{A} Forward Inductionhh+1h\leftarrow h+1
Figure 1: Proving optimism and hh-th step regret by induction

6 Conclusion and Future Work

In this paper, we proposed the first optimistic policy optimization algorithm for tabular, episodic RL that can achieve regret guarantee O~(SAH3K+AH4K+poly(S,A,H)K1/4)\widetilde{O}(\sqrt{SAH^{3}K}+\sqrt{AH^{4}K}+\operatorname{poly}(S,A,H)K^{1/4}). This algorithm improves upon previous results (Shani et al., 2020) and matches the information theoretic limit Ω(SAH3K)\Omega(\sqrt{SAH^{3}K}) when S>HS>H. Our results also raise a number of promising directions for future work. Theoretically, can we design better policy-based methods that can eliminate the constraint S>HS>H? Practically, can we leverage the insight of RPO-SAT to improve practical policy-based RL algorithms? Specifically, can we design different regularization terms to stabilize the V/QV/Q estimation process to make the algorithm more sample efficient? We look forward to answering these questions in the future.

Acknowledgement

Liwei Wang was supported by Exploratory Research Project of Zhejiang Lab (No. 2022RC0AN02), Project 2020BD006 supported by PKUBaidu Fund, the major key project of PCL (PCL2021A12).

References

  • Agarwal et al. (2021) Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98):1–76, 2021.
  • Azar et al. (2017) Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263–272. PMLR, 2017.
  • Beck & Teboulle (2003) Beck, A. and Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
  • Bhandari & Russo (2019) Bhandari, J. and Russo, D. Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786, 2019.
  • Cai et al. (2020) Cai, Q., Yang, Z., Jin, C., and Wang, Z. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pp. 1283–1294. PMLR, 2020.
  • Domingues et al. (2021) Domingues, O. D., Ménard, P., Kaufmann, E., and Valko, M. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory, pp.  578–598. PMLR, 2021.
  • Fazel et al. (2018) Fazel, M., Ge, R., Kakade, S., and Mesbahi, M. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pp. 1467–1476. PMLR, 2018.
  • Fei et al. (2020) Fei, Y., Yang, Z., Wang, Z., and Xie, Q. Dynamic regret of policy optimization in non-stationary environments. arXiv preprint arXiv:2007.00148, 2020.
  • Gu et al. (2017) Gu, S., Holly, E., Lillicrap, T., and Levine, S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pp.  3389–3396. IEEE, 2017.
  • Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
  • He et al. (2021) He, J., Zhou, D., and Gu, Q. Nearly optimal regret for learning adversarial mdps with linear function approximation. arXiv preprint arXiv:2102.08940, 2021.
  • Jaksch et al. (2010) Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 2010.
  • Jin et al. (2018) Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is q-learning provably efficient? arXiv preprint arXiv:1807.03765, 2018.
  • Kakade (2001) Kakade, S. M. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
  • Kalashnikov et al. (2018) Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
  • Lancewicki et al. (2020) Lancewicki, T., Rosenberg, A., and Mansour, Y. Learning adversarial markov decision processes with delayed feedback. arXiv preprint arXiv:2012.14843, 2020.
  • Levine et al. (2016) Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
  • Liu et al. (2019) Liu, B., Cai, Q., Yang, Z., and Wang, Z. Neural trust region/proximal policy optimization attains globally optimal policy. In Advances in Neural Information Processing Systems, pp. 10565–10576, 2019.
  • Luo et al. (2021) Luo, H., Wei, C.-Y., and Lee, C.-W. Policy optimization in adversarial mdps: Improved exploration via dilated bonuses. arXiv preprint arXiv:2107.08346, 2021.
  • Maurer & Pontil (2009) Maurer, A. and Pontil, M. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
  • Menard et al. (2021) Menard, P., Domingues, O. D., Shang, X., and Valko, M. Ucb momentum q-learning: Correcting the bias without forgetting. arXiv preprint arXiv:2103.01312, 2021.
  • Orabona (2019) Orabona, F. A modern introduction to online learning. arXiv preprint arXiv:1912.13213, 2019.
  • Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. PMLR, 2015.
  • Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Shani et al. (2020) Shani, L., Efroni, Y., Rosenberg, A., and Mannor, S. Optimistic policy optimization with bandit feedback. In International Conference on Machine Learning, pp. 8604–8613. PMLR, 2020.
  • Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  • Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  • Sutton et al. (1999) Sutton, R. S., McAllester, D. A., Singh, S. P., Mansour, Y., et al. Policy gradient methods for reinforcement learning with function approximation. In NIPs, volume 99, pp.  1057–1063. Citeseer, 1999.
  • Wang et al. (2019) Wang, L., Cai, Q., Yang, Z., and Wang, Z. Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150, 2019.
  • Weissman et al. (2003) Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., and Weinberger, M. J. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
  • Zanette & Brunskill (2019) Zanette, A. and Brunskill, E. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pp. 7304–7312. PMLR, 2019.
  • Zanette et al. (2020) Zanette, A., Lazaric, A., Kochenderfer, M., and Brunskill, E. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pp. 10978–10989. PMLR, 2020.
  • Zanette et al. (2021) Zanette, A., Cheng, C.-A., and Agarwal, A. Cautiously optimistic policy optimization and exploration with linear function approximation. arXiv preprint arXiv:2103.12923, 2021.
  • Zhang et al. (2020a) Zhang, Z., Ji, X., and Du, S. S. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. arXiv preprint arXiv:2009.13503, 2020a.
  • Zhang et al. (2020b) Zhang, Z., Zhou, Y., and Ji, X. Almost optimal model-free reinforcement learningvia reference-advantage decomposition. Advances in Neural Information Processing Systems, 33, 2020b.
  • Zhong et al. (2021) Zhong, H., Yang, Z., Wang, Z., and Szepesvári, C. Optimistic policy optimization is provably efficient in non-stationary mdps. arXiv preprint arXiv:2110.08984, 2021.

Appendix A Regret Decomposition and Failure Events

A.1 Regret Decomposition

For any (k,h)[K]×[H](k,h)\in[K]\times[H], we define

ζk,h1\displaystyle\zeta_{k,h}^{1} =[Vhk(shk)Vhπk(shk)][Qhk(shk,ahk)Qhπk(shk,ahk)],\displaystyle=[V_{h}^{k}(s_{h}^{k})-V_{h}^{\pi^{k}}(s_{h}^{k})]-[Q_{h}^{k}(s_{h}^{k},a_{h}^{k})-Q_{h}^{\pi^{k}}(s_{h}^{k},a_{h}^{k})], (19)
ζk,h2\displaystyle\zeta_{k,h}^{2} =[Ph(shk,ahk)Vh+1k()Ph(shk,ahk)Vh+1πk()][Vh+1k(sh+1k)Vh+1πk(sh+1k)].\displaystyle=[P_{h}(\cdot\mid s_{h}^{k},a_{h}^{k})V_{h+1}^{k}(\cdot)-P_{h}(\cdot\mid s_{h}^{k},a_{h}^{k})V_{h+1}^{\pi^{k}}(\cdot)]-[V_{h+1}^{k}(s_{h+1}^{k})-V_{h+1}^{\pi^{k}}(s_{h+1}^{k})].

By the definition, we have that ζk,h1\zeta_{k,h}^{1} and ζk,h2\zeta_{k,h}^{2} represent the randomness of executing a stochastic policy πhk(shk)\pi_{h}^{k}(\cdot\mid s_{h}^{k}) and the randomness of observing the next state from stochastic transition kernel Ph(shk,ahk)P_{h}(\cdot\mid s_{h}^{k},a_{h}^{k}), respectively. With these notations, we have the following standard regret decomposition lemma (Cai et al., 2020; Shani et al., 2020).

Lemma A.1 (Regret Decomposition).

For any (K,h)[K]×[H](K^{\prime},h^{\prime})\in[K]\times[H], it holds that

Regreth(K)\displaystyle\operatorname{Regret}_{h^{\prime}}(K^{\prime}) =k=1K(Vhπk(shk)Vhk(shk))(i)+k=1K(Vhk(shk)Vh(shk))(ii)\displaystyle=\underbrace{\sum_{k=1}^{K^{\prime}}\big{(}V_{h^{\prime}}^{\pi^{k}}(s_{h^{\prime}}^{k})-V_{h^{\prime}}^{k}(s_{h^{\prime}}^{k})\big{)}}_{{(i)}}+\underbrace{\sum_{k=1}^{K^{\prime}}\big{(}V_{h^{\prime}}^{k}(s_{h^{\prime}}^{k})-V_{h^{\prime}}^{*}(s_{h^{\prime}}^{k})\big{)}}_{{(ii)}}
=k=1Kh=hH[ch(shk,ahk)+Ph(shk,ahk)Vh+1k()Qhk(shk,ahk)](i.1)+k=1Kh=hH(ζk,h1+ζk,h2)(i.2)\displaystyle=\underbrace{\sum_{k=1}^{K^{\prime}}\sum_{h=h^{\prime}}^{H}\left[c_{h}(s_{h}^{k},a_{h}^{k})+P_{h}(\cdot\mid s_{h}^{k},a_{h}^{k})V_{h+1}^{k}(\cdot)-Q_{h}^{k}(s_{h}^{k},a_{h}^{k})\right]}_{{(i.1)}}+\underbrace{\sum_{k=1}^{K^{\prime}}\sum_{h=h^{\prime}}^{H}(\zeta_{k,h}^{1}+\zeta_{k,h}^{2})}_{{(i.2)}}
+k=1Kh=hH𝔼[Qhk(sh,),πhk(sh)πh(sh)sh=shk,π,P](ii.1)\displaystyle+\underbrace{\sum_{k=1}^{K^{\prime}}\sum_{h=h^{\prime}}^{H}\mathbb{E}\Big{[}\langle Q_{h}^{k}(s_{h},\cdot),\pi_{h}^{k}(\cdot\mid s_{h})-\pi_{h}(\cdot\mid s_{h})\rangle\mid s_{h^{\prime}}=s_{h^{\prime}}^{k},\pi,P\Big{]}}_{(ii.1)}
+k=1Kh=hH𝔼[Qhk(sh,ah)ch(sh,ah)Ph(sh,ah)Vh+1k()sh=shk,π,P](ii.2)\displaystyle+\underbrace{\sum_{k=1}^{K^{\prime}}\sum_{h=h^{\prime}}^{H}\mathbb{E}\left[Q_{h}^{k}\left(s_{h},a_{h}\right)-c_{h}\left(s_{h},a_{h}\right)-P_{h}\left(\cdot\mid s_{h},a_{h}\right)V_{h+1}^{k}(\cdot)\mid s_{h^{\prime}}=s_{h^{\prime}}^{k},\pi,P\right]}_{(ii.2)}
Proof.

See Cai et al. (2020) for a detailed proof. ∎

A.2 Failure Events

Definition A.2.

We define the following failure events:

Fk0={h:|k=1kh=hH(ζk,h1+ζk,h2)|16H3Kln2Hδ},\displaystyle F_{k}^{0}=\left\{\exists h:\left|\sum_{k^{\prime}=1}^{k}\sum_{h^{\prime}=h}^{H}(\zeta_{k^{\prime},h^{\prime}}^{1}+\zeta_{k^{\prime},h^{\prime}}^{2})\right|\geq\sqrt{16H^{3}K\ln\frac{2H}{\delta^{\prime}}}\right\},
Fk1={s,a,h:|ch(s,a)c¯hk(s,a)|2ln2SAHTδnhk(s,a)},\displaystyle F_{k}^{1}=\left\{\exists s,a,h:\left|c_{h}(s,a)-\overline{c}_{h}^{k}(s,a)\right|\geq\sqrt{\frac{2\ln\frac{2SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}\right\},
Fk2={s,a,h:Ph(s,a)P¯hk(s,a)14Sln3SAHTδnhk(s,a)},\displaystyle F_{k}^{2}=\left\{\exists s,a,h:\left\|P_{h}(\cdot\mid s,a)-\overline{P}_{h}^{k}(\cdot\mid s,a)\right\|_{1}\geq\sqrt{\frac{4S\ln\frac{3SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}\right\},
Fk3={s,a,h:|(Ph(s,a)P¯hk(s,a))Vh()|H4ln3SAHTδnhk(s,a)+2Hln2SAHTδ3nhk(s,a)},\displaystyle F_{k}^{3}=\left\{\exists s,a,h:\left|\left(P_{h}(\cdot\mid s,a)-\overline{P}_{h}^{k}(\cdot\mid s,a)\right)V_{h}^{*}(\cdot)\right|\geq H\sqrt{\frac{4\ln\frac{3SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}+\frac{2H\ln\frac{2SAHT}{\delta^{\prime}}}{3n_{h}^{k}(s,a)}\right\},
Fk4={s,s,a,h:|P¯hk(ss,a)Ph(ss,a)|2P¯hk(ss,a)(1P¯hk(ss,a))ln(2SAHKδ)nhk(s,a)1+7ln(2SAHKδ)3nhk(s,a)},\displaystyle F_{k}^{4}=\left\{\exists s^{\prime},s,a,h:\left|\overline{P}_{h}^{k}(s^{\prime}\mid s,a)-P_{h}(s^{\prime}\mid s,a)\right|\geq\sqrt{\frac{2\overline{P}_{h}^{k}(s^{\prime}\mid s,a)(1-\overline{P}_{h}^{k}(s^{\prime}\mid s,a))\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)-1}}+\frac{7\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}\right\},
F=k=1K(Fk0Fk1Fk2Fk3Fk4),\displaystyle F=\bigcup_{k=1}^{K}\left(F^{0}_{k}\cup F_{k}^{1}\cup F_{k}^{2}\cup F_{k}^{3}\cup F_{k}^{4}\right),

where δ=δ5\delta^{\prime}=\frac{\delta}{5}.

Lemma A.3.

It holds that

Pr(F)δ.\displaystyle\Pr(F)\leq\delta.
Proof.

These standard concentration inequalities also appear in Azar et al. (2017); Cai et al. (2020); Shani et al. (2020). For completeness, we present the proof sketch here.

Note that the martingales ζk,h1\zeta_{k,h}^{1} and ζk,h2\zeta_{k,h}^{2} defined in (19) satisfy |ζk,h1+ζk,h2|4H|\zeta_{k,h}^{1}+\zeta_{k,h}^{2}|\leq 4H. By Azuma-Hoeffding inequality, we have Pr(F0)δ\Pr(F^{0})\leq\delta^{\prime}.

Let F1=k=1KFk1.F^{1}=\bigcup_{k=1}^{K}F_{k}^{1}. By Hoeffding’s inequality, we have

Pr{|ch(s,a)c¯hk(s,a)|2ln1δnhk(s,a)}δ\operatorname{Pr}\left\{\left|c_{h}(s,a)-\overline{c}_{h}^{k}(s,a)\right|\geq\sqrt{\frac{2\ln\frac{1}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}\right\}\leq\delta^{\prime} (20)

Using a union bound over all s,as,a and all possible values of nk(s,a)n_{k}(s,a) and kk., we have Pr{Fc}δ\operatorname{Pr}\left\{F^{c}\right\}\leq\delta^{\prime}.

Let F2=k=1KFk2F^{2}=\bigcup_{k=1}^{K}F_{k}^{2}. Then Pr{F2}δ\operatorname{Pr}\left\{F^{2}\right\}\leq\delta^{\prime}, which is implied by (Weissman et al., 2003) while applying union bound on all s,as,a and all possible values of nk(s,a)n_{k}(s,a) and kk.

Let F3=k=1KFk3F^{3}=\bigcup_{k=1}^{K}F_{k}^{3}. According to (Azar et al., 2017), we have that with probability at least 1δ1-\delta^{\prime}

|[(PhP¯hk)Vh](s,a)|2H2ln(2HKδ)nhk(s,a)+2Hln(2HKδ)3nhk(s,a).\left|\left[\left(P_{h}-\overline{P}_{h}^{k}\right)V_{h}^{*}\right](s,a)\right|\leq\sqrt{\frac{2H^{2}\ln\left(\frac{2HK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)}}+\frac{2H\ln\left(\frac{2HK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}. (21)

Take a union bound over s,a,ks,a,k, we have Pr{F3}δ\operatorname{Pr}\left\{F^{3}\right\}\leq\delta^{\prime}.

Let F4=k=1KFk4F^{4}=\bigcup_{k=1}^{K}F_{k}^{4}. The Empirical Bernstein inequality (Theorem 4 in (Maurer & Pontil, 2009)) combined with a union bound argument on s,a,ss,a,s^{\prime},nhk(s,a)n_{h}^{k}(s,a) also implies the following bound holds with probability at least 1δ1-\delta^{\prime}:

|P¯hk(ss,a)Ph(ss,a)|2P¯(ss,a)(1P¯(ss,a))ln(2SAHKδ)nhk(s,a)1+7ln(2SAHKδ)3nhk(s,a)\left|\overline{P}_{h}^{k}(s^{\prime}\mid s,a)-P_{h}(s^{\prime}\mid s,a)\right|\leq\sqrt{\frac{2\overline{P}(s^{\prime}\mid s,a)(1-\overline{P}(s^{\prime}\mid s,a))\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)-1}}+\frac{7\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)} (22)

Therefore, Pr{F4}δ\operatorname{Pr}\{F^{4}\}\leq\delta^{\prime}

Finally, take a union bound with δ=δ5\delta^{\prime}=\frac{\delta}{5}, we have Pr{F}δ\operatorname{Pr}\left\{F\right\}\leq\delta. ∎

Below we will assume that the failure event FF does not happen, which is with high probability.

Appendix B Missing Proofs for Section 5

B.1 Proof of Lemma 5.1

Proof.

By Lemma A.1, we have

k=1K(Vhπk(shk)Vhk(shk))\displaystyle\sum_{k=1}^{K^{\prime}}\big{(}V^{\pi_{k}}_{h}(s^{k}_{h})-V^{k}_{h}(s^{k}_{h})\big{)}
+\displaystyle+ k=1Kh=hH𝔼[Qhk(sh,),πhk(sh)πh(sh)sh=shk,π,P]+k=1Kh=hH(ζk,h1+ζk,h2).\displaystyle\sum_{k=1}^{K^{\prime}}\sum_{h=h^{\prime}}^{H}\mathbb{E}\Big{[}\langle Q_{h}^{k}(s_{h},\cdot),\pi_{h}^{k}(\cdot\mid s_{h})-\pi_{h}(\cdot\mid s_{h})\rangle\mid s_{h^{\prime}}=s_{h^{\prime}}^{k},\pi,P\Big{]}+\sum_{k=1}^{K^{\prime}}\sum_{h=h^{\prime}}^{H}(\zeta_{k,h}^{1}+\zeta_{k,h}^{2}).

Here ζk,h1\zeta_{k,h}^{1} and ζk,h2\zeta_{k,h}^{2} are martingales defined in (19) satisfying |ζk,h1+ζk,h2|4H|\zeta_{k,h}^{1}+\zeta_{k,h}^{2}|\leq 4H. Under Azuma-Hoeffding inequalities and the optimistic assumption that 2bhk(s,a)Qhk(s,a)ch(s,a)Ph(|s,a)Vh+1k0-2b^{k}_{h}(s,a)\leq Q^{k}_{h}(s,a)-c_{h}(s,a)-P_{h}(\cdot|s,a)V^{k}_{h+1}\leq 0, we finish the proof of Lemma 5.1. ∎

B.2 Full Proof of Theorem 3.1

As discussed in Section 5, we only need to prove the coarse regret in Lemma 5.4, but this time without the optimism assumption. We restate the lemma for ease of reading.

Lemma B.1 (Coarse analysis of dynamic regret, restatement of Lemma 5.4).

Conditioned on FcF^{c}, for any K[K]K^{\prime}\in[K], we have the following regret bound:

Regret(K)=k=1K(V1πk(s1k)V1(s1k))O~(S2AH4K).\operatorname{Regret}(K^{\prime})=\sum_{k=1}^{K^{\prime}}\big{(}V^{\pi^{k}}_{1}(s_{1}^{k})-V^{*}_{1}(s_{1}^{k})\big{)}\leq\widetilde{O}(\sqrt{S^{2}AH^{4}K^{\prime}}). (23)

and the optimism holds, namely

2bhk(s,a)Qhk(s,a)ch(s,a)Ph(s,a)Vh+1k()0.\displaystyle-2b_{h}^{k}(s,a)\leq Q_{h}^{k}(s,a)-c_{h}(s,a)-P_{h}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)\leq 0. (24)
Remark B.2.

In our algorithm, we change the Bregman penalty term from KL-divergence dKL(πhπhk)d_{KL}(\pi_{h}\|\pi^{k}_{h}) to πhπhk22\|\pi_{h}-\pi^{k}_{h}\|_{2}^{2}, we also let the learning rate to be of the form ηt=O(1t)\eta_{t}=O(\frac{1}{\sqrt{t}}) instead of a constant depending on KK. We note that although this will cause more regret in the OMD term, it is crucial to obtain a Regret(K)=O(K)\text{Regret}(K^{\prime})=O(\sqrt{K^{\prime}}) bound instead of Regret(K)=O(K)\text{Regret}(K^{\prime})=O(K^{\prime}).

Proof.

First, by Lemma A.1, we decompose the regret in the following way.

k=1K(V1πk(s1)V1(s1))\displaystyle\sum_{k=1}^{K^{\prime}}\big{(}V_{1}^{\pi^{k}}\left(s_{1}\right)-V_{1}^{*}\left(s_{1}\right)\big{)}
=\displaystyle= k=1Kh=1H[ch(shk,ahk)+Ph(shk,ahk)Vh+1k()Qhk(shk,ahk)](i)+k=1Kh=1H(ζk,h1+ζk,h2)(ii)\displaystyle\underbrace{\sum_{k=1}^{K^{\prime}}\sum_{h=1}^{H}\left[c_{h}(s_{h}^{k},a_{h}^{k})+P_{h}(\cdot\mid s_{h}^{k},a_{h}^{k})V_{h+1}^{k}(\cdot)-Q_{h}^{k}(s_{h}^{k},a_{h}^{k})\right]}_{{(i)}}+\underbrace{\sum_{k=1}^{K}\sum_{h=1}^{H}(\zeta_{k,h}^{1}+\zeta_{k,h}^{2})}_{{(ii)}}
+k=1Kh=1H𝔼[Qhk(sh,),πhk(sh)πh(sh)s1=s1k,π,P](iii)\displaystyle+\underbrace{\sum_{k=1}^{K^{\prime}}\sum_{h=1}^{H}\mathbb{E}\Big{[}\langle Q_{h}^{k}(s_{h},\cdot),\pi_{h}^{k}(\cdot\mid s_{h})-\pi_{h}(\cdot\mid s_{h})\rangle\mid s_{1}=s_{1}^{k},\pi,P\Big{]}}_{(iii)}
+k=1Kh=1H𝔼[Qhk(sh,ah)ch(sh,ah)Ph(sh,ah)Vh+1k()s1=s1k,π,P](iv).\displaystyle+\underbrace{\sum_{k=1}^{K^{\prime}}\sum_{h=1}^{H}\mathbb{E}\left[Q_{h}^{k}\left(s_{h},a_{h}\right)-c_{h}\left(s_{h},a_{h}\right)-P_{h}\left(\cdot\mid s_{h},a_{h}\right)V_{h+1}^{k}(\cdot)\mid s_{1}=s_{1}^{k},\pi,P\right]}_{(iv)}.

We first prove (23) and (24) for early episodes, namely when nhk<C0kn_{h}^{k}<C_{0}\sqrt{k} for all ss (which is true at the beginning of the algorithm), then the regret bound holds. Note that in this case, conditioned on FcF^{c}, we have SHC0K1/4nhk(s,a)H2Snhk(s,a)\frac{SH\sqrt{C_{0}}{K^{\prime}}^{1/4}}{n^{k}_{h}(s,a)}\geq\sqrt{\frac{H^{2}S}{n^{k}_{h}(s,a)}}, which implies bhk(s,a)=2ln2SAHTδnhk(s,a)+H4Sln3SAHTδnhk(s,a)b_{h}^{k}(s,a)=\sqrt{\frac{2\ln\frac{2SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}+H\sqrt{\frac{4S\ln\frac{3SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}. In this case, by Lemma B.3, we have 2bhk(s,a)Qhk(s,a)ch(s,a)Ph(s,a)Vh+1k()0-2b_{h}^{k}(s,a)\leq Q_{h}^{k}(s,a)-c_{h}(s,a)-P_{h}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)\leq 0, which further implies that

Term (i)\displaystyle\text{Term }(i)\leq O(k=1Kh=1Hbhk(shk,ahk))\displaystyle O\big{(}\sum_{k=1}^{K^{\prime}}\sum_{h=1}^{H}b^{k}_{h}(s^{k}_{h},a^{k}_{h})\big{)}
\displaystyle\leq O(k=1Kh=1H2ln2SAHTδnhk(s,a)+H4Sln3SAHTδnhk(s,a))\displaystyle O\big{(}\sum_{k=1}^{K^{\prime}}\sum_{h=1}^{H}\sqrt{\frac{2\ln\frac{2SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}+H\sqrt{\frac{4S\ln\frac{3SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}\big{)}
\displaystyle\leq O~(S2AH4K),\displaystyle\widetilde{O}(\sqrt{S^{2}AH^{4}K^{\prime}}),

and

Term (iv)0.\text{Term }(iv)\leq 0.

Also, by Lemma B.3, we know that (24) holds in this period. Applying Lemma B.6 to Term (iii)(iii), we have

Term (iii)O~(AH4K)\text{Term }(iii)\leq\widetilde{O}(\sqrt{AH^{4}{K^{\prime}}})

Therefore, under event FcF^{c}, we have

k=1K(V1πk(s1)V1(s1))O~(S2AH4K)\sum_{k=1}^{K^{\prime}}\big{(}V_{1}^{\pi^{k}}\left(s_{1}\right)-V_{1}^{*}\left(s_{1}\right)\big{)}\leq\widetilde{O}(\sqrt{S^{2}AH^{4}{K^{\prime}}})

Next, we prove (23) and (24) for the remaining episodes. We prove this claim by induction. In fact, we will prove the following claim: for each episode, (23) and (24) hold. We have shown that this claim holds for the first episodes.

Assume that for k=K1k=K^{\prime}-1, we have k=1K1(V1πk(s1)V1(s1))O~(S2AH4(K1))\sum_{k=1}^{K^{\prime}-1}(V_{1}^{\pi^{k}}\left(s_{1}\right)-V_{1}^{*}\left(s_{1}\right))\leq\widetilde{O}(\sqrt{S^{2}AH^{4}(K^{\prime}-1)}), we want to prove that for k=Kk=K^{\prime}, we have k=1K(V1πk(s1)V1(s1))O~(S2AH4K)\sum_{k=1}^{K^{\prime}}(V_{1}^{\pi^{k}}\left(s_{1}\right)-V_{1}^{*}\left(s_{1}\right))\leq\widetilde{O}(\sqrt{S^{2}AH^{4}K^{\prime}}).

If there exists (h,s,a)(h,s,a) such that nhK(s)C0Kn_{h}^{K^{\prime}}(s)\geq C_{0}\sqrt{K^{\prime}} (i.e. we construct Vhref(s)V_{h}^{\mathrm{ref}}(s) in this episode), then since (23) and (24) holds for previous episodes, using Lemma 5.3 we have

|Vhref(s)Vh(s)|\displaystyle\left|V_{h}^{\mathrm{ref}}(s)-V_{h}^{*}(s)\right| =1nhK(s)|i=1K[Vhi(shi)Vh(s)]1[shi=s]|\displaystyle=\frac{1}{n^{K^{\prime}}_{h}(s)}\left|\sum_{i=1}^{K^{\prime}}\left[V^{i}_{h}(s^{i}_{h})-V_{h}^{*}(s)\right]\textbf{1}[s^{i}_{h}=s]\right|
1nhK(s)i=1K|Vhi(shi)Vh(s)|1[shi=s]\displaystyle\leq\frac{1}{n^{K^{\prime}}_{h}(s)}\sum_{i=1}^{K^{\prime}}\left|V^{i}_{h}(s^{i}_{h})-V_{h}^{*}(s)\right|\textbf{1}[s^{i}_{h}=s]
1C0Ki=1K|Vhi(s)Vh(s)|\displaystyle\leq\frac{1}{C_{0}\sqrt{K^{\prime}}}\sum_{i=1}^{K^{\prime}}\left|V^{i}_{h}(s)-V_{h}^{*}(s)\right|
1C0KS2AH4K=S2AH4C0\displaystyle\leq\frac{1}{C_{0}\sqrt{K^{\prime}}}\sqrt{S^{2}AH^{4}K^{\prime}}=\frac{\sqrt{S^{2}AH^{4}}}{C_{0}}

Therefore, by Lemma B.4 we know that QhkQ_{h}^{k} is an optimistic estimation of QhQ_{h}^{*}. Together with Lemmas A.1 and B.6, we have

k=1K(V1πk(s1)V1(s1))\displaystyle\sum_{k=1}^{K^{\prime}}\big{(}V_{1}^{\pi^{k}}\left(s_{1}\right)-V_{1}^{*}\left(s_{1}\right)\big{)}
\displaystyle\leq O(k=1Kh=1Hbhk(shk,ahk))+O(AH4K)\displaystyle O\big{(}\sum_{k=1}^{K^{\prime}}\sum_{h=1}^{H}b^{k}_{h}(s^{k}_{h},a^{k}_{h})\big{)}+O(\sqrt{AH^{4}{K^{\prime}}})
\displaystyle\leq O(k=1Kh=1H2ln2SAHTδnhk(s,a)+H4Sln3SAHTδnhk(s,a))+O(AH4K)\displaystyle O\big{(}\sum_{k=1}^{K^{\prime}}\sum_{h=1}^{H}\sqrt{\frac{2\ln\frac{2SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}+H\sqrt{\frac{4S\ln\frac{3SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}\big{)}+O(\sqrt{AH^{4}{K^{\prime}}})
\displaystyle\leq O~(S2AH4K),\displaystyle\widetilde{O}(\sqrt{S^{2}AH^{4}{K^{\prime}}}),

which concludes our proof.

B.3 Proof of Lemma 5.3

Proof.

We have

k=1K|Vhk(shk)Vh(shk)|\displaystyle\sum_{k=1}^{K^{\prime}}|V_{h}^{k}(s^{k}_{h})-V^{*}_{h}(s^{k}_{h})|
\displaystyle\leq k=1K|Vhk(shk)Vhπk(shk)|+k=1K|Vhπk(shk)Vh(shk)|\displaystyle\sum_{k=1}^{K^{\prime}}|V_{h}^{k}(s^{k}_{h})-V^{\pi^{k}}_{h}(s^{k}_{h})|+\sum_{k=1}^{K^{\prime}}|V_{h}^{\pi^{k}}(s^{k}_{h})-V^{*}_{h}(s^{k}_{h})|
=\displaystyle= k=1K(Vhπk(shk)Vhk(shk))+k=1K(Vhπk(shk)Vh(shk))\displaystyle\sum_{k=1}^{K^{\prime}}(V^{\pi^{k}}_{h}(s^{k}_{h})-V_{h}^{k}(s^{k}_{h}))+\sum_{k=1}^{K^{\prime}}(V_{h}^{\pi^{k}}(s^{k}_{h})-V^{*}_{h}(s^{k}_{h}))

where in the last step we use optimism and definition of VV^{*}. By Lemma 5.1, the first term k=1K(Vhπk(shk)Vhk(shk))\sum_{k=1}^{K^{\prime}}(V^{\pi^{k}}_{h}(s^{k}_{h})-V_{h}^{k}(s^{k}_{h})) can be bounded as

k=1K(Vhπk(sh)Vhk(sh))\displaystyle\sum_{k=1}^{K^{\prime}}\bigl{(}V_{h}^{\pi_{k}}\left(s_{h}\right)-V_{h}^{k}\left(s_{h}\right)\bigr{)} (25)
\displaystyle\leq O(k=1Kh=hHbhk(shk,ahk))+O(H3K)\displaystyle O\Big{(}\sum_{k=1}^{K^{\prime}}\sum_{h^{\prime}=h}^{H}b_{h^{\prime}}^{k}(s_{h^{\prime}}^{k},a_{h^{\prime}}^{k})\Big{)}+O(\sqrt{H^{3}K^{\prime}})
\displaystyle\leq O(k=1Kh=hHH2Snhk(shk,ahk))+O(H3K)\displaystyle O\Big{(}\sum_{k=1}^{K^{\prime}}\sum_{h^{\prime}=h}^{H}\sqrt{\frac{H^{2}S}{n_{h^{\prime}}^{k}(s_{h^{\prime}}^{k},a_{h^{\prime}}^{k})}}\Big{)}+O(\sqrt{H^{3}K^{\prime}})
\displaystyle\leq O~(S2AH4K).\displaystyle\tilde{O}(\sqrt{S^{2}AH^{4}K^{\prime}}).

For the second term k=1K(Vhπk(shk)Vh(shk))\sum^{K^{\prime}}_{k=1}(V_{h}^{\pi^{k}}(s^{k}_{h})-V^{*}_{h}(s^{k}_{h})), we use a forward induction trick:

First we have k=1K(Vhπk(shk)Vh(shk))O(S2AH4K)\sum^{K^{\prime}}_{k=1}(V_{h}^{\pi^{k}}(s^{k}_{h})-V^{*}_{h}(s^{k}_{h}))\leq O(\sqrt{S^{2}AH^{4}K^{\prime}}) holds for h=1h=1. By induction, if the claim holds up to hh, then we have

k=1K(Vhπk(shk)Vh(shk))\displaystyle\sum_{k=1}^{K^{\prime}}\big{(}V^{\pi^{k}}_{h}(s^{k}_{h})-V^{*}_{h}(s^{k}_{h})\big{)}
=\displaystyle= k=1K(Qhπk(shk,),πhk(|shk)Qh(shk,),πh(|shk))\displaystyle\sum_{k=1}^{K^{\prime}}\big{(}\langle Q^{\pi^{k}}_{h}(s^{k}_{h},\cdot),\pi^{k}_{h}(\cdot|s^{k}_{h})\rangle-\langle Q^{*}_{h}(s^{k}_{h},\cdot),\pi^{*}_{h}(\cdot|s^{k}_{h})\rangle\big{)}
\displaystyle\geq k=1K(Qhπk(shk,),πhk(|shk)Qh(shk,),πhk(|shk))\displaystyle\sum_{k=1}^{K^{\prime}}\big{(}\langle Q^{\pi^{k}}_{h}(s^{k}_{h},\cdot),\pi^{k}_{h}(\cdot|s^{k}_{h})\rangle-\langle Q^{*}_{h}(s^{k}_{h},\cdot),\pi^{k}_{h}(\cdot|s^{k}_{h})\rangle\big{)}
=\displaystyle= k=1KPh(shk,)(Vh+1πkVh+1),πhk(|shk)\displaystyle\sum_{k=1}^{K^{\prime}}\langle P_{h}(s_{h}^{k},\cdot)(V^{\pi^{k}}_{h+1}-V^{*}_{h+1}),\pi^{k}_{h}(\cdot|s^{k}_{h})\rangle
=\displaystyle= k=1K(Vh+1πk(sh+1k)Vh+1(sh+1k))+Ph(|shk,πhk(shk))(Vh+1πkVh+1)()(Vh+1πkVh+1)(sh+1k)(a),\displaystyle\sum_{k=1}^{K^{\prime}}\big{(}V^{\pi^{k}}_{h+1}(s^{k}_{h+1})-V^{*}_{h+1}(s^{k}_{h+1})\big{)}+\underbrace{P_{h}(\cdot|s_{h}^{k},\pi_{h}^{k}(s_{h}^{k}))(V^{\pi^{k}}_{h+1}-V^{*}_{h+1})(\cdot)-(V^{\pi^{k}}_{h+1}-V^{*}_{h+1})(s^{k}_{h+1})}_{(a)},

where Term (a) is a martingale. Using Azuma-Hoeffding inequality, we have Term (a)O(H2K)(a)\leq O(\sqrt{H^{2}K^{\prime}}). Therefore by induction, for all h[H]h\in[H] we have

k=1KVh+1πk(sh+1k)Vh+1(sh+1k)k=1KV1πk(s1k)V1(s1k)+O(hH2K)O(S2AH4K),\sum_{k=1}^{K^{\prime}}V^{\pi^{k}}_{h+1}(s^{k}_{h+1})-V^{*}_{h+1}(s^{k}_{h+1})\leq\sum_{k=1}^{K^{\prime}}V^{\pi^{k}}_{1}(s^{k}_{1})-V^{*}_{1}(s^{k}_{1})+O(h\sqrt{H^{2}K^{\prime}})\leq O(\sqrt{S^{2}AH^{4}K^{\prime}}),

which concludes the proof of Lemma 5.3. ∎

B.4 Useful Lemmas

B.4.1 Optimism

Lemma B.3 (Optimism at the beginning).

Conditioned on FcF^{c}, when bhk(s,a)=2ln2SAHTδnhk(s,a)+H4Sln3SAHTδnhk(s,a)b_{h}^{k}(s,a)=\sqrt{\frac{2\ln\frac{2SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}+H\sqrt{\frac{4S\ln\frac{3SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}, we have that

2bhk(s,a)Qhk(s,a)ch(s,a)Ph(s,a)Vh+1k()0.-2b_{h}^{k}(s,a)\leq Q_{h}^{k}(s,a)-c_{h}(s,a)-P_{h}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)\leq 0. (26)
Proof.

Recall that QhkQ_{h}^{k} takes form

Qhk(s,a)=max{c¯hk(s,a)+P¯hk(s,a)Vh+1k()bhk(s,a),0}.\displaystyle Q_{h}^{k}(s,a)=\max\{\overline{c}_{h}^{k}(s,a)+\overline{P}_{h}^{k}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)-b_{h}^{k}(s,a),0\}. (27)

We have

ch(s,a)+Ph(s,a)Vh+1k()Qhk(s,a)\displaystyle c_{h}(s,a)+P_{h}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)-Q_{h}^{k}(s,a)
\displaystyle\leq ch(s,a)c¯hk(s,a)+Ph(s,a)Vh+1k()P¯hk(s,a)Vh+1k()+bhk(s,a).\displaystyle c_{h}(s,a)-\overline{c}_{h}^{k}(s,a)+P_{h}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)-\overline{P}_{h}^{k}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)+b_{h}^{k}(s,a). (28)

Under the event FcF^{c}, we have

ch(s,a)c¯hk(s,a)2ln2SAHTδnhk(s,a),\displaystyle c_{h}(s,a)-\overline{c}_{h}^{k}(s,a)\leq\sqrt{\frac{2\ln\frac{2SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}, (29)

and

|Ph(s,a)Vh+1k()P¯hk(s,a)Vh+1k()|\displaystyle\left|P_{h}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)-\overline{P}_{h}^{k}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)\right|
\displaystyle\leq P¯hk(sh,ah)Ph(sh,ah)1Vh+1k()\displaystyle\left\|\overline{P}_{h}^{k}\left(\cdot\mid s_{h},a_{h}\right)-P_{h}\left(\cdot\mid s_{h},a_{h}\right)\right\|_{1}\left\|V_{h+1}^{k}(\cdot)\right\|_{\infty}
\displaystyle\leq HP¯hk(sh,ah)Ph(sh,ah)1\displaystyle H\cdot\left\|\overline{P}_{h}^{k}\left(\cdot\mid s_{h},a_{h}\right)-P_{h}\left(\cdot\mid s_{h},a_{h}\right)\right\|_{1}
\displaystyle\leq H4Sln3SAHTδnhk(s,a).\displaystyle H\sqrt{\frac{4S\ln\frac{3SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}. (30)

Here the first inequality follows from Cauchy-Schwartz inequality, the second inequality uses the fact that Vhk()H\|V_{h}^{k}(\cdot)\|_{\infty}\leq H for any (k,h)[K]×[H](k,h)\in[K]\times[H], and the last inequality holds conditioned on the event FcF^{c}. Plugging (29) and (B.4.1) into (B.4.1), together with the definition of bhkb_{h}^{k}, we obtain

ch(s,a)+Ph(s,a)Vh+1k()Qhk(s,a)2bhk.\displaystyle c_{h}(s,a)+P_{h}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)-Q_{h}^{k}(s,a)\leq 2b_{h}^{k}. (31)

Similarly, by the definition of QhkQ_{h}^{k} in (27), we have

Qhk(s,a)ch(s,a)Ph(s,a)Vh+1k()\displaystyle Q_{h}^{k}(s,a)-c_{h}(s,a)-P_{h}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)
\displaystyle\leq max{(c¯hk(s,a)ch(s,a))+(P¯hk(s,a)Vh+1k()Ph(s,a)Vh+1k())bhk(s,a),0}\displaystyle\max\Big{\{}\big{(}\overline{c}_{h}^{k}(s,a)-c_{h}(s,a)\big{)}+\big{(}\overline{P}_{h}^{k}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)-{P}_{h}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)\big{)}-b_{h}^{k}(s,a),0\Big{\}}
\displaystyle\leq max{bhkbhk,0}=0,\displaystyle\max\{b_{h}^{k}-b_{h}^{k},0\}=0, (32)

where the last inequality follows from (29) and (B.4.1). Combining (31) and (B.4.1), we finish the proof. ∎

Lemma B.4.

Conditioned on the event FcF^{c}, when the reference function VrefV^{\mathrm{ref}} satisfies |Vref(s)V(s)|S2AH4C0|V^{\mathrm{ref}}(s)-V^{*}(s)|\leq\frac{\sqrt{S^{2}AH^{4}}}{C_{0}} where C0=S3AH3C_{0}=\sqrt{S^{3}AH^{3}}, we have

2bhk(s,a)Qhk(s,a)ch(s,a)Ph(s,a)Vh+1k()0.\displaystyle-2b_{h}^{k}(s,a)\leq Q_{h}^{k}(s,a)-c_{h}(s,a)-P_{h}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)\leq 0.
Proof.

When FF does not happen, we have

|ch(s,a)c¯hk(s,a)|2ln2SAHTδnhk(s,a)\displaystyle\left|c_{h}(s,a)-\overline{c}_{h}^{k}(s,a)\right|\leq\sqrt{\frac{2\ln\frac{2SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}

and

|Ph(s,a)Vh+1k()P¯hk(s,a)Vh+1k()|H4Sln3SAHTδnhk(s,a).\displaystyle\left|P_{h}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)-\overline{P}_{h}^{k}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)\right|\leq H\sqrt{\frac{4S\ln\frac{3SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}.

Meanwhile, we have

|Ph(s,a)Vh+1k()P¯hk(s,a)Vh+1k()|\displaystyle\left|P_{h}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)-\overline{P}_{h}^{k}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)\right|
\displaystyle\leq |(P¯hk(s,a)Ph(s,a))Vh+1()|+|(P¯hk(s,a)Ph(s,a))(Vh+1kVh+1ref)()|\displaystyle\left|\left(\overline{P}_{h}^{k}\left(\cdot\mid s,a\right)-P_{h}\left(\cdot\mid s,a\right)\right)V_{h+1}^{*}(\cdot)\right|+\left|\left(\overline{P}_{h}^{k}\left(\cdot\mid s,a\right)-P_{h}\left(\cdot\mid s,a\right)\right)(V_{h+1}^{k}-V_{h+1}^{\mathrm{ref}})(\cdot)\right|
+|(P¯hk(s,a)Ph(s,a))(Vh+1refVh+1)()|\displaystyle+\left|\left(\overline{P}_{h}^{k}\left(\cdot\mid s,a\right)-P_{h}\left(\cdot\mid s,a\right)\right)(V_{h+1}^{\mathrm{ref}}-V_{h+1}^{*})(\cdot)\right| (33)

For the first term, we have

|(P¯hk(s,a)Ph(s,a))Vh+1|\displaystyle\left|\left(\overline{P}_{h}^{k}\left(\cdot\mid s,a\right)-P_{h}\left(\cdot\mid s,a\right)\right)V_{h+1}^{*}\right|
\displaystyle\leq 2𝕍YPh(s,a)Vh+1(Y)ln(2SAHKδ)nhk(s,a)+7ln(2SAHKδ)3nhk(s,a)\displaystyle\sqrt{\frac{2\mathbb{V}_{Y\sim{P}_{h}(\cdot\mid s,a)}V_{h+1}^{*}(Y)\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)}}+\frac{7\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}
\displaystyle\leq 4𝕍YPh(s,a)Vh+1ref(Y)ln(2SAHKδ)nhk(s,a)+4Hln(2SAHKδ)Snhk(s,a)+7ln(2SAHKδ)3nhk(s,a),\displaystyle\sqrt{\frac{4\mathbb{V}_{Y\sim{P}_{h}(\cdot\mid s,a)}V_{h+1}^{\mathrm{ref}}(Y)\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)}}+\sqrt{\frac{4H\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{S\cdot n_{h}^{k}(s,a)}}+\frac{7\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}, (34)

where the last inequality follows from the fact that 2𝕍(X)2𝕍(Y)+𝕍(YX)2𝕍(Y)+2𝕍(YX)\sqrt{2\mathbb{V}(X)}\leq 2\sqrt{\mathbb{V}(Y)+\mathbb{V}(Y-X)}\leq 2\sqrt{\mathbb{V}(Y)}+2\sqrt{\mathbb{V}(Y-X)} and |Vh+1Vh+1ref|H/S|V_{h+1}^{*}-V_{h+1}^{\mathrm{ref}}|\leq\sqrt{H/S}. Under the event FcF^{c}, we have

|P¯hk(ys,a)Ph(ys,a)|2P¯hk(ys,a)(1P¯hk(ys,a))ln(2SAHKδ)nhk(s,a)1+7ln(2SAHKδ)3nhk(s,a),\displaystyle\left|\overline{P}_{h}^{k}(y\mid s,a)-P_{h}(y\mid s,a)\right|\leq\sqrt{\frac{2\overline{P}_{h}^{k}(y\mid s,a)(1-\overline{P}_{h}^{k}(y\mid s,a))\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)-1}}+\frac{7\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)},

By AM-GM inequality, we have

12P¯hk(ys,a)+ln(2SAHKδ)nhk(s,a)2P¯hk(ys,a)ln(2SAHKδ)nhk(s,a)\displaystyle\frac{1}{2}\overline{P}_{h}^{k}(y\mid s,a)+\frac{\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)}\geq\sqrt{\frac{2\overline{P}_{h}^{k}(y\mid s,a)\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)}}
2P¯hk(ys,a)(1P¯hk(ys,a))ln(2SAHKδ)nhk(s,a)1\displaystyle\geq\sqrt{\frac{2\overline{P}_{h}^{k}(y\mid s,a)\left(1-\overline{P}_{h}^{k}(y\mid s,a)\right)\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)-1}}

which further implies that

Ph(ys,a)32P¯hk(ys,a)10ln(2SAHKδ)3nhk(s,a).\displaystyle P_{h}(y\mid s,a)-\frac{3}{2}\cdot\overline{P}_{h}^{k}(y\mid s,a)\leq\frac{10\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}.

Then, we have

𝕍YPh(s,a)Vh+1ref(Y)\displaystyle\mathbb{V}_{Y\sim{P}_{h}(\cdot\mid s,a)}V_{h+1}^{\mathrm{ref}}(Y) =y𝒮Ph(ys,a)(Vh+1ref(y)Phk(s,a)Vh+1ref())2\displaystyle=\sum_{y\in\cal{S}}{P}_{h}(y\mid s,a)\left(V_{h+1}^{\mathrm{ref}}(y)-P_{h}^{k}(\cdot\mid s,a)V_{h+1}^{\mathrm{ref}}(\cdot)\right)^{2}
y𝒮Ph(ys,a)(Vh+1ref(y)P¯hk(s,a)Vh+1ref())2\displaystyle\leq\sum_{y\in\cal{S}}{P}_{h}(y\mid s,a)\left(V_{h+1}^{\mathrm{ref}}(y)-\overline{P}_{h}^{k}(\cdot\mid s,a)V_{h+1}^{\mathrm{ref}}(\cdot)\right)^{2}
y𝒮(32P¯hk(ys,a)+10ln(2SAHKδ)3nhk(s,a))(Vh+1ref(y)P¯hk(s,a)Vh+1ref())2\displaystyle\leq\sum_{y\in\cal{S}}\left(\frac{3}{2}\overline{P}_{h}^{k}(y\mid s,a)+\frac{10\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}\right)\cdot\left(V_{h+1}^{\mathrm{ref}}(y)-\overline{P}_{h}^{k}(\cdot\mid s,a)V_{h+1}^{\mathrm{ref}}(\cdot)\right)^{2}
32𝕍YP¯hk(s,a)Vh+1ref(Y)+10SH2ln(2SAHKδ)3nhk(s,a).\displaystyle\leq\frac{3}{2}\mathbb{V}_{Y\sim\overline{P}_{h}^{k}(\cdot\mid s,a)}V_{h+1}^{\mathrm{ref}}(Y)+\frac{10SH^{2}\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}. (35)

Plugging (B.4.1) into (B.4.1), we obtain

|(P¯hk(s,a)Ph(s,a))Vh+1|\displaystyle\left|\left(\overline{P}_{h}^{k}\left(\cdot\mid s,a\right)-P_{h}\left(\cdot\mid s,a\right)\right)V_{h+1}^{*}\right|
\displaystyle\leq 6𝕍YP¯hk(s,a)Vh+1ref(Y)ln(2SAHKδ)nhk(s,a)+4Hln(2SAHKδ)Snhk(s,a)+8SH2ln(2SAHKδ)3nhk(s,a).\displaystyle\sqrt{\frac{6\mathbb{V}_{Y\sim\overline{P}_{h}^{k}(\cdot\mid s,a)}V_{h+1}^{\mathrm{ref}}(Y)\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)}}+\sqrt{\frac{4H\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{S\cdot n_{h}^{k}(s,a)}}+\frac{8\sqrt{SH^{2}}\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}. (36)

Meanwhile, it holds that

|(P¯hk(s,a)Ph(s,a))(Vh+1kVh+1ref)|\displaystyle\left|\left(\overline{P}_{h}^{k}\left(\cdot\mid s,a\right)-P_{h}\left(\cdot\mid s,a\right)\right)(V_{h+1}^{k}-V_{h+1}^{\mathrm{ref}})\right|
y𝒮|P¯hk(ys,a)Ph(ys,a)||Vh+1k(y)Vh+1ref(y)|\displaystyle\leq\sum_{y\in\mathcal{S}}\left|\overline{P}_{h}^{k}\left(y\mid s,a\right)-P_{h}\left(y\mid s,a\right)\right|\cdot|V^{k}_{h+1}(y)-V^{\mathrm{ref}}_{h+1}(y)|
y𝒮(2P¯h(ys,a)(1P¯h(ys,a))ln(2SAHKδ)nhk(s,a)1+7ln(2SAHKδ)3nhk(s,a))|Vh+1k(y)Vh+1ref(y)|,\displaystyle\leq\sum_{y\in\mathcal{S}}\left(\sqrt{\frac{2\overline{P}_{h}(y\mid s,a)(1-\overline{P}_{h}(y\mid s,a))\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)-1}}+\frac{7\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}\right)|V^{k}_{h+1}(y)-V^{\mathrm{ref}}_{h+1}(y)|, (37)

where the last inequality follows from the definition of event FcF^{c}. Plugging (B.4.1) and (B.4.1) into (B.4.1), we have

|Ph(s,a)Vh+1k()P¯hk(s,a)Vh+1k()|\displaystyle\left|P_{h}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)-\overline{P}_{h}^{k}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)\right|
\displaystyle\leq 6𝕍YP¯hk(s,a)Vh+1ref(Y)ln(2SAHKδ)nhk(s,a)+4Hln(2SAHKδ)Snhk(s,a)+8SH2ln(2SAHKδ)3nhk(s,a)\displaystyle\sqrt{\frac{6\mathbb{V}_{Y\sim\overline{P}_{h}^{k}(\cdot\mid s,a)}V_{h+1}^{\mathrm{ref}}(Y)\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)}}+\sqrt{\frac{4H\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{S\cdot n_{h}^{k}(s,a)}}+\frac{8\sqrt{SH^{2}}\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}
+y𝒮(2P¯h(ys,a)(1P¯h(ys,a))ln(2SAHKδ)nhk(s,a)1+7ln(2SAHKδ)3nhk(s,a))|Vh+1k(y)Vh+1ref(y)|\displaystyle+\sum_{y\in\mathcal{S}}\left(\sqrt{\frac{2\overline{P}_{h}(y\mid s,a)(1-\overline{P}_{h}(y\mid s,a))\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)-1}}+\frac{7\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}\right)|V^{k}_{h+1}(y)-V^{\mathrm{ref}}_{h+1}(y)|
+y𝒮|P¯hk(s,a)Ph(s,a)||Vh+1ref(y)Vh+1(y)|(b)\displaystyle+\underbrace{\sum_{y\in\mathcal{S}}\left|\overline{P}_{h}^{k}\left(\cdot\mid s,a\right)-P_{h}\left(\cdot\mid s,a\right)\right||V^{\mathrm{ref}}_{h+1}(y)-V^{*}_{h+1}(y)|}_{(b)}

For Term (b), we divide 𝒮\mathcal{S} into two sets: 𝒮0={y𝒮:nhk(y)C0k}\mathcal{S}_{0}=\{y\in\mathcal{S}:n_{h}^{k}(y)\geq C_{0}\sqrt{k}\} and 𝒮0c\mathcal{S}_{0}^{c}. Since |Vh+1ref(s)Vh+1(s)|S2AH4C0|V_{h+1}^{\mathrm{ref}}(s)-V_{h+1}^{*}(s)|\leq\frac{\sqrt{S^{2}AH^{4}}}{C_{0}}, we have

y𝒮0|P¯hk(s,a)Ph(sh,ah)||Vh+1ref(y)Vh+1(y)|\displaystyle\sum_{y\in\mathcal{S}_{0}}\left|\overline{P}_{h}^{k}\left(\cdot\mid s,a\right)-P_{h}\left(\cdot\mid s_{h},a_{h}\right)\right||V^{\mathrm{ref}}_{h+1}(y)-V^{*}_{h+1}(y)|
\displaystyle\leq 4Sln3SAHTδnhk(s,a)S2AH4C0\displaystyle\sqrt{\frac{4S\ln\frac{3SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}\frac{\sqrt{S^{2}AH^{4}}}{C_{0}}

For y𝒮0cy\in\mathcal{S}_{0}^{c}, we have

y𝒮0c|P¯hk(s,a)Ph(sh,ah)||Vh+1ref(y)Vh+1(y)|\displaystyle\sum_{y\in\mathcal{S}_{0}^{c}}\left|\overline{P}_{h}^{k}\left(\cdot\mid s,a\right)-P_{h}\left(\cdot\mid s_{h},a_{h}\right)\right||V^{\mathrm{ref}}_{h+1}(y)-V^{*}_{h+1}(y)|
\displaystyle\leq y𝒮0c(2P¯hk(ss,a)(1P¯hk(ss,a))ln(2SAHKδ)nhk(s,a)1+7ln(2SAHKδ)3nhk(s,a))|Vh+1ref(y)Vh+1(y)|\displaystyle\sum_{y\in\mathcal{S}_{0}^{c}}\left(\sqrt{\frac{2\overline{P}_{h}^{k}(s^{\prime}\mid s,a)(1-\overline{P}_{h}^{k}(s^{\prime}\mid s,a))\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)-1}}+\frac{7\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}\right)|V^{\mathrm{ref}}_{h+1}(y)-V^{*}_{h+1}(y)|
\displaystyle\leq y𝒮0c(4P¯hk(ss,a)ln(2SAHKδ)nhk(s,a))|Vh+1ref(y)Vh+1(y)|+7ln(2SAHKδ)3nhk(s,a)|Vh+1ref(y)Vh+1(y)|\displaystyle\sum_{y\in\mathcal{S}_{0}^{c}}\left(\sqrt{\frac{4\overline{P}_{h}^{k}(s^{\prime}\mid s,a)\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)}}\right)|V^{\mathrm{ref}}_{h+1}(y)-V^{*}_{h+1}(y)|+\frac{7\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}|V^{\mathrm{ref}}_{h+1}(y)-V^{*}_{h+1}(y)|
=\displaystyle= y𝒮0c4nhk(s,a,y)ln(2SAHKδ)nhk(s,a)2|Vh+1ref(y)Vh+1(y)|+7SHln(2SAHKδ)3nhk(s,a)\displaystyle\sum_{y\in\mathcal{S}_{0}^{c}}\sqrt{\frac{4n^{k}_{h}(s,a,y)\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n^{k}_{h}(s,a)^{2}}}|V^{\mathrm{ref}}_{h+1}(y)-V^{*}_{h+1}(y)|+\frac{7SH\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}
\displaystyle\leq y𝒮0c4nhk(y)ln(2SAHKδ)nhk(s,a)2H+7SHln(2SAHKδ)3nhk(s,a)\displaystyle\sum_{y\in\mathcal{S}_{0}^{c}}\sqrt{\frac{4n^{k}_{h}(y)\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n^{k}_{h}(s,a)^{2}}}H+\frac{7SH\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}
\displaystyle\leq y𝒮0c4C0Kln(2SAHKδ)nhk(s,a)2H+7SHln(2SAHKδ)3nhk(s,a)\displaystyle\sum_{y\in\mathcal{S}_{0}^{c}}\sqrt{\frac{4C_{0}\sqrt{K}\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n^{k}_{h}(s,a)^{2}}}H+\frac{7SH\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}
\displaystyle\leq 2SC0K1/4Hln(2SAHKδ)nhk(s,a)+7SHln(2SAHKδ)3nhk(s,a)\displaystyle\frac{2S\sqrt{C_{0}}K^{1/4}H\sqrt{\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}}{n_{h}^{k}(s,a)}+\frac{7SH\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}

Therefore, we have

|(P¯hk(sh,ah)Ph(sh,ah))Vh+1k|\displaystyle\left|\left(\overline{P}_{h}^{k}\left(\cdot\mid s_{h},a_{h}\right)-P_{h}\left(\cdot\mid s_{h},a_{h}\right)\right)V_{h+1}^{k}\right|
\displaystyle\leq 6𝕍YP¯hk(s,a)Vh+1ref(Y)ln(2SAHKδ)nhk(s,a)+4Hln(2SAHKδ)Snhk(s,a)+8SH2ln(2SAHKδ)3nhk(s,a)\displaystyle\sqrt{\frac{6\mathbb{V}_{Y\sim\overline{P}_{h}^{k}(\cdot\mid s,a)}V_{h+1}^{\mathrm{ref}}(Y)\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)}}+\sqrt{\frac{4H\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{S\cdot n_{h}^{k}(s,a)}}+\frac{8\sqrt{SH^{2}}\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}
+y𝒮(2P¯hk(ss,a)(1P¯hk(ss,a))ln(2SAHKδ)nhk(s,a)1+7ln(2SAHKδ)3nhk(s,a))|Vh+1k(y)Vh+1ref(y)|\displaystyle+\sum_{y\in\mathcal{S}}\left(\sqrt{\frac{2\overline{P}_{h}^{k}(s^{\prime}\mid s,a)(1-\overline{P}_{h}^{k}(s^{\prime}\mid s,a))\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)-1}}+\frac{7\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}\right)|V^{k}_{h+1}(y)-V^{\mathrm{ref}}_{h+1}(y)|
+4Sln3SAHTδnhk(s,a)S2AH4C0+2SC0K1/4Hln(2SAHKδ)nhk(s,a)+7SHln(2SAHKδ)3nhk(s,a)\displaystyle+\sqrt{\frac{4S\ln\frac{3SAHT}{\delta^{\prime}}}{n_{h}^{k}(s,a)}}\frac{\sqrt{S^{2}AH^{4}}}{C_{0}}+\frac{2S\sqrt{C_{0}}K^{1/4}H\sqrt{\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}}{n_{h}^{k}(s,a)}+\frac{7SH\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}

By definition, Qhk(s,a)=max{c¯hk(s,a)+P¯hk(s,a)Vh+1k()bhk(s,a),0}Q_{h}^{k}(s,a)=\max\{\overline{c}_{h}^{k}(s,a)+\overline{P}_{h}^{k}(\cdot\mid s,a)V_{h+1}^{k}(\cdot)-b_{h}^{k}(s,a),0\}. Then we can prove that (31) and (B.4.1) still hold, which finishes the proof. ∎

B.4.2 Mirror Descent

The mirror descent (MD) algorithm (Beck & Teboulle, 2003) is a proximal convex optimization method that minimizes a linear approximation of the objective together with a proximity term, defined in terms of a Bregman divergence between the old and new solution estimates. In our analysis we choose the Bregman divergence to be the l2l_{2} norm. If {fk}k=1K\left\{f_{k}\right\}_{k=1}^{K} is a sequence of convex functions fk:df_{k}:\mathbb{R}^{d}\rightarrow\mathbb{R} and CC is a constraints set, the kk-th iterate of MD\mathrm{MD} is the following:

xk+1argminxC{ηkgk(xk),xxk+xxk22},x_{k+1}\in\underset{x\in C}{\arg\min}\left\{\eta_{k}\left\langle g_{k}\left(x_{k}\right),x-x_{k}\right\rangle+\left\|x-x_{k}\right\|_{2}^{2}\right\}, (38)

where ηk\eta_{k} is the stepsize. The MD algorithm ensures Regret(K)=k=1Kf(xk)\operatorname{Regret}\left(K^{\prime}\right)=\sum_{k=1}^{K^{\prime}}f\left(x_{k}\right)- minxf(x)O(K)\min_{x}f(x)\leq O(\sqrt{K^{\prime}}) for all K[K]K^{\prime}\in[K].

The following lemma (Theorem 6.8 in (Orabona, 2019)) is a fundamental inequality for analysis of OMD regret, which will be used in our analysis.

Lemma B.5 (OMD Regret, Theorem 6.8 in (Orabona, 2019)).

Assume ηk+1ηk,k=1,,K\eta_{k+1}\leq\eta_{k},k=1,\ldots,K. Then, using OMD with the l2l_{2} norm, learning rate {ηk}\{\eta_{k}\} and uniform initialization x1=[1/d,,1/d]x_{1}=[1/d,\ldots,1/d], the following regret bounds hold

t=1T𝒈t,𝒙t𝒖max1tTBψ(𝒖;𝒙t)ηT+12t=1Tηt𝒈t22.\sum_{t=1}^{T}\left\langle\boldsymbol{g}_{t},\boldsymbol{x}_{t}-\boldsymbol{u}\right\rangle\leq\max_{1\leq t\leq T}\frac{B_{\psi}\left(\boldsymbol{u};\boldsymbol{x}_{t}\right)}{\eta_{T}}+\frac{1}{2}\sum_{t=1}^{T}\eta_{t}\left\|\boldsymbol{g}_{t}\right\|_{2}^{2}. (39)

In our analysis, by adapting the above lemma to our notation, we get the following lemma.

Lemma B.6 (OMD in Policy Optimization).

Assume ηk+1ηk,k=1,,K\eta_{k+1}\leq\eta_{k},k=1,\ldots,K. Then, using OMD with the l2l_{2} norm, learning rate {ηk}\{\eta_{k}\} and uniform initialization πh1(s)=[1/A,,1/A]\pi_{h}^{1}(\cdot\mid s)=[1/A,\ldots,1/A], the following regret bounds hold

k=1KQhk(s,),πhk(s)πh(s)2ηK+12k=1Kηka(Qhk(s,a))2.\sum_{k=1}^{K}\left\langle Q_{h}^{k}(s,\cdot),\pi_{h}^{k}(\cdot\mid s)-\pi_{h}(\cdot\mid s)\right\rangle\leq\frac{2}{\eta_{K}}+\frac{1}{2}\sum_{k=1}^{K}\eta_{k}\sum_{a}\left(Q_{h}^{k}(s,a)\right)^{2}. (40)

Moreover, if we choose ηk=1/AH2k\eta_{k}=1/\sqrt{AH^{2}k}, we have

k=1Kh=1HQhk(s,),πhk(s)πh(s)3AH4K.\sum_{k=1}^{K}\sum_{h=1}^{H}\left\langle Q_{h}^{k}(s,\cdot),\pi_{h}^{k}(\cdot\mid s)-\pi_{h}(\cdot\mid s)\right\rangle\leq 3\sqrt{AH^{4}K}. (41)
Proof.

Fix h[H]h\in[H]. Replace 𝒈t\boldsymbol{g}_{t} with Qhk(s,)Q_{h}^{k}(s,\cdot) and 𝒙t\boldsymbol{x}_{t} with πhk(s)\pi_{h}^{k}(\cdot\mid s) in (39). Since πhπhk222\|\pi_{h}^{*}-\pi_{h}^{k}\|_{2}^{2}\leq 2 and Qhk(s,a)HQ_{h}^{k}(s,a)\leq H, we have

k=1KQhk(s,),πhk(s)πh(s)\displaystyle\sum_{k=1}^{K}\left\langle Q_{h}^{k}(s,\cdot),\pi_{h}^{k}(\cdot\mid s)-\pi_{h}(\cdot\mid s)\right\rangle 2ηK+12k=1Kηka(Qhk(s,a))2\displaystyle\leq\frac{2}{\eta_{K}}+\frac{1}{2}\sum_{k=1}^{K}\eta_{k}\sum_{a}\left(Q_{h}^{k}(s,a)\right)^{2}
2ηK+12k=1KηkAH2.\displaystyle\leq\frac{2}{\eta_{K}}+\frac{1}{2}\sum_{k=1}^{K}\eta_{k}AH^{2}.

Let ηk=1/AH2k\eta_{k}=1/\sqrt{AH^{2}k}, we have k=1Kηk4KAH2\sum_{k=1}^{K}\eta_{k}\leq\sqrt{\frac{4K}{AH^{2}}}, which further implies that

k=1KQhk(s,),πhk(s)πh(s)3AH2K\displaystyle\sum_{k=1}^{K}\left\langle Q_{h}^{k}(s,\cdot),\pi_{h}^{k}(\cdot\mid s)-\pi_{h}(\cdot\mid s)\right\rangle\leq 3\sqrt{AH^{2}K}

for any h[H]h\in[H]. Taking summation over h[H]h\in[H] concludes our proof. ∎

B.5 Sum of Bonus

Lemma B.7.

It holds that

k=1Kh=1Hbhk(shk,ahk)O~(SAH3K+S5/2A5/4H11/4K1/4).\sum_{k=1}^{K}\sum_{h=1}^{H}b^{k}_{h}(s^{k}_{h},a^{k}_{h})\leq\widetilde{O}(\sqrt{SAH^{3}K}+S^{5/2}A^{5/4}H^{11/4}K^{1/4}). (42)
Proof.

For simplicity we let nhk=nhk(shk,ahk),P¯hk()=P¯hk(shk,ahk),Ph()=Phk(shk,ahk)n^{k}_{h}=n^{k}_{h}(s^{k}_{h},a^{k}_{h}),\overline{P}^{k}_{h}(\cdot)=\overline{P}_{h}^{k}(\cdot\mid s^{k}_{h},a^{k}_{h}),P_{h}(\cdot)=P_{h}^{k}(\cdot\mid s^{k}_{h},a^{k}_{h}). By the same arguments in (B.4.1), (B.4.1) and (B.4.1), we have

6𝕍YP¯hk(s,a)Vh+1ref(Y)ln(2SAHKδ)nhk(s,a)+4Hln(2SAHKδ)Snhk(s,a)+8SH2ln(2SAHKδ)3nhk(s,a)\displaystyle\sqrt{\frac{6\mathbb{V}_{Y\sim\overline{P}_{h}^{k}(\cdot\mid s,a)}V_{h+1}^{\mathrm{ref}}(Y)\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{n_{h}^{k}(s,a)}}+\sqrt{\frac{4H\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{S\cdot n_{h}^{k}(s,a)}}+\frac{8\sqrt{SH^{2}}\ln\left(\frac{2SAHK}{\delta^{\prime}}\right)}{3n_{h}^{k}(s,a)}
\displaystyle\leq O~(𝕍YPhk(s,a)Vh+1(Y)nhk(s,a)+HSnhk(s,a)+SH23nhk(s,a)).\displaystyle\widetilde{O}\left(\sqrt{\frac{\mathbb{V}_{Y\sim{P}_{h}^{k}(\cdot\mid s,a)}V_{h+1}^{*}(Y)}{n_{h}^{k}(s,a)}}+\sqrt{\frac{H}{S\cdot n_{h}^{k}(s,a)}}+\frac{\sqrt{SH^{2}}}{3n_{h}^{k}(s,a)}\right).

Then, by the definition of bhkb^{k}_{h},

k=1Kh=1Hbhk(shk,ahk)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}b^{k}_{h}(s^{k}_{h},a^{k}_{h})
\displaystyle\leq k=1Kh=1HO~(𝕍YPhk(s,a)Vh+1(Y)nhk(s,a)+y𝒮P¯hk(y)nhk|Vh+1k(y)Vh+1ref(y)|+S3/2A1/4H7/4K1/4nhk)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\widetilde{O}\left(\sqrt{\frac{\mathbb{V}_{Y\sim{P}_{h}^{k}(\cdot\mid s,a)}V_{h+1}^{*}(Y)}{n_{h}^{k}(s,a)}}+\sum_{y\in\mathcal{S}}\sqrt{\frac{\overline{P}_{h}^{k}(y)}{n^{k}_{h}}}|V^{k}_{h+1}(y)-V^{\mathrm{ref}}_{h+1}(y)|+\frac{S^{3/2}A^{1/4}H^{7/4}K^{1/4}}{n^{k}_{h}}\right)
\displaystyle\leq O~(SAH3K)+k=1Kh=1Hy𝒮O~(P¯hk(y)nhk|Vh+1k(y)Vh+1ref(y)|)+O~(S5/2A5/4H11/4K1/4)\displaystyle\widetilde{O}(\sqrt{SAH^{3}K})+\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{y\in\mathcal{S}}\widetilde{O}\left(\sqrt{\frac{\overline{P}_{h}^{k}(y)}{n^{k}_{h}}}|V^{k}_{h+1}(y)-V^{\mathrm{ref}}_{h+1}(y)|\right)+\widetilde{O}(S^{5/2}A^{5/4}H^{11/4}K^{1/4})
\displaystyle\leq O~(SAH3K)+k=1Kh=1Hy𝒮O~(Phk(y)+O(1/nhk)nhk|Vh+1k(y)Vh+1ref(y)|)+O~(S5/2A5/4H11/4K1/4)\displaystyle\widetilde{O}(\sqrt{SAH^{3}K})+\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{y\in\mathcal{S}}\widetilde{O}\left(\sqrt{\frac{P_{h}^{k}(y)+O(\sqrt{1/n^{k}_{h}})}{n^{k}_{h}}}|V^{k}_{h+1}(y)-V^{\mathrm{ref}}_{h+1}(y)|\right)+\widetilde{O}(S^{5/2}A^{5/4}H^{11/4}K^{1/4})
\displaystyle\leq O~(SAH3K)+k=1Kh=1Hy𝒮O~(Phk(y)nhk|Vh+1k(y)Vh+1ref(y)|)(v)+O~(S5/2A5/4H11/4K1/4)\displaystyle\widetilde{O}(\sqrt{SAH^{3}K})+\underbrace{\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{y\in\mathcal{S}}\widetilde{O}\left(\sqrt{\frac{P_{h}^{k}(y)}{n^{k}_{h}}}|V^{k}_{h+1}(y)-V^{\mathrm{ref}}_{h+1}(y)|\right)}_{(v)}+\widetilde{O}(S^{5/2}A^{5/4}H^{11/4}K^{1/4}) (43)

The second inequality follows from Lemma 19 in Zhang et al. (2020b) and standard techniques (e.g. (Shani et al., 2020) or (Zhang et al., 2020b)). Using Lemma B.8, we can further bound Term(v)\text{Term}\ (v) as:

Term(v)O~(SAH3K)+O~(S5/2A5/4H3/2K1/4)\text{Term}\ (v)\leq\widetilde{O}(\sqrt{SAH^{3}K})+\widetilde{O}(S^{5/2}A^{5/4}H^{3/2}K^{1/4})

Therefore, combining Equation (16) and (43), we complete the proof of Theorem 3.1. ∎

Lemma B.8.
k=1Kh=1Hy𝒮(Phk(y)nhk|Vh+1k(y)Vh+1ref(y)|)O~(SAH3K)+O~(S5/2A5/4H3/2K1/4)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{y\in\mathcal{S}}\left(\sqrt{\frac{P_{h}^{k}(y)}{n^{k}_{h}}}|V^{k}_{h+1}(y)-V^{\mathrm{ref}}_{h+1}(y)|\right)\leq\widetilde{O}(\sqrt{SAH^{3}K})+\widetilde{O}(S^{5/2}A^{5/4}H^{3/2}K^{1/4})
Proof.

We have

k=1Kh=1Hy𝒮(Phk(y)nhk|Vh+1k(y)Vh+1ref(y)|)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{y\in\mathcal{S}}\left(\sqrt{\frac{P_{h}^{k}(y)}{n^{k}_{h}}}|V^{k}_{h+1}(y)-V^{\mathrm{ref}}_{h+1}(y)|\right)
\displaystyle\leq k=1Kh=1H(y𝒮Phk(y)1Phk(y)nhk|Vh+1k(y)Vh+1ref(y)|1Phk(sh+1k)nhk|Vh+1k(sh+1k)Vh+1ref(sh+1k)|)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\big{(}\sum_{y\in\mathcal{S}}P_{h}^{k}(y)\sqrt{\frac{1}{P_{h}^{k}(y)n^{k}_{h}}}|V^{k}_{h+1}(y)-V^{\mathrm{ref}}_{h+1}(y)|-\sqrt{\frac{1}{P_{h}^{k}(s^{k}_{h+1})n^{k}_{h}}}|V^{k}_{h+1}(s^{k}_{h+1})-V^{\mathrm{ref}}_{h+1}(s^{k}_{h+1})|\big{)}
+k=1Kh=1H1Phk(sh+1k)nhk|Vh+1k(sh+1k)Vh+1ref(sh+1k)|\displaystyle+\sum_{k=1}^{K}\sum_{h=1}^{H}\sqrt{\frac{1}{P_{h}^{k}(s^{k}_{h+1})n^{k}_{h}}}|V^{k}_{h+1}(s^{k}_{h+1})-V^{\mathrm{ref}}_{h+1}(s^{k}_{h+1})|
\displaystyle\leq O~(H3K)+k=1Kh=1H1Phk(sh+1k)nhk|Vh+1k(sh+1k)Vh+1ref(sh+1k)|\displaystyle\widetilde{O}(\sqrt{H^{3}K})+\sum_{k=1}^{K}\sum_{h=1}^{H}\sqrt{\frac{1}{P_{h}^{k}(s^{k}_{h+1})n^{k}_{h}}}|V^{k}_{h+1}(s^{k}_{h+1})-V^{\mathrm{ref}}_{h+1}(s^{k}_{h+1})|
=\displaystyle= k=1Kh=1H(1Phk(sh+1k)nhk|Vh+1k(sh+1k)Vh+1(sh+1k)|the sumO~(H4S2AK)+1Phk(sh+1k)nhk|Vh+1(sh+1k)Vh+1ref(sh+1k)|almostO~(1/S))\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\big{(}\sqrt{\frac{1}{P_{h}^{k}(s^{k}_{h+1})n^{k}_{h}}}\underbrace{|V^{k}_{h+1}(s^{k}_{h+1})-V^{*}_{h+1}(s^{k}_{h+1})|}_{\text{the sum}\ \leq\ \widetilde{O}(\sqrt{H^{4}S^{2}AK})}+\sqrt{\frac{1}{P_{h}^{k}(s^{k}_{h+1})n^{k}_{h}}}\underbrace{|V^{*}_{h+1}(s^{k}_{h+1})-V^{\mathrm{ref}}_{h+1}(s^{k}_{h+1})|}_{\text{almost}\ \leq\ \widetilde{O}(1/\sqrt{S})}\big{)}
+O~(H3K)\displaystyle+\widetilde{O}(\sqrt{H^{3}K})
\displaystyle\leq O~(S2AHH2S2AK)+O~(S2AH2K/S)+O~(H3K)\displaystyle\widetilde{O}(S^{2}AH\sqrt{\sqrt{H^{2}S^{2}AK}})+\widetilde{O}(\sqrt{S^{2}AH^{2}K}/\sqrt{S})+\widetilde{O}(\sqrt{H^{3}K})
=\displaystyle= O~(SAH3K)+O~(S5/2A5/4H3/2K1/4)\displaystyle\widetilde{O}(\sqrt{SAH^{3}K})+\widetilde{O}(S^{5/2}A^{5/4}H^{3/2}K^{1/4}) (44)

The second line is the sum of martingale differences bounded by HH, directly applying Azuma-Hoeffding inequality yields the second inequality. The third inequality makes use of Lemma 11 in (Zhang et al., 2020b).

The third inequality in (44)

This is a technique used in (Azar et al. 2017). We explain the main idea here.

First we define the typical state-actions pairs as

[y]k,h= def {y:y𝒮,nhk(shk,ahk)P(yshk,ahk)2H2SL}[y]_{k,h}\stackrel{{\scriptstyle\text{ def }}}{{=}}\left\{y:y\in\mathcal{S},n_{h}^{k}(s_{h}^{k},a_{h}^{k})P(y\mid s_{h}^{k},a_{h}^{k})\geq 2H^{2}SL\right\}

which means these state-action pairs are visited frequently enough. Define Δ~h+1k(y)=|Vh+1k(y)Vh+1ref(y)|\widetilde{\Delta}_{h+1}^{k}(y)=\left|V_{h+1}^{k}(y)-V_{h+1}^{\mathrm{ref}}(y)\right|. We have

y𝒮Phk(y)nhkΔ~h+1k(y)\displaystyle\sum_{y\in\mathcal{S}}\sqrt{\frac{P_{h}^{k}(y)}{n_{h}^{k}}}\widetilde{\Delta}_{h+1}^{k}(y)
=\displaystyle= y[y]k,hPhk(y)nhkΔ~h+1k(y)+y[y]k,hPhk(y)nhkΔ~h+1k(y).\displaystyle\sum_{y\in[y]_{k,h}}\sqrt{\frac{P_{h}^{k}(y)}{n_{h}^{k}}}\widetilde{\Delta}_{h+1}^{k}(y)+\sum_{y\notin[y]_{k,h}}\sqrt{\frac{P_{h}^{k}(y)}{n_{h}^{k}}}\widetilde{\Delta}_{h+1}^{k}(y).

The second term can be bounded by

y[y]k,hPhk(y)nhknhk2Δ~h+1k(y)SH4LH2Snhk\sum_{y\notin[y]_{k,h}}\sqrt{\frac{P_{h}^{k}(y)n_{h}^{k}}{{n_{h}^{k}}^{2}}}\widetilde{\Delta}_{h+1}^{k}(y)\leq\frac{SH\sqrt{4LH^{2}S}}{n_{h}^{k}}

So we only have to deal with the first term, in which Phk(y)nhkP_{h}^{k}(y)n_{h}^{k} is large. Thus the martingale can be bounded.

The last inequality in (44)

As above, we only need to consider the case where Phk(sh+1k)nhk2H2SLP_{h}^{k}(s_{h+1}^{k})n_{h}^{k}\geq 2H^{2}SL (nhkn^{k}_{h} is the shorthand for nhk(shk,ahk)n^{k}_{h}(s^{k}_{h},a^{k}_{h})). Then we know the first term is bounded by O~(SAH3K)\widetilde{O}(\sqrt{SAH^{3}K}). For the second term, we need to use the Multiplicative Chernoff bound (See the Wiki for Chernoff bound): Pr(X(1δ)μ)eδ2μ2,\operatorname{Pr}(X\leq(1-\delta)\mu)\leq e^{-\frac{\delta^{2}\mu}{2}}, where 0δ10\leq\delta\leq 1 and X=i=1nXiX=\sum^{n}_{i=1}X_{i}, XiX_{i} are independent Bernoulli r.v., 𝔼[X]=μ\mathbb{E}[X]=\mu. Set X=nhk(shk,ahk,sh+1k),μ=Phk(sh+1k)nhkX=n^{k}_{h}(s^{k}_{h},a^{k}_{h},s^{k}_{h+1}),\ \mu=P_{h}^{k}(s_{h+1}^{k})n_{h}^{k}. Taking union bound over all h,k,𝒮h,k,\mathcal{S} with Phk(sh+1k)nhk2H2SLP_{h}^{k}(s_{h+1}^{k})n_{h}^{k}\geq 2H^{2}SL, we have that with high probability, Phk(sh+1k)nhk12nhk(shk,ahk,sh+1k)P_{h}^{k}(s^{k}_{h+1})n_{h}^{k}\geq\frac{1}{2}n_{h}^{k}(s_{h}^{k},a_{h}^{k},s^{k}_{h+1}). With this we can now apply Lemma 11 in (Zhang et al. 2020). ∎

Appendix C Explanation of How UCBVI Uses Optimism

In Azar et al. (2017), they need to bound the term (P¯hkPh)(s,a)(Vh+1kVh+1)()(\overline{P}_{h}^{k}-P_{h})(\cdot\mid s,a)(V^{k}_{h+1}-V_{h+1}^{*})(\cdot) using optimism, as mentioned in Section 4.

Define Δhk= def VhVhπk\Delta_{h}^{k}\stackrel{{\scriptstyle\text{ def }}}{{=}}V_{h}^{*}-V_{h}^{\pi^{k}}, Δ~hk= def VhkVhπk\widetilde{\Delta}_{h}^{k}\stackrel{{\scriptstyle\text{ def }}}{{=}}V_{h}^{k}-V_{h}^{\pi^{k}}, and δ~hk= def Δ~hk(shk)\widetilde{\delta}_{h}^{k}\stackrel{{\scriptstyle\text{ def }}}{{=}}\widetilde{\Delta}_{h}^{k}\left(s_{h}^{k}\right). We denote by \square a numerical constant which can vary from line to line. We also use LL to represent the logarithmic term L=ln(HSAT/δ)L=\ln(\square HSAT/\delta).

Using Bernstein’s inequality, this term is bounded by

yPπk(yshk)LPπk(yshk)nhkΔh+1k(y)+SHLnhk.\sum_{y}P^{\pi^{k}}\left(y\mid s_{h}^{k}\right)\sqrt{\frac{\square L}{P^{\pi^{k}}\left(y\mid s_{h}^{k}\right)n_{h}^{k}}}\Delta_{h+1}^{k}(y)+\frac{\square SHL}{n_{h}^{k}}.

where nhk= def nk(shk,πk(shk))n_{h}^{k}\stackrel{{\scriptstyle\text{ def }}}{{=}}n_{k}\left(s_{h}^{k},\pi^{k}\left(s_{h}^{k}\right)\right). Now considering only the yy such that Pπk(yshk)nhkH2LP^{\pi^{k}}\left(y\mid s_{h}^{k}\right)n_{h}^{k}\geq\square H^{2}L, and since 0Δh+1kΔ~h+1k0\leq\Delta_{h+1}^{k}\leq\widetilde{\Delta}_{h+1}^{k} by optimism, then (P^kπkPπk)Δk,h+1(shk)\left(\widehat{P}_{k}^{\pi_{k}}-P^{\pi_{k}}\right)\Delta_{k,h+1}\left(s_{h}^{k}\right) is bounded by

ϵ¯hk+LPπk(sh+1kshk)nhkδ~k,h+1+SHLnhkϵ¯hk+1Hδ~h+1k+SHLnhk.\overline{\epsilon}_{h}^{k}+\sqrt{\frac{\square L}{P^{\pi^{k}}\left(s_{h+1}^{k}\mid s_{h}^{k}\right)n_{h}^{k}}}\widetilde{\delta}_{k,h+1}+\frac{\square SHL}{n_{h}^{k}}\leq\overline{\epsilon}_{h}^{k}+\frac{1}{H}\widetilde{\delta}_{h+1}^{k}+\frac{\square SHL}{n_{h}^{k}}.

where ϵ¯hk= def Lnhk(yPπk(yshk)Δ~h+1k(y)Pπk(yshk)δ~h+1kPπk(sh+1kshk)).\overline{\epsilon}_{h}^{k}\stackrel{{\scriptstyle\text{ def }}}{{=}}\sqrt{\frac{\square L}{n_{h}^{k}}}\left(\sum_{y}P^{\pi^{k}}\left(y\mid s_{h}^{k}\right)\frac{\widetilde{\Delta}_{h+1}^{k}(y)}{\sqrt{P^{\pi^{k}}\left(y\mid s_{h}^{k}\right)}}-\frac{\widetilde{\delta}_{h+1}^{k}}{\sqrt{P^{\pi_{k}}\left(s_{h+1}^{k}\mid s_{h}^{k}\right)}}\right). The sum over the neglected yy such that Pπk(yshk)nhk<H2LP^{\pi^{k}}\left(y\mid s_{h}^{k}\right)n_{h}^{k}<\square H^{2}L contributes to an additional term

yPπk(yshk)nhkLnhk2Δh+1k(y)SH2Lnhk.\sum_{y}\sqrt{\frac{\square P^{\pi^{k}}\left(y\mid s_{h}^{k}\right)n_{h}^{k}L}{{n_{h}^{k}}^{2}}}\Delta_{h+1}^{k}(y)\leq\frac{\square SH^{2}L}{n_{h}^{k}}.

Then they prove that the sum of ϵ¯hk\overline{\epsilon}_{h}^{k} is of order O~(T)\widetilde{O}(\sqrt{T}) and the sum of 1nhk\frac{1}{n_{h}^{k}} is a constant order term.