This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP

Liyu Chen    Rahul Jain    Haipeng Luo
Abstract

We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that significantly improve over the only existing results of (Vial et al., 2021). Our first algorithm is computationally efficient and achieves a regret bound 𝒪~(d3B2TK)\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{2}T_{\star}K}), where dd is the dimension of the feature space, BB_{\star} and TT_{\star} are upper bounds of the expected costs and hitting time of the optimal policy respectively, and KK is the number of episodes. The same algorithm with a slight modification also achieves logarithmic regret of order 𝒪(d3B4cmin2gapminln5dBKcmin)\mathcal{O}\left(\frac{d^{3}B_{\star}^{4}}{c_{\min}^{2}\text{\rm gap}_{\min}}\ln^{5}\frac{dB_{\star}K}{c_{\min}}\right), where gapmin\text{\rm gap}_{\min} is the minimum sub-optimality gap and cminc_{\min} is the minimum cost over all state-action pairs. Our result is obtained by developing a simpler and improved analysis for the finite-horizon approximation of (Cohen et al., 2021) with a smaller approximation error, which might be of independent interest. On the other hand, using variance-aware confidence sets in a global optimization problem, our second algorithm is computationally inefficient but achieves the first “horizon-free” regret bound 𝒪~(d3.5BK)\tilde{\mathcal{O}}(d^{3.5}B_{\star}\sqrt{K}) with no polynomial dependency on TT_{\star} or 1/cmin1/c_{\min}, almost matching the Ω(dBK)\Omega(dB_{\star}\sqrt{K}) lower bound from (Min et al., 2021).

Machine Learning, ICML

1 Introduction

We study the stochastic shortest path (SSP) model, where a learner attempts to reach a goal state while minimizing her costs in a stochastic environment. SSP is a suitable model for many real-world applications, such as games, car navigation, robotic manipulation, etc. Online reinforcement learning in SSP has received great attention recently. In this setting, learning proceeds in KK episodes over a Markov Decision Process (MDP). In each episode, starting from a fixed initial state, the learner sequentially takes an action, incurs a cost, and transits to the next state until reaching the goal state. The performance of the learner is measured by her regret, the difference between her total costs and that of the optimal policy. SSP is a strict generalization of the heavily-studied finite-horizon reinforcement learning problem, where the learner is guaranteed to reach the goal state after a fixed number of steps.

Modern reinforcement learning applications often need to handle a massive state space, in which function approximation is necessary. There is huge progress in the study of linear function approximation, for both the finite-horizon setting (Ayoub et al., 2020; Jin et al., 2020b; Yang & Wang, 2020; Zanette et al., 2020a, b; Zhou et al., 2021a) and the infinite-horizon setting (Wei et al., 2021b; Zhou et al., 2021a, b). Recently, Vial et al. (2021) took the first step in considering linear function approximation for SSP. They study SSP defined over a linear MDP, and proposed a computationally inefficient algorithm with regret 𝒪~(d3B3K/cmin)\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{3}K/c_{\min}}), as well as another efficient algorithm with regret 𝒪~(K5/6)\tilde{\mathcal{O}}(K^{5/6}) (omitting other dependency). Here, dd is the dimension of the feature space, BB_{\star} is an upper bound on the expected costs of the optimal policy, and cminc_{\min} is the minimum cost across all state-action pairs. Later, Min et al. (2021) study a related but different SSP problem defined over a linear mixture MDP and achieve a 𝒪~(dB1.5K/cmin)\tilde{\mathcal{O}}(dB_{\star}^{1.5}\sqrt{K/c_{\min}}) regret bound. Despite leveraging the advances from both the finite-horizon and infinite-horizon settings, results above are still far from optimal in terms of regret guarantee or computational efficiency, demonstrating the unique challenge of SSP problems.

In this work, we further extend our understanding of SSP with linear function approximation (more specifically, with linear MDPs). Our contributions are as follows:

  • In Section 3, we first propose a new analysis for the finite-horizon approximation of SSP introduced in (Cohen et al., 2021), which is much simpler and achieves a smaller approximation error. Our analysis is also model agnostic, meaning that it does not make use of the modeling assumption and can be applied to both the tabular setting and function approximation settings. Combining this new analysis with a simple finite-horizon algorithm similar to that of (Jin et al., 2020b), we achieve a regret bound of 𝒪~(d3B2TK)\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{2}T_{\star}K}), with TB/cminT_{\star}\leq B_{\star}/c_{\min} being an upper bound of the hitting time of the optimal policy, which strictly improves over that of (Vial et al., 2021). Notably, unlike their algorithm, ours is computationally efficient without any extra assumption.

  • In Section 3.3, we further show that the same algorithm above with a slight modification achieves a logarithmic instance-dependent expected regret bound of order 𝒪(d3B4cmin2gapminln5dBKcmin)\mathcal{O}\left(\frac{d^{3}B_{\star}^{4}}{c_{\min}^{2}\text{\rm gap}_{\min}}\ln^{5}\frac{dB_{\star}K}{c_{\min}}\right) where gapmin\text{\rm gap}_{\min} is some sub-optimality gap. As far as we know, this is the first logarithmic regret bound for SSP (with or without function approximation). We also establish a lower bound of order Ω(dB2gapmin)\Omega(\frac{dB_{\star}^{2}}{\text{\rm gap}_{\min}}), which further advances our understanding for this problem even though it does not exactly match our upper bound.

  • To remove the undesirable TT_{\star} dependency in our instance-independent bound, in Section 4, we further develop a computationally inefficient algorithm that makes use of certain variance-aware confidence sets in a global optimization problem and achieves 𝒪~(d3.5BK)\tilde{\mathcal{O}}(d^{3.5}B_{\star}\sqrt{K}) regret. Importantly, this bound is horizon-free in the sense that it has no polynomial dependency on TT_{\star} or 1cmin\frac{1}{c_{\min}} even in the lower order terms. Moreover, this almost matches the best known lower bound Ω(dBK)\Omega(dB_{\star}\sqrt{K}) from (Min et al., 2021).

Techniques

Our results are built upon several technical innovations. First, as mentioned, we develop an improved analysis for the finite-horizon approximation of (Cohen et al., 2021), which might be of independent interest. The key idea is to directly bound the total approximation error with respect to the regret bound of the finite-horizon algorithm, instead of analyzing the estimation precision for each state-action pair as done in (Cohen et al., 2021).

Second, to obtain the logarithmic bound in Section 3, we note that it is not enough to simply combine the aforementioned finite-horizon approximation and the existing logarithmic regret results for the finite-horizon setting such as (He et al., 2021), since the sub-optimality gap obtained in this way is in terms of the finite-horizon counterpart instead of the original SSP and could be substantially smaller. We resolve this issue via a longer horizon in the approximation and a careful two-stage analysis.

Finally, our horizon-free result in Section 4 is obtained by a novel combination of several ideas, including the global optimization algorithm of (Zanette et al., 2020b; Wei et al., 2021b), the variance-aware confidence sets of (Zhang et al., 2021) (for a related but different setting with linear mixture MDPs), an improved analysis of the variance-aware confidence sets (Kim et al., 2021), and finally a new clipping trick and new update conditions that we propose. Our analysis does not require the recursion-based technique of (Zhang et al., 2020a) (for the tabular case), nor estimating higher order moments of value functions as in (Zhang et al., 2021) (for linear mixture MDPs), which might also be of independent interest.

Related work

Regret minimization of SSP under stochastic costs has been well studied in the tabular setting (that is, no function approximation) (Tarbouriech et al., 2020; Cohen et al., 2020, 2021; Tarbouriech et al., 2021; Chen et al., 2021a; Jafarnia-Jahromi et al., 2021). There are also several works (Rosenberg & Mansour, 2020; Chen et al., 2021b; Chen & Luo, 2021) considering the more challenging setting with adversarial costs (which is beyond the scope of this work).

Beyond linear function approximation, in the finite-horizon setting researchers also start considering theoretical guarantees for general function approximation (Wang et al., 2020; Ishfaq et al., 2021; Kong et al., 2021). The study for SSP, which again is a strict generalization of the finite-horizon problems and might be a better model for many applications, falls behind in this regard, motivating us to explore in this direction with the goal of providing a more complete picture at least for linear function approximation.

The use of variance information is crucial in obtaining optimal regret bounds in MDPs. This dates back to the work of (Lattimore & Hutter, 2012) for the discounted setting, which has been significantly extended to the finite-horizon setting (Azar et al., 2017; Jin et al., 2018; Zanette & Brunskill, 2019; Zhang et al., 2020a, b). Constructing variance-aware confidence sets for linear bandits and linear mixture MDPs has also gained recent attention (Zhou et al., 2021a; Zhang et al., 2021; Kim et al., 2021). We are among the first to do so for linear MDPs (a concurrent work (Wei et al., 2021a) also does so but for a completely different purpose of improving robustness against corruption).

Logarithmic gap-dependent bounds have been shown in different settings; see for example (Jaksch et al., 2010; Simchowitz & Jamieson, 2019; Jin et al., 2021; He et al., 2021), but to our knowledge, we are the first to show similar bounds for SSP.

2 Preliminary

An SSP instance is defined by an MDP =(𝒮,𝒜,sinit,g,c,P){\mathcal{M}}=({\mathcal{S}},{\mathcal{A}},s_{\text{init}},g,c,P). Here, 𝒮{\mathcal{S}} is the state space, 𝒜{\mathcal{A}} is the (finite) action space (with A=|𝒜|A=|{\mathcal{A}}|), sinit𝒮s_{\text{init}}\in{\mathcal{S}} is the initial state, g𝒮g\notin{\mathcal{S}} is the goal state, c:𝒮×𝒜[cmin,1]c:{\mathcal{S}}\times{\mathcal{A}}\rightarrow[c_{\min},1] is the cost function with some global lower bound cmin0c_{\min}\geq 0, and P={Ps,a}(s,a)𝒮×𝒜P=\{P_{s,a}\}_{(s,a)\in{\mathcal{S}}\times{\mathcal{A}}} with Ps,aΔ𝒮+P_{s,a}\in\Delta_{{\mathcal{S}}_{+}} is the transition function, where 𝒮+{\mathcal{S}}_{+} is a shorthand for 𝒮{g}{\mathcal{S}}\cup\{g\} and Δ𝒮+\Delta_{{\mathcal{S}}_{+}} is the simplex over 𝒮+{\mathcal{S}}_{+}.

The learning protocol is as follows: the learner interacts with the environment for K2K\geq 2 episodes. In each episode, the learner starts in initial state sinits_{\text{init}}, sequentially takes an action, incurs a cost, and transits to the next state. An episode ends when the learner reaches the goal state gg. We denote by (st,at,st)(s_{t},a_{t},s^{\prime}_{t}) the tt-th state-action-state triplet observed among all episodes, so that stPst,ats^{\prime}_{t}\sim P_{s_{t},a_{t}} for each tt, and st=st+1s^{\prime}_{t}=s_{t+1} unless st=gs^{\prime}_{t}=g (in which case st+1=sinits_{t+1}=s_{\text{init}}). Also denote by TT the total number of steps in KK episodes.

Learning objective

The learner’s goal is to learn a policy that reaches the goal state with minimum costs. Formally, a (stationary and deterministic) policy π:𝒮𝒜\pi:{\mathcal{S}}\rightarrow{\mathcal{A}} is a mapping that assigns an action π(s)\pi(s) to each state s𝒮s\in{\mathcal{S}}. We say π\pi is proper if following π\pi (that is, taking action π(s)\pi(s) whenever in state ss) reaches the goal state with probability 11. Given a proper policy π\pi, we define its value function and action-value function as follows:

Vπ(s)\displaystyle V^{\pi}(s) =𝔼[i=1Ic(si,π(si))|P,s1=s],\displaystyle=\mathbb{E}\left[\left.\sum_{i=1}^{I}c(s_{i},\pi(s_{i}))\right|P,s_{1}=s\right],
Qπ(s,a)\displaystyle Q^{\pi}(s,a) =c(s,a)+𝔼sPs,a[Vπ(s)],\displaystyle=c(s,a)+\mathbb{E}_{s^{\prime}\sim P_{s,a}}[V^{\pi}(s^{\prime})],

where the expectation in VπV^{\pi} is with respect to the randomness of next states si+1Psi,π(si)s_{i+1}\sim P_{s_{i},\pi(s_{i})} and the number of steps II before reaching the goal gg. Let Π\Pi be the set of proper policies. We make the basic assumption that Π\Pi is non-empty. Under this assumption, there exists an optimal proper policy π\pi^{\star}, such that Vπ(s)=minaQπ(s,a)V^{\pi^{\star}}(s)=\min_{a}Q^{\pi^{\star}}(s,a), and Vπ(s)=minπΠVπ(s)V^{\pi^{\star}}(s)=\min_{\pi\in\Pi}V^{\pi}(s) for all ss (Bertsekas & Yu, 2013). We use VV^{\star} and QQ^{\star} as shorthands for VπV^{\pi^{\star}} and QπQ^{\pi^{\star}}. The formal goal of the learner is then to minimize her regret against π\pi^{\star}, that is, the difference between her total costs and that of the optimal proper policy, defined as

RK\displaystyle R_{K} =t=1Tc(st,at)KV(sinit).\displaystyle=\sum_{t=1}^{T}c(s_{t},a_{t})-K\cdot V^{\star}(s_{\text{init}}).

We also define RK=R_{K}=\infty if T=T=\infty.

Linear SSP

In the so-called tabular setting, the state space is assumed to be small, and algorithms with computational complexity and regret bound depending on S=|𝒮|S=|{\mathcal{S}}| are acceptable. To handle a potentially massive state space, however, we consider the same linear function approximation setting of (Vial et al., 2021), where the MDP enjoys a linear structure in both the transition and cost functions (known as linear or low-rank MDP).

Assumption 1 (Linear SSP).

For some d2d\geq 2, there exist known feature maps {ϕ(s,a)}(s,a)𝒮×𝒜d\{\phi(s,a)\}_{(s,a)\in{\mathcal{S}}\times{\mathcal{A}}}\subseteq\mathbb{R}^{d}, unknown parameters θd\theta^{\star}\in\mathbb{R}^{d} and {μ(s)}s𝒮+d\{\mu(s^{\prime})\}_{s^{\prime}\in{\mathcal{S}}_{+}}\subseteq\mathbb{R}^{d}, such that for any (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times{\mathcal{A}} and s𝒮+s^{\prime}\in{\mathcal{S}}_{+}, we have:

c(s,a)=ϕ(s,a)θ,Ps,a(s)=ϕ(s,a)μ(s).\displaystyle c(s,a)=\phi(s,a)^{\top}\theta^{\star},\quad P_{s,a}(s^{\prime})=\phi(s,a)^{\top}\mu(s^{\prime}).

Moreover, we assume ϕ(s,a)21\left\|{\phi(s,a)}\right\|_{2}\leq 1 for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times{\mathcal{A}}, θ2d\left\|{\theta^{\star}}\right\|_{2}\leq\sqrt{d}, and h(s)𝑑μ(s)2dh\left\|{\int h(s^{\prime})d\mu(s^{\prime})}\right\|_{2}\leq\sqrt{d}\left\|{h}\right\|_{\infty} for any h𝒮+h\in\mathbb{R}^{{\mathcal{S}}_{+}}.

We refer the reader to (Vial et al., 2021) and references therein for justification on this widely-used structural assumption (especially on the last few norm constraints). Under Assumption 1, by definition we have Q(s,a)=ϕ(s,a)wQ^{\star}(s,a)=\phi(s,a)^{\top}w^{\star}, where w=θ+V(s)𝑑μ(s)dw^{\star}=\theta^{\star}+\int V^{\star}(s^{\prime})d\mu(s^{\prime})\in\mathbb{R}^{d}, that is, QQ^{\star} is also linear in the features.

Key parameters and notations

Two extra parameters that play a key role in our analysis are: B=maxsV(s)B_{\star}=\max_{s}V^{\star}(s), the maximum cost of the optimal policy starting from any state, and T=maxsTπ(s)T_{\star}=\max_{s}T^{\pi^{\star}}(s), the maximum hitting time of the optimal policy starting from any state, where Tπ(s)T^{\pi}(s) is the expected number of steps before reaching the goal if one follows policy π\pi starting from state ss. By definition, we have TB/cminT_{\star}\leq B_{\star}/c_{\min}.

For simplicity, we assume that BB_{\star}, TT_{\star}, and cminc_{\min} are known to the learner for most discussions, and defer to the appendix what we can achieve when some of these parameters are unknown. We also assume B>1B_{\star}>1 and cmin>0c_{\min}>0 by default (and will discuss the case cmin=0c_{\min}=0 for specific algorithms if modifications are needed).

For n+n\in\mathbb{N}_{+}, we define [n]={1,,n}[n]=\{1,\ldots,n\}. For any lrl\leq r, we define [x][l,r]=min{max{x,l},r}[x]_{[l,r]}=\min\{\max\{x,l\},r\} as the projection of xx onto the interval [l,r][l,r]. The notation 𝒪~()\tilde{\mathcal{O}}\left(\cdot\right) hides all logarithmic terms including lnK\ln K and ln1δ\ln\frac{1}{\delta} for some confidence level δ(0,1)\delta\in(0,1).

3 An Efficient Algorithm for Linear SSP

In this section, we introduce a computationally efficient algorithm for linear SSP. In Section 3.1, we first develop an improved analysis for the finite-horizon approximation of (Cohen et al., 2021). Then in Section 3.2, we combine this approximation with a simple finite-horizon algorithm, which together achieves 𝒪~(d3B2TK)\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{2}T_{\star}K}) regret. Finally, in Section 3.3, we further obtain a logarithmic regret bound via a slightly modified algorithm and a careful two-stage analysis.

3.1 Finite-Horizon Approximation of SSP

Finite-horizon approximation has been frequently used in solving SSP problems (Chen et al., 2021b; Chen & Luo, 2021; Cohen et al., 2021; Chen et al., 2021a). In particular, Cohen et al. (2021) proposed a black-box reduction from SSP to a finite-horizon MDP, which achieves minimax optimal regret bound in the tabular case when combining with a certain finite-horizon algorithm. We will make use of the same algorithmic reduction in our proposed algorithm, but with an improved analysis.

Specifically, for an SSP instance =(𝒮,𝒜,sinit,g,c,P){\mathcal{M}}=({\mathcal{S}},{\mathcal{A}},s_{\text{init}},g,c,P), define its finite-horizon MDP counterpart as ~=(𝒮+,𝒜,c~,cf,P~,H)\widetilde{{\mathcal{M}}}=({\mathcal{S}}_{+},{\mathcal{A}},\widetilde{c},c_{f},\widetilde{P},H), where c~(s,a)=c(s,a)𝕀{sg}\widetilde{c}(s,a)=c(s,a)\mathbb{I}\{s\neq g\} is the extended cost function, cf(s)=2B𝕀{sg}c_{f}(s)=2B_{\star}\mathbb{I}\{s\neq g\} is the terminal cost function (more details to follow), P~={Ps,a}(s,a)𝒮×𝒜{Pg,a}a𝒜\widetilde{P}=\{P_{s,a}\}_{(s,a)\in{\mathcal{S}}\times{\mathcal{A}}}\cup\{P_{g,a}\}_{a\in{\mathcal{A}}} with Pg,a(s)=𝕀{s=g}P_{g,a}(s^{\prime})=\mathbb{I}\{s^{\prime}=g\} is the extended transition function, and HH is a horizon parameter. Assume the access to a corresponding finite-horizon algorithm 𝔄\mathfrak{A} which learns through a certain number of “intervals” following the protocol below. At the beginning of an interval mm, the learner 𝔄\mathfrak{A} is first reset to an arbitrary state s1ms_{1}^{m}. Then, in each step h=1,,Hh=1,\ldots,H within this interval, 𝔄\mathfrak{A} decides an action ahma_{h}^{m}, transits to sh+1mP~shm,ahms_{h+1}^{m}\sim\widetilde{P}_{s_{h}^{m},a_{h}^{m}}, and suffers cost c~(shm,ahm)\widetilde{c}(s_{h}^{m},a_{h}^{m}). At the end of the interval, the learner suffers an additional terminal cost cf(sH+1m)c_{f}(s_{H+1}^{m}), and then moves on to the next interval.

With such a black-box access to 𝔄\mathfrak{A}, the reduction of (Cohen et al., 2021) is depicted in Algorithm 1. The algorithm partitions the time steps into intervals of length H4Tln(4K)H\geq 4T_{\star}\ln(4K) (such that π\pi^{\star} reaches gg within HH steps with high probability). In each step, the algorithm follows 𝔄\mathfrak{A} in a natural way and feeds the observations to 𝔄\mathfrak{A} (Line 7 and Line 7). If the goal state is not reached within an interval, 𝔄\mathfrak{A} naturally enters the next interval with the initial state being the current state (Line 7). Otherwise, if the goal state is reached within some interval, we keep feeding gg and zero cost to 𝔄\mathfrak{A} until it finishes the current interval (Line 7 and Line 7), and after that, the next interval corresponds to the beginning of the next episode of the original SSP problem (Line 7).

Algorithm 1 Finite-Horizon Approximation of SSP from (Cohen et al., 2021)

Input: Algorithm 𝔄\mathfrak{A} for finite-horizon MDP ~\widetilde{{\mathcal{M}}} with horizon H4Tln(4K)H\geq 4T_{\star}\ln(4K).

Initialize: interval counter m1m\leftarrow 1.

for k=1,,Kk=1,\ldots,K do

      21 Set s1msinits^{m}_{1}\leftarrow s_{\text{init}}. while s1mgs^{m}_{1}\neq g do
            43 Feed initial state s1ms^{m}_{1} to 𝔄\mathfrak{A}. for h=1,,Hh=1,\ldots,H do
                  65 Receive action ahma^{m}_{h} from 𝔄\mathfrak{A}. if shmgs^{m}_{h}\neq g then
                        7 Play action ahma^{m}_{h}, observe cost chm=c(shm,ahm)c^{m}_{h}=c(s^{m}_{h},a^{m}_{h}) and next state sh+1ms^{m}_{h+1}.
                  8else  Set chm=0c^{m}_{h}=0 and sh+1m=gs^{m}_{h+1}=g.
                  9 Feed chmc^{m}_{h} and sh+1ms^{m}_{h+1} to 𝔄\mathfrak{A}.
            10Set s1m+1=sH+1ms^{m+1}_{1}=s^{m}_{H+1} and mm+1m\leftarrow m+1.
      

Analysis

Cohen et al. (2021) showed that in this reduction, the regret RKR_{K} of the SSP problem is very close to the regret of 𝔄\mathfrak{A} in the finite-horizon MDP ~\widetilde{{\mathcal{M}}}. Specifically, define R~M=m=1M(h=1Hchm+cf(sH+1m)V1(s1m))\widetilde{R}_{M^{\prime}}=\sum_{m=1}^{M^{\prime}}(\sum_{h=1}^{H}c^{m}_{h}+c_{f}(s^{m}_{H+1})-V^{\star}_{1}(s^{m}_{1})) as the regret of 𝔄\mathfrak{A} over the first MM^{\prime} intervals of ~\widetilde{{\mathcal{M}}} (note the inclusion of the terminal costs), where V1V^{\star}_{1} is the optimal value function of the first layer of ~\widetilde{{\mathcal{M}}} (see Appendix B.1 for the formal definition). Denote by MM the final (random) number of intervals created during the KK episodes. Then Cohen et al. (2021) showed the following (a proof is included in Section 3.1 for completeness).

Lemma 1.

Algorithm 1 ensures RKR~M+BR_{K}\leq\widetilde{R}_{M}+B_{\star}.

This lemma suggests that it remains to bound the number of intervals MM. The analysis of Cohen et al. (2021) does so by marking state-action pairs as “known” or “unknown” based on how many times they have been visited, and showing that in each interval, the learner either reaches an “unknown” state-action pair or with high probability reaches the goal state. This analysis requires 𝔄\mathfrak{A} to be “admissible” (defined through a set of conditions) and also heavily makes use of the tabular setting to keep track of the status of each state-action pair, making it hard to be directly generalized to function approximation settings. Furthermore, it also introduces TT_{\star} dependency in the lower order term of MM, since the total cost for an interval where an “unknown” state-action pair is visited is trivially bounded by H=Ω(T)H=\Omega(T_{\star}).

Instead, we propose the following simple and improved analysis. The idea is to separate intervals into “good” ones within which the learner reaches the goal state, and “bad” ones within which the learner does not. Then, our key observation is that the regret in each bad interval is at least BB_{\star} — this is because the learner’s cost is at least 2B2B_{\star} in such intervals by the choice of the terminal cost cfc_{f}, and the optimal policy’s expected cost is at most BB_{\star}. Therefore, if 𝔄\mathfrak{A} is a no-regret algorithm, the number of bad intervals has to be small. More formally, based on this idea we can bound MM directly in terms of the regret guarantee of 𝔄\mathfrak{A} without requiring any extra properties from 𝔄\mathfrak{A}, as shown in the following lemma.

Theorem 1.

Suppose that 𝔄\mathfrak{A} enjoys the following regret guarantee with certain probability: R~m=𝒪~(γ0+γ1m)\widetilde{R}_{m}=\tilde{\mathcal{O}}\left(\gamma_{0}+\gamma_{1}\sqrt{m}\right) for some problem-dependent coefficients γ0\gamma_{0} and γ1\gamma_{1} (that are independent of mm) and any number of intervals mMm\leq M. Then, with the same probability, the number of intervals created by Algorithm 1 satisfies M=𝒪~(K+γ12B2+γ0B)M=\tilde{\mathcal{O}}\left(K+\frac{\gamma_{1}^{2}}{B_{\star}^{2}}+\frac{\gamma_{0}}{B_{\star}}\right).

Proof.

For any finite MMM_{\dagger}\leq M, we will show M=𝒪~(K+γ12B2+γ0B)M_{\dagger}=\tilde{\mathcal{O}}\left(K+\frac{\gamma_{1}^{2}}{B_{\star}^{2}}+\frac{\gamma_{0}}{B_{\star}}\right), which then implies that MM has to be finite and is upper bounded by the same quantity. To do so, we define the set of good intervals 𝒞g={m[M]:sH+1m=g}{\mathcal{C}}_{g}=\{m\in[M_{\dagger}]:s^{m}_{H+1}=g\} where the learner reaches the goal state, and also the total costs of the learner in interval mm of ~\widetilde{{\mathcal{M}}}: Cm=h=1Hchm+cf(sH+1m)C^{m}=\sum_{h=1}^{H}c^{m}_{h}+c_{f}(s^{m}_{H+1}). By definition and the guarantee of 𝔄\mathfrak{A}, we have

R~M\displaystyle\widetilde{R}_{M_{\dagger}} =m𝒞g(CmV1(s1m))+m𝒞g(CmV1(s1m))\displaystyle=\sum_{m\in{\mathcal{C}}_{g}}\left(C^{m}-V^{\star}_{1}(s^{m}_{1})\right)+\sum_{m\notin{\mathcal{C}}_{g}}\left(C^{m}-V^{\star}_{1}(s^{m}_{1})\right)
𝒪~(γ0+γ1M).\displaystyle\leq\tilde{\mathcal{O}}\left(\gamma_{0}+\gamma_{1}\sqrt{M_{\dagger}}\right). (1)

Next, we derive lower bounds on m𝒞g(CmV1(s1m))\sum_{m\in{\mathcal{C}}_{g}}\left(C^{m}-V^{\star}_{1}(s^{m}_{1})\right) and m𝒞g(CmV1(s1m))\sum_{m\notin{\mathcal{C}}_{g}}\left(C^{m}-V^{\star}_{1}(s^{m}_{1})\right) respectively. First note that by Lemma 17 and H4Tln(4K)H\geq 4T_{\star}\ln(4K), we have that π\pi^{\star} reaches the goal within HH steps with probability at least 11/2K1-1/2K. Therefore, executing π\pi^{\star} in an episode of ~\widetilde{{\mathcal{M}}} leads to at most B+2B2K32BB_{\star}+\frac{2B_{\star}}{2K}\leq\frac{3}{2}B_{\star} costs in expectation, which implies V1(s)32BV^{\star}_{1}(s)\leq\frac{3}{2}B_{\star} for any ss. By |𝒞g|K|{\mathcal{C}}_{g}|\leq K, we thus have

m𝒞g(CmV1(s1m))32BK.\sum_{m\in{\mathcal{C}}_{g}}\left(C^{m}-V^{\star}_{1}(s^{m}_{1})\right)\geq-\frac{3}{2}B_{\star}K.

On the other hand, for m𝒞gm\notin{\mathcal{C}}_{g}, we have Cm2BC^{m}\geq 2B_{\star} due to the terminal cost cf(sH+1m)=2Bc_{f}(s^{m}_{H+1})=2B_{\star}, and thus

m𝒞g(CmV1(s1m))B2(M|𝒞g|)B2(MK).\sum_{m\notin{\mathcal{C}}_{g}}\left(C^{m}-V^{\star}_{1}(s^{m}_{1})\right)\geq\frac{B_{\star}}{2}(M_{\dagger}-|{\mathcal{C}}_{g}|)\geq\frac{B_{\star}}{2}(M_{\dagger}-K).

Combining the two lower bounds above with Eq. (1), we arrive at B2M𝒪~(γ0+γ1M)+2BK\frac{B_{\star}}{2}M_{\dagger}\leq\tilde{\mathcal{O}}\left(\gamma_{0}+\gamma_{1}\sqrt{M_{\dagger}}\right)+2B_{\star}K. By Lemma 28, this implies M=𝒪~(K+γ12B2+γ0B)M_{\dagger}=\tilde{\mathcal{O}}\left(K+\frac{\gamma_{1}^{2}}{B_{\star}^{2}}+\frac{\gamma_{0}}{B_{\star}}\right), finishing the proof. ∎

Now plugging in the bound on MM in Theorem 1 into Lemma 1, we immediately obtain the following corollary on a general regret bound for the finite-horizon approximation.

Corollary 2.

Under the same condition of Theorem 1, Algorithm 1 ensures RK=𝒪~(γ1K+γ12B+γ0+B)R_{K}=\tilde{\mathcal{O}}\left(\gamma_{1}\sqrt{K}+\frac{\gamma_{1}^{2}}{B_{\star}}+\gamma_{0}+B_{\star}\right) (with the same probability stated in Theorem 1).

Proof.

Combining Lemma 1 and Theorem 1, we have RKR~M+B𝒪~(γ1M+γ0+B)𝒪~(γ1K+γ12B+γ1γ0B+γ0+B)R_{K}\leq\widetilde{R}_{M}+B_{\star}\leq\tilde{\mathcal{O}}(\gamma_{1}\sqrt{M}+\gamma_{0}+B_{\star})\leq\tilde{\mathcal{O}}\left(\gamma_{1}\sqrt{K}+\frac{\gamma_{1}^{2}}{B_{\star}}+\gamma_{1}\sqrt{\frac{\gamma_{0}}{B_{\star}}}+\gamma_{0}+B_{\star}\right). Further realizing γ1γ0B12(γ12B+γ0)\gamma_{1}\sqrt{\frac{\gamma_{0}}{B_{\star}}}\leq\frac{1}{2}\left(\frac{\gamma_{1}^{2}}{B_{\star}}+\gamma_{0}\right) by AM-GM inequality proves the statement. ∎

Note that the final regret bound completely depends on the regret guarantee of the finite horizon algorithm 𝔄\mathfrak{A}. In particular, in the tabular case, if we apply a variant of EB-SSP (Tarbouriech et al., 2021) that achieves R~m=𝒪~(BSAm+BS2A)\widetilde{R}_{m}=\tilde{\mathcal{O}}(B_{\star}\sqrt{SAm}+B_{\star}S^{2}A) (note the lack of polynomial dependency on HH),111This variant is equivalent to applying EB-SSP on a homogeneous finite-horizon MDP. then Corollary 2 ensures that RK=𝒪~(BSAK+BS2A)R_{K}=\tilde{\mathcal{O}}(B_{\star}\sqrt{SAK}+B_{\star}S^{2}A), improving the results of (Cohen et al., 2021) and matching the best existing bounds of (Tarbouriech et al., 2021; Chen et al., 2021a); see Appendix B.5 for more details. This is not achievable by the analysis of (Cohen et al., 2021) due to the TT_{\star} dependency in the lower order term mentioned earlier.

More importantly, our analysis is model agnostic: it only makes use of the regret guarantee of the finite-horizon algorithm, and does not leverage any modeling assumption on the SSP instance. This enables us to directly apply our result to settings with function approximation. In Appendix B.6, we provide an example for SSP with a linear mixture MDP, which gives a regret bound 𝒪~(BdTK+BdK)\tilde{\mathcal{O}}(B_{\star}\sqrt{dT_{\star}K}+B_{\star}d\sqrt{K}) via combining Corollary 2 and the near optimal finite-horizon algorithm of (Zhou et al., 2021a).

3.2 Applying an Efficient Finite-Horizon Algorithm for Linear MDPs

Similarly, if there were a horizon-free algorithm for finite-horizon linear MDPs, we could directly combine it with Algorithm 1 and obtain a TT_{\star}-independent regret bound. However, to our knowledge, this is still open due to some unique challenge for linear MDPs.

Nevertheless, even combining Algorithm 1 with a horizon-dependent linear MDP algorithm already leads to significant improvement over the state-of-the-art for linear SSP. Specifically, the finite-horizon algorithm 𝔄\mathfrak{A} we apply is a variant of LSVI-UCB (Jin et al., 2020b), which performs Least-Squares Value Iteration with an optimistic modification. The pseudocode is shown in Algorithm 2. Utilizing the fact that action-value functions are linear in the features for a linear MDP, in each interval mm, we estimate the parameters {whm}h=1H\{w^{m}_{h}\}_{h=1}^{H} of these linear functions by solving a set of least square linear regression problems using all observed data (Line 2), and we encourage exploration by subtracting a bonus term βmϕ(s,a)Λm1\beta_{m}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}} in the definition of Q^hm(s,a)\widehat{Q}^{m}_{h}(s,a) (Line 1). Then, we simply act greedily with respect to the truncated action-value estimates {Qhm}h\{Q^{m}_{h}\}_{h} (Line 2). Clearly, this is an efficient algorithm with polynomial (in dd, HH, mm and AA) time complexity for each interval mm.

We refer the reader to (Jin et al., 2020b) for more explanation of the algorithm, and point out three key modifications we make compared to their version. First, Jin et al. (2020b) maintain a separate covariance matrix Λhm\Lambda^{m}_{h} for each layer hh using data only from layer hh, while we only maintain a single covariance matrix Λm\Lambda^{m} using data across all layers (Line 2). This is possible (and resulting in a better regret bound) since the transition function is the same in each layer of M~\widetilde{M}. Another modification is to define VH+1m(s)V^{m}_{H+1}(s) as cf(s)c_{f}(s) simply for the purpose of incorporating the terminal cost. Finally, we project the action-value estimates onto [0,B][0,B] for some parameter BB similar to (Vial et al., 2021) (Line 1). In the main text we simply set B=3BB=3B_{\star}, and the upper bound truncation at BB has no effect in this case. However, this projection will become important when learning without the knowledge of BB_{\star} (see Appendix B.4).

Algorithm 2 Finite-Horzion Linear-MDP Algorithm

Parameters: λ=1\lambda=1, βm=50dBln(16BmHd/δ)\beta_{m}=50dB\sqrt{\ln(16BmHd/\delta)} where δ\delta is the failure probability and B1B\geq 1.

Initialize: Λ1=λI\Lambda_{1}=\lambda I.

for m=1,,Mm=1,\ldots,M do

       Define VH+1m(s)=cf(s)V^{m}_{H+1}(s)=c_{f}(s). for h=H,,1h=H,\ldots,1 do
            21 Compute
whm=Λm1m=1m1h=1Hϕhm(chm+Vh+1m(sh+1m)),w^{m}_{h}=\Lambda_{m}^{-1}\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}(c^{m^{\prime}}_{h^{\prime}}+V^{m}_{h+1}(s^{m^{\prime}}_{h^{\prime}+1})),
where ϕhm=ϕ(shm,ahm)\phi^{m}_{h}=\phi(s^{m}_{h},a^{m}_{h}). Define ϕ(g,a)=0\phi(g,a)=0 and
Q^hm(s,a)\displaystyle\widehat{Q}^{m}_{h}(s,a) =ϕ(s,a)whmβmϕ(s,a)Λm1\displaystyle=\phi(s,a)^{\top}w^{m}_{h}-\beta_{m}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}
Qhm(s,a)\displaystyle Q^{m}_{h}(s,a) =[Q^hm(s,a)][0,B]\displaystyle=[\widehat{Q}^{m}_{h}(s,a)]_{[0,B]}
Vhm(s)\displaystyle V^{m}_{h}(s) =minaQhm(s,a)\displaystyle=\min_{a}Q^{m}_{h}(s,a)
      for h=1,,Hh=1,\ldots,H do
            3 Play ahm=argminaQhm(shm,a)a^{m}_{h}=\operatorname*{argmin}_{a}Q^{m}_{h}(s^{m}_{h},a), suffer chmc^{m}_{h}, and transit to sh+1ms^{m}_{h+1}.
      Compute Λm+1=Λm+h=1Hϕhmϕhm\Lambda_{m+1}=\Lambda_{m}+\sum_{h=1}^{H}\phi^{m}_{h}{\phi^{m}_{h}}^{\top}.

We show the following regret guarantee of Algorithm 2 following the analysis of (Vial et al., 2021) (see Appendix B.3).

Lemma 2.

With probability at least 14δ1-4\delta, Algorithm 2 with B=3BB=3B_{\star} ensures R~m=𝒪~(d3B2Hm+d2BH)\widetilde{R}_{m}=\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{2}Hm}+d^{2}B_{\star}H) for any mMm\leq M.

Applying Corollary 2 we then immediately obtain the following new result for linear SSP.

Theorem 3.

Applying Algorithm 1 with H=4Tln(4K)H=4T_{\star}\ln(4K) and 𝔄\mathfrak{A} being Algorithm 2 with B=3BB=3B_{\star} to the linear SSP problem ensures RK=𝒪~(d3B2TK+d3BT)R_{K}=\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{2}T_{\star}K}+d^{3}B_{\star}T_{\star}) with probability at least 14δ1-4\delta.

There is some gap between our result above and the existing lower bound Ω(dBK)\Omega(dB_{\star}\sqrt{K}) for this problem (Min et al., 2021). In particular, the dependency on TT_{\star} inherited from the HH dependency in Lemma 2 is most likely unnecessary. Nevertheless, this already strictly improves over the best existing bound 𝒪~(d3B3K/cmin)\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{3}K/c_{\min}}) from (Vial et al., 2021) since TB/cminT_{\star}\leq B_{\star}/c_{\min}. Moreover, our algorithm is computationally efficient, while the algorithms of Vial et al. (2021) are either inefficient or achieve a much worse regret bound such as 𝒪~(K5/6)\tilde{\mathcal{O}}(K^{5/6}) (unless some strong assumptions are made). This improvement comes from the fact that our algorithm uses non-stationary policies (due to the finite-horizon approximation), which avoids the challenging problem of solving the fixed point of some empirical Bellman equation. This also demonstrates the power of finite-horizon approximation in solving SSP problems. On the other hand, obtaining the same regret guarantee by learning stationary policies only is an interesting future direction.

Learning without knowing BB_{\star} or TT_{\star}

Note that the result of Theorem 3 requires the knowledge of BB_{\star} and TT_{\star}. Without knowing these parameters, we can still efficiently obtain a regret bound of order 𝒪~(d3B3K/cmin+d3B2/cmin)\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{3}K/c_{\min}}+d^{3}B_{\star}^{2}/c_{\min}), matching the bound of (Vial et al., 2021) achieved by their inefficient algorithm. See Appendix B.4 for details.

3.3 Logarithmic Regret

Many optimistic algorithms attain a more favorable regret bound of the form ClnKC\ln K, where CC is an instance dependent constant usually inversely proportional to some gap measure; see e.g. (Jaksch et al., 2010) for the infinite-horizon setting and (Simchowitz & Jamieson, 2019) for the finite-horizon setting. In this section, we show that a slight modification of our algorithm also leads to an expected regret bound that is polylogarithmic in KK and inversely proportional to gapmin=mins,a:gap(s,a)>0gap(s,a)\text{\rm gap}_{\min}=\min_{s,a:\text{\rm gap}(s,a)>0}\text{\rm gap}(s,a) with gap(s,a)=Q(s,a)V(s)\text{\rm gap}(s,a)=Q^{\star}(s,a)-V^{\star}(s).222Note that for our definition of regret, a polylogarithmic bound is only possible in expectation, because even if the learner always executes π\pi^{\star}, the deviation of her total costs from KV(sinit)KV^{\star}(s_{\text{init}}) is already of order K\sqrt{K}.

The high-level idea is as follows. The first observation is that similarly to a recent work by He et al. (2021), we can show that our Algorithm 2 obtains a gap-dependent logarithmic regret bound 𝒪~(lnmgapmin)\tilde{\mathcal{O}}(\frac{\ln m}{\text{\rm gap}_{\min}^{\prime}}) for the finite-horizon problem. The caveat is that gapmin\text{\rm gap}_{\min}^{\prime} here is naturally defined using the optimal value and action-value functions VhV^{\star}_{h} and QhQ^{\star}_{h} for the finite-horizon MDP (which is different for each layer hh); more specifically, gapmin=mins,a,h:gaph(s,a)>0gaph(s,a)\text{\rm gap}_{\min}^{\prime}=\min_{s,a,h:\text{\rm gap}_{h}(s,a)>0}\text{\rm gap}_{h}(s,a) where gaph(s,a)=Qh(s,a)Vh(s)\text{\rm gap}_{h}(s,a)=Q^{\star}_{h}(s,a)-V^{\star}_{h}(s). The difference between gapmin\text{\rm gap}_{\min} and gapmin\text{\rm gap}_{\min}^{\prime} can in fact be significant; see Appendix B.7 for an example where gapmin\text{\rm gap}_{\min}^{\prime} is arbitrarily smaller than gapmin\text{\rm gap}_{\min}.

To get around this issue, we set HH to be a larger value of order 𝒪~(Bcmin)\tilde{\mathcal{O}}(\frac{B_{\star}}{c_{\min}}) and perform the following two-stage analysis. For the first H/2H/2 layers, we are able to show Qh(s,a)Q(s,a)Q^{\star}_{h}(s,a)\approx Q^{\star}(s,a) and thus gaph(s,a)gap(s,a)\text{\rm gap}_{h}(s,a)\approx\text{\rm gap}(s,a), leading to a 𝒪~(lnmgapmin)\tilde{\mathcal{O}}(\frac{\ln m}{\text{\rm gap}_{\min}}) bound on the regret suffered for these layers. Then, for the last H/2H/2 layers, we further consider two cases: if the learner’s policy for the first H/2H/2 layers are nearly optimal, then the probability of not reaching the goal within the first H/2H/2 layers is very low by the choice of HH, and thus the costs suffered in the last H/2H/2 layers are negligible; otherwise, we simply bound the costs using the number of times the learner takes a non-near-optimal action in the first H/2H/2 layers, which is again shown to be of order 𝒪~(lnmgapmin)\tilde{\mathcal{O}}(\frac{\ln m}{\text{\rm gap}_{\min}}).

One final detail is to carefully control the regret under some failure event that happens with a small probability (recall that we are aiming for an expected regret bound; see Footnote 2). This is necessary since in SSP the learner’s cost under such events could be unbounded in the worst case. To resolve this issue, we make a slight modification to Algorithm 1 and occasionally restart 𝔄\mathfrak{A} whenever the number of total intervals reaches some multiple of a threshold; see Algorithm 7 in the appendix. This finally leads to our main result summarized in the following theorem (whose proof is deferred to Appendix B.8).

Theorem 4.

There exist bb^{\prime} and δ\delta such that applying Algorithm 7 with horizon H=bBcminln(dBKcmin)H=\frac{b^{\prime}B_{\star}}{c_{\min}}\ln(\frac{dB_{\star}K}{c_{\min}}) and 𝔄\mathfrak{A} being Algorithm 2 (with B=3BB=3B_{\star} and failure probability δ\delta) ensures 𝔼[RK]=𝒪(d3B4cmin2gapminln5dBKcmin)\mathbb{E}[R_{K}]=\mathcal{O}\left(\frac{d^{3}B_{\star}^{4}}{c_{\min}^{2}\text{\rm gap}_{\min}}\ln^{5}\frac{dB_{\star}K}{c_{\min}}\right).

As far as we know, this is the first polylogarithmic bound for any SSP problem. Our result also indicates that the instance-dependent quantities of SSP can be well preserved after using some finite-horizon approximation.

Lower bounds

To better understand instance-dependent regret bounds for this problem, we further show the following lower bound.

Theorem 5.

For any algorithm 𝔄\mathfrak{A}, there exists a linear SSP instance with d2d\geq 2 and B1B_{\star}\geq 1 such that 𝔼𝔄[RK]=Ω(dB2/gapmin)\mathbb{E}_{\mathfrak{A}}[R_{K}]=\Omega(dB_{\star}^{2}/\text{\rm gap}_{\min}).

This lower bound exhibits a relatively large gap from our upper bound. One important question is whether the 1cmin\frac{1}{c_{\min}} dependency in the upper bound is really necessary, which we leave as a future direction.

4 An Inefficient Horizon-Free Algorithm

Recall that the dominating term of the regret bound shown in Theorem 3 depends on TT_{\star}, which is most likely unnecessary. Due to the lack of a horizon-free algorithm for finite-horizon linear MDPs (which, as discussed, would have addressed this issue), in this section we propose a different approach leading to a computationally inefficient algorithm with a regret bound that is horizon-free (that is, no polynomial dependency on TT_{\star}) but has a worse dependency on dd.

As stated in previous work for the tabular setting (Cohen et al., 2020, 2021; Tarbouriech et al., 2021; Chen et al., 2021a), achieving a horizon-free regret bound requires constructing variance-aware confidence sets on the transition functions. While this is straightforward in the tabular case, it is much more challenging with linear function approximation. Zhou et al. (2021a); Zhang et al. (2021) construct variance-aware confidence sets for linear mixture MDPs, but we are not aware of similar results for linear MDPs since they impose extra challenges. Our algorithm VA-GOPO, shown in Algorithm 3, is the first one to successfully make use of these ideas.

Algorithm 3 Variance-Aware Global OPtimization with Optimism (VA-GOPO)

Initialize: t=t=1t=t^{\prime}=1, k=1k=1, s1=sinits_{1}=s_{\text{init}}, B1=1B_{1}=1.

Define: s0=gs_{0}^{\prime}=g and Vt=Vwt,BtV_{t}=V_{w_{t},B_{t}}.

while kKk\leq K do

      1 if st1=gs^{\prime}_{t-1}=g or Eq. (4) holds or Vt(st)=2BtV_{t^{\prime}}(s_{t})=2B_{t} then
             while True do
                  32 Compute wt=argminwΩt(w,Bt)Vw,Bt(st)w_{t}=\operatorname*{argmin}_{w\in\Omega_{t}(w,B_{t})}V_{w,B_{t}}(s_{t}) (see Eq. (2) and Eq. (3) for definitions). if Vt(st)>BtV_{t}(s_{t})>B_{t} then Bt2BtB_{t}\leftarrow 2B_{t}; else break.
                  
            4Record the most recent update time ttt^{\prime}\leftarrow t.
      else  (wt,Bt)=(wt1,Bt1).(w_{t},B_{t})=(w_{t-1},B_{t-1}).
       Take action at=argminaϕ(st,a)wta_{t}=\operatorname*{argmin}_{a}\phi(s_{t},a)^{\top}w_{t}, suffer cost ct=c(st,at)c_{t}=c(s_{t},a_{t}), and transits to sts^{\prime}_{t}. if st=gs^{\prime}_{t}=g then st+1=sinits_{t+1}=s_{\text{init}}, kk+1k\leftarrow k+1; else st+1=sts_{t+1}=s^{\prime}_{t}.
       Increment time step tt+1t\leftarrow t+1.

VA-GOPO follows a similar framework of the Eleanor algorithm of (Zanette et al., 2020b) (for the finite-horizon setting) and the FOPO algorithm of (Wei et al., 2021b) (for the infinite-horizon setting) — they all maintain an estimate wtw_{t} of the true weight vector ww^{\star} (recall Q(s,a)=ϕ(s,a)wQ^{\star}(s,a)=\phi(s,a)^{\top}w^{\star}), found by optimistically minimizing the value of the current state sts_{t} (roughly minaϕ(st,a)wt\min_{a}\phi(s_{t},a)^{\top}w_{t}) over a confidence set of wtw_{t}, and then simply act according to argminaϕ(st,a)wt\operatorname*{argmin}_{a}\phi(s_{t},a)^{\top}w_{t}. The main differences are the construction of the confidence set and the conditions under which wtw_{t} is updated, which we explain in detail below.

Confidence Set

For a parameter B>0B>0 and a weight vector wdw\in\mathbb{R}^{d}, inspired by (Zhang et al., 2021) we define a variance-aware confidence set for time step tt as

Ωt(w,B)=j𝒥BΩtj(w,B),\Omega_{t}(w,B)=\bigcap_{j\in{\mathcal{J}}_{B}}\Omega^{j}_{t}(w,B), (2)

where 𝒥B={log2ϵ,,log2(6dB)}{\mathcal{J}}_{B}=\{\lceil\log_{2}\epsilon\rceil,\ldots,\lceil\log_{2}(6\sqrt{d}B)\rceil\} with ϵ=cmin150d3K\epsilon=\frac{c_{\min}}{150d^{3}K}, and Ωtj(w,B)=𝔹(3dB)\Omega^{j}_{t}(w,B)=\mathbb{B}(3\sqrt{d}B)\cap

{w:ν𝒢ϵ/t(6dB),|i<tclipj(ϕiν)ϵVw,Bi(w)|\displaystyle\left\{w^{\prime}:\forall\nu\in\mathcal{G}_{\epsilon/t}(6\sqrt{d}B),\left|\sum_{i<t}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)\epsilon_{V_{w,B}}^{i}(w^{\prime})\right|\right.
i<tclipj2(ϕiν)ηVw,Bi(w)ιB,t+B2jιB,t},\displaystyle\leq\left.\sqrt{\sum_{i<t}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)\eta_{V_{w,B}}^{i}(w^{\prime})\iota_{B,t}}+B2^{j}\iota_{B,t}\right\}, (3)

with 𝔹(r)={xd:x2r}\mathbb{B}(r)=\{x\in\mathbb{R}^{d}:\left\|{x}\right\|_{2}\leq r\} being the dd-dimensional L2L_{2}-ball of radius rr, 𝒢ξ(r)={ξn,n}d𝔹(r)\mathcal{G}_{\xi}(r)=\{\xi n,n\in\mathbb{Z}\}^{d}\cap\mathbb{B}(r) being the ξ\xi-net of 𝔹(r)\mathbb{B}(r), clipj(x)=[x][2j,2j]\textsf{clip}_{j}(x)=[x]_{[-2^{j},2^{j}]} (recall [x][l,r]=min{max{x,l},r}[x]_{[l,r]}=\min\{\max\{x,l\},r\}), ϕi\phi_{i} being a shorthand of ϕ(si,ai)\phi(s_{i},a_{i}), Vw,B(s)=mina[ϕ(s,a)w][0,2B]V_{w,B}(s)=\min_{a}[\phi(s,a)^{\top}w]_{[0,2B]} (and Vw,B(g)=0V_{w,B}(g)=0), ϵVi(w)=ϕiwciV(si)\epsilon^{i}_{V}(w^{\prime})=\phi_{i}^{\top}w^{\prime}-c_{i}-V(s^{\prime}_{i}), ηVi(w)=ϵVi(w)2\eta^{i}_{V}(w^{\prime})=\epsilon^{i}_{V}(w^{\prime})^{2}, and finally ιB,t=211dln48dBtϵδ\iota_{B,t}=2^{11}d\ln\frac{48dBt}{\epsilon\delta} for some failure probability δ\delta. The key difference between our confidence set and that of (Zhang et al., 2021) is in the definition of ϵVi(w)\epsilon^{i}_{V}(w^{\prime}) and ηVi(w)\eta^{i}_{V}(w^{\prime}) due to the different structures between linear MDPs and linear mixture MDPs. In particular, we note that the value function VV (more formally Vw,BV_{w,B}) in our definitions is itself defined with respect to another weight vector ww.

With this confidence set, when VA-GOPO decides to update wtw_{t}, it searches over all ww such that wΩt(w,Bt)w\in\Omega_{t}(w,B_{t}) and finds the one that minimizes the value at the current state Vw,Bt(st)V_{w,B_{t}}(s_{t}) (Line 3). Here, BtB_{t} is a running estimate of BB_{\star}. VA-GOPO maintains the inequality Vt(st)BtV_{t}(s_{t})\leq B_{t} during the update by doubling the value of BtB_{t} and repeating Line 3 whenever this is violated (Line 3). Note that the constraint wΩt(w,Bt)w\in\Omega_{t}(w,B_{t}) is in a sense self-referential — we consider ww within a confidence set defined in terms of ww itself, which is an important distinction compared to (Zhang et al., 2021) and is critical for linear MDPs.

To provide some intuition on our confidence set, denote Vwt,BtV_{w_{t},B_{t}} by VtV_{t} and Ωt(wt,Bt)\Omega_{t}(w_{t},B_{t}) by Ωt\Omega_{t}. Note that if we ignore the dependency between VtV_{t} and {ϕi}i<t\{\phi_{i}\}_{i<t} (an issue that will eventually be addressed by some covering arguments), then {ϵVti(w)}i<t\{\epsilon^{i}_{V_{t}}(w^{\prime})\}_{i<t} forms a martingale sequence when w=w~tθ+Vt(s)𝑑μ(s)w^{\prime}=\widetilde{w}_{t}\triangleq\theta^{\star}+\int V_{t}(s^{\prime})d\mu(s^{\prime}), and thus the inequality in Eq. (3) holds with high probability by some Bernstein-style concentration inequality (Lemma 36). Formally, this allows us to show the following.

Lemma 3.

With probability at least 1δ1-\delta, w~tΩt\widetilde{w}_{t}\in\Omega_{t}, t1\forall t\geq 1.

Since wtw_{t} is also in Ωt\Omega_{t}, the difference between ϕ(s,a)wt\phi(s,a)^{\top}w_{t} and c(s,a)+𝔼sPs,a[Vt(s)]=ϕ(s,a)w~tc(s,a)+\mathbb{E}_{s^{\prime}\sim P_{s,a}}[V_{t}(s^{\prime})]=\phi(s,a)^{\top}\widetilde{w}_{t} is controlled by the size of the confidence set Ωt\Omega_{t}, which is overall shrinking and thus making sure that wtw_{t} is getting closer and closer to ww^{\star}. In addition, we also show that VtV_{t} is optimistic at state sts_{t} whenever an update is performed and that BtB_{t} never overestimates BB_{\star} significantly.

Lemma 4.

With probability at least 1δ1-\delta, we have Vt(st)V(st)V_{t}(s_{t})\leq V^{\star}(s_{t}) if an update (Line 3) is performed at time step tt, and Bt2BB_{t}\leq 2B_{\star} for all tt.

Update Conditions

VA-GOPO updates wtw_{t} whenever one of the three conditions in Line 3 is triggered. The first condition st1=gs^{\prime}_{t-1}=g simply indicates that the current time step is the start of a new episode. The second condition is

j𝒥Bt,ν𝒢ϵ/t(6dBt):Φtj(ν)>8d2Φtj(ν),\exists j\in{\mathcal{J}}_{B_{t}},\nu\in\mathcal{G}_{\epsilon/t}(6\sqrt{d}B_{t}):\Phi_{t}^{j}(\nu)>8d^{2}\Phi_{t^{\prime}}^{j}(\nu), (4)

where tt^{\prime} is the most recent update time step (Line 3) and Φtj(ν)=i<tfj(ϕiν)+2jν22\Phi^{j}_{t}(\nu)=\sum_{i<t}f_{j}(\phi_{i}^{\top}\nu)+2^{j}\left\|{\nu}\right\|_{2}^{2} with fj(x)=clipj(x)xf_{j}(x)=\textsf{clip}_{j}(x)x. This lazy update condition makes sure that the algorithm does not update wtw_{t} too often (see Lemma 27) while still enjoying a small enough estimation error. The last condition Vt(st)=2BtV_{t^{\prime}}(s_{t})=2B_{t} (we call it overestimate condition) tests whether the current state has an overestimated value (note that 2Bt2B_{t} is the maximum value of VtV_{t^{\prime}} due to the truncation in its definition). This condition helps remove a factor of d1/4d^{1/4} in the regret bound without using some complicated ideas as in previous works; see Appendix C.5 for more explanation.

Regret Guarantee

We prove the following regret guarantee for VA-GOPO, and provide a proof sketch in Appendix C.1 followed by the full proof in the rest of Appendix C.

Theorem 6.

With probability at least 16δ1-6\delta, Algorithm 3 ensures RK=𝒪~(d3.5BK+d7B2)R_{K}=\tilde{\mathcal{O}}(d^{3.5}B_{\star}\sqrt{K}+d^{7}B_{\star}^{2}).

Ignoring the lower order term, our bound is (potentially) suboptimal only in terms of the dd-dependency compared to the lower bound Ω(dBK)\Omega(dB_{\star}\sqrt{K}) from (Min et al., 2021). We note again that this is the first horizon-free regret bound for linear SSP: it does not have any polynomial dependency on TT_{\star} or 1cmin\frac{1}{c_{\min}} even in the lower order terms. Furthermore, VA-GOPO also does not require the knowledge of BB_{\star} or TT_{\star}. For simplicity, we have assumed cmin>0c_{\min}>0. However, even when cmin=0c_{\min}=0, we can obtain essentially the same bound by running the same algorithm on a modified cost function; see Appendix A for details.

5 Conclusion

In this work, we make significant progress towards better understanding of linear function approximation in the challenging SSP model. Two algorithms are proposed: the first one is efficient and achieves a regret bound strictly better than (Vial et al., 2021), while the second one is inefficient but achieves a horizon-free regret bound. In developing these results, we also propose several new techniques that might be of independent interest, especially the new analysis for the finite-horizon approximation of (Cohen et al., 2021).

A natural future direction is to close the gap between existing upper bounds and lower bounds in this problem, especially with an efficient algorithm. Another interesting direction is to study SSP with adversarially changing costs under linear function approximation.

References

  • Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24:2312–2320, 2011.
  • Ayoub et al. (2020) Ayoub, A., Jia, Z., Szepesvari, C., Wang, M., and Yang, L. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pp. 463–474. PMLR, 2020.
  • Azar et al. (2017) Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263–272. PMLR, 2017.
  • Bertsekas & Yu (2013) Bertsekas, D. P. and Yu, H. Stochastic shortest path problems under weak conditions. Lab. for Information and Decision Systems Report LIDS-P-2909, MIT, 2013.
  • Chen & Luo (2021) Chen, L. and Luo, H. Finding the stochastic shortest path with low regret: The adversarial cost and unknown transition case. In International Conference on Machine Learning, 2021.
  • Chen et al. (2021a) Chen, L., Jafarnia-Jahromi, M., Jain, R., and Luo, H. Implicit finite-horizon approximation and efficient optimal algorithms for stochastic shortest path. Advances in Neural Information Processing Systems, 2021a.
  • Chen et al. (2021b) Chen, L., Luo, H., and Wei, C.-Y. Minimax regret for stochastic shortest path with adversarial costs and known transition. In Conference on Learning Theory, pp.  1180–1215. PMLR, 2021b.
  • Cohen et al. (2020) Cohen, A., Kaplan, H., Mansour, Y., and Rosenberg, A. Near-optimal regret bounds for stochastic shortest path. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pp.  8210–8219. PMLR, 2020.
  • Cohen et al. (2021) Cohen, A., Efroni, Y., Mansour, Y., and Rosenberg, A. Minimax regret for stochastic shortest path. Advances in Neural Information Processing Systems, 2021.
  • He et al. (2021) He, J., Zhou, D., and Gu, Q. Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning, pp. 4171–4180. PMLR, 2021.
  • Ishfaq et al. (2021) Ishfaq, H., Cui, Q., Nguyen, V., Ayoub, A., Yang, Z., Wang, Z., Precup, D., and Yang, L. F. Randomized exploration for reinforcement learning with general value function approximation. International Conference on Machine Learning, 2021.
  • Jafarnia-Jahromi et al. (2021) Jafarnia-Jahromi, M., Chen, L., Jain, R., and Luo, H. Online learning for stochastic shortest path model via posterior sampling. arXiv preprint arXiv:2106.05335, 2021.
  • Jaksch et al. (2010) Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 2010.
  • Jin et al. (2018) Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pp. 4863–4873, 2018.
  • Jin et al. (2020a) Jin, C., Jin, T., Luo, H., Sra, S., and Yu, T. Learning adversarial Markov decision processes with bandit feedback and unknown transition. In Proceedings of the 37th International Conference on Machine Learning, pp.  4860–4869, 2020a.
  • Jin et al. (2020b) Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp.  2137–2143. PMLR, 2020b.
  • Jin et al. (2021) Jin, T., Huang, L., and Luo, H. The best of both worlds: stochastic and adversarial episodic mdps with unknown transition. Advances in Neural Information Processing Systems, 2021.
  • Kim et al. (2021) Kim, Y., Yang, I., and Jun, K.-S. Improved regret analysis for variance-adaptive linear bandits and horizon-free linear mixture mdps. arXiv preprint arXiv:2111.03289, 2021.
  • Kong et al. (2021) Kong, D., Salakhutdinov, R., Wang, R., and Yang, L. F. Online sub-sampling for reinforcement learning with general function approximation. arXiv preprint arXiv:2106.07203, 2021.
  • Lattimore & Hutter (2012) Lattimore, T. and Hutter, M. Pac bounds for discounted mdps. In International Conference on Algorithmic Learning Theory, pp.  320–334. Springer, 2012.
  • Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.
  • Min et al. (2021) Min, Y., He, J., Wang, T., and Gu, Q. Learning stochastic shortest path with linear function approximation. arXiv preprint arXiv:2110.12727, 2021.
  • Rosenberg & Mansour (2020) Rosenberg, A. and Mansour, Y. Stochastic shortest path with adversarially changing costs. arXiv preprint arXiv:2006.11561, 2020.
  • Shani et al. (2020) Shani, L., Efroni, Y., Rosenberg, A., and Mannor, S. Optimistic policy optimization with bandit feedback. In Proceedings of the 37th International Conference on Machine Learning, pp.  8604–8613, 2020.
  • Simchowitz & Jamieson (2019) Simchowitz, M. and Jamieson, K. G. Non-asymptotic gap-dependent regret bounds for tabular mdps. Advances in Neural Information Processing Systems, 32:1153–1162, 2019.
  • Tarbouriech et al. (2020) Tarbouriech, J., Garcelon, E., Valko, M., Pirotta, M., and Lazaric, A. No-regret exploration in goal-oriented reinforcement learning. In International Conference on Machine Learning, pp. 9428–9437. PMLR, 2020.
  • Tarbouriech et al. (2021) Tarbouriech, J., Zhou, R., Du, S. S., Pirotta, M., Valko, M., and Lazaric, A. Stochastic shortest path: Minimax, parameter-free and towards horizon-free regret. Advances in Neural Information Processing Systems, 2021.
  • Vial et al. (2021) Vial, D., Parulekar, A., Shakkottai, S., and Srikant, R. Regret bounds for stochastic shortest path problems with linear function approximation. arXiv preprint arXiv:2105.01593, 2021.
  • Wang et al. (2020) Wang, R., Salakhutdinov, R., and Yang, L. F. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 2020.
  • Wei et al. (2021a) Wei, C.-Y., Dann, C., and Zimmert, J. A model selection approach for corruption robust reinforcement learning. arXiv preprint arXiv:2110.03580, 2021a.
  • Wei et al. (2021b) Wei, C.-Y., Jahromi, M. J., Luo, H., and Jain, R. Learning infinite-horizon average-reward mdps with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pp.  3007–3015. PMLR, 2021b.
  • Yang & Wang (2020) Yang, L. and Wang, M. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pp. 10746–10756. PMLR, 2020.
  • Zanette & Brunskill (2019) Zanette, A. and Brunskill, E. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In Proceedings of the 36th International Conference on Machine Learning, pp.  7304–7312, 2019.
  • Zanette et al. (2020a) Zanette, A., Brandfonbrener, D., Brunskill, E., Pirotta, M., and Lazaric, A. Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics, pp.  1954–1964. PMLR, 2020a.
  • Zanette et al. (2020b) Zanette, A., Lazaric, A., Kochenderfer, M., and Brunskill, E. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pp. 10978–10989. PMLR, 2020b.
  • Zhang et al. (2020a) Zhang, Z., Ji, X., and Du, S. S. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. In Conference On Learning Theory, 2020a.
  • Zhang et al. (2020b) Zhang, Z., Zhou, Y., and Ji, X. Almost optimal model-free reinforcement learning via reference-advantage decomposition. In Advances in Neural Information Processing Systems, volume 33, pp.  15198–15207. Curran Associates, Inc., 2020b.
  • Zhang et al. (2021) Zhang, Z., Yang, J., Ji, X., and Du, S. S. Variance-aware confidence set: Variance-dependent bound for linear bandits and horizon-free bound for linear mixture mdp. Advances in Neural Information Processing Systems, 2021.
  • Zhou et al. (2021a) Zhou, D., Gu, Q., and Szepesvari, C. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pp.  4532–4576. PMLR, 2021a.
  • Zhou et al. (2021b) Zhou, D., He, J., and Gu, Q. Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning, pp. 12793–12802. PMLR, 2021b.

Appendix A Preliminary

Extra Notations in Appendix

For a function X:𝒮+X:{\mathcal{S}}_{+}\rightarrow\mathbb{R} and a distribution PΔ𝒮+P\in\Delta_{{\mathcal{S}}_{+}}, we define PX=𝔼SP[X(S)]PX=\mathbb{E}_{S\sim P}[X(S)] and 𝕍(P,X)=VarSP[X(S)]\mathbb{V}(P,X)=\textsc{Var}_{S\sim P}[X(S)].

Cost Perturbation for cmin=0c_{\min}=0

We follow the receipt in (Vial et al., 2021, Appendix A.3) to deal with zero costs: the main idea is to run the SSP algorithm with perturbed cost cϵ(s,a)=c(s,a)+ϵc_{\epsilon}(s,a)=c(s,a)+\epsilon for some ϵ>0\epsilon>0, which is equivalent to solving a different SSP instance ϵ=(𝒮,𝒜,sinit,g,cϵ,P){\mathcal{M}}_{\epsilon}=({\mathcal{S}},{\mathcal{A}},s_{\text{init}},g,c_{\epsilon},P). Let θϵ=θ+ϵsμ(s)\theta^{\star}_{\epsilon}=\theta^{\star}+\epsilon\sum_{s^{\prime}}\mu(s^{\prime}). Then, cϵ(s,a)=ϕ(s,a)θϵc_{\epsilon}(s,a)=\phi(s,a)^{\top}\theta^{\star}_{\epsilon}. Therefore, ϵ{\mathcal{M}}_{\epsilon} is also a linear SSP with cmin=ϵc_{\min}=\epsilon (up to some a small constant, since cϵ(s,a)c_{\epsilon}(s,a) can be as large as 1+ϵ1+\epsilon). Denote by VϵV^{\star}_{\epsilon} the optimal value function in ϵ{\mathcal{M}}_{\epsilon}, and define RK=tcϵ(st,at)KVϵ(sinit)R^{\prime}_{K}=\sum_{t}c_{\epsilon}(s_{t},a_{t})-KV^{\star}_{\epsilon}(s_{\text{init}}) as the regret in ϵ{\mathcal{M}}_{\epsilon}. We have Vϵ(s)Vπ(s)+ϵTB+ϵTV^{\star}_{\epsilon}(s)\leq V^{\pi^{\star}}(s)+\epsilon T_{\star}\leq B_{\star}+\epsilon T_{\star}, and

RK=t=1Tc(st,at)KV(sinit)t=1Tcϵ(st,at)KVϵ(sinit)+K(Vϵ(sinit)V(sinit))RK+ϵTK.\displaystyle R_{K}=\sum_{t=1}^{T}c(s_{t},a_{t})-KV^{\star}(s_{\text{init}})\leq\sum_{t=1}^{T}c_{\epsilon}(s_{t},a_{t})-KV^{\star}_{\epsilon}(s_{\text{init}})+K(V^{\star}_{\epsilon}(s_{\text{init}})-V^{\star}(s_{\text{init}}))\leq R^{\prime}_{K}+\epsilon T_{\star}K.

Therefore, by running an SSP algorithm on perturbed cost cϵc_{\epsilon}, we recover its regret guarantee with cminϵc_{\min}\leftarrow\epsilon, BB+ϵTB_{\star}\leftarrow B_{\star}+\epsilon T_{\star}, and an addition bias ϵTK\epsilon T_{\star}K in regret.

Appendix B Omitted Details for Section 3

Notations

For ~\widetilde{{\mathcal{M}}}, denote by Vhπ(s)V^{\pi}_{h}(s) the the expected cost of executing policy π\pi starting from state ss in layer hh, and by πm\pi^{m} the policy executed in interval mm (for example, πm(s,h)=argminaQhm(s,a)\pi^{m}(s,h)=\operatorname*{argmin}_{a}Q^{m}_{h}(s,a) in Algorithm 2). For notational convenience, define Phm=P~shm,ahmP^{m}_{h}=\widetilde{P}_{s^{m}_{h},a^{m}_{h}}, and wh=θ+Vh+1(s)𝑑μ(s)w^{\star}_{h}=\theta^{\star}+\int V^{\star}_{h+1}(s^{\prime})d\mu(s^{\prime}) for h[H]h\in[H] such that Qh(s,a)=ϕ(s,a)whQ^{\star}_{h}(s,a)=\phi(s,a)^{\top}w^{\star}_{h}. Define indicator 𝕀s(s)=𝕀{s=s}\mathbb{I}_{s}(s^{\prime})=\mathbb{I}\{s=s^{\prime}\}, and auxiliary feature ϕ(g,a)=𝟎d\phi(g,a)=\mathbf{0}\in\mathbb{R}^{d} for all a𝒜a\in{\mathcal{A}}, such that c~(s,a)=ϕ(s,a)θ\widetilde{c}(s,a)=\phi(s,a)^{\top}\theta^{\star} and P~s,aV=ϕ(s,a)V(s)𝑑μ(s)\widetilde{P}_{s,a}V=\phi(s,a)^{\top}\int V(s^{\prime})d\mu(s^{\prime}) for any s𝒮+,a𝒜s\in{\mathcal{S}}_{+},a\in{\mathcal{A}} and V:𝒮+V:{\mathcal{S}}_{+}\rightarrow\mathbb{R} with V(g)=0V(g)=0. Finally, for Algorithm 2, define stopping time M¯=infm{mM,h[H]:Q^hm(shm,ahm)>Qhm(shm,ahm)}\overline{M}=\inf_{m}\{m\leq M,\exists h\in[H]:\widehat{Q}^{m}_{h}(s^{m}_{h},a^{m}_{h})>Q^{m}_{h}(s^{m}_{h},a^{m}_{h})\}, which is the number of intervals until finishing KK episodes or upper bound truncation on QQ estimate is triggered.

B.1 Formal Definition of QhQ^{\star}_{h} and VhV^{\star}_{h}

It is not hard to see that we can define QhQ^{\star}_{h} and VhV^{\star}_{h} recursively without resorting to the definition of ~\widetilde{{\mathcal{M}}}:

Qh(s,a)=c~(s,a)+P~s,aVh+1,Vh(s)=minaQh(s,a),\displaystyle Q^{\star}_{h}(s,a)=\widetilde{c}(s,a)+\widetilde{P}_{s,a}V^{\star}_{h+1},\qquad V^{\star}_{h}(s)=\min_{a}Q^{\star}_{h}(s,a),

with QH+1(s,a)=cf(s)Q^{\star}_{H+1}(s,a)=c_{f}(s) for all (s,a)(s,a).

B.2 Proof of Lemma 1

Proof.

Denote by k{\mathcal{I}}_{k} the set of intervals in episode kk, and by mkm_{k} the first interval in episode kk. We bound the regret in episode kk as follows: by Lemma 17 and H4Tln(4K)H\geq 4T_{\star}\ln(4K), we have the probability that following π\pi^{\star} takes more than HH steps to reach gg in ~\widetilde{{\mathcal{M}}} is at most 12K\frac{1}{2K}. Therefore,

V1π(s)Vπ(s)+2BP(sH+1g|π,P,s1=s)Vπ(s)+BK.V^{\pi^{\star}}_{1}(s)\leq V^{\pi^{\star}}(s)+2B_{\star}P(s_{H+1}\neq g|\pi^{\star},P,s_{1}=s)\leq V^{\pi^{\star}}(s)+\frac{B_{\star}}{K}.

Thus,

mkh=1HchmVπ(s1mk)\displaystyle\sum_{m\in{\mathcal{I}}_{k}}\sum_{h=1}^{H}c^{m}_{h}-V^{\pi^{\star}}(s^{m_{k}}_{1}) mkh=1HchmV1π(s1mk)+BK\displaystyle\leq\sum_{m\in{\mathcal{I}}_{k}}\sum_{h=1}^{H}c^{m}_{h}-V_{1}^{\pi^{\star}}(s^{m_{k}}_{1})+\frac{B_{\star}}{K}
=mk(h=1HchmV1π(s1m))+mkV1π(s1m)V1π(s1mk)+BK\displaystyle=\sum_{m\in{\mathcal{I}}_{k}}\left(\sum_{h=1}^{H}c^{m}_{h}-V_{1}^{\pi^{\star}}(s^{m}_{1})\right)+\sum_{m\in{\mathcal{I}}_{k}}V_{1}^{\pi^{\star}}(s^{m}_{1})-V_{1}^{\pi^{\star}}(s^{m_{k}}_{1})+\frac{B_{\star}}{K}
mk(h=1Hchm+cf(sH+1m)V1(s1m))+BK.\displaystyle\leq\sum_{m\in{\mathcal{I}}_{k}}\left(\sum_{h=1}^{H}c^{m}_{h}+c_{f}(s^{m}_{H+1})-V^{\star}_{1}(s^{m}_{1})\right)+\frac{B_{\star}}{K}. (V1(s)V1π(s)V^{\star}_{1}(s)\leq V^{\pi^{\star}}_{1}(s) and mkV1π(s1m)V1π(s1mk)2B(|k|1)=mkcf(sH+1m)\sum_{m\in{\mathcal{I}}_{k}}V_{1}^{\pi^{\star}}(s^{m}_{1})-V_{1}^{\pi^{\star}}(s^{m_{k}}_{1})\leq 2B_{\star}(|{\mathcal{I}}_{k}|-1)=\sum_{m\in{\mathcal{I}}_{k}}c_{f}(s^{m}_{H+1}))

Summing terms above over k[K]k\in[K] and by the definition of RKR_{K}, R~M\widetilde{R}_{M} we obtain the desired result. ∎

B.3 Proof of Lemma 2

We first bound the error of one-step value iteration w.r.t Q^hm\widehat{Q}^{m}_{h} and Vh+1mV^{m}_{h+1}, which is essential to our analysis.

Lemma 5.

For any Bmax{1,maxscf(s)}B\geq\max\{1,\max_{s}c_{f}(s)\}, with probability at least 1δ1-\delta, we have 0c~(s,a)+P~s,aVh+1mQ^hm(s,a)2βmϕ(s,a)Λm10\leq\widetilde{c}(s,a)+\widetilde{P}_{s,a}V^{m}_{h+1}-\widehat{Q}^{m}_{h}(s,a)\leq 2\beta_{m}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}} and Vhm(s)Vh(s)V^{m}_{h}(s)\leq V^{\star}_{h}(s) for any m+m\in\mathbb{N}_{+}, h[H]h\in[H].

Proof.

Define w~hm=θ+Vh+1m(s)𝑑μ(s)\widetilde{w}^{m}_{h}=\theta^{\star}+\int V^{m}_{h+1}(s^{\prime})d\mu(s^{\prime}), so that ϕ(s,a)w~hm=c~(s,a)+P~s,aVh+1m\phi(s,a)^{\top}\widetilde{w}^{m}_{h}=\widetilde{c}(s,a)+\widetilde{P}_{s,a}V^{m}_{h+1}. Then,

w~hmwhm=Λm1(Λmw~hmm=1m1h=1Hϕhm(chm+Vh+1m(sh+1m)))\displaystyle\widetilde{w}^{m}_{h}-w^{m}_{h}=\Lambda_{m}^{-1}\left(\Lambda_{m}\widetilde{w}^{m}_{h}-\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}(c^{m^{\prime}}_{h^{\prime}}+V^{m}_{h+1}(s^{m^{\prime}}_{h^{\prime}+1}))\right)
=λΛm1w~hm+Λm1m=1m1h=1Hϕhm(PhmVh+1mVh+1m(sh+1m))ϵhm.\displaystyle=\lambda\Lambda_{m}^{-1}\widetilde{w}^{m}_{h}+\underbrace{\Lambda_{m}^{-1}\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}(P^{m^{\prime}}_{h^{\prime}}V^{m}_{h+1}-V^{m}_{h+1}(s^{m^{\prime}}_{h^{\prime}+1}))}_{\epsilon^{m}_{h}}.

By Vh+1m(s)BV^{m}_{h+1}(s)\leq B and Lemma 31, we have with probability at least 1δ1-\delta, for any mm, h[H]h\in[H]:

ϵhmΛm2Bd2ln(mH+λλ)+ln𝒩εδ+8mHελ,\left\|{\epsilon^{m}_{h}}\right\|_{\Lambda_{m}}\leq 2B\sqrt{\frac{d}{2}\ln\left(\frac{mH+\lambda}{\lambda}\right)+\ln\frac{{\mathcal{N}}_{\varepsilon}}{\delta}}+\frac{\sqrt{8}mH\varepsilon}{\sqrt{\lambda}}, (5)

where 𝒩ε{\mathcal{N}}_{\varepsilon} is the ε\varepsilon-cover of the function class of Vh+1mV^{m}_{h+1} with ε=1mH\varepsilon=\frac{1}{mH}. Note that Vh+1m(s)V^{m}_{h+1}(s) is either cf(s)c_{f}(s) or

Vh+1m(s)=[minaϕ(s,a)wβmϕ(s,a)Γϕ(s,a)][0,B],\displaystyle V^{m}_{h+1}(s)=\left[\min_{a}\phi(s,a)^{\top}w-\beta_{m}\sqrt{\phi(s,a)^{\top}\Gamma\phi(s,a)}\right]_{[0,B]},

for some PSD matrix Γ\Gamma such that 1λ+mHλmin(Γ)λmax(Γ)1λ\frac{1}{\lambda+mH}\leq\lambda_{\min}(\Gamma)\leq\lambda_{\max}(\Gamma)\leq\frac{1}{\lambda} by the definition of Λm1\Lambda_{m}^{-1}, and for some wdw\in\mathbb{R}^{d} such that w2λmax(Γ)×mH×sups,aϕ(s,a)2×(B+1)mHλ(B+1)\left\|{w}\right\|_{2}\leq\lambda_{\max}(\Gamma)\times mH\times\sup_{s,a}\left\|{\phi(s,a)}\right\|_{2}\times(B+1)\leq\frac{mH}{\lambda}(B+1) by the definition of whmw^{m}_{h}. We denote by 𝒱{\mathcal{V}} the function class of Vh+1mV^{m}_{h+1}. Now we apply Lemma 32 to 𝒱{\mathcal{V}} with α=(w,Γ)\alpha=(w,\Gamma), n=d2+dn=d^{2}+d, D=mHd(B+1)/λmax{mHλ(B+1),d/λ2}D=mH\sqrt{d}(B+1)/\lambda\geq\max\{\frac{mH}{\lambda}(B+1),\sqrt{d/\lambda^{2}}\} (note that |Γi,j|ΓF=i=1dλi2(Γ)d/λ2|\Gamma_{i,j}|\leq\left\|{\Gamma}\right\|_{F}=\sqrt{\sum_{i=1}^{d}\lambda_{i}^{2}(\Gamma)}\leq\sqrt{d/\lambda^{2}}), and L=βmλ+mHL=\beta_{m}\sqrt{\lambda+mH}, which is given by |[x][0,B][y][0,B]||xy|\left|[x]_{[0,B]}-[y]_{[0,B]}\right|\leq|x-y| (Vial et al., 2021, Claim 2) and the following calculation: for any Δw=ϵei\Delta w=\epsilon e_{i} for some ϵ0\epsilon\neq 0,

1|ϵ||(w+Δw)ϕ(s,a)wϕ(s,a)|=|eiϕ(s,a)|ϕ(s,a)1,\displaystyle\frac{1}{|\epsilon|}\left|(w+\Delta w)^{\top}\phi(s,a)-w^{\top}\phi(s,a)\right|=\left|e_{i}^{\top}\phi(s,a)\right|\leq\left\|{\phi(s,a)}\right\|\leq 1,

and for any ΔΓ=ϵeiej\Delta\Gamma=\epsilon e_{i}e_{j}^{\top},

1|ϵ||βmϕ(s,a)(Γ+ΔΓ)ϕ(s,a)βmϕ(s,a)Γϕ(s,a)|\displaystyle\frac{1}{|\epsilon|}\left|\beta_{m}\sqrt{\phi(s,a)^{\top}(\Gamma+\Delta\Gamma)\phi(s,a)}-\beta_{m}\sqrt{\phi(s,a)^{\top}\Gamma\phi(s,a)}\right|
βm|ϕ(s,a)eiejϕ(s,a)|ϕ(s,a)Γϕ(s,a)\displaystyle\leq\beta_{m}\frac{\left|\phi(s,a)^{\top}e_{i}e_{j}^{\top}\phi(s,a)\right|}{\sqrt{\phi(s,a)^{\top}\Gamma\phi(s,a)}} (u+vu|v|u\sqrt{u+v}-\sqrt{u}\leq\frac{|v|}{\sqrt{u}})
βm|ϕ(s,a)(12eiei+12ejej)ϕ(s,a)|ϕ(s,a)Γϕ(s,a)\displaystyle\leq\beta_{m}\frac{\left|\phi(s,a)^{\top}(\frac{1}{2}e_{i}e_{i}^{\top}+\frac{1}{2}e_{j}e_{j}^{\top})\phi(s,a)\right|}{\sqrt{\phi(s,a)^{\top}\Gamma\phi(s,a)}} (|ab|12(a2+b2)|ab|\leq\frac{1}{2}(a^{2}+b^{2}))
βmϕ(s,a)ϕ(s,a)ϕ(s,a)Γϕ(s,a)βmλmin(Γ)βmλ+mH.\displaystyle\leq\beta_{m}\frac{\phi(s,a)^{\top}\phi(s,a)}{\sqrt{\phi(s,a)^{\top}\Gamma\phi(s,a)}}\leq\frac{\beta_{m}}{\sqrt{\lambda_{\min}(\Gamma)}}\leq\beta_{m}\sqrt{\lambda+mH}.

Lemma 32 then implies ln𝒩ε(d2+d)ln32d2.5Bm2H2βmλε\ln{\mathcal{N}}_{\varepsilon}\leq(d^{2}+d)\ln\frac{32d^{2.5}Bm^{2}H^{2}\beta_{m}}{\lambda\varepsilon}. Plugging this back, we get

ϵhmΛmβm2.\left\|{\epsilon^{m}_{h}}\right\|_{\Lambda_{m}}\leq\frac{\beta_{m}}{2}. (6)

Moreover, w~hmΛm1w~hm2/λd/λ(1+B)\left\|{\widetilde{w}^{m}_{h}}\right\|_{\Lambda_{m}^{-1}}\leq\left\|{\widetilde{w}^{m}_{h}}\right\|_{2}/\sqrt{\lambda}\leq\sqrt{d/\lambda}(1+B). Thus,

whmw~hmΛm\displaystyle\left\|{w^{m}_{h}-\widetilde{w}^{m}_{h}}\right\|_{\Lambda_{m}} λw~hmΛm1+ϵhmΛmβm.\displaystyle\leq\lambda\left\|{\widetilde{w}^{m}_{h}}\right\|_{\Lambda_{m}^{-1}}+\left\|{\epsilon^{m}_{h}}\right\|_{\Lambda_{m}}\leq\beta_{m}.

Therefore, c~(s,a)+P~s,aVh+1mQ^hm(s,a)=ϕ(s,a)(w~hmwhm)+βmϕ(s,a)Λm1[0,2βmϕ(s,a)Λm1]\widetilde{c}(s,a)+\widetilde{P}_{s,a}V^{m}_{h+1}-\widehat{Q}^{m}_{h}(s,a)=\phi(s,a)^{\top}(\widetilde{w}^{m}_{h}-w^{m}_{h})+\beta_{m}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}\in[0,2\beta_{m}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}] by ϕ(s,a)(w~hmwhm)[ϕ(s,a)Λm1whmw~hmΛm,ϕ(s,a)Λm1whmw~hmΛm]\phi(s,a)^{\top}(\widetilde{w}^{m}_{h}-w^{m}_{h})\in[-\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}\left\|{w^{m}_{h}-\widetilde{w}^{m}_{h}}\right\|_{\Lambda_{m}},\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}\left\|{w^{m}_{h}-\widetilde{w}^{m}_{h}}\right\|_{\Lambda_{m}}], and the first statement is proved. For any m+m\in\mathbb{N}_{+}, we prove the second statement by induction on h=H+1,,1h=H+1,\ldots,1. The base case h=H+1h=H+1 is clearly true by Vh+1m(s)=Vh+1(s)=cf(s)V^{m}_{h+1}(s)=V^{\star}_{h+1}(s)=c_{f}(s). For hHh\leq H, we have by the induction step:

Q^hm(s,a)c~(s,a)+P~s,aVh+1mc~(s,a)+P~s,aVh+1Qh(s,a).\displaystyle\widehat{Q}^{m}_{h}(s,a)\leq\widetilde{c}(s,a)+\widetilde{P}_{s,a}V^{m}_{h+1}\leq\widetilde{c}(s,a)+\widetilde{P}_{s,a}V^{\star}_{h+1}\leq Q^{\star}_{h}(s,a).

Thus, Vhm(s)minamax{0,Q^hm(s,a)}minaQh(s,a)=Vh(s)V^{m}_{h}(s)\leq\min_{a}\max\{0,\widehat{Q}^{m}_{h}(s,a)\}\leq\min_{a}Q^{\star}_{h}(s,a)=V^{\star}_{h}(s). ∎

Next, we prove a general regret bound, from which Lemma 2 is a direct corollary.

Lemma 6.

Assume cf(s)Hc_{f}(s)\leq H. Then with probability at least 12δ1-2\delta, Algorithm 2 ensures for any MM¯M^{\prime}\leq\overline{M}

R~M=𝒪~(d3B2HM+d2BH).\widetilde{R}_{M^{\prime}}=\tilde{\mathcal{O}}\left(\sqrt{d^{3}B^{2}HM^{\prime}}+d^{2}BH\right).
Proof.

Define cH+1m=cf(sH+1m)c^{m}_{H+1}=c_{f}(s^{m}_{H+1}). Note that for m<M¯m<\overline{M}, we have Vhm(shm)=max{0,Q^hm(shm,ahm)}V^{m}_{h}(s^{m}_{h})=\max\{0,\widehat{Q}^{m}_{h}(s^{m}_{h},a^{m}_{h})\}, and with probability at least 1δ1-\delta,

h=1H+1chmV1(s1m)\displaystyle\sum_{h=1}^{H+1}c^{m}_{h}-V^{\star}_{1}(s^{m}_{1}) h=1H+1chmV1m(s1m)h=1H+1chmQ^1m(s1m,a1m)h=2H+1chmP1mV2m+2βmϕ(s1m,a1m)Λm1\displaystyle\leq\sum_{h=1}^{H+1}c^{m}_{h}-V^{m}_{1}(s^{m}_{1})\leq\sum_{h=1}^{H+1}c^{m}_{h}-\widehat{Q}^{m}_{1}(s^{m}_{1},a^{m}_{1})\leq\sum_{h=2}^{H+1}c^{m}_{h}-P^{m}_{1}V^{m}_{2}+2\beta_{m}\left\|{\phi(s^{m}_{1},a^{m}_{1})}\right\|_{\Lambda_{m}^{-1}} (Lemma 5)
=h=2H+1chmV2m(s2m)+(𝕀s2mP2m)V2m+2βmϕ(s1m,a1m)Λm1\displaystyle=\sum_{h=2}^{H+1}c^{m}_{h}-V^{m}_{2}(s^{m}_{2})+(\mathbb{I}_{s^{m}_{2}}-P^{m}_{2})V^{m}_{2}+2\beta_{m}\left\|{\phi(s^{m}_{1},a^{m}_{1})}\right\|_{\Lambda_{m}^{-1}}
h=1H((𝕀sh+1mPh+1m)Vh+1m+2βmϕ(shm,ahm)Λm1).\displaystyle\leq\cdots\leq\sum_{h=1}^{H}\left((\mathbb{I}_{s^{m}_{h+1}}-P^{m}_{h+1})V^{m}_{h+1}+2\beta_{m}\left\|{\phi(s^{m}_{h},a^{m}_{h})}\right\|_{\Lambda_{m}^{-1}}\right). (cH+1m=VH+1m(sH+1m)c^{m}_{H+1}=V^{m}_{H+1}(s^{m}_{H+1}))

Therefore, by Lemma 18 and Lemma 38, with probability at least 1δ1-\delta:

R~M\displaystyle\widetilde{R}_{M^{\prime}} R~M1+Hm=1M1h=1H((𝕀sh+1mPh+1m)Vh+1m+2βmϕ(shm,ahm)Λm1)+H\displaystyle\leq\widetilde{R}_{M^{\prime}-1}+H\leq\sum_{m=1}^{M^{\prime}-1}\sum_{h=1}^{H}\left((\mathbb{I}_{s^{m}_{h+1}}-P^{m}_{h+1})V^{m}_{h+1}+2\beta_{m}\left\|{\phi(s^{m}_{h},a^{m}_{h})}\right\|_{\Lambda_{m}^{-1}}\right)+H
=𝒪~(d3B2HM+d2BH).\displaystyle=\tilde{\mathcal{O}}\left(\sqrt{d^{3}B^{2}HM^{\prime}}+d^{2}BH\right).

We are now ready to prove Lemma 2.

Proof of Lemma 2.

Note that when B=3BB=3B_{\star}, Vhm(s)Vh(s)3B=BV^{m}_{h}(s)\leq V^{\star}_{h}(s)\leq 3B_{\star}=B by Lemma 5. Thus, M¯=M\overline{M}=M, and the statement directly follows from Lemma 6 with M=M¯M^{\prime}=\overline{M}. ∎

B.4 Learning without Knowing BB_{\star} or TT_{\star}

In this section, we develop a parameter-free algorithm that achieves 𝒪~(d3B3K/cmin+d3B2/cmin)\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{3}K/c_{\min}}+d^{3}B_{\star}^{2}/c_{\min}) regret without knowing BB_{\star} or TT_{\star}, which matches the best bound and knowledge of parameters of (Vial et al., 2021) while being computationally efficient under the most general assumption. Here we apply the finite-horizon approximation with zero terminal costs, and develop a new analysis on this approximation.

Finite-Horizon Approximation of SSP with Zero Terminal Costs

Algorithm 4 Adaptive Finite-Horizon Approximation of SSP

Input: upper bound estimate BB and function U(B)U(B) from Lemma 8.

Initialize: 𝔄\mathfrak{A} an instance of finite-horizon algorithm with horizon 10Bcminln(8BK)\lceil\frac{10B}{c_{\min}}\ln(8BK)\rceil.

Initialize: m=1m=1, m=0m^{\prime}=0, k=1k=1, s=sinits=s_{\text{init}}.

while kKk\leq K do

       Execute 𝔄\mathfrak{A} for HH steps starting from state ss and receive sH+1ms^{m}_{H+1}. if sH+1m=gs^{m}_{H+1}=g then  kk+1k\leftarrow k+1, ssinits\leftarrow s_{\text{init}}; else mm+1m^{\prime}\leftarrow m^{\prime}+1, ssH+1ms\leftarrow s^{m}_{H+1}.
      1 if m>U(B)m^{\prime}>U(B) or 𝔄\mathfrak{A} detects B<BB<B_{\star} then
             B2BB\leftarrow 2B. Initialize 𝔄\mathfrak{A} as an instance of finite-horizon algorithm with horizon 10Bcminln(8BK)\lceil\frac{10B}{c_{\min}}\ln(8BK)\rceil. m0m^{\prime}\leftarrow 0.
      mm+1m\leftarrow m+1.

To avoid knowledge of BB_{\star} or TT_{\star}, we apply finite-horizon approximation with zero terminal costs and horizon of order 𝒪~(Bcmin)\tilde{\mathcal{O}}(\frac{B}{c_{\min}}) for some estimate BB of BB_{\star}, that is, running Algorithm 1 with cf(s)=0c_{f}(s)=0 and H=𝒪~(Bcmin)H=\tilde{\mathcal{O}}(\frac{B}{c_{\min}}). We show that in this case there is an alternative way to bound the regret RKR_{K} by R~M\widetilde{R}_{M}, and there is a tighter bound on the total number of intervals MM when BBB\geq B_{\star}.

Lemma 7.

Algorithm 1 with cf(s)=0c_{f}(s)=0 ensures RKR~M+Bm=1M𝕀{sH+1mg}R_{K}\leq\widetilde{R}_{M}+B_{\star}\sum_{m=1}^{M}\mathbb{I}\{s^{m}_{H+1}\neq g\}.

Proof.

Denote by k{\mathcal{I}}_{k} the set of intervals in episode kk. We have:

RK\displaystyle R_{K} =k=1K(mkh=1HchmV(sinit))=k=1K(mk(h=1HchmV1(s1m))+mkV1(s1m)V(sinit))\displaystyle=\sum_{k=1}^{K}\left(\sum_{m\in{\mathcal{I}}_{k}}\sum_{h=1}^{H}c^{m}_{h}-V^{\star}(s_{\text{init}})\right)=\sum_{k=1}^{K}\left(\sum_{m\in{\mathcal{I}}_{k}}\left(\sum_{h=1}^{H}c^{m}_{h}-V^{\star}_{1}(s^{m}_{1})\right)+\sum_{m\in{\mathcal{I}}_{k}}V^{\star}_{1}(s^{m}_{1})-V^{\star}(s_{\text{init}})\right)
R~M+Bm=1M𝕀{sH+1mg}.\displaystyle\leq\widetilde{R}_{M}+B_{\star}\sum_{m=1}^{M}\mathbb{I}\{s^{m}_{H+1}\neq g\}. (V1(s)V(s)BV^{\star}_{1}(s)\leq V^{\star}(s)\leq B_{\star} by cf(s)=0c_{f}(s)=0)

Lemma 8.

Suppose when BBB\geq B_{\star}, 𝔄\mathfrak{A} with horizon H=10Bcminln(8BK)H=\lceil\frac{10B}{c_{\min}}\ln(8BK)\rceil ensures R~M=𝒪~(γ0(B)+γ1(B)M)\widetilde{R}_{M^{\prime}}=\tilde{\mathcal{O}}(\gamma_{0}(B)+\gamma_{1}(B)\sqrt{M^{\prime}}) for any MMM^{\prime}\leq M with probability at least 1δ1-\delta, where γ0\gamma_{0}, γ1\gamma_{1} are functions of BB and are independent of MM^{\prime}. Then Algorithm 1 with cf(s)=0c_{f}(s)=0 ensures with probability at least 14δ1-4\delta,

m=1M𝕀{sH+1mg}=𝒪~(γ0(B)/B+γ1(B)2/B2+γ1(B)K/B+H)U(B).\sum_{m=1}^{M}\mathbb{I}\{s^{m}_{H+1}\neq g\}=\tilde{\mathcal{O}}\left(\gamma_{0}(B)/B+\gamma_{1}(B)^{2}/B^{2}+\gamma_{1}(B)\sqrt{K}/B+H\right)\triangleq U(B).
Proof.

First note that by Lemma 39 and V1πm(s)V1(s)V^{\pi^{m}}_{1}(s)\geq V^{\star}_{1}(s), with probability at least 1δ1-\delta: m=1MV1πm(s1m)V(s1m)2R~M+𝒪~(H)\sum_{m=1}^{M^{\prime}}V^{\pi^{m}}_{1}(s^{m}_{1})-V^{\star}(s^{m}_{1})\leq 2\widetilde{R}_{M^{\prime}}+\tilde{\mathcal{O}}(H). For any finite MMM^{\prime}\leq M, we will show m=1M𝕀{sH+1mg}=𝒪~(γ0(B)/B+γ1(B)2/B2+γ1(B)K/B)\sum_{m=1}^{M^{\prime}}\mathbb{I}\{s^{m}_{H+1}\neq g\}=\tilde{\mathcal{O}}(\gamma_{0}(B)/B+\gamma_{1}(B)^{2}/B^{2}+\gamma_{1}(B)\sqrt{K}/B), which then implies that m=1M𝕀{sH+1mg}\sum_{m=1}^{M}\mathbb{I}\{s^{m}_{H+1}\neq g\} has to be finite and is upper bounded by the same quantity. Define V~1π(s)=𝔼[h=1H/2c(sh,ah)|π,P,s1=s]\widetilde{V}^{\pi}_{1}(s)=\mathbb{E}[\sum_{h=1}^{H/2}c(s_{h},a_{h})|\pi,P,s_{1}=s] as the expected cost for the first H/2H/2 layers and V~1\widetilde{V}^{\star}_{1} as the optimal value function for the first H/2H/2 layers. By (Chen et al., 2021a, Lemma 1) and BBB\geq B_{\star}, we have V(s)V~1(s)[0,14K]V^{\star}(s)-\widetilde{V}^{\star}_{1}(s)\in[0,\frac{1}{4K}] and V(s)V1(s)[0,14K]V^{\star}(s)-V^{\star}_{1}(s)\in[0,\frac{1}{4K}]. Moreover, when sH+1mgs^{m}_{H+1}\neq g, we have h>H/2chm2B\sum_{h>H/2}c^{m}_{h}\geq 2B. Denote by Pm()P_{m}(\cdot) the conditional probability of certain event conditioning on the history before interval mm. Then with probability at least 1δ1-\delta,

2Bm=1MPm(sH+1mg)+m=1MV~1πm(s1m)V~1(s1m)M2K+m=1MV1πm(s1m)V1(s1m)\displaystyle 2B\sum_{m=1}^{M^{\prime}}P_{m}(s^{m}_{H+1}\neq g)+\sum_{m=1}^{M^{\prime}}\widetilde{V}^{\pi^{m}}_{1}(s^{m}_{1})-\widetilde{V}^{\star}_{1}(s^{m}_{1})\leq\frac{M^{\prime}}{2K}+\sum_{m=1}^{M^{\prime}}V^{\pi^{m}}_{1}(s^{m}_{1})-V^{\star}_{1}(s^{m}_{1})
12Km=1M𝕀{sH+1mg}+𝒪~(γ0(B)+γ1(B)M+H)\displaystyle\leq\frac{1}{2K}\sum_{m=1}^{M^{\prime}}\mathbb{I}\{s^{m}_{H+1}\neq g\}+\tilde{\mathcal{O}}\left(\gamma_{0}(B)+\gamma_{1}(B)\sqrt{M^{\prime}}+H\right) (MK+m=1M𝕀{sH+1mg}M^{\prime}\leq K+\sum_{m=1}^{M^{\prime}}\mathbb{I}\{s^{m}_{H+1}\neq g\} and guarantee of 𝔄\mathfrak{A})
1Km=1MPm(sH+1mg)+𝒪~(γ0(B)+γ1(B)M+H).\displaystyle\leq\frac{1}{K}\sum_{m=1}^{M^{\prime}}P_{m}(s^{m}_{H+1}\neq g)+\tilde{\mathcal{O}}\left(\gamma_{0}(B)+\gamma_{1}(B)\sqrt{M^{\prime}}+H\right). (Lemma 39)

Then by V~1πm(s1m)V~1(s1m)\widetilde{V}^{\pi^{m}}_{1}(s^{m}_{1})\geq\widetilde{V}^{\star}_{1}(s^{m}_{1}) and reorganizing terms, we get m=1MPm(sH+1mg)=𝒪~(γ0(B)/B+γ1(B)M/B+H)\sum_{m=1}^{M^{\prime}}P_{m}(s^{m}_{H+1}\neq g)=\tilde{\mathcal{O}}(\gamma_{0}(B)/B+\gamma_{1}(B)\sqrt{M^{\prime}}/B+H). Again by Lemma 39, we have with probability at least 1δ1-\delta:

m=1M𝕀{sH+1mg}=𝒪~(m=1MPm(sH+1mg))=𝒪~(γ0(B)/B+γ1(B)M/B+H).\sum_{m=1}^{M^{\prime}}\mathbb{I}\{s^{m}_{H+1}\neq g\}=\tilde{\mathcal{O}}\left(\sum_{m=1}^{M^{\prime}}P_{m}(s^{m}_{H+1}\neq g)\right)=\tilde{\mathcal{O}}\left(\gamma_{0}(B)/B+\gamma_{1}(B)\sqrt{M^{\prime}}/B+H\right).

By MK+m=1M𝕀{sH+1mg}M^{\prime}\leq K+\sum_{m=1}^{M^{\prime}}\mathbb{I}\{s^{m}_{H+1}\neq g\} and solving a quadratic inequality w.r.t m=1M𝕀{sH+1mg}\sqrt{\sum_{m=1}^{M^{\prime}}\mathbb{I}\{s^{m}_{H+1}\neq g\}}, we get m=1M𝕀{sH+1mg}=𝒪~(γ0(B)/B+γ1(B)2/B2+γ1(B)K/B+H)\sum_{m=1}^{M^{\prime}}\mathbb{I}\{s^{m}_{H+1}\neq g\}=\tilde{\mathcal{O}}(\gamma_{0}(B)/B+\gamma_{1}(B)^{2}/B^{2}+\gamma_{1}(B)\sqrt{K}/B+H). Thus, we also get the same bound for m=1M𝕀{sH+1mg}\sum_{m=1}^{M}\mathbb{I}\{s^{m}_{H+1}\neq g\}. ∎

Remark 1.

Note that the result of Lemma 8 is similar to (Tarbouriech et al., 2020, Lemma 7), which also shows that the number of “bad” intervals is of order 𝒪~(K)\tilde{\mathcal{O}}(\sqrt{K}). However, their result is derived by explicitly analyzing the transition confidence sets, while we only make use of the regret guarantee of the finite-horizon algorithm. Thus, our approach is again model-agnostic and directly applicable to linear function approximation while their result is not.

Note that Lemma 7 and Lemma 8 together implies a 𝒪~(K)\tilde{\mathcal{O}}(\sqrt{K}) regret bound when BBB\geq B_{\star}. Moreover, since the total number of “bad” intervals is of order 𝒪~(K)\tilde{\mathcal{O}}(\sqrt{K}), we can properly bound the cost of running finite-horizon algorithm with wrong estimates on BB_{\star}. We now present an adaptive version of finite-horizon approximation of SSP (Algorithm 4) which does not require the knowledge of BB_{\star} or TT_{\star}. The main idea is to perform finite-horizon approximation with zero costs, and maintain an estimate BB of BB_{\star}. The learner runs a finite-horizon algorithm with horizon of order 𝒪~(Bcmin)\tilde{\mathcal{O}}(\frac{B}{c_{\min}}). Whenever 𝔄\mathfrak{A} detects BBB\leq B_{\star}, or the number of “bad” intervals is more than expected (Line 4), it doubles the estimate BB and start a new instance of finite-horizon algorithm with the updated estimate. The guarantee of Algorithm 4 is summarized in the following theorem.

Theorem 7.

Suppose 𝔄\mathfrak{A} takes an estimate BB as input, and when B<BB<B_{\star}, it has some probability of detecting the anomaly (the event B<BB<B_{\star}) and halts. Define stopping time M¯=min{M,infm{anomaly detected in episode m}}\overline{M}^{\prime}=\min\{M,\inf_{m}\{\text{anomaly detected in episode $m$}\}\}, and suppose for any B1B\geq 1, 𝔄\mathfrak{A} with horizon H=10Bcminln(8BK)H=\lceil\frac{10B}{c_{\min}}\ln(8BK)\rceil ensures R~M=𝒪~(γ0(B)+γ1(B)M)\widetilde{R}_{M^{\prime}}=\tilde{\mathcal{O}}(\gamma_{0}(B)+\gamma_{1}(B)\sqrt{M^{\prime}}) for any MM¯M^{\prime}\leq\overline{M}^{\prime}, where γ0(B)/B,γ1(B)/B\gamma_{0}(B)/B,\gamma_{1}(B)/B are non-decreasing w.r.t BB. Then, Algorithm 4 ensures RK=𝒪~(γ0(B)+γ1(B)K+γ1(B)2/B+BH)R_{K}=\tilde{\mathcal{O}}(\gamma_{0}(B_{\star})+\gamma_{1}(B_{\star})\sqrt{K}+\gamma_{1}(B_{\star})^{2}/B_{\star}+B_{\star}H) with probability at least 14δ1-4\delta.

Proof.

We divide the learning process into epochs indexed by ϕ\phi based on the update of BB, so that B1=BB_{1}=B (the input value) and Bϕ+1=2BϕB_{\phi+1}=2B_{\phi}. Let ϕ=minϕ{BϕB}\phi^{\star}=\min_{\phi}\{B_{\phi}\geq B_{\star}\}. Define the regret in epoch ϕ\phi as R¯ϕ=Cϕk𝒦ϕV(s1ϕ,k)\bar{R}_{\phi}=C_{\phi}-\sum_{k\in{\mathcal{K}}_{\phi}}V^{\star}(s_{1}^{\phi,k}), where CϕC_{\phi} is the total costs suffered in epoch ϕ\phi, 𝒦ϕ{\mathcal{K}}_{\phi} is the set of episodes overlapped with epoch ϕ\phi, and s1ϕ,ks^{\phi,k}_{1} is the initial state in episode kk and epoch ϕ\phi (note that an episode can overlap with multiple epochs). Clearly, ϕ|𝒦ϕ|K+ϕK+𝒪(log2B)\sum_{\phi}|{\mathcal{K}}_{\phi}|\leq K+\phi^{\star}\leq K+\mathcal{O}(\log_{2}B_{\star}). Note that 𝔄\mathfrak{A} satisfies the assumptions in Lemma 8, since no anomaly will be detected when BBB\geq B_{\star}. Thus in epoch ϕ\phi^{\star}, no new epoch will be started by Lemma 8. Moreover, by Lemma 7 and Bϕ2BB_{\phi^{\star}}\leq 2B_{\star}, the regret is bounded by:

R¯ϕ=𝒪~(γ0(B)+γ1(B)K+U(B)+BU(B))=𝒪~(γ0(B)+γ1(B)K+γ1(B)2/B+BH).\displaystyle\bar{R}_{\phi^{\star}}=\tilde{\mathcal{O}}\left(\gamma_{0}(B_{\star})+\gamma_{1}(B_{\star})\sqrt{K+U(B_{\star})}+B_{\star}U(B_{\star})\right)=\tilde{\mathcal{O}}\left(\gamma_{0}(B_{\star})+\gamma_{1}(B_{\star})\sqrt{K}+\gamma_{1}(B_{\star})^{2}/B_{\star}+B_{\star}H\right).

For ϕ<ϕ\phi<\phi^{\star}, by the conditions of starting a new epoch, the number of intervals that does not reach the goal is upper bounded by U(Bϕ)U(B_{\phi}) and the total number of intervals in epoch ϕ\phi is upper bounded by K+U(Bϕ)K+U(B_{\phi}). Thus by Lemma 7 and the guarantee of 𝔄\mathfrak{A},

R¯ϕ=𝒪~(γ0(Bϕ)+γ1(Bϕ)K+U(Bϕ)+BU(Bϕ))=𝒪~(γ0(B)+γ1(B)K+γ1(B)2/B+BH),\displaystyle\bar{R}_{\phi}=\tilde{\mathcal{O}}\left(\gamma_{0}(B_{\phi})+\gamma_{1}(B_{\phi})\sqrt{K+U(B_{\phi})}+B_{\star}U(B_{\phi})\right)=\tilde{\mathcal{O}}\left(\gamma_{0}(B_{\star})+\gamma_{1}(B_{\star})\sqrt{K}+\gamma_{1}(B_{\star})^{2}/B_{\star}+B_{\star}H\right),

where the last equality is by the fact that γ0(B),γ1(B)\gamma_{0}(B),\gamma_{1}(B) and U(B)U(B) are non-decreasing w.r.t BB. Thus,

RK\displaystyle R_{K} =ϕCϕk=1KV(sinit)=ϕR¯ϕ+ϕk𝒦ϕV(s1ϕ,k)k=1KV(sinit)\displaystyle=\sum_{\phi}C_{\phi}-\sum_{k=1}^{K}V^{\star}(s_{\text{init}})=\sum_{\phi}\bar{R}_{\phi}+\sum_{\phi}\sum_{k\in{\mathcal{K}}_{\phi}}V^{\star}(s_{1}^{\phi,k})-\sum_{k=1}^{K}V^{\star}(s_{\text{init}})
=𝒪~(γ0(B)+γ1(B)K+γ1(B)2/B+BH).\displaystyle=\tilde{\mathcal{O}}\left(\gamma_{0}(B_{\star})+\gamma_{1}(B_{\star})\sqrt{K}+\gamma_{1}(B_{\star})^{2}/B_{\star}+B_{\star}H\right).

Theorem 8.

Applying Algorithm 4 with Algorithm 2 as 𝔄\mathfrak{A} to the linear SSP problem ensures RK=𝒪~(d3B3K/cmin+d3B2/cmin)R_{K}=\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{3}K/c_{\min}}+d^{3}B_{\star}^{2}/c_{\min}) with probability at least 14δ1-4\delta.

Proof.

Note that M¯=M¯\overline{M}^{\prime}=\overline{M} for Algorithm 2, and Lemma 6 ensures that Algorithm 2 satisfies assumptions of Theorem 7 with γ0(B)=d2BH\gamma_{0}(B)=d^{2}BH and γ1(B)=d3B2H\gamma_{1}(B)=\sqrt{d^{3}B^{2}H}, where H=10Bcminln(8BK)H=\lceil\frac{10B}{c_{\min}}\ln(8BK)\rceil. Then by Theorem 7, we have: RK=𝒪~(d3B3K/cmin+d3B2/cmin)R_{K}=\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{3}K/c_{\min}}+d^{3}B_{\star}^{2}/c_{\min}). ∎

Remark 2.

Comparing the bound achieved by Theorem 8 with that of Theorem 3, we see that Bcmin\frac{B_{\star}}{c_{\min}} is in place of TT_{\star}, making it a worse bound since TBcminT_{\star}\leq\frac{B_{\star}}{c_{\min}}. Previous works in SSP (Cohen et al., 2021; Tarbouriech et al., 2021; Chen et al., 2021a) suggest that algorithms that obtain a bound with dependency on Bcmin\frac{B_{\star}}{c_{\min}} is easier to be made parameter-free compared to those with dependency on TT_{\star}. Our findings in this section are consistent with that in previous works.

B.5 Horizon-Free Regret in the Tabular Setting with Finite-Horizon Approximation

Here we present a finite-horizon algorithm (Algorithm 5) that achieves R~m=𝒪~(BSAm+BS2A)\widetilde{R}_{m}=\tilde{\mathcal{O}}(B_{\star}\sqrt{SAm}+B_{\star}S^{2}A) and thus gives RK=𝒪~(BSAK+BS2A)R_{K}=\tilde{\mathcal{O}}(B_{\star}\sqrt{SAK}+B_{\star}S^{2}A) when combining with Corollary 2. For simplicity we assume that the cost function is known. We can think of Algorithm 5 as a variant of EB-SSP, which is applied on a finite-horizon MDP with state space 𝒮×[H]{\mathcal{S}}\times[H] and the transition is shared across layers. Note that due to the loop-free structure of the MDP, the value iteration converges in one sweep. Thus, skewing the empirical transition as in (Tarbouriech et al., 2021) is unnecessary. Then by the analysis of EB-SSP and the fact that transition data is shared across layers, we obtain the same regret guarantee R~m=𝒪~(BSAm+BS2A)\widetilde{R}_{m}=\tilde{\mathcal{O}}(B_{\star}\sqrt{SAm}+B_{\star}S^{2}A) (it is not hard to see that the algorithm achieves anytime regret since its updates on parameters are independent of KK).

Algorithm 5 MVP+

Input: an estimate BB such that BBB\geq B_{\star}.

Initialize: n(s,a),n(s,a,s),Qh(s,a),Vh(s),VH+1(s)=0n(s,a),n(s,a,s^{\prime}),Q_{h}(s,a),V_{h}(s),V_{H+1}(s)=0 for (s,a)𝒮+×𝒜(s,a)\in{\mathcal{S}}_{+}\times{\mathcal{A}}, s𝒮+s^{\prime}\in{\mathcal{S}}_{+}, h[H]h\in[H].

for k=1,,Kk=1,\ldots,K do

       for h=1,,Hh=1,\ldots,H do
             Take action at=argminaQh(st,a)a_{t}=\operatorname*{argmin}_{a}Q_{h}(s_{t},a), incur cost c~(st,at)\widetilde{c}(s_{t},a_{t}) and transit to stP~st,ats^{\prime}_{t}\sim\widetilde{P}_{s_{t},a_{t}}. n(st,at)n(st,at)+1n(s_{t},a_{t})\leftarrow n(s_{t},a_{t})+1, n(st,at,st)n(st,at,st)+1n(s_{t},a_{t},s^{\prime}_{t})\leftarrow n(s_{t},a_{t},s^{\prime}_{t})+1. if n(s,a)=2jn(s,a)=2^{j} for some jj\in\mathbb{N} then
                   for h=H,,1h=H,\ldots,1 do
                         for (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times{\mathcal{A}} do
                               ιs,a20ln2SAn(s,a)δ\iota_{s,a}\leftarrow 20\ln\frac{2SAn(s,a)}{\delta}, P¯s,a(s)n(s,a,s)max{1,n(s,a)}\bar{P}_{s,a}(s^{\prime})\leftarrow\frac{n(s,a,s^{\prime})}{\max\{1,n(s,a)\}} for all s𝒮+s^{\prime}\in{\mathcal{S}}_{+}. bh(s,a)max{7𝕍(P¯s,a,Vh+1)ιs,amax{1,n(s,a)},49Bιs,amax{1,n(s,a)}}b_{h}(s,a)\leftarrow\max\left\{7\sqrt{\frac{\mathbb{V}(\bar{P}_{s,a},V_{h+1})\iota_{s,a}}{\max\{1,n(s,a)\}}},\frac{49B\iota_{s,a}}{\max\{1,n(s,a)\}}\right\}. Qh(s,a)max{0,c~(s,a)+P¯s,aVh+1bh(s,a)}Q_{h}(s,a)\leftarrow\max\{0,\widetilde{c}(s,a)+\bar{P}_{s,a}V_{h+1}-b_{h}(s,a)\}. Vh(s)=argminaQh(s,a)V_{h}(s)=\operatorname*{argmin}_{a}Q_{h}(s,a).
                        
                  
            
      

B.6 Application to Linear Mixture MDP

Algorithm 6 UCRL-VTR-SSP

Initialize: λ=1\lambda=1, Σ^1,Σ~1=λI\widehat{\Sigma}_{1},\widetilde{\Sigma}_{1}=\lambda I, b^1,b~1,θ^1,θ~1=𝟎\widehat{b}_{1},\widetilde{b}_{1},\widehat{\theta}_{1},\widetilde{\theta}_{1}=\mathbf{0}.

Define: β^m=8dln(1+dmH/λ)ln(4m2H2/δ)+4dln(4m2H2/δ)+λd\widehat{\beta}_{m}=8\sqrt{d\ln(1+dmH/\lambda)\ln(4m^{2}H^{2}/\delta)}+4\sqrt{d}\ln(4m^{2}H^{2}/\delta)+\sqrt{\lambda d}.

Define: β~m=72B2dln(1+81dmHB4/λ)ln(4m2H2/δ)+36B2ln(4m2H2/δ)+λd\widetilde{\beta}_{m}=72B_{\star}^{2}\sqrt{d\ln(1+81dmHB_{\star}^{4}/\lambda)\ln(4m^{2}H^{2}/\delta)}+36B_{\star}^{2}\ln(4m^{2}H^{2}/\delta)+\sqrt{\lambda d}.

Define: βˇm=8dln(1+dmH/λ)ln(4m2H2/δ)+4dln(4m2H2/δ)+λd\check{\beta}_{m}=8d\sqrt{\ln(1+dmH/\lambda)\ln(4m^{2}H^{2}/\delta)}+4\sqrt{d}\ln(4m^{2}H^{2}/\delta)+\sqrt{\lambda d}.

for m=1,,Mm=1,\ldots,M do

       for h=H,,1h=H,\ldots,1 do
             Qhm(,)=c~(,)+θ^m,ϕVh+1m(,)β^mϕVh+1m(,)Σ^m1Q^{m}_{h}(\cdot,\cdot)=\widetilde{c}(\cdot,\cdot)+\left\langle\widehat{\theta}_{m},\phi_{V^{m}_{h+1}}(\cdot,\cdot)\right\rangle-\widehat{\beta}_{m}\left\|{\phi_{V^{m}_{h+1}}(\cdot,\cdot)}\right\|_{\widehat{\Sigma}_{m}^{-1}}, where VH+1m(s)=2B𝕀{sg}V^{m}_{H+1}(s)=2B_{\star}\mathbb{I}\{s\neq g\}. Vhm()=mina[Qhm(,a)][0,3B]V^{m}_{h}(\cdot)=\min_{a}[Q^{m}_{h}(\cdot,a)]_{[0,3B_{\star}]}.
      for h=1,,Hh=1,\ldots,H do
             Take action ahm=argminaQhm(shm,a)a^{m}_{h}=\operatorname*{argmin}_{a}Q^{m}_{h}(s^{m}_{h},a), suffer cost chmc^{m}_{h}, and transit to next state sh+1ms^{m}_{h+1}. νhm=[ϕ(Vh+1m)2(shm,ahm),θ~m][0,9B2][ϕVh+1m(shm,ahm),θ^m][0,3B]2\nu^{m}_{h}=\left[\left\langle\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h}),\widetilde{\theta}_{m}\right\rangle\right]_{[0,9B_{\star}^{2}]}-\left[\left\langle\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h}),\widehat{\theta}_{m}\right\rangle\right]_{[0,3B_{\star}]}^{2}. Ehm=min{9B2,β~mϕ(Vh+1m)2(shm,ahm)Σ~m1}+min{9B2,6BβˇmϕVh+1m(shm,ahm)Σ^m1}E^{m}_{h}=\min\left\{9B_{\star}^{2},\widetilde{\beta}_{m}\left\|{\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})}\right\|_{\widetilde{\Sigma}^{-1}_{m}}\right\}+\min\left\{9B_{\star}^{2},6B_{\star}\check{\beta}_{m}\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})}\right\|_{\widehat{\Sigma}_{m}^{-1}}\right\}. σ¯hm=max{9B2/d,νhm+Ehm}\bar{\sigma}^{m}_{h}=\sqrt{\max\{9B_{\star}^{2}/d,\nu^{m}_{h}+E^{m}_{h}\}}.
      Σ^m+1=Σ^m+h=1H(σ¯hm)2ϕVh+1m(shm,ahm)ϕVh+1m(shm,ahm)\widehat{\Sigma}_{m+1}=\widehat{\Sigma}_{m}+\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{-2}\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})^{\top}. Σ~m+1=Σ~m+h=1Hϕ(Vh+1m)2(shm,ahm)ϕ(Vh+1m)2(shm,ahm)\widetilde{\Sigma}_{m+1}=\widetilde{\Sigma}_{m}+\sum_{h=1}^{H}\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})^{\top}. b^m+1=b^m+h=1H(σ¯hm)2Vh+1m(sh+1m)ϕVh+1m(shm,ahm)\widehat{b}_{m+1}=\widehat{b}_{m}+\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{-2}V^{m}_{h+1}(s^{m}_{h+1})\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h}). b~m+1=b~m+h=1HVh+1m(sh+1m)2ϕ(Vh+1m)2(shm,ahm)\widetilde{b}_{m+1}=\widetilde{b}_{m}+\sum_{h=1}^{H}V^{m}_{h+1}(s^{m}_{h+1})^{2}\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h}). θ^m+1=(Σ^m+1)1b^m+1\widehat{\theta}_{m+1}=(\widehat{\Sigma}_{m+1})^{-1}\widehat{b}_{m+1}, θ~m+1Σ~m+11b~m+1\widetilde{\theta}_{m+1}\leftarrow\widetilde{\Sigma}_{m+1}^{-1}\widetilde{b}_{m+1}.

In this section, we provide a direct application of our finite-horizon approximation to the linear mixture MDP setting. We first introduce the problem setting of linear mixture SSP following (Min et al., 2021).

Assumption 2 (Linear Mixture SSP).

The number of states and actions are finite: |𝒮×𝒜|<|{\mathcal{S}}\times{\mathcal{A}}|<\infty. For some d2d\geq 2, there exist a known cost function c:𝒮×𝒜[0,1]c:{\mathcal{S}}\times{\mathcal{A}}\rightarrow[0,1], a known feature map ϕ:𝒮×𝒜×𝒮+d\phi:{\mathcal{S}}\times{\mathcal{A}}\times{\mathcal{S}}_{+}\rightarrow\mathbb{R}^{d}, and an unknown vector θd\theta^{\star}\in\mathbb{R}^{d} with θ2d\left\|{\theta^{\star}}\right\|_{2}\leq\sqrt{d}, such that:

  • for any (s,a),s𝒮×𝒜×𝒮+(s,a),s^{\prime}\in{\mathcal{S}}\times{\mathcal{A}}\times{\mathcal{S}}^{+}, we have Ps,a(s)=ϕ(s|s,a),θP_{s,a}(s^{\prime})=\left\langle\phi(s^{\prime}|s,a),\theta^{\star}\right\rangle;

  • for any bounded function F:𝒮[0,1]F:{\mathcal{S}}\rightarrow[0,1], we have ϕF(s,a)2d\left\|{\phi_{F}(s,a)}\right\|_{2}\leq\sqrt{d}, where ϕF(s,a)=sϕ(s|s,a)F(s)d\phi_{F}(s,a)=\sum_{s^{\prime}}\phi(s^{\prime}|s,a)F(s^{\prime})\in\mathbb{R}^{d}.

We also assume BB_{\star} is known and cmin>0c_{\min}>0. Define c~(s,a)=c(s,a)𝕀{sg}\widetilde{c}(s,a)=c(s,a)\mathbb{I}\{s\neq g\}, P~={Ps,a}(s,a)𝒮×𝒜{Pg,a}a𝒜\widetilde{P}=\{P_{s,a}\}_{(s,a)\in{\mathcal{S}}\times{\mathcal{A}}}\cup\{P_{g,a}\}_{a\in{\mathcal{A}}} with Pg,a(s)=𝕀{s=g}P_{g,a}(s^{\prime})=\mathbb{I}\{s^{\prime}=g\} as before, and ϕ(s|g,a)=𝕀{s=g}s′′ϕ(s′′|sinit,a)\phi(s^{\prime}|g,a)=\mathbb{I}\{s^{\prime}=g\}\sum_{s^{\prime\prime}}\phi(s^{\prime\prime}|s_{\text{init}},a). Note that by the definitions above, P~s,aF=ϕF(s,a),θ\widetilde{P}_{s,a}F=\left\langle\phi_{F}(s,a),\theta^{\star}\right\rangle. Also define total costs CM=m=1Mh=1HchmC_{M^{\prime}}=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}c^{m}_{h} for any M+M^{\prime}\in\mathbb{N}_{+} With our approximation scheme, it suffices to provide a finite-horizon algorithm. We start by stating the regret guarantee of the proposed finite-horizon algorithm (Algorithm 6).

Theorem 9.

Algorithm 6 ensures R~M=𝒪~(BdMH+BdM+Bd2H+Bd2.5)\widetilde{R}_{M^{\prime}}=\tilde{\mathcal{O}}(B_{\star}\sqrt{dM^{\prime}H}+B_{\star}d\sqrt{M^{\prime}}+B_{\star}d^{2}H+B_{\star}d^{2.5}) for any M+M^{\prime}\in\mathbb{N}_{+} with probability at least 15δ1-5\delta.

Combining Algorithm 6 with our finite-horizon approximation, we get the following regret guarantee on linear mixture SSP.

Theorem 10.

Applying Algorithm 1 with H=4Tln(4K)H=\lceil 4T_{\star}\ln(4K)\rceil and Algorithm 6 as 𝔄\mathfrak{A} to the linear mixture SSP problem ensures RK=𝒪~(BdTK+BdK+Bd2T+Bd2.5)R_{K}=\tilde{\mathcal{O}}(B_{\star}\sqrt{dT_{\star}K}+B_{\star}d\sqrt{K}+B_{\star}d^{2}T_{\star}+B_{\star}d^{2.5}) with probability at least 15δ1-5\delta.

Proof.

This directly follows from Theorem 9 and Corollary 2 with γ0=Bd2H\gamma_{0}=B_{\star}d^{2}H and γ1=BdH+Bd\gamma_{1}=B_{\star}\sqrt{dH}+B_{\star}d. ∎

Note that our bound strictly improves over that of (Min et al., 2021), and it is minimax optimal when dTd\geq T_{\star}. Now we introduce the proposed finite-horizon algorithm, which is a variant of (Zhou et al., 2021a, Algorithm 2). The high level idea is to construct Bernstein-style confidence sets on transition function and then compute value function estimate through empirical value iteration with bonus. We summarize the ideas in Algorithm 6. Before proving Theorem 9, we need the following key lemma regarding the confidence sets on transition function.

Lemma 9.

With probability at least 13δ1-3\delta, we have for all m+m\in\mathbb{N}_{+}, θθ^mΣ^mβ^m\left\|{\theta^{\star}-\widehat{\theta}_{m}}\right\|_{\widehat{\Sigma}_{m}}\leq\widehat{\beta}_{m} and |νhm𝕍(Phm,Vh+1m)|Ehm\left|\nu^{m}_{h}-\mathbb{V}(P^{m}_{h},V^{m}_{h+1})\right|\leq E^{m}_{h}.

Proof.

For the first statement, we first prove that θθ^mΣ^mβˇm\left\|{\theta^{\star}-\widehat{\theta}_{m}}\right\|_{\widehat{\Sigma}_{m}}\leq\check{\beta}_{m} and θθ~mΣ~mβ~m\left\|{\theta^{\star}-\widetilde{\theta}_{m}}\right\|_{\widetilde{\Sigma}_{m}}\leq\widetilde{\beta}_{m} for m+m\in\mathbb{N}_{+}. We adopt the indexing by tt in Section 2: for a given time step t=(m1)H+ht=(m-1)H+h that corresponds to (m,h)(m,h), that is, the hh-th step in the mm-th interval, define σ¯t=σ¯hm\bar{\sigma}_{t}=\bar{\sigma}^{m}_{h}, Vt=Vh+1mV_{t}=V^{m}_{h+1}, νt=νhm\nu_{t}=\nu^{m}_{h}, and Et=EhmE_{t}=E^{m}_{h}. We apply Lemma 33 with t=σ(s1:t,a1:t){\mathcal{F}}_{t}=\sigma(s_{1:t},a_{1:t}), xt=σ¯t1ϕVt(st,at)x_{t}=\bar{\sigma}_{t}^{-1}\phi_{V_{t}}(s_{t},a_{t}), yt=σ¯t1Vt(st)y_{t}=\bar{\sigma}_{t}^{-1}V_{t}(s^{\prime}_{t}), μ=θ\mu^{\star}=\theta^{\star}, ηt=σ¯t1(Vt(st)ϕVt(st,at),θ)\eta_{t}=\bar{\sigma}_{t}^{-1}\left(V_{t}(s^{\prime}_{t})-\left\langle\phi_{V_{t}}(s_{t},a_{t}),\theta^{\star}\right\rangle\right). Then, we have Zt=Σ^t,μt=θ^tZ_{t}=\widehat{\Sigma}_{t},\mu_{t}=\widehat{\theta}_{t}, where Σ^t=λI+i=1tσ¯i2ϕVi(si,ai)ϕVi(si,ai)\widehat{\Sigma}_{t}=\lambda I+\sum_{i=1}^{t}\bar{\sigma}_{i}^{-2}\phi_{V_{i}}(s_{i},a_{i})\phi_{V_{i}}(s_{i},a_{i})^{\top}, θ^t=Σ^t1b^t\widehat{\theta}_{t}=\widehat{\Sigma}_{t}^{-1}\widehat{b}_{t}, and b^t=i=1tσ¯i2ϕVi(si,ai)Vi(si)\widehat{b}_{t}=\sum_{i=1}^{t}\bar{\sigma}_{i}^{-2}\phi_{V_{i}}(s_{i},a_{i})V_{i}(s^{\prime}_{i}). Moreover,

|ηt|R=d,𝔼[ηt2|𝒢t]σ2=d,xt2L=d,μ2=θ2d.\displaystyle|\eta_{t}|\leq R=\sqrt{d},\;\mathbb{E}[\eta_{t}^{2}|\mathcal{G}_{t}]\leq\sigma^{2}=d,\;\left\|{x_{t}}\right\|_{2}\leq L=d,\;\left\|{\mu^{\star}}\right\|_{2}=\left\|{\theta^{\star}}\right\|_{2}\leq\sqrt{d}.

Therefore, with probability at least 1δ1-\delta, for any t=(m1)Ht=(m-1)H for some m+m\in\mathbb{N}_{+}, which corresponds to (m1,H)(m-1,H):

θ^mθΣ^m8dln(1+d2t/(dλ))ln(4t2/δ)+4dln(4t2/δ)+λdβˇm.\displaystyle\left\|{\widehat{\theta}_{m}-\theta^{\star}}\right\|_{\widehat{\Sigma}_{m}}\leq 8d\sqrt{\ln(1+d^{2}t/(d\lambda))\ln(4t^{2}/\delta)}+4\sqrt{d}\ln(4t^{2}/\delta)+\sqrt{\lambda d}\leq\check{\beta}_{m}.

Next, we apply Lemma 33 with t=σ(s1:t,a1:t){\mathcal{F}}_{t}=\sigma(s_{1:t},a_{1:t}), xt=ϕVt2(st,at)x_{t}=\phi_{V^{2}_{t}}(s_{t},a_{t}), yt=Vt2(st)y_{t}=V_{t}^{2}(s^{\prime}_{t}), μ=θ\mu^{\star}=\theta^{\star}, ηt=Vt2(st)ϕVt2(st,at),θ\eta_{t}=V_{t}^{2}(s^{\prime}_{t})-\left\langle\phi_{V_{t}^{2}}(s_{t},a_{t}),\theta^{\star}\right\rangle. Then, we have Zt=Σ~t,μt=θ~tZ_{t}=\widetilde{\Sigma}_{t},\mu_{t}=\widetilde{\theta}_{t}, where Σ~t=λI+i=1tϕVi2(si,ai)ϕVi2(si,ai)\widetilde{\Sigma}_{t}=\lambda I+\sum_{i=1}^{t}\phi_{V_{i}^{2}}(s_{i},a_{i})\phi_{V_{i}^{2}}(s_{i},a_{i})^{\top}, θ~t=Σ~t1b~t\widetilde{\theta}_{t}=\widetilde{\Sigma}_{t}^{-1}\widetilde{b}_{t}, and b~t=i=1tϕVi2(si,ai)Vi2(si)\widetilde{b}_{t}=\sum_{i=1}^{t}\phi_{V^{2}_{i}}(s_{i},a_{i})V^{2}_{i}(s^{\prime}_{i}). Moreover,

|ηt|R=9B2,𝔼[ηt2|𝒢t]σ2=81B4,xt2L=9B2d,μ2=θ2d.\displaystyle|\eta_{t}|\leq R=9B_{\star}^{2},\;\mathbb{E}[\eta_{t}^{2}|\mathcal{G}_{t}]\leq\sigma^{2}=81B_{\star}^{4},\;\left\|{x_{t}}\right\|_{2}\leq L=9B_{\star}^{2}\sqrt{d},\;\left\|{\mu^{\star}}\right\|_{2}=\left\|{\theta^{\star}}\right\|_{2}\leq\sqrt{d}.

Therefore, with probability at least 1δ1-\delta, for any t=(m1)Ht=(m-1)H for some m+m\in\mathbb{N}_{+}, which corresponds to (m1,H)(m-1,H):

θ~mθΣ~m72B2dln(1+81tB4d/(dλ))ln(4t2/δ)+36B2ln(4t2/δ)+λdβ~m.\displaystyle\left\|{\widetilde{\theta}_{m}-\theta^{\star}}\right\|_{\widetilde{\Sigma}_{m}}\leq 72B_{\star}^{2}\sqrt{d\ln(1+81tB_{\star}^{4}d/(d\lambda))\ln(4t^{2}/\delta)}+36B_{\star}^{2}\ln(4t^{2}/\delta)+\sqrt{\lambda d}\leq\widetilde{\beta}_{m}.

Conditioned on the event 𝒞={θθ^mΣ^mβˇm,θθ~mΣ~mβ~m,m+}{\mathcal{C}}=\left\{\left\|{\theta^{\star}-\widehat{\theta}_{m}}\right\|_{\widehat{\Sigma}_{m}}\leq\check{\beta}_{m},\left\|{\theta^{\star}-\widetilde{\theta}_{m}}\right\|_{\widetilde{\Sigma}_{m}}\leq\widetilde{\beta}_{m},\forall m\in\mathbb{N}_{+}\right\}, we have for tt corresopnding to (m,h)(m,h):

|νt𝕍(Pt,Vt)|\displaystyle\left|\nu_{t}-\mathbb{V}(P_{t},V_{t})\right|
|[ϕVt2(st,at),θ~m][0,9B2]ϕVt2(st,at),θ|+|[ϕVt(st,at),θ^m][0,3B]2ϕVt(st,at),θ2|\displaystyle\leq\left|\left[\left\langle\phi_{V_{t}^{2}}(s_{t},a_{t}),\widetilde{\theta}_{m}\right\rangle\right]_{[0,9B_{\star}^{2}]}-\left\langle\phi_{V^{2}_{t}}(s_{t},a_{t}),\theta^{\star}\right\rangle\right|+\left|\left[\left\langle\phi_{V_{t}}(s_{t},a_{t}),\widehat{\theta}_{m}\right\rangle\right]_{[0,3B_{\star}]}^{2}-\left\langle\phi_{V_{t}}(s_{t},a_{t}),\theta^{\star}\right\rangle^{2}\right|
min{9B2,|ϕVt2(st,at),θ~mθ|}+min{9B2,6B|ϕVt(st,at),θ^mθ|}\displaystyle\leq\min\left\{9B_{\star}^{2},\left|\left\langle\phi_{V_{t}^{2}}(s_{t},a_{t}),\widetilde{\theta}_{m}-\theta^{\star}\right\rangle\right|\right\}+\min\left\{9B_{\star}^{2},6B_{\star}\left|\left\langle\phi_{V_{t}}(s_{t},a_{t}),\widehat{\theta}_{m}-\theta^{\star}\right\rangle\right|\right\}
min{9B2,ϕVt2(st,at)Σ~m1θ~mθΣ~m}+min{9B2,6BϕVt(st,at)Σ^m1θ^mθΣ^m}\displaystyle\leq\min\left\{9B_{\star}^{2},\left\|{\phi_{V_{t}^{2}}(s_{t},a_{t})}\right\|_{\widetilde{\Sigma}_{m}^{-1}}\left\|{\widetilde{\theta}_{m}-\theta^{\star}}\right\|_{\widetilde{\Sigma}_{m}}\right\}+\min\left\{9B_{\star}^{2},6B_{\star}\left\|{\phi_{V_{t}}(s_{t},a_{t})}\right\|_{\widehat{\Sigma}_{m}^{-1}}\left\|{\widehat{\theta}_{m}-\theta^{\star}}\right\|_{\widehat{\Sigma}_{m}}\right\}
min{9B2,β~mϕVt2(st,at)Σ~m1}+min{9B2,6BβˇmϕVt(st,at)Σ^m1}=Et.\displaystyle\leq\min\left\{9B_{\star}^{2},\widetilde{\beta}_{m}\left\|{\phi_{V_{t}^{2}}(s_{t},a_{t})}\right\|_{\widetilde{\Sigma}_{m}^{-1}}\right\}+\min\left\{9B_{\star}^{2},6B_{\star}\check{\beta}_{m}\left\|{\phi_{V_{t}}(s_{t},a_{t})}\right\|_{\widehat{\Sigma}_{m}^{-1}}\right\}=E_{t}.

Thus the second statement is proved. Now we show that θθ^mΣ^mβ^m\left\|{\theta^{\star}-\widehat{\theta}_{m}}\right\|_{\widehat{\Sigma}_{m}}\leq\widehat{\beta}_{m}. We conditioned on event 𝒞{\mathcal{C}}, and apply Lemma 33 with t=σ(s1:t,a1:t){\mathcal{F}}_{t}=\sigma(s_{1:t},a_{1:t}), xt=σ¯t1ϕVt(st,at)x_{t}=\bar{\sigma}_{t}^{-1}\phi_{V_{t}}(s_{t},a_{t}), yt=σ¯t1Vt(st)y_{t}=\bar{\sigma}_{t}^{-1}V_{t}(s^{\prime}_{t}), μ=θ\mu^{\star}=\theta^{\star}, ηt=σ¯t1(Vt(st)ϕVt(st,at),θ)\eta_{t}=\bar{\sigma}_{t}^{-1}\left(V_{t}(s^{\prime}_{t})-\left\langle\phi_{V_{t}}(s_{t},a_{t}),\theta^{\star}\right\rangle\right). Then, we have Zt=Σ^t,μt=θ^tZ_{t}=\widehat{\Sigma}_{t},\mu_{t}=\widehat{\theta}_{t}. Moreover, |ηt|R=d|\eta_{t}|\leq R=\sqrt{d}, xt2L=d\left\|{x_{t}}\right\|_{2}\leq L=d, and for tt corresponding to (m,h)(m,h),

𝔼[ηt2|𝒢t]=σ¯t2𝕍(Pt,Vt)σ¯t2(νt+Et)1.\displaystyle\mathbb{E}[\eta_{t}^{2}|\mathcal{G}_{t}]=\bar{\sigma}_{t}^{-2}\mathbb{V}(P_{t},V_{t})\leq\bar{\sigma}_{t}^{-2}(\nu_{t}+E_{t})\leq 1.

Therefore, with probability at least 1δ1-\delta, for any t=(m1)Ht=(m-1)H for some m+m\in\mathbb{N}_{+}, which corresponds to (m1,H)(m-1,H):

θ^mθΣ^m8dln(1+dt/λ)ln(4t2/δ)+4dln(4t2/δ)+λdβ^m.\displaystyle\left\|{\widehat{\theta}_{m}-\theta^{\star}}\right\|_{\widehat{\Sigma}_{m}}\leq 8\sqrt{d\ln(1+dt/\lambda)\ln(4t^{2}/\delta)}+4\sqrt{d}\ln(4t^{2}/\delta)+\sqrt{\lambda d}\leq\widehat{\beta}_{m}.

This completes the proof. ∎

We are now ready to prove Theorem 9.

Proof of Theorem 9.

We condition on the event of Lemma 9, Lemma 10 and Lemma 11, which happens with probability at least 14δ1-4\delta. We decompose the regret as follows: with probability at least 1δ1-\delta,

R~M\displaystyle\widetilde{R}_{M^{\prime}} =m=1M(h=1Hchm+cf(sH+1m)V1(s1m))m=1M(h=1Hchm+cf(sH+1m)V1m(s1m))\displaystyle=\sum_{m=1}^{M^{\prime}}\left(\sum_{h=1}^{H}c^{m}_{h}+c_{f}(s^{m}_{H+1})-V^{\star}_{1}(s^{m}_{1})\right)\leq\sum_{m=1}^{M^{\prime}}\left(\sum_{h=1}^{H}c^{m}_{h}+c_{f}(s^{m}_{H+1})-V^{m}_{1}(s^{m}_{1})\right) (Lemma 10)
=m=1Mh=1H(chm+Vh+1m(sh+1m)Vhm(shm))\displaystyle=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\left(c^{m}_{h}+V^{m}_{h+1}(s^{m}_{h+1})-V^{m}_{h}(s^{m}_{h})\right) (cf=VH+1mc_{f}=V^{m}_{H+1})
m=1Mh=1H(Vh+1m(sh+1m)PhmVh+1m+θθ^m,ϕVh+1m(shm,ahm)+β^mϕVh+1m(shm,ahm)Σ^m1)\displaystyle\leq\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\left(V^{m}_{h+1}(s^{m}_{h+1})-P^{m}_{h}V^{m}_{h+1}+\left\langle\theta^{\star}-\widehat{\theta}_{m},\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})\right\rangle+\widehat{\beta}_{m}\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})}\right\|_{\widehat{\Sigma}_{m}^{-1}}\right) (Vhm(shm)Qhm(shm,ahm)=chm+θ^m,ϕVh+1m(shm,ahm)β^mϕVh+1m(shm,ahm)Σ^m1V^{m}_{h}(s^{m}_{h})\geq Q^{m}_{h}(s^{m}_{h},a^{m}_{h})=c^{m}_{h}+\left\langle\widehat{\theta}_{m},\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})\right\rangle-\widehat{\beta}_{m}\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})}\right\|_{\widehat{\Sigma}_{m}^{-1}})
𝒪~(m=1Mh=1H𝕍(Phm,Vh+1m)+B+m=1Mh=1Hβ^mϕVh+1m(shm,ahm)Σ^m1).\displaystyle\leq\tilde{\mathcal{O}}\left(\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\mathbb{V}(P^{m}_{h},V^{m}_{h+1})}+B_{\star}+\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widehat{\beta}_{m}\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})}\right\|_{\widehat{\Sigma}_{m}^{-1}}\right). (Lemma 38, Cauchy-Schwarz inequality, and Lemma 9)

The first term is of order 𝒪~(B2M+BCM)\tilde{\mathcal{O}}(\sqrt{B_{\star}^{2}M^{\prime}+B_{\star}C_{M^{\prime}}}) by Lemma 11. For the third term, define ={(m,h)[M]×[H]:ϕVh+1m(shm,ahm)/σ¯hmΣ^m11}{\mathcal{I}}=\left\{(m,h)\in[M^{\prime}]\times[H]:\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\|_{\widehat{\Sigma}_{m}^{-1}}\geq 1\right\} and ^={m[M]:det(Σ^m+1)>2det(Σ^m)}\widehat{{\mathcal{I}}}=\{m\in[M^{\prime}]:\det(\widehat{\Sigma}_{m+1})>2\det(\widehat{\Sigma}_{m})\}. Then,

m=1Mh=1Hβ^mϕVh+1m(shm,ahm)Σ^m1=m=1Mh=1Hβ^mσ¯hmϕVh+1m(shm,ahm)/σ¯hmΣ^m1\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widehat{\beta}_{m}\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})}\right\|_{\widehat{\Sigma}_{m}^{-1}}=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widehat{\beta}_{m}\bar{\sigma}^{m}_{h}\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\|_{\widehat{\Sigma}_{m}^{-1}}
𝒪~((m,h)Bd)+m=1Mh=1Hβ^mσ¯hmmin{1,ϕVh+1m(shm,ahm)/σ¯hmΣ^m1}\displaystyle\leq\tilde{\mathcal{O}}\left(\sum_{(m,h)\in{\mathcal{I}}}B_{\star}d\right)+\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widehat{\beta}_{m}\bar{\sigma}^{m}_{h}\min\left\{1,\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\|_{\widehat{\Sigma}_{m}^{-1}}\right\} (β^m=𝒪~(d)\widehat{\beta}_{m}=\tilde{\mathcal{O}}(\sqrt{d}) and Vh+1m=𝒪(B)V^{m}_{h+1}=\mathcal{O}(B_{\star}))
=(i)𝒪~(Bd2H+m^BdH+m=1Mh=1Hβ^mσ¯hmmin{1,ϕVh+1m(shm,ahm)/σ¯hmΣ^m+11})\displaystyle\overset{\text{(i)}}{=}\tilde{\mathcal{O}}\left(B_{\star}d^{2}H+\sum_{m\in\widehat{{\mathcal{I}}}}B_{\star}dH+\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widehat{\beta}_{m}\bar{\sigma}^{m}_{h}\min\left\{1,\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\|_{\widehat{\Sigma}_{m+1}^{-1}}\right\}\right)
=𝒪~(Bd2H+β^Mm=1Mh=1H(σ¯hm)2m=1Mh=1Hmin{1,ϕVh+1m(shm,ahm)/σ¯hmΣ^m+112})\displaystyle=\tilde{\mathcal{O}}\left(B_{\star}d^{2}H+\widehat{\beta}_{M^{\prime}}\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}}\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\min\left\{1,\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\|_{\widehat{\Sigma}_{m+1}^{-1}}^{2}\right\}}\right) (|^|=𝒪~(d)|\widehat{{\mathcal{I}}}|=\tilde{\mathcal{O}}(d) and Cauchy-Schwarz inequality)
=𝒪~(Bd2H+dm=1Mh=1H(σ¯hm)2),\displaystyle=\tilde{\mathcal{O}}\left(B_{\star}d^{2}H+d\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}}\right), (β^M=𝒪~(d)\widehat{\beta}_{M^{\prime}}=\tilde{\mathcal{O}}(\sqrt{d}) and Lemma 29)

where in (i) we apply β^mσ¯hm=𝒪~(Bd)\widehat{\beta}_{m}\bar{\sigma}^{m}_{h}=\tilde{\mathcal{O}}(B_{\star}d), Lemma 30, and:

||\displaystyle|{\mathcal{I}}| =m=1Mh=1H𝕀{ϕVh+1m(shm,ahm)/σ¯hmΣ^m121}m=1Mh=1Hmin{1,ϕVh+1m(shm,ahm)/σ¯hmΣ^m12}\displaystyle=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\mathbb{I}\left\{\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\|_{\widehat{\Sigma}_{m}^{-1}}^{2}\geq 1\right\}\leq\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\min\left\{1,\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\|_{\widehat{\Sigma}_{m}^{-1}}^{2}\right\}
|^|H+2m=1Mh=1Hmin{1,ϕVh+1m(shm,ahm)/σ¯hmΣ^m+112}\displaystyle\leq|\widehat{{\mathcal{I}}}|H+\sqrt{2}\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\min\left\{1,\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\|_{\widehat{\Sigma}_{m+1}^{-1}}^{2}\right\} (Lemma 30)
=𝒪~(dH).\displaystyle=\tilde{\mathcal{O}}\left(dH\right). (|^|=𝒪~(d)|\widehat{{\mathcal{I}}}|=\tilde{\mathcal{O}}(d) and Lemma 29)

It remains to bound m=1Mh=1H(σ¯hm)2\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}. Note that

m=1Mh=1H(σ¯hm)29B2MHd+m=1Mh=1H(νhm+Ehm)\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}\leq\frac{9B_{\star}^{2}M^{\prime}H}{d}+\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\nu^{m}_{h}+E^{m}_{h})
9B2MHd+m=1Mh=1H(𝕍(Phm,Vh+1m)+2Ehm)\displaystyle\leq\frac{9B_{\star}^{2}M^{\prime}H}{d}+\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\mathbb{V}(P^{m}_{h},V^{m}_{h+1})+2E^{m}_{h}) (Lemma 9)
9B2MHd+𝒪~(B2M+BCM+B2dMH+Bd3/2m=1Mh=1H(σ¯hm)2+B2d2H).\displaystyle\leq\frac{9B_{\star}^{2}M^{\prime}H}{d}+\tilde{\mathcal{O}}\left(B_{\star}^{2}M^{\prime}+B_{\star}C_{M^{\prime}}+B_{\star}^{2}d\sqrt{M^{\prime}H}+B_{\star}d^{3/2}\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}}+B_{\star}^{2}d^{2}H\right). (Lemma 11 and Lemma 12)

By Lemma 28 and MHM/d+dH\sqrt{M^{\prime}H}\leq M^{\prime}/d+dH, we get m=1Mh=1H(σ¯hm)2=𝒪~(B2MHd+B2M+BCM+B2d3+B2d2H)\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}=\tilde{\mathcal{O}}(\frac{B_{\star}^{2}M^{\prime}H}{d}+B_{\star}^{2}M^{\prime}+B_{\star}C_{M^{\prime}}+B_{\star}^{2}d^{3}+B_{\star}^{2}d^{2}H). Putting everything together, we get:

R~M\displaystyle\widetilde{R}_{M^{\prime}} =𝒪~(B2M+BCM+Bd2H+dB2MHd+B2M+BCM+B2d3+B2d2H)\displaystyle=\tilde{\mathcal{O}}\left(\sqrt{B_{\star}^{2}M^{\prime}+B_{\star}C_{M^{\prime}}}+B_{\star}d^{2}H+d\sqrt{\frac{B_{\star}^{2}M^{\prime}H}{d}+B_{\star}^{2}M^{\prime}+B_{\star}C_{M^{\prime}}+B_{\star}^{2}d^{3}+B_{\star}^{2}d^{2}H}\right)
=𝒪~(BdMH+BdM+dBCM+Bd2H+Bd2.5).\displaystyle=\tilde{\mathcal{O}}\left(B_{\star}\sqrt{dM^{\prime}H}+B_{\star}d\sqrt{M^{\prime}}+d\sqrt{B_{\star}C_{M^{\prime}}}+B_{\star}d^{2}H+B_{\star}d^{2.5}\right).

Now by R~M=CMMV1(s1m)\widetilde{R}_{M^{\prime}}=C_{M^{\prime}}-M^{\prime}V^{\star}_{1}(s^{m}_{1}) and Lemma 28, we get: CM=𝒪~(BM)C_{M^{\prime}}=\tilde{\mathcal{O}}(B_{\star}M^{\prime}). Plugging this back, we get R~M=𝒪~(BdMH+BdM+Bd2H+Bd2.5)\widetilde{R}_{M^{\prime}}=\tilde{\mathcal{O}}(B_{\star}\sqrt{dM^{\prime}H}+B_{\star}d\sqrt{M^{\prime}}+B_{\star}d^{2}H+B_{\star}d^{2.5}). ∎

Lemma 10.

Conditioned on the event of Lemma 9, Qhm(s,a)c~(s,a)+P~s,aVh+1mQ^{m}_{h}(s,a)\leq\widetilde{c}(s,a)+\widetilde{P}_{s,a}V^{m}_{h+1} and Vhm(s)Vh(s)3BV^{m}_{h}(s)\leq V^{\star}_{h}(s)\leq 3B_{\star}.

Proof.

Note that by Lemma 9:

θ^m,ϕVh+1m(s,a)β^mϕVh+1m(s,a)Σ^m1=P~s,aVh+1m+θ^mθ,ϕVh+1m(s,a)β^mϕVh+1m(s,a)Σ^m1\displaystyle\left\langle\widehat{\theta}_{m},\phi_{V^{m}_{h+1}}(s,a)\right\rangle-\widehat{\beta}_{m}\left\|{\phi_{V^{m}_{h+1}}(s,a)}\right\|_{\widehat{\Sigma}_{m}^{-1}}=\widetilde{P}_{s,a}V^{m}_{h+1}+\left\langle\widehat{\theta}_{m}-\theta^{\star},\phi_{V^{m}_{h+1}}(s,a)\right\rangle-\widehat{\beta}_{m}\left\|{\phi_{V^{m}_{h+1}}(s,a)}\right\|_{\widehat{\Sigma}_{m}^{-1}}
P~s,aVh+1m+θ^mθΣ^mϕVh+1m(s,a)Σ^m1β^mϕVh+1m(s,a)Σ^m1P~s,aVh+1m.\displaystyle\leq\widetilde{P}_{s,a}V^{m}_{h+1}+\left\|{\widehat{\theta}_{m}-\theta^{\star}}\right\|_{\widehat{\Sigma}_{m}}\left\|{\phi_{V^{m}_{h+1}}(s,a)}\right\|_{\widehat{\Sigma}_{m}^{-1}}-\widehat{\beta}_{m}\left\|{\phi_{V^{m}_{h+1}}(s,a)}\right\|_{\widehat{\Sigma}_{m}^{-1}}\leq\widetilde{P}_{s,a}V^{m}_{h+1}.

The first statement then follows from the definition of QhmQ^{m}_{h}. For any m+m\in\mathbb{N}_{+}, we prove the second statement by induction on h=H+1,,1h=H+1,\ldots,1. The base case h=H+1h=H+1 is clearly true by the definition of VH+1mV^{m}_{H+1}. For hHh\leq H, note that Qhm(s,a)c~(s,a)+P~s,aVh+1mc(s,a)+P~s,aVQ(s,a)Q^{m}_{h}(s,a)\leq\widetilde{c}(s,a)+\widetilde{P}_{s,a}V^{m}_{h+1}\leq c(s,a)+\widetilde{P}_{s,a}V^{\star}\leq Q^{\star}(s,a) by the induction step and the first statement. Thus, Vhm(s)max{0,minaQhm(s,a)}V(s)V^{m}_{h}(s)\leq\max\{0,\min_{a}Q^{m}_{h}(s,a)\}\leq V^{\star}(s). ∎

Lemma 11.

Conditioned on the event of Lemma 10, with probability at least 1δ1-\delta, m=1Mh=1H𝕍(Phm,Vh+1m)=𝒪~(B2M+BCM)\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\mathbb{V}(P^{m}_{h},V^{m}_{h+1})=\tilde{\mathcal{O}}\left(B_{\star}^{2}M^{\prime}+B_{\star}C_{M^{\prime}}\right) for any M+M^{\prime}\in\mathbb{N}_{+}.

Proof.

Conditioned on the event of Lemma 10, we have with probability at least 1δ1-\delta:

m=1Mh=1H𝕍(Phm,Vh+1m)=m=1Mh=1HPhm(Vh+1m)2(PhmVh+1m)2\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\mathbb{V}(P^{m}_{h},V^{m}_{h+1})=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}P^{m}_{h}(V^{m}_{h+1})^{2}-(P^{m}_{h}V^{m}_{h+1})^{2}
=m=1Mh=1H(Phm(Vh+1m)2Vh+1m(sh+1m)2)+m=1Mh=1H(Vh+1m(sh+1m)2Vhm(shm)2)+m=1Mh=1H(Vhm(shm)2(PhmVh+1m)2)\displaystyle=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\left(P^{m}_{h}(V^{m}_{h+1})^{2}-V^{m}_{h+1}(s^{m}_{h+1})^{2}\right)+\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\left(V^{m}_{h+1}(s^{m}_{h+1})^{2}-V^{m}_{h}(s^{m}_{h})^{2}\right)+\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\left(V^{m}_{h}(s^{m}_{h})^{2}-(P^{m}_{h}V^{m}_{h+1})^{2}\right)
=(i)𝒪~(m=1Mh=1H𝕍(Phm,(Vh+1m)2)+B2M+BCM)=(ii)𝒪~(Bm=1Mh=1H𝕍(Phm,Vh+1m)+B2M+BCM).\displaystyle\overset{\text{(i)}}{=}\tilde{\mathcal{O}}\left(\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\mathbb{V}(P^{m}_{h},(V^{m}_{h+1})^{2})}+B_{\star}^{2}M^{\prime}+B_{\star}C_{M^{\prime}}\right)\overset{\text{(ii)}}{=}\tilde{\mathcal{O}}\left(B_{\star}\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\mathbb{V}(P^{m}_{h},V^{m}_{h+1})}+B_{\star}^{2}M^{\prime}+B_{\star}C_{M^{\prime}}\right).

Here, (ii) is by Lemma 34, and (i) is by Lemma 38, VH+1m(s)2BV^{m}_{H+1}(s)\leq 2B_{\star} and:

m=1Mh=1HVhm(shm)2(PhmVh+1m)2\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}V^{m}_{h}(s^{m}_{h})^{2}-(P^{m}_{h}V^{m}_{h+1})^{2} =m=1Mh=1H(Vhm(shm,ahm)+PhmVh+1m)(Vhm(shm)PhmVh+1m)\displaystyle=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(V^{m}_{h}(s^{m}_{h},a^{m}_{h})+P^{m}_{h}V^{m}_{h+1})(V^{m}_{h}(s^{m}_{h})-P^{m}_{h}V^{m}_{h+1})
=m=1Mh=1H(Vhm(shm,ahm)+PhmVh+1m)(max{0,Qhm(shm,ahm)}PhmVh+1m)6BCM.\displaystyle=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(V^{m}_{h}(s^{m}_{h},a^{m}_{h})+P^{m}_{h}V^{m}_{h+1})(\max\{0,Q^{m}_{h}(s^{m}_{h},a^{m}_{h})\}-P^{m}_{h}V^{m}_{h+1})\leq 6B_{\star}C_{M^{\prime}}. (0Vhm(s)3B0\leq V^{m}_{h}(s)\leq 3B_{\star} and Qhm(shm,ahm)chm+PhmVh+1mQ^{m}_{h}(s^{m}_{h},a^{m}_{h})\leq c^{m}_{h}+P^{m}_{h}V^{m}_{h+1} by Lemma 10)

By Lemma 28, we get m=1Mh=1H𝕍(Phm,Vh+1m)=𝒪~(B2M+BCM)\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\mathbb{V}(P^{m}_{h},V^{m}_{h+1})=\tilde{\mathcal{O}}\left(B_{\star}^{2}M^{\prime}+B_{\star}C_{M^{\prime}}\right). ∎

Lemma 12.

m=1Mh=1HEhm=𝒪~(B2dMH+Bd3/2m=1Mh=1H(σ¯hm)2+B2d2H)\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}E^{m}_{h}=\tilde{\mathcal{O}}\left(B_{\star}^{2}d\sqrt{M^{\prime}H}+B_{\star}d^{3/2}\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}}+B_{\star}^{2}d^{2}H\right) for any M+M^{\prime}\in\mathbb{N}_{+}.

Proof.

Note that:

m=1Mh=1HEhm=m=1Mh=1Hmin{9B2,β~mϕ(Vh+1m)2(shm,ahm)Σ~m1}+min{9B2,6BβˇmϕVh+1m(shm,ahm)Σ^m1}\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}E^{m}_{h}=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\min\left\{9B_{\star}^{2},\widetilde{\beta}_{m}\left\|{\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})}\right\|_{\widetilde{\Sigma}_{m}^{-1}}\right\}+\min\left\{9B_{\star}^{2},6B_{\star}\check{\beta}_{m}\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})}\right\|_{\widehat{\Sigma}_{m}^{-1}}\right\}
m=1Mh=1Hβ~mmin{1,ϕ(Vh+1m)2(shm,ahm)Σ~m1}+6Bm=1Mh=1Hβˇmσ¯hmmin{1,ϕVh+1m(shm,ahm)/σ¯hmΣ^m1}.\displaystyle\leq\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widetilde{\beta}_{m}\min\left\{1,\left\|{\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})}\right\|_{\widetilde{\Sigma}_{m}^{-1}}\right\}+6B_{\star}\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\check{\beta}_{m}\bar{\sigma}^{m}_{h}\min\left\{1,\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\|_{\widehat{\Sigma}_{m}^{-1}}\right\}. (β~m9B2\widetilde{\beta}_{m}\geq 9B_{\star}^{2} and βˇmσ¯hm3B\check{\beta}_{m}\bar{\sigma}^{m}_{h}\geq 3B_{\star})

For the first sum, define ~={m[M]:det(Σ~m+1)>2det(Σ~m)}\widetilde{{\mathcal{I}}}=\{m\in[M^{\prime}]:\det(\widetilde{\Sigma}_{m+1})>2\det(\widetilde{\Sigma}_{m})\}. Then by Lemma 30,

m=1Mh=1Hβ~mmin{1,ϕ(Vh+1m)2(shm,ahm)Σ~m1}\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widetilde{\beta}_{m}\min\left\{1,\left\|{\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})}\right\|_{\widetilde{\Sigma}_{m}^{-1}}\right\}
𝒪~(m~B2dH)+2m~h=1Hβ~mmin{1,ϕ(Vh+1m)2(shm,ahm)Σ~m+11}\displaystyle\leq\tilde{\mathcal{O}}\left(\sum_{m\in\widetilde{{\mathcal{I}}}}B_{\star}^{2}\sqrt{d}H\right)+\sqrt{2}\sum_{m\notin\widetilde{{\mathcal{I}}}}\sum_{h=1}^{H}\widetilde{\beta}_{m}\min\left\{1,\left\|{\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})}\right\|_{\widetilde{\Sigma}_{m+1}^{-1}}\right\} (β~m=𝒪~(B2d)\widetilde{\beta}_{m}=\tilde{\mathcal{O}}(B_{\star}^{2}\sqrt{d}))
=𝒪~(B2d3/2H+β~MMHm=1Mh=1Hmin{1,ϕ(Vh+1m)2(shm,ahm)Σ~m+112}).\displaystyle=\tilde{\mathcal{O}}\left(B_{\star}^{2}d^{3/2}H+\widetilde{\beta}_{M^{\prime}}\sqrt{M^{\prime}H\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\min\left\{1,\left\|{\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})}\right\|_{\widetilde{\Sigma}_{m+1}^{-1}}^{2}\right\}}\right). (|~|=𝒪~(d)|\widetilde{{\mathcal{I}}}|=\tilde{\mathcal{O}}(d) and Cauchy-Schwarz inequality)
=𝒪~(B2d3/2H+B2dMH).\displaystyle=\tilde{\mathcal{O}}\left(B_{\star}^{2}d^{3/2}H+B_{\star}^{2}d\sqrt{M^{\prime}H}\right). (Lemma 29)

For the second sum, similarly define ^={m[M]:det(Σ^m+1)>2det(Σ^m)}\widehat{{\mathcal{I}}}=\{m\in[M^{\prime}]:\det(\widehat{\Sigma}_{m+1})>2\det(\widehat{\Sigma}_{m})\}. Then,

6Bm=1Mh=1Hβˇmσ¯hmmin{1,ϕVh+1m(shm,ahm)/σ¯hmΣ^m1}\displaystyle 6B_{\star}\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\check{\beta}_{m}\bar{\sigma}^{m}_{h}\min\left\{1,\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\|_{\widehat{\Sigma}_{m}^{-1}}\right\}
𝒪~(m^B2dH)+62BβˇMm^h=1Hσ¯hmmin{1,ϕVh+1m(shm,ahm)/σ¯hmΣ^m+11}\displaystyle\leq\tilde{\mathcal{O}}\left(\sum_{m\in\widehat{{\mathcal{I}}}}B_{\star}^{2}dH\right)+6\sqrt{2}B_{\star}\check{\beta}_{M}\sum_{m\notin\widehat{{\mathcal{I}}}}\sum_{h=1}^{H}\bar{\sigma}^{m}_{h}\min\left\{1,\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\|_{\widehat{\Sigma}_{m+1}^{-1}}\right\} (βˇmσ¯hm=𝒪~(Bd)\check{\beta}_{m}\bar{\sigma}^{m}_{h}=\tilde{\mathcal{O}}(B_{\star}d))
=𝒪~(B2d2H+Bdm=1Mh=1H(σ¯hm)2m=1Mh=1Hmin{1,ϕVh+1m(shm,ahm)/σ¯hmΣ^m+112}).\displaystyle=\tilde{\mathcal{O}}\left(B_{\star}^{2}d^{2}H+B_{\star}d\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}}\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\min\left\{1,\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\|_{\widehat{\Sigma}_{m+1}^{-1}}^{2}\right\}}\right). (|^|=𝒪~(d)|\widehat{{\mathcal{I}}}|=\tilde{\mathcal{O}}(d) and Cauchy-Schwarz inequality)
=𝒪~(B2d2H+Bd3/2m=1Mh=1H(σ¯hm)2).\displaystyle=\tilde{\mathcal{O}}\left(B_{\star}^{2}d^{2}H+B_{\star}d^{3/2}\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}}\right). (Lemma 29)

B.7 An instance of SSP with gapmingapmin\text{\rm gap}_{\min}^{\prime}\ll\text{\rm gap}_{\min}

Consider an SSP with four states {s0,s1,s2,s3}\{s_{0},s_{1},s_{2},s_{3}\} and two actions {a1,a2}\{a_{1},a_{2}\}. At s0s_{0}, we have c(s0,a)=0c(s_{0},a)=0 and P(s1|s0,a)=pP(s_{1}|s_{0},a)=p, P(s0|s0,a)=1pP(s_{0}|s_{0},a)=1-p for a{a1,a2}a\in\{a_{1},a_{2}\} and some p>0p>0. At s1s_{1}, we have c(s1,a1)=0c(s_{1},a_{1})=0, c(s1,a2)=ϵc(s_{1},a_{2})=\epsilon, and P(s2|s1,a1)=1P(s_{2}|s_{1},a_{1})=1, P(s3|s1,a2)=1P(s_{3}|s_{1},a_{2})=1. At s2s_{2}, we have c(s2,a1)=c(s2,a2)=1c(s_{2},a_{1})=c(s_{2},a_{2})=1 and P(g|s2,a)=qP(g|s_{2},a)=q, P(s1|s2,a)=1qP(s_{1}|s_{2},a)=1-q for any aa and some q(0,1)q\in(0,1). At s3s_{3}, we have c(s3,a)=0c(s_{3},a)=0 and P(g|s3,a)=1P(g|s_{3},a)=1 for a{a1,a2}a\in\{a_{1},a_{2}\}. The role of s0s_{0} here is to create the possibility that the learner will visit state s1s_{1} at any time step. Then under our finite-horizon approximation, we have

gapminmin(s,a):gapH(s,a)>0gapH(s,a)=c(s1,a2)c(s1,a1)=ϵ.\text{\rm gap}_{\min}^{\prime}\leq\min_{(s,a):\text{\rm gap}_{H}(s,a)>0}\text{\rm gap}_{H}(s,a)=c(s_{1},a_{2})-c(s_{1},a_{1})=\epsilon.

On the other hand, when 1q>ϵ\frac{1}{q}>\epsilon, gapmin=Q(s1,a1)V(s1)=1qϵ\text{\rm gap}_{\min}=Q^{\star}(s_{1},a_{1})-V^{\star}(s_{1})=\frac{1}{q}-\epsilon, and 1q\frac{1}{q} can be arbitrarily large.

B.8 Omitted Details in Section 3.3

We first prove a lemma bounding Qh(s,a)Qhm(s,a)Q^{\star}_{h}(s,a)-Q^{m}_{h}(s,a) and another lemma on regret decomposition w.r.t the gap functions gaph(s,a)\text{\rm gap}_{h}(s,a) in ~\widetilde{{\mathcal{M}}}.

Lemma 13.

Suppose B=3BB=3B_{\star}. With probability at least 1δ1-\delta, for all m+,h[H]m\in\mathbb{N}_{+},h\in[H], and (s,a)𝒮+×𝒜(s,a)\in{\mathcal{S}}_{+}\times{\mathcal{A}}, Algorithm 2 ensures:

0Qh(s,a)Q^hm(s,a)P~s,a(Vh+1Vh+1m)+2βmϕ(s,a)Λm1.\displaystyle 0\leq Q^{\star}_{h}(s,a)-\widehat{Q}^{m}_{h}(s,a)\leq\widetilde{P}_{s,a}(V^{\star}_{h+1}-V^{m}_{h+1})+2\beta_{m}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}.
Proof.

Note that:

whwhm=Λm1(λI+m=1m1h=1Hϕhmϕhm)whΛm1m=1m1h=1Hϕhm(chm+Vh+1m(sh+1m))\displaystyle w^{\star}_{h}-w^{m}_{h}=\Lambda_{m}^{-1}\left(\lambda I+\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}{\phi^{m^{\prime}}_{h^{\prime}}}^{\top}\right)w^{\star}_{h}-\Lambda_{m}^{-1}\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}(c^{m^{\prime}}_{h^{\prime}}+V^{m}_{h+1}(s^{m^{\prime}}_{h^{\prime}+1}))
=λΛm1wh+Λm1m=1m1h=1Hϕhm[chm+PhmVh+1]Λm1m=1m1h=1Hϕhm(chm+Vh+1m(sh+1m))\displaystyle=\lambda\Lambda_{m}^{-1}w^{\star}_{h}+\Lambda_{m}^{-1}\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}[c^{m^{\prime}}_{h^{\prime}}+P^{m^{\prime}}_{h^{\prime}}V^{\star}_{h+1}]-\Lambda_{m}^{-1}\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}(c^{m^{\prime}}_{h^{\prime}}+V^{m}_{h+1}(s^{m^{\prime}}_{h^{\prime}+1}))
=λΛm1wh+Λm1m=1m1h=1HϕhmPhm[Vh+1Vh+1m]+ϵhm\displaystyle=\lambda\Lambda_{m}^{-1}w^{\star}_{h}+\Lambda_{m}^{-1}\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}P^{m^{\prime}}_{h^{\prime}}[V^{\star}_{h+1}-V^{m}_{h+1}]+\epsilon^{m}_{h} (Define ϵhm=Λm1m=1m1h=1Hϕhm[PhmVh+1mVh+1m(sh+1m)]\epsilon^{m}_{h}=\Lambda_{m}^{-1}\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}[P^{m^{\prime}}_{h^{\prime}}V^{m}_{h+1}-V^{m}_{h+1}(s^{m^{\prime}}_{h^{\prime}+1})])
=λΛm1wh+Λm1m=1m1h=1Hϕhmϕhm(Vh+1(s)Vh+1m(s))𝑑μ(s)+ϵhm\displaystyle=\lambda\Lambda_{m}^{-1}w^{\star}_{h}+\Lambda_{m}^{-1}\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}{\phi^{m^{\prime}}_{h^{\prime}}}^{\top}\int(V^{\star}_{h+1}(s^{\prime})-V^{m}_{h+1}(s^{\prime}))d\mu(s^{\prime})+\epsilon^{m}_{h}
=λΛm1wh+(Vh+1(s)Vh+1m(s))𝑑μ(s)λΛm1(Vh+1(s)Vh+1m(s))𝑑μ(s)+ϵhm.\displaystyle=\lambda\Lambda_{m}^{-1}w^{\star}_{h}+\int(V^{\star}_{h+1}(s^{\prime})-V^{m}_{h+1}(s^{\prime}))d\mu(s^{\prime})-\lambda\Lambda_{m}^{-1}\int(V^{\star}_{h+1}(s^{\prime})-V^{m}_{h+1}(s^{\prime}))d\mu(s^{\prime})+\epsilon^{m}_{h}.

Therefore,

Qh(s,a)Q^hm(s,a)\displaystyle Q^{\star}_{h}(s,a)-\widehat{Q}^{m}_{h}(s,a) =ϕ(s,a)(whwhm)+βmϕ(s,a)Λm1\displaystyle=\phi(s,a)^{\top}(w^{\star}_{h}-w^{m}_{h})+\beta_{m}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}
λϕ(s,a)Λm1whξ1+Ps,a(Vh+1Vh+1m)λϕ(s,a)Λm1(Vh+1(s)Vh+1m(s))𝑑μ(s)ξ2\displaystyle\leq\underbrace{\lambda\phi(s,a)^{\top}\Lambda_{m}^{-1}w^{\star}_{h}}_{\xi_{1}}+P_{s,a}(V^{\star}_{h+1}-V^{m}_{h+1})\underbrace{-\lambda\phi(s,a)^{\top}\Lambda_{m}^{-1}\int(V^{\star}_{h+1}(s^{\prime})-V^{m}_{h+1}(s^{\prime}))d\mu(s^{\prime})}_{\xi_{2}}
+ϕ(s,a)ϵhmξ3+βmϕ(s,a)Λm1.\displaystyle\qquad+\underbrace{\phi(s,a)^{\top}\epsilon^{m}_{h}}_{\xi_{3}}+\beta_{m}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}.

For ξ1\xi_{1}, note that wh2=θ+Vh+1(s)𝑑μ(s)2(1+3B)d\left\|{w^{\star}_{h}}\right\|_{2}=\left\|{\theta^{\star}+\int V^{\star}_{h+1}(s^{\prime})d\mu(s^{\prime})}\right\|_{2}\leq(1+3B_{\star})\sqrt{d} by Vh+1(s)V(s)+2B3BV^{\star}_{h+1}(s)\leq V^{\star}(s)+2B_{\star}\leq 3B_{\star} for any s𝒮s\in{\mathcal{S}}, h[H]h\in[H]. Therefore, by the Cauchy-Schwarz inequality,

|ξ1|ϕ(s,a)Λm1λwhΛm1ϕ(s,a)Λm1λwh2βm4ϕ(s,a)Λm1,\displaystyle|\xi_{1}|\leq\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}\left\|{\lambda w^{\star}_{h}}\right\|_{\Lambda_{m}^{-1}}\leq\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}\sqrt{\lambda}\left\|{w^{\star}_{h}}\right\|_{2}\leq\frac{\beta_{m}}{4}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}},

where the second inequality is by λmax(Λm1)1λ\lambda_{\max}(\Lambda_{m}^{-1})\leq\frac{1}{\lambda}. Similarly, for ξ2\xi_{2},

|ξ2|\displaystyle|\xi_{2}| ϕ(s,a)Λm1λ(Vh+1(s)Vh+1m(s))𝑑μ(s)Λm1\displaystyle\leq\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}\left\|{\lambda\int(V^{\star}_{h+1}(s^{\prime})-V^{m}_{h+1}(s^{\prime}))d\mu(s^{\prime})}\right\|_{\Lambda_{m}^{-1}} (Cauchy-Schwarz inequality)
λϕ(s,a)Λm1(Vh+1(s)Vh+1m(s))𝑑μ(s)2\displaystyle\leq\sqrt{\lambda}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}\left\|{\int(V^{\star}_{h+1}(s^{\prime})-V^{m}_{h+1}(s^{\prime}))d\mu(s^{\prime})}\right\|_{2} (λmax(Λm1)1λ\lambda_{\max}(\Lambda_{m}^{-1})\leq\frac{1}{\lambda})
3Bλdϕ(s,a)Λm1βm4ϕ(s,a)Λm1.\displaystyle\leq 3B_{\star}\sqrt{\lambda d}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}\leq\frac{\beta_{m}}{4}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}. (Vh+1(s)Vh+1m(s)[0,3B]V^{\star}_{h+1}(s)-V^{m}_{h+1}(s)\in[0,3B_{\star}] for any s𝒮s\in{\mathcal{S}})

For ξ3\xi_{3}, by Eq. (6), ϵhmΛmβm2\left\|{\epsilon^{m}_{h}}\right\|_{\Lambda_{m}}\leq\frac{\beta_{m}}{2} with probability at least 1δ1-\delta. Thus, |ξ3|ϕ(s,a)Λm1ϵhmΛmβm2ϕ(s,a)Λm1|\xi_{3}|\leq\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}\left\|{\epsilon^{m}_{h}}\right\|_{\Lambda_{m}}\leq\frac{\beta_{m}}{2}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}.

To conclude, we have for all m,h,(s,a)m,h,(s,a):

0Qh(s,a)Q^hm(s,a)P~s,a(Vh+1Vh+1m)+2βmϕ(s,a)Λm1.\displaystyle 0\leq Q^{\star}_{h}(s,a)-\widehat{Q}^{m}_{h}(s,a)\leq\widetilde{P}_{s,a}(V^{\star}_{h+1}-V^{m}_{h+1})+2\beta_{m}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}.

This completes the proof. ∎

Lemma 14.

With probability at least 1δ1-\delta, m=1MV1πm(s1m)V1(s1m)2m=1Mh=1Hgaph(shm,ahm)+𝒪(BHln(M/δ))\sum_{m=1}^{M^{\prime}}V^{\pi^{m}}_{1}(s^{m}_{1})-V^{\star}_{1}(s^{m}_{1})\leq 2\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\text{\rm gap}_{h}(s^{m}_{h},a^{m}_{h})+\mathcal{O}(B_{\star}H\ln(M^{\prime}/\delta)) for any given M+M^{\prime}\in\mathbb{N}_{+}.

Proof.

By the extended value difference lemma (Shani et al., 2020, Lemma 1):

V1πm(s1m)V1(s1m)\displaystyle V^{\pi^{m}}_{1}(s^{m}_{1})-V^{\star}_{1}(s^{m}_{1}) =𝔼[h=1Ha(πm(a|shm)π~(a|shm))Qh(shm,a)|πm]\displaystyle=\mathbb{E}\left[\left.\sum_{h=1}^{H}\sum_{a}(\pi^{m}(a|s^{m}_{h})-{\widetilde{\pi}^{\star}}(a|s^{m}_{h}))Q^{\star}_{h}(s^{m}_{h},a)\right|\pi^{m}\right]
=𝔼[h=1HQh(shm,ahm)Vh(shm)|πm]=𝔼[h=1Hgaph(shm,ahm)|πm],\displaystyle=\mathbb{E}\left[\left.\sum_{h=1}^{H}Q^{\star}_{h}(s^{m}_{h},a^{m}_{h})-V^{\star}_{h}(s^{m}_{h})\right|\pi^{m}\right]=\mathbb{E}\left[\left.\sum_{h=1}^{H}\text{\rm gap}_{h}(s^{m}_{h},a^{m}_{h})\right|\pi^{m}\right],

where π~{\widetilde{\pi}^{\star}} is the optimal policy of ~\widetilde{{\mathcal{M}}}. Therefore, by Lemma 39 and gaph(s,a)=𝒪(B)\text{\rm gap}_{h}(s,a)=\mathcal{O}(B_{\star}), with probability at least 1δ1-\delta,

m=1MV1πm(s1m)V1(s1m)2m=1Mh=1Hgaph(shm,ahm)+𝒪(BHlnMδ).\sum_{m=1}^{M^{\prime}}V^{\pi^{m}}_{1}(s^{m}_{1})-V^{\star}_{1}(s^{m}_{1})\leq 2\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\text{\rm gap}_{h}(s^{m}_{h},a^{m}_{h})+\mathcal{O}\left(B_{\star}H\ln\frac{M^{\prime}}{\delta}\right).

This completes the proof. ∎

The next lemma provides an upper bound on the sum of gap functions satisfying some constraints. We denote by hm{\mathcal{F}}^{m}_{h} the interaction history up to (shm,ahm)(s^{m}_{h},a^{m}_{h}) in ~\widetilde{{\mathcal{M}}}.

Lemma 15.

Suppose B=3BB=3B_{\star}, {zhm}m=1M\{z^{m}_{h}\}_{m=1}^{M^{\prime}} are indicator functions such that zhmhmz^{m}_{h}\in{\mathcal{F}}^{m}_{h} for some M+,h[H]M^{\prime}\in\mathbb{N}_{+},h\in[H], and define Mz=m=1MzhmM_{z}=\sum_{m=1}^{M^{\prime}}z^{m}_{h}. Then with probability at least 1δ1-\delta, Algorithm 2 ensures

m=1Mzhmh=hHgaphm=𝒪(d3B2HMzlndBMHδ+d2BHln1.5dBMHδ).\sum_{m=1}^{M^{\prime}}z^{m}_{h}\sum_{h^{\prime}=h}^{H}\text{\rm gap}^{m}_{h^{\prime}}=\mathcal{O}\left(\sqrt{d^{3}B_{\star}^{2}HM_{z}}\ln\frac{dB_{\star}M^{\prime}H}{\delta}+d^{2}B_{\star}H\ln^{1.5}\frac{dB_{\star}M^{\prime}H}{\delta}\right).
Proof.

Denote by mim_{i} the ii-th interval among [M][M^{\prime}] such that zhmi=1z^{m_{i}}_{h}=1. Then,

i=1Mzh=hHQh(shmi,ahmi)Vh(shmi)+i=1Mzh=hHVh(shmi)Vhmi(shmi)\displaystyle\sum_{i=1}^{M_{z}}\sum_{h^{\prime}=h}^{H}Q^{\star}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}},a^{m_{i}}_{h^{\prime}})-V^{\star}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}})+\sum_{i=1}^{M_{z}}\sum_{h^{\prime}=h}^{H}V^{\star}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}})-V^{m_{i}}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}})
=i=1Mzh=hHQh(shmi,ahmi)Qhmi(shmi,ahmi)\displaystyle=\sum_{i=1}^{M_{z}}\sum_{h^{\prime}=h}^{H}Q^{\star}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}},a^{m_{i}}_{h^{\prime}})-Q^{m_{i}}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}},a^{m_{i}}_{h^{\prime}}) (Qhmi(shmi,ahmi)=Vhmi(shmi)Q^{m_{i}}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}},a^{m_{i}}_{h^{\prime}})=V^{m_{i}}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}}) by Lemma 5 and B=3BB=3B_{\star})
i=1Mzh=hHPhmi(Vh+1Vh+1mi)+2i=1Mzh=hHβmiϕhmiΛmi1\displaystyle\leq\sum_{i=1}^{M_{z}}\sum_{h^{\prime}=h}^{H}P^{m_{i}}_{h^{\prime}}(V^{\star}_{h^{\prime}+1}-V^{m_{i}}_{h^{\prime}+1})+2\sum_{i=1}^{M_{z}}\sum_{h^{\prime}=h}^{H}\beta_{m_{i}}\left\|{\phi^{m_{i}}_{h^{\prime}}}\right\|_{\Lambda^{-1}_{m_{i}}} (Lemma 13)
=i=1Mzh=hH(Vh+1(sh+1mi)Vh+1mi(sh+1mi))+i=1Mzh=hH(ϵhmi+2βmiϕhmiΛmi1),\displaystyle=\sum_{i=1}^{M_{z}}\sum_{h^{\prime}=h}^{H}(V^{\star}_{h^{\prime}+1}(s^{m_{i}}_{h^{\prime}+1})-V^{m_{i}}_{h^{\prime}+1}(s^{m_{i}}_{h^{\prime}+1}))+\sum_{i=1}^{M_{z}}\sum_{h=h^{\prime}}^{H}\left(\epsilon^{m_{i}}_{h^{\prime}}+2\beta_{m_{i}}\left\|{\phi^{m_{i}}_{h^{\prime}}}\right\|_{\Lambda^{-1}_{m_{i}}}\right),

where ϵhmi=Phmi(Vh+1Vh+1mi)(Vh+1(sh+1mi)Vh+1mi(sh+1mi))\epsilon^{m_{i}}_{h}=P^{m_{i}}_{h}(V^{\star}_{h+1}-V^{m_{i}}_{h+1})-(V^{\star}_{h+1}(s^{m_{i}}_{h+1})-V^{m_{i}}_{h+1}(s^{m_{i}}_{h+1})). Reorganizing terms, and by VH+1=VH+1m=cfV^{\star}_{H+1}=V^{m}_{H+1}=c_{f}, Vh+1m(s)Vh+1(s)V^{m}_{h+1}(s)\leq V^{\star}_{h+1}(s) (Lemma 5), we get:

i=1Mzh=hHgaphmi\displaystyle\sum_{i=1}^{M_{z}}\sum_{h^{\prime}=h}^{H}\text{\rm gap}^{m_{i}}_{h^{\prime}} =i=1Mzh=hHQh(shmi,ahmi)Vh(shmi)i=1Mzh=hH(ϵhmi+2βmiϕhmiΛmi1)\displaystyle=\sum_{i=1}^{M_{z}}\sum_{h^{\prime}=h}^{H}Q^{\star}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}},a^{m_{i}}_{h^{\prime}})-V^{\star}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}})\leq\sum_{i=1}^{M_{z}}\sum_{h=h^{\prime}}^{H}\left(\epsilon^{m_{i}}_{h}+2\beta_{m_{i}}\left\|{\phi^{m_{i}}_{h}}\right\|_{\Lambda^{-1}_{m_{i}}}\right)
=m=1Mh=hHzhmϵhm+2i=1Mzh=hHβmiϕhmiΛmi1.\displaystyle=\sum_{m=1}^{M^{\prime}}\sum_{h^{\prime}=h}^{H}z^{m}_{h}\epsilon^{m}_{h^{\prime}}+2\sum_{i=1}^{M_{z}}\sum_{h=h^{\prime}}^{H}\beta_{m_{i}}\left\|{\phi^{m_{i}}_{h}}\right\|_{\Lambda^{-1}_{m_{i}}}.

For the first term, by zhmϵhmh+1mz^{m}_{h}\epsilon^{m}_{h^{\prime}}\in{\mathcal{F}}^{m}_{h^{\prime}+1} for hhh^{\prime}\geq h and Lemma 38, with probability at least 1δ1-\delta,

m=1Mh=hHzhmϵhm\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h^{\prime}=h}^{H}z^{m}_{h}\epsilon^{m}_{h^{\prime}} 3m=1Mh=hH𝔼[(zhmϵhm)2|hm]+𝒪(BlnBMHδ)\displaystyle\leq 3\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h^{\prime}=h}^{H}\mathbb{E}[(z^{m}_{h}\epsilon^{m}_{h^{\prime}})^{2}|{\mathcal{F}}^{m}_{h^{\prime}}]}+\mathcal{O}\left(B_{\star}\ln\frac{B_{\star}M^{\prime}H}{\delta}\right)
=𝒪(BHMzlnBMHδ+BlnBMHδ).\displaystyle=\mathcal{O}\left(B_{\star}\sqrt{HM_{z}\ln\frac{B_{\star}M^{\prime}H}{\delta}}+B_{\star}\ln\frac{B_{\star}M^{\prime}H}{\delta}\right). (zhmhmz^{m}_{h}\in{\mathcal{F}}^{m}_{h^{\prime}} and |ϵhm|=𝒪(B)|\epsilon^{m}_{h^{\prime}}|=\mathcal{O}(B_{\star}))

For the second term, by Lemma 18,

i=1Mzh=hHβmiϕhmiΛmi1=𝒪(d3B2HMzlndBMHδ+d2BHln1.5dBMHδ).\sum_{i=1}^{M_{z}}\sum_{h=h^{\prime}}^{H}\beta_{m_{i}}\left\|{\phi^{m_{i}}_{h}}\right\|_{\Lambda^{-1}_{m_{i}}}=\mathcal{O}\left(\sqrt{d^{3}B_{\star}^{2}HM_{z}}\ln\frac{dB_{\star}M^{\prime}H}{\delta}+d^{2}B_{\star}H\ln^{1.5}\frac{dB_{\star}M^{\prime}H}{\delta}\right).

Plugging these back completes the proof. ∎

We are now ready to prove a bound on mV1πm(s1m)V1(s1m)\sum_{m}V^{\pi^{m}}_{1}(s^{m}_{1})-V^{\star}_{1}(s^{m}_{1}), which is the key to proving Theorem 4.

Lemma 16.

For any M3M^{\prime}\geq 3, Algorithm 2 with B=3BB=3B_{\star} and H35Bcminln(8BMH)H\geq\lceil\frac{35B_{\star}}{c_{\min}}\ln(8B_{\star}M^{\prime}H)\rceil for some horizon HH ensures with probability at least 13δ1/4BMH1-3\delta-\nicefrac{{1}}{{4B_{\star}M^{\prime}H}}, m=1MV1πm(s1m)V1(s1m)=𝒪(d3B4cmin2gapminln5(dBMH/δ))\sum_{m=1}^{M^{\prime}}V^{\pi^{m}}_{1}(s^{m}_{1})-V^{\star}_{1}(s^{m}_{1})=\mathcal{O}\left(\frac{d^{3}B_{\star}^{4}}{c_{\min}^{2}\text{\rm gap}_{\min}}\ln^{5}(dB_{\star}M^{\prime}H/\delta)\right).

Proof.

First note that Vhπ(s)3BV^{\pi^{\star}}_{h}(s)\leq 3B_{\star} for any s𝒮,h[H]s\in{\mathcal{S}},h\in[H]. Thus, the expected hitting time of π\pi^{\star} in ~\widetilde{{\mathcal{M}}} is at most 3Bcmin\frac{3B_{\star}}{c_{\min}} starting from any state and layer. Without loss of generality, we assume that HH is an even integer. Note that ~\widetilde{{\mathcal{M}}} can be treated as an SSP instance where the learner teleports to the goal state at the (H+1)(H+1)-th step. Thus by Lemma 17 and H35Bcminln(8BMH)H\geq\frac{35B_{\star}}{c_{\min}}\ln(8B_{\star}M^{\prime}H), when hH2+1h\leq\frac{H}{2}+1, P(sH+1g|sh=s,π)14BMHP(s_{H+1}\neq g|s_{h}=s,\pi^{\star})\leq\frac{1}{4B_{\star}M^{\prime}H} for any state ss, and for any hH2h\leq\frac{H}{2}:

Qh(s,a)Q(s,a)Qhπ(s,a)Q(s,a)=Ps,a(Vh+1πV)2BmaxsP(sH+1g|π,sh+1=s)12MH.Q^{\star}_{h}(s,a)-Q^{\star}(s,a)\leq Q^{\pi^{\star}}_{h}(s,a)-Q^{\star}(s,a)=P_{s,a}(V^{\pi^{\star}}_{h+1}-V^{\star})\leq 2B_{\star}\max_{s}P(s_{H+1}\neq g|\pi^{\star},s_{h+1}=s)\leq\frac{1}{2M^{\prime}H}.

It also implies |gaph(s,a)gap(s,a)||\text{\rm gap}_{h}(s,a)-\text{\rm gap}(s,a)| for hH2h\leq\frac{H}{2}, since:

|gaph(s,a)gap(s,a)||Qh(s,a)Q(s,a)|+|Vh(s)V(s)|12MH+maxa|Qh(s,a)Q(s,a)|1MH.\displaystyle\left|\text{\rm gap}_{h}(s,a)-\text{\rm gap}(s,a)\right|\leq\left|Q^{\star}_{h}(s,a)-Q^{\star}(s,a)\right|+\left|V^{\star}_{h}(s)-V^{\star}(s)\right|\leq\frac{1}{2M^{\prime}H}+\max_{a}\left|Q^{\star}_{h}(s,a)-Q^{\star}(s,a)\right|\leq\frac{1}{M^{\prime}H}.

Define gaphm=gaph(shm,ahm)\text{\rm gap}^{m}_{h}=\text{\rm gap}_{h}(s^{m}_{h},a^{m}_{h}) and a threshold η=3MH\eta=\frac{3}{M^{\prime}H}. By Lemma 14, it suffices to bound m=1Mh=1Hgaphm\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\text{\rm gap}^{m}_{h}. Note that

m=1Mh=1Hgaphm\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\text{\rm gap}^{m}_{h} m=1Mh=1Hgaphm𝕀{gaphm>η}+𝒪(m=1Mh=1HBMH)\displaystyle\leq\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\text{\rm gap}^{m}_{h}\mathbb{I}\left\{\text{\rm gap}^{m}_{h}>\eta\right\}+\mathcal{O}\left(\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\frac{B_{\star}}{M^{\prime}H}\right)
m=1MhH/2gaphm𝕀{gaphm>η}+m=1Mh>H/2gaphm+𝒪(B).\displaystyle\leq\sum_{m=1}^{M^{\prime}}\sum_{h\leq H/2}\text{\rm gap}^{m}_{h}\mathbb{I}\left\{\text{\rm gap}^{m}_{h}>\eta\right\}+\sum_{m=1}^{M^{\prime}}\sum_{h>H/2}\text{\rm gap}^{m}_{h}+\mathcal{O}\left(B_{\star}\right).

For the first term, define N=log2(3B+1η)=𝒪(ln(BMH))N=\lceil\log_{2}(\frac{3B_{\star}+1}{\eta})\rceil=\mathcal{O}(\ln(B_{\star}M^{\prime}H)), and

n=min{n[N]:(s,a),hH2 such that gaph(s,a)(η2n1,η2n]}.n^{\star}=\min\left\{n\in[N]:\exists(s^{\prime},a^{\prime}),h^{\prime}\leq\frac{H}{2}\text{ such that }\text{\rm gap}_{h^{\prime}}(s^{\prime},a^{\prime})\in(\eta 2^{n-1},\eta 2^{n}]\right\}.

Then by the definition of nn^{\star} and |gap(s,a)gaph(s,a)|1MH|\text{\rm gap}(s,a)-\text{\rm gap}_{h}(s,a)|\leq\frac{1}{M^{\prime}H} for hH2h\leq\frac{H}{2}, there exist (s,a),hH2(s^{\prime},a^{\prime}),h^{\prime}\leq\frac{H}{2} such that

gapmingap(s,a)gaph(s,a)+1MHη2n+1MH43η2n.\text{\rm gap}_{\min}\leq\text{\rm gap}(s^{\prime},a^{\prime})\leq\text{\rm gap}_{h^{\prime}}(s^{\prime},a^{\prime})+\frac{1}{M^{\prime}H}\leq\eta 2^{n^{\star}}+\frac{1}{M^{\prime}H}\leq\frac{4}{3}\cdot\eta 2^{n^{\star}}. (7)

Moreover, for each nn\in\mathbb{N} and hH2h\leq\frac{H}{2}, define zhm=𝕀{gaphm>η2n}z^{m}_{h}=\mathbb{I}\{\text{\rm gap}^{m}_{h}>\eta 2^{n}\}. Then by Lemma 15, with probability at least 1δ2(n+1)21-\frac{\delta}{2(n+1)^{2}},

η2nMzm=1Mzhmgaphm=𝒪(d3B2HMzlndBMH(n+1)δ+d2BHln1.5dBMH(n+1)δ),\eta 2^{n}M_{z}\leq\sum_{m=1}^{M^{\prime}}z^{m}_{h}\text{\rm gap}^{m}_{h}=\mathcal{O}\left(\sqrt{d^{3}B_{\star}^{2}HM_{z}}\ln\frac{dB_{\star}M^{\prime}H(n+1)}{\delta}+d^{2}B_{\star}H\ln^{1.5}\frac{dB_{\star}M^{\prime}H(n+1)}{\delta}\right),

where Mz=m=1MzhmM_{z}=\sum_{m=1}^{M^{\prime}}z^{m}_{h}. Solving a quadratic inequality w.r.t Mz\sqrt{M_{z}} gives:

m=1M𝕀{gaphm>η2n}=𝒪(d3B2Hη24nln2dBMH(n+1)δ+d2BHη2nln1.5dBMH(n+1)δ).\sum_{m=1}^{M^{\prime}}\mathbb{I}\{\text{\rm gap}^{m}_{h}>\eta 2^{n}\}=\mathcal{O}\left(\frac{d^{3}B_{\star}^{2}H}{\eta^{2}4^{n}}\ln^{2}\frac{dB_{\star}M^{\prime}H(n+1)}{\delta}+\frac{d^{2}B_{\star}H}{\eta 2^{n}}\ln^{1.5}\frac{dB_{\star}M^{\prime}H(n+1)}{\delta}\right). (8)

By a union bound, Eq. (8) holds for all nn\in\mathbb{N} simultaneously with probability at least 1δ1-\delta. Therefore, the first term is bounded as follows:

m=1MhH/2gaphm𝕀{gaphm>η}\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h\leq H/2}\text{\rm gap}^{m}_{h}\mathbb{I}\left\{\text{\rm gap}^{m}_{h}>\eta\right\}
=m=1MhH/2n=nNgaphm𝕀{gaphm(η2n1,η2n]}m=1MhH/2n=nNη2n𝕀{gaphm>η2n1}\displaystyle=\sum_{m=1}^{M^{\prime}}\sum_{h\leq H/2}\sum_{n=n^{\star}}^{N}\text{\rm gap}^{m}_{h}\mathbb{I}\{\text{\rm gap}^{m}_{h}\in(\eta 2^{n-1},\eta 2^{n}]\}\leq\sum_{m=1}^{M^{\prime}}\sum_{h\leq H/2}\sum_{n=n^{\star}}^{N}\eta 2^{n}\mathbb{I}\{\text{\rm gap}^{m}_{h}>\eta 2^{n-1}\}
=𝒪(hH/2n=nN(d3B2Hη2nln2dBMH(n+1)δ+d2BHln1.5dBMH(n+1)δ))\displaystyle=\mathcal{O}\left(\sum_{h\leq H/2}\sum_{n=n^{\star}}^{N}\left(\frac{d^{3}B_{\star}^{2}H}{\eta 2^{n}}\ln^{2}\frac{dB_{\star}M^{\prime}H(n+1)}{\delta}+d^{2}B_{\star}H\ln^{1.5}\frac{dB_{\star}M^{\prime}H(n+1)}{\delta}\right)\right) (Eq. (8))
=𝒪(d3B2H2η2nln3dBMHδ+d2BH2ln2.5dBMHδ)\displaystyle=\mathcal{O}\left(\frac{d^{3}B_{\star}^{2}H^{2}}{\eta 2^{n^{\star}}}\ln^{3}\frac{dB_{\star}M^{\prime}H}{\delta}+d^{2}B_{\star}H^{2}\ln^{2.5}\frac{dB_{\star}M^{\prime}H}{\delta}\right) (N=𝒪(ln(BMH))N=\mathcal{O}(\ln(B_{\star}M^{\prime}H)))
=𝒪(d3B2H2gapminln3dBMHδ+d2BH2ln2.5dBMHδ).\displaystyle=\mathcal{O}\left(\frac{d^{3}B_{\star}^{2}H^{2}}{\text{\rm gap}_{\min}}\ln^{3}\frac{dB_{\star}M^{\prime}H}{\delta}+d^{2}B_{\star}H^{2}\ln^{2.5}\frac{dB_{\star}M^{\prime}H}{\delta}\right). (Eq. (7))

For the second term, note that:

m=1Mh>H/2gaphmm=1Mh>H/2gaphm𝕀{hH2:gaphm>η}ξ1+m=1Mh>H/2gaphm𝕀{hH2:gaphmη}ξ2.\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h>H/2}\text{\rm gap}^{m}_{h}\leq\underbrace{\sum_{m=1}^{M^{\prime}}\sum_{h>H/2}\text{\rm gap}^{m}_{h}\mathbb{I}\left\{\exists h\leq\frac{H}{2}:\text{\rm gap}^{m}_{h}>\eta\right\}}_{\xi_{1}}+\underbrace{\sum_{m=1}^{M^{\prime}}\sum_{h>H/2}\text{\rm gap}^{m}_{h}\mathbb{I}\left\{\forall h\leq\frac{H}{2}:\text{\rm gap}^{m}_{h}\leq\eta\right\}}_{\xi_{2}}.

For ξ1\xi_{1}, define zH2+1m=𝕀{hH2:gaphm>η}z^{m}_{\frac{H}{2}+1}=\mathbb{I}\left\{\exists h\leq\frac{H}{2}:\text{\rm gap}^{m}_{h}>\eta\right\} and Mz=m=1MzH2+1mM_{z}=\sum_{m=1}^{M^{\prime}}z^{m}_{\frac{H}{2}+1}. Then by Lemma 15, with probability at least 1δ1-\delta,

ξ1=m=1MzH2+1mh>H/2gaphm=𝒪(d3B2HMzlndBMHδ+d2BHln1.5dBMHδ).\displaystyle\xi_{1}=\sum_{m=1}^{M^{\prime}}z^{m}_{\frac{H}{2}+1}\sum_{h>H/2}\text{\rm gap}^{m}_{h}=\mathcal{O}\left(\sqrt{d^{3}B_{\star}^{2}HM_{z}}\ln\frac{dB_{\star}M^{\prime}H}{\delta}+d^{2}B_{\star}H\ln^{1.5}\frac{dB_{\star}M^{\prime}H}{\delta}\right).

It suffices to bound MzM_{z}. Note that by the definition of nn^{\star}, we have mins,a,hH/2,gaph(s,a)>ηgaph(s,a)(η2n1,η2n]\min_{s,a,h\leq H/2,\text{\rm gap}_{h}(s,a)>\eta}\text{\rm gap}_{h}(s,a)\in(\eta 2^{n^{\star}-1},\eta 2^{n^{\star}}]. Thus, by Eq. (8),

Mz\displaystyle M_{z} =m=1M𝕀{hH2:gaphm>η}m=1MhH/2𝕀{gaphm>η}m=1MhH/2𝕀{gaphm>η2n1}\displaystyle=\sum_{m=1}^{M^{\prime}}\mathbb{I}\left\{\exists h\leq\frac{H}{2}:\text{\rm gap}^{m}_{h}>\eta\right\}\leq\sum_{m=1}^{M^{\prime}}\sum_{h\leq H/2}\mathbb{I}\left\{\text{\rm gap}^{m}_{h}>\eta\right\}\leq\sum_{m=1}^{M^{\prime}}\sum_{h\leq H/2}\mathbb{I}\left\{\text{\rm gap}^{m}_{h}>\eta 2^{n^{\star}-1}\right\}
=𝒪(d3B2H2η24n1ln2dBMHnδ+d2BH2η2n1ln1.5dBMHnδ).\displaystyle=\mathcal{O}\left(\frac{d^{3}B_{\star}^{2}H^{2}}{\eta^{2}4^{n^{\star}-1}}\ln^{2}\frac{dB_{\star}M^{\prime}Hn^{\star}}{\delta}+\frac{d^{2}B_{\star}H^{2}}{\eta 2^{n^{\star}-1}}\ln^{1.5}\frac{dB_{\star}M^{\prime}Hn^{\star}}{\delta}\right).

Plugging this back and by Eq. (7), we get:

ξ1\displaystyle\xi_{1} =𝒪(d3B2H1.5gapminln2dBMHδ+d2BHln1.5dBMHδ).\displaystyle=\mathcal{O}\left(\frac{d^{3}B_{\star}^{2}H^{1.5}}{\text{\rm gap}_{\min}}\ln^{2}\frac{dB_{\star}M^{\prime}H}{\delta}+d^{2}B_{\star}H\ln^{1.5}\frac{dB_{\star}M^{\prime}H}{\delta}\right).

For ξ2\xi_{2}, denote by π~m{\widetilde{\pi}}^{m} the near-optimal policy “closest” to πm\pi^{m}, such that:

π~m(s,h)={πm(s,h),hH/2 and gaph(s,πm(s,h))η,π(s,h),hH/2 and gaph(s,πm(s,h))>η,π(s,h),h>H/2.\displaystyle{\widetilde{\pi}}^{m}(s,h)=\begin{cases}\pi^{m}(s,h),&h\leq H/2\text{ and }\text{\rm gap}_{h}(s,\pi^{m}(s,h))\leq\eta,\\ \pi^{\star}(s,h),&h\leq H/2\text{ and }\text{\rm gap}_{h}(s,\pi^{m}(s,h))>\eta,\\ \pi^{\star}(s,h),&h>H/2.\end{cases}

Note that gaph(s,π~m(s,h))η\text{\rm gap}_{h}(s,{\widetilde{\pi}}^{m}(s,h))\leq\eta for all s,hs,h. By the extended value difference lemma (Shani et al., 2020, Lemma 1), Vhπ~m(s)Vh(s)=𝔼[h=hHgaph(sh,ah)|sh=s,π~m]3MBV^{{\widetilde{\pi}}^{m}}_{h}(s)-V^{\star}_{h}(s)=\mathbb{E}[\sum_{h^{\prime}=h}^{H}\text{\rm gap}_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})|s_{h}=s,{\widetilde{\pi}}^{m}]\leq\frac{3}{M^{\prime}}\leq B_{\star} for all s,hs,h by M3M^{\prime}\geq 3. Therefore, Vhπ~m(s)4BV^{{\widetilde{\pi}}^{m}}_{h}(s)\leq 4B_{\star} for all s,hs,h. Denote by m{\mathcal{F}}_{m} the interaction history before interval mm. Then, πm,π~mm\pi^{m},{\widetilde{\pi}}^{m}\in{\mathcal{F}}_{m}, and

P(h>H/2gaphm𝕀{hH/2:gaphmη}=0|πm,m)\displaystyle P\left(\left.\sum_{h>H/2}\text{\rm gap}^{m}_{h}\mathbb{I}\left\{\forall h\leq H/2:\text{\rm gap}^{m}_{h}\leq\eta\right\}=0\right|\pi^{m},{\mathcal{F}}_{m}\right)
P(hH/2,gaphm>η or hH/2,gaphmη,sH/2+1=g|πm,m)\displaystyle\geq P\left(\left.\exists h\leq H/2,\text{\rm gap}^{m}_{h}>\eta\text{ or }\forall h\leq H/2,\text{\rm gap}^{m}_{h}\leq\eta,s_{H/2+1}=g\right|\pi^{m},{\mathcal{F}}_{m}\right)
=P(hH/2,π~m(shm,h)πm(shm,h) or hH/2,π~m(shm,h)=πm(shm,h),sH/2+1=g|πm,m)\displaystyle=P\left(\left.\exists h\leq H/2,{\widetilde{\pi}}^{m}(s^{m}_{h},h)\neq\pi^{m}(s^{m}_{h},h)\text{ or }\forall h\leq H/2,{\widetilde{\pi}}^{m}(s^{m}_{h},h)=\pi^{m}(s^{m}_{h},h),s_{H/2+1}=g\right|\pi^{m},{\mathcal{F}}_{m}\right)
=P(hH/2,π~m(shm,h)πm(shm,h) or hH/2,π~m(shm,h)=πm(shm,h),sH/2+1=g|π~m,m)\displaystyle=P\left(\left.\exists h\leq H/2,{\widetilde{\pi}}^{m}(s^{m}_{h},h)\neq\pi^{m}(s^{m}_{h},h)\text{ or }\forall h\leq H/2,{\widetilde{\pi}}^{m}(s^{m}_{h},h)=\pi^{m}(s^{m}_{h},h),s_{H/2+1}=g\right|{\widetilde{\pi}}^{m},{\mathcal{F}}_{m}\right)
P(sH/2+1=g|π~m,m)114BMH,\displaystyle\geq P\left(\left.s_{H/2+1}=g\right|{\widetilde{\pi}}^{m},{\mathcal{F}}_{m}\right)\geq 1-\frac{1}{4B_{\star}M^{\prime}H},

where in the last inequality we apply Lemma 17, the fact that Vhπ~m(s)4BV^{{\widetilde{\pi}}^{m}}_{h}(s)\leq 4B_{\star} for all s,hs,h, and H35Bcminln(8BMH)H\geq\frac{35B_{\star}}{c_{\min}}\ln(8B_{\star}M^{\prime}H). Now by Lemma 14 and H=35Bcminln(8BMH)H=\lceil\frac{35B_{\star}}{c_{\min}}\ln(8B_{\star}M^{\prime}H)\rceil, we have:

m=1MV1πm(s1m)V1(s1m)\displaystyle\sum_{m=1}^{M^{\prime}}V^{\pi^{m}}_{1}(s^{m}_{1})-V^{\star}_{1}(s^{m}_{1}) 2m=1Mh=1Hgaph(shm,ahm)+𝒪(BHln(M/δ))\displaystyle\leq 2\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\text{\rm gap}_{h}(s^{m}_{h},a^{m}_{h})+\mathcal{O}(B_{\star}H\ln(M^{\prime}/\delta))
=𝒪(d3B2H2gapminln3dBMHδ+d2BH2ln2.5(dBMH/δ))\displaystyle=\mathcal{O}\left(\frac{d^{3}B_{\star}^{2}H^{2}}{\text{\rm gap}_{\min}}\ln^{3}\frac{dB_{\star}M^{\prime}H}{\delta}+d^{2}B_{\star}H^{2}\ln^{2.5}(dB_{\star}M^{\prime}H/\delta)\right)
=𝒪(d3B4cmin2gapminln5(dBMH/δ)).\displaystyle=\mathcal{O}\left(\frac{d^{3}B_{\star}^{4}}{c_{\min}^{2}\text{\rm gap}_{\min}}\ln^{5}(dB_{\star}M^{\prime}H/\delta)\right).

We are now ready to prove Theorem 4.

Proof of Theorem 4.

First note that for a given H4Tln(4K)H\geq 4T_{\star}\ln(4K), by Lemma 2 and Theorem 1, we have: M=𝒪~(K+d3H)M=\tilde{\mathcal{O}}\left(K+d^{3}H\right) with probability at least 14δ1-4\delta for some δ>0\delta>0 when running Algorithm 1 with Algorithm 2 and horizon HH. That is, there exist b>0b>0 and constant p1p\geq 1 such that Mb(K+d3H)lnp(dBHK/δ)M\leq b(K+d^{3}H)\ln^{p}(dB_{\star}HK/\delta). Now let M=b(K+d3H)lnp(dBHK/δ)M^{\prime}=b(K+d^{3}H)\ln^{p}(dB_{\star}HK/\delta). To obtain the regret bound in Lemma 16, it suffices to have H35Bcminln(8BMH)H\geq\frac{35B_{\star}}{c_{\min}}\ln(8B_{\star}M^{\prime}H). Plugging in the definition of MM^{\prime} and by x>lnxx>\ln x for x>0x>0, it suffices to have H=bBcminln(dBKδcmin)H=\frac{b^{\prime}B_{\star}}{c_{\min}}\ln(\frac{dB_{\star}K}{\delta c_{\min}}) for some constant b>0b^{\prime}>0. To conclude, we have MMM\leq M^{\prime} with probability at least 14δ1-4\delta when running Algorithm 1 with Algorithm 2 and horizon H=bBcminln(dBKδcmin)H=\frac{b^{\prime}B_{\star}}{c_{\min}}\ln(\frac{dB_{\star}K}{\delta c_{\min}}). Moreover, with probability at least 13δ1/4BMH1-3\delta-1/4B_{\star}M^{\prime}H, we have m=1min{M,M}V1πm(s1m)V1(s1m)=𝒪(d3B4cmin2gapminln5(dBMH/δ))\sum_{m=1}^{\min\{M,M^{\prime}\}}V^{\pi^{m}}_{1}(s^{m}_{1})-V^{\star}_{1}(s^{m}_{1})=\mathcal{O}(\frac{d^{3}B_{\star}^{4}}{c_{\min}^{2}\text{\rm gap}_{\min}}\ln^{5}(dB_{\star}M^{\prime}H/\delta)). To obtain an expected regret bound, we further need to bound the cost under the low probability “bad” event. We make the following modification to Algorithm 1: whenever the counter m=nMm=n\cdot M^{\prime} for some n+n\in\mathbb{N}_{+}, we restart Algorithm 2. Ideas above are summarized in Algorithm 7. Now consider running Algorithm 7 with Algorithm 2, horizon H=bBcminln(dBKδcmin)H=\frac{b^{\prime}B_{\star}}{c_{\min}}\ln(\frac{dB_{\star}K}{\delta c_{\min}}), failure probability δ=14MH\delta=\frac{1}{4M^{\prime}H}, and restart threshold MM^{\prime}. By the choice of MM^{\prime}, we have P(M>M)4δP(M>M^{\prime})\leq 4\delta. By a recursive argument, we have P(M>nM)(4δ)nP(M>n\cdot M^{\prime})\leq(4\delta)^{n} for n+n\in\mathbb{N}_{+}. We have by Lemma 1 and Lemma 16:

𝔼[RK]\displaystyle\mathbb{E}[R_{K}] 𝔼[R~M]+B𝔼[R~min{M,M}]+𝔼[max{0,MM}(H+2B)]+B\displaystyle\leq\mathbb{E}[\widetilde{R}_{M}]+B_{\star}\leq\mathbb{E}[\widetilde{R}_{\min\{M,M^{\prime}\}}]+\mathbb{E}[\max\{0,M-M^{\prime}\}(H+2B_{\star})]+B_{\star}
=𝒪(d3B4cmin2gapminln5(dBMH))=𝒪(d3B4cmin2gapminln5dBKcmin),\displaystyle=\mathcal{O}\left(\frac{d^{3}B_{\star}^{4}}{c_{\min}^{2}\text{\rm gap}_{\min}}\ln^{5}(dB_{\star}M^{\prime}H)\right)=\mathcal{O}\left(\frac{d^{3}B_{\star}^{4}}{c_{\min}^{2}\text{\rm gap}_{\min}}\ln^{5}\frac{dB_{\star}K}{c_{\min}}\right),

where we apply

𝔼[max{0,MM}(H+2B)]n=1P(M(nM,(n+1)M])nM(H+2B)\displaystyle\mathbb{E}[\max\{0,M-M^{\prime}\}(H+2B_{\star})]\leq\sum_{n=1}^{\infty}P(M\in(nM^{\prime},(n+1)M^{\prime}])\cdot nM^{\prime}(H+2B_{\star})
n=1nP(M>nM)M(H+2B)n=1n(4δ)nM(H+2B)16δM(H+2B)14δ=𝒪(1).\displaystyle\leq\sum_{n=1}^{\infty}n\cdot P(M>nM^{\prime})M^{\prime}(H+2B_{\star})\leq\sum_{n=1}^{\infty}n(4\delta)^{n}M^{\prime}(H+2B_{\star})\leq\frac{16\delta M^{\prime}(H+2B_{\star})}{1-4\delta}=\mathcal{O}\left(1\right).

This completes the proof. ∎

Algorithm 7 Finite-Horizon Approximation of SSP from (Cohen et al., 2021)

Input: Algorithm 𝔄\mathfrak{A} for finite-horizon MDP ~\widetilde{{\mathcal{M}}} with horizon H4Tln(4K)H\geq 4T_{\star}\ln(4K) and restart threshold MM^{\prime}.

Initialize: interval counter m1m\leftarrow 1.

for k=1,,Kk=1,\ldots,K do

      21 Set s1msinits^{m}_{1}\leftarrow s_{\text{init}}. while s1mgs^{m}_{1}\neq g do
            43 Feed initial state s1ms^{m}_{1} to 𝔄\mathfrak{A}. for h=1,,Hh=1,\ldots,H do
                  65 Receive action ahma^{m}_{h} from 𝔄\mathfrak{A}. if shmgs^{m}_{h}\neq g then
                        7 Play action ahma^{m}_{h}, observe cost chm=c(shm,ahm)c^{m}_{h}=c(s^{m}_{h},a^{m}_{h}) and next state sh+1ms^{m}_{h+1}.
                  8else  Set chm=0c^{m}_{h}=0 and sh+1m=gs^{m}_{h+1}=g.
                  9 Feed chmc^{m}_{h} and sh+1ms^{m}_{h+1} to 𝔄\mathfrak{A}.
            10Set s1m+1=sH+1ms^{m+1}_{1}=s^{m}_{H+1} and mm+1m\leftarrow m+1. if m=nMm=n\cdot M^{\prime} for some n+n\in\mathbb{N}_{+} then  Reinitialize 𝔄\mathfrak{A}.
            
      

B.9 Extra Lemmas for Section 3

Lemma 17.

(Rosenberg & Mansour, 2020, Lemma 6) Let π\pi be a policy with expected hitting time at most τ\tau starting from any state. Then for any δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, π\pi takes no more than 4τln2δ4\tau\ln\frac{2}{\delta} steps to reach the goal state.

Lemma 18.

For an arbitrary set of intervals [M]{\mathcal{I}}\subseteq[M^{\prime}] for some M+M^{\prime}\in\mathbb{N}_{+}, we have:

mh=1HβmϕhmΛm1=𝒪(d3B2H||lndBMHδ+d2BHln1.5dBMHδ).\sum_{m\in{\mathcal{I}}}\sum_{h=1}^{H}\beta_{m}\left\|{\phi^{m}_{h}}\right\|_{\Lambda_{m}^{-1}}=\mathcal{O}\left(\sqrt{d^{3}B^{2}H|{\mathcal{I}}|}\ln\frac{dBM^{\prime}H}{\delta}+d^{2}BH\ln^{1.5}\frac{dBM^{\prime}H}{\delta}\right).
Proof.

We bound the sum by considering two cases:

mh=1HβmϕhmΛm1\displaystyle\sum_{m\in{\mathcal{I}}}\sum_{h=1}^{H}\beta_{m}\left\|{\phi^{m}_{h}}\right\|_{\Lambda_{m}^{-1}} βMm:det(Λm+1)2det(Λm)h=1HϕhmΛm1+βMm:det(Λm+1)>2det(Λm)h=1HϕhmΛm1\displaystyle\leq\beta_{M^{\prime}}\sum_{m\in{\mathcal{I}}:\det(\Lambda_{m+1})\leq 2\det(\Lambda_{m})}\sum_{h=1}^{H}\left\|{\phi^{m}_{h}}\right\|_{\Lambda_{m}^{-1}}+\beta_{M^{\prime}}\sum_{m\in{\mathcal{I}}:\det(\Lambda_{m+1})>2\det(\Lambda_{m})}\sum_{h=1}^{H}\left\|{\phi^{m}_{h}}\right\|_{\Lambda_{m}^{-1}}
2βMmh=1HϕhmΛm+11+𝒪(βMdln(MH/λ)H)\displaystyle\leq\sqrt{2}\beta_{M^{\prime}}\sum_{m\in{\mathcal{I}}}\sum_{h=1}^{H}\left\|{\phi^{m}_{h}}\right\|_{\Lambda^{-1}_{m+1}}+\mathcal{O}\left(\beta_{M^{\prime}}d\ln(M^{\prime}H/\lambda)H\right) (2ΛmΛm+12\Lambda_{m}\succcurlyeq\Lambda_{m+1} by Lemma 30, and det(ΛM)/det(Λ0)((λ+MH)/λ)d\det(\Lambda_{M^{\prime}})/\det(\Lambda_{0})\leq((\lambda+M^{\prime}H)/\lambda)^{d})
=𝒪(βMH||mh=1HϕhmΛm+112+βMdHln(MH))\displaystyle=\mathcal{O}\left(\beta_{M^{\prime}}\sqrt{H|{\mathcal{I}}|\sum_{m\in{\mathcal{I}}}\sum_{h=1}^{H}\left\|{\phi^{m}_{h}}\right\|_{\Lambda_{m+1}^{-1}}^{2}}+\beta_{M^{\prime}}dH\ln(M^{\prime}H)\right) (Cauchy-Schwarz inequality)
=𝒪(d3B2H||lndBMHδ+d2BHln1.5dBMHδ).\displaystyle=\mathcal{O}\left(\sqrt{d^{3}B^{2}H|{\mathcal{I}}|}\ln\frac{dBM^{\prime}H}{\delta}+d^{2}BH\ln^{1.5}\frac{dBM^{\prime}H}{\delta}\right). ((Jin et al., 2020b, Lemma D.2), λ=1\lambda=1, and definition of βM\beta_{M^{\prime}})

B.10 Proof of Theorem 5

Proof.

Define δ=13\delta=\frac{1}{3}, Δ=δ/K82\Delta=\frac{\sqrt{\delta/K}}{8\sqrt{2}} and assume Kd22δK\geq\frac{d^{2}}{2\delta}. Consider a family of SSP parameterized by ρ{Δ,Δ}d\rho\in\{-\Delta,\Delta\}^{d} with action set 𝒜={1,1}d{\mathcal{A}}=\{-1,1\}^{d}. For the SSP instance parameterized by ρ\rho, it consists of two states {s0,s1}\{s_{0},s_{1}\}. The transition probabilities are as follows:

P(s1|s0,a)=1δρ,a,P(g|s0,a)=δ+ρ,a,\displaystyle P(s_{1}|s_{0},a)=1-\delta-\left\langle\rho,a\right\rangle,\quad P(g|s_{0},a)=\delta+\left\langle\rho,a\right\rangle,
P(s1|s1,a)=11/B,P(g|s1,a)=1/B,\displaystyle P(s_{1}|s_{1},a)=1-1/B_{\star},\quad P(g|s_{1},a)=1/B_{\star},

and the cost function is c(s,a)=𝕀{s=s1}c(s,a)=\mathbb{I}\{s=s_{1}\}. The SSP instance above can be represented as a linear SSP of dimension d+2d+2 as follows: define α=11+Δd\alpha=\sqrt{\frac{1}{1+\Delta d}}, β=Δ1+Δd\beta=\sqrt{\frac{\Delta}{1+\Delta d}},

ϕ(s,a)={[α,βa,0],s=s0[0,0,1],s=s1μ(s)={[(1δ)/α,ρ/β,11/B],s=s1[δ/α,ρ/β,1/B],s=g\displaystyle\phi(s,a)=\begin{cases}[\alpha,\beta a^{\top},0]^{\top},&s=s_{0}\\ [0,0,1]^{\top},&s=s_{1}\end{cases}\qquad\mu(s^{\prime})=\begin{cases}[(1-\delta)/\alpha,-\rho^{\top}/\beta,1-1/B_{\star}]^{\top},&s^{\prime}=s_{1}\\ [\delta/\alpha,\rho^{\top}/\beta,1/B_{\star}]^{\top},&s^{\prime}=g\end{cases}

and θ=[0,0,1]\theta^{\star}=[0,0,1]. Note that it satisfies c(s,a)=ϕ(s,a)θc(s,a)=\phi(s,a)^{\top}\theta^{\star}, P(s|s,a)=ϕ(s,a)μ(s)P(s^{\prime}|s,a)=\phi(s,a)^{\top}\mu(s^{\prime}), ϕ(s,a)21\left\|{\phi(s,a)}\right\|_{2}\leq 1, and θ21d+2\left\|{\theta^{\star}}\right\|_{2}\leq 1\leq\sqrt{d+2}. Moreover, for any function h:𝒮+h:{\mathcal{S}}_{+}\rightarrow\mathbb{R}, we have:

sh(s)μ(s)=[h(s1)(1δ)1+Δd+h(g)δ1+Δd(h(g)h(s1))ρ(1+Δd)/Δh(s1)(11/B)+h(g)/B].\displaystyle\sum_{s^{\prime}}h(s^{\prime})\mu(s^{\prime})=\begin{bmatrix}h(s_{1})(1-\delta)\sqrt{1+\Delta d}+h(g)\delta\sqrt{1+\Delta d}\\ (h(g)-h(s_{1}))\rho\sqrt{(1+\Delta d)/\Delta}\\ h(s_{1})(1-1/B_{\star})+h(g)/B_{\star}\end{bmatrix}.

Note that when Kd22δK\geq\frac{d^{2}}{2\delta}, Δdδ8=124\Delta d\leq\frac{\delta}{8}=\frac{1}{24}, and

(h(s1)(1δ)1+Δd+h(g)δ1+Δd)2\displaystyle(h(s_{1})(1-\delta)\sqrt{1+\Delta d}+h(g)\delta\sqrt{1+\Delta d})^{2} h2(1+Δd)2524h2,\displaystyle\leq\left\|{h}\right\|_{\infty}^{2}(1+\Delta d)\leq\frac{25}{24}\left\|{h}\right\|_{\infty}^{2},
(h(g)h(s1))ρ(1+Δd)/Δ22\displaystyle\left\|{(h(g)-h(s_{1}))\rho\sqrt{(1+\Delta d)/\Delta}}\right\|_{2}^{2} 4h2Δd(1+Δd)2524h2,\displaystyle\leq 4\left\|{h}\right\|_{\infty}^{2}\Delta d(1+\Delta d)\leq\frac{25}{24}\left\|{h}\right\|_{\infty}^{2},
(h(s1)(11/B)+h(g)/B)2\displaystyle(h(s_{1})(1-1/B_{\star})+h(g)/B_{\star})^{2} h2.\displaystyle\leq\left\|{h}\right\|_{\infty}^{2}.

Thus, we have sh(s)μ(s)2hd+2\left\|{\sum_{s^{\prime}}h(s^{\prime})\mu(s^{\prime})}\right\|_{2}\leq\left\|{h}\right\|_{\infty}\sqrt{d+2} by d2d\geq 2, and the SSP instance satisfies Assumption 1. The regret is bounded as follows: let aka_{k} denote the first action taken by the learner in episode kk. Then for any ρ{Δ,Δ}d\rho\in\{-\Delta,\Delta\}^{d}, the expected cost of taking action aa as the first action is Cρ(a)=B(1δρ,a)C_{\rho}(a)=B_{\star}(1-\delta-\left\langle\rho,a\right\rangle).

𝔼ρ[RK]\displaystyle\mathbb{E}_{\rho}[R_{K}] =k=1K𝔼ρ[Cρ(ak)minaCρ(a)]=Bk=1K𝔼ρ[maxaρ,aρ,ak]\displaystyle=\sum_{k=1}^{K}\mathbb{E}_{\rho}\left[C_{\rho}(a_{k})-\min_{a}C_{\rho}(a)\right]=B_{\star}\sum_{k=1}^{K}\mathbb{E}_{\rho}\left[\max_{a}\left\langle\rho,a\right\rangle-\left\langle\rho,a_{k}\right\rangle\right]
=2BΔk=1K𝔼ρ[j=1d𝕀{sgn(ρj)sgn(ak,j)}]=2BΔj=1d𝔼ρ[Nj(ρ)],\displaystyle=2B_{\star}\Delta\sum_{k=1}^{K}\mathbb{E}_{\rho}\left[\sum_{j=1}^{d}\mathbb{I}\{\mbox{\text{s}gn}(\rho_{j})\neq\mbox{\text{s}gn}(a_{k,j})\}\right]=2B_{\star}\Delta\sum_{j=1}^{d}\mathbb{E}_{\rho}[N_{j}(\rho)],

where we define Nj(ρ)=k=1K𝕀{sgn(ρj)sgn(ak,j)}N_{j}(\rho)=\sum_{k=1}^{K}\mathbb{I}\{\mbox{\text{s}gn}(\rho_{j})\neq\mbox{\text{s}gn}(a_{k,j})\}, and 𝔼ρ\mathbb{E}_{\rho} is the expectation w.r.t the SSP instance parameterized by ρ\rho. Let ρj\rho^{j} denote the vector that differs from ρ\rho at its jj-th coordinate only. Then, we have Nj(ρj)+Nj(ρ)=KN_{j}(\rho^{j})+N_{j}(\rho)=K, and for a fixed jj,

2ρ𝔼ρ[RK]\displaystyle 2\sum_{\rho}\mathbb{E}_{\rho}\left[R_{K}\right] =ρ(𝔼ρ[RK]+𝔼ρj[RK])=2BΔρj=1d(K+𝔼ρ[Nj(ρ)]Eρj[Nj(ρ)])\displaystyle=\sum_{\rho}\left(\mathbb{E}_{\rho}\left[R_{K}\right]+\mathbb{E}_{\rho^{j}}\left[R_{K}\right]\right)=2B_{\star}\Delta\sum_{\rho}\sum_{j=1}^{d}\left(K+\mathbb{E}_{\rho}[N_{j}(\rho)]-E_{\rho^{j}}[N_{j}(\rho)]\right)
2BΔρj=1d(KK2KL(Pρ,Pρj)),\displaystyle\geq 2B_{\star}\Delta\sum_{\rho}\sum_{j=1}^{d}\left(K-K\sqrt{2\text{\rm KL}(P_{\rho},P_{\rho^{j}})}\right),

where PρP_{\rho} is the joint probability of KK trajectories induced by the interactions between the learner and the SSP parameterized by ρ\rho, and in the last inequality we apply Pinsker’s inequality to obtain:

|𝔼ρ[Nj(ρ)]𝔼ρj[Nj(ρ)]|KPρPρj1K2KL(Pρ,Pρj).\displaystyle\left|\mathbb{E}_{\rho}[N_{j}(\rho)]-\mathbb{E}_{\rho^{j}}[N_{j}(\rho)]\right|\leq K\left\|{P_{\rho}-P_{\rho^{j}}}\right\|_{1}\leq K\sqrt{2\text{\rm KL}(P_{\rho},P_{\rho^{j}})}.

By the divergence decomposition lemma (see e.g. (Lattimore & Szepesvári, 2020, Lemma 15.1)), we further have

KL(Pρ,Pρj)\displaystyle\text{\rm KL}(P_{\rho},P_{\rho^{j}}) =k=1K𝔼ρ[KL(Bernoulli(δ+ak,ρ),Bernoulli(δ+ak,ρj))]\displaystyle=\sum_{k=1}^{K}\mathbb{E}_{\rho}\left[\text{\rm KL}\left(\textrm{Bernoulli}(\delta+\left\langle a_{k},\rho\right\rangle),\textrm{Bernoulli}(\delta+\left\langle a_{k},\rho^{j}\right\rangle)\right)\right]
=k=1K𝔼ρ[2ak,ρρj2δ+ak,ρ]16KΔ2δ,\displaystyle=\sum_{k=1}^{K}\mathbb{E}_{\rho}\left[\frac{2\left\langle a_{k},\rho-\rho^{j}\right\rangle^{2}}{\delta+\left\langle a_{k},\rho\right\rangle}\right]\leq\frac{16K\Delta^{2}}{\delta}, (dΔδ/2d\Delta\leq\delta/2)

where in the second last inequality we apply KL(Bernoulli(a),Bernoulli(b))2(ab)2/a\text{\rm KL}(\textrm{Bernoulli}(a),\textrm{Bernoulli}(b))\leq 2(a-b)^{2}/a when a1/2,a+b1a\leq 1/2,a+b\leq 1, which is true when δ1/3\delta\leq 1/3, dΔδ/2d\Delta\leq\delta/2. Substituting these back, we get:

2ρ𝔼ρ[RK]2BΔρj=1d(KK32KΔ2/δ)=Ω(ρBdδK).\displaystyle 2\sum_{\rho}\mathbb{E}_{\rho}[R_{K}]\geq 2B_{\star}\Delta\sum_{\rho}\sum_{j=1}^{d}\left(K-K\sqrt{32K\Delta^{2}/\delta}\right)=\Omega\left(\sum_{\rho}B_{\star}d\sqrt{\delta K}\right). (9)

Now note that gap(s1,a)=0\text{\rm gap}(s_{1},a)=0 for all aa. Define a=argmaxaρ,aa^{\star}=\operatorname*{argmax}_{a}\left\langle\rho,a\right\rangle. Then for any aaa\neq a^{\star},

Q(s0,a)V(s0)=(1δρ,a)B(1δρ,a)B=Bρ,aa2BΔ.\displaystyle Q^{\star}(s_{0},a)-V^{\star}(s_{0})=(1-\delta-\left\langle\rho,a\right\rangle)B_{\star}-(1-\delta-\left\langle\rho,a^{\star}\right\rangle)B_{\star}=B_{\star}\left\langle\rho,a^{\star}-a\right\rangle\geq 2B_{\star}\Delta.

Thus, gapmin=2BΔ\text{\rm gap}_{\min}=2B_{\star}\Delta. By K=δ82Δ\sqrt{K}=\frac{\sqrt{\delta}}{8\sqrt{2}\Delta} and Eq. (9), we get:

ρ𝔼ρ[RK]=Ω(ρBdδK)=Ω(ρdBδΔ)=Ω(ρdB2gapmin).\displaystyle\sum_{\rho}\mathbb{E}_{\rho}[R_{K}]=\Omega\left(\sum_{\rho}B_{\star}d\sqrt{\delta K}\right)=\Omega\left(\sum_{\rho}\frac{dB_{\star}\delta}{\Delta}\right)=\Omega\left(\sum_{\rho}\frac{dB_{\star}^{2}}{\text{\rm gap}_{\min}}\right).

Selecting ρ\rho^{\star} which maximizes 𝔼ρ[RK]\mathbb{E}_{\rho}[R_{K}], we get: 𝔼ρ[RK]=Ω(dB2gapmin)\mathbb{E}_{\rho^{\star}}[R_{K}]=\Omega\left(\frac{dB_{\star}^{2}}{\text{\rm gap}_{\min}}\right). ∎

Appendix C Omitted Details for Section 4

Notations

Define Qt(s,a)=ϕ(s,a)wtQ_{t}(s,a)=\phi(s,a)^{\top}w_{t} such that at=argminaQt(st,a)a_{t}=\operatorname*{argmin}_{a}Q_{t}(s_{t},a), and operator UB:ddU_{B}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} such that UBw=θ+Vw,B(s)𝑑μ(s)U_{B}w=\theta^{\star}+\int V_{w,B}(s^{\prime})d\mu(s^{\prime}). Define ιt=ιBt,t,𝒥t=𝒥Bt,Pt=Pst,at,Ct=i=1tc(si,ai)\iota_{t}=\iota_{B_{t},t},{\mathcal{J}}_{t}={\mathcal{J}}_{B_{t}},P_{t}=P_{s_{t},a_{t}},C_{t}=\sum_{i=1}^{t}c(s_{i},a_{i}), and 𝒥=𝒥2B{\mathcal{J}}={\mathcal{J}}_{2B_{\star}}. By Lemma 4, 𝒥t𝒥{\mathcal{J}}_{t}\subseteq{\mathcal{J}} for any t[T]t\in[T].

For notational convenience, we divide the whole learning process into epochs indexed by ll, and a new epoch begins whenever wtw_{t} is recomputed. Denote by tl+1t_{l}+1 the first time step in epoch ll, and for a quantity, function or set ftf_{t} indexed by time step tt, we define fl=ftl+1f_{l}=f_{t_{l}+1}. Denote by ltl_{t} the epoch time step tt belongs to, and we often ignore the subscript tt when there is no confusion. Clearly, Vt=VlV_{t}=V_{l}, and similarly for wl,w~l,ιl,Ωlw_{l},\widetilde{w}_{l},\iota_{l},\Omega_{l} (ignoring the dependency on tt for ll). With this notation setup, we define LL^{\prime} as the number of epochs that starts by the overestimate condition, that is, L=|{l>1:Vl1(stl)=2Bl1}|L^{\prime}=|\{l>1:V_{l-1}(s^{\prime}_{t_{l}})=2B_{l-1}\}|. Also define νt=argmaxν=w~lw,wΩl|ϕtν|\nu_{t}=\operatorname*{argmax}_{\nu=\widetilde{w}_{l}-w,w\in\Omega_{l}}|\phi_{t}^{\top}\nu| and a special covariance matrix Wj,t(ν)=2jI+i<tmin{1,2j/|ϕiν|}ϕiϕiW_{j,t}(\nu)=2^{j}I+\sum_{i<t}\min\left\{1,2^{j}/|\phi_{i}^{\top}\nu|\right\}\phi_{i}\phi_{i}^{\top}. Note that Φtj(ν)=νWj,l(ν)2\Phi^{j}_{t}(\nu)=\left\|{\nu}\right\|^{2}_{W_{j,l}(\nu)}.

Assumption

For simplicity, we assume that {ϕ(s,a)}(s,a)𝒮×𝒜\{\phi(s,a)\}_{(s,a)\in{\mathcal{S}}\times{\mathcal{A}}} spans d\mathbb{R}^{d}. It implies that if ϕ(s,a)v=ϕ(s,a)w\phi(s,a)^{\top}v=\phi(s,a)^{\top}w for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times{\mathcal{A}}, then v=wv=w.

Truncating the Interaction for Technical Issue

An important question in SSP is whether the algorithm halts in finite number of steps. To overcome some technical issues, we first assume that the algorithm halts after TT^{\prime} steps for an arbitrary T+T^{\prime}\in\mathbb{N}_{+}, even if the goal state is not reached. Specifically, we redefine the notation TT to be the minimum between the number of steps taken by the learner in KK episodes and TT^{\prime}, that is, T=TT=T^{\prime} if the learner does not finish KK episodes in TT^{\prime} steps. We also redefine RKR_{K} under the new definition of TT, and the true regret now becomes limTRK\lim_{T^{\prime}\rightarrow\infty}R_{K}. The implication under truncation is that sTs^{\prime}_{T} may not be gg, and TTT\leq T^{\prime}. In Appendix C.4, we prove a regret bound on RKR_{K} independent of TT^{\prime}. Thus, the proven regret bound is also an upper bound of the true regret, as it is a valid upper bound of limTRK\lim_{T^{\prime}\rightarrow\infty}R_{K}.

C.1 Proof Sketch of Theorem 6

We focus on deriving the dominating term and ignore the lower order terms. By some straightforward calculation, we decompose the regret as follows:

RK\displaystyle R_{K} t=1T[Vl(st)PtVl]Deviation+t=1T|ϕt(w~lwl)|Estimation-Err+l=1L(Vl(stl+1)Vl(stl+1))KV(sinit)Switching-Cost.\displaystyle\leq\underbrace{\sum_{t=1}^{T}[V_{l}(s^{\prime}_{t})-P_{t}V_{l}]}_{\textsc{Deviation}}+\underbrace{\sum_{t=1}^{T}\left|\phi_{t}^{\top}(\widetilde{w}_{l}-w_{l})\right|}_{\textsc{Estimation-Err}}+\underbrace{\sum_{l=1}^{L}\left(V_{l}(s_{t_{l}+1})-V_{l}(s^{\prime}_{t_{l+1}})\right)-K\cdot V^{\star}(s_{\text{init}})}_{\textsc{Switching-Cost}}.

We bound each of these terms as follows.

Bounding Deviation

This term is a sum of martingale difference sequence and is of order 𝒪~(t=1T𝕍(Pt,Vl))\tilde{\mathcal{O}}(\sqrt{\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l})}). We show that t=1T𝕍(Pt,Vl)BCT+BEstimation-Err\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l})\lesssim B_{\star}C_{T}+B_{\star}\cdot\textsc{Estimation-Err} (see Lemma 21).

Bounding Estimation-Err

Here the variance-aware confidence set Ωt\Omega_{t} comes into play. By wlΩlw_{l}\in\Omega_{l}, we have |ϕt(w~lwl)||ϕtνt|\left|\phi_{t}^{\top}(\widetilde{w}_{l}-w_{l})\right|\leq\left|\phi_{t}^{\top}\nu_{t}\right|. Thus, it suffices to bound t=1T|ϕtνt|\sum_{t=1}^{T}\left|\phi_{t}^{\top}\nu_{t}\right|. As in (Kim et al., 2021), the main idea is to bound the matrix norm of νt\nu_{t} w.r.t some special matrix by a variance-aware term, and then apply the elliptical potential lemma on {ϕt}t\{\phi_{t}\}_{t}. For any epoch ll, j𝒥lj\in{\mathcal{J}}_{l} and ν=w~lẘ\nu=\widetilde{w}_{l}-\mathring{w} with ẘΩl\mathring{w}\in\Omega_{l}, we have the following key inequality (see Lemma 24):

νWj,l(ν)22jitl𝕍(Pi,Vl)ιl.\left\|{\nu}\right\|^{2}_{W_{j,l}(\nu)}\lesssim 2^{j}\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}. (10)

One important step is thus to bound itl𝕍(Pi,Vl)\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l}). Note that this term has a similar form of t=1T𝕍(Pt,Vl)\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l}), and by a similar analysis (see Lemma 23):

ttl𝕍(Pt,Vl)BCtl+Bttl|ϕtνt|.\sum_{t\leq t_{l}}\mathbb{V}(P_{t},V_{l})\lesssim B_{\star}C_{t_{l}}+B_{\star}\sum_{t\leq t_{l}}\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|. (11)

where νt=argmaxν=w~lw,wΩl|ϕtν|\nu^{\prime}_{t}=\operatorname*{argmax}_{\nu=\widetilde{w}_{l}-w,w\in\Omega_{l}}\left|\phi_{t}^{\top}\nu\right| (note that here ll is fixed and independent of tt). Define jt𝒥lj_{t}\in{\mathcal{J}}_{l} such that |ϕtνt|(2jt1,2jt]\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|\in(2^{j_{t}-1},2^{j_{t}}]. By Eq. (10):

|ϕtνt|ϕtWjt,l1(νt)νtWjt,l(νt)ϕtWjt,l1(νt)|ϕtνt|itl𝕍(Pi,Vl)ιl\displaystyle\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|\lesssim\left\|{\phi_{t}}\right\|_{W^{-1}_{j_{t},l}(\nu^{\prime}_{t})}\left\|{\nu^{\prime}_{t}}\right\|_{W_{j_{t},l}(\nu^{\prime}_{t})}\lesssim\left\|{\phi_{t}}\right\|_{W^{-1}_{j_{t},l}(\nu^{\prime}_{t})}\sqrt{\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}} (12)

Solving for |ϕtνt|\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right| and by ttlϕtWjt,l1(νt)2=𝒪~(d)\sum_{t\leq t_{l}}\left\|{\phi_{t}}\right\|^{2}_{W^{-1}_{j_{t},l}(\nu^{\prime}_{t})}=\tilde{\mathcal{O}}\left(d\right) (similar to elliptical potential lemma), we get

ttl|ϕtνt|=𝒪~(ditl𝕍(Pi,Vl)ιl).\sum_{t\leq t_{l}}\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|=\tilde{\mathcal{O}}\left(d\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}\right).

Plugging this back to Eq. (11) and solving a quadratic inequality, we get: itl𝕍(Pi,Vl)BCtl\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\lesssim B_{\star}C_{t_{l}} (Lemma 23). Now by an analysis similar to Eq. (12) (Lemma 22):

t=1T|ϕtνt|\displaystyle\sum_{t=1}^{T}\left|\phi_{t}^{\top}\nu_{t}\right| d2t=1TϕtWjt,l1(νt)2itl𝕍(Pi,Vl)ιld3.5BCT,\displaystyle\lesssim d^{2}\sum_{t=1}^{T}\left\|{\phi_{t}}\right\|^{2}_{W^{-1}_{j_{t},l}(\nu_{t})}\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}\lesssim d^{3.5}\sqrt{B_{\star}C_{T}},

where jt𝒥tj_{t}\in{\mathcal{J}}_{t} such that |ϕtνt|(2jt1,2jt]\left|\phi_{t}^{\top}\nu_{t}\right|\in(2^{j_{t}-1},2^{j_{t}}]. The extra d2d^{2} factor is from the inequality Φtj(ν)8d2Φlj(ν)\Phi^{j}_{t}(\nu)\leq 8d^{2}\Phi^{j}_{l}(\nu).

Bounding Switching-Cost

By considering each condition of starting a new epoch, we show that Switching-Cost=𝒪~(dBL)\textsc{Switching-Cost}=\tilde{\mathcal{O}}(dB_{\star}-L^{\prime}), where LL^{\prime} is the number of epochs started by triggering the overestimate condition; see Appendix C.4. We provide more tuition on including the overestimate condition in Appendix C.5. In short, it removes a factor of d1/4d^{1/4} in the dominating term without incorporating unpractical decision sets as in previous works.

Putting Everything Together

Combining the bounds above, we get RK=CTKV(sinit)d3.5BCTR_{K}=C_{T}-KV^{\star}(s_{\text{init}})\lesssim d^{3.5}\sqrt{B_{\star}C_{T}}. Solving a quadratic inequality w.r.t CT\sqrt{C_{T}}, we have CTBKC_{T}\lesssim B_{\star}K. Plugging this back, we obtain RKd3.5BKR_{K}\lesssim d^{3.5}B_{\star}\sqrt{K}.

Below we provide detailed proofs of lemmas and the main theorem.

C.2 Proof of Lemma 3

We will prove a more general statement, from which Lemma 3 is a directly corollary.

Lemma 19.

With probability at least 1δ1-\delta, for any t+t\in\mathbb{N}_{+}, B{2i}iB\in\{2^{i}\}_{i\in\mathbb{N}}, and w𝔹(3dB)w\in\mathbb{B}(3\sqrt{d}B), we have UBwΩt(w,B)U_{B}w\in\Omega_{t}(w,B).

Proof.

For each t+t\in\mathbb{N}_{+}, B{2i}iB\in\{2^{i}\}_{i\in\mathbb{N}}, w𝒢ϵ/t(3dB)w\in\mathcal{G}_{\epsilon/t}(3\sqrt{d}B), j𝒥Bj\in{\mathcal{J}}_{B}, ν𝒢ϵ/t(6dB)\nu\in\mathcal{G}_{\epsilon/t}(6\sqrt{d}B), by Lemma 36, we have with probability at least 16δlog2t1-6\delta^{\prime}\log_{2}t with δ=δ/(24t2log22(2B)log2(t)|𝒥B|(12dBt/ϵ)2d)\delta^{\prime}=\delta/(24t^{2}\log_{2}^{2}(2B)\log_{2}(t)|{\mathcal{J}}_{B}|(12\sqrt{d}Bt/\epsilon)^{2d}):

|i<tclipj(ϕiν)ϵVw,Bi(UBw)|=|i<tclipj(ϕiν)(PiVw,BVw,B(si))|\displaystyle\left|\sum_{i<t}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)\epsilon^{i}_{V_{w,B}}(U_{B}w)\right|=\left|\sum_{i<t}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)(P_{i}V_{w,B}-V_{w,B}(s^{\prime}_{i}))\right|
8i<tclipj2(ϕiν)ηVw,Bi(UBw)ln1δ+32B2jln1δi<tclipj2(ϕiν)ηVw,Bi(UBw)ιB,t3+B22jιB,t.\displaystyle\leq 8\sqrt{\sum_{i<t}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)\eta^{i}_{V_{w,B}}(U_{B}w)\ln\frac{1}{\delta^{\prime}}}+32B2^{j}\ln\frac{1}{\delta^{\prime}}\leq\sqrt{\sum_{i<t}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)\eta^{i}_{V_{w,B}}(U_{B}w)\frac{\iota_{B,t}}{3}}+\frac{B}{2}2^{j}\iota_{B,t}. (13)

Taking a union bound, Eq. (13) holds for any t,B{2i}i,w𝒢ϵ/t(3dB),j𝒥B,ν𝒢ϵ/t(6dB)t,B\in\{2^{i}\}_{i\in\mathbb{N}},w\in\mathcal{G}_{\epsilon/t}(3\sqrt{d}B),j\in{\mathcal{J}}_{B},\nu\in\mathcal{G}_{\epsilon/t}(6\sqrt{d}B) with probability at least 1δ1-\delta.

Now for any t+t\in\mathbb{N}_{+}, B{2i}iB\in\{2^{i}\}_{i\in\mathbb{N}}, w𝔹(3dB)w\in\mathbb{B}(3\sqrt{d}B), there exist w𝒢ϵ/t(3dB)w^{\prime}\in\mathcal{G}_{\epsilon/t}(3\sqrt{d}B) such that wwϵt\left\|{w-w^{\prime}}\right\|_{\infty}\leq\frac{\epsilon}{t}. Also define V=Vw,BV=V_{w,B}, V=Vw,BV^{\prime}=V_{w^{\prime},B}, w~=UBw\widetilde{w}=U_{B}w, and w~=UBw\widetilde{w}^{\prime}=U_{B}w^{\prime}. Note that

VV\displaystyle\left\|{V-V^{\prime}}\right\|_{\infty} maxs,a|ϕ(s,a)(ww)|dwwdϵt,\displaystyle\leq\max_{s,a}\left|\phi(s,a)^{\top}(w-w^{\prime})\right|\leq\sqrt{d}\left\|{w-w^{\prime}}\right\|_{\infty}\leq\frac{\sqrt{d}\epsilon}{t}, (14)
w~w~2\displaystyle\left\|{\widetilde{w}-\widetilde{w}^{\prime}}\right\|_{2} =(V(s)V(s))𝑑μ(s)2dVVdϵt.\displaystyle=\left\|{\int(V(s^{\prime})-V^{\prime}(s^{\prime}))d\mu(s^{\prime})}\right\|_{2}\leq\sqrt{d}\left\|{V-V^{\prime}}\right\|_{\infty}\leq\frac{d\epsilon}{t}. (15)

Thus, we have for any j𝒥B,ν𝒢ϵ/t(6dB)j\in{\mathcal{J}}_{B},\nu\in\mathcal{G}_{\epsilon/t}(6\sqrt{d}B):

|i<tclipj(ϕiν)ϵVi(w~)|=|i<tclipj(ϕiν)(ϕiw~ciV(si))|\displaystyle\left|\sum_{i<t}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)\epsilon^{i}_{V}(\widetilde{w})\right|=\left|\sum_{i<t}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)(\phi_{i}^{\top}\widetilde{w}-c_{i}-V(s^{\prime}_{i}))\right|
|i<tclipj(ϕiν)(ϕiw~ciV(si))|+|i<tclipj(ϕiν)ϕi(w~w~)|+|i<tclipj(ϕiν)(V(si)V(si))|\displaystyle\leq\left|\sum_{i<t}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)(\phi_{i}^{\top}\widetilde{w}^{\prime}-c_{i}-V^{\prime}(s^{\prime}_{i}))\right|+\left|\sum_{i<t}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)\phi_{i}^{\top}(\widetilde{w}-\widetilde{w}^{\prime})\right|+\left|\sum_{i<t}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)(V(s^{\prime}_{i})-V^{\prime}(s^{\prime}_{i}))\right|
i<tclipj2(ϕiν)(ϕiw~ciV(si))2ιB,t3+B22jιB,t+2j+1dϵ.\displaystyle\leq\sqrt{\sum_{i<t}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)(\phi_{i}^{\top}\widetilde{w}^{\prime}-c_{i}-V^{\prime}(s^{\prime}_{i}))^{2}\frac{\iota_{B,t}}{3}}+\frac{B}{2}2^{j}\iota_{B,t}+2^{j+1}d\epsilon. (Eq. (13), Eq. (14), and Eq. (15))
i<tclipj2(ϕiν)ηVi(w~)ιB,t+i<tclipj2(ϕiν)(ϕi(w~w~))2ιB,t+i<tclipj2(ϕiν)(V(si)V(si))2ιB,t\displaystyle\leq\sqrt{\sum_{i<t}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)\eta^{i}_{V}(\widetilde{w})\iota_{B,t}}+\sqrt{\sum_{i<t}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)(\phi_{i}^{\top}(\widetilde{w}^{\prime}-\widetilde{w}))^{2}\iota_{B,t}}+\sqrt{\sum_{i<t}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)(V^{\prime}(s^{\prime}_{i})-V(s^{\prime}_{i}))^{2}\iota_{B,t}}
+B22jιB,t+2j+1dϵ\displaystyle\qquad+\frac{B}{2}2^{j}\iota_{B,t}+2^{j+1}d\epsilon ((a+b+c)23(a2+b2+c2)(a+b+c)^{2}\leq 3(a^{2}+b^{2}+c^{2}) and a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b})
i<tclipj2(ϕiν)ηVi(w~)ιB,t+B22jιB,t+42jdϵιB,ti<tclipj2(ϕiν)ηVi(w~)ιB,t+B2jιB,t.\displaystyle\leq\sqrt{\sum_{i<t}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)\eta^{i}_{V}(\widetilde{w})\iota_{B,t}}+\frac{B}{2}2^{j}\iota_{B,t}+4\cdot 2^{j}d\epsilon\iota_{B,t}\leq\sqrt{\sum_{i<t}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)\eta^{i}_{V}(\widetilde{w})\iota_{B,t}}+B2^{j}\iota_{B,t}. (Eq. (14), Eq. (15), and 8dϵ18d\epsilon\leq 1)

Moreover, w~𝔹(3dB)\widetilde{w}\in\mathbb{B}(3\sqrt{d}B) by Vw,B2B\left\|{V_{w,B}}\right\|_{\infty}\leq 2B. Thus, UBwΩt(w,B)U_{B}w\in\Omega_{t}(w,B) for any t+t\in\mathbb{N}_{+}, B{2i}iB\in\{2^{i}\}_{i\in\mathbb{N}}, and w𝔹(3dB)w\in\mathbb{B}(3\sqrt{d}B), and the statement is proved. ∎

Proof of Lemma 3.

This directly follows from Lemma 19 by wt𝔹(3dBt)w_{t}\in\mathbb{B}(3\sqrt{d}B_{t}), Vt=Vwt,BtV_{t}=V_{w_{t},B_{t}}, and w~t=θ+Vt(s)𝑑μ(s)=UBtwt\widetilde{w}_{t}=\theta^{\star}+\int V_{t}(s^{\prime})d\mu(s^{\prime})=U_{B_{t}}w_{t}. ∎

C.3 Proof of Lemma 4

Lemma (restatement of Lemma 4).

With probability at least 1δ1-\delta, Vl(stl+1)V(stl+1)V_{l}(s_{t_{l}+1})\leq V^{\star}(s_{t_{l}+1}) for any epoch ll and Bt2BB_{t}\leq 2B_{\star}.

Proof.

For the first statement, note that any epoch ll, by Lemma 20, there exists wl𝔹(3dBl)w^{\infty}_{l}\in\mathbb{B}(3\sqrt{d}B_{l}) such that wl=UBlwlw^{\infty}_{l}=U_{B_{l}}w^{\infty}_{l} and Vwl,Bl(s)V(s)V_{w^{\infty}_{l},B_{l}}(s)\leq V^{\star}(s). Thererfore, wlΩl(wl,Bl)w^{\infty}_{l}\in\Omega_{l}(w^{\infty}_{l},B_{l}), and Vl(stl+1)=Vwl,Bl(stl+1)Vwl,Bl(stl+1)V(stl+1)V_{l}(s_{t_{l}+1})=V_{w_{l},B_{l}}(s_{t_{l}+1})\leq V_{w^{\infty}_{l},B_{l}}(s_{t_{l}+1})\leq V^{\star}(s_{t_{l}+1}) by the definition of wlw_{l}. The second statement is a direct corollary of the first statement and how BtB_{t} is updated. ∎

Lemma 20.

For any B>0B>0, there exists w𝔹(3dB)w\in\mathbb{B}(3\sqrt{d}B) such that w=UBww=U_{B}w, and Vw,B(s)V(s)V_{w,B}(s)\leq V^{\star}(s).

Proof.

Define w1=𝟎dw^{1}=\mathbf{0}\in\mathbb{R}^{d}, and wn+1=UBwnw^{n+1}=U_{B}w^{n}. We prove by induction that ϕ(s,a)(wn+1wn)0\phi(s,a)^{\top}(w^{n+1}-w^{n})\geq 0 and ϕ(s,a)wnQ(s,a)\phi(s,a)^{\top}w^{n}\leq Q^{\star}(s,a). The base case n=1n=1 is clearly true. Now for n>1n>1, assume that we have ϕ(s,a)(wnwn1)0\phi(s,a)^{\top}(w^{n}-w^{n-1})\geq 0 and ϕ(s,a)wn1Q(s,a)\phi(s,a)^{\top}w^{n-1}\leq Q^{\star}(s,a). Then, ϕ(s,a)(wn+1wn)=Ps,a(Vwn,BVwn1,B)0\phi(s,a)^{\top}(w^{n+1}-w^{n})=P_{s,a}(V_{w^{n},B}-V_{w^{n-1},B})\geq 0 and ϕ(s,a)wn=c(s,a)+Ps,aVwn1,Bc(s,a)+Ps,aVQ(s,a)\phi(s,a)^{\top}w^{n}=c(s,a)+P_{s,a}V_{w^{n-1},B}\leq c(s,a)+P_{s,a}V^{\star}\leq Q^{\star}(s,a). Therefore, the sequence {ϕ(s,a)wn}n=1\{\phi(s,a)^{\top}w^{n}\}_{n=1}^{\infty} is non-decreasing and bounded, and thus converges. Since {ϕ(s,a)}(s,a)𝒮×𝒜\{\phi(s,a)\}_{(s,a)\in{\mathcal{S}}\times{\mathcal{A}}} spans d\mathbb{R}^{d}, the limit w=limnwnw^{\infty}=\lim_{n\rightarrow\infty}w^{n} exists and w=UBww^{\infty}=U_{B}w^{\infty}. Moreover, w𝔹(3dB)w^{\infty}\in\mathbb{B}(3\sqrt{d}B) by Vw,B2B\left\|{V_{w^{\infty},B}}\right\|_{\infty}\leq 2B and Vw,B(s)V(s)V_{w^{\infty},B}(s)\leq V^{\star}(s) since ϕ(s,a)w=limnϕ(s,a)wnQ(s,a)\phi(s,a)^{\top}w^{\infty}=\lim_{n\rightarrow\infty}\phi(s,a)^{\top}w^{n}\leq Q^{\star}(s,a). This completes the proof. ∎

C.4 Proof of Theorem 6

Proof.

We decompose the regret as follows:

RK\displaystyle R_{K} =t=1TctKV(sinit)=l=1L(t=tl+1tl+1ctVl(stl+1))+l=1LVl(stl+1)KV(sinit).\displaystyle=\sum_{t=1}^{T}c_{t}-K\cdot V^{\star}(s_{\text{init}})=\sum_{l=1}^{L}\left(\sum_{t=t_{l}+1}^{t_{l+1}}c_{t}-V_{l}(s_{t_{l}+1})\right)+\sum_{l=1}^{L}V_{l}(s_{t_{l}+1})-K\cdot V^{\star}(s_{\text{init}}).

For the first term, for a fixed epoch ll, define χτ=t=τtl+1ctVl(sτ)\chi_{\tau}=\sum_{t=\tau}^{t_{l+1}}c_{t}-V_{l}(s_{\tau}) for τ{tl+1,,tl+1}\tau\in\{t_{l}+1,\ldots,t_{l+1}\} and χtl+1+1=Vl(stl+1)\chi_{t_{l+1}+1}=-V_{l}(s^{\prime}_{t_{l+1}}). Note that within epoch ll, we have Vl(sτ)=[Ql(sτ,aτ)][0,)Ql(sτ,aτ)=ϕτwlV_{l}(s_{\tau})=[Q_{l}(s_{\tau},a_{\tau})]_{[0,\infty)}\geq Q_{l}(s_{\tau},a_{\tau})=\phi_{\tau}^{\top}w_{l}. Thus, for τ{tl+1,,tl+1}\tau\in\{t_{l}+1,\ldots,t_{l+1}\},

χτ\displaystyle\chi_{\tau} =t=τtl+1ctVl(sτ)t=τ+1tl+1ct+cτϕτwl\displaystyle=\sum_{t=\tau}^{t_{l+1}}c_{t}-V_{l}(s_{\tau})\leq\sum_{t=\tau+1}^{t_{l+1}}c_{t}+c_{\tau}-\phi_{\tau}^{\top}w_{l}
=t=τ+1tl+1ctVl(sτ)+(Vl(sτ)PτVl)+ϕτ(w~lwl)\displaystyle=\sum_{t=\tau+1}^{t_{l+1}}c_{t}-V_{l}(s^{\prime}_{\tau})+(V_{l}(s^{\prime}_{\tau})-P_{\tau}V_{l})+\phi_{\tau}^{\top}(\widetilde{w}_{l}-w_{l}) (cτ+PτVl=ϕτw~lc_{\tau}+P_{\tau}V_{l}=\phi_{\tau}^{\top}\widetilde{w}_{l})
=χτ+1+(Vl(sτ)PτVl)+ϕτ(w~lwl)\displaystyle=\chi_{\tau+1}+(V_{l}(s^{\prime}_{\tau})-P_{\tau}V_{l})+\phi_{\tau}^{\top}(\widetilde{w}_{l}-w_{l})
Vl(stl+1)+t=τtl+1(Vl(st)PtVl)+t=τtl+1ϕt(w~lwl).\displaystyle\leq\cdots\leq-V_{l}(s^{\prime}_{t_{l+1}})+\sum_{t=\tau}^{t_{l+1}}(V_{l}(s^{\prime}_{t})-P_{t}V_{l})+\sum_{t=\tau}^{t_{l+1}}\phi_{t}^{\top}(\widetilde{w}_{l}-w_{l}).

Therefore, we have:

RK\displaystyle R_{K} =l=1Lχtl+1+l=1LVl(stl+1)KV(sinit)\displaystyle=\sum_{l=1}^{L}\chi_{t_{l}+1}+\sum_{l=1}^{L}V_{l}(s_{t_{l}+1})-K\cdot V^{\star}(s_{\text{init}})
l=1Lt=tl+1tl+1[(Vl(st)PtVl)+ϕt(w~lwl)]+l=1L(Vl(stl+1)Vl(stl+1))KV(sinit).\displaystyle\leq\sum_{l=1}^{L}\sum_{t=t_{l}+1}^{t_{l+1}}\left[(V_{l}(s^{\prime}_{t})-P_{t}V_{l})+\phi_{t}^{\top}(\widetilde{w}_{l}-w_{l})\right]+\sum_{l=1}^{L}\left(V_{l}(s_{t_{l}+1})-V_{l}(s^{\prime}_{t_{l+1}})\right)-K\cdot V^{\star}(s_{\text{init}}).

We first bound the switching costs, that is, the last two terms above. We consider three cases based on how an epoch starts: define 1={l:stl=g}{\mathcal{L}}_{1}=\{l:s^{\prime}_{t_{l}}=g\}, 2={l>1:j𝒥l,ν𝒢ϵ/(tl+1)(6dBl1),Φtl+1j(ν)>8d2Φtl1+1j(ν)}{\mathcal{L}}_{2}=\{l>1:\exists j\in{\mathcal{J}}_{l},\nu\in\mathcal{G}_{\epsilon/(t_{l}+1)}(6\sqrt{d}B_{l-1}),\Phi^{j}_{t_{l}+1}(\nu)>8d^{2}\Phi^{j}_{t_{l-1}+1}(\nu)\}, and 3={l>1:Vl1(stl)=2Bl1}{\mathcal{L}}_{3}=\{l>1:V_{l-1}(s^{\prime}_{t_{l}})=2B_{l-1}\}. Then,

l=1L(Vl(stl+1)Vl(stl+1))KV(sinit)\displaystyle\sum_{l=1}^{L}\left(V_{l}(s_{t_{l}+1})-V_{l}(s^{\prime}_{t_{l+1}})\right)-K\cdot V^{\star}(s_{\text{init}})
=l1Vl(stl+1)KV(sinit)ξ1+l2Vl(stl+1)ξ2+l3Vl(stl+1)l=1LVl(stl+1)ξ3.\displaystyle=\underbrace{\sum_{l\in{\mathcal{L}}_{1}}V_{l}(s_{t_{l}+1})-K\cdot V^{\star}(s_{\text{init}})}_{\xi_{1}}+\underbrace{\sum_{l\in{\mathcal{L}}_{2}}V_{l}(s_{t_{l}+1})}_{\xi_{2}}+\underbrace{\sum_{l\in{\mathcal{L}}_{3}}V_{l}(s_{t_{l}+1})-\sum_{l=1}^{L}V_{l}(s^{\prime}_{t_{l+1}})}_{\xi_{3}}.

Note that ξ10\xi_{1}\leq 0 since for l1l\in{\mathcal{L}}_{1}, Vl(stl+1)=Vl(sinit)V(sinit)V_{l}(s_{t_{l}+1})=V_{l}(s_{\text{init}})\leq V^{\star}(s_{\text{init}}) by Lemma 4. For ξ2\xi_{2}, note that |2|=𝒪~(d)|{\mathcal{L}}_{2}|=\tilde{\mathcal{O}}(d) by Lemma 27. Thus, ξ2=𝒪~(dB)\xi_{2}=\tilde{\mathcal{O}}(dB_{\star}) by Vl4B\left\|{V_{l}}\right\|_{\infty}\leq 4B_{\star} (Lemma 4). For ξ3\xi_{3}, note that for each l3l\in{\mathcal{L}}_{3}, Vl(stl+1)Vl1(stl)Bl2Bl12B𝕀{BlBl1}1V_{l}(s_{t_{l}+1})-V_{l-1}(s^{\prime}_{t_{l}})\leq B_{l}-2B_{l-1}\leq 2B_{\star}\mathbb{I}\{B_{l}\neq B_{l-1}\}-1 by Vl(stl+1)Bl2BV_{l}(s_{t_{l}+1})\leq B_{l}\leq 2B_{\star} and Bl1B_{l}\geq 1. Thus, ξ3𝒪~(B)L\xi_{3}\leq\tilde{\mathcal{O}}(B_{\star})-L^{\prime}, by |3|=L|{\mathcal{L}}_{3}|=L^{\prime} and l=1L𝕀{BlBl1}=𝒪(log2B)\sum_{l=1}^{L}\mathbb{I}\{B_{l}\neq B_{l-1}\}=\mathcal{O}(\log_{2}B_{\star}). Therefore, with probability at least 15δ1-5\delta,

RK\displaystyle R_{K} l=1Lt=tl+1tl+1[(Vl(st)PtVl)+ϕt(w~lwl)]+𝒪~(dBL)\displaystyle\leq\sum_{l=1}^{L}\sum_{t=t_{l}+1}^{t_{l+1}}\left[(V_{l}(s^{\prime}_{t})-P_{t}V_{l})+\phi_{t}^{\top}(\widetilde{w}_{l}-w_{l})\right]+\tilde{\mathcal{O}}\left(dB_{\star}-L^{\prime}\right)
=𝒪~(t=1T𝕍(Pt,Vl)+t=1T|ϕtνt|+dBL)\displaystyle=\tilde{\mathcal{O}}\left(\sqrt{\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l})}+\sum_{t=1}^{T}\left|\phi_{t}^{\top}\nu_{t}\right|+dB_{\star}-L^{\prime}\right) (Lemma 38, wlΩlw_{l}\in\Omega_{l}, and definition of νt\nu_{t})
=𝒪~(B2L+BCT+Bt=1T|ϕtνt|+t=1T|ϕtνt|+dBL)\displaystyle=\tilde{\mathcal{O}}\left(\sqrt{B_{\star}^{2}L^{\prime}+B_{\star}C_{T}+B_{\star}\sum_{t=1}^{T}\left|\phi_{t}^{\top}\nu_{t}\right|}+\sum_{t=1}^{T}\left|\phi_{t}^{\top}\nu_{t}\right|+dB_{\star}-L^{\prime}\right) (Lemma 21)
=𝒪~(BCT+t=1T|ϕtνt|+dB+B2)\displaystyle=\tilde{\mathcal{O}}\left(\sqrt{B_{\star}C_{T}}+\sum_{t=1}^{T}\left|\phi_{t}^{\top}\nu_{t}\right|+dB_{\star}+B_{\star}^{2}\right) (x+yx+y\sqrt{x+y}\leq\sqrt{x}+\sqrt{y} and aba+b2\sqrt{ab}\leq\frac{a+b}{2})
𝒪~(d3.5BCT+d3.5BϵT+d5B2)+65d2.5ϵT\displaystyle\leq\tilde{\mathcal{O}}\left(d^{3.5}\sqrt{B_{\star}C_{T}}+d^{3.5}\sqrt{B_{\star}\epsilon T}+d^{5}B_{\star}^{2}\right)+65d^{2.5}\epsilon T (Lemma 22)
𝒪~(d3.5BCT+d5B2)+CT2K.\displaystyle\leq\tilde{\mathcal{O}}\left(d^{3.5}\sqrt{B_{\star}C_{T}}+d^{5}B_{\star}^{2}\right)+\frac{C_{T}}{2K}. (definition of ϵ\epsilon and cminTCTc_{\min}T\leq C_{T})

By RK=CTKV(sinit)R_{K}=C_{T}-K\cdot V^{\star}(s_{\text{init}}) and Lemma 28 with x=CTx=C_{T} (we also bound TT by CT/cminC_{T}/c_{\min} in logarithmic terms), we get CT=𝒪~(BK+d7B+d5B2)C_{T}=\tilde{\mathcal{O}}(B_{\star}K+d^{7}B_{\star}+d^{5}B_{\star}^{2}). Plugging this back, we obtain

RK=𝒪~(d3.5BK+d7B2).R_{K}=\tilde{\mathcal{O}}\left(d^{3.5}B_{\star}\sqrt{K}+d^{7}B_{\star}^{2}\right).

This completes the proof. ∎

C.5 Intuition for Overestimate Condition

Now we provide more reasonings on including the overestimate condition. Similar to (Zanette et al., 2020b; Wei et al., 2021b), we incorporate global optimism at the starting state of each epoch via solving an optimization problem. This is different from many previous work (Jin et al., 2020b; Vial et al., 2021) that adds bonus terms to ensure local optimism over all states. The advantage of global optimism is that it avoids using a larger function class of Qt,VtQ_{t},V_{t} for the bonus terms, which reduces the order of dd in the regret bound. However, this improvement also requires Vt\left\|{V_{t}}\right\|_{\infty} is of order BB_{\star}. In (Zanette et al., 2020b), they directly enforcing this constraint, which is not practical under large state space as we may need to iterate over all state-action pairs to check this constraint.

Here we take a new approach: we first enforce a bound on Vt\left\|{V_{t}}\right\|_{\infty} by direct truncation. However, the upper bound truncation on VtV_{t} may break the analysis. To resolve this, we start a new epoch whenever VtV_{t} is overestimated by a large amount. By the objective of the optimization problem, Vt(st)V_{t}(s_{t}) will not be overestimated in the new epoch. Hence, the upper bound truncation will not be triggered. Moreover, the overestimate of VtV_{t} cancels out the switching cost in this case as in previous discussion.

The disadvantage of the overestimation condition is that we may update policy at every time step in the worst case. If we remove this condition, Vt=𝒪~(dB)\left\|{V_{t}}\right\|_{\infty}=\tilde{\mathcal{O}}(\sqrt{d}B_{\star}) by the norm constraint on wtw_{t}, which brings back an extra d\sqrt{d} factor. However, we only recompute policy for 𝒪(K+dlnT)\mathcal{O}(K+d\ln T) times in this case.

C.6 Extra Lemmas for Section 4

Lemma 21.

With probability at least 1δ1-\delta, t=1T𝕍(Pt,Vl)=𝒪~(dB2+B2L+BCT+Bt=1T|ϕtνt|)\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l})=\tilde{\mathcal{O}}\left(dB_{\star}^{2}+B_{\star}^{2}L^{\prime}+B_{\star}C_{T}+B_{\star}\sum_{t=1}^{T}\left|\phi_{t}^{\top}\nu_{t}\right|\right).

Proof.

Note that when Vl(st)=0V_{l}(s_{t})=0, Vl(st)PtVl0V_{l}(s_{t})-P_{t}V_{l}\leq 0. Otherwise, Ql(st,a)>0Q_{l}(s_{t},a)>0 for any aa and Vl(st)Ql(st,at)V_{l}(s_{t})\leq Q_{l}(s_{t},a_{t}). Thus, Vl(st)2(PtVl)2=(Vl(st)+PtVl)(Vl(st)PtVl)(Vl(st)+PtVl)|Ql(st,at)PtVl|V_{l}(s_{t})^{2}-(P_{t}V_{l})^{2}=(V_{l}(s_{t})+P_{t}V_{l})(V_{l}(s_{t})-P_{t}V_{l})\leq(V_{l}(s_{t})+P_{t}V_{l})\left|Q_{l}(s_{t},a_{t})-P_{t}V_{l}\right|. Then with probability at least 1δ1-\delta,

t=1T𝕍(Pt,Vl)\displaystyle\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l}) =t=1T(PtVl2Vl2(st))+t=1T(Vl2(st)Vl2(st))+t=1T(Vl2(st)(PtVl)2)\displaystyle=\sum_{t=1}^{T}\left(P_{t}V_{l}^{2}-V_{l}^{2}(s^{\prime}_{t})\right)+\sum_{t=1}^{T}\left(V_{l}^{2}(s^{\prime}_{t})-V_{l}^{2}(s_{t})\right)+\sum_{t=1}^{T}\left(V_{l}^{2}(s_{t})-(P_{t}V_{l})^{2}\right)
=(i)𝒪~(t=1T𝕍(Pt,Vl2)+dB2+B2L+t=1T(Vl(st)+PtVl)|ct+ϕt(wlw~l)|)\displaystyle\overset{\text{(i)}}{=}\tilde{\mathcal{O}}\left(\sqrt{\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l}^{2})}+dB_{\star}^{2}+B_{\star}^{2}L^{\prime}+\sum_{t=1}^{T}(V_{l}(s_{t})+P_{t}V_{l})\left|c_{t}+\phi_{t}^{\top}(w_{l}-\widetilde{w}_{l})\right|\right)
=(ii)𝒪~(Bt=1T𝕍(Pt,Vl)+dB2+B2L+BCT+Bt=1T|ϕtνt|),\displaystyle\overset{\text{(ii)}}{=}\tilde{\mathcal{O}}\left(B_{\star}\sqrt{\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l})}+dB_{\star}^{2}+B_{\star}^{2}L^{\prime}+B_{\star}C_{T}+B_{\star}\sum_{t=1}^{T}\left|\phi_{t}^{\top}\nu_{t}\right|\right),

where in (i) we apply Lemma 38, Vl(st)2(PtVl)2(Vl(st)+PtVl)|Ql(st,at)PtVl|V_{l}(s_{t})^{2}-(P_{t}V_{l})^{2}\leq(V_{l}(s_{t})+P_{t}V_{l})\left|Q_{l}(s_{t},a_{t})-P_{t}V_{l}\right|, Ql(st,at)=ϕtwlQ_{l}(s_{t},a_{t})=\phi_{t}^{\top}w_{l}, ct+PtVl=ϕtw~lc_{t}+P_{t}V_{l}=\phi_{t}^{\top}\widetilde{w}_{l}, and we bound the term t=1TVl2(st)Vl2(st)=l=1LVl2(stl+1)Vl2(stl+1)\sum_{t=1}^{T}V_{l}^{2}(s^{\prime}_{t})-V_{l}^{2}(s_{t})=\sum_{l=1}^{L}V_{l}^{2}(s^{\prime}_{t_{l+1}})-V_{l}^{2}(s_{t_{l}+1}) as follows: we consider four cases based on how epoch ll ends:

  1. 1.

    stl+1=gs^{\prime}_{t_{l+1}}=g, then Vl2(stl+1)Vl2(stl+1)0V_{l}^{2}(s^{\prime}_{t_{l+1}})-V_{l}^{2}(s_{t_{l}+1})\leq 0.

  2. 2.

    Vl(stl+1)=2BlV_{l}(s^{\prime}_{t_{l+1}})=2B_{l}; this happens LL^{\prime} times and the sum of these terms is of order 𝒪~(B2L)\tilde{\mathcal{O}}(B_{\star}^{2}L^{\prime}).

  3. 3.

    Triggered by Eq. (4). By Lemma 27, this happens at most 𝒪~(d)\tilde{\mathcal{O}}(d) times and the sum of these terms is of order 𝒪~(dB2)\tilde{\mathcal{O}}(dB_{\star}^{2}).

  4. 4.

    l=Ll=L is the last epoch. This happens only once and the term is bounded by 𝒪(B2)\mathcal{O}(B_{\star}^{2}).

In (ii), we apply Lemma 34, wlΩlw_{l}\in\Omega_{l}, definition of νt\nu_{t}, and Vl=𝒪(B)\left\|{V_{l}}\right\|_{\infty}=\mathcal{O}(B_{\star}) by Lemma 4. Solving a quadratic inequality w.r.t t=1T𝕍(Pt,Vl)\sqrt{\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l})}, we have:

t=1T𝕍(Pt,Vl)=𝒪~(dB2+B2L+BCT+Bt=1T|ϕtνt|).\displaystyle\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l})=\tilde{\mathcal{O}}\left(dB_{\star}^{2}+B_{\star}^{2}L^{\prime}+B_{\star}C_{T}+B_{\star}\sum_{t=1}^{T}\left|\phi_{t}^{\top}\nu_{t}\right|\right).

This completes the proof. ∎

Lemma 22.

With probability at least 14δ1-4\delta, t=1T|ϕtνt|𝒪~(d3.5BCT+d3.5BϵT+d5B2)+65d2.5ϵT\sum_{t=1}^{T}\left|\phi_{t}^{\top}\nu_{t}\right|\leq\tilde{\mathcal{O}}\left(d^{3.5}\sqrt{B_{\star}C_{T}}+d^{3.5}\sqrt{B_{\star}\epsilon T}+d^{5}B_{\star}^{2}\right)+65d^{2.5}\epsilon T.

Proof.

Define ut=argmaxttT|ϕtνt|u_{t}=\operatorname*{argmax}_{t\leq t^{\prime}\leq T}|\phi_{t}^{\top}\nu_{t^{\prime}}|, Vj,t=2jI+i𝒯[t1]:|ϕiνui|2jϕiϕiV_{j,t}=2^{j}I+\sum_{i\in{\mathcal{T}}\cap[t-1]:|\phi_{i}^{\top}\nu_{u_{i}}|\leq 2^{j}}\phi_{i}\phi_{i}^{\top}, and jtj_{t} such that |ϕtνut|(2jt1,2jt]\left|\phi_{t}^{\top}\nu_{u_{t}}\right|\in(2^{j_{t}-1},2^{j_{t}}]. Also define 𝒯={t[T]:j𝒥t,|ϕtνut|(2j1,2j]}{\mathcal{T}}=\{t\in[T]:\exists j\in{\mathcal{J}}_{t},|\phi_{t}^{\top}\nu_{u_{t}}|\in(2^{j-1},2^{j}]\}. Note that when t𝒯t\notin{\mathcal{T}}, |ϕtνt|ϵ|\phi_{t}^{\top}\nu_{t}|\leq\epsilon. Then, for any t𝒯t\in{\mathcal{T}}:

|ϕtνut|ϕtWjt,ut1(νut)νutWjt,ut(νut)\displaystyle\left|\phi_{t}^{\top}\nu_{u_{t}}\right|\leq\left\|{\phi_{t}}\right\|_{W^{-1}_{j_{t},u_{t}}(\nu_{u_{t}})}\left\|{\nu_{u_{t}}}\right\|_{W_{j_{t},u_{t}}(\nu_{u_{t}})}
(i)22dϕtWjt,ut1(νut)νutWjt,lut(νut)+𝒪~(2jt(d3Bϵut))+2jt+5d2.5ϵ\displaystyle\overset{\text{(i)}}{\leq}2\sqrt{2}d\left\|{\phi_{t}}\right\|_{W^{-1}_{j_{t},u_{t}}(\nu_{u_{t}})}\left\|{\nu_{u_{t}}}\right\|_{W_{j_{t},l_{u_{t}}}(\nu_{u_{t}})}+\tilde{\mathcal{O}}\left(\sqrt{2^{j_{t}}\left(\frac{d^{3}B_{\star}\epsilon}{u_{t}}\right)}\right)+\sqrt{2^{j_{t}+5}d^{2.5}\epsilon}
(ii)22dϕtVjt,t12jt(itlut𝕍(Pi,Vlut)ιlut+dBιlut+dB2)+𝒪~(2jt(d3Bϵut))+2jt+5d2.5ϵ,\displaystyle\overset{\text{(ii)}}{\leq}2\sqrt{2}d\left\|{\phi_{t}}\right\|_{V^{-1}_{j_{t},t}}\sqrt{2^{j_{t}}\left(\sqrt{\sum_{i\leq t_{l_{u_{t}}}}\mathbb{V}(P_{i},V_{l_{u_{t}}})\iota_{l_{u_{t}}}}+\sqrt{d}B_{\star}\iota_{l_{u_{t}}}+dB_{\star}^{2}\right)}+\tilde{\mathcal{O}}\left(\sqrt{2^{j_{t}}\left(\frac{d^{3}B_{\star}\epsilon}{u_{t}}\right)}\right)+\sqrt{2^{j_{t}+5}d^{2.5}\epsilon},
=𝒪~(dϕtVjt,t12jt(d4B3+dBCT+dBϵT+dBιT+dB2)+2jt(d3Bϵut))+2jt+5d2.5ϵ\displaystyle=\tilde{\mathcal{O}}\left(d\left\|{\phi_{t}}\right\|_{V^{-1}_{j_{t},t}}\sqrt{2^{j_{t}}\left(\sqrt{d^{4}B_{\star}^{3}+dB_{\star}C_{T}+dB_{\star}\epsilon T}+\sqrt{d}B_{\star}\iota_{T}+dB_{\star}^{2}\right)}+\sqrt{2^{j_{t}}\left(\frac{d^{3}B_{\star}\epsilon}{u_{t}}\right)}\right)+\sqrt{2^{j_{t}+5}d^{2.5}\epsilon} (Lemma 23)

where in (i) we define ν¯ut𝒢ϵ/ut(6dBlut)\bar{\nu}_{u_{t}}\in\mathcal{G}_{\epsilon/u_{t}}(6\sqrt{d}B_{l_{u_{t}}}) such that νutν¯utϵut\left\|{\nu_{u_{t}}-\bar{\nu}_{u_{t}}}\right\|_{\infty}\leq\frac{\epsilon}{u_{t}} and apply

νutWjt,ut(νut)2=Φutjt(νut)=2jtνut22+i<utfjt(ϕiνut)\displaystyle\left\|{\nu_{u_{t}}}\right\|^{2}_{W_{j_{t},u_{t}}(\nu_{u_{t}})}=\Phi^{j_{t}}_{u_{t}}(\nu_{u_{t}})=2^{j_{t}}\left\|{\nu_{u_{t}}}\right\|_{2}^{2}+\sum_{i<u_{t}}f_{j_{t}}(\phi_{i}^{\top}\nu_{u_{t}})
2jtν¯ut22+i<utfjt(ϕiν¯ut)+2jt(νut22ν¯ut22)+2jt+1i<ut|ϕi(νutν¯ut)|\displaystyle\leq 2^{j_{t}}\left\|{\bar{\nu}_{u_{t}}}\right\|_{2}^{2}+\sum_{i<u_{t}}f_{j_{t}}(\phi_{i}^{\top}\bar{\nu}_{u_{t}})+2^{j_{t}}\left(\left\|{\nu_{u_{t}}}\right\|_{2}^{2}-\left\|{\bar{\nu}_{u_{t}}}\right\|_{2}^{2}\right)+2^{j_{t}+1}\sum_{i<u_{t}}\left|\phi_{i}^{\top}(\nu_{u_{t}}-\bar{\nu}_{u_{t}})\right| (fjtf_{j_{t}} is (22jt)(2\cdot 2^{j_{t}})-Lipschitz)
8d2(2jtν¯ut22+itlutfjt(ϕiν¯ut))+122jtdBlutϵut+2jt+1dϵ\displaystyle\leq 8d^{2}\left(2^{j_{t}}\left\|{\bar{\nu}_{u_{t}}}\right\|_{2}^{2}+\sum_{i\leq t_{l_{u_{t}}}}f_{j_{t}}(\phi_{i}^{\top}\bar{\nu}_{u_{t}})\right)+\frac{12\cdot 2^{j_{t}}dB_{l_{u_{t}}}\epsilon}{u_{t}}+2^{j_{t}+1}\sqrt{d}\epsilon (νut,ν¯ut𝔹(6dBlut)\nu_{u_{t}},\bar{\nu}_{u_{t}}\in\mathbb{B}(6\sqrt{d}B_{l_{u_{t}}}))
8d2(2jtνut22+itlutfjt(ϕiνut))+𝒪~(2jt(d3Bϵut))+2jt+5d2.5ϵ,\displaystyle\leq 8d^{2}\left(2^{j_{t}}\left\|{\nu_{u_{t}}}\right\|_{2}^{2}+\sum_{i\leq t_{l_{u_{t}}}}f_{j_{t}}(\phi_{i}^{\top}\nu_{u_{t}})\right)+\tilde{\mathcal{O}}\left(2^{j_{t}}\left(\frac{d^{3}B_{\star}\epsilon}{u_{t}}\right)\right)+2^{j_{t}+5}d^{2.5}\epsilon, (νut,ν¯ut𝔹(6dBlut)\nu_{u_{t}},\bar{\nu}_{u_{t}}\in\mathbb{B}(6\sqrt{d}B_{l_{u_{t}}}))

and in (ii) we apply Lemma 24 and:

Wjt,ut(νut)\displaystyle W_{j_{t},u_{t}}(\nu_{u_{t}}) Wjt,t(νut)=2jtI+i<tmin{1,2jt/|ϕiνut|}ϕiϕi(i)2jtI+i𝒯[t1]:|ϕiνui|2jtϕiϕi=Vjt,t.\displaystyle\succcurlyeq W_{j_{t},t}(\nu_{u_{t}})=2^{j_{t}}I+\sum_{i<t}\min\{1,2^{j_{t}}/|\phi_{i}^{\top}\nu_{u_{t}}|\}\phi_{i}^{\top}\phi_{i}^{\top}\overset{\text{(i)}}{\succcurlyeq}2^{j_{t}}I+\sum_{i\in{\mathcal{T}}\cap[t-1]:|\phi_{i}^{\top}\nu_{u_{i}}|\leq 2^{j_{t}}}\phi_{i}\phi_{i}^{\top}=V_{j_{t},t}.

Here, (i) is by |ϕiνut||ϕiνui|\left|\phi_{i}^{\top}\nu_{u_{t}}\right|\leq\left|\phi_{i}^{\top}\nu_{u_{i}}\right| by the definition of utu_{t}. Reorganizing terms by |ϕtνut|(2jt1,2jt]\left|\phi_{t}^{\top}\nu_{u_{t}}\right|\in(2^{j_{t}-1},2^{j_{t}}], we have for t𝒯t\in{\mathcal{T}}:

|ϕtνt||ϕtνut|=𝒪~(d2ϕtVjt,t12(dBCT+dBϵT+d2B2)+d3Bϵut)+64d2.5ϵ.\displaystyle\left|\phi_{t}^{\top}\nu_{t}\right|\leq\left|\phi_{t}^{\top}\nu_{u_{t}}\right|=\tilde{\mathcal{O}}\left(d^{2}\left\|{\phi_{t}}\right\|_{V^{-1}_{j_{t},t}}^{2}\left(\sqrt{dB_{\star}C_{T}+dB_{\star}\epsilon T}+d^{2}B_{\star}^{2}\right)+\frac{d^{3}B_{\star}\epsilon}{u_{t}}\right)+64d^{2.5}\epsilon.

Finally, note that:

t𝒯|ϕtνt|=𝒪~(t𝒯d2ϕtVjt,t12(dBCT+dBϵT+d2B2)+t=1Td3Bϵt)+64d2.5ϵT\displaystyle\sum_{t\in{\mathcal{T}}}\left|\phi_{t}^{\top}\nu_{t}\right|=\tilde{\mathcal{O}}\left(\sum_{t\in{\mathcal{T}}}d^{2}\left\|{\phi_{t}}\right\|_{V^{-1}_{j_{t},t}}^{2}\left(\sqrt{dB_{\star}C_{T}+dB_{\star}\epsilon T}+d^{2}B_{\star}^{2}\right)+\sum_{t=1}^{T}\frac{d^{3}B_{\star}\epsilon}{t}\right)+64d^{2.5}\epsilon T
=𝒪~(dBt𝒯𝕀{ϕtVjt,t121}+d2t𝒯min{1,ϕtVjt,t12}(dBCT+dBϵT+d2B2)+d3Bϵ)+64d2.5ϵT.\displaystyle=\tilde{\mathcal{O}}\left(\sqrt{d}B_{\star}\sum_{t\in{\mathcal{T}}}\mathbb{I}\left\{\left\|{\phi_{t}}\right\|_{V^{-1}_{j_{t},t}}^{2}\geq 1\right\}+d^{2}\sum_{t\in{\mathcal{T}}}\min\left\{1,\left\|{\phi_{t}}\right\|_{V^{-1}_{j_{t},t}}^{2}\right\}\left(\sqrt{dB_{\star}C_{T}+dB_{\star}\epsilon T}+d^{2}B_{\star}^{2}\right)+d^{3}B_{\star}\epsilon\right)+64d^{2.5}\epsilon T.

The first term is bounded by

dBt𝒯𝕀{ϕtVjt,t121}\displaystyle\sqrt{d}B_{\star}\sum_{t\in{\mathcal{T}}}\mathbb{I}\left\{\left\|{\phi_{t}}\right\|_{V^{-1}_{j_{t},t}}^{2}\geq 1\right\} dBt𝒯min{1,ϕtVjt,t12}\displaystyle\leq\sqrt{d}B_{\star}\sum_{t\in{\mathcal{T}}}\min\left\{1,\left\|{\phi_{t}}\right\|_{V^{-1}_{j_{t},t}}^{2}\right\}
=dBj𝒥t𝒯𝕀{jt=j}min{1,ϕtVj,t12}=𝒪~(d1.5B),\displaystyle=\sqrt{d}B_{\star}\sum_{j\in{\mathcal{J}}}\sum_{t\in{\mathcal{T}}}\mathbb{I}\{j_{t}=j\}\min\left\{1,\left\|{\phi_{t}}\right\|_{V^{-1}_{j,t}}^{2}\right\}=\tilde{\mathcal{O}}\left(d^{1.5}B_{\star}\right), (Lemma 29)

For the second term:

d2t𝒯min{1,ϕtVjt,t12}(dBCT+dBϵT+d2B2)\displaystyle d^{2}\sum_{t\in{\mathcal{T}}}\min\left\{1,\left\|{\phi_{t}}\right\|_{V^{-1}_{j_{t},t}}^{2}\right\}\left(\sqrt{dB_{\star}C_{T}+dB_{\star}\epsilon T}+d^{2}B_{\star}^{2}\right)
=𝒪~(d2j𝒥t𝒯𝕀{jt=j}min{1,ϕtVj,t12}(dBCT+dBϵT+d2B2))\displaystyle=\tilde{\mathcal{O}}\left(d^{2}\sum_{j\in{\mathcal{J}}}\sum_{t\in{\mathcal{T}}}\mathbb{I}\{j_{t}=j\}\min\left\{1,\left\|{\phi_{t}}\right\|^{2}_{V^{-1}_{j,t}}\right\}\left(\sqrt{dB_{\star}C_{T}+dB_{\star}\epsilon T}+d^{2}B_{\star}^{2}\right)\right)
=(i)𝒪~(j𝒥d3(dBCT+dBϵT+d2B2))=𝒪~(d3(dBCT+dBϵT+d2B2)).\displaystyle\overset{\text{(i)}}{=}\tilde{\mathcal{O}}\left(\sum_{j\in{\mathcal{J}}}d^{3}\left(\sqrt{dB_{\star}C_{T}+dB_{\star}\epsilon T}+d^{2}B_{\star}^{2}\right)\right)=\tilde{\mathcal{O}}\left(d^{3}\left(\sqrt{dB_{\star}C_{T}+dB_{\star}\epsilon T}+d^{2}B_{\star}^{2}\right)\right).

where in (i) we apply Lemma 29. Putting everything together, we get:

t=1T|ϕtνt|t𝒯|ϕtνt|+ϵT𝒪~(d3.5BCT+d3.5BϵT+d5B2)+65d2.5ϵT.\displaystyle\sum_{t=1}^{T}\left|\phi_{t}^{\top}\nu_{t}\right|\leq\sum_{t\in{\mathcal{T}}}\left|\phi_{t}^{\top}\nu_{t}\right|+\epsilon T\leq\tilde{\mathcal{O}}\left(d^{3.5}\sqrt{B_{\star}C_{T}}+d^{3.5}\sqrt{B_{\star}\epsilon T}+d^{5}B_{\star}^{2}\right)+65d^{2.5}\epsilon T.

This completes the proof. ∎

Lemma 23.

With probability at least 13δ1-3\delta, itl𝕍(Pi,Vl)=𝒪~(d3B3+BCtl+Bϵtl)\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})=\tilde{\mathcal{O}}\left(d^{3}B_{\star}^{3}+B_{\star}C_{t_{l}}+B_{\star}\epsilon t_{l}\right).

Proof.

Note that when Vl(si)=0V_{l}(s_{i})=0, Vl(si)PiVl0V_{l}(s_{i})-P_{i}V_{l}\leq 0. Otherwise, Ql(si,a)>0Q_{l}(s_{i},a)>0 for any aa and Vl(si)Ql(si,ai)V_{l}(s_{i})\leq Q_{l}(s_{i},a_{i}). Therefore, Vl2(si)(PiVl)2=(Vl(si)+PiVl)(Vl(si)PiVl)(Vl(si)+PiVl)|Ql(si,ai)PiVl|V_{l}^{2}(s_{i})-(P_{i}V_{l})^{2}=(V_{l}(s_{i})+P_{i}V_{l})(V_{l}(s_{i})-P_{i}V_{l})\leq(V_{l}(s_{i})+P_{i}V_{l})\left|Q_{l}(s_{i},a_{i})-P_{i}V_{l}\right|. Then with probability at least 1δ1-\delta,

itl𝕍(Pi,Vl)\displaystyle\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l}) =itlPi(Vl)2(PiVl)2\displaystyle=\sum_{i\leq t_{l}}P_{i}(V_{l})^{2}-(P_{i}V_{l})^{2}
=itl(Pi(Vl)2Vl2(si))+itl(Vl2(si)Vl2(si))+itl(Vl2(si)(PiVl)2)\displaystyle=\sum_{i\leq t_{l}}\left(P_{i}(V_{l})^{2}-V_{l}^{2}(s^{\prime}_{i})\right)+\sum_{i\leq t_{l}}\left(V_{l}^{2}(s^{\prime}_{i})-V_{l}^{2}(s_{i})\right)+\sum_{i\leq t_{l}}\left(V_{l}^{2}(s_{i})-(P_{i}V_{l})^{2}\right)
=(i)𝒪~(ditl𝕍(Pi,Vl2)+dB2+B2+itl(Vl(si)+PiVl)|ci+ϕi(wlw~l)|)\displaystyle\overset{\text{(i)}}{=}\tilde{\mathcal{O}}\left(\sqrt{d\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l}^{2})}+dB_{\star}^{2}+B_{\star}^{2}+\sum_{i\leq t_{l}}(V_{l}(s_{i})+P_{i}V_{l})\left|c_{i}+\phi_{i}^{\top}(w_{l}-\widetilde{w}_{l})\right|\right)
=(ii)𝒪~(dBitl𝕍(Pi,Vl)+dB2+BCtl+Bitl|ϕi(wlw~l)|).\displaystyle\overset{\text{(ii)}}{=}\tilde{\mathcal{O}}\left(\sqrt{d}B_{\star}\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})}+dB_{\star}^{2}+B_{\star}C_{t_{l}}+B_{\star}\sum_{i\leq t_{l}}\left|\phi_{i}^{\top}(w_{l}-\widetilde{w}_{l})\right|\right).

In (i) we apply Lemma 25, itl(Vl2(si)Vl2(si))itl(Vl2(si+1)Vl2(si))=𝒪~(B2)\sum_{i\leq t_{l}}(V_{l}^{2}(s^{\prime}_{i})-V_{l}^{2}(s_{i}))\leq\sum_{i\leq t_{l}}(V_{l}^{2}(s_{i+1})-V_{l}^{2}(s_{i}))=\tilde{\mathcal{O}}(B_{\star}^{2}), Vl2(si)(PiVl)2(Vl(si)+PiVl)|Ql(si,ai)PiVl|V_{l}^{2}(s_{i})-(P_{i}V_{l})^{2}\leq(V_{l}(s_{i})+P_{i}V_{l})\left|Q_{l}(s_{i},a_{i})-P_{i}V_{l}\right|, Ql(si,ai)=ϕiwlQ_{l}(s_{i},a_{i})=\phi_{i}^{\top}w_{l}, and ci+PiVl=ϕiw~lc_{i}+P_{i}V_{l}=\phi_{i}^{\top}\widetilde{w}_{l}. In (ii) we apply Lemma 34. For ttlt\leq t_{l}, define νt=argmaxν=w~lw,wΩl|ϕtν|\nu^{\prime}_{t}=\operatorname*{argmax}_{\nu=\widetilde{w}_{l}-w,w\in\Omega_{l}}\left|\phi_{t}^{\top}\nu\right|. Then by wlΩlw_{l}\in\Omega_{l} and the definition of νt\nu^{\prime}_{t}, we have |ϕt(wlw~l)||ϕtνt|\left|\phi_{t}^{\top}(w_{l}-\widetilde{w}_{l})\right|\leq\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|. Now it suffices to bound ttl|ϕtνt|\sum_{t\leq t_{l}}\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|. Define 𝒯={ttl:j𝒥t,|ϕtνt|(2j1,2j]}{\mathcal{T}}=\{t\leq t_{l}:\exists j\in{\mathcal{J}}_{t},\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|\in(2^{j-1},2^{j}]\} and for t𝒯t\in{\mathcal{T}}, define jt𝒥tj_{t}\in{\mathcal{J}}_{t} such that |ϕtνt|(2jt1,2jt]\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|\in(2^{j_{t}-1},2^{j_{t}}]. Note that when t𝒯t\notin{\mathcal{T}}, |ϕtνt|ϵ|\phi_{t}^{\top}\nu^{\prime}_{t}|\leq\epsilon. Also define Vj,t=2jI+i𝒯[t1]:|ϕiνi|2jϕiϕiV_{j,t}=2^{j}I+\sum_{i\in{\mathcal{T}}\cap[t-1]:|\phi_{i}^{\top}\nu^{\prime}_{i}|\leq 2^{j}}\phi_{i}\phi_{i}^{\top}. Then, for any t𝒯t\in{\mathcal{T}}, with probability at least 12δ1-2\delta:

|ϕtνt|\displaystyle\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right| ϕtWjt,l1(νt)νtWjt,l(νt)ϕtVjt,l12jt(itl𝕍(Pi,Vl)ιl+dBιl+dB2),\displaystyle\leq\left\|{\phi_{t}}\right\|_{W^{-1}_{j_{t},l}(\nu^{\prime}_{t})}\left\|{\nu^{\prime}_{t}}\right\|_{W_{j_{t},l}(\nu^{\prime}_{t})}\leq\left\|{\phi_{t}}\right\|_{V^{-1}_{j_{t},l}}\sqrt{2^{j_{t}}\left(\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}+\sqrt{d}B_{\star}\iota_{l}+dB_{\star}^{2}\right)},

where in the last inequality we apply Lemma 24 and:

Wjt,l(νt)\displaystyle W_{j_{t},l}(\nu^{\prime}_{t}) =2jtI+itlmin{1,2jt/|ϕiνt|}ϕiϕi(i)2jtI+i𝒯:|ϕiνi|2jtϕiϕi=Vjt,l.\displaystyle=2^{j_{t}}I+\sum_{i\leq t_{l}}\min\{1,2^{j_{t}}/|\phi_{i}^{\top}\nu^{\prime}_{t}|\}\phi_{i}\phi_{i}^{\top}\overset{\text{(i)}}{\succcurlyeq}2^{j_{t}}I+\sum_{i\in{\mathcal{T}}:|\phi_{i}^{\top}\nu^{\prime}_{i}|\leq 2^{j_{t}}}\phi_{i}\phi_{i}^{\top}=V_{j_{t},l}.

Here, (i) is by |ϕiνt||ϕiνi|\left|\phi_{i}^{\top}\nu^{\prime}_{t}\right|\leq\left|\phi_{i}^{\top}\nu^{\prime}_{i}\right| by the definition of νt\nu^{\prime}_{t}. Reorganizing terms by |ϕtνt|(2jt1,2jt]\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|\in(2^{j_{t}-1},2^{j_{t}}], we have:

t𝒯|ϕtνt|=𝒪~(t𝒯ϕtVjt,l12(itl𝕍(Pi,Vl)ιl+dBιl+dB2))\displaystyle\sum_{t\in{\mathcal{T}}}\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|=\tilde{\mathcal{O}}\left(\sum_{t\in{\mathcal{T}}}\left\|{\phi_{t}}\right\|^{2}_{V^{-1}_{j_{t},l}}\left(\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}+\sqrt{d}B_{\star}\iota_{l}+dB_{\star}^{2}\right)\right)
=𝒪~(j𝒥t𝒯𝕀{jt=j}ϕtVj,l12(itl𝕍(Pi,Vl)ιl+dBιl+dB2))\displaystyle=\tilde{\mathcal{O}}\left(\sum_{j\in{\mathcal{J}}}\sum_{t\in{\mathcal{T}}}\mathbb{I}\{j_{t}=j\}\left\|{\phi_{t}}\right\|^{2}_{V^{-1}_{j,l}}\left(\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}+\sqrt{d}B_{\star}\iota_{l}+dB_{\star}^{2}\right)\right)
=(i)𝒪~(j𝒥d(itl𝕍(Pi,Vl)ιl+dBιl+dB2))=𝒪~(d(itl𝕍(Pi,Vl)ιl+dBιl+dB2)),\displaystyle\overset{\text{(i)}}{=}\tilde{\mathcal{O}}\left(\sum_{j\in{\mathcal{J}}}d\left(\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}+\sqrt{d}B_{\star}\iota_{l}+dB_{\star}^{2}\right)\right)=\tilde{\mathcal{O}}\left(d\left(\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}+\sqrt{d}B_{\star}\iota_{l}+dB_{\star}^{2}\right)\right),

where in (i) we apply

t𝒯𝕀{jt=j}ϕtVj,l12=tr(Vj,l1t𝒯𝕀{jt=j}ϕtϕt)tr(Vj,l1Vj,l)=d.\displaystyle\sum_{t\in{\mathcal{T}}}\mathbb{I}\{j_{t}=j\}\left\|{\phi_{t}}\right\|^{2}_{V^{-1}_{j,l}}=\text{tr}\left(V^{-1}_{j,l}\sum_{t\in{\mathcal{T}}}\mathbb{I}\{j_{t}=j\}\phi_{t}\phi_{t}^{\top}\right)\leq\text{tr}\left(V^{-1}_{j,l}V_{j,l}\right)=d.

Putting everything together and by ttl|ϕtνt|t𝒯|ϕtνt|+ϵtl\sum_{t\leq t_{l}}\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|\leq\sum_{t\in{\mathcal{T}}}\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|+\epsilon t_{l}, we have:

itl𝕍(Pi,Vl)\displaystyle\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l}) =𝒪~(dBitl𝕍(Pi,Vl)+dB2+BCtl+B(d2.5B2+d1.5itl𝕍(Pi,Vl)+ϵtl))\displaystyle=\tilde{\mathcal{O}}\left(\sqrt{d}B_{\star}\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})}+dB_{\star}^{2}+B_{\star}C_{t_{l}}+B_{\star}\left(d^{2.5}B_{\star}^{2}+d^{1.5}\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})}+\epsilon t_{l}\right)\right)
=𝒪~(d1.5Bitl𝕍(Pi,Vl)+d2.5B3+BCtl+Bϵtl).\displaystyle=\tilde{\mathcal{O}}\left(d^{1.5}B_{\star}\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})}+d^{2.5}B_{\star}^{3}+B_{\star}C_{t_{l}}+B_{\star}\epsilon t_{l}\right).

Solving a quadratic inequality w.r.t itl𝕍(Pi,Vl)\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})}, we have itl𝕍(Pi,Vl)=𝒪~(d3B3+BCtl+Bϵtl)\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})=\tilde{\mathcal{O}}\left(d^{3}B_{\star}^{3}+B_{\star}C_{t_{l}}+B_{\star}\epsilon t_{l}\right). ∎

Lemma 24.

With probability at least 12δ1-2\delta, for any epoch ll, j𝒥lj\in{\mathcal{J}}_{l}, and ν=w~lẘ\nu=\widetilde{w}_{l}-\mathring{w} with ẘΩl\mathring{w}\in\Omega_{l},

νWj,l(ν)2=𝒪(2j(itl𝕍(Pi,Vl)ιl+dBιl+dB2)).\left\|{\nu}\right\|^{2}_{W_{j,l}(\nu)}=\mathcal{O}\left(2^{j}\left(\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}+\sqrt{d}B_{\star}\iota_{l}+dB_{\star}^{2}\right)\right).
Proof.

Define ϵli(w)=ϵVli(w)=ϕiwciVl(si)\epsilon_{l}^{i}(w)=\epsilon_{V_{l}}^{i}(w)=\phi_{i}^{\top}w-c_{i}-V_{l}(s^{\prime}_{i}) and ηli(w)=ηVli(w)\eta_{l}^{i}(w)=\eta_{V_{l}}^{i}(w). Note that with probability at least 12δ1-2\delta:

νWj,l(ν)2jI2=itlclipj(ϕiν)ϕiν=itlclipj(ϕiν)(ϵli(w~l)ϵli(ẘ))\displaystyle\left\|{\nu}\right\|^{2}_{W_{j,l}(\nu)-2^{j}I}=\sum_{i\leq t_{l}}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)\phi_{i}^{\top}\nu=\sum_{i\leq t_{l}}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)(\epsilon^{i}_{l}(\widetilde{w}_{l})-\epsilon^{i}_{l}(\mathring{w}))
itlclipj2(ϕiν)ηli(w~l)ιl+itlclipj2(ϕiν)ηli(ẘ)ιl+2Bl2jιl\displaystyle\leq\sqrt{\sum_{i\leq t_{l}}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)\eta^{i}_{l}(\widetilde{w}_{l})\iota_{l}}+\sqrt{\sum_{i\leq t_{l}}\textsf{clip}^{2}_{j}(\phi_{i}^{\top}\nu)\eta^{i}_{l}(\mathring{w})\iota_{l}}+2B_{l}2^{j}\iota_{l} (Lemma 3 and ẘΩl\mathring{w}\in\Omega_{l})
3itlclipj2(ϕiν)ηli(w~l)ιl+2itlclipj2(ϕiν)(ϕiν)2ιl+2Bl2jιl\displaystyle\leq 3\sqrt{\sum_{i\leq t_{l}}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)\eta^{i}_{l}(\widetilde{w}_{l})\iota_{l}}+\sqrt{2\sum_{i\leq t_{l}}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)(\phi_{i}^{\top}\nu)^{2}\iota_{l}}+2B_{l}2^{j}\iota_{l} (ϕiν=ϵli(w~l)ϵli(ẘ)\phi_{i}^{\top}\nu=\epsilon^{i}_{l}(\widetilde{w}_{l})-\epsilon^{i}_{l}(\mathring{w}) and (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2})
=𝒪~(2jitl𝕍(Pi,Vl)ιl+2jdBitlclipj(ϕiν)(ϕiν)ιl+B2jιl)\displaystyle=\tilde{\mathcal{O}}\left(2^{j}\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}+\sqrt{2^{j}\sqrt{d}B_{\star}\sum_{i\leq t_{l}}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)(\phi_{i}^{\top}\nu)\iota_{l}}+B_{\star}2^{j}\iota_{l}\right) (Lemma 25, clipj()2j\textsf{clip}_{j}(\cdot)\leq 2^{j}, Bl2BB_{l}\leq 2B_{\star}, and |ϕiν|12dB\left|\phi_{i}^{\top}\nu\right|\leq 12\sqrt{d}B_{\star})
=𝒪~(2jitl𝕍(Pi,Vl)ιl+2jdBνWj,l(ν)2jI2ιl+B2jιl).\displaystyle=\tilde{\mathcal{O}}\left(2^{j}\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}+\sqrt{2^{j}\sqrt{d}B_{\star}\left\|{\nu}\right\|_{W_{j,l}(\nu)-2^{j}I}^{2}\iota_{l}}+B_{\star}2^{j}\iota_{l}\right).

Solving a quadratic inequality, we get νWj,l(ν)2=𝒪(2jitl𝕍(Pi,Vl)ιl+dB2jιl+2jdB2)\left\|{\nu}\right\|^{2}_{W_{j,l}(\nu)}=\mathcal{O}\left(2^{j}\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}+\sqrt{d}B_{\star}2^{j}\iota_{l}+2^{j}dB_{\star}^{2}\right). ∎

Lemma 25.

With probability at least 1δ1-\delta, for any epoch ll, i=1tl(PiVlVl(si))2=𝒪~(i=1tl𝕍(Pi,Vl)+dB2)\sum_{i=1}^{t_{l}}(P_{i}V_{l}-V_{l}(s^{\prime}_{i}))^{2}=\tilde{\mathcal{O}}(\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},V_{l})+dB_{\star}^{2}) and i=1tlPiVl2Vl2(si)=𝒪~(di=1tl𝕍(Pi,Vl2)+dB2)\sum_{i=1}^{t_{l}}P_{i}V_{l}^{2}-V_{l}^{2}(s^{\prime}_{i})=\tilde{\mathcal{O}}\left(\sqrt{d\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},V_{l}^{2})}+dB_{\star}^{2}\right).

Proof.

For any t+t\in\mathbb{N}_{+}, B{2i}i=1log2BB\in\{2^{i}\}^{\lceil\log_{2}B_{\star}\rceil}_{i=1}, and w𝒢ϵ/t(3dB)w\in\mathcal{G}_{\epsilon/t}(3\sqrt{d}B), define Xi=(ϕiUBwciVw,B(si))2=(PiVw,BVw,B(si))2X_{i}=(\phi_{i}^{\top}U_{B}w-c_{i}-V_{w,B}(s^{\prime}_{i}))^{2}=(P_{i}V_{w,B}-V_{w,B}(s^{\prime}_{i}))^{2} and 𝔼i\mathbb{E}_{i} as the conditional expectation conditioned on the interaction history (s1,a1,,si,ai)(s_{1},a_{1},\ldots,s_{i},a_{i}). Note that 𝔼i[Xi]=𝕍(Pi,Vw,B)\mathbb{E}_{i}[X_{i}]=\mathbb{V}(P_{i},V_{w,B}) and |Xi|4B2|X_{i}|\leq 4B^{2}. Then by Lemma 37 with λ=14B2\lambda=\frac{1}{4B^{2}}, with probability at least 1δ1-\delta^{\prime} with δ=δ/(8(tlog2(2B))2(6dBt/ϵ)d)\delta^{\prime}=\delta/(8(t\log_{2}(2B))^{2}(6\sqrt{d}Bt/\epsilon)^{d}), we have:

i=1t(Xi𝕍(Pi,Vw,B))λi=1t𝔼i[Xi2]+ln(1/δ)λi=1t𝕍(Pi,Vw,B)+𝒪~(dB2).\displaystyle\sum_{i=1}^{t}\left(X_{i}-\mathbb{V}(P_{i},V_{w,B})\right)\leq\lambda\sum_{i=1}^{t}\mathbb{E}_{i}[X_{i}^{2}]+\frac{\ln(1/\delta^{\prime})}{\lambda}\leq\sum_{i=1}^{t}\mathbb{V}(P_{i},V_{w,B})+\tilde{\mathcal{O}}\left(dB_{\star}^{2}\right).

Reorganizing terms and by a union bound, we have with probability at least 1δ/21-\delta/2, for any t+t\in\mathbb{N}_{+}, B{2i}i=1log2BB\in\{2^{i}\}_{i=1}^{\lceil\log_{2}B_{\star}\rceil}, and w𝒢ϵ/t(3dB)w\in\mathcal{G}_{\epsilon/t}(3\sqrt{d}B):

i=1t(PiVw,BVw,B(si))2=i=1tXi2i=1t𝕍(Pi,Vw,B)+𝒪~(dB2).\sum_{i=1}^{t}\left(P_{i}V_{w,B}-V_{w,B}(s^{\prime}_{i})\right)^{2}=\sum_{i=1}^{t}X_{i}\leq 2\sum_{i=1}^{t}\mathbb{V}(P_{i},V_{w,B})+\tilde{\mathcal{O}}\left(dB_{\star}^{2}\right). (16)

Moreover, for any t+t\in\mathbb{N}_{+}, B{2i}i=1log2BB\in\{2^{i}\}_{i=1}^{\lceil\log_{2}B_{\star}\rceil}, and w𝒢ϵ/t(3dB)w\in\mathcal{G}_{\epsilon/t}(3\sqrt{d}B), by Lemma 38, with probability at least 1δ1-\delta^{\prime}:

i=1tPiVw,B2Vw,B2(si)=𝒪~(i=1t𝕍(Pi,Vw,B2)ln1δ+B2ln1δ)=𝒪~(di=1t𝕍(Pi,Vw,B2)+dB2).\sum_{i=1}^{t}P_{i}V_{w,B}^{2}-V_{w,B}^{2}(s^{\prime}_{i})=\tilde{\mathcal{O}}\left(\sqrt{\sum_{i=1}^{t}\mathbb{V}(P_{i},V_{w,B}^{2})\ln\frac{1}{\delta^{\prime}}}+B_{\star}^{2}\ln\frac{1}{\delta^{\prime}}\right)=\tilde{\mathcal{O}}\left(\sqrt{d\sum_{i=1}^{t}\mathbb{V}(P_{i},V_{w,B}^{2})}+dB_{\star}^{2}\right). (17)

Then again by a union bound, the equation above holds with probability at least 1δ/21-\delta/2 for any t+t\in\mathbb{N}_{+}, B{2i}i=1log2BB\in\{2^{i}\}_{i=1}^{\lceil\log_{2}B_{\star}\rceil}, and w𝒢ϵ/t(3dB)w\in\mathcal{G}_{\epsilon/t}(3\sqrt{d}B).

Now for any epoch ll, pick wl𝒢ϵ/tl(3dBl)w^{\prime}_{l}\in\mathcal{G}_{\epsilon/t_{l}}(3\sqrt{d}B_{l}) such that wlwlϵ/tl\left\|{w^{\prime}_{l}-w_{l}}\right\|_{\infty}\leq\epsilon/t_{l}. Also define Vl=Vwl,BlV^{\prime}_{l}=V_{w^{\prime}_{l},B_{l}} and w~l=UBlwl\widetilde{w}^{\prime}_{l}=U_{B_{l}}w^{\prime}_{l}. Then similar to Eq. (14) and Eq. (15), we have

VlVldϵ/tl,w~lw~l2dϵ/tl.\left\|{V_{l}-V^{\prime}_{l}}\right\|_{\infty}\leq\sqrt{d}\epsilon/t_{l},\quad\left\|{\widetilde{w}_{l}-\widetilde{w}^{\prime}_{l}}\right\|_{2}\leq d\epsilon/t_{l}. (18)

For the first statement:

i=1tl(PiVlVl(si))2\displaystyle\sum_{i=1}^{t_{l}}(P_{i}V_{l}-V_{l}(s^{\prime}_{i}))^{2} =i=1tl(ϕiw~lciVl(si))2\displaystyle=\sum_{i=1}^{t_{l}}\left(\phi_{i}^{\top}\widetilde{w}_{l}-c_{i}-V_{l}(s^{\prime}_{i})\right)^{2}
3i=1tl((ϕiw~lciVl(si))2+(Vl(si)Vl(si))2+(ϕi(w~lw~l))2)\displaystyle\leq 3\sum_{i=1}^{t_{l}}\left((\phi_{i}^{\top}\widetilde{w}^{\prime}_{l}-c_{i}-V^{\prime}_{l}(s^{\prime}_{i}))^{2}+(V_{l}(s^{\prime}_{i})-V^{\prime}_{l}(s^{\prime}_{i}))^{2}+(\phi_{i}^{\top}(\widetilde{w}_{l}-\widetilde{w}^{\prime}_{l}))^{2}\right)
𝒪~(i=1tl𝕍(Pi,Vl)+dB2)+6d2ϵ2tl\displaystyle\leq\tilde{\mathcal{O}}\left(\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},V^{\prime}_{l})+dB_{\star}^{2}\right)+\frac{6d^{2}\epsilon^{2}}{t_{l}} (Eq. (16) and Eq. (18))
=𝒪~(i=1tl𝕍(Pi,Vl)+i=1tl𝕍(Pi,VlVl)+dB2)=𝒪~(i=1tl𝕍(Pi,Vl)+dB2).\displaystyle=\tilde{\mathcal{O}}\left(\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},V_{l})+\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},V^{\prime}_{l}-V_{l})+dB_{\star}^{2}\right)=\tilde{\mathcal{O}}\left(\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},V_{l})+dB_{\star}^{2}\right). (Var[X+Y]2Var[X]+2Var[Y]\textsc{Var}[X+Y]\leq 2\textsc{Var}[X]+2\textsc{Var}[Y], 𝕍(Pi,VlVl)VlVl2\mathbb{V}(P_{i},V^{\prime}_{l}-V_{l})\leq\left\|{V^{\prime}_{l}-V_{l}}\right\|_{\infty}^{2}, Eq. (18), and dϵ1d\epsilon\leq 1)

For the second statement,

i=1tlPi(Vl)2Vl2(si)\displaystyle\sum_{i=1}^{t_{l}}P_{i}(V_{l})^{2}-V_{l}^{2}(s^{\prime}_{i}) =i=1tl(Pi(Vl)2Vl(si)2)+i=1tl(Pi(Vl)2Pi(Vl)2)+i=1tl(Vl2(si)Vl2(si))\displaystyle=\sum_{i=1}^{t_{l}}(P_{i}(V^{\prime}_{l})^{2}-V^{\prime}_{l}(s^{\prime}_{i})^{2})+\sum_{i=1}^{t_{l}}(P_{i}(V_{l})^{2}-P_{i}(V^{\prime}_{l})^{2})+\sum_{i=1}^{t_{l}}({V^{\prime}_{l}}^{2}(s^{\prime}_{i})-V_{l}^{2}(s^{\prime}_{i}))
𝒪~(di=1tl𝕍(Pi,Vl2)+dB2)+4Bi=1tlVlVl\displaystyle\leq\tilde{\mathcal{O}}\left(\sqrt{d\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},{V^{\prime}_{l}}^{2})}+dB_{\star}^{2}\right)+4B_{\star}\sum_{i=1}^{t_{l}}\left\|{V_{l}-V^{\prime}_{l}}\right\|_{\infty} (Eq. (17) and max{Vl,Vl}4B\max\{\left\|{V_{l}}\right\|_{\infty},\left\|{V^{\prime}_{l}}\right\|_{\infty}\}\leq 4B_{\star})
𝒪~(di=1tl𝕍(Pi,Vl2)+dB2+di=1tl𝕍(Pi,Vl2Vl2)+dBϵ)\displaystyle\leq\tilde{\mathcal{O}}\left(\sqrt{d\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},V_{l}^{2})}+dB_{\star}^{2}+\sqrt{d\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},V_{l}^{2}-{V^{\prime}_{l}}^{2})}+\sqrt{d}B_{\star}\epsilon\right) (Var[X+Y]2Var[X]+2Var[Y]\textsc{Var}[X+Y]\leq 2\textsc{Var}[X]+2\textsc{Var}[Y],x+yx+y\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}, and Eq. (18))
=𝒪~(di=1tl𝕍(Pi,Vl2)+dB2).\displaystyle=\tilde{\mathcal{O}}\left(\sqrt{d\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},V_{l}^{2})}+dB_{\star}^{2}\right). (Eq. (18) and ϵ1\epsilon\leq 1)

Thus, the second statement is proved. ∎

For the next lemma, we define the following auxiliary function:

gj(x)={x2,|x|2j,2j+1x4j,x>2j2j+1x4j,x<2j\displaystyle g_{j}(x)=\begin{cases}x^{2},&|x|\leq 2^{j},\\ 2^{j+1}x-4^{j},&x>2^{j}\\ -2^{j+1}x-4^{j},&x<-2^{j}\end{cases}

Note that gj(x)g_{j}(x) is convex and fj(x)gj(x)2fj(x)f_{j}(x)\leq g_{j}(x)\leq 2f_{j}(x).

Lemma 26.

For λ(0,1]\lambda\in(0,1], gj(λx)λ2gj(x)g_{j}(\lambda x)\geq\lambda^{2}g_{j}(x).

Proof.

Let =2j\ell=2^{j}. When |λx||\lambda x|\leq\ell, we have: gj(λx)=λ2x2λ2gj(x)g_{j}(\lambda x)=\lambda^{2}x^{2}\geq\lambda^{2}g_{j}(x). When λx>\lambda x>\ell (arguments are similar for λx<\lambda x<-\ell), we have x>x>\ell, and

gj(λx)λ2gj(x)\displaystyle g_{j}(\lambda x)-\lambda^{2}g_{j}(x) =2λx2λ2(2x2)=2λx(1λ)2(1λ2)\displaystyle=2\ell\lambda x-\ell^{2}-\lambda^{2}(2\ell x-\ell^{2})=2\ell\lambda x(1-\lambda)-\ell^{2}(1-\lambda^{2})
=(1λ)(2λx(1+λ))0.\displaystyle=(1-\lambda)\ell(2\lambda x-(1+\lambda)\ell)\geq 0.

Lemma 27.

Fix 2jϵ>02^{j}\geq\epsilon>0. Let x1,,xt𝔹(1)x_{1},\ldots,x_{t}\in\mathbb{B}(1). If there exists 0=τ0<τ1<<τz=t0=\tau_{0}<\tau_{1}<\cdots<\tau_{z}=t such that for each 1ζz1\leq\zeta\leq z, there exists νζ𝔹(B)𝔹(ϵ)\nu_{\zeta}\in\mathbb{B}(B)\setminus\mathbb{B}(\epsilon) for some B>ϵB>\epsilon such that

i=1τζfj(xiνζ)+2jνζ22>8d2(i=1τζ1fj(xiνζ)+2jνζ22)\sum_{i=1}^{\tau_{\zeta}}f_{j}(x_{i}^{\top}\nu_{\zeta})+2^{j}\left\|{\nu_{\zeta}}\right\|_{2}^{2}>8d^{2}\left(\sum_{i=1}^{\tau_{\zeta-1}}f_{j}(x_{i}^{\top}\nu_{\zeta})+2^{j}\left\|{\nu_{\zeta}}\right\|_{2}^{2}\right) (19)

Then, z=𝒪~(d)z=\tilde{\mathcal{O}}(d).

Proof.

Note that when Eq. (19) holds:

i=1τζgj(xiνζ)+2jνζ22\displaystyle\sum_{i=1}^{\tau_{\zeta}}g_{j}(x_{i}^{\top}\nu_{\zeta})+2^{j}\left\|{\nu_{\zeta}}\right\|_{2}^{2} i=1τζfj(xiνζ)+2jνζ22>8d2(i=1τζ1fj(xiνζ)+2jνζ22)\displaystyle\geq\sum_{i=1}^{\tau_{\zeta}}f_{j}(x_{i}^{\top}\nu_{\zeta})+2^{j}\left\|{\nu_{\zeta}}\right\|_{2}^{2}>8d^{2}\left(\sum_{i=1}^{\tau_{\zeta-1}}f_{j}(x_{i}^{\top}\nu_{\zeta})+2^{j}\left\|{\nu_{\zeta}}\right\|_{2}^{2}\right)
4d2(i=1τζ1gj(xiνζ)+2jνζ22).\displaystyle\geq 4d^{2}\left(\sum_{i=1}^{\tau_{\zeta-1}}g_{j}(x_{i}^{\top}\nu_{\zeta})+2^{j}\left\|{\nu_{\zeta}}\right\|_{2}^{2}\right). (20)

Thus, it suffices to bound the number times Eq. (20) holds. Define Et(ν)=i=1tgj(xiν)+2jν22E_{t}(\nu)=\sum_{i=1}^{t}g_{j}(x_{i}^{\top}\nu)+2^{j}\left\|{\nu}\right\|_{2}^{2}. Clearly EtE_{t} is convex since gjg_{j} is convex, and Et(ν)[2jϵ2,2jB2+2t2jB]E_{t}(\nu)\in[2^{j}\epsilon^{2},2^{j}B^{2}+2t2^{j}B] for ν𝔹(B)𝔹(ϵ)\nu\in\mathbb{B}(B)\setminus\mathbb{B}(\epsilon). Define:

Λ={i:log2(2jϵ2)ilog2(2jB2+2t2jB)}.\Lambda=\{i\in\mathbb{Z}:\lceil\log_{2}(2^{j}\epsilon^{2})\rceil\leq i\leq\lceil\log_{2}(2^{j}B^{2}+2t2^{j}B)\rceil\}.

For each ζ\zeta, there exists iζΛi_{\zeta}\in\Lambda such that Eτζ1(νζ)(2iζ1,2iζ]E_{\tau_{\zeta-1}}(\nu_{\zeta})\in(2^{i_{\zeta}-1},2^{i_{\zeta}}]. Define Dt,i={ν𝔹(B):Et(ν)2i}D_{t^{\prime},i}=\{\nu\in\mathbb{B}(B):E_{t^{\prime}}(\nu)\leq 2^{i}\}. Note that νζDτζ1,iζ\nu_{\zeta}\in D_{\tau_{\zeta-1},i_{\zeta}}, and Dt,iD_{t^{\prime},i} is a symmetric convex set since EtE_{t} is a convex function and Et(ν)=Et(ν)E_{t}(\nu)=E_{t}(-\nu). By Lemma 26, we have Eτζ(νζ/d)1d2Eτζ(νζ)>4Eτζ1(νζ)>2iζE_{\tau_{\zeta}}(\nu_{\zeta}/d)\geq\frac{1}{d^{2}}E_{\tau_{\zeta}}(\nu_{\zeta})>4E_{\tau_{\zeta-1}}(\nu_{\zeta})>2^{i_{\zeta}}. Therefore, νζ/dDτζ,iζ\nu_{\zeta}/d\notin D_{\tau_{\zeta},i_{\zeta}}, which means that in the direction of νζ\nu_{\zeta}, the intercept of Dτζ,iζD_{\tau_{\zeta},i_{\zeta}} is at most 1/d1/d times of that of Dτζ1,iζD_{\tau_{\zeta-1},i_{\zeta}}. By Lemma 35, we have: Vol(Dτζ,iζ)67Vol(Dτζ1,iζ)\text{Vol}(D_{\tau_{\zeta},i_{\zeta}})\leq\frac{6}{7}\text{Vol}(D_{\tau_{\zeta-1},i_{\zeta}}). Note that when ν22j\left\|{\nu}\right\|_{2}\leq 2^{j}, we have Et(ν)(t+2j)ν22E_{t}(\nu)\leq(t+2^{j})\left\|{\nu}\right\|_{2}^{2}. Therefore, when ν2ϵ=2j/(t+2j)ϵ\left\|{\nu}\right\|_{2}\leq\epsilon^{\prime}=\sqrt{2^{j}/(t+2^{j})}\epsilon, we have Et(ν)2jϵ2E_{t}(\nu)\leq 2^{j}\epsilon^{2}. Therefore, Vol(Dt,i)Vol(𝔹(ϵ))\text{Vol}(D_{t,i})\geq\text{Vol}(\mathbb{B}(\epsilon^{\prime})) for iΛi\in\Lambda. Due to the fact that Dt,iD_{t,i} is decreasing in tt, we have

z=𝒪(|Λ|log7/6(Vol(𝔹(B))/Vol(𝔹(ϵ))))=𝒪~(d).z=\mathcal{O}(|\Lambda|\log_{7/6}(\text{Vol}(\mathbb{B}(B))/\text{Vol}(\mathbb{B}(\epsilon^{\prime}))))=\tilde{\mathcal{O}}(d).

This completes the proof. ∎

Appendix D Auxiliary Lemmas

Lemma 28.

If x(ax+b)lnp(cx)x\leq(a\sqrt{x}+b)\ln^{p}(cx) for some a,b,c>0a,b,c>0 and absolute constant p1p\geq 1, then x=𝒪~(a2+b)x=\tilde{\mathcal{O}}(a^{2}+b).

Proof.

First note that x2blnp(cx)x\leq 2b\ln^{p}(cx) implies x2b(2p)pcxx\leq 2b(2p)^{p}\sqrt{cx} by lnxx\ln x\leq x for x>0x>0, which gives x4(2p)2pb2cx\leq 4(2p)^{2p}b^{2}c. Plugging this back, we get x2blnp(4(2p)2pb2c2)x\leq 2b\ln^{p}(4(2p)^{2p}b^{2}c^{2}). Therefore, x>2blnp(4(2p)2pb2c2)x>2b\ln^{p}(4(2p)^{2p}b^{2}c^{2}) implies x>2blnp(cx)x>2b\ln^{p}(cx). Next, note that x2axlnp(cx)x\leq 2a\sqrt{x}\ln^{p}(cx) implies x2ac1/4(4p)px3/4x\leq 2ac^{1/4}(4p)^{p}x^{3/4} by lnxx\ln x\leq x for x>0x>0, which gives x16(4p)4pa4cx\leq 16(4p)^{4p}a^{4}c. Plugging this back, we get x2axlnp(16(4p)4pa4c2)x\leq 2a\sqrt{x}\ln^{p}(16(4p)^{4p}a^{4}c^{2}), which gives x4a2ln2p(16(4p)4pa4c2)x\leq 4a^{2}\ln^{2p}(16(4p)^{4p}a^{4}c^{2}). Therefore, x>2axlnp(16(4p)4pa4c2)x>2a\sqrt{x}\ln^{p}(16(4p)^{4p}a^{4}c^{2}) implies x>2axlnp(cx)x>2a\sqrt{x}\ln^{p}(cx). Thus, x>4a2ln2p(16(4p)4pa4c2)+2blnp(4(2p)2pb2c2)x>4a^{2}\ln^{2p}(16(4p)^{4p}a^{4}c^{2})+2b\ln^{p}(4(2p)^{2p}b^{2}c^{2}) implies x2>axlnp(cx)\frac{x}{2}>a\sqrt{x}\ln^{p}(cx) and x2>blnp(cx)\frac{x}{2}>b\ln^{p}(cx), which implies x>(ax+b)lnp(cx)x>(a\sqrt{x}+b)\ln^{p}(cx). Taking the contrapositive, the statement is proved. ∎

Lemma 29.

(Abbasi-Yadkori et al., 2011, Lemma 11) Let {Xi}i=1\{X_{i}\}_{i=1}^{\infty} be a sequence in d\mathbb{R}^{d}, VV a d×dd\times d positive definite matrix, and define Vn=V+i=1nXiXiV_{n}=V+\sum_{i=1}^{n}X_{i}X_{i}^{\top}. Then, i=1nmin{1,XiVi112}2lndet(Vn)det(V)\sum_{i=1}^{n}\min\{1,\left\|{X_{i}}\right\|_{V_{i-1}^{-1}}^{2}\}\leq 2\ln\frac{\det(V_{n})}{\det(V)} for any n1n\geq 1.

Lemma 30.

(Abbasi-Yadkori et al., 2011, Lemma 12) Let AA, BB be positive semi-definite matrices such that ABA\succcurlyeq B. Then, we have supx0xAxxBxdet(A)det(B)\sup_{x\neq 0}\frac{x^{\top}Ax}{x^{\top}Bx}\leq\frac{\det(A)}{\det(B)}.

Lemma 31.

(Wei et al., 2021b, Lemma 11) Let {xt}t=1\{x_{t}\}_{t=1}^{\infty} be a martingale sequence on state space 𝒳{\mathcal{X}} w.r.t a filtration {t}t=0\{{\mathcal{F}}_{t}\}_{t=0}^{\infty}, {ϕt}t=1\{\phi_{t}\}_{t=1}^{\infty} be a sequence of random vectors in d\mathbb{R}^{d} so that ϕtt1\phi_{t}\in{\mathcal{F}}_{t-1} and ϕt1\left\|{\phi_{t}}\right\|\leq 1, Λt=λI+s=1t1ϕsϕs\Lambda_{t}=\lambda I+\sum_{s=1}^{t-1}\phi_{s}\phi_{s}^{\top}, and 𝒱𝒳{\mathcal{V}}\subseteq\mathbb{R}^{{\mathcal{X}}} be a set of functions defined on 𝒳{\mathcal{X}} with 𝒩ε{\mathcal{N}}_{\varepsilon} as its ε\varepsilon-covering number w.r.t the distance dist(v,v)=supx|v(x)v(x)|\text{dist}(v,v^{\prime})=\sup_{x}\left|v(x)-v^{\prime}(x)\right| for some ε>0\varepsilon>0. Then for any δ>0\delta>0, we have with probability at least 1δ1-\delta, for all t>0t>0 and v𝒱v\in{\mathcal{V}} so that supx|v(x)|B\sup_{x}\left|v(x)\right|\leq B:

s=1t1ϕs(v(xs)𝔼[v(xs)|s1])Λt124B2[d2ln(t+λλ)+ln𝒩εδ]+8t2ε2λ.\displaystyle\left\|{\sum_{s=1}^{t-1}\phi_{s}\left(v(x_{s})-\mathbb{E}[v(x_{s})|{\mathcal{F}}_{s-1}]\right)}\right\|^{2}_{\Lambda_{t}^{-1}}\leq 4B^{2}\left[\frac{d}{2}\ln\left(\frac{t+\lambda}{\lambda}\right)+\ln\frac{{\mathcal{N}}_{\varepsilon}}{\delta}\right]+\frac{8t^{2}\varepsilon^{2}}{\lambda}.
Lemma 32.

(Wei et al., 2021b, Lemma 12) Let 𝒱{\mathcal{V}} be a class of mappings from 𝒳{\mathcal{X}} to \mathbb{R} parameterized by α[D,D]n\alpha\in[-D,D]^{n}. Suppose that for any v𝒱v\in{\mathcal{V}} (parameterized by α\alpha) and v𝒱v^{\prime}\in{\mathcal{V}}^{\prime} (parameterized by α\alpha^{\prime}), the following holds:

supx𝒳|v(x)v(x)|Lαα1.\displaystyle\sup_{x\in{\mathcal{X}}}\left|v(x)-v(x^{\prime})\right|\leq L\left\|{\alpha-\alpha^{\prime}}\right\|_{1}.

Then, ln𝒩εnln(2DLnε)\ln{\mathcal{N}}_{\varepsilon}\leq n\ln\left(\frac{2DLn}{\varepsilon}\right), where 𝒩ε{\mathcal{N}}_{\varepsilon} is the ε\varepsilon-covering number of 𝒱{\mathcal{V}} with respect to the distance dist(v,v)=supx𝒳|v(x)v(x)|\text{dist}(v,v^{\prime})=\sup_{x\in{\mathcal{X}}}\left|v(x)-v^{\prime}(x)\right|.

Lemma 33.

(Zhou et al., 2021a, Theorem 4.1) Let {t}t=1\{{\mathcal{F}}_{t}\}_{t=1}^{\infty} be a filtration, {xt,ηt}t1\{x_{t},\eta_{t}\}_{t\geq 1} a stochastic process so that xt,ηtdx_{t},\eta_{t}\in\mathbb{R}^{d} and xtt,ηtt+1x_{t}\in{\mathcal{F}}_{t},\eta_{t}\in{\mathcal{F}}_{t+1}. Moreover, define yt=μ,xt+ηty_{t}=\left\langle\mu^{\star},x_{t}\right\rangle+\eta_{t} and we have:

|ηt|R,𝔼[ηt|t]=0,𝔼[ηt2|t]σ2,xt2L.\displaystyle|\eta_{t}|\leq R,\;\mathbb{E}[\eta_{t}|{\mathcal{F}}_{t}]=0,\;\mathbb{E}[\eta_{t}^{2}|{\mathcal{F}}_{t}]\leq\sigma^{2},\;\left\|{x_{t}}\right\|_{2}\leq L.

Then with probability at least 1δ1-\delta, we have for any t1t\geq 1:

i=1txiηiZt1βt,μtμZtβt+λμ2,\displaystyle\left\|{\sum_{i=1}^{t}x_{i}\eta_{i}}\right\|_{Z_{t}^{-1}}\leq\beta_{t},\;\left\|{\mu_{t}-\mu^{\star}}\right\|_{Z_{t}}\leq\beta_{t}+\sqrt{\lambda}\left\|{\mu^{\star}}\right\|_{2},

where μt=Zt1bt,Zt=λI+i=1txixi,bt=i=1tyixi\mu_{t}=Z_{t}^{-1}b_{t},Z_{t}=\lambda I+\sum_{i=1}^{t}x_{i}x_{i}^{\top},b_{t}=\sum_{i=1}^{t}y_{i}x_{i}, and

βt=8σdln(1+tL2/(dλ))ln(4t2/δ)+4Rln(4t2/δ).\beta_{t}=8\sigma\sqrt{d\ln(1+tL^{2}/(d\lambda))\ln(4t^{2}/\delta)}+4R\ln(4t^{2}/\delta).
Lemma 34.

(Chen et al., 2021a, Lemma 30) For any two random variables X,YX,Y, we have:

Var[XY]2Var[X]Y2+2(𝔼[X])2Var[Y].\displaystyle\textsc{Var}[XY]\leq 2\textsc{Var}[X]\left\|{Y}\right\|_{\infty}^{2}+2(\mathbb{E}[X])^{2}\textsc{Var}[Y].

Consequently, XCVar[X2]4C2Var[X]\left\|{X}\right\|_{\infty}\leq C\implies\textsc{Var}[X^{2}]\leq 4C^{2}\textsc{Var}[X].

Lemma 35.

(Zhang et al., 2021, Lemma 16) Let DD be a bounded symmetric convex subset of d\mathbb{R}^{d} with d2d\geq 2. Suppose uDu\in\partial D, that is, uu is on the boundary of DD, and DD^{\prime} is another bounded symmetric convex set such that DDD\subseteq D^{\prime} and duDd\cdot u\in\partial D^{\prime}. Then Vol(D)76Vol(D)\text{Vol}(D^{\prime})\geq\frac{7}{6}\text{Vol}(D), where Vol(S)\text{Vol}(S) is the volume of the set SS.

Lemma 36.

(Zhang et al., 2021, Theorem 4) Let {Xi}i=1n\{X_{i}\}_{i=1}^{n} be a martingale difference sequence and |Xi|b|X_{i}|\leq b almost surely. Then for δ<e1\delta<e^{-1}, we have with probability at least 16δlog2n1-6\delta\log_{2}n,

|i=1nXi|8i=1nXi2ln1δ+16bln1δ.\displaystyle\left|\sum_{i=1}^{n}X_{i}\right|\leq 8\sqrt{\sum_{i=1}^{n}X_{i}^{2}\ln\frac{1}{\delta}}+16b\ln\frac{1}{\delta}.
Lemma 37.

(Jin et al., 2020a, Lemma 9) Let {Xi}i=1n\{X_{i}\}_{i=1}^{n} be a martingale difference sequence adapted to the filtration {i}i=0n\{{\mathcal{F}}_{i}\}_{i=0}^{n}, and XiBX_{i}\leq B almost surely for some B>0B>0. Then, for any λ[0,1/B]\lambda\in[0,1/B], with probability at least 1δ1-\delta:

i=1nXiλi=1n𝔼[Xi2|i1]+ln(1/δ)λ.\displaystyle\sum_{i=1}^{n}X_{i}\leq\lambda\sum_{i=1}^{n}\mathbb{E}[X_{i}^{2}|{\mathcal{F}}_{i-1}]+\frac{\ln(1/\delta)}{\lambda}.
Lemma 38.

Let {Xi}i=1\{X_{i}\}_{i=1}^{\infty} be a martingale difference sequence adapted to the filtration {i}i=0\{{\mathcal{F}}_{i}\}_{i=0}^{\infty} and |Xi|B|X_{i}|\leq B for some B>0B>0. Then with probability at least 1δ1-\delta, for all n1n\geq 1 simultaneously,

|i=1nXi|3i=1n𝔼[Xi2|i1]ln4B2n3δ+2Bln4B2n3δ.\displaystyle\left|\sum_{i=1}^{n}X_{i}\right|\leq 3\sqrt{\sum_{i=1}^{n}\mathbb{E}[X_{i}^{2}|{\mathcal{F}}_{i-1}]\ln\frac{4B^{2}n^{3}}{\delta}}+2B\ln\frac{4B^{2}n^{3}}{\delta}.
Proof.

For each n1n\geq 1, applying Lemma 37 to {Xi}i=1n\{X_{i}\}_{i=1}^{n} and {Xi}i=1n\{-X_{i}\}_{i=1}^{n} with each λΛ={1B2i}i=0log2n\lambda\in\Lambda=\{\frac{1}{B2^{i}}\}_{i=0}^{\lceil\log_{2}n\rceil}, we have with probability at least 1δ2n21-\frac{\delta}{2n^{2}}, for any λΛ\lambda\in\Lambda,

|i=1nXi|λi=1n𝔼[Xi2|i1]+ln4Bn3δλ,\left|\sum_{i=1}^{n}X_{i}\right|\leq\lambda\sum_{i=1}^{n}\mathbb{E}[X_{i}^{2}|{\mathcal{F}}_{i-1}]+\frac{\ln\frac{4Bn^{3}}{\delta}}{\lambda}, (21)

Note that there exists λΛ\lambda^{\star}\in\Lambda such that λ/min{1/B,ln(4Bn3/δ)i=1n𝔼[Xi2|i1]}(12,1]\lambda^{\star}/\min\left\{1/B,\sqrt{\frac{\ln(4Bn^{3}/\delta)}{\sum_{i=1}^{n}\mathbb{E}[X_{i}^{2}|{\mathcal{F}}_{i-1}]}}\right\}\in(\frac{1}{2},1]. Plugging λ\lambda^{\star} into Eq. (21), we get |i=1nXi|3i=1n𝔼[Xi2|i1]ln4Bn3δ+2Bln4Bn3δ\left|\sum_{i=1}^{n}X_{i}\right|\leq 3\sqrt{\sum_{i=1}^{n}\mathbb{E}[X_{i}^{2}|{\mathcal{F}}_{i-1}]\ln\frac{4Bn^{3}}{\delta}}+2B\ln\frac{4Bn^{3}}{\delta}. By a union bound over nn, the statement is proved. ∎

Lemma 39.

(Cohen et al., 2020, Lemma D.4) and (Cohen et al., 2021, Lemma E.2) Let {Xi}i=1\{X_{i}\}_{i=1}^{\infty} be a sequence of random variables w.r.t to the filtration {i}i=0\{{\mathcal{F}}_{i}\}_{i=0}^{\infty} and Xi[0,B]X_{i}\in[0,B] almost surely. Then with probability at least 1δ1-\delta, for all n1n\geq 1 simultaneously:

i=1n𝔼[Xi|i1]\displaystyle\sum_{i=1}^{n}\mathbb{E}[X_{i}|{\mathcal{F}}_{i-1}] 2i=1nXi+4Bln4nδ,\displaystyle\leq 2\sum_{i=1}^{n}X_{i}+4B\ln\frac{4n}{\delta},
i=1nXi\displaystyle\sum_{i=1}^{n}X_{i} 2i=1n𝔼[Xi|i1]+8Bln4nδ.\displaystyle\leq 2\sum_{i=1}^{n}\mathbb{E}[X_{i}|{\mathcal{F}}_{i-1}]+8B\ln\frac{4n}{\delta}.