Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP

Liyu Chen Rahul Jain Haipeng Luo

Abstract

We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that significantly improve over the only existing results of (Vial et al., 2021). Our first algorithm is computationally efficient and achieves a regret bound $\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{2}T_{\star}K})$ , where $d$ is the dimension of the feature space, $B_{\star}$ and $T_{\star}$ are upper bounds of the expected costs and hitting time of the optimal policy respectively, and $K$ is the number of episodes. The same algorithm with a slight modification also achieves logarithmic regret of order $\mathcal{O}\left(\frac{d^{3}B_{\star}^{4}}{c_{\min}^{2}\text{\rm gap}_{\min}}\ln^{5}\frac{dB_{\star}K}{c_{\min}}\right)$ , where $\text{\rm gap}_{\min}$ is the minimum sub-optimality gap and $c_{\min}$ is the minimum cost over all state-action pairs. Our result is obtained by developing a simpler and improved analysis for the finite-horizon approximation of (Cohen et al., 2021) with a smaller approximation error, which might be of independent interest. On the other hand, using variance-aware confidence sets in a global optimization problem, our second algorithm is computationally inefficient but achieves the first “horizon-free” regret bound $\tilde{\mathcal{O}}(d^{3.5}B_{\star}\sqrt{K})$ with no polynomial dependency on $T_{\star}$ or $1/c_{\min}$ , almost matching the $\Omega(dB_{\star}\sqrt{K})$ lower bound from (Min et al., 2021).

Machine Learning, ICML

1 Introduction

We study the stochastic shortest path (SSP) model, where a learner attempts to reach a goal state while minimizing her costs in a stochastic environment. SSP is a suitable model for many real-world applications, such as games, car navigation, robotic manipulation, etc. Online reinforcement learning in SSP has received great attention recently. In this setting, learning proceeds in $K$ episodes over a Markov Decision Process (MDP). In each episode, starting from a fixed initial state, the learner sequentially takes an action, incurs a cost, and transits to the next state until reaching the goal state. The performance of the learner is measured by her regret, the difference between her total costs and that of the optimal policy. SSP is a strict generalization of the heavily-studied finite-horizon reinforcement learning problem, where the learner is guaranteed to reach the goal state after a fixed number of steps.

Modern reinforcement learning applications often need to handle a massive state space, in which function approximation is necessary. There is huge progress in the study of linear function approximation, for both the finite-horizon setting (Ayoub et al., 2020; Jin et al., 2020b; Yang & Wang, 2020; Zanette et al., 2020a, b; Zhou et al., 2021a) and the infinite-horizon setting (Wei et al., 2021b; Zhou et al., 2021a, b). Recently, Vial et al. (2021) took the first step in considering linear function approximation for SSP. They study SSP defined over a linear MDP, and proposed a computationally inefficient algorithm with regret $\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{3}K/c_{\min}})$ , as well as another efficient algorithm with regret $\tilde{\mathcal{O}}(K^{5/6})$ (omitting other dependency). Here, $d$ is the dimension of the feature space, $B_{\star}$ is an upper bound on the expected costs of the optimal policy, and $c_{\min}$ is the minimum cost across all state-action pairs. Later, Min et al. (2021) study a related but different SSP problem defined over a linear mixture MDP and achieve a $\tilde{\mathcal{O}}(dB_{\star}^{1.5}\sqrt{K/c_{\min}})$ regret bound. Despite leveraging the advances from both the finite-horizon and infinite-horizon settings, results above are still far from optimal in terms of regret guarantee or computational efficiency, demonstrating the unique challenge of SSP problems.

In this work, we further extend our understanding of SSP with linear function approximation (more specifically, with linear MDPs). Our contributions are as follows:

•

In Section 3, we first propose a new analysis for the finite-horizon approximation of SSP introduced in (Cohen et al., 2021), which is much simpler and achieves a smaller approximation error. Our analysis is also model agnostic, meaning that it does not make use of the modeling assumption and can be applied to both the tabular setting and function approximation settings. Combining this new analysis with a simple finite-horizon algorithm similar to that of (Jin et al., 2020b), we achieve a regret bound of $\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{2}T_{\star}K})$ , with $T_{\star}\leq B_{\star}/c_{\min}$ being an upper bound of the hitting time of the optimal policy, which strictly improves over that of (Vial et al., 2021). Notably, unlike their algorithm, ours is computationally efficient without any extra assumption.
•

In Section 3.3, we further show that the same algorithm above with a slight modification achieves a logarithmic instance-dependent expected regret bound of order $\mathcal{O}\left(\frac{d^{3}B_{\star}^{4}}{c_{\min}^{2}\text{\rm gap}_{\min}}\ln^{5}\frac{dB_{\star}K}{c_{\min}}\right)$ where $\text{\rm gap}_{\min}$ is some sub-optimality gap. As far as we know, this is the first logarithmic regret bound for SSP (with or without function approximation). We also establish a lower bound of order $\Omega(\frac{dB_{\star}^{2}}{\text{\rm gap}_{\min}})$ , which further advances our understanding for this problem even though it does not exactly match our upper bound.
•

To remove the undesirable $T_{\star}$ dependency in our instance-independent bound, in Section 4, we further develop a computationally inefficient algorithm that makes use of certain variance-aware confidence sets in a global optimization problem and achieves $\tilde{\mathcal{O}}(d^{3.5}B_{\star}\sqrt{K})$ regret. Importantly, this bound is horizon-free in the sense that it has no polynomial dependency on $T_{\star}$ or $\frac{1}{c_{\min}}$ even in the lower order terms. Moreover, this almost matches the best known lower bound $\Omega(dB_{\star}\sqrt{K})$ from (Min et al., 2021).

Techniques

Our results are built upon several technical innovations. First, as mentioned, we develop an improved analysis for the finite-horizon approximation of (Cohen et al., 2021), which might be of independent interest. The key idea is to directly bound the total approximation error with respect to the regret bound of the finite-horizon algorithm, instead of analyzing the estimation precision for each state-action pair as done in (Cohen et al., 2021).

Second, to obtain the logarithmic bound in Section 3, we note that it is not enough to simply combine the aforementioned finite-horizon approximation and the existing logarithmic regret results for the finite-horizon setting such as (He et al., 2021), since the sub-optimality gap obtained in this way is in terms of the finite-horizon counterpart instead of the original SSP and could be substantially smaller. We resolve this issue via a longer horizon in the approximation and a careful two-stage analysis.

Finally, our horizon-free result in Section 4 is obtained by a novel combination of several ideas, including the global optimization algorithm of (Zanette et al., 2020b; Wei et al., 2021b), the variance-aware confidence sets of (Zhang et al., 2021) (for a related but different setting with linear mixture MDPs), an improved analysis of the variance-aware confidence sets (Kim et al., 2021), and finally a new clipping trick and new update conditions that we propose. Our analysis does not require the recursion-based technique of (Zhang et al., 2020a) (for the tabular case), nor estimating higher order moments of value functions as in (Zhang et al., 2021) (for linear mixture MDPs), which might also be of independent interest.

Related work

Regret minimization of SSP under stochastic costs has been well studied in the tabular setting (that is, no function approximation) (Tarbouriech et al., 2020; Cohen et al., 2020, 2021; Tarbouriech et al., 2021; Chen et al., 2021a; Jafarnia-Jahromi et al., 2021). There are also several works (Rosenberg & Mansour, 2020; Chen et al., 2021b; Chen & Luo, 2021) considering the more challenging setting with adversarial costs (which is beyond the scope of this work).

Beyond linear function approximation, in the finite-horizon setting researchers also start considering theoretical guarantees for general function approximation (Wang et al., 2020; Ishfaq et al., 2021; Kong et al., 2021). The study for SSP, which again is a strict generalization of the finite-horizon problems and might be a better model for many applications, falls behind in this regard, motivating us to explore in this direction with the goal of providing a more complete picture at least for linear function approximation.

The use of variance information is crucial in obtaining optimal regret bounds in MDPs. This dates back to the work of (Lattimore & Hutter, 2012) for the discounted setting, which has been significantly extended to the finite-horizon setting (Azar et al., 2017; Jin et al., 2018; Zanette & Brunskill, 2019; Zhang et al., 2020a, b). Constructing variance-aware confidence sets for linear bandits and linear mixture MDPs has also gained recent attention (Zhou et al., 2021a; Zhang et al., 2021; Kim et al., 2021). We are among the first to do so for linear MDPs (a concurrent work (Wei et al., 2021a) also does so but for a completely different purpose of improving robustness against corruption).

Logarithmic gap-dependent bounds have been shown in different settings; see for example (Jaksch et al., 2010; Simchowitz & Jamieson, 2019; Jin et al., 2021; He et al., 2021), but to our knowledge, we are the first to show similar bounds for SSP.

2 Preliminary

An SSP instance is defined by an MDP ${\mathcal{M}}=({\mathcal{S}},{\mathcal{A}},s_{\text{init}},g,c,P)$ . Here, ${\mathcal{S}}$ is the state space, ${\mathcal{A}}$ is the (finite) action space (with $A=|{\mathcal{A}}|$ ), $s_{\text{init}}\in{\mathcal{S}}$ is the initial state, $g\notin{\mathcal{S}}$ is the goal state, $c:{\mathcal{S}}\times{\mathcal{A}}\rightarrow[c_{\min},1]$ is the cost function with some global lower bound $c_{\min}\geq 0$ , and $P=\{P_{s,a}\}_{(s,a)\in{\mathcal{S}}\times{\mathcal{A}}}$ with $P_{s,a}\in\Delta_{{\mathcal{S}}_{+}}$ is the transition function, where ${\mathcal{S}}_{+}$ is a shorthand for ${\mathcal{S}}\cup\{g\}$ and $\Delta_{{\mathcal{S}}_{+}}$ is the simplex over ${\mathcal{S}}_{+}$ .

The learning protocol is as follows: the learner interacts with the environment for $K\geq 2$ episodes. In each episode, the learner starts in initial state $s_{\text{init}}$ , sequentially takes an action, incurs a cost, and transits to the next state. An episode ends when the learner reaches the goal state $g$ . We denote by $(s_{t},a_{t},s^{\prime}_{t})$ the $t$ -th state-action-state triplet observed among all episodes, so that $s^{\prime}_{t}\sim P_{s_{t},a_{t}}$ for each $t$ , and $s^{\prime}_{t}=s_{t+1}$ unless $s^{\prime}_{t}=g$ (in which case $s_{t+1}=s_{\text{init}}$ ). Also denote by $T$ the total number of steps in $K$ episodes.

Learning objective

The learner’s goal is to learn a policy that reaches the goal state with minimum costs. Formally, a (stationary and deterministic) policy $\pi:{\mathcal{S}}\rightarrow{\mathcal{A}}$ is a mapping that assigns an action $\pi(s)$ to each state $s\in{\mathcal{S}}$ . We say $\pi$ is proper if following $\pi$ (that is, taking action $\pi(s)$ whenever in state $s$ ) reaches the goal state with probability $1$ . Given a proper policy $\pi$ , we define its value function and action-value function as follows:

	$\displaystyle V^{\pi}(s)$	$\displaystyle=\mathbb{E}\left[\left.\sum_{i=1}^{I}c(s_{i},\pi(s_{i}))\right\|P,s_{1}=s\right],$
	$\displaystyle Q^{\pi}(s,a)$	$\displaystyle=c(s,a)+\mathbb{E}_{s^{\prime}\sim P_{s,a}}[V^{\pi}(s^{\prime})],$

where the expectation in $V^{\pi}$ is with respect to the randomness of next states $s_{i+1}\sim P_{s_{i},\pi(s_{i})}$ and the number of steps $I$ before reaching the goal $g$ . Let $\Pi$ be the set of proper policies. We make the basic assumption that $\Pi$ is non-empty. Under this assumption, there exists an optimal proper policy $\pi^{\star}$ , such that $V^{\pi^{\star}}(s)=\min_{a}Q^{\pi^{\star}}(s,a)$ , and $V^{\pi^{\star}}(s)=\min_{\pi\in\Pi}V^{\pi}(s)$ for all $s$ (Bertsekas & Yu, 2013). We use $V^{\star}$ and $Q^{\star}$ as shorthands for $V^{\pi^{\star}}$ and $Q^{\pi^{\star}}$ . The formal goal of the learner is then to minimize her regret against $\pi^{\star}$ , that is, the difference between her total costs and that of the optimal proper policy, defined as

\displaystyle R_{K}

\displaystyle=\sum_{t=1}^{T}c(s_{t},a_{t})-K\cdot V^{\star}(s_{\text{init}}).

We also define $R_{K}=\infty$ if $T=\infty$ .

Linear SSP

In the so-called tabular setting, the state space is assumed to be small, and algorithms with computational complexity and regret bound depending on $S=|{\mathcal{S}}|$ are acceptable. To handle a potentially massive state space, however, we consider the same linear function approximation setting of (Vial et al., 2021), where the MDP enjoys a linear structure in both the transition and cost functions (known as linear or low-rank MDP).

Assumption 1 (Linear SSP).

For some $d\geq 2$ , there exist known feature maps $\{\phi(s,a)\}_{(s,a)\in{\mathcal{S}}\times{\mathcal{A}}}\subseteq\mathbb{R}^{d}$ , unknown parameters $\theta^{\star}\in\mathbb{R}^{d}$ and $\{\mu(s^{\prime})\}_{s^{\prime}\in{\mathcal{S}}_{+}}\subseteq\mathbb{R}^{d}$ , such that for any $(s,a)\in{\mathcal{S}}\times{\mathcal{A}}$ and $s^{\prime}\in{\mathcal{S}}_{+}$ , we have:

\displaystyle c(s,a)=\phi(s,a)^{\top}\theta^{\star},\quad P_{s,a}(s^{\prime})=\phi(s,a)^{\top}\mu(s^{\prime}).

Moreover, we assume $\left\|{\phi(s,a)}\right\|_{2}\leq 1$ for all $(s,a)\in{\mathcal{S}}\times{\mathcal{A}}$ , $\left\|{\theta^{\star}}\right\|_{2}\leq\sqrt{d}$ , and $\left\|{\int h(s^{\prime})d\mu(s^{\prime})}\right\|_{2}\leq\sqrt{d}\left\|{h}\right\|_{\infty}$ for any $h\in\mathbb{R}^{{\mathcal{S}}_{+}}$ .

We refer the reader to (Vial et al., 2021) and references therein for justification on this widely-used structural assumption (especially on the last few norm constraints). Under Assumption 1, by definition we have $Q^{\star}(s,a)=\phi(s,a)^{\top}w^{\star}$ , where $w^{\star}=\theta^{\star}+\int V^{\star}(s^{\prime})d\mu(s^{\prime})\in\mathbb{R}^{d}$ , that is, $Q^{\star}$ is also linear in the features.

Key parameters and notations

Two extra parameters that play a key role in our analysis are: $B_{\star}=\max_{s}V^{\star}(s)$ , the maximum cost of the optimal policy starting from any state, and $T_{\star}=\max_{s}T^{\pi^{\star}}(s)$ , the maximum hitting time of the optimal policy starting from any state, where $T^{\pi}(s)$ is the expected number of steps before reaching the goal if one follows policy $\pi$ starting from state $s$ . By definition, we have $T_{\star}\leq B_{\star}/c_{\min}$ .

For simplicity, we assume that $B_{\star}$ , $T_{\star}$ , and $c_{\min}$ are known to the learner for most discussions, and defer to the appendix what we can achieve when some of these parameters are unknown. We also assume $B_{\star}>1$ and $c_{\min}>0$ by default (and will discuss the case $c_{\min}=0$ for specific algorithms if modifications are needed).

For $n\in\mathbb{N}_{+}$ , we define $[n]=\{1,\ldots,n\}$ . For any $l\leq r$ , we define $[x]_{[l,r]}=\min\{\max\{x,l\},r\}$ as the projection of $x$ onto the interval $[l,r]$ . The notation $\tilde{\mathcal{O}}\left(\cdot\right)$ hides all logarithmic terms including $\ln K$ and $\ln\frac{1}{\delta}$ for some confidence level $\delta\in(0,1)$ .

3 An Efficient Algorithm for Linear SSP

In this section, we introduce a computationally efficient algorithm for linear SSP. In Section 3.1, we first develop an improved analysis for the finite-horizon approximation of (Cohen et al., 2021). Then in Section 3.2, we combine this approximation with a simple finite-horizon algorithm, which together achieves $\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{2}T_{\star}K})$ regret. Finally, in Section 3.3, we further obtain a logarithmic regret bound via a slightly modified algorithm and a careful two-stage analysis.

3.1 Finite-Horizon Approximation of SSP

Finite-horizon approximation has been frequently used in solving SSP problems (Chen et al., 2021b; Chen & Luo, 2021; Cohen et al., 2021; Chen et al., 2021a). In particular, Cohen et al. (2021) proposed a black-box reduction from SSP to a finite-horizon MDP, which achieves minimax optimal regret bound in the tabular case when combining with a certain finite-horizon algorithm. We will make use of the same algorithmic reduction in our proposed algorithm, but with an improved analysis.

Specifically, for an SSP instance ${\mathcal{M}}=({\mathcal{S}},{\mathcal{A}},s_{\text{init}},g,c,P)$ , define its finite-horizon MDP counterpart as $\widetilde{{\mathcal{M}}}=({\mathcal{S}}_{+},{\mathcal{A}},\widetilde{c},c_{f},\widetilde{P},H)$ , where $\widetilde{c}(s,a)=c(s,a)\mathbb{I}\{s\neq g\}$ is the extended cost function, $c_{f}(s)=2B_{\star}\mathbb{I}\{s\neq g\}$ is the terminal cost function (more details to follow), $\widetilde{P}=\{P_{s,a}\}_{(s,a)\in{\mathcal{S}}\times{\mathcal{A}}}\cup\{P_{g,a}\}_{a\in{\mathcal{A}}}$ with $P_{g,a}(s^{\prime})=\mathbb{I}\{s^{\prime}=g\}$ is the extended transition function, and $H$ is a horizon parameter. Assume the access to a corresponding finite-horizon algorithm $\mathfrak{A}$ which learns through a certain number of “intervals” following the protocol below. At the beginning of an interval $m$ , the learner $\mathfrak{A}$ is first reset to an arbitrary state $s_{1}^{m}$ . Then, in each step $h=1,\ldots,H$ within this interval, $\mathfrak{A}$ decides an action $a_{h}^{m}$ , transits to $s_{h+1}^{m}\sim\widetilde{P}_{s_{h}^{m},a_{h}^{m}}$ , and suffers cost $\widetilde{c}(s_{h}^{m},a_{h}^{m})$ . At the end of the interval, the learner suffers an additional terminal cost $c_{f}(s_{H+1}^{m})$ , and then moves on to the next interval.

With such a black-box access to $\mathfrak{A}$ , the reduction of (Cohen et al., 2021) is depicted in Algorithm 1. The algorithm partitions the time steps into intervals of length $H\geq 4T_{\star}\ln(4K)$ (such that $\pi^{\star}$ reaches $g$ within $H$ steps with high probability). In each step, the algorithm follows $\mathfrak{A}$ in a natural way and feeds the observations to $\mathfrak{A}$ (Line 7 and Line 7). If the goal state is not reached within an interval, $\mathfrak{A}$ naturally enters the next interval with the initial state being the current state (Line 7). Otherwise, if the goal state is reached within some interval, we keep feeding $g$ and zero cost to $\mathfrak{A}$ until it finishes the current interval (Line 7 and Line 7), and after that, the next interval corresponds to the beginning of the next episode of the original SSP problem (Line 7).

Algorithm 1 Finite-Horizon Approximation of SSP from (Cohen et al., 2021)

Input: Algorithm $\mathfrak{A}$ for finite-horizon MDP $\widetilde{{\mathcal{M}}}$ with horizon $H\geq 4T_{\star}\ln(4K)$ .

Initialize: interval counter $m\leftarrow 1$ .

for $k=1,\ldots,K$ do

21 Set

s^{m}_{1}\leftarrow s_{\text{init}}

. while $s^{m}_{1}\neq g$ do

43 Feed initial state

s^{m}_{1}

\mathfrak{A}

. for $h=1,\ldots,H$ do

65 Receive action

a^{m}_{h}

from

\mathfrak{A}

. if $s^{m}_{h}\neq g$ then

7 Play action

a^{m}_{h}

, observe cost

c^{m}_{h}=c(s^{m}_{h},a^{m}_{h})

and next state

s^{m}_{h+1}

8else Set

c^{m}_{h}=0

and

s^{m}_{h+1}=g

9 Feed

c^{m}_{h}

and

s^{m}_{h+1}

\mathfrak{A}

10Set

s^{m+1}_{1}=s^{m}_{H+1}

and

m\leftarrow m+1

Analysis

Cohen et al. (2021) showed that in this reduction, the regret $R_{K}$ of the SSP problem is very close to the regret of $\mathfrak{A}$ in the finite-horizon MDP $\widetilde{{\mathcal{M}}}$ . Specifically, define $\widetilde{R}_{M^{\prime}}=\sum_{m=1}^{M^{\prime}}(\sum_{h=1}^{H}c^{m}_{h}+c_{f}(s^{m}_{H+1})-V^{\star}_{1}(s^{m}_{1}))$ as the regret of $\mathfrak{A}$ over the first $M^{\prime}$ intervals of $\widetilde{{\mathcal{M}}}$ (note the inclusion of the terminal costs), where $V^{\star}_{1}$ is the optimal value function of the first layer of $\widetilde{{\mathcal{M}}}$ (see Appendix B.1 for the formal definition). Denote by $M$ the final (random) number of intervals created during the $K$ episodes. Then Cohen et al. (2021) showed the following (a proof is included in Section 3.1 for completeness).

Lemma 1.

Algorithm 1 ensures $R_{K}\leq\widetilde{R}_{M}+B_{\star}$ .

This lemma suggests that it remains to bound the number of intervals $M$ . The analysis of Cohen et al. (2021) does so by marking state-action pairs as “known” or “unknown” based on how many times they have been visited, and showing that in each interval, the learner either reaches an “unknown” state-action pair or with high probability reaches the goal state. This analysis requires $\mathfrak{A}$ to be “admissible” (defined through a set of conditions) and also heavily makes use of the tabular setting to keep track of the status of each state-action pair, making it hard to be directly generalized to function approximation settings. Furthermore, it also introduces $T_{\star}$ dependency in the lower order term of $M$ , since the total cost for an interval where an “unknown” state-action pair is visited is trivially bounded by $H=\Omega(T_{\star})$ .

Instead, we propose the following simple and improved analysis. The idea is to separate intervals into “good” ones within which the learner reaches the goal state, and “bad” ones within which the learner does not. Then, our key observation is that the regret in each bad interval is at least $B_{\star}$ — this is because the learner’s cost is at least $2B_{\star}$ in such intervals by the choice of the terminal cost $c_{f}$ , and the optimal policy’s expected cost is at most $B_{\star}$ . Therefore, if $\mathfrak{A}$ is a no-regret algorithm, the number of bad intervals has to be small. More formally, based on this idea we can bound $M$ directly in terms of the regret guarantee of $\mathfrak{A}$ without requiring any extra properties from $\mathfrak{A}$ , as shown in the following lemma.

Theorem 1.

Suppose that $\mathfrak{A}$ enjoys the following regret guarantee with certain probability: $\widetilde{R}_{m}=\tilde{\mathcal{O}}\left(\gamma_{0}+\gamma_{1}\sqrt{m}\right)$ for some problem-dependent coefficients $\gamma_{0}$ and $\gamma_{1}$ (that are independent of $m$ ) and any number of intervals $m\leq M$ . Then, with the same probability, the number of intervals created by Algorithm 1 satisfies $M=\tilde{\mathcal{O}}\left(K+\frac{\gamma_{1}^{2}}{B_{\star}^{2}}+\frac{\gamma_{0}}{B_{\star}}\right)$ .

Proof.

For any finite $M_{\dagger}\leq M$ , we will show $M_{\dagger}=\tilde{\mathcal{O}}\left(K+\frac{\gamma_{1}^{2}}{B_{\star}^{2}}+\frac{\gamma_{0}}{B_{\star}}\right)$ , which then implies that $M$ has to be finite and is upper bounded by the same quantity. To do so, we define the set of good intervals ${\mathcal{C}}_{g}=\{m\in[M_{\dagger}]:s^{m}_{H+1}=g\}$ where the learner reaches the goal state, and also the total costs of the learner in interval $m$ of $\widetilde{{\mathcal{M}}}$ : $C^{m}=\sum_{h=1}^{H}c^{m}_{h}+c_{f}(s^{m}_{H+1})$ . By definition and the guarantee of $\mathfrak{A}$ , we have

	$\displaystyle\widetilde{R}_{M_{\dagger}}$	$\displaystyle=\sum_{m\in{\mathcal{C}}_{g}}\left(C^{m}-V^{\star}_{1}(s^{m}_{1})\right)+\sum_{m\notin{\mathcal{C}}_{g}}\left(C^{m}-V^{\star}_{1}(s^{m}_{1})\right)$
		$\displaystyle\leq\tilde{\mathcal{O}}\left(\gamma_{0}+\gamma_{1}\sqrt{M_{\dagger}}\right).$		(1)

Next, we derive lower bounds on $\sum_{m\in{\mathcal{C}}_{g}}\left(C^{m}-V^{\star}_{1}(s^{m}_{1})\right)$ and $\sum_{m\notin{\mathcal{C}}_{g}}\left(C^{m}-V^{\star}_{1}(s^{m}_{1})\right)$ respectively. First note that by Lemma 17 and $H\geq 4T_{\star}\ln(4K)$ , we have that $\pi^{\star}$ reaches the goal within $H$ steps with probability at least $1-1/2K$ . Therefore, executing $\pi^{\star}$ in an episode of $\widetilde{{\mathcal{M}}}$ leads to at most $B_{\star}+\frac{2B_{\star}}{2K}\leq\frac{3}{2}B_{\star}$ costs in expectation, which implies $V^{\star}_{1}(s)\leq\frac{3}{2}B_{\star}$ for any $s$ . By $|{\mathcal{C}}_{g}|\leq K$ , we thus have

\sum_{m\in{\mathcal{C}}_{g}}\left(C^{m}-V^{\star}_{1}(s^{m}_{1})\right)\geq-\frac{3}{2}B_{\star}K.

On the other hand, for $m\notin{\mathcal{C}}_{g}$ , we have $C^{m}\geq 2B_{\star}$ due to the terminal cost $c_{f}(s^{m}_{H+1})=2B_{\star}$ , and thus

\sum_{m\notin{\mathcal{C}}_{g}}\left(C^{m}-V^{\star}_{1}(s^{m}_{1})\right)\geq\frac{B_{\star}}{2}(M_{\dagger}-|{\mathcal{C}}_{g}|)\geq\frac{B_{\star}}{2}(M_{\dagger}-K).

Combining the two lower bounds above with Eq. (1), we arrive at $\frac{B_{\star}}{2}M_{\dagger}\leq\tilde{\mathcal{O}}\left(\gamma_{0}+\gamma_{1}\sqrt{M_{\dagger}}\right)+2B_{\star}K$ . By Lemma 28, this implies $M_{\dagger}=\tilde{\mathcal{O}}\left(K+\frac{\gamma_{1}^{2}}{B_{\star}^{2}}+\frac{\gamma_{0}}{B_{\star}}\right)$ , finishing the proof. ∎

Now plugging in the bound on $M$ in Theorem 1 into Lemma 1, we immediately obtain the following corollary on a general regret bound for the finite-horizon approximation.

Corollary 2.

Under the same condition of Theorem 1, Algorithm 1 ensures $R_{K}=\tilde{\mathcal{O}}\left(\gamma_{1}\sqrt{K}+\frac{\gamma_{1}^{2}}{B_{\star}}+\gamma_{0}+B_{\star}\right)$ (with the same probability stated in Theorem 1).

Proof.

Combining Lemma 1 and Theorem 1, we have $R_{K}\leq\widetilde{R}_{M}+B_{\star}\leq\tilde{\mathcal{O}}(\gamma_{1}\sqrt{M}+\gamma_{0}+B_{\star})\leq\tilde{\mathcal{O}}\left(\gamma_{1}\sqrt{K}+\frac{\gamma_{1}^{2}}{B_{\star}}+\gamma_{1}\sqrt{\frac{\gamma_{0}}{B_{\star}}}+\gamma_{0}+B_{\star}\right)$ . Further realizing $\gamma_{1}\sqrt{\frac{\gamma_{0}}{B_{\star}}}\leq\frac{1}{2}\left(\frac{\gamma_{1}^{2}}{B_{\star}}+\gamma_{0}\right)$ by AM-GM inequality proves the statement. ∎

Note that the final regret bound completely depends on the regret guarantee of the finite horizon algorithm $\mathfrak{A}$ . In particular, in the tabular case, if we apply a variant of EB-SSP (Tarbouriech et al., 2021) that achieves $\widetilde{R}_{m}=\tilde{\mathcal{O}}(B_{\star}\sqrt{SAm}+B_{\star}S^{2}A)$ (note the lack of polynomial dependency on $H$ ),¹¹1This variant is equivalent to applying EB-SSP on a homogeneous finite-horizon MDP. then Corollary 2 ensures that $R_{K}=\tilde{\mathcal{O}}(B_{\star}\sqrt{SAK}+B_{\star}S^{2}A)$ , improving the results of (Cohen et al., 2021) and matching the best existing bounds of (Tarbouriech et al., 2021; Chen et al., 2021a); see Appendix B.5 for more details. This is not achievable by the analysis of (Cohen et al., 2021) due to the $T_{\star}$ dependency in the lower order term mentioned earlier.

More importantly, our analysis is model agnostic: it only makes use of the regret guarantee of the finite-horizon algorithm, and does not leverage any modeling assumption on the SSP instance. This enables us to directly apply our result to settings with function approximation. In Appendix B.6, we provide an example for SSP with a linear mixture MDP, which gives a regret bound $\tilde{\mathcal{O}}(B_{\star}\sqrt{dT_{\star}K}+B_{\star}d\sqrt{K})$ via combining Corollary 2 and the near optimal finite-horizon algorithm of (Zhou et al., 2021a).

3.2 Applying an Efficient Finite-Horizon Algorithm for Linear MDPs

Similarly, if there were a horizon-free algorithm for finite-horizon linear MDPs, we could directly combine it with Algorithm 1 and obtain a $T_{\star}$ -independent regret bound. However, to our knowledge, this is still open due to some unique challenge for linear MDPs.

Nevertheless, even combining Algorithm 1 with a horizon-dependent linear MDP algorithm already leads to significant improvement over the state-of-the-art for linear SSP. Specifically, the finite-horizon algorithm $\mathfrak{A}$ we apply is a variant of LSVI-UCB (Jin et al., 2020b), which performs Least-Squares Value Iteration with an optimistic modification. The pseudocode is shown in Algorithm 2. Utilizing the fact that action-value functions are linear in the features for a linear MDP, in each interval $m$ , we estimate the parameters $\{w^{m}_{h}\}_{h=1}^{H}$ of these linear functions by solving a set of least square linear regression problems using all observed data (Line 2), and we encourage exploration by subtracting a bonus term $\beta_{m}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}$ in the definition of $\widehat{Q}^{m}_{h}(s,a)$ (Line 1). Then, we simply act greedily with respect to the truncated action-value estimates $\{Q^{m}_{h}\}_{h}$ (Line 2). Clearly, this is an efficient algorithm with polynomial (in $d$ , $H$ , $m$ and $A$ ) time complexity for each interval $m$ .

We refer the reader to (Jin et al., 2020b) for more explanation of the algorithm, and point out three key modifications we make compared to their version. First, Jin et al. (2020b) maintain a separate covariance matrix $\Lambda^{m}_{h}$ for each layer $h$ using data only from layer $h$ , while we only maintain a single covariance matrix $\Lambda^{m}$ using data across all layers (Line 2). This is possible (and resulting in a better regret bound) since the transition function is the same in each layer of $\widetilde{M}$ . Another modification is to define $V^{m}_{H+1}(s)$ as $c_{f}(s)$ simply for the purpose of incorporating the terminal cost. Finally, we project the action-value estimates onto $[0,B]$ for some parameter $B$ similar to (Vial et al., 2021) (Line 1). In the main text we simply set $B=3B_{\star}$ , and the upper bound truncation at $B$ has no effect in this case. However, this projection will become important when learning without the knowledge of $B_{\star}$ (see Appendix B.4).

Algorithm 2 Finite-Horzion Linear-MDP Algorithm

Parameters: $\lambda=1$ , $\beta_{m}=50dB\sqrt{\ln(16BmHd/\delta)}$ where $\delta$ is the failure probability and $B\geq 1$ .

Initialize: $\Lambda_{1}=\lambda I$ .

for $m=1,\ldots,M$ do

Define

V^{m}_{H+1}(s)=c_{f}(s)

. for $h=H,\ldots,1$ do

21 Compute

w^{m}_{h}=\Lambda_{m}^{-1}\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}(c^{m^{\prime}}_{h^{\prime}}+V^{m}_{h+1}(s^{m^{\prime}}_{h^{\prime}+1})),

where

\phi^{m}_{h}=\phi(s^{m}_{h},a^{m}_{h})

. Define

\phi(g,a)=0

and

	$\displaystyle\widehat{Q}^{m}_{h}(s,a)$	$\displaystyle=\phi(s,a)^{\top}w^{m}_{h}-\beta_{m}\left\\|{\phi(s,a)}\right\\|_{\Lambda_{m}^{-1}}$
	$\displaystyle Q^{m}_{h}(s,a)$	$\displaystyle=[\widehat{Q}^{m}_{h}(s,a)]_{[0,B]}$
	$\displaystyle V^{m}_{h}(s)$	$\displaystyle=\min_{a}Q^{m}_{h}(s,a)$

for $h=1,\ldots,H$ do

3 Play

a^{m}_{h}=\operatorname*{argmin}_{a}Q^{m}_{h}(s^{m}_{h},a)

, suffer

c^{m}_{h}

, and transit to

s^{m}_{h+1}

Compute

\Lambda_{m+1}=\Lambda_{m}+\sum_{h=1}^{H}\phi^{m}_{h}{\phi^{m}_{h}}^{\top}

We show the following regret guarantee of Algorithm 2 following the analysis of (Vial et al., 2021) (see Appendix B.3).

Lemma 2.

With probability at least $1-4\delta$ , Algorithm 2 with $B=3B_{\star}$ ensures $\widetilde{R}_{m}=\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{2}Hm}+d^{2}B_{\star}H)$ for any $m\leq M$ .

Applying Corollary 2 we then immediately obtain the following new result for linear SSP.

Theorem 3.

Applying Algorithm 1 with $H=4T_{\star}\ln(4K)$ and $\mathfrak{A}$ being Algorithm 2 with $B=3B_{\star}$ to the linear SSP problem ensures $R_{K}=\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{2}T_{\star}K}+d^{3}B_{\star}T_{\star})$ with probability at least $1-4\delta$ .

There is some gap between our result above and the existing lower bound $\Omega(dB_{\star}\sqrt{K})$ for this problem (Min et al., 2021). In particular, the dependency on $T_{\star}$ inherited from the $H$ dependency in Lemma 2 is most likely unnecessary. Nevertheless, this already strictly improves over the best existing bound $\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{3}K/c_{\min}})$ from (Vial et al., 2021) since $T_{\star}\leq B_{\star}/c_{\min}$ . Moreover, our algorithm is computationally efficient, while the algorithms of Vial et al. (2021) are either inefficient or achieve a much worse regret bound such as $\tilde{\mathcal{O}}(K^{5/6})$ (unless some strong assumptions are made). This improvement comes from the fact that our algorithm uses non-stationary policies (due to the finite-horizon approximation), which avoids the challenging problem of solving the fixed point of some empirical Bellman equation. This also demonstrates the power of finite-horizon approximation in solving SSP problems. On the other hand, obtaining the same regret guarantee by learning stationary policies only is an interesting future direction.

Learning without knowing $B_{\star}$ or $T_{\star}$

Note that the result of Theorem 3 requires the knowledge of $B_{\star}$ and $T_{\star}$ . Without knowing these parameters, we can still efficiently obtain a regret bound of order $\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{3}K/c_{\min}}+d^{3}B_{\star}^{2}/c_{\min})$ , matching the bound of (Vial et al., 2021) achieved by their inefficient algorithm. See Appendix B.4 for details.

3.3 Logarithmic Regret

Many optimistic algorithms attain a more favorable regret bound of the form $C\ln K$ , where $C$ is an instance dependent constant usually inversely proportional to some gap measure; see e.g. (Jaksch et al., 2010) for the infinite-horizon setting and (Simchowitz & Jamieson, 2019) for the finite-horizon setting. In this section, we show that a slight modification of our algorithm also leads to an expected regret bound that is polylogarithmic in $K$ and inversely proportional to $\text{\rm gap}_{\min}=\min_{s,a:\text{\rm gap}(s,a)>0}\text{\rm gap}(s,a)$ with $\text{\rm gap}(s,a)=Q^{\star}(s,a)-V^{\star}(s)$ .²²2Note that for our definition of regret, a polylogarithmic bound is only possible in expectation, because even if the learner always executes $\pi^{\star}$ , the deviation of her total costs from $KV^{\star}(s_{\text{init}})$ is already of order $\sqrt{K}$ .

The high-level idea is as follows. The first observation is that similarly to a recent work by He et al. (2021), we can show that our Algorithm 2 obtains a gap-dependent logarithmic regret bound $\tilde{\mathcal{O}}(\frac{\ln m}{\text{\rm gap}_{\min}^{\prime}})$ for the finite-horizon problem. The caveat is that $\text{\rm gap}_{\min}^{\prime}$ here is naturally defined using the optimal value and action-value functions $V^{\star}_{h}$ and $Q^{\star}_{h}$ for the finite-horizon MDP (which is different for each layer $h$ ); more specifically, $\text{\rm gap}_{\min}^{\prime}=\min_{s,a,h:\text{\rm gap}_{h}(s,a)>0}\text{\rm gap}_{h}(s,a)$ where $\text{\rm gap}_{h}(s,a)=Q^{\star}_{h}(s,a)-V^{\star}_{h}(s)$ . The difference between $\text{\rm gap}_{\min}$ and $\text{\rm gap}_{\min}^{\prime}$ can in fact be significant; see Appendix B.7 for an example where $\text{\rm gap}_{\min}^{\prime}$ is arbitrarily smaller than $\text{\rm gap}_{\min}$ .

To get around this issue, we set $H$ to be a larger value of order $\tilde{\mathcal{O}}(\frac{B_{\star}}{c_{\min}})$ and perform the following two-stage analysis. For the first $H/2$ layers, we are able to show $Q^{\star}_{h}(s,a)\approx Q^{\star}(s,a)$ and thus $\text{\rm gap}_{h}(s,a)\approx\text{\rm gap}(s,a)$ , leading to a $\tilde{\mathcal{O}}(\frac{\ln m}{\text{\rm gap}_{\min}})$ bound on the regret suffered for these layers. Then, for the last $H/2$ layers, we further consider two cases: if the learner’s policy for the first $H/2$ layers are nearly optimal, then the probability of not reaching the goal within the first $H/2$ layers is very low by the choice of $H$ , and thus the costs suffered in the last $H/2$ layers are negligible; otherwise, we simply bound the costs using the number of times the learner takes a non-near-optimal action in the first $H/2$ layers, which is again shown to be of order $\tilde{\mathcal{O}}(\frac{\ln m}{\text{\rm gap}_{\min}})$ .

One final detail is to carefully control the regret under some failure event that happens with a small probability (recall that we are aiming for an expected regret bound; see Footnote 2). This is necessary since in SSP the learner’s cost under such events could be unbounded in the worst case. To resolve this issue, we make a slight modification to Algorithm 1 and occasionally restart $\mathfrak{A}$ whenever the number of total intervals reaches some multiple of a threshold; see Algorithm 7 in the appendix. This finally leads to our main result summarized in the following theorem (whose proof is deferred to Appendix B.8).

Theorem 4.

There exist $b^{\prime}$ and $\delta$ such that applying Algorithm 7 with horizon $H=\frac{b^{\prime}B_{\star}}{c_{\min}}\ln(\frac{dB_{\star}K}{c_{\min}})$ and $\mathfrak{A}$ being Algorithm 2 (with $B=3B_{\star}$ and failure probability $\delta$ ) ensures $\mathbb{E}[R_{K}]=\mathcal{O}\left(\frac{d^{3}B_{\star}^{4}}{c_{\min}^{2}\text{\rm gap}_{\min}}\ln^{5}\frac{dB_{\star}K}{c_{\min}}\right)$ .

As far as we know, this is the first polylogarithmic bound for any SSP problem. Our result also indicates that the instance-dependent quantities of SSP can be well preserved after using some finite-horizon approximation.

Lower bounds

To better understand instance-dependent regret bounds for this problem, we further show the following lower bound.

Theorem 5.

For any algorithm $\mathfrak{A}$ , there exists a linear SSP instance with $d\geq 2$ and $B_{\star}\geq 1$ such that $\mathbb{E}_{\mathfrak{A}}[R_{K}]=\Omega(dB_{\star}^{2}/\text{\rm gap}_{\min})$ .

This lower bound exhibits a relatively large gap from our upper bound. One important question is whether the $\frac{1}{c_{\min}}$ dependency in the upper bound is really necessary, which we leave as a future direction.

4 An Inefficient Horizon-Free Algorithm

Recall that the dominating term of the regret bound shown in Theorem 3 depends on $T_{\star}$ , which is most likely unnecessary. Due to the lack of a horizon-free algorithm for finite-horizon linear MDPs (which, as discussed, would have addressed this issue), in this section we propose a different approach leading to a computationally inefficient algorithm with a regret bound that is horizon-free (that is, no polynomial dependency on $T_{\star}$ ) but has a worse dependency on $d$ .

As stated in previous work for the tabular setting (Cohen et al., 2020, 2021; Tarbouriech et al., 2021; Chen et al., 2021a), achieving a horizon-free regret bound requires constructing variance-aware confidence sets on the transition functions. While this is straightforward in the tabular case, it is much more challenging with linear function approximation. Zhou et al. (2021a); Zhang et al. (2021) construct variance-aware confidence sets for linear mixture MDPs, but we are not aware of similar results for linear MDPs since they impose extra challenges. Our algorithm VA-GOPO, shown in Algorithm 3, is the first one to successfully make use of these ideas.

Algorithm 3 Variance-Aware Global OPtimization with Optimism (VA-GOPO)

Initialize: $t=t^{\prime}=1$ , $k=1$ , $s_{1}=s_{\text{init}}$ , $B_{1}=1$ .

Define: $s_{0}^{\prime}=g$ and $V_{t}=V_{w_{t},B_{t}}$ .

while $k\leq K$ do

1 if $s^{\prime}_{t-1}=g$ or Eq. (4) holds or $V_{t^{\prime}}(s_{t})=2B_{t}$ then

while True do

32 Compute

w_{t}=\operatorname*{argmin}_{w\in\Omega_{t}(w,B_{t})}V_{w,B_{t}}(s_{t})

(see Eq. (2) and Eq. (3) for definitions). if $V_{t}(s_{t})>B_{t}$ then

B_{t}\leftarrow 2B_{t}

; else break.

4Record the most recent update time

t^{\prime}\leftarrow t

else

(w_{t},B_{t})=(w_{t-1},B_{t-1}).

Take action

a_{t}=\operatorname*{argmin}_{a}\phi(s_{t},a)^{\top}w_{t}

, suffer cost

c_{t}=c(s_{t},a_{t})

, and transits to

s^{\prime}_{t}

. if $s^{\prime}_{t}=g$ then

s_{t+1}=s_{\text{init}}

k\leftarrow k+1

; else

s_{t+1}=s^{\prime}_{t}

Increment time step

t\leftarrow t+1

VA-GOPO follows a similar framework of the Eleanor algorithm of (Zanette et al., 2020b) (for the finite-horizon setting) and the FOPO algorithm of (Wei et al., 2021b) (for the infinite-horizon setting) — they all maintain an estimate $w_{t}$ of the true weight vector $w^{\star}$ (recall $Q^{\star}(s,a)=\phi(s,a)^{\top}w^{\star}$ ), found by optimistically minimizing the value of the current state $s_{t}$ (roughly $\min_{a}\phi(s_{t},a)^{\top}w_{t}$ ) over a confidence set of $w_{t}$ , and then simply act according to $\operatorname*{argmin}_{a}\phi(s_{t},a)^{\top}w_{t}$ . The main differences are the construction of the confidence set and the conditions under which $w_{t}$ is updated, which we explain in detail below.

Confidence Set

For a parameter $B>0$ and a weight vector $w\in\mathbb{R}^{d}$ , inspired by (Zhang et al., 2021) we define a variance-aware confidence set for time step $t$ as

\Omega_{t}(w,B)=\bigcap_{j\in{\mathcal{J}}_{B}}\Omega^{j}_{t}(w,B),

(2)

where ${\mathcal{J}}_{B}=\{\lceil\log_{2}\epsilon\rceil,\ldots,\lceil\log_{2}(6\sqrt{d}B)\rceil\}$ with $\epsilon=\frac{c_{\min}}{150d^{3}K}$ , and $\Omega^{j}_{t}(w,B)=\mathbb{B}(3\sqrt{d}B)\cap$

	$\displaystyle\left\{w^{\prime}:\forall\nu\in\mathcal{G}_{\epsilon/t}(6\sqrt{d}B),\left\|\sum_{i<t}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)\epsilon_{V_{w,B}}^{i}(w^{\prime})\right\|\right.$
	$\displaystyle\leq\left.\sqrt{\sum_{i<t}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)\eta_{V_{w,B}}^{i}(w^{\prime})\iota_{B,t}}+B2^{j}\iota_{B,t}\right\},$		(3)

with $\mathbb{B}(r)=\{x\in\mathbb{R}^{d}:\left\|{x}\right\|_{2}\leq r\}$ being the $d$ -dimensional $L_{2}$ -ball of radius $r$ , $\mathcal{G}_{\xi}(r)=\{\xi n,n\in\mathbb{Z}\}^{d}\cap\mathbb{B}(r)$ being the $\xi$ -net of $\mathbb{B}(r)$ , $\textsf{clip}_{j}(x)=[x]_{[-2^{j},2^{j}]}$ (recall $[x]_{[l,r]}=\min\{\max\{x,l\},r\}$ ), $\phi_{i}$ being a shorthand of $\phi(s_{i},a_{i})$ , $V_{w,B}(s)=\min_{a}[\phi(s,a)^{\top}w]_{[0,2B]}$ (and $V_{w,B}(g)=0$ ), $\epsilon^{i}_{V}(w^{\prime})=\phi_{i}^{\top}w^{\prime}-c_{i}-V(s^{\prime}_{i})$ , $\eta^{i}_{V}(w^{\prime})=\epsilon^{i}_{V}(w^{\prime})^{2}$ , and finally $\iota_{B,t}=2^{11}d\ln\frac{48dBt}{\epsilon\delta}$ for some failure probability $\delta$ . The key difference between our confidence set and that of (Zhang et al., 2021) is in the definition of $\epsilon^{i}_{V}(w^{\prime})$ and $\eta^{i}_{V}(w^{\prime})$ due to the different structures between linear MDPs and linear mixture MDPs. In particular, we note that the value function $V$ (more formally $V_{w,B}$ ) in our definitions is itself defined with respect to another weight vector $w$ .

With this confidence set, when VA-GOPO decides to update $w_{t}$ , it searches over all $w$ such that $w\in\Omega_{t}(w,B_{t})$ and finds the one that minimizes the value at the current state $V_{w,B_{t}}(s_{t})$ (Line 3). Here, $B_{t}$ is a running estimate of $B_{\star}$ . VA-GOPO maintains the inequality $V_{t}(s_{t})\leq B_{t}$ during the update by doubling the value of $B_{t}$ and repeating Line 3 whenever this is violated (Line 3). Note that the constraint $w\in\Omega_{t}(w,B_{t})$ is in a sense self-referential — we consider $w$ within a confidence set defined in terms of $w$ itself, which is an important distinction compared to (Zhang et al., 2021) and is critical for linear MDPs.

To provide some intuition on our confidence set, denote $V_{w_{t},B_{t}}$ by $V_{t}$ and $\Omega_{t}(w_{t},B_{t})$ by $\Omega_{t}$ . Note that if we ignore the dependency between $V_{t}$ and $\{\phi_{i}\}_{i<t}$ (an issue that will eventually be addressed by some covering arguments), then $\{\epsilon^{i}_{V_{t}}(w^{\prime})\}_{i<t}$ forms a martingale sequence when $w^{\prime}=\widetilde{w}_{t}\triangleq\theta^{\star}+\int V_{t}(s^{\prime})d\mu(s^{\prime})$ , and thus the inequality in Eq. (3) holds with high probability by some Bernstein-style concentration inequality (Lemma 36). Formally, this allows us to show the following.

Lemma 3.

With probability at least $1-\delta$ , $\widetilde{w}_{t}\in\Omega_{t}$ , $\forall t\geq 1$ .

Since $w_{t}$ is also in $\Omega_{t}$ , the difference between $\phi(s,a)^{\top}w_{t}$ and $c(s,a)+\mathbb{E}_{s^{\prime}\sim P_{s,a}}[V_{t}(s^{\prime})]=\phi(s,a)^{\top}\widetilde{w}_{t}$ is controlled by the size of the confidence set $\Omega_{t}$ , which is overall shrinking and thus making sure that $w_{t}$ is getting closer and closer to $w^{\star}$ . In addition, we also show that $V_{t}$ is optimistic at state $s_{t}$ whenever an update is performed and that $B_{t}$ never overestimates $B_{\star}$ significantly.

Lemma 4.

With probability at least $1-\delta$ , we have $V_{t}(s_{t})\leq V^{\star}(s_{t})$ if an update (Line 3) is performed at time step $t$ , and $B_{t}\leq 2B_{\star}$ for all $t$ .

Update Conditions

VA-GOPO updates $w_{t}$ whenever one of the three conditions in Line 3 is triggered. The first condition $s^{\prime}_{t-1}=g$ simply indicates that the current time step is the start of a new episode. The second condition is

\exists j\in{\mathcal{J}}_{B_{t}},\nu\in\mathcal{G}_{\epsilon/t}(6\sqrt{d}B_{t}):\Phi_{t}^{j}(\nu)>8d^{2}\Phi_{t^{\prime}}^{j}(\nu),

(4)

where $t^{\prime}$ is the most recent update time step (Line 3) and $\Phi^{j}_{t}(\nu)=\sum_{i<t}f_{j}(\phi_{i}^{\top}\nu)+2^{j}\left\|{\nu}\right\|_{2}^{2}$ with $f_{j}(x)=\textsf{clip}_{j}(x)x$ . This lazy update condition makes sure that the algorithm does not update $w_{t}$ too often (see Lemma 27) while still enjoying a small enough estimation error. The last condition $V_{t^{\prime}}(s_{t})=2B_{t}$ (we call it overestimate condition) tests whether the current state has an overestimated value (note that $2B_{t}$ is the maximum value of $V_{t^{\prime}}$ due to the truncation in its definition). This condition helps remove a factor of $d^{1/4}$ in the regret bound without using some complicated ideas as in previous works; see Appendix C.5 for more explanation.

Regret Guarantee

We prove the following regret guarantee for VA-GOPO, and provide a proof sketch in Appendix C.1 followed by the full proof in the rest of Appendix C.

Theorem 6.

With probability at least $1-6\delta$ , Algorithm 3 ensures $R_{K}=\tilde{\mathcal{O}}(d^{3.5}B_{\star}\sqrt{K}+d^{7}B_{\star}^{2})$ .

Ignoring the lower order term, our bound is (potentially) suboptimal only in terms of the $d$ -dependency compared to the lower bound $\Omega(dB_{\star}\sqrt{K})$ from (Min et al., 2021). We note again that this is the first horizon-free regret bound for linear SSP: it does not have any polynomial dependency on $T_{\star}$ or $\frac{1}{c_{\min}}$ even in the lower order terms. Furthermore, VA-GOPO also does not require the knowledge of $B_{\star}$ or $T_{\star}$ . For simplicity, we have assumed $c_{\min}>0$ . However, even when $c_{\min}=0$ , we can obtain essentially the same bound by running the same algorithm on a modified cost function; see Appendix A for details.

5 Conclusion

In this work, we make significant progress towards better understanding of linear function approximation in the challenging SSP model. Two algorithms are proposed: the first one is efficient and achieves a regret bound strictly better than (Vial et al., 2021), while the second one is inefficient but achieves a horizon-free regret bound. In developing these results, we also propose several new techniques that might be of independent interest, especially the new analysis for the finite-horizon approximation of (Cohen et al., 2021).

A natural future direction is to close the gap between existing upper bounds and lower bounds in this problem, especially with an efficient algorithm. Another interesting direction is to study SSP with adversarially changing costs under linear function approximation.

References

Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24:2312–2320, 2011.
Ayoub et al. (2020) Ayoub, A., Jia, Z., Szepesvari, C., Wang, M., and Yang, L. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pp. 463–474. PMLR, 2020.
Azar et al. (2017) Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263–272. PMLR, 2017.
Bertsekas & Yu (2013) Bertsekas, D. P. and Yu, H. Stochastic shortest path problems under weak conditions. Lab. for Information and Decision Systems Report LIDS-P-2909, MIT, 2013.
Chen & Luo (2021) Chen, L. and Luo, H. Finding the stochastic shortest path with low regret: The adversarial cost and unknown transition case. In International Conference on Machine Learning, 2021.
Chen et al. (2021a) Chen, L., Jafarnia-Jahromi, M., Jain, R., and Luo, H. Implicit finite-horizon approximation and efficient optimal algorithms for stochastic shortest path. Advances in Neural Information Processing Systems, 2021a.
Chen et al. (2021b) Chen, L., Luo, H., and Wei, C.-Y. Minimax regret for stochastic shortest path with adversarial costs and known transition. In Conference on Learning Theory, pp. 1180–1215. PMLR, 2021b.
Cohen et al. (2020) Cohen, A., Kaplan, H., Mansour, Y., and Rosenberg, A. Near-optimal regret bounds for stochastic shortest path. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pp. 8210–8219. PMLR, 2020.
Cohen et al. (2021) Cohen, A., Efroni, Y., Mansour, Y., and Rosenberg, A. Minimax regret for stochastic shortest path. Advances in Neural Information Processing Systems, 2021.
He et al. (2021) He, J., Zhou, D., and Gu, Q. Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning, pp. 4171–4180. PMLR, 2021.
Ishfaq et al. (2021) Ishfaq, H., Cui, Q., Nguyen, V., Ayoub, A., Yang, Z., Wang, Z., Precup, D., and Yang, L. F. Randomized exploration for reinforcement learning with general value function approximation. International Conference on Machine Learning, 2021.
Jafarnia-Jahromi et al. (2021) Jafarnia-Jahromi, M., Chen, L., Jain, R., and Luo, H. Online learning for stochastic shortest path model via posterior sampling. arXiv preprint arXiv:2106.05335, 2021.
Jaksch et al. (2010) Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 2010.
Jin et al. (2018) Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pp. 4863–4873, 2018.
Jin et al. (2020a) Jin, C., Jin, T., Luo, H., Sra, S., and Yu, T. Learning adversarial Markov decision processes with bandit feedback and unknown transition. In Proceedings of the 37th International Conference on Machine Learning, pp. 4860–4869, 2020a.
Jin et al. (2020b) Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137–2143. PMLR, 2020b.
Jin et al. (2021) Jin, T., Huang, L., and Luo, H. The best of both worlds: stochastic and adversarial episodic mdps with unknown transition. Advances in Neural Information Processing Systems, 2021.
Kim et al. (2021) Kim, Y., Yang, I., and Jun, K.-S. Improved regret analysis for variance-adaptive linear bandits and horizon-free linear mixture mdps. arXiv preprint arXiv:2111.03289, 2021.
Kong et al. (2021) Kong, D., Salakhutdinov, R., Wang, R., and Yang, L. F. Online sub-sampling for reinforcement learning with general function approximation. arXiv preprint arXiv:2106.07203, 2021.
Lattimore & Hutter (2012) Lattimore, T. and Hutter, M. Pac bounds for discounted mdps. In International Conference on Algorithmic Learning Theory, pp. 320–334. Springer, 2012.
Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.
Min et al. (2021) Min, Y., He, J., Wang, T., and Gu, Q. Learning stochastic shortest path with linear function approximation. arXiv preprint arXiv:2110.12727, 2021.
Rosenberg & Mansour (2020) Rosenberg, A. and Mansour, Y. Stochastic shortest path with adversarially changing costs. arXiv preprint arXiv:2006.11561, 2020.
Shani et al. (2020) Shani, L., Efroni, Y., Rosenberg, A., and Mannor, S. Optimistic policy optimization with bandit feedback. In Proceedings of the 37th International Conference on Machine Learning, pp. 8604–8613, 2020.
Simchowitz & Jamieson (2019) Simchowitz, M. and Jamieson, K. G. Non-asymptotic gap-dependent regret bounds for tabular mdps. Advances in Neural Information Processing Systems, 32:1153–1162, 2019.
Tarbouriech et al. (2020) Tarbouriech, J., Garcelon, E., Valko, M., Pirotta, M., and Lazaric, A. No-regret exploration in goal-oriented reinforcement learning. In International Conference on Machine Learning, pp. 9428–9437. PMLR, 2020.
Tarbouriech et al. (2021) Tarbouriech, J., Zhou, R., Du, S. S., Pirotta, M., Valko, M., and Lazaric, A. Stochastic shortest path: Minimax, parameter-free and towards horizon-free regret. Advances in Neural Information Processing Systems, 2021.
Vial et al. (2021) Vial, D., Parulekar, A., Shakkottai, S., and Srikant, R. Regret bounds for stochastic shortest path problems with linear function approximation. arXiv preprint arXiv:2105.01593, 2021.
Wang et al. (2020) Wang, R., Salakhutdinov, R., and Yang, L. F. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 2020.
Wei et al. (2021a) Wei, C.-Y., Dann, C., and Zimmert, J. A model selection approach for corruption robust reinforcement learning. arXiv preprint arXiv:2110.03580, 2021a.
Wei et al. (2021b) Wei, C.-Y., Jahromi, M. J., Luo, H., and Jain, R. Learning infinite-horizon average-reward mdps with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pp. 3007–3015. PMLR, 2021b.
Yang & Wang (2020) Yang, L. and Wang, M. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pp. 10746–10756. PMLR, 2020.
Zanette & Brunskill (2019) Zanette, A. and Brunskill, E. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In Proceedings of the 36th International Conference on Machine Learning, pp. 7304–7312, 2019.
Zanette et al. (2020a) Zanette, A., Brandfonbrener, D., Brunskill, E., Pirotta, M., and Lazaric, A. Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics, pp. 1954–1964. PMLR, 2020a.
Zanette et al. (2020b) Zanette, A., Lazaric, A., Kochenderfer, M., and Brunskill, E. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pp. 10978–10989. PMLR, 2020b.
Zhang et al. (2020a) Zhang, Z., Ji, X., and Du, S. S. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. In Conference On Learning Theory, 2020a.
Zhang et al. (2020b) Zhang, Z., Zhou, Y., and Ji, X. Almost optimal model-free reinforcement learning via reference-advantage decomposition. In Advances in Neural Information Processing Systems, volume 33, pp. 15198–15207. Curran Associates, Inc., 2020b.
Zhang et al. (2021) Zhang, Z., Yang, J., Ji, X., and Du, S. S. Variance-aware confidence set: Variance-dependent bound for linear bandits and horizon-free bound for linear mixture mdp. Advances in Neural Information Processing Systems, 2021.
Zhou et al. (2021a) Zhou, D., Gu, Q., and Szepesvari, C. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pp. 4532–4576. PMLR, 2021a.
Zhou et al. (2021b) Zhou, D., He, J., and Gu, Q. Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning, pp. 12793–12802. PMLR, 2021b.

Appendix A Preliminary

Extra Notations in Appendix

For a function $X:{\mathcal{S}}_{+}\rightarrow\mathbb{R}$ and a distribution $P\in\Delta_{{\mathcal{S}}_{+}}$ , we define $PX=\mathbb{E}_{S\sim P}[X(S)]$ and $\mathbb{V}(P,X)=\textsc{Var}_{S\sim P}[X(S)]$ .

Cost Perturbation for $c_{\min}=0$

We follow the receipt in (Vial et al., 2021, Appendix A.3) to deal with zero costs: the main idea is to run the SSP algorithm with perturbed cost $c_{\epsilon}(s,a)=c(s,a)+\epsilon$ for some $\epsilon>0$ , which is equivalent to solving a different SSP instance ${\mathcal{M}}_{\epsilon}=({\mathcal{S}},{\mathcal{A}},s_{\text{init}},g,c_{\epsilon},P)$ . Let $\theta^{\star}_{\epsilon}=\theta^{\star}+\epsilon\sum_{s^{\prime}}\mu(s^{\prime})$ . Then, $c_{\epsilon}(s,a)=\phi(s,a)^{\top}\theta^{\star}_{\epsilon}$ . Therefore, ${\mathcal{M}}_{\epsilon}$ is also a linear SSP with $c_{\min}=\epsilon$ (up to some a small constant, since $c_{\epsilon}(s,a)$ can be as large as $1+\epsilon$ ). Denote by $V^{\star}_{\epsilon}$ the optimal value function in ${\mathcal{M}}_{\epsilon}$ , and define $R^{\prime}_{K}=\sum_{t}c_{\epsilon}(s_{t},a_{t})-KV^{\star}_{\epsilon}(s_{\text{init}})$ as the regret in ${\mathcal{M}}_{\epsilon}$ . We have $V^{\star}_{\epsilon}(s)\leq V^{\pi^{\star}}(s)+\epsilon T_{\star}\leq B_{\star}+\epsilon T_{\star}$ , and

\displaystyle R_{K}=\sum_{t=1}^{T}c(s_{t},a_{t})-KV^{\star}(s_{\text{init}})\leq\sum_{t=1}^{T}c_{\epsilon}(s_{t},a_{t})-KV^{\star}_{\epsilon}(s_{\text{init}})+K(V^{\star}_{\epsilon}(s_{\text{init}})-V^{\star}(s_{\text{init}}))\leq R^{\prime}_{K}+\epsilon T_{\star}K.

Therefore, by running an SSP algorithm on perturbed cost $c_{\epsilon}$ , we recover its regret guarantee with $c_{\min}\leftarrow\epsilon$ , $B_{\star}\leftarrow B_{\star}+\epsilon T_{\star}$ , and an addition bias $\epsilon T_{\star}K$ in regret.

Appendix B Omitted Details for Section 3

Notations

For $\widetilde{{\mathcal{M}}}$ , denote by $V^{\pi}_{h}(s)$ the the expected cost of executing policy $\pi$ starting from state $s$ in layer $h$ , and by $\pi^{m}$ the policy executed in interval $m$ (for example, $\pi^{m}(s,h)=\operatorname*{argmin}_{a}Q^{m}_{h}(s,a)$ in Algorithm 2). For notational convenience, define $P^{m}_{h}=\widetilde{P}_{s^{m}_{h},a^{m}_{h}}$ , and $w^{\star}_{h}=\theta^{\star}+\int V^{\star}_{h+1}(s^{\prime})d\mu(s^{\prime})$ for $h\in[H]$ such that $Q^{\star}_{h}(s,a)=\phi(s,a)^{\top}w^{\star}_{h}$ . Define indicator $\mathbb{I}_{s}(s^{\prime})=\mathbb{I}\{s=s^{\prime}\}$ , and auxiliary feature $\phi(g,a)=\mathbf{0}\in\mathbb{R}^{d}$ for all $a\in{\mathcal{A}}$ , such that $\widetilde{c}(s,a)=\phi(s,a)^{\top}\theta^{\star}$ and $\widetilde{P}_{s,a}V=\phi(s,a)^{\top}\int V(s^{\prime})d\mu(s^{\prime})$ for any $s\in{\mathcal{S}}_{+},a\in{\mathcal{A}}$ and $V:{\mathcal{S}}_{+}\rightarrow\mathbb{R}$ with $V(g)=0$ . Finally, for Algorithm 2, define stopping time $\overline{M}=\inf_{m}\{m\leq M,\exists h\in[H]:\widehat{Q}^{m}_{h}(s^{m}_{h},a^{m}_{h})>Q^{m}_{h}(s^{m}_{h},a^{m}_{h})\}$ , which is the number of intervals until finishing $K$ episodes or upper bound truncation on $Q$ estimate is triggered.

B.1 Formal Definition of $Q^{\star}_{h}$ and $V^{\star}_{h}$

It is not hard to see that we can define $Q^{\star}_{h}$ and $V^{\star}_{h}$ recursively without resorting to the definition of $\widetilde{{\mathcal{M}}}$ :

\displaystyle Q^{\star}_{h}(s,a)=\widetilde{c}(s,a)+\widetilde{P}_{s,a}V^{\star}_{h+1},\qquad V^{\star}_{h}(s)=\min_{a}Q^{\star}_{h}(s,a),

with $Q^{\star}_{H+1}(s,a)=c_{f}(s)$ for all $(s,a)$ .

B.2 Proof of Lemma 1

Proof.

Denote by ${\mathcal{I}}_{k}$ the set of intervals in episode $k$ , and by $m_{k}$ the first interval in episode $k$ . We bound the regret in episode $k$ as follows: by Lemma 17 and $H\geq 4T_{\star}\ln(4K)$ , we have the probability that following $\pi^{\star}$ takes more than $H$ steps to reach $g$ in $\widetilde{{\mathcal{M}}}$ is at most $\frac{1}{2K}$ . Therefore,

V^{\pi^{\star}}_{1}(s)\leq V^{\pi^{\star}}(s)+2B_{\star}P(s_{H+1}\neq g|\pi^{\star},P,s_{1}=s)\leq V^{\pi^{\star}}(s)+\frac{B_{\star}}{K}.

Thus,

$\displaystyle\sum_{m\in{\mathcal{I}}_{k}}\sum_{h=1}^{H}c^{m}_{h}-V^{\pi^{\star}}(s^{m_{k}}_{1})$	$\displaystyle\leq\sum_{m\in{\mathcal{I}}_{k}}\sum_{h=1}^{H}c^{m}_{h}-V_{1}^{\pi^{\star}}(s^{m_{k}}_{1})+\frac{B_{\star}}{K}$
	$\displaystyle=\sum_{m\in{\mathcal{I}}_{k}}\left(\sum_{h=1}^{H}c^{m}_{h}-V_{1}^{\pi^{\star}}(s^{m}_{1})\right)+\sum_{m\in{\mathcal{I}}_{k}}V_{1}^{\pi^{\star}}(s^{m}_{1})-V_{1}^{\pi^{\star}}(s^{m_{k}}_{1})+\frac{B_{\star}}{K}$
	$\displaystyle\leq\sum_{m\in{\mathcal{I}}_{k}}\left(\sum_{h=1}^{H}c^{m}_{h}+c_{f}(s^{m}_{H+1})-V^{\star}_{1}(s^{m}_{1})\right)+\frac{B_{\star}}{K}.$	( $V^{\star}_{1}(s)\leq V^{\pi^{\star}}_{1}(s)$ and $\sum_{m\in{\mathcal{I}}_{k}}V_{1}^{\pi^{\star}}(s^{m}_{1})-V_{1}^{\pi^{\star}}(s^{m_{k}}_{1})\leq 2B_{\star}(\|{\mathcal{I}}_{k}\|-1)=\sum_{m\in{\mathcal{I}}_{k}}c_{f}(s^{m}_{H+1})$ )

Summing terms above over $k\in[K]$ and by the definition of $R_{K}$ , $\widetilde{R}_{M}$ we obtain the desired result. ∎

B.3 Proof of Lemma 2

We first bound the error of one-step value iteration w.r.t $\widehat{Q}^{m}_{h}$ and $V^{m}_{h+1}$ , which is essential to our analysis.

Lemma 5.

For any $B\geq\max\{1,\max_{s}c_{f}(s)\}$ , with probability at least $1-\delta$ , we have $0\leq\widetilde{c}(s,a)+\widetilde{P}_{s,a}V^{m}_{h+1}-\widehat{Q}^{m}_{h}(s,a)\leq 2\beta_{m}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}$ and $V^{m}_{h}(s)\leq V^{\star}_{h}(s)$ for any $m\in\mathbb{N}_{+}$ , $h\in[H]$ .

Proof.

Define $\widetilde{w}^{m}_{h}=\theta^{\star}+\int V^{m}_{h+1}(s^{\prime})d\mu(s^{\prime})$ , so that $\phi(s,a)^{\top}\widetilde{w}^{m}_{h}=\widetilde{c}(s,a)+\widetilde{P}_{s,a}V^{m}_{h+1}$ . Then,

	$\displaystyle\widetilde{w}^{m}_{h}-w^{m}_{h}=\Lambda_{m}^{-1}\left(\Lambda_{m}\widetilde{w}^{m}_{h}-\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}(c^{m^{\prime}}_{h^{\prime}}+V^{m}_{h+1}(s^{m^{\prime}}_{h^{\prime}+1}))\right)$
	$\displaystyle=\lambda\Lambda_{m}^{-1}\widetilde{w}^{m}_{h}+\underbrace{\Lambda_{m}^{-1}\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}(P^{m^{\prime}}_{h^{\prime}}V^{m}_{h+1}-V^{m}_{h+1}(s^{m^{\prime}}_{h^{\prime}+1}))}_{\epsilon^{m}_{h}}.$

By $V^{m}_{h+1}(s)\leq B$ and Lemma 31, we have with probability at least $1-\delta$ , for any $m$ , $h\in[H]$ :

\left\|{\epsilon^{m}_{h}}\right\|_{\Lambda_{m}}\leq 2B\sqrt{\frac{d}{2}\ln\left(\frac{mH+\lambda}{\lambda}\right)+\ln\frac{{\mathcal{N}}_{\varepsilon}}{\delta}}+\frac{\sqrt{8}mH\varepsilon}{\sqrt{\lambda}},

(5)

where ${\mathcal{N}}_{\varepsilon}$ is the $\varepsilon$ -cover of the function class of $V^{m}_{h+1}$ with $\varepsilon=\frac{1}{mH}$ . Note that $V^{m}_{h+1}(s)$ is either $c_{f}(s)$ or

\displaystyle V^{m}_{h+1}(s)=\left[\min_{a}\phi(s,a)^{\top}w-\beta_{m}\sqrt{\phi(s,a)^{\top}\Gamma\phi(s,a)}\right]_{[0,B]},

for some PSD matrix $\Gamma$ such that $\frac{1}{\lambda+mH}\leq\lambda_{\min}(\Gamma)\leq\lambda_{\max}(\Gamma)\leq\frac{1}{\lambda}$ by the definition of $\Lambda_{m}^{-1}$ , and for some $w\in\mathbb{R}^{d}$ such that $\left\|{w}\right\|_{2}\leq\lambda_{\max}(\Gamma)\times mH\times\sup_{s,a}\left\|{\phi(s,a)}\right\|_{2}\times(B+1)\leq\frac{mH}{\lambda}(B+1)$ by the definition of $w^{m}_{h}$ . We denote by ${\mathcal{V}}$ the function class of $V^{m}_{h+1}$ . Now we apply Lemma 32 to ${\mathcal{V}}$ with $\alpha=(w,\Gamma)$ , $n=d^{2}+d$ , $D=mH\sqrt{d}(B+1)/\lambda\geq\max\{\frac{mH}{\lambda}(B+1),\sqrt{d/\lambda^{2}}\}$ (note that $|\Gamma_{i,j}|\leq\left\|{\Gamma}\right\|_{F}=\sqrt{\sum_{i=1}^{d}\lambda_{i}^{2}(\Gamma)}\leq\sqrt{d/\lambda^{2}}$ ), and $L=\beta_{m}\sqrt{\lambda+mH}$ , which is given by $\left|[x]_{[0,B]}-[y]_{[0,B]}\right|\leq|x-y|$ (Vial et al., 2021, Claim 2) and the following calculation: for any $\Delta w=\epsilon e_{i}$ for some $\epsilon\neq 0$ ,

\displaystyle\frac{1}{|\epsilon|}\left|(w+\Delta w)^{\top}\phi(s,a)-w^{\top}\phi(s,a)\right|=\left|e_{i}^{\top}\phi(s,a)\right|\leq\left\|{\phi(s,a)}\right\|\leq 1,

and for any $\Delta\Gamma=\epsilon e_{i}e_{j}^{\top}$ ,

	$\displaystyle\frac{1}{\|\epsilon\|}\left\|\beta_{m}\sqrt{\phi(s,a)^{\top}(\Gamma+\Delta\Gamma)\phi(s,a)}-\beta_{m}\sqrt{\phi(s,a)^{\top}\Gamma\phi(s,a)}\right\|$
	$\displaystyle\leq\beta_{m}\frac{\left\|\phi(s,a)^{\top}e_{i}e_{j}^{\top}\phi(s,a)\right\|}{\sqrt{\phi(s,a)^{\top}\Gamma\phi(s,a)}}$		( $\sqrt{u+v}-\sqrt{u}\leq\frac{\|v\|}{\sqrt{u}}$ )
	$\displaystyle\leq\beta_{m}\frac{\left\|\phi(s,a)^{\top}(\frac{1}{2}e_{i}e_{i}^{\top}+\frac{1}{2}e_{j}e_{j}^{\top})\phi(s,a)\right\|}{\sqrt{\phi(s,a)^{\top}\Gamma\phi(s,a)}}$		( $\|ab\|\leq\frac{1}{2}(a^{2}+b^{2})$ )
	$\displaystyle\leq\beta_{m}\frac{\phi(s,a)^{\top}\phi(s,a)}{\sqrt{\phi(s,a)^{\top}\Gamma\phi(s,a)}}\leq\frac{\beta_{m}}{\sqrt{\lambda_{\min}(\Gamma)}}\leq\beta_{m}\sqrt{\lambda+mH}.$

Lemma 32 then implies $\ln{\mathcal{N}}_{\varepsilon}\leq(d^{2}+d)\ln\frac{32d^{2.5}Bm^{2}H^{2}\beta_{m}}{\lambda\varepsilon}$ . Plugging this back, we get

\left\|{\epsilon^{m}_{h}}\right\|_{\Lambda_{m}}\leq\frac{\beta_{m}}{2}.

(6)

Moreover, $\left\|{\widetilde{w}^{m}_{h}}\right\|_{\Lambda_{m}^{-1}}\leq\left\|{\widetilde{w}^{m}_{h}}\right\|_{2}/\sqrt{\lambda}\leq\sqrt{d/\lambda}(1+B)$ . Thus,

\displaystyle\left\|{w^{m}_{h}-\widetilde{w}^{m}_{h}}\right\|_{\Lambda_{m}}

\displaystyle\leq\lambda\left\|{\widetilde{w}^{m}_{h}}\right\|_{\Lambda_{m}^{-1}}+\left\|{\epsilon^{m}_{h}}\right\|_{\Lambda_{m}}\leq\beta_{m}.

Therefore, $\widetilde{c}(s,a)+\widetilde{P}_{s,a}V^{m}_{h+1}-\widehat{Q}^{m}_{h}(s,a)=\phi(s,a)^{\top}(\widetilde{w}^{m}_{h}-w^{m}_{h})+\beta_{m}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}\in[0,2\beta_{m}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}]$ by $\phi(s,a)^{\top}(\widetilde{w}^{m}_{h}-w^{m}_{h})\in[-\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}\left\|{w^{m}_{h}-\widetilde{w}^{m}_{h}}\right\|_{\Lambda_{m}},\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}\left\|{w^{m}_{h}-\widetilde{w}^{m}_{h}}\right\|_{\Lambda_{m}}]$ , and the first statement is proved. For any $m\in\mathbb{N}_{+}$ , we prove the second statement by induction on $h=H+1,\ldots,1$ . The base case $h=H+1$ is clearly true by $V^{m}_{h+1}(s)=V^{\star}_{h+1}(s)=c_{f}(s)$ . For $h\leq H$ , we have by the induction step:

\displaystyle\widehat{Q}^{m}_{h}(s,a)\leq\widetilde{c}(s,a)+\widetilde{P}_{s,a}V^{m}_{h+1}\leq\widetilde{c}(s,a)+\widetilde{P}_{s,a}V^{\star}_{h+1}\leq Q^{\star}_{h}(s,a).

Thus, $V^{m}_{h}(s)\leq\min_{a}\max\{0,\widehat{Q}^{m}_{h}(s,a)\}\leq\min_{a}Q^{\star}_{h}(s,a)=V^{\star}_{h}(s)$ . ∎

Next, we prove a general regret bound, from which Lemma 2 is a direct corollary.

Lemma 6.

Assume $c_{f}(s)\leq H$ . Then with probability at least $1-2\delta$ , Algorithm 2 ensures for any $M^{\prime}\leq\overline{M}$

\widetilde{R}_{M^{\prime}}=\tilde{\mathcal{O}}\left(\sqrt{d^{3}B^{2}HM^{\prime}}+d^{2}BH\right).

Proof.

Define $c^{m}_{H+1}=c_{f}(s^{m}_{H+1})$ . Note that for $m<\overline{M}$ , we have $V^{m}_{h}(s^{m}_{h})=\max\{0,\widehat{Q}^{m}_{h}(s^{m}_{h},a^{m}_{h})\}$ , and with probability at least $1-\delta$ ,

$\displaystyle\sum_{h=1}^{H+1}c^{m}_{h}-V^{\star}_{1}(s^{m}_{1})$	$\displaystyle\leq\sum_{h=1}^{H+1}c^{m}_{h}-V^{m}_{1}(s^{m}_{1})\leq\sum_{h=1}^{H+1}c^{m}_{h}-\widehat{Q}^{m}_{1}(s^{m}_{1},a^{m}_{1})\leq\sum_{h=2}^{H+1}c^{m}_{h}-P^{m}_{1}V^{m}_{2}+2\beta_{m}\left\\|{\phi(s^{m}_{1},a^{m}_{1})}\right\\|_{\Lambda_{m}^{-1}}$	(Lemma 5)
	$\displaystyle=\sum_{h=2}^{H+1}c^{m}_{h}-V^{m}_{2}(s^{m}_{2})+(\mathbb{I}_{s^{m}_{2}}-P^{m}_{2})V^{m}_{2}+2\beta_{m}\left\\|{\phi(s^{m}_{1},a^{m}_{1})}\right\\|_{\Lambda_{m}^{-1}}$
	$\displaystyle\leq\cdots\leq\sum_{h=1}^{H}\left((\mathbb{I}_{s^{m}_{h+1}}-P^{m}_{h+1})V^{m}_{h+1}+2\beta_{m}\left\\|{\phi(s^{m}_{h},a^{m}_{h})}\right\\|_{\Lambda_{m}^{-1}}\right).$	( $c^{m}_{H+1}=V^{m}_{H+1}(s^{m}_{H+1})$ )

Therefore, by Lemma 18 and Lemma 38, with probability at least $1-\delta$ :

	$\displaystyle\widetilde{R}_{M^{\prime}}$	$\displaystyle\leq\widetilde{R}_{M^{\prime}-1}+H\leq\sum_{m=1}^{M^{\prime}-1}\sum_{h=1}^{H}\left((\mathbb{I}_{s^{m}_{h+1}}-P^{m}_{h+1})V^{m}_{h+1}+2\beta_{m}\left\\|{\phi(s^{m}_{h},a^{m}_{h})}\right\\|_{\Lambda_{m}^{-1}}\right)+H$
		$\displaystyle=\tilde{\mathcal{O}}\left(\sqrt{d^{3}B^{2}HM^{\prime}}+d^{2}BH\right).$

∎

We are now ready to prove Lemma 2.

Proof of Lemma 2.

Note that when $B=3B_{\star}$ , $V^{m}_{h}(s)\leq V^{\star}_{h}(s)\leq 3B_{\star}=B$ by Lemma 5. Thus, $\overline{M}=M$ , and the statement directly follows from Lemma 6 with $M^{\prime}=\overline{M}$ . ∎

B.4 Learning without Knowing $B_{\star}$ or $T_{\star}$

In this section, we develop a parameter-free algorithm that achieves $\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{3}K/c_{\min}}+d^{3}B_{\star}^{2}/c_{\min})$ regret without knowing $B_{\star}$ or $T_{\star}$ , which matches the best bound and knowledge of parameters of (Vial et al., 2021) while being computationally efficient under the most general assumption. Here we apply the finite-horizon approximation with zero terminal costs, and develop a new analysis on this approximation.

Finite-Horizon Approximation of SSP with Zero Terminal Costs

Algorithm 4 Adaptive Finite-Horizon Approximation of SSP

Input: upper bound estimate $B$ and function $U(B)$ from Lemma 8.

Initialize: $\mathfrak{A}$ an instance of finite-horizon algorithm with horizon $\lceil\frac{10B}{c_{\min}}\ln(8BK)\rceil$ .

Initialize: $m=1$ , $m^{\prime}=0$ , $k=1$ , $s=s_{\text{init}}$ .

while $k\leq K$ do

Execute

\mathfrak{A}

for

H

steps starting from state

s

and receive

s^{m}_{H+1}

. if $s^{m}_{H+1}=g$ then

k\leftarrow k+1

s\leftarrow s_{\text{init}}

; else

m^{\prime}\leftarrow m^{\prime}+1

s\leftarrow s^{m}_{H+1}

1 if $m^{\prime}>U(B)$ or $\mathfrak{A}$ detects $B<B_{\star}$ then

B\leftarrow 2B

. Initialize

\mathfrak{A}

as an instance of finite-horizon algorithm with horizon

\lceil\frac{10B}{c_{\min}}\ln(8BK)\rceil

m^{\prime}\leftarrow 0

m\leftarrow m+1

To avoid knowledge of $B_{\star}$ or $T_{\star}$ , we apply finite-horizon approximation with zero terminal costs and horizon of order $\tilde{\mathcal{O}}(\frac{B}{c_{\min}})$ for some estimate $B$ of $B_{\star}$ , that is, running Algorithm 1 with $c_{f}(s)=0$ and $H=\tilde{\mathcal{O}}(\frac{B}{c_{\min}})$ . We show that in this case there is an alternative way to bound the regret $R_{K}$ by $\widetilde{R}_{M}$ , and there is a tighter bound on the total number of intervals $M$ when $B\geq B_{\star}$ .

Lemma 7.

Algorithm 1 with $c_{f}(s)=0$ ensures $R_{K}\leq\widetilde{R}_{M}+B_{\star}\sum_{m=1}^{M}\mathbb{I}\{s^{m}_{H+1}\neq g\}$ .

Proof.

Denote by ${\mathcal{I}}_{k}$ the set of intervals in episode $k$ . We have:

	$\displaystyle R_{K}$	$\displaystyle=\sum_{k=1}^{K}\left(\sum_{m\in{\mathcal{I}}_{k}}\sum_{h=1}^{H}c^{m}_{h}-V^{\star}(s_{\text{init}})\right)=\sum_{k=1}^{K}\left(\sum_{m\in{\mathcal{I}}_{k}}\left(\sum_{h=1}^{H}c^{m}_{h}-V^{\star}_{1}(s^{m}_{1})\right)+\sum_{m\in{\mathcal{I}}_{k}}V^{\star}_{1}(s^{m}_{1})-V^{\star}(s_{\text{init}})\right)$
		$\displaystyle\leq\widetilde{R}_{M}+B_{\star}\sum_{m=1}^{M}\mathbb{I}\{s^{m}_{H+1}\neq g\}.$		( $V^{\star}_{1}(s)\leq V^{\star}(s)\leq B_{\star}$ by $c_{f}(s)=0$ )

∎

Lemma 8.

Suppose when $B\geq B_{\star}$ , $\mathfrak{A}$ with horizon $H=\lceil\frac{10B}{c_{\min}}\ln(8BK)\rceil$ ensures $\widetilde{R}_{M^{\prime}}=\tilde{\mathcal{O}}(\gamma_{0}(B)+\gamma_{1}(B)\sqrt{M^{\prime}})$ for any $M^{\prime}\leq M$ with probability at least $1-\delta$ , where $\gamma_{0}$ , $\gamma_{1}$ are functions of $B$ and are independent of $M^{\prime}$ . Then Algorithm 1 with $c_{f}(s)=0$ ensures with probability at least $1-4\delta$ ,

\sum_{m=1}^{M}\mathbb{I}\{s^{m}_{H+1}\neq g\}=\tilde{\mathcal{O}}\left(\gamma_{0}(B)/B+\gamma_{1}(B)^{2}/B^{2}+\gamma_{1}(B)\sqrt{K}/B+H\right)\triangleq U(B).

Proof.

First note that by Lemma 39 and $V^{\pi^{m}}_{1}(s)\geq V^{\star}_{1}(s)$ , with probability at least $1-\delta$ : $\sum_{m=1}^{M^{\prime}}V^{\pi^{m}}_{1}(s^{m}_{1})-V^{\star}(s^{m}_{1})\leq 2\widetilde{R}_{M^{\prime}}+\tilde{\mathcal{O}}(H)$ . For any finite $M^{\prime}\leq M$ , we will show $\sum_{m=1}^{M^{\prime}}\mathbb{I}\{s^{m}_{H+1}\neq g\}=\tilde{\mathcal{O}}(\gamma_{0}(B)/B+\gamma_{1}(B)^{2}/B^{2}+\gamma_{1}(B)\sqrt{K}/B)$ , which then implies that $\sum_{m=1}^{M}\mathbb{I}\{s^{m}_{H+1}\neq g\}$ has to be finite and is upper bounded by the same quantity. Define $\widetilde{V}^{\pi}_{1}(s)=\mathbb{E}[\sum_{h=1}^{H/2}c(s_{h},a_{h})|\pi,P,s_{1}=s]$ as the expected cost for the first $H/2$ layers and $\widetilde{V}^{\star}_{1}$ as the optimal value function for the first $H/2$ layers. By (Chen et al., 2021a, Lemma 1) and $B\geq B_{\star}$ , we have $V^{\star}(s)-\widetilde{V}^{\star}_{1}(s)\in[0,\frac{1}{4K}]$ and $V^{\star}(s)-V^{\star}_{1}(s)\in[0,\frac{1}{4K}]$ . Moreover, when $s^{m}_{H+1}\neq g$ , we have $\sum_{h>H/2}c^{m}_{h}\geq 2B$ . Denote by $P_{m}(\cdot)$ the conditional probability of certain event conditioning on the history before interval $m$ . Then with probability at least $1-\delta$ ,

	$\displaystyle 2B\sum_{m=1}^{M^{\prime}}P_{m}(s^{m}_{H+1}\neq g)+\sum_{m=1}^{M^{\prime}}\widetilde{V}^{\pi^{m}}_{1}(s^{m}_{1})-\widetilde{V}^{\star}_{1}(s^{m}_{1})\leq\frac{M^{\prime}}{2K}+\sum_{m=1}^{M^{\prime}}V^{\pi^{m}}_{1}(s^{m}_{1})-V^{\star}_{1}(s^{m}_{1})$
	$\displaystyle\leq\frac{1}{2K}\sum_{m=1}^{M^{\prime}}\mathbb{I}\{s^{m}_{H+1}\neq g\}+\tilde{\mathcal{O}}\left(\gamma_{0}(B)+\gamma_{1}(B)\sqrt{M^{\prime}}+H\right)$		( $M^{\prime}\leq K+\sum_{m=1}^{M^{\prime}}\mathbb{I}\{s^{m}_{H+1}\neq g\}$ and guarantee of $\mathfrak{A}$ )
	$\displaystyle\leq\frac{1}{K}\sum_{m=1}^{M^{\prime}}P_{m}(s^{m}_{H+1}\neq g)+\tilde{\mathcal{O}}\left(\gamma_{0}(B)+\gamma_{1}(B)\sqrt{M^{\prime}}+H\right).$		(Lemma 39)

Then by $\widetilde{V}^{\pi^{m}}_{1}(s^{m}_{1})\geq\widetilde{V}^{\star}_{1}(s^{m}_{1})$ and reorganizing terms, we get $\sum_{m=1}^{M^{\prime}}P_{m}(s^{m}_{H+1}\neq g)=\tilde{\mathcal{O}}(\gamma_{0}(B)/B+\gamma_{1}(B)\sqrt{M^{\prime}}/B+H)$ . Again by Lemma 39, we have with probability at least $1-\delta$ :

\sum_{m=1}^{M^{\prime}}\mathbb{I}\{s^{m}_{H+1}\neq g\}=\tilde{\mathcal{O}}\left(\sum_{m=1}^{M^{\prime}}P_{m}(s^{m}_{H+1}\neq g)\right)=\tilde{\mathcal{O}}\left(\gamma_{0}(B)/B+\gamma_{1}(B)\sqrt{M^{\prime}}/B+H\right).

By $M^{\prime}\leq K+\sum_{m=1}^{M^{\prime}}\mathbb{I}\{s^{m}_{H+1}\neq g\}$ and solving a quadratic inequality w.r.t $\sqrt{\sum_{m=1}^{M^{\prime}}\mathbb{I}\{s^{m}_{H+1}\neq g\}}$ , we get $\sum_{m=1}^{M^{\prime}}\mathbb{I}\{s^{m}_{H+1}\neq g\}=\tilde{\mathcal{O}}(\gamma_{0}(B)/B+\gamma_{1}(B)^{2}/B^{2}+\gamma_{1}(B)\sqrt{K}/B+H)$ . Thus, we also get the same bound for $\sum_{m=1}^{M}\mathbb{I}\{s^{m}_{H+1}\neq g\}$ . ∎

Remark 1.

Note that the result of Lemma 8 is similar to (Tarbouriech et al., 2020, Lemma 7), which also shows that the number of “bad” intervals is of order $\tilde{\mathcal{O}}(\sqrt{K})$ . However, their result is derived by explicitly analyzing the transition confidence sets, while we only make use of the regret guarantee of the finite-horizon algorithm. Thus, our approach is again model-agnostic and directly applicable to linear function approximation while their result is not.

Note that Lemma 7 and Lemma 8 together implies a $\tilde{\mathcal{O}}(\sqrt{K})$ regret bound when $B\geq B_{\star}$ . Moreover, since the total number of “bad” intervals is of order $\tilde{\mathcal{O}}(\sqrt{K})$ , we can properly bound the cost of running finite-horizon algorithm with wrong estimates on $B_{\star}$ . We now present an adaptive version of finite-horizon approximation of SSP (Algorithm 4) which does not require the knowledge of $B_{\star}$ or $T_{\star}$ . The main idea is to perform finite-horizon approximation with zero costs, and maintain an estimate $B$ of $B_{\star}$ . The learner runs a finite-horizon algorithm with horizon of order $\tilde{\mathcal{O}}(\frac{B}{c_{\min}})$ . Whenever $\mathfrak{A}$ detects $B\leq B_{\star}$ , or the number of “bad” intervals is more than expected (Line 4), it doubles the estimate $B$ and start a new instance of finite-horizon algorithm with the updated estimate. The guarantee of Algorithm 4 is summarized in the following theorem.

Theorem 7.

Suppose $\mathfrak{A}$ takes an estimate $B$ as input, and when $B<B_{\star}$ , it has some probability of detecting the anomaly (the event $B<B_{\star}$ ) and halts. Define stopping time $\overline{M}^{\prime}=\min\{M,\inf_{m}\{\text{anomaly detected in episode $m$}\}\}$ , and suppose for any $B\geq 1$ , $\mathfrak{A}$ with horizon $H=\lceil\frac{10B}{c_{\min}}\ln(8BK)\rceil$ ensures $\widetilde{R}_{M^{\prime}}=\tilde{\mathcal{O}}(\gamma_{0}(B)+\gamma_{1}(B)\sqrt{M^{\prime}})$ for any $M^{\prime}\leq\overline{M}^{\prime}$ , where $\gamma_{0}(B)/B,\gamma_{1}(B)/B$ are non-decreasing w.r.t $B$ . Then, Algorithm 4 ensures $R_{K}=\tilde{\mathcal{O}}(\gamma_{0}(B_{\star})+\gamma_{1}(B_{\star})\sqrt{K}+\gamma_{1}(B_{\star})^{2}/B_{\star}+B_{\star}H)$ with probability at least $1-4\delta$ .

Proof.

We divide the learning process into epochs indexed by $\phi$ based on the update of $B$ , so that $B_{1}=B$ (the input value) and $B_{\phi+1}=2B_{\phi}$ . Let $\phi^{\star}=\min_{\phi}\{B_{\phi}\geq B_{\star}\}$ . Define the regret in epoch $\phi$ as $\bar{R}_{\phi}=C_{\phi}-\sum_{k\in{\mathcal{K}}_{\phi}}V^{\star}(s_{1}^{\phi,k})$ , where $C_{\phi}$ is the total costs suffered in epoch $\phi$ , ${\mathcal{K}}_{\phi}$ is the set of episodes overlapped with epoch $\phi$ , and $s^{\phi,k}_{1}$ is the initial state in episode $k$ and epoch $\phi$ (note that an episode can overlap with multiple epochs). Clearly, $\sum_{\phi}|{\mathcal{K}}_{\phi}|\leq K+\phi^{\star}\leq K+\mathcal{O}(\log_{2}B_{\star})$ . Note that $\mathfrak{A}$ satisfies the assumptions in Lemma 8, since no anomaly will be detected when $B\geq B_{\star}$ . Thus in epoch $\phi^{\star}$ , no new epoch will be started by Lemma 8. Moreover, by Lemma 7 and $B_{\phi^{\star}}\leq 2B_{\star}$ , the regret is bounded by:

\displaystyle\bar{R}_{\phi^{\star}}=\tilde{\mathcal{O}}\left(\gamma_{0}(B_{\star})+\gamma_{1}(B_{\star})\sqrt{K+U(B_{\star})}+B_{\star}U(B_{\star})\right)=\tilde{\mathcal{O}}\left(\gamma_{0}(B_{\star})+\gamma_{1}(B_{\star})\sqrt{K}+\gamma_{1}(B_{\star})^{2}/B_{\star}+B_{\star}H\right).

For $\phi<\phi^{\star}$ , by the conditions of starting a new epoch, the number of intervals that does not reach the goal is upper bounded by $U(B_{\phi})$ and the total number of intervals in epoch $\phi$ is upper bounded by $K+U(B_{\phi})$ . Thus by Lemma 7 and the guarantee of $\mathfrak{A}$ ,

\displaystyle\bar{R}_{\phi}=\tilde{\mathcal{O}}\left(\gamma_{0}(B_{\phi})+\gamma_{1}(B_{\phi})\sqrt{K+U(B_{\phi})}+B_{\star}U(B_{\phi})\right)=\tilde{\mathcal{O}}\left(\gamma_{0}(B_{\star})+\gamma_{1}(B_{\star})\sqrt{K}+\gamma_{1}(B_{\star})^{2}/B_{\star}+B_{\star}H\right),

where the last equality is by the fact that $\gamma_{0}(B),\gamma_{1}(B)$ and $U(B)$ are non-decreasing w.r.t $B$ . Thus,

	$\displaystyle R_{K}$	$\displaystyle=\sum_{\phi}C_{\phi}-\sum_{k=1}^{K}V^{\star}(s_{\text{init}})=\sum_{\phi}\bar{R}_{\phi}+\sum_{\phi}\sum_{k\in{\mathcal{K}}_{\phi}}V^{\star}(s_{1}^{\phi,k})-\sum_{k=1}^{K}V^{\star}(s_{\text{init}})$
		$\displaystyle=\tilde{\mathcal{O}}\left(\gamma_{0}(B_{\star})+\gamma_{1}(B_{\star})\sqrt{K}+\gamma_{1}(B_{\star})^{2}/B_{\star}+B_{\star}H\right).$

∎

Theorem 8.

Applying Algorithm 4 with Algorithm 2 as $\mathfrak{A}$ to the linear SSP problem ensures $R_{K}=\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{3}K/c_{\min}}+d^{3}B_{\star}^{2}/c_{\min})$ with probability at least $1-4\delta$ .

Proof.

Note that $\overline{M}^{\prime}=\overline{M}$ for Algorithm 2, and Lemma 6 ensures that Algorithm 2 satisfies assumptions of Theorem 7 with $\gamma_{0}(B)=d^{2}BH$ and $\gamma_{1}(B)=\sqrt{d^{3}B^{2}H}$ , where $H=\lceil\frac{10B}{c_{\min}}\ln(8BK)\rceil$ . Then by Theorem 7, we have: $R_{K}=\tilde{\mathcal{O}}(\sqrt{d^{3}B_{\star}^{3}K/c_{\min}}+d^{3}B_{\star}^{2}/c_{\min})$ . ∎

Remark 2.

Comparing the bound achieved by Theorem 8 with that of Theorem 3, we see that $\frac{B_{\star}}{c_{\min}}$ is in place of $T_{\star}$ , making it a worse bound since $T_{\star}\leq\frac{B_{\star}}{c_{\min}}$ . Previous works in SSP (Cohen et al., 2021; Tarbouriech et al., 2021; Chen et al., 2021a) suggest that algorithms that obtain a bound with dependency on $\frac{B_{\star}}{c_{\min}}$ is easier to be made parameter-free compared to those with dependency on $T_{\star}$ . Our findings in this section are consistent with that in previous works.

B.5 Horizon-Free Regret in the Tabular Setting with Finite-Horizon Approximation

Here we present a finite-horizon algorithm (Algorithm 5) that achieves $\widetilde{R}_{m}=\tilde{\mathcal{O}}(B_{\star}\sqrt{SAm}+B_{\star}S^{2}A)$ and thus gives $R_{K}=\tilde{\mathcal{O}}(B_{\star}\sqrt{SAK}+B_{\star}S^{2}A)$ when combining with Corollary 2. For simplicity we assume that the cost function is known. We can think of Algorithm 5 as a variant of EB-SSP, which is applied on a finite-horizon MDP with state space ${\mathcal{S}}\times[H]$ and the transition is shared across layers. Note that due to the loop-free structure of the MDP, the value iteration converges in one sweep. Thus, skewing the empirical transition as in (Tarbouriech et al., 2021) is unnecessary. Then by the analysis of EB-SSP and the fact that transition data is shared across layers, we obtain the same regret guarantee $\widetilde{R}_{m}=\tilde{\mathcal{O}}(B_{\star}\sqrt{SAm}+B_{\star}S^{2}A)$ (it is not hard to see that the algorithm achieves anytime regret since its updates on parameters are independent of $K$ ).

Algorithm 5 MVP+

Input: an estimate $B$ such that $B\geq B_{\star}$ .

Initialize: $n(s,a),n(s,a,s^{\prime}),Q_{h}(s,a),V_{h}(s),V_{H+1}(s)=0$ for $(s,a)\in{\mathcal{S}}_{+}\times{\mathcal{A}}$ , $s^{\prime}\in{\mathcal{S}}_{+}$ , $h\in[H]$ .

for $k=1,\ldots,K$ do

for $h=1,\ldots,H$ do

Take action

a_{t}=\operatorname*{argmin}_{a}Q_{h}(s_{t},a)

, incur cost

\widetilde{c}(s_{t},a_{t})

and transit to

s^{\prime}_{t}\sim\widetilde{P}_{s_{t},a_{t}}

n(s_{t},a_{t})\leftarrow n(s_{t},a_{t})+1

n(s_{t},a_{t},s^{\prime}_{t})\leftarrow n(s_{t},a_{t},s^{\prime}_{t})+1

. if $n(s,a)=2^{j}$ for some $j\in\mathbb{N}$ then

for $h=H,\ldots,1$ do

for $(s,a)\in{\mathcal{S}}\times{\mathcal{A}}$ do

\iota_{s,a}\leftarrow 20\ln\frac{2SAn(s,a)}{\delta}

\bar{P}_{s,a}(s^{\prime})\leftarrow\frac{n(s,a,s^{\prime})}{\max\{1,n(s,a)\}}

for all

s^{\prime}\in{\mathcal{S}}_{+}

b_{h}(s,a)\leftarrow\max\left\{7\sqrt{\frac{\mathbb{V}(\bar{P}_{s,a},V_{h+1})\iota_{s,a}}{\max\{1,n(s,a)\}}},\frac{49B\iota_{s,a}}{\max\{1,n(s,a)\}}\right\}

Q_{h}(s,a)\leftarrow\max\{0,\widetilde{c}(s,a)+\bar{P}_{s,a}V_{h+1}-b_{h}(s,a)\}

V_{h}(s)=\operatorname*{argmin}_{a}Q_{h}(s,a)

B.6 Application to Linear Mixture MDP

Algorithm 6 UCRL-VTR-SSP

Initialize: $\lambda=1$ , $\widehat{\Sigma}_{1},\widetilde{\Sigma}_{1}=\lambda I$ , $\widehat{b}_{1},\widetilde{b}_{1},\widehat{\theta}_{1},\widetilde{\theta}_{1}=\mathbf{0}$ .

Define: $\widehat{\beta}_{m}=8\sqrt{d\ln(1+dmH/\lambda)\ln(4m^{2}H^{2}/\delta)}+4\sqrt{d}\ln(4m^{2}H^{2}/\delta)+\sqrt{\lambda d}$ .

Define: $\widetilde{\beta}_{m}=72B_{\star}^{2}\sqrt{d\ln(1+81dmHB_{\star}^{4}/\lambda)\ln(4m^{2}H^{2}/\delta)}+36B_{\star}^{2}\ln(4m^{2}H^{2}/\delta)+\sqrt{\lambda d}$ .

Define: $\check{\beta}_{m}=8d\sqrt{\ln(1+dmH/\lambda)\ln(4m^{2}H^{2}/\delta)}+4\sqrt{d}\ln(4m^{2}H^{2}/\delta)+\sqrt{\lambda d}$ .

for $m=1,\ldots,M$ do

for $h=H,\ldots,1$ do

Q^{m}_{h}(\cdot,\cdot)=\widetilde{c}(\cdot,\cdot)+\left\langle\widehat{\theta}_{m},\phi_{V^{m}_{h+1}}(\cdot,\cdot)\right\rangle-\widehat{\beta}_{m}\left\|{\phi_{V^{m}_{h+1}}(\cdot,\cdot)}\right\|_{\widehat{\Sigma}_{m}^{-1}}

, where

V^{m}_{H+1}(s)=2B_{\star}\mathbb{I}\{s\neq g\}

V^{m}_{h}(\cdot)=\min_{a}[Q^{m}_{h}(\cdot,a)]_{[0,3B_{\star}]}

for $h=1,\ldots,H$ do

Take action

a^{m}_{h}=\operatorname*{argmin}_{a}Q^{m}_{h}(s^{m}_{h},a)

, suffer cost

c^{m}_{h}

, and transit to next state

s^{m}_{h+1}

\nu^{m}_{h}=\left[\left\langle\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h}),\widetilde{\theta}_{m}\right\rangle\right]_{[0,9B_{\star}^{2}]}-\left[\left\langle\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h}),\widehat{\theta}_{m}\right\rangle\right]_{[0,3B_{\star}]}^{2}

E^{m}_{h}=\min\left\{9B_{\star}^{2},\widetilde{\beta}_{m}\left\|{\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})}\right\|_{\widetilde{\Sigma}^{-1}_{m}}\right\}+\min\left\{9B_{\star}^{2},6B_{\star}\check{\beta}_{m}\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})}\right\|_{\widehat{\Sigma}_{m}^{-1}}\right\}

\bar{\sigma}^{m}_{h}=\sqrt{\max\{9B_{\star}^{2}/d,\nu^{m}_{h}+E^{m}_{h}\}}

\widehat{\Sigma}_{m+1}=\widehat{\Sigma}_{m}+\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{-2}\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})^{\top}

\widetilde{\Sigma}_{m+1}=\widetilde{\Sigma}_{m}+\sum_{h=1}^{H}\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})^{\top}

\widehat{b}_{m+1}=\widehat{b}_{m}+\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{-2}V^{m}_{h+1}(s^{m}_{h+1})\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})

\widetilde{b}_{m+1}=\widetilde{b}_{m}+\sum_{h=1}^{H}V^{m}_{h+1}(s^{m}_{h+1})^{2}\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})

\widehat{\theta}_{m+1}=(\widehat{\Sigma}_{m+1})^{-1}\widehat{b}_{m+1}

\widetilde{\theta}_{m+1}\leftarrow\widetilde{\Sigma}_{m+1}^{-1}\widetilde{b}_{m+1}

In this section, we provide a direct application of our finite-horizon approximation to the linear mixture MDP setting. We first introduce the problem setting of linear mixture SSP following (Min et al., 2021).

Assumption 2 (Linear Mixture SSP).

The number of states and actions are finite: $|{\mathcal{S}}\times{\mathcal{A}}|<\infty$ . For some $d\geq 2$ , there exist a known cost function $c:{\mathcal{S}}\times{\mathcal{A}}\rightarrow[0,1]$ , a known feature map $\phi:{\mathcal{S}}\times{\mathcal{A}}\times{\mathcal{S}}_{+}\rightarrow\mathbb{R}^{d}$ , and an unknown vector $\theta^{\star}\in\mathbb{R}^{d}$ with $\left\|{\theta^{\star}}\right\|_{2}\leq\sqrt{d}$ , such that:

•

for any $(s,a),s^{\prime}\in{\mathcal{S}}\times{\mathcal{A}}\times{\mathcal{S}}^{+}$ , we have $P_{s,a}(s^{\prime})=\left\langle\phi(s^{\prime}|s,a),\theta^{\star}\right\rangle$ ;
•

for any bounded function $F:{\mathcal{S}}\rightarrow[0,1]$ , we have $\left\|{\phi_{F}(s,a)}\right\|_{2}\leq\sqrt{d}$ , where $\phi_{F}(s,a)=\sum_{s^{\prime}}\phi(s^{\prime}|s,a)F(s^{\prime})\in\mathbb{R}^{d}$ .

We also assume $B_{\star}$ is known and $c_{\min}>0$ . Define $\widetilde{c}(s,a)=c(s,a)\mathbb{I}\{s\neq g\}$ , $\widetilde{P}=\{P_{s,a}\}_{(s,a)\in{\mathcal{S}}\times{\mathcal{A}}}\cup\{P_{g,a}\}_{a\in{\mathcal{A}}}$ with $P_{g,a}(s^{\prime})=\mathbb{I}\{s^{\prime}=g\}$ as before, and $\phi(s^{\prime}|g,a)=\mathbb{I}\{s^{\prime}=g\}\sum_{s^{\prime\prime}}\phi(s^{\prime\prime}|s_{\text{init}},a)$ . Note that by the definitions above, $\widetilde{P}_{s,a}F=\left\langle\phi_{F}(s,a),\theta^{\star}\right\rangle$ . Also define total costs $C_{M^{\prime}}=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}c^{m}_{h}$ for any $M^{\prime}\in\mathbb{N}_{+}$ With our approximation scheme, it suffices to provide a finite-horizon algorithm. We start by stating the regret guarantee of the proposed finite-horizon algorithm (Algorithm 6).

Theorem 9.

Algorithm 6 ensures $\widetilde{R}_{M^{\prime}}=\tilde{\mathcal{O}}(B_{\star}\sqrt{dM^{\prime}H}+B_{\star}d\sqrt{M^{\prime}}+B_{\star}d^{2}H+B_{\star}d^{2.5})$ for any $M^{\prime}\in\mathbb{N}_{+}$ with probability at least $1-5\delta$ .

Combining Algorithm 6 with our finite-horizon approximation, we get the following regret guarantee on linear mixture SSP.

Theorem 10.

Applying Algorithm 1 with $H=\lceil 4T_{\star}\ln(4K)\rceil$ and Algorithm 6 as $\mathfrak{A}$ to the linear mixture SSP problem ensures $R_{K}=\tilde{\mathcal{O}}(B_{\star}\sqrt{dT_{\star}K}+B_{\star}d\sqrt{K}+B_{\star}d^{2}T_{\star}+B_{\star}d^{2.5})$ with probability at least $1-5\delta$ .

Proof.

This directly follows from Theorem 9 and Corollary 2 with $\gamma_{0}=B_{\star}d^{2}H$ and $\gamma_{1}=B_{\star}\sqrt{dH}+B_{\star}d$ . ∎

Note that our bound strictly improves over that of (Min et al., 2021), and it is minimax optimal when $d\geq T_{\star}$ . Now we introduce the proposed finite-horizon algorithm, which is a variant of (Zhou et al., 2021a, Algorithm 2). The high level idea is to construct Bernstein-style confidence sets on transition function and then compute value function estimate through empirical value iteration with bonus. We summarize the ideas in Algorithm 6. Before proving Theorem 9, we need the following key lemma regarding the confidence sets on transition function.

Lemma 9.

With probability at least $1-3\delta$ , we have for all $m\in\mathbb{N}_{+}$ , $\left\|{\theta^{\star}-\widehat{\theta}_{m}}\right\|_{\widehat{\Sigma}_{m}}\leq\widehat{\beta}_{m}$ and $\left|\nu^{m}_{h}-\mathbb{V}(P^{m}_{h},V^{m}_{h+1})\right|\leq E^{m}_{h}$ .

Proof.

For the first statement, we first prove that $\left\|{\theta^{\star}-\widehat{\theta}_{m}}\right\|_{\widehat{\Sigma}_{m}}\leq\check{\beta}_{m}$ and $\left\|{\theta^{\star}-\widetilde{\theta}_{m}}\right\|_{\widetilde{\Sigma}_{m}}\leq\widetilde{\beta}_{m}$ for $m\in\mathbb{N}_{+}$ . We adopt the indexing by $t$ in Section 2: for a given time step $t=(m-1)H+h$ that corresponds to $(m,h)$ , that is, the $h$ -th step in the $m$ -th interval, define $\bar{\sigma}_{t}=\bar{\sigma}^{m}_{h}$ , $V_{t}=V^{m}_{h+1}$ , $\nu_{t}=\nu^{m}_{h}$ , and $E_{t}=E^{m}_{h}$ . We apply Lemma 33 with ${\mathcal{F}}_{t}=\sigma(s_{1:t},a_{1:t})$ , $x_{t}=\bar{\sigma}_{t}^{-1}\phi_{V_{t}}(s_{t},a_{t})$ , $y_{t}=\bar{\sigma}_{t}^{-1}V_{t}(s^{\prime}_{t})$ , $\mu^{\star}=\theta^{\star}$ , $\eta_{t}=\bar{\sigma}_{t}^{-1}\left(V_{t}(s^{\prime}_{t})-\left\langle\phi_{V_{t}}(s_{t},a_{t}),\theta^{\star}\right\rangle\right)$ . Then, we have $Z_{t}=\widehat{\Sigma}_{t},\mu_{t}=\widehat{\theta}_{t}$ , where $\widehat{\Sigma}_{t}=\lambda I+\sum_{i=1}^{t}\bar{\sigma}_{i}^{-2}\phi_{V_{i}}(s_{i},a_{i})\phi_{V_{i}}(s_{i},a_{i})^{\top}$ , $\widehat{\theta}_{t}=\widehat{\Sigma}_{t}^{-1}\widehat{b}_{t}$ , and $\widehat{b}_{t}=\sum_{i=1}^{t}\bar{\sigma}_{i}^{-2}\phi_{V_{i}}(s_{i},a_{i})V_{i}(s^{\prime}_{i})$ . Moreover,

\displaystyle|\eta_{t}|\leq R=\sqrt{d},\;\mathbb{E}[\eta_{t}^{2}|\mathcal{G}_{t}]\leq\sigma^{2}=d,\;\left\|{x_{t}}\right\|_{2}\leq L=d,\;\left\|{\mu^{\star}}\right\|_{2}=\left\|{\theta^{\star}}\right\|_{2}\leq\sqrt{d}.

Therefore, with probability at least $1-\delta$ , for any $t=(m-1)H$ for some $m\in\mathbb{N}_{+}$ , which corresponds to $(m-1,H)$ :

\displaystyle\left\|{\widehat{\theta}_{m}-\theta^{\star}}\right\|_{\widehat{\Sigma}_{m}}\leq 8d\sqrt{\ln(1+d^{2}t/(d\lambda))\ln(4t^{2}/\delta)}+4\sqrt{d}\ln(4t^{2}/\delta)+\sqrt{\lambda d}\leq\check{\beta}_{m}.

Next, we apply Lemma 33 with ${\mathcal{F}}_{t}=\sigma(s_{1:t},a_{1:t})$ , $x_{t}=\phi_{V^{2}_{t}}(s_{t},a_{t})$ , $y_{t}=V_{t}^{2}(s^{\prime}_{t})$ , $\mu^{\star}=\theta^{\star}$ , $\eta_{t}=V_{t}^{2}(s^{\prime}_{t})-\left\langle\phi_{V_{t}^{2}}(s_{t},a_{t}),\theta^{\star}\right\rangle$ . Then, we have $Z_{t}=\widetilde{\Sigma}_{t},\mu_{t}=\widetilde{\theta}_{t}$ , where $\widetilde{\Sigma}_{t}=\lambda I+\sum_{i=1}^{t}\phi_{V_{i}^{2}}(s_{i},a_{i})\phi_{V_{i}^{2}}(s_{i},a_{i})^{\top}$ , $\widetilde{\theta}_{t}=\widetilde{\Sigma}_{t}^{-1}\widetilde{b}_{t}$ , and $\widetilde{b}_{t}=\sum_{i=1}^{t}\phi_{V^{2}_{i}}(s_{i},a_{i})V^{2}_{i}(s^{\prime}_{i})$ . Moreover,

\displaystyle|\eta_{t}|\leq R=9B_{\star}^{2},\;\mathbb{E}[\eta_{t}^{2}|\mathcal{G}_{t}]\leq\sigma^{2}=81B_{\star}^{4},\;\left\|{x_{t}}\right\|_{2}\leq L=9B_{\star}^{2}\sqrt{d},\;\left\|{\mu^{\star}}\right\|_{2}=\left\|{\theta^{\star}}\right\|_{2}\leq\sqrt{d}.

Therefore, with probability at least $1-\delta$ , for any $t=(m-1)H$ for some $m\in\mathbb{N}_{+}$ , which corresponds to $(m-1,H)$ :

\displaystyle\left\|{\widetilde{\theta}_{m}-\theta^{\star}}\right\|_{\widetilde{\Sigma}_{m}}\leq 72B_{\star}^{2}\sqrt{d\ln(1+81tB_{\star}^{4}d/(d\lambda))\ln(4t^{2}/\delta)}+36B_{\star}^{2}\ln(4t^{2}/\delta)+\sqrt{\lambda d}\leq\widetilde{\beta}_{m}.

Conditioned on the event ${\mathcal{C}}=\left\{\left\|{\theta^{\star}-\widehat{\theta}_{m}}\right\|_{\widehat{\Sigma}_{m}}\leq\check{\beta}_{m},\left\|{\theta^{\star}-\widetilde{\theta}_{m}}\right\|_{\widetilde{\Sigma}_{m}}\leq\widetilde{\beta}_{m},\forall m\in\mathbb{N}_{+}\right\}$ , we have for $t$ corresopnding to $(m,h)$ :

	$\displaystyle\left\|\nu_{t}-\mathbb{V}(P_{t},V_{t})\right\|$
	$\displaystyle\leq\left\|\left[\left\langle\phi_{V_{t}^{2}}(s_{t},a_{t}),\widetilde{\theta}_{m}\right\rangle\right]_{[0,9B_{\star}^{2}]}-\left\langle\phi_{V^{2}_{t}}(s_{t},a_{t}),\theta^{\star}\right\rangle\right\|+\left\|\left[\left\langle\phi_{V_{t}}(s_{t},a_{t}),\widehat{\theta}_{m}\right\rangle\right]_{[0,3B_{\star}]}^{2}-\left\langle\phi_{V_{t}}(s_{t},a_{t}),\theta^{\star}\right\rangle^{2}\right\|$
	$\displaystyle\leq\min\left\{9B_{\star}^{2},\left\|\left\langle\phi_{V_{t}^{2}}(s_{t},a_{t}),\widetilde{\theta}_{m}-\theta^{\star}\right\rangle\right\|\right\}+\min\left\{9B_{\star}^{2},6B_{\star}\left\|\left\langle\phi_{V_{t}}(s_{t},a_{t}),\widehat{\theta}_{m}-\theta^{\star}\right\rangle\right\|\right\}$
	$\displaystyle\leq\min\left\{9B_{\star}^{2},\left\\|{\phi_{V_{t}^{2}}(s_{t},a_{t})}\right\\|_{\widetilde{\Sigma}_{m}^{-1}}\left\\|{\widetilde{\theta}_{m}-\theta^{\star}}\right\\|_{\widetilde{\Sigma}_{m}}\right\}+\min\left\{9B_{\star}^{2},6B_{\star}\left\\|{\phi_{V_{t}}(s_{t},a_{t})}\right\\|_{\widehat{\Sigma}_{m}^{-1}}\left\\|{\widehat{\theta}_{m}-\theta^{\star}}\right\\|_{\widehat{\Sigma}_{m}}\right\}$
	$\displaystyle\leq\min\left\{9B_{\star}^{2},\widetilde{\beta}_{m}\left\\|{\phi_{V_{t}^{2}}(s_{t},a_{t})}\right\\|_{\widetilde{\Sigma}_{m}^{-1}}\right\}+\min\left\{9B_{\star}^{2},6B_{\star}\check{\beta}_{m}\left\\|{\phi_{V_{t}}(s_{t},a_{t})}\right\\|_{\widehat{\Sigma}_{m}^{-1}}\right\}=E_{t}.$

Thus the second statement is proved. Now we show that $\left\|{\theta^{\star}-\widehat{\theta}_{m}}\right\|_{\widehat{\Sigma}_{m}}\leq\widehat{\beta}_{m}$ . We conditioned on event ${\mathcal{C}}$ , and apply Lemma 33 with ${\mathcal{F}}_{t}=\sigma(s_{1:t},a_{1:t})$ , $x_{t}=\bar{\sigma}_{t}^{-1}\phi_{V_{t}}(s_{t},a_{t})$ , $y_{t}=\bar{\sigma}_{t}^{-1}V_{t}(s^{\prime}_{t})$ , $\mu^{\star}=\theta^{\star}$ , $\eta_{t}=\bar{\sigma}_{t}^{-1}\left(V_{t}(s^{\prime}_{t})-\left\langle\phi_{V_{t}}(s_{t},a_{t}),\theta^{\star}\right\rangle\right)$ . Then, we have $Z_{t}=\widehat{\Sigma}_{t},\mu_{t}=\widehat{\theta}_{t}$ . Moreover, $|\eta_{t}|\leq R=\sqrt{d}$ , $\left\|{x_{t}}\right\|_{2}\leq L=d$ , and for $t$ corresponding to $(m,h)$ ,

\displaystyle\mathbb{E}[\eta_{t}^{2}|\mathcal{G}_{t}]=\bar{\sigma}_{t}^{-2}\mathbb{V}(P_{t},V_{t})\leq\bar{\sigma}_{t}^{-2}(\nu_{t}+E_{t})\leq 1.

Therefore, with probability at least $1-\delta$ , for any $t=(m-1)H$ for some $m\in\mathbb{N}_{+}$ , which corresponds to $(m-1,H)$ :

\displaystyle\left\|{\widehat{\theta}_{m}-\theta^{\star}}\right\|_{\widehat{\Sigma}_{m}}\leq 8\sqrt{d\ln(1+dt/\lambda)\ln(4t^{2}/\delta)}+4\sqrt{d}\ln(4t^{2}/\delta)+\sqrt{\lambda d}\leq\widehat{\beta}_{m}.

This completes the proof. ∎

We are now ready to prove Theorem 9.

Proof of Theorem 9.

We condition on the event of Lemma 9, Lemma 10 and Lemma 11, which happens with probability at least $1-4\delta$ . We decompose the regret as follows: with probability at least $1-\delta$ ,

$\displaystyle\widetilde{R}_{M^{\prime}}$	$\displaystyle=\sum_{m=1}^{M^{\prime}}\left(\sum_{h=1}^{H}c^{m}_{h}+c_{f}(s^{m}_{H+1})-V^{\star}_{1}(s^{m}_{1})\right)\leq\sum_{m=1}^{M^{\prime}}\left(\sum_{h=1}^{H}c^{m}_{h}+c_{f}(s^{m}_{H+1})-V^{m}_{1}(s^{m}_{1})\right)$	(Lemma 10)
	$\displaystyle=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\left(c^{m}_{h}+V^{m}_{h+1}(s^{m}_{h+1})-V^{m}_{h}(s^{m}_{h})\right)$	( $c_{f}=V^{m}_{H+1}$ )
	$\displaystyle\leq\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\left(V^{m}_{h+1}(s^{m}_{h+1})-P^{m}_{h}V^{m}_{h+1}+\left\langle\theta^{\star}-\widehat{\theta}_{m},\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})\right\rangle+\widehat{\beta}_{m}\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})}\right\\|_{\widehat{\Sigma}_{m}^{-1}}\right)$	( $V^{m}_{h}(s^{m}_{h})\geq Q^{m}_{h}(s^{m}_{h},a^{m}_{h})=c^{m}_{h}+\left\langle\widehat{\theta}_{m},\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})\right\rangle-\widehat{\beta}_{m}\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})}\right\\|_{\widehat{\Sigma}_{m}^{-1}}$ )
	$\displaystyle\leq\tilde{\mathcal{O}}\left(\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\mathbb{V}(P^{m}_{h},V^{m}_{h+1})}+B_{\star}+\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widehat{\beta}_{m}\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})}\right\\|_{\widehat{\Sigma}_{m}^{-1}}\right).$	(Lemma 38, Cauchy-Schwarz inequality, and Lemma 9)

The first term is of order $\tilde{\mathcal{O}}(\sqrt{B_{\star}^{2}M^{\prime}+B_{\star}C_{M^{\prime}}})$ by Lemma 11. For the third term, define ${\mathcal{I}}=\left\{(m,h)\in[M^{\prime}]\times[H]:\left\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\|_{\widehat{\Sigma}_{m}^{-1}}\geq 1\right\}$ and $\widehat{{\mathcal{I}}}=\{m\in[M^{\prime}]:\det(\widehat{\Sigma}_{m+1})>2\det(\widehat{\Sigma}_{m})\}$ . Then,

	$\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widehat{\beta}_{m}\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})}\right\\|_{\widehat{\Sigma}_{m}^{-1}}=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widehat{\beta}_{m}\bar{\sigma}^{m}_{h}\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\\|_{\widehat{\Sigma}_{m}^{-1}}$
	$\displaystyle\leq\tilde{\mathcal{O}}\left(\sum_{(m,h)\in{\mathcal{I}}}B_{\star}d\right)+\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widehat{\beta}_{m}\bar{\sigma}^{m}_{h}\min\left\{1,\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\\|_{\widehat{\Sigma}_{m}^{-1}}\right\}$		( $\widehat{\beta}_{m}=\tilde{\mathcal{O}}(\sqrt{d})$ and $V^{m}_{h+1}=\mathcal{O}(B_{\star})$ )
	$\displaystyle\overset{\text{(i)}}{=}\tilde{\mathcal{O}}\left(B_{\star}d^{2}H+\sum_{m\in\widehat{{\mathcal{I}}}}B_{\star}dH+\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widehat{\beta}_{m}\bar{\sigma}^{m}_{h}\min\left\{1,\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\\|_{\widehat{\Sigma}_{m+1}^{-1}}\right\}\right)$
	$\displaystyle=\tilde{\mathcal{O}}\left(B_{\star}d^{2}H+\widehat{\beta}_{M^{\prime}}\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}}\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\min\left\{1,\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\\|_{\widehat{\Sigma}_{m+1}^{-1}}^{2}\right\}}\right)$		( $\|\widehat{{\mathcal{I}}}\|=\tilde{\mathcal{O}}(d)$ and Cauchy-Schwarz inequality)
	$\displaystyle=\tilde{\mathcal{O}}\left(B_{\star}d^{2}H+d\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}}\right),$		( $\widehat{\beta}_{M^{\prime}}=\tilde{\mathcal{O}}(\sqrt{d})$ and Lemma 29)

where in (i) we apply $\widehat{\beta}_{m}\bar{\sigma}^{m}_{h}=\tilde{\mathcal{O}}(B_{\star}d)$ , Lemma 30, and:

$\displaystyle\|{\mathcal{I}}\|$	$\displaystyle=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\mathbb{I}\left\{\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\\|_{\widehat{\Sigma}_{m}^{-1}}^{2}\geq 1\right\}\leq\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\min\left\{1,\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\\|_{\widehat{\Sigma}_{m}^{-1}}^{2}\right\}$
	$\displaystyle\leq\|\widehat{{\mathcal{I}}}\|H+\sqrt{2}\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\min\left\{1,\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\\|_{\widehat{\Sigma}_{m+1}^{-1}}^{2}\right\}$	(Lemma 30)
	$\displaystyle=\tilde{\mathcal{O}}\left(dH\right).$	( $\|\widehat{{\mathcal{I}}}\|=\tilde{\mathcal{O}}(d)$ and Lemma 29)

It remains to bound $\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}$ . Note that

	$\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}\leq\frac{9B_{\star}^{2}M^{\prime}H}{d}+\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\nu^{m}_{h}+E^{m}_{h})$
	$\displaystyle\leq\frac{9B_{\star}^{2}M^{\prime}H}{d}+\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\mathbb{V}(P^{m}_{h},V^{m}_{h+1})+2E^{m}_{h})$		(Lemma 9)
	$\displaystyle\leq\frac{9B_{\star}^{2}M^{\prime}H}{d}+\tilde{\mathcal{O}}\left(B_{\star}^{2}M^{\prime}+B_{\star}C_{M^{\prime}}+B_{\star}^{2}d\sqrt{M^{\prime}H}+B_{\star}d^{3/2}\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}}+B_{\star}^{2}d^{2}H\right).$		(Lemma 11 and Lemma 12)

By Lemma 28 and $\sqrt{M^{\prime}H}\leq M^{\prime}/d+dH$ , we get $\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}=\tilde{\mathcal{O}}(\frac{B_{\star}^{2}M^{\prime}H}{d}+B_{\star}^{2}M^{\prime}+B_{\star}C_{M^{\prime}}+B_{\star}^{2}d^{3}+B_{\star}^{2}d^{2}H)$ . Putting everything together, we get:

	$\displaystyle\widetilde{R}_{M^{\prime}}$	$\displaystyle=\tilde{\mathcal{O}}\left(\sqrt{B_{\star}^{2}M^{\prime}+B_{\star}C_{M^{\prime}}}+B_{\star}d^{2}H+d\sqrt{\frac{B_{\star}^{2}M^{\prime}H}{d}+B_{\star}^{2}M^{\prime}+B_{\star}C_{M^{\prime}}+B_{\star}^{2}d^{3}+B_{\star}^{2}d^{2}H}\right)$
		$\displaystyle=\tilde{\mathcal{O}}\left(B_{\star}\sqrt{dM^{\prime}H}+B_{\star}d\sqrt{M^{\prime}}+d\sqrt{B_{\star}C_{M^{\prime}}}+B_{\star}d^{2}H+B_{\star}d^{2.5}\right).$

Now by $\widetilde{R}_{M^{\prime}}=C_{M^{\prime}}-M^{\prime}V^{\star}_{1}(s^{m}_{1})$ and Lemma 28, we get: $C_{M^{\prime}}=\tilde{\mathcal{O}}(B_{\star}M^{\prime})$ . Plugging this back, we get $\widetilde{R}_{M^{\prime}}=\tilde{\mathcal{O}}(B_{\star}\sqrt{dM^{\prime}H}+B_{\star}d\sqrt{M^{\prime}}+B_{\star}d^{2}H+B_{\star}d^{2.5})$ . ∎

Lemma 10.

Conditioned on the event of Lemma 9, $Q^{m}_{h}(s,a)\leq\widetilde{c}(s,a)+\widetilde{P}_{s,a}V^{m}_{h+1}$ and $V^{m}_{h}(s)\leq V^{\star}_{h}(s)\leq 3B_{\star}$ .

Proof.

Note that by Lemma 9:

	$\displaystyle\left\langle\widehat{\theta}_{m},\phi_{V^{m}_{h+1}}(s,a)\right\rangle-\widehat{\beta}_{m}\left\\|{\phi_{V^{m}_{h+1}}(s,a)}\right\\|_{\widehat{\Sigma}_{m}^{-1}}=\widetilde{P}_{s,a}V^{m}_{h+1}+\left\langle\widehat{\theta}_{m}-\theta^{\star},\phi_{V^{m}_{h+1}}(s,a)\right\rangle-\widehat{\beta}_{m}\left\\|{\phi_{V^{m}_{h+1}}(s,a)}\right\\|_{\widehat{\Sigma}_{m}^{-1}}$
	$\displaystyle\leq\widetilde{P}_{s,a}V^{m}_{h+1}+\left\\|{\widehat{\theta}_{m}-\theta^{\star}}\right\\|_{\widehat{\Sigma}_{m}}\left\\|{\phi_{V^{m}_{h+1}}(s,a)}\right\\|_{\widehat{\Sigma}_{m}^{-1}}-\widehat{\beta}_{m}\left\\|{\phi_{V^{m}_{h+1}}(s,a)}\right\\|_{\widehat{\Sigma}_{m}^{-1}}\leq\widetilde{P}_{s,a}V^{m}_{h+1}.$

The first statement then follows from the definition of $Q^{m}_{h}$ . For any $m\in\mathbb{N}_{+}$ , we prove the second statement by induction on $h=H+1,\ldots,1$ . The base case $h=H+1$ is clearly true by the definition of $V^{m}_{H+1}$ . For $h\leq H$ , note that $Q^{m}_{h}(s,a)\leq\widetilde{c}(s,a)+\widetilde{P}_{s,a}V^{m}_{h+1}\leq c(s,a)+\widetilde{P}_{s,a}V^{\star}\leq Q^{\star}(s,a)$ by the induction step and the first statement. Thus, $V^{m}_{h}(s)\leq\max\{0,\min_{a}Q^{m}_{h}(s,a)\}\leq V^{\star}(s)$ . ∎

Lemma 11.

Conditioned on the event of Lemma 10, with probability at least $1-\delta$ , $\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\mathbb{V}(P^{m}_{h},V^{m}_{h+1})=\tilde{\mathcal{O}}\left(B_{\star}^{2}M^{\prime}+B_{\star}C_{M^{\prime}}\right)$ for any $M^{\prime}\in\mathbb{N}_{+}$ .

Proof.

Conditioned on the event of Lemma 10, we have with probability at least $1-\delta$ :

	$\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\mathbb{V}(P^{m}_{h},V^{m}_{h+1})=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}P^{m}_{h}(V^{m}_{h+1})^{2}-(P^{m}_{h}V^{m}_{h+1})^{2}$
	$\displaystyle=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\left(P^{m}_{h}(V^{m}_{h+1})^{2}-V^{m}_{h+1}(s^{m}_{h+1})^{2}\right)+\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\left(V^{m}_{h+1}(s^{m}_{h+1})^{2}-V^{m}_{h}(s^{m}_{h})^{2}\right)+\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\left(V^{m}_{h}(s^{m}_{h})^{2}-(P^{m}_{h}V^{m}_{h+1})^{2}\right)$
	$\displaystyle\overset{\text{(i)}}{=}\tilde{\mathcal{O}}\left(\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\mathbb{V}(P^{m}_{h},(V^{m}_{h+1})^{2})}+B_{\star}^{2}M^{\prime}+B_{\star}C_{M^{\prime}}\right)\overset{\text{(ii)}}{=}\tilde{\mathcal{O}}\left(B_{\star}\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\mathbb{V}(P^{m}_{h},V^{m}_{h+1})}+B_{\star}^{2}M^{\prime}+B_{\star}C_{M^{\prime}}\right).$

Here, (ii) is by Lemma 34, and (i) is by Lemma 38, $V^{m}_{H+1}(s)\leq 2B_{\star}$ and:

	$\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}V^{m}_{h}(s^{m}_{h})^{2}-(P^{m}_{h}V^{m}_{h+1})^{2}$	$\displaystyle=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(V^{m}_{h}(s^{m}_{h},a^{m}_{h})+P^{m}_{h}V^{m}_{h+1})(V^{m}_{h}(s^{m}_{h})-P^{m}_{h}V^{m}_{h+1})$
		$\displaystyle=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(V^{m}_{h}(s^{m}_{h},a^{m}_{h})+P^{m}_{h}V^{m}_{h+1})(\max\{0,Q^{m}_{h}(s^{m}_{h},a^{m}_{h})\}-P^{m}_{h}V^{m}_{h+1})\leq 6B_{\star}C_{M^{\prime}}.$		( $0\leq V^{m}_{h}(s)\leq 3B_{\star}$ and $Q^{m}_{h}(s^{m}_{h},a^{m}_{h})\leq c^{m}_{h}+P^{m}_{h}V^{m}_{h+1}$ by Lemma 10)

By Lemma 28, we get $\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\mathbb{V}(P^{m}_{h},V^{m}_{h+1})=\tilde{\mathcal{O}}\left(B_{\star}^{2}M^{\prime}+B_{\star}C_{M^{\prime}}\right)$ . ∎

Lemma 12.

$\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}E^{m}_{h}=\tilde{\mathcal{O}}\left(B_{\star}^{2}d\sqrt{M^{\prime}H}+B_{\star}d^{3/2}\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}}+B_{\star}^{2}d^{2}H\right)$ for any $M^{\prime}\in\mathbb{N}_{+}$ .

Proof.

Note that:

	$\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}E^{m}_{h}=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\min\left\{9B_{\star}^{2},\widetilde{\beta}_{m}\left\\|{\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})}\right\\|_{\widetilde{\Sigma}_{m}^{-1}}\right\}+\min\left\{9B_{\star}^{2},6B_{\star}\check{\beta}_{m}\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})}\right\\|_{\widehat{\Sigma}_{m}^{-1}}\right\}$
	$\displaystyle\leq\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widetilde{\beta}_{m}\min\left\{1,\left\\|{\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})}\right\\|_{\widetilde{\Sigma}_{m}^{-1}}\right\}+6B_{\star}\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\check{\beta}_{m}\bar{\sigma}^{m}_{h}\min\left\{1,\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\\|_{\widehat{\Sigma}_{m}^{-1}}\right\}.$		( $\widetilde{\beta}_{m}\geq 9B_{\star}^{2}$ and $\check{\beta}_{m}\bar{\sigma}^{m}_{h}\geq 3B_{\star}$ )

For the first sum, define $\widetilde{{\mathcal{I}}}=\{m\in[M^{\prime}]:\det(\widetilde{\Sigma}_{m+1})>2\det(\widetilde{\Sigma}_{m})\}$ . Then by Lemma 30,

	$\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widetilde{\beta}_{m}\min\left\{1,\left\\|{\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})}\right\\|_{\widetilde{\Sigma}_{m}^{-1}}\right\}$
	$\displaystyle\leq\tilde{\mathcal{O}}\left(\sum_{m\in\widetilde{{\mathcal{I}}}}B_{\star}^{2}\sqrt{d}H\right)+\sqrt{2}\sum_{m\notin\widetilde{{\mathcal{I}}}}\sum_{h=1}^{H}\widetilde{\beta}_{m}\min\left\{1,\left\\|{\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})}\right\\|_{\widetilde{\Sigma}_{m+1}^{-1}}\right\}$		( $\widetilde{\beta}_{m}=\tilde{\mathcal{O}}(B_{\star}^{2}\sqrt{d})$ )
	$\displaystyle=\tilde{\mathcal{O}}\left(B_{\star}^{2}d^{3/2}H+\widetilde{\beta}_{M^{\prime}}\sqrt{M^{\prime}H\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\min\left\{1,\left\\|{\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})}\right\\|_{\widetilde{\Sigma}_{m+1}^{-1}}^{2}\right\}}\right).$		( $\|\widetilde{{\mathcal{I}}}\|=\tilde{\mathcal{O}}(d)$ and Cauchy-Schwarz inequality)
	$\displaystyle=\tilde{\mathcal{O}}\left(B_{\star}^{2}d^{3/2}H+B_{\star}^{2}d\sqrt{M^{\prime}H}\right).$		(Lemma 29)

For the second sum, similarly define $\widehat{{\mathcal{I}}}=\{m\in[M^{\prime}]:\det(\widehat{\Sigma}_{m+1})>2\det(\widehat{\Sigma}_{m})\}$ . Then,

	$\displaystyle 6B_{\star}\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\check{\beta}_{m}\bar{\sigma}^{m}_{h}\min\left\{1,\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\\|_{\widehat{\Sigma}_{m}^{-1}}\right\}$
	$\displaystyle\leq\tilde{\mathcal{O}}\left(\sum_{m\in\widehat{{\mathcal{I}}}}B_{\star}^{2}dH\right)+6\sqrt{2}B_{\star}\check{\beta}_{M}\sum_{m\notin\widehat{{\mathcal{I}}}}\sum_{h=1}^{H}\bar{\sigma}^{m}_{h}\min\left\{1,\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\\|_{\widehat{\Sigma}_{m+1}^{-1}}\right\}$		( $\check{\beta}_{m}\bar{\sigma}^{m}_{h}=\tilde{\mathcal{O}}(B_{\star}d)$ )
	$\displaystyle=\tilde{\mathcal{O}}\left(B_{\star}^{2}d^{2}H+B_{\star}d\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}}\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\min\left\{1,\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\\|_{\widehat{\Sigma}_{m+1}^{-1}}^{2}\right\}}\right).$		( $\|\widehat{{\mathcal{I}}}\|=\tilde{\mathcal{O}}(d)$ and Cauchy-Schwarz inequality)
	$\displaystyle=\tilde{\mathcal{O}}\left(B_{\star}^{2}d^{2}H+B_{\star}d^{3/2}\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}}\right).$		(Lemma 29)

∎

B.7 An instance of SSP with $\text{\rm gap}_{\min}^{\prime}\ll\text{\rm gap}_{\min}$

Consider an SSP with four states $\{s_{0},s_{1},s_{2},s_{3}\}$ and two actions $\{a_{1},a_{2}\}$ . At $s_{0}$ , we have $c(s_{0},a)=0$ and $P(s_{1}|s_{0},a)=p$ , $P(s_{0}|s_{0},a)=1-p$ for $a\in\{a_{1},a_{2}\}$ and some $p>0$ . At $s_{1}$ , we have $c(s_{1},a_{1})=0$ , $c(s_{1},a_{2})=\epsilon$ , and $P(s_{2}|s_{1},a_{1})=1$ , $P(s_{3}|s_{1},a_{2})=1$ . At $s_{2}$ , we have $c(s_{2},a_{1})=c(s_{2},a_{2})=1$ and $P(g|s_{2},a)=q$ , $P(s_{1}|s_{2},a)=1-q$ for any $a$ and some $q\in(0,1)$ . At $s_{3}$ , we have $c(s_{3},a)=0$ and $P(g|s_{3},a)=1$ for $a\in\{a_{1},a_{2}\}$ . The role of $s_{0}$ here is to create the possibility that the learner will visit state $s_{1}$ at any time step. Then under our finite-horizon approximation, we have

\text{\rm gap}_{\min}^{\prime}\leq\min_{(s,a):\text{\rm gap}_{H}(s,a)>0}\text{\rm gap}_{H}(s,a)=c(s_{1},a_{2})-c(s_{1},a_{1})=\epsilon.

On the other hand, when $\frac{1}{q}>\epsilon$ , $\text{\rm gap}_{\min}=Q^{\star}(s_{1},a_{1})-V^{\star}(s_{1})=\frac{1}{q}-\epsilon$ , and $\frac{1}{q}$ can be arbitrarily large.

B.8 Omitted Details in Section 3.3

We first prove a lemma bounding $Q^{\star}_{h}(s,a)-Q^{m}_{h}(s,a)$ and another lemma on regret decomposition w.r.t the gap functions $\text{\rm gap}_{h}(s,a)$ in $\widetilde{{\mathcal{M}}}$ .

Lemma 13.

Suppose $B=3B_{\star}$ . With probability at least $1-\delta$ , for all $m\in\mathbb{N}_{+},h\in[H]$ , and $(s,a)\in{\mathcal{S}}_{+}\times{\mathcal{A}}$ , Algorithm 2 ensures:

\displaystyle 0\leq Q^{\star}_{h}(s,a)-\widehat{Q}^{m}_{h}(s,a)\leq\widetilde{P}_{s,a}(V^{\star}_{h+1}-V^{m}_{h+1})+2\beta_{m}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}.

Proof.

Note that:

	$\displaystyle w^{\star}_{h}-w^{m}_{h}=\Lambda_{m}^{-1}\left(\lambda I+\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}{\phi^{m^{\prime}}_{h^{\prime}}}^{\top}\right)w^{\star}_{h}-\Lambda_{m}^{-1}\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}(c^{m^{\prime}}_{h^{\prime}}+V^{m}_{h+1}(s^{m^{\prime}}_{h^{\prime}+1}))$
	$\displaystyle=\lambda\Lambda_{m}^{-1}w^{\star}_{h}+\Lambda_{m}^{-1}\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}[c^{m^{\prime}}_{h^{\prime}}+P^{m^{\prime}}_{h^{\prime}}V^{\star}_{h+1}]-\Lambda_{m}^{-1}\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}(c^{m^{\prime}}_{h^{\prime}}+V^{m}_{h+1}(s^{m^{\prime}}_{h^{\prime}+1}))$
	$\displaystyle=\lambda\Lambda_{m}^{-1}w^{\star}_{h}+\Lambda_{m}^{-1}\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}P^{m^{\prime}}_{h^{\prime}}[V^{\star}_{h+1}-V^{m}_{h+1}]+\epsilon^{m}_{h}$		(Define $\epsilon^{m}_{h}=\Lambda_{m}^{-1}\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}[P^{m^{\prime}}_{h^{\prime}}V^{m}_{h+1}-V^{m}_{h+1}(s^{m^{\prime}}_{h^{\prime}+1})]$ )
	$\displaystyle=\lambda\Lambda_{m}^{-1}w^{\star}_{h}+\Lambda_{m}^{-1}\sum_{m^{\prime}=1}^{m-1}\sum_{h^{\prime}=1}^{H}\phi^{m^{\prime}}_{h^{\prime}}{\phi^{m^{\prime}}_{h^{\prime}}}^{\top}\int(V^{\star}_{h+1}(s^{\prime})-V^{m}_{h+1}(s^{\prime}))d\mu(s^{\prime})+\epsilon^{m}_{h}$
	$\displaystyle=\lambda\Lambda_{m}^{-1}w^{\star}_{h}+\int(V^{\star}_{h+1}(s^{\prime})-V^{m}_{h+1}(s^{\prime}))d\mu(s^{\prime})-\lambda\Lambda_{m}^{-1}\int(V^{\star}_{h+1}(s^{\prime})-V^{m}_{h+1}(s^{\prime}))d\mu(s^{\prime})+\epsilon^{m}_{h}.$

Therefore,

	$\displaystyle Q^{\star}_{h}(s,a)-\widehat{Q}^{m}_{h}(s,a)$	$\displaystyle=\phi(s,a)^{\top}(w^{\star}_{h}-w^{m}_{h})+\beta_{m}\left\\|{\phi(s,a)}\right\\|_{\Lambda_{m}^{-1}}$
		$\displaystyle\leq\underbrace{\lambda\phi(s,a)^{\top}\Lambda_{m}^{-1}w^{\star}_{h}}_{\xi_{1}}+P_{s,a}(V^{\star}_{h+1}-V^{m}_{h+1})\underbrace{-\lambda\phi(s,a)^{\top}\Lambda_{m}^{-1}\int(V^{\star}_{h+1}(s^{\prime})-V^{m}_{h+1}(s^{\prime}))d\mu(s^{\prime})}_{\xi_{2}}$
		$\displaystyle\qquad+\underbrace{\phi(s,a)^{\top}\epsilon^{m}_{h}}_{\xi_{3}}+\beta_{m}\left\\|{\phi(s,a)}\right\\|_{\Lambda_{m}^{-1}}.$

For $\xi_{1}$ , note that $\left\|{w^{\star}_{h}}\right\|_{2}=\left\|{\theta^{\star}+\int V^{\star}_{h+1}(s^{\prime})d\mu(s^{\prime})}\right\|_{2}\leq(1+3B_{\star})\sqrt{d}$ by $V^{\star}_{h+1}(s)\leq V^{\star}(s)+2B_{\star}\leq 3B_{\star}$ for any $s\in{\mathcal{S}}$ , $h\in[H]$ . Therefore, by the Cauchy-Schwarz inequality,

\displaystyle|\xi_{1}|\leq\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}\left\|{\lambda w^{\star}_{h}}\right\|_{\Lambda_{m}^{-1}}\leq\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}\sqrt{\lambda}\left\|{w^{\star}_{h}}\right\|_{2}\leq\frac{\beta_{m}}{4}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}},

where the second inequality is by $\lambda_{\max}(\Lambda_{m}^{-1})\leq\frac{1}{\lambda}$ . Similarly, for $\xi_{2}$ ,

$\displaystyle\|\xi_{2}\|$	$\displaystyle\leq\left\\|{\phi(s,a)}\right\\|_{\Lambda_{m}^{-1}}\left\\|{\lambda\int(V^{\star}_{h+1}(s^{\prime})-V^{m}_{h+1}(s^{\prime}))d\mu(s^{\prime})}\right\\|_{\Lambda_{m}^{-1}}$	(Cauchy-Schwarz inequality)
	$\displaystyle\leq\sqrt{\lambda}\left\\|{\phi(s,a)}\right\\|_{\Lambda_{m}^{-1}}\left\\|{\int(V^{\star}_{h+1}(s^{\prime})-V^{m}_{h+1}(s^{\prime}))d\mu(s^{\prime})}\right\\|_{2}$	( $\lambda_{\max}(\Lambda_{m}^{-1})\leq\frac{1}{\lambda}$ )
	$\displaystyle\leq 3B_{\star}\sqrt{\lambda d}\left\\|{\phi(s,a)}\right\\|_{\Lambda_{m}^{-1}}\leq\frac{\beta_{m}}{4}\left\\|{\phi(s,a)}\right\\|_{\Lambda_{m}^{-1}}.$	( $V^{\star}_{h+1}(s)-V^{m}_{h+1}(s)\in[0,3B_{\star}]$ for any $s\in{\mathcal{S}}$ )

For $\xi_{3}$ , by Eq. (6), $\left\|{\epsilon^{m}_{h}}\right\|_{\Lambda_{m}}\leq\frac{\beta_{m}}{2}$ with probability at least $1-\delta$ . Thus, $|\xi_{3}|\leq\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}\left\|{\epsilon^{m}_{h}}\right\|_{\Lambda_{m}}\leq\frac{\beta_{m}}{2}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}$ .

To conclude, we have for all $m,h,(s,a)$ :

\displaystyle 0\leq Q^{\star}_{h}(s,a)-\widehat{Q}^{m}_{h}(s,a)\leq\widetilde{P}_{s,a}(V^{\star}_{h+1}-V^{m}_{h+1})+2\beta_{m}\left\|{\phi(s,a)}\right\|_{\Lambda_{m}^{-1}}.

This completes the proof. ∎

Lemma 14.

With probability at least $1-\delta$ , $\sum_{m=1}^{M^{\prime}}V^{\pi^{m}}_{1}(s^{m}_{1})-V^{\star}_{1}(s^{m}_{1})\leq 2\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\text{\rm gap}_{h}(s^{m}_{h},a^{m}_{h})+\mathcal{O}(B_{\star}H\ln(M^{\prime}/\delta))$ for any given $M^{\prime}\in\mathbb{N}_{+}$ .

Proof.

By the extended value difference lemma (Shani et al., 2020, Lemma 1):

	$\displaystyle V^{\pi^{m}}_{1}(s^{m}_{1})-V^{\star}_{1}(s^{m}_{1})$	$\displaystyle=\mathbb{E}\left[\left.\sum_{h=1}^{H}\sum_{a}(\pi^{m}(a\|s^{m}_{h})-{\widetilde{\pi}^{\star}}(a\|s^{m}_{h}))Q^{\star}_{h}(s^{m}_{h},a)\right\|\pi^{m}\right]$
		$\displaystyle=\mathbb{E}\left[\left.\sum_{h=1}^{H}Q^{\star}_{h}(s^{m}_{h},a^{m}_{h})-V^{\star}_{h}(s^{m}_{h})\right\|\pi^{m}\right]=\mathbb{E}\left[\left.\sum_{h=1}^{H}\text{\rm gap}_{h}(s^{m}_{h},a^{m}_{h})\right\|\pi^{m}\right],$

where ${\widetilde{\pi}^{\star}}$ is the optimal policy of $\widetilde{{\mathcal{M}}}$ . Therefore, by Lemma 39 and $\text{\rm gap}_{h}(s,a)=\mathcal{O}(B_{\star})$ , with probability at least $1-\delta$ ,

\sum_{m=1}^{M^{\prime}}V^{\pi^{m}}_{1}(s^{m}_{1})-V^{\star}_{1}(s^{m}_{1})\leq 2\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\text{\rm gap}_{h}(s^{m}_{h},a^{m}_{h})+\mathcal{O}\left(B_{\star}H\ln\frac{M^{\prime}}{\delta}\right).

This completes the proof. ∎

The next lemma provides an upper bound on the sum of gap functions satisfying some constraints. We denote by ${\mathcal{F}}^{m}_{h}$ the interaction history up to $(s^{m}_{h},a^{m}_{h})$ in $\widetilde{{\mathcal{M}}}$ .

Lemma 15.

Suppose $B=3B_{\star}$ , $\{z^{m}_{h}\}_{m=1}^{M^{\prime}}$ are indicator functions such that $z^{m}_{h}\in{\mathcal{F}}^{m}_{h}$ for some $M^{\prime}\in\mathbb{N}_{+},h\in[H]$ , and define $M_{z}=\sum_{m=1}^{M^{\prime}}z^{m}_{h}$ . Then with probability at least $1-\delta$ , Algorithm 2 ensures

\sum_{m=1}^{M^{\prime}}z^{m}_{h}\sum_{h^{\prime}=h}^{H}\text{\rm gap}^{m}_{h^{\prime}}=\mathcal{O}\left(\sqrt{d^{3}B_{\star}^{2}HM_{z}}\ln\frac{dB_{\star}M^{\prime}H}{\delta}+d^{2}B_{\star}H\ln^{1.5}\frac{dB_{\star}M^{\prime}H}{\delta}\right).

Proof.

Denote by $m_{i}$ the $i$ -th interval among $[M^{\prime}]$ such that $z^{m_{i}}_{h}=1$ . Then,

	$\displaystyle\sum_{i=1}^{M_{z}}\sum_{h^{\prime}=h}^{H}Q^{\star}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}},a^{m_{i}}_{h^{\prime}})-V^{\star}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}})+\sum_{i=1}^{M_{z}}\sum_{h^{\prime}=h}^{H}V^{\star}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}})-V^{m_{i}}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}})$
	$\displaystyle=\sum_{i=1}^{M_{z}}\sum_{h^{\prime}=h}^{H}Q^{\star}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}},a^{m_{i}}_{h^{\prime}})-Q^{m_{i}}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}},a^{m_{i}}_{h^{\prime}})$		( $Q^{m_{i}}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}},a^{m_{i}}_{h^{\prime}})=V^{m_{i}}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}})$ by Lemma 5 and $B=3B_{\star}$ )
	$\displaystyle\leq\sum_{i=1}^{M_{z}}\sum_{h^{\prime}=h}^{H}P^{m_{i}}_{h^{\prime}}(V^{\star}_{h^{\prime}+1}-V^{m_{i}}_{h^{\prime}+1})+2\sum_{i=1}^{M_{z}}\sum_{h^{\prime}=h}^{H}\beta_{m_{i}}\left\\|{\phi^{m_{i}}_{h^{\prime}}}\right\\|_{\Lambda^{-1}_{m_{i}}}$		(Lemma 13)
	$\displaystyle=\sum_{i=1}^{M_{z}}\sum_{h^{\prime}=h}^{H}(V^{\star}_{h^{\prime}+1}(s^{m_{i}}_{h^{\prime}+1})-V^{m_{i}}_{h^{\prime}+1}(s^{m_{i}}_{h^{\prime}+1}))+\sum_{i=1}^{M_{z}}\sum_{h=h^{\prime}}^{H}\left(\epsilon^{m_{i}}_{h^{\prime}}+2\beta_{m_{i}}\left\\|{\phi^{m_{i}}_{h^{\prime}}}\right\\|_{\Lambda^{-1}_{m_{i}}}\right),$

where $\epsilon^{m_{i}}_{h}=P^{m_{i}}_{h}(V^{\star}_{h+1}-V^{m_{i}}_{h+1})-(V^{\star}_{h+1}(s^{m_{i}}_{h+1})-V^{m_{i}}_{h+1}(s^{m_{i}}_{h+1}))$ . Reorganizing terms, and by $V^{\star}_{H+1}=V^{m}_{H+1}=c_{f}$ , $V^{m}_{h+1}(s)\leq V^{\star}_{h+1}(s)$ (Lemma 5), we get:

	$\displaystyle\sum_{i=1}^{M_{z}}\sum_{h^{\prime}=h}^{H}\text{\rm gap}^{m_{i}}_{h^{\prime}}$	$\displaystyle=\sum_{i=1}^{M_{z}}\sum_{h^{\prime}=h}^{H}Q^{\star}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}},a^{m_{i}}_{h^{\prime}})-V^{\star}_{h^{\prime}}(s^{m_{i}}_{h^{\prime}})\leq\sum_{i=1}^{M_{z}}\sum_{h=h^{\prime}}^{H}\left(\epsilon^{m_{i}}_{h}+2\beta_{m_{i}}\left\\|{\phi^{m_{i}}_{h}}\right\\|_{\Lambda^{-1}_{m_{i}}}\right)$
		$\displaystyle=\sum_{m=1}^{M^{\prime}}\sum_{h^{\prime}=h}^{H}z^{m}_{h}\epsilon^{m}_{h^{\prime}}+2\sum_{i=1}^{M_{z}}\sum_{h=h^{\prime}}^{H}\beta_{m_{i}}\left\\|{\phi^{m_{i}}_{h}}\right\\|_{\Lambda^{-1}_{m_{i}}}.$

For the first term, by $z^{m}_{h}\epsilon^{m}_{h^{\prime}}\in{\mathcal{F}}^{m}_{h^{\prime}+1}$ for $h^{\prime}\geq h$ and Lemma 38, with probability at least $1-\delta$ ,

	$\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h^{\prime}=h}^{H}z^{m}_{h}\epsilon^{m}_{h^{\prime}}$	$\displaystyle\leq 3\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h^{\prime}=h}^{H}\mathbb{E}[(z^{m}_{h}\epsilon^{m}_{h^{\prime}})^{2}\|{\mathcal{F}}^{m}_{h^{\prime}}]}+\mathcal{O}\left(B_{\star}\ln\frac{B_{\star}M^{\prime}H}{\delta}\right)$
		$\displaystyle=\mathcal{O}\left(B_{\star}\sqrt{HM_{z}\ln\frac{B_{\star}M^{\prime}H}{\delta}}+B_{\star}\ln\frac{B_{\star}M^{\prime}H}{\delta}\right).$		( $z^{m}_{h}\in{\mathcal{F}}^{m}_{h^{\prime}}$ and $\|\epsilon^{m}_{h^{\prime}}\|=\mathcal{O}(B_{\star})$ )

For the second term, by Lemma 18,

\sum_{i=1}^{M_{z}}\sum_{h=h^{\prime}}^{H}\beta_{m_{i}}\left\|{\phi^{m_{i}}_{h}}\right\|_{\Lambda^{-1}_{m_{i}}}=\mathcal{O}\left(\sqrt{d^{3}B_{\star}^{2}HM_{z}}\ln\frac{dB_{\star}M^{\prime}H}{\delta}+d^{2}B_{\star}H\ln^{1.5}\frac{dB_{\star}M^{\prime}H}{\delta}\right).

Plugging these back completes the proof. ∎

We are now ready to prove a bound on $\sum_{m}V^{\pi^{m}}_{1}(s^{m}_{1})-V^{\star}_{1}(s^{m}_{1})$ , which is the key to proving Theorem 4.

Lemma 16.

For any $M^{\prime}\geq 3$ , Algorithm 2 with $B=3B_{\star}$ and $H\geq\lceil\frac{35B_{\star}}{c_{\min}}\ln(8B_{\star}M^{\prime}H)\rceil$ for some horizon $H$ ensures with probability at least $1-3\delta-\nicefrac{{1}}{{4B_{\star}M^{\prime}H}}$ , $\sum_{m=1}^{M^{\prime}}V^{\pi^{m}}_{1}(s^{m}_{1})-V^{\star}_{1}(s^{m}_{1})=\mathcal{O}\left(\frac{d^{3}B_{\star}^{4}}{c_{\min}^{2}\text{\rm gap}_{\min}}\ln^{5}(dB_{\star}M^{\prime}H/\delta)\right)$ .

Proof.

First note that $V^{\pi^{\star}}_{h}(s)\leq 3B_{\star}$ for any $s\in{\mathcal{S}},h\in[H]$ . Thus, the expected hitting time of $\pi^{\star}$ in $\widetilde{{\mathcal{M}}}$ is at most $\frac{3B_{\star}}{c_{\min}}$ starting from any state and layer. Without loss of generality, we assume that $H$ is an even integer. Note that $\widetilde{{\mathcal{M}}}$ can be treated as an SSP instance where the learner teleports to the goal state at the $(H+1)$ -th step. Thus by Lemma 17 and $H\geq\frac{35B_{\star}}{c_{\min}}\ln(8B_{\star}M^{\prime}H)$ , when $h\leq\frac{H}{2}+1$ , $P(s_{H+1}\neq g|s_{h}=s,\pi^{\star})\leq\frac{1}{4B_{\star}M^{\prime}H}$ for any state $s$ , and for any $h\leq\frac{H}{2}$ :

Q^{\star}_{h}(s,a)-Q^{\star}(s,a)\leq Q^{\pi^{\star}}_{h}(s,a)-Q^{\star}(s,a)=P_{s,a}(V^{\pi^{\star}}_{h+1}-V^{\star})\leq 2B_{\star}\max_{s}P(s_{H+1}\neq g|\pi^{\star},s_{h+1}=s)\leq\frac{1}{2M^{\prime}H}.

It also implies $|\text{\rm gap}_{h}(s,a)-\text{\rm gap}(s,a)|$ for $h\leq\frac{H}{2}$ , since:

\displaystyle\left|\text{\rm gap}_{h}(s,a)-\text{\rm gap}(s,a)\right|\leq\left|Q^{\star}_{h}(s,a)-Q^{\star}(s,a)\right|+\left|V^{\star}_{h}(s)-V^{\star}(s)\right|\leq\frac{1}{2M^{\prime}H}+\max_{a}\left|Q^{\star}_{h}(s,a)-Q^{\star}(s,a)\right|\leq\frac{1}{M^{\prime}H}.

Define $\text{\rm gap}^{m}_{h}=\text{\rm gap}_{h}(s^{m}_{h},a^{m}_{h})$ and a threshold $\eta=\frac{3}{M^{\prime}H}$ . By Lemma 14, it suffices to bound $\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\text{\rm gap}^{m}_{h}$ . Note that

	$\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\text{\rm gap}^{m}_{h}$	$\displaystyle\leq\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\text{\rm gap}^{m}_{h}\mathbb{I}\left\{\text{\rm gap}^{m}_{h}>\eta\right\}+\mathcal{O}\left(\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\frac{B_{\star}}{M^{\prime}H}\right)$
		$\displaystyle\leq\sum_{m=1}^{M^{\prime}}\sum_{h\leq H/2}\text{\rm gap}^{m}_{h}\mathbb{I}\left\{\text{\rm gap}^{m}_{h}>\eta\right\}+\sum_{m=1}^{M^{\prime}}\sum_{h>H/2}\text{\rm gap}^{m}_{h}+\mathcal{O}\left(B_{\star}\right).$

For the first term, define $N=\lceil\log_{2}(\frac{3B_{\star}+1}{\eta})\rceil=\mathcal{O}(\ln(B_{\star}M^{\prime}H))$ , and

n^{\star}=\min\left\{n\in[N]:\exists(s^{\prime},a^{\prime}),h^{\prime}\leq\frac{H}{2}\text{ such that }\text{\rm gap}_{h^{\prime}}(s^{\prime},a^{\prime})\in(\eta 2^{n-1},\eta 2^{n}]\right\}.

Then by the definition of $n^{\star}$ and $|\text{\rm gap}(s,a)-\text{\rm gap}_{h}(s,a)|\leq\frac{1}{M^{\prime}H}$ for $h\leq\frac{H}{2}$ , there exist $(s^{\prime},a^{\prime}),h^{\prime}\leq\frac{H}{2}$ such that

\text{\rm gap}_{\min}\leq\text{\rm gap}(s^{\prime},a^{\prime})\leq\text{\rm gap}_{h^{\prime}}(s^{\prime},a^{\prime})+\frac{1}{M^{\prime}H}\leq\eta 2^{n^{\star}}+\frac{1}{M^{\prime}H}\leq\frac{4}{3}\cdot\eta 2^{n^{\star}}.

(7)

Moreover, for each $n\in\mathbb{N}$ and $h\leq\frac{H}{2}$ , define $z^{m}_{h}=\mathbb{I}\{\text{\rm gap}^{m}_{h}>\eta 2^{n}\}$ . Then by Lemma 15, with probability at least $1-\frac{\delta}{2(n+1)^{2}}$ ,

\eta 2^{n}M_{z}\leq\sum_{m=1}^{M^{\prime}}z^{m}_{h}\text{\rm gap}^{m}_{h}=\mathcal{O}\left(\sqrt{d^{3}B_{\star}^{2}HM_{z}}\ln\frac{dB_{\star}M^{\prime}H(n+1)}{\delta}+d^{2}B_{\star}H\ln^{1.5}\frac{dB_{\star}M^{\prime}H(n+1)}{\delta}\right),

where $M_{z}=\sum_{m=1}^{M^{\prime}}z^{m}_{h}$ . Solving a quadratic inequality w.r.t $\sqrt{M_{z}}$ gives:

\sum_{m=1}^{M^{\prime}}\mathbb{I}\{\text{\rm gap}^{m}_{h}>\eta 2^{n}\}=\mathcal{O}\left(\frac{d^{3}B_{\star}^{2}H}{\eta^{2}4^{n}}\ln^{2}\frac{dB_{\star}M^{\prime}H(n+1)}{\delta}+\frac{d^{2}B_{\star}H}{\eta 2^{n}}\ln^{1.5}\frac{dB_{\star}M^{\prime}H(n+1)}{\delta}\right).

(8)

By a union bound, Eq. (8) holds for all $n\in\mathbb{N}$ simultaneously with probability at least $1-\delta$ . Therefore, the first term is bounded as follows:

	$\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h\leq H/2}\text{\rm gap}^{m}_{h}\mathbb{I}\left\{\text{\rm gap}^{m}_{h}>\eta\right\}$
	$\displaystyle=\sum_{m=1}^{M^{\prime}}\sum_{h\leq H/2}\sum_{n=n^{\star}}^{N}\text{\rm gap}^{m}_{h}\mathbb{I}\{\text{\rm gap}^{m}_{h}\in(\eta 2^{n-1},\eta 2^{n}]\}\leq\sum_{m=1}^{M^{\prime}}\sum_{h\leq H/2}\sum_{n=n^{\star}}^{N}\eta 2^{n}\mathbb{I}\{\text{\rm gap}^{m}_{h}>\eta 2^{n-1}\}$
	$\displaystyle=\mathcal{O}\left(\sum_{h\leq H/2}\sum_{n=n^{\star}}^{N}\left(\frac{d^{3}B_{\star}^{2}H}{\eta 2^{n}}\ln^{2}\frac{dB_{\star}M^{\prime}H(n+1)}{\delta}+d^{2}B_{\star}H\ln^{1.5}\frac{dB_{\star}M^{\prime}H(n+1)}{\delta}\right)\right)$		(Eq. (8))
	$\displaystyle=\mathcal{O}\left(\frac{d^{3}B_{\star}^{2}H^{2}}{\eta 2^{n^{\star}}}\ln^{3}\frac{dB_{\star}M^{\prime}H}{\delta}+d^{2}B_{\star}H^{2}\ln^{2.5}\frac{dB_{\star}M^{\prime}H}{\delta}\right)$		( $N=\mathcal{O}(\ln(B_{\star}M^{\prime}H))$ )
	$\displaystyle=\mathcal{O}\left(\frac{d^{3}B_{\star}^{2}H^{2}}{\text{\rm gap}_{\min}}\ln^{3}\frac{dB_{\star}M^{\prime}H}{\delta}+d^{2}B_{\star}H^{2}\ln^{2.5}\frac{dB_{\star}M^{\prime}H}{\delta}\right).$		(Eq. (7))

For the second term, note that:

\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h>H/2}\text{\rm gap}^{m}_{h}\leq\underbrace{\sum_{m=1}^{M^{\prime}}\sum_{h>H/2}\text{\rm gap}^{m}_{h}\mathbb{I}\left\{\exists h\leq\frac{H}{2}:\text{\rm gap}^{m}_{h}>\eta\right\}}_{\xi_{1}}+\underbrace{\sum_{m=1}^{M^{\prime}}\sum_{h>H/2}\text{\rm gap}^{m}_{h}\mathbb{I}\left\{\forall h\leq\frac{H}{2}:\text{\rm gap}^{m}_{h}\leq\eta\right\}}_{\xi_{2}}.

For $\xi_{1}$ , define $z^{m}_{\frac{H}{2}+1}=\mathbb{I}\left\{\exists h\leq\frac{H}{2}:\text{\rm gap}^{m}_{h}>\eta\right\}$ and $M_{z}=\sum_{m=1}^{M^{\prime}}z^{m}_{\frac{H}{2}+1}$ . Then by Lemma 15, with probability at least $1-\delta$ ,

\displaystyle\xi_{1}=\sum_{m=1}^{M^{\prime}}z^{m}_{\frac{H}{2}+1}\sum_{h>H/2}\text{\rm gap}^{m}_{h}=\mathcal{O}\left(\sqrt{d^{3}B_{\star}^{2}HM_{z}}\ln\frac{dB_{\star}M^{\prime}H}{\delta}+d^{2}B_{\star}H\ln^{1.5}\frac{dB_{\star}M^{\prime}H}{\delta}\right).

It suffices to bound $M_{z}$ . Note that by the definition of $n^{\star}$ , we have $\min_{s,a,h\leq H/2,\text{\rm gap}_{h}(s,a)>\eta}\text{\rm gap}_{h}(s,a)\in(\eta 2^{n^{\star}-1},\eta 2^{n^{\star}}]$ . Thus, by Eq. (8),

	$\displaystyle M_{z}$	$\displaystyle=\sum_{m=1}^{M^{\prime}}\mathbb{I}\left\{\exists h\leq\frac{H}{2}:\text{\rm gap}^{m}_{h}>\eta\right\}\leq\sum_{m=1}^{M^{\prime}}\sum_{h\leq H/2}\mathbb{I}\left\{\text{\rm gap}^{m}_{h}>\eta\right\}\leq\sum_{m=1}^{M^{\prime}}\sum_{h\leq H/2}\mathbb{I}\left\{\text{\rm gap}^{m}_{h}>\eta 2^{n^{\star}-1}\right\}$
		$\displaystyle=\mathcal{O}\left(\frac{d^{3}B_{\star}^{2}H^{2}}{\eta^{2}4^{n^{\star}-1}}\ln^{2}\frac{dB_{\star}M^{\prime}Hn^{\star}}{\delta}+\frac{d^{2}B_{\star}H^{2}}{\eta 2^{n^{\star}-1}}\ln^{1.5}\frac{dB_{\star}M^{\prime}Hn^{\star}}{\delta}\right).$

Plugging this back and by Eq. (7), we get:

\displaystyle\xi_{1}

\displaystyle=\mathcal{O}\left(\frac{d^{3}B_{\star}^{2}H^{1.5}}{\text{\rm gap}_{\min}}\ln^{2}\frac{dB_{\star}M^{\prime}H}{\delta}+d^{2}B_{\star}H\ln^{1.5}\frac{dB_{\star}M^{\prime}H}{\delta}\right).

For $\xi_{2}$ , denote by ${\widetilde{\pi}}^{m}$ the near-optimal policy “closest” to $\pi^{m}$ , such that:

\displaystyle{\widetilde{\pi}}^{m}(s,h)=\begin{cases}\pi^{m}(s,h),&h\leq H/2\text{ and }\text{\rm gap}_{h}(s,\pi^{m}(s,h))\leq\eta,\\ \pi^{\star}(s,h),&h\leq H/2\text{ and }\text{\rm gap}_{h}(s,\pi^{m}(s,h))>\eta,\\ \pi^{\star}(s,h),&h>H/2.\end{cases}

Note that $\text{\rm gap}_{h}(s,{\widetilde{\pi}}^{m}(s,h))\leq\eta$ for all $s,h$ . By the extended value difference lemma (Shani et al., 2020, Lemma 1), $V^{{\widetilde{\pi}}^{m}}_{h}(s)-V^{\star}_{h}(s)=\mathbb{E}[\sum_{h^{\prime}=h}^{H}\text{\rm gap}_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})|s_{h}=s,{\widetilde{\pi}}^{m}]\leq\frac{3}{M^{\prime}}\leq B_{\star}$ for all $s,h$ by $M^{\prime}\geq 3$ . Therefore, $V^{{\widetilde{\pi}}^{m}}_{h}(s)\leq 4B_{\star}$ for all $s,h$ . Denote by ${\mathcal{F}}_{m}$ the interaction history before interval $m$ . Then, $\pi^{m},{\widetilde{\pi}}^{m}\in{\mathcal{F}}_{m}$ , and

	$\displaystyle P\left(\left.\sum_{h>H/2}\text{\rm gap}^{m}_{h}\mathbb{I}\left\{\forall h\leq H/2:\text{\rm gap}^{m}_{h}\leq\eta\right\}=0\right\|\pi^{m},{\mathcal{F}}_{m}\right)$
	$\displaystyle\geq P\left(\left.\exists h\leq H/2,\text{\rm gap}^{m}_{h}>\eta\text{ or }\forall h\leq H/2,\text{\rm gap}^{m}_{h}\leq\eta,s_{H/2+1}=g\right\|\pi^{m},{\mathcal{F}}_{m}\right)$
	$\displaystyle=P\left(\left.\exists h\leq H/2,{\widetilde{\pi}}^{m}(s^{m}_{h},h)\neq\pi^{m}(s^{m}_{h},h)\text{ or }\forall h\leq H/2,{\widetilde{\pi}}^{m}(s^{m}_{h},h)=\pi^{m}(s^{m}_{h},h),s_{H/2+1}=g\right\|\pi^{m},{\mathcal{F}}_{m}\right)$
	$\displaystyle=P\left(\left.\exists h\leq H/2,{\widetilde{\pi}}^{m}(s^{m}_{h},h)\neq\pi^{m}(s^{m}_{h},h)\text{ or }\forall h\leq H/2,{\widetilde{\pi}}^{m}(s^{m}_{h},h)=\pi^{m}(s^{m}_{h},h),s_{H/2+1}=g\right\|{\widetilde{\pi}}^{m},{\mathcal{F}}_{m}\right)$
	$\displaystyle\geq P\left(\left.s_{H/2+1}=g\right\|{\widetilde{\pi}}^{m},{\mathcal{F}}_{m}\right)\geq 1-\frac{1}{4B_{\star}M^{\prime}H},$

where in the last inequality we apply Lemma 17, the fact that $V^{{\widetilde{\pi}}^{m}}_{h}(s)\leq 4B_{\star}$ for all $s,h$ , and $H\geq\frac{35B_{\star}}{c_{\min}}\ln(8B_{\star}M^{\prime}H)$ . Now by Lemma 14 and $H=\lceil\frac{35B_{\star}}{c_{\min}}\ln(8B_{\star}M^{\prime}H)\rceil$ , we have:

	$\displaystyle\sum_{m=1}^{M^{\prime}}V^{\pi^{m}}_{1}(s^{m}_{1})-V^{\star}_{1}(s^{m}_{1})$	$\displaystyle\leq 2\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\text{\rm gap}_{h}(s^{m}_{h},a^{m}_{h})+\mathcal{O}(B_{\star}H\ln(M^{\prime}/\delta))$
		$\displaystyle=\mathcal{O}\left(\frac{d^{3}B_{\star}^{2}H^{2}}{\text{\rm gap}_{\min}}\ln^{3}\frac{dB_{\star}M^{\prime}H}{\delta}+d^{2}B_{\star}H^{2}\ln^{2.5}(dB_{\star}M^{\prime}H/\delta)\right)$
		$\displaystyle=\mathcal{O}\left(\frac{d^{3}B_{\star}^{4}}{c_{\min}^{2}\text{\rm gap}_{\min}}\ln^{5}(dB_{\star}M^{\prime}H/\delta)\right).$

∎

We are now ready to prove Theorem 4.

Proof of Theorem 4.

First note that for a given $H\geq 4T_{\star}\ln(4K)$ , by Lemma 2 and Theorem 1, we have: $M=\tilde{\mathcal{O}}\left(K+d^{3}H\right)$ with probability at least $1-4\delta$ for some $\delta>0$ when running Algorithm 1 with Algorithm 2 and horizon $H$ . That is, there exist $b>0$ and constant $p\geq 1$ such that $M\leq b(K+d^{3}H)\ln^{p}(dB_{\star}HK/\delta)$ . Now let $M^{\prime}=b(K+d^{3}H)\ln^{p}(dB_{\star}HK/\delta)$ . To obtain the regret bound in Lemma 16, it suffices to have $H\geq\frac{35B_{\star}}{c_{\min}}\ln(8B_{\star}M^{\prime}H)$ . Plugging in the definition of $M^{\prime}$ and by $x>\ln x$ for $x>0$ , it suffices to have $H=\frac{b^{\prime}B_{\star}}{c_{\min}}\ln(\frac{dB_{\star}K}{\delta c_{\min}})$ for some constant $b^{\prime}>0$ . To conclude, we have $M\leq M^{\prime}$ with probability at least $1-4\delta$ when running Algorithm 1 with Algorithm 2 and horizon $H=\frac{b^{\prime}B_{\star}}{c_{\min}}\ln(\frac{dB_{\star}K}{\delta c_{\min}})$ . Moreover, with probability at least $1-3\delta-1/4B_{\star}M^{\prime}H$ , we have $\sum_{m=1}^{\min\{M,M^{\prime}\}}V^{\pi^{m}}_{1}(s^{m}_{1})-V^{\star}_{1}(s^{m}_{1})=\mathcal{O}(\frac{d^{3}B_{\star}^{4}}{c_{\min}^{2}\text{\rm gap}_{\min}}\ln^{5}(dB_{\star}M^{\prime}H/\delta))$ . To obtain an expected regret bound, we further need to bound the cost under the low probability “bad” event. We make the following modification to Algorithm 1: whenever the counter $m=n\cdot M^{\prime}$ for some $n\in\mathbb{N}_{+}$ , we restart Algorithm 2. Ideas above are summarized in Algorithm 7. Now consider running Algorithm 7 with Algorithm 2, horizon $H=\frac{b^{\prime}B_{\star}}{c_{\min}}\ln(\frac{dB_{\star}K}{\delta c_{\min}})$ , failure probability $\delta=\frac{1}{4M^{\prime}H}$ , and restart threshold $M^{\prime}$ . By the choice of $M^{\prime}$ , we have $P(M>M^{\prime})\leq 4\delta$ . By a recursive argument, we have $P(M>n\cdot M^{\prime})\leq(4\delta)^{n}$ for $n\in\mathbb{N}_{+}$ . We have by Lemma 1 and Lemma 16:

	$\displaystyle\mathbb{E}[R_{K}]$	$\displaystyle\leq\mathbb{E}[\widetilde{R}_{M}]+B_{\star}\leq\mathbb{E}[\widetilde{R}_{\min\{M,M^{\prime}\}}]+\mathbb{E}[\max\{0,M-M^{\prime}\}(H+2B_{\star})]+B_{\star}$
		$\displaystyle=\mathcal{O}\left(\frac{d^{3}B_{\star}^{4}}{c_{\min}^{2}\text{\rm gap}_{\min}}\ln^{5}(dB_{\star}M^{\prime}H)\right)=\mathcal{O}\left(\frac{d^{3}B_{\star}^{4}}{c_{\min}^{2}\text{\rm gap}_{\min}}\ln^{5}\frac{dB_{\star}K}{c_{\min}}\right),$

where we apply

	$\displaystyle\mathbb{E}[\max\{0,M-M^{\prime}\}(H+2B_{\star})]\leq\sum_{n=1}^{\infty}P(M\in(nM^{\prime},(n+1)M^{\prime}])\cdot nM^{\prime}(H+2B_{\star})$
	$\displaystyle\leq\sum_{n=1}^{\infty}n\cdot P(M>nM^{\prime})M^{\prime}(H+2B_{\star})\leq\sum_{n=1}^{\infty}n(4\delta)^{n}M^{\prime}(H+2B_{\star})\leq\frac{16\delta M^{\prime}(H+2B_{\star})}{1-4\delta}=\mathcal{O}\left(1\right).$

This completes the proof. ∎

Algorithm 7 Finite-Horizon Approximation of SSP from (Cohen et al., 2021)

Input: Algorithm $\mathfrak{A}$ for finite-horizon MDP $\widetilde{{\mathcal{M}}}$ with horizon $H\geq 4T_{\star}\ln(4K)$ and restart threshold $M^{\prime}$ .

Initialize: interval counter $m\leftarrow 1$ .

for $k=1,\ldots,K$ do

21 Set

s^{m}_{1}\leftarrow s_{\text{init}}

. while $s^{m}_{1}\neq g$ do

43 Feed initial state

s^{m}_{1}

\mathfrak{A}

. for $h=1,\ldots,H$ do

65 Receive action

a^{m}_{h}

from

\mathfrak{A}

. if $s^{m}_{h}\neq g$ then

7 Play action

a^{m}_{h}

, observe cost

c^{m}_{h}=c(s^{m}_{h},a^{m}_{h})

and next state

s^{m}_{h+1}

8else Set

c^{m}_{h}=0

and

s^{m}_{h+1}=g

9 Feed

c^{m}_{h}

and

s^{m}_{h+1}

\mathfrak{A}

10Set

s^{m+1}_{1}=s^{m}_{H+1}

and

m\leftarrow m+1

. if $m=n\cdot M^{\prime}$ for some $n\in\mathbb{N}_{+}$ then Reinitialize

\mathfrak{A}

B.9 Extra Lemmas for Section 3

Lemma 17.

(Rosenberg & Mansour, 2020, Lemma 6) Let $\pi$ be a policy with expected hitting time at most $\tau$ starting from any state. Then for any $\delta\in(0,1)$ , with probability at least $1-\delta$ , $\pi$ takes no more than $4\tau\ln\frac{2}{\delta}$ steps to reach the goal state.

Lemma 18.

For an arbitrary set of intervals ${\mathcal{I}}\subseteq[M^{\prime}]$ for some $M^{\prime}\in\mathbb{N}_{+}$ , we have:

\sum_{m\in{\mathcal{I}}}\sum_{h=1}^{H}\beta_{m}\left\|{\phi^{m}_{h}}\right\|_{\Lambda_{m}^{-1}}=\mathcal{O}\left(\sqrt{d^{3}B^{2}H|{\mathcal{I}}|}\ln\frac{dBM^{\prime}H}{\delta}+d^{2}BH\ln^{1.5}\frac{dBM^{\prime}H}{\delta}\right).

Proof.

We bound the sum by considering two cases:

$\displaystyle\sum_{m\in{\mathcal{I}}}\sum_{h=1}^{H}\beta_{m}\left\\|{\phi^{m}_{h}}\right\\|_{\Lambda_{m}^{-1}}$	$\displaystyle\leq\beta_{M^{\prime}}\sum_{m\in{\mathcal{I}}:\det(\Lambda_{m+1})\leq 2\det(\Lambda_{m})}\sum_{h=1}^{H}\left\\|{\phi^{m}_{h}}\right\\|_{\Lambda_{m}^{-1}}+\beta_{M^{\prime}}\sum_{m\in{\mathcal{I}}:\det(\Lambda_{m+1})>2\det(\Lambda_{m})}\sum_{h=1}^{H}\left\\|{\phi^{m}_{h}}\right\\|_{\Lambda_{m}^{-1}}$
	$\displaystyle\leq\sqrt{2}\beta_{M^{\prime}}\sum_{m\in{\mathcal{I}}}\sum_{h=1}^{H}\left\\|{\phi^{m}_{h}}\right\\|_{\Lambda^{-1}_{m+1}}+\mathcal{O}\left(\beta_{M^{\prime}}d\ln(M^{\prime}H/\lambda)H\right)$	( $2\Lambda_{m}\succcurlyeq\Lambda_{m+1}$ by Lemma 30, and $\det(\Lambda_{M^{\prime}})/\det(\Lambda_{0})\leq((\lambda+M^{\prime}H)/\lambda)^{d}$ )
	$\displaystyle=\mathcal{O}\left(\beta_{M^{\prime}}\sqrt{H\|{\mathcal{I}}\|\sum_{m\in{\mathcal{I}}}\sum_{h=1}^{H}\left\\|{\phi^{m}_{h}}\right\\|_{\Lambda_{m+1}^{-1}}^{2}}+\beta_{M^{\prime}}dH\ln(M^{\prime}H)\right)$	(Cauchy-Schwarz inequality)
	$\displaystyle=\mathcal{O}\left(\sqrt{d^{3}B^{2}H\|{\mathcal{I}}\|}\ln\frac{dBM^{\prime}H}{\delta}+d^{2}BH\ln^{1.5}\frac{dBM^{\prime}H}{\delta}\right).$	((Jin et al., 2020b, Lemma D.2), $\lambda=1$ , and definition of $\beta_{M^{\prime}}$ )

∎

B.10 Proof of Theorem 5

Proof.

Define $\delta=\frac{1}{3}$ , $\Delta=\frac{\sqrt{\delta/K}}{8\sqrt{2}}$ and assume $K\geq\frac{d^{2}}{2\delta}$ . Consider a family of SSP parameterized by $\rho\in\{-\Delta,\Delta\}^{d}$ with action set ${\mathcal{A}}=\{-1,1\}^{d}$ . For the SSP instance parameterized by $\rho$ , it consists of two states $\{s_{0},s_{1}\}$ . The transition probabilities are as follows:

	$\displaystyle P(s_{1}\|s_{0},a)=1-\delta-\left\langle\rho,a\right\rangle,\quad P(g\|s_{0},a)=\delta+\left\langle\rho,a\right\rangle,$
	$\displaystyle P(s_{1}\|s_{1},a)=1-1/B_{\star},\quad P(g\|s_{1},a)=1/B_{\star},$

and the cost function is $c(s,a)=\mathbb{I}\{s=s_{1}\}$ . The SSP instance above can be represented as a linear SSP of dimension $d+2$ as follows: define $\alpha=\sqrt{\frac{1}{1+\Delta d}}$ , $\beta=\sqrt{\frac{\Delta}{1+\Delta d}}$ ,

\displaystyle\phi(s,a)=\begin{cases}[\alpha,\beta a^{\top},0]^{\top},&s=s_{0}\\ [0,0,1]^{\top},&s=s_{1}\end{cases}\qquad\mu(s^{\prime})=\begin{cases}[(1-\delta)/\alpha,-\rho^{\top}/\beta,1-1/B_{\star}]^{\top},&s^{\prime}=s_{1}\\ [\delta/\alpha,\rho^{\top}/\beta,1/B_{\star}]^{\top},&s^{\prime}=g\end{cases}

and $\theta^{\star}=[0,0,1]$ . Note that it satisfies $c(s,a)=\phi(s,a)^{\top}\theta^{\star}$ , $P(s^{\prime}|s,a)=\phi(s,a)^{\top}\mu(s^{\prime})$ , $\left\|{\phi(s,a)}\right\|_{2}\leq 1$ , and $\left\|{\theta^{\star}}\right\|_{2}\leq 1\leq\sqrt{d+2}$ . Moreover, for any function $h:{\mathcal{S}}_{+}\rightarrow\mathbb{R}$ , we have:

\displaystyle\sum_{s^{\prime}}h(s^{\prime})\mu(s^{\prime})=\begin{bmatrix}h(s_{1})(1-\delta)\sqrt{1+\Delta d}+h(g)\delta\sqrt{1+\Delta d}\\ (h(g)-h(s_{1}))\rho\sqrt{(1+\Delta d)/\Delta}\\ h(s_{1})(1-1/B_{\star})+h(g)/B_{\star}\end{bmatrix}.

Note that when $K\geq\frac{d^{2}}{2\delta}$ , $\Delta d\leq\frac{\delta}{8}=\frac{1}{24}$ , and

	$\displaystyle(h(s_{1})(1-\delta)\sqrt{1+\Delta d}+h(g)\delta\sqrt{1+\Delta d})^{2}$	$\displaystyle\leq\left\\|{h}\right\\|_{\infty}^{2}(1+\Delta d)\leq\frac{25}{24}\left\\|{h}\right\\|_{\infty}^{2},$
	$\displaystyle\left\\|{(h(g)-h(s_{1}))\rho\sqrt{(1+\Delta d)/\Delta}}\right\\|_{2}^{2}$	$\displaystyle\leq 4\left\\|{h}\right\\|_{\infty}^{2}\Delta d(1+\Delta d)\leq\frac{25}{24}\left\\|{h}\right\\|_{\infty}^{2},$
	$\displaystyle(h(s_{1})(1-1/B_{\star})+h(g)/B_{\star})^{2}$	$\displaystyle\leq\left\\|{h}\right\\|_{\infty}^{2}.$

Thus, we have $\left\|{\sum_{s^{\prime}}h(s^{\prime})\mu(s^{\prime})}\right\|_{2}\leq\left\|{h}\right\|_{\infty}\sqrt{d+2}$ by $d\geq 2$ , and the SSP instance satisfies Assumption 1. The regret is bounded as follows: let $a_{k}$ denote the first action taken by the learner in episode $k$ . Then for any $\rho\in\{-\Delta,\Delta\}^{d}$ , the expected cost of taking action $a$ as the first action is $C_{\rho}(a)=B_{\star}(1-\delta-\left\langle\rho,a\right\rangle)$ .

	$\displaystyle\mathbb{E}_{\rho}[R_{K}]$	$\displaystyle=\sum_{k=1}^{K}\mathbb{E}_{\rho}\left[C_{\rho}(a_{k})-\min_{a}C_{\rho}(a)\right]=B_{\star}\sum_{k=1}^{K}\mathbb{E}_{\rho}\left[\max_{a}\left\langle\rho,a\right\rangle-\left\langle\rho,a_{k}\right\rangle\right]$
		$\displaystyle=2B_{\star}\Delta\sum_{k=1}^{K}\mathbb{E}_{\rho}\left[\sum_{j=1}^{d}\mathbb{I}\{\mbox{\text{s}gn}(\rho_{j})\neq\mbox{\text{s}gn}(a_{k,j})\}\right]=2B_{\star}\Delta\sum_{j=1}^{d}\mathbb{E}_{\rho}[N_{j}(\rho)],$

where we define $N_{j}(\rho)=\sum_{k=1}^{K}\mathbb{I}\{\mbox{\text{s}gn}(\rho_{j})\neq\mbox{\text{s}gn}(a_{k,j})\}$ , and $\mathbb{E}_{\rho}$ is the expectation w.r.t the SSP instance parameterized by $\rho$ . Let $\rho^{j}$ denote the vector that differs from $\rho$ at its $j$ -th coordinate only. Then, we have $N_{j}(\rho^{j})+N_{j}(\rho)=K$ , and for a fixed $j$ ,

	$\displaystyle 2\sum_{\rho}\mathbb{E}_{\rho}\left[R_{K}\right]$	$\displaystyle=\sum_{\rho}\left(\mathbb{E}_{\rho}\left[R_{K}\right]+\mathbb{E}_{\rho^{j}}\left[R_{K}\right]\right)=2B_{\star}\Delta\sum_{\rho}\sum_{j=1}^{d}\left(K+\mathbb{E}_{\rho}[N_{j}(\rho)]-E_{\rho^{j}}[N_{j}(\rho)]\right)$
		$\displaystyle\geq 2B_{\star}\Delta\sum_{\rho}\sum_{j=1}^{d}\left(K-K\sqrt{2\text{\rm KL}(P_{\rho},P_{\rho^{j}})}\right),$

where $P_{\rho}$ is the joint probability of $K$ trajectories induced by the interactions between the learner and the SSP parameterized by $\rho$ , and in the last inequality we apply Pinsker’s inequality to obtain:

\displaystyle\left|\mathbb{E}_{\rho}[N_{j}(\rho)]-\mathbb{E}_{\rho^{j}}[N_{j}(\rho)]\right|\leq K\left\|{P_{\rho}-P_{\rho^{j}}}\right\|_{1}\leq K\sqrt{2\text{\rm KL}(P_{\rho},P_{\rho^{j}})}.

By the divergence decomposition lemma (see e.g. (Lattimore & Szepesvári, 2020, Lemma 15.1)), we further have

	$\displaystyle\text{\rm KL}(P_{\rho},P_{\rho^{j}})$	$\displaystyle=\sum_{k=1}^{K}\mathbb{E}_{\rho}\left[\text{\rm KL}\left(\textrm{Bernoulli}(\delta+\left\langle a_{k},\rho\right\rangle),\textrm{Bernoulli}(\delta+\left\langle a_{k},\rho^{j}\right\rangle)\right)\right]$
		$\displaystyle=\sum_{k=1}^{K}\mathbb{E}_{\rho}\left[\frac{2\left\langle a_{k},\rho-\rho^{j}\right\rangle^{2}}{\delta+\left\langle a_{k},\rho\right\rangle}\right]\leq\frac{16K\Delta^{2}}{\delta},$		( $d\Delta\leq\delta/2$ )

where in the second last inequality we apply $\text{\rm KL}(\textrm{Bernoulli}(a),\textrm{Bernoulli}(b))\leq 2(a-b)^{2}/a$ when $a\leq 1/2,a+b\leq 1$ , which is true when $\delta\leq 1/3$ , $d\Delta\leq\delta/2$ . Substituting these back, we get:

\displaystyle 2\sum_{\rho}\mathbb{E}_{\rho}[R_{K}]\geq 2B_{\star}\Delta\sum_{\rho}\sum_{j=1}^{d}\left(K-K\sqrt{32K\Delta^{2}/\delta}\right)=\Omega\left(\sum_{\rho}B_{\star}d\sqrt{\delta K}\right).

(9)

Now note that $\text{\rm gap}(s_{1},a)=0$ for all $a$ . Define $a^{\star}=\operatorname*{argmax}_{a}\left\langle\rho,a\right\rangle$ . Then for any $a\neq a^{\star}$ ,

\displaystyle Q^{\star}(s_{0},a)-V^{\star}(s_{0})=(1-\delta-\left\langle\rho,a\right\rangle)B_{\star}-(1-\delta-\left\langle\rho,a^{\star}\right\rangle)B_{\star}=B_{\star}\left\langle\rho,a^{\star}-a\right\rangle\geq 2B_{\star}\Delta.

Thus, $\text{\rm gap}_{\min}=2B_{\star}\Delta$ . By $\sqrt{K}=\frac{\sqrt{\delta}}{8\sqrt{2}\Delta}$ and Eq. (9), we get:

\displaystyle\sum_{\rho}\mathbb{E}_{\rho}[R_{K}]=\Omega\left(\sum_{\rho}B_{\star}d\sqrt{\delta K}\right)=\Omega\left(\sum_{\rho}\frac{dB_{\star}\delta}{\Delta}\right)=\Omega\left(\sum_{\rho}\frac{dB_{\star}^{2}}{\text{\rm gap}_{\min}}\right).

Selecting $\rho^{\star}$ which maximizes $\mathbb{E}_{\rho}[R_{K}]$ , we get: $\mathbb{E}_{\rho^{\star}}[R_{K}]=\Omega\left(\frac{dB_{\star}^{2}}{\text{\rm gap}_{\min}}\right)$ . ∎

Appendix C Omitted Details for Section 4

Notations

Define $Q_{t}(s,a)=\phi(s,a)^{\top}w_{t}$ such that $a_{t}=\operatorname*{argmin}_{a}Q_{t}(s_{t},a)$ , and operator $U_{B}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ such that $U_{B}w=\theta^{\star}+\int V_{w,B}(s^{\prime})d\mu(s^{\prime})$ . Define $\iota_{t}=\iota_{B_{t},t},{\mathcal{J}}_{t}={\mathcal{J}}_{B_{t}},P_{t}=P_{s_{t},a_{t}},C_{t}=\sum_{i=1}^{t}c(s_{i},a_{i})$ , and ${\mathcal{J}}={\mathcal{J}}_{2B_{\star}}$ . By Lemma 4, ${\mathcal{J}}_{t}\subseteq{\mathcal{J}}$ for any $t\in[T]$ .

For notational convenience, we divide the whole learning process into epochs indexed by $l$ , and a new epoch begins whenever $w_{t}$ is recomputed. Denote by $t_{l}+1$ the first time step in epoch $l$ , and for a quantity, function or set $f_{t}$ indexed by time step $t$ , we define $f_{l}=f_{t_{l}+1}$ . Denote by $l_{t}$ the epoch time step $t$ belongs to, and we often ignore the subscript $t$ when there is no confusion. Clearly, $V_{t}=V_{l}$ , and similarly for $w_{l},\widetilde{w}_{l},\iota_{l},\Omega_{l}$ (ignoring the dependency on $t$ for $l$ ). With this notation setup, we define $L^{\prime}$ as the number of epochs that starts by the overestimate condition, that is, $L^{\prime}=|\{l>1:V_{l-1}(s^{\prime}_{t_{l}})=2B_{l-1}\}|$ . Also define $\nu_{t}=\operatorname*{argmax}_{\nu=\widetilde{w}_{l}-w,w\in\Omega_{l}}|\phi_{t}^{\top}\nu|$ and a special covariance matrix $W_{j,t}(\nu)=2^{j}I+\sum_{i<t}\min\left\{1,2^{j}/|\phi_{i}^{\top}\nu|\right\}\phi_{i}\phi_{i}^{\top}$ . Note that $\Phi^{j}_{t}(\nu)=\left\|{\nu}\right\|^{2}_{W_{j,l}(\nu)}$ .

Assumption

For simplicity, we assume that $\{\phi(s,a)\}_{(s,a)\in{\mathcal{S}}\times{\mathcal{A}}}$ spans $\mathbb{R}^{d}$ . It implies that if $\phi(s,a)^{\top}v=\phi(s,a)^{\top}w$ for all $(s,a)\in{\mathcal{S}}\times{\mathcal{A}}$ , then $v=w$ .

Truncating the Interaction for Technical Issue

An important question in SSP is whether the algorithm halts in finite number of steps. To overcome some technical issues, we first assume that the algorithm halts after $T^{\prime}$ steps for an arbitrary $T^{\prime}\in\mathbb{N}_{+}$ , even if the goal state is not reached. Specifically, we redefine the notation $T$ to be the minimum between the number of steps taken by the learner in $K$ episodes and $T^{\prime}$ , that is, $T=T^{\prime}$ if the learner does not finish $K$ episodes in $T^{\prime}$ steps. We also redefine $R_{K}$ under the new definition of $T$ , and the true regret now becomes $\lim_{T^{\prime}\rightarrow\infty}R_{K}$ . The implication under truncation is that $s^{\prime}_{T}$ may not be $g$ , and $T\leq T^{\prime}$ . In Appendix C.4, we prove a regret bound on $R_{K}$ independent of $T^{\prime}$ . Thus, the proven regret bound is also an upper bound of the true regret, as it is a valid upper bound of $\lim_{T^{\prime}\rightarrow\infty}R_{K}$ .

C.1 Proof Sketch of Theorem 6

We focus on deriving the dominating term and ignore the lower order terms. By some straightforward calculation, we decompose the regret as follows:

\displaystyle R_{K}

\displaystyle\leq\underbrace{\sum_{t=1}^{T}[V_{l}(s^{\prime}_{t})-P_{t}V_{l}]}_{\textsc{Deviation}}+\underbrace{\sum_{t=1}^{T}\left|\phi_{t}^{\top}(\widetilde{w}_{l}-w_{l})\right|}_{\textsc{Estimation-Err}}+\underbrace{\sum_{l=1}^{L}\left(V_{l}(s_{t_{l}+1})-V_{l}(s^{\prime}_{t_{l+1}})\right)-K\cdot V^{\star}(s_{\text{init}})}_{\textsc{Switching-Cost}}.

We bound each of these terms as follows.

Bounding Deviation

This term is a sum of martingale difference sequence and is of order $\tilde{\mathcal{O}}(\sqrt{\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l})})$ . We show that $\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l})\lesssim B_{\star}C_{T}+B_{\star}\cdot\textsc{Estimation-Err}$ (see Lemma 21).

Bounding Estimation-Err

Here the variance-aware confidence set $\Omega_{t}$ comes into play. By $w_{l}\in\Omega_{l}$ , we have $\left|\phi_{t}^{\top}(\widetilde{w}_{l}-w_{l})\right|\leq\left|\phi_{t}^{\top}\nu_{t}\right|$ . Thus, it suffices to bound $\sum_{t=1}^{T}\left|\phi_{t}^{\top}\nu_{t}\right|$ . As in (Kim et al., 2021), the main idea is to bound the matrix norm of $\nu_{t}$ w.r.t some special matrix by a variance-aware term, and then apply the elliptical potential lemma on $\{\phi_{t}\}_{t}$ . For any epoch $l$ , $j\in{\mathcal{J}}_{l}$ and $\nu=\widetilde{w}_{l}-\mathring{w}$ with $\mathring{w}\in\Omega_{l}$ , we have the following key inequality (see Lemma 24):

\left\|{\nu}\right\|^{2}_{W_{j,l}(\nu)}\lesssim 2^{j}\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}.

(10)

One important step is thus to bound $\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})$ . Note that this term has a similar form of $\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l})$ , and by a similar analysis (see Lemma 23):

\sum_{t\leq t_{l}}\mathbb{V}(P_{t},V_{l})\lesssim B_{\star}C_{t_{l}}+B_{\star}\sum_{t\leq t_{l}}\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|.

(11)

where $\nu^{\prime}_{t}=\operatorname*{argmax}_{\nu=\widetilde{w}_{l}-w,w\in\Omega_{l}}\left|\phi_{t}^{\top}\nu\right|$ (note that here $l$ is fixed and independent of $t$ ). Define $j_{t}\in{\mathcal{J}}_{l}$ such that $\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|\in(2^{j_{t}-1},2^{j_{t}}]$ . By Eq. (10):

\displaystyle\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|\lesssim\left\|{\phi_{t}}\right\|_{W^{-1}_{j_{t},l}(\nu^{\prime}_{t})}\left\|{\nu^{\prime}_{t}}\right\|_{W_{j_{t},l}(\nu^{\prime}_{t})}\lesssim\left\|{\phi_{t}}\right\|_{W^{-1}_{j_{t},l}(\nu^{\prime}_{t})}\sqrt{\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}}

(12)

Solving for $\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|$ and by $\sum_{t\leq t_{l}}\left\|{\phi_{t}}\right\|^{2}_{W^{-1}_{j_{t},l}(\nu^{\prime}_{t})}=\tilde{\mathcal{O}}\left(d\right)$ (similar to elliptical potential lemma), we get

\sum_{t\leq t_{l}}\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|=\tilde{\mathcal{O}}\left(d\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}\right).

Plugging this back to Eq. (11) and solving a quadratic inequality, we get: $\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\lesssim B_{\star}C_{t_{l}}$ (Lemma 23). Now by an analysis similar to Eq. (12) (Lemma 22):

\displaystyle\sum_{t=1}^{T}\left|\phi_{t}^{\top}\nu_{t}\right|

\displaystyle\lesssim d^{2}\sum_{t=1}^{T}\left\|{\phi_{t}}\right\|^{2}_{W^{-1}_{j_{t},l}(\nu_{t})}\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}\lesssim d^{3.5}\sqrt{B_{\star}C_{T}},

where $j_{t}\in{\mathcal{J}}_{t}$ such that $\left|\phi_{t}^{\top}\nu_{t}\right|\in(2^{j_{t}-1},2^{j_{t}}]$ . The extra $d^{2}$ factor is from the inequality $\Phi^{j}_{t}(\nu)\leq 8d^{2}\Phi^{j}_{l}(\nu)$ .

Bounding Switching-Cost

By considering each condition of starting a new epoch, we show that $\textsc{Switching-Cost}=\tilde{\mathcal{O}}(dB_{\star}-L^{\prime})$ , where $L^{\prime}$ is the number of epochs started by triggering the overestimate condition; see Appendix C.4. We provide more tuition on including the overestimate condition in Appendix C.5. In short, it removes a factor of $d^{1/4}$ in the dominating term without incorporating unpractical decision sets as in previous works.

Putting Everything Together

Combining the bounds above, we get $R_{K}=C_{T}-KV^{\star}(s_{\text{init}})\lesssim d^{3.5}\sqrt{B_{\star}C_{T}}$ . Solving a quadratic inequality w.r.t $\sqrt{C_{T}}$ , we have $C_{T}\lesssim B_{\star}K$ . Plugging this back, we obtain $R_{K}\lesssim d^{3.5}B_{\star}\sqrt{K}$ .

Below we provide detailed proofs of lemmas and the main theorem.

C.2 Proof of Lemma 3

We will prove a more general statement, from which Lemma 3 is a directly corollary.

Lemma 19.

With probability at least $1-\delta$ , for any $t\in\mathbb{N}_{+}$ , $B\in\{2^{i}\}_{i\in\mathbb{N}}$ , and $w\in\mathbb{B}(3\sqrt{d}B)$ , we have $U_{B}w\in\Omega_{t}(w,B)$ .

Proof.

For each $t\in\mathbb{N}_{+}$ , $B\in\{2^{i}\}_{i\in\mathbb{N}}$ , $w\in\mathcal{G}_{\epsilon/t}(3\sqrt{d}B)$ , $j\in{\mathcal{J}}_{B}$ , $\nu\in\mathcal{G}_{\epsilon/t}(6\sqrt{d}B)$ , by Lemma 36, we have with probability at least $1-6\delta^{\prime}\log_{2}t$ with $\delta^{\prime}=\delta/(24t^{2}\log_{2}^{2}(2B)\log_{2}(t)|{\mathcal{J}}_{B}|(12\sqrt{d}Bt/\epsilon)^{2d})$ :

	$\displaystyle\left\|\sum_{i<t}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)\epsilon^{i}_{V_{w,B}}(U_{B}w)\right\|=\left\|\sum_{i<t}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)(P_{i}V_{w,B}-V_{w,B}(s^{\prime}_{i}))\right\|$
	$\displaystyle\leq 8\sqrt{\sum_{i<t}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)\eta^{i}_{V_{w,B}}(U_{B}w)\ln\frac{1}{\delta^{\prime}}}+32B2^{j}\ln\frac{1}{\delta^{\prime}}\leq\sqrt{\sum_{i<t}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)\eta^{i}_{V_{w,B}}(U_{B}w)\frac{\iota_{B,t}}{3}}+\frac{B}{2}2^{j}\iota_{B,t}.$		(13)

Taking a union bound, Eq. (13) holds for any $t,B\in\{2^{i}\}_{i\in\mathbb{N}},w\in\mathcal{G}_{\epsilon/t}(3\sqrt{d}B),j\in{\mathcal{J}}_{B},\nu\in\mathcal{G}_{\epsilon/t}(6\sqrt{d}B)$ with probability at least $1-\delta$ .

Now for any $t\in\mathbb{N}_{+}$ , $B\in\{2^{i}\}_{i\in\mathbb{N}}$ , $w\in\mathbb{B}(3\sqrt{d}B)$ , there exist $w^{\prime}\in\mathcal{G}_{\epsilon/t}(3\sqrt{d}B)$ such that $\left\|{w-w^{\prime}}\right\|_{\infty}\leq\frac{\epsilon}{t}$ . Also define $V=V_{w,B}$ , $V^{\prime}=V_{w^{\prime},B}$ , $\widetilde{w}=U_{B}w$ , and $\widetilde{w}^{\prime}=U_{B}w^{\prime}$ . Note that

	$\displaystyle\left\\|{V-V^{\prime}}\right\\|_{\infty}$	$\displaystyle\leq\max_{s,a}\left\|\phi(s,a)^{\top}(w-w^{\prime})\right\|\leq\sqrt{d}\left\\|{w-w^{\prime}}\right\\|_{\infty}\leq\frac{\sqrt{d}\epsilon}{t},$		(14)
	$\displaystyle\left\\|{\widetilde{w}-\widetilde{w}^{\prime}}\right\\|_{2}$	$\displaystyle=\left\\|{\int(V(s^{\prime})-V^{\prime}(s^{\prime}))d\mu(s^{\prime})}\right\\|_{2}\leq\sqrt{d}\left\\|{V-V^{\prime}}\right\\|_{\infty}\leq\frac{d\epsilon}{t}.$		(15)

Thus, we have for any $j\in{\mathcal{J}}_{B},\nu\in\mathcal{G}_{\epsilon/t}(6\sqrt{d}B)$ :

	$\displaystyle\left\|\sum_{i<t}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)\epsilon^{i}_{V}(\widetilde{w})\right\|=\left\|\sum_{i<t}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)(\phi_{i}^{\top}\widetilde{w}-c_{i}-V(s^{\prime}_{i}))\right\|$
	$\displaystyle\leq\left\|\sum_{i<t}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)(\phi_{i}^{\top}\widetilde{w}^{\prime}-c_{i}-V^{\prime}(s^{\prime}_{i}))\right\|+\left\|\sum_{i<t}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)\phi_{i}^{\top}(\widetilde{w}-\widetilde{w}^{\prime})\right\|+\left\|\sum_{i<t}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)(V(s^{\prime}_{i})-V^{\prime}(s^{\prime}_{i}))\right\|$
	$\displaystyle\leq\sqrt{\sum_{i<t}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)(\phi_{i}^{\top}\widetilde{w}^{\prime}-c_{i}-V^{\prime}(s^{\prime}_{i}))^{2}\frac{\iota_{B,t}}{3}}+\frac{B}{2}2^{j}\iota_{B,t}+2^{j+1}d\epsilon.$		(Eq. (13), Eq. (14), and Eq. (15))
	$\displaystyle\leq\sqrt{\sum_{i<t}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)\eta^{i}_{V}(\widetilde{w})\iota_{B,t}}+\sqrt{\sum_{i<t}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)(\phi_{i}^{\top}(\widetilde{w}^{\prime}-\widetilde{w}))^{2}\iota_{B,t}}+\sqrt{\sum_{i<t}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)(V^{\prime}(s^{\prime}_{i})-V(s^{\prime}_{i}))^{2}\iota_{B,t}}$
	$\displaystyle\qquad+\frac{B}{2}2^{j}\iota_{B,t}+2^{j+1}d\epsilon$		( $(a+b+c)^{2}\leq 3(a^{2}+b^{2}+c^{2})$ and $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ )
	$\displaystyle\leq\sqrt{\sum_{i<t}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)\eta^{i}_{V}(\widetilde{w})\iota_{B,t}}+\frac{B}{2}2^{j}\iota_{B,t}+4\cdot 2^{j}d\epsilon\iota_{B,t}\leq\sqrt{\sum_{i<t}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)\eta^{i}_{V}(\widetilde{w})\iota_{B,t}}+B2^{j}\iota_{B,t}.$		(Eq. (14), Eq. (15), and $8d\epsilon\leq 1$ )

Moreover, $\widetilde{w}\in\mathbb{B}(3\sqrt{d}B)$ by $\left\|{V_{w,B}}\right\|_{\infty}\leq 2B$ . Thus, $U_{B}w\in\Omega_{t}(w,B)$ for any $t\in\mathbb{N}_{+}$ , $B\in\{2^{i}\}_{i\in\mathbb{N}}$ , and $w\in\mathbb{B}(3\sqrt{d}B)$ , and the statement is proved. ∎

Proof of Lemma 3.

This directly follows from Lemma 19 by $w_{t}\in\mathbb{B}(3\sqrt{d}B_{t})$ , $V_{t}=V_{w_{t},B_{t}}$ , and $\widetilde{w}_{t}=\theta^{\star}+\int V_{t}(s^{\prime})d\mu(s^{\prime})=U_{B_{t}}w_{t}$ . ∎

C.3 Proof of Lemma 4

Lemma (restatement of Lemma 4).

With probability at least $1-\delta$ , $V_{l}(s_{t_{l}+1})\leq V^{\star}(s_{t_{l}+1})$ for any epoch $l$ and $B_{t}\leq 2B_{\star}$ .

Proof.

For the first statement, note that any epoch $l$ , by Lemma 20, there exists $w^{\infty}_{l}\in\mathbb{B}(3\sqrt{d}B_{l})$ such that $w^{\infty}_{l}=U_{B_{l}}w^{\infty}_{l}$ and $V_{w^{\infty}_{l},B_{l}}(s)\leq V^{\star}(s)$ . Thererfore, $w^{\infty}_{l}\in\Omega_{l}(w^{\infty}_{l},B_{l})$ , and $V_{l}(s_{t_{l}+1})=V_{w_{l},B_{l}}(s_{t_{l}+1})\leq V_{w^{\infty}_{l},B_{l}}(s_{t_{l}+1})\leq V^{\star}(s_{t_{l}+1})$ by the definition of $w_{l}$ . The second statement is a direct corollary of the first statement and how $B_{t}$ is updated. ∎

Lemma 20.

For any $B>0$ , there exists $w\in\mathbb{B}(3\sqrt{d}B)$ such that $w=U_{B}w$ , and $V_{w,B}(s)\leq V^{\star}(s)$ .

Proof.

Define $w^{1}=\mathbf{0}\in\mathbb{R}^{d}$ , and $w^{n+1}=U_{B}w^{n}$ . We prove by induction that $\phi(s,a)^{\top}(w^{n+1}-w^{n})\geq 0$ and $\phi(s,a)^{\top}w^{n}\leq Q^{\star}(s,a)$ . The base case $n=1$ is clearly true. Now for $n>1$ , assume that we have $\phi(s,a)^{\top}(w^{n}-w^{n-1})\geq 0$ and $\phi(s,a)^{\top}w^{n-1}\leq Q^{\star}(s,a)$ . Then, $\phi(s,a)^{\top}(w^{n+1}-w^{n})=P_{s,a}(V_{w^{n},B}-V_{w^{n-1},B})\geq 0$ and $\phi(s,a)^{\top}w^{n}=c(s,a)+P_{s,a}V_{w^{n-1},B}\leq c(s,a)+P_{s,a}V^{\star}\leq Q^{\star}(s,a)$ . Therefore, the sequence $\{\phi(s,a)^{\top}w^{n}\}_{n=1}^{\infty}$ is non-decreasing and bounded, and thus converges. Since $\{\phi(s,a)\}_{(s,a)\in{\mathcal{S}}\times{\mathcal{A}}}$ spans $\mathbb{R}^{d}$ , the limit $w^{\infty}=\lim_{n\rightarrow\infty}w^{n}$ exists and $w^{\infty}=U_{B}w^{\infty}$ . Moreover, $w^{\infty}\in\mathbb{B}(3\sqrt{d}B)$ by $\left\|{V_{w^{\infty},B}}\right\|_{\infty}\leq 2B$ and $V_{w^{\infty},B}(s)\leq V^{\star}(s)$ since $\phi(s,a)^{\top}w^{\infty}=\lim_{n\rightarrow\infty}\phi(s,a)^{\top}w^{n}\leq Q^{\star}(s,a)$ . This completes the proof. ∎

C.4 Proof of Theorem 6

Proof.

We decompose the regret as follows:

\displaystyle R_{K}

\displaystyle=\sum_{t=1}^{T}c_{t}-K\cdot V^{\star}(s_{\text{init}})=\sum_{l=1}^{L}\left(\sum_{t=t_{l}+1}^{t_{l+1}}c_{t}-V_{l}(s_{t_{l}+1})\right)+\sum_{l=1}^{L}V_{l}(s_{t_{l}+1})-K\cdot V^{\star}(s_{\text{init}}).

For the first term, for a fixed epoch $l$ , define $\chi_{\tau}=\sum_{t=\tau}^{t_{l+1}}c_{t}-V_{l}(s_{\tau})$ for $\tau\in\{t_{l}+1,\ldots,t_{l+1}\}$ and $\chi_{t_{l+1}+1}=-V_{l}(s^{\prime}_{t_{l+1}})$ . Note that within epoch $l$ , we have $V_{l}(s_{\tau})=[Q_{l}(s_{\tau},a_{\tau})]_{[0,\infty)}\geq Q_{l}(s_{\tau},a_{\tau})=\phi_{\tau}^{\top}w_{l}$ . Thus, for $\tau\in\{t_{l}+1,\ldots,t_{l+1}\}$ ,

$\displaystyle\chi_{\tau}$	$\displaystyle=\sum_{t=\tau}^{t_{l+1}}c_{t}-V_{l}(s_{\tau})\leq\sum_{t=\tau+1}^{t_{l+1}}c_{t}+c_{\tau}-\phi_{\tau}^{\top}w_{l}$
	$\displaystyle=\sum_{t=\tau+1}^{t_{l+1}}c_{t}-V_{l}(s^{\prime}_{\tau})+(V_{l}(s^{\prime}_{\tau})-P_{\tau}V_{l})+\phi_{\tau}^{\top}(\widetilde{w}_{l}-w_{l})$	( $c_{\tau}+P_{\tau}V_{l}=\phi_{\tau}^{\top}\widetilde{w}_{l}$ )
	$\displaystyle=\chi_{\tau+1}+(V_{l}(s^{\prime}_{\tau})-P_{\tau}V_{l})+\phi_{\tau}^{\top}(\widetilde{w}_{l}-w_{l})$
	$\displaystyle\leq\cdots\leq-V_{l}(s^{\prime}_{t_{l+1}})+\sum_{t=\tau}^{t_{l+1}}(V_{l}(s^{\prime}_{t})-P_{t}V_{l})+\sum_{t=\tau}^{t_{l+1}}\phi_{t}^{\top}(\widetilde{w}_{l}-w_{l}).$

Therefore, we have:

	$\displaystyle R_{K}$	$\displaystyle=\sum_{l=1}^{L}\chi_{t_{l}+1}+\sum_{l=1}^{L}V_{l}(s_{t_{l}+1})-K\cdot V^{\star}(s_{\text{init}})$
		$\displaystyle\leq\sum_{l=1}^{L}\sum_{t=t_{l}+1}^{t_{l+1}}\left[(V_{l}(s^{\prime}_{t})-P_{t}V_{l})+\phi_{t}^{\top}(\widetilde{w}_{l}-w_{l})\right]+\sum_{l=1}^{L}\left(V_{l}(s_{t_{l}+1})-V_{l}(s^{\prime}_{t_{l+1}})\right)-K\cdot V^{\star}(s_{\text{init}}).$

We first bound the switching costs, that is, the last two terms above. We consider three cases based on how an epoch starts: define ${\mathcal{L}}_{1}=\{l:s^{\prime}_{t_{l}}=g\}$ , ${\mathcal{L}}_{2}=\{l>1:\exists j\in{\mathcal{J}}_{l},\nu\in\mathcal{G}_{\epsilon/(t_{l}+1)}(6\sqrt{d}B_{l-1}),\Phi^{j}_{t_{l}+1}(\nu)>8d^{2}\Phi^{j}_{t_{l-1}+1}(\nu)\}$ , and ${\mathcal{L}}_{3}=\{l>1:V_{l-1}(s^{\prime}_{t_{l}})=2B_{l-1}\}$ . Then,

	$\displaystyle\sum_{l=1}^{L}\left(V_{l}(s_{t_{l}+1})-V_{l}(s^{\prime}_{t_{l+1}})\right)-K\cdot V^{\star}(s_{\text{init}})$
	$\displaystyle=\underbrace{\sum_{l\in{\mathcal{L}}_{1}}V_{l}(s_{t_{l}+1})-K\cdot V^{\star}(s_{\text{init}})}_{\xi_{1}}+\underbrace{\sum_{l\in{\mathcal{L}}_{2}}V_{l}(s_{t_{l}+1})}_{\xi_{2}}+\underbrace{\sum_{l\in{\mathcal{L}}_{3}}V_{l}(s_{t_{l}+1})-\sum_{l=1}^{L}V_{l}(s^{\prime}_{t_{l+1}})}_{\xi_{3}}.$

Note that $\xi_{1}\leq 0$ since for $l\in{\mathcal{L}}_{1}$ , $V_{l}(s_{t_{l}+1})=V_{l}(s_{\text{init}})\leq V^{\star}(s_{\text{init}})$ by Lemma 4. For $\xi_{2}$ , note that $|{\mathcal{L}}_{2}|=\tilde{\mathcal{O}}(d)$ by Lemma 27. Thus, $\xi_{2}=\tilde{\mathcal{O}}(dB_{\star})$ by $\left\|{V_{l}}\right\|_{\infty}\leq 4B_{\star}$ (Lemma 4). For $\xi_{3}$ , note that for each $l\in{\mathcal{L}}_{3}$ , $V_{l}(s_{t_{l}+1})-V_{l-1}(s^{\prime}_{t_{l}})\leq B_{l}-2B_{l-1}\leq 2B_{\star}\mathbb{I}\{B_{l}\neq B_{l-1}\}-1$ by $V_{l}(s_{t_{l}+1})\leq B_{l}\leq 2B_{\star}$ and $B_{l}\geq 1$ . Thus, $\xi_{3}\leq\tilde{\mathcal{O}}(B_{\star})-L^{\prime}$ , by $|{\mathcal{L}}_{3}|=L^{\prime}$ and $\sum_{l=1}^{L}\mathbb{I}\{B_{l}\neq B_{l-1}\}=\mathcal{O}(\log_{2}B_{\star})$ . Therefore, with probability at least $1-5\delta$ ,

$\displaystyle R_{K}$	$\displaystyle\leq\sum_{l=1}^{L}\sum_{t=t_{l}+1}^{t_{l+1}}\left[(V_{l}(s^{\prime}_{t})-P_{t}V_{l})+\phi_{t}^{\top}(\widetilde{w}_{l}-w_{l})\right]+\tilde{\mathcal{O}}\left(dB_{\star}-L^{\prime}\right)$
	$\displaystyle=\tilde{\mathcal{O}}\left(\sqrt{\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l})}+\sum_{t=1}^{T}\left\|\phi_{t}^{\top}\nu_{t}\right\|+dB_{\star}-L^{\prime}\right)$	(Lemma 38, $w_{l}\in\Omega_{l}$ , and definition of $\nu_{t}$ )
	$\displaystyle=\tilde{\mathcal{O}}\left(\sqrt{B_{\star}^{2}L^{\prime}+B_{\star}C_{T}+B_{\star}\sum_{t=1}^{T}\left\|\phi_{t}^{\top}\nu_{t}\right\|}+\sum_{t=1}^{T}\left\|\phi_{t}^{\top}\nu_{t}\right\|+dB_{\star}-L^{\prime}\right)$	(Lemma 21)
	$\displaystyle=\tilde{\mathcal{O}}\left(\sqrt{B_{\star}C_{T}}+\sum_{t=1}^{T}\left\|\phi_{t}^{\top}\nu_{t}\right\|+dB_{\star}+B_{\star}^{2}\right)$	( $\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}$ and $\sqrt{ab}\leq\frac{a+b}{2}$ )
	$\displaystyle\leq\tilde{\mathcal{O}}\left(d^{3.5}\sqrt{B_{\star}C_{T}}+d^{3.5}\sqrt{B_{\star}\epsilon T}+d^{5}B_{\star}^{2}\right)+65d^{2.5}\epsilon T$	(Lemma 22)
	$\displaystyle\leq\tilde{\mathcal{O}}\left(d^{3.5}\sqrt{B_{\star}C_{T}}+d^{5}B_{\star}^{2}\right)+\frac{C_{T}}{2K}.$	(definition of $\epsilon$ and $c_{\min}T\leq C_{T}$ )

By $R_{K}=C_{T}-K\cdot V^{\star}(s_{\text{init}})$ and Lemma 28 with $x=C_{T}$ (we also bound $T$ by $C_{T}/c_{\min}$ in logarithmic terms), we get $C_{T}=\tilde{\mathcal{O}}(B_{\star}K+d^{7}B_{\star}+d^{5}B_{\star}^{2})$ . Plugging this back, we obtain

R_{K}=\tilde{\mathcal{O}}\left(d^{3.5}B_{\star}\sqrt{K}+d^{7}B_{\star}^{2}\right).

This completes the proof. ∎

C.5 Intuition for Overestimate Condition

Now we provide more reasonings on including the overestimate condition. Similar to (Zanette et al., 2020b; Wei et al., 2021b), we incorporate global optimism at the starting state of each epoch via solving an optimization problem. This is different from many previous work (Jin et al., 2020b; Vial et al., 2021) that adds bonus terms to ensure local optimism over all states. The advantage of global optimism is that it avoids using a larger function class of $Q_{t},V_{t}$ for the bonus terms, which reduces the order of $d$ in the regret bound. However, this improvement also requires $\left\|{V_{t}}\right\|_{\infty}$ is of order $B_{\star}$ . In (Zanette et al., 2020b), they directly enforcing this constraint, which is not practical under large state space as we may need to iterate over all state-action pairs to check this constraint.

Here we take a new approach: we first enforce a bound on $\left\|{V_{t}}\right\|_{\infty}$ by direct truncation. However, the upper bound truncation on $V_{t}$ may break the analysis. To resolve this, we start a new epoch whenever $V_{t}$ is overestimated by a large amount. By the objective of the optimization problem, $V_{t}(s_{t})$ will not be overestimated in the new epoch. Hence, the upper bound truncation will not be triggered. Moreover, the overestimate of $V_{t}$ cancels out the switching cost in this case as in previous discussion.

The disadvantage of the overestimation condition is that we may update policy at every time step in the worst case. If we remove this condition, $\left\|{V_{t}}\right\|_{\infty}=\tilde{\mathcal{O}}(\sqrt{d}B_{\star})$ by the norm constraint on $w_{t}$ , which brings back an extra $\sqrt{d}$ factor. However, we only recompute policy for $\mathcal{O}(K+d\ln T)$ times in this case.

C.6 Extra Lemmas for Section 4

Lemma 21.

With probability at least $1-\delta$ , $\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l})=\tilde{\mathcal{O}}\left(dB_{\star}^{2}+B_{\star}^{2}L^{\prime}+B_{\star}C_{T}+B_{\star}\sum_{t=1}^{T}\left|\phi_{t}^{\top}\nu_{t}\right|\right)$ .

Proof.

Note that when $V_{l}(s_{t})=0$ , $V_{l}(s_{t})-P_{t}V_{l}\leq 0$ . Otherwise, $Q_{l}(s_{t},a)>0$ for any $a$ and $V_{l}(s_{t})\leq Q_{l}(s_{t},a_{t})$ . Thus, $V_{l}(s_{t})^{2}-(P_{t}V_{l})^{2}=(V_{l}(s_{t})+P_{t}V_{l})(V_{l}(s_{t})-P_{t}V_{l})\leq(V_{l}(s_{t})+P_{t}V_{l})\left|Q_{l}(s_{t},a_{t})-P_{t}V_{l}\right|$ . Then with probability at least $1-\delta$ ,

	$\displaystyle\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l})$	$\displaystyle=\sum_{t=1}^{T}\left(P_{t}V_{l}^{2}-V_{l}^{2}(s^{\prime}_{t})\right)+\sum_{t=1}^{T}\left(V_{l}^{2}(s^{\prime}_{t})-V_{l}^{2}(s_{t})\right)+\sum_{t=1}^{T}\left(V_{l}^{2}(s_{t})-(P_{t}V_{l})^{2}\right)$
		$\displaystyle\overset{\text{(i)}}{=}\tilde{\mathcal{O}}\left(\sqrt{\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l}^{2})}+dB_{\star}^{2}+B_{\star}^{2}L^{\prime}+\sum_{t=1}^{T}(V_{l}(s_{t})+P_{t}V_{l})\left\|c_{t}+\phi_{t}^{\top}(w_{l}-\widetilde{w}_{l})\right\|\right)$
		$\displaystyle\overset{\text{(ii)}}{=}\tilde{\mathcal{O}}\left(B_{\star}\sqrt{\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l})}+dB_{\star}^{2}+B_{\star}^{2}L^{\prime}+B_{\star}C_{T}+B_{\star}\sum_{t=1}^{T}\left\|\phi_{t}^{\top}\nu_{t}\right\|\right),$

where in (i) we apply Lemma 38, $V_{l}(s_{t})^{2}-(P_{t}V_{l})^{2}\leq(V_{l}(s_{t})+P_{t}V_{l})\left|Q_{l}(s_{t},a_{t})-P_{t}V_{l}\right|$ , $Q_{l}(s_{t},a_{t})=\phi_{t}^{\top}w_{l}$ , $c_{t}+P_{t}V_{l}=\phi_{t}^{\top}\widetilde{w}_{l}$ , and we bound the term $\sum_{t=1}^{T}V_{l}^{2}(s^{\prime}_{t})-V_{l}^{2}(s_{t})=\sum_{l=1}^{L}V_{l}^{2}(s^{\prime}_{t_{l+1}})-V_{l}^{2}(s_{t_{l}+1})$ as follows: we consider four cases based on how epoch $l$ ends:

1.

$s^{\prime}_{t_{l+1}}=g$ , then $V_{l}^{2}(s^{\prime}_{t_{l+1}})-V_{l}^{2}(s_{t_{l}+1})\leq 0$ .
2.

$V_{l}(s^{\prime}_{t_{l+1}})=2B_{l}$ ; this happens $L^{\prime}$ times and the sum of these terms is of order $\tilde{\mathcal{O}}(B_{\star}^{2}L^{\prime})$ .
3.

Triggered by Eq. (4). By Lemma 27, this happens at most $\tilde{\mathcal{O}}(d)$ times and the sum of these terms is of order $\tilde{\mathcal{O}}(dB_{\star}^{2})$ .
4.

$l=L$ is the last epoch. This happens only once and the term is bounded by $\mathcal{O}(B_{\star}^{2})$ .

In (ii), we apply Lemma 34, $w_{l}\in\Omega_{l}$ , definition of $\nu_{t}$ , and $\left\|{V_{l}}\right\|_{\infty}=\mathcal{O}(B_{\star})$ by Lemma 4. Solving a quadratic inequality w.r.t $\sqrt{\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l})}$ , we have:

\displaystyle\sum_{t=1}^{T}\mathbb{V}(P_{t},V_{l})=\tilde{\mathcal{O}}\left(dB_{\star}^{2}+B_{\star}^{2}L^{\prime}+B_{\star}C_{T}+B_{\star}\sum_{t=1}^{T}\left|\phi_{t}^{\top}\nu_{t}\right|\right).

This completes the proof. ∎

Lemma 22.

With probability at least $1-4\delta$ , $\sum_{t=1}^{T}\left|\phi_{t}^{\top}\nu_{t}\right|\leq\tilde{\mathcal{O}}\left(d^{3.5}\sqrt{B_{\star}C_{T}}+d^{3.5}\sqrt{B_{\star}\epsilon T}+d^{5}B_{\star}^{2}\right)+65d^{2.5}\epsilon T$ .

Proof.

Define $u_{t}=\operatorname*{argmax}_{t\leq t^{\prime}\leq T}|\phi_{t}^{\top}\nu_{t^{\prime}}|$ , $V_{j,t}=2^{j}I+\sum_{i\in{\mathcal{T}}\cap[t-1]:|\phi_{i}^{\top}\nu_{u_{i}}|\leq 2^{j}}\phi_{i}\phi_{i}^{\top}$ , and $j_{t}$ such that $\left|\phi_{t}^{\top}\nu_{u_{t}}\right|\in(2^{j_{t}-1},2^{j_{t}}]$ . Also define ${\mathcal{T}}=\{t\in[T]:\exists j\in{\mathcal{J}}_{t},|\phi_{t}^{\top}\nu_{u_{t}}|\in(2^{j-1},2^{j}]\}$ . Note that when $t\notin{\mathcal{T}}$ , $|\phi_{t}^{\top}\nu_{t}|\leq\epsilon$ . Then, for any $t\in{\mathcal{T}}$ :

	$\displaystyle\left\|\phi_{t}^{\top}\nu_{u_{t}}\right\|\leq\left\\|{\phi_{t}}\right\\|_{W^{-1}_{j_{t},u_{t}}(\nu_{u_{t}})}\left\\|{\nu_{u_{t}}}\right\\|_{W_{j_{t},u_{t}}(\nu_{u_{t}})}$
	$\displaystyle\overset{\text{(i)}}{\leq}2\sqrt{2}d\left\\|{\phi_{t}}\right\\|_{W^{-1}_{j_{t},u_{t}}(\nu_{u_{t}})}\left\\|{\nu_{u_{t}}}\right\\|_{W_{j_{t},l_{u_{t}}}(\nu_{u_{t}})}+\tilde{\mathcal{O}}\left(\sqrt{2^{j_{t}}\left(\frac{d^{3}B_{\star}\epsilon}{u_{t}}\right)}\right)+\sqrt{2^{j_{t}+5}d^{2.5}\epsilon}$
	$\displaystyle\overset{\text{(ii)}}{\leq}2\sqrt{2}d\left\\|{\phi_{t}}\right\\|_{V^{-1}_{j_{t},t}}\sqrt{2^{j_{t}}\left(\sqrt{\sum_{i\leq t_{l_{u_{t}}}}\mathbb{V}(P_{i},V_{l_{u_{t}}})\iota_{l_{u_{t}}}}+\sqrt{d}B_{\star}\iota_{l_{u_{t}}}+dB_{\star}^{2}\right)}+\tilde{\mathcal{O}}\left(\sqrt{2^{j_{t}}\left(\frac{d^{3}B_{\star}\epsilon}{u_{t}}\right)}\right)+\sqrt{2^{j_{t}+5}d^{2.5}\epsilon},$
	$\displaystyle=\tilde{\mathcal{O}}\left(d\left\\|{\phi_{t}}\right\\|_{V^{-1}_{j_{t},t}}\sqrt{2^{j_{t}}\left(\sqrt{d^{4}B_{\star}^{3}+dB_{\star}C_{T}+dB_{\star}\epsilon T}+\sqrt{d}B_{\star}\iota_{T}+dB_{\star}^{2}\right)}+\sqrt{2^{j_{t}}\left(\frac{d^{3}B_{\star}\epsilon}{u_{t}}\right)}\right)+\sqrt{2^{j_{t}+5}d^{2.5}\epsilon}$		(Lemma 23)

where in (i) we define $\bar{\nu}_{u_{t}}\in\mathcal{G}_{\epsilon/u_{t}}(6\sqrt{d}B_{l_{u_{t}}})$ such that $\left\|{\nu_{u_{t}}-\bar{\nu}_{u_{t}}}\right\|_{\infty}\leq\frac{\epsilon}{u_{t}}$ and apply

	$\displaystyle\left\\|{\nu_{u_{t}}}\right\\|^{2}_{W_{j_{t},u_{t}}(\nu_{u_{t}})}=\Phi^{j_{t}}_{u_{t}}(\nu_{u_{t}})=2^{j_{t}}\left\\|{\nu_{u_{t}}}\right\\|_{2}^{2}+\sum_{i<u_{t}}f_{j_{t}}(\phi_{i}^{\top}\nu_{u_{t}})$
	$\displaystyle\leq 2^{j_{t}}\left\\|{\bar{\nu}_{u_{t}}}\right\\|_{2}^{2}+\sum_{i<u_{t}}f_{j_{t}}(\phi_{i}^{\top}\bar{\nu}_{u_{t}})+2^{j_{t}}\left(\left\\|{\nu_{u_{t}}}\right\\|_{2}^{2}-\left\\|{\bar{\nu}_{u_{t}}}\right\\|_{2}^{2}\right)+2^{j_{t}+1}\sum_{i<u_{t}}\left\|\phi_{i}^{\top}(\nu_{u_{t}}-\bar{\nu}_{u_{t}})\right\|$		( $f_{j_{t}}$ is $(2\cdot 2^{j_{t}})$ -Lipschitz)
	$\displaystyle\leq 8d^{2}\left(2^{j_{t}}\left\\|{\bar{\nu}_{u_{t}}}\right\\|_{2}^{2}+\sum_{i\leq t_{l_{u_{t}}}}f_{j_{t}}(\phi_{i}^{\top}\bar{\nu}_{u_{t}})\right)+\frac{12\cdot 2^{j_{t}}dB_{l_{u_{t}}}\epsilon}{u_{t}}+2^{j_{t}+1}\sqrt{d}\epsilon$		( $\nu_{u_{t}},\bar{\nu}_{u_{t}}\in\mathbb{B}(6\sqrt{d}B_{l_{u_{t}}})$ )
	$\displaystyle\leq 8d^{2}\left(2^{j_{t}}\left\\|{\nu_{u_{t}}}\right\\|_{2}^{2}+\sum_{i\leq t_{l_{u_{t}}}}f_{j_{t}}(\phi_{i}^{\top}\nu_{u_{t}})\right)+\tilde{\mathcal{O}}\left(2^{j_{t}}\left(\frac{d^{3}B_{\star}\epsilon}{u_{t}}\right)\right)+2^{j_{t}+5}d^{2.5}\epsilon,$		( $\nu_{u_{t}},\bar{\nu}_{u_{t}}\in\mathbb{B}(6\sqrt{d}B_{l_{u_{t}}})$ )

and in (ii) we apply Lemma 24 and:

\displaystyle W_{j_{t},u_{t}}(\nu_{u_{t}})

\displaystyle\succcurlyeq W_{j_{t},t}(\nu_{u_{t}})=2^{j_{t}}I+\sum_{i<t}\min\{1,2^{j_{t}}/|\phi_{i}^{\top}\nu_{u_{t}}|\}\phi_{i}^{\top}\phi_{i}^{\top}\overset{\text{(i)}}{\succcurlyeq}2^{j_{t}}I+\sum_{i\in{\mathcal{T}}\cap[t-1]:|\phi_{i}^{\top}\nu_{u_{i}}|\leq 2^{j_{t}}}\phi_{i}\phi_{i}^{\top}=V_{j_{t},t}.

Here, (i) is by $\left|\phi_{i}^{\top}\nu_{u_{t}}\right|\leq\left|\phi_{i}^{\top}\nu_{u_{i}}\right|$ by the definition of $u_{t}$ . Reorganizing terms by $\left|\phi_{t}^{\top}\nu_{u_{t}}\right|\in(2^{j_{t}-1},2^{j_{t}}]$ , we have for $t\in{\mathcal{T}}$ :

\displaystyle\left|\phi_{t}^{\top}\nu_{t}\right|\leq\left|\phi_{t}^{\top}\nu_{u_{t}}\right|=\tilde{\mathcal{O}}\left(d^{2}\left\|{\phi_{t}}\right\|_{V^{-1}_{j_{t},t}}^{2}\left(\sqrt{dB_{\star}C_{T}+dB_{\star}\epsilon T}+d^{2}B_{\star}^{2}\right)+\frac{d^{3}B_{\star}\epsilon}{u_{t}}\right)+64d^{2.5}\epsilon.

Finally, note that:

	$\displaystyle\sum_{t\in{\mathcal{T}}}\left\|\phi_{t}^{\top}\nu_{t}\right\|=\tilde{\mathcal{O}}\left(\sum_{t\in{\mathcal{T}}}d^{2}\left\\|{\phi_{t}}\right\\|_{V^{-1}_{j_{t},t}}^{2}\left(\sqrt{dB_{\star}C_{T}+dB_{\star}\epsilon T}+d^{2}B_{\star}^{2}\right)+\sum_{t=1}^{T}\frac{d^{3}B_{\star}\epsilon}{t}\right)+64d^{2.5}\epsilon T$
	$\displaystyle=\tilde{\mathcal{O}}\left(\sqrt{d}B_{\star}\sum_{t\in{\mathcal{T}}}\mathbb{I}\left\{\left\\|{\phi_{t}}\right\\|_{V^{-1}_{j_{t},t}}^{2}\geq 1\right\}+d^{2}\sum_{t\in{\mathcal{T}}}\min\left\{1,\left\\|{\phi_{t}}\right\\|_{V^{-1}_{j_{t},t}}^{2}\right\}\left(\sqrt{dB_{\star}C_{T}+dB_{\star}\epsilon T}+d^{2}B_{\star}^{2}\right)+d^{3}B_{\star}\epsilon\right)+64d^{2.5}\epsilon T.$

The first term is bounded by

	$\displaystyle\sqrt{d}B_{\star}\sum_{t\in{\mathcal{T}}}\mathbb{I}\left\{\left\\|{\phi_{t}}\right\\|_{V^{-1}_{j_{t},t}}^{2}\geq 1\right\}$	$\displaystyle\leq\sqrt{d}B_{\star}\sum_{t\in{\mathcal{T}}}\min\left\{1,\left\\|{\phi_{t}}\right\\|_{V^{-1}_{j_{t},t}}^{2}\right\}$
		$\displaystyle=\sqrt{d}B_{\star}\sum_{j\in{\mathcal{J}}}\sum_{t\in{\mathcal{T}}}\mathbb{I}\{j_{t}=j\}\min\left\{1,\left\\|{\phi_{t}}\right\\|_{V^{-1}_{j,t}}^{2}\right\}=\tilde{\mathcal{O}}\left(d^{1.5}B_{\star}\right),$		(Lemma 29)

For the second term:

	$\displaystyle d^{2}\sum_{t\in{\mathcal{T}}}\min\left\{1,\left\\|{\phi_{t}}\right\\|_{V^{-1}_{j_{t},t}}^{2}\right\}\left(\sqrt{dB_{\star}C_{T}+dB_{\star}\epsilon T}+d^{2}B_{\star}^{2}\right)$
	$\displaystyle=\tilde{\mathcal{O}}\left(d^{2}\sum_{j\in{\mathcal{J}}}\sum_{t\in{\mathcal{T}}}\mathbb{I}\{j_{t}=j\}\min\left\{1,\left\\|{\phi_{t}}\right\\|^{2}_{V^{-1}_{j,t}}\right\}\left(\sqrt{dB_{\star}C_{T}+dB_{\star}\epsilon T}+d^{2}B_{\star}^{2}\right)\right)$
	$\displaystyle\overset{\text{(i)}}{=}\tilde{\mathcal{O}}\left(\sum_{j\in{\mathcal{J}}}d^{3}\left(\sqrt{dB_{\star}C_{T}+dB_{\star}\epsilon T}+d^{2}B_{\star}^{2}\right)\right)=\tilde{\mathcal{O}}\left(d^{3}\left(\sqrt{dB_{\star}C_{T}+dB_{\star}\epsilon T}+d^{2}B_{\star}^{2}\right)\right).$

where in (i) we apply Lemma 29. Putting everything together, we get:

\displaystyle\sum_{t=1}^{T}\left|\phi_{t}^{\top}\nu_{t}\right|\leq\sum_{t\in{\mathcal{T}}}\left|\phi_{t}^{\top}\nu_{t}\right|+\epsilon T\leq\tilde{\mathcal{O}}\left(d^{3.5}\sqrt{B_{\star}C_{T}}+d^{3.5}\sqrt{B_{\star}\epsilon T}+d^{5}B_{\star}^{2}\right)+65d^{2.5}\epsilon T.

This completes the proof. ∎

Lemma 23.

With probability at least $1-3\delta$ , $\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})=\tilde{\mathcal{O}}\left(d^{3}B_{\star}^{3}+B_{\star}C_{t_{l}}+B_{\star}\epsilon t_{l}\right)$ .

Proof.

Note that when $V_{l}(s_{i})=0$ , $V_{l}(s_{i})-P_{i}V_{l}\leq 0$ . Otherwise, $Q_{l}(s_{i},a)>0$ for any $a$ and $V_{l}(s_{i})\leq Q_{l}(s_{i},a_{i})$ . Therefore, $V_{l}^{2}(s_{i})-(P_{i}V_{l})^{2}=(V_{l}(s_{i})+P_{i}V_{l})(V_{l}(s_{i})-P_{i}V_{l})\leq(V_{l}(s_{i})+P_{i}V_{l})\left|Q_{l}(s_{i},a_{i})-P_{i}V_{l}\right|$ . Then with probability at least $1-\delta$ ,

	$\displaystyle\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})$	$\displaystyle=\sum_{i\leq t_{l}}P_{i}(V_{l})^{2}-(P_{i}V_{l})^{2}$
		$\displaystyle=\sum_{i\leq t_{l}}\left(P_{i}(V_{l})^{2}-V_{l}^{2}(s^{\prime}_{i})\right)+\sum_{i\leq t_{l}}\left(V_{l}^{2}(s^{\prime}_{i})-V_{l}^{2}(s_{i})\right)+\sum_{i\leq t_{l}}\left(V_{l}^{2}(s_{i})-(P_{i}V_{l})^{2}\right)$
		$\displaystyle\overset{\text{(i)}}{=}\tilde{\mathcal{O}}\left(\sqrt{d\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l}^{2})}+dB_{\star}^{2}+B_{\star}^{2}+\sum_{i\leq t_{l}}(V_{l}(s_{i})+P_{i}V_{l})\left\|c_{i}+\phi_{i}^{\top}(w_{l}-\widetilde{w}_{l})\right\|\right)$
		$\displaystyle\overset{\text{(ii)}}{=}\tilde{\mathcal{O}}\left(\sqrt{d}B_{\star}\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})}+dB_{\star}^{2}+B_{\star}C_{t_{l}}+B_{\star}\sum_{i\leq t_{l}}\left\|\phi_{i}^{\top}(w_{l}-\widetilde{w}_{l})\right\|\right).$

In (i) we apply Lemma 25, $\sum_{i\leq t_{l}}(V_{l}^{2}(s^{\prime}_{i})-V_{l}^{2}(s_{i}))\leq\sum_{i\leq t_{l}}(V_{l}^{2}(s_{i+1})-V_{l}^{2}(s_{i}))=\tilde{\mathcal{O}}(B_{\star}^{2})$ , $V_{l}^{2}(s_{i})-(P_{i}V_{l})^{2}\leq(V_{l}(s_{i})+P_{i}V_{l})\left|Q_{l}(s_{i},a_{i})-P_{i}V_{l}\right|$ , $Q_{l}(s_{i},a_{i})=\phi_{i}^{\top}w_{l}$ , and $c_{i}+P_{i}V_{l}=\phi_{i}^{\top}\widetilde{w}_{l}$ . In (ii) we apply Lemma 34. For $t\leq t_{l}$ , define $\nu^{\prime}_{t}=\operatorname*{argmax}_{\nu=\widetilde{w}_{l}-w,w\in\Omega_{l}}\left|\phi_{t}^{\top}\nu\right|$ . Then by $w_{l}\in\Omega_{l}$ and the definition of $\nu^{\prime}_{t}$ , we have $\left|\phi_{t}^{\top}(w_{l}-\widetilde{w}_{l})\right|\leq\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|$ . Now it suffices to bound $\sum_{t\leq t_{l}}\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|$ . Define ${\mathcal{T}}=\{t\leq t_{l}:\exists j\in{\mathcal{J}}_{t},\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|\in(2^{j-1},2^{j}]\}$ and for $t\in{\mathcal{T}}$ , define $j_{t}\in{\mathcal{J}}_{t}$ such that $\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|\in(2^{j_{t}-1},2^{j_{t}}]$ . Note that when $t\notin{\mathcal{T}}$ , $|\phi_{t}^{\top}\nu^{\prime}_{t}|\leq\epsilon$ . Also define $V_{j,t}=2^{j}I+\sum_{i\in{\mathcal{T}}\cap[t-1]:|\phi_{i}^{\top}\nu^{\prime}_{i}|\leq 2^{j}}\phi_{i}\phi_{i}^{\top}$ . Then, for any $t\in{\mathcal{T}}$ , with probability at least $1-2\delta$ :

\displaystyle\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|

\displaystyle\leq\left\|{\phi_{t}}\right\|_{W^{-1}_{j_{t},l}(\nu^{\prime}_{t})}\left\|{\nu^{\prime}_{t}}\right\|_{W_{j_{t},l}(\nu^{\prime}_{t})}\leq\left\|{\phi_{t}}\right\|_{V^{-1}_{j_{t},l}}\sqrt{2^{j_{t}}\left(\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}+\sqrt{d}B_{\star}\iota_{l}+dB_{\star}^{2}\right)},

where in the last inequality we apply Lemma 24 and:

\displaystyle W_{j_{t},l}(\nu^{\prime}_{t})

\displaystyle=2^{j_{t}}I+\sum_{i\leq t_{l}}\min\{1,2^{j_{t}}/|\phi_{i}^{\top}\nu^{\prime}_{t}|\}\phi_{i}\phi_{i}^{\top}\overset{\text{(i)}}{\succcurlyeq}2^{j_{t}}I+\sum_{i\in{\mathcal{T}}:|\phi_{i}^{\top}\nu^{\prime}_{i}|\leq 2^{j_{t}}}\phi_{i}\phi_{i}^{\top}=V_{j_{t},l}.

Here, (i) is by $\left|\phi_{i}^{\top}\nu^{\prime}_{t}\right|\leq\left|\phi_{i}^{\top}\nu^{\prime}_{i}\right|$ by the definition of $\nu^{\prime}_{t}$ . Reorganizing terms by $\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|\in(2^{j_{t}-1},2^{j_{t}}]$ , we have:

	$\displaystyle\sum_{t\in{\mathcal{T}}}\left\|\phi_{t}^{\top}\nu^{\prime}_{t}\right\|=\tilde{\mathcal{O}}\left(\sum_{t\in{\mathcal{T}}}\left\\|{\phi_{t}}\right\\|^{2}_{V^{-1}_{j_{t},l}}\left(\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}+\sqrt{d}B_{\star}\iota_{l}+dB_{\star}^{2}\right)\right)$
	$\displaystyle=\tilde{\mathcal{O}}\left(\sum_{j\in{\mathcal{J}}}\sum_{t\in{\mathcal{T}}}\mathbb{I}\{j_{t}=j\}\left\\|{\phi_{t}}\right\\|^{2}_{V^{-1}_{j,l}}\left(\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}+\sqrt{d}B_{\star}\iota_{l}+dB_{\star}^{2}\right)\right)$
	$\displaystyle\overset{\text{(i)}}{=}\tilde{\mathcal{O}}\left(\sum_{j\in{\mathcal{J}}}d\left(\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}+\sqrt{d}B_{\star}\iota_{l}+dB_{\star}^{2}\right)\right)=\tilde{\mathcal{O}}\left(d\left(\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}+\sqrt{d}B_{\star}\iota_{l}+dB_{\star}^{2}\right)\right),$

where in (i) we apply

\displaystyle\sum_{t\in{\mathcal{T}}}\mathbb{I}\{j_{t}=j\}\left\|{\phi_{t}}\right\|^{2}_{V^{-1}_{j,l}}=\text{tr}\left(V^{-1}_{j,l}\sum_{t\in{\mathcal{T}}}\mathbb{I}\{j_{t}=j\}\phi_{t}\phi_{t}^{\top}\right)\leq\text{tr}\left(V^{-1}_{j,l}V_{j,l}\right)=d.

Putting everything together and by $\sum_{t\leq t_{l}}\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|\leq\sum_{t\in{\mathcal{T}}}\left|\phi_{t}^{\top}\nu^{\prime}_{t}\right|+\epsilon t_{l}$ , we have:

	$\displaystyle\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})$	$\displaystyle=\tilde{\mathcal{O}}\left(\sqrt{d}B_{\star}\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})}+dB_{\star}^{2}+B_{\star}C_{t_{l}}+B_{\star}\left(d^{2.5}B_{\star}^{2}+d^{1.5}\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})}+\epsilon t_{l}\right)\right)$
		$\displaystyle=\tilde{\mathcal{O}}\left(d^{1.5}B_{\star}\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})}+d^{2.5}B_{\star}^{3}+B_{\star}C_{t_{l}}+B_{\star}\epsilon t_{l}\right).$

Solving a quadratic inequality w.r.t $\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})}$ , we have $\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})=\tilde{\mathcal{O}}\left(d^{3}B_{\star}^{3}+B_{\star}C_{t_{l}}+B_{\star}\epsilon t_{l}\right)$ . ∎

Lemma 24.

With probability at least $1-2\delta$ , for any epoch $l$ , $j\in{\mathcal{J}}_{l}$ , and $\nu=\widetilde{w}_{l}-\mathring{w}$ with $\mathring{w}\in\Omega_{l}$ ,

\left\|{\nu}\right\|^{2}_{W_{j,l}(\nu)}=\mathcal{O}\left(2^{j}\left(\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}+\sqrt{d}B_{\star}\iota_{l}+dB_{\star}^{2}\right)\right).

Proof.

Define $\epsilon_{l}^{i}(w)=\epsilon_{V_{l}}^{i}(w)=\phi_{i}^{\top}w-c_{i}-V_{l}(s^{\prime}_{i})$ and $\eta_{l}^{i}(w)=\eta_{V_{l}}^{i}(w)$ . Note that with probability at least $1-2\delta$ :

	$\displaystyle\left\\|{\nu}\right\\|^{2}_{W_{j,l}(\nu)-2^{j}I}=\sum_{i\leq t_{l}}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)\phi_{i}^{\top}\nu=\sum_{i\leq t_{l}}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)(\epsilon^{i}_{l}(\widetilde{w}_{l})-\epsilon^{i}_{l}(\mathring{w}))$
	$\displaystyle\leq\sqrt{\sum_{i\leq t_{l}}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)\eta^{i}_{l}(\widetilde{w}_{l})\iota_{l}}+\sqrt{\sum_{i\leq t_{l}}\textsf{clip}^{2}_{j}(\phi_{i}^{\top}\nu)\eta^{i}_{l}(\mathring{w})\iota_{l}}+2B_{l}2^{j}\iota_{l}$		(Lemma 3 and $\mathring{w}\in\Omega_{l}$ )
	$\displaystyle\leq 3\sqrt{\sum_{i\leq t_{l}}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)\eta^{i}_{l}(\widetilde{w}_{l})\iota_{l}}+\sqrt{2\sum_{i\leq t_{l}}\textsf{clip}_{j}^{2}(\phi_{i}^{\top}\nu)(\phi_{i}^{\top}\nu)^{2}\iota_{l}}+2B_{l}2^{j}\iota_{l}$		( $\phi_{i}^{\top}\nu=\epsilon^{i}_{l}(\widetilde{w}_{l})-\epsilon^{i}_{l}(\mathring{w})$ and $(a+b)^{2}\leq 2a^{2}+2b^{2}$ )
	$\displaystyle=\tilde{\mathcal{O}}\left(2^{j}\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}+\sqrt{2^{j}\sqrt{d}B_{\star}\sum_{i\leq t_{l}}\textsf{clip}_{j}(\phi_{i}^{\top}\nu)(\phi_{i}^{\top}\nu)\iota_{l}}+B_{\star}2^{j}\iota_{l}\right)$		(Lemma 25, $\textsf{clip}_{j}(\cdot)\leq 2^{j}$ , $B_{l}\leq 2B_{\star}$ , and $\left\|\phi_{i}^{\top}\nu\right\|\leq 12\sqrt{d}B_{\star}$ )
	$\displaystyle=\tilde{\mathcal{O}}\left(2^{j}\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}+\sqrt{2^{j}\sqrt{d}B_{\star}\left\\|{\nu}\right\\|_{W_{j,l}(\nu)-2^{j}I}^{2}\iota_{l}}+B_{\star}2^{j}\iota_{l}\right).$

Solving a quadratic inequality, we get $\left\|{\nu}\right\|^{2}_{W_{j,l}(\nu)}=\mathcal{O}\left(2^{j}\sqrt{\sum_{i\leq t_{l}}\mathbb{V}(P_{i},V_{l})\iota_{l}}+\sqrt{d}B_{\star}2^{j}\iota_{l}+2^{j}dB_{\star}^{2}\right)$ . ∎

Lemma 25.

With probability at least $1-\delta$ , for any epoch $l$ , $\sum_{i=1}^{t_{l}}(P_{i}V_{l}-V_{l}(s^{\prime}_{i}))^{2}=\tilde{\mathcal{O}}(\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},V_{l})+dB_{\star}^{2})$ and $\sum_{i=1}^{t_{l}}P_{i}V_{l}^{2}-V_{l}^{2}(s^{\prime}_{i})=\tilde{\mathcal{O}}\left(\sqrt{d\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},V_{l}^{2})}+dB_{\star}^{2}\right)$ .

Proof.

For any $t\in\mathbb{N}_{+}$ , $B\in\{2^{i}\}^{\lceil\log_{2}B_{\star}\rceil}_{i=1}$ , and $w\in\mathcal{G}_{\epsilon/t}(3\sqrt{d}B)$ , define $X_{i}=(\phi_{i}^{\top}U_{B}w-c_{i}-V_{w,B}(s^{\prime}_{i}))^{2}=(P_{i}V_{w,B}-V_{w,B}(s^{\prime}_{i}))^{2}$ and $\mathbb{E}_{i}$ as the conditional expectation conditioned on the interaction history $(s_{1},a_{1},\ldots,s_{i},a_{i})$ . Note that $\mathbb{E}_{i}[X_{i}]=\mathbb{V}(P_{i},V_{w,B})$ and $|X_{i}|\leq 4B^{2}$ . Then by Lemma 37 with $\lambda=\frac{1}{4B^{2}}$ , with probability at least $1-\delta^{\prime}$ with $\delta^{\prime}=\delta/(8(t\log_{2}(2B))^{2}(6\sqrt{d}Bt/\epsilon)^{d})$ , we have:

\displaystyle\sum_{i=1}^{t}\left(X_{i}-\mathbb{V}(P_{i},V_{w,B})\right)\leq\lambda\sum_{i=1}^{t}\mathbb{E}_{i}[X_{i}^{2}]+\frac{\ln(1/\delta^{\prime})}{\lambda}\leq\sum_{i=1}^{t}\mathbb{V}(P_{i},V_{w,B})+\tilde{\mathcal{O}}\left(dB_{\star}^{2}\right).

Reorganizing terms and by a union bound, we have with probability at least $1-\delta/2$ , for any $t\in\mathbb{N}_{+}$ , $B\in\{2^{i}\}_{i=1}^{\lceil\log_{2}B_{\star}\rceil}$ , and $w\in\mathcal{G}_{\epsilon/t}(3\sqrt{d}B)$ :

\sum_{i=1}^{t}\left(P_{i}V_{w,B}-V_{w,B}(s^{\prime}_{i})\right)^{2}=\sum_{i=1}^{t}X_{i}\leq 2\sum_{i=1}^{t}\mathbb{V}(P_{i},V_{w,B})+\tilde{\mathcal{O}}\left(dB_{\star}^{2}\right).

(16)

Moreover, for any $t\in\mathbb{N}_{+}$ , $B\in\{2^{i}\}_{i=1}^{\lceil\log_{2}B_{\star}\rceil}$ , and $w\in\mathcal{G}_{\epsilon/t}(3\sqrt{d}B)$ , by Lemma 38, with probability at least $1-\delta^{\prime}$ :

\sum_{i=1}^{t}P_{i}V_{w,B}^{2}-V_{w,B}^{2}(s^{\prime}_{i})=\tilde{\mathcal{O}}\left(\sqrt{\sum_{i=1}^{t}\mathbb{V}(P_{i},V_{w,B}^{2})\ln\frac{1}{\delta^{\prime}}}+B_{\star}^{2}\ln\frac{1}{\delta^{\prime}}\right)=\tilde{\mathcal{O}}\left(\sqrt{d\sum_{i=1}^{t}\mathbb{V}(P_{i},V_{w,B}^{2})}+dB_{\star}^{2}\right).

(17)

Then again by a union bound, the equation above holds with probability at least $1-\delta/2$ for any $t\in\mathbb{N}_{+}$ , $B\in\{2^{i}\}_{i=1}^{\lceil\log_{2}B_{\star}\rceil}$ , and $w\in\mathcal{G}_{\epsilon/t}(3\sqrt{d}B)$ .

Now for any epoch $l$ , pick $w^{\prime}_{l}\in\mathcal{G}_{\epsilon/t_{l}}(3\sqrt{d}B_{l})$ such that $\left\|{w^{\prime}_{l}-w_{l}}\right\|_{\infty}\leq\epsilon/t_{l}$ . Also define $V^{\prime}_{l}=V_{w^{\prime}_{l},B_{l}}$ and $\widetilde{w}^{\prime}_{l}=U_{B_{l}}w^{\prime}_{l}$ . Then similar to Eq. (14) and Eq. (15), we have

\left\|{V_{l}-V^{\prime}_{l}}\right\|_{\infty}\leq\sqrt{d}\epsilon/t_{l},\quad\left\|{\widetilde{w}_{l}-\widetilde{w}^{\prime}_{l}}\right\|_{2}\leq d\epsilon/t_{l}.

(18)

For the first statement:

$\displaystyle\sum_{i=1}^{t_{l}}(P_{i}V_{l}-V_{l}(s^{\prime}_{i}))^{2}$	$\displaystyle=\sum_{i=1}^{t_{l}}\left(\phi_{i}^{\top}\widetilde{w}_{l}-c_{i}-V_{l}(s^{\prime}_{i})\right)^{2}$
	$\displaystyle\leq 3\sum_{i=1}^{t_{l}}\left((\phi_{i}^{\top}\widetilde{w}^{\prime}_{l}-c_{i}-V^{\prime}_{l}(s^{\prime}_{i}))^{2}+(V_{l}(s^{\prime}_{i})-V^{\prime}_{l}(s^{\prime}_{i}))^{2}+(\phi_{i}^{\top}(\widetilde{w}_{l}-\widetilde{w}^{\prime}_{l}))^{2}\right)$
	$\displaystyle\leq\tilde{\mathcal{O}}\left(\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},V^{\prime}_{l})+dB_{\star}^{2}\right)+\frac{6d^{2}\epsilon^{2}}{t_{l}}$	(Eq. (16) and Eq. (18))
	$\displaystyle=\tilde{\mathcal{O}}\left(\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},V_{l})+\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},V^{\prime}_{l}-V_{l})+dB_{\star}^{2}\right)=\tilde{\mathcal{O}}\left(\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},V_{l})+dB_{\star}^{2}\right).$	( $\textsc{Var}[X+Y]\leq 2\textsc{Var}[X]+2\textsc{Var}[Y]$ , $\mathbb{V}(P_{i},V^{\prime}_{l}-V_{l})\leq\left\\|{V^{\prime}_{l}-V_{l}}\right\\|_{\infty}^{2}$ , Eq. (18), and $d\epsilon\leq 1$ )

For the second statement,

$\displaystyle\sum_{i=1}^{t_{l}}P_{i}(V_{l})^{2}-V_{l}^{2}(s^{\prime}_{i})$	$\displaystyle=\sum_{i=1}^{t_{l}}(P_{i}(V^{\prime}_{l})^{2}-V^{\prime}_{l}(s^{\prime}_{i})^{2})+\sum_{i=1}^{t_{l}}(P_{i}(V_{l})^{2}-P_{i}(V^{\prime}_{l})^{2})+\sum_{i=1}^{t_{l}}({V^{\prime}_{l}}^{2}(s^{\prime}_{i})-V_{l}^{2}(s^{\prime}_{i}))$
	$\displaystyle\leq\tilde{\mathcal{O}}\left(\sqrt{d\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},{V^{\prime}_{l}}^{2})}+dB_{\star}^{2}\right)+4B_{\star}\sum_{i=1}^{t_{l}}\left\\|{V_{l}-V^{\prime}_{l}}\right\\|_{\infty}$	(Eq. (17) and $\max\{\left\\|{V_{l}}\right\\|_{\infty},\left\\|{V^{\prime}_{l}}\right\\|_{\infty}\}\leq 4B_{\star}$ )
	$\displaystyle\leq\tilde{\mathcal{O}}\left(\sqrt{d\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},V_{l}^{2})}+dB_{\star}^{2}+\sqrt{d\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},V_{l}^{2}-{V^{\prime}_{l}}^{2})}+\sqrt{d}B_{\star}\epsilon\right)$	( $\textsc{Var}[X+Y]\leq 2\textsc{Var}[X]+2\textsc{Var}[Y]$ , $\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}$ , and Eq. (18))
	$\displaystyle=\tilde{\mathcal{O}}\left(\sqrt{d\sum_{i=1}^{t_{l}}\mathbb{V}(P_{i},V_{l}^{2})}+dB_{\star}^{2}\right).$	(Eq. (18) and $\epsilon\leq 1$ )

Thus, the second statement is proved. ∎

For the next lemma, we define the following auxiliary function:

\displaystyle g_{j}(x)=\begin{cases}x^{2},&|x|\leq 2^{j},\\ 2^{j+1}x-4^{j},&x>2^{j}\\ -2^{j+1}x-4^{j},&x<-2^{j}\end{cases}

Note that $g_{j}(x)$ is convex and $f_{j}(x)\leq g_{j}(x)\leq 2f_{j}(x)$ .

Lemma 26.

For $\lambda\in(0,1]$ , $g_{j}(\lambda x)\geq\lambda^{2}g_{j}(x)$ .

Proof.

Let $\ell=2^{j}$ . When $|\lambda x|\leq\ell$ , we have: $g_{j}(\lambda x)=\lambda^{2}x^{2}\geq\lambda^{2}g_{j}(x)$ . When $\lambda x>\ell$ (arguments are similar for $\lambda x<-\ell$ ), we have $x>\ell$ , and

	$\displaystyle g_{j}(\lambda x)-\lambda^{2}g_{j}(x)$	$\displaystyle=2\ell\lambda x-\ell^{2}-\lambda^{2}(2\ell x-\ell^{2})=2\ell\lambda x(1-\lambda)-\ell^{2}(1-\lambda^{2})$
		$\displaystyle=(1-\lambda)\ell(2\lambda x-(1+\lambda)\ell)\geq 0.$

∎

Lemma 27.

Fix $2^{j}\geq\epsilon>0$ . Let $x_{1},\ldots,x_{t}\in\mathbb{B}(1)$ . If there exists $0=\tau_{0}<\tau_{1}<\cdots<\tau_{z}=t$ such that for each $1\leq\zeta\leq z$ , there exists $\nu_{\zeta}\in\mathbb{B}(B)\setminus\mathbb{B}(\epsilon)$ for some $B>\epsilon$ such that

\sum_{i=1}^{\tau_{\zeta}}f_{j}(x_{i}^{\top}\nu_{\zeta})+2^{j}\left\|{\nu_{\zeta}}\right\|_{2}^{2}>8d^{2}\left(\sum_{i=1}^{\tau_{\zeta-1}}f_{j}(x_{i}^{\top}\nu_{\zeta})+2^{j}\left\|{\nu_{\zeta}}\right\|_{2}^{2}\right)

(19)

Then, $z=\tilde{\mathcal{O}}(d)$ .

Proof.

Note that when Eq. (19) holds:

	$\displaystyle\sum_{i=1}^{\tau_{\zeta}}g_{j}(x_{i}^{\top}\nu_{\zeta})+2^{j}\left\\|{\nu_{\zeta}}\right\\|_{2}^{2}$	$\displaystyle\geq\sum_{i=1}^{\tau_{\zeta}}f_{j}(x_{i}^{\top}\nu_{\zeta})+2^{j}\left\\|{\nu_{\zeta}}\right\\|_{2}^{2}>8d^{2}\left(\sum_{i=1}^{\tau_{\zeta-1}}f_{j}(x_{i}^{\top}\nu_{\zeta})+2^{j}\left\\|{\nu_{\zeta}}\right\\|_{2}^{2}\right)$
		$\displaystyle\geq 4d^{2}\left(\sum_{i=1}^{\tau_{\zeta-1}}g_{j}(x_{i}^{\top}\nu_{\zeta})+2^{j}\left\\|{\nu_{\zeta}}\right\\|_{2}^{2}\right).$		(20)

Thus, it suffices to bound the number times Eq. (20) holds. Define $E_{t}(\nu)=\sum_{i=1}^{t}g_{j}(x_{i}^{\top}\nu)+2^{j}\left\|{\nu}\right\|_{2}^{2}$ . Clearly $E_{t}$ is convex since $g_{j}$ is convex, and $E_{t}(\nu)\in[2^{j}\epsilon^{2},2^{j}B^{2}+2t2^{j}B]$ for $\nu\in\mathbb{B}(B)\setminus\mathbb{B}(\epsilon)$ . Define:

\Lambda=\{i\in\mathbb{Z}:\lceil\log_{2}(2^{j}\epsilon^{2})\rceil\leq i\leq\lceil\log_{2}(2^{j}B^{2}+2t2^{j}B)\rceil\}.

For each $\zeta$ , there exists $i_{\zeta}\in\Lambda$ such that $E_{\tau_{\zeta-1}}(\nu_{\zeta})\in(2^{i_{\zeta}-1},2^{i_{\zeta}}]$ . Define $D_{t^{\prime},i}=\{\nu\in\mathbb{B}(B):E_{t^{\prime}}(\nu)\leq 2^{i}\}$ . Note that $\nu_{\zeta}\in D_{\tau_{\zeta-1},i_{\zeta}}$ , and $D_{t^{\prime},i}$ is a symmetric convex set since $E_{t}$ is a convex function and $E_{t}(\nu)=E_{t}(-\nu)$ . By Lemma 26, we have $E_{\tau_{\zeta}}(\nu_{\zeta}/d)\geq\frac{1}{d^{2}}E_{\tau_{\zeta}}(\nu_{\zeta})>4E_{\tau_{\zeta-1}}(\nu_{\zeta})>2^{i_{\zeta}}$ . Therefore, $\nu_{\zeta}/d\notin D_{\tau_{\zeta},i_{\zeta}}$ , which means that in the direction of $\nu_{\zeta}$ , the intercept of $D_{\tau_{\zeta},i_{\zeta}}$ is at most $1/d$ times of that of $D_{\tau_{\zeta-1},i_{\zeta}}$ . By Lemma 35, we have: $\text{Vol}(D_{\tau_{\zeta},i_{\zeta}})\leq\frac{6}{7}\text{Vol}(D_{\tau_{\zeta-1},i_{\zeta}})$ . Note that when $\left\|{\nu}\right\|_{2}\leq 2^{j}$ , we have $E_{t}(\nu)\leq(t+2^{j})\left\|{\nu}\right\|_{2}^{2}$ . Therefore, when $\left\|{\nu}\right\|_{2}\leq\epsilon^{\prime}=\sqrt{2^{j}/(t+2^{j})}\epsilon$ , we have $E_{t}(\nu)\leq 2^{j}\epsilon^{2}$ . Therefore, $\text{Vol}(D_{t,i})\geq\text{Vol}(\mathbb{B}(\epsilon^{\prime}))$ for $i\in\Lambda$ . Due to the fact that $D_{t,i}$ is decreasing in $t$ , we have

z=\mathcal{O}(|\Lambda|\log_{7/6}(\text{Vol}(\mathbb{B}(B))/\text{Vol}(\mathbb{B}(\epsilon^{\prime}))))=\tilde{\mathcal{O}}(d).

This completes the proof. ∎

Appendix D Auxiliary Lemmas

Lemma 28.

If $x\leq(a\sqrt{x}+b)\ln^{p}(cx)$ for some $a,b,c>0$ and absolute constant $p\geq 1$ , then $x=\tilde{\mathcal{O}}(a^{2}+b)$ .

Proof.

First note that $x\leq 2b\ln^{p}(cx)$ implies $x\leq 2b(2p)^{p}\sqrt{cx}$ by $\ln x\leq x$ for $x>0$ , which gives $x\leq 4(2p)^{2p}b^{2}c$ . Plugging this back, we get $x\leq 2b\ln^{p}(4(2p)^{2p}b^{2}c^{2})$ . Therefore, $x>2b\ln^{p}(4(2p)^{2p}b^{2}c^{2})$ implies $x>2b\ln^{p}(cx)$ . Next, note that $x\leq 2a\sqrt{x}\ln^{p}(cx)$ implies $x\leq 2ac^{1/4}(4p)^{p}x^{3/4}$ by $\ln x\leq x$ for $x>0$ , which gives $x\leq 16(4p)^{4p}a^{4}c$ . Plugging this back, we get $x\leq 2a\sqrt{x}\ln^{p}(16(4p)^{4p}a^{4}c^{2})$ , which gives $x\leq 4a^{2}\ln^{2p}(16(4p)^{4p}a^{4}c^{2})$ . Therefore, $x>2a\sqrt{x}\ln^{p}(16(4p)^{4p}a^{4}c^{2})$ implies $x>2a\sqrt{x}\ln^{p}(cx)$ . Thus, $x>4a^{2}\ln^{2p}(16(4p)^{4p}a^{4}c^{2})+2b\ln^{p}(4(2p)^{2p}b^{2}c^{2})$ implies $\frac{x}{2}>a\sqrt{x}\ln^{p}(cx)$ and $\frac{x}{2}>b\ln^{p}(cx)$ , which implies $x>(a\sqrt{x}+b)\ln^{p}(cx)$ . Taking the contrapositive, the statement is proved. ∎

Lemma 29.

(Abbasi-Yadkori et al., 2011, Lemma 11) Let $\{X_{i}\}_{i=1}^{\infty}$ be a sequence in $\mathbb{R}^{d}$ , $V$ a $d\times d$ positive definite matrix, and define $V_{n}=V+\sum_{i=1}^{n}X_{i}X_{i}^{\top}$ . Then, $\sum_{i=1}^{n}\min\{1,\left\|{X_{i}}\right\|_{V_{i-1}^{-1}}^{2}\}\leq 2\ln\frac{\det(V_{n})}{\det(V)}$ for any $n\geq 1$ .

Lemma 30.

(Abbasi-Yadkori et al., 2011, Lemma 12) Let $A$ , $B$ be positive semi-definite matrices such that $A\succcurlyeq B$ . Then, we have $\sup_{x\neq 0}\frac{x^{\top}Ax}{x^{\top}Bx}\leq\frac{\det(A)}{\det(B)}$ .

Lemma 31.

(Wei et al., 2021b, Lemma 11) Let $\{x_{t}\}_{t=1}^{\infty}$ be a martingale sequence on state space ${\mathcal{X}}$ w.r.t a filtration $\{{\mathcal{F}}_{t}\}_{t=0}^{\infty}$ , $\{\phi_{t}\}_{t=1}^{\infty}$ be a sequence of random vectors in $\mathbb{R}^{d}$ so that $\phi_{t}\in{\mathcal{F}}_{t-1}$ and $\left\|{\phi_{t}}\right\|\leq 1$ , $\Lambda_{t}=\lambda I+\sum_{s=1}^{t-1}\phi_{s}\phi_{s}^{\top}$ , and ${\mathcal{V}}\subseteq\mathbb{R}^{{\mathcal{X}}}$ be a set of functions defined on ${\mathcal{X}}$ with ${\mathcal{N}}_{\varepsilon}$ as its $\varepsilon$ -covering number w.r.t the distance $\text{dist}(v,v^{\prime})=\sup_{x}\left|v(x)-v^{\prime}(x)\right|$ for some $\varepsilon>0$ . Then for any $\delta>0$ , we have with probability at least $1-\delta$ , for all $t>0$ and $v\in{\mathcal{V}}$ so that $\sup_{x}\left|v(x)\right|\leq B$ :

\displaystyle\left\|{\sum_{s=1}^{t-1}\phi_{s}\left(v(x_{s})-\mathbb{E}[v(x_{s})|{\mathcal{F}}_{s-1}]\right)}\right\|^{2}_{\Lambda_{t}^{-1}}\leq 4B^{2}\left[\frac{d}{2}\ln\left(\frac{t+\lambda}{\lambda}\right)+\ln\frac{{\mathcal{N}}_{\varepsilon}}{\delta}\right]+\frac{8t^{2}\varepsilon^{2}}{\lambda}.

Lemma 32.

(Wei et al., 2021b, Lemma 12) Let ${\mathcal{V}}$ be a class of mappings from ${\mathcal{X}}$ to $\mathbb{R}$ parameterized by $\alpha\in[-D,D]^{n}$ . Suppose that for any $v\in{\mathcal{V}}$ (parameterized by $\alpha$ ) and $v^{\prime}\in{\mathcal{V}}^{\prime}$ (parameterized by $\alpha^{\prime}$ ), the following holds:

\displaystyle\sup_{x\in{\mathcal{X}}}\left|v(x)-v(x^{\prime})\right|\leq L\left\|{\alpha-\alpha^{\prime}}\right\|_{1}.

Then, $\ln{\mathcal{N}}_{\varepsilon}\leq n\ln\left(\frac{2DLn}{\varepsilon}\right)$ , where ${\mathcal{N}}_{\varepsilon}$ is the $\varepsilon$ -covering number of ${\mathcal{V}}$ with respect to the distance $\text{dist}(v,v^{\prime})=\sup_{x\in{\mathcal{X}}}\left|v(x)-v^{\prime}(x)\right|$ .

Lemma 33.

(Zhou et al., 2021a, Theorem 4.1) Let $\{{\mathcal{F}}_{t}\}_{t=1}^{\infty}$ be a filtration, $\{x_{t},\eta_{t}\}_{t\geq 1}$ a stochastic process so that $x_{t},\eta_{t}\in\mathbb{R}^{d}$ and $x_{t}\in{\mathcal{F}}_{t},\eta_{t}\in{\mathcal{F}}_{t+1}$ . Moreover, define $y_{t}=\left\langle\mu^{\star},x_{t}\right\rangle+\eta_{t}$ and we have:

\displaystyle|\eta_{t}|\leq R,\;\mathbb{E}[\eta_{t}|{\mathcal{F}}_{t}]=0,\;\mathbb{E}[\eta_{t}^{2}|{\mathcal{F}}_{t}]\leq\sigma^{2},\;\left\|{x_{t}}\right\|_{2}\leq L.

Then with probability at least $1-\delta$ , we have for any $t\geq 1$ :

\displaystyle\left\|{\sum_{i=1}^{t}x_{i}\eta_{i}}\right\|_{Z_{t}^{-1}}\leq\beta_{t},\;\left\|{\mu_{t}-\mu^{\star}}\right\|_{Z_{t}}\leq\beta_{t}+\sqrt{\lambda}\left\|{\mu^{\star}}\right\|_{2},

where $\mu_{t}=Z_{t}^{-1}b_{t},Z_{t}=\lambda I+\sum_{i=1}^{t}x_{i}x_{i}^{\top},b_{t}=\sum_{i=1}^{t}y_{i}x_{i}$ , and

\beta_{t}=8\sigma\sqrt{d\ln(1+tL^{2}/(d\lambda))\ln(4t^{2}/\delta)}+4R\ln(4t^{2}/\delta).

Lemma 34.

(Chen et al., 2021a, Lemma 30) For any two random variables $X,Y$ , we have:

\displaystyle\textsc{Var}[XY]\leq 2\textsc{Var}[X]\left\|{Y}\right\|_{\infty}^{2}+2(\mathbb{E}[X])^{2}\textsc{Var}[Y].

Consequently, $\left\|{X}\right\|_{\infty}\leq C\implies\textsc{Var}[X^{2}]\leq 4C^{2}\textsc{Var}[X]$ .

Lemma 35.

(Zhang et al., 2021, Lemma 16) Let $D$ be a bounded symmetric convex subset of $\mathbb{R}^{d}$ with $d\geq 2$ . Suppose $u\in\partial D$ , that is, $u$ is on the boundary of $D$ , and $D^{\prime}$ is another bounded symmetric convex set such that $D\subseteq D^{\prime}$ and $d\cdot u\in\partial D^{\prime}$ . Then $\text{Vol}(D^{\prime})\geq\frac{7}{6}\text{Vol}(D)$ , where $\text{Vol}(S)$ is the volume of the set $S$ .

Lemma 36.

(Zhang et al., 2021, Theorem 4) Let $\{X_{i}\}_{i=1}^{n}$ be a martingale difference sequence and $|X_{i}|\leq b$ almost surely. Then for $\delta<e^{-1}$ , we have with probability at least $1-6\delta\log_{2}n$ ,

\displaystyle\left|\sum_{i=1}^{n}X_{i}\right|\leq 8\sqrt{\sum_{i=1}^{n}X_{i}^{2}\ln\frac{1}{\delta}}+16b\ln\frac{1}{\delta}.

Lemma 37.

(Jin et al., 2020a, Lemma 9) Let $\{X_{i}\}_{i=1}^{n}$ be a martingale difference sequence adapted to the filtration $\{{\mathcal{F}}_{i}\}_{i=0}^{n}$ , and $X_{i}\leq B$ almost surely for some $B>0$ . Then, for any $\lambda\in[0,1/B]$ , with probability at least $1-\delta$ :

\displaystyle\sum_{i=1}^{n}X_{i}\leq\lambda\sum_{i=1}^{n}\mathbb{E}[X_{i}^{2}|{\mathcal{F}}_{i-1}]+\frac{\ln(1/\delta)}{\lambda}.

Lemma 38.

Let $\{X_{i}\}_{i=1}^{\infty}$ be a martingale difference sequence adapted to the filtration $\{{\mathcal{F}}_{i}\}_{i=0}^{\infty}$ and $|X_{i}|\leq B$ for some $B>0$ . Then with probability at least $1-\delta$ , for all $n\geq 1$ simultaneously,

\displaystyle\left|\sum_{i=1}^{n}X_{i}\right|\leq 3\sqrt{\sum_{i=1}^{n}\mathbb{E}[X_{i}^{2}|{\mathcal{F}}_{i-1}]\ln\frac{4B^{2}n^{3}}{\delta}}+2B\ln\frac{4B^{2}n^{3}}{\delta}.

Proof.

For each $n\geq 1$ , applying Lemma 37 to $\{X_{i}\}_{i=1}^{n}$ and $\{-X_{i}\}_{i=1}^{n}$ with each $\lambda\in\Lambda=\{\frac{1}{B2^{i}}\}_{i=0}^{\lceil\log_{2}n\rceil}$ , we have with probability at least $1-\frac{\delta}{2n^{2}}$ , for any $\lambda\in\Lambda$ ,

\left|\sum_{i=1}^{n}X_{i}\right|\leq\lambda\sum_{i=1}^{n}\mathbb{E}[X_{i}^{2}|{\mathcal{F}}_{i-1}]+\frac{\ln\frac{4Bn^{3}}{\delta}}{\lambda},

(21)

Note that there exists $\lambda^{\star}\in\Lambda$ such that $\lambda^{\star}/\min\left\{1/B,\sqrt{\frac{\ln(4Bn^{3}/\delta)}{\sum_{i=1}^{n}\mathbb{E}[X_{i}^{2}|{\mathcal{F}}_{i-1}]}}\right\}\in(\frac{1}{2},1]$ . Plugging $\lambda^{\star}$ into Eq. (21), we get $\left|\sum_{i=1}^{n}X_{i}\right|\leq 3\sqrt{\sum_{i=1}^{n}\mathbb{E}[X_{i}^{2}|{\mathcal{F}}_{i-1}]\ln\frac{4Bn^{3}}{\delta}}+2B\ln\frac{4Bn^{3}}{\delta}$ . By a union bound over $n$ , the statement is proved. ∎

Lemma 39.

(Cohen et al., 2020, Lemma D.4) and (Cohen et al., 2021, Lemma E.2) Let $\{X_{i}\}_{i=1}^{\infty}$ be a sequence of random variables w.r.t to the filtration $\{{\mathcal{F}}_{i}\}_{i=0}^{\infty}$ and $X_{i}\in[0,B]$ almost surely. Then with probability at least $1-\delta$ , for all $n\geq 1$ simultaneously:

	$\displaystyle\sum_{i=1}^{n}\mathbb{E}[X_{i}\|{\mathcal{F}}_{i-1}]$	$\displaystyle\leq 2\sum_{i=1}^{n}X_{i}+4B\ln\frac{4n}{\delta},$
	$\displaystyle\sum_{i=1}^{n}X_{i}$	$\displaystyle\leq 2\sum_{i=1}^{n}\mathbb{E}[X_{i}\|{\mathcal{F}}_{i-1}]+8B\ln\frac{4n}{\delta}.$

	$\displaystyle\frac{1}{\|\epsilon\|}\left\|\beta_{m}\sqrt{\phi(s,a)^{\top}(\Gamma+\Delta\Gamma)\phi(s,a)}-\beta_{m}\sqrt{\phi(s,a)^{\top}\Gamma\phi(s,a)}\right\|$
	$\displaystyle\leq\beta_{m}\frac{\left\|\phi(s,a)^{\top}e_{i}e_{j}^{\top}\phi(s,a)\right\|}{\sqrt{\phi(s,a)^{\top}\Gamma\phi(s,a)}}$		( $\sqrt{u+v}-\sqrt{u}\leq\frac{\|v\|}{\sqrt{u}}$ )
	$\displaystyle\leq\beta_{m}\frac{\left\|\phi(s,a)^{\top}(\frac{1}{2}e_{i}e_{i}^{\top}+\frac{1}{2}e_{j}e_{j}^{\top})\phi(s,a)\right\|}{\sqrt{\phi(s,a)^{\top}\Gamma\phi(s,a)}}$		( $\|ab\|\leq\frac{1}{2}(a^{2}+b^{2})$ )
	$\displaystyle\leq\beta_{m}\frac{\phi(s,a)^{\top}\phi(s,a)}{\sqrt{\phi(s,a)^{\top}\Gamma\phi(s,a)}}\leq\frac{\beta_{m}}{\sqrt{\lambda_{\min}(\Gamma)}}\leq\beta_{m}\sqrt{\lambda+mH}.$

	$\displaystyle\left\|\nu_{t}-\mathbb{V}(P_{t},V_{t})\right\|$
	$\displaystyle\leq\left\|\left[\left\langle\phi_{V_{t}^{2}}(s_{t},a_{t}),\widetilde{\theta}_{m}\right\rangle\right]_{[0,9B_{\star}^{2}]}-\left\langle\phi_{V^{2}_{t}}(s_{t},a_{t}),\theta^{\star}\right\rangle\right\|+\left\|\left[\left\langle\phi_{V_{t}}(s_{t},a_{t}),\widehat{\theta}_{m}\right\rangle\right]_{[0,3B_{\star}]}^{2}-\left\langle\phi_{V_{t}}(s_{t},a_{t}),\theta^{\star}\right\rangle^{2}\right\|$
	$\displaystyle\leq\min\left\{9B_{\star}^{2},\left\|\left\langle\phi_{V_{t}^{2}}(s_{t},a_{t}),\widetilde{\theta}_{m}-\theta^{\star}\right\rangle\right\|\right\}+\min\left\{9B_{\star}^{2},6B_{\star}\left\|\left\langle\phi_{V_{t}}(s_{t},a_{t}),\widehat{\theta}_{m}-\theta^{\star}\right\rangle\right\|\right\}$
	$\displaystyle\leq\min\left\{9B_{\star}^{2},\left\\|{\phi_{V_{t}^{2}}(s_{t},a_{t})}\right\\|_{\widetilde{\Sigma}_{m}^{-1}}\left\\|{\widetilde{\theta}_{m}-\theta^{\star}}\right\\|_{\widetilde{\Sigma}_{m}}\right\}+\min\left\{9B_{\star}^{2},6B_{\star}\left\\|{\phi_{V_{t}}(s_{t},a_{t})}\right\\|_{\widehat{\Sigma}_{m}^{-1}}\left\\|{\widehat{\theta}_{m}-\theta^{\star}}\right\\|_{\widehat{\Sigma}_{m}}\right\}$
	$\displaystyle\leq\min\left\{9B_{\star}^{2},\widetilde{\beta}_{m}\left\\|{\phi_{V_{t}^{2}}(s_{t},a_{t})}\right\\|_{\widetilde{\Sigma}_{m}^{-1}}\right\}+\min\left\{9B_{\star}^{2},6B_{\star}\check{\beta}_{m}\left\\|{\phi_{V_{t}}(s_{t},a_{t})}\right\\|_{\widehat{\Sigma}_{m}^{-1}}\right\}=E_{t}.$

	$\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widehat{\beta}_{m}\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})}\right\\|_{\widehat{\Sigma}_{m}^{-1}}=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widehat{\beta}_{m}\bar{\sigma}^{m}_{h}\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\\|_{\widehat{\Sigma}_{m}^{-1}}$
	$\displaystyle\leq\tilde{\mathcal{O}}\left(\sum_{(m,h)\in{\mathcal{I}}}B_{\star}d\right)+\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widehat{\beta}_{m}\bar{\sigma}^{m}_{h}\min\left\{1,\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\\|_{\widehat{\Sigma}_{m}^{-1}}\right\}$		( $\widehat{\beta}_{m}=\tilde{\mathcal{O}}(\sqrt{d})$ and $V^{m}_{h+1}=\mathcal{O}(B_{\star})$ )
	$\displaystyle\overset{\text{(i)}}{=}\tilde{\mathcal{O}}\left(B_{\star}d^{2}H+\sum_{m\in\widehat{{\mathcal{I}}}}B_{\star}dH+\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widehat{\beta}_{m}\bar{\sigma}^{m}_{h}\min\left\{1,\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\\|_{\widehat{\Sigma}_{m+1}^{-1}}\right\}\right)$
	$\displaystyle=\tilde{\mathcal{O}}\left(B_{\star}d^{2}H+\widehat{\beta}_{M^{\prime}}\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}}\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\min\left\{1,\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\\|_{\widehat{\Sigma}_{m+1}^{-1}}^{2}\right\}}\right)$		( $\|\widehat{{\mathcal{I}}}\|=\tilde{\mathcal{O}}(d)$ and Cauchy-Schwarz inequality)
	$\displaystyle=\tilde{\mathcal{O}}\left(B_{\star}d^{2}H+d\sqrt{\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}(\bar{\sigma}^{m}_{h})^{2}}\right),$		( $\widehat{\beta}_{M^{\prime}}=\tilde{\mathcal{O}}(\sqrt{d})$ and Lemma 29)

$\displaystyle\|{\mathcal{I}}\|$	$\displaystyle=\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\mathbb{I}\left\{\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\\|_{\widehat{\Sigma}_{m}^{-1}}^{2}\geq 1\right\}\leq\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\min\left\{1,\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\\|_{\widehat{\Sigma}_{m}^{-1}}^{2}\right\}$
	$\displaystyle\leq\|\widehat{{\mathcal{I}}}\|H+\sqrt{2}\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\min\left\{1,\left\\|{\phi_{V^{m}_{h+1}}(s^{m}_{h},a^{m}_{h})/\bar{\sigma}^{m}_{h}}\right\\|_{\widehat{\Sigma}_{m+1}^{-1}}^{2}\right\}$	(Lemma 30)
	$\displaystyle=\tilde{\mathcal{O}}\left(dH\right).$	( $\|\widehat{{\mathcal{I}}}\|=\tilde{\mathcal{O}}(d)$ and Lemma 29)

	$\displaystyle\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\widetilde{\beta}_{m}\min\left\{1,\left\\|{\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})}\right\\|_{\widetilde{\Sigma}_{m}^{-1}}\right\}$
	$\displaystyle\leq\tilde{\mathcal{O}}\left(\sum_{m\in\widetilde{{\mathcal{I}}}}B_{\star}^{2}\sqrt{d}H\right)+\sqrt{2}\sum_{m\notin\widetilde{{\mathcal{I}}}}\sum_{h=1}^{H}\widetilde{\beta}_{m}\min\left\{1,\left\\|{\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})}\right\\|_{\widetilde{\Sigma}_{m+1}^{-1}}\right\}$		( $\widetilde{\beta}_{m}=\tilde{\mathcal{O}}(B_{\star}^{2}\sqrt{d})$ )
	$\displaystyle=\tilde{\mathcal{O}}\left(B_{\star}^{2}d^{3/2}H+\widetilde{\beta}_{M^{\prime}}\sqrt{M^{\prime}H\sum_{m=1}^{M^{\prime}}\sum_{h=1}^{H}\min\left\{1,\left\\|{\phi_{(V^{m}_{h+1})^{2}}(s^{m}_{h},a^{m}_{h})}\right\\|_{\widetilde{\Sigma}_{m+1}^{-1}}^{2}\right\}}\right).$		( $\|\widetilde{{\mathcal{I}}}\|=\tilde{\mathcal{O}}(d)$ and Cauchy-Schwarz inequality)
	$\displaystyle=\tilde{\mathcal{O}}\left(B_{\star}^{2}d^{3/2}H+B_{\star}^{2}d\sqrt{M^{\prime}H}\right).$		(Lemma 29)

Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP

Abstract

1 Introduction

Techniques

Related work

2 Preliminary

Learning objective

Linear SSP

Assumption 1 (Linear SSP).

Key parameters and notations

3 An Efficient Algorithm for Linear SSP

3.1 Finite-Horizon Approximation of SSP

Analysis

Lemma 1.

Theorem 1.

Proof.

Corollary 2.

Proof.

3.2 Applying an Efficient Finite-Horizon Algorithm for Linear MDPs

Lemma 2.

Theorem 3.

Learning without knowing B⋆B_{\star} or T⋆T_{\star}

3.3 Logarithmic Regret

Theorem 4.

Lower bounds

Theorem 5.

4 An Inefficient Horizon-Free Algorithm

Confidence Set

Lemma 3.

Lemma 4.

Update Conditions

Regret Guarantee

Theorem 6.

5 Conclusion

References

Appendix A Preliminary

Extra Notations in Appendix

Cost Perturbation for cmin=0c_{\min}=0

Appendix B Omitted Details for Section 3

Notations

B.1 Formal Definition of Qh⋆Q^{\star}_{h} and Vh⋆V^{\star}_{h}

B.2 Proof of Lemma 1

Proof.

B.3 Proof of Lemma 2

Lemma 5.

Proof.

Lemma 6.

Proof.

Proof of Lemma 2.

B.4 Learning without Knowing B⋆B_{\star} or T⋆T_{\star}

Finite-Horizon Approximation of SSP with Zero Terminal Costs

Lemma 7.

Proof.

Lemma 8.

Proof.

Remark 1.

Theorem 7.

Proof.

Theorem 8.

Proof.

Remark 2.

B.5 Horizon-Free Regret in the Tabular Setting with Finite-Horizon Approximation

B.6 Application to Linear Mixture MDP

Assumption 2 (Linear Mixture SSP).

Theorem 9.

Theorem 10.

Proof.

Lemma 9.

Proof.

Proof of Theorem 9.

Lemma 10.

Proof.

Lemma 11.

Proof.

Lemma 12.

Proof.

B.7 An instance of SSP with gapmin′≪gapmin\text{\rm gap}_{\min}^{\prime}\ll\text{\rm gap}_{\min}

B.8 Omitted Details in Section 3.3

Lemma 13.

Proof.

Learning without knowing $B_{\star}$ or $T_{\star}$

Cost Perturbation for $c_{\min}=0$

B.1 Formal Definition of $Q^{\star}_{h}$ and $V^{\star}_{h}$

B.4 Learning without Knowing $B_{\star}$ or $T_{\star}$

B.7 An instance of SSP with $\text{\rm gap}_{\min}^{\prime}\ll\text{\rm gap}_{\min}$