Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization

Fang Kong Shanghai Jiao Tong University Xiangcheng Zhang¹¹1Fang Kong and Xiangcheng Zhang contribute equally to this work. Tsinghua University Baoxiang Wang The Chinese University of Hong Kong, Shenzhen Shuai Li Corresponding author. Shanghai Jiao Tong University

Abstract

Learning Markov decision processes (MDP) in an adversarial environment has been a challenging problem. The problem becomes even more challenging with function approximation, since the underlying structure of the loss function and transition kernel are especially hard to estimate in a varying environment. In fact, the state-of-the-art results for linear adversarial MDP achieve a regret of $\tilde{\mathcal{O}}\left(K^{6/7}\right)$ ( $K$ denotes the number of episodes), which admits a large room for improvement. In this paper, we investigate the problem with a new view, which reduces linear MDP into linear optimization by subtly setting the feature maps of the bandit arms of linear optimization. This new technique, under an exploratory assumption, yields an improved bound of $\tilde{\mathcal{O}}\left(K^{4/5}\right)$ for linear adversarial MDP without access to a transition simulator. The new view could be of independent interest for solving other MDP problems that possess a linear structure.

1 Introduction

Reinforcement learning (RL) describes the interaction between a learning agent and an unknown environment, where the agent aims to maximize the cumulative reward through trial and error Sutton and Barto (2018). It has achieved great success in many real applications, such as games (Mnih et al., 2013; Silver et al., 2016), robotics (Kober et al., 2013; Lillicrap et al., 2015), autonomous driving (Kiran et al., 2021) and recommendation systems (Afsar et al., 2022; Lin et al., 2021). The interaction in RL is commonly portrayed by Markov decision processes (MDP). Most of the works study the stochastic setting, where the reward is sampled from a fixed distribution (Azar et al., 2017; Jin et al., 2018; Simchowitz and Jamieson, 2019; Yang et al., 2021). RL in real applications is in general more challenging than the stochastic setting, as the environment could be non-stationary and the reward function could be adaptive towards the agent’s policy. For example, a scheduling algorithm will be deployed to self-interested parties, and recommendation algorithms will face strategic users.

To design robust algorithms that work under non-stationary environments, a line of works focuses on the adversarial setting, where the reward function could be arbitrarily chosen by an adversary (Yu et al., 2009; Rosenberg and Mansour, 2019; Jin et al., 2020a; Chen et al., 2021; Luo et al., 2021a). Many works in adversarial MDPs optimize the policy by learning the value function with a tabular representation. In this case, both their computation complexity and their regret bounds depend on the state space and action space sizes. In real applications however, the state and action spaces could be exponentially large or even infinite, such as in the game of Go and in robotics. Such cost of computation and the performance are then inadequate.

To cope with the curse of dimensionality, function approximation methods are widely deployed to approximate the value functions with learnable structures. Great empirical success has proved its efficacy in a wide range of areas. Despite this, theoretical understandings of MDP with general function approximation are yet to be available. As an essential step towards understanding function approximation, linear MDP has been an important setting and has received significant attention from the community. It presumes that the transition and reward functions in MDP follow a linear structure with respect to a known feature (Jin et al., 2020b; He et al., 2021; Hu et al., 2022). The stochastic setting in linear MDP has been well studied and near-optimal results are available (Jin et al., 2020b; Hu et al., 2022). The adversarial setting in linear MDP is much more challenging since the underlying linear parameters of the loss function and transition kernel are especially hard to estimate in a varying environment.

The research on linear adversarial MDPs remains open. Early works have proposed algorithms when the transition function is known. (Neu and Olkhovskaya, 2021) Several recent works explore the problem without a known transition function and derive policy optimization algorithms with the state-of-the-art regret of $\tilde{\mathcal{O}}\left(K^{6/7}\right)$ (Luo et al., 2021a; Dai et al., 2023; Sherman et al., 2023). While the optimal regret in tabular MDPs is of order $\tilde{\mathcal{O}}\left({K}^{1/2}\right)$ (Jin et al., 2020a), the regret upper bounds available for linear adversarial MDPs seem to admit a large room for improvement.

In this paper, we investigate linear adversarial MDPs with unknown transition. We propose a new view of the problem and design an algorithm based on our view. The idea is to reduce the MDP setting to a linear optimization problem by subtly setting the feature maps of the bandit arms of linear optimization. In this way, we operate on a set of policies and optimize the probability distribution of which policy to execute. By carefully balancing the suboptimality in policy execution, the suboptimality in policy construction, and the suboptimality in feature visitation estimation, we deduce new analyses of the problem. Improved regret bounds are obtained both when we have and do not have a simulator. In particular, we conclude the first $\tilde{\mathcal{O}}\left(K^{4/5}\right)$ regret bound for linear adversarial MDPs without a simulator.

Let $d$ be the feature dimension and $H$ be the length of each episode. Details of our contributions are as follows.

•

With an exploratory assumption (Assumption 1), we obtain an $\tilde{\mathcal{O}}(d^{7/5}H^{12/5}K^{4/5})$ regret upper bound for linear adversarial MDP. As compared in Table 1, this is the first regret bound that achieves $\tilde{\mathcal{O}}(K^{4/5})$ order when a simulator of the transition is not provided. We also want to note that our exploratory assumption that only ensures the MDP is learnable is much weaker than previous works (Neu and Olkhovskaya, 2021; Luo et al., 2021a). Under a weaker exploratory assumption, our result achieves a significant improvement over the $\tilde{\mathcal{O}}(K^{6/7})$ regret in Luo et al. (2021a) and also removes the dependence on $\lambda$ which is the minimum eigenvalue of the exploratory policy’s covariance that can be small.
•

In a simpler setting where the agent has access to a simulator, our regret can be further improved to $\tilde{\mathcal{O}}(\sqrt{d^{2}H^{5}K})$ . This result also removes the dependence on $\lambda$ in previous works (Neu and Olkhovskaya, 2021; Luo et al., 2021a). Compared with Luo et al. (2021a), our required simulator is also weaker: we only need to have access to the trajectory when given any policy $\pi$ ; while Luo et al. (2021a) requires the next state $s^{\prime}$ when given any state-action pair $(s,a)$ .
•

Technically, we provide a new tool for linear MDP problems by exploiting the linear features of the MDP and transforming it into a linear optimization problem. This tool could be of independent interest and might be useful in other problems that possess a linear structure.

Table 1: Comparisons of our results with most related works for linear adversarial MDPs.

Transition

Simulator¹

Exploratory

Assumption²

Regret³

Neu and Olkhovskaya (2021)

Known

yes

\displaystyle\tilde{\mathcal{O}}(\sqrt{K/\lambda})

Luo et al. (2021a, b)

Unknown

yes

\displaystyle\tilde{\mathcal{O}}(\sqrt{K/\lambda})

\displaystyle\tilde{\mathcal{O}}(K^{2/3})

yes

\displaystyle\tilde{\mathcal{O}}\left(K/\left(\lambda\right)^{2/3}\right)^{6/7}

\displaystyle\tilde{\mathcal{O}}(K^{14/15})

Dai et al. (2023)

Unknown

yes

\displaystyle\tilde{\mathcal{O}}(\sqrt{K})

\displaystyle\tilde{\mathcal{O}}(K^{8/9})

Sherman et al. (2023)

Unknown

yes

\displaystyle\tilde{\mathcal{O}}(K^{2/3})

\displaystyle\tilde{\mathcal{O}}(K^{6/7})

Ours

Unknown

yes

\displaystyle\tilde{\mathcal{O}}(\sqrt{K})

yes

\displaystyle\tilde{\mathcal{O}}(K^{4/5})

1

Our required simulator is defined in Assumption 2. Notice that Dai et al. (2023); Luo et al. (2021a, b) adopt a stronger simulator that returns the next state $s^{\prime}$ when given any state action pair $(s,a)$ , while Sherman et al. (2023); Neu and Olkhovskaya (2021) and this paper only need the simulator to return a trajectory when given a policy.
2

Our exploratory assumption is introduced in Assumption 1. It is worth noting that our assumption on exploration is also much weaker than Neu and Olkhovskaya (2021); Luo et al. (2021a, b). Our exploratory assumption only ensures the learnability of the MDP while the other works require a policy that can explore the full linear space in all steps as input. Our assumption can be implied by theirs.
3

The $\lambda$ term in the regret represents the minimum eigenvalue induced by a “good” exploratory policy $\pi_{0}$ , which satisfies $\lambda_{\min}\left(\mathbf{\Lambda}_{\pi_{0},h}\right)\geq\lambda$ for all $h\in[H]$ , where $\mathbf{\Lambda}_{\pi_{0},h}$ is the covariance of $\pi_{0}$ at step $h$ (see Assumption 1).

2 Related Work

Linear MDPs.

The linear function approximation problem has been studied for a long history (Bradtke and Barto, 1996; Melo and Ribeiro, 2007; Sutton and Barto, 2018; Yang and Wang, 2019). Until recently, Yang and Wang (2020) propose theoretical guarantees for the sample efficiency in the linear MDP setting. However, it assumes that the transition function can be parameterized by a small matrix. In general cases, Jin et al. (2020b) develop the first efficient algorithm LSVI-UCB both in sample and computation complexity. They show that the algorithm achieves $O(\sqrt{d^{3}H^{3}K})$ regret where $d$ is the feature dimension and $H$ is the length of each episode. This result is improved to the optimal order $O(dH\sqrt{K})$ by Hu et al. (2022) with a tighter concentration analysis. A very recent work (He et al., 2022a) points out a technical error in Hu et al. (2022) and show a nearly minimax result that matches the lower bound $O(d\sqrt{H^{3}K})$ in Zhou et al. (2021). All these works are based on UCB-type algorithms. Apart from UCB, the TS-type algorithm has also been proposed for this setting (Zanette et al., 2020). And the above results mainly focus on the minimax optimality. In the stochastic setting, deriving an instance-dependent regret bound is also attractive as it changes in MDPs with different hardness. This type of regret has been widely studied under the tabular MDP setting (Simchowitz and Jamieson, 2019; Yang et al., 2021). He et al. (2021) is the first to provide this type of regret in linear MDP. Using a different proof framework, they show that the LSVI-UCB algorithm can achieve $O(d^{3}H^{5}\log K/\Delta)$ where $\Delta$ is the minimum value gap in the episodic MDP.

Adversarial losses in MDPs

When the losses at state-action pairs do not follow a fixed distribution, the problem becomes the adversarial MDP. This problem was first studied in the tabular MDP setting. The occupancy measure-based method is one of the most popular approaches to dealing with a potential adversary. For this type of approach, Zimin and Neu (2013) first study the known transition setting and derive regret guarantees $\tilde{\mathcal{O}}(H\sqrt{K})$ and $\tilde{\mathcal{O}}(\sqrt{HSAK})$ for full-information and bandit feedback, respectively. For the more challenging unknown transition setting, Rosenberg and Mansour (2019) also start from the full-information feedback and derive an $\tilde{\mathcal{O}}(HS\sqrt{AK})$ regret. The bandit feedback is recently studied by Jin et al. (2020a), where the regret bound is $\tilde{\mathcal{O}}(HS\sqrt{AK})$ . The other line of works (Neu et al., 2010; Shani et al., 2020; Chen et al., 2022; Luo et al., 2021a) is based on policy optimization methods. In the unknown transition and bandit feedback setting, the state-of-the-art result in this line is also an $\tilde{\mathcal{O}}(\sqrt{K})$ order achieved by Luo et al. (2021a, b).

Specifically, a few works focus on the linear adversarial MDP problem. Neu and Olkhovskaya (2021) first study the known transition setting and provide an $O(\sqrt{K})$ regret with the assumption that an exploratory policy can explore the full linear space. For the general unknown transition case, Luo et al. (2021a, b) discuss four cases on whether a simulator is available and whether the exploratory assumption is satisfied. With the same exploratory assumption as Neu and Olkhovskaya (2021), they show a regret bound $O(\sqrt{K})$ with a simulator and $O(K^{6/7})$ otherwise. Very recent two works (Dai et al., 2023; Sherman et al., 2023) further generalize the setting by removing the exploratory assumption. These two works independently provide $O(K^{8/9})$ and $O(K^{6/7})$ regret for this setting when no simulator is available.

Linear mixture MDP is another popular linear function approximation model, where the transition is a mixture of linear functions. When considering the adversarial losses, Cai et al. (2020); He et al. (2022b) study unknown transition but full-information feedback type, in which case the learning agent can observe the loss of all actions in each state. Zhao et al. (2023) consider the general bandit feedback in this setting and show the regret in this harder environment is also $O(\sqrt{K})$ . Their modeling does not assume the structure of the loss function which introduces the dependence on $S,A$ in the regret where $S$ and $A$ are the numbers of states and actions, respectively.

3 Preliminaries

In this work, we study the episodic adversarial Markov Decision Processes (MDP) denoted by
$\mathcal{M}(\mathcal{S},\mathcal{A},H,\left\{P_{h}\right\}_{h=1}^{H},\left\{\ell_{k}\right\}_{k=1}^{K})$ where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $H$ is the horizon of each episode, $P_{h}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\to[0,1]$ is the transition kernel of step $h$ with $P_{h}(s^{\prime}\mid s,a)$ representing the transition probability from $s$ to $s^{\prime}$ by taking action $a$ at step $h$ , and $\ell_{k}$ is the loss function at episode $k$ . We denote $\pi_{k}:\left\{\pi_{k,h}\right\}_{h=1}^{H}$ as the learner’s policy at each episode $k$ , where $\pi_{k,h}$ is a mapping from each state to a distribution over the action space. Let $\pi_{k,h}(a\mid s)$ represent the selecting probability of action $a$ at state $s$ by following policy $\pi_{k}$ at step $h$ .

The learner interacts with the MDP $\mathcal{M}$ for $K$ episodes. At each episode $k=1,2,\ldots,K$ , the environment (adversary) first chooses the loss function $\ell_{k}:=\left\{\ell_{k,h}\right\}_{h=1}^{H}$ which may be probably based on the history information before episode $k$ . The learner simultaneously decides its policy $\pi_{k}$ . For each step $h=1,2,\ldots,H$ , the learner observes the current state $s_{k,h}$ , taking action $a_{k,h}$ based on $\pi_{k,h}$ , and observe the loss $\ell_{k,h}(s_{k,h},a_{k,h})$ . The environment will transit to the next state $s_{k,h+1}$ at the end of the step based on the transition kernel $P_{h}(\cdot\mid s_{k,h},a_{k,h})$ .

The performance of a policy $\pi$ over episode $k$ can be evaluated by its value function, which is the expected cumulative loss,

\displaystyle V^{\pi}_{k}=\mathbb{E}\left[\sum_{h=1}^{H}\ell_{k,h}(s_{k,h},a_{k,h})\right]\,,

where the expectation is taken from the randomness in the transition and the policy $\pi$ . Denote $\pi^{*}\in\operatorname*{argmin}_{\pi}\sum_{k=1}^{K}V^{\pi}_{k}$ as the optimal policy that suffers the least expected loss over $K$ episodes. The objective of the learner is to minimize the cumulative regret,

\displaystyle\mathrm{Reg}(K)=\sum_{k=1}^{K}\left(V_{k}^{\pi_{k}}-V_{k}^{\pi^{*}}\right)\,,

(1)

which is defined as the cumulative difference between the value of the taken policies and that of the optimal policy $\pi^{*}$ .

Linear adversarial MDP denotes an MDP where both the transition kernel and the loss function are linearly depending on a feature mapping. We give a formal definition as follows.

Definition 1 (Linear MDP with adversarial losses).

The MDP $\mathcal{M}(\mathcal{S},\mathcal{A},H,\left\{P_{h}\right\}_{h=1}^{H},\left\{\ell_{k}\right\}_{k=1}^{K})$ is a linear MDP if there is a known feature mapping $\phi:\mathcal{S}\times\mathcal{A}\to\mathbb{R}^{d}$ and unknown vector-valued measures $\mu_{h}\in\mathbb{R}^{d}$ such that the transition probability at each state-action pair $(s,a)$ satisfies

P_{h}\left(\cdot|s,a\right)=\langle\phi(s,a),\mathbf{\mu}_{h}\rangle\,.

Further, for any episode $k$ and step $h$ , there exists an unknown loss vector $\theta_{k,h}\in\mathbb{R}^{d}$ such that

\displaystyle\ell_{k,h}(s,a)=\langle\phi(s,a),\theta_{k,h}\rangle\,.

for all state-action pair $(s,a)$ . Without loss of generality, we assume $\left\|\phi\left(s,a\right)\right\|_{2}\leq 1$ for all $(s,a)$ and $\left\|\mu_{h}\left(\mathcal{S}\right)\right\|_{2}=\left\|\int_{s\in\mathcal{S}}\mathrm{d}\mu_{h}\left(s\right)\right\|_{2}\leq\sqrt{d}$ , $\left\|\theta_{k,h}\right\|_{2}\leq\sqrt{d}$ for any $k,h$ .

Given a policy $\pi$ , its feature visitation vector at step $h$ is the expected feature mapping this policy encounters at step $h$ : $\phi_{\pi,h}=\mathbb{E}_{\pi}\left[\phi\left(s_{h},a_{h}\right)\right]$ . With this definition, the expected loss that the policy $\pi$ receives at step $h$ of episode $k$ can be written as

\ell_{k,h}^{\pi}:=\mathbb{E}_{\pi}\left[\ell_{k,h}\left(s_{k,h},a_{k,h}\right)\right]=\langle\phi_{\pi,h},\theta_{k,h}\rangle\,,

(2)

and the value of policy $\pi$ can be expressed as

V_{k}^{\pi}=\sum_{h=1}^{H}\ell_{k,h}=\sum_{h=1}^{H}\langle\phi_{\pi,h},\theta_{k,h}\rangle\,.

(3)

For simplicity, we also define $\phi_{\pi,h}\left(s\right)=\mathbb{E}_{a\sim\pi_{h}\left(\cdot|s\right)}\left[\phi(s,a)\right]$ to represent the expected feature visitation of state $s$ at step $h$ by following $\pi$ .

To ensure that the linear MDP is learnable, we make the following exploratory assumption, which analogs to those assumptions made in previous works that study the function approximation setting (Neu and Olkhovskaya, 2021; Luo et al., 2021a, b; Hao et al., 2021; Agarwal et al., 2021). For any policy $\pi$ , define $\mathbf{\Lambda}_{\pi,h}:=\mathbb{E}_{\pi}\left[\phi(s_{h},a_{h})\phi(s_{h},a_{h})^{\top}\right]$ as the expected covariance of $\pi$ at step $h$ . Let $\lambda_{\min,h}^{*}=\sup_{\pi}\lambda_{\min}\left(\mathbf{\Lambda_{\pi,h}}\right)$ , where $\lambda_{\min}(\cdot)$ denotes the smallest eigenvalue of a matrix, and $\lambda_{\min}^{*}=\min_{h}\lambda_{\min,h}^{*}$ . To ensure that the linear MDP is learnable, we assume that there exists a policy that generates full rank covariance matrices.

Assumption 1 (Exploratory assumption).

$\lambda_{\min}^{*}>0$ .

When the assumption is reduced to the tabular setting, where $\phi(s,a)$ is a basis vector in $\mathbb{R}^{\mathcal{S}\times\mathcal{A}}$ , this assumption becomes $\mu_{\min}:=\max_{\pi}\min_{s,a}\mu^{\pi}(s,a)>0$ , where $\mu^{\pi}(s,a)$ is the probability of visiting the state-action pair $(s,a)$ under the trajectory induced by $\pi$ . It simply means that there exists a policy with positive visitation probability for all state-action pairs, which is standard (Li et al., 2020). In the linear setting, it guarantees that all the directions in $\mathbb{R}^{d}$ is able to be visited by some policy.

We point out that this assumption is weaker than the exploratory assumptions used in previous works (Neu and Olkhovskaya, 2021; Luo et al., 2021a, b), in which they assume such exploratory policy $\pi_{0}$ , satisfying $\lambda_{\min}\left(\mathbf{\Lambda}_{\pi_{0},h}\right)\geq\lambda_{0}>0$ for all $h\in[H]$ , is given as input to the algorithm. Since finding such an exploratory policy is extremely difficult, our assumption, which only requires the transition of the MDP itself to satisfy this constraint, is more preferred.

4 Algorithm

In this section, we introduce the proposed algorithm (Algorithm 1). The algorithm takes a finite policy class $\Pi$ and the feature visitation estimators $\left\{\hat{\phi}_{\pi,h}:\pi\in\Pi,\,h\in[H]\right\}$ as input and selects a policy $\pi_{k}\in\Pi$ in each episode $k\in[K]$ . The acquisition of $\Pi$ and $\left\{\hat{\phi}_{\pi,h}:\pi\in\Pi,\,h\in[H]\right\}$ will be introduced in Section 4.1 and Section 4.2, respectively.

Recall that the loss value $\ell_{k,h}(s,a)$ is an inner product between the feature $\phi(s,a)$ and the loss vector $\theta_{k,h}$ . According to this structure, we investigate ridge linear regression to estimate the unknown loss vector. To be specific, in each episode after executing policy $\pi_{k}$ , the observed loss value can be used to estimate the loss vector $\hat{\theta}_{k,h}$ and the value function $\hat{V}_{k}^{\pi}$ can be estimated for each policy $\pi$ (line 9). We then adopt an optimistic strategy toward the value of policies and an optimistic estimation of the policy $\pi$ ’s value (line 10). Based on the optimistic value, the exploitation probability $w(\pi)$ of a policy $\pi$ follows the EXP3-type update rule (line 11). To better explore each dimension of the linear space, the final selecting probability is defined as the weighted combination of the exploitation probability and an exploration probability $g(\pi)$ , where the weight $\gamma$ is an input parameter (line 7). Here the exploration probability $g_{h}:=\left\{g_{h}(\pi)\right\}_{\pi\in\Pi}$ is derived by computing the G-optimal design problem to minimize the uncertainty of all policies, i.e.,

\displaystyle g_{h}\in\operatorname*{argmin}_{p\in\Delta(\Pi)}\max_{\pi\in\Pi}\left\|\phi_{\pi,h}\right\|^{2}_{\mathbf{V}_{h}(p)^{-1}}

where $\mathbf{V}_{h}(p)=\sum_{\pi\in\Pi}p(\pi)\phi_{\pi,h}\phi_{\pi,h}^{\top}$ .

If the input $\hat{\phi}_{\pi,h}$ is the true feature visitation ${\phi}_{\pi,h}$ , we can ensure that the regret of the algorithm, compared with the optimal policy in $\Pi$ , is upper bounded. Now it suffices to bound the additional regret caused by the sub-optimality of the best policy in $\Pi$ , and the bias of the feature visitation estimators, which will be discussed in the following sections.

Algorithm 1 GEOMETRICHEDGE for Linear Adversarial MDP Policies (GLAP)

1: Input: policy class

\Pi

with feature visitation estimators

\left\{\hat{\phi}_{\pi,h}\right\}_{h=1}^{H}

for any

\pi\in\Pi

; confidence

\delta

; exploration parameter

\gamma\leq 1/2

2: Initialize:

\forall\pi\in\Pi,w_{1}\left(\pi\right)=1,W_{1}=|\Pi|

;

\eta=\frac{\gamma}{dH^{2}}

3: for

h=1,2,\cdots,H

4: compute the G-optimal design

g_{h}(\pi)

on the set of feature visitations:

\{\hat{\phi}_{\pi,h},\pi\in\Pi\}

. Denote

g(\pi)=\frac{1}{H}\sum_{h=1}^{H}g_{h}(\pi)

5: end for

6: for each episode

k=1,2,\ldots,K

7: Compute the probabilities for all policies

\pi\in\Pi

p_{k}\left(\pi\right)=\left(1-\gamma\right)\frac{w_{k}\left(\pi\right)}{W_{k}}+\gamma g(\pi)

8: Select policy

\pi_{k}\sim p_{k}

and observe losses

\ell_{k,h}\left(s_{k,h},a_{k,h}\right)

, for any

h\in[H]

9: Calculate the loss vector and value function estimators:

\displaystyle\hat{\theta}_{k,h}={\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi_{k},h}\ell_{k,h}\left(s_{k,h},a_{k,h}\right)\,,\hat{\ell}_{k,h}^{\pi}=\hat{\phi}_{\pi,h}^{\top}{\theta}_{k,h}\,,\hat{V}_{k}^{\pi}=\sum_{h=1}^{H}\hat{\ell}_{k,h}^{\pi}\,,

where

{\Sigma}_{k,h}=\sum_{\pi}p_{k}(\pi)\hat{\phi}_{\pi,h}\hat{\phi}_{\pi,h}^{\top}

10: Compute the optimistic estimate of the loss function

\tilde{V}_{k}^{\pi}=\sum_{h=1}^{H}\left(\hat{\ell}_{k,h}^{\pi}-2\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi,h}\sqrt{\frac{H\log\left(\frac{1}{\delta}\right)}{dK}}\right)

11: Update the selecting probability using the loss estimators

\forall\pi\in\Pi,w_{k+1}(\pi)=w_{k}(\pi)\exp\left(-\eta\tilde{V}_{k}^{\pi}\right),W_{k+1}=\sum_{\pi\in\Pi}w_{k+1}(\pi)\,

12: end for

4.1 Policy Construction

In this subsection, we introduce how to construct a finite policy set $\Pi$ such that the real optimal policy $\pi^{*}$ can be approximated by elements in $\Pi$ . The policy construction method mainly borrows from Appendix A.3 in Wagenmaker and Jamieson (2022) but with refined analysis for the adversarial setting.

We consider the linear softmax policy class. Specifically, given a parameter $w^{\pi}\in\mathbb{R}^{d\times H}$ , the induced policy $\pi$ would select action $a\in\mathcal{A}$ at state $s$ with probability

\displaystyle\pi_{h}\left(a|s\right)=\frac{\exp\left(\eta\langle\phi\left(s,a\right),w_{h}^{\pi}\rangle\right)}{\sum_{a^{\prime}}\exp\left(\eta\langle\phi\left(s,a^{\prime}\right),w_{h}^{\pi}\rangle\right)}\,.

The advantage of such a form of policy class is that it satisfies the Lipschitz property, i.e., the difference between values of induced policies can be upper bounded by the difference between the parameters $w$ . Based on this observation, by constructing a parameter covering $\mathcal{W}$ over a $d\times H$ -dimensional unit ball, we can ensure that the parameter of the optimal policy $w^{*}$ can be approximated by a parameter $w\in\mathcal{W}$ , i.e.,

\displaystyle\sum_{h}\left\|w_{h}-w_{h}^{*}\right\|_{2}\leq\epsilon\,.

Further, based on the Lipschitz property, the induced policy of $w$ would have a similar value to that of the optimal policy.

The informal result is shown in the following lemma.

Lemma 1.

There exists a finite policy class $\Pi$ with log cardinality $\log|\Pi|=\mathcal{O}\left(dH^{2}\log K\right)$ , such that the regret compared with the optimal policy in $\Pi$ is close to the global regret, i.e.,

\displaystyle\mathrm{Reg}(K):=\sum_{k=1}^{K}\left(V_{k}^{\pi_{k}}-V_{k}^{\pi^{*}}\right)\leq\sum_{k=1}^{K}V_{k}^{\pi_{k}}-\min_{\pi\in\Pi}\sum_{k=1}^{K}V_{k}^{\pi}+1:=\mathrm{Reg}(K;\Pi)+1\,.

The detailed analysis can be found in Appendix C.

4.2 Feature Visitation Estimation

In this subsection, we discuss how to deal with unavailable feature visitations of policies. Our approach is to estimate the feature visitation $\left\{\phi_{\pi,h}\right\}_{h=1}^{H}$ of each policy $\pi$ and use these estimated features as input of Algorithm 1. The feature estimating process is described in Algorithm 2, which is called the feature visitation estimation oracle.

Algorithm 2 Feature visitation estimation oracle

1: Input: policy set

\Pi

, tolerance

\epsilon\leq 1/2

, confidence

\delta

2: Initialize:

\beta=16H^{2}\log\frac{4H^{2}d|\Pi|}{\delta}

\hat{\phi}_{\pi,1}=\mathbb{E}_{a_{1}\sim\pi_{1}\left(\cdot|s_{1}\right)}\left[\phi(s_{1},a_{1})\right]

3: for

h=1,2,\dots,H-1

4: Run procedure in Theorem 2 with parameters:

\epsilon_{\exp}\leftarrow\frac{\epsilon^{2}}{d^{3}\beta}

\delta\leftarrow\frac{\delta}{2H}

\underline{\lambda}=\log\frac{4H^{2}d|\Pi|}{\delta}

\Phi\leftarrow\Phi_{h}:=\left\{\hat{\phi}_{\pi,h}:\pi\in\Pi\right\}

\gamma_{\Phi}=\frac{1}{2\sqrt{d}}

, and denote returned data as

\left\{\left(s_{h,\tau},a_{h,\tau},s_{h+1,\tau}\right)\right\}_{\tau=1}^{K_{h}}

, for

K_{h}

total number of episodes run, and covariates

\mathbf{\Lambda}_{h}\leftarrow\sum_{\tau=1}^{K_{h}}\phi\left(s_{h,\tau},a_{h,\tau}\right)\phi\left(s_{h,\tau},a_{h,\tau}\right)^{\top}+1/d\cdot I.

5: for

\pi\in\Pi

\hat{\phi}_{\pi,h+1}\leftarrow\left(\sum_{\tau=1}^{K_{h}}\phi_{\pi,h+1}\left(s_{h+1,\tau}\right)\phi_{h,\tau}^{\top}\mathbf{\Lambda}_{h}^{-1}\right)\hat{\phi}_{\pi,h}\,.

7: end for

8: end for

9: return

\Phi:=\left\{\hat{\phi}_{\pi,h}:\pi\in\Pi,\,h=1,2,\cdots,H\right\}

For any policy $\pi$ , we can first decompose its feature visitation at step $h$ as

	$\displaystyle\phi_{\pi,h}=\mathbb{E}_{\pi}\left[\phi(s_{h},a_{h})\right]=$	$\displaystyle\int\phi_{\pi,h}(s)d\mu_{h-1}(s)\mathbb{E}_{\pi}\left[\phi(s_{h-1},a_{h-1})\right]$
	$\displaystyle:=$	$\displaystyle\mathcal{T}_{\pi,h}\phi_{\pi,h-1}$
	$\displaystyle=$	$\displaystyle\mathcal{T}_{\pi,h}\mathcal{T}_{\pi,h-1}\cdots\mathcal{T}_{\pi,2}\phi_{\pi,1}\,,$

where $\mathcal{T}_{\pi,h}=\int\phi_{\pi,h}(s)d\mu_{h-1}(s)$ is the transition operator and $\phi_{\pi,1}:=\mathbb{E}_{\pi}[\phi(s_{1},a_{1})]$ can be directly computed based on policy $\pi$ . Thus, to estimate $\phi_{\pi,h}$ for each step $h$ , we need to estimate the transition operator $\mathcal{T}_{\pi,h}$ .

We investigate the least square method to estimate the transition operator. Consider currently we have collected $K$ trajectories, then the estimated value is given by

\displaystyle\hat{\mathcal{T}}\in\operatorname*{argmin}_{\mathcal{T}}\sum_{\tau=1}^{K}(\mathcal{T}\phi(s_{h-1,\tau},a_{h-1,\tau})-\sum_{a}\pi(a\mid s_{h})\phi(s_{h},a))^{2}+\left\|\mathcal{T}\right\|^{2}\,,

and the closed-form solution is that

	$\displaystyle\Lambda_{h-1}$	$\displaystyle=\sum_{\tau=1}^{K}\phi(s_{h-1,\tau},a_{h-1,\tau})\phi(s_{h-1,\tau},a_{h-1,\tau})^{\top}+\lambda I\,,$
	$\displaystyle\hat{\mathcal{T}}_{\pi,h}$	$\displaystyle=\Lambda_{h-1}^{-1}\sum_{\tau=1}^{K}\phi(s_{h-1,\tau},a_{h-1,\tau})\phi_{\pi,h}(s_{h,\tau})\,,$

where $\phi_{\pi,h}(s_{h,\tau})=\sum_{a}\pi(a\mid s_{h})\phi(s_{h},a)$ .

In order to guarantee the accuracy of the estimated feature visitation, we provide a guarantee for the accuracy of the estimated transition operator. The intuition is to collect enough data in each dimension of the feature space. For the design of how to collect trajectory, we adopt the reward-free technique in Wagenmaker and Jamieson (2022) and transform it as an independent feature visitation oracle. Algorithm 2 satisfies the following sample complexity and accuracy guarantees.

Lemma 2.

Algorithm 2 runs for at most $\tilde{\mathcal{O}}\left(\frac{d^{4}H^{3}}{\epsilon^{2}}\right)$ episodes and returns a feature visitation estimation that satisfies

\left\|\hat{\phi}_{\pi,h}-\phi_{\pi,h}\right\|_{2}\leq\epsilon/\sqrt{d}\,,

for any policy $\pi\in\Pi$ and step $h\in[H]$ , with probability at least $1-\delta$ .

And since the regret of each episode is less than $H$ , the total regret incurred in this process of estimating feature visitations is of order $\tilde{\mathcal{O}}\left(d^{4}H^{4}/\epsilon^{2}\right)$ . The detailed analysis and results are in Appendix B.

5 Analysis

This section provides the regret guarantees for the proposed algorithm as well as the proof sketch.

We consider Algorithm 1 with the policy set constructed in Section 4.1 and the feature of policies estimated in Section 4.2 as input. Suppose we run Algorithm 1 for $K$ rounds, the regret compared with any fixed policy $\pi\in\Pi$ in these $K$ rounds can be bounded as below.

Lemma 3.

For any policy $\pi\in\Pi$ , with probability at least $1-\delta$ ,

\sum_{k=1}^{K}\left(V_{k}^{\pi_{k}}-V_{k}^{\pi}\right)=\mathcal{O}\left(\underbrace{H\sqrt{dKH\log\frac{|\Pi|}{\delta}}+\frac{dH^{2}}{\gamma}\log\left(\frac{|\Pi|}{\delta}\right)+\gamma KH}_{\text{standard regret bound of EXP3}}+\underbrace{\frac{dH^{2}}{\gamma}\epsilon K}_{\text{feature estimation bias}}\right)\,,

where $\epsilon$ is the tolerance of the estimated feature bias in Lemma 2.

Recall that when the policy set is constructed as section 4.1, the difference between the global regret defined in Equation (1) and the regret compared with the policy in $\Pi$ is just a constant. So the global regret can also be bounded as above Lemma 3.

Similar to previous works on linear adversarial MDP (Luo et al., 2021a, b; Sherman et al., 2023; Dai et al., 2023) that discuss the cases of whether a transition simulator is available, we define the simulator that may help in the following assumption. Note that this simulator is weaker than Luo et al. (2021a, b); Dai et al. (2023) because their simulator could generate a next state given any state-action pair.

Assumption 2 (Simulator).

The learning agent has access to a simulator such that when given a policy $\pi$ , it returns a trajectory based on the MDP and policy $\pi$ .

If the learning agent has access to such a simulator described in Assumption 2, then the feature estimation process in Section 4.2 can be regarded as regret-free and the final regret is just as shown in Lemma 3. Otherwise, there is an additional $\tilde{\mathcal{O}}(d^{4}H^{3}/\epsilon^{2})$ regret term. Balancing the choice of $\gamma$ and $\epsilon$ yields the following regret upper bound.

Theorem 1.

With the constructed policy set in Section 4.1 and the feature estimation process in Section 4.2, the cumulative regret satisfies

\mathrm{Reg}(K)=\tilde{\mathcal{O}}\left(d^{7/5}H^{12/5}K^{4/5}\right)

with probability at least $1-\delta$ . Additionally, if a simulator in Assumption 2 is accessible, the regret can be improved to

\mathrm{Reg}(K)=\tilde{\mathcal{O}}\left(\sqrt{d^{2}H^{5}K}\right)

with probability at least $1-\delta$ .

The proof of the main results is deferred to Appendix A.

Discussions

Since a main contribution of our work is to improve the results by Neu and Olkhovskaya (2021); Luo et al. (2021a) with an exploratory assumption, we would present more insights into the difference between our approach and the approaches in these two works.

As shown in Table 1, our result $\tilde{\mathcal{O}}(K^{4/5})$ explicitly improve the result in Luo et al. (2021a) on both the dependence of $K$ and the dependence of $\lambda$ . Recall that $\lambda$ is the minimum eigenvalue in the exploratory assumption. In real applications, for each direction in the linear space, it is reasonable that there will be a policy that visits the direction. Therefore, by mixing these policies one could ensure the exploratory assumption. However, there is no guarantee on the value of $\lambda$ , and when $\lambda$ is very small the removal of the $1/\lambda$ dependence is significant.

Technically, our new view of linear MDP could be general enough to be useful in other linear settings. In section 4.1, we only introduce a simpler policy construction version to convey more intuition. We could vary this construction procedure by further putting a finite action covering over the action space to deal with the infinite action space setting. More details can be found in Appendix C. Meanwhile, Neu and Olkhovskaya (2021) requires both state and action spaces to be finite and Luo et al. (2021a, b); Dai et al. (2023); Sherman et al. (2023) can only deal with finite action space.

When a transition simulator is available, the number of calls to the simulator (query complexity) is also an important metric to portray the algorithm’s efficiency. According to Lemma 2 and the analysis in Appendix A, we only need to call the simulator for $O(K^{2})$ times to achieve an $O(\sqrt{K})$ regret (Theorem 1) by choosing $\epsilon=O(1/K)$ in Lemma 3. This is much more preferred than Luo et al. (2021a), which needs to call the simulator for $O(KAH)^{O(H)}$ times where $A$ is the action space size.

Compared with Luo et al. (2021a), we improve their results with better regret bounds, weaker exploratory assumption, weaker simulator and fewer queries (if we use one).

6 Conclusion

In this paper, we investigate linear adversarial MDP with bandit feedback. We propose a new view of linear MDP, where optimizing policies in linear MDP can be regarded as a linear optimization problem. Based on this new insight we propose an algorithm by constructing a set of policies and deploying a probability distribution over the policies to execute. With an exploratory assumption, our algorithm yields the first $\tilde{\mathcal{O}}(K^{4/5})$ regret without access to a simulator. Compared to the results in Luo et al. (2021a), our algorithm enjoys a weaker assumption, a better regret bound with respect to both $K$ and $\lambda$ , and a weaker simulator with fewer queries if it uses one.

Our view contributes a new approach to linear MDP, which could be of independent interest. We demonstrated how our algorithm under this view is generalized to infinite action spaces. Future implications of this technique could involve solving other adversarial settings, such as when the loss function is corrupted up to a budget, and solving robust linear MDP where the transition kernel could change over episodes.

References

Afsar et al. (2022) M Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey. ACM Computing Surveys, 55(7):1–38, 2022.
Agarwal et al. (2021) Naman Agarwal, Syomantak Chaudhuri, Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Online target Q-learning with reverse experience replay: Efficiently finding the optimal policy for linear mdps. arXiv preprint arXiv:2110.08440, 2021.
Azar et al. (2017) Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017.
Bradtke and Barto (1996) Steven J Bradtke and Andrew G Barto. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1-3):33–57, 1996.
Cai et al. (2020) Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283–1294. PMLR, 2020.
Chen et al. (2021) Liyu Chen, Haipeng Luo, and Chen-Yu Wei. Minimax regret for stochastic shortest path with adversarial costs and known transition. In Conference on Learning Theory, pages 1180–1215. PMLR, 2021.
Chen et al. (2022) Liyu Chen, Haipeng Luo, and Aviv Rosenberg. Policy optimization for stochastic shortest path. In Conference on Learning Theory, pages 982–1046. PMLR, 2022.
Dai et al. (2023) Yan Dai, Haipeng Luo, Chen-Yu Wei, and Julian Zimmert. Refined regret for adversarial mdps with linear function approximation. arXiv preprint arXiv:2301.12942, 2023.
Hao et al. (2021) Botao Hao, Tor Lattimore, Csaba Szepesvari, and Mengdi Wang. Online sparse reinforcement learning. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, pages 316–324. PMLR, 2021.
He et al. (2021) Jiafan He, Dongruo Zhou, and Quanquan Gu. Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning, pages 4171–4180. PMLR, 2021.
He et al. (2022a) Jiafan He, Heyang Zhao, Dongruo Zhou, and Quanquan Gu. Nearly minimax optimal reinforcement learning for linear markov decision processes. arXiv preprint arXiv:2212.06132, 2022.
He et al. (2022b) Jiafan He, Dongruo Zhou, and Quanquan Gu. Near-optimal policy optimization algorithms for learning adversarial linear mixture mdps. In International Conference on Artificial Intelligence and Statistics, pages 4259–4280. PMLR, 2022.
Hu et al. (2022) Pihe Hu, Yu Chen, and Longbo Huang. Nearly minimax optimal reinforcement learning with linear function approximation. In International Conference on Machine Learning, pages 8971–9019. PMLR, 2022.
Jin et al. (2018) Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? Advances in Neural Information Processing Systems, 31, 2018.
Jin et al. (2020a) Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. Learning adversarial markov decision processes with bandit feedback and unknown transition. In International Conference on Machine Learning, pages 4860–4869. PMLR, 2020.
Jin et al. (2020b) Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
Kiran et al. (2021) B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A Al Sallab, Senthil Yogamani, and Patrick Pérez. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(6):4909–4926, 2021.
Kober et al. (2013) Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
Li et al. (2020) Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. Advances in Neural Information Processing Systems, 33:7031–7043, 2020.
Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
Lin et al. (2021) Yuanguo Lin, Yong Liu, Fan Lin, Lixin Zou, Pengcheng Wu, Wenhua Zeng, Huanhuan Chen, and Chunyan Miao. A survey on reinforcement learning for recommender systems. arXiv preprint arXiv:2109.10665, 2021.
Luo et al. (2021a) Haipeng Luo, Chen-Yu Wei, and Chung-Wei Lee. Policy optimization in adversarial mdps: Improved exploration via dilated bonuses. Advances in Neural Information Processing Systems, 34:22931–22942, 2021.
Luo et al. (2021b) Haipeng Luo, Chen-Yu Wei, and Chung-Wei Lee. Policy optimization in adversarial mdps: Improved exploration via dilated bonuses. arXiv preprint arXiv:2107.08346, 2021.
Melo and Ribeiro (2007) Francisco S Melo and M Isabel Ribeiro. Q-learning with linear function approximation. In Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007, San Diego, CA, USA; June 13-15, 2007. Proceedings 20, pages 308–322. Springer, 2007.
Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
Neu and Olkhovskaya (2021) Gergely Neu and Julia Olkhovskaya. Online learning in mdps with linear function approximation and bandit feedback. Advances in Neural Information Processing Systems, 34:10407–10417, 2021.
Neu et al. (2010) Gergely Neu, András György, Csaba Szepesvári, et al. The online loop-free stochastic shortest-path problem. In COLT, volume 2010, pages 231–243. Citeseer, 2010.
Rosenberg and Mansour (2019) Aviv Rosenberg and Yishay Mansour. Online convex optimization in adversarial markov decision processes. In International Conference on Machine Learning, pages 5478–5486. PMLR, 2019.
Shani et al. (2020) Lior Shani, Yonathan Efroni, Aviv Rosenberg, and Shie Mannor. Optimistic policy optimization with bandit feedback. In International Conference on Machine Learning, pages 8604–8613. PMLR, 2020.
Sherman et al. (2023) Uri Sherman, Tomer Koren, and Yishay Mansour. Improved regret for efficient online reinforcement learning with linear function approximation. arXiv preprint arXiv:2301.13087, 2023.
Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
Simchowitz and Jamieson (2019) Max Simchowitz and Kevin G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. Advances in Neural Information Processing Systems, 32, 2019.
Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
Wagenmaker and Jamieson (2022) Andrew Wagenmaker and Kevin Jamieson. Instance-dependent near-optimal policy identification in linear mdps via online experiment design. arXiv preprint arXiv:2207.02575, 2022.
Yang and Wang (2019) Lin Yang and Mengdi Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004. PMLR, 2019.
Yang and Wang (2020) Lin Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746–10756. PMLR, 2020.
Yang et al. (2021) Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. In International Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2021.
Yu et al. (2009) Jia Yuan Yu, Shie Mannor, and Nahum Shimkin. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737–757, 2009.
Zanette et al. (2020) Andrea Zanette, David Brandfonbrener, Emma Brunskill, Matteo Pirotta, and Alessandro Lazaric. Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics, pages 1954–1964. PMLR, 2020.
Zhao et al. (2023) Canzhe Zhao, Ruofeng Yang, Baoxiang Wang, and Shuai Li. Learning adversarial linear mixture markov decision processes with bandit feedback and unknown transition. In International Conference on Learning Representations, 2023.
Zhou et al. (2021) Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532–4576. PMLR, 2021.
Zimin and Neu (2013) Alexander Zimin and Gergely Neu. Online learning in episodic markovian decision processes by relative entropy policy search. Advances in Neural Information Processing Systems, 26, 2013.

Appendix A Analysis of Algorithm 1

In this section we propose the regret analysis for Algorithm 1 and prove the final regret bounds for our main algorithm. We will state the necessary concentration bounds as lemmas first and then analyze the regret, proving Lemma 3 and Theorem 1. In the following analysis, we will condition on the success of the event in Theorem 17, whose probability is at least $1-\delta$ . The following inequality will be used in the analysis and is restated in the beginning.

Lemma 4.

Let $\mathcal{F}_{0}\subset\mathcal{F}_{1}\subset\cdots\subset\mathcal{F}_{T}$ be a filtration and let $X_{1},X_{2},\cdots X_{T}$ be random variables such that $X_{t}$ is $\mathcal{F}_{t}$ measurable, $\mathbb{E}{X_{t}|\mathcal{F}_{t-1}}=0$ , $\left|X_{t}\right|\leq b$ almost surely, and $\sum_{t=1}^{T}\mathbb{E}{X_{t}^{2}|\mathcal{F}_{t-1}}\leq V$ for some fixed $V>0$ and $b>0$ . Then, for any $\delta\in\left(0,1\right)$ , we have with probability at least $1-\delta$ ,

\sum_{t=1}^{T}X_{t}\leq 2\sqrt{V\log\left(1/\delta\right)}+b\log\left(1/\delta\right)\,.

To start with, we give the concentration of the feature visitation estimators returned from Algorithm 2, which will be fundamental in the following analysis. Notice that $\hat{\phi}_{\pi,1}$ can be computed directly from the initial state distribution and action distribution.

Lemma 5.

Conditioning on the success of the event in Theorem 17, we have for all episode $k$ and steps $h$ , the feature visitation estimators $\left\{\hat{\phi}_{\pi,h},\,h=1,2,\cdots,H\,,\leavevmode\nobreak\ \pi\in\Pi\right\}$ returned by algorithm 2 satisfy:

\left|\langle\theta_{k,h},\hat{\phi}_{\pi,h}-\phi_{\pi,h}\rangle\right|\leq\epsilon

Proof.

Since $\left\|\theta_{k,h}\right\|_{2}\leq\sqrt{d}$ for any $k$ and $h$ , as in Definition 1, according to Theorem 17, we have:

\left|\langle\theta_{k,h},\hat{\phi}_{\pi,h}-\phi_{\pi,h}\rangle\right|\leq\left\|\hat{\phi}_{\pi,h}-\phi_{\pi,h}\right\|_{2}\cdot\left\|\theta_{k,h}\right\|_{2}\leq\frac{\epsilon}{\sqrt{d}}\cdot\sqrt{d}\leq\epsilon\,.

∎

The following Lemma 6 and Lemma 7 will bound the magnitude for the loss and value estimators in line 9, using the properties of G-optimal design computed in line 4.

Lemma 6.

$\left\|\hat{\phi}_{\pi,h}\right\|_{\hat{\Sigma}_{k,h}^{-1}}^{2}\leq\frac{dH}{\gamma}$ and $\left|\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}\right|\leq\frac{dH}{\gamma}$ , for all $h$ and $\pi\in\Pi$ .

Proof.

According to the properties of G-optimal design, we have:

\left\|\hat{\phi}_{\pi,h}\right\|_{\left(\sum_{\pi}g_{h}(\pi)\hat{\phi}_{\pi,h}\hat{\phi}_{\pi,h}^{\top}\right)^{-1}}^{2}\leq d\,,

and $\hat{\Sigma}_{k,h}\succeq\frac{\gamma}{H}\sum_{\pi}g_{h}(\pi)\hat{\phi}_{\pi,h}\hat{\phi}_{\pi,h}^{\top}$ . Thus we have $\left\|\hat{\phi}_{\pi,h}\right\|_{\hat{\Sigma}_{k,h}^{-1}}^{2}\leq\frac{dH}{\gamma}$ . So:

\left|\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}\right|=\left|\hat{\phi}_{\pi,h}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi_{k},h}\ell_{k,h}\left(s_{k,h},a_{k,h}\right)\right|\leq\left\|\hat{\phi}_{\pi,h}\right\|_{\hat{\Sigma}_{k,h}^{-1}}\left\|\hat{\phi}_{\pi_{k},h}\right\|_{\hat{\Sigma}_{k,h}^{-1}}\leq\frac{dH}{\gamma}\,,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall\pi\in\Pi\,.

∎

Lemma 7.

With our choice of $\eta=\frac{\gamma}{dH^{2}}$ , when $K\geq L_{0}=4dH\log\left(\frac{|\Pi|}{\delta}\right)$ , we have for all optimistic loss function estimator $\tilde{V}_{k}^{\pi}$ , $\left|\eta\tilde{V}_{k}^{\pi}\right|\leq 1$ .

Proof.

To make sure $\left|\eta\tilde{V}_{k}^{\pi}\right|\leq 1$ , we notice that:

\left|\tilde{V}_{k}^{\pi}\right|\leq\sum_{h=1}^{H}\left|\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}\right|+\sum_{h=1}^{H}2\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{k,h}\sqrt{\frac{H\log 1/\delta}{dK}}\,.

(4)

By Lemma 6, we have

\left|\tilde{V}_{k}^{\pi}\right|\leq\frac{dH^{2}}{\gamma}\left(1+2\sqrt{\frac{H\log 1/\delta}{dK}}\right)\leavevmode\nobreak\ .

When $K\geq L_{0}=4dH\log\left(\frac{|\Pi|}{\delta}\right)$ , we have $\left|\tilde{V}_{k}^{\pi}\right|\leq\frac{2dH^{2}}{\gamma}$ . Thus, our choice of $\eta=\frac{\gamma}{dH^{2}}$ satisfy this constraint. ∎

Throughout the following analysis, assuming we have run for some number of episodes $K$ , we let $\left(\mathcal{F}_{k}\right)_{k=1}^{K}$ the filtration on this, with $\mathcal{F}_{k}$ the filtration up to and including episode $k$ . Define $\mathbb{E}_{k}\left[\cdot\right]=\mathbb{E}\left[\cdot|\mathcal{F}_{k-1}\right]$ . The next lemma will bound the bias of the loss vector estimator, thus we can bound the bias of the value function estimator.

Lemma 8.

Denote $\theta_{k,h}^{exp}=\mathbb{E}_{k}\left[\hat{\theta}_{k,h}\right]=\mathbb{E}_{k}\left[{\hat{\Sigma}}_{k,h}^{-1}\hat{\phi}_{\pi_{k},h}\ell_{k,h}\left(s_{k,h},a_{k,h}\right)\right]$ as the expected value of the loss vector estimator on $\mathcal{F}_{k-1}$ . Then we have for $\forall\pi\in\Pi$ :

\left|\langle\hat{\phi}_{\pi,h},\theta_{k,h}^{\exp}\rangle-\langle\phi_{\pi,h},\theta_{k,h}\rangle\right|\leq\left(\frac{dH}{\gamma}+1\right)\epsilon\leq\frac{2dH}{\gamma}\epsilon\leavevmode\nobreak\ .

As a result, we also have $\left|\langle\hat{\phi}_{\pi,h},\theta_{k,h}^{\exp}\rangle\right|\leq\frac{2dH}{\gamma}\epsilon+1$ .

Proof.

Using the tower rule of expectation, we have:

\theta_{k,h}^{exp}=\hat{\Sigma}_{k,h}^{-1}\mathbb{E}_{k}\left[\hat{\phi}_{\pi_{k},h}\phi_{\pi_{k},h}^{\top}\theta_{k,h}\right]=\hat{\Sigma}_{k,h}^{-1}\sum_{\pi^{\prime}}p_{k}\left(\pi^{\prime}\right)\hat{\phi}_{\pi^{\prime},h}\phi_{\pi^{\prime},h}^{\top}\theta_{k,h}\leavevmode\nobreak\ .

Thus,

	$\displaystyle\left\|\langle\hat{\phi}_{\pi,h},\theta_{k,h}^{exp}\rangle-\langle\hat{\phi}_{\pi,h},\theta_{k,h}\rangle\right\|$	$\displaystyle=\left\|\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\sum_{\pi^{\prime}}p_{k}\left(\pi^{\prime}\right)\hat{\phi}_{\pi^{\prime},h}\left(\phi_{\pi^{\prime},h}^{\top}-\hat{\phi}_{\pi^{\prime},h}^{\top}\right)\theta_{k,h}\right\|$
		$\displaystyle\leq\sum_{\pi^{\prime}}p_{k}\left(\pi^{\prime}\right)\left\|\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi^{\prime},h}\left(\phi_{\pi^{\prime},h}^{\top}-\hat{\phi}_{\pi^{\prime},h}^{\top}\right)\theta_{k,h}\right\|$
		$\displaystyle\leq\sum_{\pi^{\prime}}p_{k}\left(\pi^{\prime}\right)\frac{dH}{\gamma}\epsilon=\frac{dH}{\gamma}\epsilon\leavevmode\nobreak\ ,$

where the last inequality is due to Lemma 5 and Lemma 6.
Moreover, we also have that $\left|\langle\hat{\phi}_{\pi,h}-\phi_{\pi,h},\theta_{k,h}\rangle\right|\leq\epsilon$ . Combining the two terms, we proof the lemma as:

\left|\langle\hat{\phi}_{\pi,h},\theta_{k,h}^{\exp}\rangle-\langle\phi_{\pi,h},\theta_{k,h}\rangle\right|\leq\left|\langle\hat{\phi}_{\pi,h},\theta_{k,h}^{exp}\rangle-\langle\hat{\phi}_{\pi,h},\theta_{k,h}\rangle\right|+\left|\langle\hat{\phi}_{\pi,h}-\phi_{\pi,h},\theta_{k,h}\rangle\right|\leq\left(\frac{dH}{\gamma}+1\right)\epsilon\leavevmode\nobreak\ .

∎

Lemma 9.

Denote $\mathrm{DEV}_{K,\pi}=\left|\sum_{k=1}^{K}\hat{V}_{k}^{\pi}-V_{k}^{\pi}\right|$ , then we have with probability at least $1-\delta$ ,

	$\displaystyle\mathrm{DEV}_{K,\pi}\leq$	$\displaystyle 2\sum_{k=1}^{K}\sum_{h=1}^{H}\left\\|\phi_{\pi,h}\right\\|_{\Sigma_{k,h}^{-1}}^{2}\sqrt{\frac{H\log\left(\frac{1}{\delta}\right)}{dK}}+\frac{1}{2}\sqrt{dKH\log\left(\frac{1}{\delta}\right)}$
		$\displaystyle+2\left(\frac{dH^{2}}{\gamma}\right)\log\left(\frac{1}{\delta}\right)+\frac{2dH^{2}}{\gamma}\epsilon K\,.$

Proof.

First, we bound the bias of the estimated loss of each policy $\pi$ after episode $k$ in step $h$ :

	$\displaystyle\hat{\ell}_{k,h}^{\pi}-\ell_{k,h}^{\pi}=$	$\displaystyle\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}-\phi_{\pi,h}^{\top}\theta_{k,h}$
	$\displaystyle=$	$\displaystyle\left(\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}-\hat{\phi}_{\pi,h}^{\top}{\theta}_{k,h}^{exp}\right)+\left(\hat{\phi}_{\pi,h}^{\top}{\theta}_{k,h}^{exp}-\phi_{\pi,h}\theta_{k,h}\right)$
	$\displaystyle\leq$	$\displaystyle\left(\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}-\hat{\phi}_{\pi,h}^{\top}{\theta}_{k,h}^{exp}\right)+\frac{2dH}{\gamma}\epsilon\,.$

The first term is a martingale difference sequence as $\mathbb{E}_{k}\left[\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}-\hat{\phi}_{\pi,h}^{\top}{\theta}_{k,h}^{exp}\right]=0$ by the definition in Lemma 8. To bound its magnitude, notice that $\left|\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}-\hat{\phi}_{\pi,h}^{\top}{\theta}_{k,h}^{exp}\right|\leq\left|\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}\right|+\left|{\phi}_{\pi,h}^{\top}\theta_{k,h}\right|+\left|\hat{\phi}_{\pi,h}^{\top}{\theta}_{k,h}^{exp}-{\phi}_{\pi,h}^{\top}\theta_{k,h}\right|\leq\frac{dH}{\gamma}+1+\frac{2dH}{\gamma}\epsilon\leq\frac{2dH}{\gamma}$ according to 5 and 6. Its variance is also bounded as:

	$\displaystyle\mathbb{E}_{k}\left[\left(\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}-\hat{\phi}_{\pi,h}^{\top}{\theta}_{k,h}^{exp}\right)^{2}\right]\leq$	$\displaystyle\mathbb{E}_{k}\left[\left(\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}\right)^{2}\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{k}\left[\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}\hat{\theta}_{k,h}^{\top}\hat{\phi}_{\pi,h}\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{k}\left[\ell_{k,h}\left(s_{k,h},a_{k,h}\right)^{2}\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi_{k},h}\hat{\phi}_{\pi_{k},h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi,h}\right]$
	$\displaystyle\leq$	$\displaystyle\left\\|\hat{\phi}_{\pi,h}\right\\|_{\hat{\Sigma}_{k,h}^{-1}}^{2}\,,$

where the last inequality is due to the fact $\mathbb{E}_{k}\left[\hat{\phi}_{\pi_{k},h}\hat{\phi}_{\pi_{k},h}^{\top}\right]=\hat{\Sigma}_{k,h}$ and $\left|\ell_{k,h}\left(s_{k,h},a_{k,h}\right)\right|\leq 1$ . Using the Freedman inequality, we acquire with probability al least $1-\delta$ :

	$\displaystyle\mathrm{DEV}_{K,\pi}\leq$	$\displaystyle\sum_{h=1}^{H}\left\|\sum_{k=1}^{K}\hat{\ell}_{k,h}^{\pi}-\ell_{k,h}^{\pi}\right\|$
	$\displaystyle\leq$	$\displaystyle\sum_{h=1}^{H}\left[2\sqrt{\sum_{k=1}^{K}\left\\|\hat{\phi}_{\pi,h}\right\\|_{\hat{\Sigma}_{k,h}^{-1}}^{2}\log\left(\frac{1}{\delta}\right)}+\frac{2dH}{\gamma}\log\left(\frac{1}{\delta}\right)+\frac{2dH}{\gamma}\epsilon K\right]$
	$\displaystyle\leq$	$\displaystyle 2\sqrt{H\sum_{k=1}^{K}\sum_{h=1}^{H}\left\\|\hat{\phi}_{\pi,h}\right\\|_{\hat{\Sigma}_{k,h}^{-1}}^{2}\log\left(\frac{1}{\delta}\right)}+\frac{2dH^{2}}{\gamma}\log\left(\frac{1}{\delta}\right)+\frac{2dH^{2}}{\gamma}\epsilon K$
	$\displaystyle\leq$	$\displaystyle 2\sum_{k=1}^{K}\sum_{h=1}^{H}\left\\|\hat{\phi}_{\pi,h}\right\\|_{\hat{\Sigma}_{k,h}^{-1}}^{2}\sqrt{\frac{H\log\left(\frac{1}{\delta}\right)}{dK}}+\frac{1}{2}\sqrt{dKH\log\left(\frac{1}{\delta}\right)}+2\frac{dH^{2}}{\gamma}\log\left(\frac{1}{\delta}\right)+\frac{2dH^{2}}{\gamma}\epsilon K\,,$

where the second last and last inequality is due to the Cauchy Schwartz inequality and GM-AM inequality.

∎

Lemma 10.

We bound the gap between the actual regret and the expected estimated regret. With probability at least $1-\delta$ ,

\sum_{k=1}^{K}V_{k}^{\pi_{k}}-\sum_{k=1}^{K}\sum_{\pi}p_{k}\left(\pi\right)\tilde{V}_{k}^{\pi}\leq H\left(\sqrt{d}+\frac{2dH}{\gamma}\epsilon+1\right)\sqrt{2K\log\frac{1}{\delta}}+\frac{8dH^{2}}{3\gamma}\log\frac{1}{\delta}+2H\sqrt{dKH\log\frac{1}{\delta}}+\frac{2dH^{2}}{\gamma}\epsilon K\,.

Proof.

Denote $\bar{\phi}_{k,h}=\sum_{\pi}p_{k}\left(\pi\right)\hat{\phi}_{\pi,h}$ , we have that :

	$\displaystyle\ell_{k,h}^{\pi_{k}}-\sum_{\pi}p_{k}\left(\pi\right)\hat{\ell}_{k}^{\pi}=$	$\displaystyle\phi_{\pi_{k},h}^{\top}\theta_{k,h}-\bar{\phi}_{k,h}^{\top}\hat{\theta}_{k,h}$
	$\displaystyle=$	$\displaystyle\left(\phi_{\pi_{k},h}^{\top}\theta_{k,h}-\hat{\phi}_{\pi_{k},h}^{\top}\theta_{k,h}^{exp}\right)+\left(\hat{\phi}_{\pi_{k},h}^{\top}\theta_{k,h}^{exp}-\bar{\phi}_{k,h}^{\top}\hat{\theta}_{k,h}\right)$
	$\displaystyle\leq$	$\displaystyle\frac{2dH}{\gamma}\epsilon+\left(\hat{\phi}_{\pi_{k},h}^{\top}\theta_{k,h}^{exp}-\bar{\phi}_{k,h}^{\top}\hat{\theta}_{k,h}\right)\,.$

Notice that $\mathbb{E}_{k}\left[\hat{\phi}_{\pi_{k},h}^{\top}\theta_{k,h}^{exp}\right]=\mathbb{E}_{k}\left[\bar{\phi}_{k,h}^{\top}\hat{\theta}_{k,h}\right]$ . We bound its conditional variance as follows:

$\displaystyle\sqrt{\mathbb{E}_{k}\left[\left(\hat{\phi}_{\pi_{k},h}^{\top}\theta_{k,h}^{exp}-\bar{\phi}_{k,h}^{\top}\hat{\theta}_{k,h}\right)^{2}\right]}\leq$	$\displaystyle\sqrt{\mathbb{E}_{k}\left[\left(\hat{\phi}_{\pi_{k},h}^{\top}\theta_{k,h}^{exp}\right)\right]^{2}}+\sqrt{\mathbb{E}_{k}\left[\left(\bar{\phi}_{k,h}^{\top}\hat{\theta}_{k,h}\right)^{2}\right]}$	(5)
$\displaystyle\leq$	$\displaystyle\frac{2dH}{\gamma}\epsilon+1+\sqrt{\bar{\phi}_{k,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\bar{\phi}_{k,h}}$	(6)
$\displaystyle\leq$	$\displaystyle\frac{2dH}{\gamma}\epsilon+1+\sqrt{\sum_{\pi}p_{k}\left(\pi\right)\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi,h}}$	(7)
$\displaystyle=$	$\displaystyle\frac{2dH}{\gamma}\epsilon+1+\sqrt{d}\,,$	(8)

where inequality 5 is due to Cauchy Schwartz inequality and 7 is due to Jensen inequality. Moreover, $\left|\hat{\phi}_{\pi_{k},h}^{\top}\theta_{k,h}^{exp}-\bar{\phi}_{k,h}^{\top}\hat{\theta}_{k,h}\right|\leq\frac{2dH}{\gamma}$ . Applying Bernstein’s Inequality, we obtain with probability at least $1-\delta$ ,

	$\displaystyle\sum_{k=1}V_{k}^{\pi_{k}}-\sum_{k=1}^{K}\sum_{\pi}p_{k}\left(\pi\right)\hat{V}_{k}^{\pi}\leq$	$\displaystyle\sum_{h=1}^{H}\sum_{k=1}^{K}\left(\ell_{k,h}^{\pi_{k}}-\sum_{\pi}p_{k}\left(\pi\right)\hat{\ell}_{k}^{\pi}\right)$
	$\displaystyle\leq$	$\displaystyle H\left(\sqrt{d}+\frac{2dH}{\gamma}\epsilon+1\right)\sqrt{2K\log\frac{1}{\delta}}+\frac{8}{3}\frac{dH^{2}}{\gamma}\log\frac{1}{\delta}+\frac{2dH^{2}}{\gamma}\epsilon K\,.$

Since $\hat{V}_{k}^{\pi}-\tilde{V}_{k}^{\pi}=\sum_{h=1}^{H}2\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{k,h}\sqrt{\frac{H\log 1/\delta}{dK}}$ , we have:

	$\displaystyle\sum_{k=1}^{K}\sum_{\pi}p_{k}\left(\pi\right)\left(\hat{V}_{k}^{\pi}-\tilde{V}_{k}^{\pi}\right)=$	$\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{\pi}2p_{k}\left(\pi\right)\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{k,h}\sqrt{\frac{H\log 1/\delta}{dK}}$
	$\displaystyle=$	$\displaystyle 2H\sqrt{dKH\log 1/\delta}\,.$

Combining the two terms, we prove this lemma. ∎

Lemma 11.

With probability at least $1-\delta$ , we have:

\sum_{k=1}^{K}\sum_{\pi}p_{k}(\pi)\left(\tilde{V}_{k}^{\pi}\right)^{2}\leq 2dKH^{2}+2\frac{dH^{3}}{\gamma}\sqrt{2K\log\left(\frac{1}{\delta}\right)}+\frac{8dH^{3}\log\left(\frac{1}{\delta}\right)}{\gamma}\,.

Proof.

$\displaystyle\sum_{\pi}p_{k}\left(\pi\right)\left(\tilde{V}_{k}^{\pi}\right)^{2}\leq$	$\displaystyle\sum_{\pi}p_{k}\left(\pi\right)\left[2\left(\hat{V}_{k}^{\pi}\right)^{2}+2\left(\sum_{h=1}^{H}2\phi_{\pi,h}^{\top}\Sigma_{k,h}^{-1}\phi_{\pi,h}\sqrt{H\frac{\log\left(\frac{1}{\delta}\right)}{dK}}\right)^{2}\right]$
$\displaystyle\leq$	$\displaystyle\sum_{\pi}p_{k}\left(\pi\right)\left(2\left(\hat{V}_{k}^{\pi}\right)^{2}+2H\sum_{h=1}^{H}4\left\\|\phi_{\pi,h}\right\\|_{\Sigma_{k,h}^{-1}}^{2}\phi_{\pi,h}^{\top}\Sigma_{k,h}^{-1}\phi_{\pi,h}\frac{H\log\left(\frac{1}{\delta}\right)}{dK}\right)$
$\displaystyle\leq$	$\displaystyle\sum_{\pi}p_{k}\left(\pi\right)\left(2\left(\hat{V}_{k}^{\pi}\right)^{2}+8\frac{dH}{\gamma}\sum_{h=1}^{H}\phi_{\pi,h}^{\top}\Sigma_{k,h}^{-1}\phi_{\pi,h}\frac{H\log\left(\frac{1}{\delta}\right)}{dK}\right)$
$\displaystyle=$	$\displaystyle 2\sum_{\pi}p_{k}\left(\pi\right)\left(\hat{V}_{k}^{\pi}\right)^{2}+\frac{8dH^{3}\log 1/\delta}{\gamma K}\,.$	(9)

Since $\left(\hat{V}_{k}^{\pi}\right)^{2}\leq H\sum_{h=1}^{H}\left(\hat{\ell}_{k,h}^{\pi}\right)^{2}$ , we bound the first term as follows.

\sum_{\pi}p_{k}\left(\pi\right)\left(\hat{\ell}_{k,h}^{\pi}\right)^{2}\leq\sum_{\pi}p_{k}\left(\pi\right)\hat{\theta}_{k,h}^{\top}\hat{\phi}_{\pi,h}\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}\leq\hat{\phi}_{\pi_{k},h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi_{k},h}\,.

Its conditional expectation is $\mathbb{E}_{k}\left[\hat{\phi}_{\pi_{k},h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi_{k},h}\right]=\sum_{\pi}p_{k}\left(\pi\right)\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi,h}=d$ , and also $\left|\hat{\phi}_{\pi_{k},h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi_{k},h}\right|\leq\frac{dH}{\gamma}$ . Thus, applying the Hoeffding bound, we have with probability at least $1-\delta$ ,

\displaystyle\sum_{k=1}^{K}\sum_{\pi}p_{k}\left(\pi\right)\left(\hat{V}_{k}^{\pi}\right)^{2}\leq

\displaystyle H\sum_{h=1}^{H}dK+\frac{dH}{\gamma}\sqrt{2K\log 1/\delta}=dkH^{2}+\frac{dH^{3}}{\gamma}\sqrt{2K\log 1/\delta}\,.

Plugging it into 9, we finish our proof. ∎

Proof of Lemma 3.

Now we are ready to start analyzing the regret. Using classical potential function analysis techniques in similar algorithms, we have:

$\displaystyle\log\left(\frac{W_{K+1}}{W_{1}}\right)=$	$\displaystyle\sum_{k=1}^{K}\log\left(\frac{W_{k+1}}{W_{k}}\right)$
$\displaystyle=$	$\displaystyle\sum_{k=1}^{K}\log\left(\sum_{\pi}\frac{w_{k}(\pi)}{W_{k}}\exp\left(-\eta\tilde{V}_{k}^{\pi}\right)\right)$
$\displaystyle\leq$	$\displaystyle\sum_{k=1}^{K}\log\left(\sum_{\pi}\frac{p_{k}(\pi)-\gamma g_{\pi}}{1-\gamma}\left(1-\eta\tilde{V}_{k}^{\pi}+\eta^{2}\left(\tilde{V}_{k}^{\pi}\right)^{2}\right)\right)$	(10)
$\displaystyle\leq$	$\displaystyle\sum_{k=1}^{K}\sum_{\pi}\frac{p_{k}(\pi)-\gamma g_{\pi}}{1-\gamma}\left(-\eta\tilde{V}_{k}^{\pi}+\eta^{2}\left(\tilde{V}_{k}^{\pi}\right)^{2}\right)$
$\displaystyle\leq$	$\displaystyle\frac{\eta}{1-\gamma}\left[\sum_{k=1}^{K}\sum_{\pi}-p_{k}(\pi)\tilde{V}_{k}^{\pi}+\gamma\sum_{k=1}^{K}\sum_{\pi}g(\pi)\tilde{V}_{k}^{\pi}+\eta\sum_{k=1}^{K}\sum_{\pi}p_{k}(\pi)\left(\tilde{V}_{k}^{\pi}\right)^{2}\right]\,,$	(11)

where inequality 10 is from $\left|\eta\tilde{V}_{k}^{\pi}\right|\leq 1$ guaranteed by Lemma 7. Using Lemma 9, we can bound the second term as:

\begin{split}\sum_{k=1}^{K}\tilde{V}_{k}^{\pi}\leq&\sum_{k=1}^{K}V_{k}^{\pi}+\mathrm{DEV}_{k,\pi}-\sum_{k=1}^{K}\sum_{h=1}^{H}2\phi_{\pi,h}^{\top}\Sigma_{k,h}^{-1}\phi_{\pi,h}\sqrt{H\frac{\log\left(\frac{1}{\delta}\right)}{dK}}\\ \leq&\sum_{k=1}^{K}V_{k}^{\pi}+\frac{1}{2}\sqrt{dKH\log\left(\frac{1}{\delta}\right)}+2\left(\frac{dH^{2}}{\gamma}\right)\log\left(\frac{1}{\delta}\right)+\frac{2dH^{2}}{\gamma}\epsilon K\\ \leq&KH+\frac{1}{2}\sqrt{dKH\log\left(\frac{1}{\delta}\right)}+2\left(\frac{dH^{2}}{\gamma}\right)\log\left(\frac{1}{\delta}\right)+\frac{2dH^{2}}{\gamma}\epsilon K\,.\end{split}

(12)

Plugging Lemma 10, Lemma 11 and Equation (12) into Equation (11), notice we condition on $\gamma\leq 1/2$ , we obtain:

\begin{split}&\log\left(\frac{W_{K+1}}{W_{1}}\right)\leq-\eta\sum_{k=1}^{K}V_{k}^{\pi_{k}}+2\eta^{2}\left[2dKH^{2}+2\frac{dH^{3}}{\gamma}\sqrt{2K\log\left(\frac{1}{\delta}\right)}+\frac{8dH^{3}\log\left(\frac{1}{\delta}\right)}{\gamma}\right]\\ &+2\eta\gamma KH+\eta\left[\frac{1}{2}\sqrt{dKH\log\left(\frac{1}{\delta}\right)}+2\left(\frac{dH^{2}}{\gamma}\right)\log\left(\frac{1}{\delta}\right)+\frac{2dH^{2}}{\gamma}\epsilon K\right]\\ &+2\eta\left[H\left(\sqrt{d}+\frac{2dH}{\gamma}\epsilon+1\right)\sqrt{2K\log\frac{1}{\delta}}+\frac{8}{3}\frac{dH^{2}}{\gamma}\log\frac{1}{\delta}+2H\sqrt{dKH\log\frac{1}{\delta}}+\frac{2dH^{2}}{\gamma}\epsilon K\right]\,.\end{split}

(13)

Combining terms, we have:

\begin{split}\frac{\log\left(\frac{W_{K+1}}{W_{1}}\right)}{\eta}\leq&-\sum_{k=1}^{K}V_{k}^{\pi_{k}}+\mathcal{O}\left(H\sqrt{dKH\log\frac{1}{\delta}}\right)+\mathcal{O}\left(\frac{dH^{2}}{\gamma}\log\left(\frac{1}{\delta}\right)+\gamma KH\right)\\ &+\mathcal{O}\left(\eta dKH^{2}+\frac{\eta dH^{3}}{\gamma}\sqrt{2K\log\left(\frac{1}{\delta}\right)}+\frac{8\eta dH^{3}\log\left(\frac{1}{\delta}\right)}{\gamma}\right)+\mathcal{O}\left(\frac{dH^{2}}{\gamma}\epsilon K\right)\,.\end{split}

(14)

Plugging $\eta=\frac{\gamma}{dH^{2}}$ into Equation (14), we have:

\begin{split}\frac{\log\left(\frac{W_{K+1}}{W_{1}}\right)}{\eta}\leq&-\sum_{k=1}^{K}V_{k}^{\pi_{k}}+\mathcal{O}\left(H\sqrt{dKH\log\frac{1}{\delta}}\right)+\mathcal{O}\left(\frac{dH^{2}}{\gamma}\log\left(\frac{1}{\delta}\right)+\gamma KH\right)+\mathcal{O}\left(\frac{dH^{2}}{\gamma}\epsilon K\right)\,.\end{split}

(15)

On the other hand, we have:

\begin{split}\log\left(\frac{W_{K+1}}{W_{1}}\right)\geq&\eta\left(\sum_{k=1}^{K}-\tilde{V}_{k}^{\pi}\right)-\log\left(|\Pi|\right)\\ \geq&\eta\left(\sum_{k=1}^{K}-V_{k}^{\pi}-\mathrm{DEV}_{K,\pi}+\sum_{k=1}^{K}\sum_{h=1}^{H}2\phi_{\pi,h}^{\top}\Sigma_{k,h}^{-1}\phi_{\pi,h}\sqrt{H\frac{\log\left(\frac{1}{\delta}\right)}{dK}}\right)-\log\left(|\Pi|\right)\\ \geq&\eta\left(\sum_{k=1}^{K}-V_{k}^{\pi}-\frac{1}{2}\sqrt{dKH\log\left(\frac{1}{\delta}\right)}-2\left(\frac{dH^{2}}{\gamma}\right)\log\left(\frac{1}{\delta}\right)-\frac{2dH^{2}}{\gamma}\epsilon K\right)-\log\left(|\Pi|\right)\,.\end{split}

(16)

Combining (15) and (16), we have:

\begin{split}\sum_{k=1}^{K}V_{k}^{\pi_{k}}-V_{k}^{\pi}\leq&\mathcal{O}\left(H\sqrt{dKH\log\frac{1}{\delta}}+\frac{dH^{2}}{\gamma}\log\left(\frac{1}{\delta}\right)+\gamma KH\right)+\mathcal{O}\left(\frac{dH^{2}}{\gamma}\epsilon K\right)+\frac{\log\left(|\Pi|\right)}{\eta}\,.\end{split}

(17)

Choosing $\eta=\frac{\gamma}{dH^{2}}$ and combining terms, we obtain for any policy $\pi\in\Pi$ , with probability at least $1-\delta$ :

\sum_{k=1}^{K}V_{k}^{\pi_{k}}-V_{k}^{\pi}=\mathcal{O}\left(H\sqrt{dKH\log\frac{|\Pi|}{\delta}}+\frac{dH^{2}}{\gamma}\log\left(\frac{|\Pi|}{\delta}\right)+\gamma KH\right)+\mathcal{O}\left(\frac{dH^{2}}{\gamma}\epsilon K\right)\,.

(18)

∎

We will then present the proof of Theorem 1 based on Equation (18). Notice we condition on $K$ being large enough so that the optimal parameters $\gamma$ and $\epsilon$ set below are smaller than $\frac{1}{2}$ , satisfying the requirements of the algorithm, while the cases of $K$ being small is trivial.

•

In the case when we have access to a simulator, the total regret occurred while we execute the policies in $\Pi$ . Set the parameters as $\epsilon\leftarrow{dH^{2}\log\frac{K}{\delta}/K}$ , $\gamma\leftarrow{\sqrt{{dH\log\left(\frac{|\Pi|}{\delta}\right)}/{K}}}$ and using the properties of $\Pi$ in Lemma 19, the total regret is bounded as:

\mathrm{Reg}(K)\leq\mathrm{Reg}\left(K;\Pi\right)+1=\max_{\pi\in\Pi}\left(\sum_{k=1}^{K}V_{k}^{\pi_{k}}-V_{k}^{\pi}\right)+1=\mathcal{O}\left(\sqrt{d^{2}H^{5}K\log\frac{K}{\delta}}\right)\,.

Also, according to corollary 1, the total number of episodes run on the simulator is in the order of $\tilde{\mathcal{O}}\left(d^{3}HK^{2}\right)$ .

•

When we don’t have access to a simulator, we have to take account of the regret occurred while we estimate the feature visitation of each policy. According to corollary 1, the additional regret is in the order of $\mathcal{O}\left(\frac{d^{4}H^{4}}{\epsilon^{2}}\log\frac{H^{2}d|\Pi|}{\delta}+C_{1}\right)$ . By our construction of policy set $\Pi$ in Lemma 19, the total regret is bounded as:

\mathrm{Reg}(K)=\mathcal{O}\left(\frac{d^{5}H^{6}}{\epsilon^{2}}\log\frac{K}{\delta}+\frac{d^{2}H^{4}}{\gamma}\log\frac{K}{\delta}+\gamma KH+\frac{dH^{2}}{\gamma}\epsilon K+\sqrt{d^{2}H^{5}K\log\frac{K}{\delta}}+C_{1}\right),

(19)

with $C_{1}=\mathrm{poly}\left(d,H,\log 1/\delta,\frac{1}{\lambda_{min}^{*}},\log|\Pi|,\log 1/\epsilon\right)$ . Set the parameters as $\epsilon\leftarrow{K^{-2/5}d^{9/5}H^{9/5}\log^{2/5}\frac{K}{\delta}}$ , $\gamma\leftarrow{K^{-1/5}d^{7/5}H^{7/5}\log^{1/5}\frac{K}{\delta}}$ , the total regret is in the order of:

\mathrm{Reg}(K)={\mathcal{O}}\left(d^{7/5}H^{12/5}K^{4/5}\log^{1/5}\frac{K}{\delta}\right)\,.

Appendix B Construct the Policy Visitation Estimators

In this section, we will propose the analysis of Algorithm 2. We then prove theorem 17 and corollary 1 as our main results, which will provide the concentration of the estimators $\hat{\phi}_{\pi,h}$ and bound the sample complexity. These results will then be used to proof the final regret bounds in Appendix A.

First, we propose the performance guarantee of the data collecting oracle, which comes directly from theorem 9 in Wagenmaker and Jamieson [2022]. Denote:

\mathbf{XY}_{\mathrm{opt}}\left(\mathbf{\Lambda}\right)=\max_{\phi\in\Phi}\left\|\phi\right\|_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}^{2}\leavevmode\nobreak\ \leavevmode\nobreak\ \mathrm{for}\leavevmode\nobreak\ \leavevmode\nobreak\ \mathbf{A}\left(\mathbf{\Lambda}\right)=\mathbf{\Lambda}+\mathbf{\Lambda}_{0}\,,

for $\mathbf{\Lambda}_{0}$ be some fixed regularizer. We consider it’s smooth approximation:

\widetilde{\mathbf{XY}}_{\mathrm{opt}}\left(\mathbf{\Lambda}\right)=\frac{1}{\eta}\log\left(\sum_{\phi\in\Phi}e^{\eta\left\|\phi\right\|_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}^{2}}\right)\,.

We also define $\mathbf{\Omega}_{h}:=\left\{\mathbb{E}_{\pi\sim\omega}\left[\mathbf{\Lambda_{\pi,h}}\right]\,:\leavevmode\nobreak\ \leavevmode\nobreak\ \omega\in\Delta_{\pi}\right\}$ , where $\Delta_{\pi}$ is the set of all the distributions over all valid Markovian policies. $\mathbf{\Omega}_{h}$ is, then, the set of all covariance matrices realizable by distributions over policies at step $h$ .Then we have

Theorem 2.

Considering running Algorithm 6 in Wagenmaker and Jamieson [2022] with some $\epsilon>0$ and functions

f_{i}\left(\mathbf{\Lambda}\right)\leftarrow\widetilde{\mathbf{XY}}_{opt}\left(\mathbf{\Lambda}\right)

for $\mathbf{\Lambda}_{0}\leftarrow\left(T_{i}K_{i}\right)^{-1}\Sigma_{i}=:\mathbf{\Lambda}_{i}$ and

	$\displaystyle\eta_{i}=\frac{2}{\gamma_{\Phi}}\cdot\left(1+\left\\|\mathbf{\Lambda}_{i}\right\\|_{\mathrm{op}}\right)\cdot\log\|\Phi\|$
	$\displaystyle L_{i}=\left\\|\mathbf{\Lambda}_{i}^{-1}\right\\|_{\mathrm{op}}^{2}\,,\leavevmode\nobreak\ \leavevmode\nobreak\ \beta_{i}=2\left\\|\mathbf{\Lambda}_{i}^{-1}\right\\|_{\mathrm{op}}^{3}\left(1+\eta_{i}\left\\|\mathbf{\Lambda}_{i}^{-1}\right\\|_{\mathrm{op}}\right)\,,\leavevmode\nobreak\ \leavevmode\nobreak\ M_{i}=\left\\|\mathbf{\Lambda}_{i}^{-1}\right\\|_{\mathrm{op}}^{2}$

where $\Sigma_{i}$ is the matrix returned by running Algorithm 7 in Wagenmaker and Jamieson [2022] with $N\leftarrow T_{i}K_{i}$ , $\delta\leftarrow\delta/\left(2i^{2}\right)$ , and some $\underline{\lambda}\geq 0$ . Then with probability $1-2\delta$ , this procedure will collect at most

20\cdot\frac{\inf_{\mathbf{\Lambda}\in\Omega}\max_{\phi\in\Phi}\left\|\phi\right\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}}{\epsilon_{\exp}}+\mathrm{poly}\left(d,\,H,\,\log 1/\delta,\,\frac{1}{\lambda_{\min}^{*}},\,\frac{1}{\gamma_{\Phi}},\,\underline{\lambda},\,\log|\Phi|,\,\log\frac{1}{\epsilon_{\exp}}\right)

episodes, where

\mathbf{A}\left(\mathbf{\Lambda}\right)=\mathbf{\Lambda}+\min\left\{\frac{\left(\lambda_{\min}^{*}\right)^{2}}{d},\,\frac{\lambda_{\min}^{*}}{d^{3}H^{3}\log^{7/2}1/\delta}\right\}\cdot\mathrm{poly}\log\left(\frac{1}{\lambda_{\min}^{*}}\,,\leavevmode\nobreak\ d\,,\leavevmode\nobreak\ H\,,\leavevmode\nobreak\ \underline{\lambda}\,,\leavevmode\nobreak\ \log\frac{1}{\delta}\right)\cdot I\,,

and will produce covariates $\widehat{\Sigma}+\Sigma_{i}$ such that

\max_{\phi\in\Phi}\left\|\phi\right\|^{2}_{\left(\widehat{\Sigma}+\Sigma_{i}\right)^{-1}}\leq\epsilon_{\exp}

and

\lambda_{\min}\left(\widehat{\Sigma}+\Sigma_{i}\right)\geq\max\left\{d\log 1/\delta\,,\leavevmode\nobreak\ \lambda\right\}\,.

Next, we will propose the concentration analysis of our estimators and bound the total number of episodes run. Throughout this section, assuming we have run for some number of episodes K, we let $\left(\mathcal{F}_{\tau}\right)_{\tau=1}^{K}$ the filtration on this, with $\mathcal{F}_{\tau}$ the filtration up to and including episode $\tau$ . We also let $\mathcal{F}_{\tau,h}$ denote the filtration on all episodes $\tau^{\prime}<\tau$ , and on steps $h^{\prime}=1,\cdots,h$ of episode $\tau$ . Define

\phi_{\pi,h}=\mathbb{E}_{\pi}\left[\phi\left(s_{h},a_{h}\right)\right],\leavevmode\nobreak\ \leavevmode\nobreak\ \phi_{\pi,h}\left(s\right)=\sum_{a\in\mathcal{A}}\phi\left(s,a\right)\pi_{h}\left(a|s\right)

and

\mathcal{T}:=\int\phi_{\pi,h}\left(s\right)\,d\mu_{h-1}\left(s\right)^{\top}\,.

We have from lemma A.7 in Wagenmaker and Jamieson [2022]: $\phi_{\pi,h}=\mathcal{T}_{\pi,h}\phi_{\pi,h-1}=\cdots=\mathcal{T}_{\pi,h}\cdots\mathcal{T}_{\pi,1}\phi_{\pi,0}$ . We also denote $\gamma_{\Phi}:=\max_{\phi\in\Phi}\left\|\phi\right\|_{2}$ .

The following Lemma 12 comes straight from lemma B.1, lemma B.2 and lemma B.3 in Wagenmaker and Jamieson [2022] and provides us with the basic concentration properties of the estimators constructed in line 6 of Algorithm 2.

Lemma 12.

Assume that we have collected some data $\left\{\left(s_{h-1,\tau},a_{h-1,\tau},s_{h,\tau}\right)\right\}_{\tau=1}^{K}$ where, for each $\tau^{\prime}$ , $s_{h,\tau^{\prime}}|\mathcal{F}_{h-1,\tau^{\prime}}$ is independent of $\left\{\left(s_{h-1,\tau},a_{h-1,\tau},s_{h,\tau}\right)\right\}_{\tau\neq\tau^{\prime}}$ . Denote $\phi_{h-1,\tau}=\phi\left(s_{h-1,\tau},a_{h-1,\tau}\right)$ and $\mathbf{\Lambda}_{h-1}=\sum_{\tau=1}^{K}\phi_{h-1,\tau}\phi_{h-1,\tau}^{\top}+\lambda I$ . Fix $\pi$ and let

\begin{split}\hat{\mathcal{T}}_{\pi,h}=&\left(\sum_{\tau=1}^{K}\phi_{\pi,h}(s_{h,\tau})\phi_{h-1,\tau}^{\top}\right)\mathbf{\Lambda}_{h-1}^{-1}\\ \hat{\phi}_{\pi,h}=&\hat{\mathcal{T}}_{\pi,h}\hat{\mathcal{T}}_{\pi,h-1}\cdots\hat{\mathcal{T}}_{\pi,2}\hat{\mathcal{T}}_{\pi,1}\phi_{\pi,0}\,.\end{split}

Fix $\textbf{{u}}\in\mathcal{S}^{d-1}$ . Then with probability at least $1-\delta$ :

\left|\langle\textbf{{u}},\phi_{\pi,h}-\hat{\phi}_{\pi,h}\rangle\right|\leq\sum_{i=1}^{h-1}\left(2\sqrt{\log\frac{2H}{\delta}}+\frac{\log\frac{2H}{\delta}}{\sqrt{\lambda_{\min}\left(\mathbf{\Lambda}_{i}\right)}}+\sqrt{d\lambda}\right)\cdot\left\|\hat{\phi}_{\pi,i}\right\|_{\mathbf{\Lambda}_{i}^{-1}}\,.

Thus, with probability at least $1-\delta$ ,

\left\|\hat{\phi}_{\pi,h}-\phi_{\pi,h}\right\|_{2}\leq d\sum_{h^{\prime}=1}^{h-1}\left(2\sqrt{\log\frac{2Hd}{\delta}}+\frac{\log\frac{2Hd}{\delta}}{\sqrt{\lambda_{\min}\left(\mathbf{\Lambda}_{h^{\prime}}\right)}}+\sqrt{d\lambda}\right)\cdot\left\|\hat{\phi}_{\pi,h^{\prime}}\right\|_{\mathbf{\Lambda}_{h^{\prime}}^{-1}}

Lemma 13.

Let $\varepsilon_{\mathrm{est}}^{h}$ denote the event on which, for all $\pi\in\Pi$ , the feature visitation estimates returned by line 6 satisfy:

\left\|\hat{\phi}_{\pi,h+1}-\phi_{\pi,h+1}\right\|_{2}\leq d\sum_{h^{\prime}=1}^{h-1}\left(3\sqrt{\log\frac{4H^{2}d|\Pi|}{\delta}}+\frac{\log\frac{4H^{2}d|\Pi|}{\delta}}{\sqrt{\lambda_{\min}\left(\mathbf{\Lambda}_{h^{\prime}}\right)}}\right)\cdot\left\|\hat{\phi}_{\pi,h^{\prime}}\right\|_{\mathbf{\Lambda}_{h^{\prime}}^{-1}}

Then $\mathbb{P}\left[\left(\varepsilon_{\mathrm{est}}^{h}\right)^{c}\right]\leq\frac{\delta}{2H}$ .

Proof.

Following similar analysis in lemma B.5 in Wagenmaker and Jamieson [2022], we also have the data collected in Theorem 2 satisfy the independent requirements of Lemma 12. The result follows by setting $\lambda=1/d$ in Lemma 12. ∎

Lemma 14.

Let $\varepsilon_{\exp}^{h}$ denote the event on which: The total number of episodes run in line 4 is at most

C\frac{d^{3}\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\phi\in\Phi_{h}}\left\|\phi\right\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}}{\epsilon^{2}/\beta}+\mathrm{poly}\left(d,H,\log 1/\delta,\frac{1}{\lambda_{min}^{*}},\log|\Pi|,\log 1/\epsilon\right)

episodes. The covariates returned by line 4, $\mathbf{\Lambda}_{h}$ , satisfy:

\max_{\phi\in\Phi_{h}}\left\|\phi\right\|_{\mathbf{\Lambda}_{h}^{-1}}^{2}\leq\frac{\epsilon^{2}}{d^{3}\beta}\,,\,\leavevmode\nobreak\ \leavevmode\nobreak\ \lambda_{\min}\left(\mathbf{\Lambda}_{h}\right)\geq\log\frac{4H^{2}d|\Pi|}{\delta}\,.

(20)

Then $\mathbb{P}\left[\left(\varepsilon_{\exp}^{h}\right)^{c}\cap\varepsilon_{\mathrm{est}}^{h-1}\cap\left(\cap_{i=1}^{h-1}\varepsilon_{\exp}^{i}\right)\right]\leq\frac{\delta}{2H}$ .

Proof.

By Lemma 15, on the event $\varepsilon_{\mathrm{est}}^{h-1}\cap\left(\cap_{i=1}^{h-1}\varepsilon_{\exp}^{i}\right)$ we can bound $\left\|\hat{\phi}_{\pi,h+1}-\phi_{\pi,h+1}\right\|_{2}\leq\epsilon/\sqrt{d}$ . Remember that we condition on $K$ being large enough so that we have $\epsilon\leq 1/2$ . Also, we can lower bound $\left\|\phi_{\pi,h}\right\|_{2}\geq 1/\sqrt{d}$ from lemma A.6 in Wagenmaker and Jamieson [2022]. Thus,

\left\|\hat{\phi}_{\pi,h}\right\|_{2}\geq\left\|\phi_{\pi,h}\right\|_{2}-\left\|\hat{\phi}_{\pi,h+1}-\phi_{\pi,h+1}\right\|_{2}\geq 1/\sqrt{d}-\epsilon/\sqrt{d}\geq 1/\left(2\sqrt{d}\right)\,.

So the choice of $\gamma_{\Phi}=\frac{1}{2\sqrt{d}}$ is valid. The result then follows by applying Theorem 2 with our chosen parameters. ∎

Lemma 15.

On the event $\varepsilon_{\mathrm{est}}^{h}\cap\left(\cap_{i=1}^{h}\varepsilon_{\exp}^{i}\right)$ , for all $\pi\in\Pi$ ,

\left\|\hat{\phi}_{\pi,h+1}-\phi_{\pi,h+1}\right\|_{2}\leq\epsilon/\sqrt{d}\,.

Proof.

On $\varepsilon_{exp}^{i}$ , we can bound:

\begin{split}\lambda_{\min}\left(\mathbf{\Lambda}_{i}\right)&\geq\log\frac{4H^{2}d|\Pi|}{\delta}\,,\\ \left\|\hat{\phi}_{\pi,i}\right\|_{\mathbf{\Lambda}_{i}^{-1}}&\leq\frac{\epsilon}{d\sqrt{d\beta}}=\frac{\epsilon}{4Hd\sqrt{d\log\frac{4H^{2}d|\Pi|}{\delta}}}\,.\end{split}

so that:

\begin{split}\left\|\hat{\phi}_{\pi,h+1}-\phi_{\pi,h+1}\right\|_{2}&\leq d\sum_{h^{\prime}=1}^{h-1}\left(3\sqrt{\log\frac{4H^{2}d|\Pi|}{\delta}}+\frac{\log\frac{4H^{2}d|\Pi|}{\delta}}{\sqrt{\lambda_{\min}\left(\mathbf{\Lambda}_{h^{\prime}}\right)}}\right)\cdot\left\|\hat{\phi}_{\pi,h^{\prime}}\right\|_{\mathbf{\Lambda}_{h^{\prime}}^{-1}}\\ &\leq d\sum_{i=1}^{h}4\sqrt{\log\frac{4H^{2}d|\Pi|}{\delta}}\cdot\frac{\epsilon}{4Hd\sqrt{d\log\frac{4H^{2}d|\Pi|}{\delta}}}\\ &\leq dH\cdot\frac{\epsilon}{dH\sqrt{d}}=\epsilon/\sqrt{d}\,.\end{split}

∎

Lemma 16.

Define $\varepsilon_{\exp}=\cap_{h}\varepsilon_{\exp}^{h}$ and $\varepsilon_{\mathrm{est}}=\cap_{h}\varepsilon_{\mathrm{est}}^{h}$ . Then $\mathbb{P}\left[\varepsilon_{est}\cap\varepsilon_{exp}\right]\geq 1-\delta$ , and on $\varepsilon_{est}\cap\varepsilon_{exp}$ , for all $h=1,2,\cdots,H-1$ and $\pi\in\Pi$ , we have:

\left\|\hat{\phi}_{\pi,h+1}-\phi_{\pi,h+1}\right\|_{2}\leq\epsilon/\sqrt{d}\,.

Proof.

Obviously,

	$\displaystyle\varepsilon_{\exp}^{c}\cup\varepsilon_{\mathrm{est}}^{c}=$	$\displaystyle\bigcup_{h=1}^{H-1}\left(\left(\varepsilon_{\exp}^{h}\right)^{c}\cup\left(\varepsilon_{\mathrm{est}}^{h}\right)^{c}\right)$
	$\displaystyle=$	$\displaystyle\bigcup_{h=1}^{H-1}\left(\varepsilon_{\exp}^{h}\right)^{c}\backslash\left(\left(\varepsilon_{\mathrm{est}}^{h-1}\right)^{c}\cup\left(\cup_{i=1}^{h-1}\left(\varepsilon_{\exp}^{i}\right)^{c}\right)\right)\cup\bigcup_{h=1}^{H}\left(\varepsilon_{\mathrm{est}}^{h}\right)^{c}$
	$\displaystyle=$	$\displaystyle\bigcup_{h=1}^{H-1}\left(\varepsilon_{\exp}^{h}\right)^{c}\cap\left({\varepsilon_{\mathrm{est}}^{h-1}}\cap\left(\cap_{i=1}^{h-1}{\varepsilon_{\exp}^{i}}\right)\right)\cup\bigcup_{h=1}^{H}\left(\varepsilon_{\mathrm{est}}^{h}\right)^{c}\,.$

Using Lemma 13 and Lemma 14, we can bound

	$\displaystyle\mathbb{P}\left[\varepsilon_{\mathrm{est}}^{c}\cup\varepsilon_{\exp}^{c}\right]\leq$	$\displaystyle\sum_{h=1}^{H-1}\left(\mathbb{P}\left[\left(\varepsilon_{\exp}^{h}\right)^{c}\cap\varepsilon_{\mathrm{est}}^{h-1}\cap\left(\cap_{i=1}^{h-1}\varepsilon_{\exp}^{i}\right)\right]+\mathbb{P}\left[\left(\varepsilon_{\mathrm{est}}^{h}\right)^{c}\right]\right)$
	$\displaystyle\leq$	$\displaystyle\sum_{h=1}^{H-1}2\cdot\frac{\delta}{2H}$
	$\displaystyle\leq$	$\displaystyle\delta\,.$

And the inequality follows by Lemma 15. ∎

Lemma 17 (Full version of Lemma 2).

With probability at least $1-\delta$ , Algorithm 2 will run at most

CH^{2}d^{3}\sum_{h=1}^{H-1}\frac{\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\pi\in\Pi}\left\|{\phi}_{\pi,h}\right\|^{2}_{\mathbf{\Lambda}^{-1}}}{\epsilon^{2}}\log\frac{H^{2}d|\Pi|}{\delta}+\mathrm{poly}\left(d,H,\log 1/\delta,\frac{1}{\lambda_{min}^{*}},\log|\Pi|,\log 1/\epsilon\right)

episodes, and will output policy visitation estimators $\Phi=\left\{\hat{\phi}_{\pi,h}:\,h=1,2,\cdots,H,\,\pi\in\Pi\right\}$ with bias bounded as:

\left\|\hat{\phi}_{\pi,h}-\phi_{\pi,h}\right\|_{2}\leq\epsilon/\sqrt{d}\,.

Proof.

According to Lemma 16, we can condition on the event $\varepsilon_{est}\cap\varepsilon_{exp}$ , thus we obtain the accuracy desired. According to Lemma 14, we have total episodes be bounded as :

		$\displaystyle\sum_{h=1}^{H-1}C\frac{d^{3}\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\phi\in\Phi_{h}}\left\\|\phi\right\\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}}{\epsilon^{2}/\beta}+\mathrm{poly}\left(d,H,\log 1/\delta,\frac{1}{\lambda_{min}^{*}},\log\|\Pi\|,\log 1/\epsilon\right)$
	$\displaystyle\leq$	$\displaystyle\sum_{h=1}^{H-1}C\frac{d^{3}\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\phi\in\Phi_{h}}\left\\|\phi\right\\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}}{\epsilon^{2}}H^{2}\log\frac{H^{2}d\|\Pi\|}{\delta}+\mathrm{poly}\left(d,H,\log 1/\delta,\frac{1}{\lambda_{min}^{*}},\log\|\Pi\|,\log 1/\epsilon\right)\,.$

Conditioning on $\varepsilon_{est}\cap\varepsilon_{exp}$ , we have for all $\pi\in\Pi$ , $\left\|\hat{\phi}_{\pi,h}-\phi_{\pi,h}\right\|_{2}\leq\epsilon/\sqrt{d}$ , thus we can upper bound:

	$\displaystyle\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\phi\in\Phi_{h}}\left\\|\phi\right\\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}=$	$\displaystyle\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\pi\in\Pi}\left\\|\hat{\phi}_{\pi,h}\right\\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}$
	$\displaystyle\leq$	$\displaystyle\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\pi\in\Pi}\left(2\left\\|{\phi}_{\pi,h}\right\\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}+2\left\\|\hat{\phi}_{\pi,h}-\phi_{\pi,h}\right\\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}\right)$
	$\displaystyle\leq$	$\displaystyle\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\pi\in\Pi}\left(2\left\\|{\phi}_{\pi,h}\right\\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}+\frac{2\epsilon^{2}}{d\lambda_{min}\left(\mathbf{A}\left(\mathbf{\Lambda}\right)\right)}\right)$
	$\displaystyle\leq$	$\displaystyle\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\pi\in\Pi}2\left\\|{\phi}_{\pi,h}\right\\|^{2}_{\mathbf{\Lambda}^{-1}}+\frac{2\epsilon^{2}}{d\lambda_{min}^{*}}$

Thus, the total number of episodes is bounded as:

CH^{2}d^{3}\sum_{h=1}^{H-1}\frac{\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\pi\in\Pi}\left\|{\phi}_{\pi,h}\right\|^{2}_{\mathbf{\Lambda}^{-1}}}{\epsilon^{2}}\log\frac{H^{2}d|\Pi|}{\delta}+\mathrm{poly}\left(d,H,\log 1/\delta,\frac{1}{\lambda_{min}^{*}},\log|\Pi|,\log 1/\epsilon\right)\,.

(21)

∎

From lemma B.10 in Wagenmaker and Jamieson [2022], we can bound:

\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\pi\in\Pi}\left\|{\phi}_{\pi,h}\right\|^{2}_{\mathbf{\Lambda}^{-1}}\leq d\,.

Thus we have:

Corollary 1.

The sample complexity in Algorithm 2 is bounded by:

\mathcal{O}\left(\frac{d^{4}H^{3}}{\epsilon^{2}}\log\frac{H^{2}d|\Pi|}{\delta}+C_{1}\right)\,,

where $C_{1}=\mathrm{poly}\left(d,H,\log 1/\delta,\frac{1}{\lambda_{min}^{*}},\log|\Pi|,\log 1/\epsilon\right)$ .

Appendix C Construct the Policy Set

In this section we provide the proof for the policy set $\Pi$ we constructed. The construction techniques follows directly from Appendix A.3 in Wagenmaker and Jamieson [2022] and we will prove such construction also works in MDP with adversarial rewards. Our main result is stated in Lemma 19.

Lemma 18.

In the adversarial MDP setting, where the loss function changes in each round, the best policy of the MDP $\mathcal{M}(\mathcal{S},\mathcal{A},H,\left\{P_{h}\right\}_{h=1}^{H},\left\{\ell_{k}\right\}_{k=1}^{K})$ in rounds $1$ to $K$ from the set of all stationary policies, is the optimal policy of the MDP with a fixed loss function being the average. Denote the average MDP as $\mathring{\mathcal{M}}(\mathcal{S},\mathcal{A},H,\left\{P_{h}\right\}_{h=1}^{H},\mathring{\ell})$ , with the same transition kernel and the average loss $\mathring{\ell}=\frac{1}{K}\sum_{k=1}^{K}\ell_{k}$ .
That is:

	$\displaystyle\mathrm{if}\leavevmode\nobreak\ \leavevmode\nobreak\ \pi^{}=\operatorname{argmin}_{\pi}\sum_{k=1}^{K}V_{k}^{\pi}\,,$
	$\displaystyle\mathrm{then}\leavevmode\nobreak\ \leavevmode\nobreak\ \pi^{}=\operatorname{argmin}_{\pi}\mathring{V}^{\pi}\,.$

Where $\mathring{V}$ is the value function associated with the new MDP $\mathring{\mathcal{M}}$ .

Proof.

Let $\tau_{\pi}=\left((s_{1},a_{1}),(s_{2},a_{2}),\cdots(s_{H},a_{H})\right)$ be the trajectory generated by following policy $\pi$ through the MDP. Denote the occupancy measure $\mu_{h}^{\pi}(s_{h},a_{h})$ as the probability of visiting state-action pair $\left(s_{h},a_{h}\right)$ under trajectory $\tau_{\pi}$ , and $\mu^{\pi}=\left(\mu_{1}^{\pi},\mu_{2}^{\pi},\cdots\mu_{H}^{\pi}\right)$ .
For any stationary policy $\pi$ , we have:

V^{\pi}=\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{S}_{h}\times\mathcal{A}_{h}}\mu_{h}^{\pi}(s_{h},a_{h})\ell_{k}(s_{h},a_{h})\,.

Since the two MDP share the same transition kernel, the occupancy measure generated by the same policy stays unchanged. So we have:

\begin{split}\sum_{k=1}^{K}V_{k}^{\pi}&=\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{S}_{h}\times\mathcal{A}_{h}}\mu_{h}^{\pi}(s_{h},a_{h})\ell_{k}(s_{h},a_{h})\\ &=\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{S}_{h}\times\mathcal{A}_{h}}\mu_{h}^{\pi}\left(s_{h},a_{h}\right)\left(\sum_{k=1}^{K}l_{k,h}(s_{h},a_{h})\right)\\ &=\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{S}_{h}\times\mathcal{A}_{h}}\mu_{h}^{\pi}(s_{h},a_{h})\cdot K\mathring{l}(s_{h},a_{h})\\ &=K\mathring{V}^{\pi}\end{split}

So $\pi^{*}$ satisfies:

\displaystyle\pi^{*}=\operatorname*{argmin}_{\pi}\sum_{k=1}^{K}V_{k}^{\pi}=\operatorname*{argmin}_{\pi}\mathring{V}^{\pi}\,.

∎

Lemma 19.

Choose $c$ to be an arbitrary constant, then we can construct a policy set $\Pi$ for any linear adversarial MDP $\mathcal{M}(\mathcal{S},\mathcal{A},H,\left\{P_{h}\right\}_{h=1}^{H},\left\{\ell_{k}\right\}_{k=1}^{K})$ , such that there exists a policy $\pi\in\Pi$ , when compared with the global optimal policy, the regret of which is bounded by $1$ :

\sum_{s=1}^{K}V_{k}^{\pi}-V_{k}^{\pi^{*}}\leq 1\,.

So that:

\mathrm{Reg}(K)=\sum_{s=1}^{K}V_{k}^{\pi_{k}}-V_{k}^{\pi^{*}}=\max_{\pi\in\Pi}\sum_{s=1}^{K}V_{k}^{\pi}-\sum_{k=1}^{K}V_{k}^{\pi^{*}}+\mathrm{Reg}\left(K;\Pi\right)\leq\mathrm{Reg}\left(K;\Pi\right)+1\,.

and the size of $\Pi$ is bounded as:

|\Pi|\leq\left(1+32K^{2}H^{4}d^{5/2}\log\left(1+16HdK\right)\right)^{dH^{2}}\,,

where $d$ is the dimension of the feature map.

Proof.

According to Lemma A.14 in Wagenmaker and Jamieson [2022] that for any linear MDP $\mathring{\mathcal{M}}(\mathcal{S},\mathcal{A},H,\left\{P_{h}\right\}_{h=1}^{H},\mathring{\ell})$ with fixed reward function $\mathring{\ell}$ , we can construct a policy set, that there exists a policy $\pi\in\Pi$ , which approximates the best policy of $\mathring{\mathcal{M}}$ with bias $\left|\mathring{V}^{\pi}-\mathring{V}^{*}\right|\leq\epsilon^{\prime}$ . And the size of the policy set is bounded as:

\displaystyle|\Pi|\leq\left(1+\frac{32H^{4}d^{5/2}\log\left(1+16Hd/\epsilon^{\prime}\right)}{\left(\epsilon^{\prime}\right)^{2}}\right)^{dH^{2}}\,.

(22)

Notice this construction is based entirely on the set of state action features $\phi\left(s,a\right)$ and require no information on the loss or reward function. In the adversarial case, we choose $\mathring{\mathcal{M}}$ to be the average MDP denoted in Lemma 18, and we obtain the regret bound of $\pi$ in all the $K$ episodes:

\sum_{k=1}^{K}V_{k}^{\pi^{*}}-V_{k}^{\pi}=K\left(\mathring{V}^{\pi^{*}}-\mathring{V}^{\pi}\right)=K\left(\mathring{V}^{*}-\mathring{V}^{\pi}\right)\leq K\epsilon^{\prime}\,.

(23)

The proof is finished by taking $\epsilon^{\prime}=1/K$ in Equation (22).

∎

	$\displaystyle\left\|\langle\hat{\phi}_{\pi,h},\theta_{k,h}^{exp}\rangle-\langle\hat{\phi}_{\pi,h},\theta_{k,h}\rangle\right\|$	$\displaystyle=\left\|\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\sum_{\pi^{\prime}}p_{k}\left(\pi^{\prime}\right)\hat{\phi}_{\pi^{\prime},h}\left(\phi_{\pi^{\prime},h}^{\top}-\hat{\phi}_{\pi^{\prime},h}^{\top}\right)\theta_{k,h}\right\|$
		$\displaystyle\leq\sum_{\pi^{\prime}}p_{k}\left(\pi^{\prime}\right)\left\|\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi^{\prime},h}\left(\phi_{\pi^{\prime},h}^{\top}-\hat{\phi}_{\pi^{\prime},h}^{\top}\right)\theta_{k,h}\right\|$
		$\displaystyle\leq\sum_{\pi^{\prime}}p_{k}\left(\pi^{\prime}\right)\frac{dH}{\gamma}\epsilon=\frac{dH}{\gamma}\epsilon\leavevmode\nobreak\ ,$

	$\displaystyle\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\phi\in\Phi_{h}}\left\\|\phi\right\\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}=$	$\displaystyle\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\pi\in\Pi}\left\\|\hat{\phi}_{\pi,h}\right\\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}$
	$\displaystyle\leq$	$\displaystyle\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\pi\in\Pi}\left(2\left\\|{\phi}_{\pi,h}\right\\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}+2\left\\|\hat{\phi}_{\pi,h}-\phi_{\pi,h}\right\\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}\right)$
	$\displaystyle\leq$	$\displaystyle\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\pi\in\Pi}\left(2\left\\|{\phi}_{\pi,h}\right\\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}+\frac{2\epsilon^{2}}{d\lambda_{min}\left(\mathbf{A}\left(\mathbf{\Lambda}\right)\right)}\right)$
	$\displaystyle\leq$	$\displaystyle\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\pi\in\Pi}2\left\\|{\phi}_{\pi,h}\right\\|^{2}_{\mathbf{\Lambda}^{-1}}+\frac{2\epsilon^{2}}{d\lambda_{min}^{*}}$