This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization

Fang Kong Shanghai Jiao Tong University Xiangcheng Zhang111Fang Kong and Xiangcheng Zhang contribute equally to this work. Tsinghua University Baoxiang Wang The Chinese University of Hong Kong, Shenzhen Shuai Li Corresponding author. Shanghai Jiao Tong University
Abstract

Learning Markov decision processes (MDP) in an adversarial environment has been a challenging problem. The problem becomes even more challenging with function approximation, since the underlying structure of the loss function and transition kernel are especially hard to estimate in a varying environment. In fact, the state-of-the-art results for linear adversarial MDP achieve a regret of ๐’ช~โ€‹(K6/7)\tilde{\mathcal{O}}\left(K^{6/7}\right) (KK denotes the number of episodes), which admits a large room for improvement. In this paper, we investigate the problem with a new view, which reduces linear MDP into linear optimization by subtly setting the feature maps of the bandit arms of linear optimization. This new technique, under an exploratory assumption, yields an improved bound of ๐’ช~โ€‹(K4/5)\tilde{\mathcal{O}}\left(K^{4/5}\right) for linear adversarial MDP without access to a transition simulator. The new view could be of independent interest for solving other MDP problems that possess a linear structure.

1 Introduction

Reinforcement learning (RL) describes the interaction between a learning agent and an unknown environment, where the agent aims to maximize the cumulative reward through trial and error Sutton and Barto (2018). It has achieved great success in many real applications, such as games (Mnih et al., 2013; Silver et al., 2016), robotics (Kober et al., 2013; Lillicrap et al., 2015), autonomous driving (Kiran et al., 2021) and recommendation systems (Afsar et al., 2022; Lin et al., 2021). The interaction in RL is commonly portrayed by Markov decision processes (MDP). Most of the works study the stochastic setting, where the reward is sampled from a fixed distribution (Azar et al., 2017; Jin et al., 2018; Simchowitz and Jamieson, 2019; Yang et al., 2021). RL in real applications is in general more challenging than the stochastic setting, as the environment could be non-stationary and the reward function could be adaptive towards the agentโ€™s policy. For example, a scheduling algorithm will be deployed to self-interested parties, and recommendation algorithms will face strategic users.

To design robust algorithms that work under non-stationary environments, a line of works focuses on the adversarial setting, where the reward function could be arbitrarily chosen by an adversary (Yu et al., 2009; Rosenberg and Mansour, 2019; Jin et al., 2020a; Chen et al., 2021; Luo et al., 2021a). Many works in adversarial MDPs optimize the policy by learning the value function with a tabular representation. In this case, both their computation complexity and their regret bounds depend on the state space and action space sizes. In real applications however, the state and action spaces could be exponentially large or even infinite, such as in the game of Go and in robotics. Such cost of computation and the performance are then inadequate.

To cope with the curse of dimensionality, function approximation methods are widely deployed to approximate the value functions with learnable structures. Great empirical success has proved its efficacy in a wide range of areas. Despite this, theoretical understandings of MDP with general function approximation are yet to be available. As an essential step towards understanding function approximation, linear MDP has been an important setting and has received significant attention from the community. It presumes that the transition and reward functions in MDP follow a linear structure with respect to a known feature (Jin et al., 2020b; He et al., 2021; Hu et al., 2022). The stochastic setting in linear MDP has been well studied and near-optimal results are available (Jin et al., 2020b; Hu et al., 2022). The adversarial setting in linear MDP is much more challenging since the underlying linear parameters of the loss function and transition kernel are especially hard to estimate in a varying environment.

The research on linear adversarial MDPs remains open. Early works have proposed algorithms when the transition function is known. (Neu and Olkhovskaya, 2021) Several recent works explore the problem without a known transition function and derive policy optimization algorithms with the state-of-the-art regret of ๐’ช~โ€‹(K6/7)\tilde{\mathcal{O}}\left(K^{6/7}\right) (Luo et al., 2021a; Dai et al., 2023; Sherman et al., 2023). While the optimal regret in tabular MDPs is of order ๐’ช~โ€‹(K1/2)\tilde{\mathcal{O}}\left({K}^{1/2}\right) (Jin et al., 2020a), the regret upper bounds available for linear adversarial MDPs seem to admit a large room for improvement.

In this paper, we investigate linear adversarial MDPs with unknown transition. We propose a new view of the problem and design an algorithm based on our view. The idea is to reduce the MDP setting to a linear optimization problem by subtly setting the feature maps of the bandit arms of linear optimization. In this way, we operate on a set of policies and optimize the probability distribution of which policy to execute. By carefully balancing the suboptimality in policy execution, the suboptimality in policy construction, and the suboptimality in feature visitation estimation, we deduce new analyses of the problem. Improved regret bounds are obtained both when we have and do not have a simulator. In particular, we conclude the first ๐’ช~โ€‹(K4/5)\tilde{\mathcal{O}}\left(K^{4/5}\right) regret bound for linear adversarial MDPs without a simulator.

Let dd be the feature dimension and HH be the length of each episode. Details of our contributions are as follows.

  • โ€ข

    With an exploratory assumption (Assumption 1), we obtain an ๐’ช~โ€‹(d7/5โ€‹H12/5โ€‹K4/5)\tilde{\mathcal{O}}(d^{7/5}H^{12/5}K^{4/5}) regret upper bound for linear adversarial MDP. As compared in Table 1, this is the first regret bound that achieves ๐’ช~โ€‹(K4/5)\tilde{\mathcal{O}}(K^{4/5}) order when a simulator of the transition is not provided. We also want to note that our exploratory assumption that only ensures the MDP is learnable is much weaker than previous works (Neu and Olkhovskaya, 2021; Luo et al., 2021a). Under a weaker exploratory assumption, our result achieves a significant improvement over the ๐’ช~โ€‹(K6/7)\tilde{\mathcal{O}}(K^{6/7}) regret in Luo et al. (2021a) and also removes the dependence on ฮป\lambda which is the minimum eigenvalue of the exploratory policyโ€™s covariance that can be small.

  • โ€ข

    In a simpler setting where the agent has access to a simulator, our regret can be further improved to ๐’ช~โ€‹(d2โ€‹H5โ€‹K)\tilde{\mathcal{O}}(\sqrt{d^{2}H^{5}K}). This result also removes the dependence on ฮป\lambda in previous works (Neu and Olkhovskaya, 2021; Luo et al., 2021a). Compared with Luo et al. (2021a), our required simulator is also weaker: we only need to have access to the trajectory when given any policy ฯ€\pi; while Luo et al. (2021a) requires the next state sโ€ฒs^{\prime} when given any state-action pair (s,a)(s,a).

  • โ€ข

    Technically, we provide a new tool for linear MDP problems by exploiting the linear features of the MDP and transforming it into a linear optimization problem. This tool could be of independent interest and might be useful in other problems that possess a linear structure.

Table 1: Comparisons of our results with most related works for linear adversarial MDPs.
Transition Simulator1
Exploratory
Assumption2
Regret3
Neu and Olkhovskaya (2021) Known yes yes ๐’ช~โ€‹(K/ฮป)\displaystyle\tilde{\mathcal{O}}(\sqrt{K/\lambda})
Luo et al. (2021a, b) Unknown yes yes ๐’ช~โ€‹(K/ฮป)\displaystyle\tilde{\mathcal{O}}(\sqrt{K/\lambda})
no ๐’ช~โ€‹(K2/3)\displaystyle\tilde{\mathcal{O}}(K^{2/3})
no yes ๐’ช~โ€‹(K/(ฮป)2/3)6/7\displaystyle\tilde{\mathcal{O}}\left(K/\left(\lambda\right)^{2/3}\right)^{6/7}
no ๐’ช~โ€‹(K14/15)\displaystyle\tilde{\mathcal{O}}(K^{14/15})
Dai et al. (2023) Unknown yes no ๐’ช~โ€‹(K)\displaystyle\tilde{\mathcal{O}}(\sqrt{K})
no no ๐’ช~โ€‹(K8/9)\displaystyle\tilde{\mathcal{O}}(K^{8/9})
Sherman et al. (2023) Unknown yes no ๐’ช~โ€‹(K2/3)\displaystyle\tilde{\mathcal{O}}(K^{2/3})
no no ๐’ช~โ€‹(K6/7)\displaystyle\tilde{\mathcal{O}}(K^{6/7})
Ours Unknown yes yes ๐’ช~โ€‹(K)\displaystyle\tilde{\mathcal{O}}(\sqrt{K})
no yes ๐’ช~โ€‹(K4/5)\displaystyle\tilde{\mathcal{O}}(K^{4/5})
  • 1

    Our required simulator is defined in Assumption 2. Notice that Dai et al. (2023); Luo et al. (2021a, b) adopt a stronger simulator that returns the next state sโ€ฒs^{\prime} when given any state action pair (s,a)(s,a), while Sherman et al. (2023); Neu and Olkhovskaya (2021) and this paper only need the simulator to return a trajectory when given a policy.

  • 2

    Our exploratory assumption is introduced in Assumption 1. It is worth noting that our assumption on exploration is also much weaker than Neu and Olkhovskaya (2021); Luo et al. (2021a, b). Our exploratory assumption only ensures the learnability of the MDP while the other works require a policy that can explore the full linear space in all steps as input. Our assumption can be implied by theirs.

  • 3

    The ฮป\lambda term in the regret represents the minimum eigenvalue induced by a โ€œgoodโ€ exploratory policy ฯ€0\pi_{0}, which satisfies ฮปminโ€‹(๐šฒฯ€0,h)โ‰ฅฮป\lambda_{\min}\left(\mathbf{\Lambda}_{\pi_{0},h}\right)\geq\lambda for all hโˆˆ[H]h\in[H], where ๐šฒฯ€0,h\mathbf{\Lambda}_{\pi_{0},h} is the covariance of ฯ€0\pi_{0} at step hh (see Assumption 1).

2 Related Work

Linear MDPs.

The linear function approximation problem has been studied for a long history (Bradtke and Barto, 1996; Melo and Ribeiro, 2007; Sutton and Barto, 2018; Yang and Wang, 2019). Until recently, Yang and Wang (2020) propose theoretical guarantees for the sample efficiency in the linear MDP setting. However, it assumes that the transition function can be parameterized by a small matrix. In general cases, Jin et al. (2020b) develop the first efficient algorithm LSVI-UCB both in sample and computation complexity. They show that the algorithm achieves Oโ€‹(d3โ€‹H3โ€‹K)O(\sqrt{d^{3}H^{3}K}) regret where dd is the feature dimension and HH is the length of each episode. This result is improved to the optimal order Oโ€‹(dโ€‹Hโ€‹K)O(dH\sqrt{K}) by Hu et al. (2022) with a tighter concentration analysis. A very recent work (He et al., 2022a) points out a technical error in Hu et al. (2022) and show a nearly minimax result that matches the lower bound Oโ€‹(dโ€‹H3โ€‹K)O(d\sqrt{H^{3}K}) in Zhou et al. (2021). All these works are based on UCB-type algorithms. Apart from UCB, the TS-type algorithm has also been proposed for this setting (Zanette et al., 2020). And the above results mainly focus on the minimax optimality. In the stochastic setting, deriving an instance-dependent regret bound is also attractive as it changes in MDPs with different hardness. This type of regret has been widely studied under the tabular MDP setting (Simchowitz and Jamieson, 2019; Yang et al., 2021). He et al. (2021) is the first to provide this type of regret in linear MDP. Using a different proof framework, they show that the LSVI-UCB algorithm can achieve Oโ€‹(d3โ€‹H5โ€‹logโกK/ฮ”)O(d^{3}H^{5}\log K/\Delta) where ฮ”\Delta is the minimum value gap in the episodic MDP.

Adversarial losses in MDPs

When the losses at state-action pairs do not follow a fixed distribution, the problem becomes the adversarial MDP. This problem was first studied in the tabular MDP setting. The occupancy measure-based method is one of the most popular approaches to dealing with a potential adversary. For this type of approach, Zimin and Neu (2013) first study the known transition setting and derive regret guarantees ๐’ช~โ€‹(Hโ€‹K)\tilde{\mathcal{O}}(H\sqrt{K}) and ๐’ช~โ€‹(Hโ€‹Sโ€‹Aโ€‹K)\tilde{\mathcal{O}}(\sqrt{HSAK}) for full-information and bandit feedback, respectively. For the more challenging unknown transition setting, Rosenberg and Mansour (2019) also start from the full-information feedback and derive an ๐’ช~โ€‹(Hโ€‹Sโ€‹Aโ€‹K)\tilde{\mathcal{O}}(HS\sqrt{AK}) regret. The bandit feedback is recently studied by Jin et al. (2020a), where the regret bound is ๐’ช~โ€‹(Hโ€‹Sโ€‹Aโ€‹K)\tilde{\mathcal{O}}(HS\sqrt{AK}). The other line of works (Neu et al., 2010; Shani et al., 2020; Chen et al., 2022; Luo et al., 2021a) is based on policy optimization methods. In the unknown transition and bandit feedback setting, the state-of-the-art result in this line is also an ๐’ช~โ€‹(K)\tilde{\mathcal{O}}(\sqrt{K}) order achieved by Luo et al. (2021a, b).

Specifically, a few works focus on the linear adversarial MDP problem. Neu and Olkhovskaya (2021) first study the known transition setting and provide an Oโ€‹(K)O(\sqrt{K}) regret with the assumption that an exploratory policy can explore the full linear space. For the general unknown transition case, Luo et al. (2021a, b) discuss four cases on whether a simulator is available and whether the exploratory assumption is satisfied. With the same exploratory assumption as Neu and Olkhovskaya (2021), they show a regret bound Oโ€‹(K)O(\sqrt{K}) with a simulator and Oโ€‹(K6/7)O(K^{6/7}) otherwise. Very recent two works (Dai et al., 2023; Sherman et al., 2023) further generalize the setting by removing the exploratory assumption. These two works independently provide Oโ€‹(K8/9)O(K^{8/9}) and Oโ€‹(K6/7)O(K^{6/7}) regret for this setting when no simulator is available.

Linear mixture MDP is another popular linear function approximation model, where the transition is a mixture of linear functions. When considering the adversarial losses, Cai et al. (2020); He et al. (2022b) study unknown transition but full-information feedback type, in which case the learning agent can observe the loss of all actions in each state. Zhao et al. (2023) consider the general bandit feedback in this setting and show the regret in this harder environment is also Oโ€‹(K)O(\sqrt{K}). Their modeling does not assume the structure of the loss function which introduces the dependence on S,AS,A in the regret where SS and AA are the numbers of states and actions, respectively.

3 Preliminaries

In this work, we study the episodic adversarial Markov Decision Processes (MDP) denoted by
โ„ณโ€‹(๐’ฎ,๐’œ,H,{Ph}h=1H,{โ„“k}k=1K)\mathcal{M}(\mathcal{S},\mathcal{A},H,\left\{P_{h}\right\}_{h=1}^{H},\left\{\ell_{k}\right\}_{k=1}^{K}) where ๐’ฎ\mathcal{S} is the state space, ๐’œ\mathcal{A} is the action space, HH is the horizon of each episode, Ph:๐’ฎร—๐’œร—๐’ฎโ†’[0,1]P_{h}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\to[0,1] is the transition kernel of step hh with Phโ€‹(sโ€ฒโˆฃs,a)P_{h}(s^{\prime}\mid s,a) representing the transition probability from ss to sโ€ฒs^{\prime} by taking action aa at step hh, and โ„“k\ell_{k} is the loss function at episode kk. We denote ฯ€k:{ฯ€k,h}h=1H\pi_{k}:\left\{\pi_{k,h}\right\}_{h=1}^{H} as the learnerโ€™s policy at each episode kk, where ฯ€k,h\pi_{k,h} is a mapping from each state to a distribution over the action space. Let ฯ€k,hโ€‹(aโˆฃs)\pi_{k,h}(a\mid s) represent the selecting probability of action aa at state ss by following policy ฯ€k\pi_{k} at step hh.

The learner interacts with the MDP โ„ณ\mathcal{M} for KK episodes. At each episode k=1,2,โ€ฆ,Kk=1,2,\ldots,K, the environment (adversary) first chooses the loss function โ„“k:={โ„“k,h}h=1H\ell_{k}:=\left\{\ell_{k,h}\right\}_{h=1}^{H} which may be probably based on the history information before episode kk. The learner simultaneously decides its policy ฯ€k\pi_{k}. For each step h=1,2,โ€ฆ,Hh=1,2,\ldots,H, the learner observes the current state sk,hs_{k,h}, taking action ak,ha_{k,h} based on ฯ€k,h\pi_{k,h}, and observe the loss โ„“k,hโ€‹(sk,h,ak,h)\ell_{k,h}(s_{k,h},a_{k,h}). The environment will transit to the next state sk,h+1s_{k,h+1} at the end of the step based on the transition kernel Ph(โ‹…โˆฃsk,h,ak,h)P_{h}(\cdot\mid s_{k,h},a_{k,h}).

The performance of a policy ฯ€\pi over episode kk can be evaluated by its value function, which is the expected cumulative loss,

Vkฯ€=๐”ผโ€‹[โˆ‘h=1Hโ„“k,hโ€‹(sk,h,ak,h)],\displaystyle V^{\pi}_{k}=\mathbb{E}\left[\sum_{h=1}^{H}\ell_{k,h}(s_{k,h},a_{k,h})\right]\,,

where the expectation is taken from the randomness in the transition and the policy ฯ€\pi. Denote ฯ€โˆ—โˆˆargminฯ€โ€‹โˆ‘k=1KVkฯ€\pi^{*}\in\operatorname*{argmin}_{\pi}\sum_{k=1}^{K}V^{\pi}_{k} as the optimal policy that suffers the least expected loss over KK episodes. The objective of the learner is to minimize the cumulative regret,

Regโ€‹(K)=โˆ‘k=1K(Vkฯ€kโˆ’Vkฯ€โˆ—),\displaystyle\mathrm{Reg}(K)=\sum_{k=1}^{K}\left(V_{k}^{\pi_{k}}-V_{k}^{\pi^{*}}\right)\,, (1)

which is defined as the cumulative difference between the value of the taken policies and that of the optimal policy ฯ€โˆ—\pi^{*}.

Linear adversarial MDP denotes an MDP where both the transition kernel and the loss function are linearly depending on a feature mapping. We give a formal definition as follows.

Definition 1 (Linear MDP with adversarial losses).

The MDP โ„ณโ€‹(๐’ฎ,๐’œ,H,{Ph}h=1H,{โ„“k}k=1K)\mathcal{M}(\mathcal{S},\mathcal{A},H,\left\{P_{h}\right\}_{h=1}^{H},\left\{\ell_{k}\right\}_{k=1}^{K}) is a linear MDP if there is a known feature mapping ฯ•:๐’ฎร—๐’œโ†’โ„d\phi:\mathcal{S}\times\mathcal{A}\to\mathbb{R}^{d} and unknown vector-valued measures ฮผhโˆˆโ„d\mu_{h}\in\mathbb{R}^{d} such that the transition probability at each state-action pair (s,a)(s,a) satisfies

Ph(โ‹…|s,a)=โŸจฯ•(s,a),ฮผhโŸฉ.P_{h}\left(\cdot|s,a\right)=\langle\phi(s,a),\mathbf{\mu}_{h}\rangle\,.

Further, for any episode kk and step hh, there exists an unknown loss vector ฮธk,hโˆˆโ„d\theta_{k,h}\in\mathbb{R}^{d} such that

โ„“k,hโ€‹(s,a)=โŸจฯ•โ€‹(s,a),ฮธk,hโŸฉ.\displaystyle\ell_{k,h}(s,a)=\langle\phi(s,a),\theta_{k,h}\rangle\,.

for all state-action pair (s,a)(s,a). Without loss of generality, we assume โ€–ฯ•โ€‹(s,a)โ€–2โ‰ค1\left\|\phi\left(s,a\right)\right\|_{2}\leq 1 for all (s,a)(s,a) and โ€–ฮผhโ€‹(๐’ฎ)โ€–2=โ€–โˆซsโˆˆ๐’ฎdฮผhโ€‹(s)โ€–2โ‰คd\left\|\mu_{h}\left(\mathcal{S}\right)\right\|_{2}=\left\|\int_{s\in\mathcal{S}}\mathrm{d}\mu_{h}\left(s\right)\right\|_{2}\leq\sqrt{d}, โ€–ฮธk,hโ€–2โ‰คd\left\|\theta_{k,h}\right\|_{2}\leq\sqrt{d} for any k,hk,h.

Given a policy ฯ€\pi, its feature visitation vector at step hh is the expected feature mapping this policy encounters at step hh: ฯ•ฯ€,h=๐”ผฯ€โ€‹[ฯ•โ€‹(sh,ah)]\phi_{\pi,h}=\mathbb{E}_{\pi}\left[\phi\left(s_{h},a_{h}\right)\right]. With this definition, the expected loss that the policy ฯ€\pi receives at step hh of episode kk can be written as

โ„“k,hฯ€:=๐”ผฯ€โ€‹[โ„“k,hโ€‹(sk,h,ak,h)]=โŸจฯ•ฯ€,h,ฮธk,hโŸฉ,\ell_{k,h}^{\pi}:=\mathbb{E}_{\pi}\left[\ell_{k,h}\left(s_{k,h},a_{k,h}\right)\right]=\langle\phi_{\pi,h},\theta_{k,h}\rangle\,, (2)

and the value of policy ฯ€\pi can be expressed as

Vkฯ€=โˆ‘h=1Hโ„“k,h=โˆ‘h=1HโŸจฯ•ฯ€,h,ฮธk,hโŸฉ.V_{k}^{\pi}=\sum_{h=1}^{H}\ell_{k,h}=\sum_{h=1}^{H}\langle\phi_{\pi,h},\theta_{k,h}\rangle\,. (3)

For simplicity, we also define ฯ•ฯ€,hโ€‹(s)=๐”ผaโˆผฯ€h(โ‹…|s)โ€‹[ฯ•โ€‹(s,a)]\phi_{\pi,h}\left(s\right)=\mathbb{E}_{a\sim\pi_{h}\left(\cdot|s\right)}\left[\phi(s,a)\right] to represent the expected feature visitation of state ss at step hh by following ฯ€\pi.

To ensure that the linear MDP is learnable, we make the following exploratory assumption, which analogs to those assumptions made in previous works that study the function approximation setting (Neu and Olkhovskaya, 2021; Luo et al., 2021a, b; Hao et al., 2021; Agarwal et al., 2021). For any policy ฯ€\pi, define ๐šฒฯ€,h:=๐”ผฯ€โ€‹[ฯ•โ€‹(sh,ah)โ€‹ฯ•โ€‹(sh,ah)โŠค]\mathbf{\Lambda}_{\pi,h}:=\mathbb{E}_{\pi}\left[\phi(s_{h},a_{h})\phi(s_{h},a_{h})^{\top}\right] as the expected covariance of ฯ€\pi at step hh. Let ฮปmin,hโˆ—=supฯ€ฮปminโ€‹(๐šฒฯ€,๐ก)\lambda_{\min,h}^{*}=\sup_{\pi}\lambda_{\min}\left(\mathbf{\Lambda_{\pi,h}}\right), where ฮปminโ€‹(โ‹…)\lambda_{\min}(\cdot) denotes the smallest eigenvalue of a matrix, and ฮปminโˆ—=minhโกฮปmin,hโˆ—\lambda_{\min}^{*}=\min_{h}\lambda_{\min,h}^{*}. To ensure that the linear MDP is learnable, we assume that there exists a policy that generates full rank covariance matrices.

Assumption 1 (Exploratory assumption).

ฮปminโˆ—>0\lambda_{\min}^{*}>0.

When the assumption is reduced to the tabular setting, where ฯ•โ€‹(s,a)\phi(s,a) is a basis vector in โ„๐’ฎร—๐’œ\mathbb{R}^{\mathcal{S}\times\mathcal{A}}, this assumption becomes ฮผmin:=maxฯ€โกmins,aโกฮผฯ€โ€‹(s,a)>0\mu_{\min}:=\max_{\pi}\min_{s,a}\mu^{\pi}(s,a)>0, where ฮผฯ€โ€‹(s,a)\mu^{\pi}(s,a) is the probability of visiting the state-action pair (s,a)(s,a) under the trajectory induced by ฯ€\pi. It simply means that there exists a policy with positive visitation probability for all state-action pairs, which is standard (Li et al., 2020). In the linear setting, it guarantees that all the directions in โ„d\mathbb{R}^{d} is able to be visited by some policy.

We point out that this assumption is weaker than the exploratory assumptions used in previous works (Neu and Olkhovskaya, 2021; Luo et al., 2021a, b), in which they assume such exploratory policy ฯ€0\pi_{0}, satisfying ฮปminโ€‹(๐šฒฯ€0,h)โ‰ฅฮป0>0\lambda_{\min}\left(\mathbf{\Lambda}_{\pi_{0},h}\right)\geq\lambda_{0}>0 for all hโˆˆ[H]h\in[H], is given as input to the algorithm. Since finding such an exploratory policy is extremely difficult, our assumption, which only requires the transition of the MDP itself to satisfy this constraint, is more preferred.

4 Algorithm

In this section, we introduce the proposed algorithm (Algorithm 1). The algorithm takes a finite policy class ฮ \Pi and the feature visitation estimators {ฯ•^ฯ€,h:ฯ€โˆˆฮ ,hโˆˆ[H]}\left\{\hat{\phi}_{\pi,h}:\pi\in\Pi,\,h\in[H]\right\} as input and selects a policy ฯ€kโˆˆฮ \pi_{k}\in\Pi in each episode kโˆˆ[K]k\in[K]. The acquisition of ฮ \Pi and {ฯ•^ฯ€,h:ฯ€โˆˆฮ ,hโˆˆ[H]}\left\{\hat{\phi}_{\pi,h}:\pi\in\Pi,\,h\in[H]\right\} will be introduced in Section 4.1 and Section 4.2, respectively.

Recall that the loss value โ„“k,hโ€‹(s,a)\ell_{k,h}(s,a) is an inner product between the feature ฯ•โ€‹(s,a)\phi(s,a) and the loss vector ฮธk,h\theta_{k,h}. According to this structure, we investigate ridge linear regression to estimate the unknown loss vector. To be specific, in each episode after executing policy ฯ€k\pi_{k}, the observed loss value can be used to estimate the loss vector ฮธ^k,h\hat{\theta}_{k,h} and the value function V^kฯ€\hat{V}_{k}^{\pi} can be estimated for each policy ฯ€\pi (line 9). We then adopt an optimistic strategy toward the value of policies and an optimistic estimation of the policy ฯ€\piโ€™s value (line 10). Based on the optimistic value, the exploitation probability wโ€‹(ฯ€)w(\pi) of a policy ฯ€\pi follows the EXP3-type update rule (line 11). To better explore each dimension of the linear space, the final selecting probability is defined as the weighted combination of the exploitation probability and an exploration probability gโ€‹(ฯ€)g(\pi), where the weight ฮณ\gamma is an input parameter (line 7). Here the exploration probability gh:={ghโ€‹(ฯ€)}ฯ€โˆˆฮ g_{h}:=\left\{g_{h}(\pi)\right\}_{\pi\in\Pi} is derived by computing the G-optimal design problem to minimize the uncertainty of all policies, i.e.,

ghโˆˆargminpโˆˆฮ”โ€‹(ฮ )maxฯ€โˆˆฮ โกโ€–ฯ•ฯ€,hโ€–๐•hโ€‹(p)โˆ’12\displaystyle g_{h}\in\operatorname*{argmin}_{p\in\Delta(\Pi)}\max_{\pi\in\Pi}\left\|\phi_{\pi,h}\right\|^{2}_{\mathbf{V}_{h}(p)^{-1}}

where ๐•hโ€‹(p)=โˆ‘ฯ€โˆˆฮ pโ€‹(ฯ€)โ€‹ฯ•ฯ€,hโ€‹ฯ•ฯ€,hโŠค\mathbf{V}_{h}(p)=\sum_{\pi\in\Pi}p(\pi)\phi_{\pi,h}\phi_{\pi,h}^{\top}.

If the input ฯ•^ฯ€,h\hat{\phi}_{\pi,h} is the true feature visitation ฯ•ฯ€,h{\phi}_{\pi,h}, we can ensure that the regret of the algorithm, compared with the optimal policy in ฮ \Pi, is upper bounded. Now it suffices to bound the additional regret caused by the sub-optimality of the best policy in ฮ \Pi, and the bias of the feature visitation estimators, which will be discussed in the following sections.

Algorithm 1 GEOMETRICHEDGE for Linear Adversarial MDP Policies (GLAP)
1:ย ย Input: policy class ฮ \Pi with feature visitation estimators {ฯ•^ฯ€,h}h=1H\left\{\hat{\phi}_{\pi,h}\right\}_{h=1}^{H} for any ฯ€โˆˆฮ \pi\in\Pi; confidence ฮด\delta; exploration parameter ฮณโ‰ค1/2\gamma\leq 1/2
2:ย ย Initialize: โˆ€ฯ€โˆˆฮ ,w1โ€‹(ฯ€)=1,W1=|ฮ |\forall\pi\in\Pi,w_{1}\left(\pi\right)=1,W_{1}=|\Pi|; ฮท=ฮณdโ€‹H2\eta=\frac{\gamma}{dH^{2}}
3:ย ย forย h=1,2,โ‹ฏ,Hh=1,2,\cdots,Hย do
4:ย ย ย ย ย compute the G-optimal design ghโ€‹(ฯ€)g_{h}(\pi) on the set of feature visitations: {ฯ•^ฯ€,h,ฯ€โˆˆฮ }\{\hat{\phi}_{\pi,h},\pi\in\Pi\}. Denote gโ€‹(ฯ€)=1Hโ€‹โˆ‘h=1Hghโ€‹(ฯ€)g(\pi)=\frac{1}{H}\sum_{h=1}^{H}g_{h}(\pi)
5:ย ย endย for
6:ย ย forย each episode k=1,2,โ€ฆ,Kk=1,2,\ldots,Kย do
7:ย ย ย ย ย Compute the probabilities for all policies ฯ€โˆˆฮ \pi\in\Pi:
pkโ€‹(ฯ€)=(1โˆ’ฮณ)โ€‹wkโ€‹(ฯ€)Wk+ฮณโ€‹gโ€‹(ฯ€)p_{k}\left(\pi\right)=\left(1-\gamma\right)\frac{w_{k}\left(\pi\right)}{W_{k}}+\gamma g(\pi)
8:ย ย ย ย ย Select policy ฯ€kโˆผpk\pi_{k}\sim p_{k} and observe losses โ„“k,hโ€‹(sk,h,ak,h)\ell_{k,h}\left(s_{k,h},a_{k,h}\right), for any hโˆˆ[H]h\in[H]
9:ย ย ย ย ย Calculate the loss vector and value function estimators:
ฮธ^k,h=ฮฃk,hโˆ’1โ€‹ฯ•^ฯ€k,hโ€‹โ„“k,hโ€‹(sk,h,ak,h),โ„“^k,hฯ€=ฯ•^ฯ€,hโŠคโ€‹ฮธk,h,V^kฯ€=โˆ‘h=1Hโ„“^k,hฯ€,\displaystyle\hat{\theta}_{k,h}={\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi_{k},h}\ell_{k,h}\left(s_{k,h},a_{k,h}\right)\,,\hat{\ell}_{k,h}^{\pi}=\hat{\phi}_{\pi,h}^{\top}{\theta}_{k,h}\,,\hat{V}_{k}^{\pi}=\sum_{h=1}^{H}\hat{\ell}_{k,h}^{\pi}\,,
where ฮฃk,h=โˆ‘ฯ€pkโ€‹(ฯ€)โ€‹ฯ•^ฯ€,hโ€‹ฯ•^ฯ€,hโŠค{\Sigma}_{k,h}=\sum_{\pi}p_{k}(\pi)\hat{\phi}_{\pi,h}\hat{\phi}_{\pi,h}^{\top}
10:ย ย ย ย ย Compute the optimistic estimate of the loss function
V~kฯ€=โˆ‘h=1H(โ„“^k,hฯ€โˆ’2โ€‹ฯ•^ฯ€,hโŠคโ€‹ฮฃ^k,hโˆ’1โ€‹ฯ•^ฯ€,hโ€‹Hโ€‹logโก(1ฮด)dโ€‹K)\tilde{V}_{k}^{\pi}=\sum_{h=1}^{H}\left(\hat{\ell}_{k,h}^{\pi}-2\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi,h}\sqrt{\frac{H\log\left(\frac{1}{\delta}\right)}{dK}}\right)
11:ย ย ย ย ย Update the selecting probability using the loss estimators
โˆ€ฯ€โˆˆฮ ,wk+1โ€‹(ฯ€)=wkโ€‹(ฯ€)โ€‹expโก(โˆ’ฮทโ€‹V~kฯ€),Wk+1=โˆ‘ฯ€โˆˆฮ wk+1โ€‹(ฯ€)\forall\pi\in\Pi,w_{k+1}(\pi)=w_{k}(\pi)\exp\left(-\eta\tilde{V}_{k}^{\pi}\right),W_{k+1}=\sum_{\pi\in\Pi}w_{k+1}(\pi)\,
12:ย ย endย for

4.1 Policy Construction

In this subsection, we introduce how to construct a finite policy set ฮ \Pi such that the real optimal policy ฯ€โˆ—\pi^{*} can be approximated by elements in ฮ \Pi. The policy construction method mainly borrows from Appendix A.3 in Wagenmaker and Jamieson (2022) but with refined analysis for the adversarial setting.

We consider the linear softmax policy class. Specifically, given a parameter wฯ€โˆˆโ„dร—Hw^{\pi}\in\mathbb{R}^{d\times H}, the induced policy ฯ€\pi would select action aโˆˆ๐’œa\in\mathcal{A} at state ss with probability

ฯ€hโ€‹(a|s)=expโก(ฮทโ€‹โŸจฯ•โ€‹(s,a),whฯ€โŸฉ)โˆ‘aโ€ฒexpโก(ฮทโ€‹โŸจฯ•โ€‹(s,aโ€ฒ),whฯ€โŸฉ).\displaystyle\pi_{h}\left(a|s\right)=\frac{\exp\left(\eta\langle\phi\left(s,a\right),w_{h}^{\pi}\rangle\right)}{\sum_{a^{\prime}}\exp\left(\eta\langle\phi\left(s,a^{\prime}\right),w_{h}^{\pi}\rangle\right)}\,.

The advantage of such a form of policy class is that it satisfies the Lipschitz property, i.e., the difference between values of induced policies can be upper bounded by the difference between the parameters ww. Based on this observation, by constructing a parameter covering ๐’ฒ\mathcal{W} over a dร—Hd\times H-dimensional unit ball, we can ensure that the parameter of the optimal policy wโˆ—w^{*} can be approximated by a parameter wโˆˆ๐’ฒw\in\mathcal{W}, i.e.,

โˆ‘hโ€–whโˆ’whโˆ—โ€–2โ‰คฯต.\displaystyle\sum_{h}\left\|w_{h}-w_{h}^{*}\right\|_{2}\leq\epsilon\,.

Further, based on the Lipschitz property, the induced policy of ww would have a similar value to that of the optimal policy.

The informal result is shown in the following lemma.

Lemma 1.

There exists a finite policy class ฮ \Pi with log cardinality logโก|ฮ |=๐’ชโ€‹(dโ€‹H2โ€‹logโกK)\log|\Pi|=\mathcal{O}\left(dH^{2}\log K\right), such that the regret compared with the optimal policy in ฮ \Pi is close to the global regret, i.e.,

Regโ€‹(K):=โˆ‘k=1K(Vkฯ€kโˆ’Vkฯ€โˆ—)โ‰คโˆ‘k=1KVkฯ€kโˆ’minฯ€โˆˆฮ โ€‹โˆ‘k=1KVkฯ€+1:=Regโ€‹(K;ฮ )+1.\displaystyle\mathrm{Reg}(K):=\sum_{k=1}^{K}\left(V_{k}^{\pi_{k}}-V_{k}^{\pi^{*}}\right)\leq\sum_{k=1}^{K}V_{k}^{\pi_{k}}-\min_{\pi\in\Pi}\sum_{k=1}^{K}V_{k}^{\pi}+1:=\mathrm{Reg}(K;\Pi)+1\,.

The detailed analysis can be found in Appendix C.

4.2 Feature Visitation Estimation

In this subsection, we discuss how to deal with unavailable feature visitations of policies. Our approach is to estimate the feature visitation {ฯ•ฯ€,h}h=1H\left\{\phi_{\pi,h}\right\}_{h=1}^{H} of each policy ฯ€\pi and use these estimated features as input of Algorithm 1. The feature estimating process is described in Algorithm 2, which is called the feature visitation estimation oracle.

Algorithm 2 Feature visitation estimation oracle
1:ย ย Input: policy set ฮ \Pi, tolerance ฯตโ‰ค1/2\epsilon\leq 1/2, confidence ฮด\delta
2:ย ย Initialize: ฮฒ=16โ€‹H2โ€‹logโก4โ€‹H2โ€‹dโ€‹|ฮ |ฮด\beta=16H^{2}\log\frac{4H^{2}d|\Pi|}{\delta}, ฯ•^ฯ€,1=๐”ผa1โˆผฯ€1(โ‹…|s1)โ€‹[ฯ•โ€‹(s1,a1)]\hat{\phi}_{\pi,1}=\mathbb{E}_{a_{1}\sim\pi_{1}\left(\cdot|s_{1}\right)}\left[\phi(s_{1},a_{1})\right]
3:ย ย forย  h=1,2,โ€ฆ,Hโˆ’1h=1,2,\dots,H-1ย do
4:ย ย ย ย ย Run procedure in Theorem 2 with parameters: ฯตexpโ†ฯต2d3โ€‹ฮฒ\epsilon_{\exp}\leftarrow\frac{\epsilon^{2}}{d^{3}\beta}, ฮดโ†ฮด2โ€‹H\delta\leftarrow\frac{\delta}{2H}, ฮปยฏ=logโก4โ€‹H2โ€‹dโ€‹|ฮ |ฮด\underline{\lambda}=\log\frac{4H^{2}d|\Pi|}{\delta}, ฮฆโ†ฮฆh:={ฯ•^ฯ€,h:ฯ€โˆˆฮ }\Phi\leftarrow\Phi_{h}:=\left\{\hat{\phi}_{\pi,h}:\pi\in\Pi\right\}, ฮณฮฆ=12โ€‹d\gamma_{\Phi}=\frac{1}{2\sqrt{d}}, and denote returned data as {(sh,ฯ„,ah,ฯ„,sh+1,ฯ„)}ฯ„=1Kh\left\{\left(s_{h,\tau},a_{h,\tau},s_{h+1,\tau}\right)\right\}_{\tau=1}^{K_{h}}, for KhK_{h} total number of episodes run, and covariates
๐šฒhโ†โˆ‘ฯ„=1Khฯ•โ€‹(sh,ฯ„,ah,ฯ„)โ€‹ฯ•โ€‹(sh,ฯ„,ah,ฯ„)โŠค+1/dโ‹…I.\mathbf{\Lambda}_{h}\leftarrow\sum_{\tau=1}^{K_{h}}\phi\left(s_{h,\tau},a_{h,\tau}\right)\phi\left(s_{h,\tau},a_{h,\tau}\right)^{\top}+1/d\cdot I.
5:ย ย ย ย ย forย ฯ€โˆˆฮ \pi\in\Piย do
6:ย ย ย ย ย ย ย ย 
ฯ•^ฯ€,h+1โ†(โˆ‘ฯ„=1Khฯ•ฯ€,h+1โ€‹(sh+1,ฯ„)โ€‹ฯ•h,ฯ„โŠคโ€‹๐šฒhโˆ’1)โ€‹ฯ•^ฯ€,h.\hat{\phi}_{\pi,h+1}\leftarrow\left(\sum_{\tau=1}^{K_{h}}\phi_{\pi,h+1}\left(s_{h+1,\tau}\right)\phi_{h,\tau}^{\top}\mathbf{\Lambda}_{h}^{-1}\right)\hat{\phi}_{\pi,h}\,.
7:ย ย ย ย ย endย for
8:ย ย endย for
9:ย ย return ย ฮฆ:={ฯ•^ฯ€,h:ฯ€โˆˆฮ ,h=1,2,โ‹ฏ,H}\Phi:=\left\{\hat{\phi}_{\pi,h}:\pi\in\Pi,\,h=1,2,\cdots,H\right\}

For any policy ฯ€\pi, we can first decompose its feature visitation at step hh as

ฯ•ฯ€,h=๐”ผฯ€โ€‹[ฯ•โ€‹(sh,ah)]=\displaystyle\phi_{\pi,h}=\mathbb{E}_{\pi}\left[\phi(s_{h},a_{h})\right]= โˆซฯ•ฯ€,hโ€‹(s)โ€‹๐‘‘ฮผhโˆ’1โ€‹(s)โ€‹๐”ผฯ€โ€‹[ฯ•โ€‹(shโˆ’1,ahโˆ’1)]\displaystyle\int\phi_{\pi,h}(s)d\mu_{h-1}(s)\mathbb{E}_{\pi}\left[\phi(s_{h-1},a_{h-1})\right]
:=\displaystyle:= ๐’ฏฯ€,hโ€‹ฯ•ฯ€,hโˆ’1\displaystyle\mathcal{T}_{\pi,h}\phi_{\pi,h-1}
=\displaystyle= ๐’ฏฯ€,hโ€‹๐’ฏฯ€,hโˆ’1โ€‹โ‹ฏโ€‹๐’ฏฯ€,2โ€‹ฯ•ฯ€,1,\displaystyle\mathcal{T}_{\pi,h}\mathcal{T}_{\pi,h-1}\cdots\mathcal{T}_{\pi,2}\phi_{\pi,1}\,,

where ๐’ฏฯ€,h=โˆซฯ•ฯ€,hโ€‹(s)โ€‹๐‘‘ฮผhโˆ’1โ€‹(s)\mathcal{T}_{\pi,h}=\int\phi_{\pi,h}(s)d\mu_{h-1}(s) is the transition operator and ฯ•ฯ€,1:=๐”ผฯ€โ€‹[ฯ•โ€‹(s1,a1)]\phi_{\pi,1}:=\mathbb{E}_{\pi}[\phi(s_{1},a_{1})] can be directly computed based on policy ฯ€\pi. Thus, to estimate ฯ•ฯ€,h\phi_{\pi,h} for each step hh, we need to estimate the transition operator ๐’ฏฯ€,h\mathcal{T}_{\pi,h}.

We investigate the least square method to estimate the transition operator. Consider currently we have collected KK trajectories, then the estimated value is given by

๐’ฏ^โˆˆargmin๐’ฏโ€‹โˆ‘ฯ„=1K(๐’ฏโ€‹ฯ•โ€‹(shโˆ’1,ฯ„,ahโˆ’1,ฯ„)โˆ’โˆ‘aฯ€โ€‹(aโˆฃsh)โ€‹ฯ•โ€‹(sh,a))2+โ€–๐’ฏโ€–2,\displaystyle\hat{\mathcal{T}}\in\operatorname*{argmin}_{\mathcal{T}}\sum_{\tau=1}^{K}(\mathcal{T}\phi(s_{h-1,\tau},a_{h-1,\tau})-\sum_{a}\pi(a\mid s_{h})\phi(s_{h},a))^{2}+\left\|\mathcal{T}\right\|^{2}\,,

and the closed-form solution is that

ฮ›hโˆ’1\displaystyle\Lambda_{h-1} =โˆ‘ฯ„=1Kฯ•โ€‹(shโˆ’1,ฯ„,ahโˆ’1,ฯ„)โ€‹ฯ•โ€‹(shโˆ’1,ฯ„,ahโˆ’1,ฯ„)โŠค+ฮปโ€‹I,\displaystyle=\sum_{\tau=1}^{K}\phi(s_{h-1,\tau},a_{h-1,\tau})\phi(s_{h-1,\tau},a_{h-1,\tau})^{\top}+\lambda I\,,
๐’ฏ^ฯ€,h\displaystyle\hat{\mathcal{T}}_{\pi,h} =ฮ›hโˆ’1โˆ’1โ€‹โˆ‘ฯ„=1Kฯ•โ€‹(shโˆ’1,ฯ„,ahโˆ’1,ฯ„)โ€‹ฯ•ฯ€,hโ€‹(sh,ฯ„),\displaystyle=\Lambda_{h-1}^{-1}\sum_{\tau=1}^{K}\phi(s_{h-1,\tau},a_{h-1,\tau})\phi_{\pi,h}(s_{h,\tau})\,,

where ฯ•ฯ€,hโ€‹(sh,ฯ„)=โˆ‘aฯ€โ€‹(aโˆฃsh)โ€‹ฯ•โ€‹(sh,a)\phi_{\pi,h}(s_{h,\tau})=\sum_{a}\pi(a\mid s_{h})\phi(s_{h},a).

In order to guarantee the accuracy of the estimated feature visitation, we provide a guarantee for the accuracy of the estimated transition operator. The intuition is to collect enough data in each dimension of the feature space. For the design of how to collect trajectory, we adopt the reward-free technique in Wagenmaker and Jamieson (2022) and transform it as an independent feature visitation oracle. Algorithm 2 satisfies the following sample complexity and accuracy guarantees.

Lemma 2.

Algorithm 2 runs for at most ๐’ช~โ€‹(d4โ€‹H3ฯต2)\tilde{\mathcal{O}}\left(\frac{d^{4}H^{3}}{\epsilon^{2}}\right) episodes and returns a feature visitation estimation that satisfies

โ€–ฯ•^ฯ€,hโˆ’ฯ•ฯ€,hโ€–2โ‰คฯต/d,\left\|\hat{\phi}_{\pi,h}-\phi_{\pi,h}\right\|_{2}\leq\epsilon/\sqrt{d}\,,

for any policy ฯ€โˆˆฮ \pi\in\Pi and step hโˆˆ[H]h\in[H], with probability at least 1โˆ’ฮด1-\delta.

And since the regret of each episode is less than HH, the total regret incurred in this process of estimating feature visitations is of order ๐’ช~โ€‹(d4โ€‹H4/ฯต2)\tilde{\mathcal{O}}\left(d^{4}H^{4}/\epsilon^{2}\right). The detailed analysis and results are in Appendix B.

5 Analysis

This section provides the regret guarantees for the proposed algorithm as well as the proof sketch.

We consider Algorithm 1 with the policy set constructed in Section 4.1 and the feature of policies estimated in Section 4.2 as input. Suppose we run Algorithm 1 for KK rounds, the regret compared with any fixed policy ฯ€โˆˆฮ \pi\in\Pi in these KK rounds can be bounded as below.

Lemma 3.

For any policy ฯ€โˆˆฮ \pi\in\Pi, with probability at least 1โˆ’ฮด1-\delta,

โˆ‘k=1K(Vkฯ€kโˆ’Vkฯ€)=๐’ชโ€‹(Hโ€‹dโ€‹Kโ€‹Hโ€‹logโก|ฮ |ฮด+dโ€‹H2ฮณโ€‹logโก(|ฮ |ฮด)+ฮณโ€‹Kโ€‹HโŸstandard regret bound of EXP3+dโ€‹H2ฮณโ€‹ฯตโ€‹KโŸfeature estimation bias),\sum_{k=1}^{K}\left(V_{k}^{\pi_{k}}-V_{k}^{\pi}\right)=\mathcal{O}\left(\underbrace{H\sqrt{dKH\log\frac{|\Pi|}{\delta}}+\frac{dH^{2}}{\gamma}\log\left(\frac{|\Pi|}{\delta}\right)+\gamma KH}_{\text{standard regret bound of EXP3}}+\underbrace{\frac{dH^{2}}{\gamma}\epsilon K}_{\text{feature estimation bias}}\right)\,,

where ฯต\epsilon is the tolerance of the estimated feature bias in Lemma 2.

Recall that when the policy set is constructed as section 4.1, the difference between the global regret defined in Equation (1) and the regret compared with the policy in ฮ \Pi is just a constant. So the global regret can also be bounded as above Lemma 3.

Similar to previous works on linear adversarial MDP (Luo et al., 2021a, b; Sherman et al., 2023; Dai et al., 2023) that discuss the cases of whether a transition simulator is available, we define the simulator that may help in the following assumption. Note that this simulator is weaker than Luo et al. (2021a, b); Dai et al. (2023) because their simulator could generate a next state given any state-action pair.

Assumption 2 (Simulator).

The learning agent has access to a simulator such that when given a policy ฯ€\pi, it returns a trajectory based on the MDP and policy ฯ€\pi.

If the learning agent has access to such a simulator described in Assumption 2, then the feature estimation process in Section 4.2 can be regarded as regret-free and the final regret is just as shown in Lemma 3. Otherwise, there is an additional ๐’ช~โ€‹(d4โ€‹H3/ฯต2)\tilde{\mathcal{O}}(d^{4}H^{3}/\epsilon^{2}) regret term. Balancing the choice of ฮณ\gamma and ฯต\epsilon yields the following regret upper bound.

Theorem 1.

With the constructed policy set in Section 4.1 and the feature estimation process in Section 4.2, the cumulative regret satisfies

Regโ€‹(K)=๐’ช~โ€‹(d7/5โ€‹H12/5โ€‹K4/5)\mathrm{Reg}(K)=\tilde{\mathcal{O}}\left(d^{7/5}H^{12/5}K^{4/5}\right)

with probability at least 1โˆ’ฮด1-\delta. Additionally, if a simulator in Assumption 2 is accessible, the regret can be improved to

Regโ€‹(K)=๐’ช~โ€‹(d2โ€‹H5โ€‹K)\mathrm{Reg}(K)=\tilde{\mathcal{O}}\left(\sqrt{d^{2}H^{5}K}\right)

with probability at least 1โˆ’ฮด1-\delta.

The proof of the main results is deferred to Appendix A.

Discussions

Since a main contribution of our work is to improve the results by Neu and Olkhovskaya (2021); Luo et al. (2021a) with an exploratory assumption, we would present more insights into the difference between our approach and the approaches in these two works.

As shown in Table 1, our result ๐’ช~โ€‹(K4/5)\tilde{\mathcal{O}}(K^{4/5}) explicitly improve the result in Luo et al. (2021a) on both the dependence of KK and the dependence of ฮป\lambda. Recall that ฮป\lambda is the minimum eigenvalue in the exploratory assumption. In real applications, for each direction in the linear space, it is reasonable that there will be a policy that visits the direction. Therefore, by mixing these policies one could ensure the exploratory assumption. However, there is no guarantee on the value of ฮป\lambda, and when ฮป\lambda is very small the removal of the 1/ฮป1/\lambda dependence is significant.

Technically, our new view of linear MDP could be general enough to be useful in other linear settings. In section 4.1, we only introduce a simpler policy construction version to convey more intuition. We could vary this construction procedure by further putting a finite action covering over the action space to deal with the infinite action space setting. More details can be found in Appendix C. Meanwhile, Neu and Olkhovskaya (2021) requires both state and action spaces to be finite and Luo et al. (2021a, b); Dai et al. (2023); Sherman et al. (2023) can only deal with finite action space.

When a transition simulator is available, the number of calls to the simulator (query complexity) is also an important metric to portray the algorithmโ€™s efficiency. According to Lemma 2 and the analysis in Appendix A, we only need to call the simulator for Oโ€‹(K2)O(K^{2}) times to achieve an Oโ€‹(K)O(\sqrt{K}) regret (Theorem 1) by choosing ฯต=Oโ€‹(1/K)\epsilon=O(1/K) in Lemma 3. This is much more preferred than Luo et al. (2021a), which needs to call the simulator for Oโ€‹(Kโ€‹Aโ€‹H)Oโ€‹(H)O(KAH)^{O(H)} times where AA is the action space size.

Compared with Luo et al. (2021a), we improve their results with better regret bounds, weaker exploratory assumption, weaker simulator and fewer queries (if we use one).

6 Conclusion

In this paper, we investigate linear adversarial MDP with bandit feedback. We propose a new view of linear MDP, where optimizing policies in linear MDP can be regarded as a linear optimization problem. Based on this new insight we propose an algorithm by constructing a set of policies and deploying a probability distribution over the policies to execute. With an exploratory assumption, our algorithm yields the first ๐’ช~โ€‹(K4/5)\tilde{\mathcal{O}}(K^{4/5}) regret without access to a simulator. Compared to the results in Luo et al. (2021a), our algorithm enjoys a weaker assumption, a better regret bound with respect to both KK and ฮป\lambda, and a weaker simulator with fewer queries if it uses one.

Our view contributes a new approach to linear MDP, which could be of independent interest. We demonstrated how our algorithm under this view is generalized to infinite action spaces. Future implications of this technique could involve solving other adversarial settings, such as when the loss function is corrupted up to a budget, and solving robust linear MDP where the transition kernel could change over episodes.

References

  • Afsar et al. (2022) Mย Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey. ACM Computing Surveys, 55(7):1โ€“38, 2022.
  • Agarwal et al. (2021) Naman Agarwal, Syomantak Chaudhuri, Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Online target Q-learning with reverse experience replay: Efficiently finding the optimal policy for linear mdps. arXiv preprint arXiv:2110.08440, 2021.
  • Azar et al. (2017) Mohammadย Gheshlaghi Azar, Ian Osband, and Rรฉmi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263โ€“272. PMLR, 2017.
  • Bradtke and Barto (1996) Stevenย J Bradtke and Andrewย G Barto. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1-3):33โ€“57, 1996.
  • Cai et al. (2020) Qiย Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283โ€“1294. PMLR, 2020.
  • Chen et al. (2021) Liyu Chen, Haipeng Luo, and Chen-Yu Wei. Minimax regret for stochastic shortest path with adversarial costs and known transition. In Conference on Learning Theory, pages 1180โ€“1215. PMLR, 2021.
  • Chen et al. (2022) Liyu Chen, Haipeng Luo, and Aviv Rosenberg. Policy optimization for stochastic shortest path. In Conference on Learning Theory, pages 982โ€“1046. PMLR, 2022.
  • Dai et al. (2023) Yan Dai, Haipeng Luo, Chen-Yu Wei, and Julian Zimmert. Refined regret for adversarial mdps with linear function approximation. arXiv preprint arXiv:2301.12942, 2023.
  • Hao et al. (2021) Botao Hao, Tor Lattimore, Csaba Szepesvari, and Mengdi Wang. Online sparse reinforcement learning. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, pages 316โ€“324. PMLR, 2021.
  • He et al. (2021) Jiafan He, Dongruo Zhou, and Quanquan Gu. Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning, pages 4171โ€“4180. PMLR, 2021.
  • He et al. (2022a) Jiafan He, Heyang Zhao, Dongruo Zhou, and Quanquan Gu. Nearly minimax optimal reinforcement learning for linear markov decision processes. arXiv preprint arXiv:2212.06132, 2022.
  • He et al. (2022b) Jiafan He, Dongruo Zhou, and Quanquan Gu. Near-optimal policy optimization algorithms for learning adversarial linear mixture mdps. In International Conference on Artificial Intelligence and Statistics, pages 4259โ€“4280. PMLR, 2022.
  • Hu et al. (2022) Pihe Hu, Yuย Chen, and Longbo Huang. Nearly minimax optimal reinforcement learning with linear function approximation. In International Conference on Machine Learning, pages 8971โ€“9019. PMLR, 2022.
  • Jin et al. (2018) Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michaelย I Jordan. Is Q-learning provably efficient? Advances in Neural Information Processing Systems, 31, 2018.
  • Jin et al. (2020a) Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. Learning adversarial markov decision processes with bandit feedback and unknown transition. In International Conference on Machine Learning, pages 4860โ€“4869. PMLR, 2020.
  • Jin et al. (2020b) Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michaelย I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137โ€“2143. PMLR, 2020.
  • Kiran et al. (2021) Bย Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmadย A Alย Sallab, Senthil Yogamani, and Patrick Pรฉrez. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(6):4909โ€“4926, 2021.
  • Kober et al. (2013) Jens Kober, Jย Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238โ€“1274, 2013.
  • Li et al. (2020) Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. Advances in Neural Information Processing Systems, 33:7031โ€“7043, 2020.
  • Lillicrap et al. (2015) Timothyย P Lillicrap, Jonathanย J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • Lin et al. (2021) Yuanguo Lin, Yong Liu, Fan Lin, Lixin Zou, Pengcheng Wu, Wenhua Zeng, Huanhuan Chen, and Chunyan Miao. A survey on reinforcement learning for recommender systems. arXiv preprint arXiv:2109.10665, 2021.
  • Luo et al. (2021a) Haipeng Luo, Chen-Yu Wei, and Chung-Wei Lee. Policy optimization in adversarial mdps: Improved exploration via dilated bonuses. Advances in Neural Information Processing Systems, 34:22931โ€“22942, 2021.
  • Luo et al. (2021b) Haipeng Luo, Chen-Yu Wei, and Chung-Wei Lee. Policy optimization in adversarial mdps: Improved exploration via dilated bonuses. arXiv preprint arXiv:2107.08346, 2021.
  • Melo and Ribeiro (2007) Franciscoย S Melo and Mย Isabel Ribeiro. Q-learning with linear function approximation. In Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007, San Diego, CA, USA; June 13-15, 2007. Proceedings 20, pages 308โ€“322. Springer, 2007.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Neu and Olkhovskaya (2021) Gergely Neu and Julia Olkhovskaya. Online learning in mdps with linear function approximation and bandit feedback. Advances in Neural Information Processing Systems, 34:10407โ€“10417, 2021.
  • Neu et al. (2010) Gergely Neu, Andrรกs Gyรถrgy, Csaba Szepesvรกri, etย al. The online loop-free stochastic shortest-path problem. In COLT, volume 2010, pages 231โ€“243. Citeseer, 2010.
  • Rosenberg and Mansour (2019) Aviv Rosenberg and Yishay Mansour. Online convex optimization in adversarial markov decision processes. In International Conference on Machine Learning, pages 5478โ€“5486. PMLR, 2019.
  • Shani et al. (2020) Lior Shani, Yonathan Efroni, Aviv Rosenberg, and Shie Mannor. Optimistic policy optimization with bandit feedback. In International Conference on Machine Learning, pages 8604โ€“8613. PMLR, 2020.
  • Sherman et al. (2023) Uri Sherman, Tomer Koren, and Yishay Mansour. Improved regret for efficient online reinforcement learning with linear function approximation. arXiv preprint arXiv:2301.13087, 2023.
  • Silver et al. (2016) David Silver, Aja Huang, Chrisย J Maddison, Arthur Guez, Laurent Sifre, George Van Denย Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, etย al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484โ€“489, 2016.
  • Simchowitz and Jamieson (2019) Max Simchowitz and Kevinย G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. Advances in Neural Information Processing Systems, 32, 2019.
  • Sutton and Barto (2018) Richardย S Sutton and Andrewย G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Wagenmaker and Jamieson (2022) Andrew Wagenmaker and Kevin Jamieson. Instance-dependent near-optimal policy identification in linear mdps via online experiment design. arXiv preprint arXiv:2207.02575, 2022.
  • Yang and Wang (2019) Lin Yang and Mengdi Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995โ€“7004. PMLR, 2019.
  • Yang and Wang (2020) Lin Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746โ€“10756. PMLR, 2020.
  • Yang et al. (2021) Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. In International Conference on Artificial Intelligence and Statistics, pages 1576โ€“1584. PMLR, 2021.
  • Yu et al. (2009) Jiaย Yuan Yu, Shie Mannor, and Nahum Shimkin. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737โ€“757, 2009.
  • Zanette et al. (2020) Andrea Zanette, David Brandfonbrener, Emma Brunskill, Matteo Pirotta, and Alessandro Lazaric. Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics, pages 1954โ€“1964. PMLR, 2020.
  • Zhao et al. (2023) Canzhe Zhao, Ruofeng Yang, Baoxiang Wang, and Shuai Li. Learning adversarial linear mixture markov decision processes with bandit feedback and unknown transition. In International Conference on Learning Representations, 2023.
  • Zhou et al. (2021) Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532โ€“4576. PMLR, 2021.
  • Zimin and Neu (2013) Alexander Zimin and Gergely Neu. Online learning in episodic markovian decision processes by relative entropy policy search. Advances in Neural Information Processing Systems, 26, 2013.

Appendix A Analysis of Algorithm 1

In this section we propose the regret analysis for Algorithm 1 and prove the final regret bounds for our main algorithm. We will state the necessary concentration bounds as lemmas first and then analyze the regret, proving Lemma 3 and Theorem 1. In the following analysis, we will condition on the success of the event in Theorem 17, whose probability is at least 1โˆ’ฮด1-\delta. The following inequality will be used in the analysis and is restated in the beginning.

Lemma 4.

Let โ„ฑ0โŠ‚โ„ฑ1โŠ‚โ‹ฏโŠ‚โ„ฑT\mathcal{F}_{0}\subset\mathcal{F}_{1}\subset\cdots\subset\mathcal{F}_{T} be a filtration and let X1,X2,โ‹ฏโ€‹XTX_{1},X_{2},\cdots X_{T} be random variables such that XtX_{t} is โ„ฑt\mathcal{F}_{t} measurable, ๐”ผโ€‹Xt|โ„ฑtโˆ’1=0\mathbb{E}{X_{t}|\mathcal{F}_{t-1}}=0, |Xt|โ‰คb\left|X_{t}\right|\leq b almost surely, and โˆ‘t=1T๐”ผโ€‹Xt2|โ„ฑtโˆ’1โ‰คV\sum_{t=1}^{T}\mathbb{E}{X_{t}^{2}|\mathcal{F}_{t-1}}\leq V for some fixed V>0V>0 and b>0b>0. Then, for any ฮดโˆˆ(0,1)\delta\in\left(0,1\right), we have with probability at least 1โˆ’ฮด1-\delta,

โˆ‘t=1TXtโ‰ค2โ€‹Vโ€‹logโก(1/ฮด)+bโ€‹logโก(1/ฮด).\sum_{t=1}^{T}X_{t}\leq 2\sqrt{V\log\left(1/\delta\right)}+b\log\left(1/\delta\right)\,.

To start with, we give the concentration of the feature visitation estimators returned from Algorithm 2, which will be fundamental in the following analysis. Notice that ฯ•^ฯ€,1\hat{\phi}_{\pi,1} can be computed directly from the initial state distribution and action distribution.

Lemma 5.

Conditioning on the success of the event in Theorem 17, we have for all episode kk and steps hh, the feature visitation estimators {ฯ•^ฯ€,h,h=1,2,โ‹ฏ,H,ฯ€โˆˆฮ }\left\{\hat{\phi}_{\pi,h},\,h=1,2,\cdots,H\,,\leavevmode\nobreak\ \pi\in\Pi\right\} returned by algorithm 2 satisfy:

|โŸจฮธk,h,ฯ•^ฯ€,hโˆ’ฯ•ฯ€,hโŸฉ|โ‰คฯต\left|\langle\theta_{k,h},\hat{\phi}_{\pi,h}-\phi_{\pi,h}\rangle\right|\leq\epsilon
Proof.

Since โ€–ฮธk,hโ€–2โ‰คd\left\|\theta_{k,h}\right\|_{2}\leq\sqrt{d} for any kk and hh, as in Definition 1, according to Theorem 17, we have:

|โŸจฮธk,h,ฯ•^ฯ€,hโˆ’ฯ•ฯ€,hโŸฉ|โ‰คโ€–ฯ•^ฯ€,hโˆ’ฯ•ฯ€,hโ€–2โ‹…โ€–ฮธk,hโ€–2โ‰คฯตdโ‹…dโ‰คฯต.\left|\langle\theta_{k,h},\hat{\phi}_{\pi,h}-\phi_{\pi,h}\rangle\right|\leq\left\|\hat{\phi}_{\pi,h}-\phi_{\pi,h}\right\|_{2}\cdot\left\|\theta_{k,h}\right\|_{2}\leq\frac{\epsilon}{\sqrt{d}}\cdot\sqrt{d}\leq\epsilon\,.

โˆŽ

The following Lemma 6 and Lemma 7 will bound the magnitude for the loss and value estimators in line 9, using the properties of G-optimal design computed in line 4.

Lemma 6.

โ€–ฯ•^ฯ€,hโ€–ฮฃ^k,hโˆ’12โ‰คdโ€‹Hฮณ\left\|\hat{\phi}_{\pi,h}\right\|_{\hat{\Sigma}_{k,h}^{-1}}^{2}\leq\frac{dH}{\gamma} and |ฯ•^ฯ€,hโŠคโ€‹ฮธ^k,h|โ‰คdโ€‹Hฮณ\left|\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}\right|\leq\frac{dH}{\gamma}, for all hh and ฯ€โˆˆฮ \pi\in\Pi.

Proof.

According to the properties of G-optimal design, we have:

โ€–ฯ•^ฯ€,hโ€–(โˆ‘ฯ€ghโ€‹(ฯ€)โ€‹ฯ•^ฯ€,hโ€‹ฯ•^ฯ€,hโŠค)โˆ’12โ‰คd,\left\|\hat{\phi}_{\pi,h}\right\|_{\left(\sum_{\pi}g_{h}(\pi)\hat{\phi}_{\pi,h}\hat{\phi}_{\pi,h}^{\top}\right)^{-1}}^{2}\leq d\,,

and ฮฃ^k,hโชฐฮณHโ€‹โˆ‘ฯ€ghโ€‹(ฯ€)โ€‹ฯ•^ฯ€,hโ€‹ฯ•^ฯ€,hโŠค\hat{\Sigma}_{k,h}\succeq\frac{\gamma}{H}\sum_{\pi}g_{h}(\pi)\hat{\phi}_{\pi,h}\hat{\phi}_{\pi,h}^{\top}. Thus we have โ€–ฯ•^ฯ€,hโ€–ฮฃ^k,hโˆ’12โ‰คdโ€‹Hฮณ\left\|\hat{\phi}_{\pi,h}\right\|_{\hat{\Sigma}_{k,h}^{-1}}^{2}\leq\frac{dH}{\gamma}. So:

|ฯ•^ฯ€,hโŠคโ€‹ฮธ^k,h|=|ฯ•^ฯ€,hโ€‹ฮฃ^k,hโˆ’1โ€‹ฯ•^ฯ€k,hโ€‹โ„“k,hโ€‹(sk,h,ak,h)|โ‰คโ€–ฯ•^ฯ€,hโ€–ฮฃ^k,hโˆ’1โ€‹โ€–ฯ•^ฯ€k,hโ€–ฮฃ^k,hโˆ’1โ‰คdโ€‹Hฮณ,โˆ€ฯ€โˆˆฮ .\left|\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}\right|=\left|\hat{\phi}_{\pi,h}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi_{k},h}\ell_{k,h}\left(s_{k,h},a_{k,h}\right)\right|\leq\left\|\hat{\phi}_{\pi,h}\right\|_{\hat{\Sigma}_{k,h}^{-1}}\left\|\hat{\phi}_{\pi_{k},h}\right\|_{\hat{\Sigma}_{k,h}^{-1}}\leq\frac{dH}{\gamma}\,,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall\pi\in\Pi\,.

โˆŽ

Lemma 7.

With our choice of ฮท=ฮณdโ€‹H2\eta=\frac{\gamma}{dH^{2}}, when Kโ‰ฅL0=4โ€‹dโ€‹Hโ€‹logโก(|ฮ |ฮด)K\geq L_{0}=4dH\log\left(\frac{|\Pi|}{\delta}\right), we have for all optimistic loss function estimator V~kฯ€\tilde{V}_{k}^{\pi}, |ฮทโ€‹V~kฯ€|โ‰ค1\left|\eta\tilde{V}_{k}^{\pi}\right|\leq 1.

Proof.

To make sure |ฮทโ€‹V~kฯ€|โ‰ค1\left|\eta\tilde{V}_{k}^{\pi}\right|\leq 1, we notice that:

|V~kฯ€|โ‰คโˆ‘h=1H|ฯ•^ฯ€,hโŠคโ€‹ฮธ^k,h|+โˆ‘h=1H2โ€‹ฯ•^ฯ€,hโŠคโ€‹ฮฃ^k,hโˆ’1โ€‹ฯ•^k,hโ€‹Hโ€‹logโก1/ฮดdโ€‹K.\left|\tilde{V}_{k}^{\pi}\right|\leq\sum_{h=1}^{H}\left|\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}\right|+\sum_{h=1}^{H}2\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{k,h}\sqrt{\frac{H\log 1/\delta}{dK}}\,. (4)

By Lemma 6, we have

|V~kฯ€|โ‰คdโ€‹H2ฮณโ€‹(1+2โ€‹Hโ€‹logโก1/ฮดdโ€‹K).\left|\tilde{V}_{k}^{\pi}\right|\leq\frac{dH^{2}}{\gamma}\left(1+2\sqrt{\frac{H\log 1/\delta}{dK}}\right)\leavevmode\nobreak\ .

When Kโ‰ฅL0=4โ€‹dโ€‹Hโ€‹logโก(|ฮ |ฮด)K\geq L_{0}=4dH\log\left(\frac{|\Pi|}{\delta}\right), we have |V~kฯ€|โ‰ค2โ€‹dโ€‹H2ฮณ\left|\tilde{V}_{k}^{\pi}\right|\leq\frac{2dH^{2}}{\gamma}. Thus, our choice of ฮท=ฮณdโ€‹H2\eta=\frac{\gamma}{dH^{2}} satisfy this constraint. โˆŽ

Throughout the following analysis, assuming we have run for some number of episodes KK, we let (โ„ฑk)k=1K\left(\mathcal{F}_{k}\right)_{k=1}^{K} the filtration on this, with โ„ฑk\mathcal{F}_{k} the filtration up to and including episode kk. Define ๐”ผk[โ‹…]=๐”ผ[โ‹…|โ„ฑkโˆ’1]\mathbb{E}_{k}\left[\cdot\right]=\mathbb{E}\left[\cdot|\mathcal{F}_{k-1}\right]. The next lemma will bound the bias of the loss vector estimator, thus we can bound the bias of the value function estimator.

Lemma 8.

Denote ฮธk,heโ€‹xโ€‹p=๐”ผkโ€‹[ฮธ^k,h]=๐”ผkโ€‹[ฮฃ^k,hโˆ’1โ€‹ฯ•^ฯ€k,hโ€‹โ„“k,hโ€‹(sk,h,ak,h)]\theta_{k,h}^{exp}=\mathbb{E}_{k}\left[\hat{\theta}_{k,h}\right]=\mathbb{E}_{k}\left[{\hat{\Sigma}}_{k,h}^{-1}\hat{\phi}_{\pi_{k},h}\ell_{k,h}\left(s_{k,h},a_{k,h}\right)\right] as the expected value of the loss vector estimator on โ„ฑkโˆ’1\mathcal{F}_{k-1}. Then we have for โˆ€ฯ€โˆˆฮ \forall\pi\in\Pi:

|โŸจฯ•^ฯ€,h,ฮธk,hexpโŸฉโˆ’โŸจฯ•ฯ€,h,ฮธk,hโŸฉ|โ‰ค(dโ€‹Hฮณ+1)โ€‹ฯตโ‰ค2โ€‹dโ€‹Hฮณโ€‹ฯต.\left|\langle\hat{\phi}_{\pi,h},\theta_{k,h}^{\exp}\rangle-\langle\phi_{\pi,h},\theta_{k,h}\rangle\right|\leq\left(\frac{dH}{\gamma}+1\right)\epsilon\leq\frac{2dH}{\gamma}\epsilon\leavevmode\nobreak\ .

As a result, we also have |โŸจฯ•^ฯ€,h,ฮธk,hexpโŸฉ|โ‰ค2โ€‹dโ€‹Hฮณโ€‹ฯต+1\left|\langle\hat{\phi}_{\pi,h},\theta_{k,h}^{\exp}\rangle\right|\leq\frac{2dH}{\gamma}\epsilon+1โ€‰ .

Proof.

Using the tower rule of expectation, we have:

ฮธk,heโ€‹xโ€‹p=ฮฃ^k,hโˆ’1โ€‹๐”ผkโ€‹[ฯ•^ฯ€k,hโ€‹ฯ•ฯ€k,hโŠคโ€‹ฮธk,h]=ฮฃ^k,hโˆ’1โ€‹โˆ‘ฯ€โ€ฒpkโ€‹(ฯ€โ€ฒ)โ€‹ฯ•^ฯ€โ€ฒ,hโ€‹ฯ•ฯ€โ€ฒ,hโŠคโ€‹ฮธk,h.\theta_{k,h}^{exp}=\hat{\Sigma}_{k,h}^{-1}\mathbb{E}_{k}\left[\hat{\phi}_{\pi_{k},h}\phi_{\pi_{k},h}^{\top}\theta_{k,h}\right]=\hat{\Sigma}_{k,h}^{-1}\sum_{\pi^{\prime}}p_{k}\left(\pi^{\prime}\right)\hat{\phi}_{\pi^{\prime},h}\phi_{\pi^{\prime},h}^{\top}\theta_{k,h}\leavevmode\nobreak\ .

Thus,

|โŸจฯ•^ฯ€,h,ฮธk,heโ€‹xโ€‹pโŸฉโˆ’โŸจฯ•^ฯ€,h,ฮธk,hโŸฉ|\displaystyle\left|\langle\hat{\phi}_{\pi,h},\theta_{k,h}^{exp}\rangle-\langle\hat{\phi}_{\pi,h},\theta_{k,h}\rangle\right| =|ฯ•^ฯ€,hโŠคโ€‹ฮฃ^k,hโˆ’1โ€‹โˆ‘ฯ€โ€ฒpkโ€‹(ฯ€โ€ฒ)โ€‹ฯ•^ฯ€โ€ฒ,hโ€‹(ฯ•ฯ€โ€ฒ,hโŠคโˆ’ฯ•^ฯ€โ€ฒ,hโŠค)โ€‹ฮธk,h|\displaystyle=\left|\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\sum_{\pi^{\prime}}p_{k}\left(\pi^{\prime}\right)\hat{\phi}_{\pi^{\prime},h}\left(\phi_{\pi^{\prime},h}^{\top}-\hat{\phi}_{\pi^{\prime},h}^{\top}\right)\theta_{k,h}\right|
โ‰คโˆ‘ฯ€โ€ฒpkโ€‹(ฯ€โ€ฒ)โ€‹|ฯ•^ฯ€,hโŠคโ€‹ฮฃ^k,hโˆ’1โ€‹ฯ•^ฯ€โ€ฒ,hโ€‹(ฯ•ฯ€โ€ฒ,hโŠคโˆ’ฯ•^ฯ€โ€ฒ,hโŠค)โ€‹ฮธk,h|\displaystyle\leq\sum_{\pi^{\prime}}p_{k}\left(\pi^{\prime}\right)\left|\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi^{\prime},h}\left(\phi_{\pi^{\prime},h}^{\top}-\hat{\phi}_{\pi^{\prime},h}^{\top}\right)\theta_{k,h}\right|
โ‰คโˆ‘ฯ€โ€ฒpkโ€‹(ฯ€โ€ฒ)โ€‹dโ€‹Hฮณโ€‹ฯต=dโ€‹Hฮณโ€‹ฯต,\displaystyle\leq\sum_{\pi^{\prime}}p_{k}\left(\pi^{\prime}\right)\frac{dH}{\gamma}\epsilon=\frac{dH}{\gamma}\epsilon\leavevmode\nobreak\ ,

where the last inequality is due to Lemma 5 and Lemma 6.
Moreover, we also have that |โŸจฯ•^ฯ€,hโˆ’ฯ•ฯ€,h,ฮธk,hโŸฉ|โ‰คฯต\left|\langle\hat{\phi}_{\pi,h}-\phi_{\pi,h},\theta_{k,h}\rangle\right|\leq\epsilon. Combining the two terms, we proof the lemma as:

|โŸจฯ•^ฯ€,h,ฮธk,hexpโŸฉโˆ’โŸจฯ•ฯ€,h,ฮธk,hโŸฉ|โ‰ค|โŸจฯ•^ฯ€,h,ฮธk,heโ€‹xโ€‹pโŸฉโˆ’โŸจฯ•^ฯ€,h,ฮธk,hโŸฉ|+|โŸจฯ•^ฯ€,hโˆ’ฯ•ฯ€,h,ฮธk,hโŸฉ|โ‰ค(dโ€‹Hฮณ+1)โ€‹ฯต.\left|\langle\hat{\phi}_{\pi,h},\theta_{k,h}^{\exp}\rangle-\langle\phi_{\pi,h},\theta_{k,h}\rangle\right|\leq\left|\langle\hat{\phi}_{\pi,h},\theta_{k,h}^{exp}\rangle-\langle\hat{\phi}_{\pi,h},\theta_{k,h}\rangle\right|+\left|\langle\hat{\phi}_{\pi,h}-\phi_{\pi,h},\theta_{k,h}\rangle\right|\leq\left(\frac{dH}{\gamma}+1\right)\epsilon\leavevmode\nobreak\ .

โˆŽ

Lemma 9.

Denote DEVK,ฯ€=|โˆ‘k=1KV^kฯ€โˆ’Vkฯ€|\mathrm{DEV}_{K,\pi}=\left|\sum_{k=1}^{K}\hat{V}_{k}^{\pi}-V_{k}^{\pi}\right|, then we have with probability at least 1โˆ’ฮด1-\delta,

DEVK,ฯ€โ‰ค\displaystyle\mathrm{DEV}_{K,\pi}\leq 2โ€‹โˆ‘k=1Kโˆ‘h=1Hโ€–ฯ•ฯ€,hโ€–ฮฃk,hโˆ’12โ€‹Hโ€‹logโก(1ฮด)dโ€‹K+12โ€‹dโ€‹Kโ€‹Hโ€‹logโก(1ฮด)\displaystyle 2\sum_{k=1}^{K}\sum_{h=1}^{H}\left\|\phi_{\pi,h}\right\|_{\Sigma_{k,h}^{-1}}^{2}\sqrt{\frac{H\log\left(\frac{1}{\delta}\right)}{dK}}+\frac{1}{2}\sqrt{dKH\log\left(\frac{1}{\delta}\right)}
+2โ€‹(dโ€‹H2ฮณ)โ€‹logโก(1ฮด)+2โ€‹dโ€‹H2ฮณโ€‹ฯตโ€‹K.\displaystyle+2\left(\frac{dH^{2}}{\gamma}\right)\log\left(\frac{1}{\delta}\right)+\frac{2dH^{2}}{\gamma}\epsilon K\,.
Proof.

First, we bound the bias of the estimated loss of each policy ฯ€\pi after episode kk in step hh:

โ„“^k,hฯ€โˆ’โ„“k,hฯ€=\displaystyle\hat{\ell}_{k,h}^{\pi}-\ell_{k,h}^{\pi}= ฯ•^ฯ€,hโŠคโ€‹ฮธ^k,hโˆ’ฯ•ฯ€,hโŠคโ€‹ฮธk,h\displaystyle\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}-\phi_{\pi,h}^{\top}\theta_{k,h}
=\displaystyle= (ฯ•^ฯ€,hโŠคโ€‹ฮธ^k,hโˆ’ฯ•^ฯ€,hโŠคโ€‹ฮธk,heโ€‹xโ€‹p)+(ฯ•^ฯ€,hโŠคโ€‹ฮธk,heโ€‹xโ€‹pโˆ’ฯ•ฯ€,hโ€‹ฮธk,h)\displaystyle\left(\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}-\hat{\phi}_{\pi,h}^{\top}{\theta}_{k,h}^{exp}\right)+\left(\hat{\phi}_{\pi,h}^{\top}{\theta}_{k,h}^{exp}-\phi_{\pi,h}\theta_{k,h}\right)
โ‰ค\displaystyle\leq (ฯ•^ฯ€,hโŠคโ€‹ฮธ^k,hโˆ’ฯ•^ฯ€,hโŠคโ€‹ฮธk,heโ€‹xโ€‹p)+2โ€‹dโ€‹Hฮณโ€‹ฯต.\displaystyle\left(\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}-\hat{\phi}_{\pi,h}^{\top}{\theta}_{k,h}^{exp}\right)+\frac{2dH}{\gamma}\epsilon\,.

The first term is a martingale difference sequence as ๐”ผkโ€‹[ฯ•^ฯ€,hโŠคโ€‹ฮธ^k,hโˆ’ฯ•^ฯ€,hโŠคโ€‹ฮธk,heโ€‹xโ€‹p]=0\mathbb{E}_{k}\left[\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}-\hat{\phi}_{\pi,h}^{\top}{\theta}_{k,h}^{exp}\right]=0 by the definition in Lemma 8. To bound its magnitude, notice that |ฯ•^ฯ€,hโŠคโ€‹ฮธ^k,hโˆ’ฯ•^ฯ€,hโŠคโ€‹ฮธk,heโ€‹xโ€‹p|โ‰ค|ฯ•^ฯ€,hโŠคโ€‹ฮธ^k,h|+|ฯ•ฯ€,hโŠคโ€‹ฮธk,h|+|ฯ•^ฯ€,hโŠคโ€‹ฮธk,heโ€‹xโ€‹pโˆ’ฯ•ฯ€,hโŠคโ€‹ฮธk,h|โ‰คdโ€‹Hฮณ+1+2โ€‹dโ€‹Hฮณโ€‹ฯตโ‰ค2โ€‹dโ€‹Hฮณ\left|\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}-\hat{\phi}_{\pi,h}^{\top}{\theta}_{k,h}^{exp}\right|\leq\left|\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}\right|+\left|{\phi}_{\pi,h}^{\top}\theta_{k,h}\right|+\left|\hat{\phi}_{\pi,h}^{\top}{\theta}_{k,h}^{exp}-{\phi}_{\pi,h}^{\top}\theta_{k,h}\right|\leq\frac{dH}{\gamma}+1+\frac{2dH}{\gamma}\epsilon\leq\frac{2dH}{\gamma} according to 5 and 6. Its variance is also bounded as:

๐”ผkโ€‹[(ฯ•^ฯ€,hโŠคโ€‹ฮธ^k,hโˆ’ฯ•^ฯ€,hโŠคโ€‹ฮธk,heโ€‹xโ€‹p)2]โ‰ค\displaystyle\mathbb{E}_{k}\left[\left(\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}-\hat{\phi}_{\pi,h}^{\top}{\theta}_{k,h}^{exp}\right)^{2}\right]\leq ๐”ผkโ€‹[(ฯ•^ฯ€,hโŠคโ€‹ฮธ^k,h)2]\displaystyle\mathbb{E}_{k}\left[\left(\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}\right)^{2}\right]
=\displaystyle= ๐”ผkโ€‹[ฯ•^ฯ€,hโŠคโ€‹ฮธ^k,hโ€‹ฮธ^k,hโŠคโ€‹ฯ•^ฯ€,h]\displaystyle\mathbb{E}_{k}\left[\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}\hat{\theta}_{k,h}^{\top}\hat{\phi}_{\pi,h}\right]
=\displaystyle= ๐”ผkโ€‹[โ„“k,hโ€‹(sk,h,ak,h)2โ€‹ฯ•^ฯ€,hโŠคโ€‹ฮฃ^k,hโˆ’1โ€‹ฯ•^ฯ€k,hโ€‹ฯ•^ฯ€k,hโŠคโ€‹ฮฃ^k,hโˆ’1โ€‹ฯ•^ฯ€,h]\displaystyle\mathbb{E}_{k}\left[\ell_{k,h}\left(s_{k,h},a_{k,h}\right)^{2}\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi_{k},h}\hat{\phi}_{\pi_{k},h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi,h}\right]
โ‰ค\displaystyle\leq โ€–ฯ•^ฯ€,hโ€–ฮฃ^k,hโˆ’12,\displaystyle\left\|\hat{\phi}_{\pi,h}\right\|_{\hat{\Sigma}_{k,h}^{-1}}^{2}\,,

where the last inequality is due to the fact ๐”ผkโ€‹[ฯ•^ฯ€k,hโ€‹ฯ•^ฯ€k,hโŠค]=ฮฃ^k,h\mathbb{E}_{k}\left[\hat{\phi}_{\pi_{k},h}\hat{\phi}_{\pi_{k},h}^{\top}\right]=\hat{\Sigma}_{k,h} and |โ„“k,hโ€‹(sk,h,ak,h)|โ‰ค1\left|\ell_{k,h}\left(s_{k,h},a_{k,h}\right)\right|\leq 1. Using the Freedman inequality, we acquire with probability al least 1โˆ’ฮด1-\delta:

DEVK,ฯ€โ‰ค\displaystyle\mathrm{DEV}_{K,\pi}\leq โˆ‘h=1H|โˆ‘k=1Kโ„“^k,hฯ€โˆ’โ„“k,hฯ€|\displaystyle\sum_{h=1}^{H}\left|\sum_{k=1}^{K}\hat{\ell}_{k,h}^{\pi}-\ell_{k,h}^{\pi}\right|
โ‰ค\displaystyle\leq โˆ‘h=1H[2โ€‹โˆ‘k=1Kโ€–ฯ•^ฯ€,hโ€–ฮฃ^k,hโˆ’12โ€‹logโก(1ฮด)+2โ€‹dโ€‹Hฮณโ€‹logโก(1ฮด)+2โ€‹dโ€‹Hฮณโ€‹ฯตโ€‹K]\displaystyle\sum_{h=1}^{H}\left[2\sqrt{\sum_{k=1}^{K}\left\|\hat{\phi}_{\pi,h}\right\|_{\hat{\Sigma}_{k,h}^{-1}}^{2}\log\left(\frac{1}{\delta}\right)}+\frac{2dH}{\gamma}\log\left(\frac{1}{\delta}\right)+\frac{2dH}{\gamma}\epsilon K\right]
โ‰ค\displaystyle\leq 2โ€‹Hโ€‹โˆ‘k=1Kโˆ‘h=1Hโ€–ฯ•^ฯ€,hโ€–ฮฃ^k,hโˆ’12โ€‹logโก(1ฮด)+2โ€‹dโ€‹H2ฮณโ€‹logโก(1ฮด)+2โ€‹dโ€‹H2ฮณโ€‹ฯตโ€‹K\displaystyle 2\sqrt{H\sum_{k=1}^{K}\sum_{h=1}^{H}\left\|\hat{\phi}_{\pi,h}\right\|_{\hat{\Sigma}_{k,h}^{-1}}^{2}\log\left(\frac{1}{\delta}\right)}+\frac{2dH^{2}}{\gamma}\log\left(\frac{1}{\delta}\right)+\frac{2dH^{2}}{\gamma}\epsilon K
โ‰ค\displaystyle\leq 2โ€‹โˆ‘k=1Kโˆ‘h=1Hโ€–ฯ•^ฯ€,hโ€–ฮฃ^k,hโˆ’12โ€‹Hโ€‹logโก(1ฮด)dโ€‹K+12โ€‹dโ€‹Kโ€‹Hโ€‹logโก(1ฮด)+2โ€‹dโ€‹H2ฮณโ€‹logโก(1ฮด)+2โ€‹dโ€‹H2ฮณโ€‹ฯตโ€‹K,\displaystyle 2\sum_{k=1}^{K}\sum_{h=1}^{H}\left\|\hat{\phi}_{\pi,h}\right\|_{\hat{\Sigma}_{k,h}^{-1}}^{2}\sqrt{\frac{H\log\left(\frac{1}{\delta}\right)}{dK}}+\frac{1}{2}\sqrt{dKH\log\left(\frac{1}{\delta}\right)}+2\frac{dH^{2}}{\gamma}\log\left(\frac{1}{\delta}\right)+\frac{2dH^{2}}{\gamma}\epsilon K\,,

where the second last and last inequality is due to the Cauchy Schwartz inequality and GM-AM inequality.

โˆŽ

Lemma 10.

We bound the gap between the actual regret and the expected estimated regret. With probability at least 1โˆ’ฮด1-\delta,

โˆ‘k=1KVkฯ€kโˆ’โˆ‘k=1Kโˆ‘ฯ€pkโ€‹(ฯ€)โ€‹V~kฯ€โ‰คHโ€‹(d+2โ€‹dโ€‹Hฮณโ€‹ฯต+1)โ€‹2โ€‹Kโ€‹logโก1ฮด+8โ€‹dโ€‹H23โ€‹ฮณโ€‹logโก1ฮด+2โ€‹Hโ€‹dโ€‹Kโ€‹Hโ€‹logโก1ฮด+2โ€‹dโ€‹H2ฮณโ€‹ฯตโ€‹K.\sum_{k=1}^{K}V_{k}^{\pi_{k}}-\sum_{k=1}^{K}\sum_{\pi}p_{k}\left(\pi\right)\tilde{V}_{k}^{\pi}\leq H\left(\sqrt{d}+\frac{2dH}{\gamma}\epsilon+1\right)\sqrt{2K\log\frac{1}{\delta}}+\frac{8dH^{2}}{3\gamma}\log\frac{1}{\delta}+2H\sqrt{dKH\log\frac{1}{\delta}}+\frac{2dH^{2}}{\gamma}\epsilon K\,.
Proof.

Denote ฯ•ยฏk,h=โˆ‘ฯ€pkโ€‹(ฯ€)โ€‹ฯ•^ฯ€,h\bar{\phi}_{k,h}=\sum_{\pi}p_{k}\left(\pi\right)\hat{\phi}_{\pi,h}, we have that :

โ„“k,hฯ€kโˆ’โˆ‘ฯ€pkโ€‹(ฯ€)โ€‹โ„“^kฯ€=\displaystyle\ell_{k,h}^{\pi_{k}}-\sum_{\pi}p_{k}\left(\pi\right)\hat{\ell}_{k}^{\pi}= ฯ•ฯ€k,hโŠคโ€‹ฮธk,hโˆ’ฯ•ยฏk,hโŠคโ€‹ฮธ^k,h\displaystyle\phi_{\pi_{k},h}^{\top}\theta_{k,h}-\bar{\phi}_{k,h}^{\top}\hat{\theta}_{k,h}
=\displaystyle= (ฯ•ฯ€k,hโŠคโ€‹ฮธk,hโˆ’ฯ•^ฯ€k,hโŠคโ€‹ฮธk,heโ€‹xโ€‹p)+(ฯ•^ฯ€k,hโŠคโ€‹ฮธk,heโ€‹xโ€‹pโˆ’ฯ•ยฏk,hโŠคโ€‹ฮธ^k,h)\displaystyle\left(\phi_{\pi_{k},h}^{\top}\theta_{k,h}-\hat{\phi}_{\pi_{k},h}^{\top}\theta_{k,h}^{exp}\right)+\left(\hat{\phi}_{\pi_{k},h}^{\top}\theta_{k,h}^{exp}-\bar{\phi}_{k,h}^{\top}\hat{\theta}_{k,h}\right)
โ‰ค\displaystyle\leq 2โ€‹dโ€‹Hฮณโ€‹ฯต+(ฯ•^ฯ€k,hโŠคโ€‹ฮธk,heโ€‹xโ€‹pโˆ’ฯ•ยฏk,hโŠคโ€‹ฮธ^k,h).\displaystyle\frac{2dH}{\gamma}\epsilon+\left(\hat{\phi}_{\pi_{k},h}^{\top}\theta_{k,h}^{exp}-\bar{\phi}_{k,h}^{\top}\hat{\theta}_{k,h}\right)\,.

Notice that ๐”ผkโ€‹[ฯ•^ฯ€k,hโŠคโ€‹ฮธk,heโ€‹xโ€‹p]=๐”ผkโ€‹[ฯ•ยฏk,hโŠคโ€‹ฮธ^k,h]\mathbb{E}_{k}\left[\hat{\phi}_{\pi_{k},h}^{\top}\theta_{k,h}^{exp}\right]=\mathbb{E}_{k}\left[\bar{\phi}_{k,h}^{\top}\hat{\theta}_{k,h}\right]. We bound its conditional variance as follows:

๐”ผkโ€‹[(ฯ•^ฯ€k,hโŠคโ€‹ฮธk,heโ€‹xโ€‹pโˆ’ฯ•ยฏk,hโŠคโ€‹ฮธ^k,h)2]โ‰ค\displaystyle\sqrt{\mathbb{E}_{k}\left[\left(\hat{\phi}_{\pi_{k},h}^{\top}\theta_{k,h}^{exp}-\bar{\phi}_{k,h}^{\top}\hat{\theta}_{k,h}\right)^{2}\right]}\leq ๐”ผkโ€‹[(ฯ•^ฯ€k,hโŠคโ€‹ฮธk,heโ€‹xโ€‹p)]2+๐”ผkโ€‹[(ฯ•ยฏk,hโŠคโ€‹ฮธ^k,h)2]\displaystyle\sqrt{\mathbb{E}_{k}\left[\left(\hat{\phi}_{\pi_{k},h}^{\top}\theta_{k,h}^{exp}\right)\right]^{2}}+\sqrt{\mathbb{E}_{k}\left[\left(\bar{\phi}_{k,h}^{\top}\hat{\theta}_{k,h}\right)^{2}\right]} (5)
โ‰ค\displaystyle\leq 2โ€‹dโ€‹Hฮณโ€‹ฯต+1+ฯ•ยฏk,hโŠคโ€‹ฮฃ^k,hโˆ’1โ€‹ฯ•ยฏk,h\displaystyle\frac{2dH}{\gamma}\epsilon+1+\sqrt{\bar{\phi}_{k,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\bar{\phi}_{k,h}} (6)
โ‰ค\displaystyle\leq 2โ€‹dโ€‹Hฮณโ€‹ฯต+1+โˆ‘ฯ€pkโ€‹(ฯ€)โ€‹ฯ•^ฯ€,hโŠคโ€‹ฮฃ^k,hโˆ’1โ€‹ฯ•^ฯ€,h\displaystyle\frac{2dH}{\gamma}\epsilon+1+\sqrt{\sum_{\pi}p_{k}\left(\pi\right)\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi,h}} (7)
=\displaystyle= 2โ€‹dโ€‹Hฮณโ€‹ฯต+1+d,\displaystyle\frac{2dH}{\gamma}\epsilon+1+\sqrt{d}\,, (8)

where inequality 5 is due to Cauchy Schwartz inequality and 7 is due to Jensen inequality. Moreover, |ฯ•^ฯ€k,hโŠคโ€‹ฮธk,heโ€‹xโ€‹pโˆ’ฯ•ยฏk,hโŠคโ€‹ฮธ^k,h|โ‰ค2โ€‹dโ€‹Hฮณ\left|\hat{\phi}_{\pi_{k},h}^{\top}\theta_{k,h}^{exp}-\bar{\phi}_{k,h}^{\top}\hat{\theta}_{k,h}\right|\leq\frac{2dH}{\gamma}. Applying Bernsteinโ€™s Inequality, we obtain with probability at least 1โˆ’ฮด1-\delta,

โˆ‘k=1Vkฯ€kโˆ’โˆ‘k=1Kโˆ‘ฯ€pkโ€‹(ฯ€)โ€‹V^kฯ€โ‰ค\displaystyle\sum_{k=1}V_{k}^{\pi_{k}}-\sum_{k=1}^{K}\sum_{\pi}p_{k}\left(\pi\right)\hat{V}_{k}^{\pi}\leq โˆ‘h=1Hโˆ‘k=1K(โ„“k,hฯ€kโˆ’โˆ‘ฯ€pkโ€‹(ฯ€)โ€‹โ„“^kฯ€)\displaystyle\sum_{h=1}^{H}\sum_{k=1}^{K}\left(\ell_{k,h}^{\pi_{k}}-\sum_{\pi}p_{k}\left(\pi\right)\hat{\ell}_{k}^{\pi}\right)
โ‰ค\displaystyle\leq Hโ€‹(d+2โ€‹dโ€‹Hฮณโ€‹ฯต+1)โ€‹2โ€‹Kโ€‹logโก1ฮด+83โ€‹dโ€‹H2ฮณโ€‹logโก1ฮด+2โ€‹dโ€‹H2ฮณโ€‹ฯตโ€‹K.\displaystyle H\left(\sqrt{d}+\frac{2dH}{\gamma}\epsilon+1\right)\sqrt{2K\log\frac{1}{\delta}}+\frac{8}{3}\frac{dH^{2}}{\gamma}\log\frac{1}{\delta}+\frac{2dH^{2}}{\gamma}\epsilon K\,.

Since V^kฯ€โˆ’V~kฯ€=โˆ‘h=1H2โ€‹ฯ•^ฯ€,hโŠคโ€‹ฮฃ^k,hโˆ’1โ€‹ฯ•^k,hโ€‹Hโ€‹logโก1/ฮดdโ€‹K\hat{V}_{k}^{\pi}-\tilde{V}_{k}^{\pi}=\sum_{h=1}^{H}2\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{k,h}\sqrt{\frac{H\log 1/\delta}{dK}}, we have:

โˆ‘k=1Kโˆ‘ฯ€pkโ€‹(ฯ€)โ€‹(V^kฯ€โˆ’V~kฯ€)=\displaystyle\sum_{k=1}^{K}\sum_{\pi}p_{k}\left(\pi\right)\left(\hat{V}_{k}^{\pi}-\tilde{V}_{k}^{\pi}\right)= โˆ‘k=1Kโˆ‘h=1Hโˆ‘ฯ€2โ€‹pkโ€‹(ฯ€)โ€‹ฯ•^ฯ€,hโŠคโ€‹ฮฃ^k,hโˆ’1โ€‹ฯ•^k,hโ€‹Hโ€‹logโก1/ฮดdโ€‹K\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{\pi}2p_{k}\left(\pi\right)\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{k,h}\sqrt{\frac{H\log 1/\delta}{dK}}
=\displaystyle= 2โ€‹Hโ€‹dโ€‹Kโ€‹Hโ€‹logโก1/ฮด.\displaystyle 2H\sqrt{dKH\log 1/\delta}\,.

Combining the two terms, we prove this lemma. โˆŽ

Lemma 11.

With probability at least 1โˆ’ฮด1-\delta, we have:

โˆ‘k=1Kโˆ‘ฯ€pkโ€‹(ฯ€)โ€‹(V~kฯ€)2โ‰ค2โ€‹dโ€‹Kโ€‹H2+2โ€‹dโ€‹H3ฮณโ€‹2โ€‹Kโ€‹logโก(1ฮด)+8โ€‹dโ€‹H3โ€‹logโก(1ฮด)ฮณ.\sum_{k=1}^{K}\sum_{\pi}p_{k}(\pi)\left(\tilde{V}_{k}^{\pi}\right)^{2}\leq 2dKH^{2}+2\frac{dH^{3}}{\gamma}\sqrt{2K\log\left(\frac{1}{\delta}\right)}+\frac{8dH^{3}\log\left(\frac{1}{\delta}\right)}{\gamma}\,.
Proof.
โˆ‘ฯ€pkโ€‹(ฯ€)โ€‹(V~kฯ€)2โ‰ค\displaystyle\sum_{\pi}p_{k}\left(\pi\right)\left(\tilde{V}_{k}^{\pi}\right)^{2}\leq โˆ‘ฯ€pkโ€‹(ฯ€)โ€‹[2โ€‹(V^kฯ€)2+2โ€‹(โˆ‘h=1H2โ€‹ฯ•ฯ€,hโŠคโ€‹ฮฃk,hโˆ’1โ€‹ฯ•ฯ€,hโ€‹Hโ€‹logโก(1ฮด)dโ€‹K)2]\displaystyle\sum_{\pi}p_{k}\left(\pi\right)\left[2\left(\hat{V}_{k}^{\pi}\right)^{2}+2\left(\sum_{h=1}^{H}2\phi_{\pi,h}^{\top}\Sigma_{k,h}^{-1}\phi_{\pi,h}\sqrt{H\frac{\log\left(\frac{1}{\delta}\right)}{dK}}\right)^{2}\right]
โ‰ค\displaystyle\leq โˆ‘ฯ€pkโ€‹(ฯ€)โ€‹(2โ€‹(V^kฯ€)2+2โ€‹Hโ€‹โˆ‘h=1H4โ€‹โ€–ฯ•ฯ€,hโ€–ฮฃk,hโˆ’12โ€‹ฯ•ฯ€,hโŠคโ€‹ฮฃk,hโˆ’1โ€‹ฯ•ฯ€,hโ€‹Hโ€‹logโก(1ฮด)dโ€‹K)\displaystyle\sum_{\pi}p_{k}\left(\pi\right)\left(2\left(\hat{V}_{k}^{\pi}\right)^{2}+2H\sum_{h=1}^{H}4\left\|\phi_{\pi,h}\right\|_{\Sigma_{k,h}^{-1}}^{2}\phi_{\pi,h}^{\top}\Sigma_{k,h}^{-1}\phi_{\pi,h}\frac{H\log\left(\frac{1}{\delta}\right)}{dK}\right)
โ‰ค\displaystyle\leq โˆ‘ฯ€pkโ€‹(ฯ€)โ€‹(2โ€‹(V^kฯ€)2+8โ€‹dโ€‹Hฮณโ€‹โˆ‘h=1Hฯ•ฯ€,hโŠคโ€‹ฮฃk,hโˆ’1โ€‹ฯ•ฯ€,hโ€‹Hโ€‹logโก(1ฮด)dโ€‹K)\displaystyle\sum_{\pi}p_{k}\left(\pi\right)\left(2\left(\hat{V}_{k}^{\pi}\right)^{2}+8\frac{dH}{\gamma}\sum_{h=1}^{H}\phi_{\pi,h}^{\top}\Sigma_{k,h}^{-1}\phi_{\pi,h}\frac{H\log\left(\frac{1}{\delta}\right)}{dK}\right)
=\displaystyle= 2โ€‹โˆ‘ฯ€pkโ€‹(ฯ€)โ€‹(V^kฯ€)2+8โ€‹dโ€‹H3โ€‹logโก1/ฮดฮณโ€‹K.\displaystyle 2\sum_{\pi}p_{k}\left(\pi\right)\left(\hat{V}_{k}^{\pi}\right)^{2}+\frac{8dH^{3}\log 1/\delta}{\gamma K}\,. (9)

Since (V^kฯ€)2โ‰คHโ€‹โˆ‘h=1H(โ„“^k,hฯ€)2\left(\hat{V}_{k}^{\pi}\right)^{2}\leq H\sum_{h=1}^{H}\left(\hat{\ell}_{k,h}^{\pi}\right)^{2}, we bound the first term as follows.

โˆ‘ฯ€pkโ€‹(ฯ€)โ€‹(โ„“^k,hฯ€)2โ‰คโˆ‘ฯ€pkโ€‹(ฯ€)โ€‹ฮธ^k,hโŠคโ€‹ฯ•^ฯ€,hโ€‹ฯ•^ฯ€,hโŠคโ€‹ฮธ^k,hโ‰คฯ•^ฯ€k,hโŠคโ€‹ฮฃ^k,hโˆ’1โ€‹ฯ•^ฯ€k,h.\sum_{\pi}p_{k}\left(\pi\right)\left(\hat{\ell}_{k,h}^{\pi}\right)^{2}\leq\sum_{\pi}p_{k}\left(\pi\right)\hat{\theta}_{k,h}^{\top}\hat{\phi}_{\pi,h}\hat{\phi}_{\pi,h}^{\top}\hat{\theta}_{k,h}\leq\hat{\phi}_{\pi_{k},h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi_{k},h}\,.

Its conditional expectation is ๐”ผkโ€‹[ฯ•^ฯ€k,hโŠคโ€‹ฮฃ^k,hโˆ’1โ€‹ฯ•^ฯ€k,h]=โˆ‘ฯ€pkโ€‹(ฯ€)โ€‹ฯ•^ฯ€,hโŠคโ€‹ฮฃ^k,hโˆ’1โ€‹ฯ•^ฯ€,h=d\mathbb{E}_{k}\left[\hat{\phi}_{\pi_{k},h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi_{k},h}\right]=\sum_{\pi}p_{k}\left(\pi\right)\hat{\phi}_{\pi,h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi,h}=d, and also |ฯ•^ฯ€k,hโŠคโ€‹ฮฃ^k,hโˆ’1โ€‹ฯ•^ฯ€k,h|โ‰คdโ€‹Hฮณ\left|\hat{\phi}_{\pi_{k},h}^{\top}\hat{\Sigma}_{k,h}^{-1}\hat{\phi}_{\pi_{k},h}\right|\leq\frac{dH}{\gamma}. Thus, applying the Hoeffding bound, we have with probability at least 1โˆ’ฮด1-\delta,

โˆ‘k=1Kโˆ‘ฯ€pkโ€‹(ฯ€)โ€‹(V^kฯ€)2โ‰ค\displaystyle\sum_{k=1}^{K}\sum_{\pi}p_{k}\left(\pi\right)\left(\hat{V}_{k}^{\pi}\right)^{2}\leq Hโ€‹โˆ‘h=1Hdโ€‹K+dโ€‹Hฮณโ€‹2โ€‹Kโ€‹logโก1/ฮด=dโ€‹kโ€‹H2+dโ€‹H3ฮณโ€‹2โ€‹Kโ€‹logโก1/ฮด.\displaystyle H\sum_{h=1}^{H}dK+\frac{dH}{\gamma}\sqrt{2K\log 1/\delta}=dkH^{2}+\frac{dH^{3}}{\gamma}\sqrt{2K\log 1/\delta}\,.

Plugging it into 9, we finish our proof. โˆŽ

Proof of Lemma 3.

Now we are ready to start analyzing the regret. Using classical potential function analysis techniques in similar algorithms, we have:

logโก(WK+1W1)=\displaystyle\log\left(\frac{W_{K+1}}{W_{1}}\right)= โˆ‘k=1Klogโก(Wk+1Wk)\displaystyle\sum_{k=1}^{K}\log\left(\frac{W_{k+1}}{W_{k}}\right)
=\displaystyle= โˆ‘k=1Klogโก(โˆ‘ฯ€wkโ€‹(ฯ€)Wkโ€‹expโก(โˆ’ฮทโ€‹V~kฯ€))\displaystyle\sum_{k=1}^{K}\log\left(\sum_{\pi}\frac{w_{k}(\pi)}{W_{k}}\exp\left(-\eta\tilde{V}_{k}^{\pi}\right)\right)
โ‰ค\displaystyle\leq โˆ‘k=1Klogโก(โˆ‘ฯ€pkโ€‹(ฯ€)โˆ’ฮณโ€‹gฯ€1โˆ’ฮณโ€‹(1โˆ’ฮทโ€‹V~kฯ€+ฮท2โ€‹(V~kฯ€)2))\displaystyle\sum_{k=1}^{K}\log\left(\sum_{\pi}\frac{p_{k}(\pi)-\gamma g_{\pi}}{1-\gamma}\left(1-\eta\tilde{V}_{k}^{\pi}+\eta^{2}\left(\tilde{V}_{k}^{\pi}\right)^{2}\right)\right) (10)
โ‰ค\displaystyle\leq โˆ‘k=1Kโˆ‘ฯ€pkโ€‹(ฯ€)โˆ’ฮณโ€‹gฯ€1โˆ’ฮณโ€‹(โˆ’ฮทโ€‹V~kฯ€+ฮท2โ€‹(V~kฯ€)2)\displaystyle\sum_{k=1}^{K}\sum_{\pi}\frac{p_{k}(\pi)-\gamma g_{\pi}}{1-\gamma}\left(-\eta\tilde{V}_{k}^{\pi}+\eta^{2}\left(\tilde{V}_{k}^{\pi}\right)^{2}\right)
โ‰ค\displaystyle\leq ฮท1โˆ’ฮณโ€‹[โˆ‘k=1Kโˆ‘ฯ€โˆ’pkโ€‹(ฯ€)โ€‹V~kฯ€+ฮณโ€‹โˆ‘k=1Kโˆ‘ฯ€gโ€‹(ฯ€)โ€‹V~kฯ€+ฮทโ€‹โˆ‘k=1Kโˆ‘ฯ€pkโ€‹(ฯ€)โ€‹(V~kฯ€)2],\displaystyle\frac{\eta}{1-\gamma}\left[\sum_{k=1}^{K}\sum_{\pi}-p_{k}(\pi)\tilde{V}_{k}^{\pi}+\gamma\sum_{k=1}^{K}\sum_{\pi}g(\pi)\tilde{V}_{k}^{\pi}+\eta\sum_{k=1}^{K}\sum_{\pi}p_{k}(\pi)\left(\tilde{V}_{k}^{\pi}\right)^{2}\right]\,, (11)

where inequality 10 is from |ฮทโ€‹V~kฯ€|โ‰ค1\left|\eta\tilde{V}_{k}^{\pi}\right|\leq 1 guaranteed by Lemma 7. Using Lemma 9, we can bound the second term as:

โˆ‘k=1KV~kฯ€โ‰คโˆ‘k=1KVkฯ€+DEVk,ฯ€โˆ’โˆ‘k=1Kโˆ‘h=1H2โ€‹ฯ•ฯ€,hโŠคโ€‹ฮฃk,hโˆ’1โ€‹ฯ•ฯ€,hโ€‹Hโ€‹logโก(1ฮด)dโ€‹Kโ‰คโˆ‘k=1KVkฯ€+12โ€‹dโ€‹Kโ€‹Hโ€‹logโก(1ฮด)+2โ€‹(dโ€‹H2ฮณ)โ€‹logโก(1ฮด)+2โ€‹dโ€‹H2ฮณโ€‹ฯตโ€‹Kโ‰คKโ€‹H+12โ€‹dโ€‹Kโ€‹Hโ€‹logโก(1ฮด)+2โ€‹(dโ€‹H2ฮณ)โ€‹logโก(1ฮด)+2โ€‹dโ€‹H2ฮณโ€‹ฯตโ€‹K.\begin{split}\sum_{k=1}^{K}\tilde{V}_{k}^{\pi}\leq&\sum_{k=1}^{K}V_{k}^{\pi}+\mathrm{DEV}_{k,\pi}-\sum_{k=1}^{K}\sum_{h=1}^{H}2\phi_{\pi,h}^{\top}\Sigma_{k,h}^{-1}\phi_{\pi,h}\sqrt{H\frac{\log\left(\frac{1}{\delta}\right)}{dK}}\\ \leq&\sum_{k=1}^{K}V_{k}^{\pi}+\frac{1}{2}\sqrt{dKH\log\left(\frac{1}{\delta}\right)}+2\left(\frac{dH^{2}}{\gamma}\right)\log\left(\frac{1}{\delta}\right)+\frac{2dH^{2}}{\gamma}\epsilon K\\ \leq&KH+\frac{1}{2}\sqrt{dKH\log\left(\frac{1}{\delta}\right)}+2\left(\frac{dH^{2}}{\gamma}\right)\log\left(\frac{1}{\delta}\right)+\frac{2dH^{2}}{\gamma}\epsilon K\,.\end{split} (12)

Plugging Lemma 10, Lemma 11 and Equation (12) into Equation (11), notice we condition on ฮณโ‰ค1/2\gamma\leq 1/2, we obtain:

logโก(WK+1W1)โ‰คโˆ’ฮทโ€‹โˆ‘k=1KVkฯ€k+2โ€‹ฮท2โ€‹[2โ€‹dโ€‹Kโ€‹H2+2โ€‹dโ€‹H3ฮณโ€‹2โ€‹Kโ€‹logโก(1ฮด)+8โ€‹dโ€‹H3โ€‹logโก(1ฮด)ฮณ]+2โ€‹ฮทโ€‹ฮณโ€‹Kโ€‹H+ฮทโ€‹[12โ€‹dโ€‹Kโ€‹Hโ€‹logโก(1ฮด)+2โ€‹(dโ€‹H2ฮณ)โ€‹logโก(1ฮด)+2โ€‹dโ€‹H2ฮณโ€‹ฯตโ€‹K]+2โ€‹ฮทโ€‹[Hโ€‹(d+2โ€‹dโ€‹Hฮณโ€‹ฯต+1)โ€‹2โ€‹Kโ€‹logโก1ฮด+83โ€‹dโ€‹H2ฮณโ€‹logโก1ฮด+2โ€‹Hโ€‹dโ€‹Kโ€‹Hโ€‹logโก1ฮด+2โ€‹dโ€‹H2ฮณโ€‹ฯตโ€‹K].\begin{split}&\log\left(\frac{W_{K+1}}{W_{1}}\right)\leq-\eta\sum_{k=1}^{K}V_{k}^{\pi_{k}}+2\eta^{2}\left[2dKH^{2}+2\frac{dH^{3}}{\gamma}\sqrt{2K\log\left(\frac{1}{\delta}\right)}+\frac{8dH^{3}\log\left(\frac{1}{\delta}\right)}{\gamma}\right]\\ &+2\eta\gamma KH+\eta\left[\frac{1}{2}\sqrt{dKH\log\left(\frac{1}{\delta}\right)}+2\left(\frac{dH^{2}}{\gamma}\right)\log\left(\frac{1}{\delta}\right)+\frac{2dH^{2}}{\gamma}\epsilon K\right]\\ &+2\eta\left[H\left(\sqrt{d}+\frac{2dH}{\gamma}\epsilon+1\right)\sqrt{2K\log\frac{1}{\delta}}+\frac{8}{3}\frac{dH^{2}}{\gamma}\log\frac{1}{\delta}+2H\sqrt{dKH\log\frac{1}{\delta}}+\frac{2dH^{2}}{\gamma}\epsilon K\right]\,.\end{split} (13)

Combining terms, we have:

logโก(WK+1W1)ฮทโ‰คโˆ’โˆ‘k=1KVkฯ€k+๐’ชโ€‹(Hโ€‹dโ€‹Kโ€‹Hโ€‹logโก1ฮด)+๐’ชโ€‹(dโ€‹H2ฮณโ€‹logโก(1ฮด)+ฮณโ€‹Kโ€‹H)+๐’ชโ€‹(ฮทโ€‹dโ€‹Kโ€‹H2+ฮทโ€‹dโ€‹H3ฮณโ€‹2โ€‹Kโ€‹logโก(1ฮด)+8โ€‹ฮทโ€‹dโ€‹H3โ€‹logโก(1ฮด)ฮณ)+๐’ชโ€‹(dโ€‹H2ฮณโ€‹ฯตโ€‹K).\begin{split}\frac{\log\left(\frac{W_{K+1}}{W_{1}}\right)}{\eta}\leq&-\sum_{k=1}^{K}V_{k}^{\pi_{k}}+\mathcal{O}\left(H\sqrt{dKH\log\frac{1}{\delta}}\right)+\mathcal{O}\left(\frac{dH^{2}}{\gamma}\log\left(\frac{1}{\delta}\right)+\gamma KH\right)\\ &+\mathcal{O}\left(\eta dKH^{2}+\frac{\eta dH^{3}}{\gamma}\sqrt{2K\log\left(\frac{1}{\delta}\right)}+\frac{8\eta dH^{3}\log\left(\frac{1}{\delta}\right)}{\gamma}\right)+\mathcal{O}\left(\frac{dH^{2}}{\gamma}\epsilon K\right)\,.\end{split} (14)

Plugging ฮท=ฮณdโ€‹H2\eta=\frac{\gamma}{dH^{2}} into Equation (14), we have:

logโก(WK+1W1)ฮทโ‰คโˆ’โˆ‘k=1KVkฯ€k+๐’ชโ€‹(Hโ€‹dโ€‹Kโ€‹Hโ€‹logโก1ฮด)+๐’ชโ€‹(dโ€‹H2ฮณโ€‹logโก(1ฮด)+ฮณโ€‹Kโ€‹H)+๐’ชโ€‹(dโ€‹H2ฮณโ€‹ฯตโ€‹K).\begin{split}\frac{\log\left(\frac{W_{K+1}}{W_{1}}\right)}{\eta}\leq&-\sum_{k=1}^{K}V_{k}^{\pi_{k}}+\mathcal{O}\left(H\sqrt{dKH\log\frac{1}{\delta}}\right)+\mathcal{O}\left(\frac{dH^{2}}{\gamma}\log\left(\frac{1}{\delta}\right)+\gamma KH\right)+\mathcal{O}\left(\frac{dH^{2}}{\gamma}\epsilon K\right)\,.\end{split} (15)

On the other hand, we have:

logโก(WK+1W1)โ‰ฅฮทโ€‹(โˆ‘k=1Kโˆ’V~kฯ€)โˆ’logโก(|ฮ |)โ‰ฅฮทโ€‹(โˆ‘k=1Kโˆ’Vkฯ€โˆ’DEVK,ฯ€+โˆ‘k=1Kโˆ‘h=1H2โ€‹ฯ•ฯ€,hโŠคโ€‹ฮฃk,hโˆ’1โ€‹ฯ•ฯ€,hโ€‹Hโ€‹logโก(1ฮด)dโ€‹K)โˆ’logโก(|ฮ |)โ‰ฅฮทโ€‹(โˆ‘k=1Kโˆ’Vkฯ€โˆ’12โ€‹dโ€‹Kโ€‹Hโ€‹logโก(1ฮด)โˆ’2โ€‹(dโ€‹H2ฮณ)โ€‹logโก(1ฮด)โˆ’2โ€‹dโ€‹H2ฮณโ€‹ฯตโ€‹K)โˆ’logโก(|ฮ |).\begin{split}\log\left(\frac{W_{K+1}}{W_{1}}\right)\geq&\eta\left(\sum_{k=1}^{K}-\tilde{V}_{k}^{\pi}\right)-\log\left(|\Pi|\right)\\ \geq&\eta\left(\sum_{k=1}^{K}-V_{k}^{\pi}-\mathrm{DEV}_{K,\pi}+\sum_{k=1}^{K}\sum_{h=1}^{H}2\phi_{\pi,h}^{\top}\Sigma_{k,h}^{-1}\phi_{\pi,h}\sqrt{H\frac{\log\left(\frac{1}{\delta}\right)}{dK}}\right)-\log\left(|\Pi|\right)\\ \geq&\eta\left(\sum_{k=1}^{K}-V_{k}^{\pi}-\frac{1}{2}\sqrt{dKH\log\left(\frac{1}{\delta}\right)}-2\left(\frac{dH^{2}}{\gamma}\right)\log\left(\frac{1}{\delta}\right)-\frac{2dH^{2}}{\gamma}\epsilon K\right)-\log\left(|\Pi|\right)\,.\end{split} (16)

Combining (15) and (16), we have:

โˆ‘k=1KVkฯ€kโˆ’Vkฯ€โ‰ค๐’ชโ€‹(Hโ€‹dโ€‹Kโ€‹Hโ€‹logโก1ฮด+dโ€‹H2ฮณโ€‹logโก(1ฮด)+ฮณโ€‹Kโ€‹H)+๐’ชโ€‹(dโ€‹H2ฮณโ€‹ฯตโ€‹K)+logโก(|ฮ |)ฮท.\begin{split}\sum_{k=1}^{K}V_{k}^{\pi_{k}}-V_{k}^{\pi}\leq&\mathcal{O}\left(H\sqrt{dKH\log\frac{1}{\delta}}+\frac{dH^{2}}{\gamma}\log\left(\frac{1}{\delta}\right)+\gamma KH\right)+\mathcal{O}\left(\frac{dH^{2}}{\gamma}\epsilon K\right)+\frac{\log\left(|\Pi|\right)}{\eta}\,.\end{split} (17)

Choosing ฮท=ฮณdโ€‹H2\eta=\frac{\gamma}{dH^{2}} and combining terms, we obtain for any policy ฯ€โˆˆฮ \pi\in\Pi, with probability at least 1โˆ’ฮด1-\delta:

โˆ‘k=1KVkฯ€kโˆ’Vkฯ€=๐’ชโ€‹(Hโ€‹dโ€‹Kโ€‹Hโ€‹logโก|ฮ |ฮด+dโ€‹H2ฮณโ€‹logโก(|ฮ |ฮด)+ฮณโ€‹Kโ€‹H)+๐’ชโ€‹(dโ€‹H2ฮณโ€‹ฯตโ€‹K).\sum_{k=1}^{K}V_{k}^{\pi_{k}}-V_{k}^{\pi}=\mathcal{O}\left(H\sqrt{dKH\log\frac{|\Pi|}{\delta}}+\frac{dH^{2}}{\gamma}\log\left(\frac{|\Pi|}{\delta}\right)+\gamma KH\right)+\mathcal{O}\left(\frac{dH^{2}}{\gamma}\epsilon K\right)\,. (18)

โˆŽ

We will then present the proof of Theorem 1 based on Equation (18). Notice we condition on KK being large enough so that the optimal parameters ฮณ\gamma and ฯต\epsilon set below are smaller than 12\frac{1}{2}, satisfying the requirements of the algorithm, while the cases of KK being small is trivial.

  • โ€ข

    In the case when we have access to a simulator, the total regret occurred while we execute the policies in ฮ \Pi. Set the parameters as ฯตโ†dโ€‹H2โ€‹logโกKฮด/K\epsilon\leftarrow{dH^{2}\log\frac{K}{\delta}/K}, ฮณโ†dโ€‹Hโ€‹logโก(|ฮ |ฮด)/K\gamma\leftarrow{\sqrt{{dH\log\left(\frac{|\Pi|}{\delta}\right)}/{K}}} and using the properties of ฮ \Pi in Lemma 19, the total regret is bounded as:

    Regโ€‹(K)โ‰คRegโ€‹(K;ฮ )+1=maxฯ€โˆˆฮ โก(โˆ‘k=1KVkฯ€kโˆ’Vkฯ€)+1=๐’ชโ€‹(d2โ€‹H5โ€‹Kโ€‹logโกKฮด).\mathrm{Reg}(K)\leq\mathrm{Reg}\left(K;\Pi\right)+1=\max_{\pi\in\Pi}\left(\sum_{k=1}^{K}V_{k}^{\pi_{k}}-V_{k}^{\pi}\right)+1=\mathcal{O}\left(\sqrt{d^{2}H^{5}K\log\frac{K}{\delta}}\right)\,.

    Also, according to corollary 1, the total number of episodes run on the simulator is in the order of ๐’ช~โ€‹(d3โ€‹Hโ€‹K2)\tilde{\mathcal{O}}\left(d^{3}HK^{2}\right).

  • โ€ข

    When we donโ€™t have access to a simulator, we have to take account of the regret occurred while we estimate the feature visitation of each policy. According to corollary 1, the additional regret is in the order of ๐’ชโ€‹(d4โ€‹H4ฯต2โ€‹logโกH2โ€‹dโ€‹|ฮ |ฮด+C1)\mathcal{O}\left(\frac{d^{4}H^{4}}{\epsilon^{2}}\log\frac{H^{2}d|\Pi|}{\delta}+C_{1}\right). By our construction of policy set ฮ \Pi in Lemma 19, the total regret is bounded as:

    Regโ€‹(K)=๐’ชโ€‹(d5โ€‹H6ฯต2โ€‹logโกKฮด+d2โ€‹H4ฮณโ€‹logโกKฮด+ฮณโ€‹Kโ€‹H+dโ€‹H2ฮณโ€‹ฯตโ€‹K+d2โ€‹H5โ€‹Kโ€‹logโกKฮด+C1),\mathrm{Reg}(K)=\mathcal{O}\left(\frac{d^{5}H^{6}}{\epsilon^{2}}\log\frac{K}{\delta}+\frac{d^{2}H^{4}}{\gamma}\log\frac{K}{\delta}+\gamma KH+\frac{dH^{2}}{\gamma}\epsilon K+\sqrt{d^{2}H^{5}K\log\frac{K}{\delta}}+C_{1}\right), (19)

    with C1=polyโ€‹(d,H,logโก1/ฮด,1ฮปmโ€‹iโ€‹nโˆ—,logโก|ฮ |,logโก1/ฯต)C_{1}=\mathrm{poly}\left(d,H,\log 1/\delta,\frac{1}{\lambda_{min}^{*}},\log|\Pi|,\log 1/\epsilon\right). Set the parameters as ฯตโ†Kโˆ’2/5โ€‹d9/5โ€‹H9/5โ€‹log2/5โกKฮด\epsilon\leftarrow{K^{-2/5}d^{9/5}H^{9/5}\log^{2/5}\frac{K}{\delta}}, ฮณโ†Kโˆ’1/5โ€‹d7/5โ€‹H7/5โ€‹log1/5โกKฮด\gamma\leftarrow{K^{-1/5}d^{7/5}H^{7/5}\log^{1/5}\frac{K}{\delta}}, the total regret is in the order of:

    Regโ€‹(K)=๐’ชโ€‹(d7/5โ€‹H12/5โ€‹K4/5โ€‹log1/5โกKฮด).\mathrm{Reg}(K)={\mathcal{O}}\left(d^{7/5}H^{12/5}K^{4/5}\log^{1/5}\frac{K}{\delta}\right)\,.

Appendix B Construct the Policy Visitation Estimators

In this section, we will propose the analysis of Algorithm 2. We then prove theorem 17 and corollary 1 as our main results, which will provide the concentration of the estimators ฯ•^ฯ€,h\hat{\phi}_{\pi,h} and bound the sample complexity. These results will then be used to proof the final regret bounds in Appendix A.

First, we propose the performance guarantee of the data collecting oracle, which comes directly from theorem 9 in Wagenmaker and Jamieson [2022]. Denote:

๐—๐˜optโ€‹(๐šฒ)=maxฯ•โˆˆฮฆโกโ€–ฯ•โ€–๐€โ€‹(๐šฒ)โˆ’12for๐€โ€‹(๐šฒ)=๐šฒ+๐šฒ0,\mathbf{XY}_{\mathrm{opt}}\left(\mathbf{\Lambda}\right)=\max_{\phi\in\Phi}\left\|\phi\right\|_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}^{2}\leavevmode\nobreak\ \leavevmode\nobreak\ \mathrm{for}\leavevmode\nobreak\ \leavevmode\nobreak\ \mathbf{A}\left(\mathbf{\Lambda}\right)=\mathbf{\Lambda}+\mathbf{\Lambda}_{0}\,,

for ๐šฒ0\mathbf{\Lambda}_{0} be some fixed regularizer. We consider itโ€™s smooth approximation:

๐—๐˜~optโ€‹(๐šฒ)=1ฮทโ€‹logโก(โˆ‘ฯ•โˆˆฮฆeฮทโ€‹โ€–ฯ•โ€–๐€โ€‹(๐šฒ)โˆ’12).\widetilde{\mathbf{XY}}_{\mathrm{opt}}\left(\mathbf{\Lambda}\right)=\frac{1}{\eta}\log\left(\sum_{\phi\in\Phi}e^{\eta\left\|\phi\right\|_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}^{2}}\right)\,.

We also define ๐›€h:={๐”ผฯ€โˆผฯ‰[๐šฒฯ€,๐ก]:ฯ‰โˆˆฮ”ฯ€}\mathbf{\Omega}_{h}:=\left\{\mathbb{E}_{\pi\sim\omega}\left[\mathbf{\Lambda_{\pi,h}}\right]\,:\leavevmode\nobreak\ \leavevmode\nobreak\ \omega\in\Delta_{\pi}\right\}, where ฮ”ฯ€\Delta_{\pi} is the set of all the distributions over all valid Markovian policies. ๐›€h\mathbf{\Omega}_{h} is, then, the set of all covariance matrices realizable by distributions over policies at step hh.Then we have

Theorem 2.

Considering running Algorithm 6 in Wagenmaker and Jamieson [2022] with some ฯต>0\epsilon>0 and functions

fiโ€‹(๐šฒ)โ†๐—๐˜~oโ€‹pโ€‹tโ€‹(๐šฒ)f_{i}\left(\mathbf{\Lambda}\right)\leftarrow\widetilde{\mathbf{XY}}_{opt}\left(\mathbf{\Lambda}\right)

for ๐šฒ0โ†(TiKi)โˆ’1ฮฃi=:๐šฒi\mathbf{\Lambda}_{0}\leftarrow\left(T_{i}K_{i}\right)^{-1}\Sigma_{i}=:\mathbf{\Lambda}_{i} and

ฮทi=2ฮณฮฆโ‹…(1+โ€–๐šฒiโ€–op)โ‹…logโก|ฮฆ|\displaystyle\eta_{i}=\frac{2}{\gamma_{\Phi}}\cdot\left(1+\left\|\mathbf{\Lambda}_{i}\right\|_{\mathrm{op}}\right)\cdot\log|\Phi|
Li=โ€–๐šฒiโˆ’1โ€–op2,ฮฒi=2โ€‹โ€–๐šฒiโˆ’1โ€–op3โ€‹(1+ฮทiโ€‹โ€–๐šฒiโˆ’1โ€–op),Mi=โ€–๐šฒiโˆ’1โ€–op2\displaystyle L_{i}=\left\|\mathbf{\Lambda}_{i}^{-1}\right\|_{\mathrm{op}}^{2}\,,\leavevmode\nobreak\ \leavevmode\nobreak\ \beta_{i}=2\left\|\mathbf{\Lambda}_{i}^{-1}\right\|_{\mathrm{op}}^{3}\left(1+\eta_{i}\left\|\mathbf{\Lambda}_{i}^{-1}\right\|_{\mathrm{op}}\right)\,,\leavevmode\nobreak\ \leavevmode\nobreak\ M_{i}=\left\|\mathbf{\Lambda}_{i}^{-1}\right\|_{\mathrm{op}}^{2}

where ฮฃi\Sigma_{i} is the matrix returned by running Algorithm 7 in Wagenmaker and Jamieson [2022] with Nโ†Tiโ€‹KiN\leftarrow T_{i}K_{i}, ฮดโ†ฮด/(2โ€‹i2)\delta\leftarrow\delta/\left(2i^{2}\right), and some ฮปยฏโ‰ฅ0\underline{\lambda}\geq 0. Then with probability 1โˆ’2โ€‹ฮด1-2\delta, this procedure will collect at most

20โ‹…inf๐šฒโˆˆฮฉmaxฯ•โˆˆฮฆโกโ€–ฯ•โ€–๐€โ€‹(๐šฒ)โˆ’12ฯตexp+polyโ€‹(d,H,logโก1/ฮด,1ฮปminโˆ—,1ฮณฮฆ,ฮปยฏ,logโก|ฮฆ|,logโก1ฯตexp)20\cdot\frac{\inf_{\mathbf{\Lambda}\in\Omega}\max_{\phi\in\Phi}\left\|\phi\right\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}}{\epsilon_{\exp}}+\mathrm{poly}\left(d,\,H,\,\log 1/\delta,\,\frac{1}{\lambda_{\min}^{*}},\,\frac{1}{\gamma_{\Phi}},\,\underline{\lambda},\,\log|\Phi|,\,\log\frac{1}{\epsilon_{\exp}}\right)

episodes, where

๐€โ€‹(๐šฒ)=๐šฒ+minโก{(ฮปminโˆ—)2d,ฮปminโˆ—d3โ€‹H3โ€‹log7/2โก1/ฮด}โ‹…polyโ€‹logโก(1ฮปminโˆ—,d,H,ฮปยฏ,logโก1ฮด)โ‹…I,\mathbf{A}\left(\mathbf{\Lambda}\right)=\mathbf{\Lambda}+\min\left\{\frac{\left(\lambda_{\min}^{*}\right)^{2}}{d},\,\frac{\lambda_{\min}^{*}}{d^{3}H^{3}\log^{7/2}1/\delta}\right\}\cdot\mathrm{poly}\log\left(\frac{1}{\lambda_{\min}^{*}}\,,\leavevmode\nobreak\ d\,,\leavevmode\nobreak\ H\,,\leavevmode\nobreak\ \underline{\lambda}\,,\leavevmode\nobreak\ \log\frac{1}{\delta}\right)\cdot I\,,

and will produce covariates ฮฃ^+ฮฃi\widehat{\Sigma}+\Sigma_{i} such that

maxฯ•โˆˆฮฆโกโ€–ฯ•โ€–(ฮฃ^+ฮฃi)โˆ’12โ‰คฯตexp\max_{\phi\in\Phi}\left\|\phi\right\|^{2}_{\left(\widehat{\Sigma}+\Sigma_{i}\right)^{-1}}\leq\epsilon_{\exp}

and

ฮปminโ€‹(ฮฃ^+ฮฃi)โ‰ฅmaxโก{dโ€‹logโก1/ฮด,ฮป}.\lambda_{\min}\left(\widehat{\Sigma}+\Sigma_{i}\right)\geq\max\left\{d\log 1/\delta\,,\leavevmode\nobreak\ \lambda\right\}\,.

Next, we will propose the concentration analysis of our estimators and bound the total number of episodes run. Throughout this section, assuming we have run for some number of episodes K, we let (โ„ฑฯ„)ฯ„=1K\left(\mathcal{F}_{\tau}\right)_{\tau=1}^{K} the filtration on this, with โ„ฑฯ„\mathcal{F}_{\tau} the filtration up to and including episode ฯ„\tau. We also let โ„ฑฯ„,h\mathcal{F}_{\tau,h} denote the filtration on all episodes ฯ„โ€ฒ<ฯ„\tau^{\prime}<\tau, and on steps hโ€ฒ=1,โ‹ฏ,hh^{\prime}=1,\cdots,h of episode ฯ„\tau. Define

ฯ•ฯ€,h=๐”ผฯ€โ€‹[ฯ•โ€‹(sh,ah)],ฯ•ฯ€,hโ€‹(s)=โˆ‘aโˆˆ๐’œฯ•โ€‹(s,a)โ€‹ฯ€hโ€‹(a|s)\phi_{\pi,h}=\mathbb{E}_{\pi}\left[\phi\left(s_{h},a_{h}\right)\right],\leavevmode\nobreak\ \leavevmode\nobreak\ \phi_{\pi,h}\left(s\right)=\sum_{a\in\mathcal{A}}\phi\left(s,a\right)\pi_{h}\left(a|s\right)

and

๐’ฏ:=โˆซฯ•ฯ€,hโ€‹(s)โ€‹๐‘‘ฮผhโˆ’1โ€‹(s)โŠค.\mathcal{T}:=\int\phi_{\pi,h}\left(s\right)\,d\mu_{h-1}\left(s\right)^{\top}\,.

We have from lemma A.7 in Wagenmaker and Jamieson [2022]: ฯ•ฯ€,h=๐’ฏฯ€,hโ€‹ฯ•ฯ€,hโˆ’1=โ‹ฏ=๐’ฏฯ€,hโ€‹โ‹ฏโ€‹๐’ฏฯ€,1โ€‹ฯ•ฯ€,0\phi_{\pi,h}=\mathcal{T}_{\pi,h}\phi_{\pi,h-1}=\cdots=\mathcal{T}_{\pi,h}\cdots\mathcal{T}_{\pi,1}\phi_{\pi,0}โ€‰. We also denote ฮณฮฆ:=maxฯ•โˆˆฮฆโกโ€–ฯ•โ€–2\gamma_{\Phi}:=\max_{\phi\in\Phi}\left\|\phi\right\|_{2}.

The following Lemma 12 comes straight from lemma B.1, lemma B.2 and lemma B.3 in Wagenmaker and Jamieson [2022] and provides us with the basic concentration properties of the estimators constructed in line 6 of Algorithm 2.

Lemma 12.

Assume that we have collected some data {(shโˆ’1,ฯ„,ahโˆ’1,ฯ„,sh,ฯ„)}ฯ„=1K\left\{\left(s_{h-1,\tau},a_{h-1,\tau},s_{h,\tau}\right)\right\}_{\tau=1}^{K} where, for each ฯ„โ€ฒ\tau^{\prime}, sh,ฯ„โ€ฒ|โ„ฑhโˆ’1,ฯ„โ€ฒs_{h,\tau^{\prime}}|\mathcal{F}_{h-1,\tau^{\prime}} is independent of {(shโˆ’1,ฯ„,ahโˆ’1,ฯ„,sh,ฯ„)}ฯ„โ‰ ฯ„โ€ฒ\left\{\left(s_{h-1,\tau},a_{h-1,\tau},s_{h,\tau}\right)\right\}_{\tau\neq\tau^{\prime}}. Denote ฯ•hโˆ’1,ฯ„=ฯ•โ€‹(shโˆ’1,ฯ„,ahโˆ’1,ฯ„)\phi_{h-1,\tau}=\phi\left(s_{h-1,\tau},a_{h-1,\tau}\right) and ๐šฒhโˆ’1=โˆ‘ฯ„=1Kฯ•hโˆ’1,ฯ„โ€‹ฯ•hโˆ’1,ฯ„โŠค+ฮปโ€‹I\mathbf{\Lambda}_{h-1}=\sum_{\tau=1}^{K}\phi_{h-1,\tau}\phi_{h-1,\tau}^{\top}+\lambda I. Fix ฯ€\pi and let

๐’ฏ^ฯ€,h=(โˆ‘ฯ„=1Kฯ•ฯ€,hโ€‹(sh,ฯ„)โ€‹ฯ•hโˆ’1,ฯ„โŠค)โ€‹๐šฒhโˆ’1โˆ’1ฯ•^ฯ€,h=๐’ฏ^ฯ€,hโ€‹๐’ฏ^ฯ€,hโˆ’1โ€‹โ‹ฏโ€‹๐’ฏ^ฯ€,2โ€‹๐’ฏ^ฯ€,1โ€‹ฯ•ฯ€,0.\begin{split}\hat{\mathcal{T}}_{\pi,h}=&\left(\sum_{\tau=1}^{K}\phi_{\pi,h}(s_{h,\tau})\phi_{h-1,\tau}^{\top}\right)\mathbf{\Lambda}_{h-1}^{-1}\\ \hat{\phi}_{\pi,h}=&\hat{\mathcal{T}}_{\pi,h}\hat{\mathcal{T}}_{\pi,h-1}\cdots\hat{\mathcal{T}}_{\pi,2}\hat{\mathcal{T}}_{\pi,1}\phi_{\pi,0}\,.\end{split}

Fix uโˆˆ๐’ฎdโˆ’1\textbf{{u}}\in\mathcal{S}^{d-1}. Then with probability at least 1โˆ’ฮด1-\delta:

|โŸจu,ฯ•ฯ€,hโˆ’ฯ•^ฯ€,hโŸฉ|โ‰คโˆ‘i=1hโˆ’1(2โ€‹logโก2โ€‹Hฮด+logโก2โ€‹Hฮดฮปminโ€‹(๐šฒi)+dโ€‹ฮป)โ‹…โ€–ฯ•^ฯ€,iโ€–๐šฒiโˆ’1.\left|\langle\textbf{{u}},\phi_{\pi,h}-\hat{\phi}_{\pi,h}\rangle\right|\leq\sum_{i=1}^{h-1}\left(2\sqrt{\log\frac{2H}{\delta}}+\frac{\log\frac{2H}{\delta}}{\sqrt{\lambda_{\min}\left(\mathbf{\Lambda}_{i}\right)}}+\sqrt{d\lambda}\right)\cdot\left\|\hat{\phi}_{\pi,i}\right\|_{\mathbf{\Lambda}_{i}^{-1}}\,.

Thus, with probability at least 1โˆ’ฮด1-\delta,

โ€–ฯ•^ฯ€,hโˆ’ฯ•ฯ€,hโ€–2โ‰คdโ€‹โˆ‘hโ€ฒ=1hโˆ’1(2โ€‹logโก2โ€‹Hโ€‹dฮด+logโก2โ€‹Hโ€‹dฮดฮปminโ€‹(๐šฒhโ€ฒ)+dโ€‹ฮป)โ‹…โ€–ฯ•^ฯ€,hโ€ฒโ€–๐šฒhโ€ฒโˆ’1\left\|\hat{\phi}_{\pi,h}-\phi_{\pi,h}\right\|_{2}\leq d\sum_{h^{\prime}=1}^{h-1}\left(2\sqrt{\log\frac{2Hd}{\delta}}+\frac{\log\frac{2Hd}{\delta}}{\sqrt{\lambda_{\min}\left(\mathbf{\Lambda}_{h^{\prime}}\right)}}+\sqrt{d\lambda}\right)\cdot\left\|\hat{\phi}_{\pi,h^{\prime}}\right\|_{\mathbf{\Lambda}_{h^{\prime}}^{-1}}
Lemma 13.

Let ฮตesth\varepsilon_{\mathrm{est}}^{h} denote the event on which, for all ฯ€โˆˆฮ \pi\in\Pi, the feature visitation estimates returned by line 6 satisfy:

โ€–ฯ•^ฯ€,h+1โˆ’ฯ•ฯ€,h+1โ€–2โ‰คdโ€‹โˆ‘hโ€ฒ=1hโˆ’1(3โ€‹logโก4โ€‹H2โ€‹dโ€‹|ฮ |ฮด+logโก4โ€‹H2โ€‹dโ€‹|ฮ |ฮดฮปminโ€‹(๐šฒhโ€ฒ))โ‹…โ€–ฯ•^ฯ€,hโ€ฒโ€–๐šฒhโ€ฒโˆ’1\left\|\hat{\phi}_{\pi,h+1}-\phi_{\pi,h+1}\right\|_{2}\leq d\sum_{h^{\prime}=1}^{h-1}\left(3\sqrt{\log\frac{4H^{2}d|\Pi|}{\delta}}+\frac{\log\frac{4H^{2}d|\Pi|}{\delta}}{\sqrt{\lambda_{\min}\left(\mathbf{\Lambda}_{h^{\prime}}\right)}}\right)\cdot\left\|\hat{\phi}_{\pi,h^{\prime}}\right\|_{\mathbf{\Lambda}_{h^{\prime}}^{-1}}

Then โ„™โ€‹[(ฮตesth)c]โ‰คฮด2โ€‹H\mathbb{P}\left[\left(\varepsilon_{\mathrm{est}}^{h}\right)^{c}\right]\leq\frac{\delta}{2H} .

Proof.

Following similar analysis in lemma B.5 in Wagenmaker and Jamieson [2022], we also have the data collected in Theorem 2 satisfy the independent requirements of Lemma 12. The result follows by setting ฮป=1/d\lambda=1/d in Lemma 12. โˆŽ

Lemma 14.

Let ฮตexph\varepsilon_{\exp}^{h} denote the event on which: The total number of episodes run in line 4 is at most

Cโ€‹d3โ€‹inf๐šฒโˆˆฮฉhmaxฯ•โˆˆฮฆhโกโ€–ฯ•โ€–๐€โ€‹(๐šฒ)โˆ’12ฯต2/ฮฒ+polyโ€‹(d,H,logโก1/ฮด,1ฮปmโ€‹iโ€‹nโˆ—,logโก|ฮ |,logโก1/ฯต)C\frac{d^{3}\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\phi\in\Phi_{h}}\left\|\phi\right\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}}{\epsilon^{2}/\beta}+\mathrm{poly}\left(d,H,\log 1/\delta,\frac{1}{\lambda_{min}^{*}},\log|\Pi|,\log 1/\epsilon\right)

episodes. The covariates returned by line 4, ๐šฒh\mathbf{\Lambda}_{h}, satisfy:

maxฯ•โˆˆฮฆhโกโ€–ฯ•โ€–๐šฒhโˆ’12โ‰คฯต2d3โ€‹ฮฒ,ฮปminโ€‹(๐šฒh)โ‰ฅlogโก4โ€‹H2โ€‹dโ€‹|ฮ |ฮด.\max_{\phi\in\Phi_{h}}\left\|\phi\right\|_{\mathbf{\Lambda}_{h}^{-1}}^{2}\leq\frac{\epsilon^{2}}{d^{3}\beta}\,,\,\leavevmode\nobreak\ \leavevmode\nobreak\ \lambda_{\min}\left(\mathbf{\Lambda}_{h}\right)\geq\log\frac{4H^{2}d|\Pi|}{\delta}\,. (20)

Then โ„™โ€‹[(ฮตexph)cโˆฉฮตesthโˆ’1โˆฉ(โˆฉi=1hโˆ’1ฮตexpi)]โ‰คฮด2โ€‹H\mathbb{P}\left[\left(\varepsilon_{\exp}^{h}\right)^{c}\cap\varepsilon_{\mathrm{est}}^{h-1}\cap\left(\cap_{i=1}^{h-1}\varepsilon_{\exp}^{i}\right)\right]\leq\frac{\delta}{2H}.

Proof.

By Lemma 15, on the event ฮตesthโˆ’1โˆฉ(โˆฉi=1hโˆ’1ฮตexpi)\varepsilon_{\mathrm{est}}^{h-1}\cap\left(\cap_{i=1}^{h-1}\varepsilon_{\exp}^{i}\right) we can bound โ€–ฯ•^ฯ€,h+1โˆ’ฯ•ฯ€,h+1โ€–2โ‰คฯต/d\left\|\hat{\phi}_{\pi,h+1}-\phi_{\pi,h+1}\right\|_{2}\leq\epsilon/\sqrt{d}. Remember that we condition on KK being large enough so that we have ฯตโ‰ค1/2\epsilon\leq 1/2. Also, we can lower bound โ€–ฯ•ฯ€,hโ€–2โ‰ฅ1/d\left\|\phi_{\pi,h}\right\|_{2}\geq 1/\sqrt{d} from lemma A.6 in Wagenmaker and Jamieson [2022]. Thus,

โ€–ฯ•^ฯ€,hโ€–2โ‰ฅโ€–ฯ•ฯ€,hโ€–2โˆ’โ€–ฯ•^ฯ€,h+1โˆ’ฯ•ฯ€,h+1โ€–2โ‰ฅ1/dโˆ’ฯต/dโ‰ฅ1/(2โ€‹d).\left\|\hat{\phi}_{\pi,h}\right\|_{2}\geq\left\|\phi_{\pi,h}\right\|_{2}-\left\|\hat{\phi}_{\pi,h+1}-\phi_{\pi,h+1}\right\|_{2}\geq 1/\sqrt{d}-\epsilon/\sqrt{d}\geq 1/\left(2\sqrt{d}\right)\,.

So the choice of ฮณฮฆ=12โ€‹d\gamma_{\Phi}=\frac{1}{2\sqrt{d}} is valid. The result then follows by applying Theorem 2 with our chosen parameters. โˆŽ

Lemma 15.

On the event ฮตesthโˆฉ(โˆฉi=1hฮตexpi)\varepsilon_{\mathrm{est}}^{h}\cap\left(\cap_{i=1}^{h}\varepsilon_{\exp}^{i}\right), for all ฯ€โˆˆฮ \pi\in\Pi,

โ€–ฯ•^ฯ€,h+1โˆ’ฯ•ฯ€,h+1โ€–2โ‰คฯต/d.\left\|\hat{\phi}_{\pi,h+1}-\phi_{\pi,h+1}\right\|_{2}\leq\epsilon/\sqrt{d}\,.
Proof.

On ฮตeโ€‹xโ€‹pi\varepsilon_{exp}^{i}, we can bound:

ฮปminโ€‹(๐šฒi)โ‰ฅlogโก4โ€‹H2โ€‹dโ€‹|ฮ |ฮด,โ€–ฯ•^ฯ€,iโ€–๐šฒiโˆ’1โ‰คฯตdโ€‹dโ€‹ฮฒ=ฯต4โ€‹Hโ€‹dโ€‹dโ€‹logโก4โ€‹H2โ€‹dโ€‹|ฮ |ฮด.\begin{split}\lambda_{\min}\left(\mathbf{\Lambda}_{i}\right)&\geq\log\frac{4H^{2}d|\Pi|}{\delta}\,,\\ \left\|\hat{\phi}_{\pi,i}\right\|_{\mathbf{\Lambda}_{i}^{-1}}&\leq\frac{\epsilon}{d\sqrt{d\beta}}=\frac{\epsilon}{4Hd\sqrt{d\log\frac{4H^{2}d|\Pi|}{\delta}}}\,.\end{split}

so that:

โ€–ฯ•^ฯ€,h+1โˆ’ฯ•ฯ€,h+1โ€–2โ‰คdโ€‹โˆ‘hโ€ฒ=1hโˆ’1(3โ€‹logโก4โ€‹H2โ€‹dโ€‹|ฮ |ฮด+logโก4โ€‹H2โ€‹dโ€‹|ฮ |ฮดฮปminโ€‹(๐šฒhโ€ฒ))โ‹…โ€–ฯ•^ฯ€,hโ€ฒโ€–๐šฒhโ€ฒโˆ’1โ‰คdโ€‹โˆ‘i=1h4โ€‹logโก4โ€‹H2โ€‹dโ€‹|ฮ |ฮดโ‹…ฯต4โ€‹Hโ€‹dโ€‹dโ€‹logโก4โ€‹H2โ€‹dโ€‹|ฮ |ฮดโ‰คdโ€‹Hโ‹…ฯตdโ€‹Hโ€‹d=ฯต/d.\begin{split}\left\|\hat{\phi}_{\pi,h+1}-\phi_{\pi,h+1}\right\|_{2}&\leq d\sum_{h^{\prime}=1}^{h-1}\left(3\sqrt{\log\frac{4H^{2}d|\Pi|}{\delta}}+\frac{\log\frac{4H^{2}d|\Pi|}{\delta}}{\sqrt{\lambda_{\min}\left(\mathbf{\Lambda}_{h^{\prime}}\right)}}\right)\cdot\left\|\hat{\phi}_{\pi,h^{\prime}}\right\|_{\mathbf{\Lambda}_{h^{\prime}}^{-1}}\\ &\leq d\sum_{i=1}^{h}4\sqrt{\log\frac{4H^{2}d|\Pi|}{\delta}}\cdot\frac{\epsilon}{4Hd\sqrt{d\log\frac{4H^{2}d|\Pi|}{\delta}}}\\ &\leq dH\cdot\frac{\epsilon}{dH\sqrt{d}}=\epsilon/\sqrt{d}\,.\end{split}

โˆŽ

Lemma 16.

Define ฮตexp=โˆฉhฮตexph\varepsilon_{\exp}=\cap_{h}\varepsilon_{\exp}^{h} and ฮตest=โˆฉhฮตesth\varepsilon_{\mathrm{est}}=\cap_{h}\varepsilon_{\mathrm{est}}^{h}. Then โ„™โ€‹[ฮตeโ€‹sโ€‹tโˆฉฮตeโ€‹xโ€‹p]โ‰ฅ1โˆ’ฮด\mathbb{P}\left[\varepsilon_{est}\cap\varepsilon_{exp}\right]\geq 1-\delta, and on ฮตeโ€‹sโ€‹tโˆฉฮตeโ€‹xโ€‹p\varepsilon_{est}\cap\varepsilon_{exp} , for all h=1,2,โ‹ฏ,Hโˆ’1h=1,2,\cdots,H-1 and ฯ€โˆˆฮ \pi\in\Pi, we have:

โ€–ฯ•^ฯ€,h+1โˆ’ฯ•ฯ€,h+1โ€–2โ‰คฯต/d.\left\|\hat{\phi}_{\pi,h+1}-\phi_{\pi,h+1}\right\|_{2}\leq\epsilon/\sqrt{d}\,.
Proof.

Obviously,

ฮตexpcโˆชฮตestc=\displaystyle\varepsilon_{\exp}^{c}\cup\varepsilon_{\mathrm{est}}^{c}= โ‹ƒh=1Hโˆ’1((ฮตexph)cโˆช(ฮตesth)c)\displaystyle\bigcup_{h=1}^{H-1}\left(\left(\varepsilon_{\exp}^{h}\right)^{c}\cup\left(\varepsilon_{\mathrm{est}}^{h}\right)^{c}\right)
=\displaystyle= โ‹ƒh=1Hโˆ’1(ฮตexph)c\((ฮตesthโˆ’1)cโˆช(โˆชi=1hโˆ’1(ฮตexpi)c))โˆชโ‹ƒh=1H(ฮตesth)c\displaystyle\bigcup_{h=1}^{H-1}\left(\varepsilon_{\exp}^{h}\right)^{c}\backslash\left(\left(\varepsilon_{\mathrm{est}}^{h-1}\right)^{c}\cup\left(\cup_{i=1}^{h-1}\left(\varepsilon_{\exp}^{i}\right)^{c}\right)\right)\cup\bigcup_{h=1}^{H}\left(\varepsilon_{\mathrm{est}}^{h}\right)^{c}
=\displaystyle= โ‹ƒh=1Hโˆ’1(ฮตexph)cโˆฉ(ฮตesthโˆ’1โˆฉ(โˆฉi=1hโˆ’1ฮตexpi))โˆชโ‹ƒh=1H(ฮตesth)c.\displaystyle\bigcup_{h=1}^{H-1}\left(\varepsilon_{\exp}^{h}\right)^{c}\cap\left({\varepsilon_{\mathrm{est}}^{h-1}}\cap\left(\cap_{i=1}^{h-1}{\varepsilon_{\exp}^{i}}\right)\right)\cup\bigcup_{h=1}^{H}\left(\varepsilon_{\mathrm{est}}^{h}\right)^{c}\,.

Using Lemma 13 and Lemma 14, we can bound

โ„™โ€‹[ฮตestcโˆชฮตexpc]โ‰ค\displaystyle\mathbb{P}\left[\varepsilon_{\mathrm{est}}^{c}\cup\varepsilon_{\exp}^{c}\right]\leq โˆ‘h=1Hโˆ’1(โ„™โ€‹[(ฮตexph)cโˆฉฮตesthโˆ’1โˆฉ(โˆฉi=1hโˆ’1ฮตexpi)]+โ„™โ€‹[(ฮตesth)c])\displaystyle\sum_{h=1}^{H-1}\left(\mathbb{P}\left[\left(\varepsilon_{\exp}^{h}\right)^{c}\cap\varepsilon_{\mathrm{est}}^{h-1}\cap\left(\cap_{i=1}^{h-1}\varepsilon_{\exp}^{i}\right)\right]+\mathbb{P}\left[\left(\varepsilon_{\mathrm{est}}^{h}\right)^{c}\right]\right)
โ‰ค\displaystyle\leq โˆ‘h=1Hโˆ’12โ‹…ฮด2โ€‹H\displaystyle\sum_{h=1}^{H-1}2\cdot\frac{\delta}{2H}
โ‰ค\displaystyle\leq ฮด.\displaystyle\delta\,.

And the inequality follows by Lemma 15. โˆŽ

Lemma 17 (Full version of Lemma 2).

With probability at least 1โˆ’ฮด1-\delta, Algorithm 2 will run at most

Cโ€‹H2โ€‹d3โ€‹โˆ‘h=1Hโˆ’1inf๐šฒโˆˆฮฉhmaxฯ€โˆˆฮ โกโ€–ฯ•ฯ€,hโ€–๐šฒโˆ’12ฯต2โ€‹logโกH2โ€‹dโ€‹|ฮ |ฮด+polyโ€‹(d,H,logโก1/ฮด,1ฮปmโ€‹iโ€‹nโˆ—,logโก|ฮ |,logโก1/ฯต)CH^{2}d^{3}\sum_{h=1}^{H-1}\frac{\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\pi\in\Pi}\left\|{\phi}_{\pi,h}\right\|^{2}_{\mathbf{\Lambda}^{-1}}}{\epsilon^{2}}\log\frac{H^{2}d|\Pi|}{\delta}+\mathrm{poly}\left(d,H,\log 1/\delta,\frac{1}{\lambda_{min}^{*}},\log|\Pi|,\log 1/\epsilon\right)

episodes, and will output policy visitation estimators ฮฆ={ฯ•^ฯ€,h:h=1,2,โ‹ฏ,H,ฯ€โˆˆฮ }\Phi=\left\{\hat{\phi}_{\pi,h}:\,h=1,2,\cdots,H,\,\pi\in\Pi\right\} with bias bounded as:

โ€–ฯ•^ฯ€,hโˆ’ฯ•ฯ€,hโ€–2โ‰คฯต/d.\left\|\hat{\phi}_{\pi,h}-\phi_{\pi,h}\right\|_{2}\leq\epsilon/\sqrt{d}\,.
Proof.

According to Lemma 16, we can condition on the event ฮตeโ€‹sโ€‹tโˆฉฮตeโ€‹xโ€‹p\varepsilon_{est}\cap\varepsilon_{exp}, thus we obtain the accuracy desired. According to Lemma 14, we have total episodes be bounded as :

โˆ‘h=1Hโˆ’1Cโ€‹d3โ€‹inf๐šฒโˆˆฮฉhmaxฯ•โˆˆฮฆhโกโ€–ฯ•โ€–๐€โ€‹(๐šฒ)โˆ’12ฯต2/ฮฒ+polyโ€‹(d,H,logโก1/ฮด,1ฮปmโ€‹iโ€‹nโˆ—,logโก|ฮ |,logโก1/ฯต)\displaystyle\sum_{h=1}^{H-1}C\frac{d^{3}\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\phi\in\Phi_{h}}\left\|\phi\right\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}}{\epsilon^{2}/\beta}+\mathrm{poly}\left(d,H,\log 1/\delta,\frac{1}{\lambda_{min}^{*}},\log|\Pi|,\log 1/\epsilon\right)
โ‰ค\displaystyle\leq โˆ‘h=1Hโˆ’1Cโ€‹d3โ€‹inf๐šฒโˆˆฮฉhmaxฯ•โˆˆฮฆhโกโ€–ฯ•โ€–๐€โ€‹(๐šฒ)โˆ’12ฯต2โ€‹H2โ€‹logโกH2โ€‹dโ€‹|ฮ |ฮด+polyโ€‹(d,H,logโก1/ฮด,1ฮปmโ€‹iโ€‹nโˆ—,logโก|ฮ |,logโก1/ฯต).\displaystyle\sum_{h=1}^{H-1}C\frac{d^{3}\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\phi\in\Phi_{h}}\left\|\phi\right\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}}{\epsilon^{2}}H^{2}\log\frac{H^{2}d|\Pi|}{\delta}+\mathrm{poly}\left(d,H,\log 1/\delta,\frac{1}{\lambda_{min}^{*}},\log|\Pi|,\log 1/\epsilon\right)\,.

Conditioning on ฮตeโ€‹sโ€‹tโˆฉฮตeโ€‹xโ€‹p\varepsilon_{est}\cap\varepsilon_{exp}, we have for all ฯ€โˆˆฮ \pi\in\Pi, โ€–ฯ•^ฯ€,hโˆ’ฯ•ฯ€,hโ€–2โ‰คฯต/d\left\|\hat{\phi}_{\pi,h}-\phi_{\pi,h}\right\|_{2}\leq\epsilon/\sqrt{d}, thus we can upper bound:

inf๐šฒโˆˆฮฉhmaxฯ•โˆˆฮฆhโกโ€–ฯ•โ€–๐€โ€‹(๐šฒ)โˆ’12=\displaystyle\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\phi\in\Phi_{h}}\left\|\phi\right\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}= inf๐šฒโˆˆฮฉhmaxฯ€โˆˆฮ โกโ€–ฯ•^ฯ€,hโ€–๐€โ€‹(๐šฒ)โˆ’12\displaystyle\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\pi\in\Pi}\left\|\hat{\phi}_{\pi,h}\right\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}
โ‰ค\displaystyle\leq inf๐šฒโˆˆฮฉhmaxฯ€โˆˆฮ โก(2โ€‹โ€–ฯ•ฯ€,hโ€–๐€โ€‹(๐šฒ)โˆ’12+2โ€‹โ€–ฯ•^ฯ€,hโˆ’ฯ•ฯ€,hโ€–๐€โ€‹(๐šฒ)โˆ’12)\displaystyle\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\pi\in\Pi}\left(2\left\|{\phi}_{\pi,h}\right\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}+2\left\|\hat{\phi}_{\pi,h}-\phi_{\pi,h}\right\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}\right)
โ‰ค\displaystyle\leq inf๐šฒโˆˆฮฉhmaxฯ€โˆˆฮ โก(2โ€‹โ€–ฯ•ฯ€,hโ€–๐€โ€‹(๐šฒ)โˆ’12+2โ€‹ฯต2dโ€‹ฮปmโ€‹iโ€‹nโ€‹(๐€โ€‹(๐šฒ)))\displaystyle\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\pi\in\Pi}\left(2\left\|{\phi}_{\pi,h}\right\|^{2}_{\mathbf{A}\left(\mathbf{\Lambda}\right)^{-1}}+\frac{2\epsilon^{2}}{d\lambda_{min}\left(\mathbf{A}\left(\mathbf{\Lambda}\right)\right)}\right)
โ‰ค\displaystyle\leq inf๐šฒโˆˆฮฉhmaxฯ€โˆˆฮ โก2โ€‹โ€–ฯ•ฯ€,hโ€–๐šฒโˆ’12+2โ€‹ฯต2dโ€‹ฮปmโ€‹iโ€‹nโˆ—\displaystyle\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\pi\in\Pi}2\left\|{\phi}_{\pi,h}\right\|^{2}_{\mathbf{\Lambda}^{-1}}+\frac{2\epsilon^{2}}{d\lambda_{min}^{*}}

Thus, the total number of episodes is bounded as:

Cโ€‹H2โ€‹d3โ€‹โˆ‘h=1Hโˆ’1inf๐šฒโˆˆฮฉhmaxฯ€โˆˆฮ โกโ€–ฯ•ฯ€,hโ€–๐šฒโˆ’12ฯต2โ€‹logโกH2โ€‹dโ€‹|ฮ |ฮด+polyโ€‹(d,H,logโก1/ฮด,1ฮปmโ€‹iโ€‹nโˆ—,logโก|ฮ |,logโก1/ฯต).CH^{2}d^{3}\sum_{h=1}^{H-1}\frac{\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\pi\in\Pi}\left\|{\phi}_{\pi,h}\right\|^{2}_{\mathbf{\Lambda}^{-1}}}{\epsilon^{2}}\log\frac{H^{2}d|\Pi|}{\delta}+\mathrm{poly}\left(d,H,\log 1/\delta,\frac{1}{\lambda_{min}^{*}},\log|\Pi|,\log 1/\epsilon\right)\,. (21)

โˆŽ

From lemma B.10 in Wagenmaker and Jamieson [2022], we can bound:

inf๐šฒโˆˆฮฉhmaxฯ€โˆˆฮ โกโ€–ฯ•ฯ€,hโ€–๐šฒโˆ’12โ‰คd.\inf_{\mathbf{\Lambda}\in\Omega_{h}}\max_{\pi\in\Pi}\left\|{\phi}_{\pi,h}\right\|^{2}_{\mathbf{\Lambda}^{-1}}\leq d\,.

Thus we have:

Corollary 1.

The sample complexity in Algorithm 2 is bounded by:

๐’ชโ€‹(d4โ€‹H3ฯต2โ€‹logโกH2โ€‹dโ€‹|ฮ |ฮด+C1),\mathcal{O}\left(\frac{d^{4}H^{3}}{\epsilon^{2}}\log\frac{H^{2}d|\Pi|}{\delta}+C_{1}\right)\,,

where C1=polyโ€‹(d,H,logโก1/ฮด,1ฮปmโ€‹iโ€‹nโˆ—,logโก|ฮ |,logโก1/ฯต)C_{1}=\mathrm{poly}\left(d,H,\log 1/\delta,\frac{1}{\lambda_{min}^{*}},\log|\Pi|,\log 1/\epsilon\right).

Appendix C Construct the Policy Set

In this section we provide the proof for the policy set ฮ \Pi we constructed. The construction techniques follows directly from Appendix A.3 in Wagenmaker and Jamieson [2022] and we will prove such construction also works in MDP with adversarial rewards. Our main result is stated in Lemma 19.

Lemma 18.

In the adversarial MDP setting, where the loss function changes in each round, the best policy of the MDP โ„ณโ€‹(๐’ฎ,๐’œ,H,{Ph}h=1H,{โ„“k}k=1K)\mathcal{M}(\mathcal{S},\mathcal{A},H,\left\{P_{h}\right\}_{h=1}^{H},\left\{\ell_{k}\right\}_{k=1}^{K}) in rounds 11 to KK from the set of all stationary policies, is the optimal policy of the MDP with a fixed loss function being the average. Denote the average MDP as โ„ณฬŠโ€‹(๐’ฎ,๐’œ,H,{Ph}h=1H,โ„“ฬŠ)\mathring{\mathcal{M}}(\mathcal{S},\mathcal{A},H,\left\{P_{h}\right\}_{h=1}^{H},\mathring{\ell}), with the same transition kernel and the average loss โ„“ฬŠ=1Kโ€‹โˆ‘k=1Kโ„“k\mathring{\ell}=\frac{1}{K}\sum_{k=1}^{K}\ell_{k}.
That is:

ifฯ€โˆ—=argminฯ€โ€‹โˆ‘k=1KVkฯ€,\displaystyle\mathrm{if}\leavevmode\nobreak\ \leavevmode\nobreak\ \pi^{*}=\operatorname*{argmin}_{\pi}\sum_{k=1}^{K}V_{k}^{\pi}\,,
thenฯ€โˆ—=argminฯ€VฬŠฯ€.\displaystyle\mathrm{then}\leavevmode\nobreak\ \leavevmode\nobreak\ \pi^{*}=\operatorname*{argmin}_{\pi}\mathring{V}^{\pi}\,.

Where VฬŠ\mathring{V} is the value function associated with the new MDP โ„ณฬŠ\mathring{\mathcal{M}}.

Proof.

Let ฯ„ฯ€=((s1,a1),(s2,a2),โ‹ฏโ€‹(sH,aH))\tau_{\pi}=\left((s_{1},a_{1}),(s_{2},a_{2}),\cdots(s_{H},a_{H})\right) be the trajectory generated by following policy ฯ€\pi through the MDP. Denote the occupancy measure ฮผhฯ€โ€‹(sh,ah)\mu_{h}^{\pi}(s_{h},a_{h}) as the probability of visiting state-action pair (sh,ah)\left(s_{h},a_{h}\right) under trajectory ฯ„ฯ€\tau_{\pi}, and ฮผฯ€=(ฮผ1ฯ€,ฮผ2ฯ€,โ‹ฏโ€‹ฮผHฯ€)\mu^{\pi}=\left(\mu_{1}^{\pi},\mu_{2}^{\pi},\cdots\mu_{H}^{\pi}\right).
For any stationary policy ฯ€\pi, we have:

Vฯ€=โˆ‘h=1Hโˆ‘(sh,ah)โˆˆ๐’ฎhร—๐’œhฮผhฯ€โ€‹(sh,ah)โ€‹โ„“kโ€‹(sh,ah).V^{\pi}=\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{S}_{h}\times\mathcal{A}_{h}}\mu_{h}^{\pi}(s_{h},a_{h})\ell_{k}(s_{h},a_{h})\,.

Since the two MDP share the same transition kernel, the occupancy measure generated by the same policy stays unchanged. So we have:

โˆ‘k=1KVkฯ€=โˆ‘k=1Kโˆ‘h=1Hโˆ‘(sh,ah)โˆˆ๐’ฎhร—๐’œhฮผhฯ€โ€‹(sh,ah)โ€‹โ„“kโ€‹(sh,ah)=โˆ‘h=1Hโˆ‘(sh,ah)โˆˆ๐’ฎhร—๐’œhฮผhฯ€โ€‹(sh,ah)โ€‹(โˆ‘k=1Klk,hโ€‹(sh,ah))=โˆ‘h=1Hโˆ‘(sh,ah)โˆˆ๐’ฎhร—๐’œhฮผhฯ€โ€‹(sh,ah)โ‹…Kโ€‹lฬŠโ€‹(sh,ah)=Kโ€‹VฬŠฯ€\begin{split}\sum_{k=1}^{K}V_{k}^{\pi}&=\sum_{k=1}^{K}\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{S}_{h}\times\mathcal{A}_{h}}\mu_{h}^{\pi}(s_{h},a_{h})\ell_{k}(s_{h},a_{h})\\ &=\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{S}_{h}\times\mathcal{A}_{h}}\mu_{h}^{\pi}\left(s_{h},a_{h}\right)\left(\sum_{k=1}^{K}l_{k,h}(s_{h},a_{h})\right)\\ &=\sum_{h=1}^{H}\sum_{(s_{h},a_{h})\in\mathcal{S}_{h}\times\mathcal{A}_{h}}\mu_{h}^{\pi}(s_{h},a_{h})\cdot K\mathring{l}(s_{h},a_{h})\\ &=K\mathring{V}^{\pi}\end{split}

So ฯ€โˆ—\pi^{*} satisfies:

ฯ€โˆ—=argminฯ€โ€‹โˆ‘k=1KVkฯ€=argminฯ€VฬŠฯ€.\displaystyle\pi^{*}=\operatorname*{argmin}_{\pi}\sum_{k=1}^{K}V_{k}^{\pi}=\operatorname*{argmin}_{\pi}\mathring{V}^{\pi}\,.

โˆŽ

Lemma 19.

Choose cc to be an arbitrary constant, then we can construct a policy set ฮ \Pi for any linear adversarial MDP โ„ณโ€‹(๐’ฎ,๐’œ,H,{Ph}h=1H,{โ„“k}k=1K)\mathcal{M}(\mathcal{S},\mathcal{A},H,\left\{P_{h}\right\}_{h=1}^{H},\left\{\ell_{k}\right\}_{k=1}^{K}), such that there exists a policy ฯ€โˆˆฮ \pi\in\Pi, when compared with the global optimal policy, the regret of which is bounded by 11:

โˆ‘s=1KVkฯ€โˆ’Vkฯ€โˆ—โ‰ค1.\sum_{s=1}^{K}V_{k}^{\pi}-V_{k}^{\pi^{*}}\leq 1\,.

So that:

Regโ€‹(K)=โˆ‘s=1KVkฯ€kโˆ’Vkฯ€โˆ—=maxฯ€โˆˆฮ โ€‹โˆ‘s=1KVkฯ€โˆ’โˆ‘k=1KVkฯ€โˆ—+Regโ€‹(K;ฮ )โ‰คRegโ€‹(K;ฮ )+1.\mathrm{Reg}(K)=\sum_{s=1}^{K}V_{k}^{\pi_{k}}-V_{k}^{\pi^{*}}=\max_{\pi\in\Pi}\sum_{s=1}^{K}V_{k}^{\pi}-\sum_{k=1}^{K}V_{k}^{\pi^{*}}+\mathrm{Reg}\left(K;\Pi\right)\leq\mathrm{Reg}\left(K;\Pi\right)+1\,.

and the size of ฮ \Pi is bounded as:

|ฮ |โ‰ค(1+32โ€‹K2โ€‹H4โ€‹d5/2โ€‹logโก(1+16โ€‹Hโ€‹dโ€‹K))dโ€‹H2,|\Pi|\leq\left(1+32K^{2}H^{4}d^{5/2}\log\left(1+16HdK\right)\right)^{dH^{2}}\,,

where dd is the dimension of the feature map.

Proof.

According to Lemma A.14 in Wagenmaker and Jamieson [2022] that for any linear MDP โ„ณฬŠโ€‹(๐’ฎ,๐’œ,H,{Ph}h=1H,โ„“ฬŠ)\mathring{\mathcal{M}}(\mathcal{S},\mathcal{A},H,\left\{P_{h}\right\}_{h=1}^{H},\mathring{\ell}) with fixed reward function โ„“ฬŠ\mathring{\ell}, we can construct a policy set, that there exists a policy ฯ€โˆˆฮ \pi\in\Pi, which approximates the best policy of โ„ณฬŠ\mathring{\mathcal{M}} with bias |VฬŠฯ€โˆ’VฬŠโˆ—|โ‰คฯตโ€ฒ\left|\mathring{V}^{\pi}-\mathring{V}^{*}\right|\leq\epsilon^{\prime}. And the size of the policy set is bounded as:

|ฮ |โ‰ค(1+32โ€‹H4โ€‹d5/2โ€‹logโก(1+16โ€‹Hโ€‹d/ฯตโ€ฒ)(ฯตโ€ฒ)2)dโ€‹H2.\displaystyle|\Pi|\leq\left(1+\frac{32H^{4}d^{5/2}\log\left(1+16Hd/\epsilon^{\prime}\right)}{\left(\epsilon^{\prime}\right)^{2}}\right)^{dH^{2}}\,. (22)

Notice this construction is based entirely on the set of state action features ฯ•โ€‹(s,a)\phi\left(s,a\right) and require no information on the loss or reward function. In the adversarial case, we choose โ„ณฬŠ\mathring{\mathcal{M}} to be the average MDP denoted in Lemma 18, and we obtain the regret bound of ฯ€\pi in all the KK episodes:

โˆ‘k=1KVkฯ€โˆ—โˆ’Vkฯ€=Kโ€‹(VฬŠฯ€โˆ—โˆ’VฬŠฯ€)=Kโ€‹(VฬŠโˆ—โˆ’VฬŠฯ€)โ‰คKโ€‹ฯตโ€ฒ.\sum_{k=1}^{K}V_{k}^{\pi^{*}}-V_{k}^{\pi}=K\left(\mathring{V}^{\pi^{*}}-\mathring{V}^{\pi}\right)=K\left(\mathring{V}^{*}-\mathring{V}^{\pi}\right)\leq K\epsilon^{\prime}\,. (23)

The proof is finished by taking ฯตโ€ฒ=1/K\epsilon^{\prime}=1/K in Equation (22).

โˆŽ