This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient Local Planning with Linear Function Approximation

Dong Yin DeepMind thanks: Emails: {dongyin, bhao, yadkori, nevena, szepi}@google.com Botao Hao DeepMind thanks: Emails: {dongyin, bhao, yadkori, nevena, szepi}@google.com Yasin Abbasi-Yadkori DeepMind thanks: Emails: {dongyin, bhao, yadkori, nevena, szepi}@google.com Nevena Lazić DeepMind thanks: Emails: {dongyin, bhao, yadkori, nevena, szepi}@google.com Csaba Szepesvári DeepMind thanks: Emails: {dongyin, bhao, yadkori, nevena, szepi}@google.com University of Alberta
Abstract

We study query and computationally efficient planning algorithms for discounted Markov decision processes (MDPs) with linear function approximation and a simulator. We assume that the agent has local access to the simulator, meaning that the simulator can be queried only for states that have been encountered in previous simulation steps. This is a more practical setting than the so-called random-access (or, generative) setting, where the agent has a complete description of the state space and features and is allowed to query the simulator at any state of its choice. We propose two new algorithms for this setting, which we call confident Monte Carlo least-squares policy iteration (Confident MC-LSPI), and confident Monte Carlo Politex (Confident MC-Politex), respectively. The main novelty in our algorithms is that it gradually builds a set of state-action pairs (“core set”) with which it can control the extrapolation errors. Under the assumption that the action-value functions of all policies are linearly realizable with given features, we show that our algorithm has polynomial query and computational cost in the dimension of the features, the effective planning horizon and the targeted sub-optimality, while the cost remains independent of the size of the state space. Our result strengthens previous works by broadening their scope, either by weakening the assumption made on the power of the function approximator, or by weakening the requirement on the simulator and removing the need for being given an appropriate core set of states. An interesting technical contribution of our work is the introduction of a novel proof technique that makes use of a virtual policy iteration algorithm. We use this method to leverage existing results on \ell_{\infty}-bounded approximate policy iteration to show that our algorithm can learn the optimal policy for the given initial state even only with local access to the simulator. We believe that this technique can be extended to broader settings beyond this work.

1 Introduction

Efficient planning lies at the heart of modern reinforcement learning (RL). In the simulation-based RL, the agent has access to a simulator which it uses to query a state-action pair to obtain the reward of the queried pair and the next state. When planning with large state spaces in the presence of features, the agent can also compute the feature vector associated with a state or a state-action pair. Planning efficiency is measured in two ways: using query cost, the number of calls to the simulator, and using computation cost, the total number of logical and arithmetic operations that the agent uses. In Markov decision processes (MDPs) with a large state space, we call a planning algorithm query-efficient (computationally-efficient) if its query (respectively, computational) cost is independent of the size of the state space and polynomial in other parameters of the problem such as the dimension of the feature space, the effective planning horizon, the number of actions and the targeted sub-optimality.

Prior works on planning in MDPs often assume that the agent has access to a generative model which allows the agent to query the simulator with any arbitrary state-action pair (Kakade, 2003; Sidford et al., 2018; Yang and Wang, 2019; Lattimore et al., 2020). In what follows, we will call this the random access model. The random access model is often difficult to support. To illustrate this, consider a problem where the goal is to move the joints of a robot arm so that it moves objects around. The simulation state in this scenario is then completely described by the position, orientation and associated velocities of the various rigid objects involved. To access a state, a planner can then try to choose some values for each of the variables involved. Unfortunately, given only box constraints on the variable values (as is typically the case), a generic planner will often choose value combinations that are invalid based on physics, for example with objects penetrating each other in space. This problem is not specific to robotic applications but also arises in MDPs corresponding to combinatorial search, just to mention a second example.

To address this challenge, we replace the random access model with a local access model, where the only states at which the agent can query the simulator are the initial states provided to the agent, or states returned in response to previously issued queries. This access model can be implemented with any simulator that supports resetting its internal state to a previously stored such state. This type of checkpointing is widely supported, and if a simulator does not support it, there are general techniques that can be applied to achieve this functionality. As such, this access model significantly expands the scope of planners.

Definition 1.1 (Local access to the simulator).

We say the agent has local access to the simulator if the agent is allowed to query the simulator with a state that the agent has previously seen paired with an arbitrary action.

Our work relies on linear function approximation. Very recently, Weisz et al. (2021b) showed that linear realizability assumption of the optimal state-action value function (QQ^{*}-realizability) alone is not sufficient to develop a query-efficient planner. In this paper, we assume linear realizability of all policies (QπQ_{\pi}-realizability). We discuss several drawbacks of previous works (Lattimore et al., 2020; Du et al., 2020) under the same realizability assumption. First, these works require the knowledge of the features of all state-action pairs; otherwise, the agent has to spend 𝒪(|𝒮||𝒜|)\mathcal{O}(|{\mathcal{S}}||\mathcal{A}|) query cost to extract the features of all possible state-action pairs, where |𝒮||{\mathcal{S}}| and |𝒜||\mathcal{A}| are the sizes of the state space and action space, respectively. Second, these algorithms require the computation of either an approximation of the global optimal design (Lattimore et al., 2020) or a barycentric spanner (Du et al., 2020) of the matrix of all features. Although there exists algorithms to approximate the optimal design (Todd, 2016) or barycentric spanner (Awerbuch and Kleinberg, 2008), the computational complexities for these algorithms are polynomial in the total number of all possible feature vectors, i.e., |𝒮||𝒜||{\mathcal{S}}||\mathcal{A}|, which is impractical for large MDPs.

We summarize our contributions as follows:

  • With local access to the simulator, we propose two policy optimization algorithms—confident Monte Carlo least-squares policy iteration (Confident MC-LSPI), and its regularized (see e.g. Even-Dar et al. (2009); Abbasi-Yadkori et al. (2019)) version confident Monte Carlo Politex (Confident MC-Politex). Both of our algorithms maintain a core set of state-action pairs and run Monte Carlo rollouts from these pairs using the simulator. The algorithms then use the rollout results to estimate the Q-function values and then apply policy improvement. During each rollout procedure, whenever the algorithm observes a state-action pair that it is less confident about (with large uncertainty), the algorithm adds this pair to the core set and restarts. Compared to several prior works that use additive bonus (Jin et al., 2020; Cai et al., 2020), our algorithm design demonstrates that in the local access setting, core-set-based exploration is an effective approach.

  • Under the QπQ_{\pi}-realizability assumption, we prove that both Confident MC-LSPI and Confident MC-Politex can learn a κ\kappa-optimal policy with query cost of poly(d,11γ,1κ,log(1δ),log(b))\operatorname{poly}(d,\frac{1}{1-\gamma},\frac{1}{\kappa},\log(\frac{1}{\delta}),\log(b)) and computational costs of poly(d,11γ,1κ,|𝒜|,log(1δ),log(b))\operatorname{poly}(d,\frac{1}{1-\gamma},\frac{1}{\kappa},|\mathcal{A}|,\log(\frac{1}{\delta}),\log(b)), where dd is the dimension of the feature of state-action pairs, γ\gamma is the discount factor, δ\delta is the error probability, and bb is the bound on the 2\ell_{2} norm of the linear coefficients for the Q-functions. In the presence of a model misspecification error ϵ\epsilon, we show that Confident MC-LSPI achieves a final sub-optimality of 𝒪~(ϵd(1γ)2)\widetilde{\mathcal{O}}(\frac{\epsilon\sqrt{d}}{(1-\gamma)^{2}}), whereas Confident MC-Politex can improve the sub-optimality to 𝒪~(ϵd1γ)\widetilde{\mathcal{O}}(\frac{\epsilon\sqrt{d}}{1-\gamma}) with a higher query cost.

  • We develop a novel proof technique that makes use of a virtual policy iteration algorithm. We use this method to leverage existing results on approximate policy iteration which assumes that in each iteration, the approximation of the Q-function has a bounded \ell_{\infty} error (Munos, 2003; Farahmand et al., 2010) (see Section 5 for details).

2 Related work

Simulators or generative models have been considered in early studies of reinforcement learning (Kearns and Singh, 1999; Kakade, 2003). Recently, it has been shown empirically that in the local access setting, core-set-based exploration has strong performance in hard-exploration problems (Ecoffet et al., 2019). In this section, we mostly focus on related theoretical works. We distinguish among random access, local access, and online access.

  • Random access means that the agent is given a list of all possible state action pairs and can query any of them to get the reward and a sample of the next state.

  • Local access means that the agent can access previously encountered states, which can be implemented with checkpointing. The local access model that we consider in this paper is a more practical version of planning with a simulator.

  • Online access means that the simulation state can only be reset to the initial state (or distribution) or moved to a next random state given an action. The online access setting is more restrictive compared to local access, since the agent can only follow the MDP dynamics during the learning process.

We also distinguish between offline and online planning. In the offline planning problem, the agent only has access to the simulator during the training phase, and once the training is finished, the agent outputs a policy and executes the policy in the environment without access to a simulator. This is the setting that we consider in this paper. On the other hand, in the online planning problem, the agent can use the simulator during both the training and inference phases, meaning that the agent can use the simulator to choose the action when executing the policy. Usually, online RL algorithms with sublinear regret can be converted to an offline planning algorithm under the online access model with standard online-to-batch conversion (Cesa-Bianchi et al., 2004). While most of the prior works that we discuss in this section are for the offline planning problem, the TensorPlan algorithm (Weisz et al., 2021a) considers online planning.

In terms of notation, some works considers finite-horizon MDPs, in which case we use HH to denote the episode length (similar to the effective planning horizon (1γ)1(1-\gamma)^{-1} in infinite-horizon discounted MDPs). Our discussion mainly focuses on the results with linear function approximation. We summarize some of the recent advances on efficient planning in large MDPs in Table 1.

Table 1: Recent advances on RL algorithms with linear function approximation under different assumptions. Positive results mean query cost depends only polynomially on the relative parameter while negative results refer an exponential lower bound on the query complexity. CE stands for computational efficiency and “no” for CE means no computational efficient algorithm is provided.
{\dagger}: The algorithms in these works are not query or computationally efficient unless the agent is provided with an approximate optimal design (Lattimore et al., 2020) or barycentric spanner (Du et al., 2020) or “core states” (Shariff and Szepesvári, 2020) for free.
{\ddagger}Weisz et al. (2021a) consider the online planning problem whereas other works in this table consider (or can be converted to) the offline planning problem.
Positive Results Assumption CE Access Model
Yang and Wang (2019) linear MDP yes random access
Lattimore et al. (2020); Du et al. (2020) QπQ_{\pi}-realizability no \dagger random access
Shariff and Szepesvári (2020) VV^{*}-realizability no \dagger random access
This work QπQ_{\pi}-realizability yes local access
Weisz et al. (2021a) {\ddagger} VV^{*}-realizability, 𝒪(1)\mathcal{O}(1) actions no local access
Li et al. (2021) QQ^{*}-realizability, constant gap yes local access
Jiang et al. (2017) low Bellman rank no online access
Zanette et al. (2020) low inherent Bellman error no online access
Du et al. (2021) bilinear class no online access
Lazic et al. (2021); Wei et al. (2021) QπQ_{\pi}-realizability, feature excitation yes online access
Jin et al. (2020); Agarwal et al. (2020a) linear MDP yes online access
Zhou et al. (2020); Cai et al. (2020) linear mixture MDP ? online access
Negative Results Assumption CE Access Model
Du et al. (2020) QπQ_{\pi}-realizability, ϵ=Ω(H/d)\epsilon=\Omega(\sqrt{H/d}) N/A random access
Weisz et al. (2021b) QQ^{*}-realizability, exp(d)\exp(d) actions N/A random access
Wang et al. (2021) QQ^{*}-realizability, constant gap N/A online access

Random access

Theoretical guarantees for the random access model have been obtained for the tabular setting (Sidford et al., 2018; Agarwal et al., 2020b; Li et al., 2020; Azar et al., 2013). As for linear function approximation, different assumptions have been made for theoretical analysis. Under the linear MDP assumption, Yang and Wang (2019) derived an optimal 𝒪(dκ2(1γ)3)\mathcal{O}(d\kappa^{-2}(1-\gamma)^{-3}) query complexity bound by a variance-reduction Q-learning type algorithm. Under the QπQ_{\pi}-realizability of all determinstic policies (a strictly weaker assumption than linear MDP (Zanette et al., 2020)), Du et al. (2020) showed a negative result for the settings with model misspecification error ϵ=Ω(H/d)\epsilon=\Omega(\sqrt{H/d}) (see also Van Roy and Dong (2019); Lattimore et al. (2020)). When ϵ=o((1γ)2/d)\epsilon=o((1-\gamma)^{2}/\sqrt{d}), assuming the access to the full feature matrix, Lattimore et al. (2020) proposed algorithms with polynomial query costs, and Du et al. (2020) proposed similar algorithm for the exact QπQ_{\pi} realizability setting. Since these works need to find a globally optimal design or barycentric spanner, their computational costs depend polynomially on the size of the state space. Under the VV^{*}-realizability assumption (i.e., the optimal value function is linear in some feature map), Shariff and Szepesvári (2020) proposed a planning algorithm assuming the availability of a set of core states but obtaining such core states can still be computationally inefficient. Zanette et al. (2019) proposed an algorithm that uses a similar concept named anchor points but only provided a greedy heuristic to generate these points. A notable negative result is established in Weisz et al. (2021b) that shows that with only QQ^{*}-realizability, any agent requires min(exp(Ω(d)),exp(Ω(H)))\min(\exp(\Omega(d)),\exp(\Omega(H))) queries to learn an optimal policy.

Local access

Many prior studies have used simulators in tree-search style algorithms (Kearns et al., 2002; Munos, 2014). Under this setting, for the online planning problem, recently Weisz et al. (2021a) established an 𝒪((dH/κ)|𝒜|)\mathcal{O}((dH/\kappa)^{|\mathcal{A}|}) query cost bound to learn an κ\kappa-optimal policy by the TensorPlan algorithm assuming the VV^{*}-realizability. Whenever the action set is small, TensorPlan is query efficient, but its computational efficiency is left as an open problem. Under QQ^{*}-realizability and constant sub-optimality gap, for the offline planning problem, Li et al. (2021) proposed an algorithm with poly(d,H,κ1,Δgap1)\operatorname{poly}(d,H,\kappa^{-1},\Delta_{\text{gap}}^{-1}) query and computational costs.

Online access

As mentioned, many online RL algorithms can be converted to a policy optimization algorithm under the online access model using online-to-batch conversion. There is a large body of literature on online RL with linear function approximation and here we discuss a non-exhaustive list of prior works. Under the QQ^{*}-realizability assumption, assuming that the probability transition of the MDP is deterministic, Wen and Van Roy (2013) proposed a sample and computationally efficient algorithm via the eluder dimension (Russo and Van Roy, 2013). Assuming the MDP has low Bellman rank, Jiang et al. (2017) proposed an algorithm that is sample efficient but computationally inefficient, and similar issues arise in Zanette et al. (2020) under the low inherent Bellman error assumption. Du et al. (2021) proposed a more general MDP class named bilinear class and provided a sample efficient algorithm, but the computational efficiency is unclear.

Under QπQ_{\pi}-realizability, several algorithms, such as Politex (Abbasi-Yadkori et al., 2019; Lazic et al., 2021), AAPI (Hao et al., 2021), and MDP-EXP2 (Wei et al., 2021) achieved sublinear regret in the infinite horizon average reward setting and are also computationally efficient. However, the corresponding analysis avoids the exploration issue by imposing a feature excitation assumption which may not be satisfied in many problems. Under the linear MDP assumption, Jin et al. (2020) established a 𝒪(d3H3T)\mathcal{O}(\sqrt{d^{3}H^{3}T}) regret bound for an optimistic least-square value iteration algorithm. Agarwal et al. (2020a) derived a poly(d,H,κ1)\operatorname{poly}(d,H,\kappa^{-1}) sample cost bound for the policy cover-policy gradient algorithm, which can also be applied in the state aggregation setting; the algorithm and sample cost were subsequently improved in Zanette et al. (2021). Under the linear mixture MDP assumption (Yang and Wang, 2020; Zhou et al., 2020), Cai et al. (2020) proved an 𝒪(d3H3T)\mathcal{O}(\sqrt{d^{3}H^{3}T}) regret bound for an optimistic least square policy iteration (LSPI) type algorithm. A notable negative result for the online RL setting by Wang et al. (2021) shows that an exponentially large number of samples are needed if we only assume QQ^{*}-realizability and constant sub-optimality gap. Other related works include Ayoub et al. (2020); Jin et al. (2021); Du et al. (2019); Wang et al. (2019), and references therein.

3 Preliminaries

We use Δ𝒮\Delta_{{\mathcal{S}}} to denote the set of probability distributions defined on the set 𝒮{\mathcal{S}}. Consider an infinite-horizon discounted MDP that is specified by a tuple (𝒮,𝒜,r,P,ρ,γ)({\mathcal{S}},\mathcal{A},r,P,\rho,\gamma), where 𝒮{\mathcal{S}} is the state space, 𝒜\mathcal{A} is the finite action space, r:𝒮×𝒜[0,1]r:{\mathcal{S}}\times\mathcal{A}\rightarrow[0,1] is the reward function, P:𝒮×𝒜Δ𝒮P:{\mathcal{S}}\times\mathcal{A}\rightarrow\Delta_{\mathcal{S}} is the probability transition kernel, ρ𝒮\rho\in{\mathcal{S}} is the initial state, and γ(0,1)\gamma\in(0,1) is the discount factor. For simplicity, in the main sections of this paper, we assume that the initial state ρ\rho is deterministic and known to the agent. Our algorithm can also be extended to the setting where the initial state is random and the agent is allowed to sample from the initial state distribution. We discuss this extension in Appendix E. Throughout this paper, we write [N]:={1,2,,N}[N]:=\{1,2,\ldots,N\} for any positive integer NN and use log()\log(\cdot) to denote natural logarithm.

A policy π:𝒮Δ𝒜\pi:{\mathcal{S}}\rightarrow\Delta_{\mathcal{A}} is a mapping from a state to a distribution over actions. We only consider stationary policies, i.e., they do not change according to the time step. The value function Vπ(s)V_{\pi}(s) of a policy is the expected return when we start running the policy π\pi from state ss, i.e.,

Vπ(s)=𝔼atπ(|st),st+1P(|st,at)[t=0γtr(st,at)s0=s],V_{\pi}(s)=\mathbb{E}_{a_{t}\sim\pi(\cdot|s_{t}),s_{t+1}\sim P(\cdot|s_{t},a_{t})}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})\mid s_{0}=s\right],

and the state-action value function Qπ(s,a)Q_{\pi}(s,a), also known as the Q-function, is the expected return following policy π\pi conditioned on s0=s,a0=as_{0}=s,a_{0}=a, i.e.,

Qπ(s,a)=𝔼st+1P(|st,at),at+1π(|st+1)[t=0γtr(st,at)s0=s,a0=a].Q_{\pi}(s,a)=\mathbb{E}_{s_{t+1}\sim P(\cdot|s_{t},a_{t}),a_{t+1}\sim\pi(\cdot|s_{t+1})}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})\mid s_{0}=s,a_{0}=a\right].

We assume that the agent interacts with a simulator using the local access protocol defined in Definition 1.1, i.e, for any state ss that the agent has visited and any action a𝒜a\in\mathcal{A}, the agent can query the simulator and obtain a sample sP(|s,a)s^{\prime}\sim P(\cdot|s,a) and the reward r(s,a)r(s,a).

Our general goal is to find a policy that maximizes the expected return starting from the initial state ρ\rho, i.e., maxπVπ(ρ)\max_{\pi}V_{\pi}(\rho). We let π\pi^{*} be the optimal policy, V():=Vπ()V^{*}(\cdot):=V_{\pi^{*}}(\cdot), and Q(,):=Qπ(,)Q^{*}(\cdot,\cdot):=Q_{\pi^{*}}(\cdot,\cdot). We also aim to learn a good policy efficiently, i.e., the query and computational costs should not depend on the size of the state space 𝒮{\mathcal{S}}, which can be large in many problems.

Linear function approximation

Let ϕ:𝒮×𝒜d\phi:{\mathcal{S}}\times\mathcal{A}\to\mathbb{R}^{d} be a feature map which assigns to each state-action pair a dd-dimensional feature vector. For any (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}, the agent can obtain ϕ(s,a)\phi(s,a) with a computational cost of poly(d)\operatorname{poly}(d). Here, we emphasize that the computation of the feature vectors does not lead to a query cost. Without loss of generality, we impose the following bounded features assumption.

Assumption 3.1 (Bounded features).

We assume that ϕ(s,a)21\|\phi(s,a)\|_{2}\leq 1 for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}.

We consider the following two different assumptions on the linear realizability of the Q-functions:

Assumption 3.2 (QπQ_{\pi}-realizability).

There exists b>0b>0 such that for every policy π\pi, there exists a weight vector wπdw_{\pi}\in\mathbb{R}^{d}, wπ2b\|w_{\pi}\|_{2}\leq b, that ensures Qπ(s,a)=wπϕ(s,a)Q_{\pi}(s,a)=w_{\pi}^{\top}\phi(s,a) for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}.

Assumption 3.3 (Approximate QπQ_{\pi}-realizability).

There exists b>0b>0 and model misspecification error ϵ>0\epsilon>0 such that for every policy π\pi, there exists a weight vector wπdw_{\pi}\in\mathbb{R}^{d}, wπ2b\|w_{\pi}\|_{2}\leq b, that ensures |Qπ(s,a)wπϕ(s,a)|ϵ|Q_{\pi}(s,a)-w_{\pi}^{\top}\phi(s,a)|\leq\epsilon for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}.

4 Algorithm

We first introduce some basic concepts used in our algorithms.

Core set

We use a concept called core set. A core set 𝒞\mathcal{C} is a set of tuples z=(s,a,ϕ(s,a),q)𝒮×𝒜×d×({𝗇𝗈𝗇𝖾})z=(s,a,\phi(s,a),q)\in{\mathcal{S}}\times\mathcal{A}\times\mathbb{R}^{d}\times(\mathbb{R}\cup\{\mathsf{none}\}). The first three elements in the tuple denote a state, an action, and the feature vector corresponding to the state-action pair, respectively. The last element qq\in\mathbb{R} in the tuple denotes an estimate of Qπ(s,a)Q_{\pi}(s,a) for a policy π\pi. During the algorithm, we may not always have such an estimate, in which case we write q=𝗇𝗈𝗇𝖾q=\mathsf{none}. For a tuple zz, we use zsz_{s}, zaz_{a}, zϕz_{\phi}, and zqz_{q} to denote the ss, aa, ϕ\phi, and qq coordinates of zz, respectively. We note that in prior works, the core set usually consists of the state-action pairs and their features (Lattimore et al., 2020; Du et al., 2020; Shariff and Szepesvári, 2020); whereas in this paper, for the convenience of notation, we also have the target values (Q-function estimates) in the core set elements. We denote by Φ𝒞|𝒞|×d\Phi_{\mathcal{C}}\in\mathbb{R}^{|\mathcal{C}|\times d} the feature matrix of all the elements in 𝒞\mathcal{C}, i.e., each row of Φ𝒞\Phi_{\mathcal{C}} is the feature vector of an element in 𝒞\mathcal{C}. Similarly, we define q𝒞|𝒞|q_{\mathcal{C}}\in\mathbb{R}^{|\mathcal{C}|} as the vector for the QπQ_{\pi} estimate of all the tuples in 𝒞\mathcal{C}.

Good set

It is also useful to introduce a notion of good set.

Definition 4.1.

Given λ,τ>0\lambda,\tau>0, and feature matrix Φ𝒞|𝒞|×d\Phi_{\mathcal{C}}\in\mathbb{R}^{|\mathcal{C}|\times d}, the good set d\mathcal{H}\subset\mathbb{R}^{d} is defined as

:={ϕd:ϕ(Φ𝒞Φ𝒞+λI)1ϕτ}.\mathcal{H}:=\{\phi\in\mathbb{R}^{d}:\phi^{\top}(\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I)^{-1}\phi\leq\tau\}.

Intuitively, the good set is a set of vectors that are well-covered by the rows of Φ𝒞\Phi_{\mathcal{C}}; in other words, these vectors are not closely aligned with the eigenvectors associated with the small eigenvalues of the covariance matrix of all the features in the core set.

As an overview, our algorithm Confident MC-LSPI works as follows. First, we initialize the core set using the initial state ρ\rho paired with all actions. Then, the algorithm runs least-squares policy iteration (Munos, 2003) to optimize the policy. This means that in each iteration, we estimate the Q-function value for every state-action pair in 𝒞\mathcal{C} using Monte Carlo rollout with the simulator, and learn a linear function to approximate the Q-function of the rollout policy, and the next policy is chosen to be greedy with respect to this linear function. Our second algorithm Confident MC-Politex works similarly, with the only difference being that instead of using the greedy policy iteration update rule, we use the mirror descent update rule with KL regularization between adjacent rollout policies (Even-Dar et al., 2009; Abbasi-Yadkori et al., 2019). Moreover, in both algorithms, whenever we observe a state-action pair whose feature is not in the good set during Monte Carlo rollout, we add the pair to the core set and restart the policy iteration process. We name the rollout subroutine ConfidentRollout. We discuss details in the following.

4.1 Subroutine: ConfidentRollout

We first introduce the ConfidentRollout subroutine, whose purpose is to estimate Qπ(s0,a0)Q_{\pi}(s_{0},a_{0}) for a given state-action pair (s0,a0)(s_{0},a_{0}) using Monte Carlo rollouts. During a rollout, for each state ss that we encounter and all actions a𝒜a\in\mathcal{A}, the subroutine checks whether the feature vector ϕ(s,a)\phi(s,a) is in the good set. If not, we know that we have discovered a new feature direction, i.e. a direction which is not well aligned with eigenvectors corresponding to the the largest eigenvalues of the covariance matrix of the core features. In this case the subroutine terminates and returns the tuple (s,a,ϕ(s,a),𝗇𝗈𝗇𝖾)(s,a,\phi(s,a),\mathsf{none}) along with the 𝗎𝗇𝖼𝖾𝗋𝗍𝖺𝗂𝗇\mathsf{uncertain} status. If the algorithm does not discover a new direction, it returns an estimate qq of the desired value Qπ(s0,a0)Q_{\pi}(s_{0},a_{0}) and the 𝖽𝗈𝗇𝖾\mathsf{done} status. This subroutine is formally presented in Algorithm 1.

Algorithm 1 ConfidentRollout
1:  Input: number of rollouts mm, length of rollout nn, rollout policy π\pi, discount γ\gamma, initial state s0s_{0}, initial action a0a_{0}, feature matrix Φ𝒞\Phi_{\mathcal{C}}, regularization coefficient λ\lambda, threshold τ\tau.
2:  for i=1,,mi=1,\ldots,m do
3:     si,0s0s_{i,0}\leftarrow s_{0}, ai,0a0a_{i,0}\leftarrow a_{0}, query the simulator, obtain reward ri,0r(si,0,ai,0)r_{i,0}\leftarrow r(s_{i,0},a_{i,0}), and next state si,1s_{i,1}.
4:     for t=1,,nt=1,\ldots,n do
5:        for a𝒜a\in\mathcal{A} do
6:           Compute feature ϕ(si,t,a)\phi(s_{i,t},a).
7:           if ϕ(si,t,a)(Φ𝒞Φ𝒞+λI)1ϕ(si,t,a)>τ\phi(s_{i,t},a)^{\top}(\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I)^{-1}\phi(s_{i,t},a)>\tau then
8:              𝗌𝗍𝖺𝗍𝗎𝗌𝗎𝗇𝖼𝖾𝗋𝗍𝖺𝗂𝗇\mathsf{status}\leftarrow\mathsf{uncertain}, 𝗋𝖾𝗌𝗎𝗅𝗍(si,t,a,ϕ(si,t,a),𝗇𝗈𝗇𝖾)\mathsf{result}\leftarrow(s_{i,t},a,\phi(s_{i,t},a),\mathsf{none})
9:              return  𝗌𝗍𝖺𝗍𝗎𝗌,𝗋𝖾𝗌𝗎𝗅𝗍\mathsf{status},\mathsf{result}
10:           end if
11:        end for
12:        Sample ai,tπ(|si,t)a_{i,t}\sim\pi(\cdot|s_{i,t}).
13:        Query the simulator with si,t,ai,ts_{i,t},a_{i,t}, obtain reward ri,tr(si,t,ai,t)r_{i,t}\leftarrow r(s_{i,t},a_{i,t}), and next state si,t+1s_{i,t+1}.
14:     end for
15:  end for
16:  𝗌𝗍𝖺𝗍𝗎𝗌𝖽𝗈𝗇𝖾\mathsf{status}\leftarrow\mathsf{done}, 𝗋𝖾𝗌𝗎𝗅𝗍1mi=1mt=0nγtri,t\mathsf{result}\leftarrow\frac{1}{m}\sum_{i=1}^{m}\sum_{t=0}^{n}\gamma^{t}r_{i,t}
17:  return  𝗌𝗍𝖺𝗍𝗎𝗌,𝗋𝖾𝗌𝗎𝗅𝗍\mathsf{status},\mathsf{result}

4.2 Policy iteration

With the subroutine, now we are ready to present our main algorithms. Both of our algorithms maintain a core set 𝒞\mathcal{C}. We first initialize the core set using the initial state ρ\rho and all actions a𝒜a\in\mathcal{A}. More specifically, we check all the feature vectors ϕ(ρ,a),a𝒜\phi(\rho,a),a\in\mathcal{A}. If the feature vector is not in the good set of the current core set, we add the tuple {(ρ,a,ϕ(ρ,a),𝗇𝗈𝗇𝖾)}\{(\rho,a,\phi(\rho,a),\mathsf{none})\} to the core set. Then we start the policy iteration process. Both algorithms start with an arbitrary initial policy π0\pi_{0} and run KK iterations. Let πk1\pi_{k-1} be the rollout policy in the kk-th iteration. We try to estimate the state-action values for the state-action pairs in 𝒞\mathcal{C} under the current policy πk1\pi_{k-1}, i.e., Qπk1(zs,za)Q_{\pi_{k-1}}(z_{s},z_{a}) for z𝒞z\in\mathcal{C}, using ConfidentRollout. In this Q-function estimation procedure, we may encounter two scenarios:

  • (a)

    If the rollout subroutine always returns the 𝖽𝗈𝗇𝖾\mathsf{done} status with an estimate of the state-action value, once we finish the estimation for all the state-action pairs in 𝒞\mathcal{C}, we can estimate the Q-function of πk1\pi_{k-1} using least squares with input features Φ𝒞\Phi_{\mathcal{C}} and targets q𝒞q_{\mathcal{C}} and regularization coefficient λ\lambda. Let wkw_{k} be the solution to the least squares problem, i.e.,

    wk=(Φ𝒞Φ𝒞+λI)1Φ𝒞q𝒞.\displaystyle\vspace{-2mm}w_{k}=(\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I)^{-1}\Phi_{\mathcal{C}}^{\top}q_{\mathcal{C}}.\vspace{-2mm} (4.1)

    Then, for Confident MC-LSPI, we choose the rollout policy of the next iteration, i.e., πk\pi_{k}, as the greedy policy with respect to the linear function wkϕ(s,a)w_{k}^{\top}\phi(s,a):

    πk(a|s)=𝟙(a=argmaxa𝒜wkϕ(s,a)).\displaystyle\pi_{k}(a|s)=\operatorname{\mathds{1}}\big{(}a=\arg\max_{a^{\prime}\in\mathcal{A}}w_{k}^{\top}\phi(s,a^{\prime})\big{)}. (4.2)

    For Confident MC-Politex, we construct a truncated Q-function Qk1:𝒮×𝒜[0,(1γ)1]Q_{k-1}:{\mathcal{S}}\times\mathcal{A}\mapsto[0,(1-\gamma)^{-1}] using linear function with clipping:

    Qk1(s,a):=Π[0,(1γ)1](wkϕ(s,a)),\displaystyle Q_{k-1}(s,a):=\Pi_{[0,(1-\gamma)^{-1}]}(w_{k}^{\top}\phi(s,a)), (4.3)

    where Π[a,b](x):=min{max{x,a},b}\Pi_{[a,b]}(x):=\min\{\max\{x,a\},b\}. The rollout policy of the next iteration is then

    πk(a|s)exp(αj=1k1Qj(s,a)),\displaystyle\vspace{-1mm}\pi_{k}(a|s)\propto\exp\big{(}\alpha\sum_{j=1}^{k-1}Q_{j}(s,a)\big{)},\vspace{-1mm} (4.4)

    where α>0\alpha>0 is an algorithm parameter.

  • (b)

    It could also happen that the ConfidentRollout subroutine returns the 𝗎𝗇𝖼𝖾𝗋𝗍𝖺𝗂𝗇\mathsf{uncertain} status. In this case, we add the state-action pair with new feature direction found by the subroutine to the core set and restart the policy iteration process with the latest core set.

As a final note, for Confident MC-LSPI, we output the rollout policy of the last iteration πK1\pi_{K-1}, whereas for Confident MC-Politex, we output a mixture policy π¯K\overline{\pi}_{K}, which is a policy chosen from {πk}k=0K1\{\pi_{k}\}_{k=0}^{K-1} uniformly at random. The reason that this algorithm needs to output a mixture policy is that Politex (Szepesvári, 2021) uses the regret analysis of expert learning (Cesa-Bianchi and Lugosi, 2006), and to obtain a single output policy, we need to use the standard online-to-batch conversion argument (Cesa-Bianchi et al., 2004). Our algorithms are formally presented in Algorithm 2. In the next section, we present theoretical guarantees for our algorithms.

Algorithm 2 Confident MC-LSPI / Politex
1:  Input: initial state ρ\rho, initial policy π0\pi_{0}, number of iterations KK, regularization coefficient λ\lambda, threshold τ\tau, discount γ\gamma, number of rollouts mm, length of rollout nn, Politex parameter α\alpha.
2:  𝒞\mathcal{C}\leftarrow\emptyset  // Initialize core set.
3:  for a𝒜a\in\mathcal{A} do
4:     if 𝒞=\mathcal{C}=\emptyset or ϕ(ρ,a)(Φ𝒞Φ𝒞+λI)1ϕ(ρ,a)>τ\phi(\rho,a)^{\top}(\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I)^{-1}\phi(\rho,a)>\tau then
5:        𝒞𝒞{(ρ,a,ϕ(ρ,a),𝗇𝗈𝗇𝖾)}\mathcal{C}\leftarrow\mathcal{C}\cup\{(\rho,a,\phi(\rho,a),\mathsf{none})\}
6:     end if
7:  end for
8:  zq𝗇𝗈𝗇𝖾,z𝒞z_{q}\leftarrow\mathsf{none},~{}\forall z\in\mathcal{C}  // Policy iteration starts. ()(*)
9:  for k=1,,Kk=1,\ldots,K do
10:     for z𝒞z\in\mathcal{C} do
11:        𝗌𝗍𝖺𝗍𝗎𝗌,𝗋𝖾𝗌𝗎𝗅𝗍ConfidentRollout(m,n,πk1,γ,zs,za,Φ𝒞,λ,τ)\mathsf{status},\mathsf{result}\leftarrow\textsc{ConfidentRollout}(m,n,\pi_{k-1},\gamma,z_{s},z_{a},\Phi_{\mathcal{C}},\lambda,\tau)
12:        if 𝗌𝗍𝖺𝗍𝗎𝗌=𝖽𝗈𝗇𝖾\mathsf{status}=\mathsf{done}, then zq𝗋𝖾𝗌𝗎𝗅𝗍z_{q}\leftarrow\mathsf{result}; else 𝒞𝒞{𝗋𝖾𝗌𝗎𝗅𝗍}\mathcal{C}\leftarrow\mathcal{C}\cup\{\mathsf{result}\} and goto line ()(*)
13:     end for
14:     wk(Φ𝒞Φ𝒞+λI)1Φ𝒞q𝒞w_{k}\leftarrow(\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I)^{-1}\Phi_{\mathcal{C}}^{\top}q_{\mathcal{C}}; Qk1(s,a)Π[0,(1γ)1](wkϕ(s,a))Q_{k-1}(s,a)\leftarrow\Pi_{[0,(1-\gamma)^{-1}]}(w_{k}^{\top}\phi(s,a)) (Politex only)
15:     πk(a|s){𝟙(a=argmaxa𝒜wkϕ(s,a)),LSPIexp(αj=1k1Qj(s,a))/a𝒜exp(αj=1k1Qj(s,a))Politex\pi_{k}(a|s)\leftarrow\begin{cases}\operatorname{\mathds{1}}\big{(}a=\arg\max_{a^{\prime}\in\mathcal{A}}w_{k}^{\top}\phi(s,a^{\prime})\big{)},&\textsc{LSPI}\\ \exp\big{(}\alpha\sum_{j=1}^{k-1}Q_{j}(s,a)\big{)}/\sum_{a^{\prime}\in\mathcal{A}}\exp\big{(}\alpha\sum_{j=1}^{k-1}Q_{j}(s,a^{\prime})\big{)}&\textsc{Politex}\end{cases}
16:  end for
17:  return wK1w_{K-1} for LSPI, or π¯KUnif{πk}k=0K1\overline{\pi}_{K}\sim\text{Unif}\{\pi_{k}\}_{k=0}^{K-1} for Politex.

5 Theoretical guarantees

In this section, we present theoretical guarantees for our algorithms. First, we have the following main result for Confident MC-LSPI.

Theorem 5.1 (Main result for Confident MC-LSPI).

If Assumption 3.2 holds, then for an arbitrarily small κ>0\kappa>0, by choosing τ=1\tau=1, λ=κ2(1γ)41024b2\lambda=\frac{\kappa^{2}(1-\gamma)^{4}}{1024b^{2}}, n=31γlog(4(1+log(1+λ1)d)κ(1γ))n=\frac{3}{1-\gamma}\log(\frac{4(1+\log(1+\lambda^{-1})d)}{\kappa(1-\gamma)}), K=2+21γlog(3κ(1γ))K=2+\frac{2}{1-\gamma}\log(\frac{3}{\kappa(1-\gamma)}), m=4096d(1+log(1+λ1))κ2(1γ)6log(8Kd(1+log(1+λ1))δ)m=4096\frac{d(1+\log(1+\lambda^{-1}))}{\kappa^{2}(1-\gamma)^{6}}\log(\frac{8Kd(1+\log(1+\lambda^{-1}))}{\delta}), we have with probability at least 1δ1-\delta, the policy πK1\pi_{K-1} that Confident MC-LSPI outputs satisfies

V(ρ)VπK1(ρ)κ.\vspace{-2mm}V^{*}(\rho)-V_{\pi_{K-1}}(\rho)\leq\kappa.\vspace{-2mm}

Moreover, the query and computational costs for the algorithm are poly(d,11γ,1κ,log(1δ),log(b))\operatorname{poly}(d,\frac{1}{1-\gamma},\frac{1}{\kappa},\log(\frac{1}{\delta}),\log(b)) and poly(d,11γ,1κ,|𝒜|,log(1δ),log(b))\operatorname{poly}(d,\frac{1}{1-\gamma},\frac{1}{\kappa},|\mathcal{A}|,\log(\frac{1}{\delta}),\log(b)), respectively.

Alternatively, if Assumption 3.3 holds, then by choosing τ=1\tau=1, λ=ϵ2db2\lambda=\frac{\epsilon^{2}d}{b^{2}}, n=11γlog(1ϵ(1γ))n=\frac{1}{1-\gamma}\log(\frac{1}{\epsilon(1-\gamma)}), K=2+11γlog(1ϵd)K=2+\frac{1}{1-\gamma}\log\big{(}\frac{1}{\epsilon\sqrt{d}}\big{)}, m=1ϵ2(1γ)2log(8Kd(1+log(1+λ1))δ)m=\frac{1}{\epsilon^{2}(1-\gamma)^{2}}\log(\frac{8Kd(1+\log(1+\lambda^{-1}))}{\delta}), we have with probability at least 1δ1-\delta, the policy πK1\pi_{K-1} that Confident MC-LSPI outputs satisfies

V(ρ)VπK1(ρ)74ϵd(1γ)2(1+log(1+b2ϵ2d1)).V^{*}(\rho)-V_{\pi_{K-1}}(\rho)\leq\frac{74\epsilon\sqrt{d}}{(1-\gamma)^{2}}(1+\log(1+b^{2}\epsilon^{-2}d^{-1})).

Moreover, the query and computational costs for the algorithm are poly(d,11γ,1ϵ,log(1δ),log(b))\operatorname{poly}(d,\frac{1}{1-\gamma},\frac{1}{\epsilon},\log(\frac{1}{\delta}),\log(b)) and poly(d,11γ,1ϵ,|𝒜|,log(1δ),log(b))\operatorname{poly}(d,\frac{1}{1-\gamma},\frac{1}{\epsilon},|\mathcal{A}|,\log(\frac{1}{\delta}),\log(b)), respectively.

We prove Theorem 5.1 in Appendix B. For Confident MC-Politex, since we output a mixture policy, we prove guarantees for the expected value of the mixture policy, i.e., Vπ¯K:=1Kk=0K1VπkV_{\overline{\pi}_{K}}:=\frac{1}{K}\sum_{k=0}^{K-1}V_{\pi_{k}}. We have the following result.

Theorem 5.2 (Main result for Confident MC-Politex).

If Assumption 3.2 holds, then for an arbitrarily small κ>0\kappa>0, by choosing τ=1\tau=1, α=(1γ)2log(|𝒜|)K\alpha=(1-\gamma)\sqrt{\frac{2\log(|\mathcal{A}|)}{K}}, λ=κ2(1γ)2256b2\lambda=\frac{\kappa^{2}(1-\gamma)^{2}}{256b^{2}}, K=32log(|𝒜|)κ2(1γ)4K=\frac{32\log(|\mathcal{A}|)}{\kappa^{2}(1-\gamma)^{4}}, n=11γlog(32d(1+log(1+λ1))(1γ)2κ)n=\frac{1}{1-\gamma}\log(\frac{32\sqrt{d}(1+\log(1+\lambda^{-1}))}{(1-\gamma)^{2}\kappa}), and m=1024d(1+log(1+λ1))κ2(1γ)4log(8Kd(1+log(1+λ1))δ)m=1024\frac{d(1+\log(1+\lambda^{-1}))}{\kappa^{2}(1-\gamma)^{4}}\log(\frac{8Kd(1+\log(1+\lambda^{-1}))}{\delta}), we have with probability at least 1δ1-\delta, the mixture policy π¯K\overline{\pi}_{K} that Confident MC-Politex outputs satisfies

V(ρ)Vπ¯K(ρ)κ.V^{*}(\rho)-V_{\overline{{\pi}}_{K}}(\rho)\leq\kappa.

Moreover, the query and computational costs for the algorithm are poly(d,11γ,1κ,log(1δ),log(b))\operatorname{poly}(d,\frac{1}{1-\gamma},\frac{1}{\kappa},\log(\frac{1}{\delta}),\log(b)) and poly(d,11γ,1κ,|𝒜|,log(1δ),log(b))\operatorname{poly}(d,\frac{1}{1-\gamma},\frac{1}{\kappa},|\mathcal{A}|,\log(\frac{1}{\delta}),\log(b)), respectively.

Alternatively, if Assumption 3.3 holds, then by choosing τ=1\tau=1, α=(1γ)2log(|𝒜|)K\alpha=(1-\gamma)\sqrt{\frac{2\log(|\mathcal{A}|)}{K}}, λ=ϵ2db2\lambda=\frac{\epsilon^{2}d}{b^{2}}, K=2log(|𝒜|)ϵ2d(1γ)2K=\frac{2\log(|\mathcal{A}|)}{\epsilon^{2}d(1-\gamma)^{2}}, n=11γlog(1ϵ(1γ))n=\frac{1}{1-\gamma}\log(\frac{1}{\epsilon(1-\gamma)}), and m=1ϵ2(1γ)2log(8Kd(1+log(1+λ1))δ)m=\frac{1}{\epsilon^{2}(1-\gamma)^{2}}\log(\frac{8Kd(1+\log(1+\lambda^{-1}))}{\delta}), we have with probability at least 1δ1-\delta, the mixture policy π¯K\overline{\pi}_{K} that Confident MC-Politex outputs satisfies

V(ρ)Vπ¯K(ρ)42ϵd1γ(1+log(1+b2ϵ2b1)).V^{*}(\rho)-V_{\overline{\pi}_{K}}(\rho)\leq\frac{42\epsilon\sqrt{d}}{1-\gamma}(1+\log(1+b^{2}\epsilon^{-2}b^{-1})).

Moreover, the query and computational costs for the algorithm are poly(d,11γ,1ϵ,log(1δ),log(b))\operatorname{poly}(d,\frac{1}{1-\gamma},\frac{1}{\epsilon},\log(\frac{1}{\delta}),\log(b)) and poly(d,11γ,1ϵ,|𝒜|,log(1δ),log(b))\operatorname{poly}(d,\frac{1}{1-\gamma},\frac{1}{\epsilon},|\mathcal{A}|,\log(\frac{1}{\delta}),\log(b)), respectively.

We prove Theorem 5.2 in Appendix D. Here, we first discuss the query and computational costs of both algorithms and then provide a sketch of our proof.

Query and computational costs

In our analysis, we say that we start a new loop whenever we start (or restart) the policy iteration process, i.e., going to line ()(*) in Algorithm 2. By definition, when we start a new loop, the size of the core set 𝒞\mathcal{C} is increased by 11. First, in Lemma 5.1 below, we show that the size of the core set will never exceed Cmax=𝒪~(d)C_{\max}=\widetilde{\mathcal{O}}(d). Therefore, the total number of loops is at most CmaxC_{\max}. In each loop, we run KK policy iterations; in each iteration, we run Algorithm 1 from at most CmaxC_{\max} points from the core set; and each time when we run Algorithm 1, we query the simulator at most 𝒪(mn)\mathcal{O}(mn) times. Thus, for both algorithms, the total number of queries that we make is at most Cmax2KmnC_{\max}^{2}Kmn. Therefore, using the parameter choice in Theorems 5.1 and 5.2 and omitting logarithmic factors, we can obtain the query costs of Confident MC-LSPI and Politex in Table 2. As we can see, when ϵ=0\epsilon=0 or ϵ0\epsilon\neq 0 but ϵ=o(1/d)\epsilon=o(1/\sqrt{d}) (the regime we care about in this paper), the query cost of Confident MC-LSPI is lower than Politex. As for computational cost, since our policy improvement steps only involve matrix multiplication and matrix inversion, the computational cost is also polynomial in the aforementioned factors. One thing to notice is that during the rollout process, in each step, the agent needs to compute the features of a state paired with all actions, and thus the computational cost linearly depends on |𝒜||\mathcal{A}|; on the contrary the query cost does not depend on |𝒜||\mathcal{A}| since in each step the agent only needs to query the simulator with the action sampled according to the policy.

Sub-optimality

We also note that when Assumption 3.3 holds, i.e., ϵ0\epsilon\neq 0, the sub-optimality of the output policy is 𝒪~(ϵd(1γ)2)\widetilde{\mathcal{O}}(\frac{\epsilon\sqrt{d}}{(1-\gamma)^{2}}) for LSPI and 𝒪~(ϵd1γ)\widetilde{\mathcal{O}}(\frac{\epsilon\sqrt{d}}{1-\gamma}) for Politex. Therefore, in the presence of a model misspecification error, Confident MC-Politex can achieve a better final sub-optimality than Confident MC-LSPI, although it’s query cost is higher.

Table 2: Comparison of Confident MC-LSPI and Politex
Query (ϵ=0\epsilon=0) Query (ϵ0\epsilon\neq 0) Sub-optimality (ϵ0\epsilon\neq 0)
LSPI 𝒪~(d3κ2(1γ)8)\widetilde{\mathcal{O}}\big{(}\frac{d^{3}}{\kappa^{2}(1-\gamma)^{8}}\big{)} 𝒪~(d2ϵ2(1γ)4)\widetilde{\mathcal{O}}\big{(}\frac{d^{2}}{\epsilon^{2}(1-\gamma)^{4}}\big{)} 𝒪~(ϵd(1γ)2)\widetilde{\mathcal{O}}\big{(}\frac{\epsilon\sqrt{d}}{(1-\gamma)^{2}}\big{)}
Politex 𝒪~(d3κ4(1γ)9)\widetilde{\mathcal{O}}\big{(}\frac{d^{3}}{\kappa^{4}(1-\gamma)^{9}}\big{)} 𝒪~(dϵ4(1γ)5)\widetilde{\mathcal{O}}\big{(}\frac{d}{\epsilon^{4}(1-\gamma)^{5}}\big{)} 𝒪~(ϵd1γ)\widetilde{\mathcal{O}}\big{(}\frac{\epsilon\sqrt{d}}{1-\gamma}\big{)}

Proof sketch

We now discuss our proof strategy, focusing on LSPI for simplicity.

Step 1: Bound the size of the core set

The first step is to show that our algorithm will terminate. This is equivalent to showing that the size of the core set 𝒞\mathcal{C} will not exceed certain finite quantity, since whenever we receive the 𝗎𝗇𝖼𝖾𝗋𝗍𝖺𝗂𝗇\mathsf{uncertain} status from ConfidentRollout, we increase the size of the core set by 11, go back to line ()(*) in Algorithm 2, and start a new loop. The following lemma shows that the size of the core set is always bounded, and thus the algorithm will always terminate.

Lemma 5.1.

Under Assumption 3.1, the size of the core set 𝒞\mathcal{C} will not exceed

Cmax:=ee11+ττd(log(1+1τ)+log(1+1λ)).C_{\max}:=\frac{e}{e-1}\frac{1+\tau}{\tau}d\left(\log(1+\frac{1}{\tau})+\log(1+\frac{1}{\lambda})\right).

This result first appears in Russo and Van Roy (2013) as the eluder dimension of linear function class. We present the proof of this lemma in Appendix A for completeness.

Step 2: Virtual policy iteration

The next step is to analyze the gap between the value of the optimal policy and the policy π\pi parameterized by the vector wK1w_{K-1} that the algorithm outputs in the final loop, i.e., V(ρ)VπK1(ρ)V^{*}(\rho)-V_{\pi_{K-1}}(\rho). For ease of exposition, here we only consider the case of deterministic probability transition kernel PP. Our full proof in Appendix B considers general stochastic dynamics.

To analyze our algorithm, we note that for approximate policy iteration (API) algorithms, if in every iteration (say the kk-th iteration), we have an approximate Q-function that is close to the true Q-function of the rollout policy (say πk1\pi_{k-1}) in \ell_{\infty} norm, i.e., Qk1Qπk1η\|Q_{k-1}-Q_{\pi_{k-1}}\|_{\infty}\leq\eta, then existing results (Munos, 2003; Farahmand et al., 2010) ensure that we can learn a good policy if in every iteration we choose the new policy to be greedy with respect to the approximate Q-function. However, since we only have local access to the simulator, we cannot have such \ell_{\infty} guarantee. In fact, as we show in the proof, we can only ensure that when ϕ(s,a)\phi(s,a) is in the good set \mathcal{H}, our linear function approximation is accurate, i.e., |Qk1(s,a)Qπk1(s,a)|η|Q_{k-1}(s,a)-Q_{\pi_{k-1}}(s,a)|\leq\eta where Qk1(s,a)=wkϕ(s,a)Q_{k-1}(s,a)=w_{k}^{\top}\phi(s,a). To overcome the lack of \ell_{\infty} guarantee, we introduce the notion of virtual policy iteration algorithm. In the virtual algorithm, we start with the same initial policy π~0=π0\widetilde{\pi}_{0}=\pi_{0}. In the kk-th iteration of the virtual algorithm, we assume that we have access to the true Q-function of the rollout policy π~k1\widetilde{\pi}_{k-1} when ϕ(s,a)\phi(s,a)\notin\mathcal{H}, and construct

Q~k1(s,a)={w~kϕ(s,a)ifϕ(s,a)Qπ~k1(s,a)otherwise,\widetilde{Q}_{k-1}(s,a)=\begin{cases}\widetilde{w}_{k}^{\top}\phi(s,a)&~{}\text{if}~{}\phi(s,a)\in\mathcal{H}\\ Q_{\widetilde{\pi}_{k-1}}(s,a)&~{}\text{otherwise},\end{cases}

where w~k\widetilde{w}_{k} is the linear coefficient that we learn in the virtual algorithm in the same way as in Eq. (4.1). Then π~k\widetilde{\pi}_{k} is chosen to be greedy with respect to Q~k1(s,a)\widetilde{Q}_{k-1}(s,a). In this way, we can ensure that Q~k1(s,a)\widetilde{Q}_{k-1}(s,a) is close to the true Q-function Qπ~k1(s,a)Q_{\widetilde{\pi}_{k-1}}(s,a) in \ell_{\infty} norm and thus the output policy, say π~K1\widetilde{\pi}_{K-1}, of the virtual algorithm is good in the sense that V(ρ)Vπ~K1(ρ)V^{*}(\rho)-V_{\widetilde{\pi}_{K-1}}(\rho) is small.

To connect the output policy of the virtual algorithm and our actual algorithm, we note that by definition, in the final loop of our algorithm, in any iteration, for any state ss that the agent visits in ConfidentRollout, and any action a𝒜a\in\mathcal{A}, we have that ϕ(s,a)\phi(s,a)\in\mathcal{H} since the subroutine never returns 𝗎𝗇𝖼𝖾𝗋𝗍𝖺𝗂𝗇\mathsf{uncertain} status. Further, because the initial state, probability transition kernel, and the policies are all deterministic, we know that the rollout trajectories of the virtual algorithm and our actual algorithm are always the same in the final loop (the virtual algorithm does not get a chance to use the true Q-function Qπ~k1Q_{\widetilde{\pi}_{k-1}}). With rollout length nn, we know that when we start with state ρ\rho, the output of the virtual algorithm π~K1\widetilde{\pi}_{K-1} and our actual algorithm πK1\pi_{K-1} take exactly the same actions for nn steps, and thus |VπK1(ρ)Vπ~K1(ρ)|γn+11γ|V_{\pi_{K-1}}(\rho)-V_{\widetilde{\pi}_{K-1}}(\rho)|\leq\frac{\gamma^{n+1}}{1-\gamma}, which implies that V(ρ)VπK1(ρ)V^{*}(\rho)-V_{\pi_{K-1}}(\rho) is small. To extend this argument to the setting with stochastic transitions, we need to use a coupling argument which we elaborate in the Appendix.

6 Conclusion

We propose the Confident MC-LSPI and Confident MC-Politex algorithms, for local planning with linear function approximation. Under the assumption that the Q-functions of all policies are linear in some features of the state-action pairs, we show that our algorithm is query and computationally efficient. We introduce a novel analysis technique based on a virtual policy iteration algorithm, which can be used to leverage existing guarantees on approximate policy iteration with \ell_{\infty}-bounded evaluation error. We use this technique to show that our algorithm can learn the optimal policy for the given initial state even only with local access to the simulator. Future directions include extending our analysis technique to broader settings.

Acknowledgement

The authors would like to thank Gellért Weisz for helpful comments.

References

  • Abbasi-Yadkori et al. [2019] Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gellért Weisz. Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, pages 3692–3702. PMLR, 2019.
  • Agarwal et al. [2020a] Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. PC-PG: Policy cover directed exploration for provable policy gradient learning. arXiv preprint arXiv:2007.08459, 2020a.
  • Agarwal et al. [2020b] Alekh Agarwal, Sham Kakade, and Lin F Yang. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67–83. PMLR, 2020b.
  • Awerbuch and Kleinberg [2008] Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97–114, 2008.
  • Ayoub et al. [2020] Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463–474. PMLR, 2020.
  • Azar et al. [2013] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349, 2013.
  • Cai et al. [2020] Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283–1294. PMLR, 2020.
  • Cesa-Bianchi and Lugosi [2006] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
  • Cesa-Bianchi et al. [2004] Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.
  • Du et al. [2019] Simon S Du, Yuping Luo, Ruosong Wang, and Hanrui Zhang. Provably efficient Q-learning with function approximation via distribution shift error checking oracle. arXiv preprint arXiv:1906.06321, 2019.
  • Du et al. [2020] Simon S Du, Sham M Kakade, Ruosong Wang, and Lin Yang. Is a good representation sufficient for sample efficient reinforcement learning? In International Conference on Learning Representations, 2020.
  • Du et al. [2021] Simon S Du, Sham M Kakade, Jason D Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in RL. arXiv preprint arXiv:2103.10897, 2021.
  • Ecoffet et al. [2019] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
  • Even-Dar et al. [2009] Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Online Markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
  • Farahmand et al. [2010] Amir Massoud Farahmand, Rémi Munos, and Csaba Szepesvári. Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems, 2010.
  • Hao et al. [2021] Botao Hao, Nevena Lazic, Yasin Abbasi-Yadkori, Pooria Joulani, and Csaba Szepesvári. Adaptive approximate policy iteration. In International Conference on Artificial Intelligence and Statistics, pages 523–531. PMLR, 2021.
  • Harville [1998] David A Harville. Matrix algebra from a statistician’s perspective, 1998.
  • Jiang et al. [2017] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2017.
  • Jin et al. [2020] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
  • Jin et al. [2021] Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. arXiv preprint arXiv:2102.00815, 2021.
  • Kakade [2003] Sham Machandranath Kakade. On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003.
  • Kearns and Singh [1999] Michael Kearns and Satinder Singh. Finite-sample convergence rates for Q-learning and indirect algorithms. Advances in Neural Information Processing Systems, pages 996–1002, 1999.
  • Kearns et al. [2002] Michael Kearns, Yishay Mansour, and Andrew Y Ng. A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine learning, 49(2):193–208, 2002.
  • Lattimore et al. [2020] Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with good feature representations in bandits and in RL with a generative model. In International Conference on Machine Learning, pages 5662–5670. PMLR, 2020.
  • Lazic et al. [2021] Nevena Lazic, Dong Yin, Yasin Abbasi-Yadkori, and Csaba Szepesvari. Improved regret bound and experience replay in regularized policy iteration. arXiv preprint arXiv:2102.12611, 2021.
  • Li et al. [2020] Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model. Advances in Neural Information Processing Systems, 33, 2020.
  • Li et al. [2021] Gen Li, Yuxin Chen, Yuejie Chi, Yuantao Gu, and Yuting Wei. Sample-efficient reinforcement learning is feasible for linearly realizable MDPs with limited revisiting. arXiv preprint arXiv:2105.08024, 2021.
  • Munos [2003] Rémi Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning, volume 3, pages 560–567, 2003.
  • Munos [2014] Rémi Munos. From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning. Foundations and Trends in Machine Learning, 2014.
  • Russo and Van Roy [2013] Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, pages 2256–2264. Citeseer, 2013.
  • Shariff and Szepesvári [2020] Roshan Shariff and Csaba Szepesvári. Efficient planning in large MDPs with weak linear function approximation. arXiv preprint arXiv:2007.06184, 2020.
  • Sidford et al. [2018] Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5192–5202, 2018.
  • Singh and Yee [1994] Satinder P Singh and Richard C Yee. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227–233, 1994.
  • Szepesvári [2021] Csaba Szepesvári. RL Theory lecture notes: POLITEX. https://rltheory.github.io/lecture-notes/planning-in-mdps/lec14/, 2021.
  • Todd [2016] Michael J Todd. Minimum-volume ellipsoids: Theory and algorithms. SIAM, 2016.
  • Van Roy and Dong [2019] Benjamin Van Roy and Shi Dong. Comments on the Du-Kakade-Wang-Yang lower bounds. arXiv preprint arXiv:1911.07910, 2019.
  • Wang et al. [2019] Yining Wang, Ruosong Wang, Simon S Du, and Akshay Krishnamurthy. Optimism in reinforcement learning with generalized linear function approximation. arXiv preprint arXiv:1912.04136, 2019.
  • Wang et al. [2021] Yuanhao Wang, Ruosong Wang, and Sham M Kakade. An exponential lower bound for linearly-realizable MDPs with constant suboptimality gap. arXiv preprint arXiv:2103.12690, 2021.
  • Wei et al. [2021] Chen-Yu Wei, Mehdi Jafarnia Jahromi, Haipeng Luo, and Rahul Jain. Learning infinite-horizon average-reward MDPs with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pages 3007–3015. PMLR, 2021.
  • Weisz et al. [2021a] Gellert Weisz, Philip Amortila, Barnabás Janzer, Yasin Abbasi-Yadkori, Nan Jiang, and Csaba Szepesvári. On query-efficient planning in MDPs under linear realizability of the optimal state-value function. arXiv preprint arXiv:2102.02049, 2021a.
  • Weisz et al. [2021b] Gellért Weisz, Philip Amortila, and Csaba Szepesvári. Exponential lower bounds for planning in MDPs with linearly-realizable optimal action-value functions. In Algorithmic Learning Theory, pages 1237–1264. PMLR, 2021b.
  • Wen and Van Roy [2013] Zheng Wen and Benjamin Van Roy. Efficient exploration and value function generalization in deterministic systems. Advances in Neural Information Processing Systems, 26, 2013.
  • Yang and Wang [2019] Lin Yang and Mengdi Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004. PMLR, 2019.
  • Yang and Wang [2020] Lin Yang and Mengdi Wang. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. International Conference on Machine Learning, 2020.
  • Zanette et al. [2019] Andrea Zanette, Alessandro Lazaric, Mykel J Kochenderfer, and Emma Brunskill. Limiting extrapolation in linear approximate value iteration. Advances in Neural Information Processing Systems, 32:5615–5624, 2019.
  • Zanette et al. [2020] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent Bellman error. In International Conference on Machine Learning, pages 10978–10989. PMLR, 2020.
  • Zanette et al. [2021] Andrea Zanette, Ching-An Cheng, and Alekh Agarwal. Cautiously optimistic policy optimization and exploration with linear function approximation. In Conference on Learning Theory (COLT), 2021.
  • Zhou et al. [2020] Dongruo Zhou, Jiafan He, and Quanquan Gu. Provably efficient reinforcement learning for discounted MDPs with feature mapping. arXiv preprint arXiv:2006.13165, 2020.

Appendix

Appendix A Proof of Lemma 5.1

This proof essentially follows the proof of the upper bound for the eluder dimension of a linear function class in Russo and Van Roy [2013]. We present the proof here for completeness.

We restate the core set construction process in the following way with slightly different notation. We begin with Φ0=0\Phi_{0}=0. In the tt-th step, we have a core set with feature matrix Φt1(t1)×d\Phi_{t-1}\in\mathbb{R}^{(t-1)\times d}. Suppose that we can find ϕtd\phi_{t}\in\mathbb{R}^{d}, ϕt21\|\phi_{t}\|_{2}\leq 1, such that

ϕt(Φt1Φt1+λI)1ϕt>τ,\displaystyle\phi_{t}^{\top}(\Phi_{t-1}^{\top}\Phi_{t-1}+\lambda I)^{-1}\phi_{t}>\tau, (A.1)

then we let Φt:=[Φt1ϕt]t×d\Phi_{t}:=[\Phi_{t-1}^{\top}~{}~{}\phi_{t}]^{\top}\in\mathbb{R}^{t\times d}, i.e., we add a row at the bottom of Φt1\Phi_{t-1}. If we cannot find such ϕt\phi_{t}, we terminate this process. We define Σt:=ΦtΦt+λI\Sigma_{t}:=\Phi_{t}^{\top}\Phi_{t}+\lambda I. It is easy to see that Σ0=λI\Sigma_{0}=\lambda I and Σt=Σt1+ϕtϕt\Sigma_{t}=\Sigma_{t-1}+\phi_{t}\phi_{t}^{\top}.

According to matrix determinant lemma [Harville, 1998], we have

det(Σt)\displaystyle\det(\Sigma_{t}) =(1+ϕtΣt11ϕt)det(Σt1)>(1+τ)det(Σt1)\displaystyle=(1+\phi_{t}^{\top}\Sigma_{t-1}^{-1}\phi_{t})\det(\Sigma_{t-1})>(1+\tau)\det(\Sigma_{t-1})
>>(1+τ)tdet(Σ0)=(1+τ)tλd,\displaystyle>\cdots>(1+\tau)^{t}\det(\Sigma_{0})=(1+\tau)^{t}\lambda^{d}, (A.2)

where the inequality is due to (A.1). Since det(Σt)\det(\Sigma_{t}) is the product of all the eigenvalues of Σt\Sigma_{t}, according to the AM-GM inequality, we have

det(Σt)(tr(Σt)d)d=(tr(i=1tϕiϕi)+tr(λI)d)d(td+λ)d,\displaystyle\det(\Sigma_{t})\leq\left(\frac{\mathop{\mathrm{tr}}(\Sigma_{t})}{d}\right)^{d}=\left(\frac{\mathop{\mathrm{tr}}(\sum_{i=1}^{t}\phi_{i}\phi_{i}^{\top})+\mathop{\mathrm{tr}}(\lambda I)}{d}\right)^{d}\leq(\frac{t}{d}+\lambda)^{d}, (A.3)

where in the second inequality we use the fact that ϕi21\|\phi_{i}\|_{2}\leq 1. Combining (A.2) and (A.3), we know that tt must satisfy

(1+τ)tλd<(td+λ)d,(1+\tau)^{t}\lambda^{d}<(\frac{t}{d}+\lambda)^{d},

which is equivalent to

(1+τ)td<tλd+1.\displaystyle(1+\tau)^{\frac{t}{d}}<\frac{t}{\lambda d}+1. (A.4)

We note that if tdt\leq d, the result of the size of the core set in Lemma 5.1 automatically holds. Thus, we only consider the situation here t>dt>d. In this case, the condition (A.4) implies

tdlog(1+τ)<log(1+tλd)<log(td(1+1λ))\displaystyle\frac{t}{d}\log(1+\tau)<\log(1+\frac{t}{\lambda d})<\log(\frac{t}{d}(1+\frac{1}{\lambda})) =log(td)+log(1+1λ)\displaystyle=\log(\frac{t}{d})+\log(1+\frac{1}{\lambda})
=log(tτd(1+τ))+log(1+ττ)+log(1+1λ).\displaystyle=\log\left(\frac{t\tau}{d(1+\tau)}\right)+\log(\frac{1+\tau}{\tau})+\log(1+\frac{1}{\lambda}). (A.5)

Using the fact that for any x>0x>0, log(1+x)>x1+x\log(1+x)>\frac{x}{1+x}, and that for any x>0x>0, log(x)xe\log(x)\leq\frac{x}{e}, we obtain

tτd(1+τ)<tτed(1+τ)+log(1+ττ)+log(1+1λ),\displaystyle\frac{t\tau}{d(1+\tau)}<\frac{t\tau}{ed(1+\tau)}+\log(\frac{1+\tau}{\tau})+\log(1+\frac{1}{\lambda}), (A.6)

which implies

t<ee11+ττd(log(1+1τ)+log(1+1λ)).t<\frac{e}{e-1}\frac{1+\tau}{\tau}d\left(\log(1+\frac{1}{\tau})+\log(1+\frac{1}{\lambda})\right).

Appendix B Proof of Theorem 5.1

In this proof, we say that we start a new loop whenever we start (or restart) the policy iteration process, i.e., going to line ()(*) in Algorithm 2. In each loop, we have at most KK iterations of policy iteration steps. By definition, we also know that when we start a new loop, the size of the core set 𝒞\mathcal{C} increases by 11 compared with the previous loop. We first introduce the notion of virtual policy iteration algorithm. This virtual algorithm is designed to leverage the existing results on approximate policy iteration with \ell_{\infty} bounded error in the approximate Q-functions [Munos, 2003, Farahmand et al., 2010]. We first present the details of the virtual algorithm, and then provide performance guarantees for the main algorithm.

B.1 Virtual approximate policy iteration with coupling

The virtual policy iteration algorithm is a virtual algorithm that we use for the purpose of proof. It is a version of approximate policy iteration (API) with a simulator. An important factor is that the simulators of the virtual algorithm and the main algorithm need to be coupled, which we explain in this section.

The virtual algorithm is defined as follows. Unlike the main algorithm, the virtual algorithm runs exactly CmaxC_{\max} loops, where CmaxC_{\max} is the upper bound for the size of the core set defined in Lemma 5.1. In the virtual algorithm, we let the initial policy be the same as the main algorithm, i.e., π~0=π0\widetilde{\pi}_{0}=\pi_{0}. Unlike the main algorithm, the virtual algorithm runs exactly KK iterations of policy iteration. In the kk-th iteration (k1k\geq 1), the virtual algorithm runs rollouts from each element in the core set 𝒞\mathcal{C} (we will discuss how the virtual algorithm constructs the core set later) with π~k1\widetilde{\pi}_{k-1} with a simulator where π~k1\widetilde{\pi}_{k-1} is in the form of Eq. (B.3) (Q~k1\widetilde{Q}_{k-1} will be defined once we present the details of the virtual algorithm).

We now describe the rollout process of the virtual algorithm. We still use a subroutine similar to ConfidentRollout. The simulator of the virtual algorithm can still generate samples of next state given a state-action pair according to the probability transition kernel PP. The major difference from the main algorithm is that during the rollout process, when we find a state-action pair whose feature is outside of the good set \mathcal{H} (defined in Definition 4.1), i.e., (s,a)(s,a) such that ϕ(s,a)(Φ𝒞Φ𝒞+λI)ϕ(s,a)>τ\phi(s,a)^{\top}(\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I)\phi(s,a)>\tau, we do not terminate the subroutine, instead we record this state-action pair along with its feature (we call it the recorded element), and then keep running the rollout process using π~k1\widetilde{\pi}_{k-1}. Two situations can occur at the end of each loop: 1) We did not record any element, in which case we use the same core set 𝒞\mathcal{C} in the next loop, and 2) We have at least one recorded element in a particular loop, in which case we add the first element to the core set and discard any other recorded elements. In other words, in each loop of the virtual algorithm, we find the first state-action pair (if any) whose feature is outside of the good set and add this pair to the core set. Another difference from the main algorithm is that in the virtual algorithm, we do not end the rollout subroutine when we identify an uncertain state-action pair, and as a result, the rollout subroutine in the virtual algorithm always returns an estimation of the Q-function.

We now proceed to present the virtual policy iteration process. In the kk-th iteration, the virtual algorithm runs mm trajectories of nn-step rollout using the policy π~k1\widetilde{\pi}_{k-1} from each element z𝒞z\in\mathcal{C}, obtains the empirical average of the discounted return zqz_{q} in the same way as in Algorithm 1. Then we concatenate them, obtain the vector q~𝒞\widetilde{q}_{\mathcal{C}}, and compute

w~k=(Φ𝒞Φ𝒞+λI)1Φ𝒞q~𝒞.\displaystyle\widetilde{w}_{k}=(\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I)^{-1}\Phi_{\mathcal{C}}^{\top}\widetilde{q}_{\mathcal{C}}. (B.1)

We use the notion of good set \mathcal{H} defined in Definition 4.1, and define the virtual Q-function as follows:

Q~k1(s,a):={w~kϕ(s,a),ϕ(s,a),Qπ~k1(s,a),ϕ(s,a),\displaystyle\widetilde{Q}_{k-1}(s,a):=\begin{cases}\widetilde{w}_{k}^{\top}\phi(s,a),&\phi(s,a)\in\mathcal{H},\\ Q_{\widetilde{\pi}_{k-1}}(s,a),&\phi(s,a)\notin\mathcal{H},\end{cases} (B.2)

by assuming the access to the true Q-function Qπ~k1(s,a)Q_{\widetilde{\pi}_{k-1}}(s,a). The next policy π~k\widetilde{\pi}_{k} is defined as the greedy policy with respect to Q~k1(s,a)\widetilde{Q}_{k-1}(s,a), i.e.,

π~k(a|s)=𝟙(a=argmaxa𝒜Q~k1(s,a))).\widetilde{\pi}_{k}(a|s)=\operatorname{\mathds{1}}\left(a=\arg\max_{a^{\prime}\in\mathcal{A}}\widetilde{Q}_{k-1}(s,a^{\prime}))\right). (B.3)

Recall that for the main algorithm, once we learn the parameter vector wkw_{k}, the next policy πk\pi_{k} is greedy with respect to the linear function wkϕ(s,a)w_{k}^{\top}\phi(s,a), i.e.,

πk(a|s)=𝟙(a=argmaxa𝒜wkϕ(s,a))).\pi_{k}(a|s)=\operatorname{\mathds{1}}\left(a=\arg\max_{a^{\prime}\in\mathcal{A}}w_{k}^{\top}\phi(s,a^{\prime}))\right).

For comparison, the key difference is that when we observe a feature vector ϕ(s,a)\phi(s,a) that is not in the good set \mathcal{H}, our actual algorithm terminates the rollout and returns the state-action pair with the new direction, whereas the virtual algorithm uses the true Q-function of the state-action pair.

Coupling

The major remaining issue now is how the main algorithm is connected to the virtual algorithm. We describe this connection with a coupling argument. In a particular loop, for any positive integer NN, when the virtual algorithm makes its NN-th query in the kk-th iteration to the virtual simulator with a state-action pair, say (svirtual,avirtual)(s_{\text{virtual}},a_{\text{virtual}}), if the main algorithm has not returned due to encountering an uncertain state-action pair, we assume that at the same time the main algorithm also makes its NN-th query to the simulator, with a state-action pair, say (smain,amain)(s_{\text{main}},a_{\text{main}}). We let the two simulators be coupled: When they are queried with the same pair, i.e., (smain,amain)=(svirtual,avirtual)(s_{\text{main}},a_{\text{main}})=(s_{\text{virtual}},a_{\text{virtual}}), the next states that they return are also the same. In other words, the simulator for the main algorithm samples smainP(|smain,amain)s_{\text{main}}^{\prime}\sim P(\cdot|s_{\text{main}},a_{\text{main}}), and the virtual algorithm samples svirtualP(|svirtual,avirtual)s_{\text{virtual}}^{\prime}\sim P(\cdot|s_{\text{virtual}},a_{\text{virtual}}), and smains_{\text{main}}^{\prime} and svirtuals_{\text{virtual}}^{\prime} satisfy the joint distribution such that (smain=svirtual)=1\mathbb{P}\left(s_{\text{main}}^{\prime}=s_{\text{virtual}}^{\prime}\right)=1. In the cases where (smain,amain)(svirtual,avirtual)(s_{\text{main}},a_{\text{main}})\neq(s_{\text{virtual}},a_{\text{virtual}}) or the main algorithm has already returned due to the discovery of a new feature direction, the virtual algorithm samples from PP independently from the main algorithm. Note that this setup guarantees that both the virtual algorithm and the main algorithm have valid simulators which can sample from the same probability transition kernel PP.

There are a few direct consequences of this coupling design. First, since the virtual and main algorithms start with the same initial core set elements (constructed using the initial state), we know that in any loop, when starting from the same core set element zz, both algorithms will have exactly the same rollout trajectories until the main algorithm identifies an uncertain state-action pair and returns. This is due to the coupling of the simulators and the fact that within the good set \mathcal{H}, the policies for the main algorithm and the virtual algorithm take the same action. Later, we will discuss this point more in Lemma B.5. Second, the core set elements that the virtual and main algorithms use are exactly the same for any loop. This is because when the main algorithm identifies an uncertain state-action pair, it adds it to the core set and start a new loop, and the virtual algorithm also only adds the first recorded element to the core set. Since the simulators are the coupled, the first uncertain state-action pair that they encounter will be the same, meaning that both algorithms always add the same element to the core set, until the main algorithm finishes its final loop. We note that the core set elements on our algorithm are stored as ordered list so the virtual and main algorithm always run rollouts with the same ordering of the core set elements. Another observation is that while the virtual algorithm has a deterministic number of loops CmaxC_{\max}, the total number of loops that the main algorithms may run is a random variable whose value cannot exceed CmaxC_{\max}.

The next steps of the proof are the following:

  • We show that in each loop, with high probability, the virtual algorithm proceeds as an approximate policy iteration algorithm with a bounded \ell_{\infty} error in the approximate Q-function. Thus the virtual algorithm produces a good policy at the end of each loop. Then, since by Lemma 5.1, we have at most

    Cmax:=ee11+ττd(log(1+1τ)+log(1+1λ))\displaystyle C_{\max}:=\frac{e}{e-1}\frac{1+\tau}{\tau}d\left(\log(1+\frac{1}{\tau})+\log(1+\frac{1}{\lambda})\right) (B.4)

    loops, with a union bound, we know that with high probability, the virtual algorithm produces a good policy in all the loops.

  • We show that due to the coupling argument, the output parameter vector in the main and the virtual algorithms, i.e., wK1w_{K-1} and w~K1\widetilde{w}_{K-1} in the final loop are the same. This leads to the conclusion that with the same initial state ρ\rho, the value of the outputs of the main algorithm and the virtual algorithm are close, and thus the main algorithm also outputs a good policy.

We prove these two points in Sections B.2 and B.3, respectively.

B.2 Analysis of the virtual algorithm

Throughout this section, we will consider a fixed loop of the virtual algorithm, say the \ell-th loop. We assume that at the beginning of this loop, the virtual algorithm has a core set 𝒞\mathcal{C}_{\ell}. Notice that 𝒞\mathcal{C}_{\ell} is a random variable that only depends on the randomness of the first 1\ell-1 loops. In this section, we will first condition on the randomness of all the first 1\ell-1 loops and only consider the randomness of the \ell-th loop. Thus we will first treat 𝒞\mathcal{C}_{\ell} as a deterministic quantity. For simplicity, we write 𝒞:=𝒞\mathcal{C}:=\mathcal{C}_{\ell}.

Consider the kk-th iteration of a particular loop of the virtual algorithm with core set 𝒞\mathcal{C}. We would like to bound Q~k1Qπ~k1\|\widetilde{Q}_{k-1}-Q_{\widetilde{\pi}_{k-1}}\|_{\infty}. First, we have the following lemma for the accuracy of the Q-function for any element in the core set. To simplify notation, in this lemma, we omit the subscript and use π\pi to denote a policy that we run rollout with in an arbitrary iteration of the virtual algorithm.

Lemma B.1.

Let π\pi be a policy that we run rollout with in an iteration of the virtual algorithm. Then, for any element z𝒞z\in\mathcal{C} and any θ>0\theta>0, we have with probability at least 12exp(2θ2(1γ)2m)1-2\exp(-2\theta^{2}(1-\gamma)^{2}m),

|zqQπ(zs,za)|γn+11γ+θ.\displaystyle|z_{q}-Q_{\pi}(z_{s},z_{a})|\leq\frac{\gamma^{n+1}}{1-\gamma}+\theta. (B.5)
Proof.

By the definition of Qπ(zs,za)Q_{\pi}(z_{s},z_{a}):

Qπ(zs,za)=𝔼st+1P(|st,at),at+1π(|st+1)[t=0γtr(st,at)s0=zs,a0=za],Q_{\pi}(z_{s},z_{a})=\mathbb{E}_{s_{t+1}\sim P(\cdot|s_{t},a_{t}),a_{t+1}\sim\pi(\cdot|s_{t+1})}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})\mid s_{0}=z_{s},a_{0}=z_{a}\right],

and define the nn-step truncated Q-function:

Qπn(zs,za)=𝔼st+1P(|st,at),at+1π(|st+1)[t=0nγtr(st,at)s0=zs,a0=za].Q_{\pi}^{n}(z_{s},z_{a})=\mathbb{E}_{s_{t+1}\sim P(\cdot|s_{t},a_{t}),a_{t+1}\sim\pi(\cdot|s_{t+1})}\left[\sum_{t=0}^{n}\gamma^{t}r(s_{t},a_{t})\mid s_{0}=z_{s},a_{0}=z_{a}\right].

Then we have |Qπn(s,a)Qπ(s,a)|γn+11γ|Q_{\pi}^{n}(s,a)-Q_{\pi}(s,a)|\leq\frac{\gamma^{n+1}}{1-\gamma}. Moreover, the Q-function estimate zqz_{q} is an average of mm independent and unbiased estimates of Qπn(s,a)Q_{\pi}^{n}(s,a), which are all bounded in [0,1/(1γ)][0,1/(1-\gamma)]. By Hoeffding’s inequality we have with probability at least 12exp(2θ2(1γ)2m)1-2\exp(-2\theta^{2}(1-\gamma)^{2}m), |zqQπn(s,a)|θ|z_{q}-Q_{\pi}^{n}(s,a)|\leq\theta, which completes the proof. ∎

By a union bound over the |𝒞||\mathcal{C}| elements in the core set, we know that

(z𝒞,|zqQπ~k1(zs,za)|γn+11γ+θ)12Cmaxexp(2θ2(1γ)2m).\displaystyle\mathbb{P}\left(\forall~{}z\in\mathcal{C},|z_{q}-Q_{\widetilde{\pi}_{k-1}}(z_{s},z_{a})|\leq\frac{\gamma^{n+1}}{1-\gamma}+\theta\right)\geq 1-2C_{\max}\exp(-2\theta^{2}(1-\gamma)^{2}m). (B.6)

The following lemma provides a bound on |Q~k1(s,a)Qπ~k1(s,a)||\widetilde{Q}_{k-1}(s,a)-Q_{\widetilde{\pi}_{k-1}}(s,a)|, (s,a)\forall~{}(s,a) such that ϕ(s,a)\phi(s,a)\in\mathcal{H}.

Lemma B.2.

Suppose that Assumption 3.3 holds. Then, with probability at least

12Cmaxexp(2θ2(1γ)2m),1-2C_{\max}\exp(-2\theta^{2}(1-\gamma)^{2}m),

for any (s,a)(s,a) pair such that ϕ(s,a)\phi(s,a)\in\mathcal{H}, we have

|Q~k1(s,a)Qπ~k1(s,a)|bλτ+(ϵ+γn+11γ+θ)τCmax+ϵ:=η.\displaystyle|\widetilde{Q}_{k-1}(s,a)-Q_{\widetilde{\pi}_{k-1}}(s,a)|\leq b\sqrt{\lambda\tau}+\big{(}\epsilon+\frac{\gamma^{n+1}}{1-\gamma}+\theta\big{)}\sqrt{\tau C_{\max}}+\epsilon:=\eta. (B.7)

We prove this lemma in Appendix C. Since when ϕ(s,a)\phi(s,a)\notin\mathcal{H}, Q~k1(s,a)=Qπ~k1(s,a)\widetilde{Q}_{k-1}(s,a)=Q_{\widetilde{\pi}_{k-1}}(s,a), we know that Q~k1(s,a)Qπ~k1(s,a)η\|\widetilde{Q}_{k-1}(s,a)-Q_{\widetilde{\pi}_{k-1}}(s,a)\|_{\infty}\leq\eta. With another union bound over the KK iterations, we know that with probability at least

12KCmaxexp(2θ2(1γ)2m),1-2KC_{\max}\exp(-2\theta^{2}(1-\gamma)^{2}m),

the virtual algorithm is an approximate policy iteration algorithm with \ell_{\infty} bound η\eta for the approximation error on the Q-functions. We use the following results for API, which is a direct consequence of the results in Munos [2003], Farahmand et al. [2010], and is also stated in Lattimore et al. [2020].

Lemma B.3.

Suppose that we run KK approximate policy iterations and generate a sequence of policies π0,π1,,πK\pi_{0},\pi_{1},\ldots,\pi_{K}. Suppose that for every k=1,2,,Kk=1,2,\ldots,K, in the kk-th iteration, we obtain a function Q~k1\widetilde{Q}_{k-1} such that, Q~k1Qπk1η\|\widetilde{Q}_{k-1}-Q_{\pi_{k-1}}\|_{\infty}\leq\eta, and choose πk\pi_{k} to be greedy with respect to Q~k1\widetilde{Q}_{k-1}. Then

QQπK2η1γ+γK1γ.\|Q^{*}-Q_{\pi_{K}}\|_{\infty}\leq\frac{2\eta}{1-\gamma}+\frac{\gamma^{K}}{1-\gamma}.

According to Lemma B.3,

QQπ~K22η1γ+γK21γ.\|Q^{*}-Q_{\widetilde{\pi}_{K-2}}\|_{\infty}\leq\frac{2\eta}{1-\gamma}+\frac{\gamma^{K-2}}{1-\gamma}. (B.8)

Then, since Qπ~K2Q~K2η\|Q_{\widetilde{\pi}_{K-2}}-\widetilde{Q}_{K-2}\|_{\infty}\leq\eta, we know that

QQ~K23η1γ+γK21γ.\displaystyle\|Q^{*}-\widetilde{Q}_{K-2}\|_{\infty}\leq\frac{3\eta}{1-\gamma}+\frac{\gamma^{K-2}}{1-\gamma}. (B.9)

The following lemma translates the gap in Q-functions to the gap in value.

Lemma B.4.

[Singh and Yee, 1994] Let π\pi be greedy with respect to a function QQ. Then for any state ss,

V(s)Vπ(s)21γQQ.V^{*}(s)-V_{\pi}(s)\leq\frac{2}{1-\gamma}\|Q^{*}-Q\|_{\infty}.

Since π~K1\widetilde{\pi}_{K-1} is greedy with respect to Q~K2\widetilde{Q}_{K-2}, we know that

V(ρ)Vπ~K1(ρ)6η(1γ)2+2γK2(1γ)2.\displaystyle V^{*}(\rho)-V_{\widetilde{\pi}_{K-1}}(\rho)\leq\frac{6\eta}{(1-\gamma)^{2}}+\frac{2\gamma^{K-2}}{(1-\gamma)^{2}}. (B.10)

We notice that this result is obtained by conditioning on all the previous 1\ell-1 loops and only consider the randomness of the \ell-th loop. More specifically, given any core set 𝒞\mathcal{C}_{\ell} at the beginning of the \ell-th loop, we have

(V(ρ)Vπ~K1(ρ)6η(1γ)2+2γK2(1γ)2𝒞)12KCmaxexp(2θ2(1γ)2m).\mathbb{P}\left(V^{*}(\rho)-V_{\widetilde{\pi}_{K-1}}(\rho)\leq\frac{6\eta}{(1-\gamma)^{2}}+\frac{2\gamma^{K-2}}{(1-\gamma)^{2}}\mid\mathcal{C}_{\ell}\right)\geq 1-2KC_{\max}\exp(-2\theta^{2}(1-\gamma)^{2}m).

By law of total probability we have

(V(ρ)Vπ~K1(ρ)6η(1γ)2+2γK2(1γ)2)\displaystyle\mathbb{P}\left(V^{*}(\rho)-V_{\widetilde{\pi}_{K-1}}(\rho)\leq\frac{6\eta}{(1-\gamma)^{2}}+\frac{2\gamma^{K-2}}{(1-\gamma)^{2}}\right)
=\displaystyle= 𝒞(V(ρ)Vπ~K1(ρ)6η(1γ)2+2γK2(1γ)2𝒞)(𝒞)\displaystyle\sum_{\mathcal{C}_{\ell}}\mathbb{P}\left(V^{*}(\rho)-V_{\widetilde{\pi}_{K-1}}(\rho)\leq\frac{6\eta}{(1-\gamma)^{2}}+\frac{2\gamma^{K-2}}{(1-\gamma)^{2}}\mid\mathcal{C}_{\ell}\right)\mathbb{P}\left(\mathcal{C}_{\ell}\right)
\displaystyle\geq 12KCmaxexp(2θ2(1γ)2m)𝒞(𝒞)\displaystyle 1-2KC_{\max}\exp(-2\theta^{2}(1-\gamma)^{2}m)\sum_{\mathcal{C}_{\ell}}\mathbb{P}\left(\mathcal{C}_{\ell}\right)
=\displaystyle= 12KCmaxexp(2θ2(1γ)2m).\displaystyle 1-2KC_{\max}\exp(-2\theta^{2}(1-\gamma)^{2}m).

With another union bound over the CmaxC_{\max} loops of the virtual algorithm, we know that with probability at least

12KCmax2exp(2θ2(1γ)2m),\displaystyle 1-2KC^{2}_{\max}\exp(-2\theta^{2}(1-\gamma)^{2}m), (B.11)

Eq. (B.10) holds for all the loops. We call this event 1\mathcal{E}_{1} in the following.

B.3 Analysis of the main algorithm

We now move to the analysis of the main algorithm. Throughout this section, when we mention the final loop, we mean the final loop of the main algorithm, which may not be the final loop of the virtual algorithm. We have the following result.

Lemma B.5.

In the final loop of the main algorithm, all the rollout trajectories in the virtual algorithm are exactly the same as those in the main algorithm, and therefore wk=w~kw_{k}=\widetilde{w}_{k} for all 1kK1\leq k\leq K.

Proof.

We notice that since we only consider the final loop, in any iteration, for any state ss in all the rollout trajectories in the main algorithm, and all action a𝒜a\in\mathcal{A}, ϕ(s,a)\phi(s,a)\in\mathcal{H}. In the first iteration, since π0=π~0\pi_{0}=\widetilde{\pi}_{0}, and the simulators are coupled, we know that all the rollout trajectories are the same between the main algorithm and the virtual algorithm, and as a result, all the Q-function estimates are the same, and thus w1=w~1w_{1}=\widetilde{w}_{1}. If we have wk=w~kw_{k}=\widetilde{w}_{k}, we know that by the definition in (B.2), the policies πk\pi_{k} and π~k\widetilde{\pi}_{k} always take the same action given ss if for all a𝒜a\in\mathcal{A}, ϕ(s,a)\phi(s,a)\in\mathcal{H}. Again using the fact that the simulators are coupled, the rollout trajectories by πk\pi_{k} and π~k\widetilde{\pi}_{k} are also the same between the main algorithm and the virtual algorithm, and thus wk+1=w~k+1w_{k+1}=\widetilde{w}_{k+1}. ∎

Since ϕ(s,a)21\|\phi(s,a)\|_{2}\leq 1 for all s,as,a, we can verify that if we set τ1\tau\geq 1, then after adding a state-action pair s,as,a to the core set, then its feature vector ϕ(s,a)\phi(s,a) stays in the good set \mathcal{H}. Recall that in the core set initialization stage of Algorithm 2, if for an action a𝒜a\in\mathcal{A}, ϕ(ρ,a)\phi(\rho,a) is not in \mathcal{H}, we add ρ,a\rho,a to 𝒞\mathcal{C}. Thus, after the core set initialization stage, we have ϕ(ρ,a)\phi(\rho,a)\in\mathcal{H} for all aa. Thus πK1(ρ)=π~K1(ρ):=aρ\pi_{K-1}(\rho)=\widetilde{\pi}_{K-1}(\rho):=a_{\rho}. Moreover, according to Lemma B.2, we also know that when 1\mathcal{E}_{1} happens,

|Vπ~K1(ρ)w~Kϕ(ρ,aρ)|=|Qπ~K1(ρ,aρ)w~Kϕ(ρ,aρ)|η.\displaystyle|V_{\widetilde{\pi}_{K-1}}(\rho)-\widetilde{w}_{K}^{\top}\phi(\rho,a_{\rho})|=|Q_{\widetilde{\pi}_{K-1}}(\rho,a_{\rho})-\widetilde{w}_{K}^{\top}\phi(\rho,a_{\rho})|\leq\eta. (B.12)

In the following, we bound the difference of the values of the output policy of the main algorithm πK1\pi_{K-1} and the output policy of the virtual algorithm π~K1\widetilde{\pi}_{K-1} in the final loop of the main algorithm, i.e., |VπK1(ρ)Vπ~K1(ρ)||V_{\pi_{K-1}}(\rho)-V_{\widetilde{\pi}_{K-1}}(\rho)|. To do this, we use another auxiliary virtual policy iteration algorithm, which we call virtual-2 in the following. Virtual-2 is similar to the virtual policy iteration algorithm in Appendix B.1. The simulator of virtual-2 is coupled with the virtual algorithm, and virtual-2 also uses the same initial policy π^0:=π0\widehat{\pi}_{0}:=\pi_{0} as the main algorithm. Virtual-2 also uses Monte-Carlo rollouts with the simulator and obtains the estimated Q-function values q^𝒞\widehat{q}_{\mathcal{C}}, and the linear regression coefficients are computed in the same way as (B.1), i.e., w^k=(Φ𝒞Φ𝒞+λI)1Φ𝒞q^𝒞\widehat{w}_{k}=(\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I)^{-1}\Phi_{\mathcal{C}}^{\top}\widehat{q}_{\mathcal{C}}. The virtual-2 algorithm also conducts uncertainty check in the rollout subroutine. Similar to the virtual algorithm, when it identifies an uncertain state-action pair, it records the pair and keeps running the rollout process. At the end of each loop, the virtual-2 algorithm still adds the first recorded element to the core set and discard other recorded elements. The only difference is that in virtual-2, we choose the virtual Q-function to be Q^k1(s,a):=w^kϕ(s,a)\widehat{Q}_{k-1}(s,a):=\widehat{w}_{k}^{\top}\phi(s,a) for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}. Using the same arguments in Appendix B.2, we know that with probability at least 12KCmax2exp(2θ2(1γ)2m)1-2KC_{\max}^{2}\exp(-2\theta^{2}(1-\gamma)^{2}m), for all the loops and all the policy iteration steps in every loop, we have |Q^k1(s,a)Qπ^k1(s,a)|η|\widehat{Q}_{k-1}(s,a)-Q_{\widehat{\pi}_{k-1}}(s,a)|\leq\eta for all (s,a)(s,a) such that ϕ(s,a)\phi(s,a)\in\mathcal{H}. We call this event 2\mathcal{E}_{2}. Since the simulators of virtual-2 is also coupled with that of the main algorithm, by the same argument as in Lemma B.5, we know that in the last iteration of the final loop of the main algorithm, we have π^K1=πK1\widehat{\pi}_{K-1}=\pi_{K-1} and w^K=wK\widehat{w}_{K}=w_{K}. We also know that when event 2\mathcal{E}_{2} happens, in the last iteration of the all the loops of virtual-2,

|Vπ^K1(ρ)w^Kϕ(ρ,aρ)|η.\displaystyle|V_{\widehat{\pi}_{K-1}}(\rho)-\widehat{w}_{K}^{\top}\phi(\rho,a_{\rho})|\leq\eta. (B.13)

Therefore, when both events 1\mathcal{E}_{1} and 2\mathcal{E}_{2} happen, combining (B.12) and (B.13), and using the fact that w~K=wK=w^K\widetilde{w}_{K}=w_{K}=\widehat{w}_{K}, we know that

|VπK1(ρ)Vπ~K1(ρ)|=|Vπ^K1(ρ)Vπ~K1(ρ)||Vπ^K1(ρ)w^Kϕ(ρ,aρ)|+|w^Kϕ(ρ,aρ)w~Kϕ(ρ,aρ)|+|w~Kϕ(ρ,aρ)Vπ~K1(ρ)|η+0+η=2η.\begin{split}&|V_{\pi_{K-1}}(\rho)-V_{\widetilde{\pi}_{K-1}}(\rho)|=|V_{\widehat{\pi}_{K-1}}(\rho)-V_{\widetilde{\pi}_{K-1}}(\rho)|\\ \leq&|V_{\widehat{\pi}_{K-1}}(\rho)-\widehat{w}_{K}^{\top}\phi(\rho,a_{\rho})|+|\widehat{w}_{K}^{\top}\phi(\rho,a_{\rho})-\widetilde{w}_{K}^{\top}\phi(\rho,a_{\rho})|+|\widetilde{w}_{K}^{\top}\phi(\rho,a_{\rho})-V_{\widetilde{\pi}_{K-1}}(\rho)|\\ \leq&\eta+0+\eta=2\eta.\end{split}

Combining this fact with (B.10) and using union bound, we know that with probability at least

14KCmax2exp(2θ2(1γ)2m),\displaystyle 1-4KC_{\max}^{2}\exp(-2\theta^{2}(1-\gamma)^{2}m), (B.14)

with CmaxC_{\max} defined as in (B.4), we have

V(ρ)VπK1(ρ)8η(1γ)2+2γK2(1γ)2.\displaystyle V^{*}(\rho)-V_{{\pi}_{K-1}}(\rho)\leq\frac{8\eta}{(1-\gamma)^{2}}+\frac{2\gamma^{K-2}}{(1-\gamma)^{2}}. (B.15)

Finally, we choose the appropriate parameters. Note that we would like to ensure that the success probability in Eq. (B.14) is at least 1δ1-\delta and at the same time, the sub-optimality (right hand side of Eq. (B.15)) to be as small as possible. Suppose that Assumption 3.2 holds, i.e, ϵ=0\epsilon=0 in (B.7). It can be verified that by choosing τ=1\tau=1, λ=κ2(1γ)41024b2\lambda=\frac{\kappa^{2}(1-\gamma)^{4}}{1024b^{2}}, n=31γlog(4(1+log(1+λ1)d)κ(1γ))n=\frac{3}{1-\gamma}\log(\frac{4(1+\log(1+\lambda^{-1})d)}{\kappa(1-\gamma)}), θ=κ(1γ)264d(1+log(1+λ1))\theta=\frac{\kappa(1-\gamma)^{2}}{64\sqrt{d(1+\log(1+\lambda^{-1}))}}, K=2+21γlog(3κ(1γ))K=2+\frac{2}{1-\gamma}\log(\frac{3}{\kappa(1-\gamma)}), m=4096d(1+log(1+λ1))κ2(1γ)6log(8Kd(1+log(1+λ1))δ)m=4096\frac{d(1+\log(1+\lambda^{-1}))}{\kappa^{2}(1-\gamma)^{6}}\log(\frac{8Kd(1+\log(1+\lambda^{-1}))}{\delta}), we can ensure that the error probability is at most 1δ1-\delta and V(ρ)VπK1(ρ)κV^{*}(\rho)-V_{{\pi}_{K-1}}(\rho)\leq\kappa. Suppose that Assumption 3.3 holds. It can be verified that by choosing τ=1\tau=1, λ=ϵ2db2\lambda=\frac{\epsilon^{2}d}{b^{2}}, n=11γlog(1ϵ(1γ))n=\frac{1}{1-\gamma}\log(\frac{1}{\epsilon(1-\gamma)}), θ=ϵ\theta=\epsilon, K=2+11γlog(1ϵd)K=2+\frac{1}{1-\gamma}\log\big{(}\frac{1}{\epsilon\sqrt{d}}\big{)}, m=1ϵ2(1γ)2log(8Kd(1+log(1+λ1))δ)m=\frac{1}{\epsilon^{2}(1-\gamma)^{2}}\log(\frac{8Kd(1+\log(1+\lambda^{-1}))}{\delta}), we can ensure that with probability at least 1δ1-\delta,

V(ρ)VπK1(ρ)74ϵd(1γ)2(1+log(1+λ1)).V^{*}(\rho)-V_{{\pi}_{K-1}}(\rho)\leq\frac{74\epsilon\sqrt{d}}{(1-\gamma)^{2}}(1+\log(1+\lambda^{-1})).

Appendix C Proof of Lemma B.2

To simplify notation, we write π:=π~k1\pi:=\widetilde{\pi}_{k-1}, Q~(,):=Q~k1(,)\widetilde{Q}(\cdot,\cdot):=\widetilde{Q}_{k-1}(\cdot,\cdot), and w~=w~k\widetilde{w}=\widetilde{w}_{k} in this proof. According to Eq. (B.6), with probability at least 12Cmaxexp(2θ2(1γ)2m)1-2C_{\max}\exp(-2\theta^{2}(1-\gamma)^{2}m),

|zqQπ(zs,za)|γn+11γ+θ|z_{q}-Q_{\pi}(z_{s},z_{a})|\leq\frac{\gamma^{n+1}}{1-\gamma}+\theta

holds for all z𝒞z\in\mathcal{C}. We condition on this event in the following derivation. Suppose that Assumption 3.3 holds. We know that there exists wπdw_{\pi}\in\mathbb{R}^{d} with wπ2b\|w_{\pi}\|_{2}\leq b such that for any s,as,a,

|Qπ(s,a)wπϕ(s,a)|ϵ.|Q_{\pi}(s,a)-w_{\pi}^{\top}\phi(s,a)|\leq\epsilon.

Let ξ:=q~𝒞Φ𝒞wπ\xi:=\widetilde{q}_{\mathcal{C}}-\Phi_{\mathcal{C}}w_{\pi}. Then we have

ξϵ+γn+11γ+θ.\displaystyle\|\xi\|_{\infty}\leq\epsilon+\frac{\gamma^{n+1}}{1-\gamma}+\theta. (C.1)

Suppose that for a state-action pair s,as,a, the feature vector ϕ:=ϕ(s,a)\phi:=\phi(s,a)\in\mathcal{H}, with \mathcal{H} defined in Definition 4.1. Then we have

|Q~(s,a)Qπ(s,a)|\displaystyle|\widetilde{Q}(s,a)-Q_{\pi}(s,a)| |ϕw~ϕwπ|+ϵ\displaystyle\leq|\phi^{\top}\widetilde{w}-\phi^{\top}w_{\pi}|+\epsilon
=|ϕ(Φ𝒞Φ𝒞+λI)1Φ𝒞(Φ𝒞wπ+ξ)ϕwπ|+ϵ\displaystyle=|\phi^{\top}(\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I)^{-1}\Phi_{\mathcal{C}}^{\top}(\Phi_{\mathcal{C}}w_{\pi}+\xi)-\phi^{\top}w_{\pi}|+\epsilon
|ϕ(I(Φ𝒞Φ𝒞+λI)1Φ𝒞Φ𝒞)wπ|E1+|ϕ(Φ𝒞Φ𝒞+λI)1Φ𝒞ξ|E2+ϵ.\displaystyle\leq\underbrace{|\phi^{\top}\big{(}I-(\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I)^{-1}\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}\big{)}w_{\pi}|}_{E_{1}}+\underbrace{|\phi^{\top}(\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I)^{-1}\Phi_{\mathcal{C}}^{\top}\xi|}_{E_{2}}+\epsilon. (C.2)

We then bound E1E_{1} and E2E_{2} in (C.2). Similar to Appendix A, let Φ𝒞Φ𝒞+λI:=VΛV\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I:=V\Lambda V^{\top} be the eigendecomposition of Φ𝒞Φ𝒞+λI\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I with Λ=diag(λ1,,λd)\Lambda={\rm diag}(\lambda_{1},\ldots,\lambda_{d}) and VV being an orthonormal matrix. Notice that for all ii, λiλ\lambda_{i}\geq\lambda. Let α=Vϕ\alpha=V^{\top}\phi. Then for E1E_{1}, we have

E1\displaystyle E_{1} =|ϕV(IΛ1(ΛλI))Vwπ|=λ|ϕVΛ1Vwπ|\displaystyle=|\phi^{\top}V\big{(}I-\Lambda^{-1}(\Lambda-\lambda I)\big{)}V^{\top}w_{\pi}|=\lambda|\phi^{\top}V\Lambda^{-1}V^{\top}w_{\pi}|
λbαΛ12=λbi=1dαi2λi2\displaystyle\leq\lambda b\|\alpha^{\top}\Lambda^{-1}\|_{2}=\lambda b\sqrt{\sum_{i=1}^{d}\frac{\alpha_{i}^{2}}{\lambda_{i}^{2}}}
bλi=1dαi2λi,\displaystyle\leq b\sqrt{\lambda}\sqrt{\sum_{i=1}^{d}\frac{\alpha_{i}^{2}}{\lambda_{i}}}, (C.3)

where for the first inequality we use Cauchy-Schwarz inequality and the assumption that wπk12b\|w_{\pi_{k-1}}\|_{2}\leq b, and for the second inequality we use the fact that λiλ\lambda_{i}\geq\lambda. On the other hand, since we know that ϕ\phi\in\mathcal{H}, we know that αΛ1ατ\alpha^{\top}\Lambda^{-1}\alpha\leq\tau, i.e., i=1dαi2λi1τ\sum_{i=1}^{d}\alpha_{i}^{2}\lambda_{i}^{-1}\leq\tau. Combining this fact with (C.3), we obtain

E1bλτ.\displaystyle E_{1}\leq b\sqrt{\lambda\tau}. (C.4)

We now bound E2E_{2}. According to Hölder’s inequality, we have

E2\displaystyle E_{2} ϕ(Φ𝒞Φ𝒞+λI)1Φ𝒞1ξ\displaystyle\leq\|\phi^{\top}(\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I)^{-1}\Phi_{\mathcal{C}}^{\top}\|_{1}\|\xi\|_{\infty}
ϕ(Φ𝒞Φ𝒞+λI)1Φ𝒞2ξ|𝒞|\displaystyle\leq\|\phi^{\top}(\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I)^{-1}\Phi_{\mathcal{C}}^{\top}\|_{2}\|\xi\|_{\infty}\sqrt{|\mathcal{C}|}
=(ϕ(Φ𝒞Φ𝒞+λI)1Φ𝒞Φ𝒞(Φ𝒞Φ𝒞+λI)1ϕ)1/2ξ|𝒞|\displaystyle=\big{(}\phi^{\top}(\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I)^{-1}\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}(\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I)^{-1}\phi\big{)}^{1/2}\|\xi\|_{\infty}\sqrt{|\mathcal{C}|}
=(αΛ1(ΛλI)Λ1α)1/2ξ|𝒞|\displaystyle=\big{(}\alpha^{\top}\Lambda^{-1}(\Lambda-\lambda I)\Lambda^{-1}\alpha\big{)}^{1/2}\|\xi\|_{\infty}\sqrt{|\mathcal{C}|}
=i=1dαi2λiλλi2ξ|𝒞|\displaystyle=\sqrt{\sum_{i=1}^{d}\alpha_{i}^{2}\frac{\lambda_{i}-\lambda}{\lambda_{i}^{2}}}\|\xi\|_{\infty}\sqrt{|\mathcal{C}|}
(ϵ+γn+11γ+θ)τCmax,\displaystyle\leq(\epsilon+\frac{\gamma^{n+1}}{1-\gamma}+\theta)\sqrt{\tau C_{\max}}, (C.5)

where in the last inequality we use the facts that i=1dαi2λi1τ\sum_{i=1}^{d}\alpha_{i}^{2}\lambda_{i}^{-1}\leq\tau, Eq. (C.1), and Lemma 5.1. We can then complete the proof by combining (C.4) and (C.5).

Appendix D Proof of Theorem 5.2

First, we state a general result in Szepesvári [2021] on Politex. Notice that in this result, we consider an arbitrary sequence of approximate Q-functions QkQ_{k}, k=0,,K1k=0,\ldots,K-1, which do not have to take the form of (4.3).

Lemma D.1 (Szepesvári [2021]).

Given an initial policy π0\pi_{0} and a sequence of functions Qk:𝒮×𝒜[0,(1γ)1]Q_{k}:{\mathcal{S}}\times\mathcal{A}\mapsto[0,(1-\gamma)^{-1}], k=0,,K1k=0,\ldots,K-1, construct a sequence of policies π1,,πK1\pi_{1},\ldots,\pi_{K-1} according to (4.4) with α=(1γ)2log(|𝒜|)K\alpha=(1-\gamma)\sqrt{\frac{2\log(|\mathcal{A}|)}{K}}, then, for any s𝒮s\in{\mathcal{S}}, the mixture policy π¯K\overline{\pi}_{K} satisfies

V(s)Vπ¯K(s)1(1γ)22log(|𝒜|)K+2max0kK1QkQπk1γ.V^{*}(s)-V_{\overline{\pi}_{K}}(s)\leq\frac{1}{(1-\gamma)^{2}}\sqrt{\frac{2\log(|\mathcal{A}|)}{K}}+\frac{2\max_{0\leq k\leq K-1}\|Q_{k}-Q_{\pi_{k}}\|_{\infty}}{1-\gamma}.

We then consider a virtual Politex algorithm. Similar to the vanilla policy iteration algorithm, in the virtual Politex algorithm, we begin with π~0:=π0\widetilde{\pi}_{0}:=\pi_{0}. In the kk-th iteration, we run Monte Carlo rollout with policy π~k1\widetilde{\pi}_{k-1}, and obtain the estimates of the Q-function values q~𝒞\widetilde{q}_{\mathcal{C}}. We then compute the weight vector

w~k=(Φ𝒞Φ𝒞+λI)1Φ𝒞q~𝒞,\widetilde{w}_{k}=(\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I)^{-1}\Phi_{\mathcal{C}}^{\top}\widetilde{q}_{\mathcal{C}},

and according to Lemma B.2, for any θ>0\theta>0, with probability at least 12Cmaxexp(2θ2(1γ)2m)1-2C_{\max}\exp(-2\theta^{2}(1-\gamma)^{2}m), for all (s,a)(s,a) such that ϕ(s,a)\phi(s,a)\in\mathcal{H},

|w~kϕ(s,a)Qπ~k1(s,a)|bλτ+(ϵ+γn+11γ+θ)τCmax+ϵ:=η.\displaystyle|\widetilde{w}_{k}^{\top}\phi(s,a)-Q_{\widetilde{\pi}_{k-1}}(s,a)|\leq b\sqrt{\lambda\tau}+\big{(}\epsilon+\frac{\gamma^{n+1}}{1-\gamma}+\theta\big{)}\sqrt{\tau C_{\max}}+\epsilon:=\eta. (D.1)

Then we define the virtual Q-function as

Q~k1(s,a):={Π[0,(1γ)1](w~kϕ(s,a)),ϕ(s,a),Qπ~k1(s,a),ϕ(s,a),\widetilde{Q}_{k-1}(s,a):=\begin{cases}\Pi_{[0,(1-\gamma)^{-1}]}(\widetilde{w}_{k}^{\top}\phi(s,a)),&\phi(s,a)\in\mathcal{H},\\ Q_{\widetilde{\pi}_{k-1}}(s,a),&\phi(s,a)\notin\mathcal{H},\end{cases}

assuming we have access to the true Q-function Qπ~k1(s,a)Q_{\widetilde{\pi}_{k-1}}(s,a) when ϕ(s,a)\phi(s,a)\notin\mathcal{H}. We let the policy of the (k+1)(k+1)-th iteration π~k\widetilde{\pi}_{k} be

π~k(a|s)exp(αj=1k1Q~k1(s,a)).\displaystyle\widetilde{\pi}_{k}(a|s)\propto\exp\left(\alpha\sum_{j=1}^{k-1}\widetilde{Q}_{k-1}(s,a)\right). (D.2)

Since we always have Qπ~k1(s,a)[0,(1γ)1]Q_{\widetilde{\pi}_{k-1}}(s,a)\in[0,(1-\gamma)^{-1}], the clipping at 0 and (1γ)1(1-\gamma)^{-1} can only improve the accuracy of the estimation of the Q-function. Therefore, we know that with probability at least 12Cmaxexp(2θ2(1γ)2m)1-2C_{\max}\exp(-2\theta^{2}(1-\gamma)^{2}m), we have Q~k1Qπ~k1η\|\widetilde{Q}_{k-1}-Q_{\widetilde{\pi}_{k-1}}\|_{\infty}\leq\eta. Then, by taking a union bound over the KK iterations and using the result in Lemma D.1, we know that with probability at least 12KCmaxexp(2θ2(1γ)2m)1-2KC_{\max}\exp(-2\theta^{2}(1-\gamma)^{2}m), for any s𝒮s\in{\mathcal{S}}, the virtual Politex algorithm satisfies

V(s)Vπ~¯K(s)1(1γ)22log(|𝒜|)K+2η1γ,\displaystyle V^{*}(s)-V_{\overline{\widetilde{\pi}}_{K}}(s)\leq\frac{1}{(1-\gamma)^{2}}\sqrt{\frac{2\log(|\mathcal{A}|)}{K}}+\frac{2\eta}{1-\gamma}, (D.3)

where π~¯K\overline{\widetilde{\pi}}_{K} is the mixture policy of π~0,,π~K1\widetilde{\pi}_{0},\ldots,\widetilde{\pi}_{K-1}. Using another union bound over the CmaxC_{\max} loops, we know that with probability at least 12KCmax2exp(2θ2(1γ)2m)1-2KC_{\max}^{2}\exp(-2\theta^{2}(1-\gamma)^{2}m), (D.3) holds for all the loops. We call this event 1\mathcal{E}_{1} in the following.

We then consider the virtual-2 Politex algorithm. Similar to LSPI, the virtual-2 algorithm begins with π^0:=π0\widehat{\pi}_{0}:=\pi_{0}. In the kk-th iteration, we run Monte Carlo rollout with policy π^k1\widehat{\pi}_{k-1}, and obtain the estimates of the Q-function values q^𝒞\widehat{q}_{\mathcal{C}}. We then compute the weight vector

w^k=(Φ𝒞Φ𝒞+λI)1Φ𝒞q^𝒞,\widehat{w}_{k}=(\Phi_{\mathcal{C}}^{\top}\Phi_{\mathcal{C}}+\lambda I)^{-1}\Phi_{\mathcal{C}}^{\top}\widehat{q}_{\mathcal{C}},

and according to Lemma B.2, for any θ>0\theta>0, with probability at least 12Cmaxexp(2θ2(1γ)2m)1-2C_{\max}\exp(-2\theta^{2}(1-\gamma)^{2}m), for all (s,a)(s,a) such that ϕ(s,a)\phi(s,a)\in\mathcal{H},

|w^kϕ(s,a)Qπ^k1(s,a)|η,\displaystyle|\widehat{w}_{k}^{\top}\phi(s,a)-Q_{\widehat{\pi}_{k-1}}(s,a)|\leq\eta, (D.4)

where η\eta is defined as in (D.1). We also note that in the rollout process of the virtual-2 algorithm, we do not conduct the uncertainty check, i.e., we do not check whether the features are in the good set \mathcal{H}. By union bound, we know that with probability at least 12KCmax2exp(2θ2(1γ)2m)1-2KC_{\max}^{2}\exp(-2\theta^{2}(1-\gamma)^{2}m), (D.4) holds for all the KK iterations of all the CmaxC_{\max} loops. We call this event 2\mathcal{E}_{2} in the following. In the virtual-2 algorithm, we define the approximate Q-function in the same way as the main algorithm, i.e., we define

Q^k1(s,a):=Π[0,(1γ)1](w^kϕ(s,a)),\widehat{Q}_{k-1}(s,a):=\Pi_{[0,(1-\gamma)^{-1}]}(\widehat{w}_{k}^{\top}\phi(s,a)),

and we let the policy of the (k+1)(k+1)-th iteration be

π^k(a|s)exp(αj=1k1Q^k1(s,a)).\displaystyle\widehat{\pi}_{k}(a|s)\propto\exp\left(\alpha\sum_{j=1}^{k-1}\widehat{Q}_{k-1}(s,a)\right). (D.5)

We still let the simulators of all the algorithms be coupled in the same way described as in Appendix B.1. In addition, we also let the agent in the main algorithm be coupled with the virtual and virtual-2 algorithm. Take the main algorithm and the virtual algorithm as an example. Recall that in the kk-th iteration of a particular loop, the main algorithm and the virtual algorithm use rollout policies πk1\pi_{k-1} and π~k1\widetilde{\pi}_{k-1}, respectively. In the ConfidentRollout subroutine, the agent needs to sample actions according to the policies given a state. Suppose that in the NN-th time that the agent needs to take an action, the main algorithm is at state smains_{\text{main}} and the virtual algorithm is at state svirtuals_{\text{virtual}}. If the two states are the same, i.e., smain=svirtuals_{\text{main}}=s_{\text{virtual}} and two distributions of actions given this state are also the same, i.e., πk1(|smain)=π~k1(|svirtual)\pi_{k-1}(\cdot|s_{\text{main}})=\widetilde{\pi}_{k-1}(\cdot|s_{\text{virtual}}), then the actions that the agent samples in the main algorithm and the virtual algorithm are also the same. This means that the main algorithm samples amainπk1(|smain)a_{\text{main}}\sim\pi_{k-1}(\cdot|s_{\text{main}}) and the virtual algorithm samples avirtualπk1(|svirtual)a_{\text{virtual}}\sim\pi_{k-1}(\cdot|s_{\text{virtual}}), and with probability 11, amain=avirtuala_{\text{main}}=a_{\text{virtual}}. Otherwise, when smainsvirtuals_{\text{main}}\neq s_{\text{virtual}} or πk1(|smain)π~k1(|svirtual)\pi_{k-1}(\cdot|s_{\text{main}})\neq\widetilde{\pi}_{k-1}(\cdot|s_{\text{virtual}}), the main algorithm and the virtual algorithm samples a new action independently. The main algorithm and the virtual-2 algorithm are coupled in the same way. We note that using the same argument as in Lemma B.5, for the final loop of the main algorithm, all the rollout trajectories of the main, virtual, and virtual-2 algorithms are the same, which implies that wk=w~k=w^kw_{k}=\widetilde{w}_{k}=\widehat{w}_{k} for all 1kK1\leq k\leq K. This also implies that in the final loop of the main algorithm, all the policies in the KK iterations are the same between the main and the virtual-2 algorithm, i.e., πk=π^k\pi_{k}=\widehat{\pi}_{k}, 0kK10\leq k\leq K-1. Moreover, for any state ss such that ϕ(s,a)\phi(s,a)\in\mathcal{H} for all a𝒜a\in\mathcal{A}, we have πk(|s)=π~(|s)=π^k(|s)\pi_{k}(\cdot|s)=\widetilde{\pi}(\cdot|s)=\widehat{\pi}_{k}(\cdot|s). Since the initial state ρ\rho satisfies the condition that ϕ(ρ,a)\phi(\rho,a)\in\mathcal{H} for all a𝒜a\in\mathcal{A}, we have πk(|ρ)=π~(|ρ)=π^k(|ρ)\pi_{k}(\cdot|\rho)=\widetilde{\pi}(\cdot|\rho)=\widehat{\pi}_{k}(\cdot|\rho).

Let π^¯K\overline{\widehat{\pi}}_{K} be the policy that is uniformly chosen from π^0,,π^K1\widehat{\pi}_{0},\ldots,\widehat{\pi}_{K-1} in the virtual-2 algorithm in the final loop of the main algorithm, and π¯K\overline{\pi}_{K} be the policy that is uniformly chosen from π0,,πK1\pi_{0},\ldots,\pi_{K-1} in the final loop of the main algorithm. Then we have

|Vπ^¯K(ρ)Vπ¯K(ρ)|=|1Kk=0K1(Vπ^k(ρ)Vπk(ρ))|=0,\displaystyle|V_{\overline{\widehat{\pi}}_{K}}(\rho)-V_{\overline{\pi}_{K}}(\rho)|=\left|\frac{1}{K}\sum_{k=0}^{K-1}(V_{\widehat{\pi}_{k}}(\rho)-V_{\pi_{k}}(\rho))\right|=0, (D.6)

and when events 1\mathcal{E}_{1} and 2\mathcal{E}_{2} happen,

|Vπ^¯K(ρ)Vπ~¯K(ρ)|\displaystyle|V_{\overline{\widehat{\pi}}_{K}}(\rho)-V_{\overline{\widetilde{\pi}}_{K}}(\rho)|
\displaystyle\leq 1Kk=0K1|Vπ^k(ρ)Vπ~k(ρ)|\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}|V_{\widehat{\pi}_{k}}(\rho)-V_{\widetilde{\pi}_{k}}(\rho)|
=\displaystyle= 1Kk=0K1|a𝒜(π^k(a|ρ)Qπ^k(ρ,a)π~k(a|ρ)Qπ~k(ρ,a))|\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}\left|\sum_{a\in\mathcal{A}}(\widehat{\pi}_{k}(a|\rho)Q_{\widehat{\pi}_{k}}(\rho,a)-\widetilde{\pi}_{k}(a|\rho)Q_{\widetilde{\pi}_{k}}(\rho,a))\right|
\displaystyle\leq 1Kk=0K1a𝒜πk(a|ρ)|Qπ^k(ρ,a)Qπ~k(ρ,a))|\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}\sum_{a\in\mathcal{A}}\pi_{k}(a|\rho)\left|Q_{\widehat{\pi}_{k}}(\rho,a)-Q_{\widetilde{\pi}_{k}}(\rho,a))\right|
\displaystyle\leq 1Kk=0K1a𝒜πk(a|ρ)|Qπ^k(ρ,a)w^kϕ(ρ,a)+w^kϕ(ρ,a)w~kϕ(ρ,a)+w~kϕ(ρ,a)Qπ~k(ρ,a))|\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}\sum_{a\in\mathcal{A}}\pi_{k}(a|\rho)\left|Q_{\widehat{\pi}_{k}}(\rho,a)-\widehat{w}_{k}^{\top}\phi(\rho,a)+\widehat{w}_{k}^{\top}\phi(\rho,a)-\widetilde{w}_{k}^{\top}\phi(\rho,a)+\widetilde{w}_{k}^{\top}\phi(\rho,a)-Q_{\widetilde{\pi}_{k}}(\rho,a))\right|
\displaystyle\leq 1Kk=0K1a𝒜πk(a|ρ)(|Qπ^k(ρ,a)w^kϕ(ρ,a)|+|w^kϕ(ρ,a)w~kϕ(ρ,a)|+|w~kϕ(ρ,a)Qπ~k(ρ,a))|)\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}\sum_{a\in\mathcal{A}}\pi_{k}(a|\rho)\left(|Q_{\widehat{\pi}_{k}}(\rho,a)-\widehat{w}_{k}^{\top}\phi(\rho,a)|+|\widehat{w}_{k}^{\top}\phi(\rho,a)-\widetilde{w}_{k}^{\top}\phi(\rho,a)|+|\widetilde{w}_{k}^{\top}\phi(\rho,a)-Q_{\widetilde{\pi}_{k}}(\rho,a))|\right)
\displaystyle\leq 1Kk=0K1a𝒜πk(a|ρ)(η+0+η)=2η.\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}\sum_{a\in\mathcal{A}}\pi_{k}(a|\rho)(\eta+0+\eta)=2\eta. (D.7)

By combining (D.3), (D.6), and (D.7), and using a union bound, we obtain that with probability at least 14KCmax2exp(2θ2(1γ)2m)1-4KC_{\max}^{2}\exp(-2\theta^{2}(1-\gamma)^{2}m),

V(ρ)Vπ¯K(ρ)1(1γ)22log(|𝒜|)K+4η1γ.\displaystyle V^{*}(\rho)-V_{\overline{{\pi}}_{K}}(\rho)\leq\frac{1}{(1-\gamma)^{2}}\sqrt{\frac{2\log(|\mathcal{A}|)}{K}}+\frac{4\eta}{1-\gamma}. (D.8)

Now we choose appropriate parameters to obtain the final result. When Assumption 3.2 holds, i.e., ϵ=0\epsilon=0, one can verify that when we choose τ=1\tau=1, λ=(1γ)2κ2256b2\lambda=\frac{(1-\gamma)^{2}\kappa^{2}}{256b^{2}}, K=32log(|𝒜|)κ2(1γ)4K=\frac{32\log(|\mathcal{A}|)}{\kappa^{2}(1-\gamma)^{4}}, n=11γlog(32d(1+log(1+λ1))(1γ)2κ)n=\frac{1}{1-\gamma}\log(\frac{32\sqrt{d}(1+\log(1+\lambda^{-1}))}{(1-\gamma)^{2}\kappa}), θ=(1γ)κ32d(1+log(1+λ1))\theta=\frac{(1-\gamma)\kappa}{32\sqrt{d(1+\log(1+\lambda^{-1}))}}, and m=1024d(1+log(1+λ1))(1γ)4κ2log(8Kd(1+log(1+λ1))δ)m=\frac{1024d(1+\log(1+\lambda^{-1}))}{(1-\gamma)^{4}\kappa^{2}}\log(\frac{8Kd(1+\log(1+\lambda^{-1}))}{\delta}), we can ensure that with probability at least 1δ1-\delta, V(ρ)Vπ¯K(s)κV^{*}(\rho)-V_{\overline{{\pi}}_{K}}(s)\leq\kappa. When Assumption 3.3 holds, one can verify that when we choose τ=1\tau=1, λ=ϵ2db2\lambda=\frac{\epsilon^{2}d}{b^{2}}, K=2log(|𝒜|)ϵ2d(1γ)2K=\frac{2\log(|\mathcal{A}|)}{\epsilon^{2}d(1-\gamma)^{2}}, θ=ϵ\theta=\epsilon, n=11γlog(1ϵ(1γ))n=\frac{1}{1-\gamma}\log(\frac{1}{\epsilon(1-\gamma)}), and m=1ϵ2(1γ)2log(8Kd(1+log(1+λ1))δ)m=\frac{1}{\epsilon^{2}(1-\gamma)^{2}}\log(\frac{8Kd(1+\log(1+\lambda^{-1}))}{\delta}), we can ensure that with probability at least 1δ1-\delta,

V(ρ)Vπ¯K(ρ)42ϵd1γ(1+log(1+λ1)).V^{*}(\rho)-V_{\overline{\pi}_{K}}(\rho)\leq\frac{42\epsilon\sqrt{d}}{1-\gamma}(1+\log(1+\lambda^{-1})).

Appendix E Random initial state

We have shown that with a deterministic initial state ρ\rho, our algorithm can learn a good policy. In fact, if the initial state is random, and the agent is allowed to sample from a distribution of the initial state, denoted by ρ\rho in this section, then we can use a simple reduction to show that our algorithm can still learn a good policy. In this case, the optimality gap is defined as the difference between the expected value of the optimal policy and the learned policy, where the expectation is taken over the initial state distribution, i.e., we hope to guarantee that 𝔼sρ[V(s)Vπ(s)]\mathbb{E}_{s\sim\rho}[V^{*}(s)-V_{\pi}(s)] is small.

The reduction argument works as follows. First, we add an auxiliary state sinits_{\text{init}} to the state space 𝒮{\mathcal{S}} and assume that the algorithm starts from sinits_{\text{init}}. From sinits_{\text{init}} and any action a𝒜a\in\mathcal{A}, we let the distribution of the next state be ρΔ𝒮\rho\in\Delta_{{\mathcal{S}}}, i.e., P(|sinit,a)=ρP(\cdot|s_{\text{init}},a)=\rho. We also let r(sinit,a)=0r(s_{\text{init}},a)=0. Then, for any policy π\pi, we have 𝔼sρ[Vπ(s)]=1γVπ(sinit)\mathbb{E}_{s\sim\rho}[V_{\pi}(s)]=\frac{1}{\gamma}V_{\pi}(s_{\text{init}}). As for the features, for any (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}, we add an extra 0 as the last dimension of the feature vector, i.e., we use ϕ+(s,a)=[ϕ(s,a)0]d+1\phi^{+}(s,a)=[\phi(s,a)^{\top}~{}0]^{\top}\in\mathbb{R}^{d+1}. For any a𝒜a\in\mathcal{A}, we let ϕ+(sinit,a)=[001]d+1\phi^{+}(s_{\text{init}},a)=[0\cdots 0~{}1]^{\top}\in\mathbb{R}^{d+1}. Note that this does not affect linear realizability except a change in the upper bound on the 2\ell_{2} norm of the linear coefficients. Suppose that Asumption 3.2 holds. Suppose that in the original MDP, we have Qπ(s,a)=wπϕ(s,a)Q_{\pi}(s,a)=w_{\pi}^{\top}\phi(s,a) with wπdw_{\pi}\in\mathbb{R}^{d}. Let us define wπ+=[wπVπ(sinit)]d+1w_{\pi}^{+}=[w_{\pi}^{\top}~{}V_{\pi}(s_{\text{init}})]^{\top}\in\mathbb{R}^{d+1}. Then, for any ssinits\neq s_{\text{init}}, we still have Qπ(s,a)=(wπ+)ϕ+(s,a)Q_{\pi}(s,a)=(w_{\pi}^{+})^{\top}\phi^{+}(s,a) since the last coordinate of ϕ+(s,a)\phi^{+}(s,a) is zero. For sinits_{\text{init}}, we have Qπ(sinit,a)=Vπ(sinit)=(wπ+)ϕ+(s,a)Q_{\pi}(s_{\text{init}},a)=V_{\pi}(s_{\text{init}})=(w_{\pi}^{+})^{\top}\phi^{+}(s,a). The only difference is that we now have wπ+2b2+(γ1γ)2\|w^{+}_{\pi}\|_{2}\leq\sqrt{b^{2}+(\frac{\gamma}{1-\gamma})^{2}} since we always have 0Vπ(sinit)γ1γ0\leq V_{\pi}(s_{\text{init}})\leq\frac{\gamma}{1-\gamma}.

Then the problem reduces to the deterministic initial state case with initial state sinits_{\text{init}}. In the first step of the algorithm, we let 𝒞={(sinit,a,ϕ+(sinit,a),𝗇𝗈𝗇𝖾)}\mathcal{C}=\{(s_{\text{init}},a,\phi^{+}(s_{\text{init}},a),\mathsf{none})\}. During the algorithm, to run rollout from any core set element zz with zs𝒮z_{s}\in{\mathcal{S}}, we can use the current version of Algorithm 1. To run rollout from (sinit,a)(s_{\text{init}},a), we simply sample from ρ\rho as the first “next state” and then use the simulator to keep running the following trajectory of the rollout process.