Efficient Local Planning with Linear Function Approximation
Abstract
We study query and computationally efficient planning algorithms for discounted Markov decision processes (MDPs) with linear function approximation and a simulator. We assume that the agent has local access to the simulator, meaning that the simulator can be queried only for states that have been encountered in previous simulation steps. This is a more practical setting than the so-called random-access (or, generative) setting, where the agent has a complete description of the state space and features and is allowed to query the simulator at any state of its choice. We propose two new algorithms for this setting, which we call confident Monte Carlo least-squares policy iteration (Confident MC-LSPI), and confident Monte Carlo Politex (Confident MC-Politex), respectively. The main novelty in our algorithms is that it gradually builds a set of state-action pairs (“core set”) with which it can control the extrapolation errors. Under the assumption that the action-value functions of all policies are linearly realizable with given features, we show that our algorithm has polynomial query and computational cost in the dimension of the features, the effective planning horizon and the targeted sub-optimality, while the cost remains independent of the size of the state space. Our result strengthens previous works by broadening their scope, either by weakening the assumption made on the power of the function approximator, or by weakening the requirement on the simulator and removing the need for being given an appropriate core set of states. An interesting technical contribution of our work is the introduction of a novel proof technique that makes use of a virtual policy iteration algorithm. We use this method to leverage existing results on -bounded approximate policy iteration to show that our algorithm can learn the optimal policy for the given initial state even only with local access to the simulator. We believe that this technique can be extended to broader settings beyond this work.
1 Introduction
Efficient planning lies at the heart of modern reinforcement learning (RL). In the simulation-based RL, the agent has access to a simulator which it uses to query a state-action pair to obtain the reward of the queried pair and the next state. When planning with large state spaces in the presence of features, the agent can also compute the feature vector associated with a state or a state-action pair. Planning efficiency is measured in two ways: using query cost, the number of calls to the simulator, and using computation cost, the total number of logical and arithmetic operations that the agent uses. In Markov decision processes (MDPs) with a large state space, we call a planning algorithm query-efficient (computationally-efficient) if its query (respectively, computational) cost is independent of the size of the state space and polynomial in other parameters of the problem such as the dimension of the feature space, the effective planning horizon, the number of actions and the targeted sub-optimality.
Prior works on planning in MDPs often assume that the agent has access to a generative model which allows the agent to query the simulator with any arbitrary state-action pair (Kakade, 2003; Sidford et al., 2018; Yang and Wang, 2019; Lattimore et al., 2020). In what follows, we will call this the random access model. The random access model is often difficult to support. To illustrate this, consider a problem where the goal is to move the joints of a robot arm so that it moves objects around. The simulation state in this scenario is then completely described by the position, orientation and associated velocities of the various rigid objects involved. To access a state, a planner can then try to choose some values for each of the variables involved. Unfortunately, given only box constraints on the variable values (as is typically the case), a generic planner will often choose value combinations that are invalid based on physics, for example with objects penetrating each other in space. This problem is not specific to robotic applications but also arises in MDPs corresponding to combinatorial search, just to mention a second example.
To address this challenge, we replace the random access model with a local access model, where the only states at which the agent can query the simulator are the initial states provided to the agent, or states returned in response to previously issued queries. This access model can be implemented with any simulator that supports resetting its internal state to a previously stored such state. This type of checkpointing is widely supported, and if a simulator does not support it, there are general techniques that can be applied to achieve this functionality. As such, this access model significantly expands the scope of planners.
Definition 1.1 (Local access to the simulator).
We say the agent has local access to the simulator if the agent is allowed to query the simulator with a state that the agent has previously seen paired with an arbitrary action.
Our work relies on linear function approximation. Very recently, Weisz et al. (2021b) showed that linear realizability assumption of the optimal state-action value function (-realizability) alone is not sufficient to develop a query-efficient planner. In this paper, we assume linear realizability of all policies (-realizability). We discuss several drawbacks of previous works (Lattimore et al., 2020; Du et al., 2020) under the same realizability assumption. First, these works require the knowledge of the features of all state-action pairs; otherwise, the agent has to spend query cost to extract the features of all possible state-action pairs, where and are the sizes of the state space and action space, respectively. Second, these algorithms require the computation of either an approximation of the global optimal design (Lattimore et al., 2020) or a barycentric spanner (Du et al., 2020) of the matrix of all features. Although there exists algorithms to approximate the optimal design (Todd, 2016) or barycentric spanner (Awerbuch and Kleinberg, 2008), the computational complexities for these algorithms are polynomial in the total number of all possible feature vectors, i.e., , which is impractical for large MDPs.
We summarize our contributions as follows:
-
•
With local access to the simulator, we propose two policy optimization algorithms—confident Monte Carlo least-squares policy iteration (Confident MC-LSPI), and its regularized (see e.g. Even-Dar et al. (2009); Abbasi-Yadkori et al. (2019)) version confident Monte Carlo Politex (Confident MC-Politex). Both of our algorithms maintain a core set of state-action pairs and run Monte Carlo rollouts from these pairs using the simulator. The algorithms then use the rollout results to estimate the Q-function values and then apply policy improvement. During each rollout procedure, whenever the algorithm observes a state-action pair that it is less confident about (with large uncertainty), the algorithm adds this pair to the core set and restarts. Compared to several prior works that use additive bonus (Jin et al., 2020; Cai et al., 2020), our algorithm design demonstrates that in the local access setting, core-set-based exploration is an effective approach.
-
•
Under the -realizability assumption, we prove that both Confident MC-LSPI and Confident MC-Politex can learn a -optimal policy with query cost of and computational costs of , where is the dimension of the feature of state-action pairs, is the discount factor, is the error probability, and is the bound on the norm of the linear coefficients for the Q-functions. In the presence of a model misspecification error , we show that Confident MC-LSPI achieves a final sub-optimality of , whereas Confident MC-Politex can improve the sub-optimality to with a higher query cost.
-
•
We develop a novel proof technique that makes use of a virtual policy iteration algorithm. We use this method to leverage existing results on approximate policy iteration which assumes that in each iteration, the approximation of the Q-function has a bounded error (Munos, 2003; Farahmand et al., 2010) (see Section 5 for details).
2 Related work
Simulators or generative models have been considered in early studies of reinforcement learning (Kearns and Singh, 1999; Kakade, 2003). Recently, it has been shown empirically that in the local access setting, core-set-based exploration has strong performance in hard-exploration problems (Ecoffet et al., 2019). In this section, we mostly focus on related theoretical works. We distinguish among random access, local access, and online access.
-
•
Random access means that the agent is given a list of all possible state action pairs and can query any of them to get the reward and a sample of the next state.
-
•
Local access means that the agent can access previously encountered states, which can be implemented with checkpointing. The local access model that we consider in this paper is a more practical version of planning with a simulator.
-
•
Online access means that the simulation state can only be reset to the initial state (or distribution) or moved to a next random state given an action. The online access setting is more restrictive compared to local access, since the agent can only follow the MDP dynamics during the learning process.
We also distinguish between offline and online planning. In the offline planning problem, the agent only has access to the simulator during the training phase, and once the training is finished, the agent outputs a policy and executes the policy in the environment without access to a simulator. This is the setting that we consider in this paper. On the other hand, in the online planning problem, the agent can use the simulator during both the training and inference phases, meaning that the agent can use the simulator to choose the action when executing the policy. Usually, online RL algorithms with sublinear regret can be converted to an offline planning algorithm under the online access model with standard online-to-batch conversion (Cesa-Bianchi et al., 2004). While most of the prior works that we discuss in this section are for the offline planning problem, the TensorPlan algorithm (Weisz et al., 2021a) considers online planning.
In terms of notation, some works considers finite-horizon MDPs, in which case we use to denote the episode length (similar to the effective planning horizon in infinite-horizon discounted MDPs). Our discussion mainly focuses on the results with linear function approximation. We summarize some of the recent advances on efficient planning in large MDPs in Table 1.
: The algorithms in these works are not query or computationally efficient unless the agent is provided with an approximate optimal design (Lattimore et al., 2020) or barycentric spanner (Du et al., 2020) or “core states” (Shariff and Szepesvári, 2020) for free.
: Weisz et al. (2021a) consider the online planning problem whereas other works in this table consider (or can be converted to) the offline planning problem.
Positive Results | Assumption | CE | Access Model |
Yang and Wang (2019) | linear MDP | yes | random access |
Lattimore et al. (2020); Du et al. (2020) | -realizability | no | random access |
Shariff and Szepesvári (2020) | -realizability | no | random access |
This work | -realizability | yes | local access |
Weisz et al. (2021a) | -realizability, actions | no | local access |
Li et al. (2021) | -realizability, constant gap | yes | local access |
Jiang et al. (2017) | low Bellman rank | no | online access |
Zanette et al. (2020) | low inherent Bellman error | no | online access |
Du et al. (2021) | bilinear class | no | online access |
Lazic et al. (2021); Wei et al. (2021) | -realizability, feature excitation | yes | online access |
Jin et al. (2020); Agarwal et al. (2020a) | linear MDP | yes | online access |
Zhou et al. (2020); Cai et al. (2020) | linear mixture MDP | ? | online access |
Negative Results | Assumption | CE | Access Model |
Du et al. (2020) | -realizability, | N/A | random access |
Weisz et al. (2021b) | -realizability, actions | N/A | random access |
Wang et al. (2021) | -realizability, constant gap | N/A | online access |
Random access
Theoretical guarantees for the random access model have been obtained for the tabular setting (Sidford et al., 2018; Agarwal et al., 2020b; Li et al., 2020; Azar et al., 2013). As for linear function approximation, different assumptions have been made for theoretical analysis. Under the linear MDP assumption, Yang and Wang (2019) derived an optimal query complexity bound by a variance-reduction Q-learning type algorithm. Under the -realizability of all determinstic policies (a strictly weaker assumption than linear MDP (Zanette et al., 2020)), Du et al. (2020) showed a negative result for the settings with model misspecification error (see also Van Roy and Dong (2019); Lattimore et al. (2020)). When , assuming the access to the full feature matrix, Lattimore et al. (2020) proposed algorithms with polynomial query costs, and Du et al. (2020) proposed similar algorithm for the exact realizability setting. Since these works need to find a globally optimal design or barycentric spanner, their computational costs depend polynomially on the size of the state space. Under the -realizability assumption (i.e., the optimal value function is linear in some feature map), Shariff and Szepesvári (2020) proposed a planning algorithm assuming the availability of a set of core states but obtaining such core states can still be computationally inefficient. Zanette et al. (2019) proposed an algorithm that uses a similar concept named anchor points but only provided a greedy heuristic to generate these points. A notable negative result is established in Weisz et al. (2021b) that shows that with only -realizability, any agent requires queries to learn an optimal policy.
Local access
Many prior studies have used simulators in tree-search style algorithms (Kearns et al., 2002; Munos, 2014). Under this setting, for the online planning problem, recently Weisz et al. (2021a) established an query cost bound to learn an -optimal policy by the TensorPlan algorithm assuming the -realizability. Whenever the action set is small, TensorPlan is query efficient, but its computational efficiency is left as an open problem. Under -realizability and constant sub-optimality gap, for the offline planning problem, Li et al. (2021) proposed an algorithm with query and computational costs.
Online access
As mentioned, many online RL algorithms can be converted to a policy optimization algorithm under the online access model using online-to-batch conversion. There is a large body of literature on online RL with linear function approximation and here we discuss a non-exhaustive list of prior works. Under the -realizability assumption, assuming that the probability transition of the MDP is deterministic, Wen and Van Roy (2013) proposed a sample and computationally efficient algorithm via the eluder dimension (Russo and Van Roy, 2013). Assuming the MDP has low Bellman rank, Jiang et al. (2017) proposed an algorithm that is sample efficient but computationally inefficient, and similar issues arise in Zanette et al. (2020) under the low inherent Bellman error assumption. Du et al. (2021) proposed a more general MDP class named bilinear class and provided a sample efficient algorithm, but the computational efficiency is unclear.
Under -realizability, several algorithms, such as Politex (Abbasi-Yadkori et al., 2019; Lazic et al., 2021), AAPI (Hao et al., 2021), and MDP-EXP2 (Wei et al., 2021) achieved sublinear regret in the infinite horizon average reward setting and are also computationally efficient. However, the corresponding analysis avoids the exploration issue by imposing a feature excitation assumption which may not be satisfied in many problems. Under the linear MDP assumption, Jin et al. (2020) established a regret bound for an optimistic least-square value iteration algorithm. Agarwal et al. (2020a) derived a sample cost bound for the policy cover-policy gradient algorithm, which can also be applied in the state aggregation setting; the algorithm and sample cost were subsequently improved in Zanette et al. (2021). Under the linear mixture MDP assumption (Yang and Wang, 2020; Zhou et al., 2020), Cai et al. (2020) proved an regret bound for an optimistic least square policy iteration (LSPI) type algorithm. A notable negative result for the online RL setting by Wang et al. (2021) shows that an exponentially large number of samples are needed if we only assume -realizability and constant sub-optimality gap. Other related works include Ayoub et al. (2020); Jin et al. (2021); Du et al. (2019); Wang et al. (2019), and references therein.
3 Preliminaries
We use to denote the set of probability distributions defined on the set . Consider an infinite-horizon discounted MDP that is specified by a tuple , where is the state space, is the finite action space, is the reward function, is the probability transition kernel, is the initial state, and is the discount factor. For simplicity, in the main sections of this paper, we assume that the initial state is deterministic and known to the agent. Our algorithm can also be extended to the setting where the initial state is random and the agent is allowed to sample from the initial state distribution. We discuss this extension in Appendix E. Throughout this paper, we write for any positive integer and use to denote natural logarithm.
A policy is a mapping from a state to a distribution over actions. We only consider stationary policies, i.e., they do not change according to the time step. The value function of a policy is the expected return when we start running the policy from state , i.e.,
and the state-action value function , also known as the Q-function, is the expected return following policy conditioned on , i.e.,
We assume that the agent interacts with a simulator using the local access protocol defined in Definition 1.1, i.e, for any state that the agent has visited and any action , the agent can query the simulator and obtain a sample and the reward .
Our general goal is to find a policy that maximizes the expected return starting from the initial state , i.e., . We let be the optimal policy, , and . We also aim to learn a good policy efficiently, i.e., the query and computational costs should not depend on the size of the state space , which can be large in many problems.
Linear function approximation
Let be a feature map which assigns to each state-action pair a -dimensional feature vector. For any , the agent can obtain with a computational cost of . Here, we emphasize that the computation of the feature vectors does not lead to a query cost. Without loss of generality, we impose the following bounded features assumption.
Assumption 3.1 (Bounded features).
We assume that for all .
We consider the following two different assumptions on the linear realizability of the Q-functions:
Assumption 3.2 (-realizability).
There exists such that for every policy , there exists a weight vector , , that ensures for all .
Assumption 3.3 (Approximate -realizability).
There exists and model misspecification error such that for every policy , there exists a weight vector , , that ensures for all .
4 Algorithm
We first introduce some basic concepts used in our algorithms.
Core set
We use a concept called core set. A core set is a set of tuples . The first three elements in the tuple denote a state, an action, and the feature vector corresponding to the state-action pair, respectively. The last element in the tuple denotes an estimate of for a policy . During the algorithm, we may not always have such an estimate, in which case we write . For a tuple , we use , , , and to denote the , , , and coordinates of , respectively. We note that in prior works, the core set usually consists of the state-action pairs and their features (Lattimore et al., 2020; Du et al., 2020; Shariff and Szepesvári, 2020); whereas in this paper, for the convenience of notation, we also have the target values (Q-function estimates) in the core set elements. We denote by the feature matrix of all the elements in , i.e., each row of is the feature vector of an element in . Similarly, we define as the vector for the estimate of all the tuples in .
Good set
It is also useful to introduce a notion of good set.
Definition 4.1.
Given , and feature matrix , the good set is defined as
Intuitively, the good set is a set of vectors that are well-covered by the rows of ; in other words, these vectors are not closely aligned with the eigenvectors associated with the small eigenvalues of the covariance matrix of all the features in the core set.
As an overview, our algorithm Confident MC-LSPI works as follows. First, we initialize the core set using the initial state paired with all actions. Then, the algorithm runs least-squares policy iteration (Munos, 2003) to optimize the policy. This means that in each iteration, we estimate the Q-function value for every state-action pair in using Monte Carlo rollout with the simulator, and learn a linear function to approximate the Q-function of the rollout policy, and the next policy is chosen to be greedy with respect to this linear function. Our second algorithm Confident MC-Politex works similarly, with the only difference being that instead of using the greedy policy iteration update rule, we use the mirror descent update rule with KL regularization between adjacent rollout policies (Even-Dar et al., 2009; Abbasi-Yadkori et al., 2019). Moreover, in both algorithms, whenever we observe a state-action pair whose feature is not in the good set during Monte Carlo rollout, we add the pair to the core set and restart the policy iteration process. We name the rollout subroutine ConfidentRollout. We discuss details in the following.
4.1 Subroutine: ConfidentRollout
We first introduce the ConfidentRollout subroutine, whose purpose is to estimate for a given state-action pair using Monte Carlo rollouts. During a rollout, for each state that we encounter and all actions , the subroutine checks whether the feature vector is in the good set. If not, we know that we have discovered a new feature direction, i.e. a direction which is not well aligned with eigenvectors corresponding to the the largest eigenvalues of the covariance matrix of the core features. In this case the subroutine terminates and returns the tuple along with the status. If the algorithm does not discover a new direction, it returns an estimate of the desired value and the status. This subroutine is formally presented in Algorithm 1.
4.2 Policy iteration
With the subroutine, now we are ready to present our main algorithms. Both of our algorithms maintain a core set . We first initialize the core set using the initial state and all actions . More specifically, we check all the feature vectors . If the feature vector is not in the good set of the current core set, we add the tuple to the core set. Then we start the policy iteration process. Both algorithms start with an arbitrary initial policy and run iterations. Let be the rollout policy in the -th iteration. We try to estimate the state-action values for the state-action pairs in under the current policy , i.e., for , using ConfidentRollout. In this Q-function estimation procedure, we may encounter two scenarios:
-
(a)
If the rollout subroutine always returns the status with an estimate of the state-action value, once we finish the estimation for all the state-action pairs in , we can estimate the Q-function of using least squares with input features and targets and regularization coefficient . Let be the solution to the least squares problem, i.e.,
(4.1) Then, for Confident MC-LSPI, we choose the rollout policy of the next iteration, i.e., , as the greedy policy with respect to the linear function :
(4.2) For Confident MC-Politex, we construct a truncated Q-function using linear function with clipping:
(4.3) where . The rollout policy of the next iteration is then
(4.4) where is an algorithm parameter.
-
(b)
It could also happen that the ConfidentRollout subroutine returns the status. In this case, we add the state-action pair with new feature direction found by the subroutine to the core set and restart the policy iteration process with the latest core set.
As a final note, for Confident MC-LSPI, we output the rollout policy of the last iteration , whereas for Confident MC-Politex, we output a mixture policy , which is a policy chosen from uniformly at random. The reason that this algorithm needs to output a mixture policy is that Politex (Szepesvári, 2021) uses the regret analysis of expert learning (Cesa-Bianchi and Lugosi, 2006), and to obtain a single output policy, we need to use the standard online-to-batch conversion argument (Cesa-Bianchi et al., 2004). Our algorithms are formally presented in Algorithm 2. In the next section, we present theoretical guarantees for our algorithms.
5 Theoretical guarantees
In this section, we present theoretical guarantees for our algorithms. First, we have the following main result for Confident MC-LSPI.
Theorem 5.1 (Main result for Confident MC-LSPI).
If Assumption 3.2 holds, then for an arbitrarily small , by choosing , , , , , we have with probability at least , the policy that Confident MC-LSPI outputs satisfies
Moreover, the query and computational costs for the algorithm are and , respectively.
Alternatively, if Assumption 3.3 holds, then by choosing , , , , , we have with probability at least , the policy that Confident MC-LSPI outputs satisfies
Moreover, the query and computational costs for the algorithm are and , respectively.
We prove Theorem 5.1 in Appendix B. For Confident MC-Politex, since we output a mixture policy, we prove guarantees for the expected value of the mixture policy, i.e., . We have the following result.
Theorem 5.2 (Main result for Confident MC-Politex).
If Assumption 3.2 holds, then for an arbitrarily small , by choosing , , , , , and , we have with probability at least , the mixture policy that Confident MC-Politex outputs satisfies
Moreover, the query and computational costs for the algorithm are and , respectively.
Alternatively, if Assumption 3.3 holds, then by choosing , , , , , and , we have with probability at least , the mixture policy that Confident MC-Politex outputs satisfies
Moreover, the query and computational costs for the algorithm are and , respectively.
We prove Theorem 5.2 in Appendix D. Here, we first discuss the query and computational costs of both algorithms and then provide a sketch of our proof.
Query and computational costs
In our analysis, we say that we start a new loop whenever we start (or restart) the policy iteration process, i.e., going to line in Algorithm 2. By definition, when we start a new loop, the size of the core set is increased by . First, in Lemma 5.1 below, we show that the size of the core set will never exceed . Therefore, the total number of loops is at most . In each loop, we run policy iterations; in each iteration, we run Algorithm 1 from at most points from the core set; and each time when we run Algorithm 1, we query the simulator at most times. Thus, for both algorithms, the total number of queries that we make is at most . Therefore, using the parameter choice in Theorems 5.1 and 5.2 and omitting logarithmic factors, we can obtain the query costs of Confident MC-LSPI and Politex in Table 2. As we can see, when or but (the regime we care about in this paper), the query cost of Confident MC-LSPI is lower than Politex. As for computational cost, since our policy improvement steps only involve matrix multiplication and matrix inversion, the computational cost is also polynomial in the aforementioned factors. One thing to notice is that during the rollout process, in each step, the agent needs to compute the features of a state paired with all actions, and thus the computational cost linearly depends on ; on the contrary the query cost does not depend on since in each step the agent only needs to query the simulator with the action sampled according to the policy.
Sub-optimality
We also note that when Assumption 3.3 holds, i.e., , the sub-optimality of the output policy is for LSPI and for Politex. Therefore, in the presence of a model misspecification error, Confident MC-Politex can achieve a better final sub-optimality than Confident MC-LSPI, although it’s query cost is higher.
Query () | Query () | Sub-optimality () | |
LSPI | |||
Politex |
Proof sketch
We now discuss our proof strategy, focusing on LSPI for simplicity.
Step 1: Bound the size of the core set
The first step is to show that our algorithm will terminate. This is equivalent to showing that the size of the core set will not exceed certain finite quantity, since whenever we receive the status from ConfidentRollout, we increase the size of the core set by , go back to line in Algorithm 2, and start a new loop. The following lemma shows that the size of the core set is always bounded, and thus the algorithm will always terminate.
Lemma 5.1.
Under Assumption 3.1, the size of the core set will not exceed
Step 2: Virtual policy iteration
The next step is to analyze the gap between the value of the optimal policy and the policy parameterized by the vector that the algorithm outputs in the final loop, i.e., . For ease of exposition, here we only consider the case of deterministic probability transition kernel . Our full proof in Appendix B considers general stochastic dynamics.
To analyze our algorithm, we note that for approximate policy iteration (API) algorithms, if in every iteration (say the -th iteration), we have an approximate Q-function that is close to the true Q-function of the rollout policy (say ) in norm, i.e., , then existing results (Munos, 2003; Farahmand et al., 2010) ensure that we can learn a good policy if in every iteration we choose the new policy to be greedy with respect to the approximate Q-function. However, since we only have local access to the simulator, we cannot have such guarantee. In fact, as we show in the proof, we can only ensure that when is in the good set , our linear function approximation is accurate, i.e., where . To overcome the lack of guarantee, we introduce the notion of virtual policy iteration algorithm. In the virtual algorithm, we start with the same initial policy . In the -th iteration of the virtual algorithm, we assume that we have access to the true Q-function of the rollout policy when , and construct
where is the linear coefficient that we learn in the virtual algorithm in the same way as in Eq. (4.1). Then is chosen to be greedy with respect to . In this way, we can ensure that is close to the true Q-function in norm and thus the output policy, say , of the virtual algorithm is good in the sense that is small.
To connect the output policy of the virtual algorithm and our actual algorithm, we note that by definition, in the final loop of our algorithm, in any iteration, for any state that the agent visits in ConfidentRollout, and any action , we have that since the subroutine never returns status. Further, because the initial state, probability transition kernel, and the policies are all deterministic, we know that the rollout trajectories of the virtual algorithm and our actual algorithm are always the same in the final loop (the virtual algorithm does not get a chance to use the true Q-function ). With rollout length , we know that when we start with state , the output of the virtual algorithm and our actual algorithm take exactly the same actions for steps, and thus , which implies that is small. To extend this argument to the setting with stochastic transitions, we need to use a coupling argument which we elaborate in the Appendix.
6 Conclusion
We propose the Confident MC-LSPI and Confident MC-Politex algorithms, for local planning with linear function approximation. Under the assumption that the Q-functions of all policies are linear in some features of the state-action pairs, we show that our algorithm is query and computationally efficient. We introduce a novel analysis technique based on a virtual policy iteration algorithm, which can be used to leverage existing guarantees on approximate policy iteration with -bounded evaluation error. We use this technique to show that our algorithm can learn the optimal policy for the given initial state even only with local access to the simulator. Future directions include extending our analysis technique to broader settings.
Acknowledgement
The authors would like to thank Gellért Weisz for helpful comments.
References
- Abbasi-Yadkori et al. [2019] Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gellért Weisz. Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, pages 3692–3702. PMLR, 2019.
- Agarwal et al. [2020a] Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. PC-PG: Policy cover directed exploration for provable policy gradient learning. arXiv preprint arXiv:2007.08459, 2020a.
- Agarwal et al. [2020b] Alekh Agarwal, Sham Kakade, and Lin F Yang. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67–83. PMLR, 2020b.
- Awerbuch and Kleinberg [2008] Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97–114, 2008.
- Ayoub et al. [2020] Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463–474. PMLR, 2020.
- Azar et al. [2013] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349, 2013.
- Cai et al. [2020] Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283–1294. PMLR, 2020.
- Cesa-Bianchi and Lugosi [2006] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
- Cesa-Bianchi et al. [2004] Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.
- Du et al. [2019] Simon S Du, Yuping Luo, Ruosong Wang, and Hanrui Zhang. Provably efficient Q-learning with function approximation via distribution shift error checking oracle. arXiv preprint arXiv:1906.06321, 2019.
- Du et al. [2020] Simon S Du, Sham M Kakade, Ruosong Wang, and Lin Yang. Is a good representation sufficient for sample efficient reinforcement learning? In International Conference on Learning Representations, 2020.
- Du et al. [2021] Simon S Du, Sham M Kakade, Jason D Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in RL. arXiv preprint arXiv:2103.10897, 2021.
- Ecoffet et al. [2019] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
- Even-Dar et al. [2009] Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Online Markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
- Farahmand et al. [2010] Amir Massoud Farahmand, Rémi Munos, and Csaba Szepesvári. Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems, 2010.
- Hao et al. [2021] Botao Hao, Nevena Lazic, Yasin Abbasi-Yadkori, Pooria Joulani, and Csaba Szepesvári. Adaptive approximate policy iteration. In International Conference on Artificial Intelligence and Statistics, pages 523–531. PMLR, 2021.
- Harville [1998] David A Harville. Matrix algebra from a statistician’s perspective, 1998.
- Jiang et al. [2017] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2017.
- Jin et al. [2020] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
- Jin et al. [2021] Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. arXiv preprint arXiv:2102.00815, 2021.
- Kakade [2003] Sham Machandranath Kakade. On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003.
- Kearns and Singh [1999] Michael Kearns and Satinder Singh. Finite-sample convergence rates for Q-learning and indirect algorithms. Advances in Neural Information Processing Systems, pages 996–1002, 1999.
- Kearns et al. [2002] Michael Kearns, Yishay Mansour, and Andrew Y Ng. A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine learning, 49(2):193–208, 2002.
- Lattimore et al. [2020] Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with good feature representations in bandits and in RL with a generative model. In International Conference on Machine Learning, pages 5662–5670. PMLR, 2020.
- Lazic et al. [2021] Nevena Lazic, Dong Yin, Yasin Abbasi-Yadkori, and Csaba Szepesvari. Improved regret bound and experience replay in regularized policy iteration. arXiv preprint arXiv:2102.12611, 2021.
- Li et al. [2020] Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model. Advances in Neural Information Processing Systems, 33, 2020.
- Li et al. [2021] Gen Li, Yuxin Chen, Yuejie Chi, Yuantao Gu, and Yuting Wei. Sample-efficient reinforcement learning is feasible for linearly realizable MDPs with limited revisiting. arXiv preprint arXiv:2105.08024, 2021.
- Munos [2003] Rémi Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning, volume 3, pages 560–567, 2003.
- Munos [2014] Rémi Munos. From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning. Foundations and Trends in Machine Learning, 2014.
- Russo and Van Roy [2013] Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, pages 2256–2264. Citeseer, 2013.
- Shariff and Szepesvári [2020] Roshan Shariff and Csaba Szepesvári. Efficient planning in large MDPs with weak linear function approximation. arXiv preprint arXiv:2007.06184, 2020.
- Sidford et al. [2018] Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5192–5202, 2018.
- Singh and Yee [1994] Satinder P Singh and Richard C Yee. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227–233, 1994.
- Szepesvári [2021] Csaba Szepesvári. RL Theory lecture notes: POLITEX. https://rltheory.github.io/lecture-notes/planning-in-mdps/lec14/, 2021.
- Todd [2016] Michael J Todd. Minimum-volume ellipsoids: Theory and algorithms. SIAM, 2016.
- Van Roy and Dong [2019] Benjamin Van Roy and Shi Dong. Comments on the Du-Kakade-Wang-Yang lower bounds. arXiv preprint arXiv:1911.07910, 2019.
- Wang et al. [2019] Yining Wang, Ruosong Wang, Simon S Du, and Akshay Krishnamurthy. Optimism in reinforcement learning with generalized linear function approximation. arXiv preprint arXiv:1912.04136, 2019.
- Wang et al. [2021] Yuanhao Wang, Ruosong Wang, and Sham M Kakade. An exponential lower bound for linearly-realizable MDPs with constant suboptimality gap. arXiv preprint arXiv:2103.12690, 2021.
- Wei et al. [2021] Chen-Yu Wei, Mehdi Jafarnia Jahromi, Haipeng Luo, and Rahul Jain. Learning infinite-horizon average-reward MDPs with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pages 3007–3015. PMLR, 2021.
- Weisz et al. [2021a] Gellert Weisz, Philip Amortila, Barnabás Janzer, Yasin Abbasi-Yadkori, Nan Jiang, and Csaba Szepesvári. On query-efficient planning in MDPs under linear realizability of the optimal state-value function. arXiv preprint arXiv:2102.02049, 2021a.
- Weisz et al. [2021b] Gellért Weisz, Philip Amortila, and Csaba Szepesvári. Exponential lower bounds for planning in MDPs with linearly-realizable optimal action-value functions. In Algorithmic Learning Theory, pages 1237–1264. PMLR, 2021b.
- Wen and Van Roy [2013] Zheng Wen and Benjamin Van Roy. Efficient exploration and value function generalization in deterministic systems. Advances in Neural Information Processing Systems, 26, 2013.
- Yang and Wang [2019] Lin Yang and Mengdi Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004. PMLR, 2019.
- Yang and Wang [2020] Lin Yang and Mengdi Wang. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. International Conference on Machine Learning, 2020.
- Zanette et al. [2019] Andrea Zanette, Alessandro Lazaric, Mykel J Kochenderfer, and Emma Brunskill. Limiting extrapolation in linear approximate value iteration. Advances in Neural Information Processing Systems, 32:5615–5624, 2019.
- Zanette et al. [2020] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent Bellman error. In International Conference on Machine Learning, pages 10978–10989. PMLR, 2020.
- Zanette et al. [2021] Andrea Zanette, Ching-An Cheng, and Alekh Agarwal. Cautiously optimistic policy optimization and exploration with linear function approximation. In Conference on Learning Theory (COLT), 2021.
- Zhou et al. [2020] Dongruo Zhou, Jiafan He, and Quanquan Gu. Provably efficient reinforcement learning for discounted MDPs with feature mapping. arXiv preprint arXiv:2006.13165, 2020.
Appendix
Appendix A Proof of Lemma 5.1
This proof essentially follows the proof of the upper bound for the eluder dimension of a linear function class in Russo and Van Roy [2013]. We present the proof here for completeness.
We restate the core set construction process in the following way with slightly different notation. We begin with . In the -th step, we have a core set with feature matrix . Suppose that we can find , , such that
(A.1) |
then we let , i.e., we add a row at the bottom of . If we cannot find such , we terminate this process. We define . It is easy to see that and .
According to matrix determinant lemma [Harville, 1998], we have
(A.2) |
where the inequality is due to (A.1). Since is the product of all the eigenvalues of , according to the AM-GM inequality, we have
(A.3) |
where in the second inequality we use the fact that . Combining (A.2) and (A.3), we know that must satisfy
which is equivalent to
(A.4) |
We note that if , the result of the size of the core set in Lemma 5.1 automatically holds. Thus, we only consider the situation here . In this case, the condition (A.4) implies
(A.5) |
Using the fact that for any , , and that for any , , we obtain
(A.6) |
which implies
Appendix B Proof of Theorem 5.1
In this proof, we say that we start a new loop whenever we start (or restart) the policy iteration process, i.e., going to line in Algorithm 2. In each loop, we have at most iterations of policy iteration steps. By definition, we also know that when we start a new loop, the size of the core set increases by compared with the previous loop. We first introduce the notion of virtual policy iteration algorithm. This virtual algorithm is designed to leverage the existing results on approximate policy iteration with bounded error in the approximate Q-functions [Munos, 2003, Farahmand et al., 2010]. We first present the details of the virtual algorithm, and then provide performance guarantees for the main algorithm.
B.1 Virtual approximate policy iteration with coupling
The virtual policy iteration algorithm is a virtual algorithm that we use for the purpose of proof. It is a version of approximate policy iteration (API) with a simulator. An important factor is that the simulators of the virtual algorithm and the main algorithm need to be coupled, which we explain in this section.
The virtual algorithm is defined as follows. Unlike the main algorithm, the virtual algorithm runs exactly loops, where is the upper bound for the size of the core set defined in Lemma 5.1. In the virtual algorithm, we let the initial policy be the same as the main algorithm, i.e., . Unlike the main algorithm, the virtual algorithm runs exactly iterations of policy iteration. In the -th iteration (), the virtual algorithm runs rollouts from each element in the core set (we will discuss how the virtual algorithm constructs the core set later) with with a simulator where is in the form of Eq. (B.3) ( will be defined once we present the details of the virtual algorithm).
We now describe the rollout process of the virtual algorithm. We still use a subroutine similar to ConfidentRollout. The simulator of the virtual algorithm can still generate samples of next state given a state-action pair according to the probability transition kernel . The major difference from the main algorithm is that during the rollout process, when we find a state-action pair whose feature is outside of the good set (defined in Definition 4.1), i.e., such that , we do not terminate the subroutine, instead we record this state-action pair along with its feature (we call it the recorded element), and then keep running the rollout process using . Two situations can occur at the end of each loop: 1) We did not record any element, in which case we use the same core set in the next loop, and 2) We have at least one recorded element in a particular loop, in which case we add the first element to the core set and discard any other recorded elements. In other words, in each loop of the virtual algorithm, we find the first state-action pair (if any) whose feature is outside of the good set and add this pair to the core set. Another difference from the main algorithm is that in the virtual algorithm, we do not end the rollout subroutine when we identify an uncertain state-action pair, and as a result, the rollout subroutine in the virtual algorithm always returns an estimation of the Q-function.
We now proceed to present the virtual policy iteration process. In the -th iteration, the virtual algorithm runs trajectories of -step rollout using the policy from each element , obtains the empirical average of the discounted return in the same way as in Algorithm 1. Then we concatenate them, obtain the vector , and compute
(B.1) |
We use the notion of good set defined in Definition 4.1, and define the virtual Q-function as follows:
(B.2) |
by assuming the access to the true Q-function . The next policy is defined as the greedy policy with respect to , i.e.,
(B.3) |
Recall that for the main algorithm, once we learn the parameter vector , the next policy is greedy with respect to the linear function , i.e.,
For comparison, the key difference is that when we observe a feature vector that is not in the good set , our actual algorithm terminates the rollout and returns the state-action pair with the new direction, whereas the virtual algorithm uses the true Q-function of the state-action pair.
Coupling
The major remaining issue now is how the main algorithm is connected to the virtual algorithm. We describe this connection with a coupling argument. In a particular loop, for any positive integer , when the virtual algorithm makes its -th query in the -th iteration to the virtual simulator with a state-action pair, say , if the main algorithm has not returned due to encountering an uncertain state-action pair, we assume that at the same time the main algorithm also makes its -th query to the simulator, with a state-action pair, say . We let the two simulators be coupled: When they are queried with the same pair, i.e., , the next states that they return are also the same. In other words, the simulator for the main algorithm samples , and the virtual algorithm samples , and and satisfy the joint distribution such that . In the cases where or the main algorithm has already returned due to the discovery of a new feature direction, the virtual algorithm samples from independently from the main algorithm. Note that this setup guarantees that both the virtual algorithm and the main algorithm have valid simulators which can sample from the same probability transition kernel .
There are a few direct consequences of this coupling design. First, since the virtual and main algorithms start with the same initial core set elements (constructed using the initial state), we know that in any loop, when starting from the same core set element , both algorithms will have exactly the same rollout trajectories until the main algorithm identifies an uncertain state-action pair and returns. This is due to the coupling of the simulators and the fact that within the good set , the policies for the main algorithm and the virtual algorithm take the same action. Later, we will discuss this point more in Lemma B.5. Second, the core set elements that the virtual and main algorithms use are exactly the same for any loop. This is because when the main algorithm identifies an uncertain state-action pair, it adds it to the core set and start a new loop, and the virtual algorithm also only adds the first recorded element to the core set. Since the simulators are the coupled, the first uncertain state-action pair that they encounter will be the same, meaning that both algorithms always add the same element to the core set, until the main algorithm finishes its final loop. We note that the core set elements on our algorithm are stored as ordered list so the virtual and main algorithm always run rollouts with the same ordering of the core set elements. Another observation is that while the virtual algorithm has a deterministic number of loops , the total number of loops that the main algorithms may run is a random variable whose value cannot exceed .
The next steps of the proof are the following:
-
•
We show that in each loop, with high probability, the virtual algorithm proceeds as an approximate policy iteration algorithm with a bounded error in the approximate Q-function. Thus the virtual algorithm produces a good policy at the end of each loop. Then, since by Lemma 5.1, we have at most
(B.4) loops, with a union bound, we know that with high probability, the virtual algorithm produces a good policy in all the loops.
-
•
We show that due to the coupling argument, the output parameter vector in the main and the virtual algorithms, i.e., and in the final loop are the same. This leads to the conclusion that with the same initial state , the value of the outputs of the main algorithm and the virtual algorithm are close, and thus the main algorithm also outputs a good policy.
We prove these two points in Sections B.2 and B.3, respectively.
B.2 Analysis of the virtual algorithm
Throughout this section, we will consider a fixed loop of the virtual algorithm, say the -th loop. We assume that at the beginning of this loop, the virtual algorithm has a core set . Notice that is a random variable that only depends on the randomness of the first loops. In this section, we will first condition on the randomness of all the first loops and only consider the randomness of the -th loop. Thus we will first treat as a deterministic quantity. For simplicity, we write .
Consider the -th iteration of a particular loop of the virtual algorithm with core set . We would like to bound . First, we have the following lemma for the accuracy of the Q-function for any element in the core set. To simplify notation, in this lemma, we omit the subscript and use to denote a policy that we run rollout with in an arbitrary iteration of the virtual algorithm.
Lemma B.1.
Let be a policy that we run rollout with in an iteration of the virtual algorithm. Then, for any element and any , we have with probability at least ,
(B.5) |
Proof.
By the definition of :
and define the -step truncated Q-function:
Then we have . Moreover, the Q-function estimate is an average of independent and unbiased estimates of , which are all bounded in . By Hoeffding’s inequality we have with probability at least , , which completes the proof. ∎
By a union bound over the elements in the core set, we know that
(B.6) |
The following lemma provides a bound on , such that .
Lemma B.2.
Suppose that Assumption 3.3 holds. Then, with probability at least
for any pair such that , we have
(B.7) |
We prove this lemma in Appendix C. Since when , , we know that . With another union bound over the iterations, we know that with probability at least
the virtual algorithm is an approximate policy iteration algorithm with bound for the approximation error on the Q-functions. We use the following results for API, which is a direct consequence of the results in Munos [2003], Farahmand et al. [2010], and is also stated in Lattimore et al. [2020].
Lemma B.3.
Suppose that we run approximate policy iterations and generate a sequence of policies . Suppose that for every , in the -th iteration, we obtain a function such that, , and choose to be greedy with respect to . Then
According to Lemma B.3,
(B.8) |
Then, since , we know that
(B.9) |
The following lemma translates the gap in Q-functions to the gap in value.
Lemma B.4.
[Singh and Yee, 1994] Let be greedy with respect to a function . Then for any state ,
Since is greedy with respect to , we know that
(B.10) |
We notice that this result is obtained by conditioning on all the previous loops and only consider the randomness of the -th loop. More specifically, given any core set at the beginning of the -th loop, we have
By law of total probability we have
With another union bound over the loops of the virtual algorithm, we know that with probability at least
(B.11) |
Eq. (B.10) holds for all the loops. We call this event in the following.
B.3 Analysis of the main algorithm
We now move to the analysis of the main algorithm. Throughout this section, when we mention the final loop, we mean the final loop of the main algorithm, which may not be the final loop of the virtual algorithm. We have the following result.
Lemma B.5.
In the final loop of the main algorithm, all the rollout trajectories in the virtual algorithm are exactly the same as those in the main algorithm, and therefore for all .
Proof.
We notice that since we only consider the final loop, in any iteration, for any state in all the rollout trajectories in the main algorithm, and all action , . In the first iteration, since , and the simulators are coupled, we know that all the rollout trajectories are the same between the main algorithm and the virtual algorithm, and as a result, all the Q-function estimates are the same, and thus . If we have , we know that by the definition in (B.2), the policies and always take the same action given if for all , . Again using the fact that the simulators are coupled, the rollout trajectories by and are also the same between the main algorithm and the virtual algorithm, and thus . ∎
Since for all , we can verify that if we set , then after adding a state-action pair to the core set, then its feature vector stays in the good set . Recall that in the core set initialization stage of Algorithm 2, if for an action , is not in , we add to . Thus, after the core set initialization stage, we have for all . Thus . Moreover, according to Lemma B.2, we also know that when happens,
(B.12) |
In the following, we bound the difference of the values of the output policy of the main algorithm and the output policy of the virtual algorithm in the final loop of the main algorithm, i.e., . To do this, we use another auxiliary virtual policy iteration algorithm, which we call virtual-2 in the following. Virtual-2 is similar to the virtual policy iteration algorithm in Appendix B.1. The simulator of virtual-2 is coupled with the virtual algorithm, and virtual-2 also uses the same initial policy as the main algorithm. Virtual-2 also uses Monte-Carlo rollouts with the simulator and obtains the estimated Q-function values , and the linear regression coefficients are computed in the same way as (B.1), i.e., . The virtual-2 algorithm also conducts uncertainty check in the rollout subroutine. Similar to the virtual algorithm, when it identifies an uncertain state-action pair, it records the pair and keeps running the rollout process. At the end of each loop, the virtual-2 algorithm still adds the first recorded element to the core set and discard other recorded elements. The only difference is that in virtual-2, we choose the virtual Q-function to be for all . Using the same arguments in Appendix B.2, we know that with probability at least , for all the loops and all the policy iteration steps in every loop, we have for all such that . We call this event . Since the simulators of virtual-2 is also coupled with that of the main algorithm, by the same argument as in Lemma B.5, we know that in the last iteration of the final loop of the main algorithm, we have and . We also know that when event happens, in the last iteration of the all the loops of virtual-2,
(B.13) |
Therefore, when both events and happen, combining (B.12) and (B.13), and using the fact that , we know that
Combining this fact with (B.10) and using union bound, we know that with probability at least
(B.14) |
with defined as in (B.4), we have
(B.15) |
Finally, we choose the appropriate parameters. Note that we would like to ensure that the success probability in Eq. (B.14) is at least and at the same time, the sub-optimality (right hand side of Eq. (B.15)) to be as small as possible. Suppose that Assumption 3.2 holds, i.e, in (B.7). It can be verified that by choosing , , , , , , we can ensure that the error probability is at most and . Suppose that Assumption 3.3 holds. It can be verified that by choosing , , , , , , we can ensure that with probability at least ,
Appendix C Proof of Lemma B.2
To simplify notation, we write , , and in this proof. According to Eq. (B.6), with probability at least ,
holds for all . We condition on this event in the following derivation. Suppose that Assumption 3.3 holds. We know that there exists with such that for any ,
Let . Then we have
(C.1) |
Suppose that for a state-action pair , the feature vector , with defined in Definition 4.1. Then we have
(C.2) |
We then bound and in (C.2). Similar to Appendix A, let be the eigendecomposition of with and being an orthonormal matrix. Notice that for all , . Let . Then for , we have
(C.3) |
where for the first inequality we use Cauchy-Schwarz inequality and the assumption that , and for the second inequality we use the fact that . On the other hand, since we know that , we know that , i.e., . Combining this fact with (C.3), we obtain
(C.4) |
Appendix D Proof of Theorem 5.2
First, we state a general result in Szepesvári [2021] on Politex. Notice that in this result, we consider an arbitrary sequence of approximate Q-functions , , which do not have to take the form of (4.3).
Lemma D.1 (Szepesvári [2021]).
Given an initial policy and a sequence of functions , , construct a sequence of policies according to (4.4) with , then, for any , the mixture policy satisfies
We then consider a virtual Politex algorithm. Similar to the vanilla policy iteration algorithm, in the virtual Politex algorithm, we begin with . In the -th iteration, we run Monte Carlo rollout with policy , and obtain the estimates of the Q-function values . We then compute the weight vector
and according to Lemma B.2, for any , with probability at least , for all such that ,
(D.1) |
Then we define the virtual Q-function as
assuming we have access to the true Q-function when . We let the policy of the -th iteration be
(D.2) |
Since we always have , the clipping at and can only improve the accuracy of the estimation of the Q-function. Therefore, we know that with probability at least , we have . Then, by taking a union bound over the iterations and using the result in Lemma D.1, we know that with probability at least , for any , the virtual Politex algorithm satisfies
(D.3) |
where is the mixture policy of . Using another union bound over the loops, we know that with probability at least , (D.3) holds for all the loops. We call this event in the following.
We then consider the virtual-2 Politex algorithm. Similar to LSPI, the virtual-2 algorithm begins with . In the -th iteration, we run Monte Carlo rollout with policy , and obtain the estimates of the Q-function values . We then compute the weight vector
and according to Lemma B.2, for any , with probability at least , for all such that ,
(D.4) |
where is defined as in (D.1). We also note that in the rollout process of the virtual-2 algorithm, we do not conduct the uncertainty check, i.e., we do not check whether the features are in the good set . By union bound, we know that with probability at least , (D.4) holds for all the iterations of all the loops. We call this event in the following. In the virtual-2 algorithm, we define the approximate Q-function in the same way as the main algorithm, i.e., we define
and we let the policy of the -th iteration be
(D.5) |
We still let the simulators of all the algorithms be coupled in the same way described as in Appendix B.1. In addition, we also let the agent in the main algorithm be coupled with the virtual and virtual-2 algorithm. Take the main algorithm and the virtual algorithm as an example. Recall that in the -th iteration of a particular loop, the main algorithm and the virtual algorithm use rollout policies and , respectively. In the ConfidentRollout subroutine, the agent needs to sample actions according to the policies given a state. Suppose that in the -th time that the agent needs to take an action, the main algorithm is at state and the virtual algorithm is at state . If the two states are the same, i.e., and two distributions of actions given this state are also the same, i.e., , then the actions that the agent samples in the main algorithm and the virtual algorithm are also the same. This means that the main algorithm samples and the virtual algorithm samples , and with probability , . Otherwise, when or , the main algorithm and the virtual algorithm samples a new action independently. The main algorithm and the virtual-2 algorithm are coupled in the same way. We note that using the same argument as in Lemma B.5, for the final loop of the main algorithm, all the rollout trajectories of the main, virtual, and virtual-2 algorithms are the same, which implies that for all . This also implies that in the final loop of the main algorithm, all the policies in the iterations are the same between the main and the virtual-2 algorithm, i.e., , . Moreover, for any state such that for all , we have . Since the initial state satisfies the condition that for all , we have .
Let be the policy that is uniformly chosen from in the virtual-2 algorithm in the final loop of the main algorithm, and be the policy that is uniformly chosen from in the final loop of the main algorithm. Then we have
(D.6) |
and when events and happen,
(D.7) |
By combining (D.3), (D.6), and (D.7), and using a union bound, we obtain that with probability at least ,
(D.8) |
Now we choose appropriate parameters to obtain the final result. When Assumption 3.2 holds, i.e., , one can verify that when we choose , , , , , and , we can ensure that with probability at least , . When Assumption 3.3 holds, one can verify that when we choose , , , , , and , we can ensure that with probability at least ,
Appendix E Random initial state
We have shown that with a deterministic initial state , our algorithm can learn a good policy. In fact, if the initial state is random, and the agent is allowed to sample from a distribution of the initial state, denoted by in this section, then we can use a simple reduction to show that our algorithm can still learn a good policy. In this case, the optimality gap is defined as the difference between the expected value of the optimal policy and the learned policy, where the expectation is taken over the initial state distribution, i.e., we hope to guarantee that is small.
The reduction argument works as follows. First, we add an auxiliary state to the state space and assume that the algorithm starts from . From and any action , we let the distribution of the next state be , i.e., . We also let . Then, for any policy , we have . As for the features, for any , we add an extra as the last dimension of the feature vector, i.e., we use . For any , we let . Note that this does not affect linear realizability except a change in the upper bound on the norm of the linear coefficients. Suppose that Asumption 3.2 holds. Suppose that in the original MDP, we have with . Let us define . Then, for any , we still have since the last coordinate of is zero. For , we have . The only difference is that we now have since we always have .
Then the problem reduces to the deterministic initial state case with initial state . In the first step of the algorithm, we let . During the algorithm, to run rollout from any core set element with , we can use the current version of Algorithm 1. To run rollout from , we simply sample from as the first “next state” and then use the simulator to keep running the following trajectory of the rollout process.