Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization
Abstract
Learning Markov decision processes (MDP) in an adversarial environment has been a challenging problem. The problem becomes even more challenging with function approximation, since the underlying structure of the loss function and transition kernel are especially hard to estimate in a varying environment. In fact, the state-of-the-art results for linear adversarial MDP achieve a regret of ( denotes the number of episodes), which admits a large room for improvement. In this paper, we investigate the problem with a new view, which reduces linear MDP into linear optimization by subtly setting the feature maps of the bandit arms of linear optimization. This new technique, under an exploratory assumption, yields an improved bound of for linear adversarial MDP without access to a transition simulator. The new view could be of independent interest for solving other MDP problems that possess a linear structure.
1 Introduction
Reinforcement learning (RL) describes the interaction between a learning agent and an unknown environment, where the agent aims to maximize the cumulative reward through trial and error Sutton and Barto (2018). It has achieved great success in many real applications, such as games (Mnih et al., 2013; Silver et al., 2016), robotics (Kober et al., 2013; Lillicrap et al., 2015), autonomous driving (Kiran et al., 2021) and recommendation systems (Afsar et al., 2022; Lin et al., 2021). The interaction in RL is commonly portrayed by Markov decision processes (MDP). Most of the works study the stochastic setting, where the reward is sampled from a fixed distribution (Azar et al., 2017; Jin et al., 2018; Simchowitz and Jamieson, 2019; Yang et al., 2021). RL in real applications is in general more challenging than the stochastic setting, as the environment could be non-stationary and the reward function could be adaptive towards the agentโs policy. For example, a scheduling algorithm will be deployed to self-interested parties, and recommendation algorithms will face strategic users.
To design robust algorithms that work under non-stationary environments, a line of works focuses on the adversarial setting, where the reward function could be arbitrarily chosen by an adversary (Yu et al., 2009; Rosenberg and Mansour, 2019; Jin et al., 2020a; Chen et al., 2021; Luo et al., 2021a). Many works in adversarial MDPs optimize the policy by learning the value function with a tabular representation. In this case, both their computation complexity and their regret bounds depend on the state space and action space sizes. In real applications however, the state and action spaces could be exponentially large or even infinite, such as in the game of Go and in robotics. Such cost of computation and the performance are then inadequate.
To cope with the curse of dimensionality, function approximation methods are widely deployed to approximate the value functions with learnable structures. Great empirical success has proved its efficacy in a wide range of areas. Despite this, theoretical understandings of MDP with general function approximation are yet to be available. As an essential step towards understanding function approximation, linear MDP has been an important setting and has received significant attention from the community. It presumes that the transition and reward functions in MDP follow a linear structure with respect to a known feature (Jin et al., 2020b; He et al., 2021; Hu et al., 2022). The stochastic setting in linear MDP has been well studied and near-optimal results are available (Jin et al., 2020b; Hu et al., 2022). The adversarial setting in linear MDP is much more challenging since the underlying linear parameters of the loss function and transition kernel are especially hard to estimate in a varying environment.
The research on linear adversarial MDPs remains open. Early works have proposed algorithms when the transition function is known. (Neu and Olkhovskaya, 2021) Several recent works explore the problem without a known transition function and derive policy optimization algorithms with the state-of-the-art regret of (Luo et al., 2021a; Dai et al., 2023; Sherman et al., 2023). While the optimal regret in tabular MDPs is of order (Jin et al., 2020a), the regret upper bounds available for linear adversarial MDPs seem to admit a large room for improvement.
In this paper, we investigate linear adversarial MDPs with unknown transition. We propose a new view of the problem and design an algorithm based on our view. The idea is to reduce the MDP setting to a linear optimization problem by subtly setting the feature maps of the bandit arms of linear optimization. In this way, we operate on a set of policies and optimize the probability distribution of which policy to execute. By carefully balancing the suboptimality in policy execution, the suboptimality in policy construction, and the suboptimality in feature visitation estimation, we deduce new analyses of the problem. Improved regret bounds are obtained both when we have and do not have a simulator. In particular, we conclude the first regret bound for linear adversarial MDPs without a simulator.
Let be the feature dimension and be the length of each episode. Details of our contributions are as follows.
-
โข
With an exploratory assumption (Assumption 1), we obtain an regret upper bound for linear adversarial MDP. As compared in Table 1, this is the first regret bound that achieves order when a simulator of the transition is not provided. We also want to note that our exploratory assumption that only ensures the MDP is learnable is much weaker than previous works (Neu and Olkhovskaya, 2021; Luo et al., 2021a). Under a weaker exploratory assumption, our result achieves a significant improvement over the regret in Luo et al. (2021a) and also removes the dependence on which is the minimum eigenvalue of the exploratory policyโs covariance that can be small.
-
โข
In a simpler setting where the agent has access to a simulator, our regret can be further improved to . This result also removes the dependence on in previous works (Neu and Olkhovskaya, 2021; Luo et al., 2021a). Compared with Luo et al. (2021a), our required simulator is also weaker: we only need to have access to the trajectory when given any policy ; while Luo et al. (2021a) requires the next state when given any state-action pair .
-
โข
Technically, we provide a new tool for linear MDP problems by exploiting the linear features of the MDP and transforming it into a linear optimization problem. This tool could be of independent interest and might be useful in other problems that possess a linear structure.
Transition | Simulator1 |
|
Regret3 | |||
---|---|---|---|---|---|---|
Neu and Olkhovskaya (2021) | Known | yes | yes | |||
Luo et al. (2021a, b) | Unknown | yes | yes | |||
no | ||||||
no | yes | |||||
no | ||||||
Dai et al. (2023) | Unknown | yes | no | |||
no | no | |||||
Sherman et al. (2023) | Unknown | yes | no | |||
no | no | |||||
Ours | Unknown | yes | yes | |||
no | yes |
-
1
Our required simulator is defined in Assumption 2. Notice that Dai et al. (2023); Luo et al. (2021a, b) adopt a stronger simulator that returns the next state when given any state action pair , while Sherman et al. (2023); Neu and Olkhovskaya (2021) and this paper only need the simulator to return a trajectory when given a policy.
-
2
Our exploratory assumption is introduced in Assumption 1. It is worth noting that our assumption on exploration is also much weaker than Neu and Olkhovskaya (2021); Luo et al. (2021a, b). Our exploratory assumption only ensures the learnability of the MDP while the other works require a policy that can explore the full linear space in all steps as input. Our assumption can be implied by theirs.
-
3
The term in the regret represents the minimum eigenvalue induced by a โgoodโ exploratory policy , which satisfies for all , where is the covariance of at step (see Assumption 1).
2 Related Work
Linear MDPs.
The linear function approximation problem has been studied for a long history (Bradtke and Barto, 1996; Melo and Ribeiro, 2007; Sutton and Barto, 2018; Yang and Wang, 2019). Until recently, Yang and Wang (2020) propose theoretical guarantees for the sample efficiency in the linear MDP setting. However, it assumes that the transition function can be parameterized by a small matrix. In general cases, Jin et al. (2020b) develop the first efficient algorithm LSVI-UCB both in sample and computation complexity. They show that the algorithm achieves regret where is the feature dimension and is the length of each episode. This result is improved to the optimal order by Hu et al. (2022) with a tighter concentration analysis. A very recent work (He et al., 2022a) points out a technical error in Hu et al. (2022) and show a nearly minimax result that matches the lower bound in Zhou et al. (2021). All these works are based on UCB-type algorithms. Apart from UCB, the TS-type algorithm has also been proposed for this setting (Zanette et al., 2020). And the above results mainly focus on the minimax optimality. In the stochastic setting, deriving an instance-dependent regret bound is also attractive as it changes in MDPs with different hardness. This type of regret has been widely studied under the tabular MDP setting (Simchowitz and Jamieson, 2019; Yang et al., 2021). He et al. (2021) is the first to provide this type of regret in linear MDP. Using a different proof framework, they show that the LSVI-UCB algorithm can achieve where is the minimum value gap in the episodic MDP.
Adversarial losses in MDPs
When the losses at state-action pairs do not follow a fixed distribution, the problem becomes the adversarial MDP. This problem was first studied in the tabular MDP setting. The occupancy measure-based method is one of the most popular approaches to dealing with a potential adversary. For this type of approach, Zimin and Neu (2013) first study the known transition setting and derive regret guarantees and for full-information and bandit feedback, respectively. For the more challenging unknown transition setting, Rosenberg and Mansour (2019) also start from the full-information feedback and derive an regret. The bandit feedback is recently studied by Jin et al. (2020a), where the regret bound is . The other line of works (Neu et al., 2010; Shani et al., 2020; Chen et al., 2022; Luo et al., 2021a) is based on policy optimization methods. In the unknown transition and bandit feedback setting, the state-of-the-art result in this line is also an order achieved by Luo et al. (2021a, b).
Specifically, a few works focus on the linear adversarial MDP problem. Neu and Olkhovskaya (2021) first study the known transition setting and provide an regret with the assumption that an exploratory policy can explore the full linear space. For the general unknown transition case, Luo et al. (2021a, b) discuss four cases on whether a simulator is available and whether the exploratory assumption is satisfied. With the same exploratory assumption as Neu and Olkhovskaya (2021), they show a regret bound with a simulator and otherwise. Very recent two works (Dai et al., 2023; Sherman et al., 2023) further generalize the setting by removing the exploratory assumption. These two works independently provide and regret for this setting when no simulator is available.
Linear mixture MDP is another popular linear function approximation model, where the transition is a mixture of linear functions. When considering the adversarial losses, Cai et al. (2020); He et al. (2022b) study unknown transition but full-information feedback type, in which case the learning agent can observe the loss of all actions in each state. Zhao et al. (2023) consider the general bandit feedback in this setting and show the regret in this harder environment is also . Their modeling does not assume the structure of the loss function which introduces the dependence on in the regret where and are the numbers of states and actions, respectively.
3 Preliminaries
In this work, we study the episodic adversarial Markov Decision Processes (MDP)
denoted by
where is the state space, is the action space, is the horizon of each episode, is the transition kernel of step with representing the transition probability from to by taking action at step , and is the loss function at episode .
We denote as the learnerโs policy at each episode , where is a mapping from each state to a distribution over the action space. Let represent the selecting probability of action at state by following policy at step .
The learner interacts with the MDP for episodes. At each episode , the environment (adversary) first chooses the loss function which may be probably based on the history information before episode . The learner simultaneously decides its policy . For each step , the learner observes the current state , taking action based on , and observe the loss . The environment will transit to the next state at the end of the step based on the transition kernel .
The performance of a policy over episode can be evaluated by its value function, which is the expected cumulative loss,
where the expectation is taken from the randomness in the transition and the policy . Denote as the optimal policy that suffers the least expected loss over episodes. The objective of the learner is to minimize the cumulative regret,
(1) |
which is defined as the cumulative difference between the value of the taken policies and that of the optimal policy .
Linear adversarial MDP denotes an MDP where both the transition kernel and the loss function are linearly depending on a feature mapping. We give a formal definition as follows.
Definition 1 (Linear MDP with adversarial losses).
The MDP is a linear MDP if there is a known feature mapping and unknown vector-valued measures such that the transition probability at each state-action pair satisfies
Further, for any episode and step , there exists an unknown loss vector such that
for all state-action pair . Without loss of generality, we assume for all and , for any .
Given a policy , its feature visitation vector at step is the expected feature mapping this policy encounters at step : . With this definition, the expected loss that the policy receives at step of episode can be written as
(2) |
and the value of policy can be expressed as
(3) |
For simplicity, we also define to represent the expected feature visitation of state at step by following .
To ensure that the linear MDP is learnable, we make the following exploratory assumption, which analogs to those assumptions made in previous works that study the function approximation setting (Neu and Olkhovskaya, 2021; Luo et al., 2021a, b; Hao et al., 2021; Agarwal et al., 2021). For any policy , define as the expected covariance of at step . Let , where denotes the smallest eigenvalue of a matrix, and . To ensure that the linear MDP is learnable, we assume that there exists a policy that generates full rank covariance matrices.
Assumption 1 (Exploratory assumption).
.
When the assumption is reduced to the tabular setting, where is a basis vector in , this assumption becomes , where is the probability of visiting the state-action pair under the trajectory induced by . It simply means that there exists a policy with positive visitation probability for all state-action pairs, which is standard (Li et al., 2020). In the linear setting, it guarantees that all the directions in is able to be visited by some policy.
We point out that this assumption is weaker than the exploratory assumptions used in previous works (Neu and Olkhovskaya, 2021; Luo et al., 2021a, b), in which they assume such exploratory policy , satisfying for all , is given as input to the algorithm. Since finding such an exploratory policy is extremely difficult, our assumption, which only requires the transition of the MDP itself to satisfy this constraint, is more preferred.
4 Algorithm
In this section, we introduce the proposed algorithm (Algorithm 1). The algorithm takes a finite policy class and the feature visitation estimators as input and selects a policy in each episode . The acquisition of and will be introduced in Section 4.1 and Section 4.2, respectively.
Recall that the loss value is an inner product between the feature and the loss vector . According to this structure, we investigate ridge linear regression to estimate the unknown loss vector. To be specific, in each episode after executing policy , the observed loss value can be used to estimate the loss vector and the value function can be estimated for each policy (line 9). We then adopt an optimistic strategy toward the value of policies and an optimistic estimation of the policy โs value (line 10). Based on the optimistic value, the exploitation probability of a policy follows the EXP3-type update rule (line 11). To better explore each dimension of the linear space, the final selecting probability is defined as the weighted combination of the exploitation probability and an exploration probability , where the weight is an input parameter (line 7). Here the exploration probability is derived by computing the G-optimal design problem to minimize the uncertainty of all policies, i.e.,
where .
If the input is the true feature visitation , we can ensure that the regret of the algorithm, compared with the optimal policy in , is upper bounded. Now it suffices to bound the additional regret caused by the sub-optimality of the best policy in , and the bias of the feature visitation estimators, which will be discussed in the following sections.
4.1 Policy Construction
In this subsection, we introduce how to construct a finite policy set such that the real optimal policy can be approximated by elements in . The policy construction method mainly borrows from Appendix A.3 in Wagenmaker and Jamieson (2022) but with refined analysis for the adversarial setting.
We consider the linear softmax policy class. Specifically, given a parameter , the induced policy would select action at state with probability
The advantage of such a form of policy class is that it satisfies the Lipschitz property, i.e., the difference between values of induced policies can be upper bounded by the difference between the parameters . Based on this observation, by constructing a parameter covering over a -dimensional unit ball, we can ensure that the parameter of the optimal policy can be approximated by a parameter , i.e.,
Further, based on the Lipschitz property, the induced policy of would have a similar value to that of the optimal policy.
The informal result is shown in the following lemma.
Lemma 1.
There exists a finite policy class with log cardinality , such that the regret compared with the optimal policy in is close to the global regret, i.e.,
The detailed analysis can be found in Appendix C.
4.2 Feature Visitation Estimation
In this subsection, we discuss how to deal with unavailable feature visitations of policies. Our approach is to estimate the feature visitation of each policy and use these estimated features as input of Algorithm 1. The feature estimating process is described in Algorithm 2, which is called the feature visitation estimation oracle.
For any policy , we can first decompose its feature visitation at step as
where is the transition operator and can be directly computed based on policy . Thus, to estimate for each step , we need to estimate the transition operator .
We investigate the least square method to estimate the transition operator. Consider currently we have collected trajectories, then the estimated value is given by
and the closed-form solution is that
where .
In order to guarantee the accuracy of the estimated feature visitation, we provide a guarantee for the accuracy of the estimated transition operator. The intuition is to collect enough data in each dimension of the feature space. For the design of how to collect trajectory, we adopt the reward-free technique in Wagenmaker and Jamieson (2022) and transform it as an independent feature visitation oracle. Algorithm 2 satisfies the following sample complexity and accuracy guarantees.
Lemma 2.
Algorithm 2 runs for at most episodes and returns a feature visitation estimation that satisfies
for any policy and step , with probability at least .
And since the regret of each episode is less than , the total regret incurred in this process of estimating feature visitations is of order . The detailed analysis and results are in Appendix B.
5 Analysis
This section provides the regret guarantees for the proposed algorithm as well as the proof sketch.
We consider Algorithm 1 with the policy set constructed in Section 4.1 and the feature of policies estimated in Section 4.2 as input. Suppose we run Algorithm 1 for rounds, the regret compared with any fixed policy in these rounds can be bounded as below.
Lemma 3.
For any policy , with probability at least ,
where is the tolerance of the estimated feature bias in Lemma 2.
Recall that when the policy set is constructed as section 4.1, the difference between the global regret defined in Equation (1) and the regret compared with the policy in is just a constant. So the global regret can also be bounded as above Lemma 3.
Similar to previous works on linear adversarial MDP (Luo et al., 2021a, b; Sherman et al., 2023; Dai et al., 2023) that discuss the cases of whether a transition simulator is available, we define the simulator that may help in the following assumption. Note that this simulator is weaker than Luo et al. (2021a, b); Dai et al. (2023) because their simulator could generate a next state given any state-action pair.
Assumption 2 (Simulator).
The learning agent has access to a simulator such that when given a policy , it returns a trajectory based on the MDP and policy .
If the learning agent has access to such a simulator described in Assumption 2, then the feature estimation process in Section 4.2 can be regarded as regret-free and the final regret is just as shown in Lemma 3. Otherwise, there is an additional regret term. Balancing the choice of and yields the following regret upper bound.
Theorem 1.
The proof of the main results is deferred to Appendix A.
Discussions
Since a main contribution of our work is to improve the results by Neu and Olkhovskaya (2021); Luo et al. (2021a) with an exploratory assumption, we would present more insights into the difference between our approach and the approaches in these two works.
As shown in Table 1, our result explicitly improve the result in Luo et al. (2021a) on both the dependence of and the dependence of . Recall that is the minimum eigenvalue in the exploratory assumption. In real applications, for each direction in the linear space, it is reasonable that there will be a policy that visits the direction. Therefore, by mixing these policies one could ensure the exploratory assumption. However, there is no guarantee on the value of , and when is very small the removal of the dependence is significant.
Technically, our new view of linear MDP could be general enough to be useful in other linear settings. In section 4.1, we only introduce a simpler policy construction version to convey more intuition. We could vary this construction procedure by further putting a finite action covering over the action space to deal with the infinite action space setting. More details can be found in Appendix C. Meanwhile, Neu and Olkhovskaya (2021) requires both state and action spaces to be finite and Luo et al. (2021a, b); Dai et al. (2023); Sherman et al. (2023) can only deal with finite action space.
When a transition simulator is available, the number of calls to the simulator (query complexity) is also an important metric to portray the algorithmโs efficiency. According to Lemma 2 and the analysis in Appendix A, we only need to call the simulator for times to achieve an regret (Theorem 1) by choosing in Lemma 3. This is much more preferred than Luo et al. (2021a), which needs to call the simulator for times where is the action space size.
Compared with Luo et al. (2021a), we improve their results with better regret bounds, weaker exploratory assumption, weaker simulator and fewer queries (if we use one).
6 Conclusion
In this paper, we investigate linear adversarial MDP with bandit feedback. We propose a new view of linear MDP, where optimizing policies in linear MDP can be regarded as a linear optimization problem. Based on this new insight we propose an algorithm by constructing a set of policies and deploying a probability distribution over the policies to execute. With an exploratory assumption, our algorithm yields the first regret without access to a simulator. Compared to the results in Luo et al. (2021a), our algorithm enjoys a weaker assumption, a better regret bound with respect to both and , and a weaker simulator with fewer queries if it uses one.
Our view contributes a new approach to linear MDP, which could be of independent interest. We demonstrated how our algorithm under this view is generalized to infinite action spaces. Future implications of this technique could involve solving other adversarial settings, such as when the loss function is corrupted up to a budget, and solving robust linear MDP where the transition kernel could change over episodes.
References
- Afsar et al. (2022) Mย Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey. ACM Computing Surveys, 55(7):1โ38, 2022.
- Agarwal et al. (2021) Naman Agarwal, Syomantak Chaudhuri, Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Online target Q-learning with reverse experience replay: Efficiently finding the optimal policy for linear mdps. arXiv preprint arXiv:2110.08440, 2021.
- Azar et al. (2017) Mohammadย Gheshlaghi Azar, Ian Osband, and Rรฉmi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263โ272. PMLR, 2017.
- Bradtke and Barto (1996) Stevenย J Bradtke and Andrewย G Barto. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1-3):33โ57, 1996.
- Cai et al. (2020) Qiย Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283โ1294. PMLR, 2020.
- Chen et al. (2021) Liyu Chen, Haipeng Luo, and Chen-Yu Wei. Minimax regret for stochastic shortest path with adversarial costs and known transition. In Conference on Learning Theory, pages 1180โ1215. PMLR, 2021.
- Chen et al. (2022) Liyu Chen, Haipeng Luo, and Aviv Rosenberg. Policy optimization for stochastic shortest path. In Conference on Learning Theory, pages 982โ1046. PMLR, 2022.
- Dai et al. (2023) Yan Dai, Haipeng Luo, Chen-Yu Wei, and Julian Zimmert. Refined regret for adversarial mdps with linear function approximation. arXiv preprint arXiv:2301.12942, 2023.
- Hao et al. (2021) Botao Hao, Tor Lattimore, Csaba Szepesvari, and Mengdi Wang. Online sparse reinforcement learning. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, pages 316โ324. PMLR, 2021.
- He et al. (2021) Jiafan He, Dongruo Zhou, and Quanquan Gu. Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning, pages 4171โ4180. PMLR, 2021.
- He et al. (2022a) Jiafan He, Heyang Zhao, Dongruo Zhou, and Quanquan Gu. Nearly minimax optimal reinforcement learning for linear markov decision processes. arXiv preprint arXiv:2212.06132, 2022.
- He et al. (2022b) Jiafan He, Dongruo Zhou, and Quanquan Gu. Near-optimal policy optimization algorithms for learning adversarial linear mixture mdps. In International Conference on Artificial Intelligence and Statistics, pages 4259โ4280. PMLR, 2022.
- Hu et al. (2022) Pihe Hu, Yuย Chen, and Longbo Huang. Nearly minimax optimal reinforcement learning with linear function approximation. In International Conference on Machine Learning, pages 8971โ9019. PMLR, 2022.
- Jin et al. (2018) Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michaelย I Jordan. Is Q-learning provably efficient? Advances in Neural Information Processing Systems, 31, 2018.
- Jin et al. (2020a) Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. Learning adversarial markov decision processes with bandit feedback and unknown transition. In International Conference on Machine Learning, pages 4860โ4869. PMLR, 2020.
- Jin et al. (2020b) Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michaelย I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137โ2143. PMLR, 2020.
- Kiran et al. (2021) Bย Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmadย A Alย Sallab, Senthil Yogamani, and Patrick Pรฉrez. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(6):4909โ4926, 2021.
- Kober et al. (2013) Jens Kober, Jย Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238โ1274, 2013.
- Li et al. (2020) Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. Advances in Neural Information Processing Systems, 33:7031โ7043, 2020.
- Lillicrap et al. (2015) Timothyย P Lillicrap, Jonathanย J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Lin et al. (2021) Yuanguo Lin, Yong Liu, Fan Lin, Lixin Zou, Pengcheng Wu, Wenhua Zeng, Huanhuan Chen, and Chunyan Miao. A survey on reinforcement learning for recommender systems. arXiv preprint arXiv:2109.10665, 2021.
- Luo et al. (2021a) Haipeng Luo, Chen-Yu Wei, and Chung-Wei Lee. Policy optimization in adversarial mdps: Improved exploration via dilated bonuses. Advances in Neural Information Processing Systems, 34:22931โ22942, 2021.
- Luo et al. (2021b) Haipeng Luo, Chen-Yu Wei, and Chung-Wei Lee. Policy optimization in adversarial mdps: Improved exploration via dilated bonuses. arXiv preprint arXiv:2107.08346, 2021.
- Melo and Ribeiro (2007) Franciscoย S Melo and Mย Isabel Ribeiro. Q-learning with linear function approximation. In Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007, San Diego, CA, USA; June 13-15, 2007. Proceedings 20, pages 308โ322. Springer, 2007.
- Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Neu and Olkhovskaya (2021) Gergely Neu and Julia Olkhovskaya. Online learning in mdps with linear function approximation and bandit feedback. Advances in Neural Information Processing Systems, 34:10407โ10417, 2021.
- Neu et al. (2010) Gergely Neu, Andrรกs Gyรถrgy, Csaba Szepesvรกri, etย al. The online loop-free stochastic shortest-path problem. In COLT, volume 2010, pages 231โ243. Citeseer, 2010.
- Rosenberg and Mansour (2019) Aviv Rosenberg and Yishay Mansour. Online convex optimization in adversarial markov decision processes. In International Conference on Machine Learning, pages 5478โ5486. PMLR, 2019.
- Shani et al. (2020) Lior Shani, Yonathan Efroni, Aviv Rosenberg, and Shie Mannor. Optimistic policy optimization with bandit feedback. In International Conference on Machine Learning, pages 8604โ8613. PMLR, 2020.
- Sherman et al. (2023) Uri Sherman, Tomer Koren, and Yishay Mansour. Improved regret for efficient online reinforcement learning with linear function approximation. arXiv preprint arXiv:2301.13087, 2023.
- Silver et al. (2016) David Silver, Aja Huang, Chrisย J Maddison, Arthur Guez, Laurent Sifre, George Van Denย Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, etย al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484โ489, 2016.
- Simchowitz and Jamieson (2019) Max Simchowitz and Kevinย G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. Advances in Neural Information Processing Systems, 32, 2019.
- Sutton and Barto (2018) Richardย S Sutton and Andrewย G Barto. Reinforcement learning: An introduction. MIT press, 2018.
- Wagenmaker and Jamieson (2022) Andrew Wagenmaker and Kevin Jamieson. Instance-dependent near-optimal policy identification in linear mdps via online experiment design. arXiv preprint arXiv:2207.02575, 2022.
- Yang and Wang (2019) Lin Yang and Mengdi Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995โ7004. PMLR, 2019.
- Yang and Wang (2020) Lin Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746โ10756. PMLR, 2020.
- Yang et al. (2021) Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. In International Conference on Artificial Intelligence and Statistics, pages 1576โ1584. PMLR, 2021.
- Yu et al. (2009) Jiaย Yuan Yu, Shie Mannor, and Nahum Shimkin. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737โ757, 2009.
- Zanette et al. (2020) Andrea Zanette, David Brandfonbrener, Emma Brunskill, Matteo Pirotta, and Alessandro Lazaric. Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics, pages 1954โ1964. PMLR, 2020.
- Zhao et al. (2023) Canzhe Zhao, Ruofeng Yang, Baoxiang Wang, and Shuai Li. Learning adversarial linear mixture markov decision processes with bandit feedback and unknown transition. In International Conference on Learning Representations, 2023.
- Zhou et al. (2021) Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532โ4576. PMLR, 2021.
- Zimin and Neu (2013) Alexander Zimin and Gergely Neu. Online learning in episodic markovian decision processes by relative entropy policy search. Advances in Neural Information Processing Systems, 26, 2013.
Appendix A Analysis of Algorithm 1
In this section we propose the regret analysis for Algorithm 1 and prove the final regret bounds for our main algorithm. We will state the necessary concentration bounds as lemmas first and then analyze the regret, proving Lemma 3 and Theorem 1. In the following analysis, we will condition on the success of the event in Theorem 17, whose probability is at least . The following inequality will be used in the analysis and is restated in the beginning.
Lemma 4.
Let be a filtration and let be random variables such that is measurable, , almost surely, and for some fixed and . Then, for any , we have with probability at least ,
To start with, we give the concentration of the feature visitation estimators returned from Algorithm 2, which will be fundamental in the following analysis. Notice that can be computed directly from the initial state distribution and action distribution.
Lemma 5.
The following Lemma 6 and Lemma 7 will bound the magnitude for the loss and value estimators in line 9, using the properties of G-optimal design computed in line 4.
Lemma 6.
and , for all and .
Proof.
According to the properties of G-optimal design, we have:
and . Thus we have . So:
โ
Lemma 7.
With our choice of , when , we have for all optimistic loss function estimator , .
Proof.
To make sure , we notice that:
(4) |
By Lemma 6, we have
When , we have . Thus, our choice of satisfy this constraint. โ
Throughout the following analysis, assuming we have run for some number of episodes , we let the filtration on this, with the filtration up to and including episode . Define . The next lemma will bound the bias of the loss vector estimator, thus we can bound the bias of the value function estimator.
Lemma 8.
Denote as the expected value of the loss vector estimator on . Then we have for :
As a result, we also have โ .
Proof.
Lemma 9.
Denote , then we have with probability at least ,
Proof.
First, we bound the bias of the estimated loss of each policy after episode in step :
The first term is a martingale difference sequence as by the definition in Lemma 8. To bound its magnitude, notice that according to 5 and 6. Its variance is also bounded as:
where the last inequality is due to the fact and . Using the Freedman inequality, we acquire with probability al least :
where the second last and last inequality is due to the Cauchy Schwartz inequality and GM-AM inequality.
โ
Lemma 10.
We bound the gap between the actual regret and the expected estimated regret. With probability at least ,
Proof.
Denote , we have that :
Notice that . We bound its conditional variance as follows:
(5) | ||||
(6) | ||||
(7) | ||||
(8) |
where inequality 5 is due to Cauchy Schwartz inequality and 7 is due to Jensen inequality. Moreover, . Applying Bernsteinโs Inequality, we obtain with probability at least ,
Since , we have:
Combining the two terms, we prove this lemma. โ
Lemma 11.
With probability at least , we have:
Proof.
(9) |
Since , we bound the first term as follows.
Its conditional expectation is , and also . Thus, applying the Hoeffding bound, we have with probability at least ,
Plugging it into 9, we finish our proof. โ
Proof of Lemma 3.
Now we are ready to start analyzing the regret. Using classical potential function analysis techniques in similar algorithms, we have:
(10) | ||||
(11) |
where inequality 10 is from guaranteed by Lemma 7. Using Lemma 9, we can bound the second term as:
(12) |
Plugging Lemma 10, Lemma 11 and Equation (12) into Equation (11), notice we condition on , we obtain:
(13) |
Combining terms, we have:
(14) |
Plugging into Equation (14), we have:
(15) |
On the other hand, we have:
(16) |
Combining (15) and (16), we have:
(17) |
Choosing and combining terms, we obtain for any policy , with probability at least :
(18) |
โ
We will then present the proof of Theorem 1 based on Equation (18). Notice we condition on being large enough so that the optimal parameters and set below are smaller than , satisfying the requirements of the algorithm, while the cases of being small is trivial.
-
โข
In the case when we have access to a simulator, the total regret occurred while we execute the policies in . Set the parameters as , and using the properties of in Lemma 19, the total regret is bounded as:
Also, according to corollary 1, the total number of episodes run on the simulator is in the order of .
-
โข
When we donโt have access to a simulator, we have to take account of the regret occurred while we estimate the feature visitation of each policy. According to corollary 1, the additional regret is in the order of . By our construction of policy set in Lemma 19, the total regret is bounded as:
(19) with . Set the parameters as , , the total regret is in the order of:
Appendix B Construct the Policy Visitation Estimators
In this section, we will propose the analysis of Algorithm 2. We then prove theorem 17 and corollary 1 as our main results, which will provide the concentration of the estimators and bound the sample complexity. These results will then be used to proof the final regret bounds in Appendix A.
First, we propose the performance guarantee of the data collecting oracle, which comes directly from theorem 9 in Wagenmaker and
Jamieson [2022].
Denote:
for be some fixed regularizer. We consider itโs smooth approximation:
We also define , where is the set of all the distributions over all valid Markovian policies. is, then, the set of all covariance matrices realizable by distributions over policies at step .Then we have
Theorem 2.
Considering running Algorithm 6 in Wagenmaker and Jamieson [2022] with some and functions
for and
where is the matrix returned by running Algorithm 7 in Wagenmaker and Jamieson [2022] with , , and some . Then with probability , this procedure will collect at most
episodes, where
and will produce covariates such that
and
Next, we will propose the concentration analysis of our estimators and bound the total number of episodes run. Throughout this section, assuming we have run for some number of episodes K, we let the filtration on this, with the filtration up to and including episode . We also let denote the filtration on all episodes , and on steps of episode . Define
and
We have from lemma A.7 in Wagenmaker and
Jamieson [2022]: โ.
We also denote .
The following Lemma 12 comes straight from lemma B.1, lemma B.2 and lemma B.3 in Wagenmaker and
Jamieson [2022] and provides us with the basic concentration properties of the estimators constructed in line 6 of Algorithm 2.
Lemma 12.
Assume that we have collected some data where, for each , is independent of . Denote and . Fix and let
Fix . Then with probability at least :
Thus, with probability at least ,
Lemma 13.
Let denote the event on which, for all , the feature visitation estimates returned by line 6 satisfy:
Then
.
Proof.
Lemma 14.
Proof.
Lemma 15.
On the event , for all ,
Proof.
On , we can bound:
so that:
โ
Lemma 16.
Define and . Then , and on , for all and , we have:
Proof.
Lemma 17 (Full version of Lemma 2).
With probability at least , Algorithm 2 will run at most
episodes, and will output policy visitation estimators with bias bounded as:
Proof.
Corollary 1.
Appendix C Construct the Policy Set
In this section we provide the proof for the policy set we constructed. The construction techniques follows directly from Appendix A.3 in Wagenmaker and Jamieson [2022] and we will prove such construction also works in MDP with adversarial rewards. Our main result is stated in Lemma 19.
Lemma 18.
In the adversarial MDP setting, where the loss function changes in each round, the best policy of the MDP in rounds to from the set of all stationary policies, is the optimal policy of the MDP with a fixed loss function being the average. Denote the average MDP as , with the same transition kernel and the average loss .
That is:
Where is the value function associated with the new MDP .
Proof.
Let be the trajectory generated by following policy through the MDP. Denote the occupancy measure as the probability of visiting state-action pair under trajectory , and .
For any stationary policy , we have:
Since the two MDP share the same transition kernel, the occupancy measure generated by the same policy stays unchanged. So we have:
So satisfies:
โ
Lemma 19.
Choose to be an arbitrary constant, then we can construct a policy set for any linear adversarial MDP , such that there exists a policy , when compared with the global optimal policy, the regret of which is bounded by :
So that:
and the size of is bounded as:
where is the dimension of the feature map.
Proof.
According to Lemma A.14 in Wagenmaker and Jamieson [2022] that for any linear MDP with fixed reward function , we can construct a policy set, that there exists a policy , which approximates the best policy of with bias . And the size of the policy set is bounded as:
(22) |
Notice this construction is based entirely on the set of state action features and require no information on the loss or reward function. In the adversarial case, we choose to be the average MDP denoted in Lemma 18, and we obtain the regret bound of in all the episodes:
(23) |
The proof is finished by taking in Equation (22).
โ