Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP
Abstract
We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that significantly improve over the only existing results of (Vial et al., 2021). Our first algorithm is computationally efficient and achieves a regret bound , where is the dimension of the feature space, and are upper bounds of the expected costs and hitting time of the optimal policy respectively, and is the number of episodes. The same algorithm with a slight modification also achieves logarithmic regret of order , where is the minimum sub-optimality gap and is the minimum cost over all state-action pairs. Our result is obtained by developing a simpler and improved analysis for the finite-horizon approximation of (Cohen et al., 2021) with a smaller approximation error, which might be of independent interest. On the other hand, using variance-aware confidence sets in a global optimization problem, our second algorithm is computationally inefficient but achieves the first “horizon-free” regret bound with no polynomial dependency on or , almost matching the lower bound from (Min et al., 2021).
1 Introduction
We study the stochastic shortest path (SSP) model, where a learner attempts to reach a goal state while minimizing her costs in a stochastic environment. SSP is a suitable model for many real-world applications, such as games, car navigation, robotic manipulation, etc. Online reinforcement learning in SSP has received great attention recently. In this setting, learning proceeds in episodes over a Markov Decision Process (MDP). In each episode, starting from a fixed initial state, the learner sequentially takes an action, incurs a cost, and transits to the next state until reaching the goal state. The performance of the learner is measured by her regret, the difference between her total costs and that of the optimal policy. SSP is a strict generalization of the heavily-studied finite-horizon reinforcement learning problem, where the learner is guaranteed to reach the goal state after a fixed number of steps.
Modern reinforcement learning applications often need to handle a massive state space, in which function approximation is necessary. There is huge progress in the study of linear function approximation, for both the finite-horizon setting (Ayoub et al., 2020; Jin et al., 2020b; Yang & Wang, 2020; Zanette et al., 2020a, b; Zhou et al., 2021a) and the infinite-horizon setting (Wei et al., 2021b; Zhou et al., 2021a, b). Recently, Vial et al. (2021) took the first step in considering linear function approximation for SSP. They study SSP defined over a linear MDP, and proposed a computationally inefficient algorithm with regret , as well as another efficient algorithm with regret (omitting other dependency). Here, is the dimension of the feature space, is an upper bound on the expected costs of the optimal policy, and is the minimum cost across all state-action pairs. Later, Min et al. (2021) study a related but different SSP problem defined over a linear mixture MDP and achieve a regret bound. Despite leveraging the advances from both the finite-horizon and infinite-horizon settings, results above are still far from optimal in terms of regret guarantee or computational efficiency, demonstrating the unique challenge of SSP problems.
In this work, we further extend our understanding of SSP with linear function approximation (more specifically, with linear MDPs). Our contributions are as follows:
-
•
In Section 3, we first propose a new analysis for the finite-horizon approximation of SSP introduced in (Cohen et al., 2021), which is much simpler and achieves a smaller approximation error. Our analysis is also model agnostic, meaning that it does not make use of the modeling assumption and can be applied to both the tabular setting and function approximation settings. Combining this new analysis with a simple finite-horizon algorithm similar to that of (Jin et al., 2020b), we achieve a regret bound of , with being an upper bound of the hitting time of the optimal policy, which strictly improves over that of (Vial et al., 2021). Notably, unlike their algorithm, ours is computationally efficient without any extra assumption.
-
•
In Section 3.3, we further show that the same algorithm above with a slight modification achieves a logarithmic instance-dependent expected regret bound of order where is some sub-optimality gap. As far as we know, this is the first logarithmic regret bound for SSP (with or without function approximation). We also establish a lower bound of order , which further advances our understanding for this problem even though it does not exactly match our upper bound.
-
•
To remove the undesirable dependency in our instance-independent bound, in Section 4, we further develop a computationally inefficient algorithm that makes use of certain variance-aware confidence sets in a global optimization problem and achieves regret. Importantly, this bound is horizon-free in the sense that it has no polynomial dependency on or even in the lower order terms. Moreover, this almost matches the best known lower bound from (Min et al., 2021).
Techniques
Our results are built upon several technical innovations. First, as mentioned, we develop an improved analysis for the finite-horizon approximation of (Cohen et al., 2021), which might be of independent interest. The key idea is to directly bound the total approximation error with respect to the regret bound of the finite-horizon algorithm, instead of analyzing the estimation precision for each state-action pair as done in (Cohen et al., 2021).
Second, to obtain the logarithmic bound in Section 3, we note that it is not enough to simply combine the aforementioned finite-horizon approximation and the existing logarithmic regret results for the finite-horizon setting such as (He et al., 2021), since the sub-optimality gap obtained in this way is in terms of the finite-horizon counterpart instead of the original SSP and could be substantially smaller. We resolve this issue via a longer horizon in the approximation and a careful two-stage analysis.
Finally, our horizon-free result in Section 4 is obtained by a novel combination of several ideas, including the global optimization algorithm of (Zanette et al., 2020b; Wei et al., 2021b), the variance-aware confidence sets of (Zhang et al., 2021) (for a related but different setting with linear mixture MDPs), an improved analysis of the variance-aware confidence sets (Kim et al., 2021), and finally a new clipping trick and new update conditions that we propose. Our analysis does not require the recursion-based technique of (Zhang et al., 2020a) (for the tabular case), nor estimating higher order moments of value functions as in (Zhang et al., 2021) (for linear mixture MDPs), which might also be of independent interest.
Related work
Regret minimization of SSP under stochastic costs has been well studied in the tabular setting (that is, no function approximation) (Tarbouriech et al., 2020; Cohen et al., 2020, 2021; Tarbouriech et al., 2021; Chen et al., 2021a; Jafarnia-Jahromi et al., 2021). There are also several works (Rosenberg & Mansour, 2020; Chen et al., 2021b; Chen & Luo, 2021) considering the more challenging setting with adversarial costs (which is beyond the scope of this work).
Beyond linear function approximation, in the finite-horizon setting researchers also start considering theoretical guarantees for general function approximation (Wang et al., 2020; Ishfaq et al., 2021; Kong et al., 2021). The study for SSP, which again is a strict generalization of the finite-horizon problems and might be a better model for many applications, falls behind in this regard, motivating us to explore in this direction with the goal of providing a more complete picture at least for linear function approximation.
The use of variance information is crucial in obtaining optimal regret bounds in MDPs. This dates back to the work of (Lattimore & Hutter, 2012) for the discounted setting, which has been significantly extended to the finite-horizon setting (Azar et al., 2017; Jin et al., 2018; Zanette & Brunskill, 2019; Zhang et al., 2020a, b). Constructing variance-aware confidence sets for linear bandits and linear mixture MDPs has also gained recent attention (Zhou et al., 2021a; Zhang et al., 2021; Kim et al., 2021). We are among the first to do so for linear MDPs (a concurrent work (Wei et al., 2021a) also does so but for a completely different purpose of improving robustness against corruption).
2 Preliminary
An SSP instance is defined by an MDP . Here, is the state space, is the (finite) action space (with ), is the initial state, is the goal state, is the cost function with some global lower bound , and with is the transition function, where is a shorthand for and is the simplex over .
The learning protocol is as follows: the learner interacts with the environment for episodes. In each episode, the learner starts in initial state , sequentially takes an action, incurs a cost, and transits to the next state. An episode ends when the learner reaches the goal state . We denote by the -th state-action-state triplet observed among all episodes, so that for each , and unless (in which case ). Also denote by the total number of steps in episodes.
Learning objective
The learner’s goal is to learn a policy that reaches the goal state with minimum costs. Formally, a (stationary and deterministic) policy is a mapping that assigns an action to each state . We say is proper if following (that is, taking action whenever in state ) reaches the goal state with probability . Given a proper policy , we define its value function and action-value function as follows:
where the expectation in is with respect to the randomness of next states and the number of steps before reaching the goal . Let be the set of proper policies. We make the basic assumption that is non-empty. Under this assumption, there exists an optimal proper policy , such that , and for all (Bertsekas & Yu, 2013). We use and as shorthands for and . The formal goal of the learner is then to minimize her regret against , that is, the difference between her total costs and that of the optimal proper policy, defined as
We also define if .
Linear SSP
In the so-called tabular setting, the state space is assumed to be small, and algorithms with computational complexity and regret bound depending on are acceptable. To handle a potentially massive state space, however, we consider the same linear function approximation setting of (Vial et al., 2021), where the MDP enjoys a linear structure in both the transition and cost functions (known as linear or low-rank MDP).
Assumption 1 (Linear SSP).
For some , there exist known feature maps , unknown parameters and , such that for any and , we have:
Moreover, we assume for all , , and for any .
We refer the reader to (Vial et al., 2021) and references therein for justification on this widely-used structural assumption (especially on the last few norm constraints). Under Assumption 1, by definition we have , where , that is, is also linear in the features.
Key parameters and notations
Two extra parameters that play a key role in our analysis are: , the maximum cost of the optimal policy starting from any state, and , the maximum hitting time of the optimal policy starting from any state, where is the expected number of steps before reaching the goal if one follows policy starting from state . By definition, we have .
For simplicity, we assume that , , and are known to the learner for most discussions, and defer to the appendix what we can achieve when some of these parameters are unknown. We also assume and by default (and will discuss the case for specific algorithms if modifications are needed).
For , we define . For any , we define as the projection of onto the interval . The notation hides all logarithmic terms including and for some confidence level .
3 An Efficient Algorithm for Linear SSP
In this section, we introduce a computationally efficient algorithm for linear SSP. In Section 3.1, we first develop an improved analysis for the finite-horizon approximation of (Cohen et al., 2021). Then in Section 3.2, we combine this approximation with a simple finite-horizon algorithm, which together achieves regret. Finally, in Section 3.3, we further obtain a logarithmic regret bound via a slightly modified algorithm and a careful two-stage analysis.
3.1 Finite-Horizon Approximation of SSP
Finite-horizon approximation has been frequently used in solving SSP problems (Chen et al., 2021b; Chen & Luo, 2021; Cohen et al., 2021; Chen et al., 2021a). In particular, Cohen et al. (2021) proposed a black-box reduction from SSP to a finite-horizon MDP, which achieves minimax optimal regret bound in the tabular case when combining with a certain finite-horizon algorithm. We will make use of the same algorithmic reduction in our proposed algorithm, but with an improved analysis.
Specifically, for an SSP instance , define its finite-horizon MDP counterpart as , where is the extended cost function, is the terminal cost function (more details to follow), with is the extended transition function, and is a horizon parameter. Assume the access to a corresponding finite-horizon algorithm which learns through a certain number of “intervals” following the protocol below. At the beginning of an interval , the learner is first reset to an arbitrary state . Then, in each step within this interval, decides an action , transits to , and suffers cost . At the end of the interval, the learner suffers an additional terminal cost , and then moves on to the next interval.
With such a black-box access to , the reduction of (Cohen et al., 2021) is depicted in Algorithm 1. The algorithm partitions the time steps into intervals of length (such that reaches within steps with high probability). In each step, the algorithm follows in a natural way and feeds the observations to (Line 7 and Line 7). If the goal state is not reached within an interval, naturally enters the next interval with the initial state being the current state (Line 7). Otherwise, if the goal state is reached within some interval, we keep feeding and zero cost to until it finishes the current interval (Line 7 and Line 7), and after that, the next interval corresponds to the beginning of the next episode of the original SSP problem (Line 7).
Input: Algorithm for finite-horizon MDP with horizon .
Initialize: interval counter .
for do
Analysis
Cohen et al. (2021) showed that in this reduction, the regret of the SSP problem is very close to the regret of in the finite-horizon MDP . Specifically, define as the regret of over the first intervals of (note the inclusion of the terminal costs), where is the optimal value function of the first layer of (see Appendix B.1 for the formal definition). Denote by the final (random) number of intervals created during the episodes. Then Cohen et al. (2021) showed the following (a proof is included in Section 3.1 for completeness).
Lemma 1.
Algorithm 1 ensures .
This lemma suggests that it remains to bound the number of intervals . The analysis of Cohen et al. (2021) does so by marking state-action pairs as “known” or “unknown” based on how many times they have been visited, and showing that in each interval, the learner either reaches an “unknown” state-action pair or with high probability reaches the goal state. This analysis requires to be “admissible” (defined through a set of conditions) and also heavily makes use of the tabular setting to keep track of the status of each state-action pair, making it hard to be directly generalized to function approximation settings. Furthermore, it also introduces dependency in the lower order term of , since the total cost for an interval where an “unknown” state-action pair is visited is trivially bounded by .
Instead, we propose the following simple and improved analysis. The idea is to separate intervals into “good” ones within which the learner reaches the goal state, and “bad” ones within which the learner does not. Then, our key observation is that the regret in each bad interval is at least — this is because the learner’s cost is at least in such intervals by the choice of the terminal cost , and the optimal policy’s expected cost is at most . Therefore, if is a no-regret algorithm, the number of bad intervals has to be small. More formally, based on this idea we can bound directly in terms of the regret guarantee of without requiring any extra properties from , as shown in the following lemma.
Theorem 1.
Suppose that enjoys the following regret guarantee with certain probability: for some problem-dependent coefficients and (that are independent of ) and any number of intervals . Then, with the same probability, the number of intervals created by Algorithm 1 satisfies .
Proof.
For any finite , we will show , which then implies that has to be finite and is upper bounded by the same quantity. To do so, we define the set of good intervals where the learner reaches the goal state, and also the total costs of the learner in interval of : . By definition and the guarantee of , we have
(1) |
Next, we derive lower bounds on and respectively. First note that by Lemma 17 and , we have that reaches the goal within steps with probability at least . Therefore, executing in an episode of leads to at most costs in expectation, which implies for any . By , we thus have
On the other hand, for , we have due to the terminal cost , and thus
Combining the two lower bounds above with Eq. (1), we arrive at . By Lemma 28, this implies , finishing the proof. ∎
Now plugging in the bound on in Theorem 1 into Lemma 1, we immediately obtain the following corollary on a general regret bound for the finite-horizon approximation.
Corollary 2.
Under the same condition of Theorem 1, Algorithm 1 ensures (with the same probability stated in Theorem 1).
Proof.
Note that the final regret bound completely depends on the regret guarantee of the finite horizon algorithm . In particular, in the tabular case, if we apply a variant of EB-SSP (Tarbouriech et al., 2021) that achieves (note the lack of polynomial dependency on ),111This variant is equivalent to applying EB-SSP on a homogeneous finite-horizon MDP. then Corollary 2 ensures that , improving the results of (Cohen et al., 2021) and matching the best existing bounds of (Tarbouriech et al., 2021; Chen et al., 2021a); see Appendix B.5 for more details. This is not achievable by the analysis of (Cohen et al., 2021) due to the dependency in the lower order term mentioned earlier.
More importantly, our analysis is model agnostic: it only makes use of the regret guarantee of the finite-horizon algorithm, and does not leverage any modeling assumption on the SSP instance. This enables us to directly apply our result to settings with function approximation. In Appendix B.6, we provide an example for SSP with a linear mixture MDP, which gives a regret bound via combining Corollary 2 and the near optimal finite-horizon algorithm of (Zhou et al., 2021a).
3.2 Applying an Efficient Finite-Horizon Algorithm for Linear MDPs
Similarly, if there were a horizon-free algorithm for finite-horizon linear MDPs, we could directly combine it with Algorithm 1 and obtain a -independent regret bound. However, to our knowledge, this is still open due to some unique challenge for linear MDPs.
Nevertheless, even combining Algorithm 1 with a horizon-dependent linear MDP algorithm already leads to significant improvement over the state-of-the-art for linear SSP. Specifically, the finite-horizon algorithm we apply is a variant of LSVI-UCB (Jin et al., 2020b), which performs Least-Squares Value Iteration with an optimistic modification. The pseudocode is shown in Algorithm 2. Utilizing the fact that action-value functions are linear in the features for a linear MDP, in each interval , we estimate the parameters of these linear functions by solving a set of least square linear regression problems using all observed data (Line 2), and we encourage exploration by subtracting a bonus term in the definition of (Line 1). Then, we simply act greedily with respect to the truncated action-value estimates (Line 2). Clearly, this is an efficient algorithm with polynomial (in , , and ) time complexity for each interval .
We refer the reader to (Jin et al., 2020b) for more explanation of the algorithm, and point out three key modifications we make compared to their version. First, Jin et al. (2020b) maintain a separate covariance matrix for each layer using data only from layer , while we only maintain a single covariance matrix using data across all layers (Line 2). This is possible (and resulting in a better regret bound) since the transition function is the same in each layer of . Another modification is to define as simply for the purpose of incorporating the terminal cost. Finally, we project the action-value estimates onto for some parameter similar to (Vial et al., 2021) (Line 1). In the main text we simply set , and the upper bound truncation at has no effect in this case. However, this projection will become important when learning without the knowledge of (see Appendix B.4).
Parameters: , where is the failure probability and .
Initialize: .
for do
We show the following regret guarantee of Algorithm 2 following the analysis of (Vial et al., 2021) (see Appendix B.3).
Lemma 2.
With probability at least , Algorithm 2 with ensures for any .
Applying Corollary 2 we then immediately obtain the following new result for linear SSP.
Theorem 3.
Applying Algorithm 1 with and being Algorithm 2 with to the linear SSP problem ensures with probability at least .
There is some gap between our result above and the existing lower bound for this problem (Min et al., 2021). In particular, the dependency on inherited from the dependency in Lemma 2 is most likely unnecessary. Nevertheless, this already strictly improves over the best existing bound from (Vial et al., 2021) since . Moreover, our algorithm is computationally efficient, while the algorithms of Vial et al. (2021) are either inefficient or achieve a much worse regret bound such as (unless some strong assumptions are made). This improvement comes from the fact that our algorithm uses non-stationary policies (due to the finite-horizon approximation), which avoids the challenging problem of solving the fixed point of some empirical Bellman equation. This also demonstrates the power of finite-horizon approximation in solving SSP problems. On the other hand, obtaining the same regret guarantee by learning stationary policies only is an interesting future direction.
Learning without knowing or
Note that the result of Theorem 3 requires the knowledge of and . Without knowing these parameters, we can still efficiently obtain a regret bound of order , matching the bound of (Vial et al., 2021) achieved by their inefficient algorithm. See Appendix B.4 for details.
3.3 Logarithmic Regret
Many optimistic algorithms attain a more favorable regret bound of the form , where is an instance dependent constant usually inversely proportional to some gap measure; see e.g. (Jaksch et al., 2010) for the infinite-horizon setting and (Simchowitz & Jamieson, 2019) for the finite-horizon setting. In this section, we show that a slight modification of our algorithm also leads to an expected regret bound that is polylogarithmic in and inversely proportional to with .222Note that for our definition of regret, a polylogarithmic bound is only possible in expectation, because even if the learner always executes , the deviation of her total costs from is already of order .
The high-level idea is as follows. The first observation is that similarly to a recent work by He et al. (2021), we can show that our Algorithm 2 obtains a gap-dependent logarithmic regret bound for the finite-horizon problem. The caveat is that here is naturally defined using the optimal value and action-value functions and for the finite-horizon MDP (which is different for each layer ); more specifically, where . The difference between and can in fact be significant; see Appendix B.7 for an example where is arbitrarily smaller than .
To get around this issue, we set to be a larger value of order and perform the following two-stage analysis. For the first layers, we are able to show and thus , leading to a bound on the regret suffered for these layers. Then, for the last layers, we further consider two cases: if the learner’s policy for the first layers are nearly optimal, then the probability of not reaching the goal within the first layers is very low by the choice of , and thus the costs suffered in the last layers are negligible; otherwise, we simply bound the costs using the number of times the learner takes a non-near-optimal action in the first layers, which is again shown to be of order .
One final detail is to carefully control the regret under some failure event that happens with a small probability (recall that we are aiming for an expected regret bound; see Footnote 2). This is necessary since in SSP the learner’s cost under such events could be unbounded in the worst case. To resolve this issue, we make a slight modification to Algorithm 1 and occasionally restart whenever the number of total intervals reaches some multiple of a threshold; see Algorithm 7 in the appendix. This finally leads to our main result summarized in the following theorem (whose proof is deferred to Appendix B.8).
Theorem 4.
There exist and such that applying Algorithm 7 with horizon and being Algorithm 2 (with and failure probability ) ensures .
As far as we know, this is the first polylogarithmic bound for any SSP problem. Our result also indicates that the instance-dependent quantities of SSP can be well preserved after using some finite-horizon approximation.
Lower bounds
To better understand instance-dependent regret bounds for this problem, we further show the following lower bound.
Theorem 5.
For any algorithm , there exists a linear SSP instance with and such that .
This lower bound exhibits a relatively large gap from our upper bound. One important question is whether the dependency in the upper bound is really necessary, which we leave as a future direction.
4 An Inefficient Horizon-Free Algorithm
Recall that the dominating term of the regret bound shown in Theorem 3 depends on , which is most likely unnecessary. Due to the lack of a horizon-free algorithm for finite-horizon linear MDPs (which, as discussed, would have addressed this issue), in this section we propose a different approach leading to a computationally inefficient algorithm with a regret bound that is horizon-free (that is, no polynomial dependency on ) but has a worse dependency on .
As stated in previous work for the tabular setting (Cohen et al., 2020, 2021; Tarbouriech et al., 2021; Chen et al., 2021a), achieving a horizon-free regret bound requires constructing variance-aware confidence sets on the transition functions. While this is straightforward in the tabular case, it is much more challenging with linear function approximation. Zhou et al. (2021a); Zhang et al. (2021) construct variance-aware confidence sets for linear mixture MDPs, but we are not aware of similar results for linear MDPs since they impose extra challenges. Our algorithm VA-GOPO, shown in Algorithm 3, is the first one to successfully make use of these ideas.
VA-GOPO follows a similar framework of the Eleanor algorithm of (Zanette et al., 2020b) (for the finite-horizon setting) and the FOPO algorithm of (Wei et al., 2021b) (for the infinite-horizon setting) — they all maintain an estimate of the true weight vector (recall ), found by optimistically minimizing the value of the current state (roughly ) over a confidence set of , and then simply act according to . The main differences are the construction of the confidence set and the conditions under which is updated, which we explain in detail below.
Confidence Set
For a parameter and a weight vector , inspired by (Zhang et al., 2021) we define a variance-aware confidence set for time step as
(2) |
where with , and
(3) |
with being the -dimensional -ball of radius , being the -net of , (recall ), being a shorthand of , (and ), , , and finally for some failure probability . The key difference between our confidence set and that of (Zhang et al., 2021) is in the definition of and due to the different structures between linear MDPs and linear mixture MDPs. In particular, we note that the value function (more formally ) in our definitions is itself defined with respect to another weight vector .
With this confidence set, when VA-GOPO decides to update , it searches over all such that and finds the one that minimizes the value at the current state (Line 3). Here, is a running estimate of . VA-GOPO maintains the inequality during the update by doubling the value of and repeating Line 3 whenever this is violated (Line 3). Note that the constraint is in a sense self-referential — we consider within a confidence set defined in terms of itself, which is an important distinction compared to (Zhang et al., 2021) and is critical for linear MDPs.
To provide some intuition on our confidence set, denote by and by . Note that if we ignore the dependency between and (an issue that will eventually be addressed by some covering arguments), then forms a martingale sequence when , and thus the inequality in Eq. (3) holds with high probability by some Bernstein-style concentration inequality (Lemma 36). Formally, this allows us to show the following.
Lemma 3.
With probability at least , , .
Since is also in , the difference between and is controlled by the size of the confidence set , which is overall shrinking and thus making sure that is getting closer and closer to . In addition, we also show that is optimistic at state whenever an update is performed and that never overestimates significantly.
Lemma 4.
With probability at least , we have if an update (Line 3) is performed at time step , and for all .
Update Conditions
VA-GOPO updates whenever one of the three conditions in Line 3 is triggered. The first condition simply indicates that the current time step is the start of a new episode. The second condition is
(4) |
where is the most recent update time step (Line 3) and with . This lazy update condition makes sure that the algorithm does not update too often (see Lemma 27) while still enjoying a small enough estimation error. The last condition (we call it overestimate condition) tests whether the current state has an overestimated value (note that is the maximum value of due to the truncation in its definition). This condition helps remove a factor of in the regret bound without using some complicated ideas as in previous works; see Appendix C.5 for more explanation.
Regret Guarantee
We prove the following regret guarantee for VA-GOPO, and provide a proof sketch in Appendix C.1 followed by the full proof in the rest of Appendix C.
Theorem 6.
With probability at least , Algorithm 3 ensures .
Ignoring the lower order term, our bound is (potentially) suboptimal only in terms of the -dependency compared to the lower bound from (Min et al., 2021). We note again that this is the first horizon-free regret bound for linear SSP: it does not have any polynomial dependency on or even in the lower order terms. Furthermore, VA-GOPO also does not require the knowledge of or . For simplicity, we have assumed . However, even when , we can obtain essentially the same bound by running the same algorithm on a modified cost function; see Appendix A for details.
5 Conclusion
In this work, we make significant progress towards better understanding of linear function approximation in the challenging SSP model. Two algorithms are proposed: the first one is efficient and achieves a regret bound strictly better than (Vial et al., 2021), while the second one is inefficient but achieves a horizon-free regret bound. In developing these results, we also propose several new techniques that might be of independent interest, especially the new analysis for the finite-horizon approximation of (Cohen et al., 2021).
A natural future direction is to close the gap between existing upper bounds and lower bounds in this problem, especially with an efficient algorithm. Another interesting direction is to study SSP with adversarially changing costs under linear function approximation.
References
- Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24:2312–2320, 2011.
- Ayoub et al. (2020) Ayoub, A., Jia, Z., Szepesvari, C., Wang, M., and Yang, L. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pp. 463–474. PMLR, 2020.
- Azar et al. (2017) Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263–272. PMLR, 2017.
- Bertsekas & Yu (2013) Bertsekas, D. P. and Yu, H. Stochastic shortest path problems under weak conditions. Lab. for Information and Decision Systems Report LIDS-P-2909, MIT, 2013.
- Chen & Luo (2021) Chen, L. and Luo, H. Finding the stochastic shortest path with low regret: The adversarial cost and unknown transition case. In International Conference on Machine Learning, 2021.
- Chen et al. (2021a) Chen, L., Jafarnia-Jahromi, M., Jain, R., and Luo, H. Implicit finite-horizon approximation and efficient optimal algorithms for stochastic shortest path. Advances in Neural Information Processing Systems, 2021a.
- Chen et al. (2021b) Chen, L., Luo, H., and Wei, C.-Y. Minimax regret for stochastic shortest path with adversarial costs and known transition. In Conference on Learning Theory, pp. 1180–1215. PMLR, 2021b.
- Cohen et al. (2020) Cohen, A., Kaplan, H., Mansour, Y., and Rosenberg, A. Near-optimal regret bounds for stochastic shortest path. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pp. 8210–8219. PMLR, 2020.
- Cohen et al. (2021) Cohen, A., Efroni, Y., Mansour, Y., and Rosenberg, A. Minimax regret for stochastic shortest path. Advances in Neural Information Processing Systems, 2021.
- He et al. (2021) He, J., Zhou, D., and Gu, Q. Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning, pp. 4171–4180. PMLR, 2021.
- Ishfaq et al. (2021) Ishfaq, H., Cui, Q., Nguyen, V., Ayoub, A., Yang, Z., Wang, Z., Precup, D., and Yang, L. F. Randomized exploration for reinforcement learning with general value function approximation. International Conference on Machine Learning, 2021.
- Jafarnia-Jahromi et al. (2021) Jafarnia-Jahromi, M., Chen, L., Jain, R., and Luo, H. Online learning for stochastic shortest path model via posterior sampling. arXiv preprint arXiv:2106.05335, 2021.
- Jaksch et al. (2010) Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 2010.
- Jin et al. (2018) Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pp. 4863–4873, 2018.
- Jin et al. (2020a) Jin, C., Jin, T., Luo, H., Sra, S., and Yu, T. Learning adversarial Markov decision processes with bandit feedback and unknown transition. In Proceedings of the 37th International Conference on Machine Learning, pp. 4860–4869, 2020a.
- Jin et al. (2020b) Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137–2143. PMLR, 2020b.
- Jin et al. (2021) Jin, T., Huang, L., and Luo, H. The best of both worlds: stochastic and adversarial episodic mdps with unknown transition. Advances in Neural Information Processing Systems, 2021.
- Kim et al. (2021) Kim, Y., Yang, I., and Jun, K.-S. Improved regret analysis for variance-adaptive linear bandits and horizon-free linear mixture mdps. arXiv preprint arXiv:2111.03289, 2021.
- Kong et al. (2021) Kong, D., Salakhutdinov, R., Wang, R., and Yang, L. F. Online sub-sampling for reinforcement learning with general function approximation. arXiv preprint arXiv:2106.07203, 2021.
- Lattimore & Hutter (2012) Lattimore, T. and Hutter, M. Pac bounds for discounted mdps. In International Conference on Algorithmic Learning Theory, pp. 320–334. Springer, 2012.
- Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.
- Min et al. (2021) Min, Y., He, J., Wang, T., and Gu, Q. Learning stochastic shortest path with linear function approximation. arXiv preprint arXiv:2110.12727, 2021.
- Rosenberg & Mansour (2020) Rosenberg, A. and Mansour, Y. Stochastic shortest path with adversarially changing costs. arXiv preprint arXiv:2006.11561, 2020.
- Shani et al. (2020) Shani, L., Efroni, Y., Rosenberg, A., and Mannor, S. Optimistic policy optimization with bandit feedback. In Proceedings of the 37th International Conference on Machine Learning, pp. 8604–8613, 2020.
- Simchowitz & Jamieson (2019) Simchowitz, M. and Jamieson, K. G. Non-asymptotic gap-dependent regret bounds for tabular mdps. Advances in Neural Information Processing Systems, 32:1153–1162, 2019.
- Tarbouriech et al. (2020) Tarbouriech, J., Garcelon, E., Valko, M., Pirotta, M., and Lazaric, A. No-regret exploration in goal-oriented reinforcement learning. In International Conference on Machine Learning, pp. 9428–9437. PMLR, 2020.
- Tarbouriech et al. (2021) Tarbouriech, J., Zhou, R., Du, S. S., Pirotta, M., Valko, M., and Lazaric, A. Stochastic shortest path: Minimax, parameter-free and towards horizon-free regret. Advances in Neural Information Processing Systems, 2021.
- Vial et al. (2021) Vial, D., Parulekar, A., Shakkottai, S., and Srikant, R. Regret bounds for stochastic shortest path problems with linear function approximation. arXiv preprint arXiv:2105.01593, 2021.
- Wang et al. (2020) Wang, R., Salakhutdinov, R., and Yang, L. F. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 2020.
- Wei et al. (2021a) Wei, C.-Y., Dann, C., and Zimmert, J. A model selection approach for corruption robust reinforcement learning. arXiv preprint arXiv:2110.03580, 2021a.
- Wei et al. (2021b) Wei, C.-Y., Jahromi, M. J., Luo, H., and Jain, R. Learning infinite-horizon average-reward mdps with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pp. 3007–3015. PMLR, 2021b.
- Yang & Wang (2020) Yang, L. and Wang, M. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pp. 10746–10756. PMLR, 2020.
- Zanette & Brunskill (2019) Zanette, A. and Brunskill, E. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In Proceedings of the 36th International Conference on Machine Learning, pp. 7304–7312, 2019.
- Zanette et al. (2020a) Zanette, A., Brandfonbrener, D., Brunskill, E., Pirotta, M., and Lazaric, A. Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics, pp. 1954–1964. PMLR, 2020a.
- Zanette et al. (2020b) Zanette, A., Lazaric, A., Kochenderfer, M., and Brunskill, E. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pp. 10978–10989. PMLR, 2020b.
- Zhang et al. (2020a) Zhang, Z., Ji, X., and Du, S. S. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. In Conference On Learning Theory, 2020a.
- Zhang et al. (2020b) Zhang, Z., Zhou, Y., and Ji, X. Almost optimal model-free reinforcement learning via reference-advantage decomposition. In Advances in Neural Information Processing Systems, volume 33, pp. 15198–15207. Curran Associates, Inc., 2020b.
- Zhang et al. (2021) Zhang, Z., Yang, J., Ji, X., and Du, S. S. Variance-aware confidence set: Variance-dependent bound for linear bandits and horizon-free bound for linear mixture mdp. Advances in Neural Information Processing Systems, 2021.
- Zhou et al. (2021a) Zhou, D., Gu, Q., and Szepesvari, C. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pp. 4532–4576. PMLR, 2021a.
- Zhou et al. (2021b) Zhou, D., He, J., and Gu, Q. Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning, pp. 12793–12802. PMLR, 2021b.
Appendix A Preliminary
Extra Notations in Appendix
For a function and a distribution , we define and .
Cost Perturbation for
We follow the receipt in (Vial et al., 2021, Appendix A.3) to deal with zero costs: the main idea is to run the SSP algorithm with perturbed cost for some , which is equivalent to solving a different SSP instance . Let . Then, . Therefore, is also a linear SSP with (up to some a small constant, since can be as large as ). Denote by the optimal value function in , and define as the regret in . We have , and
Therefore, by running an SSP algorithm on perturbed cost , we recover its regret guarantee with , , and an addition bias in regret.
Appendix B Omitted Details for Section 3
Notations
For , denote by the the expected cost of executing policy starting from state in layer , and by the policy executed in interval (for example, in Algorithm 2). For notational convenience, define , and for such that . Define indicator , and auxiliary feature for all , such that and for any and with . Finally, for Algorithm 2, define stopping time , which is the number of intervals until finishing episodes or upper bound truncation on estimate is triggered.
B.1 Formal Definition of and
It is not hard to see that we can define and recursively without resorting to the definition of :
with for all .
B.2 Proof of Lemma 1
Proof.
Denote by the set of intervals in episode , and by the first interval in episode . We bound the regret in episode as follows: by Lemma 17 and , we have the probability that following takes more than steps to reach in is at most . Therefore,
Thus,
( and ) |
Summing terms above over and by the definition of , we obtain the desired result. ∎
B.3 Proof of Lemma 2
We first bound the error of one-step value iteration w.r.t and , which is essential to our analysis.
Lemma 5.
For any , with probability at least , we have and for any , .
Proof.
Define , so that . Then,
By and Lemma 31, we have with probability at least , for any , :
(5) |
where is the -cover of the function class of with . Note that is either or
for some PSD matrix such that by the definition of , and for some such that by the definition of . We denote by the function class of . Now we apply Lemma 32 to with , , (note that ), and , which is given by (Vial et al., 2021, Claim 2) and the following calculation: for any for some ,
and for any ,
() | |||
() | |||
Lemma 32 then implies . Plugging this back, we get
(6) |
Moreover, . Thus,
Therefore, by , and the first statement is proved. For any , we prove the second statement by induction on . The base case is clearly true by . For , we have by the induction step:
Thus, . ∎
Next, we prove a general regret bound, from which Lemma 2 is a direct corollary.
Lemma 6.
Assume . Then with probability at least , Algorithm 2 ensures for any
Proof.
We are now ready to prove Lemma 2.
B.4 Learning without Knowing or
In this section, we develop a parameter-free algorithm that achieves regret without knowing or , which matches the best bound and knowledge of parameters of (Vial et al., 2021) while being computationally efficient under the most general assumption. Here we apply the finite-horizon approximation with zero terminal costs, and develop a new analysis on this approximation.
Finite-Horizon Approximation of SSP with Zero Terminal Costs
Input: upper bound estimate and function from Lemma 8.
Initialize: an instance of finite-horizon algorithm with horizon .
Initialize: , , , .
while do
To avoid knowledge of or , we apply finite-horizon approximation with zero terminal costs and horizon of order for some estimate of , that is, running Algorithm 1 with and . We show that in this case there is an alternative way to bound the regret by , and there is a tighter bound on the total number of intervals when .
Lemma 7.
Algorithm 1 with ensures .
Proof.
Denote by the set of intervals in episode . We have:
( by ) |
∎
Lemma 8.
Suppose when , with horizon ensures for any with probability at least , where , are functions of and are independent of . Then Algorithm 1 with ensures with probability at least ,
Proof.
First note that by Lemma 39 and , with probability at least : . For any finite , we will show , which then implies that has to be finite and is upper bounded by the same quantity. Define as the expected cost for the first layers and as the optimal value function for the first layers. By (Chen et al., 2021a, Lemma 1) and , we have and . Moreover, when , we have . Denote by the conditional probability of certain event conditioning on the history before interval . Then with probability at least ,
( and guarantee of ) | |||
(Lemma 39) |
Then by and reorganizing terms, we get . Again by Lemma 39, we have with probability at least :
By and solving a quadratic inequality w.r.t , we get . Thus, we also get the same bound for . ∎
Remark 1.
Note that the result of Lemma 8 is similar to (Tarbouriech et al., 2020, Lemma 7), which also shows that the number of “bad” intervals is of order . However, their result is derived by explicitly analyzing the transition confidence sets, while we only make use of the regret guarantee of the finite-horizon algorithm. Thus, our approach is again model-agnostic and directly applicable to linear function approximation while their result is not.
Note that Lemma 7 and Lemma 8 together implies a regret bound when . Moreover, since the total number of “bad” intervals is of order , we can properly bound the cost of running finite-horizon algorithm with wrong estimates on . We now present an adaptive version of finite-horizon approximation of SSP (Algorithm 4) which does not require the knowledge of or . The main idea is to perform finite-horizon approximation with zero costs, and maintain an estimate of . The learner runs a finite-horizon algorithm with horizon of order . Whenever detects , or the number of “bad” intervals is more than expected (Line 4), it doubles the estimate and start a new instance of finite-horizon algorithm with the updated estimate. The guarantee of Algorithm 4 is summarized in the following theorem.
Theorem 7.
Suppose takes an estimate as input, and when , it has some probability of detecting the anomaly (the event ) and halts. Define stopping time , and suppose for any , with horizon ensures for any , where are non-decreasing w.r.t . Then, Algorithm 4 ensures with probability at least .
Proof.
We divide the learning process into epochs indexed by based on the update of , so that (the input value) and . Let . Define the regret in epoch as , where is the total costs suffered in epoch , is the set of episodes overlapped with epoch , and is the initial state in episode and epoch (note that an episode can overlap with multiple epochs). Clearly, . Note that satisfies the assumptions in Lemma 8, since no anomaly will be detected when . Thus in epoch , no new epoch will be started by Lemma 8. Moreover, by Lemma 7 and , the regret is bounded by:
For , by the conditions of starting a new epoch, the number of intervals that does not reach the goal is upper bounded by and the total number of intervals in epoch is upper bounded by . Thus by Lemma 7 and the guarantee of ,
where the last equality is by the fact that and are non-decreasing w.r.t . Thus,
∎
Theorem 8.
Applying Algorithm 4 with Algorithm 2 as to the linear SSP problem ensures with probability at least .
Proof.
Note that for Algorithm 2, and Lemma 6 ensures that Algorithm 2 satisfies assumptions of Theorem 7 with and , where . Then by Theorem 7, we have: . ∎
Remark 2.
Comparing the bound achieved by Theorem 8 with that of Theorem 3, we see that is in place of , making it a worse bound since . Previous works in SSP (Cohen et al., 2021; Tarbouriech et al., 2021; Chen et al., 2021a) suggest that algorithms that obtain a bound with dependency on is easier to be made parameter-free compared to those with dependency on . Our findings in this section are consistent with that in previous works.
B.5 Horizon-Free Regret in the Tabular Setting with Finite-Horizon Approximation
Here we present a finite-horizon algorithm (Algorithm 5) that achieves and thus gives when combining with Corollary 2. For simplicity we assume that the cost function is known. We can think of Algorithm 5 as a variant of EB-SSP, which is applied on a finite-horizon MDP with state space and the transition is shared across layers. Note that due to the loop-free structure of the MDP, the value iteration converges in one sweep. Thus, skewing the empirical transition as in (Tarbouriech et al., 2021) is unnecessary. Then by the analysis of EB-SSP and the fact that transition data is shared across layers, we obtain the same regret guarantee (it is not hard to see that the algorithm achieves anytime regret since its updates on parameters are independent of ).
Input: an estimate such that .
Initialize: for , , .
for do
B.6 Application to Linear Mixture MDP
Initialize: , , .
Define: .
Define: .
Define: .
for do
In this section, we provide a direct application of our finite-horizon approximation to the linear mixture MDP setting. We first introduce the problem setting of linear mixture SSP following (Min et al., 2021).
Assumption 2 (Linear Mixture SSP).
The number of states and actions are finite: . For some , there exist a known cost function , a known feature map , and an unknown vector with , such that:
-
•
for any , we have ;
-
•
for any bounded function , we have , where .
We also assume is known and . Define , with as before, and . Note that by the definitions above, . Also define total costs for any With our approximation scheme, it suffices to provide a finite-horizon algorithm. We start by stating the regret guarantee of the proposed finite-horizon algorithm (Algorithm 6).
Theorem 9.
Algorithm 6 ensures for any with probability at least .
Combining Algorithm 6 with our finite-horizon approximation, we get the following regret guarantee on linear mixture SSP.
Theorem 10.
Applying Algorithm 1 with and Algorithm 6 as to the linear mixture SSP problem ensures with probability at least .
Proof.
This directly follows from Theorem 9 and Corollary 2 with and . ∎
Note that our bound strictly improves over that of (Min et al., 2021), and it is minimax optimal when . Now we introduce the proposed finite-horizon algorithm, which is a variant of (Zhou et al., 2021a, Algorithm 2). The high level idea is to construct Bernstein-style confidence sets on transition function and then compute value function estimate through empirical value iteration with bonus. We summarize the ideas in Algorithm 6. Before proving Theorem 9, we need the following key lemma regarding the confidence sets on transition function.
Lemma 9.
With probability at least , we have for all , and .
Proof.
For the first statement, we first prove that and for . We adopt the indexing by in Section 2: for a given time step that corresponds to , that is, the -th step in the -th interval, define , , , and . We apply Lemma 33 with , , , , . Then, we have , where , , and . Moreover,
Therefore, with probability at least , for any for some , which corresponds to :
Next, we apply Lemma 33 with , , , , . Then, we have , where , , and . Moreover,
Therefore, with probability at least , for any for some , which corresponds to :
Conditioned on the event , we have for corresopnding to :
Thus the second statement is proved. Now we show that . We conditioned on event , and apply Lemma 33 with , , , , . Then, we have . Moreover, , , and for corresponding to ,
Therefore, with probability at least , for any for some , which corresponds to :
This completes the proof. ∎
We are now ready to prove Theorem 9.
Proof of Theorem 9.
We condition on the event of Lemma 9, Lemma 10 and Lemma 11, which happens with probability at least . We decompose the regret as follows: with probability at least ,
(Lemma 10) | ||||
() | ||||
() | ||||
(Lemma 38, Cauchy-Schwarz inequality, and Lemma 9) |
The first term is of order by Lemma 11. For the third term, define and . Then,
( and ) | |||
( and Cauchy-Schwarz inequality) | |||
( and Lemma 29) |
where in (i) we apply , Lemma 30, and:
(Lemma 30) | ||||
( and Lemma 29) |
It remains to bound . Note that
(Lemma 9) | |||
(Lemma 11 and Lemma 12) |
By Lemma 28 and , we get . Putting everything together, we get:
Now by and Lemma 28, we get: . Plugging this back, we get . ∎
Lemma 10.
Conditioned on the event of Lemma 9, and .
Proof.
Note that by Lemma 9:
The first statement then follows from the definition of . For any , we prove the second statement by induction on . The base case is clearly true by the definition of . For , note that by the induction step and the first statement. Thus, . ∎
Lemma 11.
Conditioned on the event of Lemma 10, with probability at least , for any .
Proof.
Lemma 12.
for any .
B.7 An instance of SSP with
Consider an SSP with four states and two actions . At , we have and , for and some . At , we have , , and , . At , we have and , for any and some . At , we have and for . The role of here is to create the possibility that the learner will visit state at any time step. Then under our finite-horizon approximation, we have
On the other hand, when , , and can be arbitrarily large.
B.8 Omitted Details in Section 3.3
We first prove a lemma bounding and another lemma on regret decomposition w.r.t the gap functions in .
Lemma 13.
Suppose . With probability at least , for all , and , Algorithm 2 ensures:
Proof.
Note that:
(Define ) | |||
Therefore,
For , note that by for any , . Therefore, by the Cauchy-Schwarz inequality,
where the second inequality is by . Similarly, for ,
(Cauchy-Schwarz inequality) | ||||
() | ||||
( for any ) |
For , by Eq. (6), with probability at least . Thus, .
To conclude, we have for all :
This completes the proof. ∎
Lemma 14.
With probability at least , for any given .
Proof.
The next lemma provides an upper bound on the sum of gap functions satisfying some constraints. We denote by the interaction history up to in .
Lemma 15.
Suppose , are indicator functions such that for some , and define . Then with probability at least , Algorithm 2 ensures
Proof.
We are now ready to prove a bound on , which is the key to proving Theorem 4.
Lemma 16.
For any , Algorithm 2 with and for some horizon ensures with probability at least , .
Proof.
First note that for any . Thus, the expected hitting time of in is at most starting from any state and layer. Without loss of generality, we assume that is an even integer. Note that can be treated as an SSP instance where the learner teleports to the goal state at the -th step. Thus by Lemma 17 and , when , for any state , and for any :
It also implies for , since:
Define and a threshold . By Lemma 14, it suffices to bound . Note that
For the first term, define , and
Then by the definition of and for , there exist such that
(7) |
Moreover, for each and , define . Then by Lemma 15, with probability at least ,
where . Solving a quadratic inequality w.r.t gives:
(8) |
By a union bound, Eq. (8) holds for all simultaneously with probability at least . Therefore, the first term is bounded as follows:
(Eq. (8)) | |||
() | |||
(Eq. (7)) |
For the second term, note that:
For , define and . Then by Lemma 15, with probability at least ,
It suffices to bound . Note that by the definition of , we have . Thus, by Eq. (8),
Plugging this back and by Eq. (7), we get:
For , denote by the near-optimal policy “closest” to , such that:
Note that for all . By the extended value difference lemma (Shani et al., 2020, Lemma 1), for all by . Therefore, for all . Denote by the interaction history before interval . Then, , and
where in the last inequality we apply Lemma 17, the fact that for all , and . Now by Lemma 14 and , we have:
∎
We are now ready to prove Theorem 4.
Proof of Theorem 4.
First note that for a given , by Lemma 2 and Theorem 1, we have: with probability at least for some when running Algorithm 1 with Algorithm 2 and horizon . That is, there exist and constant such that . Now let . To obtain the regret bound in Lemma 16, it suffices to have . Plugging in the definition of and by for , it suffices to have for some constant . To conclude, we have with probability at least when running Algorithm 1 with Algorithm 2 and horizon . Moreover, with probability at least , we have . To obtain an expected regret bound, we further need to bound the cost under the low probability “bad” event. We make the following modification to Algorithm 1: whenever the counter for some , we restart Algorithm 2. Ideas above are summarized in Algorithm 7. Now consider running Algorithm 7 with Algorithm 2, horizon , failure probability , and restart threshold . By the choice of , we have . By a recursive argument, we have for . We have by Lemma 1 and Lemma 16:
where we apply
This completes the proof. ∎
Input: Algorithm for finite-horizon MDP with horizon and restart threshold .
Initialize: interval counter .
for do
B.9 Extra Lemmas for Section 3
Lemma 17.
(Rosenberg & Mansour, 2020, Lemma 6) Let be a policy with expected hitting time at most starting from any state. Then for any , with probability at least , takes no more than steps to reach the goal state.
Lemma 18.
For an arbitrary set of intervals for some , we have:
B.10 Proof of Theorem 5
Proof.
Define , and assume . Consider a family of SSP parameterized by with action set . For the SSP instance parameterized by , it consists of two states . The transition probabilities are as follows:
and the cost function is . The SSP instance above can be represented as a linear SSP of dimension as follows: define , ,
and . Note that it satisfies , , , and . Moreover, for any function , we have:
Note that when , , and
Thus, we have by , and the SSP instance satisfies Assumption 1. The regret is bounded as follows: let denote the first action taken by the learner in episode . Then for any , the expected cost of taking action as the first action is .
where we define , and is the expectation w.r.t the SSP instance parameterized by . Let denote the vector that differs from at its -th coordinate only. Then, we have , and for a fixed ,
where is the joint probability of trajectories induced by the interactions between the learner and the SSP parameterized by , and in the last inequality we apply Pinsker’s inequality to obtain:
By the divergence decomposition lemma (see e.g. (Lattimore & Szepesvári, 2020, Lemma 15.1)), we further have
() |
where in the second last inequality we apply when , which is true when , . Substituting these back, we get:
(9) |
Now note that for all . Define . Then for any ,
Thus, . By and Eq. (9), we get:
Selecting which maximizes , we get: . ∎
Appendix C Omitted Details for Section 4
Notations
Define such that , and operator such that . Define , and . By Lemma 4, for any .
For notational convenience, we divide the whole learning process into epochs indexed by , and a new epoch begins whenever is recomputed. Denote by the first time step in epoch , and for a quantity, function or set indexed by time step , we define . Denote by the epoch time step belongs to, and we often ignore the subscript when there is no confusion. Clearly, , and similarly for (ignoring the dependency on for ). With this notation setup, we define as the number of epochs that starts by the overestimate condition, that is, . Also define and a special covariance matrix . Note that .
Assumption
For simplicity, we assume that spans . It implies that if for all , then .
Truncating the Interaction for Technical Issue
An important question in SSP is whether the algorithm halts in finite number of steps. To overcome some technical issues, we first assume that the algorithm halts after steps for an arbitrary , even if the goal state is not reached. Specifically, we redefine the notation to be the minimum between the number of steps taken by the learner in episodes and , that is, if the learner does not finish episodes in steps. We also redefine under the new definition of , and the true regret now becomes . The implication under truncation is that may not be , and . In Appendix C.4, we prove a regret bound on independent of . Thus, the proven regret bound is also an upper bound of the true regret, as it is a valid upper bound of .
C.1 Proof Sketch of Theorem 6
We focus on deriving the dominating term and ignore the lower order terms. By some straightforward calculation, we decompose the regret as follows:
We bound each of these terms as follows.
Bounding Deviation
This term is a sum of martingale difference sequence and is of order . We show that (see Lemma 21).
Bounding Estimation-Err
Here the variance-aware confidence set comes into play. By , we have . Thus, it suffices to bound . As in (Kim et al., 2021), the main idea is to bound the matrix norm of w.r.t some special matrix by a variance-aware term, and then apply the elliptical potential lemma on . For any epoch , and with , we have the following key inequality (see Lemma 24):
(10) |
One important step is thus to bound . Note that this term has a similar form of , and by a similar analysis (see Lemma 23):
(11) |
where (note that here is fixed and independent of ). Define such that . By Eq. (10):
(12) |
Solving for and by (similar to elliptical potential lemma), we get
Plugging this back to Eq. (11) and solving a quadratic inequality, we get: (Lemma 23). Now by an analysis similar to Eq. (12) (Lemma 22):
where such that . The extra factor is from the inequality .
Bounding Switching-Cost
By considering each condition of starting a new epoch, we show that , where is the number of epochs started by triggering the overestimate condition; see Appendix C.4. We provide more tuition on including the overestimate condition in Appendix C.5. In short, it removes a factor of in the dominating term without incorporating unpractical decision sets as in previous works.
Putting Everything Together
Combining the bounds above, we get . Solving a quadratic inequality w.r.t , we have . Plugging this back, we obtain .
Below we provide detailed proofs of lemmas and the main theorem.
C.2 Proof of Lemma 3
We will prove a more general statement, from which Lemma 3 is a directly corollary.
Lemma 19.
With probability at least , for any , , and , we have .
C.3 Proof of Lemma 4
Lemma (restatement of Lemma 4).
With probability at least , for any epoch and .
Proof.
For the first statement, note that any epoch , by Lemma 20, there exists such that and . Thererfore, , and by the definition of . The second statement is a direct corollary of the first statement and how is updated. ∎
Lemma 20.
For any , there exists such that , and .
Proof.
Define , and . We prove by induction that and . The base case is clearly true. Now for , assume that we have and . Then, and . Therefore, the sequence is non-decreasing and bounded, and thus converges. Since spans , the limit exists and . Moreover, by and since . This completes the proof. ∎
C.4 Proof of Theorem 6
Proof.
We decompose the regret as follows:
For the first term, for a fixed epoch , define for and . Note that within epoch , we have . Thus, for ,
() | ||||
Therefore, we have:
We first bound the switching costs, that is, the last two terms above. We consider three cases based on how an epoch starts: define , , and . Then,
Note that since for , by Lemma 4. For , note that by Lemma 27. Thus, by (Lemma 4). For , note that for each , by and . Thus, , by and . Therefore, with probability at least ,
(Lemma 38, , and definition of ) | ||||
(Lemma 21) | ||||
( and ) | ||||
(Lemma 22) | ||||
(definition of and ) |
By and Lemma 28 with (we also bound by in logarithmic terms), we get . Plugging this back, we obtain
This completes the proof. ∎
C.5 Intuition for Overestimate Condition
Now we provide more reasonings on including the overestimate condition. Similar to (Zanette et al., 2020b; Wei et al., 2021b), we incorporate global optimism at the starting state of each epoch via solving an optimization problem. This is different from many previous work (Jin et al., 2020b; Vial et al., 2021) that adds bonus terms to ensure local optimism over all states. The advantage of global optimism is that it avoids using a larger function class of for the bonus terms, which reduces the order of in the regret bound. However, this improvement also requires is of order . In (Zanette et al., 2020b), they directly enforcing this constraint, which is not practical under large state space as we may need to iterate over all state-action pairs to check this constraint.
Here we take a new approach: we first enforce a bound on by direct truncation. However, the upper bound truncation on may break the analysis. To resolve this, we start a new epoch whenever is overestimated by a large amount. By the objective of the optimization problem, will not be overestimated in the new epoch. Hence, the upper bound truncation will not be triggered. Moreover, the overestimate of cancels out the switching cost in this case as in previous discussion.
The disadvantage of the overestimation condition is that we may update policy at every time step in the worst case. If we remove this condition, by the norm constraint on , which brings back an extra factor. However, we only recompute policy for times in this case.
C.6 Extra Lemmas for Section 4
Lemma 21.
With probability at least , .
Proof.
Note that when , . Otherwise, for any and . Thus, . Then with probability at least ,
where in (i) we apply Lemma 38, , , , and we bound the term as follows: we consider four cases based on how epoch ends:
-
1.
, then .
-
2.
; this happens times and the sum of these terms is of order .
- 3.
-
4.
is the last epoch. This happens only once and the term is bounded by .
In (ii), we apply Lemma 34, , definition of , and by Lemma 4. Solving a quadratic inequality w.r.t , we have:
This completes the proof. ∎
Lemma 22.
With probability at least , .
Proof.
Define , , and such that . Also define . Note that when , . Then, for any :
(Lemma 23) |
where in (i) we define such that and apply
( is -Lipschitz) | |||
() | |||
() |
and in (ii) we apply Lemma 24 and:
Here, (i) is by by the definition of . Reorganizing terms by , we have for :
Finally, note that:
The first term is bounded by
(Lemma 29) |
For the second term:
where in (i) we apply Lemma 29. Putting everything together, we get:
This completes the proof. ∎
Lemma 23.
With probability at least , .
Proof.
Note that when , . Otherwise, for any and . Therefore, . Then with probability at least ,
In (i) we apply Lemma 25, , , , and . In (ii) we apply Lemma 34. For , define . Then by and the definition of , we have . Now it suffices to bound . Define and for , define such that . Note that when , . Also define . Then, for any , with probability at least :
where in the last inequality we apply Lemma 24 and:
Here, (i) is by by the definition of . Reorganizing terms by , we have:
where in (i) we apply
Putting everything together and by , we have:
Solving a quadratic inequality w.r.t , we have . ∎
Lemma 24.
With probability at least , for any epoch , , and with ,
Proof.
Lemma 25.
With probability at least , for any epoch , and .
Proof.
For any , , and , define and as the conditional expectation conditioned on the interaction history . Note that and . Then by Lemma 37 with , with probability at least with , we have:
Reorganizing terms and by a union bound, we have with probability at least , for any , , and :
(16) |
Moreover, for any , , and , by Lemma 38, with probability at least :
(17) |
Then again by a union bound, the equation above holds with probability at least for any , , and .
For the next lemma, we define the following auxiliary function:
Note that is convex and .
Lemma 26.
For , .
Proof.
Let . When , we have: . When (arguments are similar for ), we have , and
∎
Lemma 27.
Fix . Let . If there exists such that for each , there exists for some such that
(19) |
Then, .
Proof.
Note that when Eq. (19) holds:
(20) |
Thus, it suffices to bound the number times Eq. (20) holds. Define . Clearly is convex since is convex, and for . Define:
For each , there exists such that . Define . Note that , and is a symmetric convex set since is a convex function and . By Lemma 26, we have . Therefore, , which means that in the direction of , the intercept of is at most times of that of . By Lemma 35, we have: . Note that when , we have . Therefore, when , we have . Therefore, for . Due to the fact that is decreasing in , we have
This completes the proof. ∎
Appendix D Auxiliary Lemmas
Lemma 28.
If for some and absolute constant , then .
Proof.
First note that implies by for , which gives . Plugging this back, we get . Therefore, implies . Next, note that implies by for , which gives . Plugging this back, we get , which gives . Therefore, implies . Thus, implies and , which implies . Taking the contrapositive, the statement is proved. ∎
Lemma 29.
(Abbasi-Yadkori et al., 2011, Lemma 11) Let be a sequence in , a positive definite matrix, and define . Then, for any .
Lemma 30.
(Abbasi-Yadkori et al., 2011, Lemma 12) Let , be positive semi-definite matrices such that . Then, we have .
Lemma 31.
(Wei et al., 2021b, Lemma 11) Let be a martingale sequence on state space w.r.t a filtration , be a sequence of random vectors in so that and , , and be a set of functions defined on with as its -covering number w.r.t the distance for some . Then for any , we have with probability at least , for all and so that :
Lemma 32.
(Wei et al., 2021b, Lemma 12) Let be a class of mappings from to parameterized by . Suppose that for any (parameterized by ) and (parameterized by ), the following holds:
Then, , where is the -covering number of with respect to the distance .
Lemma 33.
(Zhou et al., 2021a, Theorem 4.1) Let be a filtration, a stochastic process so that and . Moreover, define and we have:
Then with probability at least , we have for any :
where , and
Lemma 34.
Lemma 35.
(Zhang et al., 2021, Lemma 16) Let be a bounded symmetric convex subset of with . Suppose , that is, is on the boundary of , and is another bounded symmetric convex set such that and . Then , where is the volume of the set .
Lemma 36.
(Zhang et al., 2021, Theorem 4) Let be a martingale difference sequence and almost surely. Then for , we have with probability at least ,
Lemma 37.
(Jin et al., 2020a, Lemma 9) Let be a martingale difference sequence adapted to the filtration , and almost surely for some . Then, for any , with probability at least :
Lemma 38.
Let be a martingale difference sequence adapted to the filtration and for some . Then with probability at least , for all simultaneously,