Marc Abeille \Emailm.abeille@criteo.com
\addrCriteo AI Lab, Paris, France
and \NameDavid Janz \Emaildavid.janz93@gmail.com
\addrUniversity of Oxford, UK
and \NameCiara Pike-Burke \Emailc.pike-burke@imperial.ac.uk
\addrImperial College London, UK
When and why randomised exploration works (in linear bandits)
Abstract
We provide an approach for the analysis of randomised exploration algorithms like Thompson sampling that does not rely on forced optimism or posterior inflation. With this, we demonstrate that in the -dimensional linear bandit setting, when the action space is smooth and strongly convex, randomised exploration algorithms enjoy an -step regret bound of the order . Notably, this shows for the first time that there exist non-trivial linear bandit settings where Thompson sampling can achieve optimal dimension dependence in the regret.
keywords:
linear bandits, randomised exploration, Thompson sampling1 Introduction
To achieve low regret in sequential decision-making problems, it is necessary to balance exploration (selecting uncertain actions) and exploitation (selecting previously successful actions). One method of balancing this exploration-vs-exploitation trade-off that is particularly well-understood is through optimism: optimistic algorithms maintain a set of statistically plausible models of the environment and select actions that maximize the reward in the best plausible model—note however that this entails solving a bi-level optimization problem in each round. Randomised exploration is an alternative approach where algorithms select a model of the problem randomly from a set of plausible models and act optimally with respect to that randomly sampled model—bypassing the need to solve the bi-level optimization problem associated with optimism. Notable examples of randomised decision-making algorithms include Bayesian algorithms such as posterior sampling [25, also known as Thompson sampling], ensemble sampling [18, 11] and perturbed history exploration [15, 12]. However, while randomisation-based algorithms are often preferred in practice, our theoretical understanding of when and why randomised exploration works in structured sequential decision-making problems is limited.
In this paper, we analyse randomised sequential decision-making algorithms in the classic linear bandit problem—but the techniques that we introduce should carry over to other structured settings. In this setting, previous frequentist analyses [4, 2, 15, 12, e.g.] are not sufficient to explain the practical effectiveness of randomised exploration, nor do they identify a mechanism through which randomised exploration works. Indeed, existing proofs rely on modifying randomised exploration algorithms so that they can be analysed using the optimism framework. These modifications often lead to suboptimal regret. Our analysis does away with such modifications; it holds under the assumption that the action space is smooth and strongly convex (see Section 4.2 for formal definitions), which allows for perturbation in the model parameter space to translate to perturbations in the action space, while also guaranteeing that small changes in the action space only lead to small changes in the incurred regret.
For such smooth, strongly convex action sets, which include -balls for , we a regret of the order where is the dimension of the action space and is the number of rounds. Notably, this shows for the first time that (unmodified) linear Thompson sampling can enjoy regret with the optimal dependence on the dimension in a structured linear bandit settings, thus partially resolving an important open question [24].
2 Related work
Lower bounds for the linear bandit problem depend on the structure of specific action spaces [7, 21, 16, for example,]. Theorem 2.1 of [21] shows that there exists a problem instance where the action space is the -dimensional unit sphere in which any policy must incur regret. Optimistic algorithms have frequentist regret nearly matching the lower bound for linear bandits [5, 7, 1]. Specifically, [1] show that by constructing confidence sets using self-normalized bounds for vector-valued martingales, and taking actions optimistically within these, the resulting regret is with probability at least . Despite the strong theoretical performance of optimistic algorithms, randomised algorithms, such as Thompson sampling, have been shown to perform better in practice [6, 19]. In the simpler multi-armed bandit setting, randomised algorithms achieve optimal regret [3, 13, 14, 9]. Under Bayesian assumptions, where regret is defined by taking an expectation over the unknown parameter, [23, 22] show that Thompson sampling is near-optimal in many structured and unstructured settings. In particular, for the linear bandit setting, they show a Bayesian regret bound of [23].
In this paper, our focus is on the regret of randomised exploration algorithms in linear bandits. While this setting has been studied extensively by, amongst others, previous approaches rely on modifying the algorithm to force it to be more optimistic. The main line of analysis, by [4, 2, 28], inflates the variance of the posterior over models in round by a factor of to show that the algorithm is optimistic with constant probability—this leads regret, where the increased dependence on is due to the inflation of the posterior. Further variants of randomised exploration algorithms include modifying the algorithms to only sample parameters with reward greater than the mean [19, 27] and modifying the likelihood used in the Bayesian update of Thompson sampling to force the algorithm to be more optimistic [29, 10]. The analysis of Thompson sampling in other structured settings, such as generalised linear bandits, relies on these same modifications [15, 12].
We remark that the results presented in this paper do not contradict the lower bounds by [8, 29] where examples were provided for which linear Thompson sampling incurs linear regret if the posterior distribution is not inflated. The action spaces constructed in those examples fail to satisfy our assumptions.
3 Problem setting, notation and basic definitions
We study the linear bandit problem, where each bandit instance is parameterised by an unknown ( known), and an action set , a closed subset of (the closed unit -ball in ). Then, at each time-step an agent selects an action , allowed to depend on observations from previous time-steps, and receives a real-valued reward . We assume that the reward is -subgaussian given and the past ( known), with mean given by . The goal of the agent is to select actions to minimize the -step regret (), defined by
where is any optimal arm and the horizon needs not be known.
Confidence set construction
The algorithms and analysis in this work are based on the standard regularised least-squares-based confidence ellipsoids for [1]. To construct these, fix a regularisation parameter and a confidence parameter . Define the regularised design matrices and least-squares estimates as , and then
Also, define the sequence of nondecreasing, nonnegative confidence widths
Then, [1] show that, with probability , for the ellipsoids given by
where for and a positive-definite matrix , we denote by the -weighted Euclidean norm of given by .
Optimistic algorithms
Optimistic algorithms select actions by solving the bi-level optimization problem in each round . We instead consider randomised algorithms which randomise over . These methods are formally defined in Section 4.1
Bregman divergence
Our analysis will make use of a generalised Bregman divergence, defined convex function as
for almost every , where denotes the gradient of . We recall that convex functions are almost everywhere differentiable [20].
Probabilistic formalism
Let be a filtration where is the trivial -algebra and , where is the -algebra generated by any additional random variables the algorithm uses in selecting . Note that this means that is -measurable. We will write for the -conditional probability measure and for the corresponding expectation. With this, we formalise the assumption that for all , is conditionally -subgaussian as that
| (1) |
Asymptotic notation
We will write if , and use for the converse.
Vectors, norms, balls & spheres
We will write to denote the -norm. We recall that for a positive-definite matrix and a vector of compatible dimensions, denotes the -weighted norm. We write for the closed unit Euclidean ball in , and for its surface , the -sphere.
4 A frequentist regret bound for randomised algorithms in linear bandits
In this section, we state our main result that provides conditions under which randomised exploration algorithms can achieve frequentist regret of in the linear bandit setting. We begin by describing the algorithmic framework and assumptions for the action set under which it holds.
4.1 Randomised algorithms: definition and assumptions
We consider algorithms that at each time-step sample a parameter of the form
where is a sequence of independent random variables (perturbations), and select action
Our result will require the following assumptions to hold for the perturbations .
Assumption 1.
The perturbations are independent rotationally-invariant random variables for which there exists a constant such that
These assumptions hold for many common distributions, such as standard Gaussian (with ), and the uniform distribution on (with ).
4.2 Action set assumptions: smoothness and strong convexity
A core part of our contribution is in identifying the properties of action sets that allow randomised exploration to succeed. Our assumptions will be expressed in terms of the support function of ,
Crucially, for randomised algorithms where for each , the -conditional law of is diffuse (implied by rotational invariance), we have that
Our upcoming assumptions ensure that is a suitably regular function. Note that the above relation means the per-step regret of randomised algorithms is given by the divergence
again, almost surely with respect to the -conditional law of .
Our assumptions will be based on the following three definitions:
Definition 1 (Absorbing set).
We call a set absorbing if it is a neighbourhood of the origin.
Definition 2 (Strong convexity).
We say is -strongly convex with respect to a norm if
Definition 3 (Smoothness).
We say that is -smooth with respect to a norm if
With these definitions in place, the conditions we will ask for on the arm set are captured thus.
Assumption 2.
The action set is a closed absorbing subset of , and there exists a norm and constants such that is -strongly convex and -smooth.
The motivation for asking for strong convexity and smoothness for the square , rather than directly for , is that the quantity
| (2) |
does not explode as , whereas does. That is absorbing ensures that the multiplier in the above is positive, which will come in useful in our proofs—we do not believe this assumption to be essential, but we have thus far been unable to eliminate it.
Remark 1.
Definition 3 generalises the notion of -strong convexity used in [21], where this was defined by the requirement that
Our definition will be vital to getting the right rate for randomised algorithms outside the -ball case, and specifically to avoid incurring an extra factor of in the regret, which may be large. We note also that their definition is for the strong convexity of the arm-set, whereas our definition is for the smoothness of . There is a duality between the (indicator function of) the set and the corresponding support function, which explains the inversion in the nomenclature.
Remark 2.
If is absorbing and balanced (symmetric about the origin), is a norm; if it is just absorbing, is a norm. In these cases, it may be productive to try taking (or ), as in our above examples. Of course, do not need to be known to run the algorithm, and the regret implicitly scales with the best over all norms .
An example of an action sets that satisfy Assumption 2 are balls with :
Example 1.
Let be conjugate indices (), and . Then, Assumption 2 holds with , for and , for .
Assumption 2 is unaffected by linear transformations, extending the above examples to ellipsoids:
Example 2.
Let be any arm set satisfying Assumption 2 for some , and . Then, for any , satisfies Assumption 2 for norm , and .
4.3 Main result and discussion
We are now ready to state our main result which shows that any randomised algorithm satisfying Assumption 1 with an action set satisfying Assumption 2 achieves at most regret in the linear bandit problem. This matches the lower bound of [21] up to logarithmic factors (based on , a set that satisfies our assumptions).
Theorem 4.
Fix and . Suppose that a learner uses a randomised algorithm with perturbations satisfying Assumption 1 on a linear bandit instance with an arm-set that satisfies Assumption 2. Then, for any , with probability , for all , the -step regret incurred by the learner is bounded as
The proof of this result is presented in Section 5, with much of the details deferred to the appendices. We now discuss some aspects of our result, its proof and its relation to previous works.
On the regret of Thompson sampling
If the noise in the responses is Gaussian with a known variance , and if for all the perturbations are given by , then our randomised exploration algorithm is equivalent to the linear Thompson sampling algorithms of [23, 3, 2]. Thus, for action spaces satisfying Assumption 2, Theorem 4 shows that Thompson sampling can enjoy regret of , leaving at most an gap between this frequentist regret and the corresponding Bayesian regret [23, 22, see].
On the lower bound for randomised algorithms
We remark that Theorem 4 holds for any randomised algorithm without any modification; in particular there is no need to inflate any variance proxies. This is in contrast to lower bounds by [8, 29] which show that there exist problem instances on which linear Thompson sampling suffers linear regret. These instances are specifically designed so that there is a bad ‘trap’ arm, where pulling that arm yields regret, but no information, so that Thompson sampling gets stuck. They are the polar opposite of what Assumption 2 asks for: not absorbing, strongly convex, or smooth.
Limitation of optimism-based proofs
Existing proofs of frequentist regret bounds for randomised algorithms in linear bandit, including those of [4, 2, 15, 12], leverage that with high probability,
and then show that can be suitably controlled when randomised sampling guarantees sufficient optimism—that is, when the algorithm is optimistic with a fixed probability. Unfortunately, as illustrated in [2], guaranteeing optimism with a fixed probability requires inflating the variance of the sampling distributions, and this results in an extra factor in the regret bound. Moreover, these proofs implicitly suggest that non-optimistic samples do not help in controlling the upper bound on the per-step regret, .
This approach is overly conservative in two ways: first, while a particular sample may provide very little information—measured through the design matrix update —the sample may still provide useful information on average, that is, by considering . Second, while the information acquired at a time might not significantly reduce the per-step regret bound for the step immediately following it, it may prove useful at later steps. Figure. 1 illustrates how non-optimistic samples provide useful information that is ignored by the optimistic proof approaches.
Our proof techniques
The key challenge in developing a non-optimistic proof for randomised algorithms in linear bandits is to directly analyse the dynamic of the exploration, that is, of the process , and relating this to the upper bound of the per-step regret process, . Interestingly, such approach is closer to the analysis of Thompson sampling in the -armed bandit setting, for which it is shown to be optimal [13, 3]. Within the proof of our regret bound, Theorem 4, we address the above points by:
-
(i)
Providing a new bound on , by leveraging strong convexity and smoothness;
-
(ii)
Characterising the minimum amount of information acquired during interaction through a lower bound on , where acts as a proxy for ;
and connecting (i) and (ii) by studying the properties of the average per-step information .
Comparison with forced exploration
[21] proposes a phased explore-then-commit algorithm that interleaves rounds of playing linearly independent actions with increasingly long exploitation phases, where the estimated best action is selected. They show prove a regret bound on the order of for their approach, which notably behaves poorly as . This behaviour is because their exploration is isotropic—equal in all directions—and not directed by an estimate of . In contrast, randomised exploration algorithms account for structure by (i) taking (almost surely), which accounts for the geometry of the action-set, and (ii) sampling from a distribution concentrated on a scaled version of , which accounts for the current estimate of . One might interpret randomised algorithms as blending together the exploration and exploitation stages with a more careful balance between the two.
5 Proof of main result
We now prove our main result, Theorem 4. Here and in the appendices, we will write in place of , and we will work throughout on the probability event where .
We start by moving from to . This can be done by noting that , , is a martingale difference sequence satisfying for all , and applying a standard concentration inequality (included here as Lemma 8, Appendix A). From this, conclude that with probability , for all ,
We now outline the three main results we use in bounding , and then show how they come together.
5.1 Regret decomposition & upper bound
Denote by the conditional probability of optimism at time-step . Letting for a threshold , we now decompose the regret into that incurred in time-steps where is high, and those where it is low (we take , where is the constant appearing in Assumption 1):
| (3) |
The derivations of the bound is presented in Appendix B. It is based on repeatedly applying properties of Bregman divergences and convex functions. At a high level, we introduce , which is, conditionally on , an independent copy of ; then conditioning condition on the event (the converse for the second term), and integrating the out.
Examining the two terms, is a term that appears in the standard regret analysis of optimistic algorithms, and is easily handled using a concentration argument (Lemma 9) and the elliptical potential lemma (Lemma 10); this yields
a term featuring in our overall regret bound. The term is a cost associated with randomised exploration: it is the sum of the sizes of the parameter sampling distributions (or confidence sets, as these are the same up to scaling), where size is measured in the geometry induced by .
5.2 Relating confidence widths to the amount of exploration
The challenge is now to show that grows sufficiently fast, measured with respect to the geometry induced by , such that is small. First, we relate the width to the expected amount of exploration in the direction of at step , with the latter measured in the norm, , at a cost of from -strong convexity. This is a change of geometry lemma:
Lemma 5.
For all with , for any ,
Remark 3.
When , we have for , and thus no change of geometry is needed. In that case almost surely, and for all , and so
where we for the sake of exposition, allowed ourselves the simplifying assumption our the confidence sets and perturbations are concentrated sufficiently to ensure that .
We present the proof of Lemma 5 in Appendix C. Once again, we proceed by introducing a random variable with the same -conditional law as ; however, this time, we couple and closely coupled, in that they differ only in the marginal (along which they are independent). We then proceed with a convex Poincaré inequality-style argument along the direction, which relates and the matrix, with the latter being essentially the conditional variance of .
5.3 Establishing the growth of the design matrices
The final ingredient is the following relation between the sum of the conditional expected increments in the design matrices and their realisation .
Lemma 6.
For any and , with probability at least , for all and all ,
where .
Remark 4.
A standard matrix Chernoff inequality111See [26]. This exact inequality is not stated there, but all the tools needed to derive it are. gives that with probability , for all ,
| (4) |
where denotes the usual ordering on positive-semidefinite matrices. For the ball, Equation 4 serves the same role as Lemma 6, but is tighter. However, in the general setting where , it is crucial that we obtain the dependence seen in Lemma 6.
The proof of Lemma 6 is presented in Appendix D. It uses Lemma 9, a one-dimensional version of the inequality given in Equation 4, and applies it to the process for all in a time-dependent cover of . A union bound over the size of the cover is responsible for the term, and the discretisation error involved in the covering argument yields the additive .
5.4 Putting everything together
Let be the number of steps up to on which the conditional probability of optimism was below the threshold . We will shortly show that Lemma 5 and Lemma 6, together with the assumed smoothness, yield the following bound:
Claim 7.
For all and ,
First though, note that using Claim 7 within the regret decomposition of Equation 3 completes the proof. Indeed, using that the expected per-step regret is bounded by (to handle step the first step, which is not covered by Claim 7), and the usual integral for monotonic integrands, we have
which completes our bound (observe that the first two terms are lower order).
Proof of Claim 7.
We work on the probability event resulting from applying Lemma 6 with . Since , we have , and thus for all ,
| (5) |
Now we proceed to in turn upper and lower-bounding the above expression.
For the upper-bound, note that by -smoothness, .
For the lower bound of the right-hand side of Equation 5, we will use Lemma 5. Let , and note that since , we have that . Now,
| (Lemma 5) | ||||
| () |
Combining our lower and upper bounds on Equation 5, writing for a numerical constant and letting , we obtain the quadratic
Solving for , we have
whence relabelling concludes the proof. ∎
6 Conclusion
In this paper, we have presented a new analysis of randomised exploration algorithms for the linear bandit setting, which establishes that, given a nice-enough action set, randomised algorithms can obtain the optimal dependence on the dimension of the problem without need for any algorithmic modifications. Our improved regret bounds requires that the action space satisfies a smoothness and strong convexity condition, Assumption 2, which ensures that small perturbations in the parameter space are translate directly to at least some perturbations in the action space, while also guaranteeing that these do not lead to large changes in the instantaneous regret.
Our results complement the lower bounds by [8, 29] which show that linear Thompson sampling can suffer linear regret in particular settings where the connection between randomness in the parameter and action spaces is broken. However, these results together still do not give a complete characterisation of when randomised exploration algorithms can and cannot achieve the optimal rate of regret in the linear bandit setting: it remains an important open problem to understand exactly which action spaces permit an optimal dependence on the dimension.
Acknowledgements
This project started in earnest as a result of discussions taking place at the 2023 Workshop on the Theory of Reinforcement Learning in Edmonton. We thank Csaba Szepesvári for organising this workshop, and for feedback on early versions of this work. DJ & MA thank Gergely Neu for putting them in touch with CP-B, who was working contemporaneously on the same problem.
References
- [1] Yasin Abbasi-Yadkori, Dávid Pál and Csaba Szepesvári “Improved algorithms for linear stochastic bandits” In Advances in Neural Information Processing Systems, 2011
- [2] Marc Abeille and Alessandro Lazaric “Linear Thompson sampling revisited” In Electronic Journal of Statistics 11.2, 2017, pp. 5165–5197
- [3] Shipra Agrawal and Navin Goyal “Analysis of Thompson sampling for the multi-armed bandit problem” In Conference on Learning Theory, 2012
- [4] Shipra Agrawal and Navin Goyal “Thompson sampling for contextual bandits with linear payoffs” In International Conference on Machine Learning, 2013
- [5] Peter Auer “Using confidence bounds for exploitation-exploration trade-offs” In Journal of Machine Learning Research 3.Nov, 2002, pp. 397–422
- [6] Olivier Chapelle and Lihong Li “An empirical evaluation of Thompson sampling” In Advances in Neural Information Processing Systems, 2011
- [7] Varsha Dani, Thomas P Hayes and Sham M Kakade “Stochastic linear optimization under bandit feedback” In Conference on Learning Theory, 2008
- [8] Nima Hamidi and Mohsen Bayati “On the frequentist regret of linear Thompson sampling” In arXiv preprint arXiv:2006.06790, 2023
- [9] Junya Honda and Akimichi Takemura “Optimality of Thompson sampling for Gaussian bandits depends on priors” In Artificial Intelligence and Statistics, 2014
- [10] Tom Huix, Matthew Zhang and Alain Durmus “Tight regret and complexity bounds for Thompson sampling via Langevin Monte Carlo” In International Conference on Artificial Intelligence and Statistics, 2023
- [11] David Janz, Alexander E Litvak and Csaba Szepesvári “Ensemble sampling for linear bandits: small ensembles suffice” In Advances in Neural Information Processing Systems, 2024
- [12] David Janz, Shuai Liu, Alex Ayoub and Csaba Szepesvári “Exploration via linearly perturbed loss minimisation” In International Conference on Artificial Intelligence and Statistics, 2024
- [13] Emilie Kaufmann, Nathaniel Korda and Rémi Munos “Thompson sampling: An asymptotically optimal finite-time analysis” In International Conference on Algorithmic Learning Theory, 2012
- [14] Nathaniel Korda, Emilie Kaufmann and Remi Munos “Thompson sampling for 1-dimensional exponential family bandits” In Advances in Neural Information Processing Systems, 2013
- [15] Branislav Kveton et al. “Randomized exploration in generalized linear bandits” In International Conference on Artificial Intelligence and Statistics, 2020
- [16] Tor Lattimore and Csaba Szepesvari “The end of optimism? An asymptotic analysis of finite-armed linear bandits” In Artificial Intelligence and Statistics, 2017, pp. 728–737 PMLR
- [17] Tor Lattimore and Csaba Szepesvári “Bandit algorithms” Cambridge University Press, 2020
- [18] Xiuyuan Lu and Benjamin Van Roy “Ensemble sampling” In Advances in Neural Information Processing Systems, 2017
- [19] Benedict C May, Nathan Korda, Anthony Lee and David S Leslie “Optimistic Bayesian sampling in contextual-bandit problems” In Journal of Machine Learning Research 13, 2012, pp. 2069–2106
- [20] Ralph Tyrrell Rockafellar “Convex Analysis” Princeton University Press, 1970
- [21] Paat Rusmevichientong and John N Tsitsiklis “Linearly parameterized bandits” In Mathematics of Operations Research 35.2 INFORMS, 2010, pp. 395–411
- [22] Daniel Russo and Benjamin Van Roy “An information-theoretic analysis of Thompson sampling” In The Journal of Machine Learning Research 17.1 JMLR, 2016, pp. 2442–2471
- [23] Daniel Russo and Benjamin Van Roy “Learning to optimize via posterior sampling” In Mathematics of Operations Research 39.4 INFORMS, 2014, pp. 1221–1243
- [24] Daniel Russo et al. “A tutorial on Thompson sampling” In Foundations and Trends® in Machine Learning 11.1 Now Publishers, Inc., 2018, pp. 1–96
- [25] William R Thompson “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples” In Biometrika 25.3-4 Oxford University Press, 1933, pp. 285–294
- [26] Joel A Tropp “User-friendly tail bounds for sums of random matrices” In Foundations of Computational Mathematics 12 Springer, 2012, pp. 389–434
- [27] Sharan Vaswani, Abbas Mehrabian, Audrey Durand and Branislav Kveton “Old Dog Learns New Tricks: Randomized UCB for Bandit Problems” In International Conference on Artificial Intelligence and Statistics, 2020
- [28] Ruitu Xu, Yifei Min and Tianhao Wang “Noise-adaptive thompson sampling for linear contextual bandits” In Advances in Neural Information Processing Systems 36, 2023
- [29] Tong Zhang “Feel-good thompson sampling for contextual bandits and reinforcement learning” In SIAM Journal on Mathematics of Data Science 4.2 SIAM, 2022, pp. 834–857
Appendix A Some standard results
The following lemma is adapted from Exercise 20.8 in [17].
Lemma 8.
Fix . Let be a real-valued martingale difference sequence satisfying almost surely for each and some . Then,
Next is a second concentration inequality that we require.
Lemma 9.
Let be a sequence of random variables adapted to a filtration with for all . Then, for all ,
Proof of Lemma 9.
By rescaling, we need only consider the case where . Let be the random process defined by and
Observe that for all , and that since for any ,
we have that for al ,
Therefore, is a non-negative supermartingale. Applying Ville’s inequality yields the result. ∎
The following is an adaptation of Lemma 19.4 in [17],
Lemma 10 (Elliptical potential lemma).
Fix and a sequence in . Then, letting , we have that for all ,
Appendix B Derivation of the regret decomposition upper bound (Equation 3)
Let be the (conditional) probability of optimism at step , and let . We now derive two bounds separately. When is high, we will use
| (6) |
When is low, we will prefer the bound
| (7) |
Combining these two bounds with our regret decomposition establishes Equation 3
We will derive the two bounds Equations 6 and 7 using similar techniques. Let . Both derivations will make use of the following estimates.
Claim 11.
For any norm on ,
Proof.
Letting and be independent copies of , we can express and as
Denote by the expectation over and . We have that
where we used that .
Expressing for some (which we can do due to the implicit assumption that and using the same approach we obtain the other part of the bound. ∎
Derivation of Equation 6.
For almost every such that ,
| () | ||||
| (convexity) | ||||
| (Cauchy-Schwarz) |
Now let be a measure on given by
Since the bound above holds for almost all such that , and is a diffuse measure on that set, it also holds on average for . Integrating with respect to and ,
For the first integral,
| (8) | ||||
| (, ) | ||||
| (Cauchy-Schwarz) | ||||
| (Claim 11) |
Finally, since almost surely, .
The second integral follows likewise, with the addition of multiplying the resulting nonnegative bound by to keep things tidy. ∎
For the steps with a low probability of optimism, we will need the following property of Bregman divergences:
Lemma 12 (Law of cosines).
For any convex function and all and almost all ,
Derivation of Equation 7.
For almost all ,
| (law of cosines) | ||||
| (convexity of in ) | ||||
| (Cauchy-Schwartz) |
Also, for almost every satisfying ,
| () | ||||
| ( a.e.) | ||||
| (Cauchy-Schwartz) |
Combining the above two bounds, we have that for almost all , if , then
| (9) |
Now let be a measure on given by
Since Equation 9 holds for almost all with and are non-atomic, it also holds on average for and . Integrating, we see that is upper bounded by
| (10) | ||||
For the first integral, we can use that for any , , to establish that
where the final equality follows since has the same law as conditioned on . Now, by Assumption 2 and then using the estimate from Claim 11,
Bounding the remaining integrals in Equation 10 can be done by following the same steps as for the integral in Equation 8 of the optimistic bound, just with in place of . ∎
Appendix C Proof of the change of geometry lemma (Lemma 5)
Let be a basis completion orthogonal to (a projection onto the orthogonal complement of the span of ). Let , and let be an independent copy of independent of and define
Also define the indicators and .
Proof of Lemma 5.
The proof is based on lower and upper-bounding .
For the lower bound, note that by strong convexity,
where
| (marginal variance assumption) | ||||
| (drop ) | ||||
| (marginal variance assumption) | ||||
| (Cauchy-Schwarz and ) | ||||
| (marginal variance and fourth moment assumptions) | ||||
| ( by assumption) |
For the upper bound, we have that
| () | ||||
| (convexity) | ||||
| (Cauchy-Schwarz) | ||||
| (marginal variance assumption) |
Chaining the lower and upper bounds yields the claimed result. ∎
Appendix D Proof of directional concentration (Lemma 6)
Lemma 13.
For any , the covering number of is upper bounded by .
Proof of Lemma 6.
For each , let be a minimal -cover of in , where the value of will be chosen shortly. Let
For every and , we apply Lemma 9 to the sequence , , using the upper bound for all , and confidence level . Taking a union bound over the resulting events, we obtain that with probability , for all and ,
Now for each , let be a map satisfying for all . The proof will be complete once we show that for a suitable choice of , for all , and that for the chosen , we have the bound . We begin with the bound
Letting denote the operator norm,
| (, , ) |
Also,
| (, ) | ||||
| (, ) | ||||
| () |
Now choose . By Lemma 13, for this choice, . Combining the bounds on and , we now indeed have that for all ,