This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\altauthor\Name

Marc Abeille \Emailm.abeille@criteo.com
\addrCriteo AI Lab, Paris, France and \NameDavid Janz \Emaildavid.janz93@gmail.com
\addrUniversity of Oxford, UK and \NameCiara Pike-Burke \Emailc.pike-burke@imperial.ac.uk
\addrImperial College London, UK

When and why randomised exploration works (in linear bandits)

Abstract

We provide an approach for the analysis of randomised exploration algorithms like Thompson sampling that does not rely on forced optimism or posterior inflation. With this, we demonstrate that in the dd-dimensional linear bandit setting, when the action space is smooth and strongly convex, randomised exploration algorithms enjoy an nn-step regret bound of the order O(dnlog(n))O(d\sqrt{n}\log(n)). Notably, this shows for the first time that there exist non-trivial linear bandit settings where Thompson sampling can achieve optimal dimension dependence in the regret.

keywords:
linear bandits, randomised exploration, Thompson sampling

1 Introduction

To achieve low regret in sequential decision-making problems, it is necessary to balance exploration (selecting uncertain actions) and exploitation (selecting previously successful actions). One method of balancing this exploration-vs-exploitation trade-off that is particularly well-understood is through optimism: optimistic algorithms maintain a set of statistically plausible models of the environment and select actions that maximize the reward in the best plausible model—note however that this entails solving a bi-level optimization problem in each round. Randomised exploration is an alternative approach where algorithms select a model of the problem randomly from a set of plausible models and act optimally with respect to that randomly sampled model—bypassing the need to solve the bi-level optimization problem associated with optimism. Notable examples of randomised decision-making algorithms include Bayesian algorithms such as posterior sampling [25, also known as Thompson sampling], ensemble sampling [18, 11] and perturbed history exploration [15, 12]. However, while randomisation-based algorithms are often preferred in practice, our theoretical understanding of when and why randomised exploration works in structured sequential decision-making problems is limited.

In this paper, we analyse randomised sequential decision-making algorithms in the classic linear bandit problem—but the techniques that we introduce should carry over to other structured settings. In this setting, previous frequentist analyses [4, 2, 15, 12, e.g.] are not sufficient to explain the practical effectiveness of randomised exploration, nor do they identify a mechanism through which randomised exploration works. Indeed, existing proofs rely on modifying randomised exploration algorithms so that they can be analysed using the optimism framework. These modifications often lead to suboptimal regret. Our analysis does away with such modifications; it holds under the assumption that the action space is smooth and strongly convex (see Section 4.2 for formal definitions), which allows for perturbation in the model parameter space to translate to perturbations in the action space, while also guaranteeing that small changes in the action space only lead to small changes in the incurred regret.

For such smooth, strongly convex action sets, which include p\ell_{p}-balls for p(1,)p\in(1,\infty), we a regret of the order O(dnlog(n))O(d\sqrt{n}\log(n)) where dd is the dimension of the action space and nn is the number of rounds. Notably, this shows for the first time that (unmodified) linear Thompson sampling can enjoy regret with the optimal dependence on the dimension in a structured linear bandit settings, thus partially resolving an important open question [24].

2 Related work

Lower bounds for the linear bandit problem depend on the structure of specific action spaces [7, 21, 16, for example,]. Theorem 2.1 of [21] shows that there exists a problem instance where the action space is the dd-dimensional unit sphere in which any policy must incur Ω(dn)\Omega(d\sqrt{n}) regret. Optimistic algorithms have frequentist regret nearly matching the lower bound for linear bandits [5, 7, 1]. Specifically, [1] show that by constructing confidence sets using self-normalized bounds for vector-valued martingales, and taking actions optimistically within these, the resulting regret is O(dnlog(n/δ))O(d\sqrt{n}\log(n/\delta)) with probability at least 1δ1-\delta. Despite the strong theoretical performance of optimistic algorithms, randomised algorithms, such as Thompson sampling, have been shown to perform better in practice [6, 19]. In the simpler multi-armed bandit setting, randomised algorithms achieve optimal regret [3, 13, 14, 9]. Under Bayesian assumptions, where regret is defined by taking an expectation over the unknown parameter, [23, 22] show that Thompson sampling is near-optimal in many structured and unstructured settings. In particular, for the linear bandit setting, they show a Bayesian regret bound of O~(dn)\widetilde{O}(d\sqrt{n}) [23].

In this paper, our focus is on the regret of randomised exploration algorithms in linear bandits. While this setting has been studied extensively by, amongst others, previous approaches rely on modifying the algorithm to force it to be more optimistic. The main line of analysis, by [4, 2, 28], inflates the variance of the posterior over models in round tt by a factor of Θ(dlog(t/δ))\Theta(\sqrt{d\log(t/\delta)}) to show that the algorithm is optimistic with constant probability—this leads O((dlog(n))3/2n)O((d\log(n))^{3/2}\sqrt{n}) regret, where the increased dependence on dd is due to the inflation of the posterior. Further variants of randomised exploration algorithms include modifying the algorithms to only sample parameters with reward greater than the mean [19, 27] and modifying the likelihood used in the Bayesian update of Thompson sampling to force the algorithm to be more optimistic [29, 10]. The analysis of Thompson sampling in other structured settings, such as generalised linear bandits, relies on these same modifications [15, 12].

We remark that the results presented in this paper do not contradict the lower bounds by [8, 29] where examples were provided for which linear Thompson sampling incurs linear regret if the posterior distribution is not inflated. The action spaces constructed in those examples fail to satisfy our assumptions.

3 Problem setting, notation and basic definitions

We study the linear bandit problem, where each bandit instance is parameterised by an unknown θR𝔹2d\theta_{\star}\in R\mathbb{B}^{d}_{2} (R>0R>0 known), and an action set 𝒳\mathcal{X}, a closed subset of 𝔹2d\mathbb{B}^{d}_{2} (the closed unit 2\ell_{2}-ball in d\mathbb{R}^{d}). Then, at each time-step t=1,2,t=1,2,\dots an agent selects an action Xt𝒳X_{t}\in\mathcal{X}, allowed to depend on observations from previous time-steps, and receives a real-valued reward YtY_{t}. We assume that the reward YtY_{t} is SS-subgaussian given XtX_{t} and the past (S>0S>0 known), with mean given by Xt,θ\langle X_{t},\theta_{\star}\rangle. The goal of the agent is to select actions to minimize the nn-step regret (n1n\geq 1), defined by

Rn=t=1nrtforrt=xXt,θ,R_{n}=\sum_{t=1}^{n}r_{t}\quad\text{for}\quad r_{t}=\langle x_{\star}-X_{t},\theta_{\star}\rangle\,,

where xargmaxx𝒳x,θx_{\star}\in\operatorname*{arg\,max}_{x\in\mathcal{X}}\langle x,\theta_{\star}\rangle is any optimal arm and the horizon nn needs not be known.

Confidence set construction

The algorithms and analysis in this work are based on the standard regularised least-squares-based confidence ellipsoids for θ\theta_{\star} [1]. To construct these, fix a regularisation parameter λ>0\lambda>0 and a confidence parameter δ(0,1)\delta\in(0,1). Define the regularised design matrices and least-squares estimates as V0=λIV_{0}=\lambda I, θ^0=0\quad\hat{\theta}_{0}=0 and then

Vt=XtXt𝖳+Vt1andθ^t=Vt1i=1tYiXifort1,V_{t}=X_{t}X_{t}^{\mkern-1.5mu\mathsf{T}}+V_{t-1}\quad\text{and}\quad\hat{\theta}_{t}=V_{t}^{-1}\sum_{i=1}^{t}Y_{i}X_{i}\quad\quad\text{for}\quad t\geq 1\,,

Also, define the sequence of nondecreasing, nonnegative confidence widths

βt=Rλ+S2log(1/δ)+log(det(Vt)/λd),t0.\beta_{t}=\textstyle{R\sqrt{\lambda}+S\sqrt{2\log(1/\delta)+\log(\det(V_{t})/\lambda^{d})}},\quad t\geq 0\,.

Then, [1] show that, with probability 1δ1-\delta, θt1Θt1\theta_{\star}\in\cap_{t\geq 1}\Theta_{t-1} for the ellipsoids given by

Θt1={θd:θθ^t1Vt1βt1},t1,\Theta_{t-1}=\{\theta\in\mathbb{R}^{d}\colon\|\theta-\hat{\theta}_{t-1}\|_{V_{t-1}}\leq\beta_{t-1}\}\,,\quad t\geq 1\,,

where for ada\in\mathbb{R}^{d} and a d×dd\times d positive-definite matrix BB, we denote by aB\|a\|_{B} the BB-weighted Euclidean norm of aa given by Ba,a\sqrt{\langle Ba,a\rangle}.

Optimistic algorithms

Optimistic algorithms select actions XtX_{t} by solving the bi-level optimization problem (Xt,θt)argmax(x,θ)𝒳×Θt1x,θ(X_{t},\theta_{t})\in\operatorname*{arg\,max}_{(x,\theta)\in\mathcal{X}\times\Theta_{t-1}}\langle x,\theta\rangle in each round tnt\leq n. We instead consider randomised algorithms which randomise over Θt1\Theta_{t-1}. These methods are formally defined in Section 4.1

Bregman divergence

Our analysis will make use of a generalised Bregman divergence, defined convex function f:df\colon\mathbb{R}^{d}\to\mathbb{R} as

Df(x,y)=f(x)f(y)f(y),xy,D_{f}(x,y)=f(x)-f(y)-\langle\nabla f(y),x-y\rangle\,,

for almost every ydy\in\mathbb{R}^{d}, where f\nabla f denotes the gradient of ff. We recall that convex functions are almost everywhere differentiable [20].

Probabilistic formalism

Let 𝔽=(𝔽t)t0\mathbb{F}=(\mathbb{F}_{t})_{t\geq 0} be a filtration where 𝔽0\mathbb{F}_{0} is the trivial σ\sigma-algebra and 𝔽t=σ(σ(Xt,Yt),𝔸t)\mathbb{F}_{t}=\sigma(\sigma(X_{t},Y_{t}),\mathbb{A}_{t}), where 𝔸t\mathbb{A}_{t} is the σ\sigma-algebra generated by any additional random variables the algorithm uses in selecting XtX_{t}. Note that this means that XtX_{t} is 𝔽t\mathbb{F}_{t}-measurable. We will write t\mathbb{P}_{t} for the 𝔽t\mathbb{F}_{t}-conditional probability measure and 𝔼t\mathbb{E}_{t} for the corresponding expectation. With this, we formalise the assumption that for all t1t\geq 1, YtY_{t} is conditionally SS-subgaussian as that

𝔼t1exp{sYt}exp{s2S2/2}for alls,t1.\mathbb{E}_{t-1}\exp\{sY_{t}\}\leq\exp\{s^{2}S^{2}/2\}\quad\text{for all}\quad s\in\mathbb{R}\,,\ t\geq 1\,. (1)

Asymptotic notation

We will write f(x)g(x)f(x)\lesssim g(x) if f(x)=O(g(x))f(x)=O(g(x)), and use \gtrsim for the converse.

Vectors, norms, balls & spheres

We will write \|\cdot\| to denote the 2\ell_{2}-norm. We recall that for a positive-definite matrix BB and a vector aa of compatible dimensions, aB=Ba,a\|a\|_{B}=\sqrt{\langle Ba,a\rangle} denotes the BB-weighted 2\ell_{2} norm. We write 𝔹2d\mathbb{B}^{d}_{2} for the closed unit Euclidean ball in d\mathbb{R}^{d}, and 𝕊2d1\mathbb{S}^{d-1}_{2} for its surface 𝔹2d\partial\mathbb{B}^{d}_{2}, the (d1)(d-1)-sphere.

4 A frequentist regret bound for randomised algorithms in linear bandits

In this section, we state our main result that provides conditions under which randomised exploration algorithms can achieve frequentist regret of O~(dn)\widetilde{O}(d\sqrt{n}) in the linear bandit setting. We begin by describing the algorithmic framework and assumptions for the action set under which it holds.

4.1 Randomised algorithms: definition and assumptions

We consider algorithms that at each time-step t1t\geq 1 sample a parameter of the form

θt=θ^t1+Vt11/2ηt,\theta_{t}=\hat{\theta}_{t-1}+V^{-1/2}_{t-1}\eta_{t}\,,

where (ηt)t1(\eta_{t})_{t\geq 1} is a sequence of independent random variables (perturbations), and select action

Xtargmaxx𝒳x,θt.X_{t}\in\operatorname*{arg\,max}_{x\in\mathcal{X}}\langle x,\theta_{t}\rangle\,.

Our result will require the following assumptions to hold for the perturbations (ηt)t1(\eta_{t})_{t\geq 1}.

Assumption 1.

The perturbations (ηt)t1(\eta_{t})_{t\geq 1} are independent rotationally-invariant random variables for which there exists a constant K>0K>0 such that

1𝔼u,ηt2K2and𝔼u,ηt4K4for allu𝕊2d1,t1.1\leq\mathbb{E}\langle u,\eta_{t}\rangle^{2}\leq K^{2}\quad\text{and}\quad\mathbb{E}\langle u,\eta_{t}\rangle^{4}\leq K^{4}\quad\text{for all}\quad u\in\mathbb{S}^{d-1}_{2}\,,\ t\geq 1\,.

These assumptions hold for many common distributions, such as standard Gaussian (with K4=3K^{4}=3), and the uniform distribution on d𝕊2d1\sqrt{d}\mathbb{S}^{d-1}_{2} (with K=1K=1).

4.2 Action set assumptions: smoothness and strong convexity

A core part of our contribution is in identifying the properties of action sets that allow randomised exploration to succeed. Our assumptions will be expressed in terms of the support function of 𝒳\mathcal{X},

J𝒳(θ)=maxx𝒳x,θ.J_{\mathcal{X}}(\theta)=\max_{x\in\mathcal{X}}\langle x,\theta\rangle\,.

Crucially, for randomised algorithms where for each t1t\geq 1, the 𝔽t1\mathbb{F}_{t-1}-conditional law of θt\theta_{t} is diffuse (implied by rotational invariance), we have that

Xt=J𝒳(θt)almost surely for all t1 .X_{t}=\nabla J_{\mathcal{X}}(\theta_{t})\quad\text{almost surely for all $t\geq 1$\,.}

Our upcoming assumptions ensure that J𝒳\nabla J_{\mathcal{X}} is a suitably regular function. Note that the above relation means the per-step regret of randomised algorithms is given by the divergence

rt=J𝒳(θ)Xt,θ=J𝒳(θ)J𝒳(θt)J𝒳(θt),θθt=DJ𝒳(θ,θt),r_{t}=J_{\mathcal{X}}(\theta_{\star})-\langle X_{t},\theta_{\star}\rangle=J_{\mathcal{X}}(\theta_{\star})-J_{\mathcal{X}}(\theta_{t})-\langle\nabla J_{\mathcal{X}}(\theta_{t}),\theta_{\star}-\theta_{t}\rangle=D_{J_{\mathcal{X}}}(\theta_{\star},\theta_{t})\,,

again, almost surely with respect to the 𝔽t1\mathbb{F}_{t-1}-conditional law of θt\theta_{t}.

Our assumptions will be based on the following three definitions:

Definition 1 (Absorbing set).

We call a set 𝒳d\mathcal{X}\subset\mathbb{R}^{d} absorbing if it is a neighbourhood of the origin.

Definition 2 (Strong convexity).

We say J𝒳2J_{\mathcal{X}}^{2} is mm-strongly convex with respect to a norm \|\cdot\|_{*} if

m2θθ2DJ𝒳2(θ,θ)for allθ,θd.\frac{m}{2}\|\theta-\theta^{\prime}\|_{*}^{2}\leq D_{J_{\mathcal{X}}^{2}}(\theta,\theta^{\prime})\quad\text{for all}\quad\theta,\theta^{\prime}\in\mathbb{R}^{d}\,.
Definition 3 (Smoothness).

We say that J𝒳2J_{\mathcal{X}}^{2} is MM-smooth with respect to a norm \|\cdot\|_{*} if

DJ𝒳2(θ,θ)M2θθ2for allθ,θd.D_{J_{\mathcal{X}}^{2}}(\theta,\theta^{\prime})\leq\frac{M}{2}\|\theta-\theta^{\prime}\|_{*}^{2}\quad\text{for all}\quad\theta,\theta^{\prime}\in\mathbb{R}^{d}\,.

With these definitions in place, the conditions we will ask for on the arm set 𝒳\mathcal{X} are captured thus.

Assumption 2.

The action set 𝒳\mathcal{X} is a closed absorbing subset of 𝔹2d\mathbb{B}^{d}_{2}, and there exists a norm \|\cdot\|_{*} and constants M,m>0M,m>0 such that J𝒳2J_{\mathcal{X}}^{2} is mm-strongly convex and MM-smooth.

The motivation for asking for strong convexity and smoothness for the square J𝒳2J^{2}_{\mathcal{X}}, rather than directly for J𝒳J_{\mathcal{X}}, is that the quantity

J𝒳2(θ)=2J(θ)J(θ)\nabla J^{2}_{\mathcal{X}}(\theta)=2J(\theta)\nabla J(\theta) (2)

does not explode as θ0\theta\to 0, whereas J𝒳(θ)\nabla J_{\mathcal{X}}(\theta) does. That 𝒳\mathcal{X} is absorbing ensures that the multiplier J(θ)J(\theta) in the above is positive, which will come in useful in our proofs—we do not believe this assumption to be essential, but we have thus far been unable to eliminate it.

Remark 1.

Definition 3 generalises the notion of MM-strong convexity used in [21], where this was defined by the requirement that

J𝒳(θ)J𝒳(θ)Mθθfor allθ,θ𝕊2d1.\|\nabla J_{\mathcal{X}}(\theta)-\nabla J_{\mathcal{X}}(\theta^{\prime})\|\leq M\|\theta-\theta^{\prime}\|\quad\text{for all}\quad\theta,\theta^{\prime}\in\mathbb{S}^{d-1}_{2}\,.

Our definition will be vital to getting the right rate for randomised algorithms outside the 2\ell_{2}-ball case, and specifically to avoid incurring an extra factor of θ/J(θ)\|\theta_{\star}\|/J(\theta_{\star}) in the regret, which may be large. We note also that their definition is for the strong convexity of the arm-set, whereas our definition is for the smoothness of J𝒳2J^{2}_{\mathcal{X}}. There is a duality between the (indicator function of) the set and the corresponding support function, which explains the inversion in the nomenclature.

Remark 2.

If 𝒳\mathcal{X} is absorbing and balanced (symmetric about the origin), J𝒳J_{\mathcal{X}} is a norm; if it is just absorbing, J~(θ)=J𝒳(θ)J𝒳(θ)\tilde{J}(\theta)=J_{\mathcal{X}}(\theta)\vee J_{\mathcal{X}}(-\theta) is a norm. In these cases, it may be productive to try taking =J𝒳()\|\cdot\|_{*}=J_{\mathcal{X}}(\cdot) (or J~()\tilde{J}(\cdot)), as in our above examples. Of course, ,m,M\|\cdot\|_{*},m,M do not need to be known to run the algorithm, and the regret implicitly scales with the best M/mM/m over all norms \|\cdot\|_{*}.

An example of an action sets that satisfy Assumption 2 are p\ell_{p} balls with p(1,)p\in(1,\infty):

Example 1.

Let p,q>1p,q>1 be conjugate indices (1p+1q=1\frac{1}{p}+\frac{1}{q}=1), 𝒳=𝔹qd\mathcal{X}=\mathbb{B}^{d}_{q} and =p\|\cdot\|_{*}=\|\cdot\|_{p}. Then, Assumption 2 holds with m=1m=1, M=(p1)M=(p-1) for q(1,2)q\in(1,2) and m=p1m=p-1, M=1M=1 for q[2,)q\in[2,\infty).

Assumption 2 is unaffected by linear transformations, extending the above examples to ellipsoids:

Example 2.

Let 𝒳\mathcal{X} be any arm set satisfying Assumption 2 for some \|\cdot\|_{*}, MM and mm. Then, for any Ad×dA\in\mathbb{R}^{d\times d}, A𝒳:={xd:Ax𝒳}A\mathcal{X}:=\{x\in\mathbb{R}^{d}\colon Ax\in\mathcal{X}\} satisfies Assumption 2 for norm xAxx\mapsto\|Ax\|_{*}, MM and mm.

4.3 Main result and discussion

We are now ready to state our main result which shows that any randomised algorithm satisfying Assumption 1 with an action set satisfying Assumption 2 achieves at most O~(dn)\widetilde{O}(d\sqrt{n}) regret in the linear bandit problem. This matches the lower bound of [21] up to logarithmic factors (based on 𝒳=𝔹2d\mathcal{X}=\mathbb{B}^{d}_{2}, a set that satisfies our assumptions).

Theorem 4.

Fix λ1\lambda\geq 1 and δ(0,1)\delta\in(0,1). Suppose that a learner uses a randomised algorithm with perturbations satisfying Assumption 1 on a linear bandit instance with an arm-set that satisfies Assumption 2. Then, for any θd𝔹2d\theta_{\star}\in\sqrt{d}\mathbb{B}^{d}_{2}, with probability 1δ1-\delta, for all n1n\geq 1, the nn-step regret incurred by the learner is bounded as

RnMmK(βn2K2d)n+K4βnn(dlog(1+n/(dλ))+log(1/δ)),R_{n}\lesssim\frac{M}{m}K(\beta_{n}^{2}\vee K^{2}d)\sqrt{n}+K^{4}\beta_{n}\sqrt{n(d\log(1+n/(d\lambda))+\log(1/\delta))}\,,

The proof of this result is presented in Section 5, with much of the details deferred to the appendices. We now discuss some aspects of our result, its proof and its relation to previous works.

On the regret of Thompson sampling

If the noise in the responses (Yt)t1(Y_{t})_{t\geq 1} is Gaussian with a known variance σ2\sigma^{2}, and if for all t1t\geq 1 the perturbations are given by ηt𝒩d(0,σ2I)\eta_{t}\sim\mathcal{N}_{d}(0,\sigma^{2}I), then our randomised exploration algorithm is equivalent to the linear Thompson sampling algorithms of [23, 3, 2]. Thus, for action spaces satisfying Assumption 2, Theorem 4 shows that Thompson sampling can enjoy regret of O(dnlog(n))O(d\sqrt{n}\log(n)), leaving at most an O(logn)O(\log n) gap between this frequentist regret and the corresponding Bayesian regret [23, 22, see].

On the lower bound for randomised algorithms

We remark that Theorem 4 holds for any randomised algorithm without any modification; in particular there is no need to inflate any variance proxies. This is in contrast to lower bounds by [8, 29] which show that there exist problem instances on which linear Thompson sampling suffers linear regret. These instances are specifically designed so that there is a bad ‘trap’ arm, where pulling that arm yields regret, but no information, so that Thompson sampling gets stuck. They are the polar opposite of what Assumption 2 asks for: not absorbing, strongly convex, or smooth.

𝒳\mathcal{X}J(θ)J(θ)J(\theta)\geq J(\theta_{\star})Θ\Theta^{\prime}Θ\Thetaθ\theta_{\star}xx^{\prime}xxΔ\Delta^{\prime}Δ\Delta

Figure 1: Illustration of the update to the confidence sets during non-optimistic exploration, and the impact this has on the per-step worst case regret, when 𝒳=𝔹2d\mathcal{X}=\mathbb{B}^{d}_{2}. In red, we have an initial confidence set Θ\Theta; the corresponding worst-case optimal action over Θ\Theta is given by x=argminθΘ(θ),θx=\arg\min_{\theta\in\Theta}\langle\nabla(\theta),\theta_{\star}\rangle and the associated per-step worst case regret is Δ=θ2x,θ\Delta=\|\theta_{\star}\|_{2}-\langle x,\theta_{\star}\rangle. In blue, we illustrate the average of the respective quantities after randomised structured exploration with θΘ\theta\sim\Theta. That is, taking V=V+𝔼θΘ((θ)(θ)𝖳)V^{\prime}=V+\mathbb{E}_{\theta\sim\Theta}\big{(}\nabla(\theta)\nabla(\theta)^{\mkern-1.5mu\mathsf{T}}\big{)}. While the actions sampled by this strategy are unlikely to be optimistic, this randomised strategy does in fact explore—the confidence set shrinks—and this reduces the per-step regret.

Limitation of optimism-based proofs

Existing proofs of frequentist regret bounds for randomised algorithms in linear bandit, including those of [4, 2, 15, 12], leverage that with high probability,

rt=J𝒳(θ)Xt,θ=DJ𝒳(θ,θt)supθΘtDJ𝒳(θ,θ),r_{t}=J_{\mathcal{X}}(\theta_{\star})-\langle X_{t},\theta_{\star}\rangle=D_{J_{\mathcal{X}}}(\theta_{\star},\theta_{t})\leq\sup_{\theta\in\Theta_{t}}D_{J_{\mathcal{X}}}(\theta_{\star},\theta)\,,

and then show that supθΘtDJ𝒳(θ,θ)\sup_{\theta\in\Theta_{t}}D_{J_{\mathcal{X}}}(\theta_{\star},\theta) can be suitably controlled when randomised sampling guarantees sufficient optimism—that is, when the algorithm is optimistic with a fixed probability. Unfortunately, as illustrated in [2], guaranteeing optimism with a fixed probability requires inflating the variance of the sampling distributions, and this results in an extra d\sqrt{d} factor in the regret bound. Moreover, these proofs implicitly suggest that non-optimistic samples do not help in controlling the upper bound on the per-step regret, supθΘtDJ𝒳(θ,θ)\sup_{\theta\in\Theta_{t}}D_{J_{\mathcal{X}}}(\theta_{\star},\theta).

This approach is overly conservative in two ways: first, while a particular sample may provide very little information—measured through the design matrix update XtXt𝖳=VtVt1X_{t}X_{t}^{\mkern-1.5mu\mathsf{T}}=V_{t}-V_{t-1}—the sample may still provide useful information on average, that is, by considering 𝔼t1XtXt𝖳\mathbb{E}_{t-1}X_{t}X_{t}^{\mkern-1.5mu\mathsf{T}}. Second, while the information acquired at a time tt might not significantly reduce the per-step regret bound supθΘt+1DJ𝒳(θ,θ)\sup_{\theta\in\Theta_{t+1}}D_{J_{\mathcal{X}}}(\theta_{\star},\theta) for the step immediately following it, it may prove useful at later steps. Figure. 1 illustrates how non-optimistic samples provide useful information that is ignored by the optimistic proof approaches.

Our proof techniques

The key challenge in developing a non-optimistic proof for randomised algorithms in linear bandits is to directly analyse the dynamic of the exploration, that is, of the process {Θt}t0\{\Theta_{t}\}_{t\geq 0}, and relating this to the upper bound of the per-step regret process, supθ,θΘtDJ𝒳(θ,θ)\sup_{\theta^{\prime},\theta\in\Theta_{t}}D_{J_{\mathcal{X}}}(\theta^{\prime},\theta). Interestingly, such approach is closer to the analysis of Thompson sampling in the KK-armed bandit setting, for which it is shown to be optimal [13, 3]. Within the proof of our regret bound, Theorem 4, we address the above points by:

  1. (i)

    Providing a new bound on supθ,θΘtDJ𝒳(θ,θ)\sup_{\theta^{\prime},\theta\in\Theta_{t}}D_{J_{\mathcal{X}}}(\theta^{\prime},\theta), t0t\geq 0 by leveraging strong convexity and smoothness;

  2. (ii)

    Characterising the minimum amount of information acquired during interaction through a lower bound on VtV_{t}, where VtV_{t} acts as a proxy for Θt\Theta_{t};

and connecting (i) and (ii) by studying the properties of the average per-step information 𝔼t1XtXt𝖳\mathbb{E}_{t-1}X_{t}X_{t}^{\mkern-1.5mu\mathsf{T}}.

Comparison with forced exploration

[21] proposes a phased explore-then-commit algorithm that interleaves rounds of playing dd linearly independent actions with increasingly long exploitation phases, where the estimated best action is selected. They show prove a regret bound on the order of O((θ+1/θ)dn)O((\|\theta_{\star}\|+1/\|\theta_{\star}\|)d\sqrt{n}) for their approach, which notably behaves poorly as θ0\theta\to 0. This behaviour is because their exploration is isotropic—equal in all directions—and not directed by an estimate of θ\theta_{\star}. In contrast, randomised exploration algorithms account for structure by (i) taking Xt=J(θt)X_{t}=\nabla J(\theta_{t}) (almost surely), which accounts for the geometry of the action-set, and (ii) sampling θt\theta_{t} from a distribution concentrated on a scaled version of Θt1\Theta_{t-1}, which accounts for the current estimate of θ\theta_{\star}. One might interpret randomised algorithms as blending together the exploration and exploitation stages with a more careful balance between the two.

5 Proof of main result

We now prove our main result, Theorem 4. Here and in the appendices, we will write JJ in place of J𝒳J_{\mathcal{X}}, and we will work throughout on the 1δ1-\delta probability event where θt1Θt1\theta_{\star}\in\cap_{t\geq 1}\Theta_{t-1}.

We start by moving from RnR_{n} to R¯n:=t=1n𝔼t1rt\bar{R}_{n}:=\sum_{t=1}^{n}\mathbb{E}_{t-1}r_{t}. This can be done by noting that ξt=rt𝔼t1rt\xi_{t}=r_{t}-\mathbb{E}_{t-1}r_{t}, t1t\geq 1, is a martingale difference sequence satisfying |ξt|d|\xi_{t}|\lesssim\sqrt{d} for all t1t\geq 1, and applying a standard concentration inequality (included here as Lemma 8, Appendix A). From this, conclude that with probability 1δ1-\delta, for all n1n\geq 1,

RnR¯n+dnlog(dn/δ).R_{n}\lesssim\bar{R}_{n}+\sqrt{dn\log(dn/\delta)}\,.

We now outline the three main results we use in bounding R¯n\bar{R}_{n}, and then show how they come together.

5.1 Regret decomposition & upper bound

Denote by pt1p_{t-1} the conditional probability of optimism t1{J(θt)J(θ)}\mathbb{P}_{t-1}\{J(\theta_{t})\geq J(\theta_{\star})\} at time-step t1t\geq 1. Letting χt1=𝟏[pt1p]\chi_{t-1}=\mathbf{1}[p_{t-1}\leq p] for a threshold p(0,1)p\in(0,1), we now decompose the regret into that incurred in time-steps where pt1p_{t-1} is high, and those where it is low (we take p=1/(16K4)p=1/(16K^{4}), where KK is the constant appearing in Assumption 1):

R¯n\displaystyle\bar{R}_{n} =t=1nχt1𝔼t1rt+t=1n(1χt1)𝔼t1rt\displaystyle=\sum_{t=1}^{n}\chi_{t-1}\mathbb{E}_{t-1}r_{t}+\sum_{t=1}^{n}(1-\chi_{t-1})\mathbb{E}_{t-1}r_{t}
M(βn2K2d)J(θ)t=1nχt1supu𝔹2dVt11/2u2=:R¯nTS+K4(βnKd)t=1n𝔼t1XtVt11=:R¯nOPT.\displaystyle\lesssim\underbrace{\frac{M(\beta_{n}^{2}\vee K^{2}d)}{J(\theta_{\star})}\sum_{t=1}^{n}\chi_{t-1}\sup_{u\in\mathbb{B}^{d}_{2}}\|V_{t-1}^{-1/2}u\|_{*}^{2}}_{=:\bar{R}^{\text{TS}}_{n}}+\underbrace{K^{4}(\beta_{n}\vee K\sqrt{d})\sum_{t=1}^{n}\mathbb{E}_{t-1}\|X_{t}\|_{V_{t-1}^{-1}}}_{=:\bar{R}^{\text{OPT}}_{n}}\,. (3)

The derivations of the bound is presented in Appendix B. It is based on repeatedly applying properties of Bregman divergences and convex functions. At a high level, we introduce θt\theta_{t}^{\prime}, which is, conditionally on 𝔽t1\mathbb{F}_{t-1}, an independent copy of θt\theta_{t}; then conditioning condition on the event {J(θt)J(θ)}\{J(\theta^{\prime}_{t})\leq J(\theta_{\star})\} (the converse for the second term), and integrating the θt\theta_{t}^{\prime} out.

Examining the two terms, R¯nOPT\bar{R}^{\text{OPT}}_{n} is a term that appears in the standard regret analysis of optimistic algorithms, and is easily handled using a concentration argument (Lemma 9) and the elliptical potential lemma (Lemma 10); this yields

R¯nOPT(βnKd)K4n(dlog(1+n/(dλ))+log(1/δ)),\bar{R}^{\text{OPT}}_{n}\lesssim(\beta_{n}\vee K\sqrt{d})K^{4}\sqrt{n(d\log(1+n/(d\lambda))+\log(1/\delta))}\,,

a term featuring in our overall regret bound. The term R¯nTS\bar{R}^{\text{TS}}_{n} is a cost associated with randomised exploration: it is the sum of the sizes of the parameter sampling distributions (or confidence sets, as these are the same up to scaling), where size is measured in the geometry induced by \|\cdot\|_{*}.

5.2 Relating confidence widths to the amount of exploration

The challenge is now to show that VtV_{t} grows sufficiently fast, measured with respect to the geometry induced by \|\cdot\|_{*}, such that R¯nTS\bar{R}^{\text{TS}}_{n} is small. First, we relate the width Vt11/2u\|V_{t-1}^{-1/2}u\|_{*} to the expected amount of exploration in the direction of u𝔹2du\in\mathbb{B}^{d}_{2} at step tt, with the latter measured in the 2\ell_{2} norm, \|\cdot\|, at a cost of 1/m1/m from mm-strong convexity. This is a change of geometry lemma:

Lemma 5.

For all t1t\geq 1 with pt11/(16K4)p_{t-1}\leq 1/(16K^{4}), for any u𝔹2du\in\mathbb{B}^{d}_{2},

1J(θ)Vt11/2u2Km𝔼t1[XtXt𝖳]1/2Vt11/2u.\frac{1}{J(\theta_{\star})}\|V_{t-1}^{-1/2}u\|_{*}^{2}\precsim\frac{K}{m}\|\mathbb{E}_{t-1}[X_{t}X_{t}^{\mkern-1.5mu\mathsf{T}}]^{1/2}V_{t-1}^{-1/2}u\|\,.
Remark 3.

When 𝒳=𝔹2d\mathcal{X}=\mathbb{B}^{d}_{2}, we have m=1m=1 for =\|\cdot\|_{*}=\|\cdot\|, and thus no change of geometry is needed. In that case Xt=θt/θtX_{t}=\theta_{t}/\|\theta_{t}\| almost surely, and J(θ)=θJ(\theta)=\|\theta\| for all θd\theta\in\mathbb{R}^{d}, and so

𝔼t1XtXt𝖳=𝔼t1[θtθt𝖳/θt2]1θ2𝔼t1θtθt𝖳1θ2Vart1θt=K2J2(θ)Vt11,\mathbb{E}_{t-1}X_{t}X_{t}^{\mkern-1.5mu\mathsf{T}}=\mathbb{E}_{t-1}[\theta_{t}\theta_{t}^{\mkern-1.5mu\mathsf{T}}/\|\theta_{t}\|^{2}]\approx\frac{1}{\|\theta_{\star}\|^{2}}\mathbb{E}_{t-1}\theta_{t}\theta_{t}^{\mkern-1.5mu\mathsf{T}}\succeq\frac{1}{\|\theta_{\star}\|^{2}}\mathrm{Var}_{t-1}\,\theta_{t}=\frac{K^{2}}{J^{2}(\theta_{\star})}V_{t-1}^{-1}\,,

where we for the sake of exposition, allowed ourselves the simplifying assumption our the confidence sets and perturbations are concentrated sufficiently to ensure that 1/θt1/θ1/\|\theta_{t}\|\approx 1/\|\theta_{\star}\|.

We present the proof of Lemma 5 in Appendix C. Once again, we proceed by introducing a random variable θt\theta_{t}^{\prime} with the same 𝔽t1\mathbb{F}_{t-1}-conditional law as θt\theta_{t}; however, this time, we couple θt\theta_{t} and θt\theta_{t}^{\prime} closely coupled, in that they differ only in the uu marginal (along which they are independent). We then proceed with a convex Poincaré inequality-style argument along the uu direction, which relates Xt=J(θt)X_{t}=\nabla J(\theta_{t}) and the Vt11V_{t-1}^{-1} matrix, with the latter being essentially the conditional variance of θt\theta_{t}.

5.3 Establishing the growth of the design matrices

The final ingredient is the following relation between the sum t=1n𝔼t1XtXt𝖳\sum_{t=1}^{n}\mathbb{E}_{t-1}X_{t}X_{t}^{\mkern-1.5mu\mathsf{T}} of the conditional expected increments in the design matrices and their realisation t=1nXtXt\sum_{t=1}^{n}X_{t}X_{t}.

Lemma 6.

For any r(0,1]r\in(0,1] and δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, for all n1n\geq 1 and all ur𝔹2du\in r\mathbb{B}^{d}_{2},

u𝖳i=1nXtXt𝖳u+J2(u)ωn+512t=1nu𝖳𝔼t1[XtXt𝖳]u,u^{\mkern-1.5mu\mathsf{T}}\sum_{i=1}^{n}X_{t}X_{t}^{\mkern-1.5mu\mathsf{T}}u+J^{2}(u)\omega_{n}+5\geq\frac{1}{2}\sum_{t=1}^{n}u^{\mkern-1.5mu\mathsf{T}}\mathbb{E}_{t-1}[X_{t}X_{t}^{\mkern-1.5mu\mathsf{T}}]u\,,

where ωn=dlog(20d3n2r/δ2)\omega_{n}=d\log(20d^{3}n^{2}r/\delta^{2}).

Remark 4.

A standard matrix Chernoff inequality111See [26]. This exact inequality is not stated there, but all the tools needed to derive it are. gives that with probability 1δ1-\delta, for all n1n\geq 1,

t=1nXtXt+log(d/δ)12t=1n𝔼t1XtXt𝖳,\sum_{t=1}^{n}X_{t}X_{t}+\log(d/\delta)\succeq\frac{1}{2}\sum_{t=1}^{n}\mathbb{E}_{t-1}X_{t}X_{t}^{\mkern-1.5mu\mathsf{T}}\,, (4)

where \succeq denotes the usual ordering on positive-semidefinite matrices. For the 2\ell_{2} ball, Equation 4 serves the same role as Lemma 6, but is tighter. However, in the general setting where J(u)uJ(u)\neq\|u\|, it is crucial that we obtain the J2(u)J^{2}(u) dependence seen in Lemma 6.

The proof of Lemma 6 is presented in Appendix D. It uses Lemma 9, a one-dimensional version of the inequality given in Equation 4, and applies it to the process (u,Xt2:t1)(\langle u,X_{t}\rangle^{2}\colon t\geq 1) for all uu in a time-dependent cover of r𝔹2dr\mathbb{B}^{d}_{2}. A union bound over the size of the cover is responsible for the ωn\omega_{n} term, and the discretisation error involved in the covering argument yields the additive 55.

5.4 Putting everything together

Let Nn=t=1nχt1N_{n}=\sum_{t=1}^{n}\chi_{t-1} be the number of steps up to nn on which the conditional probability of optimism was below the threshold pp. We will shortly show that Lemma 5 and Lemma 6, together with the assumed smoothness, yield the following bound:

Claim 7.

For all t2t\geq 2 and u𝔹2du\in\mathbb{B}^{d}_{2},

1J(θ)Vt11/2u2Mωt1K2J(θ)m2Nt1+KmNt1.\frac{1}{J(\theta_{\star})}\|V_{t-1}^{-1/2}u\|_{*}^{2}\lesssim\frac{M\omega_{t-1}K^{2}J(\theta_{\star})}{m^{2}N_{t-1}}+\frac{K}{m\sqrt{N_{t-1}}}\,.

First though, note that using Claim 7 within the regret decomposition of Equation 3 completes the proof. Indeed, using that the expected per-step regret is bounded by 2θd2\|\theta_{\star}\|\lesssim\sqrt{d} (to handle step the first step, which is not covered by Claim 7), and the usual integral for monotonic integrands, we have

R¯nTS\displaystyle\bar{R}^{\text{TS}}_{n} d+(βn2K2d)1n[M2ωtK2J(θ)m2Nt+KmNt]𝑑t\displaystyle\lesssim\sqrt{d}+(\beta_{n}^{2}\vee K^{2}d)\int_{1}^{n}\left[\frac{M^{2}\omega_{t}K^{2}J(\theta_{\star})}{m^{2}N_{t}}+\frac{K}{m\sqrt{N_{t}}}\right]dt
d+dd(βn2K2d)K2M2m2log(dn/(δλ))logn+MmK(βn2K2d)n,\displaystyle\lesssim\sqrt{d}+d\sqrt{d}(\beta_{n}^{2}\vee K^{2}d)K^{2}\frac{M^{2}}{m^{2}}\log(dn/(\delta\lambda))\log n+\frac{M}{m}K(\beta_{n}^{2}\vee K^{2}d)\sqrt{n}\,,

which completes our bound (observe that the first two terms are lower order).

Proof of Claim 7.

We work on the 1δ1-\delta probability event resulting from applying Lemma 6 with r=1/λr=1/\sqrt{\lambda}. Since u𝔹2du\in\mathbb{B}^{d}_{2}, we have Vn1/2u𝔹2d/λV_{n}^{-1/2}u\in\mathbb{B}^{d}_{2}/\sqrt{\lambda}, and thus for all n1n\geq 1,

J2(Vn1/2u)ωn+612t=1n𝔼t1[XtXt𝖳]1/2Vn1/2u2.\displaystyle J^{2}(V^{-1/2}_{n}u)\omega_{n}+6\geq\frac{1}{2}\sum_{t=1}^{n}\|\mathbb{E}_{t-1}[X_{t}X_{t}^{\mkern-1.5mu\mathsf{T}}]^{1/2}V^{-1/2}_{n}u\|^{2}\,. (5)

Now we proceed to in turn upper and lower-bounding the above expression.

For the upper-bound, note that by MM-smoothness, J2(Vn1/2u)M2Vn1/2u2J^{2}(V^{-1/2}_{n}u)\leq\frac{M}{2}\|V^{-1/2}_{n}u\|_{*}^{2}.

For the lower bound of the right-hand side of Equation 5, we will use Lemma 5. Let vt1=Vt11/2Vn1/2uv_{t-1}=V^{1/2}_{t-1}V^{-1/2}_{n}u, and note that since Vt1VnV_{t-1}\preceq V_{n}, we have that vt11\|v_{t-1}\|\leq 1. Now,

t=1nχt1𝔼t1[XtXt𝖳]1/2Vn1/2u2\displaystyle\sum_{t=1}^{n}\chi_{t-1}\|\mathbb{E}_{t-1}[X_{t}X_{t}^{\mkern-1.5mu\mathsf{T}}]^{1/2}V^{-1/2}_{n}u\|^{2} =t=1nχt1vt12𝔼t1[XtXt𝖳]1/2Vt11/2vt1vt12\displaystyle=\sum_{t=1}^{n}\chi_{t-1}\|v_{t-1}\|^{2}\|\mathbb{E}_{t-1}[X_{t}X_{t}^{\mkern-1.5mu\mathsf{T}}]^{1/2}V_{t-1}^{-1/2}\frac{v_{t-1}}{\|v_{t-1}\|}\|^{2}
m2K2J2(θ)t=1nχt1Vt11/2vt14vt12\displaystyle\gtrsim\frac{m^{2}}{K^{2}J^{2}(\theta_{\star})}\sum_{t=1}^{n}\chi_{t-1}\frac{\|V_{t-1}^{-1/2}v_{t-1}\|_{*}^{4}}{\|v_{t-1}\|^{2}} (Lemma 5)
m2K2J2(θ)NnVn1/2u4\displaystyle\geq\frac{m^{2}}{K^{2}J^{2}(\theta_{\star})}N_{n}\|V^{-1/2}_{n}u\|_{*}^{4}\, (vt11\|v_{t-1}\|\leq 1)

Combining our lower and upper bounds on Equation 5, writing αn=Cm2Nn/K2\alpha_{n}=Cm^{2}N_{n}/K^{2} for a numerical constant C>0C>0 and letting y=1J(θ)Vn1/2u2y=\frac{1}{J(\theta_{\star})}\|V^{-1/2}_{n}u\|_{*}^{2}, we obtain the quadratic

αny2+MωnJ(θ)y+60.-\alpha_{n}y^{2}+M\omega_{n}J(\theta_{\star})y+6\geq 0\,.

Solving for yy, we have

Vn1/2u2MωnJ(θ)+M2ωn2J2(θ)+24αn2αnMωnJ(θ)αn+6αn,\|V^{-1/2}_{n}u\|_{*}^{2}\leq\frac{M\omega_{n}J(\theta_{\star})+\sqrt{M^{2}\omega_{n}^{2}J^{2}(\theta_{\star})+24\alpha_{n}}}{2\alpha_{n}}\leq\frac{M\omega_{n}J(\theta_{\star})}{\alpha_{n}}+\sqrt{\frac{6}{\alpha_{n}}}\,,

whence relabelling nt1n\mapsto t-1 concludes the proof. ∎

6 Conclusion

In this paper, we have presented a new analysis of randomised exploration algorithms for the linear bandit setting, which establishes that, given a nice-enough action set, randomised algorithms can obtain the optimal dependence on the dimension of the problem without need for any algorithmic modifications. Our improved regret bounds requires that the action space satisfies a smoothness and strong convexity condition, Assumption 2, which ensures that small perturbations in the parameter space are translate directly to at least some perturbations in the action space, while also guaranteeing that these do not lead to large changes in the instantaneous regret.

Our results complement the lower bounds by [8, 29] which show that linear Thompson sampling can suffer linear regret in particular settings where the connection between randomness in the parameter and action spaces is broken. However, these results together still do not give a complete characterisation of when randomised exploration algorithms can and cannot achieve the optimal rate of regret in the linear bandit setting: it remains an important open problem to understand exactly which action spaces permit an optimal dependence on the dimension.

Acknowledgements

This project started in earnest as a result of discussions taking place at the 2023 Workshop on the Theory of Reinforcement Learning in Edmonton. We thank Csaba Szepesvári for organising this workshop, and for feedback on early versions of this work. DJ & MA thank Gergely Neu for putting them in touch with CP-B, who was working contemporaneously on the same problem.

References

  • [1] Yasin Abbasi-Yadkori, Dávid Pál and Csaba Szepesvári “Improved algorithms for linear stochastic bandits” In Advances in Neural Information Processing Systems, 2011
  • [2] Marc Abeille and Alessandro Lazaric “Linear Thompson sampling revisited” In Electronic Journal of Statistics 11.2, 2017, pp. 5165–5197
  • [3] Shipra Agrawal and Navin Goyal “Analysis of Thompson sampling for the multi-armed bandit problem” In Conference on Learning Theory, 2012
  • [4] Shipra Agrawal and Navin Goyal “Thompson sampling for contextual bandits with linear payoffs” In International Conference on Machine Learning, 2013
  • [5] Peter Auer “Using confidence bounds for exploitation-exploration trade-offs” In Journal of Machine Learning Research 3.Nov, 2002, pp. 397–422
  • [6] Olivier Chapelle and Lihong Li “An empirical evaluation of Thompson sampling” In Advances in Neural Information Processing Systems, 2011
  • [7] Varsha Dani, Thomas P Hayes and Sham M Kakade “Stochastic linear optimization under bandit feedback” In Conference on Learning Theory, 2008
  • [8] Nima Hamidi and Mohsen Bayati “On the frequentist regret of linear Thompson sampling” In arXiv preprint arXiv:2006.06790, 2023
  • [9] Junya Honda and Akimichi Takemura “Optimality of Thompson sampling for Gaussian bandits depends on priors” In Artificial Intelligence and Statistics, 2014
  • [10] Tom Huix, Matthew Zhang and Alain Durmus “Tight regret and complexity bounds for Thompson sampling via Langevin Monte Carlo” In International Conference on Artificial Intelligence and Statistics, 2023
  • [11] David Janz, Alexander E Litvak and Csaba Szepesvári “Ensemble sampling for linear bandits: small ensembles suffice” In Advances in Neural Information Processing Systems, 2024
  • [12] David Janz, Shuai Liu, Alex Ayoub and Csaba Szepesvári “Exploration via linearly perturbed loss minimisation” In International Conference on Artificial Intelligence and Statistics, 2024
  • [13] Emilie Kaufmann, Nathaniel Korda and Rémi Munos “Thompson sampling: An asymptotically optimal finite-time analysis” In International Conference on Algorithmic Learning Theory, 2012
  • [14] Nathaniel Korda, Emilie Kaufmann and Remi Munos “Thompson sampling for 1-dimensional exponential family bandits” In Advances in Neural Information Processing Systems, 2013
  • [15] Branislav Kveton et al. “Randomized exploration in generalized linear bandits” In International Conference on Artificial Intelligence and Statistics, 2020
  • [16] Tor Lattimore and Csaba Szepesvari “The end of optimism? An asymptotic analysis of finite-armed linear bandits” In Artificial Intelligence and Statistics, 2017, pp. 728–737 PMLR
  • [17] Tor Lattimore and Csaba Szepesvári “Bandit algorithms” Cambridge University Press, 2020
  • [18] Xiuyuan Lu and Benjamin Van Roy “Ensemble sampling” In Advances in Neural Information Processing Systems, 2017
  • [19] Benedict C May, Nathan Korda, Anthony Lee and David S Leslie “Optimistic Bayesian sampling in contextual-bandit problems” In Journal of Machine Learning Research 13, 2012, pp. 2069–2106
  • [20] Ralph Tyrrell Rockafellar “Convex Analysis” Princeton University Press, 1970
  • [21] Paat Rusmevichientong and John N Tsitsiklis “Linearly parameterized bandits” In Mathematics of Operations Research 35.2 INFORMS, 2010, pp. 395–411
  • [22] Daniel Russo and Benjamin Van Roy “An information-theoretic analysis of Thompson sampling” In The Journal of Machine Learning Research 17.1 JMLR, 2016, pp. 2442–2471
  • [23] Daniel Russo and Benjamin Van Roy “Learning to optimize via posterior sampling” In Mathematics of Operations Research 39.4 INFORMS, 2014, pp. 1221–1243
  • [24] Daniel Russo et al. “A tutorial on Thompson sampling” In Foundations and Trends® in Machine Learning 11.1 Now Publishers, Inc., 2018, pp. 1–96
  • [25] William R Thompson “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples” In Biometrika 25.3-4 Oxford University Press, 1933, pp. 285–294
  • [26] Joel A Tropp “User-friendly tail bounds for sums of random matrices” In Foundations of Computational Mathematics 12 Springer, 2012, pp. 389–434
  • [27] Sharan Vaswani, Abbas Mehrabian, Audrey Durand and Branislav Kveton “Old Dog Learns New Tricks: Randomized UCB for Bandit Problems” In International Conference on Artificial Intelligence and Statistics, 2020
  • [28] Ruitu Xu, Yifei Min and Tianhao Wang “Noise-adaptive thompson sampling for linear contextual bandits” In Advances in Neural Information Processing Systems 36, 2023
  • [29] Tong Zhang “Feel-good thompson sampling for contextual bandits and reinforcement learning” In SIAM Journal on Mathematics of Data Science 4.2 SIAM, 2022, pp. 834–857

Appendix A Some standard results

The following lemma is adapted from Exercise 20.8 in [17].

Lemma 8.

Fix 0<δ10<\delta\leq 1. Let (ξt)t+(\xi_{t})_{t\in\mathbb{N}^{+}} be a real-valued martingale difference sequence satisfying |ξt|c|\xi_{t}|\leq c almost surely for each t+t\in\mathbb{N}^{+} and some c>0c>0. Then,

(n:(t=1nξt)22(c2n+1)log(c2n+1/δ))δ.\mathbb{P}\left(\exists n\colon\left(\sum_{t=1}^{n}\xi_{t}\right)^{2}\geq 2(c^{2}n+1)\log\left(\sqrt{c^{2}n+1}/\delta\right)\right)\leq\delta.

Next is a second concentration inequality that we require.

Lemma 9.

Let (αt)t1(\alpha_{t})_{t\geq 1} be a sequence of random variables adapted to a filtration (𝔽t)t0(\mathbb{F}_{t})_{t\geq 0} with 0αtR0\leq\alpha_{t}\leq R for all t0t\geq 0. Then, for all δ(0,1)\delta\in(0,1),

{t=1nαt(11/e)t=1n𝔼[αt𝔽t1]Rlog1/δ,n1}1δ.\mathbb{P}\left\{\sum_{t=1}^{n}\alpha_{t}\geq(1-1/e)\sum_{t=1}^{n}\mathbb{E}[\alpha_{t}\mid\mathbb{F}_{t-1}]-R\log 1/\delta,\ \forall n\geq 1\right\}\geq 1-\delta\,.
Proof of Lemma 9.

By rescaling, we need only consider the case where R=1R=1. Let (Sn)n0(S_{n})_{n\geq 0} be the random process defined by S0=0S_{0}=0 and

Sn=exp((11/e)t=1n𝔼[αt𝔽t1]t=1nαt).S_{n}=\exp\!\bigg{(}(1-1/e)\sum_{t=1}^{n}\mathbb{E}[\alpha_{t}\mid\mathbb{F}_{t-1}]-\sum_{t=1}^{n}\alpha_{t}\bigg{)}\,.

Observe that Sn0S_{n}\geq 0 for all n0n\geq 0, and that since for any n0n\geq 0,

𝔼[exp{αn+1}𝔽n]1(11/e)𝔼[αn+1𝔽n]exp{(11/e)𝔼[αn+1𝔽n]}\mathbb{E}[\exp\{-\alpha_{n+1}\}\mid\mathbb{F}_{n}]\leq 1-(1-1/e)\mathbb{E}[\alpha_{n+1}\mid\mathbb{F}_{n}]\leq\exp\{-(1-1/e)\mathbb{E}[\alpha_{n+1}\mid\mathbb{F}_{n}]\}

we have that for al n0n\geq 0,

𝔼[Sn+1𝔽n]=Snexp{(11/e)𝔼[αn+1𝔽n]αn+1}Sn.\mathbb{E}[S_{n+1}\mid\mathbb{F}_{n}]=S_{n}\exp\{(1-1/e)\mathbb{E}[\alpha_{n+1}\mid\mathbb{F}_{n}]-\alpha_{n+1}\}\leq S_{n}\,.

Therefore, (Sn)n0(S_{n})_{n\geq 0} is a non-negative supermartingale. Applying Ville’s inequality yields the result. ∎

The following is an adaptation of Lemma 19.4 in [17],

Lemma 10 (Elliptical potential lemma).

Fix λ>0\lambda>0 and a sequence a1,a2,a_{1},a_{2},\dots in 𝔹2d\mathbb{B}^{d}_{2}. Then, letting Vn=t=1natat𝖳+λIV_{n}=\sum_{t=1}^{n}a_{t}a_{t}^{\mkern-1.5mu\mathsf{T}}+\lambda I, we have that for all n1n\geq 1,

t=1natVt1122dlog(1+n/(dλ)).\sum_{t=1}^{n}\|a_{t}\|_{V_{t-1}^{-1}}^{2}\leq 2d\log(1+n/(d\lambda))\,.

Appendix B Derivation of the regret decomposition upper bound (Equation 3)

Let pt1=t1{J(θt)J(θ)}p_{t-1}=\mathbb{P}_{t-1}\{J(\theta_{t})\geq J(\theta_{\star})\} be the (conditional) probability of optimism at step t1t\geq 1, and let β~t1=βt1Kd\tilde{\beta}_{t-1}=\beta_{t-1}\vee K\sqrt{d}. We now derive two bounds separately. When pt1p_{t-1} is high, we will use

𝔼t1DJ(θ,θt)4β~t1pt1𝔼t1[XtVt112]12.\mathbb{E}_{t-1}D_{J}(\theta_{\star},\theta_{t})\leq\frac{4\tilde{\beta}_{t-1}}{p_{t-1}}\mathbb{E}_{t-1}\!\left[\|X_{t}\|_{V_{t-1}^{-1}}^{2}\right]^{\frac{1}{2}}\,. (6)

When pt1p_{t-1} is low, we will prefer the bound

𝔼t1DJ(θ,θt)11pt1{4πβ~t12J(θ)supu𝔹2dVt11/2u2+6β~t1𝔼t1[XtVt112]12}.\mathbb{E}_{t-1}D_{J}(\theta_{\star},\theta_{t})\leq\frac{1}{1-p_{t-1}}\left\{\frac{4\pi\tilde{\beta}_{t-1}^{2}}{J(\theta_{\star})}\sup_{u\in\mathbb{B}^{d}_{2}}\|V_{t-1}^{-1/2}u\|_{*}^{2}+6\tilde{\beta}_{t-1}\mathbb{E}_{t-1}\!\left[\|X_{t}\|_{V_{t-1}^{-1}}^{2}\right]^{\frac{1}{2}}\right\}\,. (7)

Combining these two bounds with our regret decomposition establishes Equation 3

We will derive the two bounds Equations 6 and 7 using similar techniques. Let Pt1(A)=t1{θtA}P_{t-1}(A)=\mathbb{P}_{t-1}\{\theta_{t}\in A\}. Both derivations will make use of the following estimates.

Claim 11.

For any norm FF on d\mathbb{R}^{d},

F2(θa)Pt1(da)F2(ab)Pt12(da×db)4(βt12K2d)supu𝔹2dF2(Vt11/2u).\int F^{2}(\theta_{\star}-a)P_{t-1}(da)\vee\int F^{2}(a-b)P_{t-1}^{2}(da\times db)\leq 4(\beta_{t-1}^{2}\vee K^{2}d)\sup_{u\in\mathbb{B}^{d}_{2}}F^{2}(V_{t-1}^{-1/2}u)\,.
Proof.

Letting ηa\eta_{a} and ηb\eta_{b} be independent copies of ηt\eta_{t}, we can express aa and bb as

a=θ^t1+Vt11/2ηaandb=θ^t1+Vt11/2ηb.a=\hat{\theta}_{t-1}+V_{t-1}^{-1/2}\eta_{a}\ \quad\text{and}\quad\quad b=\hat{\theta}_{t-1}+V_{t-1}^{1/2}\eta_{b}\,.

Denote by 𝔼ηa,ηb\mathbb{E}_{\eta_{a},\eta_{b}} the expectation over ηa\eta_{a} and ηb\eta_{b}. We have that

F2(ab)Pt12(da×db)\displaystyle\int F^{2}(a-b)P_{t-1}^{2}(da\times db) =𝔼ηa,ηbF2(Vt11/2(ηaηb))\displaystyle=\mathbb{E}_{\eta_{a},\eta_{b}}F^{2}(V_{t-1}^{-1/2}(\eta_{a}-\eta_{b}))
4𝔼ηaηa2supu𝔹2dF2(Vt11/2u)\displaystyle\leq 4\mathbb{E}_{\eta_{a}}\|\eta_{a}\|^{2}\sup_{u\in\mathbb{B}^{d}_{2}}F^{2}(V_{t-1}^{-1/2}u)
4K2dsupu𝔹2dF2(Vt11/2u),\displaystyle\leq 4K^{2}d\sup_{u\in\mathbb{B}^{d}_{2}}F^{2}(V_{t-1}^{-1/2}u)\,,

where we used that 𝔼ηbηb=𝔼ηaηa2=𝔼t1ηt2K2d\mathbb{E}_{\eta_{b}}\|\eta_{b}\|=\mathbb{E}_{\eta_{a}}\|\eta_{a}\|^{2}=\mathbb{E}_{t-1}\|\eta_{t}\|^{2}\leq K^{2}d.

Expressing θ=θ^t1+βt1Vt11/2u\theta_{\star}=\hat{\theta}_{t-1}+\beta_{t-1}V_{t-1}^{-1/2}u^{\prime} for some u𝔹2du^{\prime}\in\mathbb{B}^{d}_{2} (which we can do due to the implicit assumption that θΘt1)\theta_{\star}\in\Theta_{t-1}) and using the same approach we obtain the other part of the bound. ∎

Derivation of Equation 6.

For almost every θt,θtd\theta_{t},\theta_{t}^{\prime}\in\mathbb{R}^{d} such that J(θt)J(θ)J(\theta_{t}^{\prime})\geq J(\theta_{\star}),

DJ(θ,θt)\displaystyle D_{J}(\theta_{\star},\theta_{t}) J(θt)J(θt)J(θt),θθt\displaystyle\leq J(\theta_{t}^{\prime})-J(\theta_{t})-\langle\nabla J(\theta_{t}),\theta_{\star}-\theta_{t}\rangle (J(θt)J(θ)J(\theta^{\prime}_{t})\geq J(\theta_{\star}))
J(θt),θtθtJ(θt),θθt\displaystyle\leq\langle\nabla J(\theta_{t}^{\prime}),\theta_{t}^{\prime}-\theta_{t}\rangle-\langle\nabla J(\theta_{t}),\theta_{\star}-\theta_{t}\rangle (convexity)
J(θt)Vt11θtθtVt1+J(θt)Vt11θθtVt1.\displaystyle\leq\|\nabla J(\theta_{t}^{\prime})\|_{V_{t-1}^{-1}}\|\theta_{t}^{\prime}-\theta_{t}\|_{V_{t-1}}+\|\nabla J(\theta_{t})\|_{V_{t-1}^{-1}}\|\theta_{\star}-\theta_{t}\|_{V_{t-1}}\,. (Cauchy-Schwarz)

Now let QQ be a measure on d\mathbb{R}^{d} given by

Q(A)={1pt1Pt1(A{θd:J(θ)J(θ)}),pt10any arbitrary measureotherwise.Q(A)=\begin{cases}\tfrac{1}{p_{t-1}}P_{t-1}(A\cap\{\theta\in\mathbb{R}^{d}\colon J(\theta)\geq J(\theta_{\star})\})\,,&p_{t-1}\neq 0\\ \text{any arbitrary measure}&\text{otherwise.}\end{cases}

Since the bound above holds for almost all θtd\theta^{\prime}_{t}\in\mathbb{R}^{d} such that J(θt)J(θ)J(\theta^{\prime}_{t})\geq J(\theta_{\star}), and QQ is a diffuse measure on that set, it also holds on average for θtQ\theta^{\prime}_{t}\sim Q. Integrating with respect to QQ and Pt1P_{t-1},

𝔼t1DJ(θ,θt)\displaystyle\mathbb{E}_{t-1}D_{J}(\theta_{\star},\theta_{t}) J(θt)Vt11θtθtVt1(Pt1Q)(dθt×dθt)\displaystyle\leq\int\|\nabla J(\theta_{t}^{\prime})\|_{V_{t-1}^{-1}}\|\theta_{t}^{\prime}-\theta_{t}\|_{V_{t-1}}(P_{t-1}\otimes Q)(d\theta_{t}\times d\theta_{t}^{\prime})
+J(θt)Vt11θθtVt1Pt1(dθt).\displaystyle\hskip 50.00008pt+\int\|\nabla J(\theta_{t})\|_{V_{t-1}^{-1}}\|\theta_{\star}-\theta_{t}\|_{V_{t-1}}P_{t-1}(d\theta_{t})\,.

For the first integral,

\displaystyle\int J(θt)Vt11θtθtVt1(Pt1Q)(dθt×dθt)\displaystyle\|\nabla J(\theta_{t}^{\prime})\|_{V_{t-1}^{-1}}\|\theta_{t}^{\prime}-\theta_{t}\|_{V_{t-1}}(P_{t-1}\otimes Q)(d\theta_{t}\times d\theta_{t}^{\prime}) (8)
1pt1J(θt)Vt11θtθtVt1Pt12(dθt×dθt)\displaystyle\leq\frac{1}{p_{t-1}}\int\|\nabla J(\theta_{t}^{\prime})\|_{V_{t-1}^{-1}}\|\theta_{t}^{\prime}-\theta_{t}\|_{V_{t-1}}P_{t-1}^{2}(d\theta_{t}\times d\theta_{t}^{\prime}) (f0\forall f\geq 0, f𝑑Q1pt1f𝑑Pt1\int fdQ\leq\frac{1}{p_{t-1}}\int fdP_{t-1})
1pt1[J(θt)Vt112Pt1(θt)θtθtVt12Pt12(dθt×dθt)]1/2\displaystyle\leq\frac{1}{p_{t-1}}\left[\int\|\nabla J(\theta_{t}^{\prime})\|_{V_{t-1}^{-1}}^{2}P_{t-1}(\theta_{t}^{\prime})\int\|\theta_{t}^{\prime}-\theta_{t}\|_{V_{t-1}}^{2}P_{t-1}^{2}(d\theta_{t}\times d\theta_{t}^{\prime})\right]^{1/2} (Cauchy-Schwarz)
2(βt1Kd)pt1[J(θt)Vt112Pt1(θt)]1/2.\displaystyle\leq\frac{2(\beta_{t-1}\vee K\sqrt{d})}{p_{t-1}}\left[\int\|\nabla J(\theta_{t})\|_{V_{t-1}^{-1}}^{2}P_{t-1}(\theta_{t})\right]^{1/2}\,. (Claim 11)

Finally, since J(θt)=Xt\nabla J(\theta_{t})=X_{t} almost surely, 𝔼t1[J(θt)Vt112]1/2=𝔼t1[XtVt112]1/2\mathbb{E}_{t-1}[\|\nabla J(\theta_{t})\|_{V_{t-1}^{-1}}^{2}]^{1/2}=\mathbb{E}_{t-1}[\|X_{t}\|_{V_{t-1}^{-1}}^{2}]^{1/2}.

The second integral follows likewise, with the addition of multiplying the resulting nonnegative bound by 1/pt111/p_{t-1}\geq 1 to keep things tidy. ∎

For the steps with a low probability of optimism, we will need the following property of Bregman divergences:

Lemma 12 (Law of cosines).

For any convex function f:df\colon\mathbb{R}^{d}\to\mathbb{R} and all xx and almost all y,zdy,z\in\mathbb{R}^{d},

Df(x,y)=Df(x,z)+Df(z,y)xz,f(y)f(z).D_{f}(x,y)=D_{f}(x,z)+D_{f}(z,y)-\langle x-z,\nabla f(y)-\nabla f(z)\rangle\,.
Derivation of Equation 7.

For almost all θt,θtd\theta_{t},\theta_{t}^{\prime}\in\mathbb{R}^{d},

DJ(θ,θt)\displaystyle D_{J}(\theta_{\star},\theta_{t}) =DJ(θ,θt)+DJ(θt,θt)θθt,J(θt)J(θt)\displaystyle=D_{J}(\theta_{\star},\theta_{t}^{\prime})+D_{J}(\theta_{t}^{\prime},\theta_{t})-\langle\theta_{\star}-\theta_{t}^{\prime},\nabla J(\theta_{t})-\nabla J(\theta_{t}^{\prime})\rangle (law of cosines)
DJ(θ,θt)+θθt,J(θt)J(θt)\displaystyle\leq D_{J}(\theta_{\star},\theta_{t}^{\prime})+\langle\theta_{\star}-\theta_{t},\nabla J(\theta_{t}^{\prime})-\nabla J(\theta_{t})\rangle (convexity of JJ in DJD_{J})
DJ(θ,θt)+J(θt)J(θt)Vt11θθtVt1.\displaystyle\leq D_{J}(\theta_{\star},\theta_{t}^{\prime})+\|\nabla J(\theta_{t}^{\prime})-\nabla J(\theta_{t})\|_{V_{t-1}^{-1}}\|\theta_{\star}-\theta_{t}\|_{V_{t-1}}\,. (Cauchy-Schwartz)

Also, for almost every θtd\theta_{t}^{\prime}\in\mathbb{R}^{d} satisfying J(θt)J(θ)J(\theta_{t}^{\prime})\leq J(\theta_{\star}),

DJ(θ,θt)\displaystyle D_{J}(\theta_{\star},\theta_{t}^{\prime}) =J(θ)J(θt)J(θt),θθt\displaystyle=J(\theta_{\star})-J(\theta_{t}^{\prime})-\langle\nabla J(\theta_{t}^{\prime}),\theta_{\star}-\theta_{t}^{\prime}\rangle
=1J(θ)[J2(θ)J(θt)J(θ)2J(θt)J(θt),θθt]\displaystyle=\frac{1}{J(\theta_{\star})}\left[J^{2}(\theta_{\star})-J(\theta_{t}^{\prime})J(\theta_{\star})-\langle 2J(\theta_{t}^{\prime})\nabla J(\theta_{t}^{\prime}),\theta_{\star}-\theta_{t}^{\prime}\rangle\right]
+(2J(θt)J(θ)1)J(θt),θθt\displaystyle\qquad+\bigg{(}2\frac{J(\theta_{t}^{\prime})}{J(\theta_{\star})}-1\bigg{)}\langle\nabla J(\theta_{t}^{\prime}),\theta_{\star}-\theta_{t}^{\prime}\rangle
1J(θ)[J2(θ)J2(θt)2J(θt)J(θt),θθt]+|J(θt),θθt|\displaystyle\leq\frac{1}{J(\theta_{\star})}\left[J^{2}(\theta_{\star})-J^{2}(\theta_{t}^{\prime})-\langle 2J(\theta_{t}^{\prime})\nabla J(\theta_{t}^{\prime}),\theta_{\star}-\theta_{t}^{\prime}\rangle\right]+|\langle\nabla J(\theta_{t}^{\prime}),\theta_{\star}-\theta_{t}^{\prime}\rangle| (0<J(θt)J(θ)0<J(\theta_{t}^{\prime})\leq J(\theta_{\star}))
=1J(θ)DJ2(θ,θt)+|J(θt),θθt|\displaystyle=\frac{1}{J(\theta_{\star})}D_{\!J^{2}}(\theta_{\star},\theta_{t}^{\prime})+|\langle\nabla J(\theta_{t}^{\prime}),\theta_{\star}-\theta_{t}^{\prime}\rangle| (2J(θt)J(θt)=J2(θt)2J(\theta_{t}^{\prime})\nabla J(\theta_{t}^{\prime})=\nabla J^{2}(\theta_{t}^{\prime}) a.e.)
1J(θ)DJ2(θ,θt)+J(θt)Vt11θθtVt1.\displaystyle\leq\frac{1}{J(\theta_{\star})}D_{\!J^{2}}(\theta_{\star},\theta_{t}^{\prime})+\|\nabla J(\theta_{t}^{\prime})\|_{V_{t-1}^{-1}}\|\theta_{\star}-\theta_{t}^{\prime}\|_{V_{t-1}}\,. (Cauchy-Schwartz)

Combining the above two bounds, we have that for almost all θt,θtd\theta_{t},\theta_{t}^{\prime}\in\mathbb{R}^{d}, if J(θt)J(θ)J(\theta_{t}^{\prime})\leq J(\theta_{\star}), then

DJ(θ,θt)\displaystyle D_{J}(\theta_{\star},\theta_{t}) 1J(θ)DJ2(θ,θt)+J(θt)Vt11θθtVt1\displaystyle\leq\frac{1}{J(\theta_{\star})}D_{\!J^{2}}(\theta_{\star},\theta_{t}^{\prime})+\|\nabla J(\theta_{t})\|_{V_{t-1}^{-1}}\|\theta_{\star}-\theta_{t}\|_{V_{t-1}}
+J(θt)Vt11[θθtVt1+θθtVt1].\displaystyle\quad\quad+\|\nabla J(\theta_{t}^{\prime})\|_{V_{t-1}^{-1}}\left[\|\theta_{\star}-\theta_{t}\|_{V_{t-1}}+\|\theta_{\star}-\theta_{t}^{\prime}\|_{V_{t-1}}\right]\,. (9)

Now let QQ be a measure on d\mathbb{R}^{d} given by

Q(A)={11pt1Pt1(A{θd:J(θ)J(θ)}),pt11any arbitrary measureotherwise.Q(A)=\begin{cases}\tfrac{1}{1-p_{t-1}}P_{t-1}(A\cap\{\theta\in\mathbb{R}^{d}\colon J(\theta)\leq J(\theta_{\star})\})\,,&p_{t-1}\neq 1\\ \text{any arbitrary measure}&\text{otherwise.}\end{cases}

Since Equation 9 holds for almost all θt,θtd\theta_{t},\theta_{t}^{\prime}\in\mathbb{R}^{d} with J(θt)J(θ)J(\theta_{t}^{\prime})\leq J(\theta_{\star}) and Q,Pt1Q,P_{t-1} are non-atomic, it also holds on average for θtQ\theta_{t}^{\prime}\sim Q and θtPt1\theta_{t}\sim P_{t-1}. Integrating, we see that 𝔼t1DJ(θ,θt)\mathbb{E}_{t-1}D_{J}(\theta_{\star},\theta_{t}) is upper bounded by

1J(θ)\displaystyle\frac{1}{J(\theta_{\star})} DJ2(θ,θt)Q(dθt)+J(θt)Vt11Q(dθt)θθtVt1Pt1(dθt)\displaystyle\int D_{\!J^{2}}(\theta_{\star},\theta_{t}^{\prime})Q(d\theta_{t}^{\prime})+\int\|\nabla J(\theta_{t}^{\prime})\|_{V_{t-1}^{-1}}Q(d\theta_{t}^{\prime})\int\|\theta_{\star}-\theta_{t}\|_{V_{t-1}}P_{t-1}(d\theta_{t}) (10)
+J(θt)Vt11θθtVt1Q(dθt)+J(θt)Vt11θθtVt1Pt1(dθt).\displaystyle+\int\|\nabla J(\theta_{t}^{\prime})\|_{V_{t-1}^{-1}}\|\theta_{\star}-\theta_{t}^{\prime}\|_{V_{t-1}}Q(d\theta_{t}^{\prime})+\int\|\nabla J(\theta_{t})\|_{V_{t-1}^{-1}}\|\theta_{\star}-\theta_{t}\|_{V_{t-1}}P_{t-1}(d\theta_{t})\,.

For the first integral, we can use that for any f0f\geq 0, f𝑑Q11pt1f𝑑Pt1\int fdQ\leq\frac{1}{1-p_{t-1}}\int fdP_{t-1}, to establish that

DJ2(θ,θt)Q(dθt)1(1pt1)DJ2(θ,θt)Pt1(dθt)=1(1pt1)𝔼t1DJ2(θ,θt),\int D_{\!J^{2}}(\theta_{\star},\theta_{t}^{\prime})Q(d\theta_{t}^{\prime})\leq\frac{1}{(1-p_{t-1})}\int D_{\!J^{2}}(\theta_{\star},\theta_{t}^{\prime})P_{t-1}(d\theta_{t}^{\prime})=\frac{1}{(1-p_{t-1})}\mathbb{E}_{t-1}D_{\!J^{2}}(\theta_{\star},\theta_{t})\,,

where the final equality follows since θtPt1\theta_{t}^{\prime}\sim P_{t-1} has the same law as θt\theta_{t} conditioned on 𝔽t1\mathbb{F}_{t-1}. Now, by Assumption 2 and then using the estimate from Claim 11,

𝔼t1DJ2(θ,θt)π𝔼t1θθt24π(βt12K2d)supu𝔹2dVt11/2u2,\mathbb{E}_{t-1}D_{\!J^{2}}(\theta_{\star},\theta_{t})\leq\pi\mathbb{E}_{t-1}\|\theta_{\star}-\theta_{t}\|_{*}^{2}\leq 4\pi(\beta_{t-1}^{2}\vee K^{2}d)\sup_{u\in\mathbb{B}^{d}_{2}}\|V_{t-1}^{-1/2}u\|_{*}^{2}\,,

Bounding the remaining integrals in Equation 10 can be done by following the same steps as for the integral in Equation 8 of the optimistic bound, just with 11pt1\frac{1}{1-p_{t-1}} in place of 1pt1\frac{1}{p_{t-1}}. ∎

Appendix C Proof of the change of geometry lemma (Lemma 5)

Let u:ddu^{\perp}\colon\mathbb{R}^{d}\to\mathbb{R}^{d} be a basis completion orthogonal to uu (a projection onto the orthogonal complement of the span of uu). Let ϵ=ηt,u/u\epsilon=\langle\eta_{t},u\rangle/\|u\|, and let ϵ~\tilde{\epsilon} be an independent copy of ϵ\epsilon independent of 𝔽t1\mathbb{F}_{t-1} and define

θ~t=θ^t1+Vt11/2uηt+Vt11/2uϵ~observing thatθtθ~t=(ϵϵ~)u.\tilde{\theta}_{t}=\hat{\theta}_{t-1}+V_{t-1}^{-1/2}u^{\perp}\eta_{t}+V_{t-1}^{-1/2}u\tilde{\epsilon}\quad\text{observing that}\quad\theta_{t}-\tilde{\theta}_{t}=(\epsilon-\tilde{\epsilon})u\,.

Also define the indicators ι=𝟏[J(θt)J(θ)]\iota=\mathbf{1}[J(\theta_{t})\leq J(\theta_{\star})] and ι~=𝟏[J(θ~t)J(θ)]\tilde{\iota}=\mathbf{1}[J(\tilde{\theta}_{t})\leq J(\theta_{\star})].

Proof of Lemma 5.

The proof is based on lower and upper-bounding 𝔼t1ιι~DJ2(θ~t,θt)\mathbb{E}_{t-1}\iota\tilde{\iota}D_{\!J^{2}}(\tilde{\theta}_{t},\theta_{t}).

For the lower bound, note that by strong convexity,

𝔼t1ιι~DJ2(θ~t,θt)m2Vt11/2u2𝔼t1ιι~(ϵ~ϵ)2,\mathbb{E}_{t-1}\iota\tilde{\iota}D_{\!J^{2}}(\tilde{\theta}_{t},\theta_{t})\geq\frac{m}{2}\|V_{t-1}^{-1/2}u\|_{*}^{2}\,\mathbb{E}_{t-1}\iota\tilde{\iota}(\tilde{\epsilon}-\epsilon)^{2}\,,

where

𝔼t1ιι~(ϵ~ϵ)2\displaystyle\mathbb{E}_{t-1}\iota\tilde{\iota}(\tilde{\epsilon}-\epsilon)^{2} =𝔼t1(ϵ~ϵ)2𝔼t1((ι+ι~)1)(ϵ~ϵ)2\displaystyle=\mathbb{E}_{t-1}(\tilde{\epsilon}-\epsilon)^{2}-\mathbb{E}_{t-1}((\iota+\tilde{\iota})\wedge 1)(\tilde{\epsilon}-\epsilon)^{2}
2𝔼t1((ι+ι~)1)(ϵ~ϵ)2\displaystyle\geq 2-\mathbb{E}_{t-1}((\iota+\tilde{\iota})\wedge 1)(\tilde{\epsilon}-\epsilon)^{2} (marginal variance assumption)
22𝔼t1ι(ϵ~ϵ)2\displaystyle\geq 2-2\mathbb{E}_{t-1}\iota(\tilde{\epsilon}-\epsilon)^{2} (drop \wedge)
22𝔼t1ι(K2+ϵ2)\displaystyle\geq 2-2\mathbb{E}_{t-1}\iota(K^{2}+\epsilon^{2}) (marginal variance assumption)
22pt1𝔼t1(K2+ϵ2)2\displaystyle\geq 2-2\sqrt{p_{t-1}}\sqrt{\mathbb{E}_{t-1}(K^{2}+\epsilon^{2})^{2}} (Cauchy-Schwarz and 𝔼t1ι=pt1\mathbb{E}_{t-1}\iota=p_{t-1})
24K2pt1\displaystyle\geq 2-4K^{2}\sqrt{p_{t-1}} (marginal variance and fourth moment assumptions)
1\displaystyle\geq 1 (pt1p=1/(16K4)p_{t-1}\leq p=1/(16K^{4}) by assumption)

For the upper bound, we have that

𝔼t1ιι~DJ2(θ~t,θt)\displaystyle\mathbb{E}_{t-1}\iota\tilde{\iota}D_{\!J^{2}}(\tilde{\theta}_{t},\theta_{t}) =𝔼t1ιι~(J2(θ~t)J2(θt)J2(θt),θ~tθt)\displaystyle=\mathbb{E}_{t-1}\iota\tilde{\iota}(J^{2}(\tilde{\theta}_{t})-J^{2}(\theta_{t})-\langle\nabla J^{2}(\theta_{t}),\tilde{\theta}_{t}-\theta_{t}\rangle)
=𝔼t1ιι~(J2(θ~t)J2(θt)2J(θt)J(θt),θ~tθt)\displaystyle=\mathbb{E}_{t-1}\iota\tilde{\iota}(J^{2}(\tilde{\theta}_{t})-J^{2}(\theta_{t})-\langle 2J(\theta_{t})\nabla J(\theta_{t}),\tilde{\theta}_{t}-\theta_{t}\rangle)
𝔼t1ιι~|J2(θ~t)J2(θt)|+2J(θ)𝔼t1|J(θt),θ~tθt|\displaystyle\leq\mathbb{E}_{t-1}\iota\tilde{\iota}|J^{2}(\tilde{\theta}_{t})-J^{2}(\theta_{t})|+2J(\theta_{\star})\mathbb{E}_{t-1}|\langle\nabla J(\theta_{t}),\tilde{\theta}_{t}-\theta_{t}\rangle| (0<ιJ(θt)J(θ)0<\iota J(\theta_{t})\leq J(\theta_{\star}))
=𝔼t1ιι~|J(θ~t)J(θt)|(J(θt)+J(θ~t))+2J(θ)𝔼t1|J(θt),θ~tθt|\displaystyle=\mathbb{E}_{t-1}\iota\tilde{\iota}|J(\tilde{\theta}_{t})-J(\theta_{t})|(J(\theta_{t})+J(\tilde{\theta}_{t}))+2J(\theta_{\star})\mathbb{E}_{t-1}|\langle\nabla J(\theta_{t}),\tilde{\theta}_{t}-\theta_{t}\rangle|
2J(θ){𝔼t1|J(θ~t)J(θt)|+𝔼t1|J(θt),θ~tθt|}\displaystyle\leq 2J(\theta_{\star})\left\{\mathbb{E}_{t-1}|J(\tilde{\theta}_{t})-J(\theta_{t})|+\mathbb{E}_{t-1}|\langle\nabla J(\theta_{t}),\tilde{\theta}_{t}-\theta_{t}\rangle|\right\}
6J(θ)𝔼t1|J(θt),θ~tθt|\displaystyle\leq 6J(\theta_{\star})\mathbb{E}_{t-1}|\langle\nabla J(\theta_{t}),\tilde{\theta}_{t}-\theta_{t}\rangle| (convexity)
=6J(θ)𝔼t1(ϵ~ϵ)|J(θt),Vt11/2u|\displaystyle=6J(\theta_{\star})\mathbb{E}_{t-1}(\tilde{\epsilon}-\epsilon)|\langle\nabla J(\theta_{t}),V_{t-1}^{-1/2}u\rangle|
6J(θ)𝔼t1[(ϵ~ϵ)2]1/2𝔼t1[J(θt)J(θt)𝖳]1/2Vt11/2u\displaystyle\leq 6J(\theta_{\star})\mathbb{E}_{t-1}[(\tilde{\epsilon}-\epsilon)^{2}]^{1/2}\|\mathbb{E}_{t-1}[\nabla J(\theta_{t})\nabla J(\theta_{t})^{\mkern-1.5mu\mathsf{T}}]^{1/2}V_{t-1}^{-1/2}u\| (Cauchy-Schwarz)
62J(θ)K𝔼t1[J(θt)J(θt)𝖳]1/2Vt11/2u.\displaystyle\leq 6\sqrt{2}J(\theta_{\star})K\|\mathbb{E}_{t-1}[\nabla J(\theta_{t})\nabla J(\theta_{t})^{\mkern-1.5mu\mathsf{T}}]^{1/2}V_{t-1}^{1/2}u\|\,. (marginal variance assumption)

Chaining the lower and upper bounds yields the claimed result. ∎

Appendix D Proof of directional concentration (Lemma 6)

Lemma 13.

For any r,ϵ>0r,\epsilon>0, the covering number of r𝔹2dr\mathbb{B}^{d}_{2} is upper bounded by rd(1+2ϵ)dr^{d}(1+\frac{2}{\epsilon})^{d}.

Proof of Lemma 6.

For each n1n\geq 1, let 𝒩n\mathcal{N}_{n} be a minimal ϵn\epsilon_{n}-cover of r𝔹2dr\mathbb{B}^{d}_{2} in \|\cdot\|, where the value of ϵn>0\epsilon_{n}>0 will be chosen shortly. Let

Δn=t=1nXtXt𝖳(11/e)t=1n𝔼t1[XtXt𝖳].\Delta_{n}=\sum_{t=1}^{n}X_{t}X_{t}^{\mkern-1.5mu\mathsf{T}}-(1-1/e)\sum_{t=1}^{n}\mathbb{E}_{t-1}[X_{t}X_{t}^{\mkern-1.5mu\mathsf{T}}]\,.

For every n1n\geq 1 and u𝒩nu\in\mathcal{N}_{n}, we apply Lemma 9 to the sequence αt=Xt,u2\alpha_{t}=\langle X_{t},u\rangle^{2}, t1t\geq 1, using the upper bound αtJ(u)2\alpha_{t}\leq J(u)^{2} for all t1t\geq 1, and confidence level δn=6δ/(π2n2|𝒩n|)\delta_{n}=6\delta/(\pi^{2}n^{2}|\mathcal{N}_{n}|). Taking a union bound over the resulting events, we obtain that with probability 1δ1-\delta, for all n1n\geq 1 and u𝒩nu\in\mathcal{N}_{n},

fn(u):=u𝖳Δnu+J(u)2log(1/δn)0.f_{n}(u):=u^{\mkern-1.5mu\mathsf{T}}\Delta_{n}u+J(u)^{2}\log(1/\delta_{n})\geq 0\,.

Now for each n1n\geq 1, let πn:r𝔹2d𝒩n\pi_{n}\colon r\mathbb{B}^{d}_{2}\to\mathcal{N}_{n} be a map satisfying uπn(u)ϵn\|u-\pi_{n}(u)\|\leq\epsilon_{n} for all ur𝔹2du\in r\mathbb{B}^{d}_{2}. The proof will be complete once we show that for a suitable choice of ϵn\epsilon_{n}, |fn(u)fn(πn(u))|5|f_{n}(u)-f_{n}(\pi_{n}(u))|\leq 5 for all ur𝔹2du\in r\mathbb{B}^{d}_{2}, and that for the chosen ϵn\epsilon_{n}, we have the bound log(1/δn)ωn\log(1/\delta_{n})\leq\omega_{n}. We begin with the bound

|fn(u)fn(πn(u))||u𝖳Δnuπn(u)𝖳Δtπn(u)|=:An+|J2(u)J2(πn(u))|=:Bnlog(1/δn),|f_{n}(u)-f_{n}(\pi_{n}(u))|\leq\underbrace{|u^{\mkern-1.5mu\mathsf{T}}\Delta_{n}u-\pi_{n}(u)^{\mkern-1.5mu\mathsf{T}}\Delta_{t}\pi_{n}(u)|}_{=:A_{n}}+\underbrace{|J^{2}(u)-J^{2}(\pi_{n}(u))|}_{=:B_{n}}\log(1/\delta_{n})\,,

Letting op\|\cdot\|_{\mathrm{op}} denote the 22\ell_{2}\to\ell_{2} operator norm,

An\displaystyle A_{n} =|(uπn(u))𝖳Δn(uπn(u))2πn(u)𝖳Δn(πn(u)u)|\displaystyle=|(u-\pi_{n}(u))^{\mkern-1.5mu\mathsf{T}}\Delta_{n}(u-\pi_{n}(u))-2\pi_{n}(u)^{\mkern-1.5mu\mathsf{T}}\Delta_{n}(\pi_{n}(u)-u)|
(uπn(u)2+2πn(u)πn(u)u)Δnop\displaystyle\leq(\|u-\pi_{n}(u)\|^{2}+2\|\pi_{n}(u)\|\|\pi_{n}(u)-u\|)\|\Delta_{n}\|_{\mathrm{op}}
ϵn(ϵn+2r)2n<6ϵnn.\displaystyle\leq\epsilon_{n}(\epsilon_{n}+2r)2n<6\epsilon_{n}n\,. (Δnop2n\|\Delta_{n}\|_{\mathrm{op}}\leq 2n, r1r\leq 1, ϵn<1\epsilon_{n}<1)

Also,

Bn\displaystyle B_{n} =|(J(u)J(πn(u))(J(u)+J(πn(u)))|\displaystyle=|(J(u)-J(\pi_{n}(u))(J(u)+J(\pi_{n}(u)))|
2r|(J(u)J(πn(u))|\displaystyle\leq 2r|(J(u)-J(\pi_{n}(u))| (ur𝔹2d\forall u\in r\mathbb{B}^{d}_{2}, J(u)urJ(u)\leq\|u\|\leq r)
2r|J(uπn(u))|\displaystyle\leq 2r|J(u-\pi_{n}(u))| (u,u\forall u,u^{\prime}, |J(u)J(u)||J(uu)||J(u)-J(u^{\prime})|\leq|J(u-u^{\prime})|)
2rϵn<2ϵn.\displaystyle\leq 2r\epsilon_{n}<2\epsilon_{n}\,. (r1r\leq 1)

Now choose ϵn=1/(4nd2log(1/δ))\epsilon_{n}=1/(4nd^{2}\log(1/\delta)). By Lemma 13, for this choice, log(1/δn)ωn\log(1/\delta_{n})\leq\omega_{n}. Combining the bounds on AnA_{n} and BnB_{n}, we now indeed have that for all ur𝔹2du\in r\mathbb{B}^{d}_{2},

|fn(u)fn(πn(u))|An+Bnlog(1/δn)ϵn(2+6n+log(1/δn))5.|f_{n}(u)-f_{n}(\pi_{n}(u))|\leq A_{n}+B_{n}\log(1/\delta_{n})\leq\epsilon_{n}(2+6n+\log(1/\delta_{n}))\leq 5\,.\qed