\AtBeginEnvironment

algorithmic

Indexed Minimum Empirical Divergence-Based Algorithms
for Linear Bandits

Jie Bian jiebian@u.nus.edu
Department of Electrical and Computer Engineering
National University of Singapore Vincent Y. F. Tan vtan@nus.edu.sg
Department of Mathematics
Department of Electrical and Computer Engineering
National University of Singapore

Abstract

The Indexed Minimum Empirical Divergence (IMED) algorithm is a highly effective approach that offers a stronger theoretical guarantee of the asymptotic optimality compared to the Kullback–Leibler Upper Confidence Bound (KL-UCB) algorithm for the multi-armed bandit problem. Additionally, it has been observed to empirically outperform UCB-based algorithms and Thompson Sampling. Despite its effectiveness, the generalization of this algorithm to contextual bandits with linear payoffs has remained elusive. In this paper, we present novel linear versions of the IMED algorithm, which we call the family of LinIMED algorithms. We demonstrate that LinIMED provides a $\widetilde{O}(d\sqrt{T})$ upper regret bound where $d$ is the dimension of the context and $T$ is the time horizon. Furthermore, extensive empirical studies reveal that LinIMED and its variants outperform widely-used linear bandit algorithms such as LinUCB and Linear Thompson Sampling in some regimes.

1 Introduction

The multi-armed bandit (MAB) problem (Lattimore & Szepesvári (2020)) is a classical topic in decision theory and reinforcement learning. Among the various subfields of bandit problems, the stochastic linear bandit is the most popular area due to its wide applicability in large-scale, real-world applications such as personalized recommendation systems (Li et al. (2010)), online advertising, and clinical trials. In the stochastic linear bandit model, at each time step $t$ , the learner has to choose one arm $A_{t}$ from the time-varying action set $\mathcal{A}_{t}$ . Each arm $a\in\mathcal{A}_{t}$ has a corresponding context $x_{t,a}\in\mathbb{R}^{d}$ , which is a $d$ -dimensional vector. By pulling the arm $a\in\mathcal{A}_{t}$ at time step $t$ , under the linear bandit setting, the learner will receive the reward $Y_{t,a}$ , whose expected value satisfies $\mathbb{E}[Y_{t,a}|x_{t,a}]=\langle\theta^{*},x_{t,a}\rangle$ , where $\theta^{*}\in\mathbb{R}^{d}$ is an unknown parameter. The goal of the learner is to maximize his cumulative reward over a time horizon $T$ , which also means minimizing the cumulative regret, defined as $R_{T}:=\mathbb{E}\left[\sum_{t=1}^{T}\max_{a\in\mathcal{A}_{t}}\langle\theta^{*},x_{t,a}\rangle-Y_{t,A_{t}}\right]$ . The learner needs to balance the trade-off between the exploration of different arms (to learn their expected rewards) and the exploitation of the arm with the highest expected reward based on the available data.

1.1 Motivation and Related Work

The $K$ -armed bandit setting is a special case of the linear bandit. There exist several good algorithms such as UCB1 (Auer et al. (2002)), Thompson Sampling (Agrawal & Goyal (2012)), and the Indexed Minimum Empirical Divergence (IMED) algorithm (Honda & Takemura (2015)) for this setting. There are three main families of asymptotically optimal multi-armed bandit algorithms based on different principles (Baudry et al. (2023)). However, among these algorithms, only IMED lacks an extension for contextual bandits with linear payoff. In the context of the varying arm setting of the linear bandit problem, the LinUCB algorithm in Li et al. (2010) is frequently employed in practice. It has a theoretical guarantee on the regret in the order of $O(d\sqrt{T}\log(T))$ using the confidence width as in OFUL (Abbasi-Yadkori et al. (2011)). Although the SupLinUCB algorithm introduced by Chu et al. (2011) uses phases to decompose the reward dependence of each time step and achieves an $\widetilde{O}(\sqrt{dT})$ (the $\widetilde{O}(\cdot)$ notation omits logarithmic factors in $T$ ) regret upper bound, its empirical performance falls short of both the algorithm in Li et al. (2010) and the Linear Thompson Sampling algorithm (Agrawal & Goyal (2013)) as mentioned in Lattimore & Szepesvári (2020, Chapter 22).

On the other hand, the Optimism in the Face of Uncertainty Linear (OFUL) bandit algorithm in Abbasi-Yadkori et al. (2011) achieves a regret upper bound of $\widetilde{O}(d\sqrt{T})$ through an improved analysis of the confidence bound using a martingale technique. However, it involves a bilinear optimization problem over the action set and the confidence ellipsoid when choosing the arm at each time. This is computationally expensive, unless the confidence ellipsoid is a convex hull of a finite set.

	Problem independent regret bound	Regret bound independent of $K$ ?	Principle that the algorithm is based on
OFUL (Abbasi-Yadkori et al. (2011))	$O(d\sqrt{T}\log(T))$	✓	Optimism
LinUCB (Li et al. (2010))	Hard to analyze	Unknown	Optimism
LinTS (Agrawal & Goyal (2013))	${O}(d^{\frac{3}{2}}\sqrt{T})\wedge{O}(d\sqrt{T\log(K)})$	✓	Posterior sampling
SupLinUCB (Chu et al. (2011))	$O(\sqrt{dT\log^{3}(KT)})$	✗	Optimism
LinUCB with OFUL’s confidence bound	$O(d\sqrt{T}\log(T))$	✓	Optimism
Asymptotically Optimal IDS (Kirschner et al. (2021))	$O(d\sqrt{T}\log(T))$	✓	Information directed sampling
LinIMED-3 (this paper)	$O(d\sqrt{T}\log(T))$	✓	Min. emp. divergence
SupLinIMED (this paper)	$O(\sqrt{dT\log^{3}(KT)})$	✗	Min. emp. divergence

Table 1: Comparison of algorithms for linear bandits with varying arm sets

For randomized algorithms designed for the linear bandit problem, Agrawal & Goyal (2013) proposed the LinTS algorithm, which is in the spirit of Thompson Sampling (Thompson (1933)) and the confidence ellipsoid similar to that of LinUCB-like algorithms. This algorithm performs efficiently and achieves a regret upper bound of ${O}(d^{\frac{3}{2}}\sqrt{T}\wedge d\sqrt{T\log K})$ , where $K$ is the number of arms at each time step such that $|\mathcal{A}_{t}|=K$ for all $t$ . Compared to LinUCB with OFUL’s confidence width, it has an extra $O(\sqrt{d}\wedge\sqrt{\log K})$ term for the minimax regret upper bound.

Recently, MED-like (minimum empirical divergence) algorithms have come to the fore since these randomized algorithms have the property that the probability of selecting each arm is in closed form, which benefits downstream work such as offline evaluation with the inverse propensity score. Both MED in the sub-Gaussian environment and its deterministic version IMED have demonstrated superior performances over Thompson Sampling (Bian & Jun (2021), Honda & Takemura (2015)). Baudry et al. (2023) also shows MED has a close relation to Thompson Sampling. In particular, it is argued that MED and TS can be interpreted as two variants of the same exploration strategy. Bian & Jun (2021) also shows that probability of selecting each arm of MED in the sub-Gaussian case can be viewed as a closed-form approximation of the same probability as in Thompson Sampling. We take inspiration from the extension of Thompson Sampling to linear bandits and thus are motivated to extend MED-like algorithms to the linear bandit setting and prove regret bounds that are competitive vis-à-vis the state-of-the-art bounds.

Thus, this paper aims to answer the question of whether it is possible to devise an extension of the IMED algorithm for the linear bandit problem the varying arm set setting (for both infinite and finite arm sets) with a regret upper bound of $O(d\sqrt{T}\log T)$ which matches LinUCB with OFUL’s confidence bound while being as efficient as LinUCB. The proposed family of algorithms, called LinIMED as well as SupLinIMED, can be viewed as generalizations of the IMED algorithm (Honda & Takemura (2015)) to the linear bandit setting. We prove that LinIMED and its variants achieve a regret upper bound of $\widetilde{O}(d\sqrt{T})$ and they perform efficiently, no worse than LinUCB. SupLinIMED has a regret bound of $\widetilde{O}(\sqrt{dT})$ , but works only for instances with finite arm sets. In our empirical study, we found that the different variants of LinIMED perform better than LinUCB and LinTS for various synthetic and real-world instances under consideration.

Compared to OFUL, LinIMED works more efficiently. Compared to SupLinUCB, our LinIMED algorithm is significantly simpler, and compared to LinUCB with OFUL’s confidence bound, our empirical performance is better. This is because in our algorithm, the exploitation term and exploration term are decoupling and this leads to a finer control while tuning the hyperparameters in the empirical study.

Compared to LinTS, our algorithm’s (specifically LinIMED-3) regret bound is superior, by an order of $O(\sqrt{d}\wedge\sqrt{\log K})$ . Since fixed arm setting is a special case of finite varying arm setting, our result is more general than other fixed-arm linear bandit algorithms like Spectral Eliminator (Valko et al. (2014)) and PEGOE (Lattimore & Szepesvári (2020, Chapter 22)). Finally, we observe that since the index used in LinIMED has a similar form to the index used in the Information Directed Sampling (IDS) procedure in Kirschner et al. (2021) (which is known to be asymptotically optimal but more difficult to compute), LinIMED performs significantly better on the “End of Optimism” example in Lattimore & Szepesvari (2017). We summarize the comparisons of LinIMED to other linear bandit algorithms in Table 1. We discussion comparisons to other linear bandit algorithms in Sections 3.2, 3.3, and Appendix B.

2 Problem Statement

Notations:

For any $d$ dimensional vector $x\in\mathbb{R}^{d}$ and a $d\times d$ positive definite matrix $A$ , we use $\lVert x\rVert_{A}$ to denote the Mahalanobis norm $\sqrt{x^{\top}Ax}$ . We use $a\wedge b$ (resp. $a\vee b$ ) to represent the minimum (resp. maximum) of two real numbers $a$ and $b$ .

The Stochastic Linear Bandit Model:

In the stochastic linear bandit model, the learner chooses an arm $A_{t}$ at each round $t$ from the arm set $\mathcal{A}_{t}=\{a_{t,1},a_{t,2},\ldots\}\subseteq\mathbb{R}$ , where we assume the cardinality of each arm set $\mathcal{A}_{t}$ can be potentially infinite such that $|\mathcal{A}_{t}|=\infty$ for all $t\geq 1$ . Each arm $a\in\mathcal{A}_{t}$ at time $t$ has a corresponding context (arm vector) $x_{t,a}\in\mathbb{R}^{d}$ , which is known to the learner. After choosing arm $A_{t}$ , the environment reveals the reward

Y_{t}=\langle\theta^{*},X_{t}\rangle+\eta_{t}

(1)

to the learner where $X_{t}:=x_{t,A_{t}}$ is the corresponding context of the arm $A_{t}$ , $\theta^{*}\in\mathbb{R}^{d}$ is an unknown coefficient of the linear model, $\eta_{t}$ is an $R$ -sub-Gaussian noise conditioned on $\left\{A_{1},A_{2},\ldots,A_{t},Y_{1},Y_{2},\ldots,Y_{t-1}\right\}$ such that for any $\lambda\in\mathbb{R}$ , almost surely,

\displaystyle\mathbb{E}\left[\exp(\lambda\eta_{t})\mid A_{1},A_{2},\ldots,A_{t},Y_{1},Y_{2},\ldots,Y_{t-1}\right]\leq\exp\Big{(}\frac{\lambda^{2}R^{2}}{2}\Big{)}~{}.

Denote $a_{t}^{*}:=\operatorname*{arg\,max}_{a\in\mathcal{A}_{t}}\langle\theta^{*},x_{t,a}\rangle$ as the arm with the largest reward at time $t$ . The goal is to minimize the expected cumulative regret over the horizon $T$ . The (expected) cumulative regret is defined as

R_{T}=\mathbb{E}\left[\sum_{t=1}^{T}\langle\theta^{*},x_{t,a_{t}^{*}}\rangle-\langle\theta^{*},X_{t}\rangle\right]~{}.

Assumption 1.

For each time $t$ , we assume that $\lVert X_{t}\rVert\leq L$ , and $\lVert\theta^{*}\rVert\leq S$ for some fixed $L,S>0$ . We also assume that $\Delta_{t,b}:=\max_{a\in\mathcal{A}_{t}}\langle\theta^{*},x_{t,a}\rangle-\langle\theta^{*},x_{t,b}\rangle\leq 1$ for each arm $b\in\mathcal{A}_{t}$ and time $t$ .

3 Description of LinIMED Algorithms

Algorithm 1 LinIMED-

x

for

x\in\{1,2,3\}

1: Input: LinIMED mode

x

, Dimension

d

, Regularization parameter

\lambda

, Bound

S

\lVert\theta^{*}\rVert

, Sub-Gaussian parameter

R

, Concentration parameter

\gamma

\theta^{*}

, Bound

L

\lVert x_{t,a}\rVert

for all

t\geq 1

and

a\in\mathcal{A}_{t}

, Constant

C\geq 1

2: Initialize:

V_{0}=\lambda I_{d\times d}

W_{0}=\text{0}_{d\times 1}(\text{all zeros vector with }d\text{ dimensions})

\hat{\theta}_{0}=V_{0}^{-1}W_{0}

3: for

t=1,2,\ldots T

4: Receive the arm set

\mathcal{A}_{t}

and compute

\beta_{t-1}(\gamma)=\Big{(}R\sqrt{d\log({\frac{1+(t-1)L^{2}/\lambda}{\gamma}})}+\sqrt{\lambda}S\Big{)}^{2}

5: for

a\in\mathcal{A}_{t}

6: Compute:

\hat{\mu}_{t,a}=\langle\hat{\theta}_{t-1},x_{t,a}\rangle

and

\mathrm{UCB}_{t}(a)=\langle\hat{\theta}_{t-1},x_{t,a}\rangle+\sqrt{\beta_{t-1}(\gamma)}\lVert x_{t,a}\rVert_{V^{-1}_{t-1}}

7: Compute:

\hat{\Delta}_{t,a}=(\max_{j\in\mathcal{A}_{t}}\hat{\mu}_{t,j}-\hat{\mu}_{t,a})\cdot\mathbbm{1}\{x=1,2\}+(\max_{j\in\mathcal{A}_{t}}\mathrm{UCB}_{t}(j)-\mathrm{UCB}_{t}(a))\cdot\mathbbm{1}\left\{x=3\right\}

8: if

a=\operatorname*{arg\,max}_{j\in\mathcal{A}_{t}}(\hat{\mu}_{t,j}\cdot\mathbbm{1}\{x=1,2\}+\mathrm{UCB}_{t}(j)\cdot\mathbbm{1}\{x=3\})

then

I_{t,a}=-\log({\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2}})\cdot\mathbbm{1}\{x=1\}

(LinIMED-1)

10:

\qquad\quad+\Big{[}\log{T}\wedge\Big{(}-\log({\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2}})\Big{)}\Big{]}\cdot\mathbbm{1}\{x=2\}

(LinIMED-2)

11:

\qquad\quad+\Big{[}\log\frac{C}{\max_{a\in\mathcal{A}_{t}}\hat{\Delta}^{2}_{t,a}}\wedge\Big{(}-\log({\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2}})\Big{)}\Big{]}\cdot\mathbbm{1}\{x=3\}

(LinIMED-3)

12: else

13:

I_{t,a}=\frac{\hat{\Delta}_{t,a}^{2}}{\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2}}-\log({\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2}})

14: end if

15: end for

16: Pull the arm

A_{t}=\operatorname*{arg\,min}_{a\in\mathcal{A}_{t}}I_{t,a}

(ties are broken arbitrarily) and receive its reward

Y_{t}

17: Update:

18:

V_{t}=V_{t-1}+X_{t}X_{t}^{\top}

W_{t}=W_{t-1}+Y_{t}X_{t}

, and

\hat{\theta}_{t}=V_{t}^{-1}W_{t}

19: end for

In the pseudocode of Algorithm 1, for each time step $t$ , in Line 4, we use the improved confidence bound of $\theta^{*}$ as in Abbasi-Yadkori et al. (2011) to calculate the confidence bound $\beta_{t-1}(\gamma)$ . After that, for each arm $a\in\mathcal{A}_{t}$ , in Lines 6 and 7, the empirical gap between the highest empirical reward and the empirical reward of arm $a$ is estimated as

\hat{\Delta}_{t,a}=\left\{\begin{array}[]{cc}\max_{j\in\mathcal{A}_{t}}\langle\hat{\theta}_{t-1},x_{t,j}\rangle-\langle\hat{\theta}_{t-1},x_{t,a}\rangle&\mbox{if LinIMED-1,2}\\ \max_{j\in\mathcal{A}_{t}}\mathrm{UCB}_{t}(j)-\mathrm{UCB}_{t}(a)&\mbox{if LinIMED-3}\end{array}\right.

(2)

Then, in Lines 9 to 11, with the use of the confidence width of $\beta_{t-1}(\gamma)$ , we can compute the index $I_{t,a}$ for the empirical best arm $a=\operatorname*{arg\,max}_{j\in\mathcal{A}_{t}}\hat{\mu}_{t,a}$ (for LinIMED-1,2) or the highest UCB arm $a=\operatorname*{arg\,max}_{j\in\mathcal{A}_{t}}\mathrm{UCB}_{j}(a)$ (for LinIMED-3). The different versions of LinIMED encourage different amounts of exploitation. For the other arms, in Line 13, the index is defined and computed as

I_{t,a}=\frac{\hat{\Delta}_{t,a}^{2}}{\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{{\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2}}}.

Then with all the indices of the arms calculated, in Line 16, we choose the arm $A_{t}$ with the minimum index such that $A_{t}=\operatorname*{arg\,min}_{a\in\mathcal{A}_{t}}I_{t,a}$ (where ties are broken arbitrarily) and the agent receives its reward. Finally, in Line 18, we use ridge regression to estimate the unknown $\theta^{*}$ as $\hat{\theta}_{t}$ and update the matrix $V_{t}$ and the vector $W_{t}$ . After that, the algorithm iterates to the next time step until the time horizon $T$ . From the pseudo-code, we observe that the only differences between the three algorithms are the way that the square gap, which plays the role of the empirical divergence, is estimated and the index of the empirically best arm. The latter point implies that we encourage the empirically best arm to be selected more often in LinIMED-2 and LinIMED-3 compared to LinIMED-1; in other words, we encourage more exploitation in LinIMED-2 and LinIMED-3. Similar to the core spirit of IMED algorithm Honda & Takemura (2015), the first term of our index $I_{t,a}$ for LinIMED-1 algorithm is ${\hat{\Delta}_{t,a}^{2}}/(\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2})$ , this is the term controls the exploitation, while the second term $-\log({\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2}})$ controls the exploration in our algorithm.

3.1 Description of the SupLinIMED Algorithm

1: Input:

T\in\mathbb{N}

S^{\prime}=\lceil\log T\rceil

\Psi_{t}^{s}=\emptyset

for all

s\in[S^{\prime}],t\in[T]

2: for

t=1,2,\ldots,T

s\leftarrow 1

and

\hat{\mathcal{A}}_{1}\leftarrow[K]

4: repeat

5: Use BaseLinUCB (stated in Algorithm 3 in Appendix A) with

\Psi_{t}^{s}

to calculate the width

w^{s}_{t,a}

and sample mean

\hat{Y}^{s}_{t,a}

for all

a\in\hat{\mathcal{A}}_{s}~{}.

6: if

w^{s}_{t,a}\leq\frac{1}{\sqrt{T}}

for all

a\in\hat{\mathcal{A}}_{s}

then

7: choose

A_{t}=\operatorname*{arg\,min}_{a\in\hat{\mathcal{A}}_{s}}I_{t,a}

where

I_{t,a}

is the same index function as in LinIMED algorithm:

8: Calculate the index

\displaystyle I_{t,a}=\begin{cases}\log(2T)\wedge\big{(}-\log((w^{s}_{t,a})^{2})\big{)}\quad\textbf{If }a=\operatorname*{arg\,max}_{b\in\hat{\mathcal{A}}_{s}}\hat{Y}^{s}_{t,b}\\ (\frac{\hat{\Delta}_{t,a}^{s}}{w_{t,a}^{s}})^{2}-\log((w_{t,a}^{s})^{2})\qquad\quad\textbf{otherwise}\end{cases}\text{where}\quad\hat{\Delta}^{s}_{t,a}:=\max_{b\in\hat{\mathcal{A}}_{s}}\hat{Y}^{s}_{t,b}-\hat{Y}^{s}_{t,a}~{}.

9: Keep the same index sets at all levels:

\Psi^{s^{\prime}}_{t+1}\leftarrow\Psi^{s^{\prime}}_{t}

for all

s^{\prime}\in[S]~{}.

\lx@algorithmic@hfill\qquad\leftarrow\textbf{Case 1}

10: else if

w^{s}_{t,a}\leq 2^{-s}

for all

a\in\hat{\mathcal{A}}_{s}

then

11:

\hat{\mathcal{A}}_{s+1}\leftarrow\left\{a\in\hat{\mathcal{A}}_{s}\,:\,\hat{Y}^{s}_{t,a}+w^{s}_{t,a}\geq\max_{a^{\prime}\in\hat{\mathcal{A}}_{s}}(\hat{Y}^{s}_{t,a^{\prime}}+w^{s}_{t,a^{\prime}})-2^{1-s}\right\}

12:

s\leftarrow s+1

\lx@algorithmic@hfill\qquad\leftarrow\textbf{Case 2}

13: else

14: Choose

A_{t}\in\hat{\mathcal{A}}_{s}

such that

w^{s}_{t,A_{t}}>2^{-s}

15: Update the index sets at all levels:

\Psi^{s^{\prime}}_{t+1}\leftarrow\Psi^{s^{\prime}}_{t}\cup\{t\}

s=s^{\prime}

;

\Psi^{s^{\prime}}_{t+1}\leftarrow\Psi^{s^{\prime}}_{t}

s\neq s^{\prime}

\lx@algorithmic@hfill\qquad\leftarrow\textbf{Case 3}

16: end if

17: until an action

A_{t}

is found

18: end for

Algorithm 2 SupLinIMED

Now we consider the case in which the arm set $\mathcal{A}_{t}$ at each time $t$ is finite but still time-varying. In particular, $\mathcal{A}_{t}=\{a_{t,1},a_{t,2},\ldots,a_{t,K}\}\subseteq\mathbb{R}$ are sets of constant size $K$ such that $|\mathcal{A}_{t}|=K<\infty$ . In the pseudocode of Algorithm 2, we apply the SupLinUCB framework (Chu et al., 2011), leveraging Algorithm 3 (in Appendix A) as a subroutine within each phase. This ensures the independence of the choice of the arm from past observations of rewards, thereby yielding a concentration inequality in the estimated reward (see Lemma 1 in Chu et al. (2011)) that converges to within $\sqrt{d}$ proximity of the unknown expected reward in a finite arm setting. As a result, the regret yields an improvement of $\sqrt{d}$ ignoring the logarithmic factor. At each time step $t$ and phase $s$ , in Line 5, we utilize the BaseLinUCB Algorithm as a subroutine to calculate the sample mean and confidence width since we also need these terms to calculate the IMED-style indices of each arm. In Lines 6–9 (Case 1), if the width of each arm is less than $\frac{1}{\sqrt{T}}$ , we choose the arm with the smaller IMED-style index. In Lines 10–12 (Case 2), the framework is the same as in SupLinUCB (Chu et al. (2011)), if the width of each arm is smaller than $2^{-s}$ but there exist arms with widths larger than $\frac{1}{\sqrt{T}}$ , then in Line 11 the “unpromising” arms will be eliminated until the width of each arm is smaller enough to satisfy the condition in Line 6. Otherwise, if there exist any arms with widths that are larger than $2^{-s}$ , in Lines 14–15 (Case 3), we choose one such arm and record the context and reward of this arm to the next layer $\Psi^{s}_{t+1}$ .

3.2 Relation to the IMED algorithm of Honda & Takemura (2015)

The IMED algorithm is a deterministic algorithm for the $K$ -armed bandit problem. At each time step $t$ , it chooses the arm $a$ with the minimum index, i.e.,

\displaystyle a=\operatorname*{arg\,min}_{i\in[K]}\big{\{}T_{i}(t)D_{\text{inf}}(\hat{F}_{i}(t),\hat{\mu}^{*}(t))+\log T_{i}(t)\big{\}}~{},

(3)

where $T_{i}(t)=\sum_{s=1}^{t-1}\mathbbm{1}\left\{A_{t}=a\right\}$ is the total arm pulls of the arm $i$ until time $t$ and $D_{\text{inf}}(\hat{F}_{i}(t),\hat{\mu}^{*}(t))$ is some divergence measure between the empirical distribution of the sample mean for arm $i$ and the arm with the highest sample mean. More precisely, $D_{\text{inf}}(F,\mu):=\inf_{G\in\mathcal{G}:\mathbb{E}(G)\leq\mu}D(F\|G)$ and $\mathcal{G}$ is the family of distributions supported on $(-\infty,1]$ . As shown in Honda & Takemura (2015), its asymptotic bound is even better than KL-UCB (Garivier & Cappé (2011)) algorithm and can be extended to semi-bounded support models such as $\mathcal{G}$ . Also, this algorithm empirically outperforms the Thompson Sampling algorithm as shown in Honda & Takemura (2015). However, an extension of IMED algorithm with minimax regret bound of $\widetilde{O}(d\sqrt{T})$ has not been derived. In our design of LinIMED algorithm, we replace the optimized KL-divergence measure in IMED in Eqn. equation 3 with the squared gap between the sample mean of the arm $i$ and the arm with the maximum sample mean. This choice simplifies our analysis and does not adversely affect the regret bound. On the other hand, we view the term $1/T_{i}(t)$ as the variance of the sample mean of arm $i$ at time $t$ ; then in this spirit, we use $\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert^{2}_{V_{t-1}^{-1}}$ as the variance of the sample mean (which is $\langle\hat{\theta}_{t-1},x_{t,a}\rangle$ ) of arm $a$ at time $t$ . We choose ${\hat{\Delta}_{t,a}^{2}}/{(\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2})}$ instead of the KL-divergence approximation for the index since in the classical linear bandit setting, the noise is sub-Gaussian and it is known that the KL-divergence of two Gaussian random variables with the same variance ( $\mathrm{KL}(\mathcal{N}(\mu_{1},\sigma^{2}),\mathcal{N}(\mu_{2},\sigma^{2}))=\frac{(\mu_{1}-\mu_{2})^{2}}{2\sigma^{2}}$ ) has a closed form expression similar to ${\hat{\Delta}_{t,a}^{2}}/{(\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2})}$ ignoring the constant $\frac{1}{2}$ .

3.3 Relation to Information Directed Sampling (IDS) for Linear Bandits

Information Directed Sampling (IDS), introduced by Russo & Van Roy (2014), serves as a good principle for regret minimization in linear bandits to achieve the asymptotic optimality. The intuition behind IDS is to balance between the information gain on the best arm and the expected reward at each time step. This goal is realized by optimizing the distribution of selecting each arm $\pi\in\mathcal{D}(\mathcal{A})$ (where $\mathcal{A}$ is the fixed finite arm set) with the minimum information ratio such that:

\displaystyle\pi_{t}^{\mathrm{IDS}}:=\operatorname*{arg\,min}_{\pi\in\mathcal{D}(\mathcal{A})}\frac{\hat{\Delta}_{t}^{2}(\pi)}{g_{t}(\pi)},

(4)

where $\hat{\Delta}_{t}$ is the empirical gap and $g_{t}$ is the so-called information gain (defined later). Kirschner & Krause (2018), Kirschner et al. (2020) and Kirschner et al. (2021) apply the IDS principle to the linear bandit setting, The first two works propose both randomized and deterministic versions of IDS for linear bandit. They showed a near-optimal minimax regret bound of the order of $\widetilde{O}(d\sqrt{T})$ . Kirschner et al. (2021) designed an asymptotically optimal linear bandit algorithm retaining its near-optimal minimax regret properties. Comparing these algorithms with our LinIMED algorithms, we observe that the first term of the index of non-greedy actions in our algorithms is $\hat{\Delta}^{2}_{t,a}/(\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert^{2}_{V^{-1}_{t-1}})$ , which is similar to the choice of information ratio in IDS with the estimated gap $\Delta_{t}(a):=\hat{\Delta}_{t,a}$ as defined in Algorithm 1 and the information ratio $g_{t}(a):=\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert^{2}_{V^{-1}_{t-1}}$ . As mentioned in Kirschner & Krause (2018), when the noise is 1-subgaussian and $\lVert x_{t,a}\rVert^{2}_{V^{-1}_{t-1}}\ll 1$ , the information gain in deterministic IDS algorithm is approximately $\lVert x_{t,a}\rVert^{2}_{V^{-1}_{t-1}}$ , which is similar to our choice $\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert^{2}_{V^{-1}_{t-1}}$ . However, our LinIMED algorithms are different from the deterministic IDS algorithm in Kirschner & Krause (2018) since the estimated gap defined in our algorithm $\hat{\Delta}_{t,a}$ is different from that in deterministic IDS. Furthermore, as discussed in Kirschner et al. (2020), when the noise is 1-subgaussian and $\lVert x_{t,a}\rVert^{2}_{V^{-1}_{t-1}}\ll 1$ , the action chosen by UCB minimizes the deterministic information ratio. However, this is not the case for our algorithm since we have the second term $-\log(\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert^{2}_{V^{-1}_{t-1}})$ in LinIMED-1 which balances information and optimism. Compared to IDS in Kirschner et al. (2021), their algorithm is a randomized version of the deterministic IDS algorithm, which is more computationally expensive than our algorithm since our LinIMED algorithms are fully deterministic (the support of the allocation in Kirschner et al. (2021) is two). IDS defines a more complicated version of information gain to achieve asymptotically optimality. Finally, to the best of our knowledge, all these IDS algorithms are designed for linear bandits under the setting that the arm set is fixed and finite, while in our setting we assume the arm set is finite and can change over time. We discuss comparisons to other related work in Appendix B.

4 Theorem Statements

Theorem 1.

Under Assumption 1, the assumption that $\langle\theta^{*},x_{t,a}\rangle\geq 0$ for all $t\geq 1$ and $a\in\mathcal{A}_{t}$ , and the assumption that $\sqrt{\lambda}S\geq 1$ , the regret of the LinIMED-1 algorithm is upper bounded as follows:

\displaystyle R_{T}\leq O\big{(}d\sqrt{T}\log^{\frac{3}{2}}(T)\big{)}.

A proof sketch of Theorem 1 is provided in Section 5.

Theorem 2.

Under Assumption 1, and the assumption that $\sqrt{\lambda}S\geq 1$ , the regret of the LinIMED-2 algorithm is upper bounded as follows:

\displaystyle R_{T}\leq O\big{(}d\sqrt{T}\log^{\frac{3}{2}}(T)\big{)}.

Theorem 3.

Under Assumption 1, the assumption that $\sqrt{\lambda}S\geq 1$ , and that $C$ in Line 11 is a constant, the regret of the LinIMED-3 algorithm is upper bounded as follows:

\displaystyle R_{T}\leq O\big{(}d\sqrt{T}\log(T)\big{)}.

Theorem 4.

Under Assumption 1, the assumption that $L=S=1$ , the regret of the SupLinIMED algorithm (which is applicable to linear bandit problems with $K<\infty$ arms) is upper bounded as follows:

\displaystyle R_{T}\leq O\bigg{(}\sqrt{dT\log^{3}(KT)}\bigg{)}.

The upper bounds on the regret of LinIMED and its variants are all of the form $\widetilde{O}(d\sqrt{T})$ , which, ignoring the logarithmic term, is the same as OFUL algorithm (Abbasi-Yadkori et al. (2011)). Compared to LinTS, it has an advantage of $O(\sqrt{d}\wedge\sqrt{\log K})$ . Also, these upper bounds do not depend on the number of arms $K$ , which means it can be applied to linear bandit problems with a large arm set (including infinite arm sets). One observes that LinIMED-2 and LinIMED-3 do not require the additional assumption that $\langle\theta^{*},x_{t,a}\rangle\geq 0$ for all $t\geq 1$ and $a\in\mathcal{A}_{t}$ to achieve the $\widetilde{O}(d\sqrt{T})$ upper regret bound. It is difficult to prove the regret bound for the LinIMED-1 algorithm without this assumption since in our proof we need to use that $\langle\theta^{*},X_{t}\rangle\geq 0$ for any time $t$ to bound the $F_{1}$ term. On the other hand, LinIMED-2 and LinIMED-3 encourage more exploitations in terms of the index of the empirically best arm at each time without adversely influencing the regret bound; this will accelerate the learning with well-preprocessed datasets. The regret bound of LinIMED-3, in fact, matches that of LinUCB with OFUL’s confidence bound. In the proof, we will extensively use a technique known as the “peeling device” (Lattimore & Szepesvári, 2020, Chapter 9). This analytical technique, commonly used in the theory of bandit algorithms, involves the partitioning of the range of some random variable into several pieces, then using the basic fact that $\mathbb{P}(A\cap(\cup_{i=1}^{\infty}B_{i}))\leq\sum_{i=1}^{\infty}\mathbb{P}(A\cap B_{i})$ , we can utilize the more refined range of the random variable to derive desired bounds.

Finally, Theorem 4 says that when the arm set is finite, we can use the framework of SupLinUCB (Chu et al., 2011) with our LinIMED index $I_{t,a}$ to achieve a regret bound of the order of $\widetilde{O}(\sqrt{dT})$ , which is $\sqrt{d}$ better than the regret bounds yielded by the family of LinIMED algorithms (ignoring the logarithmic terms). The proof is provided in Appendix F.

5 Proof Sketch of Theorem 1

We choose to present the proof sketch of Theorem 1 since it contains the main ingredients for all the theorems in the preceding section. Before presenting the proof, we introduce the following lemma and corollary.

Lemma 1.

(Abbasi-Yadkori et al. (2011, Theorem 2)) Under Assumption 1, for any time step $t\geq 1$ and any $\gamma>0$ , we have

\displaystyle\mathbb{P}\big{(}\lVert\hat{\theta}_{t-1}-\theta^{*}\rVert_{V_{t-1}}\leq\sqrt{\beta_{t-1}(\gamma)}\big{)}\geq 1-\gamma.

(5)

This lemma illustrates that the true parameter $\theta^{*}$ lies in the ellipsoid centered at $\hat{\theta}_{t-1}$ with high probability, which also states the width of the confidence bound.

The second is a corollary of the elliptical potential count lemma in Abbasi-Yadkori et al. (2011):

Corollary 1.

(Corollary of Lattimore & Szepesvári (2020, Exercise 19.3)) Assume that $V_{0}=\lambda I$ and $\lVert X_{t}\rVert\leq L$ for $t\in[T]$ , for any constant $0<m\leq 2$ , the following holds:

\displaystyle\sum_{t=1}^{T}\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq m\right\}\leq\frac{6d}{m}\log\Big{(}1+\frac{2L^{2}}{\lambda m}\Big{)}.

(6)

We remark that this lemma is slightly stronger than the classical elliptical potential lemma since it reveals information about the upper bound of frequency that $\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}$ exceeds some value $m$ . Equipped with this lemma, we can perform the peeling device on $\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}$ in our proof of the regret bound, which is a novel technique to the best of our knowledge.

Proof.

First we define $a_{t}^{*}$ as the best arm in time step $t$ such that $a_{t}^{*}=\operatorname*{arg\,max}_{a\in\mathcal{A}_{t}}\langle\theta^{*},x_{t,a}\rangle$ , and use $x_{t}^{*}:=x_{t,a_{t}^{*}}$ denote its corresponding context. Let $\Delta_{t}:=\langle\theta^{*},x_{t}^{*}\rangle-\langle\theta^{*},X_{t}\rangle$ denote the regret in time $t$ . Define the following events:

\displaystyle B_{t}

\displaystyle:=\big{\{}\lVert\hat{\theta}_{t-1}-\theta^{*}\rVert_{V_{t-1}}\leq\sqrt{\beta_{t-1}(\gamma)}\big{\}},\quad C_{t}:=\big{\{}\max_{b\in\mathcal{A}_{t}}\langle\hat{\theta}_{t-1},x_{t,b}\rangle>\langle\theta^{*},x^{*}_{t}\rangle-\delta\big{\}},\quad D_{t}:=\big{\{}\hat{\Delta}_{t,A_{t}}\geq\varepsilon\big{\}}.

where $\delta$ and $\varepsilon$ are free parameters set to be $\delta=\frac{\Delta_{t}}{\sqrt{\log T}}$ and $\varepsilon=(1-\frac{2}{\sqrt{\log T}})\Delta_{t}$ in this proof sketch.

Then the expected regret $R_{T}=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}$ can be partitioned by events $B_{t},C_{t},D_{t}$ such that:

\displaystyle R_{T}=\underbrace{\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}}_{=:F_{1}}+\underbrace{\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},\overline{D}_{t}\right\}}_{=:F_{2}}+\underbrace{\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}}_{=:F_{3}}+\underbrace{\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B}_{t}\right\}}_{=:F_{4}}.

(7)

For $F_{1}$ , from the event $C_{t}$ and the fact that $\langle\theta^{*},x^{*}_{t}\rangle=\Delta_{t}+\langle\theta^{*},X_{t}\rangle\geq\Delta_{t}$ (here is where we use that $\langle\theta^{*},x_{t,a}\rangle\geq 0$ for all $t$ and $a$ ), we obtain $\max_{b\in\mathcal{A}_{t}}\langle\hat{\theta}_{t-1},x_{t,b}\rangle>(1-\frac{1}{\sqrt{\log T}})\Delta_{t}$ . For convenience, define $\hat{A}_{t}:=\operatorname*{arg\,max}_{b\in\mathcal{A}_{t}}\langle\hat{\theta}_{t-1},x_{t,b}\rangle$ as the empirically best arm at time step $t$ , where ties are broken arbitrarily, then use $\hat{X}_{t}$ to denote the corresponding context of the arm $\hat{A}_{t}$ . Therefore from the Cauchy–Schwarz inequality, we have $\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}}\lVert\hat{X}_{t}\rVert_{V_{t-1}^{-1}}\geq\langle\hat{\theta}_{t-1},\hat{X}_{t}\rangle>(1-\frac{1}{\sqrt{\log T}})\Delta_{t}$ . This implies that

\displaystyle\lVert\hat{X}_{t}\rVert_{V_{t-1}^{-1}}\geq\frac{(1-\frac{1}{\sqrt{\log T}})\Delta_{t}}{\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}}}~{}.

(8)

On the other hand, we claim that $\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}}$ can be upper bounded as $O(\sqrt{T})$ . This can be seen from the fact that $\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}}=\lVert\hat{\theta}_{t-1}-\theta^{*}+\theta^{*}\rVert_{V_{t-1}}\leq\lVert\hat{\theta}_{t-1}-\theta^{*}\rVert_{V_{t-1}}+\lVert\theta^{*}\rVert_{V_{t-1}}$ . Since the event $B_{t}$ holds, we know the first term is upper bounded by $\sqrt{\beta_{t-1}(\gamma)}$ , and since the largest eigenvalue of the matrix $V_{t-1}$ is upper bounded by $\lambda+TL^{2}$ and $\lVert\theta^{*}\rVert\leq S$ , the second term is upper bounded by $S\sqrt{\lambda+TL^{2}}$ . Hence, $\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}}$ is upper bounded by $O(\sqrt{T})$ . Then one can substitute this bound back into Eqn. equation 8, and this yields

\displaystyle\lVert\hat{X}_{t}\rVert_{V_{t-1}^{-1}}\geq\Omega\Big{(}\frac{1}{\sqrt{T}}\Big{(}1-\frac{1}{\sqrt{\log T}}\Big{)}\Delta_{t}\Big{)}~{}.

(9)

Furthermore, by our design of the algorithm, the index of $A_{t}$ is not larger than the index of the arm with the largest empirical reward at time $t$ . Hence,

\displaystyle I_{t,A_{t}}=\frac{\hat{\Delta}_{t,A_{t}}^{2}}{\beta_{t-1}(\gamma)\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}(\gamma)\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}\leq\log\frac{1}{\beta_{t-1}(\gamma)\lVert\hat{X}_{t}\rVert_{V_{t-1}^{-1}}^{2}}~{}.

(10)

In the following, we set $\gamma$ as well as another free parameter $\Gamma$ as follows:

\Gamma=\frac{d\log^{\frac{3}{2}}T}{\sqrt{T}}\quad\mbox{and}\quad\gamma=\frac{1}{t^{2}},.

(11)

If $\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}(\gamma)}$ , by using Corollary 1 with the choice in Eqn. equation 11, the upper bound of $F_{1}$ in this case is $O\big{(}d\sqrt{T\log T}\big{)}$ . Otherwise, using the event $D_{t}$ and the bound in equation 9, we deduce that for all $T$ sufficiently large, we have $\lVert{X}_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\Omega\big{(}\frac{\Delta_{t}^{2}}{\beta_{t-1}(\gamma)\log({T}/{\Delta_{t}^{2}})}\big{)}$ . Therefore by using Corollary 1 and the “peeling device” (Lattimore & Szepesvári, 2020, Chapter 9) on $\Delta_{t}$ such that $2^{-l}<\Delta_{t}\leq 2^{-l+1}$ for $l=1,2,\ldots,\lceil Q\rceil$ where $Q=-\log_{2}\Gamma$ and $\Gamma$ is chosen as in Eqn. equation 11. Now consider,

$\displaystyle\!\!\!F_{1}$	$\displaystyle\leq O(1)+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert{X}_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\Omega\Big{(}\frac{\Delta_{t}^{2}}{\beta_{t-1}(\gamma)\log({T}/{\Delta_{t}^{2}})}\Big{)}\right\}$	(12)
	$\displaystyle\leq O(1)\!+\!T\Gamma\!+\mathbb{E}\!\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{\!\\|{X}_{t}\\|^{2}_{V_{t-1}^{-1}}\!\geq\!\Omega\Big{(}\frac{\Delta_{t}^{2}}{\beta_{t-1}(\gamma)\log(T/\Delta_{t}^{2})}\Big{)}\!\right\}\mathbbm{1}\big{\{}2^{-l}\!<\!\Delta_{t}\!\leq\!2^{-l+1}\big{\}}\!\!\!$	(13)
	$\displaystyle\leq O(1)+T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot\mathbbm{1}\bigg{\{}\lVert{X}_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\Omega\Big{(}\frac{2^{-2l}}{\beta_{t-1}(\gamma)\log({T\cdot 2^{2l}})}\Big{)}\bigg{\}}$	(14)
	$\displaystyle\leq O(1)\!+\!T\Gamma\!+\!\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}\!2^{-l+1}O\bigg{(}\!2^{2l}d\beta_{T}(\gamma)\log(2^{2l}T)\log\Big{(}1\!+\!\frac{2L^{2}\cdot 2^{2l}\beta_{T}(\gamma)\log(T\cdot 2^{2l})}{\lambda}\Big{)}\!\bigg{)}$	(15)
	$\displaystyle\leq O(1)+T\Gamma+\sum_{l=1}^{\lceil Q\rceil}2^{l+1}\cdot O\bigg{(}{d\beta_{T}(\gamma)\log(\frac{T}{\Gamma^{2}})}\log\Big{(}1+\frac{L^{2}\beta_{T}(\gamma)\log(\frac{T}{\Gamma^{2}})}{\lambda\Gamma^{2}}\Big{)}\bigg{)}$	(16)
	$\displaystyle\leq O(1)+T\Gamma+O\bigg{(}\frac{d\beta_{T}(\gamma)\log(\frac{T}{\Gamma^{2}})}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}(\gamma)\log(\frac{T}{\Gamma^{2}})}{\lambda\Gamma^{2}}\Big{)}\bigg{)}~{},$	(17)

where in Inequality equation 15 we used Corollary 1. Substituting the choices of $\Gamma$ and $\gamma$ in equation 11 into equation 17 yields the upper bound on $\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\big{\{}\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}(\gamma)}\big{\}}$ of the order $O(d\sqrt{T}\log^{\frac{3}{2}}T)$ . Hence $F_{1}\leq O(d\sqrt{T}\log^{\frac{3}{2}}T)$ . Other details are fleshed out in Appendix C.2.

For $F_{2}$ , since $C_{t}$ and $\overline{D}_{t}$ together imply that $\langle\theta^{*},x_{t}^{*}\rangle-\delta<\varepsilon+\langle\hat{\theta}_{t-1},X_{t}\rangle$ , then using the choices of $\delta$ and $\varepsilon$ , we have $\langle\hat{\theta}_{t-1}-\theta^{*},X_{t}\rangle>\frac{\Delta_{t}}{\sqrt{\log T}}$ . Substituting this into the event $B_{t}$ and using the Cauchy–Schwarz inequality, we have

\displaystyle\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}(\gamma)\log T}.

(18)

Again applying the “peeling device” on $\Delta_{t}$ and Corollary 1, we can upper bound $F_{2}$ as follows:

\displaystyle F_{2}\leq T\Gamma+O\bigg{(}\frac{d\beta_{T}(\gamma)\log T}{\Gamma}\bigg{)}\log\Big{(}1+\frac{L^{2}\beta_{T}(\gamma)\log T}{\lambda\Gamma^{2}}\Big{)}~{}.

(19)

Then with the choice of $\Gamma$ and $\gamma$ as stated in equation 11, the upper bound of the $F_{2}$ is also of order $O(d\sqrt{T}\log^{\frac{3}{2}}T)$ . More details of the calculation leading to Eqn. equation 19 are in Appendix C.3.

For $F_{3}$ , this is the case when the best arm at time $t$ does not perform sufficiently well so that the empirically largest reward at time $t$ is far from the highest expected reward. One observes that minimizing $F_{3}$ results in a tradeoff with respect to $F_{1}$ . On the event $\overline{C}_{t}$ , we can again apply the “peeling device” on $\langle\theta^{*},x_{t}^{*}\rangle-\langle\hat{\theta}_{t-1},x_{t}^{*}\rangle$ such that $\frac{q+1}{2}\delta\leq\langle\theta^{*},x_{t}^{*}\rangle-\langle\hat{\theta}_{t-1},x_{t}^{*}\rangle<\frac{q+2}{2}\delta$ where $q\in\mathbb{N}$ . Then using the fact that $I_{t,A_{t}}\leq I_{t,a_{t}^{*}}$ , we have

\displaystyle\log\frac{1}{\beta_{t-1}(\gamma)\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}<\frac{q^{2}\delta^{2}}{4\beta_{t-1}(\gamma)\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}(\gamma)\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}~{}.

(20)

On the other hand, using the event $B_{t}$ and the Cauchy–Schwarz inequality, it holds that

\displaystyle\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}\geq\frac{(q+1)\delta}{2\sqrt{\beta_{t-1}(\gamma)}}~{}.

(21)

If $\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}(\gamma)}$ , the regret in this case is bounded by $O(d\sqrt{T\log T})$ . Otherwise, combining Eqn. equation 20 and Eqn. equation 21 implies that

\displaystyle\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{(q+1)^{2}\delta^{2}}{4\beta_{t-1}(\gamma)}\exp\bigg{(}-\frac{q^{2}}{(q+1)^{2}}\bigg{)}~{}.

(22)

Using Corollary 1, we can now conclude that $F_{3}$ is upper bounded as

\displaystyle F_{3}\leq T\Gamma+O\bigg{(}\frac{d\beta_{T}(\gamma)\log T}{\Gamma}\bigg{)}\log\Big{(}1+\frac{L^{2}\beta_{T}(\gamma)\log T}{\lambda\Gamma^{2}}\Big{)}~{}.

(23)

Substituting $\Gamma$ and $\gamma$ in Eqn. equation 11 into Eqn. equation 23, we can upper bound $F_{3}$ by $O(d\sqrt{T}\log^{\frac{3}{2}}T)$ . Complete details are provided in Appendix C.4.

For $F_{4}$ , using Lemma 1 with the choice of $\gamma=1/t^{2}$ and $Q=-\log\Gamma$ , we have

	$\displaystyle F_{4}$	$\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B}_{t}\right\}\leq T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\}\mathbbm{1}\left\{\overline{B}_{t}\right\}$		(24)
		$\displaystyle\leq T\Gamma+\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\mathbb{P}\big{(}\overline{B}_{t}\big{)}\leq T\Gamma+\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\gamma<T\Gamma+\frac{\pi^{2}}{3}~{}.$		(25)

Thus, $F_{4}\leq O(d\sqrt{T}\log^{\frac{3}{2}}T)$ . In conclusion, with the choice of $\Gamma$ and $\gamma$ in Eqn. equation 11, we have shown that the expected regret of LinIMED-1 $R_{T}=\sum_{i=1}^{4}F_{i}$ is upper bounded by $O(d\sqrt{T}\log^{\frac{3}{2}}T)$ .∎

For LinIMED-2, the proof is similar but the assumption that $\langle\theta^{*},x_{t,a}\rangle\geq 0$ is not required. For LinIMED-3, by directly using the UCB in Line 6 of Algorithm 1, we improve the regret bound to match the state-of-the-art $O(d\sqrt{T}\log T)$ , which matches that of LinUCB with OFUL’s confidence bound.

6 Empirical Studies

This section aims to justify the utility of the family of LinIMED algorithms we developed and to demonstrate their effectiveness through quantitative evaluations in simulated environments and real-world datasets such as the MovieLens dataset. We compare our LinIMED algorithms with LinTS and LinUCB with the choice $\lambda=L^{2}$ . We set $\beta_{t}(\gamma)=(R\sqrt{3d\log(1+t)}+\sqrt{2})^{2}$ (here $\gamma=\frac{1}{(1+t)^{2}}$ and $L=\sqrt{2}$ ) for the synthetic dataset with varying and finite arm set and $\beta_{t}(\gamma)=(R\sqrt{d\log((1+t)t^{2})}+\sqrt{20})^{2}$ (here $\gamma=\frac{1}{t^{2}}$ and $L=\sqrt{20}$ ) for the MovieLens dataset respectively. The confidence widths $\sqrt{\beta_{t}(\gamma)}$ for each algorithm are multiplied by a factor $\alpha$ and we tune $\alpha$ by searching over the grid $\left\{0.05,0.1,0.15,0.2,\ldots,0.95,1.0\right\}$ and report the best performance for each algorithm; see Appendix G. Both $\gamma$ ’s are of order $O(\frac{1}{t^{2}})$ as suggested by our proof sketch in Eqn. equation 11. We set $C=30$ in LinIMED-3 throughout. The sub-Gaussian noise level is $R=0.1$ . We choose LinUCB and LinTS as competing algorithms since they are paradigmatic examples of deterministic and randomized contextual linear bandit algorithms respectively. We also included IDS in our comparisons for the fixed and finite arm set settings. Finally, we only show the performances of SupLinUCB and SupLinIMED algorithms but only in Figs. 1 and 2 since it is well known that there is a substantial performance degradation compared to established methodologies like LinUCB or LinTS (as mentioned in Lattimore & Szepesvári (2020, Chapter 22) and also seen in Figs. 1 and 2.

6.1 Experiments on a Synthetic Dataset in the Varying Arm Set Setting

We perform an empirical study on a varying arm setting. We evaluate the performance with different dimensions $d$ and different number of arms $K$ . We set the unknown parameter vector and the best context vector as $\theta^{*}=x_{t}^{*}=[\frac{1}{\sqrt{d-1}},\ldots,\frac{1}{\sqrt{d-1}},0]^{\top}\in\mathbb{R}^{d}$ . There are $K-2$ suboptimal arms vectors, which are all the same (i.e., repeated) and share the context $[(1-\frac{1}{7+z_{t,i}})\frac{1}{\sqrt{d-1}},\ldots,(1-\frac{1}{7+z_{t,i}})\frac{1}{\sqrt{d-1}},(1-\frac{1}{7+z_{t,i}})]^{\top}\in\mathbb{R}^{d}$ where $z_{t,i}\sim\textrm{Uniform}[0,0.1]$ is iid noise for each $t\in[T]$ and $i\in[K-2]$ . Finally, there is also one “worst” arm vector with context $[0,0,\dots,0,1]^{\top}$ . First we fix $d=2$ . The results for different numbers of arms such as $K=10,100,500$ are shown in Fig. 1. Note that each plot is repeated $50$ times to obtain the mean and standard deviation of the regret. From Fig. 1, we observe that LinIMED-1 and LinIMED-2 are comparable to LinUCB and LinTS, while LinIMED-3 outperforms LinTS and LinUCB regardless of the number of the arms $K$ . Second, we set $K=10$ with the dimensions $d=2,20,50$ . Each trial is again repeated $50$ times and the regret over time is shown in Fig. 2. Again, we see that LinIMED-1 and LinIMED-2 are comparable to LinUCB and LinTS, LinIMED-3 clearly perform better than LinUCB and LinTS.

The experimental results on synthetic data demonstrate that the performances of LinIMED-1 and LinIMED-2 are largely similar but LinIMED-3 is slightly superior (corroborating our theoretical findings). More importantly, LinIMED-3 outperforms both the LinTS and LinUCB algorithms in a statistically significant manner, regardless of the number of arms $K$ or the dimension $d$ of the data.

6.2 Experiments on the “End of Optimism” instance

Algorithms based on the optimism principle such as LinUCB and LinTS have been shown to be not asymptotically optimal. A paradigmatic example is known as the “End of Optimism” (Lattimore & Szepesvari, 2017; Kirschner et al., 2021)). In this two-dimensional case in which the true parameter vector $\theta^{*}=[1;0]$ , there are three arms represented by the arm vectors: $[1;0],[0;1]$ and $[1-\varepsilon;2\varepsilon]$ , where $\varepsilon>0$ is small. In this example, it is observed that even pulling a highly suboptimal arm (the second one) provides a lot of information about the best arm (the first one). We perform experiments with the same confidence parameter $\beta_{t}(\gamma)=(R\sqrt{3d\log(1+t)}+\sqrt{2})^{2}$ as in Section 6.1 (where the noise level $R=0.1$ , dimension $d=2$ ). We also include the asymptotically optimal IDS algorithm (Kirschner et al. (2021) with the choice of $\eta_{s}=\beta_{s}(\delta)^{-1}$ ; this is suggested in Kirschner et al. (2021). Each algorithm is run over $10$ independent trials. The regrets of all competing algorithms are shown in Fig. 3 with $\varepsilon=0.05,0.01,0.02$ and for a fixed horizon $T=10^{6}$ .

From Fig. 3 we observe that the LinIMED algorithms perform much better than LinUCB and LinTS and LinIMED-3 is comparable to IDS in this “End of Optimism” instance. In particular, LinIMED-3 performs significantly better than LinUCB and LinTS even when $\varepsilon$ is of a moderate value such as $\varepsilon=0.02$ . We surmise that the reason behind the superior performance of our LinIMED algorithms on the "End of Optimism" instance is that the first term of our LinIMED index is ${\hat{\Delta}_{t,a}^{2}}/{(\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2})}$ , which can be viewed as an approximate and simpler version of the information ratio that movtivates the design the IDS) algorithm.

6.3 Experiments on the MovieLens Dataset

The MovieLens dataset (Cantador et al. (2011)) is a widely-used benchmark dataset for research in recommendation systems. We specifically choose to use the MovieLens 10M dataset, which contains 10 million ratings (from 0 to 5) and 100,000 tag applications applied to 10,000 movies by 72,000 users. To preprocess the dataset, we choose the best $K\in\{20,50,100\}$ movies for consideration. At each time $t$ , one random user visits the website and is recommended one of the best $K$ movies. We assume that the user will click on the recommended movie if and only if the user’s rating of this movie is at least $3$ . We implement the three versions of LinIMED, LinUCB, LinTS and IDS on this dataset. Each trial is repeated over $100$ runs and the averages and standard deviations of the click-through rates (CTRs) as functions of time are reported in Fig. 4. One observes that LinIMED variants significantly outperform LinUCB and LinTS for all $K\in\{20,50,100\}$ when time horizon $T$ is sufficiently large. LinIMED-1 and LinIMED-2 perform significantly better than IDS when $k=20,50$ , LinIMED-3 perform significantly better than IDS when $k=50,100$ . Furthermore, by virtue of the fact that IDS is randomized, the variance of IDS is higher than that of LinIMED.

7 Future Work

In the future, a fruitful direction of research is to further modify the LinIMED algorithm to make it also asymptotically optimal; we believe that in this case, the analysis would be more challenging, but the theoretical and empirical performances might be superior to our three LinIMED algorithms. In addition, one can generalize the family of IMED-style algorithms to generalized linear bandits or neural contextual bandits.

Acknowledgements

This work is supported by funding from a Ministry of Education Academic Research Fund (AcRF) Tier 2 grant under grant number A-8000423-00-00 and AcRF Tier 1 grants under grant numbers A-8000189-01-00 and A-8000980-00-00. This research is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-PhD-2023-08-044T-J), and is part of the programme DesCartes which is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme.

References

Abbasi-Yadkori et al. (2011) Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24, 2011.
Agrawal & Goyal (2012) Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pp. 39.1–39.26. JMLR Workshop and Conference Proceedings, 2012.
Agrawal & Goyal (2013) Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pp. 127–135. PMLR, 2013.
Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256, 2002.
Baudry et al. (2023) Dorian Baudry, Kazuya Suzuki, and Junya Honda. A general recipe for the analysis of randomized multi-armed bandit algorithms. arXiv preprint arXiv:2303.06058, 2023.
Bian & Jun (2021) Jie Bian and Kwang-Sung Jun. Maillard sampling: Boltzmann exploration done optimally. arXiv preprint arXiv:2111.03290, 2021.
Cantador et al. (2011) Iván Cantador, Peter Brusilovsky, and Tsvi Kuflik. Second workshop on information heterogeneity and fusion in recommender systems (HetRec2011). In Proceedings of the Fifth ACM Conference on Recommender Systems, pp. 387–388, 2011.
Chu et al. (2011) Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208–214. JMLR Workshop and Conference Proceedings, 2011.
Garivier & Cappé (2011) Aurélien Garivier and Olivier Cappé. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 359–376. JMLR Workshop and Conference Proceedings, 2011.
Honda & Takemura (2015) Junya Honda and Akimichi Takemura. Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. J. Mach. Learn. Res., 16:3721–3756, 2015.
Kirschner & Krause (2018) Johannes Kirschner and Andreas Krause. Information directed sampling and bandits with heteroscedastic noise. In Conference On Learning Theory, pp. 358–384. PMLR, 2018.
Kirschner et al. (2020) Johannes Kirschner, Tor Lattimore, and Andreas Krause. Information directed sampling for linear partial monitoring. In Conference on Learning Theory, pp. 2328–2369. PMLR, 2020.
Kirschner et al. (2021) Johannes Kirschner, Tor Lattimore, Claire Vernade, and Csaba Szepesvári. Asymptotically optimal information-directed sampling. In Conference on Learning Theory, pp. 2777–2821. PMLR, 2021.
Lattimore & Szepesvari (2017) Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pp. 728–737. PMLR, 2017.
Lattimore & Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.
Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pp. 661–670, 2010.
Liu et al. (2024) Haolin Liu, Chen-Yu Wei, and Julian Zimmert. Bypassing the simulator: Near-optimal adversarial linear contextual bandits. Advances in Neural Information Processing Systems, 36, 2024.
Russo & Van Roy (2014) Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. Advances in Neural Information Processing Systems, 27, 2014.
Saber et al. (2021) Hassan Saber, Pierre Ménard, and Odalric-Ambrym Maillard. Indexed minimum empirical divergence for unimodal bandits. Advances in Neural Information Processing Systems, 34:7346–7356, 2021.
Thompson (1933) William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933. ISSN 00063444. URL http://www.jstor.org/stable/2332286.
Valko et al. (2014) Michal Valko, Rémi Munos, Branislav Kveton, and Tomáš Kocák. Spectral bandits for smooth graph functions. In International Conference on Machine Learning, pp. 46–54. PMLR, 2014.

Supplementary Materials for the TMLR submission
“Linear Indexed Minimum Empirical Divergence Algorithms”

Appendix A BaseLinUCB Algorithm

Here, we present the BaseLinUCB algorithm used as a subroutine in SubLinIMED (Algorithm 2).

1: Input:

\gamma=\frac{1}{2t^{2}}

\alpha=\sqrt{\frac{1}{2}\ln\frac{2TK}{\gamma}}

\Psi_{t}\subseteq\left\{1,2,...,t-1\right\}

V_{t}=I_{d}+\sum_{\tau\in\Psi_{t}}x^{T}_{\tau,A_{\tau}}x_{\tau,A_{\tau}}

b_{t}=\sum_{\tau\in\Psi_{t}}Y_{\tau,A_{\tau}}x_{\tau,A_{\tau}}

\hat{\theta}_{t}=V^{-1}_{t}b_{t}

5: Observe

K

arm features

x_{t,1},x_{t,2},\ldots,x_{t,K}\in\mathbb{R}^{d}

6: for

a\in[K]

w_{t,a}=\alpha\sqrt{x^{T}_{t,a}V^{-1}_{t}x_{t,a}}

\hat{Y}_{t,a}=\langle\hat{\theta}_{t},x_{t,a}\rangle

9: end for

Algorithm 3 BaseLinUCB

Appendix B Comparison to other related work

Saber et al. (2021) adopts the IMED algorithm to unimodal bandits which achieves asymptotically optimality for one-dimensional exponential family distributions. In their algorithm IMED-UB, they narrow down the search region to the neighboring regions of the empirically best arm and then implement the IMED algorithm for $K$ -armed bandit as in Honda & Takemura (2015). This design is inspired by the lower bound and only involves the neighboring arms of the best arm. The settings in which the algorithm in Saber et al. (2021) is applied to is different from our proposed LinIMED algorithms as we focus on linear bandits, not unimodal bandits.

Liu et al. (2024) proposes an algorithm that achieves $\widetilde{O}(\sqrt{T})$ regret for adversarial linear bandits with stochastic action sets in the absence of a simulator or prior knowledge on the distribution. Although their setting is different from ours, they also use a bonus term $-\alpha_{t}\hat{\Sigma}_{t}^{-1}$ in the lifted covariance matrix to encourage exploration. This is similar to our choice of the second term $\log(1/\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert^{2}_{V^{-1}_{t-1}})$ in LinIMED-1.

Appendix C Proof of the regret bound for LinIMED-1 (Complete proof of Theorem 1)

Here and in the following, we abbreviate $\beta_{t}(\gamma)$ as $\beta_{t}$ , i.e., we drop the dependence of $\beta_{t}$ on $\gamma$ , which is taken to be $\frac{1}{t^{2}}$ per Eqn. equation 11.

C.1 Statement of Lemmas for LinIMED-1

We first state the following lemmas which respectively show the upper bound of $F_{1}$ to $F_{4}$ :

Lemma 2.

Under Assumption 1, the assumption that $\langle\theta^{*},x_{t,a}\rangle\geq 0$ for all $t\geq 1$ and $a\in\mathcal{A}_{t}$ , and the assumption that $\sqrt{\lambda}S\geq 1$ , then for the free parameter $0<\Gamma<1$ , the term $F_{1}$ for LinIMED-1 satisfies:

\displaystyle F_{1}\leq O(1)+T\Gamma+O\bigg{(}\frac{d\beta_{T}\log(\frac{T}{\Gamma^{2}})}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log(\frac{T}{\Gamma^{2}})}{\lambda\Gamma^{2}}\Big{)}\bigg{)}~{}.

(26)

With the choice of $\Gamma$ as in Eqn. equation 11,

\displaystyle F_{1}\leq O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}~{}.

(27)

Lemma 3.

Under Assumption 1, and the assumption that $\sqrt{\lambda}S\geq 1$ , for the free parameter $0<\Gamma<1$ , the term $F_{2}$ for LinIMED-1 satisfies:

\displaystyle F_{2}\leq 2T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\bigg{)}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}~{}.

(28)

With the choice of $\Gamma$ as in Eqn. equation 11,

\displaystyle F_{2}\leq O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}~{}.

(29)

Lemma 4.

Under Assumption 1, and the assumption that $\sqrt{\lambda}S\geq 1$ , for the free parameter $0<\Gamma<1$ , the term $F_{3}$ for LinIMED-1 satisfies:

\displaystyle F_{3}\leq 2T\Gamma+O\bigg{(}\frac{d\beta_{T}\log(T)}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log(T)}{\lambda\Gamma^{2}}\Big{)}\bigg{)}~{}.

(30)

With the choice of $\Gamma$ as in Eqn. equation 11,

\displaystyle F_{3}\leq O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}~{}.

(31)

Lemma 5.

Under Assumption 1, for the free parameter $0<\Gamma<1$ , the term $F_{4}$ for LinIMED-1 satisfies:

\displaystyle F_{4}\leq T\Gamma+O(1)~{}.

With the choice of $\Gamma$ as in Eqn. equation 11,

\displaystyle F_{4}\leq O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}~{}.

(32)

C.2 Proof of Lemma 2

Proof.

From the event $C_{t}$ and the fact that $\langle\theta^{*},x^{*}_{t}\rangle=\Delta_{t}+\langle\theta^{*},X_{t}\rangle\geq\Delta_{t}$ (here is where we use that $\langle\theta^{*},x_{t,a}\rangle\geq 0$ for all $t$ and $a$ ), we obtain $\max_{b\in\mathcal{A}_{t}}\langle\hat{\theta}_{t-1},x_{t,b}\rangle>(1-\frac{1}{\sqrt{\log T}})\Delta_{t}$ . For convenience, define $\hat{A}_{t}:=\operatorname*{arg\,max}_{b\in\mathcal{A}_{t}}\langle\hat{\theta}_{t-1},x_{t,b}\rangle$ as the empirically best arm at time step $t$ , where ties are broken arbitrarily, then use $\hat{X}_{t}$ to denote the corresponding context of the arm $\hat{A}_{t}$ . Therefore from the Cauchy–Schwarz inequality, we have $\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}}\lVert\hat{X}_{t}\rVert_{V_{t-1}^{-1}}\geq\langle\hat{\theta}_{t-1},\hat{X}_{t}\rangle>(1-\frac{1}{\sqrt{\log T}})\Delta_{t}$ . This implies that

\displaystyle\lVert\hat{X}_{t}\rVert_{V_{t-1}^{-1}}\geq\frac{(1-\frac{1}{\sqrt{\log T}})\Delta_{t}}{\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}}}~{}.

(33)

On the other hand, we claim that $\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}}$ can be upper bounded as $O(\sqrt{T})$ . This can be seen from the fact that $\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}}=\lVert\hat{\theta}_{t-1}-\theta^{*}+\theta^{*}\rVert_{V_{t-1}}\leq\lVert\hat{\theta}_{t-1}-\theta^{*}\rVert_{V_{t-1}}+\lVert\theta^{*}\rVert_{V_{t-1}}$ . Since the event $B_{t}$ holds, we know the first term is upper bounded by $\sqrt{\beta_{t-1}(\gamma)}$ , and since the maximum eigenvalue of the matrix $V_{t-1}$ is upper bounded by $\lambda+TL^{2}$ and $\lVert\theta^{*}\rVert\leq S$ , the second term is upper bounded by $S\sqrt{\lambda+TL^{2}}$ . Hence, $\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}}$ is upper bounded by $O(\sqrt{T})$ . Then one can substitute this bound back into Eqn. equation 8, and this yields

\displaystyle\lVert\hat{X}_{t}\rVert_{V_{t-1}^{-1}}\geq\Omega\bigg{(}\frac{1}{\sqrt{T}}\Big{(}1-\frac{1}{\sqrt{\log T}}\Big{)}\Delta_{t}\bigg{)}~{}.

(34)

Furthermore, by our design of the algorithm, the index of $A_{t}$ is not larger than the index of the arm with the largest empirical reward at time $t$ . Hence,

\displaystyle I_{t,A_{t}}=\frac{\hat{\Delta}_{t,A_{t}}^{2}}{\beta_{t-1}(\gamma)\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}(\gamma)\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}\leq\log\frac{1}{\beta_{t-1}(\gamma)\lVert\hat{X}_{t}\rVert_{V_{t-1}^{-1}}^{2}}~{}.

(35)

If $\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}$ , by using Corollary 1 and the “peeling device” (Lattimore & Szepesvári, 2020, Chapter 9) on $\Delta_{t}$ such that $2^{-l}<\Delta_{t}\leq 2^{-l+1}$ for $l=1,2,\ldots,\lceil Q\rceil$ where $Q=-\log_{2}\Gamma$ ,

$\displaystyle\mathbb{E}\sum_{t=1}^{T}$	$\displaystyle\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}$	(36)
	$\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}\cdot\mathbbm{1}\left\{\Delta_{t}\leq\Gamma\right\}+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}\cdot\mathbbm{1}\left\{\Delta_{t}>\Gamma\right\}$	(37)
	$\displaystyle\leq T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil{Q}\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}\cdot\mathbbm{1}\left\{2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\}$	(38)
	$\displaystyle\leq T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{2^{-2l}}{\beta_{T}}\right\}$	(39)
	$\displaystyle\leq T\Gamma+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\frac{6d\beta_{T}}{2^{-2l}}\log\bigg{(}1+\frac{2L^{2}\beta_{T}}{\lambda\cdot 2^{-2l}}\bigg{)}$	(40)
	$\displaystyle=T\Gamma+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{l}\cdot 12d\beta_{T}\log\bigg{(}1+\frac{2^{2l+1}L^{2}\beta_{T}}{\lambda}\bigg{)}$	(41)
	$\displaystyle<T\Gamma+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{l}\cdot 12d\beta_{T}\log\bigg{(}1+\frac{2^{2Q+3}L^{2}\beta_{T}}{\lambda}\bigg{)}$	(42)
	$\displaystyle=T\Gamma+(2^{\lceil Q\rceil}-1)\cdot 24d\beta_{T}\log\bigg{(}1+\frac{2^{2Q+3}L^{2}\beta_{T}}{\lambda}\bigg{)}$	(43)
	$\displaystyle<T\Gamma+\frac{48d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{8L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}$	(44)

Then with the choice of $\Gamma$ as in Eqn. equation 11,

$\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}$	$\displaystyle\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}$	(45)
	$\displaystyle<d\sqrt{T}\log^{\frac{3}{2}}T+\frac{48\beta_{T}\sqrt{T}}{\log^{\frac{3}{2}}T}\log\bigg{(}1+\frac{8L^{2}\beta_{T}T}{\lambda d^{2}\log^{3}T}\bigg{)}$	(46)
	$\displaystyle\leq O\Big{(}d\sqrt{T}\log^{\frac{3}{2}}T\Big{)}~{}.$	(47)

Otherwise we have $\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}$ , then $\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}}>0$ since $\Delta_{t}\leq 1$ . Substituting this into Eqn. equation 10, then using the event $D_{t}$ and the bound in equation 9, we deduce that for all $T$ sufficiently large, we have $\lVert{X}_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\Omega\big{(}\frac{\Delta_{t}^{2}}{\beta_{t-1}\log({T}/{\Delta_{t}^{2}})}\big{)}$ . Therefore by using Corollary 1 and the “peeling device” (Lattimore & Szepesvári, 2020, Chapter 9) on $\Delta_{t}$ such that $2^{-l}<\Delta_{t}\leq 2^{-l+1}$ for $l=1,2,\ldots,\lceil Q\rceil$ where $\Gamma:=2^{-Q}$ is a free parameter that we can choose. Consider,

	$\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}$		(48)
	$\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\Delta_{t}\leq 2^{-\lceil Q\rceil}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}$		(49)
	$\displaystyle\qquad+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\Delta_{t}>2^{-\lceil Q\rceil}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}$		(50)
	$\displaystyle\leq O(1)+T\Gamma+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert{X}_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\Omega\Big{(}\frac{\Delta_{t}^{2}}{\beta_{t-1}\log({T}/{\Delta_{t}^{2}})}\Big{)}\right\}\mathbbm{1}\left\{\Delta_{t}>2^{-\lceil Q\rceil}\right\}$		(51)
	$\displaystyle\leq O(1)\!+\!T\Gamma\!+\!\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{\!\\|{X}_{t}\\|^{2}_{V_{t-1}^{-1}}\!\geq\!\Omega\Big{(}\frac{\Delta_{t}^{2}}{\beta_{t-1}\log(T/\Delta_{t}^{2})}\Big{)}\right\}\mathbbm{1}\big{\{}2^{-l}\!<\!\Delta_{t}\!\leq\!2^{-l+1}\big{\}}\!\!\!$		(52)
	$\displaystyle\leq O(1)+T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot\mathbbm{1}\bigg{\{}\lVert{X}_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\Omega\Big{(}\frac{2^{-2l}}{\beta_{t-1}\log({T\cdot 2^{2l}})}\Big{)}\bigg{\}}$		(53)
	$\displaystyle=O(1)+T\Gamma+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\sum_{t=1}^{T}\mathbbm{1}\left\{\lVert{X}_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\Omega\big{(}\frac{2^{-2l}}{\beta_{t-1}\log({T\cdot 2^{2l}})}\big{)}\right\}$		(54)
	$\displaystyle\leq O(1)\!+\!T\Gamma\!+\!\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}\!2^{-l+1}O\bigg{(}\!2^{2l}d\beta_{T}\log(T\cdot 2^{2l})\log\Big{(}1\!+\!\frac{2L^{2}\cdot 2^{2l}\beta_{T}\log(T\cdot 2^{2l})}{\lambda}\Big{)}\!\bigg{)}$		(55)
	$\displaystyle<O(1)+T\Gamma+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{l+1}\cdot O\bigg{(}{d\beta_{T}\log(\frac{T}{\Gamma^{2}})}\log\Big{(}1+\frac{L^{2}\beta_{T}\log(\frac{T}{\Gamma^{2}})}{\lambda\Gamma^{2}}\Big{)}\bigg{)}$		(56)
	$\displaystyle\leq O(1)+T\Gamma+O\bigg{(}\frac{d\beta_{T}\log(\frac{T}{\Gamma^{2}})}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log(\frac{T}{\Gamma^{2}})}{\lambda\Gamma^{2}}\Big{)}\bigg{)}~{},$		(57)

This proves Eqn. equation 26. Then with the choice of the parameters as in Eqn. equation 11,

$\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}$	$\displaystyle\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}$	(58)
	$\displaystyle<O(1)+d\sqrt{T}\log^{\frac{3}{2}}T+O\bigg{(}d\beta_{T}\log\Big{(}\frac{T^{2}}{d^{2}\log^{3}T}\Big{)}\frac{\sqrt{T}}{d\log^{\frac{3}{2}}T}\log\bigg{(}1+\frac{L^{2}\beta_{T}T}{\lambda d^{2}\log^{3}T}\cdot\log\Big{(}\frac{T^{2}}{d^{2}\log^{3}T}\Big{)}\bigg{)}\bigg{)}$	(59)
	$\displaystyle\leq O\Big{(}d\sqrt{T}\log^{\frac{3}{2}}T\Big{)}~{}.$	(60)

Hence, we can upper bound $F_{1}$ as

$\displaystyle F_{1}$	$\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}$	(61)
	$\displaystyle\leq O\Big{(}d\sqrt{T}\log^{\frac{3}{2}}T\Big{)}+O\Big{(}d\sqrt{T}\log^{\frac{3}{2}}T\Big{)}$	(62)
	$\displaystyle\leq O\Big{(}d\sqrt{T}\log^{\frac{3}{2}}T\Big{)},$	(63)

which concludes the proof. ∎

C.3 Proof of Lemma 3

Proof.

Since $C_{t}$ and $\overline{D}_{t}$ together imply that $\langle\theta^{*},x_{t}^{*}\rangle-\delta<\varepsilon+\langle\hat{\theta}_{t-1},X_{t}\rangle$ , then using the choices of $\delta$ and $\varepsilon$ , we have $\langle\hat{\theta}_{t-1}-\theta^{*},X_{t}\rangle>\frac{\Delta_{t}}{\sqrt{\log T}}$ . Substituting this into the event $B_{t}$ and using the Cauchy–Schwarz inequality, we have

\displaystyle\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}(\gamma)\log T}.

(64)

Again applying the “peeling device” on $\Delta_{t}$ and Corollary 1, we can upper bound $F_{2}$ as follows:

$\displaystyle F_{2}$	$\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}\log T}\right\}$	(65)
	$\displaystyle\leq T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}\log T}\right\}\cdot\mathbbm{1}\left\{2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\}$	(66)
	$\displaystyle\leq T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{2^{-2l}}{\beta_{T}\log T}\right\}$	(67)
	$\displaystyle\leq T\Gamma+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot 2^{2l}\cdot 6d\beta_{T}(\log T)\log\bigg{(}1+\frac{2^{2l+1}\cdot L^{2}\beta_{T}\log T}{\lambda}\bigg{)}$	(68)
	$\displaystyle\leq T\Gamma+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{l}\cdot 12d\beta_{T}(\log T)\log\bigg{(}1+\frac{2^{2\lceil Q\rceil+1}\cdot L^{2}\beta_{T}\log T}{\lambda}\bigg{)}$	(69)
	$\displaystyle=T\Gamma+(2^{\lceil Q\rceil}-1)\cdot 24d\beta_{T}(\log T)\log\bigg{(}1+\frac{2^{2\lceil Q\rceil+1}\cdot L^{2}\beta_{T}\log T}{\lambda}\bigg{)}$	(70)
	$\displaystyle<T\Gamma+\frac{48d\beta_{T}\log T}{\Gamma}\log\bigg{(}1+\frac{8L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}$	(71)
	$\displaystyle=T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}$	(72)

This proves Eqn. equation 28. Hence with the choice of the parameter $\Gamma$ as in Eqn. equation 11,

	$\displaystyle F_{2}$	$\displaystyle\leq d\sqrt{T}\log^{\frac{3}{2}}T+O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}$		(73)
		$\displaystyle\leq O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}~{}.$		(74)

∎

C.4 Proof of Lemma 4

Proof.

For $F_{3}$ , this is the case when the best arm at time $t$ does not perform sufficiently well so that the empirically largest reward at time $t$ is far from the highest expected reward. One observes that minimizing $F_{3}$ results in a tradeoff with respect to $F_{1}$ . On the event $\overline{C}_{t}$ , we can apply the “peeling device” on $\langle\theta^{*},x_{t}^{*}\rangle-\langle\hat{\theta}_{t-1},x_{t}^{*}\rangle$ such that $\frac{q+1}{2}\delta\leq\langle\theta^{*},x_{t}^{*}\rangle-\langle\hat{\theta}_{t-1},x_{t}^{*}\rangle<\frac{q+2}{2}\delta$ where $q\in\mathbb{N}$ . Then using the fact that $I_{t,A_{t}}\leq I_{t,a_{t}^{*}}$ , we have

\displaystyle\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}<\frac{q^{2}\delta^{2}}{4\beta_{t-1}\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}~{}.

(75)

On the other hand, using the event $B_{t}$ and the Cauchy–Schwarz inequality, it holds that

\displaystyle\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}\geq\frac{(q+1)\delta}{2\sqrt{\beta_{t-1}}}~{}.

(76)

If $\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}$ , the regret in this case is bounded by $O(d\sqrt{T\log T})$ (similar to the procedure to get from Eqn. equation 36 to Eqn. equation 47). Otherwise $\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}>\log\frac{1}{\Delta_{t}^{2}}\geq 0$ , then combining Eqn. equation 75 and Eqn. equation 76 implies that

\displaystyle\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{(q+1)^{2}\delta^{2}}{4\beta_{t-1}}\exp\bigg{(}-\frac{q^{2}}{(q+1)^{2}}\bigg{)}~{}.

(77)

Notice here with $\sqrt{\lambda}S\geq 1$ , $\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\leq\frac{1}{\beta_{t-1}}\leq 1$ , it holds that for all $q\in\mathbb{N}$ ,

\displaystyle\frac{(q+1)^{2}\delta^{2}}{4\beta_{t-1}}\exp\bigg{(}-\frac{q^{2}}{(q+1)^{2}}\bigg{)}<1~{}.

(78)

Using Corollary 1, one can show that:

	$\displaystyle\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}$		(79)
	$\displaystyle\leq T\Gamma+\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}\cdot\mathbbm{1}\left\{2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\}$		(80)
	$\displaystyle\leq T\Gamma+\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\sum_{q=1}^{\infty}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t}\right\}\cdot\mathbbm{1}\left\{\frac{q+1}{2}\delta\leq\langle\theta^{},x_{t}^{}\rangle-\langle\hat{\theta}_{t-1},x_{t}^{*}\rangle<\frac{q+2}{2}\delta\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}$		(81)
	$\displaystyle\quad\cdot\mathbbm{1}\left\{2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\}$		(82)
	$\displaystyle\leq T\Gamma+\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\sum_{q=1}^{\infty}\Delta_{t}\cdot\mathbbm{1}\left\{1\geq\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{(q+1)^{2}\delta^{2}}{4\beta_{t-1}}\exp\bigg{(}-\frac{q^{2}}{(q+1)^{2}}\bigg{)}\right\}\cdot\mathbbm{1}\left\{2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\}$		(83)
	$\displaystyle=T\Gamma+\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\sum_{q=1}^{\infty}\Delta_{t}\cdot\mathbbm{1}\left\{1\geq\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{(q+1)^{2}\Delta_{t}^{2}}{4\beta_{t-1}\log T}\exp\bigg{(}-\frac{q^{2}}{(q+1)^{2}}\bigg{)}\right\}\cdot\mathbbm{1}\left\{2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\}$		(84)
	$\displaystyle\leq T\Gamma+\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\sum_{q=1}^{\infty}2^{-l+1}\cdot\mathbbm{1}\left\{1\geq\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}>\frac{(q+1)^{2}\cdot 2^{-2l}}{4\beta_{T}\log T}\exp\bigg{(}-\frac{q^{2}}{(q+1)^{2}}\bigg{)}\right\}$		(85)
	$\displaystyle\leq T\Gamma+\sum_{l=1}^{\lceil Q\rceil}\sum_{q=1}^{\infty}2^{-l+1}\cdot 2^{2l}\cdot 24d\beta_{T}(\log T)\cdot\frac{\exp\bigg{(}\frac{q^{2}}{(q+1)^{2}}\bigg{)}}{(q+1)^{2}}\cdot\log\bigg{(}1+\frac{2^{2l}\cdot 8L^{2}\beta_{T}\log T}{\lambda}\cdot\frac{\exp\bigg{(}\frac{q^{2}}{(q+1)^{2}}\bigg{)}}{(q+1)^{2}}\bigg{)}$		(86)
	$\displaystyle<T\Gamma+\sum_{l=1}^{\lceil Q\rceil}\sum_{q=1}^{\infty}2^{l+1}\cdot 24d\beta_{T}(\log T)\cdot\frac{\exp\bigg{(}\frac{q^{2}}{(q+1)^{2}}\bigg{)}}{(q+1)^{2}}\cdot\log\bigg{(}1+\frac{2^{2l+1}\cdot L^{2}\beta_{T}\log T}{\lambda}\bigg{)}$		(87)
	$\displaystyle=T\Gamma+\sum_{l=1}^{\lceil Q\rceil}2^{l+1}\cdot 24d\beta_{T}(\log T)\cdot\log\bigg{(}1+\frac{2^{2l+1}\cdot L^{2}\beta_{T}\log T}{\lambda}\bigg{)}\sum_{q=1}^{\infty}\frac{\exp\bigg{(}\frac{q^{2}}{(q+1)^{2}}\bigg{)}}{(q+1)^{2}}$		(88)
	$\displaystyle\leq T\Gamma+\sum_{l=1}^{\lceil Q\rceil}2^{l+1}\cdot 24d\beta_{T}(\log T)\cdot\log\bigg{(}1+\frac{2^{2l+1}\cdot L^{2}\beta_{T}\log T}{\lambda}\bigg{)}\cdot(1.09)$		(89)
	$\displaystyle\leq T\Gamma+\sum_{l=1}^{\lceil Q\rceil}2^{l+1}\cdot 27d\beta_{T}(\log T)\cdot\log\bigg{(}1+\frac{2^{2l+1}\cdot L^{2}\beta_{T}\log T}{\lambda}\bigg{)}$		(90)
	$\displaystyle\leq T\Gamma+\sum_{l=1}^{\lceil Q\rceil}2^{l+1}\cdot 27d\beta_{T}(\log T)\cdot\log\bigg{(}1+\frac{2^{2\lceil Q\rceil+1}\cdot L^{2}\beta_{T}\log T}{\lambda}\bigg{)}$		(91)
	$\displaystyle<T\Gamma+\sum_{l=1}^{\lceil Q\rceil}\frac{216d\beta_{T}\log T}{\Gamma}\cdot\log\bigg{(}1+\frac{8L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}$		(92)
	$\displaystyle=T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\Big{)}\bigg{)}~{}.$		(93)

Hence

$\displaystyle F_{3}$	$\displaystyle=\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}+\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}$	(94)
	$\displaystyle<O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}+2T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\Big{)}\bigg{)}$	(95)
	$\displaystyle\leq 2T\Gamma+O\bigg{(}\frac{d\beta_{T}\log(T)}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log(T)}{\lambda\Gamma^{2}}\Big{)}\bigg{)}~{}.$	(96)

This proves Eqn. equation 30. With the choice of $\Gamma$ as in Eqn. equation 11,

$\displaystyle F_{3}$	$\displaystyle\leq 2d\sqrt{T}\log^{\frac{3}{2}}T+O\bigg{(}\frac{d\sqrt{T}\beta_{T}\log T}{d\log^{\frac{3}{2}}T}\log\bigg{(}1+\frac{TL^{2}\beta_{T}\log T}{\lambda d^{2}\log^{3}T}\bigg{)}\bigg{)}$	(97)
	$\displaystyle<2d\sqrt{T}\log^{\frac{3}{2}}T+O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}$	(98)
	$\displaystyle=O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}~{}.$	(99)

∎

C.5 Proof of Lemma 5

Proof.

For $F_{4}$ , the proof is straightforward by using Lemma 1 with the choice of $\gamma$ . Indeed, one has

$\displaystyle F_{4}$	$\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B}_{t}\right\}\leq T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\}\mathbbm{1}\left\{\overline{B}_{t}\right\}$	(100)
	$\displaystyle\leq T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\mathbbm{1}\left\{\overline{B}_{t}\right\}\leq T\Gamma+\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\mathbb{P}\big{(}\overline{B}_{t}\big{)}\leq T\Gamma+\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\gamma$	(101)
	$\displaystyle=T\Gamma+\sum_{t=1}^{T}\frac{1}{t^{2}}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}=T\Gamma+\sum_{t=1}^{T}\frac{2-\Gamma}{t^{2}}<T\Gamma+\frac{\pi^{2}}{3}=T\Gamma+O(1)~{}.$	(102)

With the choice of $\Gamma$ as in Eqn. equation 11,

	$\displaystyle F_{4}$	$\displaystyle<d\sqrt{T}\log^{\frac{3}{2}}T+O(1)$		(103)
		$\displaystyle\leq O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}~{}.$		(104)

∎

C.6 Proof of Theorem 1

Proof.

Combining Lemmas 2, 3, 4 and 5,

$\displaystyle R_{T}$	$\displaystyle=F_{1}+F_{2}+F_{3}+F_{4}$	(105)
	$\displaystyle\leq O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}+O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}+O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}+O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}$	(106)
	$\displaystyle=O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}~{}.$	(107)

∎

Appendix D Proof of the regret bound for LinIMED-2 (Proof of Theorem 2)

We choose $\gamma$ and $\Gamma$ as follows:

\displaystyle\gamma=\frac{1}{t^{2}}\qquad\qquad\Gamma=\frac{\sqrt{d\beta_{T}}\log T}{\sqrt{T}}~{}.

(108)

D.1 Statement of Lemmas for LinIMED-2

We first state the following lemmas which respectively show the upper bound of $F_{1}$ to $F_{4}$ :

Lemma 6.

Under Assumption 1, and the assumption that $\sqrt{\lambda}S\geq 1$ , for the free parameter $0<\Gamma<1$ , the term $F_{1}$ for LinIMED-3 satisfies:

\displaystyle F_{1}\leq T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\bigg{)}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}~{}.

Lemma 7.

Under Assumption 1, and the assumption that $\sqrt{\lambda}S\geq 1$ , for the free parameter $0<\Gamma<1$ , the term $F_{2}$ for LinIMED-3 satisfies:

\displaystyle F_{2}\leq T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\bigg{)}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}~{}.

Lemma 8.

Under Assumption 1, and the assumption that $\sqrt{\lambda}S\geq 1$ , for the free parameter $0<\Gamma<1$ , the term $F_{3}$ for LinIMED-3 satisfies:

\displaystyle F_{3}\leq 5T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\Big{)}\bigg{)}+O\bigg{(}\sqrt{T\log T}\log\bigg{(}\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}~{}.

Lemma 9.

Under Assumption 1, with the choice of $\gamma=\frac{1}{t^{2}}$ as in Eqn. equation 108, for the free parameter $0<\Gamma<1$ , the term $F_{4}$ for LinIMED-3 satisfies:

\displaystyle F_{4}\leq T\Gamma+O(1)~{}.

D.2 Proof of Lemma 6

Proof.

We first partition the analysis into the cases $\hat{A}_{t}\neq A_{t}$ and $\hat{A}_{t}=A_{t}$ as follows:

	$\displaystyle F_{1}$	$\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}$		(109)
		$\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\hat{A}_{t}\neq A_{t}\right\}+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\hat{A}_{t}=A_{t}\right\}$		(110)

Case 1: If $\hat{A}_{t}\neq A_{t}$ , this means that the index of $A_{t}$ is $I_{t,A_{t}}=\frac{\hat{\Delta}_{t,A_{t}}^{2}}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}$ . Using the fact that $I_{t,A_{t}}\leq I_{t,\hat{A}_{t}}$ we have:

$\displaystyle I_{t,A_{t}}$	$\displaystyle=\frac{\hat{\Delta}_{t,A_{t}}^{2}}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}$	(111)
	$\displaystyle\leq\log T\wedge\log\frac{1}{\beta_{t-1}\lVert\hat{X}_{t}\rVert_{V_{t-1}^{-1}}^{2}}$	(112)
	$\displaystyle\leq\log T.$	(113)

Therefore

\displaystyle\frac{\hat{\Delta}_{t,A_{t}}^{2}}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}\leq\log T~{}.

(114)

If $\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}$ , using the same procedure to get from Eqn. equation 36 to Eqn. equation 47, one has:

	$\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\hat{A}_{t}\neq A_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\}$		(115)
	$\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\}$		(116)
	$\displaystyle<T\Gamma+\frac{48d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{8L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}$		(117)
	$\displaystyle=T\Gamma+O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}.$		(118)

Else if $\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}<\frac{\Delta^{2}_{t}}{\beta_{t-1}}$ , this implies that $\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}>\log\frac{1}{\Delta_{t}^{2}}\geq 0$ . Then substituting the event $D_{t}:=\{\hat{\Delta}_{t,A_{t}}\geq\varepsilon\}$ into Eqn. equation 114, we obtain

\displaystyle\frac{\varepsilon^{2}}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}\leq\log T~{}.

(119)

With $\sqrt{\lambda}S\geq 1$ we have $\beta_{t-1}\geq 1$ , then one has

\displaystyle\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\varepsilon^{2}}{\beta_{t-1}\log T}.

(120)

Hence

	$\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t},\hat{A}_{t}\neq A_{t},\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}<\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\}$		(121)
	$\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\varepsilon^{2}}{\beta_{t-1}\log T}\right\}.$		(122)

With the choice of $\varepsilon=(1-\frac{2}{\sqrt{\log T}})\Delta_{t}$ , when $T\geq 149>\exp(5)$ , $\varepsilon>\frac{\Delta_{t}}{10}$ , then performing the “peeling device” on $\Delta_{t}$ yields

	$\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\varepsilon^{2}}{\beta_{t-1}\log T}\right\}\cdot\mathbbm{1}\left\{\Delta_{t}\geq\Gamma\right\}$		(123)
	$\displaystyle\leq 149+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{2^{-l}<\Delta_{t}\leq 2^{-l+1},\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\varepsilon^{2}}{\beta_{t-1}\log T}\right\}$		(124)
	$\displaystyle\leq O(1)+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\sum_{t=1}^{T}\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\varepsilon^{2}}{\beta_{t-1}\log T}\right\}$		(125)
	$\displaystyle\leq O(1)+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\sum_{t=1}^{T}\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{2^{-2l}}{100\beta_{T}\log T}\right\}$		(126)
	$\displaystyle\leq O(1)+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot 2^{2l}\cdot 600d\beta_{T}(\log T)\log\bigg{(}1+\frac{2^{2l}\cdot 200L^{2}\beta_{T}\log T}{\lambda}\bigg{)}$		(127)
	$\displaystyle\leq O(1)+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{l+1}\cdot 600d\beta_{T}(\log T)\log\bigg{(}1+\frac{2^{2\lceil Q\rceil}\cdot 200L^{2}\beta_{T}\log T}{\lambda}\bigg{)}$		(128)
	$\displaystyle<O(1)+\frac{4800d\beta_{T}\log T}{\Gamma}\log\bigg{(}1+\frac{800L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}.$		(129)

Considering the event $\{\Delta_{t}<\Gamma\}$ , we can upper bound the corresponding expectation as follows

\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\varepsilon^{2}}{\beta_{t-1}\log T}\right\}\cdot\mathbbm{1}\left\{\Delta_{t}<\Gamma\right\}\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\Delta_{t}<\Gamma\right\}<T\Gamma.

(130)

Then

	$\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t},\hat{A}_{t}\neq A_{t},\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}<\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\}$		(131)
	$\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\varepsilon^{2}}{\beta_{t-1}\log T}\right\}$		(132)
	$\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\varepsilon^{2}}{\beta_{t-1}\log T}\right\}\cdot\mathbbm{1}\left\{\Delta_{t}\geq\Gamma\right\}$		(133)
	$\displaystyle\qquad+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\varepsilon^{2}}{\beta_{t-1}\log T}\right\}\cdot\mathbbm{1}\left\{\Delta_{t}<\Gamma\right\}$		(134)
	$\displaystyle\leq O(1)+T\Gamma+\frac{4800d\beta_{T}\log T}{\Gamma}\log\bigg{(}1+\frac{800L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}.$		(135)

Hence

	$\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t},\hat{A}_{t}\neq A_{t}\right\}$		(136)
	$\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t},\hat{A}_{t}\neq A_{t},\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\}$		(137)
	$\displaystyle\qquad+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t},\hat{A}_{t}\neq A_{t},\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}<\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\}$		(138)
	$\displaystyle\leq T\Gamma+O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}+O(1)+T\Gamma+\frac{4800d\beta_{T}\log T}{\Gamma}\log\bigg{(}1+\frac{800L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}$		(139)
	$\displaystyle\leq T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}.$		(140)

Case 2: If $\hat{A}_{t}=A_{t}$ , then from the event $C_{t}$ and the choice $\delta=\frac{\Delta_{t}}{\sqrt{\log T}}$ we have

\displaystyle\langle\hat{\theta}_{t-1}-\theta^{*},X_{t}\rangle>\Big{(}1-\frac{1}{\sqrt{\log T}}\Big{)}\Delta_{t}.

(141)

Furthermore, using the definition of the event $B_{t}$ , that implies that

\displaystyle\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}>\frac{(1-\frac{1}{\sqrt{\log T}})^{2}\Delta_{t}^{2}}{\beta_{t-1}}.

(142)

When $T>8>\exp(2)$ , $(1-\frac{1}{\sqrt{\log T}})^{2}>\frac{1}{16}$ , then similarily, we can bound this term by $O(\frac{d\beta_{T}}{\Gamma})\log(1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}})$

Summarizing the two cases,

	$\displaystyle F_{1}$	$\displaystyle\leq O(1)+T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\bigg{)}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}$		(143)
		$\displaystyle\leq T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\bigg{)}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}~{}.$		(144)

∎

D.3 Proof of Lemma 7

Proof.

Recall that

\displaystyle F_{2}=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},\overline{D}_{t}\right\}~{}.

(145)

From $C_{t}$ and $\overline{D}_{t}$ , we derive that:

\displaystyle\langle\theta^{*},a_{t}^{*}\rangle-\delta<\varepsilon+\langle\hat{\theta}_{t-1},X_{t}\rangle.

(146)

With the choice $\delta=\frac{\Delta_{t}}{\sqrt{\log T}},\varepsilon=(1-\frac{2}{\sqrt{\log T}})\Delta_{t}$ , we have

\displaystyle\langle\hat{\theta}_{t-1}-\theta^{*},X_{t}\rangle>\frac{\Delta_{t}}{\sqrt{\log T}}.

(147)

Then using the definition of the event $B_{t}$ in Eqn. equation 147 yields

\displaystyle\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}\log T}.

(148)

Using a similar procedure as in that from Eqn. equation 36 to Eqn. equation 47, we can upper bound $F_{2}$ by

\displaystyle F_{2}\leq T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\bigg{)}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}~{}.

(149)

∎

D.4 Proof of Lemma 8

Proof.

From the event $\overline{C}_{t}$ , which is $\max_{b\in\mathcal{A}_{t}}\langle\hat{\theta}_{t-1},b\rangle\leq\langle\theta^{*},x^{*}_{t}\rangle-\delta$ , the index of the best arm at time $t$ can be upper bounded as:

\displaystyle I_{t,a_{t}^{*}}\leq\frac{(\langle\theta^{*},x_{t}^{*}\rangle-\delta-\langle\hat{\theta}_{t-1},x_{t}^{*}\rangle)^{2}}{\beta_{t-1}\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}.

(150)

Case 1: If $\hat{A}_{t}\neq A_{t}$ , then we have

\displaystyle I_{t,a_{t}^{*}}\geq I_{t,A_{t}}\geq\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}.

(151)

Suppose $\frac{q+1}{2}\delta\leq\langle\theta^{*},x_{t}^{*}\rangle-\langle\hat{\theta}_{t-1},x_{t}^{*}\rangle<\frac{q+2}{2}\delta$ for $q\in\mathbb{N}$ , then one has

\displaystyle\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}\leq\frac{q^{2}\delta^{2}}{4\beta_{t-1}\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}.

(152)

On the other hand, on the event $B_{t}$ ,

\displaystyle\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}\geq\frac{(q+1)\delta}{2\sqrt{\beta_{t-1}}}.

(153)

If $\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}$ , using the same procedure from Eqn. equation 36 to Eqn. equation 47, one has:

	$\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}\cdot\mathbbm{1}\left\{\hat{A}_{t}\neq A_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\}$		(154)
	$\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\}$		(155)
	$\displaystyle<T\Gamma+\frac{48d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{8L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}$		(156)
	$\displaystyle=T\Gamma+O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}.$		(157)

Else if $\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}<\frac{\Delta^{2}_{t}}{\beta_{t-1}}$ , this implies that $\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}>\log\frac{1}{\Delta_{t}^{2}}\geq 0$ . Then combining Eqn. equation 152 and Eqn. equation 153 implies that

\displaystyle\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{(q+1)^{2}\delta^{2}}{4\beta_{t-1}}\exp\bigg{(}-\frac{q^{2}}{(q+1)^{2}}\bigg{)}.

(158)

Then using the same procedure to get from Eqn. equation 78 to Eqn. equation 93, we have

	$\displaystyle\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}},\hat{A}_{t}\neq A_{t}\right\}$		(159)
	$\displaystyle<T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\Big{)}\bigg{)}~{}.$		(160)

Case 2: $\hat{A}_{t}=A_{t}$ . If $\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}$ , using the same procedure to get from Eqn. equation 36 to Eqn. equation 47, one has:

	$\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}\cdot\mathbbm{1}\left\{\hat{A}_{t}=A_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\}$		(161)
	$\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\}$		(162)
	$\displaystyle<T\Gamma+\frac{48d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{8L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}$		(163)
	$\displaystyle=T\Gamma+O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}.$		(164)

Else $\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}<\frac{\Delta^{2}_{t}}{\beta_{t-1}}$ implies that $\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}>\log\frac{1}{\Delta_{t}^{2}}\geq 0$ .

If $\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}<\log T$ , then using the same procedure to get from Eqn. equation 152 to Eqn. equation 160, we have

	$\displaystyle\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}},\hat{A}_{t}=A_{t},\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}<\log\frac{T}{\beta_{t-1}}\right\}$		(165)
	$\displaystyle<T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\Big{)}\bigg{)}~{}.$		(166)

If $\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}\geq\log T$ , this means now the index of $A_{t}$ is $I_{t,A_{t}}=\log T$ , by performing the “peeling device” such that $\frac{q+1}{2}\delta\leq\langle\theta^{*},x_{t}^{*}\rangle-\langle\hat{\theta}_{t-1},x_{t}^{*}\rangle<\frac{q+2}{2}\delta$ for $q\in\mathbb{N}$ , we have

\displaystyle\log T\leq\frac{q^{2}\delta^{2}}{4\beta_{t-1}\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}.

(167)

On the other hand, using the definition of the event $B_{t}$ ,

\displaystyle\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}\geq\frac{(q+1)\delta}{2\sqrt{\beta_{t-1}}}.

(168)

Combining Eqn. equation 167 and equation 168, we have

\displaystyle\delta\leq\frac{2\exp(\frac{q^{2}}{2(q+1)^{2}})}{(q+1)\sqrt{T}}.

(169)

Then with $\delta=\frac{\Delta_{t}}{\sqrt{\log T}}$ , this implies that

\displaystyle\Delta_{t}\leq\frac{2\sqrt{\log T}\exp(\frac{q^{2}}{2(q+1)^{2}})}{(q+1)\sqrt{T}}.

(170)

On the other hand, from $\frac{q+1}{2}\delta\leq\sqrt{\beta_{t-1}}\lVert x^{*}_{t}\rVert_{V^{-1}_{t-1}}\leq\sqrt{\beta_{t-1}}\cdot\frac{L}{\sqrt{\lambda}}$ , we have $q+1\leq\frac{2L\sqrt{\beta_{t-1}\log T}}{\sqrt{\lambda}\Delta_{t}}$ . Hence,

	$\displaystyle\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}},\hat{A}_{t}=A_{t},\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}\geq\log T,\Delta_{t}\geq\Gamma\right\}$		(171)
	$\displaystyle\leq\mathbb{E}\sum_{q=1}^{\lfloor\frac{2L\sqrt{\beta_{T}\log T}}{\sqrt{\lambda}\Gamma}-1\rfloor}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\Delta_{t}\leq\frac{2\sqrt{\log T}\exp(\frac{q^{2}}{2(q+1)^{2}})}{(q+1)\sqrt{T}}\right\}$		(172)
	$\displaystyle\leq\mathbb{E}\sum_{q=1}^{\lfloor\frac{2L\sqrt{\beta_{T}\log T}}{\sqrt{\lambda}\Gamma}-1\rfloor}\sum_{t=1}^{T}\frac{2\sqrt{\log T}\exp(\frac{q^{2}}{2(q+1)^{2}})}{(q+1)\sqrt{T}}$		(173)
	$\displaystyle=\mathbb{E}\sum_{q=1}^{\lfloor\frac{2L\sqrt{\beta_{T}\log T}}{\sqrt{\lambda}\Gamma}-1\rfloor}\frac{2\sqrt{T\log T}\exp(\frac{q^{2}}{2(q+1)^{2}})}{q+1}$		(174)
	$\displaystyle<\mathbb{E}\sum_{q=1}^{\lfloor\frac{2L\sqrt{\beta_{T}\log T}}{\sqrt{\lambda}\Gamma}-1\rfloor}\frac{2\sqrt{e}\sqrt{T\log T}}{q+1}$		(175)
	$\displaystyle<2\sqrt{e}\sqrt{T\log T}\log\bigg{(}\frac{2L\sqrt{\log T}}{\sqrt{\lambda}\Gamma}-1\bigg{)}$		(176)
	$\displaystyle\leq O\bigg{(}\sqrt{T\log T}\log\bigg{(}\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}~{}.$		(177)

Summarizing the two cases ( $\hat{A}_{t}\neq A_{t}$ and $\hat{A}_{t}=A_{t}$ ), we see that $F_{3}$ is upper bounded by:

$\displaystyle F_{3}$	$\displaystyle<T\Gamma+O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}+T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\Big{)}\bigg{)}$	(178)
	$\displaystyle\qquad+T\Gamma+O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}+T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\Big{)}\bigg{)}$	(179)
	$\displaystyle\qquad+T\Gamma+O\bigg{(}\sqrt{T\beta_{T}\log T}\log\bigg{(}\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}$	(180)
	$\displaystyle\leq 5T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\Big{)}\bigg{)}+O\bigg{(}\sqrt{T\log T}\log\bigg{(}\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}~{}.$	(181)

∎

D.5 Proof of Lemma 9

Proof.

The proof of this case is straightforward by using Lemma 1 with the choice $\gamma=\frac{1}{t^{2}}$ :

$\displaystyle F_{4}$	$\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B}_{t}\right\}$	(182)
	$\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B}_{t},\Delta_{t}<\Gamma\right\}+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B}_{t},\Delta_{t}\geq\Gamma\right\}$	(183)
	$\displaystyle<T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B}_{t},2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\}$	(184)
	$\displaystyle\leq T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot\mathbbm{1}\left\{\overline{B}_{t}\right\}$	(185)
	$\displaystyle\leq T\Gamma+\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\sum_{t=1}^{T}\mathbb{P}\left\{\overline{B}_{t}\right\}$	(186)
	$\displaystyle=T\Gamma+\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot\frac{\pi^{2}}{6}$	(187)
	$\displaystyle<T\Gamma+(2-\Gamma)\cdot\frac{\pi^{2}}{6}$	(188)
	$\displaystyle<T\Gamma+\frac{\pi^{2}}{3}$	(189)
	$\displaystyle=T\Gamma+O(1)~{}.$	(190)

∎

D.6 Proof of Theorem 2

Proof.

Combining Lemmas 6, 7, 8 and 9, with the choices of $\gamma$ and $\Gamma$ as in Eqn. equation 108, the regret of LinIMED-2 is bounded as follows:

$\displaystyle R_{T}$	$\displaystyle=F_{1}+F_{2}+F_{3}+F_{4}$	(191)
	$\displaystyle\leq T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\bigg{)}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}+T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\bigg{)}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}$	(192)
	$\displaystyle\qquad+5T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}+O\bigg{(}\sqrt{T\log T}\log\bigg{(}\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}$	(193)
	$\displaystyle\qquad+T\Gamma+O(1)$	(194)
	$\displaystyle\leq 8T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}+O\bigg{(}\sqrt{T\log T}\log\bigg{(}\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}$	(195)
	$\displaystyle=8\sqrt{dT\beta_{T}}\log T+O\bigg{(}\sqrt{dT\beta_{T}}\log\bigg{(}1+\frac{TL^{2}}{\lambda d\log T}\bigg{)}\bigg{)}+O\bigg{(}\sqrt{T\log T}\log\bigg{(}\frac{TL^{2}}{\lambda d\log T}\bigg{)}\bigg{)}$	(196)
	$\displaystyle=8d\sqrt{T}\log^{\frac{3}{2}}T+O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}+O\bigg{(}\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}$	(197)
	$\displaystyle\leq O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}~{}.$	(198)

∎

Appendix E Proof of the regret bound for LinIMED-3 (Proof of Theorem 3)

First we define $a_{t}^{*}$ as the best arm in time step $t$ such that $a_{t}^{*}=\operatorname*{arg\,max}_{a\in\mathcal{A}_{t}}\langle\theta^{*},x_{t,a}\rangle$ , and use $x_{t}^{*}:=x_{t,a_{t}^{*}}$ denote its corresponding context. Define $\hat{A}_{t}:=\operatorname*{arg\,max}_{a\in\mathcal{A}_{t}}\mathrm{UCB}_{t}(a)$ . Let $\Delta_{t}:=\langle\theta^{*},x_{t}^{*}\rangle-\langle\theta^{*},X_{t}\rangle$ denote the regret in time $t$ . Define the following events:

\displaystyle B^{\prime}_{t}:=\big{\{}\lVert\hat{\theta}_{t-1}-\theta^{*}\rVert_{V_{t-1}}\leq\sqrt{\beta_{t-1}(\gamma)}\big{\}},\quad D^{\prime}_{t}:=\big{\{}\hat{\Delta}_{t,A_{t}}>\varepsilon\big{\}}.

where $\varepsilon$ is a free parameter set to be $\varepsilon=\frac{\Delta_{t}}{3}$ in this proof sketch.

Then the expected regret $R_{T}=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}$ can be partitioned by events $B^{\prime}_{t},D^{\prime}_{t}$ such that:

\displaystyle R_{T}=\underbrace{\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B^{\prime}_{t},D^{\prime}_{t}\right\}}_{=:F_{1}}+\underbrace{\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{D^{\prime}_{t}}\right\}}_{=:F_{2}}+\underbrace{\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B^{\prime}_{t}}\right\}}_{=:F_{3}}.

(199)

For the $F_{1}$ case:
From $D^{\prime}_{t}$ we know $A_{t}\neq\hat{A}_{t}$ , therefore

\displaystyle I_{t,A_{t}}=\frac{\hat{\Delta}_{t,A_{t}}^{2}}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}~{}.

(200)

From $D^{\prime}_{t}$ and $I_{t,A_{t}}\leq I_{t,\hat{A}_{t}}\leq\log\frac{C}{\max_{a\in\mathcal{A}_{t}}\hat{\Delta}^{2}_{t,a}}$ , we have

\displaystyle I_{t,A_{t}}<\log\frac{C}{\varepsilon^{2}}~{}.

(201)

Combining Eqn. equation 200 and Eqn. equation 201,

\displaystyle\frac{\hat{\Delta}_{t,A_{t}}^{2}}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}<\log\frac{C}{\varepsilon^{2}}~{}.

(202)

Then

\displaystyle\frac{\hat{\Delta}_{t,A_{t}}^{2}}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}<\log\bigg{(}\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\cdot\frac{C}{\varepsilon^{2}}\bigg{)}~{}.

(203)

If $\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}$ , using the same procedure from Eqn. equation 36 to Eqn. equation 47, one has:

	$\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B^{\prime}_{t},D^{\prime}_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\}$		(204)
	$\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\}$		(205)
	$\displaystyle<T\Gamma+\frac{48d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{8L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}$		(206)
	$\displaystyle=T\Gamma+O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}.$		(207)

Else $\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}<\frac{\Delta^{2}_{t}}{\beta_{t-1}}$ ,this implies that $\beta_{t-1}\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}<\Delta^{2}_{t}$ , plug this into Eqn. equation 203 and with the choice of $\varepsilon=\frac{\Delta_{t}}{3}$ and $D^{\prime}_{t}$ , we have

\displaystyle\frac{\Delta_{t}^{2}}{9\beta_{t-1}\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}}<\log(9C)~{}.

(208)

Since $C\geq 1$ is a constant, then

\displaystyle\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}>\frac{\Delta^{2}_{t}}{9\beta_{t-1}\log(9C)}~{}.

(209)

Using the same procedure from Eqn. equation 36 to Eqn. equation 47, one has:

	$\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B^{\prime}_{t},D^{\prime}_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}<\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\}$		(210)
	$\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}>\frac{\Delta^{2}_{t}}{9\beta_{t-1}\log(9C)}\right\}$		(211)
	$\displaystyle<T\Gamma+O\bigg{(}\frac{d\beta_{T}\log C}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log C}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}~{}.$		(212)

Hence

\displaystyle F_{1}<2T\Gamma+O\bigg{(}\frac{d\beta_{T}\log C}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log C}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}~{}.

(213)

For the $F_{2}$ case: Since the event $B^{\prime}_{t}$ holds,

\displaystyle\max_{a\in\mathcal{A}_{t}}\mathrm{UCB}_{t}(a)\geq\mathrm{UCB}_{t}(a^{*}_{t})=\langle\hat{\theta}_{t-1},x^{*}_{t}\rangle+\sqrt{\beta_{t-1}}\lVert x^{*}_{t}\rVert_{V^{-1}_{t-1}}\geq\langle\theta^{*},x^{*}_{t}\rangle

(214)

On the other hand, from $\overline{D^{\prime}_{t}}$ we have

\displaystyle\max_{a\in\mathcal{A}_{t}}\mathrm{UCB}_{t}(a)\leq\mathrm{UCB}_{t}(A_{t})+\varepsilon=\langle\hat{\theta}_{t-1},X_{t}\rangle+\sqrt{\beta_{t-1}}\lVert X_{t}\rVert_{V^{-1}_{t-1}}+\varepsilon~{}.

(215)

Combining Eqn. equation 214 and Eqn. equation 215,

\displaystyle\langle\theta^{*},x^{*}_{t}\rangle\leq\langle\hat{\theta}_{t-1},X_{t}\rangle+\sqrt{\beta_{t-1}}\lVert X_{t}\rVert_{V^{-1}_{t-1}}+\varepsilon~{}.

(216)

Hence

\displaystyle\Delta_{t}-\varepsilon\leq\langle\hat{\theta}_{t-1}-\theta^{*},X_{t}\rangle+\sqrt{\beta_{t-1}}\lVert X_{t}\rVert_{V^{-1}_{t-1}}~{}.

(217)

Then with $\varepsilon=\frac{\Delta_{t}}{3}$ and $B^{\prime}_{t}$ , we have

\displaystyle\frac{2}{3}\Delta_{t}\leq 2\sqrt{\beta_{t-1}}\lVert X_{t}\rVert_{V^{-1}_{t-1}}~{},

(218)

therefore

\displaystyle\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}>\frac{\Delta^{2}_{t}}{9\beta_{t-1}}~{}.

(219)

Using the same procedure from Eqn. equation 36 to Eqn. equation 47, one has:

\displaystyle F_{2}<T\Gamma+O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}~{}.

(220)

For the $F_{3}$ case:
using Lemma 1 with the choice $\gamma=\frac{1}{t^{2}}$ :

$\displaystyle F_{3}$	$\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B^{\prime}_{t}}\right\}$	(221)
	$\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B^{\prime}_{t}},\Delta_{t}<\Gamma\right\}+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B^{\prime}_{t}},\Delta_{t}\geq\Gamma\right\}$	(222)
	$\displaystyle<T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B^{\prime}_{t}},2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\}$	(223)
	$\displaystyle\leq T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot\mathbbm{1}\left\{\overline{B^{\prime}_{t}}\right\}$	(224)
	$\displaystyle\leq T\Gamma+\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\sum_{t=1}^{T}\mathbb{P}\left\{\overline{B^{\prime}_{t}}\right\}$	(225)
	$\displaystyle=T\Gamma+\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot\frac{\pi^{2}}{6}$	(226)
	$\displaystyle<T\Gamma+(2-\Gamma)\cdot\frac{\pi^{2}}{6}$	(227)
	$\displaystyle<T\Gamma+\frac{\pi^{2}}{3}$	(228)
	$\displaystyle=T\Gamma+O(1)~{}.$	(229)

E.1 Proof of Theorem 3

Proof.

Combining Eqn. equation 213, equation 220, equation 229 with the choices of $\gamma=\frac{1}{t^{2}}$ and $\Gamma=\frac{\beta_{T}}{\sqrt{T}}$ and $C\geq 1$ is a constant, the regret of LinIMED-3 is bounded as follows:

$\displaystyle R_{T}$	$\displaystyle=F_{1}+F_{2}+F_{3}+F_{4}$	(230)
	$\displaystyle<4T\Gamma+O\bigg{(}\frac{d\beta_{T}\log C}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log C}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}+O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}+O(1)$	(231)
	$\displaystyle<O\bigg{(}d\sqrt{T}\log C\log\bigg{(}1+\frac{L^{2}T\log C}{\lambda}\bigg{)}\bigg{)}$	(232)
	$\displaystyle=O\bigg{(}d\sqrt{T}\log(T)\bigg{)}~{}.$	(233)

This completes the proof. ∎

Appendix F Proof of the regret bound for SupLinIMED (Proof of Theorem 4)

Define $s_{t}\in[\lceil\log T\rceil]$ as the index of $s$ when the arm is chosen at time $t$ . For the SupLinIMED, the index of arms except the empirically best arm is defined by $I_{t,a}=\bigg{(}\frac{\hat{\Delta}^{s_{t}}_{t,a}}{w^{s_{t}}_{t,a}}\bigg{)}^{2}-2\log(w^{s_{t}}_{t,a})$ , whereas the index of the empirically best arm is defined by $I_{t,\hat{A}^{*}_{t}}=\log(2T)\wedge(-2\log(w^{s_{t}}_{t,\hat{A}^{*}_{t}}))$ where $\hat{A}^{*}_{t}=\operatorname*{arg\,max}_{a\in\hat{\mathcal{A}}_{s_{t}}}\langle\hat{\theta}^{s_{t}}_{t},x_{t,a}\rangle$ . Define the index of the best arm at time $t$ as $a^{*}_{t}:=\operatorname*{arg\,max}_{a\in[K]}\langle\theta^{*},x_{t,a}\rangle$ .

Remark 1.

Here the upper bound we set for the index of the empirically best arm is $\log(2T)$ , which is slightly larger than our previous $\log T$ (Line 10 in the LinIMED algorithm) since in the first step of the of the SupLinIMED algorithm or, more generally, the SupLinUCB-type algorithms, the width of each arm is less than $\frac{1}{\sqrt{T}}$ , as a result, the index of each arm is larger than $\log T$ .

Let the set of time indices such that the chosen arm is from Step 1 (Lines 6–9 in Algorithm 2) be $\Psi_{0}$ . Then the cumulative expected regret of the SupLinIMED algorithm over time horizon $T$ can be defined by the following equation:

\displaystyle R_{T}=\mathbb{E}\left[\sum_{t\in\Psi_{0}}\langle\theta^{*},x_{t,a^{*}_{t}}-X_{t}\rangle\right]+\mathbb{E}\left[\sum_{t\not\in\Psi_{0}}\langle\theta^{*},x_{t,a^{*}_{t}}-X_{t}\rangle\right]

(234)

Since the index set has not changed in Step 1 (see Line 9 in Algorithm 2), the second term of the regret is the same as in the original SupLinUCB algorithm of Chu et al. (2011). For the first term, we partitioned it by the following events:

	$\displaystyle B_{t}$	$\displaystyle:=\bigcap_{t\in[T],s\in[\log T],a\in[K]}\left\{\|\langle\theta^{*}-\hat{\theta}_{t}^{s},x_{t,a}\rangle\|\leq\frac{\alpha+1}{\alpha}w^{s}_{t,a}\right\},\qquad\mbox{and}$
	$\displaystyle D_{t}$	$\displaystyle:=\left\{\hat{\Delta}^{s_{t}}_{t,A_{t}}\geq\varepsilon\right\},$

where $\alpha=\sqrt{\frac{1}{2}\ln\frac{2TK}{\gamma}}$ as in the SupLinUCB (Chu et al., 2011). We choose $\gamma=\frac{1}{2t^{2}}$ throughout. Furthermore, $\hat{\theta}_{t}^{s}$ is the $\hat{\theta}_{t}$ obtained from Algorithm 3 with $\Psi_{t}^{s}$ as the input, i.e.,

\hat{\theta}_{t}^{s}:=\bigg{(}I_{d}+\sum_{\tau\in\Psi_{t}^{s}}x_{\tau,A_{\tau}}x_{\tau,A_{\tau}}^{T}\bigg{)}^{-1}\sum_{\tau\in\Psi^{s}_{t}}Y_{\tau,A_{\tau}}x_{\tau,A_{\tau}}.

Define $\Delta_{t}:=\langle\theta^{*},x_{t,a_{t}^{*}}-X_{t}\rangle$ as the instantaneous regret at each time step $t$ . In addition, choose $\varepsilon=\frac{\Delta_{t}}{3}$ in the definition of $D_{t}$ . Then the first term of the expected regret in equation 234 can be partitioned by the events $B_{t}$ and $D_{t}$ as follows:

\displaystyle\mathbb{E}\left[\sum_{t\in\Psi_{0}}\langle\theta^{*},x_{t,a^{*}_{t}}-X_{t}\rangle\right]

\displaystyle=\underbrace{\mathbb{E}\left[\sum_{t\in\Psi_{0}}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},D_{t}\right\}\right]}_{=:F_{1}}+\underbrace{\mathbb{E}\left[\sum_{t\in\Psi_{0}}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{D}_{t}\right\}\right]}_{=:F_{2}}+\underbrace{\mathbb{E}\left[\sum_{t\in\Psi_{0}}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B}_{t}\right\}\right]}_{=:F_{3}}

We recall that when $t\in\Psi_{0}$ , $w^{s_{t}}_{t,a}\leq\frac{1}{\sqrt{T}}$ for all $a\in\hat{\mathcal{A}}_{s_{t}}$ .

To bound $F_{1}$ , we note that since $B_{t}$ occurs, the actual best arm $a^{*}_{t}\in\hat{\mathcal{A}}_{s_{t}}$ with high probability ( $1-\gamma\log^{2}T$ ) by Chu et al. (2011, Lemma 5). As such,

\displaystyle\max_{a\in\hat{\mathcal{A}}_{s_{t}}}\langle\hat{\theta}^{s_{t}}_{t},x_{t,a}\rangle\geq\langle\hat{\theta}^{s_{t}}_{t},x_{t,a^{*}_{t}}\rangle\geq\langle\theta^{*},x_{t,a^{*}_{t}}\rangle-\frac{\alpha+1}{\alpha}w^{s}_{t,a^{*}_{t}}\geq\langle\theta^{*},x_{t,a^{*}_{t}}\rangle-\frac{2}{\sqrt{T}}

where the last inequality is from the fact that $\gamma=\frac{1}{2t^{2}}$ and $\alpha=\sqrt{\frac{1}{2}\ln\frac{2TK}{\gamma}}\geq 1~{}.$ Else if, in fact, the best arm $a^{*}_{t}\notin\hat{\mathcal{A}}_{s_{t}}$ , the corresponding regret in this case is bounded by:

$\displaystyle\mathbb{E}\sum_{t\in\Psi_{0}}\Delta_{t}\cdot\mathbbm{1}\left\{a^{*}_{t}\notin\hat{\mathcal{A}}_{s_{t}}\right\}$	$\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{a^{*}_{t}\notin\hat{\mathcal{A}}_{s_{t}}\right\}$	(235)
	$\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\mathbbm{1}\left\{a^{*}_{t}\notin\hat{\mathcal{A}}_{s_{t}}\right\}$	(236)
	$\displaystyle=\sum_{t=1}^{T}\mathbb{P}(a^{*}_{t}\notin\hat{\mathcal{A}}_{s_{t}})$	(237)
	$\displaystyle\leq\sum_{t=1}^{T}\gamma\log^{2}T$	(238)
	$\displaystyle=\sum_{t=1}^{T}\frac{\log^{2}T}{2t^{2}}$	(239)
	$\displaystyle<\frac{\pi^{2}}{12}\log^{2}T~{}.$	(240)

Case 1: If $\hat{A}_{t}^{*}\neq A_{t}$ , this means that the index of $A_{t}$ is $I_{t,A_{t}}=\frac{(\hat{\Delta}_{t,A_{t}}^{s_{t}})^{2}}{\alpha^{2}\lVert X_{t}\rVert_{V_{t}^{-1}}^{2}}+\log\frac{1}{\alpha^{2}\lVert X_{t}\rVert_{V_{t}^{-1}}^{2}}$ . Using the fact that $I_{t,A_{t}}\leq I_{t,\hat{A}^{*}_{t}}$ we have

\displaystyle\frac{(\hat{\Delta}_{t,A_{t}}^{s_{t}})^{2}}{\alpha^{2}\lVert X_{t}\rVert_{V_{t}^{-1}}^{2}}+\log\frac{1}{\alpha^{2}\lVert X_{t}\rVert_{V_{t}^{-1}}^{2}}\leq 2\log T~{}.

(241)

Then using the definition of the event $D_{t}$ and the fact that $(w_{t,a}^{s_{t}})^{2}=\alpha^{2}\lVert X_{t}\rVert_{V_{t}^{-1}}^{2}\leq\frac{1}{T}$ , we have

\displaystyle\Delta_{t}^{2}\leq 9\alpha^{2}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\log T\leq\frac{9\log T}{T}.

Hence, $\Delta_{t}\leq\frac{3\sqrt{\log T}}{\sqrt{T}}$ . Therefore $F_{1}$ in this case is upper bounded as follows:

\displaystyle\mathbb{E}\left[\sum_{t\in\Psi_{0}}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\hat{A}^{*}_{t}\neq A_{t}\right\}\cdot\mathbbm{1}\left\{a^{*}_{t}\in\hat{\mathcal{A}}_{s_{t}}\right\}\right]\leq\mathbb{E}\left[\sum_{t\in\Psi_{0}}\frac{3\sqrt{\log T}}{\sqrt{T}}\right]\leq 3\sqrt{T\log T}~{}.

Case 2: If $\hat{A}_{t}^{*}=A_{t}$ , then using the definition of the event $B_{t}$ , we have

\displaystyle\langle\hat{\theta}_{t}^{s_{t}},X_{t}\rangle=\max_{a\in\hat{\mathcal{A}}_{s_{t}}}\langle\hat{\theta}_{t}^{s_{t}},x_{t,a}\rangle\geq\langle\hat{\theta}_{t}^{s_{t}},x_{t,a^{*}_{t}}\rangle\geq\langle\theta^{*},x_{t,a^{*}_{t}}\rangle-\frac{2}{\sqrt{T}}=\langle\theta^{*},X_{t}\rangle+\Delta_{t}-\frac{2}{\sqrt{T}}

therefore since event $B_{t}$ occurs,

\displaystyle\Delta_{t}\leq\langle\hat{\theta}_{t}^{s_{t}}-\theta^{*},X_{t}\rangle+\frac{2}{\sqrt{T}}\leq\frac{3}{\sqrt{T}}~{}.

Hence $F_{1}$ in this case is bounded as $2\sqrt{T}$ . Combining the above cases,

\displaystyle F_{1}\leq 3\sqrt{T\log T}+3\sqrt{T}+\frac{\pi^{2}}{12}\log^{2}T\leq O(\sqrt{T\log T})~{}.

To bound $F_{2}$ , we note from the definition of $B_{t}$ that

\displaystyle\max_{a\in\hat{\mathcal{A}}_{s_{t}}}\langle\hat{\theta}_{t}^{s_{t}},x_{t,a}\rangle\geq\langle\hat{\theta}_{t}^{s_{t}},x_{t,a^{*}_{t}}\rangle\geq\langle\theta^{*},x_{t,a^{*}_{t}}\rangle-\frac{2}{\sqrt{T}}

then on the event $\overline{D}_{t}$ ,

\displaystyle\langle\theta^{*},x_{t,a^{*}_{t}}\rangle-\frac{2}{\sqrt{T}}\leq\max_{a\in\hat{\mathcal{A}}_{s_{t}}}\langle\hat{\theta}_{t}^{s_{t}},x_{t,a}\rangle<\varepsilon+\langle\hat{\theta}_{t}^{s_{t}},X_{t}\rangle=\frac{\Delta_{t}}{3}+\langle\hat{\theta}^{s_{t}}_{t},X_{t}\rangle~{},

therefore

\displaystyle\Delta_{t}<\frac{3}{2}\bigg{(}\langle\hat{\theta}_{t}^{s_{t}}-\theta^{*},X_{t}\rangle+\frac{2}{\sqrt{T}}\bigg{)}\leq\frac{9}{2\sqrt{T}}

Hence

	$\displaystyle F_{2}$	$\displaystyle=\mathbb{E}\left[\sum_{t\in\Psi_{0}}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{D}_{t}\right\}\cdot\mathbbm{1}\left\{a^{}_{t}\in\hat{\mathcal{A}}_{s_{t}}\right\}\right]+\mathbb{E}\left[\sum_{t\in\Psi_{0}}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{D}_{t}\right\}\cdot\mathbbm{1}\left\{a^{}_{t}\notin\hat{\mathcal{A}}_{s_{t}}\right\}\right]$
		$\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{D}_{t}\right\}\right]+\frac{\pi^{2}}{12}\log^{2}T$
		$\displaystyle<\mathbb{E}\left[\sum_{t=1}^{T}\frac{9}{2\sqrt{T}}\cdot\mathbbm{1}\left\{B_{t},\overline{D}_{t}\right\}\right]+\frac{\pi^{2}}{12}\log^{2}T$
		$\displaystyle<T\cdot\frac{9}{2\sqrt{T}}+\frac{\pi^{2}}{12}\log^{2}T$
		$\displaystyle=\frac{9}{2}\sqrt{T}+\frac{\pi^{2}}{12}\log^{2}T$
		$\displaystyle\leq O(\sqrt{T})~{}.$

To bound $F_{3}$ , we use the proof as in of Chu et al. (2011, Lemma 1), which is restated as follows.

Lemma 10.

For any $a\in[K]$ , $s\in[\lceil\log T\rceil]$ , $t\in[T]$ ,

\mathbb{P}\left[|\langle\theta^{*}-\hat{\theta}_{t}^{s},x_{t,a}\rangle|>\frac{\alpha+1}{\alpha}w^{s}_{t,a}\right]\leq\frac{\gamma}{TK}

where $\alpha=\sqrt{\frac{1}{2}\ln\frac{2TK}{\gamma}}$ .

Then using the union bound, we have for all $t\in[T]$ , $s\in[\lceil\log T\rceil]$ , for all $a\in[K]$ ,

	$\displaystyle\mathbb{P}\left[\overline{B}_{t}\right]$	$\displaystyle=\mathbb{P}\left[\bigcup_{t\in[T],s\in[\lceil\log T\rceil],a\in[K]}\left\{\|\langle\theta^{*}-\hat{\theta}_{t}^{s},x_{t,a}\rangle\|>\frac{\alpha+1}{\alpha}w^{s}_{t,a}\right\}\right]$
		$\displaystyle\leq\sum_{t\in[T]}\sum_{s\in[\lceil\log T\rceil]}\sum_{a\in[K]}\mathbb{P}\left[\|\langle\theta^{*}-\hat{\theta}_{t}^{s},x_{t,a}\rangle\|>\frac{\alpha+1}{\alpha}w^{s}_{t,a}\right]$
		$\displaystyle<\big{(}TK(1+\log T)\big{)}\frac{\gamma}{TK}$
		$\displaystyle=\gamma(1+\log T)~{}.$

With the choice $\gamma=\frac{1}{2t^{2}}$ and the assumption $\Delta_{t}\leq 1$ ,

	$\displaystyle F_{3}$	$\displaystyle=\mathbb{E}\left[\sum_{t\in\Psi_{0}}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B}_{t}\right\}\right]$
		$\displaystyle\leq\sum_{t=1}^{T}\mathbb{P}\left[\overline{B}_{t}\right]$
		$\displaystyle<\sum_{t=1}^{T}\frac{1+\log T}{2t^{2}}$
		$\displaystyle<\frac{\pi^{2}}{12}(1+\log T)$
		$\displaystyle\leq O(\log T)~{}.$

Hence the first term in $R_{T}$ in equation 234 is upper bounded by:

	$\displaystyle\mathbb{E}\left[\sum_{t\in\Psi_{0}}\langle\theta^{},x_{t,a^{}_{t}}-X_{t}\rangle\right]$	$\displaystyle\leq O(\sqrt{T})+O(\log T)+O(\sqrt{T\log T})$
		$\displaystyle\leq O(\sqrt{T\log T})~{}.$

On the other hand, by Chu et al. (2011, Theorem 1), the second term in $R_{T}$ in equation 234 is upper bounded as follows:

\displaystyle\mathbb{E}\left[\sum_{t\not\in\Psi_{0}}\langle\theta^{*},x_{t,a^{*}_{t}}-X_{t}\rangle\right]\leq O\Big{(}\sqrt{dT\log^{3}(KT)}\Big{)}.

Hence the regret of our algorithm SupLinIMED is upper bounded as follows:

\displaystyle R_{T}\leq O\Big{(}\sqrt{dT\log^{3}(KT)}\Big{)}~{}.

This completes the proof of Theorem 4.

Appendix G Hyperparameter tuning in our empirical study

G.1 Synthetic Dataset

The below tables are the empirical results while tuning the hyperparameter $\alpha$ (scale of the confidence width) for fixed $T=1000$ .

Method	LinUCB			LinTS			LinIMED-1			LinIMED-2			LinIMED-3 $(C=30)$
$\alpha$	0.5	0.55	0.6	0.2	0.25	0.3	0.15	0.2	0.25	0.2	0.25	0.3	0.15	0.2	0.25
Regret	7.780	6.695	6.856	9.769	9.201	12.068	24.086	5.482	6.108	4.999	4.998	7.329	25.588	2.075	2.760

Table 2: Tuning

\alpha

when

K=10,d=2

Method	LinUCB			LinTS			LinIMED-1			LinIMED-2			LinIMED-3 $(C=30)$
$\alpha$	0.5	0.55	0.6	0.1	0.15	0.2	0.2	0.25	0.3	0.2	0.25	0.3	0.2	0.25	0.3
Regret	7.203	6.832	7.423	54.221	7.042	7.352	6.707	6.053	8.458	6.254	4.918	7.013	4.407	2.562	3.041

Table 3: Tuning

\alpha

when

K=100,d=2

Method	LinUCB			LinTS			LinIMED-1			LinIMED-2			LinIMED-3 $(C=30)$
$\alpha$	0.5	0.55	0.6	0.1	0.15	0.2	0.15	0.2	0.25	0.2	0.25	0.3	0.15	0.2	0.25
Regret	7.919	5.679	7.063	69.955	6.925	7.037	24.393	5.625	6.335	6.335	4.831	7.040	41.355	1.936	2.250

Table 4: Tuning

\alpha

when

K=500,d=2

Method	LinUCB			LinTS			LinIMED-1			LinIMED-2			LinIMED-3 $(C=30)$
$\alpha$	0.45	0.5	0.55	0.1	0.15	0.2	0.1	0.15	0.2	0.1	0.15	0.2	0.1	0.15	0.2
Regret	9.164	9.094	14.183	14.252	9.886	14.680	19.663	6.463	10.643	15.685	5.399	8.373	8.024	2.062	3.342

Table 5: Tuning

\alpha

when

K=10,d=20

Method	LinUCB			LinTS			LinIMED-1			LinIMED-2			LinIMED-3 $(C=30)$
$\alpha$	0.25	0.3	0.35	0.1	0.15	0.2	0.05	0.1	0.15	0.1	0.15	0.2	0.05	0.1	0.15
Regret	7.923	7.085	10.981	14.983	9.565	19.300	58.278	6.165	9.225	8.916	8.575	13.483	142.704	2.816	3.497

Table 6: Tuning

\alpha

when

K=10,d=50

We run these algorithms on the same dataset with different choices of $\alpha$ , we choose the best $\alpha$ with the corresponding least regret.

G.2 MovieLens Dataset

The below tables are the empirical results while tuning the hyperparameter $\alpha$ (scale of the confidence width) for fixed $T=1000$ .

Method	LinUCB			LinTS			LinIMED-1			LinIMED-2			LinIMED-3 $(C=30)$			IDS
$\alpha$	0.7	0.75	0.8	0.05	0.1	0.15	0.15	0.2	0.25	0.15	0.2	0.25	0.2	0.25	0.3	0.25	0.3	0.35
CTR	0.608	0.675	0.668	0.615	0.705	0.679	0.740	0.823	0.766	0.740	0.823	0.766	0.713	0.742	0.690	0.655	0.728	0.714

Table 7: Tuning

\alpha

when

K=20

Method	LinUCB			LinTS			LinIMED-1			LinIMED-2			LinIMED-3 $(C=30)$			IDS
$\alpha$	0.75	0.8	0.85	0	0.05	0.1	0.1	0.15	0.2	0.05	0.1	0.15	0.05	0.1	0.15	0.3	0.35	0.4
CTR	0.708	0.754	0.713	0.517	0.711	0.646	0.648	0.668	0.595	0.658	0.668	0.651	0.697	0.717	0.649	0.643	0.688	0.606

Table 8: Tuning

\alpha

when

K=50

Method	LinUCB			LinTS			LinIMED-1			LinIMED-2			LinIMED-3 $(C=30)$			IDS
$\alpha$	0.85	0.9	0.95	0	0.05	0.1	0.05	0.1	0.15	0.05	0.1	0.15	0.05	0.1	0.15	0.3	0.35	0.4
CTR	0.721	0.754	0.745	0.487	0.674	0.588	0.682	0.729	0.594	0.687	0.729	0.594	0.689	0.705	0.594	0.684	0.739	0.695

Table 9: Tuning

\alpha

when

K=100

We run these algorithms on the same dataset with different choices of $\alpha$ and we choose the best $\alpha$ with the corresponding largest reward.

Indexed Minimum Empirical Divergence-Based Algorithms for Linear Bandits

Abstract

1 Introduction

1.1 Motivation and Related Work

2 Problem Statement

Notations:

The Stochastic Linear Bandit Model:

Assumption 1.

3 Description of LinIMED Algorithms

3.1 Description of the SupLinIMED Algorithm

3.2 Relation to the IMED algorithm of Honda & Takemura (2015)

3.3 Relation to Information Directed Sampling (IDS) for Linear Bandits

4 Theorem Statements

Theorem 1.

Theorem 2.

Theorem 3.

Theorem 4.

5 Proof Sketch of Theorem 1

Lemma 1.

Corollary 1.

Proof.

6 Empirical Studies

6.1 Experiments on a Synthetic Dataset in the Varying Arm Set Setting

6.2 Experiments on the “End of Optimism” instance

6.3 Experiments on the MovieLens Dataset

7 Future Work

Acknowledgements

References

Appendix A BaseLinUCB Algorithm

Appendix B Comparison to other related work

Appendix C Proof of the regret bound for LinIMED-1 (Complete proof of Theorem 1)

C.1 Statement of Lemmas for LinIMED-1

Lemma 2.

Lemma 3.

Lemma 4.

Lemma 5.

C.2 Proof of Lemma 2

Proof.

C.3 Proof of Lemma 3

Proof.

C.4 Proof of Lemma 4

Proof.

C.5 Proof of Lemma 5

Proof.

C.6 Proof of Theorem 1

Proof.

Appendix D Proof of the regret bound for LinIMED-2 (Proof of Theorem 2)

D.1 Statement of Lemmas for LinIMED-2

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

D.2 Proof of Lemma 6

Proof.

D.3 Proof of Lemma 7

Proof.

D.4 Proof of Lemma 8

Proof.

D.5 Proof of Lemma 9

Proof.

D.6 Proof of Theorem 2

Proof.

Appendix E Proof of the regret bound for LinIMED-3 (Proof of Theorem 3)

E.1 Proof of Theorem 3

Proof.

Appendix F Proof of the regret bound for SupLinIMED (Proof of Theorem 4)

Remark 1.

Lemma 10.

Appendix G Hyperparameter tuning in our empirical study

G.1 Synthetic Dataset

G.2 MovieLens Dataset

Indexed Minimum Empirical Divergence-Based Algorithms
for Linear Bandits