This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\AtBeginEnvironment

algorithmic

Indexed Minimum Empirical Divergence-Based Algorithms
for Linear Bandits

Jie Bian jiebian@u.nus.edu
Department of Electrical and Computer Engineering
National University of Singapore
Vincent Y. F. Tan vtan@nus.edu.sg
Department of Mathematics
Department of Electrical and Computer Engineering
National University of Singapore
Abstract

The Indexed Minimum Empirical Divergence (IMED) algorithm is a highly effective approach that offers a stronger theoretical guarantee of the asymptotic optimality compared to the Kullback–Leibler Upper Confidence Bound (KL-UCB) algorithm for the multi-armed bandit problem. Additionally, it has been observed to empirically outperform UCB-based algorithms and Thompson Sampling. Despite its effectiveness, the generalization of this algorithm to contextual bandits with linear payoffs has remained elusive. In this paper, we present novel linear versions of the IMED algorithm, which we call the family of LinIMED algorithms. We demonstrate that LinIMED provides a O~(dT)\widetilde{O}(d\sqrt{T}) upper regret bound where dd is the dimension of the context and TT is the time horizon. Furthermore, extensive empirical studies reveal that LinIMED and its variants outperform widely-used linear bandit algorithms such as LinUCB and Linear Thompson Sampling in some regimes.

1 Introduction

The multi-armed bandit (MAB) problem (Lattimore & Szepesvári (2020)) is a classical topic in decision theory and reinforcement learning. Among the various subfields of bandit problems, the stochastic linear bandit is the most popular area due to its wide applicability in large-scale, real-world applications such as personalized recommendation systems (Li et al. (2010)), online advertising, and clinical trials. In the stochastic linear bandit model, at each time step tt, the learner has to choose one arm AtA_{t} from the time-varying action set 𝒜t\mathcal{A}_{t}. Each arm a𝒜ta\in\mathcal{A}_{t} has a corresponding context xt,adx_{t,a}\in\mathbb{R}^{d}, which is a dd-dimensional vector. By pulling the arm a𝒜ta\in\mathcal{A}_{t} at time step tt, under the linear bandit setting, the learner will receive the reward Yt,aY_{t,a}, whose expected value satisfies 𝔼[Yt,a|xt,a]=θ,xt,a\mathbb{E}[Y_{t,a}|x_{t,a}]=\langle\theta^{*},x_{t,a}\rangle, where θd\theta^{*}\in\mathbb{R}^{d} is an unknown parameter. The goal of the learner is to maximize his cumulative reward over a time horizon TT, which also means minimizing the cumulative regret, defined as RT:=𝔼[t=1Tmaxa𝒜tθ,xt,aYt,At]R_{T}:=\mathbb{E}\left[\sum_{t=1}^{T}\max_{a\in\mathcal{A}_{t}}\langle\theta^{*},x_{t,a}\rangle-Y_{t,A_{t}}\right]. The learner needs to balance the trade-off between the exploration of different arms (to learn their expected rewards) and the exploitation of the arm with the highest expected reward based on the available data.

1.1 Motivation and Related Work

The KK-armed bandit setting is a special case of the linear bandit. There exist several good algorithms such as UCB1 (Auer et al. (2002)), Thompson Sampling (Agrawal & Goyal (2012)), and the Indexed Minimum Empirical Divergence (IMED) algorithm (Honda & Takemura (2015)) for this setting. There are three main families of asymptotically optimal multi-armed bandit algorithms based on different principles (Baudry et al. (2023)). However, among these algorithms, only IMED lacks an extension for contextual bandits with linear payoff. In the context of the varying arm setting of the linear bandit problem, the LinUCB algorithm in Li et al. (2010) is frequently employed in practice. It has a theoretical guarantee on the regret in the order of O(dTlog(T))O(d\sqrt{T}\log(T)) using the confidence width as in OFUL (Abbasi-Yadkori et al. (2011)). Although the SupLinUCB algorithm introduced by Chu et al. (2011) uses phases to decompose the reward dependence of each time step and achieves an O~(dT)\widetilde{O}(\sqrt{dT}) (the O~()\widetilde{O}(\cdot) notation omits logarithmic factors in TT) regret upper bound, its empirical performance falls short of both the algorithm in Li et al. (2010) and the Linear Thompson Sampling algorithm (Agrawal & Goyal (2013)) as mentioned in Lattimore & Szepesvári (2020, Chapter 22).

On the other hand, the Optimism in the Face of Uncertainty Linear (OFUL) bandit algorithm in Abbasi-Yadkori et al. (2011) achieves a regret upper bound of O~(dT)\widetilde{O}(d\sqrt{T}) through an improved analysis of the confidence bound using a martingale technique. However, it involves a bilinear optimization problem over the action set and the confidence ellipsoid when choosing the arm at each time. This is computationally expensive, unless the confidence ellipsoid is a convex hull of a finite set.

Problem independent regret bound Regret bound independent of KK? Principle that the algorithm is based on
OFUL (Abbasi-Yadkori et al. (2011)) O(dTlog(T))O(d\sqrt{T}\log(T)) Optimism
LinUCB (Li et al. (2010)) Hard to analyze Unknown Optimism
LinTS (Agrawal & Goyal (2013)) O(d32T)O(dTlog(K)){O}(d^{\frac{3}{2}}\sqrt{T})\wedge{O}(d\sqrt{T\log(K)}) Posterior sampling
SupLinUCB (Chu et al. (2011)) O(dTlog3(KT))O(\sqrt{dT\log^{3}(KT)}) Optimism
LinUCB with OFUL’s confidence bound O(dTlog(T))O(d\sqrt{T}\log(T)) Optimism
Asymptotically Optimal IDS (Kirschner et al. (2021)) O(dTlog(T))O(d\sqrt{T}\log(T)) Information directed sampling
LinIMED-3 (this paper) O(dTlog(T))O(d\sqrt{T}\log(T)) Min. emp. divergence
SupLinIMED (this paper) O(dTlog3(KT))O(\sqrt{dT\log^{3}(KT)}) Min. emp. divergence
Table 1: Comparison of algorithms for linear bandits with varying arm sets

For randomized algorithms designed for the linear bandit problem, Agrawal & Goyal (2013) proposed the LinTS algorithm, which is in the spirit of Thompson Sampling (Thompson (1933)) and the confidence ellipsoid similar to that of LinUCB-like algorithms. This algorithm performs efficiently and achieves a regret upper bound of O(d32TdTlogK){O}(d^{\frac{3}{2}}\sqrt{T}\wedge d\sqrt{T\log K}), where KK is the number of arms at each time step such that |𝒜t|=K|\mathcal{A}_{t}|=K for all tt. Compared to LinUCB with OFUL’s confidence width, it has an extra O(dlogK)O(\sqrt{d}\wedge\sqrt{\log K}) term for the minimax regret upper bound.

Recently, MED-like (minimum empirical divergence) algorithms have come to the fore since these randomized algorithms have the property that the probability of selecting each arm is in closed form, which benefits downstream work such as offline evaluation with the inverse propensity score. Both MED in the sub-Gaussian environment and its deterministic version IMED have demonstrated superior performances over Thompson Sampling (Bian & Jun (2021), Honda & Takemura (2015)). Baudry et al. (2023) also shows MED has a close relation to Thompson Sampling. In particular, it is argued that MED and TS can be interpreted as two variants of the same exploration strategy. Bian & Jun (2021) also shows that probability of selecting each arm of MED in the sub-Gaussian case can be viewed as a closed-form approximation of the same probability as in Thompson Sampling. We take inspiration from the extension of Thompson Sampling to linear bandits and thus are motivated to extend MED-like algorithms to the linear bandit setting and prove regret bounds that are competitive vis-à-vis the state-of-the-art bounds.

Thus, this paper aims to answer the question of whether it is possible to devise an extension of the IMED algorithm for the linear bandit problem the varying arm set setting (for both infinite and finite arm sets) with a regret upper bound of O(dTlogT)O(d\sqrt{T}\log T) which matches LinUCB with OFUL’s confidence bound while being as efficient as LinUCB. The proposed family of algorithms, called LinIMED as well as SupLinIMED, can be viewed as generalizations of the IMED algorithm (Honda & Takemura (2015)) to the linear bandit setting. We prove that LinIMED and its variants achieve a regret upper bound of O~(dT)\widetilde{O}(d\sqrt{T}) and they perform efficiently, no worse than LinUCB. SupLinIMED has a regret bound of O~(dT)\widetilde{O}(\sqrt{dT}), but works only for instances with finite arm sets. In our empirical study, we found that the different variants of LinIMED perform better than LinUCB and LinTS for various synthetic and real-world instances under consideration.

Compared to OFUL, LinIMED works more efficiently. Compared to SupLinUCB, our LinIMED algorithm is significantly simpler, and compared to LinUCB with OFUL’s confidence bound, our empirical performance is better. This is because in our algorithm, the exploitation term and exploration term are decoupling and this leads to a finer control while tuning the hyperparameters in the empirical study.

Compared to LinTS, our algorithm’s (specifically LinIMED-3) regret bound is superior, by an order of O(dlogK)O(\sqrt{d}\wedge\sqrt{\log K}). Since fixed arm setting is a special case of finite varying arm setting, our result is more general than other fixed-arm linear bandit algorithms like Spectral Eliminator (Valko et al. (2014)) and PEGOE (Lattimore & Szepesvári (2020, Chapter 22)). Finally, we observe that since the index used in LinIMED has a similar form to the index used in the Information Directed Sampling (IDS) procedure in Kirschner et al. (2021) (which is known to be asymptotically optimal but more difficult to compute), LinIMED performs significantly better on the “End of Optimism” example in Lattimore & Szepesvari (2017). We summarize the comparisons of LinIMED to other linear bandit algorithms in Table 1. We discussion comparisons to other linear bandit algorithms in Sections 3.2, 3.3, and Appendix B.

2 Problem Statement

Notations:

For any dd dimensional vector xdx\in\mathbb{R}^{d} and a d×dd\times d positive definite matrix AA, we use xA\lVert x\rVert_{A} to denote the Mahalanobis norm xAx\sqrt{x^{\top}Ax}. We use aba\wedge b (resp.  aba\vee b) to represent the minimum (resp. maximum) of two real numbers aa and bb.

The Stochastic Linear Bandit Model:

In the stochastic linear bandit model, the learner chooses an arm AtA_{t} at each round tt from the arm set 𝒜t={at,1,at,2,}\mathcal{A}_{t}=\{a_{t,1},a_{t,2},\ldots\}\subseteq\mathbb{R}, where we assume the cardinality of each arm set 𝒜t\mathcal{A}_{t} can be potentially infinite such that |𝒜t|=|\mathcal{A}_{t}|=\infty for all t1t\geq 1. Each arm a𝒜ta\in\mathcal{A}_{t} at time tt has a corresponding context (arm vector) xt,adx_{t,a}\in\mathbb{R}^{d}, which is known to the learner. After choosing arm AtA_{t}, the environment reveals the reward

Yt=θ,Xt+ηtY_{t}=\langle\theta^{*},X_{t}\rangle+\eta_{t} (1)

to the learner where Xt:=xt,AtX_{t}:=x_{t,A_{t}} is the corresponding context of the arm AtA_{t}, θd\theta^{*}\in\mathbb{R}^{d} is an unknown coefficient of the linear model, ηt\eta_{t} is an RR-sub-Gaussian noise conditioned on {A1,A2,,At,Y1,Y2,,Yt1}\left\{A_{1},A_{2},\ldots,A_{t},Y_{1},Y_{2},\ldots,Y_{t-1}\right\} such that for any λ\lambda\in\mathbb{R}, almost surely,

𝔼[exp(ληt)A1,A2,,At,Y1,Y2,,Yt1]exp(λ2R22).\displaystyle\mathbb{E}\left[\exp(\lambda\eta_{t})\mid A_{1},A_{2},\ldots,A_{t},Y_{1},Y_{2},\ldots,Y_{t-1}\right]\leq\exp\Big{(}\frac{\lambda^{2}R^{2}}{2}\Big{)}~{}.

Denote at:=argmaxa𝒜tθ,xt,aa_{t}^{*}:=\operatorname*{arg\,max}_{a\in\mathcal{A}_{t}}\langle\theta^{*},x_{t,a}\rangle as the arm with the largest reward at time tt. The goal is to minimize the expected cumulative regret over the horizon TT. The (expected) cumulative regret is defined as

RT=𝔼[t=1Tθ,xt,atθ,Xt].R_{T}=\mathbb{E}\left[\sum_{t=1}^{T}\langle\theta^{*},x_{t,a_{t}^{*}}\rangle-\langle\theta^{*},X_{t}\rangle\right]~{}.
Assumption 1.

For each time tt, we assume that XtL\lVert X_{t}\rVert\leq L, and θS\lVert\theta^{*}\rVert\leq S for some fixed L,S>0L,S>0. We also assume that Δt,b:=maxa𝒜tθ,xt,aθ,xt,b1\Delta_{t,b}:=\max_{a\in\mathcal{A}_{t}}\langle\theta^{*},x_{t,a}\rangle-\langle\theta^{*},x_{t,b}\rangle\leq 1 for each arm b𝒜tb\in\mathcal{A}_{t} and time tt.

3 Description of LinIMED Algorithms

Algorithm 1 LinIMED-xx for x{1,2,3}x\in\{1,2,3\}
1:  Input: LinIMED mode xx, Dimension dd, Regularization parameter λ\lambda, Bound SS on θ\lVert\theta^{*}\rVert, Sub-Gaussian parameter RR, Concentration parameter γ\gamma of θ\theta^{*}, Bound LL on xt,a\lVert x_{t,a}\rVert for all t1t\geq 1 and a𝒜ta\in\mathcal{A}_{t}, Constant C1C\geq 1.
2:  Initialize: V0=λId×dV_{0}=\lambda I_{d\times d}, W0=0d×1(all zeros vector with d dimensions)W_{0}=\text{0}_{d\times 1}(\text{all zeros vector with }d\text{ dimensions}), θ^0=V01W0\hat{\theta}_{0}=V_{0}^{-1}W_{0}
3:  for t=1,2,Tt=1,2,\ldots T do
4:     Receive the arm set 𝒜t\mathcal{A}_{t} and compute βt1(γ)=(Rdlog(1+(t1)L2/λγ)+λS)2\beta_{t-1}(\gamma)=\Big{(}R\sqrt{d\log({\frac{1+(t-1)L^{2}/\lambda}{\gamma}})}+\sqrt{\lambda}S\Big{)}^{2}.
5:     for a𝒜ta\in\mathcal{A}_{t} do
6:        Compute: μ^t,a=θ^t1,xt,a\hat{\mu}_{t,a}=\langle\hat{\theta}_{t-1},x_{t,a}\rangle and UCBt(a)=θ^t1,xt,a+βt1(γ)xt,aVt11\mathrm{UCB}_{t}(a)=\langle\hat{\theta}_{t-1},x_{t,a}\rangle+\sqrt{\beta_{t-1}(\gamma)}\lVert x_{t,a}\rVert_{V^{-1}_{t-1}}
7:         Compute: Δ^t,a=(maxj𝒜tμ^t,jμ^t,a)𝟙{x=1,2}+(maxj𝒜tUCBt(j)UCBt(a))𝟙{x=3}\hat{\Delta}_{t,a}=(\max_{j\in\mathcal{A}_{t}}\hat{\mu}_{t,j}-\hat{\mu}_{t,a})\cdot\mathbbm{1}\{x=1,2\}+(\max_{j\in\mathcal{A}_{t}}\mathrm{UCB}_{t}(j)-\mathrm{UCB}_{t}(a))\cdot\mathbbm{1}\left\{x=3\right\}
8:        if a=argmaxj𝒜t(μ^t,j𝟙{x=1,2}+UCBt(j)𝟙{x=3})a=\operatorname*{arg\,max}_{j\in\mathcal{A}_{t}}(\hat{\mu}_{t,j}\cdot\mathbbm{1}\{x=1,2\}+\mathrm{UCB}_{t}(j)\cdot\mathbbm{1}\{x=3\})  then
9:            It,a=log(βt1(γ)xt,aVt112)𝟙{x=1}I_{t,a}=-\log({\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2}})\cdot\mathbbm{1}\{x=1\}(LinIMED-1)
10:            +[logT(log(βt1(γ)xt,aVt112))]𝟙{x=2}\qquad\quad+\Big{[}\log{T}\wedge\Big{(}-\log({\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2}})\Big{)}\Big{]}\cdot\mathbbm{1}\{x=2\}(LinIMED-2)
11:            +[logCmaxa𝒜tΔ^t,a2(log(βt1(γ)xt,aVt112))]𝟙{x=3}\qquad\quad+\Big{[}\log\frac{C}{\max_{a\in\mathcal{A}_{t}}\hat{\Delta}^{2}_{t,a}}\wedge\Big{(}-\log({\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2}})\Big{)}\Big{]}\cdot\mathbbm{1}\{x=3\}(LinIMED-3)
12:        else
13:            It,a=Δ^t,a2βt1(γ)xt,aVt112log(βt1(γ)xt,aVt112)I_{t,a}=\frac{\hat{\Delta}_{t,a}^{2}}{\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2}}-\log({\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2}})
14:        end if
15:     end for
16:      Pull the arm At=argmina𝒜tIt,aA_{t}=\operatorname*{arg\,min}_{a\in\mathcal{A}_{t}}I_{t,a} (ties are broken arbitrarily) and receive its reward YtY_{t}.
17:     Update:
18:      Vt=Vt1+XtXtV_{t}=V_{t-1}+X_{t}X_{t}^{\top}, Wt=Wt1+YtXtW_{t}=W_{t-1}+Y_{t}X_{t}, and θ^t=Vt1Wt\hat{\theta}_{t}=V_{t}^{-1}W_{t}.
19:  end for

In the pseudocode of Algorithm 1, for each time step tt, in Line 4, we use the improved confidence bound of θ\theta^{*} as in Abbasi-Yadkori et al. (2011) to calculate the confidence bound βt1(γ)\beta_{t-1}(\gamma). After that, for each arm a𝒜ta\in\mathcal{A}_{t}, in Lines 6 and 7, the empirical gap between the highest empirical reward and the empirical reward of arm aa is estimated as

Δ^t,a={maxj𝒜tθ^t1,xt,jθ^t1,xt,aif LinIMED-1,2maxj𝒜tUCBt(j)UCBt(a)if LinIMED-3\hat{\Delta}_{t,a}=\left\{\begin{array}[]{cc}\max_{j\in\mathcal{A}_{t}}\langle\hat{\theta}_{t-1},x_{t,j}\rangle-\langle\hat{\theta}_{t-1},x_{t,a}\rangle&\mbox{if LinIMED-1,2}\\ \max_{j\in\mathcal{A}_{t}}\mathrm{UCB}_{t}(j)-\mathrm{UCB}_{t}(a)&\mbox{if LinIMED-3}\end{array}\right. (2)

Then, in Lines 9 to 11, with the use of the confidence width of βt1(γ)\beta_{t-1}(\gamma), we can compute the index It,aI_{t,a} for the empirical best arm a=argmaxj𝒜tμ^t,aa=\operatorname*{arg\,max}_{j\in\mathcal{A}_{t}}\hat{\mu}_{t,a} (for LinIMED-1,2) or the highest UCB arm a=argmaxj𝒜tUCBj(a)a=\operatorname*{arg\,max}_{j\in\mathcal{A}_{t}}\mathrm{UCB}_{j}(a) (for LinIMED-3). The different versions of LinIMED encourage different amounts of exploitation. For the other arms, in Line 13, the index is defined and computed as

It,a=Δ^t,a2βt1(γ)xt,aVt112+log1βt1(γ)xt,aVt112.I_{t,a}=\frac{\hat{\Delta}_{t,a}^{2}}{\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{{\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2}}}.

Then with all the indices of the arms calculated, in Line 16, we choose the arm AtA_{t} with the minimum index such that At=argmina𝒜tIt,aA_{t}=\operatorname*{arg\,min}_{a\in\mathcal{A}_{t}}I_{t,a} (where ties are broken arbitrarily) and the agent receives its reward. Finally, in Line 18, we use ridge regression to estimate the unknown θ\theta^{*} as θ^t\hat{\theta}_{t} and update the matrix VtV_{t} and the vector WtW_{t}. After that, the algorithm iterates to the next time step until the time horizon TT. From the pseudo-code, we observe that the only differences between the three algorithms are the way that the square gap, which plays the role of the empirical divergence, is estimated and the index of the empirically best arm. The latter point implies that we encourage the empirically best arm to be selected more often in LinIMED-2 and LinIMED-3 compared to LinIMED-1; in other words, we encourage more exploitation in LinIMED-2 and LinIMED-3. Similar to the core spirit of IMED algorithm Honda & Takemura (2015), the first term of our index It,aI_{t,a} for LinIMED-1 algorithm is Δ^t,a2/(βt1(γ)xt,aVt112){\hat{\Delta}_{t,a}^{2}}/(\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2}), this is the term controls the exploitation, while the second term log(βt1(γ)xt,aVt112)-\log({\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2}}) controls the exploration in our algorithm.

3.1 Description of the SupLinIMED Algorithm

1:  Input: TT\in\mathbb{N}, S=logTS^{\prime}=\lceil\log T\rceil, Ψts=\Psi_{t}^{s}=\emptyset for all s[S],t[T]s\in[S^{\prime}],t\in[T]
2:  for t=1,2,,Tt=1,2,\ldots,T do
3:     s1s\leftarrow 1 and 𝒜^1[K]\hat{\mathcal{A}}_{1}\leftarrow[K]
4:     repeat
5:        Use BaseLinUCB (stated in Algorithm 3 in Appendix A) with Ψts\Psi_{t}^{s} to calculate the width wt,asw^{s}_{t,a} and sample mean Y^t,as\hat{Y}^{s}_{t,a} for all a𝒜^s.a\in\hat{\mathcal{A}}_{s}~{}.
6:        if wt,as1Tw^{s}_{t,a}\leq\frac{1}{\sqrt{T}} for all a𝒜^sa\in\hat{\mathcal{A}}_{s}  then
7:           choose At=argmina𝒜^sIt,aA_{t}=\operatorname*{arg\,min}_{a\in\hat{\mathcal{A}}_{s}}I_{t,a} where It,aI_{t,a} is the same index function as in LinIMED algorithm:
8:           Calculate the index
It,a={log(2T)(log((wt,as)2))If a=argmaxb𝒜^sY^t,bs(Δ^t,aswt,as)2log((wt,as)2)otherwisewhereΔ^t,as:=maxb𝒜^sY^t,bsY^t,as.\displaystyle I_{t,a}=\begin{cases}\log(2T)\wedge\big{(}-\log((w^{s}_{t,a})^{2})\big{)}\quad\textbf{If }a=\operatorname*{arg\,max}_{b\in\hat{\mathcal{A}}_{s}}\hat{Y}^{s}_{t,b}\\ (\frac{\hat{\Delta}_{t,a}^{s}}{w_{t,a}^{s}})^{2}-\log((w_{t,a}^{s})^{2})\qquad\quad\textbf{otherwise}\end{cases}\text{where}\quad\hat{\Delta}^{s}_{t,a}:=\max_{b\in\hat{\mathcal{A}}_{s}}\hat{Y}^{s}_{t,b}-\hat{Y}^{s}_{t,a}~{}.
9:           Keep the same index sets at all levels: Ψt+1sΨts\Psi^{s^{\prime}}_{t+1}\leftarrow\Psi^{s^{\prime}}_{t} for all s[S].s^{\prime}\in[S]~{}. Case 1\lx@algorithmic@hfill\qquad\leftarrow\textbf{Case 1}
10:        else if wt,as2sw^{s}_{t,a}\leq 2^{-s} for all a𝒜^sa\in\hat{\mathcal{A}}_{s} then
11:           𝒜^s+1{a𝒜^s:Y^t,as+wt,asmaxa𝒜^s(Y^t,as+wt,as)21s}\hat{\mathcal{A}}_{s+1}\leftarrow\left\{a\in\hat{\mathcal{A}}_{s}\,:\,\hat{Y}^{s}_{t,a}+w^{s}_{t,a}\geq\max_{a^{\prime}\in\hat{\mathcal{A}}_{s}}(\hat{Y}^{s}_{t,a^{\prime}}+w^{s}_{t,a^{\prime}})-2^{1-s}\right\}
12:           ss+1s\leftarrow s+1 Case 2\lx@algorithmic@hfill\qquad\leftarrow\textbf{Case 2}
13:        else
14:           Choose At𝒜^sA_{t}\in\hat{\mathcal{A}}_{s} such that wt,Ats>2sw^{s}_{t,A_{t}}>2^{-s}
15:           Update the index sets at all levels: Ψt+1sΨts{t}\Psi^{s^{\prime}}_{t+1}\leftarrow\Psi^{s^{\prime}}_{t}\cup\{t\} if s=ss=s^{\prime} ; Ψt+1sΨts\Psi^{s^{\prime}}_{t+1}\leftarrow\Psi^{s^{\prime}}_{t} if sss\neq s^{\prime} Case 3\lx@algorithmic@hfill\qquad\leftarrow\textbf{Case 3}
16:        end if
17:     until an action AtA_{t} is found
18:  end for
Algorithm 2 SupLinIMED

Now we consider the case in which the arm set 𝒜t\mathcal{A}_{t} at each time tt is finite but still time-varying. In particular, 𝒜t={at,1,at,2,,at,K}\mathcal{A}_{t}=\{a_{t,1},a_{t,2},\ldots,a_{t,K}\}\subseteq\mathbb{R} are sets of constant size KK such that |𝒜t|=K<|\mathcal{A}_{t}|=K<\infty. In the pseudocode of Algorithm 2, we apply the SupLinUCB framework (Chu et al., 2011), leveraging Algorithm 3 (in Appendix A) as a subroutine within each phase. This ensures the independence of the choice of the arm from past observations of rewards, thereby yielding a concentration inequality in the estimated reward (see Lemma 1 in Chu et al. (2011)) that converges to within d\sqrt{d} proximity of the unknown expected reward in a finite arm setting. As a result, the regret yields an improvement of d\sqrt{d} ignoring the logarithmic factor. At each time step tt and phase ss, in Line 5, we utilize the BaseLinUCB Algorithm as a subroutine to calculate the sample mean and confidence width since we also need these terms to calculate the IMED-style indices of each arm. In Lines 6–9 (Case 1), if the width of each arm is less than 1T\frac{1}{\sqrt{T}}, we choose the arm with the smaller IMED-style index. In Lines 10–12 (Case 2), the framework is the same as in SupLinUCB (Chu et al. (2011)), if the width of each arm is smaller than 2s2^{-s} but there exist arms with widths larger than 1T\frac{1}{\sqrt{T}}, then in Line 11 the “unpromising” arms will be eliminated until the width of each arm is smaller enough to satisfy the condition in Line 6. Otherwise, if there exist any arms with widths that are larger than 2s2^{-s}, in Lines 14–15 (Case 3), we choose one such arm and record the context and reward of this arm to the next layer Ψt+1s\Psi^{s}_{t+1}.

3.2 Relation to the IMED algorithm of Honda & Takemura (2015)

The IMED algorithm is a deterministic algorithm for the KK-armed bandit problem. At each time step tt, it chooses the arm aa with the minimum index, i.e.,

a=argmini[K]{Ti(t)Dinf(F^i(t),μ^(t))+logTi(t)},\displaystyle a=\operatorname*{arg\,min}_{i\in[K]}\big{\{}T_{i}(t)D_{\text{inf}}(\hat{F}_{i}(t),\hat{\mu}^{*}(t))+\log T_{i}(t)\big{\}}~{}, (3)

where Ti(t)=s=1t1𝟙{At=a}T_{i}(t)=\sum_{s=1}^{t-1}\mathbbm{1}\left\{A_{t}=a\right\} is the total arm pulls of the arm ii until time tt and Dinf(F^i(t),μ^(t))D_{\text{inf}}(\hat{F}_{i}(t),\hat{\mu}^{*}(t)) is some divergence measure between the empirical distribution of the sample mean for arm ii and the arm with the highest sample mean. More precisely, Dinf(F,μ):=infG𝒢:𝔼(G)μD(FG)D_{\text{inf}}(F,\mu):=\inf_{G\in\mathcal{G}:\mathbb{E}(G)\leq\mu}D(F\|G) and 𝒢\mathcal{G} is the family of distributions supported on (,1](-\infty,1]. As shown in Honda & Takemura (2015), its asymptotic bound is even better than KL-UCB (Garivier & Cappé (2011)) algorithm and can be extended to semi-bounded support models such as 𝒢\mathcal{G}. Also, this algorithm empirically outperforms the Thompson Sampling algorithm as shown in Honda & Takemura (2015). However, an extension of IMED algorithm with minimax regret bound of O~(dT)\widetilde{O}(d\sqrt{T}) has not been derived. In our design of LinIMED algorithm, we replace the optimized KL-divergence measure in IMED in Eqn. equation 3 with the squared gap between the sample mean of the arm ii and the arm with the maximum sample mean. This choice simplifies our analysis and does not adversely affect the regret bound. On the other hand, we view the term 1/Ti(t)1/T_{i}(t) as the variance of the sample mean of arm ii at time tt; then in this spirit, we use βt1(γ)xt,aVt112\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert^{2}_{V_{t-1}^{-1}} as the variance of the sample mean (which is θ^t1,xt,a\langle\hat{\theta}_{t-1},x_{t,a}\rangle) of arm aa at time tt. We choose Δ^t,a2/(βt1(γ)xt,aVt112){\hat{\Delta}_{t,a}^{2}}/{(\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2})} instead of the KL-divergence approximation for the index since in the classical linear bandit setting, the noise is sub-Gaussian and it is known that the KL-divergence of two Gaussian random variables with the same variance (KL(𝒩(μ1,σ2),𝒩(μ2,σ2))=(μ1μ2)22σ2\mathrm{KL}(\mathcal{N}(\mu_{1},\sigma^{2}),\mathcal{N}(\mu_{2},\sigma^{2}))=\frac{(\mu_{1}-\mu_{2})^{2}}{2\sigma^{2}}) has a closed form expression similar to Δ^t,a2/(βt1(γ)xt,aVt112){\hat{\Delta}_{t,a}^{2}}/{(\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2})} ignoring the constant 12\frac{1}{2}.

3.3 Relation to Information Directed Sampling (IDS) for Linear Bandits

Information Directed Sampling (IDS), introduced by Russo & Van Roy (2014), serves as a good principle for regret minimization in linear bandits to achieve the asymptotic optimality. The intuition behind IDS is to balance between the information gain on the best arm and the expected reward at each time step. This goal is realized by optimizing the distribution of selecting each arm π𝒟(𝒜)\pi\in\mathcal{D}(\mathcal{A}) (where 𝒜\mathcal{A} is the fixed finite arm set) with the minimum information ratio such that:

πtIDS:=argminπ𝒟(𝒜)Δ^t2(π)gt(π),\displaystyle\pi_{t}^{\mathrm{IDS}}:=\operatorname*{arg\,min}_{\pi\in\mathcal{D}(\mathcal{A})}\frac{\hat{\Delta}_{t}^{2}(\pi)}{g_{t}(\pi)}, (4)

where Δ^t\hat{\Delta}_{t} is the empirical gap and gtg_{t} is the so-called information gain (defined later). Kirschner & Krause (2018), Kirschner et al. (2020) and Kirschner et al. (2021) apply the IDS principle to the linear bandit setting, The first two works propose both randomized and deterministic versions of IDS for linear bandit. They showed a near-optimal minimax regret bound of the order of O~(dT)\widetilde{O}(d\sqrt{T}). Kirschner et al. (2021) designed an asymptotically optimal linear bandit algorithm retaining its near-optimal minimax regret properties. Comparing these algorithms with our LinIMED algorithms, we observe that the first term of the index of non-greedy actions in our algorithms is Δ^t,a2/(βt1(γ)xt,aVt112)\hat{\Delta}^{2}_{t,a}/(\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert^{2}_{V^{-1}_{t-1}}), which is similar to the choice of information ratio in IDS with the estimated gap Δt(a):=Δ^t,a\Delta_{t}(a):=\hat{\Delta}_{t,a} as defined in Algorithm 1 and the information ratio gt(a):=βt1(γ)xt,aVt112g_{t}(a):=\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert^{2}_{V^{-1}_{t-1}}. As mentioned in Kirschner & Krause (2018), when the noise is 1-subgaussian and xt,aVt1121\lVert x_{t,a}\rVert^{2}_{V^{-1}_{t-1}}\ll 1, the information gain in deterministic IDS algorithm is approximately xt,aVt112\lVert x_{t,a}\rVert^{2}_{V^{-1}_{t-1}}, which is similar to our choice βt1(γ)xt,aVt112\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert^{2}_{V^{-1}_{t-1}}. However, our LinIMED algorithms are different from the deterministic IDS algorithm in Kirschner & Krause (2018) since the estimated gap defined in our algorithm Δ^t,a\hat{\Delta}_{t,a} is different from that in deterministic IDS. Furthermore, as discussed in Kirschner et al. (2020), when the noise is 1-subgaussian and xt,aVt1121\lVert x_{t,a}\rVert^{2}_{V^{-1}_{t-1}}\ll 1, the action chosen by UCB minimizes the deterministic information ratio. However, this is not the case for our algorithm since we have the second term log(βt1(γ)xt,aVt112)-\log(\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert^{2}_{V^{-1}_{t-1}}) in LinIMED-1 which balances information and optimism. Compared to IDS in Kirschner et al. (2021), their algorithm is a randomized version of the deterministic IDS algorithm, which is more computationally expensive than our algorithm since our LinIMED algorithms are fully deterministic (the support of the allocation in Kirschner et al. (2021) is two). IDS defines a more complicated version of information gain to achieve asymptotically optimality. Finally, to the best of our knowledge, all these IDS algorithms are designed for linear bandits under the setting that the arm set is fixed and finite, while in our setting we assume the arm set is finite and can change over time. We discuss comparisons to other related work in Appendix B.

4 Theorem Statements

Theorem 1.

Under Assumption 1, the assumption that θ,xt,a0\langle\theta^{*},x_{t,a}\rangle\geq 0 for all t1t\geq 1 and a𝒜ta\in\mathcal{A}_{t}, and the assumption that λS1\sqrt{\lambda}S\geq 1, the regret of the LinIMED-1 algorithm is upper bounded as follows:

RTO(dTlog32(T)).\displaystyle R_{T}\leq O\big{(}d\sqrt{T}\log^{\frac{3}{2}}(T)\big{)}.

A proof sketch of Theorem 1 is provided in Section 5.

Theorem 2.

Under Assumption 1, and the assumption that λS1\sqrt{\lambda}S\geq 1, the regret of the LinIMED-2 algorithm is upper bounded as follows:

RTO(dTlog32(T)).\displaystyle R_{T}\leq O\big{(}d\sqrt{T}\log^{\frac{3}{2}}(T)\big{)}.
Theorem 3.

Under Assumption 1, the assumption that λS1\sqrt{\lambda}S\geq 1, and that CC in Line 11 is a constant, the regret of the LinIMED-3 algorithm is upper bounded as follows:

RTO(dTlog(T)).\displaystyle R_{T}\leq O\big{(}d\sqrt{T}\log(T)\big{)}.
Theorem 4.

Under Assumption 1, the assumption that L=S=1L=S=1, the regret of the SupLinIMED algorithm (which is applicable to linear bandit problems with K<K<\infty arms) is upper bounded as follows:

RTO(dTlog3(KT)).\displaystyle R_{T}\leq O\bigg{(}\sqrt{dT\log^{3}(KT)}\bigg{)}.

The upper bounds on the regret of LinIMED and its variants are all of the form O~(dT)\widetilde{O}(d\sqrt{T}), which, ignoring the logarithmic term, is the same as OFUL algorithm (Abbasi-Yadkori et al. (2011)). Compared to LinTS, it has an advantage of O(dlogK)O(\sqrt{d}\wedge\sqrt{\log K}). Also, these upper bounds do not depend on the number of arms KK, which means it can be applied to linear bandit problems with a large arm set (including infinite arm sets). One observes that LinIMED-2 and LinIMED-3 do not require the additional assumption that θ,xt,a0\langle\theta^{*},x_{t,a}\rangle\geq 0 for all t1t\geq 1 and a𝒜ta\in\mathcal{A}_{t} to achieve the O~(dT)\widetilde{O}(d\sqrt{T}) upper regret bound. It is difficult to prove the regret bound for the LinIMED-1 algorithm without this assumption since in our proof we need to use that θ,Xt0\langle\theta^{*},X_{t}\rangle\geq 0 for any time tt to bound the F1F_{1} term. On the other hand, LinIMED-2 and LinIMED-3 encourage more exploitations in terms of the index of the empirically best arm at each time without adversely influencing the regret bound; this will accelerate the learning with well-preprocessed datasets. The regret bound of LinIMED-3, in fact, matches that of LinUCB with OFUL’s confidence bound. In the proof, we will extensively use a technique known as the “peeling device” (Lattimore & Szepesvári, 2020, Chapter 9). This analytical technique, commonly used in the theory of bandit algorithms, involves the partitioning of the range of some random variable into several pieces, then using the basic fact that (A(i=1Bi))i=1(ABi)\mathbb{P}(A\cap(\cup_{i=1}^{\infty}B_{i}))\leq\sum_{i=1}^{\infty}\mathbb{P}(A\cap B_{i}), we can utilize the more refined range of the random variable to derive desired bounds.

Finally, Theorem 4 says that when the arm set is finite, we can use the framework of SupLinUCB (Chu et al., 2011) with our LinIMED index It,aI_{t,a} to achieve a regret bound of the order of O~(dT)\widetilde{O}(\sqrt{dT}), which is d\sqrt{d} better than the regret bounds yielded by the family of LinIMED algorithms (ignoring the logarithmic terms). The proof is provided in Appendix F.

5 Proof Sketch of Theorem 1

We choose to present the proof sketch of Theorem 1 since it contains the main ingredients for all the theorems in the preceding section. Before presenting the proof, we introduce the following lemma and corollary.

Lemma 1.

(Abbasi-Yadkori et al. (2011, Theorem 2)) Under Assumption 1, for any time step t1t\geq 1 and any γ>0\gamma>0, we have

(θ^t1θVt1βt1(γ))1γ.\displaystyle\mathbb{P}\big{(}\lVert\hat{\theta}_{t-1}-\theta^{*}\rVert_{V_{t-1}}\leq\sqrt{\beta_{t-1}(\gamma)}\big{)}\geq 1-\gamma. (5)

This lemma illustrates that the true parameter θ\theta^{*} lies in the ellipsoid centered at θ^t1\hat{\theta}_{t-1} with high probability, which also states the width of the confidence bound.

The second is a corollary of the elliptical potential count lemma in Abbasi-Yadkori et al. (2011):

Corollary 1.

(Corollary of Lattimore & Szepesvári (2020, Exercise 19.3)) Assume that V0=λIV_{0}=\lambda I and XtL\lVert X_{t}\rVert\leq L for t[T]t\in[T], for any constant 0<m20<m\leq 2, the following holds:

t=1T𝟙{XtVt112m}6dmlog(1+2L2λm).\displaystyle\sum_{t=1}^{T}\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq m\right\}\leq\frac{6d}{m}\log\Big{(}1+\frac{2L^{2}}{\lambda m}\Big{)}. (6)

We remark that this lemma is slightly stronger than the classical elliptical potential lemma since it reveals information about the upper bound of frequency that XtVt112\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}} exceeds some value mm. Equipped with this lemma, we can perform the peeling device on XtVt112\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}} in our proof of the regret bound, which is a novel technique to the best of our knowledge.

Proof.

First we define ata_{t}^{*} as the best arm in time step tt such that at=argmaxa𝒜tθ,xt,aa_{t}^{*}=\operatorname*{arg\,max}_{a\in\mathcal{A}_{t}}\langle\theta^{*},x_{t,a}\rangle, and use xt:=xt,atx_{t}^{*}:=x_{t,a_{t}^{*}} denote its corresponding context. Let Δt:=θ,xtθ,Xt\Delta_{t}:=\langle\theta^{*},x_{t}^{*}\rangle-\langle\theta^{*},X_{t}\rangle denote the regret in time tt. Define the following events:

Bt\displaystyle B_{t} :={θ^t1θVt1βt1(γ)},Ct:={maxb𝒜tθ^t1,xt,b>θ,xtδ},Dt:={Δ^t,Atε}.\displaystyle:=\big{\{}\lVert\hat{\theta}_{t-1}-\theta^{*}\rVert_{V_{t-1}}\leq\sqrt{\beta_{t-1}(\gamma)}\big{\}},\quad C_{t}:=\big{\{}\max_{b\in\mathcal{A}_{t}}\langle\hat{\theta}_{t-1},x_{t,b}\rangle>\langle\theta^{*},x^{*}_{t}\rangle-\delta\big{\}},\quad D_{t}:=\big{\{}\hat{\Delta}_{t,A_{t}}\geq\varepsilon\big{\}}.

where δ\delta and ε\varepsilon are free parameters set to be δ=ΔtlogT\delta=\frac{\Delta_{t}}{\sqrt{\log T}} and ε=(12logT)Δt\varepsilon=(1-\frac{2}{\sqrt{\log T}})\Delta_{t} in this proof sketch.

Then the expected regret RT=𝔼t=1TΔtR_{T}=\mathbb{E}\sum_{t=1}^{T}\Delta_{t} can be partitioned by events Bt,Ct,DtB_{t},C_{t},D_{t} such that:

RT=𝔼t=1TΔt𝟙{Bt,Ct,Dt}=:F1+𝔼t=1TΔt𝟙{Bt,Ct,D¯t}=:F2+𝔼t=1TΔt𝟙{Bt,C¯t}=:F3+𝔼t=1TΔt𝟙{B¯t}=:F4.\displaystyle R_{T}=\underbrace{\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}}_{=:F_{1}}+\underbrace{\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},\overline{D}_{t}\right\}}_{=:F_{2}}+\underbrace{\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}}_{=:F_{3}}+\underbrace{\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B}_{t}\right\}}_{=:F_{4}}. (7)

For F1F_{1}, from the event CtC_{t} and the fact that θ,xt=Δt+θ,XtΔt\langle\theta^{*},x^{*}_{t}\rangle=\Delta_{t}+\langle\theta^{*},X_{t}\rangle\geq\Delta_{t} (here is where we use that θ,xt,a0\langle\theta^{*},x_{t,a}\rangle\geq 0 for all tt and aa), we obtain maxb𝒜tθ^t1,xt,b>(11logT)Δt\max_{b\in\mathcal{A}_{t}}\langle\hat{\theta}_{t-1},x_{t,b}\rangle>(1-\frac{1}{\sqrt{\log T}})\Delta_{t}. For convenience, define A^t:=argmaxb𝒜tθ^t1,xt,b\hat{A}_{t}:=\operatorname*{arg\,max}_{b\in\mathcal{A}_{t}}\langle\hat{\theta}_{t-1},x_{t,b}\rangle as the empirically best arm at time step tt, where ties are broken arbitrarily, then use X^t\hat{X}_{t} to denote the corresponding context of the arm A^t\hat{A}_{t}. Therefore from the Cauchy–Schwarz inequality, we have θ^t1Vt1X^tVt11θ^t1,X^t>(11logT)Δt\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}}\lVert\hat{X}_{t}\rVert_{V_{t-1}^{-1}}\geq\langle\hat{\theta}_{t-1},\hat{X}_{t}\rangle>(1-\frac{1}{\sqrt{\log T}})\Delta_{t}. This implies that

X^tVt11(11logT)Δtθ^t1Vt1.\displaystyle\lVert\hat{X}_{t}\rVert_{V_{t-1}^{-1}}\geq\frac{(1-\frac{1}{\sqrt{\log T}})\Delta_{t}}{\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}}}~{}. (8)

On the other hand, we claim that θ^t1Vt1\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}} can be upper bounded as O(T)O(\sqrt{T}). This can be seen from the fact that θ^t1Vt1=θ^t1θ+θVt1θ^t1θVt1+θVt1\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}}=\lVert\hat{\theta}_{t-1}-\theta^{*}+\theta^{*}\rVert_{V_{t-1}}\leq\lVert\hat{\theta}_{t-1}-\theta^{*}\rVert_{V_{t-1}}+\lVert\theta^{*}\rVert_{V_{t-1}}. Since the event BtB_{t} holds, we know the first term is upper bounded by βt1(γ)\sqrt{\beta_{t-1}(\gamma)}, and since the largest eigenvalue of the matrix Vt1V_{t-1} is upper bounded by λ+TL2\lambda+TL^{2} and θS\lVert\theta^{*}\rVert\leq S, the second term is upper bounded by Sλ+TL2S\sqrt{\lambda+TL^{2}}. Hence, θ^t1Vt1\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}} is upper bounded by O(T)O(\sqrt{T}). Then one can substitute this bound back into Eqn. equation 8, and this yields

X^tVt11Ω(1T(11logT)Δt).\displaystyle\lVert\hat{X}_{t}\rVert_{V_{t-1}^{-1}}\geq\Omega\Big{(}\frac{1}{\sqrt{T}}\Big{(}1-\frac{1}{\sqrt{\log T}}\Big{)}\Delta_{t}\Big{)}~{}. (9)

Furthermore, by our design of the algorithm, the index of AtA_{t} is not larger than the index of the arm with the largest empirical reward at time tt. Hence,

It,At=Δ^t,At2βt1(γ)XtVt112+log1βt1(γ)XtVt112log1βt1(γ)X^tVt112.\displaystyle I_{t,A_{t}}=\frac{\hat{\Delta}_{t,A_{t}}^{2}}{\beta_{t-1}(\gamma)\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}(\gamma)\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}\leq\log\frac{1}{\beta_{t-1}(\gamma)\lVert\hat{X}_{t}\rVert_{V_{t-1}^{-1}}^{2}}~{}. (10)

In the following, we set γ\gamma as well as another free parameter Γ\Gamma as follows:

Γ=dlog32TTandγ=1t2,.\Gamma=\frac{d\log^{\frac{3}{2}}T}{\sqrt{T}}\quad\mbox{and}\quad\gamma=\frac{1}{t^{2}},. (11)

If XtVt112Δt2βt1(γ)\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}(\gamma)}, by using Corollary 1 with the choice in Eqn. equation 11, the upper bound of F1F_{1} in this case is O(dTlogT)O\big{(}d\sqrt{T\log T}\big{)}. Otherwise, using the event DtD_{t} and the bound in equation 9, we deduce that for all TT sufficiently large, we have XtVt112Ω(Δt2βt1(γ)log(T/Δt2))\lVert{X}_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\Omega\big{(}\frac{\Delta_{t}^{2}}{\beta_{t-1}(\gamma)\log({T}/{\Delta_{t}^{2}})}\big{)}. Therefore by using Corollary 1 and the “peeling device” (Lattimore & Szepesvári, 2020, Chapter 9) on Δt\Delta_{t} such that 2l<Δt2l+12^{-l}<\Delta_{t}\leq 2^{-l+1} for l=1,2,,Ql=1,2,\ldots,\lceil Q\rceil where Q=log2ΓQ=-\log_{2}\Gamma and Γ\Gamma is chosen as in Eqn. equation 11. Now consider,

F1\displaystyle\!\!\!F_{1} O(1)+𝔼t=1TΔt𝟙{XtVt112Ω(Δt2βt1(γ)log(T/Δt2))}\displaystyle\leq O(1)+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert{X}_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\Omega\Big{(}\frac{\Delta_{t}^{2}}{\beta_{t-1}(\gamma)\log({T}/{\Delta_{t}^{2}})}\Big{)}\right\} (12)
O(1)+TΓ+𝔼t=1Tl=1QΔt𝟙{XtVt112Ω(Δt2βt1(γ)log(T/Δt2))}𝟙{2l<Δt2l+1}\displaystyle\leq O(1)\!+\!T\Gamma\!+\mathbb{E}\!\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{\!\|{X}_{t}\|^{2}_{V_{t-1}^{-1}}\!\geq\!\Omega\Big{(}\frac{\Delta_{t}^{2}}{\beta_{t-1}(\gamma)\log(T/\Delta_{t}^{2})}\Big{)}\!\right\}\mathbbm{1}\big{\{}2^{-l}\!<\!\Delta_{t}\!\leq\!2^{-l+1}\big{\}}\!\!\! (13)
O(1)+TΓ+𝔼t=1Tl=1Q2l+1𝟙{XtVt112Ω(22lβt1(γ)log(T22l))}\displaystyle\leq O(1)+T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot\mathbbm{1}\bigg{\{}\lVert{X}_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\Omega\Big{(}\frac{2^{-2l}}{\beta_{t-1}(\gamma)\log({T\cdot 2^{2l}})}\Big{)}\bigg{\}} (14)
O(1)+TΓ+𝔼l=1Q2l+1O(22ldβT(γ)log(22lT)log(1+2L222lβT(γ)log(T22l)λ))\displaystyle\leq O(1)\!+\!T\Gamma\!+\!\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}\!2^{-l+1}O\bigg{(}\!2^{2l}d\beta_{T}(\gamma)\log(2^{2l}T)\log\Big{(}1\!+\!\frac{2L^{2}\cdot 2^{2l}\beta_{T}(\gamma)\log(T\cdot 2^{2l})}{\lambda}\Big{)}\!\bigg{)} (15)
O(1)+TΓ+l=1Q2l+1O(dβT(γ)log(TΓ2)log(1+L2βT(γ)log(TΓ2)λΓ2))\displaystyle\leq O(1)+T\Gamma+\sum_{l=1}^{\lceil Q\rceil}2^{l+1}\cdot O\bigg{(}{d\beta_{T}(\gamma)\log(\frac{T}{\Gamma^{2}})}\log\Big{(}1+\frac{L^{2}\beta_{T}(\gamma)\log(\frac{T}{\Gamma^{2}})}{\lambda\Gamma^{2}}\Big{)}\bigg{)} (16)
O(1)+TΓ+O(dβT(γ)log(TΓ2)Γlog(1+L2βT(γ)log(TΓ2)λΓ2)),\displaystyle\leq O(1)+T\Gamma+O\bigg{(}\frac{d\beta_{T}(\gamma)\log(\frac{T}{\Gamma^{2}})}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}(\gamma)\log(\frac{T}{\Gamma^{2}})}{\lambda\Gamma^{2}}\Big{)}\bigg{)}~{}, (17)

where in Inequality equation 15 we used Corollary 1. Substituting the choices of Γ\Gamma and γ\gamma in equation 11 into equation 17 yields the upper bound on 𝔼t=1TΔt𝟙{Bt,Ct,Dt}𝟙{XtVt112<Δt2βt1(γ)}\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\big{\{}\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}(\gamma)}\big{\}} of the order O(dTlog32T)O(d\sqrt{T}\log^{\frac{3}{2}}T). Hence F1O(dTlog32T)F_{1}\leq O(d\sqrt{T}\log^{\frac{3}{2}}T). Other details are fleshed out in Appendix C.2.

For F2F_{2}, since CtC_{t} and D¯t\overline{D}_{t} together imply that θ,xtδ<ε+θ^t1,Xt\langle\theta^{*},x_{t}^{*}\rangle-\delta<\varepsilon+\langle\hat{\theta}_{t-1},X_{t}\rangle, then using the choices of δ\delta and ε\varepsilon, we have θ^t1θ,Xt>ΔtlogT\langle\hat{\theta}_{t-1}-\theta^{*},X_{t}\rangle>\frac{\Delta_{t}}{\sqrt{\log T}}. Substituting this into the event BtB_{t} and using the Cauchy–Schwarz inequality, we have

XtVt112Δt2βt1(γ)logT.\displaystyle\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}(\gamma)\log T}. (18)

Again applying the “peeling device” on Δt\Delta_{t} and Corollary 1, we can upper bound F2F_{2} as follows:

F2TΓ+O(dβT(γ)logTΓ)log(1+L2βT(γ)logTλΓ2).\displaystyle F_{2}\leq T\Gamma+O\bigg{(}\frac{d\beta_{T}(\gamma)\log T}{\Gamma}\bigg{)}\log\Big{(}1+\frac{L^{2}\beta_{T}(\gamma)\log T}{\lambda\Gamma^{2}}\Big{)}~{}. (19)

Then with the choice of Γ\Gamma and γ\gamma as stated in equation 11, the upper bound of the F2F_{2} is also of order O(dTlog32T)O(d\sqrt{T}\log^{\frac{3}{2}}T). More details of the calculation leading to Eqn. equation 19 are in Appendix C.3.

For F3F_{3}, this is the case when the best arm at time tt does not perform sufficiently well so that the empirically largest reward at time tt is far from the highest expected reward. One observes that minimizing F3F_{3} results in a tradeoff with respect to F1F_{1}. On the event C¯t\overline{C}_{t}, we can again apply the “peeling device” on θ,xtθ^t1,xt\langle\theta^{*},x_{t}^{*}\rangle-\langle\hat{\theta}_{t-1},x_{t}^{*}\rangle such that q+12δθ,xtθ^t1,xt<q+22δ\frac{q+1}{2}\delta\leq\langle\theta^{*},x_{t}^{*}\rangle-\langle\hat{\theta}_{t-1},x_{t}^{*}\rangle<\frac{q+2}{2}\delta where qq\in\mathbb{N}. Then using the fact that It,AtIt,atI_{t,A_{t}}\leq I_{t,a_{t}^{*}}, we have

log1βt1(γ)XtVt112<q2δ24βt1(γ)xtVt112+log1βt1(γ)xtVt112.\displaystyle\log\frac{1}{\beta_{t-1}(\gamma)\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}<\frac{q^{2}\delta^{2}}{4\beta_{t-1}(\gamma)\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}(\gamma)\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}~{}. (20)

On the other hand, using the event BtB_{t} and the Cauchy–Schwarz inequality, it holds that

xtVt11(q+1)δ2βt1(γ).\displaystyle\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}\geq\frac{(q+1)\delta}{2\sqrt{\beta_{t-1}(\gamma)}}~{}. (21)

If XtVt112Δt2βt1(γ)\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}(\gamma)}, the regret in this case is bounded by O(dTlogT)O(d\sqrt{T\log T}). Otherwise, combining Eqn. equation 20 and Eqn. equation 21 implies that

XtVt112(q+1)2δ24βt1(γ)exp(q2(q+1)2).\displaystyle\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{(q+1)^{2}\delta^{2}}{4\beta_{t-1}(\gamma)}\exp\bigg{(}-\frac{q^{2}}{(q+1)^{2}}\bigg{)}~{}. (22)

Using Corollary 1, we can now conclude that F3F_{3} is upper bounded as

F3TΓ+O(dβT(γ)logTΓ)log(1+L2βT(γ)logTλΓ2).\displaystyle F_{3}\leq T\Gamma+O\bigg{(}\frac{d\beta_{T}(\gamma)\log T}{\Gamma}\bigg{)}\log\Big{(}1+\frac{L^{2}\beta_{T}(\gamma)\log T}{\lambda\Gamma^{2}}\Big{)}~{}. (23)

Substituting Γ\Gamma and γ\gamma in Eqn. equation 11 into Eqn. equation 23, we can upper bound F3F_{3} by O(dTlog32T)O(d\sqrt{T}\log^{\frac{3}{2}}T). Complete details are provided in Appendix C.4.

For F4F_{4}, using Lemma 1 with the choice of γ=1/t2\gamma=1/t^{2} and Q=logΓQ=-\log\Gamma, we have

F4\displaystyle F_{4} =𝔼t=1TΔt𝟙{B¯t}TΓ+𝔼t=1Tl=1QΔt𝟙{2l<Δt2l+1}𝟙{B¯t}\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B}_{t}\right\}\leq T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\}\mathbbm{1}\left\{\overline{B}_{t}\right\} (24)
TΓ+t=1Tl=1Q2l+1(B¯t)TΓ+t=1Tl=1Q2l+1γ<TΓ+π23.\displaystyle\leq T\Gamma+\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\mathbb{P}\big{(}\overline{B}_{t}\big{)}\leq T\Gamma+\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\gamma<T\Gamma+\frac{\pi^{2}}{3}~{}. (25)

Thus, F4O(dTlog32T)F_{4}\leq O(d\sqrt{T}\log^{\frac{3}{2}}T). In conclusion, with the choice of Γ\Gamma and γ\gamma in Eqn. equation 11, we have shown that the expected regret of LinIMED-1 RT=i=14FiR_{T}=\sum_{i=1}^{4}F_{i} is upper bounded by O(dTlog32T)O(d\sqrt{T}\log^{\frac{3}{2}}T).∎

For LinIMED-2, the proof is similar but the assumption that θ,xt,a0\langle\theta^{*},x_{t,a}\rangle\geq 0 is not required. For LinIMED-3, by directly using the UCB in Line 6 of Algorithm 1, we improve the regret bound to match the state-of-the-art O(dTlogT)O(d\sqrt{T}\log T), which matches that of LinUCB with OFUL’s confidence bound.

6 Empirical Studies

This section aims to justify the utility of the family of LinIMED algorithms we developed and to demonstrate their effectiveness through quantitative evaluations in simulated environments and real-world datasets such as the MovieLens dataset. We compare our LinIMED algorithms with LinTS and LinUCB with the choice λ=L2\lambda=L^{2}. We set βt(γ)=(R3dlog(1+t)+2)2\beta_{t}(\gamma)=(R\sqrt{3d\log(1+t)}+\sqrt{2})^{2} (here γ=1(1+t)2\gamma=\frac{1}{(1+t)^{2}} and L=2L=\sqrt{2}) for the synthetic dataset with varying and finite arm set and βt(γ)=(Rdlog((1+t)t2)+20)2\beta_{t}(\gamma)=(R\sqrt{d\log((1+t)t^{2})}+\sqrt{20})^{2} (here γ=1t2\gamma=\frac{1}{t^{2}} and L=20L=\sqrt{20}) for the MovieLens dataset respectively. The confidence widths βt(γ)\sqrt{\beta_{t}(\gamma)} for each algorithm are multiplied by a factor α\alpha and we tune α\alpha by searching over the grid {0.05,0.1,0.15,0.2,,0.95,1.0}\left\{0.05,0.1,0.15,0.2,\ldots,0.95,1.0\right\} and report the best performance for each algorithm; see Appendix G. Both γ\gamma’s are of order O(1t2)O(\frac{1}{t^{2}}) as suggested by our proof sketch in Eqn. equation 11. We set C=30C=30 in LinIMED-3 throughout. The sub-Gaussian noise level is R=0.1R=0.1. We choose LinUCB and LinTS as competing algorithms since they are paradigmatic examples of deterministic and randomized contextual linear bandit algorithms respectively. We also included IDS in our comparisons for the fixed and finite arm set settings. Finally, we only show the performances of SupLinUCB and SupLinIMED algorithms but only in Figs. 1 and 2 since it is well known that there is a substantial performance degradation compared to established methodologies like LinUCB or LinTS (as mentioned in Lattimore & Szepesvári (2020, Chapter 22) and also seen in Figs. 1 and 2.

6.1 Experiments on a Synthetic Dataset in the Varying Arm Set Setting

We perform an empirical study on a varying arm setting. We evaluate the performance with different dimensions dd and different number of arms KK. We set the unknown parameter vector and the best context vector as θ=xt=[1d1,,1d1,0]d\theta^{*}=x_{t}^{*}=[\frac{1}{\sqrt{d-1}},\ldots,\frac{1}{\sqrt{d-1}},0]^{\top}\in\mathbb{R}^{d}. There are K2K-2 suboptimal arms vectors, which are all the same (i.e., repeated) and share the context [(117+zt,i)1d1,,(117+zt,i)1d1,(117+zt,i)]d[(1-\frac{1}{7+z_{t,i}})\frac{1}{\sqrt{d-1}},\ldots,(1-\frac{1}{7+z_{t,i}})\frac{1}{\sqrt{d-1}},(1-\frac{1}{7+z_{t,i}})]^{\top}\in\mathbb{R}^{d} where zt,iUniform[0,0.1]z_{t,i}\sim\textrm{Uniform}[0,0.1] is iid noise for each t[T]t\in[T] and i[K2]i\in[K-2]. Finally, there is also one “worst” arm vector with context [0,0,,0,1][0,0,\dots,0,1]^{\top}. First we fix d=2d=2. The results for different numbers of arms such as K=10,100,500K=10,100,500 are shown in Fig. 1. Note that each plot is repeated 5050 times to obtain the mean and standard deviation of the regret. From Fig. 1, we observe that LinIMED-1 and LinIMED-2 are comparable to LinUCB and LinTS, while LinIMED-3 outperforms LinTS and LinUCB regardless of the number of the arms KK. Second, we set K=10K=10 with the dimensions d=2,20,50d=2,20,50. Each trial is again repeated 5050 times and the regret over time is shown in Fig. 2. Again, we see that LinIMED-1 and LinIMED-2 are comparable to LinUCB and LinTS, LinIMED-3 clearly perform better than LinUCB and LinTS.

Refer to caption
(a) K=10K=10
Refer to caption
(b) K=100K=100
Refer to caption
(c) K=500K=500
Figure 1: Simulation results (expected regrets) on the synthetic dataset with different KK’s
Refer to caption
(a) d=2d=2
Refer to caption
(b) d=20d=20
Refer to caption
(c) d=50d=50
Figure 2: Simulation results (expected regrets) on the synthetic dataset with different dd’s

The experimental results on synthetic data demonstrate that the performances of LinIMED-1 and LinIMED-2 are largely similar but LinIMED-3 is slightly superior (corroborating our theoretical findings). More importantly, LinIMED-3 outperforms both the LinTS and LinUCB algorithms in a statistically significant manner, regardless of the number of arms KK or the dimension dd of the data.

6.2 Experiments on the “End of Optimism” instance

Algorithms based on the optimism principle such as LinUCB and LinTS have been shown to be not asymptotically optimal. A paradigmatic example is known as the “End of Optimism” (Lattimore & Szepesvari, 2017; Kirschner et al., 2021)). In this two-dimensional case in which the true parameter vector θ=[1;0]\theta^{*}=[1;0], there are three arms represented by the arm vectors: [1;0],[0;1][1;0],[0;1] and [1ε;2ε][1-\varepsilon;2\varepsilon], where ε>0\varepsilon>0 is small. In this example, it is observed that even pulling a highly suboptimal arm (the second one) provides a lot of information about the best arm (the first one). We perform experiments with the same confidence parameter βt(γ)=(R3dlog(1+t)+2)2\beta_{t}(\gamma)=(R\sqrt{3d\log(1+t)}+\sqrt{2})^{2} as in Section 6.1 (where the noise level R=0.1R=0.1, dimension d=2d=2). We also include the asymptotically optimal IDS algorithm (Kirschner et al. (2021) with the choice of ηs=βs(δ)1\eta_{s}=\beta_{s}(\delta)^{-1}; this is suggested in Kirschner et al. (2021). Each algorithm is run over 1010 independent trials. The regrets of all competing algorithms are shown in Fig. 3 with ε=0.05,0.01,0.02\varepsilon=0.05,0.01,0.02 and for a fixed horizon T=106T=10^{6}.

Refer to caption
(a) ε=0.005\varepsilon=0.005
Refer to caption
(b) ε=0.01\varepsilon=0.01
Refer to caption
(c) ε=0.02\varepsilon=0.02
Figure 3: Simulation results (expected regrets) on the “End of Optimism” instance with different ε\varepsilon’s

From Fig. 3 we observe that the LinIMED algorithms perform much better than LinUCB and LinTS and LinIMED-3 is comparable to IDS in this “End of Optimism” instance. In particular, LinIMED-3 performs significantly better than LinUCB and LinTS even when ε\varepsilon is of a moderate value such as ε=0.02\varepsilon=0.02. We surmise that the reason behind the superior performance of our LinIMED algorithms on the "End of Optimism" instance is that the first term of our LinIMED index is Δ^t,a2/(βt1(γ)xt,aVt112){\hat{\Delta}_{t,a}^{2}}/{(\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert_{V_{t-1}^{-1}}^{2})}, which can be viewed as an approximate and simpler version of the information ratio that movtivates the design the IDS) algorithm.

Refer to caption
(a) K=20K=20
Refer to caption
(b) K=50K=50
Refer to caption
(c) K=100K=100
Figure 4: Simulation results (CTRs) of the MovieLens dataset with different KK’s

6.3 Experiments on the MovieLens Dataset

The MovieLens dataset (Cantador et al. (2011)) is a widely-used benchmark dataset for research in recommendation systems. We specifically choose to use the MovieLens 10M dataset, which contains 10 million ratings (from 0 to 5) and 100,000 tag applications applied to 10,000 movies by 72,000 users. To preprocess the dataset, we choose the best K{20,50,100}K\in\{20,50,100\} movies for consideration. At each time tt, one random user visits the website and is recommended one of the best KK movies. We assume that the user will click on the recommended movie if and only if the user’s rating of this movie is at least 33. We implement the three versions of LinIMED, LinUCB, LinTS and IDS on this dataset. Each trial is repeated over 100100 runs and the averages and standard deviations of the click-through rates (CTRs) as functions of time are reported in Fig. 4. One observes that LinIMED variants significantly outperform LinUCB and LinTS for all K{20,50,100}K\in\{20,50,100\} when time horizon TT is sufficiently large. LinIMED-1 and LinIMED-2 perform significantly better than IDS when k=20,50k=20,50, LinIMED-3 perform significantly better than IDS when k=50,100k=50,100. Furthermore, by virtue of the fact that IDS is randomized, the variance of IDS is higher than that of LinIMED.

7 Future Work

In the future, a fruitful direction of research is to further modify the LinIMED algorithm to make it also asymptotically optimal; we believe that in this case, the analysis would be more challenging, but the theoretical and empirical performances might be superior to our three LinIMED algorithms. In addition, one can generalize the family of IMED-style algorithms to generalized linear bandits or neural contextual bandits.

Acknowledgements

This work is supported by funding from a Ministry of Education Academic Research Fund (AcRF) Tier 2 grant under grant number A-8000423-00-00 and AcRF Tier 1 grants under grant numbers A-8000189-01-00 and A-8000980-00-00. This research is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-PhD-2023-08-044T-J), and is part of the programme DesCartes which is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme.

References

  • Abbasi-Yadkori et al. (2011) Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24, 2011.
  • Agrawal & Goyal (2012) Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pp.  39.1–39.26. JMLR Workshop and Conference Proceedings, 2012.
  • Agrawal & Goyal (2013) Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pp. 127–135. PMLR, 2013.
  • Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256, 2002.
  • Baudry et al. (2023) Dorian Baudry, Kazuya Suzuki, and Junya Honda. A general recipe for the analysis of randomized multi-armed bandit algorithms. arXiv preprint arXiv:2303.06058, 2023.
  • Bian & Jun (2021) Jie Bian and Kwang-Sung Jun. Maillard sampling: Boltzmann exploration done optimally. arXiv preprint arXiv:2111.03290, 2021.
  • Cantador et al. (2011) Iván Cantador, Peter Brusilovsky, and Tsvi Kuflik. Second workshop on information heterogeneity and fusion in recommender systems (HetRec2011). In Proceedings of the Fifth ACM Conference on Recommender Systems, pp.  387–388, 2011.
  • Chu et al. (2011) Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp.  208–214. JMLR Workshop and Conference Proceedings, 2011.
  • Garivier & Cappé (2011) Aurélien Garivier and Olivier Cappé. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th Annual Conference on Learning Theory, pp.  359–376. JMLR Workshop and Conference Proceedings, 2011.
  • Honda & Takemura (2015) Junya Honda and Akimichi Takemura. Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. J. Mach. Learn. Res., 16:3721–3756, 2015.
  • Kirschner & Krause (2018) Johannes Kirschner and Andreas Krause. Information directed sampling and bandits with heteroscedastic noise. In Conference On Learning Theory, pp.  358–384. PMLR, 2018.
  • Kirschner et al. (2020) Johannes Kirschner, Tor Lattimore, and Andreas Krause. Information directed sampling for linear partial monitoring. In Conference on Learning Theory, pp.  2328–2369. PMLR, 2020.
  • Kirschner et al. (2021) Johannes Kirschner, Tor Lattimore, Claire Vernade, and Csaba Szepesvári. Asymptotically optimal information-directed sampling. In Conference on Learning Theory, pp.  2777–2821. PMLR, 2021.
  • Lattimore & Szepesvari (2017) Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pp.  728–737. PMLR, 2017.
  • Lattimore & Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.
  • Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pp.  661–670, 2010.
  • Liu et al. (2024) Haolin Liu, Chen-Yu Wei, and Julian Zimmert. Bypassing the simulator: Near-optimal adversarial linear contextual bandits. Advances in Neural Information Processing Systems, 36, 2024.
  • Russo & Van Roy (2014) Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. Advances in Neural Information Processing Systems, 27, 2014.
  • Saber et al. (2021) Hassan Saber, Pierre Ménard, and Odalric-Ambrym Maillard. Indexed minimum empirical divergence for unimodal bandits. Advances in Neural Information Processing Systems, 34:7346–7356, 2021.
  • Thompson (1933) William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933. ISSN 00063444. URL http://www.jstor.org/stable/2332286.
  • Valko et al. (2014) Michal Valko, Rémi Munos, Branislav Kveton, and Tomáš Kocák. Spectral bandits for smooth graph functions. In International Conference on Machine Learning, pp.  46–54. PMLR, 2014.

Supplementary Materials for the TMLR submission
“Linear Indexed Minimum Empirical Divergence Algorithms”

Appendix A BaseLinUCB Algorithm

Here, we present the BaseLinUCB algorithm used as a subroutine in SubLinIMED (Algorithm 2).

1:  Input: γ=12t2\gamma=\frac{1}{2t^{2}}, α=12ln2TKγ\alpha=\sqrt{\frac{1}{2}\ln\frac{2TK}{\gamma}}, Ψt{1,2,,t1}\Psi_{t}\subseteq\left\{1,2,...,t-1\right\}
2:  Vt=Id+τΨtxτ,AτTxτ,AτV_{t}=I_{d}+\sum_{\tau\in\Psi_{t}}x^{T}_{\tau,A_{\tau}}x_{\tau,A_{\tau}}
3:  bt=τΨtYτ,Aτxτ,Aτb_{t}=\sum_{\tau\in\Psi_{t}}Y_{\tau,A_{\tau}}x_{\tau,A_{\tau}}
4:  θ^t=Vt1bt\hat{\theta}_{t}=V^{-1}_{t}b_{t}
5:  Observe KK arm features xt,1,xt,2,,xt,Kdx_{t,1},x_{t,2},\ldots,x_{t,K}\in\mathbb{R}^{d}
6:  for a[K]a\in[K] do
7:     wt,a=αxt,aTVt1xt,aw_{t,a}=\alpha\sqrt{x^{T}_{t,a}V^{-1}_{t}x_{t,a}}
8:     Y^t,a=θ^t,xt,a\hat{Y}_{t,a}=\langle\hat{\theta}_{t},x_{t,a}\rangle
9:  end for
Algorithm 3 BaseLinUCB

Appendix B Comparison to other related work

Saber et al. (2021) adopts the IMED algorithm to unimodal bandits which achieves asymptotically optimality for one-dimensional exponential family distributions. In their algorithm IMED-UB, they narrow down the search region to the neighboring regions of the empirically best arm and then implement the IMED algorithm for KK-armed bandit as in Honda & Takemura (2015). This design is inspired by the lower bound and only involves the neighboring arms of the best arm. The settings in which the algorithm in Saber et al. (2021) is applied to is different from our proposed LinIMED algorithms as we focus on linear bandits, not unimodal bandits.

Liu et al. (2024) proposes an algorithm that achieves O~(T)\widetilde{O}(\sqrt{T}) regret for adversarial linear bandits with stochastic action sets in the absence of a simulator or prior knowledge on the distribution. Although their setting is different from ours, they also use a bonus term αtΣ^t1-\alpha_{t}\hat{\Sigma}_{t}^{-1} in the lifted covariance matrix to encourage exploration. This is similar to our choice of the second term log(1/βt1(γ)xt,aVt112)\log(1/\beta_{t-1}(\gamma)\lVert x_{t,a}\rVert^{2}_{V^{-1}_{t-1}}) in LinIMED-1.

Appendix C Proof of the regret bound for LinIMED-1 (Complete proof of Theorem 1)

Here and in the following, we abbreviate βt(γ)\beta_{t}(\gamma) as βt\beta_{t}, i.e., we drop the dependence of βt\beta_{t} on γ\gamma, which is taken to be 1t2\frac{1}{t^{2}} per Eqn. equation 11.

C.1 Statement of Lemmas for LinIMED-1

We first state the following lemmas which respectively show the upper bound of F1F_{1} to F4F_{4}:

Lemma 2.

Under Assumption 1, the assumption that θ,xt,a0\langle\theta^{*},x_{t,a}\rangle\geq 0 for all t1t\geq 1 and a𝒜ta\in\mathcal{A}_{t}, and the assumption that λS1\sqrt{\lambda}S\geq 1, then for the free parameter 0<Γ<10<\Gamma<1, the term F1F_{1} for LinIMED-1 satisfies:

F1O(1)+TΓ+O(dβTlog(TΓ2)Γlog(1+L2βTlog(TΓ2)λΓ2)).\displaystyle F_{1}\leq O(1)+T\Gamma+O\bigg{(}\frac{d\beta_{T}\log(\frac{T}{\Gamma^{2}})}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log(\frac{T}{\Gamma^{2}})}{\lambda\Gamma^{2}}\Big{)}\bigg{)}~{}. (26)

With the choice of Γ\Gamma as in Eqn. equation 11,

F1O(dTlog32T).\displaystyle F_{1}\leq O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}~{}. (27)
Lemma 3.

Under Assumption 1, and the assumption that λS1\sqrt{\lambda}S\geq 1, for the free parameter 0<Γ<10<\Gamma<1, the term F2F_{2} for LinIMED-1 satisfies:

F22TΓ+O(dβTlogTΓ)log(1+L2βTlogTλΓ2).\displaystyle F_{2}\leq 2T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\bigg{)}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}~{}. (28)

With the choice of Γ\Gamma as in Eqn. equation 11,

F2O(dTlog32T).\displaystyle F_{2}\leq O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}~{}. (29)
Lemma 4.

Under Assumption 1, and the assumption that λS1\sqrt{\lambda}S\geq 1, for the free parameter 0<Γ<10<\Gamma<1, the term F3F_{3} for LinIMED-1 satisfies:

F32TΓ+O(dβTlog(T)Γlog(1+L2βTlog(T)λΓ2)).\displaystyle F_{3}\leq 2T\Gamma+O\bigg{(}\frac{d\beta_{T}\log(T)}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log(T)}{\lambda\Gamma^{2}}\Big{)}\bigg{)}~{}. (30)

With the choice of Γ\Gamma as in Eqn. equation 11,

F3O(dTlog32T).\displaystyle F_{3}\leq O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}~{}. (31)
Lemma 5.

Under Assumption 1, for the free parameter 0<Γ<10<\Gamma<1, the term F4F_{4} for LinIMED-1 satisfies:

F4TΓ+O(1).\displaystyle F_{4}\leq T\Gamma+O(1)~{}.

With the choice of Γ\Gamma as in Eqn. equation 11,

F4O(dTlog32T).\displaystyle F_{4}\leq O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}~{}. (32)

C.2 Proof of Lemma 2

Proof.

From the event CtC_{t} and the fact that θ,xt=Δt+θ,XtΔt\langle\theta^{*},x^{*}_{t}\rangle=\Delta_{t}+\langle\theta^{*},X_{t}\rangle\geq\Delta_{t} (here is where we use that θ,xt,a0\langle\theta^{*},x_{t,a}\rangle\geq 0 for all tt and aa), we obtain maxb𝒜tθ^t1,xt,b>(11logT)Δt\max_{b\in\mathcal{A}_{t}}\langle\hat{\theta}_{t-1},x_{t,b}\rangle>(1-\frac{1}{\sqrt{\log T}})\Delta_{t}. For convenience, define A^t:=argmaxb𝒜tθ^t1,xt,b\hat{A}_{t}:=\operatorname*{arg\,max}_{b\in\mathcal{A}_{t}}\langle\hat{\theta}_{t-1},x_{t,b}\rangle as the empirically best arm at time step tt, where ties are broken arbitrarily, then use X^t\hat{X}_{t} to denote the corresponding context of the arm A^t\hat{A}_{t}. Therefore from the Cauchy–Schwarz inequality, we have θ^t1Vt1X^tVt11θ^t1,X^t>(11logT)Δt\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}}\lVert\hat{X}_{t}\rVert_{V_{t-1}^{-1}}\geq\langle\hat{\theta}_{t-1},\hat{X}_{t}\rangle>(1-\frac{1}{\sqrt{\log T}})\Delta_{t}. This implies that

X^tVt11(11logT)Δtθ^t1Vt1.\displaystyle\lVert\hat{X}_{t}\rVert_{V_{t-1}^{-1}}\geq\frac{(1-\frac{1}{\sqrt{\log T}})\Delta_{t}}{\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}}}~{}. (33)

On the other hand, we claim that θ^t1Vt1\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}} can be upper bounded as O(T)O(\sqrt{T}). This can be seen from the fact that θ^t1Vt1=θ^t1θ+θVt1θ^t1θVt1+θVt1\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}}=\lVert\hat{\theta}_{t-1}-\theta^{*}+\theta^{*}\rVert_{V_{t-1}}\leq\lVert\hat{\theta}_{t-1}-\theta^{*}\rVert_{V_{t-1}}+\lVert\theta^{*}\rVert_{V_{t-1}}. Since the event BtB_{t} holds, we know the first term is upper bounded by βt1(γ)\sqrt{\beta_{t-1}(\gamma)}, and since the maximum eigenvalue of the matrix Vt1V_{t-1} is upper bounded by λ+TL2\lambda+TL^{2} and θS\lVert\theta^{*}\rVert\leq S, the second term is upper bounded by Sλ+TL2S\sqrt{\lambda+TL^{2}}. Hence, θ^t1Vt1\lVert\hat{\theta}_{t-1}\rVert_{V_{t-1}} is upper bounded by O(T)O(\sqrt{T}). Then one can substitute this bound back into Eqn. equation 8, and this yields

X^tVt11Ω(1T(11logT)Δt).\displaystyle\lVert\hat{X}_{t}\rVert_{V_{t-1}^{-1}}\geq\Omega\bigg{(}\frac{1}{\sqrt{T}}\Big{(}1-\frac{1}{\sqrt{\log T}}\Big{)}\Delta_{t}\bigg{)}~{}. (34)

Furthermore, by our design of the algorithm, the index of AtA_{t} is not larger than the index of the arm with the largest empirical reward at time tt. Hence,

It,At=Δ^t,At2βt1(γ)XtVt112+log1βt1(γ)XtVt112log1βt1(γ)X^tVt112.\displaystyle I_{t,A_{t}}=\frac{\hat{\Delta}_{t,A_{t}}^{2}}{\beta_{t-1}(\gamma)\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}(\gamma)\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}\leq\log\frac{1}{\beta_{t-1}(\gamma)\lVert\hat{X}_{t}\rVert_{V_{t-1}^{-1}}^{2}}~{}. (35)

If XtVt112Δt2βt1\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}, by using Corollary 1 and the “peeling device” (Lattimore & Szepesvári, 2020, Chapter 9) on Δt\Delta_{t} such that 2l<Δt2l+12^{-l}<\Delta_{t}\leq 2^{-l+1} for l=1,2,,Ql=1,2,\ldots,\lceil Q\rceil where Q=log2ΓQ=-\log_{2}\Gamma,

𝔼t=1T\displaystyle\mathbb{E}\sum_{t=1}^{T} Δt𝟙{Bt,Ct,Dt}𝟙{XtVt112Δt2βt1}𝔼t=1TΔt𝟙{XtVt112Δt2βt1}\displaystyle\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\} (36)
=𝔼t=1TΔt𝟙{XtVt112Δt2βt1}𝟙{ΔtΓ}+𝔼t=1TΔt𝟙{XtVt112Δt2βt1}𝟙{Δt>Γ}\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}\cdot\mathbbm{1}\left\{\Delta_{t}\leq\Gamma\right\}+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}\cdot\mathbbm{1}\left\{\Delta_{t}>\Gamma\right\} (37)
TΓ+𝔼t=1Tl=1QΔt𝟙{XtVt112Δt2βt1}𝟙{2l<Δt2l+1}\displaystyle\leq T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil{Q}\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}\cdot\mathbbm{1}\left\{2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\} (38)
TΓ+𝔼t=1Tl=1Q2l+1𝟙{XtVt11222lβT}\displaystyle\leq T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{2^{-2l}}{\beta_{T}}\right\} (39)
TΓ+𝔼l=1Q2l+16dβT22llog(1+2L2βTλ22l)\displaystyle\leq T\Gamma+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\frac{6d\beta_{T}}{2^{-2l}}\log\bigg{(}1+\frac{2L^{2}\beta_{T}}{\lambda\cdot 2^{-2l}}\bigg{)} (40)
=TΓ+𝔼l=1Q2l12dβTlog(1+22l+1L2βTλ)\displaystyle=T\Gamma+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{l}\cdot 12d\beta_{T}\log\bigg{(}1+\frac{2^{2l+1}L^{2}\beta_{T}}{\lambda}\bigg{)} (41)
<TΓ+𝔼l=1Q2l12dβTlog(1+22Q+3L2βTλ)\displaystyle<T\Gamma+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{l}\cdot 12d\beta_{T}\log\bigg{(}1+\frac{2^{2Q+3}L^{2}\beta_{T}}{\lambda}\bigg{)} (42)
=TΓ+(2Q1)24dβTlog(1+22Q+3L2βTλ)\displaystyle=T\Gamma+(2^{\lceil Q\rceil}-1)\cdot 24d\beta_{T}\log\bigg{(}1+\frac{2^{2Q+3}L^{2}\beta_{T}}{\lambda}\bigg{)} (43)
<TΓ+48dβTΓlog(1+8L2βTλΓ2)\displaystyle<T\Gamma+\frac{48d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{8L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)} (44)

Then with the choice of Γ\Gamma as in Eqn. equation 11,

𝔼t=1TΔt𝟙{Bt,Ct,Dt}\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\} 𝟙{XtVt112Δt2βt1}\displaystyle\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\} (45)
<dTlog32T+48βTTlog32Tlog(1+8L2βTTλd2log3T)\displaystyle<d\sqrt{T}\log^{\frac{3}{2}}T+\frac{48\beta_{T}\sqrt{T}}{\log^{\frac{3}{2}}T}\log\bigg{(}1+\frac{8L^{2}\beta_{T}T}{\lambda d^{2}\log^{3}T}\bigg{)} (46)
O(dTlog32T).\displaystyle\leq O\Big{(}d\sqrt{T}\log^{\frac{3}{2}}T\Big{)}~{}. (47)

Otherwise we have XtVt112<Δt2βt1\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}, then log1βt1XtVt112>0\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}}>0 since Δt1\Delta_{t}\leq 1 . Substituting this into Eqn. equation 10, then using the event DtD_{t} and the bound in equation 9, we deduce that for all TT sufficiently large, we have XtVt112Ω(Δt2βt1log(T/Δt2))\lVert{X}_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\Omega\big{(}\frac{\Delta_{t}^{2}}{\beta_{t-1}\log({T}/{\Delta_{t}^{2}})}\big{)}. Therefore by using Corollary 1 and the “peeling device” (Lattimore & Szepesvári, 2020, Chapter 9) on Δt\Delta_{t} such that 2l<Δt2l+12^{-l}<\Delta_{t}\leq 2^{-l+1} for l=1,2,,Ql=1,2,\ldots,\lceil Q\rceil where Γ:=2Q\Gamma:=2^{-Q} is a free parameter that we can choose. Consider,

𝔼t=1TΔt𝟙{Bt,Ct,Dt}𝟙{XtVt112<Δt2βt1}\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\} (48)
𝔼t=1TΔt𝟙{Bt,Ct,Dt}𝟙{Δt2Q}𝟙{XtVt112<Δt2βt1}\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\Delta_{t}\leq 2^{-\lceil Q\rceil}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\} (49)
+𝔼t=1TΔt𝟙{Bt,Ct,Dt}𝟙{Δt>2Q}𝟙{XtVt112<Δt2βt1}\displaystyle\qquad+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\Delta_{t}>2^{-\lceil Q\rceil}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\} (50)
O(1)+TΓ+𝔼t=1TΔt𝟙{XtVt112Ω(Δt2βt1log(T/Δt2))}𝟙{Δt>2Q}\displaystyle\leq O(1)+T\Gamma+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert{X}_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\Omega\Big{(}\frac{\Delta_{t}^{2}}{\beta_{t-1}\log({T}/{\Delta_{t}^{2}})}\Big{)}\right\}\mathbbm{1}\left\{\Delta_{t}>2^{-\lceil Q\rceil}\right\} (51)
O(1)+TΓ+𝔼t=1Tl=1QΔt𝟙{XtVt112Ω(Δt2βt1log(T/Δt2))}𝟙{2l<Δt2l+1}\displaystyle\leq O(1)\!+\!T\Gamma\!+\!\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{\!\|{X}_{t}\|^{2}_{V_{t-1}^{-1}}\!\geq\!\Omega\Big{(}\frac{\Delta_{t}^{2}}{\beta_{t-1}\log(T/\Delta_{t}^{2})}\Big{)}\right\}\mathbbm{1}\big{\{}2^{-l}\!<\!\Delta_{t}\!\leq\!2^{-l+1}\big{\}}\!\!\! (52)
O(1)+TΓ+𝔼t=1Tl=1Q2l+1𝟙{XtVt112Ω(22lβt1log(T22l))}\displaystyle\leq O(1)+T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot\mathbbm{1}\bigg{\{}\lVert{X}_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\Omega\Big{(}\frac{2^{-2l}}{\beta_{t-1}\log({T\cdot 2^{2l}})}\Big{)}\bigg{\}} (53)
=O(1)+TΓ+𝔼l=1Q2l+1t=1T𝟙{XtVt112Ω(22lβt1log(T22l))}\displaystyle=O(1)+T\Gamma+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\sum_{t=1}^{T}\mathbbm{1}\left\{\lVert{X}_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\Omega\big{(}\frac{2^{-2l}}{\beta_{t-1}\log({T\cdot 2^{2l}})}\big{)}\right\} (54)
O(1)+TΓ+𝔼l=1Q2l+1O(22ldβTlog(T22l)log(1+2L222lβTlog(T22l)λ))\displaystyle\leq O(1)\!+\!T\Gamma\!+\!\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}\!2^{-l+1}O\bigg{(}\!2^{2l}d\beta_{T}\log(T\cdot 2^{2l})\log\Big{(}1\!+\!\frac{2L^{2}\cdot 2^{2l}\beta_{T}\log(T\cdot 2^{2l})}{\lambda}\Big{)}\!\bigg{)} (55)
<O(1)+TΓ+𝔼l=1Q2l+1O(dβTlog(TΓ2)log(1+L2βTlog(TΓ2)λΓ2))\displaystyle<O(1)+T\Gamma+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{l+1}\cdot O\bigg{(}{d\beta_{T}\log(\frac{T}{\Gamma^{2}})}\log\Big{(}1+\frac{L^{2}\beta_{T}\log(\frac{T}{\Gamma^{2}})}{\lambda\Gamma^{2}}\Big{)}\bigg{)} (56)
O(1)+TΓ+O(dβTlog(TΓ2)Γlog(1+L2βTlog(TΓ2)λΓ2)),\displaystyle\leq O(1)+T\Gamma+O\bigg{(}\frac{d\beta_{T}\log(\frac{T}{\Gamma^{2}})}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log(\frac{T}{\Gamma^{2}})}{\lambda\Gamma^{2}}\Big{)}\bigg{)}~{}, (57)

This proves Eqn. equation 26. Then with the choice of the parameters as in Eqn. equation 11,

𝔼t=1TΔt\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t} 𝟙{Bt,Ct,Dt}𝟙{XtVt112<Δt2βt1}\displaystyle\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\} (58)
<O(1)+dTlog32T+O(dβTlog(T2d2log3T)Tdlog32Tlog(1+L2βTTλd2log3Tlog(T2d2log3T)))\displaystyle<O(1)+d\sqrt{T}\log^{\frac{3}{2}}T+O\bigg{(}d\beta_{T}\log\Big{(}\frac{T^{2}}{d^{2}\log^{3}T}\Big{)}\frac{\sqrt{T}}{d\log^{\frac{3}{2}}T}\log\bigg{(}1+\frac{L^{2}\beta_{T}T}{\lambda d^{2}\log^{3}T}\cdot\log\Big{(}\frac{T^{2}}{d^{2}\log^{3}T}\Big{)}\bigg{)}\bigg{)} (59)
O(dTlog32T).\displaystyle\leq O\Big{(}d\sqrt{T}\log^{\frac{3}{2}}T\Big{)}~{}. (60)

Hence, we can upper bound F1F_{1} as

F1\displaystyle F_{1} =𝔼t=1TΔt𝟙{Bt,Ct,Dt}𝟙{XtVt112Δt2βt1}+𝔼t=1TΔt𝟙{Bt,Ct,Dt}𝟙{XtVt112<Δt2βt1}\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\} (61)
O(dTlog32T)+O(dTlog32T)\displaystyle\leq O\Big{(}d\sqrt{T}\log^{\frac{3}{2}}T\Big{)}+O\Big{(}d\sqrt{T}\log^{\frac{3}{2}}T\Big{)} (62)
O(dTlog32T),\displaystyle\leq O\Big{(}d\sqrt{T}\log^{\frac{3}{2}}T\Big{)}, (63)

which concludes the proof. ∎

C.3 Proof of Lemma 3

Proof.

Since CtC_{t} and D¯t\overline{D}_{t} together imply that θ,xtδ<ε+θ^t1,Xt\langle\theta^{*},x_{t}^{*}\rangle-\delta<\varepsilon+\langle\hat{\theta}_{t-1},X_{t}\rangle, then using the choices of δ\delta and ε\varepsilon, we have θ^t1θ,Xt>ΔtlogT\langle\hat{\theta}_{t-1}-\theta^{*},X_{t}\rangle>\frac{\Delta_{t}}{\sqrt{\log T}}. Substituting this into the event BtB_{t} and using the Cauchy–Schwarz inequality, we have

XtVt112Δt2βt1(γ)logT.\displaystyle\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}(\gamma)\log T}. (64)

Again applying the “peeling device” on Δt\Delta_{t} and Corollary 1, we can upper bound F2F_{2} as follows:

F2\displaystyle F_{2} 𝔼t=1TΔt𝟙{XtVt112Δt2βt1logT}\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}\log T}\right\} (65)
TΓ+𝔼t=1Tl=1QΔt𝟙{XtVt112Δt2βt1logT}𝟙{2l<Δt2l+1}\displaystyle\leq T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}\log T}\right\}\cdot\mathbbm{1}\left\{2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\} (66)
TΓ+𝔼t=1Tl=1Q2l+1𝟙{XtVt11222lβTlogT}\displaystyle\leq T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{2^{-2l}}{\beta_{T}\log T}\right\} (67)
TΓ+𝔼l=1Q2l+122l6dβT(logT)log(1+22l+1L2βTlogTλ)\displaystyle\leq T\Gamma+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot 2^{2l}\cdot 6d\beta_{T}(\log T)\log\bigg{(}1+\frac{2^{2l+1}\cdot L^{2}\beta_{T}\log T}{\lambda}\bigg{)} (68)
TΓ+𝔼l=1Q2l12dβT(logT)log(1+22Q+1L2βTlogTλ)\displaystyle\leq T\Gamma+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{l}\cdot 12d\beta_{T}(\log T)\log\bigg{(}1+\frac{2^{2\lceil Q\rceil+1}\cdot L^{2}\beta_{T}\log T}{\lambda}\bigg{)} (69)
=TΓ+(2Q1)24dβT(logT)log(1+22Q+1L2βTlogTλ)\displaystyle=T\Gamma+(2^{\lceil Q\rceil}-1)\cdot 24d\beta_{T}(\log T)\log\bigg{(}1+\frac{2^{2\lceil Q\rceil+1}\cdot L^{2}\beta_{T}\log T}{\lambda}\bigg{)} (70)
<TΓ+48dβTlogTΓlog(1+8L2βTlogTλΓ2)\displaystyle<T\Gamma+\frac{48d\beta_{T}\log T}{\Gamma}\log\bigg{(}1+\frac{8L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)} (71)
=TΓ+O(dβTlogTΓlog(1+L2βTlogTλΓ2))\displaystyle=T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)} (72)

This proves Eqn. equation 28. Hence with the choice of the parameter Γ\Gamma as in Eqn. equation 11,

F2\displaystyle F_{2} dTlog32T+O(dTlog32T)\displaystyle\leq d\sqrt{T}\log^{\frac{3}{2}}T+O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)} (73)
O(dTlog32T).\displaystyle\leq O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}~{}. (74)

C.4 Proof of Lemma 4

Proof.

For F3F_{3}, this is the case when the best arm at time tt does not perform sufficiently well so that the empirically largest reward at time tt is far from the highest expected reward. One observes that minimizing F3F_{3} results in a tradeoff with respect to F1F_{1}. On the event C¯t\overline{C}_{t}, we can apply the “peeling device” on θ,xtθ^t1,xt\langle\theta^{*},x_{t}^{*}\rangle-\langle\hat{\theta}_{t-1},x_{t}^{*}\rangle such that q+12δθ,xtθ^t1,xt<q+22δ\frac{q+1}{2}\delta\leq\langle\theta^{*},x_{t}^{*}\rangle-\langle\hat{\theta}_{t-1},x_{t}^{*}\rangle<\frac{q+2}{2}\delta where qq\in\mathbb{N}. Then using the fact that It,AtIt,atI_{t,A_{t}}\leq I_{t,a_{t}^{*}}, we have

log1βt1XtVt112<q2δ24βt1xtVt112+log1βt1xtVt112.\displaystyle\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}<\frac{q^{2}\delta^{2}}{4\beta_{t-1}\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}~{}. (75)

On the other hand, using the event BtB_{t} and the Cauchy–Schwarz inequality, it holds that

xtVt11(q+1)δ2βt1.\displaystyle\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}\geq\frac{(q+1)\delta}{2\sqrt{\beta_{t-1}}}~{}. (76)

If XtVt112Δt2βt1\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}, the regret in this case is bounded by O(dTlogT)O(d\sqrt{T\log T}) (similar to the procedure to get from Eqn. equation 36 to Eqn. equation 47). Otherwise log1βt1XtVt112>log1Δt20\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}>\log\frac{1}{\Delta_{t}^{2}}\geq 0, then combining Eqn. equation 75 and Eqn. equation 76 implies that

XtVt112(q+1)2δ24βt1exp(q2(q+1)2).\displaystyle\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{(q+1)^{2}\delta^{2}}{4\beta_{t-1}}\exp\bigg{(}-\frac{q^{2}}{(q+1)^{2}}\bigg{)}~{}. (77)

Notice here with λS1\sqrt{\lambda}S\geq 1, XtVt112<Δt2βt11βt11\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\leq\frac{1}{\beta_{t-1}}\leq 1, it holds that for all qq\in\mathbb{N},

(q+1)2δ24βt1exp(q2(q+1)2)<1.\displaystyle\frac{(q+1)^{2}\delta^{2}}{4\beta_{t-1}}\exp\bigg{(}-\frac{q^{2}}{(q+1)^{2}}\bigg{)}<1~{}. (78)

Using Corollary 1, one can show that:

t=1TΔt𝟙{Bt,C¯t}𝟙{XtVt112<Δt2βt1}\displaystyle\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\} (79)
TΓ+t=1Tl=1QΔt𝟙{Bt,C¯t}𝟙{XtVt112<Δt2βt1}𝟙{2l<Δt2l+1}\displaystyle\leq T\Gamma+\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}\cdot\mathbbm{1}\left\{2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\} (80)
TΓ+t=1Tl=1Qq=1Δt𝟙{Bt}𝟙{q+12δθ,xtθ^t1,xt<q+22δ}𝟙{XtVt112<Δt2βt1}\displaystyle\leq T\Gamma+\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\sum_{q=1}^{\infty}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t}\right\}\cdot\mathbbm{1}\left\{\frac{q+1}{2}\delta\leq\langle\theta^{*},x_{t}^{*}\rangle-\langle\hat{\theta}_{t-1},x_{t}^{*}\rangle<\frac{q+2}{2}\delta\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\} (81)
𝟙{2l<Δt2l+1}\displaystyle\quad\cdot\mathbbm{1}\left\{2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\} (82)
TΓ+t=1Tl=1Qq=1Δt𝟙{1XtVt112(q+1)2δ24βt1exp(q2(q+1)2)}𝟙{2l<Δt2l+1}\displaystyle\leq T\Gamma+\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\sum_{q=1}^{\infty}\Delta_{t}\cdot\mathbbm{1}\left\{1\geq\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{(q+1)^{2}\delta^{2}}{4\beta_{t-1}}\exp\bigg{(}-\frac{q^{2}}{(q+1)^{2}}\bigg{)}\right\}\cdot\mathbbm{1}\left\{2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\} (83)
=TΓ+t=1Tl=1Qq=1Δt𝟙{1XtVt112(q+1)2Δt24βt1logTexp(q2(q+1)2)}𝟙{2l<Δt2l+1}\displaystyle=T\Gamma+\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\sum_{q=1}^{\infty}\Delta_{t}\cdot\mathbbm{1}\left\{1\geq\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{(q+1)^{2}\Delta_{t}^{2}}{4\beta_{t-1}\log T}\exp\bigg{(}-\frac{q^{2}}{(q+1)^{2}}\bigg{)}\right\}\cdot\mathbbm{1}\left\{2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\} (84)
TΓ+t=1Tl=1Qq=12l+1𝟙{1XtVt112>(q+1)222l4βTlogTexp(q2(q+1)2)}\displaystyle\leq T\Gamma+\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\sum_{q=1}^{\infty}2^{-l+1}\cdot\mathbbm{1}\left\{1\geq\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}>\frac{(q+1)^{2}\cdot 2^{-2l}}{4\beta_{T}\log T}\exp\bigg{(}-\frac{q^{2}}{(q+1)^{2}}\bigg{)}\right\} (85)
TΓ+l=1Qq=12l+122l24dβT(logT)exp(q2(q+1)2)(q+1)2log(1+22l8L2βTlogTλexp(q2(q+1)2)(q+1)2)\displaystyle\leq T\Gamma+\sum_{l=1}^{\lceil Q\rceil}\sum_{q=1}^{\infty}2^{-l+1}\cdot 2^{2l}\cdot 24d\beta_{T}(\log T)\cdot\frac{\exp\bigg{(}\frac{q^{2}}{(q+1)^{2}}\bigg{)}}{(q+1)^{2}}\cdot\log\bigg{(}1+\frac{2^{2l}\cdot 8L^{2}\beta_{T}\log T}{\lambda}\cdot\frac{\exp\bigg{(}\frac{q^{2}}{(q+1)^{2}}\bigg{)}}{(q+1)^{2}}\bigg{)} (86)
<TΓ+l=1Qq=12l+124dβT(logT)exp(q2(q+1)2)(q+1)2log(1+22l+1L2βTlogTλ)\displaystyle<T\Gamma+\sum_{l=1}^{\lceil Q\rceil}\sum_{q=1}^{\infty}2^{l+1}\cdot 24d\beta_{T}(\log T)\cdot\frac{\exp\bigg{(}\frac{q^{2}}{(q+1)^{2}}\bigg{)}}{(q+1)^{2}}\cdot\log\bigg{(}1+\frac{2^{2l+1}\cdot L^{2}\beta_{T}\log T}{\lambda}\bigg{)} (87)
=TΓ+l=1Q2l+124dβT(logT)log(1+22l+1L2βTlogTλ)q=1exp(q2(q+1)2)(q+1)2\displaystyle=T\Gamma+\sum_{l=1}^{\lceil Q\rceil}2^{l+1}\cdot 24d\beta_{T}(\log T)\cdot\log\bigg{(}1+\frac{2^{2l+1}\cdot L^{2}\beta_{T}\log T}{\lambda}\bigg{)}\sum_{q=1}^{\infty}\frac{\exp\bigg{(}\frac{q^{2}}{(q+1)^{2}}\bigg{)}}{(q+1)^{2}} (88)
TΓ+l=1Q2l+124dβT(logT)log(1+22l+1L2βTlogTλ)(1.09)\displaystyle\leq T\Gamma+\sum_{l=1}^{\lceil Q\rceil}2^{l+1}\cdot 24d\beta_{T}(\log T)\cdot\log\bigg{(}1+\frac{2^{2l+1}\cdot L^{2}\beta_{T}\log T}{\lambda}\bigg{)}\cdot(1.09) (89)
TΓ+l=1Q2l+127dβT(logT)log(1+22l+1L2βTlogTλ)\displaystyle\leq T\Gamma+\sum_{l=1}^{\lceil Q\rceil}2^{l+1}\cdot 27d\beta_{T}(\log T)\cdot\log\bigg{(}1+\frac{2^{2l+1}\cdot L^{2}\beta_{T}\log T}{\lambda}\bigg{)} (90)
TΓ+l=1Q2l+127dβT(logT)log(1+22Q+1L2βTlogTλ)\displaystyle\leq T\Gamma+\sum_{l=1}^{\lceil Q\rceil}2^{l+1}\cdot 27d\beta_{T}(\log T)\cdot\log\bigg{(}1+\frac{2^{2\lceil Q\rceil+1}\cdot L^{2}\beta_{T}\log T}{\lambda}\bigg{)} (91)
<TΓ+l=1Q216dβTlogTΓlog(1+8L2βTlogTλΓ2)\displaystyle<T\Gamma+\sum_{l=1}^{\lceil Q\rceil}\frac{216d\beta_{T}\log T}{\Gamma}\cdot\log\bigg{(}1+\frac{8L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)} (92)
=TΓ+O(dβTlogTΓlog(1+L2βTlogTλΓ2)).\displaystyle=T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\Big{)}\bigg{)}~{}. (93)

Hence

F3\displaystyle F_{3} =t=1TΔt𝟙{Bt,C¯t}𝟙{XtVt112<Δt2βt1}+t=1TΔt𝟙{Bt,C¯t}𝟙{XtVt112Δt2βt1}\displaystyle=\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\}+\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}}\right\} (94)
<O(dβTΓlog(1+L2βTλΓ2))+2TΓ+O(dβTlogTΓlog(1+L2βTlogTλΓ2))\displaystyle<O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}+2T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\Big{)}\bigg{)} (95)
2TΓ+O(dβTlog(T)Γlog(1+L2βTlog(T)λΓ2)).\displaystyle\leq 2T\Gamma+O\bigg{(}\frac{d\beta_{T}\log(T)}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log(T)}{\lambda\Gamma^{2}}\Big{)}\bigg{)}~{}. (96)

This proves Eqn. equation 30. With the choice of Γ\Gamma as in Eqn. equation 11,

F3\displaystyle F_{3} 2dTlog32T+O(dTβTlogTdlog32Tlog(1+TL2βTlogTλd2log3T))\displaystyle\leq 2d\sqrt{T}\log^{\frac{3}{2}}T+O\bigg{(}\frac{d\sqrt{T}\beta_{T}\log T}{d\log^{\frac{3}{2}}T}\log\bigg{(}1+\frac{TL^{2}\beta_{T}\log T}{\lambda d^{2}\log^{3}T}\bigg{)}\bigg{)} (97)
<2dTlog32T+O(dTlog32T)\displaystyle<2d\sqrt{T}\log^{\frac{3}{2}}T+O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)} (98)
=O(dTlog32T).\displaystyle=O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}~{}. (99)

C.5 Proof of Lemma 5

Proof.

For F4F_{4}, the proof is straightforward by using Lemma 1 with the choice of γ\gamma. Indeed, one has

F4\displaystyle F_{4} =𝔼t=1TΔt𝟙{B¯t}TΓ+𝔼t=1Tl=1QΔt𝟙{2l<Δt2l+1}𝟙{B¯t}\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B}_{t}\right\}\leq T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\}\mathbbm{1}\left\{\overline{B}_{t}\right\} (100)
TΓ+𝔼t=1Tl=1Q2l+1𝟙{B¯t}TΓ+t=1Tl=1Q2l+1(B¯t)TΓ+t=1Tl=1Q2l+1γ\displaystyle\leq T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\mathbbm{1}\left\{\overline{B}_{t}\right\}\leq T\Gamma+\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\mathbb{P}\big{(}\overline{B}_{t}\big{)}\leq T\Gamma+\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\gamma (101)
=TΓ+t=1T1t2l=1Q2l+1=TΓ+t=1T2Γt2<TΓ+π23=TΓ+O(1).\displaystyle=T\Gamma+\sum_{t=1}^{T}\frac{1}{t^{2}}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}=T\Gamma+\sum_{t=1}^{T}\frac{2-\Gamma}{t^{2}}<T\Gamma+\frac{\pi^{2}}{3}=T\Gamma+O(1)~{}. (102)

With the choice of Γ\Gamma as in Eqn. equation 11,

F4\displaystyle F_{4} <dTlog32T+O(1)\displaystyle<d\sqrt{T}\log^{\frac{3}{2}}T+O(1) (103)
O(dTlog32T).\displaystyle\leq O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}~{}. (104)

C.6 Proof of Theorem 1

Proof.

Combining Lemmas 2, 3, 4 and 5,

RT\displaystyle R_{T} =F1+F2+F3+F4\displaystyle=F_{1}+F_{2}+F_{3}+F_{4} (105)
O(dTlog32T)+O(dTlog32T)+O(dTlog32T)+O(dTlog32T)\displaystyle\leq O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}+O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}+O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}+O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)} (106)
=O(dTlog32T).\displaystyle=O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}~{}. (107)

Appendix D Proof of the regret bound for LinIMED-2 (Proof of Theorem 2)

We choose γ\gamma and Γ\Gamma as follows:

γ=1t2Γ=dβTlogTT.\displaystyle\gamma=\frac{1}{t^{2}}\qquad\qquad\Gamma=\frac{\sqrt{d\beta_{T}}\log T}{\sqrt{T}}~{}. (108)

D.1 Statement of Lemmas for LinIMED-2

We first state the following lemmas which respectively show the upper bound of F1F_{1} to F4F_{4}:

Lemma 6.

Under Assumption 1, and the assumption that λS1\sqrt{\lambda}S\geq 1, for the free parameter 0<Γ<10<\Gamma<1, the term F1F_{1} for LinIMED-3 satisfies:

F1TΓ+O(dβTlogTΓ)log(1+L2βTlogTλΓ2).\displaystyle F_{1}\leq T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\bigg{)}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}~{}.
Lemma 7.

Under Assumption 1, and the assumption that λS1\sqrt{\lambda}S\geq 1, for the free parameter 0<Γ<10<\Gamma<1, the term F2F_{2} for LinIMED-3 satisfies:

F2TΓ+O(dβTlogTΓ)log(1+L2βTlogTλΓ2).\displaystyle F_{2}\leq T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\bigg{)}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}~{}.
Lemma 8.

Under Assumption 1, and the assumption that λS1\sqrt{\lambda}S\geq 1, for the free parameter 0<Γ<10<\Gamma<1, the term F3F_{3} for LinIMED-3 satisfies:

F35TΓ+O(dβTlogTΓlog(1+L2βTlogTλΓ2))+O(TlogTlog(L2βTlogTλΓ2)).\displaystyle F_{3}\leq 5T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\Big{)}\bigg{)}+O\bigg{(}\sqrt{T\log T}\log\bigg{(}\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}~{}.
Lemma 9.

Under Assumption 1, with the choice of γ=1t2\gamma=\frac{1}{t^{2}} as in Eqn. equation 108, for the free parameter 0<Γ<10<\Gamma<1, the term F4F_{4} for LinIMED-3 satisfies:

F4TΓ+O(1).\displaystyle F_{4}\leq T\Gamma+O(1)~{}.

D.2 Proof of Lemma 6

Proof.

We first partition the analysis into the cases A^tAt\hat{A}_{t}\neq A_{t} and A^t=At\hat{A}_{t}=A_{t} as follows:

F1\displaystyle F_{1} =𝔼t=1TΔt𝟙{Bt,Ct,Dt}\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\} (109)
=𝔼t=1TΔt𝟙{Bt,Ct,Dt}𝟙{A^tAt}+𝔼t=1TΔt𝟙{Bt,Ct,Dt}𝟙{A^t=At}\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\hat{A}_{t}\neq A_{t}\right\}+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\hat{A}_{t}=A_{t}\right\} (110)

Case 1: If A^tAt\hat{A}_{t}\neq A_{t}, this means that the index of AtA_{t} is It,At=Δ^t,At2βt1XtVt112+log1βt1XtVt112I_{t,A_{t}}=\frac{\hat{\Delta}_{t,A_{t}}^{2}}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}. Using the fact that It,AtIt,A^tI_{t,A_{t}}\leq I_{t,\hat{A}_{t}} we have:

It,At\displaystyle I_{t,A_{t}} =Δ^t,At2βt1XtVt112+log1βt1XtVt112\displaystyle=\frac{\hat{\Delta}_{t,A_{t}}^{2}}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}} (111)
logTlog1βt1X^tVt112\displaystyle\leq\log T\wedge\log\frac{1}{\beta_{t-1}\lVert\hat{X}_{t}\rVert_{V_{t-1}^{-1}}^{2}} (112)
logT.\displaystyle\leq\log T. (113)

Therefore

Δ^t,At2βt1XtVt112+log1βt1XtVt112logT.\displaystyle\frac{\hat{\Delta}_{t,A_{t}}^{2}}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}\leq\log T~{}. (114)

If XtVt112Δt2βt1\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}, using the same procedure to get from Eqn. equation 36 to Eqn. equation 47, one has:

𝔼t=1TΔt𝟙{Bt,Ct,Dt}𝟙{A^tAt}𝟙{XtVt112Δt2βt1}\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\hat{A}_{t}\neq A_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\} (115)
𝔼t=1TΔt𝟙{XtVt112Δt2βt1}\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\} (116)
<TΓ+48dβTΓlog(1+8L2βTλΓ2)\displaystyle<T\Gamma+\frac{48d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{8L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)} (117)
=TΓ+O(dβTΓlog(1+L2βTλΓ2)).\displaystyle=T\Gamma+O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}. (118)

Else if XtVt112<Δt2βt1\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}<\frac{\Delta^{2}_{t}}{\beta_{t-1}}, this implies that log1βt1XtVt112>log1Δt20\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}>\log\frac{1}{\Delta_{t}^{2}}\geq 0. Then substituting the event Dt:={Δ^t,Atε}D_{t}:=\{\hat{\Delta}_{t,A_{t}}\geq\varepsilon\} into Eqn. equation 114, we obtain

ε2βt1XtVt112logT.\displaystyle\frac{\varepsilon^{2}}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}\leq\log T~{}. (119)

With λS1\sqrt{\lambda}S\geq 1 we have βt11\beta_{t-1}\geq 1 , then one has

XtVt112ε2βt1logT.\displaystyle\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\varepsilon^{2}}{\beta_{t-1}\log T}. (120)

Hence

𝔼t=1TΔt𝟙{Bt,Ct,Dt,A^tAt,XtVt112<Δt2βt1}\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t},\hat{A}_{t}\neq A_{t},\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}<\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\} (121)
𝔼t=1TΔt𝟙{XtVt112ε2βt1logT}.\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\varepsilon^{2}}{\beta_{t-1}\log T}\right\}. (122)

With the choice of ε=(12logT)Δt\varepsilon=(1-\frac{2}{\sqrt{\log T}})\Delta_{t}, when T149>exp(5)T\geq 149>\exp(5), ε>Δt10\varepsilon>\frac{\Delta_{t}}{10}, then performing the “peeling device” on Δt\Delta_{t} yields

𝔼t=1TΔt𝟙{XtVt112ε2βt1logT}𝟙{ΔtΓ}\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\varepsilon^{2}}{\beta_{t-1}\log T}\right\}\cdot\mathbbm{1}\left\{\Delta_{t}\geq\Gamma\right\} (123)
149+𝔼t=1Tl=1QΔt𝟙{2l<Δt2l+1,XtVt112ε2βt1logT}\displaystyle\leq 149+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{2^{-l}<\Delta_{t}\leq 2^{-l+1},\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\varepsilon^{2}}{\beta_{t-1}\log T}\right\} (124)
O(1)+𝔼l=1Q2l+1t=1T𝟙{XtVt112ε2βt1logT}\displaystyle\leq O(1)+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\sum_{t=1}^{T}\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\varepsilon^{2}}{\beta_{t-1}\log T}\right\} (125)
O(1)+𝔼l=1Q2l+1t=1T𝟙{XtVt11222l100βTlogT}\displaystyle\leq O(1)+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\sum_{t=1}^{T}\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{2^{-2l}}{100\beta_{T}\log T}\right\} (126)
O(1)+𝔼l=1Q2l+122l600dβT(logT)log(1+22l200L2βTlogTλ)\displaystyle\leq O(1)+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot 2^{2l}\cdot 600d\beta_{T}(\log T)\log\bigg{(}1+\frac{2^{2l}\cdot 200L^{2}\beta_{T}\log T}{\lambda}\bigg{)} (127)
O(1)+𝔼l=1Q2l+1600dβT(logT)log(1+22Q200L2βTlogTλ)\displaystyle\leq O(1)+\mathbb{E}\sum_{l=1}^{\lceil Q\rceil}2^{l+1}\cdot 600d\beta_{T}(\log T)\log\bigg{(}1+\frac{2^{2\lceil Q\rceil}\cdot 200L^{2}\beta_{T}\log T}{\lambda}\bigg{)} (128)
<O(1)+4800dβTlogTΓlog(1+800L2βTlogTλΓ2).\displaystyle<O(1)+\frac{4800d\beta_{T}\log T}{\Gamma}\log\bigg{(}1+\frac{800L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}. (129)

Considering the event {Δt<Γ}\{\Delta_{t}<\Gamma\}, we can upper bound the corresponding expectation as follows

𝔼t=1TΔt𝟙{XtVt112ε2βt1logT}𝟙{Δt<Γ}𝔼t=1TΔt𝟙{Δt<Γ}<TΓ.\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\varepsilon^{2}}{\beta_{t-1}\log T}\right\}\cdot\mathbbm{1}\left\{\Delta_{t}<\Gamma\right\}\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\Delta_{t}<\Gamma\right\}<T\Gamma. (130)

Then

𝔼t=1TΔt𝟙{Bt,Ct,Dt,A^tAt,XtVt112<Δt2βt1}\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t},\hat{A}_{t}\neq A_{t},\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}<\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\} (131)
𝔼t=1TΔt𝟙{XtVt112ε2βt1logT}\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\varepsilon^{2}}{\beta_{t-1}\log T}\right\} (132)
=𝔼t=1TΔt𝟙{XtVt112ε2βt1logT}𝟙{ΔtΓ}\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\varepsilon^{2}}{\beta_{t-1}\log T}\right\}\cdot\mathbbm{1}\left\{\Delta_{t}\geq\Gamma\right\} (133)
+𝔼t=1TΔt𝟙{XtVt112ε2βt1logT}𝟙{Δt<Γ}\displaystyle\qquad+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\varepsilon^{2}}{\beta_{t-1}\log T}\right\}\cdot\mathbbm{1}\left\{\Delta_{t}<\Gamma\right\} (134)
O(1)+TΓ+4800dβTlogTΓlog(1+800L2βTlogTλΓ2).\displaystyle\leq O(1)+T\Gamma+\frac{4800d\beta_{T}\log T}{\Gamma}\log\bigg{(}1+\frac{800L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}. (135)

Hence

𝔼t=1TΔt𝟙{Bt,Ct,Dt,A^tAt}\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t},\hat{A}_{t}\neq A_{t}\right\} (136)
=𝔼t=1TΔt𝟙{Bt,Ct,Dt,A^tAt,XtVt112Δt2βt1}\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t},\hat{A}_{t}\neq A_{t},\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\} (137)
+𝔼t=1TΔt𝟙{Bt,Ct,Dt,A^tAt,XtVt112<Δt2βt1}\displaystyle\qquad+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},D_{t},\hat{A}_{t}\neq A_{t},\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}<\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\} (138)
TΓ+O(dβTΓlog(1+L2βTλΓ2))+O(1)+TΓ+4800dβTlogTΓlog(1+800L2βTlogTλΓ2)\displaystyle\leq T\Gamma+O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}+O(1)+T\Gamma+\frac{4800d\beta_{T}\log T}{\Gamma}\log\bigg{(}1+\frac{800L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)} (139)
TΓ+O(dβTlogTΓlog(1+L2βTlogTλΓ2)).\displaystyle\leq T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}. (140)

Case 2: If A^t=At\hat{A}_{t}=A_{t}, then from the event CtC_{t} and the choice δ=ΔtlogT\delta=\frac{\Delta_{t}}{\sqrt{\log T}} we have

θ^t1θ,Xt>(11logT)Δt.\displaystyle\langle\hat{\theta}_{t-1}-\theta^{*},X_{t}\rangle>\Big{(}1-\frac{1}{\sqrt{\log T}}\Big{)}\Delta_{t}. (141)

Furthermore, using the definition of the event BtB_{t}, that implies that

XtVt112>(11logT)2Δt2βt1.\displaystyle\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}>\frac{(1-\frac{1}{\sqrt{\log T}})^{2}\Delta_{t}^{2}}{\beta_{t-1}}. (142)

When T>8>exp(2)T>8>\exp(2), (11logT)2>116(1-\frac{1}{\sqrt{\log T}})^{2}>\frac{1}{16}, then similarily, we can bound this term by O(dβTΓ)log(1+L2βTλΓ2)O(\frac{d\beta_{T}}{\Gamma})\log(1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}})

Summarizing the two cases,

F1\displaystyle F_{1} O(1)+TΓ+O(dβTlogTΓ)log(1+L2βTlogTλΓ2)\displaystyle\leq O(1)+T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\bigg{)}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)} (143)
TΓ+O(dβTlogTΓ)log(1+L2βTlogTλΓ2).\displaystyle\leq T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\bigg{)}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}~{}. (144)

D.3 Proof of Lemma 7

Proof.

Recall that

F2=𝔼t=1TΔt𝟙{Bt,Ct,D¯t}.\displaystyle F_{2}=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},C_{t},\overline{D}_{t}\right\}~{}. (145)

From CtC_{t} and D¯t\overline{D}_{t}, we derive that:

θ,atδ<ε+θ^t1,Xt.\displaystyle\langle\theta^{*},a_{t}^{*}\rangle-\delta<\varepsilon+\langle\hat{\theta}_{t-1},X_{t}\rangle. (146)

With the choice δ=ΔtlogT,ε=(12logT)Δt\delta=\frac{\Delta_{t}}{\sqrt{\log T}},\varepsilon=(1-\frac{2}{\sqrt{\log T}})\Delta_{t}, we have

θ^t1θ,Xt>ΔtlogT.\displaystyle\langle\hat{\theta}_{t-1}-\theta^{*},X_{t}\rangle>\frac{\Delta_{t}}{\sqrt{\log T}}. (147)

Then using the definition of the event BtB_{t} in Eqn. equation 147 yields

XtVt112Δt2βt1logT.\displaystyle\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{\Delta_{t}^{2}}{\beta_{t-1}\log T}. (148)

Using a similar procedure as in that from Eqn. equation 36 to Eqn. equation 47, we can upper bound F2F_{2} by

F2TΓ+O(dβTlogTΓ)log(1+L2βTlogTλΓ2).\displaystyle F_{2}\leq T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\bigg{)}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}~{}. (149)

D.4 Proof of Lemma 8

Proof.

From the event C¯t\overline{C}_{t}, which is maxb𝒜tθ^t1,bθ,xtδ\max_{b\in\mathcal{A}_{t}}\langle\hat{\theta}_{t-1},b\rangle\leq\langle\theta^{*},x^{*}_{t}\rangle-\delta, the index of the best arm at time tt can be upper bounded as:

It,at(θ,xtδθ^t1,xt)2βt1xtVt112+log1βt1xtVt112.\displaystyle I_{t,a_{t}^{*}}\leq\frac{(\langle\theta^{*},x_{t}^{*}\rangle-\delta-\langle\hat{\theta}_{t-1},x_{t}^{*}\rangle)^{2}}{\beta_{t-1}\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}. (150)

Case 1: If A^tAt\hat{A}_{t}\neq A_{t}, then we have

It,atIt,Atlog1βt1XtVt112.\displaystyle I_{t,a_{t}^{*}}\geq I_{t,A_{t}}\geq\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}. (151)

Suppose q+12δθ,xtθ^t1,xt<q+22δ\frac{q+1}{2}\delta\leq\langle\theta^{*},x_{t}^{*}\rangle-\langle\hat{\theta}_{t-1},x_{t}^{*}\rangle<\frac{q+2}{2}\delta for qq\in\mathbb{N}, then one has

log1βt1XtVt112q2δ24βt1xtVt112+log1βt1xtVt112.\displaystyle\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}\leq\frac{q^{2}\delta^{2}}{4\beta_{t-1}\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}. (152)

On the other hand, on the event BtB_{t},

xtVt11(q+1)δ2βt1.\displaystyle\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}\geq\frac{(q+1)\delta}{2\sqrt{\beta_{t-1}}}. (153)

If XtVt112Δt2βt1\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}, using the same procedure from Eqn. equation 36 to Eqn. equation 47, one has:

𝔼t=1TΔt𝟙{Bt,C¯t}𝟙{A^tAt}𝟙{XtVt112Δt2βt1}\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}\cdot\mathbbm{1}\left\{\hat{A}_{t}\neq A_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\} (154)
𝔼t=1TΔt𝟙{XtVt112Δt2βt1}\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\} (155)
<TΓ+48dβTΓlog(1+8L2βTλΓ2)\displaystyle<T\Gamma+\frac{48d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{8L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)} (156)
=TΓ+O(dβTΓlog(1+L2βTλΓ2)).\displaystyle=T\Gamma+O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}. (157)

Else if XtVt112<Δt2βt1\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}<\frac{\Delta^{2}_{t}}{\beta_{t-1}}, this implies that log1βt1XtVt112>log1Δt20\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}>\log\frac{1}{\Delta_{t}^{2}}\geq 0. Then combining Eqn. equation 152 and Eqn. equation 153 implies that

XtVt112(q+1)2δ24βt1exp(q2(q+1)2).\displaystyle\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\geq\frac{(q+1)^{2}\delta^{2}}{4\beta_{t-1}}\exp\bigg{(}-\frac{q^{2}}{(q+1)^{2}}\bigg{)}. (158)

Then using the same procedure to get from Eqn. equation 78 to Eqn. equation 93, we have

t=1TΔt𝟙{Bt,C¯t}𝟙{XtVt112<Δt2βt1,A^tAt}\displaystyle\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}},\hat{A}_{t}\neq A_{t}\right\} (159)
<TΓ+O(dβTlogTΓlog(1+L2βTlogTλΓ2)).\displaystyle<T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\Big{)}\bigg{)}~{}. (160)

Case 2: A^t=At\hat{A}_{t}=A_{t}. If XtVt112Δt2βt1\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}, using the same procedure to get from Eqn. equation 36 to Eqn. equation 47, one has:

𝔼t=1TΔt𝟙{Bt,C¯t}𝟙{A^t=At}𝟙{XtVt112Δt2βt1}\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}\cdot\mathbbm{1}\left\{\hat{A}_{t}=A_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\} (161)
𝔼t=1TΔt𝟙{XtVt112Δt2βt1}\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\} (162)
<TΓ+48dβTΓlog(1+8L2βTλΓ2)\displaystyle<T\Gamma+\frac{48d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{8L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)} (163)
=TΓ+O(dβTΓlog(1+L2βTλΓ2)).\displaystyle=T\Gamma+O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}. (164)

Else XtVt112<Δt2βt1\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}<\frac{\Delta^{2}_{t}}{\beta_{t-1}} implies that log1βt1XtVt112>log1Δt20\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}>\log\frac{1}{\Delta_{t}^{2}}\geq 0.

If log1βt1XtVt112<logT\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}<\log T, then using the same procedure to get from Eqn. equation 152 to Eqn. equation 160, we have

t=1TΔt𝟙{Bt,C¯t}𝟙{XtVt112<Δt2βt1,A^t=At,log1βt1XtVt112<logTβt1}\displaystyle\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}},\hat{A}_{t}=A_{t},\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}<\log\frac{T}{\beta_{t-1}}\right\} (165)
<TΓ+O(dβTlogTΓlog(1+L2βTlogTλΓ2)).\displaystyle<T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\Big{)}\bigg{)}~{}. (166)

If log1βt1XtVt112logT\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}\geq\log T, this means now the index of AtA_{t} is It,At=logTI_{t,A_{t}}=\log T, by performing the “peeling device” such that q+12δθ,xtθ^t1,xt<q+22δ\frac{q+1}{2}\delta\leq\langle\theta^{*},x_{t}^{*}\rangle-\langle\hat{\theta}_{t-1},x_{t}^{*}\rangle<\frac{q+2}{2}\delta for qq\in\mathbb{N}, we have

logTq2δ24βt1xtVt112+log1βt1xtVt112.\displaystyle\log T\leq\frac{q^{2}\delta^{2}}{4\beta_{t-1}\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}^{2}}. (167)

On the other hand, using the definition of the event BtB_{t},

xtVt11(q+1)δ2βt1.\displaystyle\lVert x_{t}^{*}\rVert_{V_{t-1}^{-1}}\geq\frac{(q+1)\delta}{2\sqrt{\beta_{t-1}}}. (168)

Combining Eqn. equation 167 and equation 168, we have

δ2exp(q22(q+1)2)(q+1)T.\displaystyle\delta\leq\frac{2\exp(\frac{q^{2}}{2(q+1)^{2}})}{(q+1)\sqrt{T}}. (169)

Then with δ=ΔtlogT\delta=\frac{\Delta_{t}}{\sqrt{\log T}}, this implies that

Δt2logTexp(q22(q+1)2)(q+1)T.\displaystyle\Delta_{t}\leq\frac{2\sqrt{\log T}\exp(\frac{q^{2}}{2(q+1)^{2}})}{(q+1)\sqrt{T}}. (170)

On the other hand, from q+12δβt1xtVt11βt1Lλ\frac{q+1}{2}\delta\leq\sqrt{\beta_{t-1}}\lVert x^{*}_{t}\rVert_{V^{-1}_{t-1}}\leq\sqrt{\beta_{t-1}}\cdot\frac{L}{\sqrt{\lambda}}, we have q+12Lβt1logTλΔtq+1\leq\frac{2L\sqrt{\beta_{t-1}\log T}}{\sqrt{\lambda}\Delta_{t}}. Hence,

t=1TΔt𝟙{Bt,C¯t}𝟙{XtVt112<Δt2βt1,A^t=At,log1βt1XtVt112logT,ΔtΓ}\displaystyle\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{C}_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V_{t-1}^{-1}}<\frac{\Delta_{t}^{2}}{\beta_{t-1}},\hat{A}_{t}=A_{t},\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}\geq\log T,\Delta_{t}\geq\Gamma\right\} (171)
𝔼q=12LβTlogTλΓ1t=1TΔt𝟙{Δt2logTexp(q22(q+1)2)(q+1)T}\displaystyle\leq\mathbb{E}\sum_{q=1}^{\lfloor\frac{2L\sqrt{\beta_{T}\log T}}{\sqrt{\lambda}\Gamma}-1\rfloor}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\Delta_{t}\leq\frac{2\sqrt{\log T}\exp(\frac{q^{2}}{2(q+1)^{2}})}{(q+1)\sqrt{T}}\right\} (172)
𝔼q=12LβTlogTλΓ1t=1T2logTexp(q22(q+1)2)(q+1)T\displaystyle\leq\mathbb{E}\sum_{q=1}^{\lfloor\frac{2L\sqrt{\beta_{T}\log T}}{\sqrt{\lambda}\Gamma}-1\rfloor}\sum_{t=1}^{T}\frac{2\sqrt{\log T}\exp(\frac{q^{2}}{2(q+1)^{2}})}{(q+1)\sqrt{T}} (173)
=𝔼q=12LβTlogTλΓ12TlogTexp(q22(q+1)2)q+1\displaystyle=\mathbb{E}\sum_{q=1}^{\lfloor\frac{2L\sqrt{\beta_{T}\log T}}{\sqrt{\lambda}\Gamma}-1\rfloor}\frac{2\sqrt{T\log T}\exp(\frac{q^{2}}{2(q+1)^{2}})}{q+1} (174)
<𝔼q=12LβTlogTλΓ12eTlogTq+1\displaystyle<\mathbb{E}\sum_{q=1}^{\lfloor\frac{2L\sqrt{\beta_{T}\log T}}{\sqrt{\lambda}\Gamma}-1\rfloor}\frac{2\sqrt{e}\sqrt{T\log T}}{q+1} (175)
<2eTlogTlog(2LlogTλΓ1)\displaystyle<2\sqrt{e}\sqrt{T\log T}\log\bigg{(}\frac{2L\sqrt{\log T}}{\sqrt{\lambda}\Gamma}-1\bigg{)} (176)
O(TlogTlog(L2βTlogTλΓ2)).\displaystyle\leq O\bigg{(}\sqrt{T\log T}\log\bigg{(}\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}~{}. (177)

Summarizing the two cases (A^tAt\hat{A}_{t}\neq A_{t} and A^t=At\hat{A}_{t}=A_{t}), we see that F3F_{3} is upper bounded by:

F3\displaystyle F_{3} <TΓ+O(dβTΓlog(1+L2βTλΓ2))+TΓ+O(dβTlogTΓlog(1+L2βTlogTλΓ2))\displaystyle<T\Gamma+O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}+T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\Big{)}\bigg{)} (178)
+TΓ+O(dβTΓlog(1+L2βTλΓ2))+TΓ+O(dβTlogTΓlog(1+L2βTlogTλΓ2))\displaystyle\qquad+T\Gamma+O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}+T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\Big{)}\bigg{)} (179)
+TΓ+O(TβTlogTlog(L2βTlogTλΓ2))\displaystyle\qquad+T\Gamma+O\bigg{(}\sqrt{T\beta_{T}\log T}\log\bigg{(}\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)} (180)
5TΓ+O(dβTlogTΓlog(1+L2βTlogTλΓ2))+O(TlogTlog(L2βTlogTλΓ2)).\displaystyle\leq 5T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\Big{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\Big{)}\bigg{)}+O\bigg{(}\sqrt{T\log T}\log\bigg{(}\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}~{}. (181)

D.5 Proof of Lemma 9

Proof.

The proof of this case is straightforward by using Lemma 1 with the choice γ=1t2\gamma=\frac{1}{t^{2}}:

F4\displaystyle F_{4} =𝔼t=1TΔt𝟙{B¯t}\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B}_{t}\right\} (182)
=𝔼t=1TΔt𝟙{B¯t,Δt<Γ}+𝔼t=1TΔt𝟙{B¯t,ΔtΓ}\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B}_{t},\Delta_{t}<\Gamma\right\}+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B}_{t},\Delta_{t}\geq\Gamma\right\} (183)
<TΓ+𝔼t=1Tl=1QΔt𝟙{B¯t,2l<Δt2l+1}\displaystyle<T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B}_{t},2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\} (184)
TΓ+𝔼t=1Tl=1Q2l+1𝟙{B¯t}\displaystyle\leq T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot\mathbbm{1}\left\{\overline{B}_{t}\right\} (185)
TΓ+l=1Q2l+1t=1T{B¯t}\displaystyle\leq T\Gamma+\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\sum_{t=1}^{T}\mathbb{P}\left\{\overline{B}_{t}\right\} (186)
=TΓ+l=1Q2l+1π26\displaystyle=T\Gamma+\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot\frac{\pi^{2}}{6} (187)
<TΓ+(2Γ)π26\displaystyle<T\Gamma+(2-\Gamma)\cdot\frac{\pi^{2}}{6} (188)
<TΓ+π23\displaystyle<T\Gamma+\frac{\pi^{2}}{3} (189)
=TΓ+O(1).\displaystyle=T\Gamma+O(1)~{}. (190)

D.6 Proof of Theorem 2

Proof.

Combining Lemmas 6, 7, 8 and 9, with the choices of γ\gamma and Γ\Gamma as in Eqn. equation 108, the regret of LinIMED-2 is bounded as follows:

RT\displaystyle R_{T} =F1+F2+F3+F4\displaystyle=F_{1}+F_{2}+F_{3}+F_{4} (191)
TΓ+O(dβTlogTΓ)log(1+L2βTlogTλΓ2)+TΓ+O(dβTlogTΓ)log(1+L2βTlogTλΓ2)\displaystyle\leq T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\bigg{)}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}+T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\bigg{)}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)} (192)
+5TΓ+O(dβTlogTΓlog(1+L2βTlogTλΓ2))+O(TlogTlog(L2βTlogTλΓ2))\displaystyle\qquad+5T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}+O\bigg{(}\sqrt{T\log T}\log\bigg{(}\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)} (193)
+TΓ+O(1)\displaystyle\qquad+T\Gamma+O(1) (194)
8TΓ+O(dβTlogTΓlog(1+L2βTlogTλΓ2))+O(TlogTlog(L2βTlogTλΓ2))\displaystyle\leq 8T\Gamma+O\bigg{(}\frac{d\beta_{T}\log T}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}+O\bigg{(}\sqrt{T\log T}\log\bigg{(}\frac{L^{2}\beta_{T}\log T}{\lambda\Gamma^{2}}\bigg{)}\bigg{)} (195)
=8dTβTlogT+O(dTβTlog(1+TL2λdlogT))+O(TlogTlog(TL2λdlogT))\displaystyle=8\sqrt{dT\beta_{T}}\log T+O\bigg{(}\sqrt{dT\beta_{T}}\log\bigg{(}1+\frac{TL^{2}}{\lambda d\log T}\bigg{)}\bigg{)}+O\bigg{(}\sqrt{T\log T}\log\bigg{(}\frac{TL^{2}}{\lambda d\log T}\bigg{)}\bigg{)} (196)
=8dTlog32T+O(dTlog32T)+O(Tlog32T)\displaystyle=8d\sqrt{T}\log^{\frac{3}{2}}T+O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}+O\bigg{(}\sqrt{T}\log^{\frac{3}{2}}T\bigg{)} (197)
O(dTlog32T).\displaystyle\leq O\bigg{(}d\sqrt{T}\log^{\frac{3}{2}}T\bigg{)}~{}. (198)

Appendix E Proof of the regret bound for LinIMED-3 (Proof of Theorem 3)

First we define ata_{t}^{*} as the best arm in time step tt such that at=argmaxa𝒜tθ,xt,aa_{t}^{*}=\operatorname*{arg\,max}_{a\in\mathcal{A}_{t}}\langle\theta^{*},x_{t,a}\rangle, and use xt:=xt,atx_{t}^{*}:=x_{t,a_{t}^{*}} denote its corresponding context. Define A^t:=argmaxa𝒜tUCBt(a)\hat{A}_{t}:=\operatorname*{arg\,max}_{a\in\mathcal{A}_{t}}\mathrm{UCB}_{t}(a). Let Δt:=θ,xtθ,Xt\Delta_{t}:=\langle\theta^{*},x_{t}^{*}\rangle-\langle\theta^{*},X_{t}\rangle denote the regret in time tt. Define the following events:

Bt:={θ^t1θVt1βt1(γ)},Dt:={Δ^t,At>ε}.\displaystyle B^{\prime}_{t}:=\big{\{}\lVert\hat{\theta}_{t-1}-\theta^{*}\rVert_{V_{t-1}}\leq\sqrt{\beta_{t-1}(\gamma)}\big{\}},\quad D^{\prime}_{t}:=\big{\{}\hat{\Delta}_{t,A_{t}}>\varepsilon\big{\}}.

where ε\varepsilon is a free parameter set to be ε=Δt3\varepsilon=\frac{\Delta_{t}}{3} in this proof sketch.

Then the expected regret RT=𝔼t=1TΔtR_{T}=\mathbb{E}\sum_{t=1}^{T}\Delta_{t} can be partitioned by events Bt,DtB^{\prime}_{t},D^{\prime}_{t} such that:

RT=𝔼t=1TΔt𝟙{Bt,Dt}=:F1+𝔼t=1TΔt𝟙{Bt,Dt¯}=:F2+𝔼t=1TΔt𝟙{Bt¯}=:F3.\displaystyle R_{T}=\underbrace{\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B^{\prime}_{t},D^{\prime}_{t}\right\}}_{=:F_{1}}+\underbrace{\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{D^{\prime}_{t}}\right\}}_{=:F_{2}}+\underbrace{\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B^{\prime}_{t}}\right\}}_{=:F_{3}}. (199)

For the F1F_{1} case:
From DtD^{\prime}_{t} we know AtA^tA_{t}\neq\hat{A}_{t}, therefore

It,At=Δ^t,At2βt1XtVt112+log1βt1XtVt112.\displaystyle I_{t,A_{t}}=\frac{\hat{\Delta}_{t,A_{t}}^{2}}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}~{}. (200)

From DtD^{\prime}_{t} and It,AtIt,A^tlogCmaxa𝒜tΔ^t,a2I_{t,A_{t}}\leq I_{t,\hat{A}_{t}}\leq\log\frac{C}{\max_{a\in\mathcal{A}_{t}}\hat{\Delta}^{2}_{t,a}}, we have

It,At<logCε2.\displaystyle I_{t,A_{t}}<\log\frac{C}{\varepsilon^{2}}~{}. (201)

Combining Eqn. equation 200 and Eqn. equation 201,

Δ^t,At2βt1XtVt112+log1βt1XtVt112<logCε2.\displaystyle\frac{\hat{\Delta}_{t,A_{t}}^{2}}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}+\log\frac{1}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}<\log\frac{C}{\varepsilon^{2}}~{}. (202)

Then

Δ^t,At2βt1XtVt112<log(βt1XtVt112Cε2).\displaystyle\frac{\hat{\Delta}_{t,A_{t}}^{2}}{\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}}<\log\bigg{(}\beta_{t-1}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\cdot\frac{C}{\varepsilon^{2}}\bigg{)}~{}. (203)

If XtVt112Δt2βt1\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}, using the same procedure from Eqn. equation 36 to Eqn. equation 47, one has:

𝔼t=1TΔt𝟙{Bt,Dt}𝟙{XtVt112Δt2βt1}\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B^{\prime}_{t},D^{\prime}_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\} (204)
𝔼t=1TΔt𝟙{XtVt112Δt2βt1}\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}\geq\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\} (205)
<TΓ+48dβTΓlog(1+8L2βTλΓ2)\displaystyle<T\Gamma+\frac{48d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{8L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)} (206)
=TΓ+O(dβTΓlog(1+L2βTλΓ2)).\displaystyle=T\Gamma+O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}. (207)

Else XtVt112<Δt2βt1\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}<\frac{\Delta^{2}_{t}}{\beta_{t-1}} ,this implies that βt1XtVt112<Δt2\beta_{t-1}\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}<\Delta^{2}_{t}, plug this into Eqn. equation 203 and with the choice of ε=Δt3\varepsilon=\frac{\Delta_{t}}{3} and DtD^{\prime}_{t}, we have

Δt29βt1XtVt112<log(9C).\displaystyle\frac{\Delta_{t}^{2}}{9\beta_{t-1}\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}}<\log(9C)~{}. (208)

Since C1C\geq 1 is a constant, then

XtVt112>Δt29βt1log(9C).\displaystyle\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}>\frac{\Delta^{2}_{t}}{9\beta_{t-1}\log(9C)}~{}. (209)

Using the same procedure from Eqn. equation 36 to Eqn. equation 47, one has:

𝔼t=1TΔt𝟙{Bt,Dt}𝟙{XtVt112<Δt2βt1}\displaystyle\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B^{\prime}_{t},D^{\prime}_{t}\right\}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}<\frac{\Delta^{2}_{t}}{\beta_{t-1}}\right\} (210)
𝔼t=1TΔt𝟙{XtVt112>Δt29βt1log(9C)}\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}>\frac{\Delta^{2}_{t}}{9\beta_{t-1}\log(9C)}\right\} (211)
<TΓ+O(dβTlogCΓlog(1+L2βTlogCλΓ2)).\displaystyle<T\Gamma+O\bigg{(}\frac{d\beta_{T}\log C}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log C}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}~{}. (212)

Hence

F1<2TΓ+O(dβTlogCΓlog(1+L2βTlogCλΓ2)).\displaystyle F_{1}<2T\Gamma+O\bigg{(}\frac{d\beta_{T}\log C}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log C}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}~{}. (213)

For the F2F_{2} case: Since the event BtB^{\prime}_{t} holds,

maxa𝒜tUCBt(a)UCBt(at)=θ^t1,xt+βt1xtVt11θ,xt\displaystyle\max_{a\in\mathcal{A}_{t}}\mathrm{UCB}_{t}(a)\geq\mathrm{UCB}_{t}(a^{*}_{t})=\langle\hat{\theta}_{t-1},x^{*}_{t}\rangle+\sqrt{\beta_{t-1}}\lVert x^{*}_{t}\rVert_{V^{-1}_{t-1}}\geq\langle\theta^{*},x^{*}_{t}\rangle (214)

On the other hand, from Dt¯\overline{D^{\prime}_{t}} we have

maxa𝒜tUCBt(a)UCBt(At)+ε=θ^t1,Xt+βt1XtVt11+ε.\displaystyle\max_{a\in\mathcal{A}_{t}}\mathrm{UCB}_{t}(a)\leq\mathrm{UCB}_{t}(A_{t})+\varepsilon=\langle\hat{\theta}_{t-1},X_{t}\rangle+\sqrt{\beta_{t-1}}\lVert X_{t}\rVert_{V^{-1}_{t-1}}+\varepsilon~{}. (215)

Combining Eqn. equation 214 and Eqn. equation 215,

θ,xtθ^t1,Xt+βt1XtVt11+ε.\displaystyle\langle\theta^{*},x^{*}_{t}\rangle\leq\langle\hat{\theta}_{t-1},X_{t}\rangle+\sqrt{\beta_{t-1}}\lVert X_{t}\rVert_{V^{-1}_{t-1}}+\varepsilon~{}. (216)

Hence

Δtεθ^t1θ,Xt+βt1XtVt11.\displaystyle\Delta_{t}-\varepsilon\leq\langle\hat{\theta}_{t-1}-\theta^{*},X_{t}\rangle+\sqrt{\beta_{t-1}}\lVert X_{t}\rVert_{V^{-1}_{t-1}}~{}. (217)

Then with ε=Δt3\varepsilon=\frac{\Delta_{t}}{3} and BtB^{\prime}_{t}, we have

23Δt2βt1XtVt11,\displaystyle\frac{2}{3}\Delta_{t}\leq 2\sqrt{\beta_{t-1}}\lVert X_{t}\rVert_{V^{-1}_{t-1}}~{}, (218)

therefore

XtVt112>Δt29βt1.\displaystyle\lVert X_{t}\rVert^{2}_{V^{-1}_{t-1}}>\frac{\Delta^{2}_{t}}{9\beta_{t-1}}~{}. (219)

Using the same procedure from Eqn. equation 36 to Eqn. equation 47, one has:

F2<TΓ+O(dβTΓlog(1+L2βTλΓ2)).\displaystyle F_{2}<T\Gamma+O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}~{}. (220)

For the F3F_{3} case:
using Lemma 1 with the choice γ=1t2\gamma=\frac{1}{t^{2}}:

F3\displaystyle F_{3} =𝔼t=1TΔt𝟙{Bt¯}\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B^{\prime}_{t}}\right\} (221)
=𝔼t=1TΔt𝟙{Bt¯,Δt<Γ}+𝔼t=1TΔt𝟙{Bt¯,ΔtΓ}\displaystyle=\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B^{\prime}_{t}},\Delta_{t}<\Gamma\right\}+\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B^{\prime}_{t}},\Delta_{t}\geq\Gamma\right\} (222)
<TΓ+𝔼t=1Tl=1QΔt𝟙{Bt¯,2l<Δt2l+1}\displaystyle<T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B^{\prime}_{t}},2^{-l}<\Delta_{t}\leq 2^{-l+1}\right\} (223)
TΓ+𝔼t=1Tl=1Q2l+1𝟙{Bt¯}\displaystyle\leq T\Gamma+\mathbb{E}\sum_{t=1}^{T}\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot\mathbbm{1}\left\{\overline{B^{\prime}_{t}}\right\} (224)
TΓ+l=1Q2l+1t=1T{Bt¯}\displaystyle\leq T\Gamma+\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\sum_{t=1}^{T}\mathbb{P}\left\{\overline{B^{\prime}_{t}}\right\} (225)
=TΓ+l=1Q2l+1π26\displaystyle=T\Gamma+\sum_{l=1}^{\lceil Q\rceil}2^{-l+1}\cdot\frac{\pi^{2}}{6} (226)
<TΓ+(2Γ)π26\displaystyle<T\Gamma+(2-\Gamma)\cdot\frac{\pi^{2}}{6} (227)
<TΓ+π23\displaystyle<T\Gamma+\frac{\pi^{2}}{3} (228)
=TΓ+O(1).\displaystyle=T\Gamma+O(1)~{}. (229)

E.1 Proof of Theorem 3

Proof.

Combining Eqn. equation 213, equation 220, equation 229 with the choices of γ=1t2\gamma=\frac{1}{t^{2}} and Γ=βTT\Gamma=\frac{\beta_{T}}{\sqrt{T}} and C1C\geq 1 is a constant, the regret of LinIMED-3 is bounded as follows:

RT\displaystyle R_{T} =F1+F2+F3+F4\displaystyle=F_{1}+F_{2}+F_{3}+F_{4} (230)
<4TΓ+O(dβTlogCΓlog(1+L2βTlogCλΓ2))+O(dβTΓlog(1+L2βTλΓ2))+O(1)\displaystyle<4T\Gamma+O\bigg{(}\frac{d\beta_{T}\log C}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}\log C}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}+O\bigg{(}\frac{d\beta_{T}}{\Gamma}\log\bigg{(}1+\frac{L^{2}\beta_{T}}{\lambda\Gamma^{2}}\bigg{)}\bigg{)}+O(1) (231)
<O(dTlogClog(1+L2TlogCλ))\displaystyle<O\bigg{(}d\sqrt{T}\log C\log\bigg{(}1+\frac{L^{2}T\log C}{\lambda}\bigg{)}\bigg{)} (232)
=O(dTlog(T)).\displaystyle=O\bigg{(}d\sqrt{T}\log(T)\bigg{)}~{}. (233)

This completes the proof. ∎

Appendix F Proof of the regret bound for SupLinIMED (Proof of Theorem 4)

Define st[logT]s_{t}\in[\lceil\log T\rceil] as the index of ss when the arm is chosen at time tt. For the SupLinIMED, the index of arms except the empirically best arm is defined by It,a=(Δ^t,astwt,ast)22log(wt,ast)I_{t,a}=\bigg{(}\frac{\hat{\Delta}^{s_{t}}_{t,a}}{w^{s_{t}}_{t,a}}\bigg{)}^{2}-2\log(w^{s_{t}}_{t,a}), whereas the index of the empirically best arm is defined by It,A^t=log(2T)(2log(wt,A^tst))I_{t,\hat{A}^{*}_{t}}=\log(2T)\wedge(-2\log(w^{s_{t}}_{t,\hat{A}^{*}_{t}})) where A^t=argmaxa𝒜^stθ^tst,xt,a\hat{A}^{*}_{t}=\operatorname*{arg\,max}_{a\in\hat{\mathcal{A}}_{s_{t}}}\langle\hat{\theta}^{s_{t}}_{t},x_{t,a}\rangle. Define the index of the best arm at time tt as at:=argmaxa[K]θ,xt,aa^{*}_{t}:=\operatorname*{arg\,max}_{a\in[K]}\langle\theta^{*},x_{t,a}\rangle.

Remark 1.

Here the upper bound we set for the index of the empirically best arm is log(2T)\log(2T), which is slightly larger than our previous logT\log T (Line 10 in the LinIMED algorithm) since in the first step of the of the SupLinIMED algorithm or, more generally, the SupLinUCB-type algorithms, the width of each arm is less than 1T\frac{1}{\sqrt{T}}, as a result, the index of each arm is larger than logT\log T.

Let the set of time indices such that the chosen arm is from Step 1 (Lines 6–9 in Algorithm 2) be Ψ0\Psi_{0}. Then the cumulative expected regret of the SupLinIMED algorithm over time horizon TT can be defined by the following equation:

RT=𝔼[tΨ0θ,xt,atXt]+𝔼[tΨ0θ,xt,atXt]\displaystyle R_{T}=\mathbb{E}\left[\sum_{t\in\Psi_{0}}\langle\theta^{*},x_{t,a^{*}_{t}}-X_{t}\rangle\right]+\mathbb{E}\left[\sum_{t\not\in\Psi_{0}}\langle\theta^{*},x_{t,a^{*}_{t}}-X_{t}\rangle\right] (234)

Since the index set has not changed in Step 1 (see Line 9 in Algorithm 2), the second term of the regret is the same as in the original SupLinUCB algorithm of Chu et al. (2011). For the first term, we partitioned it by the following events:

Bt\displaystyle B_{t} :=t[T],s[logT],a[K]{|θθ^ts,xt,a|α+1αwt,as},and\displaystyle:=\bigcap_{t\in[T],s\in[\log T],a\in[K]}\left\{|\langle\theta^{*}-\hat{\theta}_{t}^{s},x_{t,a}\rangle|\leq\frac{\alpha+1}{\alpha}w^{s}_{t,a}\right\},\qquad\mbox{and}
Dt\displaystyle D_{t} :={Δ^t,Atstε},\displaystyle:=\left\{\hat{\Delta}^{s_{t}}_{t,A_{t}}\geq\varepsilon\right\},

where α=12ln2TKγ\alpha=\sqrt{\frac{1}{2}\ln\frac{2TK}{\gamma}} as in the SupLinUCB (Chu et al., 2011). We choose γ=12t2\gamma=\frac{1}{2t^{2}} throughout. Furthermore, θ^ts\hat{\theta}_{t}^{s} is the θ^t\hat{\theta}_{t} obtained from Algorithm 3 with Ψts\Psi_{t}^{s} as the input, i.e.,

θ^ts:=(Id+τΨtsxτ,Aτxτ,AτT)1τΨtsYτ,Aτxτ,Aτ.\hat{\theta}_{t}^{s}:=\bigg{(}I_{d}+\sum_{\tau\in\Psi_{t}^{s}}x_{\tau,A_{\tau}}x_{\tau,A_{\tau}}^{T}\bigg{)}^{-1}\sum_{\tau\in\Psi^{s}_{t}}Y_{\tau,A_{\tau}}x_{\tau,A_{\tau}}.

Define Δt:=θ,xt,atXt\Delta_{t}:=\langle\theta^{*},x_{t,a_{t}^{*}}-X_{t}\rangle as the instantaneous regret at each time step tt. In addition, choose ε=Δt3\varepsilon=\frac{\Delta_{t}}{3} in the definition of DtD_{t}. Then the first term of the expected regret in equation 234 can be partitioned by the events BtB_{t} and DtD_{t} as follows:

𝔼[tΨ0θ,xt,atXt]\displaystyle\mathbb{E}\left[\sum_{t\in\Psi_{0}}\langle\theta^{*},x_{t,a^{*}_{t}}-X_{t}\rangle\right] =𝔼[tΨ0Δt𝟙{Bt,Dt}]=:F1+𝔼[tΨ0Δt𝟙{Bt,D¯t}]=:F2+𝔼[tΨ0Δt𝟙{B¯t}]=:F3\displaystyle=\underbrace{\mathbb{E}\left[\sum_{t\in\Psi_{0}}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},D_{t}\right\}\right]}_{=:F_{1}}+\underbrace{\mathbb{E}\left[\sum_{t\in\Psi_{0}}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{D}_{t}\right\}\right]}_{=:F_{2}}+\underbrace{\mathbb{E}\left[\sum_{t\in\Psi_{0}}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B}_{t}\right\}\right]}_{=:F_{3}}

We recall that when tΨ0t\in\Psi_{0}, wt,ast1Tw^{s_{t}}_{t,a}\leq\frac{1}{\sqrt{T}} for all a𝒜^sta\in\hat{\mathcal{A}}_{s_{t}}.

To bound F1F_{1}, we note that since BtB_{t} occurs, the actual best arm at𝒜^sta^{*}_{t}\in\hat{\mathcal{A}}_{s_{t}} with high probability (1γlog2T1-\gamma\log^{2}T) by Chu et al. (2011, Lemma 5). As such,

maxa𝒜^stθ^tst,xt,aθ^tst,xt,atθ,xt,atα+1αwt,atsθ,xt,at2T\displaystyle\max_{a\in\hat{\mathcal{A}}_{s_{t}}}\langle\hat{\theta}^{s_{t}}_{t},x_{t,a}\rangle\geq\langle\hat{\theta}^{s_{t}}_{t},x_{t,a^{*}_{t}}\rangle\geq\langle\theta^{*},x_{t,a^{*}_{t}}\rangle-\frac{\alpha+1}{\alpha}w^{s}_{t,a^{*}_{t}}\geq\langle\theta^{*},x_{t,a^{*}_{t}}\rangle-\frac{2}{\sqrt{T}}

where the last inequality is from the fact that γ=12t2\gamma=\frac{1}{2t^{2}} and α=12ln2TKγ1.\alpha=\sqrt{\frac{1}{2}\ln\frac{2TK}{\gamma}}\geq 1~{}. Else if, in fact, the best arm at𝒜^sta^{*}_{t}\notin\hat{\mathcal{A}}_{s_{t}}, the corresponding regret in this case is bounded by:

𝔼tΨ0Δt𝟙{at𝒜^st}\displaystyle\mathbb{E}\sum_{t\in\Psi_{0}}\Delta_{t}\cdot\mathbbm{1}\left\{a^{*}_{t}\notin\hat{\mathcal{A}}_{s_{t}}\right\} 𝔼t=1TΔt𝟙{at𝒜^st}\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{a^{*}_{t}\notin\hat{\mathcal{A}}_{s_{t}}\right\} (235)
𝔼t=1T𝟙{at𝒜^st}\displaystyle\leq\mathbb{E}\sum_{t=1}^{T}\mathbbm{1}\left\{a^{*}_{t}\notin\hat{\mathcal{A}}_{s_{t}}\right\} (236)
=t=1T(at𝒜^st)\displaystyle=\sum_{t=1}^{T}\mathbb{P}(a^{*}_{t}\notin\hat{\mathcal{A}}_{s_{t}}) (237)
t=1Tγlog2T\displaystyle\leq\sum_{t=1}^{T}\gamma\log^{2}T (238)
=t=1Tlog2T2t2\displaystyle=\sum_{t=1}^{T}\frac{\log^{2}T}{2t^{2}} (239)
<π212log2T.\displaystyle<\frac{\pi^{2}}{12}\log^{2}T~{}. (240)

Case 1: If A^tAt\hat{A}_{t}^{*}\neq A_{t}, this means that the index of AtA_{t} is It,At=(Δ^t,Atst)2α2XtVt12+log1α2XtVt12I_{t,A_{t}}=\frac{(\hat{\Delta}_{t,A_{t}}^{s_{t}})^{2}}{\alpha^{2}\lVert X_{t}\rVert_{V_{t}^{-1}}^{2}}+\log\frac{1}{\alpha^{2}\lVert X_{t}\rVert_{V_{t}^{-1}}^{2}}. Using the fact that It,AtIt,A^tI_{t,A_{t}}\leq I_{t,\hat{A}^{*}_{t}} we have

(Δ^t,Atst)2α2XtVt12+log1α2XtVt122logT.\displaystyle\frac{(\hat{\Delta}_{t,A_{t}}^{s_{t}})^{2}}{\alpha^{2}\lVert X_{t}\rVert_{V_{t}^{-1}}^{2}}+\log\frac{1}{\alpha^{2}\lVert X_{t}\rVert_{V_{t}^{-1}}^{2}}\leq 2\log T~{}. (241)

Then using the definition of the event DtD_{t} and the fact that (wt,ast)2=α2XtVt121T(w_{t,a}^{s_{t}})^{2}=\alpha^{2}\lVert X_{t}\rVert_{V_{t}^{-1}}^{2}\leq\frac{1}{T}, we have

Δt29α2XtVt112logT9logTT.\displaystyle\Delta_{t}^{2}\leq 9\alpha^{2}\lVert X_{t}\rVert_{V_{t-1}^{-1}}^{2}\log T\leq\frac{9\log T}{T}.

Hence, Δt3logTT\Delta_{t}\leq\frac{3\sqrt{\log T}}{\sqrt{T}}. Therefore F1F_{1} in this case is upper bounded as follows:

𝔼[tΨ0Δt𝟙{Bt,Dt}𝟙{A^tAt}𝟙{at𝒜^st}]𝔼[tΨ03logTT]3TlogT.\displaystyle\mathbb{E}\left[\sum_{t\in\Psi_{0}}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},D_{t}\right\}\cdot\mathbbm{1}\left\{\hat{A}^{*}_{t}\neq A_{t}\right\}\cdot\mathbbm{1}\left\{a^{*}_{t}\in\hat{\mathcal{A}}_{s_{t}}\right\}\right]\leq\mathbb{E}\left[\sum_{t\in\Psi_{0}}\frac{3\sqrt{\log T}}{\sqrt{T}}\right]\leq 3\sqrt{T\log T}~{}.

Case 2: If A^t=At\hat{A}_{t}^{*}=A_{t}, then using the definition of the event BtB_{t}, we have

θ^tst,Xt=maxa𝒜^stθ^tst,xt,aθ^tst,xt,atθ,xt,at2T=θ,Xt+Δt2T\displaystyle\langle\hat{\theta}_{t}^{s_{t}},X_{t}\rangle=\max_{a\in\hat{\mathcal{A}}_{s_{t}}}\langle\hat{\theta}_{t}^{s_{t}},x_{t,a}\rangle\geq\langle\hat{\theta}_{t}^{s_{t}},x_{t,a^{*}_{t}}\rangle\geq\langle\theta^{*},x_{t,a^{*}_{t}}\rangle-\frac{2}{\sqrt{T}}=\langle\theta^{*},X_{t}\rangle+\Delta_{t}-\frac{2}{\sqrt{T}}

therefore since event BtB_{t} occurs,

Δtθ^tstθ,Xt+2T3T.\displaystyle\Delta_{t}\leq\langle\hat{\theta}_{t}^{s_{t}}-\theta^{*},X_{t}\rangle+\frac{2}{\sqrt{T}}\leq\frac{3}{\sqrt{T}}~{}.

Hence F1F_{1} in this case is bounded as 2T2\sqrt{T}. Combining the above cases,

F13TlogT+3T+π212log2TO(TlogT).\displaystyle F_{1}\leq 3\sqrt{T\log T}+3\sqrt{T}+\frac{\pi^{2}}{12}\log^{2}T\leq O(\sqrt{T\log T})~{}.

To bound F2F_{2}, we note from the definition of BtB_{t} that

maxa𝒜^stθ^tst,xt,aθ^tst,xt,atθ,xt,at2T\displaystyle\max_{a\in\hat{\mathcal{A}}_{s_{t}}}\langle\hat{\theta}_{t}^{s_{t}},x_{t,a}\rangle\geq\langle\hat{\theta}_{t}^{s_{t}},x_{t,a^{*}_{t}}\rangle\geq\langle\theta^{*},x_{t,a^{*}_{t}}\rangle-\frac{2}{\sqrt{T}}

then on the event D¯t\overline{D}_{t},

θ,xt,at2Tmaxa𝒜^stθ^tst,xt,a<ε+θ^tst,Xt=Δt3+θ^tst,Xt,\displaystyle\langle\theta^{*},x_{t,a^{*}_{t}}\rangle-\frac{2}{\sqrt{T}}\leq\max_{a\in\hat{\mathcal{A}}_{s_{t}}}\langle\hat{\theta}_{t}^{s_{t}},x_{t,a}\rangle<\varepsilon+\langle\hat{\theta}_{t}^{s_{t}},X_{t}\rangle=\frac{\Delta_{t}}{3}+\langle\hat{\theta}^{s_{t}}_{t},X_{t}\rangle~{},

therefore

Δt<32(θ^tstθ,Xt+2T)92T\displaystyle\Delta_{t}<\frac{3}{2}\bigg{(}\langle\hat{\theta}_{t}^{s_{t}}-\theta^{*},X_{t}\rangle+\frac{2}{\sqrt{T}}\bigg{)}\leq\frac{9}{2\sqrt{T}}

Hence

F2\displaystyle F_{2} =𝔼[tΨ0Δt𝟙{Bt,D¯t}𝟙{at𝒜^st}]+𝔼[tΨ0Δt𝟙{Bt,D¯t}𝟙{at𝒜^st}]\displaystyle=\mathbb{E}\left[\sum_{t\in\Psi_{0}}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{D}_{t}\right\}\cdot\mathbbm{1}\left\{a^{*}_{t}\in\hat{\mathcal{A}}_{s_{t}}\right\}\right]+\mathbb{E}\left[\sum_{t\in\Psi_{0}}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{D}_{t}\right\}\cdot\mathbbm{1}\left\{a^{*}_{t}\notin\hat{\mathcal{A}}_{s_{t}}\right\}\right]
𝔼[t=1TΔt𝟙{Bt,D¯t}]+π212log2T\displaystyle\leq\mathbb{E}\left[\sum_{t=1}^{T}\Delta_{t}\cdot\mathbbm{1}\left\{B_{t},\overline{D}_{t}\right\}\right]+\frac{\pi^{2}}{12}\log^{2}T
<𝔼[t=1T92T𝟙{Bt,D¯t}]+π212log2T\displaystyle<\mathbb{E}\left[\sum_{t=1}^{T}\frac{9}{2\sqrt{T}}\cdot\mathbbm{1}\left\{B_{t},\overline{D}_{t}\right\}\right]+\frac{\pi^{2}}{12}\log^{2}T
<T92T+π212log2T\displaystyle<T\cdot\frac{9}{2\sqrt{T}}+\frac{\pi^{2}}{12}\log^{2}T
=92T+π212log2T\displaystyle=\frac{9}{2}\sqrt{T}+\frac{\pi^{2}}{12}\log^{2}T
O(T).\displaystyle\leq O(\sqrt{T})~{}.

To bound F3F_{3}, we use the proof as in of Chu et al. (2011, Lemma 1), which is restated as follows.

Lemma 10.

For any a[K]a\in[K], s[logT]s\in[\lceil\log T\rceil], t[T]t\in[T],

[|θθ^ts,xt,a|>α+1αwt,as]γTK\mathbb{P}\left[|\langle\theta^{*}-\hat{\theta}_{t}^{s},x_{t,a}\rangle|>\frac{\alpha+1}{\alpha}w^{s}_{t,a}\right]\leq\frac{\gamma}{TK}

where α=12ln2TKγ\alpha=\sqrt{\frac{1}{2}\ln\frac{2TK}{\gamma}}.

Then using the union bound, we have for all t[T]t\in[T], s[logT]s\in[\lceil\log T\rceil], for all a[K]a\in[K],

[B¯t]\displaystyle\mathbb{P}\left[\overline{B}_{t}\right] =[t[T],s[logT],a[K]{|θθ^ts,xt,a|>α+1αwt,as}]\displaystyle=\mathbb{P}\left[\bigcup_{t\in[T],s\in[\lceil\log T\rceil],a\in[K]}\left\{|\langle\theta^{*}-\hat{\theta}_{t}^{s},x_{t,a}\rangle|>\frac{\alpha+1}{\alpha}w^{s}_{t,a}\right\}\right]
t[T]s[logT]a[K][|θθ^ts,xt,a|>α+1αwt,as]\displaystyle\leq\sum_{t\in[T]}\sum_{s\in[\lceil\log T\rceil]}\sum_{a\in[K]}\mathbb{P}\left[|\langle\theta^{*}-\hat{\theta}_{t}^{s},x_{t,a}\rangle|>\frac{\alpha+1}{\alpha}w^{s}_{t,a}\right]
<(TK(1+logT))γTK\displaystyle<\big{(}TK(1+\log T)\big{)}\frac{\gamma}{TK}
=γ(1+logT).\displaystyle=\gamma(1+\log T)~{}.

With the choice γ=12t2\gamma=\frac{1}{2t^{2}} and the assumption Δt1\Delta_{t}\leq 1,

F3\displaystyle F_{3} =𝔼[tΨ0Δt𝟙{B¯t}]\displaystyle=\mathbb{E}\left[\sum_{t\in\Psi_{0}}\Delta_{t}\cdot\mathbbm{1}\left\{\overline{B}_{t}\right\}\right]
t=1T[B¯t]\displaystyle\leq\sum_{t=1}^{T}\mathbb{P}\left[\overline{B}_{t}\right]
<t=1T1+logT2t2\displaystyle<\sum_{t=1}^{T}\frac{1+\log T}{2t^{2}}
<π212(1+logT)\displaystyle<\frac{\pi^{2}}{12}(1+\log T)
O(logT).\displaystyle\leq O(\log T)~{}.

Hence the first term in RTR_{T} in equation 234 is upper bounded by:

𝔼[tΨ0θ,xt,atXt]\displaystyle\mathbb{E}\left[\sum_{t\in\Psi_{0}}\langle\theta^{*},x_{t,a^{*}_{t}}-X_{t}\rangle\right] O(T)+O(logT)+O(TlogT)\displaystyle\leq O(\sqrt{T})+O(\log T)+O(\sqrt{T\log T})
O(TlogT).\displaystyle\leq O(\sqrt{T\log T})~{}.

On the other hand, by Chu et al. (2011, Theorem 1), the second term in RTR_{T} in equation 234 is upper bounded as follows:

𝔼[tΨ0θ,xt,atXt]O(dTlog3(KT)).\displaystyle\mathbb{E}\left[\sum_{t\not\in\Psi_{0}}\langle\theta^{*},x_{t,a^{*}_{t}}-X_{t}\rangle\right]\leq O\Big{(}\sqrt{dT\log^{3}(KT)}\Big{)}.

Hence the regret of our algorithm SupLinIMED is upper bounded as follows:

RTO(dTlog3(KT)).\displaystyle R_{T}\leq O\Big{(}\sqrt{dT\log^{3}(KT)}\Big{)}~{}.

This completes the proof of Theorem 4.

Appendix G Hyperparameter tuning in our empirical study

G.1 Synthetic Dataset

The below tables are the empirical results while tuning the hyperparameter α\alpha (scale of the confidence width) for fixed T=1000T=1000.

Method LinUCB LinTS LinIMED-1 LinIMED-2 LinIMED-3 (C=30)(C=30)
α\alpha 0.5 0.55 0.6 0.2 0.25 0.3 0.15 0.2 0.25 0.2 0.25 0.3 0.15 0.2 0.25
Regret 7.780 6.695 6.856 9.769 9.201 12.068 24.086 5.482 6.108 4.999 4.998 7.329 25.588 2.075 2.760
Table 2: Tuning α\alpha when K=10,d=2K=10,d=2
Method LinUCB LinTS LinIMED-1 LinIMED-2 LinIMED-3 (C=30)(C=30)
α\alpha 0.5 0.55 0.6 0.1 0.15 0.2 0.2 0.25 0.3 0.2 0.25 0.3 0.2 0.25 0.3
Regret 7.203 6.832 7.423 54.221 7.042 7.352 6.707 6.053 8.458 6.254 4.918 7.013 4.407 2.562 3.041
Table 3: Tuning α\alpha when K=100,d=2K=100,d=2
Method LinUCB LinTS LinIMED-1 LinIMED-2 LinIMED-3 (C=30)(C=30)
α\alpha 0.5 0.55 0.6 0.1 0.15 0.2 0.15 0.2 0.25 0.2 0.25 0.3 0.15 0.2 0.25
Regret 7.919 5.679 7.063 69.955 6.925 7.037 24.393 5.625 6.335 6.335 4.831 7.040 41.355 1.936 2.250
Table 4: Tuning α\alpha when K=500,d=2K=500,d=2
Method LinUCB LinTS LinIMED-1 LinIMED-2 LinIMED-3 (C=30)(C=30)
α\alpha 0.45 0.5 0.55 0.1 0.15 0.2 0.1 0.15 0.2 0.1 0.15 0.2 0.1 0.15 0.2
Regret 9.164 9.094 14.183 14.252 9.886 14.680 19.663 6.463 10.643 15.685 5.399 8.373 8.024 2.062 3.342
Table 5: Tuning α\alpha when K=10,d=20K=10,d=20
Method LinUCB LinTS LinIMED-1 LinIMED-2 LinIMED-3 (C=30)(C=30)
α\alpha 0.25 0.3 0.35 0.1 0.15 0.2 0.05 0.1 0.15 0.1 0.15 0.2 0.05 0.1 0.15
Regret 7.923 7.085 10.981 14.983 9.565 19.300 58.278 6.165 9.225 8.916 8.575 13.483 142.704 2.816 3.497
Table 6: Tuning α\alpha when K=10,d=50K=10,d=50

We run these algorithms on the same dataset with different choices of α\alpha, we choose the best α\alpha with the corresponding least regret.

G.2 MovieLens Dataset

The below tables are the empirical results while tuning the hyperparameter α\alpha (scale of the confidence width) for fixed T=1000T=1000.

Method LinUCB LinTS LinIMED-1 LinIMED-2 LinIMED-3 (C=30)(C=30) IDS
α\alpha 0.7 0.75 0.8 0.05 0.1 0.15 0.15 0.2 0.25 0.15 0.2 0.25 0.2 0.25 0.3 0.25 0.3 0.35
CTR 0.608 0.675 0.668 0.615 0.705 0.679 0.740 0.823 0.766 0.740 0.823 0.766 0.713 0.742 0.690 0.655 0.728 0.714
Table 7: Tuning α\alpha when K=20K=20
Method LinUCB LinTS LinIMED-1 LinIMED-2 LinIMED-3 (C=30)(C=30) IDS
α\alpha 0.75 0.8 0.85 0 0.05 0.1 0.1 0.15 0.2 0.05 0.1 0.15 0.05 0.1 0.15 0.3 0.35 0.4
CTR 0.708 0.754 0.713 0.517 0.711 0.646 0.648 0.668 0.595 0.658 0.668 0.651 0.697 0.717 0.649 0.643 0.688 0.606
Table 8: Tuning α\alpha when K=50K=50
Method LinUCB LinTS LinIMED-1 LinIMED-2 LinIMED-3 (C=30)(C=30) IDS
α\alpha 0.85 0.9 0.95 0 0.05 0.1 0.05 0.1 0.15 0.05 0.1 0.15 0.05 0.1 0.15 0.3 0.35 0.4
CTR 0.721 0.754 0.745 0.487 0.674 0.588 0.682 0.729 0.594 0.687 0.729 0.594 0.689 0.705 0.594 0.684 0.739 0.695
Table 9: Tuning α\alpha when K=100K=100

We run these algorithms on the same dataset with different choices of α\alpha and we choose the best α\alpha with the corresponding largest reward.