algorithmic
Indexed Minimum Empirical Divergence-Based Algorithms
for Linear Bandits
Abstract
The Indexed Minimum Empirical Divergence (IMED) algorithm is a highly effective approach that offers a stronger theoretical guarantee of the asymptotic optimality compared to the Kullback–Leibler Upper Confidence Bound (KL-UCB) algorithm for the multi-armed bandit problem. Additionally, it has been observed to empirically outperform UCB-based algorithms and Thompson Sampling. Despite its effectiveness, the generalization of this algorithm to contextual bandits with linear payoffs has remained elusive. In this paper, we present novel linear versions of the IMED algorithm, which we call the family of LinIMED algorithms. We demonstrate that LinIMED provides a upper regret bound where is the dimension of the context and is the time horizon. Furthermore, extensive empirical studies reveal that LinIMED and its variants outperform widely-used linear bandit algorithms such as LinUCB and Linear Thompson Sampling in some regimes.
1 Introduction
The multi-armed bandit (MAB) problem (Lattimore & Szepesvári (2020)) is a classical topic in decision theory and reinforcement learning. Among the various subfields of bandit problems, the stochastic linear bandit is the most popular area due to its wide applicability in large-scale, real-world applications such as personalized recommendation systems (Li et al. (2010)), online advertising, and clinical trials. In the stochastic linear bandit model, at each time step , the learner has to choose one arm from the time-varying action set . Each arm has a corresponding context , which is a -dimensional vector. By pulling the arm at time step , under the linear bandit setting, the learner will receive the reward , whose expected value satisfies , where is an unknown parameter. The goal of the learner is to maximize his cumulative reward over a time horizon , which also means minimizing the cumulative regret, defined as . The learner needs to balance the trade-off between the exploration of different arms (to learn their expected rewards) and the exploitation of the arm with the highest expected reward based on the available data.
1.1 Motivation and Related Work
The -armed bandit setting is a special case of the linear bandit. There exist several good algorithms such as UCB1 (Auer et al. (2002)), Thompson Sampling (Agrawal & Goyal (2012)), and the Indexed Minimum Empirical Divergence (IMED) algorithm (Honda & Takemura (2015)) for this setting. There are three main families of asymptotically optimal multi-armed bandit algorithms based on different principles (Baudry et al. (2023)). However, among these algorithms, only IMED lacks an extension for contextual bandits with linear payoff. In the context of the varying arm setting of the linear bandit problem, the LinUCB algorithm in Li et al. (2010) is frequently employed in practice. It has a theoretical guarantee on the regret in the order of using the confidence width as in OFUL (Abbasi-Yadkori et al. (2011)). Although the SupLinUCB algorithm introduced by Chu et al. (2011) uses phases to decompose the reward dependence of each time step and achieves an (the notation omits logarithmic factors in ) regret upper bound, its empirical performance falls short of both the algorithm in Li et al. (2010) and the Linear Thompson Sampling algorithm (Agrawal & Goyal (2013)) as mentioned in Lattimore & Szepesvári (2020, Chapter 22).
On the other hand, the Optimism in the Face of Uncertainty Linear (OFUL) bandit algorithm in Abbasi-Yadkori et al. (2011) achieves a regret upper bound of through an improved analysis of the confidence bound using a martingale technique. However, it involves a bilinear optimization problem over the action set and the confidence ellipsoid when choosing the arm at each time. This is computationally expensive, unless the confidence ellipsoid is a convex hull of a finite set.
Problem independent regret bound | Regret bound independent of ? | Principle that the algorithm is based on | |
---|---|---|---|
OFUL (Abbasi-Yadkori et al. (2011)) | ✓ | Optimism | |
LinUCB (Li et al. (2010)) | Hard to analyze | Unknown | Optimism |
LinTS (Agrawal & Goyal (2013)) | ✓ | Posterior sampling | |
SupLinUCB (Chu et al. (2011)) | ✗ | Optimism | |
LinUCB with OFUL’s confidence bound | ✓ | Optimism | |
Asymptotically Optimal IDS (Kirschner et al. (2021)) | ✓ | Information directed sampling | |
LinIMED-3 (this paper) | ✓ | Min. emp. divergence | |
SupLinIMED (this paper) | ✗ | Min. emp. divergence |
For randomized algorithms designed for the linear bandit problem, Agrawal & Goyal (2013) proposed the LinTS algorithm, which is in the spirit of Thompson Sampling (Thompson (1933)) and the confidence ellipsoid similar to that of LinUCB-like algorithms. This algorithm performs efficiently and achieves a regret upper bound of , where is the number of arms at each time step such that for all . Compared to LinUCB with OFUL’s confidence width, it has an extra term for the minimax regret upper bound.
Recently, MED-like (minimum empirical divergence) algorithms have come to the fore since these randomized algorithms have the property that the probability of selecting each arm is in closed form, which benefits downstream work such as offline evaluation with the inverse propensity score. Both MED in the sub-Gaussian environment and its deterministic version IMED have demonstrated superior performances over Thompson Sampling (Bian & Jun (2021), Honda & Takemura (2015)). Baudry et al. (2023) also shows MED has a close relation to Thompson Sampling. In particular, it is argued that MED and TS can be interpreted as two variants of the same exploration strategy. Bian & Jun (2021) also shows that probability of selecting each arm of MED in the sub-Gaussian case can be viewed as a closed-form approximation of the same probability as in Thompson Sampling. We take inspiration from the extension of Thompson Sampling to linear bandits and thus are motivated to extend MED-like algorithms to the linear bandit setting and prove regret bounds that are competitive vis-à-vis the state-of-the-art bounds.
Thus, this paper aims to answer the question of whether it is possible to devise an extension of the IMED algorithm for the linear bandit problem the varying arm set setting (for both infinite and finite arm sets) with a regret upper bound of which matches LinUCB with OFUL’s confidence bound while being as efficient as LinUCB. The proposed family of algorithms, called LinIMED as well as SupLinIMED, can be viewed as generalizations of the IMED algorithm (Honda & Takemura (2015)) to the linear bandit setting. We prove that LinIMED and its variants achieve a regret upper bound of and they perform efficiently, no worse than LinUCB. SupLinIMED has a regret bound of , but works only for instances with finite arm sets. In our empirical study, we found that the different variants of LinIMED perform better than LinUCB and LinTS for various synthetic and real-world instances under consideration.
Compared to OFUL, LinIMED works more efficiently. Compared to SupLinUCB, our LinIMED algorithm is significantly simpler, and compared to LinUCB with OFUL’s confidence bound, our empirical performance is better. This is because in our algorithm, the exploitation term and exploration term are decoupling and this leads to a finer control while tuning the hyperparameters in the empirical study.
Compared to LinTS, our algorithm’s (specifically LinIMED-3) regret bound is superior, by an order of . Since fixed arm setting is a special case of finite varying arm setting, our result is more general than other fixed-arm linear bandit algorithms like Spectral Eliminator (Valko et al. (2014)) and PEGOE (Lattimore & Szepesvári (2020, Chapter 22)). Finally, we observe that since the index used in LinIMED has a similar form to the index used in the Information Directed Sampling (IDS) procedure in Kirschner et al. (2021) (which is known to be asymptotically optimal but more difficult to compute), LinIMED performs significantly better on the “End of Optimism” example in Lattimore & Szepesvari (2017). We summarize the comparisons of LinIMED to other linear bandit algorithms in Table 1. We discussion comparisons to other linear bandit algorithms in Sections 3.2, 3.3, and Appendix B.
2 Problem Statement
Notations:
For any dimensional vector and a positive definite matrix , we use to denote the Mahalanobis norm . We use (resp. ) to represent the minimum (resp. maximum) of two real numbers and .
The Stochastic Linear Bandit Model:
In the stochastic linear bandit model, the learner chooses an arm at each round from the arm set , where we assume the cardinality of each arm set can be potentially infinite such that for all . Each arm at time has a corresponding context (arm vector) , which is known to the learner. After choosing arm , the environment reveals the reward
(1) |
to the learner where is the corresponding context of the arm , is an unknown coefficient of the linear model, is an -sub-Gaussian noise conditioned on such that for any , almost surely,
Denote as the arm with the largest reward at time . The goal is to minimize the expected cumulative regret over the horizon . The (expected) cumulative regret is defined as
Assumption 1.
For each time , we assume that , and for some fixed . We also assume that for each arm and time .
3 Description of LinIMED Algorithms
In the pseudocode of Algorithm 1, for each time step , in Line 4, we use the improved confidence bound of as in Abbasi-Yadkori et al. (2011) to calculate the confidence bound . After that, for each arm , in Lines 6 and 7, the empirical gap between the highest empirical reward and the empirical reward of arm is estimated as
(2) |
Then, in Lines 9 to 11, with the use of the confidence width of , we can compute the index for the empirical best arm (for LinIMED-1,2) or the highest UCB arm (for LinIMED-3). The different versions of LinIMED encourage different amounts of exploitation. For the other arms, in Line 13, the index is defined and computed as
Then with all the indices of the arms calculated, in Line 16, we choose the arm with the minimum index such that (where ties are broken arbitrarily) and the agent receives its reward. Finally, in Line 18, we use ridge regression to estimate the unknown as and update the matrix and the vector . After that, the algorithm iterates to the next time step until the time horizon . From the pseudo-code, we observe that the only differences between the three algorithms are the way that the square gap, which plays the role of the empirical divergence, is estimated and the index of the empirically best arm. The latter point implies that we encourage the empirically best arm to be selected more often in LinIMED-2 and LinIMED-3 compared to LinIMED-1; in other words, we encourage more exploitation in LinIMED-2 and LinIMED-3. Similar to the core spirit of IMED algorithm Honda & Takemura (2015), the first term of our index for LinIMED-1 algorithm is , this is the term controls the exploitation, while the second term controls the exploration in our algorithm.
3.1 Description of the SupLinIMED Algorithm
Now we consider the case in which the arm set at each time is finite but still time-varying. In particular, are sets of constant size such that . In the pseudocode of Algorithm 2, we apply the SupLinUCB framework (Chu et al., 2011), leveraging Algorithm 3 (in Appendix A) as a subroutine within each phase. This ensures the independence of the choice of the arm from past observations of rewards, thereby yielding a concentration inequality in the estimated reward (see Lemma 1 in Chu et al. (2011)) that converges to within proximity of the unknown expected reward in a finite arm setting. As a result, the regret yields an improvement of ignoring the logarithmic factor. At each time step and phase , in Line 5, we utilize the BaseLinUCB Algorithm as a subroutine to calculate the sample mean and confidence width since we also need these terms to calculate the IMED-style indices of each arm. In Lines 6–9 (Case 1), if the width of each arm is less than , we choose the arm with the smaller IMED-style index. In Lines 10–12 (Case 2), the framework is the same as in SupLinUCB (Chu et al. (2011)), if the width of each arm is smaller than but there exist arms with widths larger than , then in Line 11 the “unpromising” arms will be eliminated until the width of each arm is smaller enough to satisfy the condition in Line 6. Otherwise, if there exist any arms with widths that are larger than , in Lines 14–15 (Case 3), we choose one such arm and record the context and reward of this arm to the next layer .
3.2 Relation to the IMED algorithm of Honda & Takemura (2015)
The IMED algorithm is a deterministic algorithm for the -armed bandit problem. At each time step , it chooses the arm with the minimum index, i.e.,
(3) |
where is the total arm pulls of the arm until time and is some divergence measure between the empirical distribution of the sample mean for arm and the arm with the highest sample mean. More precisely, and is the family of distributions supported on . As shown in Honda & Takemura (2015), its asymptotic bound is even better than KL-UCB (Garivier & Cappé (2011)) algorithm and can be extended to semi-bounded support models such as . Also, this algorithm empirically outperforms the Thompson Sampling algorithm as shown in Honda & Takemura (2015). However, an extension of IMED algorithm with minimax regret bound of has not been derived. In our design of LinIMED algorithm, we replace the optimized KL-divergence measure in IMED in Eqn. equation 3 with the squared gap between the sample mean of the arm and the arm with the maximum sample mean. This choice simplifies our analysis and does not adversely affect the regret bound. On the other hand, we view the term as the variance of the sample mean of arm at time ; then in this spirit, we use as the variance of the sample mean (which is ) of arm at time . We choose instead of the KL-divergence approximation for the index since in the classical linear bandit setting, the noise is sub-Gaussian and it is known that the KL-divergence of two Gaussian random variables with the same variance () has a closed form expression similar to ignoring the constant .
3.3 Relation to Information Directed Sampling (IDS) for Linear Bandits
Information Directed Sampling (IDS), introduced by Russo & Van Roy (2014), serves as a good principle for regret minimization in linear bandits to achieve the asymptotic optimality. The intuition behind IDS is to balance between the information gain on the best arm and the expected reward at each time step. This goal is realized by optimizing the distribution of selecting each arm (where is the fixed finite arm set) with the minimum information ratio such that:
(4) |
where is the empirical gap and is the so-called information gain (defined later). Kirschner & Krause (2018), Kirschner et al. (2020) and Kirschner et al. (2021) apply the IDS principle to the linear bandit setting, The first two works propose both randomized and deterministic versions of IDS for linear bandit. They showed a near-optimal minimax regret bound of the order of . Kirschner et al. (2021) designed an asymptotically optimal linear bandit algorithm retaining its near-optimal minimax regret properties. Comparing these algorithms with our LinIMED algorithms, we observe that the first term of the index of non-greedy actions in our algorithms is , which is similar to the choice of information ratio in IDS with the estimated gap as defined in Algorithm 1 and the information ratio . As mentioned in Kirschner & Krause (2018), when the noise is 1-subgaussian and , the information gain in deterministic IDS algorithm is approximately , which is similar to our choice . However, our LinIMED algorithms are different from the deterministic IDS algorithm in Kirschner & Krause (2018) since the estimated gap defined in our algorithm is different from that in deterministic IDS. Furthermore, as discussed in Kirschner et al. (2020), when the noise is 1-subgaussian and , the action chosen by UCB minimizes the deterministic information ratio. However, this is not the case for our algorithm since we have the second term in LinIMED-1 which balances information and optimism. Compared to IDS in Kirschner et al. (2021), their algorithm is a randomized version of the deterministic IDS algorithm, which is more computationally expensive than our algorithm since our LinIMED algorithms are fully deterministic (the support of the allocation in Kirschner et al. (2021) is two). IDS defines a more complicated version of information gain to achieve asymptotically optimality. Finally, to the best of our knowledge, all these IDS algorithms are designed for linear bandits under the setting that the arm set is fixed and finite, while in our setting we assume the arm set is finite and can change over time. We discuss comparisons to other related work in Appendix B.
4 Theorem Statements
Theorem 1.
Under Assumption 1, the assumption that for all and , and the assumption that , the regret of the LinIMED-1 algorithm is upper bounded as follows:
Theorem 2.
Under Assumption 1, and the assumption that , the regret of the LinIMED-2 algorithm is upper bounded as follows:
Theorem 3.
Theorem 4.
Under Assumption 1, the assumption that , the regret of the SupLinIMED algorithm (which is applicable to linear bandit problems with arms) is upper bounded as follows:
The upper bounds on the regret of LinIMED and its variants are all of the form , which, ignoring the logarithmic term, is the same as OFUL algorithm (Abbasi-Yadkori et al. (2011)). Compared to LinTS, it has an advantage of . Also, these upper bounds do not depend on the number of arms , which means it can be applied to linear bandit problems with a large arm set (including infinite arm sets). One observes that LinIMED-2 and LinIMED-3 do not require the additional assumption that for all and to achieve the upper regret bound. It is difficult to prove the regret bound for the LinIMED-1 algorithm without this assumption since in our proof we need to use that for any time to bound the term. On the other hand, LinIMED-2 and LinIMED-3 encourage more exploitations in terms of the index of the empirically best arm at each time without adversely influencing the regret bound; this will accelerate the learning with well-preprocessed datasets. The regret bound of LinIMED-3, in fact, matches that of LinUCB with OFUL’s confidence bound. In the proof, we will extensively use a technique known as the “peeling device” (Lattimore & Szepesvári, 2020, Chapter 9). This analytical technique, commonly used in the theory of bandit algorithms, involves the partitioning of the range of some random variable into several pieces, then using the basic fact that , we can utilize the more refined range of the random variable to derive desired bounds.
Finally, Theorem 4 says that when the arm set is finite, we can use the framework of SupLinUCB (Chu et al., 2011) with our LinIMED index to achieve a regret bound of the order of , which is better than the regret bounds yielded by the family of LinIMED algorithms (ignoring the logarithmic terms). The proof is provided in Appendix F.
5 Proof Sketch of Theorem 1
We choose to present the proof sketch of Theorem 1 since it contains the main ingredients for all the theorems in the preceding section. Before presenting the proof, we introduce the following lemma and corollary.
Lemma 1.
This lemma illustrates that the true parameter lies in the ellipsoid centered at with high probability, which also states the width of the confidence bound.
The second is a corollary of the elliptical potential count lemma in Abbasi-Yadkori et al. (2011):
Corollary 1.
(Corollary of Lattimore & Szepesvári (2020, Exercise 19.3)) Assume that and for , for any constant , the following holds:
(6) |
We remark that this lemma is slightly stronger than the classical elliptical potential lemma since it reveals information about the upper bound of frequency that exceeds some value . Equipped with this lemma, we can perform the peeling device on in our proof of the regret bound, which is a novel technique to the best of our knowledge.
Proof.
First we define as the best arm in time step such that , and use denote its corresponding context. Let denote the regret in time . Define the following events:
where and are free parameters set to be and in this proof sketch.
Then the expected regret can be partitioned by events such that:
(7) |
For , from the event and the fact that (here is where we use that for all and ), we obtain . For convenience, define as the empirically best arm at time step , where ties are broken arbitrarily, then use to denote the corresponding context of the arm . Therefore from the Cauchy–Schwarz inequality, we have . This implies that
(8) |
On the other hand, we claim that can be upper bounded as . This can be seen from the fact that . Since the event holds, we know the first term is upper bounded by , and since the largest eigenvalue of the matrix is upper bounded by and , the second term is upper bounded by . Hence, is upper bounded by . Then one can substitute this bound back into Eqn. equation 8, and this yields
(9) |
Furthermore, by our design of the algorithm, the index of is not larger than the index of the arm with the largest empirical reward at time . Hence,
(10) |
In the following, we set as well as another free parameter as follows:
(11) |
If , by using Corollary 1 with the choice in Eqn. equation 11, the upper bound of in this case is . Otherwise, using the event and the bound in equation 9, we deduce that for all sufficiently large, we have . Therefore by using Corollary 1 and the “peeling device” (Lattimore & Szepesvári, 2020, Chapter 9) on such that for where and is chosen as in Eqn. equation 11. Now consider,
(12) | ||||
(13) | ||||
(14) | ||||
(15) | ||||
(16) | ||||
(17) |
where in Inequality equation 15 we used Corollary 1. Substituting the choices of and in equation 11 into equation 17 yields the upper bound on of the order . Hence . Other details are fleshed out in Appendix C.2.
For , since and together imply that , then using the choices of and , we have . Substituting this into the event and using the Cauchy–Schwarz inequality, we have
(18) |
Again applying the “peeling device” on and Corollary 1, we can upper bound as follows:
(19) |
Then with the choice of and as stated in equation 11, the upper bound of the is also of order . More details of the calculation leading to Eqn. equation 19 are in Appendix C.3.
For , this is the case when the best arm at time does not perform sufficiently well so that the empirically largest reward at time is far from the highest expected reward. One observes that minimizing results in a tradeoff with respect to . On the event , we can again apply the “peeling device” on such that where . Then using the fact that , we have
(20) |
On the other hand, using the event and the Cauchy–Schwarz inequality, it holds that
(21) |
If , the regret in this case is bounded by . Otherwise, combining Eqn. equation 20 and Eqn. equation 21 implies that
(22) |
6 Empirical Studies
This section aims to justify the utility of the family of LinIMED algorithms we developed and to demonstrate their effectiveness through quantitative evaluations in simulated environments and real-world datasets such as the MovieLens dataset. We compare our LinIMED algorithms with LinTS and LinUCB with the choice . We set (here and ) for the synthetic dataset with varying and finite arm set and (here and ) for the MovieLens dataset respectively. The confidence widths for each algorithm are multiplied by a factor and we tune by searching over the grid and report the best performance for each algorithm; see Appendix G. Both ’s are of order as suggested by our proof sketch in Eqn. equation 11. We set in LinIMED-3 throughout. The sub-Gaussian noise level is . We choose LinUCB and LinTS as competing algorithms since they are paradigmatic examples of deterministic and randomized contextual linear bandit algorithms respectively. We also included IDS in our comparisons for the fixed and finite arm set settings. Finally, we only show the performances of SupLinUCB and SupLinIMED algorithms but only in Figs. 1 and 2 since it is well known that there is a substantial performance degradation compared to established methodologies like LinUCB or LinTS (as mentioned in Lattimore & Szepesvári (2020, Chapter 22) and also seen in Figs. 1 and 2.
6.1 Experiments on a Synthetic Dataset in the Varying Arm Set Setting
We perform an empirical study on a varying arm setting. We evaluate the performance with different dimensions and different number of arms . We set the unknown parameter vector and the best context vector as . There are suboptimal arms vectors, which are all the same (i.e., repeated) and share the context where is iid noise for each and . Finally, there is also one “worst” arm vector with context . First we fix . The results for different numbers of arms such as are shown in Fig. 1. Note that each plot is repeated times to obtain the mean and standard deviation of the regret. From Fig. 1, we observe that LinIMED-1 and LinIMED-2 are comparable to LinUCB and LinTS, while LinIMED-3 outperforms LinTS and LinUCB regardless of the number of the arms . Second, we set with the dimensions . Each trial is again repeated times and the regret over time is shown in Fig. 2. Again, we see that LinIMED-1 and LinIMED-2 are comparable to LinUCB and LinTS, LinIMED-3 clearly perform better than LinUCB and LinTS.






The experimental results on synthetic data demonstrate that the performances of LinIMED-1 and LinIMED-2 are largely similar but LinIMED-3 is slightly superior (corroborating our theoretical findings). More importantly, LinIMED-3 outperforms both the LinTS and LinUCB algorithms in a statistically significant manner, regardless of the number of arms or the dimension of the data.
6.2 Experiments on the “End of Optimism” instance
Algorithms based on the optimism principle such as LinUCB and LinTS have been shown to be not asymptotically optimal. A paradigmatic example is known as the “End of Optimism” (Lattimore & Szepesvari, 2017; Kirschner et al., 2021)). In this two-dimensional case in which the true parameter vector , there are three arms represented by the arm vectors: and , where is small. In this example, it is observed that even pulling a highly suboptimal arm (the second one) provides a lot of information about the best arm (the first one). We perform experiments with the same confidence parameter as in Section 6.1 (where the noise level , dimension ). We also include the asymptotically optimal IDS algorithm (Kirschner et al. (2021) with the choice of ; this is suggested in Kirschner et al. (2021). Each algorithm is run over independent trials. The regrets of all competing algorithms are shown in Fig. 3 with and for a fixed horizon .



From Fig. 3 we observe that the LinIMED algorithms perform much better than LinUCB and LinTS and LinIMED-3 is comparable to IDS in this “End of Optimism” instance. In particular, LinIMED-3 performs significantly better than LinUCB and LinTS even when is of a moderate value such as . We surmise that the reason behind the superior performance of our LinIMED algorithms on the "End of Optimism" instance is that the first term of our LinIMED index is , which can be viewed as an approximate and simpler version of the information ratio that movtivates the design the IDS) algorithm.



6.3 Experiments on the MovieLens Dataset
The MovieLens dataset (Cantador et al. (2011)) is a widely-used benchmark dataset for research in recommendation systems. We specifically choose to use the MovieLens 10M dataset, which contains 10 million ratings (from 0 to 5) and 100,000 tag applications applied to 10,000 movies by 72,000 users. To preprocess the dataset, we choose the best movies for consideration. At each time , one random user visits the website and is recommended one of the best movies. We assume that the user will click on the recommended movie if and only if the user’s rating of this movie is at least . We implement the three versions of LinIMED, LinUCB, LinTS and IDS on this dataset. Each trial is repeated over runs and the averages and standard deviations of the click-through rates (CTRs) as functions of time are reported in Fig. 4. One observes that LinIMED variants significantly outperform LinUCB and LinTS for all when time horizon is sufficiently large. LinIMED-1 and LinIMED-2 perform significantly better than IDS when , LinIMED-3 perform significantly better than IDS when . Furthermore, by virtue of the fact that IDS is randomized, the variance of IDS is higher than that of LinIMED.
7 Future Work
In the future, a fruitful direction of research is to further modify the LinIMED algorithm to make it also asymptotically optimal; we believe that in this case, the analysis would be more challenging, but the theoretical and empirical performances might be superior to our three LinIMED algorithms. In addition, one can generalize the family of IMED-style algorithms to generalized linear bandits or neural contextual bandits.
Acknowledgements
This work is supported by funding from a Ministry of Education Academic Research Fund (AcRF) Tier 2 grant under grant number A-8000423-00-00 and AcRF Tier 1 grants under grant numbers A-8000189-01-00 and A-8000980-00-00. This research is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-PhD-2023-08-044T-J), and is part of the programme DesCartes which is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme.
References
- Abbasi-Yadkori et al. (2011) Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24, 2011.
- Agrawal & Goyal (2012) Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pp. 39.1–39.26. JMLR Workshop and Conference Proceedings, 2012.
- Agrawal & Goyal (2013) Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pp. 127–135. PMLR, 2013.
- Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256, 2002.
- Baudry et al. (2023) Dorian Baudry, Kazuya Suzuki, and Junya Honda. A general recipe for the analysis of randomized multi-armed bandit algorithms. arXiv preprint arXiv:2303.06058, 2023.
- Bian & Jun (2021) Jie Bian and Kwang-Sung Jun. Maillard sampling: Boltzmann exploration done optimally. arXiv preprint arXiv:2111.03290, 2021.
- Cantador et al. (2011) Iván Cantador, Peter Brusilovsky, and Tsvi Kuflik. Second workshop on information heterogeneity and fusion in recommender systems (HetRec2011). In Proceedings of the Fifth ACM Conference on Recommender Systems, pp. 387–388, 2011.
- Chu et al. (2011) Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208–214. JMLR Workshop and Conference Proceedings, 2011.
- Garivier & Cappé (2011) Aurélien Garivier and Olivier Cappé. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 359–376. JMLR Workshop and Conference Proceedings, 2011.
- Honda & Takemura (2015) Junya Honda and Akimichi Takemura. Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. J. Mach. Learn. Res., 16:3721–3756, 2015.
- Kirschner & Krause (2018) Johannes Kirschner and Andreas Krause. Information directed sampling and bandits with heteroscedastic noise. In Conference On Learning Theory, pp. 358–384. PMLR, 2018.
- Kirschner et al. (2020) Johannes Kirschner, Tor Lattimore, and Andreas Krause. Information directed sampling for linear partial monitoring. In Conference on Learning Theory, pp. 2328–2369. PMLR, 2020.
- Kirschner et al. (2021) Johannes Kirschner, Tor Lattimore, Claire Vernade, and Csaba Szepesvári. Asymptotically optimal information-directed sampling. In Conference on Learning Theory, pp. 2777–2821. PMLR, 2021.
- Lattimore & Szepesvari (2017) Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pp. 728–737. PMLR, 2017.
- Lattimore & Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.
- Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pp. 661–670, 2010.
- Liu et al. (2024) Haolin Liu, Chen-Yu Wei, and Julian Zimmert. Bypassing the simulator: Near-optimal adversarial linear contextual bandits. Advances in Neural Information Processing Systems, 36, 2024.
- Russo & Van Roy (2014) Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. Advances in Neural Information Processing Systems, 27, 2014.
- Saber et al. (2021) Hassan Saber, Pierre Ménard, and Odalric-Ambrym Maillard. Indexed minimum empirical divergence for unimodal bandits. Advances in Neural Information Processing Systems, 34:7346–7356, 2021.
- Thompson (1933) William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933. ISSN 00063444. URL http://www.jstor.org/stable/2332286.
- Valko et al. (2014) Michal Valko, Rémi Munos, Branislav Kveton, and Tomáš Kocák. Spectral bandits for smooth graph functions. In International Conference on Machine Learning, pp. 46–54. PMLR, 2014.
Supplementary Materials for the TMLR submission
“Linear Indexed Minimum Empirical Divergence Algorithms”
Appendix A BaseLinUCB Algorithm
Here, we present the BaseLinUCB algorithm used as a subroutine in SubLinIMED (Algorithm 2).
Appendix B Comparison to other related work
Saber et al. (2021) adopts the IMED algorithm to unimodal bandits which achieves asymptotically optimality for one-dimensional exponential family distributions. In their algorithm IMED-UB, they narrow down the search region to the neighboring regions of the empirically best arm and then implement the IMED algorithm for -armed bandit as in Honda & Takemura (2015). This design is inspired by the lower bound and only involves the neighboring arms of the best arm. The settings in which the algorithm in Saber et al. (2021) is applied to is different from our proposed LinIMED algorithms as we focus on linear bandits, not unimodal bandits.
Liu et al. (2024) proposes an algorithm that achieves regret for adversarial linear bandits with stochastic action sets in the absence of a simulator or prior knowledge on the distribution. Although their setting is different from ours, they also use a bonus term in the lifted covariance matrix to encourage exploration. This is similar to our choice of the second term in LinIMED-1.
Appendix C Proof of the regret bound for LinIMED-1 (Complete proof of Theorem 1)
Here and in the following, we abbreviate as , i.e., we drop the dependence of on , which is taken to be per Eqn. equation 11.
C.1 Statement of Lemmas for LinIMED-1
We first state the following lemmas which respectively show the upper bound of to :
Lemma 2.
Lemma 3.
Lemma 4.
C.2 Proof of Lemma 2
Proof.
From the event and the fact that (here is where we use that for all and ), we obtain . For convenience, define as the empirically best arm at time step , where ties are broken arbitrarily, then use to denote the corresponding context of the arm . Therefore from the Cauchy–Schwarz inequality, we have . This implies that
(33) |
On the other hand, we claim that can be upper bounded as . This can be seen from the fact that . Since the event holds, we know the first term is upper bounded by , and since the maximum eigenvalue of the matrix is upper bounded by and , the second term is upper bounded by . Hence, is upper bounded by . Then one can substitute this bound back into Eqn. equation 8, and this yields
(34) |
Furthermore, by our design of the algorithm, the index of is not larger than the index of the arm with the largest empirical reward at time . Hence,
(35) |
If , by using Corollary 1 and the “peeling device” (Lattimore & Szepesvári, 2020, Chapter 9) on such that for where ,
(36) | ||||
(37) | ||||
(38) | ||||
(39) | ||||
(40) | ||||
(41) | ||||
(42) | ||||
(43) | ||||
(44) |
Then with the choice of as in Eqn. equation 11,
(45) | ||||
(46) | ||||
(47) |
Otherwise we have , then since . Substituting this into Eqn. equation 10, then using the event and the bound in equation 9, we deduce that for all sufficiently large, we have . Therefore by using Corollary 1 and the “peeling device” (Lattimore & Szepesvári, 2020, Chapter 9) on such that for where is a free parameter that we can choose. Consider,
(48) | |||
(49) | |||
(50) | |||
(51) | |||
(52) | |||
(53) | |||
(54) | |||
(55) | |||
(56) | |||
(57) |
This proves Eqn. equation 26. Then with the choice of the parameters as in Eqn. equation 11,
(58) | ||||
(59) | ||||
(60) |
Hence, we can upper bound as
(61) | ||||
(62) | ||||
(63) |
which concludes the proof. ∎
C.3 Proof of Lemma 3
Proof.
Since and together imply that , then using the choices of and , we have . Substituting this into the event and using the Cauchy–Schwarz inequality, we have
(64) |
Again applying the “peeling device” on and Corollary 1, we can upper bound as follows:
(65) | ||||
(66) | ||||
(67) | ||||
(68) | ||||
(69) | ||||
(70) | ||||
(71) | ||||
(72) |
This proves Eqn. equation 28. Hence with the choice of the parameter as in Eqn. equation 11,
(73) | ||||
(74) |
∎
C.4 Proof of Lemma 4
Proof.
For , this is the case when the best arm at time does not perform sufficiently well so that the empirically largest reward at time is far from the highest expected reward. One observes that minimizing results in a tradeoff with respect to . On the event , we can apply the “peeling device” on such that where . Then using the fact that , we have
(75) |
On the other hand, using the event and the Cauchy–Schwarz inequality, it holds that
(76) |
If , the regret in this case is bounded by (similar to the procedure to get from Eqn. equation 36 to Eqn. equation 47). Otherwise , then combining Eqn. equation 75 and Eqn. equation 76 implies that
(77) |
Notice here with , , it holds that for all ,
(78) |
Using Corollary 1, one can show that:
(79) | |||
(80) | |||
(81) | |||
(82) | |||
(83) | |||
(84) | |||
(85) | |||
(86) | |||
(87) | |||
(88) | |||
(89) | |||
(90) | |||
(91) | |||
(92) | |||
(93) |
Hence
(94) | ||||
(95) | ||||
(96) |
This proves Eqn. equation 30. With the choice of as in Eqn. equation 11,
(97) | ||||
(98) | ||||
(99) |
∎
C.5 Proof of Lemma 5
C.6 Proof of Theorem 1
Appendix D Proof of the regret bound for LinIMED-2 (Proof of Theorem 2)
We choose and as follows:
(108) |
D.1 Statement of Lemmas for LinIMED-2
We first state the following lemmas which respectively show the upper bound of to :
Lemma 6.
Under Assumption 1, and the assumption that , for the free parameter , the term for LinIMED-3 satisfies:
Lemma 7.
Under Assumption 1, and the assumption that , for the free parameter , the term for LinIMED-3 satisfies:
Lemma 8.
Under Assumption 1, and the assumption that , for the free parameter , the term for LinIMED-3 satisfies:
D.2 Proof of Lemma 6
Proof.
We first partition the analysis into the cases and as follows:
(109) | ||||
(110) |
Case 1: If , this means that the index of is . Using the fact that we have:
(111) | ||||
(112) | ||||
(113) |
Therefore
(114) |
If , using the same procedure to get from Eqn. equation 36 to Eqn. equation 47, one has:
(115) | |||
(116) | |||
(117) | |||
(118) |
Else if , this implies that . Then substituting the event into Eqn. equation 114, we obtain
(119) |
With we have , then one has
(120) |
Hence
(121) | |||
(122) |
With the choice of , when , , then performing the “peeling device” on yields
(123) | |||
(124) | |||
(125) | |||
(126) | |||
(127) | |||
(128) | |||
(129) |
Considering the event , we can upper bound the corresponding expectation as follows
(130) |
Then
(131) | |||
(132) | |||
(133) | |||
(134) | |||
(135) |
Hence
(136) | |||
(137) | |||
(138) | |||
(139) | |||
(140) |
Case 2: If , then from the event and the choice we have
(141) |
Furthermore, using the definition of the event , that implies that
(142) |
When , , then similarily, we can bound this term by
Summarizing the two cases,
(143) | ||||
(144) |
∎
D.3 Proof of Lemma 7
D.4 Proof of Lemma 8
Proof.
From the event , which is , the index of the best arm at time can be upper bounded as:
(150) |
Case 1: If , then we have
(151) |
Suppose for , then one has
(152) |
On the other hand, on the event ,
(153) |
If , using the same procedure from Eqn. equation 36 to Eqn. equation 47, one has:
(154) | |||
(155) | |||
(156) | |||
(157) |
Else if , this implies that . Then combining Eqn. equation 152 and Eqn. equation 153 implies that
(158) |
Then using the same procedure to get from Eqn. equation 78 to Eqn. equation 93, we have
(159) | |||
(160) |
Case 2: . If , using the same procedure to get from Eqn. equation 36 to Eqn. equation 47, one has:
(161) | |||
(162) | |||
(163) | |||
(164) |
Else implies that .
If , then using the same procedure to get from Eqn. equation 152 to Eqn. equation 160, we have
(165) | |||
(166) |
If , this means now the index of is , by performing the “peeling device” such that for , we have
(167) |
On the other hand, using the definition of the event ,
(168) |
On the other hand, from , we have . Hence,
(171) | |||
(172) | |||
(173) | |||
(174) | |||
(175) | |||
(176) | |||
(177) |
Summarizing the two cases ( and ), we see that is upper bounded by:
(178) | ||||
(179) | ||||
(180) | ||||
(181) |
∎
D.5 Proof of Lemma 9
Proof.
The proof of this case is straightforward by using Lemma 1 with the choice :
(182) | ||||
(183) | ||||
(184) | ||||
(185) | ||||
(186) | ||||
(187) | ||||
(188) | ||||
(189) | ||||
(190) |
∎
D.6 Proof of Theorem 2
Appendix E Proof of the regret bound for LinIMED-3 (Proof of Theorem 3)
First we define as the best arm in time step such that , and use denote its corresponding context. Define . Let denote the regret in time . Define the following events:
where is a free parameter set to be in this proof sketch.
Then the expected regret can be partitioned by events such that:
(199) |
For the case:
From we know , therefore
(200) |
From and , we have
(201) |
Combining Eqn. equation 200 and Eqn. equation 201,
(202) |
Then
(203) |
If , using the same procedure from Eqn. equation 36 to Eqn. equation 47, one has:
(204) | |||
(205) | |||
(206) | |||
(207) |
Else ,this implies that , plug this into Eqn. equation 203 and with the choice of and , we have
(208) |
Since is a constant, then
(209) |
Using the same procedure from Eqn. equation 36 to Eqn. equation 47, one has:
(210) | |||
(211) | |||
(212) |
Hence
(213) |
For the case: Since the event holds,
(214) |
On the other hand, from we have
(215) |
Combining Eqn. equation 214 and Eqn. equation 215,
(216) |
Hence
(217) |
Then with and , we have
(218) |
therefore
(219) |
Using the same procedure from Eqn. equation 36 to Eqn. equation 47, one has:
(220) |
For the case:
using Lemma 1 with the choice :
(221) | ||||
(222) | ||||
(223) | ||||
(224) | ||||
(225) | ||||
(226) | ||||
(227) | ||||
(228) | ||||
(229) |
E.1 Proof of Theorem 3
Appendix F Proof of the regret bound for SupLinIMED (Proof of Theorem 4)
Define as the index of when the arm is chosen at time . For the SupLinIMED, the index of arms except the empirically best arm is defined by , whereas the index of the empirically best arm is defined by where . Define the index of the best arm at time as .
Remark 1.
Here the upper bound we set for the index of the empirically best arm is , which is slightly larger than our previous (Line 10 in the LinIMED algorithm) since in the first step of the of the SupLinIMED algorithm or, more generally, the SupLinUCB-type algorithms, the width of each arm is less than , as a result, the index of each arm is larger than .
Let the set of time indices such that the chosen arm is from Step 1 (Lines 6–9 in Algorithm 2) be . Then the cumulative expected regret of the SupLinIMED algorithm over time horizon can be defined by the following equation:
(234) |
Since the index set has not changed in Step 1 (see Line 9 in Algorithm 2), the second term of the regret is the same as in the original SupLinUCB algorithm of Chu et al. (2011). For the first term, we partitioned it by the following events:
where as in the SupLinUCB (Chu et al., 2011). We choose throughout. Furthermore, is the obtained from Algorithm 3 with as the input, i.e.,
Define as the instantaneous regret at each time step . In addition, choose in the definition of . Then the first term of the expected regret in equation 234 can be partitioned by the events and as follows:
We recall that when , for all .
To bound , we note that since occurs, the actual best arm with high probability () by Chu et al. (2011, Lemma 5). As such,
where the last inequality is from the fact that and Else if, in fact, the best arm , the corresponding regret in this case is bounded by:
(235) | ||||
(236) | ||||
(237) | ||||
(238) | ||||
(239) | ||||
(240) |
Case 1: If , this means that the index of is . Using the fact that we have
(241) |
Then using the definition of the event and the fact that , we have
Hence, . Therefore in this case is upper bounded as follows:
Case 2: If , then using the definition of the event , we have
therefore since event occurs,
Hence in this case is bounded as . Combining the above cases,
To bound , we note from the definition of that
then on the event ,
therefore
Hence
To bound , we use the proof as in of Chu et al. (2011, Lemma 1), which is restated as follows.
Lemma 10.
For any , , ,
where .
Then using the union bound, we have for all , , for all ,
With the choice and the assumption ,
Appendix G Hyperparameter tuning in our empirical study
G.1 Synthetic Dataset
The below tables are the empirical results while tuning the hyperparameter (scale of the confidence width) for fixed .
Method | LinUCB | LinTS | LinIMED-1 | LinIMED-2 | LinIMED-3 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.5 | 0.55 | 0.6 | 0.2 | 0.25 | 0.3 | 0.15 | 0.2 | 0.25 | 0.2 | 0.25 | 0.3 | 0.15 | 0.2 | 0.25 | |
Regret | 7.780 | 6.695 | 6.856 | 9.769 | 9.201 | 12.068 | 24.086 | 5.482 | 6.108 | 4.999 | 4.998 | 7.329 | 25.588 | 2.075 | 2.760 |
Method | LinUCB | LinTS | LinIMED-1 | LinIMED-2 | LinIMED-3 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.5 | 0.55 | 0.6 | 0.1 | 0.15 | 0.2 | 0.2 | 0.25 | 0.3 | 0.2 | 0.25 | 0.3 | 0.2 | 0.25 | 0.3 | |
Regret | 7.203 | 6.832 | 7.423 | 54.221 | 7.042 | 7.352 | 6.707 | 6.053 | 8.458 | 6.254 | 4.918 | 7.013 | 4.407 | 2.562 | 3.041 |
Method | LinUCB | LinTS | LinIMED-1 | LinIMED-2 | LinIMED-3 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.5 | 0.55 | 0.6 | 0.1 | 0.15 | 0.2 | 0.15 | 0.2 | 0.25 | 0.2 | 0.25 | 0.3 | 0.15 | 0.2 | 0.25 | |
Regret | 7.919 | 5.679 | 7.063 | 69.955 | 6.925 | 7.037 | 24.393 | 5.625 | 6.335 | 6.335 | 4.831 | 7.040 | 41.355 | 1.936 | 2.250 |
Method | LinUCB | LinTS | LinIMED-1 | LinIMED-2 | LinIMED-3 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.45 | 0.5 | 0.55 | 0.1 | 0.15 | 0.2 | 0.1 | 0.15 | 0.2 | 0.1 | 0.15 | 0.2 | 0.1 | 0.15 | 0.2 | |
Regret | 9.164 | 9.094 | 14.183 | 14.252 | 9.886 | 14.680 | 19.663 | 6.463 | 10.643 | 15.685 | 5.399 | 8.373 | 8.024 | 2.062 | 3.342 |
Method | LinUCB | LinTS | LinIMED-1 | LinIMED-2 | LinIMED-3 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.25 | 0.3 | 0.35 | 0.1 | 0.15 | 0.2 | 0.05 | 0.1 | 0.15 | 0.1 | 0.15 | 0.2 | 0.05 | 0.1 | 0.15 | |
Regret | 7.923 | 7.085 | 10.981 | 14.983 | 9.565 | 19.300 | 58.278 | 6.165 | 9.225 | 8.916 | 8.575 | 13.483 | 142.704 | 2.816 | 3.497 |
We run these algorithms on the same dataset with different choices of , we choose the best with the corresponding least regret.
G.2 MovieLens Dataset
The below tables are the empirical results while tuning the hyperparameter (scale of the confidence width) for fixed .
Method | LinUCB | LinTS | LinIMED-1 | LinIMED-2 | LinIMED-3 | IDS | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.7 | 0.75 | 0.8 | 0.05 | 0.1 | 0.15 | 0.15 | 0.2 | 0.25 | 0.15 | 0.2 | 0.25 | 0.2 | 0.25 | 0.3 | 0.25 | 0.3 | 0.35 | |
CTR | 0.608 | 0.675 | 0.668 | 0.615 | 0.705 | 0.679 | 0.740 | 0.823 | 0.766 | 0.740 | 0.823 | 0.766 | 0.713 | 0.742 | 0.690 | 0.655 | 0.728 | 0.714 |
Method | LinUCB | LinTS | LinIMED-1 | LinIMED-2 | LinIMED-3 | IDS | ||||||||||||
0.75 | 0.8 | 0.85 | 0 | 0.05 | 0.1 | 0.1 | 0.15 | 0.2 | 0.05 | 0.1 | 0.15 | 0.05 | 0.1 | 0.15 | 0.3 | 0.35 | 0.4 | |
CTR | 0.708 | 0.754 | 0.713 | 0.517 | 0.711 | 0.646 | 0.648 | 0.668 | 0.595 | 0.658 | 0.668 | 0.651 | 0.697 | 0.717 | 0.649 | 0.643 | 0.688 | 0.606 |
Method | LinUCB | LinTS | LinIMED-1 | LinIMED-2 | LinIMED-3 | IDS | ||||||||||||
0.85 | 0.9 | 0.95 | 0 | 0.05 | 0.1 | 0.05 | 0.1 | 0.15 | 0.05 | 0.1 | 0.15 | 0.05 | 0.1 | 0.15 | 0.3 | 0.35 | 0.4 | |
CTR | 0.721 | 0.754 | 0.745 | 0.487 | 0.674 | 0.588 | 0.682 | 0.729 | 0.594 | 0.687 | 0.729 | 0.594 | 0.689 | 0.705 | 0.594 | 0.684 | 0.739 | 0.695 |
We run these algorithms on the same dataset with different choices of and we choose the best with the corresponding largest reward.