Adversarial Bandits against Arbitrary Strategies
Abstract
We study the adversarial bandit problem against arbitrary strategies, in which is the parameter for the hardness of the problem and this parameter is not given to the agent. To handle this problem, we adopt the master-base framework using the online mirror descent method (OMD). We first provide a master-base algorithm with simple OMD, achieving , in which comes from the variance of loss estimators. To mitigate the impact of the variance, we propose using adaptive learning rates for OMD and achieve , where is a variance term for loss estimators.
1 Introduction
The bandit problem is a fundamental decision-making problem to deal with the exploration-exploitation trade-off. In this problem, an agent plays an action, “arm”, at a time, and receives loss or reward feedback for the option. The arm might be a choice of an item for a user in recommendation systems. In practice, it is often required to consider switching user preferences for items as time passes. This can be modeled by switching best arms.
In this paper, we focus on the adversarial bandit problem, where the losses for each arm at each time are arbitrarily determined. In such an environment, we consider that the target strategy is allowed to have any sequence of arms instead of the best arm in hindsight. Therefore, regret is measured by competing with not a single best arm but any sequence of arms. We denote by the number of switches for the sequence of arms, which is referred to as hardness Auer et al., (2002). Importantly, we target any arbitrary strategies so that is not fixed in advance (in other words, the value of is not provided to the agent)
Competing with switching arms has been widely studied. In the expert setting with full information Cesa-Bianchi et al., (1997), there are several algorithms Daniely et al., (2015); Jun et al., (2017) that achieve near-optimal regret bound for -switch regret (which is defined later) without information of switch parameter . However, in the bandit problems, an agent cannot observe full information of loss at each time, which makes the problem more challenging compared with the full information setting. For stochastic bandit settings where each arm has switching reward distribution over time steps, referred to as non-stationary bandit problems, has been studied by Garivier and Moulines, (2008); Auer et al., (2019); Russac et al., (2019); Suk and Kpotufe, (2022). Especially Auer et al., (2019); Suk and Kpotufe, (2022) achieved near-optimal regret without given .
However, we cannot apply this method to the adversarial setting, where losses may be determined arbitrarily. For the adversarial bandit setting, EXP3.S Auer et al., (2002) achieved with given and without given . It is also known that the Bandit-over-Bandit (BOB) approach achieved Cheung et al., (2019); Foster et al., (2020) for the case when is not given. Recently, Luo et al., (2022) studied switching adversarial linear bandits and achieved with given .
In this paper, we study the adversarial bandit problems against any arbitrarily switching arms (i.e. without given ). To handle this problem, we adopt the master-base framework with the online mirror descent method (OMD) which has been widely utilized for model selection problems Agarwal et al., (2017); Pacchiano et al., (2020); Luo et al., (2022). We first study a master-base algorithm with negative entropy regularizer-based OMD and analyze the regret of the algorithm achieving . Nevertheless, this approach inadequately addresses the variance of estimators due to its use of a fixed learning rate throughout, resulting in a regret bound containing a term proportional to .
Based on the analysis, we propose to use adaptive learning rates for OMD to control the variance of loss estimators and achieve , where is a variance term for loss estimators. Importantly, instead of a negative entropy regularizer, we utilize a log-barrier regularizer, which allows us to control the worst case with respect to . Lastly, we assess our algorithms in comparison to those proposed in earlier works, specifically Auer et al., (2002) and Cheung et al., (2019).
2 Problem statement
Here we describe our problem settings. We let be the set of arms and be a loss vector at time in which is the loss value of the arm at time . The adversarial environment is arbitrarily determined with a sequence of reward vectors over the horizon time . At each time , an agent selects an arm , after which one observes partial feedback . In this adversarial bandit setting, we aim to minimize -switch regret which is defined as follows. Let be a sequence of actions. For a positive integer , the set of sequence of actions with switches is defined as
Then, we define the -switch regret as
We assume that is not given to the agent (or undermined). In other words, we aim to design algorithms against any sequence of arms. Therefore, we need to design universal algorithms that achieve tight regret bounds for any non-fixed , in which represents the hardness of the problem. It is noteworthy that this problem encompasses the non-stationary stochastic bandit problems without knowing a switching parameter Auer et al., (2019); Chen et al., (2019).
2.1 Regret Lower Bound
We can easily obtain the regret lower bound of this problem from the well-known regret lower bound of adversarial bandits. Let be the time when the -th switch of the best arm in hindsight happens for and , . We consider that ’s are equally distributed over . Then we have for . Then from Theorem 5.1 in Auer et al., (2002), for the best arm in hindsight over time steps, we get the regret lower bound of . We can obtain that
However, determining the feasibility of a tighter regret lower bound under undetermined remains an unresolved challenge.
3 Algorithms and regret analysis
To handle this problem, we suggest using the online mirror descent method integrated into the master-base framework.
3.1 Master-base framework
In the master-base framework, at each time, a master algorithm selects a base and the selected base selects an arm. For the undetermined switch value in advance, we suggest tuning each base algorithm using a candidate of as follows.
Let represent the set of candidates of the switch parameter for the bases such that:
Then, each base adopts one of the candidate parameters in for tuning its learning rate. For simplicity, let such that and let base represent the base having the candidate parameter when there is no confusion. Also, let be the largest value among , which indicates the near-optimal parameter for . Then we can observe that
3.2 Online mirror descent (OMD)
Here we describe the OMD method Lattimore and Szepesvári, (2020). For a regularizer function and , we define Bregman divergence as
Let be the distribution for selecting an action at time and be the probability simplex with dimension . Then with a loss vector l, using the online mirror descent we can get as follows:
(1) |
The solution of (1) can be found using the following two-step procedure:
(2) | |||
(3) |
We use a regularizer that contains a learning rate (to be specified later). We note that, in the bandit setting, we cannot observe full information of loss at time , but get partial feedback based on a selected action. Therefore, it is required to use an estimated loss vector for OMD.
3.3 Master-base OMD
We first provide a simple master-base OMD algorithm (Algorithm 1) with the negative entropy regularizer defined as
where , denotes the -index entry for p, and is a learning rate. We note that well-known EXP3 Auer et al., (2002) for the adversarial bandits is also based on the negative entropy function.
In Algorithm 1, at each time, the master selects a base from distribution . Then following distribution for selecting an arm, the base selects an arm and receive a corresponding loss . Using the loss feedback, it gets unbiased estimators and for a loss from selecting each base and each arm , respectively. Then using OMD with the estimators, it updates the distributions and for selecting a base and an arm from base , respectively.
For getting , it uses the negative entropy regularizer with learning rate . The domain for updating the distribution for selecting a base is defined as a clipped probability simplex such that for . By introducing , it can control the variance of estimator by restricting the minimum value for . For getting , it also uses the negative entropy regularizer with learning rates depending on a value of for each base. The learning rate is tuned by using a candidate value for in the base to control adaptation for switching such that
The domain for the distribution is also defined as a clipped probability simplex such that for . The purpose of is to introduce some regularization in learning for dealing with switching best arms in hindsight, which is slightly different from the purpose of .
Now we provide a regret bound for the algorithm in the following theorem.
Theorem 1.
For any switch number , Algorithm 1 achieves a regret bound of
Proof.
Let be the time when the -th switch of the best arm happens and , . Also let . For any for all , the -switch regret can be expressed as
(4) | ||||
(5) |
in which the first two terms are closely related with the regret from the master algorithm against the near optimal base , and the remaining terms are related with the regret from base algorithm against the best arms in hindsight. We note that the algorithm does not need to know in prior and is brought here only for regret analysis.
First we provide a bound for the following regret from base :
Let and denote the unit vector with 1 at -index and at the rest of indexes. Then, we have
(7) | |||
(8) | |||
(9) | |||
(10) |
where the first term in the last inequality is obtained from the clipped domain and the second term is obtained from the unbiased estimator such that where is the filtration. We can observe that the clipped domain controls the distance between the initial distribution at and the best arm unit vector for the time steps over . Let
Then, by solving the optimization problem, we can get
for all .
For the second term of the last inequality in (10), we provide a lemma in the following.
Lemma 1 (Theorem 28.4 in Lattimore and Szepesvári, (2020)).
For any we have
In Lemma 1, the first term is for the initial point diameter at time and the second term is for the divergence of the updated policy. Using the definition of the Bregman divergence and the fact that , the initial point diameter term can be shown to be bounded as follows:
(11) | ||||
(12) |
Next, for the updated policy divergence term, using for all , we have
(13) | |||
(14) | |||
(15) | |||
(16) |
where the first inequality comes from for all , the second inequality comes from , and the last inequality is obtained from from the clipped domain. We can observe that the clipped domain controls the variance of estimators. Then from (10), Lemma 1, (12), and (16), by summing up over , we have
(17) |
Next, we provide a bound for the following regret from the master:
Let and denote the unit vector with 1 at base -index and at the rest of indexes. For ease of presentation, we define and . Then, we have
(18) | ||||
(19) | ||||
(20) |
For bounding the second term in (20), we use the following lemma.
Lemma 2 (Theorem 28.4 in Lattimore and Szepesvári, (2020)).
From Theorem 1, the regret bound of Algoritm 1 is tight with respect to compared to that of EXP3.S Auer et al., (2002) which has a linear dependency on . Therefore, when is large (specifically ), Algorithm 1 performs better than EXP3.S. Also, compared with previous bandit-over-bandit approach (BOB) Cheung et al., (2019) having a loose dependency on as , our algorithm has a tighter regret bound with respect to . Therefore, when is large (specifically ), Algorithm 1 achieves a better regret bound than BOB.
However, the achieved regret bound from Algorithm 1 still has rather than due to the large variance of loss estimators from sampling twice at each time for a base and an arm. In the following, we provide an algorithm utilizing adaptive learning rates to control the variance of estimators.
3.4 Master-base OMD with adaptive learning rates
Here we propose Algorithm 2, which utilizes adaptive learning rates to control the variance of estimators. We first explain the base algorithm. For the base algorithm, we propose to use the negative entropy regularizer with adaptive learning rate such that
The adaptive learning rate is optimized using variance information for loss estimators at each time to control the variance such that
where is a variance threshold term (to be specified later). This implies that if the variance of the estimators is small, then the learning rate becomes large.
For the master algorithm, we adopt the method of Corral Agarwal et al., (2017), in which, by using a log-barrier regularizer with increasing learning rates, it introduces a negative bias term to cancel a variance term from bases by handling the worst case with respect to . The log-barrier regularizer is defined as:
with learning rates for the master algorithm.
Here we describe the update learning rates procedure for the master and bases in Algorithm 2; the other parts are similar with Algorithm 1. The variance of the loss estimator for base is . If the variance for base is larger than a threshold , then it increases learning rate as with and the threshold is updated as , which is also used for tuning the learning rate . Otherwise, it keeps the learning rate and threshold the same with the previous time step.
In the following theorem, we provide a regret bound of Algorithm 2.
Theorem 2.
For any switch number , Algorithm 2 achives a regret bound of
Proof.
Let be the time when the -th switch of the best arm happens and , . Also let . For any for all , the -switch regret can be expressed as
(24) | ||||
(25) |
in which the first two terms are closely related with the regret from the master algorithm against the near optimal base , and the remaining terms are related with the regret from base algorithm against the best arms in hindsight.
First we provide a bound for the following regret from base . From (10), we can obtain
(27) |
Then for the second term of the last inequality in (27), we provide a following lemma.
Lemma 3.
For any we can show that
Proof.
For ease of presentation, we define the negative entropy regularizer without a learning rate as
and define learning rate From the first-order optimality condition for and using the definition of the Bregman divergence,
(28) | ||||
(29) |
Also, we have
(30) | ||||
(31) | ||||
(32) |
Then, we can obtain
(33) | ||||
(34) | ||||
(35) | ||||
(36) | ||||
(37) | ||||
(38) |
where the first inequality is obtained from (29) and the last inequality is obtained from (32), , and from non-decreasing .
For the second term in the last inequality in (38), using for all , we have
(39) | ||||
(40) | ||||
(41) | ||||
(42) | ||||
(43) | ||||
(44) | ||||
(45) |
where the first inequality comes from for all , the second inequality comes from , and the third inequality is obtained from .
∎
(46) | |||
(47) | |||
(48) |
Next, we provide a bound for the regret from the master in the following lemma.
Lemma 4 (Lemma 13 in Agarwal et al., (2017)).
The negative bias term in Lemma 4 is derived from the log-barrier regularizer and increasing learning rates . This term is critical to bound the worst case regret which will be shown soon. Also, is obtained from considering the clipped domain. Then, putting (LABEL:eq:2_decom) and Lemmas 3 and 4 altogether, we have
(49) | |||
(50) | |||
(51) | |||
(52) | |||
(53) | |||
(54) |
where , , , , , and Then we can obtain
where is obtained from the worst case of . The worst case can be found by considering a maximum value of the concave bound of the last equality in (54) with variable such that . This concludes the proof. ∎
Here we provide regret bound comparison with other approaches. For simplicity in the comparison, we use the fact that for the regret bound in Theorem 2 such that
The regret bound in Theorem 2 depends on which is closely related with variance of loss estimators for . Even though the regret bound depends on the variance term, it is of interest that the worst case bound is always bounded by , which implies that the regret bound of Algorithm 2 is always tighter than or equal to that of EXP3.S. Algorithm 2 has a tight regret bound with respect to . Therefore, when is large such that , Algorithm 2 shows a better regret bound compared with BOB. We note that the value of depends on the problems, and further analysis for the term would be an interesting avenue for future research.
Remark 1.
For implementation of our algorithms, we describe how to update policy using OMD in general. Let be a loss estimator for action . For the negative entropy regularizer, by solving the optimization in (3), from Lattimore and Szepesvári, (2020), we have
In the case of the log-barrier regularizer, we have where is a normalization factor for a probability distribution Luo et al., (2022). Also, a clipped domain with in the distribution can be implemented by adding a uniform probability to the policy such that for all .
4 Conclusion
In this paper, we studied adversarial bandits against any sequence of arms with -switch regret without given . We proposed two algorithms that are based on a master-base framework with the OMD method. We propose Algorithm 1 based on simple OMD achieving . Then by using adaptive learning rates, Algorithm 2 achieved . It is still an open problem to achieve the optimal regret bound for the worst case.
5 Acknowledgment
The authors thank Joe Suk for helpful discussions.
References
- Agarwal et al., (2017) Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E. (2017). Corralling a band of bandit algorithms. In Conference on Learning Theory, pages 12–38. PMLR.
- Auer et al., (2002) Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77.
- Auer et al., (2019) Auer, P., Gajane, P., and Ortner, R. (2019). Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Conference on Learning Theory, pages 138–158.
- Cesa-Bianchi et al., (1997) Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., and Warmuth, M. K. (1997). How to use expert advice. Journal of the ACM (JACM), 44(3):427–485.
- Chen et al., (2019) Chen, Y., Lee, C.-W., Luo, H., and Wei, C.-Y. (2019). A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. In Conference on Learning Theory, pages 696–726. PMLR.
- Cheung et al., (2019) Cheung, W. C., Simchi-Levi, D., and Zhu, R. (2019). Learning to optimize under non-stationarity. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1079–1087.
- Daniely et al., (2015) Daniely, A., Gonen, A., and Shalev-Shwartz, S. (2015). Strongly adaptive online learning. In International Conference on Machine Learning, pages 1405–1411.
- Foster et al., (2020) Foster, D. J., Krishnamurthy, A., and Luo, H. (2020). Open problem: Model selection for contextual bandits. In Conference on Learning Theory, pages 3842–3846. PMLR.
- Garivier and Moulines, (2008) Garivier, A. and Moulines, E. (2008). On upper-confidence bound policies for non-stationary bandit problems.
- Jun et al., (2017) Jun, K.-S., Orabona, F., Wright, S., and Willett, R. (2017). Improved strongly adaptive online learning using coin betting. In Artificial Intelligence and Statistics, pages 943–951. PMLR.
- Lattimore and Szepesvári, (2020) Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.
- Luo et al., (2022) Luo, H., Zhang, M., Zhao, P., and Zhou, Z.-H. (2022). Corralling a larger band of bandits: A case study on switching regret for linear bandits. arXiv preprint arXiv:2202.06151.
- Pacchiano et al., (2020) Pacchiano, A., Phan, M., Abbasi-Yadkori, Y., Rao, A., Zimmert, J., Lattimore, T., and Szepesvari, C. (2020). Model selection in contextual stochastic bandit problems. arXiv preprint arXiv:2003.01704.
- Russac et al., (2019) Russac, Y., Vernade, C., and Cappé, O. (2019). Weighted linear bandits for non-stationary environments. In Advances in Neural Information Processing Systems, pages 12017–12026.
- Suk and Kpotufe, (2022) Suk, J. and Kpotufe, S. (2022). Tracking most significant arm switches in bandits. In Conference on Learning Theory, pages 2160–2182. PMLR.