This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adversarial Bandits against Arbitrary Strategies

Jung-hun Kim KAIST junghunkim@kaist.ac.kr    Se-Young Yun KAIST yunseyoung@kaist.ac.kr
Abstract

We study the adversarial bandit problem against arbitrary strategies, in which SS is the parameter for the hardness of the problem and this parameter is not given to the agent. To handle this problem, we adopt the master-base framework using the online mirror descent method (OMD). We first provide a master-base algorithm with simple OMD, achieving O~(S1/2K1/3T2/3)\tilde{O}(S^{1/2}K^{1/3}T^{2/3}), in which T2/3T^{2/3} comes from the variance of loss estimators. To mitigate the impact of the variance, we propose using adaptive learning rates for OMD and achieve O~(min{𝔼[SKTρT(h)],SKT})\tilde{O}(\min\{\mathbb{E}[\sqrt{SKT\rho_{T}(h^{\dagger})}],S\sqrt{KT}\}), where ρT(h)\rho_{T}(h^{\dagger}) is a variance term for loss estimators.

1 Introduction

The bandit problem is a fundamental decision-making problem to deal with the exploration-exploitation trade-off. In this problem, an agent plays an action, “arm”, at a time, and receives loss or reward feedback for the option. The arm might be a choice of an item for a user in recommendation systems. In practice, it is often required to consider switching user preferences for items as time passes. This can be modeled by switching best arms.

In this paper, we focus on the adversarial bandit problem, where the losses for each arm at each time are arbitrarily determined. In such an environment, we consider that the target strategy is allowed to have any sequence of arms instead of the best arm in hindsight. Therefore, regret is measured by competing with not a single best arm but any sequence of arms. We denote by SS the number of switches for the sequence of arms, which is referred to as hardness Auer et al., (2002). Importantly, we target any arbitrary strategies so that SS is not fixed in advance (in other words, the value of SS is not provided to the agent)

Competing with switching arms has been widely studied. In the expert setting with full information Cesa-Bianchi et al., (1997), there are several algorithms Daniely et al., (2015); Jun et al., (2017) that achieve near-optimal 𝒪~(ST)\tilde{\mathcal{O}}(\sqrt{ST}) regret bound for SS-switch regret (which is defined later) without information of switch parameter SS. However, in the bandit problems, an agent cannot observe full information of loss at each time, which makes the problem more challenging compared with the full information setting. For stochastic bandit settings where each arm has switching reward distribution over time steps, referred to as non-stationary bandit problems, has been studied by Garivier and Moulines, (2008); Auer et al., (2019); Russac et al., (2019); Suk and Kpotufe, (2022). Especially Auer et al., (2019); Suk and Kpotufe, (2022) achieved near-optimal regret O~(SKT)\tilde{O}(\sqrt{SKT}) without given SS.

However, we cannot apply this method to the adversarial setting, where losses may be determined arbitrarily. For the adversarial bandit setting, EXP3.S Auer et al., (2002) achieved O~(SKT)\tilde{O}(\sqrt{SKT}) with given SS and O~(SKT)\tilde{O}(S\sqrt{KT}) without given SS. It is also known that the Bandit-over-Bandit (BOB) approach achieved O~(SKT+T3/4)\tilde{O}(\sqrt{SKT}+T^{3/4}) Cheung et al., (2019); Foster et al., (2020) for the case when SS is not given. Recently, Luo et al., (2022) studied switching adversarial linear bandits and achieved O~(dST)\tilde{O}(\sqrt{dST}) with given SS.

In this paper, we study the adversarial bandit problems against any arbitrarily switching arms (i.e. without given SS). To handle this problem, we adopt the master-base framework with the online mirror descent method (OMD) which has been widely utilized for model selection problems Agarwal et al., (2017); Pacchiano et al., (2020); Luo et al., (2022). We first study a master-base algorithm with negative entropy regularizer-based OMD and analyze the regret of the algorithm achieving O~(S1/2K1/3T2/3)\tilde{O}(S^{1/2}K^{1/3}T^{2/3}). Nevertheless, this approach inadequately addresses the variance of estimators due to its use of a fixed learning rate throughout, resulting in a regret bound containing a term proportional to T2/3T^{2/3}.

Based on the analysis, we propose to use adaptive learning rates for OMD to control the variance of loss estimators and achieve O~(min{𝔼[SKTρT(h)],SKT})\tilde{O}(\min\{\mathbb{E}[\sqrt{SKT\rho_{T}(h^{\dagger})}],S\sqrt{KT}\}), where ρT(h)\rho_{T}(h^{\dagger}) is a variance term for loss estimators. Importantly, instead of a negative entropy regularizer, we utilize a log-barrier regularizer, which allows us to control the worst case with respect to ρT(h)\rho_{T}(h^{\dagger}). Lastly, we assess our algorithms in comparison to those proposed in earlier works, specifically Auer et al., (2002) and Cheung et al., (2019).

2 Problem statement

Here we describe our problem settings. We let 𝒜=[K]\mathcal{A}=[K] be the set of arms and lt[0,1]K\textit{{l}}_{t}\in[0,1]^{K} be a loss vector at time tt in which lt(a)l_{t}(a) is the loss value of the arm aa at time tt. The adversarial environment is arbitrarily determined with a sequence of reward vectors l1,l2,,lT[0,1]K\textit{{l}}_{1},\textit{{l}}_{2},\dots,\textit{{l}}_{T}\in[0,1]^{K} over the horizon time TT. At each time tt, an agent selects an arm at[K]a_{t}\in[K], after which one observes partial feedback lt(at)[0,1]l_{t}(a_{t})\in[0,1]. In this adversarial bandit setting, we aim to minimize SS-switch regret which is defined as follows. Let σ={σ1,σ2,,σT}[K]T\sigma=\{\sigma_{1},\sigma_{2},\dots,\sigma_{T}\}\in[K]^{T} be a sequence of actions. For a positive integer S<TS<T, the set of sequence of actions with SS switches is defined as

BS={σ[K]T:t=1T1𝟙{σtσt+1}S}.B_{S}=\left\{\sigma\in[K]^{T}:\sum_{t=1}^{T-1}\mathbbm{1}\{\sigma_{t}\neq\sigma_{t+1}\}\leq S\right\}.

Then, we define the SS-switch regret as

RS(T)=maxσBSt=1T𝔼[lt(at)]lt(σt).R_{S}(T)=\max_{\sigma\in B_{S}}\sum_{t=1}^{T}\mathbb{E}[\textit{{l}}_{t}(a_{t})]-\textit{{l}}_{t}(\sigma_{t}).

We assume that SS is not given to the agent (or undermined). In other words, we aim to design algorithms against any sequence of arms. Therefore, we need to design universal algorithms that achieve tight regret bounds for any non-fixed S[T1]S\in[T-1], in which SS represents the hardness of the problem. It is noteworthy that this problem encompasses the non-stationary stochastic bandit problems without knowing a switching parameter Auer et al., (2019); Chen et al., (2019).

2.1 Regret Lower Bound

We can easily obtain the regret lower bound of this problem from the well-known regret lower bound of adversarial bandits. Let tst_{s} be the time when the ss-th switch of the best arm in hindsight happens for s[S]s\in[S] and tS+11=Tt_{S+1}-1=T, t0=1t_{0}=1. We consider that tst_{s}’s are equally distributed over TT. Then we have Ts:=ts+1ts=Θ(T/S)T_{s}:=t_{s+1}-t_{s}=\Theta(T/S) for s[0,S]s\in[0,S]. Then from Theorem 5.1 in Auer et al., (2002), for the best arm in hindsight over TsT_{s} time steps, we get the regret lower bound of Ω(KTs)\Omega(\sqrt{KT_{s}}). We can obtain that

R(T)=Ω(s[0,S]KTs)=Ω(SKT).R(T)=\Omega\left(\sum_{s\in[0,S]}\sqrt{KT_{s}}\right)=\Omega\left(\sqrt{SKT}\right).

However, determining the feasibility of a tighter regret lower bound under undetermined SS remains an unresolved challenge.

3 Algorithms and regret analysis

To handle this problem, we suggest using the online mirror descent method integrated into the master-base framework.

3.1 Master-base framework

In the master-base framework, at each time, a master algorithm selects a base and the selected base selects an arm. For the undetermined switch value SS in advance, we suggest tuning each base algorithm using a candidate of SS as follows.

Let \mathcal{H} represent the set of candidates of the switch parameter S[T1]S\in[T-1] for the bases such that:

={T0,T1logT,T2logT,,T}.\mathcal{H}=\{T^{0},T^{\frac{1}{\lceil\log T\rceil}},T^{\frac{2}{\lceil\log T\rceil}},\dots,T\}.

Then, each base adopts one of the candidate parameters in \mathcal{H} for tuning its learning rate. For simplicity, let H=||H=|\mathcal{H}| such that H=O(log(T))H=O(\log(T)) and let base hh represent the base having the candidate parameter hh\in\mathcal{H} when there is no confusion. Also, let hh^{\dagger} be the largest value hSh\leq S among hh\in\mathcal{H}, which indicates the near-optimal parameter for SS. Then we can observe that

e1ShS.\displaystyle e^{-1}S\leq h^{\dagger}\leq S.

3.2 Online mirror descent (OMD)

Here we describe the OMD method Lattimore and Szepesvári, (2020). For a regularizer function F:dF:\mathbb{R}^{d}\rightarrow\mathbb{R} and p,qd\textit{{p}},\textit{{q}}\in\mathbb{R}^{d}, we define Bregman divergence as

DF(p,q)=F(p)F(q)F(q),pq.D_{F}(\textit{{p}},\textit{{q}})=F(\textit{{p}})-F(\textit{{q}})-\langle\nabla F(\textit{{q}}),\textit{{p}}-\textit{{q}}\rangle.

Let pt\textit{{p}}_{t} be the distribution for selecting an action at time tt and 𝒫d\mathcal{P}_{d} be the probability simplex with dimension dd. Then with a loss vector l, using the online mirror descent we can get pt+1\textit{{p}}_{t+1} as follows:

pt+1=argminp𝒫dp,l+DF(p,pt).\displaystyle\textit{{p}}_{t+1}=\operatorname*{arg\,min}_{\textit{{p}}\in\mathcal{P}_{d}}\langle\textit{{p}},\textit{{l}}\rangle+D_{F}(\textit{{p}},\textit{{p}}_{t}). (1)

The solution of (1) can be found using the following two-step procedure:

p~t+1=argminpdp,l+DF(p,pt),\displaystyle\tilde{\textit{{p}}}_{t+1}=\operatorname*{arg\,min}_{\textit{{p}}\in\mathbb{R}^{d}}\langle\textit{{p}},\textit{{l}}\rangle+D_{F}(\textit{{p}},\textit{{p}}_{t}), (2)
pt+1=argminp𝒫dDF(p,p~t+1).\displaystyle\textit{{p}}_{t+1}=\operatorname*{arg\,min}_{\textit{{p}}\in\mathcal{P}_{d}}D_{F}(\textit{{p}},\tilde{\textit{{p}}}_{t+1}). (3)

We use a regularizer FF that contains a learning rate (to be specified later). We note that, in the bandit setting, we cannot observe full information of loss at time tt, but get partial feedback based on a selected action. Therefore, it is required to use an estimated loss vector for OMD.

3.3 Master-base OMD

We first provide a simple master-base OMD algorithm (Algorithm 1) with the negative entropy regularizer defined as

Fη(p)=(1/η)i=1d(p(i)logp(i)p(i)),F_{\eta}(\textit{{p}})=(1/\eta)\sum_{i=1}^{d}(p(i)\log p(i)-p(i)),

where pd\textit{{p}}\in\mathbb{R}^{d}, p(i)p(i) denotes the ii-index entry for p, and η\eta is a learning rate. We note that well-known EXP3 Auer et al., (2002) for the adversarial bandits is also based on the negative entropy function.

In Algorithm 1, at each time, the master selects a base hth_{t} from distribution pt\textit{{p}}_{t}. Then following distribution pt,ht\textit{{p}}_{t,h_{t}} for selecting an arm, the base hth_{t} selects an arm ata_{t} and receive a corresponding loss lt(at)l_{t}(a_{t}). Using the loss feedback, it gets unbiased estimators lt(h)l_{t}^{\prime}(h) and lt,h′′(a)l_{t,h}^{\prime\prime}(a) for a loss from selecting each base hh\in\mathcal{H} and each arm a[K]a\in[K], respectively. Then using OMD with the estimators, it updates the distributions pt+1\textit{{p}}_{t+1} and pt+1,h\textit{{p}}_{t+1,h} for selecting a base and an arm from base hh, respectively.

For getting pt+1\textit{{p}}_{t+1}, it uses the negative entropy regularizer with learning rate η\eta. The domain for updating the distribution for selecting a base is defined as a clipped probability simplex such that 𝒫Hα=𝒫H[α,1]H\mathcal{P}_{H}^{\alpha}=\mathcal{P}_{H}\cap[\alpha,1]^{H} for α>0\alpha>0. By introducing α\alpha, it can control the variance of estimator lt(h)=lt(at)𝟙(h=ht)/pt(h)l_{t}^{\prime}(h)=l_{t}(a_{t})\mathbbm{1}(h=h_{t})/p_{t}(h) by restricting the minimum value for pt(h)p_{t}(h). For getting pt+1,h\textit{{p}}_{t+1,h}, it also uses the negative entropy regularizer with learning rates depending on a value of hh for each base. The learning rate η(h)\eta(h) is tuned by using a candidate value hh for SS in the base hh to control adaptation for switching such that

η(h)=h1/2/(K1/3T2/3).\eta(h)=h^{1/2}/(K^{1/3}T^{2/3}).

The domain for the distribution is also defined as a clipped probability simplex such that 𝒫Kβ=𝒫K[β,1]K\mathcal{P}_{K}^{\beta}=\mathcal{P}_{K}\cap[\beta,1]^{K} for β>0\beta>0. The purpose of β\beta is to introduce some regularization in learning pt+1,h\textit{{p}}_{t+1,h} for dealing with switching best arms in hindsight, which is slightly different from the purpose of α\alpha.

Now we provide a regret bound for the algorithm in the following theorem.

Algorithm 1 Master-base OMD
1:  Given: TT, KK, \mathcal{H}.
2:  Initialization: α=K1/3/(T1/3H1/2)\alpha=K^{1/3}/(T^{1/3}H^{1/2}), β=1/(KT)\beta=1/(KT), η=1/TK\eta=1/\sqrt{TK}, η(h)=h1/2/(K1/3T2/3)\eta(h)=h^{1/2}/(K^{1/3}T^{2/3}), p1(h)=1/Hp_{1}(h)=1/H, p1,h(a)=1/Kp_{1,h}(a)=1/K for hh\in\mathcal{H} and a[K]a\in[K].
3:  for t=1,,Tt=1,\dots,T do
4:     Select a base and an arm:
5:     Draw hth_{t}\sim probabilities pt(h)p_{t}(h) for hh\in\mathcal{H}.
6:     Draw at,hta_{t,h_{t}}\sim probabilities pt,ht(a)p_{t,h_{t}}(a) for a[K]a\in[K].
7:     Pull at=at,hta_{t}=a_{t,h_{t}} and Receive lt(at,ht)[0,1]l_{t}(a_{t,h_{t}})\in[0,1].
8:     Obtain loss estimators:
9:     lt(ht)=lt(at,ht)pt(ht)l_{t}^{\prime}(h_{t})=\frac{l_{t}(a_{t,h_{t}})}{p_{t}(h_{t})} and lt(h)=0l_{t}^{\prime}(h)=0 for h/{ht}h\in\mathcal{H}/\{h_{t}\}.
10:     lt,ht′′(at,ht)=lt(ht)pt,ht(at,ht)l_{t,h_{t}}^{\prime\prime}(a_{t,h_{t}})=\frac{l_{t}^{\prime}(h_{t})}{p_{t,h_{t}}(a_{t,h_{t}})} and lt,h′′(a)=0l_{t,h}^{\prime\prime}(a)=0 for h/{ht}h\in\mathcal{H}/\{h_{t}\}, a[K]/{at,ht}a\in[K]/\{a_{t,h_{t}}\}.
11:     Update distributions:
12:     pt+1=argminp𝒫Hαp,lt+DFη(p,pt)\textit{{p}}_{t+1}=\arg\min_{\textit{{p}}\in\mathcal{P}_{H}^{\alpha}}\langle\textit{{p}},\textit{{l}}_{t}^{\prime}\rangle+D_{F_{\eta}}(\textit{{p}},\textit{{p}}_{t})
13:     pt+1,h=argminp𝒫Kβp,lt,h′′+DFη(h)(p,pt,h)\textit{{p}}_{t+1,h}=\arg\min_{\textit{{p}}\in\mathcal{P}_{K}^{\beta}}\langle\textit{{p}},\textit{{l}}_{t,h}^{\prime\prime}\rangle+D_{F_{\eta(h)}}(\textit{{p}},\textit{{p}}_{t,h}) for all hh\in\mathcal{H}
14:  end for
Theorem 1.

For any switch number S[T1]S\in[T-1], Algorithm 1 achieves a regret bound of

RS(T)=O~(S1/2K1/3T2/3)\displaystyle R_{S}(T)=\tilde{O}(S^{1/2}K^{1/3}T^{2/3})
Proof.

Let tst_{s} be the time when the ss-th switch of the best arm happens and tS+11=Tt_{S+1}-1=T, t0=1t_{0}=1. Also let ts+1ts=Tst_{s+1}-t_{s}=T_{s}. For any tst_{s} for all s[0,S]s\in[0,S], the SS-switch regret can be expressed as

RS(T)\displaystyle R_{S}(T) =t=1T𝔼[lt(at)]s=0Sminks[K]t=tsts+11lt(ks)\displaystyle=\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t})\right]-\sum_{s=0}^{S}\min_{k_{s}\in[K]}\sum_{t=t_{s}}^{t_{s+1}-1}l_{t}(k_{s}) (4)
=t=1T𝔼[lt(at,ht)]t=1T𝔼[lt(at,h)]+t=1T𝔼[lt(at,h)]s=0Sminks[K]t=tsts+11lt(ks),\displaystyle=\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t,h_{t}})\right]-\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t,h^{\dagger}})\right]+\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t,h^{\dagger}})\right]-\sum_{s=0}^{S}\min_{k_{s}\in[K]}\sum_{t=t_{s}}^{t_{s+1}-1}l_{t}(k_{s}), (5)

in which the first two terms are closely related with the regret from the master algorithm against the near optimal base hh^{\dagger}, and the remaining terms are related with the regret from hh^{\dagger} base algorithm against the best arms in hindsight. We note that the algorithm does not need to know hh^{\dagger} in prior and hh^{\dagger} is brought here only for regret analysis.

First we provide a bound for the following regret from base hh^{\dagger}:

t=1T𝔼[lt(at,h)]s=0Sminks[K]t=tsts+11lt(ks).\displaystyle\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t,h^{\dagger}})\right]-\sum_{s=0}^{S}\min_{k_{s}\in[K]}\sum_{t=t_{s}}^{t_{s+1}-1}l_{t}(k_{s}).

Let ks=argmink[K]t=tsts+11lt(k)k_{s}^{*}=\arg\min_{k\in[K]}\sum_{t=t_{s}}^{t_{s+1}-1}l_{t}(k) and ej,K\textit{{e}}_{j,K} denote the unit vector with 1 at jj-index and 0 at the rest of K1K-1 indexes. Then, we have

t=tsts+11𝔼[lt(at,h)lt(ks)]\displaystyle\sum_{t=t_{s}}^{t_{s+1}-1}\mathbb{E}\left[l_{t}(a_{t,h^{\dagger}})-l_{t}(k_{s}^{*})\right] (7)
=t=tsts+11𝔼[pt,heks,K,lt]\displaystyle=\sum_{t=t_{s}}^{t_{s+1}-1}\mathbb{E}\left[\langle\textit{{p}}_{t,h^{\dagger}}-\textit{{e}}_{k_{s}^{*},K},\textit{{l}}_{t}\rangle\right] (8)
𝔼[maxp𝒫Kβt=tsts+11peks,K,lt+maxp𝒫Kβt=tsts+11pt,hp,lt]\displaystyle\leq\mathbb{E}\left[\max_{\textit{{p}}\in\mathcal{P}_{K}^{\beta}}\sum_{t=t_{s}}^{t_{s+1}-1}\langle\textit{{p}}-\textit{{e}}_{k_{s}^{*},K},\textit{{l}}_{t}\rangle+\max_{\textit{{p}}\in\mathcal{P}_{K}^{\beta}}\sum_{t=t_{s}}^{t_{s+1}-1}\langle\textit{{p}}_{t,h^{\dagger}}-\textit{{p}},\textit{{l}}_{t}\rangle\right] (9)
βTs(K1)+𝔼[maxp𝒫Kβt=tsts+11pt,hp,lt,h′′],\displaystyle\leq\beta T_{s}(K-1)+\mathbb{E}\left[\max_{\textit{{p}}\in\mathcal{P}_{K}^{\beta}}\sum_{t=t_{s}}^{t_{s+1}-1}\langle\textit{{p}}_{t,h^{\dagger}}-\textit{{p}},\textit{{l}}_{t,h^{\dagger}}^{\prime\prime}\rangle\right], (10)

where the first term in the last inequality is obtained from the clipped domain 𝒫Kβ\mathcal{P}_{K}^{\beta} and the second term is obtained from the unbiased estimator lt,h′′\textit{{l}}_{t,h^{\dagger}}^{\prime\prime} such that 𝔼[lt,h′′|t1]=𝔼[lt|t1]\mathbb{E}[\textit{{l}}_{t,h^{\dagger}}^{\prime\prime}|\mathcal{F}_{t-1}]=\mathbb{E}[\textit{{l}}_{t}|\mathcal{F}_{t-1}] where t1\mathcal{F}_{t-1} is the filtration. We can observe that the clipped domain controls the distance between the initial distribution at tst_{s} and the best arm unit vector for the time steps over [ts,ts+11][t_{s},t_{s+1}-1]. Let

p~t+1,h=argminpKp,lt,h′′+DFη(h)(p,pt,h).\displaystyle\tilde{\textit{{p}}}_{t+1,h^{\dagger}}=\operatorname*{arg\,min}_{\textit{{p}}\in\mathbb{R}^{K}}\langle\textit{{p}},\textit{{l}}_{t,h^{\dagger}}^{\prime\prime}\rangle+D_{F_{\eta(h^{\dagger})}}(\textit{{p}},\textit{{p}}_{t,h^{\dagger}}).

Then, by solving the optimization problem, we can get

p~t+1,h(k)=pt,h(k)exp(η(h)lt,h′′(k)),\tilde{p}_{t+1,h^{\dagger}}(k)=p_{t,h^{\dagger}}(k)\exp(-\eta(h^{\dagger})l_{t,h^{\dagger}}^{\prime\prime}(k)),

for all k[K]k\in[K].

For the second term of the last inequality in (10), we provide a lemma in the following.

Lemma 1 (Theorem 28.4 in Lattimore and Szepesvári, (2020)).

For any p𝒫Kβ\textit{{p}}\in\mathcal{P}_{K}^{\beta} we have

t=tsts+11pt,hp,lt,h′′DFη(h)(p,pts,h)+t=tsts+11DFη(h)(pt,h,p~t+1,h).\displaystyle\sum_{t=t_{s}}^{t_{s+1}-1}\langle\textit{{p}}_{t,h^{\dagger}}-\textit{{p}},\textit{{l}}_{t,h^{\dagger}}^{\prime\prime}\rangle\leq D_{F_{\eta(h^{\dagger})}}(\textit{{p}},\textit{{p}}_{t_{s},h^{\dagger}})+\sum_{t=t_{s}}^{t_{s+1}-1}D_{F_{\eta(h^{\dagger})}}(\textit{{p}}_{t,h^{\dagger}},\tilde{\textit{{p}}}_{t+1,h^{\dagger}}).

In Lemma 1, the first term is for the initial point diameter at time tst_{s} and the second term is for the divergence of the updated policy. Using the definition of the Bregman divergence and the fact that pts,h(k)βp_{t_{s},h^{\dagger}}(k)\geq\beta, the initial point diameter term can be shown to be bounded as follows:

DFη(h)(p,pts,h)\displaystyle D_{F_{\eta(h^{\dagger})}}(\textit{{p}},\textit{{p}}_{t_{s},h^{\dagger}}) 1η(h)k[K]p(k)log(1/pts,h(k))\displaystyle\leq\frac{1}{\eta(h^{\dagger})}\sum_{k\in[K]}p(k)\log(1/p_{t_{s},h^{\dagger}}(k)) (11)
log(1/β)η(h).\displaystyle\leq\frac{\log(1/\beta)}{\eta(h^{\dagger})}. (12)

Next, for the updated policy divergence term, using p~t+1,h(k)=pt,h(k)exp(η(h)lt,h′′(k))\tilde{p}_{t+1,h^{\dagger}}(k)=p_{t,h^{\dagger}}(k)\exp(-\eta(h^{\dagger})l_{t,h^{\dagger}}^{\prime\prime}(k)) for all k[K]k\in[K], we have

t=tsts+11𝔼[DFη(h)(pt,h,p~t+1,h)]\displaystyle\sum_{t=t_{s}}^{t_{s+1}-1}\mathbb{E}\left[D_{F_{\eta(h^{\dagger})}}(\textit{{p}}_{t,h^{\dagger}},\tilde{\textit{{p}}}_{t+1,h^{\dagger}})\right] (13)
=t=tsts+11k=1K𝔼[1η(h)pt,h(k)(exp(η(h)lt,h′′(k))1+η(h)lt,h′′(k))]\displaystyle=\sum_{t=t_{s}}^{t_{s+1}-1}\sum_{k=1}^{K}\mathbb{E}\left[\frac{1}{\eta(h^{\dagger})}p_{t,h^{\dagger}}(k)\left(\exp(-\eta({h^{\dagger}})l_{t,h^{\dagger}}^{\prime\prime}(k))-1+\eta({h^{\dagger}})l_{t,h^{\dagger}}^{\prime\prime}(k)\right)\right] (14)
t=tsts+11k=1K𝔼[η(h)2pt,h(k)lt,h′′(k)2]\displaystyle\leq\sum_{t=t_{s}}^{t_{s+1}-1}\sum_{k=1}^{K}\mathbb{E}\left[\frac{\eta(h^{\dagger})}{2}p_{t,h^{\dagger}}(k)l_{t,h^{\dagger}}^{\prime\prime}(k)^{2}\right] (15)
t=tsts+11k=1K𝔼[η(h)2pt(h)]η(h)KTs2α,\displaystyle\leq\sum_{t=t_{s}}^{t_{s+1}-1}\sum_{k=1}^{K}\mathbb{E}\left[\frac{\eta(h^{\dagger})}{2p_{t}(h^{\dagger})}\right]\leq\frac{\eta({h^{\dagger}})KT_{s}}{2\alpha}, (16)

where the first inequality comes from exp(x)1x+x2/2\exp(-x)\leq 1-x+x^{2}/2 for all x0x\geq 0, the second inequality comes from 𝔼[lt,h′′(k)2pt,h(k),pt(h)]1/(pt(h)pt,h(k))\mathbb{E}[l_{t,h^{\dagger}}^{\prime\prime}(k)^{2}\mid p_{t,h^{\dagger}}(k),p_{t}(h^{\dagger})]\leq 1/(p_{t}(h^{\dagger})p_{t,h^{\dagger}}(k)), and the last inequality is obtained from pt(h)αp_{t}(h^{\dagger})\geq\alpha from the clipped domain. We can observe that the clipped domain controls the variance of estimators. Then from (10), Lemma 1, (12), and (16), by summing up over s[S]s\in[S], we have

t=1T𝔼[lt(at,h)]s=0Sminks[K]t=tsts+11lt(ks)βT(K1)+Slog(1/β)η(h)+η(h)KT2α.\displaystyle\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t,h^{\dagger}})\right]-\sum_{s=0}^{S}\min_{k_{s}\in[K]}\sum_{t=t_{s}}^{t_{s+1}-1}l_{t}(k_{s})\leq\beta T(K-1)+\frac{S\log(1/\beta)}{\eta({h^{\dagger}})}+\frac{\eta({h^{\dagger}})KT}{2\alpha}. (17)

Next, we provide a bound for the following regret from the master:

t=1T𝔼[lt(at,ht)]t=1T𝔼[lt(at,h)].\displaystyle\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t,h_{t}})\right]-\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t,h^{\dagger}})\right].

Let p~t+1=argminpHp,lt+DFη(p,pt)\tilde{\textit{{p}}}_{t+1}=\operatorname*{arg\,min}_{\textit{{p}}\in\mathbb{R}^{H}}\langle\textit{{p}},\textit{{l}}_{t}^{\prime}\rangle+D_{F_{\eta}}(\textit{{p}},\textit{{p}}_{t}) and eh,H\textit{{e}}_{h,H} denote the unit vector with 1 at base hh-index and 0 at the rest of H1H-1 indexes. For ease of presentation, we define l~t(ht)=lt(at,ht)\tilde{l}_{t}(h_{t})=l_{t}(a_{t,h_{t}}) and l~t(h)=lt(at,h)\tilde{l}_{t}(h^{\dagger})=l_{t}(a_{t,h^{\dagger}}). Then, we have

t=1T𝔼[lt(at,ht)lt(at,h)]\displaystyle\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t,h_{t}})-l_{t}(a_{t,h^{\dagger}})\right] =t=1T𝔼[pteh,H,l~t]\displaystyle=\sum_{t=1}^{T}\mathbb{E}\left[\langle\textit{{p}}_{t}-\textit{{e}}_{h^{\dagger},H},\tilde{\textit{{l}}}_{t}\rangle\right] (18)
𝔼[maxp𝒫Hαt=1Tpeh,H,l~t+maxp𝒫Hαt=1Tptp,l~t]\displaystyle\leq\mathbb{E}\left[\max_{\textit{{p}}\in\mathcal{P}_{H}^{\alpha}}\sum_{t=1}^{T}\langle\textit{{p}}-\textit{{e}}_{h^{\dagger},H},\tilde{\textit{{l}}}_{t}\rangle+\max_{\textit{{p}}\in\mathcal{P}_{H}^{\alpha}}\sum_{t=1}^{T}\langle\textit{{p}}_{t}-\textit{{p}},\tilde{\textit{{l}}}_{t}\rangle\right] (19)
αT(H1)+𝔼[maxp𝒫Hαt=1Tptp,l~t].\displaystyle\leq\alpha T(H-1)+\mathbb{E}\left[\max_{\textit{{p}}\in\mathcal{P}_{H}^{\alpha}}\sum_{t=1}^{T}\langle\textit{{p}}_{t}-\textit{{p}},\tilde{\textit{{l}}}_{t}\rangle\right]. (20)

For bounding the second term in (20), we use the following lemma.

Lemma 2 (Theorem 28.4 in Lattimore and Szepesvári, (2020)).
maxp𝒫Hαt=1Tptp,l~tmaxp𝒫Hα(Fη(p)Fη(p1)+t=1TDFη(pt,p~t+1)).\displaystyle\max_{\textit{{p}}\in\mathcal{P}_{H}^{\alpha}}\sum_{t=1}^{T}\langle\textit{{p}}_{t}-\textit{{p}},\tilde{\textit{{l}}}_{t}\rangle\leq\max_{\textit{{p}}\in\mathcal{P}_{H}^{\alpha}}\left(F_{\eta}(\textit{{p}})-F_{\eta}(\textbf{{p}}_{1})+\sum_{t=1}^{T}D_{F_{\eta}}(\textbf{{p}}_{t},\tilde{\textbf{{p}}}_{t+1})\right).

From (20) and Lemma 2, we have

t=1T𝔼[lt(at,ht)]t=1T𝔼[lt(at,h)]\displaystyle\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t,h_{t}})\right]-\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t,h^{\dagger}})\right] (21)
αT(H1)+𝔼[maxp𝒫Hα(Fη(p)Fη(p1)+t=1TDFη(pt,p~t+1))]\displaystyle\leq\alpha T(H-1)+\mathbb{E}\left[\max_{\textit{{p}}\in\mathcal{P}_{H}^{\alpha}}\left(F_{\eta}(\textit{{p}})-F_{\eta}(\textbf{{p}}_{1})+\sum_{t=1}^{T}D_{F_{\eta}}(\textbf{{p}}_{t},\tilde{\textbf{{p}}}_{t+1})\right)\right] (22)
αT(H1)+log(H)η+ηTK2,\displaystyle\leq\alpha T(H-1)+\frac{\log(H)}{\eta}+\frac{\eta TK}{2}, (23)

where the last inequality is obtained from the fact that

Fη(p)Fη(p1)Fη(p1)log(H)η\displaystyle F_{\eta}(\textit{{p}})-F_{\eta}(\textbf{{p}}_{1})\leq-F_{\eta}(\textbf{{p}}_{1})\leq\frac{\log(H)}{\eta}

and

𝔼[t=1TDFη(pt,p~t+1)]\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}D_{F_{\eta}}(\textbf{{p}}_{t},\tilde{\textbf{{p}}}_{t+1})\right] ηTK2.\displaystyle\leq\frac{\eta TK}{2}.

Therefore, putting (LABEL:eq:1_decom), (17), and (23) altogether, we have

RS(T)\displaystyle R_{S}(T) =t=1T𝔼[lt(at)]s=0Smin1ksKt=TsTs+11lt(ks)\displaystyle=\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t})\right]-\sum_{s=0}^{S}\min_{1\leq k_{s}\leq K}\sum_{t=T_{s}}^{T_{s+1}-1}l_{t}(k_{s})
αTH+log(H)η+ηTK2+βT(K1)+Slog(1/β)η(h)+η(h)KT2α\displaystyle\leq\alpha TH+\frac{\log(H)}{\eta}+\frac{\eta TK}{2}+\beta T(K-1)+\frac{S\log(1/\beta)}{\eta({h^{\dagger}})}+\frac{\eta({h^{\dagger}})KT}{2\alpha}
=O~(S1/2T2/3K1/3),\displaystyle=\tilde{O}(S^{1/2}T^{2/3}K^{1/3}),

where α=K1/3/(T1/3H1/2)\alpha=K^{1/3}/(T^{1/3}H^{1/2}), β=1/(KT)\beta=1/(KT), η=1/TK\eta=1/\sqrt{TK}, η(h)=h1/2/(K1/3T2/3)\eta(h^{\dagger})={h^{\dagger}}^{1/2}/(K^{1/3}T^{2/3}), h=Θ(S)h^{\dagger}=\Theta(S), and H=log(T)H=\log(T). This concludes the proof. ∎

From Theorem 1, the regret bound of Algoritm 1 is tight with respect to SS compared to that of EXP3.S Auer et al., (2002) which has a linear dependency on SS. Therefore, when SS is large (specifically S=ω((T/K)1/3)S=\omega((T/K)^{1/3})), Algorithm 1 performs better than EXP3.S. Also, compared with previous bandit-over-bandit approach (BOB) Cheung et al., (2019) having a loose dependency on TT as T3/4T^{3/4}, our algorithm has a tighter regret bound with respect to TT. Therefore, when TT is large (specifically T=ω(S6K4)T=\omega(S^{6}K^{4})), Algorithm 1 achieves a better regret bound than BOB.

However, the achieved regret bound from Algorithm 1 still has O(T2/3)O(T^{2/3}) rather than O(T)O(\sqrt{T}) due to the large variance of loss estimators from sampling twice at each time for a base and an arm. In the following, we provide an algorithm utilizing adaptive learning rates to control the variance of estimators.

3.4 Master-base OMD with adaptive learning rates

Here we propose Algorithm 2, which utilizes adaptive learning rates to control the variance of estimators. We first explain the base algorithm. For the base algorithm, we propose to use the negative entropy regularizer with adaptive learning rate ηt(h)\eta_{t}(h) such that

Fηt(h)(p)=1ηt(h)i=1d(p(i)logp(i)p(i)).F_{\eta_{t}(h)}(\textit{{p}})=\frac{1}{\eta_{t}(h)}\sum_{i=1}^{d}(p(i)\log p(i)-p(i)).

The adaptive learning rate ηt(h)\eta_{t}(h) is optimized using variance information for loss estimators at each time tt to control the variance such that

ηt(h)=h/(KTρt(h)),\eta_{t}(h)=\sqrt{h/(KT\rho_{t}(h))},

where ρt(h)\rho_{t}(h) is a variance threshold term (to be specified later). This implies that if the variance of the estimators is small, then the learning rate becomes large.

For the master algorithm, we adopt the method of Corral Agarwal et al., (2017), in which, by using a log-barrier regularizer with increasing learning rates, it introduces a negative bias term to cancel a variance term from bases by handling the worst case with respect to ρt(h)\rho_{t}(h^{\dagger}). The log-barrier regularizer is defined as:

F𝝃t(p)=i=1dlogp(i)ξt(i)F_{\bm{\xi}_{t}}(\textit{{p}})=-\sum_{i=1}^{d}\frac{\log p(i)}{\xi_{t}(i)}

with learning rates 𝝃t\bm{\xi}_{t} for the master algorithm.

Here we describe the update learning rates procedure for the master and bases in Algorithm 2; the other parts are similar with Algorithm 1. The variance of the loss estimator lt(h)l_{t}^{\prime}(h) for base hh is 1/pt+1(h)1/p_{t+1}(h). If the variance 1/pt+1(h)1/p_{t+1}(h) for base hh is larger than a threshold ρt(h)\rho_{t}(h), then it increases learning rate as ξt+1(h)=γξt(h)\xi_{t+1}(h)=\gamma\xi_{t}(h) with γ>1\gamma>1 and the threshold is updated as ρt+1(h)=2/pt+1(h)\rho_{t+1}(h)=2/p_{t+1}(h), which is also used for tuning the learning rate ηt(h)\eta_{t}(h). Otherwise, it keeps the learning rate and threshold the same with the previous time step.

Algorithm 2 Master-base OMD with adaptive learning rates
  Given: TT, KK, \mathcal{H}
  Initialization: α=1/(TH)\alpha=1/(TH), β=1/(TK)\beta=1/(TK), γ=e1logT\gamma=e^{\frac{1}{\log T}}, η=H/T\eta=\sqrt{H/T}, ρ1(h)=2H\rho_{1}(h)=2H, ξ1(h)=η\xi_{1}(h)=\eta, p1(h)=1/Hp_{1}(h)=1/H, p1,h(a)=1/Kp_{1,h}(a)=1/K for hh\in\mathcal{H} and a[K]a\in[K].
  for t=1,,Tt=1,\dots,T do
     Select a base and an arm:
     Draw hth_{t}\sim probabilities pt(h)p_{t}(h) for hh\in\mathcal{H}.
     Draw at,hta_{t,h_{t}}\sim probabilities pt,ht(a)p_{t,h_{t}}(a) for a[K]a\in[K].
     Pull at=at,hta_{t}=a_{t,h_{t}} and Receive lt(at,ht)[0,1]l_{t}(a_{t,h_{t}})\in[0,1].
     Update loss estimators:
     lt(ht)=lt(at,ht)pt(ht)l_{t}^{\prime}(h_{t})=\frac{l_{t}(a_{t,h_{t}})}{p_{t}(h_{t})} and lt(h)=0l_{t}^{\prime}(h)=0 for h/{ht}h\in\mathcal{H}/\{h_{t}\}.
     lt,ht′′(at,ht)=lt(ht)pt,ht(at,ht)l_{t,h_{t}}^{\prime\prime}(a_{t,h_{t}})=\frac{l_{t}^{\prime}(h_{t})}{p_{t,h_{t}}(a_{t,h_{t}})} and lt,h′′(a)=0l_{t,h}^{\prime\prime}(a)=0 for h/{ht}h\in\mathcal{H}/\{h_{t}\}, a[K]/{at,ht}a\in[K]/\{a_{t,h_{t}}\}.
     Update distributions:
     pt+1=argminp𝒫Hαp,lt+DF𝝃t(p,pt)\textit{{p}}_{t+1}=\arg\min_{\textit{{p}}\in\mathcal{P}_{H}^{\alpha}}\langle\textit{{p}},\textit{{l}}_{t}^{\prime}\rangle+D_{{F_{\bm{\xi}_{t}}}}(\textit{{p}},\textit{{p}}_{t})
     pt+1,h=argminp𝒫Kβp,lt,h′′+DFηt(h)(p,pt,h)\textit{{p}}_{t+1,h}=\arg\min_{\textit{{p}}\in\mathcal{P}_{K}^{\beta}}\langle\textit{{p}},\textit{{l}}_{t,h}^{\prime\prime}\rangle+D_{F_{\eta_{t}(h)}}(\textit{{p}},\textit{{p}}_{t,h}) for hh\in\mathcal{H}
     Update learning rates:
     For hh\in\mathcal{H}
        If 1pt+1(h)>ρt(h)\frac{1}{p_{t+1}(h)}>\rho_{t}(h), then      ρt+1(h)=2pt+1(h),ξt+1(h)=γξt(h)\rho_{t+1}(h)=\frac{2}{p_{t+1}(h)},\xi_{t+1}(h)=\gamma\xi_{t}(h).
        Else, ρt+1(h)=ρt(h),ξt+1(h)=ξt(h)\rho_{t+1}(h)=\rho_{t}(h),\xi_{t+1}(h)=\xi_{t}(h).
  end for

In the following theorem, we provide a regret bound of Algorithm 2.

Theorem 2.

For any switch number S[T1]S\in[T-1], Algorithm 2 achives a regret bound of

RS(T)=O~(min{𝔼[SKTρT(h)],SKT}).R_{S}(T)=\tilde{O}\left(\min\left\{\mathbb{E}\left[\sqrt{SKT\rho_{T}(h^{\dagger})}\right],S\sqrt{KT}\right\}\right).
Proof.

Let tst_{s} be the time when the ss-th switch of the best arm happens and tS+11=Tt_{S+1}-1=T, t0=1t_{0}=1. Also let ts+1ts=Tst_{s+1}-t_{s}=T_{s}. For any tst_{s} for all s[0,S]s\in[0,S], the SS-switch regret can be expressed as

RS(T)\displaystyle R_{S}(T) =t=1T𝔼[lt(at)]s=0Sminks[K]t=tsts+11lt(ks)\displaystyle=\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t})\right]-\sum_{s=0}^{S}\min_{k_{s}\in[K]}\sum_{t=t_{s}}^{t_{s+1}-1}l_{t}(k_{s}) (24)
=t=1T𝔼[lt(at,ht)]t=1T𝔼[lt(at,h)]+t=1T𝔼[lt(at,h)]s=0Sminks[K]t=tsts+11lt(ks),\displaystyle=\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t,h_{t}})\right]-\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t,h^{\dagger}})\right]+\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t,h^{\dagger}})\right]-\sum_{s=0}^{S}\min_{k_{s}\in[K]}\sum_{t=t_{s}}^{t_{s+1}-1}l_{t}(k_{s}), (25)

in which the first two terms are closely related with the regret from the master algorithm against the near optimal base hh^{\dagger}, and the remaining terms are related with the regret from hh^{\dagger} base algorithm against the best arms in hindsight.

First we provide a bound for the following regret from base hh^{\dagger}. From (10), we can obtain

t=tsts+11𝔼[lt(at,h)]s=0Sminks[K]t=tsts+11lt(ks)βTsK+𝔼[maxp𝒫Kβt=tsts+11pt,hp,lt,h′′].\displaystyle\sum_{t=t_{s}}^{t_{s+1}-1}\mathbb{E}\left[l_{t}(a_{t,h^{\dagger}})\right]-\sum_{s=0}^{S}\min_{k_{s}\in[K]}\sum_{t=t_{s}}^{t_{s+1}-1}l_{t}(k_{s})\leq\beta T_{s}K+\mathbb{E}\left[\max_{p\in\mathcal{P}_{K}^{\beta}}\sum_{t=t_{s}}^{t_{s+1}-1}\langle\textit{{p}}_{t,h^{\dagger}}-\textit{{p}},\textit{{l}}_{t,h^{\dagger}}^{\prime\prime}\rangle\right]. (27)

Then for the second term of the last inequality in (27), we provide a following lemma.

Lemma 3.

For any p𝒫Kβ\textit{{p}}\in\mathcal{P}_{K}^{\beta} we can show that

t=tsts+11𝔼[pt,hp,lt,h′′]𝔼[2log(1/β)KTρT(h)h+Ts2SKρT(h)T].\displaystyle\sum_{t=t_{s}}^{t_{s+1}-1}\mathbb{E}\left[\langle\textit{{p}}_{t,h^{\dagger}}-\textit{{p}},\textit{{l}}_{t,h^{\dagger}}^{\prime\prime}\rangle\right]\leq\mathbb{E}\left[2\log(1/\beta)\sqrt{\frac{KT\rho_{T}(h^{\dagger})}{h^{\dagger}}}+\frac{T_{s}}{2}\sqrt{\frac{SK\rho_{T}(h^{\dagger})}{T}}\right].
Proof.

For ease of presentation, we define the negative entropy regularizer without a learning rate as

F(p)=i=1K(p(i)logp(i)p(i))F(\textit{{p}})=\sum_{i=1}^{K}(p(i)\log p(i)-p(i))

and define learning rate η0(h)=.\eta_{0}(h^{\dagger})=\infty. From the first-order optimality condition for pt+1,hp_{t+1,h^{\dagger}} and using the definition of the Bregman divergence,

pt+1,hp,lt′′\displaystyle\langle\textit{{p}}_{t+1,h^{\dagger}}-\textit{{p}},\textit{{l}}_{t}^{\prime\prime}\rangle 1ηt(h)ppt+1,h,F(pt+1,h)F(pt,h)\displaystyle\leq\frac{1}{\eta_{t}(h^{\dagger})}\langle\textit{{p}}-\textit{{p}}_{t+1,h^{\dagger}},\nabla F(\textit{{p}}_{t+1,h^{\dagger}})-\nabla F(\textit{{p}}_{t,h^{\dagger}})\rangle (28)
=1ηt(h)(DF(p,pt,h)DF(p,pt+1,h)DF(pt+1,h,pt,h)).\displaystyle=\frac{1}{\eta_{t}(h^{\dagger})}\left(D_{F}(\textit{{p}},\textit{{p}}_{t,h^{\dagger}})-D_{F}(\textit{{p}},\textit{{p}}_{t+1,h^{\dagger}})-D_{F}(\textit{{p}}_{t+1,h^{\dagger}},\textit{{p}}_{t,h^{\dagger}})\right). (29)

Also, we have

pt,hpt+1,h,lt′′\displaystyle\langle\textit{{p}}_{t,h^{\dagger}}-\textit{{p}}_{t+1,h^{\dagger}},\textit{{l}}_{t}^{\prime\prime}\rangle =1ηt(h)pt,hpt+1,h,F(pt,h)F(p~t+1,h)\displaystyle=\frac{1}{\eta_{t}(h^{\dagger})}\langle\textit{{p}}_{t,h^{\dagger}}-\textit{{p}}_{t+1,h^{\dagger}},\nabla F(\textit{{p}}_{t,h^{\dagger}})-\nabla F(\tilde{\textit{{p}}}_{t+1,h^{\dagger}})\rangle (30)
=1ηt(h)(D(pt+1,h,pt,h)+D(pt,h,p~t+1,h)D(pt+1,h,p~t+1,h))\displaystyle=\frac{1}{\eta_{t}(h^{\dagger})}(D(\textit{{p}}_{t+1,h^{\dagger}},\textit{{p}}_{t,h^{\dagger}})+D(\textit{{p}}_{t,h^{\dagger}},\tilde{\textit{{p}}}_{t+1,h^{\dagger}})-D(\textit{{p}}_{t+1,h^{\dagger}},\tilde{\textit{{p}}}_{t+1,h^{\dagger}})) (31)
1ηt(h)(D(pt+1,h,pt,h)+D(pt,h,p~t+1,h)).\displaystyle\leq\frac{1}{\eta_{t}(h^{\dagger})}(D(\textit{{p}}_{t+1,h^{\dagger}},\textit{{p}}_{t,h^{\dagger}})+D(\textit{{p}}_{t,h^{\dagger}},\tilde{\textit{{p}}}_{t+1,h^{\dagger}})). (32)

Then, we can obtain

t=tsts+11pt,hp,lt′′\displaystyle\sum_{t=t_{s}}^{t_{s+1}-1}\langle\textit{{p}}_{t,h^{\dagger}}-\textit{{p}},\textit{{l}}_{t}^{\prime\prime}\rangle t=tsts+11pt,hpt+1,h,lt′′\displaystyle\leq\sum_{t=t_{s}}^{t_{s+1}-1}\langle\textit{{p}}_{t,h^{\dagger}}-\textit{{p}}_{t+1,h^{\dagger}},l_{t}^{\prime\prime}\rangle (33)
+t=tsts+111ηt(h)(D(p,pt,h)D(p,pt+1,h)D(pt+1,h,pt,h))\displaystyle\qquad+\sum_{t=t_{s}}^{t_{s+1}-1}\frac{1}{\eta_{t}(h^{\dagger})}\left(D(\textit{{p}},\textit{{p}}_{t,h^{\dagger}})-D(\textit{{p}},\textit{{p}}_{t+1,h^{\dagger}})-D(\textit{{p}}_{t+1,h^{\dagger}},\textit{{p}}_{t,h^{\dagger}})\right) (34)
=t=tsts+11pt,hpt+1,h,lt′′+t=ts+1ts+11DF(p,pts,h)(1ηt(h)1ηt1(h))\displaystyle=\sum_{t=t_{s}}^{t_{s+1}-1}\langle\textit{{p}}_{t,h^{\dagger}}-\textit{{p}}_{t+1,h^{\dagger}},l_{t}^{\prime\prime}\rangle+\sum_{t=t_{s}+1}^{t_{s+1}-1}D_{F}(\textit{{p}},\textit{{p}}_{t_{s},h^{\dagger}})\left(\frac{1}{\eta_{t}(h^{\dagger})}-\frac{1}{\eta_{t-1}(h^{\dagger})}\right) (35)
1ηts(h)D(p,pts,h)1ηts+11(h)D(p,pts+1,h)1ηt(h)t=tsts+11D(pt+1,h,pt,h)\displaystyle\qquad\frac{1}{\eta_{t_{s}}(h^{\dagger})}D(\textit{{p}},\textit{{p}}_{t_{s},h^{\dagger}})-\frac{1}{\eta_{t_{s+1}-1}(h)}D(\textit{{p}},\textit{{p}}_{t_{s+1},h^{\dagger}})-\frac{1}{\eta_{t}(h^{\dagger})}\sum_{t=t_{s}}^{t_{s+1}-1}D(\textit{{p}}_{t+1,h^{\dagger}},\textit{{p}}_{t,h^{\dagger}}) (36)
2log(1/β)ηT(h)+t=tsts+11DF(pt,h,p~t+1,h)ηt(h)\displaystyle\leq 2\frac{\log(1/\beta)}{\eta_{T}(h^{\dagger})}+\sum_{t=t_{s}}^{t_{s+1}-1}\frac{D_{F}(\textit{{p}}_{t,h^{\dagger}},\tilde{\textit{{p}}}_{t+1,h^{\dagger}})}{\eta_{t}(h^{\dagger})} (37)
=2log(1/β)KTρT(h)h+t=tsts+11DF(pt,h,p~t+1,h)ηt(h),\displaystyle=2\log(1/\beta)\sqrt{\frac{KT\rho_{T}(h^{\dagger})}{h^{\dagger}}}+\sum_{t=t_{s}}^{t_{s+1}-1}\frac{D_{F}(\textit{{p}}_{t,h^{\dagger}},\tilde{\textit{{p}}}_{t+1,h^{\dagger}})}{\eta_{t}(h^{\dagger})}, (38)

where the first inequality is obtained from (29) and the last inequality is obtained from (32), D(p,pts,h)log(1/β)D(\textit{{p}},\textit{{p}}_{t_{s},h^{\dagger}})\leq\log(1/\beta), and ηt(h)ηT(h)\eta_{t}(h^{\dagger})\geq\eta_{T}(h^{\dagger}) from non-decreasing ρt(h)\rho_{t}(h^{\dagger}).

For the second term in the last inequality in (38), using p~t+1,h(k)=pt,h(k)exp(η(h)lt,h′′(k))\tilde{p}_{t+1,h^{\dagger}}(k)=p_{t,h^{\dagger}}(k)\exp(-\eta(h^{\dagger})l_{t,h^{\dagger}}^{\prime\prime}(k)) for all k[K]k\in[K], we have

t=tsts+11𝔼[DF(pt,h,p~t+1,h)ηt(h)]\displaystyle\sum_{t=t_{s}}^{t_{s+1}-1}\mathbb{E}\left[\frac{D_{F}(\textit{{p}}_{t,h^{\dagger}},\tilde{\textit{{p}}}_{t+1,h^{\dagger}})}{\eta_{t}(h^{\dagger})}\right] =t=tsts+11k=1K𝔼[1ηt(h)pt,h(k)(exp(ηt(h)lt,h′′(k))\displaystyle=\sum_{t=t_{s}}^{t_{s+1}-1}\sum_{k=1}^{K}\mathbb{E}\left[\frac{1}{\eta_{t}(h^{\dagger})}p_{t,h^{\dagger}}(k)\left(\exp(-\eta_{t}({h^{\dagger}})l_{t,h^{\dagger}}^{\prime\prime}(k))\right.\right. (39)
1+ηt(h)lt,h′′(k))]\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\left.\left.-1+\eta_{t}({h^{\dagger}})l_{t,h^{\dagger}}^{\prime\prime}(k)\right)\right] (40)
t=tsts+11k=1K𝔼[ηt(h)2pt,h(k)lt,h′′(k)2]\displaystyle\leq\sum_{t=t_{s}}^{t_{s+1}-1}\sum_{k=1}^{K}\mathbb{E}\left[\frac{\eta_{t}(h^{\dagger})}{2}p_{t,h^{\dagger}}(k)l_{t,h^{\dagger}}^{\prime\prime}(k)^{2}\right] (41)
t=tsts+11k=1K𝔼[ηt(h)2pt(h)]\displaystyle\leq\sum_{t=t_{s}}^{t_{s+1}-1}\sum_{k=1}^{K}\mathbb{E}\left[\frac{\eta_{t}(h^{\dagger})}{2p_{t}(h^{\dagger})}\right] (42)
t=tsts+11k=1K𝔼[ηt(h)ρt(h)2]\displaystyle\leq\sum_{t=t_{s}}^{t_{s+1}-1}\sum_{k=1}^{K}\mathbb{E}\left[\frac{\eta_{t}(h^{\dagger})\rho_{t}(h^{\dagger})}{2}\right] (43)
t=tsts+11k=1K𝔼[12hρt(h)KT]\displaystyle\leq\sum_{t=t_{s}}^{t_{s+1}-1}\sum_{k=1}^{K}\mathbb{E}\left[\frac{1}{2}\sqrt{\frac{h^{\dagger}\rho_{t}(h^{\dagger})}{KT}}\right] (44)
TshKT𝔼[ρT(h)1/2]2,\displaystyle\leq T_{s}\sqrt{\frac{h^{\dagger}K}{T}}\frac{\mathbb{E}\left[\rho_{T}(h^{\dagger})^{1/2}\right]}{2}, (45)

where the first inequality comes from exp(x)1x+x2/2\exp(-x)\leq 1-x+x^{2}/2 for all x0x\geq 0, the second inequality comes from 𝔼[lt,h′′(k)2pt,h(k),pt(h)]1/(pt(h)pt,h(k))\mathbb{E}[l_{t,h^{\dagger}}^{\prime\prime}(k)^{2}\mid p_{t,h^{\dagger}}(k),p_{t}(h^{\dagger})]\leq 1/(p_{t}(h^{\dagger})p_{t,h^{\dagger}}(k)), and the third inequality is obtained from 1/pt(h)ρt(h)1/p_{t}(h^{\dagger})\leq\rho_{t}(h^{\dagger}).

Then from (20) and Lemma 3, we have

t=1T𝔼[lt(at,h)]s=0Smin1ksKt=TsTs+11lt(ks)\displaystyle\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t,h^{\dagger}})\right]-\sum_{s=0}^{S}\min_{1\leq k_{s}\leq K}\sum_{t=T_{s}}^{T_{s+1}-1}l_{t}(k_{s}) (46)
βT(K1)+𝔼[2Slog(1/β)KTρT(h)h\displaystyle\leq\beta T(K-1)+\mathbb{E}\left[2S\log(1/\beta)\sqrt{\frac{KT\rho_{T}(h^{\dagger})}{h^{\dagger}}}\right. (47)
+12TSKρT(h)].\displaystyle\qquad\left.+\frac{1}{2}\sqrt{TSK\rho_{T}(h^{\dagger})}\right]. (48)

Next, we provide a bound for the regret from the master in the following lemma.

Lemma 4 (Lemma 13 in Agarwal et al., (2017)).
t=1T𝔼[lt(at,ht)]t=1T𝔼[lt(at,h)]\displaystyle\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t,h_{t}})\right]-\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t,h^{\dagger}})\right]
O(Hlog(T)η+Tη)𝔼[ρT(h)40ηlogT]+αT(H1).\displaystyle\leq O\left(\frac{H\log(T)}{\eta}+T\eta\right)-\mathbb{E}\left[\frac{\rho_{T}(h^{\dagger})}{40\eta\log T}\right]+\alpha T(H-1).

The negative bias term in Lemma  4 is derived from the log-barrier regularizer and increasing learning rates ξt(h)\xi_{t}(h). This term is critical to bound the worst case regret which will be shown soon. Also, Hlog(T)/ηH\log(T)/\eta is obtained from Hlog(1/(Hα))/ηH\log(1/(H\alpha))/\eta considering the clipped domain. Then, putting (LABEL:eq:2_decom) and Lemmas  3 and 4 altogether, we have

RS(T)\displaystyle R_{S}(T) (49)
=t=1T𝔼[lt(at)]s=0Smin1ksKt=TsTs+11lt(ks)\displaystyle=\sum_{t=1}^{T}\mathbb{E}\left[l_{t}(a_{t})\right]-\sum_{s=0}^{S}\min_{1\leq k_{s}\leq K}\sum_{t=T_{s}}^{T_{s+1}-1}l_{t}(k_{s}) (50)
O(HlogTη+Tη)𝔼[ρT(h)40ηlogT]\displaystyle\leq O\left(\frac{H\log T}{\eta}+T\eta\right)-\mathbb{E}\left[\frac{\rho_{T}(h^{\dagger})}{40\eta\log T}\right] (51)
+αT(H1)+βT(K1)\displaystyle\quad+\alpha T(H-1)+\beta T(K-1) (52)
+𝔼[2Slog(1/β)KTρT(h)h+12SKTρT(h)]\displaystyle\quad+\mathbb{E}\left[2S\log(1/\beta)\sqrt{\frac{KT\rho_{T}(h^{\dagger})}{h^{\dagger}}}+\frac{1}{2}\sqrt{SKT\rho_{T}(h^{\dagger})}\right] (53)
=O~(𝔼[SKTρT(h)])𝔼[ρT(h)TK40Hlog(T)],\displaystyle=\tilde{O}\left(\mathbb{E}\left[\sqrt{SKT\rho_{T}(h^{\dagger})}\right]\right)-\mathbb{E}\left[\frac{\rho_{T}(h^{\dagger})\sqrt{TK}}{40\sqrt{H}\log(T)}\right], (54)

where α=1/(TH)\alpha=1/(TH), β=1/(TK)\beta=1/(TK), η=H/T\eta=\sqrt{H/T}, η(h)=h/(KTρT(h))\eta(h^{\dagger})=\sqrt{h^{\dagger}/(KT\rho_{T}(h))}, H=log(T)H=\log(T), and h=Θ(S).h^{\dagger}=\Theta(S). Then we can obtain

RS(T)=O~(min{𝔼[SKTρT(h)],SKT}),\displaystyle R_{S}(T)=\tilde{O}\left(\min\left\{\mathbb{E}\left[\sqrt{SKT\rho_{T}(h^{\dagger})}\right],S\sqrt{KT}\right\}\right),

where O~(SKT)\tilde{O}(S\sqrt{KT}) is obtained from the worst case of ρT(h)\rho_{T}(h^{\dagger}). The worst case can be found by considering a maximum value of the concave bound of the last equality in (54) with variable ρT(h)>0\rho_{T}(h^{\dagger})>0 such that ρT(h)=Θ~(S)\rho_{T}(h^{\dagger})=\tilde{\Theta}(S). This concludes the proof. ∎

Here we provide regret bound comparison with other approaches. For simplicity in the comparison, we use the fact that 𝔼[ρT(h)]𝔼[ρT(h)]\mathbb{E}[\sqrt{\rho_{T}(h^{\dagger})}]\leq\sqrt{\mathbb{E}[\rho_{T}(h^{\dagger})]} for the regret bound in Theorem 2 such that

RS(T)=O~(min{SKT𝔼[ρT(h)],SKT}).R_{S}(T)=\tilde{O}\left(\min\left\{\sqrt{SKT\mathbb{E}\left[\rho_{T}(h^{\dagger})\right]},S\sqrt{KT}\right\}\right).

The regret bound in Theorem 2 depends on ρT(h)\rho_{T}(h^{\dagger}) which is closely related with variance of loss estimators lt(h)l^{\prime}_{t}(h^{\dagger}) for t[T1]t\in[T-1]. Even though the regret bound depends on the variance term, it is of interest that the worst case bound is always bounded by O~(SKT)\tilde{O}(S\sqrt{KT}), which implies that the regret bound of Algorithm 2 is always tighter than or equal to that of EXP3.S. Algorithm 2 has a tight regret bound O(T)O(\sqrt{T}) with respect to TT. Therefore, when TT is large such that T=ω(S4K2)T=\omega(S^{4}K^{2}), Algorithm 2 shows a better regret bound compared with BOB. We note that the value of 𝔼[ρT(h)]\mathbb{E}[\rho_{T}(h^{\dagger})] depends on the problems, and further analysis for the term would be an interesting avenue for future research.

Remark 1.

For implementation of our algorithms, we describe how to update policy pt\textit{{p}}_{t} using OMD in general. Let l^t(a)\widehat{l}_{t}(a) be a loss estimator for action a[d]a\in[d]. For the negative entropy regularizer, by solving the optimization in (3), from Lattimore and Szepesvári, (2020), we have

pt+1(a)=exp(ηs=1tl^s(a))b[d]exp(ηs=1tl^s(b)).p_{t+1}(a)=\frac{\exp\left(-\eta\sum_{s=1}^{t}\widehat{l}_{s}(a)\right)}{\sum_{b\in[d]}\exp\left(-\eta\sum_{s=1}^{t}\widehat{l}_{s}(b)\right)}.

In the case of the log-barrier regularizer, we have pt+1(a)=(ηs=1tl^s(a)+Z)1,p_{t+1}(a)=(\eta\sum_{s=1}^{t}\widehat{l}_{s}(a)+Z)^{-1}, where ZZ is a normalization factor for a probability distribution Luo et al., (2022). Also, a clipped domain with 0<ϵ<10<\epsilon<1 in the distribution can be implemented by adding a uniform probability to the policy such that pt+1(a)(1ϵ)pt+1(a)+ϵ/dp_{t+1}(a)\leftarrow(1-\epsilon)p_{t+1}(a)+\epsilon/d for all a[d]a\in[d].

Remark 2.

The regret bounds of Theorems 1 and 2 can also apply for the non-stationary stochastic bandit problems without knowing a switching parameter where reward distributions are switching over time steps. This is because adversarial bandit problems encompass stochastic bandit problems.

4 Conclusion

In this paper, we studied adversarial bandits against any sequence of arms with SS-switch regret without given SS. We proposed two algorithms that are based on a master-base framework with the OMD method. We propose Algorithm 1 based on simple OMD achieving O~(S1/2K1/3T2/3)\tilde{O}(S^{1/2}K^{1/3}T^{2/3}). Then by using adaptive learning rates, Algorithm 2 achieved O~(min{𝔼[SKTρT(h)],SKT})\tilde{O}(\min\{\mathbb{E}[\sqrt{SKT\rho_{T}(h^{\dagger})}],S\sqrt{KT}\}). It is still an open problem to achieve the optimal regret bound for the worst case.

5 Acknowledgment

The authors thank Joe Suk for helpful discussions.

References

  • Agarwal et al., (2017) Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E. (2017). Corralling a band of bandit algorithms. In Conference on Learning Theory, pages 12–38. PMLR.
  • Auer et al., (2002) Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77.
  • Auer et al., (2019) Auer, P., Gajane, P., and Ortner, R. (2019). Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Conference on Learning Theory, pages 138–158.
  • Cesa-Bianchi et al., (1997) Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., and Warmuth, M. K. (1997). How to use expert advice. Journal of the ACM (JACM), 44(3):427–485.
  • Chen et al., (2019) Chen, Y., Lee, C.-W., Luo, H., and Wei, C.-Y. (2019). A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. In Conference on Learning Theory, pages 696–726. PMLR.
  • Cheung et al., (2019) Cheung, W. C., Simchi-Levi, D., and Zhu, R. (2019). Learning to optimize under non-stationarity. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1079–1087.
  • Daniely et al., (2015) Daniely, A., Gonen, A., and Shalev-Shwartz, S. (2015). Strongly adaptive online learning. In International Conference on Machine Learning, pages 1405–1411.
  • Foster et al., (2020) Foster, D. J., Krishnamurthy, A., and Luo, H. (2020). Open problem: Model selection for contextual bandits. In Conference on Learning Theory, pages 3842–3846. PMLR.
  • Garivier and Moulines, (2008) Garivier, A. and Moulines, E. (2008). On upper-confidence bound policies for non-stationary bandit problems.
  • Jun et al., (2017) Jun, K.-S., Orabona, F., Wright, S., and Willett, R. (2017). Improved strongly adaptive online learning using coin betting. In Artificial Intelligence and Statistics, pages 943–951. PMLR.
  • Lattimore and Szepesvári, (2020) Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.
  • Luo et al., (2022) Luo, H., Zhang, M., Zhao, P., and Zhou, Z.-H. (2022). Corralling a larger band of bandits: A case study on switching regret for linear bandits. arXiv preprint arXiv:2202.06151.
  • Pacchiano et al., (2020) Pacchiano, A., Phan, M., Abbasi-Yadkori, Y., Rao, A., Zimmert, J., Lattimore, T., and Szepesvari, C. (2020). Model selection in contextual stochastic bandit problems. arXiv preprint arXiv:2003.01704.
  • Russac et al., (2019) Russac, Y., Vernade, C., and Cappé, O. (2019). Weighted linear bandits for non-stationary environments. In Advances in Neural Information Processing Systems, pages 12017–12026.
  • Suk and Kpotufe, (2022) Suk, J. and Kpotufe, S. (2022). Tracking most significant arm switches in bandits. In Conference on Learning Theory, pages 2160–2182. PMLR.