This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Optimistic Algorithm for Online Convex Optimization with Adversarial Constraints

\nameJordan Lekeufack \emailjordan.lekeufack@berkeley.edu
\addrDepartment of Statistics
University of California, Berkeley
\AND\nameMichael I. Jordan \emailjordan@cs.berkeley.edu
\addrDepartment of Statistics / Department of Electrical Engineering and Computer Science
University of California, Berkeley
Abstract

We study Online Convex Optimization (OCO) with adversarial constraints, where an online algorithm must make sequential decisions to minimize both convex loss functions and cumulative constraint violations. We focus on a setting where the algorithm has access to predictions of the loss and constraint functions. Our results show that we can improve the current best bounds of O(T)O(\sqrt{T}) regret and O~(T)\tilde{O}(\sqrt{T}) cumulative constraint violations to O(T(f))O(\sqrt{{\cal E}_{T}(f)}) and O~(T(g+))\tilde{O}(\sqrt{{\cal E}_{T}(g^{+})}), respectively, where T(f){\cal E}_{T}(f) and T(g+){\cal E}_{T}(g^{+}) represent the cumulative prediction errors of the loss and constraint functions. In the worst case, where T(f)=O(T){\cal E}_{T}(f)=O(T) and T(g+)=O(T){\cal E}_{T}(g^{+})=O(T) (assuming bounded gradients of the loss and constraint functions), our rates match the prior O(T)O(\sqrt{T}) results. However, when the loss and constraint predictions are accurate, our approach yields significantly smaller regret and cumulative constraint violations. Finally, we apply this to the setting of adversarial contextual bandits with sequential risk constraints, obtaining optimistic bounds O(T(f)T1/3)O(\sqrt{{\cal E}_{T}(f)}T^{1/3}) regret and O(T(g+)T1/3)O(\sqrt{{\cal E}_{T}(g^{+})}T^{1/3}) constraints violation, yielding better performance than existing results when prediction quality is sufficiently high.

1 Introduction

We are interested in generalizations of Online Convex Optimization (OCO) to problems in which constraints are imposed but can be violated —generalizations which are referred to as Constrained Online Convex Optimization (COCO). Recall the standard formulation of OCO (Orabona, 2019; Hazan, 2023): At each round tt, a learner makes a decision xt𝒳x_{t}\in{\cal X}, receives a convex loss function ftf_{t} from the environment, and suffers the loss ft(xt)f_{t}(x_{t}). The goal of the learner is to minimize the cumulative loss t=1Tft(xt)\sum_{t=1}^{T}f_{t}(x_{t}). The COCO framework imposes an additional requirement on the learner: meeting a potentially adversarial convex constraint of the form gt(xt)0g_{t}(x_{t})\leq 0 at every time step. The learner observes gtg_{t} only after selecting xtx_{t}, and cannot always satisfy the constraint exactly but can hope to have a small cumulative constraint violation t=1Tmax(gt(xt),0)\sum_{t=1}^{T}\max(g_{t}(x_{t}),0). In the adversarial setting, it is not viable to seek absolute minima of the cumulative loss, and the problem is generally formulated in terms of obtaining a sublinear Static Regret—the difference between the learner’s cumulative loss and the cumulative loss of a fixed oracle/decision. Having a sublinear regret means that, on average, we perform as well as the best action in hindsight. A stronger and more general objective is the Dynamic Regret where learner’s performance is benchmarked against sequences of decisions, not just fixed actions. In the COCO problem, we also aim to ensure a sublinear cumulative constraint violation.

One subcategory of OCO problems is adversarial contextual bandits (Auer et al., 2002;Beygelzimer et al., 2011). In that setting, the learner receives contextual information from the environment, then she selects one action among KK available, and only observes the loss of the chosen action. The learners aims to minimize its cumulative loss. Sun et al. (2017) introduced sequential risk constraints in contextual bandit, where, in addition to the loss for each action, the environment generate a risk for each action. In addition to minimizing the cumulative loss, the learner wants to keep the average cumulative risk bounded by a predefined safety threshold.

Recent work in OCO has considered settings in which the adversary is predictable—i.e., not entirely arbitrary—aiming to obtain improved regret bounds (Chiang et al., 2012; Rakhlin and Sridharan, 2013a, b; Mohri and Yang, 2016; Joulani et al., 2017). They showed that the regret improved from O(T)O(\sqrt{T}) to O(T(f))O(\sqrt{{\cal E}_{T}(f)}) where T(f){\cal E}_{T}(f) is a measure of the cumulative prediction error. The optimistic framework has also been studied in the COCO setting by Qiu et al. (2023), who focused on time-invariant constraints, (t,gt:=g)\forall t,g_{t}:=g) and the time varying constraints was pursued in Anderson et al. (2022), who established bounds for specific cases (e.g perfect loss predictions, linear constraints).

In the current paper we go beyond earlier work to consider the case of adversarial constraints. Our main contribution is the following: We present the first algorithm to solve COCO problems in which the constraints are adversarial but also predictable, achieving O(T(f))O(\sqrt{{\cal E}_{T}(f)}) regret and O~(T(g+))\tilde{O}(\sqrt{{\cal E}_{T}(g^{+})}) constraint violation in the general convex case. More precisely:

  1. 1.

    We present a meta-algorithm that, when built on an optimistic OCO algorithm, achieves O(T(f))O(\sqrt{{\cal E}_{T}(f)}) regret and O~(T(g+))\tilde{O}(\sqrt{{\cal E}_{T}(g^{+})}) constraint violation who matches the best COCO algorithm by Sinha and Vaze (2024) in the worst case.

  2. 2.

    Our algorithm is computationally efficient as it relies only on a projection on the simpler set 𝒳{\cal X} at each time step, instead of convex optimization steps.

  3. 3.

    Furthermore, the same meta algorithm can be used to prove dynamic regret guaranteesO~(PTT(f))\tilde{O}(\sqrt{P_{T}{\cal E}_{T}(f)}) with similar constraint violation guarantees O~(PTT(g+))\tilde{O}(\sqrt{P_{T}{\cal E}_{T}(g^{+})}).

  4. 4.

    Finally, we show that our method can be used to solve the adversarial contextual bandits problem with sequential risk constraints, providing a O(T(f)T1/3)O(\sqrt{{\cal E}_{T}(f)}T^{1/3}) regret and O(T(g+)T1/3)O(\sqrt{{\cal E}_{T}(g^{+})}T^{1/3}) constraint violation.

Our theoretical framework exploits state-of-the-art methods from both optimistic OCO and constrained OCO.

The rest of the paper is structured as follows: We present previous work in Section 2, introduce the main assumptions and notations in Section 3 and present the meta-algorithm for the COCO problem in Section 4. We then present how the meta-algorithm gives static regret guarantees in Section 5, dynamic regret guarantees in Section 6 and how its application to the experts setting in Section 7 and the contextual bandits in Section 8.

Reference Complexity per round Constraints Loss Function Regret Violation
Guo et al. (2022) Conv-OPT Fixed Convex O(T)O(\sqrt{T}) O(1)O(1)
Adversarial Convex O(T)O(\sqrt{T}) O(T3/4)O(T^{3/4})
Adversarial Convex (D) O(PTT)O(P_{T}\sqrt{T}) O(T3/4)O(T^{3/4})
Yi et al. (2023) Conv-OPT Adversarial Convex O(Tmax(c,1c))O(T^{\max(c,1-c)}) O(T1c/2)O(T^{1-c/2})
Sinha and Vaze (2024) Projection Adversarial Convex O(T)O\left(\sqrt{T}\right) O(TlogT)O(\sqrt{T}\log T)
Qiu et al. (2023) Projection Fixed Convex, Slater O(VT(f))O(\sqrt{V_{T}(f)}) O(1)O(1)
Anderson et al. (2022) Projection Adversarial Convex, Perfect predictions O(1)O(1) O(T)O(\sqrt{T})
Muthirayan et al. (2022) Conv-OPT Known Convex O(DT(f))O\left(\sqrt{D_{T}(f)}\right) O(T)O(\sqrt{T})
Sun et al. (2017) Projection Contextual Bandits O(T)O\left(\sqrt{T}\right) O(T3/4)O\left(T^{3/4}\right)
Ours Projection Adversarial Convex O(T(f))O(\sqrt{{\cal E}_{T}(f)}) O(T(g+)logT)O(\sqrt{{\cal E}_{T}(g^{+})}\log T)
Adversarial Convex (D) O(PTT(f))O(\sqrt{P_{T}{\cal E}_{T}(f)}) O(PTT(g+)logT)O(\sqrt{P_{T}{\cal E}_{T}(g^{+})}\log T)
Contextual Bandits O~(T(f)T1/3)\tilde{O}\left(\sqrt{{\cal E}_{T}(f)}T^{1/3}\right) O~(T(g+)T1/3)\tilde{O}\left(\sqrt{{\cal E}_{T}(g^{+})}T^{1/3}\right)
Table 1: Comparison with the most recent Constrained OCO work. c[0,1]c\in[0,1] is a parameter of the algorithm. "Conv-OPT" refers to algorithms that perform constrained convex optimization at every round. T(f){\cal E}_{T}(f) and T(g+){\cal E}_{T}(g^{+}) are measures of the prediction error. VT(f)=t=2Tsupxft(x)ft1(x)2V_{T}(f)=\sum_{t=2}^{T}\sup_{x}||\nabla f_{t}(x)-\nabla f_{t-1}(x)||_{\star}^{2}. Note that when the prediction is the previous loss, T(f)VT(f){\cal E}_{T}(f)\leq V_{T}(f). DT(f):=t=1Tft(xt)Mt2D_{T}(f):=\sum_{t=1}^{T}||\nabla{f}_{t}(x_{t})-M_{t}||_{\star}^{2} where MtM_{t} is a guess of ft(xt)\nabla{f}_{t}(x_{t}). Since xtx_{t} is unknown when constructing MtM_{t}, bounding in terms of T(f){\cal E}_{T}(f) provides better and more general results than using DT(f)D_{T}(f). For linear f^\hat{f}, these quantities are equal: T(f)=DT(f){\cal E}_{T}(f)=D_{T}(f). (D) refers to a dynamic regret guarantee, with PT=t=1T1ut+1utP_{T}=\sum_{t=1}^{T-1}||u_{t+1}-u_{t}|| the path length of the feasible comparator sequence. For contextual bandits, KK is the number of actions and MM the number of experts. Note that the criteria for constraint violation in Sun et al. (2017) is strictly weaker than ours.

2 Related Work

Unconstrained OCO

The OCO problem was introduced by Zinkevich (2003), who established a O(T)O(\sqrt{T}) static regret and O(T(1+PT))O(\sqrt{T}(1+P_{T})) dynamic regret guarantees based on projected online gradient descent (OGD), where PTP_{T} is the path-length of the comparator sequence. Hazan (2023); Orabona (2019) provide overviews of the burgeoning literature that has emerged since Zinkevich’s seminal work, in particular focusing on online mirror descent (OMD) as a general way to solve OCO problems. Zhang et al. (2018) later improved the dynamic regret bound to O(T(1+PT))O(\sqrt{T(1+P_{T})}).

Optimistic OCO

Optimistic OCO is often formulated as a problem involving gradual variation—i.e., where ft\nabla{f}_{t} and ft1\nabla{f}_{t-1} are close in some appropriate metric. Chiang et al. (2012) exploit this assumption in an optimistic version of OMD that incorporates a prediction based on the most recent gradient, and establish a regret bound of O(VT)O(\sqrt{V_{T}}) where VT=t=2Tsupx𝒳ft(x)ft1(x)2V_{T}=\sum_{t=2}^{T}\sup_{x\in{\cal X}}||\nabla{f}_{t}(x)-\nabla{f}_{t-1}(x)||_{\star}^{2}. Previous works (Rakhlin and Sridharan, 2013a, b; Steinhardt and Liang, 2014; Mohri and Yang, 2016; Joulani et al., 2017; Bhaskara et al., 2020) prove that when using a predictor MtM_{t} that is not necessarily the past gradient, one can have regret of the form O(DT)O\left(\sqrt{D_{T}}\right) where DT:=t=1Tft(xt)Mt2D_{T}:=\sum_{t=1}^{T}||\nabla{f}_{t}(x_{t})-M_{t}||_{\star}^{2}. The dynamic regret case has been studied intensively (Jadbabaie et al., 2015; Scroccaro et al., 2023) with the best bound (Zhao et al., 2020, 2024), being O((1+PT+VT)(1+PT))O(\sqrt{(1+P_{T}+V_{T})(1+P_{T})}).

Constrained OCO

Constrained OCO was first studied in the context of time-invariant constraints; i.e., where gt:=gg_{t}:=g for all tt. In this setup one can employ projection-free algorithms, avoiding the potentially costly projection onto the set 𝒳={x𝒳0,g(x)0}{\cal X}=\{x\in{\cal X}_{0},g(x)\leq 0\}, by allowing the learner to violate the constraints in a controlled way (Mahdavi et al., 2012; Jenatton et al., 2016; Yu and Neely, 2020). The case of time-varying constraints is significantly harder as the constraints gtg_{t} are potentially adversarial. Most of the early work studying such constraints (Neely and Yu, 2017; Yi et al., 2023) accordingly incorporated an additional Slater condition: xˇ𝒳,ν>0,t,gt(xˇ)ν\exists\check{x}\in{\cal X},\nu>0,\forall t,\;g_{t}(\check{x})\leq-\nu. These papers establish regret guarantees that grow with ν1\nu^{-1}, which unfortunately can be vacuous as ν\nu can be arbitrarily small. Hutchinson and Alizadeh (2024) studied the setting with time-varying constraint but such that the constraints sets (𝒳t:={x𝒳0,gt(x)0}{\cal X}_{t}:=\{x\in{\cal X}_{0},g_{t}(x)\leq 0\}) are monotone, i.e 𝒳0𝒳1𝒳T{\cal X}_{0}\subseteq{\cal X}_{1}\subseteq\dots\subseteq{\cal X}_{T} and established a O(PTTO(\sqrt{P_{T}T} dynamic regret when PTP_{T} is known beforehand. Guo et al. (2022) presented an algorithm that does not require the Slater condition and yielded improved bounds, achieving a O(T)O(\sqrt{T}) static regret, O(PTT)O(P_{T}\sqrt{T}) dynamic regret and O(T3/4)O(T^{3/4}) constraint violations, for unknown PTP_{T} . However, it requires solving a convex optimization problem at each time step. In a more recent work, Sinha and Vaze (2024) presented a simple and efficient algorithm to solve the problem with just a projection and obtained state-of-the-art guarantees: O(T)O(\sqrt{T}) regret and O(Tlog(T))O(\sqrt{T}\log(T)) constraint violations. See Table 1 for more comparison of our results with previous work.

Optimistic COCO

Qiu et al. (2023) studied the case with gradual variations and time-invariant constraints, proving a O(VT)O(\sqrt{V_{T}}) regret guarantee and a O(1)O(1) constraint violations. Muthirayan et al. (2022) tackled the time-varying but known constraints with predictions, proving a regret guarantee of O(DT)O(\sqrt{D_{T}}) and cumulative constraint violation of O(T)O(\sqrt{T}). Under perfect loss predictions, Anderson et al. (2022) demonstrated a O(1)O(1) bound on regret and O(T)O(\sqrt{T}) bound on constraint violation. We also add these results in Table 1 for comparison.

Adversarial Contextual Bandits

The adversarial contextual bandit problem was first introduced by Auer et al. (2002), who proposed EXP4, achieving optimal O(T)O(\sqrt{T}) expected regret. Wei et al. (2020) later advanced the field by incorporating predictions, achieving O(T(f)T1/4)O(\sqrt{{\cal E}_{T}(f)}T^{1/4}) regret when T(f){\cal E}_{T}(f) is known beforehand - an improvement over EXP4 when T(f)=o(T){\cal E}_{T}(f)=o(\sqrt{T}). For unknown T(f){\cal E}_{T}(f), they developed an algorithm with O(T(f)T1/3)O(\sqrt{{\cal E}_{T}(f)}T^{1/3}) expected regret. Sun et al. (2017) extended this to include sequential risk constraints (analogous to constrained OCO), developing a modified EXP4 that achieves O(T)O(\sqrt{T}) regret with O(T3/4)O(\sqrt{T^{3/4}}) total risk violation.

3 Preliminaries

3.1 Problem setup and notation

Let \mathbb{R} denote the set of real numbers, and let d\mathbb{R}^{d} denote the set of dd-dimensional real vectors. Let 𝒳0d{\cal X}_{0}\subseteq\mathbb{R}^{d} denote the set of possible actions of the learner, where x𝒳0x\in{\cal X}_{0} is a specific action, and let ||||||\cdot|| be a norm defined on 𝒳0{\cal X}_{0}. Let the dual norm be denoted as θ:=maxx,x=1θ,x||\theta||_{\star}:=\max_{x,||x||=1}\langle\theta,\ x\rangle.

Online learning is a problem formulation in which the learner plays the following game with Nature. At each step tt:

  1. 1.

    The learner plays action xt𝒳0x_{t}\subseteq{\cal X}_{0}.

  2. 2.

    Nature reveals a loss function ft:𝒳0f_{t}:{\cal X}_{0}\to\mathbb{R} and a constraint function gt:𝒳0g_{t}:{\cal X}_{0}\to\mathbb{R}.111If we have multiple constraint functions 𝐠t,k{\mathbf{g}}_{t,k}, we set gt:=maxk𝐠t,kg_{t}:=\max_{k}{\mathbf{g}}_{t,k}.

  3. 3.

    The learner suffers the loss ft(xt)f_{t}(x_{t}) and the constraint violation gt(xt)g_{t}(x_{t}).

In standard OCO, the loss function ftf_{t} is convex, and the goal of the learner is to minimize the regret with respect to an oracle action uu, where:

RegretT(u):=t=1Tft(xt)ft(u).\text{Regret}_{T}(u):=\sum_{t=1}^{T}f_{t}(x_{t})-f_{t}(u). (1)

In COCO, we generalize the OCO problem to additionally ask the learner to obtain a small cumulative constraint violation, which we denote as CCVT\text{CCV}_{T}:

CCVT:=t=1Tgt+(xt)wheregt+(x):=max{0,gt(x)}.\text{CCV}_{T}:=\sum_{t=1}^{T}g_{t}^{+}(x_{t})\quad\text{where}\quad g_{t}^{+}(x):=\max\{0,g_{t}(x)\}. (2)

Overall, the goal of the learner is to achieve both sublinear regret, wrt to any action uu in the oracle set, and sublinear CCV. This is a challenging problem, and indeed Mannor et al. (2009) proved that no algorithm can achieve both sublinear regret and sublinear cumulative constraint violation for the oracle set 𝒳max:={x𝒳0,t=1Tgt(x)0}{\cal X}^{\text{max}}:=\{x\in{\cal X}_{0},\sum_{t=1}^{T}g_{t}(x)\leq 0\}. However, it is possible to find algorithms that achieve sublinear regret for the smaller set 𝒳:={x𝒳0,gt(x)0,t[T]}{\cal X}:=\{x\in{\cal X}_{0},g_{t}(x)\leq 0,\;\forall t\in[T]\}, and this latter problem is our focus.

In addition, we assume that at the end of step tt, the learner can make predictions f^t+1\hat{f}_{t+1} and g^t+1\hat{g}_{t+1}. More precisely, we are interested in predictions of the gradients, and, for any function hh, we denote by h^t\nabla\hat{{h}}_{t} the prediction of the gradient of hh. We abuse notation and denote by h^\hat{h} the function whose gradient is h^t\nabla\hat{{h}}_{t}. Moreover, we define the following prediction errors

εt(h):=ht(xt)h^t(xt)2,t(h):=τ=1tετ(h),\begin{split}\varepsilon_{t}(h)&:=||\nabla{h}_{t}(x_{t})-\nabla\hat{{h}}_{t}(x_{t})||_{\star}^{2},\\ {\cal E}_{t}(h)&:=\sum_{\tau=1}^{t}\varepsilon_{\tau}(h),\end{split} (3)

where (xt)t=1T(x_{t})_{t=1\dots T} is the sequence of actions taken by the learner.

Additionally, for a given β\beta-strongly convex function RR, we define the Bregman divergence between two points:

BR(x;y):=R(x)R(y)R(y),xy.B^{R}(x;y):=R(x)-R(y)-\langle\nabla R(y),\ x-y\rangle. (4)

Two special cases that are particularly important:

  1. 1.

    When R(x):=12x22R(x):=\frac{1}{2}||x||_{2}^{2}, the Bregman divergence is the Euclidean distance BR(x;y)=yx22B^{R}(x;y)=||y-x||_{2}^{2}, ||||=||||=||||2||\cdot||=||\cdot||_{\star}=||\cdot||_{2}, and β=1\beta=1.

  2. 2.

    When R(x):=i=1dxilogxiR(x):=-\sum_{i=1}^{d}x_{i}\log x_{i}, the Bregman divergence is the KL divergence : BR(x;y)=DKL(x;y):=i=1dxilogxiyiB^{R}(x;y)=D_{\text{KL}}(x;y):=\sum_{i=1}^{d}x_{i}\log\frac{x_{i}}{y_{i}}, ||||=||||1||\cdot||=||\cdot||_{1}, ||||=||||||\cdot||_{\star}=||\cdot||_{\infty}, and β=1\beta=1.

3.2 Assumptions

Throughout this paper, we will use various combinations of the following assumptions.

Assumption 1 (Convex set, loss and constraints).

We make the following standard assumptions on the loss ff:

  1. 1.

    𝒳0{\cal X}_{0} is closed, convex and bounded with diameter DD.

  2. 2.

    t\forall t, ftf_{t} is convex and differentiable.

  3. 3.

    t\forall t, gtg_{t} is convex and differentiable.

Assumption 2 (Bounded losses).

The loss functions ftf_{t} are bounded by FF and the constraints gtg_{t} are bounded by GG.

Assumption 3 (Feasibility).

The set 𝒳{\cal X} is not empty.

Assumption 4 (Prediction Sequence Regularity).

For any tt, the gradient of the loss prediction function f^t\nabla\hat{{f}}_{t} and the gradient of the constraint function g^t\nabla\hat{{g}}_{t} are L^tf\hat{L}^{f}_{t} and L^tg\hat{L}^{g}_{t} Lipschitz, respectively. That is, x,y𝒳0\forall x,y\in{\cal X}_{0}, we have:

f^t(x)f^t(y)\displaystyle||\nabla\hat{{f}}_{t}(x)-\nabla\hat{{f}}_{t}(y)||_{\star} L^tfxy,\displaystyle\leq\hat{L}^{f}_{t}||x-y||,
g^t(x)g^t(y)\displaystyle||\nabla\hat{{g}}_{t}(x)-\nabla\hat{{g}}_{t}(y)||_{\star} L^tgxy.\displaystyle\leq\hat{L}^{g}_{t}||x-y||.

We abuse notation and let L^tf:=maxτtL^τf\hat{L}^{f}_{t}:=\max_{\tau\leq t}\hat{L}^{f}_{\tau} and similarly for L^tg\hat{L}^{g}_{t}. Finally, denote L^f:=L^Tf\hat{L}^{f}:=\hat{L}^{f}_{T} and similarly for L^g\hat{L}^{g}.

Assumptions 1, 2, 3 are standard in COCO (Mahdavi et al., 2012; Jenatton et al., 2016; Yu and Neely, 2020; Qiu et al., 2023; Yi et al., 2023; Guo et al., 2022). In most OCO with predictive sequences, they either assume that the predictive function is the previous loss function (Chiang et al., 2012; Qiu et al., 2023; D’Orazio and Huang, 2021), or that the learner only predicts a single vector MtM_{t} to estimate ft(xt)\nabla{f}_{t}(x_{t})  (Rakhlin and Sridharan, 2013a, b; Muthirayan et al., 2022). We expand this by predicting the entire loss gradient, making an assumption on the smoothness of f^t(xt)\nabla\hat{{f}}_{t}(x_{t}) with its value at nearby points. When using the latest observe function as prediction, 4 is equivalent to assuming that the gradients ft\nabla{f}_{t} and gt\nabla{g}_{t} are Lipchitz as in Chiang et al. (2012); Qiu et al. (2023). Moreover, 4 is automatically satisfied when predicting a vector.

4 Meta-Algorithm for Optimistic COCO

Algorithm 1 Optimistic meta-algorithm for COCO
1:x1𝒳0x_{1}\in{\cal X}_{0}, λ>0\lambda>0, Q0=0Q_{0}=0, OCO algorithm 𝒜{\cal A}.
2:for round t=1Tt=1\dots T do
3:     Play action xtx_{t}, receive ftf_{t} and gtg_{t}.
4:     Compute {\cal L} defined in (5).
5:     Update Qt+1=Qt+gt+(xt)Q_{t+1}=Q_{t}+g^{+}_{t}(x_{t}).
6:     Compute prediction ^t+1\hat{\cal L}_{t+1} as in (6).
7:     Update xt+1:=𝒜t(xt,1,,t,^t+1)x_{t+1}:={\cal A}_{t}(x_{t},{\cal L}_{1},\dots,{\cal L}_{t},\hat{\cal L}_{t+1}).
8:end for

Our meta-algorithm is inspired by Sinha and Vaze (2024). The main idea of that paper is to build a surrogate loss function t{\cal L}_{t} that can be seen as a Lagrangian of the optimization problem

minx𝒳0ft(x)s.tgt(x)0.\min_{x\in{\cal X}_{0}}f_{t}(x)\quad\text{s.t}\quad g_{t}(x)\leq 0.

The learner then runs AdaGrad (Duchi et al., 2011) on the surrogate, with a theoretical guarantee of bounded cumulative constraint violation (CCV) and Regret.

Our meta-algorithm is based on the use of optimistic methods, such as those presented in subsequent sections: Section 5, Section 6, Section 7, which allows us to obtain stronger bounds that depends on the prediction quality. Presented in Algorithm 1, this algorithm assumes that at the end of every step tt, the learner makes a prediction 222We are actually only interested in the predictions of the gradients, but for simplicity we will let h^\hat{h} denote any function whose gradient is the prediction of the gradient h^t\nabla\hat{{h}}_{t}. f^t+1\hat{f}_{t+1} and g^t+1\hat{g}_{t+1} of the upcoming loss ft+1f_{t+1} and constraint violation gt+1+g_{t+1}^{+}. At each time step tt, the learner forms a surrogate loss function, defined via a convex Lyapunov function: Φ:++\Phi:\mathbb{R}_{+}\to\mathbb{R}_{+}, where Φ\Phi is monotonically increasing and Φ(0)=0\Phi(0)=0. Specifically:

t(x)=ft(x)+Φ(Qt)gt+(x).{\cal L}_{t}(x)=f_{t}(x)+\Phi^{\prime}(Q_{t})g^{+}_{t}(x). (5)

Using the predictions f^\hat{f} and g^\hat{g}, we form a prediction of the Lagrange function ^t+1\hat{\cal L}_{t+1}, where ^t\hat{\cal L}_{t} is defined in Equation 6.

^t(x)=f^t(x)+Φ(Qt)g^t(x).\hat{\cal L}_{t}(x)=\hat{f}_{t}(x)+\Phi^{\prime}(Q_{t})\hat{g}_{t}(x). (6)

In Sinha and Vaze (2024), the update is Qt=Qt1+gt+(xt)Q_{t}=Q_{t-1}+g^{+}_{t}(x_{t}), but using ^t+1\hat{\cal L}_{t+1} at tt would require Qt+1Q_{t+1} to be known at the end of tt. We instead define the following delayed update:

Qt+1=Qt+gt+(xt),with Q0=Q1=0.Q_{t+1}=Q_{t}+g^{+}_{t}(x_{t}),\quad\text{with }Q_{0}=Q_{1}=0. (7)

The learner then executes the step tt of algorithm 𝒜{\cal A}, denoted 𝒜t{\cal A}_{t} in Algorithm 1. We then have the following lemma that relates the regret on ff, CCV, and the regret of 𝒜{\cal A} on {\cal L}.

Lemma 5 (Regret decomposition).

For any OCO algorithm 𝒜{\cal A}, if Φ\Phi is a Lyapunov potential function, we have that for any t1t\geq 1, and any u𝒳u\in{\cal X}

Φ(Qt+1)Φ(Q1)+Regrett(u)Regrett𝒜(u;1t)+St,\Phi(Q_{t+1})-\Phi(Q_{1})+\text{Regret}_{t}(u)\leq\text{Regret}_{t}^{\cal A}(u;\;{\cal L}_{1\dots t})+S_{t}, (8)

where St=τ=1tgτ+(xτ)(Φ(Qτ+1)Φ(Qτ))S_{t}=\sum_{\tau=1}^{t}g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau})), and Regrett𝒜(u;1t)\text{Regret}_{t}^{\cal A}(u;\;{\cal L}_{1\dots t}) is the regret of the algorithm running on the sequence of losses 1,,T{\cal L}_{1},\dots,{\cal L}_{T}.

Proof  By convexity of Φ\Phi, for any τ1\tau\geq 1:

Φ(Qτ+1)\displaystyle\Phi(Q_{\tau+1}) Φ(Qτ)+Φ(Qτ+1)(Qτ+1Qτ)\displaystyle\leq\Phi(Q_{\tau})+\Phi^{\prime}(Q_{\tau+1})\cdot(Q_{\tau+1}-Q_{\tau})
=Φ(Qτ)+Φ(Qτ+1)gt+(xτ).\displaystyle=\Phi(Q_{\tau})+\Phi^{\prime}(Q_{\tau+1})\cdot g^{+}_{t}(x_{\tau}).

Let u𝒳u\in{\cal X}, then by definition gτ+(u)=0,τ1g^{+}_{\tau}(u)=0,\forall\tau\geq 1, thus

Φ(Qτ+1)Φ(Qτ)+(fτ(xτ)fτ(u))\displaystyle\Phi(Q_{\tau+1})-\Phi(Q_{\tau})+(f_{\tau}(x_{\tau})-f_{\tau}(u))
Φ(Qτ+1)gτ+(xτ)+(fτ(xτ)fτ(u))\displaystyle\leq\Phi^{\prime}(Q_{\tau+1})g^{+}_{\tau}(x_{\tau})+(f_{\tau}(x_{\tau})-f_{\tau}(u))
fτ(xτ)+Φ(Qτ)gτ+(xτ)\displaystyle\leq f_{\tau}(x_{\tau})+\Phi^{\prime}(Q_{\tau})g^{+}_{\tau}(x_{\tau})
((fτ(u)+Φ(Qτ)gτ+(u))\displaystyle\quad\quad-\big{(}(f_{\tau}(u)+\Phi^{\prime}(Q_{\tau})g^{+}_{\tau}(u)\big{)}
+gτ+(xτ)(Φ(Qτ+1)Φ(Qτ))\displaystyle\quad\quad+g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau}))
τ(xτ)τ(u)+gτ+(xτ)(Φ(Qτ+1)Φ(Qτ)).\displaystyle\leq{\cal L}_{\tau}(x_{\tau})-{\cal L}_{\tau}(u)+g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau})).

Summing τ\tau from 11 to tt:

Φ(Qt+1)Φ(Q1)+Regrett(u)Regrett𝒜(u;1t)+St,\Phi(Q_{t+1})-\Phi(Q_{1})+\text{Regret}_{t}(u)\leq\text{Regret}_{t}^{\cal A}(u;\;{\cal L}_{1\dots t})+S_{t},

where

St=τ=1tgτ+(xτ)(Φ(Qτ+1)Φ(Qτ)).S_{t}=\sum_{\tau=1}^{t}g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau})).

 

In the following we make the assumption that the underlying optimistic OCO algorithm has standard regret guarantees that we will express in terms of a functional ψ\psi that takes as input a sequence of functions and returns a constant. For simplicity, we will denote ψ(h1t):=ψt(h)\psi(h_{1\dots t}):=\psi_{t}(h). An example is ψt(h)=L^th\psi_{t}(h)=\hat{L}^{h}_{t}, the Lipschitz constant constant of h^t\nabla\hat{{h}}_{t}.

With this assumption and the previous lemma, we can present our main result.

Assumption 6 (Regret of optimistic OCO).

The optimistic OCO algorithm 𝒜{\cal A} has the following regret guarantee: There is a constant CC\in\mathbb{R} and a sublinear functional ψ\psi such that for any sequence of functions (t)t=1T({\cal L}_{t})_{t=1\dots T}, and any u𝒳0u\in{\cal X}_{0}

Regrett𝒜(u;1t)C(t()+ψt()).\text{Regret}_{t}^{\cal A}(u;{\cal L}_{1\dots t})\leq C\left(\sqrt{{\cal E}_{t}({\cal L})}+\psi_{t}({\cal L})\right). (9)

We allow CC to depend on TT and other constants of the problem, as long as they are known at the beginning of the algorithm 𝒜{\cal A}.

Theorem 7 (Optimistic COCO regret and CCVguarantees).

Consider the following assumptions :

  1. a.

    t{\cal L}_{t} and ^t\hat{\cal L}_{t} satisfy the assumptions of algorithm 𝒜{\cal A} for all tt.

  2. b.

    Assumptions 1, 2, and 3.

  3. c.

    𝒜{\cal A} satisfies 6.

  4. d.

    Φ(Q):=exp(λQ)1\Phi(Q):=\exp(\lambda Q)-1, with λ=(2C(2T(g+)+ψT(g+))+2G)1\lambda=\left(2C\left(\sqrt{2{\cal E}_{T}(g^{+})}+\psi_{T}(g^{+})\right)+2G\right)^{-1}.

Under these assumptions, Algorithm 1 has the following regret and CCV guarantees: T1,u𝒳,t[T]\forall T\geq 1,\forall u\in{\cal X},\forall t\in[T],

Regrett(u)\displaystyle\text{Regret}_{t}(u) =O(t(f)),\displaystyle=O\left(\sqrt{{\cal E}_{t}(f)}\right), (10)
CCVT\displaystyle\text{CCV}_{T} =O(T(g+)logT).\displaystyle=O\left(\sqrt{{\cal E}_{T}(g^{+})}\log T\right). (11)

We present a sketch of the main ideas here, with the detailed proof deferred to Appendix A. First, using the sublinearity of the square root and the fact that QtQ_{t} is non-decreasing, we can show that:

t()2t(f)+Φ(Qt)2t(g+).\sqrt{{\cal E}_{t}({\cal L})}\leq\sqrt{2{\cal E}_{t}(f)}+\Phi^{\prime}(Q_{t})\sqrt{2{\cal E}_{t}(g^{+})}. (12)

Then, using (12) and the sublinearity of ψ\psi, we can further upper bound the regret on {\cal L} in 6:

Regrett𝒜(u;1t)C(2T(f)+ψt(f))+λexp(λQt+1)C(2t(g+)+ψt(g+)).\begin{split}\text{Regret}_{t}^{\cal A}(u;\;{\cal L}_{1\dots t})\leq&C\left(\sqrt{2{\cal E}_{T}(f)}+\psi_{t}(f)\right)\\ &+\lambda\exp(\lambda Q_{t+1})C\left(\sqrt{2{\cal E}_{t}(g^{+})}+\psi_{t}(g^{+})\right).\end{split} (13)

In addition, we can upper bound StS_{t} by using 2 and QtQ_{t} monotonicity:

StGλexp(λQt+1).S_{t}\leq G\lambda\exp(\lambda Q_{t+1}). (14)

We can then use (13) and (14) in Lemma 5, and after rearranging terms, we obtain

Regrett(u)(λλ1)exp(λQt+1)+1+C(2t(f)+ψt(f)),\text{Regret}_{t}(u)\leq\left(\frac{\lambda}{\lambda^{\star}}-1\right)\exp(\lambda Q_{t+1})+1+C(\sqrt{2{\cal E}_{t}(f)}+\psi_{t}(f)), (15)

where λ=(C(2T(g+)+ψT(g+))+G)1\lambda^{\star}=\left(C\left(\sqrt{2{\cal E}_{T}(g^{+})}+\psi_{T}(g^{+})\right)+G\right)^{-1}. We obtain

Regrett(u)C(2t(f)+ψt(f))+1=O(t(f)).\text{Regret}_{t}(u)\leq C\left(\sqrt{2{\cal E}_{t}(f)}+\psi_{t}(f)\right)+1=O\left(\sqrt{{\cal E}_{t}(f)}\right).

To establish an upper bound on CCV, we leverage the fact that RegretT(u)2FT\text{Regret}_{T}(u)\geq-2FT (from 2), which when applied to (15) yields

exp(λQT+1)(1λ/λ)C(2T(f)+ψT(f))+2FT+1.\exp(\lambda Q_{T+1})(1-\lambda/\lambda^{\star})\leq C\left(\sqrt{2{\cal E}_{T}(f)}+\psi_{T}(f)\right)+2FT+1.

If λ<λ\lambda<\lambda^{\star}, then

CCVT=QT+11λlog(C(2T(f)+ψT(f))+2FT+11λ/λ).\text{CCV}_{T}=Q_{T+1}\leq\frac{1}{\lambda}\log\left(\frac{C(\sqrt{2{\cal E}_{T}(f)}+\psi_{T}(f))+2FT+1}{1-\lambda/\lambda^{\star}}\right).

Finally, by setting λ=λ/2\lambda=\lambda^{\star}/2, we obtain

CCVTO(T(g+)log(T)).\text{CCV}_{T}\leq O\left(\sqrt{{\cal E}_{T}(g^{+})}\log(T)\right).
Remark 8.

As in Syrgkanis et al. (2015), we can use the doubling trick for adjusting lambda online at the cost of an additional log term. We provide details in Appendix B.

Remark 9.

If we have nn constraint functions 𝐠t,k{\mathbf{g}}_{t,k} with k[n]k\in[n], we can set gt:=maxk𝐠t,kg_{t}:=\max_{k}{\mathbf{g}}_{t,k}. Alternatively, we can set multiple queues, one for each kk: Qt+1,k=Qt,k+𝐠t,k(xt)Q_{{t+1},k}=Q_{t,k}+{\mathbf{g}}_{t,k}(x_{t}), one λk\lambda_{k} for each kk, and set Φk(x)=exp(λkx)1\Phi_{k}(x)=\exp(\lambda_{k}x)-1. Finally, define:

(x)=ft(x)+k=1nΦk(Qt,k)𝐠t,k+(x).{\cal L}(x)=f_{t}(x)+\sum_{k=1}^{n}\Phi_{k}^{\prime}(Q_{t,k}){\mathbf{g}}_{t,k}^{+}(x).

Then we can follow the exact same proof to show a regret guarantee:

Regrett(u)O((n+1)T(f)),\text{Regret}_{t}(u)\leq O\left(\sqrt{(n+1){\cal E}_{T}(f)}\right),

and CCV guarantee:

CCVTO((n+1)T(g+)log(T)).\text{CCV}_{T}\leq O\left(\sqrt{(n+1){\cal E}_{T}(g^{+})}\log(T)\right).

The term in n+1\sqrt{n+1} will come from:

t()(n+1)t(f)+k=1nΦk(Qt,k)(n+1)t(𝐠k+),\sqrt{{\cal E}_{t}({\cal L})}\leq\sqrt{(n+1){\cal E}_{t}(f)}+\sum_{k=1}^{n}\Phi_{k}^{\prime}(Q_{t,k})\sqrt{(n+1){\cal E}_{t}({\mathbf{g}}_{k}^{+})},

with T(𝐠k+){\cal E}_{T}({\mathbf{g}}_{k}^{+}) being the prediction error of the sequence 𝐠t,k+{\mathbf{g}}_{t,k}^{+}.

5 Static Regret guarantees

In this section, we first introduce some of the foundational optimistic algorithms that have been used for OCO, then show how we can achieve sublinear static regret and CCV with our meta algorithm.

Optimistic OMD and Optimistic AdaGrad

Algorithm 2 Optimistic Online Mirror Descent (Rakhlin and Sridharan, 2013b)
1:Sequence ηt>0\eta_{t}>0, x1x_{1}.
2:Initialize η1\eta_{1}.
3:for round t=1Tt=1\dots T do
4:     Play action xtx_{t}, receive t{\cal L}_{t}. Compute lt=t(xt)l_{t}=\nabla{{\cal L}}_{t}(x_{t}).
5:     Compute ηt+1\eta_{t+1}.
6:     x~t+1:=argminx𝒳0lt,x+1ηtBR(x;x~t).\tilde{x}_{t+1}:=\arg\min_{x\in{\cal X}_{0}}\langle l_{t},\ x\rangle+\frac{1}{\eta_{t}}B^{R}(x;\tilde{x}_{t}).
7:     Make prediction l^t+1=^t+1(x~t+1).\hat{l}_{t+1}=\nabla\hat{{{\cal L}}}_{{t+1}}(\tilde{x}_{t+1}).
8:     xt+1:=argminx𝒳0l^t+1,x+1ηt+1BR(x;x~t+1).x_{t+1}:=\arg\min_{x\in{\cal X}_{0}}\langle\hat{l}_{t+1},\ x\rangle+\frac{1}{\eta_{t+1}}B^{R}(x;\tilde{x}_{t+1}).
9:end for

This approach modifies the standard online mirror descent (OMD) algorithm introduced in Zinkevich (2003). OMD, which generalizes projected gradient descent, iteratively steps towards minimizing the most recently observed loss function. The optimistic OMD variant includes a supplementary minimization step using the predicted function, enabling faster convergence to optimality when predictions are accurate. Note that the algorithm is computationally efficient. Indeed, a mirror step x=argminx𝒳0l,x+1ηBR(x;z)x^{\star}=\arg\min_{x\in{\cal X}_{0}}\langle l,\ x\rangle+\frac{1}{\eta}B^{R}(x;z) can be computed in two steps:

  1. 1.

    Compute yy such that R(y)=R(z)ηl\nabla R(y)=\nabla R(z)-\eta l. In particular, if R\nabla R is invertible, y=(R)1(R(z)ηl)y=(\nabla R)^{-1}(\nabla R(z)-\eta l).

  2. 2.

    Let x=Π𝒳0,R(y):=argminx𝒳0BR(x;y)x^{\star}=\Pi_{{\cal X}_{0},R}(y):=\arg\min_{x\in{\cal X}_{0}}B^{R}(x;y).

The two following are special cases of OMD:

  1. 1.

    When ||||=||||||\cdot||=||\cdot||_{\star} and R(x)=12x22R(x)=\frac{1}{2}||x||_{2}^{2}, this is simply projected gradient descent, x=Π𝒳0(zη)x^{\star}=\Pi_{{\cal X}_{0}}\left(z-\eta\ell\right).

  2. 2.

    When 𝒳=Δd{\cal X}=\Delta_{d} the dd-dimensional simplex, with RR being the entropy, xi=ziZexp(ηli)x^{\star}_{i}=\frac{z_{i}}{Z}\exp(-\eta l_{i}), where ZZ is a normalizing factor to ensure x1=1||x^{\star}||_{1}=1.

Theorem 10 establishes our algorithm’s regret bounds. Our analysis extends beyond Rakhlin and Sridharan (2013b)’s vector-based predictions to handle functional predictions, incorporating techniques from Chiang et al. (2012). This extension introduces Lipschitz coefficient dependence. We express our bounds in terms of εt()\varepsilon_{t}({\cal L}) rather than ^t(x~t)t(xt)2||\nabla\hat{{{\cal L}}}_{t}(\tilde{x}_{t})-\nabla{{\cal L}}_{t}(x_{t})||_{\star}^{2}—a crucial distinction since εt()\varepsilon_{t}({\cal L}) vanishes with perfect predictions, while ^t(x~t)t(xt)2||\nabla\hat{{{\cal L}}}_{t}(\tilde{x}_{t})-\nabla{{\cal L}}_{t}(x_{t})||_{\star}^{2} may not. This problem has been highlighted before in Scroccaro et al. (2023) who present their regret guarantees in terms of ^t(x~t1)t(xt1)2||\nabla\hat{{{\cal L}}}_{t}(\tilde{x}_{t-1})-\nabla{{\cal L}}_{t}(x_{t-1})||_{\star}^{2}. This requires to know the Lipschitz coefficient of t\nabla{{\cal L}}_{t}, which is standard in OCO, but we prefer to have a dependency on the coefficient of ^t\nabla\hat{{{\cal L}}}_{t} as the learner has control over it.

Theorem 10 (Optimistic Adagrad, adapted from Rakhlin and Sridharan (2013b), Corollary 2).

Under assumptions:

  1. a.

    1,

  2. b.

    For any t,^tt,\;\nabla\hat{{{\cal L}}}_{t} is L^t\hat{L}^{{\cal L}}_{t}-Lipschitz where L^tL^t+1\hat{L}^{{\cal L}}_{t}\leq\hat{L}^{{\cal L}}_{t+1},

  3. c.

    For any t,L^tβηtt,\;\hat{L}^{{\cal L}}_{t}\leq\frac{\beta}{\eta_{t}},

  4. d.

    For any t[T],ηt+1ηtt\in[T],\eta_{t+1}\leq\eta_{t},

for any u𝒳0u\in{\cal X}_{0}, and any t1t\geq 1

Regrett(u)2Btηt+1+τ=1tητ+1βετ(),\text{Regret}_{t}(u)\leq\frac{2B_{t}}{\eta_{t+1}}+\sum_{\tau=1}^{t}\frac{\eta_{\tau+1}}{\beta}\varepsilon_{\tau}({\cal L}), (16)

where Btmaxτ[t],x𝒳0BR(x;x~τ)B_{t}\geq\max_{\tau\in[t],x\in{\cal X}_{0}}B^{R}(x;\tilde{x}_{\tau}). If ηt\eta_{t} is:

ηt=min{βBt1()+t2(),βL^t},\eta_{t}=\min\left\{\frac{\sqrt{\beta B}}{\sqrt{{\cal E}_{t-1}({\cal L})}+\sqrt{{\cal E}_{t-2}({\cal L})}},\frac{\beta}{\hat{L}^{{\cal L}}_{t}}\right\}, (17)

with B:=BTB:=B_{T}, then for any t1t\geq 1, Algorithm 2 has regret

Regrett(u)5Bβ(t()+BβL^t)=O(t()L^t),\begin{split}\text{Regret}_{t}(u)&\leq 5\sqrt{\frac{B}{\beta}}\left(\sqrt{{\cal E}_{t}({\cal L})}+\sqrt{\frac{B}{\beta}}\hat{L}^{{\cal L}}_{t}\right)\\ &=O\left(\sqrt{{\cal E}_{t}({\cal L})}\vee\hat{L}^{{\cal L}}_{t}\right),\end{split} (18)

By using Algorithm 2 as OCO algorithm 𝒜{\cal A} in Algorithm 1, we have the following regret guarantee, as a direct consequence of Theorem 7 and Theorem 10:

Corollary 11 (Optimistic Adagrad COCO).

Consider the following assumptions:

  1. a.

    4

  2. b.

    𝒜{\cal A} is optimistic Adagrad (Algorithm 2) with L^t=L^tf+Φ(Qt)L^tg+\hat{L}^{{\cal L}}_{t}=\hat{L}^{f}_{t}+\Phi^{\prime}(Q_{t})\hat{L}^{g^{+}}_{t}

  3. c.

    λ\lambda and Φ\Phi are set as in Theorem 7.

Under these assumptions, the meta-algorithm (1) has the following regret and constraint violation guarantees:

RegretT(u)O(T(f)L^f),CCVTO((T(g+)L^g+)logT).\begin{split}\text{Regret}_{T}(u)&\leq O\left(\sqrt{{\cal E}_{T}(f)}\vee\hat{L}^{f}\right),\\ \text{CCV}_{T}&\leq O\left(\left(\sqrt{{\cal E}_{T}(g^{+})}\vee\hat{L}^{g^{+}}\right)\log T\right).\end{split} (19)

Alternatively, one can use Optimistic Follow-the-regularized-leader  (Rakhlin and Sridharan, 2013a; Mohri and Yang, 2016; Joulani et al., 2017), instead of Algorithm 2, which can be proven to have similar guarantee as Theorem 10.

Remark 12.

Even if gtg_{t} is fixed or known, we cannot achieve CCVTO~(1)\text{CCV}_{T}\leq\tilde{O}(1) with this algorithm. This is because gt+\nabla g^{+}_{t} does not satisfy 4 in the general case.

6 Dynamic Regret guarantees

Moving beyond a fixed baseline u𝒳u\in{\cal X}, we can evaluate performance against a time-varying sequence {ut}t=1T\{u_{t}\}_{t=1\dots T}. Let PTP_{T} bound the path length: t=1T1ut+1utPT\sum_{t=1}^{T-1}||u_{t+1}-u_{t}||\leq P_{T}. Our objective is to bound the dynamic regret relative to this sequence.

DynRegretT(u1:T):=t=1Tft(xt)t=1Tft(ut).\text{DynRegret}_{T}(u_{1:T}):=\sum_{t=1}^{T}f_{t}(x_{t})-\sum_{t=1}^{T}f_{t}(u_{t}). (20)

By utilizing the Algorithm 2, and slightly modifying the learning rate, we can achieve state-of-the-art dynamic regret guarantees when PTP_{T} is known. We will need the following additional assumption:

Assumption 13 (Lipschitz-like Bregman divergence).

γ>0\exists\gamma>0, x,y,z𝒳0\forall x,y,z\in{\cal X}_{0},

BR(x;z)BR(y;z)γxy.B^{R}(x;z)-B^{R}(y;z)\leq\gamma||x-y||.

This assumption is always satisfied if RR is Lipschitz on 𝒳0{\cal X}_{0}. This is true in particular when RR is a norm on the bounded set 𝒳0{\cal X}_{0}.

Theorem 14 (Dynamic Regret guarantees in OCO (Jadbabaie et al., 2015)).

Under the assumptions:

  1. a.

    Assumptions 1 and 13,

  2. b.

    For any t,^tt,\;\nabla\hat{{{\cal L}}}_{t} is L^t\hat{L}^{{\cal L}}_{t}-Lipschitz where L^tL^t+1\hat{L}^{{\cal L}}_{t}\leq\hat{L}^{{\cal L}}_{t+1},

  3. c.

    For any t,L^tβηtt,\;\hat{L}^{{\cal L}}_{t}\leq\frac{\beta}{\eta_{t}},

  4. d.

    For any t[T],ηt+1ηtt\in[T],\eta_{t+1}\leq\eta_{t},

for any sequence u1,,uT𝒳0u_{1},\dots,u_{T}\in{\cal X}_{0}, and any t1t\geq 1

DynRegrett(u1:t)2B+γPtηt+1+τ=1tητ+1βετ(),\text{DynRegret}_{t}(u_{1:t})\leq\frac{2B+\gamma P_{t}}{\eta_{t+1}}+\sum_{\tau=1}^{t}\frac{\eta_{\tau+1}}{\beta}\varepsilon_{\tau}({\cal L}), (21)

where Bmaxτ[T],x𝒳0BR(x;x~τ)B\geq\max_{\tau\in[T],x\in{\cal X}_{0}}B^{R}(x;\tilde{x}_{\tau}). By setting ηt\eta_{t} as

ηt=min{β(2B+γPT)t1()+t2(),βL^t},\eta_{t}=\min\left\{\frac{\sqrt{\beta(2B+\gamma P_{T})}}{\sqrt{{\cal E}_{t-1}({\cal L})}+\sqrt{{\cal E}_{t-2}({\cal L})}},\frac{\beta}{\hat{L}^{{\cal L}}_{t}}\right\}, (22)

then Algorithm 2 has dynamic regret

DynRegretT(u1:T)3β(2B+γPT)T()+2B+γPTβL^=O(PTT()+PTL^).\begin{split}\text{DynRegret}_{T}(u_{1:T})&\leq 3\sqrt{\beta(2B+\gamma P_{T}){\cal E}_{T}({\cal L})}+\frac{2B+\gamma P_{T}}{\beta}\hat{L}^{{\cal L}}\\ &=O\left(\sqrt{P_{T}{\cal E}_{T}({\cal L})}+P_{T}\hat{L}^{{\cal L}}\right).\end{split} (23)

where Bmaxt[T],x𝒳0BR(x;x~t)B\geq\max_{t\in[T],x\in{\cal X}_{0}}B^{R}(x;\tilde{x}_{t}).

We omit the proof, but it combines elements from Jadbabaie et al. (2015) to add the term in PtP_{t} and the proof of Theorem 10 to ensure the dependency on εt()\varepsilon_{t}({\cal L}). We can now use this algorithm in Algorithm 1 to achieve dynamic regret and CCV in COCO. We first need the following definition:

Definition 15.

A sequence u1,,uTu_{1},\dots,u_{T} is admissible if t,gt(ut)0\forall t,g_{t}(u_{t})\leq 0. We assume that there exists an admissible sequence.

Note that the existence of an admissible sequence is a much weaker assumption that 3.

Corollary 16 (Dynamic Regret in COCO).

Consider the following assumptions:

  1. a.

    4 and 13.

  2. b.

    The predictions g^t\hat{g}_{t} are linear.

  3. c.

    𝒜{\cal A} is optimistic Adagrad (Algorithm 2) with L^t=L^tf\hat{L}^{{\cal L}}_{t}=\hat{L}^{f}_{t} and the learning rate defined in (22).

  4. d.

    Φ(x)=exp(λx)1\Phi(x)=\exp(\lambda x)-1 with λ=(6β(2B+γPT)T(g+)+2)1\lambda=\left(6\sqrt{\beta(2B+\gamma P_{T}){\cal E}_{T}(g^{+})}+2\right)^{-1}.

Under these assumptions, the meta-algorithm (1) has the following dynamic regret and constraint violation guarantees: for any admissible sequence u1,uTu_{1},\dots u_{T} of length at most PTP_{T}

DynRegretT(u1:T)O(PTT(f)+L^fPT),CCVTO(PTT(g+)logT).\begin{split}\text{DynRegret}_{T}(u_{1:T})&\leq O\left(\sqrt{P_{T}{\cal E}_{T}(f)}+\hat{L}^{f}P_{T}\right),\\ \text{CCV}_{T}&\leq O\left(\sqrt{P_{T}{\cal E}_{T}(g^{+})}\log T\right).\end{split} (24)

The proof structure mirrors that of Theorem 7, but employs a modified version of Lemma 5 adapted for dynamic regret analysis. We show the modified version of Lemma 5 in Appendix D. By using linear predictions for ff, we can eliminate the term linear in PTP_{T} from the regret guarantee. When PTP_{T} is unknown but utu_{t} is observable, we can achieve comparable DynRegret using Algorithm 1 from Jadbabaie et al. (2015) combined with the doubling trick (Algorithm 4, Appendix B). While alternative approaches exist that don’t require observing utu_{t} (Scroccaro et al., 2023; Zhao et al., 2020, 2024), our doubling trick implementation would still necessitate sequence observability.

7 Experts setting

In this setting, the agent has access to dd experts and has to form a distribution for selecting among them. She observes the loss of each expert and suffers an overall loss which is the expected value over the experts. Formally, we assume 𝒳0=Δd{\cal X}_{0}=\Delta_{d} where dd is the number of experts. At each step tt, the learner selects xtΔdx_{t}\in\Delta_{d}, a distribution over the experts, then observes the vector of losses td\ell_{t}\in\mathbb{R}^{d} and the vector of constraints ctdc_{t}\in\mathbb{R}^{d}. The learner then suffers the loss ft(xt)=t,xtf_{t}(x_{t})=\langle\ell_{t},\ x_{t}\rangle and constraint gt(xt)=ct,xtg_{t}(x_{t})=\langle c_{t},\ x_{t}\rangle. Let ^t\hat{\ell}_{t} denote the prediction of t\ell_{t} and c^t\hat{c}_{t} the prediction of ctc_{t}.

For the OCO case (i.e without adversarial constraint), we could use the Algorithm 2 with ||||=||||2||\cdot||=||\cdot||_{2}, but in the worst case BB can be as large as O(d)O(d) resulting in a regret scaling in O(d)O(\sqrt{d}). We instead are able achieve a scaling of O(log(d))O(\log(d)). Let ||||=||||1||\cdot||=||\cdot||_{1}, then ||||=||||||\cdot||_{\star}=||\cdot||_{\infty}. In that case, the Bregman divergence is the KL divergence and β=1\beta=1. However, the KL divergence is not upper bounded as any xt,ix_{t,i} can be arbitrarily close to zero. We circumvent this problem in Algorithm 3 by introducing the mixture yt=(1δ)x~t+δd𝟏y_{t}=(1-\delta)\tilde{x}_{t}+\frac{\delta}{d}\bm{1}. This algorithm can be found in Rakhlin and Sridharan (2013b) in the context of a two-player zero-sum game.

Algorithm 3 Optimistic Online Mirror Descent For Experts Rakhlin and Sridharan (2013b)
1:x1Δd,δ(0,1)x_{1}\in\Delta_{d},\ \delta\in(0,1).
2:Initialize η1\eta_{1}.
3:for round t=1Tt=1\dots T do
4:     Play action xtx_{t}, receive ltl_{t}.
5:     Compute ηt+1\eta_{t+1}
6:     x~t+1,j:=yt,jexp(ηtlt,j)i=1dyt,iexp(ηtlt,i),j[d]\tilde{x}_{{t+1},j}:=\dfrac{y_{t,j}\exp(-\eta_{t}l_{t,j})}{\sum_{i=1}^{d}y_{t,i}\exp(-\eta_{t}l_{t,i})},\quad\forall j\in[d]
7:     Construct mixture yt+1=(1δ)x~t+1+δd𝟏.y_{t+1}=(1-\delta)\tilde{x}_{t+1}+\frac{\delta}{d}\bm{1}.
8:     Make prediction l^t+1\hat{l}_{t+1}.
9:     xt+1,j:=yt+1,jexp(ηt+1l^t+1,j)i=1dyt+1,iexp(ηt+1l^t+1,i),j[d]x_{{t+1},j}:=\dfrac{y_{{t+1},j}\exp(-\eta_{t+1}\hat{l}_{{t+1},j})}{\sum_{i=1}^{d}y_{{t+1},i}\exp(-\eta_{t+1}\hat{l}_{{t+1},i})},\quad\forall j\in[d]
10:end for

7.1 Static Regret

We first present the OCO guarantee of Algorithm 3. We let t(x):=lt,x{\cal L}_{t}(x):=\langle l_{t},\ x\rangle and define ^T\hat{\cal L}_{T} similarly. Therefore, εt()=ltl^t2\varepsilon_{t}({\cal L})=||l_{t}-\hat{l}_{t}||_{\infty}^{2}. We have the following regret guarantee in OCO when using Algorithm 3:

Theorem 17 (Optimistic OMD with experts, (Rakhlin and Sridharan, 2013b)).

Under 1, setting δ=1/T\delta=1/T and learning rate ηt\eta_{t} as:

ηt=log(d2Te)min{1t1()+t2(),1},\eta_{t}=\sqrt{\log(d^{2}Te)}\min\left\{\frac{1}{\sqrt{{\cal E}_{t-1}({\cal L})}+\sqrt{{\cal E}_{t-2}({\cal L})}},1\right\}, (25)

Algorithm 3 has regret

RegretT(u)2log(d2Te)(T()+1)=O(T()log(dT)).\begin{split}\text{Regret}_{T}(u)&\leq 2\sqrt{\log(d^{2}Te)}\left(\sqrt{{\cal E}_{T}({\cal L})}+1\right)\\ &=O\left(\sqrt{{\cal E}_{T}({\cal L})\log(dT)}\right).\end{split} (26)
Corollary 18 (COCO in experts setting).

For any t[T]t\in[T], let t,ctd\ell_{t},c_{t}\in\mathbb{R}^{d} such that ft(x)=t,xf_{t}(x)=\langle\ell_{t},\ x\rangle and gt(x)=ct,xg_{t}(x)=\langle c_{t},\ x\rangle. Define g~t(x):=c~t,x\tilde{g}_{t}(x):=\langle\tilde{c}_{t},\ x\rangle where, i[d],c~t,i:=(ct,i)+\forall i\in[d],\;\tilde{c}_{t,i}:=(c_{t,i})^{+}. Assume j,t,ct,j0\exists j,\forall t,c_{t,j}\leq 0 Run the meta-algorithm Algorithm 1 with the following:

  1. a.

    lt=t+λΦ(Qt)c~tl_{t}=\ell_{t}+\lambda\Phi^{\prime}(Q_{t})\tilde{c}_{t}

  2. b.

    l^t=^t+λΦ(Qt)c^t\hat{l}_{t}=\hat{\ell}_{t}+\lambda\Phi^{\prime}(Q_{t})\hat{c}_{t}

  3. c.

    Use Algorithm 3 as the OCO algortihm 𝒜{\cal A}.

Then, we have

RegretT(u)O~(T(f)),CCVTO~(T(g~)).\begin{split}\text{Regret}_{T}(u)&\leq\tilde{O}\left(\sqrt{{\cal E}_{T}(f)}\right),\\ \text{CCV}_{T}&\leq\tilde{O}\left(\sqrt{{\cal E}_{T}(\tilde{g})}\right).\end{split} (27)

Moreover, if the sequence gtg_{t} is fixed or known, we have the stronger guarantee;

RegretT(u)O~(T(f)),CCVTO~(1).\begin{split}\text{Regret}_{T}(u)&\leq\tilde{O}\left(\sqrt{{\cal E}_{T}(f)}\right),\\ \text{CCV}_{T}&\leq\tilde{O}\left(1\right).\end{split} (28)

Proof  The constant gradient assumption in the experts setting prevents us from using gt+\nabla g^{+}_{t} in t\nabla{{\cal L}}_{t}; therefore, we employ g~t(x)\tilde{g}_{t}(x) instead. Denote εt(g~)=c~tc^t2\varepsilon_{t}(\tilde{g})=||\tilde{c}_{t}-\hat{c}_{t}||_{\infty}^{2}. As a direct consequence of Theorem 7, where C=2log(d2Te)C=2\sqrt{\log(d^{2}Te)} we have the regret guarantee, and:

t=1Tg~t(xt)O~(T(g~)).\sum_{t=1}^{T}\tilde{g}_{t}(x_{t})\leq\tilde{O}\left(\sqrt{{\cal E}_{T}(\tilde{g})}\right).

Finally, noticing that xΔd\forall x\in\Delta_{d}

gt+(x)g~t(x),g_{t}^{+}(x)\leq\tilde{g}_{t}(x),

we prove the CCV bound. If ctc_{t} is known at the beginning of tt, we can use g^t=g~t\hat{g}_{t}=\tilde{g}_{t}.  

7.2 Dynamic Regret

Jadbabaie et al. (2015) show that the previous algorithm also has dynamic regret guarantees. They use a different mixing parameter (δ=1/T2)(\delta=1/T^{2}) and slightly different constant for the learning rate, but they use it in the context of two player zero sum games.

Theorem 19.

Under Assumption 1 and for any tt, ^t\nabla\hat{{{\cal L}}}_{t} is a constant function, with δ=1/T\delta=1/T and the learning rate ηt\eta_{t} defined as

ηt=log(d2Te)min{PT+2t1()+t2(),1},\eta_{t}=\sqrt{\log(d^{2}Te)}\min\left\{\frac{\sqrt{P_{T}+2}}{\sqrt{{\cal E}_{t-1}({\cal L})}+\sqrt{{\cal E}_{t-2}({\cal L})}},1\right\}, (29)

Algorithm 3 has regret

RegretT(u)2log(d2Te)(PT+2)(T()+1)=O(PTT()log(dT)).\begin{split}\text{Regret}_{T}(u)&\leq 2\sqrt{\log(d^{2}Te)(P_{T}+2)}\left(\sqrt{{\cal E}_{T}({\cal L})}+1\right)\\ &=O\left(\sqrt{P_{T}{\cal E}_{T}({\cal L})\log(dT)}\right).\end{split} (30)
Corollary 20 (Dynamic Regret in experts settings).

As before, define g~t(x):=c~t,x\tilde{g}_{t}(x):=\langle\tilde{c}_{t},\ x\rangle where, i[d],c~t,i:=(ct,i)+\forall i\in[d],\;\tilde{c}_{t,i}:=(c_{t,i})^{+}. Run the meta-algorithm Algorithm 1 with the following:

  1. a.

    t[T],jt[d],ct,jt0\forall t\in[T],\;\exists j_{t}\in[d],\;c_{t,j_{t}}\leq 0.

  2. b.

    Set t(x):=t+Φ(Qt)c~t,x{\cal L}_{t}(x):=\langle\ell_{t}+\Phi^{\prime}(Q_{t})\tilde{c}_{t},\ x\rangle.

  3. c.

    Set ^t(x):=^t+Φ(Qt)c^t,x\hat{\cal L}_{t}(x):=\langle\hat{\ell}_{t}+\Phi^{\prime}(Q_{t})\hat{c}_{t},\ x\rangle.

  4. d.

    Use Algorithm 3 as the OCO algortihm 𝒜{\cal A} with the learning defined in 29

Then, for any admissible sequence u1,,uTu_{1},\dots,u_{T} of size PTP_{T}.

RegretT(u)O~(PTT(f)),CCVTO~(PTT(g~)).\begin{split}\text{Regret}_{T}(u)&\leq\tilde{O}\left(\sqrt{P_{T}{\cal E}_{T}(f)}\right),\\ \text{CCV}_{T}&\leq\tilde{O}\left(\sqrt{P_{T}{\cal E}_{T}(\tilde{g})}\right).\end{split} (31)

Moreover, if the sequence g~t\tilde{g}_{t} is fixed or known, we have the stronger guarantee;

RegretT(u)O~(PTT(f)),CCVTO~(PT).\begin{split}\text{Regret}_{T}(u)&\leq\tilde{O}\left(\sqrt{P_{T}{\cal E}_{T}(f)}\right),\\ \text{CCV}_{T}&\leq\tilde{O}\left(\sqrt{P_{T}}\right).\end{split} (32)

This is a direct consequence on Theorem 14 and Theorem 19. As noted in Section 6, we can use the doubling trick when PTP_{T} is unknown, but utu_{t} is observable.

8 Adversarial Contextual Bandits with safety constraints

Denote KK the finite set of possible actions. At each timestep tt:

  1. 1.

    The environment generates a context st𝒮s_{t}\in{\cal S}, a loss vector t[0,1]K\ell_{t}\in[0,1]^{K} and a constraint (or risk) vector ct[0,1]Kc_{t}\in[0,1]^{K}.

  2. 2.

    The learner observes sts_{t} then proposes a distribution ptΔKp_{t}\in\Delta_{K} over the possible actions, then sample atpta_{t}\sim p_{t}.

  3. 3.

    The environment reveils t[at]\ell_{t}[a_{t}] and ct[at]c_{t}[a_{t}].333We use ht,ah_{t,a} and ht[a]h_{t}[a] interchangeably.

To guide decisions, the learner uses a finite family Π:={π:𝒮ΔK}\Pi:=\{\pi:{\cal S}\to\Delta_{K}\} of experts who provide context-dependent action recommendations. We denote M:=|Π|M:=|\Pi|. Given a safety threshold α[0,1]\alpha\in[0,1], we define Π(α):={πΠ,t[T],;ct,π(st)α}\Pi^{\star}(\alpha):=\{\pi\in\Pi,\forall t\in[T],;\langle c_{t},\ \pi(s_{t})\rangle\leq\alpha\} as the subset of consistently safe experts. The learner also has access to predictions of ^t\hat{\ell}_{t} and c^t\hat{c}_{t}. The goal of the expert is to have the expected regret and expected CCV to be as small as possible:

RegretT:=maxπΠ(α)𝔼[t=1Tt[at]t[π(st)]],CCVT:=𝔼[t=1T(ct[at]α)+],\begin{split}\text{Regret}_{T}&:=\max_{\pi\in\Pi^{\star}(\alpha)}\mathbb{E}\left[\sum_{t=1}^{T}\ell_{t}[a_{t}]-\ell_{t}[\pi(s_{t})]\right],\\ \text{CCV}_{T}&:=\mathbb{E}\left[\sum_{t=1}^{T}(c_{t}[a_{t}]-\alpha)_{+}\right],\end{split} (33)

where the expectation is with respect to the randomness of the learner (selection of actions ata_{t}). Note that CCVT\text{CCV}_{T} is a strictly stronger measure than the one used in Sun et al. (2017) where their metric of safety is Rc:=𝔼[t=1Tct[at]α]R_{c}:=\mathbb{E}\left[\sum_{t=1}^{T}c_{t}[a_{t}]-\alpha\right].

As in previous sections, we first need an algorithm that solves the problem without adaptive constraints. Here, we employ a modified version of EXP4.OVAR algorithm (Wei et al., 2020), detailed in Algorithm 5, Algorithm 5. The small change we bring is to the learning rate and how it is used in the updates. In most bandits literature, the loss vector ltl_{t} is assumed to be bounded with known bounds (wlog [0,1]K[0,1]^{K}). However, when we will apply it to the Lagrangian function, the upper bound of ltl_{t} becomes dynamic, varying with time and depending on previous actions (a1,,at1)(a_{1},\dots,a_{t-1}). We thus have to take that into account when computing the upper bound of the regret, as highlighted in Theorem 21.

Theorem 21 (Modified EXP4.OVAR Regret, (Adapted from Wei et al. (2020)).

Let lt[0,Bt]l_{t}\in[0,B_{t}] a sequence of loss vectors, where BtB_{t} is non-decreasing, and ltl_{t} and BtB_{t} are chosen by the environment but depend on a1,at1a_{1},\dots a_{t-1}. Let l^t[0,Bt]\hat{l}_{t}\in[0,B_{t}] the prediction and denote T():=t=1Tltl^t2{\cal E}_{T}({\cal L}):=\sum_{t=1}^{T}||l_{t}-\hat{l}_{t}||_{\infty}^{2}. Then, if δ=(KTlog(MT))2/3\delta=\left(\frac{K}{T}\sqrt{\log(MT)}\right)^{2/3} then Algorithm 5 has regret

RegretT𝒜(l1,,lT)6(T()+𝔼[BT])(TK2log(MT))1/3.\text{Regret}_{T}^{\cal A}(l_{1},\dots,l_{T})\leq 6\left(\sqrt{{\cal E}_{T}({\cal L})}+\mathbb{E}[B_{T}]\right)(TK^{2}\log(MT))^{1/3}. (34)

See Algorithm 5 for the complete proof. For the problem with adversarial constrained, as in Section 4, we construct a surrogate loss vector similar to the Lagrangian:

lt:=t+Φ(Qt)c~t,witha[K],c~t[a]:=(ct[a]α)+,l^t:=^t+Φ(Qt)c^t,Qt+1:=Qt+c~t[at],withQ0=Q1=0,\begin{split}l_{t}&:=\ell_{t}+\Phi^{\prime}(Q_{t})\tilde{c}_{t},\quad\text{with}\quad\forall a\in[K],\tilde{c}_{t}[a]:=(c_{t}[a]-\alpha)^{+},\\ \hat{l}_{t}&:=\hat{\ell}_{t}+\Phi^{\prime}(Q_{t})\hat{c}_{t},\\ Q_{t+1}&:=Q_{t}+\tilde{c}_{t}[a_{t}],\quad\text{with}\quad Q_{0}=Q_{1}=0,\end{split} (35)

and use them in the EXP4.OVAR algorithm. For consistency with previous sections, we denote for pΔKp\in\Delta_{K}, ft(p):=t,p,gt(p):=ct,pf_{t}(p):=\langle\ell_{t},\ p\rangle,\quad g_{t}(p):=\langle c_{t},\ p\rangle and denote T(f)=t=1Tt^t2{\cal E}_{T}(f)=\sum_{t=1}^{T}||\ell_{t}-\hat{\ell}_{t}||_{\infty}^{2} and T(g+):=t=1Tctc^t2{\cal E}_{T}(g^{+}):=\sum_{t=1}^{T}||c_{t}-\hat{c}_{t}||_{\infty}^{2}.

First, we prove a similar regret decomposition lemma: Denote RegretT𝒜(l1,,lT)\text{Regret}_{T}^{\cal A}(l_{1},\dots,l_{T}) the expected regret of a contextual bandit algorithm 𝒜{\cal A} running using l1,,lTl_{1},\dots,l_{T} as loss vectors.

Lemma 22.

Assuming that t[T]\forall t\in[T], t[0,1]K\ell_{t}\in[0,1]^{K} and ct[0,1]Kc_{t}\in[0,1]^{K}. Let α\alpha the safety threshold, Φ\Phi a convex potential function, ltl_{t} and QtQ_{t} defined as in (35). Then

𝔼[Φ(Qt+1)]Φ(Q1)+RegretTRegretT𝒜(l1,,lT)+𝔼[Φ(QT+1)].\mathbb{E}\left[\Phi(Q_{t+1})\right]-\Phi(Q_{1})+\text{Regret}_{T}\leq\text{Regret}_{T}^{\cal A}(l_{1},\dots,l_{T})+\mathbb{E}\left[\Phi^{\prime}(Q_{T+1})\right]. (36)

The proof is exactly the same as Lemma 5 with the additional step of taking the expectation. Finally, by using EXP4.OVAR on ltl_{t} as defined in Equation 35, we prove that we have bounded expected regret and CCV.

Theorem 23.

Assuming:

  • Safety threshold α(0,1)\alpha\in(0,1) is known and the corresponding Π(α)\Pi^{\star}(\alpha) is not empty.

  • t[T]\forall t\in[T], t[0,1]K\ell_{t}\in[0,1]^{K} and ct[0,1]Kc_{t}\in[0,1]^{K}.

  • We define ltl_{t}, l^t\hat{l}_{t} and QtQ_{t} as in (35) and use them in EXP4.OVAR.

  • Φ(x):=exp(λx)1\Phi(x):=\exp(\lambda x)-1 with λ:=(12(TK2log(MT))1/3(2T(g~)+1)+2)1\lambda:=\left(12(TK^{2}\log(MT))^{1/3}(\sqrt{2{\cal E}_{T}(\tilde{g})}+1)+2\right)^{-1}.

Running Algorithm 1 gives the following guarantees:

RegretTO~(T(f)(TK2log(M))1/3),CCVTO~(T(g)(TK2log(M))1/3).\begin{split}\text{Regret}_{T}&\leq\tilde{O}\left(\sqrt{{\cal E}_{T}(f)}\left(TK^{2}\log(M)\right)^{1/3}\right),\\ \text{CCV}_{T}&\leq\tilde{O}\left(\sqrt{{\cal E}_{T}(g)}\left(TK^{2}\log(M)\right)^{1/3}\right).\end{split} (37)

Proof  By definition, we have for any t[T],lt[0,1+Φ(Qt)]t\in[T],l_{t}\in[0,1+\Phi^{\prime}(Q_{t})]. Thus, we have the regret guarantee of Theorem 21.

RegretT𝒜(l1,,lT)6(T()+1+𝔼[Φ(QT)])(TK2log(MT))1/3.\text{Regret}_{T}^{\cal A}(l_{1},\dots,l_{T})\leq 6\left(\sqrt{{\cal E}_{T}({\cal L})}+1+\mathbb{E}[\Phi^{\prime}(Q_{T})]\right)(TK^{2}\log(MT))^{1/3}. (38)

Inserting it in Lemma 22, using the definition of Φ\Phi we have

RegretT6(TK2log(MT))1/3(2T(f)+1)+1+𝔼[exp(λQT+1)](λ(6(TK2log(MT))1/3(2T(g)+1)+1)1).\begin{split}\text{Regret}_{T}\leq&6(TK^{2}\log(MT))^{1/3}\left(\sqrt{2{\cal E}_{T}(f)}+1\right)+1\quad+\\ &\mathbb{E}[\exp(\lambda Q_{T+1})]\left(\lambda\left(6(TK^{2}\log(MT))^{1/3}(\sqrt{2{\cal E}_{T}(g)}+1)+1\right)-1\right).\end{split}

The rest of the proof is as in Theorem 7, after noticing that, with Jensen’s inequality,

exp(λ𝔼[CCVT])𝔼[exp(λQT+1)].\exp(\lambda\mathbb{E}\left[\text{CCV}_{T}\right])\leq\mathbb{E}\left[\exp(\lambda Q_{T+1})\right].

 

Note that in the worst case: T(f)=O(T){\cal E}_{T}(f)=O(T) and T(g)=O(T){\cal E}_{T}(g)=O(T) the regret and CCV are of order O~(T5/6)\tilde{O}(T^{5/6}) which is worse than Sun et al. (2017): O(T1/2)O(T^{1/2}) regret and O(T3/4)O(T^{3/4}) CCV. However, when the predictions are slightly more accurate T(f)O(T1/3){\cal E}_{T}(f)\leq O(T^{1/3}) and T(g)O(T5/12){\cal E}_{T}(g)\leq O(T^{5/12}), this algorithm improves Sun et al. (2017), with the most significant improvement when T(f)=O(1){\cal E}_{T}(f)=O(1) and T(g)=O(1){\cal E}_{T}(g)=O(1), leading to a T1/3T^{1/3} in regret and CCV. This is close to optimal, as Wei et al. (2020) prove that the best regret that a contextual bandit algorithm with T()=O(1){\cal E}_{T}({\cal L})=O(1) is O(T1/4)O(T^{1/4}). Note that this algorithm requires T(g){\cal E}_{T}(g)(or an upper bound) to be known beforehand, as even with the doubling trick, we do not directly observe εt(g)\varepsilon_{t}(g) to update t(g){\cal E}_{t}(g) online. An heuristic method using the current observation as an estimator along the doubling trick could potentially work in practice.

9 Conclusion

This work presents pioneering optimistic algorithms for handling OCO under adversarial constraints. Beyond establishing prediction error-dependent bounds for both regret and constraints, our approach maintains efficiency by using simple projections instead of solving complete convex optimization problems per iteration. For the future, we are interested in proving stronger bounds when the obtainable guarantees against oracle sets that are larger than 𝒳{\cal X}, and when the loss function is strongly-convex. Moreover, we conjecture that a slight alteration of the algorithm should ensure a CCVO(logT)\text{CCV}\leq O(\log T) when gt+g_{t}^{+} is fixed or perfectly known, beyond the expert setting. At this stage, the non-smooth gradient of gt+g_{t}^{+} prevents us from using itself as the prediction, and therefore from establishing that our algorithm attains this bound.

References

  • Anderson et al. (2022) Daron Anderson, George Iosifidis, and Douglas J Leith. Lazy Lagrangians with predictions for online learning. arXiv preprint arXiv:2201.02890, 2022.
  • Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
  • Beygelzimer et al. (2011) Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19–26. JMLR Workshop and Conference Proceedings, 2011.
  • Bhaskara et al. (2020) Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, and Manish Purohit. Online learning with imperfect hints. In International Conference on Machine Learning, pages 822–831. PMLR, 2020.
  • Chiang et al. (2012) Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In Conference on Learning Theory, pages 6–1. JMLR Workshop and Conference Proceedings, 2012.
  • D’Orazio and Huang (2021) Ryan D’Orazio and Ruitong Huang. Optimistic and adaptive Lagrangian hedging. arXiv preprint arXiv:2101.09603, 2021.
  • Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
  • Guo et al. (2022) Hengquan Guo, Xin Liu, Honghao Wei, and Lei Ying. Online convex optimization with hard constraints: towards the best of two worlds and beyond. Advances in Neural Information Processing Systems, 35:36426–36439, 2022.
  • Hazan (2023) Elad Hazan. Introduction to online convex optimization, 2023. URL https://arxiv.org/abs/1909.05207.
  • Hutchinson and Alizadeh (2024) Spencer Hutchinson and Mahnoosh Alizadeh. Safe online convex optimization with first-order feedback. In 2024 American Control Conference (ACC), pages 1–7. IEEE, 2024.
  • Jadbabaie et al. (2015) Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization: Competing with dynamic comparators. In Artificial Intelligence and Statistics, pages 398–406. PMLR, 2015.
  • Jenatton et al. (2016) Rodolphe Jenatton, Jim Huang, and Cedric Archambeau. Adaptive algorithms for online convex optimization with long-term constraints. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 402–411, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/jenatton16.html.
  • Joulani et al. (2017) Pooria Joulani, András György, and Csaba Szepesvári. A modular analysis of adaptive (non-) convex optimization: Optimism, composite objectives, and variational bounds. In International Conference on Algorithmic Learning Theory, pages 681–720. PMLR, 2017.
  • Mahdavi et al. (2012) Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. Trading regret for efficiency: online convex optimization with long term constraints. The Journal of Machine Learning Research, 13(1):2503–2528, 2012.
  • Mannor et al. (2009) Shie Mannor, John N Tsitsiklis, and Jia Yuan Yu. Online learning with sample path constraints. Journal of Machine Learning Research, 10(3), 2009.
  • Mohri and Yang (2016) Mehryar Mohri and Scott Yang. Accelerating online convex optimization via adaptive prediction. In Artificial Intelligence and Statistics, pages 848–856. PMLR, 2016.
  • Muthirayan et al. (2022) Deepan Muthirayan, Jianjun Yuan, and Pramod P Khargonekar. Online convex optimization with long-term constraints for predictable sequences. IEEE Control Systems Letters, 7:979–984, 2022.
  • Neely and Yu (2017) Michael J. Neely and Hao Yu. Online convex optimization with time-varying constraints, 2017. URL https://arxiv.org/abs/1702.04783.
  • Orabona (2019) Francesco Orabona. A modern introduction to online learning. arXiv preprint arXiv:1912.13213, 2019.
  • Qiu et al. (2023) Shuang Qiu, Xiaohan Wei, and Mladen Kolar. Gradient-variation bound for online convex optimization with constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9534–9542, 2023.
  • Rakhlin and Sridharan (2013a) Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Conference on Learning Theory, pages 993–1019. PMLR, 2013a.
  • Rakhlin and Sridharan (2013b) Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences, 2013b. URL https://arxiv.org/abs/1311.1869.
  • Scroccaro et al. (2023) Pedro Zattoni Scroccaro, Arman Sharifi Kolarijani, and Peyman Mohajerin Esfahani. Adaptive composite online optimization: Predictions in static and dynamic environments. IEEE Transactions on Automatic Control, 68(5):2906–2921, 2023.
  • Sinha and Vaze (2024) Abhishek Sinha and Rahul Vaze. Optimal algorithms for online convex optimization with adversarial constraints. Advances in Neural Information Processing Systems, 37:41274–41302, 2024.
  • Steinhardt and Liang (2014) Jacob Steinhardt and Percy Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In International conference on machine learning, pages 1593–1601. PMLR, 2014.
  • Sun et al. (2017) Wen Sun, Debadeepta Dey, and Ashish Kapoor. Safety-aware algorithms for adversarial contextual bandit. In International Conference on Machine Learning, pages 3280–3288. PMLR, 2017.
  • Syrgkanis et al. (2015) Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. Advances in Neural Information Processing Systems, 28, 2015.
  • Wei et al. (2020) Chen-Yu Wei, Haipeng Luo, and Alekh Agarwal. Taking a hint: How to leverage loss predictors in contextual bandits? ArXiv, abs/2003.01922, 2020. URL https://api.semanticscholar.org/CorpusID:211990228.
  • Yi et al. (2023) Xinlei Yi, Xiuxian Li, Tao Yang, Lihua Xie, Yiguang Hong, Tianyou Chai, and Karl H Johansson. Distributed online convex optimization with adversarial constraints: reduced cumulative constraint violation bounds under Slater’s condition. arXiv preprint arXiv:2306.00149, 2023.
  • Yu and Neely (2020) Hao Yu and Michael J Neely. A low complexity algorithm with o (T\sqrt{T}) regret and o (1) constraint violations for online convex optimization with long term constraints. Journal of Machine Learning Research, 21(1):1–24, 2020.
  • Zhang et al. (2018) Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive online learning in dynamic environments. Advances in neural information processing systems, 31, 2018.
  • Zhao et al. (2020) Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Dynamic regret of convex and smooth functions. ArXiv, abs/2007.03479, 2020. URL https://api.semanticscholar.org/CorpusID:220381233.
  • Zhao et al. (2024) Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization. Journal of Machine Learning Research, 25(98):1–52, 2024.
  • Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pages 928–936, 2003.

Appendix A Proof of Theorem 7

Proof 

By definition of {\cal L} (5) and ^\hat{\cal L} (6), we obtain the following instantaneous prediction error:

εt()\displaystyle\varepsilon_{t}({\cal L}) =t(xt)^t(xt)2\displaystyle=||\nabla{{\cal L}}_{t}(x_{t})-\nabla\hat{{{\cal L}}}_{t}(x_{t})||_{\star}^{2}
2εt(f)+2Φ(Qt)2εt(g+),\displaystyle\leq 2\varepsilon_{t}(f)+2\Phi^{\prime}(Q_{t})^{2}\varepsilon_{t}(g^{+}),

where the last line uses a+b22a2+2b2||a+b||_{\star}^{2}\leq 2||a||_{\star}^{2}+2||b||_{\star}^{2}.

t()\displaystyle\sqrt{{\cal E}_{t}({\cal L})} τ=1t2ετ(f)+τ=1t2Φ(Qτ)2εt(g+)\displaystyle\leq\sqrt{\sum_{\tau=1}^{t}2\varepsilon_{\tau}(f)+\sum_{\tau=1}^{t}2\Phi^{\prime}(Q_{\tau})^{2}\varepsilon_{t}(g^{+})}
(i)2t(f)+τ=1t2Φ(Qτ)2εt(g+)\displaystyle\overset{(i)}{\leq}\sqrt{2{\cal E}_{t}(f)}+\sqrt{\sum_{\tau=1}^{t}2\Phi^{\prime}(Q_{\tau})^{2}\varepsilon_{t}(g^{+})}
(ii)2t(f)+Φ(Qt+1)2t(g+).\displaystyle\overset{(ii)}{\leq}\sqrt{2{\cal E}_{t}(f)}+\Phi^{\prime}(Q_{t+1})\sqrt{2{\cal E}_{t}(g^{+})}. (39)

We obtain (i) by using a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b} and (ii) by using the fact that QtQ_{t} is non-decreasing and Φ\Phi^{\prime} is a non-decreasing function. By sub-linearity of ψt\psi_{t}:

ψt(t)ψt(f)+Φ(Qt)ψt(g+)ψt(f)+Φ(Qt+1)ψt(g+).\psi_{t}({\cal L}_{t})\leq\psi_{t}(f)+\Phi^{\prime}(Q_{t})\psi_{t}(g^{+})\leq\psi_{t}(f)+\Phi^{\prime}(Q_{t+1})\psi_{t}(g^{+}). (40)

Finally, using 6, we have

Regrett𝒜(u;1t)\displaystyle\text{Regret}_{t}^{\cal A}(u;\;{\cal L}_{1\dots t}) C(t()+ψt())\displaystyle\leq C\left(\sqrt{{\cal E}_{t}({\cal L})}+\psi_{t}({\cal L})\right)
(2T(f)+ψt(f))+Φ(Qt+1)(2t(g+)+ψt(g+)),\displaystyle\leq\left(\sqrt{2{\cal E}_{T}(f)}+\psi_{t}(f)\right)+\Phi^{\prime}(Q_{t+1})\left(\sqrt{2{\cal E}_{t}(g^{+})}+\psi_{t}(g^{+})\right), (41)

where the last equation inequality comes from using both (39) and (40). By using once again the fact that QtQ_{t} is non-decreasing and Φ\Phi^{\prime} is a non-decreasing function, and knowing that gt+g^{+}_{t} is non-negative and upper bounded by GG we can also upper bound StS_{t}. Recall

St\displaystyle S_{t} :=τ=1tgτ+(xτ)(Φ(Qτ+1)Φ(Qτ))\displaystyle:=\sum_{\tau=1}^{t}g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau}))
G(Φ(Qt+1)Φ(Q1))\displaystyle\leq G(\Phi^{\prime}(Q_{{t+1}})-\Phi^{\prime}(Q_{1}))
GΦ(Qt+1).\displaystyle\leq G\Phi^{\prime}(Q_{t+1}). (42)

We can now upper bound the regret. Using Lemma 5 we have that for any u𝒳u\in{\cal X}

Φ(Qt+1)Φ(Q1)+Regrett(u)\displaystyle\Phi(Q_{t+1})-\Phi(Q_{1})+\text{Regret}_{t}(u) Regrett𝒜(u;1t)+St.\displaystyle\leq\text{Regret}_{t}^{\cal A}(u;\;{\cal L}_{1\dots t})+S_{t}.

Upper bounding the RHS using (41) and (42), we obtain

Φ(Qt+1)Φ(Q1)+Regrett(u)\displaystyle\Phi(Q_{t+1})-\Phi(Q_{1})+\text{Regret}_{t}(u)\leq CΦ(Qt+1)(2t(g+)+ψt(g+))\displaystyle C\Phi^{\prime}(Q_{t+1})(\sqrt{2{\cal E}_{t}(g^{+})}+\psi_{t}(g^{+}))
+C(2t(f)+ψt(f))\displaystyle+C(\sqrt{2{\cal E}_{t}(f)}+\psi_{t}(f))
+GΦ(Qt+1).\displaystyle+G\Phi^{\prime}(Q_{t+1}).

Thus, using Φ(Q)=exp(λQ)1\Phi(Q)=\exp(\lambda Q)-1, and after rearranging the terms,

Regrett(u)(λC(2t(g+)+ψt(g+))+λG1)exp(λQt+1)+1+C(2t(f)+ψt(f)).\text{Regret}_{t}(u)\leq\left(\lambda C(\sqrt{2{\cal E}_{t}(g^{+})}+\psi_{t}(g^{+}))+\lambda G-1\right)\exp(\lambda Q_{t+1})+1+C(\sqrt{2{\cal E}_{t}(f)}+\psi_{t}(f)).

Therefore, if λλ:=1C(2t(g+)+ψt(g+))+G\lambda\leq\lambda^{\star}:=\frac{1}{C(\sqrt{2{\cal E}_{t}(g^{+})}+\psi_{t}(g^{+}))+G},

Regrett(u)C(2t(f)+ψt(f))+1.\text{Regret}_{t}(u)\leq C\left(\sqrt{2{\cal E}_{t}(f)}+\psi_{t}(f)\right)+1.

Note that RegretT(u)2FT\text{Regret}_{T}(u)\geq-2FT, thus:

exp(λQT+1)(1λλ)C(2T(f)+ψT(f))+2FT+1.\exp(\lambda Q_{T+1})\left(1-\frac{\lambda}{\lambda^{\star}}\right)\leq C(\sqrt{2{\cal E}_{T}(f)}+\psi_{T}(f))+2FT+1.

If λ<1C(2T(g+)+ψt(g+))+G\lambda<\frac{1}{C(\sqrt{2{\cal E}_{T}(g^{+})}+\psi_{t}(g^{+}))+G}, then

QT+1log(C(2T(f)+ψT(f))+2FT+11λ/λ),Q_{T+1}\leq\log\left(\frac{C(\sqrt{2{\cal E}_{T}(f)}+\psi_{T}(f))+2FT+1}{1-\lambda/\lambda^{\star}}\right),

and

CCVTQT+1λ1λlog(C(2T(f)+ψT(f))+2FT+11λ/λ).\text{CCV}_{T}\leq\frac{Q_{T+1}}{\lambda}\leq\frac{1}{\lambda}\log\left(\frac{C(\sqrt{2{\cal E}_{T}(f)}+\psi_{T}(f))+2FT+1}{1-\lambda/\lambda^{\star}}\right).

With λ=λ2=12C(2T(g+)+ψt(g+))+2G\lambda=\frac{\lambda^{\star}}{2}=\frac{1}{2C(\sqrt{2{\cal E}_{T}(g^{+})}+\psi_{t}(g^{+}))+2G}, we have

CCVT\displaystyle\text{CCV}_{T} (2C(2T(g+)+ψt(g+))+2G)log(2(C(2T(f)+ψT(f))+2FT+1))\displaystyle\leq\left(2C\left(\sqrt{2{\cal E}_{T}(g^{+})}+\psi_{t}(g^{+})\right)+2G\right)\log\left(2\left(C(\sqrt{2{\cal E}_{T}(f)}+\psi_{T}(f))+2FT+1\right)\right)
O(T(g+)logT).\displaystyle\leq O\left(\sqrt{{\cal E}_{T}(g^{+})}\log T\right).

 

Appendix B Doubling trick for Algorithm 1

The doubling trick methodology employed here is inspired by Jadbabaie et al. (2015). The parameter we adapt online is λ\lambda. Note that for all the COCO results we have (Theorems 7, , there is a known constant cc and a known function ψ\psi such that

λ=12(μ+c),whereμ=ψ(T,PT,T(g)),\lambda=\frac{1}{2(\mu+c)},\quad\text{where}\quad\mu=\psi(T,P_{T},{\cal E}_{T}(g)),

and ψ\psi is non-decreasing, and sub-linear in each coordinate. The key idea is to apply the doubling trick on μ\mu, so that the condition λ<λ\lambda<\lambda^{\star} applies for every timestep of an epoch, except for the last one. We present the algorithm in Algorithm 4. In the case of dynamic regret, we assume that the comparator sequence utu_{t} is observable.

Algorithm 4 Doubling trick for Optimistic COCO
1:Function ψ\psi, real values: T1,P1,E1T_{1},P_{1},E_{1}, c>0c>0. Optimistic meta-algorithm 𝒪(λ){\cal O}(\lambda) for a given value λ\lambda.
2:Initialize: μ1=ψ(T1,P1,E1),λ1=12(μ1+c),N=1,E(N)=Δ(N)=P(N)=0;μ(N)=ψ(Δ(N),P(N),E(N))\mu_{1}=\psi\left(T_{1},P_{1},E_{1}\right),\lambda_{1}=\frac{1}{2(\mu_{1}+c)},N=1,E_{(N)}=\Delta_{(N)}=P_{(N)}=0;\mu_{(N)}=\psi(\Delta_{(N)},P_{(N)},E_{(N)}).
3:for round t=1Tt=1\dots T do
4:     if μ(N)>μN\mu_{(N)}>\mu_{N} then \triangleright Check doubling condition
5:         N=N+1N=N+1
6:         μN=2N1μ1\mu_{N}=2^{N-1}\mu_{1} and λN=12(μN+c)\lambda_{N}=\frac{1}{2(\mu_{N}+c)}
7:         E(N)=Δ(N)=P(N)=0E_{(N)}=\Delta_{(N)}=P_{(N)}=0
8:     end if
9:     Run one step of 𝒪(λN){\cal O}(\lambda_{N}) and observe ft,gt,xtf_{t},g_{t},x_{t} and utu_{t}.
10:     Update doubling parameters:
11:     Δ(N)=Δ(N)+1\Delta_{(N)}=\Delta_{(N)}+1
12:     P(N)=P(N)+utut1P_{(N)}=P_{(N)}+||u_{t}-u_{t-1}||
13:     E(N)=E(N)+εt(g+)E_{(N)}=E_{(N)}+\varepsilon_{t}(g^{+})
14:     μ(N)=ψ(Δ(N),P(N),E(N))\mu_{(N)}=\psi(\Delta_{(N)},P_{(N)},E_{(N)})
15:end for
Theorem 24.

Assume that, when λ<λ\lambda<\lambda^{\star} with λ=1ψ(T,T(g),PT)+c\lambda^{*}=\frac{1}{\psi(T,{\cal E}_{T}(g),P_{T})+c}, the optimistic algorithm 𝒪(λ){\cal O}(\lambda) has guarantees:

RegretTO(ϕ(T,T(f),PT)),CCVTO(ψ(T,T(g+),PT)logT).\begin{split}\text{Regret}_{T}&\leq O\left(\phi(T,{\cal E}_{T}(f),P_{T})\right),\\ \text{CCV}_{T}&\leq O\left(\psi(T,{\cal E}_{T}(g^{+}),P_{T})\log T\right).\end{split} (43)

where RegretT\text{Regret}_{T} denotes the static or dynamic regret depending on the context, ϕ\phi and ψ\psi are monotone non-decreasing and at most polynomial in each coordinate. Then by running the doubling algorithm Algorithm 4, we have the guarantee

RegretTO~(ϕ(T,T(f),PT)),CCVTO~(ψ(T,T(g+),PT)logT).\begin{split}\text{Regret}_{T}&\leq\tilde{O}\left(\phi(T,{\cal E}_{T}(f),P_{T})\right),\\ \text{CCV}_{T}&\leq\tilde{O}\left(\psi(T,{\cal E}_{T}(g^{+}),P_{T})\log T\right).\end{split} (44)

Proof  Let NN the number of epochs and for each epoch i[N]i\in[N], denote TiT_{i} its first instance. It’s last instance is therefore Ti:=Ti+11T_{i}^{\prime}:=T_{i+1}-1. For two instants ss and tt, we define the regret and CCV between the two instants:

Regretts:=τ=tsft(xt)ft(ut),CCVts:=τ=tsgt+(xt).\begin{split}\text{Regret}_{t\to s}&:=\sum_{\tau=t}^{s}f_{t}(x_{t})-f_{t}(u_{t}),\\ \text{CCV}_{t\to s}&:=\sum_{\tau=t}^{s}g_{t}^{+}(x_{t}).\end{split}

We similarly define the quantities ts(f),ts(g+),Pts{\cal E}_{t\to s}(f),{\cal E}_{t\to s}(g^{+}),P_{t\to s}. Denote μi,i=1N\mu_{i},i=1\dots N the successive values of μ\mu used in the doubling process, in λi=12(μi+c)\lambda_{i}=\frac{1}{2(\mu_{i}+c)}. Define

Δ¯(i)\displaystyle\underline{\Delta}_{(i)} :=Δ(i)1,\displaystyle:=\Delta_{(i)}-1,
P¯(i)\displaystyle\underline{P}_{(i)} :=P(i)uTiuTi1,\displaystyle:=P_{(i)}-||u_{T^{\prime}_{i}}-u_{T^{\prime}_{i}-1}||,
E¯(i)\displaystyle\underline{E}_{(i)} :=E(i)gTi+(xTi)g^Ti+(xTi)2,\displaystyle:=E_{(i)}-||\nabla g^{+}_{T^{\prime}_{i}}(x_{T^{\prime}_{i}})-\nabla\hat{g}^{+}_{T^{\prime}_{i}}(x_{T^{\prime}_{i}})||_{\star}^{2},
μ¯(i)\displaystyle\underline{\mu}_{(i)} =ψ(Δ¯(i),P¯(i),E¯(i))\displaystyle=\psi(\underline{\Delta}_{(i)},\underline{P}_{(i)},\underline{E}_{(i)})
λ¯(i)\displaystyle\underline{\lambda}_{(i)} =12(μ(i)+c)\displaystyle=\frac{1}{2(\mu_{(i)}+c)}

i.e the values of the different doubling parameters except for the last step of the epoch. Note that when running 𝒪{\cal O} with λi\lambda_{i} between TiT_{i} and Ti1T^{\prime}_{i}-1, the threshold for λ\lambda between those two timesteps is:

λi=1ψ(Ti1Ti,Ti(Ti1)(g+),PTi(Ti1))+c=2λ¯(i).\lambda^{*}_{i}=\frac{1}{\psi\left(T^{\prime}_{i}-1-T_{i},{\cal E}_{T_{i}\to(T^{\prime}_{i}-1)}(g^{+}),P_{T_{i}\to(T_{i}^{\prime}-1)}\right)+c}=2\underline{\lambda}_{(i)}.

Moreover, since the change of epoch happens at Ti+1T_{i}^{\prime}+1, we know that

μ(i)>μi>μ¯(i).\mu_{(i)}>\mu_{i}>\underline{\mu}_{(i)}. (45)

From the second inequality, we have

λi<λ¯(i)=λi2\lambda_{i}<\underline{\lambda}_{(i)}=\frac{\lambda^{*}_{i}}{2}

Thus, from (43) there are two constants CC and CC^{\prime} such that:

RegretTi(Ti1)Cϕ((Ti1)Ti,Ti(Ti1)(f),PTi(Ti1)),CCVTiTiCψ((Ti1)Ti,Ti(Ti1)(g+),PTi(Ti1))log(TiTi).\begin{split}\text{Regret}_{T_{i}\to(T^{\prime}_{i}-1)}&\leq C\phi\left((T^{\prime}_{i}-1)-T_{i},{\cal E}_{T_{i}\to(T_{i}^{\prime}-1)}(f),P_{T_{i}\to(T_{i}^{\prime}-1)}\right),\\ \text{CCV}_{T_{i}\to T^{\prime}_{i}}&\leq C^{\prime}\psi\left((T^{\prime}_{i}-1)-T_{i},{\cal E}_{T_{i}\to(T_{i}^{\prime}-1)}(g^{+}),P_{T_{i}\to(T_{i}^{\prime}-1)}\right)\log(T^{\prime}_{i}-T_{i}).\end{split}

We will focus on regret for now, but the same methodology can be applied for CCV. First note that by monotonicity of ϕ\phi,

i[N],RegretTi(Ti1)Cϕ(T,T(f),PT).\forall i\in[N],\text{Regret}_{T_{i}\to(T^{\prime}_{i}-1)}\leq C\phi(T,{\cal E}_{T}(f),P_{T}).

Then, note that T1<TNT-1<T_{N}^{\prime}, and therefore, the constant λN\lambda_{N} satisfies the condition for bounded regret and CCV when running 𝒪{\cal O} between TNT_{N} and T1T-1. We can now split the total regret into groups:

RegretT\displaystyle\text{Regret}_{T} =t=1Tft(xt)ft(ut)\displaystyle=\sum_{t=1}^{T}f_{t}(x_{t})-f_{t}(u_{t})
=i=1N1(fTi(xTi)fTi(uTi))+i=1N1RegretTi(Ti1)+RegretTN(T1)+fT(xT)fT(uT)\displaystyle=\sum_{i=1}^{N-1}(f_{T^{\prime}_{i}}(x_{T^{\prime}_{i}})-f_{T^{\prime}_{i}}(u_{T^{\prime}_{i}}))+\sum_{i=1}^{N-1}\text{Regret}_{T_{i}\to(T^{\prime}_{i}-1)}+\text{Regret}_{T_{N}\to(T-1)}+f_{T}(x_{T})-f_{T}(u_{T})
2NF+NCϕ(T,T(f),PT)\displaystyle\leq 2NF+NC\phi(T,{\cal E}_{T}(f),P_{T})

Finally, from (45) for i=Ni=N,

μN=μ12N1<μ(N)ψ(T,PT,T(g))Nlog2(ψ(T,PT,T(g)))log(μ1)\mu_{N}=\mu_{1}2^{N-1}<\mu_{(N)}\leq\psi(T,P_{T},{\cal E}_{T}(g))\Longrightarrow N\leq\log_{2}\left(\psi(T,P_{T},{\cal E}_{T}(g))\right)-\log(\mu_{1})

And since ψ\psi is at most polynomial in each coordinate, and T(g){\cal E}_{T}(g) and PTP_{T} are at most linear in TT, we have NO(log2T)N\leq O(\log_{2}T).  

Appendix C Proof of Theorem 10

Denote lt:=t(xt)l_{t}:=\nabla{{\cal L}}_{t}(x_{t}) and l^t:=^t(x~t)\hat{l}_{t}:=\nabla\hat{{{\cal L}}}_{t}(\tilde{x}_{t}). (16) in Theorem 10 is a direct consequence of the following lemma.

Lemma 25.

One step of optimistic online mirror descent satisfies:

ηtlt,xtuBR(u;x~t)BR(u;x~t+1)+ηtltl^txtx~t+1(BR(x~t+1;xt)+BR(xt;x~t)).\eta_{t}\langle l_{t},\ x_{t}-u\rangle\leq B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})+\eta_{t}||l_{t}-\hat{l}_{t}||_{\star}\cdot||x_{t}-\tilde{x}_{t+1}||-(B^{R}(\tilde{x}_{t+1};x_{t})+B^{R}(x_{t};\tilde{x}_{t})). (46)

Moreover, if ^t\nabla\hat{{{\cal L}}}_{t} is L^t\hat{L}^{{\cal L}}_{t}-smooth with L^tβηt\hat{L}^{{\cal L}}_{t}\leq\frac{\beta}{\eta_{t}},

lt,xtuBR(u;x~t)BR(u;x~t+1)ηt+BR(xt;x~t+1)(ηt+11ηt1)+ηt+1βεt().\langle l_{t},\ x_{t}-u\rangle\leq\frac{B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})}{\eta_{t}}+B^{R}(x_{t};\tilde{x}_{t+1})(\eta_{t+1}^{-1}-\eta_{t}^{-1})+\frac{\eta_{t+1}}{\beta}\varepsilon_{t}({\cal L}). (47)

We will need the following proposition to prove the lemma.

Proposition 26 (Chiang et al. (2012), proposition 18).

For any x0𝒳,ldx_{0}\in{\cal X},l\in\mathbb{R}^{d}, if x:=argminx𝒳l,x+1ηBR(x;x0)x^{\star}:=\arg\min_{x\in{\cal X}}\langle l,\ x\rangle+\frac{1}{\eta}B^{R}(x;x_{0}), then u𝒳\forall u\in{\cal X}

ηl,xu=BR(u;x0)BR(u;x)BR(x;x0).\eta\langle l,\ x^{\star}-u\rangle=B^{R}(u;x_{0})-B^{R}(u;x^{\star})-B^{R}(x^{\star};x_{0}). (48)

Proof [of Lemma 25] Let u𝒳u\in{\cal X}

ηtlt,xtu\displaystyle\eta_{t}\langle l_{t},\ x_{t}-u\rangle =ηtlt,x~t+1u+ηtltl^t,xtx~t+1+ηtl^t,xtx~t+1.\displaystyle=\langle\eta_{t}l_{t},\ \tilde{x}_{t+1}-u\rangle+\eta_{t}\langle l_{t}-\hat{l}_{t},\ x_{t}-\tilde{x}_{t+1}\rangle+\langle\eta_{t}\hat{l}_{t},\ x_{t}-\tilde{x}_{t+1}\rangle.

On one hand, using Proposition 26, the left and right terms can be upper bounded respectively :

ηtlt,x~t+1u\displaystyle\langle\eta_{t}l_{t},\ \tilde{x}_{t+1}-u\rangle =BR(u;x~t)BR(u;x~t+1)BR(x~t+1;x~t),\displaystyle=B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})-B^{R}(\tilde{x}_{t+1};\tilde{x}_{t}),
ηtl^t,xtx~t+1\displaystyle\langle\eta_{t}\hat{l}_{t},\ x_{t}-\tilde{x}_{t+1}\rangle =BR(x~t+1;x~t)BR(x~t+1;xt)BR(xt;x~t).\displaystyle=B^{R}(\tilde{x}_{t+1};\tilde{x}_{t})-B^{R}(\tilde{x}_{t+1};x_{t})-B^{R}(x_{t};\tilde{x}_{t}).

Therefore

ηtlt,x~t+1u+ηtl^t,xtx~t+1=BR(u;x~t)BR(u;x~t+1)(BR(x~t+1;xt)+BR(xt;x~t)).\langle\eta_{t}l_{t},\ \tilde{x}_{t+1}-u\rangle+\langle\eta_{t}\hat{l}_{t},\ x_{t}-\tilde{x}_{t+1}\rangle=B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})-(B^{R}(\tilde{x}_{t+1};x_{t})+B^{R}(x_{t};\tilde{x}_{t})).

On the other hand,

ltl^t,xtx~t+1xtx~t+1ltl^t.\langle l_{t}-\hat{l}_{t},\ x_{t}-\tilde{x}_{t+1}\rangle\leq||x_{t}-\tilde{x}_{t+1}||\cdot||l_{t}-\hat{l}_{t}||_{\star}.

By combining the last two inequalities, we obtain (46). To prove (47), first note that by using the fact that a,b,ρ>0,ab12ρa2+ρ2b2\forall a,b,\rho>0,ab\leq\frac{1}{2\rho}a^{2}+\frac{\rho}{2}b^{2},

xtx~t+1ltl^tηt+12βltl^t2+β2ηt+1xtx~t+12ηt+12βltl^t2+1ηt+1BR(xt;x~t+1).||x_{t}-\tilde{x}_{t+1}||\cdot||l_{t}-\hat{l}_{t}||_{\star}\leq\frac{\eta_{t+1}}{2\beta}||l_{t}-\hat{l}_{t}||_{\star}^{2}+\frac{\beta}{2\eta_{t+1}}||x_{t}-\tilde{x}_{t+1}||^{2}\leq\frac{\eta_{t+1}}{2\beta}||l_{t}-\hat{l}_{t}||_{\star}^{2}+\frac{1}{\eta_{t+1}}B^{R}(x_{t};\tilde{x}_{t+1}).

For the second part of the statement, if f^\nabla\hat{f} is L^t\hat{L}^{{\cal L}}_{t}-smooth:

ltl^t2\displaystyle||l_{t}-\hat{l}_{t}||_{\star}^{2} =t(xt)^t(x~t)2\displaystyle=||\nabla{{\cal L}}_{t}(x_{t})-\nabla\hat{{{\cal L}}}_{t}(\tilde{x}_{t})||_{\star}^{2}
2t(xt)^t(xt)2+2^t(xt)^t(x~t)2\displaystyle\leq 2||\nabla{{\cal L}}_{t}(x_{t})-\nabla\hat{{{\cal L}}}_{t}(x_{t})||_{\star}^{2}+2||\nabla\hat{{{\cal L}}}_{t}(x_{t})-\nabla\hat{{{\cal L}}}_{t}(\tilde{x}_{t})||_{\star}^{2}
2t(xt)^t(xt)2+2(L^t)2xtx~t2\displaystyle\leq 2||\nabla{{\cal L}}_{t}(x_{t})-\nabla\hat{{{\cal L}}}_{t}(x_{t})||_{\star}^{2}+2\left(\hat{L}^{{\cal L}}_{t}\right)^{2}||x_{t}-\tilde{x}_{t}||^{2}
2εt()+2(L^t)2βBR(xt;x~t).\displaystyle\leq 2\varepsilon_{t}({\cal L})+\frac{2\left(\hat{L}^{{\cal L}}_{t}\right)^{2}}{\beta}B^{R}(x_{t};\tilde{x}_{t}).

By inserting in (46) and dividing both sides by ηt\eta_{t}:

lt,xtu\displaystyle\langle l_{t},\ x_{t}-u\rangle BR(u;x~t)BR(u;x~t+1)ηt+ηt+1βεt()+(L^t)2ηt+1β2BR(xt;x~t)\displaystyle\leq\frac{B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})}{\eta_{t}}+\frac{\eta_{t+1}}{\beta}\varepsilon_{t}({\cal L})+\frac{\left(\hat{L}^{{\cal L}}_{t}\right)^{2}\eta_{t+1}}{\beta^{2}}B^{R}(x_{t};\tilde{x}_{t})
1ηt(BR(x~t+1;xt)+BR(xt;x~t))\displaystyle\quad\quad-\frac{1}{\eta_{t}}(B^{R}(\tilde{x}_{t+1};x_{t})+B^{R}(x_{t};\tilde{x}_{t}))
BR(u;x~t)BR(u;x~t+1)ηt+ηt+1βεt()+BR(xt;x~t)((L^t)2ηt+1β21ηt)\displaystyle\leq\frac{B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})}{\eta_{t}}+\frac{\eta_{t+1}}{\beta}\varepsilon_{t}({\cal L})+B^{R}(x_{t};\tilde{x}_{t})\left(\frac{\left(\hat{L}^{{\cal L}}_{t}\right)^{2}\eta_{t+1}}{\beta^{2}}-\frac{1}{\eta_{t}}\right)
+BR(xt;x~t+1)(ηt+11ηt1).\displaystyle\quad\quad+B^{R}(x_{t};\tilde{x}_{t+1})(\eta_{t+1}^{-1}-\eta_{t}^{-1}).

If L^tβ/ηt\hat{L}^{{\cal L}}_{t}\leq\beta/\eta_{t}, then (L^t)2β2ηtηt+1\left(\hat{L}^{{\cal L}}_{t}\right)^{2}\leq\frac{\beta^{2}}{\eta_{t}\eta_{t+1}} since ηt\eta_{t} is non-increasing. We can upper bound the third term of the sum on the RHS by zero.  

Proof [of Theorem 10] From (47), we have for any t1t\geq 1

lt,xtuBR(u;x~t)BR(u;x~t+1)ηt+BR(xt;x~t+1)(ηt+11ηt1)+ηt+1βεt().\langle l_{t},\ x_{t}-u\rangle\leq\frac{B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})}{\eta_{t}}+B^{R}(x_{t};\tilde{x}_{t+1})(\eta_{t+1}^{-1}-\eta_{t}^{-1})+\frac{\eta_{t+1}}{\beta}\varepsilon_{t}({\cal L}). (49)

Note that by convexity of ftf_{t}, ft(xt)ft(u)lt,xtuf_{t}(x_{t})-f_{t}(u)\leq\langle l_{t},\ x_{t}-u\rangle. Therefore, by taking the sum from 1 to T, we have

RegretT(u)\displaystyle\text{Regret}_{T}(u) t=1Tlt,xtu\displaystyle\leq\sum_{t=1}^{T}\langle l_{t},\ x_{t}-u\rangle
t=1TBR(u;x~t)BR(u;x~t+1)ηt+t=1TBR(xt;x~t+1)(ηt+11ηt1)+t=1Tηt+1βεt()\displaystyle\leq\sum_{t=1}^{T}\frac{B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})}{\eta_{t}}+\sum_{t=1}^{T}B^{R}(x_{t};\tilde{x}_{t+1})(\eta_{t+1}^{-1}-\eta_{t}^{-1})+\sum_{t=1}^{T}\frac{\eta_{t+1}}{\beta}\varepsilon_{t}({\cal L})
BR(u;x~1)η1+t=1T1(1ηt+11ηt)BR(u;x~t+1)+t=1T(1ηt+11ηt)BR(xt;x~t+1)+t=1Tηt+1βεt()\displaystyle\leq\frac{B^{R}(u;\tilde{x}_{1})}{\eta_{1}}+\sum_{t=1}^{T-1}\left(\frac{1}{\eta_{{t+1}}}-\frac{1}{\eta_{t}}\right)B^{R}(u;\tilde{x}_{t+1})+\sum_{t=1}^{T}\left(\frac{1}{\eta_{{t+1}}}-\frac{1}{\eta_{t}}\right)B^{R}(x_{t};\tilde{x}_{t+1})+\sum_{t=1}^{T}\frac{\eta_{t+1}}{\beta}\varepsilon_{t}({\cal L})
BηT+BηT+1+t=1Tηt+1βεt()\displaystyle\leq\frac{B}{\eta_{T}}+\frac{B}{\eta_{T+1}}+\sum_{t=1}^{T}\frac{\eta_{t+1}}{\beta}\varepsilon_{t}({\cal L})
2BηT+1+t=1Tηt+1βεt(),\displaystyle\leq\frac{2B}{\eta_{T+1}}+\sum_{t=1}^{T}\frac{\eta_{t+1}}{\beta}\varepsilon_{t}({\cal L}),

where B=maxtBR(u;xt)B=\max_{t}B^{R}(u;x_{t}).
To prove the Adagrad regret (18), where we set

ηt:=βBmin{1t1()+t2(),1LtB},\eta_{t}:=\sqrt{\beta B}\min\left\{\frac{1}{\sqrt{{\cal E}_{t-1}({\cal L})}+\sqrt{{\cal E}_{t-2}({\cal L})}},\frac{1}{L^{{\cal L}}_{t}\sqrt{B}}\right\},

note that it is non-decreasing. Moreover, we have ηtβLt\eta_{t}\leq\frac{\sqrt{\beta}}{L^{{\cal L}}_{t}}. Therefore,

L^tβLtL^tβηt.\hat{L}^{{\cal L}}_{t}\leq\sqrt{\beta}L^{{\cal L}}_{t}\Longrightarrow\hat{L}^{{\cal L}}_{t}\leq\frac{\beta}{\eta_{t}}.

We can apply Equation 16:

Regrett(u)2Bηt+1+τ=1tητ+1βετ().\text{Regret}_{t}(u)\leq\frac{2B}{\eta_{t+1}}+\sum_{\tau=1}^{t}\frac{\eta_{\tau+1}}{\beta}\varepsilon_{\tau}({\cal L}). (50)

That can be rewritten as

ηt=βBmin{t1()t2()εt1(),1LtB}.\eta_{t}=\sqrt{\beta B}\min\left\{\frac{\sqrt{{\cal E}_{t-1}({\cal L})}-\sqrt{{\cal E}_{t-2}({\cal L})}}{\varepsilon_{t-1}({\cal L})},\frac{1}{L^{{\cal L}}_{t}\sqrt{B}}\right\}. (51)

Moreover,

ηt1(βB)1max{2t1(),LtB}2(βB)1(t()+LtB).\eta_{t}^{-1}\leq\left(\sqrt{\beta B}\right)^{-1}\max\left\{2\sqrt{{\cal E}_{t-1}({\cal L})},L^{{\cal L}}_{t}\sqrt{B}\right\}\leq 2\left(\sqrt{\beta B}\right)^{-1}(\sqrt{{\cal E}_{t}({\cal L})}+L^{{\cal L}}_{t}\sqrt{B}). (52)

Using (51) and (52) in the regret upper bound (50):

Regrett(u)\displaystyle\text{Regret}_{t}(u) 2Bηt+1+τ=1tητ+1βετ()\displaystyle\leq\frac{2B}{\eta_{t+1}}+\sum_{\tau=1}^{t}\frac{\eta_{\tau+1}}{\beta}\varepsilon_{\tau}({\cal L})
4Bβ(t()+LtB)+τ=1tτ()τ1()\displaystyle\leq 4\sqrt{\frac{B}{\beta}}\left(\sqrt{{\cal E}_{t}({\cal L})}+L^{{\cal L}}_{t}\sqrt{B}\right)+\sum_{\tau=1}^{t}\sqrt{{\cal E}_{\tau}({\cal L})}-\sqrt{{\cal E}_{\tau-1}({\cal L})}
4Bβ(t()+LtB)+t()\displaystyle\leq 4\sqrt{\frac{B}{\beta}}\left(\sqrt{{\cal E}_{t}({\cal L})}+L^{{\cal L}}_{t}\sqrt{B}\right)+\sqrt{{\cal E}_{t}({\cal L})}
5Bβ(t()+LtB).\displaystyle\leq 5\sqrt{\frac{B}{\beta}}\left(\sqrt{{\cal E}_{t}({\cal L})}+L^{{\cal L}}_{t}\sqrt{B}\right).

 

Appendix D Dynamic Regret guarantee

We present here the dynamic regret decomposition lemma.

Lemma 27 (Dynamic Regret decomposition).

For any OCO algorithm 𝒜{\cal A}, if Φ\Phi is a Lyapunov potential function, we have that for any t1t\geq 1, and any admissible sequence u1,,uTu_{1},\dots,u_{T}

Φ(Qt+1)Φ(Q1)+DynRegrett(u1:t)DynRegrett𝒜(u1:t;1:t)+St,\Phi(Q_{t+1})-\Phi(Q_{1})+\text{DynRegret}_{t}(u_{1:t})\leq\text{DynRegret}_{t}^{\cal A}(u_{1:t};\;{\cal L}_{1:t})+S_{t}, (53)

where St=τ=1tgτ+(xτ)(Φ(Qτ+1)Φ(Qτ))S_{t}=\sum_{\tau=1}^{t}g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau})), and DynRegrett𝒜(u;1t)\text{DynRegret}_{t}^{\cal A}(u;\;{\cal L}_{1\dots t}) is the dynamic regret of the algorithm running on the sequence of losses 1,,T{\cal L}_{1},\dots,{\cal L}_{T}.

Proof  By convexity of Φ\Phi, for any τ1\tau\geq 1:

Φ(Qτ+1)\displaystyle\Phi(Q_{\tau+1}) Φ(Qτ)+Φ(Qτ+1)(Qτ+1Qτ)\displaystyle\leq\Phi(Q_{\tau})+\Phi^{\prime}(Q_{\tau+1})\cdot(Q_{\tau+1}-Q_{\tau})
=Φ(Qτ)+Φ(Qτ+1)gt+(xτ).\displaystyle=\Phi(Q_{\tau})+\Phi^{\prime}(Q_{\tau+1})\cdot g^{+}_{t}(x_{\tau}).

For any tt, , then by definition gτ+(uτ)=0,τ1g^{+}_{\tau}(u_{\tau})=0,\forall\tau\geq 1, thus

Φ(Qτ+1)Φ(Qτ)+(fτ(xτ)fτ(uτ))\displaystyle\Phi(Q_{\tau+1})-\Phi(Q_{\tau})+(f_{\tau}(x_{\tau})-f_{\tau}(u_{\tau}))
Φ(Qτ+1)gτ+(xτ)+(fτ(xτ)fτ(uτ))\displaystyle\leq\Phi^{\prime}(Q_{\tau+1})g^{+}_{\tau}(x_{\tau})+(f_{\tau}(x_{\tau})-f_{\tau}(u_{\tau}))
fτ(xτ)+Φ(Qτ)gτ+(xτ)\displaystyle\leq f_{\tau}(x_{\tau})+\Phi^{\prime}(Q_{\tau})g^{+}_{\tau}(x_{\tau})
((fτ(uτ)+Φ(Qτ)gτ+(uτ))\displaystyle\quad\quad-\big{(}(f_{\tau}(u_{\tau})+\Phi^{\prime}(Q_{\tau})g^{+}_{\tau}(u_{\tau})\big{)}
+gτ+(xτ)(Φ(Qτ+1)Φ(Qτ))\displaystyle\quad\quad+g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau}))
τ(xτ)τ(uτ)+gτ+(xτ)(Φ(Qτ+1)Φ(Qτ)).\displaystyle\leq{\cal L}_{\tau}(x_{\tau})-{\cal L}_{\tau}(u_{\tau})+g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau})).

Summing τ\tau from 11 to tt:

Φ(Qt+1)Φ(Q1)+DynRegrett(u1:t)DynRegrett𝒜(u1:t;1:t)+St,\Phi(Q_{t+1})-\Phi(Q_{1})+\text{DynRegret}_{t}(u_{1:t})\leq\text{DynRegret}_{t}^{\cal A}(u_{1:t};\;{\cal L}_{1:t})+S_{t},

where

St=τ=1tgτ+(xτ)(Φ(Qτ+1)Φ(Qτ)).S_{t}=\sum_{\tau=1}^{t}g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau})).

 

Appendix E Contextual bandits with expert advice

Algorithm 5 Modified EXP4.OVAR
1:Exploration probability δ[0,1]\delta\in[0,1].
2:Define Δ¯Π:={xΔΠ:x[π]1MT,πΠ}\bar{\Delta}_{\Pi}:=\{x\in\Delta_{\Pi}:x[\pi]\geq\frac{1}{MT},\forall\pi\in\Pi\}.
3:Initialize E0=0E_{0}=0 and x~1[π]=1M\tilde{x}_{1}[\pi]=\frac{1}{M} for all πΠ\pi\in\Pi.
4:for round t=1Tt=1\dots T do
5:     Receive context sts_{t} and make predictions l^t\hat{l}_{t}.
6:     Update learning rate:
ηt=log(MT)min{1Et1+Et2,1}\eta_{t}=\sqrt{\log(MT)}\min\left\{\frac{1}{\sqrt{E_{t-1}}+\sqrt{E_{t-2}}},1\right\} (54)
7:     Compute
xt:=argminxΔ¯Π{ηtπΠx[π]l^t[π(st)]+DKL(x,x~t)}.x_{t}:=\arg\min_{x\in\bar{\Delta}_{\Pi}}\left\{\eta_{t}\sum_{\pi\in\Pi}x[\pi]\hat{l}_{t}[\pi(s_{t})]+D_{\text{KL}}(x,\tilde{x}_{t})\right\}. (55)
8:     Compute ptΔKp_{t}\in\Delta_{K}: pt[a]=(1δ)π:π(st)=axt[π]+δKp_{t}[a]=(1-\delta)\sum_{\pi:\pi(s_{t})=a}x_{t}[\pi]+\frac{\delta}{K}.
9:     Sample atpta_{t}\sim p_{t} and receive loss ltl_{t}.
10:     Construct estimator l~t[a]=lt[a]l^t[a]pt[a]𝟙{at=a}+l^t[a]\tilde{l}_{t}[a]=\frac{l_{t}[a]-\hat{l}_{t}[a]}{p_{t}[a]}\mathbbm{1}\left\{a_{t}=a\right\}+\hat{l}_{t}[a] for all a[K]a\in[K].
11:     Update cumulative error Et=Et1+(lτ[a]l^τ[a])2pτ[a]2E_{t}=E_{t-1}+\frac{(l_{\tau}[a]-\hat{l}_{\tau}[a])^{2}}{p_{\tau}[a]^{2}}.
12:     Update
x~t+1=argminxΔ¯Π{ηtπΠx[π]l~t[π(st)]+DKL(x,x~t)}\tilde{x}_{t+1}=\arg\min_{x\in\bar{\Delta}_{\Pi}}\left\{\eta_{t}\sum_{\pi\in\Pi}x[\pi]\tilde{l}_{t}[\pi(s_{t})]+D_{\text{KL}}(x,\tilde{x}_{t})\right\}
13:end for

First we introduce the shorthand notation: lM\forall l\in\mathbb{R}^{M} and xΔMx\in\Delta_{M}:

l,xt:=πΠx[π]l[π(st)].{\langle l,\ x\rangle}_{t}:=\sum_{\pi\in\Pi}x[\pi]l[\pi(s_{t})].

The modified algorithm EXP4.OVAR is presented in Algorithm 5. Note that we modify the learning rate to something similar to what we have in Algorithm 3. Moreover, in the original EXP4.OVAR, they use different learning rates for the update of xtx_{t} and x~t+1\tilde{x}_{t+1}, but we should not do it in our setting as it will introduce a term in 𝔼[BTET]\mathbb{E}[B_{T}\cdot E_{T}] (where ETE_{T} is the "cumulative error"), which is not trivial to upper bound in terms of 𝔼[BT]\mathbb{E}[B_{T}] and 𝔼[ET]\mathbb{E}[E_{T}].

Theorem 28 (EXP4.OVAR Regret, (Adapted from Wei et al. (2020)).

Let lt[0,Bt]l_{t}\in[0,B_{t}] a sequence of loss vectors, where BtB_{t} is non-decreasing, and ltl_{t} and BtB_{t} are chosen by the environment but depend on a1,at1a_{1},\dots a_{t-1}. Let l^t[0,Bt]\hat{l}_{t}\in[0,B_{t}] the prediction and denote T():=t=1Tltl^t2{\cal E}_{T}({\cal L}):=\sum_{t=1}^{T}||l_{t}-\hat{l}_{t}||_{\infty}^{2}. , then

RegretT𝒜(l1,,lT)𝔼[BT](1+δT)+log(MT)(6K2T()δ+2).\text{Regret}_{T}^{\cal A}(l_{1},\dots,l_{T})\leq\mathbb{E}\left[B_{T}\right](1+\delta T)+\sqrt{\log(MT)}\left(6\sqrt{\frac{K^{2}{\cal E}_{T}({\cal L})}{\delta}}+2\right). (56)

Furthermore, if we set δ=(KTlog(MT))2/3\delta=\left(\frac{K}{T}\sqrt{\log(MT)}\right)^{2/3}:

RegretT𝒜(l1,,lT)(𝔼[BT]+6T())(TK2log(MT))1/3+2log(MT)+𝔼[BT]\text{Regret}_{T}^{\cal A}(l_{1},\dots,l_{T})\leq\left(\mathbb{E}\left[B_{T}\right]+6\sqrt{{\cal E}_{T}({\cal L})}\right)(TK^{2}\log(MT))^{1/3}+2\sqrt{\log(MT)}+\mathbb{E}[B_{T}] (57)

Proof  The proof follows exactly the steps in Wei et al. (2020). However, we slightly modify it to accept losses that are in [0,Bt][0,B_{t}] instead of [0,1][0,1] and the loss have some depedency on the past, adding the extra expected value on the computation of the loss. We first add the results from Wei et al. (2020). Let πΠ\pi^{\star}\in\Pi. Denote x=(11T)𝐞π+1MT𝟏Δ¯Πx^{\star}=\left(1-\frac{1}{T}\right){\mathbf{e}}_{\pi^{\star}}+\frac{1}{MT}\bm{1}\in\bar{\Delta}_{\Pi} where 𝐞π{\mathbf{e}}_{\pi^{\star}} is the distribution that concentrates on π\pi^{\star}. From Lemma 25, we have:

l~t,xtxtDKL(x,x~t)DKL(x,x~t+1)ηt+2ηt+1l^tl~t2+DKL(xt;x~t+1)(ηt+11ηt1).{\langle\tilde{l}_{t},\ x_{t}-x^{\star}\rangle}_{t}\leq\frac{D_{\text{KL}}(x^{\star},\tilde{x}_{t})-D_{\text{KL}}(x^{\star},\tilde{x}_{t+1})}{\eta_{t}}+2\eta_{t+1}||\hat{l}_{t}-\tilde{l}_{t}||_{\infty}^{2}+D_{\text{KL}}(x_{t};\tilde{x}_{t+1})(\eta_{t+1}^{-1}-\eta_{t}^{-1}). (58)

By replacing xx^{\star} by its expression and summing over tt, we obtain

t=1Tl~t,xtt(11T)t=1Tl~t[π(st)]1MTt=1Tl~t, 1tt=1TDKL(x,x~t)DKL(x,x~t+1)ηt+t=1TDKL(xt;x~t+1)(ηt+11ηt1)+2t=1Tηt+1l^tl~t2.\begin{split}\sum_{t=1}^{T}{\langle\tilde{l}_{t},\ x_{t}\rangle}_{t}-&\left(1-\frac{1}{T}\right)\sum_{t=1}^{T}\tilde{l}_{t}[\pi^{\star}(s_{t})]-\frac{1}{MT}\sum_{t=1}^{T}{\langle\tilde{l}_{t},\ \bm{1}\rangle}_{t}\\ &\leq\sum_{t=1}^{T}\frac{D_{\text{KL}}(x^{\star},\tilde{x}_{t})-D_{\text{KL}}(x^{\star},\tilde{x}_{t+1})}{\eta_{t}}+\sum_{t=1}^{T}D_{\text{KL}}(x_{t};\tilde{x}_{t+1})(\eta_{t+1}^{-1}-\eta_{t}^{-1})+2\sum_{t=1}^{T}\eta_{t+1}||\hat{l}_{t}-\tilde{l}_{t}||_{\infty}^{2}.\end{split} (59)

We can upper bound the two terms on the RHS. The first sum can be rewritten as:

t=1TDKL(x,x~t)DKL(x,x~t+1)ηt=DKL(x,x~1)η1+t=1TDKL(x,x~t)(1ηt1ηt1)DKL(x,x~T+1)ηT.\sum_{t=1}^{T}\frac{D_{\text{KL}}(x^{\star},\tilde{x}_{t})-D_{\text{KL}}(x^{\star},\tilde{x}_{t+1})}{\eta_{t}}=\frac{D_{\text{KL}}(x^{\star},\tilde{x}_{1})}{\eta_{1}}+\sum_{t=1}^{T}D_{\text{KL}}(x^{\star},\tilde{x}_{t})\left(\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}}\right)-\frac{D_{\text{KL}}(x^{\star},\tilde{x}_{T+1})}{\eta_{T}}.

Then, note that for any xΔ¯Π,DKL(x,x)log(MT)x\in\bar{\Delta}_{\Pi},D_{\text{KL}}(x^{\star},x)\leq\log(MT) because x[π]1MTx[\pi]\geq\frac{1}{MT}. Therefore,

t=1TDKL(x,x~t)DKL(x,x~t+1)ηtlog(MT)ηTandt=1TDKL(xt;x~t+1)(ηt+11ηt1)log(MT)ηT+1.\sum_{t=1}^{T}\frac{D_{\text{KL}}(x^{\star},\tilde{x}_{t})-D_{\text{KL}}(x^{\star},\tilde{x}_{t+1})}{\eta_{t}}\leq\frac{\log(MT)}{\eta_{T}}\quad{\text{and}}\quad\sum_{t=1}^{T}D_{\text{KL}}(x_{t};\tilde{x}_{t+1})(\eta_{t+1}^{-1}-\eta_{t}^{-1})\leq\frac{\log(MT)}{\eta_{T+1}}.

For the third sum, by replacing l~\tilde{l} by its definition, we have l^tl~t2=(l^t[at]lt[at]pt[at])2||\hat{l}_{t}-\tilde{l}_{t}||_{\infty}^{2}=\left(\frac{\hat{l}_{t}[a_{t}]-l_{t}[a_{t}]}{p_{t}[a_{t}]}\right)^{2}. As in (51),

ηt+1log(MT)EtEt1l^tl~t2,\eta_{t+1}\leq\sqrt{\log(MT)}\frac{\sqrt{E_{t}}-\sqrt{E_{t-1}}}{||\hat{l}_{t}-\tilde{l}_{t}||_{\infty}^{2}},

resulting in

t=1Tηt+1l^tl~t2log(MT)t=1TEtEt1log(MT)ET,\sum_{t=1}^{T}\eta_{t+1}||\hat{l}_{t}-\tilde{l}_{t}||_{\infty}^{2}\leq\sqrt{\log(MT)}\sum_{t=1}^{T}\sqrt{E_{t}}-\sqrt{E_{t-1}}\leq\sqrt{\log(MT)}\sqrt{E_{T}},

and

log(MT)ηTlog(MT)ηT+1log(MT)(2ET+1).\frac{\log(MT)}{\eta_{T}}\leq\frac{\log(MT)}{\eta_{T+1}}\leq\sqrt{\log(MT)}\left(2\sqrt{E_{T}}+1\right).

Thus the RHS of (59) is upper bounded by: log(MT)(6ET+2)\sqrt{\log(MT)}\left(6\sqrt{E_{T}}+2\right). Note that:

ET\displaystyle E_{T} =t=1T(l^t[at]lt[at]pt[at])2,\displaystyle=\sum_{t=1}^{T}\left(\frac{\hat{l}_{t}[a_{t}]-l_{t}[a_{t}]}{p_{t}[a_{t}]}\right)^{2},
Kδt=1T(l^t[at]lt[at])2pt[at],\displaystyle\leq\frac{K}{\delta}\sum_{t=1}^{T}\frac{(\hat{l}_{t}[a_{t}]-l_{t}[a_{t}])^{2}}{p_{t}[a_{t}]}, using pt[a]δK,a[K].\displaystyle\text{using }p_{t}[a]\geq\frac{\delta}{K},\;\forall a\in[K].

Then, the expected value:

𝔼[ET]\displaystyle\mathbb{E}[E_{T}] Kδt=1T𝔼[(l^t[at]lt[at])2pt[at]],\displaystyle\leq\frac{K}{\delta}\sum_{t=1}^{T}\mathbb{E}\left[\frac{(\hat{l}_{t}[a_{t}]-l_{t}[a_{t}])^{2}}{p_{t}[a_{t}]}\right],
K2δt=1Tl^tlt2=K2δT(),\displaystyle\leq\frac{K^{2}}{\delta}\sum_{t=1}^{T}||\hat{l}_{t}-l_{t}||_{\infty}^{2}=\frac{K^{2}}{\delta}{\cal E}_{T}({\cal L}),

where the inequality comes from (l^t[a]lt[a])2l^lt2,a[K](\hat{l}_{t}[a]-l_{t}[a])^{2}\leq||\hat{l}-l_{t}||_{\infty}^{2},\;\forall a\in[K] and 𝔼[1pt[at]]=K\mathbb{E}\left[\frac{1}{p_{t}[a_{t}]}\right]=K. Thus, by taking the expected value in (59), we have

𝔼[t=1Tl~t,xtt(11T)t=1Tl~t[π(st)]1MTt=1Tl~t, 1t],\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}{\langle\tilde{l}_{t},\ x_{t}\rangle}_{t}-\left(1-\frac{1}{T}\right)\sum_{t=1}^{T}\tilde{l}_{t}[\pi^{\star}(s_{t})]-\frac{1}{MT}\sum_{t=1}^{T}{\langle\tilde{l}_{t},\ \bm{1}\rangle}_{t}\right],
log(MT)𝔼[6ET+2],\displaystyle\leq\sqrt{\log(MT)}\mathbb{E}\left[6\sqrt{E_{T}}+2\right],
log(MT)(6𝔼[ET]+2),\displaystyle\leq\sqrt{\log(MT)}(6\sqrt{\mathbb{E}\left[E_{T}\right]}+2), (Jensen’s inequality)
log(MT)(6K2T()δ+2).\displaystyle\leq\sqrt{\log(MT)}\left(6\sqrt{\frac{K^{2}{\cal E}_{T}({\cal L})}{\delta}}+2\right). (60)

We can now lower bound the LHS of (60).

𝔼[t=1Tl~t,xtt(11T)t=1Tl~t[π(st)]1MTt=1Tl~t, 1t],\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}{\langle\tilde{l}_{t},\ x_{t}\rangle}_{t}-\left(1-\frac{1}{T}\right)\sum_{t=1}^{T}\tilde{l}_{t}[\pi^{\star}(s_{t})]-\frac{1}{MT}\sum_{t=1}^{T}{\langle\tilde{l}_{t},\ \bm{1}\rangle}_{t}\right],
𝔼[t=1TπΠxt[π]l~t[π(st)]t=1Tl~t[π(st)]1MTt=1TπΠl~t[π(st)]],\displaystyle\geq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{\pi\in\Pi}x_{t}[\pi]\tilde{l}_{t}[\pi(s_{t})]-\sum_{t=1}^{T}\tilde{l}_{t}[\pi^{\star}(s_{t})]-\frac{1}{MT}\sum_{t=1}^{T}\sum_{\pi\in\Pi}\tilde{l}_{t}[\pi(s_{t})]\right],
(i)𝔼[t=1TπΠxt[π]lt[π(st)]t=1Tlt[π(st)]1MTt=1TπΠlt[π(st)]],\displaystyle\overset{(i)}{\geq}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{\pi\in\Pi}x_{t}[\pi]l_{t}[\pi(s_{t})]-\sum_{t=1}^{T}l_{t}[\pi^{\star}(s_{t})]-\frac{1}{MT}\sum_{t=1}^{T}\sum_{\pi\in\Pi}l_{t}[\pi(s_{t})]\right],
𝔼[t=1TπΠxt[π]lt[π(st)]t=1Tlt[π(st)]]𝔼[BT],\displaystyle\geq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{\pi\in\Pi}x_{t}[\pi]l_{t}[\pi(s_{t})]-\sum_{t=1}^{T}l_{t}[\pi^{\star}(s_{t})]\right]-\mathbb{E}\left[B_{T}\right],
(ii)𝔼[t=1Ta[K](pt[a]+δπ:π(st)=axt[π]δK)lt[a]t=1Tlt[π(st)]]𝔼[BT],\displaystyle\overset{(ii)}{\geq}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{a\in[K]}\left(p_{t}[a]+\delta\sum_{\pi:\pi(s_{t})=a}x_{t}[\pi]-\frac{\delta}{K}\right)l_{t}[a]-\sum_{t=1}^{T}l_{t}[\pi^{\star}(s_{t})]\right]-\mathbb{E}\left[B_{T}\right],
𝔼[t=1Ta[K]pt[a]lt[a]t=1Tlt[π(st)]]𝔼[BT](1+δT),\displaystyle\geq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{a\in[K]}p_{t}[a]l_{t}[a]-\sum_{t=1}^{T}l_{t}[\pi^{\star}(s_{t})]\right]-\mathbb{E}\left[B_{T}\right](1+\delta T),
=𝔼[t=1Tlt[at]t=1Tlt[π(st)]]𝔼[BT](1+δT),\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}l_{t}[a_{t}]-\sum_{t=1}^{T}l_{t}[\pi^{\star}(s_{t})]\right]-\mathbb{E}\left[B_{T}\right](1+\delta T),
=RegretT𝔼[BT](1+δT).\displaystyle=\text{Regret}_{T}-\mathbb{E}\left[B_{T}\right](1+\delta T). (61)

(i)(i) comes from 𝔼t1[l~t]=𝔼t1[lt]\mathbb{E}_{t-1}[\tilde{l}_{t}]=\mathbb{E}_{t-1}[l_{t}] where 𝔼t1\mathbb{E}_{t-1} is the expected value conditional to all the information until the end of round t1{t-1}. For (ii)(ii), it is a consequence ptp_{t}’s definition:

a[K]pt[a]lt[a]\displaystyle\sum_{a\in[K]}p_{t}[a]l_{t}[a] =(1δ)a[K]π:π(st)=axt[π]lt[π(st)]+δKa[K]lt[a],\displaystyle=(1-\delta)\sum_{a\in[K]}\sum_{\pi:\pi(s_{t})=a}x_{t}[\pi]l_{t}[\pi(s_{t})]+\frac{\delta}{K}\sum_{a\in[K]}l_{t}[a],
πΠxt[π]lt[π(st)]\displaystyle\Longrightarrow\sum_{\pi\in\Pi}x_{t}[\pi]l_{t}[\pi(s_{t})] =a[K](pt[a]+δπ:π(st)=axt[π]lt[π(st)]δK)lt[a].\displaystyle=\sum_{a\in[K]}\left(p_{t}[a]+\delta\sum_{\pi:\pi(s_{t})=a}x_{t}[\pi]l_{t}[\pi(s_{t})]-\frac{\delta}{K}\right)l_{t}[a].

We can then combine (61) and (60), to obtain (56). (57) is a straightforward consequence of (56) and the value of δ\delta.