An Optimistic Algorithm for Online Convex Optimization with Adversarial Constraints

\nameJordan Lekeufack \emailjordan.lekeufack@berkeley.edu
\addrDepartment of Statistics
University of California, Berkeley
\AND\nameMichael I. Jordan \emailjordan@cs.berkeley.edu
\addrDepartment of Statistics / Department of Electrical Engineering and Computer Science
University of California, Berkeley

Abstract

We study Online Convex Optimization (OCO) with adversarial constraints, where an online algorithm must make sequential decisions to minimize both convex loss functions and cumulative constraint violations. We focus on a setting where the algorithm has access to predictions of the loss and constraint functions. Our results show that we can improve the current best bounds of $O(\sqrt{T})$ regret and $\tilde{O}(\sqrt{T})$ cumulative constraint violations to $O(\sqrt{{\cal E}_{T}(f)})$ and $\tilde{O}(\sqrt{{\cal E}_{T}(g^{+})})$ , respectively, where ${\cal E}_{T}(f)$ and ${\cal E}_{T}(g^{+})$ represent the cumulative prediction errors of the loss and constraint functions. In the worst case, where ${\cal E}_{T}(f)=O(T)$ and ${\cal E}_{T}(g^{+})=O(T)$ (assuming bounded gradients of the loss and constraint functions), our rates match the prior $O(\sqrt{T})$ results. However, when the loss and constraint predictions are accurate, our approach yields significantly smaller regret and cumulative constraint violations. Finally, we apply this to the setting of adversarial contextual bandits with sequential risk constraints, obtaining optimistic bounds $O(\sqrt{{\cal E}_{T}(f)}T^{1/3})$ regret and $O(\sqrt{{\cal E}_{T}(g^{+})}T^{1/3})$ constraints violation, yielding better performance than existing results when prediction quality is sufficiently high.

1 Introduction

We are interested in generalizations of Online Convex Optimization (OCO) to problems in which constraints are imposed but can be violated —generalizations which are referred to as Constrained Online Convex Optimization (COCO). Recall the standard formulation of OCO (Orabona, 2019; Hazan, 2023): At each round $t$ , a learner makes a decision $x_{t}\in{\cal X}$ , receives a convex loss function $f_{t}$ from the environment, and suffers the loss $f_{t}(x_{t})$ . The goal of the learner is to minimize the cumulative loss $\sum_{t=1}^{T}f_{t}(x_{t})$ . The COCO framework imposes an additional requirement on the learner: meeting a potentially adversarial convex constraint of the form $g_{t}(x_{t})\leq 0$ at every time step. The learner observes $g_{t}$ only after selecting $x_{t}$ , and cannot always satisfy the constraint exactly but can hope to have a small cumulative constraint violation $\sum_{t=1}^{T}\max(g_{t}(x_{t}),0)$ . In the adversarial setting, it is not viable to seek absolute minima of the cumulative loss, and the problem is generally formulated in terms of obtaining a sublinear Static Regret—the difference between the learner’s cumulative loss and the cumulative loss of a fixed oracle/decision. Having a sublinear regret means that, on average, we perform as well as the best action in hindsight. A stronger and more general objective is the Dynamic Regret where learner’s performance is benchmarked against sequences of decisions, not just fixed actions. In the COCO problem, we also aim to ensure a sublinear cumulative constraint violation.

One subcategory of OCO problems is adversarial contextual bandits (Auer et al., 2002;Beygelzimer et al., 2011). In that setting, the learner receives contextual information from the environment, then she selects one action among $K$ available, and only observes the loss of the chosen action. The learners aims to minimize its cumulative loss. Sun et al. (2017) introduced sequential risk constraints in contextual bandit, where, in addition to the loss for each action, the environment generate a risk for each action. In addition to minimizing the cumulative loss, the learner wants to keep the average cumulative risk bounded by a predefined safety threshold.

Recent work in OCO has considered settings in which the adversary is predictable—i.e., not entirely arbitrary—aiming to obtain improved regret bounds (Chiang et al., 2012; Rakhlin and Sridharan, 2013a, b; Mohri and Yang, 2016; Joulani et al., 2017). They showed that the regret improved from $O(\sqrt{T})$ to $O(\sqrt{{\cal E}_{T}(f)})$ where ${\cal E}_{T}(f)$ is a measure of the cumulative prediction error. The optimistic framework has also been studied in the COCO setting by Qiu et al. (2023), who focused on time-invariant constraints, ( $\forall t,g_{t}:=g)$ and the time varying constraints was pursued in Anderson et al. (2022), who established bounds for specific cases (e.g perfect loss predictions, linear constraints).

In the current paper we go beyond earlier work to consider the case of adversarial constraints. Our main contribution is the following: We present the first algorithm to solve COCO problems in which the constraints are adversarial but also predictable, achieving $O(\sqrt{{\cal E}_{T}(f)})$ regret and $\tilde{O}(\sqrt{{\cal E}_{T}(g^{+})})$ constraint violation in the general convex case. More precisely:

1.

We present a meta-algorithm that, when built on an optimistic OCO algorithm, achieves $O(\sqrt{{\cal E}_{T}(f)})$ regret and $\tilde{O}(\sqrt{{\cal E}_{T}(g^{+})})$ constraint violation who matches the best COCO algorithm by Sinha and Vaze (2024) in the worst case.
2.

Our algorithm is computationally efficient as it relies only on a projection on the simpler set ${\cal X}$ at each time step, instead of convex optimization steps.
3.

Furthermore, the same meta algorithm can be used to prove dynamic regret guarantees $\tilde{O}(\sqrt{P_{T}{\cal E}_{T}(f)})$ with similar constraint violation guarantees $\tilde{O}(\sqrt{P_{T}{\cal E}_{T}(g^{+})})$ .
4.

Finally, we show that our method can be used to solve the adversarial contextual bandits problem with sequential risk constraints, providing a $O(\sqrt{{\cal E}_{T}(f)}T^{1/3})$ regret and $O(\sqrt{{\cal E}_{T}(g^{+})}T^{1/3})$ constraint violation.

Our theoretical framework exploits state-of-the-art methods from both optimistic OCO and constrained OCO.

The rest of the paper is structured as follows: We present previous work in Section 2, introduce the main assumptions and notations in Section 3 and present the meta-algorithm for the COCO problem in Section 4. We then present how the meta-algorithm gives static regret guarantees in Section 5, dynamic regret guarantees in Section 6 and how its application to the experts setting in Section 7 and the contextual bandits in Section 8.

Reference	Complexity per round	Constraints	Loss Function	Regret	Violation
Guo et al. (2022)	Conv-OPT	Fixed	Convex	$O(\sqrt{T})$	$O(1)$
		Adversarial	Convex	$O(\sqrt{T})$	$O(T^{3/4})$
		Adversarial	Convex	(D) $O(P_{T}\sqrt{T})$	$O(T^{3/4})$
Yi et al. (2023)	Conv-OPT	Adversarial	Convex	$O(T^{\max(c,1-c)})$	$O(T^{1-c/2})$
Sinha and Vaze (2024)	Projection	Adversarial	Convex	$O\left(\sqrt{T}\right)$	$O(\sqrt{T}\log T)$
Qiu et al. (2023)	Projection	Fixed	Convex, Slater	$O(\sqrt{V_{T}(f)})$	$O(1)$
Anderson et al. (2022)	Projection	Adversarial	Convex, Perfect predictions	$O(1)$	$O(\sqrt{T})$
Muthirayan et al. (2022)	Conv-OPT	Known	Convex	$O\left(\sqrt{D_{T}(f)}\right)$	$O(\sqrt{T})$
Sun et al. (2017)	Projection	Contextual Bandits		$O\left(\sqrt{T}\right)$	$O\left(T^{3/4}\right)$
Ours	Projection	Adversarial	Convex	$O(\sqrt{{\cal E}_{T}(f)})$	$O(\sqrt{{\cal E}_{T}(g^{+})}\log T)$
		Adversarial	Convex	(D) $O(\sqrt{P_{T}{\cal E}_{T}(f)})$	$O(\sqrt{P_{T}{\cal E}_{T}(g^{+})}\log T)$
		Contextual Bandits		$\tilde{O}\left(\sqrt{{\cal E}_{T}(f)}T^{1/3}\right)$	$\tilde{O}\left(\sqrt{{\cal E}_{T}(g^{+})}T^{1/3}\right)$

Table 1: Comparison with the most recent Constrained OCO work.

c\in[0,1]

is a parameter of the algorithm. "Conv-OPT" refers to algorithms that perform constrained convex optimization at every round.

{\cal E}_{T}(f)

and

{\cal E}_{T}(g^{+})

are measures of the prediction error.

V_{T}(f)=\sum_{t=2}^{T}\sup_{x}||\nabla f_{t}(x)-\nabla f_{t-1}(x)||_{\star}^{2}

. Note that when the prediction is the previous loss,

{\cal E}_{T}(f)\leq V_{T}(f)

D_{T}(f):=\sum_{t=1}^{T}||\nabla{f}_{t}(x_{t})-M_{t}||_{\star}^{2}

where

M_{t}

is a guess of

\nabla{f}_{t}(x_{t})

. Since

x_{t}

is unknown when constructing

M_{t}

, bounding in terms of

{\cal E}_{T}(f)

provides better and more general results than using

D_{T}(f)

. For linear

\hat{f}

, these quantities are equal:

{\cal E}_{T}(f)=D_{T}(f)

. (D) refers to a dynamic regret guarantee, with

P_{T}=\sum_{t=1}^{T-1}||u_{t+1}-u_{t}||

the path length of the feasible comparator sequence. For contextual bandits,

K

is the number of actions and

M

the number of experts. Note that the criteria for constraint violation in Sun et al. (2017) is strictly weaker than ours.

2 Related Work

Unconstrained OCO

The OCO problem was introduced by Zinkevich (2003), who established a $O(\sqrt{T})$ static regret and $O(\sqrt{T}(1+P_{T}))$ dynamic regret guarantees based on projected online gradient descent (OGD), where $P_{T}$ is the path-length of the comparator sequence. Hazan (2023); Orabona (2019) provide overviews of the burgeoning literature that has emerged since Zinkevich’s seminal work, in particular focusing on online mirror descent (OMD) as a general way to solve OCO problems. Zhang et al. (2018) later improved the dynamic regret bound to $O(\sqrt{T(1+P_{T})})$ .

Optimistic OCO

Optimistic OCO is often formulated as a problem involving gradual variation—i.e., where $\nabla{f}_{t}$ and $\nabla{f}_{t-1}$ are close in some appropriate metric. Chiang et al. (2012) exploit this assumption in an optimistic version of OMD that incorporates a prediction based on the most recent gradient, and establish a regret bound of $O(\sqrt{V_{T}})$ where $V_{T}=\sum_{t=2}^{T}\sup_{x\in{\cal X}}||\nabla{f}_{t}(x)-\nabla{f}_{t-1}(x)||_{\star}^{2}$ . Previous works (Rakhlin and Sridharan, 2013a, b; Steinhardt and Liang, 2014; Mohri and Yang, 2016; Joulani et al., 2017; Bhaskara et al., 2020) prove that when using a predictor $M_{t}$ that is not necessarily the past gradient, one can have regret of the form $O\left(\sqrt{D_{T}}\right)$ where $D_{T}:=\sum_{t=1}^{T}||\nabla{f}_{t}(x_{t})-M_{t}||_{\star}^{2}$ . The dynamic regret case has been studied intensively (Jadbabaie et al., 2015; Scroccaro et al., 2023) with the best bound (Zhao et al., 2020, 2024), being $O(\sqrt{(1+P_{T}+V_{T})(1+P_{T})})$ .

Constrained OCO

Constrained OCO was first studied in the context of time-invariant constraints; i.e., where $g_{t}:=g$ for all $t$ . In this setup one can employ projection-free algorithms, avoiding the potentially costly projection onto the set ${\cal X}=\{x\in{\cal X}_{0},g(x)\leq 0\}$ , by allowing the learner to violate the constraints in a controlled way (Mahdavi et al., 2012; Jenatton et al., 2016; Yu and Neely, 2020). The case of time-varying constraints is significantly harder as the constraints $g_{t}$ are potentially adversarial. Most of the early work studying such constraints (Neely and Yu, 2017; Yi et al., 2023) accordingly incorporated an additional Slater condition: $\exists\check{x}\in{\cal X},\nu>0,\forall t,\;g_{t}(\check{x})\leq-\nu$ . These papers establish regret guarantees that grow with $\nu^{-1}$ , which unfortunately can be vacuous as $\nu$ can be arbitrarily small. Hutchinson and Alizadeh (2024) studied the setting with time-varying constraint but such that the constraints sets ( ${\cal X}_{t}:=\{x\in{\cal X}_{0},g_{t}(x)\leq 0\}$ ) are monotone, i.e ${\cal X}_{0}\subseteq{\cal X}_{1}\subseteq\dots\subseteq{\cal X}_{T}$ and established a $O(\sqrt{P_{T}T}$ dynamic regret when $P_{T}$ is known beforehand. Guo et al. (2022) presented an algorithm that does not require the Slater condition and yielded improved bounds, achieving a $O(\sqrt{T})$ static regret, $O(P_{T}\sqrt{T})$ dynamic regret and $O(T^{3/4})$ constraint violations, for unknown $P_{T}$ . However, it requires solving a convex optimization problem at each time step. In a more recent work, Sinha and Vaze (2024) presented a simple and efficient algorithm to solve the problem with just a projection and obtained state-of-the-art guarantees: $O(\sqrt{T})$ regret and $O(\sqrt{T}\log(T))$ constraint violations. See Table 1 for more comparison of our results with previous work.

Optimistic COCO

Qiu et al. (2023) studied the case with gradual variations and time-invariant constraints, proving a $O(\sqrt{V_{T}})$ regret guarantee and a $O(1)$ constraint violations. Muthirayan et al. (2022) tackled the time-varying but known constraints with predictions, proving a regret guarantee of $O(\sqrt{D_{T}})$ and cumulative constraint violation of $O(\sqrt{T})$ . Under perfect loss predictions, Anderson et al. (2022) demonstrated a $O(1)$ bound on regret and $O(\sqrt{T})$ bound on constraint violation. We also add these results in Table 1 for comparison.

Adversarial Contextual Bandits

The adversarial contextual bandit problem was first introduced by Auer et al. (2002), who proposed EXP4, achieving optimal $O(\sqrt{T})$ expected regret. Wei et al. (2020) later advanced the field by incorporating predictions, achieving $O(\sqrt{{\cal E}_{T}(f)}T^{1/4})$ regret when ${\cal E}_{T}(f)$ is known beforehand - an improvement over EXP4 when ${\cal E}_{T}(f)=o(\sqrt{T})$ . For unknown ${\cal E}_{T}(f)$ , they developed an algorithm with $O(\sqrt{{\cal E}_{T}(f)}T^{1/3})$ expected regret. Sun et al. (2017) extended this to include sequential risk constraints (analogous to constrained OCO), developing a modified EXP4 that achieves $O(\sqrt{T})$ regret with $O(\sqrt{T^{3/4}})$ total risk violation.

3 Preliminaries

3.1 Problem setup and notation

Let $\mathbb{R}$ denote the set of real numbers, and let $\mathbb{R}^{d}$ denote the set of $d$ -dimensional real vectors. Let ${\cal X}_{0}\subseteq\mathbb{R}^{d}$ denote the set of possible actions of the learner, where $x\in{\cal X}_{0}$ is a specific action, and let $||\cdot||$ be a norm defined on ${\cal X}_{0}$ . Let the dual norm be denoted as $||\theta||_{\star}:=\max_{x,||x||=1}\langle\theta,\ x\rangle$ .

Online learning is a problem formulation in which the learner plays the following game with Nature. At each step $t$ :

1.

The learner plays action $x_{t}\subseteq{\cal X}_{0}$ .
2.

Nature reveals a loss function $f_{t}:{\cal X}_{0}\to\mathbb{R}$ and a constraint function $g_{t}:{\cal X}_{0}\to\mathbb{R}$ .¹¹1If we have multiple constraint functions ${\mathbf{g}}_{t,k}$ , we set $g_{t}:=\max_{k}{\mathbf{g}}_{t,k}$ .
3.

The learner suffers the loss $f_{t}(x_{t})$ and the constraint violation $g_{t}(x_{t})$ .

In standard OCO, the loss function $f_{t}$ is convex, and the goal of the learner is to minimize the regret with respect to an oracle action $u$ , where:

\text{Regret}_{T}(u):=\sum_{t=1}^{T}f_{t}(x_{t})-f_{t}(u).

(1)

In COCO, we generalize the OCO problem to additionally ask the learner to obtain a small cumulative constraint violation, which we denote as $\text{CCV}_{T}$ :

\text{CCV}_{T}:=\sum_{t=1}^{T}g_{t}^{+}(x_{t})\quad\text{where}\quad g_{t}^{+}(x):=\max\{0,g_{t}(x)\}.

(2)

Overall, the goal of the learner is to achieve both sublinear regret, wrt to any action $u$ in the oracle set, and sublinear CCV. This is a challenging problem, and indeed Mannor et al. (2009) proved that no algorithm can achieve both sublinear regret and sublinear cumulative constraint violation for the oracle set ${\cal X}^{\text{max}}:=\{x\in{\cal X}_{0},\sum_{t=1}^{T}g_{t}(x)\leq 0\}$ . However, it is possible to find algorithms that achieve sublinear regret for the smaller set ${\cal X}:=\{x\in{\cal X}_{0},g_{t}(x)\leq 0,\;\forall t\in[T]\}$ , and this latter problem is our focus.

In addition, we assume that at the end of step $t$ , the learner can make predictions $\hat{f}_{t+1}$ and $\hat{g}_{t+1}$ . More precisely, we are interested in predictions of the gradients, and, for any function $h$ , we denote by $\nabla\hat{{h}}_{t}$ the prediction of the gradient of $h$ . We abuse notation and denote by $\hat{h}$ the function whose gradient is $\nabla\hat{{h}}_{t}$ . Moreover, we define the following prediction errors

\begin{split}\varepsilon_{t}(h)&:=||\nabla{h}_{t}(x_{t})-\nabla\hat{{h}}_{t}(x_{t})||_{\star}^{2},\\ {\cal E}_{t}(h)&:=\sum_{\tau=1}^{t}\varepsilon_{\tau}(h),\end{split}

(3)

where $(x_{t})_{t=1\dots T}$ is the sequence of actions taken by the learner.

Additionally, for a given $\beta$ -strongly convex function $R$ , we define the Bregman divergence between two points:

B^{R}(x;y):=R(x)-R(y)-\langle\nabla R(y),\ x-y\rangle.

(4)

Two special cases that are particularly important:

1.

When $R(x):=\frac{1}{2}||x||_{2}^{2}$ , the Bregman divergence is the Euclidean distance $B^{R}(x;y)=||y-x||_{2}^{2}$ , $||\cdot||=||\cdot||_{\star}=||\cdot||_{2}$ , and $\beta=1$ .
2.

When $R(x):=-\sum_{i=1}^{d}x_{i}\log x_{i}$ , the Bregman divergence is the KL divergence : $B^{R}(x;y)=D_{\text{KL}}(x;y):=\sum_{i=1}^{d}x_{i}\log\frac{x_{i}}{y_{i}}$ , $||\cdot||=||\cdot||_{1}$ , $||\cdot||_{\star}=||\cdot||_{\infty}$ , and $\beta=1$ .

3.2 Assumptions

Throughout this paper, we will use various combinations of the following assumptions.

Assumption 1 (Convex set, loss and constraints).

We make the following standard assumptions on the loss $f$ :

1.

${\cal X}_{0}$ is closed, convex and bounded with diameter $D$ .
2.

$\forall t$ , $f_{t}$ is convex and differentiable.
3.

$\forall t$ , $g_{t}$ is convex and differentiable.

Assumption 2 (Bounded losses).

The loss functions $f_{t}$ are bounded by $F$ and the constraints $g_{t}$ are bounded by $G$ .

Assumption 3 (Feasibility).

The set ${\cal X}$ is not empty.

Assumption 4 (Prediction Sequence Regularity).

For any $t$ , the gradient of the loss prediction function $\nabla\hat{{f}}_{t}$ and the gradient of the constraint function $\nabla\hat{{g}}_{t}$ are $\hat{L}^{f}_{t}$ and $\hat{L}^{g}_{t}$ Lipschitz, respectively. That is, $\forall x,y\in{\cal X}_{0}$ , we have:

	$\displaystyle\|\|\nabla\hat{{f}}_{t}(x)-\nabla\hat{{f}}_{t}(y)\|\|_{\star}$	$\displaystyle\leq\hat{L}^{f}_{t}\|\|x-y\|\|,$
	$\displaystyle\|\|\nabla\hat{{g}}_{t}(x)-\nabla\hat{{g}}_{t}(y)\|\|_{\star}$	$\displaystyle\leq\hat{L}^{g}_{t}\|\|x-y\|\|.$

We abuse notation and let $\hat{L}^{f}_{t}:=\max_{\tau\leq t}\hat{L}^{f}_{\tau}$ and similarly for $\hat{L}^{g}_{t}$ . Finally, denote $\hat{L}^{f}:=\hat{L}^{f}_{T}$ and similarly for $\hat{L}^{g}$ .

Assumptions 1, 2, 3 are standard in COCO (Mahdavi et al., 2012; Jenatton et al., 2016; Yu and Neely, 2020; Qiu et al., 2023; Yi et al., 2023; Guo et al., 2022). In most OCO with predictive sequences, they either assume that the predictive function is the previous loss function (Chiang et al., 2012; Qiu et al., 2023; D’Orazio and Huang, 2021), or that the learner only predicts a single vector $M_{t}$ to estimate $\nabla{f}_{t}(x_{t})$ (Rakhlin and Sridharan, 2013a, b; Muthirayan et al., 2022). We expand this by predicting the entire loss gradient, making an assumption on the smoothness of $\nabla\hat{{f}}_{t}(x_{t})$ with its value at nearby points. When using the latest observe function as prediction, 4 is equivalent to assuming that the gradients $\nabla{f}_{t}$ and $\nabla{g}_{t}$ are Lipchitz as in Chiang et al. (2012); Qiu et al. (2023). Moreover, 4 is automatically satisfied when predicting a vector.

4 Meta-Algorithm for Optimistic COCO

Algorithm 1 Optimistic meta-algorithm for COCO

x_{1}\in{\cal X}_{0}

\lambda>0

Q_{0}=0

, OCO algorithm

{\cal A}

2:for round

t=1\dots T

3: Play action

x_{t}

, receive

f_{t}

and

g_{t}

4: Compute

{\cal L}

defined in (5).

5: Update

Q_{t+1}=Q_{t}+g^{+}_{t}(x_{t})

6: Compute prediction

\hat{\cal L}_{t+1}

as in (6).

7: Update

x_{t+1}:={\cal A}_{t}(x_{t},{\cal L}_{1},\dots,{\cal L}_{t},\hat{\cal L}_{t+1})

8:end for

Our meta-algorithm is inspired by Sinha and Vaze (2024). The main idea of that paper is to build a surrogate loss function ${\cal L}_{t}$ that can be seen as a Lagrangian of the optimization problem

\min_{x\in{\cal X}_{0}}f_{t}(x)\quad\text{s.t}\quad g_{t}(x)\leq 0.

The learner then runs AdaGrad (Duchi et al., 2011) on the surrogate, with a theoretical guarantee of bounded cumulative constraint violation (CCV) and Regret.

Our meta-algorithm is based on the use of optimistic methods, such as those presented in subsequent sections: Section 5, Section 6, Section 7, which allows us to obtain stronger bounds that depends on the prediction quality. Presented in Algorithm 1, this algorithm assumes that at the end of every step $t$ , the learner makes a prediction ²²2We are actually only interested in the predictions of the gradients, but for simplicity we will let $\hat{h}$ denote any function whose gradient is the prediction of the gradient $\nabla\hat{{h}}_{t}$ . $\hat{f}_{t+1}$ and $\hat{g}_{t+1}$ of the upcoming loss $f_{t+1}$ and constraint violation $g_{t+1}^{+}$ . At each time step $t$ , the learner forms a surrogate loss function, defined via a convex Lyapunov function: $\Phi:\mathbb{R}_{+}\to\mathbb{R}_{+}$ , where $\Phi$ is monotonically increasing and $\Phi(0)=0$ . Specifically:

{\cal L}_{t}(x)=f_{t}(x)+\Phi^{\prime}(Q_{t})g^{+}_{t}(x).

(5)

Using the predictions $\hat{f}$ and $\hat{g}$ , we form a prediction of the Lagrange function $\hat{\cal L}_{t+1}$ , where $\hat{\cal L}_{t}$ is defined in Equation 6.

\hat{\cal L}_{t}(x)=\hat{f}_{t}(x)+\Phi^{\prime}(Q_{t})\hat{g}_{t}(x).

(6)

In Sinha and Vaze (2024), the update is $Q_{t}=Q_{t-1}+g^{+}_{t}(x_{t})$ , but using $\hat{\cal L}_{t+1}$ at $t$ would require $Q_{t+1}$ to be known at the end of $t$ . We instead define the following delayed update:

Q_{t+1}=Q_{t}+g^{+}_{t}(x_{t}),\quad\text{with }Q_{0}=Q_{1}=0.

(7)

The learner then executes the step $t$ of algorithm ${\cal A}$ , denoted ${\cal A}_{t}$ in Algorithm 1. We then have the following lemma that relates the regret on $f$ , CCV, and the regret of ${\cal A}$ on ${\cal L}$ .

Lemma 5 (Regret decomposition).

For any OCO algorithm ${\cal A}$ , if $\Phi$ is a Lyapunov potential function, we have that for any $t\geq 1$ , and any $u\in{\cal X}$

\Phi(Q_{t+1})-\Phi(Q_{1})+\text{Regret}_{t}(u)\leq\text{Regret}_{t}^{\cal A}(u;\;{\cal L}_{1\dots t})+S_{t},

(8)

where $S_{t}=\sum_{\tau=1}^{t}g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau}))$ , and $\text{Regret}_{t}^{\cal A}(u;\;{\cal L}_{1\dots t})$ is the regret of the algorithm running on the sequence of losses ${\cal L}_{1},\dots,{\cal L}_{T}$ .

Proof By convexity of $\Phi$ , for any $\tau\geq 1$ :

	$\displaystyle\Phi(Q_{\tau+1})$	$\displaystyle\leq\Phi(Q_{\tau})+\Phi^{\prime}(Q_{\tau+1})\cdot(Q_{\tau+1}-Q_{\tau})$
		$\displaystyle=\Phi(Q_{\tau})+\Phi^{\prime}(Q_{\tau+1})\cdot g^{+}_{t}(x_{\tau}).$

Let $u\in{\cal X}$ , then by definition $g^{+}_{\tau}(u)=0,\forall\tau\geq 1$ , thus

	$\displaystyle\Phi(Q_{\tau+1})-\Phi(Q_{\tau})+(f_{\tau}(x_{\tau})-f_{\tau}(u))$
	$\displaystyle\leq\Phi^{\prime}(Q_{\tau+1})g^{+}_{\tau}(x_{\tau})+(f_{\tau}(x_{\tau})-f_{\tau}(u))$
	$\displaystyle\leq f_{\tau}(x_{\tau})+\Phi^{\prime}(Q_{\tau})g^{+}_{\tau}(x_{\tau})$
	$\displaystyle\quad\quad-\big{(}(f_{\tau}(u)+\Phi^{\prime}(Q_{\tau})g^{+}_{\tau}(u)\big{)}$
	$\displaystyle\quad\quad+g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau}))$
	$\displaystyle\leq{\cal L}_{\tau}(x_{\tau})-{\cal L}_{\tau}(u)+g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau})).$

Summing $\tau$ from $1$ to $t$ :

\Phi(Q_{t+1})-\Phi(Q_{1})+\text{Regret}_{t}(u)\leq\text{Regret}_{t}^{\cal A}(u;\;{\cal L}_{1\dots t})+S_{t},

where

S_{t}=\sum_{\tau=1}^{t}g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau})).

In the following we make the assumption that the underlying optimistic OCO algorithm has standard regret guarantees that we will express in terms of a functional $\psi$ that takes as input a sequence of functions and returns a constant. For simplicity, we will denote $\psi(h_{1\dots t}):=\psi_{t}(h)$ . An example is $\psi_{t}(h)=\hat{L}^{h}_{t}$ , the Lipschitz constant constant of $\nabla\hat{{h}}_{t}$ .

With this assumption and the previous lemma, we can present our main result.

Assumption 6 (Regret of optimistic OCO).

The optimistic OCO algorithm ${\cal A}$ has the following regret guarantee: There is a constant $C\in\mathbb{R}$ and a sublinear functional $\psi$ such that for any sequence of functions $({\cal L}_{t})_{t=1\dots T}$ , and any $u\in{\cal X}_{0}$

\text{Regret}_{t}^{\cal A}(u;{\cal L}_{1\dots t})\leq C\left(\sqrt{{\cal E}_{t}({\cal L})}+\psi_{t}({\cal L})\right).

(9)

We allow $C$ to depend on $T$ and other constants of the problem, as long as they are known at the beginning of the algorithm ${\cal A}$ .

Theorem 7 (Optimistic COCO regret and CCVguarantees).

Consider the following assumptions :

a.

${\cal L}_{t}$ and $\hat{\cal L}_{t}$ satisfy the assumptions of algorithm ${\cal A}$ for all $t$ .
b.

Assumptions 1, 2, and 3.
c.

${\cal A}$ satisfies 6.
d.

$\Phi(Q):=\exp(\lambda Q)-1$ , with $\lambda=\left(2C\left(\sqrt{2{\cal E}_{T}(g^{+})}+\psi_{T}(g^{+})\right)+2G\right)^{-1}$ .

Under these assumptions, Algorithm 1 has the following regret and CCV guarantees: $\forall T\geq 1,\forall u\in{\cal X},\forall t\in[T]$ ,

	$\displaystyle\text{Regret}_{t}(u)$	$\displaystyle=O\left(\sqrt{{\cal E}_{t}(f)}\right),$		(10)
	$\displaystyle\text{CCV}_{T}$	$\displaystyle=O\left(\sqrt{{\cal E}_{T}(g^{+})}\log T\right).$		(11)

We present a sketch of the main ideas here, with the detailed proof deferred to Appendix A. First, using the sublinearity of the square root and the fact that $Q_{t}$ is non-decreasing, we can show that:

\sqrt{{\cal E}_{t}({\cal L})}\leq\sqrt{2{\cal E}_{t}(f)}+\Phi^{\prime}(Q_{t})\sqrt{2{\cal E}_{t}(g^{+})}.

(12)

Then, using (12) and the sublinearity of $\psi$ , we can further upper bound the regret on ${\cal L}$ in 6:

\begin{split}\text{Regret}_{t}^{\cal A}(u;\;{\cal L}_{1\dots t})\leq&C\left(\sqrt{2{\cal E}_{T}(f)}+\psi_{t}(f)\right)\\ &+\lambda\exp(\lambda Q_{t+1})C\left(\sqrt{2{\cal E}_{t}(g^{+})}+\psi_{t}(g^{+})\right).\end{split}

(13)

In addition, we can upper bound $S_{t}$ by using 2 and $Q_{t}$ monotonicity:

S_{t}\leq G\lambda\exp(\lambda Q_{t+1}).

(14)

We can then use (13) and (14) in Lemma 5, and after rearranging terms, we obtain

\text{Regret}_{t}(u)\leq\left(\frac{\lambda}{\lambda^{\star}}-1\right)\exp(\lambda Q_{t+1})+1+C(\sqrt{2{\cal E}_{t}(f)}+\psi_{t}(f)),

(15)

where $\lambda^{\star}=\left(C\left(\sqrt{2{\cal E}_{T}(g^{+})}+\psi_{T}(g^{+})\right)+G\right)^{-1}$ . We obtain

\text{Regret}_{t}(u)\leq C\left(\sqrt{2{\cal E}_{t}(f)}+\psi_{t}(f)\right)+1=O\left(\sqrt{{\cal E}_{t}(f)}\right).

To establish an upper bound on CCV, we leverage the fact that $\text{Regret}_{T}(u)\geq-2FT$ (from 2), which when applied to (15) yields

\exp(\lambda Q_{T+1})(1-\lambda/\lambda^{\star})\leq C\left(\sqrt{2{\cal E}_{T}(f)}+\psi_{T}(f)\right)+2FT+1.

If $\lambda<\lambda^{\star}$ , then

\text{CCV}_{T}=Q_{T+1}\leq\frac{1}{\lambda}\log\left(\frac{C(\sqrt{2{\cal E}_{T}(f)}+\psi_{T}(f))+2FT+1}{1-\lambda/\lambda^{\star}}\right).

Finally, by setting $\lambda=\lambda^{\star}/2$ , we obtain

\text{CCV}_{T}\leq O\left(\sqrt{{\cal E}_{T}(g^{+})}\log(T)\right).

Remark 8.

As in Syrgkanis et al. (2015), we can use the doubling trick for adjusting lambda online at the cost of an additional log term. We provide details in Appendix B.

Remark 9.

If we have $n$ constraint functions ${\mathbf{g}}_{t,k}$ with $k\in[n]$ , we can set $g_{t}:=\max_{k}{\mathbf{g}}_{t,k}$ . Alternatively, we can set multiple queues, one for each $k$ : $Q_{{t+1},k}=Q_{t,k}+{\mathbf{g}}_{t,k}(x_{t})$ , one $\lambda_{k}$ for each $k$ , and set $\Phi_{k}(x)=\exp(\lambda_{k}x)-1$ . Finally, define:

{\cal L}(x)=f_{t}(x)+\sum_{k=1}^{n}\Phi_{k}^{\prime}(Q_{t,k}){\mathbf{g}}_{t,k}^{+}(x).

Then we can follow the exact same proof to show a regret guarantee:

\text{Regret}_{t}(u)\leq O\left(\sqrt{(n+1){\cal E}_{T}(f)}\right),

and CCV guarantee:

\text{CCV}_{T}\leq O\left(\sqrt{(n+1){\cal E}_{T}(g^{+})}\log(T)\right).

The term in $\sqrt{n+1}$ will come from:

\sqrt{{\cal E}_{t}({\cal L})}\leq\sqrt{(n+1){\cal E}_{t}(f)}+\sum_{k=1}^{n}\Phi_{k}^{\prime}(Q_{t,k})\sqrt{(n+1){\cal E}_{t}({\mathbf{g}}_{k}^{+})},

with ${\cal E}_{T}({\mathbf{g}}_{k}^{+})$ being the prediction error of the sequence ${\mathbf{g}}_{t,k}^{+}$ .

5 Static Regret guarantees

In this section, we first introduce some of the foundational optimistic algorithms that have been used for OCO, then show how we can achieve sublinear static regret and CCV with our meta algorithm.

Optimistic OMD and Optimistic AdaGrad

Algorithm 2 Optimistic Online Mirror Descent (Rakhlin and Sridharan, 2013b)

1:Sequence

\eta_{t}>0

x_{1}

2:Initialize

\eta_{1}

3:for round

t=1\dots T

4: Play action

x_{t}

, receive

{\cal L}_{t}

. Compute

l_{t}=\nabla{{\cal L}}_{t}(x_{t})

5: Compute

\eta_{t+1}

\tilde{x}_{t+1}:=\arg\min_{x\in{\cal X}_{0}}\langle l_{t},\ x\rangle+\frac{1}{\eta_{t}}B^{R}(x;\tilde{x}_{t}).

7: Make prediction

\hat{l}_{t+1}=\nabla\hat{{{\cal L}}}_{{t+1}}(\tilde{x}_{t+1}).

x_{t+1}:=\arg\min_{x\in{\cal X}_{0}}\langle\hat{l}_{t+1},\ x\rangle+\frac{1}{\eta_{t+1}}B^{R}(x;\tilde{x}_{t+1}).

9:end for

This approach modifies the standard online mirror descent (OMD) algorithm introduced in Zinkevich (2003). OMD, which generalizes projected gradient descent, iteratively steps towards minimizing the most recently observed loss function. The optimistic OMD variant includes a supplementary minimization step using the predicted function, enabling faster convergence to optimality when predictions are accurate. Note that the algorithm is computationally efficient. Indeed, a mirror step $x^{\star}=\arg\min_{x\in{\cal X}_{0}}\langle l,\ x\rangle+\frac{1}{\eta}B^{R}(x;z)$ can be computed in two steps:

1.

Compute $y$ such that $\nabla R(y)=\nabla R(z)-\eta l$ . In particular, if $\nabla R$ is invertible, $y=(\nabla R)^{-1}(\nabla R(z)-\eta l)$ .
2.

Let $x^{\star}=\Pi_{{\cal X}_{0},R}(y):=\arg\min_{x\in{\cal X}_{0}}B^{R}(x;y)$ .

The two following are special cases of OMD:

1.

When $||\cdot||=||\cdot||_{\star}$ and $R(x)=\frac{1}{2}||x||_{2}^{2}$ , this is simply projected gradient descent, $x^{\star}=\Pi_{{\cal X}_{0}}\left(z-\eta\ell\right)$ .
2.

When ${\cal X}=\Delta_{d}$ the $d$ -dimensional simplex, with $R$ being the entropy, $x^{\star}_{i}=\frac{z_{i}}{Z}\exp(-\eta l_{i})$ , where $Z$ is a normalizing factor to ensure $||x^{\star}||_{1}=1$ .

Theorem 10 establishes our algorithm’s regret bounds. Our analysis extends beyond Rakhlin and Sridharan (2013b)’s vector-based predictions to handle functional predictions, incorporating techniques from Chiang et al. (2012). This extension introduces Lipschitz coefficient dependence. We express our bounds in terms of $\varepsilon_{t}({\cal L})$ rather than $||\nabla\hat{{{\cal L}}}_{t}(\tilde{x}_{t})-\nabla{{\cal L}}_{t}(x_{t})||_{\star}^{2}$ —a crucial distinction since $\varepsilon_{t}({\cal L})$ vanishes with perfect predictions, while $||\nabla\hat{{{\cal L}}}_{t}(\tilde{x}_{t})-\nabla{{\cal L}}_{t}(x_{t})||_{\star}^{2}$ may not. This problem has been highlighted before in Scroccaro et al. (2023) who present their regret guarantees in terms of $||\nabla\hat{{{\cal L}}}_{t}(\tilde{x}_{t-1})-\nabla{{\cal L}}_{t}(x_{t-1})||_{\star}^{2}$ . This requires to know the Lipschitz coefficient of $\nabla{{\cal L}}_{t}$ , which is standard in OCO, but we prefer to have a dependency on the coefficient of $\nabla\hat{{{\cal L}}}_{t}$ as the learner has control over it.

Theorem 10 (Optimistic Adagrad, adapted from Rakhlin and Sridharan (2013b), Corollary 2).

Under assumptions:

a.

1,
b.

For any $t,\;\nabla\hat{{{\cal L}}}_{t}$ is $\hat{L}^{{\cal L}}_{t}$ -Lipschitz where $\hat{L}^{{\cal L}}_{t}\leq\hat{L}^{{\cal L}}_{t+1}$ ,
c.

For any $t,\;\hat{L}^{{\cal L}}_{t}\leq\frac{\beta}{\eta_{t}}$ ,
d.

For any $t\in[T],\eta_{t+1}\leq\eta_{t}$ ,

for any $u\in{\cal X}_{0}$ , and any $t\geq 1$

\text{Regret}_{t}(u)\leq\frac{2B_{t}}{\eta_{t+1}}+\sum_{\tau=1}^{t}\frac{\eta_{\tau+1}}{\beta}\varepsilon_{\tau}({\cal L}),

(16)

where $B_{t}\geq\max_{\tau\in[t],x\in{\cal X}_{0}}B^{R}(x;\tilde{x}_{\tau})$ . If $\eta_{t}$ is:

\eta_{t}=\min\left\{\frac{\sqrt{\beta B}}{\sqrt{{\cal E}_{t-1}({\cal L})}+\sqrt{{\cal E}_{t-2}({\cal L})}},\frac{\beta}{\hat{L}^{{\cal L}}_{t}}\right\},

(17)

with $B:=B_{T}$ , then for any $t\geq 1$ , Algorithm 2 has regret

\begin{split}\text{Regret}_{t}(u)&\leq 5\sqrt{\frac{B}{\beta}}\left(\sqrt{{\cal E}_{t}({\cal L})}+\sqrt{\frac{B}{\beta}}\hat{L}^{{\cal L}}_{t}\right)\\ &=O\left(\sqrt{{\cal E}_{t}({\cal L})}\vee\hat{L}^{{\cal L}}_{t}\right),\end{split}

(18)

By using Algorithm 2 as OCO algorithm ${\cal A}$ in Algorithm 1, we have the following regret guarantee, as a direct consequence of Theorem 7 and Theorem 10:

Corollary 11 (Optimistic Adagrad COCO).

Consider the following assumptions:

a.

4
b.

${\cal A}$ is optimistic Adagrad (Algorithm 2) with $\hat{L}^{{\cal L}}_{t}=\hat{L}^{f}_{t}+\Phi^{\prime}(Q_{t})\hat{L}^{g^{+}}_{t}$
c.

$\lambda$ and $\Phi$ are set as in Theorem 7.

Under these assumptions, the meta-algorithm (1) has the following regret and constraint violation guarantees:

\begin{split}\text{Regret}_{T}(u)&\leq O\left(\sqrt{{\cal E}_{T}(f)}\vee\hat{L}^{f}\right),\\ \text{CCV}_{T}&\leq O\left(\left(\sqrt{{\cal E}_{T}(g^{+})}\vee\hat{L}^{g^{+}}\right)\log T\right).\end{split}

(19)

Alternatively, one can use Optimistic Follow-the-regularized-leader (Rakhlin and Sridharan, 2013a; Mohri and Yang, 2016; Joulani et al., 2017), instead of Algorithm 2, which can be proven to have similar guarantee as Theorem 10.

Remark 12.

Even if $g_{t}$ is fixed or known, we cannot achieve $\text{CCV}_{T}\leq\tilde{O}(1)$ with this algorithm. This is because $\nabla g^{+}_{t}$ does not satisfy 4 in the general case.

6 Dynamic Regret guarantees

Moving beyond a fixed baseline $u\in{\cal X}$ , we can evaluate performance against a time-varying sequence $\{u_{t}\}_{t=1\dots T}$ . Let $P_{T}$ bound the path length: $\sum_{t=1}^{T-1}||u_{t+1}-u_{t}||\leq P_{T}$ . Our objective is to bound the dynamic regret relative to this sequence.

\text{DynRegret}_{T}(u_{1:T}):=\sum_{t=1}^{T}f_{t}(x_{t})-\sum_{t=1}^{T}f_{t}(u_{t}).

(20)

By utilizing the Algorithm 2, and slightly modifying the learning rate, we can achieve state-of-the-art dynamic regret guarantees when $P_{T}$ is known. We will need the following additional assumption:

Assumption 13 (Lipschitz-like Bregman divergence).

$\exists\gamma>0$ , $\forall x,y,z\in{\cal X}_{0}$ ,

B^{R}(x;z)-B^{R}(y;z)\leq\gamma||x-y||.

This assumption is always satisfied if $R$ is Lipschitz on ${\cal X}_{0}$ . This is true in particular when $R$ is a norm on the bounded set ${\cal X}_{0}$ .

Theorem 14 (Dynamic Regret guarantees in OCO (Jadbabaie et al., 2015)).

Under the assumptions:

a.

Assumptions 1 and 13,
b.

For any $t,\;\nabla\hat{{{\cal L}}}_{t}$ is $\hat{L}^{{\cal L}}_{t}$ -Lipschitz where $\hat{L}^{{\cal L}}_{t}\leq\hat{L}^{{\cal L}}_{t+1}$ ,
c.

For any $t,\;\hat{L}^{{\cal L}}_{t}\leq\frac{\beta}{\eta_{t}}$ ,
d.

For any $t\in[T],\eta_{t+1}\leq\eta_{t}$ ,

for any sequence $u_{1},\dots,u_{T}\in{\cal X}_{0}$ , and any $t\geq 1$

\text{DynRegret}_{t}(u_{1:t})\leq\frac{2B+\gamma P_{t}}{\eta_{t+1}}+\sum_{\tau=1}^{t}\frac{\eta_{\tau+1}}{\beta}\varepsilon_{\tau}({\cal L}),

(21)

where $B\geq\max_{\tau\in[T],x\in{\cal X}_{0}}B^{R}(x;\tilde{x}_{\tau})$ . By setting $\eta_{t}$ as

\eta_{t}=\min\left\{\frac{\sqrt{\beta(2B+\gamma P_{T})}}{\sqrt{{\cal E}_{t-1}({\cal L})}+\sqrt{{\cal E}_{t-2}({\cal L})}},\frac{\beta}{\hat{L}^{{\cal L}}_{t}}\right\},

(22)

then Algorithm 2 has dynamic regret

\begin{split}\text{DynRegret}_{T}(u_{1:T})&\leq 3\sqrt{\beta(2B+\gamma P_{T}){\cal E}_{T}({\cal L})}+\frac{2B+\gamma P_{T}}{\beta}\hat{L}^{{\cal L}}\\ &=O\left(\sqrt{P_{T}{\cal E}_{T}({\cal L})}+P_{T}\hat{L}^{{\cal L}}\right).\end{split}

(23)

where $B\geq\max_{t\in[T],x\in{\cal X}_{0}}B^{R}(x;\tilde{x}_{t})$ .

We omit the proof, but it combines elements from Jadbabaie et al. (2015) to add the term in $P_{t}$ and the proof of Theorem 10 to ensure the dependency on $\varepsilon_{t}({\cal L})$ . We can now use this algorithm in Algorithm 1 to achieve dynamic regret and CCV in COCO. We first need the following definition:

Definition 15.

A sequence $u_{1},\dots,u_{T}$ is admissible if $\forall t,g_{t}(u_{t})\leq 0$ . We assume that there exists an admissible sequence.

Note that the existence of an admissible sequence is a much weaker assumption that 3.

Corollary 16 (Dynamic Regret in COCO).

Consider the following assumptions:

a.

4 and 13.
b.

The predictions $\hat{g}_{t}$ are linear.
c.

${\cal A}$ is optimistic Adagrad (Algorithm 2) with $\hat{L}^{{\cal L}}_{t}=\hat{L}^{f}_{t}$ and the learning rate defined in (22).
d.

$\Phi(x)=\exp(\lambda x)-1$ with $\lambda=\left(6\sqrt{\beta(2B+\gamma P_{T}){\cal E}_{T}(g^{+})}+2\right)^{-1}$ .

Under these assumptions, the meta-algorithm (1) has the following dynamic regret and constraint violation guarantees: for any admissible sequence $u_{1},\dots u_{T}$ of length at most $P_{T}$

\begin{split}\text{DynRegret}_{T}(u_{1:T})&\leq O\left(\sqrt{P_{T}{\cal E}_{T}(f)}+\hat{L}^{f}P_{T}\right),\\ \text{CCV}_{T}&\leq O\left(\sqrt{P_{T}{\cal E}_{T}(g^{+})}\log T\right).\end{split}

(24)

The proof structure mirrors that of Theorem 7, but employs a modified version of Lemma 5 adapted for dynamic regret analysis. We show the modified version of Lemma 5 in Appendix D. By using linear predictions for $f$ , we can eliminate the term linear in $P_{T}$ from the regret guarantee. When $P_{T}$ is unknown but $u_{t}$ is observable, we can achieve comparable DynRegret using Algorithm 1 from Jadbabaie et al. (2015) combined with the doubling trick (Algorithm 4, Appendix B). While alternative approaches exist that don’t require observing $u_{t}$ (Scroccaro et al., 2023; Zhao et al., 2020, 2024), our doubling trick implementation would still necessitate sequence observability.

7 Experts setting

In this setting, the agent has access to $d$ experts and has to form a distribution for selecting among them. She observes the loss of each expert and suffers an overall loss which is the expected value over the experts. Formally, we assume ${\cal X}_{0}=\Delta_{d}$ where $d$ is the number of experts. At each step $t$ , the learner selects $x_{t}\in\Delta_{d}$ , a distribution over the experts, then observes the vector of losses $\ell_{t}\in\mathbb{R}^{d}$ and the vector of constraints $c_{t}\in\mathbb{R}^{d}$ . The learner then suffers the loss $f_{t}(x_{t})=\langle\ell_{t},\ x_{t}\rangle$ and constraint $g_{t}(x_{t})=\langle c_{t},\ x_{t}\rangle$ . Let $\hat{\ell}_{t}$ denote the prediction of $\ell_{t}$ and $\hat{c}_{t}$ the prediction of $c_{t}$ .

For the OCO case (i.e without adversarial constraint), we could use the Algorithm 2 with $||\cdot||=||\cdot||_{2}$ , but in the worst case $B$ can be as large as $O(d)$ resulting in a regret scaling in $O(\sqrt{d})$ . We instead are able achieve a scaling of $O(\log(d))$ . Let $||\cdot||=||\cdot||_{1}$ , then $||\cdot||_{\star}=||\cdot||_{\infty}$ . In that case, the Bregman divergence is the KL divergence and $\beta=1$ . However, the KL divergence is not upper bounded as any $x_{t,i}$ can be arbitrarily close to zero. We circumvent this problem in Algorithm 3 by introducing the mixture $y_{t}=(1-\delta)\tilde{x}_{t}+\frac{\delta}{d}\bm{1}$ . This algorithm can be found in Rakhlin and Sridharan (2013b) in the context of a two-player zero-sum game.

Algorithm 3 Optimistic Online Mirror Descent For Experts Rakhlin and Sridharan (2013b)

x_{1}\in\Delta_{d},\ \delta\in(0,1)

2:Initialize

\eta_{1}

3:for round

t=1\dots T

4: Play action

x_{t}

, receive

l_{t}

5: Compute

\eta_{t+1}

\tilde{x}_{{t+1},j}:=\dfrac{y_{t,j}\exp(-\eta_{t}l_{t,j})}{\sum_{i=1}^{d}y_{t,i}\exp(-\eta_{t}l_{t,i})},\quad\forall j\in[d]

7: Construct mixture

y_{t+1}=(1-\delta)\tilde{x}_{t+1}+\frac{\delta}{d}\bm{1}.

8: Make prediction

\hat{l}_{t+1}

x_{{t+1},j}:=\dfrac{y_{{t+1},j}\exp(-\eta_{t+1}\hat{l}_{{t+1},j})}{\sum_{i=1}^{d}y_{{t+1},i}\exp(-\eta_{t+1}\hat{l}_{{t+1},i})},\quad\forall j\in[d]

10:end for

7.1 Static Regret

We first present the OCO guarantee of Algorithm 3. We let ${\cal L}_{t}(x):=\langle l_{t},\ x\rangle$ and define $\hat{\cal L}_{T}$ similarly. Therefore, $\varepsilon_{t}({\cal L})=||l_{t}-\hat{l}_{t}||_{\infty}^{2}$ . We have the following regret guarantee in OCO when using Algorithm 3:

Theorem 17 (Optimistic OMD with experts, (Rakhlin and Sridharan, 2013b)).

Under 1, setting $\delta=1/T$ and learning rate $\eta_{t}$ as:

\eta_{t}=\sqrt{\log(d^{2}Te)}\min\left\{\frac{1}{\sqrt{{\cal E}_{t-1}({\cal L})}+\sqrt{{\cal E}_{t-2}({\cal L})}},1\right\},

(25)

Algorithm 3 has regret

\begin{split}\text{Regret}_{T}(u)&\leq 2\sqrt{\log(d^{2}Te)}\left(\sqrt{{\cal E}_{T}({\cal L})}+1\right)\\ &=O\left(\sqrt{{\cal E}_{T}({\cal L})\log(dT)}\right).\end{split}

(26)

Corollary 18 (COCO in experts setting).

For any $t\in[T]$ , let $\ell_{t},c_{t}\in\mathbb{R}^{d}$ such that $f_{t}(x)=\langle\ell_{t},\ x\rangle$ and $g_{t}(x)=\langle c_{t},\ x\rangle$ . Define $\tilde{g}_{t}(x):=\langle\tilde{c}_{t},\ x\rangle$ where, $\forall i\in[d],\;\tilde{c}_{t,i}:=(c_{t,i})^{+}$ . Assume $\exists j,\forall t,c_{t,j}\leq 0$ Run the meta-algorithm Algorithm 1 with the following:

a.

$l_{t}=\ell_{t}+\lambda\Phi^{\prime}(Q_{t})\tilde{c}_{t}$
b.

$\hat{l}_{t}=\hat{\ell}_{t}+\lambda\Phi^{\prime}(Q_{t})\hat{c}_{t}$
c.

Use Algorithm 3 as the OCO algortihm ${\cal A}$ .

Then, we have

\begin{split}\text{Regret}_{T}(u)&\leq\tilde{O}\left(\sqrt{{\cal E}_{T}(f)}\right),\\ \text{CCV}_{T}&\leq\tilde{O}\left(\sqrt{{\cal E}_{T}(\tilde{g})}\right).\end{split}

(27)

Moreover, if the sequence $g_{t}$ is fixed or known, we have the stronger guarantee;

\begin{split}\text{Regret}_{T}(u)&\leq\tilde{O}\left(\sqrt{{\cal E}_{T}(f)}\right),\\ \text{CCV}_{T}&\leq\tilde{O}\left(1\right).\end{split}

(28)

Proof The constant gradient assumption in the experts setting prevents us from using $\nabla g^{+}_{t}$ in $\nabla{{\cal L}}_{t}$ ; therefore, we employ $\tilde{g}_{t}(x)$ instead. Denote $\varepsilon_{t}(\tilde{g})=||\tilde{c}_{t}-\hat{c}_{t}||_{\infty}^{2}$ . As a direct consequence of Theorem 7, where $C=2\sqrt{\log(d^{2}Te)}$ we have the regret guarantee, and:

\sum_{t=1}^{T}\tilde{g}_{t}(x_{t})\leq\tilde{O}\left(\sqrt{{\cal E}_{T}(\tilde{g})}\right).

Finally, noticing that $\forall x\in\Delta_{d}$

g_{t}^{+}(x)\leq\tilde{g}_{t}(x),

we prove the CCV bound. If $c_{t}$ is known at the beginning of $t$ , we can use $\hat{g}_{t}=\tilde{g}_{t}$ .

7.2 Dynamic Regret

Jadbabaie et al. (2015) show that the previous algorithm also has dynamic regret guarantees. They use a different mixing parameter $(\delta=1/T^{2})$ and slightly different constant for the learning rate, but they use it in the context of two player zero sum games.

Theorem 19.

Under Assumption 1 and for any $t$ , $\nabla\hat{{{\cal L}}}_{t}$ is a constant function, with $\delta=1/T$ and the learning rate $\eta_{t}$ defined as

\eta_{t}=\sqrt{\log(d^{2}Te)}\min\left\{\frac{\sqrt{P_{T}+2}}{\sqrt{{\cal E}_{t-1}({\cal L})}+\sqrt{{\cal E}_{t-2}({\cal L})}},1\right\},

(29)

Algorithm 3 has regret

\begin{split}\text{Regret}_{T}(u)&\leq 2\sqrt{\log(d^{2}Te)(P_{T}+2)}\left(\sqrt{{\cal E}_{T}({\cal L})}+1\right)\\ &=O\left(\sqrt{P_{T}{\cal E}_{T}({\cal L})\log(dT)}\right).\end{split}

(30)

Corollary 20 (Dynamic Regret in experts settings).

As before, define $\tilde{g}_{t}(x):=\langle\tilde{c}_{t},\ x\rangle$ where, $\forall i\in[d],\;\tilde{c}_{t,i}:=(c_{t,i})^{+}$ . Run the meta-algorithm Algorithm 1 with the following:

a.

$\forall t\in[T],\;\exists j_{t}\in[d],\;c_{t,j_{t}}\leq 0$ .
b.

Set ${\cal L}_{t}(x):=\langle\ell_{t}+\Phi^{\prime}(Q_{t})\tilde{c}_{t},\ x\rangle$ .
c.

Set $\hat{\cal L}_{t}(x):=\langle\hat{\ell}_{t}+\Phi^{\prime}(Q_{t})\hat{c}_{t},\ x\rangle$ .
d.

Use Algorithm 3 as the OCO algortihm ${\cal A}$ with the learning defined in 29

Then, for any admissible sequence $u_{1},\dots,u_{T}$ of size $P_{T}$ .

\begin{split}\text{Regret}_{T}(u)&\leq\tilde{O}\left(\sqrt{P_{T}{\cal E}_{T}(f)}\right),\\ \text{CCV}_{T}&\leq\tilde{O}\left(\sqrt{P_{T}{\cal E}_{T}(\tilde{g})}\right).\end{split}

(31)

Moreover, if the sequence $\tilde{g}_{t}$ is fixed or known, we have the stronger guarantee;

\begin{split}\text{Regret}_{T}(u)&\leq\tilde{O}\left(\sqrt{P_{T}{\cal E}_{T}(f)}\right),\\ \text{CCV}_{T}&\leq\tilde{O}\left(\sqrt{P_{T}}\right).\end{split}

(32)

This is a direct consequence on Theorem 14 and Theorem 19. As noted in Section 6, we can use the doubling trick when $P_{T}$ is unknown, but $u_{t}$ is observable.

8 Adversarial Contextual Bandits with safety constraints

Denote $K$ the finite set of possible actions. At each timestep $t$ :

1.

The environment generates a context $s_{t}\in{\cal S}$ , a loss vector $\ell_{t}\in[0,1]^{K}$ and a constraint (or risk) vector $c_{t}\in[0,1]^{K}$ .
2.

The learner observes $s_{t}$ then proposes a distribution $p_{t}\in\Delta_{K}$ over the possible actions, then sample $a_{t}\sim p_{t}$ .
3.

The environment reveils $\ell_{t}[a_{t}]$ and $c_{t}[a_{t}]$ .³³3We use $h_{t,a}$ and $h_{t}[a]$ interchangeably.

To guide decisions, the learner uses a finite family $\Pi:=\{\pi:{\cal S}\to\Delta_{K}\}$ of experts who provide context-dependent action recommendations. We denote $M:=|\Pi|$ . Given a safety threshold $\alpha\in[0,1]$ , we define $\Pi^{\star}(\alpha):=\{\pi\in\Pi,\forall t\in[T],;\langle c_{t},\ \pi(s_{t})\rangle\leq\alpha\}$ as the subset of consistently safe experts. The learner also has access to predictions of $\hat{\ell}_{t}$ and $\hat{c}_{t}$ . The goal of the expert is to have the expected regret and expected CCV to be as small as possible:

\begin{split}\text{Regret}_{T}&:=\max_{\pi\in\Pi^{\star}(\alpha)}\mathbb{E}\left[\sum_{t=1}^{T}\ell_{t}[a_{t}]-\ell_{t}[\pi(s_{t})]\right],\\ \text{CCV}_{T}&:=\mathbb{E}\left[\sum_{t=1}^{T}(c_{t}[a_{t}]-\alpha)_{+}\right],\end{split}

(33)

where the expectation is with respect to the randomness of the learner (selection of actions $a_{t}$ ). Note that $\text{CCV}_{T}$ is a strictly stronger measure than the one used in Sun et al. (2017) where their metric of safety is $R_{c}:=\mathbb{E}\left[\sum_{t=1}^{T}c_{t}[a_{t}]-\alpha\right]$ .

As in previous sections, we first need an algorithm that solves the problem without adaptive constraints. Here, we employ a modified version of EXP4.OVAR algorithm (Wei et al., 2020), detailed in Algorithm 5, Algorithm 5. The small change we bring is to the learning rate and how it is used in the updates. In most bandits literature, the loss vector $l_{t}$ is assumed to be bounded with known bounds (wlog $[0,1]^{K}$ ). However, when we will apply it to the Lagrangian function, the upper bound of $l_{t}$ becomes dynamic, varying with time and depending on previous actions $(a_{1},\dots,a_{t-1})$ . We thus have to take that into account when computing the upper bound of the regret, as highlighted in Theorem 21.

Theorem 21 (Modified EXP4.OVAR Regret, (Adapted from Wei et al. (2020)).

Let $l_{t}\in[0,B_{t}]$ a sequence of loss vectors, where $B_{t}$ is non-decreasing, and $l_{t}$ and $B_{t}$ are chosen by the environment but depend on $a_{1},\dots a_{t-1}$ . Let $\hat{l}_{t}\in[0,B_{t}]$ the prediction and denote ${\cal E}_{T}({\cal L}):=\sum_{t=1}^{T}||l_{t}-\hat{l}_{t}||_{\infty}^{2}$ . Then, if $\delta=\left(\frac{K}{T}\sqrt{\log(MT)}\right)^{2/3}$ then Algorithm 5 has regret

\text{Regret}_{T}^{\cal A}(l_{1},\dots,l_{T})\leq 6\left(\sqrt{{\cal E}_{T}({\cal L})}+\mathbb{E}[B_{T}]\right)(TK^{2}\log(MT))^{1/3}.

(34)

See Algorithm 5 for the complete proof. For the problem with adversarial constrained, as in Section 4, we construct a surrogate loss vector similar to the Lagrangian:

\begin{split}l_{t}&:=\ell_{t}+\Phi^{\prime}(Q_{t})\tilde{c}_{t},\quad\text{with}\quad\forall a\in[K],\tilde{c}_{t}[a]:=(c_{t}[a]-\alpha)^{+},\\ \hat{l}_{t}&:=\hat{\ell}_{t}+\Phi^{\prime}(Q_{t})\hat{c}_{t},\\ Q_{t+1}&:=Q_{t}+\tilde{c}_{t}[a_{t}],\quad\text{with}\quad Q_{0}=Q_{1}=0,\end{split}

(35)

and use them in the EXP4.OVAR algorithm. For consistency with previous sections, we denote for $p\in\Delta_{K}$ , $f_{t}(p):=\langle\ell_{t},\ p\rangle,\quad g_{t}(p):=\langle c_{t},\ p\rangle$ and denote ${\cal E}_{T}(f)=\sum_{t=1}^{T}||\ell_{t}-\hat{\ell}_{t}||_{\infty}^{2}$ and ${\cal E}_{T}(g^{+}):=\sum_{t=1}^{T}||c_{t}-\hat{c}_{t}||_{\infty}^{2}$ .

First, we prove a similar regret decomposition lemma: Denote $\text{Regret}_{T}^{\cal A}(l_{1},\dots,l_{T})$ the expected regret of a contextual bandit algorithm ${\cal A}$ running using $l_{1},\dots,l_{T}$ as loss vectors.

Lemma 22.

Assuming that $\forall t\in[T]$ , $\ell_{t}\in[0,1]^{K}$ and $c_{t}\in[0,1]^{K}$ . Let $\alpha$ the safety threshold, $\Phi$ a convex potential function, $l_{t}$ and $Q_{t}$ defined as in (35). Then

\mathbb{E}\left[\Phi(Q_{t+1})\right]-\Phi(Q_{1})+\text{Regret}_{T}\leq\text{Regret}_{T}^{\cal A}(l_{1},\dots,l_{T})+\mathbb{E}\left[\Phi^{\prime}(Q_{T+1})\right].

(36)

The proof is exactly the same as Lemma 5 with the additional step of taking the expectation. Finally, by using EXP4.OVAR on $l_{t}$ as defined in Equation 35, we prove that we have bounded expected regret and CCV.

Theorem 23.

Assuming:

•

Safety threshold $\alpha\in(0,1)$ is known and the corresponding $\Pi^{\star}(\alpha)$ is not empty.
•

$\forall t\in[T]$ , $\ell_{t}\in[0,1]^{K}$ and $c_{t}\in[0,1]^{K}$ .
•

We define $l_{t}$ , $\hat{l}_{t}$ and $Q_{t}$ as in (35) and use them in EXP4.OVAR.
•

$\Phi(x):=\exp(\lambda x)-1$ with $\lambda:=\left(12(TK^{2}\log(MT))^{1/3}(\sqrt{2{\cal E}_{T}(\tilde{g})}+1)+2\right)^{-1}$ .

Running Algorithm 1 gives the following guarantees:

\begin{split}\text{Regret}_{T}&\leq\tilde{O}\left(\sqrt{{\cal E}_{T}(f)}\left(TK^{2}\log(M)\right)^{1/3}\right),\\ \text{CCV}_{T}&\leq\tilde{O}\left(\sqrt{{\cal E}_{T}(g)}\left(TK^{2}\log(M)\right)^{1/3}\right).\end{split}

(37)

Proof By definition, we have for any $t\in[T],l_{t}\in[0,1+\Phi^{\prime}(Q_{t})]$ . Thus, we have the regret guarantee of Theorem 21.

\text{Regret}_{T}^{\cal A}(l_{1},\dots,l_{T})\leq 6\left(\sqrt{{\cal E}_{T}({\cal L})}+1+\mathbb{E}[\Phi^{\prime}(Q_{T})]\right)(TK^{2}\log(MT))^{1/3}.

(38)

Inserting it in Lemma 22, using the definition of $\Phi$ we have

\begin{split}\text{Regret}_{T}\leq&6(TK^{2}\log(MT))^{1/3}\left(\sqrt{2{\cal E}_{T}(f)}+1\right)+1\quad+\\ &\mathbb{E}[\exp(\lambda Q_{T+1})]\left(\lambda\left(6(TK^{2}\log(MT))^{1/3}(\sqrt{2{\cal E}_{T}(g)}+1)+1\right)-1\right).\end{split}

The rest of the proof is as in Theorem 7, after noticing that, with Jensen’s inequality,

\exp(\lambda\mathbb{E}\left[\text{CCV}_{T}\right])\leq\mathbb{E}\left[\exp(\lambda Q_{T+1})\right].

Note that in the worst case: ${\cal E}_{T}(f)=O(T)$ and ${\cal E}_{T}(g)=O(T)$ the regret and CCV are of order $\tilde{O}(T^{5/6})$ which is worse than Sun et al. (2017): $O(T^{1/2})$ regret and $O(T^{3/4})$ CCV. However, when the predictions are slightly more accurate ${\cal E}_{T}(f)\leq O(T^{1/3})$ and ${\cal E}_{T}(g)\leq O(T^{5/12})$ , this algorithm improves Sun et al. (2017), with the most significant improvement when ${\cal E}_{T}(f)=O(1)$ and ${\cal E}_{T}(g)=O(1)$ , leading to a $T^{1/3}$ in regret and CCV. This is close to optimal, as Wei et al. (2020) prove that the best regret that a contextual bandit algorithm with ${\cal E}_{T}({\cal L})=O(1)$ is $O(T^{1/4})$ . Note that this algorithm requires ${\cal E}_{T}(g)$ (or an upper bound) to be known beforehand, as even with the doubling trick, we do not directly observe $\varepsilon_{t}(g)$ to update ${\cal E}_{t}(g)$ online. An heuristic method using the current observation as an estimator along the doubling trick could potentially work in practice.

9 Conclusion

This work presents pioneering optimistic algorithms for handling OCO under adversarial constraints. Beyond establishing prediction error-dependent bounds for both regret and constraints, our approach maintains efficiency by using simple projections instead of solving complete convex optimization problems per iteration. For the future, we are interested in proving stronger bounds when the obtainable guarantees against oracle sets that are larger than ${\cal X}$ , and when the loss function is strongly-convex. Moreover, we conjecture that a slight alteration of the algorithm should ensure a $\text{CCV}\leq O(\log T)$ when $g_{t}^{+}$ is fixed or perfectly known, beyond the expert setting. At this stage, the non-smooth gradient of $g_{t}^{+}$ prevents us from using itself as the prediction, and therefore from establishing that our algorithm attains this bound.

References

Anderson et al. (2022) Daron Anderson, George Iosifidis, and Douglas J Leith. Lazy Lagrangians with predictions for online learning. arXiv preprint arXiv:2201.02890, 2022.
Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
Beygelzimer et al. (2011) Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19–26. JMLR Workshop and Conference Proceedings, 2011.
Bhaskara et al. (2020) Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, and Manish Purohit. Online learning with imperfect hints. In International Conference on Machine Learning, pages 822–831. PMLR, 2020.
Chiang et al. (2012) Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In Conference on Learning Theory, pages 6–1. JMLR Workshop and Conference Proceedings, 2012.
D’Orazio and Huang (2021) Ryan D’Orazio and Ruitong Huang. Optimistic and adaptive Lagrangian hedging. arXiv preprint arXiv:2101.09603, 2021.
Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
Guo et al. (2022) Hengquan Guo, Xin Liu, Honghao Wei, and Lei Ying. Online convex optimization with hard constraints: towards the best of two worlds and beyond. Advances in Neural Information Processing Systems, 35:36426–36439, 2022.
Hazan (2023) Elad Hazan. Introduction to online convex optimization, 2023. URL https://arxiv.org/abs/1909.05207.
Hutchinson and Alizadeh (2024) Spencer Hutchinson and Mahnoosh Alizadeh. Safe online convex optimization with first-order feedback. In 2024 American Control Conference (ACC), pages 1–7. IEEE, 2024.
Jadbabaie et al. (2015) Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization: Competing with dynamic comparators. In Artificial Intelligence and Statistics, pages 398–406. PMLR, 2015.
Jenatton et al. (2016) Rodolphe Jenatton, Jim Huang, and Cedric Archambeau. Adaptive algorithms for online convex optimization with long-term constraints. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 402–411, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/jenatton16.html.
Joulani et al. (2017) Pooria Joulani, András György, and Csaba Szepesvári. A modular analysis of adaptive (non-) convex optimization: Optimism, composite objectives, and variational bounds. In International Conference on Algorithmic Learning Theory, pages 681–720. PMLR, 2017.
Mahdavi et al. (2012) Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. Trading regret for efficiency: online convex optimization with long term constraints. The Journal of Machine Learning Research, 13(1):2503–2528, 2012.
Mannor et al. (2009) Shie Mannor, John N Tsitsiklis, and Jia Yuan Yu. Online learning with sample path constraints. Journal of Machine Learning Research, 10(3), 2009.
Mohri and Yang (2016) Mehryar Mohri and Scott Yang. Accelerating online convex optimization via adaptive prediction. In Artificial Intelligence and Statistics, pages 848–856. PMLR, 2016.
Muthirayan et al. (2022) Deepan Muthirayan, Jianjun Yuan, and Pramod P Khargonekar. Online convex optimization with long-term constraints for predictable sequences. IEEE Control Systems Letters, 7:979–984, 2022.
Neely and Yu (2017) Michael J. Neely and Hao Yu. Online convex optimization with time-varying constraints, 2017. URL https://arxiv.org/abs/1702.04783.
Orabona (2019) Francesco Orabona. A modern introduction to online learning. arXiv preprint arXiv:1912.13213, 2019.
Qiu et al. (2023) Shuang Qiu, Xiaohan Wei, and Mladen Kolar. Gradient-variation bound for online convex optimization with constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9534–9542, 2023.
Rakhlin and Sridharan (2013a) Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Conference on Learning Theory, pages 993–1019. PMLR, 2013a.
Rakhlin and Sridharan (2013b) Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences, 2013b. URL https://arxiv.org/abs/1311.1869.
Scroccaro et al. (2023) Pedro Zattoni Scroccaro, Arman Sharifi Kolarijani, and Peyman Mohajerin Esfahani. Adaptive composite online optimization: Predictions in static and dynamic environments. IEEE Transactions on Automatic Control, 68(5):2906–2921, 2023.
Sinha and Vaze (2024) Abhishek Sinha and Rahul Vaze. Optimal algorithms for online convex optimization with adversarial constraints. Advances in Neural Information Processing Systems, 37:41274–41302, 2024.
Steinhardt and Liang (2014) Jacob Steinhardt and Percy Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In International conference on machine learning, pages 1593–1601. PMLR, 2014.
Sun et al. (2017) Wen Sun, Debadeepta Dey, and Ashish Kapoor. Safety-aware algorithms for adversarial contextual bandit. In International Conference on Machine Learning, pages 3280–3288. PMLR, 2017.
Syrgkanis et al. (2015) Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. Advances in Neural Information Processing Systems, 28, 2015.
Wei et al. (2020) Chen-Yu Wei, Haipeng Luo, and Alekh Agarwal. Taking a hint: How to leverage loss predictors in contextual bandits? ArXiv, abs/2003.01922, 2020. URL https://api.semanticscholar.org/CorpusID:211990228.
Yi et al. (2023) Xinlei Yi, Xiuxian Li, Tao Yang, Lihua Xie, Yiguang Hong, Tianyou Chai, and Karl H Johansson. Distributed online convex optimization with adversarial constraints: reduced cumulative constraint violation bounds under Slater’s condition. arXiv preprint arXiv:2306.00149, 2023.
Yu and Neely (2020) Hao Yu and Michael J Neely. A low complexity algorithm with o ( $\sqrt{T}$ ) regret and o (1) constraint violations for online convex optimization with long term constraints. Journal of Machine Learning Research, 21(1):1–24, 2020.
Zhang et al. (2018) Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive online learning in dynamic environments. Advances in neural information processing systems, 31, 2018.
Zhao et al. (2020) Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Dynamic regret of convex and smooth functions. ArXiv, abs/2007.03479, 2020. URL https://api.semanticscholar.org/CorpusID:220381233.
Zhao et al. (2024) Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization. Journal of Machine Learning Research, 25(98):1–52, 2024.
Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pages 928–936, 2003.

Appendix A Proof of Theorem 7

Proof

By definition of ${\cal L}$ (5) and $\hat{\cal L}$ (6), we obtain the following instantaneous prediction error:

	$\displaystyle\varepsilon_{t}({\cal L})$	$\displaystyle=\|\|\nabla{{\cal L}}_{t}(x_{t})-\nabla\hat{{{\cal L}}}_{t}(x_{t})\|\|_{\star}^{2}$
		$\displaystyle\leq 2\varepsilon_{t}(f)+2\Phi^{\prime}(Q_{t})^{2}\varepsilon_{t}(g^{+}),$

where the last line uses $||a+b||_{\star}^{2}\leq 2||a||_{\star}^{2}+2||b||_{\star}^{2}$ .

$\displaystyle\sqrt{{\cal E}_{t}({\cal L})}$	$\displaystyle\leq\sqrt{\sum_{\tau=1}^{t}2\varepsilon_{\tau}(f)+\sum_{\tau=1}^{t}2\Phi^{\prime}(Q_{\tau})^{2}\varepsilon_{t}(g^{+})}$
	$\displaystyle\overset{(i)}{\leq}\sqrt{2{\cal E}_{t}(f)}+\sqrt{\sum_{\tau=1}^{t}2\Phi^{\prime}(Q_{\tau})^{2}\varepsilon_{t}(g^{+})}$
	$\displaystyle\overset{(ii)}{\leq}\sqrt{2{\cal E}_{t}(f)}+\Phi^{\prime}(Q_{t+1})\sqrt{2{\cal E}_{t}(g^{+})}.$	(39)

We obtain (i) by using $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ and (ii) by using the fact that $Q_{t}$ is non-decreasing and $\Phi^{\prime}$ is a non-decreasing function. By sub-linearity of $\psi_{t}$ :

\psi_{t}({\cal L}_{t})\leq\psi_{t}(f)+\Phi^{\prime}(Q_{t})\psi_{t}(g^{+})\leq\psi_{t}(f)+\Phi^{\prime}(Q_{t+1})\psi_{t}(g^{+}).

(40)

Finally, using 6, we have

	$\displaystyle\text{Regret}_{t}^{\cal A}(u;\;{\cal L}_{1\dots t})$	$\displaystyle\leq C\left(\sqrt{{\cal E}_{t}({\cal L})}+\psi_{t}({\cal L})\right)$
		$\displaystyle\leq\left(\sqrt{2{\cal E}_{T}(f)}+\psi_{t}(f)\right)+\Phi^{\prime}(Q_{t+1})\left(\sqrt{2{\cal E}_{t}(g^{+})}+\psi_{t}(g^{+})\right),$		(41)

where the last equation inequality comes from using both (39) and (40). By using once again the fact that $Q_{t}$ is non-decreasing and $\Phi^{\prime}$ is a non-decreasing function, and knowing that $g^{+}_{t}$ is non-negative and upper bounded by $G$ we can also upper bound $S_{t}$ . Recall

$\displaystyle S_{t}$	$\displaystyle:=\sum_{\tau=1}^{t}g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau}))$
	$\displaystyle\leq G(\Phi^{\prime}(Q_{{t+1}})-\Phi^{\prime}(Q_{1}))$
	$\displaystyle\leq G\Phi^{\prime}(Q_{t+1}).$	(42)

We can now upper bound the regret. Using Lemma 5 we have that for any $u\in{\cal X}$

\displaystyle\Phi(Q_{t+1})-\Phi(Q_{1})+\text{Regret}_{t}(u)

\displaystyle\leq\text{Regret}_{t}^{\cal A}(u;\;{\cal L}_{1\dots t})+S_{t}.

Upper bounding the RHS using (41) and (42), we obtain

	$\displaystyle\Phi(Q_{t+1})-\Phi(Q_{1})+\text{Regret}_{t}(u)\leq$	$\displaystyle C\Phi^{\prime}(Q_{t+1})(\sqrt{2{\cal E}_{t}(g^{+})}+\psi_{t}(g^{+}))$
		$\displaystyle+C(\sqrt{2{\cal E}_{t}(f)}+\psi_{t}(f))$
		$\displaystyle+G\Phi^{\prime}(Q_{t+1}).$

Thus, using $\Phi(Q)=\exp(\lambda Q)-1$ , and after rearranging the terms,

\text{Regret}_{t}(u)\leq\left(\lambda C(\sqrt{2{\cal E}_{t}(g^{+})}+\psi_{t}(g^{+}))+\lambda G-1\right)\exp(\lambda Q_{t+1})+1+C(\sqrt{2{\cal E}_{t}(f)}+\psi_{t}(f)).

Therefore, if $\lambda\leq\lambda^{\star}:=\frac{1}{C(\sqrt{2{\cal E}_{t}(g^{+})}+\psi_{t}(g^{+}))+G}$ ,

\text{Regret}_{t}(u)\leq C\left(\sqrt{2{\cal E}_{t}(f)}+\psi_{t}(f)\right)+1.

Note that $\text{Regret}_{T}(u)\geq-2FT$ , thus:

\exp(\lambda Q_{T+1})\left(1-\frac{\lambda}{\lambda^{\star}}\right)\leq C(\sqrt{2{\cal E}_{T}(f)}+\psi_{T}(f))+2FT+1.

If $\lambda<\frac{1}{C(\sqrt{2{\cal E}_{T}(g^{+})}+\psi_{t}(g^{+}))+G}$ , then

Q_{T+1}\leq\log\left(\frac{C(\sqrt{2{\cal E}_{T}(f)}+\psi_{T}(f))+2FT+1}{1-\lambda/\lambda^{\star}}\right),

and

\text{CCV}_{T}\leq\frac{Q_{T+1}}{\lambda}\leq\frac{1}{\lambda}\log\left(\frac{C(\sqrt{2{\cal E}_{T}(f)}+\psi_{T}(f))+2FT+1}{1-\lambda/\lambda^{\star}}\right).

With $\lambda=\frac{\lambda^{\star}}{2}=\frac{1}{2C(\sqrt{2{\cal E}_{T}(g^{+})}+\psi_{t}(g^{+}))+2G}$ , we have

	$\displaystyle\text{CCV}_{T}$	$\displaystyle\leq\left(2C\left(\sqrt{2{\cal E}_{T}(g^{+})}+\psi_{t}(g^{+})\right)+2G\right)\log\left(2\left(C(\sqrt{2{\cal E}_{T}(f)}+\psi_{T}(f))+2FT+1\right)\right)$
		$\displaystyle\leq O\left(\sqrt{{\cal E}_{T}(g^{+})}\log T\right).$

Appendix B Doubling trick for Algorithm 1

The doubling trick methodology employed here is inspired by Jadbabaie et al. (2015). The parameter we adapt online is $\lambda$ . Note that for all the COCO results we have (Theorems 7, , there is a known constant $c$ and a known function $\psi$ such that

\lambda=\frac{1}{2(\mu+c)},\quad\text{where}\quad\mu=\psi(T,P_{T},{\cal E}_{T}(g)),

and $\psi$ is non-decreasing, and sub-linear in each coordinate. The key idea is to apply the doubling trick on $\mu$ , so that the condition $\lambda<\lambda^{\star}$ applies for every timestep of an epoch, except for the last one. We present the algorithm in Algorithm 4. In the case of dynamic regret, we assume that the comparator sequence $u_{t}$ is observable.

Algorithm 4 Doubling trick for Optimistic COCO

1:Function

\psi

, real values:

T_{1},P_{1},E_{1}

c>0

. Optimistic meta-algorithm

{\cal O}(\lambda)

for a given value

\lambda

2:Initialize:

\mu_{1}=\psi\left(T_{1},P_{1},E_{1}\right),\lambda_{1}=\frac{1}{2(\mu_{1}+c)},N=1,E_{(N)}=\Delta_{(N)}=P_{(N)}=0;\mu_{(N)}=\psi(\Delta_{(N)},P_{(N)},E_{(N)})

3:for round

t=1\dots T

4: if

\mu_{(N)}>\mu_{N}

then

\triangleright

Check doubling condition

N=N+1

\mu_{N}=2^{N-1}\mu_{1}

and

\lambda_{N}=\frac{1}{2(\mu_{N}+c)}

E_{(N)}=\Delta_{(N)}=P_{(N)}=0

8: end if

9: Run one step of

{\cal O}(\lambda_{N})

and observe

f_{t},g_{t},x_{t}

and

u_{t}

10: Update doubling parameters:

11:

\Delta_{(N)}=\Delta_{(N)}+1

12:

P_{(N)}=P_{(N)}+||u_{t}-u_{t-1}||

13:

E_{(N)}=E_{(N)}+\varepsilon_{t}(g^{+})

14:

\mu_{(N)}=\psi(\Delta_{(N)},P_{(N)},E_{(N)})

15:end for

Theorem 24.

Assume that, when $\lambda<\lambda^{\star}$ with $\lambda^{*}=\frac{1}{\psi(T,{\cal E}_{T}(g),P_{T})+c}$ , the optimistic algorithm ${\cal O}(\lambda)$ has guarantees:

\begin{split}\text{Regret}_{T}&\leq O\left(\phi(T,{\cal E}_{T}(f),P_{T})\right),\\ \text{CCV}_{T}&\leq O\left(\psi(T,{\cal E}_{T}(g^{+}),P_{T})\log T\right).\end{split}

(43)

where $\text{Regret}_{T}$ denotes the static or dynamic regret depending on the context, $\phi$ and $\psi$ are monotone non-decreasing and at most polynomial in each coordinate. Then by running the doubling algorithm Algorithm 4, we have the guarantee

\begin{split}\text{Regret}_{T}&\leq\tilde{O}\left(\phi(T,{\cal E}_{T}(f),P_{T})\right),\\ \text{CCV}_{T}&\leq\tilde{O}\left(\psi(T,{\cal E}_{T}(g^{+}),P_{T})\log T\right).\end{split}

(44)

Proof Let $N$ the number of epochs and for each epoch $i\in[N]$ , denote $T_{i}$ its first instance. It’s last instance is therefore $T_{i}^{\prime}:=T_{i+1}-1$ . For two instants $s$ and $t$ , we define the regret and CCV between the two instants:

\begin{split}\text{Regret}_{t\to s}&:=\sum_{\tau=t}^{s}f_{t}(x_{t})-f_{t}(u_{t}),\\ \text{CCV}_{t\to s}&:=\sum_{\tau=t}^{s}g_{t}^{+}(x_{t}).\end{split}

We similarly define the quantities ${\cal E}_{t\to s}(f),{\cal E}_{t\to s}(g^{+}),P_{t\to s}$ . Denote $\mu_{i},i=1\dots N$ the successive values of $\mu$ used in the doubling process, in $\lambda_{i}=\frac{1}{2(\mu_{i}+c)}$ . Define

	$\displaystyle\underline{\Delta}_{(i)}$	$\displaystyle:=\Delta_{(i)}-1,$
	$\displaystyle\underline{P}_{(i)}$	$\displaystyle:=P_{(i)}-\|\|u_{T^{\prime}_{i}}-u_{T^{\prime}_{i}-1}\|\|,$
	$\displaystyle\underline{E}_{(i)}$	$\displaystyle:=E_{(i)}-\|\|\nabla g^{+}_{T^{\prime}_{i}}(x_{T^{\prime}_{i}})-\nabla\hat{g}^{+}_{T^{\prime}_{i}}(x_{T^{\prime}_{i}})\|\|_{\star}^{2},$
	$\displaystyle\underline{\mu}_{(i)}$	$\displaystyle=\psi(\underline{\Delta}_{(i)},\underline{P}_{(i)},\underline{E}_{(i)})$
	$\displaystyle\underline{\lambda}_{(i)}$	$\displaystyle=\frac{1}{2(\mu_{(i)}+c)}$

i.e the values of the different doubling parameters except for the last step of the epoch. Note that when running ${\cal O}$ with $\lambda_{i}$ between $T_{i}$ and $T^{\prime}_{i}-1$ , the threshold for $\lambda$ between those two timesteps is:

\lambda^{*}_{i}=\frac{1}{\psi\left(T^{\prime}_{i}-1-T_{i},{\cal E}_{T_{i}\to(T^{\prime}_{i}-1)}(g^{+}),P_{T_{i}\to(T_{i}^{\prime}-1)}\right)+c}=2\underline{\lambda}_{(i)}.

Moreover, since the change of epoch happens at $T_{i}^{\prime}+1$ , we know that

\mu_{(i)}>\mu_{i}>\underline{\mu}_{(i)}.

(45)

From the second inequality, we have

\lambda_{i}<\underline{\lambda}_{(i)}=\frac{\lambda^{*}_{i}}{2}

Thus, from (43) there are two constants $C$ and $C^{\prime}$ such that:

\begin{split}\text{Regret}_{T_{i}\to(T^{\prime}_{i}-1)}&\leq C\phi\left((T^{\prime}_{i}-1)-T_{i},{\cal E}_{T_{i}\to(T_{i}^{\prime}-1)}(f),P_{T_{i}\to(T_{i}^{\prime}-1)}\right),\\ \text{CCV}_{T_{i}\to T^{\prime}_{i}}&\leq C^{\prime}\psi\left((T^{\prime}_{i}-1)-T_{i},{\cal E}_{T_{i}\to(T_{i}^{\prime}-1)}(g^{+}),P_{T_{i}\to(T_{i}^{\prime}-1)}\right)\log(T^{\prime}_{i}-T_{i}).\end{split}

We will focus on regret for now, but the same methodology can be applied for CCV. First note that by monotonicity of $\phi$ ,

\forall i\in[N],\text{Regret}_{T_{i}\to(T^{\prime}_{i}-1)}\leq C\phi(T,{\cal E}_{T}(f),P_{T}).

Then, note that $T-1<T_{N}^{\prime}$ , and therefore, the constant $\lambda_{N}$ satisfies the condition for bounded regret and CCV when running ${\cal O}$ between $T_{N}$ and $T-1$ . We can now split the total regret into groups:

	$\displaystyle\text{Regret}_{T}$	$\displaystyle=\sum_{t=1}^{T}f_{t}(x_{t})-f_{t}(u_{t})$
		$\displaystyle=\sum_{i=1}^{N-1}(f_{T^{\prime}_{i}}(x_{T^{\prime}_{i}})-f_{T^{\prime}_{i}}(u_{T^{\prime}_{i}}))+\sum_{i=1}^{N-1}\text{Regret}_{T_{i}\to(T^{\prime}_{i}-1)}+\text{Regret}_{T_{N}\to(T-1)}+f_{T}(x_{T})-f_{T}(u_{T})$
		$\displaystyle\leq 2NF+NC\phi(T,{\cal E}_{T}(f),P_{T})$

Finally, from (45) for $i=N$ ,

\mu_{N}=\mu_{1}2^{N-1}<\mu_{(N)}\leq\psi(T,P_{T},{\cal E}_{T}(g))\Longrightarrow N\leq\log_{2}\left(\psi(T,P_{T},{\cal E}_{T}(g))\right)-\log(\mu_{1})

And since $\psi$ is at most polynomial in each coordinate, and ${\cal E}_{T}(g)$ and $P_{T}$ are at most linear in $T$ , we have $N\leq O(\log_{2}T)$ .

Appendix C Proof of Theorem 10

Denote $l_{t}:=\nabla{{\cal L}}_{t}(x_{t})$ and $\hat{l}_{t}:=\nabla\hat{{{\cal L}}}_{t}(\tilde{x}_{t})$ . (16) in Theorem 10 is a direct consequence of the following lemma.

Lemma 25.

One step of optimistic online mirror descent satisfies:

\eta_{t}\langle l_{t},\ x_{t}-u\rangle\leq B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})+\eta_{t}||l_{t}-\hat{l}_{t}||_{\star}\cdot||x_{t}-\tilde{x}_{t+1}||-(B^{R}(\tilde{x}_{t+1};x_{t})+B^{R}(x_{t};\tilde{x}_{t})).

(46)

Moreover, if $\nabla\hat{{{\cal L}}}_{t}$ is $\hat{L}^{{\cal L}}_{t}$ -smooth with $\hat{L}^{{\cal L}}_{t}\leq\frac{\beta}{\eta_{t}}$ ,

\langle l_{t},\ x_{t}-u\rangle\leq\frac{B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})}{\eta_{t}}+B^{R}(x_{t};\tilde{x}_{t+1})(\eta_{t+1}^{-1}-\eta_{t}^{-1})+\frac{\eta_{t+1}}{\beta}\varepsilon_{t}({\cal L}).

(47)

We will need the following proposition to prove the lemma.

Proposition 26 (Chiang et al. (2012), proposition 18).

For any $x_{0}\in{\cal X},l\in\mathbb{R}^{d}$ , if $x^{\star}:=\arg\min_{x\in{\cal X}}\langle l,\ x\rangle+\frac{1}{\eta}B^{R}(x;x_{0})$ , then $\forall u\in{\cal X}$

\eta\langle l,\ x^{\star}-u\rangle=B^{R}(u;x_{0})-B^{R}(u;x^{\star})-B^{R}(x^{\star};x_{0}).

(48)

Proof [of Lemma 25] Let $u\in{\cal X}$

\displaystyle\eta_{t}\langle l_{t},\ x_{t}-u\rangle

\displaystyle=\langle\eta_{t}l_{t},\ \tilde{x}_{t+1}-u\rangle+\eta_{t}\langle l_{t}-\hat{l}_{t},\ x_{t}-\tilde{x}_{t+1}\rangle+\langle\eta_{t}\hat{l}_{t},\ x_{t}-\tilde{x}_{t+1}\rangle.

On one hand, using Proposition 26, the left and right terms can be upper bounded respectively :

	$\displaystyle\langle\eta_{t}l_{t},\ \tilde{x}_{t+1}-u\rangle$	$\displaystyle=B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})-B^{R}(\tilde{x}_{t+1};\tilde{x}_{t}),$
	$\displaystyle\langle\eta_{t}\hat{l}_{t},\ x_{t}-\tilde{x}_{t+1}\rangle$	$\displaystyle=B^{R}(\tilde{x}_{t+1};\tilde{x}_{t})-B^{R}(\tilde{x}_{t+1};x_{t})-B^{R}(x_{t};\tilde{x}_{t}).$

Therefore

\langle\eta_{t}l_{t},\ \tilde{x}_{t+1}-u\rangle+\langle\eta_{t}\hat{l}_{t},\ x_{t}-\tilde{x}_{t+1}\rangle=B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})-(B^{R}(\tilde{x}_{t+1};x_{t})+B^{R}(x_{t};\tilde{x}_{t})).

On the other hand,

\langle l_{t}-\hat{l}_{t},\ x_{t}-\tilde{x}_{t+1}\rangle\leq||x_{t}-\tilde{x}_{t+1}||\cdot||l_{t}-\hat{l}_{t}||_{\star}.

By combining the last two inequalities, we obtain (46). To prove (47), first note that by using the fact that $\forall a,b,\rho>0,ab\leq\frac{1}{2\rho}a^{2}+\frac{\rho}{2}b^{2}$ ,

||x_{t}-\tilde{x}_{t+1}||\cdot||l_{t}-\hat{l}_{t}||_{\star}\leq\frac{\eta_{t+1}}{2\beta}||l_{t}-\hat{l}_{t}||_{\star}^{2}+\frac{\beta}{2\eta_{t+1}}||x_{t}-\tilde{x}_{t+1}||^{2}\leq\frac{\eta_{t+1}}{2\beta}||l_{t}-\hat{l}_{t}||_{\star}^{2}+\frac{1}{\eta_{t+1}}B^{R}(x_{t};\tilde{x}_{t+1}).

For the second part of the statement, if $\nabla\hat{f}$ is $\hat{L}^{{\cal L}}_{t}$ -smooth:

	$\displaystyle\|\|l_{t}-\hat{l}_{t}\|\|_{\star}^{2}$	$\displaystyle=\|\|\nabla{{\cal L}}_{t}(x_{t})-\nabla\hat{{{\cal L}}}_{t}(\tilde{x}_{t})\|\|_{\star}^{2}$
		$\displaystyle\leq 2\|\|\nabla{{\cal L}}_{t}(x_{t})-\nabla\hat{{{\cal L}}}_{t}(x_{t})\|\|_{\star}^{2}+2\|\|\nabla\hat{{{\cal L}}}_{t}(x_{t})-\nabla\hat{{{\cal L}}}_{t}(\tilde{x}_{t})\|\|_{\star}^{2}$
		$\displaystyle\leq 2\|\|\nabla{{\cal L}}_{t}(x_{t})-\nabla\hat{{{\cal L}}}_{t}(x_{t})\|\|_{\star}^{2}+2\left(\hat{L}^{{\cal L}}_{t}\right)^{2}\|\|x_{t}-\tilde{x}_{t}\|\|^{2}$
		$\displaystyle\leq 2\varepsilon_{t}({\cal L})+\frac{2\left(\hat{L}^{{\cal L}}_{t}\right)^{2}}{\beta}B^{R}(x_{t};\tilde{x}_{t}).$

By inserting in (46) and dividing both sides by $\eta_{t}$ :

	$\displaystyle\langle l_{t},\ x_{t}-u\rangle$	$\displaystyle\leq\frac{B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})}{\eta_{t}}+\frac{\eta_{t+1}}{\beta}\varepsilon_{t}({\cal L})+\frac{\left(\hat{L}^{{\cal L}}_{t}\right)^{2}\eta_{t+1}}{\beta^{2}}B^{R}(x_{t};\tilde{x}_{t})$
		$\displaystyle\quad\quad-\frac{1}{\eta_{t}}(B^{R}(\tilde{x}_{t+1};x_{t})+B^{R}(x_{t};\tilde{x}_{t}))$
		$\displaystyle\leq\frac{B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})}{\eta_{t}}+\frac{\eta_{t+1}}{\beta}\varepsilon_{t}({\cal L})+B^{R}(x_{t};\tilde{x}_{t})\left(\frac{\left(\hat{L}^{{\cal L}}_{t}\right)^{2}\eta_{t+1}}{\beta^{2}}-\frac{1}{\eta_{t}}\right)$
		$\displaystyle\quad\quad+B^{R}(x_{t};\tilde{x}_{t+1})(\eta_{t+1}^{-1}-\eta_{t}^{-1}).$

If $\hat{L}^{{\cal L}}_{t}\leq\beta/\eta_{t}$ , then $\left(\hat{L}^{{\cal L}}_{t}\right)^{2}\leq\frac{\beta^{2}}{\eta_{t}\eta_{t+1}}$ since $\eta_{t}$ is non-increasing. We can upper bound the third term of the sum on the RHS by zero.

Proof [of Theorem 10] From (47), we have for any $t\geq 1$

\langle l_{t},\ x_{t}-u\rangle\leq\frac{B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})}{\eta_{t}}+B^{R}(x_{t};\tilde{x}_{t+1})(\eta_{t+1}^{-1}-\eta_{t}^{-1})+\frac{\eta_{t+1}}{\beta}\varepsilon_{t}({\cal L}).

(49)

Note that by convexity of $f_{t}$ , $f_{t}(x_{t})-f_{t}(u)\leq\langle l_{t},\ x_{t}-u\rangle$ . Therefore, by taking the sum from 1 to T, we have

	$\displaystyle\text{Regret}_{T}(u)$	$\displaystyle\leq\sum_{t=1}^{T}\langle l_{t},\ x_{t}-u\rangle$
		$\displaystyle\leq\sum_{t=1}^{T}\frac{B^{R}(u;\tilde{x}_{t})-B^{R}(u;\tilde{x}_{t+1})}{\eta_{t}}+\sum_{t=1}^{T}B^{R}(x_{t};\tilde{x}_{t+1})(\eta_{t+1}^{-1}-\eta_{t}^{-1})+\sum_{t=1}^{T}\frac{\eta_{t+1}}{\beta}\varepsilon_{t}({\cal L})$
		$\displaystyle\leq\frac{B^{R}(u;\tilde{x}_{1})}{\eta_{1}}+\sum_{t=1}^{T-1}\left(\frac{1}{\eta_{{t+1}}}-\frac{1}{\eta_{t}}\right)B^{R}(u;\tilde{x}_{t+1})+\sum_{t=1}^{T}\left(\frac{1}{\eta_{{t+1}}}-\frac{1}{\eta_{t}}\right)B^{R}(x_{t};\tilde{x}_{t+1})+\sum_{t=1}^{T}\frac{\eta_{t+1}}{\beta}\varepsilon_{t}({\cal L})$
		$\displaystyle\leq\frac{B}{\eta_{T}}+\frac{B}{\eta_{T+1}}+\sum_{t=1}^{T}\frac{\eta_{t+1}}{\beta}\varepsilon_{t}({\cal L})$
		$\displaystyle\leq\frac{2B}{\eta_{T+1}}+\sum_{t=1}^{T}\frac{\eta_{t+1}}{\beta}\varepsilon_{t}({\cal L}),$

where $B=\max_{t}B^{R}(u;x_{t})$ .
To prove the Adagrad regret (18), where we set

\eta_{t}:=\sqrt{\beta B}\min\left\{\frac{1}{\sqrt{{\cal E}_{t-1}({\cal L})}+\sqrt{{\cal E}_{t-2}({\cal L})}},\frac{1}{L^{{\cal L}}_{t}\sqrt{B}}\right\},

note that it is non-decreasing. Moreover, we have $\eta_{t}\leq\frac{\sqrt{\beta}}{L^{{\cal L}}_{t}}$ . Therefore,

\hat{L}^{{\cal L}}_{t}\leq\sqrt{\beta}L^{{\cal L}}_{t}\Longrightarrow\hat{L}^{{\cal L}}_{t}\leq\frac{\beta}{\eta_{t}}.

We can apply Equation 16:

\text{Regret}_{t}(u)\leq\frac{2B}{\eta_{t+1}}+\sum_{\tau=1}^{t}\frac{\eta_{\tau+1}}{\beta}\varepsilon_{\tau}({\cal L}).

(50)

That can be rewritten as

\eta_{t}=\sqrt{\beta B}\min\left\{\frac{\sqrt{{\cal E}_{t-1}({\cal L})}-\sqrt{{\cal E}_{t-2}({\cal L})}}{\varepsilon_{t-1}({\cal L})},\frac{1}{L^{{\cal L}}_{t}\sqrt{B}}\right\}.

(51)

Moreover,

\eta_{t}^{-1}\leq\left(\sqrt{\beta B}\right)^{-1}\max\left\{2\sqrt{{\cal E}_{t-1}({\cal L})},L^{{\cal L}}_{t}\sqrt{B}\right\}\leq 2\left(\sqrt{\beta B}\right)^{-1}(\sqrt{{\cal E}_{t}({\cal L})}+L^{{\cal L}}_{t}\sqrt{B}).

(52)

Using (51) and (52) in the regret upper bound (50):

	$\displaystyle\text{Regret}_{t}(u)$	$\displaystyle\leq\frac{2B}{\eta_{t+1}}+\sum_{\tau=1}^{t}\frac{\eta_{\tau+1}}{\beta}\varepsilon_{\tau}({\cal L})$
		$\displaystyle\leq 4\sqrt{\frac{B}{\beta}}\left(\sqrt{{\cal E}_{t}({\cal L})}+L^{{\cal L}}_{t}\sqrt{B}\right)+\sum_{\tau=1}^{t}\sqrt{{\cal E}_{\tau}({\cal L})}-\sqrt{{\cal E}_{\tau-1}({\cal L})}$
		$\displaystyle\leq 4\sqrt{\frac{B}{\beta}}\left(\sqrt{{\cal E}_{t}({\cal L})}+L^{{\cal L}}_{t}\sqrt{B}\right)+\sqrt{{\cal E}_{t}({\cal L})}$
		$\displaystyle\leq 5\sqrt{\frac{B}{\beta}}\left(\sqrt{{\cal E}_{t}({\cal L})}+L^{{\cal L}}_{t}\sqrt{B}\right).$

Appendix D Dynamic Regret guarantee

We present here the dynamic regret decomposition lemma.

Lemma 27 (Dynamic Regret decomposition).

For any OCO algorithm ${\cal A}$ , if $\Phi$ is a Lyapunov potential function, we have that for any $t\geq 1$ , and any admissible sequence $u_{1},\dots,u_{T}$

\Phi(Q_{t+1})-\Phi(Q_{1})+\text{DynRegret}_{t}(u_{1:t})\leq\text{DynRegret}_{t}^{\cal A}(u_{1:t};\;{\cal L}_{1:t})+S_{t},

(53)

where $S_{t}=\sum_{\tau=1}^{t}g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau}))$ , and $\text{DynRegret}_{t}^{\cal A}(u;\;{\cal L}_{1\dots t})$ is the dynamic regret of the algorithm running on the sequence of losses ${\cal L}_{1},\dots,{\cal L}_{T}$ .

Proof By convexity of $\Phi$ , for any $\tau\geq 1$ :

	$\displaystyle\Phi(Q_{\tau+1})$	$\displaystyle\leq\Phi(Q_{\tau})+\Phi^{\prime}(Q_{\tau+1})\cdot(Q_{\tau+1}-Q_{\tau})$
		$\displaystyle=\Phi(Q_{\tau})+\Phi^{\prime}(Q_{\tau+1})\cdot g^{+}_{t}(x_{\tau}).$

For any $t$ , , then by definition $g^{+}_{\tau}(u_{\tau})=0,\forall\tau\geq 1$ , thus

	$\displaystyle\Phi(Q_{\tau+1})-\Phi(Q_{\tau})+(f_{\tau}(x_{\tau})-f_{\tau}(u_{\tau}))$
	$\displaystyle\leq\Phi^{\prime}(Q_{\tau+1})g^{+}_{\tau}(x_{\tau})+(f_{\tau}(x_{\tau})-f_{\tau}(u_{\tau}))$
	$\displaystyle\leq f_{\tau}(x_{\tau})+\Phi^{\prime}(Q_{\tau})g^{+}_{\tau}(x_{\tau})$
	$\displaystyle\quad\quad-\big{(}(f_{\tau}(u_{\tau})+\Phi^{\prime}(Q_{\tau})g^{+}_{\tau}(u_{\tau})\big{)}$
	$\displaystyle\quad\quad+g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau}))$
	$\displaystyle\leq{\cal L}_{\tau}(x_{\tau})-{\cal L}_{\tau}(u_{\tau})+g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau})).$

Summing $\tau$ from $1$ to $t$ :

\Phi(Q_{t+1})-\Phi(Q_{1})+\text{DynRegret}_{t}(u_{1:t})\leq\text{DynRegret}_{t}^{\cal A}(u_{1:t};\;{\cal L}_{1:t})+S_{t},

where

S_{t}=\sum_{\tau=1}^{t}g^{+}_{\tau}(x_{\tau})(\Phi^{\prime}(Q_{\tau+1})-\Phi^{\prime}(Q_{\tau})).

Appendix E Contextual bandits with expert advice

Algorithm 5 Modified EXP4.OVAR

1:Exploration probability

\delta\in[0,1]

2:Define

\bar{\Delta}_{\Pi}:=\{x\in\Delta_{\Pi}:x[\pi]\geq\frac{1}{MT},\forall\pi\in\Pi\}

3:Initialize

E_{0}=0

and

\tilde{x}_{1}[\pi]=\frac{1}{M}

for all

\pi\in\Pi

4:for round

t=1\dots T

5: Receive context

s_{t}

and make predictions

\hat{l}_{t}

6: Update learning rate:

\eta_{t}=\sqrt{\log(MT)}\min\left\{\frac{1}{\sqrt{E_{t-1}}+\sqrt{E_{t-2}}},1\right\}

(54)

7: Compute

x_{t}:=\arg\min_{x\in\bar{\Delta}_{\Pi}}\left\{\eta_{t}\sum_{\pi\in\Pi}x[\pi]\hat{l}_{t}[\pi(s_{t})]+D_{\text{KL}}(x,\tilde{x}_{t})\right\}.

(55)

8: Compute

p_{t}\in\Delta_{K}

p_{t}[a]=(1-\delta)\sum_{\pi:\pi(s_{t})=a}x_{t}[\pi]+\frac{\delta}{K}

9: Sample

a_{t}\sim p_{t}

and receive loss

l_{t}

10: Construct estimator

\tilde{l}_{t}[a]=\frac{l_{t}[a]-\hat{l}_{t}[a]}{p_{t}[a]}\mathbbm{1}\left\{a_{t}=a\right\}+\hat{l}_{t}[a]

for all

a\in[K]

11: Update cumulative error

E_{t}=E_{t-1}+\frac{(l_{\tau}[a]-\hat{l}_{\tau}[a])^{2}}{p_{\tau}[a]^{2}}

12: Update

\tilde{x}_{t+1}=\arg\min_{x\in\bar{\Delta}_{\Pi}}\left\{\eta_{t}\sum_{\pi\in\Pi}x[\pi]\tilde{l}_{t}[\pi(s_{t})]+D_{\text{KL}}(x,\tilde{x}_{t})\right\}

13:end for

First we introduce the shorthand notation: $\forall l\in\mathbb{R}^{M}$ and $x\in\Delta_{M}$ :

{\langle l,\ x\rangle}_{t}:=\sum_{\pi\in\Pi}x[\pi]l[\pi(s_{t})].

The modified algorithm EXP4.OVAR is presented in Algorithm 5. Note that we modify the learning rate to something similar to what we have in Algorithm 3. Moreover, in the original EXP4.OVAR, they use different learning rates for the update of $x_{t}$ and $\tilde{x}_{t+1}$ , but we should not do it in our setting as it will introduce a term in $\mathbb{E}[B_{T}\cdot E_{T}]$ (where $E_{T}$ is the "cumulative error"), which is not trivial to upper bound in terms of $\mathbb{E}[B_{T}]$ and $\mathbb{E}[E_{T}]$ .

Theorem 28 (EXP4.OVAR Regret, (Adapted from Wei et al. (2020)).

\text{Regret}_{T}^{\cal A}(l_{1},\dots,l_{T})\leq\mathbb{E}\left[B_{T}\right](1+\delta T)+\sqrt{\log(MT)}\left(6\sqrt{\frac{K^{2}{\cal E}_{T}({\cal L})}{\delta}}+2\right).

(56)

Furthermore, if we set $\delta=\left(\frac{K}{T}\sqrt{\log(MT)}\right)^{2/3}$ :

\text{Regret}_{T}^{\cal A}(l_{1},\dots,l_{T})\leq\left(\mathbb{E}\left[B_{T}\right]+6\sqrt{{\cal E}_{T}({\cal L})}\right)(TK^{2}\log(MT))^{1/3}+2\sqrt{\log(MT)}+\mathbb{E}[B_{T}]

(57)

Proof The proof follows exactly the steps in Wei et al. (2020). However, we slightly modify it to accept losses that are in $[0,B_{t}]$ instead of $[0,1]$ and the loss have some depedency on the past, adding the extra expected value on the computation of the loss. We first add the results from Wei et al. (2020). Let $\pi^{\star}\in\Pi$ . Denote $x^{\star}=\left(1-\frac{1}{T}\right){\mathbf{e}}_{\pi^{\star}}+\frac{1}{MT}\bm{1}\in\bar{\Delta}_{\Pi}$ where ${\mathbf{e}}_{\pi^{\star}}$ is the distribution that concentrates on $\pi^{\star}$ . From Lemma 25, we have:

{\langle\tilde{l}_{t},\ x_{t}-x^{\star}\rangle}_{t}\leq\frac{D_{\text{KL}}(x^{\star},\tilde{x}_{t})-D_{\text{KL}}(x^{\star},\tilde{x}_{t+1})}{\eta_{t}}+2\eta_{t+1}||\hat{l}_{t}-\tilde{l}_{t}||_{\infty}^{2}+D_{\text{KL}}(x_{t};\tilde{x}_{t+1})(\eta_{t+1}^{-1}-\eta_{t}^{-1}).

(58)

By replacing $x^{\star}$ by its expression and summing over $t$ , we obtain

\begin{split}\sum_{t=1}^{T}{\langle\tilde{l}_{t},\ x_{t}\rangle}_{t}-&\left(1-\frac{1}{T}\right)\sum_{t=1}^{T}\tilde{l}_{t}[\pi^{\star}(s_{t})]-\frac{1}{MT}\sum_{t=1}^{T}{\langle\tilde{l}_{t},\ \bm{1}\rangle}_{t}\\ &\leq\sum_{t=1}^{T}\frac{D_{\text{KL}}(x^{\star},\tilde{x}_{t})-D_{\text{KL}}(x^{\star},\tilde{x}_{t+1})}{\eta_{t}}+\sum_{t=1}^{T}D_{\text{KL}}(x_{t};\tilde{x}_{t+1})(\eta_{t+1}^{-1}-\eta_{t}^{-1})+2\sum_{t=1}^{T}\eta_{t+1}||\hat{l}_{t}-\tilde{l}_{t}||_{\infty}^{2}.\end{split}

(59)

We can upper bound the two terms on the RHS. The first sum can be rewritten as:

\sum_{t=1}^{T}\frac{D_{\text{KL}}(x^{\star},\tilde{x}_{t})-D_{\text{KL}}(x^{\star},\tilde{x}_{t+1})}{\eta_{t}}=\frac{D_{\text{KL}}(x^{\star},\tilde{x}_{1})}{\eta_{1}}+\sum_{t=1}^{T}D_{\text{KL}}(x^{\star},\tilde{x}_{t})\left(\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}}\right)-\frac{D_{\text{KL}}(x^{\star},\tilde{x}_{T+1})}{\eta_{T}}.

Then, note that for any $x\in\bar{\Delta}_{\Pi},D_{\text{KL}}(x^{\star},x)\leq\log(MT)$ because $x[\pi]\geq\frac{1}{MT}$ . Therefore,

\sum_{t=1}^{T}\frac{D_{\text{KL}}(x^{\star},\tilde{x}_{t})-D_{\text{KL}}(x^{\star},\tilde{x}_{t+1})}{\eta_{t}}\leq\frac{\log(MT)}{\eta_{T}}\quad{\text{and}}\quad\sum_{t=1}^{T}D_{\text{KL}}(x_{t};\tilde{x}_{t+1})(\eta_{t+1}^{-1}-\eta_{t}^{-1})\leq\frac{\log(MT)}{\eta_{T+1}}.

For the third sum, by replacing $\tilde{l}$ by its definition, we have $||\hat{l}_{t}-\tilde{l}_{t}||_{\infty}^{2}=\left(\frac{\hat{l}_{t}[a_{t}]-l_{t}[a_{t}]}{p_{t}[a_{t}]}\right)^{2}$ . As in (51),

\eta_{t+1}\leq\sqrt{\log(MT)}\frac{\sqrt{E_{t}}-\sqrt{E_{t-1}}}{||\hat{l}_{t}-\tilde{l}_{t}||_{\infty}^{2}},

resulting in

\sum_{t=1}^{T}\eta_{t+1}||\hat{l}_{t}-\tilde{l}_{t}||_{\infty}^{2}\leq\sqrt{\log(MT)}\sum_{t=1}^{T}\sqrt{E_{t}}-\sqrt{E_{t-1}}\leq\sqrt{\log(MT)}\sqrt{E_{T}},

and

\frac{\log(MT)}{\eta_{T}}\leq\frac{\log(MT)}{\eta_{T+1}}\leq\sqrt{\log(MT)}\left(2\sqrt{E_{T}}+1\right).

Thus the RHS of (59) is upper bounded by: $\sqrt{\log(MT)}\left(6\sqrt{E_{T}}+2\right)$ . Note that:

	$\displaystyle E_{T}$	$\displaystyle=\sum_{t=1}^{T}\left(\frac{\hat{l}_{t}[a_{t}]-l_{t}[a_{t}]}{p_{t}[a_{t}]}\right)^{2},$
		$\displaystyle\leq\frac{K}{\delta}\sum_{t=1}^{T}\frac{(\hat{l}_{t}[a_{t}]-l_{t}[a_{t}])^{2}}{p_{t}[a_{t}]},$	$\displaystyle\text{using }p_{t}[a]\geq\frac{\delta}{K},\;\forall a\in[K].$

Then, the expected value:

	$\displaystyle\mathbb{E}[E_{T}]$	$\displaystyle\leq\frac{K}{\delta}\sum_{t=1}^{T}\mathbb{E}\left[\frac{(\hat{l}_{t}[a_{t}]-l_{t}[a_{t}])^{2}}{p_{t}[a_{t}]}\right],$
		$\displaystyle\leq\frac{K^{2}}{\delta}\sum_{t=1}^{T}\|\|\hat{l}_{t}-l_{t}\|\|_{\infty}^{2}=\frac{K^{2}}{\delta}{\cal E}_{T}({\cal L}),$

where the inequality comes from $(\hat{l}_{t}[a]-l_{t}[a])^{2}\leq||\hat{l}-l_{t}||_{\infty}^{2},\;\forall a\in[K]$ and $\mathbb{E}\left[\frac{1}{p_{t}[a_{t}]}\right]=K$ . Thus, by taking the expected value in (59), we have

$\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}{\langle\tilde{l}_{t},\ x_{t}\rangle}_{t}-\left(1-\frac{1}{T}\right)\sum_{t=1}^{T}\tilde{l}_{t}[\pi^{\star}(s_{t})]-\frac{1}{MT}\sum_{t=1}^{T}{\langle\tilde{l}_{t},\ \bm{1}\rangle}_{t}\right],$
$\displaystyle\leq\sqrt{\log(MT)}\mathbb{E}\left[6\sqrt{E_{T}}+2\right],$
$\displaystyle\leq\sqrt{\log(MT)}(6\sqrt{\mathbb{E}\left[E_{T}\right]}+2),$	(Jensen’s inequality)
$\displaystyle\leq\sqrt{\log(MT)}\left(6\sqrt{\frac{K^{2}{\cal E}_{T}({\cal L})}{\delta}}+2\right).$		(60)

We can now lower bound the LHS of (60).

	$\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}{\langle\tilde{l}_{t},\ x_{t}\rangle}_{t}-\left(1-\frac{1}{T}\right)\sum_{t=1}^{T}\tilde{l}_{t}[\pi^{\star}(s_{t})]-\frac{1}{MT}\sum_{t=1}^{T}{\langle\tilde{l}_{t},\ \bm{1}\rangle}_{t}\right],$
	$\displaystyle\geq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{\pi\in\Pi}x_{t}[\pi]\tilde{l}_{t}[\pi(s_{t})]-\sum_{t=1}^{T}\tilde{l}_{t}[\pi^{\star}(s_{t})]-\frac{1}{MT}\sum_{t=1}^{T}\sum_{\pi\in\Pi}\tilde{l}_{t}[\pi(s_{t})]\right],$
	$\displaystyle\overset{(i)}{\geq}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{\pi\in\Pi}x_{t}[\pi]l_{t}[\pi(s_{t})]-\sum_{t=1}^{T}l_{t}[\pi^{\star}(s_{t})]-\frac{1}{MT}\sum_{t=1}^{T}\sum_{\pi\in\Pi}l_{t}[\pi(s_{t})]\right],$
	$\displaystyle\geq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{\pi\in\Pi}x_{t}[\pi]l_{t}[\pi(s_{t})]-\sum_{t=1}^{T}l_{t}[\pi^{\star}(s_{t})]\right]-\mathbb{E}\left[B_{T}\right],$
	$\displaystyle\overset{(ii)}{\geq}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{a\in[K]}\left(p_{t}[a]+\delta\sum_{\pi:\pi(s_{t})=a}x_{t}[\pi]-\frac{\delta}{K}\right)l_{t}[a]-\sum_{t=1}^{T}l_{t}[\pi^{\star}(s_{t})]\right]-\mathbb{E}\left[B_{T}\right],$
	$\displaystyle\geq\mathbb{E}\left[\sum_{t=1}^{T}\sum_{a\in[K]}p_{t}[a]l_{t}[a]-\sum_{t=1}^{T}l_{t}[\pi^{\star}(s_{t})]\right]-\mathbb{E}\left[B_{T}\right](1+\delta T),$
	$\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}l_{t}[a_{t}]-\sum_{t=1}^{T}l_{t}[\pi^{\star}(s_{t})]\right]-\mathbb{E}\left[B_{T}\right](1+\delta T),$
	$\displaystyle=\text{Regret}_{T}-\mathbb{E}\left[B_{T}\right](1+\delta T).$		(61)

$(i)$ comes from $\mathbb{E}_{t-1}[\tilde{l}_{t}]=\mathbb{E}_{t-1}[l_{t}]$ where $\mathbb{E}_{t-1}$ is the expected value conditional to all the information until the end of round ${t-1}$ . For $(ii)$ , it is a consequence $p_{t}$ ’s definition:

	$\displaystyle\sum_{a\in[K]}p_{t}[a]l_{t}[a]$	$\displaystyle=(1-\delta)\sum_{a\in[K]}\sum_{\pi:\pi(s_{t})=a}x_{t}[\pi]l_{t}[\pi(s_{t})]+\frac{\delta}{K}\sum_{a\in[K]}l_{t}[a],$
	$\displaystyle\Longrightarrow\sum_{\pi\in\Pi}x_{t}[\pi]l_{t}[\pi(s_{t})]$	$\displaystyle=\sum_{a\in[K]}\left(p_{t}[a]+\delta\sum_{\pi:\pi(s_{t})=a}x_{t}[\pi]l_{t}[\pi(s_{t})]-\frac{\delta}{K}\right)l_{t}[a].$

We can then combine (61) and (60), to obtain (56). (57) is a straightforward consequence of (56) and the value of $\delta$ .

	$\displaystyle\|\|\nabla\hat{{f}}_{t}(x)-\nabla\hat{{f}}_{t}(y)\|\|_{\star}$	$\displaystyle\leq\hat{L}^{f}_{t}\|\|x-y\|\|,$
	$\displaystyle\|\|\nabla\hat{{g}}_{t}(x)-\nabla\hat{{g}}_{t}(y)\|\|_{\star}$	$\displaystyle\leq\hat{L}^{g}_{t}\|\|x-y\|\|.$

	$\displaystyle\|\|l_{t}-\hat{l}_{t}\|\|_{\star}^{2}$	$\displaystyle=\|\|\nabla{{\cal L}}_{t}(x_{t})-\nabla\hat{{{\cal L}}}_{t}(\tilde{x}_{t})\|\|_{\star}^{2}$
		$\displaystyle\leq 2\|\|\nabla{{\cal L}}_{t}(x_{t})-\nabla\hat{{{\cal L}}}_{t}(x_{t})\|\|_{\star}^{2}+2\|\|\nabla\hat{{{\cal L}}}_{t}(x_{t})-\nabla\hat{{{\cal L}}}_{t}(\tilde{x}_{t})\|\|_{\star}^{2}$
		$\displaystyle\leq 2\|\|\nabla{{\cal L}}_{t}(x_{t})-\nabla\hat{{{\cal L}}}_{t}(x_{t})\|\|_{\star}^{2}+2\left(\hat{L}^{{\cal L}}_{t}\right)^{2}\|\|x_{t}-\tilde{x}_{t}\|\|^{2}$
		$\displaystyle\leq 2\varepsilon_{t}({\cal L})+\frac{2\left(\hat{L}^{{\cal L}}_{t}\right)^{2}}{\beta}B^{R}(x_{t};\tilde{x}_{t}).$