An Optimistic Algorithm for Online Convex Optimization with Adversarial Constraints
Abstract
We study Online Convex Optimization (OCO) with adversarial constraints, where an online algorithm must make sequential decisions to minimize both convex loss functions and cumulative constraint violations. We focus on a setting where the algorithm has access to predictions of the loss and constraint functions. Our results show that we can improve the current best bounds of regret and cumulative constraint violations to and , respectively, where and represent the cumulative prediction errors of the loss and constraint functions. In the worst case, where and (assuming bounded gradients of the loss and constraint functions), our rates match the prior results. However, when the loss and constraint predictions are accurate, our approach yields significantly smaller regret and cumulative constraint violations. Finally, we apply this to the setting of adversarial contextual bandits with sequential risk constraints, obtaining optimistic bounds regret and constraints violation, yielding better performance than existing results when prediction quality is sufficiently high.
1 Introduction
We are interested in generalizations of Online Convex Optimization (OCO) to problems in which constraints are imposed but can be violated —generalizations which are referred to as Constrained Online Convex Optimization (COCO). Recall the standard formulation of OCO (Orabona, 2019; Hazan, 2023): At each round , a learner makes a decision , receives a convex loss function from the environment, and suffers the loss . The goal of the learner is to minimize the cumulative loss . The COCO framework imposes an additional requirement on the learner: meeting a potentially adversarial convex constraint of the form at every time step. The learner observes only after selecting , and cannot always satisfy the constraint exactly but can hope to have a small cumulative constraint violation . In the adversarial setting, it is not viable to seek absolute minima of the cumulative loss, and the problem is generally formulated in terms of obtaining a sublinear Static Regret—the difference between the learner’s cumulative loss and the cumulative loss of a fixed oracle/decision. Having a sublinear regret means that, on average, we perform as well as the best action in hindsight. A stronger and more general objective is the Dynamic Regret where learner’s performance is benchmarked against sequences of decisions, not just fixed actions. In the COCO problem, we also aim to ensure a sublinear cumulative constraint violation.
One subcategory of OCO problems is adversarial contextual bandits (Auer et al., 2002;Beygelzimer et al., 2011). In that setting, the learner receives contextual information from the environment, then she selects one action among available, and only observes the loss of the chosen action. The learners aims to minimize its cumulative loss. Sun et al. (2017) introduced sequential risk constraints in contextual bandit, where, in addition to the loss for each action, the environment generate a risk for each action. In addition to minimizing the cumulative loss, the learner wants to keep the average cumulative risk bounded by a predefined safety threshold.
Recent work in OCO has considered settings in which the adversary is predictable—i.e., not entirely arbitrary—aiming to obtain improved regret bounds (Chiang et al., 2012; Rakhlin and Sridharan, 2013a, b; Mohri and Yang, 2016; Joulani et al., 2017). They showed that the regret improved from to where is a measure of the cumulative prediction error. The optimistic framework has also been studied in the COCO setting by Qiu et al. (2023), who focused on time-invariant constraints, ( and the time varying constraints was pursued in Anderson et al. (2022), who established bounds for specific cases (e.g perfect loss predictions, linear constraints).
In the current paper we go beyond earlier work to consider the case of adversarial constraints. Our main contribution is the following: We present the first algorithm to solve COCO problems in which the constraints are adversarial but also predictable, achieving regret and constraint violation in the general convex case. More precisely:
-
1.
We present a meta-algorithm that, when built on an optimistic OCO algorithm, achieves regret and constraint violation who matches the best COCO algorithm by Sinha and Vaze (2024) in the worst case.
-
2.
Our algorithm is computationally efficient as it relies only on a projection on the simpler set at each time step, instead of convex optimization steps.
-
3.
Furthermore, the same meta algorithm can be used to prove dynamic regret guarantees with similar constraint violation guarantees .
-
4.
Finally, we show that our method can be used to solve the adversarial contextual bandits problem with sequential risk constraints, providing a regret and constraint violation.
Our theoretical framework exploits state-of-the-art methods from both optimistic OCO and constrained OCO.
The rest of the paper is structured as follows: We present previous work in Section 2, introduce the main assumptions and notations in Section 3 and present the meta-algorithm for the COCO problem in Section 4. We then present how the meta-algorithm gives static regret guarantees in Section 5, dynamic regret guarantees in Section 6 and how its application to the experts setting in Section 7 and the contextual bandits in Section 8.
Reference | Complexity per round | Constraints | Loss Function | Regret | Violation |
Guo et al. (2022) | Conv-OPT | Fixed | Convex | ||
Adversarial | Convex | ||||
Adversarial | Convex | (D) | |||
Yi et al. (2023) | Conv-OPT | Adversarial | Convex | ||
Sinha and Vaze (2024) | Projection | Adversarial | Convex | ||
Qiu et al. (2023) | Projection | Fixed | Convex, Slater | ||
Anderson et al. (2022) | Projection | Adversarial | Convex, Perfect predictions | ||
Muthirayan et al. (2022) | Conv-OPT | Known | Convex | ||
Sun et al. (2017) | Projection | Contextual Bandits | |||
Ours | Projection | Adversarial | Convex | ||
Adversarial | Convex | (D) | |||
Contextual Bandits |
2 Related Work
Unconstrained OCO
The OCO problem was introduced by Zinkevich (2003), who established a static regret and dynamic regret guarantees based on projected online gradient descent (OGD), where is the path-length of the comparator sequence. Hazan (2023); Orabona (2019) provide overviews of the burgeoning literature that has emerged since Zinkevich’s seminal work, in particular focusing on online mirror descent (OMD) as a general way to solve OCO problems. Zhang et al. (2018) later improved the dynamic regret bound to .
Optimistic OCO
Optimistic OCO is often formulated as a problem involving gradual variation—i.e., where and are close in some appropriate metric. Chiang et al. (2012) exploit this assumption in an optimistic version of OMD that incorporates a prediction based on the most recent gradient, and establish a regret bound of where . Previous works (Rakhlin and Sridharan, 2013a, b; Steinhardt and Liang, 2014; Mohri and Yang, 2016; Joulani et al., 2017; Bhaskara et al., 2020) prove that when using a predictor that is not necessarily the past gradient, one can have regret of the form where . The dynamic regret case has been studied intensively (Jadbabaie et al., 2015; Scroccaro et al., 2023) with the best bound (Zhao et al., 2020, 2024), being .
Constrained OCO
Constrained OCO was first studied in the context of time-invariant constraints; i.e., where for all . In this setup one can employ projection-free algorithms, avoiding the potentially costly projection onto the set , by allowing the learner to violate the constraints in a controlled way (Mahdavi et al., 2012; Jenatton et al., 2016; Yu and Neely, 2020). The case of time-varying constraints is significantly harder as the constraints are potentially adversarial. Most of the early work studying such constraints (Neely and Yu, 2017; Yi et al., 2023) accordingly incorporated an additional Slater condition: . These papers establish regret guarantees that grow with , which unfortunately can be vacuous as can be arbitrarily small. Hutchinson and Alizadeh (2024) studied the setting with time-varying constraint but such that the constraints sets () are monotone, i.e and established a dynamic regret when is known beforehand. Guo et al. (2022) presented an algorithm that does not require the Slater condition and yielded improved bounds, achieving a static regret, dynamic regret and constraint violations, for unknown . However, it requires solving a convex optimization problem at each time step. In a more recent work, Sinha and Vaze (2024) presented a simple and efficient algorithm to solve the problem with just a projection and obtained state-of-the-art guarantees: regret and constraint violations. See Table 1 for more comparison of our results with previous work.
Optimistic COCO
Qiu et al. (2023) studied the case with gradual variations and time-invariant constraints, proving a regret guarantee and a constraint violations. Muthirayan et al. (2022) tackled the time-varying but known constraints with predictions, proving a regret guarantee of and cumulative constraint violation of . Under perfect loss predictions, Anderson et al. (2022) demonstrated a bound on regret and bound on constraint violation. We also add these results in Table 1 for comparison.
Adversarial Contextual Bandits
The adversarial contextual bandit problem was first introduced by Auer et al. (2002), who proposed EXP4, achieving optimal expected regret. Wei et al. (2020) later advanced the field by incorporating predictions, achieving regret when is known beforehand - an improvement over EXP4 when . For unknown , they developed an algorithm with expected regret. Sun et al. (2017) extended this to include sequential risk constraints (analogous to constrained OCO), developing a modified EXP4 that achieves regret with total risk violation.
3 Preliminaries
3.1 Problem setup and notation
Let denote the set of real numbers, and let denote the set of -dimensional real vectors. Let denote the set of possible actions of the learner, where is a specific action, and let be a norm defined on . Let the dual norm be denoted as .
Online learning is a problem formulation in which the learner plays the following game with Nature. At each step :
-
1.
The learner plays action .
-
2.
Nature reveals a loss function and a constraint function .111If we have multiple constraint functions , we set .
-
3.
The learner suffers the loss and the constraint violation .
In standard OCO, the loss function is convex, and the goal of the learner is to minimize the regret with respect to an oracle action , where:
(1) |
In COCO, we generalize the OCO problem to additionally ask the learner to obtain a small cumulative constraint violation, which we denote as :
(2) |
Overall, the goal of the learner is to achieve both sublinear regret, wrt to any action in the oracle set, and sublinear CCV. This is a challenging problem, and indeed Mannor et al. (2009) proved that no algorithm can achieve both sublinear regret and sublinear cumulative constraint violation for the oracle set . However, it is possible to find algorithms that achieve sublinear regret for the smaller set , and this latter problem is our focus.
In addition, we assume that at the end of step , the learner can make predictions and . More precisely, we are interested in predictions of the gradients, and, for any function , we denote by the prediction of the gradient of . We abuse notation and denote by the function whose gradient is . Moreover, we define the following prediction errors
(3) |
where is the sequence of actions taken by the learner.
Additionally, for a given -strongly convex function , we define the Bregman divergence between two points:
(4) |
Two special cases that are particularly important:
-
1.
When , the Bregman divergence is the Euclidean distance , , and .
-
2.
When , the Bregman divergence is the KL divergence : , , , and .
3.2 Assumptions
Throughout this paper, we will use various combinations of the following assumptions.
Assumption 1 (Convex set, loss and constraints).
We make the following standard assumptions on the loss :
-
1.
is closed, convex and bounded with diameter .
-
2.
, is convex and differentiable.
-
3.
, is convex and differentiable.
Assumption 2 (Bounded losses).
The loss functions are bounded by and the constraints are bounded by .
Assumption 3 (Feasibility).
The set is not empty.
Assumption 4 (Prediction Sequence Regularity).
For any , the gradient of the loss prediction function and the gradient of the constraint function are and Lipschitz, respectively. That is, , we have:
We abuse notation and let and similarly for . Finally, denote and similarly for .
Assumptions 1, 2, 3 are standard in COCO (Mahdavi et al., 2012; Jenatton et al., 2016; Yu and Neely, 2020; Qiu et al., 2023; Yi et al., 2023; Guo et al., 2022). In most OCO with predictive sequences, they either assume that the predictive function is the previous loss function (Chiang et al., 2012; Qiu et al., 2023; D’Orazio and Huang, 2021), or that the learner only predicts a single vector to estimate (Rakhlin and Sridharan, 2013a, b; Muthirayan et al., 2022). We expand this by predicting the entire loss gradient, making an assumption on the smoothness of with its value at nearby points. When using the latest observe function as prediction, 4 is equivalent to assuming that the gradients and are Lipchitz as in Chiang et al. (2012); Qiu et al. (2023). Moreover, 4 is automatically satisfied when predicting a vector.
4 Meta-Algorithm for Optimistic COCO
Our meta-algorithm is inspired by Sinha and Vaze (2024). The main idea of that paper is to build a surrogate loss function that can be seen as a Lagrangian of the optimization problem
The learner then runs AdaGrad (Duchi et al., 2011) on the surrogate, with a theoretical guarantee of bounded cumulative constraint violation (CCV) and Regret.
Our meta-algorithm is based on the use of optimistic methods, such as those presented in subsequent sections: Section 5, Section 6, Section 7, which allows us to obtain stronger bounds that depends on the prediction quality. Presented in Algorithm 1, this algorithm assumes that at the end of every step , the learner makes a prediction 222We are actually only interested in the predictions of the gradients, but for simplicity we will let denote any function whose gradient is the prediction of the gradient . and of the upcoming loss and constraint violation . At each time step , the learner forms a surrogate loss function, defined via a convex Lyapunov function: , where is monotonically increasing and . Specifically:
(5) |
Using the predictions and , we form a prediction of the Lagrange function , where is defined in Equation 6.
(6) |
In Sinha and Vaze (2024), the update is , but using at would require to be known at the end of . We instead define the following delayed update:
(7) |
The learner then executes the step of algorithm , denoted in Algorithm 1. We then have the following lemma that relates the regret on , CCV, and the regret of on .
Lemma 5 (Regret decomposition).
For any OCO algorithm , if is a Lyapunov potential function, we have that for any , and any
(8) |
where , and is the regret of the algorithm running on the sequence of losses .
Proof By convexity of , for any :
Let , then by definition , thus
Summing from to :
where
In the following we make the assumption that the underlying optimistic OCO algorithm has standard regret guarantees that we will express in terms of a functional that takes as input a sequence of functions and returns a constant. For simplicity, we will denote . An example is , the Lipschitz constant constant of .
With this assumption and the previous lemma, we can present our main result.
Assumption 6 (Regret of optimistic OCO).
The optimistic OCO algorithm has the following regret guarantee: There is a constant and a sublinear functional such that for any sequence of functions , and any
(9) |
We allow to depend on and other constants of the problem, as long as they are known at the beginning of the algorithm .
Theorem 7 (Optimistic COCO regret and CCVguarantees).
Consider the following assumptions :
-
a.
and satisfy the assumptions of algorithm for all .
- b.
-
c.
satisfies 6.
-
d.
, with .
Under these assumptions, Algorithm 1 has the following regret and CCV guarantees: ,
(10) | ||||
(11) |
We present a sketch of the main ideas here, with the detailed proof deferred to Appendix A. First, using the sublinearity of the square root and the fact that is non-decreasing, we can show that:
(12) |
Then, using (12) and the sublinearity of , we can further upper bound the regret on in 6:
(13) |
In addition, we can upper bound by using 2 and monotonicity:
(14) |
We can then use (13) and (14) in Lemma 5, and after rearranging terms, we obtain
(15) |
where . We obtain
To establish an upper bound on CCV, we leverage the fact that (from 2), which when applied to (15) yields
If , then
Finally, by setting , we obtain
Remark 8.
As in Syrgkanis et al. (2015), we can use the doubling trick for adjusting lambda online at the cost of an additional log term. We provide details in Appendix B.
Remark 9.
If we have constraint functions with , we can set . Alternatively, we can set multiple queues, one for each : , one for each , and set . Finally, define:
Then we can follow the exact same proof to show a regret guarantee:
and CCV guarantee:
The term in will come from:
with being the prediction error of the sequence .
5 Static Regret guarantees
In this section, we first introduce some of the foundational optimistic algorithms that have been used for OCO, then show how we can achieve sublinear static regret and CCV with our meta algorithm.
Optimistic OMD and Optimistic AdaGrad
This approach modifies the standard online mirror descent (OMD) algorithm introduced in Zinkevich (2003). OMD, which generalizes projected gradient descent, iteratively steps towards minimizing the most recently observed loss function. The optimistic OMD variant includes a supplementary minimization step using the predicted function, enabling faster convergence to optimality when predictions are accurate. Note that the algorithm is computationally efficient. Indeed, a mirror step can be computed in two steps:
-
1.
Compute such that . In particular, if is invertible, .
-
2.
Let .
The two following are special cases of OMD:
-
1.
When and , this is simply projected gradient descent, .
-
2.
When the -dimensional simplex, with being the entropy, , where is a normalizing factor to ensure .
Theorem 10 establishes our algorithm’s regret bounds. Our analysis extends beyond Rakhlin and Sridharan (2013b)’s vector-based predictions to handle functional predictions, incorporating techniques from Chiang et al. (2012). This extension introduces Lipschitz coefficient dependence. We express our bounds in terms of rather than —a crucial distinction since vanishes with perfect predictions, while may not. This problem has been highlighted before in Scroccaro et al. (2023) who present their regret guarantees in terms of . This requires to know the Lipschitz coefficient of , which is standard in OCO, but we prefer to have a dependency on the coefficient of as the learner has control over it.
Theorem 10 (Optimistic Adagrad, adapted from Rakhlin and Sridharan (2013b), Corollary 2).
Under assumptions:
-
a.
1,
-
b.
For any is -Lipschitz where ,
-
c.
For any ,
-
d.
For any ,
for any , and any
(16) |
where . If is:
(17) |
with , then for any , Algorithm 2 has regret
(18) |
By using Algorithm 2 as OCO algorithm in Algorithm 1, we have the following regret guarantee, as a direct consequence of Theorem 7 and Theorem 10:
Corollary 11 (Optimistic Adagrad COCO).
Consider the following assumptions:
- a.
-
b.
is optimistic Adagrad (Algorithm 2) with
-
c.
and are set as in Theorem 7.
Under these assumptions, the meta-algorithm (1) has the following regret and constraint violation guarantees:
(19) |
Alternatively, one can use Optimistic Follow-the-regularized-leader (Rakhlin and Sridharan, 2013a; Mohri and Yang, 2016; Joulani et al., 2017), instead of Algorithm 2, which can be proven to have similar guarantee as Theorem 10.
Remark 12.
Even if is fixed or known, we cannot achieve with this algorithm. This is because does not satisfy 4 in the general case.
6 Dynamic Regret guarantees
Moving beyond a fixed baseline , we can evaluate performance against a time-varying sequence . Let bound the path length: . Our objective is to bound the dynamic regret relative to this sequence.
(20) |
By utilizing the Algorithm 2, and slightly modifying the learning rate, we can achieve state-of-the-art dynamic regret guarantees when is known. We will need the following additional assumption:
Assumption 13 (Lipschitz-like Bregman divergence).
, ,
This assumption is always satisfied if is Lipschitz on . This is true in particular when is a norm on the bounded set .
Theorem 14 (Dynamic Regret guarantees in OCO (Jadbabaie et al., 2015)).
Under the assumptions:
- a.
-
b.
For any is -Lipschitz where ,
-
c.
For any ,
-
d.
For any ,
for any sequence , and any
(21) |
where . By setting as
(22) |
then Algorithm 2 has dynamic regret
(23) |
where .
We omit the proof, but it combines elements from Jadbabaie et al. (2015) to add the term in and the proof of Theorem 10 to ensure the dependency on . We can now use this algorithm in Algorithm 1 to achieve dynamic regret and CCV in COCO. We first need the following definition:
Definition 15.
A sequence is admissible if . We assume that there exists an admissible sequence.
Note that the existence of an admissible sequence is a much weaker assumption that 3.
Corollary 16 (Dynamic Regret in COCO).
Consider the following assumptions:
- a.
-
b.
The predictions are linear.
-
c.
is optimistic Adagrad (Algorithm 2) with and the learning rate defined in (22).
-
d.
with .
Under these assumptions, the meta-algorithm (1) has the following dynamic regret and constraint violation guarantees: for any admissible sequence of length at most
(24) |
The proof structure mirrors that of Theorem 7, but employs a modified version of Lemma 5 adapted for dynamic regret analysis. We show the modified version of Lemma 5 in Appendix D. By using linear predictions for , we can eliminate the term linear in from the regret guarantee. When is unknown but is observable, we can achieve comparable DynRegret using Algorithm 1 from Jadbabaie et al. (2015) combined with the doubling trick (Algorithm 4, Appendix B). While alternative approaches exist that don’t require observing (Scroccaro et al., 2023; Zhao et al., 2020, 2024), our doubling trick implementation would still necessitate sequence observability.
7 Experts setting
In this setting, the agent has access to experts and has to form a distribution for selecting among them. She observes the loss of each expert and suffers an overall loss which is the expected value over the experts. Formally, we assume where is the number of experts. At each step , the learner selects , a distribution over the experts, then observes the vector of losses and the vector of constraints . The learner then suffers the loss and constraint . Let denote the prediction of and the prediction of .
For the OCO case (i.e without adversarial constraint), we could use the Algorithm 2 with , but in the worst case can be as large as resulting in a regret scaling in . We instead are able achieve a scaling of . Let , then . In that case, the Bregman divergence is the KL divergence and . However, the KL divergence is not upper bounded as any can be arbitrarily close to zero. We circumvent this problem in Algorithm 3 by introducing the mixture . This algorithm can be found in Rakhlin and Sridharan (2013b) in the context of a two-player zero-sum game.
7.1 Static Regret
We first present the OCO guarantee of Algorithm 3. We let and define similarly. Therefore, . We have the following regret guarantee in OCO when using Algorithm 3:
Theorem 17 (Optimistic OMD with experts, (Rakhlin and Sridharan, 2013b)).
Corollary 18 (COCO in experts setting).
For any , let such that and . Define where, . Assume Run the meta-algorithm Algorithm 1 with the following:
-
a.
-
b.
-
c.
Use Algorithm 3 as the OCO algortihm .
Then, we have
(27) |
Moreover, if the sequence is fixed or known, we have the stronger guarantee;
(28) |
Proof The constant gradient assumption in the experts setting prevents us from using in ; therefore, we employ instead. Denote . As a direct consequence of Theorem 7, where we have the regret guarantee, and:
Finally, noticing that
we prove the CCV bound. If is known at the beginning of , we can use .
7.2 Dynamic Regret
Jadbabaie et al. (2015) show that the previous algorithm also has dynamic regret guarantees. They use a different mixing parameter and slightly different constant for the learning rate, but they use it in the context of two player zero sum games.
Theorem 19.
Under Assumption 1 and for any , is a constant function, with and the learning rate defined as
(29) |
Algorithm 3 has regret
(30) |
Corollary 20 (Dynamic Regret in experts settings).
As before, define where, . Run the meta-algorithm Algorithm 1 with the following:
-
a.
.
-
b.
Set .
-
c.
Set .
-
d.
Use Algorithm 3 as the OCO algortihm with the learning defined in 29
Then, for any admissible sequence of size .
(31) |
Moreover, if the sequence is fixed or known, we have the stronger guarantee;
(32) |
This is a direct consequence on Theorem 14 and Theorem 19. As noted in Section 6, we can use the doubling trick when is unknown, but is observable.
8 Adversarial Contextual Bandits with safety constraints
Denote the finite set of possible actions. At each timestep :
-
1.
The environment generates a context , a loss vector and a constraint (or risk) vector .
-
2.
The learner observes then proposes a distribution over the possible actions, then sample .
-
3.
The environment reveils and .333We use and interchangeably.
To guide decisions, the learner uses a finite family of experts who provide context-dependent action recommendations. We denote . Given a safety threshold , we define as the subset of consistently safe experts. The learner also has access to predictions of and . The goal of the expert is to have the expected regret and expected CCV to be as small as possible:
(33) |
where the expectation is with respect to the randomness of the learner (selection of actions ). Note that is a strictly stronger measure than the one used in Sun et al. (2017) where their metric of safety is .
As in previous sections, we first need an algorithm that solves the problem without adaptive constraints. Here, we employ a modified version of EXP4.OVAR algorithm (Wei et al., 2020), detailed in Algorithm 5, Algorithm 5. The small change we bring is to the learning rate and how it is used in the updates. In most bandits literature, the loss vector is assumed to be bounded with known bounds (wlog ). However, when we will apply it to the Lagrangian function, the upper bound of becomes dynamic, varying with time and depending on previous actions . We thus have to take that into account when computing the upper bound of the regret, as highlighted in Theorem 21.
Theorem 21 (Modified EXP4.OVAR Regret, (Adapted from Wei et al. (2020)).
Let a sequence of loss vectors, where is non-decreasing, and and are chosen by the environment but depend on . Let the prediction and denote . Then, if then Algorithm 5 has regret
(34) |
See Algorithm 5 for the complete proof. For the problem with adversarial constrained, as in Section 4, we construct a surrogate loss vector similar to the Lagrangian:
(35) |
and use them in the EXP4.OVAR algorithm. For consistency with previous sections, we denote for , and denote and .
First, we prove a similar regret decomposition lemma: Denote the expected regret of a contextual bandit algorithm running using as loss vectors.
Lemma 22.
Assuming that , and . Let the safety threshold, a convex potential function, and defined as in (35). Then
(36) |
The proof is exactly the same as Lemma 5 with the additional step of taking the expectation. Finally, by using EXP4.OVAR on as defined in Equation 35, we prove that we have bounded expected regret and CCV.
Theorem 23.
Assuming:
-
•
Safety threshold is known and the corresponding is not empty.
-
•
, and .
-
•
We define , and as in (35) and use them in EXP4.OVAR.
-
•
with .
Running Algorithm 1 gives the following guarantees:
(37) |
Proof By definition, we have for any . Thus, we have the regret guarantee of Theorem 21.
(38) |
Inserting it in Lemma 22, using the definition of we have
The rest of the proof is as in Theorem 7, after noticing that, with Jensen’s inequality,
Note that in the worst case: and the regret and CCV are of order which is worse than Sun et al. (2017): regret and CCV. However, when the predictions are slightly more accurate and , this algorithm improves Sun et al. (2017), with the most significant improvement when and , leading to a in regret and CCV. This is close to optimal, as Wei et al. (2020) prove that the best regret that a contextual bandit algorithm with is . Note that this algorithm requires (or an upper bound) to be known beforehand, as even with the doubling trick, we do not directly observe to update online. An heuristic method using the current observation as an estimator along the doubling trick could potentially work in practice.
9 Conclusion
This work presents pioneering optimistic algorithms for handling OCO under adversarial constraints. Beyond establishing prediction error-dependent bounds for both regret and constraints, our approach maintains efficiency by using simple projections instead of solving complete convex optimization problems per iteration. For the future, we are interested in proving stronger bounds when the obtainable guarantees against oracle sets that are larger than , and when the loss function is strongly-convex. Moreover, we conjecture that a slight alteration of the algorithm should ensure a when is fixed or perfectly known, beyond the expert setting. At this stage, the non-smooth gradient of prevents us from using itself as the prediction, and therefore from establishing that our algorithm attains this bound.
References
- Anderson et al. (2022) Daron Anderson, George Iosifidis, and Douglas J Leith. Lazy Lagrangians with predictions for online learning. arXiv preprint arXiv:2201.02890, 2022.
- Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
- Beygelzimer et al. (2011) Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19–26. JMLR Workshop and Conference Proceedings, 2011.
- Bhaskara et al. (2020) Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, and Manish Purohit. Online learning with imperfect hints. In International Conference on Machine Learning, pages 822–831. PMLR, 2020.
- Chiang et al. (2012) Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In Conference on Learning Theory, pages 6–1. JMLR Workshop and Conference Proceedings, 2012.
- D’Orazio and Huang (2021) Ryan D’Orazio and Ruitong Huang. Optimistic and adaptive Lagrangian hedging. arXiv preprint arXiv:2101.09603, 2021.
- Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
- Guo et al. (2022) Hengquan Guo, Xin Liu, Honghao Wei, and Lei Ying. Online convex optimization with hard constraints: towards the best of two worlds and beyond. Advances in Neural Information Processing Systems, 35:36426–36439, 2022.
- Hazan (2023) Elad Hazan. Introduction to online convex optimization, 2023. URL https://arxiv.org/abs/1909.05207.
- Hutchinson and Alizadeh (2024) Spencer Hutchinson and Mahnoosh Alizadeh. Safe online convex optimization with first-order feedback. In 2024 American Control Conference (ACC), pages 1–7. IEEE, 2024.
- Jadbabaie et al. (2015) Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization: Competing with dynamic comparators. In Artificial Intelligence and Statistics, pages 398–406. PMLR, 2015.
- Jenatton et al. (2016) Rodolphe Jenatton, Jim Huang, and Cedric Archambeau. Adaptive algorithms for online convex optimization with long-term constraints. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 402–411, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/jenatton16.html.
- Joulani et al. (2017) Pooria Joulani, András György, and Csaba Szepesvári. A modular analysis of adaptive (non-) convex optimization: Optimism, composite objectives, and variational bounds. In International Conference on Algorithmic Learning Theory, pages 681–720. PMLR, 2017.
- Mahdavi et al. (2012) Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. Trading regret for efficiency: online convex optimization with long term constraints. The Journal of Machine Learning Research, 13(1):2503–2528, 2012.
- Mannor et al. (2009) Shie Mannor, John N Tsitsiklis, and Jia Yuan Yu. Online learning with sample path constraints. Journal of Machine Learning Research, 10(3), 2009.
- Mohri and Yang (2016) Mehryar Mohri and Scott Yang. Accelerating online convex optimization via adaptive prediction. In Artificial Intelligence and Statistics, pages 848–856. PMLR, 2016.
- Muthirayan et al. (2022) Deepan Muthirayan, Jianjun Yuan, and Pramod P Khargonekar. Online convex optimization with long-term constraints for predictable sequences. IEEE Control Systems Letters, 7:979–984, 2022.
- Neely and Yu (2017) Michael J. Neely and Hao Yu. Online convex optimization with time-varying constraints, 2017. URL https://arxiv.org/abs/1702.04783.
- Orabona (2019) Francesco Orabona. A modern introduction to online learning. arXiv preprint arXiv:1912.13213, 2019.
- Qiu et al. (2023) Shuang Qiu, Xiaohan Wei, and Mladen Kolar. Gradient-variation bound for online convex optimization with constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9534–9542, 2023.
- Rakhlin and Sridharan (2013a) Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Conference on Learning Theory, pages 993–1019. PMLR, 2013a.
- Rakhlin and Sridharan (2013b) Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences, 2013b. URL https://arxiv.org/abs/1311.1869.
- Scroccaro et al. (2023) Pedro Zattoni Scroccaro, Arman Sharifi Kolarijani, and Peyman Mohajerin Esfahani. Adaptive composite online optimization: Predictions in static and dynamic environments. IEEE Transactions on Automatic Control, 68(5):2906–2921, 2023.
- Sinha and Vaze (2024) Abhishek Sinha and Rahul Vaze. Optimal algorithms for online convex optimization with adversarial constraints. Advances in Neural Information Processing Systems, 37:41274–41302, 2024.
- Steinhardt and Liang (2014) Jacob Steinhardt and Percy Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In International conference on machine learning, pages 1593–1601. PMLR, 2014.
- Sun et al. (2017) Wen Sun, Debadeepta Dey, and Ashish Kapoor. Safety-aware algorithms for adversarial contextual bandit. In International Conference on Machine Learning, pages 3280–3288. PMLR, 2017.
- Syrgkanis et al. (2015) Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. Advances in Neural Information Processing Systems, 28, 2015.
- Wei et al. (2020) Chen-Yu Wei, Haipeng Luo, and Alekh Agarwal. Taking a hint: How to leverage loss predictors in contextual bandits? ArXiv, abs/2003.01922, 2020. URL https://api.semanticscholar.org/CorpusID:211990228.
- Yi et al. (2023) Xinlei Yi, Xiuxian Li, Tao Yang, Lihua Xie, Yiguang Hong, Tianyou Chai, and Karl H Johansson. Distributed online convex optimization with adversarial constraints: reduced cumulative constraint violation bounds under Slater’s condition. arXiv preprint arXiv:2306.00149, 2023.
- Yu and Neely (2020) Hao Yu and Michael J Neely. A low complexity algorithm with o () regret and o (1) constraint violations for online convex optimization with long term constraints. Journal of Machine Learning Research, 21(1):1–24, 2020.
- Zhang et al. (2018) Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive online learning in dynamic environments. Advances in neural information processing systems, 31, 2018.
- Zhao et al. (2020) Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Dynamic regret of convex and smooth functions. ArXiv, abs/2007.03479, 2020. URL https://api.semanticscholar.org/CorpusID:220381233.
- Zhao et al. (2024) Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization. Journal of Machine Learning Research, 25(98):1–52, 2024.
- Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pages 928–936, 2003.
Appendix A Proof of Theorem 7
Proof
By definition of (5) and (6), we obtain the following instantaneous prediction error:
where the last line uses .
(39) |
We obtain (i) by using and (ii) by using the fact that is non-decreasing and is a non-decreasing function. By sub-linearity of :
(40) |
Finally, using 6, we have
(41) |
where the last equation inequality comes from using both (39) and (40). By using once again the fact that is non-decreasing and is a non-decreasing function, and knowing that is non-negative and upper bounded by we can also upper bound . Recall
(42) |
We can now upper bound the regret. Using Lemma 5 we have that for any
Upper bounding the RHS using (41) and (42), we obtain
Thus, using , and after rearranging the terms,
Therefore, if ,
Note that , thus:
If , then
and
With , we have
Appendix B Doubling trick for Algorithm 1
The doubling trick methodology employed here is inspired by Jadbabaie et al. (2015). The parameter we adapt online is . Note that for all the COCO results we have (Theorems 7, , there is a known constant and a known function such that
and is non-decreasing, and sub-linear in each coordinate. The key idea is to apply the doubling trick on , so that the condition applies for every timestep of an epoch, except for the last one. We present the algorithm in Algorithm 4. In the case of dynamic regret, we assume that the comparator sequence is observable.
Theorem 24.
Assume that, when with , the optimistic algorithm has guarantees:
(43) |
where denotes the static or dynamic regret depending on the context, and are monotone non-decreasing and at most polynomial in each coordinate. Then by running the doubling algorithm Algorithm 4, we have the guarantee
(44) |
Proof Let the number of epochs and for each epoch , denote its first instance. It’s last instance is therefore . For two instants and , we define the regret and CCV between the two instants:
We similarly define the quantities . Denote the successive values of used in the doubling process, in . Define
i.e the values of the different doubling parameters except for the last step of the epoch. Note that when running with between and , the threshold for between those two timesteps is:
Moreover, since the change of epoch happens at , we know that
(45) |
From the second inequality, we have
Thus, from (43) there are two constants and such that:
We will focus on regret for now, but the same methodology can be applied for CCV. First note that by monotonicity of ,
Then, note that , and therefore, the constant satisfies the condition for bounded regret and CCV when running between and . We can now split the total regret into groups:
Finally, from (45) for ,
And since is at most polynomial in each coordinate, and and are at most linear in , we have .
Appendix C Proof of Theorem 10
Denote and . (16) in Theorem 10 is a direct consequence of the following lemma.
Lemma 25.
One step of optimistic online mirror descent satisfies:
(46) |
Moreover, if is -smooth with ,
(47) |
We will need the following proposition to prove the lemma.
Proposition 26 (Chiang et al. (2012), proposition 18).
For any , if , then
(48) |
Proof [of Lemma 25] Let
On one hand, using Proposition 26, the left and right terms can be upper bounded respectively :
Therefore
On the other hand,
By combining the last two inequalities, we obtain (46). To prove (47), first note that by using the fact that ,
For the second part of the statement, if is -smooth:
By inserting in (46) and dividing both sides by :
If , then since is non-increasing. We can upper bound the third term of the sum on the RHS by zero.
Proof [of Theorem 10] From (47), we have for any
(49) |
Note that by convexity of , . Therefore, by taking the sum from 1 to T, we have
where .
To prove the Adagrad regret (18), where we set
note that it is non-decreasing. Moreover, we have . Therefore,
We can apply Equation 16:
(50) |
That can be rewritten as
(51) |
Moreover,
(52) |
Using (51) and (52) in the regret upper bound (50):
Appendix D Dynamic Regret guarantee
We present here the dynamic regret decomposition lemma.
Lemma 27 (Dynamic Regret decomposition).
For any OCO algorithm , if is a Lyapunov potential function, we have that for any , and any admissible sequence
(53) |
where , and is the dynamic regret of the algorithm running on the sequence of losses .
Proof By convexity of , for any :
For any , , then by definition , thus
Summing from to :
where
Appendix E Contextual bandits with expert advice
(54) |
(55) |
First we introduce the shorthand notation: and :
The modified algorithm EXP4.OVAR is presented in Algorithm 5. Note that we modify the learning rate to something similar to what we have in Algorithm 3. Moreover, in the original EXP4.OVAR, they use different learning rates for the update of and , but we should not do it in our setting as it will introduce a term in (where is the "cumulative error"), which is not trivial to upper bound in terms of and .
Theorem 28 (EXP4.OVAR Regret, (Adapted from Wei et al. (2020)).
Let a sequence of loss vectors, where is non-decreasing, and and are chosen by the environment but depend on . Let the prediction and denote . , then
(56) |
Furthermore, if we set :
(57) |
Proof The proof follows exactly the steps in Wei et al. (2020). However, we slightly modify it to accept losses that are in instead of and the loss have some depedency on the past, adding the extra expected value on the computation of the loss. We first add the results from Wei et al. (2020). Let . Denote where is the distribution that concentrates on . From Lemma 25, we have:
(58) |
By replacing by its expression and summing over , we obtain
(59) |
We can upper bound the two terms on the RHS. The first sum can be rewritten as:
Then, note that for any because . Therefore,
For the third sum, by replacing by its definition, we have . As in (51),
resulting in
and
Thus the RHS of (59) is upper bounded by: . Note that:
Then, the expected value:
where the inequality comes from and . Thus, by taking the expected value in (59), we have
(Jensen’s inequality) | ||||
(60) |
We can now lower bound the LHS of (60).
(61) |
comes from where is the expected value conditional to all the information until the end of round . For , it is a consequence ’s definition: