Multi-point Feedback of Bandit Convex Optimization with Hard Constraints
Abstract
This paper studies bandit convex optimization with constraints, where the learner aims to generate a sequence of decisions under partial information of loss functions such that the cumulative loss is reduced as well as the cumulative constraint violation is simultaneously reduced. We adopt the cumulative hard constraint violation as the metric of constraint violation, which is defined by . Owing to the maximum operator, a strictly feasible solution cannot cancel out the effects of violated constraints compared to the conventional metric known as long-term constraints violation. We present a penalty-based proximal gradient descent method that attains a sub-linear growth of both regret and cumulative hard constraint violation, in which the gradient is estimated with a two-point function evaluation. Precisely, our algorithm attains regret bounds and cumulative hard constraint violation bounds for convex loss functions and time-varying constraints, where is the dimensionality of the feasible region and is a user-determined parameter. We also extend the result for the case where the loss functions are strongly convex and show that both regret and constraint violation bounds can be further reduced.
1 Introduction
Bandit Convex Optimization (BCO) is a fundamental framework of sequential decision-making under uncertain environments and with limited feedback, which can be regarded as a structured repeated game between a learner and an environment (Hazan et al. 2016, Lattimore and Szepesvári 2020). In this framework, a learner is given a convex feasible region and the total number of rounds. At each round, , the learner makes decision , and then a convex loss function is revealed. The learner cannot access the loss function , but only the bandit feedback is available, i.e., the learner can only observe the value of the loss at the point she committed to, i.e., . The objective of the learner is to generate a sequence of decisions that minimizes cumulative loss under bandit feedback. The performance of the learner is evaluated in terms of regret, which is defined by
This regret measures the difference between the cumulative loss of the learner’s strategy and the minimum possible cumulative loss where the sequence of loss functions had been known in advance and the learner could choose the best fixed optimal decision in hindsight.
In many real-world scenarios, the decisions are often subject to some constraints such as budget or resources. In the context of Online Convex Optimization (OCO), where the learner has access to the complete information about the loss functions, a projection operator is typically applied in each round so that the decisions belong to constraints (Zinkevich 2003, Hazan et al. 2016). However, such a projection step is typically a computational bottleneck when the feasible region is complex.
To address the issue of the projection step, Mahdavi et al. (2012) considers online convex optimization with long-term constraints, where the learner aims to generate a sequence of decisions that the decisions satisfy constraints in the long run, instead of requiring to satisfy the constraints in all rounds. They introduce the cumulative soft constraint violation metric defined by , where is the functional constraint to be satisfied. Later, Yuan and Lamperski (2018) consideres strict notion of constraint violation reffered to as cumulative hard constraint violation, which is defined by . This metric overcomes the drawback of cumulative soft constraint violation, and it is suitable for safety-critical systems, in which the failure of constraint violation may result in catastrophic consequences.
To see that the notion of cumulative hard constraint violation is a stronger metric, let us consider the example discussed in Guo et al. (2023). Given a sequence of decisions whose constraint functions are with such that if is odd; otherwise , we have for any , however, the constraint is violated at half of rounds. On the other hand, the notion of hard constraint violation can capture the constraint violation since we have . Thus, the conventional definition of cumulative soft constraint violation cannot accurately measure the constraint violation but cumulative hard constraint violation can.
Many existing algorithms for BCO with constraints proposed in prior works typically involve projection operators as well as algorithms for OCO with constraints (Agarwal et al. 2010, Zhao et al. 2021), and are generally limited to the simple convex set. Chen et al. (2019), Garber and Kretzu (2020) consider a projection-free algorithm for BCO, but the constraint violation bound has not been reported. Some studies have extended the algorithm for OCO with soft constraints to the bandit setting (Mahdavi et al. 2012, Cao and Liu 2018), however, these algorithms cannot be directly extended to BCO with hard constraints. In other words, there has been no algorithm that can simultaneously achieve sub-linear bound both regret and cumulative hard constraints violation.
The present study focuses on the particular case of multi-point feedback of BCO with constraints, in which the loss functions are convex or strongly convex, and constraint violation is evaluated in terms of hard constraints. This kind of problem widely appears in real-world scenarios such as portfolio management problems, in which the manager has concrete constraints to be satisfied but only has access to the loss function at several points close to the decision . We present a penalty-based proximal gradient descent method which attains both regret bound and cumulative hard constraint violation bound, where is the dimensionality of the feasible region and is a user-determined parameter. Our proposed algorithm is inspired by a gradient estimation in the BCO literature (Flaxman et al. 2005, Agarwal et al. 2010) and an algorithm for OCO with hard constraints (Guo et al. 2022).
1.1 Related work
For OCO with constraints, a projection operator is generally applied to the updated variables to enforce them feasible at each round (Zinkevich 2003, Duchi et al. 2010). However, such projection is typically inefficient to implement due to the high computational effort especially when the feasible region is complex (e.g., is characterized by multiple inequalities), and efficient projection computation is limited to simple sets such as -ball or probability simplex (Duchi et al. 2008).
Instead of requiring that the decisions belong to the feasible region in all rounds, Mahdavi et al. (2012) first considers relaxing the notion of constraints by allowing them to be violated at some rounds but requiring them to be satisfied in the long run. This type of OCO is referred to as online convex optimization with long-term constraints, and the performance metric for constraint violation is defined by the cumulative violation of the decisions from the constraints for all rounds, i.e., referred to as soft constraints. Mahdavi et al. (2012) proposes a primal-dual gradient-based algorithm that attains regret bound and constraint violations and subsequent researches have been conducted to improve both bounds. Jenatton et al. (2016) extends the algorithm to achieve regret bound and constraint violation, where is a user-determined parameter. Yu and Neely (2020) proposes the drift-plus-penalty based algorithm developed for stochastic optimization in dynamic queue networks (Neely 2022), and prove the algorithm attains regret bound and constraint violation bound.
Yuan and Lamperski (2018) proposes the more strict notion of a constraint violation, which is defined by , so as not to cancel out the effect of violated constraints by the strict feasible solution. Such paradigm is later referred to as online convex optimization with hard constraints (Guo et al. 2022). In Yuan and Lamperski (2018), an algorithm that attains regret bound and constraint violation bound has proposed. Yi et al. (2021) extends the algorithm that attains regret bound and constraint violation bound, and Yi et al. (2021) also consideres the general dynamic regret bound. Guo et al. (2022) proposes an algorithm that rectifies updated variables and penalty variables and proves the algorithm attains regret bound and constraint violation for convex loss functions.
In the partial information setting, a learner is limited to accessing the loss functions and thus the learner cannot construct an algorithm by using a gradient of loss functions. Flaxman et al. (2005) considers a one-point feedback model, where only one-point function value is available, and constructed an unbiased estimator of the gradient of the loss functions. By employing the gradient estimator, they applied online gradient descent algorithm (Zinkevich 2003) and proved the algorithm attains regret bound. Another variant of the feedback model is multi-point feedback, where the learner is allowed to query multiple points of function in each round. Agarwal et al. (2010) and Nesterov and Spokoiny (2017) consideres two-point feedback model and establishes an regret bound for convex loss functions.
Reference | Bandit | Metric | Loss | Regret | Violation |
Flaxman et al. (2005) | — | convex | — | ||
Agarwal et al. (2010) | — | convex | — | ||
— | str.-convex | — | |||
Mahdavi et al. (2012) | soft | convex | |||
Guo et al. (2022) | hard | convex | |||
str.-convex | |||||
convex | |||||
This work | hard | str.-convex |
1.2 Contribution
This paper focuses on the multi-point feedback BCO with constraints, in which the constraint violation is evaluated in terms of cumulative hard constraint violation. We propose an algorithm (Algorithm 1) for the BCO and show that the proposed algorithm attains an regret bound and an cumulative hard constraint violation bound, where is a user-defined parameter (Theorem 1 and Theorem 2). By setting , the algorithm attains regret bound and constraint violation bound, which is compatible with the prior work for constrained online convex optimization with full-information (Yi et al. 2022, Guo et al. 2022). We also show both regret and constraint violation bounds are reduced to an and , respectively, when the loss functions are strongly convex (Theorem 3 and Theorem 4). The comparison of this study with prior works is summarized in Table 1.
1.3 Organization
The rest of this paper is organized as follows. In Section 2, we introduce necessary preliminaries of BCO with constraints. Section 3 presents the proposed algorithm to solve the BCO with constraints under two-point bandit feedback. In Section 4, we provide a theoretical analysis of regret bound and hard constraint violation bound for both convex and strongly convex loss functions. Finally, Section 5 concludes the present paper and addresses future work.
2 Preliminaries
2.1 Notation
For a vector , let be the -norm of , i.e., . Let be the inner product of two vectors and . Let and denote the -dimensional Euclidean ball and unit sphere, and let and denote the random variables sampled uniformly from and , respectively. For a scalar , we denote . For a Lipschitz continuous function , let be the Lipschitz constant of . We use as a shorthand for the set of positive integers . Finally, we use the notation as the conditional expectation over the condition of all randomness in the first rounds.
2.2 Assumptions
Following prior works of constrained OCO (Mahdavi et al. 2012, Guo et al. 2022), we make the following standard assumptions on feasible region, loss functions, and constraint functions.
Assumption 1 (Bounded domain).
The feasible region is a non-empty bounded closed convex set such that holds for any .
Assumption 2 (Convexity and Lipschitz continuity of loss functions).
The loss function is convex and Lipschitz continuous with Lipschitz constant on , that is, we have
for any and for any . For simplicity, we define .
Assumption 3 (Convexity and Lipschitz continuity of constraint functions).
The constraint function is convex and Lipschitz continuous with Lipschitz constant on , that is, we have
for any and for any . For simplicity, we define .
2.3 Offline constrained OCO
With the full knowledge of loss functions and constraint functions in all rounds, the offline constrained OCO is formulated as the following convex optimization problem:
(1a) | ||||
subject to | (1b) |
where is assumed to be a simple convex set (e.g., Euclidean ball, probability simplex) for which the projection onto is efficiently computable.
For the sake of simplicity of theoretical analysis, the present paper considers the case where there exists a single constraint function. By defining , this study can be easily extended to the case where multiple constraint functions, i.e., exist, because maximum of finite convex functions is also convex.
2.4 Performance metrics
Given a sequence of decisions generated by some OCO algorithm (e.g., Online Gradient Descent method (Zinkevich 2003)). Under the situation where all loss functions and constraint functions in each round are known in advance, the regret and cumulative hard constraint violation are defined as follows:
(2) | ||||
(3) |
where is the optimal solution to the offline constrained OCO formulated as Eq. (1). The objective of the learner is to generate a sequence of decisions that attains a sub-linear growth of both regret and cumulative constraint violation, that is, and .
2.5 Gradient estimator
In the partial information setting where only limited feedback is available to the learner, we follow the prior works (Flaxman et al. 2005, Agarwal et al. 2010, Zhao et al. 2021). The following result guarantees the gradient estimator with one-point feedback being an unbiased estimator.
Lemma 1.
(Zhao et al. 2021: Lemma 1) For any convex function , define its smoothed version function , where the expectation is taken over the random vector with being the unit ball, i.e., . Then, for any , we have
where the expectation is taken over the random vector with being the unit sphere centered around the origin, i.e., .
Proof.
See Flaxman et al. (2005: Lemma 2.1). ∎
Moreover, as shown in Shamir (2017: Lemma 8), for any convex function and its smoothed version , we have
(4) |
The present study considers a two-point feedback model where the learner is allowed to query two points in each round. Specifically, at round , the learner is allowed to query two points around decision , that is, and , where is a perturbation parameter and is a random unit vector sampled from unit sphere . With two points and , the gradient estimator of the function at is given by
(5) |
where is the dimensionality of the domain . As shown in Agarwal et al. (2010), is norm bounded, that is, we have , where the first inequality holds by the Lipschitz continuity of .
3 Proposed Algorithm
This section presents the proposed algorithm for solving the constrained BCO with two-point feedback. The procedure of the algorithm is shown in Algorithm 1, and this algorithm is motivated by the work in Guo et al. (2022) and the design of the algorithm is related to penalty-based proximal gradient descent method (Cheung and Lou 2017). At round , Algorithm 1 finds the decision vector by solving the following strongly convex optimization problem:
(6) |
where is the penalty variable for controlling the quality of the decision, , is the shrinkage constant, and are predetermined learning rate. Note that the optimization problem in the right-hand side (r.h.s) of Eq. (6) is strongly convex optimization due to the regularizer term, and hence the optimal solution does exist and unique. As is the case with Mahdavi et al. (2012), we optimize the r.h.s. of Eq. (6) on the domain to ensure that randomized two points around are inside the feasible region . As shown in Flaxman et al. (2005), for any and for any unit vector , it holds .
At round , where we find the decision , since we do not have the prior knowledge of the loss function to be minimized, we estimate the loss by the first-order approximation at the previous decision as . Simultaneously, we have no full information of the loss function and hence we cannot access its graient , so we estimate gradient by with two points (line 5). To prevent the constraint from being severely violated, we also introduce the rectified Lagrange multiplier associated with the functional constraint , and add the penalty term to the objective function (6), which is an approximator of the original penalty term , where is the Lagrangian multiplier associated with the constraint . We also add regularization term to stabilize the optimization problem.
We will describe more in detail the role of penalty parameter and its update rule. The penalty parameter is related to the Lagrangian multiplier (denoted by ) associated with the functional constraint , but slightly different because we have no prior knowledge of the constraint functions when making-decision. Instead, we take place the original Lagrangian multiplier with such that is an approximator of . We update the penalty parameter (line 9) as , where the first coordinate of maximum operator is the sum of the old and the rectified constraint function value ; and the second coordinate is the user-determined constant to impose a minimum penalty. This update rule for the penalty parameter prevents the decision determined by solving the problem (6) from being overly aggressive which leads to large constraint violation.
4 Theoretical Analysis
This section provides the theoretical analysis for the Algorithm 1. To facilitate the analysis, let be a function defined by
(7) |
where and is defined as Eq. (5). It is easily seen that holds, and hence we have for any . Moreover, the function defined as Eq. (7) is convex and Lipschitz continuous with Lipschitz constant on , because for any , we have
where the first inequality follows from the triangle inequality, the second inequality follows from the Cauchy-Schwarz inequality, the third inequality follows from for any Lipshitz continuous function and for any , and the last inequality follows from .
To prove Algorithm 1 attains sub-linear bound for both regret and cumulative hard constraint violation, we first show the following result which is a well-known property of a strongly convex function.
Lemma 2.
(Nesterov et al. 2018: Theorem 2.1.8) Let be a convex set. Let be a strongly convex function with modulus on , and let be an optimal solution of , that is, . Then, holds for any .
Proof.
By the definition of strong convexity of , for any , we have
(8) |
Plugging an optimal solution into in the above inequality (8), we have
where the last inequality holds by the first-order optimality condition, . ∎
The following two lemmas play an important role in proving the main theorem (Theorem 1 and Theorem 2). The first one (Lemma 3) is an inequality involving the update rule of Algorithm 1, and the second one (Lemma 4) characterizes the relationship between the current solution in Algorithm 1 and the optimal solution of the offline optimization problem formulated as Eq. (1).
Lemma 3.
(Guo et al. 2022: Lemma 5) Let be a function defined by
(9) |
where and are predetermined learning rate. Let be the optimal solution returned by Algorithm 1 where the gradient is accessible, that is, . Then, for any , we have
(10) |
Proof.
Since is a strongly convex function with modulus , we can apply Lemma 2 to . Thus, we have for any , which completes the proof. ∎
Lemma 4 (Self-bounding Property).
(Guo et al. 2022: Lemma 1) Let be a convex function satisfying Assumption 2. Let be any optimal solution to the offline constrained OCO of Eq. (1) and be the optimal solution returned by Algorithm 1. Then, we have
(11) |
where and are predetermined learning rate.
Proof.
See Guo et al. (2022: Lemma 1). ∎
We are now ready to prove the main results, which state Algorithm 1 achieves a sub-linear bound for both regret (2) and cumulative hard constraint violation (3). We first show the case where the loss functions are convex and constraint functions are fixed throughout the whole round.
4.1 Convex loss function case
Theorem 1.
Let be a sequence of decisions generated by Algorithm 1 and let be an optimal solution to the offline OCO of Eq. (1). Assume that constraint functions are fixed, that is, for any . Define and , where and . Under Assumptions 1, 2 and 3, we have
(12) | ||||
(13) |
Proof.
Part (i): Proof of Eq. (12)
Recall that the function is Lipschitz continuous with Lipschitz constant . Applying Lemma 4 to the convex function defined by Eq. (7), for an optimal solution to the offline optimization problem as Eq. (1), we have
where the last inequality follows from Assumption 1. Plugging in , we have
Since we have , by taking expectation, we have
From the inequality (4), for any optimal solution to the offline OCO as Eq. (1), we have
for any . Therefore, we have
where the second inequality follows by plugging in .
Part (ii): Proof of Eq. (13)
From Lemma 4, for any optimal solution to the offline constrained OCO as Eq. (1), we have
By the definition of , i.e., , and plugging in , we have
where the second inequality is followed by , and we plugging and . By taking summation over , we have
where the second inequality holds from Lemma 5 in Appendix A, which completes the proof. ∎
Remark 1.
By setting constant , Algorithm 1 attains regret bound. This regret bound is compatible with the prior works of unconstrained bandit convex optimization (Agarwal et al. 2010), and is compatible with the result for full-information setting (Guo et al. 2022).
For the case where the constraint functions are time-varying, we can show the following result.
Theorem 2.
Let be a sequence of decisions generated by Algorithm 1. Assume that constraint functions are time-varying. Define , and , where and . Under Assumptions 1, 2 and 3, we have
(14) |
Proof.
By the convexity of and Assumption 3, we can show is upper bounded by for any (Guo et al. 2022: Lemma 2). Applying Lemma 4 to the function defined by Eq. (7), for any , we have
By taking summation over , we have
where the last inequality holds by plugging in . Therefore, we have
where the second inequality follows from Eq. (13) in Theorem 1 and the last inequality holds by plugging in , which completes the proof. ∎
Remark 2.
By setting constant , we can obtain constraint violation bound. This bound is compatible with the result for full-information case (Guo et al. 2022).
4.2 Strongly convex loss function case
We extend the results discussed in the previous subsection to the case where the loss functions are strongly convex. We omit the proofs of the following results here since the technique of the proof is similar to that of Theorem 1 and Theorem 2. These proofs are found in Appendix B and Appendix C. To discuss the strongly convex case, we make the following assumption about loss functions.
Assumption 4 (Strong convexity of loss functions).
The loss function is Lipschitz continuous with Lipschitz constant , and strongly convex on with modulus , i.e., we have
(15) |
for any and for any . For simplicity, we define .
Under Assumption 4, the function defined as Eq. (7) is also strongly convex with modulus , namely, for any . Then, we can show the following results.
Theorem 3.
Let be a sequence of decisions generated by Algorithm 1 and let be an optimal solution to the offline OCO of Eq. (1). Assume that constraint functions are fixed, that is, for any . Define , and , where and . Under Assumptions 1, 3 and 4, we have
Theorem 4.
Let be a sequence of decisions generated by Algorithm 1. Assume that constraint functions are time-varying. Define , where and . Under Assumptions 1, 3 and 4, we have
5 Conclusion and Future Directions
This paper studies the two-point feedback of bandit convex optimization with constraints, in which the loss functions are convex or strongly convex, constraint functions are fixed or time-varying, and the constraint violation is evaluated in terms of cumulative hard constraint violation (Yuan and Lamperski 2018). We present a penalty-based proximal gradient descent algorithm with an unbiased gradient estimator and show that the algorithm attains a sub-linear growth of both regret and cumulative hard constraint violation. It would be of interest to extend this work to the case where both the loss functions and constraint functions are bandit setup as discussed in Cao and Liu (2018), and the case where only one-point bandit feedback is available to the learner. Furthermore, theoretical analysis of dynamic regret, where the comparator sequence can be chosen arbitrarily from the feasible set, would be an important direction for future work.
Acknowledgments and Disclosure of Funding
The author would like to thank Dr. Sho Takemori for making a number of valuable suggestions and advice.
References
- Hazan et al. (2016) Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
- Lattimore and Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
- Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pages 928–936, 2003.
- Mahdavi et al. (2012) Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. Trading regret for efficiency: online convex optimization with long term constraints. The Journal of Machine Learning Research, 13(1):2503–2528, 2012.
- Yuan and Lamperski (2018) Jianjun Yuan and Andrew Lamperski. Online convex optimization for cumulative constraints. Advances in Neural Information Processing Systems, 31, 2018.
- Guo et al. (2023) Hengquan Guo, Zhu Qi, and Xin Liu. Rectified pessimistic-optimistic learning for stochastic continuum-armed bandit with constraints. In Learning for Dynamics and Control Conference, pages 1333–1344. PMLR, 2023.
- Agarwal et al. (2010) Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Colt, pages 28–40. Citeseer, 2010.
- Zhao et al. (2021) Peng Zhao, Guanghui Wang, Lijun Zhang, and Zhi-Hua Zhou. Bandit convex optimization in non-stationary environments. The Journal of Machine Learning Research, 22(1):5562–5606, 2021.
- Chen et al. (2019) Lin Chen, Mingrui Zhang, and Amin Karbasi. Projection-free bandit convex optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2047–2056. PMLR, 2019.
- Garber and Kretzu (2020) Dan Garber and Ben Kretzu. Improved regret bounds for projection-free bandit convex optimization. In International Conference on Artificial Intelligence and Statistics, pages 2196–2206. PMLR, 2020.
- Cao and Liu (2018) Xuanyu Cao and KJ Ray Liu. Online convex optimization with time-varying constraints and bandit feedback. IEEE Transactions on automatic control, 64(7):2665–2680, 2018.
- Flaxman et al. (2005) Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the bandit setting: Gradient descent without a gradient. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’05, page 385–394, USA, 2005. Society for Industrial and Applied Mathematics. ISBN 0898715857.
- Guo et al. (2022) Hengquan Guo, Xin Liu, Honghao Wei, and Lei Ying. Online convex optimization with hard constraints: Towards the best of two worlds and beyond. Advances in Neural Information Processing Systems, 35:36426–36439, 2022.
- Duchi et al. (2010) John C Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. In COLT, volume 10, pages 14–26. Citeseer, 2010.
- Duchi et al. (2008) John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the l 1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pages 272–279, 2008.
- Jenatton et al. (2016) Rodolphe Jenatton, Jim Huang, and Cédric Archambeau. Adaptive algorithms for online convex optimization with long-term constraints. In International Conference on Machine Learning, pages 402–411. PMLR, 2016.
- Yu and Neely (2020) Hao Yu and Michael J. Neely. A Low Complexity Algorithm with Regret and Constraint Violations for Online Convex Optimization with Long Term Constraints. Journal of Machine Learning Research, 21(1):1–24, 2020.
- Neely (2022) Michael Neely. Stochastic network optimization with application to communication and queueing systems. Springer Nature, 2022.
- Yi et al. (2021) Xinlei Yi, Xiuxian Li, Tao Yang, Lihua Xie, Tianyou Chai, and Karl Johansson. Regret and cumulative constraint violation analysis for online convex optimization with long term constraints. In International Conference on Machine Learning, pages 11998–12008. PMLR, 2021.
- Nesterov and Spokoiny (2017) Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17:527–566, 2017.
- Yi et al. (2022) Xinlei Yi, Xiuxian Li, Tao Yang, Lihua Xie, Tianyou Chai, and H Karl. Regret and cumulative constraint violation analysis for distributed online constrained convex optimization. IEEE Transactions on Automatic Control, 2022.
- Shamir (2017) Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. The Journal of Machine Learning Research, 18(1):1703–1713, 2017.
- Cheung and Lou (2017) Yiu-ming Cheung and Jian Lou. Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization. Machine Learning, 106:595–622, 2017.
- Nesterov et al. (2018) Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018.
Appendix A Proof of useful inequalities
To show Theorem 1, we present the following results which is similar argument of Guo et al. (2022: Lemma 6).
Lemma 5.
Let be an optimal solution to the offline constrained OCO defined as Eq. (1). Under Assumptions 1 and 2, for any feasible solution , , and , we have
Proof.
The first claim is shown as follows:
where the last inequality holds from the condition of .
The second claim is shown as follows:
where the first inequality follows from Assumption 2, the second inequality follows from Assumption 1, and the third inequality follows as is the case with the inequality .
Appendix B Proof of Theorem 3
Proof.
Similar to the argument of Lemma 3, for any strongly convex function with modulus and for any optimal solution to the offline constrained OCO as Eq. (1), we have
(16) |
Applying the above inequality (16) for the function defined by Eq. (7), we have
(17) |
Note that the function is also strongly convex with modulus under Assumption 4. Since is nonnegative, from Eq. (17), we have
By taking summation over , we have
where the second inequality holds from Assumption 1. Plugging in , we have
Similar to the proof of the convex case, since we have for any and from the inequality (4), we have
where the second inequality holds from Assumption 4 the third inequality follows by letting .