Multi-point Feedback of Bandit Convex Optimization with Hard Constraints

Yasunari Hikima
Artificial Intelligence Laboratory, Fujitsu Limited, Japan
hikima.yasunari@fujitsu.com

Abstract

This paper studies bandit convex optimization with constraints, where the learner aims to generate a sequence of decisions under partial information of loss functions such that the cumulative loss is reduced as well as the cumulative constraint violation is simultaneously reduced. We adopt the cumulative hard constraint violation as the metric of constraint violation, which is defined by $\sum_{t=1}^{T}\max\{g_{t}(\bm{x}_{t}),0\}$ . Owing to the maximum operator, a strictly feasible solution cannot cancel out the effects of violated constraints compared to the conventional metric known as long-term constraints violation. We present a penalty-based proximal gradient descent method that attains a sub-linear growth of both regret and cumulative hard constraint violation, in which the gradient is estimated with a two-point function evaluation. Precisely, our algorithm attains $O(d^{2}T^{\max\{c,1-c\}})$ regret bounds and $O(d^{2}T^{1-\frac{c}{2}})$ cumulative hard constraint violation bounds for convex loss functions and time-varying constraints, where $d$ is the dimensionality of the feasible region and $c\in[\frac{1}{2},1)$ is a user-determined parameter. We also extend the result for the case where the loss functions are strongly convex and show that both regret and constraint violation bounds can be further reduced.

1 Introduction

Bandit Convex Optimization (BCO) is a fundamental framework of sequential decision-making under uncertain environments and with limited feedback, which can be regarded as a structured repeated game between a learner and an environment (Hazan et al. 2016, Lattimore and Szepesvári 2020). In this framework, a learner is given a convex feasible region $\mathcal{X}\subseteq\mathbb{R}^{d}$ and the total number $T$ of rounds. At each round, $t=1,2,\dots,T$ , the learner makes decision $\bm{x}_{t}\in\mathcal{X}$ , and then a convex loss function $f_{t}:\mathcal{X}\to\mathbb{R}$ is revealed. The learner cannot access the loss function $f_{t}$ , but only the bandit feedback is available, i.e., the learner can only observe the value of the loss at the point she committed to, i.e., $f_{t}(\bm{x}_{t})$ . The objective of the learner is to generate a sequence of decisions $\{\bm{x}_{t}\}_{t=1}^{T}\subseteq\mathcal{X}$ that minimizes cumulative loss $\sum_{t=1}^{T}f_{t}(\bm{x}_{t})$ under bandit feedback. The performance of the learner is evaluated in terms of regret, which is defined by

R_{T}\coloneqq\sum_{t=1}^{T}f_{t}(\bm{x}_{t})-\min_{\bm{x}\in\mathcal{X}}\sum_{t=1}^{T}f_{t}(\bm{x}).

This regret measures the difference between the cumulative loss of the learner’s strategy and the minimum possible cumulative loss where the sequence of loss functions $\{f_{t}(\bm{x})\}_{t=1}^{T}$ had been known in advance and the learner could choose the best fixed optimal decision in hindsight.

In many real-world scenarios, the decisions are often subject to some constraints such as budget or resources. In the context of Online Convex Optimization (OCO), where the learner has access to the complete information about the loss functions, a projection operator is typically applied in each round so that the decisions belong to constraints (Zinkevich 2003, Hazan et al. 2016). However, such a projection step is typically a computational bottleneck when the feasible region is complex.

To address the issue of the projection step, Mahdavi et al. (2012) considers online convex optimization with long-term constraints, where the learner aims to generate a sequence of decisions that the decisions satisfy constraints in the long run, instead of requiring to satisfy the constraints in all rounds. They introduce the cumulative soft constraint violation metric defined by $V^{\text{soft}}_{T}\coloneqq\sum_{t=1}^{T}g_{t}(\bm{x}_{t})$ , where $g_{t}(\bm{x})\leq 0$ is the functional constraint to be satisfied. Later, Yuan and Lamperski (2018) consideres strict notion of constraint violation reffered to as cumulative hard constraint violation, which is defined by $V^{\text{hard}}_{T}\coloneqq\sum_{t=1}^{T}\max\{g_{t}(\bm{x}_{t}),0\}$ . This metric overcomes the drawback of cumulative soft constraint violation, and it is suitable for safety-critical systems, in which the failure of constraint violation may result in catastrophic consequences.

To see that the notion of cumulative hard constraint violation is a stronger metric, let us consider the example discussed in Guo et al. (2023). Given a sequence of decisions whose constraint functions are $\{g_{t}(\bm{x}_{t})\}_{t=1}^{T}$ with $T=1000$ such that $g_{t}(\bm{x}_{t})=-1$ if $t$ is odd; otherwise $g_{t}(\bm{x}_{t})=1$ , we have $\sum_{t=1}^{\tau}g_{t}(\bm{x}_{t})\leq 0$ for any $\tau\in\{1,2,\dots,T\}$ , however, the constraint $g_{t}(\bm{x})\leq 0$ is violated at half of rounds. On the other hand, the notion of hard constraint violation can capture the constraint violation since we have $V^{\text{hard}}_{T}=500$ . Thus, the conventional definition of cumulative soft constraint violation $V_{T}^{\text{soft}}$ cannot accurately measure the constraint violation but cumulative hard constraint violation $V_{T}^{\text{hard}}$ can.

Many existing algorithms for BCO with constraints proposed in prior works typically involve projection operators as well as algorithms for OCO with constraints (Agarwal et al. 2010, Zhao et al. 2021), and are generally limited to the simple convex set. Chen et al. (2019), Garber and Kretzu (2020) consider a projection-free algorithm for BCO, but the constraint violation bound has not been reported. Some studies have extended the algorithm for OCO with soft constraints to the bandit setting (Mahdavi et al. 2012, Cao and Liu 2018), however, these algorithms cannot be directly extended to BCO with hard constraints. In other words, there has been no algorithm that can simultaneously achieve sub-linear bound both regret and cumulative hard constraints violation.

The present study focuses on the particular case of multi-point feedback of BCO with constraints, in which the loss functions are convex or strongly convex, and constraint violation is evaluated in terms of hard constraints. This kind of problem widely appears in real-world scenarios such as portfolio management problems, in which the manager has concrete constraints to be satisfied but only has access to the loss function $f_{t}(\cdot)$ at several points close to the decision $\bm{x}_{t}$ . We present a penalty-based proximal gradient descent method which attains both $O(d^{2}T^{\max\{c,1-c\}})$ regret bound and $O(d^{2}T^{1-\frac{c}{2}})$ cumulative hard constraint violation bound, where $d$ is the dimensionality of the feasible region and $c\in[\frac{1}{2},1)$ is a user-determined parameter. Our proposed algorithm is inspired by a gradient estimation in the BCO literature (Flaxman et al. 2005, Agarwal et al. 2010) and an algorithm for OCO with hard constraints (Guo et al. 2022).

1.1 Related work

For OCO with constraints, a projection operator is generally applied to the updated variables to enforce them feasible at each round (Zinkevich 2003, Duchi et al. 2010). However, such projection is typically inefficient to implement due to the high computational effort especially when the feasible region $\mathcal{X}$ is complex (e.g., $\mathcal{X}$ is characterized by multiple inequalities), and efficient projection computation is limited to simple sets such as $\ell_{1}$ -ball or probability simplex (Duchi et al. 2008).

Instead of requiring that the decisions belong to the feasible region in all rounds, Mahdavi et al. (2012) first considers relaxing the notion of constraints by allowing them to be violated at some rounds but requiring them to be satisfied in the long run. This type of OCO is referred to as online convex optimization with long-term constraints, and the performance metric for constraint violation is defined by the cumulative violation of the decisions from the constraints for all rounds, i.e., $V^{\text{soft}}_{T}\coloneqq\sum_{t=1}^{T}g_{t}(\bm{x}_{t})$ referred to as soft constraints. Mahdavi et al. (2012) proposes a primal-dual gradient-based algorithm that attains $O(\sqrt{T})$ regret bound and $O(T^{\frac{3}{4}})$ constraint violations and subsequent researches have been conducted to improve both bounds. Jenatton et al. (2016) extends the algorithm to achieve $O(T^{\max\{c,1-c\}})$ regret bound and $O(T^{1-\frac{c}{2}})$ constraint violation, where $c\in(0,1)$ is a user-determined parameter. Yu and Neely (2020) proposes the drift-plus-penalty based algorithm developed for stochastic optimization in dynamic queue networks (Neely 2022), and prove the algorithm attains $O(\sqrt{T})$ regret bound and $O(1)$ constraint violation bound.

Yuan and Lamperski (2018) proposes the more strict notion of a constraint violation, which is defined by $V^{\text{hard}}_{T}\coloneqq\sum_{t=1}^{T}\max\{g_{t}(\bm{x}_{t}),0\}$ , so as not to cancel out the effect of violated constraints by the strict feasible solution. Such paradigm is later referred to as online convex optimization with hard constraints (Guo et al. 2022). In Yuan and Lamperski (2018), an algorithm that attains $O(T^{\max\{c,1-c\}})$ regret bound and $O(T^{1-\frac{c}{2}})$ constraint violation bound has proposed. Yi et al. (2021) extends the algorithm that attains $O(T^{\max\{c,1-c\}})$ regret bound and $O(T^{\frac{1-c}{2}})$ constraint violation bound, and Yi et al. (2021) also consideres the general dynamic regret bound. Guo et al. (2022) proposes an algorithm that rectifies updated variables and penalty variables and proves the algorithm attains $O(\sqrt{T})$ regret bound and $O(T^{\frac{3}{4}})$ constraint violation for convex loss functions.

In the partial information setting, a learner is limited to accessing the loss functions and thus the learner cannot construct an algorithm by using a gradient of loss functions. Flaxman et al. (2005) considers a one-point feedback model, where only one-point function value is available, and constructed an unbiased estimator of the gradient of the loss functions. By employing the gradient estimator, they applied online gradient descent algorithm (Zinkevich 2003) and proved the algorithm attains $O(d^{\frac{2}{3}}T^{\frac{2}{3}})$ regret bound. Another variant of the feedback model is multi-point feedback, where the learner is allowed to query multiple points of function in each round. Agarwal et al. (2010) and Nesterov and Spokoiny (2017) consideres two-point feedback model and establishes an $O(d^{2}\sqrt{T})$ regret bound for convex loss functions.

Table 1: Regret bound and cumulative constraint violation bound for bandit convex optimization with constraints. The column of “Metric” stands for the metric of constraint violation.

Reference	Bandit	Metric	Loss	Regret	Violation
Flaxman et al. (2005)	$\checkmark$	—	convex	$O(d^{\frac{2}{3}}T^{\frac{2}{3}})$	—
Agarwal et al. (2010)	$\checkmark$	—	convex	$O(d^{2}\sqrt{T})$	—
Agarwal et al. (2010)	$\checkmark$	—	str.-convex	$O(d^{2}\log T)$	—
Mahdavi et al. (2012)	$\checkmark$	soft	convex	$O(\sqrt{T})$	$O(T^{\frac{3}{4}})$
Guo et al. (2022)		hard	convex	$O(\sqrt{T})$	$O(T^{\frac{3}{4}})$
Guo et al. (2022)		hard	str.-convex	$O(\log T)$	$O(\sqrt{T(1+\log T)})$
	$\checkmark$		convex	$O(d^{2}T^{\max\{c,1-c\}})$	$O(d^{2}T^{1-\frac{c}{2}})$
This work	$\checkmark$	hard	str.-convex	$O(d^{2}\log T)$	$O(d^{2}\sqrt{T(1+\log T)})$

1.2 Contribution

This paper focuses on the multi-point feedback BCO with constraints, in which the constraint violation is evaluated in terms of cumulative hard constraint violation. We propose an algorithm (Algorithm 1) for the BCO and show that the proposed algorithm attains an $O(d^{2}T^{\max\{c,1-c\}})$ regret bound and an $O(d^{2}T^{1-\frac{c}{2}})$ cumulative hard constraint violation bound, where $c\in[\frac{1}{2},1)$ is a user-defined parameter (Theorem 1 and Theorem 2). By setting $c=\frac{1}{2}$ , the algorithm attains $O(d^{2}\sqrt{T})$ regret bound and $O(d^{2}T^{\frac{3}{4}})$ constraint violation bound, which is compatible with the prior work for constrained online convex optimization with full-information (Yi et al. 2022, Guo et al. 2022). We also show both regret and constraint violation bounds are reduced to an $O(d^{2}\log T)$ and $O(d^{2}\sqrt{T(1+\log T)})$ , respectively, when the loss functions are strongly convex (Theorem 3 and Theorem 4). The comparison of this study with prior works is summarized in Table 1.

1.3 Organization

The rest of this paper is organized as follows. In Section 2, we introduce necessary preliminaries of BCO with constraints. Section 3 presents the proposed algorithm to solve the BCO with constraints under two-point bandit feedback. In Section 4, we provide a theoretical analysis of regret bound and hard constraint violation bound for both convex and strongly convex loss functions. Finally, Section 5 concludes the present paper and addresses future work.

2 Preliminaries

2.1 Notation

For a vector $\bm{x}=(x_{1},x_{2},\dots,x_{d})^{\top}\in\mathbb{R}^{d}$ , let $\norm{\bm{x}}_{2}$ be the $\ell_{2}$ -norm of $\bm{x}$ , i.e., $\norm{x}_{2}=\sqrt{\bm{x}^{\top}\bm{x}}=\sqrt{\sum_{i=1}^{d}x_{i}^{2}}$ . Let $\left<\bm{x},\bm{y}\right>$ be the inner product of two vectors $\bm{x}$ and $\bm{y}$ . Let $\mathbb{B}^{d}$ and $\mathbb{S}^{d}$ denote the $d$ -dimensional Euclidean ball and unit sphere, and let $\bm{v}\in\mathbb{B}^{d}$ and $\bm{u}\in\mathbb{S}^{d}$ denote the random variables sampled uniformly from $\mathbb{B}^{d}$ and $\mathbb{S}^{d}$ , respectively. For a scalar $z\in\mathbb{R}$ , we denote $[z]_{+}\coloneqq\max\{z,0\}$ . For a Lipschitz continuous function $f:\mathbb{R}^{d}\to\mathbb{R}$ , let $\operatorname{lip}(f)>0$ be the Lipschitz constant of $f$ . We use $[T]$ as a shorthand for the set of positive integers $\{1,2,\dots,T\}$ . Finally, we use the notation $\mathbb{E}_{t}$ as the conditional expectation over the condition of all randomness in the first $t-1$ rounds.

2.2 Assumptions

Following prior works of constrained OCO (Mahdavi et al. 2012, Guo et al. 2022), we make the following standard assumptions on feasible region, loss functions, and constraint functions.

Assumption 1 (Bounded domain).

The feasible region $\mathcal{X}\subseteq\mathbb{R}^{d}$ is a non-empty bounded closed convex set such that $\norm{\bm{x}-\bm{y}}_{2}\leq D$ holds for any $\bm{x},\,\bm{y}\in\mathcal{X}$ .

Assumption 2 (Convexity and Lipschitz continuity of loss functions).

The loss function $f_{t}:\mathcal{X}\to\mathbb{R}$ is convex and Lipschitz continuous with Lipschitz constant $F_{t}>0$ on $\mathcal{X}$ , that is, we have

\displaystyle\absolutevalue{f_{t}(\bm{x})-f_{t}(\bm{y})}\leq F_{t}\norm{\bm{x}-\bm{y}}_{2},

for any $\bm{x},\bm{y}\in\mathcal{X}$ and for any $t\in[T]$ . For simplicity, we define $F:=\max_{t\in[T]}F_{t}$ .

Assumption 3 (Convexity and Lipschitz continuity of constraint functions).

The constraint function $g_{t}:\mathcal{X}\to\mathbb{R}$ is convex and Lipschitz continuous with Lipschitz constant $G_{t}>0$ on $\mathcal{X}$ , that is, we have

\displaystyle\absolutevalue{g_{t}(\bm{x})-g_{t}(\bm{y})}\leq G_{t}\norm{\bm{x}-\bm{y}}_{2},

for any $\bm{x},\bm{y}\in\mathcal{X}$ and for any $t\in[T]$ . For simplicity, we define $G:=\max_{t\in[T]}G_{t}$ .

2.3 Offline constrained OCO

With the full knowledge of loss functions $\{f_{t}(\bm{x})\}_{t=1}^{T}$ and constraint functions $\{g_{t}(\bm{x})\}_{t=1}^{T}$ in all rounds, the offline constrained OCO is formulated as the following convex optimization problem:


$\displaystyle\min_{\bm{x}\in\mathcal{X}}\quad$	$\displaystyle\sum_{t=1}^{T}f_{t}(\bm{x})$	(1a)
subject to	$\displaystyle g_{t}(\bm{x})\leq 0\qquad\forall t\in[T],$	(1b)

where $\mathcal{X}$ is assumed to be a simple convex set (e.g., Euclidean ball, probability simplex) for which the projection onto $\mathcal{X}$ is efficiently computable.

For the sake of simplicity of theoretical analysis, the present paper considers the case where there exists a single constraint function. By defining $g_{t}(\bm{x})\coloneqq\max_{i\in[m]}g^{(i)}_{t}(\bm{x})$ , this study can be easily extended to the case where multiple constraint functions, i.e., $g_{t}^{(i)}(\bm{x})\leq 0\,(i\in[m])$ exist, because maximum of finite convex functions is also convex.

2.4 Performance metrics

Given a sequence of decisions $\{\bm{x}_{t}\}_{t=1}^{T}\subseteq\mathcal{X}$ generated by some OCO algorithm (e.g., Online Gradient Descent method (Zinkevich 2003)). Under the situation where all loss functions $\{f_{t}(\bm{x})\}_{t=1}^{T}$ and constraint functions $\{g_{t}(\bm{x})\}_{t=1}^{T}$ in each round $t=1,2,\dots,T$ are known in advance, the regret and cumulative hard constraint violation are defined as follows:

	$\displaystyle R_{T}$	$\displaystyle\coloneqq\sum_{t=1}^{T}f_{t}(\bm{x}_{t})-\sum_{t=1}^{T}f_{t}(\bm{x}^{\star}),$		(2)
	$\displaystyle V_{T}$	$\displaystyle\coloneqq\sum_{t=1}^{T}\quantity[g_{t}(\bm{x}_{t})]_{+}=\sum_{t=1}^{T}\max\quantity{g_{t}(\bm{x}_{t}),0},$		(3)

where $\bm{x}^{\star}\in\mathcal{X}$ is the optimal solution to the offline constrained OCO formulated as Eq. (1). The objective of the learner is to generate a sequence of decisions that attains a sub-linear growth of both regret and cumulative constraint violation, that is, $\lim\sup_{T\to\infty}\frac{R_{T}}{T}\leq 0$ and $\lim\sup_{T\to\infty}\frac{V_{T}}{T}\leq 0$ .

2.5 Gradient estimator

In the partial information setting where only limited feedback is available to the learner, we follow the prior works (Flaxman et al. 2005, Agarwal et al. 2010, Zhao et al. 2021). The following result guarantees the gradient estimator with one-point feedback being an unbiased estimator.

Lemma 1.

(Zhao et al. 2021: Lemma 1) For any convex function $f:\mathcal{X}\to\mathbb{R}$ , define its smoothed version function $\widehat{f}(\bm{x})=\mathbb{E}_{\bm{v}\in\mathbb{B}^{d}}[f(\bm{x}+\delta\bm{v})]$ , where the expectation is taken over the random vector $\bm{v}\in\mathbb{B}^{d}$ with $\mathbb{B}^{d}$ being the unit ball, i.e., $\mathbb{B}^{d}\coloneqq\quantity{\bm{x}\in\mathbb{R}^{d}\mid\norm{\bm{x}}_{2}\leq 1}$ . Then, for any $\delta>0$ , we have

\displaystyle\mathbb{E}_{\bm{u}\in\mathbb{S}^{d}}\quantity[\frac{d}{\delta}f(\bm{x}+\delta\bm{u})\bm{u}]=\nabla\widehat{f}(\bm{x}),

where the expectation is taken over the random vector $\bm{s}\in\mathbb{S}^{d}$ with $\mathbb{S}^{d}$ being the unit sphere centered around the origin, i.e., $\mathbb{S}^{d}\coloneqq\quantity{\bm{x}\in\mathbb{R}^{d}\mid\norm{\bm{x}}_{2}=1}$ .

Proof.

See Flaxman et al. (2005: Lemma 2.1). ∎

Moreover, as shown in Shamir (2017: Lemma 8), for any convex function $f:\mathcal{X}\to\mathbb{R}$ and its smoothed version $\widehat{f}$ , we have

\displaystyle\sup_{\bm{x}\in\mathcal{X}}\absolutevalue{\widehat{f}(\bm{x})-f(\bm{x})}\leq\delta\operatorname{lip}(f).

(4)

The present study considers a two-point feedback model where the learner is allowed to query two points in each round. Specifically, at round $t\in[T]$ , the learner is allowed to query two points around decision $\bm{x}_{t}$ , that is, $\bm{x}_{t}+\delta\bm{u}_{t}$ and $\bm{x}_{t}-\delta\bm{u}_{t}$ , where $\delta>0$ is a perturbation parameter and $\bm{u}_{t}$ is a random unit vector sampled from unit sphere $\mathbb{S}^{d}$ . With two points $\bm{x}_{t}+\delta\bm{u}_{t}$ and $\bm{x}_{t}-\delta\bm{u}_{t}$ , the gradient estimator of the function $f_{t}$ at $\bm{x}_{t}$ is given by

\displaystyle\widetilde{\nabla}f_{t}\coloneqq\frac{d}{2\delta}\quantity[f_{t}(\bm{x}_{t}+\delta\bm{u}_{t})-f_{t}(\bm{x}_{t}-\delta\bm{u}_{t})]\bm{u}_{t},

(5)

where $d$ is the dimensionality of the domain $\mathcal{X}\subseteq\mathbb{R}^{d}$ . As shown in Agarwal et al. (2010), $\widetilde{\nabla}f_{t}$ is norm bounded, that is, we have $\|\widetilde{\nabla}f_{t}\|_{2}\leq\frac{\delta}{2\delta}\operatorname{lip}(f_{t})\norm{2\delta\bm{u}_{t}}_{2}\leq\operatorname{lip}(f_{t})d$ , where the first inequality holds by the Lipschitz continuity of $f_{t}$ .

Lemma 1 implies that the gradient estimator $\widetilde{\nabla}f_{t}$ is an unbiased estimator of $\nabla\widehat{f}_{t}(\bm{x}_{t})$ , i.e., $\mathbb{E}_{\bm{u}\in\mathbb{S}^{d}}[\widetilde{\nabla}f_{t}]=\nabla\widehat{f}_{t}(\bm{x}_{t})$ , where $\widehat{f}_{t}(\bm{x}_{t})=\mathbb{E}_{\bm{v}\in\mathbb{B}^{d}}[f_{t}(\bm{x}_{t}+\delta\bm{v})]$ is the smoothed version of original function $f_{t}$ . This property holds because the distribution of perturbation $\bm{u}_{t}$ in Eq. (5) is symmetric.

3 Proposed Algorithm

This section presents the proposed algorithm for solving the constrained BCO with two-point feedback. The procedure of the algorithm is shown in Algorithm 1, and this algorithm is motivated by the work in Guo et al. (2022) and the design of the algorithm is related to penalty-based proximal gradient descent method (Cheung and Lou 2017). At round $t\in[T]$ , Algorithm 1 finds the decision vector $\bm{x}_{t+1}$ by solving the following strongly convex optimization problem:

\displaystyle\bm{x}_{t+1}=\arg\min_{\bm{x}\in(1-\xi)\mathcal{X}}\quantity{f_{t}(\bm{x}_{t})+\widetilde{\nabla}f_{t}^{\top}(\bm{x}-\bm{x}_{t})+\lambda_{t}\widehat{g}^{+}_{t}(\bm{x})+\frac{\alpha_{t}}{2}\norm{\bm{x}-\bm{x}_{t}}^{2}_{2}},

(6)

where $\lambda_{t}$ is the penalty variable for controlling the quality of the decision, $\widehat{g}^{+}_{t}(\bm{x})\coloneqq\gamma_{t}[g_{t}(\bm{x})]_{+}$ , $\xi>0$ is the shrinkage constant, and $\alpha_{t}>0,\,\gamma_{t}>0$ are predetermined learning rate. Note that the optimization problem in the right-hand side (r.h.s) of Eq. (6) is strongly convex optimization due to the $\ell_{2}$ regularizer term, and hence the optimal solution $\bm{x}_{t+1}$ does exist and unique. As is the case with Mahdavi et al. (2012), we optimize the r.h.s. of Eq. (6) on the domain $(1-\xi)\mathcal{X}$ to ensure that randomized two points around $\bm{x}_{t}$ are inside the feasible region $\mathcal{X}$ . As shown in Flaxman et al. (2005), for any $\bm{x}\in(1-\xi)\mathcal{X}$ and for any unit vector $\bm{u}\in\mathbb{S}^{d}$ , it holds $\bm{x}\pm\delta\bm{u}\in\mathcal{X}$ .

At round $t$ , where we find the decision $\bm{x}_{t+1}\in\mathcal{X}$ , since we do not have the prior knowledge of the loss function $f_{t+1}(\bm{x})$ to be minimized, we estimate the loss by the first-order approximation at the previous decision $\bm{x}_{t}$ as $\widetilde{f}_{t+1}(\bm{x})=f_{t}(\bm{x}_{t})+\left<\nabla f_{t}(\bm{x}_{t}),\bm{x}-\bm{x}_{t}\right>$ . Simultaneously, we have no full information of the loss function $f_{t}(\bm{x})$ and hence we cannot access its graient $\nabla f_{t}(\bm{x})$ , so we estimate gradient by $\widetilde{\nabla}f_{t}$ with two points (line 5). To prevent the constraint from being severely violated, we also introduce the rectified Lagrange multiplier $\lambda_{t}$ associated with the functional constraint $g_{t}(\bm{x})\leq 0$ , and add the penalty term $\lambda_{t}\widehat{g}_{t}^{+}(\bm{x})$ to the objective function (6), which is an approximator of the original penalty term $\theta_{t}g_{t}(\bm{x})$ , where $\theta_{t}$ is the Lagrangian multiplier associated with the constraint $g_{t}(\bm{x})\leq 0$ . We also add $\ell_{2}$ regularization term $\frac{\alpha_{t}}{2}\norm{\bm{x}-\bm{x}_{t}}^{2}_{2}$ to stabilize the optimization problem.

We will describe more in detail the role of penalty parameter $\lambda_{t}$ and its update rule. The penalty parameter $\lambda_{t}$ is related to the Lagrangian multiplier (denoted by $\theta_{t}$ ) associated with the functional constraint $g_{t}(\bm{x})\leq 0$ , but slightly different because we have no prior knowledge of the constraint functions when making-decision. Instead, we take place the original Lagrangian multiplier $\theta_{t+1}$ with $\lambda_{t}$ such that $\lambda_{t}\widehat{g}^{+}_{t}(\bm{x})$ is an approximator of $\theta_{t}g_{t}(\bm{x})$ . We update the penalty parameter (line 9) as $\lambda_{t+1}=\max\{\lambda_{t}+\gamma_{t+1}[g_{t+1}(\bm{x}_{t})]_{+},\eta_{t+1}\}$ , where the first coordinate of maximum operator is the sum of the old $\lambda_{t}$ and the rectified constraint function value $\gamma_{t+1}[g_{t+1}(\bm{x}_{t})]_{+}$ ; and the second coordinate is the user-determined constant $\eta_{t+1}$ to impose a minimum penalty. This update rule for the penalty parameter prevents the decision determined by solving the problem (6) from being overly aggressive which leads to large constraint violation.

Algorithm 1 A Rectified Bandit Convex Optimization with Hard Constraints under Two-Point Bandit Feedback

0: Total number of rounds

T

, learning rates

\{\alpha_{t}\}_{t=1}^{T}\subseteq\mathbb{R}_{>0},\,\{\gamma_{t}\}_{t=1}^{T}\subseteq\mathbb{R}_{>0},\,\{\eta_{t}\}_{t=1}^{T}\subseteq\mathbb{R}_{>0}

, shrinkage paramaeter

\xi>0

, and perturbation parameter

\delta>0

1: Initialization:

\bm{x}_{1}\in\mathcal{X},\,\lambda_{1}=0

, and set

\widehat{g}^{+}_{1}(\bm{x})=\gamma_{1}[g_{1}(\bm{x})]_{+}

2: for

t=1,2,\dots,T

3: Draw unit vector

\bm{u}_{t}

from

\mathbb{S}^{d}

uniformly at random.

4: Query

f_{t}(\bm{x})

at two points

\bm{x}_{t}+\delta\bm{u}_{t}

and

\bm{x}_{t}-\delta\bm{u}_{t}

5: Compute

\widetilde{\nabla}f_{t}=\frac{d}{2\delta}\quantity[f_{t}(\bm{x}_{t}+\delta\bm{u}_{t})-f_{t}(\bm{x}_{t}-\delta\bm{u}_{t})]\bm{u}_{t}

6: Find the optimal solution

\bm{x}_{t+1}

by solving optimization problem (6).

7: Submit

\bm{x}_{t+1}

, incur loss

f_{t+1}(\bm{x}_{t+1})

and observe constraint

g_{t+1}(\bm{x})

8: Set

\widehat{g}^{+}_{t+1}(\bm{x})=\gamma_{t+1}[g_{t+1}(\bm{x})]_{+}

9: Update the penalty variable as

\lambda_{t+1}=\max\{\lambda_{t}+\gamma_{t+1}[g_{t+1}(\bm{x}_{t})]_{+},\eta_{t+1}\}

10: end for

4 Theoretical Analysis

This section provides the theoretical analysis for the Algorithm 1. To facilitate the analysis, let $h_{t}:\mathcal{X}\to\mathbb{R}$ be a function defined by

h_{t}(\bm{x})\coloneqq\widehat{f}_{t}(\bm{x})+\left<\widetilde{\nabla}f_{t}-\nabla\widehat{f}_{t}(\bm{x}_{t}),\bm{x}\right>,

(7)

where $\widehat{f}_{t}(\bm{x})=\mathbb{E}_{\bm{v}\in\mathbb{B}^{d}}[f_{t}(\bm{x}+\delta\bm{v})]$ and $\widetilde{\nabla}f_{t}$ is defined as Eq. (5). It is easily seen that $\nabla h_{t}(\bm{x}_{t})=\widetilde{\nabla}f_{t}$ holds, and hence we have $\norm{\nabla h_{t}(\bm{x})}_{2}=\|\widehat{\nabla}f_{t}\|_{2}\leq d\text{lip}(f_{t})$ for any $\bm{x}\in\mathcal{X}$ . Moreover, the function $h_{t}$ defined as Eq. (7) is convex and Lipschitz continuous with Lipschitz constant $\operatorname{lip}(h_{t})=3d\operatorname{lip}(f_{t})$ on $\mathcal{X}$ , because for any $\bm{x},\bm{y}\in\mathcal{X}$ , we have

	$\displaystyle\absolutevalue{h(\bm{x})-h(\bm{y})}$	$\displaystyle\leq\absolutevalue{\widehat{f}_{t}(\bm{x})-\widehat{f}_{t}(\bm{y})}+\absolutevalue{\left<\widetilde{\nabla}f_{t}-\nabla\widehat{f}_{t}(\bm{x}_{t}),\bm{x}-\bm{y}\right>}$
		$\displaystyle\leq\operatorname{lip}(\widehat{f}_{t})\norm{\bm{x}-\bm{y}}_{2}+\quantity(\\|\widetilde{\nabla}f_{t}\\|_{2}+\\|\nabla\widehat{f}_{t}(\bm{x}_{t})\\|_{2})\norm{\bm{x}-\bm{y}}_{2}$
		$\displaystyle\leq\operatorname{lip}(\widehat{f}_{t})\norm{\bm{x}-\bm{y}}_{2}+\quantity(\operatorname{lip}(\widehat{f}_{t})d+\operatorname{lip}(\widehat{f}_{t}))\norm{\bm{x}-\bm{y}}_{2}\leq 3d\operatorname{lip}(f_{t})\norm{\bm{x}-\bm{y}}_{2},$

where the first inequality follows from the triangle inequality, the second inequality follows from the Cauchy-Schwarz inequality, the third inequality follows from $\norm{\nabla f(\bm{x})}_{2}\leq\operatorname{lip}(f)$ for any Lipshitz continuous function $f$ and for any $\bm{x}\in\mathcal{X}$ , and the last inequality follows from $\operatorname{lip}(\widehat{f}_{t})=\operatorname{lip}(f_{t})$ .

To prove Algorithm 1 attains sub-linear bound for both regret and cumulative hard constraint violation, we first show the following result which is a well-known property of a strongly convex function.

Lemma 2.

(Nesterov et al. 2018: Theorem 2.1.8) Let $\mathcal{X}\subseteq\mathbb{R}^{d}$ be a convex set. Let $f:\mathcal{X}\to\mathbb{R}$ be a strongly convex function with modulus $\sigma$ on $\mathcal{X}$ , and let $\bm{x}^{\star}\in\mathcal{X}$ be an optimal solution of $f$ , that is, $\bm{x}^{\star}=\arg\min_{\bm{x}\in\mathcal{X}}f(\bm{x})$ . Then, $f(\bm{x})\geq f(\bm{x}^{\star})+\frac{\sigma}{2}\norm{\bm{x}-\bm{x}^{\star}}^{2}_{2}$ holds for any $\bm{x}\in\mathcal{X}$ .

Proof.

By the definition of strong convexity of $f$ , for any $\bm{x},\,\bm{y}\in\mathcal{X}$ , we have

\displaystyle f(\bm{x})\geq f(\bm{y})+\left<\nabla f(\bm{x}),\bm{x}-\bm{y}\right>+\frac{\sigma}{2}\norm{\bm{x}-\bm{y}}^{2}_{2}.

(8)

Plugging an optimal solution $\bm{x}^{\star}\in\mathcal{X}$ into $\bm{y}$ in the above inequality (8), we have

\displaystyle f(\bm{x})

\displaystyle\geq f(\bm{x}^{\star})+\left<\nabla f(\bm{x}^{\star}),\bm{x}-\bm{x}^{\star}\right>+\frac{\sigma}{2}\norm{\bm{x}-\bm{x}^{\star}}^{2}_{2}\geq f(\bm{x}^{\star})+\frac{\sigma}{2}\norm{\bm{x}-\bm{x}^{\star}}^{2}_{2},

where the last inequality holds by the first-order optimality condition, $\left<\nabla f(\bm{x}^{\star}),\bm{x}-\bm{x}^{\star}\right>\geq 0$ . ∎

The following two lemmas play an important role in proving the main theorem (Theorem 1 and Theorem 2). The first one (Lemma 3) is an inequality involving the update rule of Algorithm 1, and the second one (Lemma 4) characterizes the relationship between the current solution $\bm{x}_{t}$ in Algorithm 1 and the optimal solution of the offline optimization problem formulated as Eq. (1).

Lemma 3.

(Guo et al. 2022: Lemma 5) Let $\varphi_{t}:\mathcal{X}\to\mathbb{R}^{d}$ be a function defined by

\displaystyle\varphi_{t}(\bm{x})\coloneqq f_{t}(\bm{x}_{t})+\left<\nabla f_{t}(\bm{x}_{t}),\bm{x}-\bm{x}_{t}\right>+\lambda_{t}\widehat{g}^{+}_{t}(\bm{x})+\frac{\alpha_{t}}{2}\norm{\bm{x}-\bm{x}_{t}}^{2}_{2},

(9)

where $\widehat{g}^{+}_{t}(\bm{x})\coloneqq\gamma_{t}g_{t}(\bm{x})$ and $\alpha_{t}>0,\gamma_{t}>0$ are predetermined learning rate. Let $\bm{x}_{t+1}$ be the optimal solution returned by Algorithm 1 where the gradient $\nabla f_{t}(\bm{x})$ is accessible, that is, $\bm{x}_{t+1}=\arg\min_{\bm{x}\in\mathcal{X}}\varphi_{t}(\bm{x})$ . Then, for any $\bm{x}\in\mathcal{X}$ , we have

\displaystyle\begin{split}&f_{t}(\bm{x}_{t})+\left<\nabla f_{t}(\bm{x}_{t}),\bm{x}_{t+1}-\bm{x}_{t}\right>+\lambda_{t}\widehat{g}^{+}_{t}(\bm{x}_{t+1})+\frac{\alpha_{t}}{2}\norm{\bm{x}_{t+1}-\bm{x}_{t}}^{2}_{2}\\ &\quad\leq f_{t}(\bm{x}_{t})+\left<\nabla f_{t}(\bm{x}_{t}),\bm{x}-\bm{x}_{t}\right>+\lambda_{t}\widehat{g}^{+}_{t}(\bm{x})+\frac{\alpha_{t}}{2}\norm{\bm{x}-\bm{x}_{t}}^{2}_{2}-\frac{\alpha_{t}}{2}\norm{\bm{x}-\bm{x}_{t+1}}^{2}_{2}.\end{split}

(10)

Proof.

Since $\varphi_{t}$ is a strongly convex function with modulus $\alpha_{t}$ , we can apply Lemma 2 to $\varphi_{t}$ . Thus, we have $\varphi_{t}(\bm{x}_{t+1})\leq\varphi_{t}(\bm{x})-\frac{\alpha_{t}}{2}\norm{\bm{x}-\bm{x}_{t+1}}^{2}_{2}$ for any $\bm{x}\in\mathcal{X}$ , which completes the proof. ∎

Lemma 4 (Self-bounding Property).

(Guo et al. 2022: Lemma 1) Let $f_{t}:\mathcal{X}\to\mathbb{R}$ be a convex function satisfying Assumption 2. Let $\bm{x}^{\star}\in\mathcal{X}$ be any optimal solution to the offline constrained OCO of Eq. (1) and $\bm{x}_{t}\in\mathcal{X}$ be the optimal solution returned by Algorithm 1. Then, we have

f_{t}(\bm{x}_{t})-f_{t}(\bm{x}^{\star})+\lambda_{t}\widehat{g}^{+}_{t}(\bm{x}_{t+1})\leq\frac{F_{t}^{2}}{4\alpha_{t}}+\frac{\alpha_{t}}{2}\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}-\frac{\alpha_{t}}{2}\norm{\bm{x}^{\star}-\bm{x}_{t+1}}^{2}_{2},

(11)

where $\widehat{g}^{+}_{t}(\bm{x})\coloneqq\gamma_{t}g_{t}(\bm{x})$ and $\alpha_{t}>0,\gamma_{t}>0$ are predetermined learning rate.

Proof.

See Guo et al. (2022: Lemma 1). ∎

We are now ready to prove the main results, which state Algorithm 1 achieves a sub-linear bound for both regret (2) and cumulative hard constraint violation (3). We first show the case where the loss functions are convex and constraint functions are fixed throughout the whole round.

4.1 Convex loss function case

Theorem 1.

Let $\{\bm{x}_{t}\}_{t=1}^{T}$ be a sequence of decisions generated by Algorithm 1 and let $\bm{x}^{\star}\in\mathcal{X}$ be an optimal solution to the offline OCO of Eq. (1). Assume that constraint functions are fixed, that is, $g_{t}(\bm{x})=g(\bm{x})$ for any $t\in[T]$ . Define $\alpha_{t}\coloneqq t^{c},\,\gamma_{t}\coloneqq t^{c+\varepsilon},\,\eta_{t}\coloneqq t^{c}$ and $\delta\coloneqq\frac{1}{T}$ , where $c\in[\frac{1}{2},1)$ and $\varepsilon>0$ . Under Assumptions 1, 2 and 3, we have

	$\displaystyle\sum_{t=1}^{T}\quantity[f_{t}(\bm{x}_{t})-f_{t}(\bm{x}^{\star})]$	$\displaystyle\leq\quantity(\frac{9F^{2}d^{2}}{4(1-c)}+\frac{D^{2}}{2}+2F)T^{\max\{c,1-c\}}=O(d^{2}T^{\max\{c,1-c\}}),$		(12)
	$\displaystyle\sum_{t=1}^{T}[g_{t}(\bm{x}_{t})]_{+}$	$\displaystyle\leq\frac{27F^{2}d^{2}}{4}+\frac{3FdD(1+\varepsilon)}{\varepsilon}+D^{2}=O(d^{2}).$		(13)

Proof.

Similar to the argumant in Flaxman et al. (2005) and Agarwal et al. (2010), letting $\bm{\xi}_{t}\coloneqq\widetilde{\nabla}f_{t}-\nabla\widehat{f}_{t}(\bm{x}_{t})$ , then we have $\mathbb{E}_{t}[\bm{\xi}_{t}]=\bm{0}$ from Lemma 1, and thus, we have $\mathbb{E}_{t}[\bm{\xi}^{\top}\bm{x}]=0$ for any fixed $\bm{x}\in\mathcal{X}$ . Therefore, for any fixed $\bm{x}\in\mathcal{X}$ , we have

\displaystyle\mathbb{E}_{t}[h_{t}(\bm{x})]

\displaystyle=\mathbb{E}_{t}\left[\widehat{f}_{t}(\bm{x})\right]+\mathbb{E}_{t}\left[\bm{\xi}_{t}^{\top}\bm{x}\right]=\widehat{f}_{t}(\bm{x}).

Part (i): Proof of Eq. (12)

Recall that the function $h_{t}$ is Lipschitz continuous with Lipschitz constant $\operatorname{lip}(h_{t})=3F_{t}d$ . Applying Lemma 4 to the convex function $h_{t}$ defined by Eq. (7), for an optimal solution $\bm{x}^{\star}$ to the offline optimization problem as Eq. (1), we have

	$\displaystyle\sum_{t=1}^{T}\quantity[h_{t}(\bm{x}_{t})-h_{t}(\bm{x}^{\star})]$	$\displaystyle\leq\sum_{t=1}^{T}\frac{\operatorname{lip}(h_{t})^{2}}{4\alpha_{t}}+\sum_{t=1}^{T}\quantity(\frac{\alpha_{t}}{2}\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}-\frac{\alpha_{t}}{2}\norm{\bm{x}^{\star}-\bm{x}_{t+1}}^{2}_{2})$
		$\displaystyle\leq\frac{9F^{2}d^{2}}{4}\sum_{t=1}^{T}\frac{1}{\alpha_{t}}+\sum_{t=1}^{T}\quantity(\frac{\alpha_{t}}{2}-\frac{\alpha_{t-1}}{2})\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}-\frac{\alpha_{T}}{2}\norm{\bm{x}^{\star}-\bm{x}_{T+1}}^{2}_{2}$
		$\displaystyle\leq\frac{9F^{2}d^{2}}{4}\sum_{t=1}^{T}\frac{1}{\alpha_{t}}+D^{2}\sum_{t=1}^{T}\quantity(\frac{\alpha_{t}}{2}-\frac{\alpha_{t-1}}{2}),$

where the last inequality follows from Assumption 1. Plugging in $\alpha_{t}=t^{c}$ , we have

\displaystyle\sum_{t=1}^{T}\quantity[h_{t}(\bm{x}_{t})-h_{t}(\bm{x}^{\star})]\leq\frac{9F^{2}d^{2}}{4}\cdot\frac{T^{1-c}}{1-c}+\frac{D^{2}}{2}T^{c}=\quantity(\frac{9F^{2}d^{2}}{4(1-c)}+\frac{D^{2}}{2})T^{\max\{c,1-c\}}.

Since we have $\mathbb{E}_{t}\left[h_{t}(\bm{x})\right]=\widehat{f}(\bm{x})$ , by taking expectation, we have

\displaystyle\sum_{t=1}^{T}\quantity[\widehat{f}_{t}(\bm{x}_{t})-\widehat{f}_{t}(\bm{x}^{\star})]\leq\quantity(\frac{9F^{2}d^{2}}{4(1-c)}+\frac{D^{2}}{2})T^{\max\{c,1-c\}}.

From the inequality (4), for any optimal solution $\bm{x}^{\star}\in\mathcal{X}$ to the offline OCO as Eq. (1), we have

\displaystyle f_{t}(\bm{x}_{t})-f_{t}(\bm{x}^{\star})\leq\widehat{f}_{t}(\bm{x}_{t})-\widehat{f}_{t}(\bm{x}^{\star})+2\delta F_{t},

for any $t\in[T]$ . Therefore, we have

	$\displaystyle\sum_{t=1}^{T}\quantity[f_{t}(\bm{x}_{t})-f_{t}(\bm{x}^{\star})]$	$\displaystyle\leq\sum_{t=1}^{T}\quantity[\widehat{f}_{t}(\bm{x}_{t})-\widehat{f}_{t}(\bm{x}^{\star})]+\sum_{t=1}^{T}2\delta F_{t}$
		$\displaystyle\leq\quantity(\frac{9F^{2}d^{2}}{4(1-c)}+\frac{D^{2}}{2})T^{\max\{c,1-c\}}+2F$
		$\displaystyle\leq\quantity(\frac{9F^{2}d^{2}}{4(1-c)}+\frac{D^{2}}{2}+2F)T^{\max\{c,1-c\}},$

where the second inequality follows by plugging in $\delta=\frac{1}{T}$ .

Part (ii): Proof of Eq. (13)

From Lemma 4, for any optimal solution $\bm{x}^{\star}\in\mathcal{X}$ to the offline constrained OCO as Eq. (1), we have

\displaystyle\lambda_{t}\widehat{g}^{+}_{t}(\bm{x}_{t+1})\leq\frac{\operatorname{lip}(h_{t})^{2}}{4\alpha_{t}}+\absolutevalue{h_{t}(\bm{x}_{t})-h_{t}(\bm{x}^{\star})}+\frac{\alpha_{t}}{2}\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}-\frac{\alpha_{t}}{2}\norm{\bm{x}^{\star}-\bm{x}_{t+1}}^{2}_{2}.

By the definition of $\widehat{g}^{+}_{t}(\bm{x}_{t+1})$ , i.e., $\widehat{g}^{+}_{t+1}(\bm{x})=\gamma_{t}[g_{t}(\bm{x})]_{+}$ , and plugging in $\alpha_{t}=\eta_{t}=t^{c}$ , we have

	$\displaystyle[g_{t}(\bm{x}_{t+1})]_{+}$	$\displaystyle\leq\frac{9F_{t}^{2}d^{2}}{4\lambda_{t}\alpha_{t}\gamma_{t}}+\frac{\absolutevalue{h_{t}(\bm{x}_{t})-h_{t}(\bm{x}^{\star})}}{\lambda_{t}\gamma_{t}}+\frac{\alpha_{t}}{2\lambda_{t}\gamma_{t}}\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}-\frac{\alpha_{t}}{2\lambda_{t}\gamma_{t}}\norm{\bm{x}^{\star}-\bm{x}_{t+1}}^{2}_{2}$
		$\displaystyle\leq\frac{9F_{t}^{2}d^{2}}{4t^{3c+\varepsilon}}+\frac{\absolutevalue{h_{t}(\bm{x}_{t})-h_{t}(\bm{x}^{\star})}}{t^{2c+\varepsilon}}+\frac{1}{t^{c+\varepsilon}}\quantity(\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}-\norm{\bm{x}^{\star}-\bm{x}_{t+1}}^{2}_{2}),$

where the second inequality is followed by $\lambda_{t}\geq\eta_{t}$ , and we plugging $\alpha_{t}=\eta_{t}=t^{c}$ and $\gamma_{t}=t^{c+\varepsilon}$ . By taking summation over $t=1,2,\dots,T$ , we have

	$\displaystyle\sum_{t=1}^{T}[g_{t}(\bm{x}_{t+1})]_{+}$	$\displaystyle\leq\sum_{t=1}^{T}\frac{9F_{t}^{2}d^{2}}{4t^{3c+\varepsilon}}+\sum_{t=1}^{T}\frac{\absolutevalue{h_{t}(\bm{x}_{t})-h_{t}(\bm{x}^{\star})}}{t^{2c+\varepsilon}}+\sum_{t=1}^{T}\frac{1}{t^{c+\varepsilon}}\quantity(\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}-\norm{\bm{x}^{\star}-\bm{x}_{t+1}}^{2}_{2})$
		$\displaystyle\leq\frac{27F^{2}d^{2}}{4}+\frac{3FdD(1+\varepsilon)}{\varepsilon}+D^{2},$

where the second inequality holds from Lemma 5 in Appendix A, which completes the proof. ∎

Remark 1.

By setting constant $c=\frac{1}{2}$ , Algorithm 1 attains $O(d^{2}\sqrt{T})$ regret bound. This regret bound is compatible with the prior works of unconstrained bandit convex optimization (Agarwal et al. 2010), and is compatible with the result for full-information setting (Guo et al. 2022).

For the case where the constraint functions are time-varying, we can show the following result.

Theorem 2.

Let $\{\bm{x}_{t}\}_{t=1}^{T}$ be a sequence of decisions generated by Algorithm 1. Assume that constraint functions $g_{t}(\bm{x})$ are time-varying. Define $\alpha_{t}:=t^{c},\,\gamma_{t}:=t^{c+\varepsilon}$ , and $\eta_{t}:=t^{c}$ , where $c\in[\frac{1}{2},1)$ and $\varepsilon>0$ . Under Assumptions 1, 2 and 3, we have

\displaystyle\sum_{t=1}^{T}[g_{t}(\bm{x}_{t})]_{+}\leq\quantity(\frac{27F^{2}d^{2}+G^{2}}{4}+3FdD\quantity(8+\frac{1}{\varepsilon})+2D^{2})T^{1-\frac{c}{2}}=O(d^{2}T^{1-\frac{c}{2}}).

(14)

Proof.

By the convexity of $[g_{t}(\bm{x}_{t})]_{+}$ and Assumption 3, we can show $[g_{t}(\bm{x}_{t})]_{+}-[g_{t}(\bm{x}_{t+1})]_{+}$ is upper bounded by $[g_{t}(\bm{x}_{t})]_{+}-[g_{t}(\bm{x}_{t+1})]_{+}\leq\frac{G^{2}}{4\beta}+\beta\norm{\bm{x}_{t}-\bm{x}_{t+1}}^{2}_{2}$ for any $\beta>0$ (Guo et al. 2022: Lemma 2). Applying Lemma 4 to the function $h_{t}$ defined by Eq. (7), for any $\bm{x}^{\star}\in\mathcal{X}$ , we have

\displaystyle\norm{\bm{x}_{t}-\bm{x}_{t+1}}^{2}_{2}\leq\frac{2}{\alpha_{t}}\quantity(h_{t}(\bm{x}^{\star})-h_{t}(\bm{x}_{t}))+\frac{2}{\alpha_{t}}\left<\nabla h_{t}(\bm{x}_{t}),\bm{x}_{t}-\bm{x}_{t+1}\right>+\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}-\norm{\bm{x}^{\star}-\bm{x}_{t+1}}^{2}_{2}.

By taking summation over $t=1,2,\dots,T$ , we have

	$\displaystyle\sum_{t=1}^{T}\norm{\bm{x}_{t}-\bm{x}_{t+1}}^{2}_{2}$
	$\displaystyle\quad\leq\sum_{t=1}^{T}\frac{h_{t}(\bm{x}^{\star})-h_{t}(\bm{x}_{t})}{\frac{1}{2}\alpha_{t}}+\sum_{t=1}^{T}\frac{\left<\nabla h_{t}(\bm{x}_{t}),\bm{x}_{t}-\bm{x}_{t+1}\right>}{\frac{1}{2}\alpha_{t}}+\sum_{t=1}^{T}\quantity(\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}-\norm{\bm{x}^{\star}-\bm{x}_{t+1}}^{2}_{2})$
	$\displaystyle\quad\leq\sum_{t=1}^{T}\frac{2\operatorname{lip}(h_{t})D}{\frac{1}{2}\alpha_{t}}+\norm{\bm{x}^{\star}-\bm{x}_{1}}^{2}_{2}\leq\frac{12FdD}{1-c}T^{1-c}+D^{2},$

where the last inequality holds by plugging in $\alpha_{t}=t^{c}$ . Therefore, we have

	$\displaystyle\sum_{t=1}^{T}[g_{t}(\bm{x}_{t})]_{+}$	$\displaystyle\leq\sum_{t=1}^{T}[g_{t}(\bm{x}_{t+1})]_{+}+\frac{G^{2}T}{4\beta}+\beta\sum_{t=1}^{T}\norm{\bm{x}_{t}-\bm{x}_{t+1}}^{2}_{2}$
		$\displaystyle\leq\frac{27F^{2}d^{2}}{4}+\frac{3FdD(1+\varepsilon)}{\varepsilon}+D^{2}+\frac{G^{2}T}{4\beta}+\beta\quantity(\frac{12FdD}{1-c}T^{1-c}+D^{2})$
		$\displaystyle\leq\frac{27F^{2}d^{2}}{4}+\frac{3FdD(1+\varepsilon)}{\varepsilon}+D^{2}+\quantity(\frac{G^{2}}{4}+24FdD+D^{2})T^{1-\frac{c}{2}},$

where the second inequality follows from Eq. (13) in Theorem 1 and the last inequality holds by plugging in $\beta=T^{\frac{c}{2}}$ , which completes the proof. ∎

Remark 2.

By setting constant $c=\frac{1}{2}$ , we can obtain $O(d^{2}T^{\frac{3}{4}})$ constraint violation bound. This bound is compatible with the result for full-information case (Guo et al. 2022).

4.2 Strongly convex loss function case

We extend the results discussed in the previous subsection to the case where the loss functions are strongly convex. We omit the proofs of the following results here since the technique of the proof is similar to that of Theorem 1 and Theorem 2. These proofs are found in Appendix B and Appendix C. To discuss the strongly convex case, we make the following assumption about loss functions.

Assumption 4 (Strong convexity of loss functions).

The loss function $f_{t}:\mathcal{X}\to\mathbb{R}$ is Lipschitz continuous with Lipschitz constant $F_{t}$ , and strongly convex on $\mathcal{X}$ with modulus $\sigma_{t}>0$ , i.e., we have

\displaystyle f_{t}(\bm{y})\geq f_{t}(\bm{x})+\left<\nabla f_{t}(\bm{x}),\bm{y}-\bm{x}\right>+\frac{\sigma_{t}}{2}\norm{\bm{y}-\bm{x}}^{2}_{2},

(15)

for any $\bm{x},\bm{y}\in\mathcal{X}$ and for any $t\in[T]$ . For simplicity, we define $\sigma\coloneqq\max_{t\in[T]}\sigma_{t}$ .

Under Assumption 4, the function $h_{t}:\mathcal{X}\to\mathbb{R}$ defined as Eq. (7) is also strongly convex with modulus $\sigma_{t}$ , namely, $h_{t}(\bm{y})\geq h_{t}(\bm{x})+\left<\nabla h_{t}(\bm{x}),\bm{y}-\bm{x}\right>+\frac{\sigma_{t}}{2}\norm{\bm{y}-\bm{x}}^{2}_{2}$ for any $\bm{x},\bm{y}\in\mathcal{X}$ . Then, we can show the following results.

Theorem 3.

Let $\{\bm{x}_{t}\}_{t=1}^{T}$ be a sequence of decisions generated by Algorithm 1 and let $\bm{x}^{\star}\in\mathcal{X}$ be an optimal solution to the offline OCO of Eq. (1). Assume that constraint functions are fixed, that is, $g_{t}(\bm{x})=g(\bm{x})$ for any $t\in[T]$ . Define $\alpha_{t}\coloneqq\sigma t,\,\gamma_{t}\coloneqq t^{c+\varepsilon},\,\eta_{t}\coloneqq t^{c}$ , and $\delta\coloneqq\frac{1}{\delta}$ , where $c\in[\frac{1}{2},1)$ and $\varepsilon>0$ . Under Assumptions 1, 3 and 4, we have

	$\displaystyle\sum_{t=1}^{T}\quantity[f_{t}(\bm{x}_{t})-f_{t}(\bm{x}^{\star})]$	$\displaystyle\leq\quantity(\frac{9F^{2}d^{2}}{4\sigma}+2F)\quantity(1+\log T)=O(d^{2}\log T),$
	$\displaystyle\sum_{t=1}^{T}[g_{t}(\bm{x}_{t})]_{+}$	$\displaystyle\leq\frac{27F^{2}d^{2}}{4\sigma}+\frac{3FdD(1+\varepsilon)}{\varepsilon}=O(d^{2}).$

Theorem 4.

Let $\{\bm{x}_{t}\}_{t=1}^{T}$ be a sequence of decisions generated by Algorithm 1. Assume that constraint functions $g_{t}(\bm{x})$ are time-varying. Define $\alpha_{t}\coloneqq\sigma t,\,\gamma_{t}\coloneqq t^{c+\varepsilon},\,\eta_{t}\coloneqq t^{c}$ , where $c\in[\frac{1}{2},1)$ and $\varepsilon>0$ . Under Assumptions 1, 3 and 4, we have

\displaystyle\sum_{t=1}^{T}[g_{t}(\bm{x}_{t})]_{+}\leq\quantity(\frac{27F^{2}d^{2}}{4\sigma}+\frac{G^{2}}{4}+3FdD\quantity(1+\frac{1}{\varepsilon}+\frac{4}{\sigma})+D^{2})\sqrt{T(1+\log T)}.

5 Conclusion and Future Directions

This paper studies the two-point feedback of bandit convex optimization with constraints, in which the loss functions are convex or strongly convex, constraint functions are fixed or time-varying, and the constraint violation is evaluated in terms of cumulative hard constraint violation (Yuan and Lamperski 2018). We present a penalty-based proximal gradient descent algorithm with an unbiased gradient estimator and show that the algorithm attains a sub-linear growth of both regret and cumulative hard constraint violation. It would be of interest to extend this work to the case where both the loss functions and constraint functions are bandit setup as discussed in Cao and Liu (2018), and the case where only one-point bandit feedback is available to the learner. Furthermore, theoretical analysis of dynamic regret, where the comparator sequence can be chosen arbitrarily from the feasible set, would be an important direction for future work.

Acknowledgments and Disclosure of Funding

The author would like to thank Dr. Sho Takemori for making a number of valuable suggestions and advice.

References

Hazan et al. (2016) Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
Lattimore and Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pages 928–936, 2003.
Mahdavi et al. (2012) Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. Trading regret for efficiency: online convex optimization with long term constraints. The Journal of Machine Learning Research, 13(1):2503–2528, 2012.
Yuan and Lamperski (2018) Jianjun Yuan and Andrew Lamperski. Online convex optimization for cumulative constraints. Advances in Neural Information Processing Systems, 31, 2018.
Guo et al. (2023) Hengquan Guo, Zhu Qi, and Xin Liu. Rectified pessimistic-optimistic learning for stochastic continuum-armed bandit with constraints. In Learning for Dynamics and Control Conference, pages 1333–1344. PMLR, 2023.
Agarwal et al. (2010) Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Colt, pages 28–40. Citeseer, 2010.
Zhao et al. (2021) Peng Zhao, Guanghui Wang, Lijun Zhang, and Zhi-Hua Zhou. Bandit convex optimization in non-stationary environments. The Journal of Machine Learning Research, 22(1):5562–5606, 2021.
Chen et al. (2019) Lin Chen, Mingrui Zhang, and Amin Karbasi. Projection-free bandit convex optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2047–2056. PMLR, 2019.
Garber and Kretzu (2020) Dan Garber and Ben Kretzu. Improved regret bounds for projection-free bandit convex optimization. In International Conference on Artificial Intelligence and Statistics, pages 2196–2206. PMLR, 2020.
Cao and Liu (2018) Xuanyu Cao and KJ Ray Liu. Online convex optimization with time-varying constraints and bandit feedback. IEEE Transactions on automatic control, 64(7):2665–2680, 2018.
Flaxman et al. (2005) Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the bandit setting: Gradient descent without a gradient. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’05, page 385–394, USA, 2005. Society for Industrial and Applied Mathematics. ISBN 0898715857.
Guo et al. (2022) Hengquan Guo, Xin Liu, Honghao Wei, and Lei Ying. Online convex optimization with hard constraints: Towards the best of two worlds and beyond. Advances in Neural Information Processing Systems, 35:36426–36439, 2022.
Duchi et al. (2010) John C Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. In COLT, volume 10, pages 14–26. Citeseer, 2010.
Duchi et al. (2008) John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the l 1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pages 272–279, 2008.
Jenatton et al. (2016) Rodolphe Jenatton, Jim Huang, and Cédric Archambeau. Adaptive algorithms for online convex optimization with long-term constraints. In International Conference on Machine Learning, pages 402–411. PMLR, 2016.
Yu and Neely (2020) Hao Yu and Michael J. Neely. A Low Complexity Algorithm with $O(\sqrt{T})$ Regret and $O(1)$ Constraint Violations for Online Convex Optimization with Long Term Constraints. Journal of Machine Learning Research, 21(1):1–24, 2020.
Neely (2022) Michael Neely. Stochastic network optimization with application to communication and queueing systems. Springer Nature, 2022.
Yi et al. (2021) Xinlei Yi, Xiuxian Li, Tao Yang, Lihua Xie, Tianyou Chai, and Karl Johansson. Regret and cumulative constraint violation analysis for online convex optimization with long term constraints. In International Conference on Machine Learning, pages 11998–12008. PMLR, 2021.
Nesterov and Spokoiny (2017) Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17:527–566, 2017.
Yi et al. (2022) Xinlei Yi, Xiuxian Li, Tao Yang, Lihua Xie, Tianyou Chai, and H Karl. Regret and cumulative constraint violation analysis for distributed online constrained convex optimization. IEEE Transactions on Automatic Control, 2022.
Shamir (2017) Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. The Journal of Machine Learning Research, 18(1):1703–1713, 2017.
Cheung and Lou (2017) Yiu-ming Cheung and Jian Lou. Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization. Machine Learning, 106:595–622, 2017.
Nesterov et al. (2018) Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018.

Appendix A Proof of useful inequalities

To show Theorem 1, we present the following results which is similar argument of Guo et al. (2022: Lemma 6).

Lemma 5.

Let $\bm{x}^{\star}\in\mathcal{X}$ be an optimal solution to the offline constrained OCO defined as Eq. (1). Under Assumptions 1 and 2, for any feasible solution $\bm{x}\in\mathcal{X}$ , $c\in[\frac{1}{2},1)$ , and $\varepsilon>0$ , we have

	$\displaystyle\sum_{t=1}^{T}\frac{1}{t^{3c+\varepsilon}}\leq\frac{3c+\varepsilon}{3c+\varepsilon-1}\leq 3,$
	$\displaystyle\sum_{t=1}^{T}\frac{\absolutevalue{f_{t}(\bm{x}_{t})-f_{t}(\bm{x}^{\star})}}{t^{2c+\varepsilon}}\leq\frac{FD(2c+\varepsilon)}{2c+\varepsilon-1},$
	$\displaystyle\sum_{t=1}^{T}\frac{\norm{\bm{x}_{t}-\bm{x}^{\star}}^{2}_{2}-\norm{\bm{x}_{t+1}-\bm{x}^{\star}}^{2}_{2}}{t^{c+\varepsilon}}\leq D^{2}.$

Proof.

The first claim is shown as follows:

\displaystyle\sum_{t=1}^{T}\frac{1}{t^{3c+\varepsilon}}\leq 1+\int_{1}^{T}\frac{1}{t^{3c+\varepsilon}}\,\differential{t}=1+\frac{1-T^{-3c-\varepsilon+1}}{3c+\varepsilon-1}\leq\frac{3c+\varepsilon}{3c+\varepsilon-1}\leq 3,

where the last inequality holds from the condition of $\frac{1}{2}\leq c<1$ .

The second claim is shown as follows:

\displaystyle\sum_{t=1}^{T}\frac{\absolutevalue{f_{t}(\bm{x}_{t})-f_{t}(\bm{x}^{\star})}}{t^{2c+\varepsilon}}

\displaystyle\leq\sum_{t=1}^{T}\frac{F_{t}\norm{\bm{x}_{t}-\bm{x}^{\star}}_{2}}{t^{2c+\varepsilon}}\leq\sum_{t=1}^{T}\frac{FD}{t^{2c+\varepsilon}}\leq\frac{FD(2c+\varepsilon)}{2c+\varepsilon-1},

where the first inequality follows from Assumption 2, the second inequality follows from Assumption 1, and the third inequality follows as is the case with the inequality $\sum_{t=1}^{T}\frac{1}{t^{3c+\varepsilon}}\leq\frac{3c+\varepsilon}{3c+\varepsilon-1}$ .

The last claim is shown as follows:

	$\displaystyle\sum_{t=1}^{T}\frac{\norm{\bm{x}_{t}-\bm{x}^{\star}}^{2}_{2}-\norm{\bm{x}_{t+1}-\bm{x}^{\star}}^{2}_{2}}{t^{c+\varepsilon}}$
	$\displaystyle\quad=\norm{\bm{x}_{1}-\bm{x}^{\star}}^{2}_{2}+\sum_{t=2}^{T}\quantity(\frac{1}{t^{c+\varepsilon}}-\frac{1}{(t-1)^{c+\varepsilon}})\norm{\bm{x}_{t}-\bm{x}^{\star}}^{2}_{2}-\frac{\norm{\bm{x}_{T+1}-\bm{x}^{\star}}^{2}_{2}}{T^{c+\varepsilon}}$
	$\displaystyle\quad\leq D^{2}+D^{2}\sum_{t=2}^{T}\quantity(\frac{1}{t^{c+\varepsilon}}-\frac{1}{(t-1)^{c+\varepsilon}})$
	$\displaystyle\quad=D^{2}+D^{2}\quantity(\frac{1}{T^{c+\varepsilon}}-1)\leq D^{2},$

where the first inequality follows from Assumption 1. ∎

Appendix B Proof of Theorem 3

Proof.

Similar to the argument of Lemma 3, for any strongly convex function $f_{t}$ with modulus $\sigma_{t}>0$ and for any optimal solution $\bm{x}^{\star}\in\mathcal{X}$ to the offline constrained OCO as Eq. (1), we have

\displaystyle f_{t}(\bm{x}_{t})-f_{t}(\bm{x}^{\star})+\lambda_{t}\widehat{g}^{+}_{t}(\bm{x}_{t+1})\leq\frac{F_{t}^{2}}{4\alpha_{t}}-\frac{\sigma_{t}}{2}\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}+\frac{\alpha_{t}}{2}\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}-\frac{\alpha_{t}}{2}\norm{\bm{x}^{\star}-\bm{x}_{t+1}}^{2}_{2}.

(16)

Applying the above inequality (16) for the function $h_{t}$ defined by Eq. (7), we have

\displaystyle h_{t}(\bm{x}_{t})-h_{t}(\bm{x}^{\star})+\lambda_{t}\widehat{g}^{+}_{t}(\bm{x}_{t+1})\leq\frac{9F_{t}^{2}d^{2}}{4\alpha_{t}}-\frac{\sigma_{t}}{2}\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}+\frac{\alpha_{t}}{2}\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}-\frac{\alpha_{t}}{2}\norm{\bm{x}^{\star}-\bm{x}_{t+1}}^{2}_{2}.

(17)

Note that the function $h_{t}$ is also strongly convex with modulus $\sigma_{t}$ under Assumption 4. Since $\lambda_{t}\widehat{g}^{+}_{t}(\bm{x}_{t+1})$ is nonnegative, from Eq. (17), we have

\displaystyle h_{t}(\bm{x}_{t})-h_{t}(\bm{x}^{\star})\leq\frac{9F_{t}^{2}d^{2}}{4\alpha_{t}}-\frac{\sigma_{t}}{2}\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}+\frac{\alpha_{t}}{2}\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}-\frac{\alpha_{t}}{2}\norm{\bm{x}^{\star}-\bm{x}_{t+1}}^{2}_{2}.

By taking summation over $t=1,2,\dots,T$ , we have

	$\displaystyle\sum_{t=1}^{T}[h_{t}(\bm{x}_{t})-h_{t}(\bm{x}^{\star})]$	$\displaystyle\leq\sum_{t=1}^{T}\frac{9F_{t}^{2}d^{2}}{4\alpha_{t}}+\sum_{t=1}^{T}\quantity(\frac{\alpha_{t}}{2}-\frac{\alpha_{t-1}}{2}-\frac{\sigma_{t}}{2})\norm{\bm{x}_{t}-\bm{x}}^{2}_{2}$
		$\displaystyle\leq\frac{9F^{2}d^{2}}{4}\sum_{t=1}^{T}\frac{1}{\alpha_{t}}+D^{2}\sum_{t=1}^{T}\quantity(\frac{\alpha_{t}}{2}-\frac{\alpha_{t-1}}{2}-\frac{\sigma}{2}),$

where the second inequality holds from Assumption 1. Plugging in $\alpha_{t}=\sigma t$ , we have

\displaystyle\sum_{t=1}^{T}[h_{t}(\bm{x}_{t})-h_{t}(\bm{x}^{\star})]\leq\frac{9F^{2}d^{2}}{4}\sum_{t=1}^{T}\frac{1}{\sigma t}\leq\frac{9F^{2}d^{2}}{4\sigma}\quantity(1+\int_{1}^{T}\frac{1}{t}\differential{t})=\frac{9F^{2}d^{2}}{4\sigma}\quantity(1+\log T).

Similar to the proof of the convex case, since we have $\mathbb{E}_{t}\left[h_{t}(\bm{x})\right]=\widehat{f}_{t}(\bm{x})$ for any $\bm{x}\in\mathcal{X}$ and from the inequality (4), we have

	$\displaystyle\sum_{t=1}^{T}[f_{t}(\bm{x}_{t})-f_{t}(\bm{x}^{\star})]$	$\displaystyle\leq\sum_{t=1}^{T}[\widehat{f}_{t}(\bm{x}_{t})-\widehat{f}_{t}(\bm{x}^{\star})]+\sum_{t=1}^{T}2\delta F_{t}$
		$\displaystyle\leq\frac{9F^{2}d^{2}}{4\sigma}\quantity(1+\log T)+2F$
		$\displaystyle\leq\quantity(\frac{9F^{2}d^{2}}{4\sigma}+2F)\quantity(1+\log T),$

where the second inequality holds from Assumption 4 the third inequality follows by letting $\delta=\frac{1}{T}$ .

Next, we show the cumulative hard constraint violation bound for fixed constraints. From Eq. (17), we have

\displaystyle\lambda_{t}\widehat{g}^{+}_{t}(\bm{x}_{t+1})\leq\frac{9F^{2}_{t}d^{2}}{4\alpha_{t}}+\absolutevalue{h_{t}(\bm{x}_{t})-h_{t}(\bm{x}^{\star})}-\frac{\sigma_{t}}{2}\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}+\frac{\alpha_{t}}{2}\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}-\frac{\alpha_{t}}{2}\norm{\bm{x}^{\star}-\bm{x}_{t+1}}^{2}_{2}.

By the definition of $\widehat{g}_{t}(\bm{x})=\gamma_{t}[g_{t}(\bm{x})]_{+}$ , we have

\displaystyle[g_{t}(\bm{x}_{t+1})]_{+}

\displaystyle\leq\frac{9F^{2}_{t}d^{2}}{4\alpha_{t}\lambda_{t}\gamma_{t}}+\frac{\absolutevalue{h_{t}(\bm{x}_{t})-h_{t}(\bm{x}^{\star})}}{\lambda_{t}\gamma_{t}}-\frac{\sigma_{t}}{2\lambda_{t}\gamma_{t}}\quantity(\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}+\norm{\bm{x}^{\star}-\bm{x}_{t}}^{2}_{2}-\norm{\bm{x}^{\star}-\bm{x}_{t+1}}^{2}_{2})

By taking summation over $t=1,2,\dots,T$ , and plugging $\alpha_{t}=\sigma t,\,\gamma_{t}=t^{c}$ into the above inequality, we have the following result:

	$\displaystyle\sum_{t=1}^{T}[g_{t}(\bm{x}_{t+1})]_{+}$	$\displaystyle\leq\frac{9F^{2}d^{2}}{4\sigma}\sum_{t=1}^{T}\frac{1}{t^{3c+\varepsilon}}+\sum_{t=1}^{T}\frac{\absolutevalue{h_{t}(\bm{x}_{t})-h_{t}(\bm{x}^{\star})}}{t^{2c+\varepsilon}}$
		$\displaystyle\leq\frac{27F^{2}d^{2}}{4\sigma}+\frac{3FdD(1+\varepsilon)}{\varepsilon}=O(d^{2}),$

where the second inequality follows from Lemma 5. ∎

Appendix C Proof of Theorem 4

Proof.

Similar to the proof of Theorem 2, we can show the upper bound of $\sum_{t=1}^{T}\norm{\bm{x}_{t}-\bm{x}_{t+1}}^{2}_{2}$ as

\displaystyle\sum_{t=1}^{T}\norm{\bm{x}_{t}-\bm{x}_{t+1}}^{2}_{2}

\displaystyle\leq\sum_{t=1}^{T}\frac{2\operatorname{lip}(h_{t})D}{\frac{1}{2}\alpha_{t}}+D^{2}\leq\frac{12FdD}{\sigma}\quantity(1+\log T)+D^{2},

where the second inequality holds by plugging in $\alpha_{t}=\sigma t$ . Since we have $[g_{t}(\bm{x}_{t})]_{+}-[g_{t}(\bm{x}_{t+1})]_{+}\leq\frac{G^{2}}{4\beta}+\beta\norm{\bm{x}_{t}-\bm{x}_{t+1}}^{2}_{2}$ for any $\beta>0$ , by letting $\beta=\sqrt{\frac{T}{1+\log T}}$ , we have

	$\displaystyle\sum_{t=1}^{T}\quantity([g_{t}(\bm{x}_{t})]_{+}-[g_{t}(\bm{x}_{t+1})]_{+})$	$\displaystyle\leq\frac{G^{2}T}{4\beta}+\beta\sum_{t=1}^{T}\norm{\bm{x}_{t}-\bm{x}_{t+1}}^{2}_{2}$
		$\displaystyle\leq\frac{G^{2}T}{4\beta}+\beta\quantity(\frac{12FdD}{\sigma}\quantity(1+\log T)+D^{2})$
		$\displaystyle\leq\quantity(\frac{G^{2}}{4}+\frac{12FdD}{\sigma})\sqrt{T(1+\log T)}+D^{2}\sqrt{\frac{T}{1+\log T}}.$

Finally, by combining the result of Theorem 3, we have the following result:

	$\displaystyle\sum_{t=1}^{T}[g_{t}(\bm{x}_{t})]_{+}$	$\displaystyle\leq\sum_{t=1}^{T}[g_{t}(\bm{x}_{t+1})]_{+}+\quantity(\frac{G^{2}}{4}+\frac{12FdD}{\sigma})\sqrt{T(1+\log T)}+D^{2}\sqrt{\frac{T}{1+\log T}}$
		$\displaystyle\leq\frac{27F^{2}d^{2}}{4\sigma}+\frac{3FdD(1+\varepsilon)}{\varepsilon}+\quantity(\frac{G^{2}}{4}+\frac{12FdD}{\sigma})\sqrt{T(1+\log T)}+D^{2}\sqrt{\frac{T}{1+\log T}}$
		$\displaystyle\leq\quantity(\frac{27F^{2}d^{2}}{4\sigma}+\frac{G^{2}}{4}+3FdD\quantity(1+\frac{1}{\varepsilon}+\frac{4}{\sigma})+D^{2})\sqrt{T(1+\log T)}.$

∎